iris.pandas#

Provide conversion to and from Pandas data structures.

API reference

Tags: topic_interoperability

See also: https://pandas.pydata.org/

iris.pandas.as_cube(pandas_array, copy=True, calendars=None)[source]#

Convert a Pandas Series/DataFrame into a 1D/2D Iris Cube.

Parameters:

pandas_array (pandas.Series or pandas.DataFrame) – The Pandas object to convert.
copy (bool, default=True) –
Whether to copy pandas_array, or to create array views where possible. Provided in case of memory limit concerns.

Deprecated since version 3.15.0: The ‘copy’ parameter is deprecated and will be removed in a future release. This function will always make a copy of the data array, to ensure that the returned Cube is independent of the input pandas data and to be consistent with pandas v3 behaviour.
calendars (dict, optional) – A dict mapping a dimension to a calendar. Required to convert datetime indices/columns.

Return type:

Cube

Notes

This function will copy your data by default.

Examples

as_cube(series, calendars={0: cf_units.CALENDAR_360_DAY})
as_cube(data_frame, calendars={1: cf_units.CALENDAR_STANDARD})

Since this function converts to/from a Pandas object, laziness will not be preserved.

Deprecated since version 3.3.0: This function is scheduled for removal in a future release, being replaced by iris.pandas.as_cubes(), which offers richer dimensional intelligence.

iris.pandas.as_cubes(pandas_structure, copy=True, calendars=None, aux_coord_cols=None, cell_measure_cols=None, ancillary_variable_cols=None)[source]#

Convert a Pandas Series/DataFrame into n-dimensional Iris Cubes, including dimensional metadata.

The index of pandas_structure will be used for generating the Cube dimension(s) and DimCoord. Other dimensional metadata may span multiple dimensions - based on how the column values vary with the index values.

Parameters:

pandas_structure (pandas.Series or pandas.DataFrame) – The Pandas object to convert.
copy (bool, default=True) –
Whether the Cube data is a copy of the pandas_structure column, or a view of the same array. Arrays other than the data (coords etc.) are always copies. This option is provided to help with memory size concerns.

Deprecated since version 3.15.0: The ‘copy’ parameter is deprecated and will be removed in a future release. This function will always make a copy of the data array, to ensure that the returned Cube is independent of the input pandas data and to be consistent with pandas v3 behaviour.
calendars (dict, optional) – Calendar conversions for individual date-time coordinate columns/index-levels e.g. {"my_column": cf_units.CALENDAR_360_DAY}.
aux_coord_cols (list of str, optional) – Names of columns to be converted into AuxCoord, CellMeasure and AncillaryVariable objects.
cell_measure_cols (list of str, optional) – Names of columns to be converted into AuxCoord, CellMeasure and AncillaryVariable objects.
ancillary_variable_cols (list of str, optional) – Names of columns to be converted into AuxCoord, CellMeasure and AncillaryVariable objects.

Returns:

One Cube for each column not referenced in aux_coord_cols/cell_measure_cols/ancillary_variable_cols.

Return type:

CubeList

Notes

A DataFrame using columns as a second data dimension will need to be ‘melted’ before conversion. See the Examples for how.

dask.dataframe.DataFrame are not supported.

Since this function converts to/from a Pandas object, laziness will not be preserved.

Examples

>>> from iris.pandas import as_cubes
>>> import numpy as np
>>> from pandas import DataFrame, Series

Converting a simple Series :

>>> my_series = Series([300, 301, 302], name="air_temperature")
>>> converted_cubes = as_cubes(my_series)
>>> print(converted_cubes)
0: air_temperature / (unknown)         (unknown: 3)
>>> print(converted_cubes[0])
air_temperature / (unknown)         (unknown: 3)
    Dimension coordinates:
        unknown                             x

A DataFrame, with a custom index becoming the DimCoord :

>>> my_df = DataFrame({
...     "air_temperature": [300, 301, 302],
...     "longitude": [30, 40, 50]
...     })
>>> my_df = my_df.set_index("longitude")
>>> converted_cubes = as_cubes(my_df)
>>> print(converted_cubes[0])
air_temperature / (unknown)         (longitude: 3)
    Dimension coordinates:
        longitude                             x

A DataFrame representing two 3-dimensional datasets, including a 2-dimensional AuxCoord :

>>> my_df = DataFrame({
...     "air_temperature": np.arange(300, 312, 1),
...     "air_pressure": np.arange(1000, 1012, 1),
...     "longitude": [0, 10] * 6,
...     "latitude": [25, 25, 35, 35] * 3,
...     "height": ([0] * 4) + ([100] * 4) + ([200] * 4),
...     "in_region": [True, False, False, False] * 3
... })
>>> print(my_df)
    air_temperature  air_pressure  longitude  latitude  height  in_region
0               300          1000          0        25       0       True
1               301          1001         10        25       0      False
2               302          1002          0        35       0      False
3               303          1003         10        35       0      False
4               304          1004          0        25     100       True
5               305          1005         10        25     100      False
6               306          1006          0        35     100      False
7               307          1007         10        35     100      False
8               308          1008          0        25     200       True
9               309          1009         10        25     200      False
10              310          1010          0        35     200      False
11              311          1011         10        35     200      False
>>> my_df = my_df.set_index(["longitude", "latitude", "height"])
>>> my_df = my_df.sort_index()
>>> converted_cubes = as_cubes(my_df, aux_coord_cols=["in_region"])
>>> print(converted_cubes)
0: air_temperature / (unknown)         (longitude: 2; latitude: 2; height: 3)
1: air_pressure / (unknown)            (longitude: 2; latitude: 2; height: 3)
>>> print(converted_cubes[0])
air_temperature / (unknown)         (longitude: 2; latitude: 2; height: 3)
    Dimension coordinates:
        longitude                             x            -          -
        latitude                              -            x          -
        height                                -            -          x
    Auxiliary coordinates:
        in_region                             x            x          -

Pandas uses NaN rather than masking data. Converted Cube can be masked in downstream user code :

>>> my_series = Series([300, np.nan, 302], name="air_temperature")
>>> converted_cube = as_cubes(my_series)[0]
>>> print(converted_cube.data)
[300.  nan 302.]
>>> converted_cube.data = np.ma.masked_invalid(converted_cube.data)
>>> print(converted_cube.data)
[300.0 -- 302.0]

If the DataFrame uses columns as a second dimension, pandas.melt() should be used to convert the data to the expected n-dimensional format :

>>> my_df = DataFrame({
...     "latitude": [35, 25],
...     0: [300, 301],
...     10: [302, 303],
... })
>>> print(my_df)
   latitude    0   10
0        35  300  302
1        25  301  303
>>> my_df = my_df.melt(
...     id_vars=["latitude"],
...     value_vars=[0, 10],
...     var_name="longitude",
...     value_name="air_temperature"
... )
>>> my_df["longitude"] = my_df["longitude"].infer_objects()
>>> print(my_df)
   latitude  longitude  air_temperature
0        35          0              300
1        25          0              301
2        35         10              302
3        25         10              303
>>> my_df = my_df.set_index(["latitude", "longitude"])
>>> my_df = my_df.sort_index()
>>> converted_cube = as_cubes(my_df)[0]
>>> print(converted_cube)
air_temperature / (unknown)         (latitude: 2; longitude: 2)
    Dimension coordinates:
        latitude                             x             -
        longitude                            -             x

iris.pandas.as_data_frame(cube, copy=True, add_aux_coords=False, add_cell_measures=False, add_ancillary_variables=False)[source]#

Convert a Cube to a pandas.DataFrame.

dim_coords and data are flattened into a long-style DataFrame. Other aux_coords, aux_coords and attributes may be optionally added as additional DataFrame columns.

Parameters:

cube (Cube) – The Cube to be converted to a pandas.DataFrame.
copy (bool, default=True) –
Whether the pandas.DataFrame is a copy of the the Cube data. This option is provided to help with memory size concerns.

Deprecated since version 3.15.0: The ‘copy’ parameter is deprecated and will be removed in a future release. This function will always make a copy of the data array, to ensure that the returned Cube is independent of the input pandas data and to be consistent with pandas v3 behaviour.
add_aux_coords (bool, default=False) – If True, add all aux_coords (including scalar coordinates) to the returned pandas.DataFrame.
add_cell_measures (bool, default=False) – If True, add cell_measures to the returned pandas.DataFrame.
add_ancillary_variables (bool, default=False) – If True, add ancillary_variables to the returned pandas.DataFrame.

Returns:

A DataFrame with Cube dimensions forming a MultiIndex.

Return type:

DataFrame

Warning

This documentation is for the new as_data_frame() behaviour, which is currently opt-in to preserve backwards compatibility. The default legacy behaviour is documented in pre-v3.4 documentation (summary: limited to 2-dimensional Cube, with only the data and dim_coords being added). The legacy behaviour will be removed in a future version of Iris, so please opt-in to the new behaviour at your earliest convenience, via iris.Future:
```
>>> iris.FUTURE.pandas_ndim = True
```
Breaking change: to enable the improvements, the new opt-in behaviour flattens multi-dimensional data into a single DataFrame column (the legacy behaviour preserves 2 dimensions via rows and columns).
Where the Cube contains masked values, these become numpy.nan in the returned DataFrame.
If copy parameter is explicitly set to True or False, a DeprecationWarning is raised, as this parameter will be removed in a future release. This function will always make a copy of the data array, to ensure that the returned Cube is independent of the input pandas data and to be consistent with pandas v3 behaviour.

Notes

dask.dataframe.DataFrame are not supported.

A MultiIndex DataFrame is returned by default. Use the reset_index() to return a DataFrame without MultiIndex levels. Use ‘inplace=True` to preserve memory object reference.

Cube data dtype is preserved.

Since this function converts to/from a Pandas object, laziness will not be preserved.

Examples

>>> import iris
>>> from iris.pandas import as_data_frame
>>> import pandas as pd
>>> pd.set_option('display.width', 1000)
>>> pd.set_option('display.max_columns', 1000)

Convert a simple Cube:

>>> path = iris.sample_data_path('ostia_monthly.nc')
>>> cube = iris.load_cube(path)
>>> df = as_data_frame(cube)
>>> print(df)
... 
                                          surface_temperature
time                latitude  longitude
2006-04-16 00:00:00 -4.999992 0.000000             301.659271
                              0.833333             301.785004
                              1.666667             301.820984
                              2.500000             301.865234
                              3.333333             301.926819
...                                                       ...
2010-09-16 00:00:00  4.444450 355.833313           298.779938
                              356.666656           298.913147
                              357.500000                  NaN
                              358.333313                  NaN
                              359.166656           298.995148

[419904 rows x 1 columns]

Using add_aux_coords=True maps AuxCoord and scalar coordinate information to the DataFrame:

>>> df = as_data_frame(cube, add_aux_coords=True)
>>> print(df)
... 
                                          surface_temperature  forecast_period forecast_reference_time
time                latitude  longitude
2006-04-16 00:00:00 -4.999992 0.000000             301.659271                0     2006-04-16 12:00:00
                              0.833333             301.785004                0     2006-04-16 12:00:00
                              1.666667             301.820984                0     2006-04-16 12:00:00
                              2.500000             301.865234                0     2006-04-16 12:00:00
                              3.333333             301.926819                0     2006-04-16 12:00:00
...                                                       ...              ...                     ...
2010-09-16 00:00:00  4.444450 355.833313           298.779938                0     2010-09-16 12:00:00
                              356.666656           298.913147                0     2010-09-16 12:00:00
                              357.500000                  NaN                0     2010-09-16 12:00:00
                              358.333313                  NaN                0     2010-09-16 12:00:00
                              359.166656           298.995148                0     2010-09-16 12:00:00

[419904 rows x 3 columns]

To add netCDF global attribution information to the DataFrame, add a column directly to the DataFrame:

>>> df['STASH'] = str(cube.attributes['STASH'])
>>> print(df)
... 
                                          surface_temperature  forecast_period forecast_reference_time       STASH
time                latitude  longitude
2006-04-16 00:00:00 -4.999992 0.000000             301.659271                0     2006-04-16 12:00:00  m01s00i024
                              0.833333             301.785004                0     2006-04-16 12:00:00  m01s00i024
                              1.666667             301.820984                0     2006-04-16 12:00:00  m01s00i024
                              2.500000             301.865234                0     2006-04-16 12:00:00  m01s00i024
                              3.333333             301.926819                0     2006-04-16 12:00:00  m01s00i024
...                                                       ...              ...                     ...         ...
2010-09-16 00:00:00  4.444450 355.833313           298.779938                0     2010-09-16 12:00:00  m01s00i024
                              356.666656           298.913147                0     2010-09-16 12:00:00  m01s00i024
                              357.500000                  NaN                0     2010-09-16 12:00:00  m01s00i024
                              358.333313                  NaN                0     2010-09-16 12:00:00  m01s00i024
                              359.166656           298.995148                0     2010-09-16 12:00:00  m01s00i024

[419904 rows x 4 columns]

To return a DataFrame without a MultiIndex use reset_index(). Optionally use inplace=True keyword to modify the DataFrame rather than creating a new one:

>>> df.reset_index(inplace=True)
>>> print(df)
... 
                       time  latitude   longitude  surface_temperature  forecast_period forecast_reference_time       STASH
0       2006-04-16 00:00:00 -4.999992    0.000000           301.659271                0     2006-04-16 12:00:00  m01s00i024
1       2006-04-16 00:00:00 -4.999992    0.833333           301.785004                0     2006-04-16 12:00:00  m01s00i024
2       2006-04-16 00:00:00 -4.999992    1.666667           301.820984                0     2006-04-16 12:00:00  m01s00i024
3       2006-04-16 00:00:00 -4.999992    2.500000           301.865234                0     2006-04-16 12:00:00  m01s00i024
4       2006-04-16 00:00:00 -4.999992    3.333333           301.926819                0     2006-04-16 12:00:00  m01s00i024
                     ...       ...         ...                  ...              ...                     ...         ...
419899  2010-09-16 00:00:00  4.444450  355.833313           298.779938                0     2010-09-16 12:00:00  m01s00i024
419900  2010-09-16 00:00:00  4.444450  356.666656           298.913147                0     2010-09-16 12:00:00  m01s00i024
419901  2010-09-16 00:00:00  4.444450  357.500000                  NaN                0     2010-09-16 12:00:00  m01s00i024
419902  2010-09-16 00:00:00  4.444450  358.333313                  NaN                0     2010-09-16 12:00:00  m01s00i024
419903  2010-09-16 00:00:00  4.444450  359.166656           298.995148                0     2010-09-16 12:00:00  m01s00i024

[419904 rows x 7 columns]

To retrieve a Series from df DataFrame, subselect a column:

>>> df['surface_temperature']
0         301.659271
1         301.785004
2         301.820984
3         301.865234
4         301.926819
            ...
419899    298.779938
419900    298.913147
419901           NaN
419902           NaN
419903    298.995148
Name: surface_temperature, Length: 419904, dtype: float32

iris.pandas.as_series(cube, copy=True)[source]#

Convert a 1D cube to a Pandas Series.

Parameters:

cube (Cube) – The cube to convert to a Pandas Series.
copy (bool, default=True) –
Whether to make a copy of the data. Defaults to True. Must be True for masked data.

Deprecated since version 3.15.0: The ‘copy’ parameter is deprecated and will be removed in a future release. This function will always make a copy of the data array, to ensure that the returned Cube is independent of the input pandas data and to be consistent with pandas v3 behaviour.

Return type:

Series

Notes

This function will copy your data by default. If you have a large array that cannot be copied, make sure it is not masked and use copy=False.

Since this function converts to/from a Pandas object, laziness will not be preserved.

Deprecated since version 3.4.0: This function is scheduled for removal in a future release, being replaced by iris.pandas.as_data_frame(), which offers improved multi dimension handling.