iris.pandas#
Provide conversion to and from Pandas data structures.
API reference |
Tags: topic_interoperability
See also: https://pandas.pydata.org/
- iris.pandas.as_cube(pandas_array, copy=True, calendars=None)[source]#
Convert a Pandas Series/DataFrame into a 1D/2D Iris Cube.
- Parameters:
pandas_array (
pandas.Seriesorpandas.DataFrame) – The Pandas object to convert.copy (bool, default=True) –
Whether to copy pandas_array, or to create array views where possible. Provided in case of memory limit concerns.
Deprecated since version 3.15.0: The ‘copy’ parameter is deprecated and will be removed in a future release. This function will always make a copy of the data array, to ensure that the returned Cube is independent of the input pandas data and to be consistent with pandas v3 behaviour.
calendars (dict, optional) – A dict mapping a dimension to a calendar. Required to convert datetime indices/columns.
- Return type:
Notes
This function will copy your data by default.
Examples
as_cube(series, calendars={0: cf_units.CALENDAR_360_DAY}) as_cube(data_frame, calendars={1: cf_units.CALENDAR_STANDARD})
Since this function converts to/from a Pandas object, laziness will not be preserved.
Deprecated since version 3.3.0: This function is scheduled for removal in a future release, being replaced by
iris.pandas.as_cubes(), which offers richer dimensional intelligence.
- iris.pandas.as_cubes(pandas_structure, copy=True, calendars=None, aux_coord_cols=None, cell_measure_cols=None, ancillary_variable_cols=None)[source]#
Convert a Pandas Series/DataFrame into n-dimensional Iris Cubes, including dimensional metadata.
The index of pandas_structure will be used for generating the
Cubedimension(s) andDimCoord. Other dimensional metadata may span multiple dimensions - based on how the column values vary with the index values.- Parameters:
pandas_structure (
pandas.Seriesorpandas.DataFrame) – The Pandas object to convert.copy (bool, default=True) –
Whether the Cube
datais a copy of the pandas_structure column, or a view of the same array. Arrays other than the data (coords etc.) are always copies. This option is provided to help with memory size concerns.Deprecated since version 3.15.0: The ‘copy’ parameter is deprecated and will be removed in a future release. This function will always make a copy of the data array, to ensure that the returned Cube is independent of the input pandas data and to be consistent with pandas v3 behaviour.
calendars (dict, optional) – Calendar conversions for individual date-time coordinate columns/index-levels e.g.
{"my_column": cf_units.CALENDAR_360_DAY}.aux_coord_cols (list of str, optional) – Names of columns to be converted into
AuxCoord,CellMeasureandAncillaryVariableobjects.cell_measure_cols (list of str, optional) – Names of columns to be converted into
AuxCoord,CellMeasureandAncillaryVariableobjects.ancillary_variable_cols (list of str, optional) – Names of columns to be converted into
AuxCoord,CellMeasureandAncillaryVariableobjects.
- Returns:
One
Cubefor each column not referenced in aux_coord_cols/cell_measure_cols/ancillary_variable_cols.- Return type:
Notes
A
DataFrameusing columns as a second data dimension will need to be ‘melted’ before conversion. See the Examples for how.dask.dataframe.DataFrameare not supported.Since this function converts to/from a Pandas object, laziness will not be preserved.
Examples
>>> from iris.pandas import as_cubes >>> import numpy as np >>> from pandas import DataFrame, Series
Converting a simple
Series:>>> my_series = Series([300, 301, 302], name="air_temperature") >>> converted_cubes = as_cubes(my_series) >>> print(converted_cubes) 0: air_temperature / (unknown) (unknown: 3) >>> print(converted_cubes[0]) air_temperature / (unknown) (unknown: 3) Dimension coordinates: unknown x
A
DataFrame, with a custom index becoming theDimCoord:>>> my_df = DataFrame({ ... "air_temperature": [300, 301, 302], ... "longitude": [30, 40, 50] ... }) >>> my_df = my_df.set_index("longitude") >>> converted_cubes = as_cubes(my_df) >>> print(converted_cubes[0]) air_temperature / (unknown) (longitude: 3) Dimension coordinates: longitude x
A
DataFramerepresenting two 3-dimensional datasets, including a 2-dimensionalAuxCoord:>>> my_df = DataFrame({ ... "air_temperature": np.arange(300, 312, 1), ... "air_pressure": np.arange(1000, 1012, 1), ... "longitude": [0, 10] * 6, ... "latitude": [25, 25, 35, 35] * 3, ... "height": ([0] * 4) + ([100] * 4) + ([200] * 4), ... "in_region": [True, False, False, False] * 3 ... }) >>> print(my_df) air_temperature air_pressure longitude latitude height in_region 0 300 1000 0 25 0 True 1 301 1001 10 25 0 False 2 302 1002 0 35 0 False 3 303 1003 10 35 0 False 4 304 1004 0 25 100 True 5 305 1005 10 25 100 False 6 306 1006 0 35 100 False 7 307 1007 10 35 100 False 8 308 1008 0 25 200 True 9 309 1009 10 25 200 False 10 310 1010 0 35 200 False 11 311 1011 10 35 200 False >>> my_df = my_df.set_index(["longitude", "latitude", "height"]) >>> my_df = my_df.sort_index() >>> converted_cubes = as_cubes(my_df, aux_coord_cols=["in_region"]) >>> print(converted_cubes) 0: air_temperature / (unknown) (longitude: 2; latitude: 2; height: 3) 1: air_pressure / (unknown) (longitude: 2; latitude: 2; height: 3) >>> print(converted_cubes[0]) air_temperature / (unknown) (longitude: 2; latitude: 2; height: 3) Dimension coordinates: longitude x - - latitude - x - height - - x Auxiliary coordinates: in_region x x -
Pandas uses
NaNrather than masking data. ConvertedCubecan be masked in downstream user code :>>> my_series = Series([300, np.nan, 302], name="air_temperature") >>> converted_cube = as_cubes(my_series)[0] >>> print(converted_cube.data) [300. nan 302.] >>> converted_cube.data = np.ma.masked_invalid(converted_cube.data) >>> print(converted_cube.data) [300.0 -- 302.0]
If the
DataFrameuses columns as a second dimension,pandas.melt()should be used to convert the data to the expected n-dimensional format :>>> my_df = DataFrame({ ... "latitude": [35, 25], ... 0: [300, 301], ... 10: [302, 303], ... }) >>> print(my_df) latitude 0 10 0 35 300 302 1 25 301 303 >>> my_df = my_df.melt( ... id_vars=["latitude"], ... value_vars=[0, 10], ... var_name="longitude", ... value_name="air_temperature" ... ) >>> my_df["longitude"] = my_df["longitude"].infer_objects() >>> print(my_df) latitude longitude air_temperature 0 35 0 300 1 25 0 301 2 35 10 302 3 25 10 303 >>> my_df = my_df.set_index(["latitude", "longitude"]) >>> my_df = my_df.sort_index() >>> converted_cube = as_cubes(my_df)[0] >>> print(converted_cube) air_temperature / (unknown) (latitude: 2; longitude: 2) Dimension coordinates: latitude x - longitude - x
- iris.pandas.as_data_frame(cube, copy=True, add_aux_coords=False, add_cell_measures=False, add_ancillary_variables=False)[source]#
Convert a
Cubeto apandas.DataFrame.dim_coordsanddataare flattened into a long-styleDataFrame. Otheraux_coords,aux_coordsandattributesmay be optionally added as additionalDataFramecolumns.- Parameters:
cube (
Cube) – TheCubeto be converted to apandas.DataFrame.copy (bool, default=True) –
Whether the
pandas.DataFrameis a copy of the the Cubedata. This option is provided to help with memory size concerns.Deprecated since version 3.15.0: The ‘copy’ parameter is deprecated and will be removed in a future release. This function will always make a copy of the data array, to ensure that the returned Cube is independent of the input pandas data and to be consistent with pandas v3 behaviour.
add_aux_coords (bool, default=False) – If True, add all
aux_coords(including scalar coordinates) to the returnedpandas.DataFrame.add_cell_measures (bool, default=False) – If True, add
cell_measuresto the returnedpandas.DataFrame.add_ancillary_variables (bool, default=False) – If True, add
ancillary_variablesto the returnedpandas.DataFrame.
- Returns:
A
DataFramewithCubedimensions forming aMultiIndex.- Return type:
Warning
This documentation is for the new
as_data_frame()behaviour, which is currently opt-in to preserve backwards compatibility. The default legacy behaviour is documented in pre-v3.4documentation (summary: limited to 2-dimensionalCube, with only thedataanddim_coordsbeing added). The legacy behaviour will be removed in a future version of Iris, so please opt-in to the new behaviour at your earliest convenience, viairis.Future:>>> iris.FUTURE.pandas_ndim = True
Breaking change: to enable the improvements, the new opt-in behaviour flattens multi-dimensional data into a single
DataFramecolumn (the legacy behaviour preserves 2 dimensions via rows and columns).Where the
Cubecontains masked values, these becomenumpy.nanin the returnedDataFrame.If copy parameter is explicitly set to True or False, a DeprecationWarning is raised, as this parameter will be removed in a future release. This function will always make a copy of the data array, to ensure that the returned Cube is independent of the input pandas data and to be consistent with pandas v3 behaviour.
Notes
dask.dataframe.DataFrameare not supported.A
MultiIndexDataFrameis returned by default. Use thereset_index()to return aDataFramewithoutMultiIndexlevels. Use ‘inplace=True` to preserve memory object reference.Cubedata dtype is preserved.Since this function converts to/from a Pandas object, laziness will not be preserved.
Examples
>>> import iris >>> from iris.pandas import as_data_frame >>> import pandas as pd >>> pd.set_option('display.width', 1000) >>> pd.set_option('display.max_columns', 1000)
Convert a simple
Cube:>>> path = iris.sample_data_path('ostia_monthly.nc') >>> cube = iris.load_cube(path) >>> df = as_data_frame(cube) >>> print(df) ... surface_temperature time latitude longitude 2006-04-16 00:00:00 -4.999992 0.000000 301.659271 0.833333 301.785004 1.666667 301.820984 2.500000 301.865234 3.333333 301.926819 ... ... 2010-09-16 00:00:00 4.444450 355.833313 298.779938 356.666656 298.913147 357.500000 NaN 358.333313 NaN 359.166656 298.995148 [419904 rows x 1 columns]
Using
add_aux_coords=TruemapsAuxCoordand scalar coordinate information to theDataFrame:>>> df = as_data_frame(cube, add_aux_coords=True) >>> print(df) ... surface_temperature forecast_period forecast_reference_time time latitude longitude 2006-04-16 00:00:00 -4.999992 0.000000 301.659271 0 2006-04-16 12:00:00 0.833333 301.785004 0 2006-04-16 12:00:00 1.666667 301.820984 0 2006-04-16 12:00:00 2.500000 301.865234 0 2006-04-16 12:00:00 3.333333 301.926819 0 2006-04-16 12:00:00 ... ... ... ... 2010-09-16 00:00:00 4.444450 355.833313 298.779938 0 2010-09-16 12:00:00 356.666656 298.913147 0 2010-09-16 12:00:00 357.500000 NaN 0 2010-09-16 12:00:00 358.333313 NaN 0 2010-09-16 12:00:00 359.166656 298.995148 0 2010-09-16 12:00:00 [419904 rows x 3 columns]
To add netCDF global attribution information to the
DataFrame, add a column directly to theDataFrame:>>> df['STASH'] = str(cube.attributes['STASH']) >>> print(df) ... surface_temperature forecast_period forecast_reference_time STASH time latitude longitude 2006-04-16 00:00:00 -4.999992 0.000000 301.659271 0 2006-04-16 12:00:00 m01s00i024 0.833333 301.785004 0 2006-04-16 12:00:00 m01s00i024 1.666667 301.820984 0 2006-04-16 12:00:00 m01s00i024 2.500000 301.865234 0 2006-04-16 12:00:00 m01s00i024 3.333333 301.926819 0 2006-04-16 12:00:00 m01s00i024 ... ... ... ... ... 2010-09-16 00:00:00 4.444450 355.833313 298.779938 0 2010-09-16 12:00:00 m01s00i024 356.666656 298.913147 0 2010-09-16 12:00:00 m01s00i024 357.500000 NaN 0 2010-09-16 12:00:00 m01s00i024 358.333313 NaN 0 2010-09-16 12:00:00 m01s00i024 359.166656 298.995148 0 2010-09-16 12:00:00 m01s00i024 [419904 rows x 4 columns]
To return a
DataFramewithout aMultiIndexusereset_index(). Optionally use inplace=True keyword to modify the DataFrame rather than creating a new one:>>> df.reset_index(inplace=True) >>> print(df) ... time latitude longitude surface_temperature forecast_period forecast_reference_time STASH 0 2006-04-16 00:00:00 -4.999992 0.000000 301.659271 0 2006-04-16 12:00:00 m01s00i024 1 2006-04-16 00:00:00 -4.999992 0.833333 301.785004 0 2006-04-16 12:00:00 m01s00i024 2 2006-04-16 00:00:00 -4.999992 1.666667 301.820984 0 2006-04-16 12:00:00 m01s00i024 3 2006-04-16 00:00:00 -4.999992 2.500000 301.865234 0 2006-04-16 12:00:00 m01s00i024 4 2006-04-16 00:00:00 -4.999992 3.333333 301.926819 0 2006-04-16 12:00:00 m01s00i024 ... ... ... ... ... ... ... 419899 2010-09-16 00:00:00 4.444450 355.833313 298.779938 0 2010-09-16 12:00:00 m01s00i024 419900 2010-09-16 00:00:00 4.444450 356.666656 298.913147 0 2010-09-16 12:00:00 m01s00i024 419901 2010-09-16 00:00:00 4.444450 357.500000 NaN 0 2010-09-16 12:00:00 m01s00i024 419902 2010-09-16 00:00:00 4.444450 358.333313 NaN 0 2010-09-16 12:00:00 m01s00i024 419903 2010-09-16 00:00:00 4.444450 359.166656 298.995148 0 2010-09-16 12:00:00 m01s00i024 [419904 rows x 7 columns]
To retrieve a
Seriesfrom dfDataFrame, subselect a column:>>> df['surface_temperature'] 0 301.659271 1 301.785004 2 301.820984 3 301.865234 4 301.926819 ... 419899 298.779938 419900 298.913147 419901 NaN 419902 NaN 419903 298.995148 Name: surface_temperature, Length: 419904, dtype: float32
- iris.pandas.as_series(cube, copy=True)[source]#
Convert a 1D cube to a Pandas Series.
- Parameters:
cube (
Cube) – The cube to convert to a Pandas Series.copy (bool, default=True) –
Whether to make a copy of the data. Defaults to True. Must be True for masked data.
Deprecated since version 3.15.0: The ‘copy’ parameter is deprecated and will be removed in a future release. This function will always make a copy of the data array, to ensure that the returned Cube is independent of the input pandas data and to be consistent with pandas v3 behaviour.
- Return type:
Notes
This function will copy your data by default. If you have a large array that cannot be copied, make sure it is not masked and use copy=False.
Since this function converts to/from a Pandas object, laziness will not be preserved.
Deprecated since version 3.4.0: This function is scheduled for removal in a future release, being replaced by
iris.pandas.as_data_frame(), which offers improved multi dimension handling.