-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: IO support for R data files with C extension #41386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 31 commits
d1d3e4f
de848dd
3379fa1
966cb78
22c7ade
8b1aa9c
41f817f
2341dff
1f8f033
a5983e0
e78bf6e
1475281
7e0c152
bd7dde6
140ea04
1ef9e9a
770b810
d2f3746
5ce5c05
f9a23cd
952889f
4a0cf89
83bc859
09f2005
a9da74a
40862c5
6396819
749a04e
e862057
f5ab7cd
e9852e9
a496381
5867742
94d7f20
01c0807
f5f2e99
ab06b2b
7299ee5
6a35bfa
7b35651
0ab02ec
fa3dbc1
a51f8de
835e998
0e3dc79
67613aa
dd26eb9
dc56c82
c0f6c68
a56cf38
dde183b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
Copyright (c) 2013-2020 Evan Miller (except where otherwise noted) | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in | ||
all copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN | ||
THE SOFTWARE. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -31,6 +31,7 @@ The pandas I/O API is a set of top level ``reader`` functions accessed like | |
binary;`Parquet Format <https://parquet.apache.org/>`__;:ref:`read_parquet<io.parquet>`;:ref:`to_parquet<io.parquet>` | ||
binary;`ORC Format <https://orc.apache.org/>`__;:ref:`read_orc<io.orc>`; | ||
binary;`Msgpack <https://msgpack.org/>`__;:ref:`read_msgpack<io.msgpack>`;:ref:`to_msgpack<io.msgpack>` | ||
binary;`R <https://www.r-project.org/>`__;:ref:`read_rdata<io.rdata_reader>`;:ref:`to_rdata<io.rdata_writer>` | ||
binary;`Stata <https://en.wikipedia.org/wiki/Stata>`__;:ref:`read_stata<io.stata_reader>`;:ref:`to_stata<io.stata_writer>` | ||
binary;`SAS <https://en.wikipedia.org/wiki/SAS_(software)>`__;:ref:`read_sas<io.sas_reader>`; | ||
binary;`SPSS <https://en.wikipedia.org/wiki/SPSS>`__;:ref:`read_spss<io.spss_reader>`; | ||
|
@@ -5906,6 +5907,307 @@ respective functions from ``pandas-gbq``. | |
|
||
Full documentation can be found `here <https://pandas-gbq.readthedocs.io/>`__. | ||
|
||
|
||
.. _io.rdata: | ||
|
||
R data format | ||
------------- | ||
|
||
.. _io.rdata_reader: | ||
|
||
Reading R data | ||
'''''''''''''' | ||
|
||
.. versionadded:: 1.3.0 | ||
|
||
The top-level function ``read_rdata`` will read the native serialization types | ||
in the R language and environment. For .RData and its synonymous shorthand, .rda, | ||
that can hold multiple R objects, method will return a ``dict`` of ``DataFrames``. | ||
For .rds types that only contains a single R object, method will return a ``dict`` | ||
of a single ``DataFrame``. | ||
|
||
.. note:: | ||
|
||
Since any R object can be saved in these types, this method will only return | ||
data.frame objects or objects coercible to data.frames including matrices, | ||
tibbles, and data.tables. | ||
|
||
For more information of R serialization data types, see docs on `rds`_ | ||
and `rda`_ data formats. | ||
|
||
.. _rds: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/readRDS | ||
|
||
.. _rda: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/save | ||
|
||
For example, consider the following generated data.frames in R using environment | ||
data samples from US EPA, UK BGCI, and NOAA pubilc data: | ||
|
||
.. code-block:: r | ||
|
||
ghg_df <- data.frame( | ||
gas = c("Carbon dioxide", "Methane", "Nitrous oxide", | ||
"Fluorinated gases", "Total"), | ||
year = c(2018, 2018, 2018, 2018, 2018), | ||
emissions = c(5424.88150213288, 634.457127078267, 434.528555376666, | ||
182.782432461777, 6676.64961704959), | ||
row.names = c(141:145), | ||
stringsAsFactors = FALSE | ||
) | ||
|
||
saveRDS(ghg_df, file="ghg_df.rds") | ||
|
||
plants_df <- data.frame( | ||
plant_group = c("Pteridophytes", "Pteridophytes", "Pteridophytes", | ||
"Pteridophytes", "Pteridophytes"), | ||
status = c("Data Deficient", "Extinct", "Not Threatened", | ||
"Possibly Threatened", "Threatened"), | ||
count = c(398, 65, 1294, 408, 1275), | ||
row.names = c(16:20), | ||
stringsAsFactors = FALSE | ||
) | ||
|
||
saveRDS(plants_df, file="plants_df.rds") | ||
|
||
sea_ice_df_new <- data.frame( | ||
year = c(2016, 2017, 2018, 2019, 2020), | ||
mo = c(12, 12, 12, 12, 12), | ||
data.type = c("Goddard", "Goddard", "Goddard", "Goddard", "NRTSI-G"), | ||
region = c("S", "S", "S", "S", "S"), | ||
extent = c(8.28, 9.48, 9.19, 9.41, 10.44), | ||
area = c(5.51, 6.23, 5.59, 6.59, 6.5), | ||
row.names = c(1012:1016), | ||
stringsAsFactors = FALSE | ||
) | ||
|
||
saveRDS(sea_ice_df, file="sea_ice_df.rds") | ||
|
||
save(ghg_df, plants_df, sea_ice_df, file="env_data_dfs.rda") | ||
|
||
With ``read_rdata``, you can read these above .rds or .rda files: | ||
|
||
.. ipython:: python | ||
:suppress: | ||
|
||
rel_path = os.path.join("..", "pandas", "tests", "io", "data", "rdata") | ||
file_path = os.path.abspath(rel_path) | ||
|
||
.. ipython:: python | ||
|
||
rds_file = os.path.join(file_path, "ghg_df.rds") | ||
ghg_df = pd.read_rdata(rds_file)["r_dataframe"].tail() | ||
ghg_df | ||
|
||
rda_file = os.path.join(file_path, "env_data_dfs.rda") | ||
env_dfs = pd.read_rdata(rda_file) | ||
{k: df.tail() for k, df in env_dfs.items()} | ||
|
||
To ignore the rownames of data.frame, use option ``rownames=False``: | ||
|
||
.. ipython:: python | ||
|
||
rds_file = os.path.join(file_path, "plants_df.rds") | ||
plants_df = pd.read_rdata(rds_file, rownames=False)["r_dataframe"].tail() | ||
plants_df | ||
|
||
|
||
To select specific objects in .rda, pass a list of names into ``select_frames``: | ||
|
||
.. ipython:: python | ||
|
||
rda_file = os.path.join(file_path, "env_data_dfs.rda") | ||
env_dfs = pd.read_rdata(rda_file, select_frames=["sea_ice_df"]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this a good pattern? Should it default to (a) reading a single frame if there is one or (a) reading a dingle frame if available Think about how this compares to In fact, should there be a class-based interface that would allow multiple reads from the same file, similar to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Originally, I had output generate either a single DataFrame for For vast majority, especially in R packages, only a single named data frame is ever stored in .Rdata or .rda files. Given rdata IO may not be as popular as the Excel IO module, for class-based reader interface, maybe we should gauge this need by users for a potential future PR? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmmm...per IO docs under specifying sheets, if you pass a list of sheets or There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah for |
||
env_dfs | ||
|
||
To read from a file-like object, read object in argument, ``path_or_buffer``: | ||
|
||
.. ipython:: python | ||
|
||
rds_file = os.path.join(file_path, "plants_df.rds") | ||
with open(rds_file, "rb") as f: | ||
plants_df = pd.read_rdata( | ||
f, | ||
file_format="rds", | ||
)["r_dataframe"] | ||
|
||
plants_df | ||
|
||
To read from URL, pass link directly into method: | ||
|
||
.. ipython:: python | ||
|
||
url = ("https://github.com/hadley/nycflights13/" | ||
"blob/master/data/airlines.rda?raw=true") | ||
|
||
airlines = pd.read_rdata(url, file_format="rda") | ||
airlines | ||
|
||
To read from an Amazon S3 bucket, point to the storage path. This also raises | ||
another issue. Any R data encoded in non utf-8 is currently not supported: | ||
|
||
.. code-block:: ipython | ||
|
||
In [608]: ghcran = pd.read_rdata("s3://public-r-data/ghcran.Rdata") | ||
... | ||
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 45: invalid continuation byte | ||
|
||
Also, remember if R data files do not contain any data frame object, a parsing error | ||
will occur: | ||
|
||
.. code-block:: ipython | ||
|
||
In [608]: rds_file = os.path.join(file_path, "env_data_non_dfs.rda") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems to be missing a call to read_rds There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch. Will fix. |
||
... | ||
LibrdataParserError: Invalid file, or file has unsupported features | ||
|
||
|
||
.. _io.rdata_writer: | ||
|
||
Finally, please note R's ``Date`` (without time component) will translate to | ||
``datetime64`` in pandas. Also, R's date/time field type, ``POSIXct``, that can | ||
carry varying timezones will translate to UTC time in pandas. For example, in R, | ||
the following data sample from an .rda shows date/time in 'America/Chicago' local | ||
timezone: | ||
|
||
.. code-block:: r | ||
|
||
load("ppm_df.rda") | ||
tail(ppm_df, 5) | ||
date decimal_date monthly_average deseasonalized num_days std_dev_of_days unc_of_mon_mean | ||
612 2020-12-16 17:42:25 2020.958 414.25 414.98 30 0.47 0.17 | ||
613 2021-01-16 05:17:31 2021.042 415.52 415.26 29 0.44 0.16 | ||
614 2021-02-15 15:00:00 2021.125 416.75 415.93 28 1.02 0.37 | ||
615 2021-03-18 01:42:28 2021.208 417.64 416.18 28 0.86 0.31 | ||
616 2021-04-17 12:17:31 2021.292 419.05 416.23 24 1.12 0.44 | ||
|
||
In pandas, conversion shows adjustment in hours to UTC: | ||
|
||
.. ipython:: python | ||
|
||
r_dfs = pd.read_rdata(os.path.join(file_path, "ppm_df.rda")) | ||
r_dfs["ppm_df"].tail() | ||
|
||
Writing R data | ||
'''''''''''''' | ||
|
||
.. versionadded:: 1.3.0 | ||
|
||
The method :func:`~pandas.core.frame.DataFrame.to_rdata` will write a DataFrame | ||
into R data files (.RData, .rda, and .rds). | ||
|
||
For a single DataFrame in rds type, pass in a file or buffer in method: | ||
|
||
.. ipython:: python | ||
|
||
plants_df.to_rdata("plants_df.rds") | ||
|
||
For a single DataFrame in RData or rda types, pass in a file or buffer in method | ||
and optionally give it a name: | ||
|
||
.. ipython:: python | ||
|
||
ghg_df.to_rdata("ghg_df.rda", rda_name="ghg_df") | ||
|
||
While RData and rda types can hold multiple R objects, this method currently | ||
only supports writing out a single DataFrame. | ||
|
||
Even write to a buffer and read its content (and be sure to adjust default | ||
``gzip`` compression to ``compression=None``): | ||
|
||
.. ipython:: python | ||
|
||
with BytesIO() as b_io: | ||
env_dfs["sea_ice_df"].to_rdata( | ||
b_io, | ||
file_format="rda", | ||
index=False, | ||
compression=None, | ||
) | ||
print( | ||
pd.read_rdata( | ||
b_io.getvalue(), | ||
file_format="rda", | ||
rownames=False, | ||
compression=None, | ||
)["pandas_dataframe"].tail() | ||
) | ||
|
||
While DataFrame index will not map into R rownames, by default ``index=True`` | ||
will output as a named column or multiple columns for MultiIndex. | ||
|
||
.. ipython:: python | ||
|
||
ghg_df.rename_axis(None).to_rdata("ghg_df.rds") | ||
|
||
pd.read_rdata("ghg_df.rds")["r_dataframe"].tail() | ||
|
||
To ignore the index, use ``index=False``: | ||
|
||
.. ipython:: python | ||
|
||
ghg_df.rename_axis(None).to_rdata("ghg_df.rds", index=False) | ||
|
||
pd.read_rdata("ghg_df.rds")["r_dataframe"].tail() | ||
|
||
By default, these R serialized types are compressed files in either gzip, bzip2, | ||
or xz algorithms. Similar to R, the default ``compression`` type in this method | ||
is "gzip" or "gz". Notice size difference of compressed and uncompressed files: | ||
|
||
.. ipython:: python | ||
|
||
plants_df.to_rdata("plants_df_gz.rds") | ||
plants_df.to_rdata("plants_df_bz2.rds", compression="bz2") | ||
plants_df.to_rdata("plants_df_xz.rds", compression="xz") | ||
plants_df.to_rdata("plants_df_non_comp.rds", compression=None) | ||
|
||
os.stat("plants_df_gz.rds").st_size | ||
os.stat("plants_df_bz2.rds").st_size | ||
os.stat("plants_df_xz.rds").st_size | ||
os.stat("plants_df_non_comp.rds").st_size | ||
|
||
Like other IO methods, ``storage_options`` are enabled to write to those platforms: | ||
|
||
.. code-block:: ipython | ||
|
||
ghg_df.to_rdata( | ||
"s3://path/to/my/storage/pandas_df.rda", | ||
storage_options={"user": "xxx", "password": "???"} | ||
) | ||
|
||
.. ipython:: python | ||
:suppress: | ||
|
||
os.remove("ghg_df.rds") | ||
os.remove("ghg_df.rda") | ||
os.remove("plants_df.rds") | ||
os.remove("plants_df_gz.rds") | ||
os.remove("plants_df_bz2.rds") | ||
os.remove("plants_df_xz.rds") | ||
os.remove("plants_df_non_comp.rds") | ||
|
||
Once exported, the single DataFrame can be read or loaded in R: | ||
|
||
.. code-block:: r | ||
|
||
plants_df <- readRDS("plants_df.rds") | ||
plants_df | ||
plant_group status count | ||
16 Pteridophytes Data Deficient 398 | ||
17 Pteridophytes Extinct 65 | ||
18 Pteridophytes Not Threatened 1294 | ||
19 Pteridophytes Possibly Threatened 408 | ||
20 Pteridophytes Threatened 1275 | ||
|
||
load("ghg_df.rda") | ||
|
||
mget(list=ls()) | ||
$ghg_df | ||
gas year emissions | ||
141 Carbon dioxide 2018 5424.8815 | ||
142 Methane 2018 634.4571 | ||
143 Nitrous oxide 2018 434.5286 | ||
144 Fluorinated gases 2018 182.7824 | ||
145 Total 2018 6676.6496 | ||
|
||
.. _io.stata: | ||
|
||
Stata format | ||
|
@@ -5961,6 +6263,7 @@ outside of this range, the variable is cast to ``int16``. | |
115 dta file format. Attempting to write *Stata* dta files with strings | ||
longer than 244 characters raises a ``ValueError``. | ||
|
||
|
||
.. _io.stata_reader: | ||
|
||
Reading from Stata format | ||
|
Uh oh!
There was an error while loading. Please reload this page.