Skip to content

ENH: IO support for R data files with C extension #41386

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 51 commits into from
Closed
Show file tree
Hide file tree
Changes from 31 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
d1d3e4f
ENH: Add IO support for R data files with pandas.read_rdata and DataF…
ParfaitG Apr 11, 2021
de848dd
Fix rebase issues in whatsnew and type style in frame.py
ParfaitG Apr 11, 2021
3379fa1
Fix skipif logic for test params, move package checks, add to test_api
ParfaitG Apr 11, 2021
966cb78
Refactor from built-in filter, add encoding to subprocess and locale …
ParfaitG Apr 12, 2021
22c7ade
Fix tests for OS newline and mypy, mark xfail, use default mode in io…
ParfaitG Apr 12, 2021
8b1aa9c
Added needed test skips and fixed io docs ref in whatsnew
ParfaitG Apr 12, 2021
41f817f
Merge remote-tracking branch 'upstream/master' into rdata_io
ParfaitG Apr 12, 2021
2341dff
Remove rscript implementation from code, tests, and docs
ParfaitG Apr 14, 2021
1f8f033
Merge remote-tracking branch 'upstream/master' into rdata_io
ParfaitG Apr 14, 2021
a5983e0
Fix duplicate entry in ci dep yaml
ParfaitG Apr 14, 2021
e78bf6e
Refactor to handle binary content, add datetime notes in docs
ParfaitG Apr 16, 2021
1475281
Merge remote-tracking branch 'upstream/master' into rdata_io
ParfaitG Apr 16, 2021
7e0c152
Merge remote-tracking branch 'upstream/master' into rdata_io
ParfaitG Apr 19, 2021
bd7dde6
ENH: IO support for R data files with C extension
ParfaitG May 8, 2021
140ea04
Move C src files to _libs directory
ParfaitG May 8, 2021
1ef9e9a
Adjust C src files to conform to cpplint
ParfaitG May 8, 2021
770b810
Fix C src warnings raised as compiled errors
ParfaitG May 9, 2021
d2f3746
Merge remote-tracking branch 'upstream/master' into rdata_c
ParfaitG May 9, 2021
5ce5c05
Remove pyreadr listing in yml files and docs
ParfaitG May 9, 2021
f9a23cd
Fix C src warnings, syntax, and add unix_iconv.h
ParfaitG May 9, 2021
952889f
Fix docstring issue and add mac_iconv.h
ParfaitG May 10, 2021
4a0cf89
Adjust Cython scripts to fix write rdata for Windows, revert Mac iconv
ParfaitG May 11, 2021
83bc859
Merge remote-tracking branch 'upstream/master' into rdata_c
ParfaitG May 11, 2021
09f2005
Remove quotes in include iconv.h line of C source
ParfaitG May 11, 2021
a9da74a
Add liconv to extra_link_args for Mac OS build
ParfaitG May 11, 2021
40862c5
Slight fix to liconv in extra_link_args for Mac OS build
ParfaitG May 11, 2021
6396819
Adjust rdata include_dirs for libiconv on Mac OS
ParfaitG May 11, 2021
749a04e
Add library_dirs to find libiconv on Mac OS
ParfaitG May 11, 2021
e862057
Merge remote-tracking branch 'upstream/master' into rdata_c
ParfaitG May 11, 2021
f5ab7cd
Resolve rdata extension name for compilation
ParfaitG May 12, 2021
e9852e9
Adjust include and library dir in rdata extension for Mac OS
ParfaitG May 12, 2021
a496381
Adjust docs, tests, and code re dtypes and pickling
ParfaitG May 15, 2021
5867742
Merge remote-tracking branch 'upstream/master' into rdata_c
ParfaitG May 15, 2021
94d7f20
Add compression test, adjust test to fit 32-bit OS, and Mac condition…
ParfaitG May 15, 2021
01c0807
Add gzip skip in new test for < PY 3.8
ParfaitG May 15, 2021
f5f2e99
Add try/except for encoding, fix S3 read in tests and docs, and reduc…
ParfaitG May 16, 2021
ab06b2b
Merge remote-tracking branch 'upstream/master' into rdata_c
ParfaitG May 16, 2021
7299ee5
Replace integer for float in timestamps to fit 32-bit limit
ParfaitG May 17, 2021
6a35bfa
Use C long long for large timevalue to work on 32 and 64-bit
ParfaitG May 17, 2021
7b35651
Adjust timestamps in test to work on 32 and 64-bit machines
ParfaitG May 17, 2021
0ab02ec
Add skip for 32-bit in dtypes test
ParfaitG May 17, 2021
fa3dbc1
Merge remote-tracking branch 'upstream/master' into rdata_c
ParfaitG May 17, 2021
a51f8de
Adjust rdata section of user_guide/io.rst docs
ParfaitG May 19, 2021
835e998
Merge remote-tracking branch 'upstream/master' into rdata_c
ParfaitG May 19, 2021
0e3dc79
Fix merge conflicts
ParfaitG Jun 21, 2021
67613aa
Adjust setup.py per comments
ParfaitG Jun 21, 2021
dd26eb9
Merge remote-tracking branch 'upstream/master' into rdata_c
ParfaitG Jun 21, 2021
dc56c82
Remove conda prefix condition for mac in setup.py
ParfaitG Jun 21, 2021
c0f6c68
Merge remote-tracking branch 'upstream/master' into rdata_c
ParfaitG Jun 23, 2021
a56cf38
Remove extraneous lines
ParfaitG Jun 24, 2021
dde183b
Merge remote-tracking branch 'upstream/master' into rdata_c
ParfaitG Jun 24, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions LICENSES/LIBRDATA_LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
Copyright (c) 2013-2020 Evan Miller (except where otherwise noted)

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
303 changes: 303 additions & 0 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ The pandas I/O API is a set of top level ``reader`` functions accessed like
binary;`Parquet Format <https://parquet.apache.org/>`__;:ref:`read_parquet<io.parquet>`;:ref:`to_parquet<io.parquet>`
binary;`ORC Format <https://orc.apache.org/>`__;:ref:`read_orc<io.orc>`;
binary;`Msgpack <https://msgpack.org/>`__;:ref:`read_msgpack<io.msgpack>`;:ref:`to_msgpack<io.msgpack>`
binary;`R <https://www.r-project.org/>`__;:ref:`read_rdata<io.rdata_reader>`;:ref:`to_rdata<io.rdata_writer>`
binary;`Stata <https://en.wikipedia.org/wiki/Stata>`__;:ref:`read_stata<io.stata_reader>`;:ref:`to_stata<io.stata_writer>`
binary;`SAS <https://en.wikipedia.org/wiki/SAS_(software)>`__;:ref:`read_sas<io.sas_reader>`;
binary;`SPSS <https://en.wikipedia.org/wiki/SPSS>`__;:ref:`read_spss<io.spss_reader>`;
Expand Down Expand Up @@ -5906,6 +5907,307 @@ respective functions from ``pandas-gbq``.

Full documentation can be found `here <https://pandas-gbq.readthedocs.io/>`__.


.. _io.rdata:

R data format
-------------

.. _io.rdata_reader:

Reading R data
''''''''''''''

.. versionadded:: 1.3.0

The top-level function ``read_rdata`` will read the native serialization types
in the R language and environment. For .RData and its synonymous shorthand, .rda,
that can hold multiple R objects, method will return a ``dict`` of ``DataFrames``.
For .rds types that only contains a single R object, method will return a ``dict``
of a single ``DataFrame``.

.. note::

Since any R object can be saved in these types, this method will only return
data.frame objects or objects coercible to data.frames including matrices,
tibbles, and data.tables.

For more information of R serialization data types, see docs on `rds`_
and `rda`_ data formats.

.. _rds: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/readRDS

.. _rda: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/save

For example, consider the following generated data.frames in R using environment
data samples from US EPA, UK BGCI, and NOAA pubilc data:

.. code-block:: r

ghg_df <- data.frame(
gas = c("Carbon dioxide", "Methane", "Nitrous oxide",
"Fluorinated gases", "Total"),
year = c(2018, 2018, 2018, 2018, 2018),
emissions = c(5424.88150213288, 634.457127078267, 434.528555376666,
182.782432461777, 6676.64961704959),
row.names = c(141:145),
stringsAsFactors = FALSE
)

saveRDS(ghg_df, file="ghg_df.rds")

plants_df <- data.frame(
plant_group = c("Pteridophytes", "Pteridophytes", "Pteridophytes",
"Pteridophytes", "Pteridophytes"),
status = c("Data Deficient", "Extinct", "Not Threatened",
"Possibly Threatened", "Threatened"),
count = c(398, 65, 1294, 408, 1275),
row.names = c(16:20),
stringsAsFactors = FALSE
)

saveRDS(plants_df, file="plants_df.rds")

sea_ice_df_new <- data.frame(
year = c(2016, 2017, 2018, 2019, 2020),
mo = c(12, 12, 12, 12, 12),
data.type = c("Goddard", "Goddard", "Goddard", "Goddard", "NRTSI-G"),
region = c("S", "S", "S", "S", "S"),
extent = c(8.28, 9.48, 9.19, 9.41, 10.44),
area = c(5.51, 6.23, 5.59, 6.59, 6.5),
row.names = c(1012:1016),
stringsAsFactors = FALSE
)

saveRDS(sea_ice_df, file="sea_ice_df.rds")

save(ghg_df, plants_df, sea_ice_df, file="env_data_dfs.rda")

With ``read_rdata``, you can read these above .rds or .rda files:

.. ipython:: python
:suppress:

rel_path = os.path.join("..", "pandas", "tests", "io", "data", "rdata")
file_path = os.path.abspath(rel_path)

.. ipython:: python

rds_file = os.path.join(file_path, "ghg_df.rds")
ghg_df = pd.read_rdata(rds_file)["r_dataframe"].tail()
ghg_df

rda_file = os.path.join(file_path, "env_data_dfs.rda")
env_dfs = pd.read_rdata(rda_file)
{k: df.tail() for k, df in env_dfs.items()}

To ignore the rownames of data.frame, use option ``rownames=False``:

.. ipython:: python

rds_file = os.path.join(file_path, "plants_df.rds")
plants_df = pd.read_rdata(rds_file, rownames=False)["r_dataframe"].tail()
plants_df


To select specific objects in .rda, pass a list of names into ``select_frames``:

.. ipython:: python

rda_file = os.path.join(file_path, "env_data_dfs.rda")
env_dfs = pd.read_rdata(rda_file, select_frames=["sea_ice_df"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a good pattern? Should it default to

(a) reading a single frame if there is one
(b) raising and telling the user that a specific frame name is required?

or

(a) reading a dingle frame if available
(b) reading the "first" frame where first is based on either the order in the rdata file or alphabetically

Think about how this compares to read_excel which supports reading multiple frames from the same file?

In fact, should there be a class-based interface that would allow multiple reads from the same file, similar to ExcelReader?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, I had output generate either a single DataFrame for .rds types or dict of DataFrames for .RData/.rda types. So the file format type decided the structure. Earlier, jreback was confused by this, so now I simplified to always return a dict of DataFrames for any R data file type. Because rds types carry no named objects unlike rda types, the default r_dataframe is used as key for single-item dict. But rda object names will map into dict keys.

For vast majority, especially in R packages, only a single named data frame is ever stored in .Rdata or .rda files. Given rdata IO may not be as popular as the Excel IO module, for class-based reader interface, maybe we should gauge this need by users for a potential future PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the read_... method should be simple and return a DataFrame. Users who need to interact with complicated R data files should be directed to use the lower-level class-based reader. This is the model that is used in StataReader and IIRC in ExcelFile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm...per IO docs under specifying sheets, if you pass a list of sheets or None, read_excel returns a dict of DataFrames without specifying ExcelFile. This is the model used here for read_rdata. Also, read_html only returns a list of DataFrames. I thought StataReader and SASReader were more for reading files incrementally with chunks and any meta data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah for read_hdf we do this by forcing the user to have a key. are these always an ordered list? or a keyed list?

env_dfs

To read from a file-like object, read object in argument, ``path_or_buffer``:

.. ipython:: python

rds_file = os.path.join(file_path, "plants_df.rds")
with open(rds_file, "rb") as f:
plants_df = pd.read_rdata(
f,
file_format="rds",
)["r_dataframe"]

plants_df

To read from URL, pass link directly into method:

.. ipython:: python

url = ("https://github.com/hadley/nycflights13/"
"blob/master/data/airlines.rda?raw=true")

airlines = pd.read_rdata(url, file_format="rda")
airlines

To read from an Amazon S3 bucket, point to the storage path. This also raises
another issue. Any R data encoded in non utf-8 is currently not supported:

.. code-block:: ipython

In [608]: ghcran = pd.read_rdata("s3://public-r-data/ghcran.Rdata")
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 45: invalid continuation byte

Also, remember if R data files do not contain any data frame object, a parsing error
will occur:

.. code-block:: ipython

In [608]: rds_file = os.path.join(file_path, "env_data_non_dfs.rda")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be missing a call to read_rds

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Will fix.

...
LibrdataParserError: Invalid file, or file has unsupported features


.. _io.rdata_writer:

Finally, please note R's ``Date`` (without time component) will translate to
``datetime64`` in pandas. Also, R's date/time field type, ``POSIXct``, that can
carry varying timezones will translate to UTC time in pandas. For example, in R,
the following data sample from an .rda shows date/time in 'America/Chicago' local
timezone:

.. code-block:: r

load("ppm_df.rda")
tail(ppm_df, 5)
date decimal_date monthly_average deseasonalized num_days std_dev_of_days unc_of_mon_mean
612 2020-12-16 17:42:25 2020.958 414.25 414.98 30 0.47 0.17
613 2021-01-16 05:17:31 2021.042 415.52 415.26 29 0.44 0.16
614 2021-02-15 15:00:00 2021.125 416.75 415.93 28 1.02 0.37
615 2021-03-18 01:42:28 2021.208 417.64 416.18 28 0.86 0.31
616 2021-04-17 12:17:31 2021.292 419.05 416.23 24 1.12 0.44

In pandas, conversion shows adjustment in hours to UTC:

.. ipython:: python

r_dfs = pd.read_rdata(os.path.join(file_path, "ppm_df.rda"))
r_dfs["ppm_df"].tail()

Writing R data
''''''''''''''

.. versionadded:: 1.3.0

The method :func:`~pandas.core.frame.DataFrame.to_rdata` will write a DataFrame
into R data files (.RData, .rda, and .rds).

For a single DataFrame in rds type, pass in a file or buffer in method:

.. ipython:: python

plants_df.to_rdata("plants_df.rds")

For a single DataFrame in RData or rda types, pass in a file or buffer in method
and optionally give it a name:

.. ipython:: python

ghg_df.to_rdata("ghg_df.rda", rda_name="ghg_df")

While RData and rda types can hold multiple R objects, this method currently
only supports writing out a single DataFrame.

Even write to a buffer and read its content (and be sure to adjust default
``gzip`` compression to ``compression=None``):

.. ipython:: python

with BytesIO() as b_io:
env_dfs["sea_ice_df"].to_rdata(
b_io,
file_format="rda",
index=False,
compression=None,
)
print(
pd.read_rdata(
b_io.getvalue(),
file_format="rda",
rownames=False,
compression=None,
)["pandas_dataframe"].tail()
)

While DataFrame index will not map into R rownames, by default ``index=True``
will output as a named column or multiple columns for MultiIndex.

.. ipython:: python

ghg_df.rename_axis(None).to_rdata("ghg_df.rds")

pd.read_rdata("ghg_df.rds")["r_dataframe"].tail()

To ignore the index, use ``index=False``:

.. ipython:: python

ghg_df.rename_axis(None).to_rdata("ghg_df.rds", index=False)

pd.read_rdata("ghg_df.rds")["r_dataframe"].tail()

By default, these R serialized types are compressed files in either gzip, bzip2,
or xz algorithms. Similar to R, the default ``compression`` type in this method
is "gzip" or "gz". Notice size difference of compressed and uncompressed files:

.. ipython:: python

plants_df.to_rdata("plants_df_gz.rds")
plants_df.to_rdata("plants_df_bz2.rds", compression="bz2")
plants_df.to_rdata("plants_df_xz.rds", compression="xz")
plants_df.to_rdata("plants_df_non_comp.rds", compression=None)

os.stat("plants_df_gz.rds").st_size
os.stat("plants_df_bz2.rds").st_size
os.stat("plants_df_xz.rds").st_size
os.stat("plants_df_non_comp.rds").st_size

Like other IO methods, ``storage_options`` are enabled to write to those platforms:

.. code-block:: ipython

ghg_df.to_rdata(
"s3://path/to/my/storage/pandas_df.rda",
storage_options={"user": "xxx", "password": "???"}
)

.. ipython:: python
:suppress:

os.remove("ghg_df.rds")
os.remove("ghg_df.rda")
os.remove("plants_df.rds")
os.remove("plants_df_gz.rds")
os.remove("plants_df_bz2.rds")
os.remove("plants_df_xz.rds")
os.remove("plants_df_non_comp.rds")

Once exported, the single DataFrame can be read or loaded in R:

.. code-block:: r

plants_df <- readRDS("plants_df.rds")
plants_df
plant_group status count
16 Pteridophytes Data Deficient 398
17 Pteridophytes Extinct 65
18 Pteridophytes Not Threatened 1294
19 Pteridophytes Possibly Threatened 408
20 Pteridophytes Threatened 1275

load("ghg_df.rda")

mget(list=ls())
$ghg_df
gas year emissions
141 Carbon dioxide 2018 5424.8815
142 Methane 2018 634.4571
143 Nitrous oxide 2018 434.5286
144 Fluorinated gases 2018 182.7824
145 Total 2018 6676.6496

.. _io.stata:

Stata format
Expand Down Expand Up @@ -5961,6 +6263,7 @@ outside of this range, the variable is cast to ``int16``.
115 dta file format. Attempting to write *Stata* dta files with strings
longer than 244 characters raises a ``ValueError``.


.. _io.stata_reader:

Reading from Stata format
Expand Down
Loading