pandas-dev · ParfaitG · Apr 11, 2021 · Apr 11, 2021 · Apr 11, 2021 · Apr 12, 2021
diff --git a/LICENSES/LIBRDATA_LICENSE b/LICENSES/LIBRDATA_LICENSE
@@ -0,0 +1,19 @@
+Copyright (c) 2013-2020 Evan Miller (except where otherwise noted)
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.
diff --git a/doc/source/user_guide/io.rst b/doc/source/user_guide/io.rst
@@ -31,6 +31,7 @@ The pandas I/O API is a set of top level ``reader`` functions accessed like
     binary;`Parquet Format <https://parquet.apache.org/>`__;:ref:`read_parquet<io.parquet>`;:ref:`to_parquet<io.parquet>`
     binary;`ORC Format <https://orc.apache.org/>`__;:ref:`read_orc<io.orc>`;
     binary;`Msgpack <https://msgpack.org/>`__;:ref:`read_msgpack<io.msgpack>`;:ref:`to_msgpack<io.msgpack>`
+    binary;`R <https://www.r-project.org/>`__;:ref:`read_rdata<io.rdata_reader>`;:ref:`to_rdata<io.rdata_writer>`
     binary;`Stata <https://en.wikipedia.org/wiki/Stata>`__;:ref:`read_stata<io.stata_reader>`;:ref:`to_stata<io.stata_writer>`
     binary;`SAS <https://en.wikipedia.org/wiki/SAS_(software)>`__;:ref:`read_sas<io.sas_reader>`;
     binary;`SPSS <https://en.wikipedia.org/wiki/SPSS>`__;:ref:`read_spss<io.spss_reader>`;
@@ -5906,6 +5907,307 @@ respective functions from ``pandas-gbq``.
 
 Full documentation can be found `here <https://pandas-gbq.readthedocs.io/>`__.
 
+
+.. _io.rdata:
+
+R data format
+-------------
+
+.. _io.rdata_reader:
+
+Reading R data
+''''''''''''''
+
+.. versionadded:: 1.3.0
+
+The top-level function ``read_rdata`` will read the native serialization types
+in the R language and environment. For .RData and its synonymous shorthand, .rda,
+that can hold multiple R objects, method will return a ``dict`` of ``DataFrames``.
+For .rds types that only contains a single R object, method will return a ``dict``
+of a single ``DataFrame``.
+
+.. note::
+
+   Since any R object can be saved in these types, this method will only return
+   data.frame objects or objects coercible to data.frames including matrices,
+   tibbles, and data.tables.
+
+For more information of R serialization data types, see docs on `rds`_
+and `rda`_ data formats.
+
+.. _rds: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/readRDS
+
+.. _rda: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/save
+
+For example, consider the following generated data.frames in R using environment
+data samples from US EPA, UK BGCI, and NOAA pubilc data:
+
+.. code-block:: r
+
+   ghg_df <- data.frame(
+     gas = c("Carbon dioxide", "Methane", "Nitrous oxide",
+             "Fluorinated gases", "Total"),
+     year = c(2018, 2018, 2018, 2018, 2018),
+     emissions = c(5424.88150213288, 634.457127078267, 434.528555376666,
+                   182.782432461777, 6676.64961704959),
+     row.names = c(141:145),
+     stringsAsFactors = FALSE
+   )
+
+   saveRDS(ghg_df, file="ghg_df.rds")
+
+   plants_df <- data.frame(
+     plant_group = c("Pteridophytes", "Pteridophytes", "Pteridophytes",
+                     "Pteridophytes", "Pteridophytes"),
+     status = c("Data Deficient", "Extinct", "Not Threatened",
+                "Possibly Threatened", "Threatened"),
+     count = c(398, 65, 1294, 408, 1275),
+     row.names = c(16:20),
+     stringsAsFactors = FALSE
+   )
+
+   saveRDS(plants_df, file="plants_df.rds")
+
+   sea_ice_df_new <- data.frame(
+     year = c(2016, 2017, 2018, 2019, 2020),
+     mo = c(12, 12, 12, 12, 12),
+     data.type = c("Goddard", "Goddard", "Goddard", "Goddard", "NRTSI-G"),
+     region = c("S", "S", "S", "S", "S"),
+     extent = c(8.28, 9.48, 9.19, 9.41, 10.44),
+     area = c(5.51, 6.23, 5.59, 6.59, 6.5),
+     row.names = c(1012:1016),
+     stringsAsFactors = FALSE
+   )
+
+   saveRDS(sea_ice_df, file="sea_ice_df.rds")
+
+   save(ghg_df, plants_df, sea_ice_df, file="env_data_dfs.rda")
+
+With ``read_rdata``, you can read these above .rds or .rda files:
+
+.. ipython:: python
+   :suppress:
+
+   rel_path = os.path.join("..", "pandas", "tests", "io", "data", "rdata")
+   file_path = os.path.abspath(rel_path)
+
+.. ipython:: python
+
+   rds_file = os.path.join(file_path, "ghg_df.rds")
+   ghg_df = pd.read_rdata(rds_file)["r_dataframe"].tail()
+   ghg_df
+
+   rda_file = os.path.join(file_path, "env_data_dfs.rda")
+   env_dfs = pd.read_rdata(rda_file)
+   {k: df.tail() for k, df in env_dfs.items()}
+
+To ignore the rownames of data.frame, use option ``rownames=False``:
+
+.. ipython:: python
+
+   rds_file = os.path.join(file_path, "plants_df.rds")
+   plants_df = pd.read_rdata(rds_file, rownames=False)["r_dataframe"].tail()
+   plants_df
+
+
+To select specific objects in .rda, pass a list of names into ``select_frames``:
+
+.. ipython:: python
+
+   rda_file = os.path.join(file_path, "env_data_dfs.rda")
+   env_dfs = pd.read_rdata(rda_file, select_frames=["sea_ice_df"])
+   env_dfs
+
+To read from a file-like object, read object in argument, ``path_or_buffer``:
+
+.. ipython:: python
+
+   rds_file = os.path.join(file_path, "plants_df.rds")
+   with open(rds_file, "rb") as f:
+       plants_df = pd.read_rdata(
+           f,
+           file_format="rds",
+       )["r_dataframe"]
+
+   plants_df
+
+To read from URL, pass link directly into method:
+
+.. ipython:: python
+
+   url = ("https://github.com/hadley/nycflights13/"
+          "blob/master/data/airlines.rda?raw=true")
+
+   airlines = pd.read_rdata(url, file_format="rda")
+   airlines
+
+To read from an Amazon S3 bucket, point to the storage path. This also raises
+another issue. Any R data encoded in non utf-8 is currently not supported:
+
+.. code-block:: ipython
+
+   In [608]: ghcran = pd.read_rdata("s3://public-r-data/ghcran.Rdata")
+   ...
+   UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 45: invalid continuation byte
+
+Also, remember if R data files do not contain any data frame object, a parsing error
+will occur:
+
+.. code-block:: ipython
+
+   In [608]: rds_file = os.path.join(file_path, "env_data_non_dfs.rda")
+   ...
+   LibrdataParserError: Invalid file, or file has unsupported features
+
+
+.. _io.rdata_writer:
+
+Finally, please note R's ``Date`` (without time component) will translate to
+``datetime64`` in pandas. Also, R's date/time field type, ``POSIXct``, that can
+carry varying timezones will translate to UTC time in pandas. For example, in R,
+the following data sample from an .rda shows date/time in 'America/Chicago' local
+timezone:
+
+.. code-block:: r
+
+   load("ppm_df.rda")
+   tail(ppm_df, 5)
+                      date decimal_date monthly_average deseasonalized num_days std_dev_of_days unc_of_mon_mean
+   612 2020-12-16 17:42:25     2020.958          414.25         414.98       30            0.47            0.17
+   613 2021-01-16 05:17:31     2021.042          415.52         415.26       29            0.44            0.16
+   614 2021-02-15 15:00:00     2021.125          416.75         415.93       28            1.02            0.37
+   615 2021-03-18 01:42:28     2021.208          417.64         416.18       28            0.86            0.31
+   616 2021-04-17 12:17:31     2021.292          419.05         416.23       24            1.12            0.44
+
+In pandas, conversion shows adjustment in hours to UTC:
+
+.. ipython:: python
+
+   r_dfs = pd.read_rdata(os.path.join(file_path, "ppm_df.rda"))
+   r_dfs["ppm_df"].tail()
+
+Writing R data
+''''''''''''''
+
+.. versionadded:: 1.3.0
+
+The method :func:`~pandas.core.frame.DataFrame.to_rdata` will write a DataFrame
+into R data files (.RData, .rda, and .rds).
+
+For a single DataFrame in rds type, pass in a file or buffer in method:
+
+.. ipython:: python
+
+   plants_df.to_rdata("plants_df.rds")
+
+For a single DataFrame in RData or rda types, pass in a file or buffer in method
+and optionally give it a name:
+
+.. ipython:: python
+
+   ghg_df.to_rdata("ghg_df.rda", rda_name="ghg_df")
+
+While RData and rda types can hold multiple R objects, this method currently
+only supports writing out a single DataFrame.
+
+Even write to a buffer and read its content (and be sure to adjust default
+``gzip`` compression to ``compression=None``):
+
+.. ipython:: python
+
+    with BytesIO() as b_io:
+        env_dfs["sea_ice_df"].to_rdata(
+            b_io,
+            file_format="rda",
+            index=False,
+            compression=None,
+        )
+        print(
+            pd.read_rdata(
+                b_io.getvalue(),
+                file_format="rda",
+                rownames=False,
+                compression=None,
+            )["pandas_dataframe"].tail()
+        )
+
+While DataFrame index will not map into R rownames, by default ``index=True``
+will output as a named column or multiple columns for MultiIndex.
+
+.. ipython:: python
+
+    ghg_df.rename_axis(None).to_rdata("ghg_df.rds")
+
+    pd.read_rdata("ghg_df.rds")["r_dataframe"].tail()
+
+To ignore the index, use ``index=False``:
+
+.. ipython:: python
+
+    ghg_df.rename_axis(None).to_rdata("ghg_df.rds", index=False)
+
+    pd.read_rdata("ghg_df.rds")["r_dataframe"].tail()
+
+By default, these R serialized types are compressed files in either gzip, bzip2,
+or xz algorithms. Similar to R, the default ``compression`` type in this method
+is "gzip" or "gz". Notice size difference of compressed and uncompressed files:
+
+.. ipython:: python
+
+   plants_df.to_rdata("plants_df_gz.rds")
+   plants_df.to_rdata("plants_df_bz2.rds", compression="bz2")
+   plants_df.to_rdata("plants_df_xz.rds", compression="xz")
+   plants_df.to_rdata("plants_df_non_comp.rds", compression=None)
+
+   os.stat("plants_df_gz.rds").st_size
+   os.stat("plants_df_bz2.rds").st_size
+   os.stat("plants_df_xz.rds").st_size
+   os.stat("plants_df_non_comp.rds").st_size
+
+Like other IO methods, ``storage_options`` are enabled to write to those platforms:
+
+.. code-block:: ipython
+
+   ghg_df.to_rdata(
+       "s3://path/to/my/storage/pandas_df.rda",
+       storage_options={"user": "xxx", "password": "???"}
+   )
+
+.. ipython:: python
+   :suppress:
+
+   os.remove("ghg_df.rds")
+   os.remove("ghg_df.rda")
+   os.remove("plants_df.rds")
+   os.remove("plants_df_gz.rds")
+   os.remove("plants_df_bz2.rds")
+   os.remove("plants_df_xz.rds")
+   os.remove("plants_df_non_comp.rds")
+
+Once exported, the single DataFrame can be read or loaded in R:
+
+.. code-block:: r
+
+   plants_df <- readRDS("plants_df.rds")
+   plants_df
+        plant_group              status count
+   16 Pteridophytes      Data Deficient   398
+   17 Pteridophytes             Extinct    65
+   18 Pteridophytes      Not Threatened  1294
+   19 Pteridophytes Possibly Threatened   408
+   20 Pteridophytes          Threatened  1275
+
+   load("ghg_df.rda")
+
+   mget(list=ls())
+   $ghg_df
+                     gas year emissions
+   141    Carbon dioxide 2018 5424.8815
+   142           Methane 2018  634.4571
+   143     Nitrous oxide 2018  434.5286
+   144 Fluorinated gases 2018  182.7824
+   145             Total 2018 6676.6496
+
 .. _io.stata:
 
 Stata format
@@ -5961,6 +6263,7 @@ outside of this range, the variable is cast to ``int16``.
   115 dta file format. Attempting to write *Stata* dta files with strings
   longer than 244 characters raises a ``ValueError``.
 
+
 .. _io.stata_reader:
 
 Reading from Stata format