Description
Location of the documentation
https://pandas.pydata.org/docs/user_guide/io.html#external-compatibility.
Documentation problem
This section probably has at least one typo, but more generally, doesn't seem to be documenting current behaviour.
I'll quickly run through the example here, but with a bit of cleaning so we don't have to run the entire page.
import pandas as pd
import numpy as np
df_for_r = pd.DataFrame({"first": np.random.rand(100),
"second": np.random.rand(100),
"class": np.random.randint(0, 2, (100, ))},
index=range(100))
store_export = pd.HDFStore('export.h5')
# In the documentation, this is written with 'data_columns=df_dc.columns', which I'm assuming is a mistake
store_export.append('df_for_r', df_for_r, data_columns=df_for_r.columns)
store_export
We can take a look at what's in this file:
store_export.close()
!h5ls -r export.h5
Output
/ Group
/df_for_r Group
/df_for_r/_i_table Group
/df_for_r/_i_table/class Group
/df_for_r/_i_table/class/abounds Dataset {0/Inf}
/df_for_r/_i_table/class/bounds Dataset {0/Inf, 127}
/df_for_r/_i_table/class/indices Dataset {0/Inf, 131072}
/df_for_r/_i_table/class/indicesLR Dataset {131072}
/df_for_r/_i_table/class/mbounds Dataset {0/Inf}
/df_for_r/_i_table/class/mranges Dataset {0/Inf}
/df_for_r/_i_table/class/ranges Dataset {0/Inf, 2}
/df_for_r/_i_table/class/sorted Dataset {0/Inf, 131072}
/df_for_r/_i_table/class/sortedLR Dataset {131201}
/df_for_r/_i_table/class/zbounds Dataset {0/Inf}
/df_for_r/_i_table/first Group
/df_for_r/_i_table/first/abounds Dataset {0/Inf}
/df_for_r/_i_table/first/bounds Dataset {0/Inf, 127}
/df_for_r/_i_table/first/indices Dataset {0/Inf, 131072}
/df_for_r/_i_table/first/indicesLR Dataset {131072}
/df_for_r/_i_table/first/mbounds Dataset {0/Inf}
/df_for_r/_i_table/first/mranges Dataset {0/Inf}
/df_for_r/_i_table/first/ranges Dataset {0/Inf, 2}
/df_for_r/_i_table/first/sorted Dataset {0/Inf, 131072}
/df_for_r/_i_table/first/sortedLR Dataset {131201}
/df_for_r/_i_table/first/zbounds Dataset {0/Inf}
/df_for_r/_i_table/index Group
/df_for_r/_i_table/index/abounds Dataset {0/Inf}
/df_for_r/_i_table/index/bounds Dataset {0/Inf, 127}
/df_for_r/_i_table/index/indices Dataset {0/Inf, 131072}
/df_for_r/_i_table/index/indicesLR Dataset {131072}
/df_for_r/_i_table/index/mbounds Dataset {0/Inf}
/df_for_r/_i_table/index/mranges Dataset {0/Inf}
/df_for_r/_i_table/index/ranges Dataset {0/Inf, 2}
/df_for_r/_i_table/index/sorted Dataset {0/Inf, 131072}
/df_for_r/_i_table/index/sortedLR Dataset {131201}
/df_for_r/_i_table/index/zbounds Dataset {0/Inf}
/df_for_r/_i_table/second Group
/df_for_r/_i_table/second/abounds Dataset {0/Inf}
/df_for_r/_i_table/second/bounds Dataset {0/Inf, 127}
/df_for_r/_i_table/second/indices Dataset {0/Inf, 131072}
/df_for_r/_i_table/second/indicesLR Dataset {131072}
/df_for_r/_i_table/second/mbounds Dataset {0/Inf}
/df_for_r/_i_table/second/mranges Dataset {0/Inf}
/df_for_r/_i_table/second/ranges Dataset {0/Inf, 2}
/df_for_r/_i_table/second/sorted Dataset {0/Inf, 131072}
/df_for_r/_i_table/second/sortedLR Dataset {131201}
/df_for_r/_i_table/second/zbounds Dataset {0/Inf}
/df_for_r/table Dataset {200/Inf}
Next, there is an R function for reading in this data. Just from comparing the given function to the written file I think we can see there is a mismatch:
library(rhdf5)
loadhdf5data <- function(h5File) {
listing <- h5ls(h5File)
# Find all data nodes, values are stored in *_values and corresponding column
# titles in *_items
data_nodes <- grep("_values", listing$name)
name_nodes <- grep("_items", listing$name)
data_paths = paste(listing$group[data_nodes], listing$name[data_nodes], sep = "/")
name_paths = paste(listing$group[name_nodes], listing$name[name_nodes], sep = "/")
columns = list()
for (idx in seq(data_paths)) {
# NOTE: matrices returned by h5read have to be transposed to obtain
# required Fortran order!
data <- data.frame(t(h5read(h5File, data_paths[idx])))
names <- t(h5read(h5File, name_paths[idx]))
entry <- data.frame(data)
colnames(entry) <- names
columns <- append(columns, entry)
}
data <- data.frame(columns)
return(data)
}
For example, there are no entries in export.h5
which have _values
or _items
in the names.
If we actually call this function, we get an empty dataframe back:
> loadhdf5data("export.h5")
data frame with 0 columns and 0 rows
This function does seem to work if the file is written using "fixed" format
df_for_r.to_hdf("export2.h5", key="df_for_r", format="fixed")
> loadhdf5data("export2.h5")
first second class
1 0.675013759 0.787289926 0
2 0.936797348 0.349671699 1
3 0.951930811 0.275965069 0
4 0.203085530 0.380154180 0
5 0.627195223 0.462702969 1
6 0.129148756 0.385663581 1
...
This is a bit contrary to the prose for this section which reads:
HDFStore writes table format objects in specific formats suitable for producing loss-less round trips to pandas objects. For external compatibility, HDFStore can read native PyTables format tables.
It is possible to write an HDFStore object that can easily be imported into R using the rhdf5 library (Package website). Create a table format store like this:
Suggested fix for documentation
This should probably specify that the "table" format doesn't work here. In addition, since external compatibility relies on the user writing code to read this format, maybe a specification for the format should be documented here?