Skip to content

DOC: HDF5 External Compatibility section examples don't work #35419

Closed
@ivirshup

Description

@ivirshup

Location of the documentation

https://pandas.pydata.org/docs/user_guide/io.html#external-compatibility.

Documentation problem

This section probably has at least one typo, but more generally, doesn't seem to be documenting current behaviour.

I'll quickly run through the example here, but with a bit of cleaning so we don't have to run the entire page.

import pandas as pd
import numpy as np

df_for_r = pd.DataFrame({"first": np.random.rand(100),
                         "second": np.random.rand(100),
                         "class": np.random.randint(0, 2, (100, ))},
                        index=range(100))


store_export = pd.HDFStore('export.h5')

# In the documentation, this is written with 'data_columns=df_dc.columns', which I'm assuming is a mistake
store_export.append('df_for_r', df_for_r, data_columns=df_for_r.columns)

store_export

We can take a look at what's in this file:

store_export.close()
!h5ls -r export.h5
Output
/                        Group
/df_for_r                Group
/df_for_r/_i_table       Group
/df_for_r/_i_table/class Group
/df_for_r/_i_table/class/abounds Dataset {0/Inf}
/df_for_r/_i_table/class/bounds Dataset {0/Inf, 127}
/df_for_r/_i_table/class/indices Dataset {0/Inf, 131072}
/df_for_r/_i_table/class/indicesLR Dataset {131072}
/df_for_r/_i_table/class/mbounds Dataset {0/Inf}
/df_for_r/_i_table/class/mranges Dataset {0/Inf}
/df_for_r/_i_table/class/ranges Dataset {0/Inf, 2}
/df_for_r/_i_table/class/sorted Dataset {0/Inf, 131072}
/df_for_r/_i_table/class/sortedLR Dataset {131201}
/df_for_r/_i_table/class/zbounds Dataset {0/Inf}
/df_for_r/_i_table/first Group
/df_for_r/_i_table/first/abounds Dataset {0/Inf}
/df_for_r/_i_table/first/bounds Dataset {0/Inf, 127}
/df_for_r/_i_table/first/indices Dataset {0/Inf, 131072}
/df_for_r/_i_table/first/indicesLR Dataset {131072}
/df_for_r/_i_table/first/mbounds Dataset {0/Inf}
/df_for_r/_i_table/first/mranges Dataset {0/Inf}
/df_for_r/_i_table/first/ranges Dataset {0/Inf, 2}
/df_for_r/_i_table/first/sorted Dataset {0/Inf, 131072}
/df_for_r/_i_table/first/sortedLR Dataset {131201}
/df_for_r/_i_table/first/zbounds Dataset {0/Inf}
/df_for_r/_i_table/index Group
/df_for_r/_i_table/index/abounds Dataset {0/Inf}
/df_for_r/_i_table/index/bounds Dataset {0/Inf, 127}
/df_for_r/_i_table/index/indices Dataset {0/Inf, 131072}
/df_for_r/_i_table/index/indicesLR Dataset {131072}
/df_for_r/_i_table/index/mbounds Dataset {0/Inf}
/df_for_r/_i_table/index/mranges Dataset {0/Inf}
/df_for_r/_i_table/index/ranges Dataset {0/Inf, 2}
/df_for_r/_i_table/index/sorted Dataset {0/Inf, 131072}
/df_for_r/_i_table/index/sortedLR Dataset {131201}
/df_for_r/_i_table/index/zbounds Dataset {0/Inf}
/df_for_r/_i_table/second Group
/df_for_r/_i_table/second/abounds Dataset {0/Inf}
/df_for_r/_i_table/second/bounds Dataset {0/Inf, 127}
/df_for_r/_i_table/second/indices Dataset {0/Inf, 131072}
/df_for_r/_i_table/second/indicesLR Dataset {131072}
/df_for_r/_i_table/second/mbounds Dataset {0/Inf}
/df_for_r/_i_table/second/mranges Dataset {0/Inf}
/df_for_r/_i_table/second/ranges Dataset {0/Inf, 2}
/df_for_r/_i_table/second/sorted Dataset {0/Inf, 131072}
/df_for_r/_i_table/second/sortedLR Dataset {131201}
/df_for_r/_i_table/second/zbounds Dataset {0/Inf}
/df_for_r/table          Dataset {200/Inf}

Next, there is an R function for reading in this data. Just from comparing the given function to the written file I think we can see there is a mismatch:

library(rhdf5)

loadhdf5data <- function(h5File) {

listing <- h5ls(h5File)
# Find all data nodes, values are stored in *_values and corresponding column
# titles in *_items
data_nodes <- grep("_values", listing$name)
name_nodes <- grep("_items", listing$name)
data_paths = paste(listing$group[data_nodes], listing$name[data_nodes], sep = "/")
name_paths = paste(listing$group[name_nodes], listing$name[name_nodes], sep = "/")
columns = list()
for (idx in seq(data_paths)) {
  # NOTE: matrices returned by h5read have to be transposed to obtain
  # required Fortran order!
  data <- data.frame(t(h5read(h5File, data_paths[idx])))
  names <- t(h5read(h5File, name_paths[idx]))
  entry <- data.frame(data)
  colnames(entry) <- names
  columns <- append(columns, entry)
}

data <- data.frame(columns)

return(data)
}

For example, there are no entries in export.h5 which have _values or _items in the names.

If we actually call this function, we get an empty dataframe back:

> loadhdf5data("export.h5")
data frame with 0 columns and 0 rows

This function does seem to work if the file is written using "fixed" format

df_for_r.to_hdf("export2.h5", key="df_for_r", format="fixed")  
> loadhdf5data("export2.h5")
          first      second class
1   0.675013759 0.787289926     0
2   0.936797348 0.349671699     1
3   0.951930811 0.275965069     0
4   0.203085530 0.380154180     0
5   0.627195223 0.462702969     1
6   0.129148756 0.385663581     1
...

This is a bit contrary to the prose for this section which reads:

HDFStore writes table format objects in specific formats suitable for producing loss-less round trips to pandas objects. For external compatibility, HDFStore can read native PyTables format tables.

It is possible to write an HDFStore object that can easily be imported into R using the rhdf5 library (Package website). Create a table format store like this:

Suggested fix for documentation

This should probably specify that the "table" format doesn't work here. In addition, since external compatibility relies on the user writing code to read this format, maybe a specification for the format should be documented here?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions