Skip to content

Don't make dropping missing rows a default behavior for HDF append()? #9382

Closed
@nickeubank

Description

@nickeubank

Hi All,

At the moment, the default behavior for the HDF append() function ( docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.HDFStore.append.html?highlight=append#pandas.HDFStore.append ) is to silently drop all rows that are all NaN except for the index.

As I understand it from a PyData exchange with Jeff, the reason is that people working with panels often have sparse datasets, so this is a very reasonable default.

However, while I appreciate the appeal for time-series analysis, I think this is a dangerous default. The main reason is that the assumption is that if an index has a value but the columns do not, there is no meaningful data in the row. But while true in a time series context -- where it's easy to reconstruct the index values that are dropped -- if indexes contain information like userIDs, sensor codes, place names, etc., the index itself is meaningful, and not easy to reconstruct. Thus the default behavior is potentially deleting user data without a warning.

Given the trade-off between a default that may lead to inefficient storage (dropna = False) and one that potentially erases user data (dropna = True), I think we should error on the side of data preservation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions