Skip to content

HDFStore.append_to_multiple doesn't write rows that are all np.nan #4698

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 3, 2013

Conversation

adgaudio
Copy link
Contributor

Hi,

Using HDFStore.append_to_multiple, if an entire row written to any one table consists entirely of np.nan, the row is not written to the table, but is written to the other tables. The following code reproduces and fixes the issue.

I would prefer that append_to_multiple maintain synchronized rows across tables, and to my knowledge, the best way to do that is drop that row from the other tables. We would probably need a fix that looks something like this PR.

I'm not sure if this is the best way to do this, and would love some feedback!

Thanks,
Alex

The Fix:

# setup
d = {'table1': ['col1', 'col2'], 'table2': ['col3', 'col4']}
value = pandas.DataFrame([[1,2,3,4], [5,6,None, None]], columns=['col1', 'col2', 'col3', 'col4'])

# fix (as in PR)
dfs = (value[cols] for cols in d.values())
valid_index = reduce(
        lambda index1, index2: index2.intersection(index1),
        [df.dropna(how='all').index for df in dfs]) 
value = value.ix[valid_index]

To reproduce the error:

In [1]: from pandas import DataFrame, HDFStore

In [2]: store = HDFStore('tmp1.h5')

In [3]: df = DataFrame([[1,2], [3,None]], columns=['a', 'b'])

In [4]: store.append_to_multiple({'table1': ['a'], 'table2': ['b']}, df, 'table1')

In [5]: store
Out[5]:
<class 'pandas.io.pytables.HDFStore'>
File path: tmp1.h5
/table1            frame_table  (typ->appendable,nrows->2,ncols->1,indexers->[index],dc->[a])
/table2            frame_table  (typ->appendable,nrows->1,ncols->1,indexers->[index])
store.select_as_multiple(['table1', 'table2'])

In [6]: store.select_as_multiple(['table1', 'table2'])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-1a2861fcaa34> in <module>()
----> 1 store.select_as_multiple(['table1', 'table2'])

/Users/alexgaudio/.virtualenvs/ds/lib/python2.7/site-packages/pandas/io/pytables.pyc in select_as_multiple(self, keys, where, selector, columns, start, stop, iterator, chunksize, auto_close, **kwargs)
    540                 nrows = t.nrows
    541             elif t.nrows != nrows:
--> 542                 raise ValueError("all tables must have exactly the same nrows!")
    543
    544         # select coordinates from the selector table

ValueError: all tables must have exactly the same nrows!

@jreback
Copy link
Contributor

jreback commented Aug 27, 2013

related #4625

this is a more general issue ; I think better way to handle is to add a dropna=True (which preserves the existing behavior) then your solution would work

pls add some tests for this as well

@adgaudio
Copy link
Contributor Author

@jreback - thanks for your prompt feedback! I just added to the existing tests for that method and added a dropna kwarg + docstring to the code.

For dropna, I deviated from the normal true/false syntax. Specifically, dropna= 'any' | 'all' | False. If dropna evaluates to True, then call DF.dropna(how=dropna). You might prefer to revert to normal True/False behavior...

Also, if I entirely misinterpreted what you meant by dropna to begin with, please let me know!

@jreback
Copy link
Contributor

jreback commented Aug 28, 2013

can u hookup Travis (see contributing.md)
add a release notes entry

@adgaudio
Copy link
Contributor Author

Travis is enabled for my fork, and I've squashed. When tests pass, it's good to go.

I also modified existing documentation, which you may want to review.

Thanks!
Alex

defines which table is the selector table (which you can make queries from).
The argument ``dropna`` will remove rows from the input DataFrame with
'any' or 'all' ``np.NaN`` values, unless it evaluates to False. ``dropna``
will also ensure that tables are synchronized across rows. For example,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is good - maybe break into 2 paragraphs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks - I attempted to simplify a little further and broke into 2 paragraphs

@adgaudio
Copy link
Contributor Author

Thanks for your review! I made the changes you suggested, and we should be good for review 2 if Travis tests pass for python3.

@jreback
Copy link
Contributor

jreback commented Aug 29, 2013

gr8 I'll look further later this week

@jreback
Copy link
Contributor

jreback commented Aug 31, 2013

@adgaudio pls rebase on master, I merged #4714. this should allow you to simply pass dropna=False when creating the tables in append_as_multiple

@adgaudio
Copy link
Contributor Author

I just rebased and pushed. Thanks!

@@ -833,6 +836,14 @@ def append_to_multiple(self, d, value, selector, data_columns=None, axes=None, *
if data_columns is None:
data_columns = d[selector]

# ensure rows are synchronized across the tables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I meant was that to the individual table appends, you can just simply pass dropna=False if dropna=True is passed to append_to_multiple will achieve the same effect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm....what you have is correct

@jreback
Copy link
Contributor

jreback commented Aug 31, 2013

can you add your example example as 2 test cases, 1 with dropna=True, the other with dropna=False (and assert that it still raises the error?) thxs.....then I think ready to merge

tables are synchronized. This means that if a row for one of the tables
being written to is entirely ``np.NaN``, that row will be dropped from all tables.

If ``dropna`` is False, the user is responsible for synchronizing tables.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put the 'USER IS REPONSIBLE FOR SYNCHRONIZING TABLES" back in CAPS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@adgaudio
Copy link
Contributor Author

I removed the doc section that raises the error, and agree the documentation addition showing is probably unnecessary (which is why it wasn't there to begin with).

Unless I'm misunderstanding you, the tests already test both dropna=True (the # regular operation subsection) and dropna=False (the # don't maintain synchronized rows across tables if dropna=False subsection). Regarding tests, I'm not sure what you're asking for.

If it's simpler for you to just make changes to the test, please feel free to modify the PR however you wish. Otherwise, I'll need a little more explanation on what your looking for.

@jreback
Copy link
Contributor

jreback commented Aug 31, 2013

@adgaudio the tests can't have tested exactly the case that you have, because otherwise it would fail. I want you to add the example as a set of tests (make a new function), rather then modify an existing test as you did).

…re writing

adds dropna kwarg + docstring + tests + documentation + release note

python3 compatible
@adgaudio
Copy link
Contributor Author

adgaudio commented Sep 3, 2013

@jreback, I broke out the tests into their own function and rebased onto master. Let me know if you need anything else!

Thanks,
Alex

jreback added a commit that referenced this pull request Sep 3, 2013
HDFStore.append_to_multiple doesn't write rows that are all np.nan
@jreback jreback merged commit 6ffed43 into pandas-dev:master Sep 3, 2013
@jreback
Copy link
Contributor

jreback commented Sep 3, 2013

Thanks for the pr!

@adgaudio
Copy link
Contributor Author

adgaudio commented Sep 3, 2013

Awesome, thanks!

@jreback
Copy link
Contributor

jreback commented Sep 3, 2013

gr8

also been thinking about a class to help out in splitting and reconstructions I the shards
(eg instead of actually specify a dictionar d), you would create a class to do this (which could then be serialized as well), to facilitate and automatic table reconstruction

if u have suggestions or maybe a nice use case would e interesting
(as I don't actually use this much)

@adgaudio
Copy link
Contributor Author

Hey @jreback,

I bet I significantly misinterpreted what you were thinking, but here's my attempt to rephrase your idea... Also, if you like this, let me know how we should take this to a next step:

Problem (or my interpretation):
We'd like a way to write and collect large amounts of data into an HDFStore, and we need to write more data that fits in memory at any one time. The existing append_to_multiple method requires one massive DataFrame, which takes up too much memory to be particularly useful.

Possible Solution:
To resolve this problem, we want to buffer an incoming data stream in memory and then periodically flush it to the HDFStore according to some specific configuration. The incoming data stream could be smaller dataframes, or an "infinite" list of dictionaries (or whatever else DataFrames accept in their constructor). The buffer should be flushed to the HDFStore whenever it reaches a certain size. We can implement this buffer as a class with "write" "flush" and "close" methods.

I've actually implemented this and would be willing to contribute it to pandas, if it's suitable for pandas. On the other hand, I'm not sure this is a direction we all should be going. What do you think?

The class I implemented (and regularly use) to write data to HDFStore looks basically like this:

class BufferedHDFStore:
    def write(self, dict_of_data):
         validate input data, maybe?
         append to an internal buffer
         if internal buffer greater than X mb, flush to HDFStore

    def flush():
         Convert in memory buffer to dataframes and append to HDFStore

    def close():
         close the HDFStore

@jreback
Copy link
Contributor

jreback commented Sep 10, 2013

@adgaudio hmm....I want talking about a different problem actually

basically the idea of 'sharding' a wide table into small sub-tables, but with a nicer interface and the saving of the meta data into the table node so that reconstruction becomes simpler. e.g. right now you have to have the dict d that you constructed the table with in append_to_multiple to reconstruct it properlty

you idea is a different one, but how is different from just periodically appending data?
(e.g. I do this when I have a store then every day add a bunch of rows to it for that day, I amd storing Panels)

@adgaudio
Copy link
Contributor Author

haha yea, I totally misinterpreted that! Not requiring the table names in append_to_multiple could be pretty nice. I would need some time to play around with this idea, though.

Regarding how a buffered implementation is different from just periodically appending data: I have a particular use-case in which the stream of dicts gets chunked into (buffered) lists of Series, and each list then gets converted to a DataFrame and finally appended to the HDFStore. If the stream of dicts was already in the DataFrame form, then as you suggested, I wouldn't need build this BufferedHDFStore implementation. On the other hand, I'm not completely happy with this solution, because something about it seems overly engineered. But given low memory constraints and some other details inspiring the work, it was a difficult project and is now doing the job well.

@jreback
Copy link
Contributor

jreback commented Sep 10, 2013

Why are you saving the intermediate form as dict of Series, why not just add to the frame as you create/modify a list? Then periodically append?

All for doing chunked/iterator based solutions. Yours 'sounds' like a write-iterator. Can you show a test-example?

An example off the top of my head that I know we prob need a better solution is the iterate over a csv then put in a hdf iteratively...

which itself is worth of a pandas function....maybe csv_to_hdf (which is just an exercise in figuring dtypes ahead of time and chunking the read and write) - because a int column in say the first chunk could later have missing data, so you really have to read it twice to be sure

@jreback jreback mentioned this pull request Sep 10, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants