HDFStore.append_to_multiple doesn't write rows that are all np.nan #4698

adgaudio · 2013-08-27T23:52:04Z

Hi,

Using HDFStore.append_to_multiple, if an entire row written to any one table consists entirely of np.nan, the row is not written to the table, but is written to the other tables. The following code reproduces and fixes the issue.

I would prefer that append_to_multiple maintain synchronized rows across tables, and to my knowledge, the best way to do that is drop that row from the other tables. We would probably need a fix that looks something like this PR.

I'm not sure if this is the best way to do this, and would love some feedback!

Thanks,
Alex

The Fix:

# setup
d = {'table1': ['col1', 'col2'], 'table2': ['col3', 'col4']}
value = pandas.DataFrame([[1,2,3,4], [5,6,None, None]], columns=['col1', 'col2', 'col3', 'col4'])

# fix (as in PR)
dfs = (value[cols] for cols in d.values())
valid_index = reduce(
        lambda index1, index2: index2.intersection(index1),
        [df.dropna(how='all').index for df in dfs]) 
value = value.ix[valid_index]

To reproduce the error:

In [1]: from pandas import DataFrame, HDFStore

In [2]: store = HDFStore('tmp1.h5')

In [3]: df = DataFrame([[1,2], [3,None]], columns=['a', 'b'])

In [4]: store.append_to_multiple({'table1': ['a'], 'table2': ['b']}, df, 'table1')

In [5]: store
Out[5]:
<class 'pandas.io.pytables.HDFStore'>
File path: tmp1.h5
/table1            frame_table  (typ->appendable,nrows->2,ncols->1,indexers->[index],dc->[a])
/table2            frame_table  (typ->appendable,nrows->1,ncols->1,indexers->[index])
store.select_as_multiple(['table1', 'table2'])

In [6]: store.select_as_multiple(['table1', 'table2'])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-1a2861fcaa34> in <module>()
----> 1 store.select_as_multiple(['table1', 'table2'])

/Users/alexgaudio/.virtualenvs/ds/lib/python2.7/site-packages/pandas/io/pytables.pyc in select_as_multiple(self, keys, where, selector, columns, start, stop, iterator, chunksize, auto_close, **kwargs)
    540                 nrows = t.nrows
    541             elif t.nrows != nrows:
--> 542                 raise ValueError("all tables must have exactly the same nrows!")
    543
    544         # select coordinates from the selector table

ValueError: all tables must have exactly the same nrows!

jreback · 2013-08-27T23:59:09Z

related #4625

this is a more general issue ; I think better way to handle is to add a dropna=True (which preserves the existing behavior) then your solution would work

pls add some tests for this as well

adgaudio · 2013-08-28T02:35:20Z

@jreback - thanks for your prompt feedback! I just added to the existing tests for that method and added a dropna kwarg + docstring to the code.

For dropna, I deviated from the normal true/false syntax. Specifically, dropna= 'any' | 'all' | False. If dropna evaluates to True, then call DF.dropna(how=dropna). You might prefer to revert to normal True/False behavior...

Also, if I entirely misinterpreted what you meant by dropna to begin with, please let me know!

jreback · 2013-08-28T12:38:09Z

can u hookup Travis (see contributing.md)
add a release notes entry

adgaudio · 2013-08-28T22:31:48Z

Travis is enabled for my fork, and I've squashed. When tests pass, it's good to go.

I also modified existing documentation, which you may want to review.

Thanks!
Alex

jreback · 2013-08-28T23:54:34Z

doc/source/io.rst

+defines which table is the selector table (which you can make queries from).
+The argument ``dropna`` will remove rows from the input DataFrame with
+'any' or 'all' ``np.NaN`` values, unless it evaluates to False.  ``dropna``
+will also ensure that tables are synchronized across rows.  For example,


this is good - maybe break into 2 paragraphs

thanks - I attempted to simplify a little further and broke into 2 paragraphs

adgaudio · 2013-08-29T00:48:33Z

Thanks for your review! I made the changes you suggested, and we should be good for review 2 if Travis tests pass for python3.

jreback · 2013-08-29T00:49:52Z

gr8 I'll look further later this week

jreback · 2013-08-31T14:10:51Z

@adgaudio pls rebase on master, I merged #4714. this should allow you to simply pass dropna=False when creating the tables in append_as_multiple

adgaudio · 2013-08-31T14:47:17Z

I just rebased and pushed. Thanks!

jreback · 2013-08-31T14:51:36Z

pandas/io/pytables.py

@@ -833,6 +836,14 @@ def append_to_multiple(self, d, value, selector, data_columns=None, axes=None, *
        if data_columns is None:
            data_columns = d[selector]

+        # ensure rows are synchronized across the tables


what I meant was that to the individual table appends, you can just simply pass dropna=False if dropna=True is passed to append_to_multiple will achieve the same effect.

nvm....what you have is correct

jreback · 2013-08-31T14:57:08Z

can you add your example example as 2 test cases, 1 with dropna=True, the other with dropna=False (and assert that it still raises the error?) thxs.....then I think ready to merge

jreback · 2013-08-31T14:57:33Z

doc/source/io.rst

+tables are synchronized.  This means that if a row for one of the tables
+being written to is entirely ``np.NaN``, that row will be dropped from all tables.
+
+If ``dropna`` is False, the user is responsible for synchronizing tables.


can you put the 'USER IS REPONSIBLE FOR SYNCHRONIZING TABLES" back in CAPS?

adgaudio · 2013-08-31T18:56:05Z

I removed the doc section that raises the error, and agree the documentation addition showing is probably unnecessary (which is why it wasn't there to begin with).

Unless I'm misunderstanding you, the tests already test both dropna=True (the # regular operation subsection) and dropna=False (the # don't maintain synchronized rows across tables if dropna=False subsection). Regarding tests, I'm not sure what you're asking for.

If it's simpler for you to just make changes to the test, please feel free to modify the PR however you wish. Otherwise, I'll need a little more explanation on what your looking for.

jreback · 2013-08-31T19:15:31Z

@adgaudio the tests can't have tested exactly the case that you have, because otherwise it would fail. I want you to add the example as a set of tests (make a new function), rather then modify an existing test as you did).

…re writing adds dropna kwarg + docstring + tests + documentation + release note python3 compatible

adgaudio · 2013-09-03T04:35:51Z

@jreback, I broke out the tests into their own function and rebased onto master. Let me know if you need anything else!

Thanks,
Alex

HDFStore.append_to_multiple doesn't write rows that are all np.nan

jreback · 2013-09-03T11:09:00Z

Thanks for the pr!

adgaudio · 2013-09-03T17:27:19Z

Awesome, thanks!

jreback · 2013-09-03T18:07:32Z

gr8

also been thinking about a class to help out in splitting and reconstructions I the shards
(eg instead of actually specify a dictionar d), you would create a class to do this (which could then be serialized as well), to facilitate and automatic table reconstruction

if u have suggestions or maybe a nice use case would e interesting
(as I don't actually use this much)

adgaudio · 2013-09-10T18:39:42Z

Hey @jreback,

I bet I significantly misinterpreted what you were thinking, but here's my attempt to rephrase your idea... Also, if you like this, let me know how we should take this to a next step:

Problem (or my interpretation):
We'd like a way to write and collect large amounts of data into an HDFStore, and we need to write more data that fits in memory at any one time. The existing append_to_multiple method requires one massive DataFrame, which takes up too much memory to be particularly useful.

Possible Solution:
To resolve this problem, we want to buffer an incoming data stream in memory and then periodically flush it to the HDFStore according to some specific configuration. The incoming data stream could be smaller dataframes, or an "infinite" list of dictionaries (or whatever else DataFrames accept in their constructor). The buffer should be flushed to the HDFStore whenever it reaches a certain size. We can implement this buffer as a class with "write" "flush" and "close" methods.

I've actually implemented this and would be willing to contribute it to pandas, if it's suitable for pandas. On the other hand, I'm not sure this is a direction we all should be going. What do you think?

The class I implemented (and regularly use) to write data to HDFStore looks basically like this:

class BufferedHDFStore:
    def write(self, dict_of_data):
         validate input data, maybe?
         append to an internal buffer
         if internal buffer greater than X mb, flush to HDFStore

    def flush():
         Convert in memory buffer to dataframes and append to HDFStore

    def close():
         close the HDFStore

jreback · 2013-09-10T19:05:22Z

@adgaudio hmm....I want talking about a different problem actually

basically the idea of 'sharding' a wide table into small sub-tables, but with a nicer interface and the saving of the meta data into the table node so that reconstruction becomes simpler. e.g. right now you have to have the dict d that you constructed the table with in append_to_multiple to reconstruct it properlty

you idea is a different one, but how is different from just periodically appending data?
(e.g. I do this when I have a store then every day add a bunch of rows to it for that day, I amd storing Panels)

adgaudio · 2013-09-10T19:30:38Z

haha yea, I totally misinterpreted that! Not requiring the table names in append_to_multiple could be pretty nice. I would need some time to play around with this idea, though.

Regarding how a buffered implementation is different from just periodically appending data: I have a particular use-case in which the stream of dicts gets chunked into (buffered) lists of Series, and each list then gets converted to a DataFrame and finally appended to the HDFStore. If the stream of dicts was already in the DataFrame form, then as you suggested, I wouldn't need build this BufferedHDFStore implementation. On the other hand, I'm not completely happy with this solution, because something about it seems overly engineered. But given low memory constraints and some other details inspiring the work, it was a difficult project and is now doing the job well.

jreback · 2013-09-10T22:50:33Z

Why are you saving the intermediate form as dict of Series, why not just add to the frame as you create/modify a list? Then periodically append?

All for doing chunked/iterator based solutions. Yours 'sounds' like a write-iterator. Can you show a test-example?

An example off the top of my head that I know we prob need a better solution is the iterate over a csv then put in a hdf iteratively...

which itself is worth of a pandas function....maybe csv_to_hdf (which is just an exercise in figuring dtypes ahead of time and chunking the read and write) - because a int column in say the first chunk could later have missing data, so you really have to read it twice to be sure

jreback reviewed Aug 28, 2013
View reviewed changes

jreback reviewed Aug 31, 2013
View reviewed changes

BUG: HDFStore.append_to_multiple - ensures rows are synchronized befo…

e317087

…re writing adds dropna kwarg + docstring + tests + documentation + release note python3 compatible

jreback added a commit that referenced this pull request Sep 3, 2013

Merge pull request #4698 from adgaudio/master

6ffed43

HDFStore.append_to_multiple doesn't write rows that are all np.nan

jreback merged commit 6ffed43 into pandas-dev:master Sep 3, 2013

jreback mentioned this pull request Sep 10, 2013

IO Converters #4808

Closed

Uh oh!

HDFStore.append_to_multiple doesn't write rows that are all np.nan #4698

HDFStore.append_to_multiple doesn't write rows that are all np.nan #4698

Uh oh!

Conversation

adgaudio commented Aug 27, 2013

Uh oh!

jreback commented Aug 27, 2013

Uh oh!

adgaudio commented Aug 28, 2013

Uh oh!

jreback commented Aug 28, 2013

Uh oh!

adgaudio commented Aug 28, 2013

Uh oh!

jreback Aug 28, 2013

Choose a reason for hiding this comment

Uh oh!

adgaudio Aug 29, 2013

Choose a reason for hiding this comment

Uh oh!

adgaudio commented Aug 29, 2013

Uh oh!

jreback commented Aug 29, 2013

Uh oh!

jreback commented Aug 31, 2013

Uh oh!

adgaudio commented Aug 31, 2013

Uh oh!

jreback Aug 31, 2013

Choose a reason for hiding this comment

Uh oh!

jreback Aug 31, 2013

Choose a reason for hiding this comment

Uh oh!

jreback commented Aug 31, 2013

Uh oh!

jreback Aug 31, 2013

Choose a reason for hiding this comment

Uh oh!

adgaudio Aug 31, 2013

Choose a reason for hiding this comment

Uh oh!

adgaudio commented Aug 31, 2013

Uh oh!

jreback commented Aug 31, 2013

Uh oh!

adgaudio commented Sep 3, 2013

Uh oh!

jreback commented Sep 3, 2013

Uh oh!

adgaudio commented Sep 3, 2013

Uh oh!

jreback commented Sep 3, 2013

Uh oh!

adgaudio commented Sep 10, 2013

Uh oh!

jreback commented Sep 10, 2013

Uh oh!

adgaudio commented Sep 10, 2013

Uh oh!

jreback commented Sep 10, 2013

Uh oh!

Uh oh!