-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
HDFStore.append_to_multiple doesn't write rows that are all np.nan #4698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
related #4625 this is a more general issue ; I think better way to handle is to add a dropna=True (which preserves the existing behavior) then your solution would work pls add some tests for this as well |
@jreback - thanks for your prompt feedback! I just added to the existing tests for that method and added a dropna kwarg + docstring to the code. For dropna, I deviated from the normal true/false syntax. Specifically, dropna= 'any' | 'all' | False. If dropna evaluates to True, then call DF.dropna(how=dropna). You might prefer to revert to normal True/False behavior... Also, if I entirely misinterpreted what you meant by dropna to begin with, please let me know! |
can u hookup Travis (see contributing.md) |
Travis is enabled for my fork, and I've squashed. When tests pass, it's good to go. I also modified existing documentation, which you may want to review. Thanks! |
defines which table is the selector table (which you can make queries from). | ||
The argument ``dropna`` will remove rows from the input DataFrame with | ||
'any' or 'all' ``np.NaN`` values, unless it evaluates to False. ``dropna`` | ||
will also ensure that tables are synchronized across rows. For example, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is good - maybe break into 2 paragraphs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks - I attempted to simplify a little further and broke into 2 paragraphs
Thanks for your review! I made the changes you suggested, and we should be good for review 2 if Travis tests pass for python3. |
gr8 I'll look further later this week |
I just rebased and pushed. Thanks! |
@@ -833,6 +836,14 @@ def append_to_multiple(self, d, value, selector, data_columns=None, axes=None, * | |||
if data_columns is None: | |||
data_columns = d[selector] | |||
|
|||
# ensure rows are synchronized across the tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what I meant was that to the individual table appends, you can just simply pass dropna=False
if dropna=True
is passed to append_to_multiple
will achieve the same effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm....what you have is correct
can you add your example example as 2 test cases, 1 with dropna=True, the other with dropna=False (and assert that it still raises the error?) thxs.....then I think ready to merge |
tables are synchronized. This means that if a row for one of the tables | ||
being written to is entirely ``np.NaN``, that row will be dropped from all tables. | ||
|
||
If ``dropna`` is False, the user is responsible for synchronizing tables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put the 'USER IS REPONSIBLE FOR SYNCHRONIZING TABLES" back in CAPS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
I removed the doc section that raises the error, and agree the documentation addition showing is probably unnecessary (which is why it wasn't there to begin with). Unless I'm misunderstanding you, the tests already test both dropna=True (the If it's simpler for you to just make changes to the test, please feel free to modify the PR however you wish. Otherwise, I'll need a little more explanation on what your looking for. |
@adgaudio the tests can't have tested exactly the case that you have, because otherwise it would fail. I want you to add the example as a set of tests (make a new function), rather then modify an existing test as you did). |
…re writing adds dropna kwarg + docstring + tests + documentation + release note python3 compatible
@jreback, I broke out the tests into their own function and rebased onto master. Let me know if you need anything else! Thanks, |
HDFStore.append_to_multiple doesn't write rows that are all np.nan
Thanks for the pr! |
Awesome, thanks! |
gr8 also been thinking about a class to help out in splitting and reconstructions I the shards if u have suggestions or maybe a nice use case would e interesting |
Hey @jreback, I bet I significantly misinterpreted what you were thinking, but here's my attempt to rephrase your idea... Also, if you like this, let me know how we should take this to a next step: Problem (or my interpretation): Possible Solution: I've actually implemented this and would be willing to contribute it to pandas, if it's suitable for pandas. On the other hand, I'm not sure this is a direction we all should be going. What do you think? The class I implemented (and regularly use) to write data to HDFStore looks basically like this:
|
@adgaudio hmm....I want talking about a different problem actually basically the idea of 'sharding' a wide table into small sub-tables, but with a nicer interface and the saving of the meta data into the table node so that reconstruction becomes simpler. e.g. right now you have to have the dict d that you constructed the table with in you idea is a different one, but how is different from just periodically appending data? |
haha yea, I totally misinterpreted that! Not requiring the table names in append_to_multiple could be pretty nice. I would need some time to play around with this idea, though. Regarding how a buffered implementation is different from just periodically appending data: I have a particular use-case in which the stream of dicts gets chunked into (buffered) lists of Series, and each list then gets converted to a DataFrame and finally appended to the HDFStore. If the stream of dicts was already in the DataFrame form, then as you suggested, I wouldn't need build this BufferedHDFStore implementation. On the other hand, I'm not completely happy with this solution, because something about it seems overly engineered. But given low memory constraints and some other details inspiring the work, it was a difficult project and is now doing the job well. |
Why are you saving the intermediate form as dict of Series, why not just add to the frame as you create/modify a list? Then periodically append? All for doing chunked/iterator based solutions. Yours 'sounds' like a write-iterator. Can you show a test-example? An example off the top of my head that I know we prob need a better solution is the iterate over a csv then put in a hdf iteratively... which itself is worth of a pandas function....maybe |
Hi,
Using HDFStore.append_to_multiple, if an entire row written to any one table consists entirely of np.nan, the row is not written to the table, but is written to the other tables. The following code reproduces and fixes the issue.
I would prefer that append_to_multiple maintain synchronized rows across tables, and to my knowledge, the best way to do that is drop that row from the other tables. We would probably need a fix that looks something like this PR.
I'm not sure if this is the best way to do this, and would love some feedback!
Thanks,
Alex
The Fix:
To reproduce the error: