Skip to content

BUG: fix HDFStore iterator to handle a where properly (GH8014) #8029

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 16, 2014

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Aug 14, 2014

closes #8014

@jreback
Copy link
Contributor Author

jreback commented Aug 14, 2014

cc @bboerner
give this a try

@bboerner
Copy link

@jreback, the fix looks good, thanks!

@jreback
Copy link
Contributor Author

jreback commented Aug 15, 2014

@bboerner ok this is fixed for a partial selection case.

note when you put up a fix, it should be on top of the current PR and only show the changes. e.g. just create whatever the changes are; these are local to you and don't modify anything else (unless they are pulled in).

@bboerner
Copy link

@jreback, thanks I verified the fix. And for the explanation about submitting fixes.

Several thoughts:

  1. I think the current = current + self.chunksize should revert back to above the check for if value is None else if None the loop continues w/o updating current.
  2. current = current + self.chunksize could be current = stop - same effect and IMHO clearer.
  3. As currently (and previously in the PR) implemented this can return one (or possibly many more) empty DataFrames through the generator. concat([df for df in it]) handles that case but it doesn't seem natural to get back a large number of empty DataFrames from an iterator. That could be handled by modifying the check to be if value is None or value.empty: Thoughts? I can put together a test case if you want.

Regards.

@jreback
Copy link
Contributor Author

jreback commented Aug 15, 2014

  1. easy

  2. and 3) do you have something that breaks this?

None should never actually be returned (it will raise if something bad happens or return an empty frame), never None (I may have had that for some reason but don't remember).

I don't think this can return a list of empty frames, but just a list of a single empty frame

e.g. your selection is completely invalid, then you will get a single empty frame. (but should test this).

@jreback
Copy link
Contributor Author

jreback commented Aug 15, 2014

updated

@jreback
Copy link
Contributor Author

jreback commented Aug 15, 2014

actually I think its correct to yield empties if the query doesn't select anything (you have to remember the columns are returned correct). Otherwise this is odd I think. You should get back something for each iteration and empty is valid.

@bboerner
Copy link

Thanks.

  1. I encountered the empty DataFrames a while back selecting ~10K rows from a file containing 1.74e7 rows and got back 1 non-empty DataFrame and 173 emptys. That was unexpected and not intuitive.
  2. More troubling is that it breaks with Python idiom e.g.
for e in []:
    print True

never prints True so I suggest that for a where clause which selects no rows that:

for df in store.select('df',where=where,iterator=True):
    print True

should also not print True.

Commit bboerner/pandas@7fedf5b contains the test cases illustrating.

If you'd like me to open this as a seperate issue let me know.

Regards.

@jreback
Copy link
Contributor Author

jreback commented Aug 15, 2014

ok, updated. (changed 1 item in your test which seemed wrong, see my comment).

@bboerner
Copy link

Thanks, look good. Is there a performance preference for not len(value): vs. value.empty?

Regards.

@jreback
Copy link
Contributor Author

jreback commented Aug 15, 2014

doesn't make any difference
this is all dwarfed by the actual operation of selection

@bboerner
Copy link

Thanks.

jreback added a commit that referenced this pull request Aug 16, 2014
BUG: fix HDFStore iterator to handle a where properly (GH8014)
@jreback jreback merged commit 0ecb4cb into pandas-dev:master Aug 16, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Iterating through TableIterator with where clause can incorrectly ignore data
2 participants