Skip to content

BUG: entries missing when reading from pytables hdf store using "where" statement #9676

Closed
@alexfields

Description

@alexfields

When I select from a HDF store using a "where" string (locating entries in which one field matches a particular string value), the function returns fewer rows than when I load the entire dataframe into memory and then match on that field. Below is some code that reproduces the problem; unfortunately, I can't easily provide the code that generates the source HDF store, but I'm happy to provide the kept_tids_20150310.h5 file if it would help. There are no nan values in the dataframe.

Running ptrepack on the dataframe solves the problem, but I don't believe this should happen in the first place.

I am using pandas 0.15.2 but have not tried 0.16.0.

>>> import pandas as pd
>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-46-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.2
nose: 1.3.4
Cython: 0.19.2
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.5.0
IPython: 2.4.0
sphinx: 1.2.3
patsy: 0.2.1
dateutil: 2.4.1
pytz: 2014.10
bottleneck: 0.8.0
tables: 3.1.0
numexpr: 2.3
matplotlib: 1.4.2
openpyxl: None
xlrd: None
xlwt: 0.7.2
xlsxwriter: None
lxml: 2.3.2
bs4: 4.3.2
html5lib: 0.999
httplib2: 0.7.2
apiclient: None
rpy2: 2.4.2
sqlalchemy: None
pymysql: None
psycopg2: None
>>> kept_tids = pd.read_hdf('kept_tids_20150310.h5', 'kept_tids', mode='r')
>>> kept_tids.to_hdf('kept_tids_20150310_resave.h5', 'kept_tids', mode='w', format='t', data_columns=True)
>>> chroms = kept_tids['chrom'].drop_duplicates().order().tolist()
>>> print chroms
['chr1', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr2', 'chr20', 'chr21', 'chr22', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chrM', 'chrX', 'chrY']
>>> len(kept_tids)
202836
>>> sum(len(pd.read_hdf('kept_tids_20150310_resave.h5', 'kept_tids', mode='r', where="chrom == '%s'"%x)) for x in chroms)
193757
>>> (kept_tids['chrom']=='chr16').sum()
10157
>>> len(pd.read_hdf('kept_tids_20150310_resave.h5', 'kept_tids', mode='r', where="chrom == 'chr16'"))
6278

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions