Closed
Description
read_sas
doesn't work well with chunksize
or iterator
parameters.
Code Sample and Problem Description
The following data test file in the repository have 32 lines.
sasfile = 'pandas/io/tests/sas/data/airline.sas7bdat'
pd.read_sas(sasfile).shape
Out[18]: (32, 6)
When we carefully read the file with chunksize
/iterator
, all's well:
reader = pd.read_sas(sasfile, chunksize=16)
df = reader.read()
df.shape
Out[31]: (16, 6)
df = reader.read()
df.shape
Out[33]: (16, 6)
or
reader = pd.read_sas(sasfile, iterator=True)
df = reader.read(30)
df.shape
Out[37]: (30, 6)
df = reader.read(2)
df.shape
Out[39]: (2, 6)
df = reader.read(2)
type(df)
Out[41]: NoneType
But if we don't know the length of the data, we'll easily stumble on an exception and won't read the whole data, which is painful with large files.
reader = pd.read_sas(sasfile, chunksize=20)
df = reader.read()
df.shape
Out[45]: (20, 6)
df = reader.read()
Traceback (most recent call last):
File "/usr/local/lib64/python3.5/site-packages/IPython/core/interactiveshell.py", line 2885, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-46-c5d811b93ac1>", line 1, in <module>
df = reader.read()
File "/home/users/piotr/workspace/pandas-pijucha/pandas/io/sas/sas7bdat.py", line 604, in read
rslt = self._chunk_to_dataframe()
File "/home/users/piotr/workspace/pandas-pijucha/pandas/io/sas/sas7bdat.py", line 646, in _chunk_to_dataframe
dtype=self.byte_order + 'd')
File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/frame.py", line 2419, in __setitem__
self._set_item(key, value)
File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/frame.py", line 2485, in _set_item
value = self._sanitize_column(key, value)
File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/frame.py", line 2656, in _sanitize_column
value = _sanitize_index(value, self.index, copy=False)
File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/series.py", line 2793, in _sanitize_index
raise ValueError('Length of values does not match length of ' 'index')
ValueError: Length of values does not match length of index
or
reader = pd.read_sas(sasfile, iterator=True)
reader.read(30).shape
Out[51]: (30, 6)
reader.read(30).shape
Traceback (most recent call last):
File "/usr/local/lib64/python3.5/site-packages/IPython/core/interactiveshell.py", line 2885, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-52-5d757f713808>", line 1, in <module>
reader.read(30).shape
File "/home/users/piotr/workspace/pandas-pijucha/pandas/io/sas/sas7bdat.py", line 604, in read
rslt = self._chunk_to_dataframe()
File "/home/users/piotr/workspace/pandas-pijucha/pandas/io/sas/sas7bdat.py", line 646, in _chunk_to_dataframe
dtype=self.byte_order + 'd')
File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/frame.py", line 2419, in __setitem__
self._set_item(key, value)
File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/frame.py", line 2485, in _set_item
value = self._sanitize_column(key, value)
File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/frame.py", line 2656, in _sanitize_column
value = _sanitize_index(value, self.index, copy=False)
File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/series.py", line 2793, in _sanitize_index
raise ValueError('Length of values does not match length of ' 'index')
ValueError: Length of values does not match length of index
Output of pd.show_versions()
pd.show_versions()
INSTALLED VERSIONS
commit: 75b606a
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.1.20-1
machine: x86_64
processor: Intel(R)_Core(TM)i5-2520M_CPU@_2.50GHz
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.19.0+112.g75b606a
nose: 1.3.7
pip: 9.0.1
setuptools: 21.0.0
Cython: 0.24
numpy: 1.11.0