Skip to content

BUG: Unexpected behaviour when attempting to read Stata 104 format dta files #58554

Closed
@cmjcharlton

Description

@cmjcharlton

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# The problem occurs when reading an external file, so unfortunately this
# requires a data file linked to in the issue description
import pandas as pd
df = df.read_stata("bdendo.dta")

Issue Description

A previously closed issue (#26667) is related to this, but at that time it appears that documentation or examples of this format were not available.

Support for Stata 104-format data files was added in commit 703834c (as long as the data only contained values stored as int, float or byte). I have confirmed that this worked correctly at least up to the Pandas version 0.15.2-1 included with Fedora 21.

This support was broken in commit d194844 when the condition if self.format_version > 104: was removed before the line self.time_stamp = self._null_terminate(self.path_or_buf.read(18)) during the splitting of _read_header(): into _read_new_header() and _read_old_header().

Some years later pull request #34116 changed the _version_error string to remove 104 from the list of supported format versions, however this did not change the actual code check in _read_old_header() that would trigger the message:

        if self._format_version not in [104, 105, 108, 111, 113, 114, 115]:
            raise ValueError(_version_error.format(version=self._format_version))

This results in an exception when attempting to read a file in this format instead of a helpful error message:

>>> pandas.read_stata("bdendo.dta")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 2109, in read_stata
    return reader.read()
           ^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1683, in read
    self._ensure_open()
  File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1175, in _ensure_open
    self._open_file()
  File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1205, in _open_file
    self._read_header()
  File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1287, in _read_header
    self._read_old_header(first_char)
  File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1464, in _read_old_header
    self._time_stamp = self._get_time_stamp()
                       ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1436, in _get_time_stamp
    raise ValueError()
ValueError

The test file used above is \ssa8\dta\bdendo.dta contained in https://web.archive.org/web/19991011204243/http://lib.stat.cmu.edu/stata/STB/stb27v4.zip

Expected Behavior

I would expect either an error message indicating that the format is not supported, i.e.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 2109, in read_stata
    return reader.read()
           ^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1683, in read
    self._ensure_open()
  File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1175, in _ensure_open
    self._open_file()
  File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1205, in _open_file
    self._read_header()
  File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1287, in _read_header
    self._read_old_header(first_char)
  File "C:\Program Files\Python312\Lib\site-packages\pandas\io\stata.py", line 1453, in _read_old_header
    raise ValueError(_version_error.format(version=self._format_version))
ValueError: Version of given Stata file is 104. pandas supports importing versions 105, 108, 111 (Stata 7SE), 113 (Stata 8/9), 114 (Stata 10/11), 115 (Stata 12), 117 (Stata 13), 118 (Stata 14/15/16),and 119 (Stata 15/16, over 32,767 variables).

or, if support was restored the data to be loaded:

>>> df
     set  fail  gall  hyp   ob  est  dos  dur  non  duration  age  cest  agegrp
0      1     1     0    0  1.0    1    3  4.0    1        96   74   3.0     4.0
1      1     0     0    0  NaN    0    0  0.0    0         0   75   0.0     4.0
2      1     0     0    0  NaN    0    0  0.0    0         0   74   0.0     4.0
3      1     0     0    0  NaN    0    0  0.0    0         0   74   0.0     4.0
4      1     0     0    0  1.0    1    1  3.0    1        48   75   1.0     4.0
..   ...   ...   ...  ...  ...  ...  ...  ...  ...       ...  ...   ...     ...
310   63     1     1    0  1.0    1    3  2.0    1        18   68   3.0     3.0
311   63     0     0    0  1.0    1    2  4.0    1        96   69   2.0     3.0
312   63     0     0    0  NaN    0    0  0.0    0         0   70   0.0     3.0
313   63     0     0    1  1.0    1    2  3.0    1        92   69   2.0     3.0
314   63     0     0    1  0.0    1    3  3.0    1        59   69   3.0     3.0

[315 rows x 13 columns]

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.12.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United Kingdom.1252

pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 69.5.1
pip : 24.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 5.2.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.4
numba : None
numexpr : 2.10.0
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.13.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugNeeds TriageIssue that has not been reviewed by a pandas team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions