Skip to content

ENH: Improve OverflowError message when pandas.read_stata fails to convert dates #36096

Open
@miker985

Description

@miker985

Is your feature request related to a problem?

Calling pd.read_stata on a DTA file with a bad date format code leads to a minimally useful error message. The error is raised from this line.

What I see:

>>> import pandas as pd
>>> pd.read_stata("date_overflow.DTA")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/miker985/lib-code/pandas/pandas/io/stata.py", line 1933, in read_stata
    data = reader.read()
  File "/home/miker985/lib-code/pandas/pandas/io/stata.py", line 1682, in read
    data[col], self.fmtlist[i]
  File "/home/miker985/lib-code/pandas/pandas/io/stata.py", line 333, in _stata_elapsed_date_to_datetime_vec
    conv_dates = convert_delta_safe(base, days, "d")
  File "/home/miker985/lib-code/pandas/pandas/io/stata.py", line 294, in convert_delta_safe
    values = [base + relativedelta(days=int(d)) for d in deltas]
  File "/home/miker985/lib-code/pandas/pandas/io/stata.py", line 294, in <listcomp>
    values = [base + relativedelta(days=int(d)) for d in deltas]
  File "/home/miker985/.virtualenvs/tmp-3c950d2b6b7b8c2/lib/python3.7/site-packages/dateutil/relativedelta.py", line 405, in __radd__
    return self.__add__(other)
  File "/home/miker985/.virtualenvs/tmp-3c950d2b6b7b8c2/lib/python3.7/site-packages/dateutil/relativedelta.py", line 392, in __add__
    microseconds=self.microseconds))
OverflowError: date value out of range

Describe the solution you'd like

I'd like a clearer message indicating the problem column. It would be easy to catch the OverflowError in the loop doing the date conversion and include the column name.

Desired detailed error: OverflowError: Failed converting column tiempo_gen for format %dD_m_Y: date value out of range
Alternative error: OverflowError: date value out of range for column tiempo_gen
date_overflow.DTA.gz

I would be happy to submit a patch for this if the solution is approved.

API breaking implications

I don't believe this would break or alter any APIs

Describe alternatives you've considered

  • Add a flag skip_invalid_dates which changes behavior to skip converting dates IFF they would error. Instead of raising an exception it would be caught and a warning emitted along the lines of "date value out of range for column tiempo_gen - skipping date conversion". This would allow users access to as much converted data as possible. This flag would default to False and be backwards compatible with the APIs.

    • I would be happy to submit a patch for this if approved
  • Refactor the pandas.io.stata API to allow for manual date conversion without using private functions.

Additional context

Output of pd.show_versions()

/home/miker985/.virtualenvs/tmp-3c950d2b6b7b8c2/lib/python3.7/site-packages/setuptools/distutils_patch.py:26: UserWarning: Distutils was imported before Setuptools. This usage is discouraged and may exhibit undesirable behaviors or errors. Please use Setuptools' objects directly or at least import Setuptools first. "Distutils was imported before Setuptools. This usage is discouraged "

INSTALLED VERSIONS

commit : 76f74d5
python : 3.7.8.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-112-generic
Version : #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.0.dev0+230.g76f74d53b.dirty
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 49.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

This was the test I put together to hammer out all the details so I knew what was happening with the file

tl;dr

  • The column contains values in the millions
  • The date units are days, meaning the dates occur after the year 56,000
import pandas as pd


def test_loading_bad_format():
    """
    Demonstrate OverflowError is raised with pd.read_stata
    """
    dta = "date_overflow.DTA"

    try:
        pd.read_stata(dta)
    except OverflowError as e:
        assert len(e.args) == 1
        # this message is minimally useful.
        # it should include at minimum the column causing the error
        assert e.args[0] == "date value out of range"

    # demonstrate cause
    reader = pd.io.stata.StataReader(dta, convert_dates=False)
    # date conversion is the cause here
    df = reader.read()
    # for one column
    assert len(df.columns) == 1 and 'tiempo_gen' in df
    min_val = df['tiempo_gen'].min()
    # with very large values ...
    assert min_val > 20_588_866
    assert min_val < 20_588_867
    # ... and a format indicating the units are days
    # (20.5 million days is ~56,000 years)
    assert len(reader.fmtlist) == 1 and reader.fmtlist[0] == '%dD_m_Y'

Example file - a subset of a real life file I ran into this issue with: date_overflow.DTA.gz

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions