Description
Is your feature request related to a problem?
Calling pd.read_stata
on a DTA file with a bad date format code leads to a minimally useful error message. The error is raised from this line.
What I see:
>>> import pandas as pd
>>> pd.read_stata("date_overflow.DTA")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/miker985/lib-code/pandas/pandas/io/stata.py", line 1933, in read_stata
data = reader.read()
File "/home/miker985/lib-code/pandas/pandas/io/stata.py", line 1682, in read
data[col], self.fmtlist[i]
File "/home/miker985/lib-code/pandas/pandas/io/stata.py", line 333, in _stata_elapsed_date_to_datetime_vec
conv_dates = convert_delta_safe(base, days, "d")
File "/home/miker985/lib-code/pandas/pandas/io/stata.py", line 294, in convert_delta_safe
values = [base + relativedelta(days=int(d)) for d in deltas]
File "/home/miker985/lib-code/pandas/pandas/io/stata.py", line 294, in <listcomp>
values = [base + relativedelta(days=int(d)) for d in deltas]
File "/home/miker985/.virtualenvs/tmp-3c950d2b6b7b8c2/lib/python3.7/site-packages/dateutil/relativedelta.py", line 405, in __radd__
return self.__add__(other)
File "/home/miker985/.virtualenvs/tmp-3c950d2b6b7b8c2/lib/python3.7/site-packages/dateutil/relativedelta.py", line 392, in __add__
microseconds=self.microseconds))
OverflowError: date value out of range
Describe the solution you'd like
I'd like a clearer message indicating the problem column. It would be easy to catch the OverflowError
in the loop doing the date conversion and include the column name.
Desired detailed error: OverflowError: Failed converting column tiempo_gen for format %dD_m_Y: date value out of range
Alternative error: OverflowError: date value out of range for column tiempo_gen
date_overflow.DTA.gz
I would be happy to submit a patch for this if the solution is approved.
API breaking implications
I don't believe this would break or alter any APIs
Describe alternatives you've considered
-
Add a flag
skip_invalid_dates
which changes behavior to skip converting dates IFF they would error. Instead of raising an exception it would be caught and a warning emitted along the lines of"date value out of range for column tiempo_gen - skipping date conversion"
. This would allow users access to as much converted data as possible. This flag would default toFalse
and be backwards compatible with the APIs.- I would be happy to submit a patch for this if approved
-
Refactor the
pandas.io.stata
API to allow for manual date conversion without using private functions.
Additional context
Output of pd.show_versions()
INSTALLED VERSIONS
commit : 76f74d5
python : 3.7.8.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-112-generic
Version : #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.2.0.dev0+230.g76f74d53b.dirty
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 49.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None
This was the test I put together to hammer out all the details so I knew what was happening with the file
tl;dr
- The column contains values in the millions
- The date units are days, meaning the dates occur after the year 56,000
import pandas as pd
def test_loading_bad_format():
"""
Demonstrate OverflowError is raised with pd.read_stata
"""
dta = "date_overflow.DTA"
try:
pd.read_stata(dta)
except OverflowError as e:
assert len(e.args) == 1
# this message is minimally useful.
# it should include at minimum the column causing the error
assert e.args[0] == "date value out of range"
# demonstrate cause
reader = pd.io.stata.StataReader(dta, convert_dates=False)
# date conversion is the cause here
df = reader.read()
# for one column
assert len(df.columns) == 1 and 'tiempo_gen' in df
min_val = df['tiempo_gen'].min()
# with very large values ...
assert min_val > 20_588_866
assert min_val < 20_588_867
# ... and a format indicating the units are days
# (20.5 million days is ~56,000 years)
assert len(reader.fmtlist) == 1 and reader.fmtlist[0] == '%dD_m_Y'
Example file - a subset of a real life file I ran into this issue with: date_overflow.DTA.gz