Closed
Description
TLDR the conversation here goes on for a bit, but to summarise, the suggestion is:
- deprecate
date_parser
, because it always hurts performance (counter examples welcome!) - add
date_format
, because that can boost performance - for anything else, amend docs to make clear that users should first in the data as
object
, and then apply their parsing
Performance-wise, this would only be an improvement to the status quo
As far as I can tell, date_parser
is a net negative and only ever slows things down
In the best case, it only results in a slight degradation:
timestamp_format = '%Y-%d-%m %H:%M:%S'
date_index = pd.date_range(start='1900', end='2000')
dates_df = date_index.strftime(timestamp_format).to_frame(name='ts_col')
data = dates_df.to_csv()
In [6]: %%timeit
...: df = pd.read_csv(io.StringIO(data),
...: date_parser=lambda x: pd.to_datetime(x, format=timestamp_format),
...: parse_dates=['ts_col']
...: )
...:
...:
111 ms ± 3.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [7]: %%timeit
...: df = pd.read_csv(io.StringIO(data),
...: #date_parser=lambda x: pd.to_datetime(x, format=timestamp_format),
...: #parse_dates=['ts_col']
...: )
...: df['ts_col'] = pd.to_datetime(df['ts_col'], format=timestamp_format)
...:
...:
75.8 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Parsing element-by-element is also slower than just using .apply
:
In [21]: %%timeit
...: df = pd.read_csv(io.StringIO(data),
...: #date_parser=lambda x: pd.to_datetime(x, format=timestamp_format),
...: #parse_dates=['ts_col']
...: )
...: df['ts_col'].apply(lambda x: du_parse(x, dayfirst=True))
...:
...:
1.13 s ± 33.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [22]: %%timeit
...: df = pd.read_csv(io.StringIO(data),
...: date_parser=lambda x: du_parse(x, dayfirst=True),
...: parse_dates=['ts_col']
...: )
...:
...:
1.19 s ± 3.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In the worst case, it results in 65x performance degradation, see #50586 (comment) (and this gets way worse for larger datasets)
My suggestion is:
- deprecate
date_parser
- introduce
date_format
, which would actually deliver a performance improvement:
In [1]: timestamp_format = '%Y-%d-%m %H:%M:%S'
...:
...: date_index = pd.date_range(start='1900', end='2000')
...:
...: dates_df = date_index.strftime(timestamp_format).to_frame(name='ts_col')
...: data = dates_df.to_csv()
In [2]: %%timeit
...: df = pd.read_csv(io.StringIO(data),
...: date_format=timestamp_format,
...: parse_dates=['ts_col']
...: )
...:
...:
19.5 ms ± 397 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
That's over 3 times as fast!