Description
Code Sample
import pandas as pd
import concurrent.futures as futs
from pathlib import Path
paths = Path('data').rglob('*.csv')
with futs.ProcessPoolExecutor() as e:
futures = {e.submit(pd.read_csv, p, index_col='id'): p for p in paths}
done, _ = futs.wait(f)
for d in done:
r = f.result()
Problem description
Pandas < v0.23 used to be very efficient to read CSVs by dispatching to concurrent.futures.ProcessPoolExecuter
which could reduce loading large #s of CSV files from 10s of mins to mins.
It appears this is no longer possible since the parser functions are no longer picklable. Is there a reason for this change?
Expected Output
No error, but raises:
Traceback (most recent call last):
File "/usr/local/Cellar/python/3.6.5_1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/queues.py", line 234, in _feed
obj = _ForkingPickler.dumps(obj)
File "/usr/local/Cellar/python/3.6.5_1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object '_make_parser_function.<locals>.parser_f'
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: 3.7.0
pip: 18.0
setuptools: 40.0.0
Cython: None
numpy: 1.15.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None