-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: add arrow engine to read_csv #31817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
f22ff46
8ae43e4
09074df
6be276d
df4fa7e
9cd9a6f
ecaf3fd
b3c3287
474baf4
2cd9937
48ff255
3d15a56
c969373
98aa134
b9c6d2c
67c5db6
7f891a6
11fc737
23425f7
d9b7a1f
b8adf3c
01c0394
ba5620f
2570c82
b3a1f66
d46ceed
d67925c
6378459
9d64882
852ecf9
93382b4
f1bb4e2
14c13ab
7876b4e
4426642
008acab
2dddae7
261ef6a
88e200a
bf063ab
ede2799
e8eff08
87cfcf5
55139ee
c1aeecf
62fc9d6
b53a620
f13113d
f9ce2e4
4158d6a
d34e75f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -158,9 +158,11 @@ dtype : Type name or dict of column -> type, default ``None`` | |
(unsupported with ``engine='python'``). Use ``str`` or ``object`` together | ||
with suitable ``na_values`` settings to preserve and | ||
not interpret dtype. | ||
engine : {``'c'``, ``'python'``} | ||
Parser engine to use. The C engine is faster while the Python engine is | ||
currently more feature-complete. | ||
engine : {``'c'``, ``'pyarrow'``, ``'python'``} | ||
Parser engine to use. In terms of performance, the pyarrow engine, | ||
which requires ``pyarrow`` >= 0.15.0, is faster than the C engine, which | ||
is faster than the python engine. However, the pyarrow and C engines | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a versionchanged tag here 1.2 |
||
are currently less feature complete than their Python counterpart. | ||
converters : dict, default ``None`` | ||
Dict of functions for converting values in certain columns. Keys can either be | ||
integers or column labels. | ||
|
@@ -1600,11 +1602,18 @@ Specifying ``iterator=True`` will also return the ``TextFileReader`` object: | |
Specifying the parser engine | ||
'''''''''''''''''''''''''''' | ||
|
||
Under the hood pandas uses a fast and efficient parser implemented in C as well | ||
as a Python implementation which is currently more feature-complete. Where | ||
possible pandas uses the C parser (specified as ``engine='c'``), but may fall | ||
back to Python if C-unsupported options are specified. Currently, C-unsupported | ||
options include: | ||
Currently, pandas supports using three engines, the C engine, the python engine, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a versionchanged 1.2 tag here |
||
and an optional pyarrow engine(requires ``pyarrow`` >= 0.15). In terms of performance | ||
the pyarrow engine is fastest, followed by the C and Python engines. However, | ||
the pyarrow engine is much less robust than the C engine, which in turn lacks a | ||
couple of features present in the Python parser. | ||
|
||
Where possible pandas uses the C parser (specified as ``engine='c'``), but may fall | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we might want to refactor this entire section to provide a more table like comparision of all of the parsers, if you'd create an issue for this |
||
back to Python if C-unsupported options are specified. If pyarrow unsupported options are | ||
specified while using ``engine='pyarrow'``, the parser will error out | ||
(a full list of unsupported options is available at ``pandas.io.parsers._pyarrow_unsupported``). | ||
|
||
Currently, C-unsupported options include: | ||
|
||
* ``sep`` other than a single character (e.g. regex separators) | ||
* ``skipfooter`` | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,8 @@ | ||
import distutils.version | ||
import importlib | ||
import sys | ||
import types | ||
from typing import Optional | ||
import warnings | ||
|
||
# Update install.rst when updating versions! | ||
|
@@ -43,6 +45,7 @@ | |
"pandas_gbq": "pandas-gbq", | ||
"sqlalchemy": "SQLAlchemy", | ||
"jinja2": "Jinja2", | ||
"pyarrow.csv": "pyarrow", | ||
} | ||
|
||
|
||
|
@@ -58,7 +61,11 @@ def _get_version(module: types.ModuleType) -> str: | |
|
||
|
||
def import_optional_dependency( | ||
name: str, extra: str = "", raise_on_missing: bool = True, on_version: str = "raise" | ||
name: str, | ||
extra: str = "", | ||
raise_on_missing: bool = True, | ||
on_version: str = "raise", | ||
min_version: Optional[str] = None, | ||
): | ||
""" | ||
Import an optional dependency. | ||
|
@@ -70,8 +77,7 @@ def import_optional_dependency( | |
Parameters | ||
---------- | ||
name : str | ||
The module name. This should be top-level only, so that the | ||
version may be checked. | ||
The module name. | ||
extra : str | ||
Additional text to include in the ImportError message. | ||
raise_on_missing : bool, default True | ||
|
@@ -85,6 +91,8 @@ def import_optional_dependency( | |
* ignore: Return the module, even if the version is too old. | ||
It's expected that users validate the version locally when | ||
using ``on_version="ignore"`` (see. ``io/html.py``) | ||
min_version: Optional[str] | ||
Specify the minimum version | ||
|
||
Returns | ||
------- | ||
|
@@ -109,10 +117,16 @@ def import_optional_dependency( | |
raise ImportError(msg) from None | ||
else: | ||
return None | ||
|
||
minimum_version = VERSIONS.get(name) | ||
# Handle submodules: if we have submodule, grab parent module from sys.modules | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is all this needed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This has been answered before: #31817 (comment) (and the above comment has been added based on your comment) It's to 1) import a submodule ( Now I suppose that the submodule importing is not necessarily needed. Right now this PR does:
but I suppose this could also be:
And then this additional code to directly import a submodule with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jorisvandenbossche importing as a submodule is required, you can't access the csv module by doing |
||
parent = name.split(".")[0] | ||
if parent != name: | ||
install_name = parent | ||
module_to_get = sys.modules[install_name] | ||
else: | ||
module_to_get = module | ||
minimum_version = min_version if min_version is not None else VERSIONS.get(name) | ||
if minimum_version: | ||
version = _get_version(module) | ||
version = _get_version(module_to_get) | ||
if distutils.version.LooseVersion(version) < minimum_version: | ||
assert on_version in {"warn", "raise", "ignore"} | ||
msg = ( | ||
|
Uh oh!
There was an error while loading. Please reload this page.