Skip to content

ENH: Allow callable for on_bad_lines in read_csv when engine="python" #45146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Jan 8, 2022
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
d4c0cb7
Add doc and validation
mroeschke Dec 31, 2021
f654e39
Add whatsnew, testing, and docs
mroeschke Dec 31, 2021
6c12102
Fix whatsnew formatting
mroeschke Dec 31, 2021
1aee16c
Update doc
mroeschke Dec 31, 2021
103ae04
Merge remote-tracking branch 'upstream/master' into enh/on_bad_lines_…
mroeschke Dec 31, 2021
4a853f9
Merge remote-tracking branch 'upstream/master' into enh/on_bad_lines_…
mroeschke Jan 1, 2022
d759a88
fix docstring validation
mroeschke Jan 1, 2022
dbf13e7
Merge remote-tracking branch 'upstream/master' into enh/on_bad_lines_…
mroeschke Jan 2, 2022
9b73ae4
Test is callable returns a row longer than expected length
mroeschke Jan 2, 2022
15752be
Merge remote-tracking branch 'upstream/master' into enh/on_bad_lines_…
mroeschke Jan 3, 2022
b77da02
Address comments
mroeschke Jan 3, 2022
39a83b4
Allow callable behavior returning None
mroeschke Jan 3, 2022
a5f3656
Add test for index_col inferred
mroeschke Jan 3, 2022
ae4d499
Merge remote-tracking branch 'upstream/master' into enh/on_bad_lines_…
mroeschke Jan 3, 2022
d3f9c40
Merge remote-tracking branch 'upstream/master' into enh/on_bad_lines_…
mroeschke Jan 4, 2022
8886bf8
Merge remote-tracking branch 'upstream/master' into enh/on_bad_lines_…
mroeschke Jan 5, 2022
e3b445d
Merge remote-tracking branch 'upstream/master' into enh/on_bad_lines_…
mroeschke Jan 5, 2022
013f05f
Merge remote-tracking branch 'upstream/master' into enh/on_bad_lines_…
mroeschke Jan 5, 2022
67b7e3e
Merge remote-tracking branch 'upstream/master' into enh/on_bad_lines_…
mroeschke Jan 6, 2022
743b83b
improve docs
mroeschke Jan 6, 2022
bd67152
type
mroeschke Jan 6, 2022
e04124a
Revert "improve docs"
mroeschke Jan 6, 2022
6a92f07
Merge remote-tracking branch 'upstream/master' into enh/on_bad_lines_…
mroeschke Jan 6, 2022
4817770
Add example of writing to an external list
mroeschke Jan 6, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 19 additions & 4 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1305,14 +1305,29 @@ You can elect to skip bad lines:
0 1 2 3
1 8 9 10

Or pass a callable function to handle the bad line if ``engine="python"``.
The bad line will be a list of strings that was split by the ``sep``:

.. code-block:: ipython

In [30]: pd.read_csv(StringIO(data), on_bad_lines=lambda x: x[-3:], engine="python")
Out[30]:
a b c
0 1 2 3
1 5 6 7
2 8 9 10

.. versionadded:: 1.4.0


You can also use the ``usecols`` parameter to eliminate extraneous column
data that appear in some lines but not others:

.. code-block:: ipython

In [30]: pd.read_csv(StringIO(data), usecols=[0, 1, 2])
In [31]: pd.read_csv(StringIO(data), usecols=[0, 1, 2])

Out[30]:
Out[31]:
a b c
0 1 2 3
1 4 5 6
Expand All @@ -1324,9 +1339,9 @@ fields are filled with ``NaN``.

.. code-block:: ipython

In [31]: pd.read_csv(StringIO(data), names=['a', 'b', 'c', 'd'])
In [32]: pd.read_csv(StringIO(data), names=['a', 'b', 'c', 'd'])

Out[31]:
Out[32]:
a b c d
0 1 2 3 NaN
1 4 5 6 7
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.4.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -207,6 +207,7 @@ Other enhancements
- :meth:`Series.str.split` now supports a ``regex`` argument that explicitly specifies whether the pattern is a regular expression. Default is ``None`` (:issue:`43563`, :issue:`32835`, :issue:`25549`)
- :meth:`DataFrame.dropna` now accepts a single label as ``subset`` along with array-like (:issue:`41021`)
- Added :meth:`DataFrameGroupBy.value_counts` (:issue:`43564`)
- :meth:`read_csv` now accepts a ``callable`` function in ``on_bad_lines`` when ``engine="python"`` for custom handling of bad lines (:issue:`5686`)
- :class:`ExcelWriter` argument ``if_sheet_exists="overlay"`` option added (:issue:`40231`)
- :meth:`read_excel` now accepts a ``decimal`` argument that allow the user to specify the decimal point when parsing string columns to numeric (:issue:`14403`)
- :meth:`.GroupBy.mean`, :meth:`.GroupBy.std`, :meth:`.GroupBy.var`, :meth:`.GroupBy.sum` now supports `Numba <http://numba.pydata.org/>`_ execution with the ``engine`` keyword (:issue:`43731`, :issue:`44862`, :issue:`44939`)
Expand Down
5 changes: 4 additions & 1 deletion pandas/io/parsers/python_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -197,6 +197,7 @@ class MyDialect(csv.Dialect):
skipinitialspace = self.skipinitialspace
quoting = self.quoting
lineterminator = "\n"
strict = not callable(self.on_bad_lines)

dia = MyDialect

Expand Down Expand Up @@ -990,7 +991,9 @@ def _rows_to_cols(self, content: list[list[Scalar]]) -> list[np.ndarray]:
actual_len = len(l)

if actual_len > col_len:
if (
if callable(self.on_bad_lines):
content.append(self.on_bad_lines(l))
elif (
self.on_bad_lines == self.BadLineHandleMethod.ERROR
or self.on_bad_lines == self.BadLineHandleMethod.WARN
):
Expand Down
19 changes: 16 additions & 3 deletions pandas/io/parsers/readers.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from textwrap import fill
from typing import (
Any,
Callable,
NamedTuple,
)
import warnings
Expand Down Expand Up @@ -354,7 +355,7 @@
.. deprecated:: 1.3.0
The ``on_bad_lines`` parameter should be used instead to specify behavior upon
encountering a bad line instead.
on_bad_lines : {{'error', 'warn', 'skip'}}, default 'error'
on_bad_lines : str or callable, default 'error'
Specifies what to do upon encountering a bad line (a line with too many fields).
Allowed values are :

Expand All @@ -364,6 +365,12 @@

.. versionadded:: 1.3.0

- callable, function with signature ``(bad_line: list[str]) -> list[str]``
that will process a single bad line. ``bad_line`` is a list of strings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am I right in thinking the output list[str] must be a certain length? if the output were to be the same as the input, for example, then what would happen? Checked the tests but they seemed to only cover valid function cases where relevant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_csv has a precedent of throwing a ParserWarning if a row has more elements that expected and continues parsing (seems to drop the extra elements), so I think if the callable does similar it should also throw a ParserWarning

Added a test to check this behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically it can return a list of Hashables, this should not be an issue.

We should document, that the fallback behavior is a warning

split by the ``sep``. Only supported when ``engine="python"``

.. versionadded:: 1.4.0

delim_whitespace : bool, default False
Specifies whether or not whitespace (e.g. ``' '`` or ``'\t'``) will be
used as the sep. Equivalent to setting ``sep='\\s+'``. If this option
Expand Down Expand Up @@ -1367,7 +1374,7 @@ def _refine_defaults_read(
sep: str | object,
error_bad_lines: bool | None,
warn_bad_lines: bool | None,
on_bad_lines: str | None,
on_bad_lines: str | Callable | None,
names: ArrayLike | None | object,
prefix: str | None | object,
defaults: dict[str, Any],
Expand Down Expand Up @@ -1399,7 +1406,7 @@ def _refine_defaults_read(
Whether to error on a bad line or not.
warn_bad_lines : str or None
Whether to warn on a bad line or not.
on_bad_lines : str or None
on_bad_lines : str, callable or None
An option for handling bad lines or a sentinel value(None).
names : array-like, optional
List of column names to use. If the file contains a header row,
Expand Down Expand Up @@ -1503,6 +1510,12 @@ def _refine_defaults_read(
kwds["on_bad_lines"] = ParserBase.BadLineHandleMethod.WARN
elif on_bad_lines == "skip":
kwds["on_bad_lines"] = ParserBase.BadLineHandleMethod.SKIP
elif callable(on_bad_lines):
if engine != "python":
raise ValueError(
"on_bad_line can only be a callable function if engine='python'"
)
kwds["on_bad_lines"] = on_bad_lines
else:
raise ValueError(f"Argument {on_bad_lines} is invalid for on_bad_lines")
else:
Expand Down
59 changes: 59 additions & 0 deletions pandas/tests/io/parser/test_python_parser_only.py
Original file line number Diff line number Diff line change
Expand Up @@ -329,3 +329,62 @@ def readline(self):
return self.data

parser.read_csv(NoNextBuffer("a\n1"))


@pytest.mark.parametrize("bad_line_func", [lambda x: ["2", "3"], lambda x: x[:2]])
def test_on_bad_lines_callable(python_parser_only, bad_line_func):
# GH 5686
parser = python_parser_only
bad_sio = StringIO("a,b\n1,2\n2,3,4,5,6\n3,4")
result = parser.read_csv(bad_sio, on_bad_lines=bad_line_func)
expected = DataFrame({"a": [1, 2, 3], "b": [2, 3, 4]})
tm.assert_frame_equal(result, expected)


def test_on_bad_lines_callable_write_to_external_list(python_parser_only):
# GH 5686
parser = python_parser_only
bad_sio = StringIO("a,b\n1,2\n2,3,4,5,6\n3,4")
lst = []

def bad_line_func(bad_line):
lst.append(bad_line)
return ["2", "3"]

result = parser.read_csv(bad_sio, on_bad_lines=bad_line_func)
expected = DataFrame({"a": [1, 2, 3], "b": [2, 3, 4]})
tm.assert_frame_equal(result, expected)
assert lst == [["2", "3", "4", "5", "6"]]


@pytest.mark.parametrize("bad_line_func", [lambda x: ["foo", "bar"], lambda x: x[:2]])
@pytest.mark.parametrize("sep", [",", "111"])
def test_on_bad_lines_callable_iterator_true(python_parser_only, bad_line_func, sep):
# GH 5686
# iterator=True has a separate code path than iterator=False
parser = python_parser_only
bad_sio = StringIO(f"0{sep}1\nhi{sep}there\nfoo{sep}bar{sep}baz\ngood{sep}bye")
result_iter = parser.read_csv(
bad_sio, on_bad_lines=bad_line_func, chunksize=1, iterator=True, sep=sep
)
expecteds = [
{"0": "hi", "1": "there"},
{"0": "foo", "1": "bar"},
{"0": "good", "1": "bye"},
]
for i, (result, expected) in enumerate(zip(result_iter, expecteds)):
expected = DataFrame(expected, index=range(i, i + 1))
tm.assert_frame_equal(result, expected)


def test_on_bad_lines_callable_dont_swallow_errors(python_parser_only):
# GH 5686
parser = python_parser_only
bad_sio = StringIO("a,b\n1,2\n2,3,4,5,6\n3,4")
msg = "This function is buggy."

def bad_line_func(bad_line):
raise ValueError(msg)

with pytest.raises(ValueError, match=msg):
parser.read_csv(bad_sio, on_bad_lines=bad_line_func)
12 changes: 12 additions & 0 deletions pandas/tests/io/parser/test_unsupported.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,3 +149,15 @@ def test_pyarrow_engine(self):
kwargs[default] = "warn"
with pytest.raises(ValueError, match=msg):
read_csv(StringIO(data), engine="pyarrow", **kwargs)

def test_on_bad_lines_callable_python_only(self, all_parsers):
# GH 5686
sio = StringIO("a,b\n1,2")
bad_lines_func = lambda x: x
parser = all_parsers
if all_parsers.engine != "python":
msg = "on_bad_line can only be a callable function if engine='python'"
with pytest.raises(ValueError, match=msg):
parser.read_csv(sio, on_bad_lines=bad_lines_func)
else:
parser.read_csv(sio, on_bad_lines=bad_lines_func)