Skip to content

BUG: provide chunks with progressively numbered (default) indices #12289

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions doc/source/whatsnew/v0.19.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -596,6 +596,40 @@ New Behavior:
idx1.difference(idx2)
idx1.symmetric_difference(idx2)

.. _whatsnew_0190.api.autogenerated_chunksize_index:

:func:`read_csv` called with ``chunksize`` parameter generates correct index
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When :func:`read_csv` is called with ``chunksize='n'`` and without specifying an index,
each chunk used to have an independently generated index from `0`` to ``n-1``.
They are now given instead a progressive index, starting from ``0`` for the first chunk,
from ``n`` for the second, and so on, so that, when concatenated, they are identical to
the result of calling :func:`read_csv` without the ``chunksize=`` argument.
(:issue:`12185`)

.. ipython :: python

data = 'A,B\n0,1\n2,3\n4,5\n6,7'

Previous behaviour:

.. code-block:: ipython

In [2]: pd.concat(pd.read_csv(StringIO(data), chunksize=2))
Out[2]:
A B
0 0 1
1 2 3
0 4 5
1 6 7

New behaviour:

.. ipython :: python

pd.concat(pd.read_csv(StringIO(data), chunksize=2))

.. _whatsnew_0190.deprecations:

Deprecations
Expand Down
15 changes: 14 additions & 1 deletion pandas/io/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
is_list_like, is_integer_dtype,
is_float,
is_scalar)
from pandas.core.index import Index, MultiIndex
from pandas.core.index import Index, MultiIndex, RangeIndex
from pandas.core.frame import DataFrame
from pandas.core.common import AbstractMethodError
from pandas.core.config import get_option
Expand Down Expand Up @@ -700,6 +700,7 @@ def __init__(self, f, engine=None, **kwds):
# miscellanea
self.engine = engine
self._engine = None
self._currow = 0

options = self._get_options_with_defaults(engine)

Expand Down Expand Up @@ -913,8 +914,20 @@ def read(self, nrows=None):
# May alter columns / col_dict
index, columns, col_dict = self._create_index(ret)

if index is None:
if col_dict:
# Any column is actually fine:
new_rows = len(compat.next(compat.itervalues(col_dict)))
index = RangeIndex(self._currow, self._currow + new_rows)
else:
new_rows = 0
else:
new_rows = len(index)

df = DataFrame(col_dict, columns=columns, index=index)

self._currow += new_rows

if self.squeeze and len(df.columns) == 1:
return df[df.columns[0]].copy()
return df
Expand Down
12 changes: 12 additions & 0 deletions pandas/io/tests/parser/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -461,6 +461,18 @@ def test_get_chunk_passed_chunksize(self):
piece = result.get_chunk()
self.assertEqual(len(piece), 2)

def test_read_chunksize_generated_index(self):
# GH 12185
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm is this already done for the python parser? (e.g. the progressive numbering for chunks)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the fix or the test? Anywhere, both the fix and the test apply to both parsers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i meant both.

reader = self.read_csv(StringIO(self.data1), chunksize=2)
df = self.read_csv(StringIO(self.data1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check with an index_col as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See test_read_chunksize and test_read_chunksize_named (hence the "generated_index" in the new test's name)... am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this need some more comprehensive tests. they should be in this same test method (even if they are slightly duplicated elsewhere). you are making a major change, so need to exercise multiple cases.


tm.assert_frame_equal(pd.concat(reader), df)

reader = self.read_csv(StringIO(self.data1), chunksize=2, index_col=0)
df = self.read_csv(StringIO(self.data1), index_col=0)

tm.assert_frame_equal(pd.concat(reader), df)

def test_read_text_list(self):
data = """A,B,C\nfoo,1,2,3\nbar,4,5,6"""
as_list = [['A', 'B', 'C'], ['foo', '1', '2', '3'], ['bar',
Expand Down
4 changes: 0 additions & 4 deletions pandas/io/tests/parser/test_network.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,8 +122,6 @@ def test_parse_public_s3_bucket_chunked(self):
self.assertFalse(df.empty)
true_df = local_tips.iloc[
chunksize * i_chunk: chunksize * (i_chunk + 1)]
# Chunking doesn't preserve row numbering
true_df = true_df.reset_index().drop('index', axis=1)
tm.assert_frame_equal(true_df, df)

@tm.network
Expand All @@ -143,8 +141,6 @@ def test_parse_public_s3_bucket_chunked_python(self):
self.assertFalse(df.empty)
true_df = local_tips.iloc[
chunksize * i_chunk: chunksize * (i_chunk + 1)]
# Chunking doesn't preserve row numbering
true_df = true_df.reset_index().drop('index', axis=1)
tm.assert_frame_equal(true_df, df)

@tm.network
Expand Down
1 change: 0 additions & 1 deletion pandas/io/tests/test_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,6 @@ def test_iterator(self):
it = read_csv(StringIO(self.data1), chunksize=1)
first = next(it)
tm.assert_frame_equal(first, expected.iloc[[0]])
expected.index = [0 for i in range(len(expected))]
tm.assert_frame_equal(concat(it), expected.iloc[1:])


Expand Down