-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Fix MultiIndex DataFrame to_csv() segfault #26355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/tests/frame/test_to_csv.py
Outdated
@@ -874,6 +874,13 @@ def test_to_csv_index_no_leading_comma(self): | |||
expected = tm.convert_rows_list_to_csv_str(expected_rows) | |||
assert buf.getvalue() == expected | |||
|
|||
def test_to_csv_multi_index(self): | |||
# see gh-26303 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don’t actually write a file as that leaves an artifact
either use a context manager (see other routines) or write to StringIO
in either case compare the result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov Report
@@ Coverage Diff @@
## master #26355 +/- ##
==========================================
- Coverage 92.04% 92.03% -0.01%
==========================================
Files 175 175
Lines 52292 52292
==========================================
- Hits 48132 48128 -4
- Misses 4160 4164 +4
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #26355 +/- ##
==========================================
- Coverage 91.68% 91.67% -0.01%
==========================================
Files 174 174
Lines 50703 50703
==========================================
- Hits 46488 46484 -4
- Misses 4215 4219 +4
Continue to review full report at Codecov.
|
doc/source/whatsnew/v0.25.0.rst
Outdated
@@ -254,6 +254,7 @@ Performance Improvements | |||
- Improved performance of :meth:`Series.map` for dictionary mappers on categorical series by mapping the categories instead of mapping all values (:issue:`23785`) | |||
|
|||
.. _whatsnew_0250.bug_fixes: | |||
- Fixed MultiIndex DataFrame to_csv segfault (:issue:`26303`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to Indexing, use :class:`MultiIdex` and
:meth:DataFrame.to_csv
``; specify that this is a single-level multi-index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in d47089e
@@ -946,7 +946,7 @@ def _format_native_types(self, na_rep='nan', **kwargs): | |||
new_codes.append(level_codes) | |||
|
|||
if len(new_levels) == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment here that this is a single-level MI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do new_levels and new_codes look like here (from your test)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value of new_levels
is [array(['1', '2', '3'], dtype='<U21')]
.
The value of new_codes
is [FrozenNDArray([0, 2], dtype='int8')]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment in d47089e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok use levels[0].take(codes[0])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 159a828
pandas/tests/frame/test_to_csv.py
Outdated
@@ -874,6 +874,18 @@ def test_to_csv_index_no_leading_comma(self): | |||
expected = tm.convert_rows_list_to_csv_str(expected_rows) | |||
assert buf.getvalue() == expected | |||
|
|||
def test_to_csv_multi_index(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already have a multiindex test, can you move this near; change the test to be test_to_csv_single_level_multiindex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in d47089e
pandas/tests/frame/test_to_csv.py
Outdated
index = pd.Index([(1,), (2,), (3,)]) | ||
df = pd.DataFrame([[1, 2, 3]], columns=index) | ||
with ensure_clean() as path: | ||
df = df.reindex(columns=[(1,), (3,)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this reindex needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reindex is what seems to cause the segfault in DataFrame.to_csv
as demonstrated in #26303. It seems that it has something to do with a multi-index having a larger number of levels than there are columns in a data frame.
d47089e
to
d403442
Compare
159a828
to
a3f85e8
Compare
pandas/core/indexes/multi.py
Outdated
@@ -946,7 +946,9 @@ def _format_native_types(self, na_rep='nan', **kwargs): | |||
new_codes.append(level_codes) | |||
|
|||
if len(new_levels) == 1: | |||
return Index(new_levels[0])._format_native_types() | |||
# a single-level multi-index | |||
return Index(new_levels[0].take(new_codes[0])).\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't need the explicit line break here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reformatted
pandas/tests/frame/test_to_csv.py
Outdated
df.to_csv(path, line_terminator='\n') | ||
expected = b",1,3\n0,1,3\n" | ||
|
||
with open(path, mode='rb') as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to physically write this to a file to reproduce the segfault or can you just inspect the result returned as a string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out that no: Tried the example provided in the issue. The segfault occured even if the csv was never written to a file.
Updated the test so that it doesn't write to a file.
thanks @mahepe |
git diff upstream/master -u -- "*.py" | flake8 --diff
This fix for #26303 would avoid an indexing issue caused by using MultiIndexes which results to a segfault in the example provided in #26303.