BUG: Fix MultiIndex DataFrame to_csv() segfault #26355

mahepe · 2019-05-12T13:15:24Z

closes MultiIndex DataFrame to_csv() terminates python process #26303
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This fix for #26303 would avoid an indexing issue caused by using MultiIndexes which results to a segfault in the example provided in #26303.

jreback · 2019-05-12T13:46:19Z

pandas/tests/frame/test_to_csv.py

@@ -874,6 +874,13 @@ def test_to_csv_index_no_leading_comma(self):
        expected = tm.convert_rows_list_to_csv_str(expected_rows)
        assert buf.getvalue() == expected

+    def test_to_csv_multi_index(self):
+        # see gh-26303


don’t actually write a file as that leaves an artifact
either use a context manager (see other routines) or write to StringIO

in either case compare the result

Thanks for this comment @jreback, resolved in c5d36b9

codecov · 2019-05-12T13:54:10Z

Codecov Report

Merging #26355 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #26355      +/-   ##
==========================================
- Coverage   92.04%   92.03%   -0.01%     
==========================================
  Files         175      175              
  Lines       52292    52292              
==========================================
- Hits        48132    48128       -4     
- Misses       4160     4164       +4

Flag	Coverage Δ
#multiple	`90.59% <100%> (ø)`	⬆️
#single	`40.71% <0%> (-0.15%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/multi.py	`95.62% <100%> (ø)`	⬆️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
pandas/core/frame.py	`97.01% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 681a22c...17d04a5. Read the comment docs.

codecov · 2019-05-12T13:54:10Z

Codecov Report

Merging #26355 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #26355      +/-   ##
==========================================
- Coverage   91.68%   91.67%   -0.01%     
==========================================
  Files         174      174              
  Lines       50703    50703              
==========================================
- Hits        46488    46484       -4     
- Misses       4215     4219       +4

Flag	Coverage Δ
#multiple	`90.18% <100%> (ø)`	⬆️
#single	`41.17% <0%> (-0.18%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/multi.py	`95.62% <100%> (ø)`	⬆️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
pandas/core/frame.py	`97.01% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a6e43a4...d1d0565. Read the comment docs.

jreback · 2019-05-12T14:41:36Z

doc/source/whatsnew/v0.25.0.rst

@@ -254,6 +254,7 @@ Performance Improvements
 - Improved performance of :meth:`Series.map` for dictionary mappers on categorical series by mapping the categories instead of mapping all values (:issue:`23785`)

 .. _whatsnew_0250.bug_fixes:
+- Fixed MultiIndex DataFrame to_csv segfault (:issue:`26303`)


move to Indexing, use :class:`MultiIdex` and :meth:DataFrame.to_csv ``; specify that this is a single-level multi-index

Done in d47089e

jreback · 2019-05-12T14:42:44Z

pandas/core/indexes/multi.py

@@ -946,7 +946,7 @@ def _format_native_types(self, na_rep='nan', **kwargs):
            new_codes.append(level_codes)

        if len(new_levels) == 1:


add a comment here that this is a single-level MI

what do new_levels and new_codes look like here (from your test)

The value of new_levels is [array(['1', '2', '3'], dtype='<U21')].
The value of new_codes is [FrozenNDArray([0, 2], dtype='int8')]

Added a comment in d47089e

ok use levels[0].take(codes[0])

Done in 159a828

jreback · 2019-05-12T14:44:36Z

pandas/tests/frame/test_to_csv.py

@@ -874,6 +874,18 @@ def test_to_csv_index_no_leading_comma(self):
        expected = tm.convert_rows_list_to_csv_str(expected_rows)
        assert buf.getvalue() == expected

+    def test_to_csv_multi_index(self):


we already have a multiindex test, can you move this near; change the test to be test_to_csv_single_level_multiindex

Done in d47089e

jreback · 2019-05-12T14:45:02Z

pandas/tests/frame/test_to_csv.py

+        index = pd.Index([(1,), (2,), (3,)])
+        df = pd.DataFrame([[1, 2, 3]], columns=index)
+        with ensure_clean() as path:
+            df = df.reindex(columns=[(1,), (3,)])


why is this reindex needed?

The reindex is what seems to cause the segfault in DataFrame.to_csv as demonstrated in #26303. It seems that it has something to do with a multi-index having a larger number of levels than there are columns in a data frame.

pep8speaks · 2019-05-12T15:31:32Z

Hello @mahepe! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-05-13 18:15:51 UTC

WillAyd · 2019-05-13T14:41:11Z

pandas/core/indexes/multi.py

@@ -946,7 +946,9 @@ def _format_native_types(self, na_rep='nan', **kwargs):
            new_codes.append(level_codes)

        if len(new_levels) == 1:
-            return Index(new_levels[0])._format_native_types()
+            # a single-level multi-index
+            return Index(new_levels[0].take(new_codes[0])).\


Shouldn't need the explicit line break here

Reformatted

WillAyd · 2019-05-13T14:41:54Z

pandas/tests/frame/test_to_csv.py

+            df.to_csv(path, line_terminator='\n')
+            expected = b",1,3\n0,1,3\n"
+
+            with open(path, mode='rb') as f:


Do you need to physically write this to a file to reproduce the segfault or can you just inspect the result returned as a string?

Turns out that no: Tried the example provided in the issue. The segfault occured even if the csv was never written to a file.

Updated the test so that it doesn't write to a file.

jreback · 2019-05-16T01:47:44Z

thanks @mahepe

jreback requested changes May 12, 2019

View reviewed changes

mahepe force-pushed the dev branch from 17d04a5 to c5d36b9 Compare May 12, 2019 14:37

jreback requested changes May 12, 2019

View reviewed changes

jreback added IO CSV read_csv, to_csv MultiIndex labels May 12, 2019

mahepe force-pushed the dev branch 3 times, most recently from d47089e to d403442 Compare May 12, 2019 15:31

mahepe force-pushed the dev branch 2 times, most recently from 159a828 to a3f85e8 Compare May 12, 2019 16:11

WillAyd reviewed May 13, 2019

View reviewed changes

BUG: Fix MultiIndex DataFrame to_csv() segfault (pandas-dev#26303)

d1d0565

mahepe force-pushed the dev branch from a3f85e8 to d1d0565 Compare May 13, 2019 18:15

jreback added this to the 0.25.0 milestone May 16, 2019

jreback approved these changes May 16, 2019

View reviewed changes

jreback merged commit 9c5165e into pandas-dev:master May 16, 2019

		@@ -946,7 +946,7 @@ def _format_native_types(self, na_rep='nan', **kwargs):
		new_codes.append(level_codes)

		if len(new_levels) == 1:

Uh oh!

BUG: Fix MultiIndex DataFrame to_csv() segfault #26355

BUG: Fix MultiIndex DataFrame to_csv() segfault #26355

Uh oh!

Conversation

mahepe commented May 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented May 12, 2019

Codecov Report

Uh oh!

codecov bot commented May 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pep8speaks commented May 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2019-05-13 18:15:51 UTC

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mahepe May 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented May 16, 2019

Uh oh!

Uh oh!

mahepe commented May 12, 2019 •

edited

Loading

codecov bot commented May 12, 2019 •

edited

Loading

pep8speaks commented May 12, 2019 •

edited

Loading

mahepe May 13, 2019 •

edited

Loading