Skip to content

read_csv(...,mangle_dupe_cols=True) causes silent data loss for certain column names. Request introduction of mangle_dupe_cols_str #14704

Closed
@wilsoc5

Description

@wilsoc5

Lets say we have data with both column and row headers. Lets say this dataset also simply outputs a 0 in that csv file for the cell at the intersection of the column and row headers (A1 in excel notation, or cell (0,0) in the csv file). Additionally, both "0" and "0.1" are valid column names:

RH\CH -0.1 0 0.1
10.0 123 456 789
20.0 012 345 678

Thus, "RH\CH" is replaced by "0" on export (of which I have no control).

import pandas as pd
from cStringIO import StringIO

data = '0,-0.1,0,0.1\n10.0,123,456,789\n20.0,012,345,678'
df = pd.read_csv(StringIO(data),index_col=0,header=0)
print(df)

      -0.1  0.1  0.1
0                   
10.0   123  789  789
20.0    12  678  678

The name mangling will change the duplicate "0" column containing data to "0.1", thus if a real data column has the name "0.1", this data will be copied (?) back to the mangled duplicate "0" column. Additionally, if the true "0.1" data name is missing, then this name mangling is frustrating as there is no obvious way to determine if the now present "0.1" column is a duplicate "0" that has been mangled or is a real "0.1" data series.

I propose the addition of a mangle_dupe_cols_str keyword option that defaults to '.' to preserve the current behavior. However, it can be passed as a kwarg to read_csv in cases where the period name mangling could result in further duplicate columns.

In https://github.com/pandas-dev/pandas/blob/v0.19.1/pandas/io/parsers.py#L2109-L2111:

                        if cur_count > 0:
                            this_columns[i] = '%s.%d' % (col, cur_count)
                        counts[col] = cur_count + 1

should be adapted to

                        if cur_count > 0:
                            this_columns[i] = '%s%s%d' % (col, self.mangle_dupe_cols_str, cur_count)
                        counts[col] = cur_count + 1

and lines 1042-1043 in https://github.com/pandas-dev/pandas/blob/v0.19.1/pandas/io/parsers.py#L1042-L1043 changed to:

        self.mangle_dupe_cols = kwds.get('mangle_dupe_cols', True)
        self.mangle_dupe_cols_str = kwds.get('mangle_dupe_cols_str', '.')        
        self.infer_datetime_format = kwds.pop('infer_datetime_format', False)

Expected Output

Then the corrected output would be:

>>> data = '0,-0.1,0,0.1\n10.0,123,456,789\n20.0,012,345,678'
>>> df = pd.read_csv(StringIO(data),index_col=0,header=0,mangle_dupe_cols_str='__')
>>> print(df)

      -0.1  0__1  0.1
0                   
10.0   123  456  789
20.0    12  345  678

where subsequent operations could identify any columns with the mangle_dupe_cols_str, if needed. The current behavior silently causes data loss for certain column names.

>>> pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: None
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.3
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementIO CSVread_csv, to_csvNeeds DiscussionRequires discussion from core team before further action

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions