Description
Lets say we have data with both column and row headers. Lets say this dataset also simply outputs a 0
in that csv file for the cell at the intersection of the column and row headers (A1 in excel notation, or cell (0,0) in the csv file). Additionally, both "0"
and "0.1"
are valid column names:
RH\CH | -0.1 | 0 | 0.1 |
---|---|---|---|
10.0 | 123 | 456 | 789 |
20.0 | 012 | 345 | 678 |
Thus, "RH\CH" is replaced by "0"
on export (of which I have no control).
import pandas as pd
from cStringIO import StringIO
data = '0,-0.1,0,0.1\n10.0,123,456,789\n20.0,012,345,678'
df = pd.read_csv(StringIO(data),index_col=0,header=0)
print(df)
-0.1 0.1 0.1
0
10.0 123 789 789
20.0 12 678 678
The name mangling will change the duplicate "0"
column containing data to "0.1"
, thus if a real data column has the name "0.1"
, this data will be copied (?) back to the mangled duplicate "0"
column. Additionally, if the true "0.1"
data name is missing, then this name mangling is frustrating as there is no obvious way to determine if the now present "0.1"
column is a duplicate "0"
that has been mangled or is a real "0.1"
data series.
I propose the addition of a mangle_dupe_cols_str
keyword option that defaults to '.'
to preserve the current behavior. However, it can be passed as a kwarg to read_csv
in cases where the period name mangling could result in further duplicate columns.
In https://github.com/pandas-dev/pandas/blob/v0.19.1/pandas/io/parsers.py#L2109-L2111:
if cur_count > 0:
this_columns[i] = '%s.%d' % (col, cur_count)
counts[col] = cur_count + 1
should be adapted to
if cur_count > 0:
this_columns[i] = '%s%s%d' % (col, self.mangle_dupe_cols_str, cur_count)
counts[col] = cur_count + 1
and lines 1042-1043 in https://github.com/pandas-dev/pandas/blob/v0.19.1/pandas/io/parsers.py#L1042-L1043 changed to:
self.mangle_dupe_cols = kwds.get('mangle_dupe_cols', True)
self.mangle_dupe_cols_str = kwds.get('mangle_dupe_cols_str', '.')
self.infer_datetime_format = kwds.pop('infer_datetime_format', False)
Expected Output
Then the corrected output would be:
>>> data = '0,-0.1,0,0.1\n10.0,123,456,789\n20.0,012,345,678'
>>> df = pd.read_csv(StringIO(data),index_col=0,header=0,mangle_dupe_cols_str='__')
>>> print(df)
-0.1 0__1 0.1
0
10.0 123 456 789
20.0 12 345 678
where subsequent operations could identify any columns with the mangle_dupe_cols_str, if needed. The current behavior silently causes data loss for certain column names.
INSTALLED VERSIONS
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US
pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: None
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.3
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None