Description
I have a wide delimited file that includes columns with predictable names but unpredictable capitalisation, as shown in this toy code.
import pandas as pd
from io import StringIO
data = """1,1,1,1
2,2,2,2
3,3,3,3
4,4,4,4
5,5,5,5"""
names = ['col1', 'col2', 'COL3', 'Col4']
df = pd.read_csv(StringIO(data), names=names, usecols=None)
df.head()
col1 | col2 | COL3 | Col4 | |
---|---|---|---|---|
0 | 1 | 1 | 1 | 1 |
1 | 2 | 2 | 2 | 2 |
2 | 3 | 3 | 3 | 3 |
3 | 4 | 4 | 4 | 4 |
4 | 5 | 5 | 5 | 5 |
I want to pass a list of column names to usecols
when importing the file, but I won't be able to unless I match the names exactly:
df = pd.read_csv(StringIO(data), names=names, usecols=['col3', 'col4'])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-14-92d038c5ac8c> in <module>()
1 cols = ['col3', 'col4']
----> 2 df1 = pd.read_csv('test.csv', usecols=['col3', 'col4'])
3 df1.head()
/Users/alexchase/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
560 skip_blank_lines=skip_blank_lines)
561
--> 562 return _read(filepath_or_buffer, kwds)
563
564 parser_f.__name__ = name
/Users/alexchase/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
313
314 # Create the parser.
--> 315 parser = TextFileReader(filepath_or_buffer, **kwds)
316
317 if (nrows is not None) and (chunksize is not None):
/Users/alexchase/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
643 self.options['has_index_names'] = kwds['has_index_names']
644
--> 645 self._make_engine(self.engine)
646
647 def close(self):
/Users/alexchase/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
797 def _make_engine(self, engine='c'):
798 if engine == 'c':
--> 799 self._engine = CParserWrapper(self.f, **self.options)
800 else:
801 if engine == 'python':
/Users/alexchase/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
1255
1256 if len(self.names) < len(self.usecols):
-> 1257 raise ValueError("Usecols do not match names.")
1258
1259 self._set_noconvert_columns()
ValueError: Usecols do not match names.
I see two problems here. First, the error text would be more helpful if it specified what column name(s) is/are wrong. That's pretty easy to fix. Second, if I know that capitalisation might be a problem, I have to check every column name individually before I can pass a list to usecols
. One possible solution would be to modify this part and this part of parsers.py
to case-transform the lists of column names before matching. This behaviour could be made the default, or it could be optional via an ignore_col_case
keyword in the various read
functions.
INSTALLED VERSIONS
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 25.1.6
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.0
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: 0.7.6.None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None