Skip to content

BUG: read_csv throws UnicodeDecodeError with unicode aliases #13571

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
d485c4a
BUG: `read_csv` throws UnicodeDecodeError with unicode aliases
nateGeorge Jul 5, 2016
ae62350
BUG: `read_csv` throws UnicodeDecodeError with unicode
nateGeorge Jul 6, 2016
36bcdd8
Merge branch 'master' of github.com:pydata/pandas into fix/read_csv-u…
nateGeorge Jul 6, 2016
285ccf9
BUG: `read_csv` throws UnicodeDecodeError with unicode aliases
nateGeorge Jul 6, 2016
173c38b
BUG: `read_csv` throws UnicodeDecodeError with unicode aliases
nateGeorge Jul 6, 2016
78d46d6
BUG: `read_csv` throws UnicodeDecodeError with unicode aliases
nateGeorge Jul 6, 2016
35dfb13
chore: matched master
nateGeorge Jul 12, 2016
71f084e
DOC: add pd.read_csv bug #13549
nateGeorge Jul 12, 2016
da8fce4
TST: out-> result and tm.ensure_clean
nateGeorge Jul 12, 2016
1825486
TST: conform to PEP8
nateGeorge Jul 12, 2016
1d30333
TST: condense test_read_utf_aliases test
nateGeorge Jul 12, 2016
4f680d7
Merge branch 'master' of github.com:pydata/pandas into fix/read_csv-u…
nateGeorge Jul 12, 2016
b582195
Merge branch 'master' of github.com:pydata/pandas into fix/read_csv-u…
nateGeorge Jul 13, 2016
e26c92a
CLN: remove unnecessary BytesIO import
nateGeorge Jul 13, 2016
d14b69e
CLN: remove unnecessary csv write line
nateGeorge Jul 13, 2016
eeb7011
Merge branch 'master' of github.com:pydata/pandas into fix/read_csv-u…
nateGeorge Jul 13, 2016
b8d78c4
BUG: `read_csv` throws UnicodeDecodeError with unicode aliases
nateGeorge Jul 5, 2016
75869f4
BUG: `read_csv` throws UnicodeDecodeError with unicode
nateGeorge Jul 6, 2016
9c88919
BUG: `read_csv` throws UnicodeDecodeError with unicode aliases
nateGeorge Jul 6, 2016
6725536
BUG: `read_csv` throws UnicodeDecodeError with unicode aliases
nateGeorge Jul 6, 2016
671ad41
BUG: `read_csv` throws UnicodeDecodeError with unicode aliases
nateGeorge Jul 6, 2016
3c4a798
BUG: Groupby.nth includes group key inconsistently #12839
adneu Jul 6, 2016
5675b82
In gbq, use googleapiclient instead of apiclient #13454 (#13458)
parthea Jul 7, 2016
ff6117e
RLS: switch master from 0.18.2 to 0.19.0 (#13586)
jorisvandenbossche Jul 8, 2016
b983957
BUG: Datetime64Formatter not respecting ``formatter``
haleemur Jul 8, 2016
451c054
BUG: Fix TimeDelta to Timedelta (#13600)
yui-knk Jul 9, 2016
33278a9
COMPAT: 32-bit compat fixes mainly in testing
jreback Jul 7, 2016
181cecd
BUG: DatetimeIndex - Period shows ununderstandable error
sinhrks Jul 10, 2016
a2e5d54
ENH: add downcast to pd.to_numeric
gfyoung Jul 10, 2016
6c8b21b
CLN: remove radd workaround in ops.py
sinhrks Jul 10, 2016
5d99cff
DEPR: rename Timestamp.offset to .freq
sinhrks Jul 10, 2016
8e7904f
CLN: Remove the engine parameter in CSVFormatter and to_csv
gfyoung Jun 10, 2016
a07b5d3
BUG: Block/DTI doesnt handle tzlocal properly
sinhrks Jul 10, 2016
ff2a335
BUG: Series contains NaT with object dtype comparison incorrect (#13592)
sinhrks Jul 11, 2016
1f8cc7f
CLN/TST: Add tests for nan/nat mixed input (#13477)
sinhrks Jul 11, 2016
f743eb3
BUG: groupby apply on selected columns yielding scalar (GH13568) (#13…
jorisvandenbossche Jul 11, 2016
e161699
TST: Clean up tests of DataFrame.sort_{index,values} (#13496)
IamJeffG Jul 11, 2016
5765b92
DOC: add pd.read_csv bug #13549
nateGeorge Jul 12, 2016
ac18b36
TST: out-> result and tm.ensure_clean
nateGeorge Jul 12, 2016
1fc6b90
TST: conform to PEP8
nateGeorge Jul 12, 2016
6b0e2ca
TST: condense test_read_utf_aliases test
nateGeorge Jul 12, 2016
41a6fae
DOC: asfreq clarify original NaNs are not filled (GH9963) (#13617)
jorisvandenbossche Jul 12, 2016
f730e60
BUG: Invalid Timedelta op may raise ValueError
sinhrks Jul 12, 2016
05a2d04
CLN: Cleanup ops.py
sinhrks Jul 12, 2016
c4e93bd
CLN: Removed outtype in DataFrame.to_dict (#13627)
gfyoung Jul 12, 2016
430273d
CLN: Fix compile time warnings
yui-knk Jul 13, 2016
1fa91b9
CLN: remove unnecessary BytesIO import
nateGeorge Jul 13, 2016
e379e9f
CLN: remove unnecessary csv write line
nateGeorge Jul 13, 2016
a35521e
Pin IPython for doc build to 4.x (see #13639)
jorisvandenbossche Jul 13, 2016
6c09821
CLN: reorg type inference & introspection
jreback Jul 13, 2016
5584dff
BLD: included pandas.api.* in setup.py (#13640)
gfyoung Jul 13, 2016
9463dee
docs: add note about read_csv() bug
nateGeorge Aug 15, 2016
5198179
cln: trying to merge with master
nateGeorge Aug 15, 2016
3c30cd0
CLN: merge with master
nateGeorge Aug 15, 2016
e77ac2d
Merge branch 'fix/read_csv-utf-aliases' of github.com:nateGeorge/pand…
nateGeorge Aug 19, 2016
69ab536
CLN: reset to master branch
nateGeorge Aug 19, 2016
1eb478d
Merge branch 'master' of github.com:pydata/pandas into fix/read_csv-u…
nateGeorge Aug 19, 2016
a2f178f
CLN: fix small diff from upstream/master
nateGeorge Aug 19, 2016
8e05f7e
BUG: _read encoding fix
nateGeorge Aug 19, 2016
ab153d5
DOC: add note on read_csv bug
nateGeorge Aug 19, 2016
0c1de9f
TST: add test for read_csv with unicode bug
nateGeorge Aug 19, 2016
77ec966
CLN: fix indents and spacings
nateGeorge Aug 19, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v0.19.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1075,3 +1075,5 @@ Bug Fixes
- Bug in ``Index`` raises ``KeyError`` displaying incorrect column when column is not in the df and columns contains duplicate values (:issue:`13822`)
- Bug in ``Period`` and ``PeriodIndex`` creating wrong dates when frequency has combined offset aliases (:issue:`13874`)
- Bug in ``.to_string()`` when called with an integer ``line_width`` and ``index=False`` raises an UnboundLocalError exception because ``idx`` referenced before assignment.

- Bug in ``read_csv()``, where aliases for utf-xx (e.g. UTF-xx, UTF_xx, utf_xx) raised UnicodeDecodeError (:issue:`13549`)
3 changes: 3 additions & 0 deletions pandas/io/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -350,6 +350,9 @@ def _validate_nrows(nrows):
def _read(filepath_or_buffer, kwds):
"Generic reader of line files."
encoding = kwds.get('encoding', None)
if encoding is not None:
encoding = re.sub('_', '-', encoding).lower()
kwds['encoding'] = encoding

# If the input could be a filename, check for a recognizable compression
# extension. If we're reading from a URL, the `get_filepath_or_buffer`
Expand Down
11 changes: 11 additions & 0 deletions pandas/io/tests/parser/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
import re
import sys
from datetime import datetime
from io import BytesIO

import nose
import numpy as np
Expand Down Expand Up @@ -1583,3 +1584,13 @@ def test_temporary_file(self):
new_file.close()
expected = DataFrame([[0, 0]])
tm.assert_frame_equal(result, expected)

def test_read_csv_utf_aliases(self):
# see gh issue 13549
expected = pd.DataFrame({'mb_num': [4.8], 'multibyte': ['test']})
for byte in [8, 16]:
for fmt in ['utf-{0}', 'utf_{0}', 'UTF-{0}', 'UTF_{0}']:
encoding = fmt.format(byte)
data = 'mb_num,multibyte\n4.8,test'.encode(encoding)
result = self.read_csv(BytesIO(data), encoding=encoding)
tm.assert_frame_equal(result, expected)