Description
Code Sample, a copy-pastable example if possible
import pandas as pd
print "This is what the data should look like (file with regular colon).\n"
df = pd.read_csv('testcase-utf16.txt', encoding='utf-16', delimiter='\t', engine='python')
print df
print "\n\nIncorrect dataframe (file with Unicode full-width colon).\n"
df = pd.read_csv('testcase2-utf16.txt', encoding='utf-16', delimiter='\t', engine='python')
print df
Test files:
testcase2-utf16.txt
testcase-utf16.txt
Problem description
If a UTF-16-encoded TSV file contains a full-width colon, read_csv causes the data to be parsed wrongly. I'm running Python from a Windows command prompt (Windows 10); this bug does not happen on a Mac, or on a Windows machine running Python under Cygwin.
It seems to cause "some" amount of data to be skipped - I initially ran into this in the middle of a large file. It raised an exception saying that the line contained too many columns, because the row with the full-width colon had become mashed into the end of the row several lines later.
In the minimal test case above, the two files differ only in that testcase2.txt contains a full-width colon in the first row, and testcase.txt contains a regular colon. In testcase2.txt, everything from the full-width colon disappears from the dataframe:
This is what the data should look like (file with regular colon).
Block Link
0 9 Survey: 57% of Tokyo high schools demand hair-...
1 10 17 Best ideas about Hair Colors on Pinterest |...
2 11 What It's Like to Change Your Hair Color - I T...
3 12 Hair Dye: A History - The Atlantic
4 13 A molecular basis for classic blond hair color...
Incorrect dataframe (file with Unicode full-width colon).
Block Link
0 9 Survey: 57% of Tokyo high schools demand hair-...
Expected Output
Dataframe should contain all data, correctly formatted, including fancy Unicode colon - the two dataframes should look the same when printed.
Output of pd.show_versions()
pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 34.3.2
Cython: None
numpy: 1.12.1
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 2.0.0
openpyxl: None
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.7
lxml: None
bs4: 4.5.1
html5lib: 0.9999999
httplib2: 0.10.3
apiclient: 1.6.2
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None