Description
I had to massage some messy data recently, and a big bottleneck turned out to be wide_to_long
Some test data with many id variables and time variables
import pandas as pd
import numpy as np
import string
vars = string.ascii_uppercase[0:4]
nyrs = 20
nidvars = 20
N = 5000
yrvars = []
for var in vars:
for yr in range(1, nyrs + 1):
yrvars.append( var + str(yr))
yearobs = dict(zip(yrvars, np.random.randn(len(yrvars), N)))
idobs = dict(zip(range(nidvars), np.random.rand(nidvars, N)))
frame = pd.concat([pd.DataFrame(idobs), pd.DataFrame(yearobs)], axis=1)
frame.shape
#(5000, 91)
Reshaping with wide_to_long
takes around 2 secs.
%%time
frame['id'] = frame.index
res1 = pd.wide_to_long(frame, list(vars), i='id', j='year')
#CPU times: user 1.89 s, sys: 210 ms, total: 2.1 s
#Wall time: 2.13 s
I modified wide_to_long
slightly (regex on categorical column / avoid copying many "idvariables", postpone type coercion) and the runtime is now
%%time
res2 = wide_to_long2(frame, list(vars), i='id', j='year')
#CPU times: user 112 ms, sys: 10.8 ms, total: 123 ms
#Wall time: 125 ms
The result is the same
res1.equals(res2)
#True
Docstring clarification
The wide_to_long
docstring also contains an unused last parameter stubend : str
, which should be removed.
A docstring Note
addtion about escaping special characters (with for example re.escape
) in stubnames
could also perhaps be informative, since if the user passes a dataframe with messy stubnames, the function fails with a pretty uninformative error message.
I can send a PR for this if that is wanted?
Output of pd.show_versions()
pandas: 0.19.0+124.g06f26b5.dirty
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None