Skip to content

ENH/DOC: wide_to_long performance and docstring clarification #14778

Closed
@erikcs

Description

@erikcs

I had to massage some messy data recently, and a big bottleneck turned out to be wide_to_long

Some test data with many id variables and time variables

import pandas as pd
import numpy as np
import string

vars = string.ascii_uppercase[0:4]
nyrs = 20
nidvars = 20
N = 5000

yrvars = []
for var in vars:
    for yr in range(1, nyrs + 1):
        yrvars.append( var + str(yr))
        
yearobs = dict(zip(yrvars, np.random.randn(len(yrvars), N)))
idobs = dict(zip(range(nidvars), np.random.rand(nidvars, N)))
        
frame = pd.concat([pd.DataFrame(idobs), pd.DataFrame(yearobs)], axis=1)

frame.shape
#(5000, 91)

Reshaping with wide_to_long takes around 2 secs.

%%time
frame['id'] = frame.index
res1 = pd.wide_to_long(frame, list(vars), i='id', j='year')
#CPU times: user 1.89 s, sys: 210 ms, total: 2.1 s
#Wall time: 2.13 s

I modified wide_to_long slightly (regex on categorical column / avoid copying many "idvariables", postpone type coercion) and the runtime is now

%%time
res2 = wide_to_long2(frame, list(vars), i='id', j='year')
#CPU times: user 112 ms, sys: 10.8 ms, total: 123 ms
#Wall time: 125 ms

The result is the same

res1.equals(res2)
#True

Docstring clarification

The wide_to_long docstring also contains an unused last parameter stubend : str, which should be removed.

A docstring Note addtion about escaping special characters (with for example re.escape) in stubnames could also perhaps be informative, since if the user passes a dataframe with messy stubnames, the function fails with a pretty uninformative error message.

I can send a PR for this if that is wanted?

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: 06f26b5 python: 2.7.12.final.0 python-bits: 64 OS: Darwin OS-release: 16.1.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.19.0+124.g06f26b5.dirty
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    DocsPerformanceMemory or execution speed performanceReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions