Skip to content

Manylinux wheel size could be reduced by 54% #20719

Closed
@LouisAmon

Description

@LouisAmon

Code Sample

# Build pandas
python3 -m pip install -t build --no-binary pandas pandas==0.22.0

# Measure size
du -sh build/pandas

# Strip binaries
find build/pandas -name "*.so"|xargs strip

# Measure -drastically reduced- size
du -sh build/pandas

Problem description

When building pandas from source and running the command strip, the resulting folder is 54% lighter than when using the manylinux wheel (via pip).

[...] the strip program removes inessential information from executable binary programs and object files, thus potentially resulting in better performance and sometimes significantly less disk space usage
https://en.wikipedia.org/wiki/Strip_(Unix)

This is probably harmless on most systems, yet it is quite important in size-constrained systems (such as AWS Lambda).

Some developers have resorted to distributing their own stripped binaries (e.g. lambda packages for the serverless framework zappa), but it seems like a makeshift solution.

I think the problem should be solved upstream, as each library should be responsible for packaging their own optimized binaries.

An issue has also been opened regarding numpy, as it is a dependency of pandas and they both end up using a lot of unneccesary disk space.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.87-linuxkit-aufs
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 36.2.7
Cython: None
numpy: 1.14.2
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

How to replicate

System specifications

Every command below has been executed using the amazonlinux docker image.
https://hub.docker.com/_/amazonlinux/

Pandas version: 0.22.0
Python version: 3.6.2

Prepare docker image
docker run -it amazonlinux bash
yum update -y
yum install -y findutils binutils python36-devel gcc gcc-c++
Install wheel & measure package size
python3 -m pip install -t wheel pandas==0.22.0
du -sh wheel/pandas

--> 111 MB

Try strip:

find wheel/pandas -name "*.so"|xargs strip
du -sh wheel/pandas

--> 52 MB

Almost 50% of the binary size can be stripped.

Build from source & measure package size
python3 -m pip install -t build --no-binary pandas pandas==0.22.0
du -sh build/pandas

--> 107 MB

Try strip:

find build/pandas -name "*.so"|xargs strip
du -sh build/pandas

--> 51 MB

Metadata

Metadata

Assignees

No one assigned

    Labels

    BuildLibrary building on various platformsDuplicate ReportDuplicate issue or pull request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions