Description
Code Sample
# Build pandas
python3 -m pip install -t build --no-binary pandas pandas==0.22.0
# Measure size
du -sh build/pandas
# Strip binaries
find build/pandas -name "*.so"|xargs strip
# Measure -drastically reduced- size
du -sh build/pandas
Problem description
When building pandas from source and running the command strip, the resulting folder is 54% lighter than when using the manylinux wheel (via pip).
[...] the strip program removes inessential information from executable binary programs and object files, thus potentially resulting in better performance and sometimes significantly less disk space usage
https://en.wikipedia.org/wiki/Strip_(Unix)
This is probably harmless on most systems, yet it is quite important in size-constrained systems (such as AWS Lambda).
Some developers have resorted to distributing their own stripped binaries (e.g. lambda packages for the serverless framework zappa), but it seems like a makeshift solution.
I think the problem should be solved upstream, as each library should be responsible for packaging their own optimized binaries.
An issue has also been opened regarding
numpy
, as it is a dependency ofpandas
and they both end up using a lot of unneccesary disk space.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.87-linuxkit-aufs
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 36.2.7
Cython: None
numpy: 1.14.2
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
How to replicate
System specifications
Every command below has been executed using the amazonlinux
docker image.
https://hub.docker.com/_/amazonlinux/
Pandas version: 0.22.0
Python version: 3.6.2
Prepare docker image
docker run -it amazonlinux bash
yum update -y
yum install -y findutils binutils python36-devel gcc gcc-c++
Install wheel & measure package size
python3 -m pip install -t wheel pandas==0.22.0
du -sh wheel/pandas
--> 111 MB
Try strip:
find wheel/pandas -name "*.so"|xargs strip
du -sh wheel/pandas
--> 52 MB
Almost 50% of the binary size can be stripped.
Build from source & measure package size
python3 -m pip install -t build --no-binary pandas pandas==0.22.0
du -sh build/pandas
--> 107 MB
Try strip:
find build/pandas -name "*.so"|xargs strip
du -sh build/pandas
--> 51 MB