Skip to content

ENH: Allow overwriting existing sheets when appending to excel files #40230

Closed
@mirober

Description

@mirober

Is your feature request related to a problem?

As far as I can tell there is no way to write a DataFrame to a specific (named) sheet in an existing Excel file, creating the sheet if it doesn't exist or overwriting if it does. The closest thing to this that exists is the openpyxl append mode, but if you try to write to sheet_which_already_exists with this, the data will be written to sheet_which_already_exists1. Using write mode overwrites the entire book.

Being able to overwrite existing sheets would be useful for populating "raw data" sheets in existing reports/publications on a regular basis, with pivot tables etc updating automatically.

This seems to be a very common issue:
https://stackoverflow.com/questions/62618680/overwrite-an-excel-sheet-with-pandas-dataframe-without-affecting-other-sheets
https://stackoverflow.com/questions/47602533/append-a-pandas-dataframe-to-an-existing-excel-table?noredirect=1&lq=1
https://stackoverflow.com/questions/34744863/python-how-to-use-excelwriter-to-write-into-an-existing-worksheet?noredirect=1&lq=1
#28653
#29990

Describe the solution you'd like

There are a few potential behaviours someone might want when they write to an existing sheet in append mode:

  • the current behaviour, which creates a new sheet with a different name
  • to delete the current contents of the sheet and write their data
  • to write the data without deleting anything, for example to add columns or rows to an existing table, or a new table in a different position

Originally I considered adding a flag like overwrite=True to pd.ExcelWriter but I don't think this would be sufficient to cover all the potential behaviours, so would suggest an enum along the lines of the if_exists option in df.to_sql, with the following options:

  • new_sheet: Create a new sheet with a different name.
  • overwrite_sheet: Delete the contents of the sheet, then write to it.
  • overwrite_cells: Write directly to the named sheet without deleting the previous contents.

This option would only be relevant when using mode="a" with the openpyxl engine.

API breaking implications

The new_sheet option would replicate the current behaviour, so this could be made the default.

Additional context

Demonstration of current behaviour:

import pandas as pd
from openpyxl import load_workbook
import io

df1 = pd.DataFrame({
    "col1": ["apple", "banana", "carrot"],
    "col2": [1, 2, 3],
})
df2 = pd.DataFrame({"col1": [5]})

excel = io.BytesIO()

with pd.ExcelWriter(excel, engine="openpyxl", mode="w") as writer:
    # write two sheets
    df1.to_excel(writer, sheet_name="sheet1", index=False)
    df1.to_excel(writer, sheet_name="sheet2", index=False)

with pd.ExcelWriter(excel, engine="openpyxl", mode="a") as writer:
    # try to overwrite one of them
    df2.to_excel(writer, sheet_name="sheet1", index=False)

wb = load_workbook(excel)
# sheet1, sheet2, sheet11
print(wb.sheetnames)

Why does it work this way? At the moment, when the openpyxl ExcelWriter is initialised in append mode it loads the workbook, but does not update self.sheets to reflect the sheets in the loaded file.
https://github.com/pandas-dev/pandas/blob/master/pandas/io/excel/_openpyxl.py#L54-L58

This seems unintentional and means that when write_cells is called with a sheet name that already exists in the file, it is not found in self.sheets:
https://github.com/pandas-dev/pandas/blob/master/pandas/io/excel/_openpyxl.py#L414-L419
so we follow the else branch, and when assigning to wks.title openpyxl suffixes the sheet name to avoid a naming conflict.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions