Skip to content

ENH: Using nrows option while processing xlsb files #39518

Closed
@code-R

Description

@code-R

Is your feature request related to a problem?

I wish pandas can use nrows eagerly while processing xlsx or xlsb files

https://github.com/pandas-dev/pandas/blob/master/pandas/io/excel/_pyxlsb.py#L73
If we see this code, this tries to process all the sheetdata and later apply nrows, which is very very slow in large files..in-fact in one of the xlsb files, the number of records are around 3k but, it was trying to loop around 100k records (because of xlsb metadata or something) even though I specify nrows as 3k, it has to wait processing all 100k records.

Describe the solution you'd like

If we can pass nrows to get_sheet_data and break the loop while it reaches the nrows number, then it will better.

API breaking implications

I am not really sure.

Describe alternatives you've considered

don't have alternatives at the moment, would like to hear suggestions about this.

Additional context

[add any other context, code examples, or references to existing implementations about the feature request here]

df = pd.read_excel("sample.xlsb", nrows=100, engine="pyxlsb")
def get_sheet_data(self, sheet, convert_float: bool, nrows: int) -> List[List[Scalar]]:
    res = []
    for index, r in enumerate(sheet.rows(sparse=False)):
        res.append([self._convert_cell(c, convert_float) for c in r])
        if index > nrows:
            break

    return res

We can set default value for nrows=infinity, incase if we dont want to break for others.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Duplicate ReportDuplicate issue or pull requestIO Excelread_excel, to_excelPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions