Skip to content

read_html() performance when HTML malformed #14312

Closed
@kokes

Description

@kokes

When parsing thousands of HTML files, I noticed there was a rather large gap in performance when using pd.read_html and when parsing by hand using lxml - one that could not be attributed to the pandas overhead in handling edge cases.

I looked into it a little bit and found out pandas uses lxml, but only when there are no formatting issues (recover=False in the parser settings, let's call it strict mode). It falls back on BeautifulSoup in that case, which is an order of magnitude slower. I have recreated this performance test using two public websites and a dummy example of a generated table.

https://gist.github.com/kokes/b97c8324ba664400714a78f5561340fc

(My code in no way replicates pd.read_html, but the gap is too big to be explained by edge case detection and proper column naming.)

I would like to find out how BeautifulSoup improves upon lxml in malformed HTML files, to justify the performance gap. This may as well be a non-issue if lxml is known to produce incorrect outputs - you tell me. For example, in the wikipedia example (see the gist above), lxml fails to compute in strict mode, because it reaches a <bdi> element in one of the cells, which it does not recognise. There aren't any formatting errors, just an unknown element.

(Or maybe there's a C/Cython implementation of bs4 that could be used, I haven't explored that option, I'm still trying to understand these basics.)

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO HTMLread_html, to_html, Styler.apply, Styler.applymapPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions