read_html() performance when HTML malformed

When parsing thousands of HTML files, I noticed there was a rather large gap in performance when using `pd.read_html` and when parsing by hand using `lxml` - one that could not be attributed to the `pandas` overhead in handling edge cases.

I looked into it a little bit and found out `pandas` uses `lxml`, but only when there are no formatting issues (`recover=False` in the parser settings, let's call it _strict_ mode). It falls back on `BeautifulSoup` in that case, which is an order of magnitude slower. I have recreated this performance test using two public websites and a dummy example of a generated table.

https://gist.github.com/kokes/b97c8324ba664400714a78f5561340fc

(My code in no way replicates `pd.read_html`, but the gap is too big to be explained by edge case detection and proper column naming.)

I would like to find out how `BeautifulSoup` improves upon `lxml` in malformed HTML files, to justify the performance gap. _This may as well be a non-issue_ if `lxml` is known to produce incorrect outputs - you tell me. For example, in the wikipedia example (see the gist above), `lxml` fails to compute in strict mode, because it reaches a `<bdi>` element in one of the cells, which it does not recognise. There aren't any formatting errors, just an unknown element.

(Or maybe there's a C/Cython implementation of `bs4` that could be used, I haven't explored that option, I'm still trying to understand these basics.)

Thank you!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

read_html() performance when HTML malformed #14312

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

read_html() performance when HTML malformed #14312

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions