Description
When parsing thousands of HTML files, I noticed there was a rather large gap in performance when using pd.read_html
and when parsing by hand using lxml
- one that could not be attributed to the pandas
overhead in handling edge cases.
I looked into it a little bit and found out pandas
uses lxml
, but only when there are no formatting issues (recover=False
in the parser settings, let's call it strict mode). It falls back on BeautifulSoup
in that case, which is an order of magnitude slower. I have recreated this performance test using two public websites and a dummy example of a generated table.
https://gist.github.com/kokes/b97c8324ba664400714a78f5561340fc
(My code in no way replicates pd.read_html
, but the gap is too big to be explained by edge case detection and proper column naming.)
I would like to find out how BeautifulSoup
improves upon lxml
in malformed HTML files, to justify the performance gap. This may as well be a non-issue if lxml
is known to produce incorrect outputs - you tell me. For example, in the wikipedia example (see the gist above), lxml
fails to compute in strict mode, because it reaches a <bdi>
element in one of the cells, which it does not recognise. There aren't any formatting errors, just an unknown element.
(Or maybe there's a C/Cython implementation of bs4
that could be used, I haven't explored that option, I'm still trying to understand these basics.)
Thank you!