Suggestions for html table parsing

Hi,

I have lxml, and when trying to parse html with tables through pandas I found some problems:
1. HTMLParser is set to `recover=False` contrary to lxml's default value. This is unfortunate because parser will fail on any problem in html page, like missing closing tag or similar, which IMHO happens too often to be justifiable. If such strict rules should be considered, I'd suggest parsing html document with default parser values, and then apply this restriction only on table fragments.
2. HTMLParser will almost always deliver wrong content on tables encoded in anything other then ascii. There is no encoding argument in `pandas.io.html.read_html()` and lxml doesn't do magic unless http-equiv attribute is correctly declared in html document, or encoding argument is passed to HTMLParser.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Suggestions for html table parsing #7220

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Suggestions for html table parsing #7220

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions