Closed
Description
Hi,
I have lxml, and when trying to parse html with tables through pandas I found some problems:
- HTMLParser is set to
recover=False
contrary to lxml's default value. This is unfortunate because parser will fail on any problem in html page, like missing closing tag or similar, which IMHO happens too often to be justifiable. If such strict rules should be considered, I'd suggest parsing html document with default parser values, and then apply this restriction only on table fragments. - HTMLParser will almost always deliver wrong content on tables encoded in anything other then ascii. There is no encoding argument in
pandas.io.html.read_html()
and lxml doesn't do magic unless http-equiv attribute is correctly declared in html document, or encoding argument is passed to HTMLParser.