Skip to content

Suggestions for html table parsing #7220

Closed
@klonuo

Description

@klonuo

Hi,

I have lxml, and when trying to parse html with tables through pandas I found some problems:

  1. HTMLParser is set to recover=False contrary to lxml's default value. This is unfortunate because parser will fail on any problem in html page, like missing closing tag or similar, which IMHO happens too often to be justifiable. If such strict rules should be considered, I'd suggest parsing html document with default parser values, and then apply this restriction only on table fragments.
  2. HTMLParser will almost always deliver wrong content on tables encoded in anything other then ascii. There is no encoding argument in pandas.io.html.read_html() and lxml doesn't do magic unless http-equiv attribute is correctly declared in html document, or encoding argument is passed to HTMLParser.

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementIO HTMLread_html, to_html, Styler.apply, Styler.applymapUnicodeUnicode strings

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions