Skip to content

BUG: iterparse on read_xml overwrites with attributes on child elements  #47343

Closed
@bailsman

Description

@bailsman

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from tempfile import NamedTemporaryFile
import pandas as pd

XML = '''
<issue>
  <type>BUG</type>
  <reporter type="newbie">Emanuel</reporter>
</issue>
'''.encode('utf-8')

with NamedTemporaryFile() as tempfile:
    tempfile.write(XML)
    tempfile.flush()
    df = pd.read_xml(tempfile.name, iterparse={'issue': ['type']})
    issue_type = df.iloc[0]['type']
    print(f"issue_type: expecting BUG, got {issue_type}")

Issue Description

The parsing code parses attributes on all child elements, rather than just on the main element.

This is probably unintentional - in the example above, the user is most likely interested in the type of the incident rather than the type attribute on some child element.

This feature is planned for 1.5.0 and not yet released. See: #45724 (comment)

Expected Behavior

Attributes are parsed only on the toplevel element.

Installed Versions

5465f54

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions