Skip to content

QST: loc returns matrix with one row when index is not strictly unique #58142

Open
@Laubeee

Description

@Laubeee

Research

  • I have searched the [pandas] tag on StackOverflow for similar questions.

  • I have asked my usage related question on StackOverflow.

Link to question on StackOverflow

https://stackoverflow.com/questions/78262738/why-does-pandas-loc-with-multiindex-return-a-matrix-with-single-row?noredirect=1#comment137985578_78262738

Question about pandas

It seems when there exists at least one duplicate multi-index, ALL indices will return a DataFrame when queried with .loc[], even if only one entry exists, resulting in a 1xN matrix. (I didn't test whether this is also applies to single-index). Is this wanted behaviour?

It does seem a bit strange and inconsistent with other functions, especially since you can still retrieve a Series when you query twice with .loc[].loc[]. On second glance it can also be seen to improve consistency by returning always a DataFrame for any index, regardless of the index that is used. While this has contributed to raising awarenss that the duplicate exist in one dataframe but not the other, the way there was long and tedious. I would prefer an INFO or WARN log over getting all confused why two similar dataframes return different objects when indexed in the same way.

code to reproduce (see also SO-post)

import pandas as pd
from io import StringIO

file1 = StringIO("pdf_name;page;col1;col2\npdf1;0;val1;val2\npdf2;0;asdf;ffff")
file2 = StringIO("pdf_name;page;col1;col2\npdf1;0;;\npdf2;0;;\npdf2;0;;")
data1 = pd.read_csv(file1, sep=";", index_col=['pdf_name', 'page'])
data2 = pd.read_csv(file2, sep=";", index_col=['pdf_name', 'page'])
data2 = data2.sort_index()  # data2.sort_index()  # avoid performance warning
idx = data1.index[0]
print(idx)  # ('pdf1', 0)
print("data1.loc[idx]", type(data1.loc[idx]))  # data1.loc[idx] <class 'pandas.core.series.Series'>
print("data2.loc[idx]", type(data2.loc[idx]))  # data2.loc[idx] <class 'pandas.core.frame.DataFrame'>
print("data2.loc[idx].shape", data2.loc[idx].shape)  # (1, 2)  -- single row
print("data2['col1'].loc[idx]", type(data2['col1'].loc[idx]))  # data2['col1'].loc[idx] <class 'pandas.core.series.Series'>

Metadata

Metadata

Assignees

No one assigned

    Labels

    Needs TriageIssue that has not been reviewed by a pandas team memberUsage Question

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions