Description
Research
-
I have searched the [pandas] tag on StackOverflow for similar questions.
-
I have asked my usage related question on StackOverflow.
Link to question on StackOverflow
Question about pandas
It seems when there exists at least one duplicate multi-index, ALL indices will return a DataFrame when queried with .loc[]
, even if only one entry exists, resulting in a 1xN matrix. (I didn't test whether this is also applies to single-index). Is this wanted behaviour?
It does seem a bit strange and inconsistent with other functions, especially since you can still retrieve a Series when you query twice with .loc[].loc[]
. On second glance it can also be seen to improve consistency by returning always a DataFrame for any index, regardless of the index that is used. While this has contributed to raising awarenss that the duplicate exist in one dataframe but not the other, the way there was long and tedious. I would prefer an INFO or WARN log over getting all confused why two similar dataframes return different objects when indexed in the same way.
code to reproduce (see also SO-post)
import pandas as pd
from io import StringIO
file1 = StringIO("pdf_name;page;col1;col2\npdf1;0;val1;val2\npdf2;0;asdf;ffff")
file2 = StringIO("pdf_name;page;col1;col2\npdf1;0;;\npdf2;0;;\npdf2;0;;")
data1 = pd.read_csv(file1, sep=";", index_col=['pdf_name', 'page'])
data2 = pd.read_csv(file2, sep=";", index_col=['pdf_name', 'page'])
data2 = data2.sort_index() # data2.sort_index() # avoid performance warning
idx = data1.index[0]
print(idx) # ('pdf1', 0)
print("data1.loc[idx]", type(data1.loc[idx])) # data1.loc[idx] <class 'pandas.core.series.Series'>
print("data2.loc[idx]", type(data2.loc[idx])) # data2.loc[idx] <class 'pandas.core.frame.DataFrame'>
print("data2.loc[idx].shape", data2.loc[idx].shape) # (1, 2) -- single row
print("data2['col1'].loc[idx]", type(data2['col1'].loc[idx])) # data2['col1'].loc[idx] <class 'pandas.core.series.Series'>