Skip to content

BUG: Index and MultiIndex KeyError cases and discussion #39775

Open
@attack68

Description

@attack68

Since the introduction of KeyError for missing keys in an index there have been quite a few use cases from different issues. I will try and link some of the issues if I see them.

My view is that KeyErrors for Index is fine, but MultiIndexes should be treated differently: you cannot always raise a KeyError for a single keys in a MultiIndex slice since a MultiIndex cannot always be reindexed.

Index

indexes = [
    pd.Index(['a','b','c','e','d'], name='Unique Non-Monotonic'),
    pd.Index(['a','b','c','e','d'], name='Unique Monotonic').sort_values(),
    pd.Index(['a','b','b','e','d'], name='Non-Unique Non-Monotonic'),
    pd.Index(['a','b','b','e','d'], name='Non-Unique Monotonic').sort_values(),
]

Screen Shot 2021-02-12 at 13 23 16

Code generator
ret = None
def do(command):
    try:
        exec(f'global ret; ret={command}', globals())
    except KeyError:
        return 'KeyError'
    else:
        if isinstance(ret, (np.int64)):
            return 'int64'
        elif isinstance(ret, (pd.Series)):
            return 'Series'
        elif isinstance(ret, (pd.DataFrame)):
            return 'DataFrame'
        return 'OtherType'

cases = [
"'a'",           # single valid key
"'!'",           # single invalid key 
"['a']",         # single valid key as pseudo multiple valid keys
"['!']",         # single invalid key as pseudo multip valid keys
"['a','e']",     # multiple valid keys
"['a','!']",     # at least one invalid keys  
"'a':'e'",       # valid key slice
"'a':'!'",       # at least one invalid slice key
"'!':",          # at least one invalid slice key
"'b'",           # single valid non-unique key
"['b']",         # single valid non-unique key as pseudo multiple keys
"'b':'d'",       # slice with non-unique key
]

base = [
    [f's.loc[{case}]',        # use regular s.loc[]
     f's.loc[ix[{case}]]']    # and with index slice as comparison s.loc[ix[{}]]
    for case in cases
]
commands = [
    command for sublist in base for command in sublist
]

indexes = [
    pd.Index(['a','b','c','e','d'], name='Unique Non-Monotonic'),
    pd.Index(['a','b','c','e','d'], name='Unique Monotonic').sort_values(),
    pd.Index(['a','b','b','e','d'], name='Non-Unique Non-Monotonic'),
    pd.Index(['a','b','b','e','d'], name='Non-Unique Monotonic').sort_values(),
]

results = pd.DataFrame('', index=commands, columns=['Uq Non-Mono', 'Uq Mono', 'Non-Uq Non-Mono', 'Non-Uq Mono'])
for j, index in enumerate(indexes):
    s = pd.Series([1,2,3,4,5], index=index)
    for i, command in enumerate(commands):
        results.iloc[i, j] = do(command)

This seems to be pretty consistent. The only inconsistency is perhaps highlighted in red, and a minor niggle for dynamic coding might be the different return types in the case of non-unique indexes.

Obviously the solution to dealing with any case where you need to index by pre-defined levels that may have been filtered is to reindex with your pre-defined keys. Any this is quite easy to do in RAM.

MultiIndex

indexes = [
    pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'z'), ('a','x'), ('a', 'z')]),
    pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'z'), ('a','x'), ('a', 'z')]).sortlevel()[0],
    pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'x'), ('a','x'), ('a', 'z')]),
    pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'x'), ('a','x'), ('a', 'z')]).sortlevel()[0],
]

MultiIndexing is different. You cannot always reindex for one of two reasons:

  • The number of possible combinations of the index level values exceeds ram and is computationally slow.
  • If you are to add in a value or set of values to a MultiIndex level the process is ambiguous and expanding all combinations will lead to above problems.

For example, consider the MultiIndex levels: (a,b), (x,y,z). There are a maximum of 6 index tuples but practically one will work with indexes of much less than the maximum combinations (Since the combinations scale exponentially with the number of levels). Your MultiIndex is thus [(a,x), (a,z), (b,x), (b,y)].

I think you need to be able to index MultiIndexes with keys that are missing. As a rule I would suggest that slices which are an iterable do not yield KeyErrors. Here is a summary of some of the observances below for current behaviour:

[a, y] : KeyError
[a, [y]] : KeyError but should return empty (a in level0)
[[a], y] : KeyError but should return empty (y in level1)
[[a], [y]] : KeyError but should return empty 
[a, !] : KeyError 
[a, [!]] : returns empty
[[a], !] : KeyError (maybe OK since ! not in level1)
[[!], x] : returns empty (x in level1)
[[!], [!]] : returns empty
[!, !] : KeyError

multiindex_slice

Code generator
cases_level0 = [
  "'a'",         # single valid key on level0
  "'!'",         # single invalid key on level0
  "['a']",       # single valid key on level0 as pseudo multiple valid keys
  "['!']",       # single invalid key on level0 as pseudo multiple valid keys
  "['a', 'b']",  # multiple valid key on level0
  "['a', '!']",  # at least one invalid key on level0
  "'a':'b'",     # valid level0 index slice
  "'a':'!'",     # invalid level0 index slice
  "'!':",        # fully invalid level0 index slice
]

comments_level0 = [
'0: valid single, ',
'0: invalid single, ',
'0: valid single as multiple, ',
'0: invalid single as multiple, ',
'0: multiple valid, ',
'0: one invalid in multiple, ',
'0: valid slice, ',
'0: semi-invalid slice, ',
'0: invalid slice, ',
]

base = [
  [f's.loc[{case}]', f's.loc[ix[{case}, :]]']  for case in cases_level0
]
commands = [
  command for sublist in base for command in sublist
]

indexes = [
  pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'z'), ('a','x'), ('a', 'z')]),
  pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'z'), ('a','x'), ('a', 'z')]).sortlevel()[0],
  pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'x'), ('a','x'), ('a', 'z')]),
  pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'x'), ('a','x'), ('a', 'z')]).sortlevel()[0],
]

results = pd.DataFrame('', index=commands, columns=['Uq Non-Mono', 'Uq Mono', 'Non-Uq Non-Mono', 'Non-Uq Mono'])
for j, index in enumerate(indexes):
  s = pd.Series([1,2,3,4,5], index=index)
  for i, command in enumerate(commands):
      results.iloc[i,j] = do(command)

base = [
  [com, com]  for com in comments_level0
]
comments = [
  com for sublist in base for com in sublist
]
      
results['Comment'] = comments    
results.style

cases_level1 = [
  "'x'",         # single valid key on level1
  "'y'",         # single sometimes-valid key on level1
  "'!'",         # single invalid key on level1
  "['x']",       # single valid key on level1 as pseudo multiple valid keys
  "['y']",       # single sometimes-valid key on level1 as pseudo multiple valid keys
  "['!']",       # single invalid key on level1 as pseudo multiple valid keys
  "['x', 'y']",  # multiple sometimes-valid key on level1
  "['x', '!']",  # at least one invalid key on level1
  "'x':'y'",     # sometimes-valid level0 index slice
  "'x':'!'",     # invalid level1 index slice
  "'!':",        # fully invalid level1 index slice
]

comments_level1 = [
'1: valid single',
'1: semi-valid single',
'1: invalid single',
'1: valid single as multiple',
'1: semi-valid single as multiple',
'1: invalid single as multiple',
'1: multiple semi-valid',
'1: one invalid in multiple',   
'1: semi-valid slice',
'1: semi-invalid slice',
'1: invalid slice',  
]

from itertools import product
multi_cases = list(product(cases_level0, cases_level1))
multi_comments = list(product(comments_level0, comments_level1))

base = [
  [f's.loc[{case[0]}, {case[1]}]', f's.loc[ix[{case[0]}, {case[1]}]]']  for case in multi_cases
]
commands = [
  command for sublist in base for command in sublist
]

results = pd.DataFrame('', index=commands, columns=['Uq Non-Mono', 'Uq Mono', 'Non-Uq Non-Mono', 'Non-Uq Mono'])
for j, index in enumerate(indexes):
  s = pd.Series([1,2,3,4,5], index=index)
  for i, command in enumerate(commands):
      results.iloc[i,j] = do(command)

base = [
  [com, com]  for com in multi_comments
]
comments = [
  com for sublist in base for com in sublist
]        

results['comment'] = comments
results.style\
     .applymap(lambda v: 'background-color:red;', subset=ix[["s.loc['a', ['!']]", "s.loc['a', ['y']]", "s.loc[['!'], ['!']]", "s.loc[['a'], ['y']]"], :])\
     .applymap(lambda v: 'background-color:LemonChiffon;', subset=ix[["s.loc['a', 'x':'y']"], :])

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignBugError ReportingIncorrect or improved errors from pandasIndexingRelated to indexing on series/frames, not to indexes themselvesMultiIndex

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions