Skip to content

Confusing interpretation of what DataFrame.join does #12188

Closed
@edublancas

Description

@edublancas

From the docs I see that the difference between .merge and .join is that .join operates using indexes by default, but it also lets you use columns, so I tried to use it for that since it sounds natural when coming from the SQL world.

From the docs:

on : column name, tuple/list of column names, or array-like
Column(s) to use for joining, otherwise join on index. If multiples columns given, the passed DataFrame must have a MultiIndex. Can pass an array as the join key if not already contained in the calling DataFrame. Like an Excel VLOOKUP operation

From my understanding, if on is absent a join operation is performed on the index, if on is present, it would be reasonable to think that the same operation would be performed.

Having said that:

a = pd.DataFrame({'address' : ["1820 SOME STREET", "32 ANOTHER STREET",
                              "1140 BIG STREET", "20 SMALL STREET",
                              "1090 AVENUE NAME"],
                  'id' : [1,2,3,4,5]
                  })
b = pd.DataFrame({'address' : ["1090 AVENUE NAME",
                              "1140 BIG STREET", "20 SMALL STREET",
                              "1820 SOME STREET", "32 ANOTHER STREET"],
                  'id' : [5,3,4,1,2]})

a.join(b, on='id', lsuffix='_x',  rsuffix='_y', how='left')

Output:

           address_x  id_x          address_y  id_y
0   1820 SOME STREET     1    1140 BIG STREET     3
1  32 ANOTHER STREET     2    20 SMALL STREET     4
2    1140 BIG STREET     3   1820 SOME STREET     1
3    20 SMALL STREET     4  32 ANOTHER STREET     2
4   1090 AVENUE NAME     5                NaN   NaN

And if I add id as index:

a.set_index('id', inplace=True)
b.set_index('id', inplace=True)
a.join(b, lsuffix='_x',  rsuffix='_y', how='left')

Output:

            address_x          address_y
id
1    1820 SOME STREET   1820 SOME STREET
2   32 ANOTHER STREET  32 ANOTHER STREET
3     1140 BIG STREET    1140 BIG STREET
4     20 SMALL STREET    20 SMALL STREET
5    1090 AVENUE NAME   1090 AVENUE NAME

Is that the correct behavior? If yes, I think the documentation is misleading. It took me a lot to find the bug in my code and I ended up using merge since .join works in an unexpected way.

I don't think I'm the only one with this issue, so maybe a change in the documentation would help to clarify how .join works.

Metadata

Metadata

Assignees

No one assigned

    Labels

    DocsReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions