Description
From the docs I see that the difference between .merge and .join is that .join operates using indexes by default, but it also lets you use columns, so I tried to use it for that since it sounds natural when coming from the SQL world.
From the docs:
on : column name, tuple/list of column names, or array-like
Column(s) to use for joining, otherwise join on index. If multiples columns given, the passed DataFrame must have a MultiIndex. Can pass an array as the join key if not already contained in the calling DataFrame. Like an Excel VLOOKUP operation
From my understanding, if on is absent a join operation is performed on the index, if on is present, it would be reasonable to think that the same operation would be performed.
Having said that:
a = pd.DataFrame({'address' : ["1820 SOME STREET", "32 ANOTHER STREET",
"1140 BIG STREET", "20 SMALL STREET",
"1090 AVENUE NAME"],
'id' : [1,2,3,4,5]
})
b = pd.DataFrame({'address' : ["1090 AVENUE NAME",
"1140 BIG STREET", "20 SMALL STREET",
"1820 SOME STREET", "32 ANOTHER STREET"],
'id' : [5,3,4,1,2]})
a.join(b, on='id', lsuffix='_x', rsuffix='_y', how='left')
Output:
address_x id_x address_y id_y
0 1820 SOME STREET 1 1140 BIG STREET 3
1 32 ANOTHER STREET 2 20 SMALL STREET 4
2 1140 BIG STREET 3 1820 SOME STREET 1
3 20 SMALL STREET 4 32 ANOTHER STREET 2
4 1090 AVENUE NAME 5 NaN NaN
And if I add id as index:
a.set_index('id', inplace=True)
b.set_index('id', inplace=True)
a.join(b, lsuffix='_x', rsuffix='_y', how='left')
Output:
address_x address_y
id
1 1820 SOME STREET 1820 SOME STREET
2 32 ANOTHER STREET 32 ANOTHER STREET
3 1140 BIG STREET 1140 BIG STREET
4 20 SMALL STREET 20 SMALL STREET
5 1090 AVENUE NAME 1090 AVENUE NAME
Is that the correct behavior? If yes, I think the documentation is misleading. It took me a lot to find the bug in my code and I ended up using merge since .join works in an unexpected way.
I don't think I'm the only one with this issue, so maybe a change in the documentation would help to clarify how .join works.