Skip to content

BUG: left join on index with multiple matches now works (GH5391) #7853

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

behzadnouri
Copy link
Contributor

closes #5391

>>> ### left join with multiple matches - single index case

>>> left = DataFrame([
...     ['X', 'Y', 'C', 'a'],
...     ['W', 'Y', 'C', 'e'],
...     ['V', 'Q', 'A', 'h'],
...     ['V', 'R', 'D', 'i'],
...     ['X', 'Y', 'D', 'b'],
...     ['X', 'Y', 'A', 'c'],
...     ['W', 'Q', 'B', 'f'],
...     ['W', 'R', 'C', 'g'],
...     ['V', 'Y', 'C', 'j'],
...     ['X', 'Y', 'B', 'd']],
...     columns=['cola', 'colb', 'colc', 'tag'],
...     index=[3, 2, 0, 1, 7, 6, 4, 5, 9, 8])
>>>                                                                    
... right = DataFrame([
...     ['W', 'R', 'C',  0],
...     ['W', 'Q', 'B',  3],
...     ['W', 'Q', 'B',  8],
...     ['X', 'Y', 'A',  1],
...     ['X', 'Y', 'A',  4],
...     ['X', 'Y', 'B',  5],
...     ['X', 'Y', 'C',  6],
...     ['X', 'Y', 'C',  9],
...     ['X', 'Q', 'C', -6],
...     ['X', 'R', 'C', -9],
...     ['V', 'Y', 'C',  7],
...     ['V', 'R', 'D',  2],
...     ['V', 'R', 'D', -1],
...     ['V', 'Q', 'A', -3]],
...     columns=['col1', 'col2', 'col3', 'val'])
>>>                                                                    
... right.set_index(['col1', 'col2', 'col3'], inplace=True)
>>> result = left.join(right, on=['cola', 'colb', 'colc'], how='left')
>>>                                                                    
... expected = DataFrame([
...     ['X', 'Y', 'C', 'a',   6],
...     ['X', 'Y', 'C', 'a',   9],
...     ['W', 'Y', 'C', 'e', nan],
...     ['V', 'Q', 'A', 'h',  -3],
...     ['V', 'R', 'D', 'i',   2],
...     ['V', 'R', 'D', 'i',  -1],
...     ['X', 'Y', 'D', 'b', nan],
...     ['X', 'Y', 'A', 'c',   1],
...     ['X', 'Y', 'A', 'c',   4],
...     ['W', 'Q', 'B', 'f',   3],
...     ['W', 'Q', 'B', 'f',   8],
...     ['W', 'R', 'C', 'g',   0],
...     ['V', 'Y', 'C', 'j',   7],
...     ['X', 'Y', 'B', 'd',   5]],
...     columns=['cola', 'colb', 'colc', 'tag', 'val'],
...     index=[3, 3, 2, 0, 1, 1, 7, 6, 6, 4, 4, 5, 9, 8])
>>>                                                                    
... tm.assert_frame_equal(result, expected)
>>> print(left, right, result, sep='\n')

  cola colb colc tag
3    X    Y    C   a
2    W    Y    C   e
0    V    Q    A   h
1    V    R    D   i
7    X    Y    D   b
6    X    Y    A   c
4    W    Q    B   f
5    W    R    C   g
9    V    Y    C   j
8    X    Y    B   d

                val
col1 col2 col3     
W    R    C       0
     Q    B       3
          B       8
X    Y    A       1
          A       4
          B       5
          C       6
          C       9
     Q    C      -6
     R    C      -9
V    Y    C       7
     R    D       2
          D      -1
     Q    A      -3

  cola colb colc tag  val
3    X    Y    C   a    6
3    X    Y    C   a    9
2    W    Y    C   e  NaN
0    V    Q    A   h   -3
1    V    R    D   i    2
1    V    R    D   i   -1
7    X    Y    D   b  NaN
6    X    Y    A   c    1
6    X    Y    A   c    4
4    W    Q    B   f    3
4    W    Q    B   f    8
5    W    R    C   g    0
9    V    Y    C   j    7
8    X    Y    B   d    5

>>> ### left join with multiple matches - multi index case

>>> left = DataFrame([
...     ['c', 0],
...     ['b', 1],
...     ['a', 2],
...     ['b', 3]],
...     columns=['tag', 'val'],
...     index=[2, 0, 1, 3])
>>>                                                 
... right = DataFrame([
...     ['a', 'v'],
...     ['c', 'w'],
...     ['c', 'x'],
...     ['d', 'y'],
...     ['a', 'z'],
...     ['c', 'r'],
...     ['e', 'q'],
...     ['c', 's']],
...     columns=['tag', 'char'])
>>>                                                 
... right.set_index('tag', inplace=True)
>>> result = left.join(right, on='tag', how='left')
>>>                                                 
... expected = DataFrame([
...     ['c', 0, 'w'],
...     ['c', 0, 'x'],
...     ['c', 0, 'r'],
...     ['c', 0, 's'],
...     ['b', 1, nan],
...     ['a', 2, 'v'],
...     ['a', 2, 'z'],
...     ['b', 3, nan]],
...     columns=['tag', 'val', 'char'],
...     index=[2, 2, 2, 2, 0, 1, 1, 3])
>>>                                                 
... tm.assert_frame_equal(result, expected)
>>> print(left, right, result, sep='\n')

  tag  val
2   c    0
0   b    1
1   a    2
3   b    3

    char
tag     
a      v
c      w
c      x
d      y
a      z
c      r
e      q
c      s

  tag  val char
2   c    0    w
2   c    0    x
2   c    0    r
2   c    0    s
0   b    1  NaN
1   a    2    v
1   a    2    z
3   b    3  NaN

This closes on: #5391
By providing all the matches when doing left join on index, both in the case of single index and multi-index. It also preserves the index order of the calling (left) DataFrame (as it used to), though when there are multiple matches the indices repeat and the index loses integrity.

The added test cases should be self-explanatory.

Thank you,

@jreback
Copy link
Contributor

jreback commented Jul 27, 2014

pls add a release note in v0.15.0 bug fix section (ref the original issue)
and post an ipython session showing this in action (using the test cases) - post at the top of the pr

@behzadnouri
Copy link
Contributor Author

@jreback

http://nbviewer.ipython.org/github/behzadnouri/infra/blob/master/python/GH5391.ipynb
corresponds to the added test cases.

release note added.

@jreback
Copy link
Contributor

jreback commented Jul 28, 2014

can you just copy paste these to the top of this pr, pls

@jreback jreback added this to the 0.15.0 milestone Jul 28, 2014
@behzadnouri
Copy link
Contributor Author

@jreback copy-pasted

@jreback
Copy link
Contributor

jreback commented Jul 28, 2014

@behzadnouri what I mean is show the actual execution of the ipython, so its easy to see the results

just a simple:

left....
right...

joined...

@behzadnouri
Copy link
Contributor Author

don't you see the actual execution on the link i provided?
why do you make it so unproductive?

@jreback
Copy link
Contributor

jreback commented Jul 28, 2014

@behzadnouri you are missing the point

I don't want to actually pull in your PR to do this.

I simply want to look at the PR and see what is happening.

I am not making in unproductive, but PRODUCTIVE.

you have to realize that 99% of people are not going to pull in this PR, but want to see what it does.

@behzadnouri
Copy link
Contributor Author

and what is the problem with the nbviewer link i provided?

@jreback
Copy link
Contributor

jreback commented Jul 28, 2014

because the problem is then when you search on github its hard to see what this issue is about.

Why can't you simply update the top of the PR?

its just a copy-paste.

@@ -958,6 +958,68 @@ def test_left_join_index_preserve_order(self):
right_on=['k1', 'k2'], how='right')
tm.assert_frame_equal(joined.ix[:, expected.columns], expected)

def test_left_join_index_multi_match_multiindex(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this work similarly on right / inner / outer when multiple matches? e.g. is left special case behavior, if so, why is that? if not, can you test with other how's. thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not tried right join;
As in your own comment in #5391 it works for outer join.

For left join, Wes adds this extra if not sort: part to preserve order of left frame, which has this implicit assumption that there are no 1-to-many matches, and breaks if there are 1-to-many matches. see here

all the way up-to just before if not sort: things work fine.

@behzadnouri
Copy link
Contributor Author

@jreback added the ipython notebook output

@jreback
Copy link
Contributor

jreback commented Jul 28, 2014

@behzadnouri ok, thanks for the output

ok, also need to investigate a join with multiple columns (that are non-unique).

@behzadnouri behzadnouri changed the title left join on index with multiple matches now works (bug #5391) BUG: left join on index with multiple matches now works (GH5391) Jul 29, 2014
@behzadnouri
Copy link
Contributor Author

@jreback the test frame extended to include multiple columns with different values

@jreback
Copy link
Contributor

jreback commented Aug 5, 2014

can you squash to a single commit.

@behzadnouri
Copy link
Contributor Author

how should i do that?

@jreback
Copy link
Contributor

jreback commented Aug 5, 2014

@behzadnouri
Copy link
Contributor Author

done.

@jreback
Copy link
Contributor

jreback commented Aug 8, 2014

can you restart travis on this and see if it passes

git commit --amend -C HEAD

and repush

I get a failure when I tried in master

@behzadnouri
Copy link
Contributor Author

all i can say this is passing travis build

@behzadnouri
Copy link
Contributor Author

@jreback is this resolved on your end?

@jreback
Copy link
Contributor

jreback commented Aug 21, 2014

merged via 4411ab6

The problem was that you made a change in join.pyx, but setup was not picking it up (and recythonizing), fixed up, and merged

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

left join fails in case of non-unique indices
2 participants