Description
For clarification, by "interval-point" joins I mean joining an IntervalIndex
/IntervalArray
against the point values contained in the intervals, e.g. joining a numeric IntervalIndex
against a Float64Index
. I want to keep this discussion separate from interval-interval merges for the time being.
For example, the following join
does not currently work (and likewise merge
with column data):
In [2]: df1 = pd.DataFrame({'A': [10, 20, 30]}, index=pd.interval_range(0, 3))
In [3]: df2 = pd.DataFrame({'B': ['foo', 'bar', 'baz', 'qux']},
...: index=[0.5, 1, 2.71828, 3.14159])
In [4]: df1
Out[4]:
A
(0, 1] 10
(1, 2] 20
(2, 3] 30
In [5]: df2
Out[5]:
B
0.50000 foo
1.00000 bar
2.71828 baz
3.14159 qux
In [6]: df1.join(df2)
Out[6]:
A B
(0, 1] 10 NaN
(1, 2] 20 NaN
(2, 3] 30 NaN
I think the behavior of such a join
/merge
is straight forward for left/right joins, but is a little bit less clear for inner/outer joins. For inner (outer) joins one takes the intersection (union) of both indexes as the resulting index values. This makes sense when both indexes contain the same type of objects, but this is not the case for interval-point joins. I can't think of a consistent way to handle inner/outer joins, and not entirely if they even make sense. A few options:
- Do not support inner/outer interval-point joins
- For inner joins:
- always keep the intervals and filter any non-matches?
- default to the left index and filter any non-matches?
- For outer joins:
- union any non-matching point values for an
object
dtype?- obviously non-performant and a bit weird
- coerce non-matching points to degenerate intervals (left == right) and union for an interval dtype?
- union any non-matching point values for an
- Use a new API for non-exact interval joins?
I'm leaning towards just using the existing API not supporting inner/outer for the time being, but would appreciate any thoughts.