Skip to content

ENH: implement pd.Series.corr(method="distance") #22402

Closed
@dsaxton

Description

@dsaxton

Distance correlation (https://en.wikipedia.org/wiki/Distance_correlation) is a powerful yet underused technique for comparing two distributions that I think would make a very nice addition to the existing correlation methods in pandas. For one, these measures have the unique property that two random variables $X$ and $Y$ are independent if and only if their distance correlation is zero, which cannot be said of Pearson, Spearman or Kendall.

The below code is an implementation in pure numpy (which could certainly be optimized / more elegantly written) that could be part of the Series class and then called within corr. Later it could be integrated seamlessly with corrwith, and if this feature were available I know personally it would be one of the first things I would look at when approaching a regression problem.

# self and other can be assumed to be aligned already
def nandistcorr(self, other):
    n = len(self)
    a = np.zeros(shape=(n, n))
    b = np.zeros(shape=(n, n))

    for i in range(n):
        for j in range(i+1, n):
            a[i, j] = abs(self[i] - self[j])
            b[i, j] = abs(other[i] - other[j])

    a = a + a.T
    b = b + b.T

    a_bar = np.vstack([np.nanmean(a, axis=0)] * n)
    b_bar = np.vstack([np.nanmean(b, axis=0)] * n)

    A = a - a_bar - a_bar.T + np.full(shape=(n, n), fill_value=a_bar.mean())
    B = b - b_bar - b_bar.T + np.full(shape=(n, n), fill_value=b_bar.mean())

    cov_ab = np.sqrt(np.nansum(A * B)) / n
    std_a = np.sqrt(np.sqrt(np.nansum(A**2)) / n)
    std_b = np.sqrt(np.sqrt(np.nansum(B**2)) / n)

    return cov_ab / std_a / std_b

Here's an example that shows how distance correlation can detect relationships that the other common correlation methods miss:

import numpy as np
import pandas as pd
np.random.seed(2357)

s1 = pd.Series(np.random.randn(1000))
s2 = s1**2

s1.corr(s2, method="pearson")
s1.corr(s2, method="spearman")
s1.corr(s2, method="kendall")
nandistcorr(s1.values, s2.values)

Metadata

Metadata

Assignees

No one assigned

    Labels

    AlgosNon-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diffEnhancement

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions