Description
Distance correlation (https://en.wikipedia.org/wiki/Distance_correlation) is a powerful yet underused technique for comparing two distributions that I think would make a very nice addition to the existing correlation methods in pandas
. For one, these measures have the unique property that two random variables
The below code is an implementation in pure numpy
(which could certainly be optimized / more elegantly written) that could be part of the Series
class and then called within corr
. Later it could be integrated seamlessly with corrwith
, and if this feature were available I know personally it would be one of the first things I would look at when approaching a regression problem.
# self and other can be assumed to be aligned already
def nandistcorr(self, other):
n = len(self)
a = np.zeros(shape=(n, n))
b = np.zeros(shape=(n, n))
for i in range(n):
for j in range(i+1, n):
a[i, j] = abs(self[i] - self[j])
b[i, j] = abs(other[i] - other[j])
a = a + a.T
b = b + b.T
a_bar = np.vstack([np.nanmean(a, axis=0)] * n)
b_bar = np.vstack([np.nanmean(b, axis=0)] * n)
A = a - a_bar - a_bar.T + np.full(shape=(n, n), fill_value=a_bar.mean())
B = b - b_bar - b_bar.T + np.full(shape=(n, n), fill_value=b_bar.mean())
cov_ab = np.sqrt(np.nansum(A * B)) / n
std_a = np.sqrt(np.sqrt(np.nansum(A**2)) / n)
std_b = np.sqrt(np.sqrt(np.nansum(B**2)) / n)
return cov_ab / std_a / std_b
Here's an example that shows how distance correlation can detect relationships that the other common correlation methods miss:
import numpy as np
import pandas as pd
np.random.seed(2357)
s1 = pd.Series(np.random.randn(1000))
s2 = s1**2
s1.corr(s2, method="pearson")
s1.corr(s2, method="spearman")
s1.corr(s2, method="kendall")
nandistcorr(s1.values, s2.values)