Open
Description
Code Sample, a copy-pastable example if possible
# =============================================================================
# NUMPY VS PANDAS: DIFFERENT ESTIMATION OF COVARIANCE IN PRESENCE OF NAN VALUES
# =============================================================================
# data with nan values
M = np.random.randn(10,2)
# add missing values
M[0,0] = np.nan
M[1,1] = np.nan
# Covariance matrix calculations
# ==============================
# numpy
# -----
masked_arr = np.ma.array(M, mask=np.isnan(M))
cov_numpy = np.ma.cov(masked_arr, rowvar=0, allow_masked=True, ddof=1).data
# pandas
# ------
cov_pandas = pd.DataFrame(M).cov(min_periods=0).values
# Homemade covariance coefficient calculation
# (what each of them is actually doing)
# =============================================
# select elements to estimate the covariance matrix element (0,1)
x = M[:,0]
y = M[:,1]
mask_x = ~np.isnan(x)
mask_y = ~np.isnan(y)
mask_common = mask_x & mask_y
# numpy
# -----
xn = x-np.mean(x[mask_x])
yn = y-np.mean(y[mask_y])
cov_np = sum(a*b for a,b,c in zip(xn,yn, mask_common) if c)/(np.sum(mask_common)-1)
# pandas
# ------
xn = x-np.mean(x[mask_common])
yn = y-np.mean(y[mask_common])
cov_pd = sum(a*b for a,b,c in zip(xn,yn, mask_common) if c)/(np.sum(mask_common)-1)
Problem description
I try to calculate the covariance matrix in presence of missing values and I've note that numpy and pandas retrieve differents matrix and that difference increases when increase the presence of missing values. I let above a snippet of both implementations. For me is more useful numpy way, it's seems to be more robust in presence of missing values.