Skip to content

BUG: REGRESSION: DataFrame.corr() floating point inaccuracy #45640

Closed
@tritemio

Description

@tritemio

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

rng = np.random.default_rng(1)
df = pd.DataFrame(rng.normal(size=(100, 5)))

print(1 - np.diag(df.corr().abs()).min())
# pandas 1.4.0
# -4.440892098500626e-16
# pandas 1.3.5
# 0.0

Issue Description

With pandas 1.4.0, df.corr() returns a matrix where the diagonal is not exactly 1 down to floating point precision.

In pandas 1.3.5 the diagonal of df.corr() was exactly 1.

The example above show the difference.

This causes issues when using the dist = 1 - df.corr().abs() as a distance matrix for clustering. In particular the call to scipy.spatial.distance.squareform(dist) raises an error with pandas 1.4.0 when the dist diagonal is not exactly 0.

Expected Behavior

The diagonal of df.corr() should be exactly 1 down to floating point accuracy

Installed Versions

Replace this line with the output of pd.show_versions()

Metadata

Metadata

Assignees

No one assigned

    Labels

    AlgosNon-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diffBugRegressionFunctionality that used to work in a prior pandas version

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions