Skip to content

BUG: DataFrame(data, ...) creates a copy when 'data' is a NumPy array (pandas 3.0+) #58913

Closed
@jameslamb

Description

@jameslamb

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd

X = np.random.default_rng().uniform(size=(10, 2)).astype(np.float32)
df = pd.DataFrame(X)
df_arr = df.to_numpy(dtype=np.float32)

np.shares_memory(X, df.values)
# numpy==1.26.4    , pandas==2.2.2           : True
# numpy==2.1.0.dev0, pandas==2.2.2           : True
# numpy==2.1.0.dev0, pandas==3.0.0.dev0+1067 : False

np.shares_memory(X, df_arr)
# numpy==1.26.4    , pandas==2.2.2           : True
# numpy==2.1.0.dev0, pandas==2.2.2           : True
# numpy==2.1.0.dev0, pandas==3.0.0.dev0+1067 : False

Issue Description

Starting with pandas==3.0.0, it appears that DataFrame(data) creates a copy when data is a numpy array.

Expected Behavior

I expected df.values and the result of df.to_numpy() (with copy argument omitted) to return the same numpy array that the DataFrame was created from.

I think that that has been the behavior of most combinations of pandas and numpy for at least the last 2 years. I think that because we've been running a test in LightGBM with similar code, to confirm that lightgbm isn't creating unnecessary copies in its pandas support since January 2022 (microsoft/LightGBM#4927), and that test is now failing with pandas>=3.0.0.

Apologies in advance if this is intentional behavior. I did try to look through the git blame, issues, and PRs. Did not see anything in these possibly-related discussions:

Installed Versions

Observed this in a conda environment on an M2 macbook, using Python 3.11.9

output of 'conda info' (click me)
     active environment : lgb-dev
    active env location : /Users/jlamb/miniforge3/envs/lgb-dev
            shell level : 1
       user config file : /Users/jlamb/.condarc
 populated config files : /Users/jlamb/miniforge3/.condarc
          conda version : 24.3.0
    conda-build version : not installed
         python version : 3.12.3.final.0
                 solver : libmamba (default)
       virtual packages : __archspec=1=m2
                          __conda=24.3.0=0
                          __osx=14.4.1=0
                          __unix=0=0
       base environment : /Users/jlamb/miniforge3  (writable)
      conda av data dir : /Users/jlamb/miniforge3/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/conda-forge/osx-arm64
                          https://conda.anaconda.org/conda-forge/noarch
          package cache : /Users/jlamb/miniforge3/pkgs
                          /Users/jlamb/.conda/pkgs
       envs directories : /Users/jlamb/miniforge3/envs
                          /Users/jlamb/.conda/envs
               platform : osx-arm64
             user-agent : conda/24.3.0 requests/2.31.0 CPython/3.12.3 Darwin/23.4.0 OSX/14.4.1 solver/libmamba conda-libmamba-solver/24.1.0 libmambapy/1.5.8
                UID:GID : 501:20
             netrc file : None
           offline mode : False
How I installed stable versions of `numpy`, `pandas`, and `pyarrow` (click me)
conda install \
    -c conda-forge \
    --yes \
        'numpy' \
        'pandas' \
        'pyarrow'
How I gradually replaced those versions with latest nightlies (click me)

numpy and pyarrrow:

python -m pip install \
    --extra-index-url https://pypi.anaconda.org/scientific-python-nightly-wheels/simple \
    --prefer-binary \
    --pre \
    --upgrade \
        'numpy>=2.1.0.dev0'

python -m pip install \
    --extra-index-url https://pypi.fury.io/arrow-nightlies/ \
    --prefer-binary \
    --pre \
    --upgrade \
        'pyarrow>=17.0.0.dev227'

pandas:

python -m pip install \
    --extra-index-url https://pypi.anaconda.org/scientific-python-nightly-wheels/simple \
    --prefer-binary \
    --pre \
    --upgrade \
        'pandas>=3.0.0.dev0'
output of 'pd.show_versions()' with all nightlies installed (click me)
python -c "import pandas; pandas.show_versions()"

result:

INSTALLED VERSIONS
------------------
commit                : 76c7274985215c487248fa5640e12a9b32a06e8c
python                : 3.11.9.final.0
python-bits           : 64
OS                    : Darwin
OS-release            : 23.4.0
Version               : Darwin Kernel Version 23.4.0: Fri Mar 15 00:19:22 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T8112
machine               : arm64
processor             : arm
byteorder             : little
LC_ALL                : None
LANG                  : en_US.UTF-8
LOCALE                : en_US.UTF-8

pandas                : 3.0.0.dev0+1067.g76c7274985
numpy                 : 2.1.0.dev0+git20240531.da5a779
pytz                  : 2024.1
dateutil              : 2.9.0
setuptools            : 69.5.1
pip                   : 24.0
Cython                : None
pytest                : 8.2.0
hypothesis            : None
sphinx                : None
blosc                 : None
feather               : None
xlsxwriter            : None
lxml.etree            : None
html5lib              : None
pymysql               : None
psycopg2              : None
jinja2                : None
IPython               : None
pandas_datareader     : None
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : None
bottleneck            : None
fastparquet           : None
fsspec                : None
gcsfs                 : None
matplotlib            : None
numba                 : None
numexpr               : None
odfpy                 : None
openpyxl              : None
pyarrow               : 17.0.0.dev227
pyreadstat            : None
python-calamine       : None
pyxlsb                : None
s3fs                  : None
scipy                 : 1.13.0
sqlalchemy            : None
tables                : None
tabulate              : None
xarray                : None
xlrd                  : None
zstandard             : None
tzdata                : 2024.1
qtpy                  : None
pyqt5                 : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions