Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import numpy as np
import pandas as pd
X = np.random.default_rng().uniform(size=(10, 2)).astype(np.float32)
df = pd.DataFrame(X)
df_arr = df.to_numpy(dtype=np.float32)
np.shares_memory(X, df.values)
# numpy==1.26.4 , pandas==2.2.2 : True
# numpy==2.1.0.dev0, pandas==2.2.2 : True
# numpy==2.1.0.dev0, pandas==3.0.0.dev0+1067 : False
np.shares_memory(X, df_arr)
# numpy==1.26.4 , pandas==2.2.2 : True
# numpy==2.1.0.dev0, pandas==2.2.2 : True
# numpy==2.1.0.dev0, pandas==3.0.0.dev0+1067 : False
Issue Description
Starting with pandas==3.0.0
, it appears that DataFrame(data)
creates a copy when data
is a numpy
array.
Expected Behavior
I expected df.values
and the result of df.to_numpy()
(with copy
argument omitted) to return the same numpy
array that the DataFrame
was created from.
I think that that has been the behavior of most combinations of pandas
and numpy
for at least the last 2 years. I think that because we've been running a test in LightGBM with similar code, to confirm that lightgbm
isn't creating unnecessary copies in its pandas
support since January 2022 (microsoft/LightGBM#4927), and that test is now failing with pandas>=3.0.0
.
Apologies in advance if this is intentional behavior. I did try to look through the git
blame, issues, and PRs. Did not see anything in these possibly-related discussions:
- DISC: path to nullable-by-default integers and floats #58243
- DEPR (PDEP-7/CoW): deprecate and remove
copy
keyword (except in constructors) #56022 - BUG: CoW and np.asarray leads to surprising issues #52823
Installed Versions
Observed this in a conda
environment on an M2 macbook, using Python 3.11.9
output of 'conda info' (click me)
active environment : lgb-dev
active env location : /Users/jlamb/miniforge3/envs/lgb-dev
shell level : 1
user config file : /Users/jlamb/.condarc
populated config files : /Users/jlamb/miniforge3/.condarc
conda version : 24.3.0
conda-build version : not installed
python version : 3.12.3.final.0
solver : libmamba (default)
virtual packages : __archspec=1=m2
__conda=24.3.0=0
__osx=14.4.1=0
__unix=0=0
base environment : /Users/jlamb/miniforge3 (writable)
conda av data dir : /Users/jlamb/miniforge3/etc/conda
conda av metadata url : None
channel URLs : https://conda.anaconda.org/conda-forge/osx-arm64
https://conda.anaconda.org/conda-forge/noarch
package cache : /Users/jlamb/miniforge3/pkgs
/Users/jlamb/.conda/pkgs
envs directories : /Users/jlamb/miniforge3/envs
/Users/jlamb/.conda/envs
platform : osx-arm64
user-agent : conda/24.3.0 requests/2.31.0 CPython/3.12.3 Darwin/23.4.0 OSX/14.4.1 solver/libmamba conda-libmamba-solver/24.1.0 libmambapy/1.5.8
UID:GID : 501:20
netrc file : None
offline mode : False
How I installed stable versions of `numpy`, `pandas`, and `pyarrow` (click me)
conda install \
-c conda-forge \
--yes \
'numpy' \
'pandas' \
'pyarrow'
How I gradually replaced those versions with latest nightlies (click me)
numpy
and pyarrrow
:
python -m pip install \
--extra-index-url https://pypi.anaconda.org/scientific-python-nightly-wheels/simple \
--prefer-binary \
--pre \
--upgrade \
'numpy>=2.1.0.dev0'
python -m pip install \
--extra-index-url https://pypi.fury.io/arrow-nightlies/ \
--prefer-binary \
--pre \
--upgrade \
'pyarrow>=17.0.0.dev227'
pandas
:
python -m pip install \
--extra-index-url https://pypi.anaconda.org/scientific-python-nightly-wheels/simple \
--prefer-binary \
--pre \
--upgrade \
'pandas>=3.0.0.dev0'
output of 'pd.show_versions()' with all nightlies installed (click me)
python -c "import pandas; pandas.show_versions()"
result:
INSTALLED VERSIONS
------------------
commit : 76c7274985215c487248fa5640e12a9b32a06e8c
python : 3.11.9.final.0
python-bits : 64
OS : Darwin
OS-release : 23.4.0
Version : Darwin Kernel Version 23.4.0: Fri Mar 15 00:19:22 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T8112
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 3.0.0.dev0+1067.g76c7274985
numpy : 2.1.0.dev0+git20240531.da5a779
pytz : 2024.1
dateutil : 2.9.0
setuptools : 69.5.1
pip : 24.0
Cython : None
pytest : 8.2.0
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pyarrow : 17.0.0.dev227
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.13.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None