Description
Code Sample, a copy-pastable example if possible
# -*- coding: utf-8 -*-
import pandas as pd
def _do_calc_(_df):
df2 = _df.copy()
df2["vol_ma20"] = df2["volume"].rolling(20, min_periods=1).mean()
return df2
dbars = pd.read_csv("2019.dbar_ftridx.csv.gz",
index_col=False,
encoding="utf-8-sig",
parse_dates=["trade_day"])
dbars = dbars.set_index("trade_day", drop=False) # everything works fine if this line is commented
df = dbars.groupby("exchange").apply(_do_calc_)
print(len(df))
Problem description
here is the input data file:
2019.dbar_ftridx.csv.gz
this piece of code runs well with pandas 0.23, when upgraded to 0.24.2, it reports error:
Traceback (most recent call last):
File "D:/test/groupby_bug.py", line 16, in <module>
df = dbars.groupby("exchange").apply(_do_calc_)
File "C:\Anaconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 701, in apply
return self._python_apply_general(f)
File "C:\Anaconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 712, in _python_apply_general
not_indexed_same=mutated or self.mutated)
File "C:\Anaconda3\lib\site-packages\pandas\core\groupby\generic.py", line 318, in _wrap_applied_output
not_indexed_same=not_indexed_same)
File "C:\Anaconda3\lib\site-packages\pandas\core\groupby\groupby.py", line 918, in _concat_objects
sort=False)
File "C:\Anaconda3\lib\site-packages\pandas\core\reshape\concat.py", line 228, in concat
copy=copy, sort=sort)
File "C:\Anaconda3\lib\site-packages\pandas\core\reshape\concat.py", line 292, in __init__
obj._consolidate(inplace=True)
File "C:\Anaconda3\lib\site-packages\pandas\core\generic.py", line 5156, in _consolidate
self._consolidate_inplace()
File "C:\Anaconda3\lib\site-packages\pandas\core\generic.py", line 5138, in _consolidate_inplace
self._protect_consolidate(f)
File "C:\Anaconda3\lib\site-packages\pandas\core\generic.py", line 5127, in _protect_consolidate
result = f()
File "C:\Anaconda3\lib\site-packages\pandas\core\generic.py", line 5136, in f
self._data = self._data.consolidate()
File "C:\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 922, in consolidate
bm = self.__class__(self.blocks, self.axes)
File "C:\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 114, in __init__
self._verify_integrity()
File "C:\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 311, in _verify_integrity
construction_error(tot_items, block.shape[1:], self.axes)
File "C:\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 1691, in construction_error
passed, implied))
ValueError: Shape of passed values is (432, 27), indices imply (1080, 27)
If I do not call set_index() on the dataframe, it works fine. Seems there is something wrong with the DatetimeIndex?
I don't know if this error can be re-produced on your machine, I can re-produce the same error on all my machines.
Expected Output
4104
Output of pd.show_versions()
[paste the output of pd.show_versions()
here below this line]
C:\Anaconda3\python.exe D:/test/groupby_bug.py
INSTALLED VERSIONS
commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.24.2
pytest: 4.4.0
pip: 19.0.3
setuptools: 41.0.0
Cython: 0.29.7
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.4.0
sphinx: 2.0.1
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.2
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.6
lxml.etree: 4.3.3
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.3
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
None