Closed
Description
the following works quickly in 0.15.2 and has a performance issue on the last operation df.T.duplicated() in 0.16.0 and 0.16.1
also on a private data set that works on 0.15.2 i get an error on 0.16.0 and 0.16.1 on the same operation.
code:
import pandas,numpy
df = pandas.DataFrame({'A': [1 for x in range(1000)],
'B': [1 for x in range(1000)]})
print (numpy.count_nonzero(df.duplicated()))
print (numpy.count_nonzero(df.T.duplicated()))
df = pandas.DataFrame({'A': [1 for x in range(1000000)],
'B': [1 for x in range(1000000)]})
print (numpy.count_nonzero(df.duplicated()))
print (numpy.count_nonzero(df.T.duplicated()))
this is the error i get on the private data set (code not reproduce yet with synthetic data):
File "C:\Anaconda3\lib\site-packages\pandas\util\decorators.py", line 88, in wrapper
return func(*args, **kwargs)
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2867, in duplicated
labels, shape = map(list, zip( * map(f, vals)))
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2856, in f
labels, shape = factorize(vals, size_hint=min(len(self), _SIZE_HINT_LIMIT))
File "C:\Anaconda3\lib\site-packages\pandas\core\algorithms.py", line 135, in factorize
labels = table.get_labels(vals, uniques, 0, na_sentinel)
File "pandas\hashtable.pyx", line 813, in pandas.hashtable.PyObjectHashTable.get_labels (pandas\hashtable.c:14025)
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)