Description
Code Sample, a copy-pastable example if possible
df = pd.DataFrame({'a': [1], 'val': [1.35]})
# no groupby, result is 'object', correct behavior
print(df['val'].transform(lambda x: x.map('+{}'.format)).dtype)
# result can't be converted to float, result is 'object', correct behavior
print(df.groupby('a')['val'].transform(lambda x: x.map('(+{})'.format)).dtype)
# result is 'float64' and plus sign is lost, INCORRECT behavior (should be 'object')
print(df.groupby('a')['val'].transform(lambda x: x.map('+{}'.format)).dtype)
# convert column to object type before transform:
df['val']=df['val'].astype(object)
# same code as before, but now result is 'object' and plus sign is shown, correct behavior
print(df.groupby('a')['val'].transform(lambda x: x.map('+{}'.format)).dtype)
Problem description
It is sometimes useful to use DataFrame.groupby()[col].transform()
to create a string column based on a float64 column. In the case leading up to this issue report, I wanted to create a string column showing the actual value for the group min, and e.g., "+1.2" for the difference of the other rows from the min. However, Pandas unexpectedly converts this column back to float64 values, dropping the '+' sign for the non-min rows. This only happens with groupby
, not when the transformation is applied to the whole column. It also only happens if the original column is a float type and the column of strings can be successfully converted back to floats.
I don't think users expect string columns to be converted back to the type of the original column (float in this case), even if that is possible. Pandas also doesn't seem to promise this behavior, since it only occurs with groupby
and not when applying transform
to the whole column.
It is possible to work around this by converting the original column to object type before applying the groupby and transform, but I don't think that should be necessary.
I would recommend dropping this conversion, and create a new series with the same type returned by the transform
function.
Expected Output
object
object
object
object
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.20.3
pytest: 2.8.5
pip: 9.0.1
setuptools: 35.0.2
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
xarray: None
IPython: 5.7.0
sphinx: 1.6.1
patsy: 0.4.1
dateutil: 2.4.2
pytz: 2017.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
feather: None
matplotlib: 2.2.2
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.6.0
html5lib: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None