Skip to content

DataFrame.groupby()[col].transform() tries to convert result to original column type #22243

Closed
@mfripp

Description

@mfripp

Code Sample, a copy-pastable example if possible

df = pd.DataFrame({'a': [1], 'val': [1.35]})
# no groupby, result is 'object', correct behavior
print(df['val'].transform(lambda x: x.map('+{}'.format)).dtype)
# result can't be converted to float, result is 'object', correct behavior
print(df.groupby('a')['val'].transform(lambda x: x.map('(+{})'.format)).dtype)
# result is 'float64' and plus sign is lost, INCORRECT behavior (should be 'object')
print(df.groupby('a')['val'].transform(lambda x: x.map('+{}'.format)).dtype)
# convert column to object type before transform:
df['val']=df['val'].astype(object)
# same code as before, but now result is 'object' and plus sign is shown, correct behavior
print(df.groupby('a')['val'].transform(lambda x: x.map('+{}'.format)).dtype)

Problem description

It is sometimes useful to use DataFrame.groupby()[col].transform() to create a string column based on a float64 column. In the case leading up to this issue report, I wanted to create a string column showing the actual value for the group min, and e.g., "+1.2" for the difference of the other rows from the min. However, Pandas unexpectedly converts this column back to float64 values, dropping the '+' sign for the non-min rows. This only happens with groupby, not when the transformation is applied to the whole column. It also only happens if the original column is a float type and the column of strings can be successfully converted back to floats.

I don't think users expect string columns to be converted back to the type of the original column (float in this case), even if that is possible. Pandas also doesn't seem to promise this behavior, since it only occurs with groupby and not when applying transform to the whole column.

It is possible to work around this by converting the original column to object type before applying the groupby and transform, but I don't think that should be necessary.

I would recommend dropping this conversion, and create a new series with the same type returned by the transform function.

Expected Output

object
object
object
object

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.3
pytest: 2.8.5
pip: 9.0.1
setuptools: 35.0.2
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
xarray: None
IPython: 5.7.0
sphinx: 1.6.1
patsy: 0.4.1
dateutil: 2.4.2
pytz: 2017.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
feather: None
matplotlib: 2.2.2
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.6.0
html5lib: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions