Skip to content

ENH: accept dict of column:dtype as dtype argument in DataFrame.astype #12086

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
97 commits
Select commit Hold shift + click to select a range
0aeee8d
ENH: inplace dtype changes, df per-column dtype changes; GH7271
StephenKappel May 8, 2016
58dd71b
ENH: NDFrame astype() now accepts inplace arg and dtype arg can be a …
StephenKappel May 10, 2016
43989fd
DOC: xref #13112, add back lexsorting example
jreback May 10, 2016
f0e47a9
COMPAT: boto import issues
jreback May 11, 2016
d0734ba
BUG: Added checks for NaN in __call__ of EngFormatter
yaduart May 11, 2016
2a99394
TST: fix assert_categorical_equal message
sinhrks May 11, 2016
4aa6323
BUG: Series ops with object dtype may incorrectly fail
sinhrks May 3, 2016
4de83d2
PERF: quantile now operates per block boosting perf / fix quantile wi…
jreback May 12, 2016
c9ffd78
DOC: Fix delim_whitespace regex typo.
dsm054 May 13, 2016
e5c18b4
BUG: Correct KeyError from matplotlib when processing Series yerr
gliptak May 13, 2016
00d4ec3
BUG: Misc fixes for SparseSeries indexing with MI
sinhrks May 13, 2016
82f54bd
ENH/BUG: str.extractall doesn't support index
sinhrks May 13, 2016
01dd111
DOC: Fix additional join examples in "10 Minutes to pandas" #13029
Xndr7 May 13, 2016
feee089
BUG: Bug in .groupby(..).resample(..) when the same object is called …
jreback May 14, 2016
b385799
DOC: Clarify Categorical Crosstab Behaviour
gfyoung May 14, 2016
2de2884
BUG: GH12896 where extra elements are returned in MultiIndex slicing
kawochen May 14, 2016
f637aa3
TST: Use compatible time zones
neirbowj May 15, 2016
62bed0e
COMPAT: Add Pathlib, py.path support for read_hdf
quintusdias May 16, 2016
4e4a7d9
COMPAT/TST: sparse formatting test for platform, xref #13163
jreback May 16, 2016
62fc481
CLN: no return on init
max-sixty May 17, 2016
20ea406
BUG: fix to_records confict with unicode_literals #13172
starplanet May 17, 2016
00e0f3e
BUG: Period and Series/Index comparison raises TypeError
sinhrks May 17, 2016
2429ec5
TST: change test comparison to work on older numpies, #13178
jreback May 17, 2016
009d1df
PERF: DataFrame transform
chris-b1 May 18, 2016
86f68e6
BUG: Sparse creation with object dtype may raise TypeError
sinhrks May 18, 2016
4b50149
TST: Test resampling with NaT
May 18, 2016
eeccd05
BUG: Fix #13213 json_normalize() and non-ascii characters in keys
May 19, 2016
070e877
BUG: Fix argument order in call to super
eddiejessup May 19, 2016
2a120cf
DOC: add v0.19.0 whatsnew doc
jreback May 19, 2016
fecb2ca
COMPAT: Further Expand Compatibility with fromnumeric.py
gfyoung May 20, 2016
123f2ee
BUG: Bug in .to_datetime() when passing integers or floats, no unit a…
jreback May 20, 2016
cc25040
BUG: GH12824 fixed apply() returns different result depending on whet…
adneu May 20, 2016
72164a8
API/COMPAT: add pydatetime-style positional args to Timestamp constru…
thejohnfreeman May 20, 2016
9d44e63
BUG: mpl fix to AutoDatFromatter to fix second/us-second formatters
tacaswell May 10, 2016
8e2f70b
TST: xref #13183, for windows compat
jreback May 20, 2016
f5c24d2
Reverse numpy compat changes to tslib.pyx
gfyoung May 21, 2016
d2b5819
BUG: Empty PeriodIndex issues
max-sixty May 21, 2016
6f90340
API: Use np.random's RandomState when seed is None in .sample
May 21, 2016
82bdc1d
TST: check internal Categorical
sinhrks May 21, 2016
b88eb35
TST/ERR: Add Period ops tests / fix error message
sinhrks May 22, 2016
19ebee5
ENH: support decimal option in PythonParser #12933
May 22, 2016
f8a11dd
ERR: Correct ValueError invalid type promotion exception
gliptak May 23, 2016
afde718
BUG: Fix #13149 and ENH: 'copy' param in Index.astype()
pijucha May 23, 2016
9a6ce07
BUG, ENH: Add support for parsing duplicate columns
gfyoung May 23, 2016
8662cb9
TST: assert_dict_equal to check input type
sinhrks May 24, 2016
75714de
BUG: remove_unused_categories dtype coerces to int64
sinhrks May 24, 2016
69ad08b
BUG: Bug in selection from a HDFStore with a fixed format and start a…
jreback May 24, 2016
e0a2e3b
DOC: fixed typos in GroupBy document
mortada May 24, 2016
b638f18
BUG: Properly validate and parse nrows in read_csv
gfyoung May 25, 2016
8749273
BUG: Fix for resampler for grouping kwarg bug
roycoding May 25, 2016
da5fc17
BUG, ENH: Improve infinity parsing for read_csv
gfyoung May 25, 2016
b4e2d34
TST: Remove imp and just use importlib to avoid memory error when sho…
nparley May 25, 2016
f2ce0ac
ERR: error in datetime conversion with non-convertibles
gliptak May 26, 2016
57ea76f
DOC: Improved documentation for DataFrame.join
edublancas May 26, 2016
9662d91
TST/CLN: remove np.assert_equal
sinhrks May 26, 2016
a67ac2a
COMPAT: extension dtypes (DatetimeTZ, Categorical) are now Singleton …
jreback May 25, 2016
5d67720
DOC: Added an example of pitfalls when using astype
pfrcks May 26, 2016
456dcae
TST: skip Fred / YahooOptions tests
jreback May 26, 2016
db43824
TST: split up test_merge
jreback May 26, 2016
40b4bb4
TST: reorg datetime with tz tests a bit
jreback May 26, 2016
4b05055
DOC: low_memory in read_csv
chris-b1 May 26, 2016
0f1666d
ENH: support decimal argument in read_html #12907
ccronca May 27, 2016
e8d9e79
BUG: preserve join keys dtype
jreback May 27, 2016
ae2ca83
COMPAT: windows test compat for merge, xref #13170
jreback May 27, 2016
c2ea8fb
TST: Make numpy_array test strict
sinhrks May 28, 2016
af4ed0f
DOC: remove references to deprecated numpy negation method
mortada May 28, 2016
70be8a9
DOC: Fix read_stata docstring
sinhrks May 29, 2016
721be62
BUG: Check for NaN after data conversion to numeric
gfyoung May 30, 2016
ed4cd3a
TST: Parser tests refactoring
gfyoung May 30, 2016
cc1025a
COMPAT: do not upcast results to float64 when float32 scalar *+/- flo…
jennolsen84 May 30, 2016
d6f814c
TST: remove tests_tseries.py and distribute to other tests files
jreback May 30, 2016
9e7bfdd
BLD: increase clone depth
jreback May 30, 2016
c0850ea
ENH: add support for na_filter in Python engine
gfyoung May 31, 2016
352ae44
TST: more strict testing in lint.sh
jreback May 31, 2016
132c1c5
BUG: Fix describe(): percentiles (#13104), col index (#13288)
pijucha May 31, 2016
d191640
ENH: Respect Key Ordering for OrderedDict List in DataFrame Init
gfyoung May 31, 2016
f3d7c18
BUG: Fix maybe_convert_numeric for unhashable objects
May 31, 2016
8bbd2bc
ENH: Series has gained the properties .is_monotonic*
jreback May 31, 2016
2e3c82e
TST: computation/test_eval.py tests (slow)
jreback May 31, 2016
45bab82
BUG: Parse trailing NaN values for the Python parser
gfyoung Jun 1, 2016
fcd73ad
BUG: GH13219 Fixed. Allow unicode values in usecols
hassanshamim May 19, 2016
99e78da
DOC: fix comment on previous versions cythonmagic
jorisvandenbossche Jun 2, 2016
ce56542
Fix #13306: Hour overflow in tz-aware datetime conversions.
uwedeportivo Jun 2, 2016
0c6226c
ENH: Add support for compact_ints and use_unsigned in Python engine
gfyoung Jun 2, 2016
2061e9e
BUG: Fix series comparison operators when dealing with zero rank nump…
gliptak Jun 3, 2016
103f7d3
DOC: Add example usage to DataFrame.filter
cswarth Jun 3, 2016
faf9b7d
DOC: Fixed a minor typo
Jun 5, 2016
eca7891
DOC: document doublequote in read_csv
gfyoung Jun 5, 2016
863cbc5
DEPR, DOC: Deprecate buffer_lines in read_csv
gfyoung Jun 5, 2016
5a9b498
BUG: Make pd.read_hdf('data.h5') work when pandas object stored conta…
chrish42 Jun 5, 2016
e90d411
DOC: remove obsolete cron job script (#13369)
Jun 5, 2016
b722222
CLN: remove old skiplist code
jreback Jun 5, 2016
3600bca
ENH: incorporate PR feedback; GH7271
StephenKappel Jun 5, 2016
29ecec0
ENH: inplace dtype changes, df per-column dtype changes; GH7271
StephenKappel May 8, 2016
95a029b
ENH: NDFrame astype() now accepts inplace arg and dtype arg can be a …
StephenKappel May 10, 2016
9d8e1b5
ENH: incorporate PR feedback; GH7271
StephenKappel Jun 5, 2016
c960523
resolve merge conflict in rebasing of 7271-df-astype-dict
StephenKappel Jun 5, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions doc/source/whatsnew/v0.18.2.txt
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,8 @@ Other enhancements
^^^^^^^^^^^^^^^^^^

- The ``.tz_localize()`` method of ``DatetimeIndex`` and ``Timestamp`` has gained the ``errors`` keyword, so you can potentially coerce nonexistent timestamps to ``NaT``. The default behaviour remains to raising a ``NonExistentTimeError`` (:issue:`13057`)


- The `copy` argument to the ``astype()`` functions has been deprecated in favor of a new ``inplace`` argument. (:issue:`12086`)
- ``astype()`` will now accept a dict of column name to data types mapping as the ``dtype`` argument. (:issue:`12086`)


.. _whatsnew_0182.api:
Expand Down
41 changes: 41 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -3772,6 +3772,47 @@ def update(self, other, join='left', overwrite=True, filter_func=None,
# ----------------------------------------------------------------------
# Misc methods

def astype(self, dtype, copy=True, inplace=False, raise_on_error=True,
**kwargs):
"""
Cast object to given data type(s).

Parameters
----------
dtype : numpy.dtype or Python type (to cast entire DataFrame to the
same type). Alternatively, {col: dtype, ...}, where col is a column
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you adding a method her? this should all be done in generic.py

label and dtype is a numpy.dtype or Python type (to cast one or
more of the DataFrame's columns to column-specific types).
copy : deprecated; use inplace instead
inplace : boolean, default False
Modify the DataFrame in place (do not create a new object)
raise_on_error : raise on invalid input
kwargs : keyword arguments to pass on to the constructor if
inplace=False

Returns
-------
casted : type of caller
"""
if isinstance(dtype, collections.Mapping):
if inplace:
for col, typ in dtype.items():
self[col].astype(typ, inplace=True,
raise_on_error=raise_on_error)
return None
else:
from pandas.tools.merge import concat
casted_cols = [self[col].astype(typ, copy=copy)
for col, typ in dtype.items()]
other_col_labels = self.columns.difference(dtype.keys())
other_cols = [self[col].copy() if copy else self[col]
for col in other_col_labels]
new_df = concat(casted_cols + other_cols, axis=1)
return new_df.reindex(columns=self.columns, copy=False)
df = super(DataFrame, self)
return df.astype(dtype=dtype, copy=copy, inplace=inplace,
raise_on_error=raise_on_error, **kwargs)

def first_valid_index(self):
"""
Return label for first non-NA/null value
Expand Down
16 changes: 12 additions & 4 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@ def _init_mgr(self, mgr, axes=None, dtype=None, copy=False):

@property
def _constructor(self):
"""Used when a manipulation result has the same dimesions as the
"""Used when a manipulation result has the same dimensions as the
original.
"""
raise AbstractMethodError(self)
Expand Down Expand Up @@ -2930,22 +2930,30 @@ def blocks(self):
"""Internal property, property synonym for as_blocks()"""
return self.as_blocks()

def astype(self, dtype, copy=True, raise_on_error=True, **kwargs):
def astype(self, dtype, copy=True, inplace=False, raise_on_error=True,
**kwargs):
"""
Cast object to input numpy.dtype
Return a copy when copy = True (be really careful with this!)

Parameters
----------
dtype : numpy.dtype or Python type
copy : deprecated; use inplace instead
inplace : boolean, default False
Modify the NDFrame in place (do not create a new object)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

version added tag

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I don't think we can have an inplace flag here. you can't really do this except in some extreme circumstances (e.g. you can inplace ints to floats), but its not worth supporting at all and is quite convoluted.

The copy flag was just to prevent copies if we are already the same dtype. SO let's leave all of that alone.

raise_on_error : raise on invalid input
kwargs : keyword arguments to pass on to the constructor

Returns
-------
casted : type of caller
"""

if inplace:
new_data = self._data.astype(dtype=dtype, copy=False,
raise_on_error=raise_on_error,
**kwargs)
self._update_inplace(new_data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verbose, you can do this like:

new_data = ....... copy=not inplace,...

if inplace:
    self._update_inplace(new_data)
else:
    return self._constructor(....)

return
mgr = self._data.astype(dtype=dtype, copy=copy,
raise_on_error=raise_on_error, **kwargs)
return self._constructor(mgr).__finalize__(self)
Expand Down
70 changes: 70 additions & 0 deletions pandas/tests/frame/test_dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -372,6 +372,76 @@ def test_astype_str(self):
expected = DataFrame(['1.12345678901'])
assert_frame_equal(result, expected)

def test_astype_dict(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can test here, but test for Series as well (e.g. it would have to be a 1-element dictionary)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more elegant way to handle the Series case than just adding a new if statement to handle the Series explicitly? The above implementation doesn't work for Series because the df[label] accessor would try to access rows in the Series, and I think we want to map the keys in the dict to the Series.name property.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just test its ok, e.g. if self.ndim == 1

# GH7271
a = Series(date_range('2010-01-04', periods=5))
b = Series(range(5))
c = Series([0.0, 0.2, 0.4, 0.6, 0.8])
d = Series(['1.0', '2', '3.14', '4', '5.4'])
df = DataFrame({'a': a, 'b': b, 'c': c, 'd': d})
original = df.copy(deep=True)

# change type of a subset of columns
expected = DataFrame({
'a': a,
'b': Series(['0', '1', '2', '3', '4']),
'c': c,
'd': Series([1.0, 2.0, 3.14, 4.0, 5.4], dtype='float32')})
astyped = df.astype({'b': 'str', 'd': 'float32'})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result = ...
expected = DataFrame(.....)
assert_frame_equal(result, expected)

assert_frame_equal(astyped, expected)
assert_frame_equal(df, original)
self.assertEqual(astyped.b.dtype, 'object')
self.assertEqual(astyped.d.dtype, 'float32')
Copy link
Contributor

@jreback jreback May 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use assertEqual like this, simply use assert_frame_equal (as you have above)


# change all columns
assert_frame_equal(df.astype({'a': str, 'b': str, 'c': str, 'd': str}),
df.astype(str))
assert_frame_equal(df, original)

# error should be raised when using something other than column labels
# in the keys of the dtype dict
self.assertRaises(KeyError, df.astype, {'b': str, 2: str})
self.assertRaises(KeyError, df.astype, {'e': str})
assert_frame_equal(df, original)

# if the dtypes provided are the same as the original dtypes, the
# resulting DataFrame should be the same as the original DataFrame
equiv = df.astype({col: df[col].dtype for col in df.columns})
assert_frame_equal(df, equiv)
assert_frame_equal(df, original)

# using inplace=True, the df should be changed
output = df.astype({'b': 'str', 'd': 'float32'}, inplace=True)
self.assertEqual(output, None)
assert_frame_equal(df, expected)
df.astype({'b': np.float32, 'c': 'float32', 'd': np.float32},
inplace=True)
self.assertEqual(df.a.dtype, original.a.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again construct an expected

self.assertEqual(df.b.dtype, 'float32')
self.assertEqual(df.c.dtype, 'float32')
self.assertEqual(df.d.dtype, 'float32')
self.assertEqual(df.b[0], 0.0)
df.astype({'b': str, 'c': 'float64', 'd': np.float64}, inplace=True)
self.assertEqual(df.a.dtype, original.a.dtype)
self.assertEqual(df.b.dtype, 'object')
self.assertEqual(df.c.dtype, 'float64')
self.assertEqual(df.d.dtype, 'float64')
self.assertEqual(df.b[0], '0.0')

def test_astype_inplace(self):
# GH7271
df = DataFrame({'a': range(10),
'b': range(2, 12),
'c': np.arange(4.0, 14.0, dtype='float64')})
df.astype('float', inplace=True)
for col in df.columns:
self.assertTrue(df[col].map(lambda x: type(x) == float).all())
self.assertEqual(df[col].dtype, 'float64')
df.astype('str', inplace=True)
for col in df.columns:
self.assertTrue(df[col].map(lambda x: type(x) == str).all())
self.assertEqual(df[col].dtype, 'object')

def test_timedeltas(self):
df = DataFrame(dict(A=Series(date_range('2012-1-1', periods=3,
freq='D')),
Expand Down
13 changes: 13 additions & 0 deletions pandas/tests/series/test_dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,19 @@ def test_astype_unicode(self):
reload(sys) # noqa
sys.setdefaultencoding(former_encoding)

def test_astype_inplace(self):
s = Series(np.random.randn(5), name='foo')

for dtype in ['float32', 'float64', 'int64', 'int32']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{s.name : dtype} would be valid as well

astyped = s.astype(dtype, inplace=False)
self.assertEqual(astyped.dtype, dtype)
self.assertEqual(astyped.name, s.name)

for dtype in ['float32', 'float64', 'int64', 'int32']:
s.astype(dtype, inplace=True)
self.assertEqual(s.dtype, dtype)
self.assertEqual(s.name, 'foo')

def test_complexx(self):
# GH4819
# complex access for ndarray compat
Expand Down