Skip to content

TST: Added test case for DataFrame.at #33577

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 21, 2020
Merged

Conversation

EdAbati
Copy link
Contributor

@EdAbati EdAbati commented Apr 15, 2020

@simonjayhawkins simonjayhawkins added the Testing pandas testing functions or related to the test suite label Apr 16, 2020
@simonjayhawkins simonjayhawkins added this to the 1.1 milestone Apr 16, 2020
Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @EdAbati lgtm

# https://github.com/pandas-dev/pandas/issues/33544
df = pd.DataFrame({"foo": [datetime(2000, 1, 1)]})
df.at[0, "foo"] = datetime(2000, 1, 2, tzinfo=timezone.utc)
assert df.at[0, "foo"].tzinfo == timezone.utc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you instead create an expected Dataframe and use tm.assert_frame_equal?

    result = pd.DataFrame({"foo": [datetime(2000, 1, 1)]})
    result.at[0, "foo"] = datetime(2000, 1, 2, tzinfo=timezone.utc)
    expected = pd.DataFrame(....)
    tm.assert_frame_equal(result, expected)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mroeschke ,

I have implemented as you suggested, but actually (as you can see in the checks below) this makes the test fail and raises an interesting question.

When creating the dataframe with:

result = pd.DataFrame({"foo": [datetime(2000, 1, 1)]})

It transforms the value result.at[0, "foo"] into a Timestamp object:Out[4]: Timestamp('2000-01-01 00:00:00')and running result.info() will show that the column Dtype is datetime64[ns].

If now I apply the following

result.at[0, "foo"] = datetime(2000, 1, 2, tzinfo=timezone.utc)

the result.at[0, "foo"] is transformed in to a datetime object
Out[6]: datetime.datetime(2000, 1, 2, 0, 0, tzinfo=datetime.timezone.utc) and also the column Dtype is changed to object.

I am not sure if this is the expected behaviour or it is a bug.

If this is the expected behaviour I should change the test, otherwise .at needs to be fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the expected behavior as opposed to the loc variant that would coerce to datetime type.

You can therefore change the expected construction to

expected = pd.DataFrame(np.array([datetime(2000, 1, 2, tzinfo=timezone.utc)], dtype=object, columns['foo'])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the expected behavior

this seems inconsisent with other types that cannot hold the scalar type.

>>> import numpy as np
>>>
>>> import pandas as pd
>>>
>>> pd.__version__
'1.1.0.dev0+1290.g7e99bb433'
>>>
>>> df = pd.DataFrame({"a": [1, 2, 3]})
>>> df
   a
0  1
1  2
2  3
>>>
>>> df.at[0, "a"] = np.nan
Traceback (most recent call last):
...
ValueError: Cannot assign nan to integer series
>>>

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Timezones Timezone data dtype labels Apr 16, 2020
@mroeschke
Copy link
Member

@simonjayhawkins brings up a good point. This is possibly not fully fixed as the expected dtype should not be object but datetime64[ns, UTC].

Interested in fixing this bug @EdAbati?

@EdAbati
Copy link
Contributor Author

EdAbati commented Apr 19, 2020

Hi @mroeschke and @simonjayhawkins ,

I could try to to give it a go, but I am not super familiar with every internal class and I may need some guidance/a little bit more time.

I have tried to debug the __setitem__ during result.at[0, "foo"] = datetime(2000, 1, 2, tzinfo=timezone.utc).
I have noticed that at some point it is checked if the timezones are the same/compatible.
In the specific example of the test, even if the tz are not the same/compatible (one is None and the other is 'UTC'), it does not raise the TypeError here because TypeError is part of a exception statement.

The item is set using iLocIndexer and the function find_common_type (link) is used to translate the dtype to object.

What do you think, would changing find_common_type be a good approach in this case?
Or would it be better to raise an error for the timezones mismatch?

@mroeschke
Copy link
Member

I think that behavior is working as expected. The odd part is that essentially the entire column is being changed, so I am not sure why we need to do this data type check.

@jreback jreback merged commit 04f2c2b into pandas-dev:master Apr 21, 2020
@jreback
Copy link
Contributor

jreback commented Apr 21, 2020

thanks, .at like .loc does not do coercing of the entire column, unlike [] which does.

@simonjayhawkins
Copy link
Member

my take-away for this is that the behaviour reported in original issue was perhaps the correct behaviour.

The loss of the timezone a result of the scalar being coerced. This is similar to #22740. Agreed in both cases the user 'incorrectly' expected a different behaviour, because the scalar could be coerced.
For this case, numpy will raise an error in the future and would therefore raise. This will be consistent with #33577 (comment)

>>> df = pd.DataFrame({"foo": [dt.datetime(2000, 1, 1)]})
>>> df.foo
0   2000-01-01
Name: foo, dtype: datetime64[ns]
>>>
>>> df.foo.dtype.type(dt.datetime(2000, 1, 2, tzinfo=dt.timezone.utc))
<stdin>:1: DeprecationWarning: parsing timezone aware datetimes is deprecated; this will raise an error in the future
numpy.datetime64('2000-01-02T00:00:00.000000')
>>>

The 'fixed' issue now coerces the entire column from datetime64[ns] to object. This is an incorrect behaviour for .at IMO

rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request May 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Testing pandas testing functions or related to the test suite Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG:Timezone lost when assigning Datetime via DataFrame.at
4 participants