Skip to content

DOC: Fix pandas.Series.resample docstring #23197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Nov 10, 2018
Merged
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 37 additions & 16 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -7366,16 +7366,32 @@ def resample(self, rule, how=None, axis=0, fill_method=None, closed=None,
label=None, convention='start', kind=None, loffset=None,
limit=None, base=0, on=None, level=None):
"""
Resample time-series data.

Convenience method for frequency conversion and resampling of time
series. Object must have a datetime-like index (DatetimeIndex,
PeriodIndex, or TimedeltaIndex), or pass datetime-like values
to the on or level keyword.
series. Object must have a datetime-like index (``DatetimeIndex``,
``PeriodIndex``, or ``TimedeltaIndex``), or pass datetime-like values
to the ``on`` or ``level`` keyword.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single backticks.


Parameters
----------
rule : string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the Python types (str instead of string...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done. The module is full of string by the way. Should I create another issue for that? Do these values str, int get picked up by Sphinx?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to fix one whole docstring at a time, than fixing a problem in all docstrings.

Sphinx doesn't care about the types, but we try to be consistent, and will start validating the types soon (not sure if validate_docstrings.py already conmplains about these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, atomic improvements around each docstring make total sense. Pushing the amendments now.

the offset string or object representing target conversion
The offset string or object representing target conversion.
how : string
Method for down-/re-sampling, default to ‘mean’ for downsampling.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the possible values? If they are a fixed set, can we have {'mean',...}? If it's a problem, leave it like it is, as this will go away soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible values are any valid string function that does some form of aggregation. Interestingly, this would also work with the example dataframe I shared above:

>>> df
   price  volume       time
0     10      50 2018-01-07
1     11      60 2018-01-14
2      9      40 2018-01-21
3     13     100 2018-01-28
4     14      50 2018-02-04
5     18     100 2018-02-11
6     17      40 2018-02-18
7     19      50 2018-02-25

>>> df.resample('M', on='time', how={'price': min, 'volume': sum})
            price  volume
time                     
2018-01-31      9     250
2018-02-28     14     240
>>> df.resample('M', on='time', how=['min', 'sum'])
           price     volume     
             min sum    min  sum
time                            
2018-01-31     9  43     40  250
2018-02-28    14  68     40  240

Happy to change the description of the parameter to something like:

how : str, dict, list
       Method for down-/re-sampling, default to `mean` for downsampling. Accepted combinations are:
       * string function name
       * list of string function names
       * dict of column names -> string function names (or list of string function names) 

I got inspiration from pandas.DataFrame.aggregate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f the name needs to be a string with the function name, I assume it can't be any arbitrary value. But as it's deprecated, let's not spend time on this.


.. deprecated:: 0.18.0
The new syntax is ``.resample(...).mean()``, or
``.resample(...).apply(<func>)``
axis : int, optional, default 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check other docstrings to see how the axis type is usually documented?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another example is:

apply to each column (``axis=0`` or ``'index'``)
or to each row (``axis=1`` or ``'columns'``) or    

The problem is that for Series the axis is that along which the operation is computed, which in this case it's always along the rows (aka columnwise).

Maybe a more verbose but informative doc would be

 Which axis to use for up- or down-sampling. For ``Series`` this
 will default to 0, i.e. `along the rows`. Must be
 ``DatetimeIndex``, ``TimedeltaIndex`` or ``PeriodIndex``.

I have added this tentatively to the PR but let me know if something less verbose would be better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I wasn't clear. There is not much standardization of the descriptions, use whatever makes more sense to you. But the used type is somehow standard, that's what I think we should copy from another docstring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmrr can you make this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @datapythonista not your fault at all. I somewhat skipped the word type in your request and I thought you meant how axis is usually documented. Regarding the type and values there's no much standardization either. From set_axis() in pandas/core/generic.py I like the following:

{0 or 'index', 1 or 'columns'}, default 0

If we extrapolate this to Series then that line should be:

axis : {0 or 'index'}, default 0

I will make this change now, hope it sounds good.

Which index to use for up- or down-sampling. Must be
``DatetimeIndex``, ``TimedeltaIndex`` or ``PeriodIndex``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the must be. I don't think axis must be DatetimeIndex...

Also, don't use backticks around the "along the rows". Backticks are to tell that it's a reference (to a variable, a function, a class...). We don't use them for text.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the must be. I don't think axis must be DatetimeIndex...

I took it from the main function description. As rule must be an offset string which are time-related magnitudes I decided to be explicit here. Also I ran some tests and got TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex.

In fact, following this logic, on should also have that sentence: Must be ``DatetimeIndex``, ``TimedeltaIndex`` or ``PeriodIndex``.

For instance:

>>> df = pd.DataFrame({'price': [10, 11, 9, 13, 14, 18, 17, 19], 'volume': [50, 60, 40, 100, 50, 100, 40, 50]})
>>> df['time'] = pd.date_range('01/01/2018', periods=8, freq='W')
>>> df
   price  volume       time
0     10      50 2018-01-07
1     11      60 2018-01-14
2      9      40 2018-01-21
3     13     100 2018-01-28
4     14      50 2018-02-04
5     18     100 2018-02-11
6     17      40 2018-02-18
7     19      50 2018-02-25
>>> df.resample('M', on='time').sum()
            price  volume
time                     
2018-01-31     43     250
2018-02-28     68     240

I hope I'm not missing anything here... resample is not a function I've used heavily personally.

Also, don't use backticks around the "along the rows". Backticks are to tell that it's a reference (to a variable, a function, a class...). We don't use them for text.

Slip of keyboard, I meant normal textual quotes ("). Changing.

fill_method : string, default None
Filling method for upsampling.

.. deprecated:: 0.18.0
The new syntax is ``.resample(...).<func>()``,
e.g. ``.resample(...).pad()``
closed : {'right', 'left'}
Which side of bin interval is closed. The default is 'left'
for all frequency offsets except for 'M', 'A', 'Q', 'BM',
Expand All @@ -7385,18 +7401,22 @@ def resample(self, rule, how=None, axis=0, fill_method=None, closed=None,
for all frequency offsets except for 'M', 'A', 'Q', 'BM',
'BA', 'BQ', and 'W' which all have a default of 'right'.
convention : {'start', 'end', 's', 'e'}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some default values are missing, like in this case.

For PeriodIndex only, controls whether to use the start or end of
`rule`
kind: {'timestamp', 'period'}, optional
For ``PeriodIndex`` only, controls whether to use the start or
end of ``rule``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single backticks, in both cases, double backticks are for code (including possible values of variables like NaN). But for classes, variables, parameters... we use single backticks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, wasn't clear from the docstring docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would None be with double backtick in examples like "default None"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. I think it probably should, but we don't have it anywhere, so let's leave it without backticks in the default for now

kind : {'timestamp', 'period'} optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing comma before optional

Pass 'timestamp' to convert the resulting index to a
``DateTimeIndex`` or 'period' to convert it to a ``PeriodIndex``.
By default the input representation is retained.
loffset : timedelta
Adjust the resampled time labels
Adjust the resampled time labels.
limit : int, default None
Maximum size gap when reindexing with ``fill_method``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same


.. deprecated:: 0.18.0
base : int, default 0
For frequencies that evenly subdivide 1 day, the "origin" of the
aggregated intervals. For example, for '5min' frequency, base could
range from 0 through 4. Defaults to 0
range from 0 through 4. Defaults to 0.
on : string, optional
For a DataFrame, column to use instead of index for resampling.
Column must be datetime-like.
Expand All @@ -7405,7 +7425,7 @@ def resample(self, rule, how=None, axis=0, fill_method=None, closed=None,

level : string or int, optional
For a MultiIndex, level (name or number) to use for
resampling. Level must be datetime-like.
resampling. ``level`` must be datetime-like.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same


.. versionadded:: 0.19.0

Expand Down Expand Up @@ -7522,9 +7542,10 @@ def resample(self, rule, how=None, axis=0, fill_method=None, closed=None,
For a Series with a PeriodIndex, the keyword `convention` can be
used to control whether to use the start or end of `rule`.

>>> s = pd.Series([1, 2], index=pd.period_range('2012-01-01',
freq='A',
periods=2))
>>> s = pd.Series([1, 2], index=pd.period_range(
... '2012-01-01',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep the date in the previous line

... freq='A',
... periods=2))
>>> s
2012 1
2013 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the following examples, when using resample, it once uses .head() and the other time it shows 12 months. I don't like any of them. I'd use something like resampling a year to quarters, or a quarter to months... So we can show all the data, and it's just few rows that readers can quickly check.

Copy link
Contributor Author

@jmrr jmrr Nov 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created two examples, one sampling a year into quarters using convention='start' and another sampling quarters into months using the convention='end'. If it's too much we can remove one of them.

Expand Down Expand Up @@ -7577,9 +7598,9 @@ def resample(self, rule, how=None, axis=0, fill_method=None, closed=None,

>>> time = pd.date_range('1/1/2000', periods=5, freq='T')
>>> df2 = pd.DataFrame(data=10*[range(4)],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pep8 issue, missing spaces around operator.

columns=['a', 'b', 'c', 'd'],
index=pd.MultiIndex.from_product([time, [1, 2]])
)
... columns=['a', 'b', 'c', 'd'],
... index=pd.MultiIndex.from_product([time, [1, 2]])
... )
>>> df2.resample('3T', level=0).sum()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as in the previous example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, also with a different, more realistic approach. Please let me know your comments. I also struggled a bit with the formatting of long lines, opted for styling lines 7627-7632 using black (also compatible with PEP8) but let me know it breaks consistency and should be reverted to the old style. Pushing now.

a b c d
2000-01-01 00:00:00 0 6 12 18
Expand Down