Skip to content

VERY SLOW _plotting._maybe_resample_data() #318

Closed
@Sean-Ker

Description

@Sean-Ker

Hi. I am using big data frames with ~5.7 million rows each and I noticed that the function _maybe_resample_data() for such data frames is very slow, about 5.2 seconds as shown in the output below! I found that the problem was with the way it calculates the new frequency, freq = next((f for f in FREQS[from_index:] if len(df.resample(f)) <= _MAX_CANDLES), FREQS[-1]).

I have never contributed to open source projected before, but I attached a snippet of code that makes the [finding optimal freq] function much faster. Let me know what you think. Thanks.

CODE

import time

print(df)

_MAX_CANDLES = 10_000

""" Original Code Snippet from _plotting._maybe_resample_data """
start = time.time()

from_index = dict(day=-2, hour=-6, minute=1, second=0, millisecond=0, microsecond=0, nanosecond=0)[df.index.resolution]
FREQS = ("1T", "5T", "10T", "15T", "30T", "1H", "2H", "4H", "8H", "1D", "1W", "1M")
freq = next((f for f in FREQS[from_index:] if len(df.resample(f)) <= _MAX_CANDLES), FREQS[-1])  # VERY SLOW!

print(f"Took {time.time() - start} seconds | Freq: {freq}.")
""" End of Snippet """

""" Faster Code """
start = time.time()

from_index = dict(day=-2, hour=-6, minute=1, second=0, millisecond=0, microsecond=0, nanosecond=0)[df.index.resolution]
MIN_TO_FREQ = {
    1: "1T",
    5: "5T",
    10: "10T",
    15: "15T",
    30: "30T",
    60: "1H",
    120: "2H",
    240: "4H",
    480: "8H",
    1440: "1D",
    10080: "1W",
    43800: "1M",
}
FREQS = list(MIN_TO_FREQ.values())
index_timedelta = df.index[-1] - df.index[0]
req_minutes = (index_timedelta / 10000).total_seconds() // 60
new_freq = [v for k, v in MIN_TO_FREQ.items() if k >= req_minutes]
new_freq = new_freq[0] if new_freq else "1M"

print(f"Took {(time.time() - start):.6f} seconds | Freq: {new_freq}.")
""" End of Faster Code """
assert freq == new_freq

OUTPUT

                        open     high      low    close  volume
2005-01-02 18:53:00  1.20510  1.20510  1.20510  1.20510     1.0
2005-01-02 18:54:00  1.20610  1.20610  1.20610  1.20610     1.0
2005-01-02 19:31:00  1.20610  1.20610  1.20610  1.20610     1.0
2005-01-02 19:32:00  1.20610  1.20610  1.20560  1.20560     2.0
2005-01-02 19:33:00  1.20560  1.20560  1.20560  1.20560     1.0
...                      ...      ...      ...      ...     ...
2021-04-26 16:10:00  1.24066  1.24069  1.24062  1.24062     5.0
2021-04-26 16:11:00  1.24058  1.24062  1.24042  1.24060    36.0
2021-04-26 16:12:00  1.24057  1.24069  1.24024  1.24024    45.0
2021-04-26 16:13:00  1.24026  1.24046  1.24026  1.24028    26.0
2021-04-26 16:14:00  1.24025  1.24025  1.24014  1.24018    16.0
[5736966 rows x 5 columns]

Took 5.253245115280151 seconds | Freq: 1D.
Took 0.001001 seconds | Freq: 1D.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions