Closed
Description
Hi. I am using big data frames with ~5.7 million rows each and I noticed that the function _maybe_resample_data()
for such data frames is very slow, about 5.2 seconds as shown in the output below! I found that the problem was with the way it calculates the new frequency, freq = next((f for f in FREQS[from_index:] if len(df.resample(f)) <= _MAX_CANDLES), FREQS[-1])
.
I have never contributed to open source projected before, but I attached a snippet of code that makes the [finding optimal freq] function much faster. Let me know what you think. Thanks.
CODE
import time
print(df)
_MAX_CANDLES = 10_000
""" Original Code Snippet from _plotting._maybe_resample_data """
start = time.time()
from_index = dict(day=-2, hour=-6, minute=1, second=0, millisecond=0, microsecond=0, nanosecond=0)[df.index.resolution]
FREQS = ("1T", "5T", "10T", "15T", "30T", "1H", "2H", "4H", "8H", "1D", "1W", "1M")
freq = next((f for f in FREQS[from_index:] if len(df.resample(f)) <= _MAX_CANDLES), FREQS[-1]) # VERY SLOW!
print(f"Took {time.time() - start} seconds | Freq: {freq}.")
""" End of Snippet """
""" Faster Code """
start = time.time()
from_index = dict(day=-2, hour=-6, minute=1, second=0, millisecond=0, microsecond=0, nanosecond=0)[df.index.resolution]
MIN_TO_FREQ = {
1: "1T",
5: "5T",
10: "10T",
15: "15T",
30: "30T",
60: "1H",
120: "2H",
240: "4H",
480: "8H",
1440: "1D",
10080: "1W",
43800: "1M",
}
FREQS = list(MIN_TO_FREQ.values())
index_timedelta = df.index[-1] - df.index[0]
req_minutes = (index_timedelta / 10000).total_seconds() // 60
new_freq = [v for k, v in MIN_TO_FREQ.items() if k >= req_minutes]
new_freq = new_freq[0] if new_freq else "1M"
print(f"Took {(time.time() - start):.6f} seconds | Freq: {new_freq}.")
""" End of Faster Code """
assert freq == new_freq
OUTPUT
open high low close volume
2005-01-02 18:53:00 1.20510 1.20510 1.20510 1.20510 1.0
2005-01-02 18:54:00 1.20610 1.20610 1.20610 1.20610 1.0
2005-01-02 19:31:00 1.20610 1.20610 1.20610 1.20610 1.0
2005-01-02 19:32:00 1.20610 1.20610 1.20560 1.20560 2.0
2005-01-02 19:33:00 1.20560 1.20560 1.20560 1.20560 1.0
... ... ... ... ... ...
2021-04-26 16:10:00 1.24066 1.24069 1.24062 1.24062 5.0
2021-04-26 16:11:00 1.24058 1.24062 1.24042 1.24060 36.0
2021-04-26 16:12:00 1.24057 1.24069 1.24024 1.24024 45.0
2021-04-26 16:13:00 1.24026 1.24046 1.24026 1.24028 26.0
2021-04-26 16:14:00 1.24025 1.24025 1.24014 1.24018 16.0
[5736966 rows x 5 columns]
Took 5.253245115280151 seconds | Freq: 1D.
Took 0.001001 seconds | Freq: 1D.