Fix #890 Fix number of contiguous windows in snippet #894

NimaSarajpoor · 2023-08-05T03:41:46Z

This PR fixes issue #890.

codecov · 2023-08-05T04:06:45Z

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (900b10c) 98.93% compared to head (9674dfc) 98.93%.
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #894   +/-   ##
=======================================
  Coverage   98.93%   98.93%           
=======================================
  Files          84       84           
  Lines       14236    14292   +56     
=======================================
+ Hits        14084    14140   +56     
  Misses        152      152

Files Changed	Coverage Δ
stumpy/snippets.py	`100.00% <100.00%> (ø)`
tests/naive.py	`100.00% <100.00%> (ø)`

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

seanlaw · 2023-08-05T10:26:57Z

@NimaSarajpoor Can you please double check that this produces the same result in the existing tutorial? Also, can you please insert a reference to the original issue in the initial comment (I notice that this has been missing)?

NimaSarajpoor · 2023-08-05T10:56:55Z

Can you please double check that this produces the same result in the existing tutorial?

Good point. I will check that out

Also, can you please insert a reference to the original issue in the initial comment (I notice that this has been missing)?

Done! [if a summary of the issue is needed, plz let me know and I will revise the initial comment]

NimaSarajpoor · 2023-08-07T04:35:28Z

Can you please double check that this produces the same result in the existing tutorial?

Good point. I will check that out

The last figure changes in this branch (see Figure below)

The last bar becomes taller. Two notes regarding this matter:

(1) The matlab code provided in the snippet supporting webpage seems to suffer from the same problem:

# in snippetfinder.m

# some code
len = sub * ceil(length(data)/sub) - length(data);
%% padding zeros to the end of the time series. 
data = [data;zeros(len,1)];
 
%% compute all the profiles
for i = 1:sub:length(data) - sub 
    indexes = [indexes,i];
    distance = fastMPdist_SS(data,data(i:i+ sub) ,round(sub*per/100),0.05);
    distances = [distances;distance];
end
# some code

In this MATLAB code, as shown above, the last snippet subsequence is ignored if len(data) % m is zero.

(2) As a follow up to my previous note, I checked the length of the time series T in the tutorial and realized its length is 2000. m is 200, which means len(T) % m == 0. Therefore, this is the case that can be affected by the issue in the code.

Let's consider two scenarios:
In what follows, the original T has the length of 2000.

Scenario I: Let's do T = T[:-1], and consider this new T as the input. This new T will be padded by one value [in the snippet module] so that its length becomes 2000, and hence dividable by m. The number of contgiuous windows is 9 according to both existing code in branch main as well as the code in this branch. That is why this new T, of length 1999, does not change the last figure of the tutorial.
Scenario II: Let's do T = np.append(T, 0), and consdier this new T as the input. Note that in this case, the number contiguous windows according to the existing code in the branch main will be 10 [which is the same the number of contiguous windows in this branch]. In this case, the last figure [in the branch main but with new T, of length 2001] will be similar to what I showed above.

NimaSarajpoor · 2023-08-07T05:23:32Z

Q: Can we justify the figure shown in the previous comment?
A: Maybe! Let's see!

Let's use k=9 to get 9 snippets, and let's plot them:

plt.plot(T, color='k')
for idx in indices:
    plt.plot(np.arange(idx, idx+m), T[idx:idx+m], color='b')

idx = indices[-1]  # index of last snippet, corresponding to the tall bar positioned at `k=9`
plt.plot(np.arange(idx, idx+m)
T[idx:idx+m], color='r')
plt.show()

Note that the 9-th snippet is the one colored in red. This seems to be a pattern that is a little bit different compared to the patterns on the first half and the ones on the second half of T. Although this red pattern is like a combination of those two patterns, it can be considered as one unique pattern. So, maybe that is why the bar chart for k=9 increases since it is now associated with the part of the data that shows a new behavior.

Q: Can we further justify that the code in this branch makes more sense?
A: I think we can. Let's try k=10 and use the snippet code of this branch.

k = 10
snippets, indices, profiles, fractions, areas, regimes = stumpy.snippets(T, m, k=k, s=s)

# >>> areas
# array([5981.73838943, 1887.02451582, 1342.11397203,  991.55393208,
        723.79295706,  522.38654936,  327.99156887,  168.97041678,
         59.77463888,    0.        ])

Note that area is 0 for the "10-th" snippet, and all data is covered.

>>> indices
array([ 800, 1600, 1200, 1000, 1400,  400, 1800,  200,  600,    0])

Note that the last value of the array is 0, showing the start index of the last snippet computed with k=10. This is basically the black pattern shown in the figure above, which was missing when k==9.

NimaSarajpoor · 2023-08-07T05:49:09Z

@seanlaw
[[if needed] we may also contact the first author and/or Eamon Keogh to see what they think regarding this matter.

seanlaw · 2023-08-07T11:38:44Z

[[if needed] we may also contact the first author and/or Eamon Keogh to see what they think regarding this matter.

I am fine either way. If you have time to reach out to them then please do. Otherwise, I think I am satisfied with your justifications and am willing to merge. Your call.

NimaSarajpoor · 2023-08-07T12:25:45Z

Sure. I will send an email to them as I want to make sure I am not missing anything here.

[update-1]

(1) I sent an email. @seanlaw Please feel free to merge if I don't provide any update regarding this matter in the next few days.

(2) In the paper, we see the following statement:

Note that the number of non-overlapping subsequences is much smaller than the sliding windows, just ⌊n/m⌋.

Hence, I think the fixes [in this PR] make sense as it is based on what stated in the paper.

[update-2]
(1) I heard back from the first auther. I provided her with the data as requested. She is going to take a look.

review-notebook-app · 2023-08-08T05:15:52Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

seanlaw · 2023-08-08T13:33:53Z

@NimaSarajpoor As I re-read your (excellent) comments above, it seems strange that the change-in-profile-area jumps so much for the 10th snippet. At the end of the day, given k=10, how would you interpret the change-in-profile-area and choose the "optimal number of snippets (k)"?

Also, in the second-to-last plot (Profile Area) go from range(1,11) when k=10 (instead of range(1,10)?

NimaSarajpoor · 2023-08-09T01:31:41Z

@seanlaw
[In advance, sorry for this long comment!]

Before answering the question, I would like to correct the justification I provided before:

Note that the 9-th snippet is the one colored in red. This seems to be a pattern that is a little bit different compared to the patterns on the first half and the ones on the second half of T

I think this justification, that I made recently, is not correct. Let's see something interesting here:

# in this PR

# to store the start index of snippets for different seeds
first_snippet = []
second_snippet = []
last_snippet = []

for seed in range(10):
    T = warp_add_noise(df, seed=seed)
    snippets, indices, profiles, fractions, areas, regimes = stumpy.snippets(T, m=200, k=9, s=70)
    
    first_snippet.append(indices[0])
    second_snippet.append(indices[1])
    last_snippet.append(indices[-1])

>>> first_snippet  # for different seeds
[800, 200, 200, 400, 200, 400, 200, 200, 800, 400]

>>> second_snippet # for different seeds
[1200, 1200, 1600, 1200, 1200, 1200, 1400, 1600, 1200, 1200]

>>> last_snippet
[600, 1800, 400, 800, 800, 600, 1800, 400, 600, 600]

Note 1) For different seeds, the first snippet is always coming from the first half of T

Note 2) For different seeds, the second snippet is always coming from the second half of T
[so, it seems that the snippet algo can "understand" similar patterns and try to find unique ones in the first two snippets]

Note 3) The last snippet, however, is from the first half sometimes, and some other times, it is from the second half of T. So, the last snippet is NOT necessarily related to the pattern in the middle of T, where two uninique patters are being connected to each other.

Previouslly, the code in the tutorial notebook did not have the param seed, and that resulted in misleading me. (I have now added the param seed to improve reproducibility) I will come back to the code above later at the end as I need to discuss one more thing about it. But, for now, let's discuss the questions you asked.

It seems strange that the change-in-profile-area jumps so much for the 10th snippet. At the end of the day, given k=10, how would you interpret the change-in-profile-area and choose the "optimal number of snippets (k)"?

Let's discuss the formula areas[:-1]/areas[1:] - 1.0, used in the notebook. This formula is provided by the paper to compute "change in profile area" from which an optimal k can be discovered. Note that this formula is equal to (areas[:-1] - areas[1:] ) / areas[1:], which is basically "a slope that is divided by the second point of that slope". Hence, this value can become huge if the denominator becomes small. It is my understanding is that this method is proposed in the paper to reduce the subjectivity of user in selecting k using elbow method. Let's see a similar case in a clustering problem.

# clustering iris data
import numpy as np
from sklearn import datasets
from sklearn.cluster import KMeans

iris = datasets.load_iris()
X = iris.data
y = iris.target

X_normalized = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
# X_normalized.shape: (150, 4)

inertia_values = []
for k in range(2, 146):  # k: number of clusters
    km = KMeans(n_clusters=k).fit(X_normalized)
    inertia_values.append(km.inertia_)

Let's plot insertia values as well as change-in-inertial values:

Note the spike when k is large. In this case, the inerita becomes very small and hence the denominator becomes smaller which can result in spike in the plot. So, I am not sure if the change-in-profile formula proposed in the paper is a good approach, particularly if the number of snippets almost covers the whole T.

Let's go back to the code I provided at the top of this comment:

What happens if we consider k=10 snippets? Note that the profile area plot is descending. When k=10 (which is the maximum number of snippets since n/m == 10), then the area becomes 0 at the end. Note that, in this case, we get divide-by-0 error (!!) when we want to compute the change-in-profile value!

When I set k=10 snippet, the last value in indices (which is basicallty the start index of last snippet) is always 0 for seed in range(20). I would like to understand why since the index of first and second snippet changes from one seed to another.

seanlaw · 2023-08-10T02:32:56Z

@NimaSarajpoor Can you please check the failed test?

NimaSarajpoor · 2023-08-10T02:49:52Z

[update]
I had a discussion with the first auther of the paper. Here are some notes:

The last subsequence should be considered. However, its impact may not be that much when n >> m, where n is the length of time series, and m is the length subsequence. IMO, we can say this statement for any subseq in the time series. This is like saying:
"For n // m data points, removing one should not affect the k clusters that much."

So, to make sure we are not missing any data, the last subsequence should be considered.

k == n // m means that we are considering maximum number of clusters. In this case, the change-in-profile area for 1 to k can have a U shape.
The last element of the profile area vector for k == n // m snippets is 0.

NimaSarajpoor · 2023-08-10T02:53:11Z

@NimaSarajpoor Can you please check the failed test?

I think they are related to #881, which is [or should be] fixed by #892 [which has not been merged yet]

NimaSarajpoor · 2023-08-10T03:16:28Z

Also, in the second-to-last plot (Profile Area) go from range(1,11) when k=10 (instead of range(1,10)?

Are you referring to the plot provided in the notebook? I intentionally did not change the value of k. So, it is still set to 9. I wanted to just show the impact of considering the last subsequence while keeping everything else untouched.

However, now that we consider the last subsequence, the max value of k is 10. So, should I revise the notebook and use k=10? [note that, in this case, the last value of profile_area array becomes 0]

seanlaw · 2023-08-10T10:56:45Z

Please pull main into this branch and let me know when this is ready to be merged

NimaSarajpoor · 2023-08-10T16:12:48Z

@seanlaw

However, now that we consider the last subsequence, the max value of k is 10. So, should I revise the notebook and use k=10? [note that, in this case, the last value of profile_area array becomes 0]

What do you think? Do you want me to change k from 9 to 10? In that case, we may add a sentence or two to explain why there is a spike in the change-in-ProfileArea plot when k is getting close to n // m (the maximum value of k).

seanlaw · 2023-08-10T17:23:16Z

What do you think? Do you want me to change k from 9 to 10? In that case, we may add a sentence or two to explain why there is a spike in the change-in-ProfileArea plot when k is getting close to n // m (the maximum value of k).

Hmmm, I'm torn. If I were a user, I'd wonder why it doesn't go to 10... What do you think? Certainly, we'd need to explain it

NimaSarajpoor · 2023-08-10T19:15:06Z

If I were a user, I'd wonder why it doesn't go to 10... What do you think? Certainly, we'd need to explain it

Agree. Let's set k to 10. I will then explain why it does not fully replicate the figure provided in the paper, and explain the U-shape plot of change-in-ProfileArea.

seanlaw · 2023-08-10T19:22:42Z

I will then explain why it does not fully replicate the figure provided in the paper

As long as you explain how to interpret the plots/output, I don't think it is necessary to explain why it is different from the paper.

seanlaw · 2023-08-11T14:14:01Z

@NimaSarajpoor Is this one ready to be merged?

NimaSarajpoor · 2023-08-11T14:51:22Z

Would you mind checking the changes in the notebook? If you are okay with that, then I think it is ready.

review-notebook-app · 2023-08-11T16:20:22Z

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-08-11T16:20:21Z
----------------------------------------------------------------

Line #1.    plt.bar(range(2, k + 1), areas[:-1]/areas[1:] - 1.0, tick_label=range(2, k + 1))

Hmm, this is strange. When I do:

x = np.random.rand(10)
x[-1] = 0

x[:-1]/x[1:]

I don't get a divide by zero warning.

review-notebook-app · 2023-08-11T16:20:23Z

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-08-11T16:20:22Z
----------------------------------------------------------------

I think it should be all Change-in-Profile-Area and not change-in-ProfileArea

NimaSarajpoor · 2023-08-11T22:48:31Z

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-08-11T16:20:22Z ----------------------------------------------------------------

I think it should be all Change-in-Profile-Area and not change-in-ProfileArea

I changed and pushed the commit.

NimaSarajpoor · 2023-08-11T22:52:59Z

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-08-11T16:20:21Z ----------------------------------------------------------------

Line #1. plt.bar(range(2, k + 1), areas[:-1]/areas[1:] - 1.0, tick_label=range(2, k + 1))
Hmm, this is strange. When I do:

x = np.random.rand(10)
x[-1] = 0
x[:-1]/x[1:]

I don't get a divide by zero warning.

It is strage! I tried it with numpy 1.19 and 1.23, and got warning in both versions.
I also tried it online:

https://www.online-python.com/5lcazeR8o9

seanlaw · 2023-08-12T00:05:57Z

@NimaSarajpoor To avoid divide by zero, can we do this instead:

x[:-1] / np.where(x[1:]==0, np.nan, x[1:])

or

np.divide(x[:-1], x[1:], where=x[1:]!=0)

And then we can remove/omit the comment about divide by zero as well

NimaSarajpoor · 2023-08-12T01:32:00Z

@NimaSarajpoor To avoid divide by zero, can we do this instead:
x[:-1] / np.where(x[1:]==0, np.nan, x[1:])

Good idea! I went with this option, and pushed the changes :)

seanlaw · 2023-08-12T11:13:27Z

Thank you @NimaSarajpoor for this PR!

NimaSarajpoor added 2 commits August 4, 2023 23:13

fix n windows in naive snippet

25ebea8

fix n windows in performant snippet and minor bugs

0e9545b

NimaSarajpoor changed the title ~~Fix #890 Fix number of contiguous windows~~ Fix #890 Fix number of contiguous windows in snippet Aug 5, 2023

NimaSarajpoor added 2 commits August 8, 2023 01:07

Merge branch 'main' into fix_snippet_n_windows

03c305a

improve reproducability and update notebook

d41bf1e

Merge branch 'main' into fix_snippet_n_windows

9329ae1

NimaSarajpoor force-pushed the fix_snippet_n_windows branch from bade4c3 to 9329ae1 Compare August 11, 2023 05:22

set k=10 and add some explanation

a19faf2

minor changes

935fec2

Avoid numpy divide-by-zero warning to avoid unnecessary context

9674dfc

seanlaw merged commit 6663f5f into stumpy-dev:main Aug 12, 2023

NimaSarajpoor deleted the fix_snippet_n_windows branch September 4, 2023 03:24

Fix #890 Fix number of contiguous windows in snippet #894

Fix #890 Fix number of contiguous windows in snippet #894

Uh oh!

Conversation

NimaSarajpoor commented Aug 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

seanlaw commented Aug 5, 2023

Uh oh!

NimaSarajpoor commented Aug 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NimaSarajpoor commented Aug 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NimaSarajpoor commented Aug 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NimaSarajpoor commented Aug 7, 2023

Uh oh!

seanlaw commented Aug 7, 2023

Uh oh!

NimaSarajpoor commented Aug 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

review-notebook-app bot commented Aug 8, 2023

Uh oh!

seanlaw commented Aug 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NimaSarajpoor commented Aug 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seanlaw commented Aug 10, 2023

Uh oh!

NimaSarajpoor commented Aug 10, 2023

Uh oh!

NimaSarajpoor commented Aug 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NimaSarajpoor commented Aug 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seanlaw commented Aug 10, 2023

Uh oh!

NimaSarajpoor commented Aug 10, 2023

Uh oh!

seanlaw commented Aug 10, 2023

Uh oh!

NimaSarajpoor commented Aug 10, 2023

Uh oh!

seanlaw commented Aug 10, 2023

Uh oh!

seanlaw commented Aug 11, 2023

Uh oh!

NimaSarajpoor commented Aug 11, 2023

Uh oh!

review-notebook-app bot commented Aug 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

review-notebook-app bot commented Aug 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NimaSarajpoor commented Aug 11, 2023

Uh oh!

NimaSarajpoor commented Aug 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seanlaw commented Aug 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NimaSarajpoor commented Aug 12, 2023

Uh oh!

seanlaw commented Aug 12, 2023

Uh oh!

Uh oh!

NimaSarajpoor commented Aug 5, 2023 •

edited

Loading

codecov bot commented Aug 5, 2023 •

edited

Loading

NimaSarajpoor commented Aug 5, 2023 •

edited

Loading

NimaSarajpoor commented Aug 7, 2023 •

edited

Loading

NimaSarajpoor commented Aug 7, 2023 •

edited

Loading

NimaSarajpoor commented Aug 7, 2023 •

edited

Loading

seanlaw commented Aug 8, 2023 •

edited

Loading

NimaSarajpoor commented Aug 9, 2023 •

edited

Loading

NimaSarajpoor commented Aug 10, 2023 •

edited

Loading

NimaSarajpoor commented Aug 10, 2023 •

edited

Loading

review-notebook-app bot commented Aug 11, 2023 •

edited

Loading

review-notebook-app bot commented Aug 11, 2023 •

edited

Loading

NimaSarajpoor commented Aug 11, 2023 •

edited

Loading

seanlaw commented Aug 12, 2023 •

edited

Loading