Skip to content

Fix #890 Fix number of contiguous windows in snippet #894

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Aug 12, 2023

Conversation

NimaSarajpoor
Copy link
Collaborator

@NimaSarajpoor NimaSarajpoor commented Aug 5, 2023

This PR fixes issue #890.

@NimaSarajpoor NimaSarajpoor changed the title Fix #890 Fix number of contiguous windows Fix #890 Fix number of contiguous windows in snippet Aug 5, 2023
@codecov
Copy link

codecov bot commented Aug 5, 2023

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (900b10c) 98.93% compared to head (9674dfc) 98.93%.
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #894   +/-   ##
=======================================
  Coverage   98.93%   98.93%           
=======================================
  Files          84       84           
  Lines       14236    14292   +56     
=======================================
+ Hits        14084    14140   +56     
  Misses        152      152           
Files Changed Coverage Δ
stumpy/snippets.py 100.00% <100.00%> (ø)
tests/naive.py 100.00% <100.00%> (ø)

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@seanlaw
Copy link
Contributor

seanlaw commented Aug 5, 2023

@NimaSarajpoor Can you please double check that this produces the same result in the existing tutorial? Also, can you please insert a reference to the original issue in the initial comment (I notice that this has been missing)?

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Aug 5, 2023

Can you please double check that this produces the same result in the existing tutorial?

Good point. I will check that out

Also, can you please insert a reference to the original issue in the initial comment (I notice that this has been missing)?

Done! [if a summary of the issue is needed, plz let me know and I will revise the initial comment]

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Aug 7, 2023

Can you please double check that this produces the same result in the existing tutorial?

Good point. I will check that out

The last figure changes in this branch (see Figure below)

fig_this_branch

The last bar becomes taller. Two notes regarding this matter:

(1) The matlab code provided in the snippet supporting webpage seems to suffer from the same problem:

# in snippetfinder.m

# some code
len = sub * ceil(length(data)/sub) - length(data);
%% padding zeros to the end of the time series. 
data = [data;zeros(len,1)];
 
%% compute all the profiles
for i = 1:sub:length(data) - sub 
    indexes = [indexes,i];
    distance = fastMPdist_SS(data,data(i:i+ sub) ,round(sub*per/100),0.05);
    distances = [distances;distance];
end
# some code

In this MATLAB code, as shown above, the last snippet subsequence is ignored if len(data) % m is zero.

(2) As a follow up to my previous note, I checked the length of the time series T in the tutorial and realized its length is 2000. m is 200, which means len(T) % m == 0. Therefore, this is the case that can be affected by the issue in the code.

Let's consider two scenarios:
In what follows, the original T has the length of 2000.

  • Scenario I: Let's do T = T[:-1], and consider this new T as the input. This new T will be padded by one value [in the snippet module] so that its length becomes 2000, and hence dividable by m. The number of contgiuous windows is 9 according to both existing code in branch main as well as the code in this branch. That is why this new T, of length 1999, does not change the last figure of the tutorial.

  • Scenario II: Let's do T = np.append(T, 0), and consdier this new T as the input. Note that in this case, the number contiguous windows according to the existing code in the branch main will be 10 [which is the same the number of contiguous windows in this branch]. In this case, the last figure [in the branch main but with new T, of length 2001] will be similar to what I showed above.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Aug 7, 2023

Q: Can we justify the figure shown in the previous comment?
A: Maybe! Let's see!

Let's use k=9 to get 9 snippets, and let's plot them:

plt.plot(T, color='k')
for idx in indices:
    plt.plot(np.arange(idx, idx+m), T[idx:idx+m], color='b')

idx = indices[-1]  # index of last snippet, corresponding to the tall bar positioned at `k=9`
plt.plot(np.arange(idx, idx+m)
T[idx:idx+m], color='r')
plt.show()

snippet_with_colors

Note that the 9-th snippet is the one colored in red. This seems to be a pattern that is a little bit different compared to the patterns on the first half and the ones on the second half of T. Although this red pattern is like a combination of those two patterns, it can be considered as one unique pattern. So, maybe that is why the bar chart for k=9 increases since it is now associated with the part of the data that shows a new behavior.


Q: Can we further justify that the code in this branch makes more sense?
A: I think we can. Let's try k=10 and use the snippet code of this branch.

k = 10
snippets, indices, profiles, fractions, areas, regimes = stumpy.snippets(T, m, k=k, s=s)

# >>> areas
# array([5981.73838943, 1887.02451582, 1342.11397203,  991.55393208,
        723.79295706,  522.38654936,  327.99156887,  168.97041678,
         59.77463888,    0.        ])

Note that area is 0 for the "10-th" snippet, and all data is covered.

>>> indices
array([ 800, 1600, 1200, 1000, 1400,  400, 1800,  200,  600,    0])

Note that the last value of the array is 0, showing the start index of the last snippet computed with k=10. This is basically the black pattern shown in the figure above, which was missing when k==9.

@NimaSarajpoor
Copy link
Collaborator Author

@seanlaw
[[if needed] we may also contact the first author and/or Eamon Keogh to see what they think regarding this matter.

@seanlaw
Copy link
Contributor

seanlaw commented Aug 7, 2023

[[if needed] we may also contact the first author and/or Eamon Keogh to see what they think regarding this matter.

I am fine either way. If you have time to reach out to them then please do. Otherwise, I think I am satisfied with your justifications and am willing to merge. Your call.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Aug 7, 2023

Sure. I will send an email to them as I want to make sure I am not missing anything here.


[update-1]

(1) I sent an email. @seanlaw Please feel free to merge if I don't provide any update regarding this matter in the next few days.

(2) In the paper, we see the following statement:

Note that the number of non-overlapping subsequences is much smaller than the sliding windows, just ⌊n/m⌋.

Hence, I think the fixes [in this PR] make sense as it is based on what stated in the paper.

[update-2]
(1) I heard back from the first auther. I provided her with the data as requested. She is going to take a look.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@seanlaw
Copy link
Contributor

seanlaw commented Aug 8, 2023

@NimaSarajpoor As I re-read your (excellent) comments above, it seems strange that the change-in-profile-area jumps so much for the 10th snippet. At the end of the day, given k=10, how would you interpret the change-in-profile-area and choose the "optimal number of snippets (k)"?

Also, in the second-to-last plot (Profile Area) go from range(1,11) when k=10 (instead of range(1,10)?

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Aug 9, 2023

@seanlaw
[In advance, sorry for this long comment!]

Before answering the question, I would like to correct the justification I provided before:

Note that the 9-th snippet is the one colored in red. This seems to be a pattern that is a little bit different compared to the patterns on the first half and the ones on the second half of T

I think this justification, that I made recently, is not correct. Let's see something interesting here:

# in this PR

# to store the start index of snippets for different seeds
first_snippet = []
second_snippet = []
last_snippet = []

for seed in range(10):
    T = warp_add_noise(df, seed=seed)
    snippets, indices, profiles, fractions, areas, regimes = stumpy.snippets(T, m=200, k=9, s=70)
    
    first_snippet.append(indices[0])
    second_snippet.append(indices[1])
    last_snippet.append(indices[-1])

>>> first_snippet  # for different seeds
[800, 200, 200, 400, 200, 400, 200, 200, 800, 400]

>>> second_snippet # for different seeds
[1200, 1200, 1600, 1200, 1200, 1200, 1400, 1600, 1200, 1200]

>>> last_snippet
[600, 1800, 400, 800, 800, 600, 1800, 400, 600, 600]

Note 1) For different seeds, the first snippet is always coming from the first half of T

Note 2) For different seeds, the second snippet is always coming from the second half of T
[so, it seems that the snippet algo can "understand" similar patterns and try to find unique ones in the first two snippets]

Note 3) The last snippet, however, is from the first half sometimes, and some other times, it is from the second half of T. So, the last snippet is NOT necessarily related to the pattern in the middle of T, where two uninique patters are being connected to each other.

Previouslly, the code in the tutorial notebook did not have the param seed, and that resulted in misleading me. (I have now added the param seed to improve reproducibility) I will come back to the code above later at the end as I need to discuss one more thing about it. But, for now, let's discuss the questions you asked.

It seems strange that the change-in-profile-area jumps so much for the 10th snippet. At the end of the day, given k=10, how would you interpret the change-in-profile-area and choose the "optimal number of snippets (k)"?

Let's discuss the formula areas[:-1]/areas[1:] - 1.0, used in the notebook. This formula is provided by the paper to compute "change in profile area" from which an optimal k can be discovered. Note that this formula is equal to (areas[:-1] - areas[1:] ) / areas[1:], which is basically "a slope that is divided by the second point of that slope". Hence, this value can become huge if the denominator becomes small. It is my understanding is that this method is proposed in the paper to reduce the subjectivity of user in selecting k using elbow method. Let's see a similar case in a clustering problem.

# clustering iris data
import numpy as np
from sklearn import datasets
from sklearn.cluster import KMeans

iris = datasets.load_iris()
X = iris.data
y = iris.target

X_normalized = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
# X_normalized.shape: (150, 4)

inertia_values = []
for k in range(2, 146):  # k: number of clusters
    km = KMeans(n_clusters=k).fit(X_normalized)
    inertia_values.append(km.inertia_)

Let's plot insertia values as well as change-in-inertial values:

inertia

inertia_change

Note the spike when k is large. In this case, the inerita becomes very small and hence the denominator becomes smaller which can result in spike in the plot. So, I am not sure if the change-in-profile formula proposed in the paper is a good approach, particularly if the number of snippets almost covers the whole T.


Let's go back to the code I provided at the top of this comment:

What happens if we consider k=10 snippets? Note that the profile area plot is descending. When k=10 (which is the maximum number of snippets since n/m == 10), then the area becomes 0 at the end. Note that, in this case, we get divide-by-0 error (!!) when we want to compute the change-in-profile value!


When I set k=10 snippet, the last value in indices (which is basicallty the start index of last snippet) is always 0 for seed in range(20). I would like to understand why since the index of first and second snippet changes from one seed to another.

@seanlaw
Copy link
Contributor

seanlaw commented Aug 10, 2023

@NimaSarajpoor Can you please check the failed test?

@NimaSarajpoor
Copy link
Collaborator Author

[update]
I had a discussion with the first auther of the paper. Here are some notes:

  1. The last subsequence should be considered. However, its impact may not be that much when n >> m, where n is the length of time series, and m is the length subsequence. IMO, we can say this statement for any subseq in the time series. This is like saying:
    "For n // m data points, removing one should not affect the k clusters that much."

So, to make sure we are not missing any data, the last subsequence should be considered.

  1. k == n // m means that we are considering maximum number of clusters. In this case, the change-in-profile area for 1 to k can have a U shape.

  2. The last element of the profile area vector for k == n // m snippets is 0.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Aug 10, 2023

@NimaSarajpoor Can you please check the failed test?

I think they are related to #881, which is [or should be] fixed by #892 [which has not been merged yet]

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Aug 10, 2023

Also, in the second-to-last plot (Profile Area) go from range(1,11) when k=10 (instead of range(1,10)?

Are you referring to the plot provided in the notebook? I intentionally did not change the value of k. So, it is still set to 9. I wanted to just show the impact of considering the last subsequence while keeping everything else untouched.

However, now that we consider the last subsequence, the max value of k is 10. So, should I revise the notebook and use k=10? [note that, in this case, the last value of profile_area array becomes 0]

@seanlaw
Copy link
Contributor

seanlaw commented Aug 10, 2023

Please pull main into this branch and let me know when this is ready to be merged

@NimaSarajpoor
Copy link
Collaborator Author

@seanlaw

However, now that we consider the last subsequence, the max value of k is 10. So, should I revise the notebook and use k=10? [note that, in this case, the last value of profile_area array becomes 0]

What do you think? Do you want me to change k from 9 to 10? In that case, we may add a sentence or two to explain why there is a spike in the change-in-ProfileArea plot when k is getting close to n // m (the maximum value of k).

@seanlaw
Copy link
Contributor

seanlaw commented Aug 10, 2023

What do you think? Do you want me to change k from 9 to 10? In that case, we may add a sentence or two to explain why there is a spike in the change-in-ProfileArea plot when k is getting close to n // m (the maximum value of k).

Hmmm, I'm torn. If I were a user, I'd wonder why it doesn't go to 10... What do you think? Certainly, we'd need to explain it

@NimaSarajpoor
Copy link
Collaborator Author

If I were a user, I'd wonder why it doesn't go to 10... What do you think? Certainly, we'd need to explain it

Agree. Let's set k to 10. I will then explain why it does not fully replicate the figure provided in the paper, and explain the U-shape plot of change-in-ProfileArea.

@seanlaw
Copy link
Contributor

seanlaw commented Aug 10, 2023

I will then explain why it does not fully replicate the figure provided in the paper

As long as you explain how to interpret the plots/output, I don't think it is necessary to explain why it is different from the paper.

@NimaSarajpoor NimaSarajpoor force-pushed the fix_snippet_n_windows branch from bade4c3 to 9329ae1 Compare August 11, 2023 05:22
@seanlaw
Copy link
Contributor

seanlaw commented Aug 11, 2023

@NimaSarajpoor Is this one ready to be merged?

@NimaSarajpoor
Copy link
Collaborator Author

Would you mind checking the changes in the notebook? If you are okay with that, then I think it is ready.

@review-notebook-app
Copy link

review-notebook-app bot commented Aug 11, 2023

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-08-11T16:20:21Z
----------------------------------------------------------------

Line #1.    plt.bar(range(2, k + 1), areas[:-1]/areas[1:] - 1.0, tick_label=range(2, k + 1))

Hmm, this is strange. When I do:

x = np.random.rand(10)

x[-1] = 0
x[:-1]/x[1:]

I don't get a divide by zero warning.



@review-notebook-app
Copy link

review-notebook-app bot commented Aug 11, 2023

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-08-11T16:20:22Z
----------------------------------------------------------------

I think it should be all Change-in-Profile-Area and not change-in-ProfileArea


@NimaSarajpoor
Copy link
Collaborator Author

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-08-11T16:20:22Z ----------------------------------------------------------------

I think it should be all Change-in-Profile-Area and not change-in-ProfileArea

I changed and pushed the commit.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Aug 11, 2023

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-08-11T16:20:21Z ----------------------------------------------------------------

Line #1. plt.bar(range(2, k + 1), areas[:-1]/areas[1:] - 1.0, tick_label=range(2, k + 1))
Hmm, this is strange. When I do:

x = np.random.rand(10)
x[-1] = 0
x[:-1]/x[1:]

I don't get a divide by zero warning.

It is strage! I tried it with numpy 1.19 and 1.23, and got warning in both versions.
I also tried it online:

https://www.online-python.com/5lcazeR8o9

@seanlaw
Copy link
Contributor

seanlaw commented Aug 12, 2023

@NimaSarajpoor To avoid divide by zero, can we do this instead:

x[:-1] / np.where(x[1:]==0, np.nan, x[1:])

or

np.divide(x[:-1], x[1:], where=x[1:]!=0)

And then we can remove/omit the comment about divide by zero as well

@NimaSarajpoor
Copy link
Collaborator Author

@NimaSarajpoor To avoid divide by zero, can we do this instead:

x[:-1] / np.where(x[1:]==0, np.nan, x[1:])

Good idea! I went with this option, and pushed the changes :)

@seanlaw seanlaw merged commit 6663f5f into stumpy-dev:main Aug 12, 2023
@seanlaw
Copy link
Contributor

seanlaw commented Aug 12, 2023

Thank you @NimaSarajpoor for this PR!

@NimaSarajpoor NimaSarajpoor deleted the fix_snippet_n_windows branch September 4, 2023 03:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants