Skip to content

Handle Rate Limiting for Replay Events #6710

Closed
@Lms24

Description

@Lms24

Problem Statement

For all Sentry events, we currently do not do anything if an event is not ingested by the Sentry backend due to rate limits. We of course respect the retry-after time in which we don't send events but we don't retry sending the original event. This is fine for regular events (errors, transactions, sessions) as they are mostly atomic. For Replay however, this is not the case, as we're sending multiple events in one replay. If one segment goes missing, the replay cannot be continued after this segment, as one (or multiple) diffs would be missing.

Solution Brainstorm

We have a couple of options how to handle replays and replay events if we hit a rate limit:

Option A: Splitting Replays

When we hit a rate limit, we pause the replay and once the retry-after period expired, we start a new replay with a new checkout. The obivous question here is: Can we link the two (or more) replays effectively? This will probably require additional complexity in the SDK and in the Sentry Replay UI. Possibly also for replay event ingestion (not sure here...)

Pros:

  • We get a functional replay after the rate limit period

Cons:

  • We end up with multiple replays per session
  • Linking adds complexity to SDK, UI and possibly ingestion
  • We still loose the window during the rate limit period

Option B: Pausing the Replay

When we hit a rate limit, we pause the replay and continue the same replay after the rate limit period expired. When we restart, we take full snapshot, which should theoretically make it possible to continue the replay even though we obviously missed segments during the rate limit period. IIRC this should work out and users would basically see a paused/inactive period of time.

Q: Can we show users in the UI that the "missing" segements are due to rate limits? What information do we need to pass along? and when?

Pros:

  • We get one functional replay

Cons:

  • We still loose the window during the rate limit period, which will be shown to users as a period of inactivity in the replay
  • Still some complexity around implementing this in the SDK but at least not on the ingestion side and mostly not in the UI (unless we want to show some sort of explanation for the inactivity).

Option C: Retrying rate-limited Replay Requests

In order to not loose any segments, we could leave events that were rate-limited in the queue and retry sending them at a later time. There are implications around this as we would potentially accumulate a lot of events in the queue which we'd try to re-send after the first rate limit period in addition to newer segments. This increases the potential for more rate-limits occuring at that time, therefore again increasing the amount of queued events, etc....
This would even occur if we just attempt to retry a request for 1/2/3 times.

Pros:

  • We get one functional replay with the events during the rate limiting period included

Cons:

  • Can lead to increase of queued events on the client
  • Can lead to more rate-limits after the initial one
  • Possibly also has effects on sending of other Sentry events (??)
    ==> Is this scaleable at all?

My strong feeling is that option B is probably the best but I'm happy to hear everyone's opinions.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions