Skip to content

Feature request: Optimize Circuit-Breaking for SQS FIFO Partial Processor on records with different group IDs #2981

Closed
@rubenfonseca

Description

@rubenfonseca

Use case

Discussed in #2936

We confirmed that SQS FIFO queues can deliver messages from different Group IDs to the same Lambda invocation. Right now, when a record fails processing, we short-circuit an fail the rest of the items on the invocation, regardless if they are from the same message group ID or not.

This un-optimal experience can lead to unintended messages landing on DLQs or failing completely.

We need to explore the idea of continuing to process other group IDs on the same invocation.

Solution/User Experience

When a message fails processing on the SQS FIFO resolver, collect the remaining messages from the same group ID. When a new group ID is found, start processing again.

At the end, return all the failed messages from each message group ID.

Alternative solutions

Depending how the implementation and behaviour turns out, we might consider not making the new behavior the default one, and keep it behind a feature flag.

Acknowledgement

From the discussion

Originally posted by duc00 August 8, 2023
Hello,

According to Implementing partial batch responses AWS doc:

If you're using this feature with a FIFO queue, your function should stop processing messages after the first failure and return all failed and unprocessed messages in batchItemFailures. This helps preserve the ordering of messages in your queue.

I see that Powertools implementation of the doc, with SqsFifoPartialProcessor, is strictly following this recommendation. This question is thus more specific to AWS implementation of partial batch responses with FIFO. Posting the question on this repo seems to be a good entry point nonetheless, since both AWS developers and community interact on those subjects.

My problem is the following:
I am processing a SQS FIFO queue with SqsFifoPartialProcessor. My batch size is 10. Since the queue is not high-scale, the batch often contains messages with different message group IDs. When a failure occurs, the rest of the processing is stopped and all records left are returned as failures to the queue. So I often end-up with valid records in my dead-letter queue just because they were processed in a batch containing an unrelated invalid one.

My question:
Would it be valid, after a failure, to return all remaining records in the batch but only the ones with the same group ID? According to the doc, current implementation is recommended to preserve the ordering of messages in your queue. I am not seeing why processing other records with a different group ID would go against that. Thus my question.

Many thanks!

Metadata

Metadata

Labels

batchBatch processing utilityfeatureNew feature or functionality

Type

No type

Projects

Status

Shipped

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions