Skip to content

Implement ProcessCdnLog and ProcessCdnLogQueue background jobs #8036

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Feb 1, 2024

Conversation

Turbo87
Copy link
Member

@Turbo87 Turbo87 commented Jan 31, 2024

This PR builds on top of #8010, and implements two new background worker jobs.

The ProcessCdnLog job takes an AWS region, S3 bucket name and object path inside the bucket, connects to the bucket via the object_store crate, and then stream-downloads the file while running count_downloads() on it. If download counting is successful it prints a summary of the result to the logs.

The ProcessCdnLogQueue job connects to an AWS SQS queue, receives messages from it, parses them and enqueues ProcessCdnLog jobs based on the content of these messages.

You may be asking: "Where are the counted downloads saved to the database?!" and the answer is: nowhere. The plan is to run this read-only setup in production for a couple of days to ensure that the download counting works correctly with production data. Once we have verified that everything works as expected, we will adjust the ProcessCdnLog job to save the result to the database instead of or in addition to reporting it to the logs.

Best reviewed commit by commit 😉

@Turbo87 Turbo87 added C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear A-backend ⚙️ labels Jan 31, 2024
@Turbo87 Turbo87 requested a review from a team January 31, 2024 09:16
Copy link

codecov bot commented Jan 31, 2024

Codecov Report

Attention: 118 lines in your changes are missing coverage. Please review.

Comparison is base (1f4abb9) 87.55% compared to head (c27b101) 87.38%.

Files Patch % Lines
src/worker/jobs/downloads.rs 86.42% 52 Missing ⚠️
src/config/cdn_log_storage.rs 29.03% 22 Missing ⚠️
src/sqs.rs 56.86% 22 Missing ⚠️
src/config/cdn_log_queue.rs 0.00% 17 Missing ⚠️
src/admin/enqueue_job.rs 0.00% 2 Missing ⚠️
src/config/server.rs 0.00% 2 Missing ⚠️
src/bin/background-worker.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8036      +/-   ##
==========================================
- Coverage   87.55%   87.38%   -0.17%     
==========================================
  Files         265      270       +5     
  Lines       25663    26172     +509     
==========================================
+ Hits        22468    22871     +403     
- Misses       3195     3301     +106     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Turbo87
Copy link
Member Author

Turbo87 commented Jan 31, 2024

Since CodeCov was unhappy I've now doubled the size of the diff and added a couple of tests... 😅

Copy link
Contributor

@LawnGnome LawnGnome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking forward to having this "in production" so we can check it against the existing logic!


debug!("Deleting message {message_id} from the CDN log queue…");
queue
.delete_message(receipt_handle)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a question for infra, but this queue is configured to only deliver messages exactly once, right?

(What I'm concerned about here is double counting if this job runs twice, and the second receive_messages() call occurs before messages being handled by the first job are deleted.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this queue is configured to only deliver messages exactly once, right?

from what I've seen so far that does not appear to be the case. I assume if it was the case then the delivered messages would be auto-deleted upon reception, right?

Copy link
Member Author

@Turbo87 Turbo87 Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will treat this as a non-blocking concern since the code here is experimental and read-only for now anyway. If we can change the queue to use once-only delivery then we can probably just delete the delete_message() calls here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a question for infra, but this queue is configured to only deliver messages exactly once, right?

This is a standard queue that only guarantees at least once delivery. If we want exactly once, we'd need to recreate the queue as a FIFO queue.

This allows us to access the `config` at runtime too, instead of only having access at startup.
@Turbo87 Turbo87 force-pushed the wip/queue-job branch 2 times, most recently from 2cb6c5c to 64b0a7f Compare February 1, 2024 08:48
@Turbo87 Turbo87 enabled auto-merge February 1, 2024 08:52
@Turbo87 Turbo87 merged commit 348c52b into rust-lang:main Feb 1, 2024
@Turbo87 Turbo87 deleted the wip/queue-job branch February 1, 2024 08:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-backend ⚙️ C-internal 🔧 Category: Nonessential work that would make the codebase more consistent or clear
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants