Skip to content

Split log entries so each entry contains a maximum of 100 datapoints #64

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jan 28, 2021

Conversation

paulez
Copy link
Contributor

@paulez paulez commented Jan 15, 2021

*Issue #, if available: #10

Description of changes:

  • Submit no more than 100 datapoints per log entry.
  • Split log entries so each entry contains a maximum of 100 datapoints.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@@ -14,3 +14,4 @@
DEFAULT_NAMESPACE = "aws-embedded-metrics"
MAX_DIMENSIONS = 9
MAX_METRICS_PER_EVENT = 100
MAX_DATAPOINTS_PER_EVENT = 100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider renaming to MAX_DATAPOINTS_PER_METRIC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

@@ -53,24 +55,53 @@ def create_body() -> Dict[str, Any]:
current_body: Dict[str, Any] = create_body()
event_batches: List[str] = []
num_metrics_in_current_body = 0
num_datapoints_in_current_body = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I'm following why we need to track the number of datapoints in the body. There are 2 requirements:

  1. You cannot have more than 100 metric names in an EMF event
  2. A given metric name cannot have more than 100 samples

These two requirements bound the number of metric datapoints in an event to 10_000.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand the spec properly, I thought the limit of 100 samples was for the body. I will correct to limit to 100 samples per metric name.

@paulez
Copy link
Contributor Author

paulez commented Jan 18, 2021

I've updated the logic to limit to 100 datapoints per metric per batch.
I've also added a test to combine the 100 datapoints and 100 metrics per batch limit logic.

event_batches: List[str] = []
num_metrics_in_current_body = 0
missing_data = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some comments about what this is and consider thinking about a more intuitive name? It's not really "missing" data, something more like data left to process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added comments and renamed to "remaining_data".


# assert
datapoints_count = Counter()
for batch in results:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have an assertion on the number of expected events?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added an extra assert on the batch count.

I've also added logic and assert to validate that we don't miss any datapoint with the slicing logic.

Compute the full expected results so we ensure the slicing logic doesn't
cause any datapoint to be lost.
Copy link
Member

@jaredcnance jaredcnance left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR and for your time!

@jaredcnance jaredcnance merged commit 3e5ccce into awslabs:master Jan 28, 2021
@jaredcnance jaredcnance mentioned this pull request Feb 8, 2021
@hussam789
Copy link

PR Code Suggestions ✨

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix test batch range

The test is checking the wrong batches. It's iterating through
range(expected_batches) but should be checking batches from the extra metric,
which would be in the range of expected_batches to expected_batches +
expected_extra_batches - 1.

tests/serializer/test_log_serializer.py [165-171]

 # extra metric with more datapoints
-for batch_index in range(expected_batches):
+for batch_index in range(expected_batches, expected_batches + expected_extra_batches):
     result_json = results[batch_index]
     result_obj = json.loads(result_json)
     metric_name = f"Metric-{metrics}"
     expected_datapoint_count = extra_datapoints % 100 if (batch_index == expected_batches + expected_extra_batches - 1) else 100
     assert len(result_obj[metric_name]) == expected_datapoint_count
  • Apply this suggestion
Suggestion importance[1-10]: 7

__

Why: Adjusting the iteration range to cover the extra batches for the metric with more datapoints addresses a likely bug in the test, ensuring all batches are properly verified.

Medium
  • More

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants