Skip to content

NMV2 metrics fanout and corrections #5763

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 25 commits into
base: develop
Choose a base branch
from
Open

Conversation

durch
Copy link
Contributor

@durch durch commented May 15, 2025

This change is Reviewable

Summary

This PR enhances the Network Monitor with a "metrics fanout" capability, allowing monitoring metrics to be submitted to multiple API endpoints concurrently. It also introduces a new database schema for storing detailed
route performance data to improve network reliability tracking and analysis.

Key Features

  • Metrics Fanout System: Submit monitoring data to multiple nym-api instances simultaneously
  • Route Performance Tracking: Store route metrics in a new database table

Technical Changes

  • Added support for multiple NYM_API endpoints via the nym_apis CLI parameter and NYM_APIS environment variable
  • Created new routes table schema in the database to store detailed route metrics:
    • Records node IDs for each layer (layer1, layer2, layer3, gateway)
    • Tracks success/failure status for each packet route
    • Includes created_at timestamps, would be better if these were received at, but out of scope for now
    • Added appropriate indexes for efficient querying
  • Implemented parallel submission of route data to all configured API endpoints using FuturesUnordered
  • Added support for efficient binary data copying to the database for metrics storage
  • Enhanced error handling and logging for metrics submission

Testing Notes

  • The metrics fanout functionality has been tested against multiple API endpoints to ensure proper data submission
  • Route performance data can be verified in the database by querying the new routes table
  • The implementation includes proper error handling to ensure metrics submission continues even if one endpoint fails
  • Performance impact is minimal as metrics submission happens in parallel using asynchronous requests

Deployment Considerations

  • To use this feature, configure multiple API endpoints using either:
    • Command-line arguments: --nym-apis <API_URL1> <API_URL2> ...
    • Environment variable: NYM_APIS=<API_URL1>,<API_URL2>,...
  • Database schema will need to be updated using the new migration (20250513104800_routes_table.sql)
  • No changes to existing functionality or backwards compatibility issues
  • For optimal performance, API endpoints should be geographically distributed

This enhancement significantly improves the robustness of the Nym Network Monitor by allowing redundant metric submission to multiple endpoints, ensuring that network performance data is properly captured and available for
analysis even if individual API endpoints experience issues. The detailed route tracking will also provide valuable insights into network performance patterns and help identify potential bottlenecks.

Copy link

vercel bot commented May 15, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
nym-explorer-v2 ❌ Failed (Inspect) Jun 2, 2025 10:55am
2 Skipped Deployments
Name Status Preview Comments Updated (UTC)
docs-nextra ⬜️ Ignored (Inspect) Visit Preview Jun 2, 2025 10:55am
nym-next-explorer ⬜️ Ignored (Inspect) Visit Preview Jun 2, 2025 10:55am

Copy link
Contributor

@jstuczyn jstuczyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

almost entirely nits apart from one thing:
let node_ids = topology.node_details().keys().cloned().collect::<Vec<_>>();
^ this will not work the way you want as it also retrieves gateways (which also have to be tested, but not as mixnodes : D )

layer3 INTEGER NOT NULL, -- NodeId of layer 3 mixnode
gw INTEGER NOT NULL, -- NodeId of gateway
success BOOLEAN NOT NULL, -- Whether the packet was delivered successfully
timestamp INTEGER NOT NULL DEFAULT (unixepoch()) -- When the measurement was taken
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if possible, perhaps store it with explicit type for easier queries, i.e. TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP so that it could be coerced into OffsetDateTime with sqlx

Err(AxumErrorResponse::internal_msg(
"failed to submit gateway monitoring results",
))
match message.results() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is out of scope for this PR, but just wondering, does it still make sense to have separate results for gateways and mixnodes?


// Output struct for the calculated corrected reliability for an interval
#[derive(Debug, serde::Serialize)] // Add utoipa::ToSchema if this struct is directly exposed via API
pub struct CorrectedNodeIntervalReliability {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the correction be done by nym api or rather NM? (ideally, in stage 2, NM would be submitting corrected results to the contract)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that corrections need a lot of results to be useful, and NMs submit every 5 or 10 minutes, or similar cadence. So I've pushed the correction as close to the actual usage of the results as possible otherwise it would not be doing much

pub neg_samples_in_interval: u32,
pub final_fail_seq_in_interval: u32,
}

// all SQL goes here
impl StorageManager {
pub(super) async fn get_all_avg_mix_reliability_in_last_24hr(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another nit: imo StorageManager should be concerned with just SQL queries and the logic for calculating all this stuff should go up one (or maybe even two) layers of abstractions. either NymApiStorage or maybe even above it

r.layer2 as NodeId,
r.layer3 as NodeId,
r.gw as NodeId,
r.success.unwrap_or_default(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would success be missing? can't we set it to NOT NULL and then avoid this whole iterator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, just being paranoid, if we set it to NOT NULL then inserts might start failing quietly if we have a problem somewhere else, better be explicit then sorry :)

let mut guilty_nodes_in_path: Vec<NodeId> = Vec::new();
for &node_id in &current_path_node_ids_for_blame {
if let Some(state_after_update) = node_states.get(&node_id) {
if state_after_update.fail_seq > 2 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you start distributing the blame before you accumulated all results?


// Attempt to fetch identity, first as mixnode, then as gateway if not found.
// This assumes get_mixnode_identity_key and get_gateway_identity_key return Result<Option<String>, sqlx::Error>
let mut identity: Option<String> =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we even care about identity key at this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of this is legacy, I think we just clean it all up once we get to the performance contract

let gateway_stats = monitor_gateway_results().await?;
let client = reqwest::Client::new();

let node_submit_url =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than using hardcoded strings, can't we just construct a NymApiClient and call methods on that guy instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants