proposal: runtime/metrics: define a recommended set of metrics

## Introduction

With each Go release the set of metrics exported by the `runtime/metrics` grows in size. Not all metrics are applicable to all cases, and it can become difficult to identify which metrics are actually useful. This is especially true for projects like [OpenTelemetry](https://github.com/open-telemetry/semantic-conventions/issues/535) and [Prometheus](https://github.com/prometheus/client_golang) which want to export some broadly-applicable Go runtime metrics by default, but the full set is overwhelming and not particularly user-friendly. 

Another problem with collecting all metrics is cost. The cardinality of the default metric set is closely watched by projects like Prometheus, because downstream users are often paying for the storage costs of these metrics when making use of hosted solutions.

This issue proposes defining a conservative subset of runtime metrics that are broadly applicable, and a simple mechanism for discovering them programmatically.

## Proposal

There are two parts to this proposal. The categorization of some metrics as "recommended" by the Go toolchain, and the actual mechanism for that categorization.

To start with, I would like to propose documenting such a set of metrics as "recommended" at the top of the `runtime/metrics` documentation. Each metric is required to have a full rationale explaining its utility and use-cases. The "recommended" set is intended to hold a lot of weight. We need to make sure the reason why we promote a particular metric is well-documented. The "recommended" set of metrics generally follows the compatibility guarantees of the runtime/metrics package. That being said, a metric is unlikely to be promoted to "recommended" if it's not likely to just exist indefinitely. Still, we reserve the right to remove them.

Next, we'll add a `Tags []string` field to `metric.Description` so that these metrics can be found programmatically. We could get by with a simple boolean field, but that's inflexible. In particular, what I'd like to avoid is having dedicated fields for future categorizations such that they end up non-orthogonal and confusing.

The tag indicating the default set will be the string "recommended".

### Proposed initial metrics

Below is an initial proposed set of metrics. This list is intended to be a conservative and uncontroversial set of metrics that have clear real-world use-cases.

- `/gc/gogc:percent` - `GOGC`.
    - Rationale: This metric describes the GOGC parameter to the runtime, which sets the CPU/memory trade-off of the GC.
- `/gc/gomemlimit:bytes` - `GOMEMLIMIT`.
    - Rationale: This metric descibes the GOMEMLIMIT parameter to the runtime, which sets a soft memory limit for the runtime.
- `/gc/heap/allocs:bytes` - Total bytes allocated.
    - Rationale: This metric may be used to derive an allocation rate in bytes/second, which is useful in understanding GC resource cost impact. In particular, it's useful for diagnosing regressions in production.
- `/gc/heap/allocs:objects` - Total individual allocations made.
    - Rationale: This metric may be used to derive an allocation rate in objects/second, which is useful in understanding memory allocation resource cost impact. In particular, it's useful for diagnosing regressions in production.
- `/gc/heap/goal:bytes` - GC heap goal.
    - Rationale: This metric is useful for understanding GC behavior, especially when tuning `GOGC` and `GOMEMLIMIT`, and a close approximation for heap memory footprint.
- `/memory/classes/heap/released:bytes` - Current count of heap bytes that are released back to the OS but which remain mapped.
   - Rationale: This metric is necessary for tuning `GOMEMLIMIT`. It is also necessary to understand what the runtime believes its own physical memory footprint is, as a subtraction from the total.
- `/memory/classes/heap/stacks:bytes` - Current count of bytes allocated to goroutine stacks.
   - Rationale: This metric is necessary for understanding application memory footprints, specifically those that users have some control over and may seek to optimize.
- `/memory/classes/total:bytes` - Total Go runtime memory footprint.
    - Rationale: This metric is necessary for tuning `GOMEMLIMIT`. It's also useful for identifying "other" memory, and together with `/memory/classes/heap/released:bytes`, what the runtime believes the physical memory footprint of the application is.
- `/sched/gomaxprocs:threads` - `GOMAXPROCS`.
    - Rationale: This metric is a core runtime parameter representing the available parallelism to the application.
- `/sched/goroutines:goroutines` - Current count of live goroutines (blocked, running, etc.).
    - Rationale: This metric is useful as a proxy for active work units in many circumstances, though it also includes leaks. Supplement with a goroutine profile for more detail, or app-specific concurrency counters (for example, to track the number of active http.Handlers).
- `/sched/latencies:seconds` - Distribution of time goroutines spend runnable (that is, not blocked), but not running.
    - Rationale: This metric is a measure of scheduling latency that is useful as a fine-grained proxy for overall system load. For example, diffing the distribution over short time windows can provide visibility into the latency impact of uneven load. This metric is battle-tested and has been found to be useful in a variety of scenarios.
   
This results in 10 `uint64` metrics and 1 `Float64Histogram` metric in the default set, a significant reduction from the 81 metrics currently exported by the package.
   
Here are a few other metrics that were not included.
- `/memory/classes/heap/objects:bytes` - Current count of bytes allocated.
    - Rationale: It is already possible to derive this from total allocations and frees.
- `/memory/classes/metadata/other:bytes` - Runtime metadata, mostly GC metadata.
    - Rationale: We expect to break this category out as specific things that are useful to measure come up. This does not indicate good longevity.
- `/gc/heap/frees:bytes` - Total bytes freed.
    - Rationale: This metric may be used to compute the total amount of live+unswept heap memory, with `/gc/heap/allocs:bytes`. Not that useful on its own, and live+unswept heap memory isn't a terribly useful metric since it tends to be noisy and misleading, subject to sweep scheduling nuances. The heap goal is a much more reliable measure of total heap footprint.
- `/gc/heap/frees:objects` - Total individual allocations freed.
    - Rationale: This metric may be used to compute the total number of live objects, with `/gc/heap/allocs:objects`. Not that useful on its own, and the number of live objects on its own also isn't that useful. Together with `/gc/heap/frees:objects`, `/gc/heap/allocs:bytes`, and `/gc/heap/frees:bytes` it can be used to calculate average object size, but that's also not very useful on its own. The distribution of object sizes is more useful, but the metric is  currently incomplete, as it currently buckets all objects >32 KiB in size together.
- `/godebug/non-default/*` - Count of instances of a behavior change due to a `GODEBUG` setting.
    - Rationale: While counting instances of non-default behavior is important, the usage of these particular metrics is intended to be used more on a case-by-case basis. Consider a team upgrading their go.mod version. The default behavior of their programs may change due to the upgrade, but because they're using the _new_ defaults, these metrics won't actually be updated. If something goes wrong due to the new defaults, these metrics aren't that helpful in identifying that it's due to the new default behavior. Instead, these metrics are helpful for eliminating remaining sources of non-default behavior once opted-in.

## Alternatives

### Only documenting the recommended set

One alternative is to _only_ document the set of recommended metrics. This is fine, but it also runs counter to `runtime/metrics`' original goal of being able to discover metrics programmatically. Some mechanism here seems necessary to keep the package useful to both humans and computers.

### A toolchain-versioned default metrics set

Originally, we had considered an API (for example, `metrics.Recommended(...)`) that accepted a Go toolchain version and would return the set of default metrics (specifically, a `[]metrics.Description`) for that version. All the metrics within would always be valid to pass to `metrics.Read`.

You could also imagine this set being controlled via the language version set in the `go.mod` indirectly via `GODEBUG` flags. (That is, every time we would change this set, we'd add a valid value to `GODEBUG`. Specifically something like `GODEBUG=runtimemetricsgo121=1`.)

Unfortunately, there are already a lot of questions here about stability and versioning. Least of which is the fact that toolchain versions, at least those reported by the runtime/debug package, aren't very structured.

Furthermore, this is a type of categorization that doesn't really compose well. If we ever wanted new categories, we'd need to define a new API, or possibly dummy toolchain strings. It's also a much more complicated change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

proposal: runtime/metrics: define a recommended set of metrics #67120

Introduction

Proposal

Proposed initial metrics

Alternatives

Only documenting the recommended set

A toolchain-versioned default metrics set

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

proposal: runtime/metrics: define a recommended set of metrics #67120

Description

Introduction

Proposal

Proposed initial metrics

Alternatives

Only documenting the recommended set

A toolchain-versioned default metrics set

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions