Alerting on RUM Metrics

Last9 RUM SDKs emit three families of gauge metrics derived from spans. The same data powers Discover → Applications for ad-hoc exploration; this guide covers turning those signals into alerts. All are gauge metrics — use sum_over_time for counting and avg_over_time for smoothing, not rate() or increase().

Metric reference

Metric	Covers	Key labels
`trace_client_count`	HTTP requests made by the app (network spans)	`service_name`, `span_name`, `http_status_code`, `http_method`, `env`
`trace_client_duration`	HTTP request durations — precomputed quantiles	`service_name`, `span_name`, `http_status_code`, `http_method`, `quantile`, `env`
`trace_internal_count`	App lifecycle, screens, errors, ANR, resources	`service_name`, `span_name`, `status_code`, `env`

`http_status_code` vs `status_code`

http_status_code on trace_client_count / trace_client_duration is the HTTP response code the device received ("200", "404", "503"). Use this for HTTP-level alerting.

status_code on trace_internal_count is the OTel span status (STATUS_CODE_ERROR / STATUS_CODE_UNSET). Use this for app-level error alerting (crashes, ANRs).

`trace_client_duration` quantiles

quantile label values: avg, p50, p90, p95, p99. These are precomputed — query them directly by label. No histogram_quantile() needed.

Duration values are in milliseconds.

Alert patterns

1. HTTP error count (4xx / 5xx)

Alert when users are receiving error responses from any backend endpoint.

sum(
  sum_over_time(
    trace_client_count{
      service_name="<your-app>",
      http_status_code=~"[45].."
    }[30m]
  )
)

Threshold: tune to your baseline. Start with > 10 and adjust based on traffic. Use a shorter window ([5m]) for high-traffic production apps.

2. HTTP error ratio

Alert when a significant fraction of requests are failing. More stable than raw counts for apps with variable traffic.

sum(sum_over_time(trace_client_count{service_name="<your-app>", http_status_code=~"[45].."}[30m]))
/
sum(sum_over_time(trace_client_count{service_name="<your-app>"}[30m]))

Threshold: > 0.05 for 5% error rate.

3. Per-endpoint error breakdown

Identifies which specific API endpoint is failing, useful for routing alerts to the right team.

sum by (span_name) (
  sum_over_time(
    trace_client_count{
      service_name="<your-app>",
      http_status_code=~"[45].."
    }[30m]
  )
)

4. API latency — p95 threshold

trace_client_duration holds precomputed quantiles per endpoint. Query the p95 label directly.

avg_over_time(
  trace_client_duration{
    service_name="<your-app>",
    quantile="p95",
    http_status_code="200"
  }[30m]
)

To find the worst-performing endpoint across your app:

max by (span_name) (
  avg_over_time(
    trace_client_duration{
      service_name="<your-app>",
      quantile="p95",
      http_status_code="200"
    }[30m]
  )
)

Threshold: set in milliseconds, e.g. > 3000 for 3 s.

5. App crashes and exceptions (mobile)

The SDK attaches exceptions as events on the active View span and marks it STATUS_CODE_ERROR. Alert on View spans in error state to catch crashes and unhandled exceptions.

sum(
  sum_over_time(
    trace_internal_count{
      service_name="<your-app>",
      span_name="View",
      status_code="STATUS_CODE_ERROR"
    }[30m]
  )
)

Threshold: > 0 for zero-tolerance, or tune to baseline.

6. ANR rate (Android)

App Not Responding events are emitted as standalone spans by the SDK’s watchdog thread when the main thread is blocked beyond the configured threshold (default: 5 s).

sum(
  sum_over_time(
    trace_internal_count{
      service_name="<your-app>",
      span_name="ANR detected"
    }[1h]
  )
)

Threshold: > 0 — any ANR is a user-facing freeze and warrants investigation.

Filtering by environment

All metrics carry an env label populated from deployment.environment set at SDK init. Add it to any query to scope alerts to production only:

sum(
  sum_over_time(
    trace_client_count{
      service_name="<your-app>",
      env="production",
      http_status_code=~"[45].."
    }[30m]
  )
)

Configuring the alert in Last9

Go to Alerting → Alert Groups and create or open an Alert Group for your app.
Add an Indicator with the PromQL above as the query.
Set the threshold and bad minutes / total minutes window.
Optionally add dynamic annotations using {{ $labels.span_name }} to include the failing endpoint in notifications.

See Configuring an Alert for the full walkthrough.

Troubleshooting

Alert never fires despite visible errors in dashboards

You’re likely using rate() or increase(). RUM metrics are gauges, not counters — these functions return 0 or NaN. Switch to sum_over_time(...[window]) for counting and avg_over_time(...[window]) for smoothing.
p95 latency values are absurdly high (30s+)

WebSocket connections show up as long-lived client spans with http_status_code="101" and skew quantiles. Add http_status_code="200" (or exclude "101") to latency queries.
Crash alert based on span_name="exception" never triggers

A standalone exception span is only emitted at early app startup. In normal operation, exceptions attach to the active View span and mark it STATUS_CODE_ERROR. Alert on span_name="View" + status_code="STATUS_CODE_ERROR" instead.
ANR alert is flaky or misses events

ANRs are sparse. A [5m] window will miss occurrences between scrapes. Use [1h] so the alert fires within an hour of the first ANR.
Production alert also fires on staging traffic

Add the env label to the query (e.g., env="production"). The label is populated from deployment.environment at SDK init — if it’s empty, set it in your SDK configuration.
Error ratio query returns NaN

The denominator is zero — no traffic in the window. Wrap with or vector(0) on the numerator, or alert only when the denominator exceeds a minimum traffic floor.

Please get in touch with us on Discord or Email if you have any questions.