Alerting on RUM Metrics
PromQL alert patterns for mobile and web RUM — HTTP error rates, latency thresholds, app crashes, and ANR detection using trace_client_count, trace_client_duration, and trace_internal_count.
Last9 RUM SDKs emit three families of gauge metrics derived from spans. The same data powers Discover → Applications for ad-hoc exploration; this guide covers turning those signals into alerts. All are gauge metrics — use sum_over_time for counting and avg_over_time for smoothing, not rate() or increase().
Metric reference
| Metric | Covers | Key labels |
|---|---|---|
trace_client_count | HTTP requests made by the app (network spans) | service_name, span_name, http_status_code, http_method, env |
trace_client_duration | HTTP request durations — precomputed quantiles | service_name, span_name, http_status_code, http_method, quantile, env |
trace_internal_count | App lifecycle, screens, errors, ANR, resources | service_name, span_name, status_code, env |
http_status_code vs status_code
http_status_code on trace_client_count / trace_client_duration is the HTTP response code the device received ("200", "404", "503"). Use this for HTTP-level alerting.
status_code on trace_internal_count is the OTel span status (STATUS_CODE_ERROR / STATUS_CODE_UNSET). Use this for app-level error alerting (crashes, ANRs).
trace_client_duration quantiles
quantile label values: avg, p50, p90, p95, p99. These are precomputed — query them directly by label. No histogram_quantile() needed.
Duration values are in milliseconds.
Alert patterns
1. HTTP error count (4xx / 5xx)
Alert when users are receiving error responses from any backend endpoint.
sum( sum_over_time( trace_client_count{ service_name="<your-app>", http_status_code=~"[45].." }[30m] ))Threshold: tune to your baseline. Start with > 10 and adjust based on traffic. Use a shorter window ([5m]) for high-traffic production apps.
2. HTTP error ratio
Alert when a significant fraction of requests are failing. More stable than raw counts for apps with variable traffic.
sum(sum_over_time(trace_client_count{service_name="<your-app>", http_status_code=~"[45].."}[30m]))/sum(sum_over_time(trace_client_count{service_name="<your-app>"}[30m]))Threshold: > 0.05 for 5% error rate.
3. Per-endpoint error breakdown
Identifies which specific API endpoint is failing, useful for routing alerts to the right team.
sum by (span_name) ( sum_over_time( trace_client_count{ service_name="<your-app>", http_status_code=~"[45].." }[30m] ))4. API latency — p95 threshold
trace_client_duration holds precomputed quantiles per endpoint. Query the p95 label directly.
avg_over_time( trace_client_duration{ service_name="<your-app>", quantile="p95", http_status_code="200" }[30m])To find the worst-performing endpoint across your app:
max by (span_name) ( avg_over_time( trace_client_duration{ service_name="<your-app>", quantile="p95", http_status_code="200" }[30m] ))Threshold: set in milliseconds, e.g. > 3000 for 3 s.
5. App crashes and exceptions (mobile)
The SDK attaches exceptions as events on the active View span and marks it STATUS_CODE_ERROR. Alert on View spans in error state to catch crashes and unhandled exceptions.
sum( sum_over_time( trace_internal_count{ service_name="<your-app>", span_name="View", status_code="STATUS_CODE_ERROR" }[30m] ))Threshold: > 0 for zero-tolerance, or tune to baseline.
6. ANR rate (Android)
App Not Responding events are emitted as standalone spans by the SDK’s watchdog thread when the main thread is blocked beyond the configured threshold (default: 5 s).
sum( sum_over_time( trace_internal_count{ service_name="<your-app>", span_name="ANR detected" }[1h] ))Threshold: > 0 — any ANR is a user-facing freeze and warrants investigation.
Filtering by environment
All metrics carry an env label populated from deployment.environment set at SDK init. Add it to any query to scope alerts to production only:
sum( sum_over_time( trace_client_count{ service_name="<your-app>", env="production", http_status_code=~"[45].." }[30m] ))Configuring the alert in Last9
- Go to Alerting → Alert Groups and create or open an Alert Group for your app.
- Add an Indicator with the PromQL above as the query.
- Set the threshold and bad minutes / total minutes window.
- Optionally add dynamic annotations using
{{ $labels.span_name }}to include the failing endpoint in notifications.
See Configuring an Alert for the full walkthrough.
Troubleshooting
-
Alert never fires despite visible errors in dashboards
You’re likely using
rate()orincrease(). RUM metrics are gauges, not counters — these functions return 0 orNaN. Switch tosum_over_time(...[window])for counting andavg_over_time(...[window])for smoothing. -
p95 latency values are absurdly high (30s+)
WebSocket connections show up as long-lived client spans with
http_status_code="101"and skew quantiles. Addhttp_status_code="200"(or exclude"101") to latency queries. -
Crash alert based on
span_name="exception"never triggersA standalone
exceptionspan is only emitted at early app startup. In normal operation, exceptions attach to the active View span and mark itSTATUS_CODE_ERROR. Alert onspan_name="View"+status_code="STATUS_CODE_ERROR"instead. -
ANR alert is flaky or misses events
ANRs are sparse. A
[5m]window will miss occurrences between scrapes. Use[1h]so the alert fires within an hour of the first ANR. -
Production alert also fires on staging traffic
Add the
envlabel to the query (e.g.,env="production"). The label is populated fromdeployment.environmentat SDK init — if it’s empty, set it in your SDK configuration. -
Error ratio query returns
NaNThe denominator is zero — no traffic in the window. Wrap with
or vector(0)on the numerator, or alert only when the denominator exceeds a minimum traffic floor.
Please get in touch with us on Discord or Email if you have any questions.