Argo Rollouts exposes Prometheus metrics on port 8090 — but the docs lie about which labels exist. Here's how to scrape them into Last9, build a canary dashboard, and use Last9 as an automated AnalysisTemplate gate.
Progressive delivery with Argo Rollouts shifts the risk model for deployments: instead of a binary cut-over, you graduate traffic incrementally and let data decide whether to proceed or roll back. That only works if you can actually see what your canary is doing.
Out of the box, Argo Rollouts gives you a metrics endpoint and an AnalysisTemplate API. What it doesn't give you is a running observability backend, or an accurate picture of which metrics exist and which labels are real. This post fills both gaps: how to get Argo Rollouts metrics into Last9 via the OTel Collector, what the metric set actually looks like in v1.8.x (it differs from the docs), and how to close the loop by using Last9 as the metric provider for automated canary promotion and rollback.
What Argo Rollouts Exposes
Argo Rollouts runs a Prometheus-format metrics server on port 8090. Every rollout controller instance exposes it — no configuration required. A typical scrape looks like:
# HELP rollout_info Information about rollout
rollout_info{name="checkout",namespace="production",phase="Progressing"} 1
# HELP rollout_phase Rollout phase gauge
rollout_phase{name="checkout",namespace="production",phase="Progressing"} 1
rollout_phase{name="checkout",namespace="production",phase="Paused"} 0
rollout_phase{name="checkout",namespace="production",phase="Completed"} 0
# HELP rollout_info_replicas_updated Number of updated replicas
rollout_info_replicas_updated{name="checkout",namespace="production"} 2
# HELP rollout_info_replicas_desired Number of desired replicas
rollout_info_replicas_desired{name="checkout",namespace="production"} 10
# HELP rollout_reconcile Rollout reconciliation performance
rollout_reconcile_bucket{name="checkout",le="0.005"} 42The canary_weight Label That Doesn't Exist
A lot of documentation — including our own early draft — referenced rollout_info{canary_weight="X"} as the way to track traffic split. In Argo Rollouts v1.8.x, this label does not exist on rollout_info.
The correct way to compute canary fraction:
rollout_info_replicas_updated / rollout_info_replicas_desiredThis gives you the fraction of replicas running the canary version — a reasonable proxy for traffic split in most configurations.
Full Verified Metric Set (v1.8.3)
| Metric | Type | Description |
|---|---|---|
rollout_info |
Gauge | Rollout presence; labels include phase |
rollout_phase |
Gauge | One series per phase, value 0 or 1 |
rollout_info_replicas_available |
Gauge | Available replica count |
rollout_info_replicas_updated |
Gauge | Updated (canary) replica count |
rollout_info_replicas_desired |
Gauge | Total desired replica count |
rollout_reconcile |
Histogram | Reconcile loop duration |
rollout_reconcile_error |
Counter | Reconcile error count |
rollout_events_total |
Counter | Lifecycle events by reason label |
analysis_run_info |
Gauge | Analysis run status |
rollout_phase is the right metric for dashboarding and alerting on phase state — not rollout_info filtered by label.
Getting Metrics into Last9
The path is straightforward:
Argo Rollouts :8090/metrics
→ OTel Collector (prometheus receiver)
→ Last9 (OTLP)We also scrape kube-state-metrics in the same pipeline to get rollouts-pod-template-hash — the label that lets you distinguish canary pods from stable pods in per-pod dashboards.
OTel Collector Config
# otel-collector-config.yaml
receivers:
prometheus:
config:
scrape_configs:
- job_name: argo-rollouts
scrape_interval: 15s
static_configs:
- targets: ["argo-rollouts-metrics.argo-rollouts.svc.cluster.local:8090"]
metric_relabel_configs:
- source_labels: [__name__]
regex: "rollout_.*|analysis_.*"
action: keep
- job_name: kube-state-metrics
scrape_interval: 30s
static_configs:
- targets: ["kube-state-metrics.kube-system.svc.cluster.local:8080"]
metric_relabel_configs:
- source_labels: [__name__]
regex: "kube_pod_labels"
action: keep
processors:
resource:
attributes:
- key: cluster
value: "production"
action: upsert
- key: environment
value: "prod"
action: upsert
batch:
send_batch_size: 1000
timeout: 10s
exporters:
otlp:
endpoint: "https://otlp.last9.io"
compression: gzip
headers:
Authorization: "Basic <your-last9-credentials>"
service:
pipelines:
metrics:
receivers: [prometheus]
processors: [resource, batch]
exporters: [otlp]Deploy this as a Kubernetes DaemonSet or Deployment in the same cluster. The Argo Rollouts metrics service is only accessible within the cluster.
Verifying the Pipeline
Once the collector is running, check Last9's Metrics Explorer for rollout_phase. Filter by namespace and name to find your rollout. You should see phase time series within one scrape interval (15s in the config above).
Dashboarding Canary State
With metrics in Last9, a useful canary dashboard has three panels:
Rollout phase — what state is the rollout in right now:
rollout_phase{name="checkout", namespace="production"}This gives you one line per phase. The active phase has value 1.
Canary fraction — what percentage of replicas are running the new version:
100 * rollout_info_replicas_updated{name="checkout"}
/ rollout_info_replicas_desired{name="checkout"}Reconcile error rate — is the controller itself healthy:
rate(rollout_reconcile_error{name="checkout"}[5m])Pair these with your application's own error rate and latency metrics (from your services' OTel instrumentation) on the same dashboard. The canary metrics tell you the deployment state; your service metrics tell you whether the canary is healthy.
Last9 as an Automated Canary Gate
The more powerful capability is closing the loop: using Last9 as the metric provider in an AnalysisTemplate, so Argo Rollouts automatically promotes or rolls back based on your application's error rate or latency in Last9.
The flow:
Canary at 10% traffic
↓
Argo Rollouts queries Last9 every 2 min
↓
error rate < 5%? → promote to 25%
error rate ≥ 10%? → auto rollback (after 3 failures)Argo Rollouts supports Prometheus as a metric provider out of the box. Last9 exposes a Prometheus-compatible read endpoint, so no plugin or custom integration is required.
AnalysisTemplate
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: last9-error-rate
namespace: production
spec:
args:
- name: service-name
- name: last9-auth
valueFrom:
secretKeyRef:
name: last9-prometheus-auth
key: authorization
metrics:
- name: error-rate
interval: 2m
failureLimit: 3
successCondition: result[0] < 0.05
failureCondition: result[0] >= 0.10
provider:
prometheus:
address: https://app.last9.io/api/v1/prometheus
query: |
sum(rate(http_requests_total{service="{{args.service-name}}", status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))
headers:
- key: Authorization
value: "{{args.last9-auth}}"Reference it in your Rollout:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: checkout
namespace: production
spec:
strategy:
canary:
steps:
- setWeight: 10
- analysis:
templates:
- templateName: last9-error-rate
args:
- name: service-name
value: checkout
- name: last9-auth
valueFrom:
secretKeyRef:
name: last9-prometheus-auth
key: authorization
- setWeight: 25
- pause: {duration: 5m}
- setWeight: 50
- pause: {duration: 5m}
- setWeight: 100Auth Gotcha 1: basicAuth Doesn't Exist
The Argo Rollouts Prometheus provider spec does not have a basicAuth field. The available authentication options are sigv4, oauth2, and plain headers. We went through three iterations to find the correct pattern.
The working approach: source the pre-encoded Basic <base64> value from a Kubernetes Secret via args.valueFrom.secretKeyRef, then interpolate it into the header string. The headers[].value field only accepts plain strings — but args with valueFrom are resolved before interpolation, so this works cleanly.
Auth Gotcha 2: Newlines in base64 Break the Header
When creating the secret, echo -n "user:pass" | base64 works on most systems — but some environments produce a trailing newline in the base64 output. That newline ends up in the Authorization header value, and Go's HTTP client rejects it with invalid header field value.
Use printf and strip newlines explicitly:
kubectl create secret generic last9-prometheus-auth \
--from-literal=authorization="Basic $(printf 'user:pass' | base64 | tr -d '\n')"Validation
Running an AnalysisRun directly against Last9's read endpoint with rollout_phase{phase="Error"} queried every 30 seconds: 17/17 consecutive Successful measurements — Last9 returned [0] (no errors) each time, satisfying result[0] == 0. The pipeline works.
Competitive Context
Datadog and New Relic are natively listed as metric providers in Argo Rollouts' official documentation, with dedicated integration pages. Their pitch is the same: use your observability backend as a canary gate.
Last9 achieves the same capability via its Prometheus-compatible read endpoint. The mechanism is identical — no custom plugin required. The gap today is discoverability: Last9 isn't listed in the official Argo Rollouts provider docs yet. The Prometheus provider is the path in, and it works now.
What This Gives You
| Capability | How |
|---|---|
| Rollout phase visibility | rollout_phase metric in Last9 |
| Canary replica fraction | replicas_updated / replicas_desired |
| Controller health | rollout_reconcile_error rate |
| Automated promotion/rollback | AnalysisTemplate → Last9 Prometheus endpoint |
| Unified dashboard | Rollout metrics + service metrics in one place |
Code
All config files, Kubernetes manifests, and AnalysisTemplate examples are in last9/opentelemetry-examples — otel-collector/argo-rollouts.
For the OTel Collector setup, see What is the OpenTelemetry Collector?. For how Last9 handles deployment events alongside metrics, see Real-Time Canary Deployment Tracking with Argo CD & Last9.
