Argo Rollouts Canary Monitoring: Metrics, Gotchas, and Automated Gates with Last9

Argo Rollouts exposes Prometheus metrics on port 8090 — but the docs lie about which labels exist. Here's how to scrape them into Last9, build a canary dashboard, and use Last9 as an automated AnalysisTemplate gate.

Progressive delivery with Argo Rollouts shifts the risk model for deployments: instead of a binary cut-over, you graduate traffic incrementally and let data decide whether to proceed or roll back. That only works if you can actually see what your canary is doing.

Out of the box, Argo Rollouts gives you a metrics endpoint and an AnalysisTemplate API. What it doesn't give you is a running observability backend, or an accurate picture of which metrics exist and which labels are real. This post fills both gaps: how to get Argo Rollouts metrics into Last9 via the OTel Collector, what the metric set actually looks like in v1.8.x (it differs from the docs), and how to close the loop by using Last9 as the metric provider for automated canary promotion and rollback.

What Argo Rollouts Exposes

Argo Rollouts runs a Prometheus-format metrics server on port 8090. Every rollout controller instance exposes it — no configuration required. A typical scrape looks like:

# HELP rollout_info Information about rollout
rollout_info{name="checkout",namespace="production",phase="Progressing"} 1

# HELP rollout_phase Rollout phase gauge
rollout_phase{name="checkout",namespace="production",phase="Progressing"} 1
rollout_phase{name="checkout",namespace="production",phase="Paused"}     0
rollout_phase{name="checkout",namespace="production",phase="Completed"}  0

# HELP rollout_info_replicas_updated Number of updated replicas
rollout_info_replicas_updated{name="checkout",namespace="production"} 2

# HELP rollout_info_replicas_desired Number of desired replicas
rollout_info_replicas_desired{name="checkout",namespace="production"} 10

# HELP rollout_reconcile Rollout reconciliation performance
rollout_reconcile_bucket{name="checkout",le="0.005"} 42

The `canary_weight` Label That Doesn't Exist

A lot of documentation — including our own early draft — referenced rollout_info{canary_weight="X"} as the way to track traffic split. In Argo Rollouts v1.8.x, this label does not exist on rollout_info.

The correct way to compute canary fraction:

rollout_info_replicas_updated / rollout_info_replicas_desired

This gives you the fraction of replicas running the canary version — a reasonable proxy for traffic split in most configurations.

Full Verified Metric Set (v1.8.3)

Metric	Type	Description
`rollout_info`	Gauge	Rollout presence; labels include `phase`
`rollout_phase`	Gauge	One series per phase, value 0 or 1
`rollout_info_replicas_available`	Gauge	Available replica count
`rollout_info_replicas_updated`	Gauge	Updated (canary) replica count
`rollout_info_replicas_desired`	Gauge	Total desired replica count
`rollout_reconcile`	Histogram	Reconcile loop duration
`rollout_reconcile_error`	Counter	Reconcile error count
`rollout_events_total`	Counter	Lifecycle events by `reason` label
`analysis_run_info`	Gauge	Analysis run status

rollout_phase is the right metric for dashboarding and alerting on phase state — not rollout_info filtered by label.

Getting Metrics into Last9

The path is straightforward:

Argo Rollouts :8090/metrics
    → OTel Collector (prometheus receiver)
    → Last9 (OTLP)

We also scrape kube-state-metrics in the same pipeline to get rollouts-pod-template-hash — the label that lets you distinguish canary pods from stable pods in per-pod dashboards.

OTel Collector Config

# otel-collector-config.yaml
receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: argo-rollouts
          scrape_interval: 15s
          static_configs:
            - targets: ["argo-rollouts-metrics.argo-rollouts.svc.cluster.local:8090"]
          metric_relabel_configs:
            - source_labels: [__name__]
              regex: "rollout_.*|analysis_.*"
              action: keep

        - job_name: kube-state-metrics
          scrape_interval: 30s
          static_configs:
            - targets: ["kube-state-metrics.kube-system.svc.cluster.local:8080"]
          metric_relabel_configs:
            - source_labels: [__name__]
              regex: "kube_pod_labels"
              action: keep

processors:
  resource:
    attributes:
      - key: cluster
        value: "production"
        action: upsert
      - key: environment
        value: "prod"
        action: upsert
  batch:
    send_batch_size: 1000
    timeout: 10s

exporters:
  otlp:
    endpoint: "https://otlp.last9.io"
    compression: gzip
    headers:
      Authorization: "Basic <your-last9-credentials>"

service:
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [resource, batch]
      exporters: [otlp]

Deploy this as a Kubernetes DaemonSet or Deployment in the same cluster. The Argo Rollouts metrics service is only accessible within the cluster.

Verifying the Pipeline

Once the collector is running, check Last9's Metrics Explorer for rollout_phase. Filter by namespace and name to find your rollout. You should see phase time series within one scrape interval (15s in the config above).

Dashboarding Canary State

With metrics in Last9, a useful canary dashboard has three panels:

Rollout phase — what state is the rollout in right now:

rollout_phase{name="checkout", namespace="production"}

This gives you one line per phase. The active phase has value 1.

Canary fraction — what percentage of replicas are running the new version:

100 * rollout_info_replicas_updated{name="checkout"}
    / rollout_info_replicas_desired{name="checkout"}

Reconcile error rate — is the controller itself healthy:

rate(rollout_reconcile_error{name="checkout"}[5m])

Pair these with your application's own error rate and latency metrics (from your services' OTel instrumentation) on the same dashboard. The canary metrics tell you the deployment state; your service metrics tell you whether the canary is healthy.

Last9 as an Automated Canary Gate

The more powerful capability is closing the loop: using Last9 as the metric provider in an AnalysisTemplate, so Argo Rollouts automatically promotes or rolls back based on your application's error rate or latency in Last9.

The flow:

Canary at 10% traffic
      ↓
Argo Rollouts queries Last9 every 2 min
      ↓
error rate < 5%?  →  promote to 25%
error rate ≥ 10%? →  auto rollback (after 3 failures)

Argo Rollouts supports Prometheus as a metric provider out of the box. Last9 exposes a Prometheus-compatible read endpoint, so no plugin or custom integration is required.

AnalysisTemplate

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: last9-error-rate
  namespace: production
spec:
  args:
    - name: service-name
    - name: last9-auth
      valueFrom:
        secretKeyRef:
          name: last9-prometheus-auth
          key: authorization
  metrics:
    - name: error-rate
      interval: 2m
      failureLimit: 3
      successCondition: result[0] < 0.05
      failureCondition: result[0] >= 0.10
      provider:
        prometheus:
          address: https://app.last9.io/api/v1/prometheus
          query: |
            sum(rate(http_requests_total{service="{{args.service-name}}", status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))
          headers:
            - key: Authorization
              value: "{{args.last9-auth}}"

Reference it in your Rollout:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout
  namespace: production
spec:
  strategy:
    canary:
      steps:
        - setWeight: 10
        - analysis:
            templates:
              - templateName: last9-error-rate
            args:
              - name: service-name
                value: checkout
              - name: last9-auth
                valueFrom:
                  secretKeyRef:
                    name: last9-prometheus-auth
                    key: authorization
        - setWeight: 25
        - pause: {duration: 5m}
        - setWeight: 50
        - pause: {duration: 5m}
        - setWeight: 100

Auth Gotcha 1: `basicAuth` Doesn't Exist

The Argo Rollouts Prometheus provider spec does not have a basicAuth field. The available authentication options are sigv4, oauth2, and plain headers. We went through three iterations to find the correct pattern.

The working approach: source the pre-encoded Basic <base64> value from a Kubernetes Secret via args.valueFrom.secretKeyRef, then interpolate it into the header string. The headers[].value field only accepts plain strings — but args with valueFrom are resolved before interpolation, so this works cleanly.

Auth Gotcha 2: Newlines in base64 Break the Header

When creating the secret, echo -n "user:pass" | base64 works on most systems — but some environments produce a trailing newline in the base64 output. That newline ends up in the Authorization header value, and Go's HTTP client rejects it with invalid header field value.

Use printf and strip newlines explicitly:

kubectl create secret generic last9-prometheus-auth \
  --from-literal=authorization="Basic $(printf 'user:pass' | base64 | tr -d '\n')"

Validation

Running an AnalysisRun directly against Last9's read endpoint with rollout_phase{phase="Error"} queried every 30 seconds: 17/17 consecutive Successful measurements — Last9 returned [0] (no errors) each time, satisfying result[0] == 0. The pipeline works.

Competitive Context

Datadog and New Relic are natively listed as metric providers in Argo Rollouts' official documentation, with dedicated integration pages. Their pitch is the same: use your observability backend as a canary gate.

Last9 achieves the same capability via its Prometheus-compatible read endpoint. The mechanism is identical — no custom plugin required. The gap today is discoverability: Last9 isn't listed in the official Argo Rollouts provider docs yet. The Prometheus provider is the path in, and it works now.

What This Gives You

Capability	How
Rollout phase visibility	`rollout_phase` metric in Last9
Canary replica fraction	`replicas_updated / replicas_desired`
Controller health	`rollout_reconcile_error` rate
Automated promotion/rollback	AnalysisTemplate → Last9 Prometheus endpoint
Unified dashboard	Rollout metrics + service metrics in one place

Code

All config files, Kubernetes manifests, and AnalysisTemplate examples are in last9/opentelemetry-examples — otel-collector/argo-rollouts.

For the OTel Collector setup, see What is the OpenTelemetry Collector?. For how Last9 handles deployment events alongside metrics, see Real-Time Canary Deployment Tracking with Argo CD & Last9.

Argo Rollouts Canary Monitoring: Metrics, Gotchas, and Automated Gates with Last9

Contents

What Argo Rollouts Exposes

The `canary_weight` Label That Doesn't Exist

Full Verified Metric Set (v1.8.3)

Getting Metrics into Last9

OTel Collector Config

Verifying the Pipeline

Dashboarding Canary State

Last9 as an Automated Canary Gate

AnalysisTemplate

Auth Gotcha 1: `basicAuth` Doesn't Exist

Auth Gotcha 2: Newlines in base64 Break the Header

Validation

Competitive Context

What This Gives You

Code

Contents

Start observing for free. No lock-in.

Argo Rollouts Canary Monitoring: Metrics, Gotchas, and Automated Gates with Last9

Contents

What Argo Rollouts Exposes

The canary_weight Label That Doesn't Exist

Full Verified Metric Set (v1.8.3)

Getting Metrics into Last9

OTel Collector Config

Verifying the Pipeline

Dashboarding Canary State

Last9 as an Automated Canary Gate

AnalysisTemplate

Auth Gotcha 1: basicAuth Doesn't Exist

Auth Gotcha 2: Newlines in base64 Break the Header

Validation

Competitive Context

What This Gives You

Code

Contents

Start observing for free. No lock-in.

The `canary_weight` Label That Doesn't Exist

Auth Gotcha 1: `basicAuth` Doesn't Exist