You're running Kafka on Confluent Cloud. You care about lag, throughput, retries, and replication. But where do you see those metrics?
Confluent gives you metrics, sure, but not all in one place. Some live behind a metrics API, others behind Connect clusters or Schema Registries. You either wire them manually or give up.
What if you could stream those metrics to a platform built for high-frequency, high-cardinality time series, and do it in minutes?
Why Confluent Cloud Metrics Are Harder Than They Should Be
When you deploy Kafka to production, monitoring becomes critical. You need to know when consumer lag spikes, when brokers are overwhelmed, or when replication falls behind. But Confluent Cloud spreads this data across multiple services and APIs.
The Confluent Cloud Metrics API provides basic cluster and topic metrics, but accessing them requires polling with API keys, handling rate limits, and dealing with complex label formats. Schema Registry metrics live in a separate REST API. Connect cluster metrics are yet another beast entirely.
Most teams end up with a patchwork solution: some metrics in Confluent's dashboards, others scraped into Prometheus, and gaps everywhere else. When incidents happen, you're jumping between multiple tools to understand what's broken.
A Simple Way Out: Push Metrics Directly
Forget juggling REST APIs, scraping hacks, and brittle connectors. There’s a cleaner path:
Push Confluent Cloud metrics directly to Last9 using Prometheus PushGateway.
This gives you:
- No polling delays — You control when metrics are fetched and pushed.
- No Kafka Connect hacks — No need to deploy agents or plugins in your data path.
- Faster feedback loops — Get alerts and dashboards in near real-time.
- Centralized visibility — All your Kafka metrics in one place, ready to slice, dice, and alert.
Here’s how it works:
- Use the Confluent Cloud Metrics API to fetch relevant metrics.
- Format the output into Prometheus exposition format.
- Push that data to the PushGateway.
- Let Last9 collect from there, using native Prometheus support.
This setup avoids the usual pain of “how do I scrape Confluent’s managed services?” You fetch once. You push once. And you're done.
Quickstart Example
Let’s walk through a minimal setup you can run locally to push Confluent Cloud metrics into Last9.
You’ll set up Prometheus to scrape from a PushGateway, then push a sample metric to validate everything works.
Step 1: Configure Prometheus to Scrape from PushGateway
Add this to your prometheus.yml
:
scrape_configs:
- job_name: 'confluent-cloud'
static_configs:
- targets: ['pushgateway:9091']
This tells Prometheus to scrape metrics from the PushGateway every 60 seconds by default. You can adjust the scrape interval if needed.
Step 2: Push a Test Metric
Run this one-liner to simulate a metric push:
echo "confluent_kafka_lag{topic='events'} 42" | curl --data-binary @- http://pushgateway:9091/metrics/job/kafka
This creates a dummy metric (confluent_kafka_lag
) with a topic label. PushGateway stores it temporarily until Prometheus scrapes it.
Step 3: Validate in Last9
Head over to your Last9 dashboard. Query for confluent_kafka_lag
— you should see your test value show up with the label topic="events"
.
At this point, you’ve wired up the basic plumbing.
Automate Metric Collection from Confluent Cloud
For a real setup, you’ll want to periodically fetch actual metrics from Confluent Cloud and push them. Below is a simplified Python snippet that does just that:
import requests
import time
def push_consumer_lag(api_key, api_secret, cluster_id, pushgateway_url):
# Step 1: Call Confluent Cloud’s Metrics API
auth = (api_key, api_secret)
metrics_url = "https://api.telemetry.confluent.cloud/v2/metrics/cloud/export"
response = requests.get(metrics_url, auth=auth, params={
'resource.kafka.id': cluster_id,
'metric.kafka.consumer_lag_sum': 'true'
})
# Step 2: Format and Push to PushGateway
for metric in response.json().get('data', []):
line = f"confluent_consumer_lag{{cluster='{cluster_id}',topic='{metric['topic']}'}} {metric['value']}"
requests.post(f"{pushgateway_url}/metrics/job/confluent", data=line)
# Step 3: Run this every 30 seconds
while True:
push_consumer_lag(api_key, api_secret, cluster_id, "http://pushgateway:9091")
time.sleep(30)
This:
- Pulls consumer lag from the Confluent Metrics API.
- Converts it to Prometheus-friendly format.
- Pushes it to your PushGateway, where Last9 will pick it up.
You can now scale this to include multiple metrics, enrich them with labels, and set up alerts in Last9 to identify issues early.
What’s Supported and What’s Not
The Confluent Cloud Metrics API provides solid coverage for Kafka’s operational health, but not everything is included.
Here’s what you can reliably track today, and where you’ll need to build additional plumbing.
What You Can Track Today
These metrics are exposed directly via the API and can be pushed to Last9 without additional integration layers:
- Cluster-level metrics
Monitor core metrics like request throughput, partition counts, and network usage across the cluster. - Topic-level metrics
Includes per-topic message rates, byte throughput, and partition lag—essential for tracking data distribution and backpressure. - Consumer group metrics
Get real-time visibility into consumer lag, offset commits, and group consumption patterns. - Producer metrics
Track request rates, retry counts, error rates, and delivery performance.
These cover most infrastructure-level observability needs for a typical Kafka setup.
Where You’ll Need Additional Instrumentation
The API has limitations when it comes to service-specific metrics and fine-grained introspection:
- Schema Registry metrics
Not available via the core metrics API. These require querying the Schema Registry’s dedicated REST interface. - Kafka Connect metrics
Exposed through separate management endpoints—not covered by the standard cloud metrics export. You’ll need to wire these manually. - Cross-region replication
Limited visibility. Metrics are sparse and may not reflect replication performance accurately across regions. - High-cardinality partition metrics
Per-partition metric collection at scale may run into API rate limits. Sampling or selective collection is recommended in large clusters.
Additionally, the Metrics API is focused on Kafka’s operational state, not your application’s behavior.
If you need to track custom application-level metrics, like SLA violations, business counters, or message payload errors, you’ll need to instrument your code directly.
OpenTelemetry helps you standardize how your services generate and export these metrics. You can then expose them in Prometheus format or route them through your observability pipeline alongside Confluent metrics.
Production-Ready Setup Tips
Once your Confluent Cloud metrics are flowing into Last9, the next step is to harden the pipeline for production. This includes securing credentials, automating metric collection, ensuring high availability, and avoiding common failure modes.
Secure Metric Pushes
- Use authentication for PushGateway access. Enable basic auth or token-based protection to prevent unauthorized metric injection.
- Store credentials securely. Confluent API keys and secrets should be injected at runtime via environment variables or sourced from a secret management system (e.g., Kubernetes Secrets, AWS Secrets Manager).
- Avoid hardcoding secrets. Never embed credentials in source code, Docker images, or configuration files.
Automate Metric Collection via CI/CD or Scheduled Jobs
Operational metrics should be collected independently of application lifecycles. A dedicated process ensures consistent metric delivery even during rollouts or restarts.
Example: schedule metric collection using a Kubernetes CronJob:
apiVersion: batch/v1
kind: CronJob
metadata:
name: confluent-metrics-pusher
spec:
schedule: "*/1 * * * *" # Run every minute
jobTemplate:
spec:
template:
spec:
containers:
- name: metrics-pusher
image: your-registry/confluent-pusher:latest
env:
- name: CONFLUENT_API_KEY
valueFrom:
secretKeyRef:
name: confluent-credentials
key: api-key
This decouples metric ingestion from application code and aligns with infrastructure-as-code best practices.
Configure Threshold-Based Alerts
Use Last9 to define alerts for failure scenarios that impact pipeline stability:
- High consumer group lag
- Broker or partition unavailability
- Replication delays or throughput drops
Start with coarse thresholds, then tune based on baseline production behavior.
Handle Deduplication When Running Multiple Pushers
If metric pushers are run in parallel (for redundancy or failover), ensure each instance emits unique labels:
- Include an
instance
orsource
label in each metric payload. - This prevents overwriting or conflicting time series in PushGateway.
This is especially important in HA setups where deduplication errors can mask critical signals.
Troubleshooting Common Issues
Here's how to debug common issues when metrics don't show up in Last9 or behave unexpectedly.
Metrics Aren’t Showing Up in Last9
Start by checking the basics:
Short-lived metrics disappearing?
PushGateway metrics are ephemeral. If you're pushing once and Prometheus scrapes later, you’ll miss the data.
Two options:
- Push on a schedule (e.g., every 30s)
- Tune
scrape_interval
to match your push frequency
Is Prometheus scraping it?
Open Prometheus Targets UI:
http://<prometheus-host>:9090/targets
Look for your confluent-cloud
job. If it’s down or throwing scrape errors, check the targets
section in your Prometheus config.
Is PushGateway receiving anything?
Run:
curl http://<pushgateway-host>:9091/metrics
You should see your pushed metric in plain text. If it’s empty, the push step isn’t working.
Auth Failures
- Confluent API errors (401/403)
- Check if your API key/secret is still valid
- Make sure it’s scoped to the right Kafka cluster
- Watch for expired credentials—rotate if needed
- PushGateway errors (403/401)
If you’ve secured PushGateway, make sure the script or container pushing metrics includes the correct credentials:
curl -u user:token --data-binary @- http://pushgateway:9091/metrics/job/kafka
Duplicate or Missing Time Series
- Metric name or label mismatch?
Prometheus is strict.topic
vstopics
becomes a different time series. Watch for typos or inconsistent label sets. - Same metric, multiple sources?
If you're running more than one pusher, metrics can overwrite each other.
Add a stableinstance
label to each source:
confluent_consumer_lag{cluster="...", topic="...", instance="pusher-1"} 123
Delays or Gaps in Dashboard Data
- Push is working, but charts look empty?
- Check scrape frequency in Prometheus (
scrape_interval
) - Ensure pushed metrics persist long enough to be scraped
- Look for rate-limiting errors in the API response if you’re querying too frequently
- Check scrape frequency in Prometheus (
Extend Kafka Monitoring with Correlated Workflows
Once Confluent Cloud metrics are integrated into Last9, you can move past basic uptime views and start building workflows that expose performance regressions, system bottlenecks, and data pipeline anomalies.
Here’s what to layer in:
Correlate Kafka internals with application metrics
Use shared dimensions like topic names or partition IDs to link Kafka lag, retries, and byte throughput to application-level latency, error rates, or queue depth.
Define multi-metric alert conditions
Set alerts that trigger only when specific combinations of metrics breach thresholds, for example, high consumer lag and low ingestion rates. This avoids alert storms from transient spikes.
Compare environments and deployments
Use label filters to compare cluster behavior between staging and production. Analyze the impact of schema changes, deployment timings, or partition count adjustments across environments.
Track time-windowed deltas
Use Last9’s query engine to track rate-of-change metrics over sliding windows:
- Consumer lag trends over 1h vs 24h
- Retry count deltas post-deployment
- Replication lag increases across regions
Integrate with custom metrics
The Confluent Metrics API focuses on infrastructure-level stats. For app-level tracing or per-message metadata, instrument your services with OpenTelemetry and forward those signals into Last9.
You now have a metrics pipeline that connects Kafka to the rest of your system. Use it to track how data flows, where it slows, and when it breaks, without managing separate metric stores or custom dashboards.