Vibe monitoring with Last9 MCP: Ask your agent to fix production issues! Setup →
Last9 Last9

How to Monitor Kafka Producer Metrics

Monitor critical Kafka producer metrics like record-send-rate, error-rate, and buffer-available-bytes to troubleshoot performance issues in production.

Jun 10th, ‘25
How to Monitor Kafka Producer Metrics
See How Last9 Works

Unified observability for all your telemetry.Open standards. Simple pricing.

Talk to us

Your Kafka producer pushed a million messages yesterday. Nice. But can you tell if they all made it? Or why did latency spike at 2 PM?

Producer metrics help you determine that. They expose how long messages take to send, whether messages are getting stuck, and whether retries are piling up. Let’s go over which ones help while debugging and how to monitor them.

The Key Kafka Producer Metrics

Out of all the metrics your Kafka producer exposes, only a handful are worth paying attention to.

record-send-rate
This is your throughput — how many records per second the producer is successfully sending. If this number drops suddenly, something in the pipeline is likely stuck.

record-error-rate
Tells you how many sends are failing. A few errors are normal under load. But if the rate climbs steadily, it's time to investigate: broker unavailability, serialization errors, and timeouts are all valid suspects.

request-latency-avg
This is the round-trip time for sending a batch and getting a response from the broker. Spikes here often indicate broker lag, network delays, or throttling.

buffer-available-bytes
Producers batch records in memory. If this number gets too low, your producer might block or start dropping messages. Especially relevant under sustained high throughput.

Here’s how you might pull these metrics from a running producer:

KafkaProducer<String, String> producer = new KafkaProducer<>(props);

Map<MetricName, ? extends Metric> metrics = producer.metrics();

double sendRate = getMetricValue(metrics, "record-send-rate");
double errorRate = getMetricValue(metrics, "record-error-rate");
long availableBuffer = (long) getMetricValue(metrics, "buffer-available-bytes");

logger.info("Send rate: {} msg/sec, Error rate: {}%, Buffer available: {} bytes",
    sendRate, errorRate * 100, availableBuffer);
💡
When you see your send rate dropping but error rates staying low, you might actually be looking at consumer lag backing up your topics - check out our guide on fixing Kafka consumer lag to understand how slow consumers can impact your entire pipeline.

Trace Real Production Issues Using Kafka Producer Metrics

Metrics are useful when they map to actual failure modes or performance degradation.

  • batch-size-avg
    Controls how efficiently messages are sent over the wire.
    • Small batch sizes → more network overhead, higher CPU usage.
    • Large batch sizes → better throughput, but increased latency.
      Keep an eye here if you’re tuning for performance but seeing lag in message delivery.
  • record-send-rate drops, but record-error-rate stays flat
    This usually points to saturation somewhere — broker-side CPU, disk I/O limits, or socket-level backpressure. The producer is being throttled without outright failure.
  • record-retry-rate climbs, with no corresponding error spike
    A sign of transient network instability. Packets are making it through eventually, but retries are eating into buffer time and CPU. If this creeps up, dig into network paths or DNS latency.

How to Set Up Kafka Producer Monitoring

Most Kafka producers expose dozens of metrics, but you only need a few to identify real issues early.

  • record-send-rate – This tells you how fast your producer is sending messages. If it drops without a rise in errors, something’s likely slowing things down—maybe full buffers, maybe backpressure from the broker.
  • record-error-rate – Failed sends. Some errors are expected. A steady climb usually means broker unavailability, bad configurations, or network instability.
  • request-latency-avg – Time taken to get a response from the broker. Spikes here often suggest the broker is under load or the network’s struggling.
  • buffer-available-bytes – When this hits zero, your producer starts blocking or dropping messages. Useful to identify issues before backup.

You don’t need a big monitoring setup to track these. A basic scheduled job that scrapes the metrics every 30 seconds is more than enough to start with:

@Component
public class KafkaMetricsCollector {
    private final MeterRegistry registry;
    private final KafkaProducer<String, String> producer;

    @Scheduled(fixedRate = 30000)
    public void collect() {
        Map<MetricName, ? extends Metric> metrics = producer.metrics();
        update("kafka.producer.send.rate", "record-send-rate", metrics);
        update("kafka.producer.error.rate", "record-error-rate", metrics);
        update("kafka.producer.latency", "request-latency-avg", metrics);
        update("kafka.producer.buffer.available", "buffer-available-bytes", metrics);
    }

    private void update(String name, String key, Map<MetricName, ? extends Metric> metrics) {
        double value = getMetricValue(metrics, key);
        Gauge.builder(name, () -> value).register(registry);
    }
}

Once this is running, watch for patterns when things slow down, when buffers fill, and when errors spike. Then tune based on what you see.

💡
You'll probably need more than just producer metrics to keep tabs on everything - we compared different Kafka monitoring tools that can help you monitor your whole setup.

Reading Kafka Producer Metrics Before Things Break

Kafka issues rarely happen out of nowhere. The signs are usually there, you just have to know what to monitor.

Start by knowing what normal looks like:

  • Send rates should follow predictable daily patterns.
  • Error rates should be low.
  • Latency should stay within a tight range, with small bumps during high traffic.

What matters more are the slow shifts:

  • Batch sizes getting smaller over time
    This could point to configuration drift or changes in how messages are grouped. Smaller batches mean more frequent sends, which can reduce throughput.
  • Buffer availability going down gradually
    Indicates your producer is pushing messages faster than the broker can handle. If this continues, you'll hit memory pressure or message drops.
  • Latency rising slowly while error rates stay flat
    Likely a sign of resource limits—broker I/O, CPU, or network congestion.

Here's a basic reference to tie symptoms to potential causes:

What You Notice Metric Behavior Possible Cause
Broker under load Latency increases, send rate unchanged Brokers lagging behind
Unstable connection Retry rate increases, errors stay low Temporary network issues
Buffer filling up Send rate drops, buffer usage high Producer can’t keep up
Config drift or tuning issue Batch size inconsistent, latency varies Inefficient batching or app changes

How to Connect Kafka Producer Metrics to Cluster-Wide Issues

Producer metrics don’t always point to problems with the producer itself. Sometimes they reflect broader issues elsewhere in the system.

  • A drop in record-send-rate might not mean the producer is misbehaving—it could be caused by consumer lag, leading to topic backpressure.
  • Higher request-latency-avg might mean brokers are overloaded or the entire Kafka cluster is under resource pressure.

This is where correlation matters. Producer metrics in isolation won’t tell you much. You need to compare them with broker metrics (like queue size or under-replicated partitions) and consumer lag to get a full view.

One metric worth highlighting here is:

  • produce-throttle-time-avg
    When this climbs, your producer is being throttled by the broker. It’s not a client issue—it’s the broker applying quota limits. This usually ties back to capacity constraints or misconfigured quotas.

You’ll also want to pay attention to lower-level metrics like:

  • connection-close-rate
    A spike here can signal network issues or unstable broker availability.
  • io-wait-time-ns-avg
    If this climbs, your producers might be spending time blocked on disk or socket operations, pointing to infrastructure bottlenecks.

If you’re seeing these alongside retry spikes or latency issues, it's time to look beyond just producer code and review your Kafka cluster’s health and your network path.

💡
If you want to see how producer metrics fit into the bigger picture, our Kafka observability guide walks through monitoring your entire Kafka stack together.

What Kafka Metrics Reveal as Your Traffic Scales

As your throughput grows, the way your Kafka producer behaves will shift. What worked fine at a few thousand messages per second might start causing issues at ten or a hundred times that load.

Producer metrics help highlight these scaling challenges early before they turn into production bottlenecks.

Batching becomes more critical at high volume.
You’ll want to watch how your batch-size-avg metric compares to the batch.size and linger.ms settings. If your average batch size is consistently small despite high throughput, you’re probably flushing messages too early. A small tweak to linger.ms can give the producer just enough time to batch more efficiently, often improving throughput without major changes.

Buffer usage also shifts as load increases.
At low volume, the producer has plenty of breathing room—buffer usage stays stable. At higher throughput, you’ll likely see more fluctuation. This isn't necessarily a problem, but it's worth tracking: it shows how hard the producer is working to keep up.

You can use these signals to adapt configurations dynamically, especially if your workloads vary across time:

public class AdaptiveProducerConfig {
    private volatile int currentBatchSize = 16384; // 16KB default
    private volatile int currentLingerMs = 5;

    @Scheduled(fixedRate = 60000)
    public void adjustConfiguration() {
        double avgBatchSize = getMetricValue("batch-size-avg");
        double sendRate = getMetricValue("record-send-rate");

        // If we're sending a lot, but batching poorly, increase linger
        if (avgBatchSize < currentBatchSize * 0.5 && sendRate > 1000) {
            currentLingerMs = Math.min(currentLingerMs + 1, 100);
            updateProducerConfig("linger.ms", currentLingerMs);
        }
    }
}

This kind of logic won’t replace good up-front tuning, but it can help avoid obvious inefficiencies when traffic patterns shift.

💡
Now, fix production Kafka producer issues instantly right from your IDE, with AI and Last9 MCP. Bring logs, metrics, and traces — into your local environment to auto-fix code faster.

Using Kafka Producer Metrics to Debug Issues

When things go wrong, your metrics aren’t just numbers—they’re the starting point for figuring out what’s broken.

Start with three checks:

  • Is the send rate lower than usual?
  • Are error rates increasing?
  • Is latency rising?

These usually narrow the problem down quickly.

  • Low send rate, normal error rate
    This often means the producer is blocked. Check buffer usage and broker performance—it's likely not your app logic, but pressure somewhere in the infrastructure.
  • High error rate
    The type of error matters:
    • Timeouts → likely network or overloaded brokers
    • Auth failures → check credentials, SSL configs
    • Serialization errors → usually bad input data or a mismatch in schemas
  • High latency, but send and error rates look fine
    That usually means your system is nearing its limits. Nothing’s failed yet, but it’s working harder than usual. Useful signal—if you catch it early, you can scale or adjust before things start dropping messages.
💡
For setting up Last9 with your Kafka infrastructure, check out our docs for step-by-step instructions.

Last9 for Kafka Monitoring

Using Last9 to monitor Kafka gives you the basics you rely on in production: send rate, error rate, request latency, and buffer usage.

What makes it useful in Kafka-heavy environments:

  • You can monitor per-topic and per-producer metrics without worrying about cardinality limits or query performance.
  • It works directly with Prometheus and OpenTelemetry no custom exporters or side processes.
  • Logs, metrics, and traces land in one place. When something slows down, you don’t have to pivot across three systems to figure out why.
  • Alerting can be tuned for sustained issues, not short spikes, so you get signal, not noise.

For teams running Kafka at scale, especially with variable workloads, Last9 helps you stay ahead of throughput drops and producer bottlenecks. No need to build your patchwork to get basic visibility.

Get started with us today!

FAQs

What is the producer latency metric in Kafka?
request-latency-avg measures the time it takes for the producer to send a batch to the broker and receive an acknowledgment. High latency usually points to broker load, batching inefficiencies, or network issues.

How do I increase Kafka producer throughput?
Start by tuning batch.size, linger.ms, and compression.type. Larger batches reduce network overhead, and compression helps reduce payload size. Make sure the broker can handle the load—producer-side tuning only helps if the rest of the pipeline keeps up.

What is the role of the Kafka producer?
A Kafka producer is responsible for publishing records to Kafka topics. It handles batching, partitioning, retries, and delivery guarantees based on the configuration you set.

What is the optimal batch size for Kafka producer?
There’s no universal number. Start with the default (16384 bytes) and tune based on your workload. Monitor batch-size-avg—if it’s consistently low relative to batch.size, you may need to adjust linger.ms or improve your batching logic.

Why do Kafka metrics matter?
They give you visibility into throughput, delivery health, retry behavior, and system limits. Without metrics, you’re flying blind, especially in production environments where small issues compound fast.

What is Graphite and Grafana?
Graphite is a time-series database often used to store metrics. Grafana is a dashboarding tool that can visualize data from Graphite (and many other sources). Together, they’re often used to monitor systems like Kafka.

What are the benefits and challenges of using Kafka for data streaming?
Kafka is durable, high-throughput, and horizontally scalable. It works well for real-time pipelines. The trade-offs: operational complexity, tuning requirements, and the learning curve around exactly-once semantics and partitioning.

What are the pros and cons of Apache Kafka, Apache Flink, and Apache Spark for data streaming?

Tool Pros Cons
Kafka Great for buffering, real-time messaging, fault-tolerance Not built for processing — you’ll need external compute
Flink True event-time stream processing, low latency, strong state support Operational complexity, steeper learning curve
Spark Strong ecosystem, batch + streaming, good for large workloads Higher latency, not ideal for low-latency real-time pipelines

How can you configure and customize distributed systems tools for your specific use case?
Start with defaults, then use metrics and logs to understand bottlenecks. Adjust the configuration based on observed patterns. For Kafka, that means tuning producers, brokers, and consumer settings together, not in isolation.

How to enable Kafka Metrics Reporter?
You can enable JMX reporting by setting the appropriate reporter class in the Kafka config. For producers and consumers, metrics are exposed via producer.metrics() and consumer.metrics() APIs. For broker metrics, JMX is enabled by default.

How do I monitor Kafka producer metrics effectively?
Track core metrics like record-send-rate, record-error-rate, request-latency-avg, and buffer-available-bytes. Use a system like Prometheus to scrape and store them, and build alerts around trends, not one-off spikes.

How can I monitor and interpret Kafka producer metrics?
Look for patterns.

  • Drop in send rate? Check buffers and broker throughput.
  • High error rate? Check log details for timeouts, auth failures, or serialization problems.
  • Latency creeping up? It could mean broker load or poor batching.
    Use dashboards to correlate across metrics and time windows.
Authors
Anjali Udasi

Anjali Udasi

Helping to make the tech a little less intimidating. I love breaking down complex concepts into easy-to-understand terms.

Contents

Do More with Less

Unlock high cardinality monitoring for your teams.