Your Kafka producer pushed a million messages yesterday. Nice. But can you tell if they all made it? Or why did latency spike at 2 PM?
Producer metrics help you determine that. They expose how long messages take to send, whether messages are getting stuck, and whether retries are piling up. Let’s go over which ones help while debugging and how to monitor them.
The Key Kafka Producer Metrics
Out of all the metrics your Kafka producer exposes, only a handful are worth paying attention to.
record-send-rate
This is your throughput — how many records per second the producer is successfully sending. If this number drops suddenly, something in the pipeline is likely stuck.
record-error-rate
Tells you how many sends are failing. A few errors are normal under load. But if the rate climbs steadily, it's time to investigate: broker unavailability, serialization errors, and timeouts are all valid suspects.
request-latency-avg
This is the round-trip time for sending a batch and getting a response from the broker. Spikes here often indicate broker lag, network delays, or throttling.
buffer-available-bytes
Producers batch records in memory. If this number gets too low, your producer might block or start dropping messages. Especially relevant under sustained high throughput.
Here’s how you might pull these metrics from a running producer:
KafkaProducer<String, String> producer = new KafkaProducer<>(props);
Map<MetricName, ? extends Metric> metrics = producer.metrics();
double sendRate = getMetricValue(metrics, "record-send-rate");
double errorRate = getMetricValue(metrics, "record-error-rate");
long availableBuffer = (long) getMetricValue(metrics, "buffer-available-bytes");
logger.info("Send rate: {} msg/sec, Error rate: {}%, Buffer available: {} bytes",
sendRate, errorRate * 100, availableBuffer);
Trace Real Production Issues Using Kafka Producer Metrics
Metrics are useful when they map to actual failure modes or performance degradation.
batch-size-avg
Controls how efficiently messages are sent over the wire.- Small batch sizes → more network overhead, higher CPU usage.
- Large batch sizes → better throughput, but increased latency.
Keep an eye here if you’re tuning for performance but seeing lag in message delivery.
record-send-rate
drops, butrecord-error-rate
stays flat
This usually points to saturation somewhere — broker-side CPU, disk I/O limits, or socket-level backpressure. The producer is being throttled without outright failure.record-retry-rate
climbs, with no corresponding error spike
A sign of transient network instability. Packets are making it through eventually, but retries are eating into buffer time and CPU. If this creeps up, dig into network paths or DNS latency.
How to Set Up Kafka Producer Monitoring
Most Kafka producers expose dozens of metrics, but you only need a few to identify real issues early.
record-send-rate
– This tells you how fast your producer is sending messages. If it drops without a rise in errors, something’s likely slowing things down—maybe full buffers, maybe backpressure from the broker.record-error-rate
– Failed sends. Some errors are expected. A steady climb usually means broker unavailability, bad configurations, or network instability.request-latency-avg
– Time taken to get a response from the broker. Spikes here often suggest the broker is under load or the network’s struggling.buffer-available-bytes
– When this hits zero, your producer starts blocking or dropping messages. Useful to identify issues before backup.
You don’t need a big monitoring setup to track these. A basic scheduled job that scrapes the metrics every 30 seconds is more than enough to start with:
@Component
public class KafkaMetricsCollector {
private final MeterRegistry registry;
private final KafkaProducer<String, String> producer;
@Scheduled(fixedRate = 30000)
public void collect() {
Map<MetricName, ? extends Metric> metrics = producer.metrics();
update("kafka.producer.send.rate", "record-send-rate", metrics);
update("kafka.producer.error.rate", "record-error-rate", metrics);
update("kafka.producer.latency", "request-latency-avg", metrics);
update("kafka.producer.buffer.available", "buffer-available-bytes", metrics);
}
private void update(String name, String key, Map<MetricName, ? extends Metric> metrics) {
double value = getMetricValue(metrics, key);
Gauge.builder(name, () -> value).register(registry);
}
}
Once this is running, watch for patterns when things slow down, when buffers fill, and when errors spike. Then tune based on what you see.
Reading Kafka Producer Metrics Before Things Break
Kafka issues rarely happen out of nowhere. The signs are usually there, you just have to know what to monitor.
Start by knowing what normal looks like:
- Send rates should follow predictable daily patterns.
- Error rates should be low.
- Latency should stay within a tight range, with small bumps during high traffic.
What matters more are the slow shifts:
- Batch sizes getting smaller over time
This could point to configuration drift or changes in how messages are grouped. Smaller batches mean more frequent sends, which can reduce throughput. - Buffer availability going down gradually
Indicates your producer is pushing messages faster than the broker can handle. If this continues, you'll hit memory pressure or message drops. - Latency rising slowly while error rates stay flat
Likely a sign of resource limits—broker I/O, CPU, or network congestion.
Here's a basic reference to tie symptoms to potential causes:
What You Notice | Metric Behavior | Possible Cause |
---|---|---|
Broker under load | Latency increases, send rate unchanged | Brokers lagging behind |
Unstable connection | Retry rate increases, errors stay low | Temporary network issues |
Buffer filling up | Send rate drops, buffer usage high | Producer can’t keep up |
Config drift or tuning issue | Batch size inconsistent, latency varies | Inefficient batching or app changes |
How to Connect Kafka Producer Metrics to Cluster-Wide Issues
Producer metrics don’t always point to problems with the producer itself. Sometimes they reflect broader issues elsewhere in the system.
- A drop in
record-send-rate
might not mean the producer is misbehaving—it could be caused by consumer lag, leading to topic backpressure. - Higher
request-latency-avg
might mean brokers are overloaded or the entire Kafka cluster is under resource pressure.
This is where correlation matters. Producer metrics in isolation won’t tell you much. You need to compare them with broker metrics (like queue size or under-replicated partitions) and consumer lag to get a full view.
One metric worth highlighting here is:
produce-throttle-time-avg
When this climbs, your producer is being throttled by the broker. It’s not a client issue—it’s the broker applying quota limits. This usually ties back to capacity constraints or misconfigured quotas.
You’ll also want to pay attention to lower-level metrics like:
connection-close-rate
A spike here can signal network issues or unstable broker availability.io-wait-time-ns-avg
If this climbs, your producers might be spending time blocked on disk or socket operations, pointing to infrastructure bottlenecks.
If you’re seeing these alongside retry spikes or latency issues, it's time to look beyond just producer code and review your Kafka cluster’s health and your network path.
What Kafka Metrics Reveal as Your Traffic Scales
As your throughput grows, the way your Kafka producer behaves will shift. What worked fine at a few thousand messages per second might start causing issues at ten or a hundred times that load.
Producer metrics help highlight these scaling challenges early before they turn into production bottlenecks.
Batching becomes more critical at high volume.
You’ll want to watch how your batch-size-avg
metric compares to the batch.size
and linger.ms
settings. If your average batch size is consistently small despite high throughput, you’re probably flushing messages too early. A small tweak to linger.ms
can give the producer just enough time to batch more efficiently, often improving throughput without major changes.
Buffer usage also shifts as load increases.
At low volume, the producer has plenty of breathing room—buffer usage stays stable. At higher throughput, you’ll likely see more fluctuation. This isn't necessarily a problem, but it's worth tracking: it shows how hard the producer is working to keep up.
You can use these signals to adapt configurations dynamically, especially if your workloads vary across time:
public class AdaptiveProducerConfig {
private volatile int currentBatchSize = 16384; // 16KB default
private volatile int currentLingerMs = 5;
@Scheduled(fixedRate = 60000)
public void adjustConfiguration() {
double avgBatchSize = getMetricValue("batch-size-avg");
double sendRate = getMetricValue("record-send-rate");
// If we're sending a lot, but batching poorly, increase linger
if (avgBatchSize < currentBatchSize * 0.5 && sendRate > 1000) {
currentLingerMs = Math.min(currentLingerMs + 1, 100);
updateProducerConfig("linger.ms", currentLingerMs);
}
}
}
This kind of logic won’t replace good up-front tuning, but it can help avoid obvious inefficiencies when traffic patterns shift.
Using Kafka Producer Metrics to Debug Issues
When things go wrong, your metrics aren’t just numbers—they’re the starting point for figuring out what’s broken.
Start with three checks:
- Is the send rate lower than usual?
- Are error rates increasing?
- Is latency rising?
These usually narrow the problem down quickly.
- Low send rate, normal error rate
This often means the producer is blocked. Check buffer usage and broker performance—it's likely not your app logic, but pressure somewhere in the infrastructure. - High error rate
The type of error matters:- Timeouts → likely network or overloaded brokers
- Auth failures → check credentials, SSL configs
- Serialization errors → usually bad input data or a mismatch in schemas
- High latency, but send and error rates look fine
That usually means your system is nearing its limits. Nothing’s failed yet, but it’s working harder than usual. Useful signal—if you catch it early, you can scale or adjust before things start dropping messages.
Last9 for Kafka Monitoring
Using Last9 to monitor Kafka gives you the basics you rely on in production: send rate, error rate, request latency, and buffer usage.
What makes it useful in Kafka-heavy environments:
- You can monitor per-topic and per-producer metrics without worrying about cardinality limits or query performance.
- It works directly with Prometheus and OpenTelemetry no custom exporters or side processes.
- Logs, metrics, and traces land in one place. When something slows down, you don’t have to pivot across three systems to figure out why.
- Alerting can be tuned for sustained issues, not short spikes, so you get signal, not noise.
For teams running Kafka at scale, especially with variable workloads, Last9 helps you stay ahead of throughput drops and producer bottlenecks. No need to build your patchwork to get basic visibility.
Get started with us today!
FAQs
What is the producer latency metric in Kafka?request-latency-avg
measures the time it takes for the producer to send a batch to the broker and receive an acknowledgment. High latency usually points to broker load, batching inefficiencies, or network issues.
How do I increase Kafka producer throughput?
Start by tuning batch.size
, linger.ms
, and compression.type
. Larger batches reduce network overhead, and compression helps reduce payload size. Make sure the broker can handle the load—producer-side tuning only helps if the rest of the pipeline keeps up.
What is the role of the Kafka producer?
A Kafka producer is responsible for publishing records to Kafka topics. It handles batching, partitioning, retries, and delivery guarantees based on the configuration you set.
What is the optimal batch size for Kafka producer?
There’s no universal number. Start with the default (16384
bytes) and tune based on your workload. Monitor batch-size-avg
—if it’s consistently low relative to batch.size
, you may need to adjust linger.ms
or improve your batching logic.
Why do Kafka metrics matter?
They give you visibility into throughput, delivery health, retry behavior, and system limits. Without metrics, you’re flying blind, especially in production environments where small issues compound fast.
What is Graphite and Grafana?
Graphite is a time-series database often used to store metrics. Grafana is a dashboarding tool that can visualize data from Graphite (and many other sources). Together, they’re often used to monitor systems like Kafka.
What are the benefits and challenges of using Kafka for data streaming?
Kafka is durable, high-throughput, and horizontally scalable. It works well for real-time pipelines. The trade-offs: operational complexity, tuning requirements, and the learning curve around exactly-once semantics and partitioning.
What are the pros and cons of Apache Kafka, Apache Flink, and Apache Spark for data streaming?
Tool | Pros | Cons |
---|---|---|
Kafka | Great for buffering, real-time messaging, fault-tolerance | Not built for processing — you’ll need external compute |
Flink | True event-time stream processing, low latency, strong state support | Operational complexity, steeper learning curve |
Spark | Strong ecosystem, batch + streaming, good for large workloads | Higher latency, not ideal for low-latency real-time pipelines |
How can you configure and customize distributed systems tools for your specific use case?
Start with defaults, then use metrics and logs to understand bottlenecks. Adjust the configuration based on observed patterns. For Kafka, that means tuning producers, brokers, and consumer settings together, not in isolation.
How to enable Kafka Metrics Reporter?
You can enable JMX reporting by setting the appropriate reporter class in the Kafka config. For producers and consumers, metrics are exposed via producer.metrics()
and consumer.metrics()
APIs. For broker metrics, JMX is enabled by default.
How do I monitor Kafka producer metrics effectively?
Track core metrics like record-send-rate
, record-error-rate
, request-latency-avg
, and buffer-available-bytes
. Use a system like Prometheus to scrape and store them, and build alerts around trends, not one-off spikes.
How can I monitor and interpret Kafka producer metrics?
Look for patterns.
- Drop in send rate? Check buffers and broker throughput.
- High error rate? Check log details for timeouts, auth failures, or serialization problems.
- Latency creeping up? It could mean broker load or poor batching.
Use dashboards to correlate across metrics and time windows.