Skip to content
Last9 named a Gartner Cool Vendor in AI for SRE Observability for 2025! Read more →
Last9

Apache Kafka

Monitor Apache Kafka clusters with OpenTelemetry for comprehensive streaming platform observability

Use OpenTelemetry to monitor self-managed Apache Kafka clusters and send telemetry data to Last9. This integration provides comprehensive monitoring of Kafka brokers, topics, consumers, producers, and message flow across your streaming platform.

Prerequisites

Before setting up Kafka monitoring, ensure you have:

  • Kafka Cluster: Running Apache Kafka cluster (2.0.0 or higher)
  • Monitoring Server: Virtual machine or container where you can run OpenTelemetry Collector
  • Network Access: Collector can reach Kafka brokers and Last9 endpoints
  • Administrative Access: Permission to install and configure monitoring components
  • Last9 Account: With integration credentials
  1. Install OpenTelemetry Collector

    Choose the appropriate package for your operating system. Note that systemd is required for automatic service configuration.

    For Debian/Ubuntu systems:

    wget https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.118.0/otelcol-contrib_0.118.0_linux_amd64.deb
    sudo dpkg -i otelcol-contrib_0.118.0_linux_amd64.deb

    More installation options are available in the OpenTelemetry documentation.

  2. Configure OpenTelemetry Collector

    Create the collector configuration file to monitor Kafka metrics, logs, and traces:

    sudo nano /etc/otelcol-contrib/config.yaml

    Add the following configuration. This integration uses both kafkametrics receiver for broker metrics and kafka receiver for message processing:

    receivers:
    kafka:
    protocol_version: "2.0.0" # Change this to your Kafka protocol version
    brokers:
    - "localhost:9092"
    - "broker2:9092" # Add more broker URLs as needed
    - "broker3:9092"
    # Optional: Configure authentication
    # auth:
    # sasl:
    # username: "kafka_user"
    # password: "kafka_password"
    # mechanism: "PLAIN"
    kafkametrics:
    scrapers:
    - brokers # Broker-level metrics
    - topics # Topic-level metrics
    - consumers # Consumer group metrics
    brokers:
    - "localhost:9092"
    - "broker2:9092" # Add more broker URLs
    - "broker3:9092"
    collection_interval: 60s
    # Optional: Configure authentication
    # auth:
    # sasl:
    # username: "kafka_user"
    # password: "kafka_password"
    # mechanism: "PLAIN"
    processors:
    batch:
    timeout: 15s
    send_batch_size: 10000
    send_batch_max_size: 10000
    resourcedetection/cloud:
    detectors: ["aws", "gcp", "azure"]
    resourcedetection/system:
    detectors: ["system"]
    system:
    hostname_sources: ["os"]
    transform/logs:
    flatten_data: true
    log_statements:
    - context: log
    statements:
    - set(observed_time, Now())
    - set(time_unix_nano, observed_time_unix_nano) where time_unix_nano == 0
    - set(resource.attributes["service.name"], "kafka")
    - set(resource.attributes["deployment.environment"], "production")
    exporters:
    otlp/last9:
    endpoint: "$last9_otlp_endpoint"
    headers:
    "Authorization": "$last9_otlp_auth_header"
    debug:
    verbosity: detailed
    service:
    pipelines:
    logs:
    receivers: [kafka]
    processors:
    [
    batch,
    resourcedetection/system,
    resourcedetection/cloud,
    transform/logs,
    ]
    exporters: [otlp/last9]
    traces:
    receivers: [kafka]
    processors: [batch, resourcedetection/system, resourcedetection/cloud]
    exporters: [otlp/last9]
    metrics:
    receivers: [kafka, kafkametrics]
    processors: [batch, resourcedetection/system, resourcedetection/cloud]
    exporters: [otlp/last9]

    Configuration Explanation:

    • kafka receiver: Collects logs and traces from Kafka message processing
    • kafkametrics receiver: Collects comprehensive Kafka cluster metrics
    • scrapers: Define which Kafka components to monitor (brokers, topics, consumers)
    • brokers: List of Kafka broker endpoints to connect to
  3. Configure Kafka Authentication (Optional)

    If your Kafka cluster requires authentication, configure the SASL settings:

    receivers:
    kafka:
    auth:
    sasl:
    username: "kafka_user"
    password: "kafka_password"
    mechanism: "PLAIN"
    kafkametrics:
    auth:
    sasl:
    username: "kafka_user"
    password: "kafka_password"
    mechanism: "PLAIN"
  4. Create Systemd Service Configuration

    Create a systemd service file for the OpenTelemetry Collector:

    sudo nano /etc/systemd/system/otelcol-contrib.service

    Add the following service configuration:

    [Unit]
    Description=OpenTelemetry Collector Contrib with custom flags
    After=network.target
    [Service]
    ExecStart=/usr/bin/otelcol-contrib --config /etc/otelcol-contrib/config.yaml --feature-gates transform.flatten.logs
    Restart=always
    User=root
    Group=root
    [Install]
    WantedBy=multi-user.target
  5. Start and Enable the Service

    Start the OpenTelemetry Collector service and enable it to start automatically:

    sudo systemctl daemon-reload
    sudo systemctl enable otelcol-contrib
    sudo systemctl start otelcol-contrib

Understanding Kafka Metrics

The Kafka integration collects comprehensive metrics across different components:

Broker Metrics

  • Message Throughput: Messages per second, bytes per second
  • Request Metrics: Request rate, request latency, queue sizes
  • Network Metrics: Network I/O, connection counts
  • Storage Metrics: Log size, log flush rate, partition count
  • JVM Metrics: Garbage collection, heap usage, thread count

Topic Metrics

  • Partition Metrics: Partition count, leader election rate
  • Message Metrics: Message rate, byte rate per topic
  • Replication Metrics: In-sync replicas, under-replicated partitions
  • Retention Metrics: Log retention size and time

Consumer Metrics

  • Consumer Group: Lag, member count, rebalance rate
  • Consumption Rate: Records consumed per second
  • Offset Management: Committed offsets, offset lag
  • Consumer Coordinator: Heartbeat rate, sync time

Producer Metrics

  • Production Rate: Records sent per second, byte rate
  • Request Metrics: Request latency, batch size, compression rate
  • Error Metrics: Failed sends, retry rate
  • Buffer Metrics: Available memory, buffer pool usage

Advanced Configuration

Topic-Specific Monitoring

Monitor specific topics only:

kafkametrics:
scrapers:
- topics
topics:
- "user-events"
- "payment-transactions"
- "audit-logs"

Custom Collection Intervals

Configure different collection intervals for different metrics:

kafkametrics:
collection_interval: 30s # General metrics every 30 seconds
scrapers:
- brokers
- topics:
collection_interval: 60s # Topic metrics every minute
- consumers:
collection_interval: 15s # Consumer metrics every 15 seconds

Resource Attribution

Add comprehensive metadata to metrics:

transform/logs:
log_statements:
- context: log
statements:
- set(resource.attributes["service.name"], "kafka-cluster-prod")
- set(resource.attributes["kafka.cluster.name"], "production-cluster")
- set(resource.attributes["deployment.environment"], "production")
- set(resource.attributes["team"], "data-platform")
- set(resource.attributes["region"], "us-east-1")

Multi-Cluster Monitoring

Monitor multiple Kafka clusters with different configurations:

receivers:
kafkametrics/cluster1:
scrapers: [brokers, topics, consumers]
brokers: ["cluster1-broker1:9092", "cluster1-broker2:9092"]
kafkametrics/cluster2:
scrapers: [brokers, topics, consumers]
brokers: ["cluster2-broker1:9092", "cluster2-broker2:9092"]

Verification

  1. Check Service Status

    Verify the OpenTelemetry Collector service is running:

    sudo systemctl status otelcol-contrib
  2. Monitor Service Logs

    Check for any configuration errors or connection issues:

    sudo journalctl -u otelcol-contrib -f
  3. Test Kafka Connectivity

    Verify the collector can connect to your Kafka brokers:

    # Test Kafka broker connectivity
    telnet localhost 9092
    # Check Kafka broker status (if you have Kafka tools installed)
    kafka-broker-api-versions.sh --bootstrap-server localhost:9092
  4. Generate Kafka Activity

    Create some Kafka activity to generate metrics:

    # Create a test topic
    kafka-topics.sh --create --topic test-monitoring --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1
    # Produce some test messages
    echo "test message 1" | kafka-console-producer.sh --topic test-monitoring --bootstrap-server localhost:9092
    # Consume messages
    kafka-console-consumer.sh --topic test-monitoring --bootstrap-server localhost:9092 --from-beginning
  5. Verify Data in Last9

    Log into your Last9 account and check that Kafka metrics are being received in Grafana.

    Look for metrics like:

    • kafka.brokers
    • kafka.topic.partitions
    • kafka.consumer_group.lag
    • kafka.producer.record_send_rate

Key Metrics to Monitor

Critical Performance Indicators

MetricDescriptionAlert Threshold
kafka.consumer_group.lagConsumer lag behind producers> 1000 messages
kafka.broker.request.produce.time.99p99th percentile produce latency> 500ms
kafka.topic.under_replicated_partitionsPartitions without sufficient replicas> 0
kafka.broker.request.fetch.time.99p99th percentile fetch latency> 500ms
kafka.broker.log.flush.rateLog flush rate to diskSudden drops

Health Monitoring

MetricDescriptionImportance
kafka.broker.aliveBroker availabilityCritical
kafka.controller.activeActive controller countShould be 1
kafka.broker.leader.election.rateLeader election frequencyShould be low
kafka.consumer_group.membersActive consumer countTrack changes

Troubleshooting

Connection Issues

Cannot Connect to Kafka Brokers:

# Check if Kafka is running
sudo systemctl status kafka
# Test network connectivity
telnet kafka-broker 9092
# Check Kafka logs
sudo journalctl -u kafka -f

Authentication Failures:

# Verify SASL configuration
auth:
sasl:
username: "correct_username"
password: "correct_password"
mechanism: "PLAIN"

Missing Metrics

No Broker Metrics:

# Check if JMX is enabled on Kafka brokers
# Add to Kafka broker configuration:
# JMX_PORT=9999
# export JMX_PORT

Consumer Metrics Missing:

# Verify consumer groups are active
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group your-group

High Resource Usage

Monitor and optimize collector resource usage:

# Check collector memory and CPU usage
sudo systemctl status otelcol-contrib
ps aux | grep otelcol-contrib
# Adjust batch processing settings if needed

Best Practices

Security

  • Authentication: Use SASL/SCRAM or mTLS for secure connections
  • Network Security: Restrict collector access to Kafka ports only
  • Credential Management: Store sensitive credentials in environment variables or secret management systems

Performance

  • Collection Intervals: Balance monitoring granularity with resource usage
  • Batch Processing: Use appropriate batch sizes for efficient data transmission
  • Resource Limits: Set appropriate CPU and memory limits for the collector

Monitoring Strategy

  • Alert Setup: Configure alerts for critical metrics like consumer lag and broker health
  • Dashboard Creation: Create comprehensive dashboards for different stakeholders
  • Capacity Planning: Monitor resource utilization trends for capacity planning

Cluster Management

  • Multi-Environment: Use different service names for different environments
  • Cluster Identification: Use clear naming conventions for different clusters
  • Version Tracking: Include Kafka version information in resource attributes

Need Help?

If you encounter any issues or have questions: