Apache Kafka
Monitor Apache Kafka clusters with OpenTelemetry for comprehensive streaming platform observability
Use OpenTelemetry to monitor self-managed Apache Kafka clusters and send telemetry data to Last9. This integration provides comprehensive monitoring of Kafka brokers, topics, consumers, producers, and message flow across your streaming platform.
Prerequisites
Before setting up Kafka monitoring, ensure you have:
- Kafka Cluster: Running Apache Kafka cluster (2.0.0 or higher)
- Monitoring Server: Virtual machine or container where you can run OpenTelemetry Collector
- Network Access: Collector can reach Kafka brokers and Last9 endpoints
- Administrative Access: Permission to install and configure monitoring components
- Last9 Account: With integration credentials
-
Install OpenTelemetry Collector
Choose the appropriate package for your operating system. Note that systemd is required for automatic service configuration.
For Debian/Ubuntu systems:
wget https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.118.0/otelcol-contrib_0.118.0_linux_amd64.debsudo dpkg -i otelcol-contrib_0.118.0_linux_amd64.debFor Red Hat/CentOS systems:
wget https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.118.0/otelcol-contrib_0.118.0_linux_amd64.rpmsudo rpm -ivh otelcol-contrib_0.118.0_linux_amd64.rpmMore installation options are available in the OpenTelemetry documentation.
-
Configure OpenTelemetry Collector
Create the collector configuration file to monitor Kafka metrics, logs, and traces:
sudo nano /etc/otelcol-contrib/config.yamlAdd the following configuration. This integration uses both
kafkametricsreceiver for broker metrics andkafkareceiver for message processing:receivers:kafka:protocol_version: "2.0.0" # Change this to your Kafka protocol versionbrokers:- "localhost:9092"- "broker2:9092" # Add more broker URLs as needed- "broker3:9092"# Optional: Configure authentication# auth:# sasl:# username: "kafka_user"# password: "kafka_password"# mechanism: "PLAIN"kafkametrics:scrapers:- brokers # Broker-level metrics- topics # Topic-level metrics- consumers # Consumer group metricsbrokers:- "localhost:9092"- "broker2:9092" # Add more broker URLs- "broker3:9092"collection_interval: 60s# Optional: Configure authentication# auth:# sasl:# username: "kafka_user"# password: "kafka_password"# mechanism: "PLAIN"processors:batch:timeout: 15ssend_batch_size: 10000send_batch_max_size: 10000resourcedetection/cloud:detectors: ["aws", "gcp", "azure"]resourcedetection/system:detectors: ["system"]system:hostname_sources: ["os"]transform/logs:flatten_data: truelog_statements:- context: logstatements:- set(observed_time, Now())- set(time_unix_nano, observed_time_unix_nano) where time_unix_nano == 0- set(resource.attributes["service.name"], "kafka")- set(resource.attributes["deployment.environment"], "production")exporters:otlp/last9:endpoint: "$last9_otlp_endpoint"headers:"Authorization": "$last9_otlp_auth_header"debug:verbosity: detailedservice:pipelines:logs:receivers: [kafka]processors:[batch,resourcedetection/system,resourcedetection/cloud,transform/logs,]exporters: [otlp/last9]traces:receivers: [kafka]processors: [batch, resourcedetection/system, resourcedetection/cloud]exporters: [otlp/last9]metrics:receivers: [kafka, kafkametrics]processors: [batch, resourcedetection/system, resourcedetection/cloud]exporters: [otlp/last9]Configuration Explanation:
- kafka receiver: Collects logs and traces from Kafka message processing
- kafkametrics receiver: Collects comprehensive Kafka cluster metrics
- scrapers: Define which Kafka components to monitor (brokers, topics, consumers)
- brokers: List of Kafka broker endpoints to connect to
-
Configure Kafka Authentication (Optional)
If your Kafka cluster requires authentication, configure the SASL settings:
receivers:kafka:auth:sasl:username: "kafka_user"password: "kafka_password"mechanism: "PLAIN"kafkametrics:auth:sasl:username: "kafka_user"password: "kafka_password"mechanism: "PLAIN"receivers:kafka:auth:sasl:username: "kafka_user"password: "kafka_password"mechanism: "SCRAM-SHA-256"kafkametrics:auth:sasl:username: "kafka_user"password: "kafka_password"mechanism: "SCRAM-SHA-256"receivers:kafka:auth:tls:insecure: falseca_file: "/path/to/ca.pem"cert_file: "/path/to/client.pem"key_file: "/path/to/client-key.pem"kafkametrics:auth:tls:insecure: falseca_file: "/path/to/ca.pem"cert_file: "/path/to/client.pem"key_file: "/path/to/client-key.pem" -
Create Systemd Service Configuration
Create a systemd service file for the OpenTelemetry Collector:
sudo nano /etc/systemd/system/otelcol-contrib.serviceAdd the following service configuration:
[Unit]Description=OpenTelemetry Collector Contrib with custom flagsAfter=network.target[Service]ExecStart=/usr/bin/otelcol-contrib --config /etc/otelcol-contrib/config.yaml --feature-gates transform.flatten.logsRestart=alwaysUser=rootGroup=root[Install]WantedBy=multi-user.target -
Start and Enable the Service
Start the OpenTelemetry Collector service and enable it to start automatically:
sudo systemctl daemon-reloadsudo systemctl enable otelcol-contribsudo systemctl start otelcol-contrib
Understanding Kafka Metrics
The Kafka integration collects comprehensive metrics across different components:
Broker Metrics
- Message Throughput: Messages per second, bytes per second
- Request Metrics: Request rate, request latency, queue sizes
- Network Metrics: Network I/O, connection counts
- Storage Metrics: Log size, log flush rate, partition count
- JVM Metrics: Garbage collection, heap usage, thread count
Topic Metrics
- Partition Metrics: Partition count, leader election rate
- Message Metrics: Message rate, byte rate per topic
- Replication Metrics: In-sync replicas, under-replicated partitions
- Retention Metrics: Log retention size and time
Consumer Metrics
- Consumer Group: Lag, member count, rebalance rate
- Consumption Rate: Records consumed per second
- Offset Management: Committed offsets, offset lag
- Consumer Coordinator: Heartbeat rate, sync time
Producer Metrics
- Production Rate: Records sent per second, byte rate
- Request Metrics: Request latency, batch size, compression rate
- Error Metrics: Failed sends, retry rate
- Buffer Metrics: Available memory, buffer pool usage
Advanced Configuration
Topic-Specific Monitoring
Monitor specific topics only:
kafkametrics: scrapers: - topics topics: - "user-events" - "payment-transactions" - "audit-logs"Custom Collection Intervals
Configure different collection intervals for different metrics:
kafkametrics: collection_interval: 30s # General metrics every 30 seconds scrapers: - brokers - topics: collection_interval: 60s # Topic metrics every minute - consumers: collection_interval: 15s # Consumer metrics every 15 secondsResource Attribution
Add comprehensive metadata to metrics:
transform/logs: log_statements: - context: log statements: - set(resource.attributes["service.name"], "kafka-cluster-prod") - set(resource.attributes["kafka.cluster.name"], "production-cluster") - set(resource.attributes["deployment.environment"], "production") - set(resource.attributes["team"], "data-platform") - set(resource.attributes["region"], "us-east-1")Multi-Cluster Monitoring
Monitor multiple Kafka clusters with different configurations:
receivers: kafkametrics/cluster1: scrapers: [brokers, topics, consumers] brokers: ["cluster1-broker1:9092", "cluster1-broker2:9092"] kafkametrics/cluster2: scrapers: [brokers, topics, consumers] brokers: ["cluster2-broker1:9092", "cluster2-broker2:9092"]Verification
-
Check Service Status
Verify the OpenTelemetry Collector service is running:
sudo systemctl status otelcol-contrib -
Monitor Service Logs
Check for any configuration errors or connection issues:
sudo journalctl -u otelcol-contrib -f -
Test Kafka Connectivity
Verify the collector can connect to your Kafka brokers:
# Test Kafka broker connectivitytelnet localhost 9092# Check Kafka broker status (if you have Kafka tools installed)kafka-broker-api-versions.sh --bootstrap-server localhost:9092 -
Generate Kafka Activity
Create some Kafka activity to generate metrics:
# Create a test topickafka-topics.sh --create --topic test-monitoring --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1# Produce some test messagesecho "test message 1" | kafka-console-producer.sh --topic test-monitoring --bootstrap-server localhost:9092# Consume messageskafka-console-consumer.sh --topic test-monitoring --bootstrap-server localhost:9092 --from-beginning -
Verify Data in Last9
Log into your Last9 account and check that Kafka metrics are being received in Grafana.
Look for metrics like:
kafka.brokerskafka.topic.partitionskafka.consumer_group.lagkafka.producer.record_send_rate
Key Metrics to Monitor
Critical Performance Indicators
| Metric | Description | Alert Threshold |
|---|---|---|
kafka.consumer_group.lag | Consumer lag behind producers | > 1000 messages |
kafka.broker.request.produce.time.99p | 99th percentile produce latency | > 500ms |
kafka.topic.under_replicated_partitions | Partitions without sufficient replicas | > 0 |
kafka.broker.request.fetch.time.99p | 99th percentile fetch latency | > 500ms |
kafka.broker.log.flush.rate | Log flush rate to disk | Sudden drops |
Health Monitoring
| Metric | Description | Importance |
|---|---|---|
kafka.broker.alive | Broker availability | Critical |
kafka.controller.active | Active controller count | Should be 1 |
kafka.broker.leader.election.rate | Leader election frequency | Should be low |
kafka.consumer_group.members | Active consumer count | Track changes |
Troubleshooting
Connection Issues
Cannot Connect to Kafka Brokers:
# Check if Kafka is runningsudo systemctl status kafka
# Test network connectivitytelnet kafka-broker 9092
# Check Kafka logssudo journalctl -u kafka -fAuthentication Failures:
# Verify SASL configurationauth: sasl: username: "correct_username" password: "correct_password" mechanism: "PLAIN"Missing Metrics
No Broker Metrics:
# Check if JMX is enabled on Kafka brokers# Add to Kafka broker configuration:# JMX_PORT=9999# export JMX_PORTConsumer Metrics Missing:
# Verify consumer groups are activekafka-consumer-groups.sh --bootstrap-server localhost:9092 --listkafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group your-groupHigh Resource Usage
Monitor and optimize collector resource usage:
# Check collector memory and CPU usagesudo systemctl status otelcol-contribps aux | grep otelcol-contrib
# Adjust batch processing settings if neededBest Practices
Security
- Authentication: Use SASL/SCRAM or mTLS for secure connections
- Network Security: Restrict collector access to Kafka ports only
- Credential Management: Store sensitive credentials in environment variables or secret management systems
Performance
- Collection Intervals: Balance monitoring granularity with resource usage
- Batch Processing: Use appropriate batch sizes for efficient data transmission
- Resource Limits: Set appropriate CPU and memory limits for the collector
Monitoring Strategy
- Alert Setup: Configure alerts for critical metrics like consumer lag and broker health
- Dashboard Creation: Create comprehensive dashboards for different stakeholders
- Capacity Planning: Monitor resource utilization trends for capacity planning
Cluster Management
- Multi-Environment: Use different service names for different environments
- Cluster Identification: Use clear naming conventions for different clusters
- Version Tracking: Include Kafka version information in resource attributes
Need Help?
If you encounter any issues or have questions:
- Join our Discord community for real-time support
- Contact our support team at support@last9.io