Skip to content
Last9 named a Gartner Cool Vendor in AI for SRE Observability for 2025! Read more →
Last9

AWS MSK

Monitor AWS MSK (Managed Streaming for Apache Kafka) clusters with CloudWatch metrics for comprehensive streaming platform observability

Monitor your Amazon MSK (Managed Streaming for Apache Kafka) clusters with CloudWatch metrics integration. This setup provides comprehensive monitoring of managed Kafka brokers, topics, consumer groups, and overall cluster health with minimal configuration overhead.

MSK automatically publishes detailed metrics to CloudWatch, making it easier to monitor your Kafka infrastructure without managing the underlying Kafka monitoring stack.

Prerequisites

Before setting up AWS MSK monitoring, ensure you have:

  • AWS Account: With access to MSK and CloudWatch services
  • MSK Cluster: Running MSK cluster to monitor
  • CloudWatch Permissions: IAM permissions to read CloudWatch metrics
  • Monitoring Server: Where you can install and run OpenTelemetry Collector
  • Last9 Account: With metrics integration credentials
  1. Install OpenTelemetry Collector

    Install the OpenTelemetry Collector with AWS receiver support:

    For Debian/Ubuntu systems:

    wget https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.118.0/otelcol-contrib_0.118.0_linux_amd64.deb
    sudo dpkg -i otelcol-contrib_0.118.0_linux_amd64.deb
  2. Configure AWS Credentials

    Set up AWS credentials for CloudWatch access:

    Create or update ~/.aws/credentials:

    [default]
    aws_access_key_id = YOUR_ACCESS_KEY_ID
    aws_secret_access_key = YOUR_SECRET_ACCESS_KEY
    region = us-east-1
  3. Create OpenTelemetry Collector Configuration

    Create the collector configuration file:

    sudo mkdir -p /etc/otelcol-contrib
    sudo nano /etc/otelcol-contrib/config.yaml

    Add the following configuration to collect MSK CloudWatch metrics:

    receivers:
    awscloudwatch:
    region: us-east-1 # Change to your AWS region
    metrics:
    # Broker-Level Metrics
    - metric_name: CpuIdle
    namespace: AWS/Kafka
    stat: [Average, Minimum]
    dimensions:
    - name: Cluster Name
    value: "*" # Monitor all MSK clusters
    - name: Broker ID
    value: "*"
    - metric_name: CpuSystem
    namespace: AWS/Kafka
    stat: [Average, Maximum]
    dimensions:
    - name: Cluster Name
    value: "*"
    - name: Broker ID
    value: "*"
    - metric_name: CpuUser
    namespace: AWS/Kafka
    stat: [Average, Maximum]
    dimensions:
    - name: Cluster Name
    value: "*"
    - name: Broker ID
    value: "*"
    - metric_name: KafkaDataLogsDiskUsed
    namespace: AWS/Kafka
    stat: [Average, Maximum]
    dimensions:
    - name: Cluster Name
    value: "*"
    - name: Broker ID
    value: "*"
    - metric_name: MemoryFree
    namespace: AWS/Kafka
    stat: [Average, Minimum]
    dimensions:
    - name: Cluster Name
    value: "*"
    - name: Broker ID
    value: "*"
    - metric_name: MemoryUsed
    namespace: AWS/Kafka
    stat: [Average, Maximum]
    dimensions:
    - name: Cluster Name
    value: "*"
    - name: Broker ID
    value: "*"
    # Network Metrics
    - metric_name: NetworkRxDropped
    namespace: AWS/Kafka
    stat: [Sum, Average]
    dimensions:
    - name: Cluster Name
    value: "*"
    - name: Broker ID
    value: "*"
    - metric_name: NetworkRxErrors
    namespace: AWS/Kafka
    stat: [Sum, Average]
    dimensions:
    - name: Cluster Name
    value: "*"
    - name: Broker ID
    value: "*"
    - metric_name: NetworkRxPackets
    namespace: AWS/Kafka
    stat: [Sum, Average]
    dimensions:
    - name: Cluster Name
    value: "*"
    - name: Broker ID
    value: "*"
    - metric_name: NetworkTxDropped
    namespace: AWS/Kafka
    stat: [Sum, Average]
    dimensions:
    - name: Cluster Name
    value: "*"
    - name: Broker ID
    value: "*"
    - metric_name: NetworkTxErrors
    namespace: AWS/Kafka
    stat: [Sum, Average]
    dimensions:
    - name: Cluster Name
    value: "*"
    - name: Broker ID
    value: "*"
    - metric_name: NetworkTxPackets
    namespace: AWS/Kafka
    stat: [Sum, Average]
    dimensions:
    - name: Cluster Name
    value: "*"
    - name: Broker ID
    value: "*"
    # Topic-Level Metrics
    - metric_name: BytesInPerSec
    namespace: AWS/Kafka
    stat: [Sum, Average]
    dimensions:
    - name: Cluster Name
    value: "*"
    - name: Topic
    value: "*"
    - metric_name: BytesOutPerSec
    namespace: AWS/Kafka
    stat: [Sum, Average]
    dimensions:
    - name: Cluster Name
    value: "*"
    - name: Topic
    value: "*"
    - metric_name: MessagesInPerSec
    namespace: AWS/Kafka
    stat: [Sum, Average]
    dimensions:
    - name: Cluster Name
    value: "*"
    - name: Topic
    value: "*"
    # Consumer Group Metrics
    - metric_name: ConsumerLag
    namespace: AWS/Kafka
    stat: [Sum, Average, Maximum]
    dimensions:
    - name: Cluster Name
    value: "*"
    - name: Consumer Group
    value: "*"
    - name: Topic
    value: "*"
    - name: Partition
    value: "*"
    - metric_name: OffsetLag
    namespace: AWS/Kafka
    stat: [Sum, Average, Maximum]
    dimensions:
    - name: Cluster Name
    value: "*"
    - name: Consumer Group
    value: "*"
    # Cluster-Level Aggregated Metrics
    - metric_name: GlobalTopicCount
    namespace: AWS/Kafka
    stat: [Sum, Average]
    dimensions:
    - name: Cluster Name
    value: "*"
    - metric_name: GlobalPartitionCount
    namespace: AWS/Kafka
    stat: [Sum, Average]
    dimensions:
    - name: Cluster Name
    value: "*"
    collection_interval: 300s # 5 minutes (CloudWatch default)
    processors:
    batch:
    timeout: 30s
    send_batch_size: 10000
    send_batch_max_size: 10000
    resourcedetection/cloud:
    detectors: ["aws"]
    transform/metrics:
    metric_statements:
    - context: metric
    statements:
    - set(resource.attributes["service.name"], "aws-msk")
    - set(resource.attributes["deployment.environment"], "production")
    exporters:
    prometheusremotewrite:
    endpoint: "$last9_remote_write_url"
    auth:
    authenticator: basicauth/metrics
    resource_to_telemetry_conversion:
    enabled: true
    debug:
    verbosity: detailed
    extensions:
    basicauth/metrics:
    client_auth:
    username: "$last9_remote_write_username"
    password: "$last9_remote_write_password"
    service:
    extensions: [basicauth/metrics]
    pipelines:
    metrics:
    receivers: [awscloudwatch]
    processors: [batch, resourcedetection/cloud, transform/metrics]
    exporters: [prometheusremotewrite]
  4. Configure Specific Clusters (Optional)

    To monitor specific MSK clusters instead of all clusters, modify the dimensions:

    receivers:
    awscloudwatch:
    region: us-east-1
    metrics:
    - metric_name: CpuIdle
    namespace: AWS/Kafka
    stat: [Average, Minimum]
    dimensions:
    - name: Cluster Name
    value: "production-kafka" # Specific cluster
    - name: Broker ID
    value: "*"
  5. Enable Enhanced Monitoring (Optional)

    For more detailed metrics, enable enhanced monitoring on your MSK cluster:

    # Enable enhanced monitoring via AWS CLI
    aws kafka put-cluster-policy \
    --cluster-arn arn:aws:kafka:us-east-1:123456789012:cluster/production-kafka/uuid \
    --current-version 1 \
    --policy '{
    "Version": "2012-10-17",
    "Statement": [
    {
    "Effect": "Allow",
    "Principal": {
    "Service": "cloudwatch.amazonaws.com"
    },
    "Action": [
    "logs:CreateLogGroup",
    "logs:CreateLogStream",
    "logs:PutLogEvents"
    ],
    "Resource": "*"
    }
    ]
    }'
  6. Create Systemd Service Configuration

    Create a systemd service file:

    sudo nano /etc/systemd/system/otelcol-contrib.service

    Add the service configuration:

    [Unit]
    Description=OpenTelemetry Collector for AWS MSK Monitoring
    After=network.target
    [Service]
    ExecStart=/usr/bin/otelcol-contrib --config /etc/otelcol-contrib/config.yaml
    Restart=always
    User=root
    Group=root
    Environment=AWS_REGION=us-east-1
    [Install]
    WantedBy=multi-user.target
  7. Start and Enable the Service

    Start the OpenTelemetry Collector service:

    sudo systemctl daemon-reload
    sudo systemctl enable otelcol-contrib
    sudo systemctl start otelcol-contrib

Understanding MSK Metrics

The AWS MSK integration collects comprehensive CloudWatch metrics:

Broker Performance Metrics

  • CPU Utilization: User, system, and idle CPU percentages per broker
  • Memory Usage: Free and used memory across MSK brokers
  • Disk Usage: Kafka data logs disk utilization
  • Network Activity: Packet transmission, errors, and dropped packets

Topic and Partition Metrics

  • Throughput: Bytes and messages in/out per second per topic
  • Partition Distribution: Partition counts and distribution across brokers
  • Replication: Under-replicated partitions and replica lag
  • Topic Growth: Topic creation and partition scaling metrics

Consumer Group Metrics

  • Consumer Lag: Lag between latest offset and consumer position
  • Offset Management: Consumer offset commits and lag per partition
  • Consumer Activity: Active consumer group monitoring
  • Processing Rate: Messages consumed per second per group

Cluster Health Metrics

  • Broker Health: Broker availability and status
  • Cluster Capacity: Global topic and partition counts
  • Leader Elections: Partition leadership changes
  • Connection Health: Client connection metrics

Advanced Configuration

Multi-Cluster Monitoring

Monitor multiple MSK clusters with different configurations:

receivers:
awscloudwatch/production:
region: us-east-1
metrics:
- metric_name: CpuIdle
namespace: AWS/Kafka
dimensions:
- name: Cluster Name
value: "production-kafka"
awscloudwatch/staging:
region: us-east-1
metrics:
- metric_name: CpuIdle
namespace: AWS/Kafka
dimensions:
- name: Cluster Name
value: "staging-kafka"
service:
pipelines:
metrics:
receivers: [awscloudwatch/production, awscloudwatch/staging]

Topic-Specific Monitoring

Monitor specific high-value topics:

receivers:
awscloudwatch:
metrics:
- metric_name: BytesInPerSec
namespace: AWS/Kafka
stat: [Sum, Average]
dimensions:
- name: Cluster Name
value: "production-kafka"
- name: Topic
value: "user-events" # Specific topic
- metric_name: ConsumerLag
namespace: AWS/Kafka
stat: [Sum, Maximum]
dimensions:
- name: Cluster Name
value: "production-kafka"
- name: Topic
value: "user-events"
- name: Consumer Group
value: "analytics-service"

Cross-Region Replication Monitoring

For MSK clusters with cross-region replication:

receivers:
awscloudwatch/primary:
region: us-east-1
metrics:
- metric_name: ReplicationBytesInPerSec
namespace: AWS/Kafka
stat: [Sum, Average]
awscloudwatch/replica:
region: us-west-2
metrics:
- metric_name: ReplicationBytesOutPerSec
namespace: AWS/Kafka
stat: [Sum, Average]

Verification

  1. Check Service Status

    Verify the OpenTelemetry Collector is running:

    sudo systemctl status otelcol-contrib
  2. Monitor Service Logs

    Check for any configuration errors:

    sudo journalctl -u otelcol-contrib -f
  3. Verify AWS Connectivity

    Test AWS API access:

    aws kafka list-clusters --region us-east-1
    aws cloudwatch list-metrics --namespace AWS/Kafka --region us-east-1
  4. Generate MSK Activity

    Create some Kafka activity to generate metrics:

    # List MSK cluster endpoints
    aws kafka get-bootstrap-brokers --cluster-arn arn:aws:kafka:us-east-1:123456789012:cluster/test/uuid
    # Create a test topic (if you have Kafka tools configured)
    kafka-topics.sh --create --topic test-monitoring --bootstrap-server your-msk-endpoint:9092 --partitions 3 --replication-factor 2
    # Produce test messages
    echo "test message" | kafka-console-producer.sh --topic test-monitoring --bootstrap-server your-msk-endpoint:9092
    # Consume messages
    kafka-console-consumer.sh --topic test-monitoring --bootstrap-server your-msk-endpoint:9092 --from-beginning
  5. Verify Metrics in Last9

    Log into your Last9 account and check that MSK metrics are being received in Grafana.

    Look for metrics like:

    • CpuIdle, CpuSystem, CpuUser
    • BytesInPerSec, BytesOutPerSec
    • ConsumerLag, OffsetLag
    • MemoryUsed, KafkaDataLogsDiskUsed

Key Metrics to Monitor

Critical Performance Indicators

MetricDescriptionAlert Threshold
ConsumerLagConsumer lag behind producers> 1000 messages for critical topics
CpuUserUser CPU utilization per broker> 80% sustained
MemoryUsedMemory usage per broker> 80% of available
KafkaDataLogsDiskUsedDisk usage for Kafka logs> 80% of allocated storage

Throughput Monitoring

MetricDescriptionMonitoring Focus
BytesInPerSecData ingestion rate per topicTrack load patterns
BytesOutPerSecData consumption rate per topicMonitor consumer activity
MessagesInPerSecMessage ingestion rateTrack message volume
NetworkRxPacketsNetwork receive packet rateNetwork performance

Cluster Health

MetricDescriptionAlert Condition
NetworkRxErrorsNetwork receive errors> 0 errors
NetworkTxErrorsNetwork transmit errors> 0 errors
GlobalTopicCountTotal topics in clusterUnexpected changes
GlobalPartitionCountTotal partitions in clusterMonitor for scaling

Troubleshooting

CloudWatch API Issues

Permission Denied:

# Verify AWS credentials
aws sts get-caller-identity
# Test MSK access
aws kafka list-clusters --region us-east-1
# Check CloudWatch permissions
aws cloudwatch list-metrics --namespace AWS/Kafka --region us-east-1 | head -10

Rate Limiting:

# Adjust collection interval to reduce API calls
receivers:
awscloudwatch:
collection_interval: 600s # 10 minutes instead of 5

Missing Metrics

No MSK Metrics:

# Verify MSK clusters exist
aws kafka list-clusters --region us-east-1
# Check if cluster is in correct state
aws kafka describe-cluster --cluster-arn arn:aws:kafka:region:account:cluster/name/uuid
# Verify enhanced monitoring is enabled if needed
aws kafka describe-cluster --cluster-arn YOUR_CLUSTER_ARN --query 'ClusterInfo.EnhancedMonitoring'

Consumer Lag Metrics Missing:

# Verify consumer groups are active
# You'll need Kafka client tools configured with MSK endpoint
# List consumer groups
kafka-consumer-groups.sh --bootstrap-server your-msk-endpoint:9092 --list
# Check consumer group lag
kafka-consumer-groups.sh --bootstrap-server your-msk-endpoint:9092 --describe --group your-consumer-group

High Resource Usage

Excessive CloudWatch API Calls:

  • Increase collection intervals for less critical metrics
  • Use metric filtering to collect only necessary metrics
  • Monitor CloudWatch API costs and usage

MSK Cluster Performance Issues:

# Check MSK cluster configuration
aws kafka describe-cluster-v2 --cluster-arn YOUR_CLUSTER_ARN
# Monitor broker instance types and scaling
aws kafka describe-configuration --arn CONFIGURATION_ARN

Best Practices

Security

  • IAM Roles: Use IAM roles instead of access keys when running on EC2
  • VPC Security: Deploy monitoring in the same VPC as MSK clusters
  • Encryption: Ensure MSK encryption in transit and at rest is enabled
  • Network Access: Restrict collector access to MSK clusters using security groups

Performance

  • Collection Intervals: Balance monitoring granularity with CloudWatch costs
  • Metric Selection: Focus on business-critical topics and consumer groups
  • Regional Optimization: Deploy collectors in the same region as MSK clusters
  • Batch Processing: Use appropriate batch sizes for metric collection

Monitoring Strategy

  • Consumer Lag Alerting: Set up alerts for critical consumer lag thresholds
  • Broker Health: Monitor CPU, memory, and disk usage across all brokers
  • Topic-Level Monitoring: Track throughput and growth for important topics
  • Capacity Planning: Monitor resource utilization trends for scaling decisions

Cost Optimization

  • Selective Monitoring: Monitor only necessary topics and consumer groups
  • Metric Retention: Use appropriate CloudWatch metric retention policies
  • Collection Frequency: Adjust collection intervals based on criticality
  • Resource Tagging: Use MSK resource tags for cost allocation and filtering

Need Help?

If you encounter any issues or have questions: