AWS MSK
Monitor AWS MSK (Managed Streaming for Apache Kafka) clusters with CloudWatch metrics for comprehensive streaming platform observability
Monitor your Amazon MSK (Managed Streaming for Apache Kafka) clusters with CloudWatch metrics integration. This setup provides comprehensive monitoring of managed Kafka brokers, topics, consumer groups, and overall cluster health with minimal configuration overhead.
MSK automatically publishes detailed metrics to CloudWatch, making it easier to monitor your Kafka infrastructure without managing the underlying Kafka monitoring stack.
Prerequisites
Before setting up AWS MSK monitoring, ensure you have:
- AWS Account: With access to MSK and CloudWatch services
- MSK Cluster: Running MSK cluster to monitor
- CloudWatch Permissions: IAM permissions to read CloudWatch metrics
- Monitoring Server: Where you can install and run OpenTelemetry Collector
- Last9 Account: With metrics integration credentials
-
Install OpenTelemetry Collector
Install the OpenTelemetry Collector with AWS receiver support:
For Debian/Ubuntu systems:
wget https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.118.0/otelcol-contrib_0.118.0_linux_amd64.debsudo dpkg -i otelcol-contrib_0.118.0_linux_amd64.debFor Red Hat/CentOS systems:
wget https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.118.0/otelcol-contrib_0.118.0_linux_amd64.rpmsudo rpm -ivh otelcol-contrib_0.118.0_linux_amd64.rpm -
Configure AWS Credentials
Set up AWS credentials for CloudWatch access:
Create or update
~/.aws/credentials:[default]aws_access_key_id = YOUR_ACCESS_KEY_IDaws_secret_access_key = YOUR_SECRET_ACCESS_KEYregion = us-east-1Set environment variables:
export AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_IDexport AWS_SECRET_ACCESS_KEY=YOUR_SECRET_ACCESS_KEYexport AWS_REGION=us-east-1If running on EC2, attach an IAM role with the following policy:
{"Version": "2012-10-17","Statement": [{"Effect": "Allow","Action": ["cloudwatch:GetMetricStatistics","cloudwatch:ListMetrics","kafka:ListClusters","kafka:DescribeCluster","kafka:ListNodes"],"Resource": "*"}]} -
Create OpenTelemetry Collector Configuration
Create the collector configuration file:
sudo mkdir -p /etc/otelcol-contribsudo nano /etc/otelcol-contrib/config.yamlAdd the following configuration to collect MSK CloudWatch metrics:
receivers:awscloudwatch:region: us-east-1 # Change to your AWS regionmetrics:# Broker-Level Metrics- metric_name: CpuIdlenamespace: AWS/Kafkastat: [Average, Minimum]dimensions:- name: Cluster Namevalue: "*" # Monitor all MSK clusters- name: Broker IDvalue: "*"- metric_name: CpuSystemnamespace: AWS/Kafkastat: [Average, Maximum]dimensions:- name: Cluster Namevalue: "*"- name: Broker IDvalue: "*"- metric_name: CpuUsernamespace: AWS/Kafkastat: [Average, Maximum]dimensions:- name: Cluster Namevalue: "*"- name: Broker IDvalue: "*"- metric_name: KafkaDataLogsDiskUsednamespace: AWS/Kafkastat: [Average, Maximum]dimensions:- name: Cluster Namevalue: "*"- name: Broker IDvalue: "*"- metric_name: MemoryFreenamespace: AWS/Kafkastat: [Average, Minimum]dimensions:- name: Cluster Namevalue: "*"- name: Broker IDvalue: "*"- metric_name: MemoryUsednamespace: AWS/Kafkastat: [Average, Maximum]dimensions:- name: Cluster Namevalue: "*"- name: Broker IDvalue: "*"# Network Metrics- metric_name: NetworkRxDroppednamespace: AWS/Kafkastat: [Sum, Average]dimensions:- name: Cluster Namevalue: "*"- name: Broker IDvalue: "*"- metric_name: NetworkRxErrorsnamespace: AWS/Kafkastat: [Sum, Average]dimensions:- name: Cluster Namevalue: "*"- name: Broker IDvalue: "*"- metric_name: NetworkRxPacketsnamespace: AWS/Kafkastat: [Sum, Average]dimensions:- name: Cluster Namevalue: "*"- name: Broker IDvalue: "*"- metric_name: NetworkTxDroppednamespace: AWS/Kafkastat: [Sum, Average]dimensions:- name: Cluster Namevalue: "*"- name: Broker IDvalue: "*"- metric_name: NetworkTxErrorsnamespace: AWS/Kafkastat: [Sum, Average]dimensions:- name: Cluster Namevalue: "*"- name: Broker IDvalue: "*"- metric_name: NetworkTxPacketsnamespace: AWS/Kafkastat: [Sum, Average]dimensions:- name: Cluster Namevalue: "*"- name: Broker IDvalue: "*"# Topic-Level Metrics- metric_name: BytesInPerSecnamespace: AWS/Kafkastat: [Sum, Average]dimensions:- name: Cluster Namevalue: "*"- name: Topicvalue: "*"- metric_name: BytesOutPerSecnamespace: AWS/Kafkastat: [Sum, Average]dimensions:- name: Cluster Namevalue: "*"- name: Topicvalue: "*"- metric_name: MessagesInPerSecnamespace: AWS/Kafkastat: [Sum, Average]dimensions:- name: Cluster Namevalue: "*"- name: Topicvalue: "*"# Consumer Group Metrics- metric_name: ConsumerLagnamespace: AWS/Kafkastat: [Sum, Average, Maximum]dimensions:- name: Cluster Namevalue: "*"- name: Consumer Groupvalue: "*"- name: Topicvalue: "*"- name: Partitionvalue: "*"- metric_name: OffsetLagnamespace: AWS/Kafkastat: [Sum, Average, Maximum]dimensions:- name: Cluster Namevalue: "*"- name: Consumer Groupvalue: "*"# Cluster-Level Aggregated Metrics- metric_name: GlobalTopicCountnamespace: AWS/Kafkastat: [Sum, Average]dimensions:- name: Cluster Namevalue: "*"- metric_name: GlobalPartitionCountnamespace: AWS/Kafkastat: [Sum, Average]dimensions:- name: Cluster Namevalue: "*"collection_interval: 300s # 5 minutes (CloudWatch default)processors:batch:timeout: 30ssend_batch_size: 10000send_batch_max_size: 10000resourcedetection/cloud:detectors: ["aws"]transform/metrics:metric_statements:- context: metricstatements:- set(resource.attributes["service.name"], "aws-msk")- set(resource.attributes["deployment.environment"], "production")exporters:prometheusremotewrite:endpoint: "$last9_remote_write_url"auth:authenticator: basicauth/metricsresource_to_telemetry_conversion:enabled: truedebug:verbosity: detailedextensions:basicauth/metrics:client_auth:username: "$last9_remote_write_username"password: "$last9_remote_write_password"service:extensions: [basicauth/metrics]pipelines:metrics:receivers: [awscloudwatch]processors: [batch, resourcedetection/cloud, transform/metrics]exporters: [prometheusremotewrite] -
Configure Specific Clusters (Optional)
To monitor specific MSK clusters instead of all clusters, modify the dimensions:
receivers:awscloudwatch:region: us-east-1metrics:- metric_name: CpuIdlenamespace: AWS/Kafkastat: [Average, Minimum]dimensions:- name: Cluster Namevalue: "production-kafka" # Specific cluster- name: Broker IDvalue: "*" -
Enable Enhanced Monitoring (Optional)
For more detailed metrics, enable enhanced monitoring on your MSK cluster:
# Enable enhanced monitoring via AWS CLIaws kafka put-cluster-policy \--cluster-arn arn:aws:kafka:us-east-1:123456789012:cluster/production-kafka/uuid \--current-version 1 \--policy '{"Version": "2012-10-17","Statement": [{"Effect": "Allow","Principal": {"Service": "cloudwatch.amazonaws.com"},"Action": ["logs:CreateLogGroup","logs:CreateLogStream","logs:PutLogEvents"],"Resource": "*"}]}' -
Create Systemd Service Configuration
Create a systemd service file:
sudo nano /etc/systemd/system/otelcol-contrib.serviceAdd the service configuration:
[Unit]Description=OpenTelemetry Collector for AWS MSK MonitoringAfter=network.target[Service]ExecStart=/usr/bin/otelcol-contrib --config /etc/otelcol-contrib/config.yamlRestart=alwaysUser=rootGroup=rootEnvironment=AWS_REGION=us-east-1[Install]WantedBy=multi-user.target -
Start and Enable the Service
Start the OpenTelemetry Collector service:
sudo systemctl daemon-reloadsudo systemctl enable otelcol-contribsudo systemctl start otelcol-contrib
Understanding MSK Metrics
The AWS MSK integration collects comprehensive CloudWatch metrics:
Broker Performance Metrics
- CPU Utilization: User, system, and idle CPU percentages per broker
- Memory Usage: Free and used memory across MSK brokers
- Disk Usage: Kafka data logs disk utilization
- Network Activity: Packet transmission, errors, and dropped packets
Topic and Partition Metrics
- Throughput: Bytes and messages in/out per second per topic
- Partition Distribution: Partition counts and distribution across brokers
- Replication: Under-replicated partitions and replica lag
- Topic Growth: Topic creation and partition scaling metrics
Consumer Group Metrics
- Consumer Lag: Lag between latest offset and consumer position
- Offset Management: Consumer offset commits and lag per partition
- Consumer Activity: Active consumer group monitoring
- Processing Rate: Messages consumed per second per group
Cluster Health Metrics
- Broker Health: Broker availability and status
- Cluster Capacity: Global topic and partition counts
- Leader Elections: Partition leadership changes
- Connection Health: Client connection metrics
Advanced Configuration
Multi-Cluster Monitoring
Monitor multiple MSK clusters with different configurations:
receivers: awscloudwatch/production: region: us-east-1 metrics: - metric_name: CpuIdle namespace: AWS/Kafka dimensions: - name: Cluster Name value: "production-kafka" awscloudwatch/staging: region: us-east-1 metrics: - metric_name: CpuIdle namespace: AWS/Kafka dimensions: - name: Cluster Name value: "staging-kafka"
service: pipelines: metrics: receivers: [awscloudwatch/production, awscloudwatch/staging]Topic-Specific Monitoring
Monitor specific high-value topics:
receivers: awscloudwatch: metrics: - metric_name: BytesInPerSec namespace: AWS/Kafka stat: [Sum, Average] dimensions: - name: Cluster Name value: "production-kafka" - name: Topic value: "user-events" # Specific topic - metric_name: ConsumerLag namespace: AWS/Kafka stat: [Sum, Maximum] dimensions: - name: Cluster Name value: "production-kafka" - name: Topic value: "user-events" - name: Consumer Group value: "analytics-service"Cross-Region Replication Monitoring
For MSK clusters with cross-region replication:
receivers: awscloudwatch/primary: region: us-east-1 metrics: - metric_name: ReplicationBytesInPerSec namespace: AWS/Kafka stat: [Sum, Average] awscloudwatch/replica: region: us-west-2 metrics: - metric_name: ReplicationBytesOutPerSec namespace: AWS/Kafka stat: [Sum, Average]Verification
-
Check Service Status
Verify the OpenTelemetry Collector is running:
sudo systemctl status otelcol-contrib -
Monitor Service Logs
Check for any configuration errors:
sudo journalctl -u otelcol-contrib -f -
Verify AWS Connectivity
Test AWS API access:
aws kafka list-clusters --region us-east-1aws cloudwatch list-metrics --namespace AWS/Kafka --region us-east-1 -
Generate MSK Activity
Create some Kafka activity to generate metrics:
# List MSK cluster endpointsaws kafka get-bootstrap-brokers --cluster-arn arn:aws:kafka:us-east-1:123456789012:cluster/test/uuid# Create a test topic (if you have Kafka tools configured)kafka-topics.sh --create --topic test-monitoring --bootstrap-server your-msk-endpoint:9092 --partitions 3 --replication-factor 2# Produce test messagesecho "test message" | kafka-console-producer.sh --topic test-monitoring --bootstrap-server your-msk-endpoint:9092# Consume messageskafka-console-consumer.sh --topic test-monitoring --bootstrap-server your-msk-endpoint:9092 --from-beginning -
Verify Metrics in Last9
Log into your Last9 account and check that MSK metrics are being received in Grafana.
Look for metrics like:
CpuIdle,CpuSystem,CpuUserBytesInPerSec,BytesOutPerSecConsumerLag,OffsetLagMemoryUsed,KafkaDataLogsDiskUsed
Key Metrics to Monitor
Critical Performance Indicators
| Metric | Description | Alert Threshold |
|---|---|---|
ConsumerLag | Consumer lag behind producers | > 1000 messages for critical topics |
CpuUser | User CPU utilization per broker | > 80% sustained |
MemoryUsed | Memory usage per broker | > 80% of available |
KafkaDataLogsDiskUsed | Disk usage for Kafka logs | > 80% of allocated storage |
Throughput Monitoring
| Metric | Description | Monitoring Focus |
|---|---|---|
BytesInPerSec | Data ingestion rate per topic | Track load patterns |
BytesOutPerSec | Data consumption rate per topic | Monitor consumer activity |
MessagesInPerSec | Message ingestion rate | Track message volume |
NetworkRxPackets | Network receive packet rate | Network performance |
Cluster Health
| Metric | Description | Alert Condition |
|---|---|---|
NetworkRxErrors | Network receive errors | > 0 errors |
NetworkTxErrors | Network transmit errors | > 0 errors |
GlobalTopicCount | Total topics in cluster | Unexpected changes |
GlobalPartitionCount | Total partitions in cluster | Monitor for scaling |
Troubleshooting
CloudWatch API Issues
Permission Denied:
# Verify AWS credentialsaws sts get-caller-identity
# Test MSK accessaws kafka list-clusters --region us-east-1
# Check CloudWatch permissionsaws cloudwatch list-metrics --namespace AWS/Kafka --region us-east-1 | head -10Rate Limiting:
# Adjust collection interval to reduce API callsreceivers: awscloudwatch: collection_interval: 600s # 10 minutes instead of 5Missing Metrics
No MSK Metrics:
# Verify MSK clusters existaws kafka list-clusters --region us-east-1
# Check if cluster is in correct stateaws kafka describe-cluster --cluster-arn arn:aws:kafka:region:account:cluster/name/uuid
# Verify enhanced monitoring is enabled if neededaws kafka describe-cluster --cluster-arn YOUR_CLUSTER_ARN --query 'ClusterInfo.EnhancedMonitoring'Consumer Lag Metrics Missing:
# Verify consumer groups are active# You'll need Kafka client tools configured with MSK endpoint
# List consumer groupskafka-consumer-groups.sh --bootstrap-server your-msk-endpoint:9092 --list
# Check consumer group lagkafka-consumer-groups.sh --bootstrap-server your-msk-endpoint:9092 --describe --group your-consumer-groupHigh Resource Usage
Excessive CloudWatch API Calls:
- Increase collection intervals for less critical metrics
- Use metric filtering to collect only necessary metrics
- Monitor CloudWatch API costs and usage
MSK Cluster Performance Issues:
# Check MSK cluster configurationaws kafka describe-cluster-v2 --cluster-arn YOUR_CLUSTER_ARN
# Monitor broker instance types and scalingaws kafka describe-configuration --arn CONFIGURATION_ARNBest Practices
Security
- IAM Roles: Use IAM roles instead of access keys when running on EC2
- VPC Security: Deploy monitoring in the same VPC as MSK clusters
- Encryption: Ensure MSK encryption in transit and at rest is enabled
- Network Access: Restrict collector access to MSK clusters using security groups
Performance
- Collection Intervals: Balance monitoring granularity with CloudWatch costs
- Metric Selection: Focus on business-critical topics and consumer groups
- Regional Optimization: Deploy collectors in the same region as MSK clusters
- Batch Processing: Use appropriate batch sizes for metric collection
Monitoring Strategy
- Consumer Lag Alerting: Set up alerts for critical consumer lag thresholds
- Broker Health: Monitor CPU, memory, and disk usage across all brokers
- Topic-Level Monitoring: Track throughput and growth for important topics
- Capacity Planning: Monitor resource utilization trends for scaling decisions
Cost Optimization
- Selective Monitoring: Monitor only necessary topics and consumer groups
- Metric Retention: Use appropriate CloudWatch metric retention policies
- Collection Frequency: Adjust collection intervals based on criticality
- Resource Tagging: Use MSK resource tags for cost allocation and filtering
Need Help?
If you encounter any issues or have questions:
- Join our Discord community for real-time support
- Contact our support team at support@last9.io