AWS SQS
Monitor AWS SQS queue performance, message throughput, and dead letter queues with CloudWatch metrics for comprehensive message queue observability
Monitor your Amazon SQS (Simple Queue Service) queues with CloudWatch metrics integration. This setup provides comprehensive monitoring of queue performance, message throughput, processing delays, dead letter queues, and overall queue health.
Prerequisites
Before setting up AWS SQS monitoring, ensure you have:
- AWS Account: With access to SQS and CloudWatch services
- SQS Queues: Running queues to monitor
- CloudWatch Permissions: IAM permissions to read CloudWatch metrics
- Monitoring Server: Where you can install and run OpenTelemetry Collector
- Last9 Account: With metrics integration credentials
-
Install OpenTelemetry Collector
Install the OpenTelemetry Collector with AWS receiver support:
For Debian/Ubuntu systems:
wget https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.118.0/otelcol-contrib_0.118.0_linux_amd64.debsudo dpkg -i otelcol-contrib_0.118.0_linux_amd64.debFor Red Hat/CentOS systems:
wget https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.118.0/otelcol-contrib_0.118.0_linux_amd64.rpmsudo rpm -ivh otelcol-contrib_0.118.0_linux_amd64.rpm -
Configure AWS Credentials
Set up AWS credentials for CloudWatch access:
Create or update
~/.aws/credentials:[default]aws_access_key_id = YOUR_ACCESS_KEY_IDaws_secret_access_key = YOUR_SECRET_ACCESS_KEYregion = us-east-1Set environment variables:
export AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_IDexport AWS_SECRET_ACCESS_KEY=YOUR_SECRET_ACCESS_KEYexport AWS_REGION=us-east-1If running on EC2, attach an IAM role with the following policy:
{"Version": "2012-10-17","Statement": [{"Effect": "Allow","Action": ["cloudwatch:GetMetricStatistics","cloudwatch:ListMetrics","sqs:ListQueues","sqs:GetQueueAttributes"],"Resource": "*"}]} -
Create OpenTelemetry Collector Configuration
Create the collector configuration file:
sudo mkdir -p /etc/otelcol-contribsudo nano /etc/otelcol-contrib/config.yamlAdd the following configuration to collect SQS CloudWatch metrics:
receivers:awscloudwatch:region: us-east-1 # Change to your AWS regionmetrics:# Queue Message Metrics- metric_name: NumberOfMessagesSentnamespace: AWS/SQSstat: [Sum, Average]dimensions:- name: QueueNamevalue: "*" # Monitor all queues- metric_name: NumberOfMessagesReceivednamespace: AWS/SQSstat: [Sum, Average]dimensions:- name: QueueNamevalue: "*"- metric_name: NumberOfMessagesDeletednamespace: AWS/SQSstat: [Sum, Average]dimensions:- name: QueueNamevalue: "*"- metric_name: ApproximateNumberOfMessagesnamespace: AWS/SQSstat: [Average, Maximum]dimensions:- name: QueueNamevalue: "*"- metric_name: ApproximateNumberOfMessagesVisiblenamespace: AWS/SQSstat: [Average, Maximum]dimensions:- name: QueueNamevalue: "*"- metric_name: ApproximateNumberOfMessagesNotVisiblenamespace: AWS/SQSstat: [Average, Maximum]dimensions:- name: QueueNamevalue: "*"# Dead Letter Queue Metrics- metric_name: ApproximateNumberOfMessagesDelayednamespace: AWS/SQSstat: [Average, Maximum]dimensions:- name: QueueNamevalue: "*"- metric_name: NumberOfMessagesRednamespace: AWS/SQSstat: [Sum, Average]dimensions:- name: QueueNamevalue: "*"# Age and Processing Metrics- metric_name: ApproximateAgeOfOldestMessagenamespace: AWS/SQSstat: [Average, Maximum]dimensions:- name: QueueNamevalue: "*"- metric_name: ReceiveMessageWaitTimenamespace: AWS/SQSstat: [Average, Maximum]dimensions:- name: QueueNamevalue: "*"# Size and Throughput Metrics- metric_name: SentMessageSizenamespace: AWS/SQSstat: [Average, Maximum, Sum]dimensions:- name: QueueNamevalue: "*"- metric_name: NumberOfEmptyReceivesnamespace: AWS/SQSstat: [Sum, Average]dimensions:- name: QueueNamevalue: "*"collection_interval: 300s # 5 minutes (CloudWatch default)processors:batch:timeout: 30ssend_batch_size: 10000send_batch_max_size: 10000resourcedetection/cloud:detectors: ["aws"]transform/metrics:metric_statements:- context: metricstatements:- set(resource.attributes["service.name"], "aws-sqs")- set(resource.attributes["deployment.environment"], "production")exporters:prometheusremotewrite:endpoint: "$last9_remote_write_url"auth:authenticator: basicauth/metricsresource_to_telemetry_conversion:enabled: truedebug:verbosity: detailedextensions:basicauth/metrics:client_auth:username: "$last9_remote_write_username"password: "$last9_remote_write_password"service:extensions: [basicauth/metrics]pipelines:metrics:receivers: [awscloudwatch]processors: [batch, resourcedetection/cloud, transform/metrics]exporters: [prometheusremotewrite] -
Configure Specific Queues (Optional)
To monitor specific SQS queues instead of all queues, modify the dimensions:
receivers:awscloudwatch:region: us-east-1metrics:- metric_name: ApproximateNumberOfMessagesnamespace: AWS/SQSstat: [Average, Maximum]dimensions:- name: QueueNamevalue: "production-orders" # Specific queue- metric_name: NumberOfMessagesSentnamespace: AWS/SQSstat: [Sum, Average]dimensions:- name: QueueNamevalue: "production-orders" -
Add FIFO Queue Metrics (if applicable)
If you’re using FIFO queues, add FIFO-specific metrics:
receivers:awscloudwatch:metrics:- metric_name: ContentBasedDeduplicationnamespace: AWS/SQSstat: [Sum]dimensions:- name: QueueNamevalue: "*.fifo" # Monitor all FIFO queues- metric_name: DeduplicationScopenamespace: AWS/SQSstat: [Sum]dimensions:- name: QueueNamevalue: "*.fifo"- metric_name: FifoThroughputLimitnamespace: AWS/SQSstat: [Sum]dimensions:- name: QueueNamevalue: "*.fifo" -
Create Systemd Service Configuration
Create a systemd service file:
sudo nano /etc/systemd/system/otelcol-contrib.serviceAdd the service configuration:
[Unit]Description=OpenTelemetry Collector for AWS SQS MonitoringAfter=network.target[Service]ExecStart=/usr/bin/otelcol-contrib --config /etc/otelcol-contrib/config.yamlRestart=alwaysUser=rootGroup=rootEnvironment=AWS_REGION=us-east-1[Install]WantedBy=multi-user.target -
Start and Enable the Service
Start the OpenTelemetry Collector service:
sudo systemctl daemon-reloadsudo systemctl enable otelcol-contribsudo systemctl start otelcol-contrib
Understanding SQS Metrics
The AWS SQS integration collects comprehensive CloudWatch metrics:
Message Flow Metrics
- NumberOfMessagesSent: Messages added to the queue
- NumberOfMessagesReceived: Messages retrieved from the queue
- NumberOfMessagesDeleted: Messages successfully processed and removed
- NumberOfEmptyReceives: Polling attempts that returned no messages
Queue State Metrics
- ApproximateNumberOfMessages: Total messages in the queue
- ApproximateNumberOfMessagesVisible: Messages available for retrieval
- ApproximateNumberOfMessagesNotVisible: Messages being processed (in-flight)
- ApproximateNumberOfMessagesDelayed: Messages delayed for future delivery
Performance Metrics
- ApproximateAgeOfOldestMessage: Age of the oldest message in seconds
- ReceiveMessageWaitTime: Wait time for long polling operations
- SentMessageSize: Size of messages being sent
Dead Letter Queue Metrics
- NumberOfMessagesMoved: Messages moved to dead letter queues
- DeadLetterQueueSourceQueues: Dead letter queue relationships
FIFO Queue Metrics (FIFO Queues Only)
- ContentBasedDeduplication: Messages deduplicated by content
- DeduplicationScope: Deduplication behavior per message group
- FifoThroughputLimit: FIFO queue throughput limitations
Advanced Configuration
Multi-Region Monitoring
Monitor SQS queues across multiple AWS regions:
receivers: awscloudwatch/us-east-1: region: us-east-1 metrics: - metric_name: ApproximateNumberOfMessages namespace: AWS/SQS stat: [Average, Maximum] awscloudwatch/us-west-2: region: us-west-2 metrics: - metric_name: ApproximateNumberOfMessages namespace: AWS/SQS stat: [Average, Maximum]
service: pipelines: metrics: receivers: [awscloudwatch/us-east-1, awscloudwatch/us-west-2]Queue-Specific Monitoring
Monitor different queue types with specific configurations:
receivers: awscloudwatch/standard-queues: region: us-east-1 metrics: - metric_name: ApproximateNumberOfMessages namespace: AWS/SQS stat: [Average, Maximum] dimensions: - name: QueueName value: "production-*" # Standard queues awscloudwatch/fifo-queues: region: us-east-1 metrics: - metric_name: ApproximateNumberOfMessages namespace: AWS/SQS stat: [Average, Maximum] dimensions: - name: QueueName value: "*.fifo" # FIFO queues onlyDead Letter Queue Monitoring
Specific configuration for monitoring dead letter queues:
receivers: awscloudwatch/dlq: region: us-east-1 metrics: - metric_name: ApproximateNumberOfMessages namespace: AWS/SQS stat: [Average, Maximum, Sum] dimensions: - name: QueueName value: "*-dlq" # Dead letter queues - metric_name: ApproximateAgeOfOldestMessage namespace: AWS/SQS stat: [Maximum] dimensions: - name: QueueName value: "*-dlq"Verification
-
Check Service Status
Verify the OpenTelemetry Collector is running:
sudo systemctl status otelcol-contrib -
Monitor Service Logs
Check for any configuration errors:
sudo journalctl -u otelcol-contrib -f -
Verify AWS Connectivity
Test AWS API access:
aws sqs list-queues --region us-east-1aws cloudwatch list-metrics --namespace AWS/SQS --region us-east-1 -
Generate SQS Activity
Create some queue activity to generate metrics:
# Send test messages to a queueaws sqs send-message \--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/test-queue \--message-body "Test message 1"# Receive messagesaws sqs receive-message \--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/test-queue# Check queue attributesaws sqs get-queue-attributes \--queue-url https://sqs.us-east-1.amazonaws.com/123456789012/test-queue \--attribute-names All -
Verify Metrics in Last9
Log into your Last9 account and check that SQS metrics are being received in Grafana.
Look for metrics like:
ApproximateNumberOfMessagesNumberOfMessagesSentNumberOfMessagesReceivedApproximateAgeOfOldestMessage
Key Metrics to Monitor
Critical Queue Health Indicators
| Metric | Description | Alert Threshold |
|---|---|---|
ApproximateNumberOfMessages | Messages waiting in queue | > 1000 for high-throughput queues |
ApproximateAgeOfOldestMessage | Age of oldest unprocessed message | > 300 seconds (5 minutes) |
NumberOfMessagesReceived | Messages being processed | Sudden drops indicate consumer issues |
NumberOfEmptyReceives | Polling without messages | High values indicate inefficient polling |
Performance Monitoring
| Metric | Description | Monitoring Focus |
|---|---|---|
NumberOfMessagesSent | Production rate | Track message ingestion trends |
NumberOfMessagesDeleted | Processing rate | Should match sent messages over time |
SentMessageSize | Message size distribution | Monitor for size limits and costs |
ReceiveMessageWaitTime | Long polling efficiency | Optimize consumer polling strategy |
Dead Letter Queue Monitoring
| Metric | Description | Alert Condition |
|---|---|---|
ApproximateNumberOfMessages (DLQ) | Failed messages | > 0 (any messages in DLQ) |
NumberOfMessagesMoved | Messages moved to DLQ | Increasing trend indicates issues |
Troubleshooting
CloudWatch API Issues
Permission Denied:
# Verify AWS credentialsaws sts get-caller-identity
# Test SQS accessaws sqs list-queues --region us-east-1
# Check CloudWatch permissionsaws cloudwatch list-metrics --namespace AWS/SQS --region us-east-1 | head -10Rate Limiting:
# Adjust collection interval to reduce API callsreceivers: awscloudwatch: collection_interval: 600s # 10 minutes instead of 5Missing Metrics
No Queue Metrics:
# Verify queues existaws sqs list-queues --region us-east-1
# Check specific queue metrics availabilityaws cloudwatch get-metric-statistics \ --namespace AWS/SQS \ --metric-name ApproximateNumberOfMessages \ --dimensions Name=QueueName,Value=your-queue-name \ --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ --period 300 \ --statistics AveragePartial Data:
# List all available SQS metricsaws cloudwatch list-metrics --namespace AWS/SQS --region us-east-1
# Check queue-specific metricsaws sqs get-queue-attributes \ --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/queue-name \ --attribute-names AllHigh Message Age
Troubleshoot Message Processing:
# Check queue attributes for visibility timeoutaws sqs get-queue-attributes \ --queue-url YOUR_QUEUE_URL \ --attribute-names VisibilityTimeoutSeconds,ReceiveMessageWaitTimeSeconds
# Monitor consumer behavioraws sqs get-queue-attributes \ --queue-url YOUR_QUEUE_URL \ --attribute-names ApproximateNumberOfMessagesNotVisibleBest Practices
Security
- IAM Roles: Use IAM roles instead of access keys when running on EC2
- Least Privilege: Grant only necessary CloudWatch and SQS permissions
- Queue Access: Restrict SQS queue access to authorized consumers and producers
Performance
- Collection Intervals: Balance monitoring granularity with CloudWatch API costs
- Metric Selection: Monitor only metrics relevant to your specific queues
- Regional Optimization: Deploy collectors in the same region as SQS queues
Monitoring Strategy
- Queue Depth Alerts: Set alerts for excessive queue depth
- Consumer Health: Monitor message processing rates and age
- Dead Letter Queues: Always monitor DLQs for failed message processing
- Cost Optimization: Use appropriate CloudWatch metric collection intervals
Queue Management
- Visibility Timeout: Configure appropriate visibility timeouts for your workload
- Message Retention: Set appropriate message retention periods
- Redrive Policy: Configure dead letter queues with appropriate maxReceiveCount
- Long Polling: Use long polling to reduce empty receives and costs
Need Help?
If you encounter any issues or have questions:
- Join our Discord community for real-time support
- Contact our support team at support@last9.io