Skip to content
Last9
Book demo

AWS SQS

Monitor AWS SQS queue performance, message throughput, and dead letter queues with CloudWatch metrics for comprehensive message queue observability

Monitor your Amazon SQS (Simple Queue Service) queues with CloudWatch metrics integration. This setup provides comprehensive monitoring of queue performance, message throughput, processing delays, dead letter queues, and overall queue health.

Prerequisites

Before setting up AWS SQS monitoring, ensure you have:

  • AWS Account: With access to SQS and CloudWatch services
  • SQS Queues: Running queues to monitor
  • CloudWatch Permissions: IAM permissions to read CloudWatch metrics
  • Monitoring Server: Where you can install and run OpenTelemetry Collector
  • Last9 Account: With metrics integration credentials
  1. Install OpenTelemetry Collector

    Install the OpenTelemetry Collector with AWS receiver support:

    For Debian/Ubuntu systems:

    wget https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.118.0/otelcol-contrib_0.118.0_linux_amd64.deb
    sudo dpkg -i otelcol-contrib_0.118.0_linux_amd64.deb
  2. Configure AWS Credentials

    Set up AWS credentials for CloudWatch access:

    Create or update ~/.aws/credentials:

    [default]
    aws_access_key_id = YOUR_ACCESS_KEY_ID
    aws_secret_access_key = YOUR_SECRET_ACCESS_KEY
    region = us-east-1
  3. Create OpenTelemetry Collector Configuration

    Create the collector configuration file:

    sudo mkdir -p /etc/otelcol-contrib
    sudo nano /etc/otelcol-contrib/config.yaml

    Add the following configuration to collect SQS CloudWatch metrics:

    receivers:
    awscloudwatch:
    region: us-east-1 # Change to your AWS region
    metrics:
    # Queue Message Metrics
    - metric_name: NumberOfMessagesSent
    namespace: AWS/SQS
    stat: [Sum, Average]
    dimensions:
    - name: QueueName
    value: "*" # Monitor all queues
    - metric_name: NumberOfMessagesReceived
    namespace: AWS/SQS
    stat: [Sum, Average]
    dimensions:
    - name: QueueName
    value: "*"
    - metric_name: NumberOfMessagesDeleted
    namespace: AWS/SQS
    stat: [Sum, Average]
    dimensions:
    - name: QueueName
    value: "*"
    - metric_name: ApproximateNumberOfMessages
    namespace: AWS/SQS
    stat: [Average, Maximum]
    dimensions:
    - name: QueueName
    value: "*"
    - metric_name: ApproximateNumberOfMessagesVisible
    namespace: AWS/SQS
    stat: [Average, Maximum]
    dimensions:
    - name: QueueName
    value: "*"
    - metric_name: ApproximateNumberOfMessagesNotVisible
    namespace: AWS/SQS
    stat: [Average, Maximum]
    dimensions:
    - name: QueueName
    value: "*"
    # Dead Letter Queue Metrics
    - metric_name: ApproximateNumberOfMessagesDelayed
    namespace: AWS/SQS
    stat: [Average, Maximum]
    dimensions:
    - name: QueueName
    value: "*"
    - metric_name: NumberOfMessagesRed
    namespace: AWS/SQS
    stat: [Sum, Average]
    dimensions:
    - name: QueueName
    value: "*"
    # Age and Processing Metrics
    - metric_name: ApproximateAgeOfOldestMessage
    namespace: AWS/SQS
    stat: [Average, Maximum]
    dimensions:
    - name: QueueName
    value: "*"
    - metric_name: ReceiveMessageWaitTime
    namespace: AWS/SQS
    stat: [Average, Maximum]
    dimensions:
    - name: QueueName
    value: "*"
    # Size and Throughput Metrics
    - metric_name: SentMessageSize
    namespace: AWS/SQS
    stat: [Average, Maximum, Sum]
    dimensions:
    - name: QueueName
    value: "*"
    - metric_name: NumberOfEmptyReceives
    namespace: AWS/SQS
    stat: [Sum, Average]
    dimensions:
    - name: QueueName
    value: "*"
    collection_interval: 300s # 5 minutes (CloudWatch default)
    processors:
    batch:
    timeout: 30s
    send_batch_size: 10000
    send_batch_max_size: 10000
    resourcedetection/cloud:
    detectors: ["aws"]
    transform/metrics:
    metric_statements:
    - context: metric
    statements:
    - set(resource.attributes["service.name"], "aws-sqs")
    - set(resource.attributes["deployment.environment"], "production")
    exporters:
    prometheusremotewrite:
    endpoint: "$last9_remote_write_url"
    auth:
    authenticator: basicauth/metrics
    resource_to_telemetry_conversion:
    enabled: true
    debug:
    verbosity: detailed
    extensions:
    basicauth/metrics:
    client_auth:
    username: "$last9_remote_write_username"
    password: "$last9_remote_write_password"
    service:
    extensions: [basicauth/metrics]
    pipelines:
    metrics:
    receivers: [awscloudwatch]
    processors: [batch, resourcedetection/cloud, transform/metrics]
    exporters: [prometheusremotewrite]
  4. Configure Specific Queues (Optional)

    To monitor specific SQS queues instead of all queues, modify the dimensions:

    receivers:
    awscloudwatch:
    region: us-east-1
    metrics:
    - metric_name: ApproximateNumberOfMessages
    namespace: AWS/SQS
    stat: [Average, Maximum]
    dimensions:
    - name: QueueName
    value: "production-orders" # Specific queue
    - metric_name: NumberOfMessagesSent
    namespace: AWS/SQS
    stat: [Sum, Average]
    dimensions:
    - name: QueueName
    value: "production-orders"
  5. Add FIFO Queue Metrics (if applicable)

    If you’re using FIFO queues, add FIFO-specific metrics:

    receivers:
    awscloudwatch:
    metrics:
    - metric_name: ContentBasedDeduplication
    namespace: AWS/SQS
    stat: [Sum]
    dimensions:
    - name: QueueName
    value: "*.fifo" # Monitor all FIFO queues
    - metric_name: DeduplicationScope
    namespace: AWS/SQS
    stat: [Sum]
    dimensions:
    - name: QueueName
    value: "*.fifo"
    - metric_name: FifoThroughputLimit
    namespace: AWS/SQS
    stat: [Sum]
    dimensions:
    - name: QueueName
    value: "*.fifo"
  6. Create Systemd Service Configuration

    Create a systemd service file:

    sudo nano /etc/systemd/system/otelcol-contrib.service

    Add the service configuration:

    [Unit]
    Description=OpenTelemetry Collector for AWS SQS Monitoring
    After=network.target
    [Service]
    ExecStart=/usr/bin/otelcol-contrib --config /etc/otelcol-contrib/config.yaml
    Restart=always
    User=root
    Group=root
    Environment=AWS_REGION=us-east-1
    [Install]
    WantedBy=multi-user.target
  7. Start and Enable the Service

    Start the OpenTelemetry Collector service:

    sudo systemctl daemon-reload
    sudo systemctl enable otelcol-contrib
    sudo systemctl start otelcol-contrib

Understanding SQS Metrics

The AWS SQS integration collects comprehensive CloudWatch metrics:

Message Flow Metrics

  • NumberOfMessagesSent: Messages added to the queue
  • NumberOfMessagesReceived: Messages retrieved from the queue
  • NumberOfMessagesDeleted: Messages successfully processed and removed
  • NumberOfEmptyReceives: Polling attempts that returned no messages

Queue State Metrics

  • ApproximateNumberOfMessages: Total messages in the queue
  • ApproximateNumberOfMessagesVisible: Messages available for retrieval
  • ApproximateNumberOfMessagesNotVisible: Messages being processed (in-flight)
  • ApproximateNumberOfMessagesDelayed: Messages delayed for future delivery

Performance Metrics

  • ApproximateAgeOfOldestMessage: Age of the oldest message in seconds
  • ReceiveMessageWaitTime: Wait time for long polling operations
  • SentMessageSize: Size of messages being sent

Dead Letter Queue Metrics

  • NumberOfMessagesMoved: Messages moved to dead letter queues
  • DeadLetterQueueSourceQueues: Dead letter queue relationships

FIFO Queue Metrics (FIFO Queues Only)

  • ContentBasedDeduplication: Messages deduplicated by content
  • DeduplicationScope: Deduplication behavior per message group
  • FifoThroughputLimit: FIFO queue throughput limitations

Advanced Configuration

Multi-Region Monitoring

Monitor SQS queues across multiple AWS regions:

receivers:
awscloudwatch/us-east-1:
region: us-east-1
metrics:
- metric_name: ApproximateNumberOfMessages
namespace: AWS/SQS
stat: [Average, Maximum]
awscloudwatch/us-west-2:
region: us-west-2
metrics:
- metric_name: ApproximateNumberOfMessages
namespace: AWS/SQS
stat: [Average, Maximum]
service:
pipelines:
metrics:
receivers: [awscloudwatch/us-east-1, awscloudwatch/us-west-2]

Queue-Specific Monitoring

Monitor different queue types with specific configurations:

receivers:
awscloudwatch/standard-queues:
region: us-east-1
metrics:
- metric_name: ApproximateNumberOfMessages
namespace: AWS/SQS
stat: [Average, Maximum]
dimensions:
- name: QueueName
value: "production-*" # Standard queues
awscloudwatch/fifo-queues:
region: us-east-1
metrics:
- metric_name: ApproximateNumberOfMessages
namespace: AWS/SQS
stat: [Average, Maximum]
dimensions:
- name: QueueName
value: "*.fifo" # FIFO queues only

Dead Letter Queue Monitoring

Specific configuration for monitoring dead letter queues:

receivers:
awscloudwatch/dlq:
region: us-east-1
metrics:
- metric_name: ApproximateNumberOfMessages
namespace: AWS/SQS
stat: [Average, Maximum, Sum]
dimensions:
- name: QueueName
value: "*-dlq" # Dead letter queues
- metric_name: ApproximateAgeOfOldestMessage
namespace: AWS/SQS
stat: [Maximum]
dimensions:
- name: QueueName
value: "*-dlq"

Verification

  1. Check Service Status

    Verify the OpenTelemetry Collector is running:

    sudo systemctl status otelcol-contrib
  2. Monitor Service Logs

    Check for any configuration errors:

    sudo journalctl -u otelcol-contrib -f
  3. Verify AWS Connectivity

    Test AWS API access:

    aws sqs list-queues --region us-east-1
    aws cloudwatch list-metrics --namespace AWS/SQS --region us-east-1
  4. Generate SQS Activity

    Create some queue activity to generate metrics:

    # Send test messages to a queue
    aws sqs send-message \
    --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/test-queue \
    --message-body "Test message 1"
    # Receive messages
    aws sqs receive-message \
    --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/test-queue
    # Check queue attributes
    aws sqs get-queue-attributes \
    --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/test-queue \
    --attribute-names All
  5. Verify Metrics in Last9

    Log into your Last9 account and check that SQS metrics are being received in Grafana.

    Look for metrics like:

    • ApproximateNumberOfMessages
    • NumberOfMessagesSent
    • NumberOfMessagesReceived
    • ApproximateAgeOfOldestMessage

Key Metrics to Monitor

Critical Queue Health Indicators

MetricDescriptionAlert Threshold
ApproximateNumberOfMessagesMessages waiting in queue> 1000 for high-throughput queues
ApproximateAgeOfOldestMessageAge of oldest unprocessed message> 300 seconds (5 minutes)
NumberOfMessagesReceivedMessages being processedSudden drops indicate consumer issues
NumberOfEmptyReceivesPolling without messagesHigh values indicate inefficient polling

Performance Monitoring

MetricDescriptionMonitoring Focus
NumberOfMessagesSentProduction rateTrack message ingestion trends
NumberOfMessagesDeletedProcessing rateShould match sent messages over time
SentMessageSizeMessage size distributionMonitor for size limits and costs
ReceiveMessageWaitTimeLong polling efficiencyOptimize consumer polling strategy

Dead Letter Queue Monitoring

MetricDescriptionAlert Condition
ApproximateNumberOfMessages (DLQ)Failed messages> 0 (any messages in DLQ)
NumberOfMessagesMovedMessages moved to DLQIncreasing trend indicates issues

Trace Context Propagation through SQS

To get end-to-end distributed traces across services that communicate via SQS, you need to propagate W3C TraceContext (traceparent and tracestate) through SQS MessageAttributes.

Producer — Injecting Trace Context

On the producer (the service sending messages to SQS), inject the current trace context into MessageAttributes before calling SendMessage:

from opentelemetry.propagate import inject
def inject_trace_context() -> dict:
carrier = {}
inject(carrier)
message_attributes = {}
for key, value in carrier.items():
message_attributes[key] = {
"DataType": "String",
"StringValue": value,
}
return message_attributes
# Usage
response = sqs.send_message(
QueueUrl=queue_url,
MessageBody=json.dumps(payload),
MessageAttributes=inject_trace_context(),
)

Consumer — Extracting Trace Context

On the consumer (Lambda or any SQS reader), extract the trace context from MessageAttributes and use it as the parent context for new spans.

FieldESM Format (Lambda trigger)SDK Format (ReceiveMessage)
String valuestringValueStringValue
Data typedataTypeDataType

See the AWS Lambda integration — SQS Trace Propagation for consumer-side extraction code.

Auto-Injection with @opentelemetry/instrumentation-aws-sdk

If you use the AWS SDK instrumentation for Node.js, trace context is injected and extracted automatically — no manual carrier code needed.

import { AwsInstrumentation } from "@opentelemetry/instrumentation-aws-sdk";
registerInstrumentations({
instrumentations: [
new AwsInstrumentation({
// false (default): extract traceparent from MessageAttributes
// true: extract traceparent from message body JSON field
// needed for SNS→SQS fanout (MessageAttributes stripped) or Lambda producers
sqsExtractContextPropagationFromPayload: false,
}),
],
});
sqsExtractContextPropagationFromPayloadExtracts fromUse when
false (default)MessageAttributes.traceparentProducer uses AwsInstrumentation or manually injects into MessageAttributes
trueMessage body JSON field traceparentLambda ESM triggers, non-OTel producers that embed context in body

SNS → SQS — Raw vs Wrapped Delivery

If your system uses both direct app → SQS and SNS → SQS on the same consumer, the delivery mode on the SNS subscription determines where traceparent ends up:

SNS subscription deliverytraceparent location in SQS messageExtraction
Raw message delivery ONMessageAttributes.traceparentStandard — sqsExtractContextPropagationFromPayload: false
Raw message delivery OFF (SNS default)Inside body JSON envelope: body.MessageAttributes.traceparent.ValueRequires custom body parsing

Recommended: enable raw message delivery on SNS→SQS subscriptions. This preserves MessageAttributes through the fanout, so a single extraction path works for both direct and SNS-originated messages.

aws sns set-subscription-attributes \
--subscription-arn <your-subscription-arn> \
--attribute-name RawMessageDelivery \
--attribute-value true

If you cannot change the subscription config, extract from both paths — MessageAttributes first, SNS envelope body as fallback:

function extractProducerContext(message: Message): SpanContext | null {
// Path 1: direct app → SQS or SNS→SQS with rawMessageDelivery=true
const fromAttrs = extractFromMessageAttributes(message);
if (fromAttrs) return fromAttrs;
// Path 2: SNS→SQS with rawMessageDelivery=false
// SNS wraps body as: {"Type":"Notification","MessageAttributes":{"traceparent":{"Type":"String","Value":"..."}}}
return extractFromSnsEnvelope(message);
}
function extractFromSnsEnvelope(message: Message): SpanContext | null {
try {
const envelope = JSON.parse(message.Body ?? "{}");
if (envelope.Type !== "Notification" || !envelope.MessageAttributes) return null;
const carrier: Record<string, string> = {};
for (const [key, attr] of Object.entries(envelope.MessageAttributes as Record<string, { Type: string; Value: string }>)) {
if (attr.Type === "String") carrier[key.toLowerCase()] = attr.Value;
}
return spanContextFrom(carrier);
} catch {
return null;
}
}

Polling-based Consumers — Per-Poll and Per-Message Correlation

Long-polling consumers (e.g., setInterval/setTimeout loops) typically receive a batch of up to 10 messages per call. Without explicit spans, all message processing is invisible inside a single receive operation.

The recommended pattern creates two levels of spans:

sqs.poll_cycle (SPAN_KIND_INTERNAL) ← one per interval tick
├── <queue> receive (SPAN_KIND_CONSUMER) ← auto by AwsInstrumentation
├── <queue> process (SPAN_KIND_CONSUMER) ← manual, per message
│ ├── link → producer trace ← cross-trace navigation
│ └── SQS.DeleteMessage ← auto by AwsInstrumentation
└── <queue> process (SPAN_KIND_CONSUMER) ← parallel per message

Why links and not parent? Setting the producer’s span as parent collapses producer and consumer into one trace tree. Using links keeps them as independent traces that can navigate to each other — correct per the OTel messaging spec.

import { trace, context, SpanKind, SpanStatusCode, propagation } from "@opentelemetry/api";
import { ReceiveMessageCommand, DeleteMessageCommand, SQSClient, Message } from "@aws-sdk/client-sqs";
const tracer = trace.getTracer("sqs-poller");
const sqs = new SQSClient({ region: process.env.AWS_REGION });
async function pollOnce() {
// Root span groups the entire interval tick
const pollSpan = tracer.startSpan("sqs.poll_cycle", {
kind: SpanKind.INTERNAL,
attributes: { "messaging.system": "aws.sqs", "messaging.destination.name": QUEUE_NAME },
});
await context.with(trace.setSpan(context.active(), pollSpan), async () => {
try {
// AwsInstrumentation auto-instruments this as a CONSUMER child span
const { Messages = [] } = await sqs.send(new ReceiveMessageCommand({
QueueUrl: QUEUE_URL,
MaxNumberOfMessages: 10,
WaitTimeSeconds: 5,
MessageAttributeNames: ["All"], // required for traceparent extraction
}));
pollSpan.setAttribute("messaging.batch.message_count", Messages.length);
await Promise.all(Messages.map(processMessage));
pollSpan.setStatus({ code: SpanStatusCode.OK });
} catch (err) {
pollSpan.recordException(err as Error);
pollSpan.setStatus({ code: SpanStatusCode.ERROR, message: String(err) });
} finally {
pollSpan.end();
}
});
}
async function processMessage(message: Message) {
// Extract producer's span context from MessageAttributes
const carrier: Record<string, string> = {};
for (const [key, attr] of Object.entries(message.MessageAttributes ?? {})) {
const val = attr as { StringValue?: string };
if (val.StringValue) carrier[key.toLowerCase()] = val.StringValue;
}
const producerCtx = trace.getSpanContext(propagation.extract(context.active(), carrier));
const msgSpan = tracer.startSpan(`${QUEUE_NAME} process`, {
kind: SpanKind.CONSUMER,
links: producerCtx ? [{ context: producerCtx }] : [],
attributes: {
"messaging.system": "aws.sqs",
"messaging.message.id": message.MessageId,
"messaging.operation": "process",
},
});
await context.with(trace.setSpan(context.active(), msgSpan), async () => {
try {
await handleMessage(message); // your business logic
await sqs.send(new DeleteMessageCommand({ QueueUrl: QUEUE_URL, ReceiptHandle: message.ReceiptHandle! }));
msgSpan.setStatus({ code: SpanStatusCode.OK });
} catch (err) {
msgSpan.recordException(err as Error);
msgSpan.setStatus({ code: SpanStatusCode.ERROR, message: String(err) });
} finally {
msgSpan.end();
}
});
}

Log Correlation

To correlate logs with traces, emit log records through the OTel Logs API with context.active(). The SDK automatically attaches trace_id and span_id from the active span context.

First, set up a LoggerProvider in your instrumentation bootstrap:

import { LoggerProvider, BatchLogRecordProcessor } from "@opentelemetry/sdk-logs";
import { OTLPLogExporter } from "@opentelemetry/exporter-logs-otlp-http";
import { logs } from "@opentelemetry/api-logs";
const loggerProvider = new LoggerProvider({ resource });
loggerProvider.addLogRecordProcessor(new BatchLogRecordProcessor(new OTLPLogExporter()));
logs.setGlobalLoggerProvider(loggerProvider);

Then emit structured logs from inside your span context:

import { SeverityNumber, logs } from "@opentelemetry/api-logs";
import { context } from "@opentelemetry/api";
const logger = logs.getLogger("sqs-consumer");
// Inside processMessage(), while msgSpan is active:
logger.emit({
severityNumber: SeverityNumber.INFO,
severityText: "INFO",
body: "message_processing_start",
attributes: { messageId: message.MessageId, queueName: QUEUE_NAME },
context: context.active(), // SDK attaches trace_id + span_id automatically
});

Log records arrive in Last9 with trace_id and span_id fields, enabling direct navigation from a log line to its trace in the Last9 UI.

Full Examples

PatternExample
NestJS polling consumer — per-poll + per-message spans + log correlationjavascript/nestjs-sqs-correlation
Python SQS → Lambda trace propagationpython/aws-sqs-lambda

Best Practices

Security

  • IAM Roles: Use IAM roles instead of access keys when running on EC2
  • Least Privilege: Grant only necessary CloudWatch and SQS permissions
  • Queue Access: Restrict SQS queue access to authorized consumers and producers

Performance

  • Collection Intervals: Balance monitoring granularity with CloudWatch API costs
  • Metric Selection: Monitor only metrics relevant to your specific queues
  • Regional Optimization: Deploy collectors in the same region as SQS queues

Monitoring Strategy

  • Queue Depth Alerts: Set alerts for excessive queue depth
  • Consumer Health: Monitor message processing rates and age
  • Dead Letter Queues: Always monitor DLQs for failed message processing
  • Cost Optimization: Use appropriate CloudWatch metric collection intervals

Queue Management

  • Visibility Timeout: Configure appropriate visibility timeouts for your workload
  • Message Retention: Set appropriate message retention periods
  • Redrive Policy: Configure dead letter queues with appropriate maxReceiveCount
  • Long Polling: Use long polling to reduce empty receives and costs

Troubleshooting

  • CloudWatch API issues

    • Permission denied. Verify credentials and access to SQS and CloudWatch:

      aws sts get-caller-identity
      aws sqs list-queues --region us-east-1
      aws cloudwatch list-metrics --namespace AWS/SQS --region us-east-1 | head -10
    • Rate limiting. Increase the receiver’s collection_interval to reduce API call volume:

      receivers:
      awscloudwatch:
      collection_interval: 600s # 10 minutes instead of 5
  • Missing metrics

    • No queue metrics appearing. Confirm queues exist and that CloudWatch has metric data for them:

      aws sqs list-queues --region us-east-1
      aws cloudwatch get-metric-statistics \
      --namespace AWS/SQS \
      --metric-name ApproximateNumberOfMessages \
      --dimensions Name=QueueName,Value=your-queue-name \
      --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
      --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
      --period 300 \
      --statistics Average
    • Partial data. List the full set of SQS metrics available and inspect queue attributes:

      aws cloudwatch list-metrics --namespace AWS/SQS --region us-east-1
      aws sqs get-queue-attributes \
      --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/queue-name \
      --attribute-names All
  • High message age

    • Slow consumer processing. Check the visibility timeout, long-polling settings, and in-flight message count:

      aws sqs get-queue-attributes \
      --queue-url YOUR_QUEUE_URL \
      --attribute-names VisibilityTimeoutSeconds,ReceiveMessageWaitTimeSeconds
      aws sqs get-queue-attributes \
      --queue-url YOUR_QUEUE_URL \
      --attribute-names ApproximateNumberOfMessagesNotVisible

Please get in touch with us on Discord or Email if you have any questions.