It’s 2:47 AM and your Lambda functions are timing out. API response times are spiking. You’re flipping between the CloudWatch console, your APM tool, and your logs, trying to figure out what’s going wrong.
CloudWatch has the metrics you need: CPU usage, memory pressure, and request rates — but connecting that data to what your app is doing takes time. The delay in stitching it all together slows down your incident response. By the time you have a full view, your MTTR has already crossed your SLA.
The CloudWatch Data Correlation Problem
Most teams today run a distributed observability stack, Grafana for custom dashboards, Prometheus for time-series data, an APM for traces, and CloudWatch for AWS metrics. These tools do their jobs well, but they work in isolation.
The real bottleneck isn’t the amount of data—it’s connecting the dots across systems. When something breaks, you need to quickly correlate:
- High response latency with DB connection pool issues
- API timeouts with frequent Lambda cold starts
- RDS CPU spikes with delays in transaction processing
- ECS resource limits with increased HTTP 5xx errors
CloudWatch’s built-in dashboards offer basic charts, but lack the flexibility you need during an incident. At the same time, your application data lives in Grafana with rich templating and PromQL. Bridging these gaps during outages adds unnecessary friction when you need answers fast.
Why Use Last9 for CloudWatch Integration?
Before jumping into setup, it’s worth asking: why stream CloudWatch metrics to Last9 at all?
The Problem with Tool Sprawl
Most teams juggle multiple observability tools:
- Prometheus for metric storage
- Grafana for dashboards
- APMs for tracing
- CloudWatch for AWS metrics
- Logging systems for application logs
This fragmentation leads to real problems:
- Operational overhead — More tools mean more integrations and configs
- Inconsistent data — Each system has its own query language and retention model
- Vendor lock-in — Proprietary formats make switching harder
- Pricing complexity — Multiple billing models are tough to predict
What Last9 Solves
Last9 reduces this complexity with a unified, developer-friendly observability platform.
- Prometheus-compatible: Keep your PromQL queries, dashboards, and alerts as-is
- Built-in Grafana: Comes pre-wired—no setup, no plugin hassle
- Multi-source ingestion: Ingest metrics from:
- AWS CloudWatch
- Kubernetes (via exporters)
- App metrics (via client libraries)
- 3rd-party APIs
- Custom sources via HTTP
- Predictable pricing: Charged by usage volume, not per host or metric
Built for High-Scale Workloads
- Handles high cardinality: Built for noisy container and microservice metrics
- Long-term retention: Store months of data without slowdowns
- Extended PromQL: Run complex queries for anomaly detection and forecasting
- Enterprise-ready: SOC2-compliant, encrypted, and access-controlled
No duct-taped dashboards. No jumping across tools. Just unified observability, built around your existing metrics.
How It Works: Streaming CloudWatch Metrics to Last9
CloudWatch Metric Streams let you export AWS metrics in near real-time, no polling, no API throttling. You can stream them directly into Last9 using Amazon Kinesis Firehose.
Here’s the architecture in action:
AWS Services (Lambda, RDS, ECS)
↓
CloudWatch Metric Streams
↓
Amazon Kinesis Firehose
↓
Last9 HTTP Endpoint (OpenTelemetry format)
↓
Grafana (Unified Dashboards)
Why This Setup Works Well
- Real-time metrics
Sub-minute latency. No need to wait for scrapers or API pulls. - Lower AWS costs
Avoids the pricey CloudWatchGetMetricData
calls. - Scales automatically
Kinesis Firehose can handle high-throughput metric streams out of the box. - Standards-based format
Metrics are streamed in OpenTelemetry format, making them easy to ingest, analyze, and visualize alongside your app metrics.
With this pipeline, your AWS infrastructure metrics show up in Grafana, next to app and business metrics, without juggling exporters or writing custom collectors.
10-Minute Setup from CloudWatch to Grafana
Here’s how to route your AWS metrics from CloudWatch into Grafana using Last9’s observability platform.
Step 1: Get Your Last9 Integration Credentials
In your Last9 dashboard:
- Go to Home → Integrations → CloudWatch
- Copy the following credentials:
- HTTP Endpoint URL: Target for your Firehose delivery stream
- Username and Password: Used for HTTP basic auth
You’ll use these to authenticate AWS Kinesis Firehose with Last9’s metric ingestion endpoint.
Step 2: Set Up IAM Permissions
CloudWatch metric streaming relies on multiple AWS services. Create an IAM policy with the following permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"cloudwatch:StartMetricStreams",
"cloudwatch:PutMetricStream",
"cloudwatch:GetMetricStream",
"cloudwatch:GetMetricData",
"cloudwatch:ListMetrics",
"cloudwatch:ListMetricStreams"
],
"Resource": ["*"]
},
{
"Effect": "Allow",
"Action": [
"firehose:CreateDeliveryStream",
"firehose:PutRecord",
"firehose:PutRecordBatch",
"firehose:DescribeDeliveryStream",
"firehose:UpdateDestination",
"firehose:ListDeliveryStreams"
],
"Resource": ["*"]
},
{
"Effect": "Allow",
"Action": [
"s3:CreateBucket",
"s3:GetBucketLocation",
"s3:ListAllMyBuckets",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:GetObject",
"s3:PutObject"
],
"Resource": ["arn:aws:s3:::*"]
},
{
"Effect": "Allow",
"Action": [
"iam:CreateRole",
"iam:CreatePolicy",
"iam:AttachRolePolicy",
"iam:CreatePolicyVersion",
"iam:DeletePolicyVersion",
"iam:PassRole"
],
"Resource": [
"arn:aws:iam::<account_id>:role/*",
"arn:aws:iam::<account_id>:policy/*"
]
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream"
],
"Resource": [
"arn:aws:logs:<region>:<account_id>:log-group:*:log-stream:*"
]
}
]
}
Why each block matters:
- CloudWatch: Enables metric stream setup and retrieval
- Firehose: Handles metric delivery to Last9
- S3: Provides backup storage for failed records
- IAM: Grants Firehose the ability to assume roles and attach policies
- CloudWatch Logs: Lets you monitor Firehose delivery failures
Step 3: Create Kinesis Data Firehose Delivery Stream
- Go to the AWS Kinesis → Delivery Streams section
- Click Create delivery stream
- Choose Direct PUT as the source
- Set the stream name as:
last9-{your-org-name}
- Configure HTTP Endpoint:
- Endpoint URL: Paste Last9’s ingestion URL
- Authentication: Use basic auth with your credentials
- Set buffer parameters:
- Size: 1 MB (default)
- Interval: 60 seconds
- Enable GZIP compression
- Set an S3 bucket for backup of failed deliveries
- Enable CloudWatch Logs for error tracking
- Click Create delivery stream
Notes:
- GZIP cuts bandwidth by ~70% for time series payloads
- Buffering affects latency vs request volume tradeoff
- S3 + Logs improve durability and visibility during failures
Step 4: Create CloudWatch Metric Stream
- In the CloudWatch → Metrics → Streams section, click Create metric stream
- Choose your metrics:
- All metrics: For initial observability
- Selective: For cost-controlled setups
- Set Firehose delivery stream as the destination
- Choose Output format:
OpenTelemetry 0.7
- Name your stream:
last9-{your-org-name}
- Leave state as Enabled
- Click Create
Tips for metric filtering:
- Start broad, then filter based on what’s useful
- Namespaces: Focus on Lambda, RDS, ECS, API Gateway
- Metrics: Avoid high-cardinality metrics unless they’re actionable
- Statistics: Sum, Avg, Max are common; adjust for alerting needs
Step 5: Verify Ingestion in Grafana
Within a few minutes, AWS metrics will begin appearing in Last9-powered Grafana dashboards. All CloudWatch namespaces are prefixed with amazonaws_com_AWS_
.
Some examples to validate:
# Lambda function duration
amazonaws_com_AWS_Lambda_Duration{function_name="your-function-name"}
# RDS CPU Utilization
amazonaws_com_AWS_RDS_CPUUtilization{db_instance_identifier="your-db-instance"}
# ECS container CPU
amazonaws_com_AWS_ECS_CPUUtilization{cluster_name="your-cluster-name"}
Advanced Dashboard Implementation
Once CloudWatch metrics are flowing through Last9, you can build dashboards that directly tie AWS service behavior to application performance characteristics. Below are a few practical examples.
1. Lambda Performance and Cold Start Impact
Monitor Lambda execution behavior across P50–P99 durations, cold start latency, and downstream dependencies.
PromQL Queries:
Database connection usage:
db_connection_pool_active_connections{service="api-handler"}
Upstream request latency:
histogram_quantile(0.95, http_request_duration_seconds{handler="/api/users"})
Error rate vs. invocations:
rate(amazonaws_com_AWS_Lambda_Errors[5m]) / rate(amazonaws_com_AWS_Lambda_Invocations[5m])
Memory pressure:
amazonaws_com_AWS_Lambda_MemoryUtilization{function_name="api-handler"}
Cold start time and frequency:
amazonaws_com_AWS_Lambda_InitDuration{function_name="api-handler"}
rate(amazonaws_com_AWS_Lambda_InitDuration[5m])
Latency percentiles:
histogram_quantile(0.50, amazonaws_com_AWS_Lambda_Duration{function_name="api-handler"})
histogram_quantile(0.95, amazonaws_com_AWS_Lambda_Duration{function_name="api-handler"})
histogram_quantile(0.99, amazonaws_com_AWS_Lambda_Duration{function_name="api-handler"})
This dashboard helps correlate cold starts with tail latencies, track memory usage under burst loads, and detect application bottlenecks caused by external service calls or DB saturation.
2. RDS Metrics and Query Latency Correlation
Visualize infrastructure-level metrics alongside database and app-level behavior to spot query slowdowns, CPU pressure, or connection pool starvation.
PromQL Queries:
5xx error trends:
rate(http_requests_total{status=~"5.."}[5m])
Connection pool wait times:
db_connection_pool_wait_time_seconds
Query durations (app-side):
db_query_duration_seconds{query_type="SELECT"}
db_query_duration_seconds{query_type="INSERT"}
Disk I/O and latency:
amazonaws_com_AWS_RDS_ReadIOPS{db_instance_identifier="prod-db"}
amazonaws_com_AWS_RDS_WriteIOPS{db_instance_identifier="prod-db"}
amazonaws_com_AWS_RDS_ReadLatency{db_instance_identifier="prod-db"}
amazonaws_com_AWS_RDS_WriteLatency{db_instance_identifier="prod-db"}
Connection usage:
amazonaws_com_AWS_RDS_DatabaseConnections{db_instance_identifier="prod-db"}
CPU and credit balance:
amazonaws_com_AWS_RDS_CPUUtilization{db_instance_identifier="prod-db"}
amazonaws_com_AWS_RDS_CPUCreditBalance{db_instance_identifier="prod-db"}
Use this dashboard to detect when spikes in latency or errors align with increased connection usage, CPU saturation, or degraded IOPS. It also helps surface whether query-level slowness is infrastructure-bound or coming from upstream services.
3. ECS Resource Usage and Application Load
Track how ECS-managed workloads consume CPU/memory, scale task counts, and respond to application-level traffic patterns.
PromQL Queries:
Container restarts (Kube or ECS-level):
increase(container_restarts_total[1h])
Request volume and latency (from app metrics):
rate(http_requests_total[5m])
histogram_quantile(0.95, http_request_duration_seconds)
Network I/O:
amazonaws_com_AWS_ECS_NetworkRxBytes{cluster_name="production"}
amazonaws_com_AWS_ECS_NetworkTxBytes{cluster_name="production"}
Running and pending tasks:
amazonaws_com_AWS_ECS_RunningTaskCount{cluster_name="production", service_name=~".*"}
amazonaws_com_AWS_ECS_PendingTaskCount{cluster_name="production", service_name=~".*"}
Memory utilization:
amazonaws_com_AWS_ECS_MemoryUtilization{cluster_name="production", service_name=~".*"}
Service-level CPU usage:
amazonaws_com_AWS_ECS_CPUUtilization{cluster_name="production", service_name=~".*"}
This dashboard is useful for tracking scaling anomalies, resource constraints across ECS services, and changes in network or memory usage that could lead to degraded app performance or restarts.
Production Optimization Strategies
Cost Management
CloudWatch metric streaming costs can quickly add up, especially if you're streaming large volumes of data you don’t use. To control spend without compromising observability:
1. Namespace Filtering
Restrict to essential AWS services only. For most applications, that includes:
"IncludeFilters": [
{ "Namespace": "AWS/Lambda" },
{ "Namespace": "AWS/RDS" },
{ "Namespace": "AWS/ECS" },
{ "Namespace": "AWS/ApplicationELB" }
]
2. Metric-Level Filtering
Filter out noisy or low-value metrics. Stream only what's needed for alerting and performance tracking:
"MetricNames": [
"Duration", "Errors", "Invocations", // Lambda
"CPUUtilization", "DatabaseConnections", // RDS
"MemoryUtilization" // ECS
]
3. Statistic Selection
Avoid streaming every statistic. Focus on percentiles for SLO/SLA tracking:
"AdditionalStatistics": [ "p50", "p95", "p99" ]
Performance Tuning
Fine-tune Kinesis Firehose delivery settings based on your monitoring needs.
Buffer Configuration
- Low latency (near real-time):
- Buffer size:
1MB
- Interval:
60 seconds
- Buffer size:
- High throughput (batch analytics):
- Buffer size:
5MB
- Interval:
300 seconds
- Buffer size:
Compression
- Enable
GZIP
to reduce network overhead.
Monitor Delivery Stream Health
Track the performance of metric delivery using built-in CloudWatch metrics:
- Delivery success rate:
amazonaws_com_AWS_KinesisFirehose_DeliveryToHttpEndpoint_Success
- Delivery latency (freshness):
amazonaws_com_AWS_KinesisFirehose_DeliveryToHttpEndpoint_DataFreshness
- Processing failures:
amazonaws_com_AWS_KinesisFirehose_DeliveryToHttpEndpoint_ProcessingFailed
Error Handling and Monitoring
To prevent data loss during delivery failures:
S3 Backup Setup
- Configure a fallback S3 bucket for failed deliveries.
- Apply a 30-day retention policy for debugging or replay.
Alerting
- Set up CloudWatch Alarms on S3 object count to detect spikes in delivery failures.
Example PromQL Queries
# Failed delivery rate over 5 minutes
rate(amazonaws_com_AWS_KinesisFirehose_DeliveryToHttpEndpoint_ProcessingFailed[5m])
# P95 delivery latency
histogram_quantile(0.95, rate(amazonaws_com_AWS_KinesisFirehose_DeliveryToHttpEndpoint_DataFreshness_bucket[5m]))
Troubleshoot CloudWatch Metric Streaming Issues
This section outlines common failure scenarios and how to systematically debug issues across CloudWatch, Kinesis Firehose, and Last9’s ingestion pipeline.
1. Metrics Are Not Appearing in Grafana
Check CloudWatch Metric Stream Status
Run the following command to verify if the metric stream is active:
aws cloudwatch describe-metric-streams --names last9-your-org
If the status is Creating
, Updating
, or Failed
, metrics won’t stream.
Validate Kinesis Firehose Configuration
Make sure the delivery stream is active and correctly set up to push data to an HTTP endpoint:
aws firehose describe-delivery-stream --delivery-stream-name last9-your-org
Look for DeliveryStreamStatus: ACTIVE
and a valid HTTP destination.
Inspect Firehose Logs for Delivery Failures
Check if metrics are failing to reach Last9 due to network errors, timeouts, or invalid endpoints:
aws logs describe-log-groups --log-group-name-prefix /aws/kinesisfirehose/last9-your-org
Look for DeliveryToHttpEndpoint_Failure
metrics or log entries with 4xx/5xx status codes.
2. Unexpected Increase in AWS Costs
Analyze High-Volume Metric Namespaces
CloudWatch Metric Streams charge based on volume. Namespaces like AWS/EC2
, AWS/ApplicationELB
, or high-cardinality custom metrics can lead to high costs.
Use the following metric to understand volume:
amazonaws_com_AWS_CloudWatch_MetricStreamRecords{stream_name="last9-your-org"}
Review Firehose Data Throughput
Large payloads or inefficient batching can spike costs. Check:
amazonaws_com_AWS_KinesisFirehose_DeliveryToHttpEndpoint_Bytes
Reduce frequency, batch more aggressively, and enable GZIP compression if needed.
Audit S3 Backup Usage
If S3BackupMode
is enabled, undelivered metrics may be stored, incurring additional storage charges.
Check the configured S3 bucket:
aws s3 ls s3://your-backup-bucket --recursive
3. Metrics Are Delayed or Dropped
Monitor Delivery Buffer Limits
Firehose buffering settings may be too aggressive or too small. Use:
BufferingIntervalInSeconds
BufferingSizeInMBs
Check if your stream regularly hits the upper threshold and adjust as needed:
aws firehose update-destination \
--delivery-stream-name last9-your-org \
--http-endpoint-destination-configuration '{"BufferingHints":{"SizeInMBs":10,"IntervalInSeconds":300}, "CompressionFormat":"GZIP"}'
Inspect Firehose Throughput Limits
Each Firehose shard supports ~1 MB/sec or 1,000 records/sec. Exceeding this requires scaling:
- Add more shards
- Enable parallelism via multiple delivery streams (if applicable)
Investigate Endpoint Health
If Last9’s ingestion endpoint is overloaded or rate-limited, Firehose retries can add latency or drop metrics. Check for:
- Increased
DeliveryToHttpEndpoint_Failure
metrics - Elevated retry attempts
- 429 or 5xx responses in logs
Architecture Patterns for CloudWatch Metric Streaming
Choose integration patterns based on your architecture type. Below are the key CloudWatch namespaces and metric selectors to stream into Last9 for effective monitoring and troubleshooting.
Observability Patterns for Microservices on AWS
In a microservices setup, AWS services are often split across compute, data, and messaging layers. To monitor these distributed components, stream from these CloudWatch namespaces:
Recommended CloudWatch Namespaces:
AWS/Lambda
– Event-driven computeAWS/RDS
– Relational database layerAWS/ElastiCache
– In-memory cache storesAWS/SQS
– Queueing and decoupling servicesAWS/ApplicationELB
– Load balancingAWS/ApiGateway
– API management and routing
These provide coverage across service interactions—ideal for tracing tail latency, retries, or timeouts across upstream and downstream dependencies.
Monitoring Serverless Architectures with CloudWatch
Serverless systems depend on tightly coupled services like Lambda, API Gateway, and DynamoDB. These metric selectors help track cold starts, latency spikes, and throughput issues.
Key Lambda, API Gateway, and DynamoDB Metrics:
# Lambda duration and concurrency
amazonaws_com_AWS_Lambda_Duration{function_name=~".*"}
amazonaws_com_AWS_Lambda_ConcurrentExecutions{function_name=~".*"}
# API Gateway latency and error rates
amazonaws_com_AWS_ApiGateway_Latency{api_name=~".*"}
amazonaws_com_AWS_ApiGateway_4XXError{api_name=~".*"}
# DynamoDB table capacity
amazonaws_com_AWS_DynamoDB_ConsumedReadCapacityUnits{table_name=~".*"}
amazonaws_com_AWS_DynamoDB_ConsumedWriteCapacityUnits{table_name=~".*"}
Use these to detect concurrency bottlenecks, misconfigured throttling, or excessive cold starts under load.
Observability for Container-Based Deployments (ECS/EKS)
When running applications on ECS or Kubernetes (EKS), metric visibility must include both cluster-level resource usage and service-level performance.
ECS Resource and Task Metrics:
# Cluster resource usage
amazonaws_com_AWS_ECS_CPUUtilization{cluster_name="production"}
amazonaws_com_AWS_ECS_MemoryUtilization{cluster_name="production"}
# Scaling signals
amazonaws_com_AWS_ECS_RunningTaskCount{service_name=~".*"}
amazonaws_com_AWS_ECS_PendingTaskCount{service_name=~".*"}
Combine with Application-Level Metrics:
# Request rate and latency histograms
rate(http_requests_total[5m])
histogram_quantile(0.95, http_request_duration_seconds)
This pairing enables direct correlation between infra behavior (like CPU spikes) and application symptoms (like slow response times or elevated error rates).
Final Notes
CloudWatch collects a ton of useful metrics, but getting them out in a usable, cost-efficient, and queryable form is where most setups fall short.
By streaming CloudWatch metrics directly to Last9, you skip the painful parts: no polling, no re-learning query syntax, no brittle dashboard workarounds. Just clean Prometheus-style metrics you can use.
This setup works well across stacks, be it serverless, containers, or traditional EC2-based microservices. And once it’s running, your team gets actual visibility into AWS workloads, not just surface-level graphs.