AWS ECS
Monitor AWS ECS clusters and containers with OpenTelemetry Container Insights for comprehensive containerized application observability
Monitor your AWS ECS (Elastic Container Service) clusters and containers with OpenTelemetry Container Insights integration. This setup provides comprehensive monitoring of container performance, resource utilization, task health, and cluster-wide metrics.
Prerequisites
Before setting up AWS ECS monitoring, ensure you have:
- AWS ECS Cluster: Running ECS cluster with tasks to monitor
- Container Runtime: Docker or containerd runtime
- Administrative Access: Root permissions to install and configure monitoring components
- Network Access: Outbound connectivity to Last9 endpoints
- Last9 Account: With OpenTelemetry integration credentials
Supported Deployment Models
This integration supports both ECS deployment models:
- ECS on EC2: Self-managed EC2 instances running ECS tasks
- AWS Fargate: Serverless container platform (with additional configuration)
-
Install OpenTelemetry Collector
Install the OpenTelemetry Collector with AWS Container Insights receiver:
For Debian/Ubuntu systems on ECS EC2 instances:
wget https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.118.0/otelcol-contrib_0.118.0_linux_amd64.debsudo dpkg -i otelcol-contrib_0.118.0_linux_amd64.debFor Red Hat/CentOS systems on ECS EC2 instances:
wget https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.118.0/otelcol-contrib_0.118.0_linux_amd64.rpmsudo rpm -ivh otelcol-contrib_0.118.0_linux_amd64.rpmFor running as an ECS service, use this task definition snippet:
{"family": "otel-collector-ecs","taskRoleArn": "arn:aws:iam::ACCOUNT:role/ecsTaskRole","executionRoleArn": "arn:aws:iam::ACCOUNT:role/ecsTaskExecutionRole","networkMode": "bridge","requiresCompatibilities": ["EC2"],"containerDefinitions": [{"name": "otel-collector","image": "public.ecr.aws/aws-observability/aws-otel-collector:latest","essential": true,"command": ["--config=/etc/ecs/otel-config.yaml"],"mountPoints": [{"sourceVolume": "docker-sock","containerPath": "/var/run/docker.sock","readOnly": true}]}],"volumes": [{"name": "docker-sock","host": {"sourcePath": "/var/run/docker.sock"}}]} -
Configure AWS Permissions
Set up the necessary IAM permissions for ECS monitoring:
Attach this policy to your ECS EC2 instance role:
{"Version": "2012-10-17","Statement": [{"Effect": "Allow","Action": ["ecs:ListClusters","ecs:ListContainerInstances","ecs:DescribeContainerInstances","ecs:ListServices","ecs:DescribeServices","ecs:ListTasks","ecs:DescribeTasks","ec2:DescribeInstances"],"Resource": "*"}]}Create a task role for the collector running as ECS service:
{"Version": "2012-10-17","Statement": [{"Effect": "Allow","Action": ["ecs:ListClusters","ecs:ListContainerInstances","ecs:DescribeContainerInstances","ecs:ListServices","ecs:DescribeServices","ecs:ListTasks","ecs:DescribeTasks"],"Resource": "*"}]} -
Create OpenTelemetry Collector Configuration
Create the collector configuration file for ECS monitoring:
sudo mkdir -p /etc/otelcol-contribsudo nano /etc/otelcol-contrib/config.yamlAdd the following configuration to collect ECS Container Insights metrics:
receivers:awscontainerinsightreceiver:collection_interval: 60scontainer_orchestrator: ecs# Add cluster name if running on specific cluster# cluster_name: "production-cluster"# Configure metric types to collectmetric_types:- "cadvisor" # Container resource metrics- "disk" # Disk usage metrics- "diskio" # Disk I/O metrics- "memory" # Memory usage metrics- "network" # Network metrics- "cpu" # CPU metricsprocessors:batch:timeout: 30ssend_batch_size: 10000send_batch_max_size: 10000resourcedetection/aws:detectors: ["ecs", "ec2", "aws"]timeout: 2soverride: falseresource/ecs:attributes:- key: Timestampaction: delete- key: service.namevalue: "aws-ecs"action: upsert- key: deployment.environmentvalue: "production"action: upsertexporters:otlp/last9:endpoint: "$last9_otlp_endpoint"headers:"Authorization": "$last9_otlp_auth_header"debug:verbosity: detailedservice:pipelines:metrics:receivers: [awscontainerinsightreceiver]processors: [resourcedetection/aws, resource/ecs, batch]exporters: [otlp/last9] -
Configure for Fargate (Optional)
If using AWS Fargate, additional configuration is needed:
receivers:awscontainerinsightreceiver:collection_interval: 60scontainer_orchestrator: ecscluster_name: "fargate-cluster"# Fargate-specific configurationfargate:enabled: true# Fargate tasks don't have direct cAdvisor accessuse_fargate_metrics: trueprocessors:resourcedetection/aws:detectors: ["ecs", "aws"]ecs:# Fargate resource detectionresource_arn_key: "aws.ecs.task.arn" -
Create Systemd Service Configuration
For EC2 deployment, create a systemd service:
sudo nano /etc/systemd/system/otelcol-contrib.serviceAdd the service configuration with required permissions:
[Unit]Description=OpenTelemetry Collector for AWS ECS MonitoringAfter=network.target[Service]ExecStart=/usr/bin/otelcol-contrib --config /etc/otelcol-contrib/config.yamlRestart=alwaysUser=rootGroup=root# Environment variablesEnvironment=AWS_REGION=us-east-1# Required for container accessSupplementaryGroups=docker[Install]WantedBy=multi-user.targetNote: Root permissions are required to access container runtime sockets and system files.
-
Start and Enable the Service
Start the OpenTelemetry Collector service:
sudo systemctl daemon-reloadsudo systemctl enable otelcol-contribsudo systemctl start otelcol-contrib
Understanding ECS Metrics
The AWS Container Insights receiver collects comprehensive ECS metrics:
Cluster-Level Metrics
- Cluster Resource Utilization: CPU, memory, and network utilization across the cluster
- Task Count: Running, pending, and stopped tasks
- Service Health: Service status and desired task counts
- Instance Health: EC2 instance health for ECS clusters
Task-Level Metrics
- Task Resource Usage: CPU, memory, disk, and network usage per task
- Task Status: Task lifecycle states (pending, running, stopped)
- Task Duration: Time tasks spend in different states
- Task Placement: Task distribution across container instances
Container-Level Metrics
- Container Resources: CPU usage, memory usage, memory limits per container
- Container I/O: Disk read/write operations and network traffic
- Container State: Container health and restart counts
- Container Logs: Container log metrics and error counts
Service-Level Metrics
- Service Utilization: Resource usage aggregated by ECS service
- Deployment Metrics: Service deployment success/failure rates
- Auto Scaling: Service scaling events and capacity changes
- Load Balancer: ALB/NLB health check results for services
Advanced Configuration
Multi-Cluster Monitoring
Monitor multiple ECS clusters from a single collector:
receivers: awscontainerinsightreceiver/prod: collection_interval: 60s container_orchestrator: ecs cluster_name: "production-cluster" awscontainerinsightreceiver/staging: collection_interval: 120s container_orchestrator: ecs cluster_name: "staging-cluster"
service: pipelines: metrics: receivers: [awscontainerinsightreceiver/prod, awscontainerinsightreceiver/staging]Resource Attribution Enhancement
Add detailed resource attributes for better observability:
processors: resource/ecs: attributes: - key: aws.ecs.cluster.name from_attribute: "ClusterName" action: upsert - key: aws.ecs.service.name from_attribute: "ServiceName" action: upsert - key: aws.ecs.task.family from_attribute: "TaskDefinitionFamily" action: upsert - key: aws.ecs.task.revision from_attribute: "TaskDefinitionRevision" action: upsert - key: environment value: "production" action: upsertCustom Metric Filtering
Filter metrics to reduce data volume:
processors: filter/ecs: metrics: exclude: match_type: strict metric_names: - "container_memory_cache" - "container_memory_rss" # Only include metrics matching certain criteria include: match_type: regexp metric_names: - "container_cpu_.*" - "container_memory_usage_bytes" - "container_network_.*"Performance Optimization
Optimize collector performance for high-scale environments:
receivers: awscontainerinsightreceiver: collection_interval: 30s # More frequent collection timeout: 10s
processors: batch: timeout: 10s send_batch_size: 5000 send_batch_max_size: 8000
memory_limiter: limit_mib: 512 spike_limit_mib: 128Verification
-
Check Service Status
Verify the OpenTelemetry Collector is running:
sudo systemctl status otelcol-contrib -
Monitor Service Logs
Check for any configuration errors:
sudo journalctl -u otelcol-contrib -fLook for successful receiver initialization messages and metric collection activity.
-
Verify ECS Access
Test ECS API connectivity:
# List ECS clustersaws ecs list-clusters --region us-east-1# Describe clusteraws ecs describe-clusters --clusters your-cluster-name --region us-east-1# List running tasksaws ecs list-tasks --cluster your-cluster-name --region us-east-1 -
Check Docker Socket Access
Verify the collector can access container runtime:
# Check Docker socket permissionssudo ls -la /var/run/docker.sock# Test Docker accesssudo docker ps# Check if collector user can access Dockersudo -u root docker ps -
Generate Container Activity
Deploy test tasks to generate metrics:
# Create a simple task definitionaws ecs register-task-definition \--family test-monitoring \--container-definitions '[{"name": "test-container","image": "nginx:latest","memory": 128,"essential": true,"portMappings": [{"containerPort": 80,"protocol": "tcp"}]}]'# Run the taskaws ecs run-task \--cluster your-cluster-name \--task-definition test-monitoring:1 -
Verify Metrics in Last9
Log into your Last9 account and check that ECS metrics are being received in Grafana.
Look for metrics like:
container_cpu_usage_seconds_totalcontainer_memory_usage_bytescontainer_network_receive_bytes_totalecs_cluster_cpu_utilization
Key Metrics to Monitor
Critical Container Health Indicators
| Metric | Description | Alert Threshold |
|---|---|---|
container_memory_usage_bytes | Container memory usage | > 80% of limit |
container_cpu_usage_seconds_total | Container CPU usage | > 80% consistently |
container_oom_events_total | Out of memory events | > 0 events |
ecs_task_running_count | Running task count | < desired count |
Performance Monitoring
| Metric | Description | Monitoring Focus |
|---|---|---|
container_network_receive_bytes_total | Network ingress traffic | Track application load |
container_network_transmit_bytes_total | Network egress traffic | Monitor data transfer |
container_fs_reads_bytes_total | Filesystem read operations | I/O performance |
container_fs_writes_bytes_total | Filesystem write operations | Storage performance |
Cluster Health
| Metric | Description | Alert Condition |
|---|---|---|
ecs_cluster_active_services_count | Active services count | Unexpected changes |
ecs_cluster_registered_container_instances_count | Available EC2 instances | Capacity monitoring |
ecs_service_desired_count | Desired task count | vs running count mismatch |
Troubleshooting
Collector Permission Issues
Cannot Access Docker Socket:
# Check Docker socket permissionssudo ls -la /var/run/docker.sock
# Add collector user to docker group (if running as non-root)sudo usermod -aG docker otelcol-contrib
# For systemd service, ensure it runs as rootsudo systemctl edit otelcol-contrib# Add: [Service]# User=rootECS API Access Denied:
# Verify AWS credentialsaws sts get-caller-identity
# Test ECS permissionsaws ecs list-clusters --region us-east-1
# Check IAM role attached to EC2 instancecurl http://169.254.169.254/latest/meta-data/iam/security-credentials/Missing Metrics
No Container Metrics:
# Verify containers are runningdocker ps
# Check if cAdvisor metrics are availablecurl http://localhost:8080/metrics # If cAdvisor is exposed
# Verify ECS agent is runningsudo systemctl status ecsIncomplete ECS Metrics:
# Check ECS cluster healthaws ecs describe-clusters --clusters your-cluster-name
# Verify tasks are runningaws ecs list-tasks --cluster your-cluster-name
# Check task definitionsaws ecs describe-task-definition --task-definition your-task:1Fargate-Specific Issues
Fargate Metrics Not Available:
- Verify the collector is running within the same VPC as Fargate tasks
- Check that Fargate platform version supports Container Insights
- Ensure proper task role permissions for metric collection
High Resource Usage
Collector Using Too Much Memory:
processors: memory_limiter: limit_mib: 256 # Reduce memory limit spike_limit_mib: 64
receivers: awscontainerinsightreceiver: collection_interval: 120s # Reduce collection frequencyBest Practices
Security
- Root Access: Only run collector as root when necessary for container access
- Network Policies: Implement VPC security groups to restrict collector access
- IAM Roles: Use IAM roles instead of access keys for AWS authentication
- Secret Management: Store sensitive configuration in AWS Secrets Manager
Performance
- Collection Intervals: Balance monitoring granularity with resource usage
- Metric Filtering: Filter out unnecessary metrics to reduce data volume
- Resource Limits: Set appropriate memory and CPU limits for the collector
- Batch Processing: Optimize batch sizes for efficient data transmission
Monitoring Strategy
- Multi-Layer Monitoring: Monitor cluster, service, task, and container levels
- Alerting: Set up alerts for critical metrics like OOM events and high resource usage
- Capacity Planning: Monitor resource utilization trends for scaling decisions
- Cost Optimization: Use appropriate collection intervals to balance cost and visibility
Deployment
- High Availability: Deploy collectors on multiple AZs for redundancy
- Service Discovery: Use ECS service discovery for dynamic service monitoring
- Rolling Updates: Implement rolling updates for collector configuration changes
- Health Checks: Configure health checks for collector containers
Need Help?
If you encounter any issues or have questions:
- Join our Discord community for real-time support
- Contact our support team at support@last9.io