Skip to content
Last9 named a Gartner Cool Vendor in AI for SRE Observability for 2025! Read more →
Last9

AWS ECS

Monitor AWS ECS clusters and containers with OpenTelemetry Container Insights for comprehensive containerized application observability

Monitor your AWS ECS (Elastic Container Service) clusters and containers with OpenTelemetry Container Insights integration. This setup provides comprehensive monitoring of container performance, resource utilization, task health, and cluster-wide metrics.

Prerequisites

Before setting up AWS ECS monitoring, ensure you have:

  • AWS ECS Cluster: Running ECS cluster with tasks to monitor
  • Container Runtime: Docker or containerd runtime
  • Administrative Access: Root permissions to install and configure monitoring components
  • Network Access: Outbound connectivity to Last9 endpoints
  • Last9 Account: With OpenTelemetry integration credentials

Supported Deployment Models

This integration supports both ECS deployment models:

  • ECS on EC2: Self-managed EC2 instances running ECS tasks
  • AWS Fargate: Serverless container platform (with additional configuration)
  1. Install OpenTelemetry Collector

    Install the OpenTelemetry Collector with AWS Container Insights receiver:

    For Debian/Ubuntu systems on ECS EC2 instances:

    wget https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.118.0/otelcol-contrib_0.118.0_linux_amd64.deb
    sudo dpkg -i otelcol-contrib_0.118.0_linux_amd64.deb
  2. Configure AWS Permissions

    Set up the necessary IAM permissions for ECS monitoring:

    Attach this policy to your ECS EC2 instance role:

    {
    "Version": "2012-10-17",
    "Statement": [
    {
    "Effect": "Allow",
    "Action": [
    "ecs:ListClusters",
    "ecs:ListContainerInstances",
    "ecs:DescribeContainerInstances",
    "ecs:ListServices",
    "ecs:DescribeServices",
    "ecs:ListTasks",
    "ecs:DescribeTasks",
    "ec2:DescribeInstances"
    ],
    "Resource": "*"
    }
    ]
    }
  3. Create OpenTelemetry Collector Configuration

    Create the collector configuration file for ECS monitoring:

    sudo mkdir -p /etc/otelcol-contrib
    sudo nano /etc/otelcol-contrib/config.yaml

    Add the following configuration to collect ECS Container Insights metrics:

    receivers:
    awscontainerinsightreceiver:
    collection_interval: 60s
    container_orchestrator: ecs
    # Add cluster name if running on specific cluster
    # cluster_name: "production-cluster"
    # Configure metric types to collect
    metric_types:
    - "cadvisor" # Container resource metrics
    - "disk" # Disk usage metrics
    - "diskio" # Disk I/O metrics
    - "memory" # Memory usage metrics
    - "network" # Network metrics
    - "cpu" # CPU metrics
    processors:
    batch:
    timeout: 30s
    send_batch_size: 10000
    send_batch_max_size: 10000
    resourcedetection/aws:
    detectors: ["ecs", "ec2", "aws"]
    timeout: 2s
    override: false
    resource/ecs:
    attributes:
    - key: Timestamp
    action: delete
    - key: service.name
    value: "aws-ecs"
    action: upsert
    - key: deployment.environment
    value: "production"
    action: upsert
    exporters:
    otlp/last9:
    endpoint: "$last9_otlp_endpoint"
    headers:
    "Authorization": "$last9_otlp_auth_header"
    debug:
    verbosity: detailed
    service:
    pipelines:
    metrics:
    receivers: [awscontainerinsightreceiver]
    processors: [resourcedetection/aws, resource/ecs, batch]
    exporters: [otlp/last9]
  4. Configure for Fargate (Optional)

    If using AWS Fargate, additional configuration is needed:

    receivers:
    awscontainerinsightreceiver:
    collection_interval: 60s
    container_orchestrator: ecs
    cluster_name: "fargate-cluster"
    # Fargate-specific configuration
    fargate:
    enabled: true
    # Fargate tasks don't have direct cAdvisor access
    use_fargate_metrics: true
    processors:
    resourcedetection/aws:
    detectors: ["ecs", "aws"]
    ecs:
    # Fargate resource detection
    resource_arn_key: "aws.ecs.task.arn"
  5. Create Systemd Service Configuration

    For EC2 deployment, create a systemd service:

    sudo nano /etc/systemd/system/otelcol-contrib.service

    Add the service configuration with required permissions:

    [Unit]
    Description=OpenTelemetry Collector for AWS ECS Monitoring
    After=network.target
    [Service]
    ExecStart=/usr/bin/otelcol-contrib --config /etc/otelcol-contrib/config.yaml
    Restart=always
    User=root
    Group=root
    # Environment variables
    Environment=AWS_REGION=us-east-1
    # Required for container access
    SupplementaryGroups=docker
    [Install]
    WantedBy=multi-user.target

    Note: Root permissions are required to access container runtime sockets and system files.

  6. Start and Enable the Service

    Start the OpenTelemetry Collector service:

    sudo systemctl daemon-reload
    sudo systemctl enable otelcol-contrib
    sudo systemctl start otelcol-contrib

Understanding ECS Metrics

The AWS Container Insights receiver collects comprehensive ECS metrics:

Cluster-Level Metrics

  • Cluster Resource Utilization: CPU, memory, and network utilization across the cluster
  • Task Count: Running, pending, and stopped tasks
  • Service Health: Service status and desired task counts
  • Instance Health: EC2 instance health for ECS clusters

Task-Level Metrics

  • Task Resource Usage: CPU, memory, disk, and network usage per task
  • Task Status: Task lifecycle states (pending, running, stopped)
  • Task Duration: Time tasks spend in different states
  • Task Placement: Task distribution across container instances

Container-Level Metrics

  • Container Resources: CPU usage, memory usage, memory limits per container
  • Container I/O: Disk read/write operations and network traffic
  • Container State: Container health and restart counts
  • Container Logs: Container log metrics and error counts

Service-Level Metrics

  • Service Utilization: Resource usage aggregated by ECS service
  • Deployment Metrics: Service deployment success/failure rates
  • Auto Scaling: Service scaling events and capacity changes
  • Load Balancer: ALB/NLB health check results for services

Advanced Configuration

Multi-Cluster Monitoring

Monitor multiple ECS clusters from a single collector:

receivers:
awscontainerinsightreceiver/prod:
collection_interval: 60s
container_orchestrator: ecs
cluster_name: "production-cluster"
awscontainerinsightreceiver/staging:
collection_interval: 120s
container_orchestrator: ecs
cluster_name: "staging-cluster"
service:
pipelines:
metrics:
receivers:
[awscontainerinsightreceiver/prod, awscontainerinsightreceiver/staging]

Resource Attribution Enhancement

Add detailed resource attributes for better observability:

processors:
resource/ecs:
attributes:
- key: aws.ecs.cluster.name
from_attribute: "ClusterName"
action: upsert
- key: aws.ecs.service.name
from_attribute: "ServiceName"
action: upsert
- key: aws.ecs.task.family
from_attribute: "TaskDefinitionFamily"
action: upsert
- key: aws.ecs.task.revision
from_attribute: "TaskDefinitionRevision"
action: upsert
- key: environment
value: "production"
action: upsert

Custom Metric Filtering

Filter metrics to reduce data volume:

processors:
filter/ecs:
metrics:
exclude:
match_type: strict
metric_names:
- "container_memory_cache"
- "container_memory_rss"
# Only include metrics matching certain criteria
include:
match_type: regexp
metric_names:
- "container_cpu_.*"
- "container_memory_usage_bytes"
- "container_network_.*"

Performance Optimization

Optimize collector performance for high-scale environments:

receivers:
awscontainerinsightreceiver:
collection_interval: 30s # More frequent collection
timeout: 10s
processors:
batch:
timeout: 10s
send_batch_size: 5000
send_batch_max_size: 8000
memory_limiter:
limit_mib: 512
spike_limit_mib: 128

Verification

  1. Check Service Status

    Verify the OpenTelemetry Collector is running:

    sudo systemctl status otelcol-contrib
  2. Monitor Service Logs

    Check for any configuration errors:

    sudo journalctl -u otelcol-contrib -f

    Look for successful receiver initialization messages and metric collection activity.

  3. Verify ECS Access

    Test ECS API connectivity:

    # List ECS clusters
    aws ecs list-clusters --region us-east-1
    # Describe cluster
    aws ecs describe-clusters --clusters your-cluster-name --region us-east-1
    # List running tasks
    aws ecs list-tasks --cluster your-cluster-name --region us-east-1
  4. Check Docker Socket Access

    Verify the collector can access container runtime:

    # Check Docker socket permissions
    sudo ls -la /var/run/docker.sock
    # Test Docker access
    sudo docker ps
    # Check if collector user can access Docker
    sudo -u root docker ps
  5. Generate Container Activity

    Deploy test tasks to generate metrics:

    # Create a simple task definition
    aws ecs register-task-definition \
    --family test-monitoring \
    --container-definitions '[{
    "name": "test-container",
    "image": "nginx:latest",
    "memory": 128,
    "essential": true,
    "portMappings": [{
    "containerPort": 80,
    "protocol": "tcp"
    }]
    }]'
    # Run the task
    aws ecs run-task \
    --cluster your-cluster-name \
    --task-definition test-monitoring:1
  6. Verify Metrics in Last9

    Log into your Last9 account and check that ECS metrics are being received in Grafana.

    Look for metrics like:

    • container_cpu_usage_seconds_total
    • container_memory_usage_bytes
    • container_network_receive_bytes_total
    • ecs_cluster_cpu_utilization

Key Metrics to Monitor

Critical Container Health Indicators

MetricDescriptionAlert Threshold
container_memory_usage_bytesContainer memory usage> 80% of limit
container_cpu_usage_seconds_totalContainer CPU usage> 80% consistently
container_oom_events_totalOut of memory events> 0 events
ecs_task_running_countRunning task count< desired count

Performance Monitoring

MetricDescriptionMonitoring Focus
container_network_receive_bytes_totalNetwork ingress trafficTrack application load
container_network_transmit_bytes_totalNetwork egress trafficMonitor data transfer
container_fs_reads_bytes_totalFilesystem read operationsI/O performance
container_fs_writes_bytes_totalFilesystem write operationsStorage performance

Cluster Health

MetricDescriptionAlert Condition
ecs_cluster_active_services_countActive services countUnexpected changes
ecs_cluster_registered_container_instances_countAvailable EC2 instancesCapacity monitoring
ecs_service_desired_countDesired task countvs running count mismatch

Troubleshooting

Collector Permission Issues

Cannot Access Docker Socket:

# Check Docker socket permissions
sudo ls -la /var/run/docker.sock
# Add collector user to docker group (if running as non-root)
sudo usermod -aG docker otelcol-contrib
# For systemd service, ensure it runs as root
sudo systemctl edit otelcol-contrib
# Add: [Service]
# User=root

ECS API Access Denied:

# Verify AWS credentials
aws sts get-caller-identity
# Test ECS permissions
aws ecs list-clusters --region us-east-1
# Check IAM role attached to EC2 instance
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/

Missing Metrics

No Container Metrics:

# Verify containers are running
docker ps
# Check if cAdvisor metrics are available
curl http://localhost:8080/metrics # If cAdvisor is exposed
# Verify ECS agent is running
sudo systemctl status ecs

Incomplete ECS Metrics:

# Check ECS cluster health
aws ecs describe-clusters --clusters your-cluster-name
# Verify tasks are running
aws ecs list-tasks --cluster your-cluster-name
# Check task definitions
aws ecs describe-task-definition --task-definition your-task:1

Fargate-Specific Issues

Fargate Metrics Not Available:

  • Verify the collector is running within the same VPC as Fargate tasks
  • Check that Fargate platform version supports Container Insights
  • Ensure proper task role permissions for metric collection

High Resource Usage

Collector Using Too Much Memory:

processors:
memory_limiter:
limit_mib: 256 # Reduce memory limit
spike_limit_mib: 64
receivers:
awscontainerinsightreceiver:
collection_interval: 120s # Reduce collection frequency

Best Practices

Security

  • Root Access: Only run collector as root when necessary for container access
  • Network Policies: Implement VPC security groups to restrict collector access
  • IAM Roles: Use IAM roles instead of access keys for AWS authentication
  • Secret Management: Store sensitive configuration in AWS Secrets Manager

Performance

  • Collection Intervals: Balance monitoring granularity with resource usage
  • Metric Filtering: Filter out unnecessary metrics to reduce data volume
  • Resource Limits: Set appropriate memory and CPU limits for the collector
  • Batch Processing: Optimize batch sizes for efficient data transmission

Monitoring Strategy

  • Multi-Layer Monitoring: Monitor cluster, service, task, and container levels
  • Alerting: Set up alerts for critical metrics like OOM events and high resource usage
  • Capacity Planning: Monitor resource utilization trends for scaling decisions
  • Cost Optimization: Use appropriate collection intervals to balance cost and visibility

Deployment

  • High Availability: Deploy collectors on multiple AZs for redundancy
  • Service Discovery: Use ECS service discovery for dynamic service monitoring
  • Rolling Updates: Implement rolling updates for collector configuration changes
  • Health Checks: Configure health checks for collector containers

Need Help?

If you encounter any issues or have questions: