AWS ECS

Monitor your AWS ECS (Elastic Container Service) clusters and containers with OpenTelemetry Container Insights integration. This setup provides comprehensive monitoring of container performance, resource utilization, task health, and cluster-wide metrics.

Prerequisites

Before setting up AWS ECS monitoring, ensure you have:

AWS ECS Cluster: Running ECS cluster with tasks to monitor
Container Runtime: Docker or containerd runtime
Administrative Access: Root permissions to install and configure monitoring components
Network Access: Outbound connectivity to Last9 endpoints
Last9 Account: With OpenTelemetry integration credentials

Supported Deployment Models

This integration supports both ECS deployment models:

ECS on EC2: Self-managed EC2 instances running ECS tasks
AWS Fargate: Serverless container platform (with additional configuration)

Install OpenTelemetry Collector

Install the OpenTelemetry Collector with AWS Container Insights receiver:
For Debian/Ubuntu systems on ECS EC2 instances:
wget https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.118.0/otelcol-contrib_0.118.0_linux_amd64.deb sudo dpkg -i otelcol-contrib_0.118.0_linux_amd64.deb
For Red Hat/CentOS systems on ECS EC2 instances:
wget https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.118.0/otelcol-contrib_0.118.0_linux_amd64.rpm sudo rpm -ivh otelcol-contrib_0.118.0_linux_amd64.rpm
For running as an ECS service, use this task definition snippet:
{ "family": "otel-collector-ecs", "taskRoleArn": "arn:aws:iam::ACCOUNT:role/ecsTaskRole", "executionRoleArn": "arn:aws:iam::ACCOUNT:role/ecsTaskExecutionRole", "networkMode": "bridge", "requiresCompatibilities": ["EC2"], "containerDefinitions": [ { "name": "otel-collector", "image": "public.ecr.aws/aws-observability/aws-otel-collector:latest", "essential": true, "command": ["--config=/etc/ecs/otel-config.yaml"], "mountPoints": [ { "sourceVolume": "docker-sock", "containerPath": "/var/run/docker.sock", "readOnly": true } ] } ], "volumes": [ { "name": "docker-sock", "host": { "sourcePath": "/var/run/docker.sock" } } ] }

Configure AWS Permissions

Set up the necessary IAM permissions for ECS monitoring:

EC2 Instance Role
Task Role (ECS Service)

Attach this policy to your ECS EC2 instance role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ecs:ListClusters",
        "ecs:ListContainerInstances",
        "ecs:DescribeContainerInstances",
        "ecs:ListServices",
        "ecs:DescribeServices",
        "ecs:ListTasks",
        "ecs:DescribeTasks",
        "ec2:DescribeInstances"
      ],
      "Resource": "*"
    }
  ]
}

Create a task role for the collector running as ECS service:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ecs:ListClusters",
        "ecs:ListContainerInstances",
        "ecs:DescribeContainerInstances",
        "ecs:ListServices",
        "ecs:DescribeServices",
        "ecs:ListTasks",
        "ecs:DescribeTasks"
      ],
      "Resource": "*"
    }
  ]
}

Create OpenTelemetry Collector Configuration

Create the collector configuration file for ECS monitoring:

sudo mkdir -p /etc/otelcol-contrib
sudo nano /etc/otelcol-contrib/config.yaml

Add the following configuration to collect ECS Container Insights metrics:

receivers:
  awscontainerinsightreceiver:
    collection_interval: 60s
    container_orchestrator: ecs
    # Add cluster name if running on specific cluster
    # cluster_name: "production-cluster"

    # Configure metric types to collect
    metric_types:
      - "cadvisor" # Container resource metrics
      - "disk" # Disk usage metrics
      - "diskio" # Disk I/O metrics
      - "memory" # Memory usage metrics
      - "network" # Network metrics
      - "cpu" # CPU metrics

processors:
  batch:
    timeout: 30s
    send_batch_size: 10000
    send_batch_max_size: 10000
  resourcedetection/aws:
    detectors: ["ecs", "ec2", "aws"]
    timeout: 2s
    override: false
  resource/ecs:
    attributes:
      - key: Timestamp
        action: delete
      - key: service.name
        value: "aws-ecs"
        action: upsert
      - key: deployment.environment
        value: "production"
        action: upsert

exporters:
  otlp/last9:
    endpoint: "$last9_otlp_endpoint"
    headers:
      "Authorization": "$last9_otlp_auth_header"
  debug:
    verbosity: detailed

service:
  pipelines:
    metrics:
      receivers: [awscontainerinsightreceiver]
      processors: [resourcedetection/aws, resource/ecs, batch]
      exporters: [otlp/last9]

Configure for Fargate (Optional)

If using AWS Fargate, additional configuration is needed:

receivers:
  awscontainerinsightreceiver:
    collection_interval: 60s
    container_orchestrator: ecs
    cluster_name: "fargate-cluster"

    # Fargate-specific configuration
    fargate:
      enabled: true
      # Fargate tasks don't have direct cAdvisor access
      use_fargate_metrics: true

processors:
  resourcedetection/aws:
    detectors: ["ecs", "aws"]
    ecs:
      # Fargate resource detection
      resource_arn_key: "aws.ecs.task.arn"

Create Systemd Service Configuration

For EC2 deployment, create a systemd service:

sudo nano /etc/systemd/system/otelcol-contrib.service

Add the service configuration with required permissions:

[Unit]
Description=OpenTelemetry Collector for AWS ECS Monitoring
After=network.target

[Service]
ExecStart=/usr/bin/otelcol-contrib --config /etc/otelcol-contrib/config.yaml
Restart=always
User=root
Group=root

# Environment variables
Environment=AWS_REGION=us-east-1

# Required for container access
SupplementaryGroups=docker

[Install]
WantedBy=multi-user.target

Note: Root permissions are required to access container runtime sockets and system files.

Start and Enable the Service

Start the OpenTelemetry Collector service:

sudo systemctl daemon-reload
sudo systemctl enable otelcol-contrib
sudo systemctl start otelcol-contrib

Understanding ECS Metrics

The AWS Container Insights receiver collects comprehensive ECS metrics:

Cluster-Level Metrics

Cluster Resource Utilization: CPU, memory, and network utilization across the cluster
Task Count: Running, pending, and stopped tasks
Service Health: Service status and desired task counts
Instance Health: EC2 instance health for ECS clusters

Task-Level Metrics

Task Resource Usage: CPU, memory, disk, and network usage per task
Task Status: Task lifecycle states (pending, running, stopped)
Task Duration: Time tasks spend in different states
Task Placement: Task distribution across container instances

Container-Level Metrics

Container Resources: CPU usage, memory usage, memory limits per container
Container I/O: Disk read/write operations and network traffic
Container State: Container health and restart counts
Container Logs: Container log metrics and error counts

Service-Level Metrics

Service Utilization: Resource usage aggregated by ECS service
Deployment Metrics: Service deployment success/failure rates
Auto Scaling: Service scaling events and capacity changes
Load Balancer: ALB/NLB health check results for services

Advanced Configuration

Multi-Cluster Monitoring

Monitor multiple ECS clusters from a single collector:

receivers:
  awscontainerinsightreceiver/prod:
    collection_interval: 60s
    container_orchestrator: ecs
    cluster_name: "production-cluster"
  awscontainerinsightreceiver/staging:
    collection_interval: 120s
    container_orchestrator: ecs
    cluster_name: "staging-cluster"

service:
  pipelines:
    metrics:
      receivers:
        [awscontainerinsightreceiver/prod, awscontainerinsightreceiver/staging]

Resource Attribution Enhancement

Add detailed resource attributes for better observability:

processors:
  resource/ecs:
    attributes:
      - key: aws.ecs.cluster.name
        from_attribute: "ClusterName"
        action: upsert
      - key: aws.ecs.service.name
        from_attribute: "ServiceName"
        action: upsert
      - key: aws.ecs.task.family
        from_attribute: "TaskDefinitionFamily"
        action: upsert
      - key: aws.ecs.task.revision
        from_attribute: "TaskDefinitionRevision"
        action: upsert
      - key: environment
        value: "production"
        action: upsert

Custom Metric Filtering

Filter metrics to reduce data volume:

processors:
  filter/ecs:
    metrics:
      exclude:
        match_type: strict
        metric_names:
          - "container_memory_cache"
          - "container_memory_rss"
        # Only include metrics matching certain criteria
      include:
        match_type: regexp
        metric_names:
          - "container_cpu_.*"
          - "container_memory_usage_bytes"
          - "container_network_.*"

Performance Optimization

Optimize collector performance for high-scale environments:

receivers:
  awscontainerinsightreceiver:
    collection_interval: 30s # More frequent collection
    timeout: 10s

processors:
  batch:
    timeout: 10s
    send_batch_size: 5000
    send_batch_max_size: 8000

  memory_limiter:
    limit_mib: 512
    spike_limit_mib: 128

Verification

Check Service Status

Verify the OpenTelemetry Collector is running:
```
sudo systemctl status otelcol-contrib
```
Monitor Service Logs

Check for any configuration errors:
```
sudo journalctl -u otelcol-contrib -f
```
Look for successful receiver initialization messages and metric collection activity.

Verify ECS Access

Test ECS API connectivity:

# List ECS clusters
aws ecs list-clusters --region us-east-1

# Describe cluster
aws ecs describe-clusters --clusters your-cluster-name --region us-east-1

# List running tasks
aws ecs list-tasks --cluster your-cluster-name --region us-east-1

Check Docker Socket Access

Verify the collector can access container runtime:

# Check Docker socket permissions
sudo ls -la /var/run/docker.sock

# Test Docker access
sudo docker ps

# Check if collector user can access Docker
sudo -u root docker ps

Generate Container Activity

Deploy test tasks to generate metrics:

# Create a simple task definition
aws ecs register-task-definition \
  --family test-monitoring \
  --container-definitions '[{
    "name": "test-container",
    "image": "nginx:latest",
    "memory": 128,
    "essential": true,
    "portMappings": [{
      "containerPort": 80,
      "protocol": "tcp"
    }]
  }]'

# Run the task
aws ecs run-task \
  --cluster your-cluster-name \
  --task-definition test-monitoring:1

Verify Metrics in Last9

Log into your Last9 account and check that ECS metrics are being received in Grafana.

Look for metrics like:
- container_cpu_usage_seconds_total
- container_memory_usage_bytes
- container_network_receive_bytes_total
- ecs_cluster_cpu_utilization

Key Metrics to Monitor

Critical Container Health Indicators

Metric	Description	Alert Threshold
`container_memory_usage_bytes`	Container memory usage	> 80% of limit
`container_cpu_usage_seconds_total`	Container CPU usage	> 80% consistently
`container_oom_events_total`	Out of memory events	> 0 events
`ecs_task_running_count`	Running task count	< desired count

Performance Monitoring

Metric	Description	Monitoring Focus
`container_network_receive_bytes_total`	Network ingress traffic	Track application load
`container_network_transmit_bytes_total`	Network egress traffic	Monitor data transfer
`container_fs_reads_bytes_total`	Filesystem read operations	I/O performance
`container_fs_writes_bytes_total`	Filesystem write operations	Storage performance

Cluster Health

Metric	Description	Alert Condition
`ecs_cluster_active_services_count`	Active services count	Unexpected changes
`ecs_cluster_registered_container_instances_count`	Available EC2 instances	Capacity monitoring
`ecs_service_desired_count`	Desired task count	vs running count mismatch

Troubleshooting

Collector Permission Issues

Cannot Access Docker Socket:

# Check Docker socket permissions
sudo ls -la /var/run/docker.sock

# Add collector user to docker group (if running as non-root)
sudo usermod -aG docker otelcol-contrib

# For systemd service, ensure it runs as root
sudo systemctl edit otelcol-contrib
# Add: [Service]
# User=root

ECS API Access Denied:

# Verify AWS credentials
aws sts get-caller-identity

# Test ECS permissions
aws ecs list-clusters --region us-east-1

# Check IAM role attached to EC2 instance
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/

Missing Metrics

No Container Metrics:

# Verify containers are running
docker ps

# Check if cAdvisor metrics are available
curl http://localhost:8080/metrics  # If cAdvisor is exposed

# Verify ECS agent is running
sudo systemctl status ecs

Incomplete ECS Metrics:

# Check ECS cluster health
aws ecs describe-clusters --clusters your-cluster-name

# Verify tasks are running
aws ecs list-tasks --cluster your-cluster-name

# Check task definitions
aws ecs describe-task-definition --task-definition your-task:1

Fargate-Specific Issues

Fargate Metrics Not Available:

Verify the collector is running within the same VPC as Fargate tasks
Check that Fargate platform version supports Container Insights
Ensure proper task role permissions for metric collection

High Resource Usage

Collector Using Too Much Memory:

processors:
  memory_limiter:
    limit_mib: 256 # Reduce memory limit
    spike_limit_mib: 64

receivers:
  awscontainerinsightreceiver:
    collection_interval: 120s # Reduce collection frequency

Best Practices

Security

Root Access: Only run collector as root when necessary for container access
Network Policies: Implement VPC security groups to restrict collector access
IAM Roles: Use IAM roles instead of access keys for AWS authentication
Secret Management: Store sensitive configuration in AWS Secrets Manager

Performance

Collection Intervals: Balance monitoring granularity with resource usage
Metric Filtering: Filter out unnecessary metrics to reduce data volume
Resource Limits: Set appropriate memory and CPU limits for the collector
Batch Processing: Optimize batch sizes for efficient data transmission

Monitoring Strategy

Multi-Layer Monitoring: Monitor cluster, service, task, and container levels
Alerting: Set up alerts for critical metrics like OOM events and high resource usage
Capacity Planning: Monitor resource utilization trends for scaling decisions
Cost Optimization: Use appropriate collection intervals to balance cost and visibility

Deployment

High Availability: Deploy collectors on multiple AZs for redundancy
Service Discovery: Use ECS service discovery for dynamic service monitoring
Rolling Updates: Implement rolling updates for collector configuration changes
Health Checks: Configure health checks for collector containers

Need Help?

If you encounter any issues or have questions:

Join our Discord community for real-time support
Contact our support team at support@last9.io