AWS ECS

Monitor your AWS ECS (Elastic Container Service) clusters and containers with OpenTelemetry Container Insights integration. This setup provides comprehensive monitoring of container performance, resource utilization, task health, and cluster-wide metrics.

Prerequisites

Before setting up AWS ECS monitoring, ensure you have:

AWS ECS Cluster: Running ECS cluster with tasks to monitor
Container Runtime: Docker or containerd runtime
Administrative Access: Root permissions to install and configure monitoring components
Network Access: Outbound connectivity to Last9 endpoints
Last9 Account: With OpenTelemetry integration credentials

Supported Deployment Models

This integration supports both ECS deployment models:

ECS on EC2: Self-managed EC2 instances running ECS tasks
AWS Fargate: Serverless container platform (with additional configuration)

Install OpenTelemetry Collector

Install the OpenTelemetry Collector with AWS Container Insights receiver:
For Debian/Ubuntu systems on ECS EC2 instances:
wget https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.118.0/otelcol-contrib_0.118.0_linux_amd64.deb sudo dpkg -i otelcol-contrib_0.118.0_linux_amd64.deb
For Red Hat/CentOS systems on ECS EC2 instances:
wget https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.118.0/otelcol-contrib_0.118.0_linux_amd64.rpm sudo rpm -ivh otelcol-contrib_0.118.0_linux_amd64.rpm
For running as an ECS service, use this task definition snippet:
{ "family": "otel-collector-ecs", "taskRoleArn": "arn:aws:iam::ACCOUNT:role/ecsTaskRole", "executionRoleArn": "arn:aws:iam::ACCOUNT:role/ecsTaskExecutionRole", "networkMode": "bridge", "requiresCompatibilities": ["EC2"], "containerDefinitions": [ { "name": "otel-collector", "image": "public.ecr.aws/aws-observability/aws-otel-collector:latest", "essential": true, "command": ["--config=/etc/ecs/otel-config.yaml"], "mountPoints": [ { "sourceVolume": "docker-sock", "containerPath": "/var/run/docker.sock", "readOnly": true } ] } ], "volumes": [ { "name": "docker-sock", "host": { "sourcePath": "/var/run/docker.sock" } } ] }

Configure AWS Permissions

Set up the necessary IAM permissions for ECS monitoring:

EC2 Instance Role
Task Role (ECS Service)

Attach this policy to your ECS EC2 instance role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ecs:ListClusters",
        "ecs:ListContainerInstances",
        "ecs:DescribeContainerInstances",
        "ecs:ListServices",
        "ecs:DescribeServices",
        "ecs:ListTasks",
        "ecs:DescribeTasks",
        "ec2:DescribeInstances"
      ],
      "Resource": "*"
    }
  ]
}

Create a task role for the collector running as ECS service:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ecs:ListClusters",
        "ecs:ListContainerInstances",
        "ecs:DescribeContainerInstances",
        "ecs:ListServices",
        "ecs:DescribeServices",
        "ecs:ListTasks",
        "ecs:DescribeTasks"
      ],
      "Resource": "*"
    }
  ]
}

Create OpenTelemetry Collector Configuration

Create the collector configuration file for ECS monitoring:

sudo mkdir -p /etc/otelcol-contrib
sudo nano /etc/otelcol-contrib/config.yaml

Add the following configuration to collect ECS Container Insights metrics:

receivers:
  awscontainerinsightreceiver:
    collection_interval: 60s
    container_orchestrator: ecs
    # Add cluster name if running on specific cluster
    # cluster_name: "production-cluster"

    # Configure metric types to collect
    metric_types:
      - "cadvisor" # Container resource metrics
      - "disk" # Disk usage metrics
      - "diskio" # Disk I/O metrics
      - "memory" # Memory usage metrics
      - "network" # Network metrics
      - "cpu" # CPU metrics

processors:
  batch:
    timeout: 30s
    send_batch_size: 10000
    send_batch_max_size: 10000
  resourcedetection/aws:
    detectors: ["ecs", "ec2", "aws"]
    timeout: 2s
    override: false
  resource/ecs:
    attributes:
      - key: Timestamp
        action: delete
      - key: service.name
        value: "aws-ecs"
        action: upsert
      - key: deployment.environment
        value: "production"
        action: upsert

exporters:
  otlp/last9:
    endpoint: "$last9_otlp_endpoint"
    headers:
      "Authorization": "$last9_otlp_auth_header"
  debug:
    verbosity: detailed

service:
  pipelines:
    metrics:
      receivers: [awscontainerinsightreceiver]
      processors: [resourcedetection/aws, resource/ecs, batch]
      exporters: [otlp/last9]

Configure for Fargate (Optional)

If using AWS Fargate, additional configuration is needed:

receivers:
  awscontainerinsightreceiver:
    collection_interval: 60s
    container_orchestrator: ecs
    cluster_name: "fargate-cluster"

    # Fargate-specific configuration
    fargate:
      enabled: true
      # Fargate tasks don't have direct cAdvisor access
      use_fargate_metrics: true

processors:
  resourcedetection/aws:
    detectors: ["ecs", "aws"]
    ecs:
      # Fargate resource detection
      resource_arn_key: "aws.ecs.task.arn"

Create Systemd Service Configuration

For EC2 deployment, create a systemd service:

sudo nano /etc/systemd/system/otelcol-contrib.service

Add the service configuration with required permissions:

[Unit]
Description=OpenTelemetry Collector for AWS ECS Monitoring
After=network.target

[Service]
ExecStart=/usr/bin/otelcol-contrib --config /etc/otelcol-contrib/config.yaml
Restart=always
User=root
Group=root

# Environment variables
Environment=AWS_REGION=us-east-1

# Required for container access
SupplementaryGroups=docker

[Install]
WantedBy=multi-user.target

Note: Root permissions are required to access container runtime sockets and system files.

Start and Enable the Service

Start the OpenTelemetry Collector service:

sudo systemctl daemon-reload
sudo systemctl enable otelcol-contrib
sudo systemctl start otelcol-contrib

ECS Application Logs

The metrics configuration above uses awscontainerinsightreceiver for infrastructure metrics. To collect application logs from your ECS containers and send them to Last9, use the filelog receiver running as a sidecar.

The service.name Problem

When collecting ECS application logs without explicit resource attribution, the service.name attribute defaults to aws_ecs — the same value for every service on the cluster. This makes it impossible to filter logs by service in Last9.

The fix: use the resourcedetection/ecs detector to populate ECS task metadata attributes, then copy aws.ecs.task.family into service.name.

Sidecar Collector for ECS Logs

Add an OTel Collector sidecar container to your ECS task definition. The sidecar reads log files written by your application container and forwards them to Last9 with correct resource attributes.

ECS Task Definition (relevant containers section):

{
  "containerDefinitions": [
    {
      "name": "your-app",
      "image": "your-app-image",
      "logConfiguration": {
        "logDriver": "json-file"
      },
      "mountPoints": [
        {
          "sourceVolume": "app-logs",
          "containerPath": "/var/log/app"
        }
      ]
    },
    {
      "name": "otel-collector",
      "image": "otel/opentelemetry-collector-contrib:0.128.0",
      "command": ["--config=/etc/otel/config.yaml"],
      "mountPoints": [
        {
          "sourceVolume": "app-logs",
          "containerPath": "/var/log/app",
          "readOnly": true
        },
        {
          "sourceVolume": "otel-config",
          "containerPath": "/etc/otel"
        }
      ],
      "environment": [
        { "name": "LAST9_OTLP_ENDPOINT", "value": "https://otlp.last9.io" },
        { "name": "LAST9_AUTH_HEADER", "value": "Basic <your-credentials>" }
      ]
    }
  ],
  "volumes": [{ "name": "app-logs" }, { "name": "otel-config" }]
}

Collector configuration (config.yaml):

receivers:
  filelog:
    include: [/var/log/app/*.log]
    start_at: beginning
    operators:
      - type: json_parser
        timestamp:
          parse_from: attributes.timestamp
          layout: "%Y-%m-%dT%H:%M:%S.%fZ"

processors:
  batch:
    timeout: 5s
    send_batch_size: 10000
  resourcedetection/ecs:
    detectors: ["ecs"]
    timeout: 2s
    override: false
  transform/service_name:
    log_statements:
      - context: resource
        statements:
          # Use ECS task family as service.name instead of the default "aws_ecs"
          - set(attributes["service.name"], attributes["aws.ecs.task.family"]) where attributes["aws.ecs.task.family"] != nil
          - set(attributes["deployment.environment"], "production")

exporters:
  otlp/last9:
    endpoint: "${env:LAST9_OTLP_ENDPOINT}"
    headers:
      "Authorization": "${env:LAST9_AUTH_HEADER}"

service:
  pipelines:
    logs:
      receivers: [filelog]
      processors: [resourcedetection/ecs, transform/service_name, batch]
      exporters: [otlp/last9]

What `resourcedetection/ecs` Populates

The ecs detector queries the ECS task metadata endpoint ($ECS_CONTAINER_METADATA_URI_V4) and adds these resource attributes automatically:

Attribute	Example Value	Description
`aws.ecs.task.family`	`payment-service`	Task definition family name
`aws.ecs.task.revision`	`12`	Task definition revision
`aws.ecs.task.arn`	`arn:aws:ecs:...`	Full task ARN
`aws.ecs.task.id`	`9781c248-0edd-4cdb-9a93-f63cb662a5d3`	Task UUID
`aws.ecs.cluster.arn`	`arn:aws:ecs:...`	Cluster ARN
`aws.ecs.launchtype`	`FARGATE` or `EC2`	Launch type

The transform/service_name processor then copies aws.ecs.task.family into service.name so logs from the payment-service task family appear under service.name=payment-service in Last9.

Using CloudWatch Logs Driver (No Sidecar)

If your tasks already use the awslogs log driver (CloudWatch), forward logs to Last9 using a CloudWatch Logs subscription filter instead of a sidecar. The downside is CloudWatch adds latency (~1 minute) and the log format loses structured fields unless your application emits JSON.

For real-time structured logs with correct service.name, the sidecar pattern above is preferred.

Fargate Logs with FireLens

For ECS on Fargate, use FireLens when your application already writes logs to stdout or stderr and you want the task definition to route those logs through a Fluent Bit log router. This keeps the application container focused on application work while FireLens handles log export.

The pattern has three parts:

Store OTLP credentials in a private secret or SSM parameter.
Grant the ECS task execution role read access to that exact secret or parameter.
Add a FireLens log router container and route the application container through awsfirelens.

Use least-privilege IAM for credential access:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["ssm:GetParameter", "ssm:GetParameters"],
      "Resource": "arn:aws:ssm:<region>:<aws-account-id>:parameter/<ssm-parameter-name>"
    },
    {
      "Effect": "Allow",
      "Action": "secretsmanager:GetSecretValue",
      "Resource": "arn:aws:secretsmanager:<region>:<aws-account-id>:secret:<secret-name>"
    }
  ]
}

If you store the OTLP credential only in SSM, omit the Secrets Manager statement. If you store it only in Secrets Manager, scope the statement to that one secret. Avoid Resource: "*" for telemetry credentials.

Add a FireLens log router container:

{
  "name": "log_router",
  "image": "public.ecr.aws/aws-observability/aws-for-fluent-bit:stable",
  "essential": true,
  "firelensConfiguration": {
    "type": "fluentbit",
    "options": {
      "enable-ecs-log-metadata": "true"
    }
  }
}

Add FireLens to the application container:

{
  "name": "checkout-service",
  "image": "<application-image>",
  "essential": true,
  "logConfiguration": {
    "logDriver": "awsfirelens",
    "options": {
      "Name": "opentelemetry",
      "Host": "<last9-otlp-host>",
      "Port": "443",
      "Logs_uri": "/v1/logs",
      "Tls": "on",
      "Tls.verify": "on"
    },
    "secretOptions": [
      {
        "name": "Header",
        "valueFrom": "arn:aws:secretsmanager:<region>:<aws-account-id>:secret:<secret-name>"
      }
    ]
  }
}

Replace <last9-otlp-host> and the secret ARN with the values from your Last9 integration setup. The secret value should contain the authorization header expected by the destination, for example Authorization <redacted>.

FireLens adds ECS metadata by default, including the cluster, task ARN, and task definition. Keep this enabled unless you have a deliberate reason to remove it. If you need custom resource attributes such as service.name or deployment.environment, add them in your collector or Fluent Bit configuration rather than relying on unsupported log-driver options.

Use the same service.name and deployment.environment values across logs, metrics, and traces. In Last9, those attributes drive service grouping, filters, dashboards, and trace correlation. For ECS task metadata, also keep resourcedetection/ecs enabled in collector-based paths so resource attributes such as aws.ecs.task.family, aws.ecs.task.revision, and aws.ecs.cluster.arn are available for filtering.

Fargate tasks need a new task-definition revision and service deployment to pick up FireLens, SSM, or secret changes. Plan this as a normal ECS rolling deployment. The application code does not need to change for logs and ECS task metrics, but application containers will restart when the service deploys the new task definition.

The FireLens log router should be sized below the application container. Start with a small CPU and memory allocation, then adjust based on task volume, export latency, and log-router memory usage. For high-volume services, increase task-level CPU or memory instead of starving the application container.

Understanding ECS Metrics

The AWS Container Insights receiver collects comprehensive ECS metrics:

Cluster-Level Metrics

Cluster Resource Utilization: CPU, memory, and network utilization across the cluster
Task Count: Running, pending, and stopped tasks
Service Health: Service status and desired task counts
Instance Health: EC2 instance health for ECS clusters

Task-Level Metrics

Task Resource Usage: CPU, memory, disk, and network usage per task
Task Status: Task lifecycle states (pending, running, stopped)
Task Duration: Time tasks spend in different states
Task Placement: Task distribution across container instances

Container-Level Metrics

Container Resources: CPU usage, memory usage, memory limits per container
Container I/O: Disk read/write operations and network traffic
Container State: Container health and restart counts
Container Logs: Container log metrics and error counts

Service-Level Metrics

Service Utilization: Resource usage aggregated by ECS service
Deployment Metrics: Service deployment success/failure rates
Auto Scaling: Service scaling events and capacity changes
Load Balancer: ALB/NLB health check results for services

Advanced Configuration

Multi-Cluster Monitoring

Monitor multiple ECS clusters from a single collector:

receivers:
  awscontainerinsightreceiver/prod:
    collection_interval: 60s
    container_orchestrator: ecs
    cluster_name: "production-cluster"
  awscontainerinsightreceiver/staging:
    collection_interval: 120s
    container_orchestrator: ecs
    cluster_name: "staging-cluster"

service:
  pipelines:
    metrics:
      receivers:
        [awscontainerinsightreceiver/prod, awscontainerinsightreceiver/staging]

Resource Attribution Enhancement

Add detailed resource attributes for better observability:

processors:
  resource/ecs:
    attributes:
      - key: aws.ecs.cluster.name
        from_attribute: "ClusterName"
        action: upsert
      - key: aws.ecs.service.name
        from_attribute: "ServiceName"
        action: upsert
      - key: aws.ecs.task.family
        from_attribute: "TaskDefinitionFamily"
        action: upsert
      - key: aws.ecs.task.revision
        from_attribute: "TaskDefinitionRevision"
        action: upsert
      - key: environment
        value: "production"
        action: upsert

Custom Metric Filtering

Filter metrics to reduce data volume:

processors:
  filter/ecs:
    metrics:
      exclude:
        match_type: strict
        metric_names:
          - "container_memory_cache"
          - "container_memory_rss"
        # Only include metrics matching certain criteria
      include:
        match_type: regexp
        metric_names:
          - "container_cpu_.*"
          - "container_memory_usage_bytes"
          - "container_network_.*"

Performance Optimization

Optimize collector performance for high-scale environments:

receivers:
  awscontainerinsightreceiver:
    collection_interval: 30s # More frequent collection
    timeout: 10s

processors:
  batch:
    timeout: 10s
    send_batch_size: 5000
    send_batch_max_size: 8000

  memory_limiter:
    limit_mib: 512
    spike_limit_mib: 128

Verification

Check Service Status

Verify the OpenTelemetry Collector is running:
```
sudo systemctl status otelcol-contrib
```
Monitor Service Logs

Check for any configuration errors:
```
sudo journalctl -u otelcol-contrib -f
```
Look for successful receiver initialization messages and metric collection activity.

Verify ECS Access

Test ECS API connectivity:

# List ECS clusters
aws ecs list-clusters --region us-east-1

# Describe cluster
aws ecs describe-clusters --clusters your-cluster-name --region us-east-1

# List running tasks
aws ecs list-tasks --cluster your-cluster-name --region us-east-1

Check Docker Socket Access

Verify the collector can access container runtime:

# Check Docker socket permissions
sudo ls -la /var/run/docker.sock

# Test Docker access
sudo docker ps

# Check if collector user can access Docker
sudo -u root docker ps

Generate Container Activity

Deploy test tasks to generate metrics:

# Create a simple task definition
aws ecs register-task-definition \
  --family test-monitoring \
  --container-definitions '[{
    "name": "test-container",
    "image": "nginx:latest",
    "memory": 128,
    "essential": true,
    "portMappings": [{
      "containerPort": 80,
      "protocol": "tcp"
    }]
  }]'

# Run the task
aws ecs run-task \
  --cluster your-cluster-name \
  --task-definition test-monitoring:1

Verify Metrics in Last9

Log into your Last9 account and check that ECS metrics are being received in Metrics Explorer.

Look for metrics like:
- container_cpu_usage_seconds_total
- container_memory_usage_bytes
- container_network_receive_bytes_total
- ecs_cluster_cpu_utilization
Verify ECS Logs in Last9

Open Logs Explorer and filter for ECS metadata or the service and environment attributes you configured in your collector or Fluent Bit path.

Useful filters include:
- service.name = "checkout-service" if you configured this resource attribute
- deployment.environment = "production" if you configured this resource attribute
- aws.ecs.task.family = "checkout-service"
- aws.ecs.launchtype = "FARGATE"
Open a log line and check Resource Info. ECS logs should include task and cluster attributes such as aws.ecs.task.arn, aws.ecs.task.revision, and aws.ecs.cluster.arn.
Verify Consistent Service and Environment Attributes

Use the same service.name and deployment.environment values for ECS logs, Node.js or Next.js traces, and any dashboard filters. If logs show one service name and traces show another, align the ECS task-definition values with the application instrumentation values.

Key Metrics to Monitor

Critical Container Health Indicators

Metric	Description	Alert Threshold
`container_memory_usage_bytes`	Container memory usage	> 80% of limit
`container_cpu_usage_seconds_total`	Container CPU usage	> 80% consistently
`container_oom_events_total`	Out of memory events	> 0 events
`ecs_task_running_count`	Running task count	< desired count

Performance Monitoring

Metric	Description	Monitoring Focus
`container_network_receive_bytes_total`	Network ingress traffic	Track application load
`container_network_transmit_bytes_total`	Network egress traffic	Monitor data transfer
`container_fs_reads_bytes_total`	Filesystem read operations	I/O performance
`container_fs_writes_bytes_total`	Filesystem write operations	Storage performance

Cluster Health

Metric	Description	Alert Condition
`ecs_cluster_active_services_count`	Active services count	Unexpected changes
`ecs_cluster_registered_container_instances_count`	Available EC2 instances	Capacity monitoring
`ecs_service_desired_count`	Desired task count	vs running count mismatch

Troubleshooting

Collector Permission Issues

Cannot Access Docker Socket:

# Check Docker socket permissions
sudo ls -la /var/run/docker.sock

# Add collector user to docker group (if running as non-root)
sudo usermod -aG docker otelcol-contrib

# For systemd service, ensure it runs as root
sudo systemctl edit otelcol-contrib
# Add: [Service]
# User=root

ECS API Access Denied:

# Verify AWS credentials
aws sts get-caller-identity

# Test ECS permissions
aws ecs list-clusters --region us-east-1

# Check IAM role attached to EC2 instance
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/

Missing Metrics

No Container Metrics:

# Verify containers are running
docker ps

# Check if cAdvisor metrics are available
curl http://localhost:8080/metrics  # If cAdvisor is exposed

# Verify ECS agent is running
sudo systemctl status ecs

Incomplete ECS Metrics:

# Check ECS cluster health
aws ecs describe-clusters --clusters your-cluster-name

# Verify tasks are running
aws ecs list-tasks --cluster your-cluster-name

# Check task definitions
aws ecs describe-task-definition --task-definition your-task:1

Fargate-Specific Issues

Fargate Metrics Not Available:

Verify the collector is running within the same VPC as Fargate tasks
Check that Fargate platform version supports Container Insights
Ensure proper task role permissions for metric collection

High Resource Usage

Collector Using Too Much Memory:

processors:
  memory_limiter:
    limit_mib: 256 # Reduce memory limit
    spike_limit_mib: 64

receivers:
  awscontainerinsightreceiver:
    collection_interval: 120s # Reduce collection frequency

Best Practices

Security

Root Access: Only run collector as root when necessary for container access
Network Policies: Implement VPC security groups to restrict collector access
IAM Roles: Use IAM roles instead of access keys for AWS authentication
Secret Management: Store sensitive configuration in AWS Secrets Manager

Performance

Collection Intervals: Balance monitoring granularity with resource usage
Metric Filtering: Filter out unnecessary metrics to reduce data volume
Resource Limits: Set appropriate memory and CPU limits for the collector
Batch Processing: Optimize batch sizes for efficient data transmission

Monitoring Strategy

Multi-Layer Monitoring: Monitor cluster, service, task, and container levels
Alerting: Set up alerts for critical metrics like OOM events and high resource usage
Capacity Planning: Monitor resource utilization trends for scaling decisions
Cost Optimization: Use appropriate collection intervals to balance cost and visibility

Deployment

High Availability: Deploy collectors on multiple AZs for redundancy
Service Discovery: Use ECS service discovery for dynamic service monitoring
Rolling Updates: Implement rolling updates for collector configuration changes
Health Checks: Configure health checks for collector containers

Please get in touch with us on Discord or Email if you have any questions.