AWS ECS
Monitor AWS ECS clusters and containers with OpenTelemetry Container Insights for comprehensive containerized application observability
Monitor your AWS ECS (Elastic Container Service) clusters and containers with OpenTelemetry Container Insights integration. This setup provides comprehensive monitoring of container performance, resource utilization, task health, and cluster-wide metrics.
Prerequisites
Before setting up AWS ECS monitoring, ensure you have:
- AWS ECS Cluster: Running ECS cluster with tasks to monitor
- Container Runtime: Docker or containerd runtime
- Administrative Access: Root permissions to install and configure monitoring components
- Network Access: Outbound connectivity to Last9 endpoints
- Last9 Account: With OpenTelemetry integration credentials
Supported Deployment Models
This integration supports both ECS deployment models:
- ECS on EC2: Self-managed EC2 instances running ECS tasks
- AWS Fargate: Serverless container platform (with additional configuration)
-
Install OpenTelemetry Collector
Install the OpenTelemetry Collector with AWS Container Insights receiver:
For Debian/Ubuntu systems on ECS EC2 instances:
wget https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.118.0/otelcol-contrib_0.118.0_linux_amd64.debsudo dpkg -i otelcol-contrib_0.118.0_linux_amd64.debFor Red Hat/CentOS systems on ECS EC2 instances:
wget https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.118.0/otelcol-contrib_0.118.0_linux_amd64.rpmsudo rpm -ivh otelcol-contrib_0.118.0_linux_amd64.rpmFor running as an ECS service, use this task definition snippet:
{"family": "otel-collector-ecs","taskRoleArn": "arn:aws:iam::ACCOUNT:role/ecsTaskRole","executionRoleArn": "arn:aws:iam::ACCOUNT:role/ecsTaskExecutionRole","networkMode": "bridge","requiresCompatibilities": ["EC2"],"containerDefinitions": [{"name": "otel-collector","image": "public.ecr.aws/aws-observability/aws-otel-collector:latest","essential": true,"command": ["--config=/etc/ecs/otel-config.yaml"],"mountPoints": [{"sourceVolume": "docker-sock","containerPath": "/var/run/docker.sock","readOnly": true}]}],"volumes": [{"name": "docker-sock","host": {"sourcePath": "/var/run/docker.sock"}}]} -
Configure AWS Permissions
Set up the necessary IAM permissions for ECS monitoring:
Attach this policy to your ECS EC2 instance role:
{"Version": "2012-10-17","Statement": [{"Effect": "Allow","Action": ["ecs:ListClusters","ecs:ListContainerInstances","ecs:DescribeContainerInstances","ecs:ListServices","ecs:DescribeServices","ecs:ListTasks","ecs:DescribeTasks","ec2:DescribeInstances"],"Resource": "*"}]}Create a task role for the collector running as ECS service:
{"Version": "2012-10-17","Statement": [{"Effect": "Allow","Action": ["ecs:ListClusters","ecs:ListContainerInstances","ecs:DescribeContainerInstances","ecs:ListServices","ecs:DescribeServices","ecs:ListTasks","ecs:DescribeTasks"],"Resource": "*"}]} -
Create OpenTelemetry Collector Configuration
Create the collector configuration file for ECS monitoring:
sudo mkdir -p /etc/otelcol-contribsudo nano /etc/otelcol-contrib/config.yamlAdd the following configuration to collect ECS Container Insights metrics:
receivers:awscontainerinsightreceiver:collection_interval: 60scontainer_orchestrator: ecs# Add cluster name if running on specific cluster# cluster_name: "production-cluster"# Configure metric types to collectmetric_types:- "cadvisor" # Container resource metrics- "disk" # Disk usage metrics- "diskio" # Disk I/O metrics- "memory" # Memory usage metrics- "network" # Network metrics- "cpu" # CPU metricsprocessors:batch:timeout: 30ssend_batch_size: 10000send_batch_max_size: 10000resourcedetection/aws:detectors: ["ecs", "ec2", "aws"]timeout: 2soverride: falseresource/ecs:attributes:- key: Timestampaction: delete- key: service.namevalue: "aws-ecs"action: upsert- key: deployment.environmentvalue: "production"action: upsertexporters:otlp/last9:endpoint: "$last9_otlp_endpoint"headers:"Authorization": "$last9_otlp_auth_header"debug:verbosity: detailedservice:pipelines:metrics:receivers: [awscontainerinsightreceiver]processors: [resourcedetection/aws, resource/ecs, batch]exporters: [otlp/last9] -
Configure for Fargate (Optional)
If using AWS Fargate, additional configuration is needed:
receivers:awscontainerinsightreceiver:collection_interval: 60scontainer_orchestrator: ecscluster_name: "fargate-cluster"# Fargate-specific configurationfargate:enabled: true# Fargate tasks don't have direct cAdvisor accessuse_fargate_metrics: trueprocessors:resourcedetection/aws:detectors: ["ecs", "aws"]ecs:# Fargate resource detectionresource_arn_key: "aws.ecs.task.arn" -
Create Systemd Service Configuration
For EC2 deployment, create a systemd service:
sudo nano /etc/systemd/system/otelcol-contrib.serviceAdd the service configuration with required permissions:
[Unit]Description=OpenTelemetry Collector for AWS ECS MonitoringAfter=network.target[Service]ExecStart=/usr/bin/otelcol-contrib --config /etc/otelcol-contrib/config.yamlRestart=alwaysUser=rootGroup=root# Environment variablesEnvironment=AWS_REGION=us-east-1# Required for container accessSupplementaryGroups=docker[Install]WantedBy=multi-user.targetNote: Root permissions are required to access container runtime sockets and system files.
-
Start and Enable the Service
Start the OpenTelemetry Collector service:
sudo systemctl daemon-reloadsudo systemctl enable otelcol-contribsudo systemctl start otelcol-contrib
ECS Application Logs
The metrics configuration above uses awscontainerinsightreceiver for infrastructure metrics. To collect application logs from your ECS containers and send them to Last9, use the filelog receiver running as a sidecar.
The service.name Problem
When collecting ECS application logs without explicit resource attribution, the service.name attribute defaults to aws_ecs — the same value for every service on the cluster. This makes it impossible to filter logs by service in Last9.
The fix: use the resourcedetection/ecs detector to populate ECS task metadata attributes, then copy aws.ecs.task.family into service.name.
Sidecar Collector for ECS Logs
Add an OTel Collector sidecar container to your ECS task definition. The sidecar reads log files written by your application container and forwards them to Last9 with correct resource attributes.
ECS Task Definition (relevant containers section):
{ "containerDefinitions": [ { "name": "your-app", "image": "your-app-image", "logConfiguration": { "logDriver": "json-file" }, "mountPoints": [ { "sourceVolume": "app-logs", "containerPath": "/var/log/app" } ] }, { "name": "otel-collector", "image": "otel/opentelemetry-collector-contrib:0.128.0", "command": ["--config=/etc/otel/config.yaml"], "mountPoints": [ { "sourceVolume": "app-logs", "containerPath": "/var/log/app", "readOnly": true }, { "sourceVolume": "otel-config", "containerPath": "/etc/otel" } ], "environment": [ { "name": "LAST9_OTLP_ENDPOINT", "value": "https://otlp.last9.io" }, { "name": "LAST9_AUTH_HEADER", "value": "Basic <your-credentials>" } ] } ], "volumes": [{ "name": "app-logs" }, { "name": "otel-config" }]}Collector configuration (config.yaml):
receivers: filelog: include: [/var/log/app/*.log] start_at: beginning operators: - type: json_parser timestamp: parse_from: attributes.timestamp layout: "%Y-%m-%dT%H:%M:%S.%fZ"
processors: batch: timeout: 5s send_batch_size: 10000 resourcedetection/ecs: detectors: ["ecs"] timeout: 2s override: false transform/service_name: log_statements: - context: resource statements: # Use ECS task family as service.name instead of the default "aws_ecs" - set(attributes["service.name"], attributes["aws.ecs.task.family"]) where attributes["aws.ecs.task.family"] != nil - set(attributes["deployment.environment"], "production")
exporters: otlp/last9: endpoint: "${env:LAST9_OTLP_ENDPOINT}" headers: "Authorization": "${env:LAST9_AUTH_HEADER}"
service: pipelines: logs: receivers: [filelog] processors: [resourcedetection/ecs, transform/service_name, batch] exporters: [otlp/last9]What resourcedetection/ecs Populates
The ecs detector queries the ECS task metadata endpoint ($ECS_CONTAINER_METADATA_URI_V4) and adds these resource attributes automatically:
| Attribute | Example Value | Description |
|---|---|---|
aws.ecs.task.family | payment-service | Task definition family name |
aws.ecs.task.revision | 12 | Task definition revision |
aws.ecs.task.arn | arn:aws:ecs:... | Full task ARN |
aws.ecs.task.id | 9781c248-0edd-4cdb-9a93-f63cb662a5d3 | Task UUID |
aws.ecs.cluster.arn | arn:aws:ecs:... | Cluster ARN |
aws.ecs.launchtype | FARGATE or EC2 | Launch type |
The transform/service_name processor then copies aws.ecs.task.family into service.name so logs from the payment-service task family appear under service.name=payment-service in Last9.
Using CloudWatch Logs Driver (No Sidecar)
If your tasks already use the awslogs log driver (CloudWatch), forward logs to Last9 using a CloudWatch Logs subscription filter instead of a sidecar. The downside is CloudWatch adds latency (~1 minute) and the log format loses structured fields unless your application emits JSON.
For real-time structured logs with correct service.name, the sidecar pattern above is preferred.
Fargate Logs with FireLens
For ECS on Fargate, use FireLens when your application already writes logs to stdout or stderr and you want the task definition to route those logs through a Fluent Bit log router. This keeps the application container focused on application work while FireLens handles log export.
The pattern has three parts:
- Store OTLP credentials in a private secret or SSM parameter.
- Grant the ECS task execution role read access to that exact secret or parameter.
- Add a FireLens log router container and route the application container through
awsfirelens.
Use least-privilege IAM for credential access:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": ["ssm:GetParameter", "ssm:GetParameters"], "Resource": "arn:aws:ssm:<region>:<aws-account-id>:parameter/<ssm-parameter-name>" }, { "Effect": "Allow", "Action": "secretsmanager:GetSecretValue", "Resource": "arn:aws:secretsmanager:<region>:<aws-account-id>:secret:<secret-name>" } ]}If you store the OTLP credential only in SSM, omit the Secrets Manager statement. If you store it only in Secrets Manager, scope the statement to that one secret. Avoid Resource: "*" for telemetry credentials.
Add a FireLens log router container:
{ "name": "log_router", "image": "public.ecr.aws/aws-observability/aws-for-fluent-bit:stable", "essential": true, "firelensConfiguration": { "type": "fluentbit", "options": { "enable-ecs-log-metadata": "true" } }}Add FireLens to the application container:
{ "name": "checkout-service", "image": "<application-image>", "essential": true, "logConfiguration": { "logDriver": "awsfirelens", "options": { "Name": "opentelemetry", "Host": "<last9-otlp-host>", "Port": "443", "Logs_uri": "/v1/logs", "Tls": "on", "Tls.verify": "on" }, "secretOptions": [ { "name": "Header", "valueFrom": "arn:aws:secretsmanager:<region>:<aws-account-id>:secret:<secret-name>" } ] }}Replace <last9-otlp-host> and the secret ARN with the values from your Last9 integration setup. The secret value should contain the authorization header expected by the destination, for example Authorization <redacted>.
FireLens adds ECS metadata by default, including the cluster, task ARN, and task definition. Keep this enabled unless you have a deliberate reason to remove it. If you need custom resource attributes such as service.name or deployment.environment, add them in your collector or Fluent Bit configuration rather than relying on unsupported log-driver options.
Use the same service.name and deployment.environment values across logs, metrics, and traces. In Last9, those attributes drive service grouping, filters, dashboards, and trace correlation. For ECS task metadata, also keep resourcedetection/ecs enabled in collector-based paths so resource attributes such as aws.ecs.task.family, aws.ecs.task.revision, and aws.ecs.cluster.arn are available for filtering.
Fargate tasks need a new task-definition revision and service deployment to pick up FireLens, SSM, or secret changes. Plan this as a normal ECS rolling deployment. The application code does not need to change for logs and ECS task metrics, but application containers will restart when the service deploys the new task definition.
The FireLens log router should be sized below the application container. Start with a small CPU and memory allocation, then adjust based on task volume, export latency, and log-router memory usage. For high-volume services, increase task-level CPU or memory instead of starving the application container.
Understanding ECS Metrics
The AWS Container Insights receiver collects comprehensive ECS metrics:
Cluster-Level Metrics
- Cluster Resource Utilization: CPU, memory, and network utilization across the cluster
- Task Count: Running, pending, and stopped tasks
- Service Health: Service status and desired task counts
- Instance Health: EC2 instance health for ECS clusters
Task-Level Metrics
- Task Resource Usage: CPU, memory, disk, and network usage per task
- Task Status: Task lifecycle states (pending, running, stopped)
- Task Duration: Time tasks spend in different states
- Task Placement: Task distribution across container instances
Container-Level Metrics
- Container Resources: CPU usage, memory usage, memory limits per container
- Container I/O: Disk read/write operations and network traffic
- Container State: Container health and restart counts
- Container Logs: Container log metrics and error counts
Service-Level Metrics
- Service Utilization: Resource usage aggregated by ECS service
- Deployment Metrics: Service deployment success/failure rates
- Auto Scaling: Service scaling events and capacity changes
- Load Balancer: ALB/NLB health check results for services
Advanced Configuration
Multi-Cluster Monitoring
Monitor multiple ECS clusters from a single collector:
receivers: awscontainerinsightreceiver/prod: collection_interval: 60s container_orchestrator: ecs cluster_name: "production-cluster" awscontainerinsightreceiver/staging: collection_interval: 120s container_orchestrator: ecs cluster_name: "staging-cluster"
service: pipelines: metrics: receivers: [awscontainerinsightreceiver/prod, awscontainerinsightreceiver/staging]Resource Attribution Enhancement
Add detailed resource attributes for better observability:
processors: resource/ecs: attributes: - key: aws.ecs.cluster.name from_attribute: "ClusterName" action: upsert - key: aws.ecs.service.name from_attribute: "ServiceName" action: upsert - key: aws.ecs.task.family from_attribute: "TaskDefinitionFamily" action: upsert - key: aws.ecs.task.revision from_attribute: "TaskDefinitionRevision" action: upsert - key: environment value: "production" action: upsertCustom Metric Filtering
Filter metrics to reduce data volume:
processors: filter/ecs: metrics: exclude: match_type: strict metric_names: - "container_memory_cache" - "container_memory_rss" # Only include metrics matching certain criteria include: match_type: regexp metric_names: - "container_cpu_.*" - "container_memory_usage_bytes" - "container_network_.*"Performance Optimization
Optimize collector performance for high-scale environments:
receivers: awscontainerinsightreceiver: collection_interval: 30s # More frequent collection timeout: 10s
processors: batch: timeout: 10s send_batch_size: 5000 send_batch_max_size: 8000
memory_limiter: limit_mib: 512 spike_limit_mib: 128Verification
-
Check Service Status
Verify the OpenTelemetry Collector is running:
sudo systemctl status otelcol-contrib -
Monitor Service Logs
Check for any configuration errors:
sudo journalctl -u otelcol-contrib -fLook for successful receiver initialization messages and metric collection activity.
-
Verify ECS Access
Test ECS API connectivity:
# List ECS clustersaws ecs list-clusters --region us-east-1# Describe clusteraws ecs describe-clusters --clusters your-cluster-name --region us-east-1# List running tasksaws ecs list-tasks --cluster your-cluster-name --region us-east-1 -
Check Docker Socket Access
Verify the collector can access container runtime:
# Check Docker socket permissionssudo ls -la /var/run/docker.sock# Test Docker accesssudo docker ps# Check if collector user can access Dockersudo -u root docker ps -
Generate Container Activity
Deploy test tasks to generate metrics:
# Create a simple task definitionaws ecs register-task-definition \--family test-monitoring \--container-definitions '[{"name": "test-container","image": "nginx:latest","memory": 128,"essential": true,"portMappings": [{"containerPort": 80,"protocol": "tcp"}]}]'# Run the taskaws ecs run-task \--cluster your-cluster-name \--task-definition test-monitoring:1 -
Verify Metrics in Last9
Log into your Last9 account and check that ECS metrics are being received in Metrics Explorer.
Look for metrics like:
container_cpu_usage_seconds_totalcontainer_memory_usage_bytescontainer_network_receive_bytes_totalecs_cluster_cpu_utilization
-
Verify ECS Logs in Last9
Open Logs Explorer and filter for ECS metadata or the service and environment attributes you configured in your collector or Fluent Bit path.
Useful filters include:
service.name = "checkout-service"if you configured this resource attributedeployment.environment = "production"if you configured this resource attributeaws.ecs.task.family = "checkout-service"aws.ecs.launchtype = "FARGATE"
Open a log line and check Resource Info. ECS logs should include task and cluster attributes such as
aws.ecs.task.arn,aws.ecs.task.revision, andaws.ecs.cluster.arn. -
Verify Consistent Service and Environment Attributes
Use the same
service.nameanddeployment.environmentvalues for ECS logs, Node.js or Next.js traces, and any dashboard filters. If logs show one service name and traces show another, align the ECS task-definition values with the application instrumentation values.
Key Metrics to Monitor
Critical Container Health Indicators
| Metric | Description | Alert Threshold |
|---|---|---|
container_memory_usage_bytes | Container memory usage | > 80% of limit |
container_cpu_usage_seconds_total | Container CPU usage | > 80% consistently |
container_oom_events_total | Out of memory events | > 0 events |
ecs_task_running_count | Running task count | < desired count |
Performance Monitoring
| Metric | Description | Monitoring Focus |
|---|---|---|
container_network_receive_bytes_total | Network ingress traffic | Track application load |
container_network_transmit_bytes_total | Network egress traffic | Monitor data transfer |
container_fs_reads_bytes_total | Filesystem read operations | I/O performance |
container_fs_writes_bytes_total | Filesystem write operations | Storage performance |
Cluster Health
| Metric | Description | Alert Condition |
|---|---|---|
ecs_cluster_active_services_count | Active services count | Unexpected changes |
ecs_cluster_registered_container_instances_count | Available EC2 instances | Capacity monitoring |
ecs_service_desired_count | Desired task count | vs running count mismatch |
Troubleshooting
Collector Permission Issues
Cannot Access Docker Socket:
# Check Docker socket permissionssudo ls -la /var/run/docker.sock
# Add collector user to docker group (if running as non-root)sudo usermod -aG docker otelcol-contrib
# For systemd service, ensure it runs as rootsudo systemctl edit otelcol-contrib# Add: [Service]# User=rootECS API Access Denied:
# Verify AWS credentialsaws sts get-caller-identity
# Test ECS permissionsaws ecs list-clusters --region us-east-1
# Check IAM role attached to EC2 instancecurl http://169.254.169.254/latest/meta-data/iam/security-credentials/Missing Metrics
No Container Metrics:
# Verify containers are runningdocker ps
# Check if cAdvisor metrics are availablecurl http://localhost:8080/metrics # If cAdvisor is exposed
# Verify ECS agent is runningsudo systemctl status ecsIncomplete ECS Metrics:
# Check ECS cluster healthaws ecs describe-clusters --clusters your-cluster-name
# Verify tasks are runningaws ecs list-tasks --cluster your-cluster-name
# Check task definitionsaws ecs describe-task-definition --task-definition your-task:1Fargate-Specific Issues
Fargate Metrics Not Available:
- Verify the collector is running within the same VPC as Fargate tasks
- Check that Fargate platform version supports Container Insights
- Ensure proper task role permissions for metric collection
High Resource Usage
Collector Using Too Much Memory:
processors: memory_limiter: limit_mib: 256 # Reduce memory limit spike_limit_mib: 64
receivers: awscontainerinsightreceiver: collection_interval: 120s # Reduce collection frequencyBest Practices
Security
- Root Access: Only run collector as root when necessary for container access
- Network Policies: Implement VPC security groups to restrict collector access
- IAM Roles: Use IAM roles instead of access keys for AWS authentication
- Secret Management: Store sensitive configuration in AWS Secrets Manager
Performance
- Collection Intervals: Balance monitoring granularity with resource usage
- Metric Filtering: Filter out unnecessary metrics to reduce data volume
- Resource Limits: Set appropriate memory and CPU limits for the collector
- Batch Processing: Optimize batch sizes for efficient data transmission
Monitoring Strategy
- Multi-Layer Monitoring: Monitor cluster, service, task, and container levels
- Alerting: Set up alerts for critical metrics like OOM events and high resource usage
- Capacity Planning: Monitor resource utilization trends for scaling decisions
- Cost Optimization: Use appropriate collection intervals to balance cost and visibility
Deployment
- High Availability: Deploy collectors on multiple AZs for redundancy
- Service Discovery: Use ECS service discovery for dynamic service monitoring
- Rolling Updates: Implement rolling updates for collector configuration changes
- Health Checks: Configure health checks for collector containers
Please get in touch with us on Discord or Email if you have any questions.