Last9 Last9

Mar 3rd, ‘25 / 9 min read

EC2 Monitoring: A Practical Guide for AWS Engineers

Learn how to monitor EC2 instances effectively, reduce costs, and prevent outages with practical insights for AWS engineers.

EC2 Monitoring: A Practical Guide for AWS Engineers

Monitoring your EC2 instances shouldn’t be complicated or exhausting. Yet, too often, engineers find themselves troubleshooting issues in the middle of the night, searching for the root cause of an unexpected failure.

Whether you're managing a few instances or hundreds spread across multiple regions, effective EC2 monitoring helps you stay ahead of problems instead of constantly reacting to them. And if you've ever dealt with a critical alert at an inconvenient hour, you know how important that is.

This guide breaks down EC2 monitoring into clear, practical steps—no jargon, just straightforward advice to help you keep your systems running smoothly.

Why EC2 Monitoring Matters

EC2 monitoring isn't just about knowing if your instances are running—it's about understanding their behavior, predicting problems before they happen, and making sure you're not burning cash on underutilized resources.

Here's what proper EC2 monitoring gives you:

  • Early problem detection: Catch issues while they're minor annoyances, not full-blown outages
  • Performance insights: Know when it's time to scale up, down, or change instance types
  • Cost control: Identify instances that are costing more than they should
  • Security awareness: Spot unusual behavior that might indicate compromise
  • Sleep: Perhaps the most valuable commodity in tech
💡
If your EC2 instances rely on Redis, keeping an eye on its performance is just as important. Learn more in our Redis Metrics Monitoring guide.

Step-by-Step Process to Setting Up EC2 Monitoring

AWS provides basic monitoring out of the box—here's how to make the most of it.

Step 1: Enable Detailed Monitoring

By default, EC2 comes with basic monitoring that sends metrics to CloudWatch every 5 minutes. Detailed monitoring bumps this to every 1 minute.

# Enable detailed monitoring on a new instance
aws ec2 run-instances --image-id ami-0abcdef1234567890 --instance-type t2.micro --monitoring Enabled

# Enable detailed monitoring on an existing instance
aws ec2 monitor-instances --instance-ids i-1234567890abcdef0

Is it worth the extra cost? That depends. For production workloads, critical systems, or anything that needs rapid response to problems, absolutely. For dev environments or non-critical systems, maybe not.

Step 2: Set Up Basic CloudWatch Alarms

Now that you've got metrics flowing, let's make sure you're alerted when things go sideways:

  1. Navigate to the CloudWatch console
  2. Select "Alarms" then "Create Alarm"
  3. Choose the EC2 instance and metric (e.g., CPU Utilization)
  4. Set appropriate thresholds (e.g., CPU > 80% for 5 minutes)
  5. Add notification actions (SNS topic, email, etc.)

Pro tip: Don't set thresholds too low, or you'll be drowning in false positives faster than you can say "alert fatigue."

Step 3: Create a CloudWatch Dashboard

A well-organized dashboard lets you spot patterns at a glance:

  1. In CloudWatch, go to "Dashboards" and create a new one
  2. Add widgets for your most important metrics:
    • CPU Utilization
    • Network In/Out
    • Disk Read/Write Operations
    • Status Check Failures

Arrange them logically—group similar instances together or organize by environment (prod, staging, dev).

💡
If your EC2 instances power APIs, keeping them reliable is crucial. Check out our Top 11 API Monitoring Tools to find the right solution.

The Essential EC2 Metrics You Should Monitor

Not all metrics are created equal. Here are the ones that matter:

System-Level Metrics

Metric What It Tells You Warning Signs
CPU Utilization How hard your instance is working Sustained periods above 80%
Memory Usage* Available RAM (requires custom metric) Consistently above 85%
Disk Space* Free space on your volumes (requires custom metric) Less than 20% free space
Network In/Out Data transfer volume Sudden spikes or drops

Health Metrics

Metric What It Tells You Warning Signs
Status Check (System) Hardware/AWS issues Any failures
Status Check (Instance) OS-level problems Any failures
Instance State Whether the instance is running Unexpected state changes

Load & Performance Metrics

Metric What It Tells You Warning Signs
EBS Volume Queue Length Backup of I/O operations Consistently above 1
CPU Credit Balance (for burstable instances) Available burst capacity Approaching zero
Network Packet Loss Connection quality Any non-zero values
💡
For more control over your EC2 monitoring, custom metrics can help track exactly what matters. Learn how to set them up in our AWS CloudWatch Custom Metrics Guide.

Advanced EC2 Monitoring

Basic metrics can only tell you so much. For real visibility, you need to dig deeper.

Step 1: Install the CloudWatch Agent

The CloudWatch agent lets you collect system-level metrics that aren't available by default:

# Install the CloudWatch agent on Amazon Linux 2
sudo yum install amazon-cloudwatch-agent -y

# Create a basic config file
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard

# Start the agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -s -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json

A minimal config should include:

  • Memory usage
  • Disk space utilization
  • Swap usage
  • Key process metrics

Step 2: Set Up Custom Metrics

Some things are specific to your application. Custom metrics help you track what matters to your business:

# Using the AWS CLI to publish a custom metric
aws cloudwatch put-metric-data --metric-name ActiveUsers --namespace MyApplication --value 42

# Or from within your application code using the AWS SDK
cloudwatch.putMetricData({
  MetricData: [
    {
      MetricName: 'ActiveUsers',
      Value: 42,
      Unit: 'Count'
    }
  ],
  Namespace: 'MyApplication'
})

Good candidates for custom metrics:

  • Application-specific counters (users, transactions, etc.)
  • Business metrics (checkout completions, signups)
  • Application health indicators (error rates, response times)

Step 3: Implement Log Monitoring

Metrics tell you what's happening; logs tell you why.

Set up metric filters to convert log events to metrics:

# Pattern to match PHP fatal errors
PHP Fatal error:

Create CloudWatch Logs Insights queries for common issues:

# Find error patterns
filter @message like /error|exception|failed|failure|timeout/i
| stats count(*) by bin(30m)

# Track specific events
filter @message like "Database connection"
| stats count(*) as connectionAttempts by bin(5m)

Configure the CloudWatch agent to collect logs:

{
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/syslog",
            "log_group_name": "system-logs",
            "log_stream_name": "{instance_id}"
          },
          {
            "file_path": "/var/log/nginx/error.log",
            "log_group_name": "nginx-error-logs",
            "log_stream_name": "{instance_id}"
          }
        ]
      }
    }
  }
}

AWS's built-in tools are a good start, but third-party solutions offer more power and flexibility.

Here's how they stack up:

Last9: The Telemetry Data Platform for Cloud-Native Monitoring

Why Last9?

  • Trusted by industry leaders like Disney+ Hotstar, Games24x7, CleverTap, and Replit.
  • Optimized for cloud-native environments, balancing performance, cost, and user experience.
  • Seamless integration with OpenTelemetry, Prometheus, and other observability tools.
  • Unifies metrics, logs, and traces, efficiently handling high-cardinality data.
  • Smart alerting and real-time insights via the Last9 Control Plane for proactive monitoring.

Best for:

Engineering teams managing complex distributed systems that require deep visibility without added complexity.

Probo Cuts Monitoring Costs by 90% with Last9
Probo Cuts Monitoring Costs by 90% with Last9

Datadog: The Feature-Rich Option

Strengths:

  • Comprehensive coverage across AWS services
  • Strong APM capabilities
  • Extensive integration library
  • Advanced dashboarding

Best for: Enterprises with diverse technology stacks and dedicated monitoring teams.

New Relic: The APM Specialist

Strengths:

  • Deep code-level visibility
  • Strong focus on application performance
  • Good EC2 resource monitoring
  • Robust alerting system

Best for: Development teams focused on application performance optimization.

Prometheus + Grafana: The Open-Source Combo

Strengths:

  • Complete control and customization
  • No per-host or per-metric fees
  • Powerful query language (PromQL)
  • Highly extensible

Best for: Budget-conscious teams with in-house monitoring expertise.

💡
If you're considering Datadog for EC2 monitoring, understanding its pricing is key. Get a detailed breakdown in our Datadog Pricing Guide.

How to Optimize Costs with Smarter EC2 Monitoring

Monitoring isn't just for reliability—it's for keeping your AWS bill in check too.

Identifying Underutilized Instances

Create a CloudWatch dashboard that highlights resource efficiency:

  1. Add metrics for CPU utilization (average, min, max)
  2. Include memory usage if you're using the CloudWatch agent
  3. Set up a weekly report showing instances with consistently low utilization

For burstable instances (T2/T3/T4g), monitor credit balances. If they're always high, you might be overpaying.

Right-sizing Recommendations

Use AWS Cost Explorer's Resource Optimization to get automatic suggestions, or build your own logic:

# Pseudocode for basic right-sizing logic
for instance in ec2_instances:
    cpu_util = get_average_cpu_util(instance.id, period='2weeks')
    memory_util = get_average_memory_util(instance.id, period='2weeks')
    
    if cpu_util < 20 and memory_util < 30:
        recommend_downsize(instance)
    elif cpu_util > 80 or memory_util > 80:
        recommend_upsize(instance)

Automating Instance Scheduling

Not all instances need to run 24/7. Use EC2 monitoring data to identify patterns, then implement scheduling:

# Create a CloudWatch event rule to stop dev instances after hours
aws events put-rule --name "StopDevInstancesNightly" --schedule-expression "cron(0 20 ? * MON-FRI *)"

# Add a target to the rule
aws events put-targets --rule "StopDevInstancesNightly" --targets "Id"="1","Arn"="arn:aws:lambda:region:account-id:function:StopEC2Instances"

Performance Tuning Based on EC2 Monitoring Data

Monitoring becomes truly valuable when you use it to improve performance.

Step 1: Establish Performance Baselines

Before tuning, know what "normal" looks like:

  1. Collect at least two weeks of data across various load conditions
  2. Calculate percentiles (p50, p90, p99) for key metrics
  3. Document these baselines for comparison

Step 2: Identify Bottlenecks

Use your monitoring data to spot constraints:

  • CPU-bound? Look for high CPU utilization but low memory/disk/network usage
  • Memory-bound? Watch for high swap usage or OOM errors in logs
  • I/O-bound? Check EBS volume queue length and I/O operations
  • Network-bound? Monitor network throughput against instance limits

Step 3: Implement and Verify Improvements

Make one change at a time and measure the impact:

  1. Modify the potential bottleneck (instance type, EBS volume type, etc.)
  2. Monitor the targeted metrics for 24-48 hours
  3. Compare against your baseline
  4. Document the improvement (or rollback if ineffective)

Real-world example: An e-commerce site was seeing slow response times during peak hours. EC2 monitoring showed CPU utilization spiking to 100%, while memory usage stayed below 40%. Upgrading from a compute-optimized to a general-purpose instance with more CPU power reduced response times by 62%.

💡
Choosing the right tools for EC2 monitoring is essential. Explore our Best Infrastructure Monitoring Tools to find the best fit for your needs.

Advanced EC2 Monitoring Techniques for Power Users

Try these advanced techniques:

Custom CloudWatch Composite Alarms

Instead of simple threshold-based alarms, create composite conditions:

# Create a composite alarm that triggers only when both CPU and memory are high
aws cloudwatch put-composite-alarm \
  --alarm-name HighResourceUtilization \
  --alarm-rule "(ALARM(HighCPUAlarm) AND ALARM(HighMemoryAlarm))"

This reduces false positives by ensuring multiple conditions are met before alerting.

EC2 Instance Group Monitoring

Monitor groups of related instances together:

  1. Create a CloudWatch dashboard with metrics aggregated across instance groups
  2. Set up alarms on the aggregate metrics

Use dimension math to compare environments:

SUM(SEARCH('{AWS/EC2,InstanceId} MetricName="CPUUtilization" InstanceType="t3.large"', 'Average'))

Synthetic Canary Monitoring

Don't just monitor the infrastructure—test the user experience:

  1. Create a CloudWatch Synthetics canary that simulates user actions
  2. Schedule it to run every few minutes
  3. Alert on failures or performance degradation
# Create a simple canary using the AWS CLI
aws synthetics create-canary \
  --name api-test-canary \
  --artifact-s3-location S3Bucket=my-bucket,S3Key=canary/artifacts \
  --execution-role-arn arn:aws:iam::account-id:role/CanaryRole \
  --schedule Expression="rate(5 minutes)" \
  --run-config TimeoutInSeconds=60 \
  --code Handler=index.handler,Script=$(base64 -w 0 ./canary-script.js)
💡
Monitoring EC2 instances is just one piece of the puzzle. Synthetic Monitoring helps you test performance and catch issues before users do.

EC2 Monitoring Strategy: A Practical Example

Let's put it all together with a practical example for a mid-sized web application:

The Architecture

  • 8 EC2 instances across 2 AZs (t3.large)
  • Application tier running Node.js
  • RDS for database
  • ElastiCache for session storage

The Monitoring Setup

  1. Basic Infrastructure Monitoring
    • CloudWatch detailed monitoring enabled
    • Status checks with automated recovery actions
    • EBS volume performance metrics
  2. Application-Specific Monitoring
    • CloudWatch agent collecting custom metrics:
      • Request rate, response time, error rate
      • Node.js event loop lag
      • Heap usage, garbage collection stats
    • Log monitoring for error patterns
  3. User Experience Monitoring
    • Synthetic transactions for critical user flows
    • Real user monitoring via client-side instrumentation
  4. Alerting Strategy
    • P1 alerts (immediate response): Instance failures, severe performance degradation
    • P2 alerts (business hours): Elevated error rates, resource constraints
    • Weekly performance reviews using collected data

The Results

  • 72% reduction in mean time to detection (MTTD)
  • 45% fewer false-positive alerts
  • 30% improvement in instance resource utilization
  • Zero unexpected outages in 6 months

Conclusion

Start with the basics, gradually add complexity as needed, and always tie your monitoring strategy to business outcomes.

Remember these key principles:

  • Monitor what matters to your users and your business
  • Alert on symptoms, not causes
  • Automate routine monitoring tasks
  • Use monitoring data to drive continuous improvement
  • Choose tools that fit your team's skills and workflow
💡
What's your biggest EC2 monitoring challenge? Drop it in our Discord community where we talk cloud monitoring, trading war stories, and sharing those hard-earned lessons that never make it into the AWS documentation.

FAQs

What's the difference between basic and detailed EC2 monitoring?

Basic monitoring sends metrics to CloudWatch every 5 minutes and is free. Detailed monitoring increases the frequency to every 1 minute and costs extra, but provides more timely data for alerting and auto-scaling.

Do I need to install anything on my EC2 instances for monitoring?

For basic metrics like CPU and network, no. For memory, disk space, and application-specific metrics, you'll need to install the CloudWatch agent or a third-party monitoring agent.

How much does EC2 monitoring cost?

It varies based on your setup. Basic CloudWatch metrics are free, detailed monitoring costs about $2.10 per instance per month, and custom metrics cost $0.30 per metric per month. Third-party tools typically charge per host or per metric.

Can I monitor Windows EC2 instances the same way as Linux?

Yes, though the CloudWatch agent configuration differs slightly. Windows instances also have some Windows-specific metrics like available memory and page file usage.

What's the best EC2 monitoring tool for a small startup?

Start with CloudWatch and the CloudWatch agent for basic needs. As you grow, consider Last9 for its balance of powerful features and user-friendliness without requiring a dedicated monitoring team.

How do I monitor instances across multiple AWS accounts?

Use AWS Organizations and CloudWatch cross-account observability, or implement a third-party solution like Last9 that supports multi-account monitoring natively.

Can I reduce CloudWatch costs while still maintaining good monitoring?

Yes, by being selective about which metrics you collect and at what frequency. Focus on critical metrics at 1-minute intervals and less important ones at 5-minute intervals. Use metric math instead of custom metrics where possible.

How do I correlate EC2 issues with application problems?

Implement distributed tracing with services like X-Ray or third-party APM tools that can connect infrastructure metrics to application performance. This helps identify whether application slowness stems from code issues or resource constraints.

Contents


Newsletter

Stay updated on the latest from Last9.

Authors
Anjali Udasi

Anjali Udasi

Helping to make the tech a little less intimidating. I love breaking down complex concepts into easy-to-understand terms.