Elasticsearch does a lot right—it's fast, scalable, and makes searches feel simple. But when things slow down or break, figuring out what’s going on can be frustrating. Especially if you’re not keeping an eye on the right metrics.
This guide covers Elasticsearch metrics that are worth tracking and how they help you keep your cluster healthy without data overload.
Key Categories of Elasticsearch Metrics
Elasticsearch exposes a wide range of metrics that give visibility into both system behavior and performance. These metrics are essential for diagnosing issues, understanding load, and keeping clusters healthy.
They typically fall into four main categories:
- Cluster-level metrics — track shard allocation, cluster health, and indexing activity
- Node-level metrics — cover CPU, memory usage, and garbage collection
- Index-level metrics — include document counts, index size, and search latency
- Query performance metrics — capture latency, throughput, and failed request rates
What makes these metrics valuable is how they help tie user-facing issues, like slow searches or timeouts, back to specific resource constraints or architectural problems, such as memory pressure or disk I/O bottlenecks.
A Closer Look at the Right Elasticsearch Metrics
Elasticsearch tracks hundreds of metrics internally. While it’s tempting to monitor everything, most teams don’t need that level of detail. This section covers the core metrics worth adding to your dashboard. These are the ones that tend to surface first when performance dips or something breaks.
Cluster Health
Cluster health is often the first place people look, and for good reason. It gives you a high-level view of the system's status. But to understand what’s going on, you’ll need to go beyond green/yellow/red.
Key metrics to track:
- Active shards (%): Shows what percentage of shards are currently active. A drop here can signal allocation problems or nodes going offline.
- Pending tasks: Tells you how many background tasks (like shard relocations or template updates) are queued up. A high number can mean the cluster is overloaded or something is blocking progress.
- Number of nodes: A sudden drop in node count may indicate network splits, node crashes, or resource exhaustion.
These metrics give you a sense of whether the cluster is balanced and responsive—or barely keeping up.
Node Performance
Each node contributes to the overall cluster performance. Issues at the node level can cause query timeouts, indexing slowdowns, or even outages.
Here’s what to watch:
- JVM heap usage: Elasticsearch runs on the JVM, so heap pressure matters. Monitor both the used heap percentage and GC behavior. Frequent or long GC pauses often mean memory is too tight.
- CPU usage: Occasional spikes are normal, but consistently high CPU can point to expensive queries or an overloaded node.
- Disk I/O and I/O wait: Slow disk operations can throttle indexing and cause search delays. Pay attention to I/O wait time on nodes doing heavy indexing or holding large shards.
- Thread pool queues: Thread pools manage different types of operations (search, indexing, bulk, etc.). Backups in these queues often indicate bottlenecks.
- Circuit breakers: Elasticsearch uses memory circuit breakers to prevent nodes from crashing. If these start tripping, you're running too close to the edge.
- Open file descriptors: If you’re nearing the OS limit, Elasticsearch may fail to open index files, leading to errors that are hard to trace.
Node metrics are especially helpful when one node is acting differently from the rest—it usually means a localized issue worth digging into.
Index and Search Performance
Once the cluster and nodes are healthy, the next focus is how well Elasticsearch handles the actual workload—indexing and search.
Important metrics here include:
- Indexing rate: Measures how many documents are being ingested per second. Sudden drops can indicate indexing slowdowns or blocked threads.
- Search rate: Reflects how many queries are being executed. A sharp change might be a traffic shift—or a performance bottleneck causing query queuing.
- Query latency: How long it takes to run a query, including coordinating across shards. Use percentiles (like p95) for a clearer picture under load.
- Fetch latency: Time taken to retrieve the matching documents after a query. If this is slow, it often points to disk or cache issues.
- Failed searches: Non-zero error rates usually suggest query timeouts or memory issues—these deserve immediate attention.
- Search throttling: Elasticsearch will throttle requests if a node is under pressure. Even occasional throttling can signal that you’re close to capacity.
A Quick Reference to Search Performance Benchmarks
Metric | Good | Warning | Critical |
---|---|---|---|
Query latency (p95) | < 100ms | 100–500ms | > 500ms |
Indexing rate variance | < 10% | 10–25% | > 25% |
Failed searches | 0% | 0–1% | > 1% |
Search throttling | None | Occasional | Frequent |
How to Access and Use Elasticsearch Metrics
Elasticsearch exposes several APIs for this, each suited for different use cases—quick debugging, automated monitoring, or long-term observability.
Let’s walk through the most common options.
1. Stats API: Best for Structured, Machine-Readable Metrics
The Stats API provides detailed JSON output, making it ideal for integration with external monitoring systems. It covers node-level stats like memory, disk, and CPU usage, as well as index-level metrics like document count and query latency.
Useful endpoints include:
# Get detailed node-level stats
curl -X GET "localhost:9200/_nodes/stats"
# Fetch overall cluster stats
curl -X GET "localhost:9200/_cluster/stats"
# View stats for a specific index
curl -X GET "localhost:9200/my_index/_stats"
These responses are verbose but structured—perfect for forwarding into tools like Prometheus, Elasticsearch itself, or managed platforms like Last9. Most monitoring agents and exporters pull from these endpoints.
2. Cat API: Best for Quick, Human-Readable Checks
The Cat API is Elasticsearch’s simplified, table-like view of internal metrics. It’s designed to be readable in a terminal or browser, which makes it great for quick health checks or during incident debugging.
Examples:
# View cluster health status
curl -X GET "localhost:9200/_cat/health?v"
# Basic node stats (CPU, heap, disk)
curl -X GET "localhost:9200/_cat/nodes?v&h=name,cpu,load_1m,heap.percent,disk.used_percent"
This output isn’t meant for parsing programmatically, but it’s great when you just need a quick snapshot of what’s going on—without digging through JSON.
3. Prometheus Exporter: Best for Long-Term Monitoring and Alerting
For teams using Prometheus, the Elasticsearch Exporter provides a clean way to expose Elasticsearch metrics in Prometheus format.
How it works:
- The exporter scrapes Elasticsearch’s Stats API on a regular interval.
- It converts the metrics into Prometheus format with consistent labels and types.
- You can set up alerts, store time-series data, and visualize trends using Grafana or another frontend.
This method works especially well if you're already collecting metrics from other systems with Prometheus and want to keep everything in one place.
Setting Up Practical Alerting for Elasticsearch Metrics
Metrics aren’t that helpful unless you know when to act on them. The trick is to create alerts that point to real problems, without setting off false alarms every few minutes.
Here’s how to set up alerts that give you the right signal at the right time.
When Your Cluster Isn’t Healthy
Start with the basics: Is the cluster even functioning the way it should?
- Cluster status is red – this usually means primary shards are unassigned. Alert right away.
- Cluster stays yellow for more than 5 minutes – yellow isn’t always bad (can just mean replica shards are still moving), but if it lingers, something’s probably stuck.
- Active shards drop below 95% – shard allocation may be lagging, or a node may have dropped off.
These alerts help catch cluster-wide issues before they snowball.
When System Load Puts Pressure on the Cluster
High resource usage won’t always take down your cluster—but it’s usually the first warning sign.
- JVM heap usage above 85% for more than 5 minutes – leaves very little room for garbage collection. If this holds, expect slower queries or even OOMs.
- CPU usage above 80% for 10+ minutes – not all spikes are bad, but sustained load like this can indicate expensive queries or overloaded nodes.
- Disk usage above 80% on any node – Elasticsearch doesn’t like running out of disk. Set alerts before you get close to the threshold.
These are the “things might break soon” metrics. Don’t ignore them, even if users haven’t noticed anything yet.
When Queries Slow Down or Start Failing
Not everything breaks loudly. Sometimes, things just get slower. These alerts help catch that early.
- Query latency jumps by more than 50% from the usual – absolute numbers are useful, but tracking change over time is often more reliable.
- Indexing rate drops by more than 30% – could point to a stuck thread pool, slow disk, or backpressure from the application.
- Search error rate goes above 1% – might not seem like much, but it usually means something is starting to go wrong.
These alerts help you catch performance regressions before they turn into outages.
Keep Alerts Actionable
A good rule of thumb: every alert should make someone do something.
- Add thresholds and durations to reduce noise (e.g. "only alert if heap is high for 5+ mins")
- Use historical data to set baselines for what’s “normal” in your system
- Don’t alert on things you wouldn’t take action on—disable those or send them to a low-priority channel
The goal isn’t to get the right ones, and only when they matter.
Advanced Elasticsearch Metrics Techniques You Should Know
Once you're comfortable with the basics, you can get more out of Elasticsearch by adding metrics that reflect what’s happening in your application, not just what the cluster is doing.
Adding Custom Metrics
Built-in metrics tell you a lot about cluster and node health, but they can’t explain how specific features or user actions affect performance. That’s where custom metrics help.
For example, you can capture how long a search takes from the application's point of view:
SearchResponse response = client.search(searchRequest);
long queryTime = response.getTook().getMillis();
// Send this value to your metrics backend
This kind of data helps you track things like query time per endpoint, errors tied to a specific feature, or usage patterns across different tenants.
Connecting Metrics to App Behavior
The real value comes when you tie metrics to changes in your system. A few patterns that help:
- Before vs. after deployments – Track search latency or error rates to catch regressions early
- User traffic patterns – See how traffic spikes impact indexing or query performance
- Heavy queries – Identify slow or resource-intensive searches so you can optimize where it counts
Troubleshooting Elasticsearch Using Metric Patterns
Some metric patterns consistently point to specific issues in Elasticsearch. Recognizing these early helps speed up troubleshooting and reduces downtime.
High CPU + High GC + Low Query Throughput
What it suggests: Your nodes are likely under memory pressure.
- Frequent garbage collection means the heap is filling up too quickly.
- High CPU usage may come from the JVM trying to keep up with GC or from complex query processing.
- If throughput is low even with high CPU, inefficient queries or too many aggregations could be the cause.
What to check:
- Look at GC logs for long or frequent pauses
- Review your query patterns—are they using scripts, nested fields, or sorting on unanalyzed text?
- Check if field data or aggregations are using too much memory
Low CPU + High Disk I/O Wait + Slow Indexing
What it suggests: Disk is the bottleneck.
- If CPU isn’t being used but things feel slow, it’s often because Elasticsearch is waiting on disk.
- Indexing is especially sensitive to disk I/O—merges and flushes can slow down write throughput if storage is underpowered.
What to check:
- Monitor I/O wait on the affected node(s)
- Look at indexing and merge times
- Consider upgrading to SSDs if you're on spinning disks
Fluctuating Cluster State + Ongoing Shard Movement
What it suggests: You may have network instability or node flapping.
- When nodes go in and out of the cluster, Elasticsearch reshuffles shards, which can disrupt indexing and queries.
- Repeated shard reallocation also causes performance overhead.
What to check:
- Look at the cluster state change logs
- Check node uptime and connectivity
- Investigate hardware or cloud issues that could be causing nodes to drop out
Thread Pool Rejections + High CPU
What it suggests: Your cluster is under sustained load and can’t keep up.
- Thread pools (for search, indexing, bulk, etc.) have size limits. Once full, they start rejecting new tasks.
- High CPU along with rejections usually points to workload saturation—either too much traffic or poorly optimized queries.
What to check:
- Monitor rejected tasks by thread pool type
- Use slow query logs to identify expensive operations
- Consider scaling the cluster or adding resource limits on client-side queries
Circuit Breaker Trips + Slow Queries
What it suggests: You’re hitting memory limits during query execution.
- Circuit breakers are Elasticsearch’s safety mechanism to avoid out-of-memory crashes.
- These often trip during large aggregations, deep pagination, or the use of field data that isn't optimized.
What to check:
- Which breaker is tripping? (field data, request, parent)
- Are aggregations pulling too much data into memory?
- Consider using doc values and avoiding unbounded terms aggregations
Wrapping Up
You can get a lot out of Elasticsearch’s built-in metrics, but connecting them to the bigger picture is where monitoring gets really useful.
At that point, it helps to have a platform like Last9 in place. It works well with Prometheus and OpenTelemetry, so you can bring in Elasticsearch metrics, tie them to logs and traces, and see what’s happening across your system.
One of the things that stands out is how well our platform handles high-cardinality data. If you need to track metrics per index, query, or tenant, it just works, without making things slow or expensive.
Talk to us to know more or get started for free today!
FAQs
How often should I collect Elasticsearch metrics?
For most metrics, collecting every 10-30 seconds provides a good balance between visibility and overhead. Critical metrics like cluster health can be collected more frequently (every 5 seconds), while slower-changing metrics like index statistics can be collected less often.
What are the most critical Elasticsearch metrics to monitor?
The most critical metrics include: cluster health status, JVM heap usage, indexing and search rates, query latency, garbage collection patterns, thread pool rejections, CPU usage, and disk I/O. These give you a comprehensive view of your cluster's health and performance.
Does collecting metrics impact Elasticsearch performance?
When implemented properly, metrics collection should have minimal impact (less than 1% overhead). However, excessively frequent polling of complex stats endpoints can add load to your cluster. Start with conservative collection intervals and adjust as needed.
Should I store my Elasticsearch metrics in Elasticsearch itself?
While Elasticsearch can store its own metrics (a pattern called "self-monitoring"), it's generally better to use a separate system for monitoring data. This separation ensures that monitoring remains available even when your primary Elasticsearch cluster is having issues.
How long should I retain Elasticsearch metrics data?
A common pattern is to keep high-resolution data (10-second intervals) for 1-2 days, medium resolution (1-minute intervals) for 1-2 weeks, and low resolution (5-minute intervals) for 6-12 months. This tiered approach balances troubleshooting needs with storage costs.
How can I correlate Elasticsearch metrics with user experience?
Implement distributed tracing that connects frontend requests through your application tier to Elasticsearch queries. OpenTelemetry provides a standardized way to achieve this correlation, giving you end-to-end visibility into how Elasticsearch performance affects user experience.