When your applications rely on HBase's distributed power, visibility isn't optional—it's your safety net. Most outages can be prevented with the right monitoring approach.
This guide gives you the tools to keep your HBase clusters running smoothly—key metrics that signal problems early, troubleshooting techniques for common issues, and practical strategies for both beginners and seasoned engineers.
What is HBase Monitoring?
HBase is an open-source, non-relational distributed database modeled after Google's BigTable. Running on top of HDFS (Hadoop Distributed File System), it provides random, real-time read/write access to your Big Data, handling billions of rows across thousands of columns.
Why does monitoring HBase matter? Simple—when your data layer fails, everything fails. HBase monitoring helps you:
- Catch performance issues before users notice
- Understand resource utilization patterns
- Plan capacity ahead of time
- Reduce mean time to recovery (MTTR)
- Sleep better at night (seriously)
Key HBase Metrics You Should Monitor
Let's cut to the chase—here are the metrics that matter when monitoring HBase:
System-Level Metrics
These metrics give you the 30,000-foot view of your HBase cluster health:
- CPU Usage: High CPU can signal issues with compactions or too many requests
- Memory Usage: Watch for JVM heap usage and garbage collection patterns
- Disk I/O: HBase is storage-heavy, so disk throughput bottlenecks hurt
- Network I/O: Track bytes in/out, especially with region servers
- JVM Metrics: Garbage collection pauses can lead to timeout issues
HBase-Specific Metrics
This is where you get the real intel on your HBase performance:
- Request Latency: How long do read/write operations take
- Region Server Load: Balance across your cluster
- Compaction Queue Size: Large queues mean your system is falling behind
- BlockCache Hit Ratio: Lower than 80% might signal configuration issues
- Memstore Size: Approaching flush thresholds causes performance spikes
Here's a quick reference table of critical metrics and their healthy ranges:
Metric | Warning Threshold | Critical Threshold | Notes |
---|---|---|---|
Read Latency | > 20ms | > 100ms | Higher for complex scans |
Write Latency | > 10ms | > 50ms | Watch for spikes |
BlockCache Hit Ratio | < 85% | < 75% | Tune cache size if low |
Compaction Queue | > 2000 | > 5000 | May need more resources |
Memstore Flush Size | N/A | Near max | Check flush frequency |
How to Set Up Basic HBase Monitoring
Let's get practical with a step-by-step approach to monitoring your HBase clusters:
Step 1: Enable HBase Metrics Collection
First, configure your HBase cluster to expose metrics. Add these lines to your hbase-env.sh
:
export HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false"
export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10101"
export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10102"
Restart your HBase services to apply these changes:
$ stop-hbase.sh
$ start-hbase.sh
Step 2: Install Your Collection Agent
You'll need something to collect and store these metrics. For beginners, here's how to set up Prometheus:
- Download Prometheus:
$ wget https://github.com/prometheus/prometheus/releases/download/v2.37.0/prometheus-2.37.0.linux-amd64.tar.gz
$ tar xvfz prometheus-2.37.0.linux-amd64.tar.gz
$ cd prometheus-2.37.0.linux-amd64/
- Configure Prometheus to scrape HBase JMX metrics by adding this to
prometheus.yml
:
scrape_configs:
- job_name: 'hbase'
static_configs:
- targets: ['hbase-master:10101', 'regionserver1:10102', 'regionserver2:10102']
metrics_path: /metrics
- Install and run the JMX Exporter on each HBase server:
$ wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.17.0/jmx_prometheus_javaagent-0.17.0.jar
- Create an
hbase.yml
configuration file:
---
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
- pattern: "Hadoop<service=HBase, name=RegionServer, sub=Regions><>storeFileSize: (.*)"
name: hbase_region_size
attrNameSnakeCase: true
- pattern: "Hadoop<service=HBase, name=RegionServer, sub=Regions><>storeFileCount: (.*)"
name: hbase_region_storefiles
attrNameSnakeCase: true
- Add the JMX exporter to your HBase startup:
export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -javaagent:/path/to/jmx_prometheus_javaagent-0.17.0.jar=8080:hbase.yml"
export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS -javaagent:/path/to/jmx_prometheus_javaagent-0.17.0.jar=8081:hbase.yml"
Step 3: Set Up Visualization
Install Grafana to visualize your HBase metrics:
- Install Grafana:
$ wget https://dl.grafana.com/oss/release/grafana-9.1.2.linux-amd64.tar.gz
$ tar -zxvf grafana-9.1.2.linux-amd64.tar.gz
$ cd grafana-9.1.2/
$ ./bin/grafana-server
- Add Prometheus as a data source in Grafana:
- Go to Configuration > Data Sources
- Click "Add data source"
- Select "Prometheus"
- Set URL to http://localhost:9090 (or wherever Prometheus is running)
- Click "Save & Test"
- Import a starter HBase dashboard:
- Go to Dashboards > Import
- Enter dashboard ID 13055 (a community HBase dashboard)
- Select your Prometheus data source
- Click "Import"
Step 4: Configure Essential Alerts
Set up basic alerts in Prometheus by adding these rules to your prometheus.yml
:
rule_files:
- "hbase_alerts.yml"
Create hbase_alerts.yml
with:
groups:
- name: hbase-alerts
rules:
- alert: HBaseRegionServerDown
expr: up{job="hbase"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "HBase RegionServer down"
description: "HBase region server has been down for more than 2 minutes."
- alert: HBaseHighMemstoreUsage
expr: hadoop_hbase_regionserver_server_memstoresize > 0.8 * hadoop_hbase_regionserver_server_memstoresizeupper
for: 5m
labels:
severity: warning
annotations:
summary: "High memstore usage"
description: "HBase memstore is approaching flush threshold."
Simplify with a Managed Observability Solution like Last9
If this setup seems complex, you're not wrong. This is where managed observability solutions like Last9 can save you time.
Last9 connects to your HBase clusters via OpenTelemetry or Prometheus, automatically collecting the right metrics without all the manual configuration.
Our platform gives you:
- Pre-built HBase dashboards that show what matters
- Intelligent alerts based on your cluster's actual behavior
- Correlation between HBase issues and application performance
- Cost predictability with event-based pricing
Building your monitoring stack or using a platform like Last9, the key is to start simple and expand as you learn what metrics matter most for your specific HBase workloads.

Advanced HBase Monitoring Techniques
Once you've got the basics down, here's how to level up your monitoring game:
Distributed Tracing
Tracing helps you understand request flows through your entire system, not just within HBase. Use OpenTelemetry to instrument your applications and trace requests from client to HBase and back.
This is especially useful for tracking down mysterious latency issues that cross service boundaries. You'll see exactly where time is spent—whether it's in network hops, region server processing, or compaction delays.
Custom JMX Metrics
HBase's built-in metrics are helpful, but sometimes you need more. You can expose custom metrics by extending HBase's MetricRegistry:
// Example of registering a custom metric
MetricRegistries.global().register("custom_metric_name", new Gauge<Integer>() {
@Override
public Integer getValue() {
return calculateYourMetricValue();
}
});
Predictive Monitoring
Move beyond reactive firefighting with predictive monitoring. By analyzing metric trends over time, you can:
- Predict when you'll need to add capacity
- Detect anomalies before they become outages
- Identify recurring patterns that could suggest misconfigurations
Last9's anomaly detection capabilities can help spot unusual patterns in your HBase metrics before they cause problems.
Common HBase Issues and How to Spot Them
Now for the good stuff—how to identify specific HBase problems through monitoring:
Region Server Hotspotting
Symptoms: Uneven request load across region servers, with some servers handling significantly more requests than others.
Metrics to Watch:
- Request count per region server
- IO load per server
- CPU utilization differences between servers
Fix: Consider pre-splitting regions or redesigning row keys to distribute load more evenly.
Compaction Storms
Symptoms: Sudden spikes in latency, disk I/O, and CPU usage across multiple region servers.
Metrics to Watch:
- Compaction queue size
- Disk I/O rates
- Read/write latencies during compaction
Fix: Tune compaction settings to spread the load or schedule major compactions during off-peak hours.
Memory Pressure
Symptoms: Frequent garbage collection pauses, increasing latency.
Metrics to Watch:
- JVM heap usage
- GC pause duration and frequency
- Old generation collection time
Fix: Tune JVM settings, consider increasing heap size (but beware of long GC pauses), or add more region servers to distribute memory load.
Slow Region Recovery
Symptoms: Regions take too long to reassign after server failures.
Metrics to Watch:
- Time for regions in transition
- ZooKeeper session timeouts
- Region server startup time
Fix: Check the network between region servers and ZooKeeper, ensure HDFS is healthy, and possibly increase handler counts.
Alerting Strategy for HBase
All the monitoring in the world won't help if you don't know when to take action. Here's how to set up effective alerts:
Alert Tiers
Structure your alerts in tiers to avoid alert fatigue:
- Info: Metric trends worth noting, but not actionable
- Warning: Thresholds approaching danger zones
- Critical: Immediate action required
What to Alert On
Not everything needs to wake someone up at night. Focus on:
- Critical Service Availability: Master or region server down
- Customer Impact Metrics: Read/write latency beyond thresholds
- Resource Constraints: Disk space, memory, or network bottlenecks
- Canary Tests: Synthetic transactions failing
Reducing Alert Noise
Too many alerts lead to ignored alerts. Reduce noise by:
- Setting appropriate thresholds based on baseline measurements
- Implementing alert dampening for flapping conditions
- Creating composite alerts that trigger only when multiple conditions are met
HBase Monitoring Automation
Take your monitoring to the next level with automation:
Auto-Remediation
Some issues can be fixed without human intervention:
- Automatic restart of failed HBase services
- Triggering compactions during low-traffic periods
- Balancing regions when an imbalance is detected
Runbooks
For issues that need human attention, prepare runbooks that detail:
- Detailed diagnosis steps
- Common fixes with commands to run
- Escalation paths when standard fixes don't work
Last9's correlation capabilities help identify the root cause faster, making your runbooks more effective by pointing you in the right direction immediately.
How to Scale Your HBase Monitoring
As your clusters grow, so should your monitoring approach:
High-Cardinality Challenges
With large HBase deployments, we often encounter high-cardinality metrics—thousands of regions across multiple servers can generate millions of time series. This is where our purpose-built observability platform, Last9, comes in. We’re designed to handle high-cardinality data efficiently, without driving up costs.
Cross-Cluster Visibility
Most organizations run multiple HBase clusters. Unified dashboards that show health across all clusters help spot patterns and compare performance between environments.
Conclusion
HBase monitoring doesn't need to be complicated, but it does need to be thorough. Start with the basics—system metrics, HBase-specific metrics, and simple alerting. As you grow more comfortable, add advanced techniques like distributed tracing and predictive analytics.
FAQs
How often should I check my HBase metrics?
Critical metrics should be monitored in real-time with dashboards, while trend analysis can be done weekly or monthly for capacity planning.
What's the difference between HBase metrics and HDFS metrics?
HBase metrics focus on database operations like read/write latency and region management, while HDFS metrics track underlying storage concerns like block replication and namenode health. You need both for complete visibility.
Can I monitor HBase without impacting performance?
Yes, with proper configuration. JMX has minimal overhead, and modern monitoring solutions can sample metrics intelligently to reduce impact. If concerned, start with fewer metrics and increase gradually.
How do I correlate HBase performance with application issues?
Distributed tracing provides the clearest picture, showing how HBase operations fit into broader request flows. Tools that combine metrics and traces, like Last9, make this correlation easier.
What's the minimum set of metrics for a small HBase cluster?
Even small clusters should monitor: region server status, read/write latency, compaction queue size, and basic system metrics (CPU, memory, disk I/O).