Database Monitoring Metrics: What to Track & Why It Matters

Let’s be honest—your database isn’t just another component. It’s the thing holding everything else together. When it slows down or fails, the ripple effects hit fast and hard. So keeping an eye on its performance? Non-negotiable.

The challenge is, there’s no shortage of metrics you could monitor. But not all of them are useful. This guide walks through the database metrics that matter, how to choose the right ones for your environment, and the common traps that trip up even experienced teams.

Why Database Monitoring Requires a Specialized Approach

Database monitoring isn't just another checkbox in your observability strategy — it's a specialty of its own.

Unlike application monitoring, where you're mostly concerned with user experience and frontend performance, database monitoring requires a deeper technical understanding of storage systems, query languages, and resource management.

Here's why database monitoring deserves special attention:

It's the final resting place of your data — corrupt that, and you're in recovery mode (or worse)
Performance issues compound quickly — a small database slowdown can cause exponential application delays
Problems are often silent until catastrophic — many database issues don't announce themselves until they've become serious
Root cause analysis is complex — database problems often have multiple contributing factors

💡

If you're looking to improve performance beyond just monitoring, this guide to database optimization covers practical ways to tune your setup.

Essential Database Metrics That Drive Performance and Reliability

Connection Metrics

Connection count - The number of active connections to your database serves as a vital indicator of system load and application health. Each connection consumes memory and processing resources, so monitoring this metric helps prevent resource exhaustion.

Most databases have hard limits on the maximum connections allowed (e.g., PostgreSQL's max_connections parameter or MySQL's max_connections variable). When your active connections approach 70-80% of this maximum, it's time to investigate.

A sudden drop in connections could indicate application servers going offline, while a rapid increase might signal connection leaks or inefficient connection pooling. Track this metric both as an absolute value and as a percentage of your configured maximum.

Connection time - This metric measures the duration required to establish a new database connection, typically in milliseconds.

In a healthy system, connection time should remain consistently low (usually under 10ms in same-region deployments). Increasing connection times often reveal underlying problems before they become critical failures.

Extended connection times might indicate network latency issues, DNS resolution problems, authentication bottlenecks, or resource constraints on the database server. When connection times creep up, it's worth examining network performance between application and database tiers, along with server resource utilization patterns.

Connection errors - This metric tracks failed connection attempts, segmented by error type (authentication failures, timeout errors, connection refused errors, etc.). A sudden increase in connection errors is a leading indicator of configuration drift, security changes, network issues, or capacity problems.

For example, a spike in authentication errors might indicate credential rotation problems, while connection timeouts could point to network saturation or database server overload. Establishing baselines for normal error rates helps identify anomalies quickly.

Performance Metrics

Query execution time - Perhaps the most critical performance metric, query execution time measures how long database queries take to complete. This metric should be monitored across multiple dimensions: average time (mean), median (50th percentile), 95th percentile, 99th percentile, and maximum execution time.

The distribution of query times provides valuable insights. While averages are useful for spotting trends, they can mask problematic outliers. The 95th and 99th percentiles reveal your worst-performing queries that affect user experience. For OLTP workloads, most queries should execute in milliseconds, while analytical queries might acceptably run for seconds or minutes.

Track execution times by query type, application component, and time of day to establish meaningful patterns. A sudden increase in execution time often precedes more serious performance degradation.

Cache hit ratio - This metric represents the percentage of data requests fulfilled from the database's memory cache rather than from disk. Memory access is typically thousands of times faster than disk access, making cache efficiency crucial for performance.

Different database systems have different caching mechanisms (buffer cache in PostgreSQL, buffer pool in MySQL, page cache in SQL Server), but all benefit from high cache hit ratios. For OLTP workloads, aim for at least 99% cache hit rates, while 90%+ might be acceptable for read-heavy analytical workloads.

Low cache hit ratios indicate your working dataset exceeds available memory, suggesting the need for memory increases, workload changes, or query optimization. Monitor this metric alongside memory usage patterns to optimize configuration settings like buffer pool size.

💡

To get a clearer picture of what’s happening inside MySQL, understanding its logs is key—this MySQL logs guide breaks it down simply.

Index usage - This metric family tracks which indexes are being used by queries and how frequently, helping identify both missing indexes and unused indexes that waste resources. Unused indexes consume storage space and slow down write operations as the database must update all indexes when data changes.

For each table, monitor the ratio of index scans to sequential scans, index hit rates, and index fragmentation levels. Look for tables with high read activity but low index usage – these are prime candidates for index optimization.

Many database systems provide tools to identify missing indexes based on query patterns (e.g., PostgreSQL's pg_stat_statements combined with pg_stat_user_indexes or SQL Server's missing index DMVs).

Table scan frequency - This metric counts how often your database performs full table scans (reading every row in a table) instead of using indexes for targeted data retrieval. Full table scans are resource-intensive operations that can dramatically impact performance as your data grows.

While some table scans are inevitable (especially for small tables or certain analytical queries), high or increasing scan frequencies on large tables indicate indexing issues. Track this metric as a ratio of table scans to total operations for each table, with special attention to your largest and most frequently accessed tables.

When table scan frequencies increase, investigate whether you're missing indexes, using non-selective indexes, or writing queries that can't use existing indexes (e.g., using functions on indexed columns).

Resource Utilization

CPU usage - Database operations — particularly complex joins, sorts, and aggregations — can be extremely CPU-intensive. This metric should be monitored at multiple levels: overall system CPU usage, per-core utilization, CPU time per query type, and CPU queue length.

For most database systems, sustained CPU usage above 70-80% indicates approaching performance degradation. More important than the absolute value is the pattern: steady increases over time suggest growing workloads that will eventually require optimization or scaling, while sudden spikes might indicate problematic queries or background processes.

CPU steal time is particularly important to monitor in virtualized environments, as it indicates your database isn't getting the CPU resources it's requesting from the hypervisor. Different query types have different CPU profiles — OLTP queries typically need short bursts of CPU, while analytical operations consume longer, sustained CPU resources.

Memory usage - Memory is perhaps the most critical resource for database performance, as it enables caching data to avoid expensive disk operations. Track total memory usage, memory usage by component (e.g., buffer pool, query execution memory, connection overhead), and memory pressure indicators.

Key memory pressure signals include: buffer pool eviction rates, swap usage, page faults per second, and out-of-memory errors in logs. For most database systems, memory should be configured to cache your working dataset while leaving sufficient memory for the operating system and query execution.

Different database components compete for memory — buffer/page cache for data storage, sort/join buffers for query execution, and per-connection memory for client operations. Monitor how memory is distributed across these functions to optimize configuration.

Disk I/O - I/O operations represent how frequently your database reads from and writes to persistent storage. Monitor IOPS (I/O operations per second), throughput (MB/s read and written), I/O latency (average, median, and 95th percentile response times), and I/O queue depth.

Database performance typically degrades when average I/O latency exceeds 10ms for SSDs or 20ms for HDDs. Queue depth indicates how many I/O operations are waiting — sustained high queue depths lead to application slowdowns.

Break down I/O metrics by operation type (random vs. sequential, read vs. write) and database file type (data files, log files, temporary files). Each has different performance characteristics and optimization approaches. For example, write-ahead logs require low-latency sequential writes, while data files benefit from high-throughput random read performance.

Storage space - This seemingly simple metric is crucial for operational stability. Track total space usage, free space, growth rate (daily/weekly/monthly), space usage by table and index, and fragmentation levels.

Most databases experience catastrophic failures when completely out of space, so proactive monitoring is essential. Set alerts when approaching 80-85% capacity to allow time for remediation actions. Calculate growth trends to predict when you'll need additional capacity — sudden changes in growth rates often indicate application issues or inefficient queries.

Monitor how space is utilized across different database objects. Tables with frequent updates or deletes may experience fragmentation that wastes space and degrades performance. Regularly track fragmentation levels and schedule maintenance operations (VACUUM in PostgreSQL, OPTIMIZE TABLE in MySQL) to reclaim space.

Health and Availability

Uptime - Database uptime measures the total time your database has been operational and available to serve requests. While seemingly simple, this foundational reliability metric should be tracked with nuance — measuring both overall availability percentage and the frequency and duration of outages.

The standard "five nines" (99.999%) availability translates to just 5.26 minutes of downtime per year. Track uptime across multiple timeframes (daily, weekly, monthly, yearly) and differentiate between planned maintenance windows and unexpected outages.

Uptime should be measured from the application's perspective, not just the database process status. A database server may be running but unable to serve requests due to lock contention, resource exhaustion, or network issues. True uptime metrics capture the end-to-end availability as experienced by applications.

Replication lag - In distributed database setups, replication lag measures how far behind secondary/replica instances are from the primary instance. This metric is typically measured in seconds or transaction log position differences.

Replication lags directly impact data consistency and disaster recovery capabilities. High lag means applications reading from replicas see outdated data. During failover events, excessive lag can result in data loss if the primary fails before all transactions are replicated.

Monitor lag for each replica individually, as performance can vary. Set different thresholds based on replica purpose — replicas used for read scaling might tolerate more lag than those designated for high-availability failover. Consistently increasing lag indicates that replicas can't keep up with the primary write volume, suggesting the need for optimization or additional resources.

💡

If your database monitoring depends on Prometheus, it's worth knowing how to keep it resilient—this guide on high availability in Prometheus walks you through it.

Deadlocks - Deadlocks occur when two or more transactions are waiting for each other to release locks, creating a dependency cycle that prevents progress. This metric tracks deadlock frequency, affected tables, and the queries involved.

Infrequent deadlocks are normal in concurrent database systems, but increasing deadlock rates indicate application design issues or changing access patterns. Each deadlock forces the database to roll back at least one transaction, impacting performance and potentially causing application errors.

Collect detailed information about each deadlock, including the resources involved (tables, rows, indexes), lock types (shared, exclusive), and the exact queries in the deadlock cycle. This information is crucial for refactoring application code to avoid lock contention. Many databases provide deadlock graphs or logs that visualize these dependencies.

Backup success rate and recovery time This multi-faceted metric family measures both how reliably your backup processes are complete and how quickly you can restore from those backups. Track backup duration, compression ratio, verification status, and success/failure rates.

More important than backup metrics are recovery metrics — the time required to restore from backup and the success rate of test recoveries. Regularly test restores to validate both backup integrity and recovery procedures. Measure recovery time objective (RTO) compliance through these tests.

For point-in-time recovery systems, track continuous archiving status (e.g., PostgreSQL WAL archiving or MySQL binary logging) and validate the restore chain periodically. Monitor the lag between the latest restorable point and the current time to ensure your recovery point objective (RPO) is met.

Selecting the Most Valuable Metrics for Your Specific Database Environment

Not all metrics matter equally for every database. Your monitoring strategy should be tailored to your specific setup:

By Database Type

Database Type	Critical Metrics	Why They Matter
MySQL/MariaDB	InnoDB buffer pool usage<br>Slow query log<br>Table locks	These metrics reveal MySQL's specific performance bottlenecks
PostgreSQL	VACUUM frequency<br>Bloat<br>Tuple stats	PostgreSQL's MVCC model creates unique monitoring needs
MongoDB	Document scan ratio<br>WiredTiger cache<br>Oplog window	These reflect MongoDB's document-oriented performance profile
Redis	Keyspace misses<br>Evictions<br>Fragmentation	Memory management is everything for Redis
Elasticsearch	Indexing latency<br>Search latency<br>JVM heap usage	Search engines have unique resource profiles

By Workload Type

Read-heavy applications Focus on index usage, cache hit ratios, and query execution plans.

Write-heavy systems Watch for lock contention, write amplification, and commit latency.

Mixed workloads Balance your attention across both read and write metrics, with a particular focus on concurrency metrics.

Overcoming Database Monitoring Obstacles

Challenge 1: Alert Fatigue

The problem: Setting thresholds too low creates a constant barrage of alerts that eventually get ignored.

The solution: Implement dynamic baselines that adjust to your normal patterns and only alert you to significant deviations. Consider using percentile-based alerting rather than fixed thresholds.

Challenge 2: Correlating Database and Application Issues

The problem: When your application slows down, is it the database or something else?

The solution: Track request traces that span from your application into the database to see exactly where time is spent. Tools that provide unified views across your stack make this much easier.

Challenge 3: Monitoring Without Adding Overhead

The problem: The monitoring itself can impact database performance.

The solution: Use lightweight agents and sampling strategies that reduce the impact of monitoring. Schedule intensive collection during off-peak hours when possible.

Challenge 4: High-Cardinality Data

The problem: Modern databases generate tons of metrics across numerous dimensions, making storage and analysis challenging.

The solution: Focus on aggregating metrics that matter most, and use tools that handle high-cardinality data efficiently.

Last9 is particularly strong here — their platform manages high-cardinality observability at scale, which is essential when you're tracking metrics across numerous database instances, queries, and users.

Challenge 5: Going Beyond Symptoms to Root Causes

The problem: Metrics tell you something's wrong but not why.

The solution: Combine metrics with logs and traces for context. When a metric spikes, you should be able to immediately see related logs and query patterns.

💡

Fix production database log issues instantly—right from your IDE, with AI and Last9 MCP.

How to Set a Solid Database Monitoring Stack

Tools of the Trade

When selecting monitoring tools, consider these options:

Last9 — If you're looking for a managed observability solution kinder to your budget without compromising performance, give Last9 a try.

We help teams handle high-cardinality observability with ease. Industry leaders like Disney+ Hotstar, CleverTap, and Replit trust us to keep things running smoothly.

With native support for OpenTelemetry and Prometheus, we bring together metrics, logs, and traces to give you real-time, correlated monitoring and alerting—all while keeping performance and cost in check.

Prometheus + Grafana — The open-source standard, perfect if you want maximum control and customization.

Dynatrace — Enterprise-focused monitoring with AI capabilities.

AppDynamics — Application-centric monitoring with database visibility.

Elastic APM — Part of the Elastic Stack, good integration with Elasticsearch.

Implementation Best Practices

Start small and expand Begin with the essential metrics outlined above, then add more specialized ones as you understand your needs better.

Automate where possible Use infrastructure-as-code to deploy consistent monitoring across all environments.

Document normal patterns Know what "good" looks like so you can quickly spot "bad."

Test your monitoring Regularly verify that alerts work by triggering test conditions.

Applying Database Monitoring Metrics to Solve Real Production Problems

Scenario: Query Performance Degradation

What you'll see:

Increasing query execution time
Higher CPU usage
More disk I/O
Lower cache hit ratio

First response steps:

Check for recent changes (schema, code, data volume)
Look for blocking locks or resource contention
Examine query plans for inefficient operations
Review index usage and table statistics

Scenario: Connection Pool Exhaustion

What you'll see:

Connection count approaching maximum
Increasing connection time
Application errors about connection limits
Brief spikes in CPU as new connections are created

First response steps:

Temporarily increase connection limits if possible
Check for connection leaks in the application code
Implement connection pooling if not already in place
Consider read/write splitting to distribute connection load

💡

To better understand how logging and monitoring work together, check out this guide on logging vs monitoring.

How Monitoring Data Can Help Your Database

The end goal of monitoring isn't just to know when things break — it's to make your database run better over time. Here's how to use your metrics for continuous improvement:

Trend analysis Look at how metrics change over weeks and months to spot gradual degradation.

Capacity planning Use growth trends to predict when you'll need more resources.

Performance tuning Identify your slowest and most frequent queries for optimization.

Configuration refinement Use resource utilization patterns to tune database configuration parameters.

Conclusion

Effective database monitoring isn't about collecting every possible metric — it's about tracking the right metrics at the right time and understanding what they're telling you.

Remember that the best monitoring setup is the one that fits your specific needs and environment. Take the principles outlined here and adapt them to your unique database landscape.

💡

If you have any questions about database monitoring metrics or are struggling with a specific monitoring challenge, do join our Discord Community to connect with other DevOps engineers and SREs facing similar challenges.

FAQs

How frequently should database metrics be collected?

Most metrics should be collected every 15-60 seconds for real-time monitoring, with historical data aggregated to longer intervals for trend analysis. Resource-intensive collections might be less frequent.

What's more important — availability or performance metrics?

Both are critical but serve different purposes. Availability metrics tell you if your system is functioning, while performance metrics tell you how well. A complete monitoring strategy needs both.

Should developers have access to database monitoring?

Yes, with appropriate guardrails. Developers with monitoring access can better understand the impact of their code on database performance and make more informed decisions.

How do you monitor databases in containerized environments?

Focus on collecting metrics from both the container layer (resource constraints, orchestration health) and from within the database itself. Be aware that ephemeral containers require special consideration for persistent monitoring.

How many alerts are too many?

If your team is regularly ignoring alerts or treating them as background noise, you have too many. Aim for actionable alerts that require human intervention rather than informational ones.

What's the relationship between database metrics and SLOs?

Database metrics should directly inform your Service Level Objectives. For example, if your SLO is "95% of queries complete in under 200ms," you need to monitor query execution time across your database fleet.