Last9

Database Monitoring Metrics: Each Stage Guide (2025)

Not all database metrics are created equal. Learn which ones to track, why they matter, and how they help you stay ahead of performance issues.

Apr 14th, ‘25
Database Monitoring Metrics: Each Stage Guide (2025)
See How Last9 Works

Unified observability for all your telemetry. Open standards. Simple pricing.

Talk to an Expert

A database under strain can slow query execution, saturate connection pools, or trigger lock contention — any of which can halt application workflows. When this happens, API endpoints, background jobs, and dependent services all feel the slowdown.

The challenge is knowing which metrics truly help you detect and resolve issues early. This guide focuses on high-signal metrics that:

  • Detect performance regressions before deployment
  • Map directly to application behavior
  • Support rapid diagnosis during incidents

We’ll cover how each metric works, when to track it, and common pitfalls that can reduce its diagnostic value.

Why Database Monitoring Needs a Dedicated Approach

Database monitoring is a discipline in its own right, not just a subset of general observability. Where application monitoring often focuses on response times, error rates, and user interactions, database monitoring requires detailed insight into:

  • Storage engine behavior — how data is physically stored, indexed, and retrieved
  • Query execution paths — the actual operations performed to return results
  • Resource consumption — CPU, memory, I/O, and connection pool usage under varying workloads

Databases warrant this dedicated focus because:

  • Data integrity is non-negotiable — once corrupted, recovery can be time-consuming or incomplete
  • Small slowdowns can cascade — a single inefficient query can create delays across multiple services
  • Issues can remain hidden — replication lag, index bloat, or lock contention may not trigger alerts until performance degrades sharply
  • Root cause isolation is challenging — the same symptom can arise from schema design, query patterns, workload spikes, or infrastructure limits

A specialized monitoring strategy ensures these conditions are detected early and addressed before they affect application stability.

💡
If you’re tracking query performance or troubleshooting slowdowns, understanding MySQL logs can give you the context metrics alone can’t — here’s a guide on how to work with them.

Essential Database Metrics: Quick Reference

Connection Metrics:

  • Connection count: Monitor at 70-80% of max_connections
  • Connection time: Should be <10ms in same-region deployments
  • Connection errors: Baseline normal rates, alert on deviations

Performance Metrics:

  • Query execution time: Track P95/P99, not just averages
  • Cache hit ratio: Target 99%+ for OLTP, 90%+ for analytics
  • Index usage: Monitor scans vs sequential reads

Resource Utilization:

  • CPU usage: Alert above 70-80% sustained
  • Memory usage: Watch eviction rates and swap activity
  • Disk I/O: Latency >10ms (SSD) needs investigation

Essential Database Metrics for Each Stage of the Development Lifecycle

The value of a metric depends not just on what it measures, but when and how you measure it.
By looking at metrics through the lens of development, code review, and post-deployment monitoring, you get early warnings in dev, performance safeguards during review, and fast incident resolution in production.

Connection Metrics

Connection count

During Development — Tracks the total number of active database connections. Each connection consumes memory and CPU. Monitoring this in dev environments helps catch inefficient pooling, connection leaks, or ORM misconfigurations.

  • Most engines have limits (max_connections in PostgreSQL/MySQL). Hitting 70–80% in dev indicates a problem.
  • Watch sustained high counts in local/staging load tests — often caused by opening a new connection per request instead of reusing pooled ones.

During Code Review — Evaluate whether new background jobs, migrations, or API endpoints could push connection usage closer to limits. Run load tests in review environments to verify pooling strategies.

Post-Deployment — Spikes may signal leaks, misconfigured pools, or traffic bursts. Drops might mean app nodes went offline. Track both absolute numbers and as a percentage of the configured max. Correlate with deployment or traffic change events.

Connection time

During Development — Measures the time to establish a new DB connection. Same-region should be under ~10 ms.

  • High values in dev can indicate DNS resolution delays, SSL handshake overhead, or inefficient connection handling.

During Code Review — Test changes that alter network paths (service mesh, proxy layers). Even small latency increases here can multiply across concurrent requests.

Post-Deployment — Rising connection times can signal network congestion, auth bottlenecks, or DB host CPU/memory contention. Catching increases early prevents query queuing and degraded response times.

Connection errors

During Development — Any failed connection attempts here usually mean misconfigured connection strings, credentials, or environment variables.

During Code Review — Watch for spikes in specific error types after introducing changes to networking, auth, or DB driver versions.

Post-Deployment — Classify by type for faster triage:

  • Authentication failures — Expired credentials, key rotation issues.
  • Timeouts — Network saturation or overloaded DB nodes.
  • Refused connections — Max connection limits hit.
    Baseline normal error rates, so deviations trigger immediate investigation.

Here's how you can track in Node and Django:

PostgreSQL Connection Monitoring in Node.js

// Track connection pool metrics
const pool = new Pool({
  max: 20,
  idleTimeoutMillis: 30000,
});

pool.on('connect', () => {
  console.log(`Active connections: ${pool.totalCount}`);
  // Alert if approaching 80% of max
  if (pool.totalCount > 16) {
    console.warn('Connection pool approaching capacity');
  }
});

Django Connection Pool Monitoring

# settings.py
DATABASES = {
    'default': {
        # ... other settings
        'OPTIONS': {
            'MAX_CONNS': 20,
        },
    }
}

# Custom middleware to track connections
class DatabaseConnectionMiddleware:
    def process_request(self, request):
        from django.db import connections
        db = connections['default']
        print(f"Active connections: {len(db.queries)}")

Performance Metrics

Query execution time

During Development — Measure mean, P50, P95, P99, and max times. OLTP workloads should be in ms; analytical queries may run longer but should be consistent.

  • Use staging datasets to identify slow queries before they reach code review.

During Code Review — Compare before/after performance for queries affected by schema changes. Even slight regressions in P95 can impact production latency.

Post-Deployment — Sudden increases often precede broader performance degradation. Track by query type, application component, and time of day to detect patterns. Outlier queries at P99 should get immediate attention.

Cache hit ratio

During Development — Verify hot queries in staging are served from memory (shared_buffers in PostgreSQL, innodb_buffer_pool in MySQL).

  • OLTP targets: 99%+; analytical: 90%+.

During Code Review — Ensure new queries benefit from cache rather than forcing disk reads. Poor cache usage here means predictable slowdowns in production.

Post-Deployment — Low ratios indicate working set > memory or inefficient query patterns. Adjust cache/buffer sizes or optimize queries. Monitor alongside memory usage to spot configuration mismatches.

Index usage

During Development — Monitor index scans vs. sequential scans to catch missing indexes. Use EXPLAIN to validate new query performance.

During Code Review — Review schema migrations for index changes. Remove unused indexes to reduce storage and write latency.

Post-Deployment — Low index usage on high-read tables signals optimization opportunities. Use DB-native tools (pg_stat_user_indexes, SQL Server DMVs) to identify candidates for tuning.

Table scan frequency

During Development — Identify full scans in staging tests, especially on large tables.

During Code Review — Flag PRs introducing queries that require full table scans unnecessarily.

Post-Deployment — Rising scan frequency on large tables often means missing indexes or non-selective WHERE clauses. Track scans as a ratio to total queries for each table.

Resource Utilization

CPU usage

During Development — Run synthetic load to establish CPU usage patterns for different query types.

During Code Review — Watch for queries that cause CPU spikes during review environment load tests.

Post-Deployment — Sustained usage above 70–80% suggests growing workloads that need optimization or scaling. Monitor CPU steal time in virtualized/cloud setups to detect hypervisor contention.

Memory usage

During Development — Measure buffer/cache allocation, query execution memory, and per-connection overhead in test runs.

During Code Review — Changes to query complexity or parallelism may affect memory usage; verify allocation fits within limits.

Post-Deployment — Watch for high eviction rates, swap activity, or page faults. Keep enough memory for the working dataset and OS needs.

Disk I/O

During Development — Measure IOPS, throughput, and latency for different file types (data, logs, temp).

During Code Review — Evaluate new workloads for I/O patterns that may degrade performance.

Post-Deployment — Latency >10 ms (SSD) or >20 ms (HDD) warrants investigation. Track queue depth; sustained highs lead to slowdowns.

Storage space

During Development — Track table/index growth in test datasets to predict production scaling needs.

During Code Review — Review schema changes for potential storage overhead (e.g., large indexes, denormalization).

Post-Deployment — Alert at 80–85% capacity. Monitor fragmentation levels; run maintenance (VACUUM, OPTIMIZE TABLE) as needed.

Health and Availability

Uptime

During Development — Test failover and maintenance scenarios to validate uptime monitoring accuracy.

During Code Review — Review deployment procedures for their effect on availability metrics.

Post-Deployment — Measure from the application perspective to capture cases where the DB is online but not serving queries.

Replication lag

During Development — Simulate replica lag in staging to validate read-consistency handling in code.

During Code Review — Check if new queries target replicas that may have acceptable but noticeable lag.

Post-Deployment — Monitor per-replica lag in seconds or transaction positions. Increasing lag means replicas can’t keep up with the write load.

Deadlocks

During Development — Use concurrency tests to surface deadlocks in staging before production.

During Code Review — Review access patterns that may introduce circular locking dependencies.

Post-Deployment — Track frequency, affected objects, and involved queries. Use DB-native deadlock graphs to refactor problematic code.

Backup success rate & recovery time

During Development — Test backup/restore procedures on realistic datasets.

During Code Review — Ensure changes to storage or schema don’t break backup scripts or retention policies.

Post-Deployment — Track completion rates, RTO, and RPO compliance. Test restores regularly to validate backup integrity.

💡
You can use these database optimization techniques to improve query performance and make better use of your existing resources — read the guide here.

Select the Most Valuable Metrics

Not every metric matters for every system. You can choose based on your database type and workload profile.

By Database Type

Database Type Critical Metrics Purpose
MySQL / MariaDB - InnoDB buffer pool usage – Shows how much of your data is served from memory.- Slow query log – Lists queries running longer than expected.- Table locks – Tracks blocking at the table level. MySQL bottlenecks often come from poor caching, inefficient queries, or lock contention.
PostgreSQL - VACUUM frequency – Measures how often dead tuples are cleaned up.- Bloat – Shows wasted storage from unvacuumed rows.- Tuple stats – Counts inserts, updates, and dead rows. MVCC can create storage bloat and slow queries if not maintained.
MongoDB - Document scan ratio – Compares scanned docs to returned docs.- WiredTiger cache – Shows memory use for active data.- Oplog window – Time available for replica recovery. High scan ratios mean missing indexes. Oplog size controls replica reliability.
Redis - Keyspace misses – Keys requested but not found.- Evictions – Keys removed to free space.- Fragmentation – Memory lost to allocation patterns. Memory efficiency is critical for Redis performance.
Elasticsearch - Indexing latency – Time to make new data searchable.- Search latency – Query response times.- JVM heap usage – Memory allocation for search and indexing. Balancing ingestion speed and query latency is key. Heap tuning affects both.

By Workload Type

  • Read-heavy — Watch index usage, cache hit ratio, query plans. Keeps reading fast without hitting the disk.
  • Write-heavy — Track lock contention, write amplification, and commit latency. Prevents queues and transaction stalls.
  • Mixed workloads — Balance read/write metrics. Focus on concurrency: lock waits, deadlocks, replication lag.

Overcome Database Monitoring Obstacles

Alert fatigue is common when monitoring rules are too aggressive. Instead of helping, they create background noise that’s easy to miss during busy days.

  • Set dynamic baselines that adjust to your system’s normal patterns.
  • Use percentile-based thresholds (P95, P99) to highlight meaningful changes rather than occasional outliers.

Sometimes it’s not clear whether a slowdown starts in the application or the database. Without context, troubleshooting can take longer than it should.

  • Capture end-to-end traces that follow a request from the application layer into the database.
  • Use tools that provide a combined view so you can see exactly where time is spent.

Monitoring has to be lightweight enough to avoid affecting performance. Collecting every detail all the time can put an unnecessary load on the database.

  • Run lightweight agents for frequent checks.
  • Apply sampling for high-frequency metrics.
  • Schedule more intensive collection during quieter periods.

High-cardinality metrics are part of modern database monitoring — especially when tracking queries, users, and instances with multiple labels. Storing and querying that data efficiently is key.

  • Aggregate where it makes sense, keeping the metrics most useful for decisions.
  • Use platforms designed to handle large metric dimensions without losing query speed.
  • Last9 is built for this, making high-cardinality monitoring practical at scale.

Metrics can confirm that something changed, but they don’t always explain why. Pairing them with other data sources makes troubleshooting faster.

  • Connect metrics to logs and traces for direct context.
  • Keep historical data so you can compare normal patterns to unusual ones.
💡
Fix production database log issues instantly—right from your IDE, with AI and Last9 MCP.

How to Set a Solid Database Monitoring Stack

The “right” monitoring stack depends on how much control you want, the scale of your environment, and the time you can commit to running it. Some teams want everything managed; others prefer to fine-tune every detail.

Last9 is a strong option if you want full observability without the overhead of building and maintaining your monitoring stack. It’s designed for teams that need to triage issues quickly and keep systems running smoothly while controlling costs.

We’re helping teams ship faster with:

  • Unified telemetry platform — metrics, logs, traces, APM, and RUM in one place.
  • AI Control Plane — manage observability costs in real time, with no post-ingestion billing surprises.
  • 20M+ cardinality per metric without sampling — track rich, granular data without losing fidelity.
  • Up to 70% cost reduction compared to incumbent vendors.

Companies like Probo, Replit, Games24x7, and more use Last9 to get more out of their monitoring data while spending less.

Prometheus + Grafana is the go-to open-source combination for teams that want maximum flexibility.

  • Rich ecosystem and mature community support.
  • Full control over storage, retention, and dashboarding.
  • Requires effort to scale for high-cardinality workloads.

AppDynamics is ideal if you already use it for application monitoring and want to extend that visibility to databases.

  • Ties business transactions directly to database queries.
  • Strong application-to-database correlation.

Elastic APM makes sense if you’re already invested in the Elastic Stack.

  • Unified search across metrics, logs, and traces in Kibana.
  • Cost-effective for smaller workloads, but requires tuning at scale.

Database Monitoring Metrics to Solve Production Problems

Scenario: Query Performance Degradation

When queries start running slower than usual, the symptoms often show up in multiple places at once — higher CPU, more I/O, and falling cache efficiency.

What you’ll notice:

  • Query execution times are trending upward.
  • CPU usage climbing, sometimes steadily, sometimes in bursts.
  • Disk I/O increasing as queries rely more on reads from storage.
  • Cache hit ratio dropping, meaning fewer queries are served from memory.

First steps to investigate:

  • Check for recent changes — schema updates, application deployments, or data volume growth.
  • Look for blocking locks or other forms of resource contention.
  • Review query execution plans for inefficient joins, missing indexes, or unnecessary sorts.
  • Verify index usage and update table statistics if they’re outdated.

Scenario: Connection Pool Exhaustion

When the database runs out of available connections, applications start timing out or throwing errors. This often happens gradually as leaks build up, but it can also be triggered by sudden load spikes.

What you’ll notice:

  • Connection count approaching the configured maximum.
  • Connection time increasing as the pool runs dry.
  • Application logs showing “too many connections” or similar errors.
  • Short CPU spikes as new connections are opened more frequently.

First steps to investigate:

  • Temporarily raise connection limits if it helps stabilize the system.
  • Check for connection leaks in application code or services that don’t return connections to the pool.
  • If not already in place, add a connection pool to manage usage efficiently.
  • Consider read/write splitting to spread the load across replicas.
💡
To better understand how logging and monitoring work together, check out this guide on logging vs monitoring.

How Monitoring Data Can Help Your Database

Monitoring data is most valuable when it feeds back into database tuning and capacity planning, not just incident response.

  • Detect gradual performance regression — Track query execution times (P95/P99) and buffer cache hit ratios over time. A slow decline often points to index inefficiency, stale table statistics, or data volume exceeding memory capacity.
  • Plan capacity upgrades — Monitor storage consumption, connection pool utilization, and IOPS trends. Forecast when thresholds (e.g., 80% of max connections or disk capacity) will be reached so you can provision ahead of time.
  • Prioritize query optimization — Use query frequency × execution time to identify the highest-cost statements. Optimize those first to reduce CPU utilization and disk I/O load.
  • Refine configuration parameters — Analyze resource utilization patterns (buffer pool usage, work_mem, parallel workers) and adjust database settings to better match workload characteristics.

Applied consistently, this shifts monitoring from reactive firefighting to proactive workload optimization.

Final Thoughts

Database monitoring is most effective when it’s intentional. Tracking a focused set of metrics at the right stages — development, review, and production - helps you identify meaningful changes without adding unnecessary noise.

Last9 removes the complexity of piecing together Prometheus, Grafana, and alerting for database monitoring. With OpenTelemetry auto-instrumentation, you can start collecting metrics in under five minutes.

Database monitoring features teams use in production:

  • High-cardinality metric support — track performance by query type, user, endpoint, or database instance without sampling.
  • Prebuilt dashboards for PostgreSQL, MySQL, and Redis with key metrics like query latency, cache hit ratio, index usage, and replication lag.
  • Real-time correlation — jump from a metric spike to related query traces and logs.

You can explore Last9 for free, and if it’s helpful, we’re happy to walk you through how it can work for your database monitoring needs.

FAQs

How frequently should database metrics be collected?

Most metrics should be collected every 15-60 seconds for real-time monitoring, with historical data aggregated to longer intervals for trend analysis. Resource-intensive collections might be less frequent.

What's more important — availability or performance metrics?

Both are critical but serve different purposes. Availability metrics tell you if your system is functioning, while performance metrics tell you how well. A complete monitoring strategy needs both.

Should developers have access to database monitoring?

Yes, with appropriate guardrails. Developers with monitoring access can better understand the impact of their code on database performance and make more informed decisions.

How do you monitor databases in containerized environments?

Focus on collecting metrics from both the container layer (resource constraints, orchestration health) and from within the database itself. Be aware that ephemeral containers require special consideration for persistent monitoring.

How many alerts are too many?

If your team is regularly ignoring alerts or treating them as background noise, you have too many. Aim for actionable alerts that require human intervention rather than informational ones.

What's the relationship between database metrics and SLOs?

Database metrics should directly inform your Service Level Objectives. For example, if your SLO is "95% of queries complete in under 200ms," you need to monitor query execution time across your database fleet.

Contents

Do More with Less

Unlock unified observability and faster triaging for your team.