Monitoring Node.js: Key Metrics You Should Track

Node.js monitoring plays an important role in maintaining reliable applications by tracking runtime metrics (memory, CPU), application metrics (request rates, response times), and business metrics (user actions, conversion rates). Effective monitoring helps you identify issues before they impact users and provides clear diagnostic information when troubleshooting is needed.

Before discussing specific monitoring guidelines, let's take a brief look at why metrics matter for Node.js applications.

Why metrics matter for Node.js applications

Without proper metrics, troubleshooting becomes difficult when users report vague issues like "the app feels slow." Good metrics transform these vague complaints into actionable data points such as "the payment service is experiencing 500ms response times, up from a 120ms baseline."

Metrics provide quantifiable information about your application's performance, resource utilization, and user experience. They enable you to:

Detect problems before users notice them
Diagnose the root cause of issues quickly
Make data-driven decisions about scaling and optimization
Demonstrate the impact of technical performance on business outcomes

💡

If you're just starting out with instrumentation, this guide on metrics monitoring breaks down the core concepts and what to track.

Types of Node.js metrics to monitor

When it comes to monitoring Node.js applications, there are three primary categories of metrics:

Runtime metrics
Application metrics
Business metrics

Runtime metrics

Runtime metrics show how Node.js itself is performing. They provide early warning signs of problems with the underlying platform.

Memory usage metrics

Node.js memory management and garbage collection can be complex. Monitor these metrics to identify memory issues:

Metric	Description	Warning Signs
Heap Used vs Heap Total	Amount of memory currently in use compared to allocated heap	Continuous growth without returning to baseline
External Memory	Memory used by C++ objects bound to JavaScript objects	Unexpected growth
Garbage Collection Frequency	How often garbage collection runs	Excessive frequency
RSS (Resident Set Size)	Total memory allocated for the Node process in RAM	Continuous growth
GC Duration	Time spent in garbage collection	Increasing duration

You can implement basic memory usage logging in your application:

const logMemoryUsage = () => {
  const memUsage = process.memoryUsage();
  console.log({
    rss: `${Math.round(memUsage.rss / 1024 / 1024)} MB`,
    heapTotal: `${Math.round(memUsage.heapTotal / 1024 / 1024)} MB`,
    heapUsed: `${Math.round(memUsage.heapUsed / 1024 / 1024)} MB`,
    external: `${Math.round(memUsage.external / 1024 / 1024)} MB`
  });
};

CPU utilization metrics

Node.js runs on a single thread by default, which means CPU performance can become a bottleneck—especially under load. That’s why keeping an eye on CPU-related metrics is key.

CPU Usage Percentage: This tells you how much of the CPU your Node.js process is actually using. If it’s consistently high, it might be time to optimize your code or consider offloading heavy work to worker threads or external services.
Event Loop Lag: Think of the event loop as the traffic controller of your Node app. Lag here means tasks are waiting longer than they should to be executed. Spikes in lag often point to blocking operations or too much CPU-bound work happening on the main thread.
Active Handles: This counts the number of open handles—like sockets, timers, and file descriptors. If this number keeps climbing and never drops, you might have a resource leak on your hands. It’s a good early warning sign that something’s off under the hood.

Together, these metrics help you understand not just how busy your Node.js app is, but why it might be struggling.

💡

To start instrumenting your app with metrics that reflect real usage and behavior, check out our guide on Getting Started with OpenTelemetry Custom Metrics.

Event loop metrics

The event loop is at the heart of how Node.js handles concurrency. Since everything runs through it, any hiccup here can slow down your entire app. These metrics help you keep tabs on its health:

Event Loop Lag: Measures the delay between when a callback is scheduled and when it actually runs. High lag usually means the event loop is getting blocked—often by heavy synchronous code.
Event Loop Utilization: Tells you how busy the event loop is. A high utilization means it's constantly running code with little idle time—possibly a sign of overload.
Average Tick Length: This is the average time between each “tick” of the event loop. Longer tick lengths can point to slow operations or too much happening per iteration.
Maximum Tick Length: Shows the longest delay between ticks. Spikes here are red flags for blocking operations—like large JSON parsing, slow I/O, or CPU-intensive tasks.

Tracking these metrics gives you a window into how responsive your app is—and whether the event loop is staying healthy or getting overwhelmed.

The following table provides guidance on threshold values for key runtime metrics:

Metric	Warning Threshold	Critical Threshold	What It Means
Event Loop Lag	> 100ms	> 500ms	The application is struggling to process callbacks quickly enough
CPU Usage	> 70% sustained	> 90% sustained	Approaching CPU limits
Memory Growth	Steady increase over hours	Near heap limit	Possible memory leak

Application Metrics

While system metrics tell you how your app is running, application metrics tell you how well. They help you understand how your code behaves under real-world usage—and how it interacts with everything around it.

HTTP Request Metrics

For web apps and APIs, these are the basics:

Request Rate: Tracks how many requests your app is handling per second, ideally broken down by endpoint. Useful for spotting spikes or sudden drops in traffic.
Response Time: Go beyond averages—look at P95 and P99 latencies to understand worst-case performance.
Error Rate: Monitors how many requests result in client (4xx) or server (5xx) errors. A sudden rise can indicate regressions or dependency failures.
Concurrent Connections: Shows how many connections are open at a time, which can hint at resource saturation or connection pooling issues.

Database and External Service Metrics

If your app talks to a database or external APIs, these metrics help you track that interaction:

Query Execution Time: How long your database queries take—critical for tracking bottlenecks.
Connection Pool Utilization: Helps you see if your app is running out of available DB connections.
External API Response Times: If you depend on third-party services, track how quickly (or slowly) they respond.
Failed Transactions: Measures the rate of failed DB or API requests—helpful for early detection of outages or misconfigurations.
KB Read/Written Per Second: Indicates disk I/O patterns, especially for data-heavy apps.
Network I/O: Tracks how much data your app is sending and receiving over the network.

Business Metrics

To connect technical performance to real-world outcomes, track metrics that matter to your business:

Conversion Rate: If your API is slow, fewer users may complete purchases.
User Engagement: Metrics like session duration or pages per visit help correlate performance with user satisfaction.
Cart Abandonment: Often tied to slow checkouts or broken flows.
Revenue Impact: Know how much a slow or downed service costs in actual dollars.

Bringing these all together helps you monitor more than just infrastructure—it helps you monitor impact.

How to Collect Metrics in Node.js

Understanding how your Node.js app behaves in production starts with the right metrics. Here's how to set things up—starting simple and layering in what you need as your system grows.

const os = require('os');

app.get('/health', (req, res) => {
  res.json({
    uptime: process.uptime(),
    memory: process.memoryUsage(),
    cpu: {
      load: os.loadavg(),
      cores: os.cpus().length
    }
  });
});

1. Use Built-in Node.js APIs for Basic Health Checks

Node.js gives you access to process and system-level stats using built-in modules like os and process. You can use these to expose a basic /health endpoint that tracks uptime, memory usage, and CPU load—good enough for quick checks or load balancer probes.

2. Use OpenTelemetry for Deeper Visibility

Built-in metrics are fine for local dev or basic checks, but in production, you’ll want something more structured. OpenTelemetry is a solid choice. It gives you traces and metrics out of the box and works with the most popular Node.js libraries.

You can send this data to any backend that supports OpenTelemetry, like Prometheus, Grafana, or a managed platform like Last9, which handles metrics, logs, and traces together. This makes it easier to connect performance data across services without stitching together separate tools.

OpenTelemetry also supports auto-instrumentation, so you don’t have to manually add metrics for every HTTP call or DB query—it just works with most of the common libraries.

💡

Understanding how OpenTelemetry handles metric aggregation is key to building a reliable observability setup. This guide on OpenTelemetry Metrics Aggregation breaks down concepts like delta vs. cumulative temporality and offers practical advice on organizing and analyzing performance data.

3. Track Custom Metrics That Matter to Your App

Generic system metrics are fine, but the stuff that helps you debug and improve often comes from your code.

Checkout Time, API Calls, and Business Flows

Track how long specific flows take—like placing an order, logging in, or uploading a file. Count successes, errors, and time taken.

Cache Hits and Misses

A simple hit/miss counter tells you if your cache is doing its job — or just sitting there.

if (data) {
  metrics.increment('cache.hit');
} else {
  metrics.increment('cache.miss');
}

Feature Usage and Internal State

Track things like job queue depth, retry counts, or how often a feature flag is used. These can show early signs of trouble before users complain.

💡

If you're monitoring Java applications, understanding the different types of JMX metrics is crucial. Our guide on JMX Metrics: Types, What to Monitor, and When to Check breaks down the key metrics to track and how they impact application performance.

How to Set Up Alerts That Don't Waste Your Time

Collecting metrics is straightforward. But knowing when something needs your attention—that’s where alerting comes in. The goal isn’t just to trigger alarms, but to catch real issues early without flooding your inbox.

Use Multi-Level Alerts

Instead of a single “on/off” trigger, set up alert stages:

Warning if latency crosses 200ms for 5 minutes
Error if it hits 500ms for 3 minutes
Critical if the error rate goes above 5% for 1 minute

This gives you room to investigate before things fully break.

Correlate Metrics, Don’t Alert on Just One

A spike in API latency isn’t always useful on its own. But if DB connection time also goes up, it probably points to a database issue.
If CPU usage jumps but throughput stays flat, it could be a background job running wild — or a memory leak.

Correlating metrics like this helps you avoid false positives and surface what needs fixing.

Handle Noisy Workloads with Anomaly Detection

For apps with traffic that changes throughout the day or week, static thresholds aren’t reliable. Anomaly detection helps by flagging behavior that’s out of line with past patterns—so you catch real problems, not just expected spikes.

💡

If you’re looking for a way to set this up without fighting your alerting tool, Last9’s Alert Studio makes it easier to define, tune, and manage alerts—whether you want simple rules or more advanced correlations.

Common Node.js Monitoring Mistakes to Avoid

Even with the right tools in place, it’s easy to misread what your metrics are telling you. Here are a few common pitfalls to watch out for:

Misreading Memory Usage

Node.js memory graphs often look jagged, and that’s by design. Garbage collection creates a sawtooth pattern where memory usage rises, then drops sharply. Don’t panic when you see this. Instead, pay attention to whether the peaks are creeping upward over time—that’s what usually signals a problem.

Relying on Averages

Averages can be misleading. Let’s say your average response time is 100ms. That could mean every request takes 100ms, or that most take 50ms while a few spike to 1 second. Those outliers matter. Look at p95 or p99 latency to get a more realistic picture of user experience.

Ignoring Socket.IO Metrics

If your app uses Socket.IO or any real-time protocol, don’t stop at HTTP metrics. You’ll want to track:

Active connections
Total connections since startup
Number and size of messages sent/received

These can reveal silent performance issues that don’t show up in traditional request metrics.

Getting Started with Node.js Monitoring

If you're just starting, here’s a practical way to build up your monitoring step by step:

Track core runtime metrics: Memory, CPU, and event loop health
Add application-level metrics: HTTP latency, DB queries, external API calls
Include business metrics: Conversion rates, checkout times, user drop-offs
Set up alerts: Use multi-level thresholds and pattern-based rules to reduce noise
Build dashboards: Tailor them for different teams—devs, ops, product
Review and adjust regularly: Use past incidents to improve what you monitor going forward

Final Thoughts

There’s no one-size-fits-all monitoring setup. What matters is tracking the metrics that help you answer questions like:

Is the app slowing down?
Are things failing silently?
Is user experience getting worse?

Once you’ve got the basics in place, the next step is making sense of the data. That’s where our platform, Last9, can help. We bring together your metrics, logs, and traces—so you can spot patterns, connect issues across services, and debug faster. You can even convert logs into metrics when needed, helping you track the right signals without extra overhead.

Talk to us if you'd like to know more about the platform capabilities or get started for free!

FAQs

How does Node.js's garbage collection affect my metrics?

Garbage collection causes periodic pauses in execution, visible as spikes in latency metrics and a sawtooth pattern in memory usage. These are normal, but excessive GC can indicate memory problems. Monitor both GC frequency and duration.

What's the overhead of collecting these metrics?

Modern monitoring solutions add minimal overhead (typically <1% CPU). Start with the essential metrics, then expand as needed.

Should I instrument all my API endpoints?

Start with your most critical paths. For an e-commerce site, that might be product browsing, cart, and checkout flows. Add more based on customer impact.

How do I track Node.js metrics in a microservices architecture?

Correlation is key. Use consistent tracing IDs across services and a platform that can connect related metrics to trace requests across your entire system.

What's the difference between logs, metrics, and traces?

Logs are records of discrete events ("User X logged in")
Metrics are numeric measurements over time ("5 logins per minute")
Traces follow operations across multiple services ("User login → Auth service → Database")

What's a good alerting strategy for Node.js applications?

Multi-level alerting based on impact:

Info: Anomalies worth noting, but not urgent
Warning: Issues that need attention soon
Critical: Problems affecting users right now

Route different levels to appropriate channels (email for warnings, urgent notifications for critical alerts).