Last9

What is APM Tracing?

Understand APM tracing to see how a request moves through services, helping you spot delays, errors, and bottlenecks quickly.

Sep 3rd, ‘25
What is APM Tracing?
See How Last9 Works

Unified observability for all your telemetry. Open standards. Simple pricing.

Talk to an Expert

APM tracing records the complete execution path of a request as it travels through your system, including database queries, external API calls, cache lookups, message queue events, and inter-service requests. Each step is captured with precise start and end timestamps, duration, and context such as service name, operation name, and relevant attributes.

This lets you pinpoint where latency or errors originate without piecing together metrics and logs manually. Metrics indicate that something is wrong, logs provide local detail, and traces connect them, showing how a request moved across services and how a delay or failure propagated.

What APM Means for Your Applications

Application Performance Monitoring (APM) focuses on measuring and analyzing how your application behaves under real user traffic, identifying performance regressions before they affect the user experience.

In modern architectures—comprising microservices, containers, serverless functions, databases, caches, message brokers, and third-party APIs—failures can emerge in any layer. Traditional infrastructure monitoring may raise an alert when a service degrades, but APM provides deeper insight by correlating signals across the stack to isolate the root cause.

The Role of APM in Observability

APM brings together three primary telemetry types:

  • Metrics – Time-series measurements such as latency percentiles (p50, p95, p99), error_rate, throughput, and resource utilization (CPU, memory, disk I/O).
  • Logs – Structured or unstructured event records with contextual fields (request_id, user_id, stack_trace).
  • Traces – Distributed transaction records mapping the full execution path of a request across services, with span-level timing, attributes, and causal relationships.

By correlating these, APM creates a complete performance profile. For example, while infrastructure monitoring might show sustained CPU usage above 85%, APM can attribute it to a specific API endpoint executing inefficient SQL queries such as:

SELECT * FROM orders WHERE status != 'delivered';

Beyond raw measurements, APM ties performance data to user experience and business outcomes, for instance, connecting increased checkout latency to a drop in conversion rate. This context enables you to move from reactive firefighting to proactive optimization.

💡
You can also use APM logs to connect trace context with detailed event data for faster debugging.

Why Tracing Matters in Distributed Systems

In monolithic architectures, a single request might hit one database or maybe a couple of external HTTP APIs. Troubleshooting is straightforward, fewer hops mean fewer unknowns.

Modern distributed applications are different. A single user request can span:

  • Multiple microservices, each with its own codebase and deployment pipeline.
  • Message queues and streaming pipelines (Kafka, RabbitMQ, Kinesis).
  • In-memory caches (Redis, Memcached) for faster lookups.
  • External API calls to payment gateways, search services, or fraud detection systems.
  • Multiple databases with different query patterns, isolation levels, and response characteristics.

This complexity makes isolating latency or failures much harder.

  • Metrics tell you something is wrong — e.g., error_rate spike or p95_latency doubled.
  • Logs show what happened inside each service.
  • Traces stitch these signals together into a correlated, end-to-end view, showing where and why a request slowed or failed.

Without tracing, you’re left correlating timestamps from multiple log files and guessing where the slowdown started. With tracing, you see the full path—spans in sequence, with precise timing and causal relationships.

Example: Checkout Request Trace

User Service (23ms) → Auth Service (12ms) → Payment Service (156ms) → Email Service (45ms)

                                                   Database (134ms)

At first glance, Payment Service appears slow at 156ms. Tracing shows 134ms of that was a database query. The bottleneck isn’t the service—it’s the query execution or indexing strategy.

What You Get from Traces

1. Request Flow with Timing

A trace is composed of spans. A span represents a single operation—an HTTP call, a database query, a message queue consumption, or a function execution. Each span contains:

  • start_time / end_time — precise duration measurement.
  • parent_span_id — showing triggering relationships.
  • Attributes/tags — e.g., db.system="postgresql", http.method="POST", net.peer.name="payment-service".

Tracing shows:

  • Total request duration.
  • Sequential vs. parallel operations.
  • Idle time or I/O wait points.

Example insight: An API call takes 200ms total. Tracing reveals 150ms was waiting on a downstream service with database lock contention.

2. Error Context

Tracing surfaces the exact span where an error occurred.

Instead of:

Payment Service returned HTTP 500

You get:

Payment Service (5ms) → Fraud Detection Service (40ms) → Risk Database (timeout after 30s)

Error spans often include:

  • Exception messages
  • Stack traces
  • HTTP status codes (500, 504)
  • Database error codes (deadlock_detected, connection_timeout)

This reduces time spent jumping across log streams to piece together failures.

3. Service Dependency Mapping

Traces expose the actual service dependency graph—not just the one in your architecture diagrams.

Example:

  • You expect Service AService B.
  • Trace shows Service AService CService A (circular dependency).

Deep call stacks (6+ services in one request) can signal the need for architectural simplification, caching, or asynchronous workflows.

In APM, distributed tracing is the connective tissue between metrics and logs. Without it, you’re working with disconnected data points. With it, you see the sequence, timing, and context of every operation in a request’s lifecycle—making root cause analysis faster and far more precise.

💡
For a broader view of how APM fits into observability and helps connect performance data to system behavior, read our blog!

Configure Tracing Without Breaking Your System

1. Set a Sampling Strategy

In most production setups, tracing every request is impractical—storage grows fast, query performance degrades, and costs spike. To keep systems stable, teams often adopt sampling strategies such as:

  • 100% of error traces — full detail for failure analysis
  • 10% of slow requests — those exceeding your SLA threshold
  • 1% of normal requests — for baseline performance tracking

Sampling can be applied:

  • At instrumentation — The SDK decides which spans to record.
  • In the OpenTelemetry Collector — The collector receives all spans but selectively exports only a subset to the backend.

However, sampling has trade-offs. Once a trace is dropped, you can’t reconstruct it later, which can lead to blind spots in debugging.

With Last9, you don’t need to sample at all. Our architecture is built for full-fidelity tracing, even at very high cardinality, without the usual storage explosion. This means:

  • You capture every trace in production.
  • You can retroactively slice and filter data without worrying about what was discarded.
  • You still get predictable, cost-efficient storage and query performance.

If your current APM forces you into sampling, you’re making trade-offs that aren’t necessary with Last9.

Probo Cuts Monitoring Costs by 90% with Last9
Probo Cuts Monitoring Costs by 90% with Last9

2. Focus on High-Value Operations

Even with full-fidelity traces, you want meaningful instrumentation. Target operations that impact performance or cross service boundaries:

  • Inbound and outbound HTTP calls
  • SQL and NoSQL queries
  • Cache operations (Redis, Memcached)
  • Message queue publish/consume (Kafka, RabbitMQ, SQS)
  • External API calls (payment gateways, search, fraud detection)

Most modern frameworks support auto-instrumentation for these.

Example: Python OpenTelemetry setup

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

otlp_exporter = OTLPSpanExporter(endpoint="http://localhost:4317")
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

def process_order(order_id):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)
        validate_order(order_id)
        charge_payment(order_id)
        span.set_attribute("order.status", "completed")

Start with high-traffic or business-critical flows like checkout, authentication, and data writes. Once trace quality is validated in production, expand coverage to more granular spans.

3. Maintain Trace Context Across Services

Trace continuity relies on propagating trace_id and span_id between services. For synchronous HTTP calls, this is typically done with the traceparent header:

{
  "traceparent": "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01"
}

Once configured, most HTTP frameworks automatically forward this header.
For asynchronous workloads—message queues, background jobs, streaming systems, you’ll need to:

  • Include trace context in message metadata or payload.
  • Extract it in the consumer to continue the same trace.

This ensures a single request’s timeline remains connected, even when crossing service or protocol boundaries.

The biggest debugging gains come from correlating telemetry types using a shared trace_id:

  • From a metric spike, pull related traces to see which requests caused it.
  • From a trace, jump to the logs for that specific request.

Example: Logging with trace context in Python

import logging
from opentelemetry import trace

logger = logging.getLogger(__name__)
formatter = logging.Formatter('%(asctime)s - trace_id=%(trace_id)s - %(message)s')

def process_request(user_id):
    span = trace.get_current_span()
    span_context = span.get_span_context()
    logger.info(
        f"Processing request for user {user_id}",
        extra={"trace_id": format(span_context.trace_id, "032x")}
    )

With Last9, traces, logs, and metrics are automatically correlated, even at 20M+ cardinality per metric, so you get the full picture without custom linking logic. This means faster root cause analysis, complete data fidelity with no sampling, and predictable costs at scale.

💡
Now, pinpoint and resolve production performance issues faster with APM tracing and Last9 MCP. Bring real-time traces, metrics, and logs into your local environment to identify bottlenecks and fix them with precision.

Common Trace Patterns to Look for in Production

1. N+1 Query Pattern in Database Access

The N+1 query issue happens when a service fetches related data with multiple queries instead of one optimized call.

Example: Fetching 20 user profiles → 1 query for IDs + 20 separate queries for details = 21 total queries. Individually fast queries add up, saturating DB connections, wasting CPU, and increasing I/O. Under load, this can cause connection pool exhaustion and backlog growth.

In traces:

  • Repeated spans for the same db.statement.
  • Sequential execution with consistent per-span latency.

Fix:

  • Batch-related queries.
  • Use JOIN or IN clauses.
  • Enable ORM features like eager loading (select_related in Django, include in Sequelize).

2. Long Wait Spans from Silent Timeouts

Silent timeouts occur when a service waits excessively for a dependency, often using default timeouts like 30s. This ties up worker threads or processes, reducing throughput and hiding the root cause, especially if the downstream service isn’t failing outright but responding slowly.

In traces:

  • One large-duration span with no child spans.
  • No downstream activity during the wait.

Fix:

  • Set realistic timeout values based on SLA.
  • Add circuit breakers (Resilience4j, Hystrix).
  • Alert on spans exceeding p95/p99 thresholds.

3. Cascading Latency in Dependency Chains

One slow service can ripple upstream in synchronous call chains.
Example: Database latency in Service C delays Service B, which delays Service A, ultimately hitting the user. Teams without tracing often misdiagnose the issue, wasting time fixing non-bottleneck services.

In traces:

  • Sequential spans with growing delays at deeper layers.
  • Total request time driven by the slowest dependency.

Fix:

  • Add caching at upstream layers.
  • Break deep synchronous chains into async flows.
  • Reduce dependency depth to avoid unnecessary hops.

4. Fan-Out Bottlenecks

When a service makes multiple parallel calls, the slowest one dictates total latency.
If an “optional” dependency is coded synchronously, it becomes part of the critical path without being obvious.

In traces:

  • Parallel child spans starting at the same time.
  • One outlier span much longer than the rest.

Fix:

  • Drop or async-run non-critical calls.
  • Use caching for slow, rarely changing data.
  • Prioritize calls that unblock the response first.

5. Retry Storms

Transient failures can trigger retries across multiple services. Without limits and proper backoff, retries stack up, saturating the failing service and turning a minor blip into a major outage.

In traces:

  • Multiple identical spans to the same dependency.
  • Retry intervals match the configured backoff (or have no delay).

Fix:

  • Limit retry count.
  • Use exponential backoff with jitter.
  • Ensure downstream services are idempotent.
💡
Last9 offers full monitoring with built-in alerting and notifications, purpose-built for high-cardinality environments. It’s designed to cut alert fatigue and shorten Mean Time to Detect.

Alert Strategies and Diagnostics for Mobile Applications

1. APM Alerts for Mobile Applications

Mobile-side telemetry complements backend monitoring by revealing client-specific performance degradation that backend metrics can’t see. These are often tied to:

  • Device/OS fragmentation (Android 13 vs iOS 17).
  • Region-specific network routes.
  • Hardware resource limits (RAM, battery).

For SREs, this means an error budget can be burned without any backend alerts firing.
Example: A checkout API call may succeed in 200ms from your test clients, but take >5s for users in Southeast Asia due to cellular network jitter.

Operational use:

  • Set cohort-specific alert thresholds (e.g., crash rate > 2% for Android 13 on Samsung S21).
  • Use segment-based telemetry to drive targeted rollbacks, feature flag changes, or regional routing tweaks.

2. Structure Dashboards for Actionability

Dashboards should mirror critical user workflows, not just infra layers.
For example:

  • Tier 1: High-level health — aggregate error rate, p95_latency, throughput per flow (login, checkout, search).
  • Tier 2: Dependency view — which backend or third-party calls each flow depends on, with their current latency/error rate.
  • Tier 3: Deep service metrics — DB query performance, thread pool utilization, cache hit/miss ratio.

Operational use:

  • Tie alert rules to Tier 1 KPIs first — e.g., “checkout success rate < 98% for 5m” → triggers incident.
  • Use Tier 2+3 during triage to isolate the root cause without pivoting across multiple tools.

3. APM Testing and Diagnostics Across Environments

For mobile-dependent systems, keeping APM active across dev, staging, and prod gives earlier warning signals:

  • Development: Instrument new flows early. Catch scaling problems before merge (e.g., query time growth from 100ms to 1.2s with real-world dataset size).
  • Staging: Run synthetic transactions simulating mobile conditions (slow networks, low-end devices).
  • Production: Continuously execute synthetic transactions from multiple geos; alert if they breach baseline.

Operational use:

  • Treat synthetic checks as “canaries” for high-value flows.
  • Use traces to pinpoint which span in the flow first degrades — often faster than waiting for user error reports.

APM Tool Selection and Platform Integration

1. Evaluate APM Solutions for Complex Environments

APM tools differ widely in their strengths. Choosing the right one means matching capabilities to your operational reality, not just ticking feature boxes.

Key evaluation dimensions:

  • Native integration with your tech stack
    The tool should have official or well-maintained support for your runtime, frameworks, and deployment model. For example:
    • Node.js + Express with OpenTelemetry auto-instrumentation.
    • Python + FastAPI with async context propagation.
    • Kubernetes clusters with sidecar or DaemonSet deployments.
  • High-cardinality telemetry handling
    Many APM platforms collapse or sample away high-cardinality data to save on storage costs. This removes the ability to debug specific user sessions, request IDs, or device-level issues. The right tool should:
    • Ingest millions of unique label combinations without performance degradation.
    • Store granular traces while still offering fast queries.
    • Allow label-based filtering without pre-aggregation.
  • Seamless integration into existing workflows
    Look for:
    • Direct OTLP export from your services.
    • API-driven alerting rules that post to Slack, Teams, or PagerDuty.
    • Terraform or CLI configuration for infrastructure-as-code alignment.
  • Transparent cost model
    Understand the pricing impact of:
    • Ingestion rates (e.g., GB/day).
    • Retention windows for raw traces vs. aggregated metrics.
    • Query execution costs on large datasets.
      A tool that seems cost-effective at low volume can become prohibitively expensive at scale.

2. Embed APM Into the Development Lifecycle

An APM platform delivers the most value when it’s wired directly into the development and deployment process. This means using APM data before, during, and after code hits production.

Practical patterns:

  • Alert routing with context
    Push anomaly alerts (e.g., p95 latency above SLA, error rate spike) directly to your incident channel. Include a trace link and relevant metadata (service.name, deployment.version) so the on-call engineer can start debugging without additional queries.
  • Deployment-aware dashboards
    During rollouts, automatically surface dashboards filtered by the current deployment.version. This makes it easy to spot regressions tied to a specific release.
  • Data-driven capacity planning
    Instead of adding CPU when latency increases, analyze traces to see if the cause is:
    • Slow database queries (indexing or schema changes needed).
    • Excessive synchronous dependencies.
    • Cache eviction patterns.

Performance budgets in CI/CD
Fail a build if critical performance indicators regress. Example:

if [ "$(apm-cli get-latency p95 --compare-to=baseline)" -gt 100 ]; then
    exit 1
fi

This prevents slow code from ever reaching production.

3. Handling Cloud-Native and Ephemeral Workloads

Cloud-native environments make certain traditional APM approaches ineffective.

  • Short-lived containers and serverless functions
    Instances may exist for seconds, so agent-based approaches that rely on long-lived processes often miss data. Use collector-based ingestion (e.g., OpenTelemetry Collector) to capture spans and metrics from workloads before they terminate.
  • Dynamic scaling
    A service may scale from 1 to 100 pods within minutes. Your APM must:
    • Maintain context for traces across changing instance IDs.
    • Aggregate and query by logical service name, not host identity.
  • Service mesh integration
    Meshes like Istio or Linkerd can automatically propagate trace context for inter-service calls. This provides out-of-the-box spans for service-to-service traffic, but also introduces latency and resource overhead. Traces should include mesh-related spans so you can observe and tune their performance impact.

Get Started with APM Tracing

Begin with a single critical user journey—/signup, /checkout, or a high-traffic API endpoint. Instrument that path first so you can validate trace data, verify context propagation, and understand span relationships before extending tracing across all services.

Traces become far more useful when connected to your other telemetry. The ability to jump from a trace to the related metrics spike, and then to the exact log entries for that request, is what reduces debugging time.

With Last9, telemetry correlation is built in — no stitching, no sampling gaps:

  • Unified telemetry: Metrics, logs, and traces linked by the same trace_id, even with millions of unique labels.
  • Full fidelity: No enforced sampling — capture every trace, even at high cardinality.
  • OTLP-native ingestion: Send data directly from OpenTelemetry SDKs or Collectors.
  • Attribute-based filtering: Query by deployment.version, service.name, or any custom tag to isolate regressions fast.
  • Discover Services gives you a real-time view of throughput, latency, errors, and dependencies for every service in your stack.

Once you see how quickly traces pinpoint the root cause, whether it’s a slow SQL query, a failing upstream API, or resource contention, you’ll want broader coverage.

Get started for free today, or if you'd like to discuss how Last9 would fit in your current setup, book sometime with us!

FAQs

What does APM stand for?
Application Performance Monitoring. It's the practice of monitoring software applications to detect performance issues, track user experience, and maintain system reliability.

What is an APM track?
An APM track is a trace—the complete path a request takes through your system. It shows every service call, database query, and external API request that happens to process one user action.

What is an APM alert on Android?
APM alerts on Android notify you about app performance issues like crashes, slow load times, or high memory usage. Tools like Firebase Performance Monitoring send these alerts when your mobile app crosses performance thresholds.

What is the meaning of APM testing?
APM testing involves monitoring application performance in real-time during development and production. It includes load testing, stress testing, and continuous monitoring to catch performance issues before users do.

What are the benefits of APM?
Faster incident resolution, better user experience, proactive issue detection, and data-driven performance optimization. You spot problems before they affect users and can fix the root cause instead of the symptoms.

Why is distributed tracing important?
Modern applications span multiple services, making it impossible to understand request flow from metrics alone. Distributed tracing shows you exactly how requests move through your system and where they get stuck.

What metrics does application performance monitoring track?
Response time, throughput, error rates, resource utilization (CPU, memory, disk), database query performance, external API latency, and user experience metrics like page load time.

What are some common dashboards?
Service health dashboards showing error rates and latency, infrastructure dashboards with CPU and memory usage, user experience dashboards with page load times, and business metric dashboards tracking key performance indicators.

What open source tools do you use to parse levelDB files?
Tools like leveldb-cli, ldb command-line utility, or programming libraries like leveldb for Python and level for Node.js. Most teams use these for debugging or data migration tasks.

How does APM tracing help in diagnosing application performance issues?
Tracing shows you the exact sequence of operations for slow requests. Instead of guessing which service is slow, you see the complete timeline and can identify the specific database query or API call causing delays.

How does APM tracing help in identifying performance bottlenecks?
Traces reveal which operations consume the most time in your request flow. You can spot patterns like N+1 database queries, slow external API calls, or services that don't timeout appropriately.

How does APM tracing improve application performance?
By giving you precise data about where time is spent, tracing helps you focus optimization efforts on the right places. You can fix the 200ms database query instead of optimizing code that only takes 5ms.

How does APM tracing help in identifying bottlenecks in applications?
Tracing shows request flow patterns that reveal architectural issues. You might discover that your "fast" service is actually making 10 downstream calls that add up, or that certain code paths create dependency loops.

tags

Contents

Do More with Less

Unlock unified observability and faster triaging for your team.