APM tracing records the complete execution path of a request as it travels through your system, including database queries, external API calls, cache lookups, message queue events, and inter-service requests. Each step is captured with precise start and end timestamps, duration, and context such as service name, operation name, and relevant attributes.
This lets you pinpoint where latency or errors originate without piecing together metrics and logs manually. Metrics indicate that something is wrong, logs provide local detail, and traces connect them, showing how a request moved across services and how a delay or failure propagated.
What APM Means for Your Applications
Application Performance Monitoring (APM) focuses on measuring and analyzing how your application behaves under real user traffic, identifying performance regressions before they affect the user experience.
In modern architectures—comprising microservices
, containers
, serverless
functions, databases
, caches
, message brokers
, and third-party APIs—failures can emerge in any layer. Traditional infrastructure monitoring may raise an alert when a service degrades, but APM provides deeper insight by correlating signals across the stack to isolate the root cause.
The Role of APM in Observability
APM brings together three primary telemetry types:
- Metrics – Time-series measurements such as latency percentiles (
p50
,p95
,p99
),error_rate
,throughput
, and resource utilization (CPU
,memory
,disk I/O
). - Logs – Structured or unstructured event records with contextual fields (
request_id
,user_id
,stack_trace
). - Traces – Distributed transaction records mapping the full execution path of a request across services, with
span
-level timing, attributes, and causal relationships.
By correlating these, APM creates a complete performance profile. For example, while infrastructure monitoring might show sustained CPU
usage above 85%
, APM can attribute it to a specific API endpoint executing inefficient SQL queries such as:
SELECT * FROM orders WHERE status != 'delivered';
Beyond raw measurements, APM ties performance data to user experience and business outcomes, for instance, connecting increased checkout latency to a drop in conversion rate. This context enables you to move from reactive firefighting to proactive optimization.
Why Tracing Matters in Distributed Systems
In monolithic architectures, a single request might hit one database
or maybe a couple of external HTTP
APIs. Troubleshooting is straightforward, fewer hops mean fewer unknowns.
Modern distributed applications are different. A single user request can span:
- Multiple
microservices
, each with its own codebase and deployment pipeline. Message queues
andstreaming
pipelines (Kafka
,RabbitMQ
,Kinesis
).- In-memory caches (
Redis
,Memcached
) for faster lookups. - External API calls to payment gateways, search services, or fraud detection systems.
- Multiple
databases
with different query patterns, isolation levels, and response characteristics.
This complexity makes isolating latency or failures much harder.
- Metrics tell you something is wrong — e.g.,
error_rate
spike orp95_latency
doubled. - Logs show what happened inside each service.
- Traces stitch these signals together into a correlated, end-to-end view, showing where and why a request slowed or failed.
Without tracing, you’re left correlating timestamps from multiple log files and guessing where the slowdown started. With tracing, you see the full path—spans in sequence, with precise timing and causal relationships.
Example: Checkout Request Trace
User Service (23ms) → Auth Service (12ms) → Payment Service (156ms) → Email Service (45ms)
↓
Database (134ms)
At first glance, Payment Service
appears slow at 156ms
. Tracing shows 134ms
of that was a database query. The bottleneck isn’t the service—it’s the query execution or indexing strategy.
What You Get from Traces
1. Request Flow with Timing
A trace is composed of spans
. A span represents a single operation—an HTTP
call, a database query, a message queue consumption, or a function execution. Each span contains:
start_time
/end_time
— precise duration measurement.parent_span_id
— showing triggering relationships.- Attributes/tags — e.g.,
db.system="postgresql"
,http.method="POST"
,net.peer.name="payment-service"
.
Tracing shows:
- Total request duration.
- Sequential vs. parallel operations.
- Idle time or I/O wait points.
Example insight: An API call takes 200ms
total. Tracing reveals 150ms
was waiting on a downstream service with database lock contention.
2. Error Context
Tracing surfaces the exact span where an error occurred.
Instead of:
Payment Service returned HTTP 500
You get:
Payment Service (5ms) → Fraud Detection Service (40ms) → Risk Database (timeout after 30s)
Error spans often include:
- Exception messages
- Stack traces
- HTTP status codes (
500
,504
) - Database error codes (
deadlock_detected
,connection_timeout
)
This reduces time spent jumping across log streams to piece together failures.
3. Service Dependency Mapping
Traces expose the actual service dependency graph—not just the one in your architecture diagrams.
Example:
- You expect
Service A
→Service B
. - Trace shows
Service A
→Service C
→Service A
(circular dependency).
Deep call stacks (6+ services in one request) can signal the need for architectural simplification, caching, or asynchronous workflows.
In APM, distributed tracing is the connective tissue between metrics and logs. Without it, you’re working with disconnected data points. With it, you see the sequence, timing, and context of every operation in a request’s lifecycle—making root cause analysis faster and far more precise.
Configure Tracing Without Breaking Your System
1. Set a Sampling Strategy
In most production setups, tracing every request is impractical—storage grows fast, query performance degrades, and costs spike. To keep systems stable, teams often adopt sampling strategies such as:
100%
of error traces — full detail for failure analysis10%
of slow requests — those exceeding your SLA threshold1%
of normal requests — for baseline performance tracking
Sampling can be applied:
- At instrumentation — The SDK decides which spans to record.
- In the OpenTelemetry Collector — The collector receives all spans but selectively exports only a subset to the backend.
However, sampling has trade-offs. Once a trace is dropped, you can’t reconstruct it later, which can lead to blind spots in debugging.
With Last9, you don’t need to sample at all. Our architecture is built for full-fidelity tracing, even at very high cardinality, without the usual storage explosion. This means:
- You capture every trace in production.
- You can retroactively slice and filter data without worrying about what was discarded.
- You still get predictable, cost-efficient storage and query performance.
If your current APM forces you into sampling, you’re making trade-offs that aren’t necessary with Last9.

2. Focus on High-Value Operations
Even with full-fidelity traces, you want meaningful instrumentation. Target operations that impact performance or cross service boundaries:
- Inbound and outbound
HTTP
calls SQL
andNoSQL
queries- Cache operations (
Redis
,Memcached
) - Message queue publish/consume (
Kafka
,RabbitMQ
,SQS
) - External API calls (payment gateways, search, fraud detection)
Most modern frameworks support auto-instrumentation for these.
Example: Python OpenTelemetry setup
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
otlp_exporter = OTLPSpanExporter(endpoint="http://localhost:4317")
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)
def process_order(order_id):
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
validate_order(order_id)
charge_payment(order_id)
span.set_attribute("order.status", "completed")
Start with high-traffic or business-critical flows like checkout, authentication, and data writes. Once trace quality is validated in production, expand coverage to more granular spans.
3. Maintain Trace Context Across Services
Trace continuity relies on propagating trace_id
and span_id
between services. For synchronous HTTP calls, this is typically done with the traceparent
header:
{
"traceparent": "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01"
}
Once configured, most HTTP frameworks automatically forward this header.
For asynchronous workloads—message queues, background jobs, streaming systems, you’ll need to:
- Include trace context in message metadata or payload.
- Extract it in the consumer to continue the same trace.
This ensures a single request’s timeline remains connected, even when crossing service or protocol boundaries.
4. Link Traces, Logs, and Metrics
The biggest debugging gains come from correlating telemetry types using a shared trace_id
:
- From a metric spike, pull related traces to see which requests caused it.
- From a trace, jump to the logs for that specific request.
Example: Logging with trace context in Python
import logging
from opentelemetry import trace
logger = logging.getLogger(__name__)
formatter = logging.Formatter('%(asctime)s - trace_id=%(trace_id)s - %(message)s')
def process_request(user_id):
span = trace.get_current_span()
span_context = span.get_span_context()
logger.info(
f"Processing request for user {user_id}",
extra={"trace_id": format(span_context.trace_id, "032x")}
)
With Last9, traces, logs, and metrics are automatically correlated, even at 20M+ cardinality per metric, so you get the full picture without custom linking logic. This means faster root cause analysis, complete data fidelity with no sampling, and predictable costs at scale.
Common Trace Patterns to Look for in Production
1. N+1 Query Pattern in Database Access
The N+1
query issue happens when a service fetches related data with multiple queries instead of one optimized call.
Example: Fetching 20 user profiles → 1
query for IDs + 20
separate queries for details = 21
total queries. Individually fast queries add up, saturating DB connections, wasting CPU, and increasing I/O. Under load, this can cause connection pool exhaustion and backlog growth.
In traces:
- Repeated spans for the same
db.statement
. - Sequential execution with consistent per-span latency.
Fix:
- Batch-related queries.
- Use
JOIN
orIN
clauses. - Enable ORM features like
eager loading
(select_related
in Django,include
in Sequelize).
2. Long Wait Spans from Silent Timeouts
Silent timeouts occur when a service waits excessively for a dependency, often using default timeouts like 30s
. This ties up worker threads or processes, reducing throughput and hiding the root cause, especially if the downstream service isn’t failing outright but responding slowly.
In traces:
- One large-duration span with no child spans.
- No downstream activity during the wait.
Fix:
- Set realistic
timeout
values based on SLA. - Add circuit breakers (
Resilience4j
,Hystrix
). - Alert on spans exceeding
p95
/p99
thresholds.
3. Cascading Latency in Dependency Chains
One slow service can ripple upstream in synchronous call chains.
Example: Database latency in Service C delays Service B, which delays Service A, ultimately hitting the user. Teams without tracing often misdiagnose the issue, wasting time fixing non-bottleneck services.
In traces:
- Sequential spans with growing delays at deeper layers.
- Total request time driven by the slowest dependency.
Fix:
- Add caching at upstream layers.
- Break deep synchronous chains into async flows.
- Reduce dependency depth to avoid unnecessary hops.
4. Fan-Out Bottlenecks
When a service makes multiple parallel calls, the slowest one dictates total latency.
If an “optional” dependency is coded synchronously, it becomes part of the critical path without being obvious.
In traces:
- Parallel child spans starting at the same time.
- One outlier span much longer than the rest.
Fix:
- Drop or async-run non-critical calls.
- Use caching for slow, rarely changing data.
- Prioritize calls that unblock the response first.
5. Retry Storms
Transient failures can trigger retries across multiple services. Without limits and proper backoff, retries stack up, saturating the failing service and turning a minor blip into a major outage.
In traces:
- Multiple identical spans to the same dependency.
- Retry intervals match the configured backoff (or have no delay).
Fix:
- Limit retry count.
- Use exponential backoff with jitter.
- Ensure downstream services are idempotent.
Alert Strategies and Diagnostics for Mobile Applications
1. APM Alerts for Mobile Applications
Mobile-side telemetry complements backend monitoring by revealing client-specific performance degradation that backend metrics can’t see. These are often tied to:
- Device/OS fragmentation (
Android 13
vsiOS 17
). - Region-specific network routes.
- Hardware resource limits (RAM, battery).
For SREs, this means an error budget can be burned without any backend alerts firing.
Example: A checkout API call may succeed in 200ms
from your test clients, but take >5s
for users in Southeast Asia due to cellular network jitter.
Operational use:
- Set cohort-specific alert thresholds (e.g., crash rate > 2% for
Android 13
on Samsung S21). - Use segment-based telemetry to drive targeted rollbacks, feature flag changes, or regional routing tweaks.
2. Structure Dashboards for Actionability
Dashboards should mirror critical user workflows, not just infra layers.
For example:
- Tier 1: High-level health — aggregate error rate,
p95_latency
, throughput per flow (login, checkout, search). - Tier 2: Dependency view — which backend or third-party calls each flow depends on, with their current latency/error rate.
- Tier 3: Deep service metrics — DB query performance, thread pool utilization, cache hit/miss ratio.
Operational use:
- Tie alert rules to Tier 1 KPIs first — e.g., “checkout success rate < 98% for 5m” → triggers incident.
- Use Tier 2+3 during triage to isolate the root cause without pivoting across multiple tools.
3. APM Testing and Diagnostics Across Environments
For mobile-dependent systems, keeping APM active across dev, staging, and prod gives earlier warning signals:
- Development: Instrument new flows early. Catch scaling problems before merge (e.g., query time growth from 100ms to 1.2s with real-world dataset size).
- Staging: Run synthetic transactions simulating mobile conditions (slow networks, low-end devices).
- Production: Continuously execute synthetic transactions from multiple geos; alert if they breach baseline.
Operational use:
- Treat synthetic checks as “canaries” for high-value flows.
- Use traces to pinpoint which span in the flow first degrades — often faster than waiting for user error reports.
APM Tool Selection and Platform Integration
1. Evaluate APM Solutions for Complex Environments
APM tools differ widely in their strengths. Choosing the right one means matching capabilities to your operational reality, not just ticking feature boxes.
Key evaluation dimensions:
- Native integration with your tech stack
The tool should have official or well-maintained support for your runtime, frameworks, and deployment model. For example:Node.js
+Express
with OpenTelemetry auto-instrumentation.Python
+FastAPI
with async context propagation.- Kubernetes clusters with sidecar or DaemonSet deployments.
- High-cardinality telemetry handling
Many APM platforms collapse or sample away high-cardinality data to save on storage costs. This removes the ability to debug specific user sessions, request IDs, or device-level issues. The right tool should:- Ingest millions of unique label combinations without performance degradation.
- Store granular traces while still offering fast queries.
- Allow label-based filtering without pre-aggregation.
- Seamless integration into existing workflows
Look for:- Direct OTLP export from your services.
- API-driven alerting rules that post to Slack, Teams, or PagerDuty.
- Terraform or CLI configuration for infrastructure-as-code alignment.
- Transparent cost model
Understand the pricing impact of:- Ingestion rates (e.g., GB/day).
- Retention windows for raw traces vs. aggregated metrics.
- Query execution costs on large datasets.
A tool that seems cost-effective at low volume can become prohibitively expensive at scale.
2. Embed APM Into the Development Lifecycle
An APM platform delivers the most value when it’s wired directly into the development and deployment process. This means using APM data before, during, and after code hits production.
Practical patterns:
- Alert routing with context
Push anomaly alerts (e.g.,p95
latency above SLA, error rate spike) directly to your incident channel. Include a trace link and relevant metadata (service.name
,deployment.version
) so the on-call engineer can start debugging without additional queries. - Deployment-aware dashboards
During rollouts, automatically surface dashboards filtered by the currentdeployment.version
. This makes it easy to spot regressions tied to a specific release. - Data-driven capacity planning
Instead of adding CPU when latency increases, analyze traces to see if the cause is:- Slow database queries (indexing or schema changes needed).
- Excessive synchronous dependencies.
- Cache eviction patterns.
Performance budgets in CI/CD
Fail a build if critical performance indicators regress. Example:
if [ "$(apm-cli get-latency p95 --compare-to=baseline)" -gt 100 ]; then
exit 1
fi
This prevents slow code from ever reaching production.
3. Handling Cloud-Native and Ephemeral Workloads
Cloud-native environments make certain traditional APM approaches ineffective.
- Short-lived containers and serverless functions
Instances may exist for seconds, so agent-based approaches that rely on long-lived processes often miss data. Use collector-based ingestion (e.g., OpenTelemetry Collector) to capture spans and metrics from workloads before they terminate. - Dynamic scaling
A service may scale from1
to100
pods within minutes. Your APM must:- Maintain context for traces across changing instance IDs.
- Aggregate and query by logical service name, not host identity.
- Service mesh integration
Meshes like Istio or Linkerd can automatically propagate trace context for inter-service calls. This provides out-of-the-box spans for service-to-service traffic, but also introduces latency and resource overhead. Traces should include mesh-related spans so you can observe and tune their performance impact.
Get Started with APM Tracing
Begin with a single critical user journey—/signup
, /checkout
, or a high-traffic API endpoint. Instrument that path first so you can validate trace data, verify context propagation, and understand span relationships before extending tracing across all services.
Traces become far more useful when connected to your other telemetry. The ability to jump from a trace to the related metrics spike, and then to the exact log entries for that request, is what reduces debugging time.
With Last9, telemetry correlation is built in — no stitching, no sampling gaps:
- Unified telemetry: Metrics, logs, and traces linked by the same
trace_id
, even with millions of unique labels. - Full fidelity: No enforced sampling — capture every trace, even at high cardinality.
- OTLP-native ingestion: Send data directly from OpenTelemetry SDKs or Collectors.
- Attribute-based filtering: Query by
deployment.version
,service.name
, or any custom tag to isolate regressions fast. - Discover Services gives you a real-time view of throughput, latency, errors, and dependencies for every service in your stack.
Once you see how quickly traces pinpoint the root cause, whether it’s a slow SQL query, a failing upstream API, or resource contention, you’ll want broader coverage.
Get started for free today, or if you'd like to discuss how Last9 would fit in your current setup, book sometime with us!
FAQs
What does APM stand for?
Application Performance Monitoring. It's the practice of monitoring software applications to detect performance issues, track user experience, and maintain system reliability.
What is an APM track?
An APM track is a trace—the complete path a request takes through your system. It shows every service call, database query, and external API request that happens to process one user action.
What is an APM alert on Android?
APM alerts on Android notify you about app performance issues like crashes, slow load times, or high memory usage. Tools like Firebase Performance Monitoring send these alerts when your mobile app crosses performance thresholds.
What is the meaning of APM testing?
APM testing involves monitoring application performance in real-time during development and production. It includes load testing, stress testing, and continuous monitoring to catch performance issues before users do.
What are the benefits of APM?
Faster incident resolution, better user experience, proactive issue detection, and data-driven performance optimization. You spot problems before they affect users and can fix the root cause instead of the symptoms.
Why is distributed tracing important?
Modern applications span multiple services, making it impossible to understand request flow from metrics alone. Distributed tracing shows you exactly how requests move through your system and where they get stuck.
What metrics does application performance monitoring track?
Response time, throughput, error rates, resource utilization (CPU, memory, disk), database query performance, external API latency, and user experience metrics like page load time.
What are some common dashboards?
Service health dashboards showing error rates and latency, infrastructure dashboards with CPU and memory usage, user experience dashboards with page load times, and business metric dashboards tracking key performance indicators.
What open source tools do you use to parse levelDB files?
Tools like leveldb-cli, ldb command-line utility, or programming libraries like leveldb for Python and level for Node.js. Most teams use these for debugging or data migration tasks.
How does APM tracing help in diagnosing application performance issues?
Tracing shows you the exact sequence of operations for slow requests. Instead of guessing which service is slow, you see the complete timeline and can identify the specific database query or API call causing delays.
How does APM tracing help in identifying performance bottlenecks?
Traces reveal which operations consume the most time in your request flow. You can spot patterns like N+1 database queries, slow external API calls, or services that don't timeout appropriately.
How does APM tracing improve application performance?
By giving you precise data about where time is spent, tracing helps you focus optimization efforts on the right places. You can fix the 200ms database query instead of optimizing code that only takes 5ms.
How does APM tracing help in identifying bottlenecks in applications?
Tracing shows request flow patterns that reveal architectural issues. You might discover that your "fast" service is actually making 10 downstream calls that add up, or that certain code paths create dependency loops.