Monitoring GraphQL applications presents unique challenges due to their complex execution model. The nested structure of queries, parallel resolver execution, and variable response payloads create observability gaps that traditional monitoring approaches struggle to address.
OpenTelemetry offers a robust solution for instrumenting GraphQL services, providing visibility into the full request lifecycle. This guide provides practical steps for implementing, optimizing, and troubleshooting OpenTelemetry in GraphQL environments, helping you build a comprehensive observability strategy for your API layer.
What Makes GraphQL Monitoring Different?
GraphQL isn't your standard REST API, and that makes all the difference when it comes to monitoring. With GraphQL, a single request can trigger dozens of resolvers, touch multiple data sources, and return wildly different response sizes based on the query.
Traditional API monitoring tools often miss the mark because:
- They track requests at the endpoint level, not the resolver level
- They can't show how nested fields impact performance
- They struggle to correlate resolver execution with database queries
- They lack visibility into the GraphQL parsing and validation phases
This is why pairing GraphQL with OpenTelemetry makes so much sense - you get granular insight into every step of the query execution.
Setting Up OpenTelemetry in Your GraphQL Server
Getting OpenTelemetry running with your GraphQL server isn't as complicated as it might seem. Here's how to set it up step by step.
Installing OpenTelemetry Packages for GraphQL Integration
For a Node.js environment with Apollo Server, you'll need these packages:
npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node @opentelemetry/exporter-trace-otlp-proto @opentelemetry/instrumentation-graphql
For a Java environment with GraphQL Java:
implementation 'io.opentelemetry:opentelemetry-api:1.28.0'
implementation 'io.opentelemetry:opentelemetry-sdk:1.28.0'
implementation 'io.opentelemetry:opentelemetry-exporter-otlp:1.28.0'
implementation 'io.opentelemetry.instrumentation:opentelemetry-graphql-java-12.0:1.23.0-alpha'
Configuring Basic OpenTelemetry SDK for GraphQL Server
Here's a minimal setup for Node.js that will get you started:
// tracing.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-proto');
const { GraphQLInstrumentation } = require('@opentelemetry/instrumentation-graphql');
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
// URL of your collector
url: 'http://localhost:4318/v1/traces',
}),
instrumentations: [
getNodeAutoInstrumentations(),
new GraphQLInstrumentation({
// Capture resolver timing
mergeItems: true,
// Include resolver parameters in spans
allowValues: true,
}),
],
});
sdk.start();
Make sure to import this file at the very beginning of your application:
// index.js
require('./tracing'); // Must be first import
const { ApolloServer } = require('apollo-server');
// Rest of your server setup
Capturing Meaningful GraphQL Telemetry
Setting up basic instrumentation is just the start. To get real value from OpenTelemetry and GraphQL, you need to capture the right signals.
Creating Custom Spans for GraphQL Resolver Performance Tracking
Resolvers are the heart of GraphQL performance. Here's how to instrument them properly:
const resolvers = {
Query: {
users: async (parent, args, context, info) => {
// Create a custom span for this resolver
const span = context.tracer.startSpan('users.resolver');
// Add useful attributes
span.setAttribute('graphql.args.limit', args.limit);
span.setAttribute('graphql.args.offset', args.offset);
try {
// Your existing resolver logic
const users = await getUsersFromDatabase(args);
// Record result metadata
span.setAttribute('result.count', users.length);
return users;
} catch (error) {
// Record errors
span.recordException(error);
throw error;
} finally {
span.end();
}
}
}
};
Monitoring GraphQL Query Parsing and Validation Phases
The parsing and validation phases can be performance bottlenecks, too. Here's how to track them with Apollo Server:
const server = new ApolloServer({
typeDefs,
resolvers,
plugins: [
{
async requestDidStart(requestContext) {
const { request, context } = requestContext;
const span = context.tracer.startSpan('graphql.request');
span.setAttribute('graphql.query', request.query);
context.requestSpan = span;
return {
async parsingDidStart() {
const parseSpan = context.tracer.startSpan('graphql.parse');
return () => {
parseSpan.end();
};
},
async validationDidStart() {
const validationSpan = context.tracer.startSpan('graphql.validate');
return () => {
validationSpan.end();
};
},
async executionDidStart() {
const executionSpan = context.tracer.startSpan('graphql.execute');
return () => {
executionSpan.end();
};
},
async didEncounterErrors(errors) {
context.requestSpan.recordException(errors);
},
async willSendResponse() {
context.requestSpan.end();
}
};
}
}
]
});
Common OpenTelemetry GraphQL Problems and Solutions
Even with a solid setup, you're likely to run into issues. Here are solutions to common problems.
Preventing High Cardinality Issues in GraphQL Query Tracing
GraphQL queries can be virtually infinite in their variations, which can cause a cardinality explosion in your monitoring system.
Solution: Filter and normalize queries before recording them as span attributes:
function normalizeQuery(query) {
// Replace literal values with placeholders
return query
.replace(/"[^"]*"/g, '"?"')
.replace(/\d+/g, '?');
}
// Usage
span.setAttribute('graphql.query.normalized', normalizeQuery(query));
Implementing Distributed Tracing in Federated GraphQL Architectures
If you're using Apollo Federation or other GraphQL federation approaches, tracing across services gets complex.
Solution: Ensure proper context propagation:
// In your gateway service
const gateway = new ApolloGateway({
serviceList: [
/* your services */
],
buildService({ name, url }) {
return new RemoteGraphQLDataSource({
url,
willSendRequest({ request, context }) {
// Extract and forward the trace context
const currentSpanContext = context.activeSpan?.spanContext();
if (currentSpanContext) {
const traceParent = `00-${currentSpanContext.traceId}-${currentSpanContext.spanId}-0${currentSpanContext.traceFlags.toString(16)}`;
request.http.headers.set('traceparent', traceParent);
}
}
});
}
});
Connecting GraphQL Resolver Spans with Database Operation Telemetry
Seeing GraphQL resolver times without a database query context isn't very helpful.
Solution: Link database spans with resolver spans:
async function getUsersFromDatabase(args) {
const span = tracer.startSpan('db.query.users');
span.setAttribute('db.statement', 'SELECT * FROM users LIMIT ? OFFSET ?');
span.setAttribute('db.parameters', JSON.stringify([args.limit, args.offset]));
try {
const result = await db.query('SELECT * FROM users LIMIT ? OFFSET ?', [args.limit, args.offset]);
return result;
} finally {
span.end();
}
}
Analyzing OpenTelemetry GraphQL Data Effectively
Collecting data is only useful if you can make sense of it. Here's how to analyze your OpenTelemetry GraphQL data.
Essential GraphQL Performance Metrics for Operational Monitoring
Metric | Description | Why It Matters |
---|---|---|
Resolver Duration | Time taken by each resolver | Identifies slow resolvers |
Parse/Validate Time | Time spent in parsing and validation | Can indicate complex queries |
N+1 Query Count | Number of duplicate database queries | Common GraphQL performance issue |
Resolver Error Rate | Percentage of resolvers that throw errors | Shows reliability issues |
Query Complexity | Calculated complexity score of queries | Helps identify abuse or optimization opportunities |
Designing GraphQL-Specific Observability Dashboards
A good GraphQL + OpenTelemetry dashboard should include:
- Top-level query response times
- Resolver timings by field
- Error rates by resolver
- Database query correlation
- Cache hit/miss rates
- Parsing and validation times
Integrating with Last9 for Advanced Observability
If you're looking for a budget-friendly managed observability solution that doesn’t compromise on features, Last9 pairs perfectly with your OpenTelemetry GraphQL setup.
We specialize in handling high-cardinality data — just like the data GraphQL generates — without the cost penalties you'd face with other vendors. Our pricing is based on event ingestion, so your costs stay predictable even as your GraphQL API usage grows.
Last9 integrates smoothly with your OpenTelemetry data and provides:
- Correlation between GraphQL operations and the underlying infrastructure
- Pre-built dashboards tailored for GraphQL workloads
- Smart alerting that understands GraphQL context
- A unified view across metrics, logs, and traces
Teams like Clevertap, Probo, and others trust Last9 for their OpenTelemetry needs, especially for how well we handle the high-cardinality nature of GraphQL telemetry data.

Best Practices for OpenTelemetry in GraphQL Production Environments
Moving to production requires some additional considerations:
Implementing Efficient Sampling for High-Volume GraphQL APIs
You likely don't need to trace every single GraphQL operation. Implement a smart sampling strategy:
const { ParentBasedSampler, TraceIdRatioBased } = require('@opentelemetry/sdk-trace-node');
// Sample 10% of traces by default
const rootSampler = new TraceIdRatioBased(0.1);
// For GraphQL operations, use parent-based sampling
const sampler = new ParentBasedSampler({
root: rootSampler,
});
// Add to your SDK config
const sdk = new NodeSDK({
sampler,
// other config...
});
Optimizing OpenTelemetry Resource Consumption in GraphQL Services
OpenTelemetry adds overhead. Manage it with these tips:
- Be selective about which resolvers you instrument manually
- Use attribute limits to prevent memory bloat
- Consider batching span exports in high-throughput environments
- Implement circuit breakers to disable tracing if the system is under heavy load
Protecting Sensitive Data in GraphQL Telemetry Collection
GraphQL queries often contain sensitive data. Protect it:
- Always redact authentication tokens from headers
- Filter out sensitive fields from query variables
- Hash user identifiers before recording them as span attributes
- Consider field-level policies for what can be recorded in traces
Wrapping Up
Setting up OpenTelemetry with GraphQL gives you x-ray vision into your API's performance and behavior. Remember that the goal isn't just to collect data – it's to make your GraphQL API more reliable, performant, and maintainable. Let the telemetry guide your optimization efforts and architecture decisions.
FAQs
How much overhead does OpenTelemetry add to my GraphQL server?
When properly configured, OpenTelemetry typically adds 3-5% overhead in terms of latency and CPU usage. You can reduce this by implementing sampling (tracing only a percentage of requests) or by selectively instrumenting only critical paths.
Can OpenTelemetry help identify N+1 query problems in GraphQL?
Yes! This is one of the biggest benefits. By correlating database spans with resolver spans, you can easily spot when a resolver is triggering multiple similar database queries that could be batched. Tools like DataLoader become much easier to implement effectively when you can see the N+1 problems.
How do I handle sensitive data in GraphQL queries when using OpenTelemetry?
Implement a sanitization layer that processes GraphQL queries and variables before they're attached to spans. You can write middleware that redacts sensitive fields (like passwords or personal information) before they're recorded in your telemetry data.
Is OpenTelemetry suitable for both monolithic and federated GraphQL architectures?
Absolutely. For monoliths, the setup is simpler but still valuable. For federated architectures, OpenTelemetry shines as it can trace requests across service boundaries, giving you end-to-end visibility that's otherwise very difficult to achieve.
How does batching affect OpenTelemetry tracing in GraphQL?
When using batching techniques like DataLoader, you'll want to ensure your custom spans correctly represent the batched nature of operations. This usually means creating spans for both individual resolver calls and the batched data loading operations, with proper parent-child relationships between them.