How to Use OpenTelemetry with Your GraphQL Stack

Monitoring GraphQL applications presents unique challenges due to their complex execution model. The nested structure of queries, parallel resolver execution, and variable response payloads create observability gaps that traditional monitoring approaches struggle to address.

OpenTelemetry offers a robust solution for instrumenting GraphQL services, providing visibility into the full request lifecycle. This guide provides practical steps for implementing, optimizing, and troubleshooting OpenTelemetry in GraphQL environments, helping you build a comprehensive observability strategy for your API layer.

What Makes GraphQL Monitoring Different?

GraphQL isn't your standard REST API, and that makes all the difference when it comes to monitoring. With GraphQL, a single request can trigger dozens of resolvers, touch multiple data sources, and return wildly different response sizes based on the query.

Traditional API monitoring tools often miss the mark because:

They track requests at the endpoint level, not the resolver level
They can't show how nested fields impact performance
They struggle to correlate resolver execution with database queries
They lack visibility into the GraphQL parsing and validation phases

This is why pairing GraphQL with OpenTelemetry makes so much sense - you get granular insight into every step of the query execution.

💡

If you're working with custom metrics, this intro to OpenTelemetry custom metrics can help you get the basics right before layering on complexity.

Setting Up OpenTelemetry in Your GraphQL Server

Getting OpenTelemetry running with your GraphQL server isn't as complicated as it might seem. Here's how to set it up step by step.

Installing OpenTelemetry Packages for GraphQL Integration

For a Node.js environment with Apollo Server, you'll need these packages:

npm install @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-proto \
  @opentelemetry/api

For a Java environment with GraphQL Java:

implementation 'io.opentelemetry:opentelemetry-api:1.32.0'
implementation 'io.opentelemetry:opentelemetry-sdk:1.32.0'
implementation 'io.opentelemetry:opentelemetry-exporter-otlp:1.32.0'
implementation 'io.opentelemetry.instrumentation:opentelemetry-instrumentation-api:1.32.0'

Configuring Basic OpenTelemetry SDK for GraphQL Server

Here's a minimal setup for Node.js that will get you started:

// tracing.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-proto');
const { trace } = require('@opentelemetry/api');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://localhost:4318/v1/traces',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      // Disable some auto-instrumentations if needed
      '@opentelemetry/instrumentation-fs': {
        enabled: false,
      },
    }),
  ],
});

sdk.start();

// Export tracer for use in GraphQL context
const tracer = trace.getTracer('graphql-server', '1.0.0');
module.exports = { tracer };

Make sure to import this file at the very beginning of your application:

// index.js
require('./tracing'); // Must be first import
const { ApolloServer } = require('apollo-server');
// Rest of your server setup

💡

If your setup includes Postgres, here’s how you can use OpenTelemetry with Postgres to get better visibility into query performance and database health.

Capturing Meaningful GraphQL Telemetry

Setting up basic instrumentation is just the start. To get real value from OpenTelemetry and GraphQL, you need to capture the right signals.

Creating Custom Spans for GraphQL Resolver Performance Tracking

Resolvers are the heart of GraphQL performance. Here's how to instrument them properly:

const { trace } = require('@opentelemetry/api');

const resolvers = {
  Query: {
    users: async (parent, args, context, info) => {
      // Create a custom span for this resolver
      const span = context.tracer.startSpan('users.resolver', {
        attributes: {
          'graphql.field.name': 'users',
          'graphql.operation.type': 'query',
          'graphql.args.limit': args.limit || 0,
          'graphql.args.offset': args.offset || 0,
        },
      });

      try {
        // Your existing resolver logic
        const users = await getUsersFromDatabase(args);
        
        // Record result metadata
        span.setAttributes({
          'result.count': users.length,
          'operation.status': 'success',
        });
        
        return users;
      } catch (error) {
        // Record errors properly
        span.recordException(error);
        span.setStatus({
          code: trace.SpanStatusCode.ERROR,
          message: error.message,
        });
        throw error;
      } finally {
        span.end();
      }
    }
  }
};

You can directly use trace.getTracer('your-app-name') from the OpenTelemetry API within your resolver instead of context.tracer. This simplifies the setup unless you're explicitly injecting the tracer (for testing, mocking, or managing multiple tracer instances).

For example:

const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('graphql-app');

Monitoring GraphQL Query Parsing and Validation Phases

The parsing and validation phases can be performance bottlenecks, too. Here's how to track them with Apollo Server:

const { trace } = require('@opentelemetry/api');

function normalizeQuery(query) {
  if (!query) return '';
  
  return query
    // Replace string literals with placeholders
    .replace(/"[^"]*"/g, '"<string>"')
    // Replace numeric literals with placeholders
    .replace(/\b\d+(\.\d+)?\b/g, '<number>')
    // Replace variable references (keep the structure)
    .replace(/\$\w+/g, '$<var>')
    // Remove extra whitespace
    .replace(/\s+/g, ' ')
    .trim();
}

const openTelemetryPlugin = {
  async requestDidStart(requestContext) {
    const { request } = requestContext;
    const tracer = trace.getTracer('apollo-server', '1.0.0');
    
    // Create main request span
    const requestSpan = tracer.startSpan('graphql.request', {
      attributes: {
        'graphql.operation.name': request.operationName || 'anonymous',
        'graphql.query.normalized': normalizeQuery(request.query),
      },
    });

    // Store span in request context for later use
    requestContext.requestSpan = requestSpan;

    return {
      async parsingDidStart() {
        const parseSpan = tracer.startSpan('graphql.parse', {
          parent: requestSpan,
        });
        return () => {
          parseSpan.end();
        };
      },

      async validationDidStart() {
        const validationSpan = tracer.startSpan('graphql.validate', {
          parent: requestSpan,
        });
        return () => {
          validationSpan.end();
        };
      },

      async executionDidStart() {
        const executionSpan = tracer.startSpan('graphql.execute', {
          parent: requestSpan,
        });
        return () => {
          executionSpan.end();
        };
      },

      async didEncounterErrors(requestContext) {
        const { errors } = requestContext;
        // Handle array of errors correctly
        errors.forEach((error, index) => {
          requestSpan.recordException(error);
          requestSpan.setAttributes({
            [`error.${index}.message`]: error.message,
            [`error.${index}.path`]: JSON.stringify(error.path || []),
          });
        });
        
        requestSpan.setStatus({
          code: trace.SpanStatusCode.ERROR,
          message: `${errors.length} error(s) occurred`,
        });
      },

      async willSendResponse(requestContext) {
        const { response } = requestContext;
        
        // Add response metadata
        requestSpan.setAttributes({
          'http.status_code': response.http?.status || 200,
          'graphql.response.has_errors': !!(response.errors && response.errors.length > 0),
        });
        
        requestSpan.end();
      }
    };
  }
};

const server = new ApolloServer({
  typeDefs,
  resolvers,
  plugins: [openTelemetryPlugin],
  context: ({ req }) => ({
    tracer: trace.getTracer('graphql-resolvers', '1.0.0'),
  }),
});

Common OpenTelemetry GraphQL Problems and Solutions

Even with a solid setup, you're likely to run into issues. Here are solutions to common problems.

Preventing High Cardinality Issues in GraphQL Query Tracing

GraphQL queries can be virtually infinite in their variations, which can cause a cardinality explosion in your monitoring system.

Solution: Filter and normalize queries before recording them as span attributes:

function normalizeQuery(query) {
  if (!query) return '';
  
  return query
    // Replace string literals with placeholders
    .replace(/"[^"]*"/g, '"<string>"')
    // Replace numeric literals with placeholders
    .replace(/\b\d+(\.\d+)?\b/g, '<number>')
    // Replace variable references (keep the structure)
    .replace(/\$\w+/g, '$<var>')
    // Remove extra whitespace
    .replace(/\s+/g, ' ')
    .trim();
}

// Usage
span.setAttribute('graphql.query.normalized', normalizeQuery(query));

Implementing Distributed Tracing in Federated GraphQL Architectures

If you're using Apollo Federation or other GraphQL federation approaches, tracing across services gets complex.

Solution: Ensure proper context propagation:

const { propagation, trace, context } = require('@opentelemetry/api');
const { ApolloGateway } = require('@apollo/gateway');
const { RemoteGraphQLDataSource } = require('@apollo/gateway');

// In your gateway service
const gateway = new ApolloGateway({
  serviceList: [
    { name: 'users', url: 'http://localhost:4001/graphql' },
    { name: 'products', url: 'http://localhost:4002/graphql' },
  ],
  buildService({ name, url }) {
    return new RemoteGraphQLDataSource({
      url,
      willSendRequest({ request, context: apolloContext }) {
        // Get current active span
        const activeSpan = trace.getActiveSpan();
        
        if (activeSpan) {
          // Use OpenTelemetry's built-in propagation
          const headers = {};
          propagation.inject(context.active(), headers);
          
          // Add propagated headers to the request
          Object.entries(headers).forEach(([key, value]) => {
            request.http.headers.set(key, value);
          });
          
          // Add service-specific attributes
          activeSpan.setAttributes({
            'graphql.federated.service': name,
            'graphql.federated.url': url,
          });
        }
      }
    });
  }
});

Connecting GraphQL Resolver Spans with Database Operation Telemetry

Seeing GraphQL resolver times without a database query context isn't very helpful.

Solution: Link database spans with resolver spans:

const { trace } = require('@opentelemetry/api');

async function getUsersFromDatabase(args) {
  const tracer = trace.getTracer('database', '1.0.0');
  const span = tracer.startSpan('db.query.users', {
    attributes: {
      'db.system': 'postgresql',
      'db.operation': 'SELECT',
      'db.collection.name': 'users',
      'db.statement': 'SELECT * FROM users LIMIT $1 OFFSET $2',
    },
  });

  try {
    const startTime = Date.now();
    const result = await db.query(
      'SELECT * FROM users LIMIT $1 OFFSET $2', 
      [args.limit || 10, args.offset || 0]
    );
    
    const duration = Date.now() - startTime;
    
    span.setAttributes({
      'db.rows_affected': result.length,
      'db.duration_ms': duration,
      'operation.status': 'success',
    });
    
    return result;
  } catch (error) {
    span.recordException(error);
    span.setStatus({
      code: trace.SpanStatusCode.ERROR,
      message: error.message,
    });
    throw error;
  } finally {
    span.end();
  }
}

💡

Now, fix production GraphQL issues instantly—right from your IDE, with AI and Last9 MCP.

Analyzing OpenTelemetry GraphQL Data Effectively

Collecting data is only useful if you can make sense of it. Here's how to analyze your OpenTelemetry GraphQL data.

Essential GraphQL Performance Metrics for Operational Monitoring

Metric	Description	Why It Matters
Resolver Duration	Time taken by each resolver	Identifies slow resolvers
Parse/Validate Time	Time spent in parsing and validation	Can indicate complex queries
N+1 Query Count	Number of duplicate database queries	Common GraphQL performance issue
Resolver Error Rate	Percentage of resolvers that throw errors	Shows reliability issues
Query Complexity	Calculated complexity score of queries	Helps identify abuse or optimization opportunities

Designing GraphQL-Specific Observability Dashboards

A good GraphQL + OpenTelemetry dashboard should include:

Top-level query response times
Resolver timings by field
Error rates by resolver
Database query correlation
Cache hit/miss rates
Parsing and validation times

💡

If you're exploring how OpenTelemetry stacks up against traditional APM tools, this piece breaks it down with real-world context.

Integrating with Last9 for Advanced Observability

If you're looking for a budget-friendly managed observability solution that doesn’t compromise on features, Last9 pairs perfectly with your OpenTelemetry GraphQL setup.

We specialize in handling high-cardinality data — just like the data GraphQL generates without the cost penalties you'd face with other vendors. Our pricing is based on event ingestion, so your costs stay predictable even as your GraphQL API usage grows.

Last9 integrates smoothly with your OpenTelemetry data and provides:

Correlation between GraphQL operations and the underlying infrastructure
Pre-built dashboards tailored for GraphQL workloads
Smart alerting that understands GraphQL context
A unified view across metrics, logs, and traces

Teams like Clevertap, Probo, and others trust Last9 for their OpenTelemetry needs, especially for how well we handle the high-cardinality nature of GraphQL telemetry data.

Best Practices for OpenTelemetry in GraphQL Production Environments

Moving to production requires some additional considerations:

Implementing Efficient Sampling for High-Volume GraphQL APIs

You likely don't need to trace every single GraphQL operation. Implement a smart sampling strategy:

const { ParentBasedSampler, TraceIdRatioBasedSampler } = require('@opentelemetry/sdk-trace-node');

// Create a custom sampler that samples more aggressively for errors
class GraphQLAwareSampler {
  constructor(defaultRatio = 0.1, errorRatio = 1.0) {
    this.defaultSampler = new TraceIdRatioBasedSampler(defaultRatio);
    this.errorSampler = new TraceIdRatioBasedSampler(errorRatio);
  }

  shouldSample(context, traceId, spanName, spanKind, attributes, links) {
    // Always sample if there are errors
    if (attributes && (
      attributes['graphql.response.has_errors'] || 
      spanName.includes('error')
    )) {
      return this.errorSampler.shouldSample(context, traceId, spanName, spanKind, attributes, links);
    }
    
    // Use default sampling for normal operations
    return this.defaultSampler.shouldSample(context, traceId, spanName, spanKind, attributes, links);
  }
}

// Sample 10% of traces by default, 100% of error traces
const rootSampler = new GraphQLAwareSampler(0.1, 1.0);

// For GraphQL operations, use parent-based sampling
const sampler = new ParentBasedSampler({
  root: rootSampler,
});

// Add to your SDK config
const sdk = new NodeSDK({
  sampler,
  // other config...
});

Optimizing OpenTelemetry Resource Consumption in GraphQL Services

OpenTelemetry adds overhead. Manage it with these tips:

Be selective about which resolvers you instrument manually
Use attribute limits to prevent memory bloat
Consider batching span exports in high-throughput environments
Implement circuit breakers to disable tracing if the system is under heavy load

Protecting Sensitive Data in GraphQL Telemetry Collection

GraphQL queries often contain sensitive data. Protect it:

Always redact authentication tokens from headers
Filter out sensitive fields from query variables
Hash user identifiers before recording them as span attributes
Consider field-level policies for what can be recorded in traces

A quick example:

if (args.password) {
  args.password = '[REDACTED]';
}

Wrapping Up

Setting up OpenTelemetry with GraphQL gives you x-ray vision into your API's performance and behavior. Remember that the goal isn't just to collect data – it's to make your GraphQL API more reliable, performant, and maintainable. Let the telemetry guide your optimization efforts and architecture decisions.

💡

And if you'd like to keep the conversation going, our Discord community is open. We have a dedicated channel where you can chat with other developers about your specific use case.

FAQs

How much overhead does OpenTelemetry add to my GraphQL server?

When properly configured, OpenTelemetry typically adds 3-5% overhead in terms of latency and CPU usage. You can reduce this by implementing sampling (tracing only a percentage of requests) or by selectively instrumenting only critical paths.

Can OpenTelemetry help identify N+1 query problems in GraphQL?

Yes! This is one of the biggest benefits. By correlating database spans with resolver spans, you can easily spot when a resolver is triggering multiple similar database queries that could be batched. Tools like DataLoader become much easier to implement effectively when you can see the N+1 problems.

How do I handle sensitive data in GraphQL queries when using OpenTelemetry?

Implement a sanitization layer that processes GraphQL queries and variables before they're attached to spans. You can write middleware that redacts sensitive fields (like passwords or personal information) before they're recorded in your telemetry data.

Is OpenTelemetry suitable for both monolithic and federated GraphQL architectures?

Absolutely. For monoliths, the setup is simpler but still valuable. For federated architectures, OpenTelemetry shines as it can trace requests across service boundaries, giving you end-to-end visibility that's otherwise very difficult to achieve.

How does batching affect OpenTelemetry tracing in GraphQL?

When using batching techniques like DataLoader, you'll want to ensure your custom spans correctly represent the batched nature of operations. This usually means creating spans for both individual resolver calls and the batched data loading operations, with proper parent-child relationships between them.