LangChain Observability: From Zero to Production in 10 Minutes

LangChain apps are powerful, but they’re not easy to monitor. A single request might pass through an LLM, a vector store, external APIs, and a custom chain of tools. And when something slows down or silently fails, debugging is often guesswork.

In one instance, a developer ended up with an unexpected $30,000 OpenAI bill, with no visibility into what triggered it. This blog shows how to avoid that using OpenTelemetry and LangSmith.

With this setup, you’ll be able to:

Spot expensive chains before they blow up your bill
Trace failures across tools, even when there’s no error
Sleep at night knowing you’ll catch issues before your users do

Why LangChain needs observability

In the last post, we looked at how LangChain apps behave differently from traditional ones and why that makes them harder to monitor. Here’s a quick recap:

1. Complex execution flows

A single user request might trigger 15+ LLM calls across multiple chains, models, and tools. If something breaks, you won’t know where or why unless you have tracing in place.

2. Token costs escalate quickly

LLM usage adds up fast. Without visibility, teams often discover $2,000+ monthly bills from a chain that was calling the wrong model, or running more times than expected.

3. Silent failures are common

LangChain doesn’t always throw errors. Chains can return empty results or partial outputs with no indication of what failed. You won’t see it unless you're watching closely.

4. Performance issues are hard to isolate

Is your app slow because of the LLM? The vector database? A third-party tool? Without basic timing data, it's all guesswork.

5. Debugging in production is painful

When something goes wrong in prod, you need context: which prompt ran, what model was used, how long it took, and whether it failed. Without that, you’re flying blind.

Start with built-in tracing before setting up anything custom.

Start with LangSmith Tracing

The easiest way to start tracing your LangChain app is by enabling LangSmith.

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your_api_key"  # Get this from LangSmith

That’s it. Every chain run is now traced, no code changes needed.

Try it with a simple chain:

from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.llms import OpenAI

llm = OpenAI(model="gpt-3.5-turbo")
prompt = PromptTemplate.from_template("What is the capital of {country}?")
chain = LLMChain(llm=llm, prompt=prompt)

print(chain.run("France"))

Once this runs, your LangSmith dashboard will show:

Prompt: What is the capital of France?
Model: gpt-3.5-turbo
Response: Paris
Duration: 1.2s

This is a great way to verify your chains are working, measure latency, and get a feel for how your prompts behave.

LangSmith works out of the box, but with production workloads, you may want more control over:

What gets traced and when
Filtering traces by user ID, org, or request source
Connecting traces with your logs and metrics
Capturing calls to tools, vector databases, or external APIs

That's where OpenTelemetry helps you; it gives you control. Instead of relying on built-in tracing defaults, you can decide what gets captured, how it's tagged, and where the data goes.

Build Custom Callbacks for OpenTelemetry

LangSmith is built on LangChain’s callback system; you’re just not writing the callbacks yourself. By implementing your own, you get full control over what gets captured and how it's exported. Here's how to get started:

Create a Custom Callback Handler

Let's start with an example: a callback handler that traces chain execution, tracks token usage, captures errors, and exports everything to your observability backend.

from langchain_core.callbacks.base import BaseCallbackHandler
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
import time
import logging

logger = logging.getLogger(__name__)
tracer = trace.get_tracer(__name__)

class ProductionCallbackHandler(BaseCallbackHandler):
    def __init__(self):
        self.start_time = None
        self.token_count = 0
        self.chain_id = None
        self.span = None

    def on_chain_start(self, serialized, inputs, *, run_id, **kwargs):
        try:
            self.start_time = time.time()
            self.chain_id = str(run_id)
            self.span = tracer.start_as_current_span("langchain_chain").__enter__()
            self.span.set_attribute("chain.type", serialized.get("name", "unknown"))
            self.span.set_attribute("chain.id", self.chain_id)
        except Exception as e:
            logger.warning(f"Callback failed on chain_start: {e}")

    def on_llm_end(self, response, *, run_id, **kwargs):
        if response.llm_output and "token_usage" in response.llm_output:
            usage = response.llm_output["token_usage"]
            self.token_count += usage.get("total_tokens", 0)
            if self.span:
                self.span.set_attribute("chain.tokens_used", self.token_count)

    def on_chain_error(self, error, *, run_id, **kwargs):
        if self.span:
            self.span.record_exception(error)
            self.span.set_status(Status(StatusCode.ERROR, str(error)))

    def on_chain_end(self, outputs, *, run_id, **kwargs):
        if self.span:
            duration = time.time() - self.start_time
            self.span.set_attribute("chain.duration_ms", int(duration * 1000))
            self.span.end()

What Each Method Does:

on_chain_start

Starts a span named "langchain_chain"
Captures metadata like chain type and run ID
Stores the span for later use

self.span = tracer.start_as_current_span("langchain_chain").__enter__()
self.span.set_attribute("chain.type", serialized.get("name", "unknown"))

on_llm_end

Extracts token usage from the LLM response
Adds token count as a trace attribute
Useful for cost tracking and audits

self.span.set_attribute("chain.tokens_used", self.token_count)

on_chain_error

Attaches exceptions to the trace
Marks the span with an error status
Helps you debug failed chains later

self.span.record_exception(error)
self.span.set_status(Status(StatusCode.ERROR, str(error)))

on_chain_end

Calculates and stores chain duration
Finalizes the span so it's exportable

self.span.set_attribute("chain.duration_ms", int(duration * 1000))
self.span.end()

What this callback gives you:

A trace span for each chain execution
Total tokens used per request
Runtime duration in milliseconds
Captured errors with full context

💡

If you’re deciding between using an OpenTelemetry exporter or setting up a Collector, this guide breaks down when to use which.

Configure OpenTelemetry Exporters

Now that you're capturing spans, you need to send them to your observability backend. Here's how to set that up in three steps:

1. Set Service Name and Metadata

Before exporting traces, it helps to tag them with context, like the service name, version, and deployment environment. This makes it easier to filter and group traces later.

from opentelemetry.sdk.resources import Resource
import os

resource = Resource.create({
    "service.name": "langchain-app",
    "service.version": "1.0.0",
    "deployment.environment": os.getenv("ENVIRONMENT", "production")
})

You can tweak the values here to reflect your environment or CI/CD metadata.

2. Register the OpenTelemetry Trace Provider

This step sets up a TracerProvider, which is required before you can generate or export traces. You also attach the metadata from the previous step here.

from opentelemetry.sdk.trace import TracerProvider
from opentelemetry import trace

trace.set_tracer_provider(TracerProvider(resource=resource))

At this point, you've created a tracer context, but it's not exporting anything yet.

3. Configure the Trace Exporter

Now comes the important part: pushing trace data to your observability backend using an OTLP-compatible exporter. Here's the setup for gRPC-based export (we've used Last9 as an example, but you can use any OTLP-compatible backend like Grafana or Jaeger):

from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

otlp_exporter = OTLPSpanExporter(
    endpoint=os.getenv("OTLP_ENDPOINT", "https://otlp.last9.io:443"),
    headers={"Authorization": f"Bearer {os.getenv('OTLP_API_KEY')}"}
)

This sends traces over the OTLP protocol. You can swap this for a vendor-specific exporter if needed.

Run This Setup at App Startup

This is important: you want tracing initialized before any chains, tools, or LLMs start running. Otherwise, you'll miss the early spans.

# Example: in your app's main entrypoint
init_tracing()

Build a FastAPI API with Observability

LangChain’s newer patterns make it easier to trace chains, capture metrics, and surface errors, especially when paired with FastAPI.

Use LCEL Instead of LLMChain

LangChain's older LLMChain is now deprecated. The new standard is LCEL — LangChain Expression Language, which lets you build chains using Python's pipe operator (|). It's more flexible and avoids legacy surprises.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.7)
prompt = ChatPromptTemplate.from_template(
    "Generate a {content_type} about {topic}. Make it {length}."
)
chain = prompt | llm | StrOutputParser()

Why LCEL is Better

No more deprecation warnings or old interfaces
Prompt, LLM, and parser are cleanly separated
Built-in streaming support
Better editor hints and type safety

If you're still on LLMChain, you don't need to refactor everything right away. LCEL chains work alongside older ones, so you can migrate gradually.

Add Tracing to FastAPI Endpoints

To make chain executions observable, pass a callback handler when invoking the chain. This captures latency, token usage, and errors tied to each request.

@app.post("/generate")
async def generate_content(request: GenerationRequest):
    try:
        result = chain.invoke(
            {
                "topic": request.topic,
                "content_type": request.content_type,
                "length": request.length
            },
            config={"callbacks": [callback_handler]}
        )
        return {
            "content": result,
            "status": "success",
            "tokens_used": callback_handler.token_count
        }
    except Exception as e:
        logger.error(f"Generation failed for topic '{request.topic}': {e}")
        raise HTTPException(status_code=500, detail="Generation failed")

This setup gives you:

Callbacks tied to each request — each call emits trace data with a unique ID
Token usage in the response — useful for monitoring cost per request
Clean error logging — logs include request context, like the topic
Consistent API format — easy to consume on the frontend or across services

Handle Callback Failures Without Breaking the API

Tracing should never crash your app. If your callback fails, maybe a span exporter goes down, it should log the issue without affecting the request.

Wrap it like this:

def on_chain_start(self, serialized, inputs, *, run_id, **kwargs):
    try:
        self.start_time = time.time()
        self.chain_id = str(run_id)
        self.span = tracer.start_as_current_span("langchain_chain").__enter__()
        self.span.set_attribute("chain.type", serialized.get("name", "unknown"))
        self.span.set_attribute("chain.id", self.chain_id)
    except Exception as e:
        logger.warning(f"Callback failed on chain_start: {e}")

This way, your app keeps working even if the observability layer hits trouble.

Trace HTTP Requests Automatically

To track HTTP request data, like latency, status codes, and error rates, use FastAPI's OpenTelemetry instrumentation:

from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

app = FastAPI(title="LangChain API with Observability")
FastAPIInstrumentor.instrument_app(app)

You'll now get a separate span for each HTTP request, helping you see:

Which routes are slow
Where failures happen
How often requests error out or time out

Add Metrics to Your Setup

While traces help debug problems, metrics help you track trends, usage, cost, latency, and performance over time. Use the OpenTelemetry Metrics API to track the numbers that matter.

1. Set up the MeterProvider and Exporter (run once at app startup)

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
import os

# Set up metrics exporter
metric_exporter = OTLPMetricExporter(
    endpoint=os.getenv("OTLP_ENDPOINT"),
    headers={"Authorization": f"Bearer {os.getenv('OTLP_API_KEY')}"}
)

# Register the exporter with a periodic reader
metric_reader = PeriodicExportingMetricReader(
    metric_exporter, export_interval_millis=30000
)

# Apply the MeterProvider globally
metrics.set_meter_provider(
    MeterProvider(metric_readers=[metric_reader])
)

Note: Like with traces, you can use any OTLP-compatible backend for metrics. Just adjust the endpoint and headers for your provider.

2. Define Metrics After the Provider is Set

# Get a meter for your app
meter = metrics.get_meter(__name__)

# Total token usage
token_counter = meter.create_counter(
    "langchain_tokens_total",
    description="Total tokens consumed by LangChain operations"
)

# End-to-end latency
request_duration = meter.create_histogram(
    "langchain_request_duration_seconds",
    description="Time taken to process LangChain requests"
)

# Rough cost estimation
cost_tracker = meter.create_counter(
    "langchain_cost_usd",
    description="Estimated cost of LangChain operations in USD"
)

3. Extend the Callback to Emit Metrics

class MetricsCallbackHandler(BaseCallbackHandler):
    def on_llm_end(self, response, *, run_id, **kwargs):
        if response.llm_output and "token_usage" in response.llm_output:
            usage = response.llm_output["token_usage"]
            tokens = usage.get("total_tokens", 0)
            model_name = kwargs.get("model", "unknown")
            
            token_counter.add(tokens, {"model": model_name})
            
            # Estimate cost at $0.02 per 1k tokens (adjust as needed)
            estimated_cost = tokens * 0.00002
            cost_tracker.add(estimated_cost, {"model": model_name})

This structure guarantees your metrics are registered and exported correctly, no silent drops, no surprises.

Export Traces & Metrics to Last9

Here's a complete setup using Last9 as your observability backend:

Set Environment Variables

export LAST9_API_KEY="your_last9_api_key"
export OTLP_ENDPOINT="https://otlp.last9.io:443"
export ENVIRONMENT="production"

Initialize the Complete Setup

from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry import trace
import os

def setup_observability():
    resource = Resource.create({
        "service.name": "langchain-app",
        "service.version": "1.0.0",
        "deployment.environment": os.getenv("ENVIRONMENT", "production")
    })

    trace.set_tracer_provider(TracerProvider(resource=resource))

    otlp_exporter = OTLPSpanExporter(
        endpoint=os.getenv("OTLP_ENDPOINT"),
        headers={"Authorization": f"Bearer {os.getenv('LAST9_API_KEY')}"}
    )

    span_processor = BatchSpanProcessor(otlp_exporter)
    trace.get_tracer_provider().add_span_processor(span_processor)

# Run this at app startup
setup_observability()

What You Can Do with Last9:

Once traces and metrics are flowing, here's what you can do:

Follow each request: See how a FastAPI request moves through your chain, LLM calls, and any external tools.
Track RED metrics out of the box: Request rate, error count, and latency, auto-instrumented and tagged.
Search traces by chain, model, or env: Filter by any OpenTelemetry attribute, including dynamic values like chain.id.
Build usage dashboards: Track token counts, estimated cost, and success rates over time.
Debug slow chains: Trace latency back to specific vector DB calls, prompt handling, or retries.
Alert on spend and spikes: Get notified when usage jumps or chains start failing silently.

💡

Check out our docs for everything you need to get started with Last9!

Containerize the App with Tracing

Before you run this in production, you'll want it packaged and predictable — something that behaves the same on every machine and plays well with observability tools.

Docker makes this easier to manage, one image, one setup, and fewer surprises when you deploy.

Step 1: Build a Minimal Docker Image

Start with a lightweight Python base, install dependencies, and add a health check for load balancer integrations.

FROM python:3.11-slim

WORKDIR /app

# Install Python packages
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the rest of the code
COPY . .

# Set environment variables for production
ENV ENVIRONMENT=production
ENV LANGCHAIN_TRACING_V2=true

# Add a health check endpoint
HEALTHCHECK --interval=30s --timeout=10s \
  CMD curl -f http://localhost:8000/health || exit 1

# Start the FastAPI app with multiple workers
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]

Note: If you're using LangSmith for tracing, LANGCHAIN_TRACING_V2 must be true. Set it to false only if you're using custom OpenTelemetry callbacks without LangSmith.

Step 2: Set Up Docker Compose with Observability

Here's a docker-compose.yml that wires in environment variables, observability endpoints, and basic resource controls:

version: '3.8'

services:
  langchain-app:
    build: .
    ports:
      - "8000:8000"
    environment:
      - LANGCHAIN_TRACING_V2=true
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - OTLP_ENDPOINT=https://otlp.last9.io:443
      - LAST9_API_KEY=${LAST9_API_KEY}
      - ENVIRONMENT=production
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 1G
          cpus: '0.5'

What this setup gives you:

Env vars for observability and API keys — all pulled from your local .env or secrets manager
Resource limits — so a bad chain doesn't exhaust your host
Automatic restarts — for resilience during transient failures (e.g., OpenAI 500s or DNS timeouts)

Step 3: Plan for Production Workloads

A basic LangChain API can run on modest hardware, but scale requirements vary based on usage.

If you're not using LangSmith and just want OpenTelemetry spans locally, skip LANGCHAIN_TRACING_V2 entirely and rely on your custom callback handler.

Common LangChain Issues and How to Debug Them

Most LangChain issues don’t surface until your app is under real load. This section shows how to identify them early and what your telemetry should be telling you.

What you see	What’s going wrong	How to fix it
Slow responses	Vector DB queries are slow, or LLM calls are taking time	Look at `vector_search` spans. Use async DB calls. For long `llm_call` spans, check prompt size and simplify context.
Token usage keeps climbing	Some chains are too verbose, or prompts are inefficient	Track `langchain_tokens_total` by chain name. Reduce unnecessary messages and prompt length.
OpenAI bill went through the roof	A chain is running more than expected, or failing silently	Correlate `langchain_cost_usd` with chain success. Add alerts for sudden token spikes or runaway loops.
LLMs returning empty or useless responses	Prompt is malformed, or errors are swallowed	Monitor span outputs. Add prompt preview in your traces to spot formatting issues.
Errors aren’t showing up in logs	Tool calls inside agents are failing silently	Look inside child spans — many errors don’t bubble up. Mark failed steps clearly in the callback.
Memory usage keeps going up	Chain memory isn't resetting, or vector store is caching too much	Check memory patterns alongside traffic. Clear memory between requests. Audit how you’re storing embeddings.
Spans missing from traces	Span wasn’t finalized due to crash or timeout	Make sure `on_chain_end` or `on_chain_error` is always called in your callbacks. Add logging if spans don’t close.

💡

Debugging production LangChain issues used to mean digging through logs and guessing. Now you can fix them right from your IDE with AI and Last9 MCP, bringing real-time production context into your local environment.

You now have complete observability for your LangChain application.
Your traces flow to Last9, giving you the visibility needed to debug issues and optimize performance in production.

And if you hit a snag along the way, feel free to book some time with us; we're happy to help.