LangChain apps are powerful, but they’re not easy to monitor. A single request might pass through an LLM, a vector store, external APIs, and a custom chain of tools. And when something slows down or silently fails, debugging is often guesswork.
In one instance, a developer ended up with an unexpected $30,000 OpenAI bill, with no visibility into what triggered it. This blog shows how to avoid that using OpenTelemetry and LangSmith.
With this setup, you’ll be able to:
- Spot expensive chains before they blow up your bill
- Trace failures across tools, even when there’s no error
- Sleep at night knowing you’ll catch issues before your users do
Why LangChain needs observability
In the last post, we looked at how LangChain apps behave differently from traditional ones and why that makes them harder to monitor. Here’s a quick recap:
1. Complex execution flows
A single user request might trigger 15+ LLM calls across multiple chains, models, and tools. If something breaks, you won’t know where or why unless you have tracing in place.
2. Token costs escalate quickly
LLM usage adds up fast. Without visibility, teams often discover $2,000+ monthly bills from a chain that was calling the wrong model, or running more times than expected.
3. Silent failures are common
LangChain doesn’t always throw errors. Chains can return empty results or partial outputs with no indication of what failed. You won’t see it unless you're watching closely.
4. Performance issues are hard to isolate
Is your app slow because of the LLM? The vector database? A third-party tool? Without basic timing data, it's all guesswork.
5. Debugging in production is painful
When something goes wrong in prod, you need context: which prompt ran, what model was used, how long it took, and whether it failed. Without that, you’re flying blind.
Start with built-in tracing before setting up anything custom.
Start with LangSmith Tracing
The easiest way to start tracing your LangChain app is by enabling LangSmith.
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your_api_key" # Get this from LangSmith
That’s it. Every chain run is now traced, no code changes needed.
Try it with a simple chain:
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.llms import OpenAI
llm = OpenAI(model="gpt-3.5-turbo")
prompt = PromptTemplate.from_template("What is the capital of {country}?")
chain = LLMChain(llm=llm, prompt=prompt)
print(chain.run("France"))
Once this runs, your LangSmith dashboard will show:
Prompt: What is the capital of France?
Model: gpt-3.5-turbo
Response: Paris
Duration: 1.2s
This is a great way to verify your chains are working, measure latency, and get a feel for how your prompts behave.
LangSmith works out of the box, but with production workloads, you may want more control over:
- What gets traced and when
- Filtering traces by user ID, org, or request source
- Connecting traces with your logs and metrics
- Capturing calls to tools, vector databases, or external APIs
That's where OpenTelemetry helps you; it gives you control. Instead of relying on built-in tracing defaults, you can decide what gets captured, how it's tagged, and where the data goes.
Build Custom Callbacks for OpenTelemetry
LangSmith is built on LangChain’s callback system; you’re just not writing the callbacks yourself. By implementing your own, you get full control over what gets captured and how it's exported. Here's how to get started:
Create a Custom Callback Handler
Let's start with an example: a callback handler that traces chain execution, tracks token usage, captures errors, and exports everything to your observability backend.
from langchain_core.callbacks.base import BaseCallbackHandler
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
import time
import logging
logger = logging.getLogger(__name__)
tracer = trace.get_tracer(__name__)
class ProductionCallbackHandler(BaseCallbackHandler):
def __init__(self):
self.start_time = None
self.token_count = 0
self.chain_id = None
self.span = None
def on_chain_start(self, serialized, inputs, *, run_id, **kwargs):
try:
self.start_time = time.time()
self.chain_id = str(run_id)
self.span = tracer.start_as_current_span("langchain_chain").__enter__()
self.span.set_attribute("chain.type", serialized.get("name", "unknown"))
self.span.set_attribute("chain.id", self.chain_id)
except Exception as e:
logger.warning(f"Callback failed on chain_start: {e}")
def on_llm_end(self, response, *, run_id, **kwargs):
if response.llm_output and "token_usage" in response.llm_output:
usage = response.llm_output["token_usage"]
self.token_count += usage.get("total_tokens", 0)
if self.span:
self.span.set_attribute("chain.tokens_used", self.token_count)
def on_chain_error(self, error, *, run_id, **kwargs):
if self.span:
self.span.record_exception(error)
self.span.set_status(Status(StatusCode.ERROR, str(error)))
def on_chain_end(self, outputs, *, run_id, **kwargs):
if self.span:
duration = time.time() - self.start_time
self.span.set_attribute("chain.duration_ms", int(duration * 1000))
self.span.end()
What Each Method Does:
on_chain_start
- Starts a span named
"langchain_chain"
- Captures metadata like chain type and run ID
- Stores the span for later use
self.span = tracer.start_as_current_span("langchain_chain").__enter__()
self.span.set_attribute("chain.type", serialized.get("name", "unknown"))
on_llm_end
- Extracts token usage from the LLM response
- Adds token count as a trace attribute
- Useful for cost tracking and audits
self.span.set_attribute("chain.tokens_used", self.token_count)
on_chain_error
- Attaches exceptions to the trace
- Marks the span with an error status
- Helps you debug failed chains later
self.span.record_exception(error)
self.span.set_status(Status(StatusCode.ERROR, str(error)))
on_chain_end
- Calculates and stores chain duration
- Finalizes the span so it's exportable
self.span.set_attribute("chain.duration_ms", int(duration * 1000))
self.span.end()
What this callback gives you:
- A trace span for each chain execution
- Total tokens used per request
- Runtime duration in milliseconds
- Captured errors with full context
Configure OpenTelemetry Exporters
Now that you're capturing spans, you need to send them to your observability backend. Here's how to set that up in three steps:
1. Set Service Name and Metadata
Before exporting traces, it helps to tag them with context, like the service name, version, and deployment environment. This makes it easier to filter and group traces later.
from opentelemetry.sdk.resources import Resource
import os
resource = Resource.create({
"service.name": "langchain-app",
"service.version": "1.0.0",
"deployment.environment": os.getenv("ENVIRONMENT", "production")
})
You can tweak the values here to reflect your environment or CI/CD metadata.
2. Register the OpenTelemetry Trace Provider
This step sets up a TracerProvider, which is required before you can generate or export traces. You also attach the metadata from the previous step here.
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry import trace
trace.set_tracer_provider(TracerProvider(resource=resource))
At this point, you've created a tracer context, but it's not exporting anything yet.
3. Configure the Trace Exporter
Now comes the important part: pushing trace data to your observability backend using an OTLP-compatible exporter. Here's the setup for gRPC-based export (we've used Last9 as an example, but you can use any OTLP-compatible backend like Grafana or Jaeger):
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
otlp_exporter = OTLPSpanExporter(
endpoint=os.getenv("OTLP_ENDPOINT", "https://otlp.last9.io:443"),
headers={"Authorization": f"Bearer {os.getenv('OTLP_API_KEY')}"}
)
This sends traces over the OTLP protocol. You can swap this for a vendor-specific exporter if needed.
Run This Setup at App Startup
This is important: you want tracing initialized before any chains, tools, or LLMs start running. Otherwise, you'll miss the early spans.
# Example: in your app's main entrypoint
init_tracing()
Build a FastAPI API with Observability
LangChain’s newer patterns make it easier to trace chains, capture metrics, and surface errors, especially when paired with FastAPI.
Use LCEL Instead of LLMChain
LangChain's older LLMChain
is now deprecated. The new standard is LCEL — LangChain Expression Language, which lets you build chains using Python's pipe operator (|
). It's more flexible and avoids legacy surprises.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.7)
prompt = ChatPromptTemplate.from_template(
"Generate a {content_type} about {topic}. Make it {length}."
)
chain = prompt | llm | StrOutputParser()
Why LCEL is Better
- No more deprecation warnings or old interfaces
- Prompt, LLM, and parser are cleanly separated
- Built-in streaming support
- Better editor hints and type safety
If you're still on LLMChain
, you don't need to refactor everything right away. LCEL chains work alongside older ones, so you can migrate gradually.
Add Tracing to FastAPI Endpoints
To make chain executions observable, pass a callback handler when invoking the chain. This captures latency, token usage, and errors tied to each request.
@app.post("/generate")
async def generate_content(request: GenerationRequest):
try:
result = chain.invoke(
{
"topic": request.topic,
"content_type": request.content_type,
"length": request.length
},
config={"callbacks": [callback_handler]}
)
return {
"content": result,
"status": "success",
"tokens_used": callback_handler.token_count
}
except Exception as e:
logger.error(f"Generation failed for topic '{request.topic}': {e}")
raise HTTPException(status_code=500, detail="Generation failed")
This setup gives you:
- Callbacks tied to each request — each call emits trace data with a unique ID
- Token usage in the response — useful for monitoring cost per request
- Clean error logging — logs include request context, like the topic
- Consistent API format — easy to consume on the frontend or across services
Handle Callback Failures Without Breaking the API
Tracing should never crash your app. If your callback fails, maybe a span exporter goes down, it should log the issue without affecting the request.
Wrap it like this:
def on_chain_start(self, serialized, inputs, *, run_id, **kwargs):
try:
self.start_time = time.time()
self.chain_id = str(run_id)
self.span = tracer.start_as_current_span("langchain_chain").__enter__()
self.span.set_attribute("chain.type", serialized.get("name", "unknown"))
self.span.set_attribute("chain.id", self.chain_id)
except Exception as e:
logger.warning(f"Callback failed on chain_start: {e}")
This way, your app keeps working even if the observability layer hits trouble.
Trace HTTP Requests Automatically
To track HTTP request data, like latency, status codes, and error rates, use FastAPI's OpenTelemetry instrumentation:
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
app = FastAPI(title="LangChain API with Observability")
FastAPIInstrumentor.instrument_app(app)
You'll now get a separate span for each HTTP request, helping you see:
- Which routes are slow
- Where failures happen
- How often requests error out or time out
Add Metrics to Your Setup
While traces help debug problems, metrics help you track trends, usage, cost, latency, and performance over time. Use the OpenTelemetry Metrics API to track the numbers that matter.
1. Set up the MeterProvider and Exporter (run once at app startup)
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
import os
# Set up metrics exporter
metric_exporter = OTLPMetricExporter(
endpoint=os.getenv("OTLP_ENDPOINT"),
headers={"Authorization": f"Bearer {os.getenv('OTLP_API_KEY')}"}
)
# Register the exporter with a periodic reader
metric_reader = PeriodicExportingMetricReader(
metric_exporter, export_interval_millis=30000
)
# Apply the MeterProvider globally
metrics.set_meter_provider(
MeterProvider(metric_readers=[metric_reader])
)
Note: Like with traces, you can use any OTLP-compatible backend for metrics. Just adjust the endpoint and headers for your provider.
2. Define Metrics After the Provider is Set
# Get a meter for your app
meter = metrics.get_meter(__name__)
# Total token usage
token_counter = meter.create_counter(
"langchain_tokens_total",
description="Total tokens consumed by LangChain operations"
)
# End-to-end latency
request_duration = meter.create_histogram(
"langchain_request_duration_seconds",
description="Time taken to process LangChain requests"
)
# Rough cost estimation
cost_tracker = meter.create_counter(
"langchain_cost_usd",
description="Estimated cost of LangChain operations in USD"
)
3. Extend the Callback to Emit Metrics
class MetricsCallbackHandler(BaseCallbackHandler):
def on_llm_end(self, response, *, run_id, **kwargs):
if response.llm_output and "token_usage" in response.llm_output:
usage = response.llm_output["token_usage"]
tokens = usage.get("total_tokens", 0)
model_name = kwargs.get("model", "unknown")
token_counter.add(tokens, {"model": model_name})
# Estimate cost at $0.02 per 1k tokens (adjust as needed)
estimated_cost = tokens * 0.00002
cost_tracker.add(estimated_cost, {"model": model_name})
This structure guarantees your metrics are registered and exported correctly, no silent drops, no surprises.
Export Traces & Metrics to Last9
Here's a complete setup using Last9 as your observability backend:
Set Environment Variables
export LAST9_API_KEY="your_last9_api_key"
export OTLP_ENDPOINT="https://otlp.last9.io:443"
export ENVIRONMENT="production"
Initialize the Complete Setup
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry import trace
import os
def setup_observability():
resource = Resource.create({
"service.name": "langchain-app",
"service.version": "1.0.0",
"deployment.environment": os.getenv("ENVIRONMENT", "production")
})
trace.set_tracer_provider(TracerProvider(resource=resource))
otlp_exporter = OTLPSpanExporter(
endpoint=os.getenv("OTLP_ENDPOINT"),
headers={"Authorization": f"Bearer {os.getenv('LAST9_API_KEY')}"}
)
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)
# Run this at app startup
setup_observability()
What You Can Do with Last9:
Once traces and metrics are flowing, here's what you can do:
- Follow each request: See how a FastAPI request moves through your chain, LLM calls, and any external tools.
- Track RED metrics out of the box: Request rate, error count, and latency, auto-instrumented and tagged.
- Search traces by chain, model, or env: Filter by any OpenTelemetry attribute, including dynamic values like
chain.id
. - Build usage dashboards: Track token counts, estimated cost, and success rates over time.
- Debug slow chains: Trace latency back to specific vector DB calls, prompt handling, or retries.
- Alert on spend and spikes: Get notified when usage jumps or chains start failing silently.
Containerize the App with Tracing
Before you run this in production, you'll want it packaged and predictable — something that behaves the same on every machine and plays well with observability tools.
Docker makes this easier to manage, one image, one setup, and fewer surprises when you deploy.
Step 1: Build a Minimal Docker Image
Start with a lightweight Python base, install dependencies, and add a health check for load balancer integrations.
FROM python:3.11-slim
WORKDIR /app
# Install Python packages
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy the rest of the code
COPY . .
# Set environment variables for production
ENV ENVIRONMENT=production
ENV LANGCHAIN_TRACING_V2=true
# Add a health check endpoint
HEALTHCHECK --interval=30s --timeout=10s \
CMD curl -f http://localhost:8000/health || exit 1
# Start the FastAPI app with multiple workers
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]
Note: If you're using LangSmith for tracing,LANGCHAIN_TRACING_V2
must betrue
. Set it tofalse
only if you're using custom OpenTelemetry callbacks without LangSmith.
Step 2: Set Up Docker Compose with Observability
Here's a docker-compose.yml that wires in environment variables, observability endpoints, and basic resource controls:
version: '3.8'
services:
langchain-app:
build: .
ports:
- "8000:8000"
environment:
- LANGCHAIN_TRACING_V2=true
- OPENAI_API_KEY=${OPENAI_API_KEY}
- OTLP_ENDPOINT=https://otlp.last9.io:443
- LAST9_API_KEY=${LAST9_API_KEY}
- ENVIRONMENT=production
restart: unless-stopped
deploy:
resources:
limits:
memory: 1G
cpus: '0.5'
What this setup gives you:
- Env vars for observability and API keys — all pulled from your local .env or secrets manager
- Resource limits — so a bad chain doesn't exhaust your host
- Automatic restarts — for resilience during transient failures (e.g., OpenAI 500s or DNS timeouts)
Step 3: Plan for Production Workloads
A basic LangChain API can run on modest hardware, but scale requirements vary based on usage.
If you're not using LangSmith and just want OpenTelemetry spans locally, skip LANGCHAIN_TRACING_V2
entirely and rely on your custom callback handler.
Common LangChain Issues and How to Debug Them
Most LangChain issues don’t surface until your app is under real load. This section shows how to identify them early and what your telemetry should be telling you.
What you see | What’s going wrong | How to fix it |
---|---|---|
Slow responses | Vector DB queries are slow, or LLM calls are taking time | Look at vector_search spans. Use async DB calls. For long llm_call spans, check prompt size and simplify context. |
Token usage keeps climbing | Some chains are too verbose, or prompts are inefficient | Track langchain_tokens_total by chain name. Reduce unnecessary messages and prompt length. |
OpenAI bill went through the roof | A chain is running more than expected, or failing silently | Correlate langchain_cost_usd with chain success. Add alerts for sudden token spikes or runaway loops. |
LLMs returning empty or useless responses | Prompt is malformed, or errors are swallowed | Monitor span outputs. Add prompt preview in your traces to spot formatting issues. |
Errors aren’t showing up in logs | Tool calls inside agents are failing silently | Look inside child spans — many errors don’t bubble up. Mark failed steps clearly in the callback. |
Memory usage keeps going up | Chain memory isn't resetting, or vector store is caching too much | Check memory patterns alongside traffic. Clear memory between requests. Audit how you’re storing embeddings. |
Spans missing from traces | Span wasn’t finalized due to crash or timeout | Make sure on_chain_end or on_chain_error is always called in your callbacks. Add logging if spans don’t close. |
You now have complete observability for your LangChain application.
Your traces flow to Last9, giving you the visibility needed to debug issues and optimize performance in production.
And if you hit a snag along the way, feel free to book some time with us; we're happy to help.