What is LLM Observability? A Complete Guide (with OpenTelemetry)

LLM observability is the practice of tracking what goes into an LLM, what comes out, and everything in between — latency, token usage, errors, and model behavior — so you can debug, optimize, and trust AI applications in production.

Large language models behave differently from traditional software. A function either returns the right value or it doesn't. An LLM might return a plausible-sounding answer that's factually wrong, take 8 seconds instead of 800ms, or cost $0.40 per request when you budgeted $0.04. None of these failures throw an exception. Without observability, you find out from users.

This guide covers what LLM observability is, how it differs from traditional observability, what to instrument, and how to implement it using OpenTelemetry.

What is LLM Observability?
How LLM Observability Differs from Traditional Observability
What LLM Observability Tracks
LLM Observability Architecture
Implementing LLM Observability with OpenTelemetry
LLM Observability Tools Comparison
Common LLM Issues Observability Catches
Best Practices
FAQ

What is LLM Observability? {#what-is-llm-observability}

LLM observability is the practice of collecting and analyzing traces, metrics, and logs from applications that use large language models — covering inputs (prompts), outputs (completions), token consumption, latency, and model behavior — to understand performance, catch failures, and control costs in production.

It answers three questions traditional monitoring can't:

Did the model respond correctly? (not just successfully)
Why did this response take 6 seconds and cost 10x more than expected?
Which prompts produce hallucinations, and under what conditions?

Unlike a database query that either succeeds or fails with a clear error code, LLM calls succeed at the HTTP layer while failing at the semantic layer. Observability closes that gap.

How LLM Observability Differs from Traditional Observability {#how-llm-observability-differs}

Traditional observability — the logs, metrics, traces model — assumes deterministic systems. The same input produces the same output. Failures are binary.

LLM applications break both assumptions.

Dimension	Traditional systems	LLM applications
Output determinism	Same input → same output	Same input → different outputs
Failure mode	Exception / error code	Plausible but wrong answer
Cost per request	Fixed (CPU/memory)	Variable (token consumption)
Latency driver	I/O, network	Model size, prompt length, load
What to measure	Latency, errors, throughput	All of the above + quality, tokens, context
Debug path	Stack trace	Prompt + context + model version

The additional dimension is semantic correctness — whether the output is useful, accurate, and appropriate. Observability can't fully automate quality assessment, but it can surface the signals that make debugging possible: what prompt produced the bad output, what the context window contained, how many tokens were consumed.

What LLM Observability Tracks {#what-llm-observability-tracks}

Traces

A trace captures the full journey of a request through your LLM application. For a simple chatbot, that's: user input → prompt construction → LLM API call → response processing → output.

For a RAG pipeline, traces show: query → embedding → vector search → document retrieval → prompt assembly → LLM call → response. Each step is a span with its own timing and attributes.

Key span attributes for LLM calls (from the OTel GenAI semantic conventions):

gen_ai.system — which model provider (openai, anthropic, etc.)
gen_ai.request.model — model name (gpt-4o, claude-3-5-sonnet)
gen_ai.usage.input_tokens — tokens consumed by the prompt
gen_ai.usage.output_tokens — tokens in the completion
gen_ai.response.finish_reason — why the model stopped (stop, length, content_filter)

Metrics

Token usage — input/output tokens per request, per model, per user
Latency — time to first token, total generation time, p50/p95/p99
Error rate — rate limit hits, timeouts, content filter rejections
Cost — derived from token usage × model pricing
Throughput — requests per second your application handles

Logs

Structured logs capture prompt content, completion text, and contextual metadata. These are most useful for debugging specific failure cases — when you already know a response was wrong and need to see what the model received.

In production, log prompt/completion content selectively. At high volume, capturing every prompt is expensive and creates PII risk.

LLM Observability Architecture {#llm-observability-architecture}

A minimal LLM observability stack has three components:

┌─────────────────────────────────┐
│  LLM Application                │
│  (instrumented with OTel SDK)   │
│                                 │
│  spans: gen_ai.* attributes     │
│  metrics: token_usage, latency  │
│  logs: prompt/completion events │
└──────────────┬──────────────────┘
               │ OTLP (gRPC or HTTP)
               ▼
┌─────────────────────────────────┐
│  OTel Collector (optional)      │
│  - sampling                     │
│  - redaction (PII)              │
│  - routing                      │
└──────────────┬──────────────────┘
               │ OTLP
               ▼
┌─────────────────────────────────┐
│  Observability Backend          │
│  (Last9, Datadog, Honeycomb…)  │
│  - trace visualization          │
│  - metric dashboards            │
│  - alerting                     │
└─────────────────────────────────┘

For applications using multiple LLM providers or orchestration frameworks (LangChain, LlamaIndex), the collector layer becomes more important — it normalizes data from different sources before it reaches your backend.

Implementing LLM Observability with OpenTelemetry {#implementing-with-opentelemetry}

OpenTelemetry's GenAI semantic conventions define a standard schema for LLM spans. Instrumenting to this schema means your data works with any OTLP-compatible backend and doesn't lock you into a vendor-specific agent.

Python: Manual instrumentation with OpenAI

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
import openai
import json

# Configure the tracer
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(
        OTLPSpanExporter(
            endpoint="https://otlp.last9.io/v1/traces",
            headers={"Authorization": "Basic <your-token>"},
        )
    )
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("llm-app")

client = openai.OpenAI()

def chat_with_observability(messages: list, model: str = "gpt-4o") -> str:
    with tracer.start_as_current_span("gen_ai.chat") as span:
        # Set standard GenAI attributes
        span.set_attribute("gen_ai.system", "openai")
        span.set_attribute("gen_ai.request.model", model)

        # Record the prompt as a span event
        span.add_event(
            "gen_ai.content.prompt",
            attributes={"gen_ai.prompt": json.dumps(messages)}
        )

        response = client.chat.completions.create(
            model=model,
            messages=messages,
        )

        # Record token usage
        span.set_attribute("gen_ai.usage.input_tokens", response.usage.prompt_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", response.usage.completion_tokens)
        span.set_attribute("gen_ai.response.finish_reason", response.choices[0].finish_reason)
        span.set_attribute("gen_ai.response.model", response.model)

        content = response.choices[0].message.content

        # Record the completion
        span.add_event(
            "gen_ai.content.completion",
            attributes={"gen_ai.completion": content}
        )

        return content

Python: Using the Last9 GenAI SDK for conversation tracking

For multi-turn conversations or RAG workflows, the Last9 GenAI SDK adds conversation and workflow context on top of standard OTel:

from last9_genai import conversation_context, workflow_context, Last9SpanProcessor
from opentelemetry.sdk.trace import TracerProvider

provider = TracerProvider()
provider.add_span_processor(Last9SpanProcessor())

# Track a multi-turn conversation — links all spans under one conversation_id
with conversation_context(conversation_id="session_abc123", user_id="user_456"):
    first_response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "What is observability?"}]
    )
    second_response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "user", "content": "What is observability?"},
            {"role": "assistant", "content": first_response.choices[0].message.content},
            {"role": "user", "content": "How does it apply to LLMs?"},
        ]
    )

# Track a RAG pipeline — groups retrieval + generation under one workflow
with workflow_context(workflow_id="rag_001", workflow_type="retrieval"):
    docs = retrieve_documents(query)
    response = generate_with_context(query, docs)

JavaScript: Instrumenting with the OTel Node SDK

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { trace, SpanStatusCode } = require('@opentelemetry/api');
const OpenAI = require('openai');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'https://otlp.last9.io/v1/traces',
    headers: { 'Authorization': 'Basic <your-token>' },
  }),
});
sdk.start();

const tracer = trace.getTracer('llm-app');
const client = new OpenAI();

async function chatWithObservability(messages, model = 'gpt-4o') {
  return tracer.startActiveSpan('gen_ai.chat', async (span) => {
    span.setAttributes({
      'gen_ai.system': 'openai',
      'gen_ai.request.model': model,
    });

    try {
      const response = await client.chat.completions.create({ model, messages });

      span.setAttributes({
        'gen_ai.usage.input_tokens': response.usage.prompt_tokens,
        'gen_ai.usage.output_tokens': response.usage.completion_tokens,
        'gen_ai.response.finish_reason': response.choices[0].finish_reason,
      });

      return response.choices[0].message.content;
    } catch (err) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
      throw err;
    } finally {
      span.end();
    }
  });
}

Once traces reach your backend, you can query token usage per model, trace slow requests back to prompt length, and correlate error spikes with model changes.

LLM Observability Tools Comparison {#llm-observability-tools}

Tool	Best for	OTel support	Self-host	Pricing model
Langfuse	LLM-specific tracing, prompt management	Partial	Yes (OSS)	Free OSS, paid cloud
LangSmith	LangChain ecosystem, evals	No	No	Per-trace pricing
Arize Phoenix	Offline evals, model debugging	Yes	Yes (OSS)	Free OSS
Datadog	Existing Datadog users, unified platform	Yes	No	Per-host + usage
Last9	OTel-native, unified metrics + traces + logs	Yes (native)	No	Usage-based
Helicone	OpenAI proxy, cost tracking	No	Yes (OSS)	Per-request

How to choose:

Building on LangChain → Langfuse or LangSmith for tight framework integration
Already on Datadog → LLM monitoring as an add-on makes sense
Want OTel-native, no vendor lock-in → Last9 or Phoenix
Cost tracking is the priority → Helicone as a lightweight proxy

The main tradeoff is LLM-specific tooling (better prompt management, eval workflows) vs. unified observability platforms (all signals in one place, lower operational overhead). Most teams eventually want both: LLM-specific insights and correlation with infrastructure metrics.

Common LLM Issues Observability Catches {#common-issues}

Latency spikes — traces with gen_ai.usage.input_tokens reveal when context windows are ballooning. A prompt that grows unbounded across conversation turns is a common cause.

Cost overruns — token usage metrics aggregated by endpoint, model, and user identify which features consume disproportionate budget. A single poorly-scoped system prompt can multiply costs across all requests.

Silent failures — gen_ai.response.finish_reason: length means the model hit the output token limit and truncated its answer. This doesn't surface as an error; it surfaces as incomplete or cut-off responses. Only traces catch it.

Hallucinations by prompt pattern — correlating finish_reason, input length, and error rates across prompt variants helps identify which prompt structures produce unreliable outputs.

Rate limit cascades — error rate spikes with 429 status codes, correlated with throughput metrics, reveal when traffic patterns are hitting provider limits before your own infrastructure.

Model drift after version updates — latency and error rate comparisons across gen_ai.response.model values show when a provider silently rolls a model update that affects your application.

Best Practices {#best-practices}

Instrument at the boundary, not inside it. Wrap your LLM client calls with OTel spans. Don't try to instrument inside the model itself — you can only observe the interface.

Use the OTel GenAI semantic conventions. Vendor-specific attribute names make your data non-portable. Standard attributes (gen_ai.usage.input_tokens, gen_ai.request.model) work across backends and enable community tooling.

Capture prompt/completion as span events, not attributes. Events are designed for variable-length content; attributes have size limits. Log them selectively in production — capturing every prompt at scale is expensive and creates PII exposure.

Build cost dashboards before you need them. Token usage metrics are cheap to collect. The first time a feature ships with a runaway prompt, you'll want cost data immediately — not after you've instrumented.

Trace multi-step pipelines end to end. A RAG pipeline with retrieval, reranking, and generation looks like one LLM call from the outside. Traces show you which step added latency. Retrieval taking 2 seconds vs. generation taking 2 seconds are completely different problems.

Sample aggressively in production, capture everything in dev. Head-based sampling at 1-5% in production keeps costs down. Always capture 100% of errors regardless of sampling rate.

FAQ {#faq}

What is LLM observability? LLM observability is the practice of collecting traces, metrics, and logs from applications that use large language models to monitor performance, debug failures, and control token costs in production.

How is LLM observability different from traditional observability? Traditional observability monitors deterministic systems where failures throw errors. LLM applications can fail semantically — producing plausible but incorrect outputs, running over budget, or hitting context limits — without any error code. LLM observability adds token tracking, prompt/completion capture, and quality signals on top of the standard three pillars.

What metrics matter most for LLM observability? Token usage (input and output), end-to-end latency (especially time to first token for streaming), error rate by type (rate limits, timeouts, content filter), and cost per request derived from token consumption.

Does OpenTelemetry support LLM observability? Yes. The OTel GenAI semantic conventions define standard attribute names for LLM spans — model, tokens, finish reason, provider. Most LLM frameworks and providers are adopting these conventions.

Can I use my existing observability stack for LLM applications? If your backend is OTLP-compatible and supports traces and metrics, yes. You'll need to add LLM-specific instrumentation to your application code, but the data flows through the same collector and backend you already operate.

What is the best way to track LLM costs with observability? Capture gen_ai.usage.input_tokens and gen_ai.usage.output_tokens as span attributes on every LLM call. Multiply by current model pricing and aggregate by feature, user, or endpoint. This gives you cost breakdowns without a separate billing API.

How do I debug LLM hallucinations with observability? Observability surfaces the conditions under which hallucinations occur — prompt content, context length, model version, finish reason — but cannot automatically detect whether an answer is correct. Use traces to find the prompt that produced the bad output, then use evals or human review to assess quality.

Should I log all prompts and completions in production? No. Capture them as span events selectively — on errors, for sampled traces, or for specific features where debugging value is high. Logging every prompt at scale is expensive and creates PII/security exposure, especially for user-facing applications.

If you're instrumenting a Python application, the Last9 GenAI SDK adds conversation and workflow tracking on top of standard OTel — useful for multi-turn chatbots and RAG pipelines. Traces send via OTLP to Last9 or any compatible backend.

For the full OTel setup, see Implement Distributed Tracing with OpenTelemetry and Instrument LangChain Apps with OpenTelemetry.

What is LLM Observability? A Complete Guide (with OpenTelemetry)

Contents

What is LLM Observability? A Complete Guide (with OpenTelemetry)

Table of Contents

What is LLM Observability? {#what-is-llm-observability}

How LLM Observability Differs from Traditional Observability {#how-llm-observability-differs}

What LLM Observability Tracks {#what-llm-observability-tracks}

Traces

Metrics

Logs

LLM Observability Architecture {#llm-observability-architecture}

Implementing LLM Observability with OpenTelemetry {#implementing-with-opentelemetry}

Python: Manual instrumentation with OpenAI

Python: Using the Last9 GenAI SDK for conversation tracking

JavaScript: Instrumenting with the OTel Node SDK

LLM Observability Tools Comparison {#llm-observability-tools}

Common LLM Issues Observability Catches {#common-issues}

Best Practices {#best-practices}

FAQ {#faq}

Contents

Start observing for free. No lock-in.

OpenTelemetry · Prometheus

Datadog · New Relic · Others

Built on Open Standards