Product

Discover
Auto-discover everything you run & trace problems to the root cause—fast

Services Kubernetes Jobs Hosts Applications (RUM)

Explore

Stream & analyze millions of events per minute, all correlated

Logs Traces Metrics

Control Plane
First-class DX to manage incoming telemetry data in real-time

Ingestion Storage Query Analytics

AI
Natural language insights & debugging in your IDE

Synthetic Monitoring
Uptime checks that end in a trace

GPU Workloads
Utilization & performance for GPU fleets

Alerting
For high-cardinality environments
Resources

Guides
Comprehensive docs for engineers building large-scale applications

OpenTelemetry High Cardinality Prometheus LogQL

Blog
Stories, guides, and lessons from the world of observability

Events
SRE & DevOps sharing meets

Changelog
Updates and improvements
Customers
Docs
Book demo

Ai observability illustration

Ai observability

All articles tagged 'Ai observability'

10,000 GPUs, One TSDB: Cardinality at GPU Scale

10,000 GPUs, One TSDB: Cardinality at GPU Scale

1,000 nodes × 8 GPUs × 60 metrics = 1.4M time series - before you add pod names or Slurm job IDs. GPU monitoring is a cardinality problem disguised as a metrics problem. How to design for it before production OOMs your Prometheus.

Shekhar

Apr 21, 2026

From GPU Silicon to Business Metrics: The 8 Layers of GPU Observability

From GPU Silicon to Business Metrics: The 8 Layers of GPU Observability

GPU observability isn't one thing - it's eight connected layers from silicon to cost. See why correlation across layers is what cuts debugging from 2 hours to 2 minutes, and why most teams instrument only one or two

Shekhar

Apr 21, 2026

Every Token Has a Price: Per-Request GPU Cost Attribution

Every Token Has a Price: Per-Request GPU Cost Attribution

Flat per-token pricing is wrong by 10–50× per request. Prefill vs decode, batch sharing, and cache effects break the math. How to attribute real GPU cost - compute, energy, and dollars - to each inference request.

Shekhar

Apr 17, 2026

Last9 integration with TrueFoundry AI Gateway

Last9 integration with TrueFoundry AI Gateway

TrueFoundry AI Gateway now integrates with Last9. Get unified observability for LLM traffic alongside your existing traces, metrics, and logs.

Sahil Khan

Dec 18, 2025

9 Monitoring Tools That Deliver AI-Native Anomaly Detection

9 Monitoring Tools That Deliver AI-Native Anomaly Detection

A technical guide comparing nine observability platforms built to detect anomalies and support modern AI-driven workflows.

Anjali Udasi

Dec 1, 2025

What Are AI Guardrails

What Are AI Guardrails

Learn the core concepts of AI guardrails and how they create safer, more reliable, and well-structured AI systems in production.

Anjali Udasi

Nov 5, 2025