Blog illustration

Blog

Stories, guides, and lessons from the world of observability

Instrumenting WordPress with OpenTelemetry: PHP Tracing, Browser RUM, and Error Capture in Production

Instrumenting WordPress with OpenTelemetry: PHP Tracing, Browser RUM, and Error Capture in Production

WordPress powers 40% of the web but has no native observability story. Here's how to instrument it end-to-end with OpenTelemetry - PHP, browser RUM, and errors.

Read
Prathamesh Sonpatki

Prathamesh Sonpatki

10,000 GPUs, One TSDB: Cardinality at GPU Scale

10,000 GPUs, One TSDB: Cardinality at GPU Scale

1,000 nodes × 8 GPUs × 60 metrics = 1.4M time series - before you add pod names or Slurm job IDs. GPU monitoring is a cardinality problem disguised as a metrics problem. How to design for it before production OOMs your Prometheus.

Read
Shekhar

Shekhar

From GPU Silicon to Business Metrics: The 8 Layers of GPU Observability

From GPU Silicon to Business Metrics: The 8 Layers of GPU Observability

GPU observability isn't one thing - it's eight connected layers from silicon to cost. See why correlation across layers is what cuts debugging from 2 hours to 2 minutes, and why most teams instrument only one or two

Read
Shekhar

Shekhar

The GPU Metrics That Actually Matter

The GPU Metrics That Actually Matter

Most teams monitor three GPU metrics - utilization, temperature, memory. There are 50+ that matter, and the ones you skip cause your worst outages. A vendor-neutral guide across NVIDIA, AMD, and Intel Gaudi

Read
Shekhar

Shekhar

Your LLM Is Slower Than You Think

Your LLM Is Slower Than You Think

60% GPU utilization and 3-second response times? GPU utilization is the wrong signal for LLM inference. Here's why TTFT, KV-cache pressure, and queue depth - not utilization - predict user-facing latency.

Read
Shekhar

Shekhar

Predicting GPU Failures Before They Cost You

Predicting GPU Failures Before They Cost You

Predict GPU hardware failures 48–72 hours in advance. A guide to the five rate-based signals — ECC error trends, XID events, thermal ramp, row remap exhaustion, PCIe downtraining — and how to combine them into a composite health score.

Read
Shekhar

Shekhar

Every Token Has a Price: Per-Request GPU Cost Attribution

Every Token Has a Price: Per-Request GPU Cost Attribution

Flat per-token pricing is wrong by 10–50× per request. Prefill vs decode, batch sharing, and cache effects break the math. How to attribute real GPU cost - compute, energy, and dollars - to each inference request.

Read
Shekhar

Shekhar

Best Incident Management Software for Engineering Teams (2026)

Best Incident Management Software for Engineering Teams (2026)

Compare 9 incident management tools: PagerDuty, Opsgenie, Incident.io, Rootly, FireHydrant, BetterStack, Grafana OnCall, Squadcast, and Last9. Features, pricing, and which fits your team.

Read
Sahil Khan

Sahil Khan

Database Partitioning: Types, Strategies, and When to Use Each

Database Partitioning: Types, Strategies, and When to Use Each

How database partitioning works in PostgreSQL and MySQL. Range, list, and hash partitioning with SQL examples and guidance on when to partition vs shard.

Read
Prathamesh Sonpatki

Prathamesh Sonpatki