SRECon Americas 2025 Recap Day 1

If you're an SRE—or just hang out with them enough—you know SRECon isn't just another tech conference. It's basically the yearly pilgrimage where reliability engineers come to keep it real about what it's actually like in the trenches building systems that don't fall over.

SRECon25 Americas is where engineers who lose sleep over uptime and distributed systems come together to geek out. The vibe is about challenging assumptions, understanding technical rabbit holes, and sharing stories about that one time everything broke... and how you ensured it never happened again.

New for 2025:

New at SREcon Americas!

Discussion Track - where we can talk about that time everything caught fire and the bash script that somehow saved your job.
AMA Sessions: A chance to grab those SREs who've been paged more times than they've slept and make them reveal their secrets
Breakout Group Discussions: It's where your fellow developers who understand why that obscure race condition makes you twitch
Unconference Sessions: The room picks the topic - because the hallway conversations are always better than the scheduled talks. Agreed?

Highlights from Day 1

Here are some of the talks I enjoyed at SRECon Americas 2025:

Safe Evaluation and Rollout of AI Models

Brendan Burns, Corporate VP for Azure Cloud Native, talked about safely rolling out AI models. Here's his take:

Online services now rely heavily on AI and LLMs for core user experiences, so safely deploying new models and prompts is critical to system reliability. But unlike traditional deployments, there's rarely a clear "working/broken" signal with AI systems.

Instead, we need to evaluate performance probabilistically across many user inputs. Any model or prompt change might improve some responses while degrading others.

Brendan walked through the approaches they used when building Azure Copilot - both explaining the unique reliability challenges with AI systems and showing the actual tools they're using in production right now to handle this mess.

Improving the SRE Experience for 10 Years as a Free, Open, and Automated Certificate Authority

Matthew, the tech lead for Let's Encrypt's SRE team, took us through the history of Let's Encrypt and dropped some practical knowledge for anyone dealing with TLS certificate headaches.

He covered the context you actually need when managing certs, plus he gave us a heads-up about upcoming changes and where things are headed. Bonus: we heard how they've tried to make life easier for SREs everywhere (and how the SRE community has helped them right back).

Distributed Tracing in Action: Our Journey with OpenTelemetry

Chris Detsicas from ThousandEyes shared the real story of their OpenTelemetry tracing implementation - warts and all.

He walked us through their journey, the walls they hit, and the tough calls they made while adopting OTel tracing. Chris got into the nitty-gritty of context propagation problems (which are a pain), why auto-instrumentation is worth the effort, and how testing saved their bacon.

We also got a look at how they built their pipeline, with concrete examples of how tracing has actually helped them spot issues, debug faster, and understand where their apps are really spending time.

Techniques Netflix Uses to Weather Significant Demand Shifts

Joseph Lynch from Netflix took us behind the curtain to show how they handle massive traffic shifts across their global architecture.

Netflix deals with hundreds of device types connecting worldwide, and their traffic patterns can spike by orders of magnitude in minutes. Joseph broke down how they keep things running when this happens across their sprawling system of edge gateways, microservices, caches, and databases.

The real magic happens in their data layer. Joseph explained how data gateways with built-in resilience, careful capacity planning, sharding strategies, and smart caching make all this possible.

AMA with David Woods

Dr. David Woods, a cognitive psychologist and systems safety expert, led a wide-ranging discussion on resilience in systems and human-machine collaboration - topics that hit home for anyone doing SRE work.

He explored how systems adapt when things go sideways, why complex systems break in predictable ways, how our brains process information from interfaces, and more.

His research has uncovered the key factors that help systems build resilience and thrive despite the complex penalties that come with growth.

His books include "Behind Human Error," "Resilience Engineering" (the first in the field), "Resilience Engineering in Practice," and "Joint Cognitive Systems."

Mapping a Better Future with STPA

Theo Klein shared how Google Maps is using Systems Theoretic Process Analysis (STPA) to catch reliability issues before they become production nightmares.

While traditional SRE approaches focus on component failures, Theo pointed out that many of our worst outages actually come from unexpected interactions between systems that are all "working as designed."

That's where STPA plays an important role. He walked through a real case study where this approach caught critical design flaws that would have been invisible to conventional methods, saving them from months of painful rework after the fact.

If you're at SREcon Americas 2025, drop by Last9 booth#22. We'd love to hear about your SRE experiences and challenges.

I am already looking forward to Day 2 of SRECon Americas 2025. Don't forget to catch up on all the talks once their recordings are out!