Observability is often a misunderstood and misused term. It has come to mean nothing and everything at this point.
You can simplify thinking about software observability if you think about it in 2 parts – a Radar and a Black-box
Radar
Radar systems are real-time. They must be able to detect any anomalies, understand related and unrelated signals, handle out-of-order data or loss of signal, and absolutely cannot have a blip. They enable operators to have a live view of their systems and respond immediately.
Definition of Radar
In software observability parlance, metrics & events lend themselves amazingly well to radars (monitoring). They are easy to instrument, store and fast to retrieve and alert. They are must-haves for any business to keep the lights on.
There is ONE purpose of radar systems: To avoid mishaps in the first place.
Black Box
What about black boxes? Before jumping on to 'flight recorders' - I realize everyone imagines them to be black, which they are absolutely not. They are fluorescent in color to ensure they can be traced back easily in case of a mishap!
These black boxes are great at collecting all the data. They can record all signals, a timeline of events, actual user journeys, environments, configurations, and things you never knew existed. Have a large enough storage capacity that spans at least the last three flights!
Definition of Black Box
In software observability parlance, logs & traces are best suited for understanding the sequence of events, user journeys, and code paths. They need lots of storage and are available with a latency of minutes. They are excellent at debugging, performing an RCA, and helping software engineering teams improve the system's design.
The answers from a black box can, at times, mean going back to the drawing board to change the design of the planes itself!
There is one primary purpose of flight recorder systems: To provide all possible details to perform a post-fact or post-incident analysis. RCA.
If you liked this post, you would like my take on Understanding the model of failures and its connection to Software Reliability.