At scale, software monitoring is a nightmare. More so, when you have hockey-stick growth; this is when teams want to focus on products and feature development. Software monitoring is the big hairy monster no one wants to touch, because it’s bound to be a veritable mess.
One of our customers had a major problem of onboarding different components across microservices. Alerting was convoluted and frustrating. Correlations were non-existent, and cascading failures were rampant.
Google Sheets & Tribal Knowledge
Given the hasty nature of the software stack, and the velocity of the business, teams were using Google sheets to oversee onboarding of new components to be monitored. This presented its own set of issues:
- Details of these services were becoming stale quickly.
- Tonnes of manual work to fill the sheets. Not to mention the many mistakes while copying information from the sheet.
- Miss-communication due to the lack of a common language about these intricate details. Bin this under a combination of tribal knowledge, and lack of unilateral nomenclature.
- Every service or team used to provide details at random times and it was hard to keep up with telemetry data. It was chaotic.
- Massive time slippage: lots of to and fro just to confirm details of what to be monitored.
Ultimately, this added to the stress of multiple teams and Reliability across the board took a hit. Coverage was always lacking because of inconsistencies, and productive time was being wasted.
Something needed to change.
Building auto-discovery in a week
Last9 is a managed SaaS partner. We’re invested in the success of your monitoring stack, not just from a robustness of monitoring point of view. We care about the Total Costs of Ownership (TCO) around monitoring, and spend our time reducing costs for our customers.
We built an auto-discovery tool in a week to solve the problem of monitoring coverage. Auto-discovery has a framework to discover all details of any component, or its instances running in a system, by asking for minimum information from our customers. Above all, it requires little to no human intervention making it effective and resilient.
Unlocking coverage for 70+ microservices
There are two parts to auto-discovery:
- Component template: This is created based on component types like, load balancer and exporter, or source version.
- Configuration file: If a customer has to filter any specific data it will need those details, else it is a one-time setup for all services.
Auto-discovery accepts the configuration file, parses it, and fetches the component template. It uses both of these to discover component/service instances.
Points we considered while building auto-discovery
- Every customer instruments code depending on how their org works or talks. There are also tonnes of customizations in telemetry for each customer. Accommodating use cases to discover data by taking care of how variation in telemetry will be accommodated was challenging. For ex: the label
cluster
can have different variation like cluster_name
, source_cluster
, service_cluster
etc… across customers, which they might want to discover. - It is important to consider the ways to use auto-discovery, since it’s specific to component type, exporter version, and the source (cloud). Many variations can result in missing data, so keeping it explicit w.r.t to version and the source is important.
- We cannot use a heavy query to discover data, as it might result in an error/timeout or not being able to fetch at all.
- There are multiple sources to discover data and each has a different kind of authentication mechanism, building a framework around such variations is hard.
- There are also cases where you have to discover data by querying some initial data from different sources. This has its own challenges.
How auto-discovery helps
- Information on the product about our customers’ systems is always up to date — we designed auto-discovery keeping this in mind.
- We no longer have Google Sheets, all information is available on the tool.
- No manual work from the customer. Our internal Customer Reliability team supports the customer. We now have far more productive teams.
- Our Customer Reliability team started talking in the same terminologies and defined a nomenclature to get parity across the board. This also accelerated the onboarding of new components and democratized onboarding for multiple engineers in our customer’s organization.
- Since we get information from the source of truth (Levitate), the amount of validation required is reduced by a huge margin. We now have to merely take a final review. Manual work is eradicated, and errors are negated.
- Coverage is no longer a problem. We are able to onboard and maintain around 70 services with at least 5-6 components with ease.
These are just some of the points we’ve unlocked with a simple tool like auto-discovery. It's also a tangible example of our focus on helping improve productivity, and save costs for our customers. That’s the beauty of not having lock-ins, or being completely Open Source. ✌️
Do reach out to me if you want to know more about the tool, or just discuss anything around software monitoring.
You can also book a demo with us to understand Last9 and its products. I highly recommend you book a general demo, will help you sleep better knowing we have your back 😉