Housing.com Replaces ELK Stack with Last9, Eliminating Infra Overhead While Cutting Costs
Download PDF- Real Estate Technology
- 200 engineers
- APAC
- Amazon Web Services
Housing.com operates one of India’s leading real estate platforms, serving millions of users with property listings, virtual tours, and transaction services. With 200 engineers managing a complex microservices architecture on AWS, maintaining system reliability and debugging production issues quickly is critical to user experience.
The platform processes 2TB of daily logs across 20+ technology pods, each with different log formats. This diversity, combined with unpredictable traffic spikes from sources like Google’s web crawler, created significant observability challenges that consumed valuable engineering resources.
Growing Pains
High Operational Overhead
Self-managed ELK cluster with 8 data nodes, 30 log fleet servers, and Kafka brokers required one engineer monthly for maintenance
Fragile Log Pipeline
Grok filter failures when developers changed log formats led to missing logs and pipeline breakages
Inadequate Retention
3-4 day retention meant logs were gone by the time teams completed immediate fixes and could perform detailed RCA
Scaling Challenges
Infrastructure couldn’t handle unpredictable traffic spikes, causing log lag and missing data during critical periods
Housing’s self-managed Elasticsearch cluster struggled with the complexity of standardizing logs across multiple business units and technology stacks. The team spent significant effort managing Grok filters at the Logstash level to parse raw application logs from diverse sources.
When developers changed log formats — adding new fields or modifying existing ones — the pipeline would break unless the team manually updated Grok patterns. This created a constant firefighting cycle. Over four years, Housing revamped their internal stack 2-3 times, each time encountering new scaling and reliability challenges.
Last9 allowed us to offload the operational overhead of scaling so we could focus on business alerting rather than infrastructure management. Earlier, one manpower monthly was dedicatedly working on our ELK setup, and my team was quite occupied with firefighting the self-managed stack.
Upendra Singh
Associate Director DevOps, Housing.com
The 3-4 day retention window created a particularly painful problem: teams would fix production issues first (the priority), but by the time engineers were freed up to conduct thorough root cause analysis, the relevant logs were already deleted. This incomplete evidence trail reduced RCA completion rates and prevented the team from identifying underlying systemic issues.
The Last9 Advantage
Managed Infrastructure
Eliminated dedicated engineering time for ELK maintenance and scaling
Dynamic Remapping
Regex-based log parsing with updates reflected in 10-15 minutes without service restarts
Extended Retention
14-day retention within budget enables complete RCA with historical correlation
Built-in Alerting
Replaced custom scripts with scheduled searches and Slack integration for anomaly detection
Housing evaluated multiple managed observability solutions but chose Last9 for its combination of cost-effectiveness, technical capabilities, and engineering-focused support. The support quality was a decisive factor — Housing needed a vendor that understood complex log aggregation requirements and could provide hands-on consultation rather than just documentation.
Last9’s remapping feature replaced Housing’s complex Grok filter management with a more maintainable approach. The team uses regex-based parsing with Last9’s testing tool to quickly validate patterns before deployment. If remapping fails, logs remain available in raw format rather than being lost entirely — a critical safety net that didn’t exist with their ELK setup.
The migration involved close collaboration between Last9’s success engineering team and Housing’s DevOps engineers. Multiple deep-dive sessions helped transition Housing’s team from four years of ELK patterns to Last9’s different visualization and querying approach.
Industry’s Most Complex Remapping Rules
Housing has one of the most complex remapping rule sets. This complexity stems from managing multiple business units with different logging mechanisms and no standardization across teams.
The remapping configuration handles:
- Service name standardization across inconsistent naming patterns
- URL extraction for API performance tracking in their tech scorecard
- Log routing between business units (Housing vs PropTiger) using field attributes
- Dynamic URL pattern aggregation using regex transformation to normalize URLs containing UUIDs and project IDs
For their Databricks integration powering tech scorecard analytics,
Housing needed aggregated 5xx error counts and URL analysis. The
team migrated from fetching data from ELK to Last9, extracting
fields like URL, URL_path,
and app_name for
service-to-service communication tracking.
Last9’s remapping handles log parsing dynamically — no service restarts when patterns change. With ELK, grok filter changes required pipeline restarts and risked losing logs entirely.
Ashish Mishra
DevOps Engineer, Housing.com
This standardization happens at ingestion without touching application code, and changes reflect within 10-15 minutes — compared to their ELK setup that required modifying Grok patterns and restarting services, often breaking the entire pipeline.
Built-in Alerting Replaces Custom Scripts
Housing previously ran auxiliary services that queried ELK data and sent alerts to stakeholders — additional infrastructure they had to maintain. Last9’s alerting capabilities eliminated this overhead entirely.
The team now uses:
- Scheduled searches for log pattern monitoring
- Service-level alerts with customizable thresholds
- No-data scenario detection to catch missing log indexes
- Slack integration for immediate team notification
All alert configuration happens within Last9 without external scripts or coordination with additional services.
Key Results
Zero Infrastructure Toil
Eliminated monthly dedicated engineering time previously spent maintaining ELK, Kafka, and Logstash
Higher RCA Completion Rate
14-day retention enables teams to perform detailed root cause analysis after resolving immediate production issues
Improved Reliability
Dynamic remapping prevents log loss from application format changes, with raw format fallback protection
Cost-Effective Scale
Managed solution fits within budget constraints while handling 2TB daily log volume with extended retention
Housing successfully deprecated their in-house ELK implementation, allowing their platform engineering team to focus on product development rather than observability infrastructure maintenance. The combination of extended retention, reliable ingestion, and built-in alerting has improved their ability to debug production issues and complete thorough post-mortems.
The team continues to work with Last9’s support on ongoing optimizations, including their Databricks integration for tech scorecard metrics.
Schedule a demo to understand how engineering teams at Housing, Replit, Brightcove, Clevertap, and more are choosing Last9 over their existing legacy observability tools to achieve a single pane of observability.
Handcrafted Related Posts
Understanding “Cricket Scale”
How does a DevOps/Site Reliability Engineer plan for "Cricket scale"? How do you warm systems' about to witness 30+ million concurrent users?
Observability—OSS vs Paid vs Managed OSS
The Reliability industry needs a managed, non-vendor lock-in answer to spiraling costs, high cardinality and the toil of managing a tsdb
Take back control of your Monitoring
Take back control of your Monitoring with Levitate - a managed time series data warehouse
{Do more}
with less.