Member-only story
Operational Excellence at Zeotap — Sherlock, The Path Tracking Service
This blog takes us through our journey of using Schedule based Job management systems, gaps thereof, and why we created our in-house microservice to launch jobs on Data Availability.
Zeotap deals with data. As of today, the lean data engineering team manages 450+ jobs/day across our data pipelines. Tracking every job to ensure smooth functioning and SLA happens to be a major operational task.
Challenges faced
At Zeotap, we use Kingpin, our in-house cron based Job management system, to manage our data pipelines. While Kingpin improves upon the limitations that we faced while using Oozie at Zeotap, a Cron based scheduling system poses the following challenges:
- Job failures due to Missing data i.e., a job gets launched on the scheduled frequency even if the input data is missing.
- Irregular arrival of Data leads to failures while using a Frequency-based coordinator.
- Delay in Ingestion SLA as ingestion jobs are launched about a day after the actual data arrival (to ensure enough buffer to accommodate for delays in data arrival).
The above failures amounted to over 40% of our total failures and resulted in increased ops overhead for our engineering team. To decrease…