Netflix (Remote)
Building Distributing Tracing infrastructure as a member of Observability Engineering.
- Currently developing a new system for trace data storage, modeled after data lake architectures.
- Lead the design and implementation of a new system for trace collection. Developed a sidecar/collector to support higher volume, more reliable transport with lower overhead
- Developed long-requested aggregation and analytics features in the platform by deriving metrics from trace data, using Druid and Kafka
- Introduced a new form of sampling for trace data allowing non-ingress, mid-tier applications to make local decisions about sampling a request. Implemented this in platform libraries and rolled this out across the Netflix fleet (design)
- Scaled tracing infrastructure by 10x to handle increased load from new domains
- Migrated the core storage system for trace data from Elasticsearch to a managed time-series abstraction, which reduced cost by ~20% while improving search capabilities