Download - Openzipkin conf: Zipkin at Yelp
- Prateek Agarwal
- Software Engineer- Infrastructure team @ Yelp
- Have worked on - python Swagger clients,- Zipkin infrastructure,- Maintaining Cassandra, ES clusters
About me
- 250+ services
- We <3 Python
- Pyramid/uwsgi framework
- SmartStack for service discovery
- Swagger for API schema declaration
- Zipkin transport : Kafka | Zipkin datastore : Cassandra
- Trace is generated on live traffic at a very very low % rate (0.005%)
- Can also be generated on-demand by providing a particular query-param
Infrastructure overview
pyramid_zipkin
- A simple decorator around every request
- Able to handle scribe | kafka transport
- Attaches a `unique_request_id` to every request
- No changes needed in the service logic
- Ability to add annotations using python’s `logging` module
- Ability to add custom spans Service Bpyramid_zipkin
pyramiduwsgi
swagger_zipkin
- Eliminates the manual work of attaching zipkin headers
- Decorates over swagger clients- swaggerpy (swagger v1.2)- bravado (swagger v2.0)
Service Aswagger_client
swagger_zipkin
Lessons Learned- Cassandra is an excellent datastore for heavy writes
- Typical prod writes/sec : 15k
- It was able to even handle 100k writes/sec
Lessons Learned- Allocating offheap memory for Cassandra helped in reducing write latency by 2x
- Pending compactions also went down.
Lessons Learned- With more services added, fetching from Kafka became a bottleneck
- Solutions tried:- Adding more kafka partitions
- Running more instances of collector
- Adding multiple kafka consumer threads- with appropriate changes in openzipkin/zipkin- WIN
- Batching up messages before sending to Kafka- with appropriate changes in openzipkin/zipkin- BIG WIN
Lessons Learned- With more services added, fetching from Kafka became a bottleneck
- Solutions tried:- Adding more kafka partitions
- Running more instances of collector
- Adding multiple kafka consumer threads- with appropriate changes in openzipkin/zipkin- WIN
- Batching up messages before sending to Kafka- with appropriate changes in openzipkin/zipkin- BIG WIN
Lessons Learned- With more services added, fetching from Kafka became a bottleneck
- Solutions tried:- Adding more kafka partitions
- Running more instances of collector
- Adding multiple kafka consumer threads- with appropriate changes in openzipkin/zipkin- WIN
- Batching up messages before sending to Kafka- with appropriate changes in openzipkin/zipkin- BIG WIN
Lessons Learned- With more services added, fetching from Kafka became a bottleneck
- Solutions tried:- Running more instances of collector
- Adding more kafka partitions
- Adding multiple kafka consumer threads- with appropriate changes in openzipkin/zipkin- WIN
- Batching up messages before sending to Kafka- with appropriate changes in openzipkin/zipkin- BIG WIN
Lessons Learned- With more services added, fetching from Kafka became a bottleneck
- Solutions tried:- Running more instances of collector
- Adding more kafka partitions
- Adding multiple kafka consumer threads- with appropriate changes in openzipkin/zipkin- WIN
- Batching up messages before sending to Kafka- with appropriate changes in openzipkin/zipkin- BIG WIN
Future Plans- To be used during deployments to check degradations
- Validate the differences in number of downstream calls- Check against any new dependency sneaking in- Time differences in the spans
- Create trace aggregation infrastructure using Splunk (wip)- A missing part of Zipkin
- Redeploy zipkin dependency graph service after improvements- The service was unprovisioned because it created 100s of Gigs of /tmp files- These files got purged after the run (in ~1-2 hours)- Meanwhile, ops got alerted due to low disk space remaining- Didn’t give much of a value addition