scalable eventing over apache mesos
TRANSCRIPT
© 2015 Autodesk
Scalable Eventing Over Mesos
Olivier PaugamSW Architect / Autodesk Cloud
Big Data Montreal
© 2015 Autodesk
Goals & Challenges
© 2015 Autodesk 3
The Mission
General purpose, high-volume eventing system. Batch oriented I/O. Target audience: 20+ teams within Autodesk. Must be active/active across multiple data-centers. Must be able to scale at any time. Must be able to absorb traffic spikes. Must be accessible via a single API. Must be secure (transport + data at rest). Must not be tied to a specific provider.
© 2015 Autodesk 4
A Few Use Cases
Application log pre-aggregation transport. Metering updates from our Platform API. Analytics transport prior to indexing. Event transport for Search, Activity & other services. Identity updates down to our IT systems. Editing increments for large 3D model collaboration.
© 2015 Autodesk 5
Our 5 Technical Commandments
Must use Docker. Must run on Apache Mesos + Marathon. Must leverage Apache Kafka. Must be as autonomous & low-maintenance as possible. No automation scripting allowed (Chef, Salt, Ansible…).
© 2015 Autodesk
Introducing Ochopod
© 2015 Autodesk 7
Ochopod
100% Open Source ! Novel container-centric orchestration model. Mix between a discovery & an init system. No need for dedicated frameworks. Direct Peer To Peer HTTP I/O. Can run on Mesos, K8S, etc. Relies on ZK.
© 2015 Autodesk 8
The Stack
© 2015 Autodesk 9
How Does It Work ?
Source of truth : Zookeeper. Each container belong to a “cluster”. A “leader” is picked per cluster. Leaders manage their peers via HTTP I/O. Settings passed via environment vars. Eventually consistent.
© 2015 Autodesk 10
Proxy approach. 100% Mesos+Ochopod. Used for CI/CD as well. Proxy running on an edge node. Could easily factor OAUTH2 in. Access via direct HTTPS or using a CLI. Toolkit to deploy, list, query, kill & update containers
A quick DYI Mini-PaaS
© 2015 Autodesk
Building verticals at scale
© 2015 Autodesk 12
Architecture
© 2015 Autodesk 13
Phone Switch & State Machines
© 2015 Autodesk 14
Going Global
© 2015 Autodesk 15
Shooting For Higher Scales
Unit of scale == 1 Kafka topic. Keep the pressure on each broker constant. Every sub-system can be scaled independently. API protocol designed to account for nodes shutting down. Mix of horizontal scaling & sharding via RabbitMQ. Checkpoints + idempotency + state-machines. Ochopod is critical to enable scaling.
© 2015 Autodesk
Conclusion
© 2015 Autodesk 17
6 man/month effort. 6 open-sourced 3rd-parties (Kafka, Zookeeper, RabbitMQ...). 3 deployments over 2 data-centers, using DCOS. 36+ c3.2xlarge CoreOS slaves on AWS/EC2 + VPC. ~20 Kafka brokers, ~40 Play! Nodes. ~150 live containers. ~500 live streaming sessions at any time. ~30M events / ~65M API hits a day. < 5 minor incidents, no major incident to date. 1 single dev/op (!).
© 2015 Autodesk 18
Issues & Next Steps
What does one do if a slave goes offline ? Need for better placement constraints. Need for better storage schemes. The K8S “pod” concept is cool after all... We could invest into a dedicated Mesos framework. What about Spot instances ?
© 2015 Autodesk 19
https://github.com/autodesk-cloud/ochopod
Autodesk is a registered trademark of Autodesk, Inc., and/or its subsidiaries and/or affiliates in the USA and/or other countries. All other brand names, product names, or trademarks belong to their respective holders. Autodesk reserves the right to alter product and services offerings, and specifications and pricing at any time without notice, and is not responsible for typographical or graphical errors that may appear in this document.
© 2015 Autodesk