devops for etl processing at scale with mongodb, solr, aws and chef
TRANSCRIPT
Proprietary and confidential information of stackArmor
PRESENTATION FOR DEVOPSDCMAY 17, 2016
ETL processing at scale with MongoDB and Solr on AWS using
Chef
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 2
Delivering innovation with Cloud, Data Analytics and Automation
• Cloud orchestration and automation• Migration and Operations support• Cloud hosting and SaaS Development
www.stackArmor.com
Our Partnerships
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 3
The Customer• Big Data Analytics SaaS Firm
◦ 500 million records every month needed to be processed as fast as possible at the lowest possible cost
◦ The current on-premises infrastructure was taking 3 weeks to run and complete the process
◦ Processing costs were a big deal and needed to be executed within a very tight and specific budget
◦ Due to HIPAA compliance reasons, the application components could not be altered
◦ Needed to process 5 million records per hour and complete the process in 1-2 days and tear down the environment post-completion
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 4
ETL System OverviewInput Files
Parsing Component
Staging Collection
Final Collection
Ingestion Component
SearchIndex
JSON Doc
JSON Doc
1 - Read Input Records
2 – Store JSON Docs
3 – Read each JSON Doc
4 – Search for Docs which might be a match.
5 – Get set of candidate Doc IDs
6 – After evaluating candidates, merge with existing Doc, or save as new Doc
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 5
Summary of Process• Customer receives batch files with user data from different sources
• Needs to reconcile with existing records
• Update info or create new records
• Most users already exist in the DBs
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 6
The Stack• MongoDB 2.6AWS environment with 10 mongo shards each with one primary, one replica and one arbiter. The primaries and replicas are running on r3.2xlarge instances. Each has 4 EBS volumes attached. 1x1TB, 1x60GB, 2x25GB with 3000, 180/3000, 250, and 200 IOPs, respectively.• MongoConfigThree mongo config instances are running on 3 m3.large instances. These instances also run the Zookeeper processes used to manage the SolrCloud cluster. Issues with Zookeeper stability.• MongosMongos nodes on the same VM as our application logic. Each application server connects to a local mongos node. • Solr 5.4.1Running a SolrCloud cluster with 20 shards and 5 replication factors. Each primary and secondary is deployed to a c4.4xlarge instance. Each instance has 3 EBS volumes: 2x 60 GB, 1x 200GB. 180/3000, 1000, and 600/3000 IOPs, respectively.• Application NodesIngestion/indexing nodes running on c4.xlarge instances. 10 such nodes running Tomcat 7.
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 7
Design Considerations• Wanted to build environment by “hand” to save time and meet project goals – automate now or later?◦ Which automation and orchestration technology to use?
• What is the optimal shard to replica ratio?• To P-IOPS or not P-IOPS? Or supersize the instance?• RAID or NOT, if so which one• Scale-up or Scale-out? Cost considerations• Solr versus SolrCloud?• Goal was to process 5 million records per hour• Money, Money, Money?
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 8
The Math• Needed to process a total data set of 2TB for Staging and 2TB for Production◦ Estimated 50 shards with 80GB per shard ◦ Choose instance size with memory to fit 2TB / 50 shards
• Separate mongos nodes or with app nodes • PIOPs disks vs RAID 0
◦ Cost vs performance
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 9
Instance Selection• Scale-up versus Scale-out
• IO Calculations
Name vCPU Memory Instance I/O On-Demand Cost per GB RAM Cost per Core Per GB RAM
r3.8xlarge 32 244 SSD 2 x 320 10 Gigabit $ 2.6600 $ 0.01090 $ 0.0000447 r3.4xlarge 16 122 SSD 1 x 320 High $ 1.3300 $ 0.01090 $ 0.0000894 r3.2xlarge 8 61 SSD 1 x 160 High $ 0.6650 $ 0.01090 $ 0.0001787 m2.4xlarge 8 68.4 2 x 840 High $ 0.9800 $ 0.01433 $ 0.0002095 cr1.8xlarge 32 244 SSD 2 x 120 10 Gigabit $ 3.5000 $ 0.01434 $ 0.0000588 m4.xlarge 4 16 -- High $ 0.2390 $ 0.01494 $ 0.0009336 m4.10xlarge 40 160 -- 10 Gigabit $ 2.3940 $ 0.01496 $ 0.0000935 m4.4xlarge 16 64 -- High $ 0.9580 $ 0.01497 $ 0.0002339 m4.2xlarge 8 32 -- High $ 0.4790 $ 0.01497 $ 0.0004678 m3.2xlarge 8 30 SSD 2 x 80 High $ 0.5320 $ 0.01773 $ 0.0005911 m3.xlarge 4 15 SSD 2 x 40 High $ 0.2660 $ 0.01773 $ 0.0011822 d2.8xlarge 36 244 HDD 24 x 2000 10 Gigabit $ 5.5200 $ 0.02262 $ 0.0000927
Helped us find the right instance for us; There were other Considerations as well P-IOPS or not?
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 10
Initial Design• <insert initial design>
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 11
Final Design• MongoDB nodes
◦ 10x c4.8xlarge
•solr nodes ◦ 100x c4.4xlarge ◦ Max heap 6 gb ◦ Replication factor of 1 (shard, non-cloud mode)
• Separate EBS volumes for MongoDB data, journal, and logs• 4x1TB EBS RAID0 drives for MongoDB data
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 12
The tuning journey• Driving optimal load from mongos to SOLRs• Throttling the number of java threads that can generate the load that cluster can handle• Maximize the CPU utilization on the SOLR nodes but at the same time not to push the network bandwidth traffic• Use of 10GB network between EC2s• Determination of IOPS requirement to maximize the throughput (#number of ingestion records processed). • Gradually increasing threads instead of overloading SOLR at the start• use optimize SOLR function periodically to sustain steady SOLR search response times
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 13
Automation Saved the Day!
SolrParameterized ETL Environment
Number of Instances Type of Instances Other params
Tuning by Doing!
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 14
Some charts along the way
Graph showing throughput reaching around 75,000 records per 15 mins for each node. We reached nearly 3 million records per hour but with spikes (and errors).
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 15
Things are getting better!
Graph showing throughput reaching around 75,000 records per 15 mins for each node. We reached nearly 3 million records per hour sustained throughput.
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 16
What worked to improve throughput?
Our primary goal is to maximize the throughput at the optimal price
• Gradual increase in ‘solr’ load/throttle •‘solr’ optimize run frequently to make ‘solr’ index spread evenly and sustain ‘solr’ search time•Use of 10-GB network between ec2 nodes to improve throughput of more than 3GB network bandwidth•Use of RAID0/with 4 disks to improve IO , keeping the read queue smaller •Use of RAID read block size of 32 for Mongo DB•Disable read-ahead for Mongo DB•Use of enhanced EBS networking
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 17
Results• Client has an automated, dynamically built environment• Automated parts of code deployment processes
◦ (Jenkins -> S3 -> Nodes)
• Capabilities delivered include:◦ Offsite code archiving (S3)◦ Infrastructure automation◦ CloudTrail, VPC Flow Logs and S3 access logs helped auditing◦ Created over 100 Chef recipes
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 18
Next Steps• Upgrade to the technology stack and increase instance utilization using Apache Mesos• Enable Self Service for Engineers
◦ Create a new MongoDB / SOLR cluster◦ Start a cluster◦ Stop a cluster◦ Update the code on a cluster (deploy)◦ Update configuration parameters◦ Resize instances within a named cluster
• Create Rundeck or Jenkins based dashboard• Capture state of all clusters / generate daily reports for management•Optimize Cost and Performance of Ingestion Cluster
StackArmor Demo
PROPRIETARY AND CONFIDENTIAL INFORMATION OF STACKARMOR 20
Thank you Gaurav “GP” Pal
Principal, stackArmor.com
[email protected](571) 271 4396
www.stackArmor.com