scuba: diving into data at facebook - cs.cmu.edupavlo/courses/fall2013/static/slides/scuba.pdf ·...
TRANSCRIPT
![Page 1: Scuba: Diving into Data at Facebook - cs.cmu.edupavlo/courses/fall2013/static/slides/scuba.pdf · Proposed Solution: Scuba •Structure –In-memory database –Across hundreds of](https://reader033.vdocuments.net/reader033/viewer/2022042801/5a7bd3447f8b9a72118c3c70/html5/thumbnails/1.jpg)
Scuba: Diving into Data at Facebook Presenter: Lavanya Subramanian
1
![Page 2: Scuba: Diving into Data at Facebook - cs.cmu.edupavlo/courses/fall2013/static/slides/scuba.pdf · Proposed Solution: Scuba •Structure –In-memory database –Across hundreds of](https://reader033.vdocuments.net/reader033/viewer/2022042801/5a7bd3447f8b9a72118c3c70/html5/thumbnails/2.jpg)
Need for Data Analysis
• Performance monitoring
– Detect unexpected performance drops/rises
• Pattern mining
– Understand user response to new features
• Ad revenue monitoring
– Identify regional drops/rises in ad clicks and revenue
2
![Page 3: Scuba: Diving into Data at Facebook - cs.cmu.edupavlo/courses/fall2013/static/slides/scuba.pdf · Proposed Solution: Scuba •Structure –In-memory database –Across hundreds of](https://reader033.vdocuments.net/reader033/viewer/2022042801/5a7bd3447f8b9a72118c3c70/html5/thumbnails/3.jpg)
Data Analysis at Facebook
• Large data volumes
• Real time analysis of this data
• Key Requirements
– Low latency
– Flexibility
– Scalability
3
![Page 4: Scuba: Diving into Data at Facebook - cs.cmu.edupavlo/courses/fall2013/static/slides/scuba.pdf · Proposed Solution: Scuba •Structure –In-memory database –Across hundreds of](https://reader033.vdocuments.net/reader033/viewer/2022042801/5a7bd3447f8b9a72118c3c70/html5/thumbnails/4.jpg)
Proposed Solution: Scuba
• Structure
– In-memory database
– Across hundreds of servers
• How does it work?
– Holds and processes sampled real-time data
– Query interface to access data
– Visualization interface to analyze data
4
![Page 5: Scuba: Diving into Data at Facebook - cs.cmu.edupavlo/courses/fall2013/static/slides/scuba.pdf · Proposed Solution: Scuba •Structure –In-memory database –Across hundreds of](https://reader033.vdocuments.net/reader033/viewer/2022042801/5a7bd3447f8b9a72118c3c70/html5/thumbnails/5.jpg)
Architecture
Server
Leaf nodes
5
![Page 6: Scuba: Diving into Data at Facebook - cs.cmu.edupavlo/courses/fall2013/static/slides/scuba.pdf · Proposed Solution: Scuba •Structure –In-memory database –Across hundreds of](https://reader033.vdocuments.net/reader033/viewer/2022042801/5a7bd3447f8b9a72118c3c70/html5/thumbnails/6.jpg)
Data Layout
• Data stored in tables
• Data types supported
– Integers, strings, sets of strings, vectors of strings
• Different compression for different data types
Table Characteristics
• Table is created upon data arrival at a leaf node
• Table can have empty columns; treated as null
6
![Page 7: Scuba: Diving into Data at Facebook - cs.cmu.edupavlo/courses/fall2013/static/slides/scuba.pdf · Proposed Solution: Scuba •Structure –In-memory database –Across hundreds of](https://reader033.vdocuments.net/reader033/viewer/2022042801/5a7bd3447f8b9a72118c3c70/html5/thumbnails/7.jpg)
Data Ingestion into Scuba
Leaf nodes
Scribe
7
![Page 8: Scuba: Diving into Data at Facebook - cs.cmu.edupavlo/courses/fall2013/static/slides/scuba.pdf · Proposed Solution: Scuba •Structure –In-memory database –Across hundreds of](https://reader033.vdocuments.net/reader033/viewer/2022042801/5a7bd3447f8b9a72118c3c70/html5/thumbnails/8.jpg)
Data Ingestion into Scuba
• Events are sampled to reduce the data volume
• Use Scribe, a distributed messaging system to
– Collect, aggregate and deliver data to Scuba
• For each batch of incoming data
– Pick two leaf nodes at random
– Send the batch to the node with more free memory
• Data compressed and sent to disk
• Data then read back and stored in memory
8
![Page 9: Scuba: Diving into Data at Facebook - cs.cmu.edupavlo/courses/fall2013/static/slides/scuba.pdf · Proposed Solution: Scuba •Structure –In-memory database –Across hundreds of](https://reader033.vdocuments.net/reader033/viewer/2022042801/5a7bd3447f8b9a72118c3c70/html5/thumbnails/9.jpg)
Dealing with Old Data
• Memory capacity is a concern
• Need to add new servers every 2-3 weeks
• Delete data based on
– Age: Sample and preserve a fraction of old data
– Space: When exceeding space limits, delete old data
9
![Page 10: Scuba: Diving into Data at Facebook - cs.cmu.edupavlo/courses/fall2013/static/slides/scuba.pdf · Proposed Solution: Scuba •Structure –In-memory database –Across hundreds of](https://reader033.vdocuments.net/reader033/viewer/2022042801/5a7bd3447f8b9a72118c3c70/html5/thumbnails/10.jpg)
Querying Scuba
• Three kinds of interfaces
– Web-based
– SQL
– API to support querying from application code
• Queries supported
– Different forms of aggregation
– Percentiles, histograms
• Joins not supported by Scuba
10
![Page 11: Scuba: Diving into Data at Facebook - cs.cmu.edupavlo/courses/fall2013/static/slides/scuba.pdf · Proposed Solution: Scuba •Structure –In-memory database –Across hundreds of](https://reader033.vdocuments.net/reader033/viewer/2022042801/5a7bd3447f8b9a72118c3c70/html5/thumbnails/11.jpg)
Query Execution
Root Aggregator
Leaf nodes
Intermediate Aggregators
Leaf Aggregators
11
![Page 12: Scuba: Diving into Data at Facebook - cs.cmu.edupavlo/courses/fall2013/static/slides/scuba.pdf · Proposed Solution: Scuba •Structure –In-memory database –Across hundreds of](https://reader033.vdocuments.net/reader033/viewer/2022042801/5a7bd3447f8b9a72118c3c70/html5/thumbnails/12.jpg)
Query Execution
• Leaf node may or may not contain a table’s data
– Depends on the table size and age
• Data scanning is usually by time range
– Time is Scuba’s only notion of index
• Results of a node are omitted beyond a time out
– Small missing pieces of data do not affect accuracy of computations much
– Lower response time is a bigger requirement
12
![Page 13: Scuba: Diving into Data at Facebook - cs.cmu.edupavlo/courses/fall2013/static/slides/scuba.pdf · Proposed Solution: Scuba •Structure –In-memory database –Across hundreds of](https://reader033.vdocuments.net/reader033/viewer/2022042801/5a7bd3447f8b9a72118c3c70/html5/thumbnails/13.jpg)
Performance Model
• Breaks down the latencies of different components
• Function of fanout, processing time at each aggregator, depth of tree
13
![Page 14: Scuba: Diving into Data at Facebook - cs.cmu.edupavlo/courses/fall2013/static/slides/scuba.pdf · Proposed Solution: Scuba •Structure –In-memory database –Across hundreds of](https://reader033.vdocuments.net/reader033/viewer/2022042801/5a7bd3447f8b9a72118c3c70/html5/thumbnails/14.jpg)
Experimental Setup and Queries
• 4 racks of 40 machines
• Machine configuration
– Intel Xeon E5-2660
– 2.2 GHz
– 144 GB DRAM memory
• 10G ethernet
• Scan query, Time series query
14
![Page 15: Scuba: Diving into Data at Facebook - cs.cmu.edupavlo/courses/fall2013/static/slides/scuba.pdf · Proposed Solution: Scuba •Structure –In-memory database –Across hundreds of](https://reader033.vdocuments.net/reader033/viewer/2022042801/5a7bd3447f8b9a72118c3c70/html5/thumbnails/15.jpg)
Speedup and Scaleup
15
![Page 16: Scuba: Diving into Data at Facebook - cs.cmu.edupavlo/courses/fall2013/static/slides/scuba.pdf · Proposed Solution: Scuba •Structure –In-memory database –Across hundreds of](https://reader033.vdocuments.net/reader033/viewer/2022042801/5a7bd3447f8b9a72118c3c70/html5/thumbnails/16.jpg)
Throughput
16
![Page 17: Scuba: Diving into Data at Facebook - cs.cmu.edupavlo/courses/fall2013/static/slides/scuba.pdf · Proposed Solution: Scuba •Structure –In-memory database –Across hundreds of](https://reader033.vdocuments.net/reader033/viewer/2022042801/5a7bd3447f8b9a72118c3c70/html5/thumbnails/17.jpg)
Discussion
• Details on the kind of data stored and analyzed
• Performance numbers for a wider set of queries
• Are these query throughputs good enough?
– Might be fine for an internal system
17