jdd2014: real big data - scott macgregor
DESCRIPTION
-The evolution of Big Data, both inside Akamai and in the industry. -The current Big Data Ecosystem with real-world examples. -Challenges in Big Data and future directions.TRANSCRIPT
![Page 1: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/1.jpg)
Real… Big… Data… and it’s constant evolution Scott MacGregor
![Page 2: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/2.jpg)
Who is this guy?
![Page 3: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/3.jpg)
Akamai Big Data Infrastructure
150,000 collector nodes 5000 map/reduce nodes Billions of jobs per day
![Page 4: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/4.jpg)
What is Big Data?
![Page 5: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/5.jpg)
The V’s
![Page 6: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/6.jpg)
Data that is Big
From Hortonworks
![Page 7: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/7.jpg)
What’s it really about?
![Page 8: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/8.jpg)
From the beginning…
• Akamai needed a billing system and scalable monitoring • The Open Source community wanted a search engine • Yahoo needed better product analytics for page views • Google needed more scalable computation for ad
management • Facebook needed real-time updates to social graph • LinkedIn needed a real-time activity data pipeline • Twitter needed hashtag and topic streams • Amazon needed durable shopping carts • Netflix needed a recommendation engine
![Page 9: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/9.jpg)
Big Data timeline
1998 2006 2001 2003 2005 2007 2008 2010 2011 2012 2013 2014
Akamai
Industry
Generalized map/reduce on 1 machine
Decentralized job scheduling Multiple machines File System DB
Google MapReduce Google FS
Nutch Yahoo spins off Hadoop
Amazon Dynamo
NoSql
Wide area, real-time, in-memory system monitoring
Geographical redundancy
Real-time reporting Columnar DB
Distributed File System DB
Wide-area MapReduce ExaByte Query
HBASE Neo4J
Facebook Cassandra LinkedIn Kafka
Twitter Storm Facebook Presto
![Page 10: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/10.jpg)
How it works…
![Page 11: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/11.jpg)
Big Data modes
• Batch – Computation over a large static data set – Results are complete
• Online – Computation on data as it’s generated – Localized results, must be aggregated
downstream
![Page 12: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/12.jpg)
Big Data primitives
• Collection • Parsing • Partitioning • Filtering • Throttling • Aggregation • Tracking • Validation • Analysis
![Page 13: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/13.jpg)
Collection
• What – Logs – Metadata – System stats – Application
events – Application stats – Network data
• How – Email – SPDY – HTTP POST – SCP – Scribe – Avro – Custom
![Page 14: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/14.jpg)
Parsing
• Read lines or blocks and split into fields • Transform, e.g. protobuf • Map keys to values
S 1359487051.701 4200191766 2097152 2097918 61.252.169.21 GET 440 1423 itunesus011.download.akamai.com 200 - iPeV image/jpeg - - 44 3031 - - - - - W - us/r1000/011/Purple/53/e6/0f/mzl.slohufby.320x480-75.jpg - a440.phobos.apple.com
1359486900 1423 a440.phobos.apple.com 1 3158
1359486900 1423 200 1 30128
1359486900 1423 1 209158
![Page 15: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/15.jpg)
Partitioning
• Bucketing – Reduce to a single record per bucket – e.g. 5 minutes, /24, etc.
• Hashing – Bucket blocks or records of data by a hash
function
![Page 16: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/16.jpg)
Filtering
• Statistical Methods – Top-k (HierarchicalCountSketch) – Set membership (Bloom filters) – Cardinality counting (HyperLogLog) – Frequency estimates (CountSketch) – Change detection (Deltoid)
• Sampling – Random – Reservoir
![Page 17: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/17.jpg)
Throttling
• Limit on cardinality per partition – Requires central management – Drop records over max
• Remove or trim large fields S 1359487051.701 4200191766 2097152 2097918 61.252.169.21 GET 440 1423 itunesus011.download.akamai.com 200 - iPeV image/jpeg - - 44 3031 - - - - - W - us/r1000/011/Purple/53/e6/0f/mzl.slohufby.320x480-75.jpg - a440.phobos.apple.com
S 1359487051.701 4200191766 2097152 2097918 61.252.169.21 GET 440 1423 ~ 200 - iPeV image/jpeg - - 44 3031 - - - - - W - ~
![Page 18: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/18.jpg)
Aggregation
• Merge – Merge-sort blocks in a partition
• Reduce – Combine values for like keys
• Sum, Min, Max, Mask, etc. • Shuffle
– Move the data to where its needed or closer to like data
1359486900 1423 1 209158
1359529800 1423 1 209158 1359486900 1423 1 209158
1359486900 1423 2 418316
1359529800 1423 1 209158
Aggregate
2 418316
{1423, 1359486900}
1 209158
{1423, 1359529800}
Shuffle
![Page 19: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/19.jpg)
Tracking
• Tracking – Embed GUID in each data unit sent – Publish GUIDs independent from data flow – Completeness is expected (published GUIDs)
vs. actual (embedded GUID)
![Page 20: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/20.jpg)
Data integrity
• Watermark – Producer watermarks every n-lines with a
crypto key – Receiver checks watermarks
• Checksum – Block checksums – Line CRC – Etc.
![Page 21: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/21.jpg)
Analysis
• Online – Precomputed reports
• Batch – Spark Programs – Map/Reduce – Hive: HQL – SQL
![Page 22: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/22.jpg)
Big Data at Akamai
• Billing and Reporting • System monitoring • Media Analytics • Security • Log archive
![Page 23: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/23.jpg)
Billing and reporting
Logs Akamai Edge Networks and
Products Q Parse
Pipelines
Shuffle Split
Billing DB
Reporting Reporting
Reporting Parsing • splits lines into fields • maps keys to values per pipeline • each log generates many pipelines • each pipeline represents a streaming table
Evolution • Logs were emailed (up to 1PB/day) • Now delivered via SPDY (3PB/day)
Customers
3 PB/day Doubles every year
Reporting Reporting Internal
Apps
Aggregate
![Page 24: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/24.jpg)
System monitoring
Akamai Networks and
Products Client SQL
Parser TLA Agg
Agg Agg
Alert
Trend
TLA: top level aggregator pulls data from aggregators which pull data from producers at the time of the request Produces rewrite data locally
50M jobs/day
Evolution Single machine memory for table joins Future: distributed memory for table joins
![Page 25: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/25.jpg)
Media analytics
Pipelines Akamai
Products Front end
Column Store
Index Reporting Reporting Reporting
API / UI
Customers
Indexes are recreated for each update Supports insert and update Reads are flexible and fast
Evolution: Index now fingerprint to lower cost Hyperloglog for uniqueness counting
Events
![Page 26: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/26.jpg)
Security products
Pipelines Akamai Edge
Networks Front end
HDFS
Events
Akamai Web Firewall
Map/Reduce
HBASE
Hive
Cloudera Graphite
Operations Center
Reputation Scoring
Threat Analysis
Intelligence Reports
Risk Based Authentication
Payment Fraud
External Data External Data
External Data
Evolution: Replacing HBASE with custom aggregator Replacing Hive with custom SQL processor
20 TB/day
![Page 27: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/27.jpg)
Log archive
Logs
Q Archive
Parse
180 PB, 450 Trillion records Doubles every year
Archive Index (10TB) Pipelines
Log cache 10%
Client IP Sketch
Spark
Spark SQL
HDFS
Archive Front End
Client Request
Archive is 90 data centers distributed over wide area; projected 1.2 EB in 3 years Evolution: Was flat file for index, now HDFS/Spark
Get Index and/or CIP
Cache first Then archive
![Page 28: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/28.jpg)
HDFS Hadoop / Yarn
The Ecosystem
Script Pig
SQL Hive
NoSQL HBASE
Stream Kafka Storm
Search Solr
In-Mem Spark
Integration Flume Avro
Operations Ambari Zookeeper Oozie
Monitoring
Graphite
Sharing
Mesos
![Page 29: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/29.jpg)
HDFS Hadoop / Yarn
Building a system
If you need fast access to massive amounts of data where queries are constrained to an index (read optimized): • Start with HDFS or Cassandra • Add HBASE column store • Add Hive for SQL-like access • Add Pig for scripting
HBASE Get, Put
Hive Select *
Pig { … }
![Page 30: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/30.jpg)
Building a system
If you need to search logs: • Start with HDFS • Add Flume for log data integration • Add Avro for data serialization • Add Solr for search
HDFS Hadoop / Yarn
Solr Search, e.g. Ip = 1.1.1.1
Flume Agent Avro Sink
Flume Collector Avro Source
![Page 31: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/31.jpg)
HDFS Hadoop / Yarn
Building a system
If you need flexible and shared access to unlimited amounts of data: • Start with HDFS or Cassandra • Add Hadoop for Map/Reduce or • Add Hive for SQL-like access or • Add Pig for scripting • Add Mesos for resource sharing • Add Ambari for cluster management and provisioning • Add map/reduce programs for business logic
Pig {…}
Hive Select * Flume Ambari
Mesos
Map/Reduce Java { … }
![Page 32: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/32.jpg)
Building a system
If you need fast, flexible access to in-memory data: • Start with HDFS • Add Spark • Add Spark SQL for SQL-like access or • Create Spark programs for other business logic
HDFS Hadoop / Yarn
Spark
SparkSQL Select * from
Spark Progs Java { … }
![Page 33: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/33.jpg)
Building a system
If you need real-time stream event processing: • Start with HDFS • Add Kafka for messaging and pub/sub • Add Storm for event processing • Develop Java Bolts for processing logic
HDFS Hadoop / Yarn
Kafka Storm Bolts { … }
![Page 34: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/34.jpg)
Future at Akamai
• 100x – Everything bigger and faster – Requires new R&D across many Big Data
components • Scaling Big Data Eco across wide-area • Internet Security
• Positive reputation scoring • Automatic DDoS mitigation
• Low latency data collection – 2^53 unique keys, <1 minute latency
• Support DevOps – Near real-time monitoring and control
![Page 35: JDD2014: Real Big Data - Scott MacGregor](https://reader034.vdocuments.net/reader034/viewer/2022051314/55933a7e1a28ab072d8b4664/html5/thumbnails/35.jpg)
Thank You