big data engineering - top 10 pragmatics
Post on 02-Dec-2014
1.866 Views
Preview:
DESCRIPTION
TRANSCRIPT
Krishna Sankar, http://doubleclix.wordpress.com
EC4000–PhD Guest Seminar, Naval Post Graduate School
April 27,2012
The road lies plain before me;--'tis a theme
Single and of determined bounds; …
- Wordsworth, The Prelude
What is Big Data ?
Big Data to smart data
Big Data Pipeline
Analytic Algorithms
Storage - NOSQL
Processing - Hadoop
Cloud Architectures
Analytics/Modeling
R
Visualization
o Agenda o To cover the broad
picture o Touch upon
instances of the technologies employed
o Of the Big Data domain …
Thanks to … The giants whose shoulders I am
standing on
Special Thanks to: Peter Ateshian, NPS
Prof Murali Tummala, NPS Shirley Bailes,O’Reilly Ed Dumbill,O’Reilly
Jeff Barr,AWS Jenny Kohr Chynoweth,AWS
Porcelain vs. Plumbing
• The balance is always interesting …
• This talk has both
• Would be happy to dive deep into plumbing topics like Hadoop, R, MongoDB, Cassandra et al…
① Volume o Scale
② Velocity o Data change rate vs. decision window
③ Variety o Different sources & formats o Structured vs. Unstructured
④ Variability o Breadth of interpreta<on & o Depth of analy<cs
EBC322
hKp://doubleclix.wordpress.com/2011/09/13/when-‐is-‐big-‐data-‐really-‐big-‐data/ hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf
hKp://www.quora.com/Business-‐Intelligence/What-‐is-‐the-‐future-‐of-‐business-‐intelligence
① Volume o Scale
② Velocity o Data change rate vs. decision window
③ Variety o Different sources & formats o Structured vs. Unstructured
④ Variability o Breadth of interpreta<on & o Depth of analy<cs
EBC322
hKp://doubleclix.wordpress.com/2011/09/13/when-‐is-‐big-‐data-‐really-‐big-‐data/ hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf
hKp://www.quora.com/Business-‐Intelligence/What-‐is-‐the-‐future-‐of-‐business-‐intelligence
① Volume o Scale
② Velocity o Data change rate vs. decision window
③ Variety o Different sources & formats o Structured vs. Unstructured
④ Variability o Breadth of interpreta<on & o Depth of analy<cs
EBC322
hKp://doubleclix.wordpress.com/2011/09/13/when-‐is-‐big-‐data-‐really-‐big-‐data/ hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf
hKp://www.quora.com/Business-‐Intelligence/What-‐is-‐the-‐future-‐of-‐business-‐intelligence
① Volume o Scale
② Velocity o Data change rate vs. decision window
③ Variety o Different sources & formats o Structured vs. Unstructured
④ Variability o Breadth of interpreta<on & o Depth of analy<cs
EBC322
hKp://doubleclix.wordpress.com/2011/09/13/when-‐is-‐big-‐data-‐really-‐big-‐data/ hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf
hKp://www.quora.com/Business-‐Intelligence/What-‐is-‐the-‐future-‐of-‐business-‐intelligence
① Volume o Scale
② Velocity o Data change rate vs. decision window
③ Variety o Different sources & formats o Structured vs. Unstructured
④ Variability o Breadth of interpreta<on & o Depth of analy<cs
⑤ Contextual o Dynamic variability o RecommendaWon
⑥ Connectedness
EBC322
hKp://doubleclix.wordpress.com/2011/09/13/when-‐is-‐big-‐data-‐really-‐big-‐data/ hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf
• “… they didn’t need a genius, … but build the world’s most impressive dileKante … baKling the efficient human mind with spectacular flamboyant inefficiency” – Final Jeopardy by Stephen Baker
• 15 TB memory, across 90 IBM 760 servers, in 10 racks • 1 TB of dataset • 200 Million pages processed by Hadoop • This is a good example of Connected data
– Contextual w/ variability – Breath of interpretaWon – AnalyWcs depth
hKp://doubleclix.wordpress.com/2011/03/01/the-‐educaWon-‐of-‐a-‐machine-‐%E2%80%93-‐review-‐of-‐book-‐%E2%80%9Cfinal-‐jeopardy%E2%80%9D-‐by-‐stephen-‐baker/ hKp://doubleclix.wordpress.com/2011/02/17/watson-‐at-‐jeopardy-‐a-‐race-‐of-‐machines/
Ref: hKp://www.ciol.com/News/News/News-‐Reports/Vinod-‐Khosla%E2%80%99s-‐cool-‐dozen-‐tech-‐innovaWons/156307/0/ hKp://yourstory.in/2011/11/vinod-‐khoslas-‐keynote-‐at-‐nasscom-‐product-‐conclave-‐reject-‐punditry-‐believe-‐in-‐an-‐idea-‐take-‐risk-‐and-‐succeed/
Volume
Velocity
Variety
Variability
Connectedness
Context
Model
Infer-ability
Decomplexify! Contextualize! Network! Reason! Infer!
Logs, Scribe, Flume, Storm, Hadoop…
SQL NOSQL, HDFS, XML, =iles, …
SQL, BI Tools, Hadoop, Pig, Hive, .NET Dryad, Various other tools
Internal dashboards, Tableau
Ref:h&p:goo.gl/Mm83k
Hand coded Programs, R, Mahout, …
Twitter § 200 million tweets/day § Peak 10,000/second § How would you handle the fire
hose for social network analytics ?
hKp://goo.gl/dcBsQ
Storage § 4 U box = 40 TB, § 1 PB = 25 boxes !
Zynga § “Analytics company, not a
gaming company!” § Harvests data : 15 TB/day
§ Test new features § Target advertising
§ 230 million players/month
AWS – 900 Billion objects!
• 6 Billion Messages per day
• 2 PB (w/compression) online
• 6 PB w/ replicaWon • 250 TB/Month growth • HBase Infrastructure
Ref: hKp://www.hpts.ws/sessions/2011HPTS-‐TomFastner.pdf
Path Analysis A/B TesWng
50 TB/Day 240 nodes, 84 PB Teradata InstallaWon
Very systemaWc Diagram speaks volumes!
eBay Extreme AnalyWcs Architecture
Splunk Scribe Flume Storm
Collect
NOSQL Cassandra MongoDB Hbase Neo4j
Store
Hadoop Pig/Hive
R
Transform & Analyze
R Mahout BI Tools
Model & Reason
D3.js Tableau
Dashboard
Predict, Recommend & Visualize
When I think of my own native land, !In a moment I seem to be there; !
But, alas! recollection at hand " !Soon hurries me back to despair.!
- Cowper, The Solitude Of Alexander SelKirk!
Key Value Column Document Graph
NOSQL
Neo4j
FlockDB
InfiniteGraph
CouchDB
MongoDB
Lotus Domino
Riak
Google BigTable
HBase
Cassandra
HyperTable
In-‐memory
Disk Based
SimpleDB
Memcached
Redis
Tokyo Cabinet
Dynamo
Voldemort Azure TS
MapReduce
• Data parallelism • Large InstallaWons (many ~5000 node clusters!)
19
Infrastructure As A Service
Plasorm As A Service
Sotware As A Service
Amazon – Canonical Cloud
• S3 – Blob storage • Dynamo DB – NOSQL • EMR – ElasWc Map Reduce • EC2 – Compute • 1% of Internet traffic
hKp://blog.deepfield.net/2012/04/18/how-‐big-‐is-‐amazons-‐cloud/
“Scalability is about building wider roads, not about building faster cars” – Steve Swartz
hKp://www.slideshare.net/AmazonWebServices/keynote-‐your-‐future-‐with-‐cloud-‐compuWng-‐dr-‐werner-‐vogels-‐aws-‐summit-‐2012-‐nyc
hKp://openclipart.org/detail/152311/internet-‐cloud-‐by-‐b.gaulWer,hKp://openclipart.org/detail/17847
EC2
EC2
• Social Network Analysis • SenWment Analysis • Brand Strength • CitaWon/co-‐citaWon ≅ Followed by/Also Follows • Metrics
– Network diameter, – Weak-‐Wes, – Erdös-‐Renyi model & – Kronecker Graphs
Tweets Followers Follow/Unfollow
hKp://www.oscon.com/oscon2012/public/schedule/detail/23130
Was it a vision, or a waking dream?!Fled is that music:—do I wake or sleep?!
-Keats, Ode to a Nightingale!
top related