big data engineering - top 10 pragmatics

25
Krishna Sankar, http://doubleclix.wordpress.com EC4000–PhD Guest Seminar, Naval Post Graduate School April 27,2012 The road lies plain before me;--'tis a theme Single and of determined bounds; … - Wordsworth, The Prelude

Upload: krishna-sankar

Post on 02-Dec-2014

1.866 views

Category:

Technology


0 download

DESCRIPTION

Very high level, but covers all the essentials. Slides of my talk at the Naval PostGraduate School, Monterey

TRANSCRIPT

Page 1: Big Data Engineering - Top 10 Pragmatics

Krishna Sankar, http://doubleclix.wordpress.com

EC4000–PhD Guest Seminar, Naval Post Graduate School

April 27,2012

The road lies plain before me;--'tis a theme

Single and of determined bounds; …

- Wordsworth, The Prelude

Page 2: Big Data Engineering - Top 10 Pragmatics

What is Big Data ?

Big Data to smart data

Big Data Pipeline

Analytic Algorithms

Storage - NOSQL

Processing - Hadoop

Cloud Architectures

Analytics/Modeling

R

Visualization

o  Agenda o  To cover the broad

picture o  Touch upon

instances of the technologies employed

o  Of the Big Data domain …

Page 3: Big Data Engineering - Top 10 Pragmatics

Thanks to … The giants whose shoulders I am

standing on

Special  Thanks  to:        Peter  Ateshian,  NPS  

     Prof  Murali  Tummala,  NPS        Shirley  Bailes,O’Reilly        Ed  Dumbill,O’Reilly  

     Jeff  Barr,AWS        Jenny  Kohr  Chynoweth,AWS  

Page 4: Big Data Engineering - Top 10 Pragmatics

Porcelain vs. Plumbing

• The balance is always interesting …

• This talk has both

• Would be happy to dive deep into plumbing topics like Hadoop, R, MongoDB, Cassandra et al…

Page 5: Big Data Engineering - Top 10 Pragmatics

①  Volume o  Scale  

②  Velocity o  Data  change  rate  vs.  decision  window  

③  Variety o  Different  sources  &  formats  o  Structured  vs.  Unstructured  

④  Variability o  Breadth  of  interpreta<on  &  o  Depth  of  analy<cs  

EBC322  

hKp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/  hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf  

hKp://www.quora.com/Business-­‐Intelligence/What-­‐is-­‐the-­‐future-­‐of-­‐business-­‐intelligence  

Page 6: Big Data Engineering - Top 10 Pragmatics

①  Volume o  Scale  

②  Velocity o  Data  change  rate  vs.  decision  window  

③  Variety o  Different  sources  &  formats  o  Structured  vs.  Unstructured  

④  Variability o  Breadth  of  interpreta<on  &  o  Depth  of  analy<cs  

EBC322  

hKp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/  hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf  

hKp://www.quora.com/Business-­‐Intelligence/What-­‐is-­‐the-­‐future-­‐of-­‐business-­‐intelligence  

Page 7: Big Data Engineering - Top 10 Pragmatics

①  Volume o  Scale  

②  Velocity o  Data  change  rate  vs.  decision  window  

③  Variety o  Different  sources  &  formats  o  Structured  vs.  Unstructured  

④  Variability o  Breadth  of  interpreta<on  &  o  Depth  of  analy<cs  

EBC322  

hKp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/  hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf  

hKp://www.quora.com/Business-­‐Intelligence/What-­‐is-­‐the-­‐future-­‐of-­‐business-­‐intelligence  

Page 8: Big Data Engineering - Top 10 Pragmatics

①  Volume o  Scale  

②  Velocity o  Data  change  rate  vs.  decision  window  

③  Variety o  Different  sources  &  formats  o  Structured  vs.  Unstructured  

④  Variability o  Breadth  of  interpreta<on  &  o  Depth  of  analy<cs  

EBC322  

hKp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/  hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf  

hKp://www.quora.com/Business-­‐Intelligence/What-­‐is-­‐the-­‐future-­‐of-­‐business-­‐intelligence  

Page 9: Big Data Engineering - Top 10 Pragmatics

①  Volume o  Scale  

②  Velocity o  Data  change  rate  vs.  decision  window  

③  Variety o  Different  sources  &  formats  o  Structured  vs.  Unstructured  

④  Variability o  Breadth  of  interpreta<on  &  o  Depth  of  analy<cs  

⑤  Contextual o  Dynamic  variability  o  RecommendaWon  

⑥  Connectedness

EBC322  

hKp://doubleclix.wordpress.com/2011/09/13/when-­‐is-­‐big-­‐data-­‐really-­‐big-­‐data/  hKp://www.hpts.ws/posters/Poster2011_13_Bulkowski.pdf  

Page 10: Big Data Engineering - Top 10 Pragmatics

•  “…  they  didn’t  need  a  genius,  …  but  build  the  world’s  most  impressive  dileKante  …  baKling  the  efficient  human  mind  with  spectacular  flamboyant  inefficiency”  –  Final  Jeopardy  by  Stephen  Baker  

•  15  TB  memory,  across  90  IBM  760  servers,  in  10  racks  •  1  TB  of  dataset  •  200  Million  pages  processed  by  Hadoop  •  This  is  a  good  example  of  Connected  data  

–  Contextual  w/  variability  –  Breath  of  interpretaWon  –  AnalyWcs  depth  

hKp://doubleclix.wordpress.com/2011/03/01/the-­‐educaWon-­‐of-­‐a-­‐machine-­‐%E2%80%93-­‐review-­‐of-­‐book-­‐%E2%80%9Cfinal-­‐jeopardy%E2%80%9D-­‐by-­‐stephen-­‐baker/  hKp://doubleclix.wordpress.com/2011/02/17/watson-­‐at-­‐jeopardy-­‐a-­‐race-­‐of-­‐machines/  

Page 11: Big Data Engineering - Top 10 Pragmatics

Ref:  hKp://www.ciol.com/News/News/News-­‐Reports/Vinod-­‐Khosla%E2%80%99s-­‐cool-­‐dozen-­‐tech-­‐innovaWons/156307/0/  hKp://yourstory.in/2011/11/vinod-­‐khoslas-­‐keynote-­‐at-­‐nasscom-­‐product-­‐conclave-­‐reject-­‐punditry-­‐believe-­‐in-­‐an-­‐idea-­‐take-­‐risk-­‐and-­‐succeed/  

Page 12: Big Data Engineering - Top 10 Pragmatics

Volume

Velocity

Variety

Variability

Connectedness

Context

Model

Infer-ability

Decomplexify! Contextualize! Network! Reason! Infer!

Logs,  Scribe,  Flume,  Storm,  Hadoop…  

SQL  NOSQL,  HDFS,  XML,  =iles,  …    

SQL,    BI  Tools,  Hadoop,  Pig,  Hive,    .NET  Dryad,  Various  other  tools  

Internal  dashboards,  Tableau    

Ref:h&p:goo.gl/Mm83k

Hand  coded  Programs,  R,  Mahout,  …    

Page 13: Big Data Engineering - Top 10 Pragmatics

Twitter §  200 million tweets/day §  Peak 10,000/second §  How would you handle the fire

hose for social network analytics ?

hKp://goo.gl/dcBsQ  

Storage §  4 U box = 40 TB, §  1 PB = 25 boxes !

Zynga §  “Analytics company, not a

gaming company!” §  Harvests data : 15 TB/day

§  Test new features §  Target advertising

§  230 million players/month

AWS – 900 Billion objects!

Page 14: Big Data Engineering - Top 10 Pragmatics

•  6  Billion  Messages  per  day  

•  2  PB  (w/compression)  online  

•  6  PB  w/  replicaWon  •  250  TB/Month  growth  •  HBase  Infrastructure  

Page 15: Big Data Engineering - Top 10 Pragmatics

Ref:  hKp://www.hpts.ws/sessions/2011HPTS-­‐TomFastner.pdf  

Path  Analysis  A/B  TesWng  

50  TB/Day  240  nodes,  84  PB  Teradata  InstallaWon  

Very  systemaWc  Diagram  speaks  volumes!  

eBay  Extreme  AnalyWcs  Architecture  

Page 16: Big Data Engineering - Top 10 Pragmatics

Splunk  Scribe  Flume  Storm  

Collect

NOSQL  Cassandra  MongoDB  Hbase  Neo4j  

Store

Hadoop  Pig/Hive  

R  

Transform & Analyze

R  Mahout  BI  Tools  

Model & Reason

D3.js  Tableau  

Dashboard  

Predict, Recommend & Visualize

When I think of my own native land, !In a moment I seem to be there; !

But, alas! recollection at hand " !Soon hurries me back to despair.!

- Cowper, The Solitude Of Alexander SelKirk!

Page 17: Big Data Engineering - Top 10 Pragmatics

Key  Value   Column   Document   Graph  

NOSQL  

Neo4j  

FlockDB  

InfiniteGraph  

CouchDB  

MongoDB  

Lotus  Domino  

Riak  

Google  BigTable  

HBase  

Cassandra  

HyperTable  

In-­‐memory  

Disk  Based  

SimpleDB  

Memcached  

Redis  

Tokyo  Cabinet  

Dynamo  

Voldemort   Azure  TS  

Page 18: Big Data Engineering - Top 10 Pragmatics

MapReduce

•  Data  parallelism  •  Large  InstallaWons  (many  ~5000  node  clusters!)  

Page 19: Big Data Engineering - Top 10 Pragmatics

19  

Infrastructure  As  A  Service  

Plasorm  As  A  Service  

Sotware  As  A  Service  

Page 20: Big Data Engineering - Top 10 Pragmatics
Page 21: Big Data Engineering - Top 10 Pragmatics

Amazon – Canonical Cloud

•  S3  –  Blob  storage  •  Dynamo  DB  –  NOSQL  •  EMR  –  ElasWc  Map  Reduce  •  EC2  –  Compute  •  1%  of  Internet  traffic  

hKp://blog.deepfield.net/2012/04/18/how-­‐big-­‐is-­‐amazons-­‐cloud/  

“Scalability is about building wider roads, not about building faster cars” – Steve Swartz

Page 22: Big Data Engineering - Top 10 Pragmatics

hKp://www.slideshare.net/AmazonWebServices/keynote-­‐your-­‐future-­‐with-­‐cloud-­‐compuWng-­‐dr-­‐werner-­‐vogels-­‐aws-­‐summit-­‐2012-­‐nyc  

Page 23: Big Data Engineering - Top 10 Pragmatics

hKp://openclipart.org/detail/152311/internet-­‐cloud-­‐by-­‐b.gaulWer,hKp://openclipart.org/detail/17847  

EC2

EC2

Page 24: Big Data Engineering - Top 10 Pragmatics

•  Social  Network  Analysis  •  SenWment  Analysis  •  Brand  Strength  •  CitaWon/co-­‐citaWon  ≅  Followed  by/Also  Follows  •  Metrics  

–  Network  diameter,    –  Weak-­‐Wes,    –  Erdös-­‐Renyi  model  &    –  Kronecker  Graphs  

Tweets  Followers  Follow/Unfollow  

hKp://www.oscon.com/oscon2012/public/schedule/detail/23130  

Page 25: Big Data Engineering - Top 10 Pragmatics

Was it a vision, or a waking dream?!Fled is that music:—do I wake or sleep?!

-Keats, Ode to a Nightingale!