1
Headline Goes HereSpeaker Name or Subhead Goes Here
DO NOT USE PUBLICLY PRIOR TO 10/23/12Beyond Batch
Doug Cutting October 2012
2
Hadoop Started As Batch
MapReduce• Simple, powerful• Kills a lot of birds
• Efficient, scalable• Compute at storage
• Shared platform• Used by Pig, Hive, etc.
• Incredibly useful!• But not sufficient
3
Big Data Is Not (Just) Batch
Its true themes are:• Scalability
• Affordability• Commodity hardware• Open-source software
• Distributed & reliable• Schema on read• Data beats algorithms
4
HBase: First Non-Batch Component
Online key/value store• Complement to batch
• Online put/get• Batch load & analyze• Best of both• Popular combination
• A step towards the future…
5
Holy Grail Of Big Data
• Open source, commodity HW, etc.• Linear scaling
• To scale, just buy more hardware• On many axes
• Storage capacity• Throughput & latency
• of batch & query• Transactions, Joins, Indexes
• and batch!
6
Google Gives Us A Map
Google publication Apache project
2004 GFS & MapReduce 2006 Hadoop batch programs
2005 Sawzall 2008 Pig & Hive batch queries
2006 BigTable 2008 HBase online key/value
... ... ... ... ...
2012 Spanner ? ? holy grail?
Google publication Open source project
2004 GFS & MapReduce 2006 Hadoop batch programs
2005 Sawzall 2008 Pig & Hive batch queries
2006 BigTable 2008 HBase online key/value
... ... ... ... ...
2012 Spanner ? ? transactions, etc.
5 years – 26 authors!
7
Impala Is Latest Step
Google publication Apache project
2004 GFS & MapReduce 2006 Hadoop batch programs
2005 Sawzall 2008 Pig & Hive batch queries
2006 BigTable 2008 HBase online key/value
... ... ... ... ...
2012 Spanner ? ? holy grail?
Google publication Open source project
2004 GFS & MapReduce 2006 Hadoop batch programs
2005 Sawzall 2008 Pig & Hive batch queries
2006 BigTable 2008 HBase online key/value
2010 Dremel/F1 2012 Impala online queries
2012 Spanner ? ? transactions, etc.
8
@cutting #bigquestions