webinar: the future of hadoop

26
Welcome to the webinar! Audio/Telephone: +1 (215) 383-1016 Access Code: 421-634-457 Audio Pin: Shown after joining the Webinar Hadoop, Hbase, Pig, Hive, Bigtop, Avro, Flume & Whirr are trademark of the Apache Software Foundation The Future of Hadoop Doug Cutting | A Founder of Apache Hadoop Jeff Hammerbacher | Chief Scientist, Cloudera

Upload: cloudera-inc

Post on 11-May-2015

14.576 views

Category:

Technology


1 download

DESCRIPTION

With a community of over 500 contributors, Apache Hadoop and related projects are evolving at an ever increasing rate. Join the co-creator of Apache Hadoop, Doug Cutting, and Cloudera’s Chief Scientist, Jeff Hammerbacher, for a discussion of the most exciting new features being developed by the Apache Hadoop community.

TRANSCRIPT

Page 1: Webinar: The Future of Hadoop

Welcome to the webinar!

Audio/Telephone: +1 (215) 383-1016

Access Code: 421-634-457

Audio Pin: Shown after joining the Webinar

Hadoop, Hbase, Pig, Hive, Bigtop, Avro, Flume & Whirr are trademark of the Apache Software Foundation

The Future of Hadoop

Doug Cutting | A Founder of Apache Hadoop

Jeff Hammerbacher | Chief Scientist, Cloudera

Page 2: Webinar: The Future of Hadoop

Housekeeping

▪ All lines are on mute

▪ Ask questions at any time using the Questions panel on GoToMeeting

▪ Slides and recording will be available on www.cloudera.com/events

©2011 Cloudera, Inc. All Rights Reserved.

Page 3: Webinar: The Future of Hadoop

Presentation Outline

▪ 1. Context

▪ 2. Apache Bigtop

▪ 3. Apache Hadoop Core

▪ 4. Apache HBase, Hive, and Pig

▪ 5. Other components

▪ Questions and Discussion

©2011 Cloudera, Inc. All Rights Reserved.

Page 4: Webinar: The Future of Hadoop

1. Context

Page 5: Webinar: The Future of Hadoop

Context Data

▪ 1.8 ZB will be created and replicated in 2011

▪ Up 9x in the last five years

▪ More than 90% of this data is unstructured

▪ Enterprises have some liability for 80% of this data

▪ Enterprises will spend $4T on managing data in 2011

▪ Source: IDC Digital Universe Report 2011

©2011 Cloudera, Inc. All Rights Reserved.

Page 6: Webinar: The Future of Hadoop

Context Hadoop

▪ Apache Hadoop and related software are designed for this world

▪ Volume

▪ Commodity hardware and open source software lowers cost and increases capacity

▪ Velocity

▪ Data ingest speed aided by append-only and schema-on-read design

▪ Variety

▪ Multiple tools to structure, process, and access data

©2011 Cloudera, Inc. All Rights Reserved.

Page 7: Webinar: The Future of Hadoop

Context Hadoop

Page 8: Webinar: The Future of Hadoop

Context HDFS and MapReduce

▪ Apache Hadoop = HDFS + MapReduce

▪ Similar to kernel of an operating system

▪ Referred to as “Hadoop Core”

▪ Related components are often deployed with Hadoop

▪ For example: HBase, Hive, Pig, Oozie, Flume, Sqoop

▪ Together, these components form a “Hadoop Stack”

▪ Not all components must be deployed

Page 9: Webinar: The Future of Hadoop

Context Bigtop

▪ What standards should all components follow?

▪ How can we ensure all components of the stack work together?

▪ How can we find the right version of each component?

▪ How can we make it easy to install an additional component?

Page 10: Webinar: The Future of Hadoop

2. Apache Bigtop

Page 11: Webinar: The Future of Hadoop

Apache Bigtop

▪ Now incubating at Apache

▪ Hadoop ecosystem-wide project, including:

▪ Interoperability testing of components

▪ Packaging of compatible versions of components

▪ Like a Fedora, Debian or CentOS for Hadoop ecosystem

▪ Releases are not a single artifact

▪ Rather a set of interdependent, compatible components

©2011 Cloudera, Inc. All Rights Reserved.

Page 12: Webinar: The Future of Hadoop

Apache Bigtop

▪ Current components

▪ Hadoop

▪ HBase

▪ Hive

▪ Pig

▪ Oozie

▪ Sqoop

▪ Flume

▪ ZooKeeper

▪ Whirr

Page 13: Webinar: The Future of Hadoop

Apache Bigtop

▪ Outputs

▪ Source

▪ RPM

▪ Deb

▪ Tests

▪ Integration

▪ Package

▪ Smoke

▪ Release 0.1.0 under vote now!

Page 14: Webinar: The Future of Hadoop

3. Apache Hadoop Core

Page 15: Webinar: The Future of Hadoop

Apache Hadoop Core

▪ Current stable releases based on branches from 0.20

▪ Upcoming release: 0.22

▪ Includes both security and new implementation of append

▪ Not expected to be run at scale or commercially supported

▪ Nearly ready for vote

▪ Upcoming release: 0.23

▪ Build and dependency management moved to Maven

▪ Branch to happen soon

Page 16: Webinar: The Future of Hadoop

HDFS

▪ Robustness

▪ HDFS-1073: Checkpointing of image and edits log

▪ Availability

▪ HDFS-1623: High availability

▪ Performance

▪ HDFS-941: Faster random reads

▪ HDFS-2080: Faster checksums

©2011 Cloudera, Inc. All Rights Reserved.

Page 18: Webinar: The Future of Hadoop

MapReduce

▪ Modularity

▪ MAPREDUCE-279: MapReduce 2.0

▪ Break JobTracker into ResourceManager and ApplicationMaster

▪ Replace TaskTracker with NodeManager

▪ Source of diagram: http://www.odbms.org/download/dean-keynote-ladis2009.pdf

Page 19: Webinar: The Future of Hadoop

MapReduce

▪ Potential New Frameworks

▪ MAPREDUCE-2719: Distributed shell

▪ MAPREDUCE-2720: Distributed Java commands

▪ MPI: Communication-intensive parallelism

▪ Fast scans and aggregations

▪ OpenDremel

▪ Bulk Synchronous Parallel

▪ Giraph, Golden Orb, Hama, et al.

▪ Actor Model (streaming)

▪ S4, Akka, Storm, et al.

Page 20: Webinar: The Future of Hadoop

4. HBase, Hive, and Pig

Page 21: Webinar: The Future of Hadoop

Apache HBase

▪ Upcoming release: 0.92.0

▪ Server-side triggers

▪ HBASE-2000: Coprocessors

▪ Availability

▪ HBASE-1730/4213: Online schema changes

▪ Performance

▪ HBASE-3857: HFile 2.0

▪ HBase book in September!

©2011 Cloudera, Inc. All Rights Reserved.

Page 22: Webinar: The Future of Hadoop

Apache Hive

▪ Upcoming release: 0.8

▪ Data transfer

▪ HIVE-306: INSERT INTO

▪ HIVE-1918: EXPORT/IMPORT

▪ Indexes

▪ HIVE-1644: Automatically use indexes

▪ HIVE-1803: Bitmap indexes

▪ Data formats

▪ HIVE-895: Avro support

©2011 Cloudera, Inc. All Rights Reserved.

Page 23: Webinar: The Future of Hadoop

Apache Pig

▪ Recent release: 0.9

▪ Scripting

▪ PIG-1479: Embedding Pig in Python

▪ PIG-1793: Macro expansion

▪ Debugging

▪ PIG-1712: ILLUSTRATE rework

▪ Data formats

▪ PIG-1748: Avro support

©2011 Cloudera, Inc. All Rights Reserved.

Page 24: Webinar: The Future of Hadoop

5. Other Components

Page 25: Webinar: The Future of Hadoop

Other Components

▪ Apache Incubator

▪ Sqoop, Flume, and Oozie now incubating

▪ Whirr graduated to a top-level Apache project

▪ Apache Avro

▪ Interoperability with Protocol Buffers and Thrift

▪ Column-oriented file format

▪ Python MapReduce implementation

▪ Apache ZooKeeper

▪ Multi-update

▪ Kerberos authentication of clients

©2011 Cloudera, Inc. All Rights Reserved.

Page 26: Webinar: The Future of Hadoop

Q & A Visit www.hadoopworld.com

• November 8-9, 2011 in New York City

• Early bird discount ends September 5, 2011

Enter Today: www.facebook.com/cloudera

• Click the “Be a Cloudera Hero for Apache

Hadoop” tab

• Share what you think Apache Hadoop can

do for you

• Win a personal hackathon with Doug Cutting

in San Francisco, CA