extending the edw with hadoop - chicago data summit 2011

35
Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster and Jonathan Seidman Chicago Data Summit April 26 | 2011

Upload: jonathan-seidman

Post on 02-Dec-2014

2.440 views

Category:

Technology


0 download

DESCRIPTION

Slides from talk at the Chicago Data Summit on 4/26/11: "Extending the Enterprise Data Warehouse with Hadoop".

TRANSCRIPT

Page 1: Extending the EDW with Hadoop - Chicago Data Summit 2011

Extending the Enterprise Data Warehouse with Hadoop

Robert Lancaster and Jonathan Seidman

Chicago Data Summit

April 26 | 2011

Page 2: Extending the EDW with Hadoop - Chicago Data Summit 2011

Who We Are

•  Robert Lancaster

– Solutions Architect, Hotel Supply Team

–  [email protected]

– @rob1lancaster

•  Jonathan Seidman

–  Lead Engineer, Business Intelligence/Big Data Team

– Co-founder/organizer of Chicago Hadoop User Group (http://www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG)

–  [email protected]

– @jseidman

page 2

Page 3: Extending the EDW with Hadoop - Chicago Data Summit 2011

page 3

Launched: 2001, Chicago, IL

Page 4: Extending the EDW with Hadoop - Chicago Data Summit 2011

page 4

Why are we using Hadoop?

Stop me if you’ve heard this before…

Page 5: Extending the EDW with Hadoop - Chicago Data Summit 2011

page 5

On Orbitz alone we do millions of searches and transactions daily, which leads to hundreds of gigabytes of log data every day.

Page 6: Extending the EDW with Hadoop - Chicago Data Summit 2011

page 6

Hadoop provides us with efficient, economical, scalable, and reliable storage and processing of these large amounts of data.

$ per TB

Page 7: Extending the EDW with Hadoop - Chicago Data Summit 2011

And…

page 7

Hadoop places no constraints on how data is processed.

Page 8: Extending the EDW with Hadoop - Chicago Data Summit 2011

Before Hadoop

page 8

Page 9: Extending the EDW with Hadoop - Chicago Data Summit 2011

page 9

With Hadoop

Page 10: Extending the EDW with Hadoop - Chicago Data Summit 2011

page 10

Access to this non-transactional data enables a number of applications…

Page 11: Extending the EDW with Hadoop - Chicago Data Summit 2011

Optimizing Hotel Search

page 11

Page 12: Extending the EDW with Hadoop - Chicago Data Summit 2011

Recommendations

page 12

Page 13: Extending the EDW with Hadoop - Chicago Data Summit 2011

Page Performance Tracking

page 13

Page 14: Extending the EDW with Hadoop - Chicago Data Summit 2011

Cache Analysis

page 14

2.78%

34.30% 31.87%

71.67%

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Queries

Searches

Reverse Running Total (Searches)

Reverse Running Total (Queries)

72% of queries are singletons and make up nearly a third of total search volume.

A small number of queries (3%) make up more than a third of search volume.

Page 15: Extending the EDW with Hadoop - Chicago Data Summit 2011

User Segmentation

page 15

Page 16: Extending the EDW with Hadoop - Chicago Data Summit 2011

All of this is great, but…

Most of these efforts are driven by development teams.

The challenge now is to unlock the value in this data by making it more available to the rest of the organization.

page 16

Page 17: Extending the EDW with Hadoop - Chicago Data Summit 2011

page 17

“Given the ubiquity of data in modern organizations, a data warehouse can keep pace today only by being “magnetic”: attracting all the data sources that crop up within an organization regardless of data quality niceties.”*

*MAD Skills: New Analysis Practices for Big Data

Page 18: Extending the EDW with Hadoop - Chicago Data Summit 2011

page 18

In a better world…

Page 19: Extending the EDW with Hadoop - Chicago Data Summit 2011

Integrating Hadoop with the Enterprise Data Warehouse

Robert Lancaster and Jonathan Seidman

Chicago Data Summit

April 26 | 2011

Page 20: Extending the EDW with Hadoop - Chicago Data Summit 2011

page 20

The goal is a unified view of the data, allowing us to use the power of our existing tools for reporting and analysis.

Page 21: Extending the EDW with Hadoop - Chicago Data Summit 2011

page 21

BI vendors are working on integration with Hadoop…

Page 22: Extending the EDW with Hadoop - Chicago Data Summit 2011

page 22

And one more reporting tool…

Page 23: Extending the EDW with Hadoop - Chicago Data Summit 2011

Example Processing Pipeline for Web Analytics Data

page 23

Page 24: Extending the EDW with Hadoop - Chicago Data Summit 2011

Aggregating data for import into Data Warehouse

page 24

Page 25: Extending the EDW with Hadoop - Chicago Data Summit 2011

page 25

Example Use Case: Beta Data Processing

Page 26: Extending the EDW with Hadoop - Chicago Data Summit 2011

Example Use Case – Beta Data Processing

page 26

Page 27: Extending the EDW with Hadoop - Chicago Data Summit 2011

page 27

Example Use Case – Beta Data Processing Output

Page 28: Extending the EDW with Hadoop - Chicago Data Summit 2011

page 28

Example Use Case: RCDC Processing

Page 29: Extending the EDW with Hadoop - Chicago Data Summit 2011

Example Use Case – RCDC Processing

page 29

Page 30: Extending the EDW with Hadoop - Chicago Data Summit 2011

page 30

Example Use Case: Click Data Processing

Page 31: Extending the EDW with Hadoop - Chicago Data Summit 2011

Click Data Processing – Current DW Processing

page 31

Web Server

Logs ETL DW

Data Cleansing (Stored procedure)

DW Web Server Web Servers

3 hours 2 hours ~20% original

data size

Page 32: Extending the EDW with Hadoop - Chicago Data Summit 2011

Click Data Processing – New Hadoop Processing

page 32

Web Server

Logs HDFS

Data Cleansing (MapReduce) DW

Web Server Web Servers

Page 33: Extending the EDW with Hadoop - Chicago Data Summit 2011

Conclusions

•  Market is still immature, but Hadoop has already become a valuable business intelligence tool, and will become an increasingly important part of a BI infrastructure.

•  Hadoop won’t replace your EDW, but any organization with a large EDW should at least be exploring Hadoop as a complement to their BI infrastructure.

•  Use Hadoop to offload the time and resource intensive processing of large data sets so you can free up your data warehouse to serve user needs.

•  The challenge now is making Hadoop more accessible to non-developers. Vendors are addressing this, so expect rapid advancements in Hadoop accessibility.

page 33

Page 34: Extending the EDW with Hadoop - Chicago Data Summit 2011

Oh, and also…

•  Orbitz is looking for a Lead Engineer for the BI/Big Data team.

•  Go to http://careers.orbitz.com/ and search for IRC19035.

page 34

Page 35: Extending the EDW with Hadoop - Chicago Data Summit 2011

References

•  MAD Skills: New Analysis Practices for Big Data, Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph Hellerstein, and Caleb Welton, 2009

page 35