extending the edw with hadoop - chicago data summit 2011

Extending the Enterprise Data Warehouse with Hadoop

Robert Lancaster and Jonathan Seidman

Chicago Data Summit

April 26 | 2011

Who We Are

•  Robert Lancaster

– Solutions Architect, Hotel Supply Team

–  [email protected]

– @rob1lancaster

•  Jonathan Seidman

–  Lead Engineer, Business Intelligence/Big Data Team

– Co-founder/organizer of Chicago Hadoop User Group (http://www.meetup.com/Chicago-area-Hadoop-User-Group-CHUG)

–  [email protected]

– @jseidman

page 2

Launched: 2001, Chicago, IL

Why are we using Hadoop?

Stop me if you’ve heard this before…

On Orbitz alone we do millions of searches and transactions daily, which leads to hundreds of gigabytes of log data every day.

Hadoop provides us with efficient, economical, scalable, and reliable storage and processing of these large amounts of data.

$ per TB

And…

page 7

Hadoop places no constraints on how data is processed.

Before Hadoop

page 8

With Hadoop

Access to this non-transactional data enables a number of applications…

Optimizing Hotel Search

page 11

Recommendations

page 12

Page Performance Tracking

page 13

Cache Analysis

page 14

2.78%

34.30% 31.87%

71.67%

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Queries

Searches

Reverse Running Total (Searches)

Reverse Running Total (Queries)

72% of queries are singletons and make up nearly a third of total search volume.

A small number of queries (3%) make up more than a third of search volume.

User Segmentation

page 15

All of this is great, but…

Most of these efforts are driven by development teams.

The challenge now is to unlock the value in this data by making it more available to the rest of the organization.

page 16

“Given the ubiquity of data in modern organizations, a data warehouse can keep pace today only by being “magnetic”: attracting all the data sources that crop up within an organization regardless of data quality niceties.”*

*MAD Skills: New Analysis Practices for Big Data

In a better world…

Integrating Hadoop with the Enterprise Data Warehouse

Robert Lancaster and Jonathan Seidman

Chicago Data Summit

April 26 | 2011

The goal is a unified view of the data, allowing us to use the power of our existing tools for reporting and analysis.

BI vendors are working on integration with Hadoop…

And one more reporting tool…

Example Processing Pipeline for Web Analytics Data

page 23

Aggregating data for import into Data Warehouse

page 24

Example Use Case: Beta Data Processing

Example Use Case – Beta Data Processing

page 26

Example Use Case – Beta Data Processing Output

Example Use Case: RCDC Processing

Example Use Case – RCDC Processing

page 29

Example Use Case: Click Data Processing

Click Data Processing – Current DW Processing

page 31

Web Server

Logs ETL DW

Data Cleansing (Stored procedure)

DW Web Server Web Servers

3 hours 2 hours ~20% original

data size

Click Data Processing – New Hadoop Processing

page 32

Web Server

Logs HDFS

Data Cleansing (MapReduce) DW

Web Server Web Servers

Conclusions

•  Market is still immature, but Hadoop has already become a valuable business intelligence tool, and will become an increasingly important part of a BI infrastructure.

•  Hadoop won’t replace your EDW, but any organization with a large EDW should at least be exploring Hadoop as a complement to their BI infrastructure.

•  Use Hadoop to offload the time and resource intensive processing of large data sets so you can free up your data warehouse to serve user needs.

•  The challenge now is making Hadoop more accessible to non-developers. Vendors are addressing this, so expect rapid advancements in Hadoop accessibility.

page 33

Oh, and also…

•  Orbitz is looking for a Lead Engineer for the BI/Big Data team.

•  Go to http://careers.orbitz.com/ and search for IRC19035.

page 34

References

•  MAD Skills: New Analysis Practices for Big Data, Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph Hellerstein, and Caleb Welton, 2009

page 35

extending the edw with hadoop - chicago data summit 2011

Technology