hadoop in data warehousing
TRANSCRIPT
![Page 1: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/1.jpg)
INFO-H-419: Data Warehouses project
Hadoop in Data Warehousing
by Alexey Grigorev
1
![Page 2: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/2.jpg)
Hadoop: In this Presentation
1. Introduction
2. Origins
3. MapReduce
4. Hadoop as MapReduce Implementation
5. Data Warehouse on Hadoop
6. Hadoop and Data Warehousing
7. Conclusions
2
![Page 3: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/3.jpg)
Why?
• Lot of Data
• How to deal with it?
• Hadoop to rescue!
• When to use?
• When not to use?
• Curiosity
3
![Page 4: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/4.jpg)
MapReduce: Origins
• Functional Programming
• High order functions to operate on lists
• map
• apply to each element of the list
• reduce = fold = accumulate
• aggregate a list and produce one value of output
• No side effects
4
![Page 5: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/5.jpg)
MapReduce: Origins
• (define (+1 el) (+ el 1))
• (map +1 (list 1 2 3)) (list 2 3 4)
• (reduce + 0 (list 2 3 4)) 9
• (reduce + 0 (map +1 (list 1 2 3))) 9
⇒
⇒
⇒
5
![Page 6: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/6.jpg)
MapReduce: Origins
• These function do not have side effects
• And can be parallelized easily
• Can split the input data into chunks:
• (list 1 2 3 4) (list 1 2) and (list 3 4)
• Apply map to each chuck separately, and then combine ( reduce) them
together
⇒
6
![Page 7: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/7.jpg)
MapReduce: Origins
• Mapping separately:
• (define res1 (reduce + 0 (map +1 (list 1 2)))
• (reduce + res1 (map +1 (list 3 4)))
• This is the same as (reduce + 0 (map +1 (list 1 2 3 4)))
• Note that for reduce the function must be additive
7
![Page 8: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/8.jpg)
MapReduce
• A map function
• takes a key-value pair (in_key, in_val)
• produces zero or more key-value pairs: intermediate results
• intermediate results are grouped by key
• A reduce function
• for each group in the intermediate results
• aggregates and produces the final output
8
![Page 9: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/9.jpg)
MapReduce Stages
each MapReduce Job is executed in 3 stages
• map stage: apply map to each key-value pair
• group together the intermediate results by key
• reduce stage: apply reduce to each group
9
![Page 10: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/10.jpg)
MapReduce Stages
datasource
datasource
datasource
datasource
map map map map
reduce reduce reduce
map:(in_key, in_val) ->[(out_key, out_val)]
reduce:(out_key, [out_val]) ->[res_val]
10
![Page 11: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/11.jpg)
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean dictum justo est, quis
sagittis leo tincidunt sit amet. Donec scelerisque rutrum quam non sagittis. Phasellus sem
nisi, cursus eu lacinia eu, tempor ac eros. Class aptent taciti sociosqu ad litora torquent per
conubia nostra, per inceptos himenaeos. In mollis elit quis orci congue, quis aliquet mauris
mollis. Interdum et malesuada fames ac ante ipsum primis in faucibus.
Proin euismod non quam vitae pretium. Quisque vel nisl et leo volutpat rhoncus quis ac eros.
Sed lacus tellus, aliquam non ullamcorper in, dictum at magna. Vestibulum consequat
egestas lacinia. Proin tempus rhoncus mi, et lacinia elit ornare auctor. Sed sagittis euismod
massa ut posuere. Interdum et malesuada fames ac ante ipsum primis in faucibus. Duis
fringilla dolor ornare mi dictum ornare.
11
![Page 12: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/12.jpg)
MapReduce Example
def map(String input_key, String doc):
for each word w in doc:
EmitIntermediate(w, 1)
def reduce(String output_key, Iterator output_vals):
int res = 0
for each v in output_vals:
res += v
Emit(res)
01.
02.
03.
04.
05.
06.
07.
08.
12
![Page 13: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/13.jpg)
MapReduce Example
• map stage: output 1 for each word
• group a list of pairs into
• reduce stage: for each calculate how many ones there are
w
(w, 1) (w, [1, 1, . . . , 1])
w
13
![Page 14: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/14.jpg)
MapReduce Example: Result
• amet: 2
• ante: 2
• aptent: 1
• consectetur: 1
• dictum: 3
• dolor: 2
• elit: 3
• ...
14
![Page 15: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/15.jpg)
Hadoophttp
://flickr.com
/photo
s/erike
ldrid
ge/3
614786392/
![Page 16: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/16.jpg)
Hadoop
... is a framework that allows for the distributed processing of large data
sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each
offering local computation and storage. Rather than rely on hardware to
deliver high-availability, the library itself is designed to detect and handle
failures at the application layer, so delivering a highly-available service on
top of a cluster of computers, each of which may be prone to failures.
“
16
![Page 17: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/17.jpg)
Hadoop
• Open Source implementation of MapReduce
• "Hadoop":
• HDFS
• Hadoop MapReduce
• HBase
• Hive
• ... many others
17
![Page 18: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/18.jpg)
Hadoop Cluster: Terminology
• Name Node: orchestrates the process
• Workers: nodes that do the computation
• Mappers do the map phase
• Reducers do the reduce phase
18
![Page 19: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/19.jpg)
Hadoop
mapper local storage
HDFS
fi le
Reduce Sort Copy
Read Map Combine
Pull
reducerlocal storage
result
19
![Page 20: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/20.jpg)
http
://escie
nce
.wash
ingto
n.e
du/g
et-h
elp
-now
/what-h
adoop
20
![Page 21: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/21.jpg)
21
![Page 22: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/22.jpg)
22
![Page 23: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/23.jpg)
23
![Page 24: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/24.jpg)
24
![Page 25: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/25.jpg)
25
![Page 26: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/26.jpg)
26
![Page 27: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/27.jpg)
Fault-Tolerance Load-Balancing
• No execution plan
• Node failed Task reassigned
• Node done Another task assigned
• No communication costs
≈
⇒
⇒
27
![Page 28: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/28.jpg)
Advantages
• Simple, especially for programmers who know FP
• Fault tolerant
• No schema, can process any data
• Flexible
• Cheap and runs on commodity hardware
28
![Page 29: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/29.jpg)
Disadvantages
• No declarative high-level language like SQL
• Performance issues:
• Map and Reduce are blocking
• Name Node: single point of failure
• It's young
29
![Page 30: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/30.jpg)
Disadvantages
[Abouzeid, Azza et al 2009]
30
![Page 31: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/31.jpg)
Hadoop as a Data Warehouse
• Cheetah
• Hive
31
![Page 32: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/32.jpg)
Cheetah
• Typical DW relation-like schemas
• ... But not exactly
• They call it virtual views
32
![Page 33: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/33.jpg)
Cheetah
33
![Page 34: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/34.jpg)
Cheetah
• Virtual views consist of columns that can be queried
• Everything inside is entirely denormalized
• Append-only design and slowly changing dimensions
• Proprietary
34
![Page 35: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/35.jpg)
Hive
• A data warehousing solution built by Facebook
• For Big data analysis:
• in 2010 (4 years ago!), 30+ PB
• Has its own data model
• HiveQL: a declarative SQL-like language for ad-hoc querying
35
![Page 36: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/36.jpg)
HiveQL
Tables
STATUS UPDATE(user id int, status string, ds string)
PROFILES(userid int, school string, gender int)
LOAD DATA LOCAL INPATH 'logs/status_updates'
INTO TABLE status_updates
PARTITION (ds='2009-03-20')
01.
02.
01.
02.
03.
36
![Page 37: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/37.jpg)
HiveQL
FROM
(SELECT a.status, b.school, g.gender
FROM status_updates a JOIN profiles b
ON (a.userid = b.userid and a.ds = '2009-03-20') subq1
INSERT OVERWRITE TABLE gender_summary
PARTITION (ds='2009-03-20')
SELECT subq1.gender, count(1)
GROUP BY subq1.gender
INSERT OVERWRITE TABLE school_summary
PARTITION (ds='2009-03-20')
SELECT subq.school, count(1)
GROUP BY subq1.school
01.
02.
03.
04.
05.
06.
07.
08.
09.
10.
11.
12.
37
![Page 38: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/38.jpg)
HiveQL
FROM
(SELECT a.status, b.school, g.gender
FROM status_updates a JOIN profiles b
ON (a.userid = b.userid and a.ds = '2009-03-20') subq1
INSERT OVERWRITE TABLE gender_summary
PARTITION (ds='2009-03-20')
SELECT subq1.gender, count(1)
GROUP BY subq1.gender
INSERT OVERWRITE TABLE school_summary
PARTITION (ds='2009-03-20')
SELECT subq.school, count(1)
GROUP BY subq1.school
01.
02.
03.
04.
05.
06.
07.
08.
09.
10.
11.
12.
38
![Page 39: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/39.jpg)
HiveQL
REDUCE subq2.school, subq2.meme, subq2.cnt
USING 'top10.py' AS (school, meme, cnt)
FROM (
SELECT subq1.school, subq1.meme, count(1) as cnt
FROM
(MAP b.school, a.status
USING 'meme_extractor.py'
AS (school, meme)
FROM status_update a JOIN profiles b
ON (a.userid = b.userid)) subq1
GROUP BY subq1.school, subq1.meme
DISTRIBURE BY school, meme
SORT BY school, meme, cnt desc)
) subq2
01.
02.
03.
04.
05.
06.
07.
08.
09.
10.
11.
12.
13.
14.
39
![Page 40: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/40.jpg)
Hadoop + Data Warehousehttp
://ww
w.flickr.co
m/p
hoto
s/mrflip
/5150336351/in
/photo
stream
/
![Page 41: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/41.jpg)
Hadoop + Data Warehouse
• Hadoop and Data Warehouses can co-exist
• DW: OLAP, BI, transactional data
• Hadoop: Raw, unstructured data
41
![Page 42: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/42.jpg)
ETL
• Extract: load to HDFS, parse, prepare
• Run some analysis
• Transform: clean data and transform to some structured format
• with MapReduce
• Load: extract from HDFS, load to DW
42
![Page 43: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/43.jpg)
ETL: examples
• Text processing
• Call center records analysis
• extract sentiment
• link to profile
• which customers are more important to keep?
• Image processing
43
![Page 44: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/44.jpg)
Active Storage
• Don't delete the data after processing
• Hadoop storage is cheap: it can store anything
• Run more analysis when needed
• Like: extract new keywords/features from the old dataset
44
![Page 45: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/45.jpg)
Active Storage - 2
• Up to 80% of data is dormant (or cold)
• Hadoop storage can be way cheaper than high-cost data management
solutions
• Move this data to Hadoop
• When needed quickly analyze there or move back to DW
45
![Page 46: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/46.jpg)
Analytical Sandbox ⇒
46
![Page 47: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/47.jpg)
http
://ww
w.flickr.co
m/p
hoto
s/pasu
karu76/9
824401426/
![Page 48: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/48.jpg)
http
://ww
w.flickr.co
m/p
hoto
s/pasu
karu76/4
977447932/
![Page 49: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/49.jpg)
Analytical Sandbox
• What are we looking in this data?
• No structure - hard to know
• Run ad-hoc Hive queries to see what's there
49
![Page 50: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/50.jpg)
Conclusions
• Hadoop is becoming more and more popular
• Many companies plan to adopt
• Best used with existent DW solutions
• as an ETL
• as Active Storage
• as Analytical Sandbox
50
![Page 51: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/51.jpg)
References
1. Lee, Kyong-Ha, et al. "Parallel data processing with MapReduce: a survey." ACM SIGMOD Record 40.4 (2012): 11-20.
[pdf]
2. "MapReduce vs Data Warehouse". Webpage, [link]. Accessed 15/12/2013.
3. Ordonez, Carlos, Il-Yeol Song, and Carlos Garcia-Alvarado. "Relational versus non-relational database systems for
data warehousing." Proceedings of the ACM 13th international workshop on Data warehousing and OLAP. ACM, 2010.
[pdf]
4. A. Awadallah, D. Graham. "Hadoop and the Data Warehouse: When to Use Which." (2011). [pdf] (by Cloudera and
Teradata)
5. Thusoo, Ashish, et al. "Hive: a warehousing solution over a map-reduce framework." Proceedings of the VLDB
Endowment 2.2 (2009): 1626-1629. [pdf]
6. Chen, Songting. "Cheetah: a high performance, custom data warehouse on top of MapReduce." Proceedings of the
VLDB Endowment 3.1-2 (2010): 1459-1468. [pdf]
51
![Page 52: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/52.jpg)
References
7. "How (and Why) Hadoop is Changing the Data Warehousing Paradigm." Webpage [link]. Accessed 15/12/2013.
8. P. Russom. "Integrating Hadoop into Business Intelligence and Data Warehousing." (2013). [pdf]
9. M. Ferguson. "Offloading and Accelerating Data Warehouse ETL Processing Using Hadoop." [pdf]
10. Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of
the ACM 51.1 (2008): 107-113. [pdf]
11. "What is Hadoop?" Webpage [link]. Accessed 15/12/2013.
12. Apache Hadoop project home page, url: [link].
13. Apache HBase home page, [link].
14. Apache Mahout home page, [link].
15. "How Hadoop Cuts Big Data Costs" [link]. Accessed 05/01/2014.
16. "The Impact of Data Temperature on the Data Warehouse." whitepaper by Terradata (2012). [pdf]
17. Abouzeid, Azza, et al. "HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical
workloads." Proceedings of the VLDB Endowment 2.1 (2009): 922-933. [pdf]
52
![Page 53: Hadoop in Data Warehousing](https://reader030.vdocuments.net/reader030/viewer/2022020218/55abd9b81a28ab64678b4761/html5/thumbnails/53.jpg)
Thank you