datawarehousing 101
Post on 03-Jun-2018
224 Views
Preview:
TRANSCRIPT
-
8/12/2019 Datawarehousing 101
1/70
Data Warehousing 101Everything
you never wantedto know aboutbig databases
but were forced
to find out anyway
Josh BerkusOpen Source Bridge 2011
-
8/12/2019 Datawarehousing 101
2/70
contents
covering
concepts of DW
some DWtechniques
databases
not covering
hardware
analtics!reportingtools
-
8/12/2019 Datawarehousing 101
3/70
-
8/12/2019 Datawarehousing 101
4/70
BIGDATA
-
8/12/2019 Datawarehousing 101
5/70
190
-
8/12/2019 Datawarehousing 101
6/70
What is a!data warehouse"#
-
8/12/2019 Datawarehousing 101
7/70
Big Data#
-
8/12/2019 Datawarehousing 101
8/70
-
8/12/2019 Datawarehousing 101
9/70
$%T& vs DW man single"row
writes
current data
queries generatedb user acti#it
$ 1s responsetimes
1000%s of users
few large batchimports
ears of data
queries generatedb large reports
queries can run forminutes!hours
10%s of users
-
8/12/2019 Datawarehousing 101
10/70
$%T& vs DW
big data formany
concurrentrequests to
small amounts
of data each
big data
for low
concurrencyrequests to very
large amountsof data each
-
8/12/2019 Datawarehousing 101
11/70
synony's(
subc)asses
-
8/12/2019 Datawarehousing 101
12/70
archiving
-
8/12/2019 Datawarehousing 101
13/70
archiving
WO&' data( )write once* read ne#er+
grows indefinitel
usuall a result of regulator compliance main concern( storage efficienc
-
8/12/2019 Datawarehousing 101
14/70
data 'ining
-
8/12/2019 Datawarehousing 101
15/70
data 'ining
the database where you don't know what's inthere, but you want to find out
lots of data ,-B to .B/
mostl )semi"structured+
data produced as a side effect of other
business processes needs ."intensi#e processing
-
8/12/2019 Datawarehousing 101
16/70
BI* Business Inte))igenceD++* Decision +u,,ort
$%A&* $n)ine Ana)ytica)&rocessing
Ana)ytics
-
8/12/2019 Datawarehousing 101
17/70
-
8/12/2019 Datawarehousing 101
18/70
BI-D++-$%A&-Ana)ytics
databases which support visualization oflarge amounts of data
data is fairl well understood
most data can be reduced to categories*geograph* and taonom
primaril about indeing
-
8/12/2019 Datawarehousing 101
19/70
What is a!di'ension"#
-
8/12/2019 Datawarehousing 101
20/70
di'ensions vs. facts
3act-able
customers! accounts
categorsubcategor
sub"subcategor
-
8/12/2019 Datawarehousing 101
21/70
di'ension e/a',)es location!region!countr!quadrant
product categori4ation
&5 transaction tpe
account heirarch
6. address OS!#ersion!build
-
8/12/2019 Datawarehousing 101
22/70
di'ension synony's
facet
taonomsecondar inde
#iew
-
8/12/2019 Datawarehousing 101
23/70
What is ET%#
-
8/12/2019 Datawarehousing 101
24/70
E/tract Transfor' %oad how ou turn eternal raw data into useful
database data
7pache logs 8 web analtics DB
S9 .OS files 8 financial reporting DB
O5-. ser#er 8 10"ear data warehouse
also called :5- when the transformation is
done inside the database
-
8/12/2019 Datawarehousing 101
25/70
&ur,ose of ET%-E%T
getting data into the data warehouse
clean up garbage data
split out attributes )normali4e+ dimensional data
deduplication
calculate materiali4ed #iews ! indees
-
8/12/2019 Datawarehousing 101
26/70
ET% Too)s
.E.T.T.%.E.
-
8/12/2019 Datawarehousing 101
27/70
ET% Too)s
-
8/12/2019 Datawarehousing 101
28/70
Ad2hoc scri,ting
-
8/12/2019 Datawarehousing 101
29/70
E%T Ti,s
think volume
bulk processing or parallel processing
no row"at"a"time* document"at"a"time
insert into permanent storage should bethe last step
no updates
-
8/12/2019 Datawarehousing 101
30/70
3ueues not E/tract
-
8/12/2019 Datawarehousing 101
31/70
What kind of
database shou)d Iuse for DW#
-
8/12/2019 Datawarehousing 101
32/70
4 Ty,es
1; Standard &elational
2; ;
-
8/12/2019 Datawarehousing 101
33/70
-
8/12/2019 Datawarehousing 101
34/70
standard re)ationa)
the all-purpose solution for not-that-big data adequate for all tasks
but not ecellent at an of them
eas to use
low resource requirements
well"supported b all software familiar
not suitable for reall big data
-
8/12/2019 Datawarehousing 101
35/70
-
8/12/2019 Datawarehousing 101
36/70
What5s 6&
-
8/12/2019 Datawarehousing 101
37/70
6assive)y&ara))e)
&rocessing
)i ft
-
8/12/2019 Datawarehousing 101
38/70
a,,)iance software
6&&
-
8/12/2019 Datawarehousing 101
39/70
6&&
cpu-intensive data warehousing data mining* some analtics
supporting comple quer logic
moderatel big data ,1"200-B/
drawbacks( proprietar* epensi#e
now hbridi4es with other tpes
-
8/12/2019 Datawarehousing 101
40/70
What5s a
co)u'n store#
) t
-
8/12/2019 Datawarehousing 101
41/70
co)u'n store
) t
-
8/12/2019 Datawarehousing 101
42/70
co)u'n store
inversion of a row store:
indexes become datadata becomes indexes
) t
-
8/12/2019 Datawarehousing 101
43/70
co)u'n stores
co)u'n stores
-
8/12/2019 Datawarehousing 101
44/70
co)u'n stores
for aggregations and transformations ofhighly structured data
good for B6* analtics* some archi#ing
moderatel big data ,0;?"100-B/
bad for data mining
slow to add new data ! purge data usuall support compression
-
8/12/2019 Datawarehousing 101
45/70
What5s
'a,-reduce#
'a,-reduce
-
8/12/2019 Datawarehousing 101
46/70
'a,-reduce
'a,-reduce
-
8/12/2019 Datawarehousing 101
47/70
'a,-reduce
'a,-reduce
-
8/12/2019 Datawarehousing 101
48/70
'a,-reduce
// mapfunction(doc) { for (var i in doc.links) emit([doc.parent, i], null); }}// reducefunction(keys, values) { return null;}
'a,-reduce// Mapfunction (doc) {
i (d l d l)
-
8/12/2019 Datawarehousing 101
49/70
'a,-reduce emit(doc.val, doc.val)}// Reducefunction (keys, values, rereduce) { // !is computes t!e standard deviation of t!e mapped results
var std"eviation#$.$; var count#$; var total#$.$; var s%rotal#$.$;
if (&rereduce) { // !is is t!e reduce p!ase, 'e are reducin over emitted values from // t!e map functions. for(var i in values) { total # total values[i];
s%rotal # s%rotal (values[i] * values[i]); } count # values.lent!; } else { // !is is t!e rereduce p!ase, 'e are re+reducin previosuly // reduced values. for(var i in values) { count # count values[i].count; total # total values[i].total; s%rotal # s%rotal values[i].s%rotal; } }
var variance # (s%rotal + ((total * total)/count)) / count; std"eviation # Mat!.s%rt(variance);
// t!e reduce result. t contains enou! information to -e rereduced // 'it! ot!er reduce results. return {std"eviationstd"eviation,countcount, totaltotal,s%rotals%rotal};
};
'a,-reduce vs 6&&
-
8/12/2019 Datawarehousing 101
50/70
'a,-reduce vs. 6&&
open source petabtes
write routines bhand
inefficient
generic cheap W ! cloud
D6C tools
proprietar terabtes
ad#anced quersupport
efficient
specific needs good W
integrated tools
-
8/12/2019 Datawarehousing 101
51/70
What5s enter,rise
search#
enter,rise search
-
8/12/2019 Datawarehousing 101
52/70
enter,rise search
E)astic+earch
enter,rise search
-
8/12/2019 Datawarehousing 101
53/70
enter,rise search
when you need to do D with a huge pile ofpartly processed !documents"
does( light data mining* light B6!analtics
best )full tet+ and keword search
supports )approimate results+
lots of special features for web data
-
8/12/2019 Datawarehousing 101
54/70
-
8/12/2019 Datawarehousing 101
55/70
What5s a
windowing 8uery#
regu)ar aggregate
-
8/12/2019 Datawarehousing 101
56/70
regu)ar aggregate
windowing function
-
8/12/2019 Datawarehousing 101
57/70
windowing function
-
8/12/2019 Datawarehousing 101
58/70
0123 events (event4id 5,event4type 36,start M370M89,duration 53R:02,event4desc 36
);
7323 M06( t)
-
8/12/2019 Datawarehousing 101
59/70
7323 M06(concurrent)M(tally)=:3R (=R"3R 1? start)07 concurrent
-
8/12/2019 Datawarehousing 101
60/70
strea' ,rocessing +3%
replace multiple queries with a singlequer
a#oid scanning large tables multiple times
replace pages of application code and
-
8/12/2019 Datawarehousing 101
61/70
What5s a
'ateria)ied view#
8uery resu)ts as tab)e
-
8/12/2019 Datawarehousing 101
62/70
8uery resu)ts as tab)e
calculate once* read man time comple!epensi#e queries
frequentl referenced
not necessaril a whole quer often part of a quer
might be manuall or automaticall
updated depends on product
non2re)ationa) 'atviews
-
8/12/2019 Datawarehousing 101
63/70
non re)ationa) 'atviews
ouchDB 9iews cache results of map!reduce obs
updated on data read
Solr ! :lastic Search )3aceted Search+ cached indeed results of comple searches
updated on data change
'aintaining 'atviews
-
8/12/2019 Datawarehousing 101
64/70
'aintaining 'atviews
BE+T* update mat#iewsat batch load time
G$$D* update mat#iew according
to clock!calendar:AI;* update mat#iew on data request
BAD for DW* update mat#iewsusing a trigger
'atview ti,s
-
8/12/2019 Datawarehousing 101
65/70
,
mat#iews should be small 1!10 to E of &7< on each node
each mat#iew should support se#eral
queries or one reall reall important one
truncate F append* don%t update
inde mat#iews like cra4
if the are not indees themsel#es
-
8/12/2019 Datawarehousing 101
66/70
What5s $%A
cubes
-
8/12/2019 Datawarehousing 101
67/70
Site&e
peat
9isito
rs
Browse
r
dri))2down
-
8/12/2019 Datawarehousing 101
68/70
$%A&
-
8/12/2019 Datawarehousing 101
69/70
On5ine 7naltical .rocessing 9isuali4ation technique
all data as a multi"dimensional space
great for decision support
. G &7< intensi#e
hard to do on reall big data
Works well with column stores
7ontact
-
8/12/2019 Datawarehousing 101
70/70
Josh Berkus( oshHpgeperts;com blog( blogs;ittoolbo;com!database!soup
twitter( Hfu44chef
.ostgreSA5( www;postgresql;org pgeperts( www;pgeperts;com
-his talk is copright 2011 Josh Berkus and is licensed under the reati#e ommons 7ttributionlicense;
top related