01 data analytics
TRANSCRIPT
-
8/16/2019 01 Data Analytics
1/45
© Prof. Dr.-Ing. Wolfgang Lehner |
Data Analytics Methods and Techniques
Forschungskolleg
Martin Hahmann,
Gunnar Schröder,
Phillip Grosse
-
8/16/2019 01 Data Analytics
2/45
© Martin Hahmann | | 2
Why do we need it?
“We are drowning in data, but starving for knowledge!” John Naisbett
23 petabyte/second of raw data
1 petabyte/year
Data Analytics Methods and Techniques
-
8/16/2019 01 Data Analytics
3/45
© Martin Hahmann | | 3
Data Analytics
The Big Four
Data Analytics Methods and Techniques
Classification Association
Rules
Prediction
23.11.
Clustering
-
8/16/2019 01 Data Analytics
4/45
© Martin Hahmann | | 4
Clustering
Clustering
automated grouping of objects into so called clusters objects of the same group are similar
different groups are dissimilar
Problems
applicability versatility
support for non-expert users
Data Analytics Methods and Techniques
-
8/16/2019 01 Data Analytics
5/45
© Martin Hahmann | | 5
Clustering Workflow
algorithm selection
algorithm setup
adjustment
result interpretation
Data Analytics Methods and Techniques
>500
-
8/16/2019 01 Data Analytics
6/45
© Martin Hahmann | | 6
FeedbackAlgorithmics
Clustering Workflow
algorithm selection adjustment
algorithm setup result interpretation
Which algorithm fits my data?
Which parameters fit my data?
How good is the obtained result?
How to improve result quality?
Data Analytics Methods and Techniques
-
8/16/2019 01 Data Analytics
7/45© Martin Hahmann | | 7
Traditional Clustering
one clustering
hierarchical
traditional clustering density-based
partitional
Algorithmics
Data Analytics Methods and Techniques
Feedback
-
8/16/2019 01 Data Analytics
8/45© Martin Hahmann | | 8
Traditional Clustering
Partitional Clustering
clusters represented by prototypes objects are assigned to most similar prototype
similarity via distance function
k-means[Lloyd, 1957]
partitions dataset into k disjoint subsets minimizes sum-of-squares criterion
parameters: k, seed
Data Analytics Methods and Techniques
-
8/16/2019 01 Data Analytics
9/45© Martin Hahmann | | 9
Traditional Clustering
Partitional Clustering
clusters represented by prototypes objects are assigned to most similar prototype
similarity via distancefunction
k-means[Lloyd, 1957]
partitions dataset into k disjoint subsets minimizes sum-of-squares criterion
parameters: k, seed
ISODATA[Ball & Hall, 1965]
adjusts number of clusters
merges, splits, deletes according to thresholds
5 additional parameters
k=4
k=4
k=7
Data Analytics Methods and Techniques
-
8/16/2019 01 Data Analytics
10/45© Martin Hahmann | | 10
Traditional Clustering
Partitional Clustering
clusters represented by prototypes objects are assigned to most similar prototype
similarity via distancefunction
k-means[Lloyd, 1957]
partitions dataset into k disjoint subsets minimizes sum-of-squares criterion
parameters: k, seed
ISODATA[Ball & Hall, 1965]
adjusts number of clusters
merges, splits, deletes according to thresholds
5 additional parameters
k=7
k=7
Data Analytics Methods and Techniques
-
8/16/2019 01 Data Analytics
11/45© Martin Hahmann | | 11
Traditional Clustering
Density-Based Clustering
clusters are modelled as dense areas separated by less dense areas density of an object = number of objects satisfying a similarity threshold
DBSCAN [Ester & Kriegel et al., 1996]
dense areas
core objects
object count in defined ε-neighbourhood
connection via overlapping neighbourhood
parameters: ε , minPts
DENCLUE[Hinneburg, 1998]
density via gaussian kernels cluster extraction by hill climbing
ε=1.0; minPts=5
Data Analytics Methods and Techniques
-
8/16/2019 01 Data Analytics
12/45
© Martin Hahmann | | 12
Traditional Clustering
Density-Based Clustering
clusters are modelled as dense areas separated by less dense areas density of an object = number of objects satisfying a similarity threshold
DBSCAN [Ester & Kriegel et al., 1996]
dense areas
core objects
object count in defined ε-neighbourhood
connection via overlapping neighbourhood
parameters: ε , minPts
DENCLUE[Hinneburg, 1998]
density via gaussian kernels cluster extraction by hill climbing
ε=0.5; minPts=5
Data Analytics Methods and Techniques
-
8/16/2019 01 Data Analytics
13/45
© Martin Hahmann | | 13
Traditional Clustering
Density-Based Clustering
clusters are modelled as dense areas separated by less dense areas density of an object = number of objects satisfying a similarity threshold
DBSCAN [Ester & Kriegel et al., 1996]
dense areas
core objects
object count in defined ε-neighbourhood
connection via overlapping neighbourhood
parameters: ε , minPts
DENCLUE[Hinneburg, 1998]
density via gaussian kernels cluster extraction by hill climbing
ε=0.5; minPts=20
Data Analytics Methods and Techniques
-
8/16/2019 01 Data Analytics
14/45
© Martin Hahmann | | 14
Traditional Clustering
Hierarchical Clusterings
builds a hierarchy by progressively merging / splitting clusters
clusterings are extracted by cutting the dendrogramm
bottom-up, AGNES[Ward, 1963]
top-down,
DIANA[Kaufmann & Rousseeuw, 1990]
Data Analytics Methods and Techniques
-
8/16/2019 01 Data Analytics
15/45
© Martin Hahmann | | 15
Multiple Clusterings
traditional
clustering
multi-solution clustering
Algorithmics
ensemble clustering
multiple clusterings
Data Analytics Methods and Techniques
Feedback
-
8/16/2019 01 Data Analytics
16/45
© Martin Hahmann | | 16
Multi-Solution Clustering
traditional
clustering
multi-solution clustering
multiple clusterings
subspacealternative
Algorithmics
Data Analytics Methods and Techniques
Feedback
-
8/16/2019 01 Data Analytics
17/45
© Martin Hahmann | | 17
Multi-Solution Clustering
Alternative Clustering
generate initial clustering with traditional algorithm
utilize given knowledge to find alternative clusterings
generate a dissimilar clustering
initial clustering alternative 1 alternative 2
Data Analytics Methods and Techniques
-
8/16/2019 01 Data Analytics
18/45
© Martin Hahmann | | 18
Multi-Solution Clustering
Alternative Clustering: COALA[Bae & Bailey, 2006]
generate alternative with same number of clusters high dissimilarity to initial clustering(cannot-link constraint)
high quality alternative (objects in cluster still similar)
trade-off between two goals controlled by parameter w
if dquality< w*ddissimilarity choose quality else dissimilarity
CAMI[Dang & Bailey, 2010]
clusters as gaussian mixture
quality by likelihood
dissimilarity by mutual information
Orthogonal Clustering[Davidson & Qi, 2008] iterative transformation of database
use of constraints
Data Analytics Methods and Techniques
-
8/16/2019 01 Data Analytics
19/45
© Martin Hahmann | | 19
Multi-Solution Clustering
Subspace Clustering
high dimensional data clusters can be observed in arbitrary attribute combinations (subspaces)
detect multiple clusterings in different subspaces
CLIQUE [Agarawal et al. 1998]
based on grid cells find dense cells in all subspaces
FIRES[Kriegel et al.,2005]
stepwise merge of base clusters (1D)
max. dimensional subspace clusters
DENSEST[Müller et al.,2009]
estimation of dense subspaces
correlation based (2D histograms)
Data Analytics Methods and Techniques
-
8/16/2019 01 Data Analytics
20/45
© Martin Hahmann | | 20
Ensemble Clustering
traditional
clustering
multi-solution
clustering
ensemble clustering
multiple base clusterings
one consensus clusteringpairwise
similarity
cluster
ids
Algorithmics
Data Analytics Methods and Techniques
Feedback
-
8/16/2019 01 Data Analytics
21/45
© Martin Hahmann | | 21
Ensemble Clustering
General Idea
generate multiple traditional clusterings employ different algorithms / parameters ensemble
combine ensemble to single robust consensus clustering
different consensus mechanisms
[Strehl & Ghosh, 2002], [Gionis et al., 2007]
Data Analytics Methods and Techniques
-
8/16/2019 01 Data Analytics
22/45
© Martin Hahmann | | 22
Ensemble Clustering
Pairwise Similarity
a pair of objects belongs to: the same or different cluster(s) pairwise similarities represented as coassociation matrices
Data Analytics Methods and Techniques
x1 x2 x3 x4 x5 x6 x7 x8 x9
x1 - 1 1 0 0 0 0 0 0
x2 - - 1 0 0 0 0 0 0
x3 - - - 0 0 0 0 0 0
x4 - - - - 1 1 0 0 0
x5 - - - - - 1 0 0 0
x6 - - - - - - 0 0 0
x7 - - - - - - - 1 1
x8 - - - - - - - - 1
x9 - - - - - - - - -
-
8/16/2019 01 Data Analytics
23/45
© Martin Hahmann | | 23
Ensemble Clustering
Pairwise Similarity
a pair of objects belongs to: the same or different cluster(s) pairwise similarities represented as coassociation matrices
simple example based on [Gionis et al., 2007]
goal: generate consensus clustering with maximal similarity to ensemble
Data Analytics Methods and Techniques
x1 x2 x3 x4 x5 x6 x7 x8 x9
x1 - 1 1 0 0 0 0 0 0
x2 - - 1 0 0 0 0 0 0
x3 - - - 0 0 0 0 0 0
x4 - - - - 1 0 0 0 0
x5 - - - - - 0 0 0 0
x6 - - - - - - 0 1 0
x7 - - - - - - - 0 1
x8 - - - - - - - - 0
x9 - - - - - - - - -
x1 x2 x3 x4 x5 x6 x7 x8 x9
x1 - 0 1 1 1 0 0 0 0
x2 - - 0 0 0 0 0 0 0
x3 - - - 1 1 0 0 0 0
x4 - - - - 1 0 0 0 0
x5 - - - - - 0 0 0 0
x6 - - - - - - 0 0 0
x7 - - - - - - - 1 1
x8 - - - - - - - - 1
x9 - - - - - - - - -
x1 x2 x3 x4 x5 x6 x7 x8 x9
x1 - 1 1 0 0 0 0 0 0
x2 - - 1 0 0 0 0 0 0
x3 - - - 0 0 0 0 0 0
x4 - - - - 1 0 1 0 1
x5 - - - - - 0 1 0 1
x6 - - - - - - 0 1 0
x7 - - - - - - - 0 1
x8 - - - - - - - - 0
x9 - - - - - - - - -
x1 x2 x3 x4 x5 x6 x7 x8 x9
x1 - 1 1 0 0 0 0 0 0
x2 - - 1 0 0 0 0 0 0
x3 - - - 0 0 0 0 0 0
x4 - - - - 1 1 0 0 0
x5 - - - - - 1 0 0 0
x6 - - - - - - 0 0 0
x7 - - - - - - - 1 1
x8 - - - - - - - - 1
x9 - - - - - - - - -
-
8/16/2019 01 Data Analytics
24/45
© Martin Hahmann | | 24
Ensemble Clustering
Consensus Mechanism
majority decision pairs located in the same cluster in at least 50% of the ensemble, are also part of the
same cluster in the consensus clustering
x1 x2 x3 x4 x5 x6 x7 x8 x9
x1 - 0 1 1 1 0 0 0 0
x2 - - 0 0 0 0 0 0 0
x3 - - - 1 1 0 0 0 0
x4 - - - - 1 0 0 0 0
x5 - - - - - 0 0 0 0
x6 - - - - - - 0 0 0
x7 - - - - - - - 1 1
x8 - - - - - - - - 1
x9 - - - - - - - - -
x1 x2 x3 x4 x5 x6 x7 x8 x9
x1 - 3/4 4/4 1/4 1/4 0 0 0 0
x2 - - 3/4 0 0 0 0 0 0
x3 - - - 1/4 1/4 0 0 0 0
x4 - - - - 4/4 1/4 1/4 0 1/4
x5 - - - - - 1/4 1/4 0 1/4x6 - - - - - - 1/4 3/4 1/4
x7 - - - - - - - 2/4 4/4
x8 - - - - - - - - 2/4
x9 - - - - - - - - -
x1 x2 x3 x4 x5 x6 x7 x8 x9
x1 - 3/4 4/4 - - - - - -
x2 - - 3/4 - - - - - -
x3 - - - - - - - - -
x4 - - - - 4/4 - - - -
x5 - - - - - - - - -x6 - - - - - - - 3/4 -
x7 - - - - - - - 2/4 4/4
x8 - - - - - - - - 2/4
x9 - - - - - - - - -
Data Analytics Methods and Techniques
-
8/16/2019 01 Data Analytics
25/45
© Martin Hahmann | | 25
Ensemble Clustering
Consensus Mechanism
majority decision pairs located in the same cluster in at least 50% of the ensemble, are also part of the
same cluster in the consensus clustering
more consensus mechanisms exist
e.g. graph partitioning via METIS [Karypis et al., 1997]
Data Analytics Methods and Techniques
-
8/16/2019 01 Data Analytics
26/45
© Martin Hahmann | | 26
Our Research Area
Algorithm Integration in databases
the R Project for Statistical Computing
free software environment for statistical computing and graphics
includes lots of standard math functions/libraries
function- and object-oriented
flexible/extensible
Data Analytics Methods and Techniques
Dipl.-Inf. Phillip Grosse
-
8/16/2019 01 Data Analytics
27/45
© Martin Hahmann | | 27
Our Research Area
Vector: basic datastructure in R
> x x x "abc" -> y # vector of size 1
Vector output
> x
[1] 1 2 3 4 5 6 7 8 9 10
> sprintf("y is %s", y)
[1] "y is abc"
Data Analytics Methods and Techniques
-
8/16/2019 01 Data Analytics
28/45
© Martin Hahmann | | 28
Our Research Area
Vector arithmetic
element-wise basic operators (+ - * / ^)> x y x*y
[1] 2 6 12 20
vectorrecycling (short vectors)> y x*y
[1] 2 14 6 28
arithmetic functions
log, exp, sin, cos, tan, sqrt, min, max, range, length, sum, prod, mean, var
Data Analytics Methods and Techniques
-
8/16/2019 01 Data Analytics
29/45
© Martin Hahmann | | 29
R as HANA operator (R-OP)
Data Analytics Methods and Techniques
Database
R Client
SHM
ManagerSHM
R
RICE
SHM
Manager
Rserve
TCP/IP
6
1
4
write data
access data
3
7
fork R process
2
access data
write data
5
pass R Script
[Urbanek03]
-
8/16/2019 01 Data Analytics
30/45
© Martin Hahmann | | 30
R scripts in databases
Data Analytics Methods and Techniques
## Creates some dummy data
CREATE INSERT ONLY COLUMN TABLE Test1(A INTEGER, B DOUBLE);
INSERT INTO Test1 VALUES(0, 2.5);INSERT INTO Test1 VALUES(1, 1.5);
INSERT INTO Test1 VALUES(2, 2.0);
CREATE COLUMN TABLE "ResultTable" AS TABLE "TEST1" WITH NO DATA;
ALTER TABLE "ResultTable" ADD ("C" DOUBLE);
## Creates an SQL-script function including the R script
CREATE PROCEDURE ROP(IN input1 "Test1", OUT result "ResultTable" )
LANGUAGE RLANG AS
BEGIN
A
-
8/16/2019 01 Data Analytics
31/45
© Martin Hahmann | | 31
parallel R operators
Data Analytics Methods and Techniques
## call ROP in parallel
DROP PROCEDURE testPartition1;
CREATE PROCEDURE testPartition1 (OUT result "ResultTable")AS SQLSCRIPT
BEGIN
input1 = select * from Test1;
p = CE_PARTITION([@input1@],[],'''HASH 3 A''');
CALL ROP(@p$$input1@, r);
s = CE_MERGE([@r@]);
result = select * from @s$$r@;
END;
CALL testPartition1(?);
R-Op R-Op R-Op
......
Split-Op
...Merge
Dataset
calcModel
-
8/16/2019 01 Data Analytics
32/45
© Martin Hahmann | | 32
Result Interpretation
Algorithmics Feedback
resultinterpretation visualizationtraditionalclustering
multi-solution
clustering
ensemble
clustering
quality measures
Data Analytics Methods and Techniques
-
8/16/2019 01 Data Analytics
33/45
© Martin Hahmann | | 33
Result Interpretation
Quality measures
„How good is the obtained result ?“ Problem: There is no universally valid definition of clustering quality
multiple approaches and metrics
Rand index [Rand , 1971]
comparison to known solution
Dunns Index[Dunn, 1974]
ratio intercluster / intracluster distances
higher score is better
Davis Bouldin Index[Davies & Bouldin, 1979]
ratio intercluster / intracluster distances
lower score is better
Data Analytics Methods and Techniques
-
8/16/2019 01 Data Analytics
34/45
© Martin Hahmann | | 34
Result Interpretation
Visualization
use visual representations to evaluate clustering quality utilize perception capacity of humans
result interpretation left to user subjective quality evaluation
communicate information about structures
data-driven visualize whole dataset
Data Analytics Methods and Techniques
-
8/16/2019 01 Data Analytics
35/45
© Martin Hahmann | | 35
Result Interpretation
Visualization: examples
all objects and dimensions are visualized scalability issues for large scale data
display multiple dimensions on a 2D screen
scatterplot matrix parallel coordinates[Inselberg, 1985]
Data Analytics Methods and Techniques
-
8/16/2019 01 Data Analytics
36/45
© Martin Hahmann | | 36
Result Interpretation
Visualization: Heidi Matrix[Vadapali et al.,2009]
visualize clusters in high dimensional space show cluster overlap in subspaces
pixel technique based on k-nearest neighbour relations
pixel= object pair, color= subspace
Data Analytics Methods and Techniques
-
8/16/2019 01 Data Analytics
37/45
© Martin Hahmann | | 37
Feedback
Algorithmics Feedback
result
interpretation visualization
quality measures
adjustment
Data Analytics Methods and Techniques
traditional
clustering
multi-solution
clustering
ensemble
clustering
-
8/16/2019 01 Data Analytics
38/45
© Martin Hahmann | | 38
Feedback
Result adjustment
best practise: adjustment via re-parameterization or algorithm swap to infer modifications from result interpretation
adjust low-level parameters
only indirect
Clustering with interactive Feedback [Balcan & Blum, 2008]
theoretical proof of concept
user adjustsments via high-level feedback: split and merge
limitations: 1d data, target solution available to the user
Data Analytics Methods and Techniques
-
8/16/2019 01 Data Analytics
39/45
© Martin Hahmann | | 39
Our Research Area
Data Analytics Methods and Techniques
algorithmic platformvisual-interactive
interface
-
8/16/2019 01 Data Analytics
40/45
© Martin Hahmann | | 40
Augur
Generalized Process
based on Shneidermann information seeking mantra Visualization, Interaction,
algorithmic platform based on multiple clustering solutions
Data Analytics Methods and Techniques
-
8/16/2019 01 Data Analytics
41/45
O h
-
8/16/2019 01 Data Analytics
42/45
© Martin Hahmann | | 42
Our Research Area
Predict user ratings:
Products for cross selling & product classification:
Data Analytics Methods and Techniques
O R h A
-
8/16/2019 01 Data Analytics
43/45
© Martin Hahmann | | 43
Our Research Area
Similar items & similar users
Data Analytics Methods and Techniques
Ch ll
-
8/16/2019 01 Data Analytics
44/45
© Martin Hahmann | | 44
Challenges
Huge amounts of data
building recommender models is expensive!
Dynamic
models age
user preferences change
new users & items
Parameterization
data and domain dependent
support
automatic model maintenance
Data Analytics Methods and Techniques
Personalize User-Interface
Monitor UserActions
Evaluate Model
Optimize ModelParameters
(Re-)Build Model
CalculateRecommendations
>
-
8/16/2019 01 Data Analytics
45/45
>
Questions?