database mining: bringing algorithms to data - ut … mining: bringing algorithms to data ... –...

36
1 Database Mining: Bringing Algorithms to Data Sharma Chakravarthy Information Technology Laboratory Computer Science and Engineering Department The University of Texas at Arlington, Arlington, TX 76009 Email: [email protected] URL: http://itlab.uta.edu/sharma Acknowledgments This presentation is based on the work of many of my students, especially Shiby Thomas, Mahesh Dudkiar, Hongen Zhang, Pratyush Mishra, and Himavalli Kona (and others) National Science Foundation and other agencies for their support 2/13/2012 © Sharma Chakravarthy 2 Outline Data Mining and DW Database Mining Overview Spectrum Database Mining Architectures Performance comparison Association Mining Using SQL92 and SQLOR Approaches optimization Performance comparison Summary and Challenges 2/13/2012 © Sharma Chakravarthy 3 Role of Data Warehouses DW makes DM a lot cheaper DM is one of the reasons for DW OLAP: verificationdriven sales in CA Vs. FL in Q1 of 2003 DM: discoverydriven What factors contribute to nonpayment of loans ? Will Microsoft come back? 2/13/2012 © Sharma Chakravarthy 4

Upload: hakhanh

Post on 01-Apr-2018

231 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

1

Database Mining: Bringing Algorithms to Data

Sharma Chakravarthy

Information Technology Laboratory

Computer Science and Engineering Department

The University of Texas at Arlington, Arlington, TX 76009

Email: [email protected]

URL: http://itlab.uta.edu/sharma

Acknowledgments

• This presentation is based on the work of many of my students, especially Shiby Thomas, Mahesh Dudkiar, Hongen Zhang, Pratyush Mishra, and Himavalli Kona (and others)

• National Science Foundation and other agencies for their support

2/13/2012 © Sharma Chakravarthy 2

Outline

• Data Mining and DW• Database Mining

– Overview – Spectrum

• Database Mining Architectures– Performance comparison

• Association Mining Using SQL‐92 and SQL‐OR– Approaches – optimization– Performance comparison

• Summary and Challenges

2/13/2012 © Sharma Chakravarthy 3

Role of Data Warehouses

• DW makes DM a lot cheaper

• DM is one of the reasons for DW

• OLAP: verification‐driven

– sales in CA Vs. FL in Q1 of 2003

• DM: discovery‐driven

– What factors contribute to non‐payment of loans ?

– Will Microsoft come back?

2/13/2012 © Sharma Chakravarthy 4

Page 2: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

2

OLAP Vs. Data Mining

• OLAP is user driven– Analyst generates hypothesis, uses OLAP to verify

– e.g., “people with high debt are bad credit risks”

• Data mining tool generates the hypothesis– Tool performs exploration

• e.g., find risk factors for granting credit

– Discover new patterns that analysts didn’t think of• e.g., debt-to-income ratio

• OLAP and DM complement each other

2/13/2012 © Sharma Chakravarthy 5

Why Database Mining?

• Proliferation of relational DW and the need to mine them without siphoning the data out

• Make mining to ‘co‐exist’ with OLAP and other decision‐support applications

• DM need to be a sub‐process in next generation Business Intelligence (BI) Systems

• Leverage the RDBMS technology for mining

– More than 3+ decades worth of research

• Provide an integrated decision‐support environment  for analysts

2/13/2012 © Sharma Chakravarthy 6

Data Mining Vs. Database Mining

• Data mining refers to main memory algorithms for mining

+ Can use arbitrary data structures

+ Can optimize algorithms with proper representation (hash tree for example)

- Limited memory, need for buffer management (need to be implemented separately for each approach !)

- Data has to be siphoned out of its location (mostly from a DBMS or a Data Warehouse)

- Works well only for small data sizes (no scalability)

- Every time data is added to the DB, the process has to be repeated

Solution? Database Mining – Bringing algorithms to data instead of taking data to algorithms

2/13/2012 © Sharma Chakravarthy 7

SQL‐based Mining: Implications

+ Leverage 3+ decades of  DBMS R&D

+ Buffer management comes for free !

+ Portability due to standardization (SQL)

+ Fast development of mining algorithms

+ SMP parallelism for free for parallel       database engines

+ Data is not replicated outside of DBMS

+ SQL may be extended to include ad hocmining queries

‐ However, No specialized data structures and memory management (UDF’s are exception)

2/13/2012 © Sharma Chakravarthy 8

Page 3: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

3

Data Mining Evolution

• File‐Based or Main Memory mining algorithms 

– Data mining

• SQL‐Based mining algorithms

– Database mining

• Parallel mining Algorithms

– Both main memory and SQL‐based

2/13/2012 © Sharma Chakravarthy 9

Mining Spectrum • Study architectural alternatives

• Performance evaluation

• Extend the capability of current query processors

Cache-Mine

Loose Coupling

User-defined function

Stored Procedure

Mining extenders/blades

SQL-based approach

Integrated with SQL query

engine

Mining as application on

Client/app. server

Mining as application on

database server

Mining using SQL+ Extensions

Integrated approach

Loose Tight Integration with database

2/13/2012 © Sharma Chakravarthy

Database Mining Spectrum

• Database Mining 

– Single database (Directly) e.g. Intelligent Miner

– Single relation (using JDBC)

– Layered (multiple relations, using JDBC)

– Layered (Across databases, using JDBC)

– Integrated  Database mining

2/13/2012 © Sharma Chakravarthy 11

Long‐Term Vision

• Unbundle bulky mining operations (e.g., mine)

• Identification of Common operators

• Integration of the above into the Query Optimizer

• No distinction between OLAP and mining

2/13/2012 © Sharma Chakravarthy 12

Page 4: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

4

Alternative Architectures

• Loose‐coupling: data read through a cursor

• Stored‐procedure: mining algorithm encapsulated as a stored procedure (SP)

• Cache‐mine‐store: data cached in files outside DB in binary form

• UDF: “heavy‐weight” UDFs placed in SQL queries

• SQL: mining algorithm formulated as SQL-92/SQL-OR queries

2/13/2012 © Sharma Chakravarthy 13

Loosely Coupled

• Loose‐coupling: data read through a cursor

• Intermediate results are stored in the database

• DBMS used as file system

• High context switch between address spaces

• Even with block reads, performance is poor

2/13/2012 © Sharma Chakravarthy 14

Stored procedures• Mining algorithms executed as Stored‐procedures on the server

• Programming flexibility

• Existing file code can be reused

2/13/2012 © Sharma Chakravarthy 15

Cache‐Mine‐Store

• Special case of loose coupling

• Data is read once and cached in files outside DB (in binary form)

• Advantages of SP + better performance

• Additional disk space for caching

2/13/2012 © Sharma Chakravarthy 16

Page 5: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

5

UDFs

• UDF: “heavy‐weight” UDFs placed in SQL queries• Little use of DB query processing• Fenced or unfenced mode• Significant code rewrite• Performance is good• Portability is poor

2/13/2012 © Sharma Chakravarthy 17

Stored Procedures and UDFs

• Server side: Advantages– Less traffic congestion

– Better development, Modularization and Integration

– Can return from basic data types like integers to complex structures like tables

• Client side– coding of these functions takes time and effort

2/13/2012 © Sharma Chakravarthy 18

Stored Procedures and UDFs

• Stored procedure uses PL/SQL, so DML can be used in it. 

• As a consequence, the result can be stored into a table. 

• However,  UDF uses host programing languages, such as Java, C.

2/13/2012 © Sharma Chakravarthy 19

Stored Procedures and UDFs

The difference between calling a UDF (DB2) and Stored procedures (Oracle):

DB2: call the UDF in an SQL query directly.insert into tidT1

select T_item, T_cnt, T_tids from (select item, tid from T5D1K group by item, tid) as tt0, table(saveTid(item,

tid)) as tt2

Oracle: using the “CallableStatement” classCallableStatement stmt; // prepare the CALL statement,qs = "{Call SaveTid(?)}"; // the prototype is SaveTid(int rowCount)stmt = LogIn.con.prepareCall (qs);

int rowCount=getRowCount();stmt.setInt (1, rowCount); // set the parameter value rowCount

stmt.execute (); // call the stored procedure

2/13/2012 © Sharma Chakravarthy 20

Page 6: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

6

SQL‐Based 

• SQL: mining algorithm formulated as SQL-92/SQL-OR queries

• Several alternatives (query/subquery, Kway join)

• Can exploit SQL parallelism (where available)

• No specific extensions for mining

• May facilitate identification of needed extensions

2/13/2012 © Sharma Chakravarthy 21

ExtendedSQL

preprocessor+

Optimizer

(Object) Relational

DBMS

DB

SQL-92

SQL-ORGUI

Integrated Approach

• Unbundle bulky mining operations (e.g., mine)

• Identification of Common operators

• Integration of the above into the Query Optimizer

• No distinction between OLAP and mining

• New optimization for handling large number of joins and self‐joins

• Availability of data structures in a meaningful manner

2/13/2012 © Sharma Chakravarthy 22

Architecture comparison

Experiments on four real-life data sets Used IBM DB2 Universal Server version 5 on RS/6000:

200 MHz CPU, 256 MB main memory, 9GB disk

2/13/2012 © Sharma Chakravarthy 23

Architecture comparisons (DB2)

Data set D

0

2000

4000

6000

8000

10000

12000

CacheSproc

UDFSQL

CacheSproc

UDFSQL

CacheSproc

UDFSQL

Tim

e in

sec

Pass 1 Pass 2 Pass 3 Pass 4

Support -> 0.2% 0.07% 0.02%

Data set A

0

100

200

300

400

500

600

700

800

CacheSproc

UDFSQL

CacheSproc

UDFSQL

CacheSproc

UDFSQL

Tim

e i

n s

ec

Pass 1 Pass 2 Pass 3

Support -> 0.50% 0.35% 0.20%

2/13/2012 © Sharma Chakravarthy 24

Page 7: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

7

Summary

• SQL performance good for smaller data sets

• As the size of the data set increases, SQL is not doing as good as cache (better data representation, special data structures, …)

• Stored procedure and UDF did not perform well (no optimization, limited data structures)

• SQL seems comparable with its own advantages!

2/13/2012 © Sharma Chakravarthy 25

Storage Comparison• Additional storage for Cache‐Mine and SQL to 

cache/ transform data

• Space for indexing / sorting 

• Similar storage overhead for Cache‐mine and SQL

Storage Requirement

0

5

10

15

20

25

30

35

40

45

CacheSpr

ocSQL

CacheSpro

cSQL

CacheSpro

cSQL

Cache

Sproc

SQL

Sp

ace

in m

illi

on

s o

f in

teg

ers

Cache Sort

Dataset-A Dataset-B Dataset-C Dataset-D

2/13/2012 © Sharma Chakravarthy 26

SQL‐92 based approachesto Association Rules

Input to Mining

TID Item1 Item2 Item3 Item4 Item5 100 1 1 1 200 1 1 1 300 1 1 1 1 400 1 1

2/13/2012 © Sharma Chakravarthy

Page 8: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

8

Table format for database mining

2/13/2012 © Sharma Chakravarthy 29

TID ITEM ----------- ----------- 100 1 100 3 100 4 200 2 200 3 200 5 300 1 300 2 300 3 300 5 400 2 400 5

• DBMSs have an upper limit on the

• Number of attributes in a table!

• Cannot use a table for horizontal layout

Frequent itemsets table format

2/13/2012 © Sharma Chakravarthy 30

ITEM1 ITEM2 ITEM3 ITEM4 ITEM5 ITEM6 ITEM7 ITEM8 NULLM COUNT1 0 0 0 0 0 0 0 2 22 0 0 0 0 0 0 0 2 33 0 0 0 0 0 0 0 2 35 0 0 0 0 0 0 0 2 31 3 0 0 0 0 0 0 3 22 3 0 0 0 0 0 0 3 22 5 0 0 0 0 0 0 3 33 5 0 0 0 0 0 0 3 22 3 5 0 0 0 0 0 4 2

Rule table format

2/13/2012 © Sharma Chakravarthy 31

ITEM1 ITEM2 ITEM3 ITEM4 ITEM5 ITEM6 ITEM7 ITEM8 NULLM RULEM CONF SUP

1 3 0 0 0 0 0 0 3 2 100 503 1 0 0 0 0 0 0 3 2 66.67 502 3 0 0 0 0 0 0 3 2 66.67 503 2 0 0 0 0 0 0 3 2 66.67 502 5 0 0 0 0 0 0 3 2 100 755 2 0 0 0 0 0 0 3 2 100 753 5 0 0 0 0 0 0 3 2 66.67 505 3 0 0 0 0 0 0 3 2 66.67 502 3 5 0 0 0 0 0 4 2 66.67 503 2 5 0 0 0 0 0 4 2 66.67 505 2 3 0 0 0 0 0 4 2 66.67 502 3 5 0 0 0 0 0 4 3 100 502 5 3 0 0 0 0 0 4 3 66.67 503 5 2 0 0 0 0 0 4 3 100 50

SQL‐based Association rule mining

• SQL‐92

– K-way joins

– Subquery

– 3‐way joins

– 2‐ group by

– Set-oriented apriori (an improvement of K-way

join)

• SQL‐OR (uses table functions and other features ( blobs, clobs, etc.)

– GatherJoin

– GatherPrune

– GatherCount

– Horizontal

– vertical

– SQL‐bodied functions

2/13/2012 © Sharma Chakravarthy 32

Page 9: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

9

SQL‐92 approaches

• Uses only the features of SQL‐92 standard

– “vanilla SQL”

– No table functions

– No stored procedures

– No UDF’s

• Portable (can be used on any RDBMS)

2/13/2012 © Sharma Chakravarthy 33

Performance of SQL‐92 approaches

Experiments on four real-life data sets Used IBM DB2 Universal Server version 5 on RS/6000:

200 MHz CPU, 256 MB main memory, 9GB disk

2/13/2012 © Sharma Chakravarthy 34

Comparison of SQL-92 approaches

2/13/2012 © Sharma Chakravarthy 35

SQL‐92 approaches

Set-oriented Apriori was the overall winner Subquery approach performed well The candidate generation time and the time for first pass is

much smaller than the total time K-way joining was also good for this data set

As good or better than loose-coupling only for high support

For low support considerably worse than loose-coupling 2-GroupBy has the worst performance 3-WayJoin is also not good

2/13/2012 © Sharma Chakravarthy 36

Page 10: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

10

Observations on apriori

• C1 (typically input relation) can be substituted by F1 for subsequent passes

• Second pass is the most expensive one; no pruning at all; it is a Cartesian product

• As the number of passes increases (i.e., longer frequent itemsets), the number of joins increases; hence materialization may be effective 

2/13/2012 © Sharma Chakravarthy 37

Optimizations (Set‐oriented apriori)

• Pruning non‐frequent items

– Reduces the transaction table size

• Eliminating candidate generation in second pass

– Large C2 in many cases

– Significant performance improvement

• Reusing item combinations

• Space/time tradeoff 

2/13/2012 © Sharma Chakravarthy 38

SQL‐OR approaches

• GatherJoin: based on generating k‐item combinations using a table function

• GatherPrune: non‐candidate itemset pruning pushed inside the table function

• GatherCount: support counting pushed inside the table function

• Vertical: input data transformed into vertical format; support counting using UDF

• SQL‐bodied functions: uses the control structures in SQL

• Horizontal: input data transformed into horizontal format 

2/13/2012 © Sharma Chakravarthy 39

Performance of SQL‐OR approaches

Data set A

0

500

1000

1500

2000

2500

Vertica l

GpruneGjoin

Gcount

Vertica l

GpruneG join

Gcount

Vertica l

GpruneGjoin

Gcount

Tim

e in

se

c

Prep Pass 1 Pass 2 Pass 3

Support -> 0.5% 0.35% 0.2%

Data set D

0

2000

4000

6000

8000

10000

12000

14000

Vertica l

G join

Gcount

Vertica l

G join

Gcount

Vertica l

G join

Gcount

Tim

e in

sec

Prep Pass 1 Pass 2 Pass 3 Pass 4

Support -> 0.20% 0.07% 0.02%

2/13/2012 © Sharma Chakravarthy 40

Page 11: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

11

SQL‐OR approaches

• Vertical approach is the best for higher passes

• Vertical performs poorly when Ck is too large

• GatherJoin good when the number of frequent items per 

transaction ( Nf) is small

• GatherCount better when Ck and Nf are large

• Choose the best approach for each pass

• Cost estimates using statistics collected in each pass and 

the input data parameters

2/13/2012 © Sharma Chakravarthy 41

Data Characteristics 

• Number of items : in thousands

• Number of Transactions: in Millions

• Data set sizes: High Gigabytes

– Discovering all rules rather than verifying if a rule holds

– Completeness  and soundness

– Performance

– Scalability

2/13/2012 © Sharma Chakravarthy 42

Input/Output Formats

• Input transaction table in the normal form

• Two attributes (tid, item)

• Example: 1: A, B, C

• Output is a collection of rules

• Rule table schema:

(item1, …, itemk, len, rulem, confidence, support)

• Rule AB ‐> CD, conf. 90%, support 5%

(A, B, C, D, NULL, 4, 2, 0.9, 0.05)

Tid Item

1

1

1 C

B

A

2/13/2012 © Sharma Chakravarthy 43

K‐way Join

• The process of support counting in Kwj is as follows:

In any pass k:

– Frequent itemsets of length k‐1 are used to generate candidate itemsets of length k (Ck).

– Prune some of the candidates generated

– For support counting of these candidate itemsets, k copies of input relation is joined with the Ck.

2/13/2012 © Sharma Chakravarthy 44

Page 12: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

12

Candidate Generation in SQL 

• Join step: join 2 copies of  Fk‐1

insert into Ck

select I1.item1, …, I1.itemk‐1, I2.itemk‐1

from Fk‐1 I1, Fk‐1 I2where I1.item1 = I2.item1 and …. and

I1.itemk‐2 = I2.itemk‐2 andI1.itemk‐1 < I2.itemk‐1

2/13/2012 © Sharma Chakravarthy

Candidate Generation and Pruning 

• Prune step: additional joins with 

(k‐2) more copies of  Fk‐1

• Join predicates enumerated by 

skipping an item at a time 

• k‐items have k (k‐1)‐item subsets; 

Out of that 2 have been used for 

generating the K item. No need to 

check them. Hence, the other (k‐

2) subsets need to be checked by 

doing (k‐2) joins

2/13/2012 © Sharma Chakravarthy

Candidate Set Ck Generation

Insert into Ck

Select I1.item1, I1.item2, …, I1.itemk-1, I2.itemk-1

From Fk-1 I1, Fk-1 I2

Where I1.item1 = I2.item1 ANDI1.item2 = I2.item2 AND

……

I1.itemk-2 = I2.itemk-2 ANDI1.itemk-1 < I2.itemk-1

Example: F3: {{1, 2, 3}, {1, 2, 4}, {1, 3, 4}, {1, 3, 5}, {2, 3, 4}}=> C4: {1, 2, 3, 4}, and {1, 3, 4, 5}.

2/13/2012 © Sharma Chakravarthy 47

Pruning explanation

• Consider Ck {1 3 4 5}

• The subsets are

– {1 3 4} generated by skipping item at position 4

– {1 3 5} generated by skipping item at position 3

– {1 4 5} generated by skipping item at position 2

– {3 4 5} generated by skipping item at position 1

• First 2 have been used in the generation of {1 3 4 5} 

• Hence, skip positions 1 and 2 or 1 through k‐2 (here k is 4) to check for subsets!

2/13/2012 © Sharma Chakravarthy 48

Page 13: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

13

PruningPrune step: in the k-itemset of Ck, if there is any (k-1)-subset of Ck

that is not in Fk-1, we need to delete that k-itemset from Ck.

I1.item2 = I3.item1

…I1.itemk-1 = I3.itemk-2

I2.itemk-1 = I3.itemk-1

...

...I1.item1 = Ik.item1

…I1.itemk-1 = Ik.itemk-2

I2.itemk-1 = Ik.itemk-1

Skip Item1

Skip Itemk-2

In the above example, one of the 4-itemset in C4 is {1, 3, 4, 5}. This 4-itemset needs to be deleted because one of the 3-item

subsets {3, 4, 5} is not in F3.

2/13/2012 © Sharma Chakravarthy 49

Candidate Set Ck Generation

Fk-1 I1 Fk-1 I2

Fk-1 I3

..… …. Fk-1 Ik

I1.item1 = I2.item1

..

..

I1.itemk-2 = I2.itemk-2

I1.itemk-1 < I2.itemk-1

(Skip item1)I1.item2 = I3.item1

..

..

I1.itemk-1 = I3.itemk-2

I2.itemk-1 = I3.itemk-1

(Skip itemk-2)I1.item1 = Ik.item1

..

..

I1.itemk-1 = Ik.itemk-2

I2.itemk-1 = Ik.itemk-1

Complete Query Diagram

Candidate Generation

Prune

Example: F3: {{1, 2, 3}, {1, 2, 4}, {1, 3, 4}, {1, 3, 5}, {2, 3, 4}}

=> C4: {1, 2, 3, 4}, and {1, 3, 4, 5}.

2/13/2012 © Sharma Chakravarthy 50

• Join k copies of input table (T) with Ck anddo a group by on the itemsets

• insert into Fkselect item1, … , itemk, count(*)

from Ck, T t1, … , T tkwhere t1.item = Ck.item1 and

tk.item = Ck.itemk and

t1.tid = t2.tid and t1.item < t2.item and

tk‐1.tid = tk.tid and tk‐1.item < tk.item

group by item1, item2, … ,itemk

having count(*) minsup

SQL‐92 support counting ‐ Kway

requires K joins for the kth pas

2/13/2012 © Sharma Chakravarthy 51

Support Counting for Kwj in pass k

T t1 T t2

t1.tid = t2.tid

t1.item < t2.item

T tk

tk-1.tid = tk.tid

tk-1.item < tk.item

Ck.item1 = t1.item...

Ck.itemk = tk.item

Ck

Having count(*) > minsup

Group byitem1… itemk

Join Ck with k copies of T

Follow up the join with a group by on the items and filter on minsup

Requires k joins for the kth pass

Attributes of Input Table (T) : (tid, item)

Note that Ck is used as an inner relation !

2/13/2012 © Sharma Chakravarthy 52

Page 14: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

14

K‐way join plan (Ck outer) 

• Series of joins of Ck with k 

copies of T

• Final join result is grouped 

on the k items

2/13/2012 © Sharma Chakravarthy 53

Ck needs to be materialized if innerRequires additional I/O

No materialization needed if outerCan write a single (large) query for candidate generation, pruning, and support counting

May involve 10’s of joins!

Current optimizers were not designed for that many join optimizations!

Difference between inner and outer Ck

2/13/2012 © Sharma Chakravarthy 54

54

24

53

43

33

23

5 2

32

22

41

3 1

11

IidTid

Transaction

32

33

24

35

Count

Item

F1

Example: Frequent itemsets generation using Kwj

54

53

43

52

42

32

Item2Item1

C2

32

33

24

35

Count

Item

F1

T t1

54

24

53

43

33

23

5 2

32

22

41

3 1

11

IidTid

T t2

54

24

53

43

33

23

5 2

32

22

41

3 1

11

IidTid

F2

Item1 Item2 Count

2 3 2

2 5 3

3 4 2

3 5 2

C3

Item1 Item2 Item3

2 3 5

3 4 5

2/13/2012 © Sharma Chakravarthy 55

Simple and easy to write in SQL

Only down side is the number of  joins

Also, Optimizers were never stressed for so many joins

Optimizers were never optimized for self‐joins!

SQL is generated by a script!

SQL‐92 support counting ‐ Kway

2/13/2012 © Sharma Chakravarthy 56

Page 15: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

15

Subquery‐based

• Differs in the way Fk is generated

• In kwj, Fk is generated by joining two Fk‐1 tables, pruning (involves k‐2) additional joins, and then support count (involves k more joins)– For a total of  (2k‐1) joins

– At higher passes, this becomes large!

• Instead, Fk is generated by using subqueries recursively in this approach !

2/13/2012 © Sharma Chakravarthy 57

Subquery‐based

• Makes use of common prefixes between the itemset in Ck to reduce the amount of work done in support counting

• Break up the support counting phase into a cascade of k subqueries

• The lth subquery Ql finds all tids that match the distinct itemsets formed by the first l columns of Ck

• The output of Ql is joined with T and dl+1 (the distinct itemset formed by the first l+1 columns of Ck) to get Ql+1

• The final output is obtained by doing a group-by on the k items to count support

2/13/2012 © Sharma Chakravarthy 58

2/13/2012 © Sharma Chakravarthy 59

SQL‐92 support counting ‐ Subquery

• Similar to k‐way join, we generate Ck and Fk• In the first pass, generate F1 as

select item, count(*)

from input

group by item

having count(*) minsup

TID ITEM

100 1

100 3

100 4

200 2

200 3

200 5

300 2

300 3

300 4

300 5

400 2

400 5

Input table

ITEM COUNT2 3

3 3

4 2

5 3

F1

ITEM ITEM

1 100

2 200

2 300

2 400

3 100

3 200

3 300

4 100

4 300

5 200

5 300

5 400

60

TID ITEM100 1100 3100 4200 2200 3200 5300 1300 2300 3300 5400 2400 5

ITEM1 ITEM2

2 3

2 4

2 5

3 4

3 5

4 5

ITEM1

2

3

4

ITEM1 TID

2 200

2 300

2 400

3 100

3 200

3 300

4 100

4 300

R1

ITEM1 ITEM2

2 3

2 4

2 5

3 4

3 5

4 5

D2

ITEM1 ITEM2 TID

2 3 200

2 3 300

2 4 300

2 5 200

2 5 300

2 5 400

3 4 100

3 4 300

3 5 200

3 5 300

4 5 300

T

ITEM1 ITEM2 COUNT

2 3 2

2 5 3

3 4 2

3 5 2

F2

C2

D1

TIDITEM

Join D2, R1, and TIDITEM

Group by item1, item2

Subquery: Example

2/13/2012 © Sharma Chakravarthy

Page 16: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

16

Support counting: subquery

61

Insert into Fk

Select item1, item2, ……, itemk, count(*)From (Subquery Qk) tGroup by item1, item2, ……, itemk

Having count(*) > minsup

• Its performance is close to K-way join and better than 2-group by approach because it exploits common prefixes between candidate itemsets.

• Needs to materialize the intermediate table

Create table Dl-1 AsSelect distinct item1, item2, ……, iteml-1

From Ck

Subquery Ql (for any l between 1 and k) Select item1, item2, ……, iteml, tidFrom T tl, Rl-1, Dl

Where Rl-1.item1 = Dl.item1 AND…

Rl-1.iteml-1 = Dl.iteml-1

ANDRl-1.tid = tl.tid ANDtl.item = Dl.iteml

Subquery Q0: No subquery Q0

Create table Rl-1 asSelect item1, item2, …, iteml-1, tidfrom tiditem t1, d1-1

where t1.item=d1-1.item1 ANDt1.item=dl-1.item2 AND

…t1.item=dl-1.iteml-1

// Finds all tids that match the distinct itemsets formed by the first l columns of Ck

// Finds all distinct itemsets formed by the first l-1 columns of Ck

// Generate the frequent set in pass k

2/13/2012 © Sharma Chakravarthy

2/13/2012 © Sharma Chakravarthy 62

SQL‐92 support counting ‐ Subquery

• For pass k, (K>1) we generate Ck and Fk tables

• Ck is generated as mentioned earlier.

• For Fk, we generate the subqueries.insert into F2

select item1, item2, count(*)from (Subquery Q2) as tgroup by item1, item2having count(*) minsup

Subquery Q2

select d2.item1, d2.item2, t2.tidfrom input t2, (select item1, tid

from input t1, (select distinct item1 from C2) as d1 where t1.item = d1.item1) as r1,

(select distinct item1, item2 from C2) as d2where r1.item1 = d2.item1 and

r1.tid = t2.tid andt1.item = d2.item2

63

SQL-based Association Rule MiningSupport Counting: Subquery

Rl-1.iteml = Dl.iteml

Rl-1.iteml-1 = Dl.iteml-1

T tl

item1, item2, ….., iteml, tid

Subquery Ql

Subquery Q l-1Select distinctitem1, …., iteml from Ck

Dl

tl.item = Dl.iteml

Rl-1 .tid = t1.tid

2/13/2012 © Sharma Chakravarthy

Subquery Q0: No subquery Q0

Subquery optimization

insert into Fk

select item1, …,  itemk, count(*)

from (Subquery Qk) t

group by item1, …, itemk

having count(*) > :minsup

2/13/2012 © Sharma Chakravarthy 64

Subbqeury Ql

select item1, …,  itemk, tid

from T t,  (Subquery Ql‐1) as Rl‐1(select distinct item1, …, itemk from Ck) as Dl

Where Rl‐1. item1  = Dl. Item1  and  … and

Rl‐1. iteml‐1  = Dl. Iteml‐1  and

Rl‐1. tid = t. tid and

t. item1  = Dl. Iteml

Subquery Q0: No Subquery Q0

Page 17: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

17

2/13/2012 © Sharma Chakravarthy 65

SQL‐92 support counting ‐ Subquery

• insert into Fk

select item1, item2, … , itemk, count(*)from (Subquery Qk) tgroup by item1, item2, … , itemk

having count(*) minsup• Subquery Qi (1 i k)

select item1, item2, … , itemi, tidfrom T ti, (Subquery Qi-1) as ri-1,

(select distinct item1, item2, … , itemi from Ck) as di

where ri-1.item1 = di.item1 and

ri-1.itemi-1 = di.itemi-1 andri-1.tid = ti.tid andti.item = di.itemi Subquery Q0 : No subquery Q0

Support counting: 2‐GroupBy

• Another way to avoid multi‐way joins is to first join T and Ck based on whether the ‘item’ of a (tid, item) pair of T is equal to any of the k items of Ck

• Then to do a group by on the itemsets

2/13/2012 © Sharma Chakravarthy 66

67

Support Counting: 2-GroupBy

Create Table temp ASSelect item1, item2, ……, itemk, count(*)From Ck, TWhere item = Ck.item1 OR

item = CK.item2 OR...

item = Ck.itemk

Group by item1, item2, ……, itemk, tidHaving count(*) = k

Insert into Fk

Select item1, item2, ……, itemk, count(*)From tempGroup by item1, item2, ……, itemk

Having count(*) > minsup

• No multi-way joins.• Much time is lost in

grouping, comparing and having clauses.

• Needs to materialize the intermediate table in Oracle.

Gives all (itemset, tid) pairs such that the tid has all the items in the itemset.

2/13/2012 © Sharma Chakravarthy

2/13/2012 © Sharma Chakravarthy 68

Performance comparisons of SQL‐92 approaches

In both the datasets, Kway join emerged to be the winner.

For lower supports on larger datasets, pass two is the most time consuming

Intelligent Miner is not SQL based and hence appears to be faster.

Page 18: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

18

69

Scale-Up Experiment

0

1 0 0

2 0 0

3 0 0

4 0 0

5 0 0

6 0 0

7 0 0

8 0 0

9 0 0

0 5 0 K 5 0 0 K

N u m b e r o f R e c o r d s

Tim

e (

se

co

nd

s)

K w a y J o i n S u b q u e r y

Figu re 3 .21 S cale-U p E x perim ents fo r S Q L -92 A pproaches

•Both Kway Join and Subquery approaches have the similar scale-up behavior.

2/13/2012 © Sharma Chakravarthy

Optimizations• Reduce the size of input dataset

– Non‐frequent 1‐itemsets are pruned out from the input table and this pruned input table is used instead in further passes.

– Effective for higher supports

• Optimize the second pass.– Skip generation of F1 and C2 and directly generate F2 by joining 2 copies of input dataset.

– Effective for all large data sets

• Reduce the number of joins done in any pass– Materialize all the frequent itemsets contained in any transaction at the end of the pass k and use them for support counting in pass k+1

– Effective for higher iterations

2/13/2012 © Sharma Chakravarthy 70

K‐way join Optimizations

• Pruning non‐frequent items from T

insert into Tf 

select t.tid, t.item

from T t, F1 f

where t.item = f.item

2/13/2012 © Sharma Chakravarthy 71

Non‐frequent item pruning

• Reducing the size of T

• T is stored in normal form

• Simply drop the (tid, item) records of non‐frequent items

• Join T with F1• Significant reduction in size of T

• Improved performance especially for higher support

0

300000

600000

900000

1200000

2%

1%

0.7

5%

0.3

3%

2%

1%

0.7

5%

0.5

0%

0.1

0%

R Rf for support values R Rf for support values

T10.I4.D100K T5.I2.D100K

2/13/2012 © Sharma Chakravarthy 72

Page 19: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

19

Experiments

Effect of Pruning

2/13/2012 © Sharma Chakravarthy 73

2/13/2012 74

Effect of Pruned Input

© Sharma Chakravarthy

Second pass optimization• C2 is Cartesian product of F1’s

• Usually C2 is very large

• Avoid generating C2• F2 is found by joining 2 copies of the 

pruned T

• Significant performance gains

insert into F2 

select p.item, q.item,count(*)

from Tf p, Tf q

where p.tid=q.tid and p.item <q.item

group by p.item, q.item

having count(*) > :minsup

Pass 2 optimization (T10.I4.D100K)

0

2000

4000

6000

8000

10000

12000

14000

2% 1% 0.75% 0.33%

Support

Tim

e in

sec

on

ds

With Opt. Without Opt.

2/13/2012 © Sharma Chakravarthy 75

Number of Candidate Itemsets in Different Passes

2/13/2012 © Sharma Chakravarthy 76

C2 C3 C4 C5 C6 C7 C8 C9

T5I2D500K. Sup = 0.10% 307720 126 7 0 -- -- -- --

T5I2D1000K. Sup = 0.10% 309291 127 61 0 -- -- -- --

T10I4D100K. Sup = 0.75% 12470 65 3 0 -- -- -- --

T10I4D100K. Sup = 0.33% 216153 2453 905 354 109 20 2 0

Page 20: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

20

Experiments

2/13/2012 © Sharma Chakravarthy 77

Reusing item combinations

• The SQL formulations of Association rule mining generates 

item combinations in each pass

• We can reuse them if we store these item combinations 

generated in each pass

• This and other optimizations constitute the Set‐oriented 

apriori approach

• K‐way join is replaced by single join

• Very useful in later passes

• Similar to the AprioriTid algo.

• Classic space / time trade off

2/13/2012 © Sharma Chakravarthy 78

Set‐oriented apriori

• In the kth pass of the supporting counting phase, we generate 

a table Tk which contains all k‐item combinations that are 

candidates.

• Tk has the schema (tid, item1, …, itemk)

• We join Tk‐1, Tf, and Ck to generate Tk

• The frequent itemset Fk is obtained by grouping the tuples of 

Tk on the k items and applying the minimum support filtering

2/13/2012 © Sharma Chakravarthy 79

• In kth pass create a relation 

Combk (tid, item1, item2, …, itemk). 

• Join Combk‐1, with T and Ck and insert into Combk only those transactions from T that have candidate itemsets which are one extension to the candidate itemsets in Combk‐1. 

• Do a group by on Combk to generate Fk.

• Thus in any pass k, we have only 3 joins, instead of k+1 joins.

• Comb1 and comb2 are quite large; hence takes more time

.

2/13/2012 © Sharma Chakravarthy 80

Page 21: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

21

Set‐oriented apriori

insert into Tk

select p.tid, p.item1, . . . p.itemk−1, q.item

from Ck, Tk−1 p, Tf q

where p.item1 = Ck.item1 and

...

p.itemk−1 = Ck.itemk−1 and

q.item = Ck.itemk and

p.tid = q.tid

• T2 is not generated and stored due to second pass optimization

• Only T3 and onwards are generated and stored !

2/13/2012 © Sharma Chakravarthy 81

T3 generation

insert into T3

select p.tid, p.item, q.item, r.item

from Tf p, Tf q, Tf r, Ck

where p.item = C3.item1 and

q.item = C3.item2 and

r.item = C3.item3 and

p.tid = q.tid and

q.tid = r.tid

2/13/2012 © Sharma Chakravarthy 82

Performance experiments• Comparison of the set‐oriented apriori and Subquery approaches for 2 datasets

T10.I4.D100K: Total time

0

3000

6000

9000

12000

15000

18000

21000

24000

Setap

riori

Subqu

ery

Setap

riori

Subqu

ery

Setap

riori

Subqu

ery

Setap

riori

Subqu

ery

Tim

e in

sec

Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 Pass 6 Pass 7

Support --> 2% 1% 0.75% 0.33%

T5.I2.D100K: Total time

0

300

600

900

1200

1500

1800

2100

2400

Setap

riori

Subqu

ery

Setap

riori

Subqu

ery

Setap

riori

Subqu

ery

Setap

riori

Subqu

ery

Tim

e in

sec

Pass 1 Pass 2 Pass 3 Pass 4

Support--> 2% 1% 0.5% 0.1%

2/13/2012 © Sharma Chakravarthy 83

Set‐oriented apriori Vs. subquery

0

1000

2000

3000

4000

5000

6000

Pass 3 Pass 4 Pass 5 Pass 6 Pass 7

Tim

e in seco

nds

Set-Apriori Subquery

2/13/2012 © Sharma Chakravarthy 84

Page 22: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

22

Scale‐up of set‐oriented Apriori

• Scales linearly with number of transactions and transaction size

0

5000

10000

15000

20000

25000

0 200 400 600 800 1000 1200

Number of Transactions (in thousands)

Tim

e in

sec

T10.I4 T5.I2

0

5000

10000

15000

20000

25000

0 10 20 30 40 50 60

Average transaction length

Tim

e in

sec

1000 750 500

2/13/2012 © Sharma Chakravarthy 85

Summary

• Explored SQL‐aware implementations of association rule mining

• Analyzed the best SQL‐92 option

• Cost formulae to characterize execution time

• Identified optimizations

– Set‐oriented apriori approach

• Performance experiments and scale‐up properties

• Moves us towards our short term vision 

• Basis for optimizations for the long term vision

2/13/2012 © Sharma Chakravarthy 86

Mining‐aware Optimizer

• Typically, data is stored in different DBMSs

• How can we perform mining on any RDBMS

• Our experiments indicated that different RDBMSs optimize queries in different ways – Even support variations had impact on the performance in different DBMSs

• Hence a global approach to mining did not seem appropriate !

• Analyze sql‐92 and sql‐or to generate and consolidate heuristics as metadata to be used by a mining‐aware optimizer

2/13/2012 © Sharma Chakravarthy 87

88

Motivation• Limitation: Existing mining tools can’t connect to multiple Database

Management Systems. • Solution: Use Java Database Connectivity (JDBC)

• Limitation: Most of the mining tools use Cache-Mine architecture. Data are copied into the local disk.• Solution: Use SQL-based approach. Three of the approaches

are based purely on SQL-92 and three of them are based on SQL-OR (Oracle).

• Limitation: Existing mining products do not provide expressive rule visualization • Solution: We use “rule-item” relationship in the association rule

visualization to replace the “item-item” relationship [MINESET] or directed graph [MINER].

• Java 3D, a new feature provided by JDK1.2, is used to implement three-dimensional display.

2/13/2012 © Sharma Chakravarthy

Page 23: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

23

89

Motivation

• Limitation: Most of the available products can only use the data from one table (DBMiner has 64k transactions)• Solution: We provide the user with an interface to choose the

tables to be used as data source. For each table, the user can specify the columns that correspond to items. A set of JOIN/UNION operations are transparently applied to generate the input data set.

• Limitation: Existing products use only one mining algorithm. However, the choice of an algorithm needs to be based on data as well as DBMS characteristics. • Solution: Implement a Mining Optimizer based on meta data to

decide the algorithm to be used based on the data set and the underlying DBMS used.

2/13/2012 © Sharma Chakravarthy

Short‐term Goal

• Layered architecture

• JDBC provides the database connection and SQL interface

• VMO generates and visualizes the association rules

2/13/2012 © Sharma Chakravarthy 90

2/13/2012 © Sharma Chakravarthy 91

Mapping

• Mining is done on a relation with 2 attributes (Tid, Item) 

• However, user has data in relations and has to map it into integer (Tid, Item) format

• Most mining tools accept single Tid and single column items. 

• Our mining optimizer accepts multiple Tid columns and Single/Multiple Items (attributes) specified by the user from multiple relations

• Table input(Date, CustomerID, item1, item2, item3)

2/13/2012 © Sharma Chakravarthy 92

Mapping (Contd.)

Date CustID ITEM

1/1/00 100 Milk

1/1/00 100 Eggs

1/1/00 100 Bread

1/2/00 200 Sugar

1/2/00 200 Eggs

1/2/00 200 Cake

Date CustID ITEM

1/3/00 300 Milk

1/3/00 300 Sugar

1/3/00 300 Eggs

1/3/00 300 Cake

1/4/00 400 Sugar

1/4/00 400 Cake

InputTable1 InputTable2

Number (TIDD1) CustID (TIDD2) TIDI1/1/00 100 1

1/2/00 200 2

1/3/00 300 3

1/4/00 400 4

ITEMD ITEMIBread 1Cake 2Eggs 3Milk 4Sugar 5

MappedTidsTable

MappedItemsTable

TID ITEM

1 1

1 3

1 4

2 2

2 3

2 5

3 2

3 3

3 4

3 5

4 2

4 5

FinalInputTable

Page 24: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

24

2/13/2012 © Sharma Chakravarthy 93

RULES (FINAL)Rule Head Symbol Rule Body Confidence(%) Support(%)

Cake => Eggs 67 50Eggs => Cake 67 50Eggs => Milk 67 50Milk => Eggs 100 50Cake => Sugar 100 75Sugar => Cake 100 75Eggs => Sugar 67 50Sugar => Eggs 67 50Cake => Eggs, 67 50Eggs => Cake, 67 50Sugar => Cake, 67 50Cake, Eggs => Sugar 100 50Cake, Sugar => Eggs 67 50Eggs, Sugar => Cake 100 50

Rules table with descriptions mapped back

Rule Visualization

Rule Table with Filter capability

The key is to construct a where clause using the standard SQL operators, such as ‘LIKE’, ‘NOT’, ‘IN’, ‘AND’, etc

2/13/2012 © Sharma Chakravarthy 94

Rule Visualization

Rule Table with Sort capability

2/13/2012 © Sharma Chakravarthy 95

96

Rule Visualization

# of Rules based on # of Items in Rule head

• Customize the graph • by modifying the

attributes

• Add more constraints• by modifying the

where clause

2/13/2012 © Sharma Chakravarthy

Page 25: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

25

97

3-D visualization

3-D VisualizationObtained by clicking On the previous (head:3)

All database access using relational operators

Each column is a rule

1st coat, lock, pump-> bike, eggs

(50% sup, 80% conf)

2/13/2012 © Sharma Chakravarthy

98

Rule Generation steps•Combine the frequent sets Fk in all passes into one table FISETS with

columns ITEM1, ITEM2, ITEM3, ………, ITEMk, NULLM, COUNT

•Generate all non-empty subsets for each tuple in table FISETS and store the results into table Subsets with columns TITEM1, TITEM2, TITEM3, ………, TITEMk, TNULLM, TRULEM, TCOUNT

•Join the tables FISETS and Subsets to get the association rules.

Insert into RulesSelect TITEM1, TITEM2, ……., TITEMk, TNULLM, TRULEM,

TCOUNT, (TCOUNT/COUNT)*100From Subsets t1, FISETS t2Where (t1.titem1 = t2.item1 or t1.TRULEM<=1) AND

(t1.titem2 = t2.item2 or t1.TRULEM<=2) AND…

(t1.titemk= t2.itemk or t1.TRULEM<=k) ANDt1.TRULEM = t2.NULLM AND

(TCOUNT/COUNT)*100 >= :min confidence

2/13/2012 © Sharma Chakravarthy

99

Rule Generation

FISETS table

Find the non-empty subsets with 2 items within the rule head for the tuple with red arrow

TITEM1 TITEM2 TITEM3 TITEM4 TITEM5 TITEM6 TITEM7 TITEM8 TNULLM TRULEM TCOUNT

2 3 5 0 0 0 0 0 4 3 22 5 3 0 0 0 0 0 4 3 23 5 2 0 0 0 0 0 4 3 2

Subsets table generated by a table function

ITEM1 ITEM2 ITEM3 ITEM4 ITEM5 ITEM6 ITEM7 ITEM8 NULLM COUNT

1 0 0 0 0 0 0 0 2 22 0 0 0 0 0 0 0 2 33 0 0 0 0 0 0 0 2 35 0 0 0 0 0 0 0 2 31 3 0 0 0 0 0 0 3 22 3 0 0 0 0 0 0 3 22 5 0 0 0 0 0 0 3 33 5 0 0 0 0 0 0 3 22 3 5 0 0 0 0 0 4 2

ITEM1 ITEM2 ITEM3 ITEM4 ITEM5 ITEM6 ITEM7 ITEM8 NULLM RULEM CONF SUP

2 3 5 0 0 0 0 0 4 3 100 502 5 3 0 0 0 0 0 4 3 66.67 503 5 2 0 0 0 0 0 4 3 100 50

Join tables FISETS and Subsets

Association Rules

Join

2/13/2012 © Sharma Chakravarthy

100

Reverse mapping of items

ITEM1 ITEM2 ITEM3 ITEM4 ITEM5 ITEM6 ITEM7 ITEM8 NULLM RULEM CONF SUP

1 3 0 0 0 0 0 0 3 2 100 503 1 0 0 0 0 0 0 3 2 66.67 502 3 0 0 0 0 0 0 3 2 66.67 503 2 0 0 0 0 0 0 3 2 66.67 502 5 0 0 0 0 0 0 3 2 100 755 2 0 0 0 0 0 0 3 2 100 753 5 0 0 0 0 0 0 3 2 66.67 505 3 0 0 0 0 0 0 3 2 66.67 502 3 5 0 0 0 0 0 4 2 66.67 503 2 5 0 0 0 0 0 4 2 66.67 505 2 3 0 0 0 0 0 4 2 66.67 502 3 5 0 0 0 0 0 4 3 100 502 5 3 0 0 0 0 0 4 3 66.67 503 5 2 0 0 0 0 0 4 3 100 50

TID ITEM100 Milk100 Eggs100 Bread200 Sugar200 Eggs200 Cake300 Milk300 Sugar300 Eggs300 Cake400 Sugar400 Cake

ITEM NUMBER DESCRIPTION1 Milk2 Sugar3 Eggs4 Bread5 Cake

Rule Head Symbol Rule Body Confidence(%) Support(%)Milk => Eggs 100 50Eggs => Milk 67 50Sugar => Eggs 67 50Eggs => Sugar 67 50Sugar => Cake 100 75Cake => Sugar 100 75Eggs => Cake 67 50Cake => Eggs 67 50Sugar => Eggs, Cake 67 50Eggs => Sugar, Cake 67 50Cake => Sugar, Eggs 67 50Sugar, => Cake 100 50Sugar, => Eggs 67 50Eggs, Cake => Sugar 100 50

Input

Description

Intermediate Rule Table

FinalRuleTable

Mining

Mapping Back

2/13/2012 © Sharma Chakravarthy

Page 26: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

26

SQL‐based Association rule mining

• SQL‐92

– K-way joins

– Subquery

– 3‐way joins

– 2‐ group by

– Set-oriented apriori (an improvement of K-way

join)

• SQL‐OR (uses table functions and other features ( blobs, clobs, etc.)

– GatherJoin

– GatherPrune

– GatherCount

– Horizontal

– verticalTid

– SQL‐bodied functions

2/13/2012 © Sharma Chakravarthy 101

2/13/2012 102

SQL‐ OR based approaches

• Use UDF’s (IBM DB2/UDB) and Stored procedures (Oracle) along with some Object Relational constructs.

• Advantages :

– Flexibility in the way queries can be written.

– Can make use of complex data structures – a way to mimic the main memory algorithms.

• Disadvantages :

– Specific to a given RDBMS

– Involves considerable effort for developing and testing these procedures.

© Sharma Chakravarthy

Methodology for experiments

• Synthetic data sets, generated by using IBM’s data‐generator.

• Datasets are named as TxxIyyDzzzK.

– xx denotes the average number of items present per transaction. 

– yy denotes the average support of each item in the dataset.

– zzzK denotes the total number of transactions in K (1000’s).

– Example: T5I2D1000K.

• Tested on Oracle 8i and IBM DB2/UDB V6.1

• Each experiment has been performed 4 times in a row and average taken

• Most results are shown for three datasets – T5I2D500K, T5I2D1000K and T10I4D100K. 

2/13/2012 © Sharma Chakravarthy 103

2/13/2012 104

VerticalTid (Vtid)

• Uses a different representation for the input data. Uses two procedures:– SaveTid

• To change the representation of the input data

– CountAndK• To use this changed representation for support counting

© Sharma Chakravarthy

Page 27: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

27

2/13/2012 105

VerticalTid (contd…)

CountAnd2 Procedure1“1,3”, “2,3,4”54

2“1,2,3”, “2,3,4”53

2“1,2,3”, “1,3”433“2,3,4,”, “2,3,4”52

1“2,3,4”, “1,3”42

2“2,3,4”, “1,2,3”32CountTidList1, TidList2Item2Item1

F2

Item1 Item2 Count

2 3 2

2 5 3

3 4 2

3 5 2

54

53

43

52

42

32

Item2Item1

C2

32

33

24

35

CountItem

F1

TransactionItem Tid

1 12 22 32 43 13 23 34 14 35 25 35 4

SaveTid Procedure

5

4

3

2

1

Item

3“2,3,4”

2“1,3”

3“1,2,3”

3“2,3,4”

1“1”

Count

TidList (Clob)

TidListTable

© Sharma Chakravarthy

2/13/2012 106

VerticalTid (contd …)

253

243

352

232

CountItem2Item1

F2C3

543

532

Item3

Item2

Item1

Item1 Item2 Item3 TidList1, TidList2, TidList3

Count

2 3 5 “2,3,4”, “1,2,3”, “2,3,4” 2

3 4 5 “1,2,3”, “1,3”, “2,3,4” 1

CountAnd3 Procedure

SaveTid Procedure

5

4

3

2

1

Item

3“2,3,4”

2“1,3”

3“1,2,3”

3“2,3,4”

1“1”

Count

TidList (Clob)

TidListTable

© Sharma Chakravarthy

2/13/2012 107

Experiment

T5I2D10K T5I2D100K

TidListTable 22 102

Pass 1 0 0

Pass 2 3,148 55,487

Pass 3 8 175

Pass 4 1 11

Time taken (in secs) for mining different datasets

Im_Vtid

© Sharma Chakravarthy

2/13/2012 108

Optimization

• In pass k, reduce the number of CLOBs passed to CountAndK stored procedure.  

1. Create CLOBs for only those items whose count > minsup

2. For Pass 2: use second pass optimization of k‐way join

3. For pass k (k >2), CountAndK stored procedure has been modified to create TidList for the frequent itemsets and materialize them.

4. In pass k+1 use this materialized relation, Ck+1 and TidListTable for support counting. 

CountAndK procedure now receives only 2 CLOBs (TidList’s)

© Sharma Chakravarthy

Page 28: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

28

2/13/2012 109

Optimization

Item1 Item2 Item3 TidList1, TidList2, TidList3

TidList Count

2 3 5 “2,3,4”, “1,2,3”, “2,3,4” “2,3” 2

2 3 7 “1,2,3”, “1,3”, “1,3,4” “1,3” 2

Modified CountAnd3 Stored Procedure

FComb3Item1 Item2 Item3 TidList Count

2 3 5 “2,3” 2

2 3 7 “1,3” 3

F3

Item1 Item2 Item3 Count2 3 5 22 3 7 2

C4

Item1 Item2 Item3 Item4

2 3 5 7

Item1 Item2 Item3 Item4 TidList, TidList

Count

2 3 5 7 “2,3”, “1,3,4” 2

Modified CountAnd4 Stored Procedure

© Sharma Chakravarthy

2/13/2012 110

Experiments

GJn

Vtid

© Sharma Chakravarthy

2/13/2012 111

Gather Join (Gjn)

• Differs in the way candidate itemsets are generated.

– Uses CombinationK procedure for candidate itemset generation.

• CombinationK procedure scans the input dataset and collects all the items that correspond to a transaction in a Vector. This vector is then used in the generation of candidate itemset of length K.

– For support counting: Just a simple group by on the k‐items of the candidate itemset is sufficient to identify frequent itemsets.

© Sharma Chakravarthy

2/13/2012 112

Gather Join (contd…)

• For Oracle implementation the CombinationK stored procedure has been modified. 

– It scans the input datasets and collects all items bought in a transaction in a Vector and uses it for generation of candidate itemsets.

– ItemList table is not materialized.

© Sharma Chakravarthy

Page 29: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

29

2/13/2012 113

Experiments

could not run for D500K and D1000K

© Sharma Chakravarthy

2/13/2012 114

Optimization

• To reduce the number of candidate itemsets that are generated.

• Fact: The frequent itemsets of length k‐1 are sufficient to generate all candidate itemsets of length k. 

• Pass k (k>2): Materialize the tuples, which contain frequent itemsets and use only these tuples, instead of the input table for generation of Ck+1.– CombinationK stored procedure has been modified to insert the corresponding Transaction Id along with the item combinations.

© Sharma Chakravarthy

2/13/2012 115

OptimizationC2

Tid Item1 Item2

1 1 3

1 1 4

1 3 4

2 2 3

2 2 5

2 3 5

3 2 3

3 2 4

3 2 5

3 3 4

3 3 5

3 4 5

4 2 5

Modified Combination2

5424534333235 23222413 111IidTid

Transaction

F2

Item1 Item2 Count

2 3 2

2 5 3

3 4 2

3 5 2

FComb2

Tid Item1 Item2

1 3 4

2 2 3

2 2 5

2 3 5

3 2 3

3 2 5

3 3 4

3 3 5

4 2 5

© Sharma Chakravarthy

2/13/2012 116

Optimization

C3

Tid Item1 Item2 Item3

2 2 3 5

3 2 3 4

3 2 3 5

3 3 4 5

Modified Combination3

FComb2

Tid Item1 Item2

1 3 4

2 2 3

2 2 5

2 3 5

3 2 3

3 2 5

3 3 4

3 3 5

4 2 5F3

Item1 Item2 Item3 Count

2 3 5 2

Tid Item

1 3

1 4

2 2

2 3

2 5

3 2

3 3

3 4

3 5

© Sharma Chakravarthy

Page 30: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

30

2/13/2012 117

Experiments

© Sharma Chakravarthy

2/13/2012 118

Gather Count (Gcnt)

• Similar to Gjn, except that in pass 2, this approach uses a 2‐Dimensional Array to store all the two item combinations and outputs only those item combinations whose support > minsup.

• Because of memory constraints, the same is not possible for other passes.– As the C2 table is not materialized,  pruning can’t be done for pass 3.

© Sharma Chakravarthy

2/13/2012 119

Experiments (Gcnt)

© Sharma Chakravarthy

2/13/2012 120

Optimization

• As Gather Count is very similar to Gather Join approach, the same optimization is applicable to Gather Count approach also.

• For pass 3 onwards, the Im_Gcnt approach uses the modified CombinationK procedures.

© Sharma Chakravarthy

Page 31: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

31

2/13/2012 121

Experiments

© Sharma Chakravarthy

2/13/2012 122

Experiment

© Sharma Chakravarthy

2/13/2012 123

Conclusion: SQL‐OR

© Sharma Chakravarthy

2/13/2012 124

IM_Vtid

Conclusion: SQL‐OR

© Sharma Chakravarthy

Page 32: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

32

2/13/2012 125

Conclusion: SQL‐OR

• Ranking of the naïve approaches:

1. Gather Count

2. Gather Join

3. Vertical Tid

• Though SQL‐OR based approach seem very simple in the way support counting is done, processing of CLOBs is time consuming.

• Optimizations reduce the number of CLOBs that needs to be processed and hence are very effective.

© Sharma Chakravarthy

2/13/2012 126

Metadata Table

• Based on the cardinality.

• Underlying RDBMS.

• Whether we can use any extra space.

© Sharma Chakravarthy

2/13/2012 127

Metadata TableT5I2DzzzK IBM DB2/UDB Oracle Support Value

Extra Space No Extra Space Extra Space No Extra Space10K -NA- Gjn IM_Gcnt Gcnt S = 0.20 %

-NA- Gjn IM_Gcnt Gcnt S = 0.15 %-NA- Gjn IM_Gcnt Gcnt S = 0.10 %

50K -NA- Gjn IM_Gcnt Gcnt S = 0.20 %-NA- Gjn IM_Gcnt Gcnt S = 0.15 %-NA- Gjn IM_Gcnt Gcnt S = 0.10 %

100K -NA- Gjn IM_Gcnt Gcnt S = 0.20 %-NA- Gjn IM_Gcnt Gcnt S = 0.15 %-NA- Gjn IM_Gcnt Gcnt S = 0.10 %

500K -NA- Gjn IM_Gcnt Gcnt S = 0.20 %-NA- Gjn IM_Gcnt Gcnt S = 0.15 %-NA- Gjn IM_Gcnt Gcnt S = 0.10 %

1000K -NA- Gjn IM_Gcnt Gcnt S = 0.20 %

-NA- Gjn IM_Gcnt Gcnt S = 0.15 %

-NA- Gjn IM_Gcnt Gcnt S = 0.10 %

Summary Table for SQL-OR based Approaches

© Sharma Chakravarthy

SQL‐92 based approaches

Implementation on Oracle and 

IBM DB2/UDB

Page 33: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

33

Experiments

2/13/2012 © Sharma Chakravarthy 129

2/13/2012 130

Experiments

© Sharma Chakravarthy

2/13/2012 131

Second Pass Optimization on Pruned Input (SpoPi)

© Sharma Chakravarthy

2/13/2012 132

Reuse of Item Combinations on Pruned Input

© Sharma Chakravarthy

Page 34: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

34

2/13/2012 133

Reuse of Item Combinations and Second Pass Optimization

© Sharma Chakravarthy

2/13/2012 134

Combination of all Optimizations

© Sharma Chakravarthy

2/13/2012 135

Comparison of SQL based approaches

© Sharma Chakravarthy

Results 

Table Name Ranking Supp = 0.2% Supp = 0.15%

Supp = 0.1%

T5I2D100K First RicSpo RicSpo Kwj

Second All All RicSpo

Last RicPi RicPi RicPi

T5I2D500K First RicSpo RicSpo Spo

Second Spo Spo RicSpo

Last RicPi RicPi RicPi

Ranking Supp = 2.0% Supp = 1.0% Supp = 0.75%

Supp = 0.33%

T10I4D100K First All RicSpo RicSpo Ric

Second Pi All All Spo

Last Ric RicPi RicPi RicSpo

Trends in Oracle for SQL-92 based approaches

2/13/2012 © Sharma Chakravarthy 136

Page 35: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

35

Results

Table Name Ranking Supp = 0.2% Supp = 0.15% Supp = 0.1%

T5I2D100K First RicSpo Spo RicSpo

Second Spo RicSpo All

Last RicPi SpoPi SpoPi

T5I2D500K First Spo Spo Spo

Second RicSpo RicSpo RicSpo

Last SpoPi SpoPo SpoPi

Ranking Supp = 2.0% Supp = 1.0% Supp = 0.75%

T10I4D100K First Spo RicSpo RicSpo

Second RicSpo All All

Last Ric Kwj Kwj

Trends in IBM DB2/UDB for SQL-92 based approaches

2/13/2012 © Sharma Chakravarthy 137

2/13/2012 138

Metadata TableT5I2DzzzK IBM DB2/UDB Oracle Support Value

Extra Space No Extra Space Extra Space No Extra Space10K RicSpo Spo RicSpo Spo S = 0.20 %

RicSpo Spo RicSpo Spo S = 0.15 %RicSpo Spo Spo Spo S = 0.10 %

50K RicSpo Spo RicSpo Spo S = 0.20 %Spo Spo RicSpo Spo S = 0.15 %Spo Spo Spo Spo S = 0.10 %

100K RicSpo Spo RicSpo Spo S = 0.20 %Spo Spo RicSpo Spo S = 0.15 %Spo Spo Spo Spo S = 0.10 %

500K Spo Spo RicSpo Spo S = 0.20 %Spo Spo RicSpo Spo S = 0.15 %Spo Spo Spo Spo S = 0.10 %

1000K RicSpo Spo RicSpo Spo S = 0.20 %

Spo Spo RicSpo Spo S = 0.15 %

Spo Spo Kwj Spo S = 0.10 %

Summary Table for SQL-92 based Approaches

© Sharma Chakravarthy

Thank You !!!

2/13/2012 © Sharma Chakravarthy 139

2/13/2012 140

Discussion

© Sharma Chakravarthy

Page 36: Database Mining: Bringing Algorithms to Data - UT … Mining: Bringing Algorithms to Data ... – Performance comparison • Association Mining Using SQL‐92 and SQL‐OR ... Relational

36

References

• Thuraisingham, B., A Primer for Understanding and Applying Data Mining. IEEE, 2000. Vol. 2, No.1: p. 28-31.

• Thomas, S., Architectures and optimizations for integrating Data Mining algorithms with Database Systems, PhD thesis, CISE department 1998, University of Florida: Gainesville.

• Agrawal, R., T. Imielinski, and A. Swami. Mining Association Rules between sets of items in large databases. in ACM SIGMOD International Conference on the Management of Data. 1993. Washington, D.C.

• Agrawal, R. and R. Srikant. Fast Algorithms for mining association rules. in 20th Int'l Conference on Very Large Databases (VLDB). 1994.

• Sarasere, A., E. Omiecinsky, and S. Navathe. An efficient algorithm for mining association rules in large databases. in 21st Int'l Cong. on Very Large Databases (VLDB). 1995. Zurich, Switzerland.

• Shenoy, P., et al. Turbo-charging Vertical Mining of Large Databases. in ACM SIGMOD Int'l Conference on Management of Data. 2000. Dallas.

• Han, J., J. Pei, and Y. Yin. Mining Frequent Patterns wihtout Candidate Generation. in ACM SIGMOD Int'l Conference on Management of Data. 2000. Dallas.

• Houtsma, M. and A. Swami. Set-Oriented Mining for Association Rules in Relational Databases. in 11th International Conference on Data Engineering (ICDE). 1995.

2/13/2012 © Sharma Chakravarthy 141

References

• Han, J., et al. DMQL: A data mining query language for relational database. in ACM SIGMOD workshop on research issues on data mining and knowledge discovery. 1996. Montreal.

• Meo, R., G. Psaila, and S. Ceri. A New SQL-like Operator for Mining Association Rules. in Proceedings of the 22nd VLDB Conference. 1996. Mumbai, India.

• Agrawal, R. and K. Shim, Developing tightly-coupled Data Mining Applications on a Relational Database System. 1995, IBM Almaden Research Center: San Jose, California.

• Sarawagi, S., S. Thomas, and R. Agrawal. Integrating Association Rule Mining with Rekational Database System: Alternatives and Implications. in ACM SIGMOD Int'l Conference on Management of Data. 1998. Seattle, Washington.

• Dudgikar, M., A Layered Optimizer or Mining Association Rules over RDBMS, in CSE Department. 2000, University of Florida: Gainesville.

• Thomas. S, and Chakravarthy. S. Performance evaluation and optimization of join queries for association rule mining. Proc. of the First International Conference on Data Warehousing and Knowledge Discovery, DaWaK '99, Florence, Italy, August 1999

• R. Balachandran, S. Padmanabhan, and S. Chakravarthy, “Enhanced DB-Subdue: Supporting Subtle Aspects of Graph Mining Using a Relational Approach”, To appear in PAKDD, Singapore, April 2006 (Short paper)

• A. Srinivasan, S. Sreshta, and S. Chakravarthy, “Discovery of Significant Intervals in Sequential Data”, in the Proc. of 1st ADBIS Workshop on Data Mining and Knowledge Discovery, Tallinn, Estonia, September 2005, pp. 87-98.

2/13/2012 © Sharma Chakravarthy 142

References

• H. Kona, S. Chakravarthy, and A. Arora, “SQL-Based Approach to Incremental Association Rule Mining”, in the Proc. of 1st ADBIS Workshop on Data Mining and Knowledge Discovery, Tallinn, Estonia, September 2005, pp. 11-24

• Sharma Chakravarthy, Ramji Beera, and Ram Balachandran, DB-Subdue: Database Approach to Graph Mining, In PAKDD conference, Sydney, May 2004.

• P. Mishra and S. Chakravarthy, “Performance Evaluation of SQL-OR Variants for Association Rule Mining”, in Proc. Of DaWaK (Data warehousing and Knowledge Discovery), September 2003, Prague.

• P. Mishra and S. Chakravarthy, “Performance Evaluation and Analysis of K-way join variants for Association Rule Mining”, in Proceedings of BNCOD 2003, Sheffield, UK, July 2003, 95-114.

• M. Dudgikar, S. Chakravarthy, R. Liuzi, and L. Wong, “A Layered Optimizer for Mining Association Rules over Relational Database Management Systems”, 2003 International Conference on Artificial Intelligence (IC-AI'2003), June 2002, Monte Carlo Resort, Las Vegas, Nevada

• S. Chakravarthy and H. Zhang, Visualization of association Rules over RDBMSs, in the proceedings of ACM SAC 2003, Melbourne, FL, March 2003 (Multi-media and Visualization Track).

• P. Mishra and S. Chakravarthy, “Performance Evaluation and Analysis of K-way join variants for Association Rule Mining”, in Proceedings of BNCOD 2003, Sheffield, UK, July 2003, 95-114.

• Mr. Srihari Padmanabhan, “Relational Database Approach to Graph Mining and Hierarchical Reduction”, Fall 2005 http://itlab.uta.edu/itlabweb/students/sharma/theses/pad05ms.pdf

• Mr. Sunit Sreshta, “SQL_Based Approach to Significant Interval Discovery in Time-Series Data”, Summer 2005 http://itlab.uta.edu/itlabweb/students/sharma/theses/shr05ms.pdf

2/13/2012 © Sharma Chakravarthy 143

References

• R. Balachandran, “Relational Approach to Modeling and Implementing Subtle Aspects of Graph Mining”, Fall 2003. http://www.cse.uta.edu/Research/Publications/Downloads/CSE-2003-41.pdf

• Ms. A. Krishnamurthy, “Significant Interval and Episode Discovery in Time-Series Data”, Fall 2003. http://www.cse.uta.edu/Research/Publications/Downloads/CSE-2003-39.pdf

• Mr. P. Mishra, “Performance Evaluation and Analysis of SQL-Based Approaches for Association Rule Mining”, Fall 2002. http://www.cse.uta.edu/Research/Publications/Downloads/CSE-2003-3.pdf

• Mr. Hongen Zhang, “Mining and Visualization of Association Rules in Relations DBMSs'', Summer 2000. http://itlab.uta.edu/sharma/People/ThesisWeb/etd.pdf

• S. Thomas and S. Chakravarthy, “Incremental Mining of Constrained Associations'', in Proc. of High Performance Computing (HiPC), Bangalore, India, Dec. 2000.

2/13/2012 © Sharma Chakravarthy 144