occt: a one-class clustering tree for implementing one-to-many data linkage

Post on 16-Feb-2016

24 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Ben-Gurion University of The Negev Faculty of Engineering Sciences Department of Information Systems Engineering. OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage . Ma'ayan Gafny , Asaf Shabtai , Lior Rokach , Yuval Elovici. Definitions. Definitions. - PowerPoint PPT Presentation

TRANSCRIPT

OCCT: A One-Class Clustering Tree

for Implementing One-to-Many Data Linkage

Ben-Gurion University of The NegevFaculty of Engineering Sciences

Department of Information Systems Engineering

Ma'ayan Gafny, Asaf Shabtai ,Lior Rokach, Yuval Elovici

Definitions

𝑇𝐴 – a given table A 𝑇𝐡 – a given table B (our goal is to link records from table 𝑇𝐴 with one or more records from 𝑇𝐡) ȁ�𝑇𝐴ȁ� – number of records in 𝑇𝐴 ȁ�𝑇𝐡ȁ� – number of records in 𝑇𝐡

A – the set of attributes of table 𝑇𝐴 where ai is the i-th attribute

|A| – denotes the number of attributes in 𝑇𝐴

B – the set of attributes of table 𝑇𝐡 where bi is the i-th attribute

|B| – denotes the number of attributes in 𝑇𝐡 π‘Ÿ(π‘Ž) βˆˆπ‘‡π΄ – a record from table 𝑇𝐴 π‘Ÿ(𝑏) βˆˆπ‘‡π΅ – a record from table 𝑇𝐡 𝑇𝐴× 𝑇𝐡 – a table that is generated by applying Cartesian product of 𝑇𝐴 and 𝑇𝐡

r=(r(a),r(b))βŠ†TAΓ—TB – a record of 𝑇𝐴× 𝑇𝐡 π‘‡π΄π΅βŠ†π‘‡π΄Γ— 𝑇𝐡 – denoting the set of matching records π‘‡π΄π΅ΰ΄€ΰ΄€ΰ΄€ΰ΄€βŠ†π‘‡π΄Γ— 𝑇𝐡 – denoting the set of non-matching records d – a node in the OCCT model AdβŠ†A – the subset of attributes of TA that were already selected as splitting attributes in the path

from the root of the tree to node d. 𝑇𝐴𝐡(𝑑)βŠ†π‘‡π΄π΅ – the subset of matching instances at node d of the OCCT tree π‘†π‘π‘™π‘–π‘‘π‘Žα‰€π‘‡π΄π΅(𝑑)ቁ= 𝑇𝐴𝐡(𝑑)(π‘Ž) – the splitting of 𝑇𝐴𝐡(𝑑) into n subsets according to attribute a such that

βˆ€π‘– = 1..𝑛 𝑇𝐴𝐡(𝑑𝑖)(π‘Ž) = {π‘Ÿβˆˆπ‘‡π΄π΅(𝑑)|π‘Ž = 𝑣𝑖} πœŽπ‘(𝑇𝐴𝐡(𝑑)) – selection operator that is used to select records in 𝑇𝐴𝐡(𝑑) that satisfy the given predicate

p (in this case p is a=vi) πœ‹π΄(π‘‡π΄π΅αˆΊπ‘‘αˆ») – projection operator that is used to select a subset of attributes in 𝑇𝐴𝐡(𝑑) that appear in

the attribute collection A

Definitions

Definitions

an … a4 a3 a2 a1

TA: TB:

bm … b4 b3 b2 b1

A = {a1,a2,a3,…,an}|A| = n

|TA| = num of records in TA

r(a) = a record from TA

B={b1,b2,b3,…,bm}|B|=m

|TB| = num of records in TB

r(b) = a record from TB

r(a) r(b)

Definitionsan … a4 a3 a2 a1

TA: TB:

bm … b4 b3 b2 b1

bm … b4 b3 b2 b1 an … a4 a3 a2 a1

TA x TB :

r=(r(a) , r(b))

Definitions

Target bm … b4 b3 b2 b1 an … a4 a3 a2 a1

match

match

match

match

no-match

no-match

no-match

no-match

TA x TB :

TAB

TAB

Definitions

Target bm … b4 b3 b2 b1 an … a4 a3 a2 a1

match

match

match

match

no-match

no-match

no-match

no-match

TA x TB :

TAB

TAB

Definitions

d

d1

d2

bm … b1 an … a2 a1

v1

v1

v1

bm … b1 an … a2 a1

v2

v2

v2

Definitions

d1

d2

d4

d5

d3

Ad4 = {a1,a2}

Ad2 = {a1}

AdβŠ†A – the subset of attributes of TA that were already selected as splitting attributes in the path from the root of the tree to node d.

Running Examples

The data set Customer Type Customer City Request Location Request Day Of

WeekRequest Part Of

Day Request ID

private Berlin Berlin Friday Afternoon 1

private Hamburg Hamburg Wednesday Afternoon 2

business Berlin Berlin Wednesday Morning 3

private Berlin Berlin Wednseday Morning 4

private Berlin Berlin Saturday Afternoon 5

private Berlin Berlin Thursday Morning 6

private Berlin Berlin Friday Afternoon 7

business Berlin Berlin Saturday Afternoon 8

private Berlin Berlin Saturday Afternoon 9

business Hamburg Hamburg Friday Afternoon 10

business Hamburg Hamburg Monday Afternoon 11

private Hamburg Hamburg Saturday Afternoon 12

private Berlin Berlin Monday Afternoon 13

private Bonn Berlin Monday Afternoon 14

private Berlin Berlin Monday Afternoon 15

private Bonn Bonn Saturday Morning 16

private Hamburg Hamburg Saturday Morning 17

private Hamburg Hamburg Saturday Morning 18

private Hamburg Hamburg Friday Afternoon 19

The data set – cont .Customer Type Customer City Request Location Request Day Of

WeekRequest Part Of

Day Request ID

private Bonn Hamburg Friday Afternoon 20

private Berlin Hamburg Friday Morning 21

business Berlin Berlin Friday Morning 22

private Berlin Berlin Friday Morning 23

private Berlin Berlin Wednseday Afternoon 24

private Berlin Berlin Thursday Afternoon 25

business Berlin Berlin Thursday Afternoon 26

business Bonn Bonn Monday Afternoon 27

private Hamburg Bonn Monday Afternoon 28

business Berlin Bonn Monday Afternoon 29

business Bonn Bonn Wednseday Afternoon 30

private Bonn Bonn Friday Afternoon 31

Coarse Grained Jaccard

Coarse Grained Jaccard – Splitting the root of the tree

Three candidates for split:β€’ Request locationβ€’ Request day of weekβ€’ Request part of day

CGJ– Splitting the root of the tree

dreqLocation

!= Berlin

reqLocation = Berlin

W1 = 16/31

W3 = 6/31

W2 = 9/31

Score1=1/23

Score3=1/23

Score2=2/23

*

*

*

+

+

Score(SplitreqLocation) =0.0561d

reqLocation !=Hamburg

reqLocation = Hamburg

dreqLocation

!= Bonn

reqLocation = Bonn

CGJ– Splitting the root of the tree

ddayOfWeek!=

Monday

dayOfWeek= Monday

W1 = 7/31

W3 = 3/31

W2 = 5/31

Score1=3/15

Score3=3/15

Score2=5/15

*

*

*+

+Score(SplitdayOfWeek) =

0.260

d dayOfWeek!= Wednesday

dayOfWeek= Wednesday

d dayOfWeek!= Thursday

dayOfWeek = Thursday

W4 = 9/31Score4=5/15 *d dayOfWeek != Friday

dayOfWeek = Friday

W5= 7/31Score5=3/15 *d dayOfWeek != Friday

dayOfWeek = Friday

+

+

CGJ– Splitting the root of the tree

dpartOfDay= Afternoon

partOfDay= Morning

Score1=4/23

Score(SplitpartOfDay) = 0.173

Coarse Grained Jaccard – Splitting the root of the tree

Three candidates for split:β€’ Request location 0.0561β€’ Request day of week 0.260β€’ Request part of day 0.173

The split in the root

Fine Grained Jaccard

Fine Grained Jaccard – Splitting the root of the tree

Req. Location != Berlin

Req. Loca

tion = Berlin

d

Least Probable Intersections

LPI – Splitting the root of the tree

Req. Location != Berlin

Req. Loca

tion = Berlin

d

Customer TypeCustomer CityRequest LocationRequest Day Of Week

Request Part Of DayRequest ID

privateBerlinBerlinFridayAfternoon

privateHamburgHamburgWednsedayAfternoon

businessBerlinBerlinWednsedayMorning

privateBerlinBerlinWednsedayMorning

privateBerlinBerlinSaturdayAfternoon

privateBerlinBerlinThursdayMorning

privateBerlinBerlinFridayAfternoon

businessBerlinBerlinSaturdayAfternoon

privateBerlinBerlinSaturdayAfternoon

businessHamburgHamburgFridayAfternoon

businessHamburgHamburgMondayAfternoon

privateHamburgHamburgSaturdayAfternoon

privateBerlinBerlinMondayAfternoon

privateBonnBerlinMondayAfternoon

privateBerlinBerlinMondayAfternoon

privateBonnBonnSaturdayMorning

privateHamburgHamburgSaturdayMorning

privateHamburgHamburgSaturdayMorning

privateHamburgHamburgFridayAfternoon

privateBonnHamburgFridayAfternoon

privateBerlinHamburgFridayMorning

businessBerlinBerlinFridayMorning

privateBerlinBerlinFridayMorning

privateBerlinBerlinWednsedayAfternoon

privateBerlinBerlinThursdayAfternoon

businessBerlinBerlinThursdayAfternoon

businessBonnBonnMondayAfternoon

privateHamburgBonnMondayAfternoon

businessBerlinBonnMondayAfternoon

businessBonnBonnWednsedayAfternoon

privateBonnBonnFridayAfternoon

Req. Location != Berlin

Req. Loca

tion = Berlin

LPI – Splitting the root of the tree

Req. Location != Berlin

Req. Loca

tion = Berlin

d

Maximum Likelihood Estimation

RequestLocation

Cust.City

Cust. Type

Cust.City

Cust. Type

Cust.City

Cust. Type

MLE – Splitting the root of the tree

p(Cust. City|Cust. Type) p(Cust. Type|Cust. City)

top related