data mining algorithms for large-scale distributed systems

30
Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003

Upload: raya-hutchinson

Post on 01-Jan-2016

55 views

Category:

Documents


0 download

DESCRIPTION

Data Mining Algorithms for Large-Scale Distributed Systems. Presenter: Ran Wolff Joint work with Assaf Schuster 2003. What is Data Mining?. The automatic analysis of large database The discovery of previously unknown patterns The generation of a model of the data. Main Data Mining Problems. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Mining Algorithms for Large-Scale Distributed Systems

Data Mining Algorithms for Large-Scale Distributed Systems

Presenter: Ran WolffJoint work with Assaf Schuster2003

Page 2: Data Mining Algorithms for Large-Scale Distributed Systems

What is Data Mining?

The automatic analysis of large databaseThe discovery of previously unknown patternsThe generation of a model of the data

Page 3: Data Mining Algorithms for Large-Scale Distributed Systems

Main Data Mining Problems

Association rules Description

Classification Fraud, Churn

Clustering Analysis

He who does this and that will usually do some other thing too

These attributes indicate a good behavior - those indicate bad behavior.

There are three types of entities

Page 4: Data Mining Algorithms for Large-Scale Distributed Systems

Examples – Classification

Customers purchase artifacts in a storeEach transaction is described in terms of a vector of featuresThe owner of the store tries to predict which transactions are fraudulent Example: young men who buy small

electronics during rash-hours Solution: do not respect checks

Page 5: Data Mining Algorithms for Large-Scale Distributed Systems

Examples – Associations

Amazon tracks user queries Suggests to each user additional

books he would usually be interested in

Supermarket finds out “people who buy diapers also buy beer” Place diapers and beer at opposite

sides of the supermarket

Page 6: Data Mining Algorithms for Large-Scale Distributed Systems

Examples – Clustering

Resource location Find the best location for k

distribution centers

Feature selection Find 1000 concepts which summarize

a whole dictionary Extract the meaning out of a

document by replacing each work with the appropriate conceptCar for auto, etc.

Page 7: Data Mining Algorithms for Large-Scale Distributed Systems

Why Mine Data of LSD Systems?

Data mining is goodIt is otherwise difficult to monitor an LSD system: lots of data, spread across the system, impossible to collectMany interesting phenomena are inherently distributed (e.g., DDoS), it is not enough to just monitor a few nodes

Page 8: Data Mining Algorithms for Large-Scale Distributed Systems

An Example

Peers in the Kazza network reveal to the system which files they have on their disks in exchange to access to the files of their peersThe result is a 2M peers database of people recreational preferencesMining it, you could discover that Matrix fans are also keen of Radio-Head songs Promote RH performances in Matrix-

Reloaded Ask RH to write the music for Matrix-IV

Page 9: Data Mining Algorithms for Large-Scale Distributed Systems

What is so special about this problem?

Huge systems – Huge amounts of dataDynamic setting System – join / depart Data – constant update

Ad-hoc solutionFast convergence

Page 10: Data Mining Algorithms for Large-Scale Distributed Systems

Our Work

We developed an association rule mining algorithm that works well in LSD Systems Local and therefore scalable Asynchronous and therefore fast Dynamic and therefore robust Accurate – not approximated Anytime – you get early results fast

Page 11: Data Mining Algorithms for Large-Scale Distributed Systems

In a Teaspoon

A distributed data mining algorithm can be described as a series of distributed decisionsThose decisions are reduced to a majority voteWe developed a majority voting protocol which has all those good qualitiesThe outcome is an LSD association rule mining (still to come: classification)

Page 12: Data Mining Algorithms for Large-Scale Distributed Systems

Problem Definition – Association Rule Mining (ARM)

DBXFreqDBYXFreqDBYXConf

DBDBXSupportDBXFreq

TXDBTDBXSupport

TTTDB

IT

IX

iiiI

k

m

,,,

,,

:,

,...,,

,...,,

21

21

Page 13: Data Mining Algorithms for Large-Scale Distributed Systems

Solution to Traditional ARM

MinConfDBYXConf

MinFreqDBYXFreq

YX

YXDBR

MinConfMinFreqLet

,

,:

10,10

Page 14: Data Mining Algorithms for Large-Scale Distributed Systems

Large-Scale Distributed ARM

tuv

vtt

t

ut

DBRuR

tuvVvu

tuDBDB

at time from reachable is :

,:

Page 15: Data Mining Algorithms for Large-Scale Distributed Systems

Solution of LSD-ARM

No terminationAnytime solution

Recall

Precision

YXYXuR t :~

ttt uRuRuR ~

ttt uRuRuR~~

Page 16: Data Mining Algorithms for Large-Scale Distributed Systems

Majority Vote in LSD Systems

Unknown number of nodes vote 0 or 1 Nodes may dynamically change their vote Edges are dynamically added / removed An infra-structure

detects failureensures message integritymaintains a communication forest

Each node should decide if the global majority is of 0 or 1

Page 17: Data Mining Algorithms for Large-Scale Distributed Systems

Majority Vote in LSD Systems – cont.

Because of the dynamic settings, the algorithm never terminatesInstead we measure the percent of correct outputsIn static periods that percent ought to converge to 100%In stationary periods we will show it converges to a different percentage Assume the overall percentage of ones remains

the same, but they are constantly switched

Page 18: Data Mining Algorithms for Large-Scale Distributed Systems

LSD-Majority Algorithm

Nodes communicates by exchanging messages <s, c>Node u maintains: su – its vote, cu – one (for now) <suv, cuv>– the last <s,c> it had sent

to v <svu, cvu>– the last <s,c> it had

received from v

Page 19: Data Mining Algorithms for Large-Scale Distributed Systems

LSD-Majority – cont.

Node u calculates:

Captures the current knowledge of u

Captures the current agreement between u and v

uu Ev

vuu

Ev

vuuu ccss

uvvuuvvuuv ccss

Page 20: Data Mining Algorithms for Large-Scale Distributed Systems

LSD-Majority – Rational

It is OK if the current knowledge of u is more extreme than what it had agreed with vThe opposite is not OK v might assume u supports its decision

more strongly than u actually does

Tie breaking prefers a negative decision

Page 21: Data Mining Algorithms for Large-Scale Distributed Systems

LSD-Majority – The Protocol

v to, sendthen

and 0

or

and 0

either and 0

or 0 and 0 If

uu Evuwu

wuu

Evuwu

wuu

uvuuv

uvuuv

vuvu

uvuvu

ccss

cc

cc

Page 22: Data Mining Algorithms for Large-Scale Distributed Systems

LSD-Majority – The Protocol

The same decision is applied whenever a message is received su changes an edge fails or recovers

Page 23: Data Mining Algorithms for Large-Scale Distributed Systems

LSD-Majority – Example

Page 24: Data Mining Algorithms for Large-Scale Distributed Systems
Page 25: Data Mining Algorithms for Large-Scale Distributed Systems

LSD-Majority Results

Page 26: Data Mining Algorithms for Large-Scale Distributed Systems

Proof of Correctness

Will be given in class

Page 27: Data Mining Algorithms for Large-Scale Distributed Systems

Back from Majority to ARM

To decide whether an itemset is frequent or not

LSDMrun

set

,set

set

ut

u

ut

u

DBc

DBXSupports

MinFreq

Page 28: Data Mining Algorithms for Large-Scale Distributed Systems

Back from Majority to ARM

To decide whether a rule is confident or not

LSDMrun

,set

,set

set

ut

u

ut

u

DBXSupportc

DBYXSupports

MinConf

Page 29: Data Mining Algorithms for Large-Scale Distributed Systems

Additionally

Create candidates based on the ad-hoc solutionCreate rules on-the-fly rather than upon termination

Our algorithm outputs the correct rules without specifying their global frequency and confidence

Page 30: Data Mining Algorithms for Large-Scale Distributed Systems

Eventual Results

By the time the database is scanned once, in parallel, the average node has discovered 95% of the rules, and has less than 10% false rules.