data mining algorithms for large-scale distributed systems

Data Mining Algorithms for Large-Scale Distributed Systems

Presenter: Ran WolffJoint work with Assaf Schuster2003

What is Data Mining?

The automatic analysis of large databaseThe discovery of previously unknown patternsThe generation of a model of the data

Main Data Mining Problems

Association rules Description

Classification Fraud, Churn

Clustering Analysis

He who does this and that will usually do some other thing too

These attributes indicate a good behavior - those indicate bad behavior.

There are three types of entities

Examples – Classification

Customers purchase artifacts in a storeEach transaction is described in terms of a vector of featuresThe owner of the store tries to predict which transactions are fraudulent Example: young men who buy small

electronics during rash-hours Solution: do not respect checks

Examples – Associations

Amazon tracks user queries Suggests to each user additional

books he would usually be interested in

Supermarket finds out “people who buy diapers also buy beer” Place diapers and beer at opposite

sides of the supermarket

Examples – Clustering

Resource location Find the best location for k

distribution centers

Feature selection Find 1000 concepts which summarize

a whole dictionary Extract the meaning out of a

document by replacing each work with the appropriate conceptCar for auto, etc.

Why Mine Data of LSD Systems?

Data mining is goodIt is otherwise difficult to monitor an LSD system: lots of data, spread across the system, impossible to collectMany interesting phenomena are inherently distributed (e.g., DDoS), it is not enough to just monitor a few nodes

An Example

Peers in the Kazza network reveal to the system which files they have on their disks in exchange to access to the files of their peersThe result is a 2M peers database of people recreational preferencesMining it, you could discover that Matrix fans are also keen of Radio-Head songs Promote RH performances in Matrix-

Reloaded Ask RH to write the music for Matrix-IV

What is so special about this problem?

Huge systems – Huge amounts of dataDynamic setting System – join / depart Data – constant update

Ad-hoc solutionFast convergence

Our Work

We developed an association rule mining algorithm that works well in LSD Systems Local and therefore scalable Asynchronous and therefore fast Dynamic and therefore robust Accurate – not approximated Anytime – you get early results fast

In a Teaspoon

A distributed data mining algorithm can be described as a series of distributed decisionsThose decisions are reduced to a majority voteWe developed a majority voting protocol which has all those good qualitiesThe outcome is an LSD association rule mining (still to come: classification)

Problem Definition – Association Rule Mining (ARM)

DBXFreqDBYXFreqDBYXConf

DBDBXSupportDBXFreq

TXDBTDBXSupport

TTTDB

IT

IX

iiiI

k

m

,,,

,,

:,

,...,,

,...,,

21

21

Solution to Traditional ARM

MinConfDBYXConf

MinFreqDBYXFreq

YX

YXDBR

MinConfMinFreqLet

,

,:

10,10

Large-Scale Distributed ARM

tuv

vtt

t

ut

DBRuR

tuvVvu

tuDBDB

at time from reachable is :

,:

Solution of LSD-ARM

No terminationAnytime solution

Recall

Precision

YXYXuR t :~

ttt uRuRuR ~

ttt uRuRuR~~

Majority Vote in LSD Systems

Unknown number of nodes vote 0 or 1 Nodes may dynamically change their vote Edges are dynamically added / removed An infra-structure

detects failureensures message integritymaintains a communication forest

Each node should decide if the global majority is of 0 or 1

Majority Vote in LSD Systems – cont.

Because of the dynamic settings, the algorithm never terminatesInstead we measure the percent of correct outputsIn static periods that percent ought to converge to 100%In stationary periods we will show it converges to a different percentage Assume the overall percentage of ones remains

the same, but they are constantly switched

LSD-Majority Algorithm

Nodes communicates by exchanging messages <s, c>Node u maintains: su – its vote, cu – one (for now) <suv, cuv>– the last <s,c> it had sent

to v <svu, cvu>– the last <s,c> it had

received from v

LSD-Majority – cont.

Node u calculates:

Captures the current knowledge of u

Captures the current agreement between u and v

uu Ev

vuu

Ev

vuuu ccss

uvvuuvvuuv ccss

LSD-Majority – Rational

It is OK if the current knowledge of u is more extreme than what it had agreed with vThe opposite is not OK v might assume u supports its decision

more strongly than u actually does

Tie breaking prefers a negative decision

LSD-Majority – The Protocol

v to, sendthen

and 0

or

and 0

either and 0

or 0 and 0 If

uu Evuwu

wuu

Evuwu

wuu

uvuuv

uvuuv

vuvu

uvuvu

ccss

cc

cc

LSD-Majority – The Protocol

The same decision is applied whenever a message is received su changes an edge fails or recovers

LSD-Majority – Example

LSD-Majority Results

Proof of Correctness

Will be given in class

Back from Majority to ARM

To decide whether an itemset is frequent or not

LSDMrun

set

,set

set

ut

u

ut

u

DBc

DBXSupports

MinFreq

Back from Majority to ARM

To decide whether a rule is confident or not

LSDMrun

,set

,set

set

ut

u

ut

u

DBXSupportc

DBYXSupports

MinConf

Additionally

Create candidates based on the ad-hoc solutionCreate rules on-the-fly rather than upon termination

Our algorithm outputs the correct rules without specifying their global frequency and confidence

Eventual Results

By the time the database is scanned once, in parallel, the average node has discovered 95% of the rules, and has less than 10% false rules.

data mining algorithms for large-scale distributed systems

Documents