edited dat intro
TRANSCRIPT
-
8/17/2019 Edited Dat Intro
1/28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1
Data Mining: Introduction
Lecture Note !or "ha#ter 1
Introduction to Data Miningb$
Tan, Steinbach, Kumar
-
8/17/2019 Edited Dat Intro
2/28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2
Lot o! data i being co%%ected
and &arehoued
' (eb data, e)commerce
' #urchae at de#artment/
grocer$ tore
' *an+/"redit "ard
tranaction
"om#uter hae become chea#er and more #o&er!u%
"om#etitie -reure i Strong
' -roide better, cutomi.ed erice !or an edge eg in
"utomer e%ationhi# Management
Why Mine Data? CommercialViewpoint
-
8/17/2019 Edited Dat Intro
3/28
Why Mine Data? Scientifc Viewpoint
Data co%%ected and tored at
enormou #eed 3*/hour
' remote enor on a ate%%ite
' te%eco#e canning the +ie
' microarra$ generating gene
e#reion data
' cienti!ic imu%ation
generating terab$te o! data Traditiona% techni5ue in!eaib%e !or ra& data
Data mining ma$ he%# cientit
' in c%ai!$ing and egmenting data
' in 6$#othei 7ormation
-
8/17/2019 Edited Dat Intro
4/28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4
Mining Large Data Sets - Motiation
There is often information !hidden" in the data that is
not readily evident Human analysts may take weeks to discover useful
information
Much of the data is never analyzed at all
0
00,000
1,000,000
1,00,000
2,000,000
2,00,000
9,000,000
9,00,000
4,000,000
1:: 1::; 1::< 1::8 1:::
The Data Gap
Total new disk (TB) since 1995
Number of
analysts
7rom= 3roman, " Kamath, > Kumar, ?Data Mining !or Scienti!ic and @ngineering A##%icationB
-
8/17/2019 Edited Dat Intro
5/28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004
What is Data Mining?
Many Definitions ' Non-trivial extraction of implicit, previously
unknown and potentially useful information fromdata
' Exploration analysis, !y automatic orsemi-automatic means, oflar"e #uantities of datain order to discovermeanin"ful patterns
-
8/17/2019 Edited Dat Intro
6/28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ;
What is #not$ Data Mining?
$hat is Data Minin"%
' "ertain name are more#rea%ent in certain CS%ocation E*rien, Eur+e,Eei%%$F in *oton area
' 3rou# together imi%ar
document returned b$earch engine according totheir contet eg Ama.onrain!oret, Ama.oncom,
$hat is not DataMinin"%
' Loo+ u# #hone
number in #honedirector$
' Guer$ a (eb
earch engine !orin!ormation about?Ama.onB
-
8/17/2019 Edited Dat Intro
7/28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 <
Draws ideas from machine learnin"&'(, pattern
reco"nition, statistics, and data!ase systems
Traditional Techni#ues
may !e unsuita!le due to
' Enormity of data ' Hi"h dimensionality
of data
' Hetero"eneous,distri!uted nature
of data
%rigins o& Data Mining
Machine Learning/
-attern
ecognition
Statitic/
AI
Data Minin"
Databae
$tem
-
8/17/2019 Edited Dat Intro
8/28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8
Data Mining 'as(s
-rediction Method ' Ce ome ariab%e to #redict un+no&n or
!uture a%ue o! other ariab%e
Decri#tion Method
' 7ind human)inter#retab%e #attern that
decribe the data
rom !ayyad" et#al#$ %d&ances in 'nowlede Disco&ery and Data inin" 199*
-
8/17/2019 Edited Dat Intro
9/28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 :
Data Mining 'as(s)))
"%ai!ication H-redictie
"%utering HDecri#tie
Aociation u%e Dicoer$ HDecri#tie
Se5uentia% -attern Dicoer$ HDecri#tie
egreion H-redictie
Deiation Detection H-redictie
-
8/17/2019 Edited Dat Intro
10/28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10
Classifcation: Defnition
3ien a co%%ection o! record training set ' @ach record contain a et o! attributes, one o! the
attribute i the class
7ind a model !or c%a attribute a a !unction
o! the a%ue o! other attribute 3oa%= #reiou%$ uneen record hou%d be
aigned a c%a a accurate%$ a #oib%e ' A test set i ued to determine the accurac$ o! the
mode% Cua%%$, the gien data et i diided intotraining and tet et, &ith training et ued to bui%dthe mode% and tet et ued to a%idate it
-
8/17/2019 Edited Dat Intro
11/28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11
Classifcation *+ample
Tid )efund Marital
*tatus
Taxa!le
(ncome +heat
1 Je Sing%e 12K No
2 No Married 100K No
9 No Sing%e
-
8/17/2019 Edited Dat Intro
12/28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12
Classifcation: ,pplication
Direct Mar+eting ' 3oa%= educe cot o! mai%ing b$ targeting a et o!
conumer %i+e%$ to bu$ a ne& ce%%)#hone #roduct
' A##roach=
Ce the data !or a imi%ar #roduct introduced be!ore(e +no& &hich cutomer decided to bu$ and &hich
decided other&ie Thi {buy, don’t buy} deciion !orm the
class attribute
"o%%ect ariou demogra#hic, %i!et$%e, and com#an$)
interaction re%ated in!ormation about a%% uch cutomer
+ Type of business" where they stay" how much they earn" etc#
Ce thi in!ormation a in#ut attribute to %earn a c%ai!ier
mode%rom !Berry , -inoff$ Data inin Techni.ues" 199/
-
8/17/2019 Edited Dat Intro
13/28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19
Classifcation: ,pplication .
7raud Detection ' 3oa%= -redict !raudu%ent cae in credit card
tranaction
' A##roach=Ce credit card tranaction and the in!ormation on it account)
ho%der a attribute + 0hen does a customer buy" what does he buy" how often he pays on time"
etc Labe% #at tranaction a !raud or !air tranaction Thi !orm
the c%a attribute
Learn a mode% !or the c%a o! the tranaction
Ce thi mode% to detect !raud b$ obering credit cardtranaction on an account
-
8/17/2019 Edited Dat Intro
14/28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14
Classifcation: ,pplication /
"utomer Attrition/"hurn=
' 3oa%= To #redict &hether a cutomer i %i+e%$ to
be %ot to a com#etitor
' A##roach=Ce detai%ed record o! tranaction &ith each o! the
#at and #reent cutomer, to !ind attribute
+ ow often the customer calls" where he calls" what time2of2the
day he calls most" his financial status" marital status" etc#
Labe% the cutomer a %o$a% or di%o$a%7ind a mode% !or %o$a%t$
rom !Berry , -inoff$ Data inin Techni.ues" 199/
-
8/17/2019 Edited Dat Intro
15/28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1
Classifcation: ,pplication 0
S+$ Sure$ "ata%oging
' 3oa%= To #redict c%a tar or ga%a$ o! +$ obect,
e#ecia%%$ iua%%$ !aint one, baed on the te%eco#ic
ure$ image !rom -a%omar berator$
+ 3444 imaes with 3"464 7 3"464 pi7els per imae#
' A##roach=Segment the image
Meaure image attribute !eature ) 40 o! them #er obect
Mode% the c%a baed on thee !eature
Succe Stor$= "ou%d !ind 1; ne& high red)hi!t 5uaar,
ome o! the !arthet obect that are di!!icu%t to !ind
rom !ayyad" et#al#$ %d&ances in 'nowlede Disco&ery and Data inin" 199*
-
8/17/2019 Edited Dat Intro
16/28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1;
Clustering Defnition
3ien a et o! data #oint, each haing a et o!attribute, and a imi%arit$ meaure among them,!ind c%uter uch that
' Data #oint in one c%uter are more imi%ar to
one another ' Data #oint in e#arate c%uter are %eimi%ar to one another
Simi%arit$ Meaure=
' @uc%idean Ditance i! attribute arecontinuou
' ther -rob%em)#eci!ic Meaure
-
8/17/2019 Edited Dat Intro
17/28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1<
Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
8ntracluster distances
are minimied
8ntracluster distances
are minimied8ntercluster distances
are ma7imied
8ntercluster distances
are ma7imied
-
8/17/2019 Edited Dat Intro
18/28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18
Clustering: ,pplication
Mar+et Segmentation=
' 3oa%= ubdiide a mar+et into ditinct ubet o!cutomer &here an$ ubet ma$ conceiab%$ bee%ected a a mar+et target to be reached &ith aditinct mar+eting mi
' A##roach="o%%ect di!!erent attribute o! cutomer baed on their
geogra#hica% and %i!et$%e re%ated in!ormation
7ind c%uter o! imi%ar cutomer
Meaure the c%utering 5ua%it$ b$ obering bu$ing #attern
o! cutomer in ame c%uter thoe !rom di!!erent c%uter
-
8/17/2019 Edited Dat Intro
19/28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1:
Clustering: ,pplication .
Document "%utering=
' 3oa%= To !ind grou# o! document that are
imi%ar to each other baed on the im#ortant
term a##earing in them
' A##roach= To identi!$ !re5uent%$ occurringterm in each document 7orm a imi%arit$
meaure baed on the !re5uencie o! di!!erent
term Ce it to c%uter
' 3ain= In!ormation etriea% can uti%i.e the
c%uter to re%ate a ne& document or earch
term to c%utered document
-
8/17/2019 Edited Dat Intro
20/28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20
Illustrating Document Clustering
"%utering -oint= 9204 Artic%e o! Lo Ange%e Time
Simi%arit$ Meaure= 6o& man$ &ord are common in
thee document a!ter ome &ord !i%tering
Category Total
Articles
Correctly
Placed Financial 555 364
Foreign 341 260
National 273 36
Metro 943 746
Sports 738 573
Entertainment 354 278
-
8/17/2019 Edited Dat Intro
21/28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21
,ssociation 1ule Discoery:Defnition 3ien a et o! record each o! &hich contain ome
number o! item !rom a gien co%%ection
' -roduce de#endenc$ ru%e &hich &i%% #redict
occurrence o! an item baed on occurrence o! other
item
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
:ules Disco&ered;
2Mil(3 --4 2Co(e3 2Diaper5 Mil(3 --4 26eer3
:ules Disco&ered;
2Mil(3 --4 2Co(e3 2Diaper5 Mil(3 --4 26eer3
-
8/17/2019 Edited Dat Intro
22/28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22
,ssociation 1ule Discoery: ,pplication
Mar+eting and Sa%e -romotion= ' Let the ru%e dicoered be
{Bagels, … } --> {Potato Chips}
' -otato "hi# a cone5uent O "an be ued to
determine &hat hou%d be done to boot it a%e ' *age% in the antecedent O "an be ued to ee &hich
#roduct &ou%d be a!!ected i! the tore dicontinuee%%ing bage%
' *age% in antecedent and -otato chi# in cone5uent O "an be ued to ee &hat #roduct hou%d be o%d&ith *age% to #romote a%e o! -otato chi#
-
8/17/2019 Edited Dat Intro
23/28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29
,ssociation 1ule Discoery: ,pplication
.
Su#ermar+et he%! management
' 3oa%= To identi!$ item that are bought
together b$ u!!icient%$ man$ cutomer
' A##roach= -roce the #oint)o!)a%e data
co%%ected &ith barcode canner to !indde#endencie among item
' A c%aic ru%e ))
I! a cutomer bu$ dia#er and mi%+, then he i er$%i+e%$ to bu$ beer
So, donEt be ur#ried i! $ou !ind i)#ac+ tac+ed
net to dia#er
-
8/17/2019 Edited Dat Intro
24/28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24
Se7uential 8attern Discoery:Defnition
3ien i a et o! objects, &ith each obect aociated &ith it o&n timeline o e!ents, !ind ru%e that #redict trong e5uentia% de#endencie among di!!erent eent
u%e are !ormed b$ !irt dioering #attern @ent occurrence in the #attern are goerned b$ timing contraint
(A B) (C) (D E)
./ ms
./ x" 0n" ./ ws
(A B) (C) (D E)
-
8/17/2019 Edited Dat Intro
25/28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2
Se7uential 8attern Discoery:*+amples In te%ecommunication a%arm %og,
' InerterP-rob%em @ceiePLineP"urrent
ecti!ierPA%arm ))O 7irePA%arm
In #oint)o!)a%e tranaction e5uence,
' "om#uter *oo+tore=
IntroPToP>iua%P" "QQP-rimer ))O-er%P!orPdummie,Tc%PT+
' Ath%etic A##are% Store=
Shoe ac+et, ac+etba%% ))O S#ortPRac+et
-
8/17/2019 Edited Dat Intro
26/28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2;
1egression
-redict a a%ue o! a gien continuou a%ued ariab%e
baed on the a%ue o! other ariab%e, auming a
%inear or non%inear mode% o! de#endenc$
3reat%$ tudied in tatitic, neura% net&or+ !ie%d
@am#%e= ' -redicting a%e amount o! ne& #roduct baed on
adetiing e#enditure
' -redicting &ind e%ocitie a a !unction o!
tem#erature, humidit$, air #reure, etc ' Time erie #rediction o! toc+ mar+et indice
-
8/17/2019 Edited Dat Intro
27/28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2<
Deiation9,nomaly Detection
Detect igni!icant deiation !rom norma% behaior
A##%ication=
' "redit "ard 7raud Detection
' Net&or+ Intruion
Detection
Typical network traffic at University level may reach over 100 million connections per day
-
8/17/2019 Edited Dat Intro
28/28
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28
Challenges o& Data Mining
Sca%abi%it$ Dimeniona%it$
"om#%e and 6eterogeneou Data
Data Gua%it$ Data &nerhi# and Ditribution
-riac$ -reeration
Streaming Data