edited dat intro

Upload: vishnuraju

Post on 06-Jul-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/17/2019 Edited Dat Intro

    1/28

    © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

    Data Mining: Introduction

    Lecture Note !or "ha#ter 1

    Introduction to Data Miningb$

    Tan, Steinbach, Kumar 

  • 8/17/2019 Edited Dat Intro

    2/28

    © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2

    Lot o! data i being co%%ected

    and &arehoued

     ' (eb data, e)commerce

     ' #urchae at de#artment/

    grocer$ tore

     ' *an+/"redit "ard

    tranaction

    "om#uter hae become chea#er and more #o&er!u%

    "om#etitie -reure i Strong

     ' -roide better, cutomi.ed erice !or an edge eg in

    "utomer e%ationhi# Management

    Why Mine Data? CommercialViewpoint

  • 8/17/2019 Edited Dat Intro

    3/28

    Why Mine Data? Scientifc Viewpoint

    Data co%%ected and tored at

    enormou #eed 3*/hour

     ' remote enor on a ate%%ite

     ' te%eco#e canning the +ie

     ' microarra$ generating gene

    e#reion data

     ' cienti!ic imu%ation

    generating terab$te o! data Traditiona% techni5ue in!eaib%e !or ra& data

    Data mining ma$ he%# cientit

     ' in c%ai!$ing and egmenting data

     ' in 6$#othei 7ormation

  • 8/17/2019 Edited Dat Intro

    4/28

    © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4

    Mining Large Data Sets - Motiation

    There is often information !hidden" in the data that is

    not readily evident Human analysts may take weeks to discover useful

    information

    Much of the data is never analyzed at all

    0

    00,000

    1,000,000

    1,00,000

    2,000,000

    2,00,000

    9,000,000

    9,00,000

    4,000,000

    1:: 1::; 1::< 1::8 1:::

    The Data Gap

    Total new disk (TB) since 1995

     Number of

    analysts

     

    7rom= 3roman, " Kamath, > Kumar, ?Data Mining !or Scienti!ic and @ngineering A##%icationB

  • 8/17/2019 Edited Dat Intro

    5/28

    © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

    What is Data Mining?

    Many Definitions ' Non-trivial extraction of implicit, previously

    unknown and potentially useful information fromdata

     ' Exploration analysis, !y automatic orsemi-automatic means, oflar"e #uantities of datain order to discovermeanin"ful patterns

  • 8/17/2019 Edited Dat Intro

    6/28

    © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ;

    What is #not$ Data Mining?

     $hat is Data Minin"% 

     ' "ertain name are more#rea%ent in certain CS%ocation E*rien, Eur+e,Eei%%$F in *oton area

     ' 3rou# together imi%ar

    document returned b$earch engine according totheir contet eg Ama.onrain!oret, Ama.oncom,

     $hat is not DataMinin"%

     ' Loo+ u# #hone

    number in #honedirector$

     

     ' Guer$ a (eb

    earch engine !orin!ormation about?Ama.onB

  • 8/17/2019 Edited Dat Intro

    7/28

    © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 <

    Draws ideas from machine learnin"&'(, pattern

    reco"nition, statistics, and data!ase systems

    Traditional Techni#ues

    may !e unsuita!le due to

     ' Enormity of data ' Hi"h dimensionality

    of data

     ' Hetero"eneous,distri!uted nature

    of data

    %rigins o& Data Mining

    Machine Learning/

    -attern

    ecognition

    Statitic/

     AI

    Data Minin"

    Databae

    $tem

  • 8/17/2019 Edited Dat Intro

    8/28

    © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8

    Data Mining 'as(s

    -rediction Method ' Ce ome ariab%e to #redict un+no&n or

    !uture a%ue o! other ariab%e

    Decri#tion Method

     ' 7ind human)inter#retab%e #attern that

    decribe the data

    rom !ayyad" et#al#$ %d&ances in 'nowlede Disco&ery and Data inin" 199*

  • 8/17/2019 Edited Dat Intro

    9/28

    © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 :

    Data Mining 'as(s)))

    "%ai!ication H-redictie

    "%utering HDecri#tie

     Aociation u%e Dicoer$ HDecri#tie

    Se5uentia% -attern Dicoer$ HDecri#tie

    egreion H-redictie

    Deiation Detection H-redictie

  • 8/17/2019 Edited Dat Intro

    10/28

    © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10

    Classifcation: Defnition

    3ien a co%%ection o! record training set  ' @ach record contain a et o! attributes, one o! the

    attribute i the class

    7ind a model   !or c%a attribute a a !unction

    o! the a%ue o! other attribute 3oa%= #reiou%$ uneen record hou%d be

    aigned a c%a a accurate%$ a #oib%e ' A test set  i ued to determine the accurac$ o! the

    mode% Cua%%$, the gien data et i diided intotraining and tet et, &ith training et ued to bui%dthe mode% and tet et ued to a%idate it

  • 8/17/2019 Edited Dat Intro

    11/28

    © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11

    Classifcation *+ample

    Tid    )efund Marital

    *tatus

    Taxa!le

    (ncome   +heat

    1 Je Sing%e 12K   No

    2 No Married 100K   No

    9 No Sing%e

  • 8/17/2019 Edited Dat Intro

    12/28

    © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12

    Classifcation: ,pplication

    Direct Mar+eting ' 3oa%= educe cot o! mai%ing b$ targeting  a et o!

    conumer %i+e%$ to bu$ a ne& ce%%)#hone #roduct

     ' A##roach=

    Ce the data !or a imi%ar #roduct introduced be!ore(e +no& &hich cutomer decided to bu$ and &hich

    decided other&ie Thi {buy, don’t buy} deciion !orm the

    class attribute

    "o%%ect ariou demogra#hic, %i!et$%e, and com#an$)

    interaction re%ated in!ormation about a%% uch cutomer

     +  Type of business" where they stay" how much they earn" etc#

    Ce thi in!ormation a in#ut attribute to %earn a c%ai!ier

    mode%rom !Berry , -inoff$ Data inin Techni.ues" 199/

  • 8/17/2019 Edited Dat Intro

    13/28

    © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19

    Classifcation: ,pplication .

    7raud Detection ' 3oa%= -redict !raudu%ent cae in credit card

    tranaction

     ' A##roach=Ce credit card tranaction and the in!ormation on it account)

    ho%der a attribute +  0hen does a customer buy" what does he buy" how often he pays on time"

    etc Labe% #at tranaction a !raud or !air tranaction Thi !orm

    the c%a attribute

    Learn a mode% !or the c%a o! the tranaction

    Ce thi mode% to detect !raud b$ obering credit cardtranaction on an account

  • 8/17/2019 Edited Dat Intro

    14/28

    © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14

    Classifcation: ,pplication /

    "utomer Attrition/"hurn=

     ' 3oa%= To #redict &hether a cutomer i %i+e%$ to

    be %ot to a com#etitor

     ' A##roach=Ce detai%ed record o! tranaction &ith each o! the

    #at and #reent cutomer, to !ind attribute

     +  ow often the customer calls" where he calls" what time2of2the

    day he calls most" his financial status" marital status" etc#

    Labe% the cutomer a %o$a% or di%o$a%7ind a mode% !or %o$a%t$

    rom !Berry , -inoff$ Data inin Techni.ues" 199/

  • 8/17/2019 Edited Dat Intro

    15/28

    © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

    Classifcation: ,pplication 0

    S+$ Sure$ "ata%oging

     ' 3oa%= To #redict c%a tar or ga%a$ o! +$ obect,

    e#ecia%%$ iua%%$ !aint one, baed on the te%eco#ic

    ure$ image !rom -a%omar berator$

     +  3444 imaes with 3"464 7 3"464 pi7els per imae#

     ' A##roach=Segment the image

    Meaure image attribute !eature ) 40 o! them #er obect

    Mode% the c%a baed on thee !eature

    Succe Stor$= "ou%d !ind 1; ne& high red)hi!t 5uaar,

    ome o! the !arthet obect that are di!!icu%t to !ind

    rom !ayyad" et#al#$ %d&ances in 'nowlede Disco&ery and Data inin" 199*

  • 8/17/2019 Edited Dat Intro

    16/28

    © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1;

    Clustering Defnition

    3ien a et o! data #oint, each haing a et o!attribute, and a imi%arit$ meaure among them,!ind c%uter uch that

     ' Data #oint in one c%uter are more imi%ar to

    one another ' Data #oint in e#arate c%uter are %eimi%ar to one another

    Simi%arit$ Meaure=

     ' @uc%idean Ditance i! attribute arecontinuou

     ' ther -rob%em)#eci!ic Meaure

  • 8/17/2019 Edited Dat Intro

    17/28

    © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1<

    Illustrating Clustering

    Euclidean Distance Based Clustering in 3-D space.

    8ntracluster distances

    are minimied

    8ntracluster distances

    are minimied8ntercluster distances

    are ma7imied

    8ntercluster distances

    are ma7imied

  • 8/17/2019 Edited Dat Intro

    18/28

    © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18

    Clustering: ,pplication

    Mar+et Segmentation=

     ' 3oa%= ubdiide a mar+et into ditinct ubet o!cutomer &here an$ ubet ma$ conceiab%$ bee%ected a a mar+et target to be reached &ith aditinct mar+eting mi

     '  A##roach="o%%ect di!!erent attribute o! cutomer baed on their

    geogra#hica% and %i!et$%e re%ated in!ormation

    7ind c%uter o! imi%ar cutomer

    Meaure the c%utering 5ua%it$ b$ obering bu$ing #attern

    o! cutomer in ame c%uter thoe !rom di!!erent c%uter

  • 8/17/2019 Edited Dat Intro

    19/28

    © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1:

    Clustering: ,pplication .

    Document "%utering=

     ' 3oa%= To !ind grou# o! document that are

    imi%ar to each other baed on the im#ortant

    term a##earing in them

     ' A##roach= To identi!$ !re5uent%$ occurringterm in each document 7orm a imi%arit$

    meaure baed on the !re5uencie o! di!!erent

    term Ce it to c%uter

     ' 3ain= In!ormation etriea% can uti%i.e the

    c%uter to re%ate a ne& document or earch

    term to c%utered document

  • 8/17/2019 Edited Dat Intro

    20/28

    © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20

    Illustrating Document Clustering

    "%utering -oint= 9204 Artic%e o! Lo Ange%e Time

    Simi%arit$ Meaure= 6o& man$ &ord are common in

    thee document a!ter ome &ord !i%tering

    Category Total 

     Articles

    Correctly 

    Placed Financial    555 364

    Foreign   341 260

    National    273 36

    Metro   943 746

    Sports   738 573

    Entertainment    354 278

  • 8/17/2019 Edited Dat Intro

    21/28

    © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21

    ,ssociation 1ule Discoery:Defnition 3ien a et o! record each o! &hich contain ome

    number o! item !rom a gien co%%ection

     ' -roduce de#endenc$ ru%e &hich &i%% #redict

    occurrence o! an item baed on occurrence o! other

    item

    TID Items

    1 Bread, Coke, Milk  

    2 Beer, Bread

    3 Beer, Coke, Diaper, Milk  

    4 Beer, Bread, Diaper, Milk  

    5 Coke, Diaper, Milk  

    :ules Disco&ered;

      2Mil(3 --4 2Co(e3  2Diaper5 Mil(3 --4 26eer3

    :ules Disco&ered;

      2Mil(3 --4 2Co(e3  2Diaper5 Mil(3 --4 26eer3

  • 8/17/2019 Edited Dat Intro

    22/28

    © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22

    ,ssociation 1ule Discoery: ,pplication

    Mar+eting and Sa%e -romotion= ' Let the ru%e dicoered be 

    {Bagels, … } --> {Potato Chips}

     ' -otato "hi# a cone5uent O "an be ued to

    determine &hat hou%d be done to boot it a%e ' *age% in the antecedent O "an be ued to ee &hich

    #roduct &ou%d be a!!ected i! the tore dicontinuee%%ing bage%

     ' *age% in antecedent and  -otato chi# in cone5uent O "an be ued to ee &hat #roduct hou%d be o%d&ith *age% to #romote a%e o! -otato chi#

  • 8/17/2019 Edited Dat Intro

    23/28

    © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29

    ,ssociation 1ule Discoery: ,pplication

    .

    Su#ermar+et he%! management

     ' 3oa%= To identi!$ item that are bought

    together b$ u!!icient%$ man$ cutomer

     ' A##roach= -roce the #oint)o!)a%e data

    co%%ected &ith barcode canner to !indde#endencie among item

     ' A c%aic ru%e ))

    I! a cutomer bu$ dia#er and mi%+, then he i er$%i+e%$ to bu$ beer

    So, donEt be ur#ried i! $ou !ind i)#ac+ tac+ed

    net to dia#er

  • 8/17/2019 Edited Dat Intro

    24/28

    © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24

    Se7uential 8attern Discoery:Defnition

    3ien i a et o! objects, &ith each obect aociated &ith it o&n timeline o e!ents, !ind ru%e that #redict trong e5uentia% de#endencie among di!!erent eent

    u%e are !ormed b$ !irt dioering #attern @ent occurrence in the #attern are goerned b$ timing contraint

    (A B) (C) (D E)

    ./ ms

    ./ x"  0n" ./ ws

    (A B) (C) (D E)

  • 8/17/2019 Edited Dat Intro

    25/28

    © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2

    Se7uential 8attern Discoery:*+amples In te%ecommunication a%arm %og, 

     ' InerterP-rob%em @ceiePLineP"urrent

    ecti!ierPA%arm ))O 7irePA%arm

    In #oint)o!)a%e tranaction e5uence,

     ' "om#uter *oo+tore=

    IntroPToP>iua%P" "QQP-rimer ))O-er%P!orPdummie,Tc%PT+

     '  Ath%etic A##are% Store=

    Shoe ac+et, ac+etba%% ))O S#ortPRac+et

  • 8/17/2019 Edited Dat Intro

    26/28

    © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2;

    1egression

    -redict a a%ue o! a gien continuou a%ued ariab%e

    baed on the a%ue o! other ariab%e, auming a

    %inear or non%inear mode% o! de#endenc$

    3reat%$ tudied in tatitic, neura% net&or+ !ie%d

    @am#%e= ' -redicting a%e amount o! ne& #roduct baed on

    adetiing e#enditure

     ' -redicting &ind e%ocitie a a !unction o!

    tem#erature, humidit$, air #reure, etc ' Time erie #rediction o! toc+ mar+et indice

  • 8/17/2019 Edited Dat Intro

    27/28

    © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2<

    Deiation9,nomaly Detection

    Detect igni!icant deiation !rom norma% behaior 

     A##%ication=

     ' "redit "ard 7raud Detection

     ' Net&or+ Intruion

    Detection

    Typical network traffic at University level may reach over 100 million connections per day

  • 8/17/2019 Edited Dat Intro

    28/28

    © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28

    Challenges o& Data Mining

    Sca%abi%it$ Dimeniona%it$

    "om#%e and 6eterogeneou Data

    Data Gua%it$ Data &nerhi# and Ditribution

    -riac$ -reeration

    Streaming Data