dunham - data mining

Upload: pop-ion

Post on 04-Jun-2018

250 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 Dunham - Data Mining

    1/156

  • 8/13/2019 Dunham - Data Mining

    2/156

    Contents

    Preface

    art ne Introduction

    Intoduction11 Basic Data Mining Tasks

    1. lassication 112 Regression .1.3 ie Series Analysis14 Prediction .1.15 lustering . .16 Suarization 117 Association Rules118 Sequence Discovery

    12 Data Mining Vers us Knowle dge Discovery in Databases .121 Te Developent of Data Mining

    3 Data Mining Issues . .... 14 Data Mining Metrics . . 15 Social Impic ations of Data Miing . . 16 Data Minng fro a Database Perspective17 Te Futre . 18 Exercises ... 19 Bibiograpic Notes.

    2 Relted Concepts2 Database/OL Systems22 Fuzzy Sets and Fuzzy Logc23 noration Retrieval 24 Decision Support Systes25 Dimensinal Modeling .

    251 Multidiensional Scemas252 Indexing

    26 Data Wareousing2 OLAP ... 28 Web Searc Engines29 Statistcs . 21 Macine Leng2 1 Patte Matcing12 Suary 213 Exercises 214 Bibliograpic Notes

    x

    5566778899

    12

    14 15161617

    19

    19

    21

    223262829

    3343539

    41

    442

    44444545

  • 8/13/2019 Dunham - Data Mining

    3/156

    vi otets

    3 D Miig Tehiqes

    31 Itrodtio . .. .32 A Sttistil Perspetve o Dt Mig

    321 Poit Estimtio . .. .. . ..32.2 Models Bsed o Smmiztio3.23 Byes Theorem . . . 32.4 Hypothesis Testig ....325 Regresso d Corretio

    3.3 Similrity Mesres .

    34 Deiso rees . .. .. .35 Ner Networks . . .351 Ativtio Ftios

    36 Geeti Agorithms .37 Exerises .....38 Bibiogrphi Notes .

    art o or opics

    Clssiio4.1 Itrodtio . . . . .

    41.1 Isses i Clssto

    4.2 Sttsti-Bsed Agoithms .4.. Rrsso . . . . . .4.2.2 Byes Clssito

    43 DisteBsed Algorthms..43.1 Smpe Approh ..432 K Nerest Neghbors

    44 Deisio TreeBsed Algorithms4.41 3 .... .4.4.2 C45 d C0 ... 4.4.3 C AR T . . . . . . . ..444 Slble D Tehqes

    4.5 Nerl NetworkBsed Algorithms451 Popgtio . . . . . .

    452 NN Spervised eig . .4.5.3 Rd Bsis Ftio Networks .4.5.4 Peeptros ........

    4.6 Rle-Bsed Algoithms ..........46.1 Geertg Res from DT . .462 Geertig Res from Ner Net463 Geertig Res Withot DT or N

    47 Combig Tehqes48 Smm . . .4.9 Exerises ......410 Bibiogrphi Notes .

    64 64 74 75 15 25 45 55 7

    5 86 16 467 07 1

    73

    75

    7 577

    808 0

    86

    89

    8 9

    9 0

    92

    9 7100102103103105

    10611211211411411511611912 112 1122

    5 Clseig

    51 Itoto52 Simiity d Diste Mesres53 O t i e r s . . . 54 Hierrhil Algoriths . .. .

    541 Agglomertve Agothms .54.2 Divisive Clsterg . ..

    55 Prttio Agorithms .. . ..5.51 imm Spig ree . .

    552 Sqed ro Csteig Algorthm5.53 KMes Clsterig .. 5.54 Nerest Neighbor Algothm 5.55 PAM Agorithm .. .. ..556 Bod ergy Algothm 5.5.7 Csterig wth Geeti Algorithms .5.5.8 Clsterig wth Ner Networks

    5.6 Csterig Lrge Dtbses56.1 BC . . ..5.6.2 DBSCAN . . 5.63 CURE Algoithm ..

    57 Csterig with Ctegoril Attrbtes .58 Compso . . .

    59 Exerises .. 510 Bibliogrhi Notes

    6 Assoiio Rles6 Irouco 6.2 Lrge Itemsets63 Bsi Algorithms

    6.31 Apriori Algoithm632 Smpig Algorthm633 Prtitog ..

    64 Prlle d Distribted Agorthms64.1 Dt Prllesm642 Tsk Prlelsm

    6.5 Comprg Approhes 66 Ireme Rles 67 Adved Assoitio Rle ehiqes

    671 Geerlized ssoitio Rles67.2 MltipleLevel Assoitio Res673 Qtittive Assoitio Rles67.4 Usig Mltipe Mimm Spports675 Corretio Rles .

    68 Mesrig the Qit o Rles6.9 Exerises . 6.10 Bibliogrphi Notes

    otet vii

    12 512 913 013 13213 813 813 8

    13 914 014 214 214 514 614 714 915 015 215 415 75 9

    16 116 1

    6 4 6 716 916 917 317 717 817 918 0

    18 118 418 4 8 418 5 8 518 618 718 89019

  • 8/13/2019 Dunham - Data Mining

    4/156

    ii Contens

    art hree dvaced opics

    7

    Web Minng

    7 Itrdti72 We Ctet Miig

    72 Crawlers ..722 Harvest System 723 Vital We View72 ersalia

    7.3 We Strtre Miig73 ageRak 732 Cleer

    7.4 We Usage Miig 7. Preressig 7.2 Data Strtres 743 atte Disvery74.4 atte Aalysis

    75 Exerses 0 7.6 iligrah Ntes

    Spil Mining8.82

    8.38.

    85

    8.6

    8.7

    8889

    Itrdti

    Spatial Da Overew

    82. Spatial Qees822 Spatia Data Strutres 8.23 hemati Mas . .82. Image Dataases . . .Spatia Data Mig Primitiveseeralizat ad Speiaizati88.2838

    Prgressive ReemetGeralizati Nearest Neghr SING

    Spatial Res . . 85. Spatial Assiati RlesSpatial Cassiati Agrithm

    86. 3 Exesi .. 8.6 2 Spatial Deisi reeSpatial Clsterig Algrthms 8.7 CLARANS Extesis 8.72 SDCLARANS8.73 DCLASD 87 ANG .8.75 WaveClster876 AprximatiExerises iigrahi Ntes

    193

    195

    5197

    8222 2

    252

    26282 228282

    22222

    2222232262262272282282292 3 2 3 2 3 32 3 2 3 6

    2 3 62 3 62 3 72 3 82 3 92 2 22 2 32 3

    9 empo Minng 9 Itrdti . 92 Mdig empral Evets 9.3 ime Seres

    9.3 ime Series Aalysis32 red Aalysis33 rasrmati93 Smiarity 3 redti

    9 Patte Deteti ..

    . Strig Mathg5 Seees

    95 AprirAll 9.5.2 SPADE o.5.3 eeraliati5 Featre Extrati

    9.6 Temporal Association Rues9.6.1 Intertrnsaction Rules9.62 Episode Rules 96 3 red Depedeies.64 Sequence Associaton Rules9.65 Caledri Assiati Res

    7 Exerises . 98 iligrahi Ntes

    NIS

    Conens x

    245

    245

    2 48

    252

    25 2

    253

    2 5 52 55

    2 56

    25 7257

    26 0

    26 2

    262

    264

    26 6

    2 66

    26 7

    26 7

    268

    270

    2 7 1

    2722 7 2

    A t Mining PoucA iligrahi Ntes

    274

    2 8

    B Bibiogpy 290

    ndex 305

    Abou te Auo 315

  • 8/13/2019 Dunham - Data Mining

    5/156

    Preface

    Data doubles about every year, but useful information seems to be decreasing. The area

    of data mining has arisen over the last decade to address this problem It has become

    not only an important research area but also one with large potential in the real world

    Curret business users of data minng products achieve llions of dollars a year n

    savings by using data mini\g techniques to reduce the ost of day to day businessoperations. Data mining technques are proving to be extremely usel in detecting and

    predicting terrorism

    The purpose of this book is to introduce the reader to various data mnng con

    cepts and algorithms The book is concise yet thorough n its coverage of the many

    data mnng topcs Clearly written algorithms with accompanying pseudocode are used

    to describe approaches A database perspective is used throughout This means that Iexane algorits data structures data types and complexity of algorithms and space

    The emphasis is on the use of data mning concepts in real-orld applications with large

    database components.

    Data mning research and practice is in a state simlar to that of databases in the

    1960s At that time applications prograers had to creat an entire database environ

    ment each time they wrote a program With the developmen of the relational data model

    query processing and optimization techniques ransaction anagement strategies and adhoc query languages (SQL) and interfaces the current environment is drastically dier

    ent The evolution of data mining techniques may take a simlar path over the next fewdecades making data mning techniques easier to use and develop. he objective of thisbook is to help in this process

    The intended audience of ths book is either the expeenced database professionalwho wishes to le more about data mning or graduate level compter science studentswho have completed at least an introductory database course The book is meant tobe used as the basis of a one-semester graduate level corse covering the basic dataming concepts It may also be used as reference book for computer professionals and

    researchers.

    Introduction

    I Chl Inoducion 1 Ch2 Rad Concps Coe Topcs Ch4 Cassification Ch3 Daa Minig Tchniqus r Ch5 CusrngAdvancd Topics H Ch6 Assocaion Rus Ch7 Wb Mining Ch8 Spaia Mining

    Apndx Ch9 Tmpora Mining y Daa Mng Poducs

    xi

  • 8/13/2019 Dunham - Data Mining

    6/156

    xi Preface

    The book is divided into four major parts: Introduction, Core Topics, AdvancedTopics, and Appendix The introduction covers background information needed to understand the later material In addition, it exaines topics related to data ning such asOLAP, data warehousing, information retreval, and machine le ng In the rst chapterof the introduction I provide a very cursory overview of data nng and how it relatesto the complete KDD process The second chapter surveys topics related to data mning. While ths is not crcia to the coverage of data miing and need not be read tounderstand later chapters, it provides the interested reader with an understanding andappreciation of how data ning concepts reate to other areas To thoroughly under

    stand and appreciate the data mining algoritms presented in subsequent chapters, it isimporant that the reader realize that data ning is not an isolated subect It has its basisin many related disciplines that are equally important on their own The thrd chapterin this prt surveys some techniques used to implement data ng algorithms Theseinclude statistica techniques, neural networks, and decision trees. Ts par of the bookprovides te reader with a understanding of the basic data mining concepts It asoserves as J standalon e survey of the ente data nng area

    The Core Topics covered are cassication, clustering, and association rles. I viewthese as the major data ning fnctions Other data mining c oncepts such as prediction,regression, and patte matching) may be viewed as special cases of these three In eachof these chapters I concentrate on coverage of the most coony used agorithms ofeach type. Our coverage includes pseudocode for these algorithms, an explanation ofthem and examples ilustrating their use.

    The advanced topics part ooks at vrious concepts that complicate data mingapplicatons I concentrate on tempora data, spatial data, and Web ming. Again, algorithms and pseudocode are provided

    In the appendix, production data ng systems are sureyed I will keep a moreup to data list on the Web page for the book I thak all the representatives of the variouscompanies who helped me correct and update my descriptions of their products

    All chapters include exercises covering the material in that chapter addition toconventional types of exercises that either test the stdent's understanding of the materiaor require him to apply what he has leed I also include some exercises that requreimplementation coding) and research A one-semester course woud cover the core topicsand one or more of the advanced ones

    ACKNOWLDGMNTS

    Many peopl e have helped with the completion of this bo ok. Tamer zsu provided initialadvice and inspration My dear fend Bob Korhage introduced me to much of computerscience, including patte matcing and inforation retrieval Bob, I thnk of you often.

    I particulary thak my graduate students for contributing a great deal to some ofthe original wording and editing Ther assistance in reading and coenting on earlerdrafts ha s been invaluable Matt McBride helped me prepare most of the original slides,many of which are still avaiable as a companion to the book. Yongqiao Xiao helpedwrite much of the material in the Web miing chapter He also meticulously reviewedan earlier draft of the book and corrected any stakes L e Gruenwad, Zahid Hossain,Yasemin Seydim, and Al Xiao performed much of the research that provided inforationfound conceng association rles. Maro Nascimento introduced me to the world of

    Peface xi

    temporal databases, and I have used some of the iforation from his dissertation inhe tempora mining chapter Nat Ayewah has been very atient with his explanationsof dden Markov models and heped improve the wordin of that section Zhigang Lihas introduced me to the complex world of tme series ad helped write the solutionsmanual Ive le aed a lot, but stil feel a novice in many f these reas

    The students in my CSE8 3 3 cass Spring 9 9 9 , Fall 000, and Spring 00) atSM have had to endure a great dea l I never realized how dicut it is to clearly wordalgorithm descrptions and exercises unti I wr ote this book I hope they eed sometingeven though at times the continual revisions ne cessary were, I'm sure, frustrating Torsten

    Staab wins the prize for ndng and correcting the most errors Students in my CS E8 3 3 cass during Spring 00 helped me prepare class notes and solutions to the exercises. Ithank them for their input.

    My faly has been extremely supporive in this endavor My husband, im, hasbeen as aways) understanding and patient with my odd work hours and lack of sleepA more patient and supporive husba nd coud not be found My daughter Stephanie hasput up with my moodiness caused by ack of sleep Sweetie, I hope I haven't been tooshort-tempered with you YMMTY LM) At times I have been impatient with Kristinabut you know how much I love you My Mom, sister Mrtha, and brother Dave a s alwaysare there to provide s upport and love.

    Some of the research required for thi s book was supprted by the National ScienceFoundation under Grant No IIS-9 8 08 4 . I would nally like to thank the reviewersMichael Huhns, ulia odger, Bob Cikowski, Greg Speegle, Zoran Obradovic,

    TY Lin, and ames Bucky ) for ther many constrctive coments. I tred to implementas many of these I could

  • 8/13/2019 Dunham - Data Mining

    7/156

    R O

    INTRODUCTON

  • 8/13/2019 Dunham - Data Mining

    8/156

    CHAPTER 1

    Introduction

    BASIC DATA MINING TASKS

    2 DATA INING VERSUS KOWLEDGE ISCOVERY IN ATAASES

    3 DATA MNING ISSUES

    4 DATA MININ METRICS5 SOCAL IMPLICATIONS OF DATA MINING

    6 DAT MINING ROM A DATAASE PERSPEIVE

    7 THE FUTURE

    8 EXERCISES

    1.9 ILIOGRAPHC NOTES

    The amount of data kept in computer es and databases is growing at a phenomenal rate.

    At he same time, the users of these data are expecting mo!e sophsticated informationfrom them. A markeing manager is no longer satised with simple isting of maketingcontacts but wants detailed information about customers' pst purchases as well as predictions of future prchases Simple strctured/query anguage queries ae not adequateo sppo ese ncease emans or oaon Daa mnng seps n o solve thesneeds Data mining s often dened as nding idden information in a database Ateatively, it has been called exploratory data anaysis, data drivn discovry, an deductiveleg

    Traditiona database queries (Figure 11) access a database using a well-denedquery stated in a language such as SQL. The output of th: query consists of the data

    from the databse that satises the quey The output is usuly a subset of the databasebut it may als be n extracted view or ay contain aggeations Data minng access

    of a database diers om tis traditiona access in several ays

    Query: The query might not be we formed or precsely stated. The data iner

    ight not even be exactly sure of what he wants to see

    Data: The daa accessed is usually a dierent versin from that of the originaloperational daabase. he data have been cleansed an modied to better support

    the mining process.

    Output: The output of the data ng query probbly is not a subset of thedatabase. Instead it is the output of some analysis of the contents of the database

    The current state of he of data ining is similar to that f database query processingin the late 1960s and early 1970s Over the next decade ther undoubtedly wil be great

    3

  • 8/13/2019 Dunham - Data Mining

    9/156

    4 Chapter Introduction

    Q

    Q DBMS s}Results

    FIGURE : Database access.

    stides in extending the state of the art with espect to data ining. We pobably will

    see the development of "queypocessing models, standads, and algothms tageting

    thedataining applications Wepobably will also seenew data stctues designed fo

    the stoage of databases being used fo data minng applications. Although data minng

    is cuently in its infancy, ove the last decade we have seen apoifeationof ining

    algoithms, applications, and algoithic appoaches Example 1.1 illustates one such

    application.

    EXAMPL1.1Cedit cad companes must deteine whether to authoize cedit cad purchases Suppose that based on past histocal infomation about puchases, each puchase is placedinto one of fou classes: 1 authoze, 2 ask fo futhe identication before authozation, (3) do not authoize, and 4 do not authoize but contact police The data miningunctions hee ae twofold First the histoical data must be exained to deteine howthe data t into the fou classes. Then the poblem is to apply this model to each new

    puchase Although the second pt indeed may be stated as a simple database quey, thest pat cannot be.

    Data ing involves many dieent algoithms to accomplish dieent tasks. Allof these algoiths attempt to t a model to the data The algoithms exaine the dataand detemine a model that is closest to the chaacteistics of the data being examinedData ining algothms can be chaacteized as consisting of thee parts

    Model: The pupose of the algorithm is to t a model to the data.

    Preference Some citeia must be used to t one model ove anothe.

    Search Al algoithms eque some technique to seach the data

    In Example 11 the data are modeled as divided ino fou classes The seach equiesexaning past data about cedit card puchases and thei outcome to deteine whatciteia should be used o dene the class structue The pefeence will be given tociteia that seem to t the data best For example, we pobably would want to authoizea cedit cad puchase fo a small amount of money with a cedit cad belonging to along-standing custome. Convesely, we would not want to authoze the use of a ceditcad to puchase anything if the cad has been epoted as stolen The search pocesseques that the citeia needed to t the data to the classes be popely dened

    As seen in Figure 12, the model that is ceated can be eithe pedictive o desciptive in natue n this gue, we show unde each model type some of the most comondata ning tasks that use that type of model.

    1.1

    edictve

    Secton

    Data miig

    Bas Data Mnng Tasks 5

    Desciptive

    -Classfcaton Regresson Tme seres Predcton Clsteng Smazato Assocaon Seqence

    aalysis les discovery

    FGUR Data mining models and tasks

    A predictive model makes a pediction about values of data using nown esultsfound fom dieent data. Pedictive modeling may be made based on the use ofothe histoical data. Fo example, a cedit cad use ght be efused not because ofthe use's own cedit histoy, but because the cuent pchase is sila to ealie

    puchases that wee subsequently found to be made with stolen cads Example 1.1uses pedictive modeling to pedict the cedit isk Pedictive model data nng tasksinclude classication, regession, time series analysis, and pediction Pediction mayalso be used to indicate a specic type of data nng function as is explained insection 1.14

    A descriptive model identies pattes o elationships data. Unlie the pedictivemodel, a desciptive model seves as a way to exploe the popeties of the data exainednot to pedict new popeies Clusteing, sumaization, association ules, and sequence

    discovey ae usually viewed as descptive in natue

    BASIC DATA MINING TASKS the following pagaphs we biey exploe some of the data ining functions Wefollow the basic outlne of tasks shown in Figue 12 Tis list is not intended to beehaustive, but athe ilustative Of couse, these individual tasks may be combined toobtain moe sophisticated data ning applications

    1.1.1 i ClassifcationClasscation maps data into pedened goups o classes. t is often efeed to assupevised leing because the classes e detemned befe examning the data. Twoexamples of classication applications ae deteining whehe to make a ba loan and

    identifying cedit isks Classication algoithms equie tha the classes be dened basedon data attibute values They oen descbe these classes by looing at the chaacteistics of data aleady known to belong to the classes Pate recognition is a type ofclassication whee an input patte is classied into one of seveal classes based onits similaity to these pedened classes. Example 1.1 illusates a geneal classicationpoblem Example .2 shows a simple example of patte ecogntion

    EXAMPLE 12

    An aipot secuity sceeng staton is used to detein if passenges ae potentiateoists o cinas To do this, the face of each passenge is scanned and its basic

    patte (distance betwe en eyes, size an d shape of mouth, shape of head, etc) is identied

  • 8/13/2019 Dunham - Data Mining

    10/156

    1.1.2

    6 Chapter ntroduction

    This patte is compared to entries in a database to see if it matches any pattes thatare associated with known oenders.

    Regression

    is used to map a data item to a real valued prediction varible In acaliy, regression involves the leng of the function that does tis mappg Regresnassumes that the target data t into some known type of functOn (e.g., linear, logstc,etc) and then determnes the best unction of this type that models the given data. oetype of eor analysis is used to determine which unction is "best

    tandard near

    regression, as illustrated in Example 13 is a simple example of regressOn.

    EXAMPLE 13

    A college pofessor wishes to reach a cerain level of savings efore

    her retementPeriodically, she predicts what her retrement savings will be based on ts urret valueand several past values She uses a simple linear regression foula to

    predct ths

    value

    y tng past behavior to a linear function and then using ts functOn to ?redct thevalues at points in the ture. Based on these values she then alters her vestment

    portfolo

    1.1.3 Tme Series Analyss

    With m l, the value of an attriute is exaned as it varies over time Thevalues usually are obtained as evenly spaced time points (daily, week, hourly, etc) Atime seres plot (Figure 13), is used to visualize the time series. In ths gure you caneasily see that the plots for Y and Z have similar behavior while X appears to have lessvolatility There are three basic functions performed in time seres

    analysis: n on cse

    distance measures are used to deterne the simlarty between derent tme sees te second case, the strcture of the lne is examned to determne (and perhaps classiyits behavior A third application would be to use the hstorical time series plot to predctfuture values. A time seres example is given in Example 14

    EXAMPL 14

    Smth is trying to determine whether to purchase sock from Companes X Y,or For a perod of one month he charts the daily tock price for eah copany.Figure 13 shows te time seres plot that Sth ha geneated Usg this andsilar information available from his stockbroker, Sth decdes to purchase stockX because it is less volatile whle overall showing a slightly larger relative amount ofgrowth than either of the other stocks As a matter of fact, the to

    cks or Y and havea simlar behavior The behavior of Y between days 6 and 0 dentcal to that for etween days 13 and

    Section

    FIGURE : Time seres plots.

    114 Prediction

    Basc Data Mning Tasks 7

    _. y

    -z

    Many real-world data mining applications can be seen as predicting uture data statesbased on past and cuent data. P can e viewed as a type of classication (Noteis is a data ning task that is dierent from the predicton model, although the prediction task is a type o prediction model.) The dierence i that prediction is predictinga ture state rather than a cuent state Here we are refeng to a type of applicaton

    rater than to a type of data mning modeling approach as discussed earlier. Predictionapplicatons include ooding, speech recognition, macine leng, and patte recogntion Although ture values may e predicted using time series analysis or regressiontechniques, oher approaches may e used as well. Exampl illustrates the process

    EAMPE 15

    Predictingoodingis a d icultproblem One approach uses montors placedav aous; points i n the ver These monitors coectdata relevant to ood prediction: water level,' rain amount, time humdity, andso on Thent he water level ata potential ooding point

    in e ver can be p redicted basedon the data collected y he sensors uprver from tispoint Te prediction must e made with respect to the t e data were collected.

    1.1.5 Custering

    is silr to classicaon except tat e groups re not predened, ut raerdened y the data alone Clusterng is alteatively refeed to as unsupervised leaing or segmentation. can e thought of as prtitionng r segmenting the data intogroups that might or might not be disjointed The clusterin is usually accomplished ydeterng the simlry among he data on predened attrbutes e most silr dare grouped into clusters Exmple 1.6 provides a simple custerng example Since teclusters are not predened a domin exper is often requed to intepret the meang othe created clusters

  • 8/13/2019 Dunham - Data Mining

    11/156

    8 Chape Inoducon

    XMPL 6

    A certn nton eprtment store cn cretes spec ctogs trgete to vros

    emogrpc grops bse on ttrbtes sc s ncome, octon, n psc crcterstcs of potent cstomers (ge, egt, wegt, etc.). o etene te trget mngsof te vros ctogs n to ssst n te creton of new, more specc ctogs, tecompn peorms csterng of potent cstomers bse on te etene ttrbteves. e rests of te csteng eercse re ten se b mngement to cretespec ctogs n strbte tem to te correct trget popton bse on e cser

    for tt ctog.

    A spec tpe of csterng s ce sgmnttin. Wt segmentton tbses prttone nto sonte gropngs of sr tupes ce sgmnts Segmenttons often vewe s beng entc to csterng. n oter cces segmentton s vewes specc tpe of csteng ppe to tbse tsef. n ts tet we se te twoters, clustring n sgmnttin, ntercngeb.

    6 Summarization

    Summriztin mps nto sbsets wt ssocte smpe escptons. Smton s so ce chrctriztin or gnrliztin. t etrcts or erves representtvenfomton bot te tbse s m be ccompse b ct retevng portonsof te t. Atetve, smr tpe nformton (sc s te men of some nmecttbte) cn be eve fro te t. e smmrton sccnct crcteres tecontents of te t bse. Empe . str tes ts proce ss.

    XMPL 7

    One of te mn crte se to compre nverstes b te U.S Nws & Wrl Rprts te verge SA or AC score [GM] s s smrton se to estmte tetpe n nteect eve of te stent bo

    7 Assocation Rules

    Lin nlysis tetve referre to s fniy nlsis or sscitin refers to et nng tsk of ncoverng retonsps mong t. e best mpe of ts

    tpe of ppcton s to eterne ssocton res An sscitin rul s moe ttentes specc tpes of t ssoctons ese ssoctons re often se n te retses comnt to entf tems t t re freqent prcse togeter. Empe .strtes te se of ssocton res n mrket bsket nss. Here te t neconsst of nfomton bot wt tems cstomer prcses Assoctons re so sen mn oter ppctons sc s prectng te fre of teeconcton swtces.

    XMPL 8

    A grocer store reter s trng to ece weter to pt bre on se. o ep eternete pct of ts ecson, te reter genertes ssocton es tt sow wt oter

    Secon Daa Mnng Vesus Kowlede Dscoey n Daabases 9

    procts re freqent prcse wt bre. He ns tt 60% of te tme tt bre sso so re pretes n tt 0% of t e tme je s so o. se on tese fcts, etres to cpte on te ssocton between bre, pretes, n je b pcng somepretes n e t te en of te se were te bre s ce n ton e ecesnot to pce eter of tese tems on se t te sme tme

    Users of ssocton res mst be ctone tt tese re not cs retonsps. e o not represent n retonsp nerent n te ct t (s s tre wtfncton epenences) or n te re wor ere prob s no retonsp between

    bre n pretes tt cses tem to be prcse togetr An tere s no grnteett ts ssocton w pp n te ftre. However, ssocton ues cn be se tossst ret store mngement n eectve vertsng, mrketng, n nventor contro.

    8 Sequence Dscovey

    quntil nlsis or squnc iscv s se to etermne eqent pttes n t.ese pttes re bse on tme seqence of ctons. ese pttes re smr tossoctons n tt t (or events) re fon to be rete, bt te retonsp s bseon tme Unke mket bsket nss, wc reqres te tems to be prcse tte sme tme, n seqence scove te tems e prce ove tme n some orerEmpe . strtes te scover of some smpe pttes. A smr tpe of scovecn be see n n te seqence wtn w c t re prcs. For empe, mos t peopewo prcse CD pers m be fon to prcse CDs tn one wee k. As we w

    see tempor ssocton ues re f nto ts ctego

    XMPL 9

    e Webmste t te XYZ o eoc nes t Web og t to dnow sers of te XYZ's Web pges ccess tem. He s ntereste n etermng wtseqences of pges re freqent ccess e. He etermnes tt 0 percent of te sersof pge foow one of te foowng pttes of bevor B C or D B Cor A E, B C He ten eternes to n ect fom pge to pge C.

    2 DT M INI NG VRSUS KNOWLDG DISCOVRY IN DTBSS

    e terms nlg iscov in tbss KDD n t mining e often se

    nterngeb n fct, tere ve been mn oter nes gven to ts process oscoveng sef (en) pttes n t: knowege etcton, nfomton scove,eporto t nss, nfomton vestng, n nspervse ptte recogntonOver te st few ers KDD s been se to refer to process consstng of mstep, we t mnng s on one of tese steps s s te pproc ten n sbook. e foowng entons re moe om tose fon n SSc, SS]

    DEFNON Knowedge discovey in dtbses (KDD) s te pocess ofnng seu nfomton n pttes n t

    DEFNON Dt mining s te se of gotms to etct te omtonn pttes eve b te KDD pocess

  • 8/13/2019 Dunham - Data Mining

    12/156

    i i i l d i i b

  • 8/13/2019 Dunham - Data Mining

    13/156

    Chapter 1 Introdutio

    Db

    Alg

    Ifvl

    S

    Mg

    FIGURE 5 : Historica perspective of daa mnng.

    1 .2 1 TheDvelopment of DataMin ing

    The curent evoution of data mning uncions and products is the resu of years of

    inuence from man discipines, incuding databases information retreval saistics

    algorithms, and machine learning (Figure 1 .5) Anoher computer science area hat has

    hadamajorimpaconheKDDpocess isultimedia andgraphics. Amajo

    rgoalof DD

    is to beabetodescbetheresults of heKDDprocess inameaningfumanner. Because

    many dieren results are ofen produced s is a nonrivial probem Visuaizaion

    echiques oeninvovesophisicatedmulimediaandgraphics presenaions Inaddiion

    daamining echnques can be applied to muimedia appicaions.

    Unike previous research in these disparate areas a major trend in he daabase

    comuniy is o combine resuls from these seemngy dierent discipines into one

    unifying daa or agorihmic approach Ahough in is infancy, theultimategoal of this

    evolion is to deveop a "big picture viewofhe areathat will faciiate integraion of

    hevarioustypes of appicaions inorea-world user domains.

    Table1 . shows developmensinheareasofarticia inteigence( AI)information

    rerieval (I), daabases (DB) and saisics (Sa) eading to he curren view of da

    a

    mining These dieren storica inuences which have ed o he developmen of the

    toa data mning aea, have given rise to dieren views of wha daa mni

    ng funcions

    actualy re RG99] :

    Induction is used to proceedfromvery specic knowedge to more generainfor

    maion This type o technique is ofen found in AI appicaions

    Because he primary obective of data mning is o descrbe some characerisics

    of as etof databy a generam odehis approach canb eviewed as aypeofcom-

    pression . Herethe deailed data wihin hedatabase are absractedandcompressed

    to a smaler descripion of the daa characeristics that are foundin e mode.

    Ass tatedearier he dataminingprocessitsefcanbeviewedas atypeof queing

    he underlying daabase Indeed an ongoing direcion of daa ming research s

    Sectio 1.2 Data Miig Vrsus Kowledge iscovey i Databases 3

    TABLE Time Line of Data Mining Deveopment

    Time Area

    Late 700s StaEary 900s StatEary 920s StatEary 1940s AIEary 950sEaly 950sLate 950s AILate 950s SaEary 960s AIEary 1960s DBMid 1960sMid 960s Sat

    IRSa

    Lae 960s DBEary 970s Mid 970s AI

    Late 970s SaLate 970s StaEary 980s AIMi 980s AIEa 990s DB

    1990s DB990s DB

    Conribution

    Bayes theorem of probabiityRegression anaysisMaximum likeihood estimaeNeural neworksNearest neighborSinge inkPerceponResampling, bias reducion jacnife estimatorML staredBatch reporsDecision reesLinear models for classicationSimiarity measuresClusteringEporatory data analysis (EDA)Relational data modeSMART I systemsGenetic agorihms

    Esimaion wih incompee daa (M agothm)K-means clusteringohonen selforganizng mapDecision tree algorithmsAssocaion rue aomsWeb and search enginesData warehousingOnine anaytic processing (OLAP)

    Reference

    [Bay63]

    Fis2][M43]FJ5]FL+5]Ros58]

    FF63]

    S66Ni65

    od70][Sa7][Hol75]

    DLR77]

    Koh82][Qui86]

    how to dene a daa ning quey and whether a quey anguage ie SQL) canbe deveope to capture he many dierent types of daa minng quees

    escibng a are atabase can be viewed as using approximation o help uncoverhdden fomatn about the data

    When deaing wih arge databases the impac of size nd ecency of developingan abstract mode can be thought of as a ype of search probem

    It

    is intresting to nk about he vaious daa mining probems and how each may beVewed m several derent perspectives based on the viewpoint and background of therese

    cher

    .or d;eoper. "e ention hese dierent perspectves ony to give he reader

    thefl pcture of data ng Often due to the vied bacgrounds of he data mning

    pac.pants : we nd tha the sme probem (and perhaps een the same soutons) ae

    descbed derenty Indeed, dieren ternologies can ead to misunderstandings and

  • 8/13/2019 Dunham - Data Mining

    14/156

  • 8/13/2019 Dunham - Data Mining

    15/156

  • 8/13/2019 Dunham - Data Mining

    16/156

    0 Ch t Introducton

  • 8/13/2019 Dunham - Data Mining

    17/156

    0 Chapter Introducton

    sevea tutoas suveyng data nng onepts [Ag94 Ag95] [Han96 and RS99A eent tutoa Ke97 povded a thoough suey of vsuazaton tehnques as weas a ompehensve bbogaphy.

    The aspet of paae and dsbuted data mnng has beome an mpotant eseahtop. A okshop on LageSae Paae DD Sysems was hed n 1999 ZH].

    The dea of deveopng an appoah to unfyng a data mnng atvtes hasbeen poposed n FPSS96b] [Man96] and Man97. The tem KDDMS was st poposed n 96. A eent uned mode and ageba that suppots a majo data mnngtasks has been poposed JN The 3W model vews data as beng dvded nto theedmensons. An ageba aed the dmenson algeba, has been poposed to aess ths

    theedmensona wod.DQL was deveoped at Smon Fase Unvesty HFW+96.Thee ae sevea DD and data mnng esoues. The ACM (Assoaton fo

    Computng Mahney has a spea nteest goup SGDD devoted to the pomoton

    and dssemnaton of DD nfomaton SIGKDD Exploatons s a fee newsette podued by ACM SGDD. The AC SGDD hoe page ontans a weath of esouesoneg DD and data mnng (www.amog/sgkdd.

    A vendoed goup Data Mnng Goup (DMG), s atve n the deveopment ofdata mnng standads. nfomaton about DMG an be found at wwwdmg.og TheSO/EC standads goup has eated a na omttee daft fo an SQL standad nudng data nng extensons Com. addton a pojet begun by a onsotum ofdata mnng vendos and uses esuted n the deveopment of the data nng poessmode CRISP-DM (see: wwwspdm.og.

    Thee ueny ae sevea eseah jouas eated to data nng. These nude Tnsactons on Knowedge and ata ngneeng pubsed y ComputeSoety and Data Mnng and Knowledge Dscove fom Kuwe Aadem Pubshes. nteatona DD onfeenes nude the ACM SGKDD nteatona Confeeneon nowedge Dsovey and Data Mnng (DD the Confeene on nfomaton andKnowedge Management (CM the EE nteatona Confene on Data nng(CDM the uopean Confeene on Pnpes of Data nng and owedge Dsovey (PDD and the PaAsa Confeene on owedge Dsovey and DataMnng (PAKDD. KDnuggets Nes s an ema newsette that s podued bweekyt ontans a weath of DD and data mnng nfomaton fo pattones uses andeseahes. Subsptons ae fee at www.kdnuggets.om. Addtona DD esouesan be found at owedge Dsovey Centa (www.kdentaom

    C H A P T E R 2

    Related Concepts

    DTBSE/OLTP SYSTEMS FUZZY SETS ND FUZZY LOGIC NFORMTION RETRIEVL

    4 DECISION SU PPORT SYSTEMS5 DIENSIONL MODELING6 DT REHOUSIN G7 OLP8 EB SERCH ENGINES9 STTISTICS0 MCHINE LERNING PERN MTCHING SUMMRY EXERCSES4 BIBIOGRPHIC NOTES

    Daa nng appaons ave ested fr tousands of ea. o eampe te assaton of pants as edbe o nonedbe s a data mnng task The deveopment of the datanng dspne has ts oots n many othe aeas n ths hapte we examne manyonepts eated to data mnng We bey ntodue eah onept and ndate how ts eated to data mnng

    2. 1

    A database s a oeton of data usuay assoated wth som oganzaton o entepseUnke a smpe set data n a database ae usuay vewed t have a patua stutue

    o schema wth whh t s assoated Fo exampe /D Nae Addess Sala JobNomay be the shema fo a pesonne database Hee the shema ndates that eah eod(o tupe n the database has a vaue fo eah of these ve attbutes Unke a e adatabase s ndependent of the physa method used to stoe t on dsk (o othe meda

    t aso s ndependent o the appaton that aess t A dtabae maageet yte(DBMS) s the softwae used to aess a database

    Data stoed n a database ae often vewed n a moe abstat manne o datamode Ths data model s used to desbe the data attbutes and eatonshps amongthem A data mode s ndependent of the patua DBMS used to mpement andaess the database n eet t an be vewed as a doumentaton and ommunatontoo to onvey the type and stutue of the atua data A omon data mode s the

  • 8/13/2019 Dunham - Data Mining

    18/156

    22 Chape Reae Coceps

    FGURE : ER model eample.

    ER (enti-relationship) data mode Altoug orginaly propoed in 1 7 6 te ER atamodel i til ued today wit many etenion and improvement to te rt degn.Eampe 2.1 lut rate te u e o an ER mode wit an aoci ated ER diagram een inigure 2 1 e baic component o an ER model are te entitie an te rlatiop_.An entiy i aociated wit a reaword object and a a key tat uquey dente tA relationship ued to decibe a a ociation tat eit betwe en entitie.

    EMPLE 2.1

    emloyee databae conit o emloyee and inormation conceng te j_ob tattey perorm An entity would be an Empoyee and te key could be te . Slalydierent job can be aocated wit a job number o tat we can tink o te Job a anentty wit key JobNo n igure 2.1 tere i a rectange or eac entity e diamond iued to repre ent te relationip between te two entitie. ere te reationip aJobindcate tat a pecic emloyee wit key a a particuar job wi t key JobNo eattribute aociated wit Empoyee are { Name Addre Salary} and te attributeor Job are {JobNo Nobec PayRange

    e ER model i oten ued to a btractly view te data independent o BMS.BMS ytem oten view te data in a tcture more lke a table T give rie tote relational odel were data ae viewed a being compoed o reation. Taking amatematica per pective a

    relationi a ubet o a Carteia product magine ooking

    at te domain, or et o vaue aociated wit eac attbute in te Employee eampe.A eation R coud ten be viewed a a ubet o te product o te doman

    R dom( ) dom(Name) x dom(Addre) dom(Salary) dom(JobNo) (2.1)

    Acce to a reation can be peormed baed on operation in te traditional et algebrauc a union and int erec tion. i ete nded group o et operati on i reerred to arelational algebra An equivaent e t baed on rt-order predicate calculu i calledrelational calculus Acce to databae i uualy acieved via a que language. iquery language may be baed on relational algebra or calcuu Altoug many queanguage av e been prooed te tandad languag e ued by mot BMS i SQL.

    Seco

    SELECT NameFROM RWHERE Salary 100,000

    Fz:y Se a Fzzy Logc 23

    FIGURE : SQL eampe.

    gure 22 ow a ample SQL tatement iued againt te reation R wic lit tename o all employee wit a aary greater tan $

    Uer epectation or querie ave increaed a ave te amount and opitication o te aocated data. n te eary day o databae (DB and online transaction processing OLTP ytem imple elect were enoug Now querie are compleinvolvng data ditributed over many ite and tey ue cmplicated unction c ajoin aggregate and view Traditonal da tabae quere uually invove retriev ing datarom a data bae baed on a we-dened quey. A own in igure 22 a uer ma yak to nd al empoyee wo ea over $ 1 Ti could be viewed a a type oclaication applcation a we egment te databae into two clae : toe wo avealarie atiying te predicate and toe wo do not A impe databae applicationi not tougt o a a data mning tak owever becaue e quere are we denedwit pre cie reult ata mining appicat on converely are oten vaguey dened witimprecie reult Uer igt not even be able to preciel dene wat tey want etalone be able to tell i te reult o ter requet are accurte A databae uer uualycan te i te reut o i query e not corect Tu t i uualy aumed tat aBMS retu te correc t reult or a quey. Metric ( intad o quali ty) oen include

    uc tng a repone time and trougutWen viewed a a quey ytem data mining querie etend databae concept

    ata mning problem re oten poed wt many drent oution Judging teeectivene o te reult o a data minng requet oten dicult A major dierencebetween data mning querie and toe o databae ytem te output Baic databaequere alway output eter a ube t o te databae or agregate o te data A datining query output a K object A KDD object i eite a rle a clacation or acluter. Tee object do not ei t beore eecutng te query and tey are not part o tedatabae being queed Aggregation opeator ave eited in SQL or yer ey donot retu object etng in te databae but reu a model o te data or eamplean average operator retu te average o a et o atbute alue rater ta n t e vauetemeve i i a imple type o data minng operator

    22 FUZZY SETS AND FUZZY OGIC

    A t normay tougt o a a coection o object can b e den ed b y enumeratingte et

    = { 2 3 4 5 } (22)

    or by ndicating te et memberp requirement

    = { E z+ and } ( 2 3 )

    A set i a et in wic te e t memberip ncton i a real valued (aoppoed to boolean) uncton wit output in te range [ ] eemet i ad

    24 C R C co Fuy a Fuy Logc 25

  • 8/13/2019 Dunham - Data Mining

    19/156

    24 Cap Ra Cocp

    to belong to with probability f(x) and simultaneously to be in with probabilit1 - f (x). In acuality, this is not a true probability, but rather the degree o tth assocatedwith the statement that x is n the set To show the dierence, let us look at a zzy setoperation Suppose the membershp value or Mary being tall is 07 and the vaue orher beng tn is 04 The membershp value or her beng both is 04, the mnmum othe two values these were really probabiltes, we would look at the product o thetwo values

    Fuzzy sets have been used in many computer scence and database areas n theclassication problem, all records in a database are assgned to one o the predenedclasscaton aeas A comon approach to solvng te classcaton problem s to assgn

    a set membership nction to each record or each class The record is then assigned tothe class that has the hghest membershp nction value. Smlarly, zzy sets may beused to descrbe oter data mning nctions Assocation les ae generated given acondence value at ndcates the degree to wch it holds n the entre daabase Tscan be thought o as a membership unction

    Qqees can be thought o as denng a set With adtonal database ueres,however, the set membershp nction s boolean The set o tuples n relaton R thatsasy the SQL statement in Figure 2.2 can be dened as

    {x x E R and xl > 100,000} (2.4)

    Here x.l reers to the Salay ttbute within the tuple X Some ueies, however, donot have a membership nction that is boolean For eample, suppose that we wshed

    to nd the names o employees who are tall:{x x E R and x is tall} (25)

    Ths membershp nction is not boolean, and thus the results o this uery ae uzzy. Agood eample o ueries o this type are seaches peromed on the eb

    Figure 2 3 shows the real dierence between traditonal and uzzy se t membershipSuppose there are tee sets (sho, medium, and tall) to which a person can be classiedbased on his height In Figure 2.3(a) the tradtional (or crsp) set membershp values areshown. P (b) shows the tangular view o set membership values. Notice tat theres a gradual decrease n the set membership value or sho; there s a gradual ncreaseand decrease or set membership in the medium set there is a gradual ncrease in the setmembershp value or tall

    Height(a) Csp es

    Short Mediu

    Hegh(b) zzy ses

    FIGURE Fuzzy vs traditonal set membership

    Tal

    co Fuy a Fuy Logc 25

    Fuy lgic s reasonng wth uncetainty That is, istead o a two valued logic(te and alse), there are multiple values (tre, alse, maybe) Fuzzy logc has beenused in database systems to retieve data with imprecse or missing values n this case,the membershp o records n the uey result set s uzy. As wth traditonal booleanlogic, uzzy logic uses operators such as - 1, and v. Assming that x and are uzzylogc statements and that (x) denes the membershp vlue, the ollowng values recommonly used to dene the results o these operations

    mem(x)

    mem(x y)

    mem(x v )

    - mem(x)

    = n(mem(x), men())

    ma(mem(x), me())

    (26)

    (27)

    (28)

    Fuzzy logic uses rles and membershp nctions to estimate a continuous nction Fuzzylogic is a valuable tool to develop control systems or such things as elevators, trains,and heating systems. n these cases, instead o providng a risp ono environment, theuzzy controller provides a more c ontinuous adjustment.

    Most realworld classicaton problems are uzzy. This s illustated by Fgure 24n this gure we graphcally show the threshold or appving a loan based on thencome o the ndvdual and the loan amount reuested A loan ocer may make eloan decision by simply approvng any loan reuests on or above the lne and rejectngany reuests that all below the lne [Fgure 4(a)] Ths type o decson would notbe uzzy. However, this type o decision could lead to eoneous and perhaps costlydecsons by the loan ocer. From a data nng perspecte, ths applcaton s a clas

    sication problem; that is, classy a loan application nto the approval or reect classThere re many other actors (other than ncome) that should be used to predct theclasscaton problem (such as net woth and credt ratng) Even al the associatedpredcors could be dentied, the classication problem is not a blackandwhite issuet is possible that two individuas with eactly the same pedictor values should beplaced in two dierent classes This s due to the uzzy nature o ths classication Thisis shown by the shading aound the line in Figure 24(b ) e could perhaps classiy

    Loanamount

    (a) Smpitc oan appova

    Loanamount

    Incom

    () Loan appoval not pce

    FIGUR Fuzzy classcation

  • 8/13/2019 Dunham - Data Mining

    20/156

  • 8/13/2019 Dunham - Data Mining

    21/156

    28 Chapter 2 RelatedConcepts

    T Classiedtall Classfed

    notall

    20 10

    45 2

    Not al Nottall

    Classifed tall Classied

    not tall

    FIGURE2.8: Precson and recall appled to classcaton

    level n thetree suQas "catAthough ths wouldresultn ahgherreca1theprecsonwould decrease A concept herarchy may actually be a DAG (directed acyclicgaph)rather than a tree.

    .has had

    a ajor mpact on the development of data mnng. Much of the

    atng classicaton and clusteng approaches had ther orgns n the documentretrieval problems of lbay scece and

    .nformaton retreval. Many of the smlarty

    masures de

    .velped fr fomatn reteval have been appled to more general data

    n

    ppcat?ns

    :

    Simlarly the precson and recall measures are often appled todatangpplcats as lustate?nExample 2.2Conceptherarches areequentyused

    .spatal data napptons

    Data mnng conssts ofmany more types ofappicatns thanaefoundtradtnalformatonretreval. Thel nkngand predctvetasks have no real counterpat n IR forexample

    EXAMPLE 2.2

    The accuracy of a predctve modelng technque can be escrbe based on precsonanrecall Suppose 00college studentsaeto be classed based on heght. Inactualty,there are30tall students and70whoarenot tall. A classcaton tecque classes 65student as

    tall and 35 as not tal The precson and recall appled to ths problem are

    shownFigure 2 8 Theprecson s 20/65 whle therecall s 20/30. The precson slowbecause so many students who arenot tall are classed as such.

    2.4 DECISIONSUPPORT SYSTEMS

    Decisio support systems (DSS) arecomprehensve computer systemsandrelated toolstha assist managers nmakng decsons and solvng problems The goal s to Imth d . . . provee ecSon-ma ng process byprovidg specic frmaton needed bymanag tTh f

    . emen .ese sstems tur om adionl database management systems n that more adhoc quees andcustozedformatnmaybeprovde Recently the te , rms executzveformatwn systems (E/S) and executive support systems (ESS) have evolv d Th 1 e as we ese systems a at eve opg the busness structure and comput t h er ec ques to

    25

    Scto 2 Dmsoal odlg 29

    better provide information neeed by management to make eective business decisionsData mnng can be thought of as a suite of tools that assis in the overall DSS process;that is, DSS may use data mning tools

    In many ways the term DSS is muc h more broad than the tem data mining. Whlea DSS usually contains data mining tools, this need not be so ikewi se, a data mning tool nee not be containe in a DSS system. A ecision support system coul beenterprse-wide, thus allowing upper-level managers the data neede to make intellgent

    business decisions that impact the entre company A SS ypically operates using atawarehouse data Alteatively, a SS could be built around a single user and a PC. Thebottom line is that the DSS gives managers the tools needed o make intelligent decisions

    DMENNAL MDELNG

    Dimensional modeling is a dierent way to view and intrrogate data in a databaseThis view may be used in a DSS in conjunction with data mnng tasks. Although notrequire, for eciency puposes the data may be stored uing dierent data stucturesas well Decis ion support applications often require that inormation be obtained alongmany dimensions For example, a sales manager may want to obtain infoation aboutthe amount of sales in a geographic region, particular time frame, and byproduct typeThs query requires tree imensions. A dimension is a collection of logically relateattrbutes an is viewe as an axis for modeling the data The time dimension couldbe divided into many dierent granularties: millennium, cntury, decade, year, month,day, hour, mnute, or secon Withn each imension, these ntities form levels on whchvaious DSS questions may be asked. The specic data stred are called the facts and

    usually are numerc ata. Facts consist of measures an context data The measures arethe numerc attibutes about the facts that are queried DSS queries may access the factsfrom many dierent dimensions an levels The levels in ach imension facilitate theretreval of facts at erent levels For example, the saes ifoation could be obtainedfor the year 999, for the month of February in the year 2000, or between the timesof 0 an AM on March , 2000 The same query coud be foulate for a moregeneral eve, roll up, or for a more specic leve, drill down.

    Table 2 shows a relation with three dimensions Products, Location, an ate.etermnng a key for ths relation could be difcult becaue it is possible for the sameproduct to be sold mutiple times on the same day. In this case, product 50 was sold attwo ierent times in Dallas on the same day A ner granularity on the tme (perhapsown to the mnute rather than ate as is here) coul make a ey However, ths ilustratesthat choice of key may be dicult The same multidmensional data may also be viewed

    as a cube. Figure 2 shows a view of the ata from Table 2 as a cube Each imensionis seen as an axis for the cube Ths cube has one fact for each unque combination ofimension values In this case, we could have 8 * 7 * 5 230 facts store (even thoughthe reation showed only 0 tuples) Obviously, ths sparse amount of ata wou needto be stored eciently to reuce the amount of space required

    The levels of a dimension may support a paral order or a total orer an can beviewe via a drected path, a hierarchy, or a lattice To be consistent with earlier uses ofthe term, we use aggregation hierarchy even though it ma be a lattce, to refer to theorder relationsp among dierent levels in a dmension We use < to represent ths orderrelationshp X < Y if X an Y are levels in the same dimension an X is contained

    30 Ch t 2 R l d C t Secion 2 Dimensonal Modeng 31

  • 8/13/2019 Dunham - Data Mining

    22/156

    30 Chapter 2 Relaed Concepts

    TABLE 2 : Relational View of Multidimensional Data

    Pro dD Loc Date

    123 Dallas 022900

    123 Houston 020100

    150 Dallas 031500

    150 Dalas 031500

    150 Fort Wor 021000

    50 Chicago 012000

    200 Seattle 030100300 Rochester 021500

    500 Bradenton 022000

    500 Chcago 012000

    Seattle -Rochester -Houston -

    FortWorthDallase

    Chicago1Bradenton .

    Products

    Quantity

    510

    155

    20

    5200

    15

    10

    FIGURE 29 Cube.

    UnitPrice

    25

    20

    100

    95

    80

    75

    505

    20

    25

    in Y. Figure 2 1 0(a) shows a total order relationship among level s in the Product dmen-sion from Figure 2.9 Here Product Type Company. The two facts that we are

    using in this example are Quantity and UnitPrice When this order relationshp is sati sed between two levels, there is an aggregate type of relationsp among the facts Here

    the Quantity of products sold of a paricular type is the sum of the quantities for all

    products within that type. S iilarly, the quantity of products so ld for a company is thesum of all products sold across all product types The aggregate operation is not always

    the sum, however When looking at the unit prce, it would be reasonable to look at

    Secion 2 Dimensonal Modeng 31

    Year

    1 \ Month Season Planet

    1 \Day Country Continent

    1 \ . Company Hour AM/PM State Region

    1 \Product type Minute Zp Code CountyI I \

    Product Second City

    (a) Product dimenson (b) Tme dmension c) Loaton dimenson

    GURE 2 Aggregation errces.

    such aggregate operations as average, maximum, and inium prces Figure 210(b)

    shows a herarchical relationshp among levels in the time diension, and Figure 2 1 0(c)

    shows a lattice for the location dimension. Here Day Month but Day Season. The

    aggregation can be applied only to levels that can be found in the same path as denedby the < relationship When levels for a dmension satisfy tJs sucture, the facts along

    these dimensions are said to be additive. I we add the saes data for a 24 hours ia day, we get the sales data for that day Ths is not alwas the case Looking at thelocation dimension, if we were to sum up the sales data for all zip codes in a givencounty, however, we would not get the sales data for the county Thus, these dmensionsare not additive. Ths is due to the fact that ip codes may pan dierent counties. The

    se of nonadditive densions compcate the roll up and drill down applications.

    251 Mu S

    Specialied schemas have been developed to potray multidimensional data. These in

    clude st schema, snowake schema, and fact consteation schema

    A star schema shows data as a collecton of two types: facts and dimensions

    Unle a relational schema, whch is at, a star schema is a graphcal view of the dataAt the center of the star, the data being exaned, the facs, re shown in fact tables

    (sometmes caed major tables). On te outside of the fact, each dimension is shown

    seprately in dimension tables (sometmes called minor table). The simplest st schema

    has one fact table with multple dmension tables In ts case each fact points to one

    tuple in each of the dimensions The actal data being accssed are stored in the fact

    tables and thus tend to be quite large Descrptive informtion about the dimensionsis stored in the dimensions tables, which tend to be smaller. Figure 21 1 shows a strschema based on the data in Figure 29 Here one extra dension, division, is showne facts include the quantity and pce, while the dimensions are the product, tme,

    32 Chapter Related Concepts

  • 8/13/2019 Dunham - Data Mining

    23/156

    Poduct ID DayDDescription Day

    'pe Month

    Type Description QuarterPoduc ID

    YeaProductDayD

    DayI Salesman IDLocation IDaniUnt PriceSaesman DLocatonDept

    SaesZip CodeDept Desc

    SaeDiv

    CityDivDesc

    LocationDivision

    F G U R E : Star schema.

    location, andd ivision. Descriptive information about aproduct incudes tedescription,type, andtype descriptionAccess to thefacttablefroma dimension table canbeaccom-plished via ajoin between a dimension tabe and the fact table on paricular dimensionvalues For exampe, we could access al locations in allas by doing the followingSQL quer:

    SELECT Quantity, Prce

    FROM act s Locaton

    Whe re (acts . Locat onD Location LocationD )

    and

    ( L o ca t i on C i t y = ' ' Dal as )

    Here the Locatio is a foreignkey from thefacttable tothe LocaiondimensionableThe primary key forthef acttablei sacolection offoreignkeys that pointto the dimension tables Athough ths example shows ony one fact abe there may be severa. Iaddition, ad imension tabe may itself point to anoher dimension table

    A sta schemaview canb eobtainedvia arelationa systemwhere each dimensionis a tabe and the facts are stored in a fat table The facts can be accessed reativeyeientythrough the creation of indice

    s forthe dimensions owever, he sheer volumeof the daa involved, exacerbated by the need to summarize the fact information atdierent evels across all the dimensions, compicaes the process For exampe, wemaywishto see al sales in alregionsfor aparicularday. Or we may want to see all salesinMay2000 forthe saesmenin a specicdepartment. Toensure eciency ofaccess, factsmay be stored for all possible aggregation levels. The fac data would then be extendedo include a level indicator.

    We assume thatthe dataare stored as both facttabes and dimension tables. Datain tefacttable can be viewed as aregularrelaton wit an atrbute for each factto bestored and the key being the values for each dimension. There are four basic approaches

    Section 5 imeniona Mode ng 33

    to the storage of data in dimension tabes [PB99] Each dimnsion table can_be stored in

    one of these four manners. Figure .1 illustrates these for approaches w1h the

    salsdata The rst technique, thefatened technique stores the data for each d1mens on mexatly one tabe There is one row in the tabe for each rw in the lwest e:e m. thedimensional model. The key to the data are the attrbutes fo al levels m tht dmens1?n.With he attened approach a rol up is accomplished by a SUM aggregatn operatnover the approprate tuples. Even though this approach suer from space probems as the

    D Locaton ID1 Quantty, Unit Pr ice )sale (roduct ID, Da ID Saeman I Product ( Product D Decription Tpe Type Decription)Day( Month Quarter Year)

    Diviion (Saleman D Dept Dept Dec Div Div Dec)Location (Location D Zip Code State City)

    (a) Flatened

    I D ' Salesman ID , Locaton ID" Quantity, Unit Price)ale Prduct ID DaProduct ( Product D Decription Tpe)Tpe ( Tpe Decription)Day( Da D Month)Month (Month Quarter)Quarter (Quarter Year)Year( Year)salean (Saeman ID Dept)Dept Det Dept Dec Dv)Div (Dv Dv Dec)Locaton (Location D Zip Code)Zip( Zi Code City)Ci ti e ( Sta te Cit)state (State)

    (b) Normaized

    Sale (Product D Da D saleman D Location D Qantity Unt Price)Product (Product D Dec ription Tpe Tpe Decript on)Tpe ( Tpe Decription)Day (Da ID Month Quarter Year)Month (Month Quarter Year)Quarter( Quarter Year)Year(Year)saleman (Saleman D Dept Dept Dec Div Div Dec)Dept (Det , Dept Dec Div Div Dec)Div (Div Div Dec)Location(Location D Zip Code State , City)Zip (Z i Code State City)Citie (State Cit)state()

    (c Expanded

    Sale (Product D Da D Saleman D Location , antity, UnitProduct (Product D Decription Type, Type Decrton Leve No)Day( Month Quarter, Year Leve o)

    .Divion (aleman D Dept Dept Dec, Dv, Dv Dec Level No)Location(ocation D Zip Code Sta te City, Level No

    (d Levelzed

    FGR Optons to implement star schema

    Price)

    3 h t R l t d t S t 2 6 D t W h 35

  • 8/13/2019 Dunham - Data Mining

    24/156

    3 hapte Related onceptsnumber of attrbutes grows with the number of levels it does facilitate the simpleimplementation of many DSS applications via the use of traditional SQL aggregationoperations.

    The second techque to store a dimension tabe is caled the normazed tecque,where a able exist for each level in each dimension Each table has one tuple for everyoccurrence at that leve. As with traditional normalization duplication is removed at theexpense of creating more tables and potentialy more expensive access to factual datadue to the requirement of more joins Each ower level dimension table has a foreignke pointing to the next higher level table

    Using exanded dimension tables acheves the operational advatages of both the

    attened and the noaized views while actually increasing the space requirementsbeyond that of the attened approach The number of dimension tables is identical tothat in the normalized approach and the structure of the lowest leve dimension tabeis identical to that in the attened technique Each higher leve dimension table has inaddition to the attrbutes existing for the normalized structure attrbutes from all gherlevel dimensio

    The evezd approach has one dimension tabe as does the attened techqueHowever the aggregations have aready been performed There is one tuple for eachinstance of each evel in the dimension the same number existing in al normalizedtables combined In addition attributes are added to show the level number

    An extension of the star schema the snowake scema facilitates more complexdata views this cas the aggregation hierarchy is shown explicity in the schema itsefAn exampe of a snoake schema based on the saes data is shown in Figure 213 A

    snowa shma an b ewed as a partialy nomaized version of th corsponding str schema. The division and location dimension tables have been normalized inths gure.

    25.2 Indexng

    With multidimensional data indices hep to reduce th e overhead of scanning the extremely arge tables Although the indices used re not dened specically to support mutidimensional data they do have iherent advantages in their use for these types of data.

    With ma ndces each tuple in the tabe (fact tabe or diension table) is represented by a predened bit so that a tabe with n tuples would be represented by a vectorof n bits The rst tuple in the table is associated with the rst bit the second ith thesecond bit and so on There is a uique bit vector for each vaue in the domain Thisvector indicates which associated tuples in the table have that domain value To nd the

    preise tuples an address or pointer to each tuple woud aso have to be associated witheach bit position not each vector Bitmp indices faciitate easy unctions such as joinand aggregation through the use of bit arithmetic operations Bitmap indices also savespace over moe traditional indices where pointers to ecords are maintained

    Jon ndces support joins by precomputing uples from tables that o in together andpointing o the tuples in those tables When used for mutidimensional data a commonapproach s to 'ate a join index between a dimension table and a fact table. Tsfacilitates the ecient identication of facts for a specic dimension level anor vaueJoin indices can be created for multiway joins across multipe dimension tables. Joinindices can be constrcted using bitmaps as opposed to pointers

    Secton 26 Data Warehosng 35

    Produt ID ayescrpton I ape MonthQuaeType Derpton Podu ID YeaProdu DayD

    Saema DDay

    Dept Loation D

    Location IDep DeQantty

    Dv I Saeman D r Unit PreDivDe Dept Sae Zip Code f- Zp ode

    Depatmen Saeman oationSae

    CyZip Coe

    FIGUR .3 Snowake schem

    Traditiona Bree indices may be constructed to acces each entry in the fact table.Here the ey woud be the combination of the foreign keys to the dimension tables

    2.6 DATA WAREHO ING

    Decision support systems (DSS are subject-oriented integated timevariant and nonvoatie The term daa wareouse was rst used by William Inmon in the early 1980s He

    dened daa wareouse to be a set of data that supports SS and is "subjectorientedintegrated timevarant nonvolatile [In95] With data rehousing corporatewidedata (current and storcal are merged into a single repository Traditional databasescontain oeaona daa that represent the dayto7day needs of a company Traditionalbusiess data proces&!ng (such as billing inventory cntrl payroll and manufcturing support) suppor lne transaction processing and batch reporting applications Adata warehouse however COains nformaona daa which are used to support otherunctions su> as plannng ad forecasting Although muh of the content is simlarbetween the operationa and informational data much is erent As a matter of factthe operational data are transformed into he informational data Example 2 illusratesthe dierence between the two

    EXAMPE 23

    The AME Manufacturng ompany maintains several operational databases: salesbiling empoyee manufacturing and warehousing These are used to support the daytoday unctions such as writing paychecks placing order for supplies needed in temanufacturing process bilg customers and so on. The president of AME StephaEich wishes to sreamline manufactuing to concenrate prouction on the most protableproducts To perform ts task; she asks several "what i qestions does a proection ofcurrent sales into the uture and examines data at derent geographic and time dimensions l the data that she eds to peform ths task can be found in on or more ofthe exsting databases. owever it is not easily rereved in the exact format that shedesies A data wareouse is created with exactly the sals ioration she needs b

  • 8/13/2019 Dunham - Data Mining

    25/156

    3 Ch t Related Concepts Section 7 LP 39

  • 8/13/2019 Dunham - Data Mining

    26/156

    3 Chapter Related Concepts

    Summarizing data is performed to provide a hgher level view of the data. Thissumarization may be done at multiple granularities and for dierent dimensions.

    New derived data (e. g, using age ra ther than bt date) may be added to better

    facilitated ecision support functions.

    Handing missing and eroneous data must be performed. This could entail replacing them with predicted or default values or simply removing these enties

    The portion of the transformation that deals with ensurng valid and consisten data issometmes referred to as data srbbing or data staging.

    ere are many benets to the use of a data warehouse. Because it provides anintegration of data from multiple sources, its use can provide more ecient access ofthe data he data that are stored often provide dierent levels of sumarization Forexample, sales data may be found at a low level (purchase order), at a ciy level (total ofsales for a city), or at hgher levels (county, state, country, world) The summary can beprovided for dieent types of granularity. he sales data could be sumarized by bothsalesman and departmen These sumarizations re provided by the convers ion processinstead of being calculated when the data are accessed. Thus, this also speeds up theprocessing of the data for decision support applications

    The data warehouse may appear to increase the complexity of database management because it is a replica of the operational data But keep in mind that much of thedata in the warehouse are not simply a replication but an extension to or aggregationof the data addition, because the data warehouse contains hstorical data, data stored

    there probabl will have a longe life span than the snapshot data found in the operational databases. The fact that the data the wrehouse need not be ept consistentwith the curent operational data also simplies its maintenance. he benets obtainedby the capabilities (e.g., DSS support) provided usually re deemed to outweigh anydisadvantages

    A subset of the complete data warehouse, data mart may be stored and accessedseparately. The level is at a departmental, regional, or functional level These separatedata mats are much smaller, and they more efciently support narower analytical typesof applications.

    A virta warehose is a waeouse implemented as a view from the operationaldata. While some of this view may actually be materialized for efciency, it need notal be

    hee are several ways to improve the perormance of data warehouse applications.

    Swnmarization: Because many applications reque suary-type information,data that are known to be needd for consoidation queries should be presum

    marized before sorage. Dierent levels of sumizaion should be included to

    impove performance. Witha 20 to 100%increasein storage space, an increase in

    peromance of2 to 10 tes can beachieved [Sin98, p 302] .

    Denormalization: Tradiional normalizaton reduces suchproblems as redundancy

    as well as insert, update, and deletion anomaies. However, these improvements

    are achieved at the cost of increased processing time due to joins. With a data

    warehouse, improved performance can be achieved by stoing denormalized data.

    Section 7 LP 39

    Since data wrehouses are not usually updated as frequently as operational dataare, the negatives associated with update operations e not an issue

    Partitioning Dividing the data warehouse into smaller fragments may reduceprocessing time by allowing queries to access small data sets

    The relationship between data ning and daa wehousing can be viewed assymbiotic [Inm96]. Data used in data inng applications are often slightly modiedfrom that in the databases where the dta pemanently reside. The same is rue for datain a data warehouse When data are placed in a warehouse they are extracted from thedatabase, cleansed, and reformatted. The fact that the da are derived from muliple

    sources with eterogeneous formats complicates the probl:m In addition, the fact thathe source databases are updated requires tha the warehouse be updaed peiodically orwork with stale data hese issues re identical to many f those associated with dataining and KDD (see Figure 1 4) . While data mining and ata warehousing are actuallyorogonal issues, they are complementary. Due to the type of applications and massiveamount of data in data warehouse, data ining applications can be used to providemeaningful infoation needed for decision support systems For example, managementmay use the results of classication or association rle apcations to help determinethe target population for an advertising campaign. In addition, data mining activities canbenet from the use of data in a data warehouse However, its use is not requied Daawarehousing and daa ining are sometimes thought of as he same thing Even thoughthey re related, they are dierent and each can be used without the other

    7 OPOnine anayti roessing O sysems are targeted to povide more complex queyresuls tan traditional OLTP or database sysems Unlike database queies, however, applcatons usuall nvolve anass o the actual ata. he can be thought oas an extension of some of the basic aggregation fnctions available in SQL Ths extraanalysis of the data as well as te more imprecise nature of he OAP queres is whatreally dierentiates OAP appications from traditional datase and OLTP aplicatonsOLAP tools may also be used in DSS systems

    OLAP is performed on data warehouses or data marts Te prmary goal of OAPis to suppot ad hoc querying needed o support DSS The multdimensional view of datais ndamental to OAP applications O is an appcation view, not a data srcureor scema Te complex naure of OLAP applicaions requres a mulidmensional viewof e daa The type of data accessed is often (aloug not a requiremen) a daa

    warehouse.OLAP tools can be classied as ROAP or MOLAP. With MO mtidimen

    siona O data are modeled, viewed, and pysically tored in a mtidimensionadatabase M MOLAP tools are implemened by speciaized DBMS and softwresystems capable of supporting the multidimensional daa directly With MO, daaare sored as an - dimensional array (assu ing there are n dimensions), so the cube viewis stored dectly Alhough MOLAP as exremely g torage requements, indicesare used to speed up processing. With RO reationa O owever, data arestored in a relational daabase, and a ROAP server (midleware) creates he mulidimensional view for te user As one would tink, he ROLAP tools end o be less

    40 Chapter Reated Conepts

  • 8/13/2019 Dunham - Data Mining

    27/156

    40 Chapter Reated Conepts

    c ,a :T/ti0 n

    Rl u

    -D dwn

    Prduct

    (a) Snge ce () Mutipe cel (c Sice (d Dce

    FIGU R E OLAP operations

    complex but also less efcient MDD systems may presummarze along all dimensions.A thrd approach hybrid O HO, combines the best features of ROLP andMOLAP Queres are stated in multidimensional ters Data that are not updated frequently will be stored as MDD whereas data that are updated frequently will be store das DB

    As seen in Figure 215 there are several types of OAP operations supported byOLAP tools:

    A simple query may look at a single cell within the cube [Figure 2 15 (a)]

    Sice Look at a subcube to get more specic information This is peormed yselecting on one dimension A s seen in Figure 2 15 (c) this is loo kng at a portionof the cube

    Dice: Look at a subcube by selecting on two or more dimensions Ths can beperormed by a slice on one dimension and then rotating the cube to select on asecond dimension In Figure 25(d) a dice is made because the view in (c) isrotated from all cells for one product to all cells for one location

    Rol up (dimension reduction aggregation): Roll up allows the user to ask questions that move up an aggregation hierarchy Figure 2 5( b) represents a roll upom (a) Instead of looking at one single fact we look at all the facts Thus wecould for example look at the overall total sales for the company

    Dri down Figure 25(a) represents a dll down from (b) These functions allowa user to get more detailed fact information by navigating lower in the aggregatiohierachy. We could perhaps look at quantities sold within a specic aea of each

    o the cities

    Visuaiation Visuazation allows the OLAP users to actually "see results of anoperation

    To assist with roll up and dr down operations, frequently used aggregations can beprecomputed and stored in the warehouse. There have been several dierent denitionsfor a dice. In fact, the term slice and dice is sometmes viewed together as indicatinthat te cube is subdivided by selecting on multiple dimensions

    8 WEB SEARCH ENGINES

    Setin Web Searh Engines 4

    As a result of the lage amount of data on the Web and the fact that it is continuallygrowing obtaining desired infomation can be challenging eb search enines are usedto access the data and can be viewed as query systems muh like IR systems As withR queres seach engine queres can be stated as keyword boolean weighted and soon The dierence is pmarly in the data being seached pages with heterogeneous dataand extensive hyperks and the architectre involved

    Conventional search engines suer from several problems RS99]

    Abundance Most of the data on the Web are of no interest to most people nother words alhough there is a lot of data on the Web an individual query willreteve only a ery small subset of it

    Limited coverage Seach engines often povide resuts from a subset of the Webpages Because of the extreme size of the Web it is ipossible to seach the entireWeb any time a quey is requested Instead mos sach engines create indicesthat ae updated periodically When a query is requeted often only the index isdirectly accessed

    Limited query Most search engines provide access based only on simple keyword-based seaching More advanced search engines may retreve or order pagesbased on other properties such as the popularty of pages

    Limited customiation Query results are oten detemned only by the quey

    itself. However as with traditional I systems the deed results are often dependent on the background and knowledge of the user as well Some more advancedseach engines add the ability to do customzaton usig user proles or historcalinfomaton

    Traditonal I systems (such as LexsNexis) may actualy be tailored to a specicdomain The Web however has information for everyone

    As discussed i Chapter 7, Tue Web mning consiss of content structure andusage mnng Web search engines ae very simplistic examjJles of Web content mning

    2.9 STATISTICS

    Such simple statistica concepts as determnng a data distrbtion and calculating a meanand a variance can e viewed as data mnng technques Each of these is in its own

    ight a descriptive model for the data under consideratonPrt of the data mnng modeling process requires seaching the actual data An

    equally important pat requres inferencing from the result of the seach to a generalmodel A cuent database state may be thought of as a sample (albeit large) of the realdata that may not be stored electronicay When a model is generated the goal is tot it to the entire data not ust that which was seached An model derivd shouldbe statistically signicant meaningl and valid This prolem may be compoundedby the preprocessing step of the KDD process which may actually remove some ofhe data Outliers compound this problem The fact that most database practitioners

  • 8/13/2019 Dunham - Data Mining

    28/156

    44 Cha r 2 Rlad Concps

  • 8/13/2019 Dunham - Data Mining

    29/156

    Cha r 2 Rlad Concps

    TALE 23 Rlatonshp Between Topcs [FPSM9]

    Database Manaement

    Database s an actve, evolvn enttyRecords may contan erroneous ormssn datapcal eld s numercDatabase contans mllons of records

    should et down to realty

    Machne Len

    Database s statcDatabases are complete and nose-free

    Typcal feature s bnaryDatabase contans hundreds of

    nstancesAll database problems have beensolved

    mnns toun

  • 8/13/2019 Dunham - Data Mining

    30/156

    C H A P T E R 3

    Data Mining echniques

    . NTRODTON2 A STATSTAL PRSPTV ON DAA NNG

    SLARTY ASRS

    . DSON TRS NRAL NTWORS6 GN ALGORTHS7 SS8 BBOGRAPH NOS

    3 INTRODUCION

    There are many dierent methods used to perform data mining tasks These techniquesnot ony requie specic types of data structures, but aso impy certain types of agorithmic approaches. n this chapter we briey introduce some of the comon data ningtechniques. These wi be examned in more detai in ater chapters of the book as theyare used to perform specic data mining tasks.

    Parametri moel describe the reationshp between input and output trough theuse of agebraic equations where some parameters are not specied These unspeciedparameters are determined by providing input exampes Even though parametric modeing is a nice theoretica topic and can sometimes be used often it is either too sipistic orrequies more knowedge about the data invoved than is avaiabe. Thus, for reawordprobems, these parametric modes may not be useu

    Nonparametric technques are more appropriate for data ning appications. Anonparametric moel is one that is datadriven. N expicit equations are used to deter

    ine the mode. Ts means that the modeing process adapts to the data at hand Unikeparametric modeing, where a specic mode is assumed ahead of time, the nonparametric techniques create a mode based on the input. Whie the parameric methodsrequire more knowedge about the data before the modeing process, the nonparametrictechnique requires a arge amount of data as input to the modeing process itsef. Themodeing process then creates the mode by sifting through the data Recent nonparametric methods have empoyed machine eng technques to be abe to e dynamicayas data are added to the input Thus, the more data, the better the mode created. Aso this dynac eng process aows the mode to be created continuousy as the datais input These features make nonparametric techniques particuary suitabe to database

    6

    Secton 32 A Statistica Persectve on Data Mining 7appications with arge amounts of dynamicay changing data Nonparameic techniquesincude neura networks, decision trees, and genetic agoritms.

    32 A STATISTICA PERSPECTIVE ON DATA MINING

    There have been many statistica concepts that are the basi for data mining techniquesWe biey review some of these concepts

    32 Point Estimation

    Point etimation refers to the process of estimating a popuation parameter, 8, by anestimate of the arameter This can be done to estimate mean, variance, standard

    deviation, or any other statistica parameter. Often the etimate of the parameter fora genera popuation may be made by actuay cacuating the paameter vaue for apopuation sampe. An estimator technique may aso be used to estimate (predict thevaue of missing data. The bia of an estimator is the diffrence between the expectedvaue of the estimator and the actua vaue:

    Bias = () 8 (31)

    An unbiae estimator is one whose bias is 0. Whie point stimators for sma data setsmay actuay be unbiased for arger database appications we woud expect that mostestimators re biased

    One measure of the eectiveness of an estimate is th mean quare error MSwhich is dened as the expected vaue of the squared dierence between the estimateand the actua vaue:

    MSE( ( 8)2 (3)The quare er is often exaned for a secic rediction to measure accrac raerthan to ook at the average dierence. For exampe if the te vaue for an atribute was and he predcon was the squared error woud be = he squarng speformed to ensure that the measure is aways positive and to gi ve a gher weighting tothe estimates that are grossy inaccurate As we wi see, the MSE is comony used inevauating the eecveness of data mining prediction techniques t is aso important inmachine eng At times instead of predicting a simpe pint estimate for a paameter,one may determine a range of vaues within which he true paameter vaue shoud falThis range is caed a conence interal

    Te root mean quare RMS may aso be used to estimate error or as anotherstatistic to descbe a disribution Cacuating the mean does not indicate the magnitudeof the vaues The S can be used for this pupose Given a set of n vaues

    x x the RMS is dened by

    RMS (3.3)

    An ateative use is to estimate the magnitude of the eor The root mean quare errRMS is found by taking the squae root of the MSE.

    A popuar estimating technque is the ackkne etimate With ts approach eestimate of a paameter, e is obtaied by omting one vaue from he set of observed

    D M T h Secon Sascal Persece on Daa Mn ng

  • 8/13/2019 Dunham - Data Mining

    31/156

    hper Daa M nn g Tehn ue

    values. Suppose that there is a set of n values = {x1 , Xn } . An estimate fo theea wud be

    - 1 n

    I>j + XjA J=l J+l/ = n

    (3 .4)

    Here the subscript i idicates that this estimate is obtaied by oittig the i valueGive a set of ackkife estimates these can in be used t obtain a overalestimate

    ML 31

    A jB = n

    (35

    uppose that a coi is tsse9 i the a ve ties with the flwig esuts ( idicatesa head ad 0 idicates a tai: { 0}. f we assume tha the coi toss folows heBeouli disributio, we know that

    (3)

    ssug a peect cin whe the prbabiity f ad 0 are bth / the kehdis then 5

    L( 0) J 05 0.03 (3.7)However if the coi is nt perfect but has a bias towad heads such that the probabltyof getig a head is 0.8 the keihood is

    L( 0) = 08 0.8 08 08 0 = 008 (38Here it is more kely hat the coi is biased toward getting a head than that it is tbiased he geeral frmua for keliod is

    5L(p1 XJ , . , x5)=Jpx; ( l - p)!x =pT1 x ( l - p)5IT1 x; (39)

    il

    By takig the g we get

    ( =log L( ='x og( + ( s 'x.) log(! - ad the we take the deivative with respect t

    5 Xi ( _

    x I

    --

    = p - -

    (3.10)

    (3

    g

    Settig equa to zero we nay obtain

    p=-

    5(3

    Fo this exape the estiate for is then = = 08 'hus 08 is he value for tha axiizes the ikeihd that the give sequence of heads and tais wud ccu

    nther techique fr pit estiati is caed the maximum ihood ma(. Lhood ca be deed as a value proptia to the actual prbabiity thawith a specic disibutio the give sampe exiss o the sape gives us a estiatef a parameter frm the disrbutio The higher the ikeihood vaue he ore ikeythe uderlying distibutio will produce the resuts obsered ive a sampe set ofvaues = x x} from a kw distribution fcio f(x the ME caestiate paameters for he popuati m which the sape is daw The apprachobtais parameter estiates that maxiize the prbabiity tat the sape data ocur frthe specic odel. t ooks at the it probabiity for observig the saple data bymultiplying the idividua probabilities. he ikeihod function L is thus deed as

    L( J . , Xn) = D tx 8 (3 3

    The vaue f that maxizes is the estimate chose his ca be fud by takigthe derivaive (perhaps after dig the g of each side t simpify the fomula) withrespect t Exape 3 iustrates the use of MEAORIT

    Input:

    { . Op}ob = { X {X Xn}

    Output:

    M algorit:

    = ;

    // Parameters t o be estmated

    //Input database value observed// nput database value miss ing

    //Estmates for

    Obtain inital parameter MLE estimate, 0epeat

    Estimate missing data I x fSSObtain next parameter est imate O to

    maimze ikeihood;

    utl estimate converges ;

    The xpcaion-maximizaion (E algth is a apprach that slves the estimati prble with icompete data The EM algrhm nd M fo a paramete (such

    5 hapter 3 Data Mnng echnqesSe to 3 2 A Statstcal Persectve on Data Mnng 51

  • 8/13/2019 Dunham - Data Mining

    32/156

    p g q

    as a mean) usng a twostep pocess estmaton and maxmzaton The basc EM agothm s shown n Agothm 3 1 An nta set of estmates fo the paametes s obtanedGven these estmates and the tanng data as nput the agothm then cacuates a vauefo the ssng data o exampe t ght use the estmated men to pedct a mssngvaue. These data (wth the nw vaue added) ae then used to detemne an estmate fothe mean that maxmzes the kehood These steps ae apped teatve unt successve paamete estmates convege An appoach can be used to nd the nta paameteestmates n Agothm 3 1 t s assumed that the nput database has actua obseved vaues Xobs {xi , . xk } as we as vaues that ae mssng X s {xk+ , . X} Weassume that the ente database s actua X Xob U Xss The paametes to be

    estated ae

    {h ep} The kehood ncton s dened b( X) J(x; 8 (314)

    =l

    We ae ookng fo the that maxmzes The MLE of ae the estmates that satsfa n( X)

    e 0 (315)

    The expectaton pat of the agothm estmates the ssng vaues usng the cuentestmates of . Ths can nta be done b ndng a weghted aveage of the obseveddata The maxmzaton step then nds the new estmates fo the paametes thatmaxze the kehood b usng those estmates of the mssng data An ustatve

    exampe of the EM agohm s shown n Exame EXAMPL 32

    We wsh to nd the mean , fo data that foow the noma dsbuton whee the knowndata ae 1 5 10 4} wth two data tems mssng Hee n 6 and 4 Suppose thatwe nta guess 0 3 We then use ths vaue fo the two ssng vaues. Usng thswe obtan the MLE estmate fo the mean as

    kLX; L X;

    A - +k

    3 333 3 4 33f - = . + = .

    n n 6(316)

    We now epeat usng ths as the new vaue fo the mssng tems then estmate the

    mean as lx x

    A 2 =k3 33

    33 + 433f =-+= . + = 4.77

    n n 6

    Repeatng we obtan

    k nX Xi

    3 k+3 33

    + 4,77f =-+= + = 4.92

    n n 6

    (3 .17)

    (318)

    Secton 3.2 A Statstcal Persectve on Data Mnng 51and then

    k nx; x

    l +kl

    333 .92 + 4.92

    n n 6(319)

    We decde to stop hee because the ast two estmates a on 005 apat Thus ouestmate s 497

    ne of the basc gudenes n estmatng sOca 's Razor,

    whch bascastates that smpe modes genea ed the best esuts

    322 Models Based on Summarization

    T