data mining classification

Upload: lan-anh

Post on 22-Jul-2015

902 views

Category:

Documents


2 download

TRANSCRIPT

Data mining - classification

Ging vin: Sinh vin:

Nguyn Qunh Chi

Trn Tun Anh inh Th Thanh Hng Nguyn Trng Th

Data mining - Classification

M U S pht trin nhanh chng ca mng Internet v Intranet sinh ra mt khi lng khng l cc d liu dng siu vn bn (d liu Web). Cng vi s thay i v pht trin hng ngy, hng gi v ni dung cng nh s lng cc trang Web trn Internet th vn tm kim thng tin i vi ngi s dng li cng kh khn. C th ni nhu cu tm kim thng tin trn mt CSDL phi cu trc c pht trin ch yu cng vi s pht trin ca Internet. Thc vy, vi Internet con ngi lm quen vi cc trang Web cng vi v vn cc thng tin. Trong nhng nm gn y Internet tr thnh mt trong nhng kn v khoa hc, thng tin kinh t, thng mi v qung co. Mt trong nhng l do cho s pht trin ny l s thp v gi c tiu tn khi cng khai mt tran Web trn Internet. So snh vi nhng dch v khc nh mua bn hay qung co trn mt t bo hay tp ch, th mt trang Web i r hn rt nhiu v cp nht nhanh chng hn ti hng triu ngi dung khp mi ni trn th gii. C th ni trang Web nh l cun t in Bch khoa ton th. Thng tin trn cc trang Web a dng v mt ni dung cng nh hnh thc. C th ni Internet nh mt x hi o, n bao gm cc thng tin v mi mt ca i sng kinh t, x hi c trnh by di dng vn bn, hnh nh, m thanh Tuy nhin cng vi s a dng v s lng ln thng tin nh vy ny sinh vn qu ti thng tin. Ngi ta khng th t tm kim a ch trang Web cha thng tin m mnh cn, do vy i hi phi c mt trnh tin ch qun l ni dung ca cc trang Web v cho php tm thy cc a ch trang Web c ni dung ging vi yu cu ca ngi tm kim. Cc tin ch ny qun l d liu nh cc i tng phi cu trc. Hin nay chng ta lm quen vi mt s cc tin ch nh vy, l: yahoo, google, alvista Mt khc, gi s chng ta c cc trang Web v cc vn Tin hc, Th thao, Kinh t - X hi v xy dng Cn c vo ni dung ca cc ti liu m khch hng xem hoc download v, sau khi phn lp chng ta s bit khch hng hay tp trung vo ni dung g trn trang Web ca chng ta, t chng ta s b sung thm nhiu cc ti liu v cc ni dung m khch hng quan tm v ngc li. Cn v pha khch hng sau khi phn tch chng ta cng bit c khch hng hay tp trung v vn g, t c th a ra nhng h tr thm cho khch hng . T nhng nhu cu thc t trn , phn lp v tm kim trang Web vn l bi ton hay v cn pht trin nghin cu hin nay.

[D08 HTTT1]

Page 2

Data mining - ClassificationMC LCMC LC...................................................................................................................................3 Gii thiu....................................................................................................................................4 Khai thc d liu.....................................................................................................................4 Khi nim............................................................................................................................4 u th khai ph d liu......................................................................................................5 Cc k thut khai ph d liu.............................................................................................6 Cy quyt nh....................................................................................................................7 Cng c khai ph d liu Weka......................................................................................7 Cc chc nng ca Weka Explorer....................................................................................7 Kho st d liu..................................................................................................................8 Phn lp d liu s dng cy quyt nh..................................................................................9 Tng quan v phn lp d liu trong khai ph......................................................................9 Phn lp d liu.................................................................................................................9 Cy quyt nh trong phn lp d liu.................................................................................11 nh ngha.........................................................................................................................11 Thut ton C4.5................................................................................................................12 Thc t.....................................................................................................................................14 Gii thiu v dataset............................................................................................................14 Phn tch kt qu..................................................................................................................17

[D08 HTTT1]

Page 3

Data mining - Classification

Gii thiu Khai thc d liuKhi nim Khi ph d liu c nh ngha l: qu trnh trch xut cc thng tin c gi tr tim n bn trong lng ln d liu c lu tr trong cc c s d liu, kho d liu Hin nay, ngoi thut ng khai ph d liu, ngi ta cn dng mt s thut ng khc c ngha tng t nh: khai ph tri thc t c s d liu (knowlegde mining from databases), trch lc d liu (knowlegde extraction), phn tch d liu/mu (data/patten analysis), kho c d liu (data archaeology), no vt d liu (data dredging). Nhiu ngi coi khai ph d liu v mt thut ng thng dng khc l khai ph tri thc trong c s d liu (Knowlegde Discovery in Databases KDD) l nh nhau. Tuy nhin trn thc t, khai ph d liu ch l mt bc thit yu trong qu trnh khm ph tri thc trong c s d liu. Qu trnh ny bao gm cc bc sau: Bc 1: Lm sch d liu (data cleaning): loi b nhiu hoc cc d liu khng thch hp. Bc 2: Tch hp d liu (data intergration): tch hp d liu t cc ngun khc nhau nh: c s d liu, kho d liu, file text Bc 3: Chn d liu (data selection): bc ny, nhng d liu lin quan trc tip n nhim v s c thu thp t cc ngun d liu ban u. Bc 4: Chuyn i d liu (data transformation): trong bc ny, d liu s c chuyn i v dng ph hp cho vic khai ph bng cch thc hin cc thao tc nhm hoc tp hp. Bc 5: Khai ph d liu (data mining): l giai on thit yu, trong cc phng php thng minh s c p dng trch xut ra cc mu d liu. Bc 6: nh gi mu (pattern evaluation): nh gi s hu ch ca cc mu biu din tri thc da vo mt s php o. Bc 7: Trnh din d liu (knowlegde presentation): s dng cc k thut trnh din v trc quan ho d liu biu din tri thc khai ph c cho ngi s dng

[D08 HTTT1]

Page 4

Data mining - Classification

Khai ph d liu v pht hin tri thc trong cc c s d liu cun ht cc phng php, thut ton v k thut t nhiu chuyn ngnh nghin cu khc nhau nh hc my, thu nhn mu, c s d liu, thng k, tr tu nhn to, thu nhn tri thc trong h chuyn gia cng hng ti mc tiu thng nht l trch lc ra c cc tri thc t d liu trong cc c s d liu khng l. Song so vi cc phng php khc, khai ph d liu c mt s u th r rt

u th khai ph d liuKhai ph d liu c nhiu ng dng v mt s u th r rt c xem xt di y: + So vi phng php hc my, khai ph d liu c li th hn ch, khai ph d liu c th s dng vi cc c s d liu cha nhiu nhiu, d liu khng y hoc bin i lin tc. Trong khi phng php hc my ch yu c p dng trong cc c s d liu y , t bin ng v tp d liu khng qu ln. + Phng php h chuyn gia: phng php ny khc vi khai ph d liu ch cc v d ca chuyn gia thng mc cht lng cao hn nhiu so vi cc d liu trong c s d liu, v chng thng ch bao hm c cc trng hp quan trng. Hn na cc chuyn gia s xc nhn gi tr v tnh hu ch ca cc mu pht hin c. + Phng php thng k l mt trong nhng nn tng l thuyt ca Khai ph d liu, nhng khi so snh hai phng php vi nhau ta c th thy cc phng php thng k cn tn ti mt s im yu m Khai ph d liu khc phc c: Cc phng php thng k chun khng ph hp vi cc kiu d liu c cu

trc trong rt nhiu c s d liu Cc phng php thng k hot ng hon ton theo d liu, n khng s

dng tri thc sn c v lnh vc Kt qu phn tch ca thng k c th s rt nhiu v kh c th lm r c Kt qu phn tch ca thng k c th s rt nhiu v kh c th lm r c Vi nhng u im , khai ph d liu ang c p dng khai ph d liu nhn s p ng tnh thng xuyn thay i, tng trng ca d liu. Tm kim nhng thng tin tim n trong d liu m bng phng php khc khng pht hin c.

[D08 HTTT1]

Page 5

Data mining - ClassificationCc k thut khai ph d liuCc k thut khai ph d liu thng c chia lm hai nhm chnh: - K thut khai ph d liu m t: c nhim v m t v cc tnh cht hoc cc c tnh chung ca d liu trong c s d liu hin c. Cc k thut ny c th lit k: phn cm (clustering), tm tt (summerization), trc quan ha (visualization), phn tch s ph hin bin i v lch, phn tch lut kt hp (association rules)... - K thut khai ph d liu d on: c nhim v a ra cc d on da vo cc suy din trn d liu hin thi. cc k thut ny gm c: phn lp (classification), hi quy (regression)... 3 phng php thng dng nht trong khai ph d liu la: phn cm d liu, phn lp d liu v khai ph lut kt hp. Chng ta ch xt n phng php phn lp Phn lp d liu: Mc tiu ca phng php phn lp d liu l d on nhn lp cho cc mu d liu. Qu trnh phn lp d liu thng gm 2 bc: xy dng m hnh v s dng m hnh phn lp d liu Bc 1: mt m hnh s c xy dng da trn vic phn tch cc mu d liu sn c. Mi mu tng ng vi mt lp, c quyt nh bi mt thuc tnh gi l thuc tnh lp. Cc mu d liu ny cn c gi l tp d liu hun luyn (training data set). Cc nhn lp ca tp d liu hun luyn u phi c xc nh trc khi xy dng m hnh, v vy phng php ny cn c gi l hc c thy (supervised learning) khc vi phn cm d liu l hc khng c thy (unsupervised learning). Bc 2: s dng m hnh phn lp d liu. Trc ht chng ta phi tnh chnh xc ca m hnh. Nu chnh xc l chp nhn c, m hnh s c s dng d on nhn lp cho cc mu d liu khc trong tng lai. Phng php hi qui khc vi phn lp d liu ch, hi qui dng d on v cc gi tr lin tc cn phn lp d liu th ch dng d on v cc gi tr ri rc.

[D08 HTTT1]

Page 6

Data mining - ClassificationCy quyt nhTrong phn lp d liu hnh thc trc quan ca m hnh l cy quyt nh. Sau y, lun vn s trnh by vai tr, nh gi v cy quyt nh trong khai ph d liu.

Cng c khai ph d liu WekaCc chc nng ca Weka Explorer

Cc chc nng chnh ca Weka Explorer th hin trong cc th tab ca man hnh chnh, bao gm:

Preprocess: Cho php m, iu chnh, lu mt tp tin d liu, th ny cha cc thutt ton p dng trong tin x l d liu. Classify: Cung cp cc m hnh phn loi d liu hoc hi quy. Cluster: Cung cp cc m hnh gom cm. Associate: Khai thc tp ph bin v lut kt hp. SelectAttributes: La chn cc thuc tnh thch hp nht trong tp d liu. Visualize: Th hin d liu di dng biu .

[D08 HTTT1]

Page 7

Data mining - ClassificationKho st d liu

S dng th Preprocess (1) Open file: M mt tp d liu. (2) Edit: Hin th v chnh sa d liu bng tay nu cn thit. (3) Save: Lu tr d liu hin ti ra tp tin Weka Explorer h tr mt s nh dng arff, csv

(4) Filter: Cc tc v tin x l d liu c gi l cc b lc (5) Selected attribute: Thng tin v thuc tnh ang c chno

Type: Kiu d liu ca thuc tnh (Numeric dng s, Nominal dng ri rc / khng s, ordinal th t, binary nh phn) Missing: S mu thiu gi tr trn thuc tnh ang xt Distinct: S gi tr phn bit Unique: S mu khng c gi tr trng vi mu khc

o o o

S dng th Classify (1) Classifer: la chn b phn loi v cc tham s. (2) Test Options: cc ty chn kim th m hnho o o

Use training set: s dng chnh tp d liu hun luyn kim nghim Supplied test set: S dng mt tp d liu khc. Cross-validation: Chia d liu thnh nhiu phn (Flods) thc hin nhiu ln nh gi kt qu. Percentage split: Chia d liu thnh 2 phn theo t l %, mt phn dng xy dng m hnh, phn cn li dnh cho kim th

o

(3) Result list: Danh sch kt qu cc ln chy thut ton, c th tng tc trn danh sch ny thc hin mt chc nng ph

[D08 HTTT1]

Page 8

Data mining - ClassificationPhn lp d liu s dng cy quyt nh Tng quan v phn lp d liu trong khai phPhn lp d liu Mt trong cc nhim v chnh ca khai ph d liu l gii quyt bi ton phn lp. u vo ca bi ton phn lp l mt tp cc mu hc c phn lp trc, mi mu c m t bng mt s thuc tnh. Cc thuc tnh dng m t mt mu gm hai loi l thuc tnh lin tc v thuc tnh ri rc. Trong s cc thuc tnh ri rc c mt thuc tnh c bit l phn lp, m cc gi tr ca n c gi l nhn lp. Thuc tnh lin tc s nhn cc gi tr c th t, ngc li thuc tnh ri rc s nhn cc gi tr khng c th t. Ngoi ra, cc thuc tnh c th nhn gi tr khng xc nh (chng hn, v nhng l do khch quan ta khng th bit c gi tr ca n). Ch rng nhn lp ca tt c cc mu khng c php nhn gi tr khng xc nh. Nhim v ca qu trnh phn lp l thit lp c nh x gia gi tr ca cc thuc tnh vi cc nhn lp. M hnh biu din quan h ni trn sau s c dng xc nh nhn lp cho cc quan st mi khng nm trong tp mu ban u.

Lp 1 D liu u vo

Thut ton phn lp hot ng

Lp 2

Lp n

Thc t t ra nhu cu t mt c s d liu vi nhiu thng tin n ta c th trch rt ra cc quyt nh nghip v thng minh. Phn lp v d on l hai dng ca phn tch d liu nhm trch rt ra mt m hnh m t cc lp d liu quan trng hay d on xu hng d liu tng lai. Phn lp d on gi tr ca nhng nhn xc nh (categorical label) hay nhng gi tr ri rc (discrete value), c ngha l phn lp thao tc vi nhng i tng d liu m c b gi tr l bit trc. Trong khi , d on li xy dng m hnh vi cc hm nhn gi tr lin tc. V d m hnh phn lp d bo thi tit c th cho bit thi tit ngy mai l ma, hay nng da vo nhng thng s v m, sc gi, nhit , ca ngy hm nay v cc ngy trc . Hay nh cc lut v xu hng mua hng ca khch hng trong siu th, cc nhn vin kinh doanh c th ra nhng quyt sch ng n v lng mt hng cng nh chng loi by bn Mt m hnh d on c th d on c lng tin tiu dng ca cc khch hng tim nng[D08 HTTT1] Page 9

Data mining - Classificationda trn nhng thng tin v thu nhp v ngh nghip ca khch hng. Trong nhng nm qua, phn lp d liu thu ht s quan tm cc nh nghin cu trong nhiu lnh vc khc nhau nh hc my (machine learning), h chuyn gia (expert system), thng k (statistics)... Cng ngh ny cng ng dng trong nhiu lnh vc khc nhau nh: thng mi, nh bng, maketing, nghin cu th trng, bo him, y t, gio dc... Qu trnh phn lp d liu gm hai bc: Bc th nht (learning)

Qu trnh hc nhm xy dng mt m hnh m t mt tp cc lp d liu hay cc khi nim nh trc. u vo ca qu trnh ny l mt tp d liu c cu trc c m t bng cc thuc tnh v c to ra t tp cc b gi tr ca cc thuc tnh . Mi b gi tr c gi chung l mt phn t d liu (data tuple), c th l cc mu (sample), v d (example), i tng (object), bn ghi (record) hay trng hp (case). Lun vn s dng cc thut ng ny vi ngha tng ng. Trong tp d liu ny, mi phn t d liu c gi s thuc v mt lp nh trc, lp y l gi tr ca mt thuc tnh c chn lm thuc tnh gn nhn lp hay thuc tnh phn lp (class label attribute). u ra ca bc ny thng l cc quy tc phn lp di dng lut dng if-then, cy quyt nh, cng thc logic, hay mng nron. Qu trnh ny c m t nh trong hnh v:

Training data

Classification algorithm

Classifier (modle) P_i 63.02 39.05 68.82 P_t L_l_a S_s P_r D_s

22.52 D_s 49.432 14 > 125.742 13.367 > 121.43 70.083 > 163.071 -11.058 > 418.543 Normal, Abnormal

Thuc tnh phn lp 7 (class) T l phn lpo o

Normal: 100 (32.258%) Abnormal: 210 (67.742%)

[D08 HTTT1]

Page 14

Data mining - ClassificationMt nh dng tp tin vn bn bao gm hai phn:

@relation column_2C_weka @attribute @attribute @attribute @attribute @attribute @attribute pelvic_incidence numeric pelvic_tilt numeric lumbar_lordosis_angle numeric sacral_slope numeric pelvic_radius numeric degree_spondylolisthesis numeric

Phn khai bo

@attribute class {Abnormal, Normal} @data 63.0278175,22.55258597,39.60911701,40.47523153,98.67291675,0.254399986,Abnormal 39.05695098,10.06099147,25.01537822,28.99595951,114.4054254,4. 564258645,Abnormal 68.83202098,22.21848205,50.09219357,46.61353893,105.9851355,3.530317314,Abnormal 69.29700807,24.65287791,44.31123813,44.64413017,101.8684951,11 .21152344,Abnormal 49.71285934,9.652074879,28.317406,40.06078446,108.1687249,7.91 8500615,Abnormal 40.25019968,13.92190658,25.1249496,26.32829311,130.3278713,2.2 30651729,Abnormal Phn d liu Phn khai bo: @relation @attribute @attribute @attribute o Cc kiu d liu Numeric D liu dng s V d: @ATTRIBUTE name numeric Nominal D liu ri rc V d: @ATTRIBUTE class {setosa, versicolor} String Date[D08 HTTT1]

D liu chui V d: @ATTRIBUTE name string D liu kiu ngy V d: @ATTRIBUTE discoveredPage 15

Data mining - Classificationdate D liu thiu c k hiu bng du chm hi ? o Phn d liu: Mi mu d liu c t trn mt dng, gi tr ca cc thuc tnh c lit k theo th t t tri qua phi v ngn cch bi du phy, Hin th tp tin bng arffViewer

ngha ca cc thuc tnh 1 2 3 4 Pelvic_incidence = Pi Pelvic_tilt = Pt Lumbar_lordosis_angle = lla Sacral_slope = Ss T l mc bnh vng chu nghing vng chu Gc tt xng sng tht lung cong ra dc xng cng

[D08 HTTT1]

Page 16

Data mining - Classification5 6 7 Pelvic_radius = Pr Degree_spondylolisthesis = Ps Class: Normal, Abnormal Bn knh vng chu Mc spondylolisthesis Lp: bnh thng, d thng

Phn tch kt quS dng thut ton J48 (C4.5) ca Weka cung cp hun luyn tp d liu Cy quyt nh ca thut ton l:

nh gi hiu qu phn lp ca thut ton i vi tp d liu c cho theo hai phng php:[D08 HTTT1] Page 17

Data mining - Classification

Cross-validation Ln test th nht : vi t l phn chia thnh 10 phn

S mu Phn lp ng Phn lp sai Khng phn c lp Tng 253 57 0 310

T l 81.6129% 18.3871% 0

Ln test th hai: vi t l phn chia 10 phn l 12 ta c:

S mu Phn lp ng Phn lp sai Khng phn c lp Tng 255 55 0 310

T l 82.2581% 17.7419% 0

Ln test th nm vi t l phn chia > 10 phn l 15 ta c:

S mu Phn lp ng Phn lp sai Khng phn c lp Tng 260 50 0 310

T l 83.871% 16.129% 0

Sau khi chy thut ton trn theo phng php Cross-Validation th vi tham s Fold = 15 t c hiu qu phn lp nht l 83.871% vi s mu test l 310 Precentage split: cho bit chia l bao nhiu % th t hiu qu phn lp cao nht: Ln test th nht: vi t l phn chia l 66% th ta c: S mu Phn lp ng Phn lp sai Khng phn c lp Tng 90 15 0 105 T l 85.7143% 14.2857% 0

[D08 HTTT1]

Page 19

Data mining - ClassificationLn test th hai: vi t l phn chia < 66% l 60% ta c:

S mu Phn lp ng Phn lp sai Khng phn c lp Tng 97 27 0 124

T l 78.2258% 21.7742% 0

Ln test th ba : vi t l phn chia 66% l 70% ta c:

S mu Phn lp ng Phn lp sai Khng phn c lp Tng 76 17 0 93

T l 81.7204% 18.2796% 0

[D08 HTTT1]

Page 20

Data mining - ClassificationLn test th nm: vi t l phn chia > 66% l 75% ta c:

S mu Phn lp ng Phn lp sai Khng phn c lp Tng 65 12 0 77

T l 84.4156% 15.5844% 0

Sau khi chy thut ton trn vi phng php Precentage split vi t l phn chia l 66% t hiu qu phn lp cao nht 85.7143%, nhng vi s mu phn lp 105 gim so vi 310 nn cha t hiu qu phn lp Cc suy lun suy ra t cy quyt nh s dng phng php Cross-Validation:

Classifier out put: Kt qu c lit k bng vn bn vi nhng phn phn bit nh sau

[D08 HTTT1]

Page 21

Data mining - Classification

Run information: Thng tin chung v thut ton dc s dng d liu, tp d liu Classifier model: chi tit m hnh phn loi, tuy nhin i vi mt s b phn loi th m hnh phn loi khng th hin y thng tin bng vn bn c

Summary: Lit k thng tin tng qut v mc chnh xc ca b phn loi trong th nghim v thc thi

[D08 HTTT1]

Page 22

Data mining - ClassificationCc trng hp c phn loi mt cch chnh xc v khng chnh xc cho thy t l phn trm cc trng hp th nghim mt cch chnh xc v khng chnh xc phn loi. Cc s liu c hin th trong ma trn nhm ln, vi a, b v i din cho nhn lp. y c 310 trng hp, do , t l phn trm v s liu, aa + bb = 191 + 69 = 260, ab + ba = 19 + 31= 50. T l phn trm cc trng hp phn loi chnh xc thng c gi l chnh xc hoc mu chnh xc. N c mt s nhc im nh l mt c tnh hiu sut (khng c c hi sa cha, khng nhy cm vi lp phn), v vy c th bn s mun xem xt mt s cc s khc. Kappa l mt bin php c th c hiu chnh ca tha thun gia cc phn loi v cc lp hc tht s. N c tnh bng cch tham gia cc tha thun d kin bi c hi t cc tha thun quan st v phn chia theo tha thun ti a c th. Mt gi tr ln hn 0 v lun nh hn 1 c ngha l phn loi ca bn ang lm tt hn so vi c hi ( n thc s nn c!). T l li c s dng d on s ch khng phi l phn loi. Trong s d on, d on khng ch l ng hay sai, li ny c mt cng , v cc bin php ny phn nh iu .

Detailed Accuracy By Class v Confusion Matrix: Chi tit kt qu chnh xc ca b phn loi trn tng phn lp

Ma trn nhm ln l ma trn 2x2. S lng cc trng hp phn loi chnh l tng ca ng cho chnh trong ma trn aa + bb.

[D08 HTTT1]

Page 23

Data mining - ClassificationTP rate (True Positive rate t l ng tch cc): l t l ca cc v d phn lp l loi x, trong tt c cc v d thc s c lp x. trong ma trn nhm ln, y l phn t ng cho chia cho gi tr trn hng c lin quan: TP = 191/(191+69) = 0.91; 69/(69+31) = 0.69 FP rate (False Positive rate t l sai tch cc): l t l ca cc v d phn loi l lp x, nhng thuc v mt lp khc trong s tt c cc v d khng phi lp x. trong ma trn nhm ln iu ny l phn t dng cho chia cho tng s phn t hng c lin quan tc l: 31/ (31+69) = 0.31; 19/( 191+ 19)= 0.09 Precision tnh chnh xc: xc nh cc phn ca h s m thc s ha ra l tch cc trong cc nhm phn loi Precision = TP / ( TP + FP ) Recall kh nng ly li: phn trm cc trng hp tch cc l TP rate F-Measure Gi tr trung bnh iu ha chnh xc v ly li: F-measure = 2 * ( ( Precision.Recall) / Precision + Recall) ) or = 2*TP / (2*TP) + FP + FN

[D08 HTTT1]

Page 24