data and knowledge visualization in knowledge discovery process

Upload: hoinongdan

Post on 07-Jul-2018

237 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/19/2019 Data and Knowledge Visualization in Knowledge Discovery Process

    1/6

    V i s u a l i z a ti o n S u p p o r t f o r a U s e r - C e n t e r e d K D D P r o c e s s

    T u B a o H o

    Japan A dvanced Instituteof

    Science and Technology

    Tatsunokuchi, Ishikawa

    923-1292 Japan

    81-761-51-1730

    bao@ja is t .ac . jp

    T r o n g D u n g N g u y e n

    Japan A dvanced Institute of

    Science and Technology

    Tatsunokuc hi, Ishikawa

    923-1292 Japan

    +81-761-51-1732

    n g u ye n @ j a i s t . a c . j p

    D u n g D u c N g u y e n

    Japan Adv anced Instituteof

    Science and Technology

    Tatsunokuchi, Ishikaw a

    923-1292 Japan

    81-761-51-1732

    dungduc@ja is t .ac . jp

    ABSTRACT

    Viewing knowledge d i s covery as a us er - cen ter ed proces s tha t

    r equ i r es an e f f ec t ive co l l abora t ion be tw een the us er and the

    d i s covery s ys t em, our work a ims to s uppor t an ac t ive ro l e o f the

    us er in tha t p roces s by deve lop ing s ynerg i s t i c v i s ua l i za t ion too l s

    in t egra t ed in our d i s covery s ys t em D2M S. Thes e too l s p rov ide an

    ab i l i t y o f v i s ua l i z ing the en t i r e p roces s o f kno wledge d i s covery in

    order to he lp the us er wi th da ta p reproces s ing , s e l ec t ing min ing

    a lgor i thms and parameter s , eva lua t ing and compar ing d i s covered

    model s , and t ak ing con t ro l o f t he whole d i s cover p roces s . Our

    cas e- s tud ies wi th two medica l da tas e t s on me ning i t i s and s tomac h

    cancer s how tha t , wi th v i s ua l i za t ion too ls in D2M S, the us er ga ins

    be t t e r i ns igh t i n each s t ep o f the kno wledge d i s co very proces s as

    wel l t he r e l a t ions h ip be tw een da ta and d i s covered know ledge .

    Keywords

    model s e l ec t ion , knowledge d i s covery proces s ,

    know ledge v i s ua l i za t ion , t he us er ' s ac t ive ro l e .

    d a t a a n d

    1 . I N T R O D U C T I O N

    T h e p r o c e s s o f knowledge discovery in databases ( K D D ) c a n b e

    v iewed inheren t ly cons i s t s o f f ive s teps : (1 ) under s t and ing the

    appl i ca t ion dom ain , (2 ) da ta p reproces s ing , (3 ) da ta min ing , (4 )

    pos t -p roces s ing , and (5 ) app ly ing d i s covered knowledge , where

    each s t ep r equ i r es man y dec i s ions be ing made by the us er [10] .

    To f ind impl i c i t bu t po ten t i a l ly us efu l pa t t e rns /mode l s f rom la rge

    d a t a b as e s , o n e c a n n o t e x p e c t j u s t t o p u s h a l a rg e a m o u n t o f d a t a

    in to a KD D s ys tem wi thou t the us er 's par t i c ipa t ion . In o ther

    words , t he KD D proces s can be a l t e rna t ive ly v i ewed as a p roces s

    of model s e l ec t ion , i . e . , t ha t o f choos ing by the us er the mos t

    in t e r es t ing d i s covered pa t t e rns /model s o r a lgor i thms and the i r

    s e t t i ngs fo r ob ta in ing s uch pa t t e rns /model s in a g iv en app l i ca t ion .

    M o d e l s e l e c ti o n i n K D D i s a c o m p l i c a t e d h u m a n - c e n t e r e d a n d

    dom ain-cen ter ed proces s in which the p ar t i c ipa t ion of the us er

    p lays a k ey ro le to the s ucces s .

    Vi s ua l i za t ion has p roven i t s e f f ec t ivenes s in exp lora to ry da ta

    Permission to m ake digital or hard copies of all o r part of this work for

    personal or classroom use is granted w ithout fee provided that copies are

    not m ad e or distributed for profit or c omm ercial advantage and that

    copies bear this notice and the full citation on the first page. To copy

    otherwise, or republish, to po st on servers or to redistribute to lists,

    requires prior specific permission and/or a fee.

    SIGKDD 02,July 23-26, 2002, Ed monton, Alberta, Canada.

    Copyright 2002 ACM 1-58113-567-X/02/0007...$5.00.

    ana lys i s [2 ] and a h igh po ten t i a l i n knowledge d i s covery in

    da tabas es [3 ] , [ 9 ] . I n th i s paper we in t roduce the knowledge

    d i s c o v e ry s y s t e m D 2 M S ( D a t a M i n i n g w i t h M o d e l S e l e c ti o n ) th a t

    has two m ain con t r ibu t ions to v i s ua l know ledge d i s covery . F i r s t

    a r e i t s e f f i c i en t v i s ua l i zer s fo r l a rge m ul t id ime ns iona l da tabas es ,

    d i s covered ru les , h i e r ar ch ica l s t ruc tures as wel l a s ynerg i s t i c

    v i s ua l i za t ion of da ta an d know ledge . In par t i cu la r , t he novel

    v i s ua l i za t ion t echn ique T2 . 5D (Trees 2 . 5 D imen s ions ) fo r l a rge

    h ier ar ch ica l s t ruc tures can be s een as an a l t e rna t ive to power fu l

    t echn iques fo r r epres en t ing l a rge h ie r ar ch ica l s t ruc tures s uch as

    cone t r ees [14] o r hyperbo l i c t r ees [8 ] . Second i s i t s t i gh t

    in t egra t ion of the v i s ua l i zer s wi th func t ions in each s t ep o f the

    knowledge d i s covery proces s fo r s uppor t ing the model s e l ec t ion

    purpos e .

    2 . M O D E L SE L E C T I O N I N

    D2MS

    Figure 1 s hows a concep tua l a r ch i t ec tu re o f D2M S where D ata

    M i n i n g c o m p o n e n t c u r r e n t l y i n c l u d e s a d e c i s i o n t r e e l e a r n i n g

    (CAB RO [11] ), an d a ru l e l earn ing (LUP C [6] ) s ubs ys tems .

    2 .1 U s e r - c e n t e r e d m o d e l selection

    The interes t ingness o f d i s c o v e r e d p a t t e rn s / m o d e l s is c o m m o n l y

    charac te r i zed by s evera l c r i t e r i a :

    e v ide nc e

    i nd ica tes the

    s ign i f i cance of a f ind ing meas ured by a s t a t i s ti ca l c r it e r ion ;

    r edun danc y amoun t s to the s im i l a r i ty o f a f ind ing wi th r es pec t t o

    o ther f ind ings and meas ures to what degree a f ind ing fo l lows

    f rom another one ;

    usefulness

    r e l a t es a f ind ing to the goa l o f the

    users ; nov e l t y i n c l u d e s th e d e v i a t i o n f ro m p r i o r k n o w l e d g e o f t h e

    us er o r s ys t em; s im pl i c i t y r e f er s to the s yn tac t i ca l complex i ty o f

    the p res en ta t ion of a f ind ing , and ge ne ral i t y i s d e t e r m i n e d b y t h e

    f r ac t ion of the popula t ion a f ind ing r e f ers to . The in t e r es t ingnes s

    can be s een as a func t ion of the above c r i t e r i a , and s t rong ly

    d e p e n d s o n t h e u s e r a s w e l l h i s / h e r d o m a i n k n o w l e d g e .

    I G raphical sern te rfa c e ~ ~

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    i Data Da taMin ing

    Figure 1: Conceptual architecture o f the system

    519

  • 8/19/2019 Data and Knowledge Visualization in Knowledge Discovery Process

    2/6

    The key idea o f our s o lu t ion to mode l s e l ec t ion in D2M S i s to

    s uppor t an e f f ec t ive par t i c ipa t ion of the us er in th i s p roces s .

    Concre te ly , D2MS f i r s t s uppor t s t he us er in do ing t r i a l s on

    com bina t ions o f a lgor i thms an d the i r parameter s e t t ings in o rder

    to p roduce compet ing model s , and then i t s uppor t s t he us er in

    e v a l u a t in g t h e m q u a n t i t a ti v e l y a n d q u a l i t at i v e l y b y p r o v i d i n g b o t h

    per for rnanee met r i cs va lues as we l l as v i s u a l i za t ion of thes e

    model s (F igure 2 ) .

    2 .2 P l a n a n d p l a n m a n a g e r

    T h e m o d e l s e l e c t io n i n D 2 M S m a i n l y i n v o l v e s w i t h t h re e s t e p s o f

    da ta p re-proces s ing , da ta m in ing , and pos t -p roces s ing as s how n in

    F igure 1 . There a r e th r ee phas es in do ing model s e l ec t ion in

    D 2 M S , a n d a l l a r e m a n a g e d b y t h e p l a n m a n a g e m e n t m o d u l e : ( i )

    r eg i s t e r ing p lans o f s e l ec t ed a lgor i thms and the i r s e t t i ngs ; ( i i )

    execu t ing the p l ans to d i s cover mod el s ; ( i i i) s e l ec t ing appropr i a te

    m o d e l s b y a c o m p a r a t iv e e v a l u a t i o n o f c o m p e t i n g m o d e l s . T h e s e

    phas es can be d one in t e r ac t ive ly in a l l t h r ee s t eps o f the KD D

    process .

    The f i r s t phas e i s t o r eg i s t e r p l ans . A

    plan

    i s an o rdered l i s t o f

    a lgor i thms as s oc ia t ed wi th the i r parameter s e t t i ngs tha t can y ie ld

    a m o d e l o r a n i n t e r m e d i a te r e s u lt w h e n b e i n g e x e c u t e d . T h e p l a n s

    are r epres en ted in a t r ee fo rm ca l l ed

    plan tree

    w h o s e n o d e s a r e

    s e l ec t ed a lgor i thms as s oc ia t ed wi th the i r s e t t i ngs ( the top- l e f t

    w i n d o w i n F i g u r e 3) . T h e n o d e s o n a p a t h o f t h e p l a n t re e m u s t

    fo l low the o rder o f p r eproces s ing , da ta min ing , and

    pos tproces s ing . A p lan may c on ta in s evera l a lgor i thms (nodes ) o f

    the p reproces s ing and pos tproces s ing s t eps , f o r example f i l l i ng

    m i s s i n g v a l u e s b y n a t u r a l c l u s t e r - b a s e d m e a n - a n d - m o d e t h e n

    di s cre t i z ing con t inuous a t t r ibu tes by en t ropy-bas ed a lgor i thm in

    p r e p r o c e s s i n g . A p l a n c a n b e e d i t e d i n D 2 M S d u r i n g t h e K D D

    process .

    The s ec ond phas e i s t o execu te r eg i s t e r ed p lans . Whi l e r eg i s t e ring

    g r a d u a l ly a p l a n t h e u s e r c a n r u n a n a l g o r it h m j u s t a f t er a d d i n g i t

    t o the p l an , t hen eva lua te i t s r es u l t s before dec id ing whether to

    cont inue th i s p l an f rom i t s cur r en t s t age wi th o ther a lgor i thms , o r

    to back t r ack and t ry the p l an wi th ano ther a lgor i thm or s e t t i ng .

    The us er a l s o can run a p l an a f t e r fu l ly r eg i s t e r ing i t , o r even

    reg i s t e r a number o f p l ans then run them a l toge ther . The

    in te rmedia t e r es u l t s , t he d i s covered model s and the i r s ummar ies

    and expor t ed fo rms wi l l be au tomat i ca l ly c r ea ted and s to red in the

    model base

    The th i rd phas e i s to s e l ec t appropr i a t e model s by the us er . D2MS

    p r o v i d e s a s u m m a r y t a b l e p r e s e n t i n g p e r f o r m a n c e m e t r i c s o f

    d i s covered model s accord ing to execu ted p lans ( the bo t tom- r igh t

    w i n d o w i n F i g u r e 3 ). H o w e v e r , t h e u s e r c a n e v a l u a t e e a c h m o d e l

    in deep by v i s ua l i z ing it , b rows ing i t s s l ruc ture , check ing i t s

    r e l a t ions h ip wi th the da tas e t , e t c . ( t he top-middle and top- r igh t

    w i n d o w s i n F i g u r e 3 ) T h e u s e r c a n a l so v i s u a l i z e s e ve r a l m o d e l s

    s imul t ane ous ly fo r compar ing them. B y ge t t ing ins igh t s in to

    compet ing model s , t he us er cer t a in ly can make a be t t e r s e l ec t ion

    of mode l s .

    2 .3 P r o c e s s v i s u a l i z a t i o n

    The p lan manager a l lows the us er to c r ea te , ed i t , and manager

    p lans wi th the he lp o f the plan visualizer I n F i g u r e 3 , t h e p l a n

    v i s ua l i zer d i s p lays a p l an t r ee in the top- l e f t windo w. A p lan t r ee

    i s d i s p layed in the fo rm of a t r ee , where each node r epres en t s a

    da tas e t , a model , o r an a lgor i thm. I f t he us er double c l i cks on a

    d a t a o r m o d e l n o d e , t h e d a t a o r m o d e l v i s u a l i z a t io n w i l l b e

    ac t iva ted to s how the us er cor r es ponding da ta o r model .

    Otherwis e i f t he us er double c l i cks on an a lgor i thm node , a d i a log

    box wi l l be ac t iva ted fo r the us er to en te r o r change the

    parameter s o f tha t a lgor i thm. Therefore , t he p l an v i s ua l i zer can

    s erve as a hub fo r the us er to ac t iva te o ther v i s ua l i zer s eas i ly .

    An other func t ion of p l an v i s ua l i zer is t o a l low the us er fo l lowin g

    the d i s cover p roces s . Dur ing the ru nn in g t ime o f a p l an t r ee , the

    f in i s hed par t o f t he p l an t r ee wi l l be changed to g ray co lor , and

    c u r r e n tl y r u n n i n g n o d e i s b l i n k i n g . T h e u s e r t h e n c a n e a s i ly

    s us pend , con t inue , o r run the p l an t r ee s t ep by s tep .

    3 . D A T A A N D M O D E L VISUALIZATION

    3 .1 D a t a V i s u a l i z a t i o n

    W e have chos e n the para l l e l coord ina tes t echn ique fo r v i s ua l i z ing

    2 D t a b u l a r d a ta s e ts d e f i n e d b y n r o w s a n d p c o l u m n s . D 2 M S

    improves para l l e l coord ina tes in s evera l ways to ada p t

    3.1.1 V iewing or ig inal data

    T h e b a s i c i d e a o f v i e w i n g a p - d i m e n s i o n a l d a t a se t b y p a r a l le l

    coord ina tes i s t o us e p equa l ly s paced ax es - - wh ich a r e para l l e l to

    o n e o f t h e s c r e e n a x e s a n d c o r r e s p o n d t o a t t ri b u t e s a n d t h e e n d s o f

    t h e a x e s c o r r e sp o n d t o m i n i m u m a n d m a x i m u m v a l u e s f o r e a c h

    d i m e n s i o n - - t o r e p re s e n t e a c h d a t a i n s t a n c e a s a p o l y l i n e t h at

    c ros s es each ax i s a t a pos i t ion propor t iona l t o i t s va lue fo r tha t

    d imens ion . Th i s v i ew g ives the us er a rough idea about the

    d i s t r ibu t ion of da ta o n va lue s o f each a t t r ibu te , i n par t i cu la r t he

    co lor s o f d i f f e r en t c l as s es in many eas es can s how c lear ly how

    clas s es a r e d if f e r en t f rom each o ther . A n exam ple o f the s tomach

    c a n c e r d a t a i s v i s u a l i z e d i n t h e b o t t o m - l e f t w i n d o w i n F i g u r e 4 ,

    where the da tas e t i s s hown in the top- l e f t window.

    3.1.2 Sum mar iz ing data

    This v i ew i s s ign i f i can t as the da tas e t may be very l a rge . The key

    idea i s no t t o v i ew or ig ina l da ta po in t s bu t t o v i ew the i r

    s ummar ies on para l l e l a t t r ibu tes . As WinViz [9 ] , D2MS us es bar

    char t s i n the p l ace o f a t t r ibu te va lues on each ax i s . The bar char t s

    i n e a c h a x i s h a v e t h e s a m e h e i g h t ( d e p e n d i n g o n t h e n u m b e r o f

    pos s ib le a t t r ibu te va lues ) and d i f f e r en t wid ths tha t s ign i fy the

    f r equencies o f a t t ribu te va lues . D2MS a l s o p rov ides in t e r ac t ive ly

    c o m m o n s t a t i s t i c s o n e a c h a t t r i b u t e a s m e a n o r m o d e , m e d i a n ,

    var i ance , box p lo t s , e t c . The top- r igh t window in F igure 4 s how s

    the s umm ar ies o f the s tomach c ancer da ta .

    I I

    peUng

    leb

    I SelectApply

    A ~ I I h ~ - O ~ a r c ~ s

    F i g u r e 2 : A n i l l u s t r a t io n o f u se r - c e n t e r e d m o d e l s e l e c t i o n

    520

  • 8/19/2019 Data and Knowledge Visualization in Knowledge Discovery Process

    3/6

      : ~ i

    i C - 7 -

    ~ i ~ i 7 1 ~ i S

    - - 7 - ' ~ - ~ 7 7 ~ ° ~ ~ 3 ~ 3 ~ ~ E S . . . .. . ' ~ ; X .; ;: ;' 7~ .~

    F i g u r e 3 : A s c r e e n s h o t o f D 2 M S i n s e l e c t i n g m o d e l s l e a r n e d

    f r o m W i s c o n s i n b r e a s t c a n c e r d a t a . T h e t o p - l e f t w i n d o w

    s h o w s t h e p l a n t r e e ; t h e t o p - m i d d l e w i n d o w s h o w s a t i g h t l y -

    c o u p l e d v i e w o f a d e c i s i o n t r e e l e a r n e d b y C A B R O ; t h e t o p -

    r i g h t w i n d o w s h o w s a r u l e s e t l e a r n e d b y L U P C ; t h e b o t t o m -

    l e f t w i n d o w f o r d i s p l a y i n g i n t e r m e d i a t e c o m p u t a t i o n r e s u l t s

    a n d t h e b o t t o m - r i g h t w i n d o w s h o w s t h e s u m m a r y t a b l e o f

    p e r f o r m a n c e m e t r i c s o f d i s c o v e r e d m o d e l s .

    3 1 3 Querying data

    T hi s v i e w se rve s t he hypot he s i s ge ne ra t i on a nd hypot he s i s t e s t i ng

    by t he use r . I t a l l ows t he use r t o v i e w subse t s o f t he da ta se t

    de t e rmi ne d by que r i e s . T he r e a re t h re e t ype s o f que r i e s : ( i) ba se d

    on a va l ue o f the c l a s s a t tr i bu t e whe re t he q ue ry de t e rm i ne s t he

    subse t o f a l l i ns t a nc e s be l ong i ng t o t he i nd i c a t e d c l a s s ; ( i i ) ba se d

    on a va l ue o f a de sc r i p t i ve a t tr i bu t e whe re t he que ry de t e rmi ne s

    t he subse t o f a l l i ns t a nc e s ha v i ng t h i s va l ue , ( i i i ) ba se d on a

    c on j unc t i on o f a t tr i bu t e -va l ue s pa ir s whe re t he que r y de t e rmi ne s

    t he subse t o f a l l ins t a nc e s s a t i s f i e d t h is c on j unc t i on . T he que r i e s

    c a n be de t e rmi ne d by j us t us i ng po i n t -a nd-c l i c k . T he subse t o f

    i ns t a nc e s ma t c he d t he que ry i s v i sua l i z e d i n v i e wi ng da t a mode

    a n d i n s u m m a r i z i n g d a t a m o d e . T h e g r a y r e g i o n s o n e a c h a x i s

    show t he p ropor t i ons o f spe c i f i e d i ns t a nc e s on va l ue s o f t h i s

    a t t r i bu t e a s shown i n bo t t om-r i gh t wi ndow i n F i gure 4 ) .

    3 . 2 R u l e V i s u a l i z a ti o n

    A rule i s a pa t te rn re la ted to severa l a t t r ibute-va lues and a subse t

    o f i ns t a nc e s. T he i mp or t a nc e i n v i sua l i z i ng a ru l e is how t h i s l oc a l

    s t ruc t u re i s v i e we d i n i t s r e l a t i on t o t he whol e da t a se t , a nd how

    t he v i e w suppor t s t he u se r ' s e va l ua t i on on t he ru l e i n t e re s ti ngne s s .

    D2MS' s ru l e v i sua l i z e r a l l ows t he use r t o v i sua l i z e ru l e s i n t he

    f o r m

    a n t e c e d e n t ~ c o n s e q u e n t

    w h e r e

    a n t e c e d e n t

    i s a c on j unc t i on

    of a t t r i bu t e -va l ue pa i r s , c o n s e q u e n t i s a c on j unc t i on o f a t t r i bu t e -

    va l ue pa i r s i n c a se o f a s soc i a ti on ru l e s , a nd i s a va l ue o f the c l a s s

    a t t ri bu t e i n c a se o f p re d i c t i on ru l e s . A ru l e i s s i mpl y d i sp l a ye d by

    a subse t o f pa ra l l e l c oord i na t e s i nc l ude d i n a n t e c e d e n t a n d

    c o n s e q u e n t . T h e D 2 M S ' s r u l e v i s u a l i z e r h as t h e f o l l o w i n g

    f i mc t i ons :

    3 . 2 .1 F i e w i n g r u l e s

    E a c h ru l e i s d i sp l a ye d by po l y l i ne t ha t goe s t h rough t he a xe s

    c on t a i n i ng a t t r i bu t e -va l ue s oc c ur re d on t he a n t e c e de n t pa r t o f the

    ru l e l e a d i ng t o t he c onse qu e n t pa r t o f t he ru l e t ha t a re d i sp l a ye d

    wi t h d i f f e re n t c o l o r . In t he c a se o f p re d i c t i on ru l e s , t he r a t i o

    a s soc i a t e d wi t h e a c h c l a s s i n t he c l a s s a t tr i bu t e c or re sponds t o t he

    num be r o f i ns t anc e s o f the c l a s s c ove r e d by t he ru l e ov e r t he to t a l

    num be r o f i ns ta nc e s i n t he c l a s s . T h i s v i e w g i ve s a f i r s t

    obse rva t i on o f the ru l e qua l i t y .

    3 2 2 Viewing rules and data

    T he subse t o f i ns t a nc e s c ove r e d by a ru l e i s v i sua l i z e d toge t he r

    wi t h t he ru l e by pa ra l l e l c oord i na t e s o r by summa r i e s on pa ra l l e l

    c oord i na t e s . F rom t h i s subse t o f i ns t a nc e s , t he use r c a n s e e t he s e t

    o f r u l e s e a c h o f t h e m c o v e r s o m e o f t h e se i n s t an c e s , o r t h e u s e r

    c a n sm oot h l y c ha n ge t he va l ue s o f a n a t t r i bu te i n t he ru l e t o s e e

    o t he r r e l a t e d pos s i b l e ru l e s . T he se pos s i b l e ope ra t i ons f a c i l i t a t e

    t he use r i n e va l ua t i ng t he qua l i t y o f th i s ru l e : a ru l e i s good i f

    i ns t a nc e s c ove re d by i t a re no t r e c ogn i z e d by o t he r ru l e s , a nd

    v i c e -ve r sa . T he ru l e s fo r a c l a s s c a n be d i sp l a ye d t oge t he r , a nd

    i ns t a nc e s o f t he c l a s s a s we l l o f o t he r c l a s se s c ove re d by t he se

    ru l e a re d i sp l a ye d .

    3 . 3 T r e e V i s u a l i z a t io n

    D2MS prov i de s s e ve ra l v i sua l i z a t i on t e c hn i que s t ha t a l l ow t he

    use r t o v i sua l i z e e f fe c t i ve l y l a rge h i e ra rc h i c a l s t ruc t u re s .

    3 3 1 Different modes of viewing hierarchical

    structures

    D 2 M S t r e e v i s u a l iz e r p r o v i d e s m u l t i p l e -v i e w s o f t r e e s o r

    h i e ra rc h i c a l s t ruc tu re s (F i gure 5 ) .

    • T ig h t l y - c o u p l e d v i e w s: T h e g l o b a l v i e w s h o w s t h e t r e e

    s t ruc t u re wi t h node s i n s a me sma l l s i z e wi t hou t l a be l s a nd

    t he re fore i t c a n d i sp l a y a t r e e fu l l y o r a l a rge pa r t o f i t ,

    de pe nd i ng on t he t r e e s i z e . T he de t a i l e d v i e w shows t he t r e e

    s t ruc t u re a nd node s wi t h t he i r l a be l s a s soc i a t e d wi t h

    ope ra t i ons t o d i sp l a y node i n forma t i on . T he g l oba l v i e w i s

    a s soc i a t e d wi t h a f i e l d - o f - v i e w o r p a n n e r ( a w i r e - f r a m e b o x )

    t ha t c or re sponds t o t he de t a i l e d v i e w [7].

    • Cu s to m iz in g v i e w s: In i t i a l l y , a c c ord i ng t o t he use r ' s c ho i c e ,

    t he t r e e i s e i t he r d i sp l a ye d fu l l y o r wi t h on l y t he roo t node

    a nd i t s d i r e c t sub-node s . T he t r e e t he n c a n be c o l l a p s e d o r

    e x p a n d e d pa r t i a l l y o r fu l l y f rom t he roo t o r f rom a ny

    i n t e rme di a t e node . Any sub t re e wi t h t he roo t a t a n

    unc o l l a pse d c a n be c o l l a pse d i n t o one node . T hus , t he use r i s

    a b l e t o i n t e ra c t i ve l y c us t omi z e v i e w s o f t he t r e e to me e t

    h i s / he r ne e d a nd i n t e re s t s . A l so , t he use r i s p rov i de d t he

    foc us v i e w o n one c l a s s a nd i t s r e l a ti on t o o t he r c l a s se s i n the

    who l e h i e ra rc h i c a l s t ruc tu re wi t h d i f f e re n t c o l o r s .

    • T i n y m o d e w i t h f i s h - e y e v i e w : Not e t ha t no c ur re n t

    v i sua l i z a t i on t e c hn i que a l l ows us t o d i sp l a y e f f i c i e n t l y t he

    e n t i r e tr e e whe n i t ha s , s a ys , t e n t housa nds no de s . T he

    t i g h t l y - c o u p l e d v i e w s a r e e x t e n d e d w i t h t h r e e v i e w i n g

    mode s a c c ord i ng t o t he use r ' s c ho i c e : norma l s i z e , sma l l s i z e

    a nd t i ny s i z e . F i sh-e ye i s a n i n t e re s t i ng va r i a n t o f t he c l a s s i c

    ove rv i e w-de t a i l b rowse r , p ropos e d i n [4 ]. T h i s v i e w d i s t o r t s

    t he ma g ni f i e d i ma g e so t ha t t he c e n t e r o f i n t e re s t is d i sp l a ye d

    a t h i gh ma gn i f i c a t i on , a nd t he r e s t o f t he i ma ge i s

    p r o g r e s s i v e l y c o m p r e s s e d . I n D 2 M S t r e e v i s u a l i z e r , w e

    de f i ne t h re e f i sh -e ye c ompone n t s a s fo l l ows : ( i ) Foc a l po i n t

    f : som e node o f c ur re n t i n t e re s t i n t he t r e e; ( i i) D i s t a nc e f rom

    foc a l po i n t f t o a nod e x : D ~ x ) = d ~ x ) w h e r e d(x, y)

    be t we e n t wo po i n t s x a nd y on t he t r e e is t he numb e r o f l i nks

    i n t e rve n i ng on t he pa t h c onne c t i ng t he m i n t he t r e e ; ( i i i )

    L e ve l o f de t a i l , i mpor t a nc e , r e so l u t i on : LOD(x) = -d(r ,x)

    whe re r i s t he roo t o f t he t r e e.

    5 2 1

  • 8/19/2019 Data and Knowledge Visualization in Knowledge Discovery Process

    4/6

    3 3 2 Trees 2 5 Dimensions

    T 2.5D i s i nsp i re d by t he work o f R e i ngo l d a nd T i l fo rd [13] t ha t

    d ra ws t i dy t r e e s i n a r e a sona b l e t i me a nd s t o ra ge . D i f fe re n t f rom

    t i gh t l y -c oup l e d a nd f i sh -e ye v i e ws t ha t c a n be s e e n a s l oc a t i on-

    ba se d v i e ws (v i e w o f ob j e c ts i n a r e g ion) , T 2 .5D c a n b e s e e n a s a

    re l a t i on-ba se d v i e w (v i e w of r e l a t e d ob j e c t s) . T he s t a r t i ng po i n t o f

    T 2 .5D i s t he obse rva t i on t ha t a l a rge t r e e c ons i s t s ma ny sub t re e s

    t ha t a re no t usua l l y a nd ne c e s sa r i l y v i e we d s i mul t a ne ous l y . T he

    ke y i de a o f T 2 .5D i s to r e pre se n t a l a rge t r e e i n a v i r t ua l 3D spa c e

    ( sub t re e s a re ove r l a ppe d t o r e duc e oc c up i e d spa c e ) whi l e e a c h

    sub t re e o f i n t e re s t i s d i sp l a ye d i n a 2D spa c e . T o t h i s e nd , T 2 .5D

    de t e rmi ne s t he f i xe d pos i t i on o f e a c h sub t re e ( i t s roo t node ) i n

    t wo a xe s X a nd Y , a nd i n a dd i ti on , i t c om put e s dyn a mi c a l l y a Z -

    orde r fo r t h i s sub t re e i n a n i ma gi na ry a x i s Z . A sub t re e wi t h a

    g i ve n Z -ord e r i s d i sp l a ye d ' a bove i t s s i b l i ngs t hose ha ve h i ghe r

    Z -orde r s .

    Whe n v i sua l i z i ng a nd na v i ga t i ng a t r e e , a t e a c h mome nt t he Z -

    orde r o f a l l node s o n t he pa t h f rom t he roo t t o a n o d e i n f o c u s in

    t he t r e e i s s e t t o z e ro by T 2 .SD. T he a c t i v e w i d e p a t h t o a node i n

    foc us , whi c h c on t a i ns a l l node s on t he pa t h f rom t he roo t t o t h i s

    node i n foc us a nd t he i r s i b l i ngs , i s d i sp l a ye d i n t he f ron t o f t he

    sc re e n wi t h h i gh l i gh t e d c o l o r s t o g i ve t he use r a c l e a r v i e w. Ot he r

    pa r t s o f the t r e e r e ma i n i n t he ba c kgroun d t o p rov i de a n i ma ge o f

    t he ove ra l l s t ruc t u re . Wi t h Z -orde r , T 2 .5D c a n g i ve t he use r a n

    i mpre s s i on t ha t t r e e s a re d ra w n i n a 3D spa c e . T he use r c a n e a s i l y

    c h a n g e t h e a c t i v e w i d e p a t h b y c h o o s i n g a n o t h e r n o d e i n f o c u s

    [12].

    We ha ve e xpe r i me nt e d T 2 .5D wi t h va r i ous r e a l a nd a r t i f i c i a l

    da t a se ts . I t ha s be e n ve r i f i e d tha t T 2 .5D c a n ha nd l e w e l l tr e e s

    wi t h mo re t ha n 20 ,000 node s , ' a nd mo re t ha n 1 ,000 nod e s c a n be

    d i sp l a ye d t oge t he r on t he s c re e n [12]. F i gure 7 i l l us t r a te s a p rune d

    t re e o f 1795 node s l e a rne d f rom s t oma c h c a nc e r da t a a nd d ra wn

    by T 2 .5D (no t e t ha t t he o r i g i na l s c re e n wi t h c o l o r s g i ve s a be t t e r

    v i e w t ha n t h i s b l a c k-whi t e s c re e n) .

    F i g u r e 4: R u l e v i su a l i z a ti o n in D 2 M S t o p - l e ft w i n d o w s h o w s

    t h e l i s t o f d i s c o v e r e d r u l e s t h e m i d d l e - l e f t a n d t h e t o p - r i g ht

    w i n d o w s s h o w a r u l e u n d e r i n sp e c t io n a n d b o t t o m w i n d o w

    di s p la y s t h e i ns t a nces co v ered by t h a t ru le .

    4 . V I S U A L I Z A T I O N I N T H E K D D

    P R O C E S S

    Fi gure 2 sho ws t ha t , i n D2M S, i n o rd e r t o ta ke t he c on t ro l o f t he

    KD D p roc e s s t he: use r ne e ds v i su a l i z a t i on suppor t t o de c i de wha t

    t a sk t o do ne x t a nd wha t a re r i gh t a l gor i t hms a nd pa ra me t e r s fo r

    t ha t t a sk . . For e xa mpl e , a f t e r e xa mi n i ng t he da t a by da t a

    v i sua l i z e r s , t he use r c a n de c i de t ha t da t a r e qu i re d i s c re t i z a t i on o r

    no t , a nd i f t he y do w ha t k i nd o f d i s c re ti z a t i on a l gor i t hms c a n be

    su i t a b le fo r t ha t da ta . In t h i s s e c t i on we w i l l de sc r i be i n de t a i l how

    t h e u s e r u s e s t h e s e v i s u a l i z a t i o n t o o l s t o m a n a g e t h e K D D

    process .

    4 . 1 T h e w h o l e p r o c e s s v is u a l i z a ti o n w i t h p la n

    v i s u a l i z a t i o n

    T h e f r a m e w o r k o f D 2 M S a l l o w s m a n y a l g o r it h m s a n d v i s u a l iz e r s

    t o work t oge t he r i n a n i n t e gra t e d e nv i ronme nt . T ha t p rov i de s t he

    use r a g re a t f l e x i b i l i t y o f c omb i na t i on o f t he se a l gor i t hms a nd

    v i sua l i z e r s i n o rde r t o a rc h i ve a be t t e r r e su l t , howe ve r t ha t ma y

    a l s o m a k e t a s k s m o r e c o m p l i c a t e d t o d o . W i t h t o o m a n y

    a l gor i t hms i nvo l ve d a nd a l o t o f v i e ws a c t i va t e d t o d i sp l a y

    d i f fe re n t da t a o r mode l s , i t i s ha rd t o r e me mbe r a nd unde rs t a nd

    t he r e l a t i onsh i ps a mong t he m.

    T o so l ve t ha t p rob l e m , p l a n v i sua l i z e r i s de s i gne d a s a hub t o

    c on t ro l t he se a l gor i t hms a nd v i sua l i z e r s . Whe n t he use r doub l e

    c l i c k on a node i n a p l a n t r e e , i f t ha t node de sc r i be s a n a pp l i e d

    a l gor i t hm, t he s e l e c t e d s e t o f pa ra me t e r s w i l l be d i sp l a y e d i n a

    pa ra me t e r d i a l og box . I f t he node de sc r i be s a da t a se t o r a mo de l ,

    c or re spondi ng v i sua l i z e r wi l l be a c t i va t e d t o p rov i de t he use r t he

    v i e w s o f t h a t d a t a o r m o d e l . B y f o l l o w i n g t h e r e la t io n s h i p s a m o n g

    node s i n t he p l a n t r e e , the use r e a s i l y t r a c k dow n t he r e l a t i onsh i ps

    a mong a c t i va t e d v i e ws a nd a pp l i e d a l gor i t hms .

    4 . 2 D a t a v i su a l i z a ti o n i n t h e K D D p r o c e s s

    W i t h t h e a b o v e t h r e e v i e w s o f d a ta , D 2 M S i n t eg r a t es d a t a

    v i sua l i z a t i on i n t o d i f f e re n t KDD s t e ps by d i sp l a y i ng a nd

    i n t e ra c t i ve l y c ha ng i ng t he se v i e ws o f da t a a t a ny t i me . In t he f i r s t

    s t e p o f c o l l e c t ing da t a a nd fo rmu l a t i ng t he p rob l e m, t he use r c a n

    a nd of t e n ne e d t o v i e w t he o r i g i na l da t a se t a nd i t s summ a r i z a t i on .

    T he v i sua l a na l ys i s o f c o l l e c t e d da t a ma y he l p t he use r t o i de n t i fy

    i mpor t a n t o r r e dunda n t a t t r i bu t e s o r ne w a t t r i bu t e s t o be a dde d .

    T he d a t a v i sua l i z a t ion ha s show n t o be s i gn i f i c a n t i n t he da t a

    pre proc e s s i ng s t e p tha t c ons i s t s o f func t i ons on da t a c l e a n i ng ,

    i n t e gra t i on , t r a ns forma t i on a nd re duc t i on . For e xa mpl e , ma ny

    di sc re t i z a ti on a l gor i t hms prov i de a l t e rna t i ve so l u t i on o f d i v i d i ng a

    num e r i c a l a t tr i bu t e i n t o i n t e rva l s , a nd t he v i sua l da t a que ry on t he

    d i sc re t i z e d a t t r i bu t e a nd t he c l a s s a t t r i bu t e c a n g i ve i ns i gh t s fo r

    de c i s i on . T he da t a v i sua l i z a t i on i s a l so ve ry s i gn i f i c a n t i n da t a

    mi n i ng s t e p wi t h da t a que ry mode , a nd pa r t i c u l a r l y i n t he

    e va l ua t i on s t e p i n i t s syne rg i s t i c c ombi na t i on wi t h ru l e a nd t r e e

    v i sua l i z a t i on .

    4 .3 M o d e l V i s u a l i z a t io n in t h e K D D p r o c e s s

    A mode l t ha t c a n be unde r s t ood i s a mode l t ha t c a n be t rus t e d . A

    da t a mi n i ng a l gor i t hm t ha t use s a huma n-unde rs t a nda b l e mode l

    c a n b e c h e c k e d e a s i ly b y d o m a i n e x p e r t s, p r o v i d i n g m u c h n e e d e d

    se ma nt i c va l i d i t y t o t he mode l . T o t ha t e nd , mode l v i sua l i z a t i on

    p r o v i d e s m u c h h e l p t o t h e u s e r .

    T he re a re s e ve ra l wa ys t ha t suppor t t he use r i n e va l ua t i ng t he

    q u a l it y o f t h e r u l e t o g e t h e r w i t h o t h e r m e a s u r e s u c h a s c o v e r a g e

    522

  • 8/19/2019 Data and Knowledge Visualization in Knowledge Discovery Process

    5/6

    a nd a c c ura c y o f t he ru l e. For e xa m pl e , t wo ru l e s p re d i c t i ng a

    t a rge t c l a s s ha s t he s a me suppor t a nd c onf i de nc e bu t t he one

    wrongl y c ove re d more i ns t a nc e s be l ong i ng t o c l a s se s d i f f e re n t

    f rom t he t a rge t c l a ss wou l d be c ons i de re d w orse .

    F i gure 4 i l l us t r a t e s ru l e v i sua l i z a t i on i n D2MS whe re t he t op- l e f t

    a nd bo t t om l e f t w i ndows d i sp l a y a d i s c ove re d ru l e , a nd t he t op-

    r i g h t an d b o t t o m r i g h t w i n d o w s s h o w t h e i n s ta n c e s c o v e r e d b y

    tha t rule .

    T hre e ma i n c r i t e r i a fo r s e l e c t i ng h i e ra rc h i c a l mode l s a re t he i r

    s i z e , a c c ura c y a nd unde rs t a nda b i l i t y . T he t r e e s i z e a nd a c c ura c y

    c a n be qua n t i t a t i ve l y e va l ua t e d , a mong t he m t he a c c ura c y i s

    wi de l y c ons i de re d t o be o f g re a t impor t a nc e . T h e

    unde rs t a nda b i l i t y o f t r ee s i s d i f f i c u l t t o be qua n t i f i e d o r me a sure d ,

    a nd t he i de a he re i s t o use t r e e v i sua l i z e r t o suppor t t he

    unde rs t a nd i ng o f use r s .

    I n t h e c u r re n t v e r s i o n o f C A B R O i n D 2 M S , t h e u s e r c a n g e n e r a te

    ne w mode l s e a c h i s c ompose d by a n a t t r i bu t e s e l e c t i on me a sure

    c hose n f rom t he ga i n - ra t i o , t he g i n i - i nde x [1 ] , Z 2 a nd R -me a sure ;

    a p run i ng t e c hn i que f rom e r ro r -c ompl e x i t y , r e duc e d-e r ro r a nd

    pe s s i mi s t i c e r ro r [1 ] ; a nd a d i s c re t i z a t i on t e c hn i que f rom t he

    e n t ropy-ba se d a nd e r ro r -ba se d t e c hn i que s .

    For e a c h mod e l c a nd i da t e, D2M S t re e v i sua l i z e r d i sp l a ys

    gra ph i c a l l y t he c or re spondi ng

    pr un e d t r e e , i t s s i ze , a nd i t s

    p r e d i c t i on e r r or r a t e . I t o f fe r s t he use r a mul t i p l e v i e w o f the se

    t r i al s a nd fa c i l it a t e s t he use r t o c om pa re r e su l t s o f t r ia l s i n o rde r t o

    ma ke h i s / he r f i na l s e l e c t i on o f t e c hn i que s / mode l s o f i n t e re s t .

    D2MS t re e v i sua l i z e r i s use d no t on l y i n i nduc i ng de c i s i on t r e e s

    bu t a l so i n c l a s s i fy i ng unknow n ob j e c t s . I t p l a ys t he ro l e o f t he

    i n t e r fa c e fo r v i sua l e xp l a na t i on o f t he ma t c h i ng p roc e s s , i n a wa y

    s i mi l a r t o t he e xp l a na t i on i n knowl e dge -ba se d sys t e ms . D2MS

    t r ee v i s u a l i z e r s u p p o r ts t h r ee m o d e s o f m a t c h i n g a n u n k n o w n

    obj e c t a c c ord i ng t o t he wa y t ha t t he unkn own ob j e c t i s de c l a re d .

    • T h e w h o l e re c o r d o f t h e u n k n o w n o b j e c t i s r e a d f r o m a

    da t a ba se : D2M S d i re c t l y show t he l e a f node t ha t ma t c he s t he

    ob j e c t . T he pa t h f rom t he t r e e roo t un t i l tha t l e a f node wi l l be

    h i gh l i gh t e d . In forma t i on a c c umul a t e d a l ong t he pa t h c a n be

    v i e w e d a t a n y n o d e .

    • Va l ue s o f a t t ri bu t e s a re g i ve n by t he use r whe n a nswe r i ng

    t he sys t e m que s t i ons : Que s t i ons a bou t a t t r i bu t e s wi l l be

    a ske d a c c ord i ng t o t he h i e ra rc h i c a l s t ruc t u re i n a t op-down

    m a n n e r f r o m t h e r o o t . F r o m m e n u t h e u s e r w i l l c h o o s e o n e

    va l ue i n t he l i s t o f d i s c re t e va l ue s o f t he a t t r i bu t e o r e n t e r a

    num e r i c a l va l ue i n c a se o f c on t i nuous a t tr i bu te . Qu e s t i ons

    a re a ske d dyna mi c a l l y a c c ord i ng t o t he s t e pwi se r e f i ne me nt

    of t he ma t c h i ng p roc e s s .

    • T he use r de c l a re s va l ue s o f a t t ri bu t e s he / she know s : T he use r

    i s a b l e t o s e l e c t a t t r i bu t e s t ha t he / she wi she s t o que ry on .

    T he se a t t r i bu t e s c a n be s e l e c t e d f rom t he a t t r i bu t e l i s t w i t h

    c or re spondi ng va l ue s . Onc e t he a t t r i bu t e -va l ue s pa i r s a re

    e n t e re d , t he t r e e v i su a l i z e r wi l l l i m i t t he r e g i ons on t he t r e e

    t ha t pa r t i a l l y s a t i s fy t he da t a . T he sys t e m wi l l t he n a sk

    a dd i t i ona l que s t i ons t o fu l f i l l t he ma t c h .

    5 . A C A S E - S T U D Y

    T hi s s e c t i on i l l us t r a t e s t he u t i l i t y o f syne rg i s t i c v i sua l i z a t i on o f

    d a t a a n d k n o w l e d g e o f D 2 M S i n e x tr a c ti n g k n o w l e d g e f r o m a

    s t oma c h c a nc e r da t a se t .

    F i g u r e 5 : M u l t i p l e v i e w s o f g e n e r a t e d t r e e s i n D 2 M S t h e t o p

    w i n d o w s h o w s t h e T 2 . S D v i e w w h i l e t h e b o t t o m w i n d o w

    s h o w s t h e t i g h t l y - c o u p l e d v i e w s o f t h e g e n e r a t e d d e c i s i o n t r e e

    f r o m s t o m a c h c a n c e r d a t a .

    5 .1 T h e s t o m a c h c a n c e r d a ta s e t

    T he s t oma c h c a nc e r da t a se t c o l l e c t e d a t t he Na t i ona l C a nc e r

    C e nt e r i n T ok yo dur i ng 1962-1991 i s a ve ry p re c i ous sourc e fo r

    t he r e se arc h . I t c on t a i ns da t a o f 7 ,520 pa t i e n ts de sc r i be d

    or i g i na l l y by 83 n ume r i c a nd c a t e gor i c a l a t tr i bu t e s o f l oc a t ion ,

    c ombi ne d re fe c t i on , p re -ope ra t i ve c ompl i c a t i on , pos t -ope ra t i ve

    c ompl i c a t i on , e t c . T he p rob l e m i s t o f i nd p re d i c t i ve a nd

    de sc r i p t i ve ru l e s fo r t he c l a s s o f pa t ie n t s wh o d i e d w i t h i n 90 da ys

    a f t e r ope ra t i on a mi ds t a to t a l o f 5 c l a s ses de a t h wi t h i n 90 da ys ,

    de a t h a f te r 90 da ys , de a t h a f t e r 5 ye a r s , a l i ve , unkn own .

    S e v e r a l w e l l - k n o w n d a t a m i n i n g s y s t e m s h a v e b e e n a p p l i e d t o d o

    t h i s t a sk . Howe ve r , t he ob t a i ne d re su l t s we re f a r f rom

    e xpe c t a t i ons : t he y ha ve l ow suppor t a nd c onf i de nc e , a nd usua l l y

    re l a t e to on l y a sma l l pe rc e n t a ge o f pa t i e n ts o f t he t a rge t c l a ss .

    5 .2 M i n i n g r u l es w i t h v i s u a l L U P C

    T h e D 2 M S ' s v i s u a l i z a t i o n t o o l s a s s o c i a t e d i n L U P C a l l o w u s t o

    e xa mi ne t he da t a a nd t o ga i n be t t e r i ns i gh t i n t o c ompl e x da t a

    b e f o r e l e a rn i n g . W h i l e t h e v i e w i n g m o d e o f o r ig i n a l d a t a o f fe r s

    a n i n t u i t i on a bou t t he d i s t r i bu t i on o f i nd i v i dua l a t t r i bu t e s a nd

    i ns t a nc e s , t he summa r i z i ng a nd que ry i ng mode s c a n sugge s t

    i r r e gu l a r o r r a re e ve n t s t o be i nve s t i ga t e d , o r t o gu i de whi c h

    b i a se s c ou l d be use d t o na r row t he huge s e a rc h spa c e . I t i s

    c o m m o n l y k n o w n t h a t p a t i e n t s w h o h a v e s y m p t o m s

    l i ve r me t a s t a s i s o f a l l l e ve l s 1 , 2 , o r 3 wi l l c e r t a in l y no t

    surv i ve . A l so , s e rosa l i nva s i on = 3 i s a t yp i c a l sym pt om of t he

    c l a s s de a t h wi t h i n 90 da ys . Wi t h t he v i sua l i z a t i on t oo l s , we

    found se ve ra l unusua l e ve n t s . For e xa mpl e , a mong 2329 pa t i e n t s

    i n t he c l a s s a l i ve , 5 o f the m ha ve he a v y me t a s t a s i s o f l e ve l 3,

    a nd 1 a nd 8 o f t he m ha v e me t a s t a s i s l e ve l 2 a nd 1 , r e spe c t i ve l y .

    M o r e o v e r , t h e q u e r y i n g d a t a a l l o w u s t o v e r i f y s o m e s i g n i f i ca n t

    c omb i na t i on o f sympt om s suc h a s l i ve r me t a s ta s i s = 3 , a nd

    se rosa l i nva s i on = 3 a s show n i n F i gure 6 .

    I t is c o m m o n l y k n o w n t h a t p a t ie n t s c a n n o t s u r v i v e w h e n l i v e r

    me t a s t a s i s oc c urs a ggre s s i ve l y . L e a rn i ng me t hods whe n a pp l i e d t o

    t h i s da t a se ts o f t e n y i e l d ru l e s fo r t he c l as s de a t h wi t h i n 90 da ys

    c on t a i n i ng l i ve r m e t a s t a s i s t ha t a re c ons i de re d a c c e p t a b l e bu t

    523

  • 8/19/2019 Data and Knowledge Visualization in Knowledge Discovery Process

    6/6

     

    11¢ 01

    l l ~ l O l

    I

    I 4

    F i g u r e 6 : V i s u a l i z a t i o n o f d a t a s u g g e s t e d r a r e e v e n t s t o b e

    in ves t igated .

    no t u se fu l by dom ain exper ts . Also , these d i scove red ru le s u sua l ly

    cove r on ly a subse t o f pa t ien t s o f th i s c la s s . Th is low cove rage

    means tha t the re a re pa t ien t s o f the c la s s who a re no t inc luded in

    liver metastasis and, therefore , i t is d iff icult to detect them.

    Us ing v i sua l in te rac t ive LUPC, we ran d i f fe ren t t r i a l s and

    spec i f ied pa rame te rs and cons t ra in t s to f ind on ly ru le s tha t do no t

    con ta in the a tt r ibu te l ive r_metas ta s i s and /o r i t s combina t ion

    wi th two o the r typ ica l a t t r ibu te s , Pe r i to nea lm e tas ta s i s and

    Serosa l_ invas ion? ' Be low i s a ru le wi th accu racy 100%

    discove red by L UPC tha t can be seen a s a ra re and i r regu la r even t

    in the target c lass .

    Rule 8 [accuracy = 1.0 (4/4), cover = 0.001 (416712) ]

    IF category = R AND sex = F AND proximal3.hird = 3

    A N D mi dd l eJ : h i rd = 1

    T H EN c las s = dea th w i t h i n 90 day s

    R E F E R E N C E S

    [1] Breiman, L. , Friedman, J . , Olshen, R. , and Stone, C. ,

    Classi f icat ion an d Regress ion Trees ,

    Belmon t , CA:

    Wadsworth , 1984.

    [2 ] Ca rd , S . K . , Mack in lay , J. D . , Shne ide rman , B . ,

    Readings in

    Information Visualizat ion,

    Morg an Kaufmann , 1999 .

    [3 ] Fayya d , U.M. , Gr in s te in . G .G. , and W ie rse , A . ,

    Information

    Visual i za t ion in D ata M ining and Knowle dge D isc ov e ry ,

    Morgan K aufmann , 2002 .

    [ 4 ] F u r n as , G . W . , T h e F I S H E Y E V i e w : A N e w L o o k a t

    S t ruc tu red F i le s ,

    Bell Laborator ies Technical

    M e m orandum ,

    #81-11221-9, 1981.

    [5 ] Han , J. and Cercone , N . , Ru leViz : A Mode l fo r Visua l iz ing

    Know ledge D iscove ry P rocess ,

    Six th In ter . Co nf . on

    Kno wle dg e D isc ov ery a nd D ata M ining ,

    2000, pp. 244-253.

    [6 ] Ho , T .B. , Nguyen , D.D. , and Kawasak i , S. , Min ing

    Pred ic t ion Ru les f rom Minor i ty Classes ,

    In te r W ork shop

    Rule -Base dD ata M ining ,

    Tok yo, 2001. pp. 254-264.

    [7 ] Kumar , H . P . , P la i san t , C . , Shne iderman , B . , Brows ing

    Hie ra rch ica l Da ta w i th Mul t i -Leve l Dynamic Q uer ie s and

    Prun ing ,

    In te r Journa l o f Hum an-Com pute r S tud ies ,

    46(1),

    pp. 103-124, 1997.

    The p red ic t ion o f ra re even ts i s becoming pa r t i cu la r ly inte re s ting.

    W hen suppos ing tha t some a t t r ibu te -va lue pa i r s may cha rac te r ize

    some ra re and /o r s ign i f ican t even ts . LUPC, thanks to i t s

    a ssoc ia ted v i sua l iza t ion too ls , a l lows us to examine e f fec t ive ly the

    hypo thes i s space and iden t i fy ra re ru le s wi th any g iven sma l l

    support or confidence. An e xam ple is to f ind rules in the c lass

    a l ive tha t con ta in the symptom l ive r_metas ta s i s . Such even ts

    a re ce r ta in ly ra re and in f luence human dec is ion mak ing . W e

    found ra re even ts in the c la s s a l ive , such a s ma le pa t ien t s

    ge t t ing l ive r me tas ta s i s a t se r ious leve l 3 can su rv ive wi th the

    accu racy o f 50%.

    Rule i [accuracy= 0.500 2/4); cover = 0.001 4/6712)]

    IF sex = M AND type = B1 AND liver_metastasis = 3

    AND middle third = 1

    THEN dass = alive

    6 . C O N C L U S I O N

    W e have p re sen ted the knowledge d i scove ry sys tem D2MS wi th

    suppor t fo r mode l se lec t ion in teg ra ted wi th v i sua l iza t ion . W e

    emphas ize the c ruc ia l ro le o f the u se r ' s pa r t i c ipa t ion in the m ode l

    se lec t ion p rocess o f knowledge d i scove ry and have deve loped

    data , ru le and tree visualizers in D2MS to support such

    pa r t ic ipa tion . Our bas ic idea i s u se r igh t v i sua l iza t ion techn iques

    in r igh t p laces~ and v i sua l iza t ion shou ld b e in teg ra ted in to the

    s teps o f the know ledge d i scove ry p rocess . D2MS wi th i t s

    v i sua l iza t ion suppor t has been used and shown advan tages in

    ex t rac t ing knowledge f rom a rea l -wor ld app l ica t ion on s tomach

    cancer data .

    [8 ] Lamping , J . and Rao , R . , The Hyperbo l ic Browse r : A Focus

    + Con tex t Techn iques fo r Visua l iz ing La rge H ie ra rch ie s ,

    Journal o f V i sual Language s and Com put ing ,

    7(1), pp. 33-

    55, 1997.

    [9 ] Lee , H .Y. , Ong , H.L. , and Quek , L .H. , Exp lo i t ing

    Visua l iza t ion in Knowledge Discove ry ,

    Firs t Inter . Conf. on

    Kno wle dge D isc ove ry an d D ata M ining ,

    1995, pp. 198-203.

    [10 ] Mann i la , H . , Me thods and P rob lems in Da ta Min ing ,

    Inter.

    Conf. on Database Theory ,

    Springer, 1997, pp. 41-55.

    [ 11 ] Nguyen , T .D. and Ho , T .B. , An In te rac t ive Graph ic Sys tem

    fo r Dec is ion Tree Induc t ion ,

    J o u r n a l o f J a p a n e s e S o c i e ty

    fo r Art i f ic ial Inte l l igence ,

    Vo l. 14, N. 1, pp. 131-138, 1999.

    [12 ]N guyen , T .D. , Ho , T .B. , and Sh imoda i ra , H . , A

    Visua l iza t ion Too l fo r In te rac t ive Lea rn ing o f La rge

    Dec is ion Trees ,

    Twe l~h IEEE ln te r. Con f . on Tool s w i th

    Artificial Intelligence,

    2000, pp. 28-35.

    [13 ]R e ingo ld , E .M. and Ti l fo rd , J .S ., T id ie r Drawings o f

    Trees ,

    IEEE Transact ions on Sof tware Engineering,

    Vol .

    SE-7, No. 2, pp. 223-228, 1991.

    [ 1 4 ] R o b e r t s o n , G . G . , M a e k i n l a y , J . D . , a n d C a r d , S . K . ,

    C o n e T r e e s : A n i m a t e d 3 D V i s u a l i z a ti o n o f

    H i e r a r c h i c a l I n f o r m a t i o n ,

    A C M C o n f . o n H u m a n

    F a c t o r s i n C o m p u t i n g S y s t e m s ,

    1 9 9 1 , p p . 1 8 9 - 1 9 4 .

    5 2 4