departamento de electr onica, telecomunica«©c~oes departamento de electr onica,...

Download Departamento de Electr onica, Telecomunica«©c~oes Departamento de Electr onica, Telecomunica«©c~oes

Post on 01-Aug-2020

1 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Departamento de Electrónica, Telecomunicações e Informática

    EXPLORAÇÃO de DADOS & DATA MINING

    Exercises: Clustering- k-means and hierarchical clustering

    1. The following table and plot describe 16 objects with two attributes

    ID attribute ]1 attribute ]2 2 0.8 9.8 3 1.2 11.6 4 2.8 9.6 5 3.8 9.9 6 4.4 6.5 7 4.8 1.1 8 6.0 19.9 9 6.2 18.5 10 7.6 17.4 11 7.8 12.2 12 6.6 7.7 13 8.2 4.5 14 8.4 6.9 15 9.0 3.4 16 9.6 11.1

    The data set has to be clustered intro k = 3 using k-means algorithm

    (a) Assuming that the cluster centroids 1 are

    attribute ]1 attribute ]2 cluster 1 3.8 9.9 cluster 1 7.8 12.2 cluster 3 6.2 18.5

    distribute the data set by the cluster using euclidian distance as measure of proximity.

    (b) After the distribution of the data the cluster centroids might have different values. Compute these values.

    (c) The objects stay in the same groups with the new values of the centroids?

    2. The table, on the right, the distance matrix between 6 objects (clusters) of an hierarchical clustering algorithm. The table, on left, represents the euclidian distance between pairs of objects in the data set.

    1Representing a random choice of objects in the data set

    1

  • A B B D E F A 0 12 6 3 25 4 B 12 0 19 8 14 15 C 6 19 0 12 5 18 D 3 8 12 0 11 9 E 25 14 5 11 0 7 F 4 15 18 9 7 0

    AD B B E F AD 0 ? ? ? ? B ? 0 19 14 15 C ? 19 0 5 18 E ? 14 5 0 7 F ? 15 18 7 0

    (a) The table on the right represents the new data matrix after the first merging.

    • What was the proximity measure to merge the two objects (A,D). • Compute the missing values (marked as ?) assuming that the same proximity

    measure between clusters.

    (b) Design the dendogram corresponding to the hierarchical clustering process.

    3. An Health insurance provider wants to identify the individuals in risk of having coronary diseases. The data set is stored in coronary.csv and has the attributes weight, cholesterol, gender . Download the data and try to understand the data using kmeans in RapidMiner.

    2

  • ~

  • ;1"',+l '-1.0,. ~1 ') --

    --\ 1 0\ \ . L ,----rl('i,L.\:. \\t'-l.,?,·",g\'--\- (l.,\-Q.C(') /D['':00~01.'i\''3'?,y~('\)-'\2,2) dC)(Ij'l :o-A\\J.~\-c..d-\- (~.'\ .. '\.(i\s\, "'\1''1 )A .~,,?

    ,,-h 1( .l, \ --

    ',rl1·?I(. r..1.~')

    '--

    r.lCy.\.,\ ~\ (r_c.._'.~)1.+h.":l-C\.q)l. r:---~-

    ~';y,,, ~ ,\ (e ,&.-(;,,7,..•.•. ( •. +-~!l.,,)di>! ,1; Ii ( C. .r~--:u;?- ) 1.h,~ .. \? 7)'-...,; '.~ ..;

    ,,~ ':i..Rl +- l lU ~ ~ -I.\. l< •.. 70.2

  • __ ~=- ~.d:ili. -'\lUl"

    _ __ ---'-'"{ 4.Rl; -"-'\";,'-\.Th

    .ckx~)_::.k~ '?l~ -I-iç,~«..c..)L __ ~.~G....t~O, __

    ~~)=- ~"-1Ji.t:Ú.G,~ui _ r~~() .%_"=--28 ~oct.........

    -------

    ---- ---I

    ------+-~~~~.~----------------------------------------------

    - -- ----ll----

    clL':c ; \ ,~~q ...ow..1..__ __'_~,(!~) =- ~(9.o'~.~)'l..(') \;:~ __ -º-Ur •.~~~.~t,....:l.."-'.('--'-'-"--'--U-'-'~l..--

    ~j '-"1,0 .l.~ ::.f'1.l.Iy. -'o- :>-=t " " _ _ ~':\ • ,' .......= L-cl) .o í

    ----------------------------------------- --

    ------ -----.;1----

    -lI----,-"",-CÁ'jl.=..fuG -?, 1',)L.1. G.u..:.. c..~ ~0~ l Gk'"} ,?\1.. •. (11.1 -.1Llk..JL1.. ...!.D.""'-';'\-'-) ~ (g.,~, ....•~'""'---'''-'--'-,~~'-'t-

    Jn:G~ __ __.:J 3.1.=--""L' ..:.~\~1.'--A..3:f, ------.1-- ~

  • J

    (' \ " < p' ''''\-»r\ -; ........~( ~o..\() "m\(' u \rw:,

    ')

    rl' /.l. "~,,, .••.·,An ""\a ..\ o..D; 'C. r\"\ci',,,:n : .~ "'''N-D ",~h ~'C

    AQ .B ~é E

    A\, O R ç, -'\1 '-I IJ-O

    P, X O 4°, '" " c. c. 'IC\ n < ", c: ..,' 1 ç Q ~ 'n .. A-I'> ç '" R O '\~ ~Lj C. -;; -\ o, '> S

    ç ~ ~ LI .':>