6장 군집분석(clustering analysis) -...

제6장 군집분석(Clustering Analysis)

군집분석: 비슷한 발현형태를 보이는 유전자나 표본끼리 함께 묶어서 군집을 만드는 분석방

법. 가장 널리 사용되며 분석 결과를 쉽게 표시할 수 있음.

6.1 서론

군집분석은 탐색적 자료 분석 방법으로 유사성을 갖는 자료끼리 서로 묶어서 군집을 형성해

나가는 분석 방법이다.

Eisen et.al. (1998)이 처음으로 적용. supervised clustering과 unsupervised clustering으

로 구분함.

In supervised clustering, vectors are classified with respect to known reference

vector. In unsupervised clustering, no predefined reference vectors are used.

군집분석의 대상

1) 표본: 같은 특성을 가지고 있는 표본들끼리 함께 군집으로 묶은 후 같은 군집내에 속

한 표본들끼리의 유사성은 높고 서로 다른 군집 간에 속하는 표본들끼리는 유사성이 작게

함.

2) 유전자:

a) cDNA칩; 유전자별로 상대적인 휘도값의 비로부터 유사한 패턴을 보이는 유전자를

찾는다.

b) GeneChip; 절대 휘도값으로부터 유사한 패턴을 보이는 유전자를 찾는다.

여기서는 표본이나 유전자를 개체(subject)로 통합하여 설명한다.

(1) 마이크로 어레이 자료 군집분석의 목적

유전자끼리 군집을 형성할 경우 기능이 알려지지 않은 유전자가 기능을 알고 있는 다른 유

전자와 같은 군집에 속한다면 기능이 알려지지 않은 유전자의 기능을 유추할 수 있다.

표본에 대한 군집분석은 유사한 특성을 보이는 표본끼리 묶은 후 유사한 발현 패턴을 가지

는 유전자 집합을 찾게 된다.

Alizadeh et.al. (2000): 기존에 알려진 림프종 외에 새로운 림프부종을 찾아내었다.

실험과정 중에 문제가 있거나 결측값을 많이 포함하는 슬라이드를 찾아내어 개략적인 마이

크로 어레이에 대한 품질을 평가하는데 응용될 수도 있다.

(2) 자료구조

×

- 2 -

: 휘도값(GeneChip), log (cDNA 마이크로어레이)

i: 유전자(p개)

j: 표본(n개)

ex) Rat Stem Cell(RSC) 자료

대조 집단 2개,

반복 3번

6번의 시간

슬라이드 당 3800개 이므로

p=3800, n=36이 된다.

6.2 군집분석의 절차

□ 군집분석 실행 시기 : 이미지 과정과 표준화 작업을 거친 자료행렬 X에 대해 군집분석을

실시.

□ 군집분석 결과 : 군집내 동질성과 군집간의 이질성으로 평가.

□ 군집분석실시 할 때 필요한 것들

- 유전자들 간의 유사성을 평가할 수 있는 객관적 기준이 필요 → 유전자들 간의 가까운

정도를 표현

- 전체 자료를 다 사용할 것인지 부분자료를 뽑아서 사용할 것인지 미리 결정.

6.2.1 거리의 정의와 거리의 종류

□ 개체들 간의 유사성 또는 비유사성을 정의하기 위해서 객관적인 측도가 필요하며 그 측

도가 거리임.

① 유전자들이 표본에 따라 서로 얼마나 비슷한가.

② 표본들이 유전자에 얼마나 비슷한가에 대한 객관적인 정의가 필요.

표본간이 거리는 두 표본이 얼마나 비슷한지 혹은 다른지를 측정.

(1) 측도

3가지 조건을 만족하는 측도를 거리라고 정의하면

① ≥ ② ③ ≥

기준 : 와 의 거리가 작으면 두 개체가 유사함을 나타냄

- 3 -

① ≥ 에서 자기 자신과의 거리는 0이므로 의 값이 작을수록 더 유사함을

표현

② : 거리는 두 개체의 생물적 계층관계나 방향성, 인과성을 반영하지 못함.

(2) 측도의 종류와 거리 행렬

군집분석에 널리 사용되는 다음과 같은 4가지 거리.

편의상 유전자 ⋯ ∈ 과

⋯ 에 대한 거리를 정의

① 유클리안 거리(Euclidean distance)

-정의 :

세 유전자 간에 만약

관계가 만족이 된다면 유전자 를 기준으로 와 는

기하학적으로 중심이 인 원 둘레 위에 있는 점들이 됨.

② 민코프스키 거리(Minkowski distance)

-정의 :

이 거리는 개체가 가지고 있는 차원 정보를 고려한 거리.

특별히 이면 노옴인 맨하탄 거리

③ 마할라노비스거리

-정의 :

이 거리는 두 유전자간의 통계학적 거리를 나타냄.

여기서 는 공분산행렬로 유전자에 대해 군집분석을 하는 경우 × 행렬이 되고, 표본에

대한 군집분석을 하는 경우에는 × 행렬이 됨.

④ 상관계수

≤ ≤

상관계수는 두 유전자간의 거리 혹은 유사성을 나타내는 측도로 널리 쓰임.

만약 원 자료를

을 만족되게 변환시키면

- 4 -

이 성립.

상관계수 은 유클리디안 거리에서의 0의 값에 해당되며, 상관계수가 -1로 감수할

수록 유클리안 거리는 증가.

따라서 유클리안 거리는 비유사성에 초점을 두 거리이고, 상관계수는 유사성에 기반을 둔

거리임.

예제6.2>

대표적인 추세 특성을 갖는 4종류의 유전자 (steep up), (up), (down),

(change)에 대해 4일 동안 관찰한 인 추세자료

(steep up)과 (up)는 같이 증가하는 형태이나 그 절대적인 크기가 다르며,

(change)는 (steep up)과 같이 3일째까지 증가하다가 4일째는 급속하게 감소.

(steep up) =(2,4,5,6)

(up) =(2/4, 4/4, 5/4, 6/4)

(down) =(6/4, 4/4, 3/4, 2/4)

(change) =(10/4, 14/4, 18/4, 1)

< 책에 있는 결과>

유클리디안 steep up up down change

steep up 0

up 2.60 0

down 2.75 1.23 0

change 2.25 2.14 2.15 0

<유클리안 거리>

<실제 R로 구한 유클리안 거리 결과>

- 5 -

유클리디안(수정) steep up up down change

steep up 0

up 6.75 0

down 7.586995 1.5 0

change 5.074446 4.58939 4.643544 0

<실제 R로 구한 마할라노비 거리 결과>

마할라노비 steep up up down change

steep up 0

up 2.449490 0

down 2.449490 2.449490 0

change 2.449490 2.449490 2.449490 0

<책에 있는 맨하탄 거리>

맨하탄 steep up up down change

steep up 0

up 12.75 0

down 13.75 2.50 0

change 6.50 8.25 7.75 0

<R로 구한 맨하탄 거리>

맨하탄(수정) steep up up down change

steep up 0

up 12.75 0

down 13.25 2.50 0

change 6.50 8.25 7.75 0

steep up up down change

steep up 0

up 0 0

down 2 2 0

change 1.18 1.18 0.82 0

- 6 -

거리 특징

유클리디안 두 벡터간의 기하학적 거리를 측정

맨하탄 두 벡터간의 거리를 측정시 각 변수가 차지하는 비중이 유클리안보다 큼

마할라노비스 변수간의 상관성을 보정하여 두 벡터간의 거리를 측정

패턴에 따른 차이를 측정

6.2.2 일차원 군집과 이차원 군집

일차원 군집 : 표본들을 고정시키고 유전자들만 군집화하거나, 유전자들을 고정시켜 놓고

표본들만을 군집화 함.

이차원 군집 : 유전자와 표본 양쪽 모두 동시에 군집화.

예> ALL과 AML 두 종류의 암을 구분하는 것이 목적일 때 유전자에 대해서만 군집.

표본의 자료 자체의 이상여부를 고려하고자 할 때 표본에 대해서도 군집을 실시 그 예

로 유방암자료에서 새로운 부종을 발견하는 것이 목적일 때에는 표본에 따라 군집분석

을 실시.

6.2.3 자료의 변환 : 위치(location) 조정, 척도(scale) 조정

자료의 측정단위가 일치하며 척도가 유전자나 표본에 따라 심한 차이가 없으면 자료의 변

환이 필요하지 않음.

대부분의 마이크로어레이 실험의 특성상 척도가 차이가 나는 경우가 많음.

-> 자료의 변환이 필요

자료의 변환하는 방법은 유전자와 표본에 대해 동일하게 적용.

(1) 유전자 변환

유전자 에 대해 를

으로 변환.

여기서

, 즉 변환된 각 유전자 마다 개의 자료는 평균이 0 이고 표

본분산이 1이 됨.

각 유전자의 표본평균과 표본분산을 계산하여 심하게 차이가 나는 경우에는 , 대신에

중앙값을 쓰고 대신 을 쓰기도 함.

- 7 -

> steepup <- scan() //steepup 값을 입력//

1: 2

2: 4

3: 5

4: 6

> up <- scan() //up 값을 입력//

1: 0.5

2: 1

3: 1.25

4: 1.5

> down <- scan() //down 값을 입력//

1: 1.5

2: 1

3: 0.75

4: 0.5

> change <- scan() //change 값을 입력//

(2) 표본변환

표본 에 대하여

, 여기서

이고

평균이 0이고 분산이 1, 슬라이드별로 표준화하기 때문에 4장의 표준화 방법의 일종.

표본간의 거리 유전자 변환 표본변환

유클리디안 변함 변함

맨하탄 변함 변함

마할라노비스 변함(S가 대각행렬인 경우 변화지 않음) 변함

상관계수 변함 변하지 않음

<표준화 표본간의 거리에 미치는 효과>

유전자간의 거리 유전자 변환 표본변환

유클리디안 변함 변함

맨하탄 변함 변함

마할라노비스 변함 변함(S가 대각행렬인 경우 변화지 않음)

상관계수 변하지 않음 변함

<표준화가 유전자간의 거리에 미치는 효과>

R- 프로그램.

- 8 -

1: 2.5

2: 3.5

3: 4.5

4: 1

>sqrt(t(steepup-up)%*%(steepup-up)) //steepup과 up 간의 유클리안 거리구하기//

[,1]

[1,] 6.75

>euc <- function (x,y)sqrt(sum((x-y)^2)) //유클리안 거리 함수 만들기//

>euc (steepup,up) //함수를 이용해서 steepup과 up 간의 유클리안 거리구하기//

>mink <- function (x,y,m) (sum(abs(x-y)^m))^(1/m) //민코프스키거리 구하기//

> m <- 1 //m=1차원일때//

> manh <- function (x,y) sum(abs(x-y)) //매하탄거리 함수 만들기//

> library(far) // invgen() 함수를 구하기 위한 라이브러리 설정하기//

> gen <- cbind(steepup,up,down,change) //공분산을 구하기 위해 gen 합치기//

> gen

steepup up down change

[1,] 2 0.50 1.50 2.5

[2,] 4 1.00 1.00 3.5

[3,] 5 1.25 0.75 4.5

[4,] 6 1.50 0.50 1.0

> tgen <- t(gen) // day 별로 만들기//

> tgen

[,1] [,2] [,3] [,4]

steepup 2.0 4.0 5.00 6.0

up 0.5 1.0 1.25 1.5

down 1.5 1.0 0.75 0.5

change 2.5 3.5 4.50 1.0

> cov(tgen) // 공분산 구하기//

[,1] [,2] [,3] [,4]

[1,] 0.7291667 1.104167 1.437500 0.4583333

[2,] 1.1041667 2.562500 3.479167 2.7083333

[3,] 1.4375000 3.479167 4.770833 3.6250000

[4,] 0.4583333 2.708333 3.625000 6.4166667

> maha <- function(x,y)sqrt(t(x-y)%*%invgen(cov(tgen),0.0001)%*%(x-y))

// 마할라노비스 함수 만들기//

> maha (steepup,up) //유전자 steepup과 up 간의 마할라노비스 거리 구하기//

- 9 -

[,1]

[1,] 2.449490

>diag(rep(1,4)) // S가 I(대각원소가 모두 1이고 비대각원소가 모두 0인 행렬) 만들기//

>sqrt(t(steepup-up)%*%diag(rep(1,4))%*%(steepup-up))

[,1]

[1,] 6.75 // 유클리안 거리와 같음//

> dist(tgen)

steepup up down

up 6.750000

down 7.586995 1.500000

change 5.074446 4.589390 4.643544

> dist(tgen,method="minkowski")

steepup up down

up 6.750000

down 7.586995 1.500000

change 5.074446 4.589390 4.643544

> dist(tgen,method="manhattan")

steepup up down

up 12.75

down 13.25 2.50

change 6.50 8.25 7.75

> dist(tgen,method="minkowski" p=3)

steepup up down

up 5.585276

down 6.465423 1.285641

change 5.004995 3.872614 4.111141

- 10 -

6.3 군집분석 방법

6.3.1 계층적 군집 방법

마이크로 어레이에서 가장 널리 사용되는 방법

① 분할적 방법 :

-하나의 군집으로부터 시작하여 반으로 나누고 최종적으로 군집 개수가 하나 되게 하는

방법.

② 병합적 방법 :

-개개의 군집들로부터 전체가 하나의 군집이 형성 될 때까지 모아가는 방법

-매 단계마다 하위의 군집이 상위의 어느 군집에 속하는지가 결정이 된다. 따라서 처음의

유전자의 개수가 p 개이면 군집의 구성을 결정짓는데 필요한 총 단계의 수는 p-1 번이 된

다.

병합적인 방법의 알고리즘 : 일차원(유전자 기준)

절차1 : 모든 p개의 유전자들은 p개의 군집형성

절차2 : p개의 군집 간의 거리를 계산하여 가까이 있는 두 개의 군집을 합침

절차3 : 새로 만들어진 군집과 기존에 있는 군집간의 거리를 계산하여 가까이 있는 두

개의 군집을 합침.

절차4 : 절차3을 반복하여 하나의 군집이 만들어질 때까지 반복

계층적 방법은 실제 자료의 구조가 계층적인 구조가 아니더라도 총 p-1 번의 매 단계마다

무조건적으로 가장 가까운 두 군집을 하나의 군집으로 묶는다. 또한 두 개체가 한번 같은

군집에 소속되면 다시 분리되지 않는다.

□ 최소거리 방법

□ 최대거리 방법

□ 평균거리 방법

(1) 최소거리 방법(Single Linkage)

정의 : 두군집 G,H 간의 원소간의 거리를 다 계산한 후에 그 중에서 최소거리를 사용

min∈ ∈

예제6.3

유전자가 4개이고 유전자 사이의 유사함의 척도로서 상관계수를 이용한 경우에 대해 최소

거리 방법.

단계1. 5개의 군집 1,2,3,4,5 으로 부터 4개의 군집을 구성

1 2 3 4 5

- 11 -

가장 가까운 3 과 5를 묶어줌.

단계2. 4개의 군집 1,2,(3,5),4 로부터 3개의 군집을 구성

min min

min min

min min

(3,5) 1 2 4

가장 값이 작은 0.3에 해당하는 (3,5)와 1을 묶어줌.

단계3. 3개의 군집(1,3,5),2,4 로부터 2개의 군집을 구성

min min

min min

(1,3,5) 2 4

가장 값이 작은 0.5에 해당하는 2와 4를 묶어줌.

단계4. 2개의 군집 (1,3,5), (2,4) 로부터 1개의 군집으로 구성

min min

(1,3,5) (2,4)

- 12 -

(2) 최대거리 방법(Complete Linkage)

두 군집 G,H 간의 원소간의 거리를 다 계산한 후에 그 중에서 최대 거리를 사용한다. 즉

max∈ ∈

단점 : 한 군집으로 묶인 개체들 간의 유사성이 다른 군집에 속한 개체들과의 유사성보다

항상 더 높다는 것을 보장하지 못함.

<예제>

최단 연결법과 같은 자료를 가지고 최장 연결법을 고려하자.

처음으로 가장 짧은 거리는 역시 3과 5이다.

다음으로는 3,5와 각각의 거리 중에 가장 짧은 거리를 다시 한 그룹으로 고려하는 단계를

거침

즉 (3,5)를 한그룹으로 하고 각 다른 개체들의 거리를 구하면

=max()=max(0.3,0.99)=0.99

=max()=max(0.7, 0.95)=0.95

max max

(3,5) 1 2 4(3,5) 0

1 0.99 02 0.95 0.9 04 0.9 0.6 0.5 0

다음 거리행렬에서 가장 작은 거리는 0.5 이므로 2와 4를 묶어준다.

=max( )=max(0.95,0.9)=0.95

=max()=max(0.9,0.6)=0.9

- 13 -

(3,5) (2,4) 1(3,5) 0(2,4) 0.95 0

1 0.99 0.9 0

다음 거리행렬에서 가장 작은거리 0.9이므로 (2,4)와 1을 묶어준다.

(3) 평균거리방법(Average Linkage)

두 군집 G,H 간의 원소간의 거리를 다 계산한 후에 그 평균값을 거리로 정의하는 방법

∈∈

여기서 은 각 군집내에 속한 원소들의 개수.

최대거리와 최소거리 방법을 절충한 방법.

특징 :단조증가변환으로 자료측정단위가 변하게 되면 군집분석 결과가 달라짐. 그러나 최

대거리와 최소거리 방법은 단조증가변환으로 자료측정단위가 변하더라도 값의 순서는 바뀌

지 않으므로 항상 동일한 군집분석 결과를 얻음.

<예제>

최단 연결법과 같은 자료를 가지고 평균 연결법을 고려하자.



거침


=avg()=avg(0.3,0.99)=0.645

=avg()=avg(0.7, 0.95)=0.825

- 14 -

(3,5) 1 2 4(3,5) 0

1 0.645 02 0.825 0.9 04 0.85 0.6 0.5 0


=avg()=avg(0.3,0.99)=0.645

=avg()=avg(0.7,0.9,0.95,0.8)=0.8375

=avg()=avg(0.9,0.6)=0.75

(3,5) (2,4) 1(3,5) 0(2,4) 0.8375 0

1 0.645 0.75 0


< 3가지 거리 방법의 비교>

1. 만약 같은 군집내의 유사성이 다른 군집내의 개체와 유사성보다 상대적으로 더 큰경우

에는 세가지 방법 모두 동일한 결과를 제공.

2. 최소거리 방법은 n이 커질수록 거리의 최소값은 0으로 수렴.

최대거리방법은 n이 커질수록 거리는 무한히 큰 값을 가짐.

만약 무한히 많은 원소를 가진 무한 모집단에서는 최대최소 거리방법은 거리자체의

성질 때문에 잘 반영하지 못함.

반면, 평균거리방법은 평균의 개념을 내포하고 있으므로 n이 커지더라도 항상 비슷한

거리를 갖게 됨.

- 15 -

< reference "Analysis of Microarray gene expression data">

(4) Centroid Linkage Method

<군집 U의 중심점과 V의 중심점 사이의 거리 를 두 군집사이의 거리로 정의하여 유사

성이 큰 군집을 묶어 나가는 방법.>

The Centroid linkage method uses the average value of all points in a cluster (i.e,

the cluster centroid) as the reference point for distances to other points or

clusters.

The distance between two clusters in defined as the Euclidean distance between

the centroids of the cluster pair.

The process proceeds by combining clusters according to the distance between

their centroids, the clusters with the shortest distance being combined first.

A disadvantage of the centroid method is that if the sizes of the two clusters to be

considered are very different, then the centroid of the new cluster will be very

close to that of th larger cluster.

(5) Median Linkage Clustering

The median linkage method uses the median distance between pairs of points in

different clusters as the inter-cluster distance measure.

<예제>

최단 연결법과 같은 자료를 가지고 Median 연결법을 고려하자.



- 16 -

거침


=med()=med(0.3,0.99)=0.645

=med()=med(0.7, 0.95)=0.825

(3,5) 1 2 4(3,5) 0

1 0.645 02 0.825 0.9 04 0.85 0.6 0.5 0


=med()=med(0.3,0.99)=0.645

=med()=med(0.7,0.9,0.95,0.8)=0.85

=med()=med(0.9,0.6)=0.75

(3,5) (2,4) 1(3,5) 0(2,4) 0.85 0

1 0.645 0.75 0


(6) Ward's clustering Method

Ward proposed a clustering procedure that minimizes the information loss

associated with clustering. Ward used an error sum-of-squares criterion to define

information loss. At each step, union of every possible pair of clusters in

considered and the two clusters whose fusion results in the smallest increase in

"information loss" are combined

6.3.2 K-평균 군집분석

□ 두 개체간의 비유사성을 정량화하여 위에서 아래 방식으로 K개의 군집을 형성하는 방법.

MacQueen(1967) introduced a non-hierarchical clustering technique called the

K-means method. This method assigns each object to the cluster having the

nearest centroid. In applying the K-means clustering method, the total number of

- 17 -

clusters, K, is specified in advanced of applying the clustering procedure.

Because a proximity matrix does not have to be built and the basic data do not

have to be stored during the computer run, the K-means method can be applied to

much larger data sets than hierarchical techniques.

The basic steps in this clustering method are:

Step1 : Select a set of K points as cluster seeds. These seeds represent a first

guess at the centroids of the K clusters.

Step2 : Assign each individual observation to the cluster whose centroid is

nearest. The Euclidean distance is usually used as the distance measure

with either standardized or unstandardized observation. The centroids

are recalculated for the cluster receiving the new object and for the

cluster losing the object.

Step3 : Repeat step2 until no further changes occur in the cluster compositions.

K-means clustering does not give an ordering of objects within a cluster. The final

assignment of clusters will be somewhat dependent on the initial selection of seed

points. As the number of cluster K is changed, the cluster memberships can also

change in arbitary ways. For example, the solution for K=4 clusters, may not be

nested within the K=3 cluster solution.

ex)

Suppose we measure two variables and for each of four items A,B,C and D.

The data are given in the following table:

- 18 -

Item observations

A 5 3

B -1 1

C 1 -2

D -3 -2

The objective is to divide these items into clusters such that the items

within a cluster are closer to one another than they are to the items in different

clusters. To implement the -means method, we arbitrarily partition the items

into two cluster, such as (AB) and (CD), and compute the coordinates of

the cluster centroid(mean), Thus, at Step1, we have

cluster Coordinates of centroid

(AB)

(CD)

At Step2, we computer the Euclidean distance of each item from the group

centroids and reassign each item to the nearest group. If an item is moved from

the initial configuration, the cluster centroids(mean) must be updated before

proceeding, We compute the squared distances

Since A is closer to cluster (AB) than to cluster(CD), it is not reassigned.

Continuing, We get

and, consequently, B is ressigned to cluster (CD), giving cluster(BCD) and the

following updated coordinates of the centroid:

cluster Coordinates of centroid

(A) 5 3

(BCD) -1 -1

Again, each item is checked for reassignment. Computing the squared distances

- 19 -

gives the follow :

cluster

Squared distances to group centroids

Item

A B C D

A 0 40 41 89

(BCD) 52 4 5 5

We see that each item is currently assigned to the cluster with the nearest

centroid(mean), and the process stop, The final clusters are A and (BCD).

6.3.3 SOM(Self-Organizing Maps) 군집분석

● Comparison of SOMs and K-Means clustering

SOMs work somewhat like K-Means clustering but are a little richer. With K-Means, you choose the number of clusters to fit the data into. For a SOM you choose the shape and size of a network of clusters to fit the data into. In a SOM, we call these clusters 'nodes'.

Like K-Means, a SOM initially populates its nodes or clusters by randomly sampling the data (or randomly generating points in the data space, depending on the initialization option you choose), and then refines the nodes in a systematic fashion. Unlike K-Means clustering, however, a SOM will not force there to be exactly as many clusters as there are nodes, because it is possible for a node to end up without any associated cluster items when the map is complete. A further difference with K-Means clustering is that the SOM automatically provides some information on the similarity between nodes - i.e., how strongly the certain nodes resemble each other.

● Outline 1

Interpreting patterns of gene expression with self-organizing maps: Methods and application to

hematopoietic differentiation, Tamayo et al. PNAS Vol. 96, pp. 2907, 1999

Method:

choose a geometry of ''nodes'' (e.g. a 6 by 5 grid)

The nodes are mapped into k-dimensional "gene expression" space (k=no. of conditions), initially

at random, and then iteratively adjusted

- 20 -

Each iteration involves randomly selecting a data point P and moving the nodes in the direction

of . The closest node is moved the most, whereas other nodes are moved by smaller

amounts depending on their distance from in the initial geometry.

The position of node N at iteration is denoted . The initial mapping is random. On

subsequent iterations, a data point is selected and the node that maps nearest to is

identified. The mapping of nodes is then adjusted by moving points toward :

.

Radius decreases linearly with , , and eventually becomes zero

: maximum number of iterations

● Outline 2

Kohonen introduced this method and Tamayo et al., first applied it to gene expression data. Self

organizing maps are constructed as follows. k is fixed and some topology on the centers is

assumed. One chooses a grid, k = l×m, of nodes, and a distance function between nodes,

D(N1,N2). Each of the grid nodes is mapped into a k-dimensional space, at random. The gene

vectors are mapped into this space as well. As the algorithm proceeds, the grid nodes are

- 21 -

iteratively adjusted. Each iteration involves randomly selecting a data point P and moving the

grid nodes in the direction of P. The closest node nP is moved the most, whereas other nodes

are moved by smaller amounts depending on their distance from nP in the initial geometry of

the grid. In this fashion, neighboring points in the initial geometry tend to be mapped to nearby

points in the k-dimensional space. The process continues iteratively.

Self organizing maps :

1. Input: n-dimentional vector for each element (data point) p.

2. Start with a grid of k = l × m nodes, and a random n-dimensional associated vector

f0(v) for each grid node v, representing the initial associated center.

3. Iteration i:

Pick a data point p. Find a grid node np such that fi(np) is the closest to p.

Update all node vectors v as follows :

fi+1(v) = fi(v) + H(D(np, v), i)[p - fi(v)]

Where H is a learning function which decreases with the number of iterations (i), as

well as with D(np, v). i.e. nodes that are farther from np are less aected.

4. Repeat until no improvement is possible.

The clusters are defined by the grid nodes. We assign each point (gene vector) to its

nearest node np (cluster). The movement of a center is affected not only by the elements of

its own cluster. Note that the number of clusters, k, is set a-priori in this method.

● Detail

The SOM algorithm

The basic idea of SOM is simple yet effective. The SOM defines a mapping from high dimensional input data space onto a regular two-dimensional array of neurons. Every neuron i of

the map is associated with an n-dimensional reference vector , where n

denotes the dimension of the input vectors. The reference vectors together form a codebook. The neurons of the map are connected to adjacent neurons by a neighbourhood relation, which dictates the topology, or the structure, of the map. The most common topologies in use are rectangular and hexagonal.

Adjacent neurons belong to the neighbourhood Ni of the neuron i. In the basic SOM algorithm, the topology and the number of neurons remain fixed from the beginning. The number of neurons determines the granularity of the mapping, which has an effect on the accuracy and generalization of the SOM.

During the training phase, the SOM forms an elastic net that folds onto the "cloud" formed by

- 22 -

input data. The algorithm controls the net so that it strives to approximate the density of the data. The reference vectors in the codebook drift to the areas where the density of the input data is high. Eventually, only few codebook vectors lie in areas where the input data is sparse.

The learning process of the SOM goes as follows:

1. One sample vector x is randomly drawn from the input data set and its similarity (distance) to the codebook vectors is computed by using e.g. the common Euclidean distance measure:

min

2.After the BMU has been found, the codebook vectors are updated. The BMU itself as well as its topological neighbours are moved closer to the input vector in the input space i.e. the input vector attracts them. The magnitude of the attraction is governed by the learning rate. As the learning proceeds and new input vectors are given to the map, the learning rate gradually decreases to zero according to the specified learning rate function type. Along with the learning rate, the neighbourhood radius decreases as well.

The update rule for the reference vector of unit i is the following:

∈

∉

3.The steps 1 and 2 together consitute a single training step and they are repeated until the training ends. The number of training steps must be fixed prior to training the SOM because the rate of convergence in the neighbourhood function and the learning rate is calculated accordingly.

After the training is over, the map should be topologically ordered. This means that n topologically close (using some distance measure e.g. Euclidean) input data vectors map to n adjacent map neurons or even to the same single neuron.

- 23 -

- 24 -

● Process of SOM

- 25 -

● Example 1 (Iris data)> head(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

> dim(iris)

[1] 150 5

> library(som)

>

> xl <- 5

> yl <- 6

>

> foo <- som(normalize(iris[,1:4]), xdim=xl, ydim=yl,rlen=c(1000,5000)) # SOM 실행

> plot(foo) # SOM 결과 그리기

- 26 -

> tt <- cbind(foo$visual,iris[,5]) #iris 종류에 따른 위상적인 위치의 분포 파악

> table(tt[c(2,1,4)])

, , iris[, 5] = setosa

x

y 0 1 2 3 4

0 0 0 1 1 48

2 0 0 0 0 0

3 0 0 0 0 0

4 0 0 0 0 0

5 0 0 0 0 0

, , iris[, 5] = versicolor

x

y 0 1 2 3 4

0 0 0 0 0 0

2 2 2 0 0 0

3 6 10 0 0 0

4 4 4 0 1 0

5 3 2 8 5 3

, , iris[, 5] = virginica

x

y 0 1 2 3 4

0 0 0 0 0 0

2 0 0 0 0 0

3 0 0 0 0 0

4 0 0 0 0 0

5 50 0 0 0 0

> tt1 <- apply(ftable(tt[c(2,1,4)]),1,sum) # 위상적인 위치에 따른 iris 종류의 분포 파악

> index <- which(!tt1==0)

>

> tt2 <- ftable(tt[c(2,1,4)])[index,]

> row.names(tt2) <-paste(0:(xl-1),rep(0:(yl-1),rep(xl,yl)))[index]

>

- 27 -

> diris<- foo$code[index,]

> row.names(diris)<-paste(0:(xl-1),rep(0:(yl-1),rep(xl,yl)))[index]

> diris <- dist(diris)

> plot(hclust(diris)) # node간의 군집분석 (hieratical clustering - complete linkage)

> tt2[hclust(diris)$order,]

[,1] [,2] [,3]

2 0 1 0 0

3 0 1 0 0

4 0 48 0 0

0 4 0 3 50

1 4 0 2 0

1 3 0 4 0

0 3 0 4 0

2 4 0 8 0

0 2 0 6 0

3 4 0 5 0

1 2 0 10 0

4 4 0 3 0

1 1 0 2 0

0 1 0 2 0

3 3 0 1 0

- 28 -

● Example 2 (Yeast data)

The yeast data frame has 6601 rows and 18 columns, i.e., 6601 genes, measured at 18 time

points.

Source

http://genomics.stanford.edu

References

Tamayo et. al. (1999), Interpreting patterns of gene expression with self-organizing maps:

Methods and application to hematopoietic differentiation, PNAS V96, pp2907-2912, March 1999.

> data(yeast)

> dim(yeast)

[1] 6601 18

> head(yeast)

Gene zero ten twenty thirty fourty fifty sixty seventy eighty ninety

1 18srRnaa 22 38 41 43 23 29 25 20 17 98

2 18srRnab 5 9 -13 -9 -14 -13 -11 -18 -1 -18

3 18srRnac 3 -2 13 5 6 5 -3 -1 -6 37

4 18srRnad 3 3 9 8 -4 2 -7 -2 -1 18

5 18srRnae 9 12 24 13 5 9 1 4 8 63

6 25srRnaa 11 24 52 30 164 67 104 31 19 346

hundred one.ten one.twenty one.thirty one.fourty one.fifty one.sixty

1 46 27 23 38 27 28 287

2 9 -8 -15 -6 -19 -35 150

3 8 -3 -3 7 7 0 182

4 9 2 0 3 5 15 80

5 16 6 5 23 16 21 147

6 97 51 28 100 61 45 448

> yeast.1 <- normalize(yeast[,-1])

> rownames(yeast.1) <- yeast[,1]

> colnames(yeast.1) <- names(yeast)[-1]

> head(yeast.1)

zero ten twenty thirty fourty fifty

18srRnaa -0.4203440 -0.1707074 -0.12390052 -0.09269594 -0.4047417 -0.3111280

18srRnab 0.1343007 0.2335665 -0.31239515 -0.21312940 -0.3372116 -0.3123952

18srRnac -0.2719121 -0.3852088 -0.04531869 -0.22659344 -0.2039341 -0.2265934

- 29 -

18srRnad -0.2771141 -0.2771141 0.03012110 -0.02108477 -0.6355551 -0.3283200

18srRnae -0.3839521 -0.2984431 0.04359281 -0.26994009 -0.4979640 -0.3839521

25srRnaa -0.7296869 -0.6215308 -0.38857909 -0.57161257 0.5432277 -0.2637835

sixty seventy eighty ninety hundred one.ten

18srRnaa -0.37353712 -0.4515486 -0.49835542 0.7654299 -0.04588908 -0.3423325

18srRnab -0.26276228 -0.4364773 -0.01459790 -0.4364773 0.23356647 -0.1883130

18srRnac -0.40786819 -0.3625495 -0.47584622 0.4985056 -0.15861541 -0.4078682

18srRnad -0.78917274 -0.5331434 -0.48193755 0.4909739 0.03012110 -0.3283200

18srRnae -0.61197598 -0.5264670 -0.41245504 1.1552094 -0.18443112 -0.4694610

25srRnaa 0.04404549 -0.5632929 -0.66312931 2.0574137 -0.01419244 -0.3968988

one.twenty one.thirty one.fourty one.fifty one.sixty

18srRnaa -0.4047417 -0.17070738 -0.3423325 -0.32673026 3.714262

18srRnab -0.3620280 -0.13868009 -0.4612938 -0.85835677 3.732684

18srRnac -0.4078682 -0.18127475 -0.1812748 -0.33989016 3.784110

18srRnad -0.4307317 -0.27711409 -0.1747024 0.33735628 3.665737

18srRnae -0.4979640 0.01508982 -0.1844311 -0.04191616 3.549461

25srRnaa -0.5882520 0.01076668 -0.3137018 -0.44681702 2.906023

>

> som.fit <- som(as.data.frame(yeast.1), xdim=5, ydim=6,rlen=c(300,1500))

> plot(som.fit)

- 30 -

방법 장점 단점

계층적 빠른 컴퓨팅 시간한번 잘못 만들어진 계층적

구조는 후에 고치기가 힘듦

K-평균, SOM설정된 최적의 기준을 근사적으로

만족시키는 군집을 제공함

초기값 K 설정이 쉽지 않음

긴 컴퓨팅 시간

유일한 결과를 보장하지 않음

> yeast.2 <- filtering(yeast[,-c(1,11)])

> yeast.2 <- normalize(yeast.2)

> dim(yeast.2)

[1] 760 16

> som.fit <- som(as.data.frame(yeast.2), xdim=5, ydim=6)

> plot(som.fit)

6.4 군집분석의 알고리즘 비교와 군집결과의 신뢰성 평가

(1) 군집분석 알고리즘 비교

- 31 -

사영

PCA등유전자 차원 축소

차원 축소로 생긴 새로운 인공

유전자(eg. super gene)는

해석하기 어려움

표 6.12 군집분석 알고리즘의 장·단점

(2) 군집결과의 신뢰성 평가

군집분석을 실시한 후에 군집 수에 대한 평가 및 군집분석의 결과에 대한 신뢰성을

평가하기 위한 평가측도로써 실루엣 너비(Silhouette width)를 이용한다.

max

여기서, ∈

는 번째 유전자(또는 표본)와 가 속한 군집()에 있는 다른

유전자(표본)들과의 거리들의 평균이다. min≠ 로 와 다른 군집내의

개체들까지의 평균거리 가운데 최소거리이다.

6.5 군집분석 R 프로그램

(1) ExpressionSet class

- ExpressionSet usually consist of several conceptually distinct parts:

assay data, phenotypic meta-data, feature annotations and meta-data, and a

description of the experiment.

Assay data : a matrix of 'expression' values has F rows and S columns, where F is

the number of features on the chip and S is the number of samples.

Phenotypic data : phenotypic data summarizes information about the samples (e.g.,

sex, age, and treatment status; referred to as 'covariates'). The information

describing the samples can be represented as a table with S rows and V columns,

where V is the number of covariates.

Annotations and feature data : provide a character string identifying the type of

chip used in the experiment. It is also possible to record information about features

that are unique to the experiment.

Experiment description : basic description about the experiment (e.g., the

- 32 -

library(Biobase)

data(geneCov); data(geneData);

> head(geneCov)

cov1 cov2 cov3

A 1 1 1

B 1 1 1

C 1 1 1

D 1 1 1

E 1 2 1

F 1 2 1

> head(geneData)

A B C D E F

AFFX-MurIL2_at 192.7420 85.75330 176.7570 135.5750 64.49390 76.3569

AFFX-MurIL10_at 97.1370 126.19600 77.9216 93.3713 24.39860 85.5088

AFFX-MurIL4_at 45.8192 8.83135 33.0632 28.7072 5.94492 28.2925

AFFX-MurFAS_at 22.5445 3.60093 14.6883 12.3397 36.86630 11.2568

AFFX-BioB-5_at 96.7875 30.43800 46.1271 70.9319 56.17440 42.6756

AFFX-BioB-M_at 89.0730 25.84610 57.2033 69.9766 49.58220 26.1262

G H I J K L

AFFX-MurIL2_at 160.5050 65.9631 56.9039 135.60800 63.44320 78.2126

AFFX-MurIL10_at 98.9086 81.6932 97.8015 90.48380 70.57330 94.5418

AFFX-MurIL4_at 30.9694 14.7923 14.2399 34.48740 20.35210 14.1554

AFFX-MurFAS_at 23.0034 16.2134 12.0375 4.54978 8.51782 27.2852

AFFX-BioB-5_at 86.5156 30.7927 19.7183 46.35200 39.13260 41.7698

AFFX-BioB-M_at 75.0083 42.3352 41.1207 91.53070 39.91360 49.8397

M N O P Q R

AFFX-MurIL2_at 83.0943 89.3372 91.0615 95.9377 179.8450 152.4670

AFFX-MurIL10_at 75.3455 68.5827 87.4050 84.4581 87.6806 108.0320

AFFX-MurIL4_at 20.6251 15.9231 20.1579 27.8139 32.7911 33.5292

AFFX-MurFAS_at 10.1616 20.2488 15.7849 14.3276 15.9488 14.6753

investigator or lab where the experiment was done, an overall title, and other

notes).

- Assembling an ExpressionSet in R :

new("ExpressionSet", exprs=assay data, phenoData=phenotypic data, experimentDa

ta=experiment description, annotation=annotation)

example

- 33 -

AFFX-BioB-5_at 80.2197 36.4903 36.4021 35.3054 58.6239 114.0620

AFFX-BioB-M_at 63.4794 24.7007 47.4641 47.3578 58.1331 104.1220

S T U V W X

AFFX-MurIL2_at 180.83400 85.4146 157.98900 146.8000 93.8829 103.85500

AFFX-MurIL10_at 134.26300 91.4031 -8.68811 85.0212 79.2998 71.65520

AFFX-MurIL4_at 19.81720 20.4190 26.87200 31.1488 22.3420 19.01350

AFFX-MurFAS_at -7.91911 12.8875 11.91860 12.8324 11.1390 7.55564

AFFX-BioB-5_at 93.44020 22.5168 48.64620 90.2215 42.0053 57.57380

AFFX-BioB-M_at 115.83100 58.1224 73.42210 64.6066 40.3068 41.82090

Y Z

AFFX-MurIL2_at 64.4340 175.61500

AFFX-MurIL10_at 64.2369 78.70680

AFFX-MurIL4_at 12.1686 17.37800

AFFX-MurFAS_at 19.9849 8.96849

AFFX-BioB-5_at 44.8216 61.70440

AFFX-BioB-M_at 46.1087 49.41220

## Phenotypic data를 만든다 ##

covN <- c("Covariate 1; 2 levels", "Covariate 2; 2 levels", "Covariate 3; 3

levels")

metadata<-data.frame(labelDescription=covN, row.names=colnames(geneCov))

pD <- new("AnnotatedDataFrame", data = geneCov, varMetadata=metadata)

> pD

rowNames: A, B, ..., Z (26 total)

varLabels and varMetadata:

cov1: Covariate 1; 2 levels



## ExpressionSet object를 만든다 ##

eSet <- new("ExpressionSet", exprs = geneData, phenoData = pD)

> eSet

ExpressionSet (storageMode: lockedEnvironment)

assayData: 500 features, 26 samples

element names: exprs

phenoData

rowNames: A, B, ..., Z (26 total)


- 34 -




featureData

featureNames: AFFX-MurIL2_at, AFFX-MurIL10_at, ..., 31739_at (500 total)

varLabels and varMetadata: none

experimentData: use 'experimentData(object)'

Annotation character(0)

## 참고 명령어 ##

exprs(ExpressionSet) : assay data를 추출해준다.

phenoData(ExpressionSet) : phenotypic data를 추출해준다.

pData(ExpressionSet) : covariate value를 추출해준다.

sampleNames(ExpressionSet) : sample(Array) 이름을 추출해준다.

featureNames(ExpressionSet) : feature(Gene) 이름을 추출해준다.

varLabels(ExpressionSet) : covariate 이름을 추출해준다

varMetadata(ExpressionSet) : covariate의 정보를 추출해준다.

experimentData(ExpressionSet) : experiment description을 추출해준다.

annotation(ExpressionSet) : annotation을 추출해준다.

data(sample.exprSet.1)

## ExpressionSet으로 변환 ###

eset<-as(sample.exprSet.1,"ExpressionSet")

> eset




phenoData

sampleNames: A, B, ..., Z (26 total)





featureData

featureNames: AFFX-MurIL2_at, AFFX-MurIL10_at, ..., 31739_at (500 total)



(2) toy example

- 35 -

Annotation [1] "hgu95"

## Assay data 추출 ##

Exprs <- exprs(eset)

> head(Exprs)

A B C D E F

AFFX-MurIL2_at 192.7420 85.75330 176.7570 135.5750 64.49390 76.3569

AFFX-MurIL10_at 97.1370 126.19600 77.9216 93.3713 24.39860 85.5088

AFFX-MurIL4_at 45.8192 8.83135 33.0632 28.7072 5.94492 28.2925

AFFX-MurFAS_at 22.5445 3.60093 14.6883 12.3397 36.86630 11.2568

AFFX-BioB-5_at 96.7875 30.43800 46.1271 70.9319 56.17440 42.6756

AFFX-BioB-M_at 89.0730 25.84610 57.2033 69.9766 49.58220 26.1262

G H I J K L

AFFX-MurIL2_at 160.5050 65.9631 56.9039 135.60800 63.44320 78.2126

AFFX-MurIL10_at 98.9086 81.6932 97.8015 90.48380 70.57330 94.5418

AFFX-MurIL4_at 30.9694 14.7923 14.2399 34.48740 20.35210 14.1554

AFFX-MurFAS_at 23.0034 16.2134 12.0375 4.54978 8.51782 27.2852

AFFX-BioB-5_at 86.5156 30.7927 19.7183 46.35200 39.13260 41.7698

AFFX-BioB-M_at 75.0083 42.3352 41.1207 91.53070 39.91360 49.8397

M N O P Q R

AFFX-MurIL2_at 83.0943 89.3372 91.0615 95.9377 179.8450 152.4670

AFFX-MurIL10_at 75.3455 68.5827 87.4050 84.4581 87.6806 108.0320

AFFX-MurIL4_at 20.6251 15.9231 20.1579 27.8139 32.7911 33.5292

AFFX-MurFAS_at 10.1616 20.2488 15.7849 14.3276 15.9488 14.6753

AFFX-BioB-5_at 80.2197 36.4903 36.4021 35.3054 58.6239 114.0620

AFFX-BioB-M_at 63.4794 24.7007 47.4641 47.3578 58.1331 104.1220

S T U V W X

AFFX-MurIL2_at 180.83400 85.4146 157.98900 146.8000 93.8829 103.85500

AFFX-MurIL10_at 134.26300 91.4031 -8.68811 85.0212 79.2998 71.65520

AFFX-MurIL4_at 19.81720 20.4190 26.87200 31.1488 22.3420 19.01350

AFFX-MurFAS_at -7.91911 12.8875 11.91860 12.8324 11.1390 7.55564

AFFX-BioB-5_at 93.44020 22.5168 48.64620 90.2215 42.0053 57.57380

AFFX-BioB-M_at 115.83100 58.1224 73.42210 64.6066 40.3068 41.82090

Y Z

AFFX-MurIL2_at 64.4340 175.61500

AFFX-MurIL10_at 64.2369 78.70680

AFFX-MurIL4_at 12.1686 17.37800

AFFX-MurFAS_at 19.9849 8.96849

AFFX-BioB-5_at 44.8216 61.70440

- 36 -

AFFX-BioB-M_at 46.1087 49.41220

## Gene name 추출 ##

gN <- featureNames(eset)

> head(gN)

[1] "AFFX-MurIL2_at" "AFFX-MurIL10_at" "AFFX-MurIL4_at"

[4] "AFFX-MurFAS_at" "AFFX-BioB-5_at" "AFFX-BioB-M_at"

library(annotate) #: 주석을 처리하는 라이브러리

library("hgu95av2") #: Affymetrix Human 95 Chip version 2

## affy symbol을 gene name과 비교 대체 ###

syms <- getSYMBOL(gN, "hgu95av2")

> head(syms)

AFFX-MurIL2_at AFFX-MurIL10_at AFFX-MurIL4_at AFFX-MurFAS_at

NA NA NA NA

AFFX-BioB-5_at AFFX-BioB-M_at

NA NA

syms <- ifelse(is.na(syms),gN, syms)

> head(syms)

AFFX-MurIL2_at AFFX-MurIL10_at AFFX-MurIL4_at AFFX-MurFAS_at

"AFFX-MurIL2_at" "AFFX-MurIL10_at" "AFFX-MurIL4_at" "AFFX-MurFAS_at"

AFFX-BioB-5_at AFFX-BioB-M_at

"AFFX-BioB-5_at" "AFFX-BioB-M_at"

rownames(Exprs) <- syms

dR <- dimnames(Exprs)[[1]][1:100]

dC <- dimnames(Exprs)[[2]]

nR <- length(dR)

nC <- length(dC)

temp <-Exprs[1:100,]

library(geneplotter)

image(1:nC,1:nR,scale(t(temp)), col=rev(greenred.colors(50)), axes=FALSE,

xlab="Array", ylab="Gene name", main="Expression Profile matrix Graph")

- 37 -

axis(1, at =(1:nC), label=dC, tick=TRUE)

axis(2, at =(1:nR), label=dR, tick=FALSE )

## greenred.colors ##

green - black - red gradient의 색을 뽑아줌.

heatmap(Exprs[1:50,], col=rev(dChip.colors(50)))

## dChip.colors ##

blue-red 색을 뽑아줌

hv <- heatmap(Exprs[1:50,], col=rev(greenred.colors(50)))

hv

> hv

$rowInd

[1] 23 11 42 41 40 44 19 43 32 15 10 38 4 30 14 3 35 37 31 45 9 12

[23] 34 26 24 27 28 39 16 7 20 1 29 21 22 13 18 8 5 6 2 17 33 25

[45] 36 46 47 48 49 50

(3) Heatmap

- 38 -

$colInd

[1] 26 8 14 16 12 11 4 13 17 24 19 10 7 2 25 9 23 6 15 18 22 20

[23] 5 1 3 21

#### change the clustering linkage ####

slf <- function(d) hclust(d, method="single")

heatmap(Exprs[1:50,], col=rev(dChip.colors(50)), hclustfun=slf)

- 39 -

#### visualization of distance matrix ###

d1 <- dist(t(scale(Exprs)))

dN <- dimnames(Exprs)[[2]]

nS <- length(dN)

d1M <- as.matrix(d1)

image(1:nS,1:nS,d1M, col=rev(dChip.colors(50)), axes=FALSE, main="표본간의 거리")

axis(1, at =(1:nS), label=dN, tick=FALSE )

axis(2, at =(1:nS), label=dN, tick=FALSE )

#4) Leukemia (AML / ALL) 자료 불러오기

library(golubEsets)

(4) Leukemia(AML/ALL)

- 40 -

data(Golub_Train); data(Golub_Test); data(Golub_Merge);

> Golub_Train




phenoData

sampleNames: 1, 2, ..., 33 (38 total)


Samples: Sample index

ALL.AML: Factor, indicating ALL or AML

...: ...

Source: Source of sample

(11 total)

featureData

featureNames: AFFX-BioB-5_at, AFFX-BioB-M_at, ..., Z78285_f_at (7129 total)



pubMedIds: 10521349

Annotation [1] "hu6800"

> Golub_Test




phenoData





...: ...


(11 total)

featureData




pubMedIds: 10521349


- 41 -

> Golub_Merge




phenoData





...: ...


(11 total)

featureData




pubMedIds: 10521349


###자료의 여과 변환 ###

### 변환 함수 ###

GolubTrans <-function(eSet) {

X<-exprs(eSet)

X[X<100]<-100

X[X>16000]<-16000

X <- log2(X)

exprs(eSet) <- X

eSet

}

gTrn <- GolubTrans(Golub_Train)

gTest <- GolubTrans(Golub_Test)

gMerge <- GolubTrans(Golub_Merge)

### 여과 함수 ###

mmfilt <- function(r=5, d=500, na.rm=TRUE) {

function(x) {

minval <- min(2^x, na.rm=na.rm)

maxval <- max(2^x, na.rm=na.rm)

(maxval/minval > r) && (maxval-minval > d)

- 42 -

}

}

library(genefilter)

mmfun <- mmfilt()

ffun <- filterfun(mmfun) ## single argument를 가지는 function으로 만들어줌

sub <- genefilter(gTrn, ffun ) ## filtering을 통과하면 True로 반환해줌

sub[c(2401,3398,4168)] <- FALSE

sum(sub)

> sum(sub)

[1] 3051

gTrnS <- gTrn[sub,]

gTestS <- gTest[sub,]

gMergeS <- gMerge[sub,]

## ALL과 AML 순서 뽑아내기 ##

Ytr <-Golub_Train$ALL.AML

> Ytr

[1] ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL

ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL AML AML AML AML

AML AML AML AML AML AML AML

Levels: ALL AML

Ytest <- Golub_Test$ALL.AML

> Ytest


ALL ALL ALL ALL ALL AML AML AML AML AML AML AML AML AML AML

AML AML AML AML

Levels: ALL AML

Ymerge <- Golub_Merge$ALL.AML

> Ymerge

- 43 -


ALL ALL ALL ALL ALL AML AML AML AML AML AML AML AML AML AML

AML AML AML AML ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL

ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL AML

AML AML AML AML AML AML AML AML AML AML

Levels: ALL AML

### 1-rho와 유클리디안 거리 ###

r <- cor(exprs(gTrnS))

dimnames(r) <- list(as.vector(Ytr),as.vector(Ytr))

d <-1-r

library(ellipse)

plotcorr(r, main="Leukemia data: Correlation matrix for 38 mRNA samples\n All

3,051 genes")

타원의 방향이 45 ̊ 이면 양의 상관관계를 나타내고 135 ̊ 이면 음의 상관관계를 나타낸

다. 또한 폭이 좁을수록 강한 관계를 나타낸다.

(5) 와 유클리디안 거리

- 44 -

## t-test를 통한 자료 여과 ###

gtt <- ttest(gTrnS$ALL, p=0.01) ##library(genefilter)

gf1 <- filterfun(gtt)

whT <- genefilter(gTrnS, gf1)

> sum(whT)

[1] 609

gTrnSub <- gTrnS[whT,]

## 609개의 유전자 집합의 상관행렬 ###

rS <- cor(exprs(gTrnSub))

dimnames(rS) <- list(gTrnSub$ALL, gTrnSub$ALL)

dS <- 1-rS

plotcorr(rS, main="Leukemia data: Correlation matrix for 38 mRNA samples\n

609 genes")

(6) 명시 여과와 유전자 부분집합 찾기

- 45 -

mds <- cmdscale(d, k=2, eig=TRUE)

plot(mds$points, type="n",main="MDS for ALL AML data, correlation matrix,

G=3,051 genes, k=2")

text(mds$points[,1],mds$points[,2],labels=Ytr,col=as.integer(Ytr)+1, cex=0.8)

## 3-dim. MDS ##

(7) MDS 프로그램

- 46 -

mds2 <- cmdscale(d, k=3, eig=TRUE)

pairs(mds2$points, main="MDS for ALL AML data, correlation matrix,G=3,051

genes, k=3", pch=c("L","M")[as.integer(Ytr)],col = as.integer(Ytr)+1)

## Screeplot ##

mdsScree <- cmdscale(d, k=8, eig=TRUE)

plot(mdsScree$eig, pch=18, col="blue", type="o")

(8) 계층적 군집 프로그램

- 47 -

## average linkage ##

hc1 <- hclust(as.dist(d), method="average")

plot(hc1, main="Dendrogram for ALL AML data: ", sub="Average linkage,

correlation matrix, G=3,051 genes")

cthc1 <- cutree(hc1, 3) # cuts a tree

table(Ytr, cthc1) # two-way contingency table

> table(Ytr, cthc1)

cthc1

Ytr 1 2 3

ALL 24 2 1

AML 0 11 0

## single linkage ##

hc2 <- hclust(as.dist(d), method="single")

plot(hc2, main="Dendrogram for ALL AML data: ", sub="Single linkage,

correlation matrix, G= 3,051 genes")

cthc2 <- cutree(hc2, 3)

table(Ytr, cthc2)

> table(Ytr, cthc2)

cthc2

Ytr 1 2 3

ALL 25 1 1

AML 11 0 0

- 48 -

## complete linkage ##

hc3 <- hclust(as.dist(d), method="complete")

plot(hc3, main="Dendrogram for ALL AML data: ", sub="Complete linkage,

correlation matrix, G= 3,051 genes")

cthc3 <- cutree(hc3, 3)

table(Ytr, cthc3)

> table(Ytr, cthc3)

cthc3

Ytr 1 2 3

ALL 26 1 0

AML 5 0 6

- 49 -

kmeans.fit <- kmeans(t( exprs(gTrnS)), centers = 2, iter.max = 10000)

kmenas.fit$cluster

> kmeans.fit$cluster

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

1 1 1 1 1 2 2 2 1 2 1 2 1 1 1 1 1 2 1 1 1 2 2 1

25 26 27 34 35 36 37 38 28 29 30 31 32 33

2 1 2 2 2 2 2 2 2 2 2 2 2 2

cR <- kmeans(Exprs[1:10,], 2, 1000)

cc <- kmeans(t(Exprs[1:10,]), 2, 1000)

-PAM-

- Partitioning (clustering) of the data into k clusters ``around medoids'', a more

robust version of K-means.

library(cluster)

pm3 <- pam(as.dist(d), k=2, diss=TRUE)

table(Ytr, pm3$clustering)

> table(Ytr, pm3$clustering)

Ytr 1 2

ALL 23 4

AML 0 11

clusplot(d, pm3$clustering, diss=TRUE, labels=3,col.p=1, col.txt=as.integer(Ytr)+1,

main="Bivariate cluster plot for ALL AML data\n Correlation matrix, K=3,

G=3,051 genes")

(9) K-Means

- 50 -

plot(pm3,which.plots=2, main="Silhouette plot for ALL-AML Data")

dist <- 1-cor(exprs(gTrnS))

pm <- pam(as.dist(dist),k=2,diss=T)

sil <- pm$silinfo$widths

color <- rep(c(2,4),c(27,11))[as.numeric(rownames(sil))]

barplot(sil[,3],col=color,names.arg=as.numeric(rownames(sil)),

main="Silhouette plot for ALL-AML Data")

- 51 -

# SOM

library(som)

gtrns<-filtering(2^(exprs(gTrnS)), lt=100, ut=16000, mmr=5, mmd=500)

gtrns_n <- normalize(log2(gtrns))

foo1 <- som(gtrns_n, xdim=5, ydim=6)

plot(foo1)

# PCA

pca<-prcomp(t(exprs(gTrnS)),retx=T,center=T, scale=F, tol=NULL)

plot(pca)

summary(pca)

(10) SOM & PCA

6장 군집분석(clustering analysis) -...

Documents