unsupervised analysis goal a: find groups of genes that have correlated expression profiles. these...
TRANSCRIPT
![Page 1: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/1.jpg)
UNSUPERVISED ANALYSIS
•GOAL A: FIND GROUPS OF GENES THAT HAVE
CORRELATED EXPRESSION PROFILES. THESE GENES ARE
BELIEVED TO BELONG TO THE SAME BIOLOGICAL
PROCESS.
•GOAL B: DIVIDE TISSUES TO GROUPS WITH SIMILAR
GENE EXPRESSION PROFILES. THESE TISSUES ARE
EXPECTED TO BE IN THE SAME BIOLOGICAL (CLINICAL)
STATE.
CLUSTERING
Unsupervised analysis
![Page 2: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/2.jpg)
Giraffe
DEFINITION OF THE CLUSTERING PROBLEM
![Page 3: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/3.jpg)
CLUSTER ANALYSIS YIELDS DENDROGRAM
T (RESOLUTION)
![Page 4: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/4.jpg)
Giraffe + Okapi
BUT WHAT ABOUT THE OKAPI ?
![Page 5: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/5.jpg)
STATEMENT OF THE PROBLEM
GIVEN DATA POINTS Xi, i=1,2,...N, EMBEDDED IN D
- DIMENSIONAL SPACE, IDENTIFY THE
UNDERLYING STRUCTURE OF THE DATA.
AIMS:PARTITION THE DATA INTO M CLUSTERS,
POINTS OF SAME CLUSTER - "MORE SIMILAR“
M ALSO TO BE DETERMINED!
GENERATE DENDROGRAM,
IDENTIFY SIGNIFICANT, “STABLE” CLUSTERS
"ILL POSED": WHAT IS "MORE SIMILAR"?
RESOLUTION
Statement of the problem2
![Page 6: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/6.jpg)
CLUSTER ANALYSIS YIELDS DENDROGRAM
Dendrogram2
TLINEAR ORDERING OF DATA
YOUNG OLD
![Page 7: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/7.jpg)
52 41 3
Agglomerative Hierarchical Clustering
3
1
4 2
5
Distance between joined clusters
Need to define the distance between thenew cluster and the other clusters.
Single Linkage: distance between closest pair.
Complete Linkage: distance between farthest pair.
Average Linkage: average distance between all pairs
or distance between cluster centers
Need to define the distance between thenew cluster and the other clusters.
Single Linkage: distance between closest pair.
Complete Linkage: distance between farthest pair.
Average Linkage: average distance between all pairs
or distance between cluster centers
Dendrogram
The dendrogram induces a linear ordering of the data points
The dendrogram induces a linear ordering of the data points
![Page 8: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/8.jpg)
Hierarchical Clustering -Summary
• Results depend on distance update method
• Greedy iterative process
• NOT robust against noise
• No inherent measure to identify stable clusters
![Page 9: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/9.jpg)
2 good clouds
COMPACT WELL SEPARATED CLOUDS – EVERYTHING WORKS
![Page 10: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/10.jpg)
2 flat clouds
2 FLAT CLOUDS - SINGLE LINKAGE WORKS
![Page 11: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/11.jpg)
filament
SINGLE LINKAGE SENSITIVE TO NOISE
![Page 12: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/12.jpg)
52 41 3
Average linkage
3
1
4 2
5
Distance between joined clusters
Need to define the distance between thenew cluster and the other clusters.
Average Linkage: average distance between all pairs
Mean Linkage: distance between centroids
Need to define the distance between thenew cluster and the other clusters.
Average Linkage: average distance between all pairs
Mean Linkage: distance between centroids
Dendrogram
![Page 13: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/13.jpg)
nature 2002 breast cancer
![Page 14: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/14.jpg)
STATEMENT OF THE PROBLEM
GIVEN DATA POINTS Xi, i=1,2,...N, EMBEDDED IN D
- DIMENSIONAL SPACE, IDENTIFY THE
UNDERLYING STRUCTURE OF THE DATA.
AIMS:PARTITION THE DATA INTO M CLUSTERS,
POINTS OF SAME CLUSTER - "MORE SIMILAR“
M ALSO TO BE DETERMINED!
GENERATE DENDROGRAM,
IDENTIFY SIGNIFICANT, “STABLE” CLUSTERS
"ILL POSED": WHAT IS "MORE SIMILAR"?
RESOLUTION
Statement of the problem2
![Page 15: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/15.jpg)
how many clusters?
3 LARGEMANY small (SPC)
toy problem SPC
![Page 16: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/16.jpg)
other methods
![Page 17: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/17.jpg)
K-means
Iteration = 0
•Start with random positions of centroids.
![Page 18: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/18.jpg)
K-means
Iteration = 1
•Start with random positions of centroids.
•Assign data points to
centroids
![Page 19: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/19.jpg)
K-means
Iteration = 1
•Start with random positions of centroids.
•Assign data points to
centroids
•Move centroids to center
of assigned points
![Page 20: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/20.jpg)
K-means
Iteration = 3
•Start with random positions of centroids.
•Assign data points to
centroids
•Move centroids to center
of assigned points
•Iterate till minimal cost
![Page 21: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/21.jpg)
• Result depends on initial centroids’ position
• Fast algorithm: compute distances from data points to centroids
• Must preset K
• Fails for non-spherical distributions
K-means - Summary
![Page 22: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/22.jpg)
TSS vs K
![Page 23: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/23.jpg)
Iris setosa
Iris versicolor
Iris virginica
50 specimes from each group4 numbers for each flower150 data points in 4-dimensional space
irises
![Page 24: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/24.jpg)
150 points in d=4
3 large clusters
d=4
![Page 25: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/25.jpg)
Output of SPC
Stable clusters “live” for large T
Stable clusters “live” for large T
![Page 26: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/26.jpg)
Choosing a value for T
![Page 27: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/27.jpg)
Same data - Average Linkage
No analog for No analog for
![Page 28: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/28.jpg)
Same data - Average Linkage
Examining this cluster
Examining this cluster
![Page 29: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/29.jpg)
A ( I I )S c G B M
P r G B MC L
GE
NE
S
S 2S 3
T
S 1 ( G 1 )
G 1 2
G 5
C o u p l e d T w o - W a y C l u s t e r i n g ( C T W C )
o f 3 5 8 G e n e s a n d 3 6 S a m p l e s
F i g . 2 A
G L I O B L A S T O M A : M . H E G I e t a l C H U V , C L O N T E C H A R R A Y S
g l i o b l a s t o m a
![Page 30: UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL](https://reader036.vdocuments.net/reader036/viewer/2022081516/56649d095503460f949dbde0/html5/thumbnails/30.jpg)
AB004904 STAT- i nduced STAT i nhi bi t or 3
M 32977 VEG F
M 35410 I G FBP2
X51602 VEG FR1
M 96322 gr avi n
AB004903 STAT- i nduced STAT i nhi bi t or 2
X52946 PTN
J04111 c- j un
X79067 TI S11B
S 1 1S 1 2
S 1 4
S 1 0
S 1 3S 1 (G 5 )
S u p e r -P a ra m a g n e tic C lu s te r in g o f A ll S a m p le s
U s in g S ta b le G e n e C lu s te r G 5
F ig . 2 B
S 1 (G 5 )