fourier harmonic approach for visualizing temporal ... · fourier harmonic approach for visualizing...

Fourier Harmonic Approach for Visualizing Temporal Patterns of GeneExpression Data

Li Zhang and Aidong ZhangDepartment of Computer Science and Engineering

State University of New York at BuffaloBuffalo, NY 14260

lizhang,[email protected]

Murali RamanathanDepartment of Pharmaceutical SciencesState University of New York at Buffalo

Buffalo, NY [email protected]

Abstract

DNA microarray technology provides a broad snapshotof the state of the cell by measuring the expression levelsof thousands of genes simultaneously. Visualization tech-niques can enable the exploration and detection of patternsand relationships in a complex dataset by presenting thedata in a graphical format in which the key characteris-tics become more apparent. The purpose of this study is topresent an interactive visualization technique conveying thetemporal patterns of gene expression data in a form intuitivefor non-specialized end-users.

The first Fourier harmonic projection (FFHP) was in-troduced to translate the multi-dimensional time series datainto a two dimensional scatter plot. The spatial relationshipof the points reflect the structure of the original dataset andrelationships among clusters become two dimensional. Theproposed method was tested using two published, array-derived gene expression datasets. Our results demonstratethe effectiveness of the approach.

Keywords: visualization, gene expression, time series,Fourier harmonic projection

1 INTRODUCTION

Knowledge of the spectrum of genes expressed at a cer-tain time or under given conditions proves instrumental tounderstand the working of a living cell. DNA microarraytechnology allows measurements of expression levels forthousands of genes simultaneously [9]. Extensive researchhas been conducted on the study of temporal patterns ofgene expressions [10, 16]. Clustering methods which groupgenes or samples with similar patterns have become main-stream analysis tool [14]. Visualization can facilitate thediscovery of structures, features, patterns, and relationshipsin data and may provide more insightful information than

traditional numerical methods. By visualization, we hopeto gain some intuition regarding the data, but more impor-tantly, we would like to understand the relationships be-tween data points and detect the intrinsic structure, or possi-ble cluster tendencies. Visualization is especially importantin the early stages of data analysis in which qualitative anal-ysis is primary to quantitative. Early success will enhancethe users’ performance in the remaining stages of analy-sis. Array-derived gene expression datasets present analysisand visualization challenges because of their dimensional-ity, noisy environment, and pattern varieties.

The parallel coordinates approach [18] is perhaps thesimplest method to display patterns of gene expression pro-files. Here the data in each dimension are plotted along aseparate axis. Holter et al. [8] have used parallel coordi-nates to visualize the temporal progression in the yeast cellcycle data. Self-organizing maps (SOM) [12] is another ap-proach. More recently, Hautaniemi et al. [6] have presenteda heat map-based strategy for visualizing the U-matrix fromSOM. The most prominent visualization-enhanced analy-sis tool for gene expression data is TreeView [5], whichprovides a user-friendly computational and graphical en-vironment for assessing the results from hierarchical clus-tering. The graphical presentations from TreeView includea dendrogram to reflect the distance relationships betweenclusters and a heat plot to visually convey gene expressionchanges between samples. The heat plot can be viewed asvariation of the parallel coordinates plot in which color isused to convey dimension values.

Here, we present an alternative mapping for multi-dimensional data that is based on the first harmonic of thediscrete Fourier transform. The mapping has interestingproperties and preserves certain key characteristics of a va-riety of data sets, especially time series data. Unlike par-allel coordinates and heat plot which display all individualdimensional information, our approach uses a two dimen-sional point to represent each gene over the time at the com-

putational cost ofO(N log N). It focuses on one very im-portant aspect of the visualization: revealing the structure ofthe entire dataset. Tested using two published, array-derivedgene expression time series datasets, our results indicatedthat temporal patterns were well reflected in the visualiza-tion: cluster relationship became two dimensional, clusterswere apparent, and outliers were clear. An interactive visu-alization tool, VizStruct, was implemented to perform thevisualization.

The remainder of this paper is organized as follows. Sec-tion 2 presents the model of visualization. In section 3, weshow our analysis results. The final section discusses otherissues in this approach. Proofs for all mapping propositionsare included as appendix.

2 METHODS AND SYSTEM

The Mapping

Mapping converts multi-dimensional data to two-dimensions for visualization. An array-derived profilefor M genes withN measurements results inM N -dimensional data point containing real valued numericaldata. Time series data in its simplest form is merely a setof datayt, t = 0, . . . , N − 1 where the subscriptt indi-cates the time at which the datumyt was observed [4]. Onthe other hand, a discrete-time real signal onN evenly dis-tributed time points is represented as an indexed sequenceof N real numbers0, . . . , N − 1 denoted byx[n] [1]. Eachterm ofx[n] is denoted byx[n]. The denotation similaritybetween time series and digital signal suggests that we canview each data point in a time series as a discrete-time realsignal (It is not necessary for the signal’s time index to com-ply with the actual time points). In this scenario, the prob-lem of a two dimensional visualization of the time series istransformed into the problem of finding a two-dimensionalpoint estimation for signals (data points).

The frequency domain representation of discrete-timesignals is through discrete-time Fourier transform, or DFT[13]. The DFT of aN -point signalx[n] is a frequency se-quence withN complex values:F(x[n]) = [Fk(x[n])],where each

Fk(x[n]) =N−1∑n=0

x[n]WnkN , k = 0, . . . , N−1. (1)

WN = e−i2π/N is called twiddle factor.Each harmonicFk in the DFT is a measure of thekth

sinusoidal frequency component in the signal. For exam-ple, the zero harmonic,F0, is the mean value, the first har-monicF1 measures the base frequency component, the sec-ond harmonicF2 measures the component in the signal thatis twice the base frequency, and so forth. Because Fourier

harmonics, in general, are complex numbers, they providethe two-dimensional point estimate for mapping a multi-dimensional signal. For this reason, we refer to the mappingas the Fourier harmonic projections. In particular, we areinterested in the first Fourier hamonic projection (FFHP):

F1(x[n]) =N−1∑n=0

x[n]WnN =

N−1∑n=0

x[n]e−i2πn/N . (2)

The relationship between the DFT and the mapping al-lows the fast Fourier transform algorithm (FFT), originallydiscovered by Cooley and Tukey [13], to be used for com-putation. The FFT is a computationally efficient algorithmand has a complexity ofO(N log N).

The complex number ofF1(x[n]) in Equation (2) can beexpressed in terms of magnituder and phaseθ to provide auseful geometric interpretation of the mapping. The data setwas normalized so that the range of values of each dimen-sion across the dataset was0 to 1. For a data point withNdimensions, the complex exponential divides a unit circlecentered at the origin of the complex plane intoN equallyspaced angles. The value of the first dimension is projectedon the radial line corresponding toθ = 0 and, similarly, thevalue of thekth dimension is projected on to the radial linecorresponding to theθ = 2π(1−k)/N radians. The overalltwo-dimensional FFHP mapping is the complex sum of allN projections from a data point. Figure 1 illustrates the ge-ometric interpretation for a point containing6 dimensions.

unit circle

x[n]

F1(x[n])W6

W6W6

W6

W6 W6

0

12

3

4 5

Figure 1. A geometric interpretation of thefirst Fourier harmonic projection. A normal-ized 6-dimensional data point is shown on theright by the stem plot. The powers of twid-dle factor W6 divide the unit circle centeredat the origin into 6 equal angles and each di-mension of the data point is projected onto adifferent radial angle (open circle). The pro-jections are taken complex number sum togive a 2-dimensional image (indicated by afilled circle).

amplitude shifting amplitude scaling

+ a x a

origin

time shifting time shifting

origin

(A) (B)

Figure 2. Illustration of the effect of (A) amplitude shifting and multiplying, and (B) time shifting ofthe first Fourier harmonic projection.

Properties of FFHP

The first Fourier harmonic projection has useful prop-erties that preserve the correlation between dimensions inthe multi-dimensional data point. We summarize them aspropositions below. Detailed proofs for propositions areprovided as the appendix.

1. Data points with equal values for all the dimensionsare mapped to the origin. Ifx[n] = [a, . . . , a], thenF1(x[n]) = 0.

2. Two data points whose dimension values differ due to theamplitude shifting of a constant are mapped to the samepoint. If y[n] = x[n] + a, thenF1(y[n]) = F1(x[n]).

3. Two data points whose dimension values differ due to theamplitude multiplying a constant are mapped to the twopoints on a line through the origin. Ify[n] = ax[n], thenF1(y[n]) = aF1(x[n]). See Figure 2A.

4. Two data points whose dimension values are transposingeach other, i.e. symmetric regarding the middle time point,are mapped to the points symmetric to the real axis. Ify[n] = x[N − n− 1], thenF1(y[n]) = F1(x[n]).

5. Data points that differ only because they are “time-shifted”by d dimensions relative to each other are mapped to thecircumference of the circle that is concentric with the unitcircle and the angle between the points in the visualizationis φ = 2πd/N . If y[n] = x[n − d], thenF1(y[n]) =F1(x[n])W d

N . This property is illustrated in Figure 2B.

6. Letw[n] = x[n] − y[n] be the difference between the twoN -dimensional points,x[n] and y[n]. The distance be-tween these two points in the visualization is:

‖F1(w[n])‖2 = g0N

(1 + 2

N−1∑

k=1

rk cos(2πk/N)

)(3)

Theory in Practice

We will demonstrate in the next section that the relativelocations of temporal profiles’ mapping images can be pre-dicted by Propositions 1–5. For example, genes with rela-tively low levels of expression throughout the time line aremapped close to the origin. Genes that steadily increaseover the time and genes that steadily decrease over the timeline are likely mapped symmetric to the real axis.

In Equation (3), theg0 is the variance ofw[n], rk isthe kth-sample autocorrelation coefficient ofw[n]. It canbe shown [4] that−1 ≤ rk ≤ 1 and for mutually inde-pendent random sequences or white noise,rk = 0. Equa-tion (3) provides insight into the cluster delineation capa-bilities of the first Fourier harmonic projection. The vari-ance andkth-sample autocorrelation coefficients of the dif-ference between two points within a given cluster are likelyto be small because they will share similarities across manydimensions. Thus, points within cluster are likely to mapclose to each other in the visualization.

The first Fourier harmonic projection is well suited forvisualizing temporal patterns. FFHP has a better class sepa-ration capability when the first harmonic is more dominantamong all harmonics. Since the first harmonic is a measureof the base sinusoidal frequency component in the signal, alower frequency signal tends to have a larger first harmonic.This is the case for a large number of time series data wheremost trend patterns show low frequency variations (not hav-ing multiple cycles). The first Fourier harmonic projectionis sensitive to dimension order. For time series data, thetime points (dimensions) are naturally ordered.

VizStruct System

The first Fourier harmonic projection approach is im-plemented as a visualization tool called VizStruct, whichis written in Java and is available on request from the firstauthor ([email protected]). The name VizStruct em-

(A) (B) (C)

Figure 3. Three snapshots from a dimension tour through a synthetic three-dimensional data set con-taining 25, 000 points. The parameter settings for (A)–(C) were 〈0.5, 0.5, 0.5〉, 〈0.3, 0.5, 1〉, and 〈0.05, 1, 1〉,respectively.

phasizes its capability of visualizing the structure of thedataset.

Dimension Tour

The dimension tour is a feature of VizStruct that allowsthe user to interact with the data via dynamic animations.It is analogous to the grand tour [2] and the user interactswith the visualization by changing the dimension parame-ter associated with each dimension. The default value foreach dimension parameter is0.5 and the individual param-eters can be changed over the range−1 to +1 either manu-ally or systematically using the program. Because no two-dimensional mapping can capture all the interesting prop-erties of the original multi-dimensional space, two pointsthat are close in the visualization can theoretically be farapart in the multi-dimensional space. The dimension tour,which creates animations that explore dimension parameterspace, can reveal structures in the multi-dimensional inputthat were hidden due to overlap with other points in the vi-sualization.

Figure 3 illustrates the capabilities of the dimension tourusing a synthetic dataset containing25, 000 points in threedimensions. At the default settings of dimensional param-eters (Figure 3A),5 clusters are apparent. However, dur-ing the course of the animations (Figures 3B and 3C), themulti-layered structures of the original5 clusters becomeincreasingly apparent.

3 RESULTS

Data Sets for Visualization

Our approach was tested using two published array-derived data sets. Therat kidneyarray dataset of Stuart etal. [16] containes measurements of gene expressions duringrat kidney organogenesis. The data were downloaded fromhttp://organogenesis.ucsd.edu/data.html. It consists of873genes which vary significantly during kidney developmentat 7 different time points: gestational day13, 15, 17, 19;newborn (N);1 week (W); and nonpregnant adult (A).

Thefibroblastsdataset of Iyer et al. [10] is the result of astudy of the response of human fibroblasts to serum. It con-sists of gene expressions measuring the temporal changesin mRNA levels of517 human genes at13 time points,ranging from15 minutes to24 hours after serum stim-ulation. The data were downloaded from http://genome-www.stanford.edu/serum.

Rat Kidney Dataset

In the rat kidneydataset, there are5 discrete patterns orgroups of gene expression during nephrogenesis. Figure 4illustrates these temporal profiles characterized by the ide-alized gene expressions.

Figure 5A shows how genes are classified by a hierar-chical clustering algorithm. It copies the Figure3 in [16].Figure 5B shows the parallel coordinates of the dataset. Pat-terns of genes in each group comply to the profiles depictedin Figure 4 (with some noise).

13151719 N W A0

0.51

1.52

2.53 1

13151719 N W A

2

13151719 N W A

3

13151719 N W A

4

13151719 N W A

5

Figure 4. Idealized temporal gene expression profiles during kidney development. The groups werenamed 1 through 5 based on the timing of their peak expression during development. Seven timepoints were 13, 15, 17, 19 embryonic days; N , newborn; W , 1 week old; A, adult.

0

1

2

3

All Data0

1

2

3

Group 10

1

2

3

Group 2

0

1

2

3

Group 30

1

2

3

Group 40

1

2

3

Group 5

(A) (B)

Figure 5. Visualization of the rat kidneydataset in heat plot and parallel coordinates. (A) Dendrogramand a heat plot from hierarchical clustering algorithm. (B) Parallel coordinates for the entire datsetand each of the gene groups.

Figures 6A-B show the visualization of therat kidneydataset in VizStruct under the first Fourier harmonic pro-jection for two dimension parameter settings. There are5sets of colored symbols for each of the5 gene groups. Eachsymbol represents one gene across7 time points. In Figure6A, two big clusters are clearly apparent from the visualiza-tion. The top cluster consists of genes from groups3, 4, and5. The bottom cluster is comprised of genes from groups1and2. Furthermore, genes from each group are aggregated.

The formation of two large clusters can be interpreted bythe temporal profiles. Groups1 and 2 with genes whichhave very high relative levels of expression in early de-velopment are quite different from groups3, 4, and5 forgenes that have a relatively steady increase in expressionthroughout development. The visualization also indicatesthat points in the upper cluster are symmetric to the pointsin the lower cluster. Properties of FFHP may suggest thereason. Temporal profiles of groups1 and4 suggest that

they are somewhat symmetric to the middle time point (ges-tational day19). By Proposition4, they would be mappedto points symmetric to the real axis. On the other hand,groups4 and group5 are mapped closely since they havesimilar profiles except for the significantly up-regulated inthe last time point. The same arguments can be applied inthe case of group1 vs. group2, or group3 vs. group4.

Due to the noise, most boundaries between groups arenot very clear. However, the separation between group4and group5 improves in Figure 6B compared to Figure 6A.

Fibroblast Dataset

Temporal patterns are slightly complicated in thefibrob-last dataset. Data has been classified into10 groups usingthe hierarchical clustering algorithm by the original author[10]. Figure 7A shows the result of the hierarchical cluster-ing. It is a copy of Figure3 from [10]. Figure 7B gives the

−4 −3 −2 −1 0 1 2 3 4 5−6

−4

−2

0

2

4

6

Re[F1]

Im[F

1]

G1

G2

G3

G4

G5

−6 −4 −2 0 2 4 6 8 10

−6

−4

−2

0

2

4

6

8

10

12

Re[F1]

Im[F

1]

G1

G2

G3

G4

G5

(A) (B)

Figure 6. Visualization of the rat kidney dataset in VizStruct. Five gene groups were representedby blue plus symbols, red circles, green triangles, magenta stars, and black cross symbols. Thedimension parameters for (A) and (B) were 〈0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5〉 and 〈1, 0.5,−1,−0.5, 0.5, 1,−1〉,respectively.

0

5

10

15

All Group A Group B Group C Group D Group E

0

5

10

15

Group F Group G Group H Group I Group J Group U

(A) (B)

Figure 7. Visualization of the fibroblastdataset in heat plot and parallel coordinates. (A) Dendrogramand a heat plot from hierarchical clustering algorithm. (B) Parallel coordinates for the entire datset andeach of the gene groups. The gene group labelled U consists genes without label after hierarchicalclustering.

−120 −100 −80 −60 −40 −20 0 20

−20

−10

0

10

20

30

111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111

222

2222222222222222222222222222222222

22222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222211111111111111111111111111111111

3

3

33333333333333333333333333

3333

33

11111111

44

4444444

4

4

44444444

4444

44

4

4

4

4

44

44

44

44

44

444

4

55 55

5

5

5

1111111111

6666666

6

66666

6 6

6666

6 666

6

66666666

66111111

11

77

7

7

77

77

7777 77

7

1111

8

8

8888

88

8888

88

8

888

888888

888

888

8

8

88

8

88

88

888888

88

888888

8

8

888

8

8

8

8

8

9999999

9

99999 999 9

9

9

10

10101010

10

1010

10

10

10

10

10

10

1010

10

101010

1010

10

10

10

Re[F1]

Im[F

1]

−15 −10 −5 0 5 10

−8

−6

−4

−2

0

2

4

6

8

10

1111 11

111 11

111

1111

111111111111111111

1 11

11 1

11111

1111

111

11111111111 1

11111 11111

1

1111

1 111

11

1111 1

11 11111

11

1

2

2

2

2

22

22

22222

222

222 22222

2

2

22

22

2222222 2

22

22 2

222222222222

2222

222222222222

222 2

2222222

222

222 22 222 2

2

222

2

2222

2222 2 2

2

2222222 22

2222222

22

222 2

222222

2

2222

22

111111

111111

11111111

11

1111

1111

11

3

3333

3333

3333

3

3333 3333

33

33

3

3

3 3

3

33

11

1111

11

4

4

44

44

4

4

4

4

4

44

4

444

44

444

4

4

4

4

4 4

4 4

4

44

4

4

4

4

4

4

55 5

5

5

5

5

11 1111

11 11

66

6

6

66

6

6

6

6

666

66

6

6

66

6 666

6

6

66

6

6

66

6

66 11

1111

11

7

77

77

7

7

77

7

7 7

7

7

1111

8

8

8888

8

8

8

8

88

88

8

8

8

8

88 8

888

8

8

8

88

8

8

8

8

8

8

8

8

8

888 8

8

8

8

8

888

8

8

8

8

8

8 8

8

8

8

8

8

99

999

99

9

9

99

999

99

9

9

9

10

10

10

10 10

10

10

10

10

10

10

10

10

10

10 10

10

10 10

10

10

10

10

10

Re[F1]

Im[F

1]

(A) (B)

Figure 8. Visualization of the fibroblastdataset in VizStruct. 11 colored numbers were used for eachof the gene groups. (A) The mapping of the entire dataset. (B) Enlarged portion of the visualizationin (A).

−120 −100 −80 −60 −40 −20 0 20

−20

−10

0

10

20

30

33333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333

333

3333333333333333333333333333333333

3333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333

6

6

33333333333333333333333333

3333

33

3383

99

3333999

3

9

99333333

9999

1111

11

11

11

11

99

1111

1111

99

1111

9911

11

66 63

6

6

6

10310108

10101010101010

1

1010101010

10 10

881010

10 8810

2

8888108810

1010 888

8

99

9

9

99

99

9989 99

11

88

8

2

8888

38

8888

88

2

822

888888

228

222

2

7

88

9

29

89

888888

92

288888

8

2

888

2

2

7

7

7

8888888

8

2810810 8101010

4

2

9

6883

6

33

6

3

6

6

8

6

1010

5

668

66

6

6

6

Re[F1]

Im[F

1]

−15 −10 −5 0 5 10

−8

−6

−4

−2

0

2

4

6

8

10

333 33

333 33

333

3333

333333333333333333

3 33

33 3

33333

3333

333

33333333333 3

33333 33333

3

3333

3 333

33

3333 3

33 33333

33

3

3

3

3

3

33

33

33333

333

33333333

3

3

33

33

3333333 3

33

33 3

333333333333

3333

333333333333

3333

3333333

333

333 33 333 3

3

333

3

3333

33333 3

3

3333333 33

3333333

33

333 3

333333

3

3333

33

333

333

3333

3

33

33

3

6

3333

3333

3333

3

3333 3333

33

33

3

3

3 3

3

33

3

38

3

9

9

33

33

9

9

9

3

9

99

3

333

33

999

9

11

11

9 9

1111

99

11

11

9

9

11

11

66 6

3

6

10 310

10 8

1010

10

10

1010

1010

10

101010

1010

8

8

1010

10 8810

2

8

88

8

10

88

10

1010 8

88

8

9

99

99

9

9

99

8

9 9

9

11

88

8

2

8888

3

8

8

8

88

88

2

8

2

2

88 8

888

2

8

22

2

8

8

9

8

9

8

888 8

8

9

2

2

888

8

8

8

2

8

8 8

2

88

888

88

8

2

810

8108

1010

10

2

9

6

8

8 3

6

3

3

3

6

8

6

10 106 6

8

6

6

6

6

Re[F1]

Im[F

1]

(A) (B)

Figure 9. Visualization of the fibroblast dataset in VizStruct. Genes were grouped by the k-meansclustering method. 11 colored numbers were used for each gene groups. (A) The mapping of theentire dataset. (B) Enlarged portion of the visualization in (A).

parallel coordinates the entire dataset and each gene group.

Figure 8A shows the visualization of thefibroblastsdataset in VizStruct under the first Fourier harmonic pro-jection. Colored numbers1 through11 were used for eachgene group. The visualization reveals several outliers andno distinct clusters. The majority of data are too dense to beseen. By zooming technology, the enlarged detail is shownin Figure 8B. More numbers spread out, but most blue num-bers (1) were still covered by numbers2 and3.

Two observations can be made: (1) a large number ofgenes are close to the center. (2) most clusters have a radialshape emitting from the center. They can be interpreted bythe FFHP properties. (1) Temporal patterns of groupA andgroupB have very flat shape and relative smaller values. ByPropositions1 and2, they tend to be mapped closed to theorigin. (2) In hierarchical clustering, Pearson’s correlationcoefficient was used to measure the similarity. Genes whosetime values differ due to the amplitude shifting or multiply-

−3 −2 −1 0 1 2 3−4

−3

−2

−1

0

1

2

3

4

x

y

−4 −3 −2 −1 0 1 2 3 4 5−6

−4

−2

0

2

4

6

1

23

4

5

6

7

8

9

10

1112

13

14151617

18

19

20

2122

23

24 25

262728

29

30

3132

3334

35 36

37

38394041

42

4344

4546

47

48

49

50

51

5253

5455

56

5758

59

6061

62

63

64

65

66

67

68

69

7071

727374

75

76

7778

79

80

81

82

8384

8586

87

88

8990

9192

93

94

9596

97

98

99

100

101

102

103

104 105

106107

108

109

110111

112113

114

115

116117

118

119120

121 122123124

125126127

128

129

130

131

132133 134

135

136

137

138

139

140

141

142

143144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166167

168

169

170

171

172

173

174

175

176177

178179

180181

182

183

184

185

186

187

188

189

190191

192

193

194

195

196

197

198

199

200

201

202203

204

205206

207

208

209

210

211

212

213

214

215

216217

218 219

220221

222223

224

225226

227

228

229

230

231

232233

234

235

236

237

238

239240

241

242243

244

245

246

247

248

249

250

251

252

253254

255256

257258

259260

261

262263

264265

266

267

268269270

271

272273

274

275

276

277

278279

280

281

282

283284285

286

287288

289

290

291

292293

294295296

297

298

299

300

301

302

303

304

305

306

307 308

309

310

311

312

313

314 315

316

317318

319

320321322

323

324

325

326327

328329

330

331332

333334

335336

337338 339

340

341

342

343344 345

346

347

348349

350351

352

353354

355

356

357

358

359

360

361362

363

364

365

366

367 368

369

370

371

372373

374

375

376

377378

379

380

381382383

384

385386

387

388

389390

391

392393

394

395 396 397

398

399

400

401

402

403404405

406

407408

409410411

412

413

414415416

417

418419

420

421

422

423

424

425

426

427428429430

431

432

433

434

435

436

437438439

440

441

442

443

444

445

446

447448

449450

451

452

453454455

456

457458

459

460

461

462

463

464

465

466

467468469 470

471

472473

474475476

477

478

479

480

481

482483484

485

486

487

488

489

490491

492

493

494

495496497

498499

500501502

503

504

505

506

507

508

509510

511

512

513

514515

516

517

518

519520

521

522523524525

526

527

528

529

530

531

532

533

534535

536

537

538539

540

541

542

543 544

545

546

547

548

549550

551

552553

554

555556

557

558

559

560

561

562563564

565

566

567568

569

570

571

572

573

574

575576577

578579

580581

582

583584

585586

587

588589590

591

592

593

594595

596 597598

599

600 601

602 603

604

605

606

607

608

609 610

611

612613

614

615616617

618

619620

621622623

624625

626 627

628

629

630

631632633

634635

636

637638

639

640641

642643644

645646

647648

649

650651

652653

654

655656657

658659

660661

662

663

664

665666

667

668

669

670

671

672

673674 675

676

677

678679

680

681682

683

684

685

686

687

688

689

690

691692693694695 696697

698699

700701

702

703

704705706 707

708709

710

711

712

713

714

715716 717

718

719

720721

722723724

725726

727

728

729730

731732

733

734

735736737738

739

740

741742

743744745746

747

748

749750751

752

753754

755

756

757

758

759760761762763

764

765766

767

768

769

770

771

772

773

774

775

776777778

779

780781

782783784

785

786

787788789

790791

792

793

794

795

796797

798

799

800

801

802

803804

805

806

807808809

810811

812813

814

815816817

818819820821

822823

824

825

826827

828829

830

831

832

833834835836

837

838

839840

841

842

843

844845846

847

848849

850851

852

853

854

855

856

857

858

859860

861

862863

864865

866

867868

Re[F1]

Im[F

1]

−3 −2 −1 0 1 2 3−4

−3

−2

−1

0

1

2

3

4

1

2

3

4

5

6

78

9

10

11

12

13

14

15

16

17

1819

20

212223 24

2526

27

28

29

30

31

3233

34

35

36

37

3839

4041

4243

44

45

46

47

48

49

50

5152

53

54

55

565758

596061

62

63

64

65

66

67

68

69

70 71

7273

74

75 76

77

78

79

80

81

82

8384

85

86

87

88

8990

9192 9394

95

96

97

98

99100

101

102

103

104

105

106

107

108109

110111

112

113

114115

116117118 119

120

121

122

123124

125

126127128

129

130 131

132

133

134

135

136137

138

139

140

141

142

143

144145

146

147

148149

150

151

152153

154

155

156157

158

159

160161

162

163

164

165

166167

168 169170

171

172

173

174

175176

177

178179 180

181182

183

184

185

186

187

188

189

190

191

192

193194

195

196

197

198

199

200

201202

203

204205

206

207

208

209

210211212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232233 234

235

236

237238

239

240

241

242

243

244

245

246

247

248

249

250

251

252 253

254

255

256

257 258259

260

261

262

263264

265266 267268269

270

271

272273274 275

276

277

278279

280

281 282

283

284

285

286

287288

289 290 291

292

293

294

295296

297

298

299300

301

302

303

304

305

306

307

308

309

310

311 312

313314

315

316317318

319

320321

322 323

324 325

326

327

328

329

330

331

332

333334

335

336

337338 339

340

341 342

343

344

345

346

347348

349

350

351

352 353

354

355356

357

358

359

360361

362

363

364

365

366367

368

369

370371

372373

374

375

376

377378

379

380381

382383384

385

386

387 388

389

390

391

392393

394

395 396

397

398399

400

401

402

403404405

406

407408

409

410411 412

413

414

415416

417

418

419420

421422

423

424

425

426

427

428

429 430

431432

433

434

435

436

437

438439

440

441442

443

444

445

446

447

448

449

450

451452

453

454

455456

457458 459

460

461

462

463

464

465

466467

468469

470

471

472

473

474

475

476

477

478

479

480

481

482483484485

486 487

488

489

490

491

492493

494495

496

497

498

499

500501

502503

504

505

506

507508

509

510

511

512513

514515

516

517

518

519

520

521

522

523

524

525526

527

528

529

530

531

532

533

534

535

536

537538

539

540541

542

543

544545546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562563564

565566

567

568

569

570

571

572

573574

575576577

578

579580

581

582

583584

585

586

587

588

589590

591

592

593

594

595596

597

598599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616617

618619620

621

622623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

641642643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658659

660661

662

663

664

665

666667668

669

670

671672

673

674

675

676

677

678679680

681

682

683

684 685

686

687

688689

690

691

692

693

694

695 696697698

699

700701702

703

704705

706

707

708

709

710

711

712713

714

715716

717

718 719720721

722

723

724

725

726

727

728

729

730

731732

733

734 735

736

737

738 739

740

741742

743

744745746

747

748

749

750

751

752753

754

755

756

757

758

759

760761

762

763764

765

766

767

768

769

770771

772773

774

775

776

777778

779

780

781782

783784785

786

787788

789790

791

792

793

794

795796

797

798799800

801802

803

804

805

806

807

808

809

810811

812813

814

815

816

817

818819

820

821822823

824825 826

827

828

829830831

832

833834

835

836

837

838

839840

841

842

843

844

845846

847

848

849

850

851852

853

854855

856

857

858

859860

861

862863

864

865

866867

868

x

y

(A) (B) (C)

Figure 10. Comparison of Sammon’s mapping and the first Fourier harmonic projection. (A) Theresult from Sammon’s mapping of the rat kidneydataset. Five gene groups were represented by 5colored symbols the same as were used in Figure 6. (B)–(C) Comparison of relative gene locationsbetween (B) VizStruct and (C) Sammon’s mapping. Five colors were used for each of the genegroups. The color schema is the same as in (A). Notice that the visualization layout of (A) and (C) isidentical. The purpose of panel (C) is to compare the gene by gene location to the panel (B).

ing a constant have very high coefficient values. By Propo-sitions2 and3, they tend to be mapped closely along a linethough the origin.

Some relative gene group locations can be predicted. Forexample, by Proposition4, groups4 and5 are mapped sym-metric to the real axis. Close inspection suggests that tem-poral pattern of group8 is somewhat a2 or3 right time-shiftfrom the pattern of group6. By Proposition 5, each timepoint shift correspondents to roughly2/13π ≈ 30 rota-tion. In Figure 8B, genes in group8 are mapped about60

clockwise to genes in group6.Lacking clear clusters in the visualization indicates that

different clustering methods may yield different results.Figure 9 shows the grouping aspect of k-means cluster-ing. Euclidean distance was used in k-means as the distancemeasure. The visualization reveals that genes closed to theorigin are grouped as one.

Comparison of FFHP to Other Visualizations

Heat plot and parallel coordinates display all individualdimensional information. In parallel coordinates, a polylinerepresents a gene over time. This format is a concise andintuitive for displaying temporal profiles. However, the par-allel coordinate has obvious drawbacks: when the data sizebecomes larger it becomes increasingly unreadable as indi-cated in the first panel of Figure 5 and Figure 7. Heat plotuses a color mosaic instead of a polyline to overcome over-

lapping. Combined with the dendrogram from hierarchicalclustering, it gives a global view as well as individual clus-ters of the dataset by grouping genes with similar patternstogether. Cluster relationships in heat plot is one dimen-sional: clusters of genes are listed one by one as shown inFigure 5A and Figure 7A.

The first Fourier harmonic projection takes a differentapproach. It uses a two dimensional point to represent eachgene over time. By doing so, it takes advantage of the spa-tial relationship of the points to reflect the structure of theoriginal dataset. Thus the cluster relationships become twodimensional and FFHP has a better capability of displayingoutliers than heat plot. Unlike heat plot, which requires al-gorithms to group similar genes to make a meaningful visu-alization, FFHP directly mapped the multi-dimensional dataonto a two dimensional space without any prior knowledgeof the dataset.

Multidimensional scaling (MDS) [3] is a competing ap-proach for visualizing multi-dimensional data. We com-pared FFHP to Sammon’s mapping [15], a variant of MDSthat optimizes the following stress functionE :

E =1∑

i

∑j<i

d∗ij

∑i

∑j<i

(d∗ij − dij)2

d∗ij, (4)

whered∗ij is the distance between pointsi andj in theN -dimensional space anddij is the distance betweeni andjin the visualization.

Figure 10 shows Sammon’s mapping of therat kidneydataset. A comparison of Figure 10A to Figure 6A re-veals the extensive similarities between FFHP and Sam-mon’s mapping. The relative locations of individual sam-ples are also remarkably similar. This is indicated in Figure10B and Figure 10C.

Sammon’s mapping has some drawbacks: (1) it providesa single final result and the user cannot intervene interac-tively during visualization, (2) the incremental addition ofeven a single point requires a complete repetition of the op-timization procedure and possible extensive reorganizationof all the previously mapped points to new locations, and(3) it requires time-consuming optimization procedures oftime complexityO(N2) or greater.

Our results illustrate some of the strengths and weak-nesses of FFHP. The visualization reflects the structure ofthe dataset: outliers, clusters and their relationships. Com-prehending the structure of the dataset can facilitate thechoice and understanding the results of different clusteringmethods. Although the FFHP uses an approach to mappingmulti-dimensional data that is distinctively different fromSammon’s mapping, it yields results that are consistentlycomparable. However, when the dataset contains a largenumber of patterns, it becomes difficult to separate differentpatterns and the visualization may be difficult to interpret.

4 DISCUSSION

Visualization of microarray data is a challenge becauseof its high dimensionality. In this paper, we have exploreduse of the first Fourier harmonic projection for visualizingmulti-dimensional time course array datasets. Unlike par-allel coordinate or heat plot approach which display all di-mensional information, FFHP uses a two dimensional pointto represent each data point (gene in this case). Our re-sults indicated that temporal patterns were well reflected byspatial relationships: genes with similar pattern were aggre-gated and relative locations of gene groups can be predicted.Moreover, the first Fourier harmonic projection was shownto yield results that were similar to those from Sammon’smapping.

Achieving two dimensional mapping requires a trade-off. The mapping is lossy for detailed dimensional infor-mation. Our approach attempts to preserve to the maximumsemantics of the data points via Fourier harmonic aspect.In particular, characterizing the data using two descriptivemeasurements: the real and imaginary portions of the firstdiscrete Fourier harmonic. A similar approach uses prin-cipal component analysis (PCA) [11]. This visualizationdeploys another two descriptive measurement: the first andsecond principal component.

In addition to the first Fourier harmonic projection,higher Fourier harmonics can also be used as mappings.

It can be shown that for any harmonic(> 1), there exitsan equivalent first harmonic of the original discrete signalwhose time indices are systematically rearranged. At cer-tain conditions (such as temporal patterns with high fre-quency variations, i.e. multiple cycles), higher harmonicprojections enhance substructure separation in the visual-ization. Detailed discussion is beyond the scope of this pa-per due to a length constraint.

The FFHP mapping results in two-dimensional visual-izations that are identical to those of radial coordinate visu-alization techniques, e.g., RadViz [7]. However, rather thanvector notation and thespring paradigmof RadViz, we haveused a complex number notation. This substantive reformu-lation of the mapping provides valuable theoretical insightsand allows important properties of mapping, including itsrelationship to the DFT, to be easily derived. It can alsocreate possible extensions such as higher Fourier harmonicprojections.

The first Fourier harmonic projection requires data with-out missing values. This requirement can be easily metbecause filling in missing values is a mature research field[17].

Gene expression data can be studied in either samplespace or gene space. Here, we have reported only the visual-ization on gene space. In a separate report [19], we appliedthe first Fourier harmonic projection on the sample spaceand performed visualization-driven classifications. Our ex-periments demonstrated that FFHP offered an alterative for-mat of visualization. We believe that using FFHP alone orin combination with heat plot or parallel coordinates wouldgive a biologist additional powerful tools for analyzing andvisualizing microarray data sets.

ACKNOWLEDGEMENTS

This work was supported by grants from the NationalScience Foundation. We also thank the anonymous review-ers for their constructive comments on the manuscript.

References

[1] Cadzow, J. A., Landingham, H. F.Signals, Systems, andTransforms. Prentice-Hall, Inc., Englewood Cliffs, NJ,1985.

[2] Cook, C., Buja, A., Cabrera, J., and Hurley, C. GrandTour and Projection Pursuit.Journal of Computational andGraphical Statistics, 2(3):225–250, 1995.

[3] Davision, M. L.Multidimensional Scaling. Krieger Publish-ing, Inc., Malabar, FL, 1992.

[4] Diggle, P. J.Time Series: A Biostatistical Introduction. Ox-ford University Press, Oxford OX2 6DP, 1990.

[5] Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D.Cluster Analysis and Display of Genome-wide ExpressionPatterns.Proc. Natl. Acad. Sci. USA, Vol. 95:14863–14868,December 1998.

[6] Hautaniemi, S., Yli-Harja, O., Astola, J., Kauraniemi, P.,Kallioniemi, A., Wolf, M., Ruiz, J., Mousses, S., andKallioniemi, O. Analysis and Visualization of Gene Ex-pression Microarray Data in Human Cancer Using Self-Organizing Maps, 2003. To appear in Machine Learning:Special Issue on Methods in Functional Genomics.

[7] Hoffman, P. E., Grinstein, G. G., Marx, K., Grosse, I., andStanley, E. DNA Visual and Analytic Data Mining. InIEEEVisualization ’97, pages 437–441, Phoenix, AZ, 1997.

[8] Holter, N. S., Mitra, M., Maritan, A., Cieplak, M., Banavar,J. R., and Fedoroff, N. V. Fundamental Patterns Underly-ing Gene Expression Profiles: Simplicity from Complex-ity. Proc. Natl. Acad. Sci. USA, Vol. 97(15):8409–8414, July2000.

[9] Ideker, T., Galitski, T., and Hood, L. A New Approachto Decoding Life: Systems Biology.Annu. Rev. GenomicsHum. Genet., 2:343–372, July 2001.

[10] Iyer, V. R., Eisen, M. B., Ross, D. T., Schuler, G., Moore,T., Lee, J. C. F., Trent, J. M., Staudt, L. M., Hudson, J. Jr.,Boguski, M. S., Lashkari, D., Shalon, D., Botstein, D., andBrown, P. O. The Transcriptional Program in the Responseof Human Fibroblasts to Serum.Science, Vol. 283(1):83–87,January 1999.

[11] K. Y. Yeung and W. L. Ruzzo. Principal Component Anal-ysis for Clustering Gene Expression Data.Bioinformatics,Vol. 17(9):763–774, 2001.

[12] Kohonen, T. Self-Organizing Maps, Springer Series in In-formation Sciences, volume Vol. 30. Springer, Berlin, Hei-delberg, New York, 1995.

[13] Morrison, N., editor.Introduction to Fourier Analysis. JohnWiley & Sons, Inc., New York, NY, 1994.

[14] Parmigiani, G., Garrett, E. S., Irizarry, R. A., and Zeger, S.L., editor. The Analysis of Gene Expression Data: Methodsand Software. Springer-Verlag New York, Inc, New York,NY, 2003.

[15] Sammon, J. W. A nonlinear mapping for data structure anal-ysis. IEEE Transactions on Computers, C-18(5):401–409,1969.

[16] Stuart, R. O., Bush, K. T., and Nigam, S. K. Changes inGlobal Gene Expression Patterns During Development andMaturation of the Rat Kidney.Proc. Natl. Acad. Sci. USA,Vol. 98(10):5649–5654, May 2001.

[17] Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P.,Hastie, T., Tibshirani, R., Botstein, D., and Altman, R. Miss-ing Value Estimation Methods for DNA Microarrays.Bioin-formatics, Vol.17(6):520–525, 2001.

[18] Ward, M. O. XmdvTool: Integrating Multiple Methods forVisualizing Multivariate Data. InIEEE Visualization 1994,pages 326–336, 1994.

[19] Zhang, L., Zhang, A., Ramanathan, M. et al. VizCluster andIts Application on Clustering Gene Expression Data.In-ternational Journal of Distributed and Parallel Databases,13(1):79–97, January 2003.

Appendix: Proof for Propositions

Lemma 1

(N−1∑n=0

an

)2

=

N−1∑n=0

a2n + 2

N−1∑

k=1

N−k−1∑t=0

at at+k.

Lemma 2 For any two complex numbersz and w, (1) z + w =z + w, (2) z w = z w, (3) z = z.

Lemma 3 Let j ∈ N, thenN−1∑n=0

e−i2πjn/N

=

N−1∑n=0

cos(2πjn/N) =

N−1∑n=0

sin(2πjn/N) = 0.

Lemma 4 FFHP is homomorphic: F1(ax[n] + by[n]) =aF1(x[n]) + bF1(y[n]).

Proposition 1 (Cancellation) If x[n] = [a, . . . , a], thenF1(x[n]) = 0.

Proof: From the formula in Eq. (2), we get

F1(x[n]) =

N−1∑n=0

a e−i2πn/N = a

N−1∑n=0

e−i2πn/N .

By Lemma 3, letj = 1. F1(x[n]) = 0. ¤

Proposition 2 (Amplitude Shifting) If y[n] = x[n] + a, thenF1(y[n]) = F1(x[n])

From the formula in Eq. (2), we get

F1(y[n]) =

N−1∑n=0

(x[n] + a)e−i2πn/N =

N−1∑n=0

x[n]e−i2πn/N

+

N−1∑n=0

a e−i2πn/N = F1(x[n]) + 0

The second summation is0 by Proposition 1. ¤

Proposition 3 (Amplitude Multiplying) If y[n] = ax[n], thenF1(y[n]) = aF1(x[n]).

From the formula in Eq. (2), we get

F1(y[n]) =

N−1∑n=0

a x[n]e−i2πn/N

= a

N−1∑n=0

x[n]e−i2πn/N = a F1(x[n])

¤

Proposition 4 (Transposing) y[n] = x[N − n − 1], thenF1(y[n]) = F1(x[n]).

By Lemma 2, we have

F1(x[n]) =

N−1∑n=0

x[n]e−i2πn/N =

N−1∑n=0

x[n]ei2πn/N

=

N−1∑n=0

x[n]ei2πn/N = F1(x[−n])

However, whenx[n] is a real signal,x[n] = x[n]. Then we have

F1(x[n]) = F1(x[−n]). i.e., F1(x[n]) = F1(x[−n]).

Sincex[N − n− 1] = x[−n] thenF1(y[n]) = F1(x[n]) ¤

Proposition 5 (Time Shifting) If y[n] = x[n − d], thenF1(y[n]) = Wd

N F1(x[n]).

Proof: Assume0 ≤ n < N , let l = n− d, thenn = l + d. Whenn = 0, l = −d and whenn = N − 1, l = N − 1− d. From theformula in Eq. (2), we get

F1(y[n]) = F1(x[n− d]) =

N−1−d∑

l=−d

x[l]e−i2π(l+d)/N

=

N−1−d∑

l=−d

x[l]e−i2πl/Ne−i2πd/N = WdN

N−1−d∑

l=−d

x[l]e−i2πl/N

However,ei2πn/N = ei2π(n+N)/N andx[n] = x[n + N ],

N−1−d∑

l=−d

x[l]e−i2πl/N =

−1∑

l=−d

x[l + N ] e−i2π(l+N)/N

+

N−1−d∑

l=0

x[l] e−i2πl/N

Let t = l + N for the first summation andt = l for the secondsummation, we get

N−1−d∑

l=−d

x[l]e−i2πl/N =

N−1∑

t=N−d

x[t]e−i2πt/N

+

N−1−d∑t=0

x[t]e−i2πt/N

=

N−1∑t=0

x[t]e−i2πt/N = F1(x[n])

Therefore,F1(y[n]) = F1(x[n])WdN . ¤

Definition 1 The mean of a signalx[n] is defined asx =∑N−1n=0 x[n]/N . Thek-th sample autocovariance coefficient of a

signalx[n] is defined asgk =∑N−1−k

n=0 (x[n]− x)(x[n + k] −x)/N . g0 is called the variance ofx[n]. Thek-th sample autocor-relation coefficient is defined asrk = gk/g0.

Proposition 6 (General Distance)Letw[n] = x[n]−y[n] be thedifference betweenx[n] andy[n]. The distance betweenF1(x[n])andF1(y[n]) is

‖F1(w[n])‖2 = g0N

(1 + 2

N−1∑

k=1

rk cos(2πk/N)

).

Proof: By Lemma 4, the distance betweenF1(y[n]) andF1(x[n]) is ‖F1(w[n])‖. From Eq. (2), we get

‖F1(w[n])‖ = ‖N−1∑n=0

w[n]e−i2πn/N‖

= ‖N−1∑n=0

w[n] cos(2πn/N)− iw[n] sin(2πn/N)‖

Let ω = 2π/N , by Lemma 3, we haveN−1∑n=0

cos(nω) =

N−1∑n=0

sin(nω) = 0. Now add a termw, the mean ofw[n],

‖F1(w[n])‖2 =

(N−1∑n=0

w[n] cos(nω)

)2

+

(N−1∑n=0

w[n] sin(nω)

)2

=

(N−1∑n=0

(w[n]− w) cos(nω)

)2

+

(N−1∑n=0

(w[n]− w) sin(nω)

)2

Expending each squaring term by Lemma 1, we get

N−1∑n=0

(w[n]− w)2(cos2(nω) + sin2(nω))

+ 2

N−1∑

k=1

N−1−k∑t=0

[(w[t]− w)(w[t + k]− w)Ω]

whereΩ = cos(tω) cos((t + k)ω) + sin(tω) sin((t + k)ω).By trigonometry identitycos θ cos φ + sin θ sin φ = cos(φ −

θ), we haveΩ = cos(kω). Now

‖F1(w[n])‖2 =

N−1∑n=0

(w[n]− w)2

+ 2

N−1∑

k=1

N−1−k∑t=0

[(w[t]− w)(w[t + k]− w) cos(kω)]

= N(g0 + 2

N−1∑

k=1

gk cos(kω))

= g0N

(1 + 2

N−1∑

k=1

rkcos(2πk/N)

). ¤

fourier harmonic approach for visualizing temporal ... · fourier harmonic approach for visualizing...

Documents