fundamentalsofdataminingingenomicsandproteomics preface...

Preface

As natural phenomena are being probed and mapped in ever-greater detail,scientists in genomics and proteomics are facing an exponentially growing vol-ume of increasingly complex-structured data, information, and knowledge. Ex-amples include data from microarray gene expression experiments, bead-basedand microfluidic technologies, and advanced high-throughput mass spectrom-etry. A fundamental challenge for life scientists is to explore, analyze, andinterpret this information effectively and efficiently. To address this challenge,traditional statistical methods are being complemented by methods from datamining, machine learning and artificial intelligence, visualization techniques,and emerging technologies such as Web services and grid computing.

There exists a broad consensus that sophisticated methods and tools fromstatistics and data mining are required to address the growing data analysisand interpretation needs in the life sciences. However, there is also a great dealof confusion about the arsenal of available techniques and how these shouldbe used to solve concrete analysis problems. Partly this confusion is due toa lack of mutual understanding caused by the different concepts, languages,methodologies, and practices prevailing within the different disciplines.

A typical scenario from pharmaceutical research should illustrate some ofthe issues. A molecular biologist conducts nearly one hundred experimentsexamining the toxic effect of certain compounds on cultured cells using amicroarray gene expression platform. The experiments include different com-pounds and doses and involves nearly 20 000 genes. After the experiments arecompleted, the biologist presents the data to the bioinformatics departmentand briefly explains what kind of questions the data is supposed to answer.Two days later the biologist receives the results which describe the output ofa cluster analysis separating the genes into groups of activity and dose. Whilethe groups seem to show interesting relationships, they do not directly addressthe questions the biologist has in mind. Also, the data sheet accompanyingthe results shows the original data but in a different order and somehow trans-formed. Discussing this with the bioinformatician again it turns out that what

Fundamentals of Data Mining in Genomics and Proteomics, 2007, XXII, 282 p., 68 illus., Hardcover, SpringerISBN-10: 0-387-47508-7 ISBN-13: 978-0-387-47508-0

edited byWerner Dubitzky,Martin Granzow, andDaniel Berrar


edited byWerner Dubitzky,Martin Granzow, andDaniel Berrar

vi Preface

the biologist wanted was not clustering (automatic classification or automaticclass prediction) but supervised classification or supervised class prediction.

One main reason for this confusion and lack of mutual understanding isthe absence of a conceptual platform that is common to and shared by the twobroad disciplines, life science and data analysis. Another reason is that datamining in the life sciences is different to that in other typical data miningapplications (such as finance, retail, and marketing) because many require-ments are fundamentally different. Some of the more prominent differencesare highlighted below.

A common theme in many genomic and proteomic investigations is theneed for a detailed understanding (descriptive, predictive, explanatory) ofgenome- and proteome-related entities, processes, systems, and mechanisms.A vast body of knowledge describing these entities has been accumulated ona staggering range of life phenomena. Most conventional data mining appli-cations do not have the requirement of such a deep understanding and thereis nothing that compares to the global knowledge base in the life sciences.

A great deal of the data generated in genomics and proteomics is generatedin order to analyze and interpret them in the context of the questions and hy-potheses to be answered and tested. In many classical data mining scenarios,the data to be analyzed are generated as a “by-product” of an underlying busi-ness process (e.g., customer relationship management, financial transactions,process control, Web access log, etc.). Hence, in the conventional scenariothere is no notion of question or hypothesis at the point of data generation.

Depending on what phenomenon is being studied and the methodologyand technology used to generate data, genomic and proteomic data struc-tures and volumes vary considerably. They include temporally and spatiallyresolved data (e.g., from various imaging instruments), data from spectralanalysis, encodings for the sequential and spatial representation of biologi-cal macromolecules and smaller chemical and biochemical compounds, graphstructures, and natural language text, etc. In comparison, data structuresencountered in typical data mining applications are simple.

Because of ethical constraints and the costs and time involved to run exper-iments, most studies in genomics and proteomics create a modest number ofobservation points ranging from several dozen to several hundreds. The num-ber of observation points in classical data mining applications ranges fromthousands to millions. On the other hand, modern high-throughput experi-ments measure several thousand variables per observation, much more thanencountered in conventional data mining scenarios.

By definition, research and development in genomics and proteomics issubject to constant change – new questions are being asked, new phenomenaare being probed, and new instruments are being developed. This leads to fre-quently changing data processing pipelines and workflows. Business processesin classical data mining areas are much more stable. Because solutions willbe in use for a long time, the development of complex, comprehensive, and


edited by Werner Dubitzky, Martin Granzow, and Daniel Berrar

Preface vii

expensive data mining applications (such as data warehouses) is readily jus-tified.

Genomics and proteomics are intrinsically “global” – in the sense that hun-dreds if not thousands of databases, knowledge bases, computer programs, anddocument libraries are available via the Internet and are used by researchersand developers throughout the world as part of their day-to-day work. The in-formation accessible through these sources form an intrinsic part of the dataanalysis and interpretation process. No comparable infrastructure exists inconventional data mining scenarios.

This volume presents state of the art analytical methods to address keyanalysis tasks that data from genomics and proteomics involve. Most impor-tantly, the book will put particular emphasis on the common caveats andpitfalls of the methods by addressing the following questions: What are therequirements for a particular method? How are the methods deployed andused? When should a method not be used? What can go wrong? How can theresults be interpreted? The main objectives of the book include:

� To be acceptable and accessible to researchers and developers both in lifescience and computer science disciplines – it is therefore necessary to ex-press the methodology in a language that practitioners in both disciplinesunderstand;

� To incorporate fundamental concepts from both conventional statisticsas well as the more exploratory, algorithmic and computational methodsprovided by data mining;

� To take into account the fact that data analysis in genomics and proteomicsis carried out against the backdrop of a huge body of existing formalknowledge about life phenomena and biological systems;

� To consider recent developments in genomics and proteomics such as theneed to view biological entities and processes as systems rather than col-lections of isolated parts;

� To address the current trend in genomics and proteomics towards increas-ing computerization, for example, computer-based modeling and simula-tion of biological systems and the data analysis issues arising from large-scale simulations;

� To demonstrate where and how the respective methods have been suc-cessfully employed and to provide guidelines on how to deploy and usethem;

� To discuss the advantages and disadvantages of the presented methods,thus allowing the user to make an informed decision in identifying andchoosing the appropriate method and tool;

� To demonstrate potential caveats and pitfalls of the methods so as toprevent any inappropriate use;

� To provide a section describing the formal aspects of the discussed method-ologies and methods;





viii Preface

� To provide an exhaustive list of references the reader can follow up toobtain detailed information on the approaches presented in the book;

� To provide a list of freely and commercially available software tools.

It is hoped that this volume will (i) foster the understanding and use ofpowerful statistical and data mining methods and tools in life science as wellas computer science and (ii) promote the standardization of data analysis andinterpretation in genomics and proteomics.

The approach taken in this book is conceptual and practical in nature.This means that the presented data-analytical methodologies and methodsare described in a largely non-mathematical way, emphasizing an information-processing perspective (input, output, parameters, processing, interpretation)and conceptual descriptions in terms of mechanisms, components, and prop-erties. In doing so, the reader is not required to possess detailed knowledgeof advanced theory and mathematics. Importantly, the merits and limitationsof the presented methodologies and methods are discussed in the context of“real-world” data from genomics and proteomics. Alternative techniques arementioned where appropriate. Detailed guidelines are provided to help practi-tioners avoid common caveats and pitfalls, e.g., with respect to specific para-meter settings, sampling strategies for classification tasks, and interpretationof results. For completeness reasons, a short section outlining mathematicaldetails accompanies a chapter if appropriate. Each chapter provides a richreference list to more exhaustive technical and mathematical literature aboutthe respective methods.

Our goal in developing this book is to address complex issues arising fromdata analysis and interpretation tasks in genomics and proteomics by provid-ing what is simultaneously a design blueprint, user guide, and research agendafor current and future developments in the field.

As design blueprint, the book is intended for the practicing professional(researcher, developer) tasked with the analysis and interpretation of datagenerated by high-throughput technologies in genomics and proteomics, e.g.,in pharmaceutical and biotech companies, and academic institutes.

As a user guide, the book seeks to address the requirements of scientistsand researchers to gain a basic understanding of existing concepts and meth-ods for analyzing and interpreting high-throughput genomics and proteomicsdata. To assist such users, the key concepts and assumptions of the varioustechniques, their conceptual and computational merits and limitations are ex-plained, and guidelines for choosing the methods and tools most appropriateto the analytical tasks are given. Instead of presenting a complete and in-tricate mathematical treatment of the presented analysis methodologies, ouraim is to provide the users with a clear understanding and practical know-howof the relevant concepts and methods so that they are able to make informedand effective choices for data preparation, parameter setting, output post-processing, and result interpretation and validation.



Preface ix

As a research agenda, this volume is intended for students, teachers, re-searchers, and research managers who want to understand the state of theart of the presented methods and the areas in which gaps in our knowledgedemand further research and development. To this end, our aim is to maintainthe readability and accessibility throughout the chapters, rather than compil-ing a mere reference manual. Therefore, considerable effort is made to ensurethat the presented material is supplemented by rich literature cross-referencesto more foundational work.

In a quarter-length course, one lecture can be devoted to two chapters,and a project may be assigned based on one of the topics or techniques dis-cussed in a chapter. In a semester-length course, some topics can be covered ingreater depth, covering – perhaps with the aid of an in-depth statistics/datamining text – more of the formal background of the discussed methodology.Throughout the book concrete suggestions for further reading are provided.

Clearly, we cannot expect to do justice to all three goals in a single book.However, we do believe that this book has the potential to go a long wayin bridging a considerable gap that currently exists between scientists in thefield of genomics and proteomics on one the hand and computer scientistson the other hand. Thus, we hope, this volume will contribute to increasedcommunication and collaboration across the disciplines and will help facilitatea consistent approach to analysis and interpretation problems in genomics andproteomics in the future.

This volume comprises 12 chapters, which follow a similar structure interms of the main sections. The centerpiece of each chapter represents a casestudy that demonstrates the use – and misuse – of the presented method orapproach. The first chapter provides a general introduction to the field of datamining in genomics and proteomics. The remaining chapters are intended toshed more light on specific methods or approaches.

The second chapter focuses on study design principles and discusses repli-cation, blocking, and randomization. While these principles are presented inthe context of microarray experiments, they are applicable to many types ofexperiments.

Chapter 3 addresses data pre-processing in cDNA and oligonucleotide mi-croarrays. The methods discussed include background intensity correction,data normalization and transformation, how to make gene expression levelscomparable across different arrays, and others.

Chapter 4 is also concerned with pre-processing. However, the focus isplaced on high-throughput mass spectrometry data. Key topics include base-line correction, intensity normalization, signal denoising (e.g., via wavelets),peak extraction, and spectra alignment.

Data visualization plays an important role in exploratory data analysis.Generally, it is a good idea to look at the distribution of the data priorto analysis. Chapter 5 revolves around visualization techniques for high-dimensional data sets, and puts emphasis on multi-dimensional scaling. Thistechnique is illustrated on mass spectrometry data.



x Preface

Chapter 6 presents the state of the art of clustering techniques for discover-ing groups in high-dimensional data. The methods covered include hierarchicaland k-means clustering, self-organizing maps, self-organizing tree algorithms,model-based clustering, and cluster validation strategies, such as functionalinterpretation of clustering results in the context of microarray data.

Chapter 7 addresses the important topics of feature selection, featureweighting, and dimension reduction for high-dimensional data sets in genomicsand proteomics. This chapter also includes statistical tests (parametric or non-parametric) for assessing the significance of selected features, for example,based on random permutation testing.

Since data sets in genomics and proteomics are usually relatively smallwith respect to the number of samples, predictive models are frequently testedbased on resampled data subsets. Chapter 8 reviews some common dataresampling strategies, including n-fold cross-validation, leave-one-out cross-validation, and repeated hold-out method.

Chapter 9 discusses support vector machines for classification tasks, andillustrates their use in the context of mass spectrometry data.

Chapter 10 presents graphs and networks in genomics and proteomics, suchas biological networks, pathways, topologies, interaction patterns, gene-geneinteractome, and others.

Chapter 11 concentrates on time series analysis in genomics. A methodol-ogy for identifying important predictors of time-varying outcomes is presented.The methodology is illustrated in a study aimed at finding mutations of thehuman immunodeficiency virus that are important predictors of how well apatient responds to a drug regimen containing two different antiretroviraldrugs.

Automated extraction of information from biological literature promisesto play an increasingly important role in text-based knowledge discoveryprocesses. This is particularly important for high-throughput approaches suchas microarrays and high-throughput proteomics. Chapter 12 addresses knowl-edge extraction via text mining and natural language processing.

Finally, we would like to acknowledge the excellent contributions of theauthors and Alice McQuillan for her help in proofreading.

Coleraine, Northern Ireland, and Weingarten, Germany, Werner DubitzkyAugust 2006 Martin Granzow

Daniel Berrar



Preface xi

The following list shows the symbols or abbreviations for the most com-monly occurring quantities/terms in the book. In general, uppercase boldfacedletters such as X refer to matrices. Vectors are denoted by lowercase boldfacedletters, e.g., x, while scalars are denoted by lowercase italic letters, e.g., x.

List of Abbreviations and Symbols

ACE Average (test) classification errorANOVA Analysis of varianceARD Automatic relevance determinationAUC Area under the curve (in ROC analysis)BACC Balanced accuracy (average of sensitivity and specificity)BACC Balanced accuracybp Base pairCART Classification and regression treeCV Cross-validationDa DaltonsDDWT Decimated discrete wavelet transformESI Electrospray ionizationEST Expressed sequence tagETA Experimental treatment assignmentFDR False discovery rateFLD Fisher’s linear discriminantFN False negativeFP False positiveFPR False positive rateFWER Family-wise error rateGEO Gene Expression OmnibusGO Gene OntologyICA Independent component analysisIE Information extractionIQR Interquartile rangeIR Information retrievalLOOCV Leave-one-out cross-validationMALDI Matrix-assisted laser desorption/ionizationMDS Multidimensional scalingMeSH Medical Subject HeadingsMM MismatchMS Mass spectrometrym/z Mass-over-chargeNLP Natural language processingNPV Negative predictive valuePCA Principal component analysisPCR polymerase chain reaction





xii Preface

PCR Polymerase chain reactionPLS Partial least squaresPM Perfect matchPPV Positive predictive valueRLE Relative log expressionRLR Regularized logistic regressionRMA Robust multi-chip analysisS2N Signal-to-noiseSAGE Serial analysis of gene expressionSAM Significance analysis of gene expressionSELDI Surface-enhance laser desorption/ionizationSOM Self-organizing mapSOTA Self-organizing tree algorithmSSH Suppression substractive hybridizationSVD Singular value decompositionSVM Support vector machineTIC Total ion currentTN True negativeTOF Time-of-flightTP True positiveUDWT Undecimated discrete wavelet transformVSN Variance stabilization normalization#(·) Counts; the number of instances satisfying the condition in (·)x The mean of all elements in xχ2 Chi-square statisticε Observed error rateε.632 Estimate for the classification error in the .632 bootstrapyi Predicted value for yi (i.e., predicted class label for case xi)¬y Not yΣ Covarianceτ True error ratex′ Transpose of vector xD Data setd(x, y) Distance between x and yE(X) Expectation of a random variable X〈k〉 Average of kLi ith learning set� Set of real numbersTi ith test setTRij Training set of the ith external and jth internal loopVij Validation set of the ith external and jth internal loopvi ith vertex in a network



Contents

1 Introduction to Genomic and Proteomic Data AnalysisDaniel Berrar, Martin Granzow, and Werner Dubitzky . . . . . . . . . . . . . . . . 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 A Short Overview of Wet Lab Techniques . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Transcriptomics Techniques in a Nutshell . . . . . . . . . . . . . . . . . . . 31.2.2 Proteomics Techniques in a Nutshell . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 A Few Words on Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Study Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.5 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.5.1 Mapping Scientific Questions to Analytical Tasks . . . . . . . . . . . . 91.5.2 Visual Inspection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.5.3 Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.5.3.1 Handling of Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . 131.5.3.2 Data Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.5.4 The Problem of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.5.4.1 Mapping to Lower Dimensions . . . . . . . . . . . . . . . . . . . . . . 151.5.4.2 Feature Selection and Significance Analysis . . . . . . . . . . . 161.5.4.3 Test Statistics for Discriminatory Features . . . . . . . . . . . 171.5.4.4 Multiple Hypotheses Testing . . . . . . . . . . . . . . . . . . . . . . . 191.5.4.5 Random Permutation Tests . . . . . . . . . . . . . . . . . . . . . . . . 21

1.5.5 Predictive Model Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221.5.5.1 Basic Measures of Performance . . . . . . . . . . . . . . . . . . . . . 241.5.5.2 Training, Validating, and Testing . . . . . . . . . . . . . . . . . . . 251.5.5.3 Data Resampling Strategies . . . . . . . . . . . . . . . . . . . . . . . . 27

1.5.6 Statistical Significance Tests for Comparing Models . . . . . . . . . . 291.6 Result Post-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

1.6.1 Statistical Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311.6.2 Epistemological Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321.6.3 Biological Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

1.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33



xiv Contents

2 Design Principles for Microarray InvestigationsKathleen F. Kerr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392.2 The “Pre-Planning” Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.2.1 Goal 1: Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402.2.2 Goal 2: Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.2.3 Goal 3: Class Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

2.3 Statistical Design Principles, Applied to Microarrays . . . . . . . . . . . . . . 422.3.1 Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422.3.2 Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432.3.3 Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.4 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3 Pre-Processing DNA Microarray DataBenjamin M. Bolstad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.1.1 Affymetrix GeneChips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.1.2 Two-Color Microarrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.2 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.2.1 Pre-Processing Affymetrix GeneChip Data . . . . . . . . . . . . . . . . . . 563.2.2 Pre-Processing Two-Color Microarray Data . . . . . . . . . . . . . . . . . 59

3.3 Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.3.1 Affymetrix GeneChip Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.3.1.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.3.1.2 Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.3.2 Two-Color Microarrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.3.2.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.3.2.2 Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.4 Caveats and Pitfalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.5 Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.5.1 Affymetrix GeneChip Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.5.2 Two-Color Microarrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.6 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643.6.1 Pre-Processing an Affymetrix GeneChip Data Set . . . . . . . . . . . . 643.6.2 Pre-Processing a Two-Channel Microarray Data Set . . . . . . . . . . 69

3.7 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 733.8 List of Tools and Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 743.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 743.10 Mathematical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

3.10.1 RMA Background Correction Equation . . . . . . . . . . . . . . . . . . . . 743.10.2 Quantile Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753.10.3 RMA Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753.10.4 Quality Assessment Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75







Contents xv

3.10.5 Computation of M and A Values for Two-ChannelMicroarray Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

3.10.6 Print-Tip Loess Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 76References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4 Pre-Processing Mass Spectrometry DataKevin R. Coombes, Keith A. Baggerly, and Jeffrey S. Morris . . . . . . . . . . 794.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794.2 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.3 Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.4 Caveats and Pitfalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 874.5 Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894.6 Case Study: Experimental and Simulated Data Sets for Comparing

Pre-Processing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924.7 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 984.8 List of Tools and Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 984.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5 Visualization in Genomics and ProteomicsXiaochun Li and Jaroslaw Harezlak . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.2 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.2.1 Metric Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1075.2.2 Nonmetric Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

5.3 Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1095.4 Caveats and Pitfalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1105.5 Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1125.6 Case Study: MDS on Mass Spectrometry Data . . . . . . . . . . . . . . . . . . . 1135.7 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1185.8 List of Tools and Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1195.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6 Clustering – Class Discovery in the Post-Genomic EraJoaquın Dopazo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1236.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1236.2 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

6.2.1 Distance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1266.2.2 Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.2.2.1 Aggregative Hierarchical Clustering . . . . . . . . . . . . . . . . . 1286.2.2.2 k -Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1296.2.2.3 Self-Organizing Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1306.2.2.4 Self-Organizing Tree Algorithm . . . . . . . . . . . . . . . . . . . . . 1306.2.2.5 Model-Based Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

6.2.3 Biclustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131



xvi Contents

6.2.4 Validation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1316.2.5 Functional Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

6.3 Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1326.4 Caveats and Pitfalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.4.1 On Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1356.4.2 On Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.5 Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1366.6 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1376.7 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1396.8 List of Tools and Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.8.1 General Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1406.8.1.1 Multiple Purpose Tools (Including Clustering) . . . . . . . . 140

6.8.2 Clustering Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1416.8.3 Biclustering Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1416.8.4 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1416.8.5 Public-Domain Statistical Packages and Other Tools . . . . . . . . . 1416.8.6 Functional Analysis Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

6.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

7 Feature Selection and Dimensionality Reduction inGenomics and ProteomicsMilos Hauskrecht, Richard Pelikan, Michal Valko, and JamesLyons-Weiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1497.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1497.2 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

7.2.1 Filter Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1517.2.1.1 Criteria Based on Hypothesis Testing . . . . . . . . . . . . . . . . 1517.2.1.2 Permutation Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1527.2.1.3 Choosing Features Based on the Score . . . . . . . . . . . . . . . 1537.2.1.4 Feature Set Selection and Controlling False Positives . . 1537.2.1.5 Correlation Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

7.2.2 Wrapper Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1557.2.3 Embedded Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

7.2.3.1 Regularization/Shrinkage Methods . . . . . . . . . . . . . . . . . . 1557.2.3.2 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . 156

7.2.4 Feature Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1567.2.4.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1567.2.4.2 Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1587.2.4.3 Probabilistic (Soft) Clustering . . . . . . . . . . . . . . . . . . . . . . 1587.2.4.4 Clustering Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1587.2.4.5 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . 1597.2.4.6 Discriminative Projections . . . . . . . . . . . . . . . . . . . . . . . . . 159

7.3 Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1607.4 Case Study: Pancreatic Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161



Contents xvii

7.4.1 Data and Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1617.4.2 Filter Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

7.4.2.1 Basic Filter Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1627.4.2.2 Controlling False Positive Selections . . . . . . . . . . . . . . . . . 1627.4.2.3 Correlation Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

7.4.3 Wrapper Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1657.4.4 Embedded Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1667.4.5 Feature Construction Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1677.4.6 Summary of Analysis Results and Recommendations . . . . . . . . . 168

7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1697.6 Mathematical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

8 Resampling Strategies for Model Assessment and SelectionRichard Simon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1738.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1738.2 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

8.2.1 Resubstitution Estimate of Prediction Error . . . . . . . . . . . . . . . . . 1748.2.2 Split-Sample Estimate of Prediction Error . . . . . . . . . . . . . . . . . . 175

8.3 Resampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1768.3.1 Leave-One-Out Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1778.3.2 k -fold Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1788.3.3 Monte Carlo Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1788.3.4 Bootstrap Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

8.3.4.1 The .632 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1798.3.4.2 The .632+ Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

8.4 Resampling for Model Selection and Optimizing Tuning Parameters 1818.4.1 Estimating Statistical Significance of Classification Error Rates 1838.4.2 Comparison to Classifiers Based on Standard Prognostic

Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1838.5 Comparison of Resampling Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 1848.6 Tools and Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1848.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

9 Classification of Genomic and Proteomic Data UsingSupport Vector MachinesPeter Johansson and Markus Ringner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1879.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1879.2 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

9.2.1 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1889.2.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1909.2.3 Evaluating Predictive Performance . . . . . . . . . . . . . . . . . . . . . . . . . 191

9.3 Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1929.3.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192



xviii Contents

9.3.2 Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1929.4 Caveats and Pitfalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1929.5 Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1939.6 Case Study: Classification of Mass Spectral Serum Profiles Using

Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1939.6.1 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1939.6.2 Analysis Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

9.6.2.1 Strategy A: SVM without Feature Selection . . . . . . . . . . 1969.6.2.2 Strategy B: SVM with Feature Selection . . . . . . . . . . . . . 1969.6.2.3 Strategy C: SVM Optimized Using Test Samples

Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1969.6.2.4 Strategy D: SVM with Feature Selection Using Test

Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1969.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

9.7 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1979.8 List of Tools and Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1979.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1989.10 Mathematical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

10 Networks in Cell BiologyCarlos Rodrıguez-Caso and Ricard V. Sole . . . . . . . . . . . . . . . . . . . . . . . . . . 20310.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

10.1.1 Protein Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20410.1.2 Metabolic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20510.1.3 Transcriptional Regulation Maps . . . . . . . . . . . . . . . . . . . . . . . . . 20510.1.4 Signal Transduction Pathways . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

10.2 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20610.2.1 Graph Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20610.2.2 Node Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20710.2.3 Graph Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

10.3 Caveats and Pitfalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21210.4 Case Study: Topological Analysis of the Human Transcription

Factor Interaction Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21310.5 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21810.6 List of Tools and Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21910.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22010.8 Mathematical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

11 Identifying Important Explanatory Variables forTime-Varying OutcomesOliver Bembom, Maya L. Petersen, and Mark J. van der Laan . . . . . . . . . 22711.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22711.2 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229



Contents xix

11.3 Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23311.3.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23311.3.2 Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

11.4 Caveats and Pitfalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23511.5 Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23711.6 Case Study: HIV Drug Resistance Mutations . . . . . . . . . . . . . . . . . . . 23911.7 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24511.8 List of Tools and Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24611.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

12 Text Mining in Genomics and ProteomicsRobert Hoffmann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25112.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

12.1.1 Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25112.1.2 Interactive Literature Exploration . . . . . . . . . . . . . . . . . . . . . . . . 253

12.2 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25312.2.1 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25312.2.2 Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25412.2.3 Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25412.2.4 Biomedical Text Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25512.2.5 Assessment and Comparison of Text Mining Methods . . . . . . . 256

12.3 Caveats and Pitfalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25612.3.1 Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25612.3.2 Full Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25712.3.3 Distribution of Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25712.3.4 The Impossible . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25812.3.5 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258

12.4 Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25912.4.1 Functional Coherence Analysis of Gene Groups . . . . . . . . . . . . . 25912.4.2 Co-Occurrence Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26012.4.3 Superimposition of Experimental Data to the Literature

Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26012.4.4 Gene Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261

12.5 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26112.6 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26512.7 List of Tools and Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26612.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26612.9 Mathematical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275



List of Contributors

Keith A. BaggerlyDepartment of Biostatistics andApplied Mathematics, University ofTexas M.D. Anderson CancerCenter, Houston, TX 77030, [email protected]

Oliver BembomDivision of Biostatistics, Universityof California, Berkeley, CA 94720-7360, [email protected]

Daniel BerrarSystems Biology Research Group,University of Ulster, NorthernIreland, [email protected]

Benjamin M. BolstadDepartment of Statistics, Universityof California, Berkeley, CA 94720-3860, [email protected]

Kevin R. CoombesDepartment of Biostatistics andApplied Mathematics, University ofTexas M.D. Anderson CancerCenter, Houston, TX 77030, [email protected]

Joaquın DopazoDepartment of Bioinformatics,Centro de Investigacion PrıncipeFelipe, E46013, Valencia, [email protected]

Werner DubitzkySystems Biology Research Group,University of Ulster, NorthernIreland, [email protected]

Martin Granzowquantiom bioinformatics GmbH &Co. KG, Ringstrasse 61, D-76356Weingarten, [email protected]

Jaroslaw HarezlakHarvard School of Public Health,Boston, MA 02115, [email protected]

Milos HauskrechtDepartment of Computer Science,and Intelligent Systems Program,and Department of BiomedicalInformatics, University of Pitts-burgh, Pittsburgh, PA 15260,[email protected]



xxii List of Contributors

Robert HoffmannMemorial Sloan-Kettering CancerCenter, 1275 York Avenue, NewYork, NY 10021, [email protected]

Peter JohanssonComputational Biology andBiological Physics Group, Depart-ment of Theoretical Physics,Lund University, SE-223 62, Lund,[email protected]

Kathleen F. KerrDepartment of Biostatistics,University of Washington, Seattle,WA 98195, [email protected]

Xiaochun LiDana Farber Cancer Institute,Boston, Massachusetts, USA, andHarvard School of Public Health,Boston, MA 02115, [email protected]

James Lyons-WeilerDepartment of BiomedicalInformatics, University of Pitts-burgh, Pittsburgh, PA 15260,[email protected]

Jeffrey S. MorrisDepartment of Biostatistics andApplied Mathematics, University ofTexas M.D. Anderson CancerCenter, Houston, TX 77030, [email protected]

Richard PelikanIntelligent Systems Program,University of Pittsburgh, Pittsburgh,PA 15260, [email protected]

Maya L. PetersenDivision of Biostatistics, Universityof California, Berkeley, CA 94720-7360, [email protected]

Markus RingnerComputational Biology andBiological Physics Group, Depart-ment of Theoretical Physics,Lund University, SE-223 62, Lund,[email protected]

Carlos Rodrıguez-CasoICREA-Complex Systems Lab,Universitat Pompeu Fabra (GRIB),Dr Aiguader 80, 08003 Barcelona,[email protected]

Richard SimonNational Cancer Institute, Rockville,MD 20852, [email protected]

Ricard V. SoleICREA-Complex Systems Lab,Universitat Pompeu Fabra (GRIB),Dr Aiguader 80, 08003 Barcelona,Spain, and Santa Fe Institute, 1399Hyde Park Road, NM 87501, [email protected]

Michal ValkoDepartment of Computer Science,University of Pittsburgh, Pittsburgh,PA 15260, [email protected]

Mark J. van der LaanDivision of Biostatistics,University of California, Berkeley,CA 94720-7360, [email protected]



Index

χ2, see Chi square2D-DIGE, see two-dimensional

difference in-gel electrophoresis2D-PAGE, see two-dimensional

polyacrylamide gel electrophoresis

accuracy, 24balanced, 24, 194

adjacency matrix, 206–207affycomp, 64Affymetrix, 5, 52–53, 64Agilent, 53–55amplicon, 3ANOVA, see one-way analysis of

variancearabadopsis, 47ARD, see automatic relevance

determinationarea under the curve, 152, 169ArrayAssist, 62assortative mixing, 210, 220assortativenness, 211AUC, see area under the curveautomatic relevance determination, 156

BACC, see accuracy, balancedbackground correction, 55backward elimination, 17, 155bagging, see bootstrap aggregationbaseline subtraction, 82beam search, 155betweenness

centrality, 208distribution, 210

bias, 22experimenter, 14selection, 27, 177, 196

bias-variance trade-off, 23biclustering, 124, 131, 135Bioconductor, 62, 74, 98, 104, 112, 120blocking, 43, 47

complete block design, 43blotting

Northern, 4Southern, 4Western, 5

Bonferroni, 20, 134, 154, 232bootstrap, 28, 179

.632 bootstrap, 28, 179

.632+ bootstrap, 180, 184aggegration, 238

BRB-ArrayTools, 184

CAAT, 138caBig, 98calibrant, 84, 119calibration, 81, 82capacity control, 23CART, see classification and regression

treescentroid, 193Chi square, 152, 169Ciphergen, 84, 93, 161ClaNC, 197class comparison, 41, 124class discovery, 40, 124class prediction, 124



276 Index

classification, 10classification and regression trees, 155,

166classifier

linear maximal margin, 189non-linear, 188

CLENCH, 134cluster analysis, 40clustering, 10, 124, 156, 158

average, 209biclustering, 131coefficient, 207distribution, 210fuzzy, 137hierarchical, 46, 126, 128

average linkage, 129complete linkage, 129single linkage, 129

k-means, 126, 129, 158model-based, 130probabilistic, 158self-organizing maps, 130self-organizing tree algorithm, 130soft, 158

confidence interval, 24confounding, 40, 47connectedness, 131connectivity, see node, degreecorrelation

filtering, 154jackknifed correlation coefficient, 127Pearson, 117, 126profile, 210Spearman rank, 127

covariate, 7cross-hybridization, 53cross-validation, 27, 191

10-fold, 1845x2CV, 30external, 27, 181internal, 27, 181k-fold, 28, 178leave-k-out, 28leave-one-out, 28, 177, 184

curse of dimensionality, 2Cy3, 52Cy5, 52

D/S/A algorithm, 230

data mining, 2, 8data re-scaling, 15data transformation, 14Daubechies wavelet, 87DAVID, 134dChip MBEI, 63DDWT, see decimated discrete wavelet

transformdecimated discrete wavelet transform,

86deconvolution, 92dendrogram, 10, 104detector, 5differential display, 4discrete wavelet transform, 86discriminant analysis, 41distance metric, 135, 157

City Block, see distance metric,Manhattan

correlation, 157Cosine, 157Euclidean, 105, 126, 157

standardized, 157Hamming, 157Jaccard, 157Mahalanobis, 137, 157Manhattan, 112, 157Minkowski, 157

Dunn-like indices, 131dye-swap, 45, 47

EAM, see energy absorbing matrixeGOn, 134eigengene, 16electrophoresis, 4electrospray ionization, 5embedded methods, 150, 155, 160energy absorbing matrix, 79entity recognition, 254ER graph, see graph, Erdos-Renyierror

of prediction, 25rate, 24

family-wise, 20comparison-wise, 20observed, 24–25, 181, 183true, 24–25, 181

resubstitution estimate, 174selection, 177







Index 277

split-sample estimate, 175Type I, 18, 258Type II, 19, 258

ESI, see electrospray ionizationEST, see expressed sequence tagETA, see experimental treatment

assignmentexperimental treatment assignment

assumption, 235expressed sequence tag, 5

F-measure, 258false discovery rate, 20, 97, 134, 154,

233false positive rate, 20FatiGO, 134FatiGOplus, 138FDR, see false discovery ratefeature, 7, 149

construction, 151selection, 149–169, 190

recursive feature elimination, 190filter, 150, 160, 169, 190fingerprint, 7Fisher score, 151, 152, 169Fisher-like score, 18FLD, see linear discriminant, Fisherforward selection, 16, 155FPR, see false positive rateFWER, see error rate, family-wise

Gaussian mixture, 158GCRMA, 63Gene Expression Omnibus, 53, 65, 69Gene Ontology, 261GeneChip, 5, 52GenePix, 55, 60, 69GeneSifter, 62GeneSpring, 62genetic algorithm, 150, 155genomics, 1

functional, 2GEO, see Gene Expression OmnibusGO, see Gene OntologyGoMiner, 134goodness-of-fit, 30, 111GOStat, 134Gosurfer, 134GOTM, 134

GOToolBox, 134graph, 203, 206

directed, 205Erdos-Renyi, 204, 207k-scaffold, 211random, 209random modular graph, 204scale-free, 204, 209undirected, 206weighted, 206

Graphviz, 219

heatmap, 104Hidden Markov model, 125high-throughput, 2hill climbing, 150, 155hyperplane, 200

maximal margin, 188hyperplanes, 188

ICA, see independent componentanalysis

IE, see information extractioniHOP, 252, 254, 260Imagene, 55Incogen, 99independent component analysis, 16,

159inference, 43information

extraction, 254, 268retrieval, 253, 267

interquartile range, 59inverse-probability-of-treatment-

weighted transformation, 231ion source, 5IPTW, see inverse-probability-of-

treatment-weighted transforma-tion

IQR, see interquartile rangeIR, see information retrievaliterative signature algorithm, 131

J5-score, 152, 169jackknife, 178

k-nearest neighbor, 26k-NN, see k-nearest neighborKarhunen-Loeve transform, see

principal component analysis



278 Index

kernel function, 189, 200

L1-metric, see distance metric,Manhattan

L2-metric, see distance metric,Euclidean

Lagrangian, 199latent vectors, 16learning

supervised, 40–41unsupervised, 9, 40

learning by rote, 24LIBSVM, 197lift, 24linear discriminant

Fisher, 159loess, 60, 245

print-tip, 61LOOCV, see cross-validation,

leave-one-out28

m/z, see mass-to-charge ratioMA-plot, 60–61, 76MAC, see maximum allowed absolute

correlationMALDI, see matrix-assisted laser

desorption/ionizationmargin, 198Markov blanket filtering, 17MAS 5.0, 64mass analyzer, 5mass spectrometry, 5, 79–99, 180mass-to-charge ratio, 5, 80MATLAB, 98matrix-assisted laser desorption and

ionization, 79matrix-assisted laser desorp-

tion/ionization, 5maximum allowed absolute correlation,

164MDS, see multidimensional scalingMedical Subject Headings, 252Medline, 255MeSH, see Medical Subject Headingsmicroarray, 5, 39, 51–76

cDNA, 53, 55single-channel, 52spotted cDNA array, 5two-channel, 52, 59

microarray sample pool, 64mismatch, 54missing value handling, 13MM, see mismatchmodel, 6

assessment, 173, 191construction, 182selection, 173, 181, 182

modificationsposttranslational, 1, 5

modularity, 203Monte Carlo cross-validation, see

sampling, repeated randomsubsampling

Monte Carlo permutation, 21MS, see mass spectrometryMSP, see microarray sample poolMUDWT, 96, 97multi-array probe-level model, 57multidimensional scaling, 12, 105, 159multiple hypotheses testing, 19mutual information, 152, 169mzXML, 99

natural language processing, 255negative predictive value, 24neighborhood divergence, 259network, 203

cellular, 204hierarchical, 210ii, 206motif, 211

NLP, see natural language processingNo Free Lunch theorem, 10node, 207

average degree, 208degree, 207indegree, 207outdegree, 207

normalization, 51, 56between-slide, 61loess, 61of mass spectra, 82print-tip loess, 61, 76quantile, 57, 61, 63, 75variance stabilization, 64within-slide, 61

NPV, see negative predictive valuenuisance parameter, 229



Index 279

NUSE, see standard error, normalizedunscaled

Occam’s razor, 10one-versus-all, 18one-way analysis of variance, 18Onto-Express, 134overfitting, 23–25, 150

Pajek, 219partial least squares, 16, 159path, 209

average length, 209PCA, see principal component analysisPCR, see polymerase chain reactionpeak

detection, 82matching, 82quantification, 82

peptide/protein chips, 6perceptual mapping, see multidimen-

sional scalingperfect match, 53phage display, 6phase

application, 22learning, 22test, 22training, 22validation, 22

PLIER, 64PLM, see multi-array probe-level modelPLS, see prtial least squares16PM, see perfect matchpolymerase chain reaction, 3polysemy, 256population, 42positive predictive value, 24PPV, see positive predictive valuepre-processing, 13precision, 258predictor, 7prevalence, 24, 97principal component analysis, 12, 15,

108, 159probe, 6probe set, 53

spike-in probe set, 58probes, 53

PROcess, 98, 114profile, 7

array, 7gene expression, 7protein expression, 7

projection pursuit, 113ProteinChip, 89, 93, 114proteomics, 1PubMed, 252

qRT-PCR, 3, see quantitative real-timereverse transcriptase PCR

QT-Clust, 136quantitative real-time reverse transcrip-

tase PCR, 3

randomization, 46recall, 258receiver operating characteristic, 169reference design, 45reference RNA, 45regression, 10

least angle, 121regularized logistic, 166

regularization, 155relative log expression, 59, 76replicate

biological, 43technical, 43

replication, 42reverse transcriptase, 3ribonuclease, 3ribonuclease protection assay, 3RLE, see relative log expressionRLR, see regression, regularized logisticRMA, see robust multi-chip analysisrobust multi-chip analysis, 56–57, 62,

74ROC, see receiver operating character-

isticRPA, see ribonuclease protection assayRProteomics, 98

S+ArrayAnalyzer, 62S2N, see signal-to-noiseSAGE, see serial analysis of gene

expressionSAM, see significance analysis of

microarrays



280 Index

SAM scoring criterion, 152, 169SAMBA, 131Sammon mapping, 111sample, 1, 42sampling, 173–185, 196

bootstrapping, see bootstrapk-fold random subsampling, 28random subsampling, 27repeated random subsampling, 178single hold-out method, 27split-sample, 175two-fold nested resampling, 181

Savitzky-Golay, 90scale-freeness, 203scaling

metric, 107nonmetric, 109

ScanAlyze, 55segmentation, 55SELDI-TOF, see surface-enhanced

laser desorption/ionizationtime-of-flight

self-organizing maps, 113, 126, 130self-organizing tree algorithm, 126, 128,

130sensitivity, 24, 97serial analysis of gene expression, 4set

learning, 25, 176test, 25, 176training, 191validation, 25, 181, 191

shrunken centroid classifier, 181signal-to-noise, 18, 59, 82, 90, 190significance analysis of microarrays, 154silhouette coefficient, 131simulated annealing, 150, 155singular value decomposition, 15SiZer plot, 88small-n-large-p problem, 2small-world pattern, 203, 209SOMs, see self-organizing mapSOTA, see self-organizing tree algorithmspecificity, 24spectrum, 7Spot, 55SSH, see suppression subtractive

hybridizationstandard error

normalized unscaled, 58, 75stress

function, 106squared, 109weighted, 108

studyexperimental, 40observational, 40

subtractive hybridization, 4SUDWT, 96–97summarization, 53support vector machine, 11, 156,

187–200suppression subtractive hybridization, 4surface-enhanced laser desorp-

tion/ionization time-of-flight, 6,79, 104, 161, 194

SVD, see singular value decompositionSVM, see support vector machineSVMLight, 197SW pattern, see small-world patternsynonymy, 256

t-statistic, 17, 169tag, 4target, 6test

Anderson-Darling, 11ANOVA, 18Bartlett, 17Benjamini and Hochberg, 21, 233Brown and Forsythe, 19Cochran, 19Duncan, 19Dunnett, 19F-test, 18Hochberg, 21Holm, 20, 134Kruskal-Wallis, 19Levene, 17McNemar, 30post-hoc, 19random permutation, 21, 152, 153,

183Storey and Tibshirani, 21Student, 19t-test, 17, 152, 169Tukey, 19variance-corrected resampled, 30–31





Index 281

Welch, 19Wilcoxon rank-sum, 152

testing, 25, 182text mining, 32, 251–270

full text mining, 257TIC, see total ion currenttime resolution, 79time series analysis, 227–247time-of-flight, 5, 79TOF, see time-of-flighttopological overlap analysis, 211total ion current, 89training, 25, 182transcriptomics, 2truly alternative, 20truly null, 20two-dimensional difference in-gel

electrophoresis, 5two-dimensional polyacrylamide gel

electrophoresis, 5

UCSF Spot, 55UDWT, see undecimated discrete

wavelet transformundecimated discrete wavelet transform,

83

validating, 25validation, 182variance, 22VSN, see normalization, variance

stabilization

Welch-Satterthwaite, 17Wolfe dual, 199wrapper, 150, 155, 160, 190

yeast two-hybrid, 6

z-score transformation, 15



282 Index



fundamentalsofdataminingingenomicsandproteomics preface...

Documents