computational systems bioinformatic csb2006 conference proceedings 2006

Series on Advances in Bioinformarics and Compurarional Biology - Volume 4

w Life Sciences Society rf ["A*^

COMPUTATIONA SYSTEMS BIOINFORMATICS

CSB2006 CONFERENCE PROCEEDINGS

Stanford CA, 14 -18 August 2006

Editors

Peter Markstein Ying Xu

WARIM MAtf^

Imperial College Press

Life Sciences Society

COMPUTATIONAL SYSTEMS BIOINFORMATICS

SERIES ON ADVANCES IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY

Series Editors: ISSN: 1751-6404 Ying XU (University of Georgia, USA) Limsoon WONG (National University of Singapore, Singapore)

Associate Editors:

Ruth Nussinov (NCI, USA) See-Kiong Ng (Inst for Infocomm Res, Singapore) Rolf Apweiler (EBI, UK) Kenta Nakai (Univ of Tokyo, Japan) Ed Wingender (BioBase, Germany) Mark Ragan (Univ of Queensland, Australia)

Published

Vol. 1: Proceedings of the 3rd Asia-Pacific Bioinformatics Conference Eds: Yi-Ping Phoebe Chen and Limsoon Wong

Vol. 2: Information Processing and Living Systems Eds: Vladimir B. Bajic and Tan Tin Wee

Vol. 3: Proceedings of the 4th Asia-Pacific Bioinformatics Conference Eds: Too Jiang, Ueng-Cheng Yang, Yi-Ping Phoebe Chen and Limsoon Wong

Vol. 4: Computational Systems Bioinformatics Eds: Peter Markstein and Ying Xu

Jjeriei o i A'JVCIR'O". if: ni-"Sni:o'-T'fjf'i •: on-.-' CnmpijMt r;:i<j. [' l-j-.-y '/> i i-' .rl r ' '•

Life Sciences Society '

COMPUTATIONAL SYSTEMS BIOINFORMATICS

CSB2006 CONFERENCE PROCEEDSJ^GS

Stanf ord CA, i ^ £ ^ » i s ^ O c b C \ \ \

Edilch's "yf' ~ ^

Peter M i r ks^ l i i / / rxx \ \

YingXu /

V [ X H

# * .

'V Imperial College Press

Published by

Imperial College Press 57 Shelton Street Covent Garden London WC2H 9HE

Distributed by

World Scientific Publishing Co. Pte. Ltd.

5 Toh Tuck Link, Singapore 596224

USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601

UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.

Series on Advances in Bioinformatics and Computational Biology — Vol. 4 COMPUTATIONAL SYSTEMS BIOINFORMATICS Proceedings of the Conference CSB 2006

Copyright © 2006 by Imperial College Press

All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.

ISBN 1-86094-700-X

Printed in Singapore by B & JO Enterprise

Life Sciences Society

THANK YOU LSS Corporate Members and CSB2006 Platinum Sponsors!

The Life Sciences Society, LSS Directors, together with the CSB2006 program committee and

conference organizers are extremely grateful to the

Hewlett-Packard Company

and

Microsoft Research

for their LSS Corporate Membership and for their Platinum Sponsorship of the

Fifth Annual Computational Systems Bioinformatics Conference, CSB2006 at Stanford University, California, August 14-18, 2006

i n v e n t

Microsoft"

Research

This page is intentionally left blank

COMMITTEES

Steering Committee

Phil Bourne - University of California, San Diego Eric Davidson - California Institute of Technology Steven Salzberg - The Institute for Genomic Research John Wooley - University of California, San Diego, San Diego Supercomputer Center

Organizing Committee

Russ Altman - Stanford University, Faculty Sponsor (CSB2005) Serafim Batzoglou - Stanford University, Faculty Sponsor (CSB2002-CSB2004) Pat Blauvelt - Communications Ed Buckingham - Local Arrangements Chair Kass Goldfein - Finance Consultant Karen Hauge - Local Arrangements - Food VK Holtzendorf - Sponsorship Robert Lashley - Sun Microsystems Inc, Co-Chair Steve Madden - Agilent Technologies Alexia Marcous - CEI Systems Inc, Sponsorship Vicky Markstein - Life Sciences Society, Co-Chair, LSS President Yogi Patel - Stanford University, Communications Gene Ren - Finance Chair Jean Tsukamoto - Graphics Design Bill Wang - Sun Microsystems Inc, Registration Chair Peggy Yao - Stanford University, Sponsorship Dan Zuras - Group 70, Recorder

Program Committee

Tatsuya Akutsu - Kyoto University Vineet Bafna - University of California, San Diego Serafim Batzoglou - Stanford University Chris Bystroff - Rensselaer Polytechnic Institute Jake Chen - Indiana University Amar Das - Stanford University David Dixon - University of Alabama Terry Gaasterland - University of California, San Diego Robert Giegerich - Universitat Bielefeld Eran Halperin - University of California Berkeley Wolfgang R. Hess - University of Freiburg

Ivo Hofacker - University of Vienna Wen-Lian Hsu - Academia Sinica Daniel Huson - Tubingen University Tao Jiang - University of California Riverside Sun-Yuan Kung - Princeton University Dong Yup Lee - Singapore Cheng Li - Harvard School of Public Health Jie Liang - University of Illinois at Chicago Ann Loraine - University of Alabama Bin Ma - University of Western Ontario Peter Markstein - Hewlett-Packard Co., Co-chair Satoru Miyano - University of Tokyo Sean Mooney - Indiana University Ruth Nussinov - National Cancer Institute Mihai Pop - University of Maryland Isidore Rigoutsos - IBM TJ Watson Research Center Marie-France Sagot - Universite Claude Bernard Mona Singh - Princeton University Victor Solovyev - Royal Holloway, University of London Chao Tang - University of California at San Francisco Olga Troyanskaya - Princeton University Limsoon Wong - Institute for Infocomm Research Ying Xu - University of Georgia, Co-chair

Assistants to the Program Co-Chairs

Misty Hice - Hewlett-Packard Labs Ann Terka - University of Georgia Joan Yantko - University of Georgia

Poster Committee

Dick Carter - Hewlett Packard Labs Robert Marinelli - Stanford University Nigam Shah - Stanford University, Chair Kathleen Sullivan - Five Prime Therapeutics, Inc

Tutorial Committee

Carol Cain - Agency for Healthcare Research and Quality, US Department of Health and Human Services Betty Cheng - Stanford University Biomedical Informatics Training Program, Chair Al Shpuntoff

Workshop Committee

Will Bridewell - Stanford University, Chair

Demonstrations Committee

AJ Chen - Stanford University, Chair Rong Chen - Stanford University

Referees

Larisa Adamian Tatsuya Akutsu Doi Atsushi

Vineet Bafna Purushotham Bangalore Serafim Batzoglou Sebastian Boecker Chris Bystroff

Jake Chen Shihyen Chen Zhong Chen

Amar Das Eugene Davydov Tobias Dezulian David A. Dixon Chuong B Do

Kelsey Forsythe Ana Teresa Freitas

Terry Gaasterland Irene Gabashvili Robert Geigerich Samuel S Gross Juntao Guo

Eran Halperin Wolfgang Hess Ivo Hofacker Daniel Huson Wen-Lian Hsu

Seiya Imoto

Tao Jiang

Uri Keich Gad Kimmel Bonnie Kirkpatrick S Y Kung

Vincent Lacroix Dong Yup Lee Xin Lei Cheng Li Guojun Li Xiang Li Jie Liang Huiqing Liu Jingyuan Liu Nianjun Liu Ann Loraine

Bin Ma Man-Wai Mak Fenglou Mao Peter Markstein Alice C McHardy Satoru Miyano Sean Mooney

Jose Carlos Nacher Rei-ichiro Nakamichi Brian Naughton Kay Nieselt Ruth Nussinov

Victor Olman

Daniel Piatt Mihai Pop

Vibin Ramakrishnan Isidore Rigoutsos

Marie-France Sagot Nigam Shah Baozhen Shan Daniel Shriner Mona Singh Sagi Snir Victor Solovyev Andreas Sundquist Ting-Yi Sung

Chao Tang Eric Tannier Olga Troyanskaya Aristotelis Tsirigos

Adelinde Uhrmacher

Raj Vadigepalli Gabriel Valiente

Limsoon Wong Hongwei Wu

Lei Xin Ying Xu

Rui Yamaguchi Will York Hiroshi Yoshida Ryo Yoshida

Noah Zaitlen Stanislav O. Zakharkin

PREFACE

The Life Sciences Society, LSS, was launched at the CSB2005 conference. Its goal is to pull together the power available from computer science, the engineering capability to design complex automated instruments, together with the weight of centuries of accumulated knowledge from the biosciences.

LSS directors, organizing committee and members have dedicated time and talent to make CSB2006 one of the premier life sciences conferences in the world. Beside the huge volunteer effort for CSB it is important that this conference be properly financed. LSS and CSB are thankful for the continuous and generous support from Hewlett Packard and from Microsoft Research. We also want to thank the CSB2006 authors who have trusted us with the results of their research. In return LSS has arranged to have the CSB2006 Proceedings distributed to libraries as a volume in the "Advances in Bioinformatics and Computational Biology" book series - Oxford Press. CSB proceedings are indexed in Medline.

A very big thank you to John Wooley, CSB steering committee member, par excellence, who was there to help whenever needed. The general conference Co-Chair for CSB2006, Robert Lashley, has done a phenomenal job in his first year with LSS. Ed Buckingham as Local Arrangements Chair continues to

provide for the 4th continuous year outstanding professional leadership for CSB. Once again the Program Committee co-chaired by Peter Markstein and Ying Xu has orchestrated a stellar selection of thirty eight bioinformatics papers for the plenary sessions and for publication in the Proceedings. The selection of the best posters was done under the supervision of Nigam Shah, Poster Chair. Selection of the ten tutorial classes was conducted by Betty Cheng, Tutorial Chair, and of the seven workshops by Will Bridewell, Workshop Chair. Ann Loraine's work with PubMed has been instrumental in getting CSB proceedings indexed in Medline. Kirindi Choi is again Chair of Volunteers. Pat Blauvelt is LSS membership Chair, Bill Wang is Registration Chair, and Gene Ren is Finance Chair. Together with the above committee members all CSB committee members deserve a special thank you. This has been an incredibly dedicated CSB organizing committee!

If you believe that Sharing Matters, you are invited to join our drive for successful knowledge transfer and persuade a colleague to join LSS.

Thank you for participating in CSB2006.

Vicky Markstein President, Life Sciences Society

CONTENTS

Committees vii

Referees ix

Preface xi

Keynote Addresses

Exploring the Ocean's Microbes: Sequencing the Seven Seas 1 Marvin E. Frazier et al.

Don't Know Much About Philosophy: The Confusion Over Bio-Ontologies 3 Mark A. Musen

Invited Talks

Biomedical Informatics Research Network (BIRN): Building a National Collaboratory for BioMedical and Brain Research 5

Mark H. Ellisman

Protein Network Comparative Genomics 7 Trey Ideker

Systems Biology in Two Dimensions: Understanding and Engineering Membranes as Dynamical Systems 9

Erik Jakobsson

Bioinformatics at Microsoft Research 11 Simon Mercer

Movie Crunching in Biological Dynamic Imaging 13 Jean-Christophe Olivo-Marin

Engineering Nucleic Acid-Based Molecular Sensors for Probing and Programming Cellular Systems 15 Christina D. Smolke

Reactome: A Knowledgebase of Biological Pathways 17 Lincoln Stein et al.

XIV

Structural Bioinformatics

Effective Optimization Algorithms for Fragment-Assembly based Protein Structure Prediction 19 Kevin W. DeRonne and George Katypis

Transmembrane Helix and Topology Prediction Using Hierarchical SVM Classifiers and an Alternating Geometric Scoring Function 31

Allan Lo, Hua-Sheng Chiu, Ting-Yi Sung and Wen-Lian Hsu

Protein Fold Recognition Using the Gradient Boost Algorithm 43 Feng Mao, JinboXu, Libo Yu and Dale Schuurmans

A Graph-Based Automated NMR Backbone Resonance Sequential Assignment 55 Xiang Wan and Guohui Lin

A Data-Driven, Systematic Search Algorithm for Structure Determination of Denatured or Disordered Proteins 67

Lincong Wang and Bruce Randall Donald

Multiple Structure Alignment by Optimal RMSD Implies that the Average Structure is a Consensus 79 Xueyi Wang and Jack Snoeyink

Identification of a-Helices from Low Resolution Protein Density Maps 89 Alessandro Dal Palit, Enrico Pontelli, Jing He and Yonggang Lu

Efficient Annotation of Non-Coding RNA Structures Including Pseudoknots via Automated Filters 99 Chunmei Liu, Yinglei Song, PingHu, Russell L. Malmberg and Liming Cai

Thermodynamic Matchers: Strengthening the Significance of RNA Folding Energies 111 Thomas Hochsmann, Matthias Hochsmann and Robert Giegerich

Microarray Data Analysis and Applications

PEM: A General Statistical Approach for Identifying Differentially Expressed Genes in Time-Course cDNA Microarray Experiment without Replicate 123

XuHan, Wing-Kin Sung and Lin Feng

Efficient Generalized Matrix Approximations for Biomarker Discovery and Visualization in Gene Expression Data 133

Wenyuan Li, Yanxiong Peng, Hung-Chung Huang and Ying Liu

Computational Genomics and Genetics

Efficient Computation of Minimum Recombination with Genotypes (not Haplotypes) 145 Yufeng Wu and Dan Gusfield

Sorting Genomes by Translocations and Deletions 157 Xingqin Qi, Guojun Li, Shuguang Li and YingXu

XV

Turning Repeats to Advantage: Scaffolding Genomic Contigs Using LTR Retrotransposons 167 Ananth Kalyanaraman, Srinivas Aluru and Patricks. Schnable

Whole Genome Composition Distance for HIV-1 Genotyping 179 Xiaomeng Wu, Randy Goebel, Xiu-Feng Wan and Guohui Lin

Efficient Recursive Linking Algorithm for Computing the Likelihood of an Order of a Large Number of Genetic Markers 191

S. Tewari, S. M. Bhandarkar and J. Arnold

Optimal Imperfect Phylogeny Reconstruction and Haplotyping (IPPH) 199 Srinath Sridhar, Guy E. Blelloch, R. Ravi and Russell Schwartz

Toward an Algebraic Understanding of Haplotype Inference by Pure Parsimony 211 Daniel G. Brown and Ian M. Harrower

Global Correlation Analysis Between Redundant Probe Sets Using a Large Collection of Arabidopsis ATH1 Expression Profiling Data 223

Xiangqin Cui and Ann Loraine

Motif Sequence Identification

Distance-Based Identification of Structure Motifs in Proteins Using Constrained Frequent Subgraph Mining 227 Jun Huan, Deepak Bandyopadhyay, Jan Prins, Jack Snoeyink, Alexander Tropsha and Wei Wang

An Improved Gibbs Sampling Method for Motif Discovery via Sequence Weighting 239 Xin Chen and Tao Jiang

Detection of Cleavage Sites for HIV-1 Protease in Native Proteins 249 Liwen You

A Methodology for Motif Discovery Employing Iterated Cluster Re-Assignment 257 Osman Abul, Finn Drablos and Geir Kjetil Sandve

Biological Pathways and Systems

Identifying Biological Pathways via Phase Decomposition and Profile Extraction 269 Yi Zhang and Zhidong Deng

Expectation-Maximization Algorithms for Fuzzy Assignment of Genes to Cellular Pathways 281 Liviu Popescu and Golan Yona

Classification of Drosophila Embryonic Developmental Stage Range Based on Gene Expression Pattern Images 293

Jieping Ye, Jianhui Chen, Qi Li and Sudhir Kumar

Evolution versus "Intelligent Design": Comparing the Topology of Protein-Protein Interaction Networks to the Internet 299

Qiaofeng Yang,Georgos Siganos, Michalis Faloutsos andStefano Lonardi

XVI

Protein Functions and Computational Proteomics

Cavity-Aware Motifs Reduce False Positives in Protein Function Prediction Brian Y. Chen, DrewH. Bryant, Viacheslav Y. Fofanov, David M. Kristensen, Amanda E. Cruess, MarekKimmel, Olivier Lichtarge andLydiaE. Kavraki

Protein Subcellular Localization Prediction Based on Compartment-Specific Biological Features Chia-Yu Su, Allan Lo, Hua-Sheng Chiu, Ting-Yi Sung and Wen-Lian Hsu

Predicting the Binding Affinity of MHC Class II Peptides Fatih Altiparmak, AltunaAkalin andHakan Ferhatosmanoglu

Codon-Based Detection of Positive Selection Can be Biased by Heterogeneous Distribution of Polar Amino Acids Along Protein Sequences

Xuhua Xia and Sudhir Kumar

Bayesian Data Integration: A Functional Perspective Curtis Huttenhower and Olga G. Troyanskaya

An Iterative Algorithm to Quantify the Factors Influencing Peptide Fragmentation for MS/MS Spectru Chungong Yu, Yu Lin, Shiwei Sun, Jinjin Cai, Jingfen Zhang, Zhuo Zhang, Runsheng Chen andDongbo Bu

Complexity and Scoring Function of MS/MS Peptide De Novo Sequencing ChangjiangXu and Bin Ma

Biomedical Applications

Expectation-Maximization Method for Reconstructing Tumor Phylogenies from Single-Cell Data Gregory Pennington, Charles A. Smith, Stanley Shackney and Russell Schwartz

Simulating In Vitro Epithelial Morphogenesis in Multiple Environments Mark R. Grant, Sean H. J. Kim and C. Anthony Hunt

A Combined Data Mining Approach for Infrequent Events: Analyzing HIV Mutation Changes Based on Treatment History

Ray S. Lin, Soo-Yon Rhee, Robert W. Shafer andAmar K. Das

A Systems Biology Case Study of Ovarian Cancer Drug Resistance Jake Y. Chen, Changyu Shen, Zhong Yan, Dawn P. G. Brown andMu Wang

Author Index

1

EXPLORING THE OCEANS MICROBES: SEQUENCING THE SEVEN SEAS

Marvin E. Frazier'.Douglas B. Rusch1, Aaron L. Halpern1, Karla B. Heidelberg1, Granger Sutton1, Shannon Williamson1, Shibu

Yooseph1, Dongying Wu2, Jonathan A. Eisen2, Jeff Hoffman1, Charles H. Howard1, Cyrus Foote1, Brooke A. Dill1, Karin

Remington1, Karen Beeson1, Bao Tran1, Hamilton Smith1, Holly Baden-Tillson1, Clare Stewart1, Joyce Thorpe1, Jason Freemen1,

Cindy Pfannkoch1, Joseph E. Venter1, John Heidelberg2, Terry Utterback1, Yu-Hui Rogers1, Shaojie Zhang3, Vineet Bafna3, Luisa

Falcon4, Valeria Souza4,German Bonilla4, Luis E. Eguiarte4 , David M. Karl5, Ken Nealson6, Shubha Sathyendranath7, Trevor

Piatt7, Eldredge Bermingham8, Victor Gallardo9, Giselle Tamayo10, Robert Friedman1, Robert Strausberg1, J. Craig Venter1

J. Craig Venter Institute, Rockville, Maryland, United States Of America 2 The Institute For Genomic Research, Rockville, Maryland, United States Of America

Department of Computer Science, University of California San Diego Instituto de Ecologia Dept. Ecologia Evolutiva, National Autonomous University of Mexico

Mexico City, 04510 Distrito Federal, Mexico University of Hawaii, Honolulu, United States of America

6 Dept. of Earth Sciences, University of Southern California, Los Angeles, California, United States of America 7 Dalhousie University, Halifax, Nova Scotia, Canada

8 Smithsonian Tropical Research Institute, Balboa, Ancon, Republic of Panama University of Concepcidn, Concepcion, Chile

10 University of Costa Rica, San Pedro, San Jose, Republic of Costa Rica

The J. Craig Venter Institute's (JCVI) environmental genomics group has collected ocean and soil samples from around the world. We have begun shotgun sequencing of microbial samples from more that 100 open-ocean and coastal sites across the Pacific, Indian and Atlantic Oceans. These data are being augmented with deep sequencing of 16S and 18S rRNA and the draft sequencing of ~150 cultured marine microbial species. The JCVI is also developing and refining bioinformatics tools to assemble, annotate, and analyze large-scale metagenomic data, along with the appropriate database infrastructure to enable directed analyses. The goals of this Global Ocean Survey are to better understand microbial biodiversity; to discover new genes of ecological importance, including those involved in carbon cycling; to discover new genes that may be useful for biological energy production; and to establish a freely shared, global environmental genomics database that can be used by scientists around the world.

Using newly developed metagenomic methods, we are able to examine not only the community of microorganisms, but the community of genes that enable them to capture energy from the sun, remove carbon dioxide from the air, take up organic carbon, and cycle

nitrogen in its various forms through the ecosystem. To date, we have discovered many thousands of new microbial species and millions of new genes, with no apparent slowing of the rate of discovery. This data will be of great value for the study of protein function and protein evolution. The goal of this new science, however, is not to merely catalog sequences, genes and gene families, and species for their own sake. We are attempting to use these new data to better understand the functioning of natural ecosystems. Environmental metagenomics examines the interplay of perhaps thousands of species present and functioning at a point in space and time. Each individual sequence is no longer just a piece of a genome. It is a piece of an entire biological community. This is a resource that can be mined by microbial ecologists worldwide to better understand biogeochemical cycling. Moreover, within this data set is a huge diversity of previously unknown, energy-related genes that may be useful for developing new methods of biological energy production.

We acknowledge the DOE, Office of Science (DE-FG02-02ER63453), the Gordon and Betty Moore Foundation, the Discovery Channel and the J. Craig

2

Venter Science Foundation for funding to undertake this study. We are also indebted to a large group of individuals and groups for facilitating our sampling and analysis. We thank the Governments of Canada, Mexico, Honduras, Costa Rica, Panama, and Ecuador and French Polynesia/France for facilitating sampling activities. All sequencing data collected from waters of the above named countries remain part of the genetic patrimony of the country from which they were obtained.

Canada's Bedford Institute of Oceanography provided a vessel and logistical support for sampling in Bedford basin. The Universidad Nacional Autonoma de Mexico (UNAM) facilitated permitting and logistical arrangements and identified a team of scientists for collaboration. The scientists and staff of the Smithsonian Tropical Research Institute (STRI) hosted our visit in Panama. Representatives from Costa Rica's Organization for Tropical Studies (Jorge Arturo Jimenez and Francisco Campos Rivera), the University of Costa Rica (Jorge Cortes) and the National Biodiversity Institute (INBio) provided assistance with planning, logistical arrangements and scientific analysis. Our visit to the Galapagos Islands was facilitated by assistance

from the Galapagos National Park Service Director, Washington Tapia, the Charles Darwin Research Institute, especially Howard Snell and Eva Danulat. We especially thank Greg Estes (guide), Hector Chauz Campo (Institute of Oceanography of the Ecuador Navy) and a National Park Representative, Simon Ricardo Villemar Tigrero for field assistance while in the Galapagos Islands. Martin Wilkalski (Princeton) and Rod Mackie (University of Illinois) provided advice for target regions in the Galapagos to sample. We thank Matthew Charette (Woods Hole Oceanographic Institute) and Dave Karl (University of Hawaii) for nutrient analysis work and advice. We also acknowledge the help of Michael Ferrari and Jennifer Clark for assistance in acquiring the satellite images. The U.S. Department of State facilitated Governmental communications on multiple occasions. John Glass (JCVI) provided valuable assistance in methods development. Tyler Osgood (JCVI) facilitated many of the vessel related technical needs. We gratefully acknowledge Dr. Michael Sauri, who oversaw medical related issues for the crew of the Sorcerer II. Finally, special thanks also to the captain and crew of the S/V Sorcerer II.

3

DON'T KNOW MUCH ABOUT PHILOSOPHY: THE CONFUSION OVER BIO-ONTOLOGIES

Mark A. Musen, M.D., Ph.D.

The National Center for Biomedical Ontology Stanford University

251 Campus Drive, X-215 Stanford, CA 94305 USA

Abstract:

For the past decade, thqre has been increasing interest in ontologies in the biomedical community. As interest has peaked, so has the confusion. The confusion stems from the multiple knowledge-representation languages used to encode ontologies (e.g., frame-based systems, Semantic Web standards such as RDF(S) and OWL, and languages created specifically by the bioinformatics community, such as OBO), where each language has explicit strengths and weaknesses. Biomedical scientists use ontologies for multiple purposes, from annotation of experimental data, to natural-language processing, to data integration, to construction of decision-support systems. Each of these purposes imposes different requirements concerning which entities ontologies should encode and how those entities should be encoded. Although the biomedical informatics community remains excited about ontologies, exactly what an ontology is and how it should be represented within a computer are points about which, with considerable questioning, we can see little uniformity of opinion. The confusion will persist until we can understand that different developers have very different requirements for ontologies, and therefore those developers will make very different assumptions about how ontologies should

be created and structured. We will review those assumptions and the corresponding implications for ontology construction.

Our National Center for Biomedical Ontology (http://bioontologv.org') is one of the seven national centers for biomedical computing formed under the NIH Roadmap. The Center takes a broad perspective on what ontologies are and how they should be developed and put to use. Our goal, simply put, is to help to eliminate much of the current confusion. The Center recognizes the importance of ontologies for use in a wide range of biomedical applications, and is developing new technology to make all relevant ontologies widely accessible, searchable, alignable, and useable within software systems. Ultimately, the Center will support the publication of biomedical ontologies online, much as we publish scientific knowledge in print media. The advent of biomedical knowledge that is widely available in machine-processable form will alter the way that we think about science and perform scientific experiments. The biomedical community soon will enter an era in which scientific knowledge will become more accessible, more useable, and more precise, and in which new methods will be needed to support a radically different kind of scientific publishing.

http://bioontologv.org'

5

BIOMEDICAL INFORMATICS RESEARCH NETWORK (BIRN): BUILDING A NATIONAL COLLABORATORY FOR BIOMEDICAL AND BRAIN RESEARCH

Mark H. Ellisman, Ph.D., Professor

UCSD Department of Neurosciences and Director of the BIRN Coordinating Center (www.nbirn.net) The Center for Research on Biological Systems (CRBS) at UCSD

The Biomedical Informatics Research Network (BIRN) is an initiative within the National Institutes of Health (US) that fosters large-scale collaborations in biomedical science by utilizing the capabilities of the emerging national cyberinfrastructure (high-speed networks, distributed high-performance computing and the necessary software and data integration capabilities). Currently, the BIRN involves a consortium of 20 universities and 30 research groups participating in three test bed projects centered around brain imaging of human neuropsychiatric disease and associated animal models. These groups are working on large scale, cross-institutional imaging studies on Alzheimer's disease, depression, and schizophrenia using structural and functional magnetic resonance imaging (MRI). Others are studying animal models relevant to multiple sclerosis, attention deficit disorder, and Parkinson's disease through MRI, whole brain histology, and high-resolution light and electron microscopy. These test bed projects present practical and immediate requirements for performing large-scale bioinformatics studies and provide a multitude of usage cases for distributed computation and the handling of heterogeneous data. The promise of the BERN is the ability to test new hypotheses through the analysis of larger patient populations and unique multi-resolution views of animal models through data sharing and the integration of site independent resources for collaborative data refinement.

The BIRN Coordinating Center (BERN-CC) is orchestrating the development and deployment of key infrastructure components for immediate and long-range support of the scientific goals pursued by these test bed

scientists. These components include high bandwidth inter-institutional connectivity via Internet2, a uniformly consistent security model, grid-based file management and computational services, software and techniques to federate data and databases, data caching and replication techniques to improve performance and resiliency, and shared processing, visualization and analysis environments. As a core component of the BERN infrastructure, Internet2 provides a solid foundation for the future expansion of the BERN as well as the stable high performance network required by researchers in a national collaboratory. Researchers within BERN are also benefiting directly from the connectivity to high performance computing resources, such as TeraGrid. Currently researchers are performing advanced shape analyses of anatomical structures to gain a better understanding of diseases and disorders. These analyses run on TeraGrid have produced over 10TB of resultant data which were then transferred back to the BERN Data Grid.

BERN intertwines concurrent revolutions occurring in biomedicine and information technology. As the requirements of the biomedical community become better specified through projects like the BERN, the national cyberinfrastructure being assembled to enable large-scale science projects will also evolve. As these technologies mature, the BERN is uniquely situated to serve as a major conduit between the biomedical research community of NIH-sponsored programs and the information technology development programs, mostly supported by other government agencies (e.g., NSF, NASA, DOE, DARPA) and industry.

http://www.nbirn.net

7

PROTEIN NETWORK COMPARATIVE GENOMICS

Trey Ideker

University of California San Diego

With the appearance of large networks of protein-protein and protein-DNA interactions as a new type of biological measurement, methods are needed for constructing cellular pathway models using interaction data as the central framework. The key idea is that, by comparing the molecular interaction network with other biological data sets, it will be possible to organize the network into modules representing the repertoire of distinct functional processes in the cell. Three distinct types of network comparisons will be discussed, including those to identify:

(1) Protein interaction networks that are conserved across species

(2) Networks in control of gene expression changes

(3) Networks correlating with systematic phenotypes and synthetic lethals

Using these computational modeling and query tools, we are constructing network models to explain the physiological response of yeast to DNA damaging agents.

Relevant articles and links

1. Yeang, C.H., Mak, H.C., McCuine, S., Workman, C , Jaakkola, T., and Ideker, T. Validation and refinement of gene regulatory pathways on a network of physical interactions. Genome Biology 6(7): R62 (2005).

2. Kelley, R. and Ideker, T. Systematic interpretation of genetic interactions using protein networks. Nature Biotechnology 23(5):561-566 (2005).

3. Sharan, R., Suthram, S., Kelley, R. M., Kuhn, T., McCuine, S., Uetz, P., Sittler, T., Karp, R. M., and Ideker, T. Conserved patterns of protein interaction in multiple species. Proc Natl Acad Sci USA. 8:102(6): 1974-79 (2005).

4. Suthram, S., Sittler, T., and Ideker, T. The Plasmodium network diverges from those of other species. Nature 437: (November 3,2005).

5. http://www.pathblast.org 6. http://www.cytoscape.org

Acknowledgements

We gratefully acknowledge funding through NIH/NIGMS grant GM070743-01; NSF grant CCF-0425926; Unilever, PLC, and the Packard Foundation.

http://www.pathblast.org

http://www.cytoscape.org

9

SYSTEMS BIOLOGY IN TWO DIMENSIONS: UNDERSTANDING AND ENGINEERING MEMBRANES AS DYNAMICAL SYSTEMS

Erik Jakobsson

University of Illinois at Urbana-Champaign Director, National Center for the Design of Biomimetic Nanoconductors

Theme:

The theme of our NTH Nanomedicine Development Center is design of biomimetic nanoconductors and devices utilizing nanoconductors. The model theoretical systems are native and mutant biological channels and other ion transport proteins and synthetic channels, and heterogenous membranes containing channels and transporters. The model experimental systems are engineered protein channels and synthetic channels in isolation, and in self-assembled membranes supported on nanoporous silicon scaffolds. The ultimate goal is to understand how biomimetic nanoscale design can be utilized in devices to achieve the functions that membrane systems accomplish in biological systems: a) Electrical and electrochemical signaling, b) generation of osmotic pressures and flows, c) generation of electrical power, and d).energy transduction.

Broad Goals:

Our Center's broad goals are: 1. To advance theoretical, computational,

and experimental methods for understanding and quantitatively characterizing biomembrane and other nanoscale transport processes, through interactive teams doing collaborative macromolecular design and synthesis, computation/theory, and experimental functional characterization.

2. To use our knowledge and technical capabilities to design useful biomimetic de

vices and technologies that utilize membrane and nanopore transport.

3. To interact synergistically with other workers in the areas of membrane processes, membrane structure, the study of membranes as systems biomolecular design, biomolecular theory and computation, transport processes, and nanoscale device design.

4. To disseminate enhanced methods and tools for: theory and computation related to transport, experimental characterization of membrane function, theoretical and experimental characterization of nanoscale fluid flow, and nanotransport aspects of device design.

Initial Design Target:

A biocompatible biomimetic battery (the "biobattery") to power an implantable artificial retina, extendable to other neural prostheses. Broad design principles are suggested by the electrocyte of the electric eel, which generates large voltages and current densities by stacking large areas of electrically excitable membranes in series. The potential advantages of the biomimetic battery are lack of toxic materials, and ability to be regenerated by the body's metabolism.

Major Emergent Reality Constraints:

The development and maintenance of the electrocyte in the eel are guided by elaborate and adaptive pathways under genetic control, which we can not realistically hope to include in a device.

10

Our approach will include replacing the developmental machinery with a nanoporous silicon scaffold, on which membranes will self-assemble. The lack of maintenance machinery will be compensated for by making the functional components of the biobattery from more durable, less degradable molecules.

Initial Specific Activities:

1. Making a detailed dynamical model, including electrical and osmotic phenomena and incorporating specific geometry, of the eel electrocyte.

2. Do initial design of biomimetic battery that is potentially capable of fabrication/self assembly.

3. Search for more durable functional analogues of the membranes and transporters of the electrocyte. Approaches being pursued include designing beta-barrel

4.

functional analogues for helix-bundle proteins, mining extremophile genomes for appropriate transporters, chemically functionalized silicon pores, and design of durable synthetic polymer membranes that can incorporate transport molecules by self-assembly. These approaches combine information technology, computer modeling, and simulation, with experiment. Fabrication of nanoporous silicon supports for heterogenous membranes in complex geometries.

Organizational Principles of Center:

Our core team is supported by the NIH Roadmap grant, but we welcome collaborations with all workers with relevant technologies and skills, and aligned interests.

11

BIOINFORMATICS AT MICROSOFT RESEARCH

Simon Mercer

Microsoft Research One Microsoft Way

Redmond, WA 98052, USA

The advancement of the life sciences in the last twenty years has been in part the story of increasing integration of computing with scientific research, a trend that is set to transform the practice of science in our lifetimes. Conversely, biological systems are a rich source of ideas that will transform the future of computing.

In addition to supporting academic research in the life sciences, Microsoft Research is a source of tools and technologies well suited to the needs of basic scientific

research - current projects include new languages to simplify data extraction and processing, tools for scientific workflows, and biological visualization.

Computer science researchers also bring new perspectives to problems in biology, such as the use of schema-matching techniques in merging ontologies, machine learning in vaccine design, and process algebra in understanding metabolic pathways.

13

MOVIE CRUNCHING IN BIOLOGICAL DYNAMIC IMAGING

Jean-Christophe Olivo-Marin

Quantitative Image Analysis Unit Institut Pasteur

CNRS URA 2582 25 rue du Dr Roux

75724 Paris, France

Recent advances in biological imaging technologies have enabled the observation of living cells with high resolution during extended periods of time and are impacting biological research in such different areas as high-throughput image-base drug screening, cellular therapies, cell and developmental biology and gene expression studies. Deciphering the complex machinery of cell functions and dysfunction necessitates indeed large-scale multidimensional image-based assays to cover the wide range of highly variable and intricate properties of biological systems. However, understanding the wealth of data generated by multidimensional microscopy depends critically on decoding the visual information contained therein and on the availability of the tools to do so. Innovative automatic techniques to extract quantitative data from image sequences are therefore of major interest. I will present methods we have recently developed to perform the computational analysis of image sequences coming from multidimensional microscopy, with particular emphasis on tracking and motion analysis for 3D+t images sequences using active contours and multiple particle tracking.

1. INTRODUCTION

The advent of multidimensional microscopy (real-time optical sectioning and confocal, TIRF, FRET, FRAP, FLIM) has enabled biologists to visualize cells, tissues and organs in their intrinsic 3D and 3D+t geometry, in contrast to the limited 2D representations that were available until recently. These new technologies are already impacting biological research in such different areas as high-throughput image-base drug screening, cellular therapies, cell and developmental biology and gene expression studies, as they are put-ting at hand the imaging of the inner working of living cells in their natural context. Expectations are high for breakthroughs in areas such as cell response and motility modification by drugs, control of targeted sequence incorporation into the chromatin for cell therapy, spatial-temporal organization of the cell and its changes with time or under infection, assessment of pathogens routing into the cell, interaction between proteins, sanitary control of pathogen evolution, to name but a few. Deciphering the complex machinery of cell functions and dysfunction necessitates large-scale multidimensional image-based assays to cover the wide range of highly variable and intricate properties of biological material. However, understanding the wealth of data generated by multidimensional

microscopy depends critically on decoding the visual information contained therein.

Within the wide interdisciplinary field of biological imaging, I will concentrate on work developed in our laboratory on two aspects central to cell biology, particle tracking and cell shape and motility analysis, which have many applications in the important field of infectious diseases.

2. PARTICLE TRACKING

Molecular dynamics in living cells is a central topic in cell biology, as it opens the possibility to study with sub-micron resolution molecular diffusion, spatio-temporal regulation of gene expression and pathogen motility and interaction with host cells. For example, it is possible, after labelling with specific fluorochromes, to record the movement of organelles like phagosomes or endosomes in the cell,6 the movement of different mutants of bacteria or parasites2 or the positioning of telomeres in nuclei (Galy et al., 2000).3

I will describe the methods we have developed to perform the detection and the tracking of microscopic spots directly on four dimensional (3D+t) image data.4,5

They are able to detect with high accuracy multiple

14

biological objects moving in three-dimensional space and incorporate the possibility to follow moving spots switching between different types of dynamics. Our methods decouple the detection and the tracking processes and are based on a two step procedure: first, the objects are detected in the image stacks thanks to a procedure based on a three-dimensional wavelet transform; then the tracking is performed within a Bayesian framework where each object is represented by a state vector evolving according to biologically realistic dynamic models.

3. CELL TRACKING

Another important project of our laboratory is motivated by the problem of cell motility. The ability of cells to move and change their shape is important in many important areas of biology, including cancer, development, infection and immunity.7 We have developed algorithms to automatically segment and track moving cells in dynamic 2D or 3D microscopy.1' 8 For this purpose, we have adopted the framework of active contours and deformable models that is widely employed in the computer vision community. The segmentation proceeds by evolving the front according to evolution equations that minimize an energy functional (usually by gradient descent). This energy contains both data attachment terms and terms encoding prior information about the boundaries to be extracted, e.g. smoothness constraints. Tracking, i.e. linking segmented objects between time points, is simply achieved by initializing front evolutions using the segmentation result of the previous frame, under the assumption that inter-frame motions are modest. I will describe some of our work on adapting these methods to the needs of cellular imaging in biological research.

References 1. A. Dufour, V. Shinin, S. Tajbakhsh, N. Guillen, J.-

C. Olivo-Marin, and C. Zimmer, Segmenting and

tracking fluorescent cells in dynamic 3-d microscopy with coupled active surfaces, IEEE Trans. Image Processing, vol. 14, no. 9, pp. 1396-1410, 2005.

2. F. Frischknecht, P. Baldacci, B. Martin, C. Zimmer, 5. Thiberge, J.-C. Olivo-Marin, S. L. Shorte, and R. Menard, Imaging movement of malaria parasites during transmission by Anopheles mosquitoes, Cell Microbiol, vol. 6, no. 7, pp. 687-94, 2004.

3. V. Galy, J.-C. Olivo-Marin, H. Scherthan, V. Doyle, N. Rascalou, and U. Nerhbass, Nuclear pore complexes in the organization of silent telomeric chromatin, Nature, vol. 403, pp. 108-112, 2000.

4. A. Genovesio, B. Zhang, and J.-C. Olivo-Marin, Tracking of multiple fluorescent biological objects in three dimensional video microscopy, IEEE International Conference on Image Processing ICIP 2003, vol. I, pp. 1105-1108, September 2003, Barcelona, Spain, 2003

5. A. Genovesio, T. Liedl, V. Emiliani, W. Parak, M. Coppey-Moisan, and J.-C. Olivo-Marin, Multiple particle tracking in 3D+t microscopy : method and application to the tracking of endocytozed Quantum Dots, IEEE Trans. Image Processing, 15, 5, pp. 1062-1070, 2006

6. C. Murphy, R. Saffrich, J.-C. Olivo-Marin, A. Gi-ner, W. Ansorge, T. Fotsis, and M. Zerial, Dual function of rhod in vesicular movement and cell motility, Eur. Journal of Cell Biology, vol. 80, no. 6, pp. 391-398,2001.

7. C. Zimmer, E. Labruyere, V. Meas-Yedid, N. Guillen, and J.-C. Olivo-Marin, Segmentation and tracking of migrating cells in videomicroscopy with parametric active contours: a tool for cell-based drug testing, IEEE Trans. Medical Imaging, vol. 21, pp. 1212-1221,2002.

8. C. Zimmer and J.-C. Olivo-Marin, Coupled parametric active contours, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 11, pp. 1838-1842,2005.

15

ENGINEERING NUCLEIC ACID-BASED MOLECULAR SENSORS FOR PROBING AND PROGRAMMING CELLULAR SYSTEMS

Professor Christina D. Smolke

California Institute of Technology, Department of Chemical Engineering

Information flow through cellular networks is responsible for regulating cellular function at both the single cell and multi-cellular systems levels. One of the key limitations to understanding dynamic fluctuations in intracellular biomolecule concentrations is the lack of enabling technologies that allow for user-specified probing and programming of these cellular events. I will discuss our work in developing the molecular design and cellular engineering strategies for the construction of tailor-made sensor platforms that can temporally and spatially monitor and regulate information flow through diverse cellular networks. The construction of sensor platforms based on allosteric regulation of non-coding RNA (ncRNA) activity will be presented, where molecular recognition of a ligand-binding event is coupled to a conformational change in the RNA

molecule. This regulated conformational change may be linked to an appropriate readout signal by controlling a diverse set of ncRNA gene regulatory activities. Our research has demonstrated the modularity, design predictability, and specificity inherent in these molecules for cellular control. In addition, the flexibility of these sensor platforms enables these molecules to be incorporated into larger circuits based on molecular computation strategies to construct sensor sets that will perform higher-level signal processing toward complex systems analysis and cellular programming strategies. In particular, the application of these molecular sensors to the following downstream research areas will be discussed: metabolic engineering of microbial alkaloid synthesis and 'intelligent' therapeutic strategies.

17

REACTOME: A KNOWLEDGEBASE OF BIOLOGICAL PATHWAYS

Lincoln Stein, Peter D'Eustachio, Gopal Gopinathrao, Marc Gillespie, Lisa Matthews, Guanming Wu

Cold Spring Harbor Laboratory Cold Spring Harbor, NY, USA

Imre Vastrik, Esther Schmidt, Bernard de Bono, Bijay Jassal, David Croft, Ewan Birney

European Bioinformatics Institute Hinxton, UK

Suzanna Lewis

Lawrence Berkeley National Laboratory Berkeley, CA, USA

Reactome, located at http://www.reactome.org is a curated, peer-reviewed resource of human biological processes. Given the genetic makeup of an organism, the complete set of possible reactions constitutes its reactome. The basic unit of the Reactome database is a reaction; reactions are then grouped into causal chains to form pathways. The Reactome data model allows us to represent many diverse processes in the human system, including the pathways of intermediary metabolism, regulatory pathways, and signal transduction, and high-level processes, such as the cell cycle. Reactome provides a qualitative framework, on which quantitative

data can be superimposed. Tools have been developed to facilitate custom data entry and annotation by expert biologists, and to allow visualization and exploration of the finished dataset as an interactive process map. Although our primary curational domain is pathways from Homo sapiens, we regularly create electronic projections of human pathways onto other organisms via putative orthologs, thus making Reactome relevant to model organism research communities. The database is publicly available under open source terms, which allows both its content and its software infrastructure to be freely used and redistributed.

http://www.reactome.org

19

EFFECTIVE OPTIMIZATION ALGORITHMS FOR FRAGMENT-ASSEMBLY BASED PROTEIN STRUCTURE PREDICTION

Kevin W . DeRonne* and George Karyp is

Department of Computer Science & Engineering,

Digital Technology Center, Army HPC Research Center University of Minnesota, Minneapolis, MN 55455

* Email: {deronne, karypis} @cs.umn.edu

Despite recent developments in protein structure prediction, an accurate new fold prediction algorithm remains elusive. One of the challenges facing current techniques is the size and complexity of the space containing possible structures for a query sequence. Traditionally, to explore this space fragment assembly approaches to new fold prediction have used stochastic optimization techniques. Here we examine deterministic algorithms for optimizing scoring functions in protein structure prediction. Two previously unused techniques are applied to the problem, called the Greedy algorithm and the Hill-climbing algorithm. The main difference between the two is that the latter implements a technique to overcome local minima. Experiments on a diverse set of 276 proteins show that the Hill-climbing algorithms consistently outperform existing approaches based on Simulated Annealing optimization (a traditional stochastic technique) in optimizing the root mean squared deviation (RMSD) between native and working structures.

1. INTRODUCTION

Reliably predicting protein structure from amino acid sequence remains a challenge in bioinformatics. Although the number of known structures continues to grow, many new sequences still lack a known ho-molog in the PDB 2, which makes it harder to predict structures for these sequences. The conditional existence of a known structural homolog to a query sequence commonly delineates a set of subproblems within the greater arena of protein structure prediction. For example, the biennial CASP competition3

breaks down structure prediction as follows. In homologous fold recognition the structure of the query sequence is similar to a known structure for some other sequence. However, these two sequences have only a low (though detectable) similarity. In analogous fold recognition there exists a known structure similar to the correct structure of the query, but the sequence of that structure has no detectable similarity to the query sequence. Still more challenging is the problem of predicting the structure of a query sequence lacking a known structural relative, which is called new fold (NF) prediction.

Within the context of the NF problem knowledge-based methods have attracted increasing attention over the last decade. In CASP, prediction

approaches that assemble fragments of known structures into a candidate structure 18' 7' 10 have consistently outperformed alternative methods, such as those based largely on explicit modeling of physical forces. Fragment assembly for a query protein begins with the selection of structural fragments based on sequence information. These fragments are then successively inserted into the query protein's structure, replacing the coordinates of the query with those of the fragment. The quality of this new structure is assessed by a scoring function. If the scoring function is a reliable measure of how close the working structure is to the native fold of the protein, then optimizing the function through fragment insertions will produce a good structure prediction. Thus, building a structure in this manner can break down into three main components: a fragment selection technique, an optimizer for the scoring function, and the scoring function itself.

To optimize the scoring function, all the leading assembly-based approaches use an algorithm involving a stochastic search (e.g. Simulated Annealing 18, genetic algorithms 7, or conformational space annealing 1 0). One potential drawback of such techniques is that they can require extensive parameter tuning before producing good solutions.

'Corresponding author. ahttp://predictioncenter.org/

http://umn.edu

http://predictioncenter.org/

20

In this paper we wish to examine the relative performance of deterministic and stochastic techniques to optimize a scoring function. The new algorithms presented below are inspired by techniques originally developed in the context of graph partitioning 4, and do not depend on a random element. The Greedy approach examines all possible fragment insertions at a given point and chooses the best one available. The Hill-climbing algorithm follows a similar strategy but allows for moves that reduce the score locally, provided that they lead to a better global score.

Several variables can affect the performance of optimization algorithms in the context of fragment-based ab initio structure prediction. For example, how many fragments per position are available to the optimizer, how long the fragments are, if they should be multiple sizes at different stages 18 or all different sizes used together 7, and other parameters specific to the optimizer can all influence the quality of the resulting structures.

Taking the above into account, we varied fragment length and number of fragments per position when comparing the performance of our optimization algorithms to that of a tuned Simulated Annealing approach. Our experiments test these algorithms on a diverse set of 276 protein domains derived from SCOP 1.69 14. The results of these experiments show that the Hill-climbing-based approaches are very effective in producing high-quality structures in a moderate amount of time, and that they generally outperform Simulated Annealing. On the average, Hill-climbing is able to produce structures that are 6% to 20% better (as measured by the root mean square deviation (RMSD) between the computed and its actual structure), and the relative advantage of Hill-climbing-based approaches improves with the length of the proteins.

2. MATERIALS AND METHODS

2.1. Data

The performance of the optimization algorithms studied in this paper were evaluated using a set of proteins with known structure that was derived from

bNo bond lengths were modified to fit this constraint; proteins cThis dataset is available at http://www.cs.umn.edu/ " deronne/supplement/optimize

SCOP 1.69 14 as follows. Starting from the set of domains in SCOP, we first removed all membrane and cell surface proteins, and then used Astral's tools 3

to construct a set of proteins with less than 25% sequence identity. This set was further reduced by keeping only the structures that were determined by X-ray crystallography, filtering out any proteins with a resolution greater than 2.5A, and removing any proteins with &Ca — Ca distance greater than 3.8A times their sequential separation13.

The above steps resulted in a set of 2817 proteins. From this set, we selected a subset of 276 proteins (roughly 10%) to be used in evaluating the performance of the various optimization algorithms (i.e., a test set), whereas the remaining 2541 sequences were used as the database from whence to derive the structural fragments (i.e., a training set).c The test sequences, whose characteristics are summarized in Table 1, were selected to be diverse in length and secondary structure composition.

Table 1. Number of sequences at various length intervals and SCOP class.

SCOP Class alpha beta alpha/beta alpha+beta

Seqi

< 100 23 23 4 15

aence Length

100-200 40 27 26 36

> 200 6 18 39 17

total 69 69 69 69

2.2. Neighbor Lists

As the search space for fragment assembly is much too vast, fragment-based ab initio structure prediction approaches must reduce the number of possible structures that they consider. They accomplish this primarily by restricting the number of structural fragments that can be used to replace each fc-mer of the query sequence. In evaluating the various optimization algorithms developed in this work, we followed a methodology for identifying these structural fragments that is similar in spirit to that used by the Rosetta 18 system.

Consider a query sequence X of length /. For

not satisfying it were simply removed from consideration.

http://www.cs.umn.edu/

21

each position i, we identify a list (Li) of n structural fragments by comparing the query sequence against the sequences of the proteins in the training set. For fragments of length fc, these comparisons involve the fc-mer of X starting at position i(0<i<l — k + l) and all fc-mers in the training set. The n structural fragments are selected so that their corresponding sequences have the highest profile-based score with the query sequence's fc-mer. Throughout the rest of this paper, we will refer to the list Li as the neighbor list of position i.

In our study we used neighbor lists containing fragments of a single length as well as neighbor lists containing fragments of different lengths. In the latter case we consider two different approaches to leveraging the varied length fragments. The first, referred to as scan, uses the fragment lengths in decreasing order. For example, if the neighbor lists contain structural fragments of length three, six, and nine, the algorithm starts by first optimizing the structure using only fragments of length nine, then fragments of length six, and finally fragments of length three. Each one of these optimization phases terminates when the algorithm has finished (i.e., reached a local optimum or performed a predetermined number of iterations), and the resulting structure becomes the input to the subsequent optimization phase. The second approach for combining different length fragments is referred to as pool, and it optimizes the structure once, selecting fragments from any available length. Using any single length fragment in isolation, or using either scan or pool will be referred to as a fragment selection scheme.

2.2.1. Sequence Profiles

The comparisons between the query and the training sequences take advantage of evolutionary information by utilizing PSI-BLAST * generated sequence profiles.

The profile of a sequence X of length I is represented by two I x 20 matrices. The first is its position-specific scoring matrix PSSMx that is computed directly by PSI-BLAST. The rows of this matrix correspond to the various positions in X, while the columns correspond to the 20 distinct amino acids. The second matrix is its position-specific frequency

matrix PSFMx that contains the frequencies used by PSI-BLAST to derive PSSMx- These frequencies (also referred to as target frequencies 13) contain both the sequence-weighted observed frequencies (also referred to as effective frequencies 13) and the BLO-SUM62 6 derived-pseudocounts x. For each row of a PSFM, the frequencies are scaled so that they add up to one. In the cases where PSI-BLAST could not produce meaningful alignments for a given position of X, the corresponding rows of the two matrices are derived from the scores and frequencies of BLO-SUM62.

For our study, we used the version of the PSI-BLAST algorithm available in NCBI's blast release 2.2.10 to generate profiles for both the test and training sequences. These profiles were derived from the multiple sequence alignment constructed after five iterations using an e value of 10~2. The PSI-BLAST search was performed against NCBI's nr database that was downloaded in November of 2004 and which contained 2,171,938 sequences.

2.2.2. Profile-to-Profile Scoring Method

The similarity score between a pair of fc-mers (one from the query sequence and one from a sequence in the training set) was computed as the ungapped alignment score of the two fc-mers whose aligned positions were scored using profile information.

Many different schemes have been developed for determining the similarity between profiles that combine information from the original sequence, position-specific scoring matrix, or position-specific target and/or effective frequencies 13, 21, n . In our work we use a scheme that is derived from PICASSO 5' 13 that was recently used in developing effective remote homology prediction and fold recognition algorithms 16. Specifically, the similarity score between the ith position of protein X's profile, and the j th position of protein Y's profile is given by

20

Sx,Y(i,j)= EPSFM x ( i ,OPSSMy(j ' ,0 + i=i

(1) 20 £PSFM Y ( j ,Z)PSSM x ( i ,0 , l = i

where PSFMx(i,Z) and PSSMx(i,Z) are the values corresponding to the Ith amino acid at the ith position of X's position-specific scoring and frequency

22

matrices. PSFMY(j,l) and PSSMY(j,l) are defined in a similar fashion.

Equation 1 determines the similarity between two profile positions by weighting the position-specific scores of the first sequence according to the frequency at which the corresponding amino acid occurs in the second sequence's profile. The key difference between Equation 1 and the corresponding scheme used in 13 (therein referred to as PI-CASS03), is that our measure uses the target frequencies, whereas the scheme of 13 is based on effective frequencies.

2.3. Protein Structure Representation

Internally, we consider only the positions of the Ca

atoms, and we use a vector representation of the protein in lieu of <p and ip backbone angles. Our protein construction approach uses the actual coordinates of the atoms in each fragment, rotated and translated into the reference frame of the working structure. Fragments are taken directly from known structures, and are chosen from the training dataset using the above profile-profile scoring methods.

2.4. Scoring Function

As the focus of this work is to develop and evaluate new optimization techniques, we use the RMSD between the predicted and native structure of a protein as the scoring function. Although such a function cannot serve as a predictive measure, we believe that using this as a scoring function allows for a clearer differentiation between the optimization process and the scoring function. In effect, we assume an ideal scoring function in order to test the optimization techniques.

2.5. Optimization Algorithms

In this study we compare the performance of three different optimization algorithms in the context of fragment assembly-based approaches for ab initio structure predictions. One of these algorithms, Simulated Annealing 8 , is currently a widely used method to solve such problems, whereas the other two algorithms, Greedy and Hill-climbing, are newly developed for this work.

The key operation in all three of these algorithms is the replacement of a fc-mer starting at a particular position i, with that of a neighbor structure. We will refer to this operation as a move. A move is considered valid if, after inserting the fragment, it does not create any steric conflicts. A structure is considered to have a steric conflict if it contains a pair of Ca

atoms within 2.5A of one another. Also, for each valid move, its gain is defined as the improvement in the value of the scoring function between the working structure and the native structure of the protein.

2.5.1. Simulated Annealing (SA)

Simulated Annealing 8 is a generalization of the Monte Carlo 12 method for discrete optimization problems. This optimization approach is designed to mimic the process by which a material such as metal or glass cools. At high temperatures, the atoms of a metal can adopt configurations not available to them at lower temperatures—e.g., a metal can be a liquid rather than a solid. As the system cools, the atoms arrange themselves into more stable states, forming a stronger substance.

The Simulated Annealing (SA) algorithm proceeds in a series of discrete steps. In each step it randomly selects a valid move and performs it (i.e., inserts the selected fragment into the structure). This move can either improve or degrade the quality of the structure. If the move improves the quality, then the move is accepted. If it degrades the quality, then the move will still be accepted with probability

p = eV T ) , (2)

where T is the current temperature of the system, Qoid is the score of the last state, and qnew is the score of the state in question. From Equation 2 we see that the likelihood of accepting a bad move is inversely related to the temperature and how much worse the new structure is from the current structure. That is, the optimizer will accept a very bad move with a higher probability if the temperature is high than if the temperature is low.

The algorithm begins with a high system temperature which it progressively decreases according to an annealing schedule. As the optimization must use finite steps, the cooling of the system cannot be

23

continuous, but the annealing schedule can be modified to increase its smoothness. The annealing schedule depends on a combination of the number of total allowed moves and the number of steps in which to make those moves. Our implementation of Simulated Annealing, following the general framework employed in Rosetta 18, uses an annealing schedule that linearly decreases the temperature of the system to zero over a fixed number of cycles.

Simulated Annealing is a highly tunable optimization framework. The starting temperature and the annealing schedule can be varied to improve performance, and the performance of the algorithm depends greatly on these parameters. Section 3.2.1 describes how we arrive at the values for these parameters of SA as implemented in this study.

2.5.2. The Greedy Algorithm (G)

One of the characteristics of the Simulated Annealing algorithm is that it considers moves for insertion at random, irrespective of their gains. The Greedy algorithm that we present here selects maximum gain moves.

Specifically, the algorithm consists of two phases. In the first phase, called initial structure generation, the algorithm starts from a structure corresponding to a fully extended chain, and attempts to make a valid move at each position of the protein. This is achieved by scoring all neighbors in each neighbor list and inserting the best neighbor (i.e. the neighbor with the highest gain) from each list. If some positions have no valid moves on the first pass, the algorithm attempts to make moves at these positions after trying all positions once. This ensures that the algorithm makes moves at nearly every position down a chain, and also provides a good starting point for the next phase.

In the second phase, called progressive refinement, the algorithm repeatedly finds the maximum gain valid move over all positions of the chain, and if this move leads to a positive gain—i.e. it improves the value of the scoring function—the algorithm makes the move. This progressive refinement phase terminates upon failing to find any move to make. The Greedy algorithm is guaranteed to finish the progressive refinement phase in at least a local optimum.

2.5.3. Hill-Climbing (HC)

The Hill-climbing algorithm was developed to allow the Greedy algorithm to effectively climb out of locally optimal solutions. The key idea behind Hill-climbing is to not stop after achieving a local optimum but to continue performing valid moves in the hope of finding a better local or a (hopefully) global optimum.

Specifically, the Hill-climbing algorithm works as follows. The algorithm begins by applying the Greedy algorithm in order to reach a local optimum. At this point, it begins a sequence of iterations consisting of a hill-climbing phase, followed by a progressive refinement phase (as in the Greedy approach). In the hill-climbing phase, the algorithm performs a series of moves, each time selecting the highest gain valid move irrespective of whether or not it leads to a positive gain. If at any point during this series of moves, the working structure achieves a score that is better than that of the structure at the beginning of the hill-climbing phase, this phase terminates and the algorithm enters the progressive refinement phase. The above sequence of iterations terminates when the hill-climbing phase is unable to produce a better structure after successively performing all best scoring valid moves.

Since the hill-climbing phase starts at a local optimum, its initial set of moves will lead to a structure whose quality (as measured by the scoring function) is worse than that at the beginning of the hill-climbing phase. However, subsequent moves can potentially lead to improvements that outweigh the initial quality degradation; thus allowing the algorithm to climb out of locally optimal solutions.

Move Locking As Hill-climbing allows negative gain moves, the algorithm can potentially oscillate between a local optimum and a non-optimal solution. To prevent this from happening, we implement a notion of move locking. After each move, a lock is placed on the move to prevent the algorithm from making this move again within the same phase. By doing so, we ensure the algorithm does not repeatedly perform the same sequence of moves; thus guaranteeing its termination after a finite number of moves. All locks are cleared at the end of a hill-

24

climbing phase, allowing the search maximum freedom to proceed.

We investigate two different locking methods. The first, referred to as fine-grain locking, locks the single move made. The algorithm can subsequently select a different neighbor for insertion at this position. The second, referred to as coarse-grain locking, locks the position of the query sequence itself; preventing any further insertions at that position. In the case of pooling, coarse locking locks moves of all sizes.

Since fine-grain locking is less restrictive, we expect it to lead to better quality solutions. However, the advantage of coarse-grain locking is that each successive fragment insertion significantly reduces the set of fragments that need to be considered for future insertions; thus, leading to a faster optimization algorithm.

2.5.4. Efficient Checking of Steric Conflicts

One characteristic of the Greedy and Hill-climbing algorithms is their need to evaluate the validity of every available move after every insertion. This proves necessary because each insertion can potentially introduce new proximity conflicts. In an attempt to assuage the time requirement for this process, we have developed an efficient formulation for validity checking.

Recall that a valid move brings no two Ca atoms within 2.5A of each other. To quickly determine if this proximity constraint holds, we impose a three-dimensional grid over the structure being built with boxes 2.5A on each side. As each move is made, its atoms are added to the grid, and for each addition the surrounding 26 boxes are checked for atoms violating the proximity constraint. In this fashion we limit the number of actual distances that must be computed.

We further decrease the required time by sequentially checking neighbors at each position down the amino acid chain. All atoms upstream of the insertion point must be internally valid, as they have previously passed proximity checks. Thus, we need only examine those atoms at or downstream from the insertion. This saves on computation time within one iteration of checking all possible moves.

3. EXPERIMENTAL EVALUATION

3.1. Performance of the Greedy and Hill-climbing Algorithms

To compare the effectiveness of the Greedy and Hill-climbing optimization techniques, we report results from a series of experiments in which we vary a number of parameters. Table 2 shows results for the Greedy and Hill-climbing optimization techniques using fc-mer sizes of 9, 6, and 3 individually, as well as using the scan and pool techniques to combine them. Average times are also reported for each of these five fragment selection schemes.

Examining Table 2, we see that the Hill-climbing algorithm consistently outperforms the Greedy algorithm. As Hill-climbing includes running Greedy to convergence, the result is not surprising, and neither is the increased run-time that Hill-climbing requires. Both schemes seem to take advantage of the increased flexibility of smaller fragments and greater numbers of fragments per position. For example, on the average the 3-mer results are 9.4%, 12.0%, and 8.5% better than the corresponding 9-mer results for Greedy, Hill-climbing (coarse) (hereafter HCC) and Hill-climbing (fine) (hereafter HC/) , respectively. Similarly, increasing the neighbor lists from 25 to 100 yields a 23.1%, 31.6%, and 43.6% improvement for Greedy, HCC, and HC/, respectively. These results also show that the search algorithms embedded in Greedy, HCC, and HC/ are progressively more powerful as the size of the overall search space increases.

With respect to locking, a less restrictive finegrained approach generally yields better results than a coarse-grained scheme. For example, averaging over all experiments, fine-grained locking yields a 21.2% improvement over coarse-grained locking. However, this increased performance comes at the cost of an increase in run-time of 1128% on the average.

Comparing the performance of the scan and pooling methods to combine variable length fc-mers we see that pool performs consistently better than scan by an average of 4.4%. This improvement also comes at the cost of an increase in run time, which in this case is 131.1% on the average. Results from the pool and scan settings clearly indicate that Greedy

25

Table 2. Average values over 276 proteins optimized using Hill-climbing and different locking schemes. Times are in seconds and scores are in A. Lower is better in both cases.

Greedy

Hill-climbing (coarse) (HCC)

Hill-climbing (fine) (HC,)

k = 9 fc = 6 fc = 3 Scan Pool fc = 9 fc = 6 fc = 3 Scan Pool fe = 9 fc = 6 fe = 3 Scan Pool

n = Score

9.07 8.76 8.20 7.21 7.06 6.70 6.46 6.07 5.10 5.06 5.81 5.67 5.56 4.69 4.30

= 25 Time

11 12 15 33 41 49 65 76

120 341 357 352 390 748

1997

n = Score

8.20 7.98 7.51 6.52 6.34 5.99 5.67 5.35 4.50 4.33 4.96 4.76 4.60 3.92 3.56

= 50 Time

14 17 22 40 58 98

124 182 216 912

1314 1417 1561 2878 7101

n = Score

7.77 7.50 7.08 6.02 5.94 5.54 5.23 4.92 4.01 3.96 4.53 4.30 4.10 3.37 3.14

= 75 Time

18 22 30 48 76

143 221 313 333

1588 2656 3277 3837 6237

18000

n = Score

7.38 7.21 6.80 5.81 5.57 5.29 4.93 4.68 3.76 3.74 4.30 3.99 3.87 3.17 2.87

= 100 Time

22 27 39 56 97

226 279 433 517

1833 4978 5392 6369

10677 21746

and HCC are not as effective at exploring the search space as HC/.

Table 3 . SCOP classes and lengths for the tuning set.

SCOP identifier

dlj iwi. dlkpf_ d2mcm__ d l b e a _ d lca l .2 dlj iga. d lnbca . d lyaca . dla8d.2 dlaoza2

length

105 111 112 116 121 146 155 204 205 209

SCOP class

beta alpha+beta beta alpha beta alpha beta alpha/beta beta beta

3.2. Comparison with Simulated Annealing

3.2.1. Tuning the Performance of SA

Due to the sensitivity of Simulated Annealing to specific values for various parameters, we performed a search on a subset of the test proteins in an attempt to maximize the ability of SA to optimize the test structures. Specifically, we attempted to find values for two governing factors: the initial temperature To and the number of moves nm. To this end, we selected ten medium length proteins of diverse secondary structural classification (see Table 3), and optimized them over various initial temperatures. The

initial temperature that yielded the best average optimized RMSD was To = 0.1 and we used this value in all subsequent experiments.

In addition to an initial temperature, when using Simulated Annealing one must select an appropriate annealing schedule. Our annealing schedule decreases the temperature linearly over 3500 cycles. This allows for a smooth cooling of the system. Over the course of these cycles, the algorithm attempts a x [l x n) moves, where a is an empirically determined scaling factor, I is the number of amino acids in the query protein, and n is the number of neighbors per position. Note that for the scan and pool techniques (see Section 2.2), we allow SA three times the number of attempted moves because the total number of neighbors is that much larger. In order to produce comparable run-times to the G, HCC and HC/ schemes, a values of 20, 50 and 100 are employed, respectively. Finally, following recent work 17 we allowed for a temporary increase in the temperature after 150 consecutive rejected moves.

3.2.2. Results

The Simulated Annealing results are summarized in Table 4. As we see in this table, Simulated Annealing consistently outperforms the Greedy scheme. Specifically, the average performance of SA with a = 20 is 15.1% better than that obtained by G. These performance comparisons are obtained by averaging the

26

Table 4. Average values over 276 proteins optimized using Simulated Annealing. Times are in seconds and scores are in A. Lower is better in both cases.

a = 20

a = 50

a = 100

fc = 9 fc = 6 fc = 3 Scan Pool fc = 9 fc = 6 fc = 3 Scan Pool fc = 9 fc = 6 fc = 3 Scan Pool

n -Score

7.88 7.45 6.78 6.11 5.93 7.20 6.69 6.19 5.68 5.91 6.76 6.31 6.05 5.65 5.99

= 25 Time

25 25 25 74 75 34 34 35

103 103 52 50 52

148 156

n -Score

6.99 6.46 6.01 5.54 5.84 6.44 6.13 5.90 5.48 6.08 6.34 6.14 6.21 5.53 6.23

= 50 Time

31 30 31 92 94 48 49 51

148 150 81 81 84

241 265

n = Score

6.54 6.12 5.87 5.39 6.00 6.31 6.06 6.02 5.50 6.25 6.31 6.18 6.34 5.62 6.34

= 75 Time

36 36 37

109 112 65 64 67

197 203 112 115 118 348 352

n = Score

6.28 6.03 5.81 5.39 6.13 6.21 6.11 6.18 5.48 6.31 6.28 6.26 6.40 5.62 6.38

: 100 Time

42 42 43

128 132

80 80 81

258 251 145 146 155 439 447

The values of a in the above table scale the number of moves Simulated Annealing is allowed to make. In our case, the total number of moves is a X (I x n) where I is the length of the protein being optimized and n is the number of neighbors per position.

Table 5. Average values over the longest 138 proteins optimized using Hill-climbing and different locking schemes. Times are in seconds and scores are in A. Lower is better in both cases.

n = 25 Score Time

n = 50 Score Time

n = 75 Score Time

n = 100 Score Time

Greedy

fc = 9 fc = 6 fc = 3 Scan Pool

11.56 11.15 10.36 9.52 9.24

17 19 24 50 64

10.59 10.29 9.73 8.62 8.50

23 27 38 62 95

10.01 9.77 9.30 8.08 7.96

29 36 51 76

126

9.52 9.52 8.95 7.78 7.55

37 46 68 91

164

Hill-climbing (coarse) (HCe)

fc = 9 k = 6 fc = 3 Scan Pool

8.44 8.11 7.60 6.43 6.42

90 121 142 213 651

7.48 7.16 6.86 5.73 5.55

185 234 347 394 1773

6.89 6.63 6.32 5.08 4.93

271 424 602 625 3109

6.46 6.18 6.07 4.72 4.74

433 535 833 982

3581

Hill-climbing (fine) (HC,)

fc = 9 fc = 6 fc = 3 Scan Pool

7.33 7.23 7.02 6.03 5.38

672 662 737 1376 3844

6.18 5.92 5.88 4.97 4.45

2477 2690 2974 5173 13717

5.55 5.31 5.24 4.18 3.82

4992 6238 7360 11524 34960

5.23 4.95 4.94 3.94 3.40

9396 10252 12190 19818 42045

ratios between the two schemes of the corresponding RMSDs over all fragment selection schemes and values of n. The superior performance of Simulated Annealing over Greedy is to be expected, as Greedy lacks any sort of hill-climbing ability, whereas the stochastic nature of Simulated Annealing allows it a chance of overcoming locally optimal solutions. In contrast, both the fine and coarse-locking versions of Hill-climbing outperform SA. More concretely, on the average HCC performs 22.0% better than SA with a = 50, and HC/ performs 46.3% better than SA

with a = 100. Analyzing the performance of Simulated Anneal

ing with respect to the value of a, we see that while Simulated Annealing shows an average improvement of 1.7% when a is increased from 20 to 50, the performance deteriorates by an average of 0.07% when a is increased from 50 to 100. This indicates that further increasing the value of a may not lead to performance comparable to that of the Greedy and Hill-climbing schemes.

Also note that in some of the results shown in

27

T a b l e 6. Average values over the longerst 138 proteins optimized using Simulated Annealing. Times are in seconds and scores are in A. Lower is better in both cases.

a = 20

a = 50

a = 100

fc = 9 fc = 6 fc = 3 Scan Pool fc = 9 fc = 6 fc = 3 Scan Pool fc = 9 fe = 6 fc = 3 Scan Pool

n = Score

9.97 9.54 8.63 7.86 7.76 9.21 8.58 7.96 7.31 7.90 8.57 8.17 7.90 7.32 8.11

= 25 Time

37 37 38

113 114 53 54 55

164 163 86 83 86

243 260

n = Score

8.92 8.39 7.77 7.20 7.80 8.30 8.01 7.76 7.18 8.19 8.35 8.19 8.20 7.33 8.30

= 50 Time

48 48 49

145 147

80 81 83

245 248 137 138 143 411 455

n = Score

8.38 7.94 7.70 7.06 8.07 8.26 8.06 7.95 7.25 8.33 8.41 8.23 8.30 7.39 8.36

= 75 Time

58 58 59

176 180 109 108 112 331 341 192 197 204 592 606

n = Score

8.16 7.89 7.67 7.12 8.18 8.23 8.16 8.16 7.24 8.39 8.43 8.36 8.37 7.45 8.44

: 100 Time

68 69 70

210 215 135 136 138 440 427 251 254 270 760 778

The values of a. in the above table scale the number of moves Simulated Annealing is allowed to make. In our case, the total number of moves is a x (I x n) where I is the length of the protein being optimized and n is the number of neighbors per position.

Table 4, the performance occasionally decreases as the a value increases. This ostensibly strange result comes from the dependence of the cooling process on the number of allowed moves, in which the value of a plays a role. For all entries in Table 4 the annealing schedule will cool the system over a fixed number of steps, but the number of moves made will vary greatly. Thus, in order to keep the cooling of the system linear we vary the number of moves allowed before the system reduces its temperature. As a result, different values of a can lead to different randomly chosen optimization paths.

Comparing the performance of the various optimization schemes with respect to the various fragment selection schemes, we see an interesting trend. The performance of SA deteriorates (by 9.6% on the average) when the different length fc-mers are used via the pool method, whereas the performance of HC/ improves (by 4.4% on average). We are currently investigating the source of this behavior, but one possible explanation is that Simulated Annealing has a bias towards smaller fragments. This bias might result because an insertion of a bad 3-mer will degrade the structure less than that of a bad 9-mer, and as a result, the likelihood of accepting the former move will be higher (Equation 2). This may reduce the optimizers ability to effectively utilize the variable length fc-mers.

Performance on Longest Sequences In order to gain a better understanding of how the optimization schemes perform, we focus on the longer half of the test proteins. Average RMSDs and times for the Greedy and Hill-climbing schemes are shown in Table 5, and average RMSDs and times for Simulated Annealing are shown in Table 6.

In general, the trends in these tables agree with the trends in the average values over all the proteins. However, one key difference is that the relative improvement of the Hill-climbing scheme over Simulated Annealing is higher, while Greedy actually does worse. For example, comparing G and SA for a = 20, SA performs 15.7% better, as opposed to 15.1% for the full average. Comparing with SA for a = 50, HCC performs 27.0% better as opposed to 22.0% for the full average. Finally, comparing with SA for a = 100, HC/ is 54.6% better, as opposed to 46.3% for the full average. These results suggest that, in the context of a larger search space, a hill-climbing ability is important, and that the hill-climbing abilities of HCC and HC/ are better than those of SA.

4. DISCUSSION AND CONCLUSIONS

This paper presents two new techniques for optimizing scoring functions for protein structure predic-

28

tion. One of these approaches, HCC, using the scan technique, reaches better solutions than Simulated Annealing in comparable time. The performance of SA seems to saturate beyond a = 50, but HC/ will make use of an increased time allowance, finding the best solutions of all the examined algorithms. Furthermore, experiments with variations on the number of moves available to the optimizer demonstrate that the Hill-climbing approach makes better use of an expanded search space than Simulated Annealing. Additionally, Simulated Annealing requires the hand-tuning of several parameters, including the total number of moves, the initial temperature, and the annealing schedule. One of the main advantages of using schemes like Greedy and Hill-climbing is that they do not rely on such parameters.

Recently, greedy techniques have been applied to problems similar to the one this paper addresses. The first problem is to determine a set of representative fragments for use in decoy structure construction 15, 9 . The second problem is to reconstruct a native protein fold given such a set of representative fragments 19, 20. The greedy approaches used for both these problems traverse the query sequence in order, inserting the best found fragment for each position. As an extension, the algorithms build multiple structures simultaneously in the search for a better structure. While such approaches have the ability to avoid local minima, they lack an explicit notion of hill-climbing.

The techniques this paper describes could be modified to solve either of the above two problems. To build a representative set of fragments, one could track the frequency of fragment use within multiple Hill-climbing optimizations of different proteins. This would yield a large set of fragments, which could serve as input to a clustering algorithm. The cen-troids of these clusters could then be used in decoy construction. In order to construct a native fold from these fragments one need only restrict the move options of Hill-climbing to the representative set. We are currently working on adapting our algorithms to solve these problems.

ACKNOWLEDGMENTS

This work was supported in part by NSF EIA-9986042, ACI-0133464, IIS-0431135, and NIH

RLM008713A; the Digital Technology Center at the University of Minnesota; and by the Army High Performance Computing Research Center (AHPCRC) under the auspices of the Department of the Army, Army Research Laboratory (ARL) under Cooperative Agreement number DAAD19-01-2-0014. The content of which does not necessarily reflect the position or the policy of the government, and no official endorsement should be inferred. Access to research and computing facilities was provided by the Digital Technology Center and the Minnesota Supercomput-ing Institute.

References

1. S. F. Altschul, L. T. Madden, A. A. Schffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Research, 25 (17): 3389-402, 1997.

2. H.M. Berman, J.Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, and P.E. Bourne. The protein data bank. Nucleic Acids Research, 2000.

3. J. Chandonia, G. Hon, N. S. Walker, L. Lo Conte, P. Koehl, M. Levitt, and S. E. Brenner. The astral compendium in 2004. Nucleic Acids Research, 2004.

4. C.M. Fiduccia and R.M. Mattheyses. A linear-time heuristic for improving network partitions. Proceedings of the 19th Design Automation Conference, 1982.

5. A. Heger and L. Holm. Picasso: generating a covering set of protein family profiles. Bioinformatics, 2001.

6. S. Henikoff and J. G. HenikofT. Amino acid subsitu-tion matrices from protein blocks. PNAS, 89:10915-10919, 1992.

7. K. Karplus, R. Karchin, J. Draper, J. Casper, Y. Mandel-Gutfreund, M. Diekhans, and R. Hughey. Combining local-structure, fold-recognition, and new fold methods for protein structure prediction. PROTEINS: Structure, Function and Genetics, 2003.

8. S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science, 220:671-680, 1983.

9. R. Kolodny, P. Koehl, L. Guibas, and M. Levitt. Small libraries of protein fragments model native protein structures accurately. Journal of Molecular Biology, 323:297-307, 2002.

10. J. Lee, S. Kim, K. Joo, I. Kim, and J. Lee. Prediction of protein tertiary structure using profesy, a novel method based on fragment assembly and conformational space annealing. PROTEINS: Structure, function and bioinformatics, 2004.

29

11. M. Marti-Renom, M. Madhusudhan, and A. Sali. Alignment of protein sequences by their profiles. Protein Science, 13:1071-1087, 2004.

12. N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller. Equation of state calculations by fast computing machines. Journal of Chemical Physics, 21:1087-1092, 1953.

13. D. Mittelman, R. Sadreyev, and N. Grishin. Probabilistic scoring measures for profile-profile comparison yield more accurate short seed alignments. Bioinformatics, 19(12):1531-1539, 2003.

14. A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. Scop: a structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology, 247:536-540, 1995.

15. B. H. Park and M. Levitt. The complexity and accuracy of discrete state models of protein structure. Journal of Molecular Biology, 249:493-507, 1995.

16. H. Rangwala and G. Karypis. Profile based direct kernels for remote homology detection and fold

recognition. Bioinformatics, 21:4239-4247, 2005. 17. C. A. Rohl, C. E. M. Strauss, K. M. S. Misura, and

D. Baker. Protein structure prediction using rosetta. Methods in Enzymology, 2004.

18. K. T. Simons, C. Kooperberg, E. Huang, and D. Baker. Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and bayesian scoring functions. Journal of Molecular Biology, 1997.

19. P. Tuffery and P. Derreumaux. Dependency between consecutive local conformations helps assemble protein structures from secondary structures using go potential and greedy algorithm. Protein Science, 61:732-740, 2005.

20. P. Tuffery, F. Guyon, and P. Derreumaux. Improved greedy algorithm for protein structure reconstruction. Journal of Computational Chemistry, 26:506-513, 2005.

21. G. Wang and R. L. Dunbrack JR. Scoring profile-to-profile sequence alignments. Protein Science, 13:1612-1626, 2004.

31

TRANSMEMBRANE HELIX AND TOPOLOGY PREDICTION USING HIERARCHICAL SVM CLASSIFIERS AND

AN ALTERNATING GEOMETRIC SCORING FUNCTION

Allan Lo1'2, Hua-Sheng Chiu3, Ting-Yi Sung3, Wen-Lian Hsu3'* 1 Bioinformatics Program, Taiwan International Graduate Program, Academia Sinica, Taipei, Taiwan

2 Department of Life Sciences, National Tsing Hua University, Hsinchu, Taiwan 3 Bioinformatics Lab., Institute of Information Science, Academia Sinica, Taipei, Taiwan

Email: {allanlo, huasheng, tsung, hsu}@iis.sinica.edu.tw

Motivation: A key class of membrane proteins contains one or more transmembrane (TM) helices, traversing the membrane lipid bilayer. Various properties such as the length, arrangement and topology or orientation of TM helices, are closely related to a protein's functions. Although a range of methods have been developed to predict TM helices and their topologies, no single method consistently outperforms the others. In addition, topology prediction has much lower accuracy than helix prediction, and thus requires continuous improvements. Results: We develop a method based on support vector machines (SVM) in a hierarchical framework to predict TM helices first, followed by their topology. By partitioning the prediction problem into two steps, specific input features can be selected and integrated in each step. We also propose a novel scoring function for topology models based on membrane protein folding process. When bench-marked against other methods in terms of performance, our approach achieves the highest scores at 86% in helix prediction (Q2) and 91% in topology prediction (TOPO) for the high-resolution data set, resulting in an improvement of 6% and 14% in their respective categories over the second best method. Furthermore, we demonstrate the ability of our method to discriminate between membrane and non-membrane proteins, with higher than 99% in accuracy. When tested on a small set of newly solved structures of membrane proteins, our method overcomes some of the difficulties in predicting TM helices by incorporating multiple biological input features.

1. INTRODUCTION

Integral membrane proteins constitute a wide and important class of biological entities that are crucial for life, representing about 25% of the proteins encoded by several genomes1"3. They also play a key role in various cellular processes including signal and energy transduction, cell-cell interactions, and transport of solutes and macromolecules across membranes4. Despite their biological importance, the proportion of available high-resolution structures is exceedingly limited at about 0.5% of all solved structures5, compared to that of globular proteins deposited in the Protein Data Bank (PDB)6. In the absence of a high-resolution structure, an accurate structural model is important for the functional annotation of membrane proteins. A membrane protein structural model defines the number and location of transmembrane helices (TMHs) and the orientation or topology of the protein relative to the lipid bilayer. However, experimental approaches for identifying membrane protein structural models are time-consuming7. Therefore, bioinformatics development in sequence-based prediction methods is valuable for elu

cidating the structural genomics of membrane proteins. Many different methods have been developed to

predict structural models of transmembrane helix (TMH) proteins. Earlier approaches relied on physico-chemical properties such as hydrophobicity8'10 to identify TMH regions. Recently, more advanced methods using hidden Markov models3,11 and neural networks12 have been developed, and they have achieved significant improvements in prediction accuracy. Although several methods are available, none of them have integrated multiple biological input features in a machine-learning framework. Furthermore, an evaluation study13 concluded that current accuracies were over-estimated, and topology prediction remained a major challenge.

In this paper, we propose a machine-learning approach called SVMtmh (SVM for ffansmembrane helix prediction) in a hierarchical classification framework to predict membrane protein structure. We divide the prediction task into two successive steps by using a tertiary classifier consisting of two hierarchical binary classifiers. The number and location of TMHs are predicted in the first step, followed by the prediction of the topology in the second step.

* Corresponding author.

32

Our key contributions are as follows: 1) By decomposing the prediction into two steps, we reduce the complexity involved in each step, and biological input features relevant to each classifier can be applied. 2) We select multiple input features, including those based on different structural parts of a TMH protein, and integrate them to predict helices. 3) For topology prediction, we propose a novel topology scoring function based on the current understanding of membrane protein insertion. To the best of our knowledge, the proposed topology scoring function is the first model to capture the relationship between topogenic factors and topology formation.

The performance of SVMtmh is compared with other methods across several benchmark data sets and SVMtmh achieves a marked improvement in both helix and topology prediction. Specifically, SVMtmh achieves the highest score at 91% for topology prediction (TOPO) and 86% for helix prediction (Q2) in the high-resolution data set, an improvement of 14% and 6%, respectively, compared to the second highest score. In addition, SVMtmh yields the lowest false positive rate at 0.5% when tested for discrimination between membrane and non-membrane proteins. Finally, we apply SVMtmh to analyze a newly solved structure of bacteriorhodopsin (bR) and show that our method can

provide the correct structural model which is in close agreement with the structure obtained through X-ray crystallography. We also provide a detailed analysis of the comparison with other methods and conclude with a summary and directions for future work.

2. METHODS

2.1. System architecture

The proposed approach uses hierarchical binary classifiers to predict the helices and topology of an integral membrane protein. We represent the problem of membrane protein structure prediction as a multiple classification process and solve it in two steps using hierarchical SVM classifiers. The overall framework is described in this section.

Each residue of a TMH protein can be regarded as belonging to one of the three classes defined by its position with respect to the membrane: inner (/) loop, transmembrane helix (H), and outer (o) loop. The aim of predicting membrane protein structures is to identify the correct class of each residue. Since there are three classes for a protein sequence, we design a tertiary classifier, which consists of two binary classifiers in a hierarchical structure. An overview of the system architecture is shown in Fig. 1.

S T E P 1 : H e l i x P r e d i c t i o n

sliding window w\

M V T L I A L T P F V S R K query protein

peptide extraction l l l j j l l l l

r feature encoding I

and scaling

test ing Input features: 1. Amino acid composition (AA) 2. Di-peptide composition (DP) 3. Hydrophobicity scale (HS) 4. Amphiphilicity (AM)

predicted H/~H residues by t w o di f ferent fea ture sets

TMH,

\ 1 TMH;

consensus

predicted TMH

-»• — | TMH |—| TMH I—I TMH | —

TMH |— T M H candidates de te rmina t ion

S T E P 2 : T o p o l o g y P r e d i c t i o n

i predicted TMH ' (from STEP 1)

identify non-hel ical

segments

- H TMH l-H TMH~|-rl TMH~|-r-I I I I I I I I

+ • • + sliding window w2

cEE l U l U i H ]

fea ture encodi and scal ing

i n g l i n g *

Input feature: Amino acid composition (AA)

"Toio i i i° | TMH |22| TMH [ii°TTMH

T predicted i/o residues in the non-hel ica l Alternating Geometric

Scoring Function

• V S U I b l t S U IS W I C 9 I V

in the non-hel ica l segments

I T M H and topo logy a re predic ted

TMH f a g r r M T T T i ^ TMH

Fig. 1. Overview of the SVMtmh system architecture.

33

In Step 1, TM and non-TM residues (H/~H) are predicted. We use different feature sets (Section 3.3) to train our SVM classifiers and then combine the results from the best two combinations into a consensus prediction, which is screened for TMH candidates and subsequently assembled into physical TMH segments. In Step 2, the remaining non-helix residues (~H) from Step 1 are classified as either inner or outer (i/o) residues. To determine if the topology of the protein is an inner (;') or an outer loop (o), we apply the proposed alternating geometric scoring function. Briefly, the classification framework is performed in two steps, each of which uses an associated binary classifier (H/~H, i/o).

We use sliding windows to partition a protein sequence into peptides. The optimal length of the sliding window, w, is incrementally searched from 3 to 41 for both classifiers. The optimal window sizes, w; for the first classifier and w2 for the second classifier, are found to be 21 and 29, respectively.

2.2. Training and testing

We train our classifiers with the LIBSVM package14

and Radial Basis Function (RBF) is chosen as the kernel function. The associated parameters (C, y) are optimized at (1.8661, 0.1250). The cost weight is adjusted to avoid under-prediction in unbalanced data sets. Since the helix and non-helix classes make up about 30% and 70%

of the data set respectively, we set the cost weight at 7/3 for the first classifier. Similarly, we set the cost weight at 1/1 for the second classifier to reflect the proportion of the inner and outer loop classes in the data set. Ten-fold cross-validation is used to evaluate our method. The data set is first divided into ten subsets of equal size. Each subset is in turn tested using the classifier trained on the remaining nine subsets. Since each residue of the whole data set is only predicted once, the overall prediction accuracy is the percentage of correctly predicted residues. The values in the feature vectors are scaled in the range of [0, 1].

2.3. Helix prediction

2.3.1. Feature selection and extraction

The choice of relevant features is critical in any prediction models. Thus, in the present study we select features that capture important relationships between a sequence and the structure. TMH proteins are subject to global constraints of the lipid bilayer since they contain membrane-spanning helices15. Additionally, TM helices can be divided into distinct local structural parts, including the core and end regions based on the propensity of amino acids3. Fig. 2 shows the selection of features to capture both the global and local information of a TM helix. The representation of each feature is described below:

H 1 Helix core region ( -15 A)

^ ^ Helix end region (-7.5 A)

I ^ H Loop (flexible length)

0 y Amino acid residue

w Plasma membrane

Global input feature: Amino acids (AA) composition

Global input feature: Di-peptide (DP) composition

Interfacial - 15 A

Hydrocarbon - 30 A

ft'Y- / ! *« Interfacial - 15 A

'— Local input feature: Hydrophobicity Scale (HS)

— Local input feature: Amphiphilicity (AM)

Fig. 2. Transmembrane (TM) helix structure in the lipid bilayer: helix core and end regions. Loops connect between the adjacent TM helices. We select global and local input features to capture information contained in a TM helix. Global input features: amino acid (AA) and di-peptide (DP compositions. Local input features: hydrophobicity scale (HS)16 and amphiphilicity (AM)17. The helix core region is surrounded by an aliphatic hydrocarbon layer about 30 A in thickness. The helix end regions are embedded in the water-membrane interface of about 15 A.

34

1. Amino acid composition (AA): This basic feature enables us to capture the global information of a TM helix. Each residue of a peptide is represented by a vector of length of 20 which is indicated by 1 in the position corresponding to the amino acid type of the residue, and 0 otherwise.

2. Di-peptide composition (DP): We consider the coupling effect of two adjacent residues that contain global information along the sequence. This feature is represented by the pair-residue occurrence probability, P(X,Y) where (X,Y) is an ordered pair of amino acids of X followed by Y. The vector space of this feature input comprises 400 dimensions.

3. Helix core feature: Hydrophobicity is used to capture local information within the core region of a TM helix where it is a major stabilizing factor16. We select a hydrophobicity scale (HS) recently determined by membrane insertion experiments16. Each residue is represented by a vector of length 20 that has a real value corresponding to its hydrophobicity.

4. Helix ends feature: The end regions of a TM helix near the membrane-water interface exhibit a preference for aromatic and polar residues, as shown in amino acid propensity studies16'17. We select an amphiphilicity (AM) index17 as a input feature to capture the local information contained in the helix-capping ends. Each residue is represented by a vector of length 20 that has a real value corresponding to its amphiphilicity.

2.3.2. Determination of TMH candidates

To identify potential TMH regions, it is necessary to determine if there are any TMH candidates among our initial prediction results. We do this by modifying the algorithm proposed in the THUMBUP program18 to determine TMH candidates and assemble them into physical TMH segments.

Step 1: Filtering

We define a cut-off value, lmim as the minimal length for a TMH candidate. A predicted helix segment is a TMH candidate if its length is at least lmin; otherwise, it is converted to a non-helix segment. Steps 2 and 3 describes the assembly of a TMH candidate.

Step 2: Extension

An optimal TMH length, lopl, is set at 21 to reflect the thickness of the hydrocarbon core of a lipid bilayer19. If

the length of a TMH candidate is between lmi„ and lopl, it is extended to lop, from its N- and C-termini. Two or more TMH candidates are merged if they overlap after the extension.

Step 3: Splitting

We define lmax, as the cut-off value for the length of a TMH candidate to be split. A TMH candidate whose length is greater than or equal to lmax is split into two helices, starting from its N- and C-termini with the loop in the center.

We optimize lmin and lmax on the training data set (Section 3.1). The optimized values for lmin and lmax for the best prediction performance are 9 and 38, respectively.

2.4. Topology prediction

2.4.1. Input feature

Using the second classifier, we predict the topology label (i/o) of each non-helix residues from the results of the first classifier (H/~H). Amino acid composition is employed as the input feature. The encoding scheme follows the same procedure outlined in the helix prediction section.

2.4.2. Alternating geometric scoring function

The purpose of predicting of the topology of a TMH protein is to determine the orientation of the protein with respect to the membrane. A TMH protein follows special constraints on its topology such that it always starts with an inner (?) loop or outer (o) loop that must alternate in order to connect the TM helices. Therefore, the problem of predicting the topology of a TMH protein is reduced to predicting the topology of the first loop located at the N-terminus.

There is growing body of evidence that the final topology is influenced by multiple signals distributed along the entire protein in the loop segments, including the charge bias, loop size, and folding of the N-terminal loop domain20. Furthermore, the widely accepted two-stage model for membrane protein folding suggests that the final topology of a membrane protein is established in the early stages of membrane insertion21. These biological phenomena form the basis of our assumptions about topology models. First, we assume that topology formation is a result of contributing signals present in

35

the various loop segments. Second, signals embedded in the loop segments near the N-terminus are more likely to be a factor in the formation of topology since they are inserted in the membrane at an earlier time. Based on these assumptions, we develop a novel topology scoring function that considers the topogenic contribution from all loop segments that diminishes over a distance away from the N-terminus.

In the proposed topology scoring function, the contribution of signals in the loop segments varies inversely proportional to their distance from the N-terminus in a geometric series: Given a transmembrane protein that has n non-helical segments 5 (1 < j < n and n, j e N) predicted in the first step: For each Sj of length | Sj |, we define two ratios, Rj and R0, to represent the predicted ratios of topology labels / and o, respectively.

R.(j) = (# of "inside" residues/1 Sj |) x 100% (1)

Ro (j) = (# of "outside" residues /1 s, |) x 100%, (2)

where i?, + R0 = 100%. To determine the protein topology, we define two topology scores, TSj and TS0, where TSi is for the N-terminal loop on the inside of membrane and TS0 is for the outside.

IS, = £ W{J)x[aRiU) + (1 -a)R0(j) ] (3)

TS„ = X W(J)x[ (1 -a)R,(J) + aR0(j) ] (4) \<j<n

a = 1, // j is odd

[0, // j is even

W(j) = \/bu'1)"E',bandEIe

(5)

(6)

where b and EI denote the base and the exponent increment, respectively. W(j) is a geometric function which assigns weights to the R,(j) and R0(j) terms. If TSf > TS0, then the topology of the N-terminal loop is

inside; otherwise, the topology is outside. For the calculation of topology scores, the geomet

ric scoring function alternates between the inner (j) and outer (o) loops to take into account the alternating nature of the connecting loops. Fig. 3 illustrates the calculation of alternating geometric scoring function for an example protein.

3. RESULTS AND DISCUSSION

3.1. Data sets

1. Low-resolution TMH proteins: We train and perform ten-fold cross-validation on a collection of low-resolution data set compiled by Moller et al22. We select 145 proteins of good reliability from a set of 148 non-redundant proteins. We manually validate this data set using annotations from SWISS-PROT release 49.023 and further remove two proteins because they have no membrane protein annotations. The final data set contains 143 proteins for which low-resolution topology models are available. This entire data set is also used to train our model for testing on the following three data sets.

2. High-resolution TMH proteins: We use a collection of 36 high-resolution TMH proteins from PDB compiled by Chen et al.13 and obtain topology information for 35 out of 36 proteins. We validate this data set using annotations from SWISS-PROT release 49.023 and update the topologies of two proteins.

3. Soluble proteins: A collection of 616 high-resolution soluble proteins from PDB compiled by Chen et al.13 is used to test for discrimination between membrane and soluble proteins.

4. Newly solved TMH proteins: Four newly solved high-resolution TMH proteins24 are used as an independent test set.

Protein sequence: G A L S A ||||ppii|iMnp|| ' G L T T M I M P I M I P I M M i O K G R Y N-terminus M ^ ^ M ^ H predicted Helix 1 • • ^ • " • ^ ^ Predicted Helix 2 B B ^ B B c-terminus

Predicted topology: o o o i i ffj/^ggffgggfg^ggg^i^ j j j j o fgfgg^gg^gfi^f^g^ o o ;' r /

Fig 3. An example of evaluating a TMH protein's topology with alternating geometric scoring function. Helices are predicted in the first step (Section 2.3). A predicted loop segment can have more than one type of topology. We use the proposed alternating geometric scoring function to determine the final topology. In this example, R,(l) = 2/5, fl„(l) = 3/5, R,(2) = 4/5, Ra(2) = 1/5, R&) = 3/5 and R0(3) = 2/5. Given a set of optimal values for (A, El) = (1.6, 1.0) indicated in Section 3.6, TSt = 1 * R,(l) + 1/(1.6' °) * RQ(2) + 1/(1.620) x «,{3) = 0.7594. Similarity, TS, = 1.2563. TS0 > TS,, therefore, the final topology for the N-terminal loop is outside (o).

36

3.2. Evaluation metrics

There are two sets of evaluation measures for the TMH prediction: per-segment and per-residue accuracies13. Per-segment scores indicate how accurately the location of a TMH region is predicted and per-residue scores report how well each residue is predicted. Table 1 lists the per-segment and per-residue metrics used in this paper.

In the calculation of per-segment scores, two issues must be addressed when counting a helix as correctly predicted. First, a minimal overlap of observed helix segments must be defined. For this, we use a less relaxed criterion which requires at least 9 overlapping residues. An evaluation study by Chen et al.13 used a more relaxed minimal overlap of only 3 residues. Second, we do not allow an overlapping observed helix to be counted twice. We use the following examples to illustrate these two issues (H = Helix):

Observation

Prediction 1

Prediction 2

Prediction 3

- -HHHHHHHHHHHHH- - -HHHHHHHHHH-HHH HHHHHHHHH--

HHHHHHHHHHHHHHHHHHHHHHH----HHH-HHHHHHHHH---HHHHHHHHH--

Prediction 1 achieves 100% accuracy if the minimal overlap is 3 residues. If the minimal overlap is 9 resi

dues, Prediction 1 achieves 50% accuracy. Prediction 2 achieves 50% accuracy because it already overlaps with the first observed helix. Prediction 3 achieves 100% accuracy if the minimal overlap is 3 residues, but the second predicted helix is an over-prediction since we only count an overlapping observed helix only once. Prediction 3 achieves 50% accuracy if the minimal overlap is 9 residues because the first predicted helix does not satisfy the minimal overlap requirement. In addition, the second predicted helix is also an over-prediction, thus it is not counted.

3.3. Performance of input feature combinations for helix prediction

We test the performance of different input feature combinations for the first classifier. The following combinations are considered: 1) AA only; 2) AA and any one of DP, HS, and AM; 3) AA and any two of DP, HS, and AM; and 4) all four features. We also construct a consensus prediction from the two top-performing combinations through probability estimation using LIBSVM25. The value of the estimated probability for each residue corresponds to the confidence given for its predicted class. In the case of disagreement between the predicted classes, the consensus prediction takes the result of a prediction that has the highest probability.

Table 1. Evaluation metrics used in this work. Per-segment metrics include Q^, Q%£" , ££ f ' and TOPO. Per-residue metrics include Q2

Q^f" , and Q%.prd . Npn>, is the number of proteins in a data set. We follow the same performance measures proposed by Chen et al."

Symbol Formula Description

Qoi

e?/oobs him

him

TOPO

Is, [1,1/ O** A O l " = 100% for protein i -J x 100%, with S, = \ N ' [o, otherwise

number of correctly predicted TM in data set

number of TM observed in data set

number of correctly predicted TM in data set

xl00%

xl00% number of TM predicted in data set

number of proteins with correctly predicted topology xl00%

percentage of proteins in which all its TMH segments are predicted correctly

TMH segment recall

TMH segment precision

percentage of correctly predicted topology

' number of residues predicted correctly in protein i 1 number of residues in prtoeini

-xl00% percentage of correctly predicted TMH residues

e ,%obs 2T

$T %prd IT

number of residues correctly predicted in TM helices number of residues observed in TM helices

number of residues correctly predicted in TM helices number of residues predicted in TM helices

xl00%

xl00%

TMH residue recall

TMH residue precision

37

compared methods for per-segment measures in TOPO, Qok, and Q°^d at 84%, 71%, and 95%, respectively. Specifically, SVMtmh improves TOPO by 5% over the second best method for the low-resolution data set. For the high-resolution set, most notably, SVMtmk has the highest score at 91% for TOPO, a 14% improvement over the second best method. Another marked improvement is also observed for the high-resolution set in

Table 2. Performance of input feature combinations and the consensus method. Input features: AA (amino acid composition), DP (di-peptide composition), HS (hydrophobicity scale)16 and AM (am-phiphilicity)'7.

Table 2 shows the performance of combinations of input features and the consensus prediction. Combination 5 achieves the highest score for Qok at 71.9% and performs consistently well in other per-segment and per-residue measures. Combination 6 has a strikingly high Q%jbs score of 85.9%. The purpose of consensus prediction is to maximize the benefits of both combinations. In fact, the consensus approach increases the Qok

score of Combination 6 by 1.5%, while the Q^hs score only decreases by 0.3%. Compared to Combination 5, the consensus has a decrease in Qok of 1.4%, but an increase in Q^1" of 3.8%. In addition, the consensus approach also scores the highest for Q2 at 89.1%. The consensus approach is selected as our best model for comparison with other approaches.

3.4. Performance on high- and low-resolution data sets

SVMtmh is compared to other methods for high and low-resolution data sets in Table 3. For the low-resolution set, SVMtmh ranks the highest among all the

Table 3. Performance of prediction methods for low- and high-resolution data sets. Per-segment and per-residue scores of all methods compared are taken from an evaluation by Chen et a!.u. TOPO scores for the high-resolution data set are re-evaluated due to the update of topology information. The shaded area outlines the four top-performing methods. Note that we do not have cross-validation results for all other methods. Therefore, their accuracies might be over-estimated. In addition, we use a minimal overlap of 9 residues whereas Chen et al" used only 3 residues. Methods are sorted by their Qok values for the low-resolution data set.

No.

1

2

3

4

5

6

7

8

9

Input Feature (s)

AA

AA+DP

AA+HS

AA+AM

AA+DP+HS

AA+DP+AM

AA+HS+AM

AA+DP+HS+AM

Consensus (5+6)

Per-segment (%)

a* 71.2

69.8

71.2

70.5

71.9

69.0

68.3

69.1

70.5

Qkm

93.8

94.0

92.8

93.6

93.6

93.4

93.3

92.3

93.2

y-i%prd MhOn

93.9

93.8

94.2

93.6

94.2

94.0

94.2

95.4

94.9

Per-residue (%)

a 89.1

88.9

89.1

89.1

89.0

89.0

88.8

89.0

89.1

82.9

81.9

81.9

83.0

81.8

85.9

79.8

80.9

85.6

&" 83.0

83.2

84.0

82.9

83.7

80.6

84.4

84.3

81.4

Methods

s \ M M I A

j . M l ^ ! ' - i '

• •US ' , : • 1 ' r 1 \ ! . \* I ( >'rj

PRED-TMR PHDhtm08 PHDhtm07

sosui TopPred2

DAS Ben-Tal Wolfenden WW GES Eisenberg

KD Heijne Hopp-Wodds

Sweet Av-Cid Roseman

Levitt Nakashima A-Cid Lawson Radzicka Bull-Breese EM Fauchere

•T

"I

(.: .

t j ?

58 57

56 49 48 39 35 29 27 23 20 13 11 11 11 10

9 9 9 8 8 6 6

5 5

Low-resolution

Per-segment (%)

** '!

•mlt

• »

92

86 85

88 84 93 79 56 90 93 90

88 89 87

87 87 89 88

88 87 86 87 86 89 87

; 1 " J ropo

fs-' l

• • i

>'. 93

86 86 86

79 81

90 82 75 68 63

59 55 58

59 58 56

56 56 57 57 56 56 56 56

14

: ! !

68 72

59

Per-residue (%)

i-i X^

%•) ::. 90

87 87 88 88

86 87 80 81 78

72 63 51 54 58 53 48 49

50 52 43 41 40 41 43

M s;-

• h £

78 83

83 79 74

65 67 47 83 87

89 91 91 90 88 89 91 91 90 89 89 91 91

91 91

Ml H

\ > 86 75 75

72 71 85 83 76 59 53 47 42 35 36

38 36 34

35 35 35 32 32 32 32 33

]

L' j

k*

-* '• 1

»' 61

64 69 71

75 79 65 64 60 58

56 54 52 52 48 47 45 45 43 40 39 36 33 31 28

High-resolution

Per-segment (%)

•T.,^"

ec> V >•*!

'4 .

84 77

83 88 90

99 94 97 79 95

93 95 93 94

91 95 93

92 90 93 88 92 86

92 43

OX"1 1C

»S • 'r

•>o

" V

90

76 81 86 90

96 89 90 89 89 86

91 83 83 84

83 82 82 83 79 83 80 79 77

62

,PO

Ml fl.

t r

.'" 60 69

57

Per-residu

M SO H.

X<:

76 78

78 75 77

72 67 71 72 69

62 71 60 58

59 58

6 ! 55 63 56 60 56 55 57

62

t>2

;.• . f t

* ' - J

58 76 76

66 64

48 79 74 53 77 80 71 83 83

80 80 85 85 83 85 84 84 84

85 28

e (%)

•?."!""

Mi

y.v

>."•

'.'! 85

82 82 74

83 94 66 72 80 68 61

72 58 58

58 56 58 55

60 55

58 56 54 55

56

38

which SVMtmh obtains the highest score for Q2 at 86%, compared to the second best methods at 80%. Generally, SVMtmh performs 3% to 12% better for the high-resolution set than for the low-resolution in terms of per-segment scores. Meanwhile, for per-residue scores, the accuracy for the high- and low-resolution data sets is similar in the range of 81% to 90%. The shaded area in Table 3 denotes the four top-performing approaches, which are selected to further predict newly solved membrane protein structures (Section 3.7).

3.5. Discrimination between and membrane proteins

soluble

To assess our method's ability to discriminate between soluble and membrane proteins, we apply SVMtmh to the soluble protein data set. A cut-off length is chosen as the minimum TMH length. Any protein that does not have at least one predicted TMH exceeding the minimum length is classified as a soluble protein. We calculate the false positives (FP) rates for the soluble protein set, where a false positive represents a soluble protein being falsely classified as a membrane protein. Similarly, we also calculate the false negatives (FN) rates for both high- (FNhigh) and low-resolution (FNi0W) membrane protein sets using the chosen cut-off length. Clearly, the cut-off length is a trade-off between the FP and FN rates. Therefore, the cut-off length selected must minimize FP + FNhigh+ FNiow- Fig. 4 shows the FP and FN rates as a function of cut-off length. The cut-off length at 18, which minimizes the sum of all errors is used to discriminate between soluble and membrane proteins. Table 4 shows the results of our method compared to the other methods. SVMtmh is capable of distinguishing soluble and membrane proteins at FP and FNi0W rates at less than 1% and FNhigh rate at 5.6%. In general, most advanced methods such as TMHMM23

and PHDpsiHtm0812 achieve better accuracies than simple hydrophobicity scale methods including Kyte-Doolittle (KD)8 and White -Wimley (WW)10.

3.6. Effect of alternating geometric scoring function on topology accuracy

We characterize the dependency of topology accuracy (TOPO) on the values of the base (b) and the exponent increment (EI) used in the alternating geometric scoring function for the low-resolution data set. Fig. 5 shows

100

60

40

20

False positives (FP) I

False negatives ( F N ^ ) (Lowlresoultion)

False negatives (FNhigh) (Higji resoultion)

v

oL£ _&jy

0 5 10 15 is 20

Fig. 4. The false positive and false negative rates as a function of cut-off length. The x-axis: cut-off length; the y-axis: false positive and false negative rates (%). Discrimination between soluble proteins and membrane proteins is based on the cut-off length chosen. The cutoff length at 18 (dashed line) is chosen to minimize the sum of all three error rates (FP + FN|0W + FNhigh)-

Table 4. Confusion between soluble and membrane proteins. The results of all compared methods are taken from Chen et alP. False positive rates for soluble proteins are calculated in the second column In the third and fourth columns, false negative rates for membrane proteins are reported. Methods are sorted by false positive rates.

Methods

SVMtmh TMHMM2 SOSUI PHDpsiHtm08 PHDhtm08 Wolfenden Ben-Tal PHDhtm07 PRED-TMR HMMTOP2 TopPred2 DAS WW GES Eisenberg KD Sweet Hopp-Woods Nakashima Heijne Levitt Roseman A-Cid Av-Cid Lawson FM Fauchere Bull-Breese Radzicka

False positives (%)

0.5 1 1 2 2 2 3 3 4 6 10 16 32 53 66 81 84 89 90 92 93 95 95 95 98 99 99 100 100

False negatives (%)

Low-resolution

0 4 4 8

23 13 4 16 1 1 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

High-resolution

5.6 8 8 3 19 39 11 14 8 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

the relationships between topology accuracy coded by colours and the variables in the scoring function. The white circles indicate the highest topology accuracy at about 84% and their corresponding values for b and EI. The region in which half of the white circles (8/16) occur falls in the ranges for b and EI between [1.5, 2.5]

39

and [0.5, 1.5], respectively. The set of values for (b, EI) we choose for the scoring function is (1.6, 1.0). An interesting observation is that low topology accuracy (80%: blue and 79%: navy) occurs in the vertical-left, lower-horizontal, and upper-right regions. In the vertical-left (b = 1) and the lower-horizontal (EI = 0) regions, the scoring function is simplified to assigning an equal weight of 1 to all loop signals regardless of their distance from the N-terminus. Conversely, in the upper-right region, when both b and EI are large, the scoring function assigns very small weights to the loop signals downstream of the N-terminus. The poor accuracy in the vertical-left and the lower-horizontal region is a result of considering the contribution of every signal in the loop segments equally. On the other hand, in the upper-right region, the poor performance is due the contribution from downstream signals made negligible by the scoring function. Therefore, our analysis supports the assumptions we have made about our scoring function: 1) topology formation is a result of contributing signals distributed along the protein sequence, particularly in the loop regions; and 2) the contribution of each downstream loop segment on the first loop segment is not equal and diminishes as a function of distance away from the N-terminus. Our results suggest that the inclusion of both assumptions in modeling membrane protein topology is a key factor in achieving the best topology accuracy.

3.7. Performance on newly solved structures and analysis of bacte-riorhodopsin

To illustrate the performance of the top four methods on the high and low-resolution data sets as shown in Table 3, we test four recently solved membrane protein structures not included in the training set. The results are shown in Table 5. The best predicted protein is a photo-synthetic reaction center protein (PDB ID: lumxL), for which all methods predict all helices correctly (Qok = 100%). On the other hand, only two methods are capable of predicting all the helices from a bacteriorhodop-sin (bR) structure (PDB ID: ltn0_A) correctly {Qok = 100%). In terms of topology prediction, most methods predict correctly for all four proteins. We devote our analysis to bR to illustrate that TMH prediction is by no means a trivial task and continuous development in this area is indispensable in advancing our understanding of membrane protein structures.

Fig. 6(a) displays the high-resolution structure of bR from PDB. Bacteriorhodopsin (bR) is a member of the rhodopsin family, which is characterized with seven distinct transmembrane helices that can be indexed from Helix A to G. Studies of synthetic peptides of each of the seven TM helices of bR have shown that Helix A to Helix E can form independently stable helices when inserted into a lipid bilayer26. However, Helix G does

Table 5. Performance of top four approaches shaded in Table 3 for newly solved membrane proteins. Proteins are indicated by their PDB codes and their observed topologies. Topology terms N„,: N-terminal loop on the inside of membrane; Nout: N-terminal loop on the outside of membrane. PREDTOPO: predicted topology.

Protein (observed topology)

ltn0_A(Nout)

lvfp_A(Nin)

lumx_L(Nin)

Methods

SVMtmh TMHMM2 PHDpsiHtm08 HMMTOP2



PRED_TOPO

N o u t

Nout

Nout

Nout

Nta

N„, Nin

Nta

Nta

Nu,

Nout

Nu,

Per-

100 0 0

100

0 0 0 0

100 100 100 100

•segment (%) y-s%o6* /-\%prd ^Zhtm idhtm

100 86 71 100

70 70 50 80

100 100 100 100

100 100 100 100

100 100 50 89

100 100 100 100

Per-residue (%) O /~*%obs /~\%prd 122 \ilT V.1T

85 71 76 73

87 86 86 85

90 85 82 83

84 94 68 87 77 87 69 90

57 74 54 72 52 72 58 63

91 89 78 89 92 75 78 83

lxfh_A(Ni„)


N,. Nin

Ni„

Nin

0 0 0 0

70 70 50 90

78 88 56 90

60 63 53 71

62 57 69 73

60 65 53 71

40

84 83 82.5 82 81 80.5 80 79

sr 3.s -

1.5 2.0 2.5 3.0

Base (6)

Fig. 5. The relationship between base (ft) and exponent increment (El) in the alternating geometric scoring function and topology accuracy. The x-axis: base (b); the y-axis: exponent increment (£/). The accuracy of topology prediction (TOPO) for low-resolution data set is divided into 8 levels, each indicated by a colour. The best accuracy (84%) and its associated (b, EI) values occur within the white circles.

N - t e r m i n u s (extracellular side)

C - t e r m i n u s (cytoplasmic sklel

(a)

SEQ&TMH Q A Q I TGRP I - H B L SVMtmh » « « » « * « « * « » * * TMHMM2 » « • « « « « * » * * . PHDpsiHtm08 » • * * * » HMMTOP2 » « « • « « * * * « » .

100 110

I I SEQ&TMH P L L L L D L A L L V O A D ^ L B L M - 1 SVMtmh * * * * • • * • • • « « . TMHMM2 • * * • • • • * • » * « « « . PHDpsiHtm08 * • * * • * • • • • • • • * • • » • « HMMTOP2 * * * • • * * • • • • • » » «

190 200

. » • • • * * » • • » » « » « « a* s • • • • • • • • • « • « • • * a * • • • • • • • • • • • «

• • • • • • • n i l 1«0 ISO

l l G Y G L T M V P F G G E Q N P I Y W A R Y A D W L F T T

i M R P g V A S T F K V L R N V T V V

• • • • • • • i * > • > • • • > • •

::: 2~9

S6Q&TMH L W S A Y P V V W L I G S E G A G I V P I - N I E T L L F H V L D V S A K V G F G L I L L ^ S R A I FGEA £ APEPSAGDGAAATSD SVMtmh • * » • • * » * « » * * . . . i . . . . . . t . . H . . M M . TMHMM2 • * » « • « • * * * * * • • » • • • • • PHDpSiHtmOS • » * * * * » » • * * « « • • • • • • • • • • • • • » • • • • • * * * * » » * * * • • » • * • HMMTOP2 • • • • • • > • • • • • * • • • * • * * • • • • • • • • • • *

(b)

Fig. 6(a). The structure of a bacteriorhodopsin (bR) (PDB ID: ltnO_A). Each helix is coloured and indexed from A to G. Figure is prepared with ViewerLite29. Fig. 6(b). Prediction results of bR by the top four methods (* = predicted helix). The observed helices are indicated by colour boxes. The region of Helix G (purple) and its predictions are highlighted in grey.

41

not form a stable helix in detergent micelles27 and exhibits structural irregularity at Lys216 by forming a n-bulge28. However, despite its atypical structure, Helix G is important in the function of bR, as it binds to retinal and undergoes conformation change during the photo-synthetic cycle28.

The results of the predictions by all four approaches are shown in Fig. 6(b). Interestingly, all approaches are successful in identifying the first six helices (Helix A - E) with good accuracy. However, most methods do not predict with the same level of success for Helix G. In particular, TMHMM2 misses Helix G entirely and PHDpsihtm08 merges predictions for Helix F and Helix G into one long helix. SVMtmh and HMMTOP211 are the only two out of all four methods that can correctly identify the presence of Helix G. Furthermore, upon a closer examination of Helix G, HMMTOP2 over-predicts by 3 residues at the N-terminus and severely under-predicts by 9 residues at the C-terminus. SVMtmh only under-predicts by 2 residues at the N-terminus of Helix G. The poor prediction results may be due to the intrinsic structural irregularity as described earlier, which adds another level of complexity into the TMH prediction problem. Despite the difficulties involved in predicting the correct location of Helix G, SVMtmh is successful in producing a prediction for the bR structure, which is in close agreement with the experimental approach. One possible reason for our success in this case could be the integration of multiple biological input features that encompass both global and local information for TMH prediction. TMHMM2 and HMMTOP2 rely solely on amino acid composition as sequence information, while PHDpsiHtm08 only uses sequence information from multiple sequence alignments. In contrast, SVMtmh incorporates a combination of both physico-chemical and sequence-based input features for helix prediction.

4. CONCLUSION

We have proposed an approach based on SVM in a hierarchical framework to predict transmembrane helix and topology in two successive steps. We demonstrate that by separating the prediction problem using two classifiers, specific biological input features associated with individual classifiers can be applied more effectively. By integrating both the sequence and structural

input features and using a novel topology scoring function, SVMtmh achieves comparable or better per-segment and topology accuracy for both high- and low-resolution data sets. When tested for confusion between membrane and soluble proteins, SVMtmh discriminates between them with the lowest false positive rate compared to the other methods. We further analyze a set of newly solved structures and show that SVMtmh is capable of predicting the correct helix and topology of bacteriorhodopsin as derived from a high resolution experiment.

With regard to future work, we will continue to enhance the performance of our approach by incorporating more relevant features in both stages of helix and topology prediction. We will also consider some complexities of TM helices, including helix lengths, tilts, and structural motifs, as in the case of bacteriorhodopsin. Supported by the results we achieved, our approach could prove valuable for genome-wide predictions to identify potential integral membrane proteins and their topologies.

While obtaining high-resolution structures for membrane proteins presents itself as a major challenge in the field of structural biology, the need for accurate prediction methods is highly demanded. We believe that the continuous development of computational methods with the integration of biological knowledge in this area will be immensely fruitful.

Acknowledg merits

We gratefully thank Jia-Ming Chang, Hsin-Nan Lin, Wei-Neng Hung, and Wen-Chi Chou for providing helpful discussions and computational assistance. This work was supported in part by the thematic program of Academia Sinica under grant AS94B003 and AS95ASIA02.

References

1. Wallin E and von Heijne G. Genome-wide analysis of integral membrane proteins from eubacterial, ar-chaean, and eukaryotic organisms. Protein Sci 1998; 7: 1029-1038.

2. Stevens TJ and Arkin IT. The effect of nucleotide bias upon the composition and prediction of transmembrane helices. Protein Sci 2000; 9: 505-511.

3. Krogh A, Larsson B, von Heijne G, and Sonnham-mer EL. Predicting transmembrane protein topol-

42

ogy with a hidden Markov model: application to complete genomes. JMol Biol 2001; 305: 567-580.

4. Ubarretxena-Belandia I and Engelman DE. Helical membrane proteins: diversity of functions in the context of simple architecture. Curr Op in Struc Bio 2001; 11: 370-376.

5. White SH. The progress of membrane protein structure determination. Protein Sci 2004; 13: 1948-1949.

6. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, and Bourne PE. The Protein Data Bank. Nucleic Acids Res 2000; 28: 235-242.

7. van Geest M and Lolkema JS. Membrane topology and insertion of membrane proteins: search for to-pogenic signals. Microbiol Mol Biol Rev 2000; 64: 13-33.

8. Kyte J and Doolittle RF. A simple method for displaying the hydropathic character of a protein. J Mol Biol 1982; 157: 105-132.

9. Eisenberg D, Weiss RM, and Terwilliger TC. The hydrophobic moment detects periodicity in protein hydrophobicity. Proc Natl Acad Sci USA 1984; 81: 140-144.

10. White SH and Wimley WC. Membrane protein folding and stability: physical principles. Annu Rev Biophys Biomol Struct 1999; 28: 319-365.

11. Tusnady GE and Simon I. Principles governing amino acid composition of integral membrane proteins: application to topology prediction. J Mol Biol 1998; 283: 489-506.

12. Rost B, Fariselli P, and Casadio R. Topology prediction for helical transmembrane proteins at 86% accuracy. Protein Sci 1996; 5: 1704-1718.

13. Chen CP, Kernytsky A, and Rost B. Transmembrane helix predictions revisited. Protein Sci 2002; 11:2774-2791.

14. Chang CC and Lin CJ. LIBSVM: A library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cilin/libsvm/.

15. von Heijne G. Membrane protein structure prediction. Hydrophobicity analysis and the positive-inside rule. JMol Biol 1992; 225: 487-494.

16. Hessa T, Kim H, Bihlmaier K, Lundin C, Boekel J, Andersson H, Nilsson I, White SH, and von Heijne G. Recognition of transmembrane helices by the

endoplasmic reticulum translocon. Nature 2005; 433:377-381.

17. Mitaku S, Hirokawa T, and Tsuji T. Amphiphilicity index of polar amino acids as an aid in the characterization of amino acid preference at membrane-water interfaces. Bioinformatics 2002; 18: 608-616.

18. Zhou H and Zhou Y. Predicting the topology of transmembrane helical proteins using mean burial propensity and a hidden-Markov-model-based method. Protein Sci2003; 12: 1547-1555.

19. Jayasinghe S, Hristova K, and White SH. Energetics, stability, and prediction of transmembrane helices. JMol Biol 2001; 312: 927-934.

20. Goder V and Spiess M. Topogenesis of membrane proteins: determinants and dynamics. FEBS Letters 2001;504:87-93.

21. Popot JL and Engelman DM. Membrane protein folding and oligomerization: the two-stage model. Biochemistry 1990; 29: 4031-4037.

22. Moller S, Kriventseva EV, and Apweiler R. A collection of well characterised integral membrane proteins. Bioinformatics 2000; 16: 1159-1160.

23. Bairoch B, Apweiler R. The SWISS-PROT protein sequence database: its relevance to human molecular medical research. J. Mol. Med. 1997; 5: 312-316.

24. Cao B, Porollo A, Adamczak R, Jarrell M, and Meller J. Enhanced recognition of protein transmembrane domains with prediction-based structural profiles. Bioinformatics 2006; 22: 303-309.

25. Wu TF, Lin CJ, and Weng RC. Probability estimates for multi-class classification by pairwise coupling. JMLR 2004; 5: 975-1005.

26. Booth PJ. Unravelling the folding of bacteri-orhodopsin. Biochim Biophys Acta 2000; 1460: 4-14.

27. Hunt JF, Earnest TN, Bousche O, Kalghatgi K, Reilly K, Horvath C, Rothschild KJ, and Engelman DM. A biophysical study of integral membrane protein folding. Biochemistry 1997; 36: 15156-15176.

28. Luecke H, Schobert B, Richter HT, Cartailler JP, and Lanyi JK. Structure of bacteriorhodopsin at 1.55 A resolution. JMol Biol 1999; 291: 899-911.

29. ViewerLite for molecular visualization. Software available at http://www.iaici.or.ip/sci/viewer.htm.

http://www.csie.ntu.edu.tw/~cilin/libsvm/

http://www.iaici.or.ip/sci/viewer.htm

43

PROTEIN FOLD RECOGNITION USING THE GRADIENT BOOST ALGORITHM

Feng Jiao*

School of Computer Science, University of Waterloo, Canada [email protected]

Jinbo Xut

Toyota Technological Institute at Chicago, USA j3xu@tti-c. org

Libo Yu

Bioinformatics Solutions Inc., Waterloo, Canada

[email protected]

Dale Schuurmans

Department of Computing Science, University of Alberta, Canada

dale @cs. ualberta. ca

Protein structure prediction is one of the most important and difficult problems in computational molecular biology. Protein threading represents one of the most promising techniques for this problem. One of the critical steps in protein threading, called fold recognition, is to choose the best-fit template for the query protein with the structure to be predicted. The standard method for template selection is to rank candidates according to the z-score of t he sequence-template alignment. However, the z-score calculation is time-consuming, which greatly hinders structure prediction at a genome scale. In this paper, we present a machine learning approach that treats the fold recognition problem as a regression task and uses a least-squares boosting algorithm (LS-Boost) to solve it efficiently. We test our method on Lindahl's benchmark and compare it with other methods. According to our experimental results we can draw the conclusions that: (1) Machine learning techniques offer an effective way to solve the fold recognition problem. (2) Formulating protein fold recognition as a regression rather than a classification problem leads to a more effective outcome. (3) Importantly, the LS_Boost algorithm does not require the calculation of the z-score as an input, and therefore can obtain significant computational savings over standard approaches. (4) The LS_Boost algorithm obtains superior accuracy, with less computation for both training and testing, than alternative machine learning approaches such as SVMs and neural networks, which also need not calculate the z-score. Finally, by using the LS3oos t algorithm, one can identify important features in the fold recognition protocol, something that cannot be done using a straightforward SVM approach.

1. INTRODUCTION

In the post-genomic era, understanding protein function has become a key step toward modelling complete biological systems. It has been established that the functions of a protein are directly linked to its three-dimensional structure. Unfortunately, current "wet-lab" methods used to determine the three-dimensional structure of a protein are costly, time-consuming and sometimes unfeasible. The ability to predict a protein's structure directly from its sequence is urgently needed in the post-genomic era, where protein sequences are becoming available at

a far greater rate than the corresponding structure information.

Protein structure prediction is one of the most important and difficult problems in computational molecular biology. In recent years, protein threading has turned out to be one of the most successful approaches to this problem 7' 14' 15. Protein threading predicts protein structures by using statistical knowledge of the relationship between protein sequences and structures. The prediction is made by aligning each amino acid in the target sequence to a position in a template structure and evaluating how well

* Work performed at the Alberta Ingenuity Centre for Machine Learning, University of Alberta, t Contact author.

mailto:[email protected]


44

the target fits the template. After aligning the sequence to each template in the structural template database, the next step then is to separate the correct templates from incorrect templates for the target sequence—a step we refer to as template selection or fold recognition. After the best-fit template is chosen, the structural model of the sequence is built based on the alignment between the sequence and the chosen template.

The traditional fold recognition technique is based on calculating the z-score, which statistically tests the possibility of the target sequence folding into a structure very similar to the template 3. In this technique, the z-score is calculated for each sequence-template alignment by first determining the distribution of alignment scores among random re-shufflings of the sequence, and then comparing the alignment score of the correct sequence (in standard deviation units) to the average alignment score over random sequences. Note that the z-score calculation requires the alignment score distribution to be determined by randomly shuffling the sequence many times (approx. 100 times), meaning that the shuffled sequence has to be threaded to the template repeatedly. Thus, the entire process of calculating the z-score is very time-consuming. In this paper, instead of using the traditional z-score technique, we propose to solve the fold recognition problem by treating it as a machine learning problem.

Several research groups have already proposed machine learning methods, such as neural networks 9 ' 2 3 and support vector machines (SVMs) 20' 22 for fold recognition. In this general framework, for each sequence-template alignment, one generates a set of features to describe the instance, treats the extracted features as input data, and the alignment accuracy or similarity level as a response variable. Thus, the fold recognition problem can be expressed as a standard prediction problem that can be solved by supervised machine learning techniques for regression or classification. In this paper we investigate a new approach that proves to be simpler to implement, more accurate and more computationally efficient. In particular, we combine the gradient boosting algorithm of Friedman 5 with a least-squares loss criterion to obtain a least-squares boosting algorithm, LS_Boost. We use LS_Boost to estimate the alignment accuracy

of each sequence-template alignment and employ this as part of our fold recognition technique.

To evaluate our approach, we experimentally test it on Lindahl's benchmark 12 and compare the resulting performance with other fold recognition methods, such as the z-score method, SVM regression, SVM classification, neural networks and Bayes classification. Our experimental results demonstrate that the LS_Boost method outperforms the other techniques in terms of both prediction accuracy and computational efficiency. It is also a much easier algorithm to implement.

The remainder of the paper is organized as follows. We first briefly introduce the idea of using protein threading for protein structure prediction. We show how to generate features from each sequence-template alignment and convert protein threading into a standard prediction problem (making it amenable to supervised machine learning techniques). We discuss how to design the least-squares boosting algorithm by combining gradient boosting with a least-squares loss criterion, and then describe how to use our algorithm to solve the fold recognition problem. Finally, we will describe our experimental set-up and compare LS_Boost with other methods, leading to the conclusions we present in the end.

2. PROTEIN THREADING AND FOLD RECOGNITION

2.1. The threading method for protein structure prediction

The idea of protein threading originated from the observation that the number of different structural folds in nature may be quite small, perhaps two orders of magnitude fewer than the number of known protein sequences n . Thus, the structure prediction problem can be potentially reduced to a problem of recognition: choosing a known structure into which the target sequence will fold. Or, put another way, protein threading is in fact a database search technique, where given a query sequence of unknown structure, one searches a structure (template) database and finds the best-fit structure for the given sequence. Thus, protein threading typically consists of the following four steps:

(1) Build a template database of representative

45

three-dimensional protein structures, which usually involves removing highly redundant structures.

(2) Design a scoring function to measure the fitness between the target sequence and the template based on the knowledge of the known relationship between the structures and the sequences. Usually, the minimum value of the scoring function corresponds to the optimal sequence-template alignment.

(3) Find the best alignment between the target sequence and the template by minimizing the scoring function.

(4) Choose the best-fit template for the sequence according to a criterion, based on all the sequence-template alignments.

In this paper, we will only focus on the final step. That is, we only discuss how to choose the best template for the sequence, which is called fold recognition. We use our existing protein threading server RAPTOR 21' 22 to generate all the sequence-structure alignments. For the fold recognition problem, there are two different approaches: the z-score method 3 and the machine learning method 9 ' 23.

2.2. The z-score method for fold recognition

The z-score is defined to be the "distance" (in standard deviation units) between the optimal alignment score and the mean alignment score obtained by randomly shuffling the target sequence. An accurate z-score can cancel out the sequence composition bias and offset the mismatch between the sequence size and the template length. Bryant et al. 3 proposed the following procedures to calculate z-score:

(1) Shuffle the aligned sequence residues randomly. (2) Find the optimal alignment between the shuffled

sequence and the template. (3) Repeat the above two steps N times, where N is

on the order of one hundred. Then calculate the distribution of these N alignment scores.

After the N alignment scores are obtained, we calculate the deviation of the optimal alignment score from the distribution of these N alignment scores.

We can see from above that in order to calculate the z-score for each sequence-template alignment, we need to shuffle and rethread the target sequence many times, which takes a significant amount of time and essentially prevents this technique from being applied to genome-scale structure prediction.

2.3. Machine learning methods for fold recognition

Another approach to the fold recognition problem is to use machine learning methods, such as neural networks, as in the GenTHREADER 9 and PROSPECT-I systems 23, or SVMs, as in the RAPTOR system 22. Current machine learning methods generally treat the fold recognition problem as a classification problem. However, there is a limitation to the classification approach that arises when one realizes that there are three levels of similarity that one can draw between two proteins: fold level similarity, superfamily level similarity and family level similarity. Currently, classification-based methods treat the three different similarity levels as a single level, and thus are unable to effectively differentiate one similarity level from another while maintaining a hierarchical relationship between the three levels. Even a multi-class classifier cannot deal with this limitation very well since the three levels are in a hierarchical relationship.

Instead, we use a regression approach, which simply uses the alignment accuracy as the response value. That is, we reformulate the fold recognition problem as predicting the alignment accuracy of a threading pair, which then is used to differentiate the similarity level between proteins. In our approach, we use SARF 2 to generate the alignment accuracy between the target protein and the template protein. The alignment accuracy of threading pair is defined to be the number of correctly aligned positions, based on the correct alignment generated by SARF. A position is correctly aligned only if its alignment position is no more than four position shifts away from its correct alignment. On average, the higher the similarity level between two proteins, the higher the value of the alignment accuracy will be. Thus alignment accuracy can help to effectively differentiate the three similarity levels. Below we will show in our experiments that the regression approach obtains

46

much better results than the standard classification approach.

3. FEATURE EXTRACTION

One of the key steps in the machine learning approach is to choose a set of proper features to be used as inputs for predicting the similarity between two proteins. After optimally threading a given sequence to each template in the database, we generate the following features from each threading pair.

(1) Sequence size, which is the number of residues in the sequence.

(2) Template size, which is the number of residues in the template.

(3) Alignment length, which is the number of aligned residues. Usually, two proteins from the same fold class should share a large portion of similar sub-structure. If the alignment length is considerably smaller than the sequence size or the template size, then it indicates that this threading pair is unlikely to be in the same SCOP class.

(4) Sequence identity. Although a low sequence identity does not imply that two proteins are not similar, a high sequence identity can indicate that two proteins should be considered as similar.

(5) Number of contacts with both ends being aligned to the sequence. There is a contact between two residues if their spatial distance is within a given cutoff. Usually, a longer protein should have more contacts.

(6) Number of contacts with only one end being aligned to the sequence. If this number is big, then it might indicate that the sequence is aligned to an incomplete domain of the template, which is not good since the sequence should fold into a complete structure.

(7) Total alignment score. (8) Mutation score, which measures the sequence

similarity between the target protein and the template protein.

(9) Environment fitness score. This feature measures how well to put a residue into a specific environment.

(10) Alignment gap penalty. When aligning a sequence and a template, some gaps are allowed.

However, if there are too many gaps, it might indicate that the quality of the alignment is bad, and therefore the two sequences may not be in the same similarity level.

(11) Secondary structure compatibility score, which measures the secondary structure difference between the template and the sequence in all positions.

(12) Pairwise potential score, which characterizes the capability of a residue to make a contact with another residue.

(13) The z-score of the total alignment score and the z-score of a single score item such as mutation score, environment fitness score, secondary structure score and pairwise potential score.

Notice that here we still take into consideration the traditional z-score for the sake of performance comparison. But later we will show that we can obtain nearly the same performance without using the 2-score, which means it is unnecessary to calculate the z-score as one of the features.

We calculate the alignment accuracy between the target protein and the template protein using a structure comparison program SARF. We use the alignment accuracy as the response variable. Given the training set with input feature vectors and the response variable, we need to find a prediction function that maps the features to the response variable. By using this function, we can estimate the alignment accuracy for each sequence-template alignment. Then, all the sequence-template alignments can be ranked based on the predicted alignment accuracy and the first-ranked one is chosen as the best alignment for the sequence. Thus we have converted the protein structure problem to a function estimation problem. In the next section, we will show how to design our LS-Boost algorithm by combining the gradient boosting algorithm of Friedman 5 with a least-squares loss criterion.

4. LEAST-SQUARES BOOSTING ALGORITHM FOR FOLD RECOGNITION

The problem can be formulated as follows. Let x denote the feature vector and y the alignment accuracy. Given an input variable x, a response variable y and

47

some samples {yi^Xi}^, we want to find a function F*(x) that can predict y from x such that over the joint distribution of {y,x} values, the expected value of a specific loss function L(y,F(x)) is minimized 5. The loss function is used to measure the deviation between the real y value and the predicted y value.

F*(x) = aigmin Ey,xL(y, F(x)) F(x) "'

= argmin Ex[EyL{y,F(x))\x] (1) F{x)

Normally F(x) is a member of a parameterized class of functions F(x;P), where P is a set of parameters. We use the form of the "additive" expansions to design the function as follows:

M

F{x;P) = ^2 Pmh(x;am) (2) m = 0

where P = {/?m,«m}m=0' The functions h(x;a) are usually simple functions of x with parameters Q = {c*i,a2,. • • ,C«M}- When we wish to estimate F(x) non-parametrically the task becomes more difficult. In general, we can choose a parameterized model F(x; P) and change the function optimization problem to parameter optimization. That is, we fix the form of the function and optimize the parameters instead. A typical parameter optimization method is a "greedy-stagewise" approach. That is, we optimize {Pm,(Xm} after all of the {/?i,aj}(i = 0 , 1 , . . . ,m — 1) are optimized. This process can be represented by the following two recursive equations.

N

{/3m,am) = argmin Y^L(2/i,Fm_i(a;j) + (3h(xi;a)) 8 = 1

Fm = Fm-i(x) + pmh{x;am)

(3)

(4)

Algorithm 1: Gradient Boost

• Initialize F0(x) = argminp J2?-i Li\Vu p)

• For m = 1 to M do:

• Step 1. Compute the negative gradient

\dL{yi,F{xi))^ dFx

y% = -

• Step 2. Fit a model

N

am = argmin V ^ y ~ f3h(xi;am)}2

i = i

• Step 3. Choose a gradient descent step size as

N

Pm = argmin V L { y u F m - l(xi) p f-f

t—l

+ph(xi\a))

• Step 4. Update the estimation of F(x)

Fm(x) = Fm_i(x) + pmh{x; am)

• end for • Output the final regression function Fm(x)

Fig. 1. Gradient boosting algorithm

Algorithm 2: LS_Boost

• Initialize F0 = y = j , £ i y» • For m = 1 to M do:

• y%=Vi -Fm-i{xi,i = 1,...,N) N

• (pm, «m) = argmin V ^ - ph(xi\ am)f p,a *•—*

t = l • Fm(x) = Fm_i(x) + pmh(x;am)

• end for • Output the final regression function Fm(x)

Fig. 2 . LS-Boost algorithm

Friedman proposed a steepest-descent method to solve the optimization problem described in Equation 2 5. This algorithm is called the Gradient Boosting algorithm and its entire procedure is given in Figure 1.

By employing the least square loss function (L(y, F)) = (y—F)2/2 we have a least-squares boosting algorithm shown in Figure 2. For this procedure,

48

p is calculated as follows:

N

(P,am) = axgminV^j/i - ph{xi\am)f » = 1

N

and therefore p = N x j/»/ ^ / i ( £ t ; a m ) (5) » = i

The simple function /i(:r, a) can have any form that can be conveniently optimized over a. In terms of boosting, optimizing over a to fit the training data is called weak learning. In this paper, for considerations of speed, we choose some function for which it is easy to obtain a. The simplest function to use here is the linear regression function:

y — ax + b (6)

where x is the input feature and y is the alignment accuracy. The parameters of the linear regression function can be solved easily by the following equation:

a _ JOL^i, — y _ ax

"XX

n n

where ' n = n x ^ i - - ( ^ i j ) !

i= i *=i n n

lxy =nx ^Xiyi - (£,xi)(%2vi) n n

i = l «=1 *=1

There are many other simple functions one can use, such as an exponential function y = a + ebx, logarithmic function y = a + blnx, quadratic function y = ax2 + bx + c, or hyperbolic function y = a + b/x, etc.

In our application, for each round, we choose one feature and obtain the simple function h(x, a) with the minimum least-squares error. The underlying reasons for choosing a single feature at each round are: i) we would like to see the role of each feature in fold recognition; and ii) we notice that alignment accuracy is proportional to some features. For example, the higher the alignment accuracy, the lower the mutation score, fitness score and pairwise score. Figure 3 shows the relation between alignment accuracy and mutation score.

100 200 300 400 500 600 700 800 alignment accuracy

F i g . 3 . T h e re la t ion b e t w e e n a l i g n m e n t a c c u r a c y a n d m u t a

t ion score.

In the end, we combine these simple functions to form the final regression function. As such, Algorithm 2 translates to the following procedures.

(1) Calculate the difference between the real alignment accuracy and the predicted alignment accuracy. We call this difference the alignment accuracy residual. Assume the initial predicted alignment accuracy is the average alignment accuracy of the training data.

(2) Choose a single feature which correlates best with the alignment accuracy residual. The parameter p is calculated by using Equation 5. Then the alignment accuracy residual is predicted by using this chosen feature and the parameter.

(3) Update the predicted alignment accuracy by adding the predicted alignment accuracy residual. Repeat the above two steps until the predicted alignment accuracy does not change significantly.

5. EXPERIMENTAL RESULTS

When one protein structure is to be predicted, we thread its sequence to each template in the database and obtain the predicted alignment accuracy using the LS_Boost algorithm. We choose the template with the highest alignment accuracy as the basis to build the structure of the target sequence.

We can describe the relationship between two proteins at three different levels: the family level, super family level and the fold level. If two proteins

49

are similar at the family level, then these two proteins have evolved from a common ancestor and usually share more than 30% sequence identity. If two proteins are similar only at the fold level, then their structures are similar even though their sequences are not similar. The superfamily-level similarity is something in between family level and fold level. If the target sequence has a template that is in the same family as the sequence, then it is easier to predict the structure of the sequence. If two proteins are similar only at fold level, it means they share less sequence similarity and it is harder to predict their relationship.

We use the SCOP database 16 to judge the similarity between two proteins and evaluate our predicted results at different levels. If the predicted template is similar to the target sequence at the family level according to the SCOP database, we treat it as correct prediction at the family level. If the predicted template is similar at the superfamily level but not at the family level, then we assess this prediction as being correct at the superfamily level. Similarly, if the predicted template is similar at the fold level but not at the other two levels, we assess the prediction as correct at the fold level. When we say a prediction is correct according to the top K criterion, we mean that there are no more than K — 1 incorrect predictions ranked before this prediction. The fold-level relationship is the hardest to predict because two proteins share very little sequence similarity in this case.

To train the parameters in our algorithm, we randomly choose 300 templates from the FSSP list 1

and 200 sequences from Holm's test set 6. By threading each sequence to all the templates, we obtain a set of 60,000 training examples.

To test the algorithm, we use Lindahl 's benchmark, which contains 976 proteins, each pair of which shares at most 40% sequence identity. By threading each one against all the others, we obtain a set of 976 x 975 threading pairs. Since the training set is chosen randomly from a set of non-redundant proteins, the overlap between the training set and Lin-dahl's benchmark is fairly small, which is no more than 0.4 percent of the whole test set. To ensure the complete separation of training and testing sets, these overlap pairs are removed from the test data.

We calculate the recognition rate of each method at the three similarity levels.

5.1. Sensitivity

Figure 4 shows the sensitivity of our algorithm at each round. We can see that the LS_Boost algorithm nearly converges within 100 rounds, although we train the algorithm further to obtain higher performance.

Sensitivity according to Top 1 and Top 5 criteria

Family Laval (Top 5]

SuperfamBy Level (Top 5)

f Fold Level (Top 5)

;p/^~~~^~~

•

SO 100 150 200 250 300 350 400 450 500 Number of training rounds

Fig . 4. Sensitivity curves during the training process.

Table 1 lists the results of our algorithm against several other algorithms. PROSPECT II uses the 2-score method, and its results are taken from Kim et al.'s paper 10. We can see that the LS_Boost algorithm is better than PROSPECT II at all three levels. The results for the other methods are taken from Shi et al's paper 18. Here we can see that our method apparently outperforms the other methods. However, since we use different sequence-structure alignment methods, this disparity may be partially due to different threading techniques. Nevertheless, we can see that the machine learning approaches normally perform much better than the other methods.

Table 2 shows the results of our algorithm against several other popular machine learning methods. Here we will not describe the detail of each method. In this experiment, we use RAPTOR to generate all the sequence-template alignments. For each different method, we tune the parameters on the training set and test the model on the test set. In total we test the following six other machine learning methods.

50

Table 1. Sensitivity of the LS_Boost method compared with other sturcutre prediction servers.

R A P T O R (LS-Boost) PROSPECT II

FUGUE PSIJBLAST

HMMER.PSIBLAST SAMT98-PSIBLAST

BLASTLINK SSEARCH

THREADER

Family Level Top 1 86.5% 84.1 % 82.3% 71.2% 67.7% 70.1% 74.6% 68.6% 49.2%

Top 5 89.2% 88.2% 85.8% 72.3% 73.5% 75.4% 78.9% 75.7% 58.9%

Superfamily Level Top 1 60.2% 52.6% 41.9% 27.4% 20.7% 28.3% 29.3% 20.7% 10.8%

Top 5 74.4% 64.8% 53.2% 27.9% 31.3% 38.9% 40.6% 32.5% 24.7%

Fold Level Top 1 38.8% 27.7% 12.5% 4.0% 4.4% 3.4% 6.9% 5.6% 14.6%

Top 5 61.7% 50.3% 26.8% 4.7% 14.6% 18.7% 16.5% 15.6% 37.7%

Table 2. Performance comparison of seven machine learning methods. The sequence template alignments are generated by RAPTOR.

LS-Boost SVM (regression)

SVM (classification) AdaJBoost

Neural Networks Bayes classifier

Naive Bayes Classifier

Family Level Top 1 86.5% 85.0% 82.6% 82.8% 81.1% 69.9% 68.0%

Top 5 89.2% 89.1% 83.6% 84.1% 83.2% 72.5% 70.8%

Superfamily Level Top 1 60.2% 55.4% 45.7% 50.7% 47.4% 29.2% 31.0%

Top 5 74.4% 71.8% 58.8% 61.1% 58.3% 42.6% 41.7%

Fold Level Top 1 38.8% 38.6% 30.4% 32.2% 30.1% 13.6% 15.1%

Top 5 61.7% 60.6% 52.6% 53.3% 54.8% 40.0% 37.4%

(1) SVM regression. Support vector machines are based on the concept of structural risk minimization from statistical learning theory 19. The fold recognition problem is treated as a regression problem, therefore we consider SVMs used for regression. Here we use the svmJight software package 8 and an RBF kernel to obtain the best performance. As shown in Table 2, LS-Boost performs slightly better than SVM regression.

(2) SVM classification. The fold recognition problem is treated as a classification problem, and we consider an SVM for classification. The software and kernel we consider is the same as for SVM regression. In this case, one can see that SVM classification performs worse than SVM regression, especially at the superfamily level and the fold level.

(3) AdaBoost. Boosting is a procedure that combine the outputs of many "weak" classifiers to produce a powerful "committee". We use the standard AdaBoost algorithm 4 for classification, which is similar to LS-Boost except that it performs classification rather than regression and uses the exponential instead of least-squares loss function. The AdaBoost algorithm achieves a comparable result to SVM classification but is

worse than both of the regression approaches, LS_Boost and SVM regression.

(4) Neural networks. Neural networks are one of the most popular methods used in machine learning 17. Here we use a multi-layer perceptron for classification, based on the Matlab neural network toolbox. The performance of the neural network is similar to SVM classification and Adaboost.

(5) Bayesian classifier. A Bayesian classifier is a probability based classifier which assigns a sample to a class based on the probability that it belongs to the class 13.

(6) Naive Bayesian classifier. The Naive Bayesian classifier is similar to the Bayesian classifier except that it assumes that the features of each class are independent, which greatly decreases computation 13. We can see both Bayesian classifier and Naive Bayesian classifier obtain poor performance.

Our experimental results show clearly that: (1) The regression based approaches demonstrate better performance than the classification based approaches. (2) LSJBoost performs slightly better than SVM regression and significantly better than the other methods. (3) The computational efficiency of

51

LS_Boost is much better than SVM regression, SVM classification and the neural network.

One of the advantages of our boosting approach over SVM regression is its ability to identify important features, since at each round LS-Boost only chooses a single feature to approximate the alignment accuracy residual. The following are the top five features chosen by our algorithm. The corresponding simple functions associated with each feature are all linear regression functions y = ax + b, showing that there is a strong linear relation between the features and the alignment accuracy. For example, from the figure 3, we can see that the linear regression function is the best fit.

(1) Sequence identity; (2) Total alignment score; (3) Fitness score; (4) Mutation score; (5) Pairwise potential score.

It seems surprising that the widely used z-score is not chosen as one of the most important features. This indicates to us that the z-score may not be the most important feature and redundant. To confirm our hypothesis, we re-trained our model using all the features except all the z-scores. That is, we conducted the same training and test procedures as before, but with the reduced feature set. The results given in Table 3 show that for LS_Boost there is almost no difference between using the z-score as an additional feature or without using it. Thus, we conclude that by using the LS-Boost approach it is unnecessary to calculate z-score to obtain the best performance. This means that we can greatly improve the computational efficiency of protein threading without sacrificing accuracy, by completely avoiding the calculation of the expensive z-score.

To quantify the margin of superiority of LS-Boost over the other machine-learning methods, we use bootstrap method to get the error analysis. After training the model, we randomly sample 600 sequences from Lindahl's benchmark and calculate the sensitivity using the same method as before. We repeat the sampling for 1000 times and get the mean and standard deviation of the sensitivity of each method as listed in table 4. We can see

that LS-Boost method is slightly better than SVM regression and much better than other methods.

5.2. Specificity

We further examine the specificity of the LS_Boost method with Lindahl's benchmark. All threading pairs are ranked by their confidence score (i.e., the predicted alignment accuracy or the classification score if an SVM classifier is used) and the sensitivity-specificity curves are drawn in Figure 5, 6 and 7. Figure 6 demonstrates that at the superfamily level, the LS-boost method is consistently better than SVM regression and classification within the whole spectrum of sensitivity. At both the family level and fold level, LS-Boost is a little better when the specificity is high while worse when the specificity is low. At the family level, LS-Boost achieves a sensitivity of 55.0% and 64.0% at 99% and 50% specificities, respectively, whereas SVM regression achieves a sensitivity of 44.2% and 71.3%, and SVM classification achieves a sensitivity of 27.0% and 70.9% respectively. At the superfamily level, LS-Boost has a sensitivity of 8.2% and 20.8% at 99% and 50% specificities, respectively. In contrast, SVM regression has a sensitivity of 3.6% and 17.8%, and SVM classification has a sensitivity of 2.0% and 16.1% respectively. Figure 7 shows that at the fold level, there is no big difference between LS_Boost method, SVM regression and SVM classification method.

Family Level Only I t 1 1 ii |

LS_Booat

0.9 X — - SVM_Classification

0.8 • v T ' ^ , ~ , ~ - ~ . ^

07 " —" I " ~ " -->- ° 6 ' ~^^T\ 1 °'5 ' \ e (0 I

0,4 • ,

\ 0.3 • >

0.2 •

0.1

0' ' ' ' ' 0 0.2 0.4 0.6 0.8 1

Specificity

Fig. 5 . Family-level specificity-sensitivity curves on Lindahl's benchmark set. The three methods LS-Boost, SVM regression and SVM classification are compared.

52

Table 3. Comparison of fold recognition performance with zscore and without zscore.

LS-Boost with z-score LS_Boost without z-score

Family Level Top 1 86.5% 85.8%

Top 5 89.2% 89.2%

Superfamily Level Top 1 60.2% 60.2%

Top 5 74.4% 73.9%

Fold Level Top 1 38.8% 38.3%

Top 5 61.7% 62.9%

Table 4 . Error Analysis of seven machine learning methods. The sequence-template alignments are generated by RAPTOR.

LS-Boost SVM (R) SVM (C)

Ada-Boost NN B C

N B C

Family Level T o p i

mean 86.6% 85.2% 82.5% 82.9% 81.8% 70.0% 68.8%

std 0.029 0.031 0.028 0.030 0.029 0.027 0.026

Top 5 mean 89.2% 89.2% 83.8% 84.2% 83.5% 72.6% 71.0%

std 0.031 0.031 0.030 0.029 0.030 0.027 0.028

Superfamily Level Top 1

mean 60.2% 55.6% 45.8% 50.7% 47.5% 29.1% 31.1%

std 0.029 0.029 0.026 0.028 0.027 0.021 0.022

Top 5 mean 74.3% 72.0% 58.9% 61.2% 58.4% 42.6% 41.9%

std 0.034 0.033 0.030 0.031 0.031 0.026 0.025

Fold Level Top 1

mean 38.9% 38.7% 30.4% 32.1% 30.2% 13.7% 15.1%

std 0.027 0.027 0.024 0.025 0.024 0.016 0.017

Top 5 mean 61.8% 60.7% 52.8% 53.4% 55.0% 40.1% 37.3%

std 0.036 0.035 0.032 0.034 0.033 0.028 0.027

5.3. Computational Efficiency

Overall, the LS_Boost procedure achieves superior computational efficiency during both training and testing. By running our program on a 2.53 GHz Pentium IV processor, after extracting the features, the training time is less than thirty seconds and the total test time is approximately two seconds. Thus we can see that our technique is very fast compared to other approaches, in particular the machine learning approaches such as neural networks and SVMs which require much more time to train. Table 5 lists the running time of several different fold recognition methods. Prom this table, we can see that the boosting approach is more efficient than the SVM regression method, which is desirable for genome-scale structure prediction. The running time shown in this table does not contain the computational time of sequence-template alignment.

6. CONCLUSION

In this paper, we propose a new machine learning approach—LS_Boost—to solve the protein fold recognition problem. We use a regression approach which is proved to be both more accurate and efficient than classification based approaches. One of the most significant conclusions of our experimental evaluation is that we do not need to calculate the standard z-score, and can thereby achieve a substantial computational savings without sacrificing prediction accuracy. Our algorithm achieves strong sen-

SuperFamily Level Only

Fig. 6 . Superfamily-level specificity-sensitivity curves on Lindahl's benchmark set. The three methods LS-Boost, SVM regression and SVM classification are compared.

Fold Level Onty

LS_Booat SVM_Classification

Specificity

Fig. 7. Fold-level specificity-sensitivity curves on Lindahl's benchmark set. The three methods LS-Boost, SVM regression and SVM classification are compared.

53

Table 5. Running time of different machine learning approaches.

LS-Boost SVM classification

SVM regression Neural Network

Naive Bayes Classifier Bayes Classifier

Training time 30 seconds

19 mins 1 hour

2.3 hours 1.8 hours 1.9 hours

Testing time 2 seconds 26 mins

4.3 hours 2 mins 2 mins 2 mins

sitivity results compared to other fold recognition

methods, including both machine learning methods

and z-score based methods. Moreover, our approach

is significantly more efficient for both the training

and testing phases, which may allow genome-scale

scale s tructure prediction.

References

1. T. Akutsu and S. Miyano. On the approximation of protein threading. Theoretical Computer Science, 210:261-275, 1999.

2. N.N. Alexandrov. SARFing the PDB. Protein Engineering, 9:727-732, 1996.

3. S.H. Bryant and S.F. Altschul. Statistics of sequence-structure threading. Current Opinions in Structural Biology, 5:236-244, 1995.

4. Y. Preund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In European Conference on Computational Learning Theory, pages 23-37, 1995.

5. J.H. Friedman. Greedy function approximation: A gradient boosting machine. The Annuals of Statistics, 29(5), October 2001.

6. L. Holm and C. Sander. Decision support system for the evolutionary classification of protein structures. 5:140-146, 1997.

7. J. Moultand T. Hubbard, F. Fidelis, and J. Pedersen. Critical assessment of methods on protein structure prediction (CASP)-round III. Proteins: Structure, Function and Genetics, 37(S3):2-6, December 1999.

8. T. Joachims. Making Large-scale SVM Learning Practical. MIT Press, 1999.

9. D.T. Jones. GenTHREADER: An efficient and reliable protein fold recognition method for genomic sequences. Journal of Molecular Biology, 287:797-815, 1999.

10. D. Kim, D. Xu, J. Guo, K. Ellrott, and Y. Xu. PROSPECT II: Protein structure prediction method for genome-scale applications. Protein Engineering, 16(9):641-650, 2003.

11. H. Li, R. Helling, C. Tang, and N. Wingreen. Emergence of preferred structures in a simple model of protein folding. Science, 273:666-669, 1996.

12. E. Lindahl and A. Elofsson. Identification of related proteins on family, superfamily and fold level. Journal of Molecular Biology, 295:613-625, 2000.

13. D. Michie, D.J. Spiegelhalter, and C.C. Taylor. Machine learning, neural and statistical classification, (edit collection). Elllis Horwood, 1994.

14. J. Moult, F. Fidelis, A. Zemla, and T. Hubbard. Critical assessment of methods on protein structure prediction (CASP)-round IV. Proteins: Structure, Function and Genetics, 45(S5):2-7, December 2001.

15. J. Moult, F. Fidelis, A. Zemla, and T. Hubbard. Critical assessment of methods on protein structure prediction (CASP)-round V. Proteins: Structure, Function and Genetics, 53(S6):334-339, October 2003.

16. A.G. Murzin, S.E. Brenner, T. Hubbard, and C. Chothia. SCOP:a structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology, 247:536-540, 1995.

17. Judea Pearl, probabilistic reasoning in intelligent system:Networks of plausible inference. Springer, 1995.

18. J. Shi, T. Blundell, and K. Mizuguchi. FUGUE: Sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. Journal of Molecular Biology, 310:243-257, 2001.

19. V.N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.

20. J. Xu. Protein fold recognition by predicted alignment accuracy. IEEE Transactions on Computational Biology and Bioinformatics, 2:157 - 165, 2005.

21. J. Xu, M. Li, D. Kim, and Y. Xu. RAPTOR: optimal protein threading by linear programming. Journal of Bioinformatics and Computational Biology, 1(1):95-117, 2003.

22. J. Xu, M. Li, G. Lin, D. Kim, and Y. Xu. Protein threading by linear programming, pages 264-275, Hawaii, USA, 2003. Biocomputing: Proceedings of the 2003 Pacific Symposium.

23. Y. Xu, D. Xu, and V. Olman. A practical method for interpretation of threading scores: an application of neural networks. Statistica Sinica Special Issue on Bioinformatics, 12:159-177, 2002.

55

A GRAPH-BASED AUTOMATED NMR BACKBONE RESONANCE SEQUENTIAL ASSIGNMENT

Xiang Wan a n d Guohui Lin*

Department of Computing Science, University of Alberta

Edmonton, Alberta T6G 2E8, Canada

* Email: [email protected]. ca

The success in backbone resonance sequential assignment is fundamental to protein three dimensional structure determination via NMR spectroscopy. Such a sequential assignment can roughly be partitioned into three separate steps, which are grouping resonance peaks in multiple spectra into spin systems, chaining the resultant spin systems into strings, and assigning strings of spin systems to non-overlapping consecutive amino acid residues in the target protein. Separately dealing with these three steps has been adopted in many existing assignment programs, and it works well on protein NMR data that is close to ideal quality, while only moderately or even poorly on most real protein datasets, where noises as well as data degeneracy occur frequently. We propose in this work to partition the sequential assignment not into physical steps, but only virtual steps, and use their outputs to cross validate each other. The novelty lies in the places where the ambiguities in the grouping step will be resolved in finding the highly confident strings in the chaining step, and the ambiguities in the chaining step will be resolved by examining the mappings of strings in the assignment step. In such a way, all ambiguities in the sequential assignment will be resolved globally and optimally. The resultant assignment program is called GASA, which was compared to several recent similar developments RIBRA, MARS, PACES and a random graph approach. The performance comparisons with these works demonstrated that GASA might be more promising for practical use.

Keywords: Protein NMR backbone resonance sequential assignment, chemical shift, spin system, connectivity graph.

1. INTRODUCTION

Nuclear Magnetic Resonance (NMR) spectroscopy has been increasingly used for protein three-dimensional structure determination. Although it hasn't been able to achieve the same accuracy as X-ray crystallography, enormous technological advances have brought NMR to the forefront of structural biology 1 since the publication of the first complete solution structure of a protein (bull seminal trypsin inhibitor) determined by NMR in 1985 2. The underlined mathematical principle for protein NMR structure determination is to employ NMR spectroscopy to obtain local structural restraints such as the distances between hydrogen atoms and the ranges of dihedral angles, and then to calculate the three dimensional structure. Local structural restraint extraction is mostly guided by the backbone resonance sequential assignment, which therefore is crucial to the accurate three dimensional structure calculation. The resonance sequential assignment is to map the identified resonance peaks from multiple NMR spectra to their corresponding nuclei in the target protein, where every peak captures a nuclear

magnetic interaction among a set of nuclei and its coordinates are the chemical shift values of the interacting nuclei. Normally, such an assignment procedure is roughly partitioned into three main steps, which are grouping resonance peaks from multiple spectra into spin systems, chaining the resultant spins systems into strings, and assigning the strings of spin systems to non-overlapping consecutive amino acid residues in the target protein, as illustrated in Figure 1, where the scoring scheme quantifies the residual signature information of the peaks and spin systems.

Separately dealing with these three steps has been adopted in many existing assignment programs 3 _ 1 0 . Furthermore, depending the NMR spectra data availability, different programs may have different starting points. To name a few automated assignment programs, PACES 6 , a random graph approach 8 (we abbreviate it as RANDOM in the rest of the paper) and MARS 10 assume the availability of spin systems and focus on chaining the spin systems and their subsequent assignment; AutoAssign 3 and RIBRA 9 can start with the multiple spectral peak lists and automate the whole sequential

*To whom correspondence should be addressed.

56

Scoring

peak lists *• Grouping Chaining Assignment -*- candidates

Fig . 1. The flow chart of the NMR resonance sequential assignment.

assignment process. In terms of computational techniques, PACES uses exhaustive search algorithms to enumerate all possible strings and then performs the string assignment; RANDOM 8 avoids exhaustive enumeration through multiple calls to Hamiltonian path/cycle generation in a randomized way; MARS 10 first searches all possible strings of length 5 and then uses their mapping positions to filter out the correct strings; AutoAssign 3 uses a best-first search algorithm with constraint propagation to look for assignments; RIBRA 9 applies a weighted maximum independent set algorithm for assignments.

The above mentioned sequential assignment programs all work well on the high quality NMR data, but most of them remain unsatisfactory in practice and even fail when the spectral data is of low resolution. Through a thorough investigation, we identified that the bottleneck of automated sequential assignment is resonance peak grouping. Essentially, a good grouping output gives well organized high quality spin systems, for which the correct strings can be fairly easily determined and the subsequent string assignment also becomes easy. In AutoAssign and RIBRA, the grouping is done through a binary decision model that considers the HSQC peaks as anchor peaks and subsequently maps the peaks from other spectra to these anchor peaks. For such a mapping, the HN and N chemical shift values in the other peaks are required to fall within the pre-specified HN and N chemical shift tolerance thresholds of the anchor peaks. However, this binary-decision model in the peak grouping inevitably suffers from its sensitivity to the tolerance thresholds. In practice, from one protein dataset to another, chemical shift thresholds vary due to the experimental condition and the structure complexity. Large tolerance thresholds could create too many ambiguities in resultant spin systems and consequently in the later chaining and assignment, leading to a dramatic decrease of assign

ment accuracy; On the other hand, small tolerance thresholds would produce too few spin systems when the spectral data resolution is low, hardly leading to a useful assignment.

Secondly, we found that in the traditional three-step procedure, which is the basis of many automated sequential assignment programs, each step is separately executed, without consideration of inter-step effects. Basically, the input to each step is assumed to contain enough information to produce meaningful output. However, for the low resolution spectral data, the ambiguities appearing in the input of one step seem very hard to be resolved internally. Though it is possible to generate multiple sets of outputs, the contained uncertainties in one input might cause more ambiguities in the outputs, which are taken as inputs to the succeeding steps. Consequently, the whole process would fail to produce a meaningful resonance sequential assignment, which might be possible if the outputs of succeeding steps are used to validate the input to the current step.

In this paper, we propose a two-phase Graph-based Approach for Sequential Assignment (GASA) that uses the spin system chaining results to validate the peak grouping and uses the string assignment results to validate the spin system chaining. Therefore, GASA not only addresses the chemical shift tolerance threshold issue in the grouping step but also presents a new model to automate the sequential assignment. In more details, we propose a two-way nearest neighbor search approach in the first phase to eliminate the requirement of user-specified HN and N chemical shift tolerance thresholds. The output of first phase consists of two lists of spin systems. One list contains the perfect spin systems, which are regarded as of high quality, and the other the imperfect spin systems, in which some ambiguities have to be resolved to produce legal spin systems. In the second phase, the spin system chaining is performed to re-

57

solve the ambiguities contained in the imperfect spin systems and the string assignment step is included as a subroutine to identify the confident strings. In other words, the ambiguities in the imperfect spin systems are resolved through finding the highly confident strings in the chaining step, and the ambiguities in the chaining step are resolved through examining the mappings of strings in the assignment step. Therefore, GASA does not separate the sequential assignment into physical steps but only virtual steps, and all ambiguities in the whole assignment process are resolved globally and optimally.

The rest of the paper is organized as follows. In Section 2, we introduce the detailed steps of operations in GASA. Section 3 presents our experimental results and discussion. We conclude the paper in Section 4.

2. THE GASA ALGORITHM

The input data to GASA could be a set of peak lists or, assuming the grouping is done, a list of spin systems. In the case of a given list of spin systems, GASA skips the first phase and directly invokes the second phase to conduct the spin system chaining and the assignment. In the other case, GASA firstly conducts a bidirectional nearest neighbor search to generate the perfect spin systems and the imperfect spin systems with ambiguities. It then invokes the second phase which applies a heuristic search, guided by the quality of the string mapping to the target protein, to perform the chaining and assignment for resolving the ambiguities in the imperfect spin systems and meanwhile complete the assignment.

2.1. Phase 1: Filtering

For ease of exposition and fair comparison with RANDOM, PACES, MARS and RIBRA, we assume the availability of spectral peaks containing chemical shifts for C<* and C0, and the HSQC peak list. One typical example would be the triple spectra containing HSQC, CBCA(CO)NH and HNCACB. Nevertheless, GASA can accept other combinations of spectra. An HSQC spectrum contains 2D peaks each of which corresponds to a pair of chemical shifts for an amide proton and the directly attached nitrogen; An HNCACB spectrum contains 3D peaks each of which is a triple of chemical shifts for a nitrogen, the directly adjacent amide proton, and a carbon

alpha/beta from the same or the preceding amino acid residue; An CBCA(CO)NH spectrum contains 3D peaks each of which is a triple of chemical shifts for a nitrogen, the directly adjacent amide proton, and a carbon alpha/beta from the preceding amino acid residue. For ease of presentation, a 3D peak containing a chemical shift of the intra-residue carbon alpha is referred to as an intra-peak; otherwise an inter-peak. The goal of filtering is to identify all perfect spin systems without asking for the chemical shift tolerance thresholds. Note that to the best of our knowledge, all existing peak grouping models require the manually set chemical shift tolerance thresholds in order to decide whether two resonance peaks should be grouped into the same spin system or not. Consequently, different tolerance thresholds clearly produce different sets of possible spin systems, and for the low resolution spectral data, a minor change of tolerance thresholds would lead to huge difference in the formed spin systems and subsequently the final sequential assignment. In fact, the proper tolerance thresholds are normally dataset dependent and how to choose them is a very challenging issue in the automated resonance assignment. We propose to use the nearest neighbor approach, detailed as follows using the triple spectra as an example. Due to the high quality of HSQC spectrum, the peaks in HSQC are considered as centers, and every peak in CBCA(CO)NH and HNCACB is distributed to the closest center, using the normalized Euclidean distance. Given a center C = (HNc,Nc) and a peak P = (HNp,Np),Cp ), the normalized Euclidean distance between them is defined as

D= / / H N p - H N c y / N P - N ^

V \ CTHN / \ ^N J

where OHN an<i CN are the standard deviations of HN and N chemical shifts that are collected from BioMa-gResBank (h t tp : //www. bmrb. wise. edu).

In the ideal case, each center should have 6 peaks distributed to it in total, 4 from HNCACB spectrum and 2 from CBCA(CO)NH spectrum. However, due to the chemical shift degeneracy, some centers may have less than 6 or even 0 peaks. The reasons for this is that the peaks should be associated with these centers might turn out closer to other centers. Therefore, using a set of common chemical shift tolerance thresholds results in more troublesome centers.

Figure 2 illustrates a simple scenario where 3

58

centers present, but using the common tolerance thresholds C\ has only 4 peaks associated while Ci has 8. In Figure 2(a), using the common tolerance thresholds, only one perfect spin system with center C3 is formed because the two peaks that should belong to center C\ are closer to center C2, which create ambiguities in both spin systems. Nevertheless, a closer look that center C\ reveals that the two peaks that should belong to it but are closer to center Ci are among the 6 most closest peaks. That is, using the center specific tolerance thresholds, the spin system with center C\ can be formed by adding these two peaks (see Figure 2(b)); Similarly, using the center specific tolerance thresholds, the spin system with center C2 becomes another perfect spin system.

(

*

( I® \ *

~-A

( A

*

* /

A

Ci) A |

! (£) * !

" \ * * / '

(a) Using the common tolerance thresholds.

A A

(Ci) A '

(b) Using the center specific tolerance thresholds.

F ig . 2. A sample scenario in the peak grouping: (a) There are 3 HSQC peaks as 3 centers Ci,C2,C3. Every peak is distributed to the closest center, measured by the normalized Euclidean distance. Using the common tolerance thresholds, only C3 forms a perfect spin system (with exactly 6 associated peaks), (b) Using center specific tolerance thresholds, both C\ and C% find their 6 closest peaks to form perfect spin systems, respectively.

We designed a bidirectional nearest neighbor model, which essentially applies the center specific tolerance thresholds, to have two steps of operations: Residing and Inviting. In the Residing step, we associated each peak in CBCA(CO)NH and HN-CACB spectra to their respective closest HSQC peak. If the HSQC peak and its associated peaks in CBCA(CO)NH and HNCACB spectra form a perfect spin system, then the resultant spin system is inserted into the list of perfect spin systems. These already associated peaks are then removed from the nearest neighbor model for further consideration. In the Inviting step, each remaining peak in HSQC spectrum looks for the k closest peaks in CBCA(CO)NH and HNCACB spectra, and if a perfect spin system can be formed using some of these k peaks, then the spin system is formed and the associated peaks are removed. The parameter k is related to the number of peaks contained in a perfect spin system, which is known ahead of resonance assignment. A typical value of k is set as 1.5 times the number of peaks in a perfect spin system. In the triple spectra case (HSQC, HNCACB and CBCA(CO)NH), k = 9. The aforementioned two steps will be iteratively executed until no more perfect spin systems can be found and two lists of spin systems, perfect and imperfect, are constructed. Note that this bidirectional nearest neighbor model essentially applies the center specific tolerance thresholds, and thus it does not require any chemical shift tolerance thresholds. Nonetheless, the user could specify maximal HN and N chemical shift tolerance thresholds to speed up the process, though we have noticed that minor differences in these maximal chemical shift tolerance thresholds would not really affect the performance of this bidirectional search.

2.2. Phase 2: Resolving

The goal of Resolving is to identify the true peaks contained in the imperfect spin systems and then to conduct the spin system chaining and string assignment. In general, it is very difficult to distinguish the true peaks from the fake peaks when every imperfect spin system is individually examined. During our development, we have found that in most cases, those spin systems containing true peaks enable more confident string finding than those containing fake

59

peaks. With this observation, we propose to extract true peaks from the imperfect spin systems through spin system chaining and the resultant string assignment, namely, to accept those that result in spin systems having highly confident mapping positions in the target protein.

The relationships between spin systems are formulated into a connectivity graph similar to what we have proposed in another sequential assignment program CIS A n . In the connectivity graph, one vertex corresponds to a spin system. Given two perfect spin systems vt = (HN*, N<, Cf, cf, C?_lt C f ^ ) and VJ = (HNj, Nj, Cf, Cf, Cf_v C f ^ ) , if both |Cf - CJLil < 5a and |Cf - C ^ l < 8p hold, then there is an edge from v^ to Vj with its weight calculated as

'icg-qui | icf-cf.j h

(2)

In Equation (2), both 6a and 5p are pre-determined chemical shift tolerance thresholds, which are typically set to 0.2ppm and 0.4ppm, respectively, though minor adjustments are sometimes necessary to ensure a sufficient number of connectivities. Given one perfect spin system vt = (HNj, Nj, Cf, Cf, Cf_1, Cf_j) and another imperfect spin system Vj = (HNj, Nj, C£ , Cf2, • • •, Cfm, C% C%, • • •, C j j , we check each legal combination v'j = (HNj, Nj, Cft, Cjk, Cfp, Cjq) where l,k £ [l,m] and p,q £ [l,n]. Those carbon chemical shifts with subscription I, k represent the intra-residue chemical shifts and those with subscription p, q represent the inter-residue chemical shifts. Subsequently, if both |Cf - C^l < Sa and |Cf — Cj | < dp hold, then there is an edge from Vi to v'j with its weight calculated as

1 f\C?-Cfp\ , | C f - C £ +

jq\ (3)

If both |C?, - Cf | < Sa and |Cjfc - Cf | < 5p hold, then there is an edge from v'j to Vi with its weight calculated as

'iqi-c?| , |cffc-cfr da + h

(4)

Note that it is possible that there are multiple edges between one perfect spin system and one imperfect spin system, but at most one of them could be true.

In GASA, no connection is allowed for two imperfect spin systems.

Once the connectivity graph has been constructed, GASA proceeds essentially the same as CISA u to apply a local heuristic search algorithm, guided by the mapping quality of the generated string of spin systems in the target protein. Given a string, its mapping quality in the target protein is measured by the average likelihood of spin systems at the best mapping position for the string, where the likelihood of a spin system at a position is estimated by the histogram-based scoring scheme developed in 12. This scoring scheme is essentially a naive Bayesian learning, and it uses the chemical shift values collected in BioMagRes-Bank (http://www.bmrb.wisc.edu) as prior distributions and estimates for every observed chemical shift value the probability that it is associated with an amino acid residue residing in certain secondary structure. More precisely, for every type of chemical shift, there is a tolerance window of length e. For an observed chemical shift value cs, the number of chemical shift values in BioMagResBank that fall in the range (cs — e,cs + e), denoted as N(cs \ aa,ss), is counted for every combination of amino acid type aa and secondary structure type ss. The probability

. ,, , , „/ , v N{cs | aa, ss) is then computed as F(cs aa, ss) = —— r— ,

N(aa, ss) where N(aa, ss) is the total number of the same kind of chemical shift values collected in BioMagResBank. The scoring scheme then takes the absolute logarithm of the probability as the mapping score. Summing up the individual intra-residue chemical shift mapping scores in a spin system gives for the spin system its mapping score to every amino acid residue in the target protein.

Therefore, the edges in the connectivity graph are weighted by the scoring scheme, and they are used to order the edges coming out of the ending spin system in the current string to provide the candidate spin systems for the current string to grow to. It has been observed that a sufficiently long string itself is able to detect the succeeding spin system by taking advantage of the discerning power of the scoring scheme. In each iteration of GASA, the search algorithm starts with an Open List (OL) of strings and seeks to expand the one with the best mapping score. Another list, Complete List (CL), is used in the algorithm to save those completed strings. In the

http://www.bmrb.wisc.edu

60

following, we briefly describe the GAS A algorithm for resolving the ambiguities in imperfect spin systems through the spin system chaining into strings and the subsequent assignment.

OL Initialization: Let G denote the constructed connectivity graph. GASA firstly searches for all unambiguous edges in G, which are edges in G such that its starting vertex has out-degree 1 and its ending vertex has in-degree 1. It then expands these edges into simple paths with a pre-defined length L by both tracing their starting vertices backward and their ending vertices forward. The tracing stops if either of the following conditions is satisfied. (1) The newly reached vertices are already in the paths; (2) The length of each path reaches L. These paths are stored in OL in the non-increasing order of their mapping scores. The size of OL is fixed at S and thus only the top S paths are kept in OL. Note that both L and S are set in the way to obtain the best trade-off between the computing time and the performance.

Path Growing: In this step, GASA tries to bidirec-tionally expand the top ranked path stored in OL. Denote this path as P, the starting vertex in P as h and the ending vertex in P as t. All the directed edges incident to h and incident from t sue considered as candidate edges to potentially expand P, and the resultant expanded paths are called child paths of P. For every potential child path, GASA finds its best mapping position in the target protein and calculates its mapping score. If the mapping score is higher than that of some path already stored in OL, then this child path makes into OL (and accordingly the path with the least mapping score is removed from OL). When none of the potential child paths of P is actually added into OL, or P is not expandable in either direction (that is, there is no edge incident to h, nor edge incident from t), path P is closed for further expanding and subsequently is added into CL. GASA proceeds to consider the top ranked path in OL iteratively and this growing process is terminated when OL becomes empty.

CL Finalizing: Let P denote the path of the highest mapping score in CL (tie is broken to the longest path). GASA performs the following filtering: Firstly, all paths in CL with both their lengths and their scores less than 90% of the length and the

score of path P are discarded from further consideration. These paths are considered as of low quality compared to path P. All the remaining paths are considered to be reliable strings. Next, only those edges occurring in at least 90% of the paths in CL are regarded as reliable ones. The other edges in the paths are therefore removed, which might break the paths into shorter ones. These resultant paths are final candidate paths.

Ambiguities Resolving: GASA scans through the paths in CL for the longest one, which is the confident string built in the current iteration. Nevertheless, it could be that the mapping position in the target protein for this string conflicts mappings in the previous iteration. In this case, GASA respects previous mappings and the current string has to be broken down by removing the spin systems that have the conflicts. Consequently, the spin systems assigned in this iteration might not necessarily form into a single string. These assigned spin systems are then removed from the connectivity graph G, as well as those edges incident to and from them. Additionally, for the imperfect spin systems that are assigned in the current iteration, those peaks that are used to build the spin systems and edges are considered as true peaks, while the others are considered as fake peaks subsequently removed. If the remaining connectivity graph G is still non-empty, GASA proceeds to the next iteration. When it terminates, all the assigned spin systems and their mapping positions are reported as the output assignment.

2.3. Implementation

All components in GASA are written in the C/C-H-programming language and can be compiled on both Linux and Windows systems. They can be obtained separately or as a whole package through the corresponding author.


We evaluated the performance of GASA through three comparison experiments with several recent works, including RANDOM, PACES, MARS and RI-BRA. We note that there is another recent work GANA 13 that uses a genetic algorithm to automatically perform backbone resonance assignment with a high degree of precision and recall, which how-

61

ever due to time constraint we would not be able to make comparison with in the current work. The first experiment is to compare GASA with RANDOM, PACES and MARS only, all of which work well when assuming the availability of spin systems and their original design focuses are on chaining the spin systems into strings and the subsequent string assignment. Such a comparison is interesting since the experimental results will show the validity of combining the spin system chaining with the resultant string assignment in order to resolve the ambiguities in the adjacencies between spin systems. The other two experiments are used for comparison with RIBRA only to judge the value of combining peak grouping into spin systems, spin system chaining, and string assignment all together.

RIBRA explicitly defines two criteria, namely precision and recall, to measure its performance. In particular, precision is defined as the percentage of correctly assigned amino acids among all the assigned amino acids, and recall is defined as the percentage of correctly assigned amino acids among the amino acids that should be assigned spin systems, respectively 9. In this work, we use the same criteria in the second and the third experiments to facilitate the comparison. For the first experiment on the availability of spin systems, where the datasets are simulated such that there is no fake spin system, the performance of an assignment program is measured by the assignment accuracy, which is defined as the percentage of correctly assigned spin systems among all the simulated spin systems. (In fact, in this case, accuracy = precision = recall.)

3.1. Experiment 1

The dataset in Experiment 1 is simulated on the basis of 12 proteins in 14, whose lengths range from 66 to 215. The dataset construction is detailed as follows. For each of these 12 proteins, we extracted its data entry from BioMagResBank to obtain all the chemical shift values for the amide proton HN, the directly attached nitrogen N, the carbon alpha C a , and the carbon beta C3 . For each amino acid residue, its four chemical shifts together with CQ and C& chemical shifts from the preceding residue formed the initial spin system. Next, for each such initial spin system, chemical shifts for intra-residue C a and C 3 were perturbed by adding to them random errors that fol

low independent normal distributions with 0 means and constant standard deviations. We adopted the widely accepted tolerance thresholds for C a and OP chemical shifts, which were 5a = 0.2ppm and 5p = 0.4ppm, respectively 3 ' 6 ' 8 ' 10. Subsequently, the standard deviations of the normal distributions were set to 0.2/2.5 = 0.08ppm and 0.4/2.5 = 0.16ppm, respectively. The achieved spin system is called a final spin system. These 12 instances, with suffix 1, are summarized in Table 1 (the left half).

In order to test the robustness of all four programs, we generated another set of 12 instances through doubling the tolerance thresholds (that is, 5a — 0.4ppm and 5p = 0.8ppm). They, having suffix 2, are also summarized in Table 1 (the right half). Obviously, Table 1 tells that instances in the second set are much harder than the corresponding ones in the first set, where the complexity of an instance can be measured by the average out-degree of the vertices in the connectivity graph.

All four programs — RANDOM, PACES, MARS and GASA — were called to run on both sets of instances. The performance results of RANDOM, PACES, MARS and GASA on both sets of instances are collected in Table 2. Their assignment accuracies on two sets are also plotted in Figure 3. In summary, RANDOM achieved on average 50% assignment accuracy (We followed the exact way of determining accuracy as described in 8 , where 1000 iterations for each instance have been run.), which is roughly the same as that claimed in its original paper 8 . PACES performed better than RANDOM, but it failed on seven instances where the connectivity graphs were too complex (computer memory ran out, see Discussion for more information). The collected results for PACES on these seven instances were obtained through manually reducing the tolerance thresholds to remove a significant portion of edges from the connectivity graph. We implemented the scheme that if PACES didn't finish an instance in 8 hours, then the tolerance thresholds would be reduced by 25%, for example, from 6a = 0.2ppm to Sa = 0.15ppm. We remark that the performance of PACES in this experiment is a bit lower than that is claimed in its original paper 6. There are at least three reasons for this: (1) The datasets tested in 6 are different from ours. We have done a test on using the datasets in 6 to compare RANDOM, PACES, MARS and CISA, a predecessor of GASA n , and the result tendency

62

Tab le 1. Two sets of instances, each having 12 ones, in the first experiment: 'Length' denotes the length of a protein, measured by the number of amino acid residues therein; ' # C E ' records the number of correct edges in the connectivity graph, which ideally should be equal to (Length —1), and ' # W E ' records the number of wrong edges, respectively; 'Avg.OD' records the average out-degree of the connectivity graph.

Length

66 68 78 86 89

105 112 114 115 116 158 215

Sa = 0.2ppm, 5/3 = 0.4ppm InstancelD

bmr4391.1 bmr4752.1 bmr4144.1 bmr4579.1 bmr4316.1 bmr4288.1 bmr4670.1 bmr4929.1 bmr4302.1 bmr4353.1 bmr4027.1 bmr4318.1

# C E

65 67 77 85 88

104 111 113 112 114 157 206

# W E

20 43 30 82

168 45 35 41 44 47 85

191

Avg.OD

1.29 1.62 1.37 1.94 2.88 1.42 1.30 1.35 1.38 1.40 1.53 1.85

Sa = 0.4ppm, 8g = 0.8ppm InstancelD


# C E

65 67 77 85 88

104 111 113 112 114 157 206

# W E

51 118

86 221 349 139 109 128 166 139 224 652

Avg.OD

1.76 2.72 2.09 3.56 4.91 2.34 1.96 2.11 2.46 2.29 3.04 3.99

T a b l e 2. Assignment accuracies of RANDOM, PACES, MARS and GASA in the first experiment. 'PACES performance on these 3 datasets were obtained by reducing tolerance thresholds to 5a = 0.15ppm and 5p = 0.3ppm (75%). * PACES performance on this dataset was obtained by reducing tolerance thresholds to 8a = 0.3ppm and Sp = 0.6ppm (75%). *PACES performance on these 3 datasets were obtained by reducing tolerance thresholds to 5a = 0.2ppm and Sp = 0.4ppm (50%).

Length

66 68 78 86 89

105 112 114 115 116 158 215

Avg.

Sa = 0.2ppm, Sp = 0.4ppm InstancelD


RANDOM

0.67 0.37 0.40 0.52 0.37 0.56 0.62 0.66 0.65 0.48 0.32 0.38

0.50

PACES

0.70 0.77 0.51

0.60* 0.38*

0.63 0.70 0.83 0.69 0.67 0.77

0.48*

0.64

MARS

0.87 0.97 0.97 0.85 0.96 0.95 0.80 0.97 0.92 0.80 0.93 0.80

0.90

GASA

0.89 0.97 0.99 0.79 0.99 0.98 0.83 0.97 0.95 0.90 0.99 0.81

0.92

<5a = 0.4ppm, Sp = 0.8ppm InstancelD


RANDOM

0.56 0.32 0.33 0.34 0.29 0.49 0.44 0.48 0.49 0.45 0.30 0.22

0.39

PACES

0.67 0.72*

0.35 0.41* 0.17*

0.47 0.52 0.74 0.47 0.52 0.30

0.40*

0.48

MARS

0.80 0.90 0.97 0.71 0.92 0.93 0.70 0.97 0.82 0.73 0.81 0.62

0.82

GASA

0.92 0.87 0.99 0.61 0.94 0.92 0.61 0.99 0.90 0.87 0.76 0.57

0.83

is very much the same as what we have seen in this experiment. (2) PACES is only semi-automated, in the sense that it needs manual adjustment after one iteration to iteratively improve the assignment. In this experiment, PACES was taken as fully automated and it was run for only one iteration. One could run it several iterations for improved assignment. However, in the current work we were unable to manually adjust fairly well and we decided not to do so. (3) PACES is designed to take in better spin systems containing in addition carbonyl chemical shifts. With the current combination PACES was expected to perform a bit lower, since the extra CO chemical shifts provide extra information for resolving ambiguities. Again, we have done a similar test

on using the combination (HN, N, C a , C13, CO) of chemical shifts in 6 to compare RANDOM, PACES, MARS and CIS A u , and the result tendency is very much the same as what we have seen in this experiment. MARS and GASA performed equally very well. They both outperformed PACES and RANDOM in all instances, and even more significantly on the second set of more difficult instances, which indicates that combining the chaining and assignment together does effectively resolve the ambiguities and then make better assignments.

63

(a) Assignment accuracies on the 1st set of instances.

[Bf iaidom BPACE5 S3MAR^CGAJw]

(b) Assignment accuracies on the 2nd set of instances.

Fig. 3 . Plots of assignment accuracies for RANDOM, PACES, MARS and GASA on two sets of instances with different tolerance thresholds, using CQ and C@ chemical shifts for connectivity inference.

3.2. Experiment 2

In RIBRA, 5 sets of different datasets were simulated from the data entries deposited in BioMagRes-Bank. Among them, one is perfect dataset, which is simulated from BioMagResBank without adding any errors, and the other four datasets contain four different types of errors respectively. The false positive dataset is generated by respectively adding 5% carbon fake peaks into perfect CBCA(CO)NH and HNCACB peak lists. The false negative dataset is generated by randomly removing a small portion of inter carbon peaks from perfect CBCA(CO)NH and HNCACB peak lists. The grouping error dataset is generated by adding HN, N, C a and C 3 perturbations into inter peaks in the perfect CBCA(CO)NH peak list. The linking error dataset is generated by

adding Ca and C0 perturbations into inter peaks in the perfect HNCACB peak list.

Table 3 collects the average performances of RIBRA and GASA on these 5 sets of datasets. As shown, there is no significant difference among the performances on the perfect, false positive and link error datasets. GASA shows more robustness on the dataset with missing data while RIBRA performs better on the grouping error dataset. Through the detailed investigation, we found that these 5 sets of datasets contain the C^ inter and intra peaks with 0 C 3 chemical shifts for Glycine, indicating that in the RIBRA simulation, Glycine would have two inter peaks and two intra peaks in HNCACB and the amino acid residues after Glycine would have two inter peaks in CBCA(CO)NH. However, this is not the case in the real NMR spectral data. In fact, a huge amount of ambiguity in the sequential assignment results from Glycine because it produces various legal combinations in grouping and thus making the identification of perfect spin systems harder. For example, the spin systems containing 3, 4 and 5 peaks have the same chance to be perfect spin systems as those containing 6 peaks and meanwhile they could be considered as the spin systems with missing peaks. Therefore, grouping is much easier on the datasets with the simulated C 3 peaks for Glycine. Since GASA is designed to deal with the real spectral data, in which there are no peaks with 0 carbon chemical shifts, the performance of GASA on the grouping error dataset is not as good as RIBRA. To verify our thoughts, we randomly selected 14 proteins among the grouping error dataset, with length ranging from 69 to 186, and removed all the peaks of 0 C^ chemical shift. Both RIBRA and GASA were tested on them. RIBRA achieved 87.7% precision and 72.7% recall, and GASA achieved 88.5% precision and 79.4% recall, slightly better. It is noticed that in the construction of grouping error dataset, RIBRA kept the perfect HSQC and HNCACB peak lists untouched and only added some perturbations to the inter peaks in the CBCA(CO)NH peak list. We believe that to simulate a real NMR spectral dataset, perturbing chemical shifts in all simulated peaks is necessary and would be closer to the reality because the chemical shifts deposited in BioMagResBank have been manually adjusted across multiple spectra. Even though HSQC is a very reliable experiment, the deposited HN and N chemical

64

Table 3. Comparison results for RIBRA and GAS A in Experiment 2. Percentages in parentheses were obtained on 14 randomly chosen proteins with C peaks for Glycine removed.

Dataset

Perfect False positive False negative

Grouping error

Linking error

Average

RIBRA Precision

98.28% 98.28% 95.61% 98.16%

(87.7%) 96.28%

97.33%

Recall

92.33% 92.35% 77.36% 88.57%

(72.7%) 89.15%

87.95%

GASA Precision

98.24% 97.33% 96.34% 91.12%

(88.5%) 96.17%

95.84%

Recall

93.44% 92.24% 89.0%

81.27% (79.4%) 89.74%

89.14%

shifts in BioMagResBank are still slightly different from the measured values in the real HSQC spectra (h t tp : / /bmrb. wise. edu/). In the next Experiment 3, we chose not to simulate C 3 peaks for Glycine and to perturb every piece of chemical shift in the data.

3.3. Experiment 3

The purpose of Experiment 3 is to provide more convincing comparison results between GASA and RIBRA, based on the better data simulation. For this purpose, we used the same 12 proteins in Experiment 1 and the simulation is detailed as follows. For each of these 12 proteins, we extracted its data entry from BioMagResBank to obtain all the chemical shift values for HN, N, CQ, and C'3. For each amino acid residue in the protein, except Proline, its HN and N chemical shifts formed a peak in HSQC peak list; its HN and N chemical shifts with C a and C 3 chemical shifts from the preceding residue formed two inter peaks respectively in CBCA(CO)NH peak list; and its HN and N chemical shifts with its own C a and C 3

chemical shifts and with C a and C^ chemical shifts from the preceding residue formed two intra peaks and two inter peaks respectively in HNCACB peak list. Note that there is no C" peak for Glycine in either CBCA(CO)NH or HNCACB peak list. Next, for each peak in HSQC, CBCA(CO)NH and HNCACB peak lists, the contained HN, N, C a or C13

chemical shifts were perturbed by adding to them random errors that follow independent normal distributions with 0 means and constant standard deviations. We chose the same tolerance thresholds as those used in RIBRA, which were 5HN = 0.06ppm for HN, <5N = 0.8ppm for N, 5a = 0.2ppm for C a , and 5p = 0.4ppm for C3 , respectively. Subsequently, the standard deviations of the normal distributions were set to 0.06/2.5 = 0.0024ppm, 0.8/2.5 = 0.32ppm,

0.2/2.5 = 0.08ppm, and 0.4/2.5 = 0.16ppm, respectively.

Partial information of and the performances of RIBRA and GASA on these 12 proteins are summarized in Table 4. The detailed datasets are available through link h t tp : / /www.cs .ua lbe r t a . ca/~ghlin/src/WebTools/gasa.php. From the table, we can see that GASA formed many more spin systems than RIBRA did on every dataset, and from the assignment precision we can conclude that most of these spin systems are true spin systems. On average, GASA performed significantly better than RIBRA (precision 86.72% versus 65.23%, recall 74.18% versus 42.10%). The detailed precision and recall are also plotted in Figure 4. In summary, GASA outperformed RIBRA in all instances and RIBRA failed to solve three instances, which are bmr4316, bmr4288 and bmr4929. As shown in Table 4, RIBRA only achieved 65.23% precision and 42.1% recall on average, which are noticeably worse than what it is claimed in 9. The possible explanations for RIBRA not doing well on these 12 instances are: (1) The simulation procedure in Experiment 3 didn't generate C peaks with 0 chemical shift for Glycines, which causes more ambiguities in the peak grouping, and subsequent spin system chaining. (2) In the 12 simulated datasets in Experiment 3, the chemical shifts in every peak in all HSQC, HNCACB and CBCA(CO)NH peak lists were perturbed with random reading errors, which generated more uncertainties in every step of operation in the sequential assignment.

4. CONCLUSIONS

In this paper, we proposed a novel two-stage graph-based algorithm called GASA for protein NMR backbone resonance sequential assignment. The input to

http://www.cs.ualberta

65

Tab le 4 . Partial information of and the performance of RIBRA and GASA on the 12 protein NMR datasets in Experiment 3. 'Length' denotes the length of a protein, measured by the number of amino acid residues therein; 'Missing' records the number of true spin systems tha t are not simulated in the dataset, including those for Prolines; 'Grouped' records the number of spin systems that were actually formed by RIBRA and GASA, respectively.

BMRB Entry

bmr4391 bmr4752 bmr4144 bmr4579 bmr4316 bmr4288 bmr4670 bmr4929 bmr4302 bmr4353 bmr4027 bmr4318

Average

Length

66 68 78 86 89

105 112 114 115 116 158 215

Missing

7 2

10 3 4 9

10 4 8

18 10 24

RIBRA Grouped

44 44 42 54

N/A N/A

47 N/A

70 72 96

127

Precision

65.12% 63.12% 64.25% 66.34%

N / A N / A

76.23% N / A

71.35% 55.24% 65.23% 60.22%

65.23%

Recall

48.54% 42.33% 39.68% 43.22%

N / A N / A

35.35% N / A

46.67% 40.75% 42.15% 40.17%

42.1%

GASA Grouped

52 54 63 70 67 84 83 89 97 89

123 165

Precision

92.32% 90.71% 84.12% 82.92% 79.11% 82.91% 90.44% 95.51% 84.52% 96.62% 82.64% 78.81%

86.72%

recall

81.41% 74.22% 77.93% 69.93% 62.37% 72.32% 73.65% 77.32% 76.61% 87.38% 68.92% 68.13%

74.18%

GASA can be spin systems or raw spectral peak lists. GASA is based on an assignment model that separates the whole assignment process only into virtual steps and uses the outputs from these virtual steps to cross validate each other. The novelty of GASA lies in the places where all ambiguities in the assignment process are resolved globally and optimally. The extensive comparison experiments with several recent works including RANDOM, PACES, MARS and RIBRA showed that GASA is more effective in dealing with the NMR spectral data degeneracy and thereby provides a more promising solution to automated resonance sequential assignment.

We have also proposed a spectral dataset simulation method that generates datasets closer to the reality. One of our future works is to formalize this simulation method to produce a large number of protein NMR datasets for common comparison purpose. One of the reasons for doing this is that, though BioMagResBank as a repository has collected all known protein NMR data, somehow there is no benchmark testing datasets in the literature. As a preliminary effort, the 12 simulated protein NMR datasets, in the format of triple spectra HSQC, HNCACB and CBCA(CO)NH, are available through link h t tp : / /www.cs .ua lber ta . ca/~ghlin/src/WebTools/gasa.php.

1

0.9

0.8

0.7

f £ 0 . 4

0.3

| • RIBRA DGASA |

n

4391 4752 4144 4579 4316 4288 4670 4929 4302 4353 4027 4318

(a) Assignment precision.

1 -p

0.9-

0.8 -

_0.6-(0 o n K . 0) U ' J •

= 0.4

0.3

0 2

01

o J

JBRIBRA OGASAJ

II I ll I II

|-|

ll I 4391 4752 4144 4579 4316 4288 4670 4929 4302 4353 4027 4318

(b) Assignment recall.

F ig . 4 . Plots of detailed assignment (a) precision and (b) recall on each of the 12 protein datasets in Experiment 3 by RIBRA and GASA.

http://www.cs.ualberta

66

ACKNOWLEDGMENTS

This research is supported in part by AICML, CFI

and NSERC. The authors would like to thank the au

thors of RIBRA for providing access to their datasets

and for their prompt responses to our inquiries.

References 1. A. E. Ferentz and G. Wagner. NMR spectroscopy: a

multifaceted approach to macromolecular structure. Quarterly Review Biophysics, 33:29-65, 2000.

2. M. P. Williamson, T. F. Havel, and K. Wiithrich. Solution conformation and proteinase inhibitor IIA from bull seminal plasma by proton NMR and distance geometry. Journal of Molecular Biology, 182:295-315, 1985.

3. D. E. Zimmerman, C. A. Kulikowski, Y. Huang, W. F. M. Tashiro, S. Shimotakahara, C. Chien, R. Powers, and G. T. Montelione. Automated analysis of protein NMR assignments using methods from artificial intelligence. Journal of Molecular Biology, 269:592-610, 1997.

4. P. Giintert, M. Salzmann, D. Braun, and K. Wiithrich. Sequence-specific NMR assignment of proteins by global fragment mapping with the program Mapper. Journal of Biomolecular NMR, 18:129-137, 2000.

5. G.-H. Lin, D. Xu, Z. Z. Chen, T. Jiang, J. J. Wen, and Y. Xu. An efficient branch-and-bound algorithm for the assignment of protein backbone NMR peaks. In Proceedings of the First IEEE Computer Society Bioinformatics Conference (CSB 2002), pages 165-174. IEEE Computer Society Press, 2002.

6. B. E. Coggins and P. Zhou. PACES: Protein sequential assignment by computer-assisted exhaustive search. Journal of Biomolecular NMR, 26:93-111, 2003.

7. T. K. Hitchens, J. A. Lukin, Y. Zhan, S. A. McCal-lum, and G. S. Rule. MONTE: An automated Monte Carlo based approach to nuclear magnetic resonance assignment of proteins. Journal of Biomolecular NMR, 25:1-9, 2003.

8. C. Bailey-Kellogg, S. Chainraj, and G. Panduran-gan. A random graph approach to NMR sequential assignment. In Proceedings of the Eighth Annual International Conference on Research in Computational Molecular Biology (RECOMB 2004), pages 58-67, 2004.

9. K.-P. Wu, J.-M. Chang, J.-B. Chen, C.-F. Chang, W.-J. Wu, T.-H. Huang, T.-Y. Sung, and W.-L. Hsu. RIBRA - an error-tolerant algorithm for the NMR backbone assignment problem. In Proceedings of the 9th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2005), pages 103-117, 2005.

10. Y.-S. Jung and M. Zweckstetter. Mars - robust automatic backbone assignment of proteins. Journal of Biomolecular NMR, 30:11-23, 2004.

11. X. Wan and G.-H. Lin. CISA: Combined NMR resonance connectivity information determination and sequential assignment. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2005. Submitted.

12. X. Wan, T. Tegos, and G.-H. Lin. Histogram-based scoring schemes for protein NMR resonance assignment. Journal of Bioinformatics and Computational Biology, 2:747-764, 2004.

13. H.-N. Lin, K.-P. Wu, J.-M. Chang, T.-Y. Sung, and W.-L. Hsu. GANA - a genetic algorithm for NMR backbone resonance assignment. Nucleic Acids Research, 33:4593-4601, 2005.

14. Y. Xu, D. Xu, D. Kim, V. Olman, J. Razumovskaya, and T. Jiang. Automated assignment of backbone NMR peaks using constrained bipartite matching. IEEE Computing in Science & Engineering, 4:50-62, 2002.

67

A DATA-DRIVEN, SYSTEMATIC SEARCH ALGORITHM FOR STRUCTURE DETERMINATION OF DENATURED OR DISORDERED PROTEINS

Lincong Wang

Dartmouth Computer Science Department Hanover, NH 03755, USA

Email: [email protected]

Bruce Randall Donald*f

Dartmouth Computer Science Department, Dartmouth Chemistry Department Dartmouth Department of Biological Sciences

Hanover, NH 03755, USA Email: [email protected]

Traditional algorithms for the structure determination of native proteins by solution nuclear magnetic resonance (NMR) spectroscopy require a large number of experimental restraints. These algorithms formulate the structure determination problem as the computation of a structure or a set of similar structures that best fit the restraints. However, for both laboratory-denatured and natively-disordered proteins, the number of restraints measured by the current NMR techniques is well below that required by traditional algorithms. Furthermore, there presumably exists a heterogeneous set of structures in either the denatured or disordered state. We present a data-driven algorithm capable of computing a set of structures (ensemble) directly from sparse experimental restraints. For both denatured and disordered proteins, we formulate the structure determination problem as the computation of an ensemble of structures from the restraints. In this formulation, each experimental restraint is a distribution. Compared with previous algorithms, our algorithm can extract more structural information from the experimental data. In our algorithm, all the backbone conformations consistent with the data are computed by solving a series of low-degree monomials (yielding exact solutions in closed form) and systematic search with pruning. The algorithm has been successfully applied to determine the structural ensembles of two denatured proteins, acyl-coenzyme A binding protein (ACBP) and eglin C, using real experimental NMR data.

1. INTRODUCTION

The protein folding problem is fundamental in structural

biology. It can be stated as the problem of elucidating

how a protein can fold, in less than a second, from the

denatured state to the native state. One challenge to solv

ing the folding problem is the lack of knowledge about

the structures of proteins in the denatured state. In this

paper, "denatured state" means the state in which the

backbone NH groups have little protection against 1H

/ 2H-exchange. It has been estimated that about one-

third of eukaryotic proteins are disordered or partially-

disordered in their native state in solution. Such natively-

disordered proteins play key roles in signal transduction

and genetic regulation as well as in human diseases such

as Alzheimer's and Parkinson's diseases. Although much

progress has been made in understanding how the struc

ture defines the biological function of native proteins, it

is not well-known how the structures of disordered pro

teins determine their function. For denatured proteins,

an accurate, quantitative structural distribution is key to

solving the protein-folding problem 2 7 , 2 0 , while for dis

ordered proteins, an accurate distribution of the structures

is critical for establishing the structure-function relation

ship. To quantify the structural distribution, it is neces

sary to compute the ensemble of structures directly from

experimental data. At present, NMRa is the only avail

able technique that can measure many individual struc

tural restraints for these proteins. However, even using

the most advanced NMR techniques, the number of mea

sured restraints is well below that required by traditional

'Corresponding author tThis work is supported by the following grants to B.R.D.: National Institutes of Health (R01 GM 65982) and National Science Foundation (EIA-0305444). "Abbreviations used: NMR, nuclear magnetic resonance; PRE, paramagnetic relaxation enhancement; RDC, residual dipolar coupling; CH, the bond vector between backbone C a and HQ; NH, the bond vector between backbone amide nitrogen and amide proton; CC', the bond vector between backbone CQ and C; NC', the bond vector between backbone amide nitrogen and C'; RMSD, root-mean squared deviation; NOE, nuclear Over-hauser effect; POF, principal order frame; SVD, singular value decomposition; ACBP, acyl-coenzyme A binding protein; vdW, van der Waals; MD, molecular dynamics; SA, simulated annealing.



68

NMR structure determination methods 5' n . Furthermore, these methods formulate the structure determination problem as the computation of a structure (or a set of similar structures) that best fit the restraints. Such a formulation is appropriate for the structure determination of a native protein having a single dominant conformation. However, a new formulation is necessary for computing the structures of either denatured or disordered proteins, which are presumably heterogeneous in solution 19' 24. In this paper, we first formulate the structure determination problem of both denatured and disordered proteins as the determination of a heterogeneous ensemble of structures, from sparse experimental restraints measured in either the denatured or disordered state. In this formulation, the restraints are distributions. We then present a data-driven algorithm capable of accurately-computing denatured backbone structures directly from sparse restraints. In our algorithm, the conformational space consistent with the data is searched systematically, rather than randomly as in previous approaches 14' 4 ' 1 6 , 6. The algorithm uses considerably more experimental data than previous approaches for characterizing the denatured state from experimental data 1 6 '1 4 ' 4. The larger amount of data, together with the systematic search, significantly increase the accuracy of the computed ensembles. In the following, we only present the algorithm and application to denatured proteins. The algorithm can be applied to natively-disordered proteins as well. Our contributions are:

(1) A new formulation of the structure determination problem for denatured proteins.

(2) A data-driven, systematic search algorithm for computing an ensemble of all-atom backbone structures for denatured proteins directly from experimental data.

(3) Successful application of the algorithm to compute the structure ensembles of two denatured proteins from real, biological NMR data.

This paper concentrates on the computer science aspects of the algorithm. We will only describe briefly the applications of the algorithm to two real biological systems. The biological significance of our results and the use of the computed ensembles to understand protein folding will be addressed in detail in another paper.

1.1. Organization of the paper

We begin with a probabilistic interpretation of NMR data in the denatured state in terms of equilibrium statistical physics. Section 3 presents a formulation of the structure determination problem of denatured proteins using experimental NMR data such as the orientational restraints from residual dipolar coupling (RDC) 2 5 ' 1 2 and distance restraints from paramagnetic relaxation enhancement (PRE) x experiments. Section 4 reviews existing approaches. Section 5 presents the mathematical basis of the algorithm. Section 6 describes our algorithm for computing an ensemble of structures. Section 7 presents briefly the results of applying our algorithm to compute the structural ensembles of two denatured proteins, acyl coenzyme A binding protein (ACBP) and eglin C, from real, experimental NMR data. Finally, in section 8 we analyze the complexity of the algorithm and describe its performance in practice.

2. A PROBABILISTIC INTERPRETATION OF RESTRAINTS IN THE DENATURED STATE

Our algorithm first computes both the backbone dihedral angles and the orientation of each structural fragment15 independently using the orientational restraints from RDCs, and then assembles the computed structural fragments into a complete structure using the distance restraints from PREs. Our algorithm is based on a new formulation for structure determination in which each experimental restraint is converted to a distribution. In the following, we present the physical basis for the formulation.

RDCs can be measured on proteins weakly-aligned in a dilute liquid crystal medium 25' 2 6 . The RDC, r, between two nuclear spins is related to the direction of the corresponding internuclear unit vector v = (x, y, z) by2 2 ,

r = Sxxx2 + Syyy

2 + Szzz2 (1)

where Sxx, Syy and Szz are the three diagonal elements of a diagonalized Saupe matrix S (the alignment tensor) specifying the ensemble-averaged anisotropic orientation of a molecule in the laboratory frame; x, y and z are, respectively, the x, y, z—components of v in a principal

bA structural fragment consists of m-consecutive residues; typically m fa 10.

69

order frame (POF) which diagonalizes S. Before diag-onalization, S is a 3 x 3 symmetric, traceless matrix with five independent elements. Note that x2 + y2 + z2 = 1 and Sxx + Syy + Szz = 0. Thus, given both the RDC r and tensor S, Eq. (1) represents the projection onto a 2-sphere of an ellipse of solutions for the orientation of the vector v with respect to a global frame (POF) common to all the RDCs measured on the same aligned protein. Tensor S must be known first in order to extract orientational restraints from RDC data. RDCs alone or in combination with other NMR-measured geometric restraints have been used extensively to determine and refine the solution structures of native proteins 7' 29.

Recently, it has been shown that RDCs can also be measured accurately on weakly-aligned, denatured proteins 23, 2' 18' 9 and disordered proteins 4. For a folded, native n-residue protein, a single global tensor, S, can be used to interpret all the experimental RDCs by Eq. (1). However, according to equilibrium statistical physics 15, a set of tensors, Q, is required to interpret the RDCs measured in the denatured state. Each tensor in the set Q represents a cluster of denatured structures that have similar structures and align similarly in the medium. The set of RDCs corresponding to each tensor in the set Q can be sampled from the individual distributions associated with each measured RDC. The distribution for each RDC can be defined by an RDC random variable that has as its sampling space the RDCs of all the orientations of the corresponding vector v in different structures that exist in the denatured state. The experimentally-measured RDC value is the expectation. The different tensors in the set Q represent different conformations in the denatured state that are oriented differently in the aligning medium. The tensor S is also a random variable.

Paramagnetic relaxation enhancement (PRE) is similar to the nuclear Overhauser effect (NOE) 30 in terms of physics0. However, PRE can be observed even in the denatured state between an electron spin and a nuclear spin as far as 20 A away, while no NOE between two nuclear spins can be observed at such a distance. The reason is that PRE is almost 2,000-fold stronger than the NOE. In fact, long-range NOEs, which are critical for computing structures using traditional methods,5' n are generally too weak to be detected on denatured proteins. The

PRE-derived distance, d, in the denatured state, is also a random variable, where the measured value is an average over all the possible structures in the denatured state.

3. THE STRUCTURE DETERMINATION PROBLEM FOR DENATURED PROTEINS

As is well known, given bond length, bond angle and peptide plane UJ angle, the backbone conformation of an n-residue protein is completely determined by a 2n-tuple of backbone dihedral angles, c n = (</>i, i/'i, • . . , 4>n, V'n), where (<fo, V>i) are the dihedral angles of residue i. This 2n-tuple will be called a conformation vector, cn. In fact, the sines and cosines of the 2n ((f>, I/J) angles are sufficient to determine a backbone conformation. The structure determination problem for a denatured protein is to compute an ensemble of presumably heterogeneous structures that are consistent with the experimental data within a relative large range for the data. More precisely, the structure determination problem for denatured proteins can be formulated as the computation of a set of conformation vectors, c n , given the distributions for all the RDCs r and for all the PREs d.

4. PREVIOUS WORK

Solution NMR spectroscopy is the only experimental technique currently capable of measuring geometric restraints for individual residues of a denatured protein at the atomic level. Traditional NMR structure determination methods 5 ' n , developed for computing structures in the native state, require more than 10 restraints per residue, derived mainly from NOE experiments, to compute a well-defined native structure. Recently developed RDC-based approaches for computing native structures rely on either heuristic approaches such as restrained molecular dynamics (MD) and simulated annealing (SA) 1 0 ' 1 3 or a structural database 8 ' 21. It is not clear how to extend these native structure determination approaches to compute the desired denatured structures. Traditional NOE-based approaches cannot be used since long-range NOEs, which are critical for applying the traditional approaches to determine NMR structures, are usually too weak to be detected in the denatured

cThe main difference between PRE and NOE is that PRE results from the dipole-dipole interaction between an electron and a nucleus while the physical basis of NOE is the dipole-dipole interaction between two nuclei. Under the isolated two spin assumption, both PRE and NOE (that is, the observed intensity of cross-peaks in either a PRE or NOE experiment) are proportional to r - 6 where r is the distance between two spins.

70

state.d Previous RDC-based MD/SA approaches typically require either more than 5 RDCs per residue or at least 3 RDCs and 1 NOE per residue (most of them should be long-range) to compute a well-defined native structure. In the database-based approaches, RDCs are employed to select structural fragments mined from the protein databank (PDB) 3, a database of experimentally-determined native structures. A backbone structure for a native protein is, then, constructed by linking together the RDC-selected fragments using a heuristic method. Compared with the MD/SA approaches, the database-based approaches require fewer RDCs. However, these database-based approaches have not been extended to compute structures for denatured proteins. In summary, neither the traditional NOE-based methods nor the above RDC-based approaches can be applied to compute all-atom backbone structures in the denatured state at this time.

Recently, approaches 14' 4 have been developed to build structural models for the denatured state using one RDC per residue. These approaches are generate-and-test. They begin with the construction of a library of backbone (</>, ip) angles using only the angles occurring in the loops of the native proteins deposited in the PDB. Then, they randomly select (4>, ip) angles from the library to build an ensemble of backbone models. Finally, the models were tested by comparing the experimental RDCs with the average RDCs back-computed from the ensemble of backbone structures. There are three problems with these methods. First, the (<f>, ip) angle library is biased since only the (<f>, ip) angles from the loops of the native proteins are used. Consequently, the models constructed from the library may bias towards the native conformations in the PDB. Second, random selection may miss valid conformations. Third, the agreement of the experimental RDCs with the average RDCs back-computed from the ensemble of structures may result from over-fitting. Over-fitting is likely since one RDC per residue is not enough to restrain the orientation of an internuclear vector (such as the NH bond vector) to a finite set. In fact, given an alignment tensor S, an infinite number of backbone conformations can agree with one RDC per residue, while only a finite number of conformations agree with two RDCs per residue 29, 28' 31.

All-atom models for the denatured state have been computed previously in a generate-and-test manner in 16

by using PREs to select the structures from all-atom MD simulation at high temperature. Due to the data sparsity and large experimental errors, PREs alone are, in general, insufficient to define precisely even the backbone Ca-trace. The generated models have large uncertainty. A generate-and-test approach 6 using mainly NOE distance restraints has been developed to determine the ensemble of all-atom structures of an SH3 domain in the unfolded state in equilibrium with a folded state.e However, the relatively large experimental errors as well as the sparsity and locality of NOEs similarly introduce large uncertainty in the resulting ensemble of structures, which was selected mainly by the NOEs.

5. THE MATHEMATICAL BASIS OF OUR ALGORITHM

Our algorithm uses a set of low-degree (<4) monomials for computing, exactly and in constant time, the sines and cosines of individual backbone dihedral {<f>, ip) angles. These monomials have been derived from the RDC equation (1) and protein backbone kinematics, and have been described in detail elsewhere 29' 28 ' 31 . In the following, for ease of exposition, we state the monomials for computing, respectively, the sine and cosine of the backbone <f> angle from a CH RDC and those of ip angle from an NH RDC 28. NH and CH RDCs denote, respectively, the RDCs measured on NH and CH bond vectors. Starting with peptide plane i, we can compute the sines and cosines of the <j>i,ipi angles, respectively, from the CH RDC of residue i and the NH RDC of residue i +1 using the following two Propositions:

Proposition 5.1 28 Given the orientation of peptide plane i in the POF (see section 2) of RDCs, the x-component of the CH unit vector u of residue i, in the POF, can be computed from the CH RDC by solving a quartic monomial in x. Given the x-component, the y-component can be computed from Eq. (1), and the z-component from x2 + y2 + z2 = 1. Given u, the sine and cosine of the <j>i angle can be computed by solving linear equations.

dThe denatured state in this paper (see section 1) has been called the "unfolded state" 2 0 . eAn unfolded state in equilibrium with a folded state 6 differs from the denatured state in this paper. In 6 , the observed NOEs result from the equilibrium between the folded and unfolded states, not from the unfolded state alone.

71

Proposition 5.2 28 Given the orientation of peptide plane i in the POF of RDCs, the x-component of the NH unit vector v of residue i + 1, in the POF, can be computed from the NH RDC by solving a quartic monomial in x. Given the x-component, the y-component can be computed from Eq. (1), and the z-component from x2 +y2 + z2 = 1. Given v, the sine and cosine of the ipi angle can be computed by solving linear equations.

According to Propositions 5.1-5.2, given the orientation of peptide plane i (plane i stands for the peptide plane for residue i in the protein sequence), the sines and cosines of the backbone (pi, ipi angles can be computed, exactly in closed form, from the CH RDC of residue i and the NH RDC of residue i+1. Furthermore, the orientation of the peptide plane for residue i + 1 can be computed, exactly in closed form, from the orientation of the peptide plane for residue i and the sines and cosines of the intervening fa, ipi angles. Thus, given a tensor S, the orientation of the peptide plane for residue 1 (the first peptide plane) of the protein sequence, and CH and NH RDCs, all the sines and cosines of backbone (<f), ip) angles can be computed from RDCs by solving a series of quartic and linear equations. Thus, the set of conformations consistent with two RDCs per residue is finite and algebraic. In conclusion, given bond length, bond angle, peptide plane w angle, and the orientation of the first peptide plane as well as a tensor S, and a set of two RDCs per residue sampled from the RDC distributions (see section 2), a finite and algebraic set of backbone conformations can be determined exactly. Furthermore, this set of conformations can be computed by a systematic search such as a depth-first search over a fc-ary tree where k < 64, the maximum number of solutions for (cp, ip) angles for a single residue 28' 29. Taken together, we have stated the mathematical basis of our algorithm, that is, an ensemble of denatured structures can be computed exactly by solving a series of monomials each with degree < 4 using different sets of two RDCs per residue sampled from their distributions and the corresponding tensors S from the set

Q-

6. AN ALGORITHM FOR STRUCTURE DETERMINATION OF DENATURED PROTEINS

Our algorithm for computing the structure ensemble of a denatured protein extends but differs substantially from

our previous algorithms ' ' 3 1 for computing the backbone structures of native proteins. The goal of the present algorithm is to compute a presumably heterogeneous ensemble of structures that are consistent with the experimental data within a large range, rather than a single structure or a set of similar structures that best fits the data (as in the native state). For the native state, a single tensor, S, can be used to interpret all the experimental RDCs by Eq. (1). Moreover, for native proteins, this single tensor can be determined during structure computation (if secondary structure elements are known 29, 2 8) . However, it is physically infeasible to use a single tensor to interpret all the experimental RDCs on a denatured protein (see section 2). Rather one should use a set, Q, of different tensors to compute all the possible different conformations in the denatured state. This set of tensors Q is updated continuously during the structure computation. Our algorithm computes the ensemble using a divide-and-conquer strategy for efficiency.

6.1. Divide-and-conquer strategy

The algorithm first divides the entire protein sequence into p fragments, F i , . . . ,FP, and p — 1 linkers, L i , . . . ,Lp_i (Fig. 1). A linker consists of the residues between two neighboring fragments. Next, the algorithm computes, independently, an ensemble of structures, W*, for each fragment i where i = 1 , . . . ,p. This step is called Fragment computation (Fig. 2) and will be detailed in Section 6.2. Next, for each structure in ensemble Wj , i = 1 , . . . ,p, we compute the corresponding tensor U by singular value decomposition (SVD)17 and save each tj into a set TV Given a structure and the experimental RDCs, a tensor S can be computed by using SVD to minimize the RDC RMSD, ET = y / ^ " 1 ^ - , where u is the total number of RDCs for fragment F*, TV and r ' are, respectively, the experimental RDC for residue j of Fj and the RDC back-computed from the structure using the tensor S by Eq. (1). As shown in Eq. (1), given a structure, r[ is a function of S so by minimizing Er, S can be computed by SVD 17. Next, the algorithm merges all the tensors in the sets T, , i = 1 , . . . ,p, into p-tuples, ( t i , . . . , t p ) , such that tj is from the set Tj and all p tensors in a p-tuple have their Syy and Szz values agree with one another up to the ranges defined by

[Syy — Syy, Syy + 5yy] ^Ud \SZZSZZ, SZZ+5ZZ] WhCTCSyy and 5ZZ are thresholds. For each merged p-tuple, the algo-

72

rithm then computes their common tensor by SVD using the corresponding structures in W;, i = 1 , . . . , p and all the experimental RDCs for Fi, i = l,...,p, and saves the common tensors into a set Q. The diagonalization of the tensor returned from SVD gives not only the diagonal elements, Sxx, Syy and Szz, but also the orientation for each fragment in the common POF as well. In particular, the orientations of all the peptide planes in the POF are returned from SVD where the first and last peptide planes are used for computing (</>, ip) angles from RDCs by Propositions 5.1-5.2. Finally, the algorithm computes the linkers, L\,..., Lp_i, using every common tensor in Q and assembles the corresponding fragments and linkers into complete backbone structures. This step is called Linker computation and assembly (Fig. 3) and will be detailed in Section 6.3.

6.2. Fragment computation

A structure ensemble, Wj , of an m-residue fragment F* is computed as follows (Fig. 2). First, the algorithm estimates an initial tensor So,i by SVD using experimental RDCs and a model built with the backbone (</>, ip) angles for polyprolinell. The algorithm then selects b different sets of RDCs, Ri,..., Rb, for the fragment by randomly sampling CH and NH RDC values from their respective normal distributions. Next, for each Rt, t = 1 , . . . , b, the algorithm computes an optimal conformation vector, c\t, by systematically searching over all the possible conformation vectors, c m of 2m-tuples (^i, ipi,..., cf>m, i>m), computed from Rt where the 4>k angle for residue k is computed according to Proposition 5.1 from the sampled CH RDC for residue k, and the tpk angle is computed according to Proposition 5.2 from the sampled NH RDC for residue k +1. An optimal conformation vector is a vector which has the minimum score under a scoring function TF denned as

TF=E? + wvE2v (2)

where Er = y/Zi-lET=£$k~ri,k)' is the RDC RMSD, u is the number of RDCs for each residue, rjk

and r ' k are, respectively, the experimental RDC for RDC j of residue k, and the corresponding RDC back-computed from the structure. The variables wv and Ev

are, respectively, the relative weight and score for van der Waals (vdW) repulsion. For each conformation vector cm of a fragment, Ev is computed with respect to a quasi-polyalanine model built with c m . The quasi-polyalanine

model consists of alanine, glycine and proline residues with proton coordinates. If a residue is neither a glycine nor a proline in the protein sequence, it is replaced with an alanine residue. If the vdW distance between two atoms computed from the model is larger than the minimum vdW distance between the two atoms, the contribution of this pair of atoms to Ev is set to zero. Since the (<f>, ip) angles are computed from the sampled CH and NH RDCs by exact solution, the back-computed NH and CH RDCs are in fact the same as their sampled values. For additional RDCs (CC or N C RDCs), Er is minimized as cross-validation using Eq. (2). For each sampled set of RDCs, Rt,t = 1 , . . . , b, the output of this systematic search step is the optimal conformation vector c l t in Fig. 2. The search step is followed by an SVD step to update tensors, S l t , using die experimental RDCs and the just-computed fragment structure. Next, the algorithm repeats the cycle of systematic search followed by SVD (systematic-search/tensor-update) to compute a new ensemble of structures using each of the newly-computed tensors, Sit, t = 1 , . . . , b. The output of the fragment computation for a fragment i is a set of conformation vectors Chw, w = 1,.. .,bh, where h is the number of the cycles of systematic search/tensor-update.

6.3. Linker computation and assembly

Given a common tensor S in set Q and the orientations of two fragments Fi and F2 in the POF for S, an m-residue linker L\ between diem is computed as shown in Fig. 3. The computation of a linker can start from either its N-terminus as detailed in Fig. 3 or from its C-terminus, depending on die availability of experimental data. For the latter, the interested reader can see the Propositions 10.1 and 10.2 (section 10 of APPENDIX) for the detail. Every two consecutive fragments are assembled (combined), recursively, into a single fragment and the process stops when all the fragments have been assembled. The scoring function for the linker computation, Th, is computed similarly to TF.

TL=E2 + wvE2 + wpE

2 (3)

The main difference is mat Ev for a linker is computed with respect to an individual structure composing of all the previously-computed and linked fragments, and the current linker built with the backbone (cf>, tp) angles computed from RDCs. In addition, the PRE violation, Ep, which is essentially the PRE RMSD for an individual

73

Fragments / Linkers

Structure Ensembles

Sets of Tensors

Set of Tuples

Set Q of Common Alignment Tensors

I Divide

J Fragment Computation

Tensor , , Update V

J Merge: Syy ± 8yy and Szz ± 5zz

( Tensor Update

I Linker Computation

Linkers F^_Li - F2—L2 Fp-1-Lp-1 —Fp F47L1 - F2-L2 — — Fp-i-Lp-1 -Fp y

II Assemble

Ensemble C|, C2, 1 C, '"1

CV c2- • cq

Fig. 1. Divide-and-conquer strategy. The input to the algorithm is: the protein sequence, at least two RDCs per residue in a single medium and PREs (if available). The terms c* denote conformation vectors for the complete backbone structure. Please see the text for the definitions of other terms and an explanation of the algorithm.

Ensemble of conformation vectors:

Set of alignment tensors:

Ensemble of conformation vectors: C2t 1 c2 2

Systematic Search

I Tensor Update

Systematic Search

Set W\ S2, 1 S; 2,2 >2,j

Tensor o ' o , \\ Update 02, b(b-1) S 2 , b 2 " I

Fig. 2. Fragment computation: the computation of a structure ensemble of a fragment The figure shows only two cycles of systematic search followed by S VD. Please see the text for the definition of terms and an explanation of the algorithm.

structure composing of all the previously-computed and linked fragments and the current linker, is computed as

Ep = Y ' oJx— , where d, and d^ are, respectively,

the experimental PRE distance and the distance between two CQ atoms back-computed from the model, and o is the number of PRE restraints. An experimental PRE dis-

74

tance restraint is between two Ca atoms computed from the PRE peak intensity 16. If d\ < di, the contribution of PRE violation i to Ep is set to zero. This search step is similar to our previous systematic searches as detailed in 2 9 , 2 8 , 31. The key difference is that the linker scoring function, Eq. (3), has two new terms: Ev and Ep, and lacks the term in 29' 28' 3 1 for restraining (</>, ip) angles to the favorable Ramachandran region for a typical a-helix or /3-strand.

7. APPLICATION TO REAL BIOLOGICAL SYSTEMS

We have applied our algorithm to compute the structure ensembles of two proteins, an acid-denatured denatured ACBP and a urea-denatured eglin C, from real experimental NMR data.

Application to acid-denatured ACBP. An ensemble of 231 structures has been computed for ACBP denatured at pH 2.3. The experimental NMR data 9 has both PREs and four backbone RDCs per residue: NH, CH, NC and CC. All the 231 structures have no vdW repulsion larger than 0.1 A except for a few vdW violations as large as 0.35 A between the two nearest neighbors of a proline and the proline itself. These 231 structures satisfy all the experimental RDCs (CH, NH, CC and NC) much better than the native structure, and have PRE violations, Ep, in the range of 4.4 — 7.0 A. The native structure also has very different Saupe elements, Syy and Szz. Further analysis of the computed ensemble shows that the acid-denatured ACBP is neither random coil nor native-like.

Application to urea-denatured eglin C. An ensemble of 160 structures were computed for eglin C denatured at 8 M urea. No structures in the ensemble have a vdW violation larger than 0.1A except for a few vdW violations as large as 0.30 A. The computed structures satisfy the experimental CH and NH RDCs much better than the native structure. The native structure also has very different Saupe elements, Syy and Szz. Further analysis of the computed ensemble also shows that the acid-denatured ACBP is neither random coil nor native-like.

8. ALGORITHMIC COMPLEXITY AND PRACTICAL PERFORMANCE

The complexity of the algorithm (Fig. 1) can be analyzed as follows. Let the protein sequence be divided into p

m-residue fragments and p — 1 m-residue linkers and let the size of samplings be b. The systematic-search step in Fragment computation takes 0(bpfm) time to compute all the p ensembles for p fragments (Fig. 2) where / is the number of (</>, tp) pairs for each residue computed from two quartic equations (Propositions 5.1-5.2) and pruned using a real solution filter as described in 28

and also a vdW filter (repulsion). A single SVD step in Fig. 2 takes m52 + 53 = 0(m) time. Thus, h cycles of systematic-search/SVD take tF time in the worst-case, where tF = £ j = 1 pV (fm + m) = p^=^ (fm + m) = 0(pbh+1(fm + m) = 0(pbh+1fm) since fm is much larger than m. In implementation, b = 8 x 1024 and h = 2 (see section 11 of APPENDIX). In practice, only a small number (about 100) of structures out of all the possible bh computed structures for fragment i (section 6.2 and Fig. 2), are selected and saved in W j (Fig. 1), that is, the selected structures have TF < Tmax or TL < Tmax

where TF and TL are computed, respectively, by Eq. (2) and Eq. (3), and Tmax is a threshold. The Merge step takes o(pwplogiu) time, where w = \Wi\ is the number of structures in W*. The Merge step generates q p-tuples of alignment tensors, where q = 'ywp and 7 is the percentage of p-tuples selected from the Cartesian product of the sets Tt,i = 1 , . . . ,p, according to the ranges for Syy and Szz (section 6.1). The SVD step for computing q common tensors from p m-residue fragments takes q(mp52 + 53) = 0{mpq) time. The linkers are computed and assembled top-down using a binary tree. The Linker computation and assembly step then takes th time, where

** = 6«£i2f 2*/(afc+1>m = t g W ' m g r _ 1 r a / " / m

since at depth k, vdW repulsion and PRE violation are computed for the assembled fragment consisting of 2k m-residue fragments and an m-residue linker (Fig. 3). The total time is 0(pbh+1fm + pwp log w + mpq + 6 g p 2 m + 1 / 2 m l o g p + m ) = 0(pbh+1fm + pwplogw + mpq + fcgp2(c+l)m+lym) w h e r e c = lQgf = Q^y

The largest possible value 28 for / is 16 but on average / is about 2. The largest possible value for 7 is 1 but in practice, it is very small, about 10 - 9 , and q = 103

with w = 100. Although the worst-time complexity is exponential in O(h), 0(m) and 0(p), the parameters for m, h,paie rather small constants in practice with typical values ofm = 10, h = 2,p = 6 for a 100-residue protein. In practice, on a Linux cluster with 32 2.8GHz Xeon processors, 20 days are required for computing an ensemble of 231 structures for ACBP, and 7 days for computing

75

For i <— 1 to 4 // 4-fold degeneracy in relative orientation

(1) TL «- oo (2) cm ) i <— 0 // initialize the conformation vector (3) For j <— 1 to b II sampling cycle

(a) Sample a set of RDCs, Rj, from the normal distributions for RDCs. (b) Compute an optimal conformation vector c''m-2,i *~ (<f>i> ^l> • • • > 4>m-2, i'm-2) by systematic search. (c) Compute <£m_i by Proposition 5.1 using CH RDC for residue m — 1. (d) Compute VVi-i, 4>m and Vm by Proposition 10.3 (section 10 of APPENDIX). (e) Build a polyalanine model for linker Li using the vector c ' m i ; <— (<£i ,ipi,- • •, 4>m, i>m) (f) Link Li to Fi and F2. // see figure caption for an explanation (g) Compute Ep and a new score T'L by Eq. (3) for the assembled fragment FiU LiU F2. (h) HT'L <TLwdEp<

TL+-T'L/

(4) Return cm>i // tfie optimal conformation vector

Fig. 3. Linker computation and Assembly. 6 is the number of sampling cycles. Pmax is the maximum PRE violation allowed and set to be 7.0A. The Link step, step (f), is to translate first the N-terminal of Li to the C-terminal of Fi , then translate the C-terminal of the fragment F1UL1 to the N-terminal of F2. There exists an intrinsic 4-fold degeneracy in the relative orientation between two fragments computed using RDCs measured in a single medium.

an ensemble of 160 structures for eglin C.

9. CONCLUSION AND BIOLOGICAL SIGNIFICANCE

At present, we have only very limited knowledge of the structural distribution of either laboratory-denatured or natively-disordered proteins. The main reason is that the current experimental techniques can only provide a sparse number of restraints, even while the traditional structure determination methods require a large number of them. In this paper, we presented and demonstrated a data-driven, systematic search algorithm capable of computing the ensemble of denatured solution structures directly from sparse experimental restraints. Our algorithm is based on the formulation of structure determination of denatured or disordered proteins as the computation of a set of heterogeneous structures from the distributions for the sparse experimental restraints. We have shown that the ensemble of denatured structures can be computed using the distributions for the orientational restraints from RDCs by solving a series of low-degree monomials. Compared with the previous approaches for characterizing the denatured state from experimental data, the ensemble of structures computed by our algorithm is substantially more accurate. More restraints were used in our algorithm, and most importantly, exact algebraic solutions in combination with systematic search

guarantee that all the valid conformations consistent with the experimental restraints are computed. The accurately-computed structure ensemble makes it possible to answer two key questions in protein folding: (a) are the structures in the denatured state random coils? and (b) are the denatured structures native-like? Our quantitative analysis concludes that the denatured states of both ACBP and eglin C are neither random nor native-like.

Acknowledgments We would like to thank Drs. Kresten Lindorff-Larsen and Hemming Poulsen for NMR data on acid-denatured ACBP, Dr. David Shortle for NMR data on urea-denatured eglin C, and Dr. Jane Dyson for communicating to us the values of backbone (cf>, ip) angles for the polyproline II model. We would like to thank Mr. Tony Yan and Drs. Ramgopal Mettu, Kresten Lindorff-Larsen, Andrei Alexandrescu, Mehmet Apay-din and Chris Bailey-Kellogg, and all members of Donald lab for helpful discussions and critical reading of the manuscript.

References 1. A. Abragam. The Principles of Nuclear Magnetism.

Clarendon Press, Oxford, 1961. 2. M. S. Ackerman and D. Shortle. Molecular alignment of

denatured states of staphylococcal nuclease with strained polyacrylamide gels and surfactant liquid crystalline phases. Biochemistry, 41:3089-3095, 2002.

3. H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N.

76

Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne. The protein data bank. Nucl. Acids Res., 28:235-242, 2000.

4. P. Bernado, L. Blanchard, P. Timmins, D. Marion, R. W. Ruigrok, and M. Blackledge. A structural model for unfolded proteins from residual dipolar couplings and small-angle x-ray scattering. Proc. Natl. Acad. Sci. USA, 102:17002-17007, 2005.

5. A. T. Briinger. XPLOR: A system for X-ray crystallography andNMR. Yale University Press: New Haven, 1993.

6. W. Y. Choy and J. D. Forman-Kay. Calculation of ensembles of structures representing the unfolded state of an SH3 domain. J. Mol. Biol, 308:1011-1023, 2001.

7. G. M. Clore, M. R. Starich, C. A. Bewley, M. L. Cai, and J. Kuszewski. Impact of residual dipolar couplings on the accuracy of NMR structures determined from a minimal number of NOE restraints. J. Am. Chem. Soc, 121(27):6513-6514, 1999.

8. F. Delaglio, G. Kontaxis, and A. Bax. Protein structure determination using molecular fragment replacement and NMR dipolar couplings. J. Am. Chem. Soc, 122(9):2142-2143, 2000.

9. W. Fieber, S. Kristjansdottir, and F. M. Poulsen. Short-range, long-range and transition state interactions in the denatured state of ACBP from residual dipolar couplings. J. Mol. Biol, 339:1191-1199, 2004.

10. A. W. Giesen, S. W. Homans, and J. M. Brown. Determination of protein global folds using backbone residual dipolar coupling and long-range NOE restraints. J. Biomol. NMR, 25:63-71, 2003.

11. P. Giintert, C. Mumenfhaler, and K. Wiithrich. Torsion angle dynamics for NMR structure calculation with the new program DYANA. J. Mol. Biol, 273:283-298, 1997.

12. W. Hu and L. Wang. Residual dipolar couplings: Measurements and applications to biomolecular studies. Annual Reports on NMR Spectroscopy (In Press), 2006.

13. J. C. Hus, D. Marion, and M. Blackledge. Determination of protein backbone using only residual dipolar couplings. J. Am. Chem. Soc, 123:1541-1542, 2001.

14. A. K. Jha, A. Colubri, K. F. Freed, and T R. Sosnick. Statistical coil model of the unfolded state: Resolving the reconciliation problem. Proc. Natl. Acad. Sci. USA, 102:13099-13104, 2005.

15. L. D. Landau and E. M. Lifshitz. Statistical Physics. Perg-amon Press, Oxford, 1980.

16. K. Lindorff-Larsen, S. Kristjansdottir, K. Teilum, W. Fieber, C. M. Dobson, M. Poulsen, and M. Vendrus-colo. Determination of an ensemble of structures representing the denatured state of the bovine acyl-coenzyme a binding protein. J. Am. Chem. Soc, 126:3291-3299, 2004.

17. J. A. Losonczi, M. Andrec, M. W. Fischer, and J. H. Prestegard. Order matrix analysis of residual dipolar couplings using singular value decomposition. J. Magn. Re-son., 138(2):334-342, 1999.

18. R. Mohana-Borges, N. K. Goto, G. J. Kroon, H. J. Dyson, and P. E. Wright. Structural characterization of unfolded states of apomyoglobin using residual dipolar couplings. J. Mol. Biol, 340:1131-1142, 2004.

19. P. J. Plory. Statistical Mechanics of Chain Molecules. Ox

ford University Press, New York, 1988. 20. T. L. Religa, J. S. Markson, U. Mayor, S. M. V. Freund, and

A. R. Fersht. Solution structure of a protein denatured state and folding intermediate. Nature, 437:1053-1056, 2005.

21. C. A. Rohl and D. Baker. De Novo determination of protein backbone structure from residual dipolar couplings using Rosetta. /. Am. Chem. Soc, 124(11):2723-2729, 2002.

22. A. Saupe. Recent results in the field of liquid crystals. Angew. Chem., 7:97-112, 1968.

23. D. Shortle and M. S. Ackerman. Persistence of nativelike topology in a denatured protein in 8 M urea. Science, 293:487-489, 2001.

24. C. Tanford. Protein denaturation. Part C. Theoretical models for the mechanism of denaturation. Adv. Protein Chem., 24:1-95, 1970.

25. N. Tjandra and A. Bax. Direct measurement of distances and angles in biomolecules by NMR in a dilute liquid crystalline medium. Science, 278:1111-1114, 1997.

26. J. R. Tolman, J. M. Flanagan, M. A. Kennedy, and J. H. Prestegard. Nuclear magnetic dipole interactions in field-oriented proteins: Information for structure determination in solution. Proc. Natl. Acad. Sci. USA, 92:9279-9283, 1995.

27. W. F. van Gunsteren, R. Btirgi, C. Peter, and X. Daura. The key to solving the protein-folding problem lies in an accurate description of the denatured state. Angew. Chem. Int. Ed, 40:351-355, 2001.

28. L. Wang and B. R. Donald. Analysis of a systematic search-based algorithm for determining protein backbone structure from a minimal number of residual dipolar couplings. In IEEE Computer Society Bioinformatics Conference, pages 319-330, Stanford University, CA, 2004.

29. L. Wang and B. R. Donald. Exact solutions for internu-clear vectors and backbone dihedral angles from NH residual dipolar couplings in two media, and their application in a systematic search algorithm for determining protein backbone structure. J. Biomol. NMR, 29:223-242, 2004.

30. L. Wang and B. R. Donald. An efficient and accurate algorithm for assigning nuclear overhauser effect restraints using a rotamer library ensemble and residual dipolar couplings. In IEEE Computer Society Bioinformatics Conference, pages 189-202, Stanford University, CA, 2005.

31. L. Wang, R. Mettu, and B. R. Donald. An algebraic geometry approach to backbone structure determination from NMR data. In IEEE Computer Society Bioinformatics Conference, pages 235-246, Stanford University, CA, 2005.

APPENDIX

In this appendix, we first state the polynomials for com

puting, respectively, the sine and cosine of the backbone

4> angle from an NH RDC and those of the ip angle from a

CH RDC, starting with the C-terminus of a fragment. By

comparison, Propositions 5.1-5.2 of the main text com

pute the ((j>, ip) angles from RDCs starting with the N-

terminus. The proof for these two propositions is very

77

similar to the proof for lemmas 5.1-5.2 given in 28. We then present a proof for a new proposition for computing backbone (</>, tp) angles from oriented peptide planes. Finally, we describe the parameters and implementation of the algorithm.

10. LOW-DEGREE POLYNOMIALS FOR COMPUTING BACKBONE DIHEDRAL ANGLES

The following two Propositions, 10.1 and 10.2, are a generalization of Propositions 5.3 and 5.4 of 28 to compute backbone structure from the C-terminus, rather than the N-terminus. Starting with the peptide plane i + 1, we can compute backbone <t>ui>% angles, respectively, from the NH RDC of residue i and CH RDC of residue i as follows:

Proposition 10.1. Given the orientation of peptide plane i+1 in the POF ofRDCs, the x-component of the CH unit vector u of residue i, in the POF, can be computed from the CH RDC for residue i by solving a quartic monomial in x describing the intersection of two ellipses. Given the x-component, the y-component can be computed from Eq. (1) andthe z-componentfrom x2+y2+z2 = 1. Given u, the sine and cosine of the ipi angle can be computed by solving a linear equation.

Here, the CH vector ellipse equation is a function of the ipi angle. The ellipse equation has been described in detail in 28.

Proposition 10.2. Given the orientation of peptide plane i+lin the POF ofRDCs, the x-component of the NH unit vector v of residue i, in the POF, can be computed from the NH RDC for residue i by solving a quartic monomial in x describing the intersection of two ellipses. Given the x-component, the y-component can be computed from Eq. (1), and the z-component from x2 + y2 + z2 = 1. Given v, the sine and cosine of the (pi angle can be computed by solving a linear equation.

Here, the NH vector ellipse equation is a function of the (pi angle. The ellipse equation has been described in detail previously 28.

The sine and cosine of the backbone (<j), tp) angles of the last two residues linking two oriented fragments can be computed, exactly and in constant time, by the following Proposition:

Proposition 10.3 Given the orientation of peptide planes i and i + 2 and the backbone dihedral angle <f>i, the sines and cosine of the backbone dihedral angles ipi, <pi+i and ipi+i can be computed exactly and in constant time.

Proof. In the following, small and capital bold letters denote, respectively, column vectors and matrices. All the vectors are 3D vectors and all the matrices are 3D rotation matrices. Let v i , V3 and wi , w 3 denote, respectively, the NH and C a vectors of peptide planes i, and i + 2. From protein backbone kinematics we have

LG1W3 = B.z(ipi)KRy((pi+i)Ilx(93)Kz(xpi+i)cw,

LG1V3 = Rz(ipi)RIly((pi+i)Kx(e3)Rz(ipi+i)cv

where R is a constant matrix, and cw and cv are two constant vectors and #3 is a constant angle. Given the backbone angle (pi, the matrix L is known. The matrix G i is the rotation matrix from the POF of RDCs to a coordinate frame defined in the peptide plane i. From Eq. (4), through algebraic manipulation we can derive the following three simple trigonometric equations satisfied by the ipi, 4>i+i and tpi+i angles

a\ sin <pi+i + 61 cos (pi+i = c\

a-i sin V>»+i + b2 cos ipi+i = c-i,

a3 sin (pi + 63 cos cpt = c3

where a\, 61, c\ are constants derived from the constant matrix R, and the six variables, a2 ,62 , C2, a3, b3, c3, are simple trigonometric function of the (pi+\ angle. •

11. PARAMETERS AND IMPLEMENTATION OF THE ALGORITHM

Our algorithm (Figs. 1,2, 3 of the main text) is built upon (a) exact solutions for backbone (cp, tp) angles from RDCs, and (b) a systematic-search for exploring all the possible solutions consistent with the experimental restraints and biophysical properties (minimum vdW repulsion). However, several parameters must be chosen to ensure the correctness and convergence of the algorithm. We explored via computational experiments the spaces of these parameters to find the proper values that ensure the computed ensembles are stable. The parameters includes:

(1) division of protein sequence into fragments and linkers

(2) initial estimation of alignment tensors

78

(3) the standard deviations of the probability distributions for convolving the experimental RDCs

(4) the size of sampling, b (5) the number of systematic-search/SVD cycles, h

In order to see their effects on the computed ensembles, we have run the algorithm with different initial tensors computed by SVD using either an ideal a-helix (<f> = -64.3°, tp = -39.4°), or /3-strand (<£ = -120.0°, ip = 138.0°), or polyProline II model (<j> = -80.0°, V> = 135.0°). We have also tested the algorithm using different sizes b of sampling and different numbers h of the systematic-search/SVD cycles. Our computational experiments showed that with an b — 8 x 1024 and h = 2, the computed ensemble has already achieved a stable state since further increase in either b or h does not changes the distributions of backbone (4>, VO angles and pair-wise backbone RMSDs between the structures in the ensembles. The largest effect appears to be how the protein sequence is divided if there are missing RDCs concentrated in a certain region. In the implementation, the division into fragments and linkers is based primarily on the availability of experimental RDCs. In general, the linkers between two fragments have more missing RDCs than the fragments. If no experimental data is available for either CH or NH RDCs, the corresponding 4> and tl>

are selected randomly in the range of [—7r,7r]. As detailed in section 6.1, the alignment tensor used to compute the linkers is computed from the structures of fragments. Thus, if we exchange a fragment with a linker and if the linker has many missing RDCs, the computed ensemble differs, to some extent, from the original one. Our choice for division emphasizes the experimental data. The standard deviations for RDC random variables are, respectively, 8.0 Hz (Hertz) for CH RDCs and 4.0 Hz for NH RDCs, and both are much larger than the real experimental errors, which are estimated to be less than 1.0 Hz for CH RDCs and 0.50 Hz for NH RDCs. The values of these deviations are, respectively, about one-half of the ranges for all the experimental CH and NH RDC values. The probability distributions used to convolve RDCs are rather broad relative to the experimental values, and thus the algorithm is capable of computing most of structures in the denatured state. The relative weight wv and wp in Eq. (2) and Eq. (3) of the main text are set to be 8.0 and 2.0, respectively. The effects on the final ensembles of these weights are minimal, since vdW repulsion is very small in the final structures, and PRE violation is implemented by the requirement that all the final structures have no RMSD in PREs larger than 7.0A. The function forms for both Ev and Ep in Eq. (3) are flat-bottom-harmonic- walls.

79

MULTIPLE STRUCTURE ALIGNMENT BY OPTIMAL RMSD IMPLIES THAT THE AVERAGE STRUCTURE IS A CONSENSUS

Xueyi Wang* and Jack Snoeyink

Department of Computer Science, University of North Carolina at Chapel Hill Chapel Hill, NC, 27599-3175, USA

Email: {xwang , snoeyink}@cs.unc.edu

Root mean square deviation (RMSD) is often used to measure the difference between structures. We show mathematically that, for multiple structure alignment, the minimum RMSD (weighted at aligned positions or unweighted) for all pairs is the same as the RMSD to the average of the structures. Thus, using RMSD implies that the average is a consensus structure. We use this property to validate and improve algorithms for multiple structure alignment. In particular, we establish the properties of the average structure, and show that an iterative algorithm proposed by Sutcliffe and co-authors can find it efficiently — each iteration takes linear time and the number of iterations is small. We explore the residuals after alignment and assign weights to positions to identify aligned cores of structures. Observing this property also calls into question whether global RMSD is the right way to compare multiple protein structures, and guides the search for more local techniques.

1. INTRODUCTION

Although protein structures are uniquely determined by their sequences1, protein structures are better conserved through evolution than the sequences2. Proteins with similar 3D structures may have similar functions and are often evolved from common ancestors3. As structural biologists classify proteins, how should they compare structures?

Pairwise comparisons are commonly performed by measuring the root mean squared deviation (RMSD) between corresponding atoms in two structure, once a suitable correspondence has been chosen and the molecules have been translated and rotated as rigid bodies to the best match4""6. Corresponding atoms may also be given weights so that core atoms have the greatest influence on the matching and weighted RMSD score.

Pairwise comparison can be extended to multiple structure alignment in several ways. In this paper we look at ways to extend RMSD (weighted at aligned positions or unweighted) after a correspondence between atoms has already been chosen. Multiple structure alignment is an important tool to identify structurally conserved regions, to provide clues for building evolutionary trees and finding common ancestors, and to determine consensus structures for protein families.

For multiple structure alignment, first we need to choose a score function to measure the goodness of the

alignment. Examples from the literature include the sum of all pairwise squared distances7'8, which we also use, or the average RMSD per aligned position9. If we consider the protein structures as rigid bodies, then problem of multiple structure alignment is to translate and rotate these structures to minimize the score function. Several methods also choose a consensus structure to represent the whole alignment.

Many algorithms have been presented to solve this multiple structure alignment problem. Some first do pairwise structure alignments and then use heuristic methods to integrate the structures. Gerstein and Levitt10 choose the structure that has minimum total RMSD to all other structures as the consensus structure and aligns other structures to it. Ochagavia and Wodak9

and Lupyan et al? present a progressive algorithm that chooses one structure at a time and minimizes the total RMSD to all the already aligned structures until all the structures are aligned. Other researchers use non-deterministic methods. Sali and Blundell" use simulated annealing to determine the optimal structure alignments and Guda et al.12 use Monte Carlo optimization.

Other algorithms align all the structures together instead of aligning each pair separately. Two iterative algorithms by Sutcliffe et a/.8 align protein structures to their average structure, also done by Verboon and Gabriel13 and Pennec14. We will focus most of our attention on this approach. MUSTA15 use geometric


http://unc.edu

80

hashing and finds a consensus structure of Ca atoms. MultiProt16 iteratively chooses each structure as a consensus structure, aligns all other structures to the consensus structure, and detects the largest core among aligned molecules. MASS17 and CBA18 first align secondary structure and then align tertiary structure.

In this paper, we show that if you use the root of total squared deviation to score multiple structure alignment, then mathematically you obtain the same result by taking the average structure as a consensus structure, and doing pairwise alignment to this consensus. We can use this to establish properties of the Sutcliffe et al.s algorithms, including a better stopping condition. In our tests on protein families from HOMSTRAD19, this algorithm quickly reaches the optimum alignment and consensus structure. By modeling deviations from the average positions as 3-dimensional Gaussian distributions, we can also determine weights for well-aligned positions that can determine the aligned core. We also raise the question, "If the average is not the right consensus structure then what scoring function should replace wRMSD?"

2. METHODS

We define the average of structures and weighted RMSD for multiple structures for position weights, and then establish the properties of wRMSD.

2.1. Average structure and weighted root mean square deviation

We assume there are n structures each having m points (atoms), so that structure 5, for (1 < /' < n) has points

Pn, Pa,••-,Pim- For a fixed position k, the « pointspik for (l < / < n) are assumed to correspond. We define the

— 1 " average structure S to have points pk =— Y[pik for

(1 <k<m). We may assign a position weight wk > 0 to each

aligned position k and define the weighted root mean squared deviation (wRMSD) as the weighted sum of all squared pairwise distances between structures, i.e.

w SD - g ^ g f p ^ f . Weigh* allow us to emphasize some positions in the alignment (e.g., an aligned core) and reduce or eliminate the

influence of other positions; we obtain the standard RMSD by setting wk = 1 for (1 < k < m).

Note there are «(«-l)/2 structure pairs, and each structure pair has m squared distances. If we want to transform the atom positions to minimize wRMSD, then, because m and n are fixed and the square root function is monotone increasing, we can instead minimize the weighted sum of all squared pairwise

distances X Z Z w * | h * -pJk\ . 1=2 >1 k=\

The following technical lemma on weighted sums of squares allows us to make several observations about the average structure under wRMSD.

Lemma 1. For any aligned position k, the total squared distance from pik, p2k, ••-, Pnk to any point qk

equals the total to the average point pk plus from pk

toqk:

i\p* -**r=zik -P*\2+£h -PA2

1=1 ;=1 i=l

Proof. To establish the Lemma, we subtract the second term from both sides, expand the difference of squares, then apply the definition of pk in the penultimate step.

i\pik-<ikt -\Pik-Pkt\

n

= X t * -it+Pik -Pk\\pik -ik -Pik +Pk] 1=1

n

;=i

n n

<=1 ;'=1

Our first theorem says that if wRMSD is used to compare multiple structures, then what is really happening is that all structures are being compared to the average structure - that the average structure S is a consensus, whether we recognize it or not. It is better computationally to recognize this, because it reduces the number of pairs of structures that must be compared from n(n-l)/2 to n.

Theorem 1. The weighted sum of squared distances for all pairs equals the weighted sum of squared distances to the average structure S :

n i - l m - n m

ZZEw*lh*-/>./*| = " Z Z W * I K _ A I • 1=2 j=l k=\ i=\ k=\

Proof. In Lemma 1, replace qk by pjk, then multiply by the weight wk, and sum over ally and k to obtain:

81

m n n

ZZZw*h* -P4 =2^xSw4k -P*f • j=\ k=\ 1=1 *=1 ;=1 7=1

We can re-arrange the order of summation on the left, noticing that terms with / =j cancel and every other term appears twice. The resulting equation gives the desired result after dividing out the extra factor of two:

1=2 j=\ k=l k=l

m n

wk1L1L\\p* - p*\\ ,=i j=\

Two more theorems suggest how to choose the structure closest to a given set of structures. If you can choose any structures, then chose the average 5 ; if you must choose from a limited set, then choose the structure closest to the average S .

Theorem 2. The average structure S minimizes the weighted sum of squared distances from all the structures, i.e. for any structure Q with points q\, qi, ...,

qm, XEw*l/,*-0*i ^ Z Z W * I K - A I I and

i=l t=l ;=1 *=1

equality holds if and only if qk = ~pk for all positions

with wk > 0. Proof. This follows immediately from Lemma 1

n

since w j ^ -pk\ ^ 0 with equality if and only if i=l

qk = pk orw* = 0. D

Theorem 3. The structure from a set Qu ..., Qm that minimizes the weighted sum of squared distances from all the structures St is the one whose wRMSD is closest to S .

n m

Proof. In Lemma 1 Z Z w* ll-P'* ~ P* II ' s ^lxsd ^ ;=i *=i

the set of structures, so it is both necessary and n

sufficient to minimize ^ wk\qk -pk\ • • ;=i

2.2. Minimizing wRMSD

In structure alignment, we translate and rotate structures in 3D space to minimize wRMSD. We define Rf as a 3x3 rotation matrix and T as a 3x1 translation vector for structure S,-. We aim to find the optimal T, and R, for each structure to minimize the wRMSD. The target function is:

org mm R.T ZZZ^IK^-^-V;*^!

^ / = 2 )=\ *=1

We can fix one of the rotations to be the identity, and one of the translations to be zero. When there are only two structures, then the minimization reduces to a linear equation in R and T. Horn5 showed that these can be found separately: the minimum wRMSD for the pah-can be found by translating each structure so its origin is the weighted center of mass, i.e. ^wkpjk = 0 , then

applying an optimum rotation found with quaternions. To minimize wRMSD with more than two

structures, we can combine Theorem 1 with Horn's analysis to show that wRMSD is minimized when the centroids (weighted centers of mass) of each structure is the centroid of the average structure, i.e. each structure may be translated so the origin is the centroid.

Finding optimum rotations for several structures is harder than for a pair because the minimization problem no longer reduces to a linear equation. We can use the fact that the average is the best consensus (Theorem 1), and modify a simple iterative algorithm of Sutcliffe et a/.8 to converge to the minimum wRMSD. Instead of directly finding the optimal rotation matrices, we align each structure to the average structure separately to minimize wRMSD. Because rotating structures also changes the average structure, we repeat until the algorithm converges to a local minimum of wRMSD.

Algorithm: Given n structures with m points (atoms) each and weights wk at each position, minimize wRMSD to within a threshold value f(e.g. e= l.OxlO-

5). 1. Translate the weighted centroid of each structure S,

for (1 <, i < ri) to the origin, (optionally align each structure to a randomly chosen St for a good initial average.)

2. Calculate the average S , with points

Pk = ~ZPat . and 5D = XZW*1^* -Pk\

4.

For each (1 < / < ri), align St to 5 using Horn's method to calculate optimum rotation matrix Ri that

m

minimizes Z w * l^ ' ^ ' * ~^*ll and replace *=i

s, = /?,•£,•

Calculate new average 5new and deviation n m

SDnew = V V , ;=1 *=1

5£>n e w

w. new — new

k \\Pik ~ Pk

If SD - SDmw < e, then the algorithm terminates; otherwise, set SD = SIT™ and S = 5new and go to step 3.

82

Horn's method and our theorems imply that the deviation SD decreases monotonically in each iteration. From theorem 1, we know that minimizing the deviation SD to the average minimizes the global wRMSD. From Horn5, in step 3 we have

n m J. .J 2 ft itt

E5>*k"" -PA ^IZ^Ik -p>f = SD

i= l *=1 ;=1 *=1

From theorem 2, in step 4 we have

sz^=£f>|/^-vi2 / ' = 1 j f c = l

n m ., ..2 ^ X""1 X™1 new — ^LL^qPik -PA

1 = 1 ll=\

So SD"™ < SD and SD decreases in each iteration. We stop when this decrease is less than the threshold e, this will be a local minimum of SD.

Horn's method calculates the optimal rotation matrix for two /w-atom structures in 0(m) operations, so initialization and each iteration take 0(« m) operations. Our experiments show that for any start positions of all n structures, the algorithm converges in a maximum of 4-6 iterations when s = l.OxlO-5. The number of iterations is one fewer when the proteins start with a preliminary alignment from the optional initialization in step 1. Because the lower bound for aligning n structures with m points per structure is 0(« m), this algorithm is close to the optimum.

We must make two remarks about the paper of Sutcliffe et a/.8, which proposed the algorithm above. First, they actually give different weights to individual atoms, which they change during the minimization. We can establish analogues of Theorems 1-3 for individual atom weights if the weight of a corresponding pair of atoms is the half-normalized product of the individual weights. To minimize wRMSD for such weights, however, we have observed that it is no longer sufficient to translate the structure centroids to the origin. We believe that this may explain why Sutcliffe's algorithm can take many iterations for convergence — the weights are not well-grounded in mathematics. We plan to explore atom weights more thoroughly in a subsequent paper.

Second, their termination condition was when the deviation between two average structures was small, which is actually testing only the second inequality on the decrease of SD above. It is a stronger condition to terminate based on the deviation of SD.

While preparing the final version of this paper, we found two papers with similar iterative algorithms13'14. Both algorithms use singular value decomposition (SVD) as the subroutine for finding an optimal rotation matrix; quaternions should be used instead because they preserve chirality. Pennec14 presented an iterative algorithm for unweighted multiple structure alignment and our work can be regarded as the extension of his work. Verboon and Gabriel13 presented their iterative algorithm as minimizing wRMSD with atom weights (different atoms having different weights), but in fact it works only for position weights because the optimization of translation and of rotation cannot be separated with atom weights.


3.1. Performance

We test the performance of our algorithm by minimizing the RMSD for 23 protein families from HOMSTRAD19, which are all the families that contain more than 10 structures with total aligned length longer than 100. We set e= l.OxlO-5 and run the experiment on a 1.8GHz Pentium M laptop with 768M memory. The code is written in MATLAB and is downloadable at http://www.cs.unc.edu/~xwang/.

We run our algorithm 5,000 times for each protein family. Each time we begin by randomly rotating each structure in 3D space and then minimize the RMSD. We expect that the changes in RMSD will be small, since these proteins were carefully aligned with a combination of tools, but want to make sure that our algorithm does not become stuck in local minima that are not the global minimum. The results are shown in Table 1.

For each protein family's 5,000 tests, the difference between maximum RMSD and minimum RMSD is less than l.OxlO"8, so they converge to the same local minimum. Moreover, the optimal RMSD values found by our algorithm are less than the original RMSD from the alignments in HOMSTRAD in all cases. In three cases the relative difference is greater than 3%; in each of these cases there is an aligned core for all proteins in the family, but some disordered regions allow our algorithm to finds alignments with better RMSD. These cases clearly call for weighted alignment.

http://www.cs.unc.edu/~xwang/

83

Table 1. Performance of the algorithm on different protein families from HOMSTRAD. We report n, the number of proteins, m, the number of atoms aligned, RMSD from the HOMSTRAD Alignment (HA), the RMSD for the optimal alignment from our algorithm, statistics on iterations and time (milliseconds) for 5,000 runs of each alignment.

Protein family RMSD

HA(A) optim.

RMSD

%rel.

diff

Iterations

avg,med,max

Time (ms)

avg,median,max

immunoglobulin domain - V set -

heavy chain

globin

21

41

107

109

1.224

1.781

1.213

1.747

0.91

1.95

3.8, 4, 4

4.0, 4, 5

11.7, 10, 30

24.4, 20, 40

phospholipase A2

ubiquitin conjugating enzyme

Lipocalin family

glycosyl hydrolase family 22

(lysozyme)

18

13

15

12

111

114

118

119

1.492

1.729

2.881

1.357

1.478

1.714

2.873

1.342

0.95

0.88

0.28

1.12

3.9,

4.0,

4.0,

3.9,

4.

4.

4.

4,

4

5

5

4

10.5,

7.9,

9.3,

7-3,

10,

10,

10,

10,

41

11

30

11

Fatty acid binding protein-like

Proteasome A-type and B-type

phycocyanin

short-chain

dehydrogenases/reductases

serine proteinase - eukaryotic

Papain fam cysteine proteinase

glutathione S-transferase

Alpha amylase, catalytic dom.

legume lectin

17

17

12

13

27

13

14

23

12

122

148

148

177

181

190

200

201

202

1.825

3.302

2.188

1.971

1.454

1.396

2.336

2.327

1.302

1.824

3.032

2.077

1.954

1.435

1.383

2.315

2.293

1.287

0.05

8.91

5.34

0.87

1.32

0.94

0.91

1.48

1.17

4.0,

4.8,

4.0,

4.0,

3.8,

3.9,

4.0,

4.0,

3.8,

4,

5,

4,

4,

4,

4,

4,

4,

4,

5

6

5

5

4

5

5

5

4

10.5,

9.3,

11,0,

8.8,

17.4,

8.9,

9.8,

16.1,

8.0,

10,

10,

10,

10,

20,

10,

10,

20,

10,

40

21

40

11

40

30

20

40

30

Serine/Threonine protein kinases,

catalytic domain

subtilase

Alpha amylase, catalytic and C-

terminal domains

triose phosphate isomerase

15

11

23

10

205

222

224

242

2.561

2.279

2.668

1.398

2.503

2.268

2.602

1.386

2.32

0.49

2.54

0.87

4.0,

4.0,

4.0,

3.7,

4,

4,

4,

4.

5

5

5

4

10.6,

8.1,

16.6,

7.0,

10,

10,

20,

10,

21

30

40

11

pyridine nucleotide-disulphide

oxidoreductases class-I 11 262 3.870 3.420 13.16 4.7, 5, 6 10.1, 10, 21

lactate/malate dehydrogenase

cytochrome p450

aspartic proteinase

14

12

13

266

295

297

2.036

2.872

1.932

2.024

2.861

1.877

0.59

0.38

2.93

4.0, 4, 5

4.0, 4, 5

4.0, 4, 4

10.9, 10,

9.8, 10,

10.5, 10,

21

30

30

30 T

25

20

S 15

0 4

• • •

~^*. $-^*-

0 1000 2000 3000 4000 5000 6000

Number of atoms (nxm)

(a) Average running time vs. number of atoms (b) Average running time vs. number of structures Fig. 1. Average running time vs. the number of atoms or the number of structures

84

The maximum number of iterations is 6 and the average and median number of iterations is around 4, so / is a small constant and the algorithm achieves the lower bound of multiple structure alignment, which is @(n m). All of the average running time is less than 25 milliseconds and all of the maximum running time is less than 40 milliseconds, which means our algorithm is highly efficient.

Figure la and lb show the relationship between the average running time and the number of atoms (nxm) and the number of structures (ri) in each protein family. The average running time shows linear relation with the number of structures but not the number of atoms, because the most time-consuming operation is computing eigenvectors and eigenvalues of a 4x4 matrix in Horn's method, which takes 0(n) in each iteration.

3.2. Consensus structure

For a given protein family, one problem is to find a consensus structure to summarize the structure information. Altaian and Gerstein20 and Chew and Kedem21 propose to use the average structure of the

(a) all 11 aligned proteins

(c) Structure with minimum RMSD

Fig. 2. Multiple structure alignment for

conserved core as the consensus structure. In fact, by Theorems 1 and 2, the wRMSD is minimized by aligning to the average structure, and no other structure has better wRMSD with all structures. Thus, we claim that the average structure is the natural candidate for the consensus structure.

One objection to this claim is that the average structure is not a true protein structure - it may have physically unrealizable distances or angles due to the averaging. This depends on the intended use for the consensus structure — in fact, some other proposed consensus structures are even more schematic: Taylor et al.22, Chew and Kedem21, and Ye and Janardan23 use vectors between neighboring Ca atoms to represent protein structures and define a consensus structure as a collection of average vectors from aligned columns.

But a more significant answer comes from Theorem 3: if you do have a set of structures from which you wish to choose a consensus, including the proposal of Gerstein and Levitt10 to use the true protein structure that has the minimum RMSD to all other structures, or POSA of Ye and Godzik24, which builds a consensus structure by rearranging input structures based on alignments of partial order graphs based on

(b) the consensus structure

(d) Structure with maximum RMSD

nucleotide-disulphide oxidoreductases class-I

85

40

30

20

10

n ,r- , 0.6 0.8 0.9

Fig

0.5 1 1.5 2 2.5 3 0 5

3D Gaussian Distribution

(a) Distribution of the best aligned position (b) histogram of R2 for all aligned positions 3. 3D Gaussian Distribution analysis of the distances from each atom to corresponding points on the average structure

0.7

these structures, then you should choose from this set the structure with minimum wRMSD to the average.

Figure 2 shows the alignment of conserved core of protein family pyridine nucleotide-disulphide oxidoreductases class-I, the consensus structure, the consensus protein structure with the minimum RMSD to all other structures, and the structure with maximum RMSD to other structures.

3.3. Statistical analysis of deviation from consensus in aligned structures

Deriving the statistical description of the aligned protein structures is an intriguing question that has significant theoretical and practical implications. As a first step, we investigate the following question concerning the spatial distribution of aligned positions in a protein family. More specifically, we want to test the null hypothesis that, at a fixed position k, the distances the n atoms can be found from the average ~pk, especially those that are in the "core" area of protein structures, are consistent with distances from a 3D Gaussian distribution. We chose the Gaussian not only because it is the most widely used distribution function, due to the central limit theorem of statistics, but also because previous studies hint that Gaussian is the best model to describe the aligned structures25. If, by checking our data, we can establish the fact that aligned positions are distributed according to the Gaussian distribution in 3D, the set of aligned protein structures can be conveniently described by a concise model that is composed by the average structure and the covariance matrix specifying the distribution of the positions.

To test the fitness of our data to the hypothesized 3D Gaussian model, we adopted the Quantile-Quantile Plot (q-q plot) procedure26, which is commonly used to

determine whether two data sets come form a common distribution. In our procedure, the y-axis is the distances from each structure to the average structure for each aligned position, and the x-axis is the quantile data from 3D Gaussian. Figure 4a shows the q-q plot for the best aligned position. The correlation coefficient R2 is 0.9632, which suggests that the data fits the 3D Gaussian model pretty well. We carried the same experiments for all the aligned positions and the collected the histogram of the correlation coefficient R2

is shown in figure 4b. We identify that more than 79% of the positions we check have R2> 0.8.

The types of curves in q-q plots reveal information that can be used to classify whether a position should be deemed part of the core. The illustrated q-q plot has the last two curves above the line, which indicates that the two corresponding structures have larger errors in this position than would be predicted by a Gaussian distribution. Most position produce curves like this, or with all or almost all points on a line through the origin. Low slope indicates that they align well, and that the residuals may fit a 3D Gaussian distribution with a small scale. A few plots begin above the line and come down, or stay on a line of higher slope, indicating that such positions are disordered and should not be considered part of the core.

3.4. Determining and weighting the core for aligned structures

There are many ways in which we can potentially use this model of the alignment in a family to determine the structurally conserved core of the family, and help biologist to compare protein structures. Due to space constraints, we briefly demonstrate one heuristic for determining position weights to identify and align the conserved core of two of our structure families.

86

(a) pyridine nucleotide-disulphide oxidoreductases class-I (b) proteasome A-type and B-type Figure 4. Aligned protein families using position weights. The black colored positions satisfy ak<a + a, dark-gray colored atoms satisfy

a+a<ak<a+2a, gray colored atoms satisfy a + 2a < ak < a + 3a, and light-gray colored atoms satisfy ak > a + 3a.

We use the following iterative algorithm to assign weights.

Align the protein structures by RMSD using the algorithm of Section 3.2. For each aligned position k, calculate distance dik -\pik-pk\ for (1 < /' < «), and the correlation

coefficient Rk by assuming that deviations have a 3D Gaussian distribution, and the average squared

1.

2.

Then calculate the mean 1 "

distance ak =—y\dik

a and standard deviation a of ak.

3. If all ai<a + 3(7, then exit the algorithm;

Otherwise set all weights

Rk2/akifak<a + 3*i a H g n s t m c t u r e s b y

0 otherwise.

wRMSD, and go to step 2. The term \lak in the weights encourages the

alignment in the positions where the average squared deviations are small, and the term R? encourages those positions where the distances to the average structure are close to 3D Gaussian distribution. Figure 3 shows two examples of alignments, where the black is the core and gray are portions that are eliminated by being given weight zero, often due to divergence in of some or all members in the family.

4. CONCLUSION

In this paper, we analyzed the problem of minimizing the multiple structure alignment using weighted RMSD with weights at aligned positions, which includes RMSD as a special case. While directly minimizing wRMSD is hard in multiple structure alignment, we show that this problem is the same as minimizing the

wRMSD to the average structure. Thus, the average structure is the natural choice for a consensus structure.

Based on this property, we create an efficient iterative algorithm for minimizing the wRMSD and prove its convergence and other properties. Each iteration takes time proportional to the number of atoms in the structures. We tested the algorithm on the protein families from HOMSTRAD database that have more than 10 proteins with total aligned length longer than 100 atoms. The results show our algorithm minimizes the wRMSD in less than 50 milliseconds in Matlab for any protein family. Regardless of the starting positions of structures, the tests show that the algorithm converges to the same local minimum, which is most probably the global minimum. The tests also show that the number of iterations is a small constant whenever the input does not have near symmetry, so the algorithm achieves the linear lower bound for multiple structure alignment.

The algorithm in the paper is for aligning protein structures after sequence alignment. We plan to extend our work to weighted multiple structure alignment with atom weights at different atoms (which includes gapped structure alignment as a special case). We plan to devise new algorithms to achieve better aligned structures for multiple structure alignment by combining the sequence and structure alignments together and build 3D Hidden Markov Models for protein structure classification.

Acknowledgments

We thank Jun Huan for helpful discussion and advice during this work, and the reviewers for their comments. This work is partially supported by NIH grant 127666

87

and a UNC Bioinformatics and Computational Biology training grant.

References

1. Sela M, White FH Jr, Anfinsen CB. Reductive cleavage of disulfide bridges in ribonuclease. Science 1957; 125: 691-692.

2. Holm L, Sander C. Mapping the protein universe. Science 1996; 273: 595-602.

3. Branden C, Tooze J. Introduction to protein structure, 2nd ed. Garland Publishing, Inc., New York. 1999.

4. Kabsch W. A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallographica A 1978; 34: 827-828.

5. Horn BKP. Closed-form solution of absolute orientation using unit quaternions. Journal of the Optical Society of America A: Optics, Image Science, and Vision 1987; 4(4): 629-642.

6. Coutsias EA, Seok C, Dill KA. Using quaternions to calculate RMSD. Journal of Computational Chemistry 2004; 25(15): 1849-1857.

7. Lupyan D, Leo-Macias A, Ortiz AR. A new progressive-iterative algorithm for multiple structure alignment. Bioinformatics 2005; 21(15): 3255-3263.

8. Sutcliffe MJ, Haneef I, Carney D, Blundell TL. Knowledge based modelling of homologous proteins, Part I: Three-dimensional frameworks derived from the simultaneous superposition of multiple structures. Protein Engineering 1987; 1(5): 377-384.

9. Ochagavia ME, Wodak S. Progressive combinatorial algorithm for multiple structural alignments: application to distantly related proteins. Proteins 2004; 55(2): 436-454.

10. Gerstein M, Levitt M. Comprehensive assessment of automatic structural alignment against a manual standard, the SCOP classification of proteins. Protein Science 1998; 7: 445-456.

11. Sali A, Blundell TL. The definition of general topological equivalence in protein structures: A procedure involving comparison of properties and relationships through simulated annealing and dynamic programming. Journal of Molecular Biology 1990; 212: 403-428.

12. Guda C, Scheeff ED, Bourne PE, Shindyalov IN. A new algorithm for the alignment of multiple protein structures using Monte Carlo optimization. Proceedings of Pacific Symposium on Biocomputing 2001: 275-286.

13. Verboon P, Gabriel KR. Generalized Procrustes analysis with iterative weighting to achieve resistance, Br. J. Math. Statist. Psychol, 1995; 48:57-74.

14. Pennec X. Multiple registration and mean rigid shapes: Application to the 3D case. Proceedings of the 16th Leeds Annual Statistical Workship, 1996: 178-185.

15. Leibowitz N, Nussinov R, Wolfson HJ. MUSTA~a general, efficient, automated method for multiple structure alignment and detection of common motifs: application to proteins. Journal of Computational Biology 2001; 8(2): 93-121.

16. Shatsky M, Nussinov R, ,Wolfson HJ. A method for simultaneous alignment of multiple protein structures. Proteins 2004; 56(1): 143-156.

17. Dror O, Benyamini H, Nussinov R, Wolfson HJ. Multiple structural alignment by secondary structures: Algorithm and applications. Protein Science 2003; 12: 2492-2507.

18. Ebert J, Brutlag D. Development and validation of a consistency based multiple structure alignment algorithm. Bioinformatics 2006; 22(9): 1080-1087.

19. Mizuguchi K, Deane CM, Blundell TL, Overington JP. HOMSTRAD: A database of protein structure alignments for homologous families. Protein Science 1998; 7: 2469-2471.

20. Altaian RB, Gerstein M. Finding an average core structure: application to the globins. Proc. Int. Conf. Intelligent Systems for Molecular Biology 1994; 2: 19-27.

21. Chew LP, Kedem K. Finding the Consensus Shape for a Protein Family. Algorithmica 2002; 38(1): 115-129.

22. Taylor WR, Flores TP, Orengo CA. Multiple protein structure alignment. Protein Science 1994; 3: 1858-1870.

23. Ye J, Janardan R. Approximate multiple protein structure alignment using the sum-of-pairs distance. Journal of Computational Biology 2004; 11(5): 986-1000.

24. Ye Y, Godzik A. Multiple flexible structure alignment using partial order graphs. Bioinformatics 2005; 21(10): 2362-2369.

25. Alexandrov V, Gerstein M. Using 3D Hidden Markov Models that explicitly represent spatial coordinates to model and compare protein structures. BMC Bioinformatics 2004; 5:2.

26. Evans M, Hastings N, Peacock B. Statistical Distributions. 3rd ed. New York, Wiley. 2000.

89

IDENTIFICATION OF a-HELICES FROM LOW RESOLUTION PROTEIN DENSITY MAPS

A. Dal Palu Dip. Matematica

Universita di Parma alessandro. dalpalu@unipr. it

J. He, E . Pontelli*, Y. Lu

Dept. Computer Science

New Mexico State University

{jinghe, epontell, ylu} @cs.nmsu. edu

This paper presents a novel methodology to analyze low resolution (e.g., 6A to lOA) protein density map, that can be obtained through electron cryomicroscopy. At such resolutions, it is often not possible to recognize the backbone chain of the protein, but it is possible to identify individual structural elements (e.g., a-helices and /9-sheets). The methodology proposed in this paper performs gradients analysis to recognize volumes in the density map and to classify them. In particular, the focus is on the reliable identification of a-helices. The methodology has been implemented in a tool, called Helix Tracer, and successfully tested with simulated structures, modeled from the Protein Data Bank at 10A resolution. The results of the study have been compared with the only other known tool with similar capabilities (Helixhunter), denoting significant improvements in recognition and precision.

1. INTRODUCTION

3-dimensional (3D) protein structure information is essential in understanding the mechanisms of biological processes. A protein can be thought of as a chain of beads that adopts a certain conformation in the 3D space (native conformation). The building blocks of the chain are 20 kinds of amino acids. Knowledge of 3D structure of proteins is essential in understanding the mechanisms of protein function, and this information has become more and more important in rational drug design.

Both experimental techniques and prediction techniques have been devised to generate 3D protein structures. The most commonly used experimental techniques for protein structure determination are X-ray crystallography and Nuclear Magnetic Resonance (NMR). Both techniques can determine structures at atomic resolution (usually better than 3A). In particular, X-ray crystallography has produced more than 80% of the known protein structures currently present in the Protein Data Bank (PDB). Although these two techniques are successful in targeting soluble proteins, they are seriously lim

ited for non-soluble proteins, such as membrane bond proteins and large protein complexes. In particular, X-ray crystallography is limited to the availability of suitable crystals of the protein, and large protein complexes cannot easily produce crystals.

Electron cryomicroscopy is an experimental technique that has the potential to allow structure determination for large protein complexes 15' 12, 2 . Using the cutting edge techniques in this field, 3D structure of large complexes, such as the Herpes virus, have been successfully generated at 8.5A resolution 15. Although it is not possible to determine the backbone chain of the protein at the resolution range of 6A to lOA—current methods to solve protein structure require a density map of much higher resolution, such as 3A or 4A 14, 6—this resolution allows the visualization of various secondary structure elements, such as a-helices and /3-sheets 15.

In this paper, we present a new methodology to aid in the identification of a-helices in a low resolution density map. The methodology relies on a novel representation of a-helices, where helices are modeled as general cylinder-like shapes, defined by a central axial line (i.e., a spline). The spline is

•Corresponding author.

90

a continuous line (possibly not straight), described by a set of control points. This feature allows the model to better fit real helices, and thus provides smaller errors, since helices in nature are often not straight. The actual identification of the helices makes use of a new type of analysis of the density maps, based on the notion of gradient segmentation. The strength of this segmentation method is its threshold independence—which allows the segmentation of volumes present in the density map without the drawback of using generic thresholds, that can be inadequate for specific regions of the density map. The segmentation we propose is general, and can be potentially used for extraction of other features, e.g., /3-sheets and coils. The proposed methodology has been implemented in a software tool, called Helix Tracer. Preliminary experiments show very promising results—Helix Tracer is on average capable of identifying over 75% of the helices, with very low RMSD errors, and with greater accuracy than related systems (the Helixhunter system 7 ) .

To the best of our knowledge, only one other approach has been proposed to identify helices in low resolution density maps: Helixhunter 7 relies on a cylindrical representation of helices, where each helix is described as a straight cylinder with a 5A diameter. Helixhunter identifies helices by searching cylindrically shaped areas in the density map using the second moment tensor.

Since a-helices and ^-sheets are the major components of a structure, the knowledge of this information helps in discovering the fold of a protein. Moreover, these secondary structure components help in producing important geometrical constraints about the tertiary structure. Such constraints can be employed to effectively guide a protein prediction method, significantly improving precision and efficiency of the prediction 4, as well as to reduce the search space in the context of molecular dynamics applications. For example, in 4, we present an effective method, based on constraint satisfaction, to combine information about a-helices, obtained from Helix Tracer, with results from helix prediction (obtained from PHD 1 3) , with the aim of determining the most likely mappings of the a-helices on the primary sequence.

2. METHOD

The input to our analysis algorithm is a density map, encoded as a 3-dimensional array. Each element corresponds to a cubic volume, called voxel, and each voxel is associated to the mean electron density of the protein in that volume. For the sake of simplicity, the density is normalized w.r.t. a maximal density in the map.

Our analysis method relies on the observation that, at the resolution range of 6A-10A 15- 7, it is possible to observe that the density distribution of a helix resembles a cylinder. In particular, the cylindrical area presents the local maximum density value roughly on the central axis of the corresponding a-helix. The density gradually decreases as the distance from the central axis increases. However, most helices have a certain degree of curvature, particularly for long helices, thus making a perfect cylinder not an accurate template.

2.1. Overall approach

The algorithm for helix extraction is based on processing the discrete density map. The outcome of this analysis is a description of the helices identified. In various previous proposals, such as in Helixhunter 7, each a-helix is described as a cylinder with a 2.5A radius. The cylinder is characterized by three parameters (see Fig. 1 on the left): (a) the starting point, s — (sx,sy,sz), located on one extremity of the central axis of the cylinder, (b) the axis orientation vector d = (dx,dv,dz), and (c) the length of the axis L Following our concern, that actual helices in nature are not straight but they tend to bend and curve to a certain degree, we introduce a more general representation. In this work, we describe the central axial line of the a-helix in terms of a quadratic spline 1, while the helix itself is defined as the set of points whose minimal distance from the spline is 2.5A. A spline is a continuous curve, controlled by a finite number of control points. The central axial spline is generated using a standard spline function, based on the identified control points S\... Sn, where Si = (aiX,aiy,aiZ)—see Figure 1 on the right.

The essential idea used in the helix detection process is to segment the density map into volumes satisfying certain properties.

91

•

e

X

t -id i

C5>

/ /" 4 /

1 T^3 1 1 *«2 I

VT^r)

Fig . 1. Helix models

The intuition of the segmentation process is that each local maximal density voxel can be related to the presence of a packed set of atoms. This situation arises when amino acids are arranged into specific patterns that provide a high local density contribution. For example, helices are arranged so that the side chains of amino acids involved show an average increase of local density w.r.t. normal coil, due to the helical packing of the backbone. At low resolution, this is characterized by a clear increase of local density that reflects the helical three dimensional shape of the helix. Hence, the problem boils down to recognize such clusters made of locally higher density.

Every maximal density voxel v is a representative of a volume that is defined as the set of voxels that can be reached from v without increasing the density along the path followed. Each volume is a maximal set of voxels and it contains, in general, small parts of individual helices. The key idea is that this segmentation offers a robust identification of subsets of helices' volumes. Thus, the problem boils down to the one of correctly merging some of these volumes in order to reconstruct the identified helices.

The method involves gradient analysis, and it is substantially different from simple density values thresholding (as used in previous proposals 7' 8 ) . The gradient is a vectorial information, expressed in terms of a 3D direction and intensity. Intuitively, the gradient shows the direction that points to the locally maximal increase of density. The gradient information is computed for each voxel, considering the density map as the discretization of a continuous function from R3 to density values. In this perspective, the gradient corresponds to the first derivative

of such density function. For processing purposes, the gradient associated to each voxel is approximated using a discrete and local operator over the original density map.

Using the gradient direction as a pointer, we can follow these directions from voxel to voxel, until a maximal density value voxel is found. The paths generated touch every voxel, and can be partitioned according to the ending points reached. Paths that share the same ending point form a tree structure, that is associated to the same volume. This process generates in output the segmentation we require for helix detection.

The motivation for requiring such segmentation is that low resolution density maps witness the presence of a helix as a dense cylinder-like shape, where the maximal density increases gradually towards its axis. When close to the axis (e.g., < 5A), the gradient points towards such axis. This means that the high density voxels of the trees identified on the gradient paths can be employed to characterize the location of the helix axis. Observe that we use gradient trees to segment volumes—by collecting in a single volume all the nodes whose gradient paths lead to the same maximal density voxel. Thus, each of these volumes will contribute to only a part of a helix, and further analysis is required to study the properties of the volumes (and the relationships between their maximal density voxels) and determine whether different volumes actually belong to the same helix.

The complete process is articulated in the following phases, described in the next subsections: (i) gradients calculation, (ii) graph construction and processing; (Hi) detection of helices.

2.2. Gradients determination

The density map is processed, in order to build the map of gradients. The gradient is approximated using Sobel-like convolution masks ( 3 x 3 x 3 ) over the original density map 5. The gradient is represented by a vector whose direction and intensity can be calculated using the Sobel-like mask in Figure 2. For each voxel, a 3D convolution process is performed using the three masks: each mask is overlapped on the density map, and the summation of a point-by-point product is performed in order to collect the intensity of the gradient component for each dimension. The

92

addition of the three resulting vectors generates the gradient associated to the voxel. For example, the component of the gradient along the X axis can be calculated by using the three matrices in the first row in Figure 2.

In Fig. 3(a), we show a slice of a density map; Fig. 3(b) indicates the corresponding z-projection of the gradient for each point. Fig. 3(c) is the overlay of Fig. 3(a) and Fig. 3(b). Observe how the gradient lines are "pointing" towards the denser regions of the density map (shown in darker color).

2.3. Construction of the graph

The next step of the algorithm involves the construction of a graph describing the structure of the density map. In particular, the directed graph G = (N, E) is used to summarize the gradient properties, where N is the set of nodes of the graph and EC. N x N is the set of edges. Nodes will represent voxels of interest (as described later) while edges connect voxels that are "adjacent" in the density map.

Let us consider two voxels Vi = (xi,yi,zi) and V<z = {x2,V2, z-i)- The voxels are considered to be neighbors if and only if the following relationship is satisfied: max{|xi — X2\, |j/i — 2/21, \zi - zi\\ < 1. In other words, two voxels are neighbors if they differ by at most by one unit in each coordinate—which leads to 26 possible neighbors per voxel. In the graph we propose to construct, edges will be introduced only if the two nodes involved are neighbors.

The process starts with a coarse thresholding (cropping) of the density map. The purpose of this step is to discard grossly irrelevant voxels, to improve efficiency of the successive analysis steps, without incurring in any relevant loss of information. In particular, we retain only the voxels with a density value greater than 0.5 (50% of the maximum value of the map); this choice arises from the practical observation that the voxels of interest have an average density larger than 0.65.

After this coarse thresholding, N consists of the nodes formed by the remaining voxels. For each node m, we add a direct edge (n\, 712) that starts from ni and points to the neighbor 712 G N of n\ (the arrowed lines in Figure 4) if the following two conditions are satisfied:

• The directed edge is the best approximation of the gradient direction (the non-arrowed lines in Figure

4); • The density at 712 is higher than that at m .

As last step in the construction, the direction of each edge in the graph is inverted. The resulting graph is a directed acyclic graph, and each node has at most one incoming edge (Figure 4(c)). The graph is actually a forest of tress, since it is not necessarily connected. The key property is that every path in each tree represents a decreasing density sequence of neighboring voxels.

The trees recognized in this graph construction provide a segmentation of the density map in distinct volumes. Each volume contains the voxels that belong to the same tree. The root of a tree is the representative of the associated volume, and it is the voxel with the highest density in the volume—e.g., the double-circled node in Figure 4(c).

After the trees have been generated, small and spatially close trees are merged to simplify the forest of trees. Note that in this version we did not apply any image preprocessing, i.e. smoothing and/or filtering. This implies that some noise could be present and split the gradient segmentation into small volumes. However, due to the low resolution scale (6-12A), we can recover this problem my merging manually volumes that are close to each other. This is important, in order to reduce volumes fragmentation and to facilitate detection of volume borders. When the distance between the roots of two trees are less than 3.0A, the two trees are merged, because the typical distance between consecutive amino acids is 3.8A. In the cases we select, the roots are close enough to ensure that the merged volumes describe two areas that are consecutive according to the direction defined by the backbone. In the future we plan to introduce a more robust image preprocessing (e.g. a smoothing phase) to cope with this problem.

The last operation in the graph processing phase is to mark the border of each volume induced by the tree. A voxel is on the border if at least one of its neighbors is not in the same volume. The border voxels are used later for merging volumes that are determined to belong to the same helix. Voxels belonging to a volume border are properly marked using a flag.

93

Sobelx[0]

Sobelyp] =

Sobelz[Q]

" 1 0 - 1 ' 2 0 - 2

. 1 0 - 1 .

1 2 1' 0 0 0

-1 - 2 - 1

1 2 1 2 4 2 1 2 1

Sobelx[l] = 2 0 - 2 4 0 - 4 2 0 - 2

Sobely[l] = 2 4 2 0 0 0

. - 2 - 4 - 2

Sobelz[l) = "0 0 0' 0 0 0

.0 0 0.

Sobelx[2] =

Sobely[2]

Sobelz[2) =

' 1 0 - 1 ' 2 0 - 2

.10 - 1 ,

1 2 1 0 0 0

-1 - 2 - 1

-1 - 2 - 1 1 -2 - 4 - 2 -1 - 2 - 1

(a)

Fig . 2 . Sobel's Convolution Masks

s \ I t / ( ^ \ \ l / ' ( I t If >. I ~.^\ \ I / ? l \ \ \ If ' \ W W /

^ \ \ 1/sK^-, I I If K -

V'^4-Z

'/dp-

(b)

!''!lr 1 ' ' ^ N \ \ U \ \ ' A \ W W W N N

Fig . 3 . Density map and gradients

J .1/ • — = - < >

1 ^ — (a)

2.4. Detection of helices

i i l l VY/

Fig . 4 . Obtaining the tree from gradients

The final phase in our computation is to define the location of the a-helices. This phase involves two steps. The first step is to detect and merge the volumes that belong to the same helix. The second step is to construct a description of each identified helix, by defining the location of the control points that constitute the central axial line.

It has been observed that a helix often contains

one or more neighboring volumes. Therefore, the volumes are analyzed to see if they belong to the same helix. The border voxels of each volume are scanned for the satisfaction of two rules. One is to see if there is a neighboring border voxel that belongs to a different volume. The other is to see if the border voxel has high density (greater than a threshold he-lixTHR). A typical threshold is 85% of the maximum density value of the map. Volumes that satisfy the

94

above two rules are merged to generate the volume of a helix. Note that this thresholding is applied after the segmentation phase, and thus is used simply as a selection tool for the relevant volumes.

The rationale of this process is the following. If two volumes belong to the same helix, this implies that the contact points between the two volumes lie on a plane which is orthogonal to the helix axis. In practice, due to the discretization of the density in voxels, the contact surface may not be a regular plane, and it may contain some irregularities. Nevertheless, this does not constitute a problem. This follows from the gradients property: each gradient associated to a voxel that belongs to a helix points towards the axis of the helix. Therefore, if the contact surface is not orthogonal to the axis, the gradients on the border will point to the volume, and thus one volume would be a subset of the other. This also implies that the high density region on the border between two volumes is very close to the axis of the a-helix. We have experimentally observed that only helices show a high density (above the helixTHR threshold) close to their axis, thus we can safely merge two volumes that presents this characteristic.

The identification of the control points relies on the fact that the central axial line of the helix is often located at the high density voxels of the volume. To define the search space of the control points, a subset of the helix volume, called trace volume is generated using a threshold (helixselectTHR). This threshold, by default, is set to 2% less than helixTHR. This choice is suggested by the practical observation that helixTHR, the threshold used to detect volumes belonging to same helix, is too strict when used for the construction of the axis. Moreover, note that this thresholding is performed on volumes, which ensures that we will not encounter cases where the analysis takes us to volumes that do not belong to the same helix. This guarantees that separate helices are not erroneously merged or a helix incorrectly broken in separate structures.

The central axial line is generated using a greedy algorithm. The idea is to start from the highest density voxel close to the barycentre of the trace volume (in the neighborhood of 3A). We estimate the initial search direction by means of a least square fitting of the trace volume. From the starting voxel, we launch

two searches along the initial search direction, that returns two half axis: one for each side analyzed. The traversal moves to a neighbor that contains the locally highest density available. During this exploration, we move along the axis towards the ends of the helix, while building a path that contains the maximal values detected; recall that the density map for a helix decreases quickly when moving orthogonally away from the axis. The union of the two paths gives the set of control points associated to the axis.

Finally, the control points are smoothed with a single pass of Gaussian smoothing (a — 8), in order to reduce the scattered jumps between neighbors. The smoothed and real-coordinate points are used as the actual control points for the second-order spline that is consequently generated. At the end of the process, a validation step is launched, in order to discard the helices that show an extreme and unnatural curvature.


3.1. Helix Tracer results

In order to test Helix Tracer, we generated density maps for 29 proteins with known structures in the PDB. The density maps have been generated at 10A resolution, using the program pdb2mrc from the EM AN suite u . The proteins have been randomly chosen, and they offer a good representation of proteins with varying structural configurations. In particular, we include representatives from the major SCOP families 10—a + /? (e.g., 1A06), all a (e.g., 1CC5), a//3 (e.g., 1B0M), and proteins with separate a and j3 domains (e.g., 1BVP). For example, the density map of protein 1BM1, at 10A resolution, is shown in Figure 5. The helices identified by Helix Tracer are shown as sticks that are overlayed on the density map and on the backbone of the protein. Notice that the helices identified are not straight. All experiments have been conducted using Linux (2.6.11) workstations (a Pentium 4, 3.1GHz, and a AMD 64-bit 2.39GHz).

Table 2 reports the number of helices recognized by Helix Tracer. In particular, Table 2 provides the following information: the PDB Id of each protein (ID), the execution time, in seconds, (Time), the number of helices present in the PDB annotation

95

( P D B Helices), the number of helices in the PDB annotation that are longer than 8.lAa, the number of helices recognized by Helix Tracer (Identified), the number of identified helices that are correct (Correct) , the number of false positives (False), and the number of helices of adequate length present in the PDB annotation and missed by Helix Tracer (Missed). The last two columns report the number of Ca present in the helices of length greater than 8.lA in the PDB annotation, and the corresponding number of Ca correctly identified by Helix Tracer.

Fig. 5. Helices identified for 1BM1 (PDB Id)

A helix is correctly identified by Helix Tracer if it can be paired with a PDB helix in the protein. In particular, there should be at least one Oa on the PDB helix that is within a 4A distance from the central axial line of the identified helix. Helix Tracer, on average, recognizes 80% of the helices longer than 8.1 A. In particular, often the ratio of helices correctly identified is larger than 88%.

Although the pairing process is simple, the accuracy of the identified helices can also be seen from the number of Ca that are recognized by Helix Tracer. A Ca is a correctly recognized Ca atom if it can be projected internally on the helix axis identified. C a atoms that cannot be projected on the axis— i.e., they could be projected on the prolongation of the axis outside the helix—are not accounted as correctly identified C a atoms. Helix Tracer recognized 75% of the total Ca atoms that are on the PDB he-

aThis is the minimal length of helices detected by default by 1

lices longer than 5 amino acids (shown in Table 2 and in Figure 6 on the left, as comparison). Observe also that, despite the lack of optimization in the current implementation, the execution times are very reasonable (Table 2, column Time).

When a helix is represented as a straight cylinder, it is straightforward to calculate the projection of a Ca atom on the central helix axis. However, since we use splines as axis representation (see Figure 1), a method to project a Ca on the axial spline needs to be developed. We subdivide the continuous spline into a set of 40 contiguous segments, a sufficient number to approximate the spline. The lengths of these segments are not necessarily identical, and they depend on the spatial distribution of the control points. We approximate the problem of computing the projection on a continuous spline with the problem of finding the smallest projection vector out of the set of segments.

In order to further evaluate the accuracy, the RMSD (Root Mean Square Deviation) is calculated for the correctly identified Ca atoms. The theoretical distance between a Ca atom and the central axial line of the helix is 2.5A. The RMSD we calculated is the deviation of the distance between the correctly identified Ca atoms and the central axial line from 2.5A distance. RMSD values for the selected proteins are plotted in Figure 6 (on the right).

Table 1. Use of different resolutions

8A 10A 12A

1 B V P Correct Missed

C a

100% 0

93%

89% 1

85%

78% 2

76%

1Q16 Correct Missed

OCK

62% 20

63%

57% 23

58%

36% 34

36%

1TCD Correct 92% 76% 68% Missed 2 6 8

Cc 85% 77% 66%

Finally, let us underline that the quality of the results is dependent on the resolution of the density maps employed. We tested the program on the density maps at 8A, 10A, and 12A resolutions. Table 1

Helix Tracer and Helixhunter.

96

shows the percentage of correctly identified helices, the number of missed helices, and the percentage of correctly identified Ca. We performed these experiments using the proteins 1BVP, 1Q16, and 1TCD. As expected, the accuracy of Helix Tracer degrades as the quality of the density map degrades.

3.2. Comparison with Helixhunter

In this subsection we report the comparison between Helix Tracer and Helixhunter 7. We employ the release 1.7 of the Helixhunter software. The comparison is performed on the same set of 29 proteins used in the previous subsection. The density threshold used in Helixhunter is 0.85, which is the same {helix-THR) used in Helix Tracer. Although we also tested different density threshold for Helixhunter, 0.85 appears to be the threshold that generates best overall results for these 29 proteins.

The comparison starts with the evaluation of the following two parameters (for definition, see the previous subsection):

• RMSD: we compare the RMSD values for those Ca atoms that have been correctly identified by both systems;

• Ca: we compare the number of Ca that have been correctly identified by either system.

These results are depicted in the two diagrams of Figure 6. Helix Tracer correctly identified 75.0% of the total Ca atoms on the helices longer than 8.1 A. Helixhunter identified 58.4% of such Ca atoms. For the correctly identified Ca atoms, the RMSD from Helix Tracer is on average 0.086A lower than that of Helixhunter. For the protein 1CC5 we reach a peak of improvement of 64% in CQ recognition. Moreover, note that the method adopted in Helix Tracer performs always better than Helixhunter in terms of the number of correctly identified helices and RMSD.

Figure 7 compares the performance of the two systems in terms of the number of helices that are longer than 5 amino acids. In particular, the diagram on the left compares the number of correctly identified helices (relating them to the number of helices in the PDB annotation), while the figure on the right shows the trend in number of helices present in the PDB annotation and not recognized by either of the systems. Once again, we can observe that He

lix Tracer provides significantly better results (up to 37% more helices correctly identified) and it never performs worse than Helixhunter.

4. CONCLUSIONS AND FUTURE WORK

In this paper, we presented a novel methodology to analyze low resolution density maps (e.g., 6A to 10A) of proteins. This is the resolution level that can be obtained for large protein complexes using techniques such as electron cryomicroscopy. At this level of resolution, it is often impossible to recognize the actual backbone directly from the protein density map, but the resolution is sufficient to visually recognize structural features, such as a-helices and /3-sheets.

The methodology proposed in this paper makes use of gradients information, extracted from the density map, to perform volumes segmentation and to guide analysis of volumes towards the identification of secondary structure components. In this paper, we focused on the problem of recognizing a-helices. The resulting technique has been implemented in the Helix Tracer system, and we performed a test using 29 proteins, with very positive results. In particular, Helix Tracer provides a more flexible representation of helices, leading to a more accurate identification.

The outcome of the analysis performed by Helix Tracer can be applied to aid in the reconstruction of a tentative atomic model of the entire protein complex. For example, the information about a-helices can be employed as constraints to guide and/or filter ab-initio prediction secondary structure, and to aid in threading the protein sequence in the 3D structure. In this direction, we have proposed a framework to map the secondary structures identified from the density map to their locations on the primary sequence of the protein 4; the framework computes successful mappings that best satisfy both the constraints obtained from the density map and the results obtained using secondary structure prediction tools (e.g., PHD). The framework relies on encoding all the components of the problem as a constraints satisfaction problem 9 and makes use of Constraint Logic Programming techniques to solve it.

This gradient-based technique is a general approach, and can be effectively used to recognize other

97

Table 2. Helix Tracer results

ID

1A06 1AGW 1AXG 1B0M 1BM1 1BRX 1BVP 1CC5 1CI1 1DZE 1FIY 1GIH 1JQN 1KE8 1KGB 1M52 1NVS 1POC 1P14 1P8I 1Q16 1QHD 1TCD 1TPB 2BRD 2YPI 3PRK 7TIM 8TIM

Execution Time(s)

3.5 2.5 3.6 8.9 1.8 1.8

11.7 0.5 5.2 1.9

12.5 2.6

12.3 2.6 1.8

11.7 4.9

12.3 1.9 1.8

29.0 8.4 5.2 5.0 1.8 5.0 2.0 5.1 5.0

Helices

14 8

15 30

7 7

10 4

28 9

40 12 39 13

8 30 12 34 14 8

57 14 28 22

7 24

6 24 28

PDB Helices> 8.lA

10 8

10 26

7 7 9 4

25 9

37 11 38 12 8

20 9

27 11 7

53 10 25 22

7 24

6 24 22

Identified

10 7

10 33

7 8

10 5

24 8

31 7

35 8 7

19 10 23

7 8

37 11 24 17 7

17 11 22 18

Helix Tracer Correct False

9 1 7 0 8 2

23 10 7 0 7 1 8 2 4 1

18 6 7 1

28 3 6 1

31 4 8 0 7 0

15 4 8 2

17 6 7 0 7 1

30 7 8 3

19 5 13 4 7 0

16 1 5 6

18 4 17 1

Missed

1 1 2 3 0 0 1 0 7 2 9 5 7 4 1 5 1

10 4 0

23 2 6 9 0 8 1 6 5

ca PDB

99 98 98

259 161 162 123 41

231 182 572 114 652 121 184 242 95

240 122 179 480 122 230 194 169 266

90 216 211

c« Helix Tracer

80 86 61

207 154 156 105 37

172 145 418

76 479

86 168 183 74

154 79

166 279

98 177 130 158 158

63 163 150

Comparison of Helix Ca Recognition

i I I I i I I I i i I I I I I i i I i I I I i I I I i I I

•—• Helixhunter Q—Q Helix Tracer —- PDB Data

I I I I I I I l ' ' ' '

II§IlllPpI!§§lI^f§§§PP§ Protein

RMSD Comparison (Ca correctly identified)

I I I I I I I I I I I I I I I I I I I I I I I I I I I I I

•—• Helixhunter B - o Helix Tracer

I I I I I I I imimmimrmmm § §

Fig. 6. Helix Tracer vs. Helixhunter: # of Amino acids identified and RMSD

secondary structure traits of the protein. Work is in progress to apply this methodology to identify /3-sheets and coils. Future work will include the devel

opment of heuristics to further improve the quality of a-helix identification and to reduce the number of false positives. This will require a more compre-

98

Correctly Identified Helices Missed Helices

x "S30

nmmmmrm

I Helix Tracer D Helixhunter I PDB Helices

mi Protein

S 4 0

1 Ii§31iiPlg§

I Helix Tracer 0 Helixhunter

Fig. 7. Correctly identified helices and missed helices

hensive analysis, which will include the recognition

of coils and /3-sheets. Finally, work is in progress to

apply the proposed technique to real da ta obtained

from electron cryomicroscopy.

Acknowledgments

The research has been partially supported by

NSF grants CNS-0454066, HRD-0420407, and CNS-

0220590.

References

1. R.H. Bartels, J.C. Beatty, and B.A. Barsky. An Introduction to Splines for use in Computer Graphics and Geometric Modeling. Morgan Kaufmann Publishers, Los Altos, 1987.

2. B. Bottcher, S.A. Wynne, and R.A. Crowther. Determination of the Fold of the Core Protein of Hepatitis B Virus by Electron Cryomicroscopy. In Nature, 386:8891, 1997.

3. W. Chiu et al. Deriving Folds of Macromolecu-lar Complexes through Electron Cryomicroscopy and Bioinformatics Approaches. In Curr Opin Struct Biol., 12:263-269, 2002.

4. A. Dal Palu, E. Pontelli, J. He, and Y. Lu. A Constraint Logic Programming Approach to 3D Structure Determination of Large Protein Complexes. In A CM Symposium on Applied Computing, ACM Press, 2006.

5. R. Gonzalez and R. Woods. Digital Image Processing. Addison Wesley, 1992.

6. J. Greer. Automated Interpretation of Electron Density Maps of Proteins: Derivation of Atomic Coordi

nates for the Main Chain. In J MOl Biol, 100:427-458, 1974.

7. W. Jiang, M.. Baker, S. Ludtke, and W. Chiu. Bridging the Information Gap: Computational Tools for Intermediate Resolution Structure Interpretation. In J. of Mol. Biol., 308, 2001.

8. Y. Kong and J. Ma. A Structural-informatics Approach for Mining Beta-sheets: Locating Sheets in Intermediate Resolution Density Maps. In J Mol Biol, 332(2):399-413, 2003.

9. V. Kumar. Algorithms for CSP: a Survey. In AI Magazine, Spring, 32-44, 1992.

10. L. Lo Conte, B. Ailey, T.J. Hubbard, S.E. Brenner, A.G. Murzin, and C. Chothia. SCOP: a Structural Classification of Proteins Database. In Nucl. Acids Res., 28:257-259, 2000.

11. S. Ludtke, P. R. Baldwin, and W. Chiu. EMAN: Semi-automated Software for High Resolution Single Particle Reconstructions. In J. Struct. Biol., 128, 82-97, 1999.

12. E.J. Mancini, M. Clarke, B.E. Gowen, T. Rutten, and S.D. Fuller. Cryo-electron Microscopy Reveals the Functional Organization of an Enveloped Virus. In Mol. Cell, 5:255266, 2000.

13. B. Rost. Protein Secondary Structure Prediction Continues to Rise. In J. Struct. Biol., 134:204-218, 2001.

14. C.E. Wang. ConfMatch: Automating Electron-density Map Interpretation by Matching Conformations. In Acta Crystallogr. D. Biol Crystallogr, 56:1591-1611, 2000.

15. Z.H. Zhou, M. Dougherty, J. Jakana, J. He, F . Rixon, and W. Chiu. Seeing the Herpesvirus Capsit at 8.5A. In Science, 288:877880, 2000.

99

EFFICIENT ANNOTATION OF NON-CODING RNA STRUCTURES INCLUDING PSEUDOKNOTS VIA AUTOMATED FILTERS

Chunmei Liu* and Yinglei Song

Department of Computer Science, University of Georgia

Athens, Georgia 30602, USA

Email: {chunmei, song}@cs.uga.edu

Ping Hu

Department of Genetics, University of Georgia

Athens, GA 30602, USA


Russell L. Malmberg

Department of Plant Biology, University of Georgia

Athens, GA 30602, USA


Liming Cai*

Department of Computer Science, University of Georgia Athens, Georgia 30602, USA


Computational search of genomes for RNA secondary structure is an important approach to the annotation of non-coding RNAs. The bottleneck of the search is sequence-structure alignment, which is often computationally intensive. A plausible solution is to devise effective filters that can efficiently remove segments unlikely to contain the desired structure patterns in the genome and to apply search only on the remaining portions. Since filters can be substructures of the RNA to be searched, the strategy to select which substructures to use as filters is critical to the overall search speed up. Such an issue becomes more involved when the structure contains pseudoknots; approaches that can filter pseudoknots are yet available. In this paper, a new effective filtration scheme is introduced to filter RNA pseudoknots. Based upon the authors' earlier work in tree-decomposable graph model for RNA pseudoknots, the new scheme can automatically derive a set of filters with the overall optimal filtration ratio. Search experiments on both synthetic and biological genomes showed that, with this filtration approach, RNA structure search can speed up 11 to 60 folds whiling maintaining the same search sensitivity and specificity of without the filtration. In some cases, the filtration even improves the specificity that is already high.

1. INTRODUCTION

Non-coding RNAs (ncRNAs) do not encode proteins yet they play fundamental roles in many biological processes including chromosome replication, RNA modification, and gene regulation 7' 19, 28. Due to the explosive growth of fully sequenced genome data, homologous searching using computational methods has recently become an important approach to annotating genomes and identifying new ncRNAs 16, 21, 22 j n general, a computational searching tool scans through a genome and aligns its sequence segments to an RNA profile. Since secondary structure

generally determines the biological functions of an ncRNA and is preserved across its homologs, a profile needs to include both sequence conservation and secondary structure information. For example, compared with profiling models based on Hidden Markov Models (HMMs) 14, Covariance models (CMs) 6 contain additional emission states that can emit base pairs to generate stems. CMs can thus be used as structural profiles to model RNA families. However, for pseudoknots, which contain at least one pair of crossing stems, the sequence-structure alignment is computationally intractable. RNA structure search


http://uga.edu




100

in genomes or large databases thus remains difficult. Search on genomes can be speeded up with filtra

tion methods. With simpler sequence or structural models, it is possible to efficiently remove genome segments unlikely to contain the desired pattern. A few filtration methods Ref. 2, 16, 27 have been developed to improve the search efficiency. For example, in tRNAscan-SE 16, two efficient tRNA detection algorithms are used as filters to preprocess a genome and remove most parts that are unlikely to contain the searched tRNA structure. The remaining part of the genome is then scanned with a CM to identify the tRNA. FastR 2 considers the structural units of an RNA structure. It evaluates the specificity of each structural unit and construct filters based on the specificities of these structural units. In 27, an algorithm is developed to safely "break" the base pairs in an RNA structure and automatically select filters from the resulting HMM models. These approaches have significantly improved the computational efficiency of genome annotation. However, all of them have yet been applied to search for structures that contain pseudoknots.

Filters, like the structure to be searched, need to be profiled with appropriate models. Most of the existing searching tools 3> 13, 15, 16 use CMs to profile the secondary structure of an ncRNA. While CM based searching tools can achieve high accuracy, they are incapable of modeling pseudoknots. In addition, the time complexity for optimally aligning a sequence segment to a CM profile is too high for a thorough search of a genome 13. A few models 4' 20' 23' 26

based on stochastic grammar systems have been proposed to profile pseudoknot structures. However, for all these models, the computation time and memory space costs needed for optimal structure-sequence alignment are 0(N5) and 0(N4) respectively. In practice, these models cannot be directly used for searching. Heuristic approaches 3 ' 8 ' 15 can significantly improve the search efficiency for pseudoknots. These approaches either cannot guarantee the search accuracy 8 or have the same drawback in computation efficiency as CM based approaches 3 ' 15.

A tree decomposable graph model has been introduced in our previous work 25. In particular, the secondary structure of RNAs is modeled as a conformational graph, while a queried sequence segment

is modeled with an image graph with valued vertices and edges. The sequence-structure alignment can be determined by finding in the image graph the maximum valued subgraph that is isomorphic to the conformational graph. Based on a tree decomposition of the conformational graph with tree width t, a sequence-structure alignment can be accomplished in time 0{ktN'2) 25, where k is a small parameter (practically k < 7), and TV is the size of the conformational graph. The tree width t of the RNA conformational graph is very small, e.g, t = 2 for pseudoknot-free RNAs and can only increase slightly for pseudoknots. Experiments have shown that this approach is significantly faster than CM based searching approaches while achieving an accuracy comparable with that of CM.

In this paper, based on the tree decomposable model, we develop a novel approach of filtration. In particular, based on the profiling model in our previous work, a subtree formed by tree nodes containing either of the two vertices that form a stem can be used as a filter. A filter can thus be constructed for each vertex in the conformational graph. Based on the intersection relationship among the subtrees of filters, we are able to construct a filter graph. In the graph each vertex represents a maximal subtree and two vertices are connected with an edge if the corresponding subtrees intersect. We associate every vertex in the filter graph a weight, which is the filtration ratio of the filter that can be measured based on randomly generated sequences. We thus select filters that correspond to the maximum weighted independent set in the graph. A filter graph is a chordal graph and we thus are able to compute its maximum weighted independent set in time 0(n 2 ) , where n is its number of vertices. Filters can thus be selected in time 0(n2).

We have implemented this filter selection algorithm and combined it with the original tree decomposition based searching program to improve its computational efficiency. To test its accuracy and computational efficiency, we used this combined search tool to search for RNA structures inserted into random generated sequences. Our testing results showed that, compared with the original searching program, this filtering approach is significantly faster and can achieve improved specificity. Specifically, it achieved

101

20 to 60 fold speed up for pseudoknot-free RNAs and 11 to 45 fold speedup for RNAs containing pseudo-knots. In addition, for some tested structures, this approach is able to achieve an improvement in specificity from about 80% to 92%. We then used this combined searching tool to search a few biological genomes for ncRNAs. Our testing results showed that this combined program can accurately determine the locations of these ncRNAs with significantly reduced computational time, e.g, compared with the original searching program, it achieved 6 to 142 fold speed up for genome searchings for pseudoknots.

2. ALGORITHMS AND MODELS

2.1. Tree Decomposable Graph Model

In our previous work 25, the consensus secondary structure of an RNA family was modeled as a topological relation among stems and loops. The model consists of two components: a conformational graph that describes the relationship among all stems and loops and a set of simple statistical profiles that model individual stems and loops. In the conformational graph, each vertex defines one of the base pairing regions of a stem. The graph contains both directed and undirected edges. Each undirected edge connects two vertices that form the pairing regions of a stem. In addition, the vertices for two base regions are connected with a directed edge (from 5' to 3') if the sequence part between them is a loop. Technically, two additional vertices s (called source) and t (called sink) are included in the graph. Figure 1(a) and (b) show the consensus structure of an RNA family and its conformational graph. In general, we can construct a consensus structure from the multiple structural alignment of a family of RNAs. In this model, in addition to the conformational graph, individual stems are profiled with the Covariance Model (CM) 6 , and loops are profiles with HMM 14.

To align the structure model to a target sequence, we first preprocess the target sequence to identify all possible matches to each individual stem profile. All pairs of regions with statistically significant alignment score, called the images of the stem, are identified. Then an image graph is constructed from the set of images for all stems in the structure. In particular, each vertex represents an image for one

pairing region of a stem; two vertices for the base pairing regions of a stem are connected with a non-directed edge. In addition, a directed edge connects the vertices for two non-overlapping base regions (5' to 3'). To reduce the complexity of the graph, a parameter k is used to define the maximum number of images that a stem can map to. It can be computed based on a statistical cut-off value and its value is generally small in nature. Figure 1(c) and (d) illustrate the mapping from stems to their images and the corresponding image graph constructed.

The optimal structure-sequence alignment between the structure model and the target sequence thus corresponds to finding in the image graph a maximum weighted subgraph that is isomorphic to the conformational graph. The weight is defined by the alignment score between vertices (stems) and edges (loops) in the conformational graph and their counterparts in the image graph. The subgraph isomorphism problem is NP-hard. Interestingly, the conformational graph for the RNA secondary structure is tree decomposable; efficient isomorphism algorithms are possible.

Definition 2.1 (2 4). Let G = (V,E) be a graph, where V is the set of vertices in G, E denotes the set of edges in G. Pair (T, X) is a tree decomposition of graph G if it satisfies the following conditions:

(1) T = (I, F) defines a tree, the sets of vertices and edges in T are I and F respectively,

(2) X = {Xi\i G I,Xt C V}, and Vu G V, 3i G / such that u £ Xi,

(3) V(u, v) G E, 3i G / such that u G Xi and v G Xi, (4) Vijj, k € I, if k is on the path that connects i

and j in tree T, then Xi D Xj C Xk •

The tree width of the tree decomposition (T, X) is defined as maxj€j \Xi\ — 1. The tree width of the graph G is the minimum tree width over all possible tree decompositions ofG.

Figure 2 provides an example for a tree decomposition of a given graph. Tree decomposition is a technique rooted in the deep graph minor theorems 24; it provides a topological view on graphs. Tree width of a graph measures how much the graph is "tree-like". Conformational graphs for the RNA secondary structure have small tree width. For example,

102

5 /

Stem 1 Stem 2

(b)

ir2 ]r,

(d)

Fig . 1. (a) An RNA structure that contains both nested and parallel stems, (b) The corresponding conformational graph, (c) A secondary structure (top), and the mapped regions and images for its stems on the target sequence (bottom). The dashed lines specify the possible mappings between stems and their images, (d) The image graph formed by the images of its stems on a target sequence. {il\,ir\) and (jh,jri) for stem 1, and (1/2,""2) an<^ (jhtfrz) for stem 2.

(a) (b)

Fig . 2 . (a) An example of a graph, (b) A tree decomposition for the graph in (a).

the tree width is 2 for the graph of any pseudoknot-free RNA and it can only increase slightly for all known pseudoknot structures 25. For instance, the conformational graph shown in Figure 5 for sophisticated bacterial tmRNAs has tree width 5.

We showed in our previous work 25 that given a tree decomposition of the conformational graph with tree width t, the maximum weighted subgraph isomorphism can be efficiently found in time 0(ktN2), where N is the length of the structure model and k is the maximum number of images that a stem can

map to.

2.2. Automated Structure Filter

We observe that any subtree in a tree decomposition of a conformational graph induces a substructure and is thus a structure profile of smaller size. It can be used as a filter to preprocess a genome to be annotated. In particular, the left and right regions of any stem s, in an RNA structure have two corresponding vertices v\ and v\ in its conformational graph. In the tree decomposition of the conforma-

103

tional graph, these two vertices induce a maximal connected subtree Ti, in which every node contains either of the vertices. We choose subtrees with this maximal property since each of them contains the maximum amount of structural information associated with the stem. This is also to ensure that when the RNA structure contain a simple pseudoknot, the pseudoknot will be included in some filter.

This way, we thus can obtain up to O(N) such subtrees, where N is the size of the conformational graph. However, subtrees may intersect and it would be more desirable to select a set of disjoint subtrees to preprocess the genome. For this, we construct a filter graph as follows. In the graph each vertex represents a maximal subtree defined above and two vertices are connected with an edge if the corresponding subtrees intersect. Figure 3 shows an example for the filter graph of a given RNA structure.

We associate every vertex in the filter graph a weight, which is the filtration ratio of the filter resulted from the corresponding subtree. The filtration ratio of a filter is defined as the percentage of nucleotides that pass the corresponding filtration process and it is obtained as follows. For each filter, we randomly generate a sequence of sufficient length and compute the distribution of the scores of alignment between the filter profile and all the sequence segments in the generated sequence. For a filter with filtration ratio / , we assign a weight of - In/ to its corresponding vertex. To achieve a minimum filtration ratio, we need to find the maximum weighted independent set in the filter graph. We show in the following that this independent set can be found eas-

iiy-According to 10, the filter graph constructed

from a tree decomposition is actually a chordal graph, in which any cycle with length larger than 3 contains a chord. Also for any chordal graph, there exists a tree decomposition for the graph such that the vertices contained in every tree node induce a clique and the tree decomposition can be found in time 0( |V| 2) , where V is the vertex set of the chordal graph 9. Then given such a tree decomposition, a simple dynamic programming algorithm can be developed to find the maximum weight independent set.

Theorem 2.1. For an RNA secondary structure

that contains n stems, there exists an algorithm of time 0(n2) that can select a set of disjoint filters with the maximum filtration ratio.

2.3. Filter-Sequence Alignment

For a given filter F, the vertices contained in the tree bags of its corresponding subtree induce a subgraph in the conformational graph; such an induced subgraph is its filter conformational graph. An alignment between a structural filter profile and a target sequence is essentially an isomorphism between its filter conformational graph H and some subgraph of the image graph G for the target sequence. To find such an isomorphism, we adopt the general dynamic programming technique 1 over the tree decomposition of H. However, since the general technique can only be directly applied to a subgraph isomorphism on small fixed graph H and graph G of a small tree width l r , we introduce some additional techniques to solve the problem in our setting. We present a summary and some details of the new optimal alignment algorithm in the following.

The dynamic programming over the tree decomposition to find an optimal alignment is based on the maintenance of a dynamic programming table for each node in the tree. An entry in a table includes a possible combination of images of vertices in the corresponding tree node and the validity and partial optimal alignment score associated with the combination. The table thus contains a column allocated for each vertex in the node and two additional columns V and S to maintain validity and partial optimal alignment scores respectively.

In a bottom up fashion, the algorithm first fills the entries in the tables for all leaf nodes. Specifically, for vertices in a leaf node, a combination of their images is valid if the corresponding mapping satisfies the first two conditions for isomorphism (see section 2) and the partial optimal alignment score for a valid combination is the sum of the alignment scores of loops and stems induced by images of vertices that are only contained in the node. For an internal node Xi in the tree, without loss of generality, we assume Xj and Xk are its children nodes. For a given combination e$ of images of vertices in Xi, the algorithm checks the first two conditions for isomorphism (section 2 in 25) and sets a to be in-

104

4 (filter (u, v))

f (filter (a, /») / X 2 (filter (c, d))

Fig . 3 . (a) The conformational graph for a secondary structure that includes a pseudoknot. (b) A tree decomposition for the graph in (a), (c) A filter graph for the secondary structure in (a), (d) Substructures of the filters.

e

H

x"y~z~

bottom-up a X X

b y y

d w u

2 £ S b e

y y y

V V

r

t s t s

S

(a) (b)

Fig . 4. A sketch of the dynamic programming approach for optimal alignments. The algorithm maintains a dynamic programming table in each tree node. Starting with leaves of the tree, the algorithm follows a bottom-up fashion. In computing the table for a parent node, only combinations of the images of the vertices in the node are considered. In every such combination, only one locally best combination (computed in the children tables) is used for vertices that occur in the children nodes but not in the parent node.

valid if one of them is not satisfied. Otherwise, the algorithm queries the tables for X,- and Xk- &i is set to be valid if and only if there exist valid entries

ej and et from the tables of Xj and Xk such that ej and efe have the same assignments of images as that of ej for vertices in Xj n Xj and Xj n Xk re-

105

spectively. The partial optimal alignment score for a valid entry e, includes the alignment scores of stems and loops induced by images of vertices only in Xi and the maximum partial alignment scores over all valid entries e /s and e^'s with the same assignments of images for vertices in Xi n Xj and Xi n Xk as that of ti in tables for Xj and Xk respectively. Figure 4 provides an example for the overall algorithm.

The alignment score is the sum of the scores for aligning individual stems and loops in the structure profile. The alignment score for a stem is calculated between the stem profile and a chosen image in the target of the stem. Since any loop in the structure is between some two stems, the alignment score for a loop is calculated between its profile and the sequence segment in the target within the two chosen images for the two stems. The time complexity for this dynamic programming approach is 0(klN2), where k is the number of images for each vertex in the conformational graph, t is the tree width of its tree decomposition and N is its number of vertices.


We performed experiments to test the accuracy and efficiency of this filtration based approach and compared it with that of the original tree decomposition based program. The training data was obtained from the Rfam database 12. For each family, we choose up to 60 sequences with pair-wise identities lower than 80% from the structural alignment of seed sequences.

In practice, to obtain a reasonably small value for the parameter k, the upper bound on the number of images that a stem can map to, we constrain the images of a stem within certain region, called the constrained image region of the stem, in the target sequence. We assume that for homologous sequences, the distances from the pairing region of a given stem to the 3' end follow a Gaussian distribution. For a stem, we compute the mean and standard deviation of distances from its two pairing regions to the 3' end of the sequence respectively, evaluated over all training sequences. For training data representing distant homologs of an RNA family with structural variability, we can effectively divide data into groups so that a different but related profile can be built for each group and used for searches. This ensures a small value for the parameter k in the models.

As a first profiling and searching experiment, we inserted several RNA sequences from the same family into a random background generated with the same base composition as the sequences in the family. We then used this filtration based approach and the original tree decomposition based program to search for the inserted sequences. We compared the sensitivity and specificity of both approaches on several different RNA families. Finally, we tested the performance of our approach by searching for non-coding RNA genes in real biological genomes.

3.1. On Pseudoknot-Free Structures

We implemented this filter selection algorithm and combined it with our tree decomposition based searching tool to improve searching efficiency. To test its accuracy and computational efficiency, we used this program to search for about 30 pseudoknot-free RNA structures inserted in a random background of 105 nucleotides generated with the same base composition as the RNA structure. In particular, we computed the filtration ratio of each selected filter with a random sequence of 10000 nucleotides, which is generated with the same base composition as that of the sequence to be searched. The statistical distribution of alignment scores for each filter and the overall structural profile is determined on the same sequence using a method similar to that used by RSEARCH 13. To improve the computational efficiency, we determine the maximum size of the substructure for each filter; a window with a size that is about 1.2 times of this value is used for searching while this filter is used.

The order that the selected filters are applied is critical to the performance of searching. However, the number of possible orders for / selected filters is up to l\ and we thus are unable to exhaustively search through all possible orders and find the best one. In practice, we develop a heuristic method to determine the order of filters. In particular, we consider both the filtration ratio and the computation time of a filter. For each selected filter, we associate it with the value -^f-, where / is its measured filtration ratio and T is the computation time needed for the filter to scan the testing sequence. We then apply the structural profiles of filters to scan the target sequence with an increasing order of this value.

106

A sequence segment passes the screening of a filter if its corresponding alignment Z-score is larger than 2.0. For final processing, we use the original tree decomposition based algorithm to process the remaining sequence segments. An alignment Z-score larger than 5.0 is reported as a hit. In our experiments, for each stem, the algorithm selects k images with the maximum alignment scores within the constrained image region of the stem. In order to evaluate the impact of the parameter k on the accuracy of the algorithm, we carried out the same searching experiments for each given k. Table 1 shows the number of filters selected for each tested structure and the filtration ratio for the one that is first applied to scan the genome.

Table 2 shows that on the tested RNA families, the filtration based approach achieves the same or better searching accuracy than that of the original approach. In particular, a significant improvement on specificity is observed on a few tested families. From Table 3, compared to the original approach, the filtration based approach consumes a significantly reduced amount of computation time. On most of the tested families, the filtration based searching is more than 30.0 times faster than our original approach.

3.2. On Pseudoknot Structures

We also performed searching experiments on several RNA families that contain pseudoknot structures. For each family, we inserted about 30 structures that contain pseudoknots into a background randomly generated with the same base composition as that of the inserted sequences. The training data was also obtained from the Rfam database 12 where we selected up to 40 sequences with pair wise identity lower than 80% from the seed alignment for each family.

For each tested pseudoknot structure, the filtration ratio for the first filter that is applied to scan the genome is shown in Table 4. Tables 5 and 6 compare the searching accuracy and efficiency between the filtration based approach and the original one. It is evident that on families with pseudoknots, the filtration based algorithm achieves the same accuracy as that of the CM based algorithm when parameter k reaches a value of 7. In addition, the filtration based approach is more than 20 times faster than the

original approach on most of the tested pseudoknot structures.

3.3. On Biological Genomes

We used the program to search biological genomes for structural patterns that contain pseudoknots: corona virus genomes, tmRNA, and telomerase RNAs. For example, the secondary structure formed by nucleotides in the 3' untranslated region in the genomes of the corona virus family contains a pseudoknot structure. This pseudoknot was recently shown to play important roles in the replication of the viruses in the family n . We selected four genomes from the corona virus family and used the algorithm to search for this pseudoknot. For bacteria, the tmRNA is essential for the trans-translation process and is responsible for adding a new C-terminal peptide tag to the incomplete protein product of a broken mRNA 18. The secondary structure of tmRNA contains four pseudoknots; Figure 5 provides a sketch of the stems that constitute the secondary structure of a tmRNA. The tree decomposition based algorithm was also used to search for tmRNA genes on the genomes of two bacteria organisms, Haemophilus influenzae and Neisseria meningitidis. Both of the genomes contain more than 106 nucleotides. Among the bacteria containing tmRNAs, these two are relatively distant from each other evolutionarily. To test the accuracy and efficiency of the algorithm on genomes with a significantly larger size, we used the algorithm to search for the telomerase RNA gene in the genomes of two yeast organisms, Saccharomyces cerevisiae and Saccharomyces bay anus, both of which contain more than 107 nucleotides. Telomerase RNA is responsible for the addition of some specific simple sequences onto the chromosome ends 5.

The parameter k used in the tree decomposition based algorithm for searching all genomes is 7. Table 4 also shows the filtration ratio of the first applied filter obtained on different values of k for each pseudoknot structure. Table 7 provides the real locations of the searched patterns and the identified location offsets deviating from the real locations annotated by the filtration based and the original approaches respectively. The table clearly shows that compared with the original approach, the filtration based approach is able to achieve the same accuracy with a

107

T a b l e 1. The number of filters selected on tested pseudoknot free structures. For each structure, the filtration ratio for the first filter used to scan the genome is also shown.

RNA

EC EO

Let-7 Lin.4

Purine SECIS S.box TTL

Number of Selected Filters

1 1 2 3 1

l—l

3 2

Filtration Ratios fc = 6 0.147 0.082 0.110 0.045 0.042 0.089 0.189 0.093

fc = 7 0.084 0.049 0.074 0.030 0.042 0.036 0.189 0.056

fc = 8 0.084 0.049 0.055 0.030 0.021 0.036 0.189 0.056

EC, EO and TTL represent Entero.CRE, Entero.OriR, and Tymo.tRNA-like respectively.

Table 2. A comparison of the searching accuracy of filtration based approach and the original tree decomposition based program in terms of sensitivity and specificity.

RNA

EC EO

Let_7 Lin.4

Purine SECIS S-box TTL

Without Filtration fc = 6

SE 100 100 95.8 100

93.10 100 100 100

SP 80.65

100 100

94.11 96.43 97.30 92.86 96.67

fc = 7 SE 100 100 100 100

93.10 100 100 100

SP 80.65 100 100

94.11 96.43 97.30 96.30 96.67

fc = 8 SE 100 100 100 100

93.10 100 100 100

SP 80.65 100 100

94.11 96.43 97.30 96.30 96.67

With Filtration fc = 6

SE 100 100 95.8 100

93.10 100 100 100

SP 91.18 100 100 100

96.43 97.30 96.30 96.67

fc = 7 SE 100 100 100 100

93.10 100 100 100

SP 93.93 100 100 100 100

97.30 100

96.67

fc = 8 SE 100 100 100 100

93.10 100 100 100

SP 96.87 100 100 100 100

97.30 100

96.67

SE and SP are sensitivity and specificity in percentage respectively.

Table 3 . The computation time for both approaches on all pseudoknot free RNA families.

RNA

EC EO

Let.7 Lin.4

Purine SECIS S-box TTL


RT 2.85 4.91 14.97 3.22 7.09 9.14

29.76 5.01

fc = 7 RT 3.21 5.26 16.38 4.25 8.49 10.23 34.76 6.10

fc = 8 RT 3.38 5.42 16.92 5.10 9.61 10.89 41.01 7.07


RT 0.07 0.17 0.24 0.11 0.25 0.15 1.22 0.20

SU 40.71X 28.88 x 62.38X 29.27X 28.36 x 60.94X 24.39 x 25.05 x

fc = 7 RT 0.08 0.23 0.31 0.14 0.33 0.20 1.71 0.24

SU 40.13X 22.87X 52.84X 30.36X 25.72X 51.15X 20.33X 25.42X

fc = 8 RT 0.11 0.27 0.34 0.16 0.38 0.23 1.81 0.30

SU 30.73 x 20.07X 49.76 x 31.87X 25.29X 39.73X 22.65X 23.57X

RT is the computation time in minutes; SU is the amount of speed up compared to the original approach.

4. CONCLUSIONS

In this paper, we develop a new approach to improve the computational efficiency for annotating non-coding RNAs in biological genomes. Based on the graph theoretical profiling model proposed in our previous work, we develop a new nitration model that uses subtrees in a tree decomposition of the conformational graph as filters. This new filtering approach can be used to search genomes for structures

significantly reduced amount of computation time. Both programs achieve 100% sensitivity and specificity for searches in genomes. The table also shows that on real biological genomes, the selected filter sets can effectively screen out the parts of the genome that do not contain the desired structures and thus improve the searching efficiency.

108

Table 4. The number of filters selected on tested pseudoknot structures. For each structure, the filtration ratio for the first filter used to scan the genome is also shown.

RNA

Alpha-RBS Antizyme-FSE HDV_ribozyme

IFN -gamma Tombus.3JV

corona_pk3

PK3 tmRNA

Telomerase

Number of Selected Filters

3 1 3 5 3 1

1 11 2

Filtration Ratios fc = 6 0.095 0.078 0.030 0.069 0.067 0.028

0.027 0.220 0.130

fc = 7 0.071 0.066 0.030 0.035 0.048 0.014

0.013 0.220 0.130

fc = 8 0.071 0.042 0.010 0.035 0.048 0.014

0.013 0.070 0.130

T a b l e 5. The search sensitivity (SE) and specificity (SP) for both filtration based and original approaches on RNA sequences containing pseudoknots.

RNA

Alpha-RBS Antizyme-FSE HDV-robozyme

IFN-gamma Tombus-3-IV

corona_pk3

Without Filtration k = 6

SE 95.80 96.43 100 100 100 100

SP 92.00

100 97.37

100 100

97.37

fe = 7 SE 100 100 100 100 100 100

SP 96.00

100 97.37 100 100

97.37

fe = 8 SE 100 100 100 100 100 100

SP 96.00 100

97.37 100 100

97.37


SE 95.80 92.86 100 90 100

97.30

SP 96.0 100

97.37 100 100 100

fc = 7 SE 100 100 100 100 100 100

SP 96.0 100

97.37 100 100 100

fe = 8 SE 100 100 100 100 100 100

SP 96.0 100

97.37 100 100 100

T a b l e 6. The computation performance for both searching algorithms on some RNA families that contain pseudoknots.

RNA

Alpha-RBS Antizyme-FSE HDV-ribozyme

IFN-gamma Tombus-3JV corona_pk3


RT 0.31 0.13 0.34 0.72 0.27 0.15

k = l RT 0.42 0.18 0.52 1.07 0.40 0.20

fc = 8 RT 0.55 0.23 0.79 1.52 0.57 0.26

k RT 0.02

0.003 0.01 0.04 0.01

0.005

= 6 SU

15.50X 43.33X 34.00 x 18.00X 27.00 x 30.00X


RT 0.03

0.004 0.02 0.05 0.03

0.007

SU 14.00X 45.00X 26.00X 21.40X 13.33X 28.57X

k RT 0.05

0.006 0.03 0.06 0.05 0.01

= 8 SU

ll.OOx 38.33 x 26.33 x 25.33 x 11.40X 26.00 x

The amount of RT is in hours; SU is the amount of speed up compared to the original approach.

h r,(T=^i(TM n=h r m ((r-h r^\\\ n l{~Pi-U I I I ((\ 11 ITH r ^ V \ -A-B-D-E-F-G-H-g-h-I -J- j - i -K-L-M-N-m-0-o- l -k-n-P-p-Q-R-S-r-cj -B-T-0-V-W-X-v-u-t -Z- l -z - l -6-#-2-3-x-w-£-a-o l -b-$-4-a-

PK1 Y

PK2 PK3 PK4

Fig . 5. Diagram of stems in the secondary structure of a tmRNA. Upper case letters indicate base regions that pair with the corresponding lower case letters. The four pseudoknots constitute the central part of the tmRNA gene and are labeled as P k l , Pk2, Pk3, Pk4 respectively.

containing pseudoknots with high accuracy. Com- filtering method allows us to apply more sophisti-pared to the original method, a significant amount cated sequence-structure alignment algorithm on the of speed up is also achieved. More importantly, this remaining portions of the genome. For example, we

109

Table 7. A comparison of the accuracy and genomes.

efficiency for both algorithms on searching biological

OR

BCV MHV PDV HCV

HI NM

SC SB

ncRNA

3'PK 3'PK 3'PK 3'PK

tmRNA tmRNA

TLRNA TLRNA

Without Filtration L 0 0 0 0

- 1 0

- 3 - 3

R 0 0 0 0

- 1 0

- 1 2

RT 0.053 0.053 0.048 0.047

44.0 52.9

492.3 550.2

With Filtration L 0 0 0 0

- 1 0

- 3 - 3

R 0 0 0 0

- 1 0

- 1 2

RT 0.008 0.007 0.004 0.006

0.32 0.37

8.74 9.28

SU 6.63X 7.57x 12.00x 7.83x

137.50 x 142.97X

56.33x 59.29x

Real location Left

30798 31092 27802 27063

472210 1241197

307691 7121532

Right 30859 31153 27882 27125

472575 1241559

308430 7122282

GL

0.31 0.31 0.28 0.27

18.3 22.0

103.3 114.8

OR is the name of the organism; GL is the length of the genome in multiples of 105 nucleotides. BCV is Bovine corona virus; MHV is Murine hepatitus virus; PDV is Porcine diarrhea virus; HCV is Human corona virus; HI and NM represent Haemophilus influenzae and Neisseria meningitidis respectively, and SC and SB represent Saccharomyces cerevisiae and Saccharomyces bayanus respectively. L and R are the left and right offsets of the resulting locations respectively compared to the real locations. RT is the single CPU time needed to identify the ncRNA in hours. For tmRNA and telomerase RNA searches, RT was estimated from the time needed by a parallel search with 16 processors. SU is the amount of speed up compared to the original approach.

are able t o search remote homologs of a sequence

family using a few alternative profiling models for

each stem or loop. This approach can be used to find

remote homologs with unknown secondary structure.

References

1. S. Arnborg and A. Proskurowski, "Linear time algorithms for NP-hard problems restricted to partial fc-trees.", Discrete Applied Mathematics, 23: 11-24, 1989.

2. V. Bafna and S. Zhang, "FastR: Fast database search tool for non-coding RNA.", Proceedings of the 3rd IEEE Computational Systems Bioinformatics Conference, 52-61, 2004.

3. M. Brown and C. Wilson, "RNA Pseudoknot Modeling Using Intersections of Stochastic Context Free Grammars with Applications to Database Search.", Pacific Symposium on Biocomputing, 109-125, 1995.

4. L. Cai, R. Malmberg, and Y. Wu, "Stochastic Modeling of Pseudoknot Structures: A Grammatical Approach.", Bioinformatics, 19, i66 — z73, 2003.

5. A. T. Dandjinou, N. Levesque, S. Larose, J. Lucier, S. A. Elela, and R. J. Wellinger, "A Phylogenetically Based Secondary Structure for the Yeast Telomerase RNA.", Current Biology, 14: 1148-1158, 2004.

6. S. Eddy and R. Durbin, "RNA sequence analysis using covariance models.", Nucleic Acids Research, 22: 2079-2088, 1994.

7. D. N. Frank and N. R. Pace, "Ribonuclease P: unity and diversity in a tRNA processing ribozyme.", Annu Rev Biochem., 67: 153-180, 1998.

8. D. Gautheret and A. Lambert, "Direct RNA motif definition and identification from multiple sequence alignments using secondary structure profiles.", Journal of Molecular Biology, 313: 1003-1011,

2001. 9. F. Gavril, "Algorithms for minimum coloring, maxi

mum clique, minimum covering by cliques, and maximum independent set of a chordal graph", SI AM Journal on Computing, 1:180-187, 1972.

10. F. Gavril, "The intersection graphs of subtrees in trees are exactly the chordal graphs", Journal of Combinatorial Theory Series B, 16: 47-56, 1974.

11. S. J. Geobel, B. Hsue, T. F. Dombrowski, and P. S. Masters, "Characterization of the RNA components of a Putative Molecular Switch in the 3' Untranslated Region of the Murine Coronavirus Genome.", Journal of Virology, 78: 669-682, 2004.

12. S. Griffiths-Jones, A. Bateman, M. Marshall, A. Khanna, and S. R. Eddy, "Rfam: an RNA family database.", Nucleic Acids Research, 31: 439-441, 2003.

13. R. J. Klein and S. R. Eddy, "RSEARCH: Finding Homologs of Single Structured RNA Sequences.", BMC Bioinformatics, 4:44, 2003.

14. A. Krogh, M. Brown, I.S. Mian, K. Sjolander, and D. Haussler, "Hidden Markov models in computational biology: Applications to protein modeling", J. Molecular Biology, 235: 1501-1531, 1994.

15. C. Liu, Y. Song, R. Malmberg, and L. Cai, "Profiling and Searching for RNA Pseudoknot Structures in Genomes.", Lecture Notes in Computer Science, 3515: 968-975.

16. T. M. Lowe and S. R. Eddy, "tRNAscan-SE: A Program for Improved Detection of Transfer RNA genes in Genomic Sequence.", Nucleic Acids Research, 25: 955-964, 1997.

17. J. Matousek and R. Thomas, "On the complexity of finding iso- and other morphisms for partial fc-trees.", Discrete Mathematics, 108: 343-364, 1992.

18. N. Nameki, B. Felden, J. F. Atkins, R. F. Gesteland, H. Himeno, and A. Muto, "Functional and struc-

110

tural analysis of a pseudoknot upstream of the tag-encoded sequence in E. coli tmRNA.", Journal of Molecular Biology, 286(3): 733-744, 1999.

19. V. T. Nguyen, T. Kiss, A. A. Michels, and O. Ben-saude, "7SK small nuclear RNA binds to and inhibits the activity of CDK9/cyclin T complexes.", Nature 414: 322-325, 2001.

20. E. Rivas and S. Eddy, "The language of RNA: a formal grammar that includes pseudoknots.", Bioin-formatics, 16: 334-340, 2000.

21. E. Rivas and S. R. Eddy, "Noncoding RNA gene detection using comparative sequence analysis.", BMC Bioinformatics, 2:8, 2001.

22. E. Rivas, R. J. Klein, T. A. Jones, and S. R. Eddy, "Computational identification of noncoding RNAs in E. coli by comparative genomics.", Current Biology, 11: 1369-1373, 2001.

23. E. Rivas and S. R. Eddy, "A dynamic programming algorithm for RNA structure prediction including pseudoknots.", Journal of Molecular Biology, 285: 2053-2068, 1999.

24. N. Robertson and P. D. Seymour, "Graph Minors II. Algorithmic aspects of tree-width.", Journal of Algorithms, 7: 309-322, 1986.

25. Y. Song, C. Liu, R. L. Malmberg, F. Pan, and L. Cai, "Tree decomposition based fast search of RNA structures including pseudoknots in genomes", Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference, 223-224, 2005.

26. Y. Uemura, A. Hasegawa, Y. Kobayashi, and T. Yokomori, "Tree adjoining grammars for RNA structure prediction.", Theoretical Computer Science, 210: 277-303, 1999.

27. Z. Weinberg and W. L. Ruzzo, "Faster genome annotation of non-coding RNA families without loss of accuracy.", Proceedings of the Eighth Annual International Conference on Computational Molecular Biology, 243-251, 2004.

28. Z. Yang, Q. Zhu, K. Luo, and Q. Zhou, "The 7SK small nuclear RNA inhibits the Cdk9/cyclin T l kinase to control transcription.", Nature 414: 317-322, 2001.

I l l

THERMODYNAMIC MATCHERS: STRENGTHENING THE SIGNIFICANCE OF RNA FOLDING ENERGIES

T . Hochsmann*, M. Hochsmann and R. Giegerich

Faculty of Technology, University Bielefeld,

Bielefeld, Germany Email: {thoechsm*,mhoechsm,robert} @techfak.uni-bielefeld.de

Thermodynamic RNA secondary structure prediction is an important recipe for the latest generation of functional non-coding RNA finding tools. However, the predicted energy is not strong enough by itself to distinguish a single functional non-coding RNA from other RNA. Here, we analyze how well an RNA molecule folds into a particular structural class with a restricted folding algorithm called Thermodynamic Matcher (TDM). We compare this energy value to that of randomized sequences. We construct and apply TDMs for the non-coding RNA families RNA I and hammerhead ribozyme type III and our results show that using TDMs rather than universal minimum free energy folding allows for highly significant predictions.

1. INTRODUCTION

In this section, we discuss shortly the state of the art

in RNA gene prediction and classification, and give

an outline of the new approach presented here.

1.1. RNA gene prediction and classification

The term "RNA genes" is defined, for the purpose of this article, as those RNA transcripts that are not translated to protein, but carry out some cellular function by themselves. Recent increased interest in the manifold regulatory functions of RNA have led to the characterization of close to 100 classes of functional RNA1' 2. These RNA regulators mostly exert their function via their tertiary structure.

RNA genes are more difficult to predict than protein coding genes for two reasons: (1) There is no signal such as an open reading frame, which would be a first necessary indicator of a coding region. (2) Comparative gene prediction approaches are difficult to apply, because sequence need not be preserved in order to preserve a functional structure. In fact, structure preservation in the presence of sequence variation is the best indicator of a potentially interesting piece of RNA3, 4 . This means that, in one way or another, structure must play an essential role in RNA gene prediction.

Whereas the full 3D structure of an RNA

molecule currently cannot be computed, its 2D structure, the particular pattern of base pairs that form helices, bulges, hairpins etc., can be determined by dynamic programming algorithms based on an elaborate thermodynamic model5-7 . Unfortunately, the minimum free energy (MFE) structure as denned and computed by this model is often weakly determined, and does not necessarily correspond to the functional structure in vivo. And of course, every single stranded RNA molecule, be it functional or not, attains some structure.

However, if there is a functional structure, preserved by evolution, it should be well-defined, according to two criteria:

• Energy Criterion: The energy level of the MFE structure should be relatively low, to ensure that the structure is stable enough to execute a specific function.

• Uniqueness Criterion: The determined MFE structure should not be challenged by alternative foldings with similar free energy.

Much work has been invested in the Energy Criterion: Can we move a window along an RNA sequence, determine the MFE of the best local folding, and where it is significantly lower than for a random sequence, may we hope for an RNA gene, because evolution has selected for a well-defined structure? Surprising first results were reported by Seffens &


112

Digby8, indicating that mRNAs (where one would not even expect such an effect) had lower energies than random sequences of the same nucleotide composition. However, this finding was refuted by Workman & Krogh9, who showed that this effect goes away when considering randomized sequences with conserved dinucleotide composition. Rivas & Eddy10

studied the significance of local folding energies in detail, reporting two further caveats: First, local in-homogeneity of CG content can produce a seemingly strong signal. Second, variance in MFE values is high, and a value of at least 4 standard deviations from the mean (a Z-score of 4) should be required before a particular value is considered an indicator of structural conservation. In most recent work, Clote et a l . u studied several functional RNA families, comparing their MFE values against sequences of the same dinucleotide composition. They found that, on the one hand, there is a signal of smaller-than-random free energy, but on the other hand, it is not significant enough to be used for RNA gene prediction.

A weak signal can be amplified by using a comparative approach. Washietl et al.12 suggest that, by scanning several well-aligned sequences, significant Z-scores can be obtained. The tool RNAz3 is based on this idea. Of course, a good sequence alignment is not always available.

All in all, it has been determined that the Energy Criterion is not useless, but also not strong enough by itself to distinguish functional RNA genes from other RNA.

A first move to incorporate the Uniqueness Criterion has been suggested by Le et al.13. They compute scores based on energy differences: They compare the MFE value to the folding energy of a "restrained structure", which is defined by forbidding all base pairs observed in the MFE structure. This essentially partitions the folding space into two parts, taking the MFE structure within each part as the representative structure. This can be seen as a binary version of the shape representative structures defined by Giegerich et al.14. Just recently, the complete probabilistic analysis of abstract shapes of RNA has become possible15, which would allow us to base the Le et al. approach on probabilities derived from Boltzmann statistics. This appears to be

a promising route to follow. Here, however, we take yet another road in the same direction.

1.2. Outline of the new approach

After gene prediction via the Energy Criterion, the next step is to analyze the candidate structure, in order to decide whether it is a potential member of a known functional class. The structural family models provided in Rfam16, 17 are typically used for this purpose. We suggest to combine the second step with the first one: We ask how well the molecule folds into a particular structural class, and compare this energy value to that of randomized sequences. We shall show that in this way we can obtain significant Z-scores. Note that this approach contains the earlier one as a special case: If the "particular class" holds all feasible structures, we are back with simple MFE folding. The Le et al. approach, by contrast, is not subsumed by this idea, as their partitioning is derived from the sequence at hand, while ours is set a priori.

The term Thermodynamic Matcher (TDM) has been suggested by Reeder et al.18 for an algorithm that folds an RNA sequence into a particular type of structure in the energetically most favorable way. This is similar to using covariance models based on stochastic context free grammars, but uses thermodynamics rather than statistics. A first example of a TDM was the program pknotsRG-enf, which folds an RNA sequence into the energetically best structure containing at least one pseudoknot somewhere.

Although the idea of specialized thermodynamic folding appears to be an attractive supplement to co-variance models16, to our knowledge, no other TDMs have been reported. This is most likely due to the substantial programming effort incurred when implementing such specialized folding algorithms under the full energy model. However, these efforts are reduced by the technique of algebraic dynamic programming19' 20, which allows to produce such a folding program - at least an executable draft - in one afternoon of work. Subsequent experimentation may be required to make the draft more specific, as explicated below. By this technique, we have been able to produce TDMs for nine RNA families so far, and our results show that using TDMs rather than universal MFE folding allows for highly significant

113

predictions.

struct comp

comp =

comp struct

IL

base comp base region comp region

BL I BR

region comp

ML

struct

SS

comp region

HL

region

grammar Q is a tuple (£, V, P, A) where S is a set of terminal symbols, V is a set of variables with £ n V = 0, P is a production set, and A is a designated variable called axiom. The language C(Q) of a tree grammar Q is the set of trees that do not contain variables, which can be derived by iteratively applying productions starting with the axiom.

Figure 1 shows tree grammar QGF for RNA secondary structures. QGF is a simplified version of the base grammar our TDMs are derived from, which is more complex and takes into account the latest energy model for RNA folding. We use QGF to illustrate the basic concepts of TDMs. Note that the sequence of leaf nodes (in left-to-right order) for a tree T € £(Q) is the primary sequence for T. RNA structure prediction and stochastic context free grammar approaches to align RNA structures, are problems of computing an optimal derivation for a primary sequence.

region

P i g . 1. General folding grammar QGF'- The terminal symbol "base" denotes one of the nucleotides A,C,G,U, and "region" is a sequence of nucleotides, struct and comp are non-terminal symbols, and the corresponding productions are shown above. These productions can be read as follows: An RNA secondary structure can be a single component or a component next to some other struct. A component is either a single stranded region (SS), or it is composed (AD) from stacking regions (SR) and loops (BR,BL,IL,ML), which can be arbitrarily nested and terminated by a hairpin loop (HL).

The same results as with our TDMs in this paper can be computed using RNAmotif21, by using the free energy as score function. However our motifs will result in exponential many structures for a input sequence. For every structure the energy has to be separately computed resulting in exponential runtime.

1.3. Tree grammars

RNA secondary structure, excluding pseudo-knots, can be formally denned with regular tree grammars15. Similar to context free string grammars, a set of rules, called productions, transforms non terminal symbols into trees labeled with terminal and non terminal symbols. Formally, a tree

base base

Fig . 2 . This is one possible derivation of the grammar QCF for the sequence "CUCCGGCGCAG". Note that this is just one of many possible trees/structures.

2. THERMODYNAMIC MATCHERS

The RNA folding problems means finding the energetically best folding for a given sequence under a certain model. Throughout this article, we consider the Zuker & Stiegler model, which describes the structure space and energy contributions for RNA

114

secondary structures and is used in a wide range of folding routines7, 15, 6 . As indicated above, the structure space for an RNA molecule can be denned with a tree grammar and the folding problem becomes a parsing problem19' 20. We use this view and express (or restrict) folding spaces in terms of tree grammars, thereby obtaining thermodynamic matchers. The informal notion of a structural motif is formally modeled by a specialized tree grammar.

Let Q be a grammar that describes the folding space for some structural motif, e.g. only those structures that have a tRNA-like hairpin structure. Q typically differs from QGF by absence of some rules, while other rules may be duplicated and specialized. Fg denotes the structure space for the grammar Q, in other words: all possible trees that can be derived from the grammar's axiom. A thermodynamic matcher TDMg(s) is an algorithm that calculates the minimum free energy and the corresponding structure from the structure space Fg for some nucleotide sequence s. MFEg(s) is the minimum free energy calculated by TDMg(s). Since the same energy model is used, the minimal free energy of the restricted folding can not be lower than the minimal free energy of the general folding, we always have MFEg(s) > MFEGF(S). Note that it is not always possible to fold a sequence into a particular motif. In this TDM returns an empty result.

2.1 . Z-scores

A Z-score is the distance from the mean of a distribution normalized by the standard deviation. Mathematically: Z{x) = (x — fx)/5, with fi being the mean and 8 the standard deviation. Z-scores are useful for quantifying how different from normal a recorded value is. This concept has been applied to eliminate an effect that is well known for minimum free energy folding: The energy distribution is biased by the G/C content of a sequence as well as its length and dinucleotide composition.

To calculate the Z-score for a particular sequence, the distribution of MFE values for random sequences with the same dinucleotide composition must be known. The lower the Z-score, the lower is the energy compared to energies from random sequences. Clote et. al.11 observed that Z-score distributions for RNA genes are lower than Z-score dis

tribution for random RNA. However, this difference is fairly small and only significant if the whole distribution is considered. It is not sufficient to distinguish an individual RNA gene from random RNA10. The reason for the insufficient significance of Z-scores are the combinatorics of RNA folding. There is often some structure in the complete search space that obtains a low energy.

0.25 I 1— 1 1 1 1 1 G G F

"BNAI G H H

g. 0.15 - / \

f 1 I \

0.05 / \

o I • — — • ' - ^ '

- 4 - 2 0 2 4 Z-score

Fig . 3 . Z-score histogram for 10000 random sequences with a length of 100 nucleotides, for two TDMs and the general folding.

Here, our aim is not the general prediction of non-coding RNA, but the detection of new members of a known, or at least defined, RNA family. By restricting the folding space, we can, as we demonstrate in Section 3, shift Z-scores for family members into a significant zone. Structures with MFEGF = MFEg for a grammar Q get a lower Z-score, since the distribution MFEg for random RNA is shifted to higher energies. Even if this seems to be right for the grammars used in this paper, the effect of a folding space restriction on the energy distribution is not obvious. Clearly, the mean is shifted to more positive values, but the effect on the variance is not yet understood mathematically. Therefore, our applications must provide evidence that the Z-scores are affected in the desired way.

Let Dg{s) be the frequency distribution of MFE values for random sequences with the same dinucleotide frequency as s, i.e. the minimum free energy versus the fraction of structures s' obtaining that energy with TDMg(s'). Zg{s) is the Z-score for a sequence s with respect to the distribution Dg{s).

115

The value-mean and the standard deviation can be determined by a sampling procedure. For our experiments, we generate 1000 random sequences preserving the dinucleotide frequencies of s.

The distribution of Z-scores for random RNA sequences is shown in Figure 3. Interestingly, a restriction of the folding space does not affect the Z-score distribution. At least this holds for the TDMs shown in this paper. For a reliable detection of RNA genes, a Z-score of lower than -4 is needed10. Our experiments showed that over 99.98% of random RNAs have Z-scores greater then -4. To distinguish RNA genes from other RNA on a genomic scale, a threshold should be set to a Z-value such that the number of false predictions is trackable.

the Rfam database, the consensus shown there is a good starting point; at least the structural part of it. Alternatively, the consensus of known sequences can be obtained with programs that predict a common structure, like PMmulti22 and RNAcast23.

motif AD

hloop

region hloop AD

2.2. Design and implementation

Designing a thermodynamic matcher means defining its structure space. On the one hand it must be large enough to support good sensitivity, and on the other hand it must be small enough to provide good specificity. A systematic analysis of the relation between structure space restriction and its effect on specificity and sensitivity of MFE based Z-scores is subject of our current research.

A-e f. E • ' - 6

*&*•

Fig. 4 . Consensus structure for RNAI genes taken from the Rfam database.

The design of a TDM for an RNA gene requires a consensus structure. If an RNA family is listed in

SS

region

hloop

hloop IL

base hloop base region hloop region

BL I BR

region hloop hloop region

I HL

region

Fig. 5 . Simplified version of the grammar QRNAI- Reconsider the grammar in Figure 1. Instead of an axiom t h a t derives arbitrary RNA structures, the axiom motif derives three hairpin loops {hloop) connected by single stranded regions.

We now exemplify the design of a TDM. For instance, we are interested in stable secondary structures that consist of three hairpin loops separated by single stranded regions, like the structures of RNAI genes as shown in Figure 4. A specialized grammar for RNAI must only allow structures compatible with this motif. A simplified version of the grammar QRNAU which abstracts from length constraints for stems and loops, is given in Figure 5.

Since we want to demonstrate that with a search

116

space reduction new members of an RNA family can be detected by their energy based Z-score, we do not incorporate explicit sequence constraints in a thermodynamic matcher other than those necessary to form the required base-pairs. However, this could be easily incorporated in our framework.

We use the algebraic dynamic programming (ADP) framework19 to turn RNA secondary structure space grammars into thermodynamic matchers. In the context of ADP, writing a grammar in a text based notation is equivalent to writing a dynamic programming structure prediction program. This approach is similar to using an engine for searching with regular expressions. There is no need to implement the search routines, it is only a matter of specifying the search results. A grammar, which constitutes the control structure of an unrestricted folding algorithm, is augmented by an evaluation algebra incorporating the established energy rules5. All TDMs share these rules, only the grammar changes.

The time complexity of a TDM depends on the motif complexity. If multiloops are included the runtime is 0(n3) where n is the length of the sequence that is folded. Without multiloops the time complexity is 0(n2), if the size of bulges and loops is bounded by a constant. In both cases the memory consumption scales with 0(n2).

3. RESULTS

We constructed TDMs for the non-coding RNA families RNAI and hammerhead type III ribozyme (ham-merheadlll) taken from the Rfam database Version 7.016' 17. All TDMs used in this section utilize the complete energy model for RNA folding6 and therefore have more complex grammars than the grammars presented to explain our method.

To assess if TDMs can be used to find candidates for an RNA family, we searched for known members in genomic data. The known members are those from Rfam seeds, which are experimental validated. We apply our TDMs to genomes containing the seed sequences and measure the relation between Z-score threshold, sensitivity, and specificity. We define sensitivity as TP/ (TP+FN) and specificity as TN/(TN+FP), where TP is the number of true positives, TN is the number true negatives, FP is the number of false positives, and FN is the number of

false negatives.

3.1. RNA I

Replication of ColEl and related bacterial plasmids is initiated by a primer, the plasmid encoded RNAII transcript, which forms a hybrid with its template DNA. RNAI is a shorter plasmid-encoded RNA that acts as a kinetically controlled suppressor of replication and thus controls the plasmid copy number24. Sequences coding for RNAI fold into stable secondary structures with Z-scores reaching from —3.6 to -6 .7 (Table 1).

Table 1. Z-score for the RNAI seed sequences computed with TDMgGF and TDMgRNA].

EMBL Accession number

AF156893.2 X80302.1 Y17716.1 Y17846.1 U80803.1 D21263.1 S42973.1 U65460.1 X63534.1 AJ132618.1

ZGCF

-6.61 -4.88 -5.74 -5.06 -6.33 -3.96 -4.53 -6.73 -3.63 -5.93

ZQRNAI

-7.31 -6.20 -6.29 -6.16 -6.84 -5.33 -5.82 -7.41 -5.41 -6.71

The Rfam consensus structure consists of three adjacent hairpin loops connected by single stranded regions (Figure 4). Structures for this consensus are described by the grammar QRNAI (Figure 5). If we allow for arbitrary stem lengths in our motif, all structures that consist of three adjoined hairpins would be favored by TDMgRNAI. This has an unde-sired effect: It would be possible to fold a sequence, that folds (with general folding) into a single hairpin with low energy, into a structure with one long and two very short hairpins. Although the energy of the restricted folding is higher than the energy of the unrestricted folding, it would still obtain a good energy resulting in a low Z-score. Clearly, these structures do not really resemble the structures of RNAI genes. In refinement, each stem loop is restricted to a minimal length of 25 nucleotides and the length of the complete structure is restricted to up to 100 nucleotides. These restrictions are compatible with the consensus of RNAI and increase the sensitivity

117

and specificity of TDMgRNAI. Sequences from the seed obtain ZgRNAI values between —5.33 and —7.41 (Table 1). For random RNA the frequency distribution of ZgRNAI is similar to ZgCF (see Figure 3). The ZQRNAI score difference is large enough to distinguish RNAI genes from random RNA.

Sequence position [nt]

(a) General folding (TDMgG

0 1000 2000 3000 4000 5000 Sequence position [nt]

(b) Restricted Folding (TDMgRNAI)

Fig. 6. TDM scan for RNAI in a plasmid of Klebsiella pneumoniae (EMBL Accession number AF156893). The known RNAI gene is located at position 4498 indicated by the dotted vertical line, (a) In steps of 5 nucleotides, the score ZgCF is shown for the following 100 nucleotides and for their reverse complement. The Z-scores for both directions are drawn versus the same sequence position. The position where the known RNAI gene starts achieves a low Z-score, but there is another position with a lower Z-score (position ~ 1450) and positions, with nearly as low scores (around position 750). (b) shows corresponding values for ZgRNAJ. The RNAI gene now clearly separates from all other positions. Sequences that fold into some unrelated stable structure are penalized because they cannot fold into a stable RNAI structure.

To verify whether RNAI genes can also be distinguished from genomic RNA, we applied our matcher to 10 plasmids that contain the seed sequences (one in each of them). The Plasmid length ranges from 108 to 8193 nucleotides in this experiment. All plasmids together have a length of ~ 27500 nucleotides. For each plasmid, a 100 nucleotides long window was slid from 5' to 3' with a successive offset of 5. ZgRNAI

was computed for every window. RNA I can be located on both strands of the plasmid. Therefore, TDMgRNAI was also applied to the reverse complement. Overall, this results in ~ 11000 ZgRNA, scores. An RNAI sequence was counted as positive hit if a

Z-score in the range of 5 nucleotides to the left or right of the starting position of an RNAI gene has a Z-score equal or lower than the current threshold. In this region, no negative hits are counted. Figure 6 shows the result for a plasmid of Klebsiella pneumoniae.

It is also possible to use a complete sequence as input for a TDM. However, this will return the best substructure (or substructures) in terms of energy, which not always corresponds to the substructure with the lowest Z-score.

100

80

2- 60 a> o> 8

40

20

TTZ

-8 -7 -6

Sensitivity (GGF) — Specificity (G e F) —

Sensitivity (GR N A | ) —• Specificity (GR N A , ) --•-

-3 -2 -1

Fig. 7. Sensitivity and specificity versus the Z-value threshold. TDMgRNA1 improves sensitivity and specificity compared to TDMgGF.

If we set the Z-score threshold to —5, we obtain for TDMgRNAI a sensitivity of 100% and a specificity of 99.89%, which means 10 true positives and 12 false positives (for all plasmids). For TDMgGF, we obtain only a sensitivity of 80% and a specificity of 99.10%, which means 8 true positives and 99 false positives. A threshold of —3.5 is required to find all RNAI genes of the seed. The specificity in this case is 96.71% resulting in 362 false positives. (Figure 7). Although the specificity is fairly low, it makes a big difference to the number of false positives for genome wide applications.

3.2. Hammerhead ribozyme (type III)

The hammerhead ribozyme was originally discovered as a self-cleaving motif in viroids and satellite RNAs. These RNAs replicate using the rolling circle mech-

118

anism, which generates long multimeric replication intermediates. They use the cleavage reaction to resolve the multimeric intermediates into monomeric forms. The region able to self-cleave has three base paired helices connected by two conserved single stranded regions and a bulged nucleotide. Hammerhead type III ribozymes (Hammerheadlll) form stable secondary structures with Z-scores varying from -6 to -2 for general folding.

The seed sequences from the Rfam database vary in their length. 6 sequences have a length of around 80 nucleotides. All other seed sequences are around 55 nucleotides long. To be able to use length constraints, which are not too vague, we removed the 6 long sequences for our experiment. Thus, TDMgHH

is not designed to search for Hammerheadlll candidates with a sequence length larger than 60 nucleotides.

U-c Qfi«3 li A n c C C A ..C C « %AU0^O

Fig. 8. Consensus structure for hammerhead ribozyme type III genes taken from the Rfam database.

Grammar QHH describes the folding space for the consensus structure shown in Figure 8. The maximal length of our motif is 60 nucleotides. The single stranded region between the two stem loops in the multiloop has to be between 5 and 6 nucleotides long. The stem lengths are not explicitly restricted. TDMgHH improves the distribution of Z-scores for the seed sequences (Figure 9).

Most sequences now obtain a Z-score smaller than —4, but some obtain a higher score. These se

quences are only about 45 nucleotides long. They fold into two adjacent hairpin loops and do not form a multiloop with TDMgGF They are forced into our Hammerheadlll motif with considerable higher free energy. If a family has many members, it might be necessary to separately consider subfamilies.

I

Fig. 9. Z-scores distribution for 68 hammerhead ribozyme type III sequences.

We applied TDMgHH to 59 viroid sequences with length of 290 to 475 nucleotides. Hammerheadlll can be located on both strands of the DNA. Each sequence contains one or two Hammerheadlll genes. A 60 nucleotides long window was slid from 5' to 3' with a successive offset of 2. For the sequence (and for its reverse complement), of each window ZgHH

was computed. Overall, this resulted in ~ 19500 scores. An Hammerheadlll sequence was counted as positive hit if a Z-score in the range of 3 nucleotides to the left or right of the starting position of an Hammerheadlll gene has a Z-score equal or lower than the current threshold. In this region, no negative hits are counted. The sensitivity and specificity depending on the Z-score threshold is shown in Figure 10. The sensitivity is improved significantly compared to TDMgGF. However, the specificity is lower for Z-scores thresholds smaller than —3, which is the relevant region. It turned out that many false positives with Z-values of smaller —4 maybe true positives, which are not part of the Rfam seed, but are predicted as new RNAI candidate genes in Rfam. Figure 11 shows sensitivity and specificity if false nega-

119

tives, that are candidate genes in Rfam, are counted as true positives. All RNA candidate genes that are provided in Rfam achieve low Z-scores as shown in Figure 12. Unlike Infernal16, which is used for the prediction of candidate family members in Rfam, we use pure thermodynamics rather than a covariance based optimization. This gives further and independent evidence for the correctness of both predictions.

Fig . 10. Selectivity and specificity versus the Z-value threshold. TDMgHH improves sensitivity and specificity compared to TDMgGF.

100

80

60

40 -

20 Sensitivity (GGF) Specificity (GGF) Sensitivity (GHH) Specificity (GHH)

-4

Z-score

Pig . 1 1 . Selectivity and specificity versus the Z-value threshold. TDMgHH improves sensitivity and specificity compared to TDMgGF. Candidates predicted by Rfam are treated as positive hits.

0.4

0.35

0.3 -

0.25

0.2

0.15

0.1

0.05

0 -10

-

1 1

1 1

-

1 "

r-t—/

1

t \

1

V.'.

' ' G^ ' G HH

random sequences — -

' \ v ' " K - - , i Y-.

-

i i

i —

r

-

-8 -4 -2

Z-score

Fig . 12. Distribution of Z-scores for all 274 Hammerheadll l gene and gene candidate sequences taken from the Rfam database.

4. DISCUSSION

The current debate about the quality of thermodynamic prediction of RNA secondary structures is extended by our observations regarding specialized folding spaces. It is well known that the MFE structure from predictions in most cases only shares a small number of base-pairs that can be detected by more reliable sources than MFE such as compensa-tional base mutations. This is a consequence of the combinatorics of the RNA folding space, which provides many "good" foldings. Thus, MFE on its own can not be used to discriminate non-coding RNAs. We demonstrated that, given a consensus structure for a family of non-coding RNA, a restriction of the folding space to this family prunes low energy foldings for non-coding RNA that do not belong to this family. The overlap of Z-score distributions for MFE values for family members and non-family members can be reduced by our technique resulting in a search technique with high sensitivity and specificity, called thermodynamic matching.

In our experiments for RNA I and the hammerhead type III riboyzme, we did not include other restrictions than size restrictions for parts of the structure. These matchers can be fine tuned and can also include sequence restrictions, which could further increase their sensitivity and specificity. It is also possible to include H-type pseudoknots in the motif using techniques presented in Ref. 18.

120

We demonstrated that a TDM can detect members of RNA families by scanning single sequences. It seems promising to extend the TDM approach to scan aligned sequences using a combined energy and covariance scoring in spirit of RNAalifold12. This should further increase selectivity, or, if this is not necessary, allow "looser" motif definitions.

A question that arises from our observations is: Can our TDM approach be incorporated in a gene prediction strategy? If we would guess a certain motif and find stable structures with significant Z-scores, they might be somehow biologically relevant.

In a current research project, we focus on a systematic generation of TDMs for known RNA families from the Rfam database. We are also working on a graphical user interface to facilitate biologists to create their own TDMs, without requiring the knowledge of the underlying algebraic dynamic programming technique.

Beside the two RNA families shown here we have implemented TDMs for 7 other non-coding RNA families, including transfer RNA, micro RNA per-cursor and the Nanos 3' UTR translation control element. The results were consistent with our observations for RNAI and the hammerhead ribozyme given here, and will be used to analyze further the predictive power of thermodynamic matchers.

ACKNOWLEDGEMENTS

We thank Marc Rehmsmeier for helpful discussions and Michael Beckstette for comments on the manuscript.

References

1. A. F. Bompfiinewerer, C. Flamm, C. Fried, G. Fritzsch, I. L. Hofacker, J. Lehmann, K. Missal, A. Mosig, B. Miiller, S. J. Prohaska, B. M. R. Stadler, P. F. Stadler, A. Tanzer, S. Washietl, and C. Witwer, "Evolutionary patterns of non-coding RNAs," Theor. Biosci, vol. 123, pp. 301-369, 2005.

2. S. R. Eddy, "Non-coding RNA Genes and the Modern RNA World," Nature Reviews Genetics, vol. 2, pp. 919-929, 2001.

3. S. Washietl, I. L. Hofacker, and P. F. Stadler, "From The Cover: Fast and reliable prediction of noncod-ing RNAs," PNAS, vol. 102, no. 7, pp. 2454-2459, 2005.

4. E. Rivas and S. Eddy, "Noncoding RNA gene de

tection using comparative sequence analysis," BMC Bioinformatics, vol. 2, no. 1, p. 8, 2001.

5. D. H. Turner, N. Sugimoto, and S. M. Freier, "RNA Structure Prediction," Annual Review of Biophysics and Biophysical Chemistry, vol. 17, no. 1, pp. 167-192, 1988.

6. M. Zuker, "Mfold web server for nucleic acid folding and hybridization prediction," Nucl. Acids Res., vol. 31, no. 13, pp. 3406-3415, 2003.

7. I. L. Hofacker, "Vienna RNA secondary structure server," Nucl. Acids Res., vol. 31, no. 13, pp. 3429-3431, 2003.

8. W. Seffens and D. Digby, "mRNAs have greater negative folding free energies than shuffled or codon choice randomized sequences," Nucl. Acids Res., vol. 27, no. 7, pp. 1578-1584, 1999.

9. C. Workman and A. Krogh, "No evidence that mRNAs have lower folding free energies than random sequences with the same dinucleotide distribution," Nucl. Acids Res., vol. 27, no. 24, pp. 4816-4822, 1999.

10. E. Rivas and S. R. Eddy, "Secondary structure alone is generally not statistically significant for the detection of noncoding RNAs," Bioinformatics, vol. 16, no. 7, pp. 583-605, 2000.

11. P. Clote, F. Ferre, E. Kranakis, and D. Krizac, "Structural RNA has lower folding energy than random RNA of the same dinucleotide frequency," RNA, vol. 11, no. 5, pp. 578-591, 2005.

12. S. Washietl and I. L. Hofacker, "Consensus folding of aligned sequences as a new measure for the detection of functional RNAs by comparative genomics," J Mol Biol, vol. 342, pp. 19-30, 2004.

13. S.-Y. Le, J.-H. Chen, D. Konings, and J. Maizel, Jacob V., "Discovering well-ordered folding patterns in nucleotide sequences," Bioinformatics, vol. 19, no. 3, pp. 354-361, 2003.

14. R. Giegerich, B. Voss, and M. Rehmsmeier, "Abstract Shapes of RNA," Nucl. Acids Res., vol. 32, no. 16, pp. 4843-4851, 2004.

15. B. Voss, R. Giegerich, and M. Rehmsmeier, "Complete probabilistic analysis of RNA shapes," BMC Biology, vol. 4, no. 5, 2006.

16. S. Griffiths-Jones, A. Bateman, M. Marshall, A. Khanna, and S. R. Eddy, "Rfam: an RNA family database," Nucl. Acids Res., vol. 31, no. 1, pp. 439-441, 2003.

17. S. Griffiths-Jones, S. Moxon, M. Marshall, A. Khanna, S. R. Eddy, and A. Bateman, "Rfam: annotating non-coding RNAs in complete genomes," Nucl. Acids Res., vol. 33, no. suppl 1, pp. D121-124, 2005.

18. J. Reeder and R. Giegerich, "Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics," BMC Bioinformatics, vol. 5, no. 104, 2004.

19. R. Giegerich, C. Meyer, and P. Steffen, "A discipline of dynamic programming over sequence data,"

121

Science of Computer Programming, vol. 51, no. 3, pp. 215-263, 2004.

20. P. Steffen and R. Giegerich, "Versatile and declarative dynamic programming using pair algebras," BMC Bioinformatics, vol. 6, no. 224, 2005.

21. T. J. Macke, D. J. Ecker, R. R. Gutell, D. Gau-theret, D. A. Case, and R. Sampath, "RNAMotif, an RNA secondary structure definition and search algorithm," Nucl. Acids Res., vol. 29, no. 22, pp. 4724-4735, 2001.

22. I. L. Hofacker, S. H. F. Bernhart, and P. F. Stadler, "Alignment of RNA Base Pairing Probability Matrices," Bioinformatics, vol. 20, pp. 2222-2227, 2004.

23. J. Reeder and R. Giegerich, "Consensus shapes: an alternative to the Sankoff algorithm for RNA consensus structure prediction," Bioinformatics, vol. 21, no. 17, pp. 3516-3523, 2005.

24. Y. Eguchi and J. Itoh, T Tomizawa, "Antisense RNA," Annu. Rev. Biochem., vol. 60, pp. 631-652, 1991.

123

PEM: A GENERAL STATISTICAL APPROACH FOR IDENTIFYING DIFFERENTIALLY EXPRESSED GENES IN TIME-COURSE CDNA MICROARRAY EXPERIMENT WITHOUT

REPLICATE

XuHan*

Genome Institute of Singapore, 60, Biopolis Street, Singapore 138672

'Email: [email protected]

Wing-Kin Sung

Genome Institute of Singapore, 60, Biopolis Street, Singapore 138672

School of Computing, National University of Singapore, Singapore 117543

Email:[email protected], [email protected]

Lin Feng

School of Computer Engineering, Nanyang Technological University, Singapore 637553


Replication of time series in microarray experiments is costly. To analyze time series data with no replicate, many model-specific approaches have been proposed. However, they fail to identify the genes whose expression patterns do not fit the pre-defined models. Besides, modeling the temporal expression patterns is difficult when the dynamics of gene expression in the experiment is poorly understood. We propose a method called PEM (Partial Energy ratio for Microarray) for the analysis of time course cDNA microarray data. In the PEM method, we assume the gene expressions vary smoothly in the temporal domain. This assumption is comparatively weak and hence the method is general enough to identify genes expressed in unexpected patterns. To identify the differentially expressed genes, a new statistic is developed by comparing the energies of two convoluted profiles. We further improve the statistic for microarray analysis by introducing the concept of partial energy. The PEM statistic is incorporated into the permutation based SAM framework for significance analysis. We evaluated the PEM method with an artificial dataset and two published time course cDNA microarray datasets on yeast. The experimental results show the robustness and the generality of the PEM method. It outperforms the previous versions of SAM and the spline based EDGE approaches in identifying genes of interest, which are differentially expressed in various manner.

Keywords: Time course, cDNA microarray, differentially expressed gene, PEM.

1. INTRODUCTION

Time-course cDNA microarray experiments are widely

used to study the cell dynamics from a genomic

perspective and to discover the associated gene

regulatory relationship. Identifying differentially

expressed genes is an important step in time course

microarray data analysis to select the biologically

significant portion from the genes available in the

dataset. A number of solutions have been proposed in

the literature for this purpose.

When replicated time course microarray data is

available, various statistical approaches, like ANOVA

and its modifications, are employed (Lonnstedt &

Speed, 2002; Park et al, 2003; Smyth, 2004). This

category of approaches has been extended to recent

work on longitudinally sampled data, where the

microarray measurements span in multi-dimensional

space with the coordinates to be gene index, individual

donor, and time point, etc. (Guo et al, 2003; Storey et

al, 2005). However, replication of time series or

longitudinal sampling is costly if the number of time

points is comparatively large. For the sake of this, many

published time course datasets have no replicate.






124

When replicated time course is not available, clustering based approaches and model-specific approaches are widely used.

Clustering based approaches select genes whose patterns are similar to each other. A famous example of clustering software is the Eisen's Cluster (Eisen et al., 1998). Clustering based approaches are advantageous in finding co-expressed genes. The drawback is that clustering does not provide a ranking for the individual genes, and it is difficult to determine a cut-off threshold based on confidence analysis. Additionally, cluster analysis may fail to detect changing genes that belong to clusters for which most genes do not change (Bar-Joseph et al, 2003).

Model-specific approaches identify differentially expressed genes based on prior knowledge of their temporal patterns. For instance, Spellman et al. (1998) used Fourier transform to identify cell-cycle regulated genes; Peddada et al. (2003) proposed an order-restricted model to select responsive genes; Xu et al. (2002) developed a regression-based approach to identify the genes induced in Huntington's disease transgenic model; in the recent versions of SAM (Tusher et al, 2001), two alternative methods, slope based and signed area based, are provided for analyzing single time course data. However, the assumption underlying the model-specific approaches is too strong and some biologically informative genes that do not fit the predefined model may be ignored. Bar-Joseph et al. (2002) proposed a spline based approach, which is established on comparatively weaker assumptions. The software of EDGE (Storey et al, 2005) implemented natural cubic spline and polynomial spline for testing the statistical significance of genes. In spline based approaches, the dimension of spline needs to be chosen carefully to balance the robustness and the diversity of gene patterns, and an empirical setting of dimension may not be applicable for some applications.

The goal of this paper is to propose a new statistical method called PEM (Partial Energy ratio for Microarray) for the analysis of time course cDNA microarray data. In time-course experiments, the measurements are sampled from continuously varying gene expressions. Thus it is often observed that the log-ratio expression profiles of the differentially expressed

genes are featured with "smooth" patterns, of which the energies mainly concentrate in low frequency. To utilize this feature, we employ two simple convolution kernels that function as a low-pass filter and a high-pass filter, namely smoothing kernel and differential kernel, respectively. The basic statistic for testing the smoothness of a temporal pattern is represented by the energy ratio of the convoluted profiles. We further improve the performance of the statistic for microarray anlaysis by introducing a concept called partial energy to solve the problem caused by "steep edge", which refers to rapid increasing or decreasing of gene expression level. The proposed ratio statistic is incorporated into the permutation based SAM (Tusher et al., 2001) framework for determining confidence interval and false discovery rate (Benjamini and Hochberg, 1995). In the SAM framework, a small positive constant called "relative difference" is added to the denominator of the ratio, which efficiently stabilizes the variance of the proposed statistic.

An artificial dataset and two published cDNA microarray datasets are employed to evaluate our approach. The published datasets include the yeast environment response dataset (Gasch et al, 2000) and the yeast cell cycle dataset (Spellman et al, 1998). The experiment results showed the robustness and generality of the proposed PEM method. It outperforms previous versions of SAM and spline based EDGE in identifying genes differentially expressed in various manner. In the experiment with yeast cell cycle dataset, the PEM method not only identified the periodically expressed genes, but also identified a set of non-periodically expressed genes, which are verified to be biologically informative.

2. METHOD

2.1 Signal/noise model for cDNA microarray data

Consider a two-channel cDNA time-course microarray experiment over m genes: gj, g2, .... gm, and n time points: th t2, ..., tn. the log-ratio expression profile of the gene g, (;' = 1 to m) can be represented by X, = [Xfai), Xi(t2), ... X,(f„)]T, where X&tj) (/' = 1 to n) represents the log-ratio expression value of g; at they'th time point.

125

We model the log-ratio expression profile X; as the sum of its signal component 5, = [S^fy), Sfa), ... 5,(f„)]T and its noise component et = [Si(t,), e,(f2), ... e,{tn)]

T, i.e. X, = Si + £;. We have the following assumption on the noise component:

Assumption of noise: £;(f;), eit2), ..., £;(/„) are independent random variables following a symmetric distribution with the mean equal to zero.

Note that the noise distribution in our assumption is not necessarily normal so that this gives a better model of the heavily tailed symmetrical noise distribution that is often observed in microarray log-ratio data.

For a non-differentially expressed gene g„ we assume its expression signals in two channels are identical at all the time points. In this case, the signal component 5,- is constantly zero, and the log-ratio expression profile Xt

only consists of the noise component. Thus the null hypothesis is defined as follow:

H0: X, =e,

Due to the variation of populations in cDNA microarray experiments, there is bias between the expression signals in two channels. Thus the assumption underlying the null hypothesis may not be established if the log-ratios are calculated directly from the raw data. We suggest using pre-processing approaches such as Lowess regression to compensate the global bias (Yang et al., 2002). To further overcome the influence of the gene-specific bias, we adopted the SAM framework, in which a small positive constant called "relative difference" was introduced to stabilize the variance of the statistic (Tusher et al, 2001). Nevertheless, the null hypothesis provides a mathematical foundation for demonstration of our method.

2.2 Smoothing convolution and differential convolution

In time-course experiments, the measurements are sampled from continuously varying gene expressions. If there is adequate number of sampled time points, the temporal pattern of the signal St will be comparatively smooth so that the energy of S, will concentrate in low frequency. To utilize this feature, we introduce two

simple convolution kernels for time series data analysis, namely the smoothing kernel and the differential kernel. The smoothing kernel is represented by a sliding-window Ws = [1, 1], and the differential kernel is represented by Wd = [-1, 1]. In signal processing, the smoothing kernel and the differential kernel function as a low-pass filter and a high-pass filter, respectively, to detect the edges.

Given a vector V = [V(ti), V(t2),..., V(tn)]T representing a

time-series, the smoothed profile and the differential profile of V are represented by V*WS = [V(f;) + V(t2), V(t2) + V(t3), .... V(tn.,) + V(tn)f, and V*Wd = [V(t,) -V(t2), V(t2) - V(t3) V{tn.i) - V(tn)]

T, respectively, where * is the convolution operator.

Since the energy of the signal component St is likely to concentrate in low frequency, we have:

Assumption of signal: If 5, is a non-zero signal vector, then

E(\Si*Ws\2)>E(\Si*Wd\

2)

where E(\ St *WS |2) and E(\ St *Wd |2) represent the expected energies of the corresponding smoothed profile and differential profile.

Next, we derive two propositions from the Assumption of noise and the Assumption of signal, as follows:

Proposition 1: If the noise component e, satisfies the Assumption of noise, then

E(\ei*Ws\2) = E(\ei*Wd\

2) (1)

Proposition 2: If the signal component St satisfies Assumption of signal, and the noise component e, satisfies the Assumption of noise, then

E(\(Si+£i)*Ws \2)>E(\(Si +£i)*Wd |2) (2)

Propositions 1 and 2 can be proven based on the symmetry of noise distribution and the linear decomposability of convolution operation.

Note that the log-ratio expression profile Xt = S, + £,. According to Eq. (1) and Eq. (2), we define a statistic called energy ratio (ER) for testing the null hypothesis, as follow:

126

4 -

3 -

2 -

f 1

Logarithm of ER

-n=15 -n=10 •n=5

Fig. 1. The numerically estimated distribution of logarithm of ER(£,),

where n is the number of time points.

2 -

1.5 -

1--• „ -

•0.5 -

\ : \ 8

ft

l\

1 \ \ i

time points

n

Fig. 2. An example of responsive gene expression profile where a

"steep edge" occurs between the 3rd and the 4* time points.

ER(Xi) = \x,*w,\2

\x,*wd\2 (3)

The distributions of logarithm of ER(ed are shown in Fig. 1, where the number of time points varies from 5 to 15 and the distribution of et is multivariate normal. We take logarithm simply for the convenience of visualization. Obviously, the logarithm of £/?(£,) follows a symmetric distribution highly peaked around zero mean. The distribution is two-tailed, but we are only interested in the positive tail when testing the null hypothesis. This is because the negative tail implies the energy concentrates in the high frequency. According to Nyquist sampling criterion, the high frequency component is not adequately sampled thus the expression profile may not be reliable. When n—»°°, ER(sd is asymptotically independent on the distribution of Si, which can be easily proven based on central limit theorem.

expression profile in which a steep up-slope edge occurs between the 3"1 and the 4th time points. When the number of time points is limited, the steep edge adds a large value to the denominator in Eq. (3), hence reduces the statistical significance of the ER score.

To solve the "steep edge" problem, we propose a new concept called partial energy. The basic idea of partial energy is to exclude the steep edges in calculating the energy of a differential profile. Denote Y = [Yi, Y2, ... y„]T be a vector representing a profile, the border partial energy of Y is defined as:

i=l i=l

where k<n, and y2 represents the ith biggest value of

Yt2, y2

2, ..., Y?- For example, let Y= [1, -1 , -4, 3, 2]T, its

2-order partial energy PE2 (Y) = I2 + (-1)2 + 22 = 6, where -4 and 3 are excluded in calculating the partial energy.

2.3 Partial energy

In most time-course microarray experiments, the number of time points is limited. Due to insufficient sampling, the smoothness of the signal component St is not guaranteed at all the time points. We call this a "steep edge" problem. A steep edge refers to rapid increasing or decreasing of gene expression level at certain time points. Fig. 2 shows an example of responsive gene

For most responsive patterns in microarray data, the number of steep edges is much smaller than the number of time points. We assume there are no more than 2 steep edges in the gene expression profile, and modify the statistic to be the ratio of the 2-order partial energies (PER2) of the smoothed profile and the differential profile:

PERJXi) = PE2(Xi *WS)

PE2(X,*Wd) (4)

127

where * is the convolution operator, and Ws and Wd are the smoothing kernel and the differential kernel defined in section 2.2.

2.4 Significance analysis

Since the PER2 statistic defined in Eq. (4) takes the form of a ratio, it can be easily incorporated into the SAM framework (Tusher et al., 2001) for significance analysis.

In the first step, a "relative difference" s0 is added to the denominator in Eq. (4), as follow:

PEM(Xt) = PE2(Xt *WS)

PE2(X,.*Wd) + s0

(5)

For the sake of simplicity, the constant so is chosen to be the 5 percentile of PE2 (Xi*Wd) for all the genes (i = 1 to m). By adding introducing the relative difference, the genes with small fold-changes are excluded from the top-ranking list. This efficiently reduces the influence of channel bias and stabilizes the variance of statistic. The statistic defined in Eq. (5) is called PEM (Partial Energy ratio for Microarray).

Secondly, we employ the algorithm of SAM for determining the confidence interval and the false discovery rate (sometimes called q-value). For the detail of the algorithm, one can refer to the SAM manual available at the website: http://www-stat.stanford.edu /--tibs/SAM/. Here, we briefly describe our strategy of randomized permutation. With the PEM statistic, the procedure of permutation consists of two steps. In the first step, the order of the log-ratio measurements in the expression profile are randomly permutated for each gene; in the second step, the signs of the measurements are randomly flipped. The second step is based on the Assumption of noise, where the distributions of the measurements of non-differentially expressed genes are assumed to be symmetric with zero mean.

3. EXPERIMENTS

The robustness and generality of the proposed PEM method are evaluated with both simulation dataset and published microarray datasets, which include the yeast environment response dataset (Gasch et al, 2000) and

the yeast cell cycle dataset (Spellman et al, 1998). The missing values in the published datasets are filled in using KNN-Impute (Troyanskaya et al., 2001). The evaluation is based on relative operating characteristic (ROC) score, which gives a reasonable measurement of sensitivity vs. specificity. (McNeil and Hanley, 1984)

For comparison, our evaluation also includes the approaches employed in two of the most popular microarray analysis software, which are the SAM (Tusher, 2001) and the spline based EDGE (Storey et al, 2005).

Recent version of SAM provides two alternative approaches for the analysis of single time course data. They are slope based and signed area based, respectively. The slope based SAM is designed for identifying the genes with monotonous increasing or decreasing patterns, and the signed area based SAM is an improved version of paired t-test.

In the EDGE software, both the natural cubic spline and the polynomial spline based approaches are included in our evaluation. The dimension of splines is empirically optimized to be 4 in the simulation and in the yeast environment response experiments. In yeast cell cycle experiment, the dimension of splines is optimized to be 8 for cubic spline, and is set to be 5 for polynomial spline to avoid singular matrix in calculation.

3.1 Simulation

In the simulation experiment, each log-ratio expression profile is generated by summing its signal component and its noise component. The noise component follows normal distribution with zero mean. For non-differentially expressed genes, the intensity of the signal component is constantly zero. For differentially expressed genes, the signal component is one of the three frequently observed signal patterns in time course microarray data, as shown in Fig. 3(a). They are monotonous decreasing pattern defined by linear function, peaked responsive pattern defined by Gaussian function, and periodic pattern defined by sine function. There are two free parameters in our simulation test: the number of time points and the signal- noise ratio (SNR). For each parameter setting, we generate 6000 artificial time course gene expression profiles, of which 5400

http://www-stat.stanford.edu

128

2 -

1.5 -

£ • 1 -

S °'5 " c

m ° -c

2> -0.5 -

-1 -•1.5 -

s-1 1 • J

i *j 4 k\ t a s / i s

time point

monotonous decreasing

peaked responsive

- - - 'periodic

(a) Three basic patterns of gene expression profile defined in the artificial dataset;

DC

sco

rt

a

0.95 -

0.9 -'

0.85 -

0.8 -.

0.75 -

0.7 -

0.65 -

0.6 -'

0.55 -,

0.5 -i

. ,

• " . - • " • '

5 7 10 15

number of time points

- - « - -S lope based SAM

— a - - Signed area based SAM

— * -Cub ic spline

— • — Polynomial spline

, • PEM

(b) ROC scores under variant number of time points;

1 -

0.95 -

0.9 -

§ 0.85 -

o °-8 " O 0.75 -

0.7 -

0.65 -

0.6 -i

W

• '

0.5

SIJV * ^ . . . - *

• •

• • - i i - •

1 2 SNR

- - • - - Slope based SAM

— h - Signed area based SAM

— M - C u b i c spline

— • — Polynomial spline

* PEM

c) ROC score vs. S/N ratio

Fig. 3. The numerically estimated distribution of logarithm of ER(£,),

where n is the number of time points.

(90%) belong to non-differentially expressed genes and 600 (10%) belong to differentially expressed genes. The 600 profiles of differentially expressed genes are equally divided into 3 portions, corresponding to monotonous

decreasing pattern, peaked responsive pattern, and periodic pattern.

First, we generate artificial datasets by setting the SNR to be 1.0 and the numbers of time points to be 5, 7, 10, or 15. The ROC scores are plotted against the number of time points in Fig. 3(b). Next, we fix the number of time points to be 10 and set the SNR to be 0.5, 1.0, and 2.0. The ROC scores are plotted against SNR in Fig. 3(c).

The result of simulation experiment demonstrates that PEM achieves the best overall performance among the methods in evaluation. The signed area based SAM is the most robust when the number of time points is 5. However, as the number of time points increases or the SNR becomes larger, the PEM and EDGE approaches achieve much higher ROC score than SAM. This is because the SAM approaches are modeled base on specific patterns, while the models underlying PEM and EDGE are more general.

3.2 Evaluation with yeast environment response dataset

The yeast environment response dataset consists of measurements in 173 arrays published in (Gasch et al, 2000) and (Derisi et al, 1997). The dataset is used to discover the way in which the budding yeast S. Cerevisiae cells adapt to variant changing environments. Among the arrays available in the dataset, we selected 79 arrays based on two criteria: (i) population from wild-type cells; (ii) at least 7 time points sampled under each condition. These arrays fall into 10 individual experiments:

• Heat shock from 25 °C to 37°C, consisting of 8 time points; • Hydrogen peroxide treatment, consisting of 10 time points; • Menadione exposure, consisting of 9 time points; • DTT exposure, consisting of 8 time points; • Diamide treatment, consisting of 8 time points; • Hyper-osmotic shock, consisting of 7 time points; • Nitrogen source depletion, consisting of 10 time points; • Diauxic shift, consisting of 7 time points;

Table 1. ROC scores of the evaluated methods on environment response

experiments. The bold fonts correspond to the highest scores in the rows.

(a) heat shock

(b) diauxic shift

2.5

2

1.5 •

0.5 -

0

-0.5 -

-1 -

• 1 . 5 •

-2 •

-2.5 -

" ^ » * -1 '

up-regulated - -

* i

._ _.. " "

- .down-reguiated

"

c) nitrogen depletion

Fig. 4. Average expression patterns of ERS genes in variant experiments.

e Stationary phase, including two nearly-identical experiments consisting of 10 and 12 time points, respectively.

We assess the approaches by applying them to the 10 time course experiments individually. To evaluate the sensitivity and specificity of the methods, we use a list

! M:. ; , I : I :!!

ilrt.i"r.-\i

ll;ill"i<.-l:

;•. i.iViL-

M.-IKhllO I.-

I)! -!

Mldi- i ld i

1MM-. ' c.-J S j i i i . i . VM:n;-

."J.*M ' .MrtJ S v ! -.W. -h l i

S A M S i i M p I : X i l -

U ")!' •.i .-<• •

Sj!).>f' 0 i.S> i l -"i

i -C4:i «».hW

(1 'S4 I I . ' 9 2

( j " i : ' i i . i s 1 I I S M I

0.74.' ii *.'.' i) :.V

!i i'm i) (rfi.- 0 fhV

l ! \ ;n- i n-. . i i i i | i - U O V ' I " J l (M i ' - l

ul.l'k

Ni::-!i^»-! f i.-*K.=J i l W ' i ! ; < j ; S

dcpv - l nn

i ):;.•,! i n <.|"t': (I.75K l i f t " ' .1 l i .'10

Mall i i ; i ;>:\ (J4')1 D M 1 . (I i |. '

piisw i i|-l I

•Ssnn.»iai> l i 74*1 (I /"!<fi <) ; i ' i

j ' hav . . i xps J

A u - i a t ' i 11 t i l l ." 11 M 5 n 71 r> (i .'17 0.77J

I ' Wll l l l 111 . 1 U l : 1 I ( l | . 1 M 41-

( l .nHi l I K i t

n.:i-

I) NAM

. • : ii sh«

1 si' .i

of 270 genes available at the website of Chen et al. (2003). This list is the intersection of (i) around 800 Environment Stress Response (ESR) genes which were identified by Gasch et al. (2000) using hierarchical clustering on multiple experiments, and (ii) a list of ortholog genes in fusion yeast S. Pombe which are differentially expressed under environment stress. Figure 4 shows that these evolutionarily conserved ERS genes are expressed with various expression patterns in different experiments so that they provide a good test-bed to evaluate the robustness and the generality of the methods.

The ROC scores for methods are summarized in Table 1. The PEM method outperforms other methods in 7 out of 10 experiments. It achieves reasonably good ROC scores (>0.7) in most experiments, except for the Menadione exposure experiment in which all the methods do not perform well. To further show the

130

superiority of PEM, we averaged the ROC scores over all experiments for each method, and used paired t-test for comparison of the performance of PEM and the other methods. The p-values of the paired t-test demonstrate the significance of the improvement made by PEM.

3.3 Evaluation Dataset

with Yeast Ceil Cycle

The yeast cell cycle dataset (Spellman et at, 1998) consists of the measurements in three experiments (Alpha factor, CDC15, CDC28) on cell cycle synchronized yeast S. Cerevisiae cells. We employed a reference list containing 104 cell cycle regulated genes determined by traditional biological experiments, as mentioned in the original paper. In addition to SAM and EDGE, we also include the method of Fourier transform (Spellman et at, 1998) in our evaluation. The Fourier transform (FT) method was introduced specifically for identifying periodically expressed genes.

The ROC scores are shown in Table 2. The PEM method outperforms the SAM approaches and the spline based EDGE approaches in all experiments. The FT method performs slightly better than PEM in identifying periodically expressed genes. However, the PEM method also identified a number of non-periodically expressed genes, which account for considerable false positives in calculating ROC scores. To show this, we clustered the top 706 differentially expressed genes identified by the PEM in the alpha factor experiment. These genes are selected based on a false discovery rate equal to 0.1. We applied K-mean clustering using Eisen's Cluster software (Eisen et al, 1998) and came up with eight clusters, as shown in Fig. 5. Five of the clusters are periodic and the remaining three are non-periodic. Note that the non-periodic portion of the differentially expressed genes is not significant with the Fourier transform approach. The non-periodic clusters are mapped to the gene ontology clusters using GO Term Finder in SGD database (http://db.yeastgenome. org/cgi-bin/GO/goTermFinder/). We selected four significant gene ontology terms corresponding to the non-periodic clusters, as listed in Table 3. The Bonferroni correlated hypergeometric P-values show

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8 = r *

periodic

clusters

non-periodic

clusters

Fig. 5. Clustering result shows periodic and non-periodic patterns of

differentially expressed genes identified by PEM in alpha factor

experiment.

Table 2. ROC scores for evaluation of the methods in identifying

periodically expressed cell cycle regulated genes.

alp/u

. .K i *

..;. »s

"linpi

ll.l .eil

^ M

(] S'-'J

(141) J

l . - l h *

.V.jili.l

a:«-3

ba.M-d

SAM

ci r» .-•*

U jSu

• i T>:S

< i;:m

S,Milr

Iwn-J

r.iM.i-

n 'ASA

0 iji-,4

:i l 7

S-.,h

hllhlK-

1.1 Kil-

!) : •/

0 vm

!! !IM

n

0.917

0.811

0.859

I'l-M

0 «S.?

II Si»s

0 "ft')

Table 3. Selected significant gene ontology terms mapped to non-

periodic clusters. The GO terms and cluster IDs are retrieved from

SGD database.

• ;•«!!•.••;;. •<:r.i " )

( I:MCM'I S=:xi.ji lvpr-n'iM: .;i e,i) [MUM* j S'.lu-.A

< ..lit.. ' ! ,!.:.in>i/;*iJi:<iM it. W-.ij l i f •KilfiJO1 ' ? >f Ci

i .i:-.ii-: n iil . iunait: 'VisxUhi'M'i •ji»i)iliiftS,,.7 . OiV «

Fliiiijj tlu:-E.i>n !;\ : . ' ' '!ni VIMI ' » v ?

midj::.Hi !;i iBga::s.

. • . . : ;> : . : : . -

http://db.yeastgenome

131

that these non-periodic clusters are biologically meaningful.

The evaluation with the yeast S. Cerevisiae cell cycle dataset clearly indicates the ability of the PEM method in identifying genes with either periodic or non-periodic patterns. In comparison to model-specific approaches like Fourier transform, the PEM method is more general and leads to a better overview of the dynamics of gene expression changes.

4 CONCLUSION AND DISCUSSION

Replications in time course microarray experiments are costly. In selecting methods for analysis without replicates, people usually have to face the problem to make tradeoff between robustness and generality. If the assumption underlying the method is too strong, the method may fail to identify the genes whose expression patterns do not fit the pre-defined model. In this paper, we propose a general statistical method called PEM (Partial Energy ratio for Microarray), for identifying differentially expressed genes in time course cDNA microarray experiment without replicates. In the PEM method, we assume the gene expressions vary smoothly in time series. This assumption is comparatively weak hence the PEM method is more general in identifying genes expressed in unexpected patterns. To identify differentially expressed genes, we employed convolution kernels in our statistic and introduced the concept of partial energy. The proposed statistic can be easily incorporated into the SAM framework for significance analysis, in which the variance of the statistic is stabilized by introducing the "relative difference". Experimental results show the robustness and generality of PEM when the number of time points is comparatively large (>6). Another advantage of PEM is that the parameters of PEM can be fixed for applications of different experiments, although automatic determination of the optimal parameters may slightly improve the performance, which will be investigated in the future.

The main limitation of the PEM method is, the assumption of signal smoothness may not be satisfied if the measurements are not adequately sampled. In this case, replication of the time series is necessary. Thus,

we will also explore the possibility of modifying the PEM method for the applications where replicates are available. For this problem, one possible solution is to integrate the PEM statistic and the ANOVA F-score using a permutation based strategy, which will be implemented and be tested in near future.

Acknowledgments

The authors thank Dr. Krishna, Karuturi Radha, Dr. Vladimir Andreevich Kuznetsov, Mr. Juntao, Li, and Mr. Vega, Vinsensius Berlian for the valuable discussions on the topics related to this paper.

References

1. Bar-Joseph Z, Gerber G, Simon I, Gifford D, and Jaakkola T. Comparing the continuous representation of time-series expression profiles to identify differentially expressed genes. Proc. Natl Acad. Sci. USA 2003; 100: 10146-10151. 2. Bar-Joseph Z, Gerber G, Gifford D, Jaakkola T, and Simon I. A new approach to analyzing gene expression time series data. RECOMB 2002: 39-48. 3. Benjamini Y and Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B. 1995; 57: 289-300. 4. Chen D, Toone WM, Mata J, Lyne R, Burns G, Kivinen K, Brazma A, Jones N, and Bahler J. Global transcriptional responses of fission yeast to environmental stress, Mol. Biol. Cell 2003; 14: 214-229.

5. DeRisi JL, Iyer VR, and Brown PO. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 1997; 278: 680-686. 6. Eisen MB, Spellman PT, Brown PO, and Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA 1998; 95: 14863-14868. 7. Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D, and Brown PO. Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell 2000; 11: 4241-4257.

132

8. Guo X, Qi H, Verfaillie CM, and Pan W. Statistical multiple slide systematic variation. Nuclceic Acids Res. significance analysis of longitudinal gene expression 2002; 30: el5. data, Bioinformatics 2003; 19: 1628-1635. 9. Lonnstedt I. and Speed TP. Replicated microarray data. Statistica Sinica 2002; 12: 31-46. 10. McNeil BJ and Hanley JA. Statistical approaches to the analysis of receiver operating characteristic (ROC) curves. Med. Decis. Mak. 1984; 4: 137-150. 11. Park T, Yi S, Lee S, Lee SY, Yoo D, Ahn J, and Lee Y. Statistical tests for identifying differentially expressed genes in time-course microarray experiments. Bioinformatics 2003; 19: 694-703. 12. Peddada SD, Lobenhofer EK, Li L, Afshari CA, Weinberg CR, and Umbach DM. Gene selection and clustering for time-course and dose-response microarray experiments using order-restricted inference, Bioinformatics 2003; 19: 834-841. 13. Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical applications in genetics and molecular biology 2004; 3: article 3. 14. Spellman PT, Sherlock G, Zhang MO, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, and Futcher B. Comprehensive identification of cell-cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 1998; 9: 3273-3297.

15. Storey JD, Xiao W, Leek JT, Tompkins RG, and Davis RW. Significance analysis of time course microarray experiments. Proc. Natl Acad. Sci. USA 1005; 102: 12837-12842. 16. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, and Altman RB. Missing value estimation methods for DNA microarrays. Bioinformatics 2001; 17: 520-525. 17. Tusher V, Tibshirani R, and Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA 2001; 98: 5116-5121. 18. Xu XL, Olson JM, and Zhao LP. A regression-based method to identify differentially expressed genes in microarray time course studies and its application in an inducible Huntington's disease transgenic model. Human Molecular Genetics 2002; 11: 1977-1985. 19. Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, and Speed TP. Normalization for cDNA microarray data: a robust composite method addressing single and

133

EFFICIENT GENERALIZED MATRIX APPROXIMATIONS FOR BIOMARKER DISCOVERY AND VISUALIZATION IN GENE EXPRESSION DATA

Wenyuan Li, Yanxiong Peng, Hung-Chung Huang and Ying Liu*

Department of Computer Science, University of Texas at Dallas Richardson, TX 75083, U.S.A. * Email: ying. liu@utdallas. edu

In most real-life gene expression data sets, there are often multiple sample classes with ordinals, which are categorized into the normal or diseased type. The traditional feature or attribute selection methods consider multiple classes equally without paying attention to the up/down regulation across the normal and diseased types of classes, while the specific gene selection methods particularly consider the differential expressions across the normal and diseased, but ignore the existence of multiple classes. In this paper, for improving the biomarker discovery, we propose to make the best use of these two aspects: the differential expressions (that can be viewed as the domain knowledge of gene expression data) and the multiple classes (that can be viewed as a kind of data set characteristic). Therefore, we simultaneously take into account these two aspects by employing the 1-rank generalized matrix approximations (GMA). Our results show that the consideration of both aspects can not only improve the accuracy of classifying the samples, but also provide a visualization method to effectively analyze the gene expression data on both genes and samples. Based on the GMA mechanism, we further propose an algorithm for obtaining the compact biomarker by reducing the redundancy.

1. INTRODUCTION

With the rapid advances of microarray technologies, massive amounts of gene expression data are generated in experiments. Analysis of these high-throughput data poses both opportunities and challenges to the biologists, statisticians, and computer scientists. One of the most important features in microarray data is the very high dimensionality with a small number of samples. There are over thousands of genes and at most several hundreds of samples in the data set. Such characteristic, which has never existed in any other type of data, has made the traditional data mining and analysis methods not effective, and therefore attracted the focus of recent research. Among these methods, a crucial approach is to select a small portion of informative genes for further analysis, such as disease classification and the discovery of structure of the genetic network 18. Due to the drastic size difference of genes and samples, the step of gene selection is also the need of solving the well-known problem "curse of dimensionality" in statistics, data mining and machine learning 5.

However, quite different from the traditional feature selection in other data sets such as text 22, the final goal of gene selection is to discover the "biomarker", a minimal subset of genes that not only

are differentially expressed across different sample classes, but also contains most relevant genes without redundancy. These two characteristics distinguish the task of discovering "biomarker" from the common feature selection tasks.

Recent gene selection methods fall into two categories: filter methods and wrapper methods 18. The wrapper methods 3 are closely "embedded" in the classifier and thus are often time-consuming. On the other hand, the filter methods analyze the data by investigating their domain-specific targets: (1) differential expression across classes and (2) redundancies induced by the relevant genes. They are independent of the sample classification and are efficient in analyzing the functions of genes. Therefore, they attracted more focus of the studies in recent years.

The basic goals of these filter methods are to obtain a subset of genes with maximum relevance and minimum redundancy 9 ' 23 ' 4. Most existing filter methods follow the methodologies of statistics 9

and information theory 4' 23' 18 to rank the genes and reduce the redundancy, such as t-like-statistics, mutual information or information gain based methods. These methods are computationally efficient. However, they select the biomarker by only considering the binary class labels, e.g., healthy/diseased,

'Corresponding author.

134

while the sample classes in the observed experiments are often ordinal with the gradually changing tendency 3 . For example, in the Lupus experiment (see Subsection 5.2), four classes of persons are considered. They are normal ones, relatives of patient, patients who show the early symptoms, and patients whose symptoms are complete. Also some gene expression experiments a consider classes of samples that are the composite of normal and disease ingredients with different scale, e.g., 1:4 or 3:4. In these gene expression data, although there are two types of classes, i.e., positive and negative, the labels of multiple classes show the ordinal scales according to the degree of their membership to the positive or negative type, e.g., 'normal', 'low-grade' tumor, 'intermediate-grade' tumor and 'high-grade' tumor 16. However, when dealing with the data sets with such multiple classes and two types, most existing filtering methods e.g., information gain and t-statistics, combine all classes in positive type into a positive class and similarly combine all classes in negative type into a negative class, and then do the filtering process on the two combined classes. Such analysis may ignore the characteristics of the expression data within each single class, and therefore may lose the accuracy of discovering the biomarker with maximal relevance and minimal redundancy. On the other hand, most general feature selection methods, e.g., ReliefF 10, consider the multiple classes, but ignore the special characteristic of gene selection, up and down regulations. Therefore, they are not specific to the task of gene selection as well. There have been few works in the wrapper methods on investigating the biomarker on these data sets, such as Gaussian process model based method 3. However, it is a wrapper method by using the leave-one-out error and forward selection and therefore is not efficient. Moreover, the original and intuitive objective of biomarker discovery is that the user can visually select the differentially expressed genes without redundancy. According to this objective, however, it is like a black-box screening the user out of the analysis

process. Therefore, in this paper, we propose a class of

1-rank Generalized Matrix Approximation (GMA) b

filter method to simultaneously rank the genes and samples to identify the biomarker in the data sets with multiple classes. The GMA simultaneously takes into account the global between-class data distribution (differentially expression) and local within-class data distribution (collection of low or high values). As pointed out by Achlioptas and McSh-erry 1, through the low-rank matrix approximation, the particular trends or the meaningful dimensions of the high-dimensional data implicate that the overall structure inherent can be easily discovered. This is the second "blessing of dimensionality" stated by Donoho 5. Latent semantic indexing 14 to understand text data, the success of HITS u and PageR-ank 13 algorithms to understand the huge WWW graph adjacency matrix, and recent greedy matrix approximation for machine learning 17 reveal this implication. Among these techniques, 1-rank matrix approximation is essential for analyzing the high-dimensional data 12, n ' 13. One of the efficient techniques for getting the 1-rank matrix is to employ the discrete dynamical system to quickly converge to the local optima, which has been widely used and studied 12> 20> 7 ' n .

We followed the framework of the resonance model introduced in our previous work of visually analyzing the high-dimensional data 12. We generalized it as a novel discrete dynamical system, which is particularly designed for approximating the gene expression matrix with the multiple classes, i.e., GMA-1. In nature, it is a reinforcement mechanism simulating the resonance phenomenon. Due to the quick convergence and efficient matrix-vector multiplications, GMA-1 is quite efficient. As a filter method, GMA-1 provides the simultaneous ranking of genes and samples c. By rearranging the gene expression matrix with GMA-1 rankings, we can visually observe the overall distribution (see Fig.4) of the values, where top genes are differentially expressed across

a In MAQC project, the description of the data sets are available at http://www.fda.gov/nctr/science/ centers/toxicoinformatics/maqc. b The 1-rank matrix is a matrix whose rank is 1. It is formally expressed as a multiplication x y T of two vectors x and y. Therefore, a particular overall structure of the matrix can be approximated and observed by the tendency of x y T . The GMA follows the framework of the traditional 1-rank matrix approximation in linear algebra and generalizes it by partitioning the matrix. c The samples are ranked within each class.

http://www.fda.gov/nctr/science/

135

classes and top samples are important to the class. Therefore, GMA-1 can satisfy the biomarker discovery. Moreover, the sorted matrix according to the ranking of both genes and samples can be visually shown to the user for further analysis. Furthermore, if the user needs to refine the biomarker for obtaining the compact biomarker, GMA-2 can be employed to remove redundant genes. We followed the idea of Jaeger et al. 9 by using the representative of the dense cluster in the gene correlation matrix of the biomarker to reduce the redundancy. As observed and proved in GMA-2, it is able to find clusters with fixed density. Different from a general clustering algorithm used by Jaeger, GMA-2 is particularly customized for finding clusters with the fixed density. Therefore, CBioMarker combining GMA-1 and GMA-2 yields higher accuracy.

2. BASIC RESONANCE MODEL FOR APPROXIMATING MATRIX

In this section, we firstly introduce the resonance model for the purpose of revealing the terrain of the high-dimensional data set. Then the underlying rationale of the resonance model, i.e., the 1-rank matrix approximation, is explained by two theorems. Through the expatiation of the basic mechanism for approximating the matrix in this section, a generalized matrix approximation for the task of biomarker discovery shall be introduced in the next section.

2.1. Process of Resonance Model

This resonance model has been introduced in 12

for visually analyzing the high-dimensional data set. The target is to rearrange the matrix for collecting the large values to the left-top corner of the sorted matrix (called "mountain"), while leaving the small values to the right-bottom corner (called "plains"). In this way, the data terrain (showing where the "mountains" and "plains" are) can be used to visually analyze the high-dimensional data sets. Fig. 1(a) and (b) clearly indicates how this process can be used to visually analyze the matrix through the comparison before and after the rearranging process. In

a real-world example, from a yeast gene correlation data 19 in Fig. 1(c), we multiplied the ranking value vectors of rows and columns, we can get Fig. 1(d). This phenomenon implies that, the resonance model indirectly does the work of the 1-rank matrix approximation by using the matrix in Fig. 1(d) to approximate the real matrix in Fig. 1(c). Through this matrix approximation process, the underlying dominant terrain and structure is revealed.

In nature, the resonance model is an iterative reinforcement learning process of the matrix. It simulates the resonance phenomenon by introducing a forcing object 5, such that when an appropriate response function r is applied, o will resonate to elicit those objects {o^,.. . } c O, whose "natural frequency" is similar to o. This "natural frequency" represents the makeup of both o and the objects {oi,...} who resonated with o when r was applied. Through the iterative reinforcement process, the "frequency" of the forcing object 5 and the ranking values of the objects Oi G O are updated and converged until 5 is similar to those objects with the largest ranking values. In this way, the "frequency" vector 6 of o and the ranking value vector r of the object set can approximate the matrix W by the matrix r o T , denoted as r o T w W.

In the context of the weighted bipartite graph G = (0,F,E,W) and W = (wy)|c>|x|;r| d where O and T are two subsets of vertices, the static 'natural frequency' of Oj £ O is Oi = (wn, Wi2,. • •, Wi\p\)-Likewise, the dynamic 'frequency' of the forcing object 5 is defined as 5 = (w\, 1S2, • •., wj^|)- The components of the graph G are clearly shown in Fig.2(a).

Simply put, if two objects of the same 'natural frequency' resonate, they should have a similar terrain. The evaluation of resonance strength between objects Oi and Oj is given by the response function r(oj,Oj) : 1 " x R" - • R. We defined this function abstractly to support different measures of resonance strength. For example, one existing measure to compare two terrains is the well-known rearrangement inequality theorem, where I(x, y) = ICiLi xiVi *s maximized when the two positive sequences x = ( x i , . . . , x n ) and y = ( j / i , . . . , yn) are or-

d In the gene expression data, to make sure W is a non-negative matrix, we scale the values of W to the range of [0,1] by —^-^—:—, where min and max can be the minimum and maximum of each rows or of the whole matrix. In the rest of the paper, W is supposed to be a matrix whose range is in [0,1] if we do not mention.

136

4 —. a. columns

(a) 3D matrix before sorting

Mountain

4 rows 2 X s ^ ^ » ' r 3

3 o ** columns

(b) 3D matrix after sorting (c) symmetric matrix S (d) approximation matrix r * S * T of S

Fig . 1. Matrix approximation by the basic linear resonance model ( r=c=T): (a) and (b), a small example matrix with 4 rows and 5 columns to illustrate how the terrain, i.e., "mountains" and "plains" help analyzing the data in both rows and columns; (c) and (d) symmetric matrix sorted by r* and 6*.

dered in the same way (i.e., x\ ^ X2 ^ • • • xn and 2/i ^ J/2 ^ • • • ^ Vn) and is minimized when they are ordered in the opposite way (i.e., x\ > x^ ^ • • • xn

and J/1 ^y2 ^ ••• < yn)-Notice if two vectors maximizing I(x, y) are put

together to form M = [x;y] (in MATLAB format), we obtain the terrain with the "mountain" in the left side and the "plain" in the right side. For example, the response function I is a suitable candidate to characterize the similarity of terrains of two objects. Likewise, E(x, y) = exp(£]"=i %iVi) is also an effective response function with the function of magnifying the roles of "mountains".

To find the 'mountains' and 'plains', the forcing object 5 evaluates the resonance strength of every objects Oj against itself to locate a 'best fit' based on the contour of its terrain. By running this iteratively, those objects that resonated with o are discovered and placed together to form the 'mountains' within the 2-dimensional matrix W. In the same fashion, the 'plains' are discovered by combining those objects that resonated weakly with 6. This iterative learning process between o and G is outlined below.

Initialization Set up o with a uniform distribution: 6 = ( 1 , 1 , . . . , 1); normalize it as 6 = norm(6)e; then let k = 0; and record this as 6<°> = 6.

Apply Response Function For each object Oj € O, compute the resonance strength r(6,Oi); store the results in a vector r = ( r (6 ,o 1 ) , r ( 6 ,o 2 ) , . . . , r ( 6 ) 0 | 0 | ) ) ; and then normalize it, i.e., r = norm(r).

Adjust Forcing Object Using r from the previous step, adjust the terrain of 5 for all o% G O. To do this, we define the adjustment function c(r,fj) : Rl°l x Rl°l -> R, where the weights of the j - th frequency is given in fj = (wij,W2j, • • •, w\o\j)- F° r e&ch frequency /_,-, Wj = c(r,fj) integrates the weights from fj into 5 by evaluating the resonance strength recorded in r. Again, c is abstract, and can be materialized using the inner product c(r, fj) = r • fj = J2i wn • r(6, Oi). Finally, we compute 6 = norm(6) and record it as 6(fe+1) = 6.

Test Convergence Compare o(fc+1) against b^k\ If the result converges, go to the next step; else apply r on O again (i.e., forcing resonance), and then adjust o.

Ma t r i x Rear rangement Sort the objects Oj € O by the coordinates of r in descending order; and sort the frequencies fa € T by the coordinates of 6 in descending order.

For clearly stating the whole process above, we further express it in the following formulas,

r<fc+1> = norm(r(W6(fc))) ;(fe+i)

norm (c(W[ T-(k+l) ))

(1)

(2)

To illustrate how the matrix is sorted, let's take a look at a real-life example from a yeast gene expression data 19. The symmetric gene correlation matrix is computed by Pearson correlation measure. After the resonance model, we obtained the converged r*

enorm(x) = x/ | |x | |2 , where ||x||2 = ($27=1 z ? ) 1 / 2 & 2-norm of vector x = ( x i , . . . ,xn)

137

Responso. Weighted Function

»h Adjustment Function

(a) basic resonance model (b) GMA-1: extended resonance model 1 (c) GMA-2: extended resonance model 2

Pig. 2. The resonance models of approximating the matrix for different purposes: (a) collecting the high values into the left-top corner; (b) simultaneously collecting high/low values into the left-top corners of k classes or submatrices W~ or W*; (c) collecting the extremely high similarity/correlation values into the left-top corner to form a dense cluster.

and 6* with the decreasing order, and also sorted Oj e O and fj € T accordingly. Certainly, the rows and columns of the matrix S are also rearranged with the same orders of Oi and fj. The sorted S in this example is shown in Fig. 1(c). We also draw its corresponding 1-rank approximation matrix r*6*T in Fig. 1(d). This example in Fig. 1(c) and (d) illustrates two observations: (1) the function of the resonance model is to collect the large values in the left-top corner of the rearranged matrix and leave the small values to the right-bottom corner; (2) the underlying rationale is to employ the 1-rank matrix r*6*T to approximate 5. Actually, it is essential that the value distribution of r*6*T determines how the values of the sorted S are distributed.

3. TWO GENERALIZED MATRIX APPROXIMATIONS BY EXTENDING RESONANCE MODEL FOR GENE SELECTION

In this section, we extend and generalize the basic mechanism of the resonance model in Section 2 for the purpose of the gene selection in two aspects. The first is to rank genes and samples for selecting those differentially expressed genes Q={gi,. • -,gk}- The second is to discover those very dense clusters in the correlation matrix computed from Q, and remove the redundant genes in Q by only selecting one or two representative genes from each dense cluster. In the two steps, we particularly designed two extended resonance models. From the perspective of the matrix computation, they are two generalized matrix approximation methods based on the basic resonance

model.

3.1. GMA-1 for Ranking Differentially Expressed Genes

Consider the general case of the gene expression data, suppose the data set consists of m genes and n samples with k classes, whose number of samples are n i , . . . , nit respectively and n\ + ... + nk—n. Without losing the generality, we suppose the first fc_ classes are negative, the following k+ classes are positive, and k- + k+ = k. Therefore, a general gene-sample matrix WmXn = [ W~ , Wf ) is shown

with submatrix blocks in Fig.3(a). Because the target of analyzing differentially expressed genes is to find up-regulated or down-regulated genes between negative and positive sample classes, the basic resonance model should be changed, from collecting high values to the left-top corner of W, to:

(1) A series of low values collections in each W^~ into the left-top corner, and simultaneously a series of high values collections in each W* into the left-top corner.

(2) Controlling the differences of left-top corners between the negative classes W[~ and W*.

An example figure of such matrix approximation is illustrated in Fig.4. Therefore, to meet these two goals, we extended the basic resonance model, called GMA-1, according to this task as follows.

(1) Transformation of W: before doing the GMA-1, we need to transform the original gene-sample matrix W to W. The structure of W is made of

Negative Classes Positive Classes

w= Wj Wu

I I + 1 w WT

VJ2- Y n = n, + ... +n t

"> j

w =

up regulation

l - w r

[i-Wf. w+ 1

down regulation

1 - W +

(a) original matrix W = [ Wi , W4+ ] (b) transformed matrix W ' = [ W'~ , W,'+

Fig . 3 . Transformation of the matrix W: the transformed matrix W has the same structure of submatrix blocks as shown in (a), but with different submatrix W s ' - and W^+ as listed in (b).

the submatrix blocks W~ and Wf of negative classes and positive classes as shown in Fig.3(a). In the case of finding up-regulated and differentially expressed genes, since we need to collect the low values of W~ into the left-top corner, we need to reverse the values of W~ so that low values become high and vice versa. In other words, we do the transformation by W'~ = 1 — W~. In this way, the result of collecting high values of W'~ and W[+ into their own left-top corners naturally lead to the result of collecting the low values of W~ into the left-top corners and the high values of Wf into the left-top corners. This is an essential step to meet the first goal aforementioned. We can also use other reverse functions in stead of the simple 1 — x function used in Fig.3(b). Similarly, we can transform W by W[+ = 1 — Wf in the case of finding down-regulated and differentially expressed genes.

(2) The k partitions of the forcing object 6: an implicit requirement in the first goal is that the relative order of each class (submatrix W-~ or W/+) should be kept the same after doing GMA-1 and sorting W. For example, after running our algorithm, it is required that all columns of the submatrix W!f must appear after all columns of W[~, although we can change the order of columns or samples within W{~ or W^~. To satisfy this requirement, we partition the original forcing object's frequency vector 6 into k parts corresponding to k classes or submatrices.

Specifically, 6 = (6 i ; . . . ; 6k ) f, where each 5i corresponds to a sample class. In the process of GMA-1, we separately normalize each 6i and then sum their resonance strength vectors together with the factor a to control the differentiation between the negative and positive classes.

(3) The factor a for controlling the differentiation between the negative and positive classes: the frequency vector of 6 is divided into k = fc_ + k+ parts, each of which is normalized independently. Therefore, we can control the differentiation between the negative and positive classes, by magnifying the resonance strengths rf = norm(Wi'

+6;) of k+ positive classes, or minifying the frequency subvectors r~ = norm(Wi'

_6i) of fc_ negative classes. In formal,

'( + • + r k_ + ar+ + ... + av+ )

fe_ negat ive classes fc+ positive classes

(3)

where a ~£ 1 and a as a scaling factor is multiplied with the normalized positive classes' resonance strength vectors. With the increasing of a, the proportions of positive classes in the resonance strength vector r will increase and thus result in the increasingly large differences in the top-left corners between positive and negative classes. In this way, the user can tune a. to get a suitable differential contrast of two types of classes.

fThe concatenation of k = k- + k+ vectors is expressed in MATLAB format.

139

To summarize the above changes of the resonance model, we draw the architecture of the GMA-1 in Fig.2(b) and express its process in the following formulas:

r - ( f c + 1 ) =norm(W;-67« ) , i = l fc~

r+ ( fc+1) =norm(W; + 6+ W ) , i = 1 , . . . , k+

r(fc+1) = n o r » ( E ^ 1 r i -( f c + 1 ) + a E £ i r i

+ ( f c + 1 ) ) 6- ( fc+1)=norm((W i '-)Tr( fe+1)), i = 1 , . . . , k~ 5+(fc+i)=norm((^+)Tr(fc+i)), j = i;...; fc+

(4)

Algori thm 3.1 (GMA-1): Biomarker Discovery.

Input: (1) Wmxn, expression matrix from m genes set G and n samples set S;

(2) ( m , . . . ,nfc)T, sizes of the fc sample classes with the submatrix structure as in Fig.3(a).

(3) (fc_, k+)T, numbers of negative and positive classes.

(4) regulation option, down or up; (5) a, differentiation factor.

Output: (1) (gi,...,gm), ranking sequence of m genes; (2) ( s i , . . . , s„), ranking sequence of n samples.

1: preprocess W so that the values of W in [0,1] as following the steps in Subsection 2.1.

2: transform W to W according to formulas in Fig. 3(b) with the knowledge of the matrix structure given by ( n i , . . . ,rik)T, and (fc_, fc+)T and regulation option.

3: iteratively run equations in Eqn.(4) to obtain the converged r* and 5* ( i= l , 2 , . . . , fc).

4: sort r* in decreasing order to get the ranking gene sequence (gi,...,gm), and sort each of o £ , . . . , o £ in decreasing order to get the sorted sample sequence {comment: Because the positions of all sample classes in W keep not changing as shown in Fig.3(a), each sorting ofo* can only change the order of samples within the i-th sample class W^.}.

where ri,vf ,r~ £ R m x l and 6~ e Mn< x l , of £ Rntxl. Comparing Eqn.(l) and (2) with Eqn.(4), besides using the linear functions r = c = I, we partitioned the matrix W to k submatrix blocks and divided the frequency vector 6 into k subvec-tors. Therefore, two equations in the basic resonance model are expanded to the (2k + 1) equations in GMA-1. We also formally summarize it as Algorithm 3.1 GMA-1 for the biomarker discovery. A real-life example of the overall process in Algorithm GMA-1 is visually shown in Fig.4.

In practice, GMA-1 can quickly converge. Considering that GMA-1 is a generalized resonance model by partitioning the matrix into k submatrices, its computational complexity is the same as the resonance model on the whole matrix, i.e., 0(mn).

3.2. GMA-2 for Reducing Redundancy by Finding Dense Clusters

It has been recognized that the top-ranked genes may not be the minimum subset of genes for biomarker and classification 9 | 4' 23, because there are correlations among the top-ranked genes, which induces the problem of reducing "redundancy" from the top-ranked gene subsets. One of the effective strategies is to take into account the gene-to-gene correlation and remove redundant genes through pairwise correlation analysis among genes 9' 4 ' 21. In this section, we proposed to use the GMA-2, an special instance of the basic resonance model to reduce the redundancy of the top-ranked genes selected by GMA-1. The GMA-2 is a clustering method to find the high-density clusters. Then we can simply select one or more representative genes from each cluster and therefore reduce the redundancy. The underlying rationale is "members of a very homogeneous and dense cluster are highly correlated and with more redundancy; while a heterogeneous and loose cluster means bigger variety in genes". Although similar work has been done by Jaeger et al. 9, the authors used the fuzzy clustering algorithm which is not a suitable algorithm to control the density of the clusters. Comparing with the fuzzy clustering algorithm, the GMA-2 can not only find clusters with different densities, but also provide the membership degree for a cluster for each gene.

Given a pairwise correlation or similarity matrix of a set of genes g , the GMA-2 outputs the largest cluster with the fixed density. To find more clusters with the fixed density, the GMA-2 can be iteratively run on the remaining matrix by removing rows and columns of the genes in clusters already found. Unlike the GMA-1 which is a generalization of the basic resonance model, the GMA-2 is actually a special instance of the basic resonance model. Observing Fig. 1(c) and (d), the linear basic resonance model is

g In our context, this set of genes are the top-ranked m' genes selected by the GMA-1.

140

able to collect the high values of a symmetric matrix to the left-top corner of the sorted matrix. This means that it can approximate a high-density cluster. Therefore, we customized the basic resonance model to find the dense cluster by setting the response and adjustment functions to be I or E. When r = c = I, we called this linear resonance model as RML; and when r = c = E, this non-linear resonance model is called RME. The overall architecture of RML and RME is illustrated in Fig.2(c). With these settings and S = ST, two equations in the basic resonance model (i.e., Eqn.(l) and (2)) can be combined together by removing 6, and therefore RML and RME can be represented by Eqn.(5) and Eqn.(6) respectively as follows,

r(fc+1> = norm(Sr«) (5)

r(fe+1> =norm(E(Sr<*>)) (6)

A theoretical analysis is given in the following to show how RML works.

Given a nonnegative gene correlation matrix S = (sij)nxn £ R n x " , a nonnegative membership vector x = (xi,... ,xn)

T € {0, l } n x l is supposed to indicate the membership degree of each gene belonging to the dense and largest cluster, when the values of x are 0 or 1, D(x) in Eqn.(7) means the density of a cluster formed by those genes whose corresponding Xi is 1.

n n

D(x) = YJ y j SijXiXj = x T 5 x (7) i = l j = \

However, there are extensive studies on the problem of finding the densest subgraph h which is known as the NP-hard problem 6 . A typical strategy in approximation algorithms is to relax the integer constraints (i.e., x take the binary values 0 or 1) in x to the continuous real numbers, e.g., x e [0, l ] " x l and normalize it as ||x||2 = y/J27=i x1 = 1- In this way, the membership degree x changes from the binary number to the continuous number. According to the matrix computation theory 8 , we have the following theorem,

Theo rem 3.1 (Rayleigh-Ritz) . Let S e R n x " be a real symmetric matrix and Xmax(S) be the largest eigenvalue of S, then we have,

x T 5 x ^max(S) = max ,, ,, = max x r 5 x (8)

x£E" | | x | | 2 | |x| |2 = l

and the eigenvector x* corresponding to Am a x(5) is the solution on which the maximum is attained.

Theorem 3.1 indicates that the first eigenvector x* of S is the solution of -D(x) and therefore reveals a dense cluster. According to the linear algebra, the iterative running of Eqn.(5) in RML will lead to the convergence of r to the first eigenvector of 5 , i.e., r* = x*. Therefore, the RML can reveal the dense cluster. In practice, we found that the non-linear resonance model RME works better than the linear RML by using the exponential function to magnify the roles of high values in the dense cluster. Hence, based on RME, the GMA-2 is formally stated in Algorithm 3.2,

Algor i thm 3.2 (GMA-2): Find a <5-Dense Cluster Input: (1) Snxn, a non-negative gene correlation matrix

from a set of n genes G; (2) 5, a fixed density threshold.

Output: G' = {gi,-- -,9k} 6 G, a sequence of k genes which forms a dense cluster.

1: run RME on S to get the converged r*. 2: sort r* in decreasing order and get the sequence of genes

(fli) • • • ifln) according to this order. Then set the subset G2 = {91,52} and k = 2.

3: while D(S(Gk)) ^ 5 do 4: k = k+l.

5: set Gk = {ffi, • • • ,9k} as the top k genes. 6: end while 7: if there is no k satisfying D(S(Gk)) ^ <5 t h e n 8: return 0. 9: end if

10: return Gk-i-

4. ALGORITHM FOR COMPACT BIOMARKER DISCOVERY

In some cases, the user is more interested in the biomarker with the minimal genes that can classify the samples. Therefore, in this section, we discover the compact biomarker by combining GMA-1 and GMA-2. We outlined it in Algorithm 4.1, CBioMarker.

h Considering a nonnegative symmetric matrix is the adjacency matrix of an undirected weighted graph, a dense cluster becomes the dense subgraph in this graph. Therefore, these two problems are equivalent.

141

Similar to that of the basic resonance model, the computational complexity of GMA-2 is 0 (n 2 ) . Therefore, the computational complexity of CBioMarker is at most 0 (mn + hm'2) if considering the size of S in each iteration is always m' (but in fact, S' is always smaller than that of the previous iteration after removing the gene dense cluster already found.), where h is the iteration number in Algorithm CBioMarker depending on the number of dense clusters found in S'. Therefore, Algorithm 4.1 CBioMarker is efficient as well. Our empirical result on the large Leukemia data set with the size 12582x72 in subsection 5.1 shows that it took about 3 seconds in MATLAB environment and Pentium IV PC with 512MB RAM.

Algor i thm 4.1 (CBioMarker): Outline of Compact Biomarker Discovery with GMA-1 and GMA-2

(l)

(2)

(3)

Output:

Input: (1) Wmxn, a gene expression matrix from a set of m genes G; ( m , . •. ,rik)T, sizes of the k sample classes with the submatrix structure as shown in Fig.3(a). (fc_,fc+)T , numbers of negative and positive classes.

(4) 5, a fixed density threshold of the cluster. G' = {gi,... ,gq} S G, a subset of q genes which forms a biomarker.

1: run GMA-1 on W to get the gene ranking sequence

( g i , - . - , 3 m ) . 2: select the first m' genes from the ranking gene sequence

(ffi, • • •, 9m) and compute its correlation matrix S. set G' = {} and S' = S. r e p e a t

run GMA-2 on S' with <5 and get the highly correlated gene cluster sequence G". if G" is not empty t h e n

select the first representative gene g\ and add it to G', i.e., G' = {G',g\}. {comment: the number of representative genes selected depends on 8. If S is high, then one representative gene is enough; otherwise, select several more.}

end if set G' = G' - G" and S' = S(G').

until G" is empty {comment: it indicates there are no 6-dense clusters any more.}

11: add the rest of genes that are not clustered and found by GMA-2 to G'.

12: return G'.

3: 4: 5:

6: 7:

10

5. EMPIRICAL STUDY

In this section, we conducted the experiments on two data sets and compared our method with three most popular filter methods, T-statistics (T), Information Gain (IG) and ReliefF 10 . We firstly used the GMA-1 ', T and IG to rank the genes and compared them over different feature sizes, £=2,4,10,20,50,100,200,500,1000. Each resulting feature subset was used to train an SVM classifier J with the linear kernel function. Because of the small number of samples, the Leave-One-Out Cross Validation (LOOCV), a popular performance validation procedure adopted by many researchers, was performed to assess the classification performance. Then for obtaining a minimum biomarker, we ran the CBioMarker to get the compact biomarker and similarly used LOOCV accuracy to evaluate it.

5.1. Leukemia Data

We used the Leukemia gene expression data 2, where besides the classes "ALL" and "AML", a new class "MLL" of samples is identified. It contains 12,582 genes and 72 samples with these 3 sample classes. Therefore, we performed three experiments to test our method by using one class versus the rest of classes as positive versus negative: (1) ALL versus MLL&AML, (2) MLL versus ALL&AML and (3) AML versus ALL&MLL. In each experiment, the gene expression matrix partition for our method is W = [Wi ,Wf ,W^~] with one positive and two negative classes. In all three experiments, a was set to 2 for GMA-1. The results are shown in Table 1, 2 and 3. As shown in the three tables, our method GMA-1 outperforms the other methods in,

• High Accuracy: in all three experiments, GMA-1 maintains very high accuracies in different k. In the experiment "MLL versus ALL&AML", where the class MLL is hard to distinguish, GMA-1 can still obtain high accuracy even when k is very small.

• Compact biomarker: observing the accuracies of three methods from the small k to the large,

'Because GMA-1 can rank genes in terms of up and down regulation respectively, in this experiment of comparing fe top-ranking genes, we selected 0.5fc top-ranking genes in up regulation and 0.5fc top-ranking genes in down regulation to form k top-ranking genes given by GMA-1. JThe SVMlight was used.

142

GMA-1 is able to quickly obtain the high accuracy in the very small k, while the methods T and IG require larger k to arrive at the same accuracy (the numbers in bold in three tables show the minimum k each method requires to get the highest accuracy). This means that GMA-1 outperforms the other methods in terms of discovering the compact or minimal biomarker. For example, in Table 1, the top 2 ranking genes are found by GMA-1 and their accuracy is 100%, while the accuracy of the other two methods' top 2 ranking genes are less than 80%. Similar cases also appear in Table 2 and 3.

• Stability: not only do the small amount of selected genes have the higher accuracies than the other methods, but also the the large subset of selected genes maintain the high accuracy. This is a stable property with k increasing, and may be interesting to the biologists when they try to analyze more relevant genes contributing to the diseases. In contrast, the method T is not stable, especially in Table 2 when the samples are hard to distinguish.

Table 1. LOOCV accuracy rate (%) of ALL versus MLL&AML.

k=

T

IG

RliefF

G M A - 1

2

79.2

76.4

63.9

100

4 86.1

80.6

86.1

100

10

91.7

95.8

95.8

98.6

20

93.1

98.6

95.8

100

50

98.6

98.6

98.6

100

100

98.6

98.6

98.6

100

200

98.6

98.6

100

100

500

100

98.6

98.6

100

1000

100

98.6

98.6

100

Table 2. LOOCV accuracy rate (%) of MLL versus ALL&AML.

fc= T

IG

RliefF

GMA-1

2

69.4

72.2

72.2

86.0

4 65.2

88.9

88.9

88.9

10

81.9

88.9

95.8

97.2

20

80.6

88.9

94.4

98.6

50

84.7

98.6 94.4

100

100

86.1

98.6

94.4

97.2

200

93.1

97.2

97.2

98.6

500

90.3

98.6

98.6

98.6

1000

87.5

97.2

98.6

98.6

CBioMarker: find 4 genes with 93.1%

An important factor, which enables GMA-1 to perform well, is that the matrix approximation has the global searching ability to take into account the value distribution of the whole matrix and multiple classes in macroview way. This is different from

the way of individually considering genes, samples, or gene-to-gene. We then intended to obtain the minimal biomarker while keeping a relatively high accuracy (e.g., the accuracy is greater than 90%). There is no need to find the compact biomarker in the experiments except "MLL versus ALL&AML", because GMA-1 already found 2 genes with the accuracy greater than 90%. Therefore, we performed the algorithm CBiomarker with 5 = 0.7 for GMA-2 for the second experiment. As shown in Table 2, we found 4 genes with the accuracy 93.1%. This result is better than any other method in Table 2 when k = 4.

Table 3 . LOOCV accuracy rate (%) of AML versus ALL&MLL.

k=

T

IG

RliefF

GMA-1

2

66.7

79.2

86.1

90.3

4 77.8

76.4

84.7

10

97.2

87.5

95.8

91.7 97.2

20

98.6

93.1

94.4

97.2

50

100

97.2

97.2

97.2

100

98.6

97.2

97.2

97.2

200

97.2

97.2

97.2

97.2

500

97.2

97.2

98.6

97.2

1000

97.2

97.2

97.2

97.2

To test if the biomarker found by our methods is effective or not, for instance, we checked two genes found by GMA-1 in Table 1 with Entrez Gene in NCBI Website (h t tp : //www. ncb i . nlm. n ih . gov/entrez) . Two genes are MME which is underexpressed and LGALSl which is overexpressed k . By investigating the result of Armstrong et al. 2, these two genes were also ranked as the first genes in the underexpressed and overexpressed genes respectively. MME is a common acute lymphocytic leukemia antigen that is an important cell surface marker in the diagnosis of human acute lymphocytic leukemia (ALL); while LGALSl was also reported to be highly correlated with ALL 15.

5.2. Lupus Data

In this experiment, we used the unpublished data set taken from the Lupus gene expression experiments of Microarray core facility in UT Southwestern Medical Center. We demonstrate the visualization ability of our method for facilitating the user to analyze both the genes and samples simultaneously. This data set contains 1,022 genes and 84

k The GeneBank No. of MME is J03779 and the GeneBank No. of LGALSl is AI535946.

143

samples with 4 sample classes: "NC" (Normal Control), "FDR" (First-Degree Relative), "ILE" (Incomplete Lupus Erythematosus) and "SLE" (Systematic Lupus Erythematosus). Among these classes, "NC" and "FDR" are from the normal persons while "ILE" and "SLE" are from patients.

INC FDR ILE SLE

(a) sorted W

A

(b) sorted W

i (c) sorted r*6*T

Fig. 4. Visualization of the sorted matrix W, sorted transformation matrix W = [ l - W N Q , 1 -WFDR' ^ILE- ^ S L E ] > and sorted approximation matrix r o:

the concatenation of k vectors: 6*=(6j|. W , where 6* is

We performed GMA-1 with a = 5 on the data. The sorted matrix W with up regulation setting (see Fig.3(b)) is visualized by the grey scale image in Fig.4(a). From this redistribution of the whole matrix, the dominant tendency within each class can be clearly observed. While the most differentially expressed genes (or rows) are placed in the top of W, the low values of the first two classes "NC" and "FDR" are collected to the left-top corner of each submatrix W^Q and WppR, and the high values of the first two classes "ILE" and "SLE" are collected to the left-top corner of each submatrix W J L E

and WgLE- ^n t n i s w a y > t n e ^ a t a within-class distributions and the between-class distribution are fully considered. To illustrate the process of GMA-1, we also drew the grey scale image of the transformed matrix W for up regulation and the final approximation matrix r*6* r given by the converged resonance strength vector r* and the frequency distribution vector 6.

By observing the grey scale image of approximation matrix r*6* r in Fig.4(c), we found that the outlier samples of each class are put in the rightmost place of the corresponding class submatrix. For example, the colors of the rightmost sample (the 13-th

column) in the class "NC" are significantly different from the colors of all other left samples, which indicates that this sample may be an outlier of the class "NC". This can also be observed in Fig.4(a) of the original sorted gene expression matrix. After analyzing this visualization, besides obtaining the top-ranking relevant genes, the user can also draw the conclusion that some normal persons may be early-stage, undetected patients. Similar cases occur in the other classes as well.

6. CONCLUSIONS

In this work, we have introduced a novel perspective of the matrix approximation for filtering the genes in the multiple-class data sets. It comprehensively considers the global between-class data distribution and local within-class data distribution, and therefore improves the accuracy of the biomarker discovery. Meanwhile, it provides an overall tendency of the whole matrix for visualizing and analyzing the data. Experiments on gene expression data have demonstrated its efficiency and effectiveness of both biomarker discovery and visualization.

ACKNOWLEDGMENT

The authors would like to thank Dr. Quan Li from Microarray core facility in UT Southwestern Medical Center for sharing his unpublished data set with us.

References 1. Achlioptas D, McSherry F. Fast computation of low

rank matrix approximations. In Proc. ofSTOC2001. 2. Armstrong SA, et. al. . MLL translocations specify a

distinct gene expression profile that distinguishes a unique leukemia. Nature Genetics 2002; 30(1): 41-47.

3. Chu W, Ghahramani Z, Falciani F, Wild D. Biomarker discovery in microarray gene expression data with gaussian processes. Bioinformatics 2005; 21(16): 3385-3393.

4. Ding C, Peng H. Minimum redundancy feature selection from microarray gene expression data. In Proc. of CSB 2003.

5. Donoho DL. High-dimensional data analysis: The curses and blessings of dimensionality. In Math Challenges of the 21st Century 2000.

6. Feige U, Kortsarz G, Peleg D. The dense k-subgraph problem. Algorithmica 2001; 29(3): 410-421.

144

7. Gibson D, Kleinberg J, Raghavan P. Clustering categorical data: An approach based on dynamical systems. VLDB Journal 2000; 8(3-4): 222-236.

8. Golub G, Loan CV. Matrix Computations. The Johns Hopkins University Press, 1996.

9. Jaeger J, Sengupta R, Ruzzo WL. Improved gene selection for classification of microarrays. In Proc. of PSB 2003.

10. Kira K, Rendell L. A practical approach to feature selection. In Proc. of ICML 1992.

11. Kleinberg J. Authoritative sources in a hyper linked environment. J. of the ACM 1999; 46(5): 604-632.

12. Li W, Ong KL, Ng WK. Visual terrain analysis of high-dimensional datasets. In Proc. of PKDD 2005.

13. Page L, Brin S, Motwani R, Winograd T. The pager-ank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project. 1998.

14. Papadimitriou CH, Tamaki H, Raghavan P, Vempala S. Latent semantic indexing: A probabilistic analysis. In Proc. of PODS 1998.

15. Rozovskaia T, et. al. . Expression profiles of acute lymphoblastic and myeloblastic leukemias with all-1 rearrangements. Proc. of National Academy of Sci

ences USA 2003; 100(13): 7853-7858. 16. Singh D, et. al. . Gene expression correlates of clin

ical prostate cancer behavior. Cancer Cell 2002; 1: 203-209.

17. Smola AJ, Scholkopf B. Sparse greedy matrix approximation for machine learning. In Proc. of ICML 2000.

18. Tabus I, Astola J. Gene feature selection. Technical report. http://www.cs.tut.fi/~tabus/course/GSP/ Chapter2GSPSR.pdf.

19. Tavazoie S, Hughes J, Campbell M, Cho R, Church G. Yeast micro data set, 2000.

20. Tsaparas P. Using non-linear dynamical systems for web searching and ranking. In Proc. of PODS 2004.

21. Xing E, Jordan M, Karp R. Feature selection for high-dimensional genomic microarray data. In Proc. of ICML 2001.

22. Yang Y, Pedersen JP. A comparative study on feature selection in text categorization. In Proc. of ICML 1997.

23. Yu L, Liu H. Redundancy based feature selection for microarray data. In Proc. of SIGKDD 2004; Seattle, Washington.

http://www.cs.tut.fi/~tabus/course/GSP/

145

EFFICIENT COMPUTATION OF MINIMUM RECOMBINATION WITH GENOTYPES (NOT HAPLOTYPES)

Yufeng W u * and D a n Gusfield

Department of Computer Science

University of California, Davis

Davis, CA 95616, U.S.A.

Email: { wuyu, gusfield} @cs. ucdavis. edu

A current major focus in genomics is the large-scale collection of genotype data in populations in order to detect variations in the population. The variation data are sought in order to address fundamental and applied questions in genetics that concern the haplotypes in the population. Since almost all the collected da ta is in the form of genotypes, but the downstream genetics questions concern haplotypes, the standard approach to this issue has been to t ry to first infer haplotypes from the genotypes, and then answer the downstream questions using the inferred haplotypes. Tha t two-stage approach has potential deficiencies, giving rise to the general question of how well one can answer the downstream questions using genotype data without first inferring haplotypes, and also giving rise t o the goal of computing the range of downstream answers that would be obtained over the range of possible inferred haplotype solutions. This paper provides some tools for the study of those issues, and some partial answers. We present algorithms to solve downstream questions concerning the minimum amount of recombination needed to derive given genotypic data, without first fixing a choice of haplotypes. We apply these algorithms to the goal of finding recombination hotspots, obtaining as good results as a published method that first infers haplotypes; and to the case of estimating the minimum amount of recombination needed to derive the true haplotypes underlying the genotypic data, obtaining weaker results compared to first inferring haplotypes using the program PHASE. Hence our tools allow an initial study of the two-stage versus one-stage issue, in the context of specific downstream questions, bu t our experiments certainly do not fully resolve the issue.

1. INTRODUCTION

The field of genomics is now in a phase where large-scale data in populations is collected in order to study population-level variations 8 . Variations between individuals are used to provide insight into basic biological processes such as meiotic recombination, or to locate genes that are currently under natural selection 36 , and to help locate the genes that influence genetic disease or economic traits (through a technique called "association mapping" or "LD mapping") 1 5) . Algorithms and computation play a central role in all of these efforts, and there is a growing literature on several key problems involved in both the acquisition and the downstream analysis of population variation data. In discussing acquisition and analysis problems, and in order to introduce the theme of this paper, we must first define some basic terms, concepts and issues.

In diploid organisms (such as humans) there are two (not completely identical) "copies" of each chromosome, and hence of each region of interest. A

description of the data from a single copy is called a haplotype, while a description of the conflated (mixed) data on the two copies is called a genotype. Today, the underlying data that forms a haplotype is usually a vector of values of m single nucleotide polymorphisms (SNP's). A SNP is a single nucleotide site where exactly two (of four) different nucleotides occur in a large percentage of the population. Genotype data is represented as an n by m 0-1-2 (ternary) matrix G. Each row is a genotype. A pair of binary vectors of length m (haplotypes) generate a row i of G if for every position c both entries in the haplotypes are 0 (or 1) if and only if G(i,c) is 0 (or 1) respectively, and exactly one entry is 1 and one is 0 if and only if G(i, c) = 2. The international Haplotype Map Project 7 ' 8 is focused on determining, both molecularly and computationally, the common haplotypes in several diverse human populations.

The Key Issue The key technological fact is that it is very difficult and costly to collect large-scale haplotype data, but relatively easy and cheap


146

to collect genotype data. But mutation, recombination, selection, evolution, all operate on haplotypes, not genotypes, and therefore the "downstream" biological questions that we want to answer using population variation data (for example, about recombination hotspots, linkage disequilibrium, natural selection, association mapping, phylogenetic networks, etc.) are all questions that are most naturally framed in terms of haplotypes. So, we have the ability to gather large-scale genotypes in populations, but we have the need to ask and answer questions about the underlying haplotypes in populations. To date, the resolution of this issue has overwhelmingly involved two independent stages: First, try to infer the "correct" haplotypes from the genotypes, either inferring a pair of haplotypes for each genotype in the sample, or inferring just the frequencies of the haplotypes in the sample; Second, do the downstream analysis using those inferred haplotypes.

There is a very large literature on haplotype inference (HI), and on an absolute scale, the underlying haplotypes can be inferred with remarkable fidelity 25, although problems remain and the field of haplotype inference is still very active (for example, even the developer of the most widely used program PHASE 35 has recently introduced a totally different approach in order to address much larger data than PHASE can handle 3 3 ) . So, the two-stage approach may in the end be the best way to address many of the downstream biological questions of interest, but in general there is some (potential) loss of information in any two-stage approach, and certainly this particular approach has both problems and missed opportunities. The main problem is that the haplotype inferences are likely to be incorrect to some extent and is not clear what effect those inaccuracies will have on the downstream analysis. The missed opportunities inherent in the two-stage approach is that by choosing just one set of haplotypes, we do not address questions about the range of possible answers to the downstream questions that the collected genotype data support. Range questions are of interest because they provide a kind of "sensitivity analysis" for any particular chosen answer (for example for an answer derived from the two-stage approach), and they address the general question of "how much does it really help to know the underlying haplotypes that

gives rise to the genotypes" 6' 2 r ' 28? The answer to that question helps determine how much effort or money one would be willing to spend to determine the correct haplotypes (by molecular means or by gathering more genotype data). Indeed, we are seeing results in this general direction. For example, Halperin, et al. developed a method for tag SNP selection with genotype data 14.

The Main Theme of This Paper: Motivated by the above discussion, this paper concerns the solution to certain downstream biological questions using genotypic data, without first fixing a choice of haplotypes. In particular, we are concerned with estimating, and bracketing the range of the minimum amount of recombination needed to derive haplotypes that can pair to form the observed genotypes, and with problems of inferring an explicit evolutionary history of haplotypes that can pair to form the observed genotypes. As a byproduct, turning the two-stage process on its head, we can use some of these computations to solve the haplotype inference problem itself. We develop polynomial time algorithms for some problems, non-polynomial but practical algorithms for other problems, and show the results of applying these methods to simulated and real biological data. Our methods provide some tools to study the two-stage versus one-stage issue, in the context of specific problems involving recombination. However, we do not claim that our experimental results resolve the issue of which approach is best.

2. ADDITIONAL DEFINITIONS

Before discussing our results in detail, we need some additional definitions.

Given an input set (or matrix) of n genotype vectors G of length m, the Haplotype Inference (HI) Problem is to find a set (or matrix) Hofn pairs of binary vectors (with values 0 and 1), one pair for each genotype vector, such that each genotype vector in G is generated by the associated pair of haplotypes in H. H is called an "HI solution for G". Genotype data is also called "unphased data", and the decision

on whether to expand a 2 entry in G to [ ] or to [ ]

in H, is called a "phasing" of that entry. The way that all the 2's in a column are expanded is called the phasing of the column.

147

The standard assumption in population genetics 15, 16 is that at most one mutation has occurred in any sampled site in the evolution of the haplotypes. This is called the "infinite sites model". In addition to mutation, haplotypes may evolve due to recombination between haplotype sequences. Meiotic recombination takes two equal length sequences and produces a third sequence of the same length consisting of some prefix of one of the sequences, followed by a suffix of the other sequence. Meiotic recombination is one of the principal evolutionary forces responsible for shaping genetic variation within species. Efforts to deduce patterns of historical recombination or to estimate the frequency or the location of recombination are central to modern-day genetics 26, and recombination is at the heart of the logic of association mapping, a technique that is widely hoped to help locate genes influencing genetic diseases and important traits 15.

For a given set of haplotypes, computing the minimum number of recombinations needed to explain their evolution (under the infinite sites model) is a standard question of interest, for both practical and fundamental reasons. For a matrix of haplotypes H, we define Rmin(H) as the minimum number of recombination events needed in a derivation of H from some unknown (or sometimes known) ancestral haplotype, under the infinite sites model. The problem of computing Rmin(H) exactly is NP-hard, but there is a growing literature on polynomial-time methods that work on problems of special structure; on practical heuristics that are exact on small data; and on efficient methods to compute close bounds on Rmin(H).

The evolutionary history of a set of haplotypes H, which evolve by site mutations (assuming the infinite sites model) and recombination, is displayed on a directed acyclic graph called a "Phylogenetic Network" or an "Ancestral Recombination Graph (ARG)". For a formal definition of these graphs, see Gusfield, et al. 12 or Gusfield 11.

In most of the results in this paper the concept of site incompatibility is fundamental. Given a haplotype matrix H, two sites (columns) p and q in H are said to be incompatible if and only if there are four rows in H where columns p and q contain all four of the ordered pairs 0,1; 1,0; 1,1; and 0,0. The

test for the existence of all four pairs is called the "four-gamete test" in the population genetics literature. The classic Perfect Phylogeny theorem is that there is a phylogenetic network without recombination (and hence a tree), that derives haplotypes H, if and only if there is no incompatible pair of site in H. An HI solution, H, for G is a called a "PPH solution" if no pair of sites in matrix H are incompatible. The problem of determining if there is a PPH solution can be solved in linear time 9> 32 .

3. RECOMBINATION LOWER BOUNDS OVER GENOTYPES

Let L denote a particular recombination lower bound method that works on haplotype data, and let L(H) be the lower bound given by L when applied to haplotype matrix H. That is, L(H) <Rmin(H). Given a genotype matrix G, we define MinL(G) as the minimum value of L(H) taken over every HI solution H for G, that is, over all haplotype matrices that generate G. Similarly, we define MaxL(G) by changing "minimum" to "maximum" in the definition. The two quantities, MinL(G) and MaxL(G), precisely define the range of results that method L will produce, over all possible HI solutions for G. Note that MinL{G) < L(H*) < Rmin{H*) where H* is the true (but unknown) set of haplotypes that gives rise to G, but it is not true that Rmin(H*) < MaxL{G). Rather, L(H*) < MaxL{G).

The motivation for wanting to know MaxL(G) may be a bit unintuitive. One situation is where we are interested in the amount of recombination that must have occurred in the generation of the true haplotypes underlying the observed genotypes G, and the available tool for studying recombination levels is the ability to compute L(H) given haplotypes H. An obvious question in this situation is whether it is valuable to expend additional resources to better determine more information about the true H (in the laboratory or by collecting more data). The difference MaxL(G) — MinL{G) indicates the most that can be learned about recombination (through the use of L(H)), even with additional efforts to learn the true H. In particular, if the difference is small, determination of the true H has little value in this context, and if the difference is large, MaxL(G) bounds the most that can be learned, even if the true H is

148

known. In the next three sections we develop lower

bound methods that work on a genotype matrix G. These methods will also be useful in Section 4 where we develop a method to build a minimum ARG for a set of genotypes.

3.1. The case of the Hudson-Kaplan (HK) lower bound

The first and best-known lower bound on Rmin(H) is the HK bound 19. When L is the HK method, we use MinHK(G) and MaxHK(G) in place of MinmL(G) and MaxL(G). Previously, Wiuf 38 showed that MinHK(G) can be computed in polynomial time. In this section, we show that MaxHK{G) can also be computed in polynomial time. We first have to define the incompatibility graph and to briefly describe the HK bound and method.

The "incompatibility graph" IG(H) for H is a graph containing one node for each site in H, and an edge connecting two nodes p and q if and only if sites p and q are incompatible. We will refer to a node of IG{H) and to the site of G it corresponds to, interchangeably. The HK lower bound on Rmin(H) can be described and computed as follows: Arrange the nodes of IG(H) on a line in the order that the corresponding sites appear in the underlying chromosome. Then compute the maximum number of non-overlapping edges in the embedded graph IG(H). Two edges that only share a single node are still non-overlapping. The computed number is the HK bound, denoted HK(H). It is easy to establish that HK(H) < Rmin(H), and that HK{H) can be computed in time that is linear in the number of edges of IG(H).

3.1.1. Efficient algorithm for MaxHK(G)

Given a genotype matrix G, we define the maximal incompatibility graph for G, denoted MIG(G), as follows: Each node in MIG(G) corresponds to a site in G, and there is an edge between nodes p and q if there exists an HI solution H for G so that the pair p, q is incompatible. Note that the existence of an edge (p, q) is determined independently of all other pairs of sites; the HI solution that is used for one

pair can be completely different from the HI solution used for another pair. Therefore, we only need to look at sites p and q to determine if edge (p, q) is in MIG(G). Graph MIG(G) is a supergraph of every incompatibility graph IG(H) where H is an HI solution for G. For example, suppose sites p and q in G are

0 0 0 1 10 2 2

Then there is an edge (p, q) in MIG(G) because we

can phase the 2's in row four as [ ] to make p, q

incompatible. We now describe the algorithm.

Algorithm MaxHK

1. Construct MIG(G) for input data G. 2. Arrange the nodes of MIG(G) on a line, in the

order that the sites appear in the underlying chromosome. Then find a maximum-size set, EG, of non-overlapping edges in MIG(G). We claim that \EG\ = MaxHK(G).

Time analysis: The first step takes 0{nm2) time. The second time takes 0{m?) time. Thus, the algorithm runs in 0(nm2) time.

Correctness: For every HI solution H for G, \EG\ > HK(H) because IG{H) is subgraph of MIG(G). Therefore, \EG\ > MaxHK(G). To show the converse, it is sufficient to show that \EG\ < HK{H) for some HI solution H for G. This is not immediate because it is not necessarily true that MIG(G) = IG{H) for some HI solution H for G. But, if we can find an HI solution H for G where all the edges of EG are in IG(H) (where they will be non-overlapping), then \EG\ < HK(H). The edges in EG induce a graph, and consider one of the connected components, C, of that graph. Because the edges in EG are non-overlapping and C is a connected component, the edges in C form a simple connected path along the nodes in C ordered from left to right in the embedded MIG(G). Let s\, s%,..., s* denote the ordered nodes in C. To construct the desired H, we first phase sites sy, S2 to make pair si, S2 incompatible (that is possible since edge (si,S2) is

149

in MIG(G)). Now we move to site S3. We want to make pair S2,S3 incompatible but we have already chosen how s-2 will be phased with respect to s i . The critical observation is that this prior decision does not constrain the ability to make pair s^, S3 incompatible, although one has to pay attention to how S2 was phased. In choosing how to phase S3 relative to S2, the only rows in G where a phasing choice has any effect on whether pair 3^,83 will be incompatible, are the rows where both those sites have value 2 in the genotype matrix G. For one such row k of G, suppose we need to phase the 2's in S2, S3 to produce the pair 0,1 or the pair 1,0 or both, in order to make pair si, S2m incompatible. (The case where we need 0,0 and/or 1,1 is similar and omitted.) If column S2

(for row k) has been phased as [ ] we phase S3 (for

row k) as [ ]. Otherwise, we phase S3 as [ ]. In

either case, we will produce the needed binary pairs in sites S2, S3 for row k. Similarly, we can follow the same approach to phase sites s 4 , . . . , sk, making each consecutive pair of sites incompatible.

In this way, we can construct a haplotyp-ing solution H for G where all the edges of EG (and possibly more) appear in IG(H), and hence \Ea\ < HK(H) < MaxHK{G). But since \EG\ > MaxHK(G), \EG\ = MaxHK(G), completing the proof of the correctness of Algorithm MaxHK.

3.2. The case of connected-component lower bound

A "non-trivial" connected component, C, of a graph is a connected component that contains at least one edge. A trivial connected component has only one node, and no edges. For a graph / , we use cc(I) to denote the number of non-trivial connected components in graph I. It has previously been established 13, 1 that for a haplotype matrix H, cc{IG{H)) < Rmin(H), and that this lower bound can be, but is not always, superior to the HK bound when applied to specific haplotype matrices. Therefore, for the same reasons we want to compute MinHK{G) and MaxHK(G), we define MinCC(G) and MaxCC(G) respectively as the minimum and maximum values of cc(IG(H)) over every HI solution H for G. In this section we show

that MinCC{G) can be computed in polynomial time by Algorithm MinCC, using an idea similar to one used for MaxHK(G). The problem of efficiently computing MaxCC{G) is currently open.

Algorithm MinCC

1. Given genotype matrix G, construct graph MIG(G) and remove all trivial components.

2. For each remaining component C, let G{C) be the matrix G restricted to the sites in C. For each such C, determine if there is a PPH solution for G{C), and remove component C if there is a PPH solution for G(C).

3. Let Kc be the number of remaining connected components. We claim that Kc = MinCC (G).

Time analysis: Constructing MIG{G) takes 0(nm2) time. Finding all components takes O(m) time. Checking all components for PPH solutions takes 0(nm) time. Thus, the entire algorithm takes 0{nm2) time.

Correctness. We first argue that cc(IG(H)) > Kc for every HI solution H for G. Let H be an arbitrary HI solution for G, and consider one of the Kc

remaining connected components, C, found by the algorithm. Since G(C) does not have a PPH solution, there must be at least one incompatible pair of sites in H, and so at least one edge in C must also be in IG(H). Further, since IG{H) is a subgraph of MIG{G), every connected component of IG{H) must be completely contained in a connected component of MIG(G). Therefore, there must be at least one non-trivial connected component of IG(H) contained in C, and so cc(IG(H)) > Kc.

To finish the proof of correctness, it suffices to find an HI solution H' for G where cc{IG(H')) = Kc. Note that we can phase the sites in each connected component of MIG(G) separately, assured that no pair of sites in different components will be made incompatible. This is due to the maximality of connected components, and the definition of MIG(G). To begin the construction of H', for a non-trivial component C of MIG(G) where G{C) has a PPH solution, we phase the sites in C to create a PPH solution. As a result, none of those sites will be in-

150

compatible with any other sites in G. Next we phase the sites of one of the Kc remaining components, C, so that in H', the nodes of C form a connected component of IG{H'). To do this, first find an arbitrary rooted, directed spanning tree T of C Then phase the site at the root and one of its children in T so that those two sites are made incompatible. Any other site can be phased as soon as its unique parent site has been phased. As in the proof of correctness for Algorithm MaxHK, and because each node has a unique parent, each site can be phased to be made incompatible with its parent site, no matter how that parent site was phased. The result is that all the sites of C will be in a single connected component of IG(H'), so Kc > cc(IG(H'). But cc(IG(H)) > Kc

for every HI solution H for G, so MinCC{G) = Kc, and the correctness of Algorithm MinCC is proved.

Final comments on the polynomial-time methods

Above, we developed polynomial-time methods to compute MaxHK(G) and MinCC(G), given genotypes G. These are two specific cases of our interest in efficiently computing MinL(G) and MaxL(G) for different lower bounding methods L that work on haplotypes. Clearly, for the best application of such numerical values, we would like to compute MinL(G) and MaxL{G) for the lower bound methods L that obtain the highest lower bounds on Rmin(H) when given haplotypes H. The HK and the CC lower bounds are not the best, but are of interest because they allow provably polynomial-time methods to compute MinHK(G), MaxHK(G) and MinCC(G). Those results contribute to the theoretical study of lower bound methods, and may help to obtain polynomial-time, or practical methods, for better lower bound methods. In the next section we discuss a practical method (on moderate size data) to compute better lower bounds given genotypes.

3.3. Parsimony-based lower bound

One of the most effective methods to compute lower bounds on Rmin(H), for a haplotype matrix H, was developed in Myers, et al. 30, further studied in Bafna, et al. 2, and optimized in Song et al 34. All of the methods in those papers produce lower bounds

on Rmin(H) that are much superior to HK(H) and CC(H), particularly when n > m. Therefore, given G, we would like to compute the minimum and/or maximum of these better bounds over all HI solutions for G. Unfortunately, we do not have a polynomial-time method for that problem, and we presently solve it only for very small data. However, we have developed a lower bounding method that works on genotype matrices of moderate size, using an idea related to the cited methods, and we have observed that when n> m, the lower bound obtained is often much superior to MinHK(G) and MinCC(G).

All the lower bound methods in the papers cited above work by first finding (local) lower bounds for (selected) intervals or subsets of sites in H, and then combining those local bounds to form a composite lower bound on Rmin(H). The composition method was developed in Myers, et al. 3 0 and is the same for all of the methods. What differs between the methods is the way local bounds are computed. We do not have space to fully detail the methods, but all the local bounds are computed with some variation of the following idea 30: Let Hap(H) be the number of distinct rows of H, minus the number of distinct columns, minus 1. Then Hap(H) < Rmin(H). Hap(H) is called the Haplotype lower bound. When applied to the entire matrix H, Hap(H) is often a very poor lower bound, but when used to compute many local lower bounds in small intervals, and these local bounds are combined with the composition method, the overall lower bound on Rmin(H) is generally quite good.

Similar to the methods that work on haplotype data, given a genotype matrix G, we compute relaxed Haplotype lower bounds for many small intervals, and then use the composition method to create an overall number Ghap(G) which is a lower bound on the minimum Rmin(H) over every HI solution H for G. Of course, to be of value, it must be that Ghap(G) is larger than MinHK(G) and MinCC{G) for a large range of data.

We now explain how we compute the local bounds in G that combine to create Ghap(G). When restricted to sites in an interval, we have a submatrix G' of G. An HI solution H' for a genotype matrix G' is called a "pure parsimony" solution if it minimizes the number of distinct haplotypes used, over

151

all HI solutions for G'. If the number of distinct haplotypes in a pure parsimony HI solution for G' is p{G'), and G' has m' sites, it is easy to show that p(G') - m' - 1 < Rmin(H') for any HI solution H' for G'. We call this bound Par{G'). To compute Ghap(G), we compute the local bound Par(G') for each submatrix of G denned by an interval of sites of G, and then combine those local bounds using the composition method from Myers, et al 30. It is easy to show that Ghap(G) < Rmin(H) for every HI solution H for G.

The problem of computing a pure parsimony haplotyping solution is known to be NP-hard 17' 22, so computing Par{G') is also NP-hard. But, a pure parsimony HI solution can be found relatively efficiently in practice on datasets of moderate size by using integer linear programming 10. Other papers have shown how to solve the problem on larger datasets 4' 5 . Therefore, each local Par(G') bound can be computed in practice when the size of G' is moderate, and so Ghap(G) can be computed in practice for a wide range of data.

Our experiments show that Ghap(G) is often smaller than MinHK(G) or MinCC(G) when n < m and when the recombination rate is low. However, when n increases, Ghap(G) becomes higher than MinHK(G) or MinCC(G). Our simulation shows that for dataset with 20 genotypes and 20 sites, Ghap(G) is larger than MinHK(G) or MinCC{G) for over 80% of the data. As an example, a real biological data (from Orzack, et al. 31) has 80 rows and 9 sites. MinHK{G) = MinCC(G) = 2, while Ghap{G) is 5 (which is equal to Rmin(G) as shown in Section 5.3).

4. CONSTRUCTING A MINIMUM ARG FOR GENOTYPE DATA USING BRANCH AND BOUND

In this section, we consider the problem of constructing an ancestral recombination graph (ARG) that derives an HI solution H and uses the fewest number of recombinations for the genotype matrix G. We call such an ARG a minimum ARG for G and denote the minimum number of recombination in this ARG Rmin(G). Formally,

Haplotyping on a minimum ARG: Given a genotype data G, find an HI solution H for G, such

that we can derive H on an ARG with the fewest number of recombinations. Here, as usual, we assume the infinite sites model of mutations.

It is easy to see this problem is difficult. After all, there is no known efficient algorithm for constructing the minimum ARG for haplotype data 37, 3

and haplotype data can be considered to be a subset of genotype data. Here, we show that under certain conditions, we can solve this problem by a branch and bound method. The intuition of our method comes from the concept of hypercube of length m binary sequences.

Note that there are up to 2 m possible sequences in the hypercube that can be on the an ARG that derives an HI solution for G. Conceptually we can build the ARG as follows. We start from every sequence node in the hypercube as the root of the ARG. Each time, we try all possible ways of deriving a new sequence by (1) an (unused) mutation from a derived sequence, or (2) a recombination of two derived sequences. The ARG grows when we derive new sequences. Once the ARG derives an HI solution for G, we have found an ARG that is potentially the solution. We can find the minimum ARG by searching through all possible ways of deriving new sequences and finding the ARG with smallest number of recombinations.

Directly applying the above idea is not practical when the data size increases. We develop a practical method using branch and bound. We start building the ARG by staring from a sequence as the root. At each step, we maintain a set of sequences that have been derived. We also maintain the best ARG found so far, i.e. the ARG that derives an HI solution for G and use the smallest number of recombinations (denoted Rmin). We derive a new sequence by a recombination of two already derived sequences or an unused mutation from a derived sequence. We check whether the current ARG derives an HI solution. If so, we store this solution if this ARG uses less recombinations than Rmin. If not, we compute a lower bound on the minimum number of recombinations we need to derive an HI solution, given the choices we make in the search path. If the lower bound is not smaller than Rmin, we know the current partially built ARG can not lead to a better solution and thus stop this search path. Otherwise,

152

we continue to derive more sequences from the current derived sequences. We illustrate the basic procedure of the branch and bound method in Algorithm GenoMinARG.

Algorithm GenoMinARG

1. Root We maintain a set of sequences called derived set (containing sequences that are part of the ARG already built so far). Initialize the derived set with a binary sequence sr

as the root of the ARG. Maintain a variable Rmin as the currently known minimum number of recombinations. Initialize Rmin to be oo (or some pre-computed upper bound).

2. Deriving sequences Repeat until all search paths are explored or terminated. Then Return to Step 1 if there are more root sequences to try. Stop the algorithm otherwise.

2.1 Through either a recombination or (unused) mutation from sequences in the derived set, grow the derived set by deriving a new sequence.

2.2 Check whether the derived set contains an HI solution. If so, stop this search path. Denote the number of recombinations in this ARG Rminc. If Rminc < Rmin, set Rmin <— Rminc. Continue with the next branch.

2.3 If the recombination lower bound (with the current derived haplotypes) is at least Rmin, stop this search path and continue with the next search. Otherwise, follow this branch and continue on step 2.1.

The key to the success of branch and bound is the use of genotype recombination lower bounds we presented earlier. We use the lower bound methods in Section 3 and also improve them with some additional ideas, which speed up the method significantly. We omit the details due to the space limit.

Remarks. The branch and bound method seems to work for many datasets with up to 8 sites. This method is still useful because there are real biological data that contain small number of sites and many rows. We provide such an example in Section 5.3.

5. APPLICATIONS

5.1. Detecting recombination hotspots using genotype data

Recombination rates are often believed to vary significantly across a genome. A recombination hotspot refers to a genomic region where the recombination rate is much higher than in its neighboring regions. Detecting recombination hotspots is important for many applications, e.g. association mapping and has been actively studied recently, for example in Myers, et al. 29. Bafna and Bansal 2 applied recombination lower bounds based on haplotypes to reveal recombination hotspots. Their results on the MHC data 20

and MS32 data 2 1 indicate that recombination lower bounds may be useful in identifying locations and intensity of recombination hotspots.

However, computing recombination lower bounds on haplotype data has a potential problem. The real biological data (such as the MHC 20 and MS32 data 21) are genotypes. It is not clear what effect the haplotyping error has on recombination hotspot detection. Here, we compute the exact minimum number of recombinations for small intervals of the genotype data, and thus effectively remove the haplotyping uncertainty from our results.

Given a set of genotypes G, we move a sliding window with a small number of (say 6) SNPs. We denote the submatrix within a window by G. For each window, we compute Rmin(G). Each time, we move the window by half of its width. We use Rmin(G) on these intervals to calculate the average minimum number of recombinations per kB along the genome. We first analyze the MHC data. After removing missing values, there are 277 (out of 296) SNPs and 50 genotypes. On a Pentium 2.0 GHz machine, the computation takes 242 seconds when the window size is 5, and it takes 2865 seconds when the sliding window size is set to 6. For a few intervals, the computation of the exact minimum number of recombinations is slow. We simply time out and use an efficiently computed recombination lower bounds instead for these intervals.

Figure 1 plots the average minimum number of recombinations per KB across the region of interest. The results we obtained by computing over genotypes match the results over haplotype data quite

153

« 25" c e 1 S 20 -A £ 9

% 15-1

S » | a i o -9 e

DNA3

» i 5- i S DMAM }

1 *'-0

^ .*. a..-50

DMB1.I

* 1

._-J*_._.. 100

1AP? « I

I 7

j

I

*

150 200

Physical dlstancs {In kg. from the leftmost $HP\

[

250

Pig. 1. Detecting recombination hotspots using genotype data for MHC data. Use a sliding window of 6 SNPs. We assign position 0 to the leftmost SNP. The results match the ones based on haplotypes.

well. The DNAl/2, DNA3, DMB1/2 and TAP2 hotspots are clearly identifiable and match the results in Bafna, et al. 2 quite well. We also tested our method on the MS32 data 21. The result is shown in Figure 2. Again, we see good matches with the results reported in Bafna, et al. 2. Five hotspots are identifiable: NID2, NIDI, MS32, MSTM1 and MSTM2. As in Bafna, et al. 2, NID3 is not significant and not detected. Overall, our results by computing minimum number of recombination on genotype data match quite well with the results in Bafna, e t a l 2 .

5.2. Comparing minimum recombination on genotypes and haplotypes

In Section 4, we demonstrated that for certain genotype data, we can build a minimum ARG that derives an HI solution explaining the given genotypes. This allows us to compare the minimum recombination on genotypes, original haplotypes and HI solutions found by program PHASE.

We generated simulation genotype data as follows. We first run Hudson's program MS 18 to generate 2n haplotypes (denoted H0). We choose various scaled recombination rate (denoted p) when running MS. Genotype matrix G with n rows is then generated by pairing haplotype 2% with haplotype 2i — 1. We fix the number of sites in the sequences to be a small number, say 7. We run our method on G, and obtain an HI solution H. We compare H with the

original haplotypes H0. As a comparison, we run program PHASE on G, and compare the PHASE HI solution (denoted Hp) with H0. For each data size and scaled recombination rate, we generate 100 datasets.

For each dataset, we compute Rmin{G), the minimum number of recombinations on G. We also compute Rmin(H0) using program beagle 2 4 . As a comparison, we compute Rmin{Hp). Note that Rmin(G) < Rmin{H0) and Rmin(G) < Rmin(Hp).

Table 1 shows the comparison among Rmin(G),Rmin(H0) and Rmin(Hp) for various data size and p. One can see that for a large portion of the data we simulated, Rmin(G) = Rmin(H0). Thus, haplotyping on a minimum ARG may be an effective approach for the range of data we tested. Another interesting observation is on the performance of PHASE. PHASE is known to be quite accurate for many data. Our simulation here shows that PHASE tends to minimize the number of recombinations to some extent, at least implicitly. Our simulation results show that often, but not always, Rmin(Hp) is between Rmin(G) and Rmin(H0).

From the simulation we performed, we observed that for some data, Rrnin(H0) is much closer to Rmin(G) than Rmin(Hp). But on average, Rmin(H0) is usually closer to Rmin(Hp) than Rmin(G). Overall, these experiments suggest that for the downstream question of computing the minimum number of recombination, the two stage approach by first using PHASE to obtain an HI solu-

154

« c « K £ qui

8

c 3 S £ c

2

5 -

4 -

3 -

2 -

1 -

0

NIB?

•1

Nra3 f *

-JfcN< 50

NIDI 4

»

100

«S3!

150

»

h-T" f t 1

; .*-, • r*

• •

* • ."•

•v.% ** •* 200

Physical distance fin KB, from the IsRmost SNPf

250

Fig . 2. Detecting recombination hotspots using genotype data for MS32 data. Use a sliding window of 6 SNPs.

Table 1. Comparison of Rmin(H0), Rmin(G) and Rmin(Hp). Here, H stands for Rmin(H0), G stands for Rmin(G) and P stands for Rmin{Hv). The data size is the number of genotype rows by the number of sites, p is the scaled recombination rate in Hudson' program MS. The next 3 columns display the percentage of datasets where two compared minimum number of recombinations are equal. For example, for data 15 by 7, p = 40, 43% of data have equal Rmin(G) and Rmin(H0). The two columns on the right shows the average difference between Rmin(G) (resp. Rmin(Hp)) with Rmin(H0). The value reported for difference are average differences for the two numbers over the 100 datasets.

Data size

15 by 7 20 by 7 15 by 7 20 by 7 15 by 7 20 by 7

P 20 20 30 30 40 40

G,H

56% 45% 38% 44% 43% 46%

P,H

68% 60% 60% 56% 46% 47%

G,P

63% 72% 56% 68% 54% 76%

H-G

0.62 0.68 0.99 0.79 0.87 0.80

H-P

0.45 0.34 0.32 0.34 0.2

0.48

tion Hp and then compute Rmin(Hp) is an effective approach for the range of data we tested. We want to point out however that this conclusion would not be possible without our method of computing Rmin(G), thus allowing one to study this issue.

5.3. Haplotyping on a minimum ARG

Since we can build a minimum ARG for the input genotypes, we can construct a HI solution from this minimum ARG, since the ARG gives a set of hap-lotypes that can explain the genotypes. Naturally, we want to study the haplotyping accuracy of this minimum-ARG method. Here, we use the same simulated data in 5.2. Table 2 shows the haplotyping accuracy of our method, and table 3 shows the re

sult on these dataset of program PHASE. We use the following three haplotyping accuracy measures: (a) The standard error 3S is the percentage of incorrectly phased genotypes, (b) the switching error 23

is related to the incorrectly phased neighboring het-erozygotes, and (c) the percentage of mis-phased 2s. The results show that our method is comparable to the solutions by PHASE in accuracy. Sometimes, our method produces an HI solution better than PHASE, although statistically PHASE is still slightly more accurate for the range of data we tested. This indicates that finding a single minimum ARG that derives an HI solution may not be enough to produce more accurate HI solutions than PHASE. Finding an ARG, either a minimum one or a near-minimum one, that gives the best haplotyping accuracy remains an in-

155

teresting research problem.

min-ARG

Std. err. Switch

% of mis-2

p = 20 15x7

0.269 0.801 10.98

20x7

0.237 0.821 10.00

p=30 15x7

0.336 0.750 14.11

20x7

0.276 0.799 11.63

p = 40 15x7

0.381 0.721 15.90

20x7

0.333 0.752

13.9

Std. err. Switch

% of mis-2

p=20 15x7

0.256 0.804 10.65

20x7

0.214 0.834

8.98

p= 30 15x7

0.297 0.787 12.13

20x7

0.265 0.811 11.01

p = 40 15x7

0.349 0.753 14.49

20x7

0.281 0.793 11.65

Finally, we test our method with the A P O E data

from Orzack, et al 3 1 . This da t a has 47 non-trivial

genotypes (i.e. the genotype contains more than one

2) and 9 sites. This genotype da ta has a real solu

tion (i.e. experimentally determined phases). Pro

gram PHASE produces a solution within 16 seconds

with 4 incorrectly phased genotypes. Our method

takes about 2 minute to find an HI solution with

5 incorrectly phased genotypes, performing slightly

worse than PHASE. A benefit of our method is tha t

it computes the minimum number of recombinations

for the given genotype da ta (under the infinite sites

model). For this data , Rmin(G) = 5. We also

note tha t Rmin(Hp) = 6 and for the real solution,

Rmin(H0) = 7. This is a small indication that hap-

lotyping on a minimum (or near-minimum) ARG

may be useful, i.e. real da ta may be derived on a

minimum or near-minimum ARG.

A c k n o w l e d g m e n t s . The authors wish to

thank Yun S. Song for sharing simulation scripts.

Work supported by grants CCF-0515278 and IIS-

0513910 from National Science Foundation.

References 1. V. Bafna and V. Bansal. The number of recombina

tion events in a sample history: conflict graph and lower bounds. IEEE/A CM Transactions on Computational Biology and Bioinformatics, 1:78-90, 2004.

2. V. Bafna and V. Bansal. Improved recombination lower bounds for haplotype data. In Proceedings of RECOMB 2005, LNBI Vol. 3500, pages 569-584. Springer, 2005.

3. M. Bordewich and C. Semple. On the computational complexity of the rooted subtree prune and regraft distance. Annals of Combintorics, 8:409-423, 2004.

4. D. Brown and I. Harrower. A new integer programming formulation for the pure parsimony problem in haplotype analysis. In Proc. of 2004 Workshop on Algorithms in Bioinformatics, Berlin, Germany, 2004. Springer-Verlag LNCS.

5. D. Brown and I. Harrower. A new formulation for haplotype inference by pure parsimony, report cs-2005-03. Technical report, University of Waterloo, School of Computer Science, 2005.

6. A. Clark. The role of haplotypes in candidate gene studies. Genetic Epidemiology, 27:321-333, 2004.

7. I. H. Consortium. The HapMap project. Nature, 426:789-796, 2003.

8. I. H. Consortium. A haplotype map of the human genome. Nature, 437:1299-1320, 2005.

9. Z. Ding, V. Filkov, and D. Gusfield. A linear-time algorithm for the perfect phylogeny haplotyping problem. In Proceedings of RECOMB 2005, LNBI Vol. 3500, pages 585-600. Springer, 2005.

10. D. Gusfield. Haplotype inference by pure parsimony. In R. Baeza-Yates, E. Chavez, and M. Chrochemore, editors, 14th Annual Symposium on Combinatorial Pattern Matching (CPM'03), volume 2676 of Springer LNCS, pages 144-155, 2003.

11. D. Gusfield. Optimal, efficient reconstruction of Root-Unknown phylogenetic networks with constrained and structured recombination. JCSS, 70:381-398, 2005.

12. D. Gusfield, S. Eddhu, and C. Langley. Optimal, efficient reconstruction of phylogenetic networks with constrained recombination. J. Bioinformatics and Computational Biology, 2(1):173-213, 2004.

13. D. Gusfield, D. Hickerson, and S. Eddhu. An efficiently-computed lower bound on the number of recombinations in phylogenetic networks: Theory and empirical study. To appear in Discrete Applied Math, Special issue on Computational Biology.

14. E. Halperin, G. Kimmel, and R. Shamir. Tag snp selection in genotype data for maximizing snp prediction accuracy. Bioinformatics, 21:il95-i203, 2005. Bioinformatics Suppl. 1, Proceedings of ISMB 2005.

15. J. Hein, M. Schierup, and C. Wiuf. Gene Genealogies, Variation and Evolution: A primer in coales-cent theory. Oxford University Press, UK, 2005.

16. D. Hinds, L. Stuve, G. Nilsen, E. Halperin, E. Eskin, D. Gallinger, K. Frazer, and D. Cox. Whole-genome patterns of common DNA variation in three human populations. Science, 307:1072-1079, 2005.

17. E. Hubbel. Personal Communication, August 2000. 18. R. Hudson. Generating samples under the Wright-

Fisher neutral model of genetic variation. Bioinfor-

Table 2. Accuracy of haplotyping on a minimum ARG. The results are averaged over 100 datasets for each parameter settings. The accuracy measures includes standard error, switch accuracy and % of mis-phased 2s. One can see that, comparing to PHASE, the min-ARG approach is comparable but underperform slightly.

Table 3. Accuracy of program PHASE on the same datasets. The results are averaged over 100 datasets for each parameter settings.

156

matics, 18(2):337-338, 2002. 19. R. Hudson and N. Kaplan. Statistical properties of

the number of recombination events in the history of a sample of DNA sequences. Genetics, 111:147-164, 1985.

20. A. Jeffreys, L. Kauppi, and R. Neumann. Intensely punctate meiotic recombination in the class ii region of the major histocompatibility complex. Nature Genetics, 29:217-222, 2001.

21. A. Jeffreys, R. Neumann, M. Panayi, S. Myers, and P. Donnelly. Human recombination hot spots hidden in regions of strong marker association. Nature Genetics, 37:601-606, 2005.

22. G. Lancia, C. Pinotti, and R. Rizzi. Haplotyping populations by pure parsimony: Complexity, exact and approximation algorithms. INFORMS J. on Computing, special issue on Computational Biology, 16:348-359, 2004.

23. S. Lin, D. Cutler, M. Zwick, and A. Chakravarti. Haplotype inference in random population samples. Am. J. of Hum. Genet, 71:1129-1137, 2003.

24. R. Lyngso, Y. Song, and J. Hein. Minimum recombination histories by branch and bound. In Proceedings of Workshop on Algorithm of Bioinformatics (WABI) 2005, volume 3692, pages 239-250.

25. J. Marchini, D. Cutler, N. Patterson, M. Stephens, E. Eskin, E. Halperin, S. Lin, Z. Qin, H. Munro, G. Abecasis, and P. Donnelly. A comparison of phasing algorithms for trios and unrelated individuals. Am. J. Human Genetics, 78:437-450, 2006.

26. G. McVean, S. Myers, S. Hunt, P. Deloukas, D. Bent-ley, and P. Donnelly. The fine-scale structure of recombination rate variation in the human genome. Science, 304:581-584, 2004.

27. A. Morris, J. Whittaker, and D. Balding. Little loss of information due to unknown phase for fine-seal linkage-disequilibrium mapping with single-nucleotide-polymorphism genotyep data. Am. J. Human Genetics, 74:945-953, 2004.

28. R. Morris and N. Kaplan. On the advantage of hap-loype analysis in the presence of multiple disease susceptibility alleles. Genetic Epidemiology, 23:221-233,

2002. 29. S. Myers, L. Bottolo, C. Freeman, G. McVean, and

P. Donnelly. A fine-scale map of recombination rates and hotspots across the human genome. Science, 310:321-324, 2005.

30. S. R. Myers and R. C. Griffiths. Bounds on the minimum number of recombination events in a sample history. Genetics, 163:375-394, 2003.

31. S. Orzack, D. Gusfield, , J. Olson, S. Nesbitt, L. Sub-rahmanyan, and J. V. P. Stanton. Analysis and exploration of the use of rule-based algorithms and consensus methods for the inferral of haplotypes. Genetics, 165:915-928, 2003.

32. R. V. Satya and A. Mukherjee. An optimal algorithm for perfect phylogeny haplotyping. In Proceedings of 4th CSB Bioinformatics Conference, Los Alamitos, CA, 2005. IEEE Press.

33. P. Scheet and M. Stephens. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and hap-lotypic phase. Am. J. Human Genetics, 78:629-644, 2006.

34. Y. Song, Y. Wu, and D. Gusfield. Efficient computation of close lower and upper bounds on the minimum number of needed recombinations in the evolution of biological sequences. Bioinformatics, 21:i413-i422, 2005. Bioinformatics Suppl. 1, Proceedings of ISMB 2005.

35. M. Stephens, N. Smith, and P. Donnelly. A new statistical method for haplotype reconstruction from population data. Am. J. Human Genetics, 68:978-989, 2001.

36. B. Voight, S. Kudaravalli, X. Wen, and J. Pritchard. A map of recent positive selection in the human genome. PLOS Biology, 4:e72, 2006.

37. L. Wang, K. Zhang, and L. Zhang. Perfect phyloge-netic networks with recombination. Journal of Computational Biology, 8:69-78, 2001.

38. C. Wiuf. Inference of recombination and block structure using unphased data. Genetics, 166:537-545, 2004.

157

SORTING GENOMES BY TRANSLOCATIONS AND DELETIONS

Xingqin Qi

Department of Applied Mathematics, Shandong University at Weihai,

Weihai, 264213, China School of Mathematics and System Sciences, Shandong University,

Jinan, 250100, China


Guojun Li

School of Mathematics and System Sciences, Shandong University,

Jinan, 250100, China

Department of Biochemistry and Molecular Biology, University of Georgia, Athens, Georgia 30602, USA

Email: guojun@csbl. bmb.uga. edu

Shuguang Li

School of Mathematics and System Sciences, Shandong University, Jinan, 250100, China

Department of Mathematics and Information Science, Yantai University, Yantai, 264005, China


Ying Xu

Department of Biochemistry and Molecular Biology, University of Georgia, Athens, Georgia 30602, USA Email: [email protected]

Given two signed multi-chromosomal genomes n and T with the same gene set, the problem of sorting by translocations (SBT) is to find a shortest sequence of translocations transforming 11 to T, where the length of the sequence is called the translocation distance between II and V. In 1996, Hannenhalli gave the formula of the translocation distance for the first time, based on which an 0(n3) algorithm for SBT was given. In 2005, Anne Bergeron et al. revisited this problem and gave an elementary proof for the formula of the translocation distance which leads to a new 0(n3) algorithm for SBT. In this paper, we show how to extend Anne Bergron's algorithm for SBT to include deletions, which allows us to compare genomes containing different genes. We present an asymptotically optimal algorithm for transforming II to T by translocations and deletions, providing a feasible sequence with length at most OPT+2, where OPT is the minimum number of translocations and deletions transforming II to r . Furthermore, this analysis can be used to approximate the minimum number of translocations and insertions transforming one genome to another.

1. INTRODUCTION

A translocation considered here is always reciprocal which exchanges non-empty tails between two chromosomes. Given two multi-chromosomal genomes II and r , the problem of sorting by translocations (abbreviated as SBT) is to find a shortest translocation sequence that transforms II to T. SBT was first introduced by Kececioglue and Ravix and was given a polynomial time algorithm by Hannenhalli 2. Bergeron, Mixtacki and Stoye 3 pointed out an error in

Hannenhalli's sorting strategy and gave a new 0(n3) algorithm for SBT. Li et al. 4 gave a linear time algorithm for computing the translocation distance (without producing a shortest sequence). Wang et al. 5 presented an 0(n2) algorithm for SBT.

Note that all above algorithms assume that the two genomes have the same gene content. Such is of course rarely the case in biological practice. In this paper we consider a more general case: when the gene set of T is a subset of the gene set of II. Clearly, in such case, "deletions" are needed. Write




158

A for the set of genes in both II and T, write An for those in IT only. We assume that each gene in A appears exactly once in each genome. We will try to computer the minimum number of translocations and deletions transforming II to T, which is denoted as dtd(H, r ) . We present an asymptotically optimal algorithm, which provides a feasible sequence with length at most dtd(II, T) + 2.

The paper is organized as follows. The necessary preliminaries are given in Section 2 and Section 3. A lower bound on dtd(II, T) is given in Section 4. In Section 5 and Section 6 we give the approximation algorithm and its analysis respectively. Conclusions are given in Section 7.

2. PRELIMINARIES

As usual, we represent a gene by a positive integer and an associated sign ("+" or "—") reflecting the direction of the gene, and the corresponding element is said to be positive element or negative element. A chromosome is a sequence of genes and does not have an orientation. A genome is a set of chromosomes.

A chromosome is orientation-less, therefore flipping a chromosome X = xi,...,Xk into —X = —Xfe,..., —x\ does not affect the chromosome it represents. Hence, a chromosome X is said to be identical to a chromosome Y iff either X = Y or X = —Y. As a convention we illustrate a chromosome horizontally and read it from left to right. Genomes II and T are said to be identical if their sets of chromosomes are the same. Let X = (X\, X2) and Y = (Yj, Yj) be two chromosomes, where X\,X2,Yi,Y2 are sequences of genes. A prefix-prefix translocation switches X\ with Y\ resulting in ( V i , ^ ) , (Xi, Y^). A prefix-suffix translocation switches X\ with Yi resulting in {—Y2,X2), (Yi, —X\). The resulting genome after applying a translocation p on genome II is denoted as II • p. For a chromosome X = (xi,...,Xk), the numbers +xi and — Xk are called tails of X. The set of tails of all the chromosomes in II is denoted by T(II). Genomes II and T are co-tailed if T(II) = ^"(r). Therefore, SBT is limited to genomes that are co-tailed.

In the following, w.l.o.g, we assume that the elements in each chromosome of the target genome r are positive and in increasing order. For example, let n = {(4,3), (1 ,2 , -7 ,5) , (6 , -8 ,9)} and

T = {(1,2,3), (4,5), (6,7,8,9)}. Reader(s) are assumed to have a thoughtful understanding of Refs. 2 and 3.

2.1. The Cycle Graph

For a chromosome X = (xi, ...,£&), replace every positive element +x, by ordered pair (x' ,x^) of vertices and replace every negative element — Xi by ordered pair (x^,xf) of vertices. Vertices u and v are neighbors in X if they are adjacent in the ordered list constructed in aforementioned manner. We say that vertices u and v are neighbors in a genome if they are neighbors in some chromosome in this genome. For gene x, vertices xt and xh are always neighbors and for simplicity, we exclude them from the definition of "neighbors" in the following discussion.

The bicolored cycle graph Q{J1,T) of II with respect to T which have the same gene content is defined as follows. The vertex set V contains the pairs of vertices xt and xh for every gene x in II, i.e. V={u: u is either xl or xh, where x is a gene in II}. Vertices u and v are connected by a black edge iff they are neighbors in II. Vertices u and v are connected by a gray edge iff they are neighbors in V.

A gray edge (u, v) in Q(H, T) is interchromosomal if u, v belong to different chromosomes, otherwise is intrachromosomal. Each vertex has degree either 2 or 0, hence the graph can be uniquely decomposed into a number of disjoint cycles. A cycle is interchromosomal if it contains at least one interchromosomal gray edge, otherwise is intrachromosomal.

2.2. The Sub-permutation

A segment is an interval I = Xi,...,Xj within a chromosome X = xi,X2,-..,xm. Let Vi be the set of vertices induced by genes in / , i.e., V/ ={u: u is either x\ or x\, i < k < j}. We refer to left vertex corresponding to a;, and right vertex corresponding to Left (I) and Right(I) respectively. Define IN{I)=Vi \ {Left(I),Right(I)}. An edge (u, v) in G(H, T) is said to be inside the interval J if u,v£lN(I).

A sub-permutation(SP) is an interval I = Xi, Xi+i, ..,Xj within a chromosome X of II such that (i) there exists no edge (u, v) with u € IN(I),v £ IN(I) and (ii) that is not the union of smaller such

159

intervals. We refer to a SP by giving its first and last el

ement such as (xi...Xj). A minimal sub-permutation (MSP) is a SP not containing any other SPs. A SP is trivial if it is of the form (i...i+l) or (—(i+1)...— i), otherwise is non-trivial.

2.3. The Forest of SPs

Two different SPs of a chromosome are either disjoint, nested with different endpoints, or overlapping on one element. When two SPs overlap on one element, we say that they are linked. Successive linked SPs form a chain. A chain that can not be extended to the left or right is called maximal. The nesting and linking relation of SPs on a chromosome can be shown in the following way.

Definition 2 .1 . Given a chromosome X and its SPs, define the forest Fx by the following construction:

1. Each non-trivial SP is represented by a round node.

2. Each maximal chain that contains a non-trivial SP is represented by a square node whose (ordered) children are the round nodes that represent the non-trivial SPs of this chain.

3. A square node is the child of the smallest SP that contains this chain.

The above definition can be extended to a forest of a genome by combining the forests of all chromosomes:

Definition 2.2. Given a genome II consisting of chromosomes {X\,X2,• ••, X^}. The forest Fu is the union of forests Fx1, ...,FxN •

Note that a leaf of Fu corresponds to a MSP of II. Denote the number of leaves and trees of Fa by L and T respectively. If T = 1 and L is even, genome II has an even-isolation. We will refer to a MSP that is a leaf in Fu as simply a leaf.

2.4. The Translocation Distance

Let p(X, Y, i,j) be a translocation acting on chromosomes X = (xi,...,xp) and Y = (yi,..-,yq), where the cleavages occur in X between Xi-\ and Xi and in y between yj~\ and yj. Let / € {x'_1,a;^_1}

and g £ {x\,x^} such that / and g are neighbors in X. Let u e {yj_i,y£_i} and v € {y),y'j} such that u and v are neighbors in Y. Then p acts on black edges (f,g) and (u, v). Let Ac denote the change in the number of cycles after performing a translocation on II. Then Ac € {—1,0,1} 2 . A translocation is proper if Ac = 1, improper if Ac = 0 and bad if A c = - 1 .

It is easy to see each interchromosomal gray edge (u, v) in (7(11, T) determines a proper (prefix-prefix or prefix-suffix) translocation p of LT by cutting the two black edges incident on u and v respectively. In the following, as in Ref. 2, we only consider proper translocations determined by interchromosomal gray edges.

We say that a translocation destroys a SP C if C is not a SP in the resulting genome. The only way to destroy a SP with translocations is to apply a translocation with one cleavage in the SP, and one cleavage in another chromosome. Such translocations always merge cycles and thus are always bad. Yet, a translocation may destroy more than one SP at the same time. In fact, at most two MSPs on two different chromosomes, plus all SPs containing these two MSPs, can be destroyed by a single translocation. If a translocation destroys two MSPs on different chromosomes at the same time, we say the translocation merges the two MSPs. Anne Bergeron et al. 3 proved that it is also possible to eventually merge two MSPs that initially belong to two different trees of the same chromosome.

Lemma 2.1. 3 If a chromosome X of genome U contains more than one tree, and no other chromosome ofH contains non-trivial MSP, then the trees can be separated on different chromosomes by proper translocations without modifying Fu-

Lemma 2.2. 3 Given two genomes II and T with the same gene set, assume there are N chromosomes and n genes in both LT and T. Let c be the number of cycles in 5(11, T). The minimum number of translocations for transforming II to T is d t ( I I , r ) = n — N — c + t, where

{ L + 2, if L is even and T=l L + 1, if L is odd L, if L is even and T ^ 1

160

A translocation p is valid if dt(H, T) — dt(U • p, T) = 1. A translocation is safe if it does not create any new non-trivial SPs. Based on this formula, A. Bergeron et al. gave an 0(n3) efficient algorithm for SBT (hereafter called Algorithm I). For completeness, we describe it in the following.

Algorithm I 1. if L is even and T = 1

destroy one leaf such that U = L — 1 2. if L is odd

perform a bad translocation such that V = 0, or T > 1 and V = L - 1

3. while II is not sorted do if there exist MSPs on different chromo

somes then perform a bad translocation such that V = 0, or V > 1 and V = L - 2

else perform a proper translocation such that T and L remain unchanged,

end while

For above algorithm, we have two points to remark.

R e m a r k 2 .1 . Through Algorithm I, we always try to merge a pair of valid MSPs on different chromosomes, i.e., merging them will not create an even-isolation in the resulting genome. How to select a pair of valid MSPs to merge has been introduced in the proof of Lemma 4 in Ref. 3.

Remark 2.2. Through Algorithm I, we always try to destroy a valid MSP on some chromosome, i.e., destroying it will not create an even-isolation in the resulting genome. How to select a valid MSP to destroy has been described in the proof of Theorem 2 in Ref. 3.

3. ON SORTING BY TRANSLOCATIONS AND DELETIONS

Returning to the problem at hand, i.e., when the gene set of F is a subset of the gene set of n , to find the minimum of translocations and deletions required to transform genome n to T. Let II be a genome which is induced from n by deleting from n

all the genes in An- We always assume genomes n and r are co-tailed, that implies II and T are co-tailed too. Thus one can use Algorithm I to transform II t oT .

3.1. The New Definition for Cycle Graph

Given that the genes of An are destined to be deleted, their identities and signs are irrelevant, and could be replaced with any symbols different from those used in A. For any segment of form S = Ui,ii2 • • • up-i,up, where u\,up correspond to two elements of A, and for all i,2 < i < p — 1, Ui corresponds to a gene of An, we replace S by S" = Ui5(u\)up, where 8{u{) is the segment of An between u\ and up. And u\ and up are separated by a 6.

Example 3 .1 . Let S = +1 , -a, +2, - 3 , +b, -c, +4, —5 be a segment on a chromosome of n , then S can be rewritten as +l,5(lh),+2, -3,5(3*),+4, - 5 , where S(lh) = -a, (5(3*) = +b, -c.

We represent genomes n and T by the redefined cycle graph G(H, T), where V is the set of vertices, B is the set of black edges and D is the set of gray edges. These three sets are defined as follows:

•v = {x°yx%{At}

• The black edges pertain to genome n . There are two kinds of black edges: direct black edges which link two adjacent vertices in n ; indirect black edges which link two vertices separated by a 6. For an indirect black edge e = (a, b), then 6(a), the segment of elements in An between a and b, is called the label ofe.

• Gray edges link adjacent vertices in T.

An indirect cycle (or indirect SP) is one containing at least one indirect black edge, otherwise is direct. Color the leaves corresponding to indirect MSPs red, and the leaves corresponding to direct MSPs blue.

An example is given in the following Fig.l, where n = {(1, -2 ,3 ,8 ,4 , - 5 , a, b, c, 6), (7,9, -10 ,11 , - 12 , 13,14, —15, d, e, f, 16)}. Indirect black edges are indicated by thick lines.

161

1' 1> 2" 2' 3'...-S'.::-: f'"" 8' 4' 4' 5" 5' 6' 6fc

'© O 6 * ' O 0 & 6 <> ^ O——O O

7' 7k 9' »* 10' 10' l l l l l ' 12" 12' 13' 13k 141 i4» i5» 15' 16' 16»

Cycle Graph

(1-3) (4...S) (9...11)

6 (14-10 (11.-13)

Forest

Fig. 1. The cycle graph 5(11, T) and the forest F n .

3.2. The New Definition for a Translocation

In 0(n, T), an indirect black edge determines not an adjacency of genome II but an interval containing only genes to be deleted. We thus have to redefine what we mean by "the bad translocation acting on two black edges" or "the proper translocation determined by an interchromosomal gray edge". Let e = (a,b) be one indirect edge in Q(U,T). The segment [x, 5(a)] designates the interval bounded on the left by x and on the right by the element of An adjacent to b. The segment [(5(a), x] designates the interval bounded on the left by the element of An adjacent to a and on the right by x. To give Definition 3.1 simply, we define 5(a) = 0 for a direct black edge e = (a, b). Then the segment [x, 5(a)] designates the interval bounded on the left by x and on the right by a. The segment [5(a), x] designates the interval bounded on the left by b and on the right by x.

Definition 3.1. Assume the two black edges e = (a, b) and / = (c, d) are on two different chromosomes X = Xi,...,a,6(a),b,...,xp and Y = yi,...,c,5(c),d,...,yq, where x;(l < i < p) and yj (1 < j < q) are vertices of £7(11, T).

(1) The translocation determined by g = (a,c) exchanges the segment [x\, a] of X with the segment [5(c),yq}oiY.

(2) The translocation determined by g = (b, d) exchanges the segment [xi,<5(a)] of X with the segment [d, yq) of Y.

(3) The translocation determined by g = (a, d) exchanges the segment [x\, o] of X with the segment [yi,5(c))oiY.

(4) The translocation determined by g — (b,c)

exchanges the segment [xi,5(a)] of X with the segment [yi,c] of Y.

(5)The translocation determined by e and / exchanges the segment [xi, 5(a)] of X with the segment [yuc]oiY.

4. A LOWER BOUND ON d t d ( n , r )

Algorithm I can be generalized to graphs containing direct and indirect edges by making use of the new definition of a translocation.

Lemma 4.1. A proper translocation determined by an interchromosomal gray edge of a cycle C transforms C into two cycles C\ and C2, at least one of which is of size 1, say C\. Then the black edge ofC\ is direct.

Corollary 4.1. An interchromosomal cycle C of size k is transformed by Algorithm I with k — 1 translocations, into k cycles of size 1. If C is direct, the k cycles are all direct. If not, only one of these cycles is indirect.

By Corollary 4.1, for each interchromosomal indirect cycle, Algorithm I gathers all the genes to be deleted into a single segment. At the end, a single deletion is required for each interchromosomal indirect cycle. Now consider the intrachromosomal indirect cycles which are forming MSPs. The merging of two MSPs is achieved by combining two intrachromosomal cycles C\ and C2 one from each of the two MSPs. This gets rid of the two MSPs and creates an interchromosomal cycle. The destroying of one MSP M is achieved by combining one intrachromosomal cycle C\ £ M with another cycle Ci which is not in any MSP. This gets rid of M and creates an interchromosomal cycle. The resulting interchromosomal cycles can be resolved as described above. For either destroying or merging, if both C\ and C2 are indirect, we will save one step of deletion. Thus we get the optimal sorting scheme: So that there are as few deletions and translocations as possible, we just need to merge as many as possible of pairs of indirect cycles through Algorithm I.

Given two genome II and T with different genes, denote the number of red leaves in Fn by r (II, F) and the number of indirect cycles in Q(H, T) by ci(LT, T).

162

Lemma 4.2. The number of merging of pairs of indirect cycles through Algorithm I is at most

L^FJ + I.

Proof. We will prove it in the following subcases. subcase 1: if L is even and T = 1. In such

case, two MSPs M\ and M% will be destroyed in step 1 and 2 respectively, and the left MSPs are paired to merge in step 3. If both Mi and M2 are indirect, there are at most |_r' '2 j mergings of pairs of indirect cycles in step 3, thus there are at most L r ( n 'P~2J + 2 = L ^ ^ J + 1 mergings of pairs of indirect cycles through Algorithm I;a if only one of Mi and M2 is indirect, then there are at most [ '2 J mergings of pairs of indirect cycles in step 3, thus there are at most [ r ( n - n - i j + 1 < |_lffi£lj + i mergings of pairs of indirect cycles through Algorithm I; if neither of M\ and Mi is indirect, it is impossible to merge a pair of indirect cycles in steps 1 and 2, thus there are at most [ r ' 2' ' ] mergings of pairs of indirect cycles through Algorithm I.

subcase 2: if L is odd. In such case, step 1 is not executed. One MSP M will be destroyed in step 2, and the left MSPs are paired to merge in step 3. If M is indirect, there are at most |_r^ '2 '~ J mergings of pairs of indirect cycles in step 3, thus there are at most L ^ - P " 1 ] + 1 < L 1 ^ ] + 1 mergings of pairs of indirect cycles through Algorithm I; if M is direct, it is impossible to merge a pair of indirect cycles in step 2, so there are at most |_r('I2'

r^j mergings of pairs of indirect cycles through Algorithm I.

subcase 3: if L is even and T ^ 1. In such case, steps 1 and 2 are not executed. So there are at most L « j mergings of pairs of indirect cycles in step 3, i.e., through Algorithm I. D

Theorem 4.1. dtd(Il,r) > dt(fl,r) + ci(U,T) -[r^ 2 \ —1) where II is the induced genome by deleting from II all the genes in An and dt(U.,T) is the translocation distance between II and T.

5. DESIGN AN ALGORITHM

We will approximate dtdfil, T) by merging as many as possible of pairs of indirect MSPs through Algorithm I. To do this, when some MSP must be destroyed, our strategy prefers to destroy a direct MSP.

Our sorting scheme requires careful consideration of MSP choice and cycle (i.e., black edge) choice for each MSP. In summary: MSPs merging

1. Choose a pair of valid MSPs M\ and M2, having the same color if possible.

2. If both Mi and M2 are indirect, choose an indirect black edge e in Mi and an indirect black edge / in M2; otherwise, choose any black edge e in Mi and / in M2.

3. Apply the (prefix-prefix) translocation determined by e and / by Definition 3.1.

MSP destroying

1. Choose a valid MSP M, direct if possible.

2. Choose a black edge e in M, and a black edge

/ on a different chromosome not contained

in any MSP.

3. Apply the (prefix-prefix) translocation de

termined by e and / by Definition 3.1.

5.1. Special Cases and Corresponding Sub-procedures

We always try to merge a pair of "valid" MSPs with "the same color", or destroy a "valid" "direct" MSP. But in some cases, the two conditions for "merging" or "destroying" are not compatible. We list the following six cases. For a tree with x leaves, we denote it by x-tree. If x is even, it is called an even-tree, otherwise, is called an odd-tree.

Case 1: T = 3, one is an even-tree, the other two are 1-trees. All leaves of the even-tree have color i, the leaves of both the two 1-trees have color j , where i ^ j .

Sub-procedure 1: Ignore the color of leaves and apply Algorithm I on Q(H,r).

Case 2: T = 2, one is an x-tree, the other is a 1-tree, where x is odd and x > 3. The leaf of the 1-tree has color i. The rightmost leaf TZ of the odd-tree has color k, the leftmost leaf C of the odd-tree has color I, the other internal leaves of the odd-tree have color j , where i ^ j .

Sub-procedure 2:

aSince MSP destroying may merge a pair of indirect cycles.

163

1. Apply a sequence of proper translocations without changing Fu until the two trees are on different chromosomes by Lemma 2.1.

2. If k = I = i, then

2.1. merge the leaf of the 1-tree with the middle leaf of the odd-tree.b

2.2. merge 71 with C.

3. Ignore the color of leaves and apply Algorithm I o n £ ( I I , r ) .

Case 3: T = 2, one is an x-tree, the other is a y-tree, where x,y > 2 and x + y is even. All leaves of the re-tree have color i, all leaves of the y-tree have color j,i^j.

Sub-procedure 3:

1. Apply a sequence of proper translocations without changing Fu until the two trees are on different chromosomes by Lemma 2.1.

2. Merge the middle leaves of the two trees.0

3. Flip the chromosome on which the tree of color j is on the right of the tree of color i.

4. Keep merging the pair of leftmost leaves of color j on different chromosomes until at most one leaf of color j is left.d

5. Keep merging the pair of rightmost leaves of color i on different chromosomes until at most one leaf of color i is left.

6. Ignore the color of leaves and apply Algorithm i on <?(n,r).

Case 4: T = 1, L is odd and L > 3. The rightmost leaf "R. has color i, the leftmost leaf C has color j , and the other internal leaves all have color red.

Sub-procedure 4:

1. If i = j =blue, then

1.1 destroy the middle (red) leaf of the tree.6

1.2 merge 1Z with £.

2. Ignore the color of leaves and apply Algorithm I on £(ILT).

Case 5: T = 2, one is an even-tree, the other is a 1-tree. All leaves of the even-tree have color red, and the leaf of the 1-tree has color blue.

Sub-procedure 5: Ignore the color of leaves and apply Algorithm I on Q(Tl,T).

Case 6: T = 1, L is even and all leaves of Fu are red.

Sub-procedure 6: Ignore the color of leaves and apply Algorithm I on Q(Il,T).

5.2. Main Lemmas

The following lemmas will be central in providing an invariant for the sorting algorithm.

Lemma 5.1. 3 If a chromosome X of genome II contains more than one tree, then there exists a proper translocation involving chromosome X.

Lemma 5.2. 2 If there exists a proper translocation in II, then there exists a proper translocation p in II such that II • p does not have any new non-trivial MSP.

Lemma 5.3. 3 / / o chromosome X of genome II contains more than one tree, and no other chromosome of II contains non-trivial MSP, then there exists a proper translocation involving X that does not modify Fn-

Lemma 5.4. In the process of Algorithm I, if some MSP must be destroyed, we can always destroy a "valid" "direct" MSP as long as it is not the Case 4,5 or 6.

Lemma 5.5. Let LT fee a genome whose forest Fu has L > 4 leaves and T > 2 trees, where L is even and Fu is not the Case 1,2 or 3. / / there are same color leaves on different chromosomes, then there exists a bad translocation merging a pair of same color leaves such that the forest Fu< has L' = L — 2 leaves and T" ^ 1 trees.

Proof. We will prove it by discussing on T. subcase 1: T = 2. Assume the two trees are

a;-tree and y-tree, where x + y > 4. Clearly, the two trees must be on different chromosomes. When x > 2

bThis will make 71 and C be on different chromosomes. cThis will create four trees. d I t is feasible since we always use a prefix-prefix translocation to merge MSPs. eThis will make Tt and C be on different chromosomes.

164

and y > 2, since Fu is not Case 3, there must exist a pair of same color leaves between the two trees. Merging this pair of same color leaves will result in a genome II' with L' = L — 2 leaves and T" ^ 1 trees. When x = l , j / > 3 ( j / = l , a ; > 3 , respectively), since Fu is not the Case 2, there exists an internal leaf li of y-tree (a:-tree, respectively) such that l\ has the same color with the only leaf 1% of x-tree (y-tree, respectively). Merging l\ and li will result in a genome IT with L' = L - 2 leaves and T ' ^ 1 trees.

subcase 2: T = 3. Since Fn is not the Case 1, there exist a pair of same color leaves l\ and I2 on different chromosomes such that either l\ or I2 is a leaf of some z-tree, where x > 2. Clearly, merging Zi and I2 will result in a genome II' with L' = L — 2 leaves and T" ^ 1 trees.

subcase 3: T > 4. In this case, merging any pair of same color leaves on different chromosomes will result in a genome II' with V = L — 2 leaves and V ± 1 trees. •

Lemma 5.6. Let U be a genome whose forest Fu has L > 4 leaves and T > 2 trees, where L is even and Fu is not the Case 1,2 or 3. If there does not exist any pair of same color leaves on different chromosomes, then there exists a valid proper translocation of II without changing L.

Proof. We will prove it by discussing on T. subcase 1: T = 2. The two trees must be on

the same chromosome. By Lemma 5.3, there exists a valid proper translocation without changing Fu-

subcase 2: T > 3. Since there are only two kind of colors, all trees in Fu must be on at most two chromosomes.

If the trees are on one chromosome, by Lemma 5.3, there exists a valid proper translocation without modifying Fu-

Otherwise, assume the two chromosomes are X and Y, and all leaves on X have color i while all leaves on Y have color j , where i ^ j . Since T > 3, either X or Y has more than one tree. Then by Lemmas 5.1 and 5.2, there exists a proper translocation without changing L. The resulting genome must have more than two trees, thus this proper translocation is valid. •

Lemma 5.7. Let IT be a genome having no leaves. If LT is not sorted, then there always exists a safe proper translocation on LT.

Proof. Since LT is not sorted and Fu = 0, there must be an interchromosomal gray edge in Gn,T, which determines a proper (prefix-prefix or prefix-suffix) translocation. Then by Lemma 5.2, there exists a safe proper translocation on IT. •

5.3. The Approximation Algorithm

Our extended algorithm for merging as many as possible of pairs of indirect MSPs is given in the following Algorithm II.

Algorithm II

1. if L is even and T = 1 if it is Case 6, go to sub-procedure 6; else, destroy a valid blue leaf by Lemma 5.4.

2. if L is odd

(a) if L = 1, destroy this leaf.

(b) if it is Case 4 or 5, go to sub-procedure 4 or 5 respectively; else, destroy a valid blue leaf by Lemma 5.4.

3. while L > 4 do

(a) if it is Case 1,2 or 3, go to sub-procedure 1,2 or 3 respectively.

(b) if there exist same color leaves on different chromosomes then

perform a valid bad translocation merging a pair of same color leaves by Lemma 5.5.

else,

perform a valid proper translocation by Lemma 5.6.

4. if L = 2 ( comment: there must be T = 2)

(a) perform proper translocations without changing Fu until the two leaves are on different chromosomes by Lemma 2.1.

(b) perform a bad translocation merging this two leaves.

5. Perform safe proper translocations on LT until LT is sorted by Lemma 5.7.

165

Let Q\ (II, T) be the graph containing only length 1 cycles obtained by applying Algorithm II to the graph £/(II,r). Now we apply the following Procedure Deletion on Si (II, T).

Procedure Deletion For each indirect edge e = (a, b) of Q\ (II, T)

Delete the gene segment between a and b, labeled by 6(a).

We call Algorithm II augmented by Procedure Deletion the Translocation-Deletion algorithm. It is clear that the algorithm transforms genome II to genome T. Let P(H,T) be the number of mergings of pairs of indirect cycles in Algorithm II.

Theorem 5.1. The number of translocations and deletions obtained by the Translocation-Deletion algorithm is Apxtd(U,T) = dt(n.,T) + ci(U,T) -

/?(n,r).

6. ANALYSIS OF TRANSLOCATION-DELETION ALGORITHM

Let a(II, T) be the number of mergings of pairs of indirect MSPs during Algorithm II. Obviously, /3(II,r) > a(II,T). Assume that there are ratep3 red leaves at the beginning of step 3. Assume there are rSi red leaves when sub-procedure i happens and there will be j/i pairs of red MSPs merged in sub-procedure i, where i = 1,2,..., 6.

Lemma 6.1. Fori = 1,2,3,4,5,6, if rSi is even,

Proof. We discuss the six cases respectively. Case 1 appears. rSl is even. In sub-procedure 1,

two red leaves are used to merge with two blue leaves respectively, the other red leaves if exist are paired to merge, so yx = L!:£V~J = L ^ J ~ l-

Case 2 appears. If k — I = i, then rS2 is odd. In sub-procedure 2, one red leaf is used to merge with a blue leaf, the other red leaves if exist are paired to merge, so yi = [ "\ J = L-^-J. If k = i,l = j or k = j,l = i, then rS2 is even. In sub-procedure 2, two red leaves are used to merge with two blue leaves respectively, the other red leaves if exist are paired to merge, so y2 = L1"1^] = L ^ J - !• If k = l=j, then rS2 is odd. In sub-procedure 2, one red leaf is

used to merge with a blue leaf, the other red leaves if exist are paired to merge, so yi = [r'22~ j = [^-J.

Case 3 appears. If rS3 is even, in sub-procedure 3, two red leaves are used to merge with two blue leaves respectively, the other red leaves are paired to merge, so 2/3 = [rs3

2~ j = [^-J — 1. If r33 is odd, in sub-procedure 3, the middle red leaf is used to merge with the middle blue leaf, the other red leaves are paired to merge, so 2/3 = Lr's2~ j = L^"J-

Case 4 appears. If i = j =blue, then rS4 is odd. In sub-procedure 4, one red leaf is cut, the other red leaves are paired to merge, so j/4 = [r"4

2~ J = L^-J. If all leaves of the odd-tree are red, then rS4 is odd. In sub-procedure 4, one red leaf is destroyed, the other red leaves are paired to merge, so

2/4 = \J-^2—J = L ^ J - If o r u y o n e 0I" the rightmost and leftmost leaf is red, then rS4 is even. In sub-procedure 4, a red internal leaf will be destroyed, another red leaf will be used to merge with the only one blue leaf, the other red leaves are paired to merge,

Case 5 appears. rSs is even, in sub-procedure 5, one red leaf is destroyed and another red leaf is used to merge with the blue leaf, the other red leaves are paired to merge, so y5 == [ra5

2~ J = [^-J — 1. Case 6 appears. rS6 is even, in sub-procedure

6, two red leaves are destroyed respectively, and the other red leaves are paired to merge, so ye =

L ^ J = L J - 1 . •

Since when Cases i = 4,5 or 6 appears in Algorithm II, rSi = r(Il, T) and a(U,T) = yi} so we have

Lemma 6.2. / / one of Cases 4,5,6 appears in Al

gorithm II, thena(U,T) = L 1 ^ ^ ] ifr(II,T) is odd,

and a(TI,T) = L 1 ^ ] - 1 tf r ( I I , r ) is even.

Lemma 6.3. / / one of Cases 1,2,3 appears in Al

gorithm II, then a(IT,r) = l^^1} - 1 J/r(IT,r) is

even, and a(U, T) = L 1 1 ^^] if r(U, T) is odd.

Proof. Clearly, r s t ep3 = r(H, T). Note that a red leaf is always merged with another red leaf in step 3 until one of the Cases 1,2,3 appears. Assume that there are x» pairs of indirect MSPs which have been merged in step 3 when sub-procedure i happens, i = 1,2,3. Since rSi + 2xt = rstep3 = r ( I I , r ) ,

166

rSi and r(II, T) have the same parity. We have a(II, T) = 2/i + a;,. So by Lemma 6.1, if r(II, T) is even, a(II,r) = x{ + L^J - 1 = [^^-\ ~ * = ^Knj^j _ 1; o t h e r w i s e ; a (n , r ) = Xi + L f-J =

Lemma 6.4. If none of the six cases appears in Algorithm ii, o (n , r ) = \?&D.\.

Proof. Since none of the six cases appears in Algorithm II, if r (IT, T) is even, a red leaf is always merged with another red leaf; if r(II, T) is odd, except one red leaf is merged with a blue leaf, the others are paired to merge, so a(II, T) = | / ' 2 ]• ^

By Lemmas 6.2 — 6.4, we have

Lemma 6.5. a(U,T) > [!^E1\ - 1

Theorem 6.1. Apxtd{U.,T) - dtd(n,T) < 2.

Proof. By Theorems 4.1 and 5.1, Apxtd(Il,T) -dtd(U, T) < L^fQj + 1 - P(U, T) < L ^ J + 1 -a(n , r )<2 . D

Example 6.1. See Fig.l, there are two indirect cycles in Q(Il,T) and five non-trivial MSPs: (1...3), (4...6), (9...11), (11...13), (14..„16), where (4...6) and (14...16) are indirect. IT = {(1, -2 ,3 ,8 ,4 , -5 ,6 ) , (7,9, -10,11,-12,13,14, - 1 5 , 16)} and dt(U.,T) = 16 - 2 - 7 + 6 = 13. If in step 2 of Algorithm II, (1...3) is chosen to destroy, the resulting forest will be Case 2, then /3(IT,r) = 0, so ApxtdQl,r) = 13 + 2 - 0 = 15; if (11...13) is chosen to destroy, (4...6) will merge with (14...16) and (1...3) will merge with (9...11) respectively, then /3(n, T) = 1, so Apxtd(n, r ) = 13 + 2 - 1 = 14. Note that 14 is the shortest number of translocations and deletions transforming II to T.

7. CONCLUSIONS

In this paper, we give an asymptotically optimal algorithm when the gene set of T is a subset of the gene set of LT. In fact, the problem of transforming II to r with a minimum of translocations and insertions can be approximated by the translocation-deletion analysis, where IT takes the role of T, and vice versa. To the best of our knowledge, this is the first time to consider SBT when the genomes have different gene sets.

ACKNOWLEDGMENTS

The work is supported, in part, by National Science Foundation (NSF/DBI-0354771, NSF/ITR-IIS-0407204). Guojun Li's work is also supported by NSFC under Grant No. 60373025.

References 1. J. D. Kececioglu, R. Ravi. Of mice and men: Al

gorithms for evolutionary distance between genomes with translocation. In: Proc. 6th ACM-SIAM Symposium on Discrete Algorithm. 1995: 604-613.

2. S. Hannenhalli. Polynomial algorithm for computing translocation distance between genomes. Discrete Applied Mathematics 1996; 71: 137-151.

3. A. Bergeron, J. Mixtacki, J. Stoye. On sorting by translocations. In: Proc. 9th Annual International Conference On Research in Computional Molecular Biology (RECOMB'05). Cambridge, MA. 2005: 615-629.

4. G. Li, X. Qi, X. Wang, B. Zhu. A linear time algorithm for computing translocation distance between signed genomes. In: Proc. 15th Annual sympo-siun on Combinaotrial Pattern Matching (CPM'04). Springer-Verlag. 2004: 323-332.

5. L. Wang, D. Zhu, X. Liu, S. Ma. An 0(n2) algorithm for signed translocation. Journal of Computer and System Sciences 2005; 70(3): 284-299.

6. N. El-Mabrouk. Genome rearrangement by reversals and insertions/deletions of contiguous segments. In: Proc. 11th Annual symposiun on Combinaotrial Pattern Matching (CPM'00). Springer-Verlag. 2000: 222-234.

167

TURNING REPEATS TO ADVANTAGE: SCAFFOLDING GENOMIC CONTIGS USING LTR RETROTRANSPOSONS

A. Kalyanaraman*1, S. Aluru1 and P.S. Schnable2

1 Department of Electrical and Computer Engineering 2 Departments of Agronomy, and Genetics, Development and Cell Biology

Iowa State University, Ames, IA 50011, USA

Email: {ananthk,aluru,schnable}Qiastate.edu

The abundance of repeat elements in the maize genome complicates its assembly. Retrotransposons alone are estimated to constitute at least 50% of the genome. In this paper, we introduce a problem called retroscaffolding, which is a new variant of the well known problem of scaffolding that orders and orients a set of assembled contigs in a genome assembly project. The key feature of this new formulation is that it takes advantage of the structural characteristics and abundance of a particular type of retrotransposons called the Long Terminal Repeat (LTR) retrotransposons. This approach is not meant to supplant but rather to complement other scaffolding approaches. The advantages of retroscaffolding are two fold: (i) it allows detection of regions containing LTR retrotransposons within the unfinished portions of a genome and can therefore guide the process of finishing, and (ii) it provides a mechanism to lower sequencing coverage without impacting the quality of the final assembled genie portions. Sequencing and finishing costs dominate the expenditures in whole genome projects, and it is often desired in the interest of saving cost to reduce such efforts spent on repetitive regions of a genome. The retroscaffolding technique provides a viable mechanism to this effect. Results of preliminary studies on maize genomic data validate the utility of our approach. We also report on the on-going development of an algorithmic framework to perform retroscaffolding.

1. INTRODUCTION

Hierarchical sequencing4 is being used to sequence the maize genome18. In this approach, a genome is first broken into numerous smaller clones of size up to 200 kbp each called a Bacterial Artificial Chromosome (or BAC). Next, a combination of these BACs that provide a minimum tiling path based on their locations along the genome is determined. Each selected BAC is then individually sequenced using a shotgun approach that generates numerous short (~500-l,000 bp long) fragments. The problem of assembling the target genome is thereby reduced to the problem of computationally assembling each BAC from its fragments.

The fragments generated by a shotgun experiment approximately represent a collection of sequences originating from positions distributed uniformly at random over each BAC. As with a jigsaw puzzle, the idea is to generate fragments such that each genomic position is expected to be covered (or sampled) by at least one fragment — and also ensuring that there is sufficient computable evidence in the form of "overlaps" between fragments

to carry out the assembly. Regardless of the coverage specified, however, gaps invariably occur during sequencing, i.e., it cannot be guaranteed that every position is covered by at least one fragment. Coverage affects the nature of gaps — a low coverage typically results in several long gaps, while a high coverage results in fewer and shorter gaps. Because of gaps, assembling a set of fragments sequenced from a BAC typically results in not one but many assembled sequences called contigs that represent the set of all contiguous genomic stretches sampled. The next step, scaffolding, aims at determining the order and orientation of the contigs relative to one another. Once scaffolded, the identified gaps between contigs can be filled through targeted experimental procedures called pre-finishing and finishing. For simplicity, we use the term "finishing" to collectively refer to both these procedures.

The main focus of this paper is the scaffolding step. The need for scaffolding arises from the fact that there could be gaps in sequencing. To be able to identify a pair of contigs corresponding to adjacent genomic stretches, current methods generate shotgun fragments in "pairs" — each BAC is first bro-


168

BAC

Clone mate pairs e

Cl

i gapi ,

i i

i i i i i i i i i i

i i

I i z

gap2

Physical gap

**

pw!

J | 4 Contigs

Pig. 1. An example showing 6 pairs of clone mate fragments (shown connected in dotted lines) sequenced from a given BAC. The relative order and orientation between contigs c\ and ci (also, between 03 and C4) can be inferred from the clone mates. The supplied clone mate information is , however, not sufficient to determine the scaffolding information between all pairs of contigs in this example.

ken into smaller clones of length ~5 kbp, and each such clone is sequenced from both ends thereby producing two fragments which are referred to as clone mates (or a clone pair). During scaffolding, the fact that a pair of clone mates originated from the same ~5 kbp clone can be used to impose distance and orientation constraints for linking contigs that span the corresponding fragments1, 9 ' 10 ' 17, 19. Figure 1 illustrates an example of scaffolding contigs based on clone mate information. This technique is not, however, sufficient to link contigs surrounding gaps without a flanking pair of clone mates (gap2 in Figure 1). Such gaps, called physical gaps, are typically harder to "close", and involve costly finishing efforts. Performing a higher coverage sequencing is an effective but expensive approach to reduce the occurrences of gaps. The approach proposed in this paper provides an alternative mechanism to scaffold around physical gaps as well, subject to their repeat content.

In this paper, we introduce a new variant of the scaffolding problem called the retro scaffolding problem. The problem is to order and orient contigs based on their span of LTR retrotransposon-rich regions of the genome. This approach has the following advantages:

• It does not require clone mate information. Thus, our approach complements existing scaffolding approaches for genomes with significant LTR retrotransposon content. Also, with the advent of newer sequencing technologies13 that do not generate clone mate information, the importance of our approach is further emphasized.

• It can be used to identify LTR retrotransposon-rich portions within the un

finished genomic regions. Such information can be useful if it is decided to not finish repetitive regions in the interest of saving costs, as is the case with the maize genome project18.

• In genome projects of highly repetitive genomes, most of the sequencing and finishing efforts are expected to be spent on repeat rich regions. This is one of the main concerns in the on-going efforts to sequence the maize genome, at least 50% of which is expected to be retrotransposons. The retroscaffold-ing technique provides a mechanism to reduce sequencing coverage without affecting the quality of the genie portion of the final assembly, thereby providing a means to reduce the sequencing costs.

In Section 2, we describe the retroscaffolding idea, formulate it as a problem, and discuss the various factors that affect the ability to retroscaffold. For obtaining a proof of concept, we conducted experiments on previously sequenced maize BAC data. The results show that (i) 3X/4X coverage sequencing is suited for exploiting the data's repeat content towards retroscaffolding, (ii) retroscaffolding can yield over 30% savings in finishing costs, and (iii) with retroscaffolding it is possible to opt for a lower sequencing coverage. These and other experimental results assessing the effects of various factors on retroscaffolding are presented in Section 3. As part of the NSF/DOE/USDA maize genome project18, we are working on applying the retroscaffolding technique to the maize data as it becomes available. To this effect, we are developing an algorithmic framework to perform retroscaffolding as described in Section 4.

169

In Section 5, we present the results of our experiments to assess the effect of applying both clone mate based scaffolding and retroscaffolding on maize genomic data. Various strengths and limitations of the retroscaffolding technique are discussed in Section 6. Given that retrotransposons are abundant in genomes of numerous plant crops yet to be sequenced (e.g., wheat, barley, sorghum, etc.), the capability of retroscaffolding to exploit this repeat content can provide a significant means to reduce sequencing and finishing costs.

2. RETROSCAFFOLDING

Retrotransposons are DNA repeat elements abundant in several eukaryotic genomes — occupying at least 45% of the human genome6, >50% in maize15, 20, and up to 90% in wheat7. Long Terminal Repeat (LTR) retrotransposons constitute one of the most abundant classes of retrotransposons, and have been studied in relation to genome evolution, genomic rearrangements and retroviral transposition mechanisms2' 3 . As their name suggests, LTR retrotransposons are distinctly characterized in their structure by two terminal repeat sequences — one each at the 5' and 3' ends of a retrotransposon inserted in a host genome. Given that these retrotransposons are typically 10-15 kbp long, their flanking LTRs can also be expected to be separated by as many bps along the genome*. Moreover, the LTR sequences are identical at the time a retrotransposon inserts itself into a host genome, and gradually diverge over time due to mutations. Yet, the LTRs flanking most retrotransposons are similar enough for detection. These properties form the basis of our retroscaffolding idea, as explained below.

Low coverage sequencing of a genome with significant LTR retrotransposon content is likely to result in a proportionately large number of gaps that span these repetitive regions. If it so happens that the sequencing covers only the two LTRs of a given retrotransposon, a subsequent assembly can be expected to have two contigs each spanning one of the LTRs. Therefore, the detection of two identical or highly similar LTR-like sequences in two contigs is a neces

sary (but not sufficient) indication that the contigs sample the flanking regions of an inserted retrotransposon. If this indication can be further validated to sufficiency by searching for other structural signals of an LTR retrotransposon (described below), then the contigs can be relatively ordered and oriented (because LTRs are directed repeats). In addition, this implies that the intervening region between two consecutively ordered contigs contains retrotransposon related sequences — an information that can be used to prioritize the gaps for finishing, and potentially reduce efforts spent on finishing repetitive regions, if so desired.

The structure of a full-length LTR retrotransposon (illustrated in Figure 2a) is characterized by the following key attributes:

• LI The 5' and 3' LTRs share a "high" sequence identity.

• L2 The starting positions of the 5' and 3' LTRs are at least Dmin bp and at most Dmax bp apart along the genome.

• L3 Typically, LTRs start with TG and end in CA.

• L4 The 5 (or 6) bp immediately to the left of the 5' LTR are "highly similar" (if not identical) to the 5 (or 6) bp immediately to the right of the 3' LTR. This repeat is referred to as a Target Site Duplication (TSD) because it corresponds to the 5 (or 6) bp duplicated in the host genome at the time and site of the retrotransposon's insertion.

• L5 The intervening region between the 5' and 3' LTRs contains several signals that correspond to an inserted retrotransposon. These include a primer binding site (PBS), retrotransposon genes (gag, pol, and env), and a poly-purine tract (PPT).

For a sequence s, let ss=s, and sr denote its reverse complement. A sequence c is said to contain a sequence I if there exists between c and either V or V, a "good quality" alignment that spans a sufficiently "long" suffix or prefix of the latter sequence. Let an LTR pair (l&, ly) denote the two LTRs of a given LTR retrotransposon.

"Sometimes, LTR retrotransposons can be nested within one another, accordingly affecting the distances between the 5' and 3 ' LTRs.

170

e5

Dmin < (63 — 65) < Dmax

'••min < (e5 - 65), (e3 - 63) < Ln e3

Host

Genome 4^>(TG...CAl PBS...ga9...pol...env...PPT XTG...CA}CZ^>

*-' 3'LTR ' 3'TSD 5'TSD 5'LTR Retrotransposon (a)

a—>;"^G...cQ. PBS... C2 ...PPT (TG...CA

retro-link T fe

-*-: :**= Dmin < distance < Dmax— (b)

Fig. 2. (a) Structure of a full-length LTR retrotransposon. (b) An example showing two contigs ci and ci with a retro-link between them.

Definition of a Retro-link: Given a set L of n LTR pairs, two contigs C; and Cj are said to be retro-linked if 3 (ly, ly) € L such that both c, and Cj contain l§< or I31 or both.

An example of a retro-link between two contigs is shown in Figure 2b. As shown, the above definition is extended to account for additional structural attributes such as L3, L4 and L5, to ensure that a retro-link indeed spans the same full-length LTR retrotransposon. Details are omitted for brevity. The Retroscaffolding Problem: Given a set C of m contigs and a set L of n LTR pairs, partition C such that:

• each subset is an ordered set of contigs, and • every pair of consecutive contigs in each sub

set is retro-linked and there is no contig that participates in two retro-links in opposite orientations.

The retroscaffolding problem can be viewed as a variant of the standard scaffolding problem, called the Contig Scaffolding Problem that is NP-complete9. In the latter, the input is a set of contigs and a set of clone mates, where each clone mate pair is a pair of fragments sequenced from the same clone of a known approximate length. This is similar to the distance constraint imposed by a retro-link between the two contigs containing two LTRs of the same retrotransposon. In addition to the LTRs, a retro-link accounts for other structural attributes of an LTR retrotransposon. Also, like in the original scaffolding problem, not all retro-links may be used

in the final ordering and orientation. Similar to the contig scaffolding problem, the retroscaffolding problem can be formulated as on optimization problem.

The effectiveness of retroscaffolding on a genome is dictated by the following factors: LTR retrotransposon abundance: The ability to retroscaffold depends on the number of retro-links that can be established, which is limited by the number of detectable LTR retrotransposons in the genome. Note that this approach of exploiting the abundance in retrotransposons offers a respite from the traditional view that these are a source of complication in genome projects.

Presence of distinguishable LTRs: LTRs from different retrotransposons but from the same "family" may share substantial sequence similarity. Therefore, it is essential to take into account other structural evidence specific to an insertion before establishing a retro-link between two contigs. Even if the same LTR retrotransposon is present in two different locations of a genome, it can be expected that the TSDs are different because they correspond to the host genomic sequence at the site of insertion. It may still happen that a target genome contains the same family retrotransposons in abundant quantities, and other structural attributes become less distinguishable as well. If BAC-by-BAC sequencing is used, the above situation can be alleviated by applying retroscaffolding to contigs corresponding to the same BAC (instead of across BACs). This is because the likelihood of the same family occurring multiple times at a BAC level is much smaller than

171

Table 1. Summary of the LTR retrotransposons identified in 4 maize BACs using LTILpar.

GenBank Accession

BAC Length (in bp)

Number of LTR Retrotransposons in BAC retrotransposons Length in bp % bp

BACi BAC2

BAC3

BAd

AC157977 AC160211 AC 157776 AC157487

107,631 132,549 147,470 136,932

3 6 8 6

29,578 60,391 73,099 57,783

27% 46% 50% 42%

Table 2. LTR-par parameter settings.

Parameter Name Default Value Description

Dmin/Dmax 600/15,000 bp Distance constraints between 5' and 3' LTRs (L2) T 70% % identity cutoff between 5' and 3' LTRs (LI) Lmin/Lmax 100/2,000 bp Minimum/maximum allowed length of an LTR Match/mismatch 2/-5 Match and mismatch scores Gap penalties 6/1 Gap opening and continuation penalties

at a genome level. Sequencing coverage: Retroscaffolding targets each sequencing gap that spans an inserted retro-transposon such that its flanking LTRs are represented in two different contigs. Henceforth, we will refer to such gaps as retro-gaps. Given the length of such an insert ranges from 10-15 kbp (greater, if it is a nested retrotransposon), the coverage at which the genome is sequenced is a key factor affecting the ability to retroscaffold. If the sequencing coverage is too high (e.g., 10X), then there are likely be so few (short) sequencing gaps that the need for any scaffolding technique diminishes. Whereas at very low coverage (e.g., IX) long sequencing gaps that may span entire LTR retrotransposons are likely to prevail.

3. PROOF OF CONCEPT OF RETROSCAFFOLDING ON MAIZE GENOMIC DATA

In this section, we provide a proof of concept for retroscaffolding. For this purpose, four finished maize BACs (listed in Table 1) were acquired from Cold Spring Harbor Laboratory14. The first step was to determine the LTR retrotransposon content of these BACs. LTR-par11, which is a program for the de novo identification of LTR retrotransposons, was used to analyze each BAC with the parameters specified in Table 2. Table 1 summarizes the findings.

As can be observed, the fraction of LTR retrotransposons in these BACs averages 42%, consistent with the latter's estimated abundance in the genome.

The effect of sequencing at different coverages was assessed as follows. A program that "simulates" a random shotgun sequencing over an arbitrary input sequence at a user-specified coverage was provided by Scott Emrich at Iowa State University5. Each run of the program produces a set (or sample) of fragments, along with the information of their originating positions. We ran this program on each BAC for coverages IX through 10X, and for each coverage 10 samples were collected to simulate sequencing 10 such BACs. For each sample, using the knowledge of the fragments' originating positions, the set of all contiguous genomic stretches covered (and thereby the set of sequencing gaps) was determined. Ideally, assembling the sample would produce a contig for each contiguous stretch. Based on the placement information of the contigs on the BAC and that of the LTR pairs (Table 1) on the BAC, each LTR pair was classified into one of these three classes (see Figure 3):

• CgC: both LTRs are contained in two different contigs,

• C_C: both LTRs are contained in the same contig, and

• GgX: at least one LTR is not contained by any contig (i.e., it is located in a gap).

172

Genome •••

Contigs Fragments

CgC

5' 3'

CJC

5' 3'

GgX

^ 5' 3'

Fig. 3. Classification of LTR pairs based on the location of sequencing gaps, LTRs, and contigs. Dotted lines denote sequencing gaps. Retro-links correspond to the class CgC.

Table 3. Classification of the LTR pairs in 4 BACs, with respect to a set of 10 shotgun samples obtained from each BAC at different coverages.

Coverage

IX 2X 3X 4X 5X 6X 7X 8X 9X

10X

CgC

16 26 25 27 24 22 19 18 16

7

BAC\ C-C

1 0 3 3 6 8

11 12 14 23

GgX

13 4 2 0 0 0 0 0 0 0

CgC%

53 87 83 90 80 73 63 60 53 23

BAC2

CgC%

83 95

100 100

95 83 83 77 48 37

BAC3

CgC%

63 77 97 88 93 76 61 64 50 31

BACA

CgC%

63 92

100 100

95 98

100 67 60 43

In this classification scheme, it is easy to see that retro-links can be expected to be established only for CgC LTR pairs. Therefore, the ratio of the number of CgC LTR pairs to the total number of LTR pairs is indicative of the maximal value of retroscaffolding at a given coverage. We computed this ratio for each of the 4 BACs used in our experiments, by considering one coverage at a time, and counting the LTR pairs in each of the three classes over all 10 samples. Prom Table 3, we observe that the ratio is maximum for a 3X coverage for 3 out of the 4 BACs, and 4X for the other BAC. This implies that a 3X/4X coverage project is expected to best benefit from the retroscaffolding approach. To understand the above results intuitively, observe that a very high coverage has a high likelihood of sequencing an LTR retro-transposon region to entirety, making retroscaffolding unnecessary. While a very low coverage results in a high likelihood of LTRs falling in gaps, making retroscaffolding ineffective. Both these expectations are corroborated in our experiments — in Table 3, the gradual increase in C-C and the decrease in GgX with increasing coverage. The C-C increase with coverage also indicates the amount of efforts spent in sequencing retrotransposon-rich regions.

In our next experiment, we assess the potential savings that can be achieved at the finishing step through the information provided by retroscaffolding on gap content. Table 4 shows the number of gaps generated at various sequencing coverages, and the number of which can be detected using retroscaffolding (i.e., retro-gaps). While the results are shown only for two BACs (due to lack of space), we observed a similar pattern in all four BACs. As each retro-gap corresponds to a potential region of the genome that may not necessitate finishing, the ratio of the number of retro-gaps to the total number of sequencing gaps indicates the potential savings achievable at the finishing step because of retroscaffolding. Prom the table we observe this ratio ranges from 23%-40% for BAC2, and 24%-49% for BAC A; averaging over 34% savings for both BACs.

Table 4 also shows that sequencing BACi at a 6X coverage is expected to result in ~37 sequencing gaps; while sequencing at a 4X coverage and subsequently applying retroscaffolding is expected to result in an effective 39 gaps (as 65.7-26.6). This implies that through retroscaffolding it is possible to reduce the coverage from 6X to 4X on BACi without much loss of scaffolding information. As retroscaf-

173

Table 4. Number of retro-gaps vs. all sequencing gaps. Measurements are averaged over all 10 samples of each of the two BACs.

Coverage

IX 2X 3X 4X 5X 6X 7X 8X 9X

10X

All gaps

70.5 88.7 84.6 65.7 50.6 37.4 28.3 18.7 13.0 9.3

BAd Retro-gaps

26.4 33.6 32.2 26.6 19.3 13.7 9.5 6.5 3.0 2.7

%Retro-gaps

37.4 37.9 38.1 40.5 38.1 36.6 33.6 34.8 23.1 29.0

All gaps

78.0 93.5 84.0 64.5 46.4 39.5 26.6 19.1 11.9 9.5

BACA

Retro-gaps

24.8 33.4 31.0 19.5 16.7 13.2 9.1 6.3 5.9 2.3

%Retro-gaps

31.8 35.7 36.9 30.2 36.0 33.4 34.2 33.0 49.6 24.2

folding can be used independent of clone mate information, we are working on evaluating the collective effectiveness of both clone mate-based scaffolding and retroscaffolding approaches. If similar results can be shown at a much larger scale of experimental data for a target genome, then retroscaffolding can be used to advocate for a low coverage sequencing, directly impacting the sequencing costs of repetitive genomes.

4. A FRAMEWORK FOR RETRO-LINKING

We developed the following two-phase approach to retroscaffolding. In the first phase, retro-links are established between contigs that show "sufficient" evidence of spanning two ends of the same LTR retrotransposon. Once retro-links are established, the process of scaffolding the contigs is the same as scaffolding them based on clone mate information, i.e., each retro-link can be treated equivalent to a clone mate pair that imposes distance and orientation constraints appropriate for LTR retrotransposon inserts. Therefore, in principle, any of the programs developed for the conventional contig scaffolding problem1, 9 ' 10, 17, 19 can be used to achieve retroscaffolding from the retro-linked contigsb. In what follows, we describe our approach to establish retro-links.

There are two types of retro-links that can be established among contig data: (i) those that cor

respond to LTR retrotransposons that are already known to exist in the genome of the target organism or closely related species, and (ii) those that are de novo found in the contig data. The first class of retro-links can be established by building a database of known LTR retrotransposons and detecting contigs that overlap with LTR sequences of the same retrotransposon. However, such a database of already known LTR sequences of a target genome may hardly be complete in practice. For this reason, the second class of retro-links that are based on a de novo detection of LTR sequences in the contig data is preferable. However, additional validation will be necessary to ensure the correctness of such retro-links.

In what follows, we describe the algorithmic framework we developed to establish retro-links based on already known LTR retrotransposons, and the results of applying it on maize genomic data.

4.1. Building a Database of LTR Pairs

Given that the entire genome of maize has not yet been assembled, the first step in our approach is to build a database of maize LTR pairs from previously sequenced maize genomic data. A set of 560 known full-length LTR retrotransposons and 149 solo LTRsc

was acquired from San Miguel16. In addition, a set of 470 maize BACs were downloaded from GenBank5. Because the information about the LTR sequences within the full-length retrotransposons and BACs

bFor our experiments, we used the Bambus19 program. cSolo LTRs are typically the result of a deletion/recombination event at a site of an inserted LTR retrotransposon, in which only either a 5' or a 3' LTR (or a part of it) survives.

174

Table 5. Summary of LTR pairs predicted by LTfLpar.

Input Number of Number of full- Number of sequences length predictions LTR pairs

LTR retrotransposons16 560 556 556 Solo-LTRs16 149 149 Maize BACs5 470 1,234 1,234

Total 1,939

was not available, we used the LTR-par program to identify LTR retrotransposons and their location information. We did not include the LTRs identified in the four maize BACs listed in Table 1, so that they can be used as benchmark data for validating retroscaffolding.

Given a set of sequences, LTR-par identifies subsequences within each sequence that bear structural semblance to full-length LTR retrotransposons. Desired values for structural attributes can be input as parameters. We used the values shown in Table 2. As part of each prediction, the locations of both the 5' and 3' LTRs are output. A prediction is made only if the identified region satisfies LTR sequence similarity (LI) and LTR distance (L2) conditions. Based on the presence of other signals such as the TG..CA motif (L3) and TSDs (L4), each prediction is also associated with a "confidence level". A confidence level of 1 implies presence of both L3 and L4, 0.5 implies either L3 or L4 but not both, and 0 implies only LI and L2. In this paper, we use level 1 predictions, although we are currently evaluating other combinations of LTR pairs from across confidence levels. Table 5 shows the statistics over the resultant total of 1,939 LTR pairs.

4.2. An Algorithm to Establish Retro-links

Let C denote a set of m contigs generated through an assembly of maize fragments corresponding to one BAC, and let L denote the set of n LTR pairs (n =1,939 in Table 1). Our algorithmic framework performs the following steps:

• SI Compute P = {(c,(Z5-,Z3'))lc G

C, (h> ,h>) € L,c contains 1& or l3> or both}. • S2 Construct a set G = {G\,G2, • • •,Gn},

such that VGi C C, Vc e d, (c,(/j,,Z|,)) € P. Note that G need not be a partition of

C. We call each Gi a contig group. • S3 VGi € G, compute R4 = {{ci,Cj)\ci,Cj £

Gi,ci and c,- are retro-linked by (J\i,l\i)}-

A naive way to perform step Si is by evaluating each of the m x n pairs of the form {contig, LTR pair), to check if a contig contains one of the LTRs. The check can be performed through standard dynamic programming techniques for computing semi-global alignments that take time proportional to the product of the lengths of the sequences being aligned. As reverse complemented forms also need to be considered, this approach involves 4 x m x n alignments in the worst case. We developed a run-time efficient method based on the observation that if two sequences that align significantly, then they also have a "long" exact match between them (although the converse need not hold). Thus it is sufficient to evaluate only pairs of the form (contig, LTR) that have an exact match of a minimum cutoff length. For this purpose, we adapted a parallel algorithm for detecting maximal matches across DNA sequences that we had originally developed for a clustering problem12. The algorithm runs in linear space and run-time proportional to the number of the output pairs. For each generated pair, an optimal semi-global alignment is computed. Only pairs that have alignments satisfying a specified criteria are output. As pairs are output, the set G is computed as well in constant time per pair (step S2).

Steps SI and S2 ensure that two contigs are paired if and only if they contain LTRs from the same LTR pair. To perform S3, it is therefore necessary only to establish additional structural evidence such as the presence of TSDs, PPT, PBS, and/or retrotransposon genes. The attributes to look for, however, depends on the location of the subsequences corresponding to the LTRs within the contigs — for e.g., it may not be possible to look for retrotrans-

175

copia element

>c710fc

>h -t:

Prediction: BACi (unknown)

Truth: BACi (assembled)

3- i > C16"

> / l t

>cl 10 retro-fink (OR)

> Cl6

><=16 retro-link > Cio

J!_LTR r \ retror

J/LTR

> C 1 6 > cio

> C41

>h —

Prediction: BACi (unknown)

Truth: BACi. (assembled)

gypsy element

>h > C24 £ 3

> C24 retro-fink > en

> c\x retro-link > <^4

J!_LTR r ~\ retror

£LTR

- &

>c] •41 > ^ 4

P i g . 4 . Validation of two retro-links — between contigs cio and cie, and contigs CA\ and C24. Vertically aligned ovals denote overlapping regions, and squares denote retrotransposon hit through tblastx against the GenBank nr database.

poson genie sequences if the LTR regions within the contigs are a suffix of one contig and a prefix of another (see Figure 2b). We perform S3 as follows: we concatenate each pair of contigs under consideration in each of the 4 possible orientation combinations, and run LTR-par on the concatenated sequence. A retro-link is established between a pair only if sufficient structural evidence is detected.

Preliminary Validations: We validated the retro-linking algorithm on

BACi of Table 1 as follows. Shotgun fragments were experimentally sequenced at a 3X coverage of the BAC14, and were assembled5 using the CAP3 assembler8. The resulting 45 contigs were input along with the 1,939 LTR pairs (in Table 5) to our retro-linking program. Note that the 1,939 LTR pairs do not include the 3 LTR pairs in BACi as identified by LTR-.par (Table 1) — that way, the validation reflects an assessment of retro-linking under practical settings in which a target BAC sequence and its LTR pairs are unknown prior to the ret-roscaffolding step. The experiment resulted in 44 contig groups (= \G\), and upon investigation we found that most of the groups were "equivalent", i.e., the corresponding LTR pairs share a significant sequence identity (> 95%). The equivalent groups were merged.

The subsequent step was to evaluate each contig pair of a merged group for a valid retro-link. For

detecting retrotransposon genie sequences in contigs, we queried the contigs against the GenBank nr database using the tblastx program. Other structural attributes were detected using LTR-par. This step resulted in only two retro-linked pairs: (cj0 -¥ c16), and (c24 -> C41) with the arrows implying the order in which the contigs can be expected to occur along the "unknown" BAC sequence (BACi) in the specified orientations. We verified the predictions by aligning each of these 4 contigs directly against the known sequence of BACi and found that the ret-roscaffolding prediction is correct (see Figure 4).

5. SCAFFOLDING WITH CLONE MATES AND RETRO-LINKS

Retroscaffolding differs from conventional contig scaffolding as it relies on the presence of LTR retro-transposons instead of the clone mate information. While this suggests that either of the techniques can be applied independent of one another, the output may themselves be not mutually exclusive — i.e., it is possible that the relative ordering and orientation between the same two contigs are implied by both the techniques. While such redundancies in output can be used as additional supporting evidence for bolstering the validity of scaffolding, the actual value added by either of these two techniques is dictated by its respective unique share in output scaffolding. Ideally, we would hope that these two outputs to

176

Table 6. Results of (i) scaffolding contig data for BAC* (136,932 bp) using clone mate information, (ii) retroscaffolding, and (iii) combined scaffolding using both clone mate and retro-link information.

Number of scaffolds Total span of scaffolds (bp) Average span of scaffold (bp) Number of contig pairs scaffolded Number of assembly gaps covered

Clone mate scaffolding

32 120,350

3,760 42 22

Retroscaffolding

5 65,605

6,246 10 17

Combined scaffolding

27 138,356

4,457 71 28

complement one another. We assessed the effect of a combined application

of retroscaffolding and clone mate based scaffolding on real maize genomic contig data as follows: 62 con-tigs were generated by performing a CAP3 assembly over a 3X coverage set of fragments sequenced from BACi. Ideally, all 62 contigs would be part of just one "scaffold" if the contigs were all to be ordered along the target BAC.

The scaffolding achievable from just the clone mate information was first assessed by running the Bambus19 program on the contigs. This resulted in 32 scaffolds spanning an estimated total of 120,350 bp and each with an average span of 3,760 bp. (Note that the "span" of a scaffold output by Bambus is only an estimate, because it includes the size estimated for sequencing gaps between the scaffolded contigs.) We then assessed the scaffolding achieved by retroscaffolding the contig data — retro-links were first established using the framework described in Section 4 and the output was transformed as input to Bambus. While retroscaffolding resulted in many fewer scaffolds (5), the total span was smaller (65,605 bp) when compared to clone mate scaffolding. However, the average span of each scaffold was almost twice as large in retroscaffolding. This is as expected because the distance constraint used for each retro-link was longer ([5000,15000]) than that of clone mate links ([2200,3800]).

In the next step, we input both the retro-link and clone mate information with their respective distance and orientation constraints to Bambus. This combination resulted in fewer scaffolds (27) and a longer total span (138,356 bp) than was achieved by just clone mate scaffolding — implying that retroscaffolding provides added information that is not provided by clone mate information. The above results are

summarized in Table 6. The table also shows the number of contig pairs scaffolded as a result of the respective scaffolding strategies; the higher this number is, the more inclusive scaffolding is on the contigs — ideally, we would expect all contigs to be in one scaffold thereby implying (6

22) contigs pairs.

We also assessed the individual effect of these scaffolding techniques on "assembly gaps": Each of the 62 contigs was individually aligned to the assembled BAC4 sequence and the stretch along which each has a maximum alignment score was selected to be its locus on the BAC. A maximal stretch along the BAC not covered by any of the 62 contigs was considered an "assembly gap". There were a total of 42 such gaps. For each of the three scaffolding strategies (i.e., clone mate based, retroscaffolding and combined), an assembly gap is said to be "covered" (alternatively, "not covered") if there exists a (alternatively, does not exist any) pair of scaffolded contigs spanning the gap. Based on this definition, the number of covered assembly gaps was 22 for clone mate scaffolding, 17 for retroscaffolding, and 28 for the combined scaffolding. This further demonstrates the value added by retroscaffolding.

6. DISCUSSION

Our preliminary studies on maize genomic (Section 3) and the experimental results on maize contig data (Section 5) demonstrate a proof of concept and the value added by retroscaffolding in genome assembly projects. For retroscaffolding to be effective in a genome project, it is necessary that the LTR retrotransposons in the genome are both abundant and distinguishable. LTR sequences within the same family of LTR retrotransposons are harder to distinguish, and repeat-rich genomes (e.g., maize) could

177

have numerous copies of the same family scattered across the genome. Therefore, applying retroscaf-folding at a genome level may cause several spurious retro-links to be established, thereby confounding the process of scaffolding. It is for this reason that retroscaffolding is more suited for genome projects involving hierarchical (e.g., BAC-by-BAC) sequencing. Retroscaffolding can also be used to order and orient BACs, if the overlapping ends of two consecutive BACs along a tiling path span an LTR retro-transposon.

In genome projects which generate clone mate information, the scaffolding information derived from retroscaffolding may in part be already provided by clone mates. In the worst case, even if no new scaffolding information is provided by retroscaffolding, we can benefit from the scaffolding information provided by retroscaffolding in two ways: (i) we will have information about not only the genomic loci but also the composition of the assembly gaps covered by retroscaffolding, as they are expected to contain sequences corresponding to a retrotransposon insert. Therefore, we can prioritize the gaps to finish based on this information, and (ii) the scaffolding output by retroscaffolding can be used to as supporting evidence to validate the output of clone mate information.

Retroscaffolding will be useful in projects which do not generate clone mate information. New sequencing technologies such as the 454 sequencing13

that do not generate clone mate information are increasingly becoming popular due to their high throughput and cost-effectiveness. Such sequencing technologies may be an appropriate choice for low-budget sequencing projects, and retroscaffolding could make the task of carrying out the assembly in such projects practically feasible.

Retroscaffolding also provides a mechanism to explore the feasibility of a lower coverage sequencing in genome projects. While reducing the sequencing coverage as low as 3X may expose more gaps to span LTR retrotransposons in a target genome, it also implies that there is less redundancy in fragment data. This might affect the quality of the output assembly, especially of those contigs corresponding to the non-repetitive regions of the genome. To circumvent this issue in a hierarchical sequencing project, we

propose the following iterative approach to sequencing and assembly: first, sequence all the BACs at a low coverage and assemble them. If a subsequent retroscaffolding reveals the low repeat content in a subset of the input BACs, then perform additional coverage sequencing selectively on these BACs, and reassemble them with the fragments sequenced from all sequencing phases. In practice, this procedure can be repeated until sufficient information is gathered to completely assemble and scaffold each BAC. This approach provides a cost-effective mechanism to sequence repeat-rich genomes without compromising on the quality of the output assembly.

7. CONCLUSIONS

Genome projects of several economically important plant crops such as maize, barley, sorghum, wheat, etc., are either already underway or are likely to be initiated over the next few years. Most of these plant genomes contain an enormous number of retrotransposons that are not only expected to confound the assembly process, but are also expected to consume the bulk of the sequencing and finishing budget. In contrast to this perspective, the retroscaffolding approach proposed in this paper offers the possibility of exploiting the abundance of LTR retrotransposons, thus serving three main purposes: (i) to scaffold con-tigs that are output by an assembler, (ii) to guide the process of finishing by providing information on the unfinished regions of the genome, and (iii) to introduce the possibility of reducing sequencing coverage without loss of information regarding the sequenced genes and their relative ordering. Given that sequencing and finishing account for most of the expenditures in genome projects, continued research in developing this new methodology further could have a high impact.

Several developments have been planned as future work on this research. Specifically, we plan to evaluate the collective effectiveness of retroscaffolding and clone mate based scaffolding at a larger scale. The algorithmic framework for retroscaffolding is still at an early stage of development. Further validation of the framework on sequenced genomes and at much larger scales are essential to ensure an effective and high-quality application of our methodology in forthcoming complex genome projects. To

178

this effect, the application of retroscaffolding on the

on-going maize genome project will provide a good

start ing point.

ACKNOWLEDGMENTS

We thank the reviewers of an earlier version of this manuscript for providing several useful insights. We are grateful to Scott Emrich for the scripts to simulate shotgun sequencing and for discussions. We thank Richard McCombie for the maize BAC data, and Phillip San Miguel for the maize retrotranspo-son data. This research was supported by the NSF award DBI #0527192 and an IBM Ph.D. Fellowship to A. Kalyanaraman.

References

1. S. Batzoglou, D.B. Jaffe, K. Stanley, J. Butler et al. ARACHNE a whole-genome shotgun assembler. Genome Research, 12(1):177-189, 2002.

2. J.L. Bennetzen. The contributions of retroelements to plant genome organization, function and evolution. Trends in Microbiology, 4(9):347-353, 1996.

3. J.M. Coffin, S.H. Hughes, and H.E. Varmus. Retroviruses. Plantview, 1997.

4. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature, 409:860-921, 2001.

5. S.J. Emrich. Personal Communication, 2005. 6. E.S. Lander, L.M. Linton, B. Birren, C. Nusbaum,

et al. Initial sequencing and analysis of the human genome. Nature, 409:860-921, 2001.

7. R.B. Flavell. Repetitive DNA and chromosome evolution in plants. Philosophical Transactions of the Royal Society of London. B., 312:227-242, 1986.

8. X. Huang and A. Madan. CAP3: A DNA sequence assembly program. Genome Research, 9(9):868-877, 1999.

9. D.H. Huson, K. Reinert, and E. Myers. The greedy path—merging algorithm for sequence assembly. Proc. International Conference on Research in Computational Biology (RECOMB), pages 157-163, 2001.

10. D.B. Jaffe, J. Butler, S. Gnerre, E. Mauceli et al. Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Research, 13:91-96, 2003.

11. A. Kalyanaraman and S. Aluru. Efficient Algorithms and Software for Detection of Full-Length LTR Retrotransposons. In Proc. IEEE Computational Systems Bioinformatics Conference, pages 56-64, 2005.

12. A. Kalyanaraman, S. Aluru, V. Brendel, and S. Kothari. Space and time efficient parallel algorithms and software for EST clustering. IEEE Transactions on Parallel and Distributed Systems, 14(12):1209-1221, 2003.

13. M. Margulies, M. Egholm, W.E. Altman, S. Attiya, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature, 437:376-380, 2005.

14. R. McCombie. Personal Communication, 2005. 15. B.C. Meyers, S.V. Tingey, and M. Morgante. Abun

dance, distribution, and transcriptional activity of repetitive elements in the maize genome. Science, 274:765-768, 1998.

16. P.S. Miguel. Personal Communication, 2005. 17. J.C. Mullikin and Z. Ning. The phusion assembler.

Genome Research, 13:81-90, 2003. 18. NSF. NSF, USDA and DOE Award $32 Million

to Sequence Corn Genome, ht tp: / /www.nsf .gov/ news/news_summ.jsp?cntn_id=104608\&org=BI0\ &from=news, Press Release 05-197, 2005.

19. M. Pop, D.S. Kosack, and S.L. Salzberg. Hierarchical scaffolding with Bambus. Genome Research, 14:149-159, 2004.

20. P. SanMiguel, B.S. Gaut, A. Tikhonov, Y. Nakajima, and J.L. Bennetzen. The paleontology of intergene retrotransposons of maize. Nature Genetics, 20:43-45, 1998.

http://www.nsf.gov/

179

WHOLE GENOME COMPOSITION DISTANCE FOR HIV-1 GENOTYPING

Xiaomeng Wu, R a n d y Goebel



Email: xiaomeng, [email protected]

Xiu-Feng W a n

Systems Biology Laboratory, Department of Microbiology, Miami University Oxford, OH 45056, USA.

E-mail: [email protected]

Guohui Lin*




Existing HIV-1 genotyping systems require a computationally expensive phase of multiple sequence alignments and the alignments must have a sufficiently high quality for accurate genotyping. This is particularly a challenge when the number of strains is large. Here we propose a whole genome composition distance (WGCD) to measure the evolutionary closeness between two HIV-1 whole genomic RNA sequences, and use that measure to develop an HIV-1 genotyping system. Such a WGCD-based genotyping system avoids multiple sequence alignments and does not require any pre-knowledge about the evolutionary rates. Experimental results showed that the system is able to correctly identify the known subtypes, sub-subtypes, and individual circulating recombinant forms.

Keywords: String composition; Whole genome phylogenetic analysis; Neighbor joining; HIV-1 genotyping; Circulating recombinant form

1. INTRODUCTION

Acquired Immune Deficiency Syndrome (AIDS) is caused by a virus known as human immunodeficiency virus (HIV). The first case of AIDS was reported in the United States in 1981 and has since become a major worldwide epidemic. By the end of 2005, over 900,000 people had been diagnosed with HIV infection, and more than 2.3 million were estimated to be HIV positive (h t tp : //www. who. i t ) .

There are two types of HIV: HIV-1 and HIV-2. HIV-1 is more pathogenic than HIV-2, and most of HIV infection is caused by HIV-1 1. HIV is a retrovirus, which has one RNA genome segment encoding 9 genes, including env, gag, nef, pol, rev, tat, vif, vpr, and vpu (for HIV-1) or vpx (for HIV-2). Similar to other RNA viruses, HIV is notorious for its fast mutation and recombination. The identification of emerging genotypes, due to HIV mutation and recombination, presents a major challenge for the development of HIV vaccines and anti-HIV

medicines 2. In the last two decades, many different genotypes of HIV-1 have been reported, largely consisting of three major groups: M, O and N 3. Further analyses have characterized the group M of HIV-1 into 9 subtypes (A-D, F-H, J, and K), and at least 16 circulating recombinant forms (CRFs) (h t tp : / / h iv -web . l an l .gov) . These subtypes represent different lineages of HIV-1, and have some geographical associations (Figure 1). Improved genotyping information will not only enhance the development of anti-HIV drugs and HIV vaccine, but also help us understand the epidemics of HIV infection. For example, previously placed subtypes E and I were later discovered to be recombinants. So it is clear that an efficient and effective genotyping system for HIV will be essential for HIV study.

Currently however, most genotyping methods are complicated and laborious. For a genotyping process that uses HIV-1 whole genomic sequences, there are two main challenges: (1) The mutation rates be-





http://hiv-web.lanl.gov

180

HIV-1

Group M Group N Group O

A B C D F G H J K

classification. HIV-1 is divided into three groups, and group M is F ig . 1. This diagram illustrates the different levels of HIV-1 divided into 9 subtypes.

tween HIV-1 genomes are not equal. For example, the genetic distance between genotypes is found to be greater in more polymorphic genes (e.g., the envelope gene) than in others. As a result, a phylogenetic relation based on a single gene may not be accurate with respect to the evolutionary patterns of HIV-1. Some recent methods propose the use of partial genome or whole genome sequences to conduct the phylogenetic analysis 4. It is expected that recombination will complicate these genotype interpretation systems. (2) As the number of HIV-1 genomes increases, genotyping methods must be faster and more robust. Almost all of current available traditional genotyping methods are based on multiple sequence alignments 2' 5' 4. For example, most of them rely on at least 300-500 character long alignment, which can then provide enough sequence information, even though the aligned length could vary for different regions of a genome. The importance of alignment length was emphasized in RevML 6, where it is reported that alignments with fewer than 400 characters generated trees with problems, such as mixing of sub-subtypes between Al and A2, F l and F2 and sometimes K, as well as mixing of B and D. However, we believe that the performance of multiple sequence alignments will decrease as the number of sequences increases. There are a number of genotyping systems especially designed for anti-retrovirus drug resistance studies 7' 2. These genotyping methods either employ rule-based algorithms or are based on a genotype-phenotype database. However, these methods may generate some confusing instead of confirmatory information, because of the updating sequence information in the database 2.

In this paper we propose a novel method called the whole genome composition distance (WGCD),

that avoids multiple sequence alignment to measure the evolutionary distance between two HIV-1 whole genomic sequences. Subsequently, a distance-based phylogenetic construction method Neighbor-Joining (NJ) 8 is adapted to build the phylogenetic clades. Our results show that the proposed approach can efficiently construct the phylogenetic clades for a set of 42 HIV-1 strains exactly the same as the one published by Los Alamos National Laboratory (LANL) with intensive computation and human curation 6 .

The rest of the paper is organized as follows: In Section 2, we introduce in detail the whole genome composition distance computation using complete genomic sequences. Section 3 outlines the flow of operations in this novel HIV-1 genotyping system. An HIV-1 dataset that includes in total 42 whole genomic sequences is introduced in Section 4, as are the experimental results and our discussion. Experimental results and discussion on the individual CRF identification are also included in Section 4. We conclude the paper in Section 5.

2. WHOLE GENOME COMPOSITION DISTANCE

There are several existing HIV-1 genotyping systems, including the Stanford HIV-Seq program (h t tp : / /h ivdb .S tanford .edu) , the NCBI Genotyping Program (ht tp: / /www.ncbi .nih .gov/ projects /genotyping) , the Los Alamos Recombinant Identification Program (h t tp : / /hivweb. l a n l . gov/RIP/RIPsubmit.html), the European-based Subtype Analyzer Program ( h t t p : / / p g v l 9 . v i r o l . u c l . a c . u k / d o w n l o a d / s t a r - l i n u x . t a r ) 9 , a n d a recently developed system (ht tp : / /www.bioafr ica . ne t /subtypetool /h tml) described in 4. All these systems employ a computationally intensive phase

http://hivdb.Stanford.edu

http://www.ncbi.nih.gov/

http://pgvl9.virol

http://ucl.ac.uk/download/star-linux.tar

http://www.bioafrica

181

of multiple sequence alignments or alike to align the query sequences, with priority given to some fragments that are known to be more conserved than the others. Consequently, when limited to single genes in the HIV-1 strains, these systems all perform acceptably well. However, the performance is highly dependent on the accumulated knowledge on HIV-1 strains, such as the fact that HIV-1 po/gene is highly conservative so at most two gaps can be introduced into the alignment.

On the other hand, different levels of conservation within different HIV-1 genes pose additional difficulties in the phylogenetic analysis: The analyses using different genes might produce inconsistent and even erroneous results (the same situation happens in numerous HIV databases) 10, and the multiple sequence alignments are more challenging, and thus the MSA-based phylogenetic analysis using multiple genes is computationally impossible.

In this study, we explore the possibility of HIV-1 phylogenetic analysis using their complete genomic sequences, through avoiding the computationally intensive phase of multiple sequence alignments. We adapt some ideas for whole genome phylogeny construction from the literature U ' 12 and propose a novel composition distance to measure the evolutionary closeness between two HIV-1 whole genomic sequences. After the distance matrix on the set of HIV-1 whole genomic sequences is thus computed, the Neighbor-Joining method 8 is utilized to build the phylogenetic clades. We note that there is a rich literature on whole genome phylogenetic analysis, where many approaches have been proposed for estimating the pairwise distance between two whole genomes. To name a few, there are approaches based on string composition 11_14 , approaches based on text compression 15~18, and approaches based on gene content 1 9 _ 2 2 .

We use the whole genomic RNA sequences of HIV-1 to validate our method. Given an RNA sequence R, in our case to represent an HIV-1 strain, the single nucleotide composition of R is a vector of 4 nucleotide frequencies in sequence R. Namely, for each type of nucleotide nu, the frequency of nu in R is the number of occurrences of nu in sequence R divided by the total number of nucleotides in R. Likewise, the dinucleotide composition of R is a vector of 42 = 16 dinucleotide frequencies in sequence R. In general, for each length-fc nucleotide segment (there

are possibly 4fc of them), its frequency in R is calculated as the number of its occurrences in sequence R divided by the total number of overlapping length-fc nucleotide segments in R. For simplicity, the length-fc nucleotide segment composition of R is called the fc-th composition of R and denoted as Ck{R)-

The fc-th composition of R can be used as a signature for strain JR. Given two strains R and S, one may define, for example, the Euclidean distance between Ck{R) and Ck(S), each is a 4fe-dimensional vector, to measure the evolutionary distance between R and S. There are several similar measures proposed in the literature for whole genome phylogenetic analysis, including the single amino acid composition, the dipeptide composition n , an SVD-based measure using tripeptide (and tetrapeptide) composition 13' 23, the complete information set (CIS) 12, and the composition vector (CV) 14. In fact, the CIS method defines an information discrepancy between C).(R) and Cfc(S), and uses its normalized version to measure the evolutionary distance between R and S 12. Based on our previous research on whole genome phylogenetic analysis for Avian Influenza Viruses (AIV), we propose to use the Euclidean distance between Ck(R) and Ck(S) to measure the evolutionary distance. That is, assuming that Ck(R) = ( / i , h , • • •, AO and Ck(S) = {91,92, •• • ,9^), where fi and 9i are the frequencies of a common length-fc nucleotide segment in R and S, respectively, then the Euclidean distance dk(R, S) is

dt(R,S) N

4fc

i = l ft)2 (1)

Note that in general Ck{R) and Ck+i(R) both contain evolutionary information of R, but some information hidden in one composition isn't necessarily revealed by the other. Obviously, the single nucleotide composition is one of the most information-rich compositions. The dinucleotide composition reveals the single nucleotide composition and some amount of extra evolutionary information not included in the single nucleotide composition. Likewise, the (fc + l)-th composition is expected to contain some additional evolutionary information not included in the fc-th composition. Nonetheless, this additional information is expected to decrease with increasing fc (see also the Experimental Results and Discussion). For these reasons, we propose to use

182

(Ci{R),C2(R), ••-,Ck(R)), for a sufficiently large k, to represent strain R. Subsequently, assuming strain S is represented as (Ci(S), C 2 (5 ) , . . . , Ck(S)), then the Euclidean distance between R and S is d(R,S), defined as

d(R, S)

N k

i = l

d?(JJ,S), (2)

where di(R,S) is defined in Equation (1). Such a distance is called the whole genome composition distance (WGCD) between R and S.

3. WGCD-BASED HIV-1 GENOTYPING

The WGCD-Based HIV-1 Genotyping system can be used as a tool, in addition to existing systems, typically in the cases where the whole strains are available. In fact, the genotyping system described here has been developed for analyzing whole strains to avoid the computationally intensive phase of multiple sequence alignments. The method can be roughly partitioned into the following four steps:

In the first step, a fixed nucleotide segment length k is used, which in our case is set to 80, to calculate the composition (Ci(S),C2(S),..., Ck(S)) for each HIV-1 whole strain S. Note that it is impossible to count the frequency for every length-80 nucleotide segment, as there would be 480 «s 1.5 x 1048

of them. Instead, the counting is done for only those length-80 nucleotide segments that actually occur in S. This counting is computed by several linear scans of the whole strain S, and the frequencies of the same length nucleotide segments are alphabetically ordered. In the second step, for every pair of strains R and S, their evolutionary distance measured by d(S,T) is computed using Equation (2). Note that the computation is done with a linear scanning the composition vectors for R and S at the same time, where the frequency for a non-occurring segment is automatically treated as 0. The resultant pairwise evolutionary distance matrix M = (d(S,T)) is fed to the Neighbor-Joining method in the third step to build a phylogenetic tree on the strains. The fourth step is the standard bootstrapping in phylogeny construction methods 24, to randomly mutate 30% of the genetic sequences at each iteration and then to build a bootstrapping tree using exactly the same method as stated in steps one to three. In total, 200

such bootstrapping trees are built and their consensus is produced as the final phylogenetic clades. Note that bootstrapping is used to test whether the output phylogeny is stable (or confident) with respect to the input sequences, given that there might be sequencing errors and some strains are incomplete sequences.

4. EXPERIMENTAL RESULTS AND DISCUSSION

4.1. Datasets

We downloaded a set of 42 HIV-1 strains based on a review paper 6 , which were used as references sequences in genotyping using an enhanced version of fastDNAml maximum likelihood tree fitting (RevML) and site rate estimation codes (RevRates) 6. The trees created in 6 require several executions of RevML and RevRates using different initial site mutation rates, followed by using modified global and local site mutation rates. Both trees on single genes and full-length env, pol, gag were constructed, and the single gene trees show problems when fewer than 400 character alignments occurred. In addition to these 42 HIV-1 strains, 2 CPZ strains (CPZ is a subtype of Simian immunodeficiency virus (SIV), which is believed to have a common ancestor to HIV-1 and HIV-2) were included for outgrouping purposes, as stated in the review paper 6 . The average length of these 44 whole genomic RNA sequences is 9,019bp, with the maximum length of 9, 829bp and the minimum length of 8,349bp. 15 recombinant strains were also downloaded for recombinant identification ability testing for the WGCD based HIV-1 genotyping system.

4.2. Results on Maximum Nucleotide Segment Length Determination

As discussed earlier, for any strain S, the (k + l)-th composition is expected to contain some additional evolutionary information not included in the fc-th composition, yet this additional amount decreases with increasing k. We have set up experiments to validate this expectation, and subsequently to determine the maximum length k we should use in the WGCD-based genotyping. Given a pair of strains S and T, we calculate the composition vectors for them, respectively, i.e., (Ci(S),C2(S),... ,Cso(S))

183

3 14 15

(a) fc = 10. (b) fc = 15.

Si

if

1

18

16

14

12

1fl

8

6

4

2

r

- f

V

".o-

"'&

.a-.»• - = •

'»'

l3.

•a

-J.-

...

.;; •*. • * : . T

...

Si 35

rrD 'a

r

-n

-

-

: •i

a

1 8 Q.

S

l S|p

14

12

10

8

6

4

2

o

E

B

i

Q

•

a-

=

^

•°-

Q 9

^

g o •B £ ", * J. ° Q- ,_

' • " ' ^ J ' .

. 0..a..G..,-5..-a--G--(

-^^•^T-T^r - i :^

-

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 1 2 3 4 5 6 7 8 9 10 1112 13 14 15 16 17 18 19 20 2122 23 24 25 26 27 28 29 30

(c) k = 20. (d) k = 30.

|TQ3HrI3EHQeDaHI3EEE33SSCEajHGCieEil

0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 40 45 50 55

(e) k = 50. (f) fc = 80.

Fig . 2 . The average distance contribution percentages of individual compositions to the WGCD with respect to different values of fc, where green lines plot 5 randomly chosen distance contribution percentages and blue lines plot the minimum and maximum distance contribution percentages.

F2

+A

Y3

71

15

8 F

2+

AJ2

49

23

7 F

2+

AJ2

49

23

6 F

2+

AF

37

79

56

F1

+A

J24

92

38

K+

AJ2

49

23

9 K

+A

J24

92

35

D+

K0

34

54

D+

U8S

824 D

+A

Y3

71

15

7 J+AF082394

J+AF082395

B+K0345S

B+

AY

42

33

87

B+

AY

17

39

51

B+

AY

33

12

95

C+

U5

29

53

C+

AF

06

71

55

C+

U4

60

16

C+

AY

77

26

99

H+

AF

19

01

28

H+

AF

00

54

96

H+

AF

19

01

27

F1

+A

F0

77

33

6 F

1+

AF

00

54

94

F1

+A

F0

75

70

3 A

1+

US

11

90

»"

A1

+A

F0

69

67

0 A

1+

AF

00

48

85

A1

+A

F4

84

50

9 A

2+

AF

28

62

38

A2

+A

F2

86

23

7 ,„ G

+A

F0

61

64

2 .-

G+

AF

06

16

41

G+

AF

08

49

36

N+

AJ2

71

37

0 N

+A

J00

80

22

N+

AY

53

26

35

0+

AJ3

02

64

7 O

+L

20

58

7 0

+A

Y1

B9

81

2 O

+L

20

57

1 C

PZ

+A

F4

47

76

3 C

PZ

+U

42

72

0

II •Si

F2+

AY

3711S8

F2

+A

J24

92

37

F2

+A

F3

77

95

6 F

2+

AJ2

49

23

6 K

+A

J24

92

39

K+

AJ2

49

23

5 F

1+

AF

07

57

03

F1

+A

F0

77

33

6 F

1+

AF

00

54

94

F1

+A

J24

92

38

B+

K03455

B+

AY

17

39

51

B+

AY

423387 B

+A

Y331295

D+

K03454

0+

U8

86

24

D+

AY

37

11

57

J+A

F0

82

39

4 J+

AF

06

23

95

H+

AF

19

01

28

H+

AF

00

54

96

H+

AF

19

01

27

C+

U52953

C+

AF

06

71

55

C+

U46016

C+

AY

77

26

99

A1

+U

51

19

0 A

1+

AF

06

96

70

A1+

AF

0O4885

A1

+A

F4

84

50

9 A

2+

AF

28

82

38

A2

+A

F2

86

23

7 G

+A

F0B

4936 G

+A

F0

61

64

2 G

+A

F0

81

64

1 N

+A

J27

13

70

N+

AJ0

06

02

2 N

4A

Y5

32

63

5 0

+A

J30

26

47

O+

L2

05

87

0+

AY

16

98

12

0+

12

05

71

CP

Z+

AF

44

77

63

CP

Z+

U4

27

20

F2+

AY

37115B

F2

+A

J24

92

37

«• F

2+

AJ2

49

23

6 F

2+A

F37795B

F

1+

AJ2

49

23

8 K

+A

J24

92

39

K+

AJ2

49

23

5 D

+A

Y3

71

15

7 r,„

J+A

F0B

2394 *-

J+A

F0

82

39

5 ,„ D

+K

03

45

4 D

+U

88

82

4 ,„ B

+K

03

45

5 -

B+

AY

42

33

87

B+

AY

17

39

51

B+

AY

33

12

95

„ C

*U5

29

53

. C

+A

F0

67

15

5 C

+U

46

01

6 C

+A

Y7

72

69

9 „, H

+A

F1

90

12

8 a

H+

AF

00

54

96

H+

AF

19

01

27

„, A1

+U

51

19

0 -

A1

+A

F0

69

67

0 A

1+

AF

00

48

85

A1

+A

F4

84

50

9 ,„ A

2+

AF

2B

62

38

A2

+A

F2

86

23

7 ,„ G

+A

F0

61

64

2 »

G+

AF

08

16

41

G+

AF

08

49

36

,„ F1

+A

F0

77

33

6 ..

F1+

AF

00S494

F1

+A

F0

75

70

3 N

+A

J27

13

70

N+

AJ0

06

02

2 N

+A

Y5

32

63

5 O

+A

J30

2W

7 O

+L205B

7 O

+L

20

57

1 0

+A

Y1

69

81

2 C

PZ

+A

F4

47

76

3 C

PZ

+U

42

72

0

F2

+A

Y3

71

15

8 F

2+

AJ2

49

23

7 F

2+

AJ2

49

23

6 F

2+

AF

37

79

56

K+

AJ2

49

23

9 K

+A

J24

92

35

F1

+A

F0

77

33

6 F

1+

AF

00

54

94

F1

+A

F0

75

70

3 F

1+

AJ2

49

23

8 B

+K

03

45

5 B

+A

Y423387

B+

AY

17

39

51

B+

AY

331295 D

+K

03454 D

+U

88824 D

+A

Y371157

J+A

F0

82

39

4 J+

AF

08

23

95

H4

AF

19

01

28

H*A

F0

05

49

6

H+

AF

19

01

27

C+

U5

29

53

C+

AF

06

71

55

C+

U4

60

16

C+

AY

77

28

99

A1

+U

51

19

0 A

1+

AF

08

96

70

A1

+A

F0

04

88

5 A

1+

AF

48

45

09

A2

+A

F2

86

23

8 A

2+

AF

28

62

37

G+

AF

0B

49

36

G+

AF

06

16

42

G+

AF

06

16

41

N+

AJ2

71

37

0 N

+A

J00

60

22

N+

AY

53

26

35

0+

AJ3

02

64

7 O

+L

20

58

7 0

+A

Y1

69

81

2 O

+L

20

57

1 C

PZ

+A

F4

47

76

3 C

PZ

+U

42

72

0

rl'-1—

n*

r 1

CO

II •S

i

•v

rf 1" Lr

1—£,„

1'"

fi

1 \L

rf' i_

„ F

1+

AF

07

57

03

F1

+A

F0

77

33

6 F

1+

AF

00

54

94

F1

+A

J24

92

38

, F2

+A

F3

77

95

6 F

2+

AY

37

11

58

F2

+A

J24

92

37

F2

+A

J24

92

36

K+

AJ2

49

23

9 K

+A

J24

92

35

. J+

AF

08

23

94

J+A

F0

82

39

5 „ B

+K

03

45

5 B

+A

Y1

73

95

1 B

+A

Y4

23

38

7 B

+A

Y3

31

29

5 „ D

+K

03

45

4 D

+A

Y3

71

15

7 D

+U

8S824

C+

U4

60

16

C+

U5

29

53

C+

AY

77

26

99

C+

AF

06

71

55

H+

AF

19

01

28

H+

AF

00

54

96

H+

AF

19

01

27

A1

+U

51

19

0 A

1+

AF

06

96

70

A1

+A

F0

04

88

5 A

UA

F4

84

50

9

G+

AF

08

49

36

G+

AF

06

16

42

G+

AF

06

16

41

N+

AJ2

71

37

0

N+

AY

53

26

35

O+

AJ3

02

64

7 J

V'

O+

L2

05

87

ff 1-L

_ 0

+A

Y1

69

81

2 O

+L

20

57

1 C

PZ

+A

F4

47

78

3 C

PZ

+U

42

72

0

0 00

II •si

" 2 a > erent iff

T)

a en

Q

O

O

£ a en a erat

*! w>

..F1

+A

F0

75

70

3 js BF

1+

AF

07

73

36

I99 F

1+A

FO

05494 F

1+

AJ2

49

23

8 *2

+A

F3

77

95

6 m

r2+

AY

37

11

5B

5

F2

+A

J24

92

37

F2

+A

J24

92

36

lqQK

+A

J24

92

39

K+

AJ2

49

23

5 j-

..J+A

F0

82

39

4 "-

J+A

F0

82

39

5 J3+

K03455

.„B+

AY

17

39

51

IM

B+

AY

42

33

87

B+

AY

33

12

95

-P+

K0

34

54

8T)+

AY

37

11

57

D+

U8

88

24

„H+

AF

19

01

28

4ri+A

F0

05

49

6 H

+A

F1

90

12

7 „C

+U

52

95

3 ,J

t+A

F0

67

15

5 IM

C

+U

48

01

6 C

+A

Y772S

99 laA

1+

U5

11

90

^A1

+A

F0

69

67

0 A

1+

AF

00

48

85

A1

+A

F4

84

50

9 2

+A

F2

B6

23

8 2

+A

F2

86

23

7 .G

+A

F0

84

93

6 G

+A

F0

61

64

2 G

+A

F0

61

64

1 r

,„N+

AJ2

71

37

0 J"S

MT

N+

AJ006022

•—

N+

AY

53

26

35

,„O+

AJ3

02

64

7 »

O+

L2

05

87

IM

0+

AY

1B

98

12

193 O

+L

20

57

1

CP

Z+

AF

44

77

63

CP

Z+

U4

27

20

o o

a

"3

o X

i bC

0)

H

185

and (Ci(T), C2(T),..., C80(T)). For each i < 80, we compute di(S,T) according to Equation (1). We set k = 10,15,20,30,50,80 to compute the WGCD between S and T, d(S,T), using Equation (2). Subsequently, Ci{S,T) = d2

i(S,T)/d2(S,T) is the distance contribution percentage of the i-th composition to the WGCD. We took the average of Cj(5, T) over all pairs of strains and the average is denoted as c7, the average distance contribution percentage of the i-th composition to the WGCD. For Jk = 10,15,20,30,50 and 80, these cfs are plotted as red lines in Figures 2(a)-2(f), where one can see that the tail portion of the line corresponding to k = 80 is approximately horizontal, but this is not the case for k = 10,15,20,30, or 50. Note that we might think of volume (ci+T—57) as the extra contribution of the (i + l)-th composition compared to the i-th composition. Therefore, the approximately horizontal tail portion indicates that we might neglect the evolutionary information carried by the nucleotide segments longer than 80.

4.3. Phylogenetic Analysis Results

We have constructed the phylogenies using k = 10,15,20,30,50 and 80. These trees are illustrated in Figures 3(a)-3(f), respectively. Comparing the contribution plots for individual compositions, the strains are classified better into the phylogenetic clades with increasing k. For example, in Figures 3(a) and 3(b), AJ249238 belongs to subtype F l , but is misclassified into F2; This problem is resolved in Figures 3(c) and 3(d), when k increases to 20 and 30, respectively. Nonetheless, subtype K is mis-inserted into subtype F in both of the phylogenies in Figures 3(c) and 3(d). When k = 80, a phylogeny which maps to all known evolutionary relationships is obtained (Figure 3(f)). For example, sub-subtypes Al and A2 are adjacent to each other; sub-subtypes F l and F2 are adjacent to each other; subtypes B and D are closer to each other than other subtypes; and groups M, N and O are well-separated.

For comparison, we have also conducted a control experiment to use the multiple sequence alignments by ClustalW (ht tp: / /www.ebi .ac.uk/ c lus ta lw/) . We uploaded all the 44 genomic sequences to the ClustalW webserver. The guide tree generated by ClustalW is shown in Figure 4. One can see that there is a problem with the outgrouping CPZ SIV strains, and subtype C is misplaced outside

of group M. We have also borrowed a software Biolay-

(ht tp: / /www.biolayout .org/) to display out 25, 26

the phylogenetic clades. For this purpose, we removed the outgrouping CPZ SIV strains and sorted the pairwise distances between all 42 HIV-1 strains in increasing order. Scanning through the order, we selected the 84 minimum distances, which involve all 42 strains (84 is the minimum number such that this number of distances involve all 42 strains), and sent them to Biolayout. Figure 5 shows the graphical view of these distances, which clearly demonstrated 11 clades corresponding to 13 subtypes. Also can be seen are that subtypes Al and A2, F l and F2, B and D, D and G, and G and A2 are closer than the others.

<W, rTh X fTps rVi T rk rCrKh rtrrVs X frVn IT?,

ISSP^HSS SiiiHiiiglSliM"8

Pig. 4 . The ClustalW guide tree on the 42 HIV-1 strains and two CPZ SIV strains.

4.4. Discussion

4.4.1. Strains as Concatenations of Gene Sequences

We have also used the concatenation of all 9 gene sequences to represent a strain, then used them to construct a phylogeny. For the set of the above 42 pure subtype strains and 2 CPZ strains, the resulting bootstrapping phylogenetic trees using k = 80 is shown in Figure 6. An interesting observation is that this phylogeny is different from the one constructed using complete genomic sequences — it contains several misplaced strains in subtypes A2 and C. In fact, this misplacement is consistent in all values of k we have tested, namely, k = 10,15,20, 30,50 and

http://www.ebi.ac.uk/

http://www.biolayout.org/

186

^ ^ ^ W P S M I P I I I H ^ File Edit View Search Tools Help

• A1+AF484509

• F1*AJ246I38 Jr^.LJ5V'm

/t\ ^#^t££SP^4l^*A1+^FQe567i-1

W & W ' 3 7 1 1 5 8

^__ _,

¥y 5 " •MwHWiW'i3 7

" ~ ~ A C M F 0 6 7 1 55 ^ f ^ - r h ^ t C+AY772699

^*Src*U52353

j #N*AJ2713?0 « $ W 5 3 2 6 3 5

^ » N*AJ006022

• O*AJ302647 / J J * 0 + A Y 1 89812

( r o + U Q 5 8 7

• (HL20571

IJ*AF0823S5

^ • * JMF082394

• KMJ249235

• K*AJ249239

MPHiiHlg-T-w

[Ready j

Fig . 5 . The Biolayout graphical view of the phylogenetic clades of the 42 HIV-1 strains, using the 84 smallest distances computed by the WGCD method using nucleotide strings of length 1 to 80.

80. We have also obtained the distance contribution percentages of all the compositions, and the plots (as in Figure 2) are very "horizontal" when k > 30. This observation might indicate that the inter-gene regions in the complete genomic sequences for HIV-1 strains also contain evolutionary information that can be used for genotyping purpose.

rfa fa fo fa rfa* rj>, o o p o d ^ z j ^ 15*1: p cfb >: N N

X rWff?? X Hf>i

Fig . 6. The Neighbor-Joining phylogeny with 200 bootstrapping iterations on the WGCD using k = 80, where strains are represented as concatenations of 9 gene sequences each.

4.4.2. Strains as Concatenations of Protein

Sequences

The gene products were also used for phylogenetic analysis. To this purpose, every strain is represented as a concatenation of 9 protein sequences. Since there are 20 types of amino acid residues, we set A; = 10,20,25 and 40 in the WGCD distance computation. Once again, the distance contribution percentages were calculated and their plots (similarly as in Figure 2) are very "horizontal" when k = 40. The resultant bootstrapping Neighbor-Joining phylogeny for k = 40 is shown in Figure 7. In this phylogeny one can see that there are more misplaced phylogenetic clades than in the phylogeny using only gene sequences. For example, subtypes Al, C, Fl , and G are split into 2 parts each; sub-subtypes F l and F2, and sub-subtypes Al and A2 become disconnected. Therefore, we might be able to conclude that using gene products for phylogenetic analysis on such fast evolving viruses is the least appropriate, and confirm that nucleotide sequences are superior. We note that this would be resulted from the silent mutations, which contribute to the genotyping diversity at nucleotide level but not protein level. In fact, the standard HIV-1 genotyping is done mostly

187

at the nucleotide level. Another possible reason that gene products are inferior may be that the evolution changes in the intergenic regions were not considered in the genotyping using protein coding regions.

4.4.3. Identification of Circular Recombinant Forms

Group M HIV-1 strains are currently classified into 9 subtypes and 16 circulating recombinant forms (CRFs). In our previous experiments, only non-CRF strains were included. Recently there are intensive activities in the area of discovering new HIV-1 CRFs, although many of the new CRFs have not yet been published 6 . In addition, evidence suggests many sub-subtypes and even a potential new subtype. The classification of new HIV-1 sequences follows the proposed HIV nomenclature guidelines 3' 27 to use the HIV-1 subtyping reference set. With respect to this set, each of the known CRFs was described by four representatives. It is recognized that, with the large increase of reported CRFs, the reference set would increase to an extent that it would cause problems in some analyses if four sequences were included for each CRF. Therefore, previous studies have limited the CRF section to one sequence per CRF, and the sequence selected for each subtype was intended to show how it is composed of the included subtypes.

fra m ^mmMimmmu umnmm

Fig. 7. The Neighbor-Joining phylogeny with 200 bootstrapping iterations on the WGCD using k = 40, where strains are represented as concatenations of 9 protein sequences each.

We have also performed experiments to test whether or not the WGCD-based genotyping system would be able to recognize CRFs. In the experiments, we found that when the number of CRFs is limited to one to two at a time (where CRFs are represented as complete genomic RNA sequences), then the recombinant subtype information can always be correctly identified. For example, Figures 8(a) and 8(b) show that both L39106 (NCBI accession number, strain from Nigeria) and AF063224 (NCBI accession number, strain from Djibouti) are recognized as A/G recombinant, and both of them are closer to subtype A than to subtype G, conforming with previous studies on these two recombinants. When we added both L39106 and AF063224 into the phyloge-netic analysis, we found (Figure 9) they were grouped together and were recognized as A/G recombinants (a little closer to subtype A than to subtype G). Similarly, we have tested for 51 other CRFs individually and as a whole set, and in each case their recombinant form(s) was correctly identified (data not shown). Typically, we have also used Biolayout to graphically display the phylogenetic closeness relationships of L39106 and AF063224 to the pure subtype strains. Again for this purpose, the outgrouping CPZ strains were removed from the dataset, which consists of 42 HIV-1 pure subtype strains and 2 CRFs, and a total of 92 smallest distances were used. Figure 10 clearly shows the pure subtype clades and that both L39106 and AF063224 have edges connected to all Al, A2, and G strains. This fact indicates that neither of L39106 and AF063224 can be classified into Al, or A2, or G pure subtypes, yet they both have close relationships with Al, A2, and G subtypes, an obvious indication of recombinants. Based on these results, we are confident in claiming that, in terms of single CRF detections, our WGCD-based HIV-1 genotyping system performs well and that is superior to existing methods in the sense that it requires no CRF section pre-selection, nor that one CRF has to be described by four representatives.

Compared with other recombination detection systems based on multiple sequence alignments, the current version of our method has the limitation that it cannot detect the breaking points of the HIV-1 genotyping. In the future, we are proposing to check the motifs contributing the most to the recombination, and thus find the breaking points within the recombination. We will study how to detect the re-

188

fn1! r?. rfri r?i

L

,oS^pj|Si£S8§§8gg5SB s B X g S " " * * ^

rj>, n?, i l f? i rp, 3C^O03CDU0\u"n"TI i r i i p O ^ T m i ^ n

IH'Aiilll I ll'tfgilii-WIIf ssss* illli

(a) L39016. (b) AF063224.

Fig. 8 . The Neighbor-Joining phylogenies with 200 bootstrapping iterations on the WGCD using k — 80, where both L39106 and AF063224 are recognized as A/G recombinants, and both of them are closer to subtype A than to subtype G.

combination interactively, which will be able to find the recombinations from original subtypes and even between recombinant genotypes.

fe t H? . . fr rfe ^ k^hkt^^

8£SS"",8S!88

Fig. 9. The Neighbor-Joining phylogeny with 200 bootstrapping iterations on the WGCD using k = 80, where L39106 and AF063224 are grouped together and recognized as A/G recombinants (a little closer to subtype A than to subtype G).

4.4.4. Subtype Signature Nucleotide

Segment Identification

In this current work, for a fixed k values, all nucleotide (or amino acid) segments of length up to k are used in the WGCD distance calculation. Look

ing at the resultant distance matrices, we have found that the pairwise distances are very close to each other, though they are sufficient in identifying the subtypes and CRFs. Therefore, one interesting research subject would be to identify those nucleotide segments that signify subtypes. With these signature segments, it could be easier for subtype identification for a newly sequenced strain, by simply looking at the occurrences of signature segments. Furthermore, occurrences of signature segments associated with multiple subtypes are indications of CRFs.

This is one of our on-going research focuses and we have obtained promising partial results. The full set of detailed experimental results will be reported in a succeeding paper. We are also constructing a web server based on the WGCD for HIV-1 genotyp-ing, which will be released to public upon passing an extensive test.

5. CONCLUSIONS

We proposed a whole genome composition distance based HIV-1 genotyping system. Such a system avoids the computationally intensive phase of multiple sequence alignments, by using nucleotide segment frequencies to represent the genomes for phylogenetic analysis. Experiments show that every composition has its unique contributions to the evolutionary distance computation, and the contribution decreases with the nucleotide segment length increases. By

189

OglKl

00#®©^<^

Node: -

* N * W 2 7 1 3 7 0 VVM+AY53283!

kfi-t.AJ00602:

lb*wi69ei2

IReady

Fig. 10. The Biolayout graphical view of the phylogenetic clades of the 42 HIV-1 strains and 2 A/G CRFs, using the 92 smallest distances computed by the WGCD method using nucleotide strings of length 1 to 80. The figure shows that subtypes A (including Al and A2) and G are fully connected by these two A/G CRFs.

setting the maximum segment length k to 80, the

Neighbor-Joining method using the WGCD-based

distances constructs phylogenetic clades tha t are

identical to the golden standard ones determined by

running several MSA-based genotyping systems to

gether with manual parameter adjustments. Experi

ments on single C R F identification also confirm tha t

such a new system is also capable of C R F discov

ery using only the whole strains, without any extra

requirements on the CRFs. Lastly, our experiments

also confirm tha t , for phylogenetic analysis on fast

evolving viruses such as HIV, complete genomic se

quences could be a better source than gene sequences

and gene products.

ACKNOWLEDGMENTS

This research is supported in part by AICML, CFI

and NSERC.

References

1. S. J. Popper, A. D. Sarr, K. U. Travers, A. Gueye-Ndiaye, S. Mboup, M. E. Essex, and P. J. Kanki.

Lower human immunodeficiency virus (HIV) type 2 viral load reflects the difference in pathogenicity of HIV-1 and HIV-2. Journal of Infectious Diseases, 180:1116-1121, 1999.

2. M. Sturmer, H. W. Doerr, and W. Preiser. Variety of interpretation systems for human immunodeficiency virus type 1 genotyping: Confirmatory information or additional confusion? Current Drug Targets Infectious Disorder, 3:373-382, 2003.

3. D. L. Robertson, J. P. Anderson, J. A. Bradac, J. K. Carr, B. Foley, R. K. Funkhouser, F. Gao, B. H. Hahn, M. L. Kalish, C. Kuiken, G. H. Learn, T. Leitner, F. McCutchan, S. Osmanov, M. Peeters, D. Pieniazek, M. Salminen, P. M. Sharp, S. Wolin-sky, and B. Korber. HIV-1 nomenclature proposal. Science, 288:55-56, 2000.

4. T. de Oliveira, K. Deforche, S. Cassol, M. Salminen, D. Paraskevis, C. Seebregts, J. Snoeck, E. J. van Rensburg, A. M. J. Wensing, D. A. van de Vi-jver, C. A. Boucher, R. Camacho, and A.-M. Van-damme. An automated genotyping system for analysis of HIV-1 and other microbial sequences. Bioin-formatics, 21:3797-3800, 2005.

5. M. Rozanov, U. Plikat, C. Chappey, A. Kocher-gin, and T. Tatusova. A web-based genotyping resource for viral sequences. Nucleic Acids Research, 32:W654-W659, 2004.

6. T. Leitner, B. Korber, M. Daniels, C. Calef, and B. Foley. HIV-1 Subtype and Circulating Recom-

190

binant Form (CRF) Reference Sequences. Accessible through h t tp : / /www.hiv . lan l .gov/conten t / hiv-db/REVIEWS/Ref Seqs2005/Ref Seqs05. ht'/ml, 2005.

7. R. W. Shafer, P. Hsu, A. K. Patick, C. Craig, and V. Brendel. Identification of biased amino acid substitution patterns in human immunodeficiency virus type 1 isolates from patients treated with protease inhibitors. Journal of Virology, 73:6197-6202, 1999.

8. N. Saitou and M. Nei. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4:406-425, 1987.

9. R. E. Myers, C. V. Gale, A. Harrison, Y. Takeuchi, and P. Kellam. A statistical model for HIV-1 sequence classification using the subtype analyser (STAR). Bioinformatics, 21:3535-3540, 2005.

10. The HIV Sequence Database. Accessible through h t tp : / /www.hiv . lan l .gov/content / hiv-db/mainpage. html.

11. S. Karlin and C. Burge. Dinucleotide relative abundance extremes: a genomic signature. Trends in Genetics, 11:283-290, 1995.

12. W. Li, W. Fang, L. Ling, J. Wang, Z. Xuan, and R. Chen. Phylogeny based on whole genome as inferred from complete information set analysis. Journal of Biological Physics, 28:439-447, 2002.

13. G. Stuart, K. Moffet, and S. Baker. Integrated gene and species phylogenies from unaligned whole genome sequence. Bioinformatics, 18:100-108, 2002.

14. B. Hao and J. Qi. Prokaryote phylogeny without sequence alignment: from avoidance signature to composition distance. In Proceedings of the 2003 IEEE Bioinformatics Conference (CSB 2003), pages 375-385, 2003.

15. S. Grumbach and F. Tahi. Compression of DNA sequences. Data Compression Conference, 1993.

16. E. Rivals, M. Dauchet, J. Delahaye, and O. Del-grange. Compression and genetic sequences analysis. Biochimie, 78:315-322, 1996.

17. X. Chen, S. Kwong, and M. Li. A compression algorithm for DNA sequences and its applications in genome comparison. In Proceedings of the Sixth Annual International Computing and Combinatorics

Conference (RECOMB), pages 107-117. ACM Press, 2000.

18. D. Benedetto, E. Caglioti, and V. Loreto. Language trees and zipping. Physical Review Letters, 88:048702, 2002.

19. B. Snel, P. Bork, and M. A. Huynen. Genome phylogeny based on gene content. National Genetics, 21:108-110, 1999.

20. E. Herniou, T. Luque, X. Chen, J. Vlak, D. Winstan-ley, J. Cory, and D. O'Reilly. Use of whole genome sequence data to infer baculovirus phylogeny. Journal of Virology, 75:8117-8126, 2001.

21. C. House and S. Fitz-Gibbon. Using homolog groups to create a whole-genomic tree of free-living organisms: An update. Molecular Evolution, 54:539-547, 2002.

22. B. Snel, P. Bork, and M. A. Huynen. Genomes in flux: the evolution of archaeal and proteobacterial gene content. Genome Research, 12:17-25, 2002.

23. G. Stuart, K. Moffet, and J. Leader. A comprehensive vertebrate phylogeny using vector representation of protein sequences from whole genomes. Molecular Biology and Evolution, 19:554-562, 2002.

24. J. Felsenstein. PHYLIP. Accessible through h t t p : / / e v o l u t i o n . genetics.Washington. edu/phylip.html.

25. An automatic graph layout algorithm for similarity and network visualization. Bioinformatics, 17:853-854, 2001.

26. L. Goldovsky, I. Cases, A. J. Enright, and C. A. Ouzounis. An automatic graph layout algorithm for similarity and network visualization. Applied Bioinformatics, 4:71-74, 2005.

27. D. L. Robertson, J. P. Anderson, J. A. Bradac, J. K. Carr, R. K. Funkhouser, F. Gao, B. H. Hahn, C. Kuiken, G. H. Learn, T. Leitner, F. McCutchan, S. Osmanov, M. Peeters, D. Pieniazek, M. Salmi-nen, S. Wolinsky, and B. Korber. HIV-1 nomenclature proposal: a reference guide to HIV-1 classification. In Human Retroviruses and AIDS 1999: a compilation and analysis of nucleic acid and amino acid sequences. Los Alamos National Laboratory, Los Alamos, NM, 2000.

http://www.hiv.lanl.gov/content/

http://www.hiv.lanl.gov/content/

191

EFFICIENT RECURSIVE LINKING ALGORITHM FOR COMPUTING THE LIKELIHOOD OF AN ORDER OF A LARGE NUMBER OF GENETIC MARKERS

S. Tewari

Dept. of Statistics, University of Georgia, Athens, GA 30605-1952, USA * Email: statsusant@yahoo. com

Dr. S. M. Bhanda rka r *

Dept. of Computer Science, University of Georgia,

Athens, GA 30605-7404, USA

* Email: [email protected]

Dr. J . Arnold

Dept. of Genetics, University of Georgia,

Athens, GA 30605-7223, USA * Email: [email protected]

Assuming no interference, a multi-locus genetic likelihood is implemented based on a mathematical model of the recombination process in meiosis that accounts for events up to double crossovers in the genetic interval for any specified order of genetic markers. The mathematical model is realized with a straightforward algorithm that implements the likelihood computation process. The time complexity of the straightforward algorithm is exponential without bound in the number of genetic markers and implementation of the model for more than 7 genetic markers is not feasible, motivating the need for a novel algorithm. A recursive linking algorithm is proposed tha t decomposes the pool of genetic markers into segments and renders the model implementable for a large number of genetic markers. The recursive algorithm is shown to reduce the order of time complexity from exponential to linear. The improvement in time complexity is shown theoretically by a worst-case analysis of the algorithm and supported by run time results using data on linkage group-I of the fungal genome Neurospora crassa. K e y w o r d s : Crossover, EM algorithm, recursive linking, time complexity, MLE.

1. INTRODUCTION

High density linkage maps are an essential tool for characterizing genes in many systems, fundamental genetic processes, such as genetic exchange between chromosomes, as well as the analysis of traits controlled by more than one gene (i.e.,complex traits)1. Since genetic maps are most often the critical link between phenotype (what a gene or its product does ) and the genetic material, genetic maps can be exploited to address how the genetic material controls a particular trait 2 controlled by one or more genes. Most model systems possess high density linkage maps that can assist in the analysis of complex traits. The bread mold, Neurospora crassa3, which gave us the biochemical function of genes, is no exception4. One approach to understanding the genetic basis of a

complex trait is to follow its segregation in offspring along with an array of genetic markers. One class of genetic markers frequently used are restriction fragment length polymorphism (RFLP), markers in the DNA itself. These markers in essence allow a trian-gulation on loci in the DNA affecting the complex trait. Part of this triangulation process involves the construction of a genetic map with many markers. This is a computationally challenging problem 5 and at the heart of understanding complex traits, such as human disease. In this paper we address the problem of genetic map reconstruction from a large number of RFLP markers. We focus on map construction for a model system N. crassa6 where there is a wealth of published information about how markers segregate because the genetic makeup of gametes can be identified. Previous attempts ( 7' 8) at genetic map




192

construction of a large number of genetic markers are mostly based on pair wise genotypic information, not on likelihood computation. Here we focus on solving the computational challenge posed by our probabilistic modelling of the genetic map.

2. MULTILOCUS GENETIC LIKELIHOOD FOR A SPECIFIED ORDER OF GENETIC MARKERS

Let S be the sample space of an exchange (crossover) between any two non-sister strands (chromatids) of the tetrad in a single meiosis. Let c denote the probability of exchange of genetic material between any two non sister strands in the tetrad at meiosis.

5 = {0,1,2,3,4}

P(i) = -;i

P(0) = 1 -

= 1,--- ,4; ieS

c;0eS (1)

The element 0 in 5 indicates the absence of a crossover event. Elements 1, 2, 3 and 4 indicate that non-sister chromatids (1,3), (2,3), (2,4) and (4,1) took part in exchange respectively.

Let Si(= S x S) be the set whose elements denote crossover events covering events up to double crossovers between locus Aiaxid A(i+i}(i = 1,...,/ — 1), where Z=total number of loci being studied. Using equation (1 ) , the probability distribution on Si is given by:

P({hJ}) = T^k^o^o} 16 i V % ) f T T" 1

H "j {/{i=0;j#0} +-f{i^0;j=0}}

+ (1 - C i ) 2 / { i = j = 0 }

(2)

where, {i, j} £ Si. Let <j>k denote a unique crossover event on 5 ' as

described below:

4>k = h x i2 x ... x i j _ ! (3)

where,

i-i

= ii.i2.h—k-i ; ij € Si ; 4>k € Sl = J J S ;

Let /fc denote a multi-locus genotype with I loci:

fk = h x i2 x ... x i(_x x ii

where,

k = i1.i2.i3...ii ; ij =0,1; Vj = l ,--- ,Z

The indices ij = 1 and ij = 0 indicate the paternal and maternal alleles respectively. The progeny are obtained by crossover between homogeneous parents. The observed data set can be represented as:

V={rij ; Vj = l , - - - , 2 ' }

where, rij is the observed frequency of fj.

2.1. Probability distribution on Sl

Let us define the following functions

/°(o) = (ai,a2,a3,a4)

/ x (a ) = (a 3 , a 2 , a i , a 4 )

/ 2 ( a ) = ( 0 1 , 0 3 , 0 2 , 0 4 ) ' (4)

/ 3 ( a ) = ( 0 1 , 0 4 , 0 3 , 0 2 )

/ 4 ( a ) = ( 0 4 , 0 2 , 0 3 , 0 1 )

where,

a = ( 0 1 , 0 2 , 0 3 , 0 4 ) '

Oj = 0,1 Mi

The function fij(a) = fj(fi(a)) corresponds to the events in Si. For a particular crossover <f>k we can generate a model tetrad at meiosis using the function fij. The following matrix Rk of size 4 x / defines the simulated tetrad as follows:

Rk = (RoRi---R(i-i)) (5)

where,

flo = (1100)' ; Ri = fjk{Ri-i)Mi = !,-•• ,1-1

and the ith genetic interval Si observed the crossover event {j, k}. The conditional distribution of fi for a given 0fe is

P(fM = \tIf^ka, (6) • j = i

where, Rk(j, -)is the j t h row of Rk. The marginal density of a single spore fi is given

by

P(fi) = T,P(fi\MxPk

i=\

fc

CxP (7)

193

where, C is the conditional probability matrix given by :

C = ((CH))

Chi = P{fi\4>k) (from equation ( 6))

and P is given by,

P = (Pk; Vfc)'

Pfe = P(^fc) i - i

= l[p(Ij=ikJ) j = i

(8)

(9)

where, ikj S 5j and the probability distribution P(Ij = ikj) is as defined in equation( 2).

Let 9 = (ci,C2,...,c;_i)' denote the unknown parameter vector in the model. The log-likelihood of f viewed as a function of Q is given by:

N2,m = E Ah)

k|»*,m = (»l.*2)

t i ^ O AND 125*0

n{kh) =Yln^\îh))

•*k\j =P{xkj = l | / j ) = 7Tj|fc X ^ f c

Pj

Pi =^27ri\kxîc

nj\k = cki in equation ( 8)

j = i

ln(e|£>) = £>log £ [\ EhfÛ,-)} j = i

i-i

k j=l

j = l

(10)

The following two theorems solve equation( 10) using a set of recurrence relations obtained via the Expectation-Maximization (EM) algorithm 9. The proofs are not given in the interest of brevity.

Theorem 1. The EM-iterative equations are given below.

where c%+1> = I ' 1 (11)

w/iere

N0,m = Yl Ah)

fc »fc,m = (0,0)

Nlt„ £ Ah)

fc *lc,m = (*l.*2)

t i = 0 (Str ic t )OR i 2 = 0

TTfc = Pfc in equation ( 9)

Note that, ikyTn denotes an event in Sm for the crossover <j)k •

Theorem 2. Let f = (fa, /2, /3 , fa)' be the observed frequency

vector corresponding to all possible meiotic products for parental genes M and O for two markers. The genotype vector for f is (MM MO OM 00)'. The maximum likelihood estimator 10 of the exchange probability c under the model represented by Equa-tion( 1) is unique and is given as follows:

(1) If fa + fa<f2 + h ên cmle = 1 (2) If / i + fa > h + h ên Cmu is given by the

unique solution (in the interval [0,1] ) of the following equation:

where,

f(c) = c2 - 2c + D = 0

£ > = 2( / a + /3) . N = j-fi

(12)

N t = i

This theorem is used to obtain the starting values of Cm for the EM-iterative equations in Theorem 1.

194

3. THE STRAIGHTFORWARD ALGORITHM

In the pseudocode below, k denotes a particular crossover 4>k as defined in equation( 3). The function getRMatrix implements equation( 5) to create the Rk matrix corresponding to the crossover (j>k- The function kProb computes the marginal probability Pk due to crossover <f>k as defined in equation( 9). The matrix R^ and the probability Pk in turn create matrices C and P progressively during the course of the recursive loop to calculate the marginal probability of each observed distinct genotype as defined in equation ( 7). These marginal probabilities along with the counts for the distinct genotypes are used to compute the log-likelihood in equation( 10). A particular crossover (j>k does not enter into the computation of the likelihood so long it does not have positive probability for at least one distinct genotype. The elimination of such crossovers is achieved with the function klsWorthy, which implies that at least some amount of computation cannot be avoided for each crossover. In the recursive algorithm we propose in the paper, this feature is handled more efficiently where a large number of crossovers are eliminated by performing checks on a few. In the pseudocode the vector sum computes conditional probabilities across all distinct genotypes. The conditional probability is computed with the help of the function countMatch that implements the equation( 6), counting the number of strands (out of 4) in R^ that match with the observed genotype. The vector freq has the observed count corresponding to the distinct genotypes(dg) and totalObs is the total sample size and probFOld stores the marginal probability of each distinct genotype using equation( 7).The array pCount implements the EM algorithm via equation ( 11) by re-categorizing the vector sum based on the crossover values along the chromosome. Note that the denominator of TT^j , pj, the marginal probability due to the j t h distinct genotype, is left out from its {^k\j) computation as that requires going through all the crossovers and is currently being progressively computed by probFOld. To compute nk ' in equa-tion( 11) we need to add up the inverse probabilities (7Tjfc|j) across the distinct genotypes but, as their marginal probabilities are not computed yet it is not possible to do so. We work around this problem by

adding another dimension along the number of distinct genotypes to the structure pCount. The first dimension of pCount is of magnitude 3 to account for Notm,Ni,m and JV2,m in equation( 11) and the second dimension runs along m, accounting for the (/ — 1) genetic intervals. Once all the crossovers are processed and the marginal probabilities computed, the elements in the third dimension are divided by their corresponding marginal probabilities and then added up across the dimension. This gives us the two dimensional structure postCount (not shown in the pseudocode) containing values of Notm,Ni,m a n d N2,m for all the genetic intervals. Then the new value of Cj for each genetic interval is computed using equa-tion( 11) and the process iterates until convergence. Despite being a recursive algorithm (crossovers are generated recursively) it suffers from the computational bottleneck to process a huge (25C-1) for loci I) number of crossovers. This problem is overcome using the proposed recursive linking algorithm. For brevity we show the pseudocode of only the most important part of the straightforward algorithm.

{Pseudocode of the Straightforward Algorithm} loop

{ This is a recursive loop. This dynamically generates Z-l FOR loops.} r=getRMatrix(k) if kIsWorthy()==l then

prob=kProb(cProb01d,k); for i = 0; i < dg\ i + + do

sum[i] = countMatchQ * prob * freq[i]/4.0 * totalObs probF01d[i]+ = countMatch()/4.0 * prob

end for for j = 0; j < loci — l;j + +do

pCount[CellSpecial(fc[j])][j]+ = sum { The function CellSpecialQ maps crossover values from 0 to 24 to the events of the set 5, and the addition is a componentwise vector addition.}

end for end if

end loop

195

4. THE PROPOSED RECURSIVE LINKING ALGORITHM

Let the entire order of genetic markers be broken into equal segments of width (h), such that all the intervals are covered. So, for I genetic markers the number of segments s is given by the following equation :

The first segment has an associated array called kArrayFirst that has all the crossovers for the segment. The array UnklnfoFirst stores the last row generated by the R matrix for each crossover of the segment. The array cArrayFirst checks for each particular crossover and each observed genotype of the first segment which strands of the simulated (based on the model described in equation( 1)) tetrad obtained by R match with the genotype. The matching status forms the last dimension of the array with length 4 and consists of symbols 1 and 0 indicating a match(l) and mismatch(O) respectively. For example, a matching status 1 0 0 1 for the first distinct genotype corresponding to crossover pattern 0 0 2 3 4 in the first segment level indicates that among the 4 tetrads in meiosis generated by the crossover pattern 0 0 2 3 4 in the first segment, the observed genotype in question was found only on the first and the fourth tetrad. When we use this information over a combined segment formed with two segments, only a match at the same tetrad position will ensure a match for the combined segment .Note that R depends on crossover values on all intervals of the segment and its columns are sequentially dependent on each other with a lag one. Each crossover in the first segment branches out to 25^h_1^ crossovers in the following segment and creates 25 2 ^ - 1 ^ combined crossovers. This continues till the last segment is accounted for. In order to move along the segments following the model described in equation( 1), we need to know the last row (a tetrad pattern at the last locus of the segment) generated by the R matrix of the linked crossover of the previous segment corresponding to each combined crossover of the two segments. The following lemma states that only certain patterns are possible at the end locus of the adjoining segments and we create arrays similar to kArray,link Info and cArray for all the fol

lowing segments except the last one and call them kArrayTemp\\,linkInfoTemp\\ and cArrayTemp[] respectively, where the dimension denotes the segment numbers.

Lemma 4.1. Under the model described by equa-tion( 1) at any particular locus only one of the tetrad patterns 110 0, 0110, 1010, 100 1, 0101 and 0 0 11 could occur.

Consider a combined crossover for all the segments. To compute equation( 6) i.e, to count the matches for the entire crossover we have to examine the matching status of all the segments and update them. To be considered a match for the whole segment at a particular position (out of 4 possible positions) one must have a match for all the segments at that position. Hence when the matching status of two segments are updated the resulting matching status is 1 if and only if both the segments have 1 at that position and 0 otherwise. This updated matching status is termed a spore in this paper. The distinction between a spore and matching status is that while matching status is the original status of the segment the spore is the matching status obtained after updating the matching status of all the previous segments. The following lemma restricts the number of possible spore patterns.

Lemma 4.2. Under model described by equationf 1), for any

crossover on any observed genotype the spore patterns 0111, 1011, 110 1, 1110 and 1111 are not possible.

Analogous to the function klsWorthy in the straightforward algorithm we implement the concept of counting active crossovers for each segment. We call a crossover of a particular segment active if it has positive probability for at least one distinct genotype. Note that active crossovers will be different across segments as an active crossover depends both on the observed genotypes(that varies across segments) and the tetrad pattern used in the generation of the R matrix.

It is important to emphasize the huge computational gain achieved by elimination of the crossovers on a segment wise basis in the proposed algorithm compared to the straight forward algorithm which

196

eliminates crossovers one at a time. A single elimination of a crossover in the first segment has the effect of 25'~2 eliminations of crossovers in the straightforward algorithm. The effect is 2 5 ' - 3 for the 2nd segment and so on.

For the last segment we create an array called tempSumlndex. Corresponding to all possible spores (the dimension is restricted by Lemma 4.2) the matching status of the last segment is updated and then it is summed (along its dimension) to compute equation( 8) for the last segment except that the matching status is computed for all possible tetrad patterns in Lemma 4.1. Note that Lemma 4.1 and Lemma 4.2 restrict the array size and ensure storage economy. Next we implement equation( 7) for the last segment for all possible spores denoted by the array tempSum. To update 6 using equation( 11) we use another array called tempPCount which computes iVo,m,./Vi)m and A .m via index manipulation as shown in the pseudocode.

{Pseudocode for the recursive structure}

tempSum[6][ds][ll] tempPCount [6] [dg] [11] [3] [loci-1] loop

{*i5 *2J *3> ;0 < = 14 < activeKLast[ii]} tempDouble=kProb(lastCProb,kArrayLast[zi][i4]) condProb=tempSumIndex[ii][i2][i3][i4]/4.0 tempSum[ii][i2][i3]+ = condProb x tempDouble for j = loci — h; j < loci — 1; j + + do

tempPCount[ii][i2][i3][CellSpecial(kArrayLast[ii] [z4][j—loci+ h])] [index] + = condProb x tempDouble;

end for end loop

The arrays tempSum and tempPCount taken together are termed a recursive structure for the last segment. This structure has the property that 25^-*) crossovers have already been processed in a form so that equation ( 11) can be implemented and it can handle any spore generated from the previous segment. Note that when a crossover from the previous (allowing for all possible spore patterns of its previous segment) segment is processed the recursive structure identifies and uses an appropriate spore from the last segment and thus processes simultaneously 2 5 ^ - 1 ' combined crossovers of these two segments. After all the crossovers from the previous seg

ment are processed the proposed algorithm generates a recursive structure that is exactly the same as that of the last segment without adding to the storage requirement. The recursive structure for the last but one segment now has 252('l_1) crossovers processed within it with provision for all possible spores from the previous segment. This very feature shows how we geometrically increase the information base (in terms of crossovers) of the "table look up " procedure and avoid traversing all the crossovers one at a time. One of the reasons this procedure works is because if we look into the combined crossovers of two segments we see that crossover values for the left segment change only for every 2 5 ^ - 1 ^ combined crossovers. This lets us delay the probability value updates when linking the segments. The process is best understood by looking at the pseudocode below and noticing how the arrays tempSum and tempPCount are updated in the recursive linking process.

Recursive Linking : templSum=tempSum templPCount=tempPCount int startPos=0 int endPos=loci-h

for i = 0 ; i < segments-2 ; i + + do startPos=endPos-h+2 firstCProb —> from startPos to endPos in cProbOld spores[ll][4] —> generate spores for jo = 0 ; jo < 6 ; j 0 + + do

for ji = 0 ; j i < dG ; j i + + do for j 2 = 0 ; j 2 < 11 ; j 2 + + do

for j k = 0 ; j k < activeKTemp[i][jo] ; j k + + do

k = kArrayTemp[i][j0]b'fe] prob=kProb(firstCProb,k);

int j'oi =linkInfoTemp[i][jo][jfc] int spike=SporeMatch(UpdateScore( cArrayTemp[i] \j0] [jk\ fa] ,spores[j2])) temp2Sum[j'o] [ji] [j2]+ =templSum[j0i] [ji] [spike] xprob for m =startPos-l ; m < endPos ; m + + do

temp2PCount [jo][ji][J2] [CellSpecial(A;[m - startPos + l])][m]+ =tempSum[i7oi] [ji] [spike] xprob

end for

197

for n = 0 ; n < 3 ; n + + d o for index=endPos ; index < loci-1 ; index++ do

temp2PCount[j0] [ji] ih] [n] [index] + =templPCount[j0i] [ji] [spike] [n] [index] x prob

end for end for

end for end for

end for end for templSum=temp2Sum temp 1P Count=temp2PCount endPos=startPos-1

end for Linking with the first segment is the last step

of the likelihood computation process. In this phase we do not have any previous matching status vectors and instead of tempPCount we have the structure pCount as in the straightforward algorithm. Note that the length of the first segment must be adjusted to account for both even and odd number of genetic markers. That entails a trivial modification of the algorithm, hence we do not mention the details. In the interest of clarity of description of the algorithm we have assumed all the segments to be of equal length and hence both even and odd number of markers can not be implemented without first changing the length of at least one segment; preferably that of the first one.

5. TIME COMPLEXITY COMPARISON OF THE ALGORITHMS

In the analysis of time complexity of the algorithms 11 the running variable is /, the number of genetic markers. The core function of the algorithm is to process a large number of crossovers in real time. The algorithm would be required even if one wanted to compute just the likelihood for the initial probabilities and not use the subsequent EM iterations. Hence in both the straightforward and the proposed algorithm we provide run time complexity analysis for the main computationally intensive phase, namely processing all the crossovers for a single iteration.

The loop in the straightforward algorithm runs for 25('-1) iterations. In each iteration the computation time of matrix R is 0(1). This is so because R has / columns and the computation time for each column is fixed. Since the observed genotypes are not known in advance we do a worst-case analysis for the computation of vector sum. The worst case occurs when there is a match between an observed genotype and any of the 4 columns of the matrix R resulting in run time complexity of order 0(1). The vector sum has dg elements which does not vary with I and hence the total execution time for computing vector sum is 0(1). The vector addition involved in computing pCount entails dg elements and thus accounts for run time complexity of 0(1). Hence the total run time complexity of the straightforward algorithm is

Rst = O (/25 ( i"1)) (14)

In the recursive linking algorithm the run time for each of the s segments is 0((h - 1)25^-^)). So the total run time for the recursive algorithm is

Rri = sO ((/i - l )25 ( h _ 1 ) )

which is minimized for h = 3 for any odd number of loci /. Hence for h = 3 the run time complexity for the recursive linking algorithm is of order 0(1) using equation( 13).

6. RUN TIME RESULTS

We ran several jobs on a DELL PC (Model DM051 Pentinum(R) 4 CPU 3.40GHz and 4GB of RAM) for different number of genetic markers in their natural order of precedence on a data set from the linkage group-I of Neurospora crassa4 for both the algorithms. It was verified that both the algorithms provide the same likelihood for the same order of markers as they essentially solve the same problem but in different ways. The run time corresponds to the average time it takes for a single EM iteration before convergence on multiple starts. The resulting speedup is clearly evident from the table below.

198

Table 1. Run time (in seconds) Comparison of Straightforward and Recursive Algorithm

Loci(Z) Straightforward Algorithm Recursive Linking Algorithm

5 2.49 0.073 7 1336.45 0.298 9 > 43200.0 0.66 21 oo 5.61 41 oo 8.93 61 oo 13.03

7. CONCLUSIONS

Assuming no interference, a multi-locus genetic likelihood was implemented based on a mathematical model of the recombination process in meiosis that accounted for events up to double crossovers in the genetic interval for any specified order of genetic markers. The mathematical model was realized with a straight forward algorithm that implemented the likelihood computation process. The time complexity of the straightforward algorithm was exponential without bound in the number of genetic markers and implementation of the model for more than 7 genetic markers turned out to be not feasible, motivating the need for a novel algorithm. The proposed recursive linking algorithm decomposed the pool of genetic markers into segments and rendered the model im-plementable for a large number of genetic markers. The recursive algorithm has been shown to reduce the order of time complexity from exponential to linear. The improvement in time complexity has been shown theoretically by a worst-case analysis of the algorithm and supported by run time results using data on linkage group-I of the fungal genome Neu-rospora crassa.

ACKNOWLEDGEMENTS

The research was supported in part by a grant from the U.S. Dept. of Agriculture

under the NRI Competitive Grants Program (Award No: GEO-2002-03590) to Drs. Bhandarkar and Arnold. We thank the College of Agricultural and Environmental Sciences, University of Georgia for their support.

References 1. Lander ES, Schork NJ. Genetic dissection of complex

traits. Science 1994; 265: 2037-2048. 2. Doerge RW, Zeng ZB, Weir BS. Statistical issues in

the search for genes affecting quantitative traits in experimental populations. Statistical Science 1997; 12: 195-219.

3. Raju NB. Meiosis and ascospore genesis in Neu-rospora. Eur. J. Cell.Biol. 1980; 23: 208-223.

4. Nelson MA, Crawford ME, Natvig DO. Restriction polymorphism maps of Neurospora crassa: 1998 update. http://wxvw.fgsc.net/fgn45/45rflp.html 1998; :

5. Lander ES, Green P. Construction of multi-locus genetic linkage maps in humans. Proc.Natl.Acad.Sci.USA 1987; 84: 2363-2367.

6. Barratt RW, Newmeyer D, Perkins DD, Garnjobst L. Map construction in Neurospora crassa. Advances in Genetics 1954; 6: 1-93.

7. Mester D, Romin Y, Minkov D, Nevo E, Korol A. Constructing large-scale genetic maps using an evolutionary strategy algorithm. Genetics 2003; 165: 2269-2282.

8. Cuticchia AJ, Arnold J, Timberlake WE. The use of simulated annealing in chromosome reconstruction experiments based on binary scoring. Genetics 1992; 132: 591-601.

9. Dempster A, Laird N, Rubin D. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society ,Series B 1977; 39:1: 1-38.

10. Rao CR.Linear Statistical Inference and Its Application, 2nd ed. Wiley-Interscience. 2002.

11. Thomas HC, Charles EL, Ronald LR, Clifford S.Introduction to Algorithms, 2nd ed. The MIT Press. 2001.

http://wxvw.fgsc.net/fgn45/45rflp.html

http://Proc.Natl.Acad.Sci.USA

199

OPTIMAL IMPERFECT PHYLOGENY RECONSTRUCTION AND HAPLOTYPING (IPPH)

Srinath Sridhar

Computer Science Department, Carnegie Mellon University

Pittsburgh, PA 15213. USA.


Guy E. Blelloch


Pittsburgh, PA 15213. USA.


R. Ravi

Tepper School of Business, Carnegie Mellon University

Pittsburgh, PA 15213. USA. Email: [email protected]

Russell Schwartz

Department of Biological Sciences, Carnegie Mellon University

Pittsburgh, PA 15213. USA. Email: [email protected]

The production of large quantities of diploid genotype data has created a need for computational methods for large-scale inference of haplotypes from genotypes. One promising approach to the problem has been to infer possible phylogenies explaining the observed genotypes in terms of putative descendants of some common ancestral haplo-type. The first at tempts at this problem proceeded on the restrictive assumption that observed sequences could be explained by a perfect phylogeny, in which each variant locus is presumed to have mutated exactly once over the sampled population's history. Recently, the perfect phylogeny model was relaxed and the problem of reconstructing an imperfect phylogeny (IPPH) from genotype data was considered. A polynomial time algorithm was developed for the case when a single site is allowed to mutate twice, but the general problem remained open. In this work, we solve the general IPPH problem and show for the first time that it is possible to infer optimal q-near-perfect phylogenies from diploid genotype data in polynomial time for any constant q, where q is the number of "extra" mutations required in the phylogeny beyond what would be present in a perfect phylogeny. This work has application to the haplotype phasing problem as well as to various related problems in phylogenetic inference, analysis of sequence variability in populations, and association study design. Empirical studies on human data of known phase show this method to be competitive with the leading phasing methods and provide strong support for the value of continued research into algorithms for general phylogeny construction from diploid data.

1. INTRODUCTION

Sophisticated computational methods for data refinement and interpretation have become a core component of modern studies in human genetics. Computational methods have long been central to the study of phylogenetics, or evolutionary tree building, particularly the parsimony variants best suited to inferences of relationships over short time scales. The more specialized problem of haplotype reconstruction has also benefited tremendously from contributions from the fields of discrete algorithms and combinatorial optimization. In haplotype reconstruc

tion, also called phasing, one seeks to separate the alleleic contributions of two chromosomes observed together in a diploid genotype assay. If we use the symbols 0 and 1 to represent homozygous and 2 to represent heterozygous alleles then '0221' is a genotype typed on four loci. Two pairs of haplotypes 0001, 0111 and 0011, 0101 are both consistent with the genotype and the goal of phasing is to correctly infer the true haplotypes, given the genotypes. The problem has relevance to basic research into population histories as well as to applied problems such as statistically linking haplotypes to disease pheno-types.





200

The field of computational haplotype reconstruction began with a fast and simple iterative method based on the idea that a haplotype that is observed in one individual is likely to be found in other individuals as well8. Various statistically motivated algorithms based on heuristic optimization techniques such as expectation maximization28 and Markov chain Monte Carlo methods35 have since been developed. Such algorithms provide significantly improved robustness and accuracy, although still poor scaling in problem size. Several combinatorial optimization methods have also been formulated based on Clark's original method20, although these have all been intractable in theory and practice. The advent of large-scale genotyping created a need for new methods designed to scale to chromosome-sized data sets.

Recently, a new avenue towards haplotype reconstruction was developed based on phylogenetic methods. In such an approach, one seeks to identify the ancestral history that could have produced a set of haplotype sequences from a common ancestor such that the inferred haplotypes could give rise to the observed genotypes. Gusfield19 showed that it is possible to directly and efficiently infer phylogenies that could explain an observed set of diploid genotypes provided the phylogenies are perfect, meaning that each character mutates only once from the common ancestor over the entire phylogeny. This problem is referred to as Perfect Phylogeny Haplotyping (PPH). Several subsequent results simplified and improved on this original method2, 3' 12' 13. The work produced a fast, practical method for large-scale haplotyping, in which one breaks a large sequence into blocks consistent with perfect phylogenies and uses the phylogenies to phase those sequences.

The perfect phylogeny assumption is quite restrictive, though, and several approaches have been taken to adapt the perfect phylogeny method to more realistic models of molecular evolution. Data inconsistent with perfect phylogenies can arise from multiple mutations of a base over the history of a species or through the processes of recombination or gene conversion, which can assemble hybrid chromosomes in which different segments have different phylogenies. Heuristic methods have allowed some tolerance to recurrent mutations (for example,

the work by Halperin and Eskin23) resulting in the generation of imperfect phylogenies. Imperfect phylogeny haplotyping has proven to be very fast and competitive with the best prior methods in accuracy. Some recent work has been directed at provably optimal methods for imperfect phylogeny haplotyping (IPPH), resulting in a polynomial-time algorithm for the case when a single site is allowed to mutate twice (or a single recombination is present), under a practical assumption on the input data33. But the general IPPH problem remained open. It appears for the moment intractable in both theory and practice to infer recombinational histories in non-trivial cases.

Our Contributions: In this work, we solve the general IPPH problem and show for the first time that it is possible to infer imperfect phylogenies with any constant number q of recurrent mutations in polynomial time in the number of sequences and number of sites typed. Our approach builds on both the prior theory on phylogeny construction from diploid data13, 33 and a separate body of theory on the inference of imperfect phylogenies from haplotype data4, 14, 21, 25, 31. Our algorithm reconstructs a q-near-perfect phylogeny in time nm°^ where m is the number of characters (variant sites typed), n is the number of taxa (sequences examined), and q is the imperfectness of the solution, defined below. The prior method of Song et al.33 that solves the 1-near-perfect phylogeny haplotyping problem relied on an empirically but not theoretically supported assumption that an embedded perfect phylogeny problem will have a unique solution. We relax this assumption to allow a polynomial number of such solutions and show that the relaxed assumption holds with high probability if the population obeys a common assumption of population genetics theory called Hardy-Weinberg equilibrium. We thus provide a theoretical basis for Song et al.'s empirical observation. We demonstrate and validate our algorithm by comparing with leading phasing methods using a collection of large-scale genotype data of known phase29. We find that our method is efficient and more accurate on blocks with small q. We further provide strong empirical support for the value of continuing research into accelerated algorithms for phylogeny construction from diploid data.

201

2. PRELIMINARIES

We begin by introducing definitions for parsimony-based phylogeny reconstruction. In such problems, we wish to study the relationship between a set of n taxa, each of which is defined over a set of m binary characters. Input / will be represented by a matrix, where row set R(I) represent taxa and column set C(I) represent characters. In a haplotype matrix all taxa r* G {0, l } m and in a genotype matrix r( G {0,1,2}™.

Definition 2.1. A phylogeny for n x m binary input matrix J is a tree T(V, E) and a label function I : V(T) -> {0, l } m with the following properties: R(I) C I(V(T)) and for all (u,v) G E(T), d(l(u),l(v)) = 1 where d is the Hamming distance. That is, every input taxon appears in T and the Hamming distance between adjacent vertices is 1.

Definition 2.2. We define the following terms for a phylogeny T on input I:

• character n(e) represents mutation on edge e = (u,v) s.t. l(u)[n(e)] ^ l(v)[/j.(e)].

• vertex v G V(T) is terminal if l(v) £ R(I) and Steiner otherwise.

• length(T) = \E(T)\ . • phylogeny T is optimal if length(T) is min

imized. • penalty(T) = length(T) - m. • phylogeny T is q-near-perfect if

penalty(T) = q and perfect if penalty(T) = 0.

We say that a character i mutates on edge e if fi{e) = i. We will assume that both states 0,1 are present in all characters. Therefore the length of an optimum phylogeny is at least m. This provides a natural motivation for the penalty of a phylogeny as defined above. For simplicity, we will drop the label function l(v) and use v to refer to both a vertex and the taxon it represents.

The I P P H problem: The input to the problem is an n x m matrix G, where each row #» G {0,1,2}m

represents a (genotype) taxon and each column represents a character. The output is a In x m matrix H in which each row /i, G {0, l } m represents a haplotype. Furthermore corresponding to every taxon

gi G R(G), there are two taxa /i2i-i,^2i G R(H) with the following properties:

• if gi[c\ ^ 2 then h2i-i[c} = h2i[c] = gi[c] • if gi[c] = 2 then /i2i-i[c] ^ h2i[c]

The objective of the IPPH problem is to find an output matrix H such that the length of the optimum phylogeny on H is minimized. This problem is clearly NP-hard, since if matrix G contains no 2s, then the problem is equivalent to reconstructing the most parsimonious phylogenetic tree15. We therefore consider the following parameterized version of the problem. Given matrix G and parameter q, we return matrix H such that there exists an optimal phylogeny T* on H with penalty(T*) < q, under the assumption that such a matrix H exists. Note that the PPH problem is a restriction of IPPH when q = 0.

Definition 2.3. For a set of binary state taxa S and a set of characters C, the set of gametes GAM(S, C) is the projection of S on characters C. In other words ( x i , . . . ,x\c\) G GAM(S,C) i.f.f. there exists s G S with s[ci] = Xi for all c, G C. A set of characters C shares k gametes if \GAM(S,C)\ = k. A pair of characters i, j conflict if \GAM(S, {i, j})\ = 4.

Pre-process ing: We perform a well established pre-processing step that ensures that for any optimal output matrix H, (0,0) G GAM(H,{i,j}) for all i,j G C(H) (See the work of Eskin et al.13). We assume that the input matrix has no duplicate rows, since such rows do not change the optimal solution. Note that such a matrix should have at most q + 1 more taxa than characters, since otherwise, it does not have a solution to the IPPH problem.

3. ALGORITHM

At a high level, our algorithm has the same spirit as the algorithm of Song et al.33. The algorithm guesses characters that mutate more than once and removes them from the input matrix. It then solves the perfect phylogeny haplotyping problem on the remainder of the matrix. Finally, our algorithm adds the removed characters back and performs haplotyping on the full matrix. In this section, we will use the same assumption of Song et al.33: for any subset of characters of the input matrix, if a solution to perfect phylogeny haplotyping (PPH) exists, then it is

202

function solveIPPH(input matrix G)

(1) guess Q = {i G C{G)\3eu e2 € £(T*), ex ^ e2,/i(ei) = /z(e2) = i} (2) let M be matrix G restricted to characters C(G) \ Q (3) let M' be the unique solution to perfect phylogeny haplotyping of M (4) let H' be matrix M' denned on characters C(G) s.t. Vh2i_x,h'2i £ R(H') and corresponding

taxa 5 i G R(G).Vc G Q./4_i[c] = ft'2j[c] = #[£] (5) guess K = {i G C(G) | 3j G C(G), |G^M(t(T*), {*, j})\ = 4} (6) guess GAM(t(T*),K) (7) loop for c G C(H') \ K

(a) H' := processMatrix(fl' /, c)

(8) for all couples h!2i_x,h'2i m H'

(a) resolve state 2 on h!2i_x,b!2i s.t.GAM({/i2i_1,/i2J,K) C G A M ( £ ( T * ) , K )

function processMatrix(matrix H', character c)

(1) initialize the set A := {c} (2) while |A| > 0 do

(a) extract a character c from A (b) for all (2,c)-couples h2i-i , /i2i

i. for all c G Q with /i2j_i[c] = 2(= /i2j[c])

A. if \IND(H', {c,c})\ = 3 then £(c,c) = IND(H', {c,c}) else guess three gametes Q{c, c)

B. resolve state 2 in c on /i2j-ii foi based on G(c,c) C. A := AU {c' <£ K|/i2i-i[c'] ^ h2i[c'}}

i.e. add to A characters c' £ K for which the current couple is also a (2, c')-couple

(c) if 3c G Q. s.t.(c, c) conflict or for the set H2 of all (2, c)-couples GAM(H2,K) <£ GAM{t(T*),n) then return no so lu t ions

(d) remove c from C(H')

(3) return F '

Fig . 1. Algorithm to solve IPPH

unique. In Section 4, we show that the number of PPH solutions is polynomial with high probability.

Throughout the paper, we fix an arbitrary optimal phylogeny T*, which we will use as a reference for expository purposes. Let t(T*) be the set of all taxa (Steiner vertices included) in T*. Let Q be the set of characters that mutate more than once in T*. Since \Q\ < q, we can find Q by brute force in time 0{Y.U(l))=0{qm«).

After finding Q, we can remove the characters to obtain matrix M with character set C{G) \ Q. We now use a prior method to solve the perfect phylogeny haplotyping (PPH) problem on M in time 0(nm)12. Let M' be the unique solution to the

PPH problem. The solution matrix M' contains 2n taxa and m — \Q\ characters. We can now add the characters Q to matrix M' to obtain H'. For all j G Q and h'2i_1,h'2i

e H' a n d 9i e ^> taxa ^ i - i bl = ^2tb1 = 5*b1- Note that matrix H' contains state 2 only on the characters Q (see Fig 3(b) for an example).

Definition 3.1. A pair of taxa h'2i_1, h'2i G R(H') is defined as a couple. For any c G C(H'), if /i2i-i[c] =

2 (= KiW) o r Ki-Ac] ± h2i[c] t h en h'2i_x,h'2i is a (2, c)-couple.

Note that if c contains both 0 and 1 in a couple of H', then the couple contained state 2

203

at character c in the input (before perfect phytogeny haplotyping). Let K = {i G C(G)\3j G C(G),\GAM(t(T*),{i,j})\ = 4} be the set of characters that conflict with some other character in t(T*). Note that Q C K,. To complete the description and analysis of the algorithm we borrow the following definition from prior work13.

Definition 3.2. We say that a (unordered) couple r\,ri induces {x,y) at characters (c,c') if ri[c] = x,ri[c'} = y or n[c] = r2[c] = 2,n[c'] = r2[c'} = y or ri[c] = r2[c] = x,r\\d] = r2[c'} = 2. We define IND(H',{c,c'}) to denote the set of gametes induced by the couples of H' at c,c'.

The goal now is to convert the {0, 1, 2} matrix H' to a {0, 1} matrix H such that the following correctness conditions are satisfied:

(1) for every (2, c)-couple in H' one of the two taxa should contain state ' 1 ' and the other '0' on character c in H

(2) GAM(H, K) C GAM(t(T*),K), i.e. the set of gametes on K in H is a subset of the set of gametes on K in T*.

(3) (\GAM{H,{c,c'})\ = 4) = > c,c' G K, i.e. a pair of characters share four gametes in H only if they are both in K.

In matrix H', if a couple contains state 2 at character c, then replacing it with state 0 on one taxa and 1 on the other is informally referred to as a resolution. The algorithm to solve IPPH is summarized in Figure 1. We now go into the details of the steps. The following lemma shows that Steps 5 and 6 of function solvelPPH can be implemented efficiently:

Lemma 3.1. Sets K and GAM(t(T*),K,) can be found in time m2qq°^ + 0(nm).

Proof. We can easily identify K by brute force in time 0(m'Kl). Since we do not know |K|, this step can take time 0 (2 m ) . We can however do better by performing such an enumeration over phylogenies as illustrated in Figure 2. First we construct the unique perfect phylogeny T for the matrix H' restricted to C(G) \ Q which contains m — \Q\ +1 vertices in time 0(nm)12. Note that contracting edges e G T* with /i(e) G Q results in tree T (see Figure 2(b)). We will begin with T and add the edges e, /x(e) G Q to

obtain T* as follows. Since we know Q already, we can identify the set of \Q\ + q edges (labels fi) in time 0((lQ[+q)) = 0((2q)) = 0(4«). There a r e m - | Q | + l locations to add an edge that mutates a character in Q. All possible edge assignments can be enumerated in time 0((m - \Q\ + l)l«l+«) = 0(m2"). Each enumeration assigns a set of edges (multi-set on characters) Qv to each vertex v of T. We now enumerate all possible rooted trees Tv with edge labels in Qv for all vertices v G T in time 0((\Q\ + q)lQl+q) = 0((2q)2i). Since the mutations in C(G) \ Q are already fixed by the perfect phylogeny, the states in all vertices on characters C(G) \ Q are known. For every root of Tv, we guess the states in all Q characters in time 0(2" ), which can be improved to 0(qq) (since this is equivalent to enumerating all tree structures with edges mutating Q). Note that since we guessed the states at the root of Tv, we know the states on all characters for all vertices in Tv (for all v). Further, for any two roots of TV,TV>, the path that connects them is given by the path connecting v, v' in the perfect phylogeny T (see Figure 2(c)). Therefore, we can identify the edges that lie between two mutations of a character in Q. We can now find K since for all i G K \ Q, there exists j G Q and ei , ez, ez G T* such that fi(ei) = fi(e3) = j , ufa) = i and ei lies in the path connecting ei,e3.

Since we know states in all characters of Tv and T'v, we know the states in all the characters for every vertex in the path connecting them as well. We now consider the set of all vertices VK that lie in the path connecting two mutations of the same character in T. It is easy to see that GAM(VK, K) = GAM(t(T*), K). We can therefore identify K and GAM(t(T*),K) in time m2qq0(-qS> + 0(nm). D

Figure 3 illustrates how the algorithm performs haplotyping given K and GAM(t(T*),K). For reference Figures 3(a) and 3(b) represent input matrix G and matrix H' as found in Step 4 of solvelPPH. Function solvelPPH selects c = 2 (Step 7). Function processMatrix determines £(2,6) = IND(H', {2,6}) = {(0,0), (0,1), (1,1)} and guesses G{2,7) = {(0,0), (0,1), (1,1)} (Step 2(b)iA). Based on these two sets of three gametes, the (2,2)-couples, rows 9 to 12, are resolved (Step 2(b)iB), Fig 3(c). Character 2 is removed (Step 2d). No-

204

(a)

,00)0000

. 0000010

.0000011

^ 5

Fig. 2 . Algorithm to determine K and GAM{t{T*), K) efficiently, (a) optimal phylogeny T* with Q = {6,7}, K = {4, 5, 6, 7}(b) perfect phylogeny T on characters C(G) \ Q (c) the four edges mutating characters 6 and 7 are assigned to three vertices of T; rooted trees Tv are constructed on the assigned edges; states on the three roots 0000000,0001010, 0000111 determine edges {4, 5} that lie between two mutations of 6, 7. Filled vertices form VK.

tice that a character c is removed only when all the (2,c)-couples are completely resolved (rows 9 to 12). Function solvelPPH then selects c = 3 (Step 7), Fig 3(d). Function processMatrix guesses 0(3,6) = {(0,0), (1,0), (0,1)} and determines 0(3,7) = IND(H',{3,7}) = {(0,0), (0,1), (1,1)} (Step 2(b)iA), and using them (2,3)-couple, rows 7 and 8, are resolved(Step 2(b)iB), Fig 3(e). Since the pair of rows (7, 8) is also a (2, l)-couple character 1 is added to A (Step 2(b)iC). Character 3 is removed (Step 2d) and c = 1 is extracted from A (Step 2a), Fig 3(f). Since there are no (2, l)-couples, character 1 is removed (Step 2d), Fig 3(g). This exhausts all c G C(H') \ K. Function solvelPPH then resolves state 2 in the first couple resulting in 0000,0010 which are both in GAM(t(T*),n) (Step 8a), Fig 3(h). Since the next couple has no state 2, we resolve the third couple which results in 0101,0111 (Step 8a), completing the algorithm, Fig 3(i). We now prove the main lemma that bounds the running time of our algorithm.

Theorem 3.1. / /penal ty(T*) < q, then the algorithm described in Figure 1 returns a solution matrix H that obeys all correctness conditions in time

Proof. To prove this theorem, we first need three simple lemmas.

Lemma 3.2. If ci G n and C2 G C(H') \ K, then ci,C2 share exactly three gametes in t(T*).

Proof. Every pair of characters share at least three gametes in T*. Characters, c\, c2 cannot share four gametes since c2 £ K. •

Lemma 3.3. For a pair of characters c £ K,C £ Q, given a set of three gametes Q(c,c), there exists a unique resolution of state 2 in character c for any (2, c)- couple s.t. for resulting matrix H', GAM(H',{c,c})cg(c,c)

Proof. Let r2i-i,r2i G R(H') be a (2,c)-couple. By definition, r2i-i[c] ^ r2i[c}. If r2i_i[c] = f2i[c] = 2 then the resolution will either create GAM({r2i-i,r2i},{c,c}) = {(0,0), (1,1)} or {(0,1), (1,0)}. Only one of the two can be contained in set Q(c, c) established between c and c. •

Lemma 3.4. In matrix H', for c ^ K,C G Q if every (2,c)-couple is in {0, l } m then GAM(H,{c,c}) is fixed where H is any matrix obtained from H' by resolution of 2s.

Proof. For any couple in H' if /i2t-i[^] = ^J l =

2 then according to the condition of the lemma, ^2i-i[c] = ^2i-i[c] = s- Therefore, for any couple h2i-i, h2i in H obtained by resolving state 2 on c will have the property that GAM({h2i-i,h2i}, {c,c}) =

{(8,0), (8,1)}- •

Corollary 3.1. Although c can contain state 2 in matrix H', Step 2c of function processMatrix in Figure 1 can test if c, c are conflicting. Furthermore, if c, c do not conflict at Step 2c, then the final matrix obtained by the algorithm will not have a conflict between c, c.

We now return to the proof of the main theorem. Step 1 requires at most q(m) enumerations. Lemma 3.1 shows that given the correct Q, both K, and GAM{t(T*), K) can be found in time m2qq°^ +

205

(a) 0000020

0000211

0000121

2020121

0201012

0201022

Input G

(b) 0000020 0000020 0000011 0000111 0000121 0000121 1000121 0010121 0001012 0101012 0000022 0101022

matrix H'

3(d) 0E00020 0300020 oioooii 0300111 o;00121 0300121 .100121 0-10121

(e) 300020 300020

0500011 0 300111 0J00121 0300121

: oom 0 310101

(c) 0000020 0000020 0000011 0000111 0000121 0000121 1000121 0010121 0001010 0101011 0000000 0101011 c

Fig . 3 . Given Q = {6,7}, K = {4,5,6,7}, GAM{t(T*),K) = {0000,0010,0011,1011,1010,0111,0101}, the algorithm chooses c = 2 ,3 ,1 . Based on a set of three gametes, it resolves states 2 in c S Q, in all (2, c)-couples. Shaded regions represent deleted (or ignored) characters and couples completely resolved by the algorithm. After exhausting all characters c € C(G) \ re, the algorithm considers the remaining couples and iteratively resolves states 2 s.t. the couples are in GAM(t(T*), re).

0(nm). We now bound the run time for function processMatrix.

Lemma 3.5. Total run-time for all calls to function processMatrix is 0(2qnm2).

Proof. First notice that once a character c is removed from A at Step 2a of function processMatrix, it is never added back throughout the execution of the algorithm. This is because, function processMatrix resolves all the states 2 present in (2, c)-couples and subsequently the character is deleted (or ignored) for the rest of the algorithm in Step 2d. This bounds the number of times Step 2a is executed throughout the algorithm as 0(m). To prove the lemma, we now bound the time for executing Steps 2a through 2d as 0(2qnm).

The number of (2,c)-couples is 0{n) and therefore Step 2b loops for 0(n) times. The cardinality of Q is at most q and therefore the loop in Step 2(b)i executes 0(|Q|) = 0(q) times. The time bound for guessing the set of gametes G(c, c) shared between c and c is the hardest to analyze.

Consider any one call to function processMatrix. Let c,'s represent the characters added into A during the execution of a call. We show that if G(CJ,C) is guessed at Step 2(b)iA then for all future ck encountered during the execution of Step 2(b)iA, \IND{H',{ck,c})\ = 3. When the algorithm guessed G(CJ,C), by definition of (2,Cj)-couples, h2i-i[cj] ^ /i2i[cj] and by the condition on Step 2(b)i, h2i-i[c\ = h2i[c) = 2. Also /i2j-i[cfc] = ^2i[cfc] = x since otherwise \IND(H',{ck,c})\ = 3

(cfc and c cannot be identical characters). Let cj be the character extracted from A and used in the loop of Step 2b during which ck was added into A. This implies the existence of a couple h'2i_1,h'2i

which is both a (2,c;)-couple and a (2, Cfc)-couple. However /i2j_i[c] = h'2i[c] = y. Again, otherwise \IND{H',{ck,c})\ = 3. Therefore matrix H' induces gametes (x, 0), (x, 1), (0, y), (1, y) on characters (cfc, c). For all values of x, y € {0,1}, this results in three induced gametes. The above proof therefore shows that the i f condition in Step 2(b)iA fails at most q times for any function call to processMatrix. Therefore the probability of all guesses performed at Step 2(b)iA being correct for any single call to processMatrix is at least 2~q. Equivalently, we suffer a multiplicative factor of 29 in the run-time because of this step.

The check performed at Step 2c takes time 0(nm). Assuming q < m, the total running time for all calls to processMatrix combined is 0(2 'nm 2) .D

Using Lemma 3.2, we know that c shares exactly three gametes with all characters c £ Q in t(T*). For character c, Step 2(b)iA (guesses) iterates over the set of all possible gametes shared between c, c. Given the set of three gametes, Lemma 3.3 shows that there is a unique resolution of states 2 in the (2, e)-couples and this is performed in Step 2(b)iB. Correctness condition 1 holds by the definition of resolution. Step 2c of function processMatrix checks for conditions 2 and 3. Using Corollary 3.1, we know that c and c cannot conflict because of the reso-

206

lution of any of the remaining 2s (ensures correctness condition 3). Finally, there can be no 2s in character c (since c £ Q) or in any of the (2,c)-couples since it was just resolved. Step 8 of function solveIPPH iterates n times. At each iteration, it performs a brute-force step of computing all possible ways of resolving the 2s. Since only the characters in Q can contain state 2 at this point, Step 8a takes 0(m2q) time. This step also checks if the resulting gametes on characters in K are in the predicted set GAM(t(T*), K)(ensures correctness condition 3). This shows that any matrix found by the algorithm obeys the correctness conditions and the running time is nm°(q\ Finally, we know that there exists a set of three gametes Q(c, Q) (as defined by t(T*)) s.t. resolving based on Q will ensure that c,c do not conflict and GAM(H',K) C GAM(t(T*),n). Using these two observations, we know that if penalty(T*) < q, then the algorithm finds matrix H' that obeys the correctness conditions in the stated time.

We now prove the correctness of our algorithm:

Theorem 3.2. Any solution matrix H obeying all

the correctness conditions is optimal.

Proof. The proof is constructive and demonstrates the procedure to construct a g-near-perfect phy-logeny for H. The phylogeny along with correctness condition 1 guarantees that the returned matrix H is an optimal solution.

Matrix H satisfies the following two properties: if a pair of characters c, c' conflict in H, then c,c' £ K

(third correctness condition); a q-near-perfect phylogeny can be constructed on GAM(H, n) (since GAM{H,K) C GAM(t(T*),K), second correctness condition). It can be shown that a g-near-perfect phylogeny can be constructed for any matrix that satisfies the above two properties (see Section 7 of Gusfield and Bansal 2 1 ) . Such a phylogeny is obtained by constructing the g-near-perfect phylogeny on on GAM(H,K), contracting the phylogeny to a vertex and constructing a perfect phylogeny on the remaining characters. •

Theorem 3.3. The algorithm of Figure 1 returns

an optimal solution H to the IPPH problem in time

nm0^.

Proof. The proof follows directly from Theorems 3.1 and 3.2. •

4. SOLUTIONS TO PPH

We assumed in the preceding sections, following prior work33, that the perfect phylogeny stage of the algorithm will find a unique solution, but this assumption does not necessarily hold. To guarantee optimality, the algorithm would need to enumerate over all solutions to the PPH sub-problem, increasing run-time proportionally. Prior work showed that the number of PPH solutions is at most 2k, where k is the number of characters of G that do not contain the homozygous minor allele13' 19. In the worst case, this could be as large as m and therefore the number of PPH solutions can be 2 m . This however should not occur in practice since the underlying population typically follows a random mating model.

Hardy-Weinberg equilibrium states that the two haplotypes of any given individual are selected independently of one another at random. For any fixed character, let p be the minor allele frequency and (1 — p) be the major allele frequency. Consider G, an n x m input genotype matrix to the PPH problem. The probability under Hardy-Weinberg that none of the n taxa contain the homozygous minor allele is (1 — p2)n- This probability could be large for very small values of p.

It is reasonable to assume that the value of p > c

for some constant c since otherwise SNPs cannot be detected. In this case, with high probability in n (at least 1 — (1— c 2 )") ,a specific character contains the homozygous minor allele. Therefore in expectation the number of characters without a homozygous minor allele is at most m( l — c2)". This expectation, exponentially tends to zero with n.

We now consider a more general setting when the value of p is assumed to be uniformly distributed in [0,0.5]. Now, the probability that a specific character does not contain the homozygous minor allele is:

/•0.5 2 / (l-p2)ndp

Jo

Jrl/Vn r0.5

' (l-p2)ndp + 2 {l-p2)ndp 0 JxisfR

207

,•0.5

<

<

'^+2 {l~p2)ndP

- T = + 2 E / (!-p2)n dp

2 2 ^ <

y/n

Now, if n = Q.(m2), then using Chernoff bounds we can show that with high probability the number of characters that lack a homozygous minor allele is bounded by 21ogm and therefore the number of solutions to the PPH problem is bounded by m2 .

This discussion answers the question raised by Gusfield on the theoretical estimate for the number of PPH solutions to expect from a coalescent model19. Furthermore, since PPH is always performed on SNP blocks with low diversity, it is not unreasonable to assume n » m. Since the number of solutions returned by PPH is 0(m 2 ) , the IPPH algorithm described above with high probability just suffers 0(m2) overhead for finding the optimal extension of each PPH solution.

5. EMPIRICAL VALIDATION

We demonstrate and validate our algorithm by comparing with leading phasing methods using a collection of large-scale genotype data of known phase from a high resolution scan of human chromosome 212 9 . The study identified common single nucleotide polymorphisms (SNPs) and typed them on 20 sequences through a method that directly identifies haplotypes rather than genotypes. The SNPs were partitioned into 4135 haplotype blocks by the authors of that study using a diversity test. We extracted haplotypes from the 'haplotype pattern' file provided in the supplementary material, which identifies distinct haplotypes in each block and provides some inference of missing data when it can be done unambiguously. We ignored haplotypes which still had significant missing data, replacing them with haplotypes randomly selected from the multinomial

distribution of fully-resolved haplotypes in the corresponding block in order to maintain 20 chromosomes per block. The haplotype blocks were divided into four sets based on their imperfectness (q = 0,1,2, 3+) using a prior method31. A large majority of blocks (98%) are 0, 1, or 2 near-perfect. We do not solve optimally for the 2% with imperfectness 3 or greater because the run-time for optimal solutions would be prohibitive. Such blocks can be solved non-optimally in practice by subdividing into smaller blocks of lower imperfectness, but such data would not be comparable to those solved optimally and is therefore omitted from our analysis. We then randomly paired haplotypes to produce 10 diploid genotypes per block as our final test set. Figure 4 shows that the length of the blocks is related to q, the imperfectness. The tails are heavier for larger values of q as expected. For instance, this shows that a significant fraction of the q = 2 blocks have large number of characters when compared with q = 1,0.

o c o

5 10 15 20 25 30 35 40 45 50

Number of characters in a block

Fig . 4 . Distribution of the fraction of blocks as a function of the number of characters in the block. Data for more than 50 characters not shown.

We compared our method with two popular phasing packages. We ran the haplotyper 2 8 package, which uses an expectation-maximization heuristic, with 20 rounds (the recommended value). We also ran the PHASE35 package, which employs a Markov Chain Monte Carlo method. Although

208

Perfectness

0 1 2

#Blocks

3497 461 93

#SNPs

20816 4211 1266

Error rate PHASE

0.11 0.53 0.83

haplotyper

0.11 0.47 0.68

our alg

0.17 0.35 0.55

Total Run Time(secs) PHASE

3521.33 805.62 268.02

haplotyper

337.18 80.62 59.28

our alg

17.21 8.77

1111.18

Fig . 5 . Empirical results on Chromosome 21. Blocks with 0, 1, 2 near-perfectness (accounting for 4051 out of 4135 blocks) were analyzed separately using the three algorithms. Error rate is the number of switch errors divided by the number of blocks. Total run time is the sum over all blocks.

PHASE can make use of additional information such as SNP positions, we provided it only the genotypes as input.

For our own method, with any given q, we enumerated all possible phylogenies that are at most <7-near-perfect. Note that this does not guarantee finding all possible output matrices that are consistent with a g-near perfect phylogeny. Where multiple solutions were obtained, we selected the one that minimized the entropy of the haplotype frequencies. For the PPH sub-problem we implemented a fast prior algorithm13. We note that for simplicity, we do not implement the algorithm exactly as described. Rather, function processMatrix does not add characters into A to be processed iteratively. The implementation however always returns a haplotype output matrix and we report the switch errors33 for the output.

Table of Figure 5 summarizes the test results*. All methods provide comparable accuracy when blocks are perfect. We attribute the slightly worse performance of our method on perfect input to the fact that we do not use any procedure to select a maximum-likelihood output matrix among all perfect matrices, as is done for prior perfect phylogeny methods13. All three methods degrade in accuracy with increase in imperfectness. Our method, however, scales much better with imperfectness than do the other two, clearly outperforming them on 1-near-perfect and 2-near-perfect inputs. Imperfectness can result from recurrent mutations, recombinations or incorrect SNP inferences and we attribute our method's superior performance on imperfect data to the fact that it explicitly handles one of these factors while the others suffer from all three. Our method is extremely fast for q = 0, where it

reduces to the perfect phylogeny algorithm of Eskin et al.13 and also substantially outperforms the competing methods for q = 1. Our method's run time rapidly increases with larger q, though, as expected from the theoretical bounds.

6. DISCUSSION AND CONCLUSIONS

We have developed a theory for reconstructing phy-logenetic trees directly from genotypes that is optimal in the number of mutations. As an immediate application, we solve the general IPPH problem. We demonstrate practical results that show great promise in accurately inferring phase from real data sets. Our results suggest that imperfect phylogeny approaches can lead to significant improvements in accuracy over other leading phasing methods. Run time, while very fast for perfect and almost perfect data, remains an obstacle for even modest q; this observation suggests a need for further research in improving theoretical and practical bounds for general q. Our new method has several immediate applications in computational genetics:

• Phasing: At present, the method is competitive with the most widely used tools in accuracy and, with optimizations, should become competitive in run-time for larger q. Both PHASE and haplotyper return confidence scores on its results, which might allow a slower, high-accuracy method such as ours to function as a fall-back for regions that those methods cannot infer with confidence.

• Phylogeny Construction: Our run times are competitive with typical times to be expected from other optimal phylogeny reconstruction algorithms, even when the input consists of hap-lotypes. Our approach may thus be considered

aHaplotyper crashes on 18 blocks which were ignored in the calculation of its accuracy.

209

preferable to the standard practice of inferring

haplotypes then fitting them to phylogenies.

• Haplotype Blocks Inference: Our method could

serve as an improved means of identifying

recombination-free haplotype blocks for pur

poses of association s tudy design by more accu

rately distinguishing recurrent mutat ion from re

combination. The blocks might thus be useful in

improving statistical power in haplotype-based

association testing.

Future empirical studies, enabled by our method,

will be needed to better establish the nature of im-

perfectness in real genotype da ta and the degree to

which bet ter handling of recurrent mutat ions will be

of use in practice.

Acknowledgments

This work is supported in par t by NSF I T R grant

CCR-0122581 (The ALADDIN project) and NSF

grant CCF-043075.

References

1. Ft. Agarwala and D. Fernandez-Baca. A Polynomial-Time Algorithm for the Perfect Phylogeny Problem when the Number of Character States is Fixed. SIAM Journal on Computing, 23 (1994).

2. V. Bafna, D. Gusfield, G. Lancia, and S. Yooseph. Haplotyping as perfect phylogeny: A direct approach. Journal of Computational Biology, 10 (2003).

3. V. Bafna, D. Gusfield, G. Hannenhalli, and S. Yooseph. A note on efficient computation of haplotypes via perfect phylogeny. Journal of Computational Biology, 11 (2004).

4. G. E. Blelloch, K. Dhamdhere, E. Halperin, R. Ravi, R. Schwartz and S. Sridhar. Fixed Parameter Tractability of Binary Near-Perfect Phylogenetic Tree Reconstruction. To appear in proc International Colloquium on Automata, Languages and Programming (2006).

5. H. Bodlaender, M. Fellows and T. Warnow. Two Strikes Against Perfect Phylogeny. In proc International Colloquium on Automata, Languages and Programming (1992).

6. H. Bodlaender, M. Fellows, M. Hallett, H. Wareham and T. Warnow. The Hardness of Perfect Phylogeny, Feasible Register Assignment and Other Problems on Thin Colored Graphs. Theoretical Computer Science (2000).

7. M. Bonet, M. Steel, T. Warnow and S. Yooseph. Better Methods for Solving Parsimony and Compatibility. Journal of Computational Biology, 5(3) (1992).

8. A. G. Clark. Inference of haplotypes form PCR-amplified samples of diploid populations. Molecular Biology and Evolution 7:1111-122 (1990).

9. W. H. Day and D. Sankoff. Computational Complexity of Inferring Phylogenies by Compatibility. Systematic Zoology (1986).

10. P. Damaschke. Parameterized Enumeration, Transversals, and Imperfect Phylogeny Reconstruction. In proc International Workshop on Parameterized and Exact Computation (2004).

11. R.G. Downey and M. R. Fellows. Parameterized Complexity. Monographs in Computer Science (1999).

12. Z. Ding, V. Filkov and D. Gusfield. A Linear Time Algorithm For Perfect Phylogeny Haplotyping. In proc Research in Computational Molecular Biology (2005).

13. E. Eskin, E. Halperin and R. M. Karp. Efficient Reconstruction of Haplotype Structure via Perfect Phylogeny. Journal of Bioinformatics and Computational Biology (2003).

14. D. Fernandez-Baca and J. Lagergren. A Polynomial-Time Algorithm for Near-Perfect Phylogeny. SIAM Journal on Computing, 32 (2003).

15. L. R. Foulds and R. L. Graham. The Steiner problem in Phylogeny is NP-complete. Advances in Applied Mathematics (3) (1982).

16. G. Ganapathy, V. Ramachandran and T. Warnow. Better Hill-Climbing Searches for Parsimony. In proc Workshop on Algorithms in Bioinformatics (2003).

17. D. Gusfield. Efficient Algorithms for Inferring Evolutionary Trees. In: Networks, 21 (1991).

18. D. Gusfield. Algorithms on Strings, Trees and Sequences. Cambridge University Press (1999).

19. D. Gusfield. Haplotyping as a Perfect Phylogeny: Conceptual Framework and Efficient Solutions. In proc Research in Computational Molecular Biology (2002).

20. D. Gusfield. An Overview of Combinatorial Methods for Haplotype Inference. Lecture Notes in Computer Science, 2983 (2004).

21. D. Gusfield and V. Bansal. A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters. In proc Research in Computational Molecular Biology (2005).

22. D. Gusfield, S. Eddhu and C. Langley. Efficient Reconstruction of Phylogenetic Networks with Constrained Recombination. In proc IEEE Computer Society Bioinformatics Conference (2003).

23. E. Halperin and E. Eskin. Haplotype Reconstruction from Genotype Data using Imperfect Phylogeny. Bioinformatics (2004).

24. D. A. Hinds, L. L. Stuve, G. B. Nilsen, E. Halperin, E. Eskin, D. G. Ballinger, K. A. Frazer, D. R. Cox. Whole Genome Patterns of Common DNA Variation in Three Human Populations, www.perlegen.com. Science (2005).

25. D. Huson, T. Klopper, P. J. Lockhart, M. A. Steel. Reconstruction of Reticulate Networks from Gene

http://www.perlegen.com

210

Trees. In proc Research in Computational Molecular Biology (2005).

26. The International HapMap Consortium. The International HapMap Project, www.hapmap.org. Nature 426 (2003)

27. S. Kannan and T. Warnow. A Fast Algorithm for the Computation and Enumeration of Perfect Phylo-genies. SI AM Journal on Computing, 26 (1997).

28. T. Niu, Z. S. Qin, X. Xu, and J. Liu. Bayesian Hap-lotype Inference for Multiple Linked Single Nucleotide Polymorphisms. American Journal of Human Genetics (2002)

29. N. Patil et al. Blocks of Limited Haplotype Diversity Revealed by High-Resolution Scanning of Human Chromosome 21. In Science (2001).

30. H. J. Promel and A. Steger. The Steiner Tree Problem: A Tour Through Graphs Algorithms and Complexity. Vieweg Verlag (2002).

31. S. Sridhar, K. Dhamdhere, G. E. Blelloch, E. Halperin, R. Ravi and R. Schwartz. Simple Reconstruction of Binary Near-Perfect Phylogenetic Trees. In proc International Workshop on Bioinformatics Research and Applications (2006).

32. C. Semple and M. Steel. Phylogenetics. Oxford University Press (2003).

33. Y. Song, Y. Wu and D. Gusfield. Algorithms for Imperfect Phylogeny Haplotyping with a Single Homo-plasy or Recombination Event. In proc Workshop on Algorithms in Bioinformatics (2005).

34. M. A. Steel. The Complexity of Reconstructing Trees from Qualitative Characters and Subtrees. Journal of Classification, 9 (1992).

35. M. Stephens, N. Smith and P. Donnelly. A new statistical method for haplotype reconstruction from population data. American Journal of Human Genetics (2001).

http://www.hapmap.org

211

TOWARD AN ALGEBRAIC UNDERSTANDING OF HAPLOTYPE INFERENCE BY PURE PARSIMONY

Daniel G. Brown and I a n M. Harrower

David R. Cheriton School of Computer Science, University of Waterloo,

200 University Avenue W.,

Waterloo, Ontario, Canada N2L 3G1

Email: {brovmdg, imharrow} @cs. uwaterloo. ca

Haplotype inference by pure parsimony (HIPP) is known to be NP-Hard. Despite this, many algorithms successfully solve HIPP instances on simulated and real data . In this paper, we explore the connection between algebraic rank and the HIPP problem, to help identify easy and hard instances of the problem. The rank of the input matrix is known to be a lower bound on the size an optimal HIPP solution. We show that this bound is almost surely tight for da ta generated by randomly pairing p haplotypes derived from a perfect phylogeny when the number of distinct population members is more than (^4^)plogp (for some positive e). Moreover, with only a constant multiple more population members, and a common mutation, we can almost surely recover an optimal set of haplotypes in polynomial t ime. We examine the algebraic effect of allowing recombination, and bound the effect recombination has on rank. In the process, we prove a stronger version of the standard haplotype lower bound. We also give a complete classification of the rank of a haplotype matrix derived from a galled tree. This classification identifies a set of problem instances with recombination when the rank lower bound is also tight for the HIPP problem. K e y w o r d s : Haplotype inference, phylogenetic networks, galled tree, probabilistic analysis of algorithms.

1. INTRODUCTION

Haplotype inference is the process of attempting to identify the chromosomal sequences that have given rise to a diploid population. Recently, this problem has become increasingly important, as researchers attempt to connect variations to inherited diseases. The simplest haplotype inference problem to describe is haplotype inference by pure parsimony (HIPP), introduced by Gusfield1. The goal is to identify a smallest set of haplotypes to explain a set of genotypes. This objective is partly justified by the observation, now made in several species2, 3, that in genomic regions under strong linkage disequilibrium, few ancestral haplotypes are found.

The problem is also interesting purely com-binatorially; it is NP-hard1, and its only known polynomial-time approximation algorithms have exponentially bad performance guarantees4. In practice, though, it is surprisingly easy to solve, especially when applied to synthetically generated test instances arising from standard evolution models. Several authors have developed integer programming or branch-and-bound algorithms that have shown fast performance1' 5 _ 7 , and some real-world problems have readily solved exactly or near-exactly7.

To help resolve this conflict between theoretical complexity and practical ease of solution, we explore

the algebraic structure of the problem. We focus on problem instances resulting from random pairing of haplotypes generated by standard evolution models. Our results focus on the algebraic rank of the genotype matrix. This is known to be a lower bound to the optimal solution size of the HIPP problem8. We show this bound is almost surely tight for data generated by ( i ^ ) p l o g p random pairings of p haplotypes from a perfect phylogeny (for positive e).

In Section 5, we strengthen our results by exploring population models that allow for recombination. Here, we show that for haplotypes with few recombinations, the rank bound is near-exact, and we show in particular the conditions under which it is exact for galled trees, the simplest kind of phylogenetic networks that include recombination. In addition, this study of algebraic rank allows us to prove a new variation of the "haplotype lower bound"9 on the number of recombinations required to explain a haplotype matrix.

The rank lower bound is important: when tight, it verifies for branch-and-bound procedures that they have found an optimum; moreover, computing the size of the HIPP solution is NP-hard in general. However, we can go one step further for some HIPP instances, to compute an optimal set of haplotypes in polynomial time. Our second theorem, in Sec-

212

tion 4, shows a constructive algorithm that almost surely runs in polynomial time on instances derived from perfect phylogenies with a constant factor more genotypes than for our first theorem. Our result requires that there be a mutation found in at least an a fraction of the genotypes. We connect this theorem to the standard coalescent model from population genetics in Section 4.1, where we show that for haplotypes generated by this model, we can guarantee such a common mutation exists with probability at least 1 — e when the mutation parameter 9 of the coalescent model is at least a constant depending on e and a.

Our results identify structure in HIPP instances and show how to easily solve HIPP in such cases, despite its theoretical hardness. Although we do not show why previous haplotype inference algorithms often run surprisingly quickly, we do show that many of the instances upon which they are run have special structure, which reduces the complexity of the problem.

2. BACKGROUND AND RELATED WORK

We begin by briefly reviewing existing work on haplotype inference, particularly with the parsimony objective, and on a phase transition for haplotyping problems.

2.1. Haplotype inference and notation

The input to a haplotype inference algorithm is a genotype matrix, G. Each of its n rows represents the genotype gi of population member p,; the m columns represent sites in the genome. G(i,j) is the genotype of population member p, at site Sj.

We assume there are two alleles, 0 and 1, at each site, and thus three choices for G(i,j): G(i,j) = 0 if both parent chromosomes of pi have allele 0 at Sj, G(i,j) = 1 at positions where one parent has each allele, and G(i,j) = 2 if both have allele 1. (Note: this is not the standard notation, which exchanges the meanings of 1 and 2.)

Haplotype inference consists of explaining G with a 0/1 haplotype matrix, H. The k rows of H represent possible chromosome choices and their alleles at the m sites. Genotype gi is explained by H if

there exist two rows (possibly the same) of H whose sum is gi. We can represent this pairing by an n x k pairing matrix, II, where row rj of II has value 1 in the two columns corresponding to the two parent haplotypes of gi, and 0 elsewhere. (If two copies of the same haplotype explain genotype gi, the row of II has a 2 in that column and 0 elsewhere.) In our formulation, haplotype inference consists of finding a haplotype matrix H and a valid pairing matrix II such that G = U • H. (The simplicity of this formulation explains our notational choice; it is inspired by a formulation of He and Zelikovsky10.)

2.2. Pure parsimony

In haplotype inference by pure parsimony (HIPP), introduced by Gusfield1, we seek the smallest set of haplotypes to explain G. This corresponds to finding the smallest H such that a proper pairing matrix II exists where G — n • H. It is NP-hard1, and its known approximation algorithms have only exponential approximation guarantees4.

Still, many instances of this problem have been easy to solve. Gusfield1 gave an integer linear programming formulation for the problem that, though theoretically exponential in size, often solved very quickly. Halldorsson et al. found a polynomial-sized IP formulation for the problem11; we independently identified and extended it with further inequalities5, 7. Our experiments demonstrated that even on large instances, the optimal solution can be found easily. Why are HIPP instances often solvable in practice?

2.3. Haplotypes and genotypes

To answer this question, we examine some models for generating HIPP instances: where do haplotypes come from, and how are they paired to make genotypes?

The simplest generative model for haplotypes is perfect phylogeny. In this model, all m sites evolve according to a rooted phylogenetic tree, with the root at the top of the tree. Without loss of generality, we assume that at every sampled site Sj, the common ancestor of all haplotypes had allele 0. Each site is assigned to a single edge of the tree, which represents when the unique mutation of that site oc-

213

curred. Leaves descendant from that point have allele 1 at site Sj; other leaves have allele 0. A matrix H compatible with this framework is a PPH matrix (for perfect phylogeny haplotype; we can also relax the all-zero ancestor requirement). For ease of calculation, we also include a single non-polymorphic site so, where all haplotypes have value 1; this has no effect on any solution, since any haplotype pairing gives the same genotype, 2, at that site.

More complex generative models for haplotypes allow for recombination, breaking the rule that every site derives from the same single phylogenetic tree. We discuss recombination in Section 5. We can also allow a probabilistic process for generation of the haplotype matrix; the simplest of these is described in Section 4.1, where we consider model parameter settings that make the conditions of our second major theorem (Theorem 4.1) likely to hold.

Haplotypes pair to form genotypes. The simplest process is random pairing: each genotype results from pairing two haplotypes sampled with replacement. This corresponds to n edges being picked from a random multigraph model; every edge (i,j) has probability 2/A;2, and every loop (i,i) has probability 1/fc2. A haplotype or genotype may occur in multiple copies.

2.4. A phase transition for haplotyping?

In 2003, Chung and Gusfield12 examined the number of distinct PPH solutions to a genotype matrix obtained by randomly pairing In haplotypes from a perfect phylogeny without replacement n times. This model of data generation is not the same as the model studied in this paper, but their observations are clearly related. Let k be the number of distinct haplotypes. There seems to be a phase transition: when n <C k log k, there are many PPH solutions; when n » fclog/c, there is typically only one. Cleary and St. John13 then studied the structure of random pairing graphs; they showed that if there are o(k log k) population members, with high probability there are multiple PPH solutions. They also show experimentally, but do not prove, that above this bound, there is usually a unique PPH solution. Note that the number of sites and the number of distinct haplotypes are directly correlated in the coa-lescent model. This observation explains the depen

dence on length of the sequences observed in their experiments.

3. LINEAR ALGEBRAIC STRUCTURE AND A FIRST BOUND

Our work focuses on the rank of the genotype matrix, which Kalpakis and Namjoshi8 have previously noted is a lower bound on the size of the solution to the HIPP instance.

Lemma 3 .1. The number of haplotypes in the solution to the HIPP instance G is at least k* = rank(G). If there exist a k* x m haplotype matrix H and an n x k* pairing matrix II such that G = H • H, then H forms an optimal set of haplotypes for G.

Proof. Since H is a valid set of haplotypes for G only if there exists a pairing matrix such that G = n • H, H must have rank at least rank(G), and must have at least that many rows. If G = H • H and H has exactly k* rows, then it matches the lower bound and is optimal. •

Corollary 3.1. IfHisannxk* pairing matrix of rank k*, H is a k* xm haplotype matrix of rank k*, and G = U- H, then k* = rank(G) and (n , H) is an optimal HIPP solution for G.

Proof. This follows since if both n and H are full rank, so is their product. •

3.1. The rank of the pairing and haplotype matrices

We now consider when we can expect G to be created from full-rank matrices H and n .

3.1.1. The rank of the pairing matrix

Lemma 3.2. If H is a random pairing matrix for k haplotypes with more than (i^)fclogfc pairings, for some constant e > 0, then rank(n) = k almost surely as k —> oo.

Proof, n is the node-edge incidence matrix of the pairing multigraph. A standard graph theory result shows that n is full rank if all connected components of the graph are non-bipartite14. For example,

214

if the graph is connected and contains a triangle, then rank(II) = k.

If the graph has ^ r fc log k pairings, then almost surely as k —> oo it has at least I = ^-khgk distinct non-self pairings, for some constant e' > 0. As such, it contains a subgraph G' from the random graph model G(k,£), with k nodes and t edges, and each possible edge equally likely. The classic Erdos-Renyi Theorem15 shows that for such I, G' is connected almost surely, and a standard textbook exercise (see, for example, BoUobas's textbook16) gives that such a graph contains a triangle almost surely, as k —• oo. As such, II is rank k almost surely as k —> oo. •

3.1.2. Pairing in non-uniform haplotype pools

We may also be sampling from a non-uniform population, where some haplotypes are more common than others. If there is a pool of p different haplotypes being randomly paired to form genotypes, but only k distinct haplotypes, we may require substantially more genotypes in order to be confident that we have paired all of the haplotype kinds.

In this representation, we can assume that H, the haplotype matrix used to create G is a p x m matrix, and II, the pairing matrix, is n x p. If the matrix II is rank p and the matrix H is rank k, then their product G = II • H is also of rank k.

The Erdos-Renyi Theorem shows that as long as the number of distinct non-self pairings is at least i ^ p l o g p (which is true, again, almost surely if the number of genotypes is ^4^-plogp, for some e' > 0), the pairing graph on the p nodes represented by the pool of p haplotypes is connected, and again, it will also contain a triangle as a subgraph as before. Hence, II will be full rank (of rank p).

3.2. The haplotype matrix

Now we consider the haplotype matrix. It is of rank k when it comes from a perfect phylogeny.

Lemma 3.3. If H is a p x n haplotype matrix with k distinct haplotypes and can be realized by a perfect phylogeny, with one column SQ of all ones, then H is rank k.

Proof. Each distinct haplotype hi corresponds to a different leaf in the phylogenetic tree, and has value 1 at positions corresponding to mutations on the path from leaf to root. Consider two neighbouring leaves i and j in the subtree induced by haplotypes of H. Their sequences are distinct, so at least one of i or j has a mutation on the path from their common ancestor to it; suppose it is i. Then hi has a one found in no other haplotype, and is thus linearly independent of all other haplotypes. Remove hi and repeat this process for all k haplotypes. The column so prevents a row of all zeros being the last row left, ensuring the matrix is of rank k, not k — 1. Elementary column operations allow us to extend to the case where the root has allele 1 at other sites than so- •

3.3. Putting it together: a first bound

If the data are generated by a perfect phylogeny, and there are enough population members, then the rank of the genotype matrix is the optimal number of haplotypes.

Theorem 3.1. Let G be a genotype matrix produced by random pairing of a pool ofp haplotypes (not necessarily unique) that are generated by a perfect phylogeny (with a column of all ones). If G has at least ^•plogp pairings, then the size of the optimal solution to the HIPP instance G is rank(G), almost surely as p —> oo.

Proof. The pairing matrix is almost surely of full rank, p, by Lemma 3.2, and the rank of the haplotype matrix equals its number of distinct haplotypes, by Lemma 3.3. Hence, Corollary 3.1 shows that the initial set of haplotypes found in H is an optimal set for this HIPP instance. •

Thus, for many instances, the general NP-hardness of HIPP is partly ameliorated: we can easily compute the size of the optimum, if not the actual haplotypes. In the next section, we see that for PPH instances with a few times more pairings can be exactly solved.

4. WHEN CAN WE FIND THE ACTUAL HAPLOTYPES?

For instances of the problem with a constant factor more genotypes than the bound of Theorem 3.1, and

215

with a "common" mutation, we can almost surely identify the optimal haplotypes for PPH instances. We do this by connecting to a variant of the HIPP problem, where we restrict the haplotype matrix to be a PPH matrix. Our main result is the following constructive theorem.

Theorem 4.1. Let G be derived from randomly pairing p haplotypes compatible with a perfect phy-logeny, represented by the haplotype matrix H, with k distinct haplotypes. Suppose there exists a column of H (without loss of generality, s\) with at least ap of both zeros and ones, for some a > 0. If G arises from at least max( i^p logp , ^^plogfc) pairings of members of H, then we can solve HIPP in polynomial time almost surely as k —> oo (and consequently as p —» ooj.

For example, if p = k and there is a site with minor allele frequency at least 25%, Theorem 4.1 says that if the number of rows of G is at least (8 + e)k log k, the optimal set of haplotypes can be found almost surely in polynomial time as fc —• oo.

We prove Theorem 4.1 through several steps. Our first lemma notes that we can restrict ourselves to PPH matrices H. The Min-PPH problem, studied by Bafna et al.17, is the HIPP problem subject to this restriction on H. (Bafna et al. have shown that Min-PPH is NP-hard.)

Lemma 4.1. If the conditions of Theorem 4-1 hold, then almost surely, the set of unique haplotypes in H is an optimal HIPP solution, and also an optimal solution to the Min-PPH instance G.

Proof. The number of distinct population members is greater than ( i ^ ) p l o g p , so Theorem 3.1 applies, and rank(G) = k almost surely and the unique haplotypes of H form an optimal HIPP solution for G. Since they satisfy a perfect phylogeny, they are also a smallest PPH solution. •

Lemma 4.1 allows us to restrict our search to PPH solutions. Lemma 4.2 shows there exists only one, with high probability; Corollary 4.1 shows it can be found in polynomial time. This will complete the proof of Theorem 4.1.

Lemma 4.2. Given a genotype matrix G satisfying the conditions of Theorem 4-1, almost surely there

exists only one set of haplotypes that satisfies a perfect phylogeny and can generate G.

Proof. We prove the lemma via a property of the DPPH algorithm of Bafna et al.ls. This algorithm constructs a graph with one vertex for each column of G. The main result we use is that the number of PPH solutions for G is 2C _ 1 , where c is the number of connected components in a specific subgraph18 of this graph. We show that if G satisfies the conditions of Theorem 4.1, then with high probability, c = 1, and there is only one PPH solution.

Graph D(G) has one vertex for each distinct column in genotype matrix G and an edge between two vertices s and s' if there exists a row goiG with value 1 in columns s and s', and the resolution of g at sites s and s' is restricted by the perfect phylogeny condition. Recall from Section 2 that in our notation, 1 represents a heterozygous site. More precisely, we connect s and s' if we can find three genotypes g\, <?2 and 53 such that the 3 x 2 submatrix of G induced by these genotypes and sites has one of the forms in Figure 1. In each of these forms, one possible resolution of sites s and s' violates the perfect phylogeny condition, so the possible space of PPH solutions is restricted. The total number of PPH solutions for G equals 2C _ 1 , where c is the number of connected components of D(G)18.

a) 11 b) 1 1 c) 1 1 I x 0 0 2 0 y l 2 2 0 2

Fig. 1. Patterns of 3 x 2 submatrices which cause an edge to be added between the vertices representing the sites in D(G). The values x and y are each either 0 or 2.

To show D(G) is almost surely connected, we show that almost surely, there is an edge between each node and the node for s i , the site with the common mutation. For any site, let ej be the tree edge containing the mutation at site Sj, let Ai be the set of haplotypes below e, and Bj be all other haplotypes. When considering the random pairings to make genotypes, we use (X, Y) to denote a pairing of a haplotype from class X with a haplotype from class Y.

216

Consider a site s. We consider a few cases on s, depending on whether es is below e\ or not, and on the size of Aa. First, suppose ea is not below e\. If Aa has fewer than ^ haplotypes, we will have an edge between s and s\ in D(G) if the events (Aa, Ai), (AS,B\ \ As) and (B\ \ Aa,Ai) occur, since these events produce a submatrix of type (a) in Figure 1. In a random pairing, each event has probability at least ^ . If \AS\ > T|P, the events (Aa,Ai), (AS,AS) and (Ai,Ai) produce a submatrix of type (c) in Figure 1. Again, all events have probability at least -^-.

Suppose instead es is below e\. If \Aa\ < ^p, the events (Aa,Bi), (Aa,Ai\Aa) and (Bi,Ai\Aa) give a submatrix of form (a), while if \Aa\ > ^ , the events {As,Bi), (Aa, Aa) and (Bi,Bi) give a submatrix of form (b). The needed events always have probability at least #•.

2p

Therefore, for each column s, three events each with probability at least £-, will connect s and si . This totals less than 6fc events, since there are at most 2k — 2 distinct columns in a perfect phylogeny with fc distinct leaves. By the coupon-collector lemma19, after (^^plogfc random pairings, the probability that a needed event has not yet occurred is less than 6(fc~e/2).

Thus, if the conditions of Theorem 4.1 are satisfied, then almost surely, D(G) is connected and the DPPH results of Bafna et oZ.18 show that there exists a unique PPH solution for G. •

Corollary 4 .1 . A Min-PPH instance G satisfying the conditions of Theorem 4-1 can be solved in polynomial time with high probability.

Proof. The algorithm DPPH of Bafna et al.u gives a representation of all PPH solutions for a given genotype matrix G in polynomial time, and allows their enumeration in time polynomial in the input matrix size and proportional to the number of PPH solutions. Lemma 4.2 shows that there is a unique PPH solution almost surely, and it can be recovered in polynomial time. •

4.1. A bound on finding a common mutation in a coalescent model

Our theorems show that almost surely, if there exists a mutation with minor allele frequency at least

a in our data, we will likely be able to solve HIPP if the number of genotypes is above a relatively small bound. How often does such a common mutation occur? One would likely not study a population if all mutations were rare, but we can also give a partial answer to this question probabilistically, in a simple version of the coalescent model from population genetics.

Our results show that to guarantee that with probability 1 — e there is a site chosen with minor allele frequency at least a, one needs to set the parameter 9 in the coalescent process to a constant depending only on a and e; hence, the number of polymorphic sites needs only increase as the logarithm of the population sample's size. Our bounds are coarse, but again prove that one can have provably high success in solving synthetic HIPP instances.

The question of how many mutations with minor allele frequency a can be expected to exist has been studied by theoretical population geneticists; see, for example, Fu20. However, the work of these authors has mostly concerned the expected number of mutations with minor allele frequency a; we need to determine how large 0 must be in order to be confident that a mutation of the type we desire exists almost surely.

4.1.1. An introduction to the coalescent model

Infinite-site constant-population coalescent models are a standard population genetics model to produce haplotypes. We describe them briefly, focusing on details we need; for full detail, see Hein et al.21. In particular, we focus on the event order, not the time between events; see Hudson22 for justification of this approach. The coalescent approach is used, for example, in the program ms23, which has been used by several groups to generate HIPP problem test instances, by randomly pairing the resultant haplotypes1' 5 | 7.

The coalescent model describes the descent of a population under neutral evolution. We use it to generate rooted trees with p leaves, where each leaf represents a haplotype. We will describe the model going backward in time: we begin with the p leaves, and coalesce them to their common ancestor. Two

217

kinds of events can occur as we move backwards in time: mutations and coalescences; the parameter 6 governs which of these is more likely to happen. (In population genetics, 6 = 2Nfj, depends on both /x, the mutation rate, and N, the effective population size.) If k lineages are active at a point in time, a coalescence is the next event with probability k^~[}s, and a mutation with probability k_\+e. When a mutation is indicated, one of the active lineages is uniformly chosen, a mutation at a new polymorphic site is assigned to it. When a coalescence is indicated, two lineages are uniformly chosen and joined into a common ancestor. The process continues until only one lineage remains, which is the common ancestor for all sites. All random choices are independent. We establish haplotypes for the sequences, as in Section 2.3.

We can sample from the same distribution of tree topologies by thinking about the coalescent process by moving forward in time, not backward. The standard way to do this is as a branching process, where we start with one lineage, and then whenever a divergence event is indicated, a lineage is chosen from all of the i lineages, and it bifurcates to produce i +1 lineages; this process is continued until there are n lineages present. (Mutations occur in this process as in the backward conception of the coalescent, but we can ignore this here. We will use the forward branching process model only to estimate the number of lineages at a time when a mutation occurring on one lineage in particular would be sufficient to create a mutation of our desired type; we then switch to the backward coalescent version of the process, conditioned on having a lineage of our desired type.)

For our purposes, however, it is actually easier to use a non-standard forward-in-time way to sample from this distribution. A fact about the branching process is that if we pick one of the two lineages that result from the initial bifurcation, its number of eventual descendants in the n-member population is uniformly distributed over { 1 , . . . n — 1} (see, e.g. Ref 22.) Moreover, the structure of the two trees that result from this bifurcation, one with i eventual descendants and the other with n — i, are themselves chosen independently from the coalescent distribution with those number of nodes.

As such, we can generate trees from the coales

cent distribution in a somewhat different-appearing formulation that is still equivalent, by annotating each lineage with the number of eventual descendants that it will have; the initial lineage is annotated to have n eventual descendants. We still pick our branching lineage uniformly at random from the active lineages, but a lineage is only active if it has more than one eventual descendant; those with only one descendant will never bifurcate again. When we choose a lineage with i eventual descendants to bifurcate, we sample the number of descendants that one of the new lineages will have uniformly from { 1 , . . . , i — 1}, with the other new lineage resulting form the bifurcation having i minus that many descendants. After n — 1 such bifurcations, we will have chosen the topology of our coalescent tree. Since we are successively conditioning according to the probabilities of the traditional forward coalescent process, the tree we choose by this procedure is chosen from the same probabilistic distribution as for the classic model.

4.1.2. Common mutations

We now connect the coalescent model to our HIPP theorems. If we use the model to generate p haplotypes, we can apply Theorem 4.1, if we have a polymorphic site with minor allele found in at least ap haplotypes.

Theorem 4.2. Suppose that we produce p haplotypes by the coalescent model of Section J^.l.l. For any e > 0, if we choose the parameter 6 of the coalescent model to be at least ^^ (e 1 / ^ 2 0 - 1 ) — 1), then with probability at least 1 — 2e, we will have a site with minor allele frequency at least a. Also, if 6 is w(l) as a function of p, such a site is chosen almost surely as p —* oo.

We prove Theorem 4.2 by focusing on finding one edge with between ap and (1 — a)p descendants (a "good" edge) with a mutation. Our bounds are likely coarse as a consequence. First, we show that there is likely a good edge reasonably high in the tree.

Lemma 4.3. Consider a coalescent tree with p leaves, and assume 0 < a < 1/3 . The probability that the top of a good edge exists in the tree at or before the ith bifurcation from the top is at least

218

1 — £2a~1. Given e > 0, with probability at least 1 — e, there is such an edge with start at or above the i*a e-th bifurcation, for t*a e = eVt2"-1).

Proof. At each step, there exists a lineage with the most descendants. If it has fewer than (1 — a)p descendants, we have already seen a good edge. To see this, consider the first time this happens: we divided a value greater than (1 — a)p into two parts, both smaller than (1 — a)p. Since a < 1/3, one is at least ap.

Thus, we can concern ourselves with bifurcations on the lineage with the most descendants. At step i in the coalescent process, going forward, this lineage has probability at least 1/i of being chosen to bifurcate; it may be more if there are lineages with only one descendant haplotype. If it bifurcates, the probability of a good edge being produced is at least 1 —2a. Since all bifurcations are independent, we can upper bound the probability of no good edges occur-ring by level I by n U ( l -*=&)< VLx ^ < e(2a-l)loSe = £ 2 a - l T h g b o u n ( j a g & function 0f t h e

probability 1 — e of a good edge at or above level t*a t

is easily shown by arithmetic. •

Now, we can finish the proof of Theorem 4.2.

Proof. By Lemma 4.3, with probability at least 1 — e, there is a period in the coalescent history of the sequences during which there are fewer than d.ayt

lineages, and where a mutation on one lineage would be a good mutation. Sticking to such instances, and now working backward in time, the next event on that lineage is a mutation with probability at least j—zi+0 i if there are more coalescences of other parts of the tree before an event on our good lineage, it only increases the probability of mutation preceding coalescence on the lineage of interest. Setting this probability equal to 1 — e and solving for 9 gives that if 9 > ~€'(la.e — 1)> then the probability of a mutation of minor allele frequency at least a is at least 1 - 2e. D

Our bounds, while perhaps odd, are constant as a function of p. They indicate how far down in the tree one must look in order to be guaranteed a high probability of having found a good edge, and this depends solely on a. Since the expected number of

mutations in the tree is approximately Poisson distributed with mean 6 logp (see Ref 21), we note that if the number of mutations accumulated is cu(logp), then a common mutation exists almost surely (for any a) asp grows.

4.2. Applicability to small populations

The previous theorems are asymptotic results depending on the value of p. However, a small experiment shows that they apply for small p as well. For a variety of values of p, we used Hudson's program ms23 to generate a PPH matrix with varying values of the mutation parameter 9, and paired the haplotypes randomly to generate n distinct genotypes. Shown in Table 1 are the number of genotypes needed so that in 200 experiments, the generating set of haplotypes (after removing duplicates) was always optimal for HIPP and was the unique Min-PPH solution. (We verified the number of PPH solutions using Ding, Filkov and Gusfield's LPPH24.) Even for moderate values of p and 9, l.lp\ogp genotypes satisfied these conditions.

Table 1. The smallest number of genotypes n for which all 200 trials passed the rank and P P H tests.

e 5 10 20 40

10

25 25 35 35

15

45 40 45 55

Number of haplotypes p 30 50

60 75 85 130 90 160 110 200

75

130 155 220 295

100

115 195 260 395

150

150 260 440 540

200

185 310 515 675

5. THE ALGEBRAIC RANK OF NON-PPH INSTANCES

For PPH instances, rank(i7) exactly equals its number of unique haplotypes. If n is full rank, then the unique rows of H form an optimal solution to the HIPP instance G = H • H with exactly rank(G) haplotypes. For instances generated by models that include recombination, the situation is more complicated: H may not be full rank, and the rank of G may not equal the size of its optimal solution. We now study ranks of haplotype matrices in such models, assuming always that n is of full rank. We use the rank of H to prove a lower bound on the number of recombinations in a phylogenetic network that ex-

219

plains a set of haplotypes, which is provably at least as strong as the commonly used "haplotype lower bound"9. For galled trees, a class of recombination networks, we give a full characterization of the rank. One interesting feature of our findings is that estimating the number of recombinations and performing haplotype inference by pure parsimony seem opposed to each other. We discuss this more in Section 5.4.

5.1. Phylogenetic networks with recombination

We first give a combinatorial description of this domain. A phylogenetic network is a rooted directed acyclic graph with edges pointing away ("down") from the root. The leaves (indegree 1 and outde-gree 0) correspond to current haplotypes. Coales-cent nodes (indegree 1 and outdegree 2) correspond to the most recent common ancestor of their descendant haplotypes. Recombination nodes (indegree 2 and outdegree 1) correspond to when two incoming lineages recombine to form a single lineage, a prefix of one lineage followed by a suffix of the second lineage. The node is labelled to indicate which parental lineage takes each role and the discrete recombination breakpoint where the recombination occurs.

Mutations are assigned to edges of the network. Each mutation has a chromosomal position, which mutates once, from allele 0 to allele 1, in the network. We assume that the haplotype at any position in the network identifies the allele found at that network position for every site with a mutation in the network. Without loss of generality, we assume that at the top of the network, the haplotype is all zeros, except for a one in a special site SQ that never mutates. At a coalescent node or a leaf, the haplotype is the haplotype at the parent of the node, with zeros changed to ones at positions corresponding to any mutations on the edge separating them. At recombination nodes, labelled with position k, the first k sites of the haplotype come from the parent corresponding to the prefix edge into the node and the remainder comes from the other parent.

An important element of a phylogenetic network are the recombination cycles between recombination nodes and coalescent nodes. For any network, we can give a partial order over recombination cycles, when

the recombination node of one is a descendant of another, and then identify an order from the bottom of the network to the top.

Gusfield et al.25 defined a simple kind of recombination network, galled trees, in which every edge is found in at most one recombination cycle. We will focus our attention on the rank of data coming from a galled tree, but first give a general rank bound.

5.2. The rank of data from phylogenetic networks

We can easily relate the data rank to the number of recombinations in the network.

Theorem 5 .1 . Let H be a haplotype matrix with k unique rows derived from a phylogenetic network with r recombinations. Then rank(ff) > k — r. Stated another way, r >k— rank(i?).

Proof. We prove this by induction on the number of recombinations in the network. If the network has no recombinations, it is a tree and has a column of all ones, so Lemma 3.3 applies. If not, consider a lowest recombination node. Below it is a tree; suppose its leaves have p unique haplotypes. If there is a mutation found only in all p haplotypes, then the lemma applies and removing the p haplotypes drops the rank by p. If no such mutation exists, removing the p haplotypes drops the rank by at least p — 1, but not necessarily p. In either case, we remove that recombination node and its descendants, and have a network with one fewer recombination. •

This rank bound may be surprisingly useful; it is similar in spirit to the "haplotype lower bound"9

on the number of recombinations required to explain a haplotype matrix, which equals k — c + 1, where c is the number of unique columns in the matrix H. The haplotype bound is often negative, because there may be many different columns, but k — rank(-ff) is always non-negative, and thus may be stronger. We also note that this bound applies for unknown ancestral sequence, as it can be adjusted in the standard way to apply to the case of a known ancestral sequence. Of course, rank(JJ) is slower to compute than the number of distinct columns of H, but use of the bound may still be interesting to explore.

220

5.3. The algebraic rank of galled trees

The rank of the haplotype matrix can decrease from full rank by at most the number of recombinations in the network that gives rise to the haplotypes. In the case of galled trees, we can identify for each recombination whether it actually does reduce the rank or not.

We will inspect recombination cycles from the bottom of the network up, and identify them as rank-decreasing, rank-maintaining, or rank-confounding. For rank-confounding cycles, all haplotypes in the cycle are independent, but there may be a dependency between them and the other haplotypes in the tree, so we add a new haplotype to determine this.

There are three types of node in the recombination cycle. The coalescent node and recombination node that together define the recombination cycle will simply be referred to as the coalescent node and recombination node, respectively. The other nodes in the cycle (although marking coalescence events) will be referred to as cycle nodes. In a recombination cycle, a node is included if it represents a haplotype in H, and no higher node on the cycle also represents that haplotype. The sides of the cycle are the two directed paths from the coalescent node to the recombination node. Consecutive included nodes are nodes on either side of the cycle that have only unin-cluded nodes between them. Let hc be the haplotype at the coalescent node and hr be the haplotype at the recombination node.

Theorem 5.2. A recombination cycle is rank-maintaining if:

• the recombination node is not included, or • the recombination node is included and has a mu

tation found in no other included node in the cycle, or

• between the coalescent node and the first included cycle node on either side of the cycle, or between any two consecutive included nodes on the cycle are found two mutations on either side of the recombination breakpoint of the cycle (a "rank-maintaining pair").

A recombination cycle is rank-decreasing if it is not rank-maintaining, and the coalescent node of the cycle is included.

A recombination cycle is rank-confounding if it is neither rank-maintaining nor rankdecreasing. The rank of H will be the same as that of H with hr removed and hc added.

Proof. We need to detect whether the haplotype hr

at the recombination node is independent of all other haplotypes. We begin with the rank-maintaining cases. The case where hr is not included is trivial. If there is a mutation j unique to hr among cycle nodes, that column is independent of all other columns, so the recombination node is independent; other haplotypes may possess mutation j , but they also possess other mutations not found in hr. If there is a rank-maintaining pair, then every included haplotype that possesses one mutation from the pair also possesses the other, except hr, so hr is independent.

If there is no rank-maintaining pair, all mutations on the cycle can be individually isolated by subtracting haplotypes at consecutive included nodes. Elementary row operations can thus transform hr

into hc. If hc is already in H, then hr is not independent of the other haplotypes in H\ if it is not, it may or may not be independent. We must add hc

to our set of haplotypes find out. •

Going through each cycle obeying the partial order, we can identify its effect on the rank, and thus compute the overall rank of H.

5.4. Consequences of the rank bounds for phylogenetic networks

In the standard variation of the coalescent process that includes recombination, the relative rates of recombination and mutation are given by two parameters p and 6. When p is large relative to 9, recombination is common, whereas when 0 is large, recombination is rare.

The bounds from Theorems 5.1 and 5.2 can be read to say that if mutation is common relative to recombination, the rank of H (and consequently G = II • H, if II is full rank) is likely to be close to its number of unique haplotypes; many mutations will make it likely that the haplotype at a recombination node is linearly independent of the other haplotypes. When rank is high, the most parsimonious set of haplotypes to produce the genotype matrix G is likely to

221

have close to the same number of distinct haplotypes as does H.

By contrast, when recombination is more common than mutation, we may start to accumulate many rank-decreasing cycles. This may mean that for the genotype matrix G, the rank bound on the HIPP solution may be far from optimal. But, interestingly, for the haplotype matrix H, the lower bound on the minimum number of recombinations to explain H will be increasingly accurate, since this bound goes up as the rank goes down.

This suggests a tension between the HIPP problem and estimating the number of recombinations: for instances with few recombinations, we get little information about the minimum number of recombinations, but may obtain a close match on the minimum number of haplotypes. For instances with lots of recombinations, we get no information about the minimum number of haplotypes, but may get some information about the number of recombinations. Unfortunately, we cannot identify which of these bounds we have, just from the rank of G.

6. CONCLUSION

We have presented several results about algebraic rank and HIPP instances, studying HIPP instances generated by randomly pairing haplotypes generated from two important models from population genetics. For data generated by a perfect phylogeny on p haplotypes, when the number of distinct population members is more than (^)plogp, the size of the optimal solution equals the rank of the genotype matrix. Moreover, with only a few times more genotypes and a common mutation, we can recover the haplotypes in polynomial time.

We studied more closely data generated by the coalescent model often studied in population genetics; this is relevant, for example, for data generated with Hudson's popular ms package23. We showed that the constant value for the mutation parameter 9, to guarantee a common mutation with probability 1 - e .

Finally, we examined the effect of adding recombination to generative model. Here we derive two interesting results. First, we provide an interesting variant of the "haplotype lower bound". This shows that rank is still a close bound when the model allows

recombination. Second, we completely classify the algebraic rank of haplotype matrices derived from galled trees, the simplest type of phylogenetic network with recombination. The algebraic structure of HIPP instances will likely have other fruitful consequences as well.

Acknowledgements

This research has been supported by the Natural Sciences and Engineering Research Council of Canada, through a Discovery Grant to D.B. and Postgraduate and Canada Graduate Scholarships to I.H. We would like to thank Katherine St. John for sending us a copy of her paper with Sean Cleary13, and Dan Gusfield, Yun Song and Brian Golding for helpful conversations.

References 1. D. Gusfield. Haplotype inference by pure parsimony.

In Proceedings of CPM 2003, pages 144-155, 2003. 2. The International HapMap Consortium. A hap

lotype map of the human genome. Nature, 437(7063):1299-1300, 2005.

3. K. Lindblad-Toh et al. Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature, 438(7069):803-809, 2005.

4. G. Lancia, C. M. Pinotti, and R. Rizzi. Haplotyping populations by pure parsimony: Complexity of exact and approximation algorithms. INFORMS Journal on Computing, 16:348-359, 2004.

5. D. G. Brown and I. M. Harrower. A new integer programming formulation for the pure parsimony problem in haplotype analysis. In Proceedings of WABI 2004, pages 254-265, 2004.

6. L. Wang and Y. Xu. Haplotype inference by maximum parsimony. Bioinformatics, 19(14):1773-1780, 2003.

7. D.G. Brown and I.M. Harrower. Integer programming approaches to haplotype inference by pure parsimony. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 3(2):141-154, 2006.

8. K. Kalpakis and P. Namjoshi. Haplotype phasing using semidefinite programming. In Proceedings of BIBE 2005, pages 145-152, 2005.

9. S.R. Myers and R.C. Griffiths. Bounds on the minimum number of recombination events in a sample history. Genetics, 163:375-394, 2003.

10. J. He and A. Zelikovsky. Linear reduction for haplotype inference. In Proceedings of WABI 2004, pages 242-253, 2004.

11. B. V. Halldorsson, V. Bafna, N. Edwards, R. Lip-pert, S. Yooseph, and S. Istrail. A survey of computational methods for determining haplotypes. In Com-

222

mutational Methods for SNPs and Haplotype Inference: DIMACS/RECOMB Satellite Workshop, volume 2983 of LNCS, pages 26-47, 2004.

12. R. H. Chung and D. Gusneld. Empirical exploration of perfect phylogeny haplotyping and haplotypers. In Proceedings of COCOON 2003, pages 5-19, 2003.

13. S. Cleary and K. St. John. Analyses of haplotype inference algorithms. 2005. Manuscript under review.

14. C. Van Nuffelen. On the incidence matrix of a graph. IEEE Transactions on Circuits and Systems, 23(9):572-572, Sep 1976.

15. P Erdos and A Renyi. On random graphs. Publica-tiones Mathematicae Debrecen, 6:290-297, 1959.

16. B. Bollobas. Random Graphs. Cambridge Press, 2nd edition, 2001.

17. V. Bafna, D. Gusneld, S. Hannenhalli, and S. Yooseph. A note on efficient computation of hap-lotypes via perfect phylogeny. Journal of Computational Biology, 11:858-866, 2004.

18. V. Bafna, D. Gusneld, G. Lancia, and S. Yooseph. Haplotyping as perfect phylogeny: A direct approach. Journal of Computational Biology, 10:323-

340, 2003. 19. R. Motwani and P. Raghavan. Randomized Algo

rithms. Cambridge Press, 1995. 20. Y. X. Fu. Statistical properties of segregating sites.

Theoretical Population Biology, 48(2):172-177, 1995. 21. J. Hein, M. H. Schierup, and C. Wiuf. Gene Ge

nealogies, Variation and Evolution. Oxford University Press, 2005.

22. R. R. Hudson. Gene genealogies and the coales-cent process. Oxford Surveys of Evolutionary Biology, 7:1-44, 1990.

23. R. R. Hudson. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinfor-matics, 18(2):337-338, 2002.

24. Z. Ding, V. Filkov, and D. Gusneld. A linear-time algorithm for the perfect phylogeny haplotyping (pph) problem. In Proceedings of RECOMB 2005, pages 585-600, 2005.

25. D. Gusneld, S. Eddhu, and C. Langley. Optimal, efficient reconstruction of phylogenetic networks with constrained recombination. Journal of Bioinformat-ics and Computational Biology, 2(1):173-213, 2004.

223

GLOBAL CORRELATION ANALYSIS BETWEEN REDUNDANT PROBE SETS USING A LARGE COLLECTION OF ARABIDOPSIS ATH1 EXPRESSION PROFILING DATA

Xiangqin Cui ' and Ann Loraine ' ' Section on Statistical Genetics, Department of Biostatistics

2 Department of Genetics 3 Department of Medicine

4corresponding author University of Alabama at Birmingham

Birmingham, AL 35294 {aloraine,xcui}@uab.edu

Oligo-based expression microarrays from Affymetrix typically contain thousands of redundant probe sets that match different regions of the same gene. We used linear regression and correlation to survey redundant probe set behavior across nearly 500 quality-screened experiments from the Arabidopsis ATH1 array manufactured by Affymetrix. We found that expression values from redundant probe set pairs were often poorly correlated. Pre-filtering expression results using MAS5.0 "present-absent" calls increased the overall percentage of well-correlated probe sets. However, poor correlation was still observed for a substantial number of probe set pairs. Visual inspection of non-correlated probe sets' target genes suggests that some may be inappropriately merged gene models and represent independently expressed, but neighboring loci. Others may reflect differential regulation of alternative 3-prime end processing. Results are on-line at http://www.transvar.org/exp_cor/analysis.

1. INTRODUCTION

Affymetrix microarrays contain thousands of probes which are grouped into probe sets, collections of probes that (typically) hybridize to 300-500 bp sequence segments near the three prime ends of target transcripts. Due to the high frequency of alternative mRNA processing (splicing and polyadenylation) in many eukaryotic genomes, Affymetrix arrays typically include multiple probe sets that match predicted or known mRNA variants produced by the same gene. Because these probe sets measure different regions (or transcripts) of the same gene, we designate these as "redundant probe sets."

Thanks to new public resources that archive and distribute expression data from hundreds, sometimes thousands, of microarray experiments, it is now possible to survey the behavior of individual probe sets across many different experimental conditions and laboratory settings. The Nottingham Arabidopsis Stock Centre's NASCArrays is perhaps the acme of such services'. For a nominal fee, users can subscribe to the NASC AffyWatch service, which delivers quarterly DVDs bearing expression data in the form of array 'CEL' files, which contain numeric, probe intensity data from

microarray scans. These CEL files, the majority of which are from the ATH1 microarray2, are contributed by users who enjoy discounted array processing service from NASC in exchange for donating their data for public use.

Because the ATH1 array is based on a solved genome, data from NASC provide an unprecedented opportunity to investigate the long-range behavior of redundant probe sets. Toward this end, we analyzed co-expression patterns among redundant probe sets using a database that contained data from nearly 500 quality-screened ATH1 array hybridizations.

2. METHODS

We obtained probe set to gene mappings and gene structure annotations (version 6) from the Arabidopsis Information Resource (TAIR)3. To simplify the analysis, we purged all probe sets that mapped to multiple genes. Using methods described previously4, we created an expression database containing quality-screened data from 486 array hybridizations compiled from AffyWatch Release 1.0. Array data were processed using RMA5, followed by quantile-quantile normalization. We also processed the CEL files using MAS5.06 and generated Present, Absent, and Marginal

http://www.transvar.org/exp_cor/analysis

224

"calls" for each probe set. All array processing was done using the BioConductor software under default settings7.

We used R to perform linear regression and compute Pearson's correlation coefficient for each pair of redundant probe sets that measure the same gene. Results from these analyses, including scatter plots showing regression results, are posted as Supplementary F i l e s a t o u r W e b s i t e http://www.transvar.org/exp_cor/analysis.

We manually inspected gene models using the I n t e g r a t e d G e n o m e B r o w s e r (http://genoviz.sourceforge.net) and IGB Quickload site http://www.transvar.org/data/quickload, which serves probe set-to-genome alignments generated by Affymetrix and Arabidopsis gene annotations (versions 5 and 6). To assess cDNA evidence, we used the Sequence Viewer tool at the TAIR Web site.

3. RESULTS

The ATH1 array contains 21,148 probe sets that uniquely map to 20,987 protein-coding genes in the Arabidopsis genome as determined by extensive sequence analysis performed at TAIR. Of these 21,148 probe sets, 309 are redundant probe sets measuring 148 genes (Table 1.) To simplify the analysis, we focused on the 142 genes interrogated by two probe sets each.

We hypothesized that if redundant probe sets measure related molecular entities, i.e., transcripts whose synthesis is driven by the same promoter, they should exhibit a high degree of correlation across a broad range of conditions. To test this, we computed Pearson's correlation coefficient (r) and performed linear regression between each pair of redundant probe sets. Interestingly, we found that many redundant probe sets are not well-correlated (Figure 1 A).

Table 1. Breakdown of redundant probe sets per gene on the ATH1 expression microarray

Probe sets per gene

1 2 3

>3

Genes

20,839 142 4 2

redundant probe sets may be the result of including low expression readings whose target gene was not actually expressed. The readings from these probe sets might represent random noise and, therefore, exhibit low correlation. To reduce the influence of the probe set readings not derived from bona fide expression of their respective target genes, we ran the "present-absent" call procedure in MAS5.0 for each probe set on each array and eliminated probe set readings called as "Absent" from the analysis. We then re-computed linear regression and Pearson's correlation coefficient for the redundant probe set pairs in which both partner was called as "present" in at least 20 chips. This filtering step removed 55 genes, leaving 87 for further correlation analysis.

This PA filtering followed by correlation analysis generated two notable results. First, we found surprisingly small correspondence in P versus A calls between redundant probe sets (Supplementary File 4). Second, we found that eliminating probe set readings called as "Absent" by MAS5.0 removed many of the genes that were found to be poorly-correlated according in the first (no PA filtering) analysis (Figure IB).

A

m _

•0

0 5

1 1

in _

o _

m -

o -

RMA

c=f= I n 1 1 1 I I I 1

-0.2 0.2 0.4 0.6 0.8 1.0

RMA+PA

n _J~L i i i i i i i i

-0.2 0.2 0.4 0.6 0.8 1.0

r

It is commonly believed that less than half of the genes in a genome are simultaneously expressed8'9. If true, the low degree of correlation between some

Figure 1. Correlation (r) computed using RMA expression values before (A) and after (B) PA filtering.

We next explored the possibility that for some of the poorly correlated probe sets, the annotated structure of the target gene may have inappropriately merged adjacent or overlapping genes into a single gene model. If this were true, then we might observe a negative correlation between putative transcript size and

http://www.transvar.org/exp_cor/analysis

http://genoviz.sourceforge.net

http://www.transvar.org/data/quickload

225

expression correspondence between redundant probe sets since inappropriately merged gene models would likely be unusually large.

To test this, we computed Pearson's correlation coefficient comparing average transcript size per gene (log scale) and R-squared from the linear regression, which is the percentage of variation in one probe set that can be explained by variation in the other. (Note that transcript sizes are approximately log-normally distributed; see Supplemental File 2.) We found that there was indeed a weak negative correlation (r = -0.28) between average transcript size per gene and R-squared, suggesting that some fraction of the genes with non-correlated, redundant probe sets might represent gene models that should be split.

Many genes currently included in the Arabidopsis version 6 annotations are based originally on the results of computational analysis and manual curation. For many of these gene models, some additional evidence is needed before they can be accepted as accurate. Currently, the gold standard for assessing the correctness of a gene model is the existence of one or

more full-length or partial cDNA sequences that cover the gene region in question. Using the Integrated Genome Browser to visualize probe sets and the TAIR Web site Sequence Viewer to visualize gene structures and cDNA alignments, we investigated cDNA support for genes whose redundant probe sets generated non-correlated expression values.

Of nine genes with non-correlated redundant probe sets (r < 0.3), only one was supported by cDNA evidence covering the entire gene model. However, visualization with the Integrated Genome Browser revealed that the probe sets associated with this gene (AT5G04440) appear to interrogate opposite strands of the chromosome, which explains why expression readings from these two probe sets were uncorrelated. No similarly trivial explanation could explain lack of correspondence (r = 0.09) between the two redundant probe sets interrogating AT4G12640, however. This lack of correspondence suggests that gene model AT4G12640 represents two independent transcriptional units.

A. AT4GI2640 gene model

probes T,483,0iU

254825 AT jAUfioa jAssjm

B. A T 4 G 1 2 6 4 0 ^ 8X^46650

I SAIKJMOOS

4328,22.70.N

5.6 5.8 6.0 6,2 6.4 6.6

254825_at !R2= 0.0567 , e= 0.91 , K>r.wef= 0.082 :

254826_AT

A^5913S1 )5AIL_1566_D0 5,Vi

SALK-024223.56.01 X

CS404S24

gquen« ru le r - c l i c k here to open K* kh sequence window ~'"T '"' *7",*46ST600 'T' ' "'" *"

<imMi£

jflj^lJ-Ol-Fie CS')

J4J4JT7 JJfL16-80-H»

1^116-44-106

Figure 2. A. Alignment of ATHl redundant probe sets to Arabidopsis chromosome 4, alongside gene model AT5G04440. Probes appear as vertical bars above rectangles representing the genomic alignment of probe set design sequences, from which the probe sequences were selected, B. Scatter diagram showing expression readings from the probe sets in A. C. TAIR Sequence Viewer showing lack of full cDNA support for this gene model.

226

4. DISCUSSION & CONCLUSIONS

We found that large-scale analysis of redundant probe sets reveals a surprising lack of correspondence of expression values between probe sets annotated as interrogating the same gene. Some discordance between redundant probe sets may arise from differential regulation of alternative splicing or polyadenylation. In many cases, however, it is more likely to result from incorrect gene models. We suggest that this lack of correspondence can be used to improve annotation, first as a means of checking probe set to gene mappings (as with AT5G04440) and second as a way to flag gene models that require further validation through cDNA sequencing or other means.

Acknowledgments

We thank Tapan Mehta and Vinodh Srinivasasainagendra for superb programming support. Funds from NSF grant 0217651 (PI David Allison) supported this work.

References

1. Craigon DJ, James N, Okyere J, Higgins J, Jotham J, May S. NASCArrays: a repository for microarray data generated by NASC's transcriptomics service. Nucleic Acids Res 2004; 32: D575-577.

2. Redman JC, Haas BJ, Tanimoto G, Town CD. Development and evaluation of an Arabidopsis whole genome Affymetrix probe array. Plant J 2004;38:545-561.

3. Rhee SY, Beavis W, Berardini TZ, Chen G, Dixon D, Doyle A, Garcia-Hernandez M, Huala E, Lander G, Montoya M, Miller N, Mueller LA, Mundodi S, Reiser L, Tacklind J, Weems DC, Wu Y, Xu I, Yoo D, Yoon J, Zhang P. The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Res 2003; 31: 224-228.

4. Persson S, Wei H, Milne J, Page GP, Somerville CR. Identification of genes required for cellulose synthesis by regression analysis of public microarray data sets. Proc Natl Acad Sci USA 2005; 102: 8633-8638.

5. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003; 4: 249-264.

6. Hubbell E, Liu WM, Mei R. Robust estimators for expression analysis. Bioinformatics 2002; 18: 1585-1592.

7. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004; 5: R80.

8. Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, Cooke MP, Walker JR, Hogenesch JB. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci USA 2004; 101: 6062-6067.

9. Jongeneel CV, Delorenzi M, Iseli C, Zhou D, Haudenschild CD, Khrebtukova I, Kuznetsov D, Stevenson BJ, Strausberg RL, Simpson AJ, Vasicek TJ. An atlas of human gene expression from massively parallel signature sequencing (MPSS). Genome Res 2005; 15: 1007-1014.

227

DISTANCE-BASED IDENTIFICATION OF STRUCTURE MOTIFS IN PROTEINS USING CONSTRAINED FREQUENT SUBGRAPH MINING

Jun Huan , Deepak Bandyopadhyay , Jan Prins ,

Jack Snoeyink1, Alexander Tropsha2, Wei Wang1

Computer Science Department The Laboratory for Molecular Modeling, School of Pharmacy

University of North Carolina at Chapel Hill Email : {huan, debug, prins, snoeyink, weiwang}@cs.unc.edu, [email protected]

Structure motifs are amino acid packing patterns that occur frequently within a set of protein structures. We define a labeled graph representation of protein structure in which vertices correspond to amino acid residues and edges connect pairs of residues and are labeled by (1) the Euclidian distance between the Ca atoms of the two residues and (2) a boolean indicating whether the two residues are in physical/chemical contact. Using this representation, a structure motif corresponds to a labeled clique that occurs frequently among the graphs representing the protein structures. The pairwise distance constraints on each edge in a clique serve to limit the variation in geometry among different occurrences of a structure motif. We present an efficient constrained subgraph mining algorithm to discover structure motifs in this setting. Compared with contact graph representations, the number of spurious structure motifs is greatly reduced.

Using this algorithm, structure motifs were located for several SCOP families including the Eukaryotic Serine Proteases, Nuclear Binding Domains, Papain-like Cysteine Proteases, and FAD/NAD-linked Reductases. For each family, we typically obtain a handful of motifs within seconds of processing time. The occurrences of these motifs throughout the PDB were strongly associated with the original SCOP family, as measured using a hyper-geometric distribution. The motifs were found to cover functionally important sites like the catalytic triad for Serine Proteases and co-factor binding sites for Nuclear Binding Domains. The fact that many motifs are highly family-specific can be used to classify new proteins or to provide functional annotation in Structural Genomics Projects.

Keywords: protein structure comparison, structure motif, graph mining, clique

1. INTRODUCTION This paper studies the following structural comparison problem: given a set Q of three dimensional (3D) protein structures, identify all structure motifs that occur with sufficient frequency among the proteins in Q. Our study is motivated by the large number of (> 35,000) 3D protein structures stored in public repositories such as the Protein Data Bank (PDB,4). The recent Structural Genomics projects 28 aim to generate many new protein structures in a high-throughput fashion, which may further increase the available protein structures significantly. With fast growing structure data, automatic and effective knowledge discovery tools are needed to gain insights from the available structure data in order to generate testable hypotheses about the functional role of proteins and the evolutionary relationship among proteins.

Our study is also motivated by the complex relationship between protein structure and protein function 8. It is well known that global structure similarity does not necessarily imply similar function. For example, the TIM barrels are a large group of proteins with remarkably similar global structures, yet widely varying func

tions . Conversely, similar function does not necessarily imply similar global structure: the most versatile enzymes, hydro-lyases and the O-glycosyl glucosidases, are associated with 7 different global structural families n . Many globally dissimilar structures show convergent evolution of biological function. Because of the puzzling relationship between global protein structure and function, recent research effort in protein structure comparison has shifted to identifying local structural features (referred to as structure motifs) responsible for biological functions including protein-protein interaction, ligand binding, and catalysis 3' 5' 3 0 _ 3 3 . A recent review of methods and applications involved in protein structure motif identification can be found in 19.

Using a graph representation of proteins, we formalize the structure motif identification problem as a frequent clique mining problem in a set of graphs Q and present a novel constrained clique mining algorithm to obtain recurring cliques from Q that satisfy certain additional constraints. The constraints are encoded in the graph representation of protein structure as pair-wise amino acid residue distances, pair-wise amino acid residue interactions, and the physical/chemical properties


228

of the amino acid residues and their interactions in a protein structure.

Compared to other methods, our method offers the following advantages. First, our method is efficient. It usually takes only a few seconds to process a group of proteins of moderate size (ca. 30 proteins), which makes it suitable for processing protein families defined by various classifications such as SCOP or EC (Enzyme Commission). Second, our results are specific. As we show in our experimental study section, by requiring structure motifs to recur among a group of proteins, rather than in just two proteins, we significantly reduce spurious patterns without losing structure motifs that have clear biological relevance. With a quantitative definition of significance based on the hyper-geometric distribution, we find that the structure motifs we identify are specifically associated with the original family. This association may significantly improve the accuracy of feature-based functional annotation of structures from structural genomics projects.

The rest of this paper is organized as follows. In Section 1.1, we review recent progress in discovering protein structure motifs. In Section 2, we review definitions related to graphs and introduce the constrained subgraph mining problem. In Section 3, we discuss our graph representation of proteins structures. In Section 4, we present a detailed description of our method. We also include a practical implementation of the algorithm that supports the experimental study in Section 5. Finally, Section 6 concludes with a brief discussion of future work.

• String pattern matching methods that encode the local structure and sequence information of a protein as a string, and apply string search algorithms to derive motifs 1 7 '1 8 ' 32.

• Delaunay tessellation (DT) 6' 20> 33 partitioning the structure into an aggregate of non-overlapping, irregular tetrahedra thus identifying all unique nearest neighbor residue quadruplets for any protein 33.

• Graph matching methods comparing protein structures modeled as graphs and discovering structure motifs by finding recurring subgraphs 1, 10, 14, 22, 29, 31 , 38

Geometric hashing 21 and graph matching 38 methods have been extended for inferring recurring structure motifs from multiple structures, but both methods have exponential running time in me number of structures in a data set.

2. CONSTRAINED FREQUENT CLIQUE MINING

2.1. Labeled graphs

We define a labeled graph G as a four-element tuple G = (V, E, E, A) where V is a set of vertices or nodes and E C V x V is a set of undirected edges. E is a set of (disjoint) vertex and edge labels, and A: V U E —• E is a function that assigns labels to vertices and edges. We assume that a total ordering is defined on the labels in E.

1.1. Related work

There is an extensive body of literature on comparing and classifying proteins using multiple sequence or structure alignment, such as VAST 9 and DALI 12. Here we focus on the recent algorithmic techniques for discovering structure motifs from protein structures. The methods can be classified into the following five types:

• Depth-first search, starting from simple geometric patterns such as triangles, progressively finding larger patterns 5' 25, 30.

• Geometric hashing, originally developed in computer vision, applied pairwise between protein structures to identify structure motifs 3 ' 24- 35.

Ps

J® V s ^

P3 P4 13

(Q) (S)°

Fig. 1. Database Q of three labeled graphs. The mapping (isomorphism) <Ji —• P2, 92 —'Pi, and 93 —> pz demonstrates that clique Q is isomorphic to a subgraph of P and so we say that Q occurs in P. Set {pi, P2, p3 } is an embedding of Q in P. Similarly, graph S (non-clique) occurs in both graph P and graph Q.

G' = (V, E') is a subgraph of G, denoted by G" C G, if vertices V C V, and edges E' <Z(Ef] (V x V')), i.e. E' is a subset of the edges of G that join vertices in V.

229

2.2. Constraints on structure motifs

A constraint in our discussion is a function that assigns a boolean value to a subgraph such that true implies that the subgraph has some desired property and false indicates otherwise. For example, the following statement "each amino acid residue in a structure motif must have a solvent accessible surface of sufficient size" is a constraint. This constraint selects only those structure motifs that are close to the surface of proteins. The task of formulating the right constraint(s) is left for domain experts. As part of our computational concern, we answer the following two questions: (1) what types of constraints can be efficiently incorporated into a subgraph mining algorithm and (2) how to incorporate a constraint if it can be efficiently incorporated. The answer to the two questions is the major contribution of this paper and is discussed in details in Section 4.

2.3. Graph matching

A fundamental part of our constrained subgraph mining method is to find an occurrence of a graph H within another graph G. To make this more precise, we say that graph H occurs in G if we can find an isomorphism between graph H = (VH, EH, E, A#) and some subgraph of G = (VG, EG, E, A G ) . An isomorphism from H to the subgraph of G defined by vertices V C VG is a 1-1 mapping between vertices / : VH —• V that preserves edges and edge/node labels. The set V is an embedding of H in G. This definition is illustrated in Figure 1.

In this paper, we restrict ourselves to matching cliques, i.e. fully connected subgraphs. For example, the graph Q in Figure 1 is a clique since each pair of (distinct) nodes is connected by an edge in Q while 5 is not. In protein structure graphs, a clique corresponds to a structure motif with all pairwise inter-residue distances specified.

2.4. The constrained frequent clique mining problem

Given a set of graphs, or a graph database Q, we define the support of a clique C, denoted by s(C), as the fraction of graphs in Q in which C occurs. We choose a support threshold 0 < a < 1, and define C to be frequent if it occurs in at least fraction a of the graphs in Q. Note that while C may occur many times within a single graph, for the purpose of measuring support, these count as only one occurrence. Given a constraint p, the problem

of Constrained Frequent Clique Mining is to identify all frequent cliques C in a graph database Q such that p(C) is true. Figure 2 shows all cliques (without any constraint) which appear in at least two graphs in the graph database shown in Figure 1. If we use support threshold a — 2/3 without any constraint, all six cliques will be reported to users. If we increase a to 3/3, only cliques A\, A2, A3, and A\ will be reported. If we use support threshold a = 2/3 and the constraint that each clique should contains at least one node with label "a", the constrained frequent cliques are A\, A3, A4, and AQ.

(a) (jS CS)-5-® ®—^—® S Y 'J? (ST y >;»

(A,) (A2) (A3) (A4) (A5) (A6)

Fig. 2. All (non-empty) frequent cliques with support > 2/3 in Q from Figure 1. The actual support values are: (3/3, 3/3, 3/3,3/3,2/3, 2/3) for cliques from Ai lo A6.

3. HYBRID GRAPH REPRESENTATION OF PROTEIN STRUCTURES

3.1. Graph representation overview

We model a protein structure as a labeled undirected graph by incorporating pairwise amino acid residue distances and contact relation in the following way. The nodes of our protein graphs represent the Ca atoms of each amino acid residue. We create edges connecting each and every pair of (distinct) residues, labeled by two types of information: (1) The Euclidian distance between the two related Ca atoms and (2) A boolean indicates whether the two residues have physical/chemical contact. More precisely, a protein in our study is a labeled graph P = (V, E, E, A) where

• V is a set of nodes that represents the set of amino acid residues in the protein

• E = V x V - (u, u) for all u € V • E = Ey U TIE is the set of disjoint node labels

(Ey) and edge labels (E^) • Ey is the set of 20 amino acid types • HE = R+ x {true, false) where IR+ is the set

of positive real numbers • A assigns labels to nodes and edges.

Our graph representation can be viewed as a hybrid of two popular representation of proteins: that of distance

230

matrix representation 8 and that of contact map representation 13.

In practice we are not concerned with interactions over long distances (say > 13 A), so proteins need not be represented by complete graphs. Since each amino acid occupies a real volume, the number of edges per vertex in the graph representation can be bounded by a small constant.

The graph representation presented here is similar to those used by other groups 25, 38. The major difference is that in our representation, geometric constraints are in the form of pairwise distance constraints and are embedded into the graph representation to model geometrically conserved patterns. The absence of geometric constraints can lead to many spurious matches as noticed in 25, 38. Another difference is that we explicitly specify the "contact" relation. The contact relation enables us to incorporate various constraints into the subsequent graph mining process and further reduce irrelevant patterns.

In the following, we discuss how to discretize distances into distance bins, which is important for our structure motif identification algorithm.

/ < 4

1

8 .5</<10

5

4</<5.5

2

10</<11.5

6

5.5</<7

3

11.5</

7

7</<8.5

4

Fig. 3. Mapping distances I to bins. The unit is A.

3.2. Distance discretization

To map continuous distances to discrete values, we discretize distances into bins. The width of such bins is commonly referred to as the distance tolerance, and popular choices are 1 A 22,1.5 A 5, and 2 A 26. In our system, we choose the median number 1.5 A as shown in Figure 3, which empirically delivers patterns with good geometric conservation.

4. THE CONSTRAINED CLIQUE MINING ALGORITHM

In this section, we present a detailed discussion on (1) what types of constraints can be incorporated efficiently into a subgraph mining algorithm and (2) how to incorporate them.

Our strategy relies on designing graph normalization functions that map cliques to one dimensional sequences of labels. A graph normalization function is a 1-1 mapping N such that N(G) = N{G') if and only if G = G". In other words, a graph normalization function always assigns a unique string to each unique graph. The string N{G) is the canonical code (code in short) of the graph G with respect to the function N.

Many graph normalization functions have a very desirable property: prefix-preservation. A graph normalization function is prefix-preserving if for every graph G, there always exists a subgraph G' C G such that the code of G' is a prefix of the code of G. Examples of prefix-preserving graph normalization functions include the DFS code 39 and the CAM code 15. As we prove in Theorem 4.8, with a generic depth first search algorithm, a prefix-preserving graph normalization function guarantees that no frequent constrained patterns can be missed. The design challenge here is to construct a graph normalization function that is prefix-preserving in the presence of constraints.

4.1. A synthetic example of constraints

The following constraint is our driving example for constrained clique mining. The constraint states that we should only report frequent cliques that contain at least one edge label of "y". The symbol "y" is selected to make the constraint works best with the graph example we show in Figure 1. Applying this constraint to all the frequent cliques shown in Figure 2, we find that there are only three cliques satisfying the constraint, namely Aj, As, and A§. We name this simple constraint an edge label constraint and show a specific graph normalization function that is prefix-preserving for this edge label constraint. Before we do that, we introduce a normalization that does not support any constraints. Our final solution will adapt this constraint-unaware graph normalization function.

4.2. A graph normalization function that does not support constraints

We use our previous canonical code 15 for graph normalization, outlined below for completeness.

Given an n x n adjacency matrix M of a graph G with n nodes, we define the code of M, denoted

231

by code(M), as the sequence of lower triangular entries of M (including the node labels as diagonal entries) in the order: Mx,\M%xM%'1..Mn±Mn>..Mn^xMn<n

where M, j represents the entry at the ith row and jth column in M. Since edges are undirected, we are concerned only with the lower triangular entries of M. Figure 4 shows examples of adjacency matrices and codes for the labeled graph Q shown in the same figure.

The lexicographic order of sequences defines a total order over adjacency matrix codes. Given a graph G, its canonical code, denoted by F(G), is the maximal code among all its possible codes. For example, F(Mi) ="bybyxa" shown in Figure 4 is the canonical code of the graph.

%y~ b

yI xl a

b_ J L - t -x v a

a

Y X

b

V b

> a X

V

b

Y b

(QJ13 M ' M2 M3 M4

F i g . 4 . Al l possible adjacency matrices for the clique Q . Since adjacency matrices for undirected graph are symmetr ic , only the lower triangular par t of die matrices are shown. Using lexicographic order of symbols: y>x>d>c> b> a> 0, we have code(Mi) ="bybyxa" > code(Mi) = "bybxya" > code(M3) — "aybxyb" > code(Mi) ="aybxyb". Hence code(Mi) is

the canonical code for the graph Q.

4.3. A graph normalization function that supports constraints

In this section, we introduce the definition of a generic graph normalization function ip. In the following two sections, we show the applications of the generic graph normalization function ip. In Section 4.4, we show that several widely used constraints for protein structure motifs lead to a well-defined function ip. In Section 4.5, we show that the function ip can be used in a depth-first constrained clique search procedure to make sure that each constrained frequent clique is searched once and exactly once.

Definition 4.1. Given the graph normalization function F with its codomain T and a constraint p, a generic graph normalization function ip, is a function that maps a graph G to T* recursively as:

V>(G) = {

(F(G) F(G)

max G'CG.p(G')

if p(G) is false if | V(G) | = 1

ip(G')$F(G) otherwise

where G' is a subgraph of G with size one less than G; ip(G')$f(G) is the concatenation of the code produced by ip(G'), the symbol $, and the code F(G). We assume that the symbol $ is not in the set T and we use this symbol to separate different parts of the generic graph normalization. The total ordering on strings (max) is defined by lexicographical order with the underlying symbol ordering assumed from T.

Example 4.1. Applying the generic graph normalization to the graph Q shown in Figure 4, with the graph normalization F be the canonical code we discussed in Section 4.2, ip(G) is b$byb$bybyxa. bybyxa is a suffix of the code since it is the canonical code of graph Q. In a search for ip(Q), two subgraphs of Q that satisfy the edge label constraint are searched. One is a single edge connecting nodes with labels "b" and "b" with an edge label "y" and the other is also a single edge connecting nodes with labels "a" and "b" with an edge label "y". Since the canonical code for the first ("byb") is greater than that of the second ("bya"), we put string "byb" before the string "bybyxa" and obtain "byb$bybyxa". Finally we add a single "b" before the string "byb$bybyxa" at the last step of the recursive definition and we have ip(G) = b$byb$bybyxa.

Theorem 4.1. ip{G) exists for every graph G with the edge label constraint.

Proof. Let's first assume that a graph G contains an edge label "y". We claim that there always exists a subgraph G' of G that also contains the same label. This observation suggests that we can always find at least one G' in the recursive definition, and hence tp(G) is defined. If the original G does not contain any edge label "y", its code is F(G), which is also defined. •

Theorem 4.2. ip is a 1-1 mapping and thus a graph normalization function.

Proof. If two graphs are isomorphic, they must give the same string, according to the definition of ip. To prove that two graphs that have the same canonical strings are isomorphic, noticing that the last element of a label sequence produced by ip is F(G) where F is a graph normalization procedure. Therefore two identical sequence must imply the same graph, as guaranteed by F. •

Theorem 4.3. For all G such that p(G) is true, there exists a subgraph G ' c G with size one less than G such

232

that p{G') is true and ip{G') is a prefix ofijj(G).

Proof. This property is a direct result of the recursive definition 4.1. •

We notice that in proving Theorems 4.2 and 4.3, we do not use the definition of the constraint p. In other words, Theorems 4.2 and 4.3 can be proved as long as we have Theorem 4.1. Therefore, we have the following theorem:

Theorem 4.4. Ifip is defined for every graph with respect to a given constraint p, ip is 1-1 and prefixing-preserving.

Proof. This is a direct result of the recursive definition 4.1. •

4.4. More examples related to protein structure motifs

Let's first view a real-world example of constraint that is widely used in structure motif discovery. The connected component constraint (CC constraint for short) asserts that in a structure motif, each amino acid residue is connected to at least another amino acid residue by a contact relation and that the motif is a connected component with respect to the contact relation. The intuition of the CC constraint is that a structure motif should be compact and hence has no isolated amino acid residue. To be formal, the CC constraint is a function cc that assigns value true to a graph if it is a connected component according to the contact relation and false otherwise.

As another example, the contact density constraint asserts that the ratio of the number of contacts and the total number of edges in a structure motif should be greater than a predefined threshold. Such ratio is referred to as the contact density (density) of the motif and the constraint is referred to as the density constraint. The intuition of the density constraint is that a structure motif should be compact and the amino acid residues in the motif should well interact with other. This constraint may be viewed as a more strict version of the CC constraint which only requires a motif to be connected component. Again, to be formal, the density constraint is a function d that assigns value true to a graph if its contact density is at least some predefined threshold and false otherwise.

It would be an awkward situation if we need to define a new graph normalization procedure for each of the

constraints we discuss above. Fortunately, this is not the case. In the following discussion, we show that generic graph normalization function tp is well defined for these two constraints.

Theorem 4.5. ip(G) exist for every graph G with respect to the CC constraint or the density constraint.

Proof. We only show the proof of the theorem for the CC constraint and that for the density constraint can be proved similarly. The key observation is for every graph G of size n that is a connected component with respect to the node contact relation, there exists a subgraph G' C G such that G' is a connected component according to the same contact relation. The observation is a well-known result from graph theorem and a detail proof can be found in1 5 . a

Following Theorem 4.4, we have the following theorem.

Theorem 4.6. %jj is a 1-1 mapping and prefix-preserving for the CC constraint or the density constraint.

After working several example constraints, we study the sufficient and necessary condition such that our graph normalization function ip is well defined for a constraint p. The following theorem formalize the answer.

Theorem 4.7. Given a constraint p, ip(G) exist for every graph G with respect to the constraint p if and only if for each graph G of size n such that p(G) is true, there exists a subgraph G' C G of size n — 1 such that p(G') is also true.

Proof, (if) For a graph G such that p(G) is true, if there exists one G' C G such that p{G') is also true, by the definition of ip, tp(G) exists.

(only if) If tp(G) exists for every graph G with respect to a constraint p, for a graph G such that p{G) is true, by the definition of ip, we always have at least one G' <ZG such that p(G') is also true. •

4.5. cliquehashing

We have designed an efficient algorithm identifying frequent cliques from a labeled graph database with constraints, as described below. At the beginning of the algorithm we scan a graph database and find all frequent node

233

CliqueHashing(£/, a, p) begin 1. for each node label t € \(v), v G V[G], G G Q do 2. counter[t] <— counter[t] U {v} 3. C ^ C U { ( } 4. end for 5. for t € C do 6. if( s(t) > a, p(s) is true) do 7. T <— T U backtrack_search(t, counter[t]) 8. end if 9. end for 10. return T end

backtrack jsearch(to, 0, p) begin 1. 2. 3. 4. 5. 6. 6. 7. 8. 9. 10 11 12 13

for each clique h G O do C «- {/I/ = hUv,hC V[G],v£ (V[G] - ft)} for each occurrence of a clique / G 0 ' do t *- f(f) counter[t] <— counter[£] U {/} C ^ C U { t }

end for end for for each t G C do

if( s(t) > cr, to E *> and p(t) is true) do T *— T U backtrack.search(t, counter[t])

end if end for return T

end

Fig. 5. The CliqueHashing algorithm which reports frequent cliques, T, from a group of graphs 5 with support at least a and with a constraint p. tfi is the graph normalization function defined in Definition 4.1. x C y if string x is a prefix of string y. s(G) is the support of a graph G.

types (line 1-4, Figure 5). The node types and their occurrences are kept in a hash table counter. At a subsequent step, a frequent clique with size n > 1 is picked from the hash table and is extended to all possible n + 1 cliques by attaching one additional node to its occurrences in all possible ways. These newly discovered cliques and their occurrences, are again indexed in a separate hash table and enumerated recursively. The algorithm backtracks to the parents of a clique if no further extension from the clique is possible. The overall algorithm stops when all frequent node types have been enumerated. We illustrate the CliqueHashing algorithm, with the edge label constraint, in Figure 6.

Theorem 4.8. If ip is well defined for all possible graphs with the constraint p, the CliqueHashing algorithm identifies all frequent constrained cliques from a graph database exactly once.

Proof. The prefix preserving property of Definition 4.1 implies that at least one subclique of a frequent clique will pass the IF statement of line 9, in the back-track-search procedure in CliqueHashing. Therefore the algorithm will not miss any frequent cliques in the presence of a constraint p.

The proof that the algorithm discovers every constrained frequent cliques exactly once may not be obvious at first glance. The key observation is that for a clique G of size n, there is only one subclique with size n — 1 that has a code matching a prefix of ijj (G). If we can prove the

observation, by the line 9 of the backtrack_search procedure, the CliqueHashing algorithm guarantees that each constrained frequent cliques will be discovered exactly once.

To prove the observation, we assume to the contrary that there are at least two such subcliques with the same size and both give codes as prefixes of ip(G). We claim that one of the two codes must be a prefix of the other (by the definition of prefix). The claim leads to the conclusion that one of two subcliques must be a subclique of the other (by the definition of ip). The conclusion contradicts our assumption that the two subcliques have the same size. •

"d"

"c" "b-

"a"

W iP5) {p2}

(Pal

(q,)

{q3}

{s,}

(s3)

(q2)

(Sp)

"byb"

"bya"

(P2. P3)

(qi. q3)

{Pz.Pi)

{q,.q2}

{s,,s2)

"bybyxa" {P2. Ps. Pi)

h i . q3 . q2)

steps step2

stepl

Fig. 6. The contents of the hash table counter after applying the CliqueHashing algorithm to the data set shown in Figure 1 with the edge label constraint.

234

% x /y Ps

^

Pi

%yj -'cu

ft

,x /y

(Q)

' * \ x / y

J (P) (Q) (S)

Fig. 7. A graph database of three graphs with multiple labels.

"d"

"c"

"b"

"a"

(PJ (Ps)

(s4l

<Pj>

(P,l

{q,(

M (s,)

{ s j

{p,l

<q?}

{s2}

"byb"

"bya"

(q,.q2)

(q3.q2) {s,,s2}

{s,,s,}

"bybyxa" (P2. P3. Pi)

fq t. q3. q2)

step3

step2

stepl

Fig. 8. the contents of the hash table counter after applying the CliqueHashing algorithm to the data set shown left.

4.6. CliqueHashing on Graphs

-labeled

A multi-labeled graph is a graph where there are two or more labels associated with a single edge in the graph. The CliqueHashing algorithm can be applied to multi-labeled graphs directly without major modifications. The key observation is that our enumeration is based on occurrences of cliques (line 3 in function backtrack_search). In Figure 7, we show a graph database with three multi-labeled graphs. In figure 8, we show (pictorially) how the CliqueHashing algorithm can be applied to graphs with multilables.

In the context of the structure motifs detection, handling multi-labeled graphs is important for the following reason. First, due to the imprecision in 3D coordinates data in motif discovery, we need to tolerate distance variations between different instances of the same motif. Second, partitioning the ID distance space into distance bins is not a perfect solution since distance variations can not be well handled at the boundary of the bins. In our application distance bins may lead to a significant number of missing motifs. Using a multi-labeled graph we can solve the boundary problem by using "overlapping" bins to take care of boundary effect.

5. EXPERIMENTAL STUDY

5.1. Experimental setup To exclude redundant structures from our analysis, we used the culled PDB list (http://www.fccc.edu/research/ labs/dunbrack/pisces/culledpdb.html) with sequence similarity cutoff value 90% (resolution = 3.0, R factor = 1.0). This list contains about one quarter of all protein structures in PDB; remaining ones are regarded as duplicates to proteins in the list. We study four SCOP families: Eukaryotic Serine Protease (ESP), Papain-like Cysteine Protease (PCP), Nuclear Binding Domains (NB), and FAD/NAD-linked reductase (FAD). Each protein structure in a SCOP family was converted to its graph representation as outlined in Section 3. The pairwise amino acid residue contacts are obtained by computing the almost-Delaunay edges 2 with e = 0.1 and with length up to 8.5 A, as was also done in 14. Structure motifs from a SCOP family were identified using the CliqueHashing algorithm with the CC constraint that states "each amino acid residue in a motif should contact at least another residue and the motif should be a connected component with respect to the contact relation". Timings of the search algorithm were reported using the same hardware configuration used in 14.

In Table 1, we document the four families including their SCOP ID, total number of proteins in the family (iV), the support threshold we used to retrieve structure motifs (a), and the processing time (T, in seconds). In the same table, we also record all the structure motifs identified, giving the motifs' compositions (a sequence of one-letter residue codes), actual support values (K), the number of occurrences outside the family in the representative structures in PDB (referred to as the background frequencies hereafter) (5), and their statistical significance in the family (P). The statistical significance is computed by a hyper-geometric distribution, specified in Appendix

7.1. Images of protein structures were produced using VMD 16 and residues in the images were colored by the residue identity using default VMD settings.

5.2. Eukaryotic serine protease

The structure motifs identified from the ESP family were documented at the top part of Table 1. The data indicated that the motifs we found are highly specific to the ESP family, measured by P-value < 10 - 8 2 . We have investigated the spatial distribution of the residues covered by

http://www.fccc.edu/research/

Table 1. Motifs

235

Motif | Composition | K \ S \ —log(P) Eukaryotic Serine Protease (ID: 50514) N: 56 a:

1 2 3 4 5 6 7 8 9 10 11 12 13

DHAC ACGG DHSC DHSA DSAC DGGG DHSAC SAGC DACG HSAC DHAA DAAC

DHAAC

54 52 52 52 52 52 51 51 51 51 51 51 50

13 9

10 10 12 23 9

11 14 14 18 32 5

100 100 100 100 100 100 100 100 100 100 100 99

100 Papain-like cysteine protease (ID: 54002) N: 24,

1 2

HCQS HCQG

18 18

2 3

34 34

Nuclear receptor ligand-binding domain (ID: 485( 1 2

FQLL DLQF

20 18

21 7

43 42

FAD/NAD-linked reductase (ID: 51943) N:20a 1 [ AGGG | 17 j 34 | 34

Motif 48/56, T

14 15 16 17 18 19 20 21 22 23 24 25 26

IT: 18/24, 3 4

» ) N: 23 3

: 15/20,1

2

Composition 31.5

DHAC HACA ACGA DSAG SGGC AGAG AGGG

ACGAG SCGA DACS DGGS SACG DSGC

T: 18.4 WWGS WGNS

«

50 50 50 50 50 50 50 49 49 49 49 49 49

18 18

, <T: 17/23, T: 15.3 DLQF

r-. 90.0 AGGA

17

17

*

6 8

11 16 17 27 58 4 6 7 8

10 15

3 4

8

91

-log(P)

100 100 100 100 100 95 85

100 100 100 100 98 98

44 44

39

27

Motif

27 28 29 30 31 32 33 34 35 36 37 38

5

4

Composition

DASC SAGG DGGL

DSAGC DSSC SCSG AGAG SAGG DSGS DAAG DASG GGGG

WGSG

LQLL

K

49 49 49 48 48 48 48 48 48 48 48 48

18

17

*

20 31 53 9

12 19 19 23 23 27 32 71

5

40

-log(P)

92 90 83 99 97 93 93 88 94 89 87 76

43

31

those motifs, by plotting all residues covered by at least one motif in the structure of a trypsin: 1HJ9, shown in Figure 9. Interestingly we found that all these residues are confined to the vicinity of the catalytic triad of 1HJ9, namely: HIS57-ASP102-SER195, confirming a known fact that the geometry of the catalytic triad and its spatially adjacent residues are rigid, which is probably responsible for functional specificity of the enzyme.

Fig. 9. Left: Spatial distribution of residues found in 38 common structure motifs within protein 1HJ9. The residues of catalytic triad, HIS57-ASP102-SER195, are connected by white dotted lines. Right: Performance comparison of graph mining (GM) and geometric hashing (GH) for structure motif identification.

We found that there are five motifs that occur significantly (P-value < 10 -T) in another SCOP family: Prokaryotic Serine Protease (details not shown). This is not surprising since both prokaryotic and eukaryotic serine proteases are quite similar at both structural and functional levels and they share the same SCOP superfamily classification. None of the motif has significant presence

outside these two families. Comparing to our own previous study that uses

generic subgraph mining algorithm (without constraints and without utilizing pairwise amino acid residue distance information), and pairwise structural comparison performed by other groups x ' 1 0 ' 22' 31, 29, 38, we report a significant improvement of the "precision" of structure motifs. For example, rather than reporting thousands of motifs for a small data set like serine proteases 3 8 ' 1 4 , we report a handful of structure motifs that are highly specific to the serine protease family (as measured by low P-values) and highly specific to the catalytic sites of the proteins (as shown in Figure 9).

To further evaluate our algorithm, we randomly sample two proteins from the ESP family and search for common structure motifs. We obtain an average of 2300 motifs per experiment for a total of thousand runs. Such motifs are characterized by poor statistical significance and were not specific to known functional sites in the ESP. If we require a structure motif to appear in at least 24 of a 31 randomly selected ESP proteins and repeat the same experiment, we obtain an average of 65 motifs per experiment with improved statistical significance. This experiment demonstrates that comparing a group of proteins improves the quality of the motifs, as observed by 3S.

Beside improved quality of structure motifs, we observe a significant speed up for our structure motif com-

236

parison algorithm comparing to other methods such as geometric hashing. At the right part of Figure 9, we show performance comparison of graph mining (GM) and geometric hashing (GH)21 ( executable download from the companion website) for serine proteases. We notice a general trend that with the increasing number of proteins structures, the running time of graph mining decreases (since there are fewer common structure motifs) but the running time of geometric hashing increases. The two techniques have different set of parameters that make any direct comparison of running time difficult, however, the trend is very clear that graph mining has better scalability than geometric hashing for data set contains large number of proteins structures.

Fig. 10. Left: Residues included in the motifs from PCP family in protein 1CQD. The residues in catalytic dyad CYS27-HIS161 are connected by a white dotted line and two important surrounding residues ASN181 and SER182 are labeled. Right: Residues included in motifs from the NB family in protein lOVL. The labeled residue GLN 435 has direct interaction with the cofactor of the protein.

5.3. Papain-Iike cysteine protease and nuclear binding domain

We applied our approach to two additional SCOP families: Papain-Like Cysteine Protease (PCP, ID: 54002) and Nuclear Receptor Ligand-Binding Domain (NB, ID: 48509). The results are documented in the middle part of Table 1.

For the PCP family, we identified five structure motifs which covered the catalytic CYC-HIS dyad and nearby residues ASN and SER which are known to interact with the dyad 7, as shown in Figure 10. For the NB family, we identified four motifs a which map to the co-factor binding sites 37, shown in the same figure. In addition, four members missed by SCOP: Isrv, Ikhq, and loOe were identified for the PCP family and six members IsjO, Irkg, losh, lnq7, lpq9, lnrl were identified for the

"Structure motifs 2 and 3 have the same residue composition but they r

They do not map to the same set of residues.

NB family.

Fig. 11 . The motif appears in two proteins 1LVL (belongs to me FAD/NAD-linked reducatase family without Rossman fold ) and 1JAY (belongs to the 6-phosphogluconate dehydrogenase-like, N-terminal domain family with Rossman fold) with conserved geometry.

5.4. FAD/NAD binding proteins

In the SCOP database, there are two superfamilies of NADPH binding proteins, the FAD/NAD(P)-binding domains and the NAD(P)-binding Rossmann-fold domains, which share no sequence or fold similarity to each other. This presents a challenging test case for our system to check whether we would be able to find patterns across the two groups with biological significance.

To address the question, we applied our algorithm to the largest family in SCOP FAD/NAD(P)-binding domain: FAD/NAD-linked reductases (SCOPID: 51943). With support threshold 15/20, we obtained two recurring structure motifs from the family, and both showed strong statistical significance in the NAD(P)-binding Rossmann-fold superfamily as shown in bottom part of Table 1.

In Figure 11, we show a motif that is statistically enriched in both families; it has conserved geometry and is interacting with the NADPH molecule in two proteins belonging to the two families. Notice that we do not include any information from NADPH molecule during our search, and we identified this motif due to its strong structural conservation among proteins in a SCOP superfamily. The two proteins have only 16% sequence similarity and adopt different folds (DALI z-score 4.5). The result suggests that significantly common features can be inferred from proteins with no apparent sequence and fold similarity.

different residue contact patterns and therefore regarded as two patterns.

237

5.5. Random proteins Our last case study is a control experiment to empirically

evaluate the statistical significance of the structure motifs

regardless of the P-^ualue definition. To that end, 20 pro

teins were randomly sampled from the culled PDB list in

order to obtain common motifs with support > 15. The

parameters 20 and 15 were set up to mimic the size of a

typical SCOP family. We repeated the experiment a mil

lion times, and did not find a single recurring structure

motif. Limited by the available computational resources,

we did not test the system further; however, we are con

vinced that the chance of observing a random structure

motif in our system is rather small.

6. CONCLUSION We present a method to identify recurring structure motifs in a protein family with high statistical significance. This method was applied to selected SCOP families to demonstrate its applicability to finding biologically significant motifs with statistical significance. In future studies, we will apply this approach to all families in SCOP as well as from other classification systems such as Gene Ontology and Enzyme Classification. The accumulation of all significant motifs characteristic of known protein functional and structural families will aid protein structures resulting from structural genomics projects.

References 1. Peter J. Artymiuk, Andrew R. Poirrette, Helen M. Grind-

ley, David W. Rice, and Peter Willett. A graph-theoretic approach to the identification of three-dimensional patterns of amino acid side-chains in protein structures. Journal of Molecular Biology, 243:327^14, 94.

2. D. Bandyopadhyay and J. Snoeyink. Almost-Delaunay simplices : Nearest neighbor relations for imprecise points. In ACM-SIAM Symposium On Distributed Algorithms, pages 403-412, 2004.

3. JA Barker and JM Thornton. An algorithm for constraint-based structural template matching: application to 3d templates with statistical analysis. Bioinformatics, 19(13):1644-9,2003.

4. H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, and P.E. Bourne. The protein data bank. Nucleic Acids Research, 28:235-242, 2000.

5. Philip Bradley, Peter S. Kim, and Bonnie Berger. TRILOGY: Discovery of sequence-structure patterns across diverse proteins. Proceedings of the National Academy of Sciences, 99(13): 8500-8505, June 2002.

6. SA Cammer, CW Carter, and A. Tropsha. Identification of sequence-specific tertiary packing motifs in protein structures using delaunay tessellation. Lecture notes in Computational Science and Engineering,, 24:477-494, 2002.

7. K. H. Choi, R. A. Laursen, and K. N. Allen. The 2.1 angstrom structure of a cysteine protease with proline specificity from ginger rhizome, zingiber officinale. Biochemistry, 7, 38(36):11624-33,1999.

8. I. Eidhammer, I. Jonassen, and W. R. Taylor. Protein Bioinformatics: An Algorithmic Approach to Sequence and Structure Analysis. John Wiley & Sons, Ltd, 2004.

9. JF Gibrat, T Madej, and SH. Bryant. Surprising similarities in structure comparison. Curr Opin Struct Biol, (6(3)):683-92,1996.

10. H.M. Grindley, P.J. Artymiuk, D.W. Rice, and P. Willet. Identification of tertiary structure resemblance in proteins using a maximal common subgraph isomorphism algorithm. J. Mol. biol., 229:707-721, 1993.

11. H. Hegyi and M. Gerstein. The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J Mol Biol, 288:147-164, 1999.

12. L. Holm and C. Sander. Mapping the protein universe. Science, 273:595-602., 1996.

13. J. Hu, X. Shen, Y. Shao, C. Bystroff, and M. J. Zaki. Mining protein contact maps. 2nd BIOKDD Workshop on Data Mining in Bioinformatics, 2002.

14. J. Huan, W. Wang, D. Bandyopadhyay, J. Snoeyink, J. Prins, and A. Tropsha. Mining protein family specific residue packing patterns from protein structure graphs. In Proceedings of the 8th Annual International Conference on Research in Computational Molecular Biology (RE-COMB), pages 308-315,2004.

15. J. Huan, W Wang, and J. Prins. Efficient mining of frequent subgraph in the presence of isomorphism. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM), pages 549-552,2003.

16. William Humphrey, Andrew Dalke, and Klaus Schulten. VMD - Visual Molecular Dynamics. Journal of Molecular Graphics, 14:33-38, 1996.

17. I. Jonassen, I. Eidhammer, D. Conklin, and W. R. Taylor. Structure motif discovery and mining the PDB. Bioinformatics, 18:362-367, 2002.

18. I. Jonassen, I. Eidhammer, and W. R. Taylor. Discovery of local packing motifs in protein structures. Proteins, 34:206-219,1999.

19. Susan Jones and Janet M Thornton. Searching for functional sites in protein structures. Current Opinion in Chemical Biology, 8:3-7, 2004.

20. Bala Krishnamoorthy and Alexander Tropsha. Development of a four-body statistical pseudo-potential to discriminate native from non-native protein conformations. Bioinformatics, 19(12):1540-48,2003.

21. N. Leibowitz, ZY Fligelman, R. Nussinov, and HJ Wolf-son. Automated multiple structure alignment and detection of a common substructural motif. Proteins, 43(3):235—45, May 2001.

22. M Milik, S Szalma, and KA. Olszewski. Common structural cliques: a tool for protein structure and function analysis. Protein Eng., 16(8):543-52., 2003.

23. N Nagano, CA Orengo, and JM Thornton. One fold with many functions: the evolutionary relationships between

tim barrel families based on their sequences, structures and functions. Journal of Molecular Biology, 321:741-765, 2002.

24. Ruth Nussinov and Haim J. Wolfson. efficient detection of three-dimensional structural motifs in biological macromolecules by computer vision techniques. PNAS, 88:10495-99,1991.

25. Robert B. Russell. Detection of protein three-dimensional side-chain patterns: new examples of convergent evolution. Journal of Molecular Biology, 279:1211-1227, 1998.

26. S. Schmitt, D. Kuhn, and G. Klebe. A new method to detect related function among proteins independent of sequence and fold homology. J. Mol. Biol, 323(2):387-406, 2002.

27. J. P Shaffer. Multiple hypothesis testing. Ann. Rev. Psych, pages 561-584, 1995.

28. Jeffrey Skolnick, Jacquelyn S. Fetrow, and Andrzej Kolin-ski. Structural genomics and its importance for gene function analysis, nature biotechnology, 18:283-287, 2000.

29. R. V. Spriggs, P. J. Artymiuk, and P. Willett. Searching for patterns of amino acids in 3D protein structures. J Chem lnf Comput Sci, 43:412-421, 2003.

30. A Stark and RB Russell. Annotation in three dimensions, pints: Patterns in non-homologous tertiary structures. Nucleic Acids Res, 31(13):3341^1, 2003.

31. A. Stark, A. Shkumatov, and R. B. Russell. Finding functional sites in structural genomics proteins. Structure (Camb), 12:1405-1412, 2004.

32. William R. Taylor and Inge Jonassen. A method for evaluating structural models using structural patterns. Proteins, July 2004.

33. A. Tropsha, C.W. Carter, S. Cammer, and I.I. Vais-man. Simplicial neighborhood analysis of protein packing (SNAPP): a computational geometry approach to studying proteins. Methods Enzymol, 374:509-544, 2003.

34. J. R. Ullman. An algorithm for subgraph isomorphism. Journal of the Association for Computing Machinery, 23:31-42, 1976.

35. AC Wallace, N Borkakoti, and JM Thornton. Tess: a geometric hashing algorithm for deriving 3d coordinate templates for searching structural databases, application to enzyme active sites. Protein Sci, 6(11):2308-23, 1997.

36. G. Wang and R. L. Dunbrack. PISCES: a protein sequence culling server. Bioinformatics, 19:1589-1591, 2003. http://www.fccc.edu/research/labs/dunbrack/pisces/ culledpdb.html.

37. Z. Wang, G. Benoit, J. Liu, S. Prasad, P. Aarnisalo, X. Liu, H. Xu, NR Walker, and T. Perlmann. Structure and function of nurrl identifies a class of ligand-independent nuclear receptors. Nature, 423(3):555-60, 2003.

38. PP Wangikar, AV Tendulkar, S Ramya, DN Mali, and S Sarawagi. Functional sites in protein families uncovered

via an objective and automated graph theoretic approach. J Mol Biol, 326(3):955-78, 2003.

39. X. Yan and J. Han. gspan: Graph-based substructure pattern mining. In Proc. International Conference on Data Mining'02, pages 721-724, 2002.

7. APPENDIX

7.1. Statistical significance of structure motifs

Any cliques that are frequent in a SCOP family are checked against a data set of 6500 representative proteins from CulledPDB 36, selected from all proteins in the Protein Data Bank. For each clique c, we used Ullman's subgraph isomorphism algorithm 34 to search for its occurrence(s) and record the search result in an occurrence vector V = v\, V2, • • •, vn, where Vi is 1 if c occurs in the protein pi, and 0, otherwise. Such cliques are referred to as structure motifs. We determine the statistical significance of a structure motif by computing the related P-value, defined by a hyper-geometric distribution 5. There are three parameters in our statistical significance formula: a collection of representative proteins M, which stands for all known structures in PDB; a subset of proteins T C M in which a structure motif m occurs, a subset of proteins F C M stands for the family we would like to establish the statistical significance. The probability of observing a set of motif m containing proteins K = F n T with size at least k is given by the following formula:

fc_l (\F\\r\M\-\F\\

P-value = 1 - E .JIT'"' • (D *—> (\M\\ v ' i=0 V|T|/

where |X| is the cardinality of a set X. For example, if a motif m occurs in every member of a family F and in no proteins outside F (i.e. K = F = T) for a large family F, we would estimate that this motif is specifically associated with the family; the statistical significance of such case is measured by a P-value close to zero.

We adopt the Bonferroni correction for multiple independent hypotheses 27: 0.001/|C|, where \C\ is the set of categories, is used as the default threshold to measure the significance of the P-value of individual test. Since the total number of SCOP families is 2327, a good starting point of P-value upper bound is 10 - 7 .

7.2. Background frequency Using the culledpdb list (http://www.fccc.edu/research/labs/dunbrack/ pisces/culledpdb.html) as discussed in Section 5.1, we obtain around 6000 proteins as the "representative proteins" in PDB. We treat the proteins as a sample from PDB and for each motif, we estimate its background frequency (the number of occurrences in proteins) using graph matching. Specifically, each sample protein is transformed to its graph representation using the procedure outline in Section 3 and we use subgraph isomorphism testing to obtain the total number of proteins the motif occurs in.

http://www.fccc.edu/research/labs/dunbrack/pisces/

http://www.fccc.edu/research/labs/dunbrack/

239

AN IMPROVED GIBBS SAMPLING METHOD FOR MOTIF DISCOVERY VIA SEQUENCE WEIGHTING

Xin Chen*

School of Physical and Mathematical Sciences

Nanyang Technological University, Singapore

*Email: [email protected]. sg

Tao J iang

Department of Computer Science and Engineering University of California at Riverside, USA

Currently visiting at Tsinghua University, Beijing, China


The discovery of motifs in DN A sequences remains a fundamental and challenging problem in computational molecular biology and regulatory genomics, although a large number of computational methods have been proposed in the past decade. Among these methods, the Gibbs sampling strategy has shown great promise and is routinely used for finding regulatory motif elements in the promoter regions of co-expressed genes. In this paper, we present an enhancement to the Gibbs sampling method when the expression data of the concerned genes is given. A sequence weighting scheme is proposed by explicitly taking gene expression variation into account in Gibbs sampling. That is, every putative motif element is assigned a weight proportional to the fold change in the expression level of its downstream gene under a single experimental condition, and a position specific scoring matrix (PSSM) is estimated from these weighted putative motif elements. Such an estimated PSSM might represent a more accurate motif model since motif elements with dramatic fold changes in gene expression are more likely to represent t rue motifs. This weighted Gibbs sampling method has been implemented and successfully tested on both simulated and biological sequence data. Our experimental results demonstrate that the use of sequence weighting has a profound impact on the performance of a Gibbs motif sampling algorithm.

1. INTRODUCTION

Discovering motifs in DNA sequences remains a fundamental and challenging problem in computational molecular biology and regulatory genomics 19, although a large number of computational methods have been proposed in the past decade. The motif finding problem can be simply formalized as the problem of looking for short segments that are over-represented among a set of long DNA sequences. Previously proposed methods for finding motifs broadly fall into two categories: (a) deterministic combinatorial approaches based on word statistics 15, 14, 2, 6 a n ( j ^ probabilistic approaches based on local multiple sequence alignment 10' 1' 8 ' n . Typical methods in the first category search the promoter sequences of co-regulated genes for various sized motifs exhaustively, and then evaluate their significance by a statistical method, whereas methods in the second category rely on local search techniques such as

expectation maximization and Gibbs sampling. The latter methods also usually represent a motif as a position specific scoring matrix (PSSM) (which is also commonly referred to as a position weight matrix).

Gibbs sampling has shown to be a very promising strategy for motif discovery. The original implementation of Gibbs sampling was done in the site sampling mode, which assumes that there is exactly one motif element (notably a transcript factor binding site) located in each (promoter) sequence. Since its first application to find conserved DNA motifs in the early 90's 10, quite a few improvements have been made in the literature to improve its effectiveness. These improvements include: (a) motif sampling allowing zero or multiple motif elements located in each sequence 13; (b) incorporation of a higher-order Markov background model 18' n ; (c) column sampling allowing gaps within a motif 13; (d) incorporation of phylogeny information 17; and so on. Be-




240

sides these enhancements, we observe below two important aspects common to all previous implementations of Gibbs sampling (and also common to most other motif finding algorithms).

First, the promoter DNA sequences upstream of a collection of co-expressed genes are often taken as the input to a motif finding algorithm. This is because that co-expression is usually taken as an evidence of co-regulation, which is in turn assumed to be controlled by a common motif. Molecular biology has been revolutionized by cDNA rnicroarray and chromatin immunoprecipitation (ChIP) techniques, which allow us to simultaneously monitor the mRNA expression levels of many genes under various conditions. With gene expression data in hand, one generally applies a selected threshold on fold changes in the expression level under a single experimental condition relative to some control condition in order to retrieve a set of co-expressed genes. Here, it naturally gives rise to a question: how to select a threshold properly such that the motif would be more easily and reliably found? Notice that motif elements with large fold changes in expression are in general more likely to represent a true motif12. However, the statistical significance (e.g., p-values) of these motif elements may not increase as the threshold increases, because an increase in threshold may also simultaneously reduce the number of co-expressed genes. On the other hand, lowering the threshold may cluster more genes that are less likely co-regulated, and also decrease the statistical significance as well. This is a dilemma that was not addressed by any previous Gibbs sampling algorithm.

Second, a PSSM is commonly used to represent a probabilistic motif model by taking into account the base variation of motif elements at different positions. Specifically, given a PSSM of a motif, Q, each component q^j describes the probability of observing base j at position i in the motif. In all previous implementations of Gibbs sampling, a PSSM is estimated from a set of putative motif elements, which are sampled from the input promoter sequences, with components proportional to the observed frequencies of each base at different positions. This relies on an implicit assumption that has never been questioned before. That is, all motif elements, regardless of the downstream genes that they regulate, should

contribute equally to the components of the motif PSSM. However, we know that expression levels (and fold changes) vary in a large range even among co-regulated genes, which could perhaps suggest that the above assumption might not be fair. In other words, equating every motif element could result in an inaccurate PSSM. Note, however, that an accurate motif model for a transcription factor is essential to differentiate its true binding sites from spurious ones.

In this paper, we address the above two problems together by one scheme referred to as sequence weighting. It is natural to assume that motif elements with dramatic fold changes in expression are more likely to represent a true motif. Therefore, we want to estimate a PSSM such that it can explicitly reflect such a (nonuniform) likelihood distribution over motif elements. One way to achieve this is to assign each motif element a weight, e.g., proportional to the fold change in expression, and then to estimate each component g^ of the PSSM as the weighted frequencies of base j at position i among all motif elements. One can see that a weighted PSSM favors putative motif elements showing large fold changes in expression. On the other hand, a putative motif element with small fold changes, which is less likely to represent the true motif, will not affect a weighted PSSM as much.

The use of fold changes in expression as weights to estimate PSSMs implicitly assumes that the DNA sequences of motif elements exhibiting higher fold changes are more similar to the motif consensus pattern. This is plausible since such motif elements are more likely to represent the true motif. Moreover, since the binding energy of a transcription factor (TF) protein to a DNA site can be approximated as the sum of pairwise contact energy between the individual nucleotides and the protein 20, different binding sites may indeed have different affinities for their cognate transcription factors. In evolution, there is not only selection force for TF binding sites to remain recognized by their TFs, but that also selection force for preserving the strength of binding sites 17, especially those showing dramatic fold changes in expression.

We have incorporated the sequence weighting scheme into the Gibbs sampling algorithm originally

241

developed in 10, 7 . In real applications on a set of co-expressed genes, we can assign each input promoter sequence a weight proportional to the fold change in gene expression obtained from a cDNA microarray experiment, or proportional to the so-called binding ratio determined by a genome-wide location analysis, which is a popular approach that combines a modified chromatin immunoprecipitation procedure with DNA microarray analysis for studying genome-wide protein-DNA interactions and transcription regulation 16. In a genome-wide location analysis, a binding ratio is calculated by taking the average of fold changes in expression over three independent microarray experiments.

Our implementation of Gibbs sampling via sequence weighting has been successfully tested on both simulated and real biological sequence data. We considered two sets of genes regulated by the transcriptional activator Gal4 and Stel2 respectively, and their expression levels and binding ratios were determined by the genome-wide location analysis 16. The test results show that the use of sequence weighting has a profound impact on the performance of the Gibbs motif sampling algorithm.

The rest of the paper is organized as follows. The next section introduces the basic Gibbs sampling algorithm and the proposed sequence weighting scheme. Preliminary experiments on simulated data and real data are presented in Section 3. Section 4 gives some concluding remarks.

2. GIBBS SAMPLING THROUGH SEQUENCE WEIGHTING

In this section, we start with the description of the basic Gibbs sampling algorithm, and then introduce the new method of estimating position specific scoring matrices (PSSMs) via sequence weighting.

2.1. The Motif Model

A DNA motif is usually represented by a set of short sequences that are all binding sites of some transcription factor protein. Due to base variation at binding sites, the pattern of a DNA motif is conveniently described by a probabilistic model of base frequencies

at each position, for which a common mathematical representation is a so-called position specific scoring matrix (PSSM). A PSSM Q consists of entries qij, which give the probabilities of observing base j at position i of a binding site. The main assumption underlying this motif model is that the bases occurring at different positions of a DNA motif are probabilistically independent.

Assume that we are given a set of N binding sites, si, S2, . . . , SN, of width W each. Let J = 4 be the number of bases in the alphabet {A, C, G, T}, and Cij be the observed count/frequency of base j at position i. A widely used method to estimate a PSSM from these binding sites is simply given by a

where c* is the sum of Cjj over the alphabet; that is,

With the PSSM Q, we are able to estimate the probability P(s\Q) of an arbitrary sequence s being generated by the motif model as

w

P(s\Q) = H<li,at t = l

where Si is the base of s at position i. On the other hand, a background sequence model V is estimated to depict the probabilities pj of base j occurring in the background sequence. The probability of the sequence s being generated by V is given by

w P(s\V) = J[pai

»=i

Therefore, the likelihood that s is a true binding motif of interest under the motif mode Q versus the background model V is given by the formula

The most useful quantity characterizing the quality of a PSSM Q is its information content J, defined as

1 w J

to each Cij in order to avoid a zero frequency for any base not a In an actual implementation, a "pseudocount" should be added actually observed.

242

where the logarithm is often taken with base 2 to express the information content in bits. The information content thus ranges from 0 to 2, reflecting the weakest to the strongest motifs.

2.2. Basic Gibbs Sampling Algorithm

The basic motif finding problem is, given a set of DNA sequences, Si, S2, • • •, Sjv, to look for short sequence segments of specified width W that are over-represented among the input sequences. In real biological applications, the input sequences (of a typical size of 800 bps) are usually taken from the upstream regions of co-expressed genes, W ranges from 5 bp to 16 bp, and the output segments are putative binding sites.

Gibbs sampling has proven to be a powerful strategy for finding weak DNA motifs. The most basic implementation of Gibbs sampling, known as a site sampler, assumes that there is exactly one binding site located in each input sequence. The details of this implementation are described in 10. In Figure 1, we briefly summarize it in order to show how sequence weighting is incorporated into Gibbs sampling. Note that Sk\i, i+W—1] denotes the substring of width W starting at i and ending at i + W — 1 in sequence Sk-

2.3. Estimating the PSSM via Sequence Weighting

To start, we introduce two more notations in order to incorporate sequence weights into the computation of a PSSM. Given a set of N binding sites, si, S2, • • •, SN, of width W each, let Wk be the weight associated with the input sequence Sk, reflecting in some way the contribution of the sequence Sk to the PSSM as discussed above. The sequence weights can be normalized so that they sum up to N. We define a binary function 6(i,j,k) as

d(l'3,lt)-\0, otherwise

where Sk(i) is the base at position i of sequence k. In order to incorporate sequence weights into the

Gibbs Sampling algorithm, we propose to compute Cij as the weighted count of base j at position i of

the binding motif, i.e.

N

Cij = ^2wkS(i,j,k) fe=i

Then, we estimate qij as before, but using the weighted counts c^j. That is,

j

and

Qij = — , 1 < i < W, l<j<J

One can easily see that the above is a natural extension of the original construction of PSSM where the weights for all sequences involved were assumed to be equal.

We have implemented the above sequence weighting scheme into the Gibbs motif sampler software developed in 10' 7. Only the necessary parts of the source code have been modified so that we could make a fair comparison between the original Gibbs sampler and this modified version. It is easy to see that the extra running time caused by sequence weighting is negligible.


In order to test the performance of the above weighted Gibbs sampler, we have applied it to both simulated and real sequence data, and compared its results with the original Gibbs sampler 10' 7. The simulated data sets allow us to compare the performance of the algorithms in an idealized situation that does not involve the complexities of real data. For our tests on real data, we use two sets of genes in Sac-charomyces cerevisiae (yeast) that were determined by ChlP-array experiments 16 to be co-regulated by two proteins Stel2 and Gal4, respectively.

3.1. Simulated Data

In our simulation studies, a motif model was created as follows. First, 20 short DNA sequences (of width W) were randomly generated for binding sites of a common transcription factor with varying degrees of conservation. The seed transcription factor binding site is described by a consensus pattern.

243

Input: A set of DNA sequences Si, S2,. • •, SN and the motif width W Output: The starting position ak of the motif in each sequence Sk',

A PSSM Q — [qij] for the putative motif model

begin Initialization:

Randomly select a position ak for the motif in each sequence Sk Estimate the background base frequencies Pj, for j from 1 to J , to obtain V

Repea t until convergence: Predictive update step:

Randomly select a sequence Sz from the input sequences Take the set of putative binding sites {Sk[ak,a>k + W — 1] | 1 < k < N,k ^= z} Estimate the PSSM Q from {Sk[ak, ak + W - 1] | 1 < k < N, k ± z}

Sampling step: Estimate P(SZ [n, n + W - 1] | Q) for every position n in sequence Sz

Estimate P(Sz[n,n + W - 1] | V) for every position n in sequence Sz

Randomly select a new position az in Sz according to L(Sz[n, n + W - 1] \V,Q) end

Fig . 1. The basic Gibbs sampling algorithm.

The degree of conservation is measured by the Hamming distances to the consensus pattern, and the weakest binding sites have one half of their bases different from those at the corresponding positions of the consensus. Second, a set of 20 promoter sequences of 800 bases long were randomly generated, each with a binding site implanted at a randomly selected position. Finally, each promoter sequence was assigned a weight as the degree of conservation of the implanted binding site, based on the observation that binding sites with dramatic fold changes in gene expression are more likely to represent true motifs 12. In the test, we chose five different motif widths (W = 8,10,12,14,16), reflecting different levels of difficulty for motif finding. For each motif width, 100 test data sets were generated, giving rise to a total of 500 data sets.

Both the original Gibbs sampler and the weighted version Were applied to search for motifs in each data set, and the top three motifs were reported from each program. A found motif is considered correct if its consensus sequence differs from the the planted motif consensus pattern by at most two bases. We are interested in the number of times each program successfully detects the motif inserted in the 100 tests for each motif width, and the average rank of the correct motif if it comes up in the top three. The results are summarized in Table 1.

Each program was run twice on each test data set with the option of column sampling 13 turned on or turned off, respectively.

We can see that the weighted Gibbs sampling method was able to find more correct motifs than the original Gibbs motif sampler in all the tests. It is particularly promising in the discovery of weak (i.e., short) motifs, as twice as many correct motifs were found by the weighted method when W = 8 (without column sampling) than by the original Gibbs sampling method.

3.2. Real Biological Data

3.2.1. 5tel2

The transcription activator Stel2 is a DNA-bound protein that directly controls the expression of genes in response of haploid yeast to mating pheromones 16. We will use it to demonstrate how the sequence weighting scheme could boost the prediction accuracy of the Gibbs sampling method.

The genome-wide location analysis is a promising approach to monitor protein-DNA interactions across a whole genome 16. It combines a modified chromatin immunoprecipitation (ChIP) procedure with DNA microarray analysis in order to provide a relative binding of the protein of interest to a DNA sequence. Such an analysis on epitope-tagged

244

Table 1. Simulation results on 20 sequences of 800 bases

Tests

W = 8 W = 10 W = 12 W = 14 W = 16

The original Gibbs motif sampler 7 ' with column

Times found 22 35 53 74 70

sampling Avg. rank

2.50 2.03 1.91 1.89 2.00

10 without column sampling Times found

23 50 73 84 92

Avg. rank 1.91 2.00 1.97 1.96 1.85

Gibbs sampling via with column sampling

Times found Avg. rank 31 1.61 58 1.97 77 1.92 j 80 1.86 88 1.76

sequence weighting without column sampling Times found Avg. rank

49 1.90 73 1.97 88 2.08 95 1.96 95 2.00

Stel2 has determined that 29 pheromone-induced genes in yeast are likely to be directly regulated by Stel2 16. Figure 2 lists these genes and their binding ratios extracted from 16. Note that, in the figure, names with all capital letters, such as STE12, are used to represent genes, and names that begin with a capital letter, such as Stel2, represent DNA binding motifs.

Of great interest is to find the sites bound by Stel2 in the promoter sequences of these 29 genes. For this purpose, we extracted up to 800 bp upstream regions of each gene from the Saccharomyces genome database, and assigned each sequence a weight as the relative binding ratio obtained from the genome-wide location analysis. Both the original Gibbs sampling algorithm and our weighted version were run on all 29 sequences, and their experimental results were then compared.

Due to the stochastic nature of Gibbs sampling, we ran both programs 10 times with different random seeds, and each time the top ten putative motifs were reported. One can see that the same motif might be reported in different runs. Of the 100 putative motifs, the original Gibbs sampling algorithm did not find any motif resembling the known Stel2 consensus pattern TGAAACA 5. Our algorithm, however, found the correct Stel2 motif six times in 10 runs, and ranked the Stel2 motif the second among all the putative motifs in terms of information content. Figure 2 lists the putative binding sites found by the weighted Gibbs sampling method upstream of the 29 genes regulated by Stel2, and Figure 3 shows the corresponding weighted PSSM. The information content of this PSSM is 1.09, indicating a very strong motif that has been detected by our algorithm.

The above experimental results are very encouraging, but not surprising to us. One can see from Figure 2 that, roughly speaking, the higher a relative binding ratio is, the closer the concerned binding site

is to the known motif consensus pattern (in terms of sequence Hamming distance). In particular, each of the top six sequences in the table contains a binding site exactly matching the Stel2 motif consensus pattern. Once some of these binding sites have been selected by chance, they are strongly favored in the construction of the (weighted) PSSM due to their large weights. This process tends to recruit more correct sites, which in turn further improve the specificity of the PSSM.

3.2.2. Gal4

Gal4 is among the most characterized transcriptional activators, which activates genes necessary for galactose metabolism 16. It provides another test showing that sequence weighting really improves the performance of the original Gibbs sampling algorithm. The genome-wide location analysis 16 found 10 genes to be regulated by Gal4 and induced in Galactose, with varying relative binding ratios (see Figure 4).

We performed the same experiment on Gal4 as we did on Stel2. Of the 100 putative motifs, the original Gibbs sampling algorithm once again failed to find any motif similar to the known Gal4 consensus pattern CGGNnCCG. This result was unexpected by us because the binding sites of Gal4 are actually well conserved among the input sequences (shown below). With sequence weighting, however, our algorithm successfully discovered the exact Gal4 motif with the highest information content (1.80) among 100 putative motifs. These putative binding sites are listed in Figure 4, while the weighted PSSM is given in Figure 5. This clearly shows the advantage of sequence weighting that we have implemented in the Gibbs sampling algorithm, although we suspect that the algorithms might have found or missed the Gal4 motif by chance due to its very low statistical significance and the stochastic nature of Gibbs sampling.

245

Name Binding sites

FUS1 STE12 FUS3 PEP1 PCL2 ERG24 PRH1 FIG2 PGH1 YIL169C AGA1 HYH1 GIC2 YER019W SPC25 FAR! KAR5 FIG1 YIL037C YOR343C AFR1 PH081 YOL155C YIL083C CIK1 YPL192C SCH9 CHS1 YOR129C

614 ggtgcgatga 327 ttgcataatt 645 tacagggcat 189 agttccaggg 281 gaatgccagc 470 gaaacagtat 447 acggagtacg 447 aaaacaacac 525 ttaccacaat 649 cacaaataac 599 cataattttc 656 AAAGtggcac 620 aAAAAGAAac 128 tCCAGCAGcg 2 55 aataccagaa 231 actacatgac 75 tagaaaatta 70 CCAAGaccat 641 agcacattat 618 aagcgagacc 419 accactgata 165 caactgcatg 131 CACGcgcctt 529 attgcgtgcc 655 GAAGCAActt 619 aaaacagaac 14 gtacaaatca 442 aatacatgca 459 taagcaagta

TGAAACA TGAAACA TGAAACA TGAAACA TGAAACA TGAAACA TGAAAAA TGAAACA TGTATCA TGCAACA TGAAACA TGACACA TGAAAAT TGAATCA TGGAACA TGTAACT TGGAACA TGAACAA TGAAAAA ACAATAA TGAAACA GGAAAAA AGGAACA TGAAATC TGAAACA TGAAACA AGAAACA GGAAACA TGTAACA *******

aac at gaaac 620 cagcatttct 333 attgcagaaa 651 gatctaaaac 195 cattcttgtt 287 tatgtattac 476 cgtccgttat 453 aacttattgc 453 cggtggttTC 531 gcgcctacaa 655 atattaaaac 605 atatttacag 662 gtaaaactcc 62 6 tcgccgaata 134 aacgctgata 2 61 aaaattcaat 237 atagacctgt 81 tccccatcat 76 tgttttaaaa 647 t ac age aagg 624 agtacgatcc 425 ttatatagat 171 taaAAATAAA 137 tgcggatttg 535 caactacgac 661 agtatattgg 625 tagtcctgtg 20 tacattacct 448 ttgatagaca 465

R a t i o

5 4 , 3 , 3 , 3 , 2 , 2 , 2 , 2 1 . 1. 1, 1 . 1. 1, 1. 1. 1. 1 . 1. 1 1 1 1 0 . 0 . 0 . 0 . 0 .

. 1 ,3 .3 .3 .7 .7 .6

,6 .4 .4 ,4 ,3 .3 ,3 ,2 .2 ,1 , 1

,9 .9 ,9 ,9 ,7

P r o b

0 , 0 0 , 0 , 0, 0 , 0 , 0 , 0 0 , 0 , 0 0 , 0 , 0 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ,

. 3 0 3 1

. 3 0 3 1

. 3 0 3 1

. 3 0 3 1

. 3 0 3 1

. 3 0 3 1

. 0 7 0 5

. 3 0 3 1

. 0 0 4 7

. 0 1 7 7

. 3 0 3 1

. 0 1 3 9

. 0 0 6

. 0 3 7 9

. 0 3 0 4

. 0 0 3 2

. 0 3 0 4

. 0 0 3 2

. 0 7 0 5

. 0 0 0 0

. 3 0 3 1

. 0 0 4 1

. 0 0 3 1

. 0 0 0 7

. 3 0 3 1 , 3 0 3 1 . 0 3 1 4 . 0 1 7 9 . 0 3 8 1

Fig. 2. Putative binding sites for the transcription activator Stel2 found by the weighted Gibbs sampling method. The column under Ratio lists the relative binding ratios obtained from the genome-wide location analysis, and the column under Prob lists the probabilities of the binding sites given by the motif model represented by the PSSM in Figure 3.

POS A C G T

1 2 3 4 5 6 7

0.087732 0.028292 0.778292 0.912907 0.83948 0.176893 0.872697

0 .017586 0 .036817 0.045558 0.042062 0.038565 0.760593 0 .035069

0.050137 0 .90678 0.078109 0.01692 0.01692 0.01692 0.01692

0.844 54 5 0 .028111 0 .098041 0 .028111 0.105034 0 .04 5 594 0.075314

Fig. 3 . The weighted PSSM calculated from the putative binding sites in Figure 2.

The complete results of both tests axe available at

http://www.ntu.edu.sg/home/ChenXin/Gibbs.

4. DISCUSSION AND FUTURE RESEARCH

The selection of a suitable threshold value on ex

pression level (or binding ratio) in order to retrieve a

set of co-regulated genes and the construction of an

accurate PSSM from a set of promoter sequences to

http://www.ntu.edu.sg/home/ChenXin/Gibbs

246

Name B ind ing s i t e s Ratio Prob

GALl 348 tattgaagta GAL10 532 cgaggacgca GAL3 513 accccacgtt GAL2 404 ttcgtccgtg MTH1 332 gaaaaaggtc GAL7 609 aaaaagcgct GAL80 629 cttcatttac GCYl 432 ggcgaacaat FUR4 516 aaagctttca PCL10 54 5 tttttgggcc

CGGATTAGAAGCCGCCG CGGAGGAGAGTCTTCCG CGGTCCACTGTGTGCCG CGGAGATATCTGCGCCG CGGGGAAATGGAGTCCG CGGACAACTGTTGACCG CGGCGCACTCTCGCCCG CGGGGCAGACTATTCCG CCGATTTCCTAGACCGG CGGAATATATCTTTTCG

agcgggcgac 364 tcggagggcT 548 aacatgctcc 529 ttcaggggtc 420 tgcgagtttt 348 tgatccgaag 62 5 aacgacctca 64 5 gggAAGAACA 448 aaAAAAGTCG 532 ggaagctcgg 561

8. 8. 5. 5 2. 1. 1. 1. 1. 0.

5 5 9

5 5 4 1 1 6

0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

5853 5853 5853 5853 5853 5853 5853 5853 0014 0279

Fig. 4. Putat ive binding sites for the transcription activator Gal4 found by the weighted Gibbs sampling. The relative binding ratios and probabilities of the binding sites are displayed as in Figure 3.

POS A C G T

1 2 3 15 16 17

0. 0. 0. 0. 0. 0.

027807 027807 027807 027807 027807 027807

0. 0. 0. 0. 0. 0.

927216 04 5826 01812 5 912106 899515 01812 5

0. 0. 0. 0. 0. 0.

,016568 ,897958 ,925659 ,016568 044269 925659

0. 0. 0. 0. 0. 0.

028409 028409 028409 043 519 028409 028409

Fig. 5 . The weighted PSSM calculated from the putative binding sites in Figure 4.

represent the true motif model are two delicate problems usually ignored by a Gibbs sampling strategy in motif discovery. In this paper, we try to tackle these problems by a sequence weighting scheme in order to improve the prediction accuracy of the basic Gibbs sampling algorithm.

Gibbs sampling via sequence weighting can be effectively applied to find motifs when the gene expression data is available. As we have noticed before, several computational methods that take advantage of gene expression variation have been developed 3, 12, 9, 4 kU£ a n (jjffer from o u r s m various aspects.

For example, MDscan 12 uses a word-enumeration strategy to exhaustively search for motifs, and is thus a deterministic combinatorial approach. Moreover, it needs a threshold value on expression level in order to extract highly expressed genes, and also treats all putative binding sites equally when representing a motif model, regardless of their expression variations. Our method does not require a preset threshold value.

On the other hand, many computational methods have been proposed to identify motifs in the promoter regions of genes that exhibit similar expression patterns across a variety of experimental conditions 3. Here, our proposed method focuses on a single

experimental condition (relative to a control condition). Previous studies 9 showed that focusing on a single experimental condition is crucial for identifying experiment specific regulatory motifs. One reason for this is that averaging across experiments may destroy the significant relationship between the expression of genes and their regulatory motifs present only in a single experiment.

To summarize, we have proposed in this paper a sequence weighting scheme for enhancing the motif finding accuracy of the basic Gibbs sampling algorithm. It was achieved by estimating a PSSM from the promoter sequences weighted proportionally to the fold changes in the expression of their downstream genes. Our preliminary experiments on simulated and real biological data have clearly shown the advantage of this sequence weighting scheme in a Gibbs sampling. In the future, we would like to test this method on more real data sets with gene expression profiles and extend the method to gene expression data across multiple experimental conditions.

247

ACKNOWLEDGMENTS

Research supported in par t by NSF grant CCR-

0309902, NIH grant LMOO8991-01, NSFC grant

60528001, National Key Project for Basic Research

(973) grant 2002CB512801, and a fellowship from the

Center for Advanced Study, Tsinghua University.

References

1. T. Bailey and C. Elkan. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol., 2, 28-36, 1994.

2. H. Bussemaker, H. Li, and E. Siggia. Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. Proc. Natl. Acad. Sci. USA , 97, 10096-10100, 2001.

3. H. Bussemaker, H. Li, and E.D. Siggia. Regulatory element detection using correlation with expression, not. Genet, 27, 167-171, 2001.

4. E. Conlon, X. Liu, J. Lieb, and J. Liu. Integrating regulatory motif discovery and genome-wide expression analysis. PNAS, 100, 3339-3344, 2003.

5. J. Dolan, C. kirkman, and S. Fields. The yeast STE12 protein binds to the DNA sequence mediating pheromone induction. Proc. natl. Acad. Sci. USA, 86, 5703-5707, 1989.

6. M. Gupta and J. Liu. Discovery of conserved sequence patterns using a stochastic dictionary model. J. Am. Stat. Assoc., 98, 55-66, 2003.

7. h t tp: / /www.fas .harvard.edu/~junl iu/Software/ gibbs9_95.tar

8. J. Hughes, P. Estep, S. Tavazoie, and G. Church. Computational identification of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol., 296, 1205-1214, 2000.

9. S. Keles, M. Laan, and M. Eisen. Identification

of regulatory elements using a feature selection method. Bioinformatics, 18, 1167-1175, 2002.

10. C. Lawrence, S. Altschul, M. Boguski, J. Liu, A. Neuwald, J. Wootton. Detcting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science, 262, 208-214, 1993.

11. X. Liu, d. Brutlag, and J. Liu. BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac Symp Bio-comput, 127-138, 2001.

12. X. Liu, d. Brutlag, and J. Liu. An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nature Biotechnology, 20, 835-839, 2002.

13. A. Neuwald, J. Liu, and C. Lawrence. Gibbs motif sampling: detection of bacterial outer membrane protein repeats. Protein Sci., 4 , 1618-1632, 1995.

14. P. Pevzner and s. Sze. Combinatorial approaches to finding subtle signals in DNA sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol., 8, 269-278, 2000.

15. I. Rigoutsos and A. Floratos. Combinatorial pattern discovery in biological sequnce: the TEIRESIAS algorithm. Bioinformatics, 14, 55-67, 1998.

16. B. Ren, et al. Genome-wide location and function of DNA binding proteins. Science, 290, 2306-2309, 2000.

17. R. Siddharthan, E. Siggia, E. Nimwegen. Phy-loGibbs: a Gibbs sampling motif finder that incorporates phylogeny. PLoS Computational Biology, 1, e67, 0534-0555, 2005.

18. G. Thijs, M. Lescot, K. Marchal, S. Rombauts, B. Moor, P. Rouze, and Y. Moreau. A higher order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics, 17, 1113-1122, 2001.

19. M. Tompa, N. Li, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology, 23 , 2005.

20. M. Djordjevic, A. Sengupta, and B. Shraiman. A biophysical approach to transcription factor binding site discovery. Genome Res., 13 , 2381-2390, 2003.

http://www.fas.harvard.edu/~junliu/Software/

249

DETECTION OF CLEAVAGE SITES FOR HIV-1 PROTEASE IN NATIVE PROTEINS

Liwen You*

Computational Biology and Biological Physics Group

Department of Theoretical Physics, Lund University

Solvegatan 14 A, SE-22362, Lund, Sweden * Email: [email protected]

Intelligent Systems Lab School of Information Science, Computer and Electrical Engineering, Halmstad University

Box 823, SE-30118, Halmstad, Sweden

Predicting novel cleavage sites for HIV-1 protease in non-viral proteins is a difficult task because of the scarcity of previous cleavage data on proteins in a native state. We introduce a three-level hierarchical classifier which combines information from experimentally verified short oligopeptides, secondary structure and solvent accessibility information from prediction servers to predict potential cleavage sites in non-viral proteins. The best classifier using secondary structure information on the second level classification of the hierarchical classifier is the one using logistic regression. By using this level of classification, the false positive ratio was reduced by more than half compared to the first level classifier using only the oligopeptide cleavage information. The method can be applied on other protease specificity problems too, to combine information from oligopeptides and structure from native proteins.

1. INTRODUCTION

Within the HIV-1 virion genome, gag and pol are two main genes. It is known that the gag gene encodes four separate proteins which form the building blocks for the viral core (i.e. matrix protein, capsid protein, and nucleocapsid protein) and the pol gene encodes four replication related proteins (i.e. protease, reverse transcriptase and integrase). Translation of gag and gag/pol transcript results in Gag and GagPol polyproteins. During the HIV-1 virion maturation process, HIV-1 protease cleaves viral Gag and GagPol polyproteins into structural and other replication proteins and make it possible to assemble into an infectious virion. Therefore, the cleavage of the polyproteins by HIV-1 protease plays an important role in the final stage of the HIV virion maturation process. Efficiently hindering the cleavage process is one way of blocking the viral life cycle. HIV-1 protease inhibitors are therefore part of the therapy arsenal against HIV/AIDS today. Efficiently cleaved substrates are excellent templates for the synthesis of tightly binding chemically modified inhibitors 1. However, the difficulty is that the protease cleaves at different sites with little or no sequence similarities. In the last two decades, several studies, including

wet-lab experiments on HIV-1 protease cleavage of oligopeptides, have been performed to study cleavage specificity 2~5.

On the other hand, little is known about what happens to the protease after its mutation and post-maturation phases of the viral life cycle. It raises questions with regard to the involvement of the protease in breakdown of host proteins related to the immune system, the protein synthesis process, gene regulatory pathways and so on. So far, it has been discovered that the protease acts on more than 20 variant non-viral proteins, such as Actin 6 and Vi-mentin 7. However, there is lack of comprehensive information about the interaction between the protease and non-viral proteins. Therefore, the study of the susceptibility of host proteins in native states to hydrolysis by the protease is important to understand the role of HIV-1 protease in its host cell.

The two cleavage problems, cleaving of short oligopeptides and cleaving of native proteins, are related but different. Short oligopeptides or denatured proteins do not have folded structures. It is known that the protease has an active site with eight subsites, where eight corresponding residues can be bound. There are quite a lot of oligopeptides that have been experimentally verified as substrates of



250

HIV-1 protease. The cleavage specificity of the protease is both sensitive to its context and broad. We have in our previous work collected an extended data set with 746 octamers 8 and built a predictor with 92% sensitivity and specificity for predicting cleavage of short oligopeptides. In contrast to short oligopeptides are proteins in their native states folded with complex structures. There are only around 20 tested native protein substrates reported in the literature. In total, there are around 42 cleaved sites in those native proteins. On average, a protein with a length of about 400 amino acids has only one or two cleavage sites. In other words, the cleavages in native proteins are rare cases. Due to the rarity and complex structure, this cleavage problem is much harder to attack than the one on short oligopeptides. The two problems are related in the sense that cleavage sites in short oligopeptides are very likely to be cleaved in native proteins if it is located at surface exposed regions. On the other hand, cleavage sites in native proteins might not be cleaved since the local environment makes it recognize some specific structures.

The aim of the present work was to predict cleavage sites in native proteins by combining information from short oligopeptides and native proteins. This is complicated since the information from short oligopeptides is difficult to transfer to native proteins, and vice versa, but this is important to do since experiments on oligopeptides are much easier to perform, and they are more abundant in the literature than experiments on native proteins.

2. Systems and methods

There are about 42 experimentally verified cleavage sites within 21 proteins with a total length of 8212 amino acids. Native protein cleavage sites are rarely observed, which implies that the cleavage sites are in a tiny region of the whole protein sequence and structure space.

As mentioned before, the two cleavage problems are different. A predictor based on short oligopeptides should discover all true cleavage sites but with lots of false positive ones. Taking the Bcl2 protein 9 as an example, it has 205 amino acids, but only one cleavage site. The predictor predicts 55 cleavage sites including the true one. Therefore, the predictor based only on short oligopeptides does not work

well on native proteins. This is not surprising since some predicted cleavage sites might not be exposed to the protease or their local secondary structures may prevent the binding with the protease.

Tyndall et al. 10' u have targeted the recognition of substrates and ligands by proteases based on PDB files. They found that proteases generally recognize the extended beta strand conformation in the active sites. Peptidic compound's structure can be denned by their <j> and tp angels. But strictly speaking, short oligopeptides do not contain much structure information. Therefore it is not possible to build a predictor based on short oligopeptides for native proteins, as the example with Bcl2 shows.

Is it possible to get structure information for proteins? We know that as far as ligands go, unless they are peptidic compounds, then secondary structure cannot be readily denned. Although there are lots of PDB files describing ligand structure information, it is almost impossible to find structure information for a whole protein, except a short part of it. So, lack of experimental structure information is a problem. We use structure predictors to get secondary structure information. Many research groups have developed secondary structure predictors. Today, some predictors can reach around 80% correct prediction performance. In this way, secondary structure information can be accessed. The risk here is that it contains noise. However, if it contains more information than noise, then it should still improve the prediction. The same goes for solvent accessibility information.

Due to little information about the cleavage of native proteins and insufficient structure information on proteins, it would be hard to directly work on the native protein level. As mentioned in the above section, there are much more data available for short oligopeptides. Fortunately, our predictor based on short oligopeptides contains information about cleavage specificity of the protease, but predicts too many false positives on native proteins since it is not possible to take structure into account on short oligopeptides. By accessing some prediction servers to get secondary structure and solvent accessibility information, we can combine them with the information from short oligopeptides to build a predictor to determine the cleavage of native proteins.

251

2.1 . Hierarchical classifier

Boyd et al. 12 have built a publicly accessible bioin-formatics tool to build computational models of protease specificity which could be based on amino acid sequences, expert knowledge, and secondary/tertiary structure of substrates. However, their way to build prediction models was mainly based on protease specificity profiles, which is too flexible to tune. In addition, they extracted accessibility surface area information from PDB files, which might not be available for interesting proteins. Furthermore, they used a rule based method to use secondary structure information, instead of a data-driving one.

Here we used a three-level hierarchical classifier to combine the information from oligopeptides and native proteins. Figure 1 illustrates the structure of the hierarchical classifier.

Using predictor trained on short oligopeptides on native protein sequence

' Secondary structure Truedeavage ^

Solvent accessibility True cleavage

proteins with a window size of 8 amino acids, meaning 4 residues at both sides of the cleavage sites, moves along a protein sequence and predicts all possible cleavage sites. Only sites predicted to be cleaved are collected with true cleavage indicators (class labels). The predictor works like a filter on the protein sequence level that only removes a part of true non-cleaved sites.

(2) At the second level, secondary structure information around these predicted cleavage sites are collected with a larger window size to include residue interaction and the window does not need to be symmetric around its cleavage site. A predictor trained using the secondary structure information is used to check the cleaving at those sites.

(3) At the final step, solvent accessibility information with the same window size as the second step around the remaining cleaved sites is collected. If residues inside the window are not exposed to the protease, it should not be cleaved, and thus removed from the cleaving list. Only those claimed to be cleaved at this step are classified as to be cleaved by the whole hierarchical classifier.

With regards to the first classification level, our previous work 8' 13 has discussed how to build a classifier based on short oligopeptides. The first classification level should never miss a true cleaving site. In the sense that it should be tuned to never produce false negatives, at the cost of some more false positives. At the last step, in order to measure the fraction of exposed volume of all residues inside that window, a conservative measurement is taken in such way that if 90% residues inside the window are buried, then the whole fragment is not accessible to the protease.

Get sites predicted to be cleaved after the three levels of prediction

Fig. 1. A three-level hierarchical classifier, which combines information from short oligopeptides and native proteins.

(1) At the first classification level, the predictor trained using short oligopeptides and denatured

2.2. Data

Cleavage sites in 21 native proteins were collected from the literature 6 ' 7' 9 ' 1 4 _ 2 3 . Similar sequences with minor mutations sites were not included since they contain redundant information. These protein sequences were submitted to a structure and solvent accessibility prediction server 24

(http://www.predictprotein.org/), where PROFsec

http://www.predictprotein.org/

252

and PROFacc were used to get secondary structure and solvent accessibility information individually.

3. Algorithms

Two generative models, a naive Bayes classifier and a Bayesian inference model, and two discriminative models, logistic regression and support vector machines (SVM), were tested for the cleavage prediction. For a generative model, the data distribution is either known or assumed to be close to a well known distribution. For a discriminative model, the density estimation is not needed. It directly works on the model to find optimum values for its parameters.

3.1. Rare case detection

The cleavage site prediction problem is a rare case detection. For rare case detection with an imbal-anced data set, there is a majority class and a minority class. Classifiers tend to be biased towards the majority class but sampling methods (i.e., under-sampling of majority class and over-sampling of minority class) can compensate for this to some extent. We use the synthetic minority over-sampling technique (SMOTE) introduced by Chawla et al. 25. It introduces new data by randomly choosing a sample and interpolating new samples between it and its K-nearest samples.

Classification accuracy is a common measure for evaluating model performance. However, the data set is very imbalanced and a classifier that always predicts uncleaved is correct in more than 97% of the cases. Therefore accuracy metric is not a suitable one for this problem. Good metrics for this problem are sensitivity, specificity, geometric mean (G), which is square root of the product of sensitivity and specificity, or area under ROC (receiver operating characteristic) curve. We use all of them to evaluate and compare models.

3.2. Naive Bayes

The secondary structure predictor outputs probabilities. We use the notation 7rj = {^EJ^HJ'^LJ)

for the secondary structure probabilities for position j of sample i. The numbers are provided by the secondary structure predictor and are normalized so

that n^j + nlHj + TTZ

LJ = 1. The index j runs from 1 to J (the size of the input window). We assume that the data set TTJ, where i = 1 , . . . , JV (the number of cleaved samples), in cleaved class has dirich-let distribution at position j . It is the same for the non-cleaved class. In addition, we assume that all positions inside the window are independent. In total, there are 3 x J parameters for each class and we use maximum likelihood to estimate those parameters. The posterior probability needed for the classification decision is computed using Bayes' theorem. The Fastfit MATLAB toolbox was used to estimate dirichlet distribution parameters.

3.3. Bayesian inference

Each amino acid residue has a specific structure in native proteins and HIV-1 protease recognizes specific structures. We can interpret each helix, strand and loop probability set at each position as the probability to observe H, E and L at that position if we randomly draw new samples from unseen but possible structure character sequence space around cleaved sites. In other words, we can draw new samples representing possible structure patterns (i.e. HHHHL... LLL) from the structure probability data set. Using the drawn new structure character sequences, we can estimate the parameters on the dirichlet distributions at different positions. When we predict a new structure probability data, we draw a set of structure sequences from it and use the dirichlet distributions to calculate the probability for observing those structure character sequences and average them to get its posterior probability. We used Gibbs-sampling method to implement it.

3.4. Logistic regression

Logistic regression has the form: log P L ~ 0 *] = w • 7r + b, where n is the secondary structure probability for residues inside a window and 9 denotes class label. The parameters, w and b , are fitted using maximum likelihood with MATLAB StatBox toolbox (version 4.2).

3.5. SVM

To train SVM with a very imbalanced data set makes the decision boundary biased towards major-

253

ity class. Randomly under-sampling, over-sampling method (i.e. SMOTE) were used in our experiments to remove and add secondary structure probability data respectively. The problem with this method is that there are quite a lot of parameters to tune (constraints, kernel parameters for SVM, sampling rate and ratio between two classes after sampling and number of nearest neighbors in SMOTE). Cross-validation was used to find their optimum values for good generalization performance. We used the lib-SVM 26 MATLAB toolbox to train SVM.

4. Experiments and results

4.1. Exploring the data set

We explored the structure sequence data from the secondary structure predictor output to see if the structure data set contains any information to separate cleaved and non-cleaved class. Figure 2 displays the probabilities for observing L, H and E at each position for non-cleaved (upper part) and cleaved (bottom part) class.

There is almost no structure difference at different positions for the non-cleaved class. For each structure, it is uniformly distributed inside the window. For the cleaved class, loops are less likely to be observed around the active site. However, strands are more likely to be observed in the vicinity of the cleaved site, which agrees with Tyndall's conclusion that extended conformation is preferred at active sites. It is worth noting that the probability to observe helix structure increases somewhat when it is closer to the active site. This is probably due to the structure prediction performance on helix, strand and loop. From the secondary structure prediction server, it states that "PHD as well as other methods focus on predicting hydrogen bonds. Consequently, occasionally strongly predicted (high reliability index) helices are observed as strands and vice versa (expected accuracy of PHDsec)."

4.2. Experiments

After using the first level predictor on the 21 proteins having 42 true cleavage sites, the prediction results are shown in Table 1.

1.5

I 1

0.5 9 O 0 O 0 0 0 0 0 O 6

0 5 10 15 20 25 30 Residue position inside a window for non-cleaved class

Table 1. Prediction results after using the first level predictor on the 21 native protein sequences. Sensitivity=100%; false positive rate=16%; precision=2.5%.

True cleavage sites True non-cleavage sites

Predicted to be cleaved sites Predicted to be non-cleaved sites

42

0

1613

6557

0 5 10 15 20 25 30 Residue position inside a window for cleaved class

We can see that after this step, all true cleavage sites were kept with 100% sensitivity (TP/(TP+FN)) and 16% (false positive rate= FP/(FP+TN)) non-cleaved sites were predicted to be cleaved. The precision (TP/(TP+FP)) is 2.5%, which means if there is one true cleavage site, the predictor predicts 38.4 non-cleaved sites to be cleaved. In total, there are 1655 sites predicted to be cleaved and fed into the second level predictor.

Pig. 2. The probabilities for observing loop, helix and strand structures at each site for non-cleaved (upper part) and cleaved (bottom part) class. We use window (15,15) to demonstrate it.

The next experiment was to try the four different classifiers, two generative and two discriminative methods, with different window sizes using sec-

254

ondary structure information on the 1655 sites and estimate the generalization performance using cross-validation. AUC (area under ROC curve) was used to compare the four classifiers. The cross-validation was done in the following way. For each classifier, the whole data set was randomly divided into two parts, 80% of the data set was used to train the classifier, and the remaining 20% was used to test its prediction performance. The process was repeated 100 times for each window size. For SMOTE over-sampling, 5-nearest neighbors were used to interpolate new samples. The cleaved samples were over-sampled 3 times and the uncleaved samples were under-sampled to get the same number of cleaved samples after its over-sampling process.

Table 2. The best three classification generalization performances, area under ROC curve, of the four different classifiers only based on the secondary structure information. It reports its mean value and its standard deviation.

Logistic Naive Bayesian regression SVM Bayes inference

1 0.706 (0.095) 0.701 (0.068) 0.685 (0.079) 0.67 (0.079) 2 0.702 (0.083) 0.698 (0.081) 0.684 (0.081) 0.67 (0.080) 3 0.701 (0.083) 0.697 (0.075) 0.682 (0.080) 0.66 (0.078)

Table 2 lists the three largest AUC values for each classifier. Although the performance variance is around 8%, we can see that, on average, the best classifier is the one with logistic regression method and SVM with sampling methods performs as good as logistic regression method. Naive Bayes classifier is little inferior to them and Bayesian inference is almost the same as naive Bayes.

The third experiment is to estimate the influence of the cutoff value on the sensitivity, specificity and precision performance of the best classifier, logistic regression. In this experiment, during training, cross-validation was used to tune the cutoff value over the outputs of the classifier in such way that it gives the best geometric mean value. The generalization performances were estimated by using this tuned cutoff value over held out test data set. Table 3 lists the sensitivity, specificity and precision using the tuned cutoff value on the test data set. If only the predictor based on short oligopeptides is used, the precision is around 2.5%. After using the sec

ond predictor based on secondary structure information, the precision increases to 4.4%, which increases 1.7 times. In other words, for each true cleavage site there are 38 false ones from the first level classifier. After using the second classifier, for each correct cleavage site, it predicts 22 false cleavage sites.

Table 3 . Sensitivity, specificity and precision performance were estimated with the tuned cutoff value with logistic regression classifier.

Sensitivity Specificity Precision T P FP FN TN

Mean

0.64 0.65

0.044 5.1

112.4 2.9

204.6

Standard deviation

0.21 0.10

0.013 1.7

30.5 1.7

30.5

There are no exact criteria to choose the best cutoff value. For this cleavage site prediction, if it is required to reach 90% sensitivity, we can lower the cutoff value, but then get more false positive ones. Generally, the ROC curve can give a good idea to pick the cutoff value for a specific requirement. Figure 3 displays the ROC curve of the best classifier with the logistic regression method.

Fig. 3 . ROC curve of the best classifier with logistic regression method. The upper curve is ROC measured on the training data set and the lower curve is for the test data set. Error bars with standard deviation are also displayed.

255

5. Discussion

Two discriminative (logistic regression and SVM) and two generative models (naive Bayes and Bayesian inference) were used to build classifiers with secondary structure information. From our experiments, there is no major difference between them. The logistic regression, is the best among them on average. For Bayesian methods, a bad data and model parameter distribution assumption could affect their performance quite a lot. With Bayesian method, there are 2 x 3 x J parameters needed to estimate for dirichlet distributions. Logistic regression has only 2 x J + 1 parameters in the model. While, SVM with sampling methods has around 2 x J + 5 in the model. Due to the lack of data, discriminative approaches are better than generative ones in general. Since logistic regression has few parameters and is fast to train, it is the method of choice in this case.

During the experiments, the secondary structure and solvent accessibility information were predicted only by one prediction sever. It has not been tested how sensitive the classifiers are to the predicted structure and accessibility information. In addition, the hierarchical classifier does not consider cleaving ordering in a protein if there are more than one cleavage site. If a protein is cleaved at the first cleavage site, the protein is cleaved into two fragments and their secondary structures might change and previously buried parts can be exposed to the protease.

Another useful information is to use the absolute positions of predicted cleavage sites. Normally it is impossible to have cleavage sites at the very end of a native protein. Therefore, we can use this rule to directly rule out some false predicted cleaved sites.

To conclude, the hierarchical classifier, which combines protein sequences, experimentally tested short oligopeptides, protein secondary structure and solvent accessibility information, can be used to detect the cleavage sites on native proteins. By using the secondary level classification based on secondary structure information, the false positive ratio is more than halved compared to the classifier only using short oligopeptide information on the first level. Therefore structure and solvent accessibility data provide information to predict protease-

substrate interactions. This method can also be used for other cleavage problems on native proteins.

Acknowledgments

The work was supported by the National Research School in Genomics and Bioinformatics hosted by Goteborg University, Sweden. The author also thanks Dr Martin Skold, Dr Joel Tyndall, Prof Thorsteinn Rognvaldsson and Dr Daniel Garwicz for discussions and pointers to important references.

References

1. Beck ZQ, Hervio L, Dawson PE, Elder JE, and Madison EL. Identification of efficiently cleaved substrates for HIV-1 protease using a phage display library and use in inhibitor development. Virology, 274:391-401, 2000.

2. Ridky TW, Bizub-Bender D, Cameron CE, Weber IT, Wlodawer A, Copeland T, Skalka AM, and Leis J. Programming the rous sarcoma virus protease to cleave new substrate sequences. J Biol Chem, 271:10538-10544, 1996.

3. Ridky TW, Cameron CE, Cameron J, Leis J, Copeland T, Wlodawer A, Weber IT, and Harrison RW. Human immunodeficiency virus, type 1 protease substrate specificity is limited by interactions between substrate amino acids bound in adjacent enzyme subsites. J Biol Chem, 271:4709-4717, 1996.

4. Tozser J, Bagossi P, Weber IT, Louis JM, Copeland TD, and Oroszlan S. Studies on the symmetry and sequence context dependence of the HIV-1 proteinase specificity. J Biol Chem, 272:16807-16814, 1997.

5. Tozser J, Zahuczky G, Bagossi P, Louis JM, Copeland TD, Oroszlan S, Harrison RW, and Weber IT. Comparison of the substrate specificity of the human T-cell leukemia virus and human immunodeficiency virus proteinases. Eur J Biochem, 267:6287-6295, 2000.

6. Tomasselli AG, Hui JO, Adams L, Chosay J, Lowery D, Greenberg B, Yem A, Deibel MR, Zurcher-Neely H, and Heinrikson RL. Actin, troponin c, alzheimer amyloid precursor protein and pro-interleukin 1 beta as substrates of the protease from human immunodeficiency virus. J Biol Chem, 266(22):14548-53, 1991.

7. Shoeman RL, Honer B, Stoller TJ, Kesselmeier C, Miedel MC, Traub P, and Graves MC. Human immunodeficiency virus type 1 protease cleaves the intermediate filament proteins vimentin, desmin, and glial fibrillary acidic protein. Proc Natl Acad Sci USA, 87(16):6336-6340, 1990.

8. You L, Garwicz D, and Rognvaldsson T. Comprehensive bioinformatic analysis of the specificity of

256

human immunodeficiency virus type 1 protease. J Virol, 79(19) :12477-86, 2005.

9. Strack PR, Frey MW, Rizzo CJ, Cordova B, George HJ, Meade R, Ho SP, Corman J, Tritch R, and Ko-rant BD. Apoptosis mediated by hiv protease is preceded by cleavage of bcl-2. Proc Natl Acad Sci USA, 93(18):9571-6, 1996.

10. Fairlie DP, Tyndall JD, Reid RC, Wong AK, Abbenante G, Scanlon MJ, March DR, Bergman DA, Chai CL, and Burkett BA. Conformational selection of inhibitors and substrates by proteolytic enzymes: implications for drug design and polypeptide processing. J Med Chem, 43(7):1271-81, 2000.

11. Tyndall JD, Nail T, and Fairlie DP. Proteases universally recognize beta strands in their active sites. Chem Rev, 105(3):973-99, 2005.

12. Boyd SE, Garcia de la Banda M, Pike RN, Whis-stock JC, and Rudy GB. Pops: A computational tool for modeling and predicting protease specificity. Proceedings of the IEEE Computer Society Bioinfor-matics Conference, Stanford, CA, page 372, 2004.

13. Rognvaldsson T and You L. Why neural networks should not be used for hiv-1 protease cleavage site prediction. Bioinformatics, 20(11):1702-1709, 2004.

14. Meier UC, Billich A, Mann K, Schramm HJ, and Schramm W. alpha 2-macroglobulin is cleaved by hiv-1 protease in the bait region but not in the c-terminal inter-domain region. Biol Chem Hoppe Seyler, 372(12):1051-6., 1991.

15. Oswald M and von der Helm K. Fibronectin is a non-viral substrate for the hiv proteinase. FEBS Lett, 292(1-2) :298-300, 1991.

16. Riviere Y, Blank V, Kourilsky P, and Israel A. Processing of the precursor of nf-kappa b by the hiv-1 protease during acute infection. Nature, 350(6319) :625-6, 1991.

17. Tomaszek TA Jr, Moore ML, Strickler JE, Sanchez RL, Dixon JS, Metcalf BW, Hassell A, Dreyer GB, Brooks I, Debouck C, and et al. Proteolysis of an active site peptide of lactate dehydrogenase by human immunodeficiency virus type 1 protease. Biochemistry, 31(42):10153-68, 1992.

18. Chattopadhyay D, Evans DB, Deibel MR Jr, Vosters

AF, Eckenrode FM, Einspahr HM, Hui JO, Tomas-selli AG, Zurcher-Neely HA, Heinrikson RL, and Sharma SK. Purification and characterization of het-erodimeric human immunodeficiency virus type 1 (hiv-1) reverse transcriptase produced by in vitro processing of p66 with recombinant hiv-1 protease. J Biol Chem, 267(20):14227-32, 1992.

19. Freund J, Kellner R, Konvalinka J, Wolber V, Kraus-slich HG, and Kalbitzer HR. A possible regulation of negative factor (nef) activity of human immunodeficiency virus type 1 by the viral protease. Eur J Biochem, 223(2):589-93, 1994.

20. Mildner AM, Paddock DJ, LeCureux LW, Leone JW, Anderson DC, Tomasselli AG, and Heinrikson RL. Production of chemokines ctapiii and nap/2 by digestion of recombinant ubiquitin-ctapiii with yeast ubiquitin c-terminal hydrolase and human immunodeficiency virus protease. Protein Expr Purif, 16(2):347-354, 1999.

21. Alvarez E, Menendez-Arias L, and Carrasco L. The eukaryotic translation initiation factor 4gi is cleaved by different retroviral proteases. J Virol, 77:12392-400, 2003.

22. Tomasselli AG, Howe WJ, Hui JO, Sawyer TK, Reardon IM, DeCamp DL, Craik CS, and Heinrikson RL. Calcium-free calmodulin is a substrate of proteases from human immunodeficiency viruses 1 and 2. Proteins, 10(l): l-9, 1991.

23. Alvarez E, Castello A, Menendez-Arias L, and Carrasco L. Human immunodeficieny virus protease cleaves poly(a) binding protein. Biochem. J., Immediate Publication, doi:10.1042/BJ20060108, 2006.

24. Rost B, Yachdav G, and Liu J. The predictprotein server. Nucleic Acids Research, 31(13):3300-3304, 2003.

25. Chawla NV, Bowyer KW, Hall LO, and Kegelmeyer WP. Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321-357, 2002.

26. Chang CC and Lin CJ. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.

http://www.csie.ntu.edu.tw/

257

A METHODOLOGY FOR MOTIF DISCOVERY EMPLOYING ITERATED CLUSTER RE-ASSIGNMENT

Osman Abul* ' and F i n n Drabl0s

Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology,

Trondheim, Norway

* Email: [email protected] jinn. drablos@ntnu. no

Geir Kjetil Sandve

Department of Computer and Information Science, Norwegian University of Science and Technology,

Trondheim, Norway


Motif discovery is a crucial part of regulatory network identification, and therefore widely studied in the literature. Motif discovery programs search for statistically significant, well-conserved and over-represented patterns in given promoter sequences. When gene expression data is available, there are mainly three paradigms for motif discovery; cluster-first, regression, and joint probabilistic. The success of motif discovery depends highly on the homogeneity of input sequences, regardless of paradigm employed. In this work, we propose a methodology for getting homogeneous subsets from input sequences for increased motif discovery performance. It is a unification of cluster-first and regression paradigms based on iterative cluster re-assignment. The experimental results show the effectiveness of the methodology.

1. INTRODUCTION

Transcription Factors (TF) are proteins that selectively bind to short pieces (5-25nt long) of DNA, so called Transcription Factor Binding Sites (TFBS). Although TFs bind in a selective way they allow some degeneracy in their binding sites, forming Transcription Factor Binding Motifs (TFBMs) or just motifs. This property creates the TFBS representation problem, i. e. the choice of language in which motifs are expressed. The most common representations are motif consensus over IUPAC codes, mismatch strings and position specific weight matrices (PSWMs), as well as their variants and specializations.

Finding TFBMs is an important step in elucidation of genetic regulatory networksa. There are basically two methods for finding TFBMs, experimental and computational, although they usually benefit

'Corresponding author. *This work was carried out during the tenure of an ERCIM fellowship. a Regulatory network identification methods are also studied without explicitly focusing on use and discovery of TFBMs 3 3 ' 27> 3 6 ' 2 8 ' 2 4 , in this paper we do not cover these approaches.

from each other. ChlP-chip experiments can analyze the genome-wide binding of a specific TF. For instance, Lee et al. 19 have conducted experiments for over 100 TFs for experimental identification of regulatory networks of Saccharomyces cerevisiae. Unfortunately their resolution (« lK-nt) is not enough to exactly identify binding locations. Other problems include condition specific binding, measurement noise, and difficulty in finding an optimal consensus motif. TFBMs are functional elements of genes and preserved throughout the evolution. This property, together with available completed genetic maps of many species, has made possible computational identification based solely on sequence data. That is, since these regions have accumulated very few mutations compared to non-functional parts, it is possible to find them computationally by just exploiting the statistical over-representation. Computational ap-



258

proaches built around this fact include MEME 2' *, BioProspector 20, AlignACE 12, Consensus 10, and MDScan 21, among many others.

TFs bind to respective TFBSs in promoter regions of their target genes. Each gene can have a number of TFBSs for several different TFs in its promoter sequence. In Eukaryotes, TFBSs are organized in modules; sets of TFBSs for a number of TFs. Each TF can function as inducer or repressor and this process is combinatorial, i.e. depends on the qualitatively and quantitatively binding of other TFs. This combinatorial behavior can cause non-additive expression behavior for their common targets. In general, intra-module couplings are much stronger than inter-module couplings. Expression behavior also depends on the genome-wide global conditions.

To understand the governing rules for gene expression, we need to know 1) all TFs, 2) abundance and activity of them under varying conditions, 3) their binding sites, and 4) their combinatorial joint regulation of target expression 35, 9. From this, it is clear that to induce regulatory networks computationally we need both sequence and functional data. Typically, the sequence data employed is the inter-genic promoter regions upstream of transcription start sites while the functional data is obtained from microarray experiments under various conditions. Other useful sources of data for motif (and module) discovery include ChlP-chip experiments (e.g. 3 ) , TFBM databases (e.g. 2 6 ) , and phyloge-netic relations (e.g. 1 4).

The success of motif discovery programs depends on the quality of input data. That is, they typically give high false-positives/negatives if input genes are heterogenous with respect to regulation. To make the input genes homogeneous, genes are clustered before they are presented to motif discovery programs; hence this is called the cluster-first approach. This is because gene expression depends on combinatorial binding of TFs on TFBMs. The co-expressed genes are assumed to be co-regulated, therefore genes are clustered based on their expression profile similarity over a course of microarray experiments. Each cluster (in which sequences are highly probable to contain homogeneous TFBMs) is given as input to motif finding programs (MEME, BioProspector, MDScan etc.).

An alternative to the cluster-first approach is to start from a large set of putative motifs and filter them by regressing on expression data. The idea behind this approach is to remove non-relevant motifs and thereby reduce the number of false positives. Examples of this approach include Reduce 7, Motif Regressor s' 21, a boosting approach also employing ChlP-chip data (Hong et al. 15) and a logic regression approach by Keles. et al. 1T.

Although a number of algorithms and programs have been developed for motif discovery, little has been done on designing a methodology for optimal usage. In particular, little attention is paid to the selection of homogeneous subsets from heterogeneous gene sets of interest. In practice, what an experimenter does is 1) cluster the gene sets of interest (using a clustering program like k-means, hierarchical clustering, Self-organizing maps, etc), then 2) input them to one or a few motif finding programs, and finally 3) decide on the true motifs among all the candidates, either by further analysis (like regression) or manually. Though clustering before motif discovery improves homogeneity compared to random subsets, it might fail in finding true clusters. Motivated by this, we here study the generation of homogeneous clusters using both sequence and expression data, and we address the issue of methodology for motif discovery.

We define an iterative procedure (a methodology) for the motif discovery process. Briefly, we start with an initial clustering of gene sets from gene expression data and find motifs in these clusters. We then (optionally) refine these motifs by filtering out irrelevant ones. In this step, simple filtering or filtering employing regression analysis is applied. After that, we screen all the genes by motif profiles of each cluster and refine clusters by re-assignment based on screening score. Following this, we restart motif discovery on the new gene clusters and iterate this procedure until convergence. Finally, we output the set of motifs found in the last iteration.

2. POWERING MOTIF DISCOVERY USING GENE EXPRESSION DATA

The three main paradigms for incorporating gene expression data into motif discovery are cluster-

259

first, regression and joint probabilistic. Brazma et al. 6 presented one of the earliest

methods within the cluster-first paradigm. They look for over-represented oligos with limited degeneracy, both genome-wide and for clusters generated from gene expression clustering based on the time series data. The approach taken by Beer et al. 5 also use a cluster-first approach. The genes are clustered using expression data with k-means clustering and Align ACE 12 is used for motif discovery. A very similar approach using a custom clustering algorithm is presented in 23.

A variant of the cluster-first approach is TFCC (Transcription Factor Centric Clustering) of Zhu et al. 37. The idea is to find a set of genes showing similar expression profiles to the expression profile of a particular TF over a set of expression experiments, and then look for motifs in that cluster using Alig-nACE 12. Similarly, Hvidsten et al. 13 find similar genes to a selected gene using the expression data, and construct logical rules (in the form of if-then rules) in terms of the absence/presence of a priori given motifs. The objective in this approach is not to find novel motifs but motif modules. An indirect cluster-first approach is presented in Tamada et al. 34 where the objective is finding regulatory networks. Motif discovery is an intermediate step used to refine the network. Briefly, they construct a regulatory network from gene expression data, and from the induced network they identify TFs and search for motifs in the sequence data of subtrees of TFs.

One of the earliest work using regression on gene expression data for motif discovery is the Reduce method of Bussemaker et al. 7. The objective is to find the best minimal set of motifs (if-mers) capable of explaining the gene expression data. The method uses single gene expression experiments over which oligo scores are linearly regressed. The model is fit in an iterative manner, i.e. starting with an empty set and adding the most significant motif to the model in each iteration until there is no statistical improvement. Similarly, Conlon et al. 8 introduced a linear regression method called Motif Regressor. They employ MDScan 21 to extract features, sets of candidate motifs, from sequence data. From the resulting large number of putative motifs the insignificant ones are eliminated through regression. The LogicMotif ap

proach of Keles, et al. 17 uses two-step logistic regression on a single gene expression experiment. In the first step, the set of all over-represented oligos (allowing limited degeneracy) in the input sequences are identified as candidate motifs. In the second step, for each sequence a binary score vector (serving as a covariate vector) is constructed in which each entry corresponds to existence of a motif type (or a logical function of a subset of all motif types, a so called logic tree) and this vector is regressed on expression data. The Rim-Finder system of Zilberstein et al. 38 is another method using the regression approach. Identification of synergistic effects of pairs of motifs using co-expression has also been studied 26.

Methods for binary regression (classification) have also been developed. A large-margin classification approach, called Medusa, using boosting together with alternating decision trees is given in 22 . Likewise, the recent study by Hong et al. 15 presents a boosting approach for motif discovery. They formulate the problem of motif discovery as a classification of ChlP-chip data, and find motifs accordingly.

The idea of using a joint probabilistic paradigm was first proposed by Holmes and Bruno u . The idea is to model probabilistic interactions between sequence features (motifs) and expression data. The approach has been extensively studied by Segal et al. 30, 29, 32, 3i, 4 o n a few m o c iel variants all employing Bayesian reasoning. The basic variant assumes that transcriptional modules are dependent on sequence information and that they in turn determine gene expression. The approach learns transcriptional module motif profiles using the Expectation-maximization (EM) algorithm. Another similar probabilistic clustering algorithm jointly using sequence and time series expression data is presented in 18, where each cluster represent transcriptional modules and in turn determine motif profile and gene expression of genes in the modules. However, they assume an initial set of motifs given a priori and assign motif profiles to modules after clustering finishes, i.e. as a post-processing step. 3. A MOTIF DISCOVERY

METHODOLOGY

In the cluster-first approaches, clustering based on gene expression data is assumed to represent true

260

functional clusters. Due to the noise in data, uncertainty of the number of clusters and lack of true knowledge of optimal distance measures, the results only partially represent true clusters. It is also the case that genes with TFBSs for the same TF are not necessarily co-expressed during a specific time-course as gene expression is combinatorial and therefore depends on several factors.

To explore the claims above we have conducted experiments on some subsets of S.cerevisiae gene clusters reported by Harbison et al. 9 and on genome-wide gene expression data by Gasch et al. 25. More information is provided for these datasets in the Experiments section. In Figure 1 we show the Silhouette index of true clusterings and clustering induced by k-means clustering for two subsets. Silhouette index, ranging -1 to 1, measures how similar a point is to points in its cluster compared to points in other clusters. Larger index values indicate good cluster separation. The results agree with our claims that genes having similar motifs need not be co-expressed, and that co-expression clustering therefore can be deceptive for motif discovery.

On the other hand, as shown in Figure 2, clustering (using gene expression) before motif discovery improves the quality of discovered motifs. In the analysis, we have used MDScan for motif discovery and we have selected random subsets with 500 genes from over 6000 genes of the Gasch et al. dataset. The number of clusters is 5. Note also that, Figure 2:b scores are higher than Figure 2:a scores. This makes sense as selection of homogeneous clusters instead of random clusters gives better candidates for motif discovery, as already discussed.

To get the advantages of gene expression clustering, while at the same time avoiding its potential deceptiveness, we propose a methodology for discovering regulatory motifs using both gene expression and upstream sequence data.

The methodology is illustrated in Figure 3. It starts with the initial clustering from gene expression data. Following this, a motif discovery algorithm is used to find candidate motifs for each cluster. Then, these motifs are optionally regressed over the gene expression values for motif filtering; only the significant motifs are retained. Since motif discovery is applied to each cluster, the discovered motifs can be

regarded as motif profiles of their respective clusters. Motifs from the motif discovery algorithm are assumed to be putative and further refined by filtering. In this way, only significant and relevant motifs are kept. After that, the motif profiles are used to screen all the genes; i.e. a score for each gene for each motif profile is computed. Based on the motif scores, genes are re-assigned to clusters. The idea here is that if a gene is closer to the motif profile of a different cluster rather than its current one, then its cluster membership should be changed based on this new evidence. These steps (motif discovery, filtering, screening and cluster re-assignment) are iterated until the clusters converge and a final set of motifs are output as motif profiles for each cluster. Note that, we do not force the use of any particular clustering, motif discovery, filtering or screening algorithm.

We will now define a basic vocabulary to be used in the remaining parts of this section. Let G = {GgYgZf1 be the set of genes and E = {Ee}

eez[El be

the set of gene expression experiments (either time series or various treatment conditions). Our input data is DNA sequence data extracted from regions upstream of transcription start site and gene expression data. Define S = {Sg : g S G} as the sequence data such that Sg = {Sgi : I = 1, . . . , |5S |} where Sgi e {A, C, G, T} is the nucleotide in the fth position and \Sg\ is the length of the sequence for g, respectively. Finally, define the gene expression matrix asY = {Yg

e : g = l , . . . , | G | ; e = 1 , . . . , | £ |} where Yg

e is the pre-processed gene expression value for gene g under experiment e. For convenience, we also define 1^ as the expression vector for g over all experiments and Ye as the expression vector for e over all genes. 3.1. Clustering

The input to the clustering step is Y. The task is to partition G into a given number of partitions based on similarity computed from Yg. In the literature a number of clustering algorithms for gene expression data have been designed and employed. A crucial point in clustering is to decide on the clustering method (k-means, self-organizing maps and hierarchical clustering are more common) and similarity measure (e.g. Euclidean distance, Mahalanobis distance, Pearson correlation). For instance, in 3 7 and 5 , modified k-means algorithms with Pearson corre-

261

0.3

0.25

0.2 X

f 0.15

% 0.1 3

£ 0.05 CO

0

-0.05

.

•

True clustering K-means clustering

n?s

0.2

0.15

0.1

0.05

0

0.05

, True clustering

i I I

•

•

•

K-means clustering

a) gene cluster {CBF1,FHL1,BAS1,IN04,MBP1} b) gene cluster {CBF1,BAS1,MBP1,MSN2,REB1}

Fig. 1. Silhouette index scores

4.3

4.2

§ 4 . 1 CO

Si 2

3.8

3.7

3.6

;

.

•

.

. Random clustering K-means clustering

a) random 500 genes

8 4t CO * •

•: 3.4h

3,l

j 1

|

I

+

Random clustering K-means clustering

b) gene cluster {CBF1,BAS1,MBP1,MSN2,REB1}

Fig. 2. MDScan scores

lation coefficient are used. It is also important to decide on the number of clusters. Some methods estimate the number of clusters by applying model selection (e.g. using cross-validation 18, adaptive-quality based clustering 2 3 ) . Since the hierarchical clusters are exploratory and flexible they are usually the preferred choice.

Since the clustering step is done once in our approach, and serves only as a source of good initial clusters, we leave the selection of clustering algorithm to the user, as different clustering algorithms may be optimal depending on the specific dataset. We denote initial clustering results as {C^}c=i > where \CX\ is the number of clusters. 3.2. Motif Discovery

The motif discovery methods basically differ in their motif representation (e.g. IUPAC codes, regular expressions, PSWMs) search algorithm (e.g. Gibbs sampling, Expectation-maximization, word count

ing), and exploitation of biological knowledge (e.g. fixed/flexible gaps, bi-modality, palindromic motifs, motifs in modules, inter-dependence of motif locations).

For our purpose, any PSWM based motif finding method like Align ACE, MEME, MDScan and Bio-Prospector can be used in this step. We apply the motif discovery algorithm for each cluster separately and independently. Let us denote the clusters at the i'th step by Cl and the motif set output from cluster CI by M*. 3.3. Motif Filtering

Given the resulting motifs M* of the motif discovery step, the filtering step outputs a subset of the motifs denoted by M*'.

The reason we introduce a filtering step is because statistical over-representation does not necessarily imply biological significance. In other words, some statistically over-represented motifs may be ei-

262

Clustering

Initial Clusters

Clusters < Motif Discovery

Cluster reassignment Filtering

Filtered Motifs

Scores Screening

Expression l Data I

Sequence Data

Pig . 3 . Iterative Motif Discovery Approach.

ther artifacts of the motif discovery program or simple tandem repeats. As those artifact motifs are not generally consistent with expression, this filtering step has the potential of eliminating most of them. Since in regression based approaches the success depends highly on the initial putative motifs, feeding these programs with the output of statistically over-represented motifs usually give better results. Note that it is also possible to have a simple filter that does not employ expression data, e.g. filtering nothing or filtering putative motifs based only on their motif scores.

3.4. Screening and Cluster Re-assignment

Given the motif profile M* of each cluster, we score all the genes, including the genes in other clusters, measuring the conformance of them to the motif profile. This creates a vector of cluster motif profile conformance measures for each gene. We use the matrix similarity score metric reported in Kel et al. 16. The metric basically uses the information content

of PSWM and scales each k-mer within maximum possible and minimum possible match.

The genes are assigned to the cluster to which they have highest conformance, thus creating a cluster re-assignment. If the cluster re-assignment is the same or very similar to previous clustering, the set of motifs for each cluster is output, otherwise the iteration continues with the gene clusters C l + 1 . 4. EXPERIMENTS

To assess the merit and relevance of the methodology presented we conduct several experiments on real datasets for S.cerevisiae.

We use the gene expression dataset by Gasch et al. 25. The dataset contains over 150 gene expression arrays (measured under several conditions with repetitions) for 6371 ORFs of S.cerevisiae. We pre-process the dataset by log transforming the background corrected intensities. Since the dataset contains missing values, we eliminate arrays and genes with considerable number of missing entries. This gives 149 arrays and 6107 ORFs, which can be con-

263

sidered as a 6107 x 149 matrix. There are still missing values in this matrix and we impute these missing values with k — nn imputation method. The method, for each missing-valued gene, identifies closest k genes over non-missing entries and then imputes the missing value by the average of column values for the k genes. After this we get a complete expression matrix. As for the sequence data, we use at most 500 (-500 to -1) base sequences from upstream of transcription start site for all of the 6107 genes.

There are many alternative methods and tools that can be used in different steps of our methodology. Since our objective here is to show the effectiveness of it, we do not experiment with an extensive set of methods and tools, but rather a few practical ones. In all of the experiments conducted we have selected k-means as clustering and MDScan as motif discovery algorithms, and use either a trivial identity or linear regression based filters, k-means and MDScan algorithms have been chosen mainly because they are fast. This is particularly important for the choice of motif discovery algorithm, as it is run for each cluster in every iteration. Although MDScan is originally designed for ChlP-chip experiments, it also works well without ChlP-chip data.

As performance measures we use MDScan scores, Convergence, Jaccard index and Silhouette index. MDScan score is used to quantify the strength of motifs within clusters. In cases where experimentally determined binding sites for motifs are available, the correspondence between predicted and known sites could have been used as performance measure. We rather preferred MDScan score as it is more objective and more general. Convergence is defined as the number of re-assigned genes so it is a natural metric for our methodology. We use Silhouette index and Jaccard index as cluster separation and similarity metrics, respectively. 4.1. Random Clusters

In this experiment, we randomly select 500 genes among 6107 genes and cluster them into 5 clusters by k-means clustering. For each cluster we use the same parameter setting for MDScan as follows (and default values for other parameters);

• motif width=8 • number of motifs to report=2 • number of top sequences to seed=20

• number of motifs to be kept for refinement = 4x number of motifs to report

The order of genes presented to MDScan is relevant. In our experiments we have used random orders to avoid any bias (this is because we do not use ChlP-chip data). On the other hand we conjecture that the genes could have been sorted based on distance to their cluster centroids, thereby possibly improving motif discovery.

Figure 4 gives the results for 20 runs (Iteration 1 is the result for the initial k-means clustering). In all of these runs we employ k-means as the initial clustering algorithm and use trivial identity filters. In all of the runs we start from converged k-means results at iteration number 1. From the figure we see that the number of re-assigned genes decreases along iterations, suggesting a convergence. MDScan scores of clusters also increases with the iterations. It is clear from both figures that our approach is able to correct some deceptiveness of the initial clustering.

We have tested how sensitive our methodology is to initial clustering by running with random initial clusterings. We have also tested the approach by changing the random gene numbers, number of clusters and MDScan parameters. In all of these cases, similar results are observed to those reported in Figure 4. This means that improvement of the methodology is not dependent on particular settings of initial clusters or motif discovery tools.

4.2. Harbison et al. Clusters

Harbison et al. 9 identified S. cerevisiae target genes for a number of TFs by collecting results from the following resources, ChlP-chip data, published data from literature and phylogenetic conservation. As a result, they defined several dozens of (overlapping) gene clusters for each binding motif. They also confirmed the results by applying several motif discovery programs (MEME, MDScan, AlignACE, etc.). We therefore assume these clusters as true clusters for our purpose.

We conduct experiments with the following three gene subsets drawn from Table 1;

• Subset 1: {CBF1,FHL1,BAS1,IN04,MBP1} (5 clusters),

264

; ; 5

I " (0 4.6

I:: >,

-L

+

±

3 4 5 6 7 £ Iteration Number

a) MDScan scores

4 5 6 7 Iteration Number

b) Convergence

Fig. 4 . Performances results on random clusters

• Subset 2: {CBF1,BAS1,MBP1,MSN2,REB1} (5 clusters)

• Subset 3: {CBF1, REB1} (2 clusters)

As a pre-processing, we remove genes not contained within the 6107 genes of the Gasch et al. dataset. We also remove genes found in more than one cluster. This way we ensure that each gene belongs exactly one cluster. As a result, subsets 1, 2, and 3 contain 399, 403, and 253 genes, respectively.

Table 1. Clusters used in experiments

Regulator CBF1 FHL1 BAS1 IN04 MBP1 MSN2 REB1

# of genes in the cluster 195 131 17 32 92 74 99

We show general utility of the methodology on the subsets 1 and 2. Number of clusters parameter for the k-means is set to 5. MDScan parameters used are same as given in Section 4.1 and trivial identity filter are employed.

In Figure 5:a-d , the average MDScan scores and convergence performances are shown for subset 1 and 2 over 20 runs. The results clearly indicate the improved MDScan scores and convergence over iterations. With the same parameters, the true clustering for subset 1 (2) has MDScan score of 4.00 (4.14) and number of re-assigned genes is 230 (248). This additionally shows that even though their performance is better than k-means clustering, there is a potential

to increase performance by cluster re-assignment. Figure 5:e-f shows the Silhouette index of clus

terings for subset 1 and 2 through iterations and also for the true clustering. We reason from the figure that our method achieves the similar Silhouette index as true clusters while k-means clustering (iteration number 1) destroys the original clustering. To measure the cluster similarity between the true clustering and clusterings over iterations we measure the Jaccard index. This index, ranging from 0 to 1, is an external cluster validation method measuring how similar a clustering is to another. A high index value indicates high cluster similarity. The results which are shown on Figure 5:g-h are inconclusive. This might be due to the similarity in sequences (like TATA box) in different clusters of the true clustering, and failure of MDScan to find specific motifs for the considered clusters. For instance, it finds specific binding sites for true clusters of CBF1, FHL1, IN04, MBP1 and REB1 while it fails to find specific binding sites for true BAS1 and MSN2 clusters.

We test the effect of using different filters on subset 3. Filter 1 is a trivial identity filter, while Filter 2 and 3 are stepwise multiple linear regression filters. The only difference between Filter 2 and 3 is the scoring function for computing the covariate vector; Filter 2 uses the function defined in 16 and Filter 3 uses the the one employed in Motif Regres-sor. For the case of Filter 1 the number of motifs reported by MDScan is set to 2, while for Filter 2 and 3 it is set to 15 and the other parameters are the same as in Section 4.1. The number of clusters parameter for k-means is set to 2 for all niters. Since

Jacc

ard

Inde

x O

OO

O

OO

OO

OO

O

O

O

P

-*

'-*

-

*-

+ '-

*

-*

'-*•

—

to -*

-1 ro

co J* cn 01 •

Silh

ouet

te I

ndex

,JJ

Qrq m

m

<~i

*~t H

3

^-'

«

-H

B o car a.

3 &.

™

H

— ^

<n

3B

2 *

CO

.h

..

r—

h-

• i--C

. i

i

• H

--

L

K-

-

-HI 1

1 '

-H

-1 1

I

TH

i

|_

, .

| h

H-

H

1

I---

-H-

->

H

f-

1

| h

H

"

—v-

--.

ZF

1

•

3 fa.

•^

P

o

cn

-* t

n

1

• »H

D-i

• i-

cn

-H

• i-

-D

• "-

CD

-'

• O

i

• 1-

-CD

-1 *

•+I-

V-I

*

• H

-CC

M

•1--

CD

--1

b P

o

cn

-»

True Clusle ring

1.

o

cn ro

to

=

cn

a

-a

TO

c p 3 O

. T

O

c

cr

a. 3

Jacc

ard

Inde

x

- ro

CO

r 5 6 tion Numb S

i-J CO

CO

O

0.1

0.08

0.06

• *

r--0

--l +

• H

I

h

H

!

• "-

--I

!

h—

r .

h--l

. h

_.

._

r

r-

—I

. h

—I

. H

--

-^ 0.12

I I I ! I

!-

0.18

0.16

0.14 -\—

A

->

--

l I---

I

=}

..

.-

-H

.

—)----,

-(

--

--

I •

h-"H

•

s

3

a.

Silh

ouet

te I

ndex

o fO

3 4 5 Iteration N (D

^1

00

ID

o

ro

P

^

P

b

b

P

cn

ro

tn

-*

cn

ot

n-

*

1 >

---r

TH

• *

i-

Ln

-i

H--

r~n

-H

r-rn

-H

• +

r-

m-

1

i- -

-rn

-i

•+

+

a>

-i

- P

ro

tn

ro

cn

•-D

# of r

e-as

sign

ed g

enes

-*

-*

ro

ro

MD

Sca

n S

core

O

o

a

<

CD

0q §

HI

h-i

• h

-n

-H

+

++

• 4

I> -

' +

- i-

-m--

-^

• •—

n h

-<

• r

--

m-

--

i

• r

-m

-i

-i-

-m

--

H

• {L

><

-

t-CE

D-i

•£•

MDScan s core

•HD

• •

+•

H

HI

h-<

••

--

o—

<

•

K-

H

| h

--'

•

H

1

I----I

•

h-

O-

H

+-

hH

XD

-i

+ r-D

J-i

+ •

I'd

-)

h-O

}-^

# of

re-

assi

gn

es g

enes

M

DS

can

Sco

re

oo

oo

oo

o

o o I >-*

era <T> 3 o CD

Z

c cn

3

• h

-

• t- -

•"--

I

r-

j

H

1

h-

H-E

D--

H

! h

—

1 !

--

--

1 1

1—

H

1

I—

•

__

L

x_

_L

_

--

i

- H

1

-H

+

h-

I--

I-

- -1

i-ciH

-

i

Co

e

3

- ro

CO

(0

3 c cn

3 5

-r

CO

<D

O

'I-D

Q--

I

• t-H

1

h-i

i- H

1 1

1 +

H---I

1

h

1-

H

1

H

H--IZ

D---H

*-

n--r

r~r-

--H •

H

1 h

--

-'

•

- r

--

m-

--

H

•

h-|

| J- —

H-

ON

266

5.2

5.1

5

4.

8 4.8 [0

§ 4 . 7 r

g 4 . 6

5 4.5

4.4

4.3 4.2*

1

CO

- • — Fitter #1 • e - Filter #2 • •• • • Filter #3

a) Convergence

• ; . * • - • * • - * • " .

0.31

—•— Filter #1 • • - F i l t e r #2 •••-Fi l ter #3

2 3 4 5 6 7 8 9 10 Iteration Number

• * . . ,

\ ' " • • • v

- 0 -

—•— Filter #1 - • - Filter #2 • • * • • Filter #3

• " ' " 1

' . . • / . .

* * / :

3 4 5 6 7 Iteration Number

b) MDScan score

1 2 3 4 5 6 7 Iteration Number

c) Jaccard index

Fig. 6. Performance results for subset 3

we have 149 expression experiments, we regress once for each experiment and count the motifs selected in the stepwise regression. Finally, we use the 2 most frequently selected motifs out of 15 (i.e. 13 motifs are filtered out) as a cluster motif signature and use them for cluster conformance computation in the reassignment step.

Figure 6 gives the averaged results over 20 runs. It is clear that all the performance parameters show increased performance over the k-means clustering for all filters used. Note that in Figure 6:a, there is full convergence, and from Figure 6:c, the Jaccard index increases suggesting recovery of true clustering for all filters. Recall that the last issue was inconclusive from subsets 1 and 2. We conclude from the results that the methodology works well for all of these filters. It is also possible to compare filters based on the performance parameters. For the average MDScan scores, Filter 1 scores best while in terms of Jaccard index Filter 2 scores slightly better. This clearly shows the tradeoff of selecting the filter

among multiple filters. Note that, this is why we have introduced a filtering step in our methodology.

5. CONCLUSION

In this work, we have addressed the problem of developing a methodology for motif discovery. It is organized around the idea of getting highly homogeneous gene clusters using both sequence and expression data. We do this by screening all genes and re-assigning clusters in several iterations.

The analysis and experimental results show that clustering based on gene expression is a better basis for motif discovery than random clustering, but not perfect. It is also shown that it might mislead. Our method is developed to compensate these two issues and thereby improve the quality of motif discovery. The conducted experiments clearly suggest the utility of our approach.

The methodology is quite flexible, e.g. not developed around a particular motif discovery, filtering, screening or clustering algorithm. In other words, a

267

broad range of algorithms developed in the field can

be used in our methodology.

The methodology presented here can also be con

sidered as a unification of the cluster-first and re

gression based motif discovery paradigms into a sin

gle framework. Our approach is similar to the joint

probabilistic approaches, especially to Tamada et al. 3 4 where their main motivation is finding regulatory

networks rather than discovering motifs. However, it

is in general different from these approaches, in tha t

our approach does not establish any probabilistic re

lationships between gene expression and sequence in

formation.

We have also shown the importance of the fil

tering step. I t has been shown tha t regardless of the

actual filtering method used, the methodology works

well, i.e. improves over the initial clustering.

Future work will focus on assessment of general

utility and performance of our methodology as com

pared to joint probabilistic modeling.

References

1. Timothy L. Bailey and Charles Elkan. The Value of Prior Knowledge in Discovering Motifs with MEME. In Proc. of the ISMB'95, Menlo Park, CA, 1995.

2. Timothy L. Bailey and Charles Elkan. Unsupervised Learning of Multiple Motifs in Biopolymers using Expectation Maximization. Machine Learning, (21):51-80, 1995.

3. Yoseph Barash, Gal Elidan, Nir Friedman, and Tommy Kaplan. Modeling Dependencies in Protein-DNA Binding Sites. In Proc. of the 7th International Conf. on Research in Computational Molecular Biology, Berlin, Germany, 2003.

4. Alexis Battle, Eran Segal, and Daphne Koller. Probabilistic Discovery of Overlapping Cellular Processes and Their Regulation. In Proc. of 9th RECOMB, San Diego, CA, 2004.

5. Michael A. Beer and Saeed Tavazoie. Predicting gene expression from sequence. Cell, 117:185-198, 2004.

6. Alvis Brazma, Inge Jonassen, Jaak Vilo, and Esko Ukkonen. Predicting Gene Regulatory Elements in Silico on a Genomic Scale. Genome Research, 8:1202-1215, 1998.

7. Harmen J. Bussemaker, Hao Li, and Eric D. Sig-gia. Regulatory Element Detection using Correlation with Expression. Nature Genetics, 27:167-171, February 2001.

8. Erin M. Conlon, X. Shirley Liu, Jason D. Lieb, and Jun S. Liu. Integrating Regulatory Motif Discovery and Genome-wide Expression Analysis. PNAS, 100(6) :3339-3344, 2003.

9. Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, Hannett NM, Tagne JB, Reynolds DB, Yoo J, Jennings EG, Zeitlinger J, Pokholok DK, Kellis M, Rolfe PA, Takusagawa KT, Lander ES, Gifford DK, Fraenkel E, and Young RA. Transcriptional regulatory networks in Saccha-romyces Cerevisiae. Nature, 431(7004) :99-104, 2004.

10. Gerald Z. Hertz and Garry D. Stormo. Identifying DNA and Protein Patterns with Statistically Significant Alignments of Multiple Sequences. Bioinfor-matics, 15(7/8) :563-577, 1999.

11. Ian Holmes and William J. Bruno. Finding regulatory elements using joint likelihoods for sequence and expression profile data. In Proc. of Eighth International Conference of Intelligent Systems for Molecular Biology, pages 202-210, 2000.

12. Jason D. Hughes, Preston W. Estep, Saeed Tavazoie, and George M. Church. Computational Identification of Cis-regulatory Elements Associated with Groups of Functionally Related Genes in Saccha-romyces Cerevisiae. Journal of Molecular Biology, (296):1205-1214, 2000.

13. Torgeir R. Hvidsten, Bartosz Wilczynski, Andriy Kryshtafovych, Jerzy Tiuryn, Jan Komorowski, and Krzysztof Fidelis. Discovering Regulatory Binding-site Modules using Rule-based Learning. Genome Research, (15):856-866, 2005.

14. Shane T. Jensen, Lei Shen, and Jun S. Liu. Combining Phylogenetic Motif Discovery and Motif Clustering to Predict Co-regulated Genes. Bioinformatics, 21(20):3832-3839, 2005.

15. Katherina J. Kechris, Erik van Zwet, Peter J. Bickel, and Michael B. Eisen. A Boosting Approach for Motif Modeling using ChlP-chip Data. Bioinformatics, 21(ll):2636-2643, 2005.

16. A.E. Kel, E. Gobling, I. Reuter, E. Cheremushkin, O.V. Kel-Margoulis, and E. Windenger. MATCH: A tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Research, 31(13):3576-3579, 2003.

17. Siindiiz Kele§, Mark J. van der Laan, and Chris Vulpe. Regulatory Motif finding by Logic Regression. U.C. Berkeley Biostatistics Working Paper Series, (145), 2004.

18. Anshul Kundaje, Manuel Middendorf, Feng Gao, Chris Wiggins, and Christina Leslie. Combining sequence and time series expression data to learn transcriptional modules. IEEE Transactions on Computational Biology and Bioinformatics, 2(3):194-202, 2005.

19. T. Lee, N. Rinaldi, F. Robert, D. Odom, Z. Bar-Joseph, G. Gerber, N. Hannett, C. Harbison, C. Thompson, I. Simon, J. Zeitlinger, E. Jennings, H. Murray, D. Gordon, B. Ren, J. Wyrick, J. Tagne, T. Volkert, E. Fraenkel, D. Gifford, and R. Young. Transcriptional regulatory networks in Saccharomyces Cerevisiae. Science, (298): 799-804, 2002.

268

20. X. Liu, D.L. Brutlag, and J.S. Liu. Bioprospec-tor: Discovering Conserved DNA Motifs in Ppstream Regulatory Regions of Co-expressed Genes. In Proc. of Pacific Symposium on Biocomputing, 2001.

21. X. Shirley Liu, Douglas L. Brutlag, and Jun S. Liu. An Algorithm for Finding Protein-DNA Binding Sites with Applications to Chromatin-Immunoprecipitation Microarray Experiments. Nature Biotechnology, 20:835-839, 2002.

22. Manuel Middendorf, Anshul Kundaje, Mihir Shah, Yoav Freund, Chris H. Wiggings, and Christina Leslie. Motif Discovery through Predictive Modeling of Gene Regulation. In Proc. of 9th RECOMB, Cambridge, MA, 2005.

23. Yves Moreau, Gert Thijs, Kathleen Marchal, Frank De Smet, Janick Mathys, Magali Lescot, Stephane Rombauts, Pierre Rouze, and Bart De Moor. Integrating Quality-based Clustering of Microarray Data with Gibbs Sampling for the Discovery of Regulatory Motifs. JOBIM 2002, pages 75-79, 2002.

24. Naoki Nariai, Yoshinori Tamada, Seiya Imoto, and Satoru Miyano. Estimating Gene Regulatory Networks and Protein-protein Interactions of Sac-charomyces Cerevisiae from Multiple Genome-wide Data. Bioinformatics, 21(2):206-212, 2005.

25. Audrey P.Gasch, Paul T. Spellman, Camilla M. Kao, Orna Carmel-Harel, Michael B. Eisen, Gisela Storz, David Botstein, and Patrick O. Brown. Genomic expression programs in the response of yeast cells to environmental changes. Molecular Biology of the Cell, 11:4241-4257, 2000.

26. Yitzhak Pilpel, Priya Sudarsanam, and George M. Church. Identifying Regulatory Networks by Combinatorial Analysis of Promoter Elements. Nature Genetics, 29:153-159, 2001.

27. Jiang Qian, Jimmy Lin, Nicholas M. Luscombe, Haiyuan Yu, and Mark Gerstein. Prediction of Regulatory Networks: Genome-wide Identification of Transcription Factor Targets from Gene Expression Data. Bioinformatics, 19(15):1917-1926, 2003.

28. John Jeremy Rice, Yuhai Tu, and Gustavo Stolovitzky. Reconstructing Biological Networks using Conditional Correlation Analysis. Bioinformatics, 21(6):765-773, 2005.

29. E. Segal, R. Yelensky, and D. Roller. Genome-wide Discovery of Transcriptional Modules from DNA Sequence and Gene Expression. Bioinformatics, 19(l):273-282, 2003.

30. Eran Segal, Yoseph Barash, Itamar Simon, Nir Friedman, and Daphne Roller. From Promoter Sequence to Expression: A Probabilistic Framework. In Proc. of 6th RECOMB, Washington, DC, 2001.

31. Eran Segal, Dana Peer, Aviv Regev, Daphne Roller, and Nir Friedman. Learning Module Networks. Journal of Machine Learning Research, 6:557-588, 2005.

32. Eran Segal, Michael Shapira, Aviv Regev, Dana Peer, David Botstein, Daphne Roller, and Nir Friedman. Module Networks: Identifying Regulatory Modules and Their Condition-specific Regulators from Gene Expression Data. Nature Genetics, 34(2):166-176, 2003.

33. Lev A. Soinov, Maria A. Krestyaninova, and Alvis Brazma. Towards Reconstruction of Gene Networks from Expression Data by Supervised Learning. Genome Biology, 4(1):R6.1-R6.10, 2003.

34. Yoshinori Tamada, Su Yong Rim, Hideo Bannai, Seiya Imoto, Rousuke Tashiro, Satoru Ruhara, and Satoru Miyano. Estimating Gene Networks from Gene Expression Data by Combining Bayesian Network Model with Prometer Element Detection. Bioinformatics, 19(2):227-236, 2003.

35. Biao Xing and Mark J. van der Laan. A Statistical Method for Constructing Transcriptional Regulatory Networks using Gene Expression and Sequence Data. U.C. Berkeley Biostatistics Working Paper Series, (144), 2004.

36. Biao Xing and Mark J. van der Laan. A Causal Inference Approach for Constructing Transcriptional Regulatory Networks. Bioinformatics, 21(21):4007-4013, 2005.

37. Zhou Zhu, Yitzhak Pilpel, and George M. Church. Computational Identification of Transcription Factor Binding Sites via a Transcription-factor-centric Clustering (TFCC) Algorithm. Journal of Molecular Biology, (318):71-81, 2002.

38. Chaya Ben-Zaken Zilberstein, Eleazar Eskin, and Zohar Yakhini. Sequence Motifs in Ranked Expression Data. Technion CS Dept. Technical Report, (CS-2003-09), 2003.

269

IDENTIFYING BIOLOGICAL PATHWAYS VIA PHASE DECOMPOSITION AND PROFILE EXTRACTION

Yi Zhang and Zhidong Deng

Department of Computer Science, Tsinghua University Beijing, 100084, China


Biological processes are always carried out through large numbers of genes (and their products) and these activities are often organized into different cellular pathways: sets of genes that cooperate to finish specific biological functions. Owing to the development of microarray technology and its ability to simultaneously measure the expression of thousands of genes, effective algorithms to reveal biologically significant pathways are possible. However, some open problems such as large amount of noise in microarrays and the fact that most biological processes are overlapping and active only on partial conditions pose great challenges to researchers. In this paper, we proposed a novel approach to identify overlapping pathways via extracting partial expression profiles from coherent cliques of clusters scattered on different conditions. We firstly decompose gene expression data into highly overlapping segments and partition genes into clusters in each segment; then we organize all the resulting clusters as a cluster graph and search coherent cliques of clusters; finally we extract expression profiles from coherent cliques and shape biological pathways as genes consistent with these profiles. We compare our algorithm with several recent models and the experimental results justify the superiorities of our approach: robustly identifying overlapping pathways in arbitrary set of conditions and consequently discovering more biologically significant pathways in terms of enrichment of gene functions.

1. INTRODUCTION

The rapid development of high-throughput techniques such as Oligonucleotide and cDNA microarrays [5] enable measuring the expression of thousands of genes simultaneously. This possibility offers an unprecedented opportunity to characterize the underlying mechanisms of a living cell. Activities of a living cell are so complex that different sets of genes participate in diverse biological processes to perform various cell functions. In this sense, identifying cellular pathways, sets of coherent genes that coordinate in biological processes to achieve specific functions, plays a considerable role in gaining an insight into the cell's activities.

Recently, researchers have made tremendous efforts to identify coherent gene groups [10]. Pioneering work includes agglomerative algorithm for hierarchical clustering [7], K-Means clustering of genes [17] and some graph-theoretical approaches for gene-based clustering [15]. Admittedly, applying traditional clustering algorithms on gene expression data can provide us with new perspectives on cellular processes. However, several problems in this process should be highlighted: (1) biological processes are active only on partial conditions. This fact renders clustering genes on entire conditions ineffective. (2) Extremely high noise exists in microarrays, which calls for robust models for

pathway identification. (3) Partitioning genes into mutually exclusive clusters is unreasonable in that biological pathways are always overlapping.

Biclustering algorithms [14] are designed to capture biological processes active on part conditions. Different from traditional clustering methods, these models perform simultaneous clustering on both rows and columns and thus finally discover coherent submatrices where rows refer to genes and columns correspond to relevant conditions. One challenge to biclustering is that all the possible combinations of various genes and conditions are almost infinite.

Furthermore, overlapping property of cellular pathways is also mentioned by recent work. On one hand, some biclustering algorithms discover submatrices one after another and thus naturally yield non-exclusive biclusters. For instance, Cheng et al [6] mask the previous biclusters with random numbers and find other ones. Similarity, in [12] each bicluster merely deals with the "residual" expression of previous biclusters. On the other hand, algorithms aim to discover overlapping pathways simultaneously also existed. Battle et al [4] proposed a probabilistic model to discover overlapping biological processes concurrently.

Managing high noise in gene expression data is also indispensable for successfully determining



270

coherent genes. In [2], the author engages robust similarity measure based on the rank of expression in each condition, rather than the accurate expression level, to model the similarity between expression profiles of genes. This sort of measures is robust in the sense that they mainly focus on the rough shape of expression profile and will not be affected by disturbances on accurate expression level. The consensus clustering algorithm in [9] combine different clustering results to form a clustering ensemble. The underlying idea of this ensemble learning approach is that integrating opinions of different "experts" can yield a robust estimation.

In fact, an algorithm address all of these open problems is highly desirable. In this paper, we propose a strategy that satisfies all these demands: robustly discovering overlapping pathways on partial conditions. Other than traditional approaches that directly seek grouping over genes, our algorithm identifies cellular pathways by robustly searching expression profiles over arbitrary set of conditions. The key ideas of our approach are: (1) decompose the entire conditions into highly overlapping segments and clustering genes over each segment; (2) manage all resultant clusters into a cluster graph and discover coherent cliques on cluster graph; (3) extract expression profiles over coherent cliques and shape overlapping pathways according to the these profiles. As a result, this algorithm is capable of robustly recognizing overlapping molecular pathways on partial conditions and thus furnishing biologically significant sets of genes in terms of enrichment of gene functions.

2. METHOD

Our pathway discovery algorithm consists of three steps: (1) decomposing conditions into overlapping segments and performing gene clustering on each segment; (2) construct a cluster graph from the resulting clusters over all segments and discovering coherent cliques on the graph; (3) extracting expression profile from each coherent clique and identifying biological pathway according to each profile. In the rest of this section, we examine these steps in section 2.1~2.3 and analyze the properties of this algorithm in section 2.4.

2.1. Phase Decomposition

In order to capture biological processes in partial conditions, we divide the entire conditions (i.e. columns)

34

S j |

S i r — —

at ,

C, Cj C, C j C, C, C, C, C. C, ,

Fig. 1. Phase Decomposition

into highly overlapping segments. Each segment contains all the rows and a few columns in gene expression matrix. Then we discover co-expressed genes by gene-based clustering on each segment. Finally, large clusters are retained for later processing.

The first step is to decompose gene expression matrix into many segments. The goal of this decomposition is to ensure that biological process active on any partial conditions can be discovered by combining some segments. Note the term "segment" refers to a submatrix in gene expression matrix where all rows and a subset of columns are included. The decomposition strategy is shown in figure 1. Each segment covers fixed amount of conditions and advances a small step based on the previous segment. For instance, Si covers {cb c2, c3, c4} and s2 contains {c2, c3, c4, c5}. Two parameters are involved: (1) segment length L: the number of conditions covered by a segment. Note that using too large L loses the possibility to discover pathways active on short period; while engaging too small L makes clustering on each segment ineffective: co-expression on very short period always appears by chance. We set L = 4 in figure 1 only for illustration; such short segments will not be used in experiments. (2) step length AL: in figure 1, we set AL = 1, thus any biological process whose life-span is larger than L can be obtained from combining certain segments. For instance, period c2 ~ C6 can be captured by integrating two segments s2 and s3. One may choose larger step length in order to reduce the total number of segments. But fortunately, combinations of different segments can approximately represent any period larger than L, as long as segments are highly overlapping.

The second procedure is gene-based clustering on each segment so as to obtain the co-expression group. Here we use hierarchical clustering, with average link and Pearson correlation, to group genes on each segment. On each segment, cutting the hierarchical tree at specific threshold \-c will produce many sets of co-

271

expressed genes. Note c is a key parameter of our algorithm: two gene expression profiles are considered coherent when their Pearson correlation is larger than c, i.e. their distance is smaller than 1-c.

At last, clusters which contain less than 5 genes are discarded in that too small clusters are considered outliers or biologically insignificant groups.

2.2. Coherent Clique on Cluster Graph

After clustering on each segment as discussed in section 2.1, we obtain many co-expression gene clusters. Clusters on the same segment are mutually exclusive while clusters computed from different segments may be highly overlapping, especially when step length is small and thus adjacent segments may present similar structure on gene expression. In this section, we address the problem of how to utilize these clusters to discover biological processes that active on arbitrary period. For this purpose, we propose the concepts of cluster graph and coherent clique; then we focus on how to discover coherent cliques on cluster graph. Note that searching coherent cliques is to find possible biological processes.

Firstly, given two gene clusters C and C, we define the overlapping degree, and use this definition to offer a distance measure between clusters. Note that \C\ is the amount of genes in cluster C.

ICnC'l • Overlap(C, C')=

\CuC\ • Distance(C, C') = \ - Overlap(C, C)

Secondly, after defining the distance between clusters, all clusters obtained from the procedures discussed in section 2.1 constitute a large cluster graph, which furnishes us with a global view of relationships among genes over different segments. Cluster graph G(V, E) is a complete graph where each node V e V refers to a cluster C and the weight of an edge e = (v,, v2) G E is the distance between two clusters corresponding to V\ and v2.

Thirdly, the concept of ^-coherent clique is proposed as following: a ^-coherent clique Q(V, E') in a cluster graph G(V, E) is a complete subgraph in G, satisfying that (1) any edge in E' has an weight less than p. (2) V contains at least two nodes. Note that a /?-coherent clique is biologically meaningful: (1) any two clusters in a coherent clique has a distance smaller than P, i.e. an overlapping degree larger than 1-/?; (2) clusters in a /^-coherent clique must come from different

segments, since clusters from same segment are mutually exclusive; (3) The fact that several clusters from diverse segments share a large proportion of common co-expressioned genes indicates the existence of a biological process which is active on the period composed of these segments.

Finally, given a cluster graph, we want to discover /?-coherent cliques on this graph. An effective algorithm to attain such goal is the hierarchical clustering with complete link [11]. Using this algorithm, we can get a hierarchical tree, and then cut the tree into many /?-coherent cliques according to certain choice of ft. The definition of complete link ensures that the resulting clusters on cluster graph are /^-coherent cliques.

2.3. Profile and Pathway Extraction

In section 2.2, we partition the entire cluster graph into many ^-coherent cliques. In this section, we discuss how to robustly extract expression profile of the biological process underlying each coherent clique. We also address how to discover cellular pathways, i.e. set of coordinated genes, from expression profiles.

To begin with, recall that a coherent clique is composed of a set of nodes in cluster graph and each node refers to a cluster from a segment. Since each cluster covers certain conditions: the conditions covered by the segment where this cluster is generated, thus we can define the active period of a coherent clique:

• The active period P(Q) of a coherent clique Q is all the conditions that covered by at least one cluster in Q.

See figure 1 for an illustration: Supposed that coherent clique Q is composed of three clusters, which are generated on segment Si, s2 and s4, respectively. Then, the active period of Q is {cb c2, c3, c4, c5, c6, c7}.

Another important notion is the core genes: • Gene g is the core gene of coherent clique Q if

and only if g is the member of all clusters in Q. Furthermore, the expression profile involving the

underlying biological process of coherent cluster Q is: • The expression profile of coherent cluster Q is

defined on Q's active period and computed as the mean expression of all the core genes of Q.

Finally, we identify the cellular pathway corresponding to Q based on its expression profile:

• A gene g belongs to the cellular pathway of coherent clique Q if and only if the Pearson correlation between g's expression and Q's expression profile on Q's active period exceeds

272

c, the coherence parameter mentioned in section 2.1. Note that g's expression outside Q's active period is not considered.

2.4. Further Analysis

In this section we mainly discuss three properties: (1) ability to discover pathways on partial conditions; (2) identifying overlapping pathways; (3) robustness.

Firstly, if the active period of a biological process P can be obtained by combining different segments produced in section 2.1, a corresponding coherent clique should be identified from cluster graph. This is based on two assumptions: (1) segment is defined in suitable granularity and highly overlapping; (2) genes cooperating in P should co-express in P's active period.

Secondly, the overlapping property of pathways is ensured: (1) the active periods of diverse coherent cliques Qi and Q2 differ, thus a gene g can be consistent with both Qy's expression profile in g / s active period and £?/s expression profile in Q2

,s active period. (2) Even on same period, expression level of gene g can be consistent with different expression profiles.

Thirdly, the algorithm in this paper is robust, from four perspectives: (1) The definition of coherent clique is robust in that any two clusters in the clique have high overlapping degree and thus it is unlikely that an "outlier" cluster can be accommodated. (2) The computation of active period of each coherent clique is robust. See figure 1 for an illustration: supposed a biological process P is active on conditions ci~c7, then the corresponding coherent clique Q should have an active period as ci~c7. Ideally speaking Q ought to consist of four clusters which are located in segments Si, S2, S3 and s4, respectively, because P is active on these segments and genes participate in P should co-express on these segments. However, owing to high noise in microarrays, some clusters may be missed. As a result, Q may consist of only three clusters on Si, s2 and s4, respectively. Even in this case, the active period of Q will be judged properly as Ci~c7, based on Si, s2 and s4. In short, segment overlapping ensures the robustness of active period estimation. (3) The choice of core genes in a coherent clique Q is robust in that each core gene must belong to all the clusters in Q. Since these clusters are located in different segments and obtained by clustering on each segment independently, it is unlikely that an outlier gene will belong to all these clusters. Admittedly, this selection strategy is so "cautious" that

it may miss some core genes. However, only a subset of core genes is still sufficient to extract expression profile of the underlying biological process, because core genes are supposed to co-express well in their biological process. (4) At last, the quality of core gene selection and active period estimation ensures the quality of expression profile extraction and the resultant pathway.


In this section we present empirical results. Compared with several state-of-the-art models, our algorithm is more capable of identifying biologically significant pathways in terms of the enrichment of gene functions.

3.1. Dataset and Preprocessing

Two well-know datasets used in our experiments are yeast cell cycle data [16] and yeast stress data [8]. For preprocessing, we remove genes with more than 5% missing values and estimate missing values by KNNimpute [18]; then genes with small variance are removed. These steps result in 526 genes in cell cycle dataset and 659 genes in stress condition dataset.

3.2. Rival Methods

In this part, we introduce rival algorithms and their parameters: (1) HClust [7]: hierarchical clustering with average link and Pearson correlation. Finally 30 clusters are formed on both two datasets. (2) Plaid [12] is designed to discover biclusters one by one independently. The default parameters are used. We stop at 100 biclusters on both datasets. (3) OP [4] is a probabilistic model to search overlapping pathways simultaneously. The number of pathways is set as 30. (4) PD?E(Pathway Identification by Profile Extraction): our algorithm. The coherence threshold c mentioned in section 2.1 and 2.3 is 0.7; the parameter /? used to define ^-coherent clique is 0.7. On cell cycle dataset which contains 76 conditions, we set segment length L as 10 and step length AL = 2; for yeast stress dataset contains 173 conditions, L is engaged as 20 and AL is set as 3.

3.3. Results on Cell Cycle

Running our algorithm on 526 genes in 76 conditions results in 162 coherent cliques, thus finally we obtain 162 cellular pathways. Note that these pathways are generated independently, thus the fact that 162

273

20 40 60 80 100 120 n: pathway size

Fig. 2. Distribution of Pathway Size

to a

%

c

1

0.8

0.6

0.4

0.2

n

*. \ \ \ \

\ \ \

, \ 1 1

\ \ \ , \ .

~.L \ \ \ \ \

HCIust PIPE OP Plaid

•

I \ \ \ \ 1 \

» . . X

0 10 20 30 40 50 60 n: number of involved pathways

Fig. 3. Distribution of Gene Participation

pathways are produced from 526 genes does not indicate that the average size of pathways is quite small. In reality, the smallest pathway contains 4 genes and the largest one includes 101 genes. The distribution of pathway size is shown in figure 2. The X-axis is pathway size; and the Y-axis is the proportion between pathways larger than specific size and all the pathways. From figure 2 we can observe that more than 80% pathways contain more than 10 genes, while only about 20% pathways contains more than 40 genes. This result shows that the majority of pathways have moderate size.

More interestingly, we measure the overlapping among pathways. HCIust will certainly generate 30 mutually exclusive pathways, and running OP model results in 30 slightly overlapping pathways; at last, plaid model find 100 biclusters one by one. Figure 3 shows the distribution of amount of pathways that each gene participates in. The X-axis is the number of pathways a gene joins; the Y-axis is the proportion between the genes which involve more than a specific amount of pathways and all the genes. Four algorithms present

quite different properties in figure 3: (1) Since HCIust merely produce mutually exclusive pathways, all genes take part in only one pathway. (2) For OP model, single gene takes part in at most 5 pathways, and only 19 out of 526 genes participate in more than three pathways. (3) In Plaid algorithm, pathways are excessively overlapping: almost all genes participate in more than 30 pathways. (4) For our PIPE method, the result in figure 3 shows a natural distribution that only a few genes throw themselves into more than 15 pathways.

According to many researches concentrating on scale-free topology of biological networks [3] and especially of genetic regulatory networks [13]: (1) there should be a few "hub" genes connected with many other genes and thus join a lot of biological processes; (2) most genes in network should not have large degrees and thus they participate in only a few biological processes. As shown in figure 3, only our algorithm generates results consistent with above conclusion. HCIust and OP model can not produce "hub" gene, and Plaid produce too many "hubs".

At last, to justify the biological significance of the pathways generated by these models, we test the enrichment of gene functions in GO categories [1]. For any pathway, the enrichment of a GO category is represented by p-value: the smaller the p-value, the better the enrichment. The p-value is computed based on Genomica [19]. For each GO category, we focus on the p-value of the pathway with best enrichment. For comparison, enrichment of all four algorithms are computed and listed in Table 1. Note that p-value larger than 0.001 is considered as a failure to find enrichment and is labeled as "—" in the Table. Among 117 GO categories listed in Table 1: (1) PIPE won 70 times, while OP, HCIust and Plaid models won 25, 21 and 1 times, respectively. (2) PIPE failed 23 times, yet OP, HCIust and Plaid failed 41, 58 and 90 times, respectively. Further examining the results listed in Table 1 will naturally yield to the conclusion that PIPE has identified more biologically significant pathways.

3.4. Results on Stress Condition

To further justify the superiority of PIPE, we engage yeast stress condition dataset [8] for another experiment. Running our algorithm on 659 genes and 173 conditions brings about 174 pathways. Pathway size distribution of PIPE is demonstrated in figure 4, where a few pathways contain more than 100 genes and the majority are in

274

1

0.8

0.2 • ^ % -

0 50 100 150 200 n: pathway size

Fig. 4. Distribution of Pathway Size

... \ \ \ \

\ \

\ \ \

\ \ : \

\ i

lv --.,....

• HCIust PIPE OP Plaid

•

\ \ \

\

0 10 20 30 40 50 60 n: number of involved pathways

Fig. 5. Distribution of Gene Participation

moderate size. In addition, figure 5 shows similar results as observed in figure 3: HCIust and OP can not discover any "hub" gene while Plaid considers most 659 genes as hubs that join a lot of pathways.

We also test the enrichment of GO categories and list the results in Table 2. From table 2 we can summarize that over 128 GO categories listed in Table2: (1) PIPE find best enrichment for 84 categories, while OP, HCIust and Plaid find 18, 13 and 19 times, respectively. (2) PIPE failed to find enrichment for 11 GO terms, yet OP, HCIust and Plaid failed 61, 81 and 68 times, respectively. In a word, PIPE has its own advantage to discover biologically meaningful pathways.

Another interesting fact is that Plaid algorithm performs much better on stress condition dataset than on cell cycle condition. One explanation for this result is the differences of regulation mechanism between endogenous phase (e.g. cell cycle and sporulation) and exogenous phase (e.g. stress condition, DNA damage and diauxic shift) [13]. In exogenous phase such as stress response, genes are often regulated by more

transcriptional factor and participate in more processes than in endogenous phase such as cell cycle. Therefore, Plaid model, which tends to produce excessively overlapping pathways, results in more accurate results.

4. CONCLUSION

In this paper, we presented a new approach to discover cellular pathways. We firstly decompose gene expression matrix into highly overlapping segments and partition genes into clusters on each segment; then we organize all the resulting clusters into a cluster graph and identify coherent cliques; finally we extract expression profiles of coherent cliques and shape biological pathways from these profiles. We compare our algorithm with several recent models and the experimental results justify the superiorities of our approach: robustly identifying overlapping pathways on partial conditions and consequently discovering biologically significant pathways.

Acknowledgments

This work was supported in part by the National Science Foundation of China under Grant 60321002 and the Teaching and Research Award Program for Outstanding Young Teachers in Higher Education Institutions of MOE (TRAPOYT), China.

References

1 M. Ashburner, et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 25: 25-29, 2000.

2 R. Balasubramaniyan, E. Hullermeier, N. Weskamp and J. Kamper. Clustering of gene expression data using a local shape-based similarity measure. Bioinformatics, 21(7): 1069-1077, 2005.

3 A.L. Barabasi and Z.N. Oltvai. Network biology: understanding the cell's functional organization. Nature Reviews Genetics, 5: 101, 2004.

4 A. Battle, E. Segal and D. Roller. Probabilistic discovery of overlapping cellular processes and their regulation. Journal of Computational Biology, 12(7): 909-927, 2005.

5 P.O. Brown and D. Bostein. Exploring the new world of the genome with DNA microarrays. Nat. Genet. 21, 33-37, 1999.

6 Y. Cheng and G.M. Church. Biclustering of expression data. ISMB, 2000.

7 M.B. Eisen, P.T. Spellman, P.O. Brown and D. Bostein. Cluster analysis and display of genome-

275

wide expression patterns. PNAS. 95, 14863-14868, 1998.

8 A.P. Gasch, P.T. Spellman, CM. Kao, O. Carmel-Harel, M.B. Eisen, G. Storz, D. Botstein and P.O. Brown. Genomic expression programs in the response of yeast cells to environmental changes. Molecular Biology of the Cell, 11: 4241-4257, 2000.

9 T. Grotkjar, O. Winther, B. Regenberg, J. Nielsen and L.K. Hansen. Robust multi-scale clustering of large DNA microarray datasets with consensus algorithm. Bioinformatics, 22(1): 58-67, 2006.

10 D. Jiang, C. Tang and A. Zhang. Cluster analysis for gene expression data: a survey. IEEE trans, on Knowledge and Data Engineering, 16(11): 1379-1386,2004.

11 B. King. Step-wise clustering procedures. J. Am. Stat. Assoc. 69, pages 86-101, 1967.

12 L. Lazzeroni and A. Owen. Plaid models for gene expression data. Technical report. Stanford Univ., 2000.

13 N.M. Luscombe, M.M. Babu, H. Yu, M. Snyder S.A. Teichmann and M. Gerstein. Genomic analysis of regulatory network dynamics reveals large topological changes. Nature, 431(7006): SOS-

SI 2, 2004. 14 S.C. Madeira and A.L. Oliveira. Biclustering

algorithms for biological data analysis: a survey. IEEE trans, on Computational Biology and Bioinformatics, 1(1): 24-45,2004.

15 R. Sharan and R. Shamir. CLICK: a clustering algorithm with applications to gene expression analysis. ISMB, 2000.

16 P.T. Spellman, G. Sherlock, M.Q. Zhang, V.R. Iyer, K. Anders, M.B. Eisen, P.Q. Brown, D. Botstein and B. Futcher. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces Cerevisiae by microarray hybridization. Molecular Biology of the Cell, 9: 3273-3297, 1998.

17 S. Tavazoie, J.D. Hughes, M.J. Campbell, R.J. Cho and G.M. Church. Systematic determination of genetic network architecture. Nat. Genet. 22, 281-285, 1999.

18 O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein and R.B. Altman. Missing value estimation methods for DNA microarrays. Bioinformatics, 17: 520-525, 2001.

19 http://genomica.weizmann.ac.il/index.html

Table 1. GO Categories Enrichment based on Cell Cycle Dataset

GO term

35S Primary Transcript Processing

amine biosynthesis

amine metabolism

amine transport

aspartate family amino acid metabolism

ATP dependent DNA helicase activity

ATP dependent helicase activity

beta-glucosidase activity

bud neck

carbohydrate metabolism

carbohydrate transport

carrier activity

cation transport

cation transporter activity

cell communication

cell wall

chromatin assembly or disassembly

chromatin binding

conjugation

contractile ring

HClust

4.01E-6 —

—

7.56E-5 —

—

—

—

—

—

1.70E-7 —

—

—

—

7.94E-6

4.66E-7

3.95E-5 —

3.95E-5

PIPE

1.05E-7

1.91E-6

2.48E-6 —

—

2.49E-6

2.14E-5 —

5.29E-6

1.25E-4

3.20E-5

2.58E-5

2.57E-4

7.60E-5

2.79E-5 —

8.15E-11

4.33E-7

4.17E-5

8.45E-7

OP

3.55E-5

7.02E-6

2.45E-6 —

1.10E-7 —

—

5.77E-6 —

—

2.57E-6

2.84E-4 —

2.16E-4 —

2.89E-8

3.49E-7

3.95E-5 —

8.82E-5

Plaid —

—

—

—

—

—

—

—

—

—

6.88E-5 —

2.60E-5

1.98E-4 —

3.06E-5 —

—

—

—

http://genomica.weizmann.ac.il/index.html

cyclin-dependent protein kinase regular activity

cytokinesis

cytokinesis, completion of separation

cytoplasmic vesicle

cytoskeleton organization and biogenesis

cytosolic large ribosomal subnit (sensu Eukaryota)

cytosolic small ribosomal subnit (sensu Eukaryota)

development

DNA binding

DNA helicase activity

DNA metabolism

DNA packaging

DNA recombination

DNA repair

DNA replication

DNA replication initiation

DNA replication, synthesis, of RNA primer

DNA strand elongation

DNA unwinding

DNA-dependent ATPase activity

DNA-directed DNA polymerase activity

electron transporter activity

endoplasmic reticulum

endosome

energy pathways

glucosidase activity

glutamate metabolism

glutamine family amino acid biosynthesis

glutamine family amino acid metabolism

helicase activity

hexose transport

hydrolase activity

ion transporter activity

iron ion transport

iron ion transporter activity

iron-siderochrome transport

kinase inhibitor activity

lagging strand elongation

large ribosomal subunit

leading strand elongation

main pathways of carbohydrate metabolism

mannose transporter activity

MCM complex

metal ion transporter activity

methionine metabolism

mismatch repair

—

2.12E-4

1.30E-5 —

—

2.86E-10

2.19E-5 —

1.77E-4 —

4.36E-9

3.74E-5

7.99E-6

1.62E-6 —

—

—

1.21E-6

2.12E-4

2.12E-4

6.00E-5

2.26E-4 —

—

2.84E-6 —

4.13E-6

4.13E-6

3.52E-5 —

2.10E-8 —

—

—

—

—

—

—

1.26E-9 —

2.84E-6

2.10E-8

3.95E-5 —

8.59E-5

8.29E-6

9.09E-5

4.59E-6

1.97E-5 —

1.05E-4

1.18E-10

1.39E-5

1.37E-4

2.45E-8

1.04E-6

3.25E-8

2.24E-8

4.45E-10

8.37E-8

1.18E-4

2.14E-5

2.71E-4

4.12E-6

2.49E-6

2.49E-6

9.86E-7 —

1.96E-4 —

1.33E-4 —

9.92E-6

9.92E-6

8.28E-5

5.51E-6

7.02E-6 —

1.10E-4 —

—

—

—

5.63E-6

5.20E-10

2.71 E-4

1.76E-4

7.02E-6

4.33E-7 —

—

1.42E-6

—

—

1.11E-8

2.82E-5 —

6.26E-10

3.29E-5 —

2.26E-6

3.74E-5

2.82E-7

1.80E-4

5.84E-5

2.56E-5 —

5.84E-5 —

1.20E-5

2.12E-4

2.12E-4

2.53E-4

1.36E-4 —

2.82E-5

3.05E-5

7.63 E-7

1.20E-4

1.20E-4 —

—

3.26E-7

1.16E-4

3.16E-5

2.82E-5

7.55E-7

2.82E-5

1.86E-4 —

2.75E-9 —

4.63E-5

3.26E-7

3.95E-5

2.49E-5

2.22E-8

4.71E-5

—

—

—

—

—

—

—

—

2.87E-6 —

1.88E-5 —

—

1.98E-4 —

—

—

1.07E-4 —

—

—

—

—

—

—

—

—

—

—

—

—

—

—

—

—

—

—

—

2.42E-4 —

—

—

—

—

—

277

mitochondrion

mitotic recombination

nuclear organization and biogenesis

nucleic acid binding

nucleolus

nucleosome

nucleotide-excision repair

nucleotidyltransferase activity

organic acid metabolism

polyamine transport

pre-replicative complex

pre-replicative complex formation and maintenance

protein amino acid glycosylation

protein binding

protein biosynthesis

protein folding

pyrophosphatase activity

regulation of cell cycle

regulation of cyciin dependent protein kinase activity

regulation of enzyme activity

regulation of metabolism

regulation of transcription

regulation of transcription from Pol 2 promoter

replication fork

response to DNA damage stimulus

response to stimulus

response to stress

ribosomal large subunit biogenesis

ribosomal small subunit assemble and maintenance

ribosomal subunit assembly

RNA binding

RNA metabolism

RNA processing

rRNA processing

siderochrome transport

site of polarized growth

small ribosomal sununit

structural constituent of ribosome

sulfur amino acid biosynthesis

sulfur amino acid metabolism

sulfur metabolism

telomerase-independent telomere maintenance

telomere maintenance

transcription regulator activity

transferase activity and phosphorus-contain group

transcription metal ion transporter activity

—

8.29E-6 —

—

8.01 E-23

5.46E-09 —

—

—

—

—

—

1.42E-4

1.54E-5

5.80E-13

2.17E-12 —

—

—

—

—

—

—

—

4.43E-9

2.46E-5

4.67E-5

9.54E-5 —

—

2.46E-4

2.46E-12

2.10E-12

8.54E-14

1.65E-6 —

2.19E-5

1.12E-14

8.59E-5

7.89E-6

1.42E-4

6.00E-5

6.00E-5 —

—

—

1.31E-4

8.34E-9

1.48E-7

1.57E-7

7.10E-20

1.35E-13

1.66E-5

4.76E-5

9.04E-5

9.98E-6

1.40E-7

1.40E-7 —

4.69E-5

1.13E-13

2.16E-10

2.08E-4

3.51E-5

1.18E-4

1.18E-4

3.46E-5

1.76E-5

3.06E-5 —

7.21E-9

8.37E-6

1.15E-5

6.84E-6

1.86E-4

9.98E-6 —

2.61E-11

1.50E-12

1.95E-13

9.94E-5

2.36E-4

1.39E-5

2.15E-15 —

—

—

5.18E-7

5.18E-7

2.78E-4

2.04E-4 ...

—

4.71E-5 —

4.52E-5

2.42E-18

3.90E-9 —

—

1.80E-5 —

3.13E-5

3.13E-5 —

—

2.36E-12

1.07E-10 —

—

—

—

—

—

—

4.71E-05

2.41E-7

2.04E-4 —

—

—

—

—

1.96E-9

8.62E-10

3.73E-11

1.88E-8 —

3.29E-5

4.63E-14

2.22E-8

1.70E-10

3.54E-9

2.53E-4

2.53E-4 —

—

2.49E-5

2.76E-4 —

—

1.93E-6

4.90E-9 —

—

—

—

—

—

—

—

1.63E-4

8.46E-6 —

—

—

—

—

—

—

—

—

2.12E-5 —

—

—

—

—

—

1.07E-4

1.46E-6

1.20E-5 —

—

—

8.24E-7 —

—

—

—

—

—

—

—

translational elongation

tricarboxylic acid cycle

tricarboxylic acid cycle intermediate metabolism

unfolded protein binding

vesicle-mediated transport

1.54E-6

2.26E-4

7.17E-7

1.68E-9 ...

9.70E-7 —

1.23E-4

4.21E-7

1.11E-4

2.32E-6

1.68E-4

4.17E-5

4.14E-8 —

—

—

—

2.94E-5 ...

Table 2. GO Categories Enrichment based on Stress Condition Dataset

GO term

35S Primary Transcript Processing

acid phosphatase activity

aerobic respiration

alcohol metabolism

amine biosynthesis

amine metabolism

amine transport activity

aspartate family amino acid biosynthesis

aspartate family amino acid metabolism

ATP dependent DNA helicase activity

carbohydrate catabolism

carbohydrate kinase activity

carbohydrate metabolism

carbohydrate transport

cell communication

cell wall

cytosolic large ribosomal subnit (sensu Eukaryota)

cytosolic small ribosomal subnit (sensu Eukaryota)

disaccharide metabolism

DNA binding

DNA-directed RNA polymerase 1 complex

endoplasmic reticulum

energy pathways

energy reserve metabolism

enzyme regulator activity

galactose metabolism

glucan metabolism

glucose metabolism

glucosyltransferase activity

glutamate metabolism

glutamine family amino acid biosynthesis

glutamine family amino acid metabolism

glutathione peroxidase activity

glycogen metabolism

helicase activity

heterocycle metabolism

hexose metabolism

HClust

3.11E-7 —

—

7.70E-5 —

4.94E-4

2.01E-4 —

9.58E-6 —

—

—

6.28E-10 —

—

—

1.65E-19

2.02E-11 —

—

—

—

1.87E-12

9.12E-5 —

2.11E-8 —

—

—

—

—

—

—

—

4.26E-4

4.72E-4

1.43E-5

PIPE

4.27E-10

1.18E-6

3.74E-6

1.48E-5

8.56E-10

1.73E-12 —

1.31E-5

3.02E-10

2.13E-5 —

3.03E-5

5.67E-9

6.67E-7

2.06E-4

1.72E-7

1.50E-32

2.36E-21

4.26E-5

1.55E-4

2.92E-4

2.86E-5

1.64E-10

3.49E-6

1.74E-4

2.11E-7

5.67E-5 —

3.42E-4

9.34E-5

3.78E-7

5.55E-6

1.38E-4

5.67E-5

9.89E-7

1.43E-4

1.39E-4

OP

6.35E-7

4.21E-7 —

7.70E-5

2.54E-4

4.93E-4

3.17E-4 —

9.58E-6 —

2.41E-4 —

9.50E-7

2.10E-4 —

1.77E-6

1.35E-18

6.56E-11 —

—

—

2.34E-5

5.30E-14

3.03E-5 —

2.11E-8

6.53E-5 —

—

—

—

—

—

6.53E-5 —

—

1.43E-5

Plaid

3.46E-9 —

—

8.57E-5

1.27E-9

2.41 E-15 —

5.08E-4

3.44E-5

1.64E-4 —

—

2.71E-10 —

—

4.95E-4

4.18E-20

4.33E-12 —

—

—

—

9.49E-14

3.55E-4 —

—

—

8.08E-5 —

—

2.26E-4 —

—

—

1.39E-4 —

—

279

hexose transport

hexose transporter activity

hydrolase activity acting on ester bonds

ion transporter activity

kinase activity

large ribosomal subunit

ligase activity, forming carbon-nitrogen bonds

lipid metabolism

main pathways of carbohydrate metabolism

mannose transporter activity

methionine metabolism

methyltransferase activity

mitochondrion

Noc complex

non-membrane-bound organelle

nucleic acid binding

nucleolus

nucleotide biosynthesis

nucleotide metabolism

organic acid metabolism

oxidoreductase activity

oridoreductase activity on CH-OH group of donors

pentose metabolism

peroxidase activity

peroxisomal matrix

phosphoric ester hydrolase activity

polysaccharide metabolism

processing of 20S pre-rRNA

protein binding

protein biosynthesis

protein folding

purine nucleotide metabolism

pyrophosphatase activity

regulation of biosynthesis

regulation of catabolism

regulation of cell redox homeostasis

regulation of cellular process

regulation of translation

regulation of translational fidelity

respiratory chain complex 3

response to biotic stimulus

response to osmotic stress

response to stimulus

response to stress

ribonucleoprotein complex

ribosomal large sununit assembly and maintenance

—

—

—

—

9.12E-5

1.65E-19 —

—

3.59E-4 —

1.18E-6 —

—

—

—

9.82E-12

6.66E-21

1.74E-5

1.14E-4 —

4.65E-4 —

—

—

—

—

—

1.79E-9 —

2.82E-29

2.36E-7

1.74E-5 —

—

—

—

—

—

—

—

—

—

—

—

3.74E-9

1.34E-6

9.66E-7

1.46E-4 —

2.37E-4

9.79E-6

1.50E-32 —

—

2.51E-6

3.64E-4

4.47E-7

1.24E-5

4.83E-5

2.25E-4

1.73E-4

6.29E-8

7.55E-26

1.74E-5

1.14E-4

1.72E-10

1.17E-5

6.02E-5

1.93E-4

1.16E-9 —

3.18E-4

1.28E-4

7.70E-11

174E-7

3.41E-60

1.95E-15

1.74E-5

3.55E-4

2.85E-6

1.38E-4

1.34E-7

2.61 E-7

6.97E-8

1.33E-6

4.72E-5

8.64E-5 —

1.29E-5

2.62E-6

8.38E-8

1.24E-5

6.53E-5 —

2.73E-4 —

—

1.35E-18 —

—

5.29E-8 —

1.18E-6 —

4.40E-4 —

—

6.44E-11

6.73E-20

2.76E-4 —

273E-4

2.73E-4 —

6.26E-5

1.34E-7

3.76E-4

1.16E-4

2.10E-4

4.69E-9

3.44E-11

1.64E-27

6.67E-17

2.76E-4 —

—

—

—

—

—

—

1.18E-4 —

1.74E-5

4.22E-8

8.32E-9

1.14E-8

2.54E-6

—

—

—

—

—

4.18E-20

4.47E-4

4.19E-4

770E-6 —

—

8.48E-5

2.51E-7 —

—

2.17E-11

1.51E-21 —

—

1.33E-13

1.85E-6

3.33E-4 —

7.03E-5 —

—

—

4.39E-11

2.52E-4

1.82E-27

2.61E-8 —

3.51E-5 —

—

—

—

—

—

—

5.28E-5 —

6.96E-11

6.39E-12

2.44E-8

8.76E-7

280

ribosomal large subunit biogenesis

ribosome

ribosome biogenesis

RNA binding

RNA helicase activity

RNA ligase activity

RNA metabolism

RNA methyltransferase activity

RNA modification

RNA polymerase complex

RNA processing

RNA-dependent ATPase activity

rRNA binding

rRNA modification

rRNA processing

S-adenosylmethionine-dependent methyltransferease

signal transduction

small nuclear ribonucleoprotein complex

small nucleolar ribonucleoprotein complex

small ribosomal sununit

snoRNA binding

SRP-dependent protein-membrane target

structural constituent of ribosome

succinate dehydrogenase (ubiquinone) activity

sulfur amino acid metabolism

sulfur metabolism

thioredoxin peroxidase activity

transcription from Pol 1 promoter


transcription, DNA-dependent

transferase activity, transferase acyl groups

transferase activity and phosphorus-contain group

transcription factor activity, nucleic acid binding

transcription initiation factor activity


translational elongation

trehalose metabolism

tricarboxylic acid cycle

tricarboxylic acid cycle intermediate metabolism

tRNA aminoacylation

tRNA metabolism

tRNA methyltransferase activity

tRNA modification

UDP-glycosyltransferase activity

unfolded protein binding

—

—

4.97E-5

3.75E-14

4.26E-4

4.97E-5

5.78E-30 —

7.88E-9 —

2.59E-22 —

—

—

2.59E-22 —

—

—

1.50E-7

2.01E-11

3.11E-7 —

2.29E-29 —

3.48E-6

7.67E-6 —

—

—

—

—

—

—

—

—

—

—

—

—

2.09E-4

2.76E-6 —

2.76E-6 —

3.66E-4

4.22E-6

6.72E-5

1.04E-4

5.21E-11

9.89E-7

8.23E-10

2.02E-20

1.19E-4

3.03E-8

2.92E-4

3.68E-22

2.13E-5

2.89E-5

3.41E-6

3.68E-22 —

9.34E-5

1.73E-4

1.08E-7

2.32E-21

2.10E-8 —

2.55E-61

7.03E-6

3.48E-6

1.45E-5

6.72E-6

2.92E-4

1.55E-4

2.92E-4

3.23E-5

3.60E-4

5.01 E-6

1.44E-4

5.01E-6

8.46E-7

4.26E-5

1.26E-8 —

2.81E-8

3.80E-8

6.85E-5

3.80E-8

3.42E-4

1.87E-7

—

—

3.65E-5

2.09E-13 —

7.95E-5

5.09E-27 —

1.92E-8 —

2.92E-21 —

—

—

2.92E-21 —

—

—

3.17E-7

6.56E-11

6.53E-7

5.48E-5

7.00E-28 —

3.48E-6

7.67E-6 —

—

—

—

—

—

—

—

—

—

—

2.77E-5

7.61E-5

3.12E-4

5.08E-6 —

5.08E-6 —

1.05E-11

4.92E-4 —

7.95E-5

2.54E-13

1.38E-4

2.22E-7

1.60E-28 —

1.22E-7 —

5.29E-23

1.64E-4

1.64E-4 —

5.29E-23

4.46E-4 —

—

1.16E-9

4.33E-12

7.56E-11 —

7.76E-31 —

1.45E-4 —

—

—

—

—

—

—

—

—

—

—

—

—

3.55E-4

2.06E-6

1.75E-5 —

1.79E-5 —

1.14E-4

281

EXPECTATION-MAXIMIZATION ALGORITHMS FOR FUZZY ASSIGNMENT OF GENES TO CELLULAR PATHWAYS

Liviu Popescu

Department of Computer Science, Cornell University, Ithaca, NY, 14853, US

Golan Yona *

Computer Science Department, Technion - Israel Institute of Technology * Email:[email protected]

Cellular pathways are composed of multiple reactions and interactions mediated by genes. Many of these reactions are common to multiple pathways, and each reaction might be potentially mediated by multiple genes in the same genome. Existing pathway reconstruction procedures assign a gene to all pathways in which it might catalyze a reaction, leading to a many-to-many mapping of genes to pathways. However, it is unlikely that all genes that are capable of mediating a certain reaction are involved in all the pathways that contain it. Rather, it is more likely that each gene is optimized to function in specific pathway(s). Hence, existing procedures for pathway construction produce assignments that are ambiguous.Here we present a probabilistic algorithm for the assignment of genes to pathways that addresses this problem and reduces this ambiguity. Our algorithm uses expression data, database annotations and similarity data to infer the most likely assignments, and estimate the affinity of each gene with the known cellular pathways. We apply the algorithm to metabolic pathways in Yeast and compare the results to assignments that were experimentally verified.

1. INTRODUCTION

In the last decade an increasingly large number of genomes were sequenced and analyzed. The wealth of experimental data about genes initiated many studies in search for larger complexes, patterns and regularities. Of broad interest are studies that attempt to compile the network of cellular pathways in a given genome 1~i. Due to the complexity of these studies, these pathways have been verified and studied extensively only in a few organisms, while in others the analysis is mostly computational. To propagate the experimental knowledge to other organisms several groups developed procedures that extrapolate pathways (and mostly metabolic pathways) based on the known association of genes with reactions in these pathways. However, many genes have unknown function and therefore the cellular processes in which they participate remain largely unknown. On the other hand, some reactions can be catalyzed by multiple genes and are associated with multiple pathways. Thus, assignments that rely just on the general functional characterization of genes are not refined enough and tend to introduce ambiguity by creating many-to-many mappings between genes and pathways. This ambiguity characterizes popular procedures for the assignment of genes to pathways5, 6.

In a previous work we presented a deterministic algorithm for pathway assignment that reduces the ambiguity by using expression data, in addition to functional characterization of genes, and selectively assigning genes to pathways such that co-expression within pathways is maximized and conflicts among pathways (due to shared assignments) are minimized. Furthermore, to complement the set of known enzymes (which is usually incomplete) our algorithm considers other genes in the subject genome that might possess catalytic capabilities based on similarity. As our tests showed, our algorithm works well on a set of test pathways, however, it assigns a single gene to each reaction. While our results generally support this assumption, it is not always the case in reality. Different genes might participate with different affinities in different pathways. Therefore, a more reasonable approach would be to assign a probabilistic measure to indicate the level of association of a given gene with a given pathway.

In this paper we present a variation over a known EM algorithm that addresses this problem, now assuming that the same gene can participate in multiple pathways, and estimates pathway assignment probabilities from expression data and sequence similarities. Our framework can be extended to include interaction data and other high-throughput data sets, each one providing information on

"Corresponding author.


282

different aspects of the same cellular process.

2. A PROBABILISTIC FRAMEWORK FOR ASSIGNING GENES TO PATHWAYS

In this paper we focus on metabolic pathways. In Ref. 8 a metabolic pathway is defined as "a sequence of consecutive enzymatic reactions that brings about the synthesis, breakdown, or transformation of a metabolite from a key intermediate to some terminal compound." This definition is used in many studies and by most biochemical textbooks and underlies literature curated databases such as MetaCyc9. We adopt this definition in our algorithm.

Our initial assumption is that the expression profiles of genes assigned to the same pathway tend to be similar which suggests that each pathway has a characteristic expression profile. Indeed, a similar assumption was employed in other studies on pathway reconstruction (see section 4). Therefore a pathway can be modeled as a probabilistic source for the expression profiles of the participating genes, having as centroid the pathway characteristic profile.

2.1. Preliminaries

In the next sections we use bold characters to represent sets or vectors and non-bolded characters to represent individual entities or measurements. The input to our algorithm is a genome G with N genes, enzyme families F\, F2, ..FM and pathways Pi, P2, •••, PK- We adhere to the set of known pathways as our algorithm is concerned with pathway assignments rather than pathway discovery (although this can be easily changed). Each pathway P contains a set enzymatic reactions. Each reaction is associated with an enzyme family F whose member genes can catalyze the reaction. We denote by F(P) the set of protein families that are associated with the reactions of pathway P. We use G(Fj) to represent the set of genes that can be assigned to enzyme family Fj based on their database records (or based on their similarity with known enzymes of family Fj, as described in section 1.2.1 of the online Supplementary Material). The set of enzymatic reactions (families) associated with gene i is denoted by F(t).

Our goal is to predict which genes take part in each

pathway. In other words, our goal is to compute the probability p(i\Pk) of gene i participating in pathway Pk as well as the posterior probability p(Pk\i), which we refer to as the affinity of gene i with pathway Pk-

Computing the probabilities p(i\Pk) andp(Pfc|z) is difficult since they refer to biological entities (genes) that are not observed directly but only through measurements (e.g. expression level). Therefore, we assume that each cellular process (in our case a metabolic pathway) can be modeled as a statistical source3 generating measurable observations over genes. Each gene i is associated with a feature vector Xj, and the conditional probability p(xj \Pk) denotes the probability of the A;-th source to emit Xj. We initialize these probabilities based on prior knowledge of metabolic reactions. We then revisit these estimates and recompute these probabilities based on experimental observations until convergence to maximum likelihood solutions. However, this process is constrained so as to maintain the prior information.

The observations can be characterized in terms of different types of data (such as expression profiles, interactions, etc) that reflect different aspects of the pathway. E.g. Xj = {e,, ij,...} where e, is the expression profile of gene i, i, is the interaction profile and so on. Assuming independence between these features we can decompose p(xi|Pfc) = p(ei\Pk)p{U\Pk) •••• In this work we use only expression profiles (that are generated from multiple experiments). I.e. we estimate p(xj|Pfc) ~ p(ej|Pfc) where p(e|Pfe) is the probability to observe expression vector e in pathway Pk. This approximation is based on the assumption that genes participating in the same biological process are similarly expressed. Indeed, it produces good results as we demonstrate later on. However, the algorithm can be easily generalized to include other types of data.

2.2. The EM algorithm

Our algorithm is based on the fuzzy EM clustering algorithm that assumes a mixture of Gaussian sources10, with several modifications that are discussed in the next section. We model each pathway as a source that generates expression profiles for the pathway genes such that p(e|Pfc) follows a Gaussian distribution N(fik, Sfc) or a mixture of Gaussian sources (assuming there are several underlying processes, intermingled together). Each path-

aEach pathway can also be modeled as a mixture of sources, for example, when there are multiple branches.

283

way has a prior p(Pfe). We assume that the microarray experiments are in

dependent of each other such that the expression vector e is composed of d independent measurements {ei,e2,...,ed}. I.e. p(e\Pk) = rL=iP(e/l-Pfc) w h e r e

each component is distributed as a one dimensional normal distribution N{nki,0ki)- Hence the covariance matrix Sfc is actually a diagonal matrix, whose non-zero elements are denoted by o v

We seek the parameters that will maximize the likelihood of the data. We initialize the parameters of the pathway models /ifc, <7fc and p{Pk) based on database annotations and similarity data, as described in section 1.2.1 in the online Supplementary Material. These parameters as well as the probabilities p(Pfc|ej) are modified iteratively, using an EM algorithm similar to the one described in Ref. 10, until convergence. This algorithm converges to parameters that maximize (locally) the likelihood of the d a t a p P / 6 ) = n ^ P ( e i i e ) = IL £*?(«*! W A ) whereO = {(/ii ,ai), (/Z2,CT2), ...,(/iK,<Ti<)}- For more details, see section 1 in the online Supplementary Material.

2.3. Knowledge-based clustering

Our algorithm is a variation of the EM algorithm described above, in several ways. First, our algorithm utilizes any prior information that might help to obtain more accurate assignments. Instead of random initialization of Hk and CTfe, we use the prior information that is available from database annotations and similarity searches, to initialize the parameters. Second, we employ constrained clustering so as to minimize the number of pathways that end up with an incomplete assignment. Third, we replace the Euclidean metric with a new metric, the mass-distance measure, that is more effective for detecting similarity between expression profiles. Due to space limitation the details of these three elements are described in the Supplementary Material, section 1.2.

3. RESULTS

To evaluate the performance of our method we test the influence of different settings and parameters on pathway

assignments and show that the algorithm produces results of biological significance. We first provide quantitative measures of performance by comparing the results we get to experimentally validated assignments. We then take a look at particular examples to illustrate the strengths of our algorithm.

Our model organism is the Yeast genome. Pathways blueprints are obtained from the MetaCyc database9. We used a subset of 52 metabolic pathways for which we could assign Yeast genes to all the reactions in the pathway. 23 of these were experimentally verified to exist in Yeast in SGD11. This set of 23 pathways serves as our test set. To assign genes to pathways we test two different expression datasets: the Cell Cycle data set of Ref. 12 and the Rosetta Inpharmatics Yeast compendium data set13. Genes are mapped to enzymatic reactions using Biozon14

at b i o z o n . o r g . Proteins that are linked with enzyme families based on their annotation are referred to as annotated enzymes. Proteins assigned to reactions based on similarity with known enzymes are referred to as predicted enzymes. For more information on the datasets used in this study see section 2 of the Supplementary Material.

To explore the influence of different options on our algorithm we ran a total of 12 experiments. We compare performance across different models (the Gaussian model vs the mass-distance model), different data sets (Cell-cycle vs. Rosetta) and different subsets of genes from the Yeast genome as outlined below. We use several performance measures as discussed in section 3 of the Supplementary Material.

Gene se t s Using the prior knowledge we can restrict the set of genes we consider in our algorithm. The most constrained set of genes is pathway genes (PG) , i.e. the genes that can be assigned to at least one of the pathways based on database annotations and by prediction P G = Ufc Uf 6F(pk) G(Fj). The intermediate set of genes consists of all enzymes (AE) in the genome, annotated or predicted (including enzymes that are not associated with any of the reactions in the pathways we considered), P G = D^GiFj). The third set of genes we consider is the entire genome or all genes (AG), A G = G.

bExpression profiles are typically composed of measurements taken from a set of independent experiments. For example, in time-series datasets the measurements are collected at different time-points usually spaced at relatively large time intervals during which the cell has undergone significant changes and the correlation between consecutive time points is relatively weak. Other datasets (e.g. Rosetta) are generated from experiments that are conducted practically independently under different conditions.

284

Table 1. Comparative results for different experimental settings. Results are reported over the test set of 23 pathways. First column lists the experimental setup. Codes used: Cell Cycle data (TS), Rosetta (ROS), pathway genes (PG), all enzymes (AE), entire genome (AG), MASS Distance (MD), Gaussian model (GM), deterministic algorithm (DET). (e.g., "ROS:AE:MD" is the experiment using Rosetta, clustering only enzymes and using the MASS distance model). In the second column we show the number of pathways with a verified (EV) assignment in the top position. The third column shows the number of pathways with violated constraints (number in parenthesis is over the entire set of 52 pathways used in clustering). The fourth and the fifth columns show the precision and recall with respect to the verified genes (where genes are assigned to a pathway if the posterior probability is greater than a threshold 9 = 0.1). In the last 2 columns we show the MAP with respect to the ranking of genes based on their affinity, and with respect to the ranking of all possible deterministic assignments based on their score (see section 3 of the Supplementary Material for details). The last line in the table represents the results of a model that gives a random ordering of the genes and assignments; this is equivalent to a regular pathway reconstruction algorithm.

Experiment

TS:AG:MD TS:AE:MD TS:PG:MD TS:PG:GM TS:AG:GM TS:AE:GM TS:PG:DET

ROS:PG:MD ROS:PG:GM ROS:AE:MD ROS:AG:MD ROS:AE:GM ROS:AG:GM ROS:PG:DET

random model

# of pathways with verified

top assignment 12 10 8 9 9 6 10

14 14 13 13 12 10 11

5.4

# of path ways with violated constraints

12(29) 12(28) 11(24) 9(22) 12(27) 11(27) N/A

10(24) 11(22) 10(24) 10(26) 12(25) 11(25) N/A

N/A

precision (genes)

0.72 0.80 0.79 0.80 0.84 0.79 N/A

0.85 0.83 0.85 0.84 0.74 0.80 N/A

N/A

recall (genes)

0.60 0.65 0.69 0.69 0.71 0.63 N/A

0.72 0.71 0.69 0.69 0.63 0.56 N/A

N/A

MAP genes

0.86 0.83 0.81 0.85 0.87 0.84 N/A

0.94 0.94 0.94 0.93 0.91 0.9 N/A

0.74

MAP assignments

0.7 0.62 0.62 0.61 0.6

0.52 0.68

0.84 0.82 0.78 0.77 0.77 0.66 0.75

0.45

3.1. Summary of results

Our method is conceived as an extension of the current pathway reconstruction methods like Pathway Tools15. These methods do not attempt to assign genes to pathways selectively and hence cannot be compared to ours. Therefore we need some other baseline to compare our results to. We consider the random model that generates random permutations over the set of all possible assignments (this setting is similar to that of KEGG or Pathway Tools, where there is no ranking over assignments and all assignments are equally probable). For each pathway we generate 100k random permutations and compute the average MAP over the results (see section 3 in the Supplementary Material for details). We also compare the results with those of the deterministic algorithm c of our previous work7.

As Table 1 shows, our algorithm is able to improve

significantly over the random model under all settings, and it also improves over the deterministic algorithm. Clearly, our model exploits the information in the expression data to rank the genes effectively. When comparing the different settings our general conclusions are:

• Clearly the choice of the expression data set is important. The performance of our algorithm on the Rosetta set is significantly and consistently better than on the Cell Cycle data.

• There are no significant differences between the different model variations within each expression dataset, but there are some noticeable trends. The mass distance model has a slight advantage compared to the Gaussian model, all other being equal. The good performance even with the Euclidean metric reflects the strong correlation between pathways and expression pat-

T o compare the results, we ran the deterministic algorithm of Ref. 7 but skipped the last step that attempts to minimize shared assignments by looking at near-optimal assignments, since it explicitly drives the assignments towards solutions that assign a single gene to each reaction.

285

terns. • Interestingly, the performance does not decline

significantly when we use a larger set of genes. This confirms that pathways tend to have unique characteristic expression profiles

• Most of the pathways have zero or one violated constraints (see section 1.2.2 of the Supplementary Material) in all settings. However, there are a few pathways in which consistently most of the reactions are not satisfied (such as arginine biosynthesis, cysteine biosynthesis II, arginine degradation I and trehalose biosynthesis III). To some extent, this reflects the capacity (or lack thereof) of our model to cover certain pathways. However, this can also suggest that these pathways might not exist in Yeast or they might exist but in a different configuration than the pathway blueprint. This conclusion is also reinforced by the fact that the average number of violated constraints does not seem to depend on the expression dataset.

The number of related genes (genes that have significant affinity with a pathway although they were not initially assigned to it) loosely correlates with the number of reactions in the pathway as well as with the num

ber of genes initially assigned to the pathway (data not shown). Finally, most pathways tend to have similar performance across all experiments within a data set. This explains the small difference in performance between experiments within the datasets. The difference between the two datasets is caused by a few pathways for which the Rosetta data set is more informative.

3.2. Example - Homoserine methionine biosynthesis

In this section we present an individual case which illustrates our method. Due to the page limit we focus only on one example. Two additional examples are given in the Supplementary Material (section 4). The examples are discussed in the context of the ROS:AE:MD experiment. This setting is one of the best ones according to our performance measures and is chosen because the clustering is done with all enzymes, revealing interesting dependencies between pathways.

Methionine is bio-synthesized in this pathway from homoserine. It is part of the superpathway Threonine and methionine biosynthesis that consists of three pathways: homoserine biosynthesis, homoserine methionine biosynthesis and threonine biosynthesis from homoserine as is depicted in Figure la. Though related, these pathways

L-aspartate

i i

homoserine

I homocysteine

L-threonine

tetrahydropteroyltri- L-methionine L-glutamate

(a)

\ homoserine methionine biosynthesis

\ threonine bbsynthese \ homoserine biosyn

thesis

(b)

Fig. 1. The relation between homoserine biosynthesis, homoserine methionine biosynthesis and threonine biosynthesis from homoserine. (a). Homoserine is synthesized from aspartate in the first pathway and methionine and threonine are synthesized in the second and third pathway respectively, starting both from homoserine, therefore the super-pathway forks at homoserine. (b). The characteristic profiles of the three pathways. Homoserine biosynthesis and threonine biosynthesis from homoserine are correlated while homoserine methionine biosynthesis is just loosely correlated.

286

have different characteristic expression profiles for most of the experiments. However they seem to be similar in certain experiments (Figure lb), which suggests that they share regulation mechanisms.

This pathway has three reactions: 2.3.1.31, 4.2.99.-and 2.1.1.14. We have excluded reaction 4.2.99.- from the pathway model because it has an incomplete EC number'1. There are seven genes that can be initially assigned to this pathway based on database annotations and function prediction (by similarity). However, only three are experimentally verified assignments: MET2, MET6 and MET17. These are also the only genes whose affinity

d Recently, this reaction was revised and assigned a new number 2.5.1.49.

with the pathway (posterior probability) at the end of the run is significant (1 in this case) as is shown in Table 2. Our algorithm assigns these genes such that each reaction is associated with exactly one gene. The other four unverified genes have insignificant affinity to the pathway, and no other genes are associated with the pathway. Note that all the unverified genes as well as MET 17 have similar functional assignments initially (to 6 different reactions), only with different evalues. However, only MET17 makes it to final round and is assigned to the pathway by our algorithm. Furthermore, our algorithm consistently recovers the experimentally verified genes

Table 2. The probabilistic assignment of the Homoserine methionine biosynthesis pathway. The table lists all the genes that are potentially assigned to this pathway. The double line separates the genes that were assigned to this pathway from the ones that were rejected. In the first column we show the name of the gene or the systematic name. Second column shows the Biozon ID14. Third column shows the affinity to the pathway ip{Pk\xi))- The forth column shows the EC family membership (number in parenthesis is the weight which reflects the confidence in this assignment). If the EC number is in bold the gene was annotated in the database as capable of catalyzing this reaction otherwise it was predicted. In the last column we list the MetaCyc ID of alternate pathways to which this gene was also assigned (number in parenthesis is the affinity to that pathway).

Gene Name

MET2 MET6 MET17

YHR112C

Cys3

Str3

YFR055W

Biozon ID

004860000048 007670000142 004440000819

003780000158

003940001012

004650000171

003400000153

Verification

verified verified verified

unverified

unverified

unverified

unverified

Pathway affinity

1.00 1.00 1.00

0.00

0.00

0.00

N/A (no profile)

EC Numbers

2.3.1.31(1.00) 2.1.1.14(1.00) 4.2.99.- (1.00) 4.2.99.10(1.00) 4.2.99.8 (0.99) 4.2.99.9 (0.45) 2.3.1.31(0.59) 4.4.1.1(0.37) 4.4.1.8(0.38)

2.3.1.31(0.09) 4.2.99.10(0.15) 4.2.99.8 (0.09) 4.2.99.9(0.31) 4.4.1.1(0.24) 4.4.1.8(0.26) 4.4.1.1(1.00) 4.2.99.10(0.37) 4.2.99.8 (0.24) 4.2.99.9 (0.68) 2.3.1.31(0.26) 4.4.1.8(0.57) 4.4.1.8(1.00) 4.2.99.10(0.22) 4.2.99.8(0.15) 4.2.99.9 (0.45) 4.4.1.1(0.39) 2.3.1.31(0.15) 4.4.1.8(1.00) 4.2.99.10(0.18) 4.2.99.8(0.11) 4.2.99.9 (0.34) 4.4.1.1(0.28)

Alternative pathway affinity

GLYOXYLATE-BYPASS (0.99)

HOMOCYSDEGR-PWY (0.50) PWY-801 (0.50)

HOMOCYSDEGR-PWY (0.50) PWY-801 (0.50)

N/A (no profile)

287

MET2 MET17 2.5.1.49

2.3.1.31 (4 2 99 -) homoserine ^ ^ •O-acetyl-L-nomoserine - . V^^ ^-homocysteine

acetyl-CoA \ coenzyme A H S acetate

MET6 2.1.1.14

~~- 5-methyltetrahydropteroyltri-L-glutamate

: ^ tetrahydropteroyltri-^ f L-glutamate

L-methionine

2.2e-02

Fig. 2. Homoserine methionine biosynthesis, the pathway diagram. The diagram was obtained from the MetaCyc database17 and augmented with genes that are experimentally verified to participate in the pathway and their expression profiles. We notice a strong correlation between the genes catalyzing last two reactions while the first gene is less correlated. (Profiles shown are for the Cell Cycle dataset).

over all experiments involving Rosetta, and most of the time-series experiments.

4. RELATED STUDIES

Metabolic pathway reconstruction has been an important direction in experimental research for many decades. This research focused on some well studied organisms like E. coli and S. cerevisiae. The knowledge thus obtained was collected in databases like EMP/MPW18, MetaCyc17 and KEGG19.

Unfortunately, the experimental reconstruction of metabolic pathways is a long and costly process and the obtained information is restricted to the studied organism. The breakthroughs in DNA sequencing and thus the large number of sequenced and annotated organisms led to the development of procedures for extending metabolic knowledge from the organisms in which it was experimentally studied to newly sequenced organisms. Methods like Pathologic5, PUMA220, SEED21

and KEGG6 use sets of blueprints of experimentally elucidated metabolic pathways, and match the reactions in these blueprints with genes in the target organism based on their functional annotations. Sometimes not

all enzyme functions needed to complete the pathway can be found in the original annotation. To cope with this situation, tools for predicting the missing enzymatic activity22-26 were added to complement the original annotation.

The analysis of the dynamic aspects of cellular processes, including metabolic pathways, was made possible by the increasing availability of high-throughput data, like expression data, interaction data and subcellular location of proteins. Clustering is one of the favorite methods for the analysis of expression data, because genes that are similarly expressed might participate in the same cellular process. Consequently a number of clustering methods were applied to expression data starting with the seminal work in Ref. 27 (see Ref. 28 and Ref. 29 for discussion of these methods).

Expression data is used in metabolic pathway analysis by a large number of studies. A large class of these studies extract active pathways by scoring them based on the expression of the assigned genes30-34. Clustering of expression data is also used in metabolic pathway analysis. These methods try to elucidate the function of un-characterized genes by mapping pathways to the clusters

288

to which these genes belong35' 36. The integration of metabolic information and expres

sion data is further used to extract active pathways and processes in several ways. Expression data and metabolic network topology are combined in Ref. 37 to define a metric that is used for extracting clusters of genes corresponding to active pathways. Similarly, active pathways and their pattern of activity are extracted38 using a generalized form of canonical correlation analysis between kernels defined based on expression data and on the pathway graph. To predict operons this approach is extended in39 by integrating also the position of the genes on the DNA.

A first step in the metabolic network reconstruction is the inference of the more general cellular network. Several unsupervised prediction methods used models like Bayesian networks40 and boolean networks41 for cellular network inference from expression data. A supervised method for cellular network inference is described in Ref. 42. The method is based on canonical correlation between a kernel function integrating expression data, interaction data, phylogenetic profiles and subcellular location and a second kernel function defined based on the experimentally validated cellular network of yeast. This work is extended in Ref. 43 by forcing chemical compatibility constraints for edges in the predicted cellular network.

Related to our study are also the the studies on regulatory modules. Regulatory modules44 are sets of genes whose expression is controlled by the same group of control genes (regulation program). Genes in a module are assumed to have a common function. It is also commonly assumed that enzymes in the same pathway are co-regulated, thus there is an overlap between a pathway and a regulatory module. This holds for some of the known metabolic pathways, however it is not always the case and relationships between pathways and modules can be one of several types as presented in Figure 3:

(1) one to one - a pathway overlaps with a regulatory module, i.e. the genes participating in the pathway are co-regulated (see Figure 3a). (e.g. Homoserine methionine biosynthesis).

(2) many to one (module sharing) — a module is shared by several pathways, i.e. the genes participating in several pathways are co-regulated (see Figure 3b). (e.g. valine biosynthesis and isoleucine biosynthe

sis). (3) one to many - a pathway overlaps several mod

ules, i.e. not all the genes participating in a pathway are co-regulated but they can be grouped in few co-regulated groups (see Figure 3c).

(4) mixed - a pathway overlaps several modules and share some of them with other pathways (see Figure 3d), (e.g. folic acid biosynthesis).

(a) (b)

Ml c M2

)

(c) (d)

Fig. 3. Relationships between pathways and modules, a. One to one: the pathway Pi overlaps module M\. b. Many to one (module sharing): The pathways Pi and Pi share the module Mi. c. One to many. The pathway Pi overlaps both modules M i and Mi. d. Mixed: The pathways Pi and P2 share the module Mi while Pi overlaps with module Mi as well.

Regulatory modules are explored in depth in Ref. 45, where several probabilistic models and inference algorithms are presented. For information on other related studies and extended discussion see the appendix in Ref. 7.

It is important to emphasize that these methods and others described in this section are technically different from our approach and most of them are targeted towards pathway analysis. Our current work focuses on a probabilistic framework for metabolic pathway assignment which enables us to address problems like ambiguous assignments, protein complexes and missing enzymes in the same context. In addition to expression data and metabolic knowledge, our framework can be extended to use other types of high throughput data.

289

5. DISCUSSION

In this paper we present an algorithm for probabilistic assignment of genes to pathways. Given a genome, our algorithm uses pathway blueprints (from MetaCyc or other sources), database annotations and similarity data (from Biozon) and genome-wide mRNA expression data to determine the characteristic expression profiles of pathways and assess the affinity of each gene with each pathway.

We test and demonstrate the power of our method on the Yeast genome. Although it is difficult to evaluate our algorithm, since the amount of experimentally validated data is limited, our results so far are significant and very encouraging and for most pathways the top assignment is also an experimentally verified one. The algorithm can also predict complexes and accommodates multifunctional enzymes.

While in this work we refer to a pathway as a well defined entity, in reality this is not the case and cellular processes are tightly related. Moreover, since cellular processes form a complex and highly connected network it is difficult to delineate the boundaries of individual pathways and the same process might be defined differently by different groups (see Ref. 7). This further motivates a probabilistic approach that assigns each gene with a certain probability to each pathway. And although we adhere to pathway diagrams that were determined in the literature, our procedures can be modified to redefine pathway boundaries so as to correlate better with regulatory modules (see section 4). Furthermore, our method can be easily applied to fill pathway holes in pathways with uncharacterized or unassigned reactions (as discussed in section 3 of the Supplementary Material).

It should be noted that while all pathways included in our analysis can be associated with Yeast genes based on annotation and functional prediction, some of the pathways might not exist in Yeast after all. An example of such a pathway is cysteine biosynthesis II which exists in mammals but not in Yeast. In Yeast, cysteine is obtained from homoserine and reactions making up this pathway overlap reactions in two other pathways (therefore in our analysis, all the reactions in this pathway have violated constraints). In this view, our method can also help to validate whether certain pathways exist in a given genome.

As our examples demonstrate, clustering alone cannot solve the pathway reconstruction problem and it is necessary to add additional constraints and prior knowledge to generate effective pathway models. This empha

sizes the fundamental difference between our work and studies that are based on clustering of expression profiles. Furthermore, our results also indicate that even with these constraints and prior knowledge, expression data alone cannot discover all pathways and therefore additional datasets such as interaction data and subcellular location data are necessary to improve the models and we intend to integrate such datasets in future versions of our algorithm.

Besides the aforementioned extensions, there are other improvements and future directions that we would like to pursue. For example, we would like to improve function prediction. Currently, this is done based on database annotations or based on sequence similarity. However, the later is problematic and often genes are assigned based on similarity to multiple enzymatic reactions. To address this problem we intend to develop better methods to characterize enzymatic domains, using a methodology similar to the one we introduced in Ref. 46 and Ref. 24.

Finally, our algorithm can be applied to other genomes given a compatible expression dataset, and using a similar analysis to the one reported here we have started mapping pathways in the human genome.

6. SUPPLEMENTARY MATERIAL

A detailed description of our algorithm, the evaluation methodology and additional examples are available in the online supplementary material at b i o z o n . o r g / f t p / d a t a / p a p e r s / p a t h w a y - a s s i g n m e n t - e m / .

References 1. B. Schwikowski, P. Uetz, and S. Fields. A network

of protein-protein interactions in yeast. Nat Biotechnol, 18(12):1257-1261,2000.

2. Y. Ho, A. Gruhler, A. Heilbut, G. D. Bader, L. Moore, S.-L. Adams, A. Millar, P. Taylor, K. Bennett, K. Boutilier, L. Yang, C. Wolting, I. Donaldson, S. Schandorff, J. Shew-narane, M. Vo, J. Taggart, M. Goudreault, B. Muskat, C. Alfarano, D. Dewar, Z. Lin, K. Michalickova, A. R. Willems, H. Sassi, P. A. Nielsen, K. J. Rasmussen, J. R. Andersen, L. E. Johansen, L. H. Hansen, H. Jespersen, A. Podtelejnikov, E. Nielsen, J. Crawford, V. Poulsen, B. D. Sorensen, J. Matthiesen, R. C. Hendrickson, F. Glee-son, T. Pawson, M. F. Moran, D. Durocher, M. Mann, C. W. V. Hogue, D. Figeys, and M. Tyers. Systematic identification of protein complexes in Saccharomyces cere-visiae by mass spectrometry. Nature, 415(6868): 180 - 183, January 2002.

3. N. C. Duarte, M. J. Herrgard, and B. O. Palsson.

http://biozon.org/ftp/

290

Reconstruction and Validation of Saccharomyces cere-visiae iND750, a Fully Compartmentalized Genome-Scale Metabolic Model. Genome Res., 14(7): 1298-1309, 2004.

4. J. Forster, I. Famili, P. Fu, B. O. Palsson, and J. Nielsen. Genome-Scale Reconstruction of the Saccharomyces cere-visiae Metabolic Network. Genome Res, 13(2):244-253, 2003.

5. S. M. Paley and P. D. Karp. Evaluation of computational metabolic-pathway predictions for Helicobacter pylori. Bioinformatics, 18(5):715-724, 2002.

6. H. Bono, H. Ogata, S. Goto, and M. Kanehisa. Reconstruction of Amino Acid Biosynthesis Pathways from the Complete GenomeSequence. Genome Res., 8(3):203-210, 1998.

7. L. Popescu and G. Yona. Automation of gene assignments to metabolic pathways using high-throughput expression data. BMC Bioinformatics, 6(1):217, 2005.

8. J. Stenesh. Dictionary of Biochemistry and Molecular Biology (2nd Edition). John Wiley & Sons, 1989.

9. C. J. Krieger, P. Zhang, L. A. Mueller, A. Wang, S. Paley, M. Arnaud, J. Pick, S. Y. Rhee, and P. D. Karp. MetaCyc: a multiorganism database of metabolic pathways and enzymes. Nucleic Acids Res., 32(Database issue):D438-442, 2004.

10. R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley Interscience, 2nd ed. edition, October 2000.

11. K. R. Christie, S. Weng, R. Balakrishnan, M. C. Costanzo, K. Dolinski, S. S. Dwight, S. R. Engel, B. Feierbach, D. G. Fisk, J. E. Hirschman, E. L. Hong, L. Issel-Tarver, R. Nash, A. Sethuraman, B. Starr, C. L. Theesfeld, R. An-drada, G. Binkley, Q. Dong, C. Lane, M. Schroeder, D. Botstein, and J. M. Cherry. Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic Acids Res., 32 (Database issue):D311-314, 2004.

12. P. T. Spellman, G. Sherlock, M. Q. Zhang, V. R. Iyer, K. Anders, M. B. Eisen, P. O. Brown, D. Botstein, and B. Futcher. Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization. Mol. Biol. Cell, 9(12):3273-3297, 1998.

13. T. R. Hughes, M. J. Marton, A. R. Jones, C. J. Roberts, R. Stoughton, C. D. Armour, H. A. Bennett, E. Coffey, H. Dai, Y. D. He, M. J. Kidd, A. M. King, M. R. Meyer, D. Slade, P. Y. Lum, S. B. Stepaniants, D. D. Shoemaker, D. Gachotte, K. Chakraburtty, J. Simon, M. Bard, and S. H. Friend. Functional Discovery via a Compendium of Expression Profiles. Cell, 102(1): 109-126, July 2000.

14. A. Birkland and G. Yona. BIOZON: a system for unification, management and analysis of heterogeneous biological data. BMC Bioinformatics, 7:70, 2006. URLbiozon . org .

15. P. D. Karp, S. Paley, and P. Romero. The Pathway Tools software. Bioinformatics, 18(Suppl 1):S225-S232, 2002.

16. G. Yona, W. Dirks, S. Rahman, and D. Lin. Effective similarity measures for expression profiles. Bioinformatics, 2006. in press.

17. BioCyc. Biocyc database, 2005. URL h t t p : / / b i o c y c . o r g / .

18. J. Selkov, E, Y. Grechkin, N. Mikhailova, and E. Selkov. MPW: the Metabolic Pathways Database. Nucleic Acids Res., 26(l)A3-45, 1998.

19. M. Kanehisa, S. Goto, S. Kawashima, Y. Okuno, and M. Hattori. The KEGG resource for deciphering the genome. Nucleic Acids Res., 32(Database issue):D277-280, 2004.

20. N. Maltsev, E. Glass, D. Sulakhe, A. Rodriguez, M. H. Syed, T. Bompada, Y. Zhang, and M. D'Souza. PUMA2-grid-based high-throughput analysis of genomes and metabolic pathways. Nucl. Acids Res., 34(suppl 1):D369-372, 2006.

21. SEED. The seed: an annotation/analysis tool provided by fig, 2005. URL h t t p : / / t h e s e e d . u c h i c a g o . e d u / F I G / i n d e x . c g i .

22. H. Bono, S. Goto, W. Fujibuchi, H. Ogata, and M. Kanehisa. Systematic Prediction of Orthologous Units of Genes in the Complete Genomes. Genome Inform Ser Workshop Genome Inform, 9:32^*0, 1998.

23. M. Green and P. Karp. A bayesian method for identifying missing enzymes in predicted metabolic pathway databases. BMC Bioinformatics, 5(1):76, 2004.

24. U. Syed and G. Yona. Using a mixture of probabilistic decision trees for direct prediction of protein function. In Proceedings of the seventh annual international conference on Computational molecular biology, pages 289-300. ACM Press, 2003.

25. I. Shah. Predicting enzyme function from sequence. PhD thesis, George Mason University, 1999.

26. P. Kharchenko, D. Vitkup, and G. M. Church. Filling gaps in a metabolic network using expression information. Bioinformatics, 20(suppl l):il78-185, 2004.

27. M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA, 95(25): 14863-14868, December 8 1998.

28. F. Valafar. Pattern Recognition Techniques in Microarray Data Analysis: A Survey. Ann NY Acad Sci, 980(l):41-64, 2002.

29. D. Jiang, C. Tang, and A. Zhang. Cluster analysis for gene expression data: A survey. IEEE Transactions on Knowledge and Data Engineering, 16(11): 1370-1386, 2004.

30. P. Grosu, J. P. Townsend, D. L. Haiti, and D. Cavalieri. Pathway Processor: A Tool for Integrating Whole-Genome Expression Results into Metabolic Networks. Genome Res., 12(7):1121-1126, 2002.

31. R. Kuffner, R. Zimmer, and T. Lengauer. Pathway analysis in metabolic databases via differential metabolic display (DMD). Bioinformatics, 16(9):825-836, 2000.

32. P. Pavlidis, D. Lewis, and W. Noble. Exploring gene expression data with class scores. In Pac Symp Biocomput, pages 474^t85, 2002.

33. S. Doniger, N. Salomonis, K. Dahlquist, K. Vranizan, S. Lawlor, and B. Conklin. MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data. Genome Biol, 4(1):R7, 2003.

http://biocyc.org/

http://theseed.uchicago.edu/

291

34. J. Rahnenfuhrer, F. S. Domingues, J. Maydt, and T. Lengauer. Calculating the Statistical Significance of Changes in Pathway Activity From Gene Expression Data. Statistical Applications in Genetics and Molecular Biology, 3(1), 2004.

35. M. Nakao, H. Bono, S. Kawashima, T. Kamiya, K. Sato, S. Goto, and M. Kanehisa. Genome-scale Gene Expression Analysis and Pathway Reconstruction in KEGG. Genome Inform Ser Workshop Genome Inform., 10:94-103, 1999.

36. J. van Helden, D. Gilbert, L. Wemisch, and S. Schroeder, M.and Wodak. Applications of regulatory sequence analysis and metabolic network analysis to the interpretation of gene expression data. Lecture Notes in Computer Sciences, 2066:155-172, 2001.

37. D. Hanisch, A. Zien, R. Zimmer, and T. Lengauer. Co-clustering of biological networks and gene expression data. Bioinformatics, 18(Suppl 1):S145-154, 2002.

38. J. P. Vert and M. Kanehisa. Extracting active pathways from gene expression data. Bioinformatics, 19(Suppl 2): II238-II244, 2003.

39. Y. Yamanishi, J.-P. Vert, A. Nakaya, and M. Kanehisa. Extraction of correlated gene clusters from multiple genomic data by generalized kernel canonical correlation analysis. Bioinformatics, 19(Suppl):323i-330, 2003.

40. N. Friedman, M Linial, I. Nachman, and D. Pe'er. Using bayesian networks to analyze expression data. Journal of Computational Biology, 7(3-4):601-620, 2000.

41. T. Akutsu, S. Miyano, and S. Kuhara. Algorithms for identifying boolean networks and related biological networks based on matrix multiplication and fingerprint function. Journal of Computational Biology, 7(3-4):331-343, 2000.

42. Y Yamanishi, J.-P. Vert, and M. Kanehisa. Protein network inference from multiple genomic data: a supervised ap

proach. Bioinformatics, 20(suppl l):i363-370, 2004. 43. Y Yamanishi, J.-P. Vert, and M. Kanehisa. Supervised en

zyme network inference from the integration of genomic data and chemical information. Bioinformatics, 21(suppl l):i468-477, 2005.

44. E. Segal, M. Shapira, A. Regev, D. Pe'er, D. Botstein, D. Kollerl, and N. Friedman. Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat Genet, 34(2): 166-176, 2003.

45. E. Segal, M. Shapira, A. Regev, D. Pe'er, D. Botstein, D. Kollerl, and N. Friedman. Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat Genet, 34(2): 166-176, 2003.

46. N. Nagaraj an and G. Yona. Automatic prediction of protein domains from sequence information using a hybrid learning system. Bioinformatics, 20: 1335-1360, 2004.

47. R. J. Cho, M. J. Campbell, E. A. Winzeler, L. Steinmetz, A. Conway, L. Wodicka, A. E. G. Tyra G. Wolfsberg and, D. Landsman, D. J. Lockhart, and R. W. Davis. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell, 2:65-73, July 1998.

48. S. Altschul, T. Madden, A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25(17):3389-3402, 1997.

49. E. M. Voorhees and L. P. Buckland, editors. The Fourteenth Text REtrieval Conference Proceedings (TREC 2005), 2005.

50. J. Hanley and B. McNeil. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(l):29-36, 1982.

293

CLASSIFICATION OF DROSOPHILA EMBRYONIC DEVELOPMENTAL STAGE RANGE BASED ON GENE EXPRESSION PATTERN IMAGES

J ieping Y e a ' b * J ianhui C h e n a ' b Qi L i c Sudhir K u m a r a ' d

a Center of Evolutionary Functional Genomics, Biodesign Institute, Arizona State University, Tempe, AZ 85287-5301. {jieping.ye,jianhui.chen,s.kumar} @asu.edu.

Department of Computer Science and Engineering, Arizona State University, Tempe, AZ 85287. c Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716. [email protected].

School of Life Sciences, Arizona State University, Tempe, AZ 85287.

The genetic analysis of spatial patterns of gene expression relies on the direct visualization of the presence or absence of gene products (mRNA or protein) at a given developmental stage (time) of a developing animal. The raw data produced by these experiments include images of the Drosophila embryos showing a particular gene expression pattern revealed by a gene-specific probe. The identification of genes showing spatial and temporal overlaps in their expression patterns is fundamentally important to formulating and testing gene interaction hypotheses. Comparison of expression patterns is most biologically meaningful when images from a similar t ime point (developmental stage range) are compared. In this paper, we propose a computational system for automatic developmental stage classification by image analysis. This classification system uses image textural properties at a sub-block level across developmental stages as distinguishing features. Gabor filters are applied to extract features of image sub-blocks. Robust implementations of Linear Discriminant Analysis (LDA) are employed to extract the most discriminant features for the classification. Experiments on a collection of 2705 expression pattern images from early stages show that the proposed system significantly outperforms previously reported results in terms of classification accuracy, which shows high promise of the proposed system in reducing the time taken by biologists to assign the embryo stage range.

1. INTRODUCTION

Gene expression in a developing embryo is modulated in particular cells in a time-specific manner, which leads to the differentiation of cell fates. Research efforts into the spatial and temporal characteristics of gene expression patterns of the model organism Drosophila melanogaster (the fruit fly) have been at the leading-edge of scientific investigations into the fundamental principles of animal development.5' 16

These studies have now established that the same gene (or its product) may be utilized in different ways at different times during development, and that multiple genes show similar expression patterns in one or more developmental stages. The genetic analysis of spatial patterns of gene expression relies on the direct visualization of the presence or absence of gene products (mRNA or protein) at a given developmental stage (time) of a developing animal. The raw data produced from these experiments includes images of the Drosophila embryo showing a particular gene expression pattern revealed by a gene-specific probe. The knowledge of the spatial overlap of patterns of gene expression is important to under

standing the interplay of genes in different stages of development.5, 16

Estimation of the pattern overlap is most biologically meaningful when images from a similar time point (developmental stage range) are compared. Stages in Drosophila melanogaster development denote the time after fertilization at which certain, specific events occur in the developmental cycle. Embryogenesis is traditionally divided into a series of consecutive stages distinguished by morphological markers.1 The duration of developmental stages varies from 15 minutes to more than 2 hours; therefore, the stages of development are differentially represented in the embryo collections. Some consecutive stages, although morphologically distinguishable, differ very little in terms of changes in gene expression, whereas other stage transitions, such as the onset of zygotic transcription or organogenesis, are accompanied by massive changes in gene expression.1

The first 16 stages of embryogenesis are divided into six convenient stage ranges (stages 1-3, 4-6, 7-8, 9-10, 11-12 and 13-16). In recent high throughput experiments,18 each image is assigned to one of the stage ranges manually.



294

In this paper, we examine how image analysis can be used for automatic stage range determination (classification). In order to distinguish between different stage ranges of development, we need to use embryo morphology to extract features. Across the various developmental stages, a distinguishing feature is image textural properties at a sub-block level, because image texture at the sub-block level changes as embryonic development progresses (Fig. 1). The staining procedure helps illuminate the morphological features of the transparent embryos as well. We thus apply Gabor filters7 to extract the textural features of image sub-blocks. Since not all features are useful for stage range discrimination, we apply robust implementations of Linear Discriminant Analysis (LDA)8, 10' u for the extraction of the most discriminant features, which are linear combinations of the textural features derived from the Gabor filters. Finally, the Nearest-Neighbor (NN) algorithm and Support Vector Machines (SVM)4, 6 ' 19 are employed for classification (stage range determination). Our experiments on a collection of 2705 expression pattern images from early stages show that the proposed system achieves about 86% accuracy, when less than 10% of the data is used for the training, which is significantly higher than previously reported result (about 73%).n

f 1-3 4-6 7-8

Fig . 1. Spatial and temporal view of Drosophila images across different stages (1-8) of development (of the same gene Kr). The figure shows the morphological changes at the anterior and posterior end of the embryo during stages 4-6 and the morphological changes in the middle regions of the embryo during stages 7-8. The textural features (based on the morphology of the embryo) are different from the gene expression, which is indicted by the blue staining.

1.1. Raw image pre-processing

We used a collection of 2705 embryo images from three different developmental stage ranges (1-3, 4-6, and 7-8) in our study. The raw images of Drosophila Embryo were collected from the Berkeley Drosophila Genome Project (BDGP).18 Gene expression pattern

images were in different sizes and orientations. The image standardization procedure from Ref. 16 was applied and all images were standardized to the size of 128 x 320.

Next, we applied Histogram Equalization13 to improve the contrast and obtain approximately an uniform histogram distribution, while still keeping the detailed information for the processed images.

Finally, we applied Gabor Filters7 to extract the textural features of image sub-blocks. Gabor Filters are well-known for texture analysis,2 as they are effective in extracting information in different spatial frequency ranges and orientations. We found the textural features obtained via Gabor Filters very effective in stage range classification (see Section 4). The number of textural features extracted via Gabor Filters is 384. Since not all features are useful for stage discrimination, we applied the Linear Discriminant Analysis (LDA) for extracting the most discriminant features before the classification.

2. LINEAR DISCRIMINANT ANALYSIS

Linear Discriminant Analysis (LDA)8' 10' 14 is a well-known method for feature extraction that projects high-dimensional data onto a low-dimensional space to maximize class separability. The optimal projection or transformation in classical LDA is obtained by minimizing the within-class distance and maximizing the between-class distance simultaneously, thus achieving maximum class discrimination.

Given a training dataset consisting of n data points (images), {ffli}"-i S IRm, from k different classes, classical LDA aims to compute the transformation G € TRmXi (£ < m) that maps a to a vector 2/i in the ^-dimensional space as follows:

G : SJ £ E r o -> y( = GTai € H £ .

In classical LDA, the transformation matrix G is computed so that the class structure is preserved. The class structure is quantified by three scatter matrices, called the within-class scatter Sw, the between-class scatter Sb, and the total scatter St, defined below.

Assume that there are k classes in the dataset. Suppose Cj, Si, n* are the centroid, covariance matrix, and sample size of the i-th class, respectively,

295

and c is the global centroid. Define the matrices

Hw = ~[{A1 - C l eT ) , • • • , (Ak - cke

T)}, (1) \Jn

Hb = - p [ynT(c i - c),--- ,y/n^(ck - c)], (2)

Ht = 4=(A - ceT), (3)

where A = [x\,- • • ,xn] is the data matrix, Ai is the data matrix of the «-th class, n; is the size of the i-th class, and e is the vector of all ones. Then the three scatter matrices are defined as follows:10

Sw = HWHW, Sb = HbHb , and St = HtHt .

It follows from the definition that trace(5„,) measures the within-class cohesion, trace(5()) measures the between-class separation, and trace(St) measures the variance of the dataset, where the trace12 of a square matrix is the summation of all its diagonal entries . It is easy to verify that St = Sb + Sw.

The scatter matrices in the reduced space (projected by G) are GTSWG, GTSbG, and GTStG, respectively. The optimal transformation G in classical LDA is computed by maximizing the following objective function:8' 10, 14

/ i(G) = trace ( (G T 5 l u G)" 1 GTSbG) , (4)

subject to the constraint that GTSWG = Ie, where Ie is the identity matrix of size L The optimal solution is given by the eigenvectors of 5~15b corresponding to the nonzero eigenvalues, provided that Sw is non-singular. Since St = Sb + Sw, the solution can also be obtained by computing the eigenvectors of S^lSb, assuming St is nonsingular. The reduced dimension, I, is no larger than k — 1, where k is the number of classes, as the rank of Sb is bounded from above by k — 1. In practice, i often equals k — 1. Note that the total scatter matrix is a multiple of the sample co-variance matrix and is required to be nonsingular. If a small number of expression pattern images is used in the training set, all scatter matrices in question can be singular. This is known as the singularity or undersampled problem.15

We have recently developed Uncorrected LDA (ULDA)21 as an extension of classical LDA. A key property of ULDA is that the features in the transformed space of ULDA are uncorrelated to each

other, thus reducing the redundancy in the transformed (dimension reduced) space. Furthermore, ULDA is applicable, even when all scatter matrices are singular, thus overcoming the singularity problem. The optimal transformation G of ULDA can be computed by maximizing the following objective function:

f2(G) = trace ((GTStG)+GTSbG), (5)

subject to the constraint that GTStG = Ie, where M+ denotes the pseudo-inverse12 of a matrix M. The computation of the optimal transformation of ULDA is based on the simultaneous diagonalization of the three scatter matrices.21 Let X be the matrix that simultaneously diagonalizes Sb, Sw, and St-That is,

XTSbX = Db, XTSWX = Dw, and XTStX = Dt,

(6)

where Db, Dw, and Dt are diagonal, and the diagonal entries of Sb are sorted in the non-increasing order. Then G = Xq solves the optimization problem in Eq. (5), where Xq consists of the first q columns of X with q = rank(5(,).

ULDA has been applied successfully in several applications, including microarray gene expression data analysis.22 However, we have observed that for data containing large amount of noises, ULDA has been shown to be less effective.21 We employ the reg-ularization technique to improve the robustness of ULDA. The algorithm is called Regularized ULDA (RULDA). Regularization is commonly used to stabilize the sample covariance matrix estimation and improve the classification performance.9 Regularization is also the key to many other machine learning methods such as Support Vector Machines (SVM),19

spline fitting,20 etc. In RULDA, a regularization parameter A is added to the diagonal elements of the total scatter matrix St as St + AJm, where Im is the identity matrix of size m. The optimal transformation G of RULDA is given by computing the eigenvectors of

(St + Mm^Sb. (7)

The performance of RULDA is critically dependent on the estimation of an appropriate regularization value A, because a large A may significantly disturb the information on St, while a small A may not be

296

effective enough to solve the singularity problem. Cross-validation is commonly used to estimate the optimal A from a finite set,

{Ai AN},

of N candidates. We used N = 100 in our experiments.

With the discriminant features extracted via LDA, the Nearest-Neighbor (NN) algorithm and Support Vector Machines (SVM) are applied for classification.

3. K-NEAREST NEIGHBOR AND SUPPORT VECTOR MACHINES FOR CLASSIFICATION

K-Nearest Neighbor (KNN)8, 14 is a non-parametric classifier and theoretical proofs have shown that its error is asymptotically at most 2 times of the Bayesian error rate. KNN finds the K nearest neighbors among training samples based on a certain distance measure, and uses the categories of the K neighbors to determine the category of the test sample. The parameter K for the number of neighbors can be selected by cross-validation. In our experiments, K is set to be 1 and the algorithm is called Neighbor-Neighbor (NN).

Support Vector Machines (SVM)3' 6 ' 19 are the state-of-the-art classifiers for many classification problems.3 SVM finds a maximum margin separating hyperplane between two classes. It leads to a straightforward learning algorithm that can be reduced to a convex optimization problem. The formulation can be extended to multi-class classifications.6' 17 SVM is attractive due to its well developed theory.19 Another appealing feature of SVM classification is the sparseness of its representation of the decision boundary. The maximum margin hyperplane can be represented as a linear combination of data points. Those training examples that receive nonzero weights, are called the support vectors, since removing them would change the location of the separating hyperplane. Kernels6' 1T can be used to extend SVM to classify nonlinearly separable data. We apply linear SVM in our experiments.

4. RESULTS AND DISCUSSIONS

In this section, we experimentally evaluate the proposed system on embryonic developmental stage range classification. A collection of 2705 embryo images from three different developmental stage ranges (1-3, 4-6, 7-8) was used in our study.

We performed our study by a random splitting of the whole dataset into training and test sets. The dataset was partitioned randomly into a training set consisting of n images (n denotes the training sample size) and a test set consisting of the remaining 2705 — n images. We varied the training sample size n from 30 to 540. To reduce the variability, the splitting was repeated 50 times and the resulting accuracies were averaged.

We first examined the effect of Histogram Equalization (HE) and Gabor Filters (GF) on stage range classification. To this end, we ran the experiments under four different conditions: "NO" without any pre-processing, "HE" with Histogram Equalization, "GF" with Gabor Filters, and "HE+GF" with both Histogram Equalization and Gabor Filters. The classification result (accuracy in percentage) using SVM as the classifier is shown in Table 1. We can observe that both the HE and GF operations are effective in classification, while GF is more effective than HE. In the following experiments, all images were pre-processed via both operations.

Table 1. Effect of image pre-processing operations on stage range classification (accuracies shown in percentage). NO: No pre-processing; HE: Histogram Equalization; and GF: Gabor Filters.

size n 30 60 90

Image pre-processing operation NO HE GF HE + GF

53.49 62.04 67.17

61.22 67.91 65.27 77.77 68.21 79.91

76.31 80.34 82.73

Next, we evaluated the proposed system on stage range classification. We employed both RULDA and ULDA to extract discriminant features before applying NN and SVM for classification. We can observe from Table 2 that RULDA plus NN and RULDA plus SVM achieve the best overall performance. When less than 10% of the images is used in the training set, they achieve about 86% accuracy, which is

297

significantly higher than previously reported result11

(about 73%). The key feature of the proposed computational system in comparison with the previous work is the inclusion of the feature extraction step via Regularized ULDA (RULDA), as well as the use of SVM as the classifier. Experimental results in Table 2 show the effectiveness of both RULDA and SVM for stage range classification.

Table 2. Comparison of two feature extraction algorithms (ULDA and RULDA) and two classifiers (NN and SVM) on classification accuracy and standard deviation (in parenthesis) in percentage.

training sample size n

30

60

90

180

300

480

540

ULDA NN

76.94 (3.48) 79.89 (2.33) 80.68 (2.11) 77.22 (2.62) 66.30 (2.67) 68.29 (2.19) 73.80 (2.01)

SVM

76.94 (3.48) 79.89 (2.33) 80.68 (2.11) 77.22 (2.62) 66.30 (2.67) 68.69 (2.39) 73.90 (2.20)

RULDA NN

76.55 (4.33) 80.91 (3.08) 82.71 (3.09) 85.74 (1.82) 86.60 (1.18) 87.48 (0.99) 87.24 (1.26)

SVM

76.31 (4.18) 80.34 (2.93) 82.73 (2.29) 86.10 (1.65) 87.37 (1.58) 88.75 (1.16) 88.91 (1.16)

In general, as the training sample size, n, increases, the classification accuracy of both RULDA plus NN and RULDA plus SVM increases. We observe that ULDA does not perform well when the training sample size n is large. The rationale behind this may be that ULDA involves the minimum redundancy (uncorrelated features) in the transformed space and is susceptible of overfitting. The expression pattern images may contain a large amount of noises due to the errors encountered in high throughput experiment and in image pre-processing. RULDA significantly improves ULDA in these cases, which shows the effectiveness of the regularization applied in RULDA. The regularization parameter in RULDA is estimated via cross-validation using the training data. When the training set is large, the estimation of the regularization value is more reliable and more robust to the noise. This explains the relatively larger difference between RULDA and ULDA in classification, when the training sample size n is large. Overall, RULDA plus SVM performs slightly

better than RULDA plus NN, especially when the training sample size n is large.

Recall that RULDA projects the data onto IR ~~ , where k is the number of classes in the dataset. There are k = 3 stage ranges (classes) in our experiments, and all images are projected onto a 2D plane. To examine the effectiveness of the projection, we ran RULDA on a training set of 180 images and applied the projection to a test set of 2525 images. In Fig. 2, we showed the projection of a subset of test images (for clarity of presentation). We depicted each test image by the corresponding stage range (1, 2, and 3). Overall, the three stage ranges were separated well, which shows that the discriminant features derived via RULDA is effective in stage range discrimination. We observe that stage ranges 1 and 2 are connected, as well as stage ranges 2 and 3, while stage ranges 1 and 3 are better separated. Note that the embryonic development is a continuous process, where the cutting points (boundaries) between different stages are assigned manually. These are consistent with the data distribution (after projection) as shown in Fig. 2.

0.18 - 3 3

0 1 6 ' 1 J? of3

0 1 4 ' , 1 a ^ 3 i TO 3

a„- 1/V3V 1 * » * V N , 3 '

"• >r>^v 32 UK8, o c a - \,i I W / ^ 3 3

004- 1 2, z 2,^22 , 2 ECTa ; 2 3 2 2 0.02 • o 1 8 ^ * 2 2 ,

0 • 22 2 ? 2

0.02 • 2 2 2

- 0 . 0 5 0 0.05 0.1 0.15

Fig . 2. Visualization of a subset of test images after the projection onto the 2D plane via RULDA. Images from the first range (1-3), the second range (4-6), and the third range (7-8) are depicted by " 1 " , "2", and "3",respectively.

5. CONCLUSIONS

We present in this paper a computational system for automatic developmental stage classification by image analysis. This classification system applies Ga-

298

bor filters to extract textural features of image sub-

blocks. Uncorrelated LDA (ULDA) and Regularized

ULDA (RULDA) are employed to extract the most

discriminant features for the classification. Experi

ments on a collection of 2705 expression pat tern im

ages from early stages show tha t the proposed system

significantly outperforms previously reported results

in terms of classification accuracy. The experimen

tal results demonstrate the promise of the proposed

computational system for embryonic developmental

stage range classification. As a future work, we plan

t o test the proposed system using a much larger col

lection of expression pat tern images including images

from all stage ranges.

Acknowledgement

This research is sponsored by the Center of Evolu

t ionary Functional Genomics of the Biodesign Insti

tu te a t the Arizona State University and the National

Inst i tutes of Health (NIH).

References

1. M. Bownes. A photographic study of development in the living embryo of Drosophila melanogaster. J Embryol Exp Morphol, 33:789-801, 1975.

2. B.S.Manjunath and W.Y.Ma. Texture features for browsing and retrieval of image data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8):837-842, 1996.

3. C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121-167, 1998.

4. C. J. C. Burges. Geometric methods for feature extraction and dimensional reduction - a guided tour. The Data Mining and Knowledge Discovery Handbook, pages 59-92, 2005.

5. S.B. Carroll, J.K. Grenier, and S.D. Weatherbee. From DNA to diversity : molecular genetics and the evolution of animal design. 2nd ed. Maiden, MA: Blackwell Pub, 2005.

6. N. Cristianini and J. Shawe-Taylor. Support Vector Machines and other Kernel-based Learning Methods. Cambridge University Press, 2000.

7. J.G. Daugman. Complete discrete 2-D Gabor transforms by neural networks for image analysis and compression. IEEE Transcations on Acoustics, Speech, and Signal Processing, 36(7):1169-1179, 1988.

8. R.O. Duda, P.E. Hart, and D. Stork. Pattern Classification. Wiley, 2000.

9. J.H. Friedman. Regularized discriminant analysis. Journal of the American Statistical Association, 84(405) :165-175, 1989.

10. K. Fukunaga. Introduction to Statistical Pattern Classification. Academic Press, San Diego, California, USA, 1990.

11. M. Gargesha, J. Yang, B. Van Emden, S. Pan-chanathan, and S. Kumar. Automatic annotation techniques for gene expression images of the fruit fly embryo. In Proceedings of SPIE (Visual Communications and Image Processing), pages 576-583, 2005.

12. G. H. Golub and C. F . Van Loan. Matrix Computations. The Johns Hopkins University Press, USA, third edition, 1996.

13. R.C. Gonzalez and R.E. Woods. Digital Image Processing, Second Edition. Addison-Wesley, 1993.

14. T. Hastie, R. Tibshirani, and J.H. Friedman. The Elements of Statistical Learning : Data Mining, Inference, and Prediction. Springer, 2001.

15. W.J. Krzanowski, P. Jonathan, W.V McCarthy, and M.R. Thomas. Discriminant analysis with singular covariance matrices: methods and applications to spectroscopic data. Applied Statistics, 44:101-115, 1995.

16. S. Kumar, K. Jayaraman, S. Panchanathan, R. Gu-runathan, A. Marti-Subirana, and S.J. Newfeld. BEST: A novel computational approach for comparing gene expression patterns from early stages of Drosophila melanogaster development. Genetics, 162(4):2037-2047, 2002.

17. B. Scholkopf and A. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond. MIT Press, 2002.

18. P. Tomancak et al. Systematic determination of patterns of gene expression during Drosophila em-bryogenesis. Genome Biol, 3(12):research0088.1-14, 2002.

19. V.Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.

20. G. Wahba. Spline models for observational data. Society for Industrial & Applied Mathematics, 1998.

21. J. Ye. Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. Journal of Machine Learning Research, 6:483-502, 2005.

22. J. Ye, T. Li, T. Xiong, and R. Janardan. Using uncorrelated discriminant analysis for tissue classification with gene expression data. IEEE/ACM Trans. Computational Biology and Bioinformatics, 1(4):181-190, 2004.

299

EVOLUTION VERSUS "INTELLIGENT DESIGN": COMPARING THE TOPOLOGY OF PROTEIN-PROTEIN INTERACTION NETWORKS TO THE INTERNET

Q. Yang, G. Siganos, M. Faloutsos a n d S. Lonardi*

Department of Computer Science and Engineering, University of California,

Riverside, CA 92521, USA


Recent research efforts have made available genome-wide, high-throughput protein-protein interaction (PPI) maps for several model organisms. This has enabled the systematic analysis of PPI networks, which has become one of the primary challenges for the system biology community. In this study, we attempt to understand better the topological structure of PPI networks by comparing them against man-made communication networks, and more specifically, the Internet.

Our comparative study is based on a comprehensive set of graph metrics. Our results exhibit an interesting dichotomy. On the one hand, both networks share several macroscopic properties such as scale-free and small-world properties. On the other hand, the two networks exhibit significant topological differences, such as the cliqueshness of the highest degree nodes. We attribute these differences to the distinct design principles and constraints tha t both networks are assumed to satisfy. We speculate that the evolutionary constraints that favor the survivability and diversification are behind the building process of PPI networks, whereas the leading force in shaping the Internet topology is a decentralized optimization process geared towards efficient node communication.

1. INTRODUCTION

From an engineering perspective, cells are complex systems that process information. The main mechanism by which cells are able to process information is through protein-protein interactions (PPI). Cellular proteins either aggregate in protein complexes or act concertedly to assemble, store and transduce biological information in an efficient and reliable way. Pathways of interactions between proteins can be found in essentially every cellular process, e.g., signal transduction cascades, metabolism, cell cycle control, apoptosis. Recently, a number of experimental, genome-wide, high-throughput studies have been conducted to determine protein-protein interactions and the consequent interaction networks in several model organisms (see, e.g., Refs. 1 and 2). They provide a unique opportunity to study the complex dynamics of "message passing" in cellular networks at the genome-scale.

The overarching goal of our study is to understand better the topological properties and structure of PPI networks. To do this, along the lines of comparative genomics, we propose to compare PPI networks against one of the largest and the most successful communication networks, the Internet. The main question we tackle in this paper is: how differ

ent or similar are the two types of networks? This comparison can constitute a valuable reference when attempting to understand the design principles that underlie PPI networks. Interestingly, PPI networks could be thought of as a type of communication networks since the protein interactions implicitly convey information on biological processes. Clearly, the building process behind the two networks is very different. PPI networks resulted as a byproduct of processes at the evolutionary scale and are constrained by the laws of physics and chemistry. The Internet was built to optimize the communication efficiency through a decentralized process and under the constraints imposed by technological, geographical, social and economical factors.

In recent years, several research groups have studied large complex systems and their topologies, from social networks to the structure of the web. In what follows, we provide a quick overview of the most related previous work on PPI and Internet topologies. The rapid developing theoretical models for complex networks, such as the ER random model 3, and the small-world 4, scale-free 5 and hierarchical network models 6, have greatly influenced the analysis of the topology of complex biological networks (see, e.g., Refs 7, 8 and, 9). PPI networks have



300

been characterized as scale-free networks that follow a power-law degree distribution with a sharp cutoff for large degrees 7. Recently, it has been shown that PPI networks show hierarchical organization 9. The literature on the analysis of the Internet topologya

is even richer than the one for PPI networks. A recent study provides a good overview of this body of work 10. The study in this field was jump-started in 1999, when Faloutsos et al. 5 used power-laws to characterize the degree distribution of the AS-level Internet topology. It has also been argued that the Internet topology is organized with a natural semantic proximity, such as geography or business interests n , and exhibits a hierarchical structure 12.

The contribution of this work is an extensive topological comparison of PPI and Internet networks. On the one hand, both network types exhibit some similar properties, such as skewed degree distribution. On the other hand, the networks have been built by completely different processes, over a very different time-scale, and to optimize different criteria. Our study uses the most important and diverse graph metrics that have been proposed and used in a wide range of studies in multiple disciplines. To our knowledge, this is the first such extensive study of these two types of networks.

We classified the results of our study in six categories, namely, (1) connectivity, (2) small-world, (3) modular/hierarchical organization, (4) entropy, (5) communication efficiency, and (6) robustness. Some of our findings are somewhat surprising, and are discussed in Section 4 and summarized in Section 5. We speculate that the differences found by our study can be attributed to the distinctive objectives and constraints that the two types of networks are supposed to satisfy. We conjecture that the goals are robustness and survivability for cellular networks and communication efficiency in man-made networks.

2. NOTATIONS AND METRICS

First, we briefly review all the metrics used in this study. Formally, a graph metric is a function M :

Q —» Rf, where Q is the space of all possible graphs and t is a positive integer.

2.1. Connectivity

In the domain of connectivity metrics, we selected average degree and the degree distribution to measure the global connectivity, and the rich club connectivity 13 to measure the core connectivity.

Definition 2.1. The average degree of a graph G = (V,E) is denned as k = 2m/n, where n = \V\ and m = \E\.

Definition 2.2. The degree distribution of a graph G = {V,E) is a function P : [0 , . . . , kmax] -> [0,1], where P(k) is the fraction of the vertices in G that have degree k for 0 < k < fcmax, and kmax is the largest degree in G.

High degree vertices play an essential role in communication networks. They carry most of the communication flow and together form the backbone of the network, which is also referred to as the core of the network. We used rich club connectivity to measure how densely connected are high degree vertices in the network 13.

Definition 2.3. The rich club connectivity of a graph G = (V, E) is a function <j> : 2V -» R defined as follows

\{(u,v)GE:uep,vep}\ 4>{p) (1)

IPKIPI - 1 ) / 2

where p is the set containing the first \p\ highest degree vertices in the list of vertices ranked according to their degree in non-increasing order.

2.2. Small-World metrics

The small-world hypothesis states that everyone in the world can be reached through a short chain of social acquaintances. According to Watts and Strc-gatz 4, a small-world network is mainly characterized by two structural properties, namely, (1) a shorter

aNote that we study the Internet topology at the AS-level, which is denned as follows. The Internet consists of a large number of independently managed networks, which we call Autonomous Systems (AS). For example, an Internet Service Provider of a large company network constitutes usually an AS. In the AS-level graph, the vertices are the Autonomous Systems and an edge represents the fact that the two adjacent nodes are physically connected and exchange information in the form of packets. In this work, we use the term Internet or Internet topology to refer to the AS-level Internet graph.

301

characteristic path length and (2) a higher clustering coefficient when compared to random networks.

Definition 2.4. Given a graph G = (V,E), the characteristic path length L of G is defined as L =

( E „ , v e v L ( u ' t , ) ) / [ n ( n _ 1)/21> w h e r e £(u,«) i s t h e

shortest path length between vertex u and v.

Definition 2.5. The clustering coefficient C(v) of a vertex v G V is defined as

C{V)~ d(v)(d(v) - l ) /2 ' ( 2 )

where d(u) > 1 is the degree of vertex v. The dus-tering coefficient C of a graph G = (V, £ ) is defined ™C=(J2veVC(v))/n.

2.3. Modular and Hierarchical Organization

The modular and hierarchical structures in networks can be quantified by the scaling relation between the clustering coefficient Ck and the vertex degree k, where Ck = (£d ( a ) ) = f c C{v))/N(k) and N{k) is the number of vertices having degree k. Several studies 6 ' 14 have shown that if a network has modular and hierarchical structure, the distribution of Ck is power-law-like, that is Ck ~ k~a for some real positive a.

2.4. Entropy

We selected graph entropy 15 and target entropy 16

to evaluate the randomness of a graph. Let X and Y be two discrete random variables associated with the degree of the two vertices of a randomly chosen edge.

Definition 2.6. The graph entropy E(G) of a graph G = (V, E) is defined as follows

E(G) = H(X) + H(Y)-H(X,Y), (3)

where H(X) and H(Y) are the entropy of random variable X and Y, and H(X, Y) is the joint entropy of X and Y (as defined e.g., in Ref 17).

The graph entropy E(G) corresponds to the mutual information between random variable X and Y, which measures the amount of information that one random variable contains about another random

variable. The mutual information quantifies the reduction in the uncertainty of one random variable due to the knowledge of the other 1T.

Our second metric of randomness is target entropy, which measures the predictability of the amount of traffic in the neighborhood of any given vertex 16. More specifically, assume that every vertex in the network sends one unit flow to vertex u using shortest path. Let c(u,v) denote the fraction of the flows with destination u that passes through vertex v, where v is the immediate neighbor of u.

Definition 2.7. The target entropy T(u) of a vertex u e V is defined as follows

T(u) = - X I c(u,v)log2c(u,v). (4) V neighbor of U

The target entropy T of a graph G = (V, E) is defined ™T = (J2u€VT(u))/n.

2.5. Performance Measures

We selected two metrics to evaluate the performance of the network, namely eccentricity and edge congestion. The former metric is related to the notion of reachability of a graph 18, whereas the latter measures the congestion on the edges assuming a flow model.

Definition 2.8. The eccentricity e{u) of a vertex u £ V is defined as e(u) = max„ey L(u,v).

Edge congestion measures the amount of flows traveling through the edges of a network assuming a given traffic model and routing policy. In this study, we assume that one unit of flow between every pair of vertices is routed using the shortest-path routing policy 19' 20.

Definition 2.9. The edge congestion ec(u,v) of an edge (u, v) £ E is defined as ec(u, v) — f(u, v)/[n(n— 1)], where f(u, v) denotes the total number of flows traveling through the edge (tt, v).

2.6. Robustness

We selected two simple methods to measure the robustness of network topology under failures. In the first, we remove vertices at random, which corresponds to random failures. In the second, we remove

302

vertices in the order of decreasing degree, which corresponds to "intelligent attacks" 21. In both cases, the network eventually gets decomposed into a set of connected components. To characterize this process, we measured (a) Lc = \S\/n, where S is the largest connected component, and (b) iVc, the number of components in the network.

Tab le 1. Statistic summary: n is the number of vertices, m is the number of edges, fc is the average degree, L is the characteristic path length, and C is the clustering coefficient.

Fly

6926 20745 5.9905

4.45931 0.0154

AS990220

4686 8772

3.7439 3.72621 0.3786

skitter

9200 28957

6.2927 3.118

0.6212

3. DATASETS

In our work, we used four networks whose global statistics are summarized in Table 1. Yeast and Fly are two PPI networks downloaded from the DIP database 22, in which vertices represent proteins and edges represent physical interactions between pairs of proteins. AS990220 and S k i t t e r are two AS-level Internet instances obtained from two different methods15.

We also employed two models of random graphs, namely, G(n,p) and degree-based random graphs. A G(n,p) random graph is a graph composed of n vertices where each pair of vertices is connected with probability3 p . Given a degree distribution d and an integer n, a degree-based random graph (DBRG) is a graph with n vertices where vertices u, v are connected with probability proportional to the product of their degree26 d(u)d(v). In the following, G(n,p) random graphs were generated based on the same number of vertices and edges as in the real networks. DBRG random graphs were produced based on the same degree distribution of the real networks along with the same number of vertices and edges as in real networks.


4.1. Connectivity

Average Degree. Table 1 summarizes the average degree of the four networks. Yeast and Fly PPI networks have an average degree around 6. AS990220 has approximately the same number of vertices as Yeast PPI network, but its average degree is much lower ( « 4). The other AS-level Internet S k i t t e r has an average degree of about 6, much closer to the two PPI networks.

Skewed Degree Distribution. Figure 1 shows the complementary cumulative density function (CCDF) of degree distribution of the four networks. The two Internet networks show a "perfect" power-law degree distribution with 7 » 1.1. Observe that although the degree distribution of the two PPI networks is highly skewed, they do not follow closely a power-law distribution (7 « 1.7).

The degree distribution of PPI networks has been characterized as truncated scale-free, which has a power-law regime followed by a sharp cutoff, like an exponential or Gaussian decay of the tail 27 . In Ref 27, the authors showed that they could generate networks with such degree distribution by imposing constraints on the process that adds new links to vertices. We speculate that such constraints could potentially exist in the evolutionary process that shaped the topology of PPI networks. For example, one constraint is the physical and chemical limitations on the number of interacting partners that a protein could possibly have. Moreover, compartmen-talization and inherent functional modular organization of various components inside the cell would restrict spatially and functionally the number of links added between two different compartments or two different functional modules. Although it is not clear what is the evolutionary advantage for PPI networks to have scale-free topology, we argue that the physical, chemical and thermodynamic constraints in the cell could account for the lack of a perfect scale-free topology in PPI networks.

bAS990220 is an AS-level topology collected by the Oregon routeviews project 2 3 , which extracts the information from BGP routing updates 2 4 . S k i t t e r was collected by CAIDA (Cooperative Association for Internet Data Analysis) using traceroute and then carefully processed 10 ' 2 5 .

303

0.1

0.01

0.001

+

\ -

:

'

V

X

AS99O220 + 10"(-0.137)*x"(-1.2071), R*"2 = 0.9907

X

Fig. 1. Complementary cumulative density function (CCDF) of the degree distribution.

Rich club connectivity. Figure 4 shows that about 10% of the vertices with the highest degree in AS-level Internet are more densely connected with each other than those in PPI networks. In another words, links between high degree vertices in PPI networks are suppressed, which is consistent with previous observations 28. A comparison with G(n,p) and DBRG random graphs further illustrates that the number of links between high degree vertices in PPI networks is significantly lower than expected. In contrast, the number of links connecting high degree vertices in AS-level Internet matches the expected number observed in their corresponding random networks (data not shown).

The core connectivity analysis shows that high degree vertices in PPI networks do not connect with each other as much as in the Internet or when compared to random networks. This feature is consistent with the theory of functional modular organization of the cell. Functional modules can be insulated from or connected to each other. Insulation allows the cell to

carry out many diverse reactions without the crosstalks that would harm the cell, whereas connectivity allows one function to influence another one. The most notable effect of suppressing the connections between high degree vertices is to prevent the deleterious perturbations from propagating rapidly over the network through densely connected high degree vertices. In contrast, such concern does not exist in Internet, in which high degree vertices (i.e., large Internet Service Providers) are expected to connect to each other to promote communication between different cities, countries and continents.

4.2. Small-world metrics

Characteristic Path Length. Table 1 summarizes the chasacteristic path length for the four networks. The two AS-level Internet have an average shortest path length almost one hop shorter than that of two PPI networks. This indicates that on average it takes fewer edges to reach one another using shortest path in Internet than in PPI networks.

304

yeast 10"*(0.3036) * X"{-1.1058), R"2 = 0.5235 -

+ > * * + A- ,

10"(-1.5748) * x"(-0.2608), R"2 = 0.2375

" " " ~ ~ : t ' l - ±+ + 1 * 4. * 4. -H- * ' * £ +4. +

AS990220 10"(-0.247)"x"(-0.7171).R"2 = 0.5197 -

V v *

10"(0.431) * «~(-

' \

17154)

++

R"2

+

skitter + = 0.8692 - -

•

*

Fig . 2 . Clustering coefficient Cfc as a function of the degree k.

Clustering Coefficient. The clustering coefficient C is also shown in Table 1. The two AS-level Internet have a much higher clustering coefficient than that of two PPI networks, indicating that the neighboring vertices in Internet are well connected when compared to PPI networks. This may also imply that prominent cluster structures exist in the Inter

net 6, 9

The fact that Internet networks have a shorter characteristic path length and a higher clustering coefficient than PPI networks indicates that the former graphs are more small-world than PPI networks. This results imply that the overall design of the Internet promotes well-connected neighborhood and can propagate messages through a short chain between long distance vertices, which in turn implies more efficient communication. This conclusion is rather expected and not surprising. The primary goal of the Internet is to deliver messages in a fast and reliable manner, thus the small-world properties are more desirable in such type of networks. In contrast,

fast communication is perhaps not one of the primary concerns in PPI networks.

4.3. Modular and Hierarchical Organization

Figure 2 shows the clustering coefficient Ck versus degree k. The two AS-level Internet show a power-law-like distribution of Ck ~ k~a, which is an indication of hierarchical organization. In PPI networks, however, there is almost no indication of hierarchical structures, except perhaps among high degree vertices, where we detected weak signs of hierarchical structure.

In addition to the analysis of the scaling relation between Ck and k, we designed another experiment to explicitly demonstrate that the combination of high clustering coefficient and the presence of the relation Ck ~ k~a results in modular and hierarchical structures in AS-level Internet. The experiment was designed as follows. (1) Iteratively remove the most critical edge (for example, we removed the edge with

305

« • • » » « . < * £ - - * • -

- \

\

V

" -"

' • • - • % ' , .

""*•**-«,

1 1 1

kS99022o' — i — DBRQ —x—

-...„ Q(n, p) - * - .

k

K-\

W

H|y«-«« M W W f - K

•

1

\ ^ ,

\ DBRQ —x— 1 Q(n, p) - « - .

\ ^ " V «

<\ . i i 0.4 0.45

Fig . 3 . The size L c of the largest normalized component (represented on the y-axis) in real networks and the corresponding G{n,p) and DBRG random networks after successive removal of critical edges. The x-axis represents the fraction of edges removed.

the highest betweenness 29) in the network, until the network breaks into two components, (2) Measure the size Lc of the largest normalized connected component, (3) Repeat step (1) and (2) on the largest component until its size reaches one node. The rationale behind this decomposition process is that if the network is modular and hierarchically organized, then one expects the decomposition to separate large components from the network.

Figure 3 shows the comparison between the real networks and their corresponding G(n,p) and DBRG random networks. The graph illustrates that Lc decays much faster in four real networks than that in their two random counterparts, which in turn indicates that the four real networks are more modular and hierarchically organized than random networks. Figure 5 shows a comparison of Lc between the four real networks. Observe that the size of the largest component in the AS-level Internet decays faster than that in the PPI networks. Although the

decay rate of Lc in S k i t t e r is comparable to that of two PPI networks, in S k i t t e r , larger size components are separated from the network, indicating a much stronger modularity.

The measurements on the scaling relation and the decomposition experiment suggest that modular and hierarchical structures exist in all networks examined. Moreover, the topology of the Internet is significantly more modular and hierarchical than that of PPI networks. In fact, hierarchical organization is inherent in the topology of the Internet, which mirrors the hierarchical structure in business relationships. It is a well known fact that on the Internet there are a few tens of vertices that provide international world-wide connectivity and they practically form a clique. Then, within each country we have national, regional and local Internet Service Providers (ISPs). Typically, the smaller ISPs are the customers of the larger ISPs. This hierarchical structure which emerges as a reflection of business policies, is not a strict hierarchy but it definitely provides

306

r

X V ^ "

Yeast

AS99O220 Skitter

m

Fig . 4. Rich club connectivity <t>(p) as a function of p. F ig . 5 . L c as a function of the fraction of edges removed.

0.25 :

Fig . 6. Graph entropy E(G) of the networks under study,

a topological structure.

4.4. Entropy

Graph Entropy. Figure 6 illustrates the graph entropy for real networks and their corresponding G(n,p) and DBRG random networks. The figure also shows the graph entropy for rewired networks, in which 10, 20, 30, 40, 50% of all edges are rewired0

(both in real and random networks). The data points for random and rewired networks were averaged over 10 corresponding networks. Observe that the entropy of all real networks approaches the corresponding DBRG random networks with more and more rewiring. In contrast, the entropy of random networks remains almost the same regardless of how much rewiring was performed (data not shown). Since the vertices in G(n,p) random networks are

Yeast + Fly x

AS99O220 x Skitter 0

Fig. 10 100 1000 10000

7. Target entropy Tfc as a function of the degree k.

connected uniformly at random, their graph entropy is expected to be zero, as shown in Figure 6. These observations show that graph entropy reflects the randomness of a network in a quantitative way, since the value of graph entropy varies accordingly with the randomness of a network.

The figure also shows that the two AS-level Internet have a much higher graph entropy than that of the two PPI networks. This result implies that on the Internet we can observe a regular connection pattern between different classes of vertices, e.g., low or high degree vertices. The analysis of the assorta-tivity coefficient 30 confirms this result by showing preferential connections between low and high degree vertices in Internet networks (data not shown).

The analysis of graph entropy shows that the connectivity between vertices with different degrees

cRewiring is a process that randomly switches the edges in the network in such a way that the degree distribution of the nodes in the networks remains constant 4 .

307

8,5

8

7.5

7

6.5

6

5.5

5

4.5

•

•

i

-

*

P

X X ± .

X + ++ X „ X-t*V* j +*.

w ^

* * *

on jarffl-iann]

DO

Yeast Fly

AS990220 Skitter

i m XK+X X *

Q X K

X

•

•

-

• •

-

1 10 100 1000 10000

Fig. 8. Eccentricity e ^ a s a function of the degree k. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fig. 9. Distribution of edge congestion.

0.9

08

0.7

0.6

O.S

0.4

0.3

0.2

0.1

i V

• %\

. \

N

N:

\

\ \

i_

ic

k \ - ^ f j = - - # T " - * ~ ««_ -#-

Yeast -

AS990220 -Skitter

-+—• *-

-Q-~ •

" •

•

.

• -

[ 1000

100

I — - — ...•»••

..-"-

X V * 1 ^

Yeast — i — Fly —x—

AS990220 —*— Skitter —43—

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 '0.001 0.01 0.1 1

Fig. 10. Lc as a function of the fraction of vertices removed. Fig- 11- Nc as a function of the fraction of vertices removed.

in PPI networks is close to uniformly random, which is analogous to the notion of diversification in evolution. Diversification is a process in which multiple phenotypes and genotypes are simultaneously present in a population, which increases the probability that some individuals will survive and reproduce in a heterogeneous and changing environment. The same mechanism could possibly be adopted in the building process of PPI networks too, since the heterogeneous connectivity pattern could potentially increase the robustness of the network by redundancy or/and degeneracy mechanism 31.

Target Entropy. Recall that the target entropy of a vertex u is the entropy of the distribution of the number of times a vertex in the neighborhood of u is traversed to route messages. The closer is the distribution to a uniform distribution, the higher is its entropy. Figure 7 shows Tk versus degree k, where Tk = (Ed(u)=fc T(u))/N(k) and N{k) is the number

of vertices having degree fc. The figure shows that the vertices in the Internet are less uniform at choosing neighbors to route messages when compared to PPI networks. This is due to the fact that all the routing domains in the Internet have to visit some large ISPs in order to exchange information with other administrative domains. Therefore, the choice of which vertex to use for routing is highly selective in Internet. Since such constraints do not exist in PPI networks, the vertices have more freedom to choose which vertex to pass the message to.

The results presented here on the target entropy appear to somewhat contradict the ones obtained by Sneppen et al. in Ref 16. In their work, the authors show that the Internet has higher target entropy than the PPI network of Yeast and Fly, and that the PPI of Fly has higher target entropy than the one of Yeast. Our results show that the Internet has lower target entropy than the two PPI networks and that the PPI of Fly has lower target entropy

308

than that of Yeast. This discrepancy is likely to be explained by the fact that they used much smaller networks than the ones we used in our study. In addition, the authors of Ref 16 do not report the average degree of the networks used in their dataset. Comparing the target entropy of networks with very different average degrees might not be very meaningful.

4.5. Performance Measure

Our main measure of performance is the communication efficiency. Many factors influence the performance of the Internet, such as routing policy, traffic flow, etc. Since for PPI networks the notions of routing and traffic might not be very meaningful, we used the simplest model for both types of networks, namely we modeled packets using one unit of flow between every pair of vertices where packets are routed using shortest path policy. Under these assumption, our goal was to determine whether PPI networks have any advantages over the Internet as communication networks. We measured eccentricity, which estimates how quickly one vertex can reach any vertex in the network, and edge congestion, which is related to the traffic flow in the network.

Eccentricity. Figure 8 shows the quantity ek versus degree k, where £& = (52d(u)=k£(u))/N(k) i s t n e av~ erage eccentricity for the vertices having the same degree k. The figure shows that on average the vertices in Internet reach the rest of the vertices in the network with fewer hops than those in PPI networks.

Edge Congestion. The edge congestion for all edges in the network was sorted into non-decreasing order and the distribution of the congestion of each edge was plotted in Figure 9. The figure shows that most of the edges in S k i t t e r have less flow traffic than the other networks. Although the edge congestion in AS990220 is comparable to that of the two PPI networks, we need to recall that AS990220 has an average degree of 3.74 in contrast to an average degree of about 6 in PPI networks. In other words, AS990220 achieves the same level of edge congestion as the PPI networks with a significantly smaller number of edges. This indicates that the topology of the Internet has inherent structural properties that tend

to reduce edge congestion. The performance analysis suggests that the In

ternet is highly optimized for communication efficiency (under the assumptions we made on the routing and the traffic). In contrast, it appears that PPI networks are not optimized to route messages and minimize traffic.

4.6. Robustness

When vertices were randomly removed from the network, along with all incident edges, all four networks behave similarly in terms of the size of the largest normalized component Lc and the number of components Nc (data not shown). However, when we targeted first vertices with the highest degrees, AS-level Internet collapses much faster than PPI networks as shown by a smaller Lc and a larger Nc (see Figure 10 and 11). The results indicate that high degree vertices play a critical role in Internet. In contrast, the fact that high degree vertices in PPI networks tend not to be connected with each other (as shown by the rich club connectivity analysis) prevents the deleterious effects, such as gene knockout or protein malfunction, from spreading throughout the network too fast. Thus, suppressed cross-talking between high degree vertices in PPI networks clearly contributes to the robustness of the network by localizing the effects of deleterious perturbations. The distinction between the robustness of these two types of networks supports the idea that the underlying driving force that shape the topology of these two types of networks are distinct, namely, for PPI networks, the survivability of the cell favored by evolution, and for Internet, the optimized communication requirements.

5. CONCLUSION

In this paper we showed that by comparing PPI networks to AS-level Internet, one can possibly gain some insights on the topological properties and the design principles underlying the two types of networks. Such cross-disciplinary comparison brings together tools, expertise, and ideas from different communities, and benefits both research areas.

Our results suggest that although both types of networks have been characterized as scale-free

309

topologies, they also exhibit non-trivial topological differences.

• Connectivity. The Internet has a highly-connected "core", which does not appear to exist in PPI networks.

• Small-world. The Internet topology exhibits stronger small-world properties than PPI networks.

• Modular/Hierarchical organization. The Internet topology shows a more prominent modular and hierarchical organization than PPI networks.

• Entropy. Vertices with different degrees are more uniformly connected in PPI networks than those in the Internet.

• Communication efficiency. The Internet topology is more efficient in routing messages and minimizing traffic than PPI networks with respect to the metrics that capture the communication efficiency.

• Robustness. The Internet and PPI networks seem equally robust against random failures. However, PPI networks are more robust under "targeted" (e.g., toward high degree nodes) attacks compared to the Internet.

We speculate that the structural and functional differences between PPI networks and AS-level Internet are originated from different constraints and objectives that govern and shape the building process of these complex networks. Specifically, the building process of PPI networks is driven by the evolutionary constraints that favor the survivability and diversification. In contrast, the architecture of Internet is built under the needs of fast and reliable communication.

ACKNOWLEDGMENTS

This project was supported in part by NSF CAREER IIS-0447773 and NSF DBI-0321756. We thank Dr. M. Newman for providing us the implementation of the edge betweenness clustering algorithm and for helpful discussions. We also want to thank the anonymous reviewers that helped improve the quality of this manuscript.

References 1. Uetz P, Giot L, et al. A comprehensive analysis of

protein-protein interactions in Saccharomyces cere-

visiae. Nature 2000; 403: 623-627. 2. Giot L, Bader J, et al. A protein interaction map of

Drosophila melanogaster. Science 2003; 302: 1727-1736.

3. Erdos P, Renyi A. On random graphs. Publicationes Mathematicae 1959; 6: 290-297.

4. Watts D, Strogatz S. Collective dynamics of 'small-world' networks. Nature 1998; 393: 440-442.

5. Faloutsos M, Faloutsos P, Faloutsos C. On power-law relationships of the Internet topology. Proc. of ACM SIGCOMM 1999: 251-263.

6. Ravasz E, Barabasi A-L. Hierarchical organization in complex networks. Phys. Rev. E 2003; 67: 026112.

7. Jeong H, Mason S, et al. Lethality and centrality in protein networks. Nature 2001; 411: 41-42.

8. Jeong H, Tombor B, et al. The large-scale organization of metabolic networks. Nature 2000; 407: 651-654.

9. Barabasi A-L, Deszo Z, et al. Scale-free and hierarchical structures in complex networks. Seventh Granada Lectures on Modeling of Complex Systems 2002.

10. Mahadevan P, Krioukov D, et al. The Internet AS-level topology: three data sources and one definitive metric. ACM SIGCOMM Computer Communication Review 2006; 36: 17-26.

11. Gkantsidis C, Mihail M, et al. Spectral analysis of Internet topologies. Proc. of the 22nd Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM'03) 2003: 364-374.

12. Jaiswal S, Rosenberg A, et al. Comparing the structure of power-law graphs and the Internet AS graph. Proc. of the 12th IEEE International Conference on Network Protocols (ICNP'04) 2004: 294-303.

13. Zhou S, Mondragon R. Accurately modeling the internet topology. Phys. Rev. E 2004; 70: 066108.

14. Dorogovtsev S, Goltsev A, et al. Pseudofractal scale-free web. Phys. Rev. E 2002; 65: 066122.

15. Mahadevan P, Krioukov D, et al. Comparative analysis of the Internet AS-level topologies extracted from different data sources, http://www.krioukov.net/ "dima/pub/.

16. Sneppen K, Trusina A, et al. Hide-and-seek on complex networks. Europhys. Lett. 2005; 69: 853-859.

17. Cover T, Thomas J. Elements of information theory. John Wiley & Sons, Inc. New York, NY. 1991.

18. Harary F. Graph theory. Addison-Wesley publishing company, Reading, MA. 1994.

19. Akella A, Chawla S, et al. Scaling properties of the Internet graph. Proc. of the Twenty-second Annual Symposium on Principles of Distributed Computing 2003: 337-346.

20. Gkantsidis C, Mihail M, et al. Conductance and congestion in power law graphs. Proc. of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems 2003: 148-159.

21. Albert R, Jeong H, et al. Error and attack tolerance

http://www.krioukov.net/

310

of complex networks. Nature 2000; 406: 378-382. 22. Xenarios L, Salwinski L, et al. DIP, the Database of

Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Research 2002; 30: 303-305.

23. University of Oregon Route Views Project. Online data and reports, ht tp: / /www.routeviews.org/ .

24. Rekhter Y, Li T(Eds). A Border Gateway Protocol 4 (BGP-4). 1995.

25. Mao Z, Rexford J, et al. Towards an accurate AS-level traceroute tool. Proc. of ACM SIGCOMM 2003: 365-378.

26. Aiello W, Chung F, et al. A random graph model for massive graphs. Proc. of the Thirty-second Annual ACM Symposium on Theory of Computing 2000:

171-180. 27. Amaral L, Scala A, et al. Classes of small-world net

works. Proc. Natl. Acad. Sci. 2000; 97: 11149-11152. 28. Maslov S, Sneppen K. Specificity and stability in

topology of protein networks. Science 2002; 296: 910-913.

29. Girvan M, Newman M. Community structure in social and biological networks. Proc. Natl. Acad. Sci. 2002; 99: 7821-7826.

30. Newman M. Assortative mixing in networks. Phys. Rev. Lett. 2002; 89: 208701.

31. Tononidagger G, Sporns O, et al. Measures of degeneracy and redundancy in biological networks. Proc. Natl. Acad. Sci. 1999; 96: 3257-3262.

http://www.routeviews.org/

311

CAVITY-AWARE MOTIFS REDUCE FALSE POSITIVES IN PROTEIN FUNCTION PREDICTION

Brian Y. Chen"* , Drew H. Bryan t 6 * , Viacheslav Y. Fofanovc , David M. Kr i s t ensen d

A m a n d a E. Cruess" , Marek Kimmel c , Olivier L ich ta rge d , e , Lydia E . Kavrak i 0 ' 6 ' 6 ' *

° Department of Computer Science,

Department of Bioengineering, cDepartment of Statistics,

Rice University Houston, TX 77005, USA

Program in Structural Computational Biology and Molecular- Biophysics,

e Department of Molecular and Human Genetics, Baylor College of Medicine

Houston, TX 77030, USA

Determining the function of proteins is a problem with immense practical impact on the identification of inhibition targets and the causes of side effects. Unfortunately, experimental determination of protein function is expensive and time consuming. For this reason, algorithms for computational function prediction have been developed to focus and accelerate this effort. These algorithms are comparison techniques which identify matches of geometric and chemical similarity between motifs, representing known functional sites, and substructures of functionally uncharacterized proteins (targets). Matches of statistically significant geometric and chemical similarity can identify targets with active sites cognate to the matching motif. Unfortunately, statistically significant matches can include false positive matches to functionally unrelated proteins. We target this problem by presenting Cavity Aware Match Augmentation (CAMA), a technique which uses C-spheres to represent active clefts which must remain vacant for ligand binding. CAMA rejects matches to targets without similar binding volumes. On 18 sample motifs, we observed that introducing C-spheres eliminated 80% of false positive matches and maintained 87% of true positive matches found with identical motifs lacking C-spheres. Analyzing a range of C-sphere positions and sizes, we observed that some high-impact C— spheres eliminate more false positive matches than others. High-impact C-spheres can be detected with a geometric analysis we call Cavity Scaling, permitting us to refine our initial cavity-aware motifs to contain only high-impact C-spheres. In the absence of expert knowledge, Cavity Scaling can guide the design of cavity-aware motifs to eliminate many false positive matches.

1. INTRODUCTION

Exhaustive knowledge of the biological function of a large number of proteins would have a broad impact on the identification of drug targets and the reduction of potential side effects. Unfortunately, the experimental determination of protein function is an expensive and time consuming process. In an effort to guide and accelerate the experimental process, computational techniques have been developed to predict protein function by identifying distinct similarities to known proteins. Algorithms like Geometric Hashing34, JESS14, pvSOAR33 and Match Augmentation (MA)5, search functionally uncharacterized protein structures (targets), for substructures with geometric and chemical similarity (matches), to known active sites (motifs). Finding a match

with statistically significant geometric and chemical similarity can imply that the target has an active site similar to the motif, suggesting functional homology1, 14' 33' 5.

One fundamental subproblem of protein function prediction is the design of effective motifs. Ideally, effective motifs have geometric and chemical characteristics which have matches to functionally homologous targets (sensitive motifs), and do not have matches to functionally unrelated targets (specific motifs). In practice, however, many matches are identified within functionally unrelated targets. For this reason, statistical models1' 14, 33 ' 5 can establish a threshold of similarity necessary to imply functional homology. Predictions from any non-trivial statistical model will inevitably contain some

*= Equal Contribution t Corresponding Author: [email protected]


312

false positive matches which identify statistically significant geometric similarity to functionally unrelated proteins. In the context of actual function predictions, where expensive resources could be deployed to verify computational predictions, false positive matches must be avoided to minimize wasted resources, while preserving as many true positive matches to functional homologs as possible. This paper proposes a method that reduces false positive matches while preserving most true positive matches, by adding biological information that rejects matches to functionally unrelated targets.

It is hypothesized that ligand binding proteins often contain active clefts or cavities which create chemical microenvironments essential for biological function. In several instances, large surface concavities have been associated with protein function30' 13. Inspired by seminal work in the modeling and search for protein cavities30, 8' 33, we seek to use cavities to eliminate false positive matches. If the matching atoms of the target truly form a cognate active site with similar function, the matching atoms of the target should surround an empty cavity with similar shape.

This paper presents Cavity-Aware Match Augmentation (CAMA), an adaptation of Match Augmentation5, which searches for motifs built from motif points, while requiring specific geometric volumes, represented with sets of C-spheres, to remain empty. On 18 cavity-aware motifs derived from ligand binding proteins, we compared the number of false positive matches found relative to identical motifs without C-spheres. Cavity-aware motifs eliminated a large proportion of false positive matches that were identified with point-based motifs, while preserving most true positive matches. We also compared the relative effect of many C-sphere positions and sizes to the number of false positive matches eliminated. This led us to observe trends indicating that certain high-impact C-spheres contribute more to the elimination of false positive matches than others. We exploited these trends to produce Cavity Scaling, a technique for identifying high-impact C-spheres a priori. Cavity Scaling allowed us to refine our existing motifs to contain only high-impact C-spheres, guiding the design of cavity-aware motifs that eliminate many false positive matches.

2. RELATED WORK

Motif Types The search for effective motifs has led to many different geometric representations of protein active sites, including point-based motifs and cavity-based motifs. Point-based motifs represent active sites as sets of motif points in three dimensions, labeled with varying chemical and biological definitions. Depending on how motif points are defined, they have different labels associated with them and these labels need to be taken into account with varying comparison algorithms. Motif points have been used to represent evolutionarily significant amino acids5, "pseudo-centers" representing protein-ligand interactions17, •, atoms in catalytic sites2, 14, points on the Connolly surface21 with labels representing electrostatic potentials15, and even atoms in flexible motifs18.

Clefts and cavities, on the surface or within protein structures, have many different volumetric representations. These cavity-based representations include spheres12' 6' 26, 20, alpha-shapes9' 8 ' 33 , 32, and grid-based techniques28.

Geometric Comparison Algorithms Many algorithms exist for identifying matches between motifs and targets. These methods differ fundamentally in that they are optimized for comparing different types of motifs. There are algorithms for comparing graph-based motifs27, algorithms for finding catalytic sites14, and the seminal Geometric Hashing framework10 which can search for many types of motifs, including motifs based on atom position22, points on Connolly face centers16, catalytic triads2, and flexible protein models18. The comparison algorithm we use in this work is based on Match Augmentation5, because of its availability and compatibility with our selected motif type.

Statistical Models of Geometric Similarity Finding a match with MA indicates only that substruc-tural geometric and chemical similarity exists between the motif and a substructure of the target, not that the motif and the target have functionally similar active sites. We measure geometric similarity with LRMSD, root mean square distance (RMSD) between matching points in 3D, aligned with least RMSD. In order to use matches to imply functional similarity, it is essential to understand the degree of similarity, in LRMSD, sufficient to imply functional similarity. However, a simple

313

Full PDB Frequency Distribution Computing p-values with Motif Profiles

1 _ >

B

Full PDB

A P " A + B

LRMSD

(b)

Pig. 1. A frequency distribution of matches between a motif and all functionally unrelated proteins in the PDB (a). Comparing the area under the curve to the left of some LRMSD r, relative to the entire area under the curve (b).

LRMSD threshold is insufficient to indicate functional similarity between any motif and a matching target. Some motifs match functional homologs at lower values of LRMSD than other motif-target pairs, and LRMSD itself is affected by the number of matching points5. Fortunately, earlier work has demonstrated that motif-specific LRMSD thresholds can be produced with statistical models of functional similarity5. Many important statistical models have been designed, including parametric1, 14, empirical33, and nonparametric5 statistical models.

Geometric comparison algorithms operate on the assumption that substructural and chemical similarity implies functional similarity. Our statistical model can be used to identify the degree of similarity sufficient to follow this implication. Given a match m with LRMSD r between motif S and target T, exactly one of two hypotheses must hold:

HQ\ S and T are structurally dissimilar HA'- S and T are structurally similar

Our statistical model tests these hypotheses by computing a motif profile. Motif profiles are frequency distributions (see Figure la) of match LRMSDs between S and the entire Protein Data Bank (PDB)11, which is essentially a large set of functionally unrelated proteins. A motif profile is basically a histogram (see example plotted in Figure la), where the vertical axis indicates the number of matches at each specific LRMSD, indicated by the horizontal axis. Motif profiles provide very complete information about matches typical of HQ. If we suspect that a match m has LRMSD r indicative of functional

similarity, we can use the motif profile to determine the probability p of observing another match ml with smaller LRMSD by computing the volume under the curve to the left of r, relative to the entire volume (see Figure lb). The probability p, referred to as the p-value, is the measure of statistical significance. With a standard of statistical significance a, if p < a, then we say that the probability of observing a match m' with LRMSD r' < r is so low that we reject the null hypothesis (Ho) in favor of the alternative hypothesis [HA)- We call m statistically significant.

In the context of controlled experiments, where we know when matches identify functional homologs and when they do not, there are four possibilities: True positives (TP), False positives (FP), True negatives (TN), and False negatives (FN). A match is a TP, if it identifies a functional homolog, and if the match is statistically significant. A match is a FP, if the match identifies a functionally unrelated protein, and is statistically significant. A match is a TN if it is not statistically significant and matches a functionally unrelated protein. A match is a FN if it identifies a functional homolog, but is not statistically significant.

In practice, our statistical model occasionally identifies false positive matches. Designing motifs which generate fewer FP matches is an essential aspect of motif design, especially when we consider the possibility that expensive experimental resources could be wasted in an attempt to verify predicted functions. In the next section, we discuss a method for designing motifs which strongly reduces false positives.

314

\

. / / \ .

>376

i ..1-371

hr-367

'rp-366

[Asp-376

Glu-371

(a)

F ig . 2. A diagram of a cavity-aware motif. Beginning with functionally relevant amino acids and bound ligand coordinates (a), cavity-aware motif points are positioned at alpha carbon coordinates (black dots, (b)), and C-spheres are positioned at ligand atom coordinates (transparent spheres, (b)).

3. METHODS

Cavity-Aware Motifs The cavity-aware motifs used in this work are an integration of a point-based motif and a cavity-based motif. Cavity-aware motifs contain motif points taken from atom coordinates labeled with evolutionary data23- 24, 5- 7. A motif S contains a set of \S\ motif points { s i , . . . , S|s|} in three dimensions, whose coordinates are taken from backbone and side-chain atoms. Each motif point Si in the motif has an associated rank, a measure of the functional significance of the motif point. Each Si also has a set of alternate amino acid labels l(si) C {GLY, ALA,...}, which represent residues to which this amino acid has mutated during evolution. Labels permit our motifs to simultaneously represent many homologous active sites with slight mutations, not just a single active site. In this paper, we obtain labels and ranks using the Evolutionary Trace23, 24.

Cavity-aware motifs also contain a set of C-spheres C = {c\,C2,...Ck} with radii r(ci),r(c2), . . .r(cfc), which are rigidly associated with the motif points. Wei, 1 < i < k, a maximum radius, rmax{ci)i is defined to be the largest radius (rounded to the nearest integer) such that Cj contains no atoms from the protein which gave rise to the motif. C-spheres are a loose approximation of solvent exposed volumes essential for ligand binding. C-spheres can have arbitrary radii, and can be centered at arbitrary positions. While this work targets the functional prediction of active sites that bind

small ligands, this representation could be used to represent protein-protein interfaces and other generalized interaction zones.

C-sphere positions in this work were selected based on the coordinates of atoms in bound ligands. For example, in Figure 2, we modeled the heme-dependent enzyme nitric oxide synthase, which catalyzes the synthesis of nitric oxide (NO) from an L-arginine substrate. This multi-step reaction takes place in a deep cleft and involves zinc, tetrahydro-biopterin, and hydride-donating (NADPH or H2O2) cofactors4' 31. Using PDB structure ldww, we centered C-spheres at several atom coordinates on the heme, in order to fill the heme-binding cavity, and placed one C-sphere to represent tetrahydro-biopterin, which was further outside from the main cavity, as shown in Figure 2.

In our experimentation, a small number (usually 10) of C-spheres were manually placed for each motif. In some cases, not all atoms of the ligand were used, such as in heme in Figure 2, but selections were made to approximate the shape of the ligand binding cavity based on the atom coordinates available. C-spheres could have been made to fit better by moving the C-sphere centers, but we used atom coordinates to standardize our experimentation. Future work will explore the generalized positioning of C-spheres.

Matching Criteria Cavity Aware Match Augmentation compares a cavity-aware motif 5 to a target

315

Successful Match

Motif Match

Unsuccessful Match

Motif Mismatch

F i g . 3 . Two cases of cavity-aware matching. Every time a match is generated by CAMA, an alignment of the motif points is generated to the matching points of the target. This specifies the precise positions of the C-spheres in the motif relative to the target. CAMA accepts matches to targets where no C-spheres contain any target atoms (a), and rejects matches where any target atom is within one or more C-spheres (b).

T, a protein structure encoded as \T\ target points referred to as T = {ti,.. -t\x\}, where each ti is taken from atom coordinates, and labeled l(ti) for the amino acid U belongs to. A match m is a bijec-tion correlating all motif points in S to a subset of T of the form m = {{sai,tbl), {sa2,tb2)... (s a | s | , t b |S 1)}-Referring to Euclidean distance between points a and b as \\a — b\\, an acceptable match requires:

Criterion 1 \fi, sai and tbi are label compatible: l(tbi) G l(sai).

Criterion 2 Vi, ||A(s0i) - tbi\\ < e, our threshold for geometric similarity.

Criterion 3 VtjVcj \\U - A(CJ)\\ > r(cj)

where motif S is in LRMSD alignment with a subset of target T, via rigid transformation A. Criterion 1 assures that we have motif and target amino acids that are identical or vary with respect to important evolutionary divergences. Criterion 2 assures that when in LRMSD alignment, all motif points are within e of correlated target points. Finally Criterion 3 assures that no target point falls within a C-sphere, when the motif is in LRMSD alignment with the matching target points. CAMA outputs the match with smallest LRMSD among all matches that fulfill these criteria. Partial matches correlating subsets of S to T are rejected.

Matching Algorithm CAMA is a three staged hierarchical matching algorithm which identifies correlations for motif points in order of rank. The first stage, Seed Matching is a hashing technique which exploits pairwise distances between motif points to rapidly identify correlations between the three highest ranking motif points, and triplets of target points. These triplets are passed to the second stage, Augmentation, which expands seed matches to full correlations of all motif points. The final stage, Cavity

Filtering, identifies the aligned position of the C-spheres in each full correlation, and checks to see if any target points fall within a C-sphere. The correlation with the smallest LRMSD that has no target points within any C-sphere is returned as the resulting match. Seed Matching and Augmentation are documented in earlier work5, but we summarize them below for completeness.

Seed Matching Seed Matching identifies all sets of 3 target points T" = {tA,tB,tc} which fulfill our matching criteria with the highest ranked 3 motif points, S' = {si,s<2,ss}. In this stage, we represent the target as a geometric graph with colored edges. There are exactly three unordered pairs of points in 5 ' , and we name them red, blue and green. In the target, if any pair of target points ti,tj fulfills our first two criteria with either red, blue or green, we draw a corresponding red blue or green edge between ti,tj in the target. Once we have processed all pairs of target points, we find all three-colored triangles in T. These are the Seed Matches, a set of three-point correlations to 5" that we sort by LRMSD and pass to Augmentation.

Augmentation Augmentation is an application of depth first search that begins with the list of seed matches. Assuming that there are more than four motif points, we must find correspondences for the unmatched motif points within the target. Interpret the list of seed matches as a stack of partially complete matches. Pop off the first match, and considering the LRMSD alignment of this match, plot the position P of the next unmatched motif point Sj relative to the aligned orientation of the motif. In the spherical region V around P, identify all target points ti, compatible with s,, inside V. Now compute the LRMSD alignment of all correlated points, include the new correlation (sj,i,). If the new alignment

316

satisfies our first two criteria and there are no more unmatched motif points, put this match into a heap which maintains the match with smallest LRMSD. If there are more unmatched motif points, put this partial match back onto the stack. Continue to test correlations in this manner, until V contains no more target points that satisfy our criteria. Then, return to the stack, and begin again by popping off the first match on the stack, repeating this process until the stack is empty.

Cavity Filtering Augmentation results in a heap of completed matches. Beginning from the match with lowest LRMSD, for each match, retrieve the alignment of the motif onto the target. Using this alignment, we plot the positions of the C-spheres in rigid alignment with the motif. Then, for each C-sphere, we check if a target point exists within the C-sphere. If any target point is found within any C-sphere, the match is discarded, and we continue to the match with next-lowest LRMSD. This is diagrammed in Figure 3b. If we identify a match with no target points in any C-spheres, as in Figure 3a, we return this match as the output.

Discussion Standard MA would accept the match with lowest LRMSD, regardless of the C-spheres. Cavity Filtering rejects matches in order of ascending LRMSD, starting with the match with lowest LRMSD, causing CAM A to potentially increase the LRMSD of matches found, in comparison to MA. When C-sphere radii are all zero, CAM A and MA are therefore identical.

Cavity-Aware Statistical Significance We evaluate p-values for matches to a cavity-aware motif S in the same manner as for point-based motifs. We first generate a point-based version 5 ' of S, and use S' to compute a motif profile. Then, given a match m of S with LRMSD r, we compute the p-value of r relative to this motif profile, p-values for cavity-aware motifs are computed relative to point-based motif profiles because the purpose of a cavity-aware motif is to eliminate matches which would have been statistically significant relative to the point-based motif. Since matches with cavity-aware motifs have equal or greater LRMSDs than matches with identical point-based motifs, matches found with cavity-aware motifs have equal or higher p-values.

Cavity-aware motifs are not perfect; due to variations in active site structure, some functional ho-

mologs have atoms which occupy C-spheres. In our experimentation, we measured both the number of FP matches eliminated, as well as the number of TP matches lost by adding C-spheres, and demonstrate that the number of TP matches lost is small in comparison to the number of FP matches eliminated.

High-Impact C—spheres In our experimentation, we observed that some high-impact C-spheres eliminated more FP matches than other C-spheres. Identifying high-impact C-spheres is essential, because a cavity-aware motif without high-impact C-spheres would not eliminate many more F P matches than an identical point-based motif. More importantly, a computational technique for identifying high-impact C-spheres could simplify the design of cavity-aware motifs by ensuring that only high-impact C-spheres are used.

We have observed that motif profiles derived from cavity-aware motifs that include high-impact C-spheres have a tendency of shifting towards higher LRMSDs as C-sphere radius increases. In Figure 4a, we demonstrate motif profiles computed with a motif that has exactly one C-sphere. Each motif profile corresponds to identical motif points with a C-sphere at an identical position, where the only difference is that radius changes evenly between zero and the C-sphere's maximum size. As size increases, the motif profile changes very little. This is a low-impact C-sphere. In comparison, in Figure 4b, for the same motif points and a C-sphere in a different position, as radius changes uniformly between zero and the C-sphere's maximum size, many more matches shift towards higher LRMSDs, as mentioned in Section 3. This is a high-impact C-sphere.

We have designed a technique which uses this effect to identify high-impact C-spheres, called Cavity Scaling. Cavity Scaling takes as input a single C-sphere, and a set of motif points. Using this cavity-aware motif, we generate a spectrum of cavity-aware test motifs which differ only in the radius of the single C-sphere. The C-sphere radius in each test motif ranges from zero to the maximum size of the input C-sphere. We then compute a motif profile for each test motif, and compare the motif profile medians. If the motif profile medians change significantly as C-sphere radius increases, then we consider the input C-sphere a high-impact C-sphere. The process of Cavity Scaling is then repeated for each C-sphere

0 2 LRMSD 4 6 (a)

F ig . 4 . 20 motif profiles for a low-impact C-sphere (i

that has been defined, individually. Cavity Scaling permitted us to refine C-sphere

selections in cavity-aware motifs. As we will show later, refined cavity-aware motifs eliminate most FP matches and maintain TP matches in comparison to manually defined cavity-aware motifs. More importantly, even though this work tests C-spheres centered on ligand atom coordinates, Cavity Scaling is independent of C-sphere centers, making it a general test for high-impact C-spheres. In the future, this could be applied at a larger scale to explore more general representations of cavity-aware motifs, and provide feedback about C-sphere placements in motif design.


Motifs The motifs used in this work begin as 18 point-based motifs designed to represent a range of unrelated active sites in unmutated protein structures with biologically occurring bound lig-ands. These are documented in Figure 5. Earlier work has produced examples of motifs designed with evolutionarily significant amino acids5, 7 and amino acids with documented function29, so these principles were followed in the design of our point-based motifs. Amino acids for use in 10 of the motifs were selected by evolutionary significance, and are taken directly from earlier work7, and the remaining 8 motifs were identified by functionally active amino acids documented in the literature (marked * in Figure 5).

For example, in the case of nitric oxide synthase, we selected active site residues which bind NHA and heme. Cys-194 is axially coordinated to heme. Glu-371 and Trp-366 form hydrogen bonds

317

0 2 LRMSD4 6 (b)

i), and a high-impact C-sphere (b), as radius increases.

with the guanidinium group of NHA while Tyr-367 and a protonated Asp-376 form hydrogen bonds to the carboxylate group of NHA3. We also selected Val-346 and Phe-363, which create a small hydrophobic cavity within the larger heme-binding cavity, allowing dioxide (O2) to bind end-on to heme without steric interference4. C-sphere positions and sizes were denned in Section 3.

The selection of motif points strongly influences motif sensitivity and specificity. However, in this work, we seek to demonstrate that adding C-spheres can improve point-based motifs. For this reason, we take the selection of motif points and the number of TP and FP matches found, for each point-based motif, as given.

Functional Homologs In order to count TP and FN matches, it is essential to fix a benchmark set of functional homologs. We use the functional classification of the Enzyme Commission25 (EC), which identifies distinct families of functional homologs for each motif used. Proteins with PDB structures in these families form the set of functional homologs we search for. Structure fragments and mutants were removed to ensure accuracy.

Unrelated Proteins In order to measure FP and TN matches, it is essential to fix the set of functionally unrelated protein structures. The set we use is, initially, a snapshot of the PDB from Sept 1, 2005. For each motif, the set of functional homologs is removed, producing a homolog-free variation of the PDB specific for each motif. Furthermore, the PDB was processed to reduce sequential and structure redundancy. In structures with multiple chains describing the same protein, only one

318

Motifs Used in Experimentation P D B id

16pk* lady* lani* lay I

lb7y* Iczf

ldid* Idww* Iggm*

l ja7 I jg l l kp3 Ikpg llbf lucn 2ahj 7mht 8tln*

A m i n o A c i d s U s e d R39,P45,G376,G399,K202

E81,T83,R112,E130,Y264,R311 D51,D101,S102,R166,H331,H412 L249,S250,G251,G253,K254,T255

W149,H178,S180,E206,Q218,F258,F260 D180,D201,D202,A205,G228,S229,R256,K258,Y291

F25,H53,D56,F93,W136,K182, C194, V346, F363, W366, Y367, E371, D376,

E188,R311,E239,E341,E359,S361 S36,C76,W108,Q57,I58,W63,

E97,G99,G101,D160.L179,G183, R106,F139,E202,L286,R288,Y331

D17,G72,G74,W75,G76,F200 E51,S56,P57,F89,G91,F112,E159,N180,S211,G233

K12,P13,G92,R105,N115,H118 P53,L120,Y127,V190,D193,I196 P80,C81,S85,E119,R163,R165 M120,E143,L144,Y157,H231

L i g a n d s U s e d C15H22N5012F4P3

CwH21NsOsP Zn2+, O4P3-ATP, C20\~

C19H2sNe07P, Mg2+ CsH15N06, Zn2+

Mn2+, CsHi3N04, Heme, NHA

C12H17NeO&P C8Ht5N06

CuHwNeOsS ATP

C5H11N02Se C12H18N09P

0 4 P 3 ~ , Ca2+, A D P Fes+, NO, CAHs02, Zn2+

Ci4H20NeO$S C2H6OS, Ca2+, Zn2+

#c 10 10 10 10 10 10 10 10 10 10 10 10 10 10 8 10 10 9

R a n g e 4-6 4-6 2-6 4-8 4-8 2-8 2-6

4-10 4-10 4-8 6-8 6-8 6-6 4-6 4-8 4-10 4-8 2-8

120(>

-Hi 78

F2ft> SIM

,-'R2R8 E202 /

r rX-^-:-

fj.2S6 V33! t lkp3

Fig. 5. Motifs used, with example diagrams below. Starred (*) motifs use functionally documented amino acids. The column marked "#C" denotes the number of C-spheres in each motif. "Range" denotes the range of C-sphere maximum diameters (in A) for the motif. Experimental details can be found at: http://wwrw.cs.rice.edu/~brianyc/papers/CSB2006

copy of each redundant chain was used, and all mutants and protein fragments were removed. This produced 13599 protein structures. The set of structures used was not strictly filtered for sequential nonredun-dancy because eliminating one member of any pair with too much sequence identity involves making arbitrary choices. Eliminating fragments and mutated structures, which seem to be the largest source of sequential redundancy, was the most reproducible and well denned policy.

Implementation Specifics CAMA was implemented in C /C++. Large scale comparison of many potential C-sphere radii was accomplished with a distributed version of CAMA, which used the Message Passing Interface19 (MPI) protocol for interprocess communication. Code was prototyped on a 16-node Athlon 1900MP cluster and the Rice TeraClus-ter, a cluster of 272 800Mhz Intel Itanium2 processors. Final production runs ran on Ada, a 28 chassis Cray XD1 with 672 2.2Ghz AMD Opteron cores.

4.1 . C-Spheres Eliminate False Positives, Preserve True Positives

We first demonstrate that C-spheres affect the elimination of FP matches and the retention of TP matches. We compared the number of TP and FP matches found with 18 point-based motifs to cavity-aware versions of the same motifs. For completeness, we show how 20 increments of varying C-sphere radii affect the number of TP and FP matches found.

Our data begins as 18 motifs {S\,Si,...S\%\. For each motif Si, we generated 20 C-sphere size variations called {S^, Sit,..., Siig } . If Si has C-spheres {ci,C2,...Ck}, with individual maximum sizes rmax(ci),rmax(c2),. ..rmax(ck), then the variation Sij G {5i0 ,5i1 , . . . ,5»1 B} S^ has C-spheres of radii ( ^ r m o x ( c i ) ) , ( ^ r m o x ( c 2 ) ) , . . . (^r-max(c&)). For example, 5»lg has C-spheres of radii ' max (Cl) , ' max (C2), • ••rmax(ci), and 5^ would have only C-spheres of radii 0, making 5j0 equivalent to a point-based motif.

Since matches to S^, Si2,..., 5»18 have p-values

http://wwrw.cs.rice.edu/~brianyc/papers/CSB2006

319

Matches to Homologs by Point-based Motifs Average Percentage of TP and FP Matches Motif 16pk lady lani layl lb7y lczf ldid

ldww lggm lja7 Ijgl lkp3 lkpg llbf lucn 2ahj 7mht 8tln

# Homologs 20 22 75 8 9 14

149 192

7 1008

13 35 13 11

153 23 10 59

# TPs 14 20 75 8 0 14

149 181

5 448 13 35 11 11

133 6 9

56

# F P 216 200 205 170 170 117 80 76 195 57 196 162 151 50 162 186 160 187 % Max C-sphere Size

Fig . 6. Average effect of cavity-aware motifs on TP and FP matches, over all motifs. The horizontal axis charts C-sphere radius, where the radius of all C-spheres scales simultaneously from zero to individual maximum size (see Section 4.1). The vertical axis charts the average percentage, per motif, of T P and FP matches remaining, relative to their respective point-based motifs. The number of T P and F P matches for each point-based motif is shown at left. F P matches are dramatically reduced while most T P matches are preserved. Before T P matches begin to fall off, cavity-aware motifs eliminate 80% of F P matches while maintaining 87% of T P matches.

greater than or equal to Si0, because they have C-spheres with non-zero radii, the number of FP and TP matches identified among S^, 5» 2 , . . . , 5,19 is less than or equal to that of Si0. The number of homologs matched by each point-based motif, Si0, is listed in the left of Figure 6. The number of TP and FP matches eliminated is calculated relative to the number matched by the point-based motif, and thus all Si0 have 100% of TP and FP matches, as in the leftmost point of the graph in Figure 6. Second from the left, we plot the percentage of TP and FP matches retained among S^, relative to Si0, for all i, and then average these percentages over all S^. Continuing from left to right, we compute the average percentage of TP and FP matches, over all Sj2 , then all Si3, etc., again relative to Si0.

Observations Demonstrated in Figure 6, as C-sphere radius increases, the number of FP matches are reduced dramatically, while the number of TP matches falls slightly. Also, large percentages of TP matches were maintained as C-sphere radius increased, with few losses, until approximately 80% of maximum size, when the number of true positives began to fall off, for most motifs. This was expected since maximum size was computed only on the primary motif structure, and not on homologs.

One motif, Phenylalanyl-TRNA Synthetase (lb7y), exhibited 0 sensitivity. The point-based version of lb7y matched no functional homologs, so no cavity-aware motifs based on lb7y matched any func

tional homologs either. For this reason, the percentage of TP matches eliminated by cavity-aware variations of lb7y is undefined, and therefore no TP and FP data (for consistency) is included in the averages plotted in Figure 6. Cavity-aware variations on lb7y still rejected more FPs as C-sphere radius increased. Point-based motifs from lja7 and 2ahj exhibited low sensitivity, identifying less than 20% of the total number of true positives. Having a very flexible active site, cavity-aware variations of 16pk were significantly less sensitive than its point-based counterparts. Overall, cavity-aware motifs eliminate many FP matches, while preserving most TP matches.

4.2. Analysis of Individual C-spheres

Some C-spheres may have a greater impact on FP match elimination than other C-spheres. We performed Cavity Scaling on each C-sphere in each of our 18 motifs, identifying which C-spheres were high-impact, layl, used in Figure 7 is an excellent example, having several high- and low-impact C-spheres. All motifs had related behavior: Some motifs had many high-impact C-spheres, and others (lczf, 16pk, 8tln) had none, but significant increases in motif profile medians remained correlated to the elimination of FP matches in all examples.

Observations Motif profiles of some single-C-sphere motifs, computed over increasing radii, shift significantly in the median towards higher LRMSDs.

320

C-sphere 10

% Max C-sphere Size

Fig. 7. Effect of Individual C-spheres on Motif Specificity. As C-sphere size uniformly increases, as described in Section 4.1 (horizontal axis), some high-impact C-spheres, such as 4 and 6, eliminate more FP matches (vertical axis) than others, such as 10 and 9. Line plots show the number of remaining FP matches for a specific single-C-sphere motif, and for a motif containing all C-spheres. C-sphere positions relative to cavity shape are illustrated in the inset graphic. High-impact C-spheres, such as C-sphere 6, generate motif profiles whose medians shift towards higher LRMSDs as C-sphere radius increases. Other C-spheres, which do not eliminate as many FP matches, such as C-sphere 10, do not affect motif profiles as much. Cavity Scaling identifies C-spheres which eliminate more FP matches.

These single-C-sphere motifs eliminate more FP

matches as radii increase. Alternatively, motif profile

medians of other single-C-sphere motifs that do not

eliminate many FP matches also do not shift towards

higher LRMSDs as radii increase. This is apparent

in Figure 7, where we detail this effect for single C-

sphere motifs based on layl. In the inset graphs,

identical copies of the layl motif that contain only

C-spheres 4 or 6 undergo significant changes in mo

tif profile medians, towards higher LRMSDs, as ra-

321

Impact of High-Impact C-Spheres in Cavity-Aware Motifs

^ \ > . , . x

\ \

-*- TPs in Manually Designed Motifs

— FPs in Manually Designed Motifs

- TPs in Refined Motifs

• FPs in Refined Motifs 1 1 1 1 [ i 1 1 1

~~* *—i—•—2^*-

\ \

"\

V

\

'*""""•,.

°" % Max C-sphere Size. ™*

Fig. 8 . T P / F P matches preserved when using automatically refined cavity-aware motifs. Axes here are identical those of Figure 6. Automatically refined motifs (gray) reject a large majority of FP matches, retaining slightly more than manually designed (black) motifs. Automatically refined motifs also preserve slightly more T P matches than manually designed motifs.

dius increases. Simultaneously, as seen in the main graph, these single-C-sphere motifs, containing only C-sphere 4 or 6, rapidly eliminate FP matches, layl motif copies with only C-spheres 9 or 10 experience insignificant changes in motif profile medians, and also eliminate FP matches more slowly, as radius increases. C-sphere positions relative to active site geometry are provided in the inset graphic in Figure 7. No correlation between high-impact C-spheres and cavity topography was apparent, emphasizing the difficulty of designing motifs with high-impact cavities.

Motifs with only one C-sphere eliminate very few TP matches, but careful inspection indicates that individual cavities cause different TP matches to be rejected. This effect accumulates into the slow loss of TP matches observed in section 4.1.

4.3. Automatically Refined Cavity-aware Motifs

In an experimental function prediction setting, rules and automated techniques for defining sensitive and specific motifs are important for high throughput function predictions. Having shown in the previous section that Cavity Scaling can identify high-impact C-spheres, we use Cavity Scaling to generate motifs containing only high-impact C-spheres, and demonstrate that they are reasonably effective.

Experiment: We applied Cavity Scaling on every C-sphere in every motif, which identified a set of high-impact C-spheres for all motifs except lczf,

16pk and 8tln. We repeated the experiment described in Section 4.1 for the remaining motifs, using only high-impact C-spheres. We refer to these as automatically refined motifs. We compared our results to manually designed motifs used in Section 4.1, which contained all C-spheres.

Observations: Like the axes of Figure 4.1, Figure 8 plots percent of maximum size (horizontal axis) versus the average percent of remaining TP and FP matches (vertical axis). Automatically refined cavity-aware motifs reject a large majority of FP matches, retaining a few more than manually designed motifs. This is expected, because low-impact cavities still eliminate some FP matches, which are not eliminated in automatically refined motifs. Automatically refined motifs retained more TP matches on average than manually designed motifs, for the same reasons.

5. CONCLUSIONS

In order to design more sensitive and specific motifs, we have integrated atom geometry and active cavity volumes into cavity-aware motifs. On 18 nonhomologous motifs, cavity-aware motifs eliminated most false positive matches while preserving most true positive matches. We also observed that some high-impact C-spheres have a greater influence on the number of true positive and false positive matches eliminated, and that high-impact C-spheres can be identified with Cavity Scaling. Cavity Scaling refines the selection of C-spheres in cavity-aware motifs, en-

322

suring tha t motifs used in practice will contain high-

impact C-spheres.

Cavity Scaling is particularly relevant for cavity-

aware motif design because it operates independently

of C-sphere centers. C-spheres centered on general

spatial locations could be filtered with Cavity Scaling

for high-impact C-spheres, providing a general ap

proach to C-sphere placement, independent of bound

ligands. Cavity Scaling does not entirely answer the

problem of designing cavity-aware motifs, because it

does not provide quanti tat ive reasons for selecting

specific sphere sizes, bu t from our experience with

this da ta set, C-spheres at approximately 80-85% of

maximum size seem best.

ACKNOWLEDGEMENTS This work is supported by a grant from the National Science Foundation NSF DBI-0318415. Additional support is gratefully acknowledged from training fellowships of the W.M. Keck Center (NLM Grant No. 5T15LM07093) to B.C. and D.K.; from March of Dimes Grant FY03-93 to O.L.; from a Sloan Fellowship to L.K; and from a VI-GRE Training in Bioinformatics Grant from NSF DMS 0240058 to V.F. Experiments were run on equipment funded by NSF EIA-0216467 and NSF CNS-0523908. Large production runs were done on equipment supported by NSF CNS-042119, Rice University, and partnership with AMD and Cray. D.B. has been partially supported by the W.M. Keck Undergraduate Research Training Program and by the Brown School of Engineering at Rice University. A.C. has been partially supported by a CRA-W Fellowship.

References

1. Stark A., Sunyaev S., and Russell RB. A model for statistical significance of local similarities in structure. J. Mol. Biol, 326:1307-1316, 2003.

2. Wallace A.C, Laskowski R.A., and Thornton J.M. Derivation of 3D coordinate templates for searching structural databases. Prot. Sci., 5:1001-13, 1996.

3. Crane B.R., Arvai A.S., Ghosh D.K., Wu C , Getzoff E.D., Stuehr D.J., and Tainer J.A. Structure of nitric oxide synthase oxygenase dimer with pterin and substrate. Science, 279:2121-2126, 1998.

4. Crane B.R., Arvai A.S., Ghosh S., Getzoff E.D., Stuehr D.J., and Tainer J.A. Structures of the nw-hydroxy-1-arginine complex of inducible nitric oxide synthase oxygenase dimer with active and inactive pterins. Biochemistry, 39:4608-4621, 2000.

5. Chen B.Y., Fofanov V.Y., Kristensen D.M., Kimmel M., Lichtarge O., and Kavraki L.E. Algorithms for structural comparison and statistical analysis of 3d protein motifs. Proceedings of Pacific Symposium on Biocomputing 2005, pages 334-45, 2005.

6. Levitt D.G. and Banaszak L.J. Pocket: a computer graphics method for identifying and displaying protein cavities and their surrounding amino acids. Journal of Molecular Graphics, 10(4):229-34, 1992.

7. Kristensen D.M., Chen B.Y., Fofanov V.Y., Ward R.M., Lisewski A.M., Kimmel M., Kavraki L.E., and Lichtarge O. Recurrent use of evolutionary importance for functional annotation of proteins based on local structural similarity. Protein Science, in press, 2006.

8. Liang J. Edelsbrunner H., Facello M. On the definition and the construction of pockets in macro-molecules. Discrete Applied Mathematics, 88:83-102, 1998.

9. Edelsbrunner H. and Mucke E.P. Three-dimensional alpha shapes. ACM Trans. Graphics, 13:43-72, 1994.

10. Wolfson H.J. and Rigoutsos I. Geometric hashing: An overview. IEEE Comp. Sci. Eng., 4(4):10-21, 1997.

11. Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I.N., and Bourne P.E. The protein data bank. Nucleic Acids Research, 28:235-242, 2000.

12. Kuntz I.D., Blaney J.M., Oatley S.J., Langridge R., and Ferrin T.E. A geometric approach to macromolecule-ligand interactions. J. Mol. Biol., 161:269-288, 1982.

13. Liang J., Edelsbrunner H., and Woodward C. Anatomy of protein pockets and cavities: measurement of binding site geometry and implications for ligand design. Protein Science, 7:1884-1897, 1998.

14. Barker J.A. and Thornton J.M. An algorithm for constraint-based structural template matching: application to 3D templates with statistical analysis. Bioinf., 19(13):1644-1649, 2003.

15. Kinoshita K. and Nakamura H. Identification of protein biochemical functions by similarity search using the molecular surface database ef-site. Protein Science, 12:15891595, 2003.

16. Rosen M., Lin S.L., Wolfson H., and Nussinov R. Molecular shape comparisons in searches for active sites and functional similarity. Prot. Eng., 11(4):263-277, 1998.

17. Shatsky M., Shulman-Peleg A., Nussinov R., and Wolfson H.J. Recognition of binding patterns common to a set of protein structures. Proceedings of RECOMB 2005, pages 440-55, 2005.

18. Shatsky M., Nussinov R., and Wolfson H.J. Flexprot: Alignment of flexible protein structures without a predefinition of hinge regions. Journal of Computational Biology, 11(1):83-106, 2004.

19. Snir M. and Gropp W. MPI: The Complete Reference (2nd Edition). The MIT Press, 1998.

20. Williams M.A., Goodfellow J.M., and Thornton J.M. Buried waters and internal cavities in monomeric proteins. Protein Science, 3:1224-35, 1994.

21. Connolly M.L. Solvent-accessible surfaces of proteins and nucleic acids. Science, 221:709-713, 1983.

323

22. Bachar 0. , Fischer D., Nussinov R., and Wolfson H. A computer vision based technique for 3-d sequence independent structural comparison of proteins. Prot. Eng., 6(3):279-288, 1993.

23. Lichtarge O., Bourne H.R., and Cohen F.E. An evolutionary trace method defines binding surfaces common to protein families. J. Mol. Biol., 257(2):342-358, 1996.

24. Lichtarge O., Yamamoto K.R., and Cohen F.E. Identification of functional surfaces of the zinc binding domains of intracellular receptors. J.Mol.Biol., 274:325-7, 1997.

25. International Union of Biochemistry. Nomenclature Committee. Enzyme Nomenclature. Academic Press: San Diego, California, 1992.

26. Smart O.S., Goodfellow J.M., and Wallace B.A. The pore dimensions of gramacidin a. Biophysics Journal, 65:2455-2460, 1993.

27. Artymiuk P.J., Poirrette A.R., Grindley H.M., Rice D.W., and Willett P. A graph-theoretic approach to the identification of three dimensional patterns of amino acid side chains in protein structures. J. Mol. Biol, 243:327-344, 1994.

28. Laskowski R.A. Surfnet: A program for a program for visualizing molecular surfaces, cavities, and in

tramolecular interactions. Journal Molecular Graphics, 13:321-330, 1995.

29. Laskowski R.A., Watson J.D., and Thornton J.M. Protein function prediction using local 3d templates. Journal of Molecular Biology, 351:614-626, 2005.

30. Laskowski R.A., Luscombe N.M., Swindells M.B., and Thornton J.M. Protein clefts in molecular recognition and function. Protein Science, 5:2438-2452, 1996.

31. Adak S., Wang Q., and Stuehr D.J. Arginine conversion to nitroxide by tetrahydrobiopterin-free neuronal nitric-oxide synthase. J. Biol. Chem., 275:33554-33561, 2000.

32. Binkowski T.A., Joachimiak A., and Liang J. Protein surface analysis for function annotation in high-througput structural genomics pipeline. Protein Science, 14:2972-2981, 2005.

33. Binkowski T.A., Adamian L., and Liang J. Inferring functional relationships of proteins from local sequence and spatial surface patterns. J. Mol. Biol., 332:505-526, 2003.

34. Lamdan Y. and Wolfson H.J. Geometric hashing: A general and efficient model based recognition scheme. Proc. IEEE Conf. Comp. Vis., pages 238-249, 1988.

325

PROTEIN SUBCELLULAR LOCALIZATION PREDICTION BASED ON COMPARTMENT-SPECIFIC BIOLOGICAL FEATURES

Chia-Yu Su12, Allan L o u , Hua-Sheng Chiu4, Ting-Yi Sung4, Wen-Lian Hsu4'* 1 Bioinformatics Program, Taiwan International Graduate Program, Academia Sinica, Taipei, Taiwan

2Institute of Bioinformatics, National Chiao Tung University, Hsinchu, Taiwan 3 Department of Life Sciences, National Tsing Hua University, Hsinchu, Taiwan

Bioinformatics Lab., Institute of Information Science, Academia Sinica, Taipei, Taiwan Email: {cysu, allanlo, huasheng, tsung, hsu}@iis.sinica.edu.tw

Prediction of subcellular localization of proteins is important for genome annotation, protein function prediction, and drug discovery. We present a prediction method for Gram-negative bacteria that uses ten one-versus-one support vector machine (SVM) classifiers, where compartment-specific biological features are selected as input to each SVM classifier. The final prediction of localization sites is determined by integrating the results from ten binary classifiers using a combination of majority votes and a probabilistic method. The overall accuracy reaches 91.4%, which is 1.6% better than the state-of-the-art system, in a ten-fold cross-validation evaluation on a benchmark data set. We demonstrate that feature selection guided by biological knowledge and insights in one-versus-one SVM classifiers can lead to a significant improvement in the prediction performance. Our model is also used to produce highly accurate prediction of 92.8% overall accuracy for proteins of dual localizations.

1. INTRODUCTION

Gram-negative bacteria have five major subcellular localization sites, which are the cytoplasm (CP), the inner membrane (IM), the periplasm (PP), the outer membrane (OM), and the extracellular space (EC). Prediction of protein subcellular localization for Gram-negative bacteria has been extensively studied and several systems have been developed. PSORT I1 has been a widely used prediction tool. Gardy et al? proposed PSORT-B, a multi-modular method combined with a Bayesian network, to improve the performance of PSORT I. Although PSORT-B has a high precision, it only yields an overall prediction recall, also referred to as accuracy, of 74.8%. Yu et al.3 presented an approach called CELLO that utilized support vector machines (SVM) based on ^-peptide compositions. The overall prediction accuracy of CELLO reaches 88.9% but the accuracy for extracellular proteins is still relatively low, at 78.9%. Recently, Wang et al.4 developed a system called P-CLASSIFIER that used multiple SVM based on amino acid subalphabets. The system attains an overall prediction accuracy of 89.8%.

In this study, we present a method called PSL101 (Protein Subcellular Localization prediction by j-On-1 classifiers) that incorporates compartment-specific biological features in ten one-versus-one (1-v-l) SVM classifiers to predict protein subcellular localization for

Gram-negative bacteria. Given a protein sequence, PSL101 constructs feature vectors extracted from specific input features that are characteristic of a given localization. These features include amino acid composition, di-peptide composition, solvent accessibility, secondary structure, signal peptides, transmembrane a-helices, transmembrane /^-barrels, and non-classical protein secretion. Biological knowledge and insights are used to guide our feature selection in the classification of different compartments. The output probability values from ten binary classifiers are integrated by a combination of majority votes and a probabilistic method to determine the final prediction of localization sites. Experiment results show that our method attains an overall prediction accuracy of 91.4%, which has presently the most accurate prediction performance for single-localized proteins. Based on a forward feature selection algorithm, the final feature combinations correlate well with biological insights. We further make use of this method in the prediction of dual-localized proteins and obtain an overall accuracy of 92.8%.

2. METHODS

2.1. SVM framework

SVM has been widely used in pattern recognition applications on data mining and bioinformatics. Prediction of


326

protein subcellular localization can be treated as a multi-class classification problem. For multi-class classification, the one-versus-rest (1-v-r) SVM model has demonstrated a good classification performance5. However, for any localization site, it is difficult to find a universal set of biological features from the remaining four sites that can be effectively used for 1-v-r SVM model. Based on biological domain knowledge, compartment-specific biological features should be used in distinguishing two localization sites, and this presupposition is later confirmed by our experiment results. Thus, we propose to use ten 1-v-l SVM classifiers for protein subcellular localization prediction. The system architecture of PSL101 is shown in Fig. 1.

The LIBSVM6 software is used in our experiments. For all classifiers, we use Radial Basis Function (RBF) kernel and optimize the cost (c) and gamma (f) parameters. The probability estimates by LIBSVM are used for determining the confidence levels of classifications7.

•RRDFLKGIASSSFWLGGSSVLTPLN ••

Predicted Localization Site(s)

Fig. 1. System architecture of one-versus-one SVM models based on compartment-specific features.

2.2. Biological input features

In Gram-negative bacteria secretory pathways, proteins localized to a particular subcellular compartment have distinct biological properties. We consider the following nine biological input features to distinguish between proteins translocated to different compartments and

construct our classification framework to mimic the translocation process of bacterial secretory pathways. 1. Amino acid composition (AA). Protein descriptors

based on ^-peptide compositions or their variations have been shown effective in protein subcellular localization prediction3'4. If n = 1, then the n-peptide composition reduces to the amino acid composition. The feature vector is of dimension 21 (i.e., 20 amino acids types plus a symbol 'X,' for others).

2. Di-peptide composition (Dip). The H-peptide compositions preserve more global sequence information when n gets larger. For computational efficiency, we choose n = 2, the di-peptide composition. This feature vector has dimension 441 (21x21).

3. Solvent accessibility (SA). Protein structures from different compartments show characteristic differences, particularly at the surface, which is directly exposed to the environment. Proteins in different localization sites have different surface residue compositions. Cytoplasmic proteins have a balance of acidic and basic surface residues, while extracellular proteins have a slight excess of acidic surface residues8. Thus, solvent accessibility represented by the amino acid composition of surface residues could be useful to identify extracellular proteins.

4. Secondary structure elements (SSE). Transmembrane a-helices are frequently observed in inner membrane proteins while transmembrane /?-barrels are largely found in outer membrane proteins9. The secondary structure elements are useful for detecting proteins localized in the inner membrane and the outer membrane. We compute the amino acid compositions of three secondary structure elements (a-helix, /?-strand, and random coil) based on the predicted results from HYPROSP II10, a knowledge-based secondary structure prediction approach.

5. Signal peptides (Sig). Signal peptides are N-terminal peptides typically between 15 and 40 amino acids long, and they target proteins for translocation through the general secretory pathway11. The presence of a signal peptide suggests that the protein does not reside in the cytoplasm. SignalP12, a neural network and hidden Markov model based method, is used to predict the presence and location of signal peptide cleavage sites in protein sequences. We employ this prediction method to distinguish cytoplasmic and non-cytoplasmic proteins.

327

Transmembrane a-helices (TMA). Integral inner membrane proteins are characterized by transmembrane a-helices. The presence of transmembrane a-helices could imply that the protein is located in the inner membrane. TMHMM13 is a hidden Markov model based method for the prediction of transmembrane a-helices and their topology in proteins. We apply TMHMM to identify potential transmembrane a-helical proteins residing in the inner membrane. Twin-arginine signal peptides (TAT). The twin ar-ginine translocase (TAT) system exports proteins from the cytoplasm to the periplasm. The proteins translocated by TAT bear a unique twin-arginine motif14. The presence of the motif is a useful feature to distinguish periplasmic and non-periplasmic proteins. TatP server15 uses a combination of two neural networks to predict the presence and location of twin-arginine signal peptide cleavage sites in bacteria. This server is used to detect TAT. Transmembrane ^-barrels (TMB). A large number of proteins residing in the outer membrane are characterized by /?-barrel structures. Thus, they could be a candidate feature to detect outer mem-

input features specific to each classifier. Starting with an empty subset, a forward feature selection algorithm keeps adding the best features that lead to an improvement on the accuracy of the classifiers. The process is terminated if adding the features no longer improves the accuracy.

2.4. Class determination

In order for each binary classifier Cyto distinguish class / and j , the input feature vector is constructed by concatenating different biological features refined specifically according to the intrinsic characteristics of proteins in localization sites i and j . We utilize several prediction methods to extract specific features based on biological domain knowledge. For each protein in the testing set, a predicted class and its corresponding probability are returned from each classifier.

In order to determine the predicted localization site of each protein, we combine the predicted results from ten binary classifiers by majority votes. In the case of a tie, the localization site with the highest average probability is assigned as the final prediction of localization site.

brane proteins. TMB-Hunt is a method that uses a « RESULTS AND DISCUSSION modified ^-Nearest Neighbor (&-NN) algorithm to distinguish protein sequences of transmembrane /?-barrel (TMB) from non-TMB on the basis of amino acid composition. We employ TMB-Hunt to identify potential outer membrane proteins.

9. Non-classical protein secretion (Sec). It had been believed for a long time that an N-terminal signal peptide was strictly required to export a protein to the extracellular space. Recent studies, however, have shown that several extracellular proteins can be secreted without a classical N-terminal signal peptide17. Identification of non-classical protein secretion, which is not triggered by signal peptides, could be a potential discriminator for cytoplasmic and extracellular proteins. Predictions produced from SecretomeP18, a non-classical protein secretion method, are applied in our experiments.

2.3. Feature selection in SVM classifiers

Since it is unlikely to try all possible feature combinations in different classifiers, heuristics guided by biological insights are used to determine a small subset of

3.1. Benchmark data set

To train and test our method, we use a benchmark data set of proteins from Gram-negative bacteria applied in previous works1"4. It consists of 1,441 proteins with experimentally determined localizations, in which 1,302 proteins have a single localization site and 139 proteins have dual localization sites. Table 1 lists the number of proteins in different sites in the data set.

Table 1. Number of proteins in different localization sites.

Localization sites No.

Cytoplasmic (CP) Inner membrane (IM) Periplasmic (PP) Outer membrane (OM) Extracellular (EC)

248 268 244 352 190

Cytoplasmic / Inner membrane (CP / IM) Inner membrane / Periplasmic (IM / PP) Outer membrane / Extracellular (OM / EC)

14 49 76

All sites 1,441

328

3.2. Evaluation measures

For comparison with other approaches, we follow the same measures used in previous works1"4 to evaluate the performance of our method. Accuracy (Ace) and Matthew's correlation coefficient (MCC)19 defined in Eq. (1) and (2) are used to assess the performance at five localization sites. The overall accuracy is defined in Eq. (3).

MCC,

Acc^TPjN,

(TTiyptJ-lFPMFN,)

J(TP, + FN,)(TP, + FP,)(TN, + FP,)(TN, + FN,)

Acc = YjPl r^Nt,

(1)

(2)

(3) i=i 1=1

where / = 5 is the number of total localization sites, and TPh TN,, FPh FN, ,and N, are the number of true positives, true negatives, false positives, false negatives, and proteins in localization site /, respectively. MCC, considering both under-and over-predictions, offers a complementary measure for the prediction performance, where MCC = 1 indicates a perfect prediction and MCC = 0 indicates a completely random assignment.

Due to different intrinsic characteristics of single localization and dual localization proteins, their prediction results are reported separately.

3.3. Results of single localization proteins

In Table 2, we compare the performance of our approach with other approaches using 1,302 single-localized proteins in a ten-fold cross validation test. The overall accuracy of PSL101 reached 91.4%, which is 1.6% better than the state-of-the-art system, P-CLASSIFIER. In addition, PSL101 outperforms P-

CLASSIFIER in terms of MCCs except for extracellular proteins. The compartment-specific features selected in PSL101 are summarized in Table 3. The experiment results show that our feature selection not only leads to a significant improvement in the overall accuracy but also correlates well with biological insights. For example, PSL101 selects signal peptides and transmembrane a-helices as the optimal features to distinguish proteins localized in the cytoplasm (no signal peptides) and the inner membrane (presence of transmembrane a-helices).

3.4. Results of dual teins

localization pro-

For dual localization classification, we conduct two experiments. In the first experiment, we compare with P-CLASSIFIER in which the dual-localized proteins are tested with classifiers trained on single-localized proteins. The two localization sites receiving highest probability sums from the 10 classifiers are assigned as the dual localization sites of the protein. Instead of giving full marks to dual-localized proteins with at least one site predicted correctly, we choose a less biased criterion to assess the performance: if only one of the dual

Table 3.

1-v-l classifiers

Ccp.IM

CCP, PP

CCP.OM

CCP, EC

ClM.pp

CIM,OM

Q M . E C

Cpp, OM

Cpp, EC

COM. EC

Comp

AA

•

• •

• • • •

artment-specific feature selection.

Dip

• • •

•

• • •

SA

•

•

•

SSE Sig TMATATTMB Sec

• • • •

• • •

• • • • •

• •

•

Table 2. The comparison of different approaches in the prediction of subcellular localization for Gram-negative bacteria.

PSL101 P-CLASSIFIER CELLO PSORT-B PSORTI Localization

CP

IM

PP

OM

EC

Ace (%)

95.2

93.7

87.3

93.8

84.2

MCC

0.88

0.95

0.84

0.93

0.83

Ace (%)

94.6

87.1

85.9

93.6

86.0

MCC

0.85

0.92

0.81

0.90

0.89

Ace (%)

90.7

88.4

86.9

94.6

78.9

MCC

0.85

0.92

0.80

0.90

0.82

Ace (%)

69.4

78.7

57.6

90.3

70.0

MCC

0.79

0.85

0.69

0.93

0.79

Ace (%)

75.4

95.1

66.4

54.5

-

MCC

0.58

0.64

0.55

0.47

-

Overall 91.4 89.8 74.8 60.9

http://Qm.EC

329

sites is predicted correctly, the prediction receives only half mark. Table 4 lists the prediction performance. PSL101 outperforms P-CLASSIFIER except for the class of cytoplasmic/inner membrane, in which there are only 14 proteins in the data set.

In the second experiment, we apply 1-v-l SVM models directly on dual-localized proteins in a ten-fold cross validation test. Since there are three pairs of dual localization sites: {CP,IM}, {IM,PP}, and {OM,EC}, we use the following three 1-v-l SVM classifiers:

C{CP,IM},{IM,PP}> C{CP,IM},{OM,EC}j ^ n d C{IM,PP},{OM,EC}- F ° r

each dual-localized protein, ten predicted probabilities, generated from previous 10 classifiers trained on single-localized proteins, comprise the input feature vector (of dimension 10). Since the classifier C{CP,IM},{IM,PP} has the IM site in common, it requires an additional single localization classifier CCp,pp to distinguish {CP.IM} and {IM,PP}. Thus the final prediction of dual localization sites is determined by a combination of the output probabilities from the 3 dual localization classifiers and the single localization classifier CCp,pp. The final prediction of localization sites are determined by a combination of the output probabilities from both dual localization classifiers and the distinct single localization classifiers. To assess the prediction performance, we use the same evaluation measures defined in Eq. (1), (2), and (3). The predicted results are shown in Table 5. The overall accuracy reaches 92.8% for proteins localized in two different localizations. The results indicate that PSL101 performs consistently well in both single and dual localization proteins. Thus, the input feature vector of dimension 10 trained on single-localized proteins is able to capture the important relationships between input biological features and characteristics of localization sites.

4. CONCLUSION

In this study, we propose a method to predict protein subcellular localization using multiple 1-v-l SVM models based on compartment-specific features. Experiment results show that our method attains high overall prediction accuracies of 91.4% and 92.8% for single and dual localization proteins, respectively. The feature combinations generated by a forward feature selection algorithm correlate well with biological insights. Our method provides accurate predictions and suggests useful biological features in protein localization prediction.

Table 4. The comparison of the prediction performance for dual localization proteins.

CP/IM

IM/PP

OM/EC

Overall

PSL101

Mark

6.5 26.5 73.0

106

Ace (%)

46.4 54.1 96.1

76.3

P-CLASSIFIER

Mark

10.5 19.0 64.0

93.5

Ace (%)

75.0 38.8 84.2

67.3

Table 5. The performance of dual localization classifiers that use predicted probabilities from 10 single localization classifiers as an input feature.

Localization

CP/IM

IM/PP

OM/EC

Overall

Ace (%)

64.3

93.9

97.4

92.8

MCC

0.70

0.85

0.96

-

Acknowledgments

We thank Hsin-Nan Lin, Jia-Ming Chang, and Ching-Tai Chen for helpful suggestions and computational assistance. The research was supported in part by the thematic program of Academia Sinica under grant AS94B003 and AS95ASIA02.

References

1. Nakai K and Kanehisa M. Expert system for predicting protein localization sites in Gram-negative bacteria. Proteins 1991; 11: 95-110.

2. Gardy JL, Spencer C, Wang K, Ester M, Tusnady GE, Simon I, Hua S, deFays K, Lambert C, Nakai K, et al. PSORT-B: Improving protein subcellular localization prediction for Gram-negative bacteria. Nucleic Acids Res 2003; 31: 3613-3617.

3. Yu CS, Lin CJ, and Hwang JK. Predicting subcellular localization of proteins for Gram-negative bacteria by support vector machines based on n-peptide compositions. Protein Sci 2004; 13: 1402-1406.

4. Wang J, Sung WK, Krishnan A, and Li KB. Protein subcellular localization prediction for Gram-negative bacteria using amino acid subalphabets and a combination of multiple support vector machines. BMC Bioinformatics 2005; 6: 174.

330

5. Garg A, Bhasin M, and Raghava GP. Support vector machine-based method for subcellular localization of human proteins using amino acid compositions, their order, and similarity search. J Biol Chem 2005; 280: 14427-14432.

6. Chang CC and Lin CJ. LIBSVM: A library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cilin/libsvm/.

7. Wu TF, Lin CJ, and Weng RC. Probability estimates for multi-class classification by pairwise coupling. J Machine Learning Res 2004; 5: 975-1005.

8. Andrade MA, O'Donoghue SI, and Rost B. Adaptation of protein surfaces to subcellular location. J MolBiol 1998; 276: 517-525.

9. Pautsch A and Schulz GE. Structure of the outer membrane protein A transmembrane domain. Nat Struct Biol 1998; 5: 1013-1017.

10. Lin HN, Chang JM, Wu KP, Sung TY, and Hsu WL. HYPROSP Il-a knowledge-based hybrid method for protein secondary structure prediction based on local prediction confidence. Bioinformat-ics 2005; 21: 3227-3233.

11. Emanuelsson O, Nielsen H, Brunak S, and von Heijne G. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 2000; 300: 1005-1016.

12. Bendtsen JD, Nielsen H, von Heijne G, and Brunak

S. Improved prediction of signal peptides: SignalP 3.0. J Mol Biol 2004; 340: 783-795.

13. Krogh A, Larsson B, von Heijne G, and Sonnham-mer EL. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 2001; 305: 567-580.

14. Berks BC. A common export pathway for proteins binding complex redox cofactors? Mol Microbiol 1996; 22: 393-404.

15. Bendtsen JD, Nielsen H, Widdick D, Palmer T, and Brunak S. Prediction of twin-arginine signal peptides. BMC Bioinformatics 2005; 6: 167.

16. Garrow AG, Agnew A, and Westhead DR. TMB-Hunt: an amino acid composition based method to screen proteomes for beta-barrel transmembrane proteins. BMC Bioinformatics 2005; 6: 56.

17. Nickel W. The mystery of nonclassical protein secretion. A current view on cargo proteins and potential export routes. Eur J Biochem 2003; 270: 2109-2119.

18. Bendtsen JD, Jensen LJ, Blom N, Von Heijne G, and Brunak S. Feature-based prediction of non-classical and leaderless protein secretion. Protein EngDes Sel 2004; 17: 349-356.

19. Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage ly-sozyme. Biochim Biophys Acta 1975; 405: 442-451.

http://www.csie.ntu.edu.tw/~cilin/libsvm/

331

PREDICTING THE BINDING AFFINITY OF MHC CLASS II PEPTIDES

Fat ih Al t ipa rmak , Al tuna Akalin , Hakan Ferha tosmanoglu 1

C o m p u t e r Science and Engineering, T h e Ohio S ta te University Computa t iona l Biology Uni t , Bergen Center for Computa t iona l Science, University of Bergen

emails: {al t iparm, hakan}@cse.ohio-state.edu, Altuna.Akal [email protected]

MHC (Major Histocompatibility Complex) proteins are categorized under the heterodimeric integral membrane proteins. The MHC molecules are divided into 2 subclasses, class I and class II. Two classes differ from each other in size of their binding pockets. Predicting the affinity of these peptides is important for vaccine design. It is also vital for understanding the roles of immune system in various diseases. Due to the variability of the locations of the class II peptide binding cores, predicting the affinity of these peptides is difficult. In this paper, we proposed a new method for predicting the affinity of the MHC Class II binding peptides based on their sequences. Our method classifies peptides as binding and non-binding. Our prediction method is based on a 3-step algorithm. In the first step we identify the informative n-grams based on their frequencies for both classes. In the next step, the alphabet size is reduced. At the last step, by utilizing the informative n-grams, the class of a given sequence is predicted. We have tested our method on the MHC Bench IV-b data set [13], and compared with various other methods in the literature.

1. INTRODUCTION

MHC (Major Histocompatibility Complex) proteins are categorized under the heterodimeric integral membrane proteins. The primary function of MHC proteins is presentation of antigenic peptides, which are degraded from foreign proteins, to T lymphocytes so that an immune response in the system can start [4]. The MHC molecules are divided into two subclasses, class I and class II. Members of both classes bind to peptides by recognizing a core sequence having 9 residues. Two classes differ from each other in size of their binding pockets. MHC class I molecules usually bind peptides around 9 residues, whereas class II molecules bind peptides of length 10-30 [4,11]. It has been indicated that specific positions in this binding core, called anchor residues, are important for binding specificity and affinity. In the 9 residue binding core, the positions 1, 4, 6, 7 and 9 are the important ones [6]. There is a study showing that not only binding cores but also flanking residues towards the N and C-terminal of the peptide affect the binding stability and affinity [7]. Predicting the affinity of these peptides is important for vaccine design. It is also vital for understanding the roles of immune system in various diseases. Due to the variability of the locations of the class II peptide binding cores, predicting the affinity of these peptides is difficult. The task of the prediction algorithms has been learning the binding motif and using it for prediction. Various methods

have been employed for this task. HMM, neural networks, Gibbs sampling, SVM and popular matrix based methods [1-3,12,15] are some of them. In this paper, we proposed a new method for predicting the affinity of the MHC Class II binding peptides based on their sequences. Our method classifies peptides as binding and non-binding. It is shown that some type of amino acids are preferred in some locations of binding peptides [6,7], so we expect that some motifs are important for binding and they occur more frequently in binding peptide set and not in non-binding set. In order to find those frequent motifs we utilize n-grams and the information content of them. N-grams are subsequences of peptides composed of n consecutive amino-acids. They have recently been utilize for classification. Ganapathi-raju and colleagues [5] investigated the n-grams in different organisms and observed that some n-grams are specific to some of the organisms. In addition, Vries et al. [9] utilize them to predict protein families. They have found most representative n-grams for each family and used that information for classifying proteins in a Bayesian probabilistic model. There are also reports that n-grams are successfully incorporated in GCPR ligand determination [16].

Our prediction method is based on a 3-step algorithm. In the first step we identify the informative n-grams, where n £ [1)5]. We declare an n-gram as informative according to the frequency of the n-gram in the distributions of the both classes. In the next


332

step, the alphabet size is reduced such that each resulting amino-acid grouping to capture informative sub-groups. At the last step, by utilizing the informative n-grams, we aim to predict the class of a given sequence. In order to do this we employ two different prediction schemes. We have tested our methods on the MHC Bench IV-b data set [13]. Various other methods have been applied to this data set, and our methods perform better than most of these methods.

2. DECIDING INFORMATION CONTENT OF AN JV-GRAM

Information content of an n-gram is decided after two steps: determination of classes, and, for each class finding the distribution of the n-grams. In the first step, the sequences having affinity less than or equal to 0 is assigned to the first class, unbinding peptides, and the rest of the proteins are assigned to the second class, binding peptides. In the second step for the given n, we find the distribution of n-grams for both classes.

We declare an n-gram as informative according to the cumulative distribution of the frequency of the n- gram lies into in the distributions of the both classes. The cumulative distribution functioned/) of a real valued random variable X is defined as:

F(x) = P(X <= x)

For a specific n-gram, whose information content is explored, cdf is calculated for distributions of both classes. The cdf refers to the ratio of number of n-grams having frequency less than the explored n-gram. As an example, for n = 4, assume the 4-grams {AGIR, AGLH, KWVF, NCPA, and, DETY} show up in the sample with the following frequencies.

Class-I Class-II

AGIR

10 10

AGLH

30 40

KWVF

20 50

NCPA

40 30

DETY

50 20

So, the corresponding cdf of these 5 4-grams for both classes are as following.

Class-I Class-II

AGIR

0.2 0.2

AGLH

0.6 0.8

KWVF

0.4 1

NCPA

0.8 0.6

DETY

1 0.4

Then, the target space of the cdf,([0,1]), is divided into 3 subspaces according to the minimum and maximum thresholds as shown in Figure 1. The n-grams have a cdf less than the minimum threshold is

in region 1 and the ones having a cdf between minimum and maximum thresholds are in the 2 n d region and the rest are in the 3 r d region. If the given n-gram is in the same region for both distributions then it is accepted as uninformative, otherwise informative. Assume that the minimum threshold is 0.25 and the maximum one is 0.75. For the above example, the 4-grams AGIR and AGLH are uninformative. For the informative n-grams, the absolute value of the cdfs from both classes is assigned as the information content of the n-gram. The 4-grams KWVF and DETY have the same information content whereas NCPA has a smaller value.

Region 1 Region 2 R e g i o n 3

1 1 1 0.0 Minimum o.5 Maximum 1.0

Threshold Threshold Fig . 1. Dividing the frequency space into 3 subspace according to the minimum and the maximum quartiles

3. CLUSTERING INFREQUENT AMINOACIDS: DECREASING THE ALPHABET SIZE

Since there is some preference on some positions of the binding peptides, it is expected that there will be a bias towards some types of amino acids in binding peptides. In a set of binding peptides this bias may not be apparent if we just look at the single aminoacid frequencies. However, if amino acids can be grouped based on their chemical and physical properties, reducing the size of the alphabet, it is possible to see that some of these abstract groupings become more frequent for a class. In order to find the sub-groups of amino acids, uninformative 1-grams are extracted, and among those uninformative 1-grams, the ones that are infrequent in binding and non-binding sets, are extracted. Extracted aminoacids grouped together if they are in the same group for Table 1. After grouping we recalculate frequencies with reduced alphabet, and repeat the procedure until there is no infrequent 1-gram.

Hence, the resulting groupings depend on the dataset and the ones given in Table 1. Assume that the aminoacids R and H are infrequent for both classes in the sample set, but, K has a cdf more than minimum threshold. Then, R & H are grouped in together where as, K will not join this group and be

333

analyzed separately.

Table 1. Possible Groupings

Group 1 Group 2 Group 3 Group 4 Group 5 Group 6

D E R H K

N Q S T F W Y

C A P G V I L M

4. ALGORITHM

In the previous sections we described how to find the information content of an n-gram using the distribution of both classes and how to reduce the size of the alphabet. In this section, we describe the algorithms that utilize the informative subsequences to find the affinity class of a given aminoacid sequence. After identifying the employed n-grams, the cdf of these n-grams are summed up for each class and the one with the maximum sum is assigned as the predicted class for the given sequence.

4.1. iV-gram Algorithm

Any subsequence of length n is taken into account. The information held by the informative length n subsequences is combined and the majority class is assigned as the class for the query sequence. For example, for n = 3, and the query sequence ACDE-FVWYZ of length 9, there are 7 3-grams. The information content of these 3-grams are combined and assigned as the class of the query sequence, ACDE-FVWYZ.

This approach can be utilize in two different ways. The first style is using only the n-grams for a fixed n, as shown in the above example. The other one is employing all the n-grams for which information content has been explored, for n € [1,5]. We named the second one as UAL(XJti\ize All n-grams).

4.2. Dynamic Approach

The algorithm shown in Table 2 is proposed for a problem which is a variant of the matrix multiplication problem(MMP). The MMP problem can be summarized as finding the order of multiplication such that the total cost of the multiplication operation between the matrices will be minimized and each

matrix will be multiplied only once. The given algorithm utilizes the information content of all n-grams where n € [1,5]. For a given sequence the algorithm aims to find the division with the greatest information content. Hence, the objective is maximizing the sum of the information contents while dividing the given sequence into subsequences. For a sequence of ACD the algorithm considers the following subsets {ACD, A-CD,AC-D,A-C-D}. Clearly, as inherited from the matrix multiplication problem, this algorithm considers each aminoacid once. The only difference is that this solution also considers the size of the n-gram. The length multiplier, LM, is added to the algorithm for this purpose. The LM function takes the size of the n-gram as input and returns 1 + n/10. Due to the space restrictions, we do not mention the elements of the dynamic programming.

Table 2. Dynamic Algorithm Using each Aminoacid Once

for i := 1 to n do Infj,i := LM(l)*Information Content(IC) of aminoacidj

length = length of the given sequence for n := 2 to 5 do

for i := 1 to length-n do j : = i + n Inf-Alli,j= LM(n)*IC of the n-grairijj Inf.Dividei];j = Maxi=<k<j{lnfi>k + Inffc+ljJ-} Inf^j = Max{Inf_AU,InLDivide}

5. RESULTS

There are 10 datasets available at [13]. An ideal test set should contain equal number of binders and non-binders. In absence of this, the evaluation parameters will show bias. Hence, we test our methods on the dataset 4-b, which is categorized under the label of balanced binders and non-binders by the owners. Raghava and Singh [13] evaluate 12 different approaches under 3 categories, motif, matrix and ann, on this dataset. We will give the rankings of our results among the given ones. Elimination process of the infrequent 1-grams has two iterations. At the end of the first iteration, the size of the alphabet is decreased from 20 to 18 and at the end of the last iteration it is reduced to 14. Experiments were done for each alphabet. The ordinary n-gram algorithm is run for n < = 3. The data set is divided into 5 equal partitions. Each partition is used as the test set, whereas the remaining are used as the training set.

334

Table 3. Results for 4-b dataset

Alphabet Size —• Methods!

1-gram 2-gram 3-gram

UAL Dynamic

20

0.546 0.5788 0.6627 0.6507 0.6421

18

0.6284 0.584

0.6746 0.6832 0.673

14

0.61 0.6045 0.6954 0.6695 0.6575

As shown in Table 3, each method performs better with the reduced size alphabets than the original alphabet. As the size of the alphabet decreases, the accuracy level for 2-gram and 3-gram algorithms increases. This is not the case for the others. They perform better with the alphabet of size 18. We pick the best 4 results from [13] and compare our results with the top performing methods. The selected results and the authors of the corresponding papers are depicted below.

Authors

Accuracy

Rammennsee

et al. [14]

0.7003

Marshal

et al. [10]

0.6849

Struniolo

et al. [17]

0.6764

Hammer

et al. [8]

0.6627

The highest level of accuracy achieved by our methods is 0.6954. Only one of the shown results, Rammennsee et al. [14], in [13] is greater than this. The highest accuracy achieved by Dynamic approach is 0.673 and this is slightly less than the third one, Struniolo et al. [17]. That of UAL method is 0.6832 and this is close to the second one, Marshal et al. [10].

6. CONCLUSION & DISCUSSION

Our methods have surpassed many other methods whose results are shown in [13]. We have shown that our genuine and simple approach is as accurate as those complicated methods.

We are currently investigating new algorithms and trying our existing algorithm on the other data sets. One possible modification to our dynamic algorithm might be the usage of every amino acid more than once. Another one to the same algorithm might be changing the LM function according to the performance of n-gram algorithm on various n. For the used dataset, 4-b, 3-gram performs the best, so the LM function can be designed such that the 3-grams are favored.

References 1. M. Bhasin and GPS. Raghava. Svm based method for predict

ing hla-drbl*0401 binding peptides in an antigen sequence. Bioinformatics, 20:421-423, 2004.

2. Vladimir Brusic, George Rudy, and Leonard C. Harrison. Prediction of mhc binding peptides using artificial neural networks. In Complex Systems: Mechanism of Adaptation, pages 253-260, 1994.

3. S. Buus, SL. Lauemoller, P. Worning, C. Kesmir, T. Frimurer, S. Corbet, A. Fomsgaard, H. Hilden, A. Holm, and S. Brunak. Sensitive quantitative predictions of peptide-mhc binding by a 'query by committee' artificial neural network approach. Tissue Antigens, 62:378-384, 2003.

4. Flora Castellino et al. Antigen presentation by mhc class ii molecules: invariant chain function, protein trafficking, and the molecular basis of diverse determinant capture. Hum Immunol, 54(2): 159-169, 1997.

5. M. Ganapathiraju, J. Klein-Seetharaman, R. Rosenfeld, J. Carbonell, and R. Reddy. Comparative n-gram analysis of whole-genome protein sequences. In Proceedings of the Human Language Technologies Conference, 2002.

6. AJ. Godkin, T. Friede, M. Davenport, S. Stevanovic, A. Willis, J. Jewell, A. Hill, and H.-G. Rammensee. Use of eluted peptide sequence data to identify the binding characteristics of peptides to the insulin-dependant diabetes susceptibility allele hla-dq8 (dq 3.2). Int. Immunology, 9:905, 1997.

7. AJ. Godkin, KJ. Smith, A. Willis, MV. Tejada-Simon, J. Zhang, T. Elliott, and AVS. Hill. Naturally processed hla class ii peptides reveal highly conserved immunogenic flanking region sequence preferences that reflect antigen processing rather than peptide-mhc interactions. Journal of Immunology, 166:6720-6727, 2001.

8. Bono E. Gallazi F. Belunis C. Nagy Z. Hammer, J. and F. Sin-gaglia. Precise prediction of major histocompatibility complex class ii peptide interaction based on side chain scanning. J.Exp.Med., 180:2353, 1994.

9. JK. Vries JK, R. Munshi, D. Tobi, J. Klein-Seetharaman, PV. Benos, and I. Bahar. A sequence alignment-independent method for protein classification. Appl Bioinformatics, 3(2-3): 137-48, 2004.

10. K.W. Marshal, K.J. Wilson, J. Liang, A. Woods, D. Za-ller, and J.B. Rothbard. Prediction of peptide affinity to hla-drbl*0401. J. Immunol., 154:5927-5933, 1995.

11. H. Max, T. Haider, H. Kropshofer, M. Kalbus, CA Muller, and H. Kalbacher. Characterization of peptides bound to extracellular and intracellular hla-drl molecules. Hum Immunol, 38:193-200, 1993.

12. M. Nielsen, C. Lundegaard, P. Worning, SL. Lauemoller, K. Lamberth, S. Buus, S. Brunak, and P. Lund. Reliable prediction of t-cell epitopes using neural networks with novel sequence representations. Protein Science, 12:1007-1017, 2003.

13. G.P.S. Raghava and Harpreet Singh. Evaluation og mhc binding peptide prediction methods. In http://www.imtech.res.in/raghava/mhcbench.

14. H. G. Rammensee, T. Friede, and S. Stevanovics. Mhc ligands and peptide motifs: first listing. Immunogenetics, 41:178-228, 1995.

15. HG. Rammensee, J. Bachmann, NPN. Emmerich, OA. Ba-chor, and S. Stevanovi. Syfpeithi: database for mhc ligands and peptide motifs. Immunogenetics, 50:213-219, 1999.

16. U. Sezerman, A. Akalin, Z. Kasap, and E. Kavak. Gpcr ligand determination using svms. In ISMB/ECCB, 2004.

17. T. Sturniolo, E. Bono, J. Ding, L. Raddrizzani, O. Tuer-cei, U. Sahin, M. Braxenthaler, F. Gallazzi, M.P. Protti, F. Sinigaglia, and J. Hammer. Generation of tissue-specific and promiscuous hla ligand database using dna microarrays and virtual hla class ii matrices. Nat. Biotech., 17:555-561, 1999.

http://www.imtech.res.in/raghava/mhcbench

335

CODON-BASED DETECTION OF POSITIVE SELECTION CAN BE BIASED BY HETEROGENEOUS DISTRIBUTION OF POLAR AMINO ACIDS ALONG PROTEIN

SEQUENCES

Xuhua Xia

Department of Biology, University of Ottawa 30 Marie Curie, P.O. Box 450, Station A, Ottawa, Ontario, Canada, KIN 6N.

E-mail: [email protected]

Sudhir Kumar

Center for Evolutionary Functional Genomics The Biodesign Institute and The School of Life Sciences, Arizona State University

TempeAZ, 85287-5301 USA E-mail: [email protected]

The ratio of the number of nonsynonymous substitutions per site (Ka) over the number of synonymous substitutions per site (Ks) has often been used to detect positive selection. Investigators now commonly generate Ka/Ks ratio profiles in a sliding window to look for peaks and valleys in order to identify regions under positive selection. Here we show that the interpretation of peaks in the Ka/Ks profile as evidence for positive selection can be misleading. Genie regions with Ka/Ks > 1 in the MRG gene family, previously claimed to be under positive selection, are associated with a high frequency of polar amino acids with a high mutability. This association between an increased Ka and a high proportion of polar amino acids appears general and not limited to the MRG gene family or the sliding-window approach. For example, the sites detected to be under positive selection in the HIV1 protein-coding genes with a high posterior probability turn out to be mostly occupied by polar amino acids. These findings caution against invoking positive selection from Ka/Ks ratios and highlight the need for considering biochemical properties of the protein domains showing high Ka/Ks ratios. In short, a high Ka/Ks ratio may arise from the intrinsic properties of amino acids instead of from extrinsic positive selection.

1. INTRODUCTION

Positive selection is one of the sculptors of biological adaptation. To detect positive selection on protein-coding sequences, it is common to calculate the number of synonymous substitutions per site (Ks) and nonsynonymous substitutions per site (Ka) and test the null hypothesis of Ka - Ks = 0, based on the neutrality principle1'5. If the null hypothesis is rejected and Ka/Ks > 1 (or Ka - Ks > 0), then the presence of positive selection may be invoked 6"9. Statistical methods frequently used for detecting positive selection at the sequence level include the distance-based method for pairwise comparisons 10'13, and the maximum parsimony 14 and maximum likelihood ML methods 5'15"17 used for phylogeny-based inferences. A number of inherent biases and problems in some of these methods have been outlined by various authors 4'18"22.

A previous study has shown that (Ka/Ks > 1) need not be a signature of positive Darwinian selection 23. Here, we illustrate one particular bias associated with the heterogeneous distribution of polar amino acids along the linear protein sequence. Our results suggest that peaks in Ka/Ks profiles can arise from an increased

frequency of polar amino acids and consequently may not be taken as evidence for positive selection. The generality of the association between the increased Ka/Ks ratio and a high proportion of polar amino acids is further demonstrated with the protein-coding genes in the HIVI genome.

2. SITES OF "POSITIVE SELECTION" CODE FOR A RELATIVELY HIGH FREQUENCY OF POLAR AMINO ACIDS

We illustrate the problem by using sequence data from the MRG gene family, which belongs to the G-protein-coupled receptor superfamily, is expressed specifically in nociceptive neurons, and is implicated in the modulation of nociception 8. Using the Pamilo-Bianchi-Li method n ' 12 with a sliding-window approach (window width of 90 base pairs and step length of 15 base pairs) to generate the Ka/Ks profile along the sequence, it has been reported that the peaks (Ka/Ks > 1) in the profile coincided with the extracellular domain boundaries, and the valleys (Ka/Ks < 1) coincided with the transmembrane and cytoplasmic domains 8. This



336

observation prompted the conclusion that the extracellular domains of the MRG receptor family have experienced strong positive selection.

The PBL method is based on the number of transitional and transversional substitutions on the 0-fold degenerate sites (where any nucleotide substitution leads to a nonsynonymous substitution, e.g., the second codon position), 2-fold degenerate sites (where one nucleotide substitution, typically a transition, is synonymous, and the other two nucleotide substitutions are nonsynonymous; e.g., the third codon position of lysine codons AAA and AAG), and 4-fold degenerate sites (where any nucleotide substitution is synonymous, e.g., the third codon position of glycine codons GGA, GGC, GGG, and GGU). The equations for computing the window-specific Kaw and Ksw under PBL method are as follows:

K,.„ = ^iw'nw + AiH-Atw + ft,

(1)

where LOw, L2w, and LAw are the numbers of 0-fold, 2-fold, and 4-fold degenerate sites, and Aiw and Biw are the numbers of transitional and transversional substitutions per i-fold degenerate site, respectively, in the given window, w.

In the practical application of the PBL method, Ksw may often become 0 when the window size is small and/or when the closely-related sequences are compared. In this case, investigators will compute the Kaw/Ks ratio (where Ks is estimated from the whole sequence comparison), rather than the window-specific Kaw/Ksw ratios. The Kaw/Ks profile for the MRG gene family was obtained in this way 8. Thus, any fluctuation seen in Kaw/Ks is simply the fluctuation of Kaw (Fig.

1). We observed that Kaw/Ks fluctuation for MRG

sequences was associated negatively with the number of 4-fold degenerate sites in a window (LAw). This negative correlation is highly significant (Pearson r = -0.45246, P = 0.003; Fig 1) and means that codons in the extracellular domains contain a rather small number of 4-fold degenerate 3rd codon positions. A survey of the amino acid composition of extracellular domains provides the answer: extracellular domains contain (and require) hydrophilic (polar) amino acids that are mostly coded by 2-fold degenerate codons (we restrict "polar

amino acids" to refer to the eight strongly polar amino acids only, i.e., Arg, Asn, Asp, Glu, Gin, Ser, Lys, and His). The negative correlation between the window-specific lAw and the number of polar amino acids (Npw) is also statistically highly significant (Fig. 2; Pearson r = -0.5946, P < 10-5).

10 a w — h

45 195 345 495 645 795

Position of window center

Fig. 1. Kaw/Ks increases with decreasing LAW. The Kaw/Ks curve is identical to that in Fig. 2a in ref. 8 for MRGX1 vs MRGX4. The number of 4-fold degenerate sites (LA) is superimposed for easy comparison. Shaded areas mark the extracellular domains. The analysis is performed with DAMBE

The negative association between LAw and Kaw in the extracellular domains, which implies that nonsynonymous substitutions at the extracellular domains mainly occur at 2-fold degenerate sites, points to a reason for the higher Kaw in the extracellular domains. The nonsynonymous substitutions at the 2-fold degenerate sites involve amino acids that are biochemically more similar to each other than those at the 0-fold degenerate positions. This can be seen by considering the extent of amino acid dissimilarity, which can be measured by Grantham's 26 or Miyata's biochemical distance 27. Grantham's distance is based on the chemical composition of the side chain, the volume and the polarity of the amino acid residues, whereas Miyata's distance is based on the volume and polarity only. It is well established that amino acid pairs with a small Grantham's or Miyata's distance replace each other more often than those with a large Grantham's or Miyata's distance 28. With this we address the question of whether amino acid substitutions at the 2-fold degenerate sites have smaller Grantham's or Miyata's

337

distance. Among the 196 possible codon substitutions involving a single nucleotide change for the universal genetic code 28, 58 are transversions at the 2nd codon position (i.e., 0-fold degenerate site), with the average Grantham's 26 distance between the two involved amino acids equal to 102.48. In contrast, 24 nonsynonymous transversions at the third codon position and 56 nonsynonymous transversions at the first codon position (i.e., 2-fold degenerate site) have an average Grantham's distance equal to only 67.67 and 69.27, respectively. A similar trend is observed with Miyata's distance 27. These Grantham dissimilarity values are close to those reported for interspecific variation in many different proteins 28"30.

(a) 195 345 495 645 795

Position of window center

22

20

18

J 16

14

12

10

(b)

•

•

• • • * • • u •

• • * • : • • • u

• • • • • • • • 1

• • • • • • • • * •

•

• • •

• • • •

10 20 30

Fig. 2. The increased number of polar amino acids (Npw) in the extracellular domains results in low LAw values (a), leading to significant negative correlation between L4w and Npw (b). Based on comparisons between MRGX1 and MRGX4. Only strongly polar amino acids (Arg, Asn, Asp, Glu, Gin, Ser, Lys, His) were included. The shaded areas mark extracellular domains.

The observations mentioned above suggest that an increase in the number of 2-fold degenerate positions in a sliding window increases the opportunity for nonsynonymous substitutions involving more similar amino acids that would be subjected to less intense purifying selection and would yield elevated fixation rates of nonsynonymous mutations. This possibility is

supported by the fact that some of the polar amino acids (present in high frequency in the extracellular domains) have high substitution rates. For example, Serine is known to be the fastest-evolving amino acid in the PAM and JTT substitution matrices31,32.

In the windows with the highest Kaw/Ks peak in Fig. 1 (corresponding to the second shaded extracellular domain), six of the eight serine residues are involved in nonsynonymous substitutions. The MRGX1 and MRGX4 sequences code for 194 strongly polar amino acids, with 50 (25.8%) involved in nonsynonymous substitutions, which is in contrast to the 450 non-polar or weakly-polar amino acids with only 86 (19.1%) involved in the nonsynonymous substitutions. In short, the high Kaw/Ks peaks associated with the extracellular domains in the MRG gene family may be attributed, at least partially, to the biochemical constraint that extracellular domains need to have a high frequency of polar amino acids, i.e., it may not be necessary to invoke positive selection.

3. SIMULATIONS CONFIRMING THE ASSOCIATION BETWEEN HIGHER Kaw/Ks RATIO AND INCREASED FREQUENCY OF POLAR AMINO ACIDS

While the above-mentioned properties of extracellular domains explain the elevation of Kaw, they do not explain why Kaw/Ks ratio is greater than 1 for some peaks. In order to investigate how this can happen, we examined the effects of the overabundance of codons coding for polar amino acids (hereafter referred to as PAA-coding codons) on the estimation of Ks and Ka values. We simulated the evolution of protein-coding genes with codon frequencies derived from MRGX1 and MRGX4 sequences by using the Evolver program in PAML (abacus.gene.ucl.ac.uk/software/paml.html).

We set transition/transversion ratio (K) = 2, sequence length = 90, the branch length = 1.5 nucleotide substitutions per codon, and omega = 1 (i.e., no differential selection against synonymous and nonsynonymous substitutions). We performed two types of simulations, designated MorePolarAA and FewerPolarAA, that differ only in the frequencies of codons coding for the polar amino acids. The codon frequencies used in these two types of simulations differ as follows. First, the codon frequencies for the MRGX1 and MRGX4 sequences used in Choi and Lahn 8 were

http://abacus.gene.ucl.ac.uk/software/paml.html

338

obtained. Second, designating PpMXobs a s t n e observed frequency of rth PAA-coding codon in the two sequences, the PPAA.i value equals (10/ll)*PPAA.iobs in the MorePolarAA simulation and (1/11)* PpMSobs in the FewerPolarAA simulation. Thus, the PAA-coding codons are 10-fold more frequent in the MorePolarAA simulation than in the FewerPolarAA simulation. A 10-fold difference such as this is not drastic because, for the window-specific codon frequencies, the extreme values for the frequencies of PAA-coding codons are 3.3% and 63.3%, respectively (a nearly 20-fold difference) in MRG genes 8. The codon frequencies for non PAA-coding frequencies are the same for the two types of simulations.

Each simulation was repeated 150 times, and the Ka, Ks and KalKs ratios were calculated. The use of PBL method on these simulated data produced a mean KalKs of 1.22 for the MorePolarAA simulation, and 0.79 for the FewPolarAA simulation (f = 5.1372, df = 298, P = 0.0000). Thus, those KawlKs peaks may be caused at least partially by the presence of high frequencies of codons coding for polar amino acids in the extracellular domains. Therefore, we conclude that the heterogeneous distribution of polar amino acids along the protein sequences and the problem with estimating KalKs for short sequences may generate spurious peaks and valleys in the KalKs profiles not indicative of positive selection.

The association between the extracellular domains of the MRG gene family and the high KawlKs peaks 8

may be interpreted in two ways. First, it is possible that these domains are under positive selection, but it is also possible that these domains carry high frequencies of polar amino acids because of the hydrophilic necessity of being extracellular. In the second case, the high KawlKs peaks may simply arise because of higher intrinsic mutability of polar amino acids. Unless we can exclude the second possibility, it is prudent to refrain from interpreting the existence of high KawlKs peaks as evidence in favor of positive selection.

4. DISCUSSION

How robust are our conclusions drawn from the analysis of the MRG genes using the PBL method? In particular, do other methods suffer the same problem as the PBL method? Our answer is positive because the high KawlKs peaks for the extracellular domains of the MRG genes are also recovered when other statistical methods

are used (Fig. 3). Therefore, the potentially erroneous interpretation that extracellular domains (which contain an overabundance of polar amino acids) are subject to positive selection will be made using many different existing methods, as they all suffer from the similar biases caused by the heterogeneous distribution of polar amino acids along the protein sequences.

YN00 Codeml

PBL MNG

0 10 20 30 40 50 60

window

Fig. 3. The window-specific Kaw/Ks values between the MRGX1 vs MRGX4 sequences, estimated by four different methods: YN00 13, Codeml34, PBL " • n and modified Nei-Gojobori *•pp-37W.

The association between the statistically-detected positive selection and polar amino acids is not restricted to the MRG gene family. This is evident from results of our examination of data from a recent study in which positive selection was inferred in protein-coding genes from HIV1 genomes 33. Amino acid sites statistically inferred to be under positive selection tend to be occupied by polar amino acids. In particular, amino acid sites inferred with a greater posterior probability have a greater chance of being occupied by polar amino acids. For example, polar amino acids account for 41.88% of all amino acids coded in the reference HIV1 sequence HXB233, but 49.55% of all amino acids at the positively selected sites detected with the posterior probability P > 0.90, and 59.52% of all amino acids at the positively-selected sites detected with the posterior probability P > 0.95 (data from Table 3 in ref. 33). The pattern is even

stronger for the env gene, which harbours the overwhelming majority of the statistically-detected positively-selected sites. Polar amino acids account for 40.54% of all amino acids coded in this gene, but 52.46% of all amino acids at the positively selected sites detected with the posterior probability P > 0.90, and 68.97% of all amino acids at the positively selected sites detected with the posterior probability P > 0.95 (data from Table 3 in ref.33). These results well exemplify the association between an increased Ka and a high frequency of polar amino acids.

The result from the HIV protein-coding genes is particularly noteworthy because the method used to detect positive selection is not the sliding-window approach, but a more recently developed site-specific approach 35. We may therefore conclude that positively selected sites detected by current statistical methods should be interpreted cautiously. In particular, we suggest that statistically detected "positively selected sites" by qualified with the word "putative".

Acknowledgments

This work was supported in part by the Discovery Grant, Strategic Grant and RTI Grant from the Natural Science and Engineering Research Council of Canada to X. Xia and an NIH grant to S. Kumar. We thank Masatoshi Nei and S. Aris-Brosou for helpful comments on the previous versions. Two anonymous reviewers provide helpful comments and suggestions that reduced the ambiguity of the paper and improved the generality of our conclusions.

References

1. Hughes AL, Nei M: Pattern of nucleotide substitution at major histocompatibility complex class I loci reveals overdominant selection. Nature 1988,335:167-170.

2. Hughes AL, Ota T, Nei M: Positive Darwinian selection promotes charge profile diversity in the antigen-binding cleft of class I major-histocompatibility-complex molecules. Mol Biol Evol 1990, 7:515-524.

3. Li W-H: Molecular evolution. Sunderland, Massachusetts: Sinauer; 1997.

4. Nei M, Kumar S: Molecular evolution and phylogenetics. New York: Oxford University Press; 2000.

5. Yang Z, Bielawski JP: Statistical methods for detecting molecular adaptation. Trends In Ecology And Evolution 2000,15:496-503.

6. Thornton K, Long M: Excess of Amino Acid Substitutions Relative to Polymorphism between X-linked Duplications in Drosophila melanogaster. Mol Biol Evol 2004,13:13.

7. Skibinski DO, Ward RD: Average allozyme heterozygosity in vertebrates correlates with Ka/Ks measured in the human-mouse lineage. Mol Biol Evol 2004,21:1753-1759.

8. Choi SS, Lahn BT: Adaptive evolution of MRG, a neuron-specific gene family implicated in nociception. Genome Res 2003,13:2252-2259.

9. Wang HY, Tang H, Shen CK, Wu CI: Rapidly evolving genes in human. I. The glycophorins and their possible role in evading malaria parasites. Mol Biol Evol 2003, 20:1795-1804.

10. Nei M, Gojobori T: Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol 1986, 3:418-426.

11. Li WH: Unbiased estimation of the rates of synonymous and nonsynonymous substitution. J Mol Evol 1993,36:96-99.

12. Pamilo P, Bianchi NO: Evolution of the ZFX and ZFY genes: Rates and interdependence between the genes. Mol Biol Evol 1993,10:271-281.

13. Yang Z, Nielsen R: Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol Biol Evol 2000,17:32-43.

14. Suzuki Y, Gojobori T: A method for detecting positive selection at single amino acid sites. Mol Biol Evol 1999,16:1315-1328.

15. Goldman N, Yang Z: A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol 1994,11:725-736.

16. Muse SV, Gaut BS: A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol Biol Evol 1994,11:715-724.

17. Aris-Brosou S: Determinants of adaptive evolution at the molecular level: the extended complexity hypothesis. Mol Biol Evol 2005, 22:200-209.

18. Bierne N, Eyre-Walker A: The genomic rate of adaptive amino acid substitution in Drosophila. Mol Biol Evol 2004, 21:1350-1360.

19. Suzuki Y, Nei M: Reliabilities of parsimony-based and likelihood-based methods for detecting positive

selection at single amino acid sites. Mol Biol Evol 2001,18:2179-2185. Suzuki Y, Nei M: Simulation study of the reliability and robustness of the statistical methods for detecting positive selection at single amino acid sites. Mol Biol Evol 2002,19:1865-1869. Suzuki Y, Nei M: False-positive selection identified by ML-based methods: examples from the Sigl gene of the diatom Thalassiosira weissflogii and the tax gene of a human T-cell lymphotropic virus. Mol Biol Evol 2004, 21:914-921. Wong WS, Yang Z, Goldman N, Nielsen R: Accuracy and power of statistical methods for detecting adaptive evolution in protein coding sequences and for identifying positively selected sites. Genetics 2004,168:1041-1051. Hughes AL, Friedman R: Variation in the Pattern of Synonymous and Nonsynonymous Difference between Two Fungal Genomes. Mol Biol Evol 2005. Xia X, Xie Z: DAMBE: Software package for data analysis in molecular biology and evolution. J Hered 2001, 92:371-373. Xia X: Data analysis in molecular biology and evolution. Boston: Kluwer Academic Publishers; 2001. Grantham R: Amino acid difference formula to help explain protein evolution. Science 1974,185:862-864. Miyata T, Miyazawa S, Yasunaga T: Two types of amino acid substitutions in protein evolution. J Mol Evol 1979,12:219-236. Xia X, Li WH: What amino acid properties affect protein evolution? J Mol Evol 1998, 47:557-564. Briscoe AD, Gaur C, Kumar S: The spectrum of human rhodopsin disease mutations through the lens of interspecific variation. Gene 2004, 332:107-118. Miller MP, Kumar S: Understanding human disease mutations through the use of interspecific genetic variation. Hum Mol Genet 2001,10:2319-2328. Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary change in proteins. In: Dayhoff MO (ed.)A(eds.) Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Washington D.C. 1978: 345-352. Jones DT, Taylor WR, Thornton JM: The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci 1992, 8:275-282. Yang W, Bielawski JP, Yang Z: Widespread adaptive evolution in the human immunodeficiency virus type 1 genome. J Mol Evol 2003, 57:212-221.

34. Yang Z: Phylogenetic analysis by maximum likelihood (PAML). Version 3.12. In. London: University College; 2002.

35. Yang Z, Swanson WJ, Vacquier VD: Maximum-likelihood analysis of molecular adaptation in abalone sperm lysin reveals variable selective pressures among lineages and sites. Mol Biol Evol 2000,17:1446-1455.

340

20.

21.

22.

23.

24.

25.

26.

27.

28.

29.

30.

31.

32.

33.

341

BAYESIAN DATA INTEGRATION: A FUNCTIONAL PERSPECTIVE

Curtis Huttenhower and Olga G. Troyanskaya

Department of Computer Science, Lewis-Sigler Institute for Integrative Genomics.Princeton University

Princeton, NJ 08544, USA


Accurate prediction of protein function and interactions from diverse genomic data is a key problem in systems biology. Heterogeneous data integration remains a challenge, particularly due to noisy data sources, diversity of coverage, and functional biases. It is thus important to understand the behavior and robustness of data integration methods in the context of various biological functions. We focus on the ability of Bayesian networks to predict functional relationships between proteins under a variety of conditions. This study considers the effect of network structure and compares expert estimated conditional probabilities with those learned using a generative method (expectation maximization) and a discriminative method (extended logistic regression). We consider the contributions of individual data sources and interpret these results both globally and in the context of specific biological processes. We find that it is critical to consider variation across biological functions; even when global performance is strong, some categories are consistently predicted well, and others are difficult to analyze. All learned models outperform the equivalent expert estimated models, although this effect diminishes as the amount of available data decreases. These learning techniques are not specific to Bayesian networks, and thus our conclusions should generalize to other methods for data integration. Overall, Bayesian learning provides a consistent benefit in data integration, but its performance and the impact of heterogeneous data sources must be interpreted from the perspective of individual functional categories.

1. INTRODUCTION

As more sources of high-throughput biological data have become available, many efforts have been made to automatically integrate heterogeneous data types for the prediction of protein function and interactions1'3. Several of these systems have focused solely on data representation and presentation as a means of allowing efficient storage, retrieval, and manipulation by domain experts4"6. We look instead at the process of fully automated prediction of genetic interaction and functional linkages, which has to date been addressed by four primary methods: decision trees7'8, support vector machines9, graph-based methods10"13, and Bayesian networks14'15.

When applying any of these techniques to the problem of data integration, it is important to consider the effects of the method's parameters and assumptions on the resulting biological predictions. If two genes are predicted to be interacting or functionally related, are they necessarily related under all conditions, or do they interact only under specific circumstances or within one or two narrowly defined processes? Similarly, given some set of heterogeneous experimental data to be integrated, they may easily possess significant differences in reliability, magnitude, and coverage of various biological functions. Intuitively, one might

expect correlations in microarray expression values to indicate a different, less reliable relationship between proteins than direct binding in an immunoprecipitation experiment. These differences could also be more pronounced in specific functional categories; for example, regulatory mechanisms such as phosphorylation signaling pathways will not be visible in microarray experiments.

Thus, it is important to examine the behavior of data integration methods from the perspective of diverse biological functions. We focus our analysis on Bayesian networks, considering both generative and discriminative learning frameworks. Bayesian networks provide an interpretable framework for examining machine learning across a variety of biological functions, experimental data types, and network parameters. We thus investigate the characteristics of Bayesian data integration by breaking performance down with respect to specific biological processes drawn from the Gene Ontology16. We further decompose the network's behavior by examining its dependence on each of its heterogeneous data sources, and we examine the effect of network structure by comparing a multilayer network structure to a single-layer naive Bayesian classifier. Finally, both of these network structures can be parameterized with expert estimates, with probabilities learned generatively

Corresponding author.


342

through expectation maximization (EM)1 , or with a discriminative model learned using Extended Logistic Regression (ELR)18.

Varying these parameters allows us to evaluate the predictive power of expert estimation, generative learning, and discriminative learning for each configuration of the Bayesian model. This results in a detailed comparison of functional predictions for individual ontology terms, network parameters, and data sources. All of these facets of functional prediction are nonspecific to Bayesian networks, allowing our conclusions to generalize to other heterogeneous data integration techniques.

2. RESULTS

We evaluated per-function Bayesian network performance over four variables: overall network structure, experimental data types, conditional probability sources (expert estimated, generatively learned, or discriminatively learned), and stability of learned parameters over varying initial conditions. The system described in Troyanskaya et al15 acted as a basis, providing a predefined multilevel network structure and fixed conditional probabilities estimated by a consensus of domain experts. We integrated a variety of data

0.66

0.64

0.62

0.6

Fig. 1. Areas under sensitivity/specificity curves over all data for the six primary network configurations: three parameter estimation methods (expert estimation in blue, generative EM in beige, and discriminative ELR in red) and two network structures (naive and full). Evaluations are over a five-fold cross validation. Both generative and discriminative learning show a substantial improvement over expert estimation, particularly in the full network.

sources heterogeneous both in their experimental origin (i.e. physical binding versus pathway membership) and in their computational behavior (i.e. discrete versus continuous data).

In all cases, evaluation was performed against a gold standard of functional relationships derived from S. cerevisiae GO annotations (experiments with the MIPS functional hierarchy19 generated similar results). Our overall results appear in Figure 1, which shows that both generative and discriminative learning improve upon expert estimated probabilities (particularly in the full network). However, a global evaluation such as this cannot reveal how well each model predicts interactions within specific biological processes. For example, will these predictions be helpful to a biologist interested in DNA replication, or is the learned performance due to improvements in other functional areas?

To address questions such as this, we examined individual areas under receiver operating characteristics (ROC) curves (AUCs) for each term in a subset of the Gene Ontology (see Methods). This made it possible to monitor performance within individual functional categories as network parameters varied. Figure 2 displays the results, and it is important to note that performance varies far more across functional categories than it does across network parameters, learning techniques, or data sets. This means that for any aggregate, cross-functional evaluation to remain biologically relevant, it is necessary to keep in mind that it may represent average behavior based on strong performance in only a few functional areas. For example, without functional analysis such as that in Figure 2, it would be difficult to determine that even the most accurate predictions included in Figure 1 are often inapplicable to RNA processing terms (purple cluster, Figure 2). Conversely, we might be more inclined to trust predictions for uncharacterized genes paired with known genes annotated to metabolic terms (red cluster, Figure 2).

Interestingly, network structure proved to have little effect on learned networks, while it greatly impacted the performance of expert estimated parameters. Experiments were performed on two network structures, a slight modification of that proposed in Troyanskaya et al15 and a naive Bayesian simplification of this model (Figure 5). Hidden nodes

343

hu a e s s s s « e a • I I IE! a a S s S B E E it's. | g g s

f £ > t S } B B B i B ji B B a H H B . . . . a A

i i i i i i l i i i i i i i i i i i i i i i i i 1 2 2 2 2 3 1

RNA Modification RNA 3 ' end processing mitochondrion organization and biogenesis organelle inheri tance death M phase mitot ic c e l l cycle response to endogenous st imulus DHA rep l ica t ion and chromosome cycle c e l l d i f f e r en t i a t i on in terphase protein amino acid a ikyla t ion phosphorus metabolism glycoprotein metabolism protein folding DNA rep l ica t ion pos i t ive regulation of b io logica l process chromosome segregation regulat ion of c e l l cycle regula t ion of c e l l organizat ion and biogenesis t r ansc r ip t ion termination RNA splicing mftNA metabolism protein complex assembly chromosome organization and biogenesis DMA packaging "regulat ion of t r ansc r ip t ion , DNA dependent" t r ansc r ip t i on from RNA polymerase I I promoter RNA loca l i za t ion "nucleobase, nucleoside, nucleotide and nucle ic acid t r a n s p o r t " nuclear t ranspor t nucleocytoplasmic t ransport t r ansc r ip t ion i n i t i a t i o n regula t ion o£ biosynthes is regula t ion of protein metabolism t r a n s l a t i o n a l i n i t i a t i o n RNA elongation protein amino acid deacetylet ion nuclear organization and biogenesis protein l oca l i z a t i on protein amino acid ace ty la t ion negative regulation of b io logica l process gene s i lencing "regulat ion of gene expression, ep igene t i c" ubiquit in cycle cytoplasm organization and biogenesis rPNA metabolism t ransc r ip t ion from RNA polymerase I promoter t r a n s l a t i o n a l elongation hydrogen t ranspor t ion t ranspor t cofactor metabolism nucleotide metabolism biopolymer cataboliam macromolecule cataboliam t r ansc r ip t i on from RNA polymerase I I I promoter tRHA metabolism generation ol precursor metabol i tes and energy homeostasis vacuolar t ransport alcohol metabolism amine metabolism amino acid and der iva t ive metabolism organic acid metabolism growth ce l l d iv is ion ce l l communication morphogenesis endosome organization and biogenesis secre t ion membrane fusion vacuole organization and biogenesis cytoskeleton dependent i n t r a c e l l u l a r t ranspor t external encapsulating s t ruc ture organisat ion and biogenesis l i p i d metabolism biopolymer biosynthesis DHA recombination carbohydrate metabolism meiotic c e l l cycle reproduction response to external stimulus cytoskeleton organization and biogenesis ves ic le mediated t ranspor t autophagy vitamin metabolism

Fig. 2. A heat map of pairwise functional relationship prediction within individual Gene Ontology processes. Yellow indicates an AUC above random, blue below, and black exactly random (AUC = 0.5). Each column represents a network configuration (a combination of structure, parameter source, and data set presence), and each row represents a biological function. ELR networks perform similarly to their EM counterparts and have been omitted for clarity. Grey cells indicate network configurations for which fewer than ten gene pairs were available for evaluating a functional category. Marked clusters indicate terms that are consistently poorly predicted (purple), predicted well (green), and predicted well only by learned networks (red).

344

None Small TF SynL TH ColP CC MA

B 0.66

0.64

0.62

0.6

0.54

0.52

0.5 None Small TF SynL TH ColP CC MA

Fig. 3. A) A comparison of functional predictions with the naive network structure using EM and expert parameter estimation, removing each data set in turn. Networks with complete input were trained and evaluated using all available data sets; each other network had either one data set (cellular component (CC), coimmunoprecipitation (ColP), transcription factor binding (TF), two-hybrid (TH), or microarrays (MA)) removed or all small data sets (biochemical association, dosage lethality, purified complexes, reconstructed complexes, and synthetic rescue) removed as a single unit. All evaluation was performed using only gene pairs with at least two data types available so as two allow evaluation with any one data set removed. AUCs are averages across five-fold cross validation. B) A comparison (as in part a) using the full network structure. Expert estimated parameters produce markedly worse performance with the full structure relative to the naive structure, and in both cases and across all data sets they are less accurate than learned parameters.

provide a way of relating similar data types and taking advantage of additional network parameters; a naive Bayesian assumption limits both the complexity and the representational power of the network17. The learned networks gained little from the additional parameters available in the full network, and its complexity hampered the predictive power of the expert estimated parameters (Figure 1).

Given these network configurations, it is of interest to see how much information is being contributed by different data sources. The number of pairs in the data sources varied from a few dozen to several million, and particularly in light of the potential sensitivity of learning algorithms to their training data, it is to be hoped that performance would degrade gracefully relative to the number of training examples. Figure 3 contains performance results for both network structures using expert estimated and learned network parameters. ELR and EM learning both performed essentially equivalently; they were largely unaffected by network structure, and their performance dropped off only with the removal of the largest or most informative data sets. The expert estimated probabilities proved to be much less effective when using the full network structure, but they were affected only minimally by the removal of particular data sets.

We next focused specifically on the robustness of the network's predictions in the face of variations in

input data and learning characteristics. In the case of learned networks, the choice of initial probability values could conceivably influence the point to which the network converged after learning. To ensure that this was not the case, Figure 4A demonstrates the performance of the two network structures using randomized probability tables as initial parameters. Variation is small for both learning methods and network structures, justifying our use of expert estimated probabilities for initialization.

Similarly, both learned and expert estimated networks could be susceptible to fluctuations in individual probability tables or their corresponding input data sets. To investigate this possibility, we randomized the conditional probability tables for each of four nodes in the full learned network: the root "Functional Relationship" node, "Microarray Correlation," the hidden "Genetic Association" node, or the "Yeast Two-Hybrid" leaf node (see Methods). This resulted in the relative performances seen in Figure 4B. Performance degrades gracefully and roughly in proportion to the data set effects seen in Figure 3. Randomizing the "Functional Relationship" prior will not change the relative order of predictions and, as expected, leaves the performance-recall curve largely unchanged. Randomization of "Yeast Two-Hybrid" or "Microarray Correlation" have roughly the same impact as did removing the associated data sets in Figure 3.

345

0.66

0.64

0.62

0.6

0.58

0.56

0.54

0.52

0.5

B

•

• NJ

y riv i\ \

\ \ \ \

\

— None Functional Relationship Genetic Association Mteroarray Two-Hybrid

-

"

0 O.I 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Fig. 4. A) Variation in convergence of network learned from randomly initialized probability tables. Five randomizations were performed to generate the given means and standard deviations; variation is zero for the naive EM network since expectation maximization reduces to deterministic maximum likelihood in this case. Random initialization has little impact on learned network performance. B) A comparison of the full Bayesian network learned using EM to four versions with randomized parameters for the "Functional Relationship", "Microarray Correlation", "Genetic Association", and "Yeast Two-Hybrid" nodes. For each randomized network, one conditional probability table was set to random values after learning and before evaluation. Results using ELR learning are similar (data not shown). Recall has been scaled to emphasize the high precision area of biological interest, and performance is shown using the log-likelihood score LLS=\ogi(TPNIFPIP) for P total positive pairs, N total negative pairs, and TP and FP the number of true and false positives at a particular sensitivity threshold.

However, the randomization of the hidden "Genetic Association" node had little impact, again indicating the minimal benefit of the full network over a naive Bayesian structure. Thus, the network is fairly resistant to errors in input data or in the corresponding probability tables.

3. DISCUSSION

We investigated the behavior of Bayesian data integration while varying a set of parameters relevant to both the computational and biological aspects of the task. Using two network structures (naive and full), three parameter estimation methods (expert estimation, generative EM learning, and discriminative ELR learning), and several data sets, we demonstrate that learning consistently improves the predictive power of the network. More importantly, decomposing this performance into individual functional categories shows that the improvement afforded by learning varies by function, as does the network's general accuracy.

3.1. Per-function behavior

As mentioned above, it is critical that we consider performance results in the context of individual biological functions. While it is unsurprising that particular data types would have functional biases, the fact that even highly varied and heterogeneous data

sources provide little predictive power under some circumstances is rarely taken into account. At least two conclusions can be drawn from the functional analysis in Figure 2. In many cases, underrepresented functional categories are areas in which good high-throughput data is certainly available; one might expect, for example, that yeast two-hybrid experiments would provide information on protein complex assembly20. For functional categories such as this, Figure 2 informs us that such signals can be rapidly lost in noise from experimental conditions where otherwise related genes do not function in tandem. Such categories may be better predicted directly by individually trained classifiers such as support vector machines. For other functions, though, data may not be available at all; autophagy might indicate an area in which further laboratory experimentation would substantially improve prediction performance.

There are several of these functional categories for which the best performance remains close to random given any data or network parameters. Most of these are regulatory functions (regulation of biosynthesis, regulation of protein metabolism, regulation of transcription, etc.) with several nucleic acid processing terms interspersed (RNA splicing, mRNA metabolism, DNA packaging, etc.). These terms have no strong size bias, with numbers of annotated genes ranging from tens to several hundred. Aside from issues introduced by

346

data sparsity, it is possible that these larger terms represent more tenuous and less easily detectable functional relationships and are thus more difficult to predict. This bias may also be due to the sparsity of current high-throughput data in certain functional areas; for example, post-transcriptional modification cannot be directly detected by any of the data sets included in this study. However, even the most unreliably predicted functions remain well above random performance, with only two terms having AUCs below 0.6 in the naive EM network.

A functional perspective on performance also allows us to discover terms that are in some sense easier or harder to predict. For example, Figure 2 contains a metabolism cluster (red) that is only predicted well by learned networks; expert networks using either structure produce near or below random performance. Conversely, several groups of functional terms appear to be easy to predict or, equivalently, difficult to improve upon past a certain baseline. A cluster of processes (green) including transcription from RNA polymerase I, translational elongation, rRNA metabolism, and cytoplasm organization all fall into this category. They are predicted with reasonable accuracy by every network configuration, and learning improves their accuracies only minimally. Gene pairs in these terms tend to be supported by multiple data types (many of the most confident pairs are supported by microarray correlation, coimmunoprecipitation, and/or cellular component) and by high microarray correlations. When sufficient data is available, it is unsurprising that results become less dependent upon particular integration techniques.

3.2. Diverse data sources

As shown in Figure 3, we found that several factors influence the relationship between experimental data types and prediction performance. Most data sets (especially small ones containing fewer than 5000 gene pairs) have a negligible impact on the learned networks, and the performance of the naive expert networks remains unchanged even after the removal of some larger data sets. Removing microarrays (by far the largest data source) greatly reduces the accuracy of all four learned networks and the full expert network, but the naive expert network is only minimally affected in this case. This can be interpreted as the full expert network "trusting" the relatively noisy microarray data too much, while the naive network balances it more

fairly against other data types; given the large fraction of training data microarrays account for in the learned networks, it is unsurprising that their performance suffers as well.

We also observed that removing moderately sized data sets tends to have an appropriately moderate effect on the overall precision and recall, but more varied results can be seen in a comparison of functional categories (Figure 2). Losses of coimmunoprecipitation, two-hybrid, or synthetic lethality data degrade the naive expert network's performance equally over many functional categories. Removing cellular component data actually improves both the naive and full expert networks, the latter substantially. This improvement is most visible in the group of terms for which learning is particularly beneficial (cell division, DNA recombination, biopolymer biosynthesis, lipid metabolism, and so forth). An inspection of the learned conditional probabilities in any of the network variants reveals that a positive cellular component signal decreases the posterior probability of functional relationship, which is likely indicative of the different focuses of the component and process ontologies within GO.

3.3. Parameter estimation

In general, using expectation maximization or logistic regression to learn conditional probability tables for functional prediction has several clear benefits over expert estimation. In terms of overall prediction accuracy, both precision and recall are significantly enhanced in learned networks, particularly for the full network structure. Predictions are also made much more continuously; expert-populated network predictions tend to cluster in a few tight groups (data not shown). This effectively limits the usefulness of these Bayesian networks as continuous probability estimators and restricts them to a few discrete quanta, a problem not encountered in learned networks.

When comparing ELR to EM learning, Figure 3 indicates that ELR is generally more sensitive to the removal of training data than EM. Particularly in the naive case, when expectation maximization reduces to a simple maximum likelihood estimate, ELR unsurprisingly requires more training data and processing power. The benefits of ELR for our task are seen mainly in its increased consistency, especially in the high precision/low recall area of biological interest.

347

Figure 4 demonstrates this best, with ELR showing a lower standard deviation over random convergences and producing slightly better results at low recall. This behavior comes at a cost of interpretability, though, since the network parameters learned during ELR are no longer directly interpretable as reliabilities of individual data sets.

It is interesting to note that Bayesian data integration in general is fairly robust to errors in the conditional probability tables. Neither the suboptimal naive expert estimates nor the "worst-case" errors introduced by randomizing the tables (Figure 4B) reduce performance significantly. These randomizations in particular indicate that performance degrades gracefully; in particular, errors in a data set's probability table are generally less harmful than complete removal of the data set (Figure 3). As was already indicated by the similarity between full and naive network performances, modifying hidden nodes has little impact on learned prediction accuracy. However, the full network parameters are clearly more difficult for experts to estimate, a known property of Bayesian probabilities17.

When examining performance over individual functional categories in Figure lb, the AUCs for almost all processes are either improved or left unchanged by EM or ELR relative to naive expert probability estimates. In particular, there are specific functional categories for which high-throughput data appears to perform particularly well. These include mainly metabolism terms (amino acid and derivative, alcohol, amine, organic acid, carbohydrate, etc.), many of which are related to a general stress response and are thus represented well in microarray data. A number of other terms are improved less dramatically, consisting mainly of nuclear transport and nucleotide processing categories. From the data set removal results, it appears that this is a general improvement not due to a specific data type. The only term significantly damaged by expectation maximization is RNA modification, which is slightly enriched relative to the prior for related pairs with a shared cellular component.

4. CONCLUSION

The process of collecting, analyzing, and integrating high-throughput biological data has always required a balance between automation and expert knowledge. Bayesian learning provides a natural way to incorporate prior knowledge in the context of formal probabilistic

methodology, giving domain experts ample opportunity to intervene with, manipulate, and visualize results, while leaving the work of relationship discovery up to computational methods. Such tools can take advantage of biological accessibility while simultaneously scaling up to larger, heterogeneous data sets and providing fine grained information regarding individual gene pairs and specific functional categories. This applies not only to Bayesian networks, but to any sufficiently sensitive, flexible, and accessible machine learning technique.

Moreover, regardless of the machine learning technique being examined, it is necessary to evaluate the accuracy of functional predictions in the context of individual biological areas. Overall performance improvements may arise from gains in only a few functional categories (such as microarray data's strong ability to predict ribosomal functions). If a computational method is to be used to steer the direction of future laboratory experiments, care must be taken to ensure that it performs adequately in the biological areas relevant to those experiments. Similarly, if predictions are to be made across the entire genome, it is important not to exclude functional terms due to signal loss or lack of data.

In this study, we found that Bayesian learning can be a robust method for prediction of functional relationships from heterogeneous data, but care must be taken in selecting an appropriate training method. While expert estimation provided good overall results, machine learning was able to improve both precision and recall over a wide variety of functional categories, particularly those to which high-throughput data tends to be sensitive. Both the generative expectation maximization and discriminative ELR methods consistently surpassed the expert models' predictions, particularly for the full network structure modeling hidden relationships between experimental data types. As the field expands, it is vital to adapt learning methods such as these to new high-throughput data sources, and it is equally vital to produce high-throughput data with sufficient coverage and functional diversity to realize the potential of computational methods.

It is clearly necessary to examine performance within specific functional categories to reveal many of these differences, and such evaluation is an important aspect of any functional predictor. Individual data sources come with functional biases, and integrating

348

Fig. 5. A) Full network structure. Hidden nodes are shown in dark gray, data set nodes in light gray, and the output node in white. B) Naive network structure; shading is as in part A.

them without proper care can exacerbate these deficiencies; adding more data will almost always improve overall genomic performance, but this may come at the cost of drowning out specific functional categories. At the most difficult end of the scale, there still exist biological areas insufficiently covered by all high-throughput data, even in an organism as well-studied as S. cerevisiae. It is in areas such as these that a carefully managed integration of computational and biological knowledge can yield the most substantial returns.

5. METHODS

5.1. Bayesian network algorithms

Bayesian network inference was performed using the Lauritzen algorithm21 for inference and expectation maximization17 or extended logistic regression18 for parameter learning. Expectation maximization was, in all cases, allowed to converge for five iterations; ELR ran for 1000 iterations. Some additional experiments with continuous Bayesian networks were run using the junction tree inference algorithm22. Performance was found to be generally below that of continuous networks (data not shown), which is perhaps unsurprising given the unsuitability of the linear Gaussian assumption. The

University of Pittsburgh Decision System Laboratory's SMILE library and GENIE modeling environment23

were used for manipulation of discrete networks, and the Intel PNL library24 was used with continuous networks.

5.2. Bayesian network implementation

The full and naive network structures are shown in Figure 5. The former was constructed by simplifying the Troyanskaya et al15 network's complex microarray inputs to a single correlation node and removing the trivially small "Unlinked Noncomplementation" node. Preliminary experiments showed that this made a negligible difference in performance relative to the original expert network (data not shown). The latter was constructed by removing each hidden node (those representing neither inputs to nor outputs from the predictor) while maintaining the expert estimated conditional probability tables of the remaining nodes.

Of the heterogeneous data sources, most represent positive binary genetic interactions in which a "true" result indicates that two genes interact and a "false" result indicates that they do not interact or that no data is present. Biochemical assays, coimmunoprecipitation, synthetic and dosage interactions, protein complexes (all drawn from the GRID25 and BIND26 databases), and transcription factor modules27 all fall into this category,

349

as the cellular component data (from the Gene Ontology). This allowed each of these data types to be presented to the Bayesian network as a single boolean variable per gene pair. Microarray coexpression data were collected from a variety of sources28"38.

5.3. Data preparation

Each of the data sources was represented as a binary input with a true value indicating cooccurrence of a gene pair in the data set. For each data set, missing gene pairs were represented by false values. The microarray data described above was preprocessed by concatenating the approximately 350 conditions into a single expression vector for each gene. Genes with more than 70% missing data were removed, after which any remaining missing values were imputed using KNNImpute39 with k=l0. Pairwise relationships were calculated by computing the centered Pearson correlations for all gene pairs within individual data sets and normalizing these to the range [0, 1]. The overall correlation was then taken to be the average of these values and subsequently quantized into five bins representing values less than 0.5, 0.5-0.75, 0.75-0.8, 0.8-0.9, and greater than 0.9 for input into the Bayesian networks.

For robustness testing, network terms were chosen for randomization in such a way as to cover a variety of data set types and sizes. Randomizing the "Functional Relationship" node demonstrates the evaluation's dependence only on gene pair rank (and not on exactly probability estimation), and "Genetic Association!' shows the relatively small benefit provided by the full network's hidden nodes. "Microarray Correlation" provided a large, continuous data set, and "Two-Hybrid" represented one that was smaller and discretized.

5.4. Gold standard generation

Gene Ontology terms representing positive functional relationships were selected at a 5% gene count cutoff, corresponding to GO biological process terms to which at most 321 of the 6438 S. cerevisiae genes were annotated40. Any two genes coannotated to such a term or its descendants were considered to be functionally related. Similarly, terms to which at least 15% of the genome (965 genes) was annotated represented a negative threshold; any genes coannotated to such a term and not to any more specific term were considered

to be functionally unrelated. Gene pairs coannotated to intermediate terms were excluded from the gold standard and thus from the evaluation. This process resulted in a set of 720458 related gene pairs and 10566822 unrelated pairs. A 5% positive and 10% negative term cutoff tested with the MIPS hierarchy performed similarly (data not shown).

5.5. Testing and cross validation

For all networks, evaluation was performed as an average of 5-fold cross validation (approximately 953 genes per fold). Additional cross validation was performed by varying random seeds as shown in Figure 4B; each training and evaluation cycle, regardless of conditional probability table seeding, utilized five different gene sets.

Overall LLS/recall curves were generated from probabilities drawn from the topmost "Functional Relationship" network node after Bayesian inference, again averaged over 5-fold cross validation. To calculate per-functional category performance, gene pairs were considered relevant to a category if they were both annotated to the category (and thus related) or if they were unrelated and one gene was annotated to the category. Negative pairs in which neither gene was annotated to a functional term below the 5% cutoff were evaluated with every functional category. All AUCs were calculated analytically using the Wilcoxon Rank Sum formula41. The resulting data were converted into heat maps using the TIGR MeV42 software.

Acknowledgments

We thank everyone in the Troyanskaya lab for their help and insightful discussions. This research was partially supported by NIH grant R01 GM071966 and NSF grant IIS-0513552 to Olga G. Troyanskaya. Dr. Troyanskaya is an Alfred P. Sloan Research Fellow.

References

1. Detours, V., et al. Integration and cross-validation of high-throughput gene expression data: comparing heterogeneous data sets. FEBS Letters 2003. 546(1): 98-102.

2. Joyce, A.R. and B.O. Palsson. The model organism as a system: integrating 'omics' data sets. Nature Reviews: Molecular Cell Biology 2006. 7(3): 198-210.

350

3. Yu, J. and F. Fotouhi. Computational approaches for predicting protein-protein interactions: a survey. J. Medical Systems 2006. 30(1): 39-44.

4. Chapman, A., C. Yu, and H.V. Jagadish. Effective integration of protein data through better data modeling. Omics 2003. 7(1): 101-2.

5. Lacroix, Z. Biological data integration: wrapping data and tools. IEEE Transactions on Information Technology in Biomedicine 2002. 6(2): 123-8.

6. Venkatesh, T.V. and H.B. Harlow. Integromics: challenges in data integration. Genome Biology 2002. 3(8): REPORTS4027.

7. Clare, A. and R.D. King. Predicting gene function in Saccharomyces cerevisiae. Bioinformatics 2003. 19 Suppl 2: II42-II49.

8. Zhang, L.V., et al. Predicting co-complexed protein pairs using genomic and proteomic data integration. BMC Bioinformatics 2004. 5: 38.

9. Lanckriet, G.R., et al. Kernel-based data fusion and its application to protein function prediction in yeast. Pacific Symposium on Biocomputing 2004: 300-11.

10. Hwang, D., et al. A data integration methodology for systems biology. PNAS 2005. 102(48): 17296-301.

11. Karaoz, U., et al. Whole-genome annotation by using evidence integration in functional-linkage networks. PNAS 2004. 101(9): 2888-93.

12. Lee, I., et al. A probabilistic functional network of yeast genes. Science 2004. 306(5701): 1555-8.

13. Nabieva, E., et al. Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 2005. 21 Suppl 1: i302-i310.

14. Lu, L.J., et al. Assessing the limits of genomic data integration for predicting protein networks. Genome Research 2005. 15(7): 945-53.

15. Troyanskaya, O.G., et al. A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). PNAS 2003.100(14): 8348-53.

16. Ashburner, M., et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics 2000. 25(1): 25-9.

17. Neapolitan, R. Learning Bayesian Networks. 2004, Chicago, Illinois: Prentice Hall.

18. Greiner, R., et al. Structural Extension to Logistic Regression: Discriminative Parameter Learning of Belief Net Classifiers. Machine Learning Journal 2005. 59(3): 297-322.

19. Ruepp, A., et al. The FunCat, a functional annotation scheme for systematic classification of

proteins from whole genomes. Nucleic Acids Research 2004.32(18): 5539-45.

20. Ito, T., et al. A comprehensive two-hybrid analysis to explore the yeast protein interactome. PNAS 2001. 98(8): 4569-74.

21. Lauritzen, S. and D. Spiegelhalter. Local Computations with Probabilities on Graphical Structures and their Application to Expert Systems. J. Royal Statistical Society 1988. 50(2).

22. Jensen, F. An Introduction to Bayesian Networks. 1996: Springer.

23. Druzdzel, M. SMILE: Structural Modeling, Inference, and Learning Engine and GeNIe: A development environment for graphical decision-theoretic models. Proceedings of the Sixteenth National Conference on Artificial Intelligence 1999: 902-903.

24. Eruhimov, V., K. Murphy, and G. Bradski, Intel's open-source probabilistic networks library. 2003: http://www.intel.com/technology/computing/pnl/ind ex.htm.

25. Breitkreutz, B.J., C. Stark, and M. Tyers. The GRID: the General Repository for Interaction Datasets. Genome Biology 2003. 4(3): R23.

26. Bader, G.D., et al. BIND-The Biomolecular Interaction Network Database. Nucleic Acids Research 2001. 29(1): 242-5.

27. Fujibuchi, W., J.S. Anderson, and D. Landsman. PROSPECT improves cis-acting regulatory element prediction by integrating expression profile data with consensus pattern searches. Nucleic Acids Research 2001. 29(19): 3988-96.

28. Chu, S., et al. The transcriptional program of sporulation in budding yeast. Science 1998. 282(5389): 699-705.

29. DeRisi, J.L., V.R. Iyer, and P.O. Brown. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 1997. 278(5338): 680-6.

30. Gasch, A.P., et al. Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast ATR homolog Meclp. Molecular Biology of the Cell 2001. 12(10): 2987-3003.

31. Gasch, A.P., et al. Genomic expression programs in the response of yeast cells to environmental changes. Molecular Biology of the Cell 2000. 11(12): 4241-57.

32. Hughes, T.R., et al. Widespread aneuploidy revealed by DNA microarray expression profiling. Nature Genetics 2000. 25(3): 333-7.

33. Ogawa, N., J. DeRisi, and P.O. Brown. New components of a system for phosphate accumulation

http://www.intel.com/technology/computing/pnl/ind

and polyphosphate metabolism in Saccharomyces cerevisiae revealed by genomic expression analysis. Molecular Biology of the Cell 2000. 11(12): 4309-21.

34. Shakoury-Elizeh, M , et al. Transcriptional remodeling in response to iron deprivation in Saccharomyces cerevisiae. Molecular Biology of the Cell 2004.15(3): 1233-43.

35. Spellman, P.T., et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 1998. 9(12): 3273-97.

36. Sudarsanam, P., et al. Whole-genome expression analysis of snf/swi mutants of Saccharomyces cerevisiae. PNAS 2000. 97(7): 3364-9.

37. Yoshimoto, H., et al. Genome-wide analysis of gene expression regulated by the calcineurin/Crzlp signaling pathway in Saccharomyces cerevisiae. J. Biological Chemistry 2002. 277(34): 31079-88.

38. Zhu, G., et al. Two yeast forkhead genes regulate the cell cycle and pseudohyphal growth. Nature 2000. 406(6791): 90-4.

39. Troyanskaya, O., et al. Missing value estimation methods for DNA microarrays. Bioinformatics 2001.17(6): 520-5.

40. Hong, E., et al., Saccharomyces Genome Database. 2005, http://www.yeastgenome.org/.

41. Lehmann, E. Nonparametrics: Statistical Methods Based on Ranks. 1975: McGraw-Hill.

42. Saeed, A.I., et al. TM4: a free, open-source system for microarray data management and analysis. Biotechniques 2003. 34(2): 374-8.

http://www.yeastgenome.org/

353

AN ITERATIVE ALGORITHM TO QUANTIFY THE FACTORS INFLUENCING PEPTIDE FRAGMENTATION FOR MS/MS SPECTRUM

Chungong Yir , Yu Lin', Shiwei Sun, Jinjin Cai, Jingfen Zhang

Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080, China

Zhuo Zhang, Runsheng Chen*

Institute of Biophysics, Chinese Academy of Sciences, Beijing 100035, China

Dongbo Bu *

David R. Cheriton School of Computer Science University of Waterloo Waterloo, Ontario, Canada N2L 3G1


In protein identification through MS/MS spectrum, it is critical to accurately predict theoretical spectrum from a peptide sequence, which heavily depends on a quantitative understanding of the fragmentation process. To date, widely used database searching methods adopted a simple statistical model to predict theoretical spectrum, yielding a spectrum deviating significantly from the practical spectrum for some peptides and therefore preventing automated positive identification. Here, in order to derive an improved predicting model, we proposed a novel method to automatically leaxn the factors influencing fragmentation from a training set of MS/MS spectra. In this method, the determining of factors is converted into an optimization problem to minimize an objective function that measures the distance between experimental spectrum and theoretical one. Then, an iterative algorithm was proposed to minimize the non-linear objective function. We implemented the methods and tested them on experimental data. The examination of 1451 spectra is in good agreement with some known knowledge about peptide fragmentation, such as the tendency of cleavage towards the middle of peptide, and Pro's preference of N-terminal cleavage. Moreover, on a testing set containing 1425 spectra, comparison between predicted and practical spectra generates a median correlation of 0.759, showing this method's ability to predict a "realistic" spectrum. The results in this paper help to an accurate identification of protein through both database searching and de novo methods.

1. INTRODUCTION

A major goal of proteomics is to study biological processes comprehensively through the identification, characterization, and quantification of expressed proteins in a cell or a tissue. Tandem mass spectrometry(MS/MS) has emerged as a powerful tool for sensitive high-throughput identification of proteins1' 2. In an experiment, proteins of interest are selected, and digested by enzyme such as trypsin, and then the resultant peptides are separated in the mass analyzer according to their mass to charge ratio (m/z — value). In a single experiment, multiple copies of the same peptide are fragmented into many charged fragments, and the fragments retaining the ionizing charge after CID have their m/z — value measured, the aggregate of which

forms MS/MS spectrum 16. Predicting theoretical spectrum accurately from

a peptide sequence lies at the core of protein identification, especially for the database searching methods. Most database searching methods start with constructing a theoretical spectrum for each peptide in a protein database, followed by a comparison of theoretical spectrum with experimental one using an effective scoring functions1' 2' n ' 12' 1 4 _ 1 6 ' 28 . The peptides with the highest score would be reported as potential solutions. Lacking a complete understanding of the fragmentation process, the widely used algorithms, such as Sequest10 and Mascot13, adopted a simple statistical model to predict theoretical spectrum, which assumes that cleavage will occur at peptide bonds in a uniform manner, regard-

**To whom the correspondence should be addressed, t These two authors contributed equally to this paper.


354

less of some important influencing factors such as position of amino acids, types of bond, etc. Though succeeds in general cases, this simple model produces theoretical spectrum deviated significantly from experimental one for some peptides, leading to low or insignificant scores, and thus preventing positive protein identification.

Furthermore, the de novo identification approaches could also benefit from an accurate prediction of the theoretical spectrum. Many studies haver been conducted to identify protein without dependence of a protein sequence database. Saku-rai adopted a method to enumerate all possible sequences and compare each one with the spectrum5, and prefix pruning technique was proposed to speed up the search6' 2. An alternative strategy is spectrum graph, which formulates the spectrum into a graph, and attempts find the longest path in the graph28, 15' 4' 8 . To overcome the shortcomings of spectrum graph arisen by missing and mixed peaks, PEAKS employs a sophisticated dynamic programming method9. In addition, Zhongqi Zhang proposed a method to combine a divide-and-conquer algorithm with a spectrum simulation7. Typically, a de novo sequencing method computes candidate sequences first, and then evaluates them by comparing the experimental spectrum with the predicted spectra. Hence, an accurate prediction of theoretical spectrum is not only useful to the database search approach, but also useful to de novo methods.

To accurately predict theoretical spectrum, it depends on a quantitative understanding of the fragmentation process occurring in mass spectrometry, which remains a challenge because of the following reasons: First, fragmentation is a stochastic process governed by complicated physical and chemical rules and affected by many factors such as position and identity of amino acids, types of bonds, etc. Moreover, it's also unclear that to what extent each factor affects the fragmentation process. Secondly, iso-topic atoms, neutral losses, post-transcription modification and measuring error always result in a peak deviation from its expected position. This paper addresses an attempt to quantify the factors influencing fragmentation.

1.1. Related Work

To predict theoretical spectrum, except for the promising chemical kinetic model to simulate fragmentation process24, several studies have been conducted to develop a statistical predicting model. Dancik et al. introduced an automatic tool-offset frequency function- to learn the ion types tendency and intensity threshold from the experimental spectra15' 16. J.R. Yates III et al. attempted to identify statistical trend in spectrum peak intensities and put them into the chemical context. F. P. Roth and S.P. Gygi applied probability decision tree approach to distinguish the important factors from a total 63 peptide and fragmentation attributes29. Another interesting method to determine the factors influencing fragmentation is a linear model proposed by F.Schutz21. In this method, F.Schutz fitted a linear model to spectrum, in which the influence of some specific amino acid and their position in the peptide are reflected. Moreover, the linear model also shows ability to accurately predict theoretical spectrum.

The linear model has some difficulties. In this model, the preference for cleavage at a bond is represented as the sum of the influence of C-terminal residue and of N-terminal residue. This assumption is strict since it implies that Xaa-Pro bond has an enhanced cleavage than any Xaa-Yaa bond regardless of which amino acid Xaa is, which is inconsistent with the observation that Xaa-Pro bond's cleavage is hindered when Xaa is Gly or Pro21 . Hence, it is more reasonable to consider the cleavage preference in bond's manner rather than the sum of residues influence. In this paper, we present a novel model to overcome these difficulties.

1.2. Our Contribution

Our contributions within this paper are as follows: 1. We introduce a novel statistical model to

determine the important factors that influence the global fragmentation. Following the well-known "mobile proton" hypothesis, our model accounts for influence of amino acids position, cleavage preference for a bond in a more reasonable manner.

2. We used this model to predict theoretical spectrum for a test set and made comparison with practical ones. Using the derived quantitative pa-

355

rameters, theoretical spectrum could be generated by simulating the tendency of cleavage towards middle, preference for N-terminal or C-terminal cleavage for a specific bond. Experimental results show that this model could predict a more 'realistic' spectrum.

We implemented these algorithms into an open source package PI (Peptide Identifier, downloadable freely from http://www.bioinfo.org.cn/MSMS/) and trained PI on several sets of spectra from ISB18. As a result, we rediscovered some known knowledge about peptide fragmentation, such as the tendency of cleavage towards the middle of peptide, and Pro's preference of N-terminal cleavage. Moreover, PI could predict accurate theoretical mass spectrum from a peptide sequence.

2. METHODS

2.1. Fragmentation Model

"Mobile proton" hypothesis is one of the widely accepted tenets of the peptide fragmentation. In this model, the ionizing protons on the peptide migrate to an amide carbonyl oxygen along the peptide backbone, resulting in the cleavage of its N-terminal peptide bond and the production of a b-ion or y-ion depending on N-terminus or C-terminus retains the charge, respectively. Occasionally, an a-ion is generated from a 6-ion by losing of carbon monoxide. Other possible backbone ions, such as c, x, z ions, are not typically generated under low energy collision-induced dissociation conditions22' 23, 26.

Several factors have significant affection on the fragmentation process since fragmentation in spectrometry is a stochastic process governed by the physical and chemical properties of a peptide and the collision dynamics. Some of the factors are listed as follows: First, there is a tight relationship between peak intensity and the relative position of cleavage site, that is, fragmentation occurs more often in the middle of peptide than that at ends17' 22 ' 25. Second, individual amino acid has different preference for which of the two adjacent amide bonds(N-terminal or C-terminal) may break. For example, it was reported that Pro has a strong C-bias cleavage22. Other factors, such as excitation method, charge state of ions, etc, also have influence on the fragmentation process26. Hence, identifying the significant factors

is important to improve theoretical spectrum predicting. This paper attempts to quantify the factors influencing fragmentation process under the "mobile proton" hypothesis.

2.2. Influence of Cleavage Site and Peptide Bonds

In the mobile proton fragmentation model, it was reported that the proton attachment depends partly on the relative affinities and the position of amino acids31. Let A(a,i) denote the relative proton affinities of amino acids a*, f(j) denote the influence of an amino acid at position j on proton affinities. For a peptide bond < ai,a,j > , let B(a,i,a.j) denote the relative possibility that the bond breaks when a proton migrates onto a,j. Thus, for peptide

= P i^P^—PL^ t n e n u m D e r of cleavage events of the j — th bond, denoted as (?»_,•, can be estimated to be proportional to fj * A(pj_1) * B(pj_x,Pj). Hence, minimizing the difference between the actual value Cij and its estimation will assign reasonable value for A(a,i),f(j) and B(a,i,a,j). For the sake of simplicity, we define C(a,i,a,j) = A(a,j) * B(ai,a,j), which is a measurement of the relative possibility that a proton migrates onto a,j and therefore results in the cleavage at the bond < a,i,a,j >. Hence, all the above parameters could be determined through solving the following non-linear programming problem on a training peptide set P^, P^,..., p W with same length \P^\ = \P™\ = ... = |P<*>| = L

K L

i=\ j=2

s.t. £ / ( 0 = i,/(0>o,

JTai*fJ*C(pW1,p¥i) = l.

ai > 0,C(ai,a,j) > 0

Here, a* is an auxiliary variable, a scale factor to meet X ^ a * * /_,• * C(pf_x,pf) = 1. The objective function is the sum of square of the difference between the theoretical and experimental intensities.

We tried some classical non-linear programming methods but failed to find optimal solution in reasonable time for the high rank of restriction formulas.

http://www.bioinfo.org.cn/MSMS/

356

Here, an iterative method was adopted to solve this problem. The method is based on the fact that the above formula could be reduced into a least square problem if two of the three types of variables, e.g., c*j and C(a,i, a,j), were fixed, while only one types, e.g., f(j) was chosen as variables.

At first, all the variables are assigned with random initial value. Each iteration loop contains three steps corresponding to one of f(j),cti and C(a,i,a,j) was chosen as variables. For example, if f(j) was chosen as variable while C(a,i, a,j) and a, was fixed with the current value, a classical optimization algorithm is called to solve the least square problem over variable f(j). So do on and C(a,i,aj). The iteration loop is repeated until the value of the objective function does not change.

It can be easily proved that the iterative algorithm must converge at last. The proof is based on the fact that the value of the objective function is non-negative and decreases monotonously at each step. In practice, the algorithm always converges to a fixed point after no more than 10 iteration loops, and experiments reach the same fixed point on different random initializations. In addition, the characteristic of the formula guarantees that only positive solution would be found.

The algorithm to minimize the distance function is given in Fig. 1.


3.1. Datasets

A public online spectrum set, ABJP, from ISB 18 was used to test our algorithm, which contains spectra generated through shotgun analysis of proteins from human K562 cells. We restricted our analysis to doubly charged, 'mobile' peptide for this proof-of-concept experiment. The spectrum set are randomly divided set into two parts, a training set (1451 matches) and a testing set (1425 matches). (See supplementary material http://www.bioinfo.org.cn/MSMS/).

3.2. Position and Bond's Influence on Cleavage

Relationship between fragmentation probability and cleavage site The training set was catego

rized into several subsets with respect to the length of peptide. On some subsets containing peptides with same length L = 7,8,9,10,11,12,13,14,15, the relationship between amino acids position and affinity ability are calculated. Fig. 2 shows the cases where L = 9,11,13,15. Fig. 2 demonstrates that fragmentation occurs more often towards the middle of a peptide than at its ends, which is consistent with the observation given previously21, 29' 17. Moreover, Figure 1 shows that the shorter the peptide, the more asymmetric the curve, which supports the observation that fragmentation near the N-terminus differs significantly from that at other sites26.

Cleavage Preference of Peptide Bonds The statistical results of the preference of

fragmentation at all the 400 peptide bonds are calculated (See supplementary material at http://www.bioinfo.org.cn/MSMS/).

To justify the motivation of this work, we compared the cleavage preference of Xaa-Pro bonds with that of Xaa-Trp bonds (See Figure 3a). Figure 3a shows that in general cases a Xaa-Pro bond has a higher tendency to cleavage than the corresponding Xaa-Trp bond; however, a Xaa-Pro bond is relatively hard to cleavage when Xaa is Gly or Pro since cleavage is hindered in these two cases21. This phenomenon cannot be reflected correctly if simply measuring the cleavage preference as the sum of residue influence. Hence, it is more reasonable to consider the cleavage preference in bond's manner.

Examination of the preference data is in good agreement with knowledge already known to mass spectrum experts. First, some amino acids prefer cleavage at N-terminus over C-terminus bond. For example, it is well-known that cleavage at Pro's N-terminus is preferred than that at C-terminus because attacking of the adjacent carbonyl oxygen at the electropositive carbon is hindered due to the molecular structure of Pro22 . Figure 3b shows that Xaa-Pro always has a higher possibility of fragmentation than the counterpart Pro-Xaa, supporting that Pro tends to cleavage at its N-terminal than C-terminal bond21. It was also reported that cleavage at His-Xaa bonds are much more often than others23' 22. As an example, Figure 3c shows a comparison between His-Xaa bonds and Asn-Xaa bonds, which is consistent with the



357

above observation. Second, fragmentation of the Xaa-Pro bond is encouraged when Xaa is Ile(0.020), His(0.016) or Trp(0.014), while is hindered when Xaa is Gly(0.0018) or Pro(0.0009) (See Figure 3a). In conclusion, the above results have strongly supports from the "mobile proton model", i.e., the more basic the residue, the more large the affinity of proton, and then the more facile the fragmentation.

3.3. Predicting Theoretical Spectrum

For a given peptide P, theoretical spectrum could be predicted by simulating the fragmentation process following the mobile proton model. That is, the number of cleavage events of the j — th bond can be estimated to be proportional to fj*A(j)j_i)*B(pj-i,Pj). Here, we roughly assumed that a b ion or y ion would be formed by a cleavage event with equal probability since the 'effective' temperature is unknown26.

Two examples were shown in Fig. 4, one for "DPLLLAIIPK" containing Pro since Pro has a unique fragmentation preference, the other for "DAGTIAGINVMR". Each predicted spectrum was plotted on the lower axis below, showing reasonable similarity to their experimental counterpart (Correlation coefficient are 0.80 and 0.81, respectively).

On the test set containing 1425 pairs of spectra and its corresponding peptide, theoretical spectrum were predicted and compared with experimental spectra. The median correlation between predicted and the practical spectra is 0.759, showing that this method could predict a "realistic" spectrum.

4. CONCLUSION AND DISCUSSION

Prediction of theoretical spectrum accurately is important to database searching methods, however, this prediction is in great need of a quantitative understanding of the fragmentation process. Here, we proposed a non-linear programming methods to estimate the factors influencing the fragmentation. We applied this algorithm to real data, and successfully obtained many biological features which also supported by some known rules of fragmentation, demonstrating the efficiency of the methods. And our simulated mass spectra are reasonably similar to their experimental counterparts. Currently, we have

not taken charge +3 and non-mobile peptides into account, and adopted a rough assumption that 6-ion and y-ion are produced with equal probability by a cleavage event. The influence of distant amino acids on fragmentation is also not considered in our model. How to incorporate those factors in PI remains an open problem.

ACKNOWLEDGMENT

This work was supported by National Sciences Foundation of China under grants 60496320, 30500104 and 30570393, National Key Basic Research and Development Program under grants 2002CB713805 and 2003CB715900, and opening task of Shanghai Key Laboratory of Intelligent Information Processing Fudan University No. IIPL-04-001. Dongbo Bu was partly supported by NSERC operating grant OGP0046506.

References 1. Zhu, H.; Bilgin, M.; Snyder, M. Annu Rev Biochem

2003, 72, 783-812. 2. Yates, J. R., 3rd J Mass Spectrom 1998, 33, 1-19. 3. Aebersold, R.; Goodlett, D. R. Chew. Rev 2001, 101,

269-295. 4. Hines, W.M., Falick, A.M., Burlingame, A.L., and

Gibson, B.W. J. Am. Sco. Mass. Spectrom. 1992 3,326-336.

5. Sakurai, T., Matsuo, T., Matsuda, H., and Katakuse, I. Biomed. Mass Spectrum 1984 H(8), 396-399.

6. Siegel, M.M., and Bauman, N. Biomed. Environ. Mass Spectrum 1988 15, 333-343.

7. Zhongqi Zhang. Anal. Chem. 2004 76,6374-6383 8. Bartels, C. Biomed. Environ. Mass Spectrim 1990 19,

363-368. 9. Bin Ma, Kaizhong Zhang, and Chengzhi Liang. Jour

nal of Computer and System Sciences 2005 70: 418-430.

10. Yates, J. R., 3rd; Eng, J. K.; McCormack, A. L. J Am Soc Mass Spectrom 1994, 5, 976-989.

11. Sonar http://65.219.84.5/ProteinId. html. 12. MOWSE

http://www. hgmp.mrc. ac.uk/Bioinformatics/ Webapp/mowse/mowsedoc.html.

13. Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cot-trell, J. S. Electrophoresis 1999, 20, 3551-3567.

14. Zhang, N.; Aebersold, R.; Schwikowski, B. Pro-teomics 2002, 2, 1406-1412.

15. Dancik, V.; Addona, T. A.; Clauser, K. R.; Vath, J. E.; Pevzner, P. A. J Comput Biol 1999, 6, 327-342.

http://65.219.84.5/ProteinId

http://www

http://ac.uk/Bioinformatics/

358

16. Bafna, V.; Edwards, N. Bioinformatics 2001, 17 Suppl 1, S13-21.

17. Havilio, M.; Haddad, Y.; Smilansky, Z. Anal Chem 200S, 75, 435-444.

18. Resing et al. 2004, Anal Chem. 2004 Jul l;76(13):3556-68.

19. Pevzner, P. A.; Mulyukov, Z.; Dancik, V.; Tang, C. L. Genome Res 2001, 11, 290-299.

20. Keller, A.; Purvine, S.; Nesvizhskii, A. I.; Stolyar, S.; Goodlett, D. R.; Kolker, E. Omics 2002, 6, 207-212.

21. Schutz, F.; Kapp, E. A.; Simpson, R. J.; Speed, T. P. Biochem Soc Trans 2003, 31, 1479-1483.

22. Tabb, D. L.; Smith, L. L.; Breci, L. A.; Wysocki, V. H.; Lin, D.; Yates, J. R., 3rd Anal Chem 2003, 75, 1155-1163.

23. Wysocki, V. H.; Tsaprailis, G.; Smith, L. L.; Breci, L. A. J Mass Spectrom 2000, 35, 1399-1406.

24. Zhang. Z. Proc. 50th Am So. Mass Spectrom. Orlando. Fl. 2002. Paper TPE-126

25. O'Hair, R. A. J Mass Spectrom 2000, 35, 1377-1381. 26. Paizs, B.; Suhai, S. Mass Spectrom Rev 2004 5, 103-

113 27. Chen, T.; Kao, M. Y.; Tepel, M.; Rush, J.; Church,

G. M. J Comput Biol 2001, 8, 325-337. 28. Yunhu Wan, Ting Chen, RECOMB 2005, pp. 342-

356, 2005 29. J. E. Elias, F.D. Gibbons, O. D. King, F. P. Roth,

S. P. Gygi, Nature BioTechnology 2004 Vol. 2 Num. 2, Feb. 2004

30. Huang, Y. Wysocki, V.H. Tabb, D.L and Yates, J.R. J. Am Soc Mass Spectrom 2002 219, 233-244

31. M. J. Nold, B. A. Cerda, and C. Wesdemiotis J. Am Soc Mass Spectrom 1999 10,1-8

Algorithm to Minimize Distance Function

Input: K pairs of peptides and tandem mass spectra {(Pi,Si),(P2,S2),..., (PK,SK)},

|p(D| = |p(2)| = ... = | p W | = L . Output: Bond's preference of cleavage C(ai,a,j) for each bond < a,i,aj >, Position's influence on cleavage f(j),j = 1,2, ...,L ; 1. Initializing C(ai,a,j) and f(j) randomly; 2. Optimize formula (1) over a» with C(a,i,aj) and f(j) holding the current value; 3. Optimize formula (1) over C{a,i,aj) with at and f(j) holding the current value; 4. Optimize formula (1) over f(j) with C(ai,aj) and f(j) holding the current value; 5. Repeat step 2-4 until the objective function value converges. 6. Output C(a,i,aj) and f(j).

Fig. 1. Algorithm to Minimize Distance Function

1 o.«.

*• 1.,-

son

Prot

on

| 01 1 '

„.

/

1 L

* A / \

\

\ -*, _»

t 0.16-

*..

i

I I

I

r-••

V, 3 4 S B 7 B 9 0 3

V

\ X .

a «

ait

an-

aio

aoo

? ?

f

(JOQ

-ate-

-f

/ *

A

, / \ / \

V \ V

\

Fig. 2. Relationship between Proton Affinity and Cleavage Site. (L=9,ll ,13,15)

360

n

Ik yk K L M N P Q R S T V W Y

Fig 3a Fig 3b

Fig. 3. Bonds Preference for Proton Affinity and Cleavage

S^sl

M N P Q R S T V W T

Fig3c

„ - C L : ^ •?7Z7zrT7z?

Fig. 4. Simulated and Experimental Spectra for 'DAGTIAGINVMR' and 'DPLLLAIIPK'

361

COMPLEXITY AND SCORING FUNCTION OF MS/MS PEPTIDE DE NOVO SEQUENCING

Changj iang Xu a n d Bin Ma*

Department of Computer Science, University of Western Ontario,

London, ON N6A 5B7, Canada Email: [email protected]


Tandem mass spectrometry (MS/MS) has become a standard way for identifying peptides and proteins. A scoring function plays an important role in the MS/MS data analysis. De novo sequencing is the computational step to derive a peptide sequence from an MS/MS spectrum, normally by constructing the peptide that maximizes the scoring function. A number of polynomial time algorithms have been developed based on scoring functions that consider only either the N-terminal or C-terminal fragment ions of the peptide. It remains unknown whether the consideration of the internal fragment ions will still be polynomial time solvable. In this paper, we prove that the internal fragment ions make the de novo sequencing problem NP-complete. We also propose a regression model based scoring method to incorporate correlations between the fragment ions. Our scoring function is combined with PEAKS de novo sequencing algorithm and tested on ion t rap data. The experimental results show that the regression model based scoring method can remarkably improve the de novo sequencing accuracy.

1. INTRODUCTION

Identification of the proteins existing in a tissue is frequently a key step in proteomics research. In recent years, tandem mass spectrometry (MS/MS) has become a powerful analytical tool for protein and peptide identification1' 2. It is difficult to identify intact proteins directly. Hence the proteins are digested into short peptides and the individual peptides are identified separately by using MS/MS.

The peptide identification is to deduce a peptide sequence that best matches an MS/MS spectrum. This technique can give accurate peptide identifications provided that a high quality MS/MS spectrum is available. However, currently only a fraction of the acquired spectra lead to positive peptide identifications. The reasons involve various factors3-5, which include poor fragmentation of the selected precursor ions, chemical contaminants obscuring peptide fragment ions, unanticipated residues caused by post-translational modifications.

Over the past decade, numerous computational approaches and software programs have been developed for MS/MS peptide identification. These can be categorized into four classes6: sequence database searching, de novo sequencing, sequence tagging, and consensus of multiple search engines.

The database searching finds the best matching peptide from a protein sequence database. The popular algorithms using this approach include Sequest7, Mascot8, Tandem9, and Omssa10. The de novo sequencing is equivalent to searching for the optimal peptide from an universal peptide database that includes all linear combinations of amino acids. The efficient algorithms for computing the optimal peptide are required to avoid the explicit searching. Among the de novo algorithms are Lutefisk11, 12, Sherenga13, Compute-Q14, PEAKS15 ' 16, PepNovo17, andNovoHMM18. The sequence tagging is to find the best peptide by searching a database with sequence tags that may be inferred by de novo sequencing. The existing algorithms include GutenTag19, OpenSea20, SPIDER21, and DeNovoID22. The consensus method combines several different programs to increase the confidence and coverage23' 24. A review of most of the protein and peptide identification algorithms can be found in Ref. 4.

Scoring function, which is used to evaluate the matches between candidate peptides and the MS/MS spectrum, is a key component in peptide identification. The scoring function is usually described by a mathematical model that quantifies the likelihood that a given sequence is the correct peptide




362

sequence that generates the MS/MS spectrum. The basic principle of the MS/MS peptide identification is that the peaks in the spectrum are produced by the fragment ions of the peptide. The scoring function evaluates the peptide with the number and intensities of the peaks. A lot of scoring methods have been developed for database searching 7' 2 5 _ 3° and for de novo sequencing 13~15- 17' 18. Many scoring functions examine the correlations between the fragment ions. Different techniques such as likelihood test13, 26' 27, Hidden Markov model28, decision trees29, and Bayesian network17 are used.

However, all of the polynomial time de novo sequencing algorithms developed previously are base on scoring functions that only use the N-terminal or C-terminal ions of the peptide. None of them utilizes the internal fragment ions. Some de novo sequencing programs such as PEAKS15 use a scoring function that takes into account of internal fragment ions to further re-evaluate the peptide candidates. However, the de novo sequencing step could not account for the internal fragment ions since the internal fragment ions will make de novo sequencing problem very similar to the two well-known open problems: the partial digest problem 31 and the problem 12.116 in Ref. 32. Neither of these two open problems has known polynomial time algorithms.

In this paper, we prove that the de novo sequencing with internal fragment ions is in fact NP-complete. Therefore, future research in utilizing internal fragment ions should focus on either heuristic algorithms or exponential time algorithms that run fast enough for smaller instances. This also justifies the two-step approach used in PEAKS 15 to utilize the internal fragment ions. The second contribution of the paper is to present a new scoring function that is based on a regression model (RM). The RM-based scoring method can efficiently exploit the relationship between different fragment ion types. The new scoring function is used to refine PEAKS' de novo sequencing results, and the significant improvement is achieved.

The remainder of this paper is organized as follows. Section 2 proves the complexity of the de novo sequencing with internal fragment ions. Section 3 presents the RM-based scoring method. Section 4 gives the comparison of the new scoring method with

PEAKS and PepNovo.

2. COMPLEXITY OF DE NOVO SEQUENCING WITH INTERNAL FRAGMENT IONS

2.1. Notations and preliminaries

There are 20 common amino acid residues, denoted by 20 different single-letter codes. Normally, a peptide is a string over the alphabet of these 20 letters. The mass of a residue a is denoted by m(a). For a string of residues aia2 . •. an, define m{a\a2 • • • an) =

Er=im(a*)-In an MS/MS, a peptide is fragmented into dif

ferent ion types. In low energy collision-induced dissociation (CID), the fragmentation produces mostly y-ions (the fragment with C-terminus) and b-ions (the fragment with N-terminus). These are the most interesting ion types in de novo sequencing algorithms. However, the peptide is frequently fragmented more than once. This causes the internal fragment ions (the fragments without the terminus). Internal fragment ions are more often observed when the collision energy is high, such as in the TOF/TOF mass spectrometers. When the internal fragment ion contains only one amino acid, it is also called the im-monium ion. In this paper we only consider the internal fragment ions with two or more amino acids. For example, the peptide AGEDK has four b-ions A, AG, AGE, and AGED; four y-ions K, DK, EDK, and GEDK; and three internal fragment ions GE, ED, and GED.

The mass value of each ion is the total mass value of its residues plus a constant associated with the ion type. In practice, when the ions retain only one positive charge, the constants for b, y and internal ions are 1, 19, and 1, respectively. Therefore, the b-ions of the peptide ai<i2 . . . an have mass values B = {1 + m{a\a,2 ...a^) \k = 1 , . . . , n — 1}; the y-ions of the peptide have mass values Y = {19 + m(afeafc+i... an) \k = 2 , . . . , n}; and the internal fragment ions have mass values I = {1 + m(afcafc+i... a,-) 11 < k < j < n}.

An MS/MS spectrum provides the signal intensity at every mass value (in fact, mass to charge ratio). This can be used to define a scoring function to select peptides. First, three functions fy(x), fb(x),

363

and fi(x) are used to define the score of that a y, b, and internal fragment ion is at mass x, respectively. Suppose the sets of mass values of the y, b, and internal fragment ions of a peptide P are Y, B, and / , respectively. Then the score of the peptide is defined by:

score(P) = £ /„(*)+ Yl f>>W+ £ hW x€Y x€B\Y x£l\(YUB)

The de novo sequencing problem is to compute the peptide sequence P over an alphabet S such that score(P) is maximized. Notice that when the mass values of multiple ions overlap, only the score of one (the most important one) is counted in the scoring function.

A simplified version is to let fy(x) = Cyf(x), fb(x) = Cbf(x), and fi(x) = Cif(x) for some constants cy, Cb, ci and a function f(x). In next section we will prove that even this simplified version is NP-complete.

2.2. NP-completeness

The de novo sequencing problem has been extensively studied. Polynomial time algorithms have been proposed. However, all polynomial time algorithms consider only the ions with either the N or C-terminus. In another word, fi(x) = 0 for all x. Some software systems such as PEAKS software use internal fragment ions to refine the results after the de novo sequencing algorithm finds a list of candidates. But internal fragment ions are not used in the de novo sequencing algorithm. An MS/MS spectrum usually contains a large number of ions that are neither y or b ions. A significant portion of these additional ions are internal fragments. The use of these ions will inevitably improve the accuracy. However, it is unknown whether an efficient algorithm exists when these ions are taken into account. Our result in this section answers the question negatively. That is, the finding of optimal sequence is NP-complete when the internal fragment ions are counted. Our result suggests that when internal fragment ions are counted, most likely no polynomial time algorithm exists (unless P=NP). And therefore, research efforts should be put to design either heuristic algorithms

or exponential time algorithms that run fast enough when the sequence is short.

Theorem 2.1. De novo sequencing is NP-complete if internal fragment ions are counted.

Proof. Obviously the problem is in NP because given any peptide sequence, the score can be calculated in polynomial time. In what follows we reduce the Max-Cut-3 problem to our problem. A Max-Cut-3 instance is a graph (V,E), where V = {vi,v2,.. • ,vn} and E = {e i , e 2 ) . . . , e m } . Each vertex has degree exactly 3. The optimal solution is two disjoint vertex sets Vi and Vb such that (a) V1IIV2 = V and (b) the number of edges with the two adjacent vertices in two different sets is maximized. It is well known that Max-Cut-3 is NP-hard 33 .

Our constructed instance of de novo sequencing has only five letters in the alphabet S. They are G, E, A, D, and W. The mass values of the five letters are 57, 129, 71, 115, and 186, respectively. a Therefore, m(G) + m(E) = m(A) + m(D) = m(W).

For each vertex v%, suppose it is adjacent with three edges e,-, e^, and e;, we construct the following string

Si = U WW... WAD WW... WAD WW... WAD WW... W, j ' - l fe-j'-l i-fc-l 2m-l

where U can be one of W, GE and EG, all having the same mass. Let

s0 = W2m+1 =WW...W, 2m+l

and S be the concatenation of the constructed string: S = S0S1S2 • • -snso.

The idea of our construction is to define a spectrum so that the optimal solution of the de novo sequencing problem has the form of S. This will be achieved by carefully design the y and b ion scores fy(x) = Cyf{x) and fb{x) = Cbf(x). Then we will use the internal fragment ion score fi(x) = cjf(x) to "fine tune" the ti in each s^. Depending on whether U takes GE or EG, Sj will produce different internal fragment ions EW... WA or GW... WA. For an edge efe = (vi,Vj), if U and tj are different, then both of EWfc_1A and GWfc-1A will contribute to the score. However, if U and tj are the same, then only one of

aThese are the nominal mass values of the five real amino acid residues coded with the five letters.

364

the two will contribute to the score. Thus, the solution of the de novo sequencing is connected to the solution of the Max-Cut-3. The detail of the construction follows.

Let Y, Y' and Y" be the y-ions of S by setting all ti to W, GE and EG, respectively. Similarly, let B, B' and B" be the b-ions of S by setting all U to W, GE and EG, respectively. Obviously Y C Y' n Y" and BcB'D B".

Furthermore, let J = {1 + m(EWfc-1A),l + m(GWfc_1A) | k = 1 , . . . , m}. Clearly, I consists of the internal fragment ions resulted from the fragments in ti and one of the three AD in the same Sj. Each pair of the internal fragment ions in I corresponds to an edge efc. Therefore, |7| = 2m. Because of the existence of so at both ends of S. It is easy to verify the three sets Y, B, and I do not overlap each other. Let I* = {1 + m(aWJ'6) | a G {E,G,W},a G {A,D,W},0 < j < 2mn + Am + n}. One can easily verify that all the the internal fragment ions of S are in /*, no matter how individual £,'s take their values.

We assign values of / (x) as follows:

/(*)

1, l

4m' o,

-1,

x G YUB x & I xG (Y'UY"UB'UB"ur)

\{YUBl) I) otherwise

Let fy(x) = fb(x) = / (x) , and /j(x) = £ / (x) . Because f(x) = 1 for x G YllB and f(x) = 0 for

xG (Y'UY"UB'UB"UI*)\(YI)BUI), they and b ion scores will enforce the optimal solution to have the form of S, as proved in the following lemma.

Lemma 2.1. Any optimal solution can be modified to have the form of S. In addition, ti must be either GE or EG.

Proof. According to the definition of / (x) , all the y, b and internal ions of the sequences with form S have scores greater than or equal to 0 in the definition; and all of the ions in YUB will contribute score 1. Therefore, any sequence with form S will have a score no less than \Y U B\. Because Y n B = 0,

|yu5| = |r| + |B|. On the other hand, even if all the positive po

sitions are matched by y and b ions, the score is no

more than \Y\ + \B\ + \ because \I\ = 2m. Consequently, an optimal solution needs to match all mass values in Y U B using its y and b ions. This ensures that it has the form of x(-2m+1^n+2\ where each segment X can independently take one of W, EG, GE, AD, and DA. If this optimal solution does not satisfy the lemma, then for every segment X that contradicts the lemma, there are two possible cases.

Case 1. X takes W but S asks for GE or EG as a ti. In this case, we simply change X from W to either GE or EG. Because of the definition of / (x ) , this will not reduce the score.

Case 2. X is a two letter segment, and is different from what S asks for. In this case, one can easily check that the y-ion caused by the fragmentation of the two letter segment in X will give a -1 value. This will make the total score less than \Y\ + \B\. Therefore, this case does not exist.

Thus, the lemma is proved. •

The following lemma concludes our proof of Theorem 2.1.

Lemma 2.2. The spectrum has an optimal solution with value Amn + 8m + 2n + 2 + m ^ if and only if the Max-Cut-3 instance has an optimal solution that cuts K edges.

Proof. Note that \Y\ + \B\ = Amn + 8m + 2ra + 2. That is, any solution that satisfies Lemma 2.1 should gain score 4mn + 8m + 2n + 2 using the y and b ions. The m££ portion is determined by the internal cleavage ions in I.

u<=" Suppose the optimal cut is V = V\ U V -For each Vi G Vi, let tt = GE. For each Vi G V2, let ti = EG. Then S is a solution of the de novo sequencing problem. All the mass values in Y U B are matched. For each edge efc, if it is cut, the pair of mass values 1 + m(EWfc_1A) and 1 + m(GWfe_1A) in / are both matched. If e^ is not cut, then exactly one is matched. This gives score m ^ from internal fragment ions.

"=4>" Because of Lemma 2.1, each U is either GE or EG. Let V\ consist of all Vi such that U = GE, and Vi consist of all Vi such that ti = EG. We get a cut for the Max-Cut-3. This way, it is clear that an edge e^ is cut if and only if the pair of mass values 1 +m(EWfc-1A) and 1 + m(GWfe-1A) in I are both

365

matched. The score m ^ contributed by the ions in I ensures that exactly K pairs are both matched. That is, exactly K edges are cut. •

The proof of Lemma 2.2 finishes the proof of Theorem 2.1. •

3. REGRESSION MODEL BASED SCORING METHODS FOR DE NOVO SEQUENCING

3.1. Relationship between fragment ions

When peptides are fragmented by collision-induced dissociation (CID) in a tandem mass spectrometer, the resulting fragment ions can be categorized into three classes. One is the complementary fragment ions generated from one backbone cleavage, which include the N-terminal fragments (a, b, and c ions) and the C-terminal fragments (x, y, and z ions). Another is the derivatives of fragment ions that include the neutral loss of water or ammonia, multiple charged ions, and isotopic ions. The last is internal fragments and immonium ions generated from double backbone cleavage. The typical fragment ions in low energy CID are summarized in Table 1. The notations used in this paper are also listed in Table 1. Notice that h1

and yl denote the derivative ions from b and y-ions. This is different from the conventional notation of b» and yi, which represents the b-ion and y-ion with i residues, respectively.

The fragment ions observed in an MS/MS spectrum have various intensities. Many are low and even below noise. It is therefore difficult to directly distinguish the fragment ions with low intensity from the contaminants and noise. However, the fragment ions occur correlatively with each other. This relationship between the fragment ions is helpful to correctly identify the fragment ions. The dependencies and correlations between types of fragment ions may be categorized into two classes. One is between the complementary fragments (such as b and y ions). The other is between fragments and their derivatives (such as b, b-NH3, and b-H20, or y, y-NH3, and y-H2O). The relationship between the fragment ions can be examined via their statistical distributions. Table 2 lists the conditional probabilities calculated by examining the fragment ions in ion trap data sets.

Prom the statistical results, we can clearly see the dependencies between different types of fragment ions. For example, b and y ions mostly occur together. The derivatives of fragment ions strongly depend on the fragment ions.

3.2. Regression model for scoring function

First, the peak intensities in the mass spectrum are normalized so that each peak has intensity between 0 and 1. Let p be the r-th highest peak in the spectrum, which r is refered to as the ranking of peak p. Then the normalized intensity of p is defined by s(r) — (ro + l)/(?"o + T). The constant ro may be taken in the range [50,100].

Suppose a peptide P = a\a<2 ... an. Each fragmentation between a t and flfc+i is associated to a number of ions. The N-temrinal ions include the b-ion 0102... afc and its derivative ions as in Table 1. The C-terminal ions include the y-ion ak+i • • • an

and its derivative ions as in Table 1. We use the same notation b% and yl to denote both the derivative ions and the normalized intensity of the ions. In addition, there are internal fragment ions a^... ak (i = 2 , . . . ,k — 1) andafc+ i . . .a,j (j = fc+2,...,n —1) associated to the fragmentation. We sort these internal fragment ions according to their normalized intensities and denote the normalized intensities as v},u2,..., from high to low. Thus for each fragmentation k we construct a score function via the following quadratic regression model:

/ (fc) = E ai,i»* + E &.*6< + E M"* + i i i

E a2,«yv+E fort™+E wvivi c1) ».i i,3 i,3

where a's, /?'s, and 7's are the regression coefficients, which are nonnegative and satisfy the following constraint,

E "1.* + E p^ + E 7 i ' * + E Q2'*> i i i i,j

+ E ' 8 2 . « + ET2,« = 1 (2)

The last three terms are the quadratic regression part, which represents the dependencies between different ion types. If necessary, the model also allows to add the terms of triple regression.

366

Table 1. Fragment ions in low energy CID and notations. mc = M — m + 2, where M is the precursor ion mass.

Fragments with N-terminus Fragment type b b2+ b3+ b-NH3

b - H 2 0 b-2NH3

b-NH 3 -H 2 0 b-2H 2 0 a (b-CO)

Mass Notation m 6U or b

( m + l ) / 2 b2

(m + 2)/3 b3

m - 17 617

m - 18 618

m - 34 634

m - 35 b3 5

m - 36 636

m - 28 628

Fragments with C-terminus Fragment type

y y 2 +

y3+ y-NH3

y-H 2 0 y-2NH3

y-NH 3 -H 2 0 y-2H2Q

Mass mc

(mc + l ) / 2 (mc + 2)/3

mc - 17 mc — 18 mc — 34 mc — 35 mc — 36

Notation y° or y

V2

y3

3,17

ylS

y34

y35

y 36

Table 2. Statistical probabilities of fragment ions in ion trap da ta set.

0.23 0.24

0.10 0.11

0.45 0.58

0.45 0.59

0.23 0.29

" b 3 5 -

0.26 0.33

-&B b 2 8 -

0.23 0.34 0.29 0.46

P(b 1 observed) P ( b ' observed | b observed)

~yV-0.36 0.47

0.37 0.48

y34

0.22 0.28

-pr-0.24 0.30

-J3T-0.20 0.25

P(y" observed) P(y* observed | y observed)

y 0.32 0.32

y 0.12 0.11

P(b observed) P(b observed | y observed)

0.68 0.78

In practice, we do not need to consider all combinations of all fragment ions. The regression model can be simplified according to the statistical characterization. In low energy CID, it is known that b and y-ions are dominant ion types. Moreover, for tryptic peptides, y-ions in general have stronger intensities than b-ions, and the derivatives of fragment ions strongly depend on the fragment ions. Taking this into account, we simplify the above model as follows:

i=0,2,3 i=0,2,3 i<5

+yJ2a^yi+feE^i&i+yJ2^ibi <3) i^0 i^0 i^0

In this simplified model, the neutral loss of water or ammonia is not considered in the linear regression terms, and only the top five internal ions are used for each fragmentation. We also ignore the relationship between the derivative ions because their effects are too week. For clarity, we rewrite the above scoring model as

f{k) = Xl-w (4)

where Xk = [yl,b\u%,yy\bbl,ybl}T is a column vector associated to the fragmentation between a^ and cifc+i, w = [aM,/3i,i,7M,a2,i,/32,i,72,i]T is a column vector of the regression coefficients, and the super

scription T stands for the transpose of a vector. Notice that because each i can take several different values, both Xk and w are 34-dimension vectors.

Let Nk be the number of unobserved b and y ions associated to the fragmentation k. Nk can be 0, 1, or 2. Introducing a penalty for the unobserved b and y ions, we further modify the scoring model as

f(k) = f(k) - /ziVfc xT w »Nk (5)

where 0 < /x < 1 is a penalty coefficient. For a peptide P of n amino acids. The score of the spectrum S matched by the peptide P is calculated by

n - l

score(5,P) = £ / ' ( * ; ) = ^ X f cr •« ; - /* ^ i V * (6)

fe=i fc=i fe=i

We train the regression coefficients by a linear programming. Suppose we have K mass spectra as training dataset. For each spectrum Sk, there are one positive peptide Pk, and L negative peptides Pki, l = l,...,L. The linear programming formulation is

367

given as

K

max2_, ek subject to (7)

score(5fe, Pk) - score(Sk, Pki) > ek,

0 < Wi < 1,

0<H< 1,

efe < c

This formulation is very similar to the linear programming used in Ref. 30. However, Ref. 30 concerned about the "database search" approach of MS/MS peptide identification, and therefore all the negative peptides are usually very different from the positive peptide. We concern about the de novo sequencing approach, where the negative peptides often differ from the positive peptide by only a few amino acids. As a result, the selected ion types and dependencies in the regression model are very different in the two approaches. Furthermore, we use the normalized intensities which depends on the ranking of a peak but not the actual signal intensity. This is a novel approach, which produces better accuracy. Before this work, we did not know whether a similar linear programming formulation as in Ref. 30 can work for improving de novo sequencing results.

The current PEAKS program first computes a y-ion matching score and a b-ion matching score at each mass value according to the peaks around it, and then efficiently computes thousands of amino acid sequences that maximize the total scores at the mass values of b-ions and y-ions 15. These candidate sequences are then re-evaluated by a refined scoring function and the top scoring sequence is output. Here, we add another step to further use the regression model based scoring function to re-evaluate the 100 top-scoring sequences computed by PEAKS. We refer to this modified approach as PEAKS-RM.

4. EXPERIMENTS

In this section, we give experimental results to show that the regression model based scoring method can significantly improve the de novo sequencing accuracy over two existing high performance de novo programs: PEAKS and PepNovo.

The performance is measured using the ratio between the number of correctly predicted amino acids and the total length of the peptides. The ratio is refered to as the identification accuracy. Two types of the ratio are considered and defined as follows:

number of correctly predicted amino acids number of amino acids in the real peptides

number of correctly predicted amino acids number of amino acids in the prediction

An amino acid is correctly predicted if the amino acid appears at the same mass position of both the predicted and the real peptides. Given a test data set, the total length of real peptides is fixed. Therefore Type I accuracy only depends on the number of correctly predicted amino acids. However, because PepNovo only outputs partial sequences for some peptides, the number of the predicted amino acids may be significantly less than the total amino acids in the real peptides. Therefore, Type II accuracy may be very different from Type I accuracy. We note that software can increase Type II accuracy by missing the amino acids that do not appear in the MS/MS spectra, and only outputing the amino acids that are easy to be determined.

All the data sets used in our experiments are ion trap MS/MS data. The mass error in the ion trap data is around 0.5 dalton. Therefore, we do not make a distinction between the amino acids leucine and isoleucine (which have identical mass) and between lysine and glutamine (which have a small difference of 0.04 dalton in their masses).

Three ion trap datasets are used in the experiment. The training dataset contains 168 positive MS/MS spectra obtained from the first LC/MS/MS runs on "mixture A" as described in Ref. 34. The peptide sequences of these 168 MS/MS were all identified in Ref. 34. The two datasets used for testing are denoted by dataset 1 and dataset 2, respectively. Dataset 1 has 400 spectra provided by the authors of Ref. 17. 280 of the 400 spectra were used to compare PepNovo and other software in Ref. 17. Dataset 2 has 144 LCQ spectra. The three datasets were obtained in different labs with different protein mixtures.

Experimental results are given in Tables 3, 4 and 5. Table 3 shows the results for dataset 1. All the spectra in this dataset are doubly charged. Table 4

368

shows the results for dataset 2. This dataset contains singly, doubly and triply charged spectra. Because PepNovo's parameters were only trained for doubly charged spectra 17, we also list the results for the doubly charged spectra of dataset 2 in Table 5. By comparing with PEAKS and PepNovo algorithms, it is clear that our regression model based scoring function can significantly improve the de novo sequencing accuracy.

Table 3. Accuracies of PEAKS-RM, PEAKS, and PepNovo for dataset 1. (The average length of real peptides is 10.55.)

Algorithm PEAKS-RM

PEAKS PepNovo

Type I 0.708 0.655 0.652

Type II 0.701 0.665 0.697

Average length 10.66 10.38 9.87

Table 4. Accuracies of PEAKS-RM, PEAKS, and PepNovo for dataset 2. (The average length of real peptides is 11.82.)

Algorithm PEAKS-RM

PEAKS PepNovo

Type I 0.639 0.623 0.518

Type II 0.638 0.638 0.547


Table 5. Accuracies of PEAKS-RM, PEAKS, and PepNovo for only the doubly charged spectra of dataset 2. (The average length of real peptides is 12.085.)

Algorithm PEAKS-RM

PEAKS PepNovo

Type I 0.666 0.655 0.567

Type II 0.663 0.667 0.602


5. CONCLUSION AND DISCUSSION

This paper first proved that the de novo sequencing with internal fragment ions is NP-complete. This explains the reason that all existing polynomial time de novo sequencing algorithms could not use internal fragment ions. The paper then studied the statistical correlations between different ion types in ion trap MS/MS spectra; and proposed a regression model based scoring function for de novo sequencing, which incorporates the correlations between the fragment ion types. The experimental results showed that the regression model is a very effective scoring method in peptide de novo sequencing.

The authors also compared the regression models with and without internal fragment ions using our datasets. In the regression model without internal fragment ions, the coefficients were re-trained using the training data. The results showed that the inclusion of internal fragment ions improved the accuracy quite a bit in dataset 1 but only very slightly in datasets 2 and 3. The improvement mostly happens when there are some y and b ions missing for one peptide, and the internal fragment ions can then help to deduce the missing information. The experiments do not prove or disapprove that the consideration of internal fragment ions will significantly improve the peptide identification accuracy. This is because (a) the training and testing data were selected by currently available software that does not utilize internal fragment ions; (b) as illustrated by the NP-hardness result, there is no efficient algorithm (unless P=NP) to find the optimal solution with internal fragment ions, and the regression model is only a heuristic method. Consequently, the detailed comparison is omitted and the results in Section 3 should be purely regarded as a regression model instead of the discussion of internal fragment ions.

Acknowledgment

This research was undertaken, in part, thanks to funding from NSERC, PREA, and the Canada Research Chairs Program. The authors thank Dr. Kaizhong Zhang and Dr. Gilles Lajoie for discussions. The authors also thank the authors of Ref. 34 for providing the training dataset; Dr. Pavel Pevzner, Ari Frank for providing dataset 1 and PepNovo program; and Dr. Richard Johnson for providing dataset 2.

References

1. Snyder, A.P. Interpreting Protein Mass Spectra: A Comprehensive Resource, Oxford University Press. 2000.

2. Aebersold, R. and Mann, M. Mass spectrometry-based proteomics. Nature 2003 , 422, 198-207.

3. Johnsona, R.S. et al. Informatics for protein identification by mass spectrometry. Methods 2005, 35, 223-236.

4. Shadforth, I. et al. Protein and peptide identification algorithms using MS for use in high-throughput, automated pipelines. Proteomics 2005, 5, 4082-4095.

369

5. Zhang, N. et al. ProblDtree: An automated software program capable of identifying multiple peptides from a single collision-induced dissociation spectrum collected by a tandem mass spectrometer. Proteomics 2005, 5, 4096-4106.

6. Xu, C. and Ma, B. Software for computational peptide identification from MS-MS data. Drug Discovery Today, July 2006.

7. Eng, J.K. et al. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Amer. Soc. Mass Spectrom. 1994, 5, 976-989.

8. Perkins, D.N. et al. Probability-based protein identification by searching sequence database using mass spectrometry data. Electrophoresis 1999, 20, 3551-3567.

9. Craig, R. and Beavis, R.C. A method for reducing the time required to match protein sequences with tandem mass spectra. Rapid Commun. Mass Spectrom. 2003, 17, 2310-2316.

10. Geer, L.Y. et al. Open Mass Spectrometry Search Algorithm. J. Proteome Research 2004, 3, 958-964.

11. Taylor, J.A. and Johnson, R.S. Sequence database searches via de novo peptide sequencing by tandem mass spectrometry. Rapid Commun. Mass Spectrom. 1997, 11, 1067-1075.

12. Taylor, J.A. and Johnson, R.S. Implementation and uses of automated de novo peptide sequencing by tandem mass spectrometry. Anal. Chern. 2001, 73, 2594-2604.

13. Dancik, V. et al. De novo protide sequencing via tandem mass-spectrometry. J. Comp. Biology 1999, 6, 327-342.

14. Chen, T. et al. A dynamic programming approach to de novo peptide sequencing via tandem mass spectrometry. J. Comp. Biology 2001, 8, 325-337.

15. Ma, B. et al. PEAKS: powerful software for peptide de novo sequencing by MS/MS. Rapid Commun. Mass Spectrom. 2003,17, 2337-2342.

16. Ma,B. et al. An effective algorithm for peptide de novo sequencing from MS/MS spectra. Journal of Computer and System Sciences 2005, 70, 418-430.

17. Prank, A. and Pevzner, P. Pepnovo: De novo peptide sequencing via probabilistic network modeling. Anal. Chem. 2005, 77, 964-973.

18. Fischer, B. et al. NovoHMM: A Hidden Markov Model for de novo peptide sequencing. Anal. Chem. 2005, 77, 7265-7273.

19. Tabb, D. et al. GutenTag: High-throughput sequence tagging via an empirically derived fragmentation model. Anal. Chem 2003, 75, 6415-6421.

20. Searle, B.C. et al. High-throughput identification of

proteins and unanticipated sequence modifications using a mass-based alignment algorithm for MS/MS de novo sequencing results. Anal. Chem. 2004, 76, 2220-2230.

21. Han, Y. et al. SPIDER: Software for protein identification from sequence tags with de novo sequencing error. Journal of Bioinformatics and Computational Biology 2005, 3, 697-716.

22. Halligan, B. D. et al. DeNovoID: a web-based tool for identifying peptides from sequence and mass tags deduced from de novo peptide sequencing by mass spectroscopy. Nucleic Acids Research 2005, 33, 376-381.

23. Searle, B. C. Improving sensitivity by combining results from multiple search methodologies. Workshop Computational Proteomics and Mass Spectrometry 2004. Ohio State University, Ohio.

24. Rogers, I. Assessment of an amalgamative approach to protein identification, ASMS Conference on Mass Spectrometry 2005. San Antonio, Texas.

25. Fu, Y. et al. Exploiting the kernel trick to correlate fragment ions for peptide identification via tandem mass spectrometry. Bioinformatics 2004, 20, 1948-1954.

26. Bafna, V. et al. A probabilistic model for scoring tandem mass spectra against a peptide database. Bioinformatics 2001, 17, S13-S21.

27. Havilio, M. et al. Intensity-Based Statistical Scorer for Tandem Mass Spectrometry. Anal. Chem. 2003, 75, 435-44.

28. Colinge, J. et al. OLAV: Towards high-throughput tandem mass spectrometry data identification. J. Proteomics 2003, 3, 1454-63.

29. Elias, J. E. et al. Intensity-based protein identification by machine learning from a library of tandem mass spectra. Nat. Biotechnol. 2004, 22, 214-219.

30. J. Liu, B. Ma, M. Li: PRIMA: Peptide robust identification from MS/MS spectra. APBC'05, 181-190, 2005.

31. Pevzner, P.A. and Waterman, M.S. Open Combinatorial Problems in Computational Molecular Biology. Proceedings of the 3rd Israel Symposium on Thoery of Computing and Systems 1995, 158-173.

32. Pevzner, P. A. Computational Molecular Biology: An Algorithmic APproach. MIT Press. 2000.

33. Papadimitrou, C. and Yannakakis M. Optimization, approximation, and complexity classes. Journal of Computer and System Sciences 1991, 43, 425-440.

34. Keller, A. et al. Experimental Protein Mixture for Validating Tandem Mass Spectra Analysis, OMICS 2002, 6(2), 207-212.

371

EXPECTATION-MAXIMIZATION METHOD FOR RECONSTRUCTING TUMOR PHYLOGENIES FROM SINGLE-CELL DATA

G. Penn ing ton


Pittsburgh, PA 15213, USA

C. A. Smi th a n d S. Shackney

Allegheny Singer Research Institute, Allegheny General Hospital

Pittsburgh, PA 15212, USA

R. Schwartz*

Department of Biological Sciences, Carnegie Mellon University

Pittsburgh, PA 15213 USA Email: russells @andrew. emu. edu

Recent studies of gene expression in cancerous tumors have revealed that cancers presenting indistinguishable symptoms in the clinic can represent substantially different entities at the molecular level. The ability to distinguish between these different cancers makes possible more accurate prognoses and more finely targeted therapeutics. Making full use of this knowledge, however, requires characterizing commonly occurring cancer sub-types and the specific molecular abnormalities that produce them. Computational approaches to this problem to date have been hindered by the fact that tumors are highly heterogeneous masses typically containing cells at multiple stages of progression from healthy to aggressively malignant. We present a computational approach for taking advantage of tumor heterogeneity when characterizing tumor progression pathways by inferring those pathways from single-cell assays. Our approach uses phylogenetic algorithms to infer likely evolutionary sequences producing cell populations in single tumors, which are in turn used to create a profile of commonly used pathways across the patient population. This approach is combined with expectation maximization to infer unknown parameters used in the phylogeny construction. We demonstrate t he approach on a set of fluorescent in situ hybridization (FISH) data measuring cell-by-cell gene and chromosome copy numbers in a large sample of breast cancers. The results validate the proposed computational methods by showing consistency with several previous findings on these cancers. They also provide novel insights into the mechanisms of tumor progression in these patients.

1. INTRODUCTION

Computational studies have led to substantial revisions in thinking about how to treat and diagnose cancers. Although all cancers are characterized by a general pattern of uncontrolled cell growth, it has long been recognized that they represent many different diseases at the molecular level. Numerous different combinations of genetic abnormalities could potentially disrupt the controls on cell growth and produce essentially the same gross phe-notypes. Classic chemotherapies for treating cancers thus typically target the phenotype of frequent cell division rather than any specific genetic state distinguishing cancerous from healthy cells, leading to treatments that are broadly but not consistently effective and that carry serious side-effects.

The application of computational clustering methods to gene expression microarrays5 has recently shown that most tumors can be grouped into one of a few common "cancer sub-types,"6' n> 14 each characterized by similar molecular abnormalities and potentially treatable by common "targeted therapeutics" addressing those specific abnormalities. Subtype identification has proven useful in predicting patient outcomes22, 26, 25 ' 23 and in selecting appropriate treatment regimens.3' 1 The most notable success of this new approach to targeted therapeutics is the drug trastuzumab (Herceptin), an antibody to the Her-2/neu gene that is specifically effective in a subset of breast cancers characterized by amplification of the Her-2/neu gene.10

The recognition of cancer sub-types was a signifi-

* Corresponding author

372

cant advance, but it is also a simplification. A cancer sub-type characterizes a general progression pathway or set of related pathways by which successively accumulating mutations transform once healthy cells into increasingly aggressive tumor cells. However, any given patient may have advanced to a greater or lesser degree along this pathway,9, 19 and the degree of progression is itself a significant predictor of prognosis.13 It is thus valuable to understand not just what changes distinguish advanced cancer cells on a particular pathway from those on a different pathway, but also the particular sequence of events by which those changes accumulate on any given pathway. Desper et al.4 showed that it is possible to identify relationships among different tumors by constructing phylogenies, or evolutionary trees, using microarray gene expression data and a distance metric similar to those used in the prior clustering approaches. However, this approach oversimplifies in some ways because tumors are not homogeneous masses. As cells in a tissue progress along a given pathway through the accumulation of successive mutations, the earlier states do not die out, but rather leave remnant populations in the tumor. Figure 1 illustrates this process. The existence of multiple progression states within a single tumor can be expected to confound microarray-based approaches, which can only measure tissue-wide average expression levels. Cancer prognosis has indeed been shown to be affected by changes apparent in single cells, but not from such tissue-wide measurements.20

Our contributions: We present a new method that treats tumor heterogeneity as an asset rather than an obstacle to the inference of progression pathways by using single-cell measurements to infer progression pathways within and between patients. We develop an algorithm for inferring likely evolutionary trees across cells by combining phylogenetic methods with an expectation-maximization framework for learning model parameters. We then use trees inferred patient-by-patient to identify specific sequences of molecular changes that commonly underlie a particular tumor type. We apply our technique to a large set of single-cell fluorescent in situ hybridization (FISH) measurements from breast cancers in which copy numbers are assessed for the Her-2/neu oncogene, the p53 tumor suppressor gene, and

chromosome 17, on which both genes are found.7 The results validate our approach by recapitulating several previously observed features of the roles of these genes in breast cancers. They further provide new insights into nature of common progression pathways in these cancers with implications for the optimal diagnosis and treatment of cancer patients.

Fig. 1. Illustration of cancer progression resulting in tumor heterogeneity, (a): A healthy mass of cells labeled H. (b): A cell mutates into a diseased state D\, which encourages proliferation and further progression, (c): The proliferating cell expands, leaving a heterogeneous population, (d): A D\ cell reaches a further progression state Di, increasing potential for proliferation, (e): Both populations continue to expand (f): The D2 population becomes dominant, and an additional mutation results in a new disease state, D3.

2. METHODS

Our method uses expectation maximization (EM) to learn several unknown parameters in a model of cell progression, applies an algorithm for the minimum cost arborescence problem to construct per-patient phylogenies consistent with the model, and then identifies commonly used pathways across patients. The remainder of this section defines the input data and phylogeny model and explains each step of the overall inference process. All algorithms described below were implemented using the functional programming language Objective Caml.

2.1. Input Data

Although our high-level approach is intended to apply to any form of cell-by-cell assay, we assume below an input format based on the FISH copy number data used in our validation experiments. These data count copy numbers of a single gene and a sin-

373

gle chromosome in individual cells. Each patient can thus be represented as an N by N two-dimensional array M, where N is some maximum observed count. For the present work, N is 10 and any counts above 10 are collectively grouped into a single row or column of M representing the count "greater than 10." Element rriij of M is then the fraction of cells of a given sample that have i copies of the chromosome and j copies of the gene, which we call state (i,j). The FISH data is produced by manually counting fluorescent probes on labeled cell microscopy images, which can produce false counts if two probes are too close together or a single indistinct probe signal is incorrectly viewed as two distinct probes. We apply a preprocessing step to the input prior to our algorithm to reduce this noise. We assume that up to ten percent of cells from an observed state may have been misclassified and thus screen out from each patient's data any states for which the observed frequency is less than 10% of the sum of the frequencies of its neighbors.

2.2. Probability Model

We select phylogenies using a likelihood model that assumes cell states evolve from one another through four possible known molecular mechanisms for tu-morigenesis: gene gain, gene loss, chromosome duplication, and chromosome loss. In gene gain, a cell gives rise to a new state with one extra copy of the gene. Gene loss produces a new state with one fewer gene copies. Chromosome duplication, modeling incomplete mitosis, doubles the complement of genes and chromosomes in a cell. Chromosome loss results in the loss of a single chromosome; as it is not clear how many gene copies might lie on the lost chromosome, we allow any number of gene copies to be lost simultaneously with the chromosome. Each of these operations is assumed to have some prior probability: pg+ for gene gain, p 9_ for gene loss, pc+ for chromosome duplication, and pc_ for chromosome loss. We call the vector of these four prior probabilities 9. The prior probability of a full tree, Pr{T|0} is then defined to be the product of the prior probabilities of its edges.

Our model further defines the probability of the data given a tree, Pr{M|T}, to be the product over all non-root nodes u of the frequency of u's parent

node, where the root is always defined to be the (2,2) state. This model is meant to capture the intuition that a node is more likely to have descended from a well-populated state than from a sparsely populated state. Thus, we can define the full probability we seek to maximize, Pr{M|T}Pr{T|0}, to be Ile=(u,v)£T fuPe, where / „ is the frequency of node u and pe is the prior probability of the edge type of edge e given by 9. Figure 2 illustrates the definition, showing two possible trees for a given set of nodes and describing how the probability is derived for each tree. The goal of our computational methods is to find the 9 maximizing Pr{M|0} over the full distribution of trees that might have produced M and, given this 9, find the tree T maximizing Pr{M, T\9] = Pr{M|T} Pr{T\9}.

Fig. 2. Example illustrating the probability model for trees. Three cell states — (2,2), (2,1), and (1,1) with frequencies /2,2, fo,i, and /i? i — can be joined by two possible trees, (a) States (1,1) and (2,1) each descend directly from (2,2) by a chromosome loss and a gene loss respectively. The probability Pr{M|T}Pr{T|0} has a contribution of /2,2Pc- from the chromosome loss and f2,2Pg- for the gene loss, (b) State (1,1) is descended from (2,1) by a chromosome loss and (2,1) from (2,2) by a gene loss. Pr{M|T}Pr{T|0} has a contribution of f2,2Pg- from the gene loss as in (a) and a contribution of f2,iPc- from the chromosome loss.

2.3. Optimal Tree Inference

Given the input M and a current set of parameters 9, we construct a directed graph G = (V,E), where V is the set of observed states, and E is the set of all possible single mutation events that connect two states in V. If there exist states in G that are not reachable from the root, we add Steiner nodes to G using a heuristic method presented as Algorithm 2.1.

Once we have ensured that every node of G is reachable, we add a weight function w(v,u) = fvP(u,v) to G and compute an optimal phylogenetic tree for the given patient using a classic algorithm for finding minimum weight arborescences (directed

374

minimum spanning trees) due to Chu and Liu.2 Chu and Liu's algorithm is similar to Prim's greedy algorithm for undirected spanning trees12 but with some additional complications to handle directed cycles. We specifically define the (2,2) state to be the root of the tree, which is part of the input to the ar-borescence algorithm. This algorithm is used after parameter inference to find the best-fit trees and is also used as a subroutine of the parameter inference method to initialize the Markov chain sampling for each patient on each EM round. A summary of our method for single-patient phylogeny inference is provided as Algorithm 2.2.

Algorithm 2.1 Heuristic algorithm for Steiner node inference

l: Given G = (V,E). Let G' = (V',E') be a directed graph containing all possible states and edges, and let R C V be the set of vertices of G reachable from the root.

2: while R ^ V do 3: Perform a breadth-first traversal of G' start

ing from R and stopping when we encounter a vertex v €V — R. Let k be the distance from R to v (the length of the path found by BFS).

4: Consider all nodes in V — R at distance k from R, and let v* be the one from which we can reach the most nodes in V — R (the largest island).

5: Solve the multiple source shortest path problem from R to v* in G', where the weight of an edge e £ E' is equal to — logpe (minus the log of its probability).

6: Add the nodes and edges on the shortest path

from V to v* in G. 7: end while

2.4. Parameter Inference

We estimate the parameter set 6 by EM. We treat the tree topology T as a set of latent variables corresponding to the presence or absence of each tree edge. In the expectation phase, we find the expectation of each of these latent variables by enumerating over possible trees T consistent with the output, weighted by the conditional probability Pr{M|T}Pr{T|6>} = Pr{M,T|0}. This expectation is evaluated by a Markov chain Monte Carlo method,

in which states correspond to the possible trees and their stationary distributions are set to be proportional to Pr{M|T} Pr{T|6>}. The frequency of occurrence of each possible tree edge in the Markov chain thus gives the expected value of the latent variable corresponding to that edge.

Algorithm 2.2 Procedure for tree inference from a matrix of cell counts S

l: Convert the FISH matrix for an individual patient into a graph G.

2: Add all edges to G allowed by the connectivity model of section 2.2, each weighted as minus the log of its probability.

3: Apply Algorithm 2.1 to add Steiner nodes until G is connected.

4: Find a minimum-cost arborescence on G by the method of Chu and Liu.2

In the maximization phase, these edge expectations are used to determine maximum-likelihood estimates of the parameters for the next EM round. This estimation is accomplished by counting the expected occurrences of each of the four edge types, summed over all potential tree edges of that type.

We initialize the method by assuming each of the four parameters is 0.25. We then construct an initial tree for each patient by running Algorithm 2.2 to provide a starting state for the Monte Carlo iteration. We then perform successive Monte Carlo steps as follows:

(1) Pick a node u from the tree uniformly at random from all nodes other than the root.

(2) For each possible parent v of u excluding current descendants of u, compute an edge weight w(v, u) = fvP(v,u) > where /„ is v's node frequency and P(VtU) is the prior probability of the edge type from v to u. Note that v might be u's current parent.

(3) Pick some v among all possible parents with probability pv = w(v, u)/ J3X w(x, u).

(4) Delete the edge from u's current parent to u and replace it with an edge from v to u.

Repeatedly applying this move creates a Markov model we call H.

Note that this move set cannot produce a cycle

375

in the graph because, when selecting a new parent v of u, the move set specifically prohibits the selection of any v that is a current descendant of u. This guarantees v is not reachable by any directed path from u. The newly added edge (v, u) thus cannot create any directed cycle.

We can further show that H samples among all possible trees according to their relative probabilities as defined in our probability model. This result is established by the following theorem:

Theorem 2.1. For any two trees S and T and any

input data set M, the ratio of the stationary proba

bilities 3* inM will be equal to pjff ir} p i f f l •

Proof. Each non-root node has exactly one parent, so a tree T is completely defined by its list of parent assignments pr : V —> V, where pr(v) is the parent of V in T. H is ergodic, since we can reach any tree T from any tree 5 by reassigning parents in 5 to match those in T in breadth-first order of the nodes in T. Any cycle from tree T returning to itself will contain some (possible empty) sequence of changes to the parent of V: PT(U),UI,U2, •.. ,Uk,PT{u). These changes will contribute a factor of w(ui,v)w(U2,v) • • •w(uk,v)w(pT(v),v)/Wk+1

for some W to the probability of traversing the cycle. The probability of traversing the cycle in the opposite direction is w(uk,v)w(uk-i,v) •• •w(ui,v)w(pT(v),v)/Wk+1, i.e. the same value. Counting contributions for all v e V, this establishes by the Kolmogorov criterion that H converges on a unique stationary distribution obeying detailed balance. It then suffices to show that for any two neighboring trees T\ and T2 that the ratio of their transition probabilities PTx~'Ti is e c i u a l t0 prKlTilprmle}- The fact that t h e y are

neighbors means that they differ by a single parent assignment, u\ versus U2, of a node v.

PTX->T2 = W(U2,V)/W PT2-*T! W{ui,v)/W

_ w(u2,v) w(ui,v)

_ Pr{Mir2}Pr{r2 | f l} Pr{M|Ti}Pr{T!|(9} D

In order to establish that the Markov chain is adequately sampling states, we also need to show

it is rapidly mixing. If we define Ps(t,T) to be the probability of encountering tree T at step t from starting tree S then we can do this by showing there is some to polynomial in n for which we have a small variation distance between Ps(to,T) and the stationary distribution II, where we follow Jerrum and Sinclair8 in defining variation distance as A t = | J2T \Ps(t0,T) - irT\- We establish this by the following theorem:

Theorem 2.2. The Markov chain H initialized with some state S reaches variation distance e from II in time 0(n(/>(lne_1 -l-lnTr^1)) where n is the number of distinct cell states and <fi is the maximum ratio of any two probabilities from 6 = { p c _ , p c + , p g _ , p g + } .

Proof. We can prove rapid mixing using the canonical path method,21 in which we define a path ju,v between any two states U and V. Space does not permit a detailed tutorial on the method, so we provide only the details specific to our problem here and refer the interested reader to Jerrum and Sinclair8

for an excellent tutorial on the method. We establish a canonical path between any tree S and tree T in which we convert parents of nodes in S to their parents in T according to the breadth-first order of those nodes in T. Suppose we examine a step on the canonical path from S to T transitioning from some S* to T*, in which we change the parent of some node v from us to UT- Then the other canonical paths using that transition will be those between any S" and T" for which S' and T" have the same parents as T for nodes before v in breadth-first order and the same parents as 5 for nodes after v, PS'(v) = Ps(^), and pr'(v) = PT(V)- The canonical path method depends on bounding a quantity called the edge loading, defined for a transition e = (S*,T*) as (7rr*PT',s*) -1Es',T's.t.e37s„,T. •ns'nT'Viswl YjS',T'nS'^T' < ITTPS*,T' and |7s',T'| < n, so edge loading for H is bounded by (TTT'PT-,S')~1 (n^T-ps*,T»), which is itself bounded by n</>. This establishes the mixing time bound of nfiilne-1 +lmrs'

1). •

To ensure adequate mixing, we apply the Monte Carlo move 100n3 times per patient counting edge types every 10n2 moves. The fraction of edges assigned to each type provides a maximum likelihood

376

estimate of that edge type's probability for the next EM round. We repeat the above steps until all parameters converge with an error of less than one percent. We perform two versions of this inference: a global inference, in which we establish the four edge type probabilities for the whole population by pooling edge counts across all patients on each EM round, and a per-patient inference, in which we establish distinct parameters for each patient by performing the complete EM algorithm on each patient individually.

2.5. Identifying a Global Consensus Network

A final stage of analysis is performed with the EM-inferred parameters to find a best-fit tree for each patient and a global consensus network for the entire population. We first fit a phylogeny to each patient using Algorithm 2.1. We then find a global consensus network by identifying all pathways used in at least some fraction t of all patients. Given the per-patient trees T\,..., Tn, we can identify consensus pathways by searching depth-first through each tree individually and then, for each node, counting how many other trees have the same node and exhibit the same pathway from that node to the root. Those pathways occurring in a t fraction of trees are added to the global consensus network. For the present study, t=5%. Note that this consensus network need not itself be a tree, since a node may be reachable by more than one common pathway in different individual trees.

3. RESULTS AND ANALYSIS

3.1. Data

We applied our phylogeny inference methods to two data sets collected for a previous study on human breast cancer progression7 using a protocol for FISH-based analysis of gene and chromosome copy numbers.17 One data set consists of Her-2/neu gene copy numbers and chromosome 17 copy numbers assayed in cells from 118 individuals, with an average of 63 cell assays per patient. The second consists of p53 and chromosome 17 copy numbers assayed in 113 individuals with an average of 68 assays per patient. These two data sets were chosen for the present study in part because of the importance of both genes in

breast cancer progression. Her-2/neu amplification promotes cell proliferation and is associated with a class of breast cancers15' 24. p53 is a crucial tumor suppressor gene whose loss or inactivation is implicated in approximately half of all human cancers.18

Furthermore, the fact that both genes occupy chromosome 17 provides some means for validation of the method, as inferred patterns of chromosome gain and loss should be the same in both datasets.

3.2. Global Consensus Trees

We first performed a single consensus inference, fitting one set of prior probabilities to each of the full data sets. Table 1 shows the inferred probabilities from the two data sets. Both show similar frequencies of chromosome duplication and loss, with slightly higher rates of loss than duplication. p53 and Her-2 show very different patterns of gene gain and loss, though. Her-2 shows a notable preference for gene gain over loss, consistent with the fact that Her-2 amplification characterizes a subset of breast cancers.15' 24 p53, on the other hand, shows a slight excess of gene loss over gain, consistent with the fact that p53 is implicated in cancers through loss of function, rather than amplification.

Table 1. Global consensus probabilities inferred for chromosome duplication (p c+), chromosome loss (jpc-), gene gain (pg+), and gene loss (pg-).

Pc+ Pc- Pg+ Pg-Her-2/neu 0.268 0.282 0.319 0.131

p53 0.274 0.290 0.198 0.238

Figure 3(a) shows a consensus phylogenetic network for Her-2/neu and chromosome 17. Two dominant edges project from the root, one corresponding to chromosome duplication and the other to gene gain, with lesser amounts of chromosome and gene loss. The two dominant edges lead to two prominent pathways in the graph. One pathway exhibits successive gene gains without changes in chromosome copy number while the other shows a pattern of alternating chromosome duplication and loss. There is support in the literature for both of these pathways. A large fraction of breast cancers exhibit diploidy with substantial amplification of Her-2/neu,10 ' 24 consis-

377

tent with the pure gene gain pathway. The alternating pattern of chromosome duplication and loss has also previously been predicted based on mathematical models and is supported by evidence from several classes of solid tumors.16 We further note, however, that these two pathways are not rigidly separated, but rather exhibit some ability to interconvert. Gene gain or loss events occasionally branch off of the chromosome gain/loss pathway and chromosome abnormalities occasionally appear off of the gene amplification pathway. This observation is, to our knowledge, novel. Examination of individual phylogenies (data not shown) suggests that individual patients may follow one or the other of these two dominant pathways exclusively or may combine the two.

Figure 3(b) shows the consensus phylogenetic network inferred for p53. Like the Her-2/neu network, the p53 one shows one prominent pathway exhibiting alternating chromosome duplication and loss. It is to be expected that the same chromosome patterns would be observed, as p53 and Her-2/neu are found on the same chromosome, and this finding thus validates the data and the analysis methods. Gene gain and loss is much less prominent for p53 than it was for Her-2/neu, however. Some gene gain and loss does occur, but it is comparatively rare and does not produce any long chains of successive amplifications, as is seen with Her-2/neu. Patterns of p53 gain and loss are likely to be difficult to interpret directly from copy number data, as they may involve partial inactivation of the gene rather than total loss.18 p53 is, however, a tumor suppressor, so we would not expect to see a prominent p53 amplification pathway in cancers.

(b)

Fig . 3 . Consensus networks inferred from pathways found in at least 5% of patients. Nodes represent cell states with chromosome and gene counts in parentheses and frequency of the state as a percentage of observed cells. Black dashed edges denote gene events and gray solid edges chromosome events. Edge label and thickness indicates the number of patients exhibiting the given edge, (a) Her-2/neu and chromosome 17 network, (b) p53 and chromosome 17 network.

378

3.3. Heterogeneity Between Patients

(a)

'1i

\¥

IP > as ij S3

"m*

1: 0.6 0.7

(b)

• I t - • .

• • • 4 t • •

• ..a I?

0.1 0.2 0.3 0.4 -4fa-Fig . 4 . Visual representation of the space of inferred prior probability parameters from per-patient data. Each image shows data points for the four inferred probability parameters on individual patients. p c - (chromosome loss) and pc+ (chromosome duplication) determine as the x and y positions of the points. pg- (gene loss) determines point size, with point size proportional to 1 + 10p s_. pg+ determines the color of the point, ranging from black for pg+ = 0 to white for p9+ = 1. Point positions are perturbed by a random factor up to 0.025 in x and y dimensions in order to make points with the same positions visible as distinct entities, (a) Parameters for Her-2/neu. (b) Parameters for p53.

While the global analysis gives us a reasonable best estimate of the overall frequencies of each of the possible genetic abnormalities, it is also useful to assess differences between patients. Figure 4 provides a graphical display of edge-type distributions derived by performing the EM inference one patient at a time

instead of globally. Figure 4(a) shows parameters for Her-2/neu and

chromosome 17. A substantial fraction of points cluster on the axes and especially at the origin, corresponding to tumors that exhibit no chromosome loss, no chromosome duplication, or no loss and no duplication; these tumors cover a spectrum of gene gain and gene loss probabilities. Many points exhibit no gene loss (appearing as small squares in the figure) and these are scattered throughout the graph. A relatively small number of points exhibit almost exclusively chromosome events. A substantial fraction of all points lie in the middle of the plot, exhibiting some balance of all four event types. These observations are consistent with what was seen in the consensus phylogenies, suggesting that a large fraction of patients use both the gene and chromosome amplifying pathways, with other groups exhibiting exclusively one pattern or the other.

Figure 4(b) shows parameters for p53 and chromosome 17. The plot is superficially similar to that of Her-2/neu but with some notable differences. First, pure gene gain or gene loss in the absence of the other is comparatively rare for p53. Of those points on the axes or origin, though, a comparatively greater portion of them show up as having high gene loss and low gene gain (large black squares) as opposed to high gene gain and low gene loss (small white squares). This again appears consistent with the fact that p53 amplification is not associated with breast cancer, while Her-2/neu amplification is.

4. DISCUSSION

We have developed a novel computational method using phylogeny reconstruction algorithms to infer tumor progression pathways from cell-by-cell assays. The method allows us to produce likely progression trees for individual patients and to identify common progression pathways across distinct patients. Application to a set of FISH data on two known cancer-related genes gathered from breast cancer tumors validates the method by recapitulating several previously identified properties of these genes and their role in breast cancer. It further provides novel insights into the progression mechanisms acting in these tumors.

This work may have several important conse-

379

quences for cancer biology in general and in the specific types studied here. Her-2/neu amplifying tumors show two dominant pathways, chromosome amplifying and gene amplifying, which is consistent with prior knowledge. Our study also reveals, though, that these pathways can work in concert in individual patients. Approaches to cancer sub-type identification based on clustering of tissue-averaged measurements would not generally be able to recognize that these hybrid tumors are in fact using combinations of two fundamental pathways and may require therapeutics directed at both. This problem may be particularly significant for the classification of Her-2/neu tumors because current clinical standards for detecting Her-2/neu amplification and prescribing anti-Her-2/neu therapy use a protocol tuned for diploid cells and normalized by chromosome counts;27 the protocol would be expected to be less sensitive to Her-2/neu amplification in aneu-ploid cells and thus potentially to fail to recommend anti-Her-2/neu therapy to patients whose tumors are genuinely Her-2/neu amplifying but are also aneu-ploid. We can anticipate that similar issues will arise with other tumor types as more targeted therapies become available. Accurate inference of progression pathways within tumors is thus likely to be a key step in developing a more rational approach to the targeted treatment of cancers.

There are several future directions to be explored in this work. One current limitation is that it looks at only a small number of measurements per cell simultaneously (one gene and one chromosome in the present work). The nature of the assay precludes much improvement in the experimental data, but computational inferences could in principle correlate states across different sets of copy data. For example, one might infer which Her-2/chromosome 17 states and which p53/chromosome 17 states overlap to produce likely Her-2/p53/chromosome 17 phytogenies. There are also other kinds of single-cell cytometry data to which this method could be applied, such as single-cell protein expression data. Finally, there are many avenues for advancement in developing more realistic models of the tumorigenesis process and more sophisticated phylogeny algorithms for the core inference and sampling steps, for example to deal more robustly with the inference of Steiner

nodes.

Acknowledgments

R.S. and G.R were supported by a grant from the Berkman Faculty Development Fund at Carnegie Mellon University.

References

1. Ayers M, Symmans WF, Stec J, Damokosh AI, Clark E, Hess K, Lecocke M, Metivier J, Booser D, Ibrahim N, Valero V, Royce M, Arun B, Whitman G, Ross J, Sniege N, Hotoagyi GN, Pusztai L. Gene expression profiles predict complete pathologic response t o neoadjuvant pacli-taxel and fluorouracil, doxorubicin, and cyclophosphamide chemotherapy in breast cancer. J Clin Onco/2004; 22(12) : 1-10.

2. Chu Y, Liu T. On the shortest arborescence of a directed graph. Sci Sinica 1965; 14: 1386-1400.

3. Cunliffe HE, Ringner M, Bilke S, Walker RL, Cheung JM, Chen Y, and Meltzer PS. The gene expression response of breast cancer to growth regulators: patterns and correlation with tumor expression profiles. Cancer Res 2003; 63: 7158-7166.

4. Desper R, Khan J, Schaffer AA. Tumor classification using phylogenetic methods on expression data. J Theor Biol 2004; 228: 477-496.

5. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998; 95: 14863-14868.

6. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeej M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999; 286: 531-537.

7. Janocko LE, Brown KA, Smith CA, Gu LP, Pollice AA, Singh SG, Julian T, Wolmark N, Sweeney L, Silverman JF , ShackneySE. Distinctive patterns of Her-2/neu, c-myc, and cyclin Dl gene amplification by fluorescence in situ hybridization in primary human breast cancers. Cytometry 2001; 46(3) : 136-149.

8. Jerrum M, Sinclair A. The Markov chain Monte Carlo method: An approach to approximate counting and integration. In Hochbaum DS (ed.), Approximation Algorithms for NP-Hard Problems PWS Publishing, Boston. 1996: 482-520.

9. Nowell PC. The clonal evolution of tumor cell populations. Science 1976; 194: 23-28.

10. Pegram MD, Konecny G, and Slamon DJ. The molecular and cellular biology of HER2/neu gene amplifica-tion/overexpression and the clinical development of her-ceptin (trastuzumab) therapy for breast cancer. Cancer Treat Res 2000; 103: 57-75.

11. Perou CM, S0rlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, Fluge O, Pergamenschikov A, Williams C, Zhu SX, L0nning PE, Borresen-Dale A-L, Brown PO, and Botstein D. Molecular portraits of human breast tumors. Nature 2000; 406: 747-752.

12. Prim RC. Shortest connection networks and some gener-

380

alizations. Bell System Technical Journal 1957; 36: 1389-1401.

13. Ried T, Heselmeyer-Haddad K, Blegen H, Schrock E, and Auer G. Genomic changes defining the genesis, progression, and malignancy potential of solid human tumors: a phenotype/genotype correlation. Genes Chromosomes Cancer 1999; 25(3) : 195-204.

14. Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, Spell-man P, Iyer V, Jeffrey SS, van de Rijn M, Waltham M, Pergamenschikov A, Lee JCF, Lashkari D, Shalon D, Myers TG, Weinstein JN, Botstein D, and Brown PO. Systematic variation in gene expression patterns in human cancer cell lines. Nat Genet 2000; 24: 227-235.

15. Salmon DJ, Clark GM, Wong SG, Levin WJ, Ullrich A, McGuire WL. Human breast cancer: correlation of relapse and survival with amplification of the HER-2/neu oncogene. Science 1987; 235: 177-182.

16. Shackney SE, Smith CA, Miller BW, Burholt DR, Murtha K, Giles HR, Ketterer DM, Pollice AA. Model fo the genetic evolution of human solid tumors. Cancer Res 1989; 49: 3344-3354.

17. Shackney SE, Singh SG, Yakulis R, Smith CA, Pollice AA, Petruolo S, Waggoner A, Hartsock RJ. Aneuploidy in breast cancer: a fluorescence in situ hybridization study. Cytometry 1995; 22(4): 282-291.

18. Shackney SE, Shankey TV. Common patterns of genetic evolution in human solid tumors. Cytometry 1997; 29: 1-27.

19. Shackney SE, Silverman JF . Molecular evolutionary patterns in breast cancer. Adv Anat Pathology 2003; 10(5): 278-290.

20. Shackney SE, Smith CA, Pollice A, Brown K, Day R, Julian T, Silverman J F . Intracellular patterns of Her-2/neu, ras, and ploidy abnormalities in primary human breast cancers predict postoperative clinical disease-free survival. Clin Cancer Res 2004; 10: 3042-3052.

21. Sinclair A. Improved bounds for mixing rates of Markov

chains and multicommodity flow. Combin Probab Comput 1992; 1: 351-370.

22. S0rlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, van de Rijn M, Jeffrey SS, Thorsen T, Quist H, Matese JC, Brown PO, Botstein D, Lonning PE, B0rresen-Dale A-L. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci USA 2001; 98(19): 10869-10874.

23. Valk PJM, Verhaak RGW, Beijen MA, Erpelinck CAJ, van Waalwijk can Doorn-Khosrovani SB, Boer JM, Bev-erloo HB, Moorhose MJ, van der Spek PJ , Lowenberg B, Delwel R. Prognostically useful gene-expression profiles in acute myeloid leukemia. N Engl J Med 2004; 350(16) : 1617-1628.

24. van de Vijver M, van de Bersselaar R, Devilee P, Cor-nelisse C, Peterse J, Nusse R. Amplification of the neu (c-erbB-2) oncogene in human mammary tumors is relatively frequent and is often accompanied by amplification of the linked c-erbA oncogene. Mol Cell Biol 1987; 7(5) : 2019-2023.

25. van de Vijver MJ, He YD, van't Veer LJ, Dai H, Hart AA, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, Parrish M, Atsma D, Witteveen A, Glas A, Delahaye L, van der Velde T, Bartelink H, Rodenhuis S, Rutgers ET, Friend SH, Bernards R. A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med 2002; 347: 1999-2009.

26. van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH. Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002; 415: 484-485.

27. Winston JS, Ramanaryanan T, Levine E. Her-2/neu evaluation in breast cancer: are we there yet? Am J Clin Pathol 2004; 121: S33-S49.

381

SIMULATING IN VITRO EPITHELIAL MORPHOGENESIS IN MULTIPLE ENVIRONMENTS

M. R. Grant, S. H. J. Kim, C. A. Hunt*

Joint UCSF/UCB Bioengineering Graduate Group and The Biosystems Group, Department of Biopharmaceutical Sciences, The University of California, San Francisco, CA 94143, USA

Email: [email protected] • [email protected] • [email protected]

In vitro studies of epithelial cell morphogenesis have demonstrated the influence of environment composition and orientation in the development of multicellular epithelial structures such as tubules and cysts. We have constructed a low resolution, discrete event simulation model and report on its use to explore how experimentally observed morphogenetic phenomena under four growth conditions might be generated and controlled. We identified simulation attributes that may have in vitro counterparts. We studied how changes in the logic governing simulated epithelial cell behavior might cause abnormal growth. Simulation results support the importance of a polarized response to the environment to the generation of a normal epithelial phenotype and show how disruptions of tight mechanistic control lead to aberrant growth characteristics.

1. INTRODUCTION

Epithelial cells are studied in vitro in order to better understand the mechanisms that lead to normal and dis-regulated epithelial cell morphogenesis'. It has been observed that epithelial cells are capable of sensing the presence of matrix in the local environment through specific cell surface receptors such as integrins, and that decisions about cell survival are made at least in part based on attachment to matrix2. Attachment to neighboring cells is important for specification of an apical surface, and to the directed transport of factors to the apical and basal surfaces3. Finally, the presence of an apical surface free of matrix contact appears to be an important requirement for the stability of an epithelial monolayer4. Epithelial development in vitro leads ultimately to structures in which cells have a basal surface associated with matrix, lateral surfaces associated with other epithelial cells, and an apical surface adjacent to an environment free of cell or matrix content. Epithelial cell behavior is thus hypothesized to be governed by processes, including environment sensing, that lead ultimately to the acquisition of a stable structure in which each cell has a three-surface environment5.

We report on the behaviors of low-resolution discrete space, discrete event simulation models based on the above hypothesis. Each action taken by a simulated cell is strictly determined by environment-focused rules. Actions are mandated in response to type and location in the local environment of any combination of three components: cells, matrix, or free space. One model

provides strong support for the above hypothesis. It mimics key in vitro epithelial cell phenotypic attributes in four different in vitro conditions: in surface, embedded, suspension, and overlay cultures. The model stands as a testable analogue for the high-level mechanisms driving epithelial cell behavior in vitro. However, we hypothesized that it is unlikely that every cell always adheres to a set of genome-controlled mandates. To test that hypothesis we placed selected rules under stochastic control. We report results of in silico experiments conducted to determine how much a rule can be relaxed without significant erosion of targeted behaviors. We also report the initial results of altering selected rules to identify potential conditions that may lead to aberrant cell growth in vitro.

2. MODEL STRUCTURE

We use the agent-based, simulation modeling package MASON . It provides tools for event scheduling, controlling event execution, and visualization of simulation components. The simulations consist of 2D hexagonal grids, three simulation component specifications, and visualizations of grid states. In order to avoid confusion, simulated components, environments, and outcomes will hereafter be labeled in small caps in order to distinguish them from their real-world referents. FREE

SPACE and MATRIX components do not exhibit any behaviors. All interaction with the environments occurs through the actions of individual CELLS on their immediate environment.





382

1 1 3 • * r

*• 5 S • * •

'• e 9 • • <• »•

Fig. 1. Examples of each of the nine transition rules. Two rules were altered to simulate mutations resulting in altered epithelial cell behavior: Ml: mutated rule 1; M8: mutated rule 8.

We identified twenty-one out of a potential sixty-four distinct local environment configurations that could give distinctive epithelial cell behaviors. Final states of the local environment for each of these environment configurations were assigned based on observations from in vitro experiments, and the outcomes were simplified into eight rules that generate a complete mapping from any initial configuration of the local environment to a final configuration. The derived rules, examples of which are pictured in Fig. 1, are as follows. During each simulation cycle (time step), each CELL determines if it has one, two, or three types of neighbors. Only one type: follow Rule 1, 2, or 3. Two types: follow Rule 4, 5, 6, or 7. Three types: follow Rule 8 or 9. Rules 1-3: if the environment contains only: 1) CELLS: die; 2) FREE SPACE: die; 3) MATRIX: divide and replace a MATRIX

with a daughter to maximize CELL neighbors. Rules 4-7: when there is 4) FREE SPACE, no MATRIX, and just one CELL: add MATRIX between self and that CELL; 5) FREE

SPACE, no MATRIX, and at least two CELLS: die; 6) at least one CELL and MATRIX (no FREE SPACE): divide and replace a MATRIX with a daughter to maximize CELL neighbors; 7) MATRIX and at least one FREE SPACE:

divide and place the daughter in a FREE SPACE that has a matrix neighbor. Rule 8: when there is at least one CELL plus MATRIX neighbored by an adjacent FREE SPACE:

divide and place the daughter in a FREE SPACE that neighbors MATRIX. Otherwise, 9) do nothing: mandates achieved.

Four simulated environments were constructed in order to provide a range of environments to analyze simulation outcomes. Each simulated environment corresponds to a particular condition frequently used to study epithelial cells in vitro. Surface culture was simulated as two 2D square grids, the lower grid filled with MATRIX

and the upper grid with FREE SPACE components. To initiate a simulation a single CELL is placed in the center of the FREE SPACE grid. Simulated embedded culture was represented using a MATRIX-filled, 2D hexagonal grid, with the center of the grid at the start of a simulation occupied by a single CELL. Overlay culture was simulated by placing a row of CELLS on a horizontal plane through the center of a 2D grid filled with MATRIX. Finally, suspension culture was simulated by placing two CELLS adjacent to one another in a 2D hexagonal grid filled with FREE SPACE.

To explore the consequences of perturbations in CELL behavior, modifications to the rales were implemented. We altered the rules for CELL survival (mutated Rule 1), and the rules for orientation of CELL division (mutated Rule 8) (Fig. 1). Also, probabilistic option was added to Rule 1 to allow for CELL survival in the presence of a CELL-only environment. When a CELL event is executed and the CELL is in a CELL-only environment, there is a probability that the CELL will die. If the probability is not met, the CELL survives. That probability is a parameter value shared by all of the CELLS in a particular simulation run. At each event, a CELL simulates sampling from a uniform distribution using an implementation of the Mersenne Twister random number generation algorithm provided with MASON. For each simulation experiment, CELL numbers were recorded at each step, for fifty steps, for at least 25 simulation runs per simulated condition.

3. RESULTS

A simulation model in which epithelial cell behavior is determined by the types and locations of components in the local environment produces a range of phenotypes exhibited by epithelial cells during growth in vitro, under a range of different conditions. The 2D simulation successfully generated stable monolayers in simulated surface culture, stable structures resembling lumen-filled cysts in simulated embedded culture, stable inverted cysts in simulated suspension culture, and tubule-like structures in simulated overlay culture (not shown). An example of a growth sequence is presented in Fig. 2. Similarity to additional characteristics of in vitro cultures was observed as well, including the development of a stable CYSTS after 10 simulated days (Fig. 3A), the distribution of CYST sizes (Fig. 3B), and the continuation of CELL death and division during the growth of a CYST (Fig. 3C).

383

O •» 1

4 - - 7 -K^y 11 ^--CH" 13 Fig. 2. Results of a typical simulation of growth in embedded culture. The numbers refer to the simulation step at which the snapshot was taken. In this example, a stable structure formed at simulation step 13. The rules are applied once to each CELL at each simulation step, with random ordering of CELL scheduling each step. As a consequence a wide variety of stable CYSTS are formed.

To insure that the collection of rules in Fig. 1 is necessary for forming the targeted attributes, simulations were conducted using a rule set in which one of the first eight rules were broken or weakened. Breaking or weakening any one of these rules caused a dramatic change in growth characteristics. Two examples are presented. The first rule change simulated the consequences of a loss in the ability of the CELL to die. The simulated CELLS having the altered rule set generated normal phenotypes in simulated surface culture and suspension culture, but abnormal attributes in simulated embedded and simulated overlay conditions. A second rule change simulated a disruption in the direction in which CELL division was implemented. Both altered rule sets caused a loss in the ability to form stable structures in simulated embedded culture (Fig. 3D). Reduction in the probability of CELL death as a consequence of following Rule 1 resulted in an increase in the number of CELLS found in the structures formed in simulated embedded culture (Fig. 4), but not in uncontrolled growth. We also observed that despite a reduction in CELL death probability of 80% and more, stable structures were still formed, although their morphology varied.

4. DISCUSSION

The model described simulates outcomes under four different growth conditions where the key features are remarkably similar to observations of epithelial cell growth under corresponding in vitro conditions. Simulations also generated additional in silico system-level attributes (not shown) under other conditions. Some of these may merit in vitro exploration as a means of further model validation, possibly invalidation. This model provides a foundation for the design of increasingly representative and predictive simulation models of epithelial cell morphogenesis. This model and its descendants will have three important uses: 1) it will provide a

framework for the systematic reverse engineering of the mechanisms underlying the targeted behaviors. The process will lead to causal linkages between known molecular level events and system level events and system level phenotype. 2) It will provide means to posit causes of disregulated epithelial cell growth, and 3) it will also enable in silico experimental methods to discover and analyze potential treatments of disregulated growth.

Fig. 3. Characteristics of simulated epithelial cell morphogenesis under different conditions (n > 24). A: Comparison of simulation and in vitro epithelial cell growth rates in embedded culture; error bars: 1 SD; two simulation steps represents about 24 hours in vitro. B: The frequency distribution of CELL number per CYST is similar to in vitro referent data from embedded culture. C: Analysis of division and death events in simulated embedded culture. D: Consequences of rule changes described in the text.

Although two fundamentally different rule changes were identified that both give rise to disregulated growth in simulated embedded culture, others are possible and are being explored to identify additional changes in simulated cell behavior that could result in altered growth patterns under different conditions. We anticipate that some of these will merit in vitro follow-up.

Despite the ability of the model to simulate targeted outcomes under a range of conditions, there are in vitro observations that are evident and measurable at the current level of resolution that the model currently is not capable of representing. For example, not all cysts formed in 3D embedded culture are lumen-filled, monolayer structures. In the case of MDCK cells, a kidney epithelial cell line, nearly thirty percent of cysts that develop contain either no lumenal space or have partially filled lumens after day ten . Another unmatched in vitro observation is the polarity inversion that takes place in the transfer of a cyst grown in suspension culture into an

384

embedded matrix . It is straightforward to add rules representing additional mechanistic detail with the goal of extending model behavior to cover one or more of these additional attributes. It requires iterative in silico experimentation to also adjust existing mechanistic rules, as new ones are added, so that the behavior of the resulting new model validates against the behavior of the original model.

60-, I

0 -f- 1 , 1 , , 0 10 20 30 40 50

Simulation Steps Fig. 4. Consequences of stochastic application of Rule 1 in simulated embedded culture (n = 50). In addition to p = 1.0, five probabilities were tested: 0.8, 0.6, 0.4, 0.2, and 0. Whenp = 0.4, Rule 1 is followed 40% (average) of the time when applied. The plateau level, the number of cells in the final, stable structure, was highly dependent on the early form of that structure. Consequently, variance is large.

We had such additional attributes in mind when we changed the model so that rule application would become stochastic. The results in Fig. 4 for the stochastic application of Rule 1 were unexpected: a gradual shift from formation of stable structures to unregulated growth was judged a possible outcome. Application of Rule 1 as infrequently as 1% of the time will result in formation of stable (yet large) structures. Furthermore, the morphology resulting from different rule application probabilities was different. Such results suggest that tight control and regulation by currently unknown mechanisms are both necessary and essential for proper growth and morphogenesis.

5. CONCLUSION

Modeling and simulating epithelial cell morphogenesis with a focus on the types and locations of components in the extracellular environment result in a representation of epithelial cell behavior that has remarkably high fidelity at this low level of resolution. This model serves as a reference point for more complex models of epithelial cell morphogenesis that are at least as predictive as the model presented here.

Acknowledgments

This work was abstracted in part from material assembled by MRG for his Ph.D dissertation. We are grateful for the funding provided by the CDH Research Foundation, and for the support provided by our cell biology collaborators and their groups: Profs. Keith Mostov (tetrad.ucsf.edu/faculty.php?ID=45) and Thea Tlsty (tetrad.ucsf.edu/faculty.php?ID=78). Our dialogues with Glen Ropella, Hal Berman, and Nancy Dumont covering many technical, biological, and theoretical issues proved important. For their support helpful advice, we thank the members of the BioSystems Group (biosystems.ucsf.edu).

References

1. Zegers MMP, O'Brien LE, Yu W, Datta A, Mostov KE. Epithelial polarity and tubulogenesis in vitro. Trends Cell Biol 2003; 13: 169-176.

2. Meredith JEJ, Fazeli B, Schwartz MA. The extracellular matrix as a cell survival factor. Mol Biol Cell 1993; 4: 953-961.

3. Lipschutz JH, Guo W, O'Brien LE, Nguyen YH, Novick P, et al. Exocyst is involved in cystogenesis and tubulogenesis and acts by modulating synthesis and delivery of basolateral plasma membrane and secretory proteins. Mol Biol Cell 2000; 11: 4259-4275.

4. Hall HG, Farson DA, Bissel MJ. Lumen formation by epithelial cell lines in response to collagen overlay: a morphogenetic model in culture. Proc Natl Acad Sci 1982; 79: 4672-4676.

5. O'brien LE, Zegers MM, Mostov KE. Opinion: Building epithelial architecture: insights from three-dimensional culture models. Nat Rev Mol Cell Biol 2002; 3:531-537.

6. Luke S, Balan GC, Panait L, Cioffi-Revilla C, Paus S. MASON: A Java Multi-Agent Simulation Library; 2003.

7. Lin H-H, Yang T-P, Jian S-T, Yang H-Y, Tang M-J. Bcl-2 overexpression prevents apoptosis-induced Madin-Darby canine kidney simple epithelial cyst formation. Kidney International 1999; 55: 168-178.

8. Wang AZ, Ojakian GK, Nelson WJ. Steps in the morphogenesis of a polarized epithelium I. Uncoupling the roles of cell-cell and cell-substratum contact in establishing plasma membrane polarity in multicellular epithelial (MDCK) cysts. J Cell Sci 1990; 95: 137-151.

http://biosystems.ucsf.edu

385

A COMBINED DATA MINING APPROACH FOR INFREQUENT EVENTS: ANALYZING HIV MUTATION CHANGES BASED ON TREATMENT HISTORY

Ray S. Lin1*, Soo-Yon Rhee2, Robert W. Shafer2, and Amar K. Das1

1 Stanford Medical Informatics and2 Division of Infectious Diseases

Department of Medicine, Stanford University

Stanford, CA 94305, United States Email: raylin@ Stanford, edu

Many biological databases contain a large number of variables, among which events of interest may be very infrequent. Using a single data mining method to analyze such databases may not find adequate predictors. The HIV Drug Resistance Database at Stanford University stores sequential HIV-1 genotype-test results on patients taking antiretroviral drugs. We have analyzed the infrequent event of gene mutation changes by combining three data mining methods. We first use association rule analysis to scan through the database and identify potentially interesting mutation patterns with relatively high frequency. Next, we use logistic regression and classification trees to further investigate these patterns by analyzing the relationship between treatment history and mutation changes. Although the AUC measures of the overall prediction is not very high, our approach can effectively identify strong predictors of mutation change and thus focus the analytic efforts of researchers in verifying these results.

1. INTRODUCTION

Many databases contain a large number of variables, among which events of biological relevance could be rare and difficult to predict accurately. Combining different data mining methods may be effective in discovering the relationship between rare events (e.g., mutation changes in genotype-test results) and other measured factors (e.g., phenotypic or environmental variables). In our work, we have investigated the relationship between mutation changes in the HIV protease gene and information on patients' recent antiretroviral treatment, using data from the HIV Drug Resistance Database (HIVDB) at Stanford University (hivdb.stanford.edu)1. In this paper, we present an evaluation of an approach combining association rule analysis, logistic regression, and classification trees to predict the occurrence of mutations in the HIV protease gene based on treatment history.

Association rule analysis is a popular method for mining commercial transaction databases and has been explored in finding patterns in biomedical data, such as gene regulatory elements in microarray data , gene expression and co-regulated clusters4, and protein-protein interactions5. Association rule

analysis is an efficient method to discover relations hidden in sparse large database6. However, when the events of primary interests are rare in the database, the rules being mined may have very low support (<5%) and cannot reach confident conclusion about the associations. Therefore, other analytical methods are necessary in order to conduct more elaborate investigation about the associations being discovered.

Logistic regression is a classic method for predicting binary outcomes and identifying risk factors in clinical research7. Classification trees are powerful analytical tools that produce interpretable results (tree diagrams) and thus have been widely used in predicting various clinical and biomedical events. In HIV research, it has been used to analyze the association of antiretroviral resistance mutations with response to therapy8 and to predict drug resistance based on HIV mutations9. While logistic regression assumes the linearity and independence of the predictors, classification trees can be used to discover the interactions among predictors that do not exhibit strong marginal effects10. The classification trees explore high-order effects by the nature of recursive partitioning2. However, both of these two methods require pre-specified outcome variables and

http://hivdb.stanford.edu)1

386

predictors and are not effective in scanning through a sparse database with a large number of variables.

Combing these three methods may potentially mitigate each of their limitations. Studies have shown the effectiveness of using different methods in classification11'12 and the combination of association rule analysis with classification methods1314. However, only few studies explored the combinatorial usage in biomedical domain15.

In this study, we have undertaken the following approach: We first use association rule analysis to scan through the occurrences of HIV protease gene mutations and identify potentially interesting patterns with relatively high frequency. Then logistic regression and classification trees are used to further investigate these patterns by analyzing the relationship between treatment history and mutation changes. We have found that, whereas association rule analysis can effectively focus interesting patterns for further investigation, logistic regression and classification trees can potentially identify both the linear and high-order relationships in the database.

2. BACKGROUND

Significant research efforts have been undertaken in investigating the association between HIV gene mutations and antiretroviral therapy. Mutations on the protease and reverse transcriptase genes—targets of antiretroviral drugs—have been shown associated with drug resistance16. On the one hand, these mutations can cause treatment failure; thus, it is important to be able to predict drug resistance based on the specific HIV mutations and identify the best treatment for the patients 8. On the other hand, the mutation change may also be the result of certain antiretroviral treatments; therefore, it is also crucial to predict the mutation change based on patients' treatment history.

There have been a number of studies examining the occurrence of HIV gene mutations after initial exposure to specific antiretroviral treatments16. Few have investigated sequential mutation changes in patients changing antiretroviral drugs in the context of comprehensive drug treatment histories, such as those available in HIVDB. With data from over 2000 subjects in HIVDB, we undertook a combined data-mining approach to predict mutation changes in the HIV protease gene (PI to P99) based on antiretroviral drug history.

3. METHOD

We derived a dataset from HIVDB that included 2,681 unique patients who had more than one HIV

protease genotype-test result and who had treatment history recorded during "time windows," starting with one genotype test result and ending with another measure. For each of the 99 coding positions in the HIV protease gene, the occurrence of a mutation change is identified as a difference between the test result at the beginning of the time window and that at the end. Antiretroviral drugs administered within the time window are considered to be the predictors of mutation change. Each of the 7 protease inhibitors (Pis)—abbreviated APV, IDV, NFV, RTV, SQV, LPV, and ATV—is represented as an individual predictor. Nucleoside reverse transcriptase inhibitors (NRTI) and non nucleoside reverse transcriptase inhibitors (nNRTI) are treated as two aggregated predictors.

We first utilized association rule analysis to scan through the database for identifying patterns (i.e., pairs of predictors and mutation changes) of relatively high frequency. In order to achieve high sensitivity in identifying potentially interesting patterns, rules are mined by the a priori algorithm using minimum support 0.002 and minimum confidence 0.1. Each of the coding position identified in the association rules was further analyzed by a logistic regression model and a classification tree. For each model, the dependent variable is a binary variable indicating whether there is a mutation change at that coding position; the independent variables are the 9 treatment predictors. In classification trees, Gini index of diversity is used as the splitting criterion. The trees are pruned back by 1-SE rule based on 10-fold cross validation error rates2. We evaluated the performance by area under ROC curve (AUC) analysis, and assessed the sensitivity and specificity at the optimal point.

4. PRELIMINARY RESULTS

4.1. Descriptive Statistics

In average, the length of the time window is 391 days, the number of mutation changes per time window is 3.01, and the percentage of mutation changes per coding position is 3%. The occurrence of each PI drug in a single time window ranged on average from 1% (ATV) to 20% (IDV); the occurrence of NRTI and nNRTI were 63% and 33% respectively.

4.2. Association Rules Mining

In total, 449,077 rules are mined in association rule analysis. Among them, we examine further 1,406 rules with treatment on the left hand side and mutation change on the right hand side. The highest support is

387

0.04 and highest confidence is 0.50. Table 1 shows example mined association rules that have relatively high confidence and support.

4.3. Genotype Change Prediction

We identified nine unique coding positions from the 50 rules with highest confidence and the 50 rules with highest support. Each of these nine positions was analyzed by a logistic regression model and a classification tree. The models predict whether there is a mutation change at a particular position based on the treatment. The AUC measurement and sensitivity/ specificity at the optimal cut point are summarized in Table 2. The AUC measurements show that the overall prediction performance is not high (from 0.56 to 0.70). However, both methods identify strong predictors of mutation change in each of the nine coding positions. In average, logistic regression identifies four strong predictors (with p < 0.001), and classification trees identify drug combinations (consisting of two to seven drugs) as predictors of the mutation change at each position.

Table 3 shows the effects of treatment on one particular mutation change (P54) estimated by logistic regression. In this analysis, APV, IDV, RTV, SQV, and LPV are strong risk factors for developing this mutation change. NRTI and nNRTI show strong protective effect while other drugs are not associated with the mutation change.

Table 1. Association rules with relatively high confidence (Conf.) and support (Supp.)

Rule {APV, SQV, LPV}=» {P10} {APV, RTV, LPV, nNRTI}=> {P10} {APV, SQV, LPV}=>{P54) (IDV) => {P46} {RTV}=>{P71}

Supp. 0.002 0.003 0.002 0.039 0.038

Conf. 0.50 0.47 0.45 0.20 0.22

Table 2. The prediction performance of classification trees and logistic regression indicated by AUC and by sensitivity and specificity, in parentheses

Coding Position

P10 P13 P20 P36 P46 P54 P71 P82 P90

Classification Trees

0.64(0.51,0.69) 0.62 (0.67, 0.53) 0.65 (0.62, 0.64) 0.63 (0.66, 0.55) 0.67 (0.63,0.62) 0.62 (0.54,0.76) 0.62 (0.60,0.58) 0.63 (0.58,0.70) 0.56 (0.46, 0.74)

Logistic Regression

0.61 (0.60, 0.55) 0.60 (0.68, 0.50) 0.65 (0.66, 0.57) 0.63 (0.66, 0.55) 0.64 (0.57,0.64) 0.70 (0.64, 0.64) 0.60(0.51,0.65) 0.67 (0.57,0.68) 0.64 (0.56, 0.65)

Table 3. The treatment effects on the mutation change of P54 estimated by logistic regression

Treatment Odds Ratio (95% CI) p value APV 4.72 (3.55,6.28) <0.001 IDV 1.56(1.23,1.97) 0.003 NFV 0.96(0.73,1.25) 0.75 RTV 1.77(1.36,2.30) <0.001 SQV 1.68(1.27,2.21) <0.001 LPV 2.45 (1.81,3.31) <0.001 ATV 0.95 (0.36,2.47) 0.91 NRTI 0.65 (0.51,0.83) <0.001

nNRTI 0.66 (0.52, 0.82) <0.001

5. DISCUSSION

In this study, we analyze data on mutation changes stored in the HIVDB and investigate their association with patients' history of antiretroviral treatment. In the HIVDB, the occurrence of mutation changes in the HIV protease gene is less than 15% and the frequency of the combination of these mutation changes and contextual antiretroviral history is less than 5%. No single data-mining method may adequately find predictors of mutation change. Therefore, we use a novel combined approach to analyzing the database. Association rule analysis is first applied to the database for identifying patterns with relatively high support and confidence. Logistic regression and classification trees are then used to predict the mutation change in specific coding positions.

Although the AUC measurements resulting from our approach do not show high overall prediction performance, we can effectively identify strong predictors of mutation change. A preliminary analysis of these results validates their relevance based on prior studies. The rules listed in Table 1 are consistent with previous research in the association between treatments and HIV mutations16. P10 is a well-known polymorphism position. It is not surprising that the mutation change in P10 is associated with different treatments. APV, SQV, and LPV are all shown associated with P54 (specifically I54V) mutation in previous research. The association rule shows a similar association between these three drugs and mutation change in P54. Similarly, IDV has been shown associated with P46 (M46I and M46L) mutation and P71 has been shown associated with P71 (A71V) before, and the rules also show these associations in the mutation change.

With the P54 mutation changes, past studies have shown associations between RTV and I54V/I54L; IDV and I54V; NFV and I54V/ I54L; ATV and I54L. The results of logistic regression find similar

388

associations in RTV and IDV but not in NFV and ATV (Table 3). One limitation of our study is that we did not distinguish the specific genotype mutation in each change, since these events are too rare to be analyzed with such differentiation. As a result, we were not able to compare our data mining results with existing literature on genotype-specific mutation changes. In addition, this study did not distinguish the treatment-naive patients from the patients changing different drugs during the treatment. Mutations observed in these two populations may be associated with different predictors. Hence, it is also a potential caveat of current results.

Identifying predictors of infrequent events is an important but difficult task in studying biological databases. We have addressed this challenge by developing a combined method using association rule analysis to find potentially interesting patterns involving rare events and logistic regression and classification trees to establish linear and high-order associations among those events. In this paper, we show that our approach can identify strong predictors for rare events in a biomedical genomics database and can provide researchers potentially biologically relevant results for further verification.

Acknowledgments

The authors thank Martin O'Connor for his assistance in data preparation. This work was funded in part by a training grant from the National Library of Medicine (5T15LM007033-22).

References

1. Rhee, S.Y. et al. Human immunodeficiency virus reverse transcriptase and protease sequence database. Nucleic Acids Res 31, 298-303 (2003).

2. Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, (Springer-Verlag, 2001).

3. Conklin, D„ Jonassen, I., Aasland, R. & Taylor, W.R. Association of nucleotide patterns with gene function classes: application to human 3' untranslated sequences. Bioinformatics 18, 182-9 (2002).

4. Ji, L. & Tan, K.L. Mining gene expression data for positive and negative co-regulated gene clusters. Bioinformatics 20, 2711-8 (2004).

5. Oyama, T., Kitano, K., Satou, K. & Ito, T. Extraction of knowledge on protein-protein interaction by association rule discovery. Bioinformatics 18, 705-14 (2002).

6. Tan, P., Steinbach, M. & Kumar, V. Introduction to Data Mining, (Addison Wesley, 2006).

7. Dreiseitl, S. & Ohno-Machado, L. Logistic regression and artificial neural network classification models: a methodology review. J Biomed Inform 35, 352-9 (2002).

8. Quigg, M. et al. Association of antiretroviral resistance genotypes with response to therapy-comparison of three models. Antivir Ther 7, 151-7 (2002).

9. Beerenwinkel, N. et al. Diversity and complexity of HIV-1 drug resistance: a bioinformatics approach to predicting phenotype from genotype. Proc Natl Acad Sci USA99, 8271-6 (2002).

10. Cook, N.R., Zee, R.Y. & Ridker, P.M. Tree and spline based association analysis of gene-gene interaction models for ischemic stroke. Stat Med 23, 1439-53 (2004).

11. Alexe, G. et al. A Robust Meta-classification Strategy for Cancer Diagnosis from Gene Expression Data. Proc IEEE Comput Syst Bioinform Conf, 322-5 (2005).

12. Newman, D.J., Hettich, S., Blake, C.L., & Merz, C.J. (1998). UCI Repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science. [http://www.ics.uci.edu/~mlearn/MLRepository. html].

13. Liu, B., Hsu, W. & Ma, Y. Integrating classification and association rule mining, in KDD'98 pp.80-86 (New York, NY, 1998).

14. Li, W., Han, J. & Pei, J. CMAR: Accurate and efficient classification based on multiple class-association rules, in ICDM'01 pp.369-376 (San Jose, CA, 2001).

15. Chae, Y.M., Ho, S.H., Cho, K.W., Lee, D.H. & Ji, S.H. Data mining approach to policy analysis in a health insurance domain. Int J Med Inform 62, 103-11 (2001).

16. Rhee, S.Y. et al. HIV-1 Protease and reverse-transcriptase mutations: correlations with antiretroviral therapy in subtype B isolates and implications for drug-resistance surveillance. J Infect Dis 192, 456-65 (2005).

http://www.ics.uci.edu/~mlearn/MLRepository

389

A SYSTEMS BIOLOGY CASE STUDY OF OVARIAN CANCER DRUG RESISTANCE

Jake Y. Chen1'2*, Changyu Shen3, Zhong Yan1, Dawn P. G. Brown4, Mu Wang4

; Indiana University School of Informatics, IUPUI, Indianapolis, IN 46202; 2 Department of Computer and Information Science,

Purdue University School of Science, IUPUI, Indianapolis, IN 46202;3 Division of Biostatistics, Department of Medicine, Indiana

University School of Medicine, Indianapolis, IN 46202;3 Department of Biochemistry and Molecular Biology, Indiana University

School of Medicine, Indianapolis, IN 46202 Corresponding Author. Email: [email protected]

In ovarian cancer treatment, the chemotherapy drug cisplatin often induce drug resistance after prolonged use, causing cancer relapse and the eventual deaths of patients. Cisplatin-induced drug resistance is known to involve a complex set of cellular changes but its molecular mechanism(s) remain unclear. In this study, we designed a systems biology approach to examine global protein level and network level changes by comparing Proteomics profiles between cisplatin-resistant cell lines and cisplatin-sensitive cell lines. First, we used an experimental proteomics method based on a Label-free Liquid Chromatography / Mass Spectrometry (LC/MS) platform to obtain a list of 119 proteins that are differentially expressed in the samples. Second, we expanded these proteins into a cisplatin-resistant activated sub-network, which consists of 1230 proteins in 1111 protein interactions. An examination of network topology features reveals the activated responses in the network are closely coupled. Third, we examined sub-network proteins using Gene Ontology categories. We found significant enrichment of proton-transporting ATPase and ATP synthase complexes in addition to protein binding proteins. Fourth, we examined sub-network protein interaction function categories using 2-dimensional visualization matrixes. We found that significant cellular physiological responses arise from endogeneous, abiotic, and stress-related signals, which correlates well with known facts that internalized cisplatin cause DNA damage and induce cell stress. Fifth and finally, we developed a new visual representation structure for display of activated sub-networks using functional categories as network nodes and their crosstalk as network edges. This type of sub-network further shows that while cell communication and cell growth are generally important to tumor mechanisms, molecular regulation of cell differentiation and development caused by responses to genomic-wide stress seem to be more relevant to the acquisition of drug resistance.

1. INTRODUCTION

The study of cancer drag resistance entails great challenges and opportunities for cancer chemotherapy. In ovarian cancer, each year in the United States, approximately 23,000 women are diagnosed with the disease, and approximately 15,000 will die, making ovarian cancer to rank second only to breast cancer by the total number of new patients and first by the total number of deaths among all gynecological cancer cases each year. Platinum-based chemotherapy, usually with the cancer drug "cisplatin", has been the primary treatment for ovarian cancer today '. Cisplatin is known to bind DNA to form cisplatin-DNA adducts and therefore could inhibit DNA replication and/or transcription of cancer cells 2. At the beginning of such chemotherapeutic treatments, patients are usually responsive. As the treatment goes on for six to twelve months, however, many patients would relapse into cancer and eventually die from spread of new tumors that are refractory to further cisplatin treatment even at large toxic doses 3. In cisplatin-resistant cancer cells when compared with initial cisplatin-sensitive cancer cells, cancer researchers have observed a diverse range of cellular changes, which include decreased drug accumulation, increased cellular glutathione, and

enhanced DNA repair capacity 3 '4. Recent advances in microarray technology has enabled researchers to identify global differential gene expression patterns between cisplatin resistant and cisplatin sensitive cell lines 5. Notable examples of differentially expressed genes are: DNA repair related genes (BRCA1, DNA-PK, and ERCC1), P-glycoproteins genes, genes encoding heat shock proteins (HSP27, HSP70), and copper transport protein encoding genes (CTR). Despite these progresses, the molecular mechanism(s) of acquired cisplatin resistance remain unclear. This difficulty reflects a general challenge in applying a single genomics or functional genomics technology platform to complex disease biology problems. For example, it is generally known that there is a low correlation (15-25%) between global gene expression level and protein expression levels in higher eukaryotes including cancer cells 6. Very low correlation (merely above random level) between global gene co-expressions and protein protein interaction pairs has also been observed 7. Such evidence suggests that the control mechanism for intracellular signaling networks may not be fully understood at the gene expression level. A holistic approach to collect and interpret global signal changes beyond the gene expression level would


390

provide significant additional insights, which in turn could lead to the development of novel therapeutic strategies for cancer.

In this work, we adopt a systems biology approach for the study of ovarian cancer drug resistance molecular mechanisms. For systems biology, we refer to the simultaneous experimental measurement of global molecular expression data in perturbed cells and computational interpretation of them at the molecular interaction network level or above (adapted and expanded from 8). We consider systems biology as a holistic study approach distinct from a pure Omics or bioinformatics-based method. We summarize its characteristics as the following. 1) Actual functional genomics or proteomics experiments should be done on specific biological conditions driven by specific biological hypothesis. 2) Experimental Omics data derived should be interpreted in existing genomics knowledge context based on integrated annotated genomics database and integrated bioinformatics data analysis methods. 3) High-level knowledge structures at the molecular network or pathway levels must be derived.

Our system biology study of cisplatin drug resistant ovarian cancer consists of the following three elements. The first element is experimental proteomics study, using a label free Liquid Chromatography/Mass Spectrometry (LC/MS)-based technology platform (see Methods) to identify differentially expressed proteins in replicated cisplatin-resistant vs. cisplatin-sensitive ovarian cell line samples. The second element is the annotation of experimental proteomics results using human genome annotation database (from Gene Ontology 9), human protein interaction network database (from OPHID 10), integrated statistical data analysis, network analysis, and information visualization methods. The third element is the representation of our discovery results with a network of activated biological processes to summarize underlying complex protein interaction networks. These elements interweave into a coherent functional picture for the first time of the global changes in ovarian cancer cisplatin-resistant cells.

Our work has the following significances. First, LC/MS-based proteomics results that compares cisplatin resistant with cisplatin sensitive ovarian cancer cells have not been previously performed/reported elsewhere. Many new proteins involved in the drug resistance were

identified. Second, we described and used a set of novel systems biology informatics methods and, in particular, one that can identify "significantly interacting protein categories", distinct from previous work of using GO annotations for ge^e classifications from Microarray results analysis u . Collectively these methods can be generalized to enable other similar systems biology studies, in which statistically significant experimental Omics results, public protein interactome data, and genome/proteome annotation database are integrated into an easy-to-interpret 2-dimensional visualization matrix. Third, we developed a unique molecular network visual representation scheme based on automatically data-mined significant biological process categories and significant between-category interactions. This representation scheme can enable bioinformatics scientists to summarize and infer essential relationships between networked proteins in many similar application domains beyond our case study in this work.

2. METHODS

2.1. Proteomics Methods

A2780 and 2008 cisplatin-sensitive human ovarian cancer cell lines and their resistant counterparts, A2780/CP and 2008/C13*5.25, were used in this study. Proteins were prepared and subjected to LC/MS/MS analysis as previously described l2. There were two groups (two different parent cell lines), six samples per cell line, and two HPLC injections per sample. Samples were run on a Surveyor HPLC (ThermoFinnigan) with a C18 microbore column (Zorbax 300SB-C18, 1mm x 5cm). All tryptic peptides (100 uL, or 20 ug) were injected onto the column in random order. Peptides were eluted with a linear gradient from 5 to 45% acetonitrile developed over 120 min at a flow rate of 50 uL/min, eluant was introduced into a ThermoFinnigan LTQ mass spectrometer. The data were collected in the triple-play mode. The acquired data were filtered by proprietary software as described by Higgs et al.12. Database searching against IPI human database and NR-Homo Sapiens database was carried out using SEQUEST algorithm. Protein quantification was carried out using the LC/MS-based label-free proprietary protein quantification software licensed from Eli Lilly and Company 12. Briefly, once the raw files are acquired from the LTQ, all total ion chromatogram (TIC) will be

391

aligned by retention time. Each aligned peak should match parent ion, charge state, daughter ions (MS/MS data) and retention time (within 1 minute window). If any of these parameters were not matched, the peak will be disqualified from the quantification. The area under the curve (AUC) from individually aligned peak was measured, normalized, and compared for their relative abundance. All peak intensities are transformed to a log2 scale before quantile normalization 13. If multiple peptides have the same protein identification then their quantile normalized log2 intensities are averaged to obtain log2 protein intensities. The log2 protein intensity is the final quantity that is fit by a separate ANOVA statistical model for each protein. log2 (Intensity) = overall mean + group effect (fixed) + sample effect (random) + replicate effect (random). Group effect refers to the effect caused by the experimental conditions or treatments being evaluated. Sample effect is the random effects from individual biological samples. It also includes the random effects from sample preparations. The replicate effect refers to the random effects from replicate injections of the same sample. All of the injections were in random order and the instrument was operated by the same operator. The inverse log2 of each sample mean was determined to resolve the fold change between samples. A summary of the overall process is shown as Fig. 1.

m Waffle

Rat ion

Digest - enzyrre

Systems Bdogy Data Analysis ^ -

* >

Ion Trap Mass Spec Deterrrinemass

perform MSMS fragmentation

onead l peptide as ft aut es

Mcmbore UPC 1mmDC18

flowtfe SOgL/nln

1 «- WJKWt-E

Separation of peptides

Figure 1. Schematic presentation of LC/MS based protein quantitative analysis of complex biological samples.

2.2. Preparation of Data Sets

Proteins in Differentially Expressed Cisplatin-Resistant vs. Cisplatin-Sensitive Ovarian Cancer Cells. Our experimental proteomics platform generated 574 differentially expressed proteins (with q-value <=0.10; both up- and down-regulation values) or 141 proteins (with q-value <=0.05), all identified with International Protein Index (IPI) database IDs. We convert these IPI identifiers into UniProt IDs in order to

integrate this data set with all other annotated public data. 119 of the 141 proteins (0.05 q-value threshold) was successfully mapped and converted, using the International Protein Index (IPI) database 14

downloaded in March 2006, the UniProt database downloaded in November 2005 15, and additional internally curated public database mapping tables. Similarly, 451 out of the 574 proteins with the less restrict threshold (q-value <=0.10) were mapped from IPI IDs to UniProt IDs.

Human Protein Interactome Data. The primary source of human data comes from the Online Predicted Human Interaction Database (OPHID) 10, which were downloaded in February 2006. It contains more than 47,213 human protein interactions among 10,577 proteins identified by UniProt accession numbers. After mapping the proteins in OPHID to UniProt IDs, we recorded 46,556 unique protein interactions among 9959 proteins. Note that even though more than half of OPHID are interacting protein pairs inferred from available lower organisms onto their human orthologous protein pair counterparts, the statistical significance of these predicted human interactions was confirmed by additional evidences according to OPHID and partially cross-validated according to our previous experience 16. We assigned a heuristic interaction confidence score to each protein interactions, based on the types and sources proteins recorded in OPHID according to a method described in 16.

Human Protein Annotation Data. Gene Ontology (GO) classification data 9 were downloaded from Geneontology.org in January 2006 and were used as the primary source of protein annotation for this study. Human proteome GO annotation was further performed based on human gene GO annotation from NCBI and human gene ID to protein UniProt ID mappings curated internally.

Human Interacting Protein Categorical Annotation Data. Each GO term from the human protein annotation data was annotated with its minimal GO level number in the GO term hierarchy. Each GO term's higher-level parent GO terms (multiple parent GO terms are possible) up to GO level 1 (three GO terms at this level: molecular function, cellular components, and biological processes) are also traced and recorded in an internally curated GO annotation table. When calculating interacting protein GO category information, we use this internally curated GO term

http://Geneontology.org

392

table to map all the low-level GO term IDs (original GO Term ID) used to annotate each protein to all the GO term IDs' high-level GO term IDs (folded GO Term ID). For this study, we designate that all the folded GO term ID should be at GO term hierarchy Level = 3. Note that our method allows for multiple GO annotation Term IDs (original or folded) generated for each protein ID on purpose. Therefore, it is possible for a protein or a protein interaction pair to appear in more than one folded GO term category or more than one folded GO term interacting category pairs.

2.3. Network Expansion

We derive ovarian cancer drug resistant-related differentially expressed protein interaction sub-network using a nearest-neighbor expansion method described in 16. We call the original list of differentially expressed proteins (119 proteins) seed (S) proteins and all the protein interactions within them seed interactions (or S-S type interactions). After expansion, we call the collection of seed proteins and expanded non-seed (N) proteins sub-network proteins (including both S and N proteins); we call the collection of seed interactions and expanded seed-to-non-seed interactions (or S-N type interactions) sub-network protein interactions (including both S-S type and S-N type interactions). Note that we do not include non-seed-to-non-seed protein interactions (or "N-N" type interactions) in our definition of the sub-network, primarily because the N-N type of protein interactions often outnumbered total S-S and S-N types of protein interaction by several folds with molecular network context often not tightly related to the initial seed proteins and seed interactions. The only occasion to consider the N-N type interactions is when we calculate sub-network properties such as node degrees for proteins in the sub-network.

2.4. Visualization

We use Spotfire DecisionSite Browser 7.2 to implement the 2-dimensional functional categorical crosstalk matrix. To perform interaction network visualization, we used ProteoLens n . ProteoLens has native built-in support for relational database access and manipulations. It allows expert users to browse database schemas and tables, query relational data using SQL, and customize data fields to be visualized as graphical annotations in the visualized network.

2.5. Network Statistical Examination

Since the seed proteins are those that are found to display different abundance level between two different cell lines via mass spectrometry, one would expect that the network "induced" by them to be more "connected" in the sense that they are to certain extent related to the same biological process(es). To gauge network "connectivity", we introduced several basic concepts. We define a path between two proteins A and B as a set of proteins PI, P2,..., Pn such that A interacts with PI, PI interacts with P2, ..., and Pn interacts with B. Note that if A directly interacts with B, then the path is the empty set. We define the largest connected component of a network, as the largest subset of proteins such that there is at least one path between any pair of proteins in the network. We define the index of aggregation of a network as the ratio of the size of the largest connected component of the network to the size of the network by protein counts. Therefore, the higher the index of aggregation, the more "connected" the network should be. Lastly, we define the index of separation of a subnetwork as the ratio of S-S type interactions over all sub-network interactions. A high index of separation found in a network represents extensive "re-discovery" of proteins after the protein interactions are expanded from the seed proteins.

To examine the statistical significance of observed index of aggregation and index of separation in expanded protein networks, we measure the likelihood of the topology of the observed sub-network under random selection of seed proteins. This is done by randomly selecting 119 proteins, identifying the subnetwork induced/expanded, and calculating subnetwork indexes accordingly. The same procedure is repeated n=1000 times to generate the distribution of the indexes under random sampling, with which the observed values are compared to obtain significance levels (for details, refer to l8).

2.6. Significance of interacting Protein Categories

To assess the statistical significance on the number of pairs in the subnetwork that falls in specific function categories, we treat it as the outcome of a random draw of 1723 pairs from the pool of 46556 pairs in OPHID. Then it follows a hypergeometric distribution. A p-value is calculated based on the hypergeometric

393

distribution to evaluate the likelihood that we observe an outcome under random selection of 1723 pairs that is at least as "extreme" as what we have observed. Note "extreme" either implies an unusual large (over-representation) or too small (under-representation) number. Let x be the count of the pair that falls in a function category in the subnetwork, n=1723, N=46556 and k=corresponding count in OPHID, then the p-value for over/under-representation of the observed count can be calculated as:

Over representation:

p = Pr[X>x\n,N,k] = V (N-k

n-i

(N

Under representation:

p = ?r[X<x\n,N,k] = Yi i=0

flr\

\} J

N-k

n-i

(N

V s J Since tests of over/under representation of various

categories are correlated with one another (over representation of one category could imply under representation of other categories), we also control the false discovery rate (FDR) using method developed by Benjamini and YekutieliI9.

3. RESULTS

3.1. Activated Protein Interaction Subnetwork Properties

We examined network topology for the protein interaction sub-network expanded from seed proteins. Recall that seed proteins are significantly and differentially expressed proteins derived from LC/MS proteomics experiments based on comparing cisplatin resistant with cisplatin sensitive cell line samples (see Methods). The resulting protein interaction sub-network consists of 1,230 seed and non-seed proteins in 1,723 sub-network interactions (including 17 S-S type interactions and 1,706 S-N type protein interactions). We call this protein interaction sub-network "core subnetwork" to distinguish it from the "full sub-network" (additionally including all N-N type protein interactions), and plot their node degree frequency distributions in Fig 2, where the whole human protein interaction network from OPHID (labeled "network") is also shown. As expected, both the network and the subnetwork (full) display good "scale-free" property. The result also shows cisplatin resistant activated sub

network (full) contains more "hubs" than "peripheral" proteins to form a cohesive functional sub-network. The core sub-network, while perhaps limited in size, begins to show "scale-free like" distribution, although hubs in the sub-network (core) are more distinctively identifiable than overly abundant peripheral nodes by high node degree counts.

100

c

• Sub-Network (core) A Subnetwork (Full) • Network

node degree in network

Figure 2. Node degree distribution of the sub-networks (core or full) in comparison with the human protein interaction network.

We also examined other network features for the core sub-network. The largest connected component (defined in the Method section; ibid) of the sub-network consists of 1230 proteins with 1723 interactions. The index of aggregation is 1193/1230=97.0%. The index of separation as the percentage of S-S type interactions (17) over the core sub-network interactions (1723), i.e., 17/1723=0.96%. The index of aggregation has a p-value less than 0.001 (upper tail) and the index of separation 0.06 (upper tail) A significant but not exceptionally high network index of aggregation suggests that the core sub-network has connectivity structures that are not random by nature. This correlates well with the previous node degree distribution in Fig. 2, where an exceptionally large number of hubs are shown to exist. A relative high (though not significant) index of separation after expansions suggests that the 119 seed proteins may be tightly related—an observation consistent with the assumption that majority of the connected proteins should participate in a few shared biological pathways defined by the activated sub-network.

394

3.2. Analysis of Activated Protein Functional Category Distributions

We are interested in discovering enriched protein functional categories among differentially expressed seed proteins and its immediate protein interaction subnetwork nearest interaction partners. Note that this enrichment includes response ("activated") proteins that are either up-regulated or down-regulated in the proteomics experiment. Although this up-/down-regulation detail can be essential to establishing regulatory network models, we are more interested in proteins and protein groups that are "activated" in the cisplatin-response functional process than those that are "not activated". Therefore, we choose not to differentiate the regulation detail for this study.

Although GO-based functional category analysis can be done routinely using many existing bioinformatics methods ", the inclusion of protein interaction network context has not been previously described. Here, we are interested in broad pathway-related changes in the sub-network, even this means the cost of failing to detect the significant appearance of some isolated proteins in the initial seed protein set. In Table 1, we show GO categories that are significantly

enriched or impoverished in the sub-network. The 17 GO categories are filtered from 70 GO categories (data not shown) where available sub-network proteins has GO annotations. The filter criteria are 1) P-value over-or under- representation must be within 0.05 and 2) the total category count of GO in the whole network is greater than 10. In GO_TERM column, we listed three types of information: level 3 GO terms, GO term category type ( 'C for cellular component, 'F' for molecular function, and 'P' for biological process; in parenthesis preceding the dash), and GO identifier (seven digit number following the dash in parenthesis). In the ENRICHMENT column, we listed two types of counts of proteins with GO annotation levels falling in the corresponding category: within core sub-network and whole network (in parenthesis). In the PVALUE column, we listed two numbers: p-value from significance test whether there is an over- or an under-representation (two numbers separated by a V) of observed GO term category count in the sub-network. In the last CONCLUSION column, we used symbols to help us memorize test results: '++' to suggest significant over-representation when false discovery rate (FDR) controlled at 0.05, '--' to suggest significant under-representation when FDR controlled at 0.05, '+'

Table 1. Summarized Result for Observed Proteomics-level Changes in its Sub-network Context while Comparing Cisplatin-Resistant with Cisplatin-Sensitive Ovarian Cancer Cells. Only statistically significant changes are shown (see text for explanations of

GO TERM

proton-transporting ATP synthase complex (C-0045259)

proton-transporting two-sector ATPase complex (C-0016469)

proteasome complex (sensu Eukaryota) (C-0000502)

organelle lumen (C-0043233)

myosin (C-0016459)

membrane (C-0016020)

protein binding (F-0005515)

drug binding (F-0008144)

isomerase activity (F-0016853)

transferase activity (F-0016740)

receptor binding (F-0005102)

metabolism (P-0008152)

regulation of viral life cycle (P-0050792)

cellular physiological process (P-0050875)

detection of stimulus (P-0051606)

cell communication (P-0007154)

organismal physiological process (P-0050874)

ENRICHMENT

8(56)

4(22)

4(66)

2(13)

2(18)

5(1848)

31 (1412)

2(13)

3(41)

8 (338)

2 (944)

13 (556)

2(20)

64 (4496)

2(30)

15 (1881)

5 (928)

PVALULE

(OVERAJNDER)

0/1

.0001/1

.0079/9989

.0101/9996

.0191/9988

1/0

.0004/9998

.0101/9996

.0127/9986

0493/.9802

.9999/0006

.0152/.9936

.0234/9984

.0353/9768

.0496/9947

.9747/0452

.9892/.0289

CONCLUSION

++

++

+

+

+

-++

+

+

+

-+

+

+

+

--

395

to suggest insignificant over-representation when FDR controlled at 0.05 but significant overrepresentation at native p-value=0.05, '-' to suggest insignificant over-representation when FDR controlled at 0.05 but significant overrepresentation at native p-value=0.05.

From the above table, we can obtain the following insights. First, there are abnormally high level of proton-transporting ATP synthase and ATPase production in the cell, suggesting unusually high oxidative energy production capability among cancerous function in cisplatin resistance cell lines over cisplatin sensitive cell lines. Second, although the protein interaction network is inherently enriched with

response to stress

response to external stimulus

response to endogenous stimulus

response to biotic stimulus

response to abiotic stimulus

regulation of viral life cycle

regulation of physiological process

regulation of development

regulation of cellular process

organismal physiological process

organ development

metabolism

locomotory behavior

localization

cellular physiological process

cell communication

proteins with "protein binding" capabilities (note 1412 proteins in the category from the whole network), the cisplatin-resistant cell line demonstrated unusually high level of protein-binding activities. This suggests that intracellular signaling cascades, not intercellular signaling (note the under-representation in "cell communication" category), is positively correlated with cisplatin-resistance. Third, the data suggest that the location of the biological activities of cisplatin resistant response take place in cytoplasm or nucleus, rather than on "membrane". This analysis gives essential clues to the overall picture of molecular signaling events for cisplatin resistant cell lines. We also obtained additional

Figure 3. Cross-talk between related biological processes in Cisplatin-resistant Cell lines.

396

categorical enrichment data from different GO levels (not shown here due to space constraints).

3.3. Functional Category Cross-talks

We developed a 2-dimensiona! visualization matrix (extended from our technique described in 20) to show significant cross-talks between GO categories in Fig. 3 (only biological processes at level=3 is shown due to space constraints). The size of node is inversely proportional to the p-value of interacting categories. The color legends are: red (dark) for interacting categories that are significant when FDR controlled at 0.05; and gray (light) colors for interacting categories that are not significant when FDR controlled at 0.05. The figure reveals additional interesting findings. First, cellular physiological processes are significantly activated in drug-resistant cell lines (the largest and reddest dot, at the bottom left corner). This could lead to further drill-down of protein interaction in the interacting category for biological validations

(preliminary results; not shown). Second, these cellular physiological processes seem to be quite selective rather than comprehensive. For example, when looking at significant regulation of cellular response categories, significant cross-talk functional patterns strongly suggest the cellular and physiological responses arise from endogeneous, abiotic, and stress-related signals (internalized cisplatin causing DNA damage and inducing cell stress). Using a cross-talk matrix such as this, cancer biologists can quickly filter out other insignificant secondary responses (such as cell growth, cell development shown) to establish new prioritized hypothesis to test.

3.4. Visualization of the Activated Interaction Functional Sub-network

In Fig. 4, we show a visualization of activated biological process functional network, using a recently developed software tool "ProteoLens" 17. ProteoLens is a biological network data mining and annotation

f«$«'*Wg«»!ft

%C &** <$& 3 , 1 ^ 3tH s

f tto < tetsiv a<f tX'itX M *St.t •"•>

<W !»&'<* * * £

<»ii$ ^fS-^-HS'Si', YZa*t ;£*a£g^|,PS-ftr.

Figure 4. Overview of activated biological process functional network in cisplastin-resistant ovarian cancer cells.

397

platform, which supports standard GML files and relational data in the Oracle Database Management System (for additional details, visit http://wvvw.proteolens.orgA. In the figure, in contrast with regular protein interaction network, we encode nodes as significantly over-/underrepresented protein functional categories, and edges as significantly interacting protein functional categories. Several additional information types are also represented. The p-values of interaction categories are inversely proportional to the thickness of edges, while the FDR=0.05 adjusted interacting category significance Boolean flags are indicated by line color: red (dark) for "significant" and blue (light) for "not significant". The original abundance (by count) of each functional category is encoded into node size. The p-values of activated protein category significance in the subnetwork is encode as node color intensity, on a scale from light yellow (less significant) to dark red (more significant).

The resulting categorical network is both novel and informative. First, we confirm that cisplatin-resistant ovarian cancer cells demonstrated significant cellular changes in cell's overall physiological processes. These processes are first and foremost connected to cancer cell's native response to stimulus that is endogenous, abiotic, and stress-related as opposed to exogeneous, biotic, and related to tissue development. Second, while cell communication and cell growth are generally important to tumor mechanisms, we observe that molecular regulation of cell differentiation and development seems to be more relevant to the acquisition of drug resistance based on examination of this activated functional category interaction sub-network. Third, interestingly, we observe that the regulation of viral life cycle also plays very significant roles in the entire drug resistant process. This unknown observation may be further examined at protein levels to formulate hypothesis about acquired cisplatin resistance in ovarian cancer.

4. DISCUSSION

In this study, we showed that the key to interpreting Omics data is a systems biology approach, which is both "hypothesis-driven and data-driven, with the ultimate goal of integrating multi-dimensional biological signals at molecular signaling network levels. This is essential in unleashing the powerful potential for

Proteomics techniques, which become a very powerful and efficient methodology in recent years for analysis of thousands of proteins on the basis of differences in their expression levels and post-translational modifications. In a systems biology approach, we integrate data from existing Omics databases and assemble different statistic as different visual cues in intuitive information visualization platforms. Our use of the functional 2-dimensional matrix and interaction category network is innovative. All these contributed to biological observations, which clearly demonstrate that cellular responses to genomic stress, in this case, DNA-damaging agent, are tightly associated with cisplatin resistance. Evidence for molecular regulation of cell differentiation and development also provide insight into the underlying mechanisms of cisplatin-resistance in ovarian cancer cells. Further examination of filtered subset of proteins in significantly detected functional categories and their cross-talks provide opportunities in modulating proteins involved in these biological processes/pathways to re-sensitize cisplatin resistant ovarian cancer cells.

We plan to conduct further studies to generate ranked list of proteins that significantly participated in the overall biological process of ovarian drug resistance as a group. We believe that the prioritized validation of these proteins in subsequent steps will further move us closer to finding molecular mechanism(s) causing ovarian cancer cells to become resistant to platinum-based chemotherapy. Meanwhile, we plan to collect Microarray data and conduct similar experiments to find out differences between the "Omics" results. We also plan to apply the systems biology approaches in different biological application domains.

Acknowledgments

This work was in part supported by a grant provided by Indiana University Purdue University Indianapolis to Dr. Jake Chen and by computer systems obtained by Indiana University through its relationship with Sun Microsystems Inc. as a Sun Center of Excellence. We thank Stephanie Burks for maintaining the high-end Sun servers and Oracle lOg servers. We also thank Jason Sisk and Kimberly Melluck from Indiana University School of Informatics for maintaining windows and Linux client computers, on which this study was partially conducted.

http://wvvw.proteolens.orgA

398

References

1. Yamamoto, K., Okamoto, A., Isonishi, S., Ochiai, K. and Ohtake, Y. (2001) Heat shock protein 27 was up-regulated in cisplatin resistant human ovarian tumor cell line and associated with the cisplatin resistance. Cancer Lett, 168, 173-81.

2. Zamble, D.B. and Lippard, S.J. (1995) Cisplatin and DNA repair in cancer chemotherapy. Trends Biochem Sci, 20,435-9.

3. Auersperg, N., Edelson, ML, Mok, S.C., Johnson, S.W. and Hamilton, T.C. (1998) The biology of ovarian cancer. Semin Oncol, 25,281-304.

4. Johnson, S.W., Laub, P.B., Beesley, J.S., Ozols, R.F. and Hamilton, T.C. (1997) Increased platinum-DNA damage tolerance is associated with cisplatin resistance and cross-resistance to various chemotherapeutic agents in unrelated human ovarian cancer cell lines. Cancer Res, 57,850-6.

5. Sakamoto, M., Kondo, A., Kawasaki, K., Goto, T., Sakamoto, H., Miyake, K., Koyamatsu, Y., Akiya, T., Iwabuchi, H., Muroya, T. et al. (2001) Analysis of gene expression profiles associated with cisplatin resistance in human ovarian cancer cell lines and tissues using cDNA microarray. Hum Cell, 14,305-15.

6. Chen, G., Gharib, T.G., Huang, C.C., Taylor, J.M., Misek, D.E., Kardia, S.L., Giordano, T.J., Iannettoni, M.D., Orringer, M.B., Hanash, S.M. et al. (2002) Discordant protein and mRNA expression in lung adenocarcinomas. Mol Cell Proteomics, 1,304-13.

7. Bhardwaj, N. and Lu, H. (2005) Correlation between gene expression profiles and protein-protein interactions within and across genomes. Bioinformatics, 21,2730-8.

8. Kitano, H. (2002) Systems biology: a brief overview. Science, 295, 1662-4.

9. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T. et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 25,25-9.

10. Brown, K.R. and Jurisica, I. (2005) Online predicted human interaction database. Bioinformatics, 21,2076-82.

11. Pinto, F.R., Cowart, L.A., Hannun, Y.A., Rohrer, B. and Almeida, J.S. (2005) Local correlation of expression profiles with gene annotations—proof of concept for a general conciliatory method. Bioinformatics, 21,1037-45.

12. Higgs, R.E., Knierman, M.D., Gelfanova, V., Butler, J.P. and Hale, J.E. (2005) Comprehensive label-free method for the relative quantification of proteins from biological samples. J Proteome Res, 4,1442-50.

13. Bolstad, B.M., Irizarry, R.A., Astrand, M. and Speed, T.P. (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19, 185-93.

14. Kersey, P.J., Duarte, J., Williams, A., Karavidopoulou, Y., Birney, E. and Apweiler, R. (2004) The International Protein Index: an integrated database for proteomics experiments. Proteomics, 4, 1985-8.

15. Apweiler, R., Bairoch, A., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M. et al. (2004) UniProt: the Universal Protein knowledgebase. Nucleic Acids Res, 32, Dl 15-9.

16. Chen, J.Y., Shen, C. and Sivachenko, A. (2006) Mining Alzheimer Disease Relevant Proteins from Integrated Protein Interactome Data. Pacific Symposium on Biocomputing '06. Maui, HI, Vol. 11, pp. 367-378.

17. Sivachenko, A., Chen, J. and Martin, C. (2005) ProteoLens: A Visual Data Mining Platform for Exploring Biological Networks (submitted). Bioinformatics.

18. Chen, J.Y., Pinkerton, S.L., Shen, C. and Wang, M. (2006) An Integrated Computational Proteomics Method to Extract Protein Targets for Fanconi Anemia Studies. 21st Annual ACM Symposium on Applied Computing. Dijon, France, Vol. 1, pp. 173-179.

19. Benjamini, Y. and Yekutieli, D. (2001) The control of the false discovery rate in multiple testing under dependency. Ann. Statist., 29,1165-1188.

20. Chen, J.Y., Sivachenko, A.Y., Bell, R., Kurschner, C, Ota, I. and Sahasrabudhe, S. (2003) Initial Large-scale Exploration of Protein-protein Interactions in Human Brain. IEEE Computer Society Bioinformatics 2003. IEEE Computer Society Press, Stanford, California, pp. 229-234.

AUTHOR INDEX

Osman Abul Altuna Akalin Fatih Altiparmak Srinivas Aluru J. Arnold

Deepak Bandyopadhyay S. M. Bhandarkar Ewan Birney Guy E. Blelloch Daniel G. Brown Dawn P. G. Brown Drew H. Bryant Dongbo Bu

Jinjin Cai Liming Cai Brian Y. Chen Jake Y. Chen Jianhui Chen Runsheng Chen Xin Chen Hua-Sheng Chiu Hua-Sheng Chiu David Croft Amanda E. Cruess Xiangqin Cui

Peter D'Eustachio Amar K. Das Bernard de Bono Zhidong Deng Kevin W. DeRonne Bruce Randall Donald Finn Drabl0s

Mark H. Ellisman

Michalis Faloutsos Lin Feng Hakan Ferhatosmanoglu Viacheslav Y. Fofanov Marvin E. Frazier

257 331 331 167 191

227 191 17

199 211 389 311 353

353 99

311 389 293 353 239

31 325

17 311 223

17 385

17 269

19 67

257

5

299 123 331 311

1

Marc Gillespie Randy Goebel Gopal Gopinathrao Mark R. Grant Dan Gusfield

XuHan Ian M. Harrower JingHe Matthias HOchsmann Thomas Hochsmann Wen-Lian Hsu Wen-Lian Hsu Ping Hu JunHuan Hung-Chung Huang C. Anthony Hunt Curtis Huttenhower

Trey Ideker

Eric Jakobsson Bijay Jassal Tao Jiang Feng Jiao

Ananth Kalyanaraman George Karypis Lydia E. Kavraki Sean H. J. Kim Marek Kimmel David M. Kristensen Sudhir Kumar Sudhir Kumar

Suzanna Lewis Guojun Li QiLi Shuguang Li Wenyuan Li Olivier Lichtarge Guohui Lin Guohui Lin

17 179

17 381 145

123 211

89 111 111 31

325 99

227 133 381 341

9 17

239 43

167 19

311 381 311 311 293 335

17 157 293 157 133 311

55 179

Robert Giegerich 111 Ray S. Lin 385

400

YuLin Chunmei Liu Ying Liu Allan Lo Allan Lo Stefano Lonardi Ann Loraine Yonggang Lu

Bin Ma Russell L. Malmberg Lisa Matthews Simon Mercer Mark A. Musen

Jean-Christophe Olivo-Marin

Alessandro Dal Palu Yanxiong Peng Gregory Pennington Enrico Pontelli Liviu Popescu Jan Prins

Xingqin Qi

R. Ravi Soo-Yon Rhee

Geir Kjetil Sandve Esther Schmidt Patrick S. Schnable Dale Schuurmans Russell Schwartz Russell Schwartz Stanley Shackney Robert W. Shafer Changyu Shen Georgos Siganos Charles A. Smith Christina D. Smolke Jack Snoeyink

353 99

133 31

325 299 223

89

361 99 17 11 3

13

89 133 371

89 281 227

157

199 385

257 17

167 43

199 371 371 385 389 OQQ

371 15 79

Jack Snoeyink Yinglei Song Srinath Sridhar Lincoln Stein Chia-Yu Su Shiwei Sun Ting-Yi Sung Ting-Yi Sung Wing-Kin Sung

S. Tewari Alexander Tropsha Olga G. Troyanskaya

Imre Vastrik

Xiang Wan Xiu-Feng Wan Lincong Wang Mu Wang Wei Wang Xueyi Wang Guanming Wu Xiaomeng Wu Yufeng Wu

Xuhua Xia Changjiang Xu Jinbo Xu Ying Xu

Zhong Yan Qiaofeng Yang Jieping Ye Golan Yona Liwen You Chungong Yu Libo Yu

Jingfen Zhang Yi Zhang Zhuo Zhang

227 99

199 17

325 353

31 325 123

191 227 341

17

55 179 67

389 227

79 17

179 145

335 361 43

157

389 299 293 281 249 353 43

353 269 353

computational systems bioinformatic csb2006 conference proceedings 2006

Documents

hess university

ying xu university of

alabama bin ma university

systems eds

california riverside

conference csb

conference organizers

w life sciences society