genomes, man, and machines [guest editorial]

4
From the Guest Editor Genomes, Man, and Machines B ioinformatics may be defined as a dis- cipline that generates computational tools, databases, and methods to support genomic and postgenomic research. It comprises the study of DNA structure and function, gene and protein expression, protein production, structure and func- tion, genetic regulatory systems, and clin- ical applications. Advances in genomics and functional genomics are fundamental components of science in the new millen- nium. Since international efforts to se- quence genomes began formally in 1990, outstanding technological achievements have been implemented with enormous implications for medicine. Over the past 15 years, numerous inno- vations have supported the development of a new biological research paradigm, one that is information-heavy and com- puter-driven. Some of these advances in- clude improved DNA sequencing methods, new approaches to identify pro- tein structure, and revolutionary methods to monitor the expression of many genes in parallel. The design of techniques able to deal with different sources of incom- plete and noisy data has become another crucial goal for the bioinformatics com- munity. Moreover, there is the need to im- plement computational solutions based on theoretical frameworks to allow scientists to perform complex inferences about the phenomena under study. This special issue focuses on the multidisciplinary challenges brought about by this information revolution. It consists of eight articles from leading au- thorities in their fields, which reflect key issues in both genomic and postgenomic research. The order in which the articles are organized aims to facilitate an over- view of topics ranging from the biological and bioinformatics basis through the sys- tems to the clinical applications. Ursula Bond and colleagues (Trinity College Dublin) review a number of tools and approaches to performing genomic and postgenomic studies based on an im- portant model organism: Saccharomyces cerevisiae. Genome expression profiling has become a promising approach to un- derstanding the molecular dynamics of many biological and physiological pro- cesses. Paul Bertone and Mark Gerstein (Yale University) review some of the re- cent advances in this area, as well as rele- vant challenges for the development of integrative data mining systems. Michael Wendl and colleagues, researchers of the Washington University Genome Se- quencing Center and collaborators of the Human Genome Project, provide us with a comprehensible description of some of the components, techniques, and tools in- volved in the processing of DNA sequence data. Gustavo Glusman and Doron Lancet, researching at the Weizmann Institute of Science and at the Institute for Systems Bi- ology, discuss the GESTALT workbench. This is an integrative platform for the anal- ysis of large-scale genomic data, which provides the user with advanced and intu- itive visualization facilities. Microbial genome identification and analysis are key tasks that provide the basis for much further research into the biology of those organisms. Daniel Dalevi and Siv Andersson (Uppsala University) discuss a computational approach to studying the dy- namics of microbial genomes. There is also the need to provide scientists with effective techniques to represent, filter, and compress DNA data. Xin Chen (Peking University) and colleagues from the City University of Hong Kong and the University of Waterloo describe a powerful compression algorithm for DNA sequences. Diane Cook and col- leagues (University of Texas at Arlington) report the implementation of a system able to perform three important data mining tasks: unsupervised pattern discovery, su- pervised concept learning, and hierarchical clustering on a protein database. The contri- bution from Berrar and co-workers (Ger- man Cancer Research Centre) deals with the design of an integrated database system. It brings together different types of genomic and clinical data in order to support compar- ative genomic hybridization analysis. Finally, tables have been provided in this introduction that provide our readers with a categorized collection of key re- sources on the rapidly evolving field of bioinformatics. Table 1 introduces some of the leading companies offering prod- ucts and services in the areas of genomics, functional genomics, and medical appli- 18 IEEE ENGINEERING IN MEDICINE AND BIOLOGY July/August 2001 0739-5175/01/$10.00©2001IEEE ©1998 DIGITAL VISION STOCK CORP. Francisco Azuaje Department of Computer Science, Trinity College Dublin

Upload: f

Post on 23-Sep-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Genomes, man, and machines [Guest Editorial]

From the Guest EditorGenomes, Man, and Machines

Bioinformatics may be defined as a dis-cipline that generates computational

tools, databases, and methods to supportgenomic and postgenomic research. Itcomprises the study of DNA structure andfunction, gene and protein expression,protein production, structure and func-tion, genetic regulatory systems, and clin-ical applications. Advances in genomicsand functional genomics are fundamentalcomponents of science in the new millen-nium. Since international efforts to se-quence genomes began formally in 1990,outstanding technological achievementshave been implemented with enormousimplications for medicine.

Over the past 15 years, numerous inno-vations have supported the developmentof a new biological research paradigm,one that is information-heavy and com-puter-driven. Some of these advances in-clude improved DNA sequencingmethods, new approaches to identify pro-tein structure, and revolutionary methodsto monitor the expression of many genesin parallel. The design of techniques ableto deal with different sources of incom-plete and noisy data has become anothercrucial goal for the bioinformatics com-munity. Moreover, there is the need to im-plement computational solutions based ontheoretical frameworks to allow scientiststo perform complex inferences about thephenomena under study.

This special issue focuses on themultidisciplinary challenges broughtabout by this information revolution. Itconsists of eight articles from leading au-thorities in their fields, which reflect keyissues in both genomic and postgenomicresearch. The order in which the articlesare organized aims to facilitate an over-view of topics ranging from the biologicaland bioinformatics basis through the sys-tems to the clinical applications.

Ursula Bond and colleagues (TrinityCollege Dublin) review a number of toolsand approaches to performing genomicand postgenomic studies based on an im-portant model organism: Saccharomycescerevisiae. Genome expression profilinghas become a promising approach to un-derstanding the molecular dynamics ofmany biological and physiological pro-

cesses. Paul Bertone and Mark Gerstein(Yale University) review some of the re-cent advances in this area, as well as rele-vant challenges for the development ofintegrative data mining systems. MichaelWendl and colleagues, researchers of theWashington University Genome Se-quencing Center and collaborators of theHuman Genome Project, provide us with acomprehensible description of some of thecomponents, techniques, and tools in-volved in the processing of DNA sequencedata. Gustavo Glusman and Doron Lancet,researching at the Weizmann Institute ofScience and at the Institute for Systems Bi-ology, discuss the GESTALT workbench.This is an integrative platform for the anal-ysis of large-scale genomic data, whichprovides the user with advanced and intu-itive visualization facilities.

Microbial genome identification andanalysis are key tasks that provide the basisfor much further research into the biology ofthose organisms. Daniel Dalevi and SivAndersson (Uppsala University) discuss acomputational approach to studying the dy-namics of microbial genomes. There is alsothe need to provide scientists with effectivetechniques to represent, filter, and compressDNA data. Xin Chen (Peking University)and colleagues from the City University ofHong Kong and the University of Waterloodescribe a powerful compression algorithmfor DNA sequences. Diane Cook and col-leagues (University of Texas at Arlington)report the implementation of a system ableto perform three important data miningtasks: unsupervised pattern discovery, su-pervised concept learning, and hierarchicalclustering on a protein database. The contri-bution from Berrar and co-workers (Ger-man Cancer Research Centre) deals withthe design of an integrated database system.It brings together different types of genomicand clinical data in order to support compar-ative genomic hybridization analysis.

Finally, tables have been provided inthis introduction that provide our readerswith a categorized collection of key re-sources on the rapidly evolving field ofbioinformatics. Table 1 introduces someof the leading companies offering prod-ucts and services in the areas of genomics,functional genomics, and medical appli-

18 IEEE ENGINEERING IN MEDICINE AND BIOLOGY July/August 20010739-5175/01/$10.00©2001IEEE

©19

98D

IGIT

AL

VIS

ION

ST

OC

KC

OR

P.

Francisco AzuajeDepartment of Computer Science,

Trinity College Dublin

Page 2: Genomes, man, and machines [Guest Editorial]

cations. Table 2 includes key references tospecial issues on genomic andpostgenomic studies recently published,as well as a selection of online courses andtutorials. A categorized list of fundamen-tal databases is described in Table 3. Ta-ble 4 includes the URLs of prestigiouspublic research centers around the world.A selection of renowned meetings, con-ferences, and other scientific events isshown in Table 5. For a more detailedview of these and other resources onbioinformatics, the reader is referred tothe website Genomes & Machines(http://www.cs.tcd.ie/francisco.Azuaje/genomes&machines.html).

Due to space and time constraints wewere not able to include additional rele-vant studies. Some of these developmentsinclude the reconstruction of evolutionarytrees from genomic data and fundamentalproblems linked to database, machinelearning, and data mining research.

We are indebted to all of the contribu-tors to this special issue for their apprecia-ble efforts in producing an enlighteningcollection of articles. I am particularlygrateful to our former editor Dr. Alvin

Wald for giving me the opportunity tocollaborate as a guest editor and for hiscontinuous advice. I thank our new editorDr. Mark Wiederhold for his supportthroughout the final preparation of thisspecial issue. We greatly appreciate the co-

operation from the referees who kindlysupported a rigorous peer-review process.

We hope that this special issue willmotivate our readers to further learn andparticipate in this exciting multi-disciplinary challenge.

July/August 2001 IEEE ENGINEERING IN MEDICINE AND BIOLOGY 19

Table 2. Special Issues, References, and Tutorials

Nature special issue on the human genome, vol. 409, pp. 813-958, 15 February 2001;and Nature insight on functional genomics, vol. 405, pp. 819-865, 15 June 2000.

http://www.nature.com

IEEE Computing in Science & Engineering, special issue on computational biology,May-June, 1999.

http://www.computer.org/cise/

National Center for Biotechnology Information: tutorials and databases.

http://www.ncbi.nlm.nih.gov/

Science special issue on the human genome, vol. 291, 16 February 2001; and Sciencefunctional genomics, a collection of online resources related to the human genome se-quence.

http://www.sciencemag.org

IEEE Spectrum, special feature on genome sequencing and analysis, vol. 37, number11 November 2000.

http://www.spectrum.ieee.org

Virtual Bacteria ID lab, Howard Medical Institute.

http://www.biointeractive.org

Getting Connected to the Postgenomic EraTable 1. Hardware, Software, and Services Providers

Company Activity URL

Affymetrix Expression chips and analysis www.affymetrix.com

Agilent Technologies Genome expression systems, software and hardware www.agilent.com

Celera Databases and genotyping www.celera.com

Compugen Proprietary and collaborative gene discovery www.compugen.com

CuraGen Drug-induced changes in gene expression www.curagen.com

DoubleTwist.com Internet portal, on-line access to tools and databases www.doubletwist.com

eBioinformatics Web-based bioinformatics tools and databases www.ebioinformatics.com

Genaissance Population genomics www.genaissance.com

GeneLogic Gene expression database products www.genelogic.com

Genomica Enterprise-wide bioinformatics systems and services www.genomica.com

Genomics Institute A division of Novartis, genomics and proteomics www.gnf.org

IBM Data mining and protein structure determination methods www.research.ibm.com/ topics/serious/bio/

Incyte Databases and analysis tools www.incyte.com

Informax Desktop and enterprise-wide products www.informaxinc.com

Lexicon Genetics Biochips www.lexgen.com

Lion Bioscience Enterprise-wide systems and services www.lionbioscience.com

Motorola Biochip Expression arrays www.motorola.com/ biochipsystems/

Myriad Genetics Therapeutic and diagnostic products based on genomic methods www.myriad.com

Oxford Molecular Bioinformatics systems and services www.oxmol.co.uk

Rosetta Inpharmatics Gene expression data acquisition for drug discovery applications www.rosetta.com

Silicon Genetics Gene expression analysis and visualization www.sigenetics.com

SpotFire Data visualization software for gene expression www.spotfire.com

Structural GenomiX Proteomics, structure prediction www.stromix.com

Syrrx Protein structure prediction www.syrrx.com

Page 3: Genomes, man, and machines [Guest Editorial]

20 IEEE ENGINEERING IN MEDICINE AND BIOLOGY July/August 2001

Computing Life: The Challenge Aheadby Leroy Hood

Institute for Systems Biology

The Human Genome Project has catalyzed the birth of the fieldof bioinformatics. In its broadest context, bioinformatics or

computational biology is concerned with capturing, storing, an-alyzing, graphically displaying, modeling, and ultimately dis-tributing biological information. Moreover, the HumanGenome Project has encouraged a series of paradigm changesleading to the view that biology is an informational science [1].

� The draft of the human genome [2,3] has given us a ge-netics parts list of what is necessary for building a hu-man: the 30,000-40,000 genes, their regulatoryregions, a lexicon of motifs that are the building blockcomponents of proteins (and genes), and access to thehuman variability (polymorphisms) that make us each dif-ferent from one another.

� Genomics has triggered the development of high-through-put instrumentation for DNA sequencing, DNA arrays,genotyping, proteomics, etc. These instruments have cata-lyzed a new type of science for biology termed discoveryscience. Discovery science defines all (most) of the ele-ments in a biological system (e.g., sequence of the ge-nome, identification and quantitation of all of the mRNAsor proteins in a particular cell type—respectively, the ge-nome, transcriptome, and the proteome). Discovery sci-ence creates databases of information, in contrast to themore classical hypothesis-driven science that formulates hy-potheses and attempts to test them. The high-throughputtools both provide the means for discovery science and canassay how global information sets (e.g., transcriptomes orproteomes) change as systems are perturbed.

� The tools of computer science, statistics, and mathematicsare critical for studying biology as an informational sci-ence. Curiously, biology is the only science that, at its veryheart, employs a digital language. The grand challenge inbiology is to determine how the digital language of thechromosomes is converted into the three-dimensional andfour-dimensional (time variant) languages of living, breath-ing organisms.

� The genomes of the model organisms yeast, worm, fly, etc.,have demonstrated the fundamental conservation amongall living organisms of the basic informational pathways.Hence, systems can be perturbed in model organisms togain insight into their functioning, and these data will pro-vide fundamental insights into human biology. From the ge-nome, the informational pathways and networks can beextracted to begin understanding their “logic of life.” More-over, different genomes can be compared to identify simi-larities and differences in the strategies for the logic oflife—and these provide fundamental insights into develop-ment, physiology, and evolution.

� Biology is an informational science. There are two majortypes of biological information: 1) the information of thegenes or proteins, which are the molecular machines of life;and 2) the information of the regulatory networks that coor-dinate and specify the expression patterns of the genes (pro-teins). All biological information is hierarchical:

DNA→mRNA→protein→protein interactions→informa-tional pathways→informational networks→cells→networksof cells (tissues/organs)→individuals→populations→ecolo-gies. The challenge is to create tools that can capture and in-tegrate these different levels of biological information and,as we see below, integrate it.

All of these paradigm changes lead to the view that the majorchallenge for biology and medicine in the 21st century will bethe study of complex systems, and that the approach necessaryfor studying biological complexity will be systems biology. Analgorithm (i.e., approach) for systems biology has been pro-posed [1,4].

i) Identify all elements in the system with discovery tools(e.g., sequence genome, etc.).

ii) Use current knowledge of the system to formulate a modelpredicting its behavior.

iii) Perturb the system in a model organism using biological,genetic (knockout, over expression), or environmental perturba-tions—capture information at all relevant levels: mRNA, pro-tein, protein interactions, etc. Integrate this information.

iv) Compare theoretical predictions and experimental data.Carry out additional perturbations to bring theory and experi-ment into closer apposition. Integrate new data into model. Iter-ate steps iii) and iv) until the mathematical model can predict thestructure of the system and its systems or emergent propertiesgiven particular perturbations.

We have successfully tested the first stages of this approachon the galactose utilization system in yeast [4].

What are the challenges presented by systems biology?� The integration of technology, biology, and computation.� The integration of the various levels of biological informa-

tion and the modeling.� The proper annotation of biological information and its

storage and integration in databases.� The inclusion of other molecules, large and small, in the sys-

tems approach.� The integration imperatives of systems biology presents

many challenges to industry and academia [1, 5].

References[1] T. Ideker., T. Galitski, and L. Hood, “A new approach to decodinglife: Systems biology,” Annu. Rev. Genomics & Human Genetics, to bepublished, 2001.[2] E.S. Lander, L.M. Linton, B. Birren, C. Nusbaum, M.C. Zody, J.Baldwin, K. Devon, K. Dewar, et al., “Initial sequencing and analysis ofthe Human Genome,” Nature, vol. 409, pp. 860-921, 2001.[3] J.C. Venter, M.D. Adams, E.W. Myers, P.W. Li, R.J. Mural, G.G.Sutton, H.O. Smith, M. Yandell, et al., “The sequence of the human ge-nome,” Science, vol. 291, pp. 1304-1351, 2001.[4] T. Ideker, V. Thorsson, J.A. Ranish, R. Christmas, J. Buhler, J.K. Eng,R. Bumgarner, D.R. Goodlett, R. Aebersold, and L. Hood, “Integratedgenomic and proteomic analyses of a systematically perturbed meta-bolic network,” Science, to be published, 2001.[5] A. Aderem and L. Hood, “Immunology in the post-genomic era,” Na-ture Immunology, to be published, 2001.

Page 4: Genomes, man, and machines [Guest Editorial]

July/August 2001 IEEE ENGINEERING IN MEDICINE AND BIOLOGY 21

Table 3. Databases

Genome mapping and sequence repositories

GenBank, contains all known nucleotide and protein se-quences.

http://www.ncbi.nlm.nih.gov/

The Genome Database, contains all known nucleotide and pro-tein sequences.

http://gdbwww.gdb.org/

EMBL Nucleotide Sequence Database, contains all known nu-cleotide and protein sequences.

http://www.ebi.ac.uk/embl/index.html

Expression data

ASDB, protein products and expression patterns.

http://cbcg.nersc.gov/asdb/

BodyMap, human and mouse gene expression data.

http://bodymap.ims.u-tokyo.ac.jp/

TRIPLES, localization and expression in Saccharomyces.

http://ycmi.med.yale.edu/ygac/triples.htm

Genetic maps

GB4-RH, Genebridge4 human radiation hybrid maps - SangerCentre.

http://www.sanger.ac.uk/Rhserver/Rhserver.shtml

GeneMap’99, gene map of the human genome.

http://www.ncbi.nlm.nih.gov/genemap99/

IXDB, physical map of human chromosome X.

http://www.mpimg-berlin-dahlem.mpg.de/~xteam/welcome.html

Genomic databases

FlyBase, Drosophila sequences and genomic information.

http://www.fruitfly.org

MGD, mouse genome database.

http://www.informatix.jax.org

SGD, Saccharomyces cerevisiae genome.

http://genome-www.stanford.edu/Saccharomyces/

Metabolic and regulatory pathways

KEGG, Kyoto Encyclopaedia of Genes and Genomes.

http://www.genome.ad.jp/kegg/

MIPS, yeast pathways.

http://www.mips.biochem.mpg.de/proj/yeast/pathways/index.html

Gene Ontology Consortium.

http://www.geneontology.org/

Protein databases

PROSITE, biologically significant protein patterns and profiles.

http://www.expasy.ch/prosite/

Protein data bank (PDB), 3-D macromolecular structure data.

http://www.rcsb.org/pdb/

SWISS-PROT/TrEMBL, curated protein sequences.

http://www.expasy.ch/sprot/

Getting Connected to the Postgenomic Era

Table 4. Public Research Centers: Projects,Publications, and Tools

In the U.S.A.

Baylor College of Medicine

http://www.hgsc.bcm.tmc.edu

Lawrence Berkeley Laboratory

http://www-hgc.lbl.gov

Stanford Human Genome Center

http://shgc-www.stanford.edu

University of Washington Genome Center

http://www.genome.washington.edu

The Institute for Genomic Research (TIGR)

http://www.tigr.org

The National Center for Biotechnology Information (NCBI)

http://www.ncbi.nlm.nih.gov

University of California at Santa Cruz

http://genome.ucsc.edu

MIT Whitehead Institute

http://www-genome.wi.mit.edu

In Europe

European Bioinformatics Institute

http://www.ebi.ac.uk

Pasteur Institute

http://www.pasteur.fr

European Molecular Biology Laboratory

http://www.embl-heidelberg.de

Sanger Centre

http://www.sanger.ac.uk

Around the World

Peking University Centre of Bioinformatics

http://www.cbi.pku.edu.cn

Xylella fastidiosa Genome Project, Brazil

http://onsona.lbi.ic.unicamp.br/xf/

South African National Bioinformatics Institute

http://www.sanbi.ac.za

Weizmann Institute of Science

http://dapsas1.weizmann.ac.il

Table 5. Conferences, Workshops, and Courses

Cold Spring Harbor Laboratory meetings & courses

http://nucleus.cshl.org/meetings/

International Conference on Intelligent Systems for MolecularBiology

http://ismb01.cbs.dtu.dk/

European Molecular Biology Organization (EMBO) courses andmeetings

http://www.embo.org/U.S. D.O.E. Human Genome Program calendar

http://www.ornl.gov/hgmis/CAL.HTML