genomes, man, and machines [guest editorial]
TRANSCRIPT
From the Guest EditorGenomes, Man, and Machines
Bioinformatics may be defined as a dis-cipline that generates computational
tools, databases, and methods to supportgenomic and postgenomic research. Itcomprises the study of DNA structure andfunction, gene and protein expression,protein production, structure and func-tion, genetic regulatory systems, and clin-ical applications. Advances in genomicsand functional genomics are fundamentalcomponents of science in the new millen-nium. Since international efforts to se-quence genomes began formally in 1990,outstanding technological achievementshave been implemented with enormousimplications for medicine.
Over the past 15 years, numerous inno-vations have supported the developmentof a new biological research paradigm,one that is information-heavy and com-puter-driven. Some of these advances in-clude improved DNA sequencingmethods, new approaches to identify pro-tein structure, and revolutionary methodsto monitor the expression of many genesin parallel. The design of techniques ableto deal with different sources of incom-plete and noisy data has become anothercrucial goal for the bioinformatics com-munity. Moreover, there is the need to im-plement computational solutions based ontheoretical frameworks to allow scientiststo perform complex inferences about thephenomena under study.
This special issue focuses on themultidisciplinary challenges broughtabout by this information revolution. Itconsists of eight articles from leading au-thorities in their fields, which reflect keyissues in both genomic and postgenomicresearch. The order in which the articlesare organized aims to facilitate an over-view of topics ranging from the biologicaland bioinformatics basis through the sys-tems to the clinical applications.
Ursula Bond and colleagues (TrinityCollege Dublin) review a number of toolsand approaches to performing genomicand postgenomic studies based on an im-portant model organism: Saccharomycescerevisiae. Genome expression profilinghas become a promising approach to un-derstanding the molecular dynamics ofmany biological and physiological pro-
cesses. Paul Bertone and Mark Gerstein(Yale University) review some of the re-cent advances in this area, as well as rele-vant challenges for the development ofintegrative data mining systems. MichaelWendl and colleagues, researchers of theWashington University Genome Se-quencing Center and collaborators of theHuman Genome Project, provide us with acomprehensible description of some of thecomponents, techniques, and tools in-volved in the processing of DNA sequencedata. Gustavo Glusman and Doron Lancet,researching at the Weizmann Institute ofScience and at the Institute for Systems Bi-ology, discuss the GESTALT workbench.This is an integrative platform for the anal-ysis of large-scale genomic data, whichprovides the user with advanced and intu-itive visualization facilities.
Microbial genome identification andanalysis are key tasks that provide the basisfor much further research into the biology ofthose organisms. Daniel Dalevi and SivAndersson (Uppsala University) discuss acomputational approach to studying the dy-namics of microbial genomes. There is alsothe need to provide scientists with effectivetechniques to represent, filter, and compressDNA data. Xin Chen (Peking University)and colleagues from the City University ofHong Kong and the University of Waterloodescribe a powerful compression algorithmfor DNA sequences. Diane Cook and col-leagues (University of Texas at Arlington)report the implementation of a system ableto perform three important data miningtasks: unsupervised pattern discovery, su-pervised concept learning, and hierarchicalclustering on a protein database. The contri-bution from Berrar and co-workers (Ger-man Cancer Research Centre) deals withthe design of an integrated database system.It brings together different types of genomicand clinical data in order to support compar-ative genomic hybridization analysis.
Finally, tables have been provided inthis introduction that provide our readerswith a categorized collection of key re-sources on the rapidly evolving field ofbioinformatics. Table 1 introduces someof the leading companies offering prod-ucts and services in the areas of genomics,functional genomics, and medical appli-
18 IEEE ENGINEERING IN MEDICINE AND BIOLOGY July/August 20010739-5175/01/$10.00©2001IEEE
©19
98D
IGIT
AL
VIS
ION
ST
OC
KC
OR
P.
Francisco AzuajeDepartment of Computer Science,
Trinity College Dublin
cations. Table 2 includes key references tospecial issues on genomic andpostgenomic studies recently published,as well as a selection of online courses andtutorials. A categorized list of fundamen-tal databases is described in Table 3. Ta-ble 4 includes the URLs of prestigiouspublic research centers around the world.A selection of renowned meetings, con-ferences, and other scientific events isshown in Table 5. For a more detailedview of these and other resources onbioinformatics, the reader is referred tothe website Genomes & Machines(http://www.cs.tcd.ie/francisco.Azuaje/genomes&machines.html).
Due to space and time constraints wewere not able to include additional rele-vant studies. Some of these developmentsinclude the reconstruction of evolutionarytrees from genomic data and fundamentalproblems linked to database, machinelearning, and data mining research.
We are indebted to all of the contribu-tors to this special issue for their apprecia-ble efforts in producing an enlighteningcollection of articles. I am particularlygrateful to our former editor Dr. Alvin
Wald for giving me the opportunity tocollaborate as a guest editor and for hiscontinuous advice. I thank our new editorDr. Mark Wiederhold for his supportthroughout the final preparation of thisspecial issue. We greatly appreciate the co-
operation from the referees who kindlysupported a rigorous peer-review process.
We hope that this special issue willmotivate our readers to further learn andparticipate in this exciting multi-disciplinary challenge.
July/August 2001 IEEE ENGINEERING IN MEDICINE AND BIOLOGY 19
Table 2. Special Issues, References, and Tutorials
Nature special issue on the human genome, vol. 409, pp. 813-958, 15 February 2001;and Nature insight on functional genomics, vol. 405, pp. 819-865, 15 June 2000.
http://www.nature.com
IEEE Computing in Science & Engineering, special issue on computational biology,May-June, 1999.
http://www.computer.org/cise/
National Center for Biotechnology Information: tutorials and databases.
http://www.ncbi.nlm.nih.gov/
Science special issue on the human genome, vol. 291, 16 February 2001; and Sciencefunctional genomics, a collection of online resources related to the human genome se-quence.
http://www.sciencemag.org
IEEE Spectrum, special feature on genome sequencing and analysis, vol. 37, number11 November 2000.
http://www.spectrum.ieee.org
Virtual Bacteria ID lab, Howard Medical Institute.
http://www.biointeractive.org
Getting Connected to the Postgenomic EraTable 1. Hardware, Software, and Services Providers
Company Activity URL
Affymetrix Expression chips and analysis www.affymetrix.com
Agilent Technologies Genome expression systems, software and hardware www.agilent.com
Celera Databases and genotyping www.celera.com
Compugen Proprietary and collaborative gene discovery www.compugen.com
CuraGen Drug-induced changes in gene expression www.curagen.com
DoubleTwist.com Internet portal, on-line access to tools and databases www.doubletwist.com
eBioinformatics Web-based bioinformatics tools and databases www.ebioinformatics.com
Genaissance Population genomics www.genaissance.com
GeneLogic Gene expression database products www.genelogic.com
Genomica Enterprise-wide bioinformatics systems and services www.genomica.com
Genomics Institute A division of Novartis, genomics and proteomics www.gnf.org
IBM Data mining and protein structure determination methods www.research.ibm.com/ topics/serious/bio/
Incyte Databases and analysis tools www.incyte.com
Informax Desktop and enterprise-wide products www.informaxinc.com
Lexicon Genetics Biochips www.lexgen.com
Lion Bioscience Enterprise-wide systems and services www.lionbioscience.com
Motorola Biochip Expression arrays www.motorola.com/ biochipsystems/
Myriad Genetics Therapeutic and diagnostic products based on genomic methods www.myriad.com
Oxford Molecular Bioinformatics systems and services www.oxmol.co.uk
Rosetta Inpharmatics Gene expression data acquisition for drug discovery applications www.rosetta.com
Silicon Genetics Gene expression analysis and visualization www.sigenetics.com
SpotFire Data visualization software for gene expression www.spotfire.com
Structural GenomiX Proteomics, structure prediction www.stromix.com
Syrrx Protein structure prediction www.syrrx.com
20 IEEE ENGINEERING IN MEDICINE AND BIOLOGY July/August 2001
Computing Life: The Challenge Aheadby Leroy Hood
Institute for Systems Biology
The Human Genome Project has catalyzed the birth of the fieldof bioinformatics. In its broadest context, bioinformatics or
computational biology is concerned with capturing, storing, an-alyzing, graphically displaying, modeling, and ultimately dis-tributing biological information. Moreover, the HumanGenome Project has encouraged a series of paradigm changesleading to the view that biology is an informational science [1].
� The draft of the human genome [2,3] has given us a ge-netics parts list of what is necessary for building a hu-man: the 30,000-40,000 genes, their regulatoryregions, a lexicon of motifs that are the building blockcomponents of proteins (and genes), and access to thehuman variability (polymorphisms) that make us each dif-ferent from one another.
� Genomics has triggered the development of high-through-put instrumentation for DNA sequencing, DNA arrays,genotyping, proteomics, etc. These instruments have cata-lyzed a new type of science for biology termed discoveryscience. Discovery science defines all (most) of the ele-ments in a biological system (e.g., sequence of the ge-nome, identification and quantitation of all of the mRNAsor proteins in a particular cell type—respectively, the ge-nome, transcriptome, and the proteome). Discovery sci-ence creates databases of information, in contrast to themore classical hypothesis-driven science that formulates hy-potheses and attempts to test them. The high-throughputtools both provide the means for discovery science and canassay how global information sets (e.g., transcriptomes orproteomes) change as systems are perturbed.
� The tools of computer science, statistics, and mathematicsare critical for studying biology as an informational sci-ence. Curiously, biology is the only science that, at its veryheart, employs a digital language. The grand challenge inbiology is to determine how the digital language of thechromosomes is converted into the three-dimensional andfour-dimensional (time variant) languages of living, breath-ing organisms.
� The genomes of the model organisms yeast, worm, fly, etc.,have demonstrated the fundamental conservation amongall living organisms of the basic informational pathways.Hence, systems can be perturbed in model organisms togain insight into their functioning, and these data will pro-vide fundamental insights into human biology. From the ge-nome, the informational pathways and networks can beextracted to begin understanding their “logic of life.” More-over, different genomes can be compared to identify simi-larities and differences in the strategies for the logic oflife—and these provide fundamental insights into develop-ment, physiology, and evolution.
� Biology is an informational science. There are two majortypes of biological information: 1) the information of thegenes or proteins, which are the molecular machines of life;and 2) the information of the regulatory networks that coor-dinate and specify the expression patterns of the genes (pro-teins). All biological information is hierarchical:
DNA→mRNA→protein→protein interactions→informa-tional pathways→informational networks→cells→networksof cells (tissues/organs)→individuals→populations→ecolo-gies. The challenge is to create tools that can capture and in-tegrate these different levels of biological information and,as we see below, integrate it.
All of these paradigm changes lead to the view that the majorchallenge for biology and medicine in the 21st century will bethe study of complex systems, and that the approach necessaryfor studying biological complexity will be systems biology. Analgorithm (i.e., approach) for systems biology has been pro-posed [1,4].
i) Identify all elements in the system with discovery tools(e.g., sequence genome, etc.).
ii) Use current knowledge of the system to formulate a modelpredicting its behavior.
iii) Perturb the system in a model organism using biological,genetic (knockout, over expression), or environmental perturba-tions—capture information at all relevant levels: mRNA, pro-tein, protein interactions, etc. Integrate this information.
iv) Compare theoretical predictions and experimental data.Carry out additional perturbations to bring theory and experi-ment into closer apposition. Integrate new data into model. Iter-ate steps iii) and iv) until the mathematical model can predict thestructure of the system and its systems or emergent propertiesgiven particular perturbations.
We have successfully tested the first stages of this approachon the galactose utilization system in yeast [4].
What are the challenges presented by systems biology?� The integration of technology, biology, and computation.� The integration of the various levels of biological informa-
tion and the modeling.� The proper annotation of biological information and its
storage and integration in databases.� The inclusion of other molecules, large and small, in the sys-
tems approach.� The integration imperatives of systems biology presents
many challenges to industry and academia [1, 5].
References[1] T. Ideker., T. Galitski, and L. Hood, “A new approach to decodinglife: Systems biology,” Annu. Rev. Genomics & Human Genetics, to bepublished, 2001.[2] E.S. Lander, L.M. Linton, B. Birren, C. Nusbaum, M.C. Zody, J.Baldwin, K. Devon, K. Dewar, et al., “Initial sequencing and analysis ofthe Human Genome,” Nature, vol. 409, pp. 860-921, 2001.[3] J.C. Venter, M.D. Adams, E.W. Myers, P.W. Li, R.J. Mural, G.G.Sutton, H.O. Smith, M. Yandell, et al., “The sequence of the human ge-nome,” Science, vol. 291, pp. 1304-1351, 2001.[4] T. Ideker, V. Thorsson, J.A. Ranish, R. Christmas, J. Buhler, J.K. Eng,R. Bumgarner, D.R. Goodlett, R. Aebersold, and L. Hood, “Integratedgenomic and proteomic analyses of a systematically perturbed meta-bolic network,” Science, to be published, 2001.[5] A. Aderem and L. Hood, “Immunology in the post-genomic era,” Na-ture Immunology, to be published, 2001.
July/August 2001 IEEE ENGINEERING IN MEDICINE AND BIOLOGY 21
Table 3. Databases
Genome mapping and sequence repositories
GenBank, contains all known nucleotide and protein se-quences.
http://www.ncbi.nlm.nih.gov/
The Genome Database, contains all known nucleotide and pro-tein sequences.
http://gdbwww.gdb.org/
EMBL Nucleotide Sequence Database, contains all known nu-cleotide and protein sequences.
http://www.ebi.ac.uk/embl/index.html
Expression data
ASDB, protein products and expression patterns.
http://cbcg.nersc.gov/asdb/
BodyMap, human and mouse gene expression data.
http://bodymap.ims.u-tokyo.ac.jp/
TRIPLES, localization and expression in Saccharomyces.
http://ycmi.med.yale.edu/ygac/triples.htm
Genetic maps
GB4-RH, Genebridge4 human radiation hybrid maps - SangerCentre.
http://www.sanger.ac.uk/Rhserver/Rhserver.shtml
GeneMap’99, gene map of the human genome.
http://www.ncbi.nlm.nih.gov/genemap99/
IXDB, physical map of human chromosome X.
http://www.mpimg-berlin-dahlem.mpg.de/~xteam/welcome.html
Genomic databases
FlyBase, Drosophila sequences and genomic information.
http://www.fruitfly.org
MGD, mouse genome database.
http://www.informatix.jax.org
SGD, Saccharomyces cerevisiae genome.
http://genome-www.stanford.edu/Saccharomyces/
Metabolic and regulatory pathways
KEGG, Kyoto Encyclopaedia of Genes and Genomes.
http://www.genome.ad.jp/kegg/
MIPS, yeast pathways.
http://www.mips.biochem.mpg.de/proj/yeast/pathways/index.html
Gene Ontology Consortium.
http://www.geneontology.org/
Protein databases
PROSITE, biologically significant protein patterns and profiles.
http://www.expasy.ch/prosite/
Protein data bank (PDB), 3-D macromolecular structure data.
http://www.rcsb.org/pdb/
SWISS-PROT/TrEMBL, curated protein sequences.
http://www.expasy.ch/sprot/
Getting Connected to the Postgenomic Era
Table 4. Public Research Centers: Projects,Publications, and Tools
In the U.S.A.
Baylor College of Medicine
http://www.hgsc.bcm.tmc.edu
Lawrence Berkeley Laboratory
http://www-hgc.lbl.gov
Stanford Human Genome Center
http://shgc-www.stanford.edu
University of Washington Genome Center
http://www.genome.washington.edu
The Institute for Genomic Research (TIGR)
http://www.tigr.org
The National Center for Biotechnology Information (NCBI)
http://www.ncbi.nlm.nih.gov
University of California at Santa Cruz
http://genome.ucsc.edu
MIT Whitehead Institute
http://www-genome.wi.mit.edu
In Europe
European Bioinformatics Institute
http://www.ebi.ac.uk
Pasteur Institute
http://www.pasteur.fr
European Molecular Biology Laboratory
http://www.embl-heidelberg.de
Sanger Centre
http://www.sanger.ac.uk
Around the World
Peking University Centre of Bioinformatics
http://www.cbi.pku.edu.cn
Xylella fastidiosa Genome Project, Brazil
http://onsona.lbi.ic.unicamp.br/xf/
South African National Bioinformatics Institute
http://www.sanbi.ac.za
Weizmann Institute of Science
http://dapsas1.weizmann.ac.il
Table 5. Conferences, Workshops, and Courses
Cold Spring Harbor Laboratory meetings & courses
http://nucleus.cshl.org/meetings/
International Conference on Intelligent Systems for MolecularBiology
http://ismb01.cbs.dtu.dk/
European Molecular Biology Organization (EMBO) courses andmeetings
http://www.embo.org/U.S. D.O.E. Human Genome Program calendar
http://www.ornl.gov/hgmis/CAL.HTML