bioinformatics overview

8/13/2019 Bioinformatics Overview

1/7

BIOINFORMATICS AN OVERVIEW

T.R. Sharma

Genoinformatics Lab, National Research Centre on Plant Biotechnology

I.A.R.I, New Delhi 110012

[email protected]

Introduction

Bioinformatics is the computational analysis of biological data, consisting of the informationstored in the form of DNA and protein sequences in various biological databases. TheNational Center for Biotechnology Information (NCBI 2001) defines bioinformatics as:"Bioinformatics is the field of science in which biology, computer science, and informationtechnology merge into a single discipline. There are three important sub-disciplines withinbioinformatics: the development of new algorithms and statistics which assess relationshipsamong members of large data sets, the analysis and interpretation of various types of dataincluding nucleotide and amino acid sequences, protein domains, and protein structures; and

the development and implementation of tools that enable efficient access and management ofdifferent types of information."

Analyses in bioinformatics focus on three types of datasets: genome sequences,macromolecular structures, and functional genomics experiments (e.g. microarray data).However, bioinformatics tools are also applied to various other data, e.g. phylogenetic andmetabolic pathway analysis, the text of scientific papers, and plant varietal information andstatistics. Analysis of biological data requires application of large number of techniques likeprimary sequence alignment, protein 3D structure alignment, phylogenetic tree construction,prediction and classification of protein structure, prediction of RNA structure, prediction ofprotein function, and expression data clustering. Development of suitable algorithms is an

important part of bioinformatics. The techniques and algorithms were specifically developedfor the analysis of biological data, for instance, the dynamic programming algorithm forsequence alignment is one of the most popular programmes among the biologists. Thesequence information generated worldwide is stored systematically in different types ofdatabases. Hence, it is necessary to understand about the databases and their different types.

What is a database?

A database is a collection of information stored in a computer in a systematic way, such that acomputer program can consult it to answer questions. A biological database is a large,organized body of persistent data, usually associated with computerized software designed toupdate, query, and retrieve components of the data stored within the system. A simple

database might be a single file containing many records, each of which includes the same setof information. For example, a record associated with a nucleotide sequence databasetypically contains information such as contact name; the input sequence with a description ofthe type of molecule; the scientific name of the source organism from which it was isolated;and, often, literature citations associated with the sequence.


2/7

Bio-informatics: An Overview

VI-78

Divisions of DNA databases

Since the size of databases is growing rapidly, these have been further broken into divisionson the basis of the taxonomy of the organisms. The GenBank divisions are divided into twogeneral categories like, organismal and functional. The sequences derived from specificorganisms are stored in the organismal category. Whereas the functional category include

databases which are independent of their taxonomic classification e.g. EST, STS and HTGetc. Respective Genbank divisions store sequence records of different organism which isidentified from three letter codes indicated in the beginning of each sequence entry. Forinstance, HTG (high throughput genome) division contained sequences generated fromdifferent organisms. These sequences are generally unfinished and are further classified asPhase1(sequences which are unfinished, unordered and contained gaps) and Phase 2(sequences which unfinished, ordered and contained a few gaps). Once sequences are finishedand all gaps are resolved (Phase 3) it moved to a specific division e.g. PLN in case of plants.The huge wealth of information in the form of DNA and protein sequences and publicationson molecular biology are stored in the data banks (Fig.1). Major public data banks whichtakes care of the DNA and protein sequences are GenBank in USA

(http://www.ncbi.nlm.nih.gov), EMBL (European Molecular Biology Laboratory) in Europe(http://www.ebi.ac.uk/embl/) and DDBJ (DNA Data Bank) in Japan(http://www.ddbj.nig.ac.jp). . The growth of DNA sequence data in GenBank is depicted inFig. 2. This rapid growth in DNA sequence data is because of the fact that variousCollaborative International Programmes have started during the past few years to sequencecomplete genomes of various organisms. The whole genomes of various microorganismshave already been sequenced by The Institute of Genome Research (TIGR) which can beseen on their website www.tigr.org . The large genomes like Human (3 billion bp) Rice (450Mb bp),Arabidopsis(130Mb bp) and Mouse (2.5 billion bp) have also been sequenced andthe data is in public domain in GenBank. Now these DNA sequences have to be used inmeaningful ways for the welfare of mankind. Different types of sequences of important cropsavailable in public domain are listed in Table1.

Fig.1. Status of Sequences submitted in the GenBank (Source: NCBI)


3/7


VI-79

Table1. Different types of sequences of important crops available in public domain*

Type of database

in public domainPlant species

Whole genome Oryza sativa, Arabidopsis thaliana

Partial genome

T. aestivum, Z. mays, S. bicolor, B. oleracea, B. rapa, G. max, S.

tuberosum, L. esculentum, V. vinifera, Poncirus trifoliate, Medicagotruncatula, Lotus corniculatus

EST

Aegilops tauschii, Allium cepa, Arabidopsis thaliana, Avena sativa,

Beta vulgaris subsp. vulgaris, Brassica napus, Brassica oleracea,

Brassica rapa, Capsicum annuum, Coffea arabica, Glycine max,

Gossypium arboreum, Gossypium hirsutum, Helianthus annuus,

Hordeum vulgare, Lactuca sativa, Lolium perenne, Lotus corniculatus,

Lycopersicon esculentum, Malus domestica, Medicago sativa, Medicago

truncatula, Nicotiana benthamiana, Nicotiana tabacum, Oryza sativa,

Phaseolus coccineus, Phaseolus vulgaris, Saccharum officinarum,

Secale cereale, Solanum melongena, Solanum tuberosum, Sorghum

bicolor, Triticum monococcum, Vitis vinifera, Zea mays

mRNAT. aestivum, Z. mays, S. bicolor, B. oleracea, B. rapa, G. max, S.

tuberosum, L. esculentum, V. vinifera, Medicgo truncatula, L.

corniculatus, O. sativa, A. thaliana

Protein Z. mays, S. bicolor, B. oleracea, B. rapa, G. max, S. tuberosum, V.vinifera, C. sinensis, M. truncatula, E. globulus, O. sativa, A. thaliana

BAC endOryza australiensis, O. brachyantha, O. glaberrima, O. granulata, O.

latifolia, O. minuta, O. officinalis, O. punctata, O. ridleyi, O. rufipogon,

O. schlechteri, G. hirsutum

Source: NCBI

Divisions of Protein databases

Protein sequences are mainly stored in two databases EMBL and GenBank. Swiss-Prot whichis a very well maintained and curetted database was established at the Swiss Institute ofBioinformatics. Though it is a small database, it has important annotations which are freelyavailable to the academic users. GenBank created PIR a protein database as a translation ofthe Genbank. PIR database is further subdivided into four sections like PIR1, PIR2, PIR 3 andPIR4 on the bases of degree of annotation.

DNA Sequence Analysis

Bioinformatics tools are now easily available to the biologists with the advent of internet andvarious Web Browsers on World Wide Web. These tools are indispensable for any GenomeSequencing Centres. The analysis of DNA sequences started once these are out of thesequencing machines. The first and foremost task of a biologist is to look for the accuracy ofsequence he got from the machine. One way is to go for finding cloning sites of inserts in thesequencing vector. If the insert is a PCR product then one should look for the primersequences used in the amplification of that product. Then one can perform Basic Localalignment Search Tool (BLAST) search against the DNA sequence database in the GenBankand see the probable matches. If the unknown sequences shows hits with any sequence of thesame or related organisms then it is considered as a true sequence. These are the basic steps,


4/7


VI-80

which can be performed manually if the dataset is very small or if one has to deal with singleor a few sequences. However, in large genome sequencing projects one has to handlethousands of sequences at a given time.

Searching for Sequence Alignment

Once high quality sequence is obtained once has to ask an important question whether this is a

new sequence or the sequence similar to other DNA sequences available in the databases. Forgetting answer of this question, on has to perform database search for sequence comparisons.All sequence searching methods rely on the basic concepts of alignment and distance betweenthe sequences and pair wise sequence alignment is performed. There are different algorithmsto perform global and local alignments (Fig.2). In global alignment, complete alignment ofthe input sequence is performed with sequences available in the databases. Whereas in localalignment, most similar segments of the input sequence are aligned with the databasesequences. Sequence comparison (DNA/protein) against database is one of the veryimportant and powerful tools of bioinformatics. This type of sequence comparison isgenerally performed with two programmes BLAST and FASTA, which compares unknownsequence against a sequence database. In BLAST best local alignments between the unknown

sequences and the database is found by using an approach based on matching short sequencefragments and a powerful statistical model. Whereas a method of approximation is used inFASTA which try to concentrate only on significant alignments. In BLAST search output,Expected (E) values and Bit scores are mentioned to determine the significant match ofunknown sequences with that of sequences available in the database (Fig.3). The significanceof a BLAST hit is very important for the interpretation of results. Generally 67% identity atDNA level shows 100% identity in protein level. It is also suggested that at least 75%sequence identity between two sequences should be observed for considering it as asignificant hit.

Fig.2 . Global and local alignments between two DNA sequences


5/7


VI-81

Fig.3. BLAST output showing Bit score and E values after similarity search

Gene Prediction and Annotation

Simply determining four alphabets (ATGC) of DNA sequences of any organism has no value

until some meaning is derived from this by gene prediction. Gene prediction is complex workand there is no algorithm which can exactly predict the true exons in a DNA sequence.Basically two major considerations are taken into consideration while predicting a gene. 1)identification of structural elements such a start/ stop codon and splice sites of the unknownsequence and 2) performing homology search against protein, EST and cDNA database toidentify potential coding regions. For gene prediction, very commonly used softwareGENSCAN developed by MIT, USA (http://www.genes.mit.edu/GENSCAN.html), which isfreely available on Web and online analysis of DNA sequences, can be performed. The outputobtained from the GENSCAN is then used for gene annotation by using BLAST to search thepublic or private DNA sequence databases to find out the matches to the unknown querysequence with millions of sequences available in the Gen Bank. A very popular Websitehttp://www.ncbi.nlm.nih.gov is available for BLAST at NCBI`s Home page which performssearches by using various criteria and options (Fig.4).


6/7


VI-82

Fig. 4. Performing BLAST search at NCBI Home pagePrimer Design

Another important aspects in the use of genome sequence data after predicting genes are todesign primers either for PCR or for sequencing. Such primers are used for the amplificationof genes or its alleles from the known sources and making best use out of it. Though PRIMEsoftware within GCG package is mainly used for this purpose, PRIMER3- a web basedsoftware (www-genoem.wi.mit.edu /genome_software/other /primer3.html) is beingcommonly used for designing primers. PCR Primer pairs are designed to amplify a well-defined target sequences from the template. Some of the important considerations whiledesigning primers are, the GC content, melting temperature, primer size, and size of the PCRproduct to be amplified. These parameters can be used either as default setting or one canchange them as per their requirement.

Phylogenetic Analysis

Once similarity search is performed between unknown sequence and the database sequence tofind per cent homology between them, it is obvious to know how these sequences are relatedto each other. The sequences derived from two closely related organisms shows moresimilarity at DNA level and distantly related organisms shows more dissimilarity at thesequence level. To find an evolutionary relationship among sequences derived from differentorganisms, a phylogenetic tree is constructed (Fig.5). Such evolutionary tree can also beconstructed on the basis of phenotypic markers, molecular markers or sequence information.A typical phylogentic tree is comprised of nodes, branches and termini of the branches. When


7/7


VI-83

all the branches are emerged from a common node it is termed as the root of a tree. Thoughsome trees are constructed as un-rooted tree where common evolutionary point is not known.For constructing a phylogenetic tree the PILEUP option of GCG package is more commonlyused. Besides, DNA STAR software (www.dnastar.com) also have options to construct treefrom different DNA or protein sequences. However, web based tools like MacClade (//www.

phylogeny.arizona.edu/macclade/) can also be used for evolutionary studies of differentorganisms based on their DNA sequences.

Similarly, bioinformatics tools can be used for protein function analysis by database search.Finding SSR markers and SNP markers from the EST or genome sequences can be performedin silicoby using different algorithms which will also be discussed in the presentation.

Fig. 5. Phylogenetic analysis of resistance gene analogue sequences (sk21,sk95, sk10, sk3,sk76, sk101 and sk65) obtained from rice and known Resistance gene sequences (L6, M,N,RPS2 and Xa1) isolated from different crops. Analysis was performed with DNASTARsoftware.

Conclusions

In functional genomics, investigation of gene expression at whole genome levels underdifferent stresses can be studied by using microarryas. Now-a-day this type of geneexpression databases are being prepared in different organisms and even at different tissues.Bioinformatics tools are helpful in locating DNA sequences in the GenBank simply by puttingaccession numbers, making alignments of two or more than two sequences, performingsimilarity searches for unknown sequences in the GenBank, assembling short sequence readsand developing consensus sequences, finding genes and markersin silicoand in performingcomparative analysis of different genomes.

Selected References and Web Resources

Sobral, B.W.S. 1997. Common language of bioinformatics. Nature. 389:418.Brown, S.M. 2000. Bioinformatic: A Biologist`s Guide to Biocomputing and the Internet.

Eton Publishing, Natick. MA , USA.Baxevanis, A.D. and Ouellette B.F.F. 2001. Bioinformatics- A Practical Guide to the

Analysis of Genes and Proteins. Second Edition. A John Wiley and Sons, Inc.,Publication, NY.

GENSCAN : http://genes.mit.edu/GENSCAN.htmlFGENESH :http://www.softberry.com/berry.phtml

bioinformatics overview

Documents