bioinformatics - databases - bioplexity · data sources for bioinformatics main types of biological...
TRANSCRIPT
Bioinformatics - Lecture 09
BioinformaticsBiological databases
Martin Saturka
http://www.bioplexity.org/lectures/
EBI version 0.4
Creative Commons Attribution-Share Alike 2.5 License
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
Data sources for bioinformatics
Main types of biological databases with utilization tools.communication with databases. database usage.genomes, structure families, expression maps.
Main topicsdatabase technics- file types, sql, biodas- protocols, bioperlparticular databases- sequences, structures- expression, ontologydigest approaches- constraint programming- filters, data structures
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
File types
sequencesFasta - multiple sequencesone sequence: first line - header >..., next lines - per 70 ntGenBank header: gi|gi-number|gb|accession| locusGFF - sequence featuresseqname source feature start end score strand frame
GenBank flat files or ASN.1flat files: multiline description, 6x10 nt per lineASN.1: structured description {...{...}}, optionally packed
structuresPDB - line/column-wise dataline start: line decription - comment / atomatom rank role residuum chain res-rank coordinates
expressiontables: lines - genes, columns - tissues / experimentsMAGE: http://www.mged.org/Workgroups/MAGE/mage.html
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
Annotation access
annotation systems
http://www.open-bio.org/wiki/Projects
BioDAS - distributed annotation systemhttp://biodas.org/data accesshttp://example.org/das/organism/features?segment=CHR I:1,500
site prefix das data command arguments
system compositiona reference sequence server plus annotation servers
other projectsMOBY - interoperability for biological data server servicesOBDA - sequence access standardizationmyGrid - grid and middleware for bioinformatics
http://www.mygrid.org.uk/myExperiment - myGrid spin-off
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
DBMS
SQL
tabular data storagemain open-source database systemssmall: sqlite, large: postgresql, firebirdsqlaccess: sql - structured query languagesuitable for data structured into regular tablesdifferent approaches
linear data (sequences): flat filesdeeply structured data: ASN.1, HDF, NetCDF
> sqlite3
CREATE TABLE genes (gi INTEGER, chr CHAR(3), ori CHAR(1));
INSERT INTO genes VALUES (826, ’19’, ’+’);
INSERT INTO genes VALUES (827, ’X’, ’-’);
SELECT * FROM genes;
.quit
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
Genomes
where to get data on sequenced chromosomes
gene specific: gene id sequence specefic: accession
main genome database sitesNCBI - National center for biotechnology information
http://www.ncbi.nlm.nih.gov/Entrez/EMBL - European bioinformatics institute
http://www.ebi.ac.uk/embl/DDBJ - DNA databank of Japan
http://www.ddbj.nig.ac.jp/
NCBI ftp://ftp.ncbi.nlm.nih.gov/directory /genomes/H sapiens/
assembled reference sequences: Assembled chromosomesfile /gene/DATA/gene2refseq.gz
gene IDs with positions along chromosomes
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
DNA variations
SNPs, CNVs
many projects set to deal with intra-species variationdbSNPhttp://www.ncbi.nlm.nih.gov/SNP/the SNP consortiumhttp://snp.cshl.org/haplotypeshttp://www.hapmap.org/glovar - human variationshttp://www.glovar.org/human variomehttp://www.humanvariomeproject.org/general variomeshttp://variome.net/
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
Gene prediction
open-source and on-line gene prediction
Glimmer - bacteria, archea, viruseshttp://cbcb.umd.edu/software/glimmer/
GlimmerHMM - eukaryotic geneshttp://cbcb.umd.edu/software/GlimmerHMM/
GeneZilla (TIGRscan) - eukaryotic geneshttp://www.genezilla.org/
GenScan - human geneshttp://genes.mit.edu/GENSCAN.html
software listshttp://www.genefinding.org/
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
Nucleic structures
RNAs and 3D nucleic structural databases
3D structures of nucleic acidsRNABasehttp://www.rnabase.org/NDB nucleic acids databasehttp://ndbserver.rutgers.edu/
SCOR - structural classification of RNAhttp://scor.berkeley.edu/
RNA motifs, structures and interactions
other databasesSmall RNA databasehttp://condor.bcm.tmc.edu/smallRNA/Noncoding RNA databasehttp://biobases.ibch.poznan.pl/ncRNA/
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
Proteins
protein structures
3D structuresRCSB http://www.rcsb.org/
protein domainsExpasy http://www.expasy.ch/UniProt http://www.uniprot.org/
structuresSCOP, CATH, FSSP, CASP, PFAMhierarchical classification
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
Protein families
systematics on protein structures
SCOP http://scop.berkeley.edu/structural classification of proteinsalpha, beta, alpha/beta, alpha+beta, ... superfamiliesfolds: cca 1000, superfamilies: cca 1500, families: cca 3000
CATH http://www.cathdb.info/class (C), architecture (A), topology (T), homologoussuperfamily (H)cca 1400 familiesC: main secondary structure compositionA: orientation of secondary structuresT: folds with sec. structure connectivityH: similarity superfamilies
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
Expasy
ExPASy (expert protein analysis system)
UniProt - the universal protein resourcehttp://www.expasy.uniprot.org/
knowledgebase, reference clusters, archives
swissprothttp://www.expasy.ch/sprot/
database of protein sequences together with annotationsstructure and function of proteins
prositehttp://www.expasy.ch/prosite/
documentation on protein domains, folds, families
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
Expressions
expression microarrays repositories
not a central repositoryevery institution wants to have a main microarray database
some of the repositoriesGEO - gene expression omnibushttp://www.ncbi.nlm.nih.gov/geo/Stanford microarray databasehttp://genome-www.stanford.edu/Broad (MIT/Harvard) institutehttp://www.broad.mit.edu/tools/data.htmlEBI ArrayExpresshttp://www.ebi.ac.uk/arrayexpress/ChipDBhttp://staffa.wi.mit.edu/chipdb/public/
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
Expression atlases
expression mapping projectsBrainAtlas (mouse oriented)http://www.brainatlas.org/http://www.brain-map.org/RAD - RNA abundance databasehttp://www.cbil.upenn.edu/RAD3/BodyMaphttp://bodymap.ims.u-tokyo.ac.jp/GNF gene expression atlashttp://expression.gnf.org/3D developmental gene expressionhttp://www.univie.ac.at/GeneEMAC/TissueInfohttp://pbtest.med.cornell.edu/services/tissueinfo/query
relational schemaGUS - genomics unified schemahttp://www.gusdb.org/
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
Ontology
overall description of bio-systems
Gene Ontologyhttp://www.geneontology.org/description of gene products for various databasesthe main bio-ontology project
Gene Cardshttp://www.genecards.org/human genes information / ontology databaseone of the first ontology projects
KEGGhttp://www.geneobjects.org/Kyoto encyclopedia of genes and genomesmainly known for molecular interaction pathways
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
Organisms
sites dedicated to particular model organisms
the sites:the generic model organism database projecthttp://www.gmod.org/Escherichia coli http://ecocyc.org/Saccharomyces cerevisiaehttp://www.yeastgenome.org/Arabidopsis thalianahttp://www.arabidopsis.orgDrosophila melanogasterhttp://www.flybase.org/ http://www.fruitfly.org/Caenorhabditis eleganshttp://www.wormbase.org/Danio rerio http://zfin.org/Mus musculus http://www.informatics.jax.org/Rattus sp. http://rgd.mcw.edu/
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
EBI
European bioinformatics institute
EBI http://www.ebi.ac.uk/part of EMBL http://www.embl.org/
EBI databsesEMBL nucleotide databasehttp://www.ebi.ac.uk/embl/UniProt (together with Expasy and PIR)ArrayExpresshttp://www.ebi.ac.uk/arrayexpress/public repository for microarray dataEnsemblhttp://www.ensembl.org/genomes and annotation for metazoa
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
NCBI
National center for biotechnology information
http://www.ncbi.nlm.nih.gov/
the main bioinformatics institute / web sitedatabases and services
sequence databasesGenBank, ESTs, SNPs, etc.PubMed - literature databaseEntrezhttp://www.ncbi.nlm.nih.gov/entrez/retrieval system connecting together plethora of databasesincluding PubMed, genomes, ontologiesBlast - the search engine, OMIM, etc.Science primerhttp://www.ncbi.nlm.nih.gov/About/primer/introductions into molecular biology and bioinformatics
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
Blast
Blast - basic local alignment search tool
http://www.ncbi.nlm.nih.gov/BLAST/
two examplesblast - nucleotide - blastn (for short queries)ATCAGTGTAGTCATCGATACCGTAGTCA
- short random sequenceresultsnothing significant, use mouse sequence gi 83999722display graph
genomes - human - megablast (for related sequences)GACACCTTCTCTCCTCCCAGATTCCAGTAACTCCCAATCTTCTCTCTGCAG
- part of an immunoglobulin sequenceresultstwo very significant matches, use ref NT 026437.11click on IGHG1 for informationclick on ’blue box’ to zoom in 8x
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
Perl
standard language for biosequences
Perl scriptingpros:
for fast access to various configuration and log filessuitable for short to middle programs on structured datahuge amount of packages for varius database systems,datastructures, including formats of biological data andconnections to biological databasesregularily used for parsing datafiles and program outputsin common daily bioinformatics
cons:usually hard to read and sustain scriptsobject oriented approach rather rudimentary
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
Perl basics
simple perl scripting
example.pl#/usr/bin/env perl
use strict;
use warnings;
my $var01 = "GATTACA";
$var01 =˜ s/T/U/g;
print substr (reverse($var01), 1, 4), "\n"; # CAUU
my @array01 = (3.14, "Pi");
my %hash01 = ("value" => 3.14, "symbol" => "Pi");
print $hash01{"symbol"}, "\n" if 3.14 == $array01[0];
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
BioPerl
modules for sequence-based work in bioinformatics
modulescore, run, dbi packages
main parser, standard bio filetypeswrapper around variety bioinformatics toolsconnecting to biological databases
microarray packagemanipulation of microarray formats (preliminary)
other packagesfor linkage studies, C extensions for align algorithms, etc.
usageuse Bio::Perl;
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
Filetype conversion
converterread a given file ’file.seq’takes sequence and writes it in fasta format
#/usr/bin/env perl
use strict;
use warnings;
use Bio::Perl;
my $in = Bio::SeqIO->new(-file => "file.seq" ,
’-format’ => ’GenBank’);
my $out = Bio::SeqIO->new(-file => ">file.fa" ,
’-format’ => ’Fasta’);
my $seq = $in->next seq();
$out->write seq($seq);
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
Sequence retrieval
example on sequence filestakes sequence of human protein il9rmakes blast request and write the results
#/usr/bin/env perl
use strict;
use warnings;
use Bio::Perl;
my $seq obj = get sequence(’genbank’,"il9r human");
write sequence(">il9r.fasta",’fasta’,$seq obj);
my $blast result = blast sequence($seq obj);
write blast(">il9r.blast", $blast result);
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
Human gene IDs extraction
#/usr/bin/env perl
use strict; #use with <gene2refseq
use warnings;
use Bio::Perl;
my @cols = (1, 2, 7, 9, 10);
my $col last = 11;
my $done = 0;
while (<STDIN>) {. my @line = split /\s+/, $ ;
. if ("9606" eq $line[0]) {
. $line[2] = lc $line[2];
. next if $done >= $line[1];
. foreach my $col (@cols) {
. print STDOUT $line[$col], " ";}
. print STDOUT $line[$col last], "\n";}}
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
Data structures
data storage for running programs
one / several long sequencessimple string (array of chars)not to put it into composite structures
long time to access a packed string
sets of all oligonucleotidesnucleotides→ numbers 0...3standard array as a lookup table for the oligos
fast access to each cell
table of gene ids - gaps between idshash (i.e. associative array)scripting languages with suitable hash data structures
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
Constraint programming
how to solve problems with constraints
kind of logic programmingdeclarative programming
specify what to do, not how to do ita logic program with constraint specifications
setting relations between dataspecific methods for constraint satisfaction
many general solutions but few of them obey the constraintsmostly combinatorial problems
not the technics for general optimizations
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
Constraint approaches
search while consistent
depth-first searchbetter than starting paths if most of them falseback-tracking rather ineffective, to avoid itefficiency with filtering out the wrong paths
filter aheadset as unaccessible all the recognised wrong pathslesser ways for other (backtracked) search attemptsdata reduction for a final exhaustive search
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
Chess board example
the second attempt leads to the searched resultmuch faster than the search-backtrack approach
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
Constraints in bioinformatics
problems accessible for CP
where (not) to use CPsequence alignment - no
all of the alignments allowedoptimization case, not for the constraint programming
clustering, classification - nomany possible waysoptimization case, not for the constraint programming
sequence assembly - not exactlywhile based on constraints, not a global filtering
higher structure prediction - yessuitable connections of short sequences required
RNA gene prediction - yesspecific sequence characteristics required
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
Filters
filtering as CP examples
non-coding RNA predictioneach RNA gene contain base-paired sequencescomplementary sequences with a limited separation(7nt, 70nt)-stack used in FastR software
structure compositionpredicted secondary structures should be concatenatablefirst, to thread short sequences to gain building blockscombinatorial search to concatenate the sequences
Martin Saturka www.Bioplexity.org Bioinformatics - Databases
Items to remember
Nota bene:
file types, data access
Database sitessequences, structuresexpressions, ontologies
Programmingbioperl, conversionconstraints, filters
Martin Saturka www.Bioplexity.org Bioinformatics - Databases