bioinformacs resources - structural resources / sql · bioinfres sose 17 bioinformacs resources -...

91
BioinfRes SoSe 17 Bioinforma)cs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J. Reeb Ins)tut für Informa)k I12

Upload: vannguyet

Post on 31-Mar-2018

225 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Bioinforma)csResources-StructuralResources/SQL-

Lecture&ExercisesProf.B.Rost,Dr.L.Richter,J.Reeb

Ins)tutfürInforma)kI12

Page 2: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

PreliminaryScheduleApr. 28th Intro, General Overview (1. sh.) Jun 16th No Lecture May 5th Sequence Databases (2. sh.) Jun 23rd NoSql 2 (7.sh.) May 12th Sequence Databases (3. sh.) Jun 30th MongoDB, JavaScript (8.sh.) May 19th Structure Databases (4. sh.) Jul 7th Node.js Applications (9.sh.) May 26th No Lecture Jul 14st PredictProtein Jun 2nd SQL (5. sh.) Jul 21st Wrap Up, Q&A Jun 9th SQL, NoSql (6. sh) Jul 28th Exam

* These exercises can earn you a bonus

Page 3: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Orga-ExamDate

●  ExamscheduledforFriday,Jul28th

●  Time:16:30-18:00

●  Room:MW0350Egbert-von-HoyerLectureHall(MechanicalEngineeringBuilding)

●  Registra)onisMANDATORY

●  sofar6studentsregistered

Page 4: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

SecondaryDatabases

●  Databaseswhichdigestandstructuredatafromprimarydatabases

●  Notalways“true”databasesystems●  SCOP/CATH

●  PFAM

●  PROSITE

Page 5: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Classifica)onofStructures:CATH-Gene3D/SCOP

●  cameupinthemiddleofthe1990s●  botharequitesimilar

●  aim:organizetheproteinstructuresavailableinPDB,basedonsingledomains

●  hierarchicalsystem(roughly):-  secondarystructurecontent-  fold-  superfamilies-  families

Page 6: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

SCOP:aStructuralClassifica)onofProteins

●  Murzin,A.,Brenner,S.E.,Hubbard,T.J.P.andChothia,C.(1995)J.Mol.Biol.,247,536-540

●  Hubbard,T.P.,Murzin,A.,Brenner,S.E.andChothia,C.(1997),Nucl.AcidsRes.25(1),236-239(easiertoobtain)

●  fullymanuallycurated,drivenbyexpertanalysis

●  associatedwiththeASTRALcompendium

●  latestnews:SCOPe(UCBerkeley),SCOP2(MRCLabMolBiol,Cambridge,UK)

Page 7: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

SCOP:aStructuralClassifica)onofProteins

●  J.-M.Chandonia,etal.,SCOPe:ManualCura)onandAr)factRemovalintheStructuralClassifica)onofProteins–extendedDatabase,J.Mol.Biol.(2016),hjp://dx.doi.org/10.1016/j.jmb.2016.11.023

●  A.Andreeva,D.Haworth,C.Cho)a,E.Kulesha,A.Murzin.SCOP2prototype:anewapproachtoproteinstructuremining.NucleicAcidsRes.2014Jan1;42(Databaseissue):D310–D314.Publishedonline2013Nov29.doi:10.1093/nar/gkt1242

Page 8: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

HierarchicalLevel

1.  Classes:Considersecondarystructurecomposi)on(allα,allβ,α/β,α+β,mul)-domain,membrane/cellsurface/pep)des,...)

2.  Fold:Shapeofadomain.Proteinsofthesamefoldhavethesamemajorsecondarystructureelementsinthesamearrangementwiththesametopologicalfeatures

3.  Superfamily:Groupsofdomainwhichhaveatleastadistantcommonancestor

Page 9: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

HierarchicalLevel

5.  Family:Groupswithinsuperfamilieswithamorerecentcommonancestor(>30%sequencesidentyor>15%seq.id.plussamefunc)on

6.  Proteindomain:Groupswithinfamilies,essen)allythesameprotein(isoform,thesameproteinbutfromdifferentspecies)

7.  Species:Proteindomainsaccordingtospecies

8.  Domain:thesingledomain

Page 10: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Developmentstar)ngfromyear2000

taken from http://scop.berkeley.edu/help/ver=2.06#scopchanges

Page 11: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

taken from http://scop.berkeley.edu/help/ver=2.06#scopchanges

Page 12: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

CATH-Faces

taken from http://www.ebi.ac.uk/about/people/janet-thornton

taken from http://www.tgac.ac.uk/scientific-advisory-board/

Page 13: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Publica)ons●  SillitoeI,Lewis,TE,CuffAL,DasS,AshfordP,DawsonNL,FurnhamN,LaskowskiRA,LeeD,LeesJ,Leh)nenS,StuderR,ThorntonJM,OrengoCA.CATH:comprehensivestructuralandfunc)onalannota)onsforgenomesequences.NucleicAcidsRes.2015Jandoi:10.1093/nar/gku947

●  LamSD,DawsonNL,DasS,SillitoeI,AshfordP,LeeD,Leh)nenS,OrengoCA,LeesJG.Gene3D:expandingtheu)lityofdomainassignments.NucleicAcidsRes.2016Jandoi:10.1093/nar/gkv1231

Page 14: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

CATH●  semi-automa)cprocedureforderivinganovelhierarchicalclassifica)onofproteindomainstructures

●  fourmainlevels:-  C:proteinclass,mainlysecondarystructurecomposi)onofeachdomain

-  A:architecture,summarizesshapesbasedonorienta)onofsecondarystructureelements

-  T:topology,sequen)alconnec)vityisconsidered-  H:homologoussuperfamily,highsimilaritywithsimilarfunc)ons,evolu)onaryrela)onshipassumed

Page 15: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

some nine highly populated families (‘superfolds’ [1]),with important implications for prediction algorithms,and it illustrated the insights to be gained from orderingthe data in this way.

Several other groups have also classified the known struc-tures, focusing on a variety of local and global topologi-cal features and employing a range of algorithms (struc-ture comparison algorithms and classification generallyare reviewed in [13–16]). The SCOP database, developedby Murzin et al. [17], groups proteins having significantsequence similarity into homologous families, whereasmore distant structural similarities are largely identifiedmanually. This database places emphasis on evolutionaryrelationships and information from the literature relatingto well-studied fold families is also incorporated (e.g. the βtrefoils [18] and the OB fold [19]). By contrast Holm andSander, use the structure comparison algorithm DALI torecognise structural neighbours, whether motif or foldbased, without formally ordering proteins in the PDB intofamilies [20]. The ENTREZ database of Hogue et al. [21],uses a similar approach to DALI, listing neighbours by afast vector-based comparison algorithm (VAST).

The task of defining structural relationships is furthercomplicated by the existence of multidomain proteins;more than 30% of non-identical structures in the currentPDB contain two or more domains. A number of domainrecognition algorithms have appeared recently to address

this problem [22–26]. The 3Dee database of Siddiquiand Barton (http://snail.biop.ox.ac.uk:8080/3Dee) sepa-rates the constituent folds of multidomain proteins usingthe DOMAK algorithm. Similarly, Sowdhaminini et al.have constructed a database of single domain families[27], using the domain recognition algorithm DIAL [26]and the structural comparison procedure SEA [28]. Bothdatabases contain data that is generated largely automati-cally, but is subsequently checked and where appropri-ate reordered manually.

In recognition of the need to regularly maintain and updatedata on structural relatives, we have further developed ourautomatic procedures for identifying and classifying struc-tural families [6] to construct a database of single-domainfold families. Any multidomain proteins are first dividedinto their constituent domain folds by an automatic consen-sus procedure which is in agreement between three inde-pendent algorithms (SJ et al. unpublished data). As well asclustering proteins by sequence and structure, recognisedfamilies are also grouped according to similarity in proteinclass (i.e. secondary structure composition and contacts).Finally, the architecture (shape, defined by the assembly ofsecondary structures, regardless of their connectivity) adop-ted by each protein fold, is assigned manually. Althoughthis is a somewhat subjective process, based largely on com-monly used descriptions in the literature (e.g. sandwich,barrel and propellor), it is an essential first step towardsordering the known folds in a useful and practical way.

1094 Structure 1997, Vol 5 No 8

Figure 1

Annual increase in the numbers of proteindomain structures in the PDB (top plot,[11,12]). The lower lines show the numbers ofidentical families (I-level, 100% sequenceidentity between structures within the familyand 100% overlap), non-identical families(N-level, > 95% sequence identity, 85%overlap), sequence families (S-level, > 35%sequence identity, 60% overlap), homologoussuperfamilies (H-level, > 25% sequenceidentity, SSAP >80 and 60% overlap), andtopological or fold families (T-level, SSAP>70), where SSAP is a structural comparisonscore.

7500��

3000�

2500�

2000�

1500�

1000�

500�

0'85 '86 '87 '88 '89 '90 '91 '92 '93 '94 '95 '96

Domain

Identical

Non-identical

Sequence familyHomologous superfamilyTopology

Num

ber o

f dom

ains

1985–95

Deposition date

Domain fold distribution

from Structure 15, August 1997, 5:1093–1108 http://biomednet.com/elecref/0969212600501093

Page 16: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

CurrentRelease●  CATHDBversion:4.0●  235,000domain

●  25mioproteinpredic)ons

●  new:-  improvedpredic)onoffunc)onalfamilies-  currentputa)vedomainassignments(CATH-B)-  CATH-40:anon-redundantsetofCATHdomainsforhomolgybenchmarkingexperiments(<40%seq.idwith60%overlap)

●  hjp://www.cathdb.info/wiki/doku/?id=release_notes#cath_release_notes

Page 17: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

NumberingScheme

●  C:1,2,3,4(alpha,beta,alpha/beta,none)(4)●  A:samearchitecture,differenttopology(40)

●  T:Topology(connec)onofsecondarystructureelements)(1373)

●  H:Homology(families)(2737)

Page 18: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

appear less distinct and may reflect the tolerance of helixpacking modes that allows diverse combinations of two-and three-helix motifs. This gives rise to a continuum offolds within which helix packing angles range fromaligned through to orthogonal. Despite this variety, certainmotifs appear to recur frequently — the aligned α hairpinand the two-helix and three-helix orthogonal motifs com-mon in the repressor and globin-like folds. Therefore, inthis class, it may ultimately be more appropriate to sepa-rate fold families into architectural groups that reflect spe-cific combinations of these common motifs.

By contrast to the mainly α class, in the mainly β class, theconstraints on β strands to be hydrogen bonded withinsheets and also on sheet–sheet packing gives rise to somevery distinct and easily recognisable architectures. In par-ticular, the β prisms, β propellors and β solenoids demon-strate the symmetry and regularity of structures satisfyingthese preferred packing constraints. In contrast to the fewarchitectures observed within the mainly α class, at least

16 different, relatively simple, architectures can be dis-cerned in the mainly β class.

The diversity of the mainly β class is not currently observedwithin the α−β class, in which only eight regular architec-tures are apparent to date. This may simply reflect a bias inthe structures determined or could suggest that in this classthe preferred motifs are more constrained in the ways inwhich they combine. The βαβ motif appears to be highlyfavoured and is observed within a large proportion of folds.In some topologies, the β strands are adjacent in space(classic motif) but in others they are separated by a thirdantiparallel strand, forming a three-stranded β sheet (splitmotif) [31]. Although both the classic and the split βαβmotifs are most commonly found in two and three-layerarchitectures, the classic motif is also found to recur withinbarrel and semi-barrel or horseshoe architectures (Figure 4).

The structures that fall outside these rather simple layerarchitectures tend to be quite complex. Compared to the

1098 Structure 1997, Vol 5 No 8

Table 2

The numbers of fold families (T-level), homologous superfamilies (H-level) and domain structures in different architectures are shown forthe mainly α, mainly β and α−β classes.

Number of Percentage of Number of Percentage of Number of Percentage ofC lass Architecture T-levels all T-levels* H-levels all H-levels* domains all domains*

Mainly α Non-bundle 86 17.03 93 14.42 1455 18.01Bundle 34 6.73 39 6.05 226 2.80Few SS 25 4.95 25 3.88 112 1.39

Mainly β Ribbon 17 3.37 17 2.64 114 1.41S ingle sheet 5 0.99 6 0.93 56 0.69Roll 6 1.19 6 0.93 55 0.68Barrel 22 4.36 29 4.50 861 10.66C lam 1 0.20 1 0.16 1 0.01Sandwich 21 4.16 43 6.67 1236 15.30D istorted sandwich 14 2.77 14 2.17 83 1.03Trefoil 1 0.20 4 0.62 49 0.61O rthogonal prism 1 0.20 1 0.16 4 0.05A ligned prism 1 0.20 2 0.31 3 0.04Four-propellor 1 0.20 1 0.16 3 0.04S ix-propellor 1 0.20 1 0.16 37 0.46Seven-propellor 2 0.40 2 0.31 11 0.14E ight-propellor 1 0.20 1 0.16 2 0.02Two-solenoid 2 0.40 3 0.47 5 0.06Three-solenoid 1 0.20 1 0.16 1 0.01Complex 5 0.99 5 0.78 104 1.29

α–β Roll 24 4.75 33 5.12 469 5.81Barrel 8 1.58 20 3.10 365 4.52Two-layer sandwich 77 15.25 112 17.36 957 11.85Three-layer (αβα) sandwich 78 15.45 115 17.83 1396 17.28Three-layer (ββα) sandwich 3 0.59 3 0.47 11 0.14Four-layer sandwich 4 0.79 4 0.62 12 0.15Box 1 0.20 1 0.16 2 0.02Horseshoe 1 0.20 1 0.16 1 0.01Complex 34 6.73 34 5.27 253 3.13Few SS 14 2.77 14 2.17 96 1.19

Few SS Irregular 14 2.77 14 2.17 98 1.21

*The percentages of total fold families, total homologous superfamilies and total domain structures adopting a particular architecture are shown.

Page 19: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Pfam●  currentversionis31.0,March2017,16712familiesin604clans

●  hostedbytheEBI●  Cita)on:“ThePfamproteinfamiliesdatabase:towardsamoresustainablefuture”Nucl.AcidsRes.(04January2016)44(D1):D279-D285.doi:10.1093/nar/gkv1344

Page 20: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Pfam●  Pfam-A:curatedseedalignmentderivedfromPfamseq(UniProtKBbased),profileHMMsfortheseedalignment,fullalignmentwithallHMMdetectedsequences

●  Pfam-B:un-annotated,automa)callygeneratedfromnon-redundantclusterfromADDA

●  focusesonsingledomains

Page 21: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Terms

●  Family:collec)onofrelatedproteinregions●  Domain:structuralunit

●  Repeat:shotunitwhichisunstableinisola)onbutformsastablestructurewhenfoundinmul)plecopies

●  Mo)f:shortunitfoundoutsideglobulardomains●  Clans:relatedgroupofPfamentriesbasedonsimilarityinsequence,structureofprofile-HMM

Page 22: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Page 23: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

PfamNumbers(rel.31)

●  16712Pfam-Afamilies●  36%ofthefamiliesareclassifiedinto604clans

●  thePfam-Areleasematches73%ofthe26.7MiosequencesinthecorrespondingUniProtreferenceproteomedatabase

●  coverageof90.5%ofSwissProthuman●  useofjackhmmer(fromHMMER3package)

●  considerCATHandPDB

Page 24: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

does not currently present a scalability problem, aidinghuman interpretation through visualization has becomeincreasingly difficult. Most approaches for facilitatingalignment visualization natively in the browser do notscale well. Applets, such as the Jalview alignment viewer(12), partly solve the problem, but require Java to beinstalled and coupled to the browser.For example, the largest Pfam-A family (version 27.0)

with >363 000 matches to the profile HMM is the ABCtransporters family (ABC_tran, accession PF00005)—itsfull alignment is thus too large to be useful for mostpurposes. The seed alignment, by contrast, contains just55 representative sequences, which may be an insufficientnumber to represent the sequence diversity within thefamily. To provide more useable samples of the sequencediversity within a family, we now calculate model-matchesfor four additional sequence sets, based on‘Representative Proteomes’ (RPs) (13). For theABC_tran family, the RP alignments range in size fromapproximately a quarter of the size of the full alignment toless than one tenth.In an RP set, each member proteome is selected from a

grouping of similar proteomes. The selected proteome ischosen to best represent the set of grouped proteomes interms of both sequence and annotation information. Thegrouping of proteomes is based on a clustering of UniProt,UniRef50, and includes all complete proteome sequences.In each cluster, sequences have !50% identity and have atleast an 80% overlap with the longest sequence. The simi-larity of two proteomes is determined by considering justthe clusters containing sequences from either of the twoproteomes. The two proteomes are grouped when thefraction of clusters that contain sequences from bothproteomes out of the subset of proteome-specific clustersexceeds a given threshold. This threshold is termed theco-membership threshold. The percentage threshold ofco-membership (or common clusters) can be adjusteddown to produce larger groupings, and hence less redun-dant sequence sets.We use the RP sequence sets constructed using co-

membership thresholds of 75, 55, 35 and 15%, giving arange of sequence redundancy for each family. Using rep-resentative proteomes has the advantage that it still allows

for organism-specific copy numbers to be assessed, afeature that can be lost when using global non-redundancythresholds on an entire sequence database. However, themajor advantage for Pfam is the dramatic reduction in thesize of the family full alignments, as shown in Table 1,which illustrates the reductions with increasingly redun-dant RPs for the 10 biggest families in Pfam. The RPsets do not currently include viruses, and so for somefamilies such as GP120, there may not be a match to theRP sets.

The reduction in the size of the full alignments variesfrom family to family, reflecting in part the bias in thesequence database. Overall, across the whole of thedatabase, using RP at 75, 55, 35 and 15% co-membershipthresholds results in average alignment sizes that are, re-spectively, 38.8, 29.7, 20.4 and 11.6% of the full alignmentsize. As the number of sequences in the sequence databaseincreases, we anticipate that the alignments based on RPswill grow at a more linear rate and provide a more con-venient way of sampling the full alignment sequencediversity.

As illustrated in Table 1, the full alignment size for thetop 10 families ranges from 129 000 to 363 000 sequences.With alignments of this size, it is no longer practical tocalculate the neighbour-joining trees provided in previousPfam releases. Before release 27.0, these approximateneighbour-joining phylogenetic trees (with bootstrappingvalues based on 100 replicas) were used to order thealignments, such that phylogenetically related sequenceswould be grouped together. From release 27.0 onwards,the full alignments are ordered according to theHMMER bit score of the match, with the highestscoring sequence found at the top of the alignment.The same phylogenetic trees are still provided for theseed alignments, but are merely a guide as they arecalculated with the FastTree approximation algorithm(14). The seed alignment sequences remain ordered ac-cording to the calculated tree.

In the Pfam website, we use two different colouringschemes when displaying our alignments in a webbrowser: the Clustal scheme (15), based on the chemicalproperties of the amino acids found in the column,and a heat-map scheme that reflects the posterior

Table 1. The reduction in size of RP versus full alignments

Family identifier (accession) Seed Full RP75 RP55 RP35 RP15

ABC_tran (PF00005) 55 363 409 26% (93 265) 21% (77 150) 16% (57 358) 8% (28 903)COX1 (PF00115) 94 254 351 1% (2006) 0.7% (1661) 0.4% (1218) 0.2% (538)zf-H2C2_2 (PF13465) 163 227 898 61% (138 033) 27% (60 664) 15% (34 039) 9% (21 562)WD40 (PF00400) 1804 193 252 65% (125 805) 52% (100 531) 36% (69 386) 23% (21 562)MFS_1 (PF07690) 195 181 668 30% (55 719) 25% (55 719) 17% (55 719) 8% (55 719)RVT_1 (PF00078) 152 172 360 5% (8257) 4% (6662) 3% (5373) 2% (3604)BPD_transp_1 (PF00528) 81 156 339 23% (36 523) 19% (29 422) 14% (22 134) 7% (10 630)Response_reg (PF00072) 57 151 337 29% (44 329) 25% (37 848) 20% (29 453) 10% (15 208)GP120 (PF00516) 24 146 453 N/A N/A N/A N/AHATPase_c (PF02518) 659 129 386 28% (36 085) 24% (30 935) 19% (24 121) 10% (12 473)

The seed alignment is used to construct the profile HMM and contains a representative set of sequences of the family. The full alignment contains allhits in pfamseq scoring above the gathering threshold. In Pfam 27.0, we have introduced four additional alignments based on RPs, which containdecreasing amounts of sequence redundancy from RP75 to RP15. For each RP data set, the percentage reduction in the size of the full alignment isshown, with the number of sequences given in brackets.

D224 Nucleic Acids Research, 2014, Vol. 42, Database issue

from release 27

Page 25: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

String

●  ProteinInterac)onNetworks:”STRINGisadatabaseofknownandpredictedprotein-proteininterac)ons.Theinterac)onsincludedirect(physical)andindirect(func)onal)associa)ons;theystemfromcomputa)onalpredic)on,fromknowledgetransferbetweenorganisms,andfrominterac)onsaggregatedfromother(primary)databases.”

●  2031organisms

●  9.6mioproteins●  1,380miointerac)ons

Page 26: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

String

●  ProteinInterac)onNetworks●  2031organisms

●  9.6mioproteins

●  1,380miointerac)ons

Page 27: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

String

●  SzklarczykD,MorrisJH,CookH,KuhnM,WyderS,SimonovicM,SantosA,DonchevaNT,RothA,BorkP,JensenLJ,vonMeringC.TheSTRINGdatabasein2017:quality-controlledprotein-proteinassocia)onnetworks,madebroadlyaccessible.NucleicAcidsRes.2017Jan;45:D362-68.

Page 28: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Page 29: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Prosite

●  PROSITEconsistsofdocumenta)onentriesdescribingproteindomains,familiesandfunc)onalsitesaswellasassociatedpajernsandprofilestoiden)fythem

●  SigristCJA,deCastroE,CeruxL,CucheBA,HuloN,BridgeA,BougueleretL,XenariosI.Newandcon)nuingdevelopmentsatPROSITE.NucleicAcidsRes.2012;doi:10.1093/nar/gks1067PubMed:23161676

Page 30: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

ENCODE/UCSCGenomeBrowser●  TheENCODEProjectConsor)um.AnIntegratedEncyclopediaofDNAElementsintheHumanGenome.Nature.2012Sep6;489(7414):57–74.doi:10.1038/nature11247

●  RosenbloomKR,SloanCA,MalladiVS,DreszerTR,LearnedK,KirkupVM,WongMC,MaddrenM,FangR,HeitnerSG,LeeBT,BarberGP,HarteRA,DiekhansM,LongJC,WilderSP,ZweigAS,KarolchikD,KuhnRM,HausslerD,KentWJ.ENCODEdataintheUCSCGenomeBrowser:year5update.NucleicAcidsRes.2013Jan;41(Databaseissue):D56-63.

Page 31: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

ENCODE/UCSCGenomeBrowser●  UCSCGenomeBrowser:KentWJ,SugnetCW,FureyTS,RoskinKM,PringleTH,ZahlerAM,HausslerD.ThehumangenomebrowseratUCSC.GenomeRes.2002Jun;12(6):996-1006.

Page 32: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

ENCODE/UCSCGenomeBrowser●  ENCODE:EncyclopediaofDNAElements●  interna)onalcollabora)onofresearchgroups

●  fundedbytheNa)onalHumanGenomeResearchIns)tute(NHGRI)

●  buildacomprehensivepartslistoffunc)onalelementinthehumangenome

●  includeselementsthatactonproteinandRNAlevelandregulatoryelements

●  TheENCODEProjectConsor)um.AnIntegratedEncyclopediaofDNAElementsintheHumanGenome

Page 33: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

taken from https://www.encodeproject.org/

Page 34: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

taken from https://www.encodeproject.org/

Page 35: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

GenomicAnnota)ons

taken from https://www.encodeproject.org/data/annotations

Page 36: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

UCSCGenomeBrowser

●  actuallyacollec)onofintegratedservices●  hjps://genome.ucsc.edu/index.html

●  providesamoregraphicalinterfacetoaccesstheENCODEdataandalotofaddi)onaltools

Page 37: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17 taken from https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&...

Page 38: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Databases-SQL

●  Overlapwithdatabaselecture●  “SQLcrashcourse”

●  nodesigntheory

●  nonormaliza)on●  standardbookslike:

●  A.Kemper&A.EicklerDatenbanksysteme–EineEinführung9.Auflage,2013OldenbourgVerlag,München

Page 39: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

MoreBooks●  R.Elmasri,S.B.Navathe:FundamentalsofDatabaseSystems,BenjaminCummings,RedwoodCity,Ca,USA,5.Ed.,2006

●  R.Ramakrishnan,J.Gehrke:DatabaseManagementSystems,3.Ed.,2009.

●  G.Vossen:Datenmodelle,DatenbanksprachenundDatenbank-Management-Systeme.5.Auflage,Oldenbourg,2008.

●  C.J.Date:AnIntroduc)ontoDatabaseSystems.McGraw-Hill,8.Ed.,2003.

Page 40: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

SelectedSQLTopics●  Tablemodifica)ons-  insert,update,create,alter

●  Dataretrievalandrepor)ng/aggrega)on-  select,average,sum

●  Combina)onandPerformance-  join

●  Accesscontrolandpermissions-  grant

●  BackupandRestore/Input-output

Page 41: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

ReasonsforDBMS

●  redundancy,consistency●  limitedaccess

●  difficultmul)-useraccess

●  lossofinforma)on●  lossofintegrity

●  securityissues

●  expensiveapplica)ondevelopment

Page 42: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Abstrac)onlayers

Physical Layer

Logical Layer

View 1 View 2

Page 43: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

VariousDataModels

●  Networkmodel●  Hierarchicalmodel

●  Rela;onalModel

●  XMLschema●  Object-orientedmodel

●  Deduc)vemodel

Page 44: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Rela)onalModelStudents

Matric Name

123455 Mayer

233457 Huber

... ...

Attends

Matric LectureNo

123455 2

233457 5

... ...

Lectures

LectureNo Title

2 Bioinformtics

5 Genomics

... ...

SelectNameFromStudents,Ajends,LecturesWhereStudents.Matric=Ajends.Matricand Ajends.LectureNo=Lectures.LectureNoand Lectures.Title=‘Genomics’;

UpdateLecturesSetTitle=‘GenomicsofMammalian’WhereLectureNo=5;

Page 45: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

En)tyRela)onshipModel

●  GraphicalNota)on●  Modelsrealworld“en))es”and“rela)on”

●  allowsfor“ajributes”

●  allowsforfunc)onali)es(1:1,1:n,n:m)●  allowstodefinekeys

●  key:asetforajributeswhichvaluescombina)onallowunambiguousinstanceiden)fica)on

Page 46: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Nota)on

(strong)En)ty

Ajribute,key:underlined

Rela)on

weakEn)ty(dependonothers)

Student

Name

Attends

Page 47: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

ERExampleMatric Name Semester

Student

Attends

Lecture

LectureNo Title Reader

Page 48: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Func)onality

Attends

Student

Lecture Grade

N

M

Page 49: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

20

Studenten

Assistenten

MatrNr

PersNr

Semester

Name

Name

Fachgebiet

Note

hören

prüfen

arbeitenFür Professoren

Vorlesungen

lesen

voraussetzen

SWS

VorlNr

Titel

Raum

Rang

PersNr

Nach- folger Vorgänger

Name

Funktionalitäten

1

N

1

1

N N

N

M

M M N

taken from Prof. Kempers database lecture WS 13/14

Page 50: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17 25

Prüfungen als schwacher Entitytyp

Studenten ablegen Prüfungen 1 N Note

PrüfTeil

MatrNr

Vorlesungen

umfassen

VorlNr

abhalten

Professoren

PersNr

N N

M M

• Mehrere Prüfer in einer Prüfung

• Mehrere Vorlesungen werden in einer Prüfung abgefragt

taken from Prof. Kempers database lecture WS 13/14

Page 51: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

SQL

●  standardized,SQL99(1999)andSQL3(2003)●  implementedbymostavailabledatabasemanagementsystemmanufacturer

●  but:notalwaysallspecifiedfeaturesimplemented

●  noteverythingisspecified!●  especiallyadmin/servermaintenanceiso�envendorspecific

Page 52: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

SQLDataTypes

●  char●  varchar

●  binaryandvarbinary

●  blobandtext●  numeric,decimal,integer(exact)

●  approximate:float,double

Page 53: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

SQLDataTypes

●  variousformatsfor)meanddate●  enum:oneoutofadefinedset

●  set:zeroormoreitemsoutofapredefinedlist

Formoreinforma)onseethelivetourthrough

hjp://dev.mysql.com/doc/refman/5.6/en/index.html

Page 54: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

ACID-PrincipleforTransac)ons

●  A:Atomicity:All-or-nothing,i.e.asequenceofopera)onsisexecutedlikeasingleatomicopera)onwhichcannotbeinterrupted

●  C:Consistency:A�ereveryopera)onthedatabaseisconsistent,i.e.allcondi)onsandconstraintsaboutcontextandrela)onshipsarefulfilled

Page 55: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

ACID-PrincipleII

●  I:Isola)on:Concurrentopera)onstonotaffecteachother

●  D:Durability:Uponsuccessfulcomple)onofatransac)onitisguaranteedthatallmodifica)onsarepersistent,i.e.theyarestoredinthedatabase,evenincaseofanunexpectedpowerloss.

Page 56: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Rela)onalAlgebra

●  σSelec)on●  πProjec)on

●  ρRename

●  xCrossProduct●  Join

●  −Difference

●  ÷Division

Page 57: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

●  ∪Union●  ∩Intersec)on

●  SemiJoin(le�)

●  Le�OuterJoin●  (Full)OuterJoin

Rela)onalAlgebra

Page 58: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Demonstra)onTable

gene indiv organism function status cytox 1 mouse prep gapdh 1 human glycolysis completed gapdh 2 human glycolysis completed ttn 2 human muscle ongoing unkno 3 human NULL prep

Page 59: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Selec)on

●  TheSELECTopera)on(denotedbyσ(sigma))isusedtoselectasubsetofthetuplesfromarela)onbasedonaselec)oncondi)on

●  Itactsasa(row)filter

●  SpecifiedintheWHERE-clause

●  σ status = “ongoing” (STATUS)

Page 60: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Selec)on

●  General:theselectopera)onisdenotedbyσ<selec)oncondi)on>(R)where:-  theσ(sigma)isusedtodenotetheselectoperator-  theselec)oncondi)onisaBoolean(condi)onal)expressionspecifiedontheajributesofrela)onR

-  tuplesthatmakethecondi)ontrueareselected(appearintheresultoftheopera)on)

-  tuplesthatmakethecondi)onfalsearefilteredout(discardedfromtheresultoftheopera)on)

Page 61: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Selec)on

●  TheBooleanexpressionspecifiedin<selec)oncondi)on>ismadeupofanumberofclausesoftheform:<ajributename><comparisonop><constantvalue>or<ajributename><comparisonop><ajributename>

●  <ajributename>isthenameofanajributeofR,<comparisonop>idnormallyoneoftheopera)ons{=,>,>=,<,<=,!=}

●  ClausescanbearbitrarilyconnectedbytheBooleanoperatorsand,orandnot

Page 62: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Selec)on

●  NULListestedforwithspecialoperators●  Selectσiscommuta)ve

●  canbecascadeofselectopera)onsofaconjunc)onofcondi)ons:σ<condi)on1>(σ<condi)on2>(R))=σ<condi)on2>(σ<condi)on1>(R))σ<cond1>(σ<cond2>(σ<cond3>(R))=σ<cond1>AND<cond2>AND<cond3>(R)

Page 63: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Projec)on

●  PROJECTOpera)onisdenotedbyπ(pi)●  usePROJECTtoretrievespecificajributesofrela)onR

●  Itactsasa(column)filterofthetuples

●  Example:πGene,status(STATUS)

●  Projectremovesduplicateswhichmightoccur(inSQL:SELECTDISTINCTinsteadofsimpleSELECT)

Page 64: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

SingleExpressionvs.SequenceofRela)onalOpera)ons

●  Toretrievecompletedgenesfromourexample:●  Singleexpression:πgene,status(σstatus=completed(STATUS))

●  Sequenceofopera)on:ALL_COMP<-σstatus=completed(STATUS)RESULT<-πgene,status(ALL_COMP)

Page 65: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Rename

●  RENAMEisdenotedbyρ(rho)●  Insomecases,wemaywanttorenametheajributesofarela)onortherela)onnameorboth-  Usefulwhenaqueryrequiresmul)pleopera)ons-  Necessaryinsomecases(seeJOINopera)onlater)

Page 66: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

RENAME

●  RENAMEopera)onsρcanbeexpressedbyanyofthefollowingforms:-  ρS(R)changes:therela5onnameonlytoS-  ρ(B1,B2,…,Bn)(R)changes:thecolumn(a9ribute)namesonlytoB1,B1,…,Bn

-  ρS(B1,B2,…,Bn)(R)changesboth:therela)onnametoS,andthecolumn(ajribute)namestoB1,B1,…,Bn

Page 67: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Rela)onalOperatorsfromSetTheory

●  Union●  Intersec)on

●  Minus

●  CartesianProducts

Page 68: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Union

●  ItisaBinaryopera)on,denotedby∪●  TheresultofR∪S,isarela)onthatincludesalltuplesthatareeitherinRorinSorinbothRandS

●  Duplicatetuplesareeliminated

●  RandShavetotypecompa)ble:-  theymusthavethesamenumberofajributes-  correspondingajributesaretypecompa)ble

Page 69: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Intersec)on

●  INTERSECTIONisdenotedby∩●  Theresultoftheopera)onR∩S,isarela)onthatincludesalltuplesthatareinbothRandS

●  TheajributenamesintheresultwillbethesameastheajributenamesinR

●  Thetwooperandrela)onsRandSmustbe“typecompa)ble”

Page 70: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

SetDifference

●  SETDIFFERENCE(alsocalledMINUSorEXCEPT)isdenotedby–

●  TheresultofR–S,isarela)onthatincludesalltuplesthatareinRbutnotinS

●  TheajributenamesintheresultwillbethesameastheajributenamesinR

●  Thetwooperandrela)onsRandSmustbe“typecompa)ble”

Page 71: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Proper)esofUnion,Intersec)onandDifference

●  Bothunionandintersec)onarecommuta)ve;thatis:R∪S=S∪R,andR∩S=S∩R

●  Unionandintersec)onareassocia)veopera)ons;thatis:R∪(S∪T)=(R∪S)∪T(R∩S)∩T=R∩(S∩T)

●  Theminusopera)onisnotcommuta)ve;thatis:R–S≠S–R

Page 72: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

CrossProduct(CartesianProduct)

●  CROSSPRODUCTOpera)on●  Usedtocombinetuplesfromtworela)onsinacombinatorialfashion

●  DenotedbyR(A1,A2,...,An)xS(B1,B2,...,Bm)

●  Resultisarela)onQwithdegreen+majributes:Q(A1,A2,...,An,B1,B2,...,Bm)

Page 73: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

CartesianProduct(CrossProduct)●  Theresul)ngrela)oncontainseverypossiblecombina)onofthetuplesfromRandS--onefromRandonefromS

●  Hence,ifRhasnRtuples(denotedas|R|=nR),andShasnStuples,thenRxSwillhavenR*nStuples

●  ThetwooperandsdoNOThavetobe"typecompa)ble”

●  Generally,CARTESIANPRODUCTisnotameaningfulopera)on,butcanbecomemeaningfulwhenfollowedbyotheropera)ons

Page 74: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Join

●  JOINOpera)on(denotedby)●  SequenceofCARTESIANPRODUCTfollowedbySELECTisusedtoiden)fyandselectrelatedtuplesfromtworela)ons

●  veryimportantforanyrela)onaldatabasewithmorethanasinglerela)on,becauseitallowstocombinerelatedtuplesfromvariousrela)ons

Page 75: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Join

●  Thegeneralformofajoinopera)onontworela)onsR(A1,A2,...,An)andS(B1,B2,...,Bm)is:R<joincondi)on>S

●  whereRandScanbeanyrela)onsthatresultfromgeneralrela)onalalgebraexpressions

Page 76: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Join●  ConsiderthefollowingJOINopera)on:-  IfR(A1,A2,...,An)andS(B1,B2,...,Bm)ThinkaboutR.Ai=S.Bj

-  Resultisarela)onQwithdegreen+majributes:Q(A1,A2,...,An,B1,B2,...,Bm)

-  Theresul)ngrela)onstatehasonetupleforeachcombina)onoftuples–rfromRandsfromS,butonlyiftheysa)sfythejoincondi)onr[Ai]=s[Bj]

-  ifRhasnRtuples,andShasnStuples,thenthejoinresultwillgenerallyhavelessthannR*nStuples

Page 77: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Join(moreprecise)

●  ThegeneralcaseofJOINopera)oniscalledaTheta-join:RthetaS

●  Thejoincondi)oniscalledtheta●  ThetacanbeanygeneralbooleanexpressionontheajributesofRandS;forexample:R.Ai<S.BjAND(R.Ak=S.BlORR.Ap<S.Bq)

Page 78: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Equijoin

●  Themostcommonuseofjoininvolvesjoincondi)onswithequalitycomparisonsonly

●  Suchajoin,whereonlythecomparisonoperatorusedis=,iscalledanEQUIJOIN

●  TheJOINseeninthepreviousexamplewasanEQUIJOIN

Page 79: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

NaturalJoin●  Anothervaria)onofJOINcalledNATURALJOIN—denotedby*orwithoutcondi)on

●  Itwascreatedtogetridofthesecond(superfluous)ajributeinanEQUIJOINcondi)on.

●  Q←R(A,B,C,D)*S(C,D,E)

●  implicitjoincondi)onincludeseachpairofajributeswiththesamename,“AND”edtogether:R.C=S.CANDR.D=S.D

●  keepsonlyoneajributeofeachsuchpair:Q(A,B,C,D,E)

Page 80: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

SemiJoin

●  actslikeafilterbasedonaspecifiedajribute●  RSmeans:ifRandShaveacommonajributeCtheresultarealltuplesfromRwhichCvalueoccursalsoinS,nQ≤nRtuples

●  Q<-R(A,B,C)S(C,D,E)

●  Q(A,B,C)withnRajributes●  πA,B,C(σR.C=S.C(RxS))

Page 81: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Le�OuterJoin●  Rightversionisanalogous●  addinforma)ontocorrespondingle�sidetuples

●  RSmeans:ifRandShaveacommonajributeCtheresultareallcombinedtuplesfromRandSwhereR.C=S.Candinaddi)onallremainingtuplesfromR,nQ=nRtuples

●  Q<-R(A,B,C)S(C,D,E)

●  Q(A,B,C,D,E)withnR∪Sajributes

●  ifnomatchingtuplesfoundinSajributesDandEcontainnovalues

Page 82: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

(Full)OuterJoin●  combinescorrespondingtuplesvonRandSwherepossible,elseajributesle�blank

●  RSmeans:ifRandShaveacommonajributeCtheresultareallcombinedtuplesfromRandSwhereR.C=S.Candinaddi)onallremainingtuplesfromRandS,nQ≤nR+Stuples

●  Q<-R(A,B,C)S(C,D,E)

●  Q(A,B,C,D,E)withnR∪Sajributes●  ifnomatchingtuplesfoundinRorSajributesA,BorDandEcontainnovalues

Page 83: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Division

●  GivesallajributetupleforR-SwhereavaluesforR-Sco-occurswithalltuplesinS

●  R(A,B)andS(B)●  R÷S:Q(A)whereeachresulttupleinQcanbefoundinRincombina)onwitheverytuplefromS

Page 84: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

CompleteSetofRela)onalOpera)ons

●  Thesetofopera)onsincludingSELECTσ,PROJECTπ,UNION∪,DIFFERENCE-,RENAMEρ,andCARTESIANPRODUCTXiscalledacompletesetbecauseanyotherrela)onalalgebraexpressioncanbeexpressedbyacombina)onofthesefiveopera)ons.

●  Examples:-  R∩S=(R∪S)–((R-S)∪(S-R))-  R<joincondi)on>S=σ<joincondi)on>(RXS)

Page 85: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

BeyondClassicalAlgebra

●  Grouping:groupby●  Aggrega)on:count,sum,average,min,max

Page 86: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

KeysandIndexes

●  Eachrela)onrepresentsasubsetofthecartesianproductofitsdomains(ajributes)

●  Somevaluesmightbeuniqueforarowothersarenot

●  Toaddressandaccessaspecifictupleinarela)onweneedtodefineaprimarykey

●  Aprimarykeyissetofajributeswhichcombina)onallowsustounambiguouslyiden)fyacertainrowintherela)on

Page 87: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

KeysandIndexes●  Consequences:-  Eachprimarykey(combina)on)canoccuronlyonceinatable

-  Entrieswhichmissatoneoftheseajributevaluesarenotallows(NOTNULL)

-  Defaultvaluesfortheseajributesmakenosense-  Thesesystemhastokeeptrackwhichthehelpofanindex

●  Thekeydependsonthemodelingandthedomain

Page 88: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

Indexes/Constraints

●  PRIMARYKEY:UNIQUE,NOTNULL●  UNIQUE:Ifthereisavalueitmustbeunique,ifthereisnovaluebutNULLitcanoccurmul)ple)mes

●  INDEX:Asearchstructurewhichallowstofindtuples(rows)whichaspecificajributevalueefficiently-  mustexplicitlyrequestedinthetablestructure-  forcharactertypesyoucantheprefixlength

Page 89: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

PerformanceConsidera)ons

●  Therearethreerela)ontojoinA*B*C:-  A(1.000.000rows)-  B(100rows)-  C(10.000rows)

Page 90: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

PerformanceConsidera)ons●  (Worst)Casew/oindexesandbadsequence:A*C:10.000.000.000comparisonsO(n*m)->D(10.000.000.000rows)D*B(1.000.000.000.000comparisons)O(n*m)-  ofcoursetuplesmightbedroppedinrealitybecauseofmissingjoinpartners

●  Casewithindexesandcleversequence:B*A:100*log(10.000.000)comparisons->D(10.000.000rows)C*D:10.000*log(10.000.000)comparisons

Page 91: Bioinformacs Resources - Structural Resources / SQL · BioinfRes SoSe 17 Bioinformacs Resources - Structural Resources / SQL - Lecture & Exercises Prof. B. Rost, Dr. L. Richter, J

BioinfRes SoSe 17

PerformanceConsidera)ons●  Sequenceofevalua)oncanbeop)mizedbythedatabaseengine-  cleverorderwithexploita)onofassocia)vityandcommuta)vity

-  example:100*log(10.000.000)vs10.000.000*log(100)

-  maybenoteffec)veinworstcasebutdefinitelyevery)meelse