joint ebi-wellcome trust summer school 14-18 june 2010
TRANSCRIPT
Joint EBI-Wellcome TrustJoint EBI-Wellcome Trust
Summer SchoolSummer School14-18 June 201014-18 June 2010
04/19/23 2
Concepts, historical milestones & Concepts, historical milestones & the central place of bioinformatics the central place of bioinformatics
in modern biology: in modern biology: a European perspectivea European perspective
Teresa K.Attwood University of Manchester
04/19/23 3
Concepts, historical milestones & Concepts, historical milestones & the central place of bioinformatics the central place of bioinformatics
in modern biology: in modern biology: a personal perspective from a Europeana personal perspective from a European
Teresa K.Attwood University of Manchester
04/19/23 4
Concepts, Concepts, historical milestones historical milestones & & the central place of bioinformatics the central place of bioinformatics
in modern biology: in modern biology: a personal perspective from a Europeana personal perspective from a European
Teresa K.Attwood University of Manchester
• Where the concept of bioinformatics originated• Some key milestones & key people• Its place in ‘the new biology’
04/19/23 Teresa K.Attwood University of Manchester
5
OverviewOverview
DisclaimerDisclaimer• Bear in mind that this is a personal view• That it’s hard
– to step out of a situation & look back in• & remain objective
– to separate the European & American histories
• Observers from different perspectives will see & tell the story differently!
• So this is just my perspective…– & it’s bound up with sequences & dbs
04/19/23 Teresa K.Attwood University of Manchester
6
Origin of bioinformaticsOrigin of bioinformatics
• The origins of bioinformatics are rooted in sequence analysis
• And driven by the desire to – collect them – annotate them– & analyse them
• systematically (i.e., using computers)!
04/19/23 Teresa K.Attwood University of Manchester
7
The concept ‘bioinformatics’ was barely known pre 1990…
04/19/23 Teresa K.Attwood University of Manchester
8
1950 1960 1970 1980 1990 2000 2010 2020
insu
linrib
onuc
leas
eDa
yhoff
Atla
s
Key milestonesKey milestones
ARPAnet
Margaret DayhoffMargaret Dayhoff1925-19831925-1983
• Pioneered development of computer methods to compare protein sequences – & to derive evolutionary histories from alignments
• Particularly interested in deducing evolutionary connections from sequence evidence
04/19/23 Teresa K.Attwood University of Manchester
9
Margaret DayhoffMargaret Dayhoff
• Collected all the known protein sequences – made them available to the scientific community
• In 1965, she compiled a book– the 1st Atlas of Protein Sequence and Structure
04/19/23 Teresa K.Attwood University of Manchester
10
Margaret DayhoffMargaret Dayhoff
04/19/23 Teresa K.Attwood University of Manchester
11
04/19/23 Teresa K.Attwood University of Manchester
12
1950 1960 1970 1980 1990 2000 2010 2020
insu
linrib
onuc
leas
eDa
yhoff
Atla
s
ARPAnet
65 se
quen
ces
Auto
pro
tein
sequ
ence
rs
DNA
sequ
encin
g
PDB
Auto
DNA
sequ
encin
g
Internet
7 st
ruct
ures
Key milestonesKey milestones
Data overload in the USAData overload in the USA
04/19/23 Teresa K.Attwood University of Manchester
13
Data overload in the USAData overload in the USA
04/19/23 Teresa K.Attwood University of Manchester
14
Data overload in EuropeData overload in Europe
• The data overload problem had also been noticed in Europe
• The solution was to create the 1st nucleotide sequence database– this was the EMBL databank
• this preceded the 1st release of GenBank by ~6 months
04/19/23 Teresa K.Attwood University of Manchester
15
04/19/23 Teresa K.Attwood University of Manchester
16
1950 1960 1970 1980 1990 2000 2010 2020
insu
linrib
onuc
leas
eDa
yhoff
Atla
s
ARPAnet
65 se
quen
ces
Auto
pro
tein
sequ
ence
rs
DNA
sequ
encin
g
PDB
Auto
DNA
sequ
encin
g
EMBL
, Gen
Bank
568
sequ
ence
s
PIR-
PSD
859
sequ
ence
s
Internet
7 st
ruct
ures
Key milestonesKey milestones
Enter Amos BairochEnter Amos Bairoch
• A crazy postgrad student in Switzerland– interested in space exploration & the search for ET life
• His project was to develop software to analyse protein & nucleotide sequences– PC/Gene
04/19/23 Teresa K.Attwood University of Manchester
17
Amos BairochAmos Bairoch
• He published his 1st paper in 1982• A letter to the BJ suggesting the use of
checksums to “facilitate the detection of typographical & keyboard errors”– a true computer nerd!
04/19/23 Teresa K.Attwood University of Manchester
18
Amos BairochAmos Bairoch
• Why did he do this?• In the process of developing PC/Gene,
he typed in >1,000 protein sequences– some from the literature, most from the Atlas
• by 1981, this was a large book & several supplements, & listed 1,660 proteins
• it was not then available electronically
04/19/23 Teresa K.Attwood University of Manchester
19
Amos BairochAmos Bairoch
• In 1983, he acquired a computer tape of the EMBL databank– this was version 2, with 811 sequences
• In 1984, he received the 1st available computer tape copy of the Atlas– (which quickly became the PIR-PSD)– but he was deeply unhappy with the PIR format
04/19/23 Teresa K.Attwood University of Manchester
20
Amos BairochAmos Bairoch
• So he decided to convert the PIR database into the semi-structured format of EMBL– part manually & part automatically– the result was PIR+– it was distributed as part of PC/Gene (now commercial)
• In summer 1986, he decided to release the database independently of PC/Gene– so that it would be available to all, free of charge
04/19/23 Teresa K.Attwood University of Manchester
21
Amos BairochAmos Bairoch
• The new database was called Swiss-Prot • The 1st release was made on 21 July 1986
– the exact number of entries is unknown, as he can’t find the original floppy disks!
04/19/23 Teresa K.Attwood University of Manchester
22
04/19/23 Teresa K.Attwood University of Manchester
23
1950 1960 1970 1980 1990 2000 2010 2020
insu
linrib
onuc
leas
eDa
yhoff
Atla
s
ARPAnet
65 se
quen
ces
Auto
pro
tein
sequ
ence
rs
DNA
sequ
encin
g
PDB
Auto
DNA
sequ
encin
g
EMBL
, Gen
Bank
568
sequ
ence
s
PIR
DDBJ
, Sw
iss-P
rot
859
sequ
ence
s~3
,900
sequ
ence
s
PROS
ITE
PRIN
TS
58 e
ntrie
s30
ent
ries
Internet
7 st
ruct
ures
Key milestonesKey milestones
Global data overload Global data overload
• The number of sequences was growing• The number of structures was growing• So was the number of protein family signatures• Two extraordinary developments had yet to
take place– what were they?
04/19/23 Teresa K.Attwood University of Manchester
24
04/19/23 Teresa K.Attwood University of Manchester
25
1950 1960 1970 1980 1990 2000 2010 2020
insu
linrib
onuc
leas
eDa
yhoff
Atla
s
ARPAnet
65 se
quen
ces
Auto
pro
tein
sequ
ence
rs
DNA
sequ
encin
g
PDB
Auto
DNA
sequ
encin
g
EMBL
, Gen
Bank
568
sequ
ence
s
PIR
DDBJ
, Sw
iss-P
rot
859
sequ
ence
s~3
,900
sequ
ence
s
PROS
ITE
PRIN
TS
58 e
ntrie
s30
ent
ries
Internet
7 st
ruct
ures
wwwFl
yBas
e
Key milestonesKey milestones
04/19/23 Teresa K.Attwood University of Manchester
26
1950 1960 1970 1980 1990 2000 2010 2020
insu
linrib
onuc
leas
eDa
yhoff
Atla
s
ARPAnet
65 se
quen
ces
Auto
pro
tein
sequ
ence
rs
DNA
sequ
encin
g
PDB
Auto
DNA
sequ
encin
g
EMBL
, Gen
Bank
568
sequ
ence
s
PIR
DDBJ
, Sw
iss-P
rot
859
sequ
ence
s~3
,900
sequ
ence
s
PROS
ITE
PRIN
TS
58 e
ntrie
s30
ent
ries
Internet
7 st
ruct
ures
HT D
NA se
quen
cingwww
H.in
fluen
zae
geno
me
M.ja
nnac
hii g
enom
e
S.ce
revi
sae
geno
me
D.M
elan
ogas
ter g
enom
e
H.sa
pien
s gen
ome
C.el
egan
s gen
ome
FlyB
ase
Pfam
Inte
rPro
2,42
3ent
ries
TrEM
BL
70,0
00 se
quen
ces
Key milestonesKey milestones
04/19/23 27
InterProInterPro
PfamPfamProfilesProfiles
ProDomProDom PRINTSPRINTS
PrositeProsite
ProDomProDom
Original InterPro partnersOriginal InterPro partners
Teresa K.Attwood University of Manchester
What is InterPro?What is InterPro?“InterPro is an integrated documentation resource
for protein families, domains & sites. By uniting databases that use different methodologies & a
varying degree of biological information, InterPro capitalises on their individual strengths,
producing a powerful integrated database & diagnostic tool.”
04/19/23 28Teresa K.Attwood University of Manchester
The vision?The vision?• Naïvely, we wanted to make life easier!• We aimed to
– simplify & rationalise protein family analysis– centralise & streamline the annotation process
• & reduce manual annotation burdens– &, in the wake of all the genome projects, to facilitate
automatic functional annotation of uncharacterised proteins
04/19/23 29Teresa K.Attwood University of Manchester
In fact (& now with 11 partners) we made life a lot harder! But that’s another story…
04/19/23 Teresa K.Attwood University of Manchester
30
1950 1960 1970 1980 1990 2000 2010 2020
insu
linrib
onuc
leas
eDa
yhoff
Atla
s
ARPAnet
65 se
quen
ces
Auto
pro
tein
sequ
ence
rs
DNA
sequ
encin
g
PDB
Auto
DNA
sequ
encin
g
EMBL
, Gen
Bank
568
sequ
ence
s
PIR
DDBJ
, Sw
iss-P
rot
859
sequ
ence
s~3
,900
sequ
ence
s
PROS
ITE
PRIN
TS
58 e
ntrie
s30
ent
ries
Internet
7 st
ruct
ures
HT D
NA se
quen
cingwww
H.in
fluen
zae
geno
me
M.ja
nnac
hii g
enom
e
S.ce
revi
sae
geno
me
D.M
elan
ogas
ter g
enom
e
H.sa
pien
s gen
ome
C.el
egan
s gen
ome
FlyB
ase
Pfam
Inte
rPro
2,42
3ent
ries
TrEM
BL
70,0
00 se
quen
ces
Key milestonesKey milestones
04/19/23 Teresa K.Attwood University of Manchester
31
1950 1960 1970 1980 1990 2000 2010 2020
insu
linrib
onuc
leas
eDa
yhoff
Atla
s
ARPAnet
65 se
quen
ces
Auto
pro
tein
sequ
ence
rs
DNA
sequ
encin
g
PDB
Auto
DNA
sequ
encin
g
EMBL
, Gen
Bank
568
sequ
ence
s
PIR
DDBJ
, Sw
iss-P
rot
859
sequ
ence
s~3
,900
sequ
ence
s
PROS
ITE
PRIN
TS
58 e
ntrie
s30
ent
ries
Internet
7 st
ruct
ures
HT D
NA se
quen
cingwww
H.in
fluen
zae
geno
me
M.ja
nnac
hii g
enom
e
S.ce
revi
sae
geno
me
D.M
elan
ogas
ter g
enom
e
H.sa
pien
s gen
ome
C.el
egan
s gen
ome
FlyB
ase
Inte
rPro
Pfam
TrEM
BL
70,0
00 se
quen
ces
UniP
rot
2,42
3ent
riesKey milestonesKey milestones
04/19/23 Teresa K.Attwood University of Manchester
32
1950 1960 1970 1980 1990 2000 2010 2020
insu
linrib
onuc
leas
eDa
yhoff
Atla
s
ARPAnet
65 se
quen
ces
Auto
pro
tein
sequ
ence
rs
DNA
sequ
encin
g
PDB
Auto
DNA
sequ
encin
g
EMBL
, Gen
Bank
568
sequ
ence
s
PIR
DDBJ
, Sw
iss-P
rot
859
sequ
ence
s~3
,900
sequ
ence
s
PROS
ITE
PRIN
TS
58 e
ntrie
s30
ent
ries
Internet
7 st
ruct
ures
HT D
NA se
quen
cingwww
H.in
fluen
zae
geno
me
M.ja
nnac
hii g
enom
e
S.ce
revi
sae
geno
me
D.M
elan
ogas
ter g
enom
e
H.sa
pien
s gen
ome
C.el
egan
s gen
ome
FlyB
ase
Inte
rPro
Pfam
TrEM
BL
70,0
00 se
quen
ces
UniP
rot
2,42
3ent
ries
10,8
67,7
98 se
quen
ces
185,
231,
366
sequ
ence
s
ENA
517,
100
sequ
ence
s
Key milestonesKey milestones
04/19/23 Teresa K.Attwood University of Manchester
33
1950 1960 1970 1980 1990 2000 2010 2020
insu
linrib
onuc
leas
eDa
yhoff
Atla
s
ARPAnet
65 se
quen
ces
Auto
pro
tein
sequ
ence
rs
DNA
sequ
encin
g
PDB
Auto
DNA
sequ
encin
g
EMBL
, Gen
Bank
568
sequ
ence
s
PIR
DDBJ
, Sw
iss-P
rot
859
sequ
ence
s~3
,900
sequ
ence
s
PROS
ITE
PRIN
TS
58 e
ntrie
s30
ent
ries
Internet
7 st
ruct
ures
HT D
NA se
quen
cingwww
H.in
fluen
zae
geno
me
M.ja
nnac
hii g
enom
e
S.ce
revi
sae
geno
me
D.M
elan
ogas
ter g
enom
e
H.sa
pien
s gen
ome
C.el
egan
s gen
ome
FlyB
ase
Inte
rPro
Pfam
TrEM
BL
70,0
00 se
quen
ces
UniP
rot
2,42
3ent
ries
10,8
67,7
98 se
quen
ces
ENA
517,
100
sequ
ence
s18
5,23
1,36
6 se
quen
ces
hundreds more
billions more
hundreds more
Key milestonesKey milestones
The central place of bioinformatics The central place of bioinformatics in modern biologyin modern biology
04/19/23 Teresa K.Attwood University of Manchester
34
• Hopefully, this potted history speaks for itself• In the last 30 years, bioinformatics has given us
– the first ‘complete’ catalogues of DNA & protein sequences• including genomes & proteomes of organisms across the Tree of Life
– software to analyse biological data on an unprecedented scale– & hence tools to help understand
• more about evolutionary processes in general• our place on the Tree of Life in particular• &, ultimately, more about health & disease
• It isn’t a panacea, but its contribution has been huge
04/19/23 35Teresa K.Attwood University of Manchester
Recommended readingRecommended readingA.B.Richon. A short history of bioinformatics (http://www.netsci.org/Science/Bioinform/feature06.html)
A.Bairoch (2000) Serendipity in bioinformatics, the tribulations of a Swiss bioinformatician through exciting times. Bioinformatics, 16(1), 48-64.M.Ashburner (2006) Won for all – How the Drosophila genome was sequenced. Cold Spring Harbor Laboratory Press.B.J.Strasser (2008) GenBank – Natural history in the 21st century? Science, 322, 537-538.