comparative bacterial genomics - dtu bioinformatics · comparative bacterial genomics workshop,...
TRANSCRIPT
Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 1
Comparative Bacterial Genomics
Exercises for Day 4 - core/pan genomes
Pimlapas Leekitcharoenphon (Shinny)30 August 2012
http://www.cbs.dtu.dk/staff/dave/CDC_2012.php
Thursday, August 30, 2012
Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 2
http://cge.cbs.dtu.dk/services/
Thursday, August 30, 2012
National Food Institute, Technical University of Denmark
Protein homology in clonal strains (outbreak)
Thursday, August 30, 2012
National Food Institute, Technical University of Denmark
SNPs
Single Nucleotide Polymorphisms
• DNA sequence variations that occur when a single nucleotide (A,T,C,or G) in the genome sequence is altered. • SNPs can occur in both coding and non-coding regions of genome
Thursday, August 30, 2012
Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 5
SNPs tree
• Download 6 genomes from the following link
• Choose Salmonella Typhimurium D23580 as a reference genome
• Construct SNPs tree
‣ http://cge.cbs.dtu.dk/services/snpTree/
Thursday, August 30, 2012
Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 6
0M
0.5
M1
M
1.5M
2M
2.5
M
V. cholerae O1 biovar El Tor str. N16961 I
2,961,149 bp
BASE ATLAS
Center for Biological Sequence Anhttp://www.cbs.dtu.dk/
G Content
0.18 0.30
A Content
0.20 0.32
T Content
0.21 0.32
C Content
0.17 0.30
Annotations:
CDS +
CDS -
rRNA
tRNA
AT Skew
-0.04 0.04
GC Skew
-0.08 0.08
Percent AT
0.46 0.59
Resolution: 1185
genomeStatistics
rnammer
1 3 5 7 9 12 15 18 21 24 27 30 33 36 39 42 45
05000
10000
15000
New genes
New gene families
Core genome
Pan genome
1 : Ecoli_042 2 : Ecoli_536 3 : Ecoli_55989 4 : Ecoli_ABU_83972 5 : Ecoli_APEC_O1 6 : Ecoli_ATCC_8739 7 : Ecoli_BL21_DE3_28965 8 : Ecoli_BL21_DE3_30681 9 : Ecoli_BW2952 10 : Ecoli_B_str_REL606 11 : Ecoli_DH1 12 : Ecoli_E24377A 13 : Ecoli_ED1a 14 : Ecoli_ETEC_H10407 15 : Ecoli_HS 16 : Ecoli_IAI1 17 : Ecoli_IAI39 18 : Ecoli_IHE3034 19 : Ecoli_KO11 20 : Ecoli_O103H2_str_12009 21 : Ecoli_O111H_str_11128 22 : Ecoli_O127H6_str_E2348_69 23 : Ecoli_O157H7_str_EDL933 24 : Ecoli_O157H7_str_TW14359 25 : Ecoli_O26H11_str_11368 26 : Ecoli_O55H7_str_CB9615 27 : Ecoli_O83H1_str_NRG_857C 28 : Ecoli_S88 29 : Ecoli_SE11 30 : Ecoli_SE15 31 : Ecoli_SMS35 32 : Ecoli_UM146 33 : Ecoli_UMN026 34 : Ecoli_UTI89 35 : Ecoli_W 36 : Ecoli_str_K12_substr_DH10B 37 : Ecoli_str_K12_substr_MG1655 38 : Ecoli_str_K12_substr_W3110 39 : Vatypica_ACS_049_V_Sch6 40 : Vatypica_ACS_134_V_Col7a 41 : Vdispar_ATCC_17748 42 : Vparvula_ATCC_17745 43 : Vparvula_DSM_2008 44 : Vsp_3_1_44 45 : Vsp_6_1_27 46 : Vsp_str_F0412
3.3 %111 / 3,378
28.3 %1,980 / 6,989
55.5 %2,683 / 4,838
52.4 %2,666 / 5,085
34.9 %2,114 / 6,065
33.1 %2,074 / 6,269
30.3 %1,795 / 5,923
30.5 %1,813 / 5,939
26.7 %1,916 / 7,168
30.5 %2,050 / 6,715
32.6 %2,040 / 6,250
28.3 %2,095 / 7,406
32.3 %1,842 / 5,705
31.9 %2,074 / 6,494
33.6 %1,805 / 5,377
30.2 %1,747 / 5,786
29.9 %1,736 / 5,802
31.9 %1,743 / 5,469
34.4 %1,846 / 5,360
32.5 %1,873 / 5,769
30.6 %1,777 / 5,804
32.1 %1,846 / 5,747
5.0 %243 / 4,897
30.3 %2,110 / 6,968
29.7 %2,127 / 7,169
29.5 %2,198 / 7,456
28.1 %2,155 / 7,667
25.5 %1,872 / 7,339
28.0 %2,022 / 7,222
25.9 %2,170 / 8,370
27.8 %2,222 / 7,979
29.4 %2,212 / 7,534
26.1 %2,254 / 8,624
27.9 %1,972 / 7,061
29.6 %2,295 / 7,753
28.1 %1,904 / 6,782
25.7 %1,850 / 7,198
25.6 %1,841 / 7,205
26.9 %1,851 / 6,869
28.7 %1,944 / 6,766
27.5 %1,971 / 7,179
26.3 %1,893 / 7,208
27.2 %1,946 / 7,165
2.6 %96 / 3,691
75.0 %3,261 / 4,346
38.7 %2,246 / 5,808
36.6 %2,201 / 6,016
33.6 %1,915 / 5,695
34.5 %1,963 / 5,692
30.4 %2,085 / 6,866
34.2 %2,205 / 6,448
36.3 %2,179 / 6,005
29.6 %2,214 / 7,478
36.2 %1,976 / 5,464
35.9 %2,233 / 6,219
36.7 %1,906 / 5,192
32.8 %1,843 / 5,611
33.0 %1,848 / 5,596
34.9 %1,843 / 5,282
37.7 %1,947 / 5,165
35.3 %1,972 / 5,581
33.6 %1,884 / 5,612
35.0 %1,949 / 5,561
2.9 %112 / 3,894
38.1 %2,277 / 5,979
35.7 %2,219 / 6,213
32.5 %1,919 / 5,903
33.9 %1,991 / 5,874
29.4 %2,083 / 7,082
33.1 %2,209 / 6,672
35.3 %2,191 / 6,211
29.3 %2,244 / 7,665
34.5 %1,965 / 5,696
35.5 %2,270 / 6,400
35.6 %1,922 / 5,398
31.9 %1,857 / 5,817
32.1 %1,861 / 5,806
34.2 %1,872 / 5,473
36.6 %1,964 / 5,371
34.2 %1,983 / 5,797
32.5 %1,896 / 5,827
34.0 %1,963 / 5,771
2.8 %118 / 4,277
72.3 %3,688 / 5,101
38.6 %2,289 / 5,931
42.3 %2,451 / 5,795
36.7 %2,562 / 6,982
40.8 %2,680 / 6,565
43.7 %2,670 / 6,112
36.7 %2,759 / 7,516
45.4 %2,507 / 5,523
43.9 %2,762 / 6,293
41.8 %2,264 / 5,418
38.0 %2,213 / 5,823
37.9 %2,209 / 5,822
39.9 %2,202 / 5,514
42.9 %2,314 / 5,388
40.4 %2,345 / 5,808
38.6 %2,251 / 5,839
40.3 %2,326 / 5,771
2.3 %103 / 4,463
36.9 %2,259 / 6,124
40.2 %2,413 / 5,999
36.5 %2,593 / 7,105
39.7 %2,672 / 6,728
41.9 %2,637 / 6,301
34.6 %2,682 / 7,762
43.7 %2,492 / 5,705
41.4 %2,698 / 6,523
39.9 %2,238 / 5,609
36.9 %2,208 / 5,989
36.3 %2,186 / 6,014
38.0 %2,171 / 5,707
40.6 %2,270 / 5,592
38.5 %2,311 / 6,004
37.0 %2,227 / 6,026
38.4 %2,291 / 5,971
2.3 %88 / 3,822
46.2 %2,452 / 5,307
30.9 %2,144 / 6,948
37.5 %2,396 / 6,387
39.9 %2,372 / 5,942
45.0 %3,018 / 6,702
37.8 %2,081 / 5,503
47.0 %2,741 / 5,827
38.1 %1,994 / 5,228
34.4 %1,944 / 5,645
34.8 %1,952 / 5,617
36.4 %1,935 / 5,317
38.7 %2,021 / 5,225
36.4 %2,055 / 5,647
34.7 %1,968 / 5,677
35.8 %2,018 / 5,637
2.7 %103 / 3,886
34.5 %2,335 / 6,762
43.2 %2,655 / 6,143
46.1 %2,626 / 5,697
43.4 %2,981 / 6,875
45.0 %2,357 / 5,232
64.9 %3,385 / 5,213
41.6 %2,134 / 5,135
38.2 %2,104 / 5,504
37.2 %2,064 / 5,548
39.1 %2,048 / 5,244
41.6 %2,140 / 5,139
38.8 %2,162 / 5,566
37.9 %2,110 / 5,560
38.7 %2,143 / 5,536
3.9 %200 / 5,078
33.0 %2,516 / 7,615
34.4 %2,472 / 7,184
30.1 %2,581 / 8,574
34.3 %2,276 / 6,634
35.2 %2,581 / 7,333
32.4 %2,098 / 6,481
30.3 %2,079 / 6,856
29.6 %2,044 / 6,898
31.2 %2,045 / 6,565
33.0 %2,137 / 6,467
31.5 %2,169 / 6,884
30.4 %2,098 / 6,893
31.2 %2,143 / 6,862
3.1 %150 / 4,773
67.5 %3,741 / 5,540
37.0 %2,900 / 7,832
43.2 %2,597 / 6,013
46.4 %3,042 / 6,550
43.0 %2,483 / 5,781
39.4 %2,432 / 6,172
39.1 %2,418 / 6,182
40.1 %2,373 / 5,919
44.1 %2,533 / 5,743
41.9 %2,575 / 6,151
40.0 %2,473 / 6,185
41.7 %2,552 / 6,116
2.8 %121 / 4,337
38.7 %2,880 / 7,439
47.2 %2,608 / 5,524
48.9 %2,994 / 6,128
46.3 %2,464 / 5,326
42.2 %2,409 / 5,711
41.3 %2,372 / 5,746
43.5 %2,367 / 5,437
47.1 %2,503 / 5,310
44.5 %2,539 / 5,707
42.8 %2,449 / 5,718
44.3 %2,515 / 5,683
3.9 %202 / 5,116
34.9 %2,496 / 7,160
46.4 %3,371 / 7,266
33.3 %2,327 / 6,984
31.0 %2,282 / 7,362
30.7 %2,271 / 7,389
32.1 %2,268 / 7,062
34.3 %2,377 / 6,932
33.1 %2,415 / 7,299
31.7 %2,323 / 7,337
32.5 %2,385 / 7,336
2.1 %79 / 3,683
43.5 %2,547 / 5,858
46.0 %2,220 / 4,821
41.1 %2,153 / 5,242
41.1 %2,152 / 5,239
42.7 %2,113 / 4,953
45.9 %2,223 / 4,842
42.3 %2,236 / 5,283
41.3 %2,181 / 5,277
42.2 %2,215 / 5,254
3.2 %147 / 4,662
42.3 %2,399 / 5,675
37.9 %2,313 / 6,099
38.1 %2,320 / 6,091
39.7 %2,303 / 5,796
42.4 %2,408 / 5,683
40.0 %2,440 / 6,094
38.4 %2,348 / 6,120
40.0 %2,421 / 6,055
2.5 %84 / 3,305
68.5 %2,844 / 4,150
70.4 %2,886 / 4,098
73.1 %2,818 / 3,854
81.0 %2,989 / 3,688
72.2 %2,986 / 4,136
68.5 %2,869 / 4,191
70.4 %2,922 / 4,153
3.5 %125 / 3,567
64.5 %2,847 / 4,414
68.3 %2,820 / 4,126
74.3 %2,987 / 4,018
81.6 %3,264 / 4,000
77.5 %3,153 / 4,066
76.9 %3,165 / 4,117
2.8 %99 / 3,597
67.8 %2,806 / 4,137
67.6 %2,836 / 4,195
67.4 %2,983 / 4,424
65.0 %2,880 / 4,434
64.6 %2,888 / 4,474
2.2 %73 / 3,311
71.5 %2,801 / 3,915
69.7 %2,916 / 4,183
69.0 %2,860 / 4,145
68.7 %2,874 / 4,181
1.8 %59 / 3,353
80.2 %3,169 / 3,953
75.1 %3,024 / 4,028
79.6 %3,139 / 3,944
4.3 %157 / 3,665
80.2 %3,271 / 4,079
80.4 %3,303 / 4,109
3.3 %120 / 3,599
77.1 %3,186 / 4,134
3.0 %110 / 3,665
Aliivibrio salmonicida LFI1238
3,915 proteins, 3,378 families
Photobacterium profundum
SS9
5,480 proteins, 4,897 families
Vibrio fischeri ES114
3,818 proteins, 3,691 families
Vibrio fischeri MJ11
4,039 proteins, 3,894 families
Vibrio splendidus LGP32
4,431 proteins, 4,277 families
Vibrio species
MED
222 1099517005441
4,590 proteins, 4,463 families
Vibrio campbellii
AN
D4 1103602000595
3,935 proteins, 3,822 families
Vibrio species Ex25
4,004 proteins, 3,886 families
Vibrio shilonii
AK1 1103207002036
5,360 proteins, 5,078 families
Vibrio vulnificus YJ016
5,028 proteins, 4,773 families
Vibrio vulnificus CM
CP6
4,538 proteins, 4,337 families
Vibrio harveyi
ATCC BA
A-1116
6,064 proteins, 5,116 families
Vibrio parahaemolyticus 16
3,780 proteins, 3,683 families
Vibrio parahaemolyticus
RIMD
2210633
4,832 proteins, 4,662 families
Vibrio cholerae A
M-19226
3,407 proteins, 3,305 families
Vibrio cholerae 2740-80
3,771 proteins, 3,567 families
Vibrio cholerae 1587
3,758 proteins, 3,597 families
Vibrio cholerae MZO
-2
3,425 proteins, 3,311 families
Vibrio cholerae MO
10
3,421 proteins, 3,353 families
Vibrio cholerae 0395
3,875 proteins, 3,665 families
Vibrio cholerae V52
3,815 proteins, 3,599 families
Vibrio cholerae
O1 biovar eltor str. N
16961
3,828 proteins, 3,665 families
Aliivi
brio
salm
onici
da
LFI1
238
3,915
pro
tein
s, 3,3
78 fa
mili
es
Photo
bacte
rium
profu
ndum
SS9
5,480
pro
tein
s, 4,8
97 fa
mili
es
Vibrio
fisch
eri
ES11
4
3,818
pro
tein
s, 3,6
91 fa
mili
es
Vibrio
fisch
eri
MJ1
1
4,039
pro
tein
s, 3,8
94 fa
mili
es
Vibrio
splen
didu
s
LGP32
4,431
pro
tein
s, 4,2
77 fa
mili
es
Vibrio
spec
ies
MED
222 1
0995
1700
5441
4,590
pro
tein
s, 4,4
63 fa
mili
es
Vibrio
cam
pbell
ii
AN
D4 1
1036
0200
0595
3,935
pro
tein
s, 3,8
22 fa
mili
es
Vibrio
spec
ies
Ex2
5
4,004
pro
tein
s, 3,8
86 fa
mili
es
Vibrio
shilo
nii
AK1 1
1032
0700
2036
5,360
pro
tein
s, 5,0
78 fa
mili
es
Vibrio
vuln
ificu
s
YJ0
16
5,028
pro
tein
s, 4,7
73 fa
mili
es
Vibrio
vuln
ificu
s
CM
CP6
4,538
pro
tein
s, 4,3
37 fa
mili
es
Vibrio
harv
eyi
ATCC B
AA
-111
6
6,064
pro
tein
s, 5,1
16 fa
mili
es
Vibrio
para
haem
olytic
us
16
3,780
pro
tein
s, 3,6
83 fa
mili
es
Vibrio
para
haem
olytic
us
RIMD
2210
633
4,832
pro
tein
s, 4,6
62 fa
mili
es
Vibrio
chole
rae
AM
-192
26
3,407
pro
tein
s, 3,3
05 fa
mili
es
Vibrio
chole
rae
2740
-80
3,771
pro
tein
s, 3,5
67 fa
mili
es
Vibrio
chole
rae
1587
3,758
pro
tein
s, 3,5
97 fa
mili
es
Vibrio
chole
rae
MZO
-2
3,425
pro
tein
s, 3,3
11 fa
mili
es
Vibrio
chole
rae
MO
10
3,421
pro
tein
s, 3,3
53 fa
mili
es
Vibrio
chole
rae
0395
3,875
pro
tein
s, 3,6
65 fa
mili
es
Vibrio
chole
rae
V52
3,815
pro
tein
s, 3,5
99 fa
mili
es
Vibrio
chole
rae
O1 b
iovar
elto
r str.
N16
961
3,828
pro
tein
s, 3,6
65 fa
mili
es
Homology within proteomes
5.0 %1.8 %
Homology between proteomes
81.6 %25.5 %
BLAST matrix
grep
ls -1
gawk
pancoreplot
makebmdest blastmatrix
Copy and download, GenBank and DNA files
saco_extract
saco_convert Prodigal
4 1 Sequences as Biological Information
organisms, the number of species present in the environment, and, despite their small size, the biomass they represent on a worldwide scale. Even inside an animal, microbes are abundant: only one out of every 10 cells in a human body is actually human, whilst the other nine cells are prokaryotic.
From an evolutionary perspective, Bacteria and Archaea have been around for more than 3 billion years; plants and animals are relatively recent ‘newcomers’ on the scene, arriving less than half a billion years ago. Since Bacteria and Archaea can divide rather quickly and have had much more time to evolve, their diversity by far exceeds that of eukaryotes (the members of Eucarya). Our human perception is that plants and animals are completely unlike each other, and so are, say, insects and mammals, as they are strikingly different even at first sight. The diversity of
Fig. 1.1 A phylogenetic tree displaying the genetic distances between members of the three super-kingdoms of life: Bacteria, Archaea, and Eucarya. The represented bacterial genera will appear in examples throughout the book. The distance between bacterial genera is much larger than that of plants and animals, drawn on the same scale of genetic distance
BACTERIA
ARCHAEA
EUCARYA
Unicellulareukaryotes
Animals Plants
Macro-organisms
Protozoans
Flav
obac
teriu
m
Crenarchaeota
EuryarchaeotaChlamydiae
Cyanobacteria
Pro
teob
acte
ria
Act
inob
acte
ria
Chlorobi
Clostridium
Bacillus
Chloroflexi
Acidobacteria
Giardia
Saccharomyces
Trypanosoma
Slime mold
Babesia
Aquifi
cae
Ther
moto
ga
Thermus
Deinoco
ccus
Firmicutes
Bacteroidetes
Spirochaetes
Pla
ncto
myc
etes
16S rRNA phylogenetic
tree
locate rRNA sequences
Basic genome statistics
njplot
extractseqs
clustalw
Genome atlas
Published annotated
genes/proteins
genomeAtlas
sed
chmod
genewiz
Examine GenBank
files
mousepad
basicgenomeanalysis
Genefinding, genes/proteins
Amino acid and codon
usage
Number of genes/
proteins
Information table for all genomes.
Add information to this table as you do the exercises
Subset specific gene
counts
MONDAY Tuesday Wednesday Thursday
Pan and core
genome plot
Raw DNA sequence
Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012 3
0M
0.5
M1
M
1.5M
2M
2.5
M
V. cholerae O1 biovar El Tor str. N16961 I
2,961,149 bp
BASE ATLAS
Center for Biological Sequence Anhttp://www.cbs.dtu.dk/
G Content
0.18 0.30
A Content
0.20 0.32
T Content
0.21 0.32
C Content
0.17 0.30
Annotations:
CDS +
CDS -
rRNA
tRNA
AT Skew
-0.04 0.04
GC Skew
-0.08 0.08
Percent AT
0.46 0.59
Resolution: 1185
genomeStatistics
rnammer
1 3 5 7 9 12 15 18 21 24 27 30 33 36 39 42 45
05
00
01
00
00
15
00
0
New genes
New gene families
Core genome
Pan genome
1 : Ecoli_042 2 : Ecoli_536 3 : Ecoli_55989 4 : Ecoli_ABU_83972 5 : Ecoli_APEC_O1 6 : Ecoli_ATCC_8739 7 : Ecoli_BL21_DE3_28965 8 : Ecoli_BL21_DE3_30681 9 : Ecoli_BW2952 10 : Ecoli_B_str_REL606 11 : Ecoli_DH1 12 : Ecoli_E24377A 13 : Ecoli_ED1a 14 : Ecoli_ETEC_H10407 15 : Ecoli_HS 16 : Ecoli_IAI1 17 : Ecoli_IAI39 18 : Ecoli_IHE3034 19 : Ecoli_KO11 20 : Ecoli_O103H2_str_12009 21 : Ecoli_O111H_str_11128 22 : Ecoli_O127H6_str_E2348_69 23 : Ecoli_O157H7_str_EDL933 24 : Ecoli_O157H7_str_TW14359 25 : Ecoli_O26H11_str_11368 26 : Ecoli_O55H7_str_CB9615 27 : Ecoli_O83H1_str_NRG_857C 28 : Ecoli_S88 29 : Ecoli_SE11 30 : Ecoli_SE15 31 : Ecoli_SMS35 32 : Ecoli_UM146 33 : Ecoli_UMN026 34 : Ecoli_UTI89 35 : Ecoli_W 36 : Ecoli_str_K12_substr_DH10B 37 : Ecoli_str_K12_substr_MG1655 38 : Ecoli_str_K12_substr_W3110 39 : Vatypica_ACS_049_V_Sch6 40 : Vatypica_ACS_134_V_Col7a 41 : Vdispar_ATCC_17748 42 : Vparvula_ATCC_17745 43 : Vparvula_DSM_2008 44 : Vsp_3_1_44 45 : Vsp_6_1_27 46 : Vsp_str_F0412
3.3 %111 / 3,378
28.3 %1,980 / 6,989
55.5 %2,683 / 4,838
52.4 %2,666 / 5,085
34.9 %2,114 / 6,065
33.1 %2,074 / 6,269
30.3 %1,795 / 5,923
30.5 %1,813 / 5,939
26.7 %1,916 / 7,168
30.5 %2,050 / 6,715
32.6 %2,040 / 6,250
28.3 %2,095 / 7,406
32.3 %1,842 / 5,705
31.9 %2,074 / 6,494
33.6 %1,805 / 5,377
30.2 %1,747 / 5,786
29.9 %1,736 / 5,802
31.9 %1,743 / 5,469
34.4 %1,846 / 5,360
32.5 %1,873 / 5,769
30.6 %1,777 / 5,804
32.1 %1,846 / 5,747
5.0 %243 / 4,897
30.3 %2,110 / 6,968
29.7 %2,127 / 7,169
29.5 %2,198 / 7,456
28.1 %2,155 / 7,667
25.5 %1,872 / 7,339
28.0 %2,022 / 7,222
25.9 %2,170 / 8,370
27.8 %2,222 / 7,979
29.4 %2,212 / 7,534
26.1 %2,254 / 8,624
27.9 %1,972 / 7,061
29.6 %2,295 / 7,753
28.1 %1,904 / 6,782
25.7 %1,850 / 7,198
25.6 %1,841 / 7,205
26.9 %1,851 / 6,869
28.7 %1,944 / 6,766
27.5 %1,971 / 7,179
26.3 %1,893 / 7,208
27.2 %1,946 / 7,165
2.6 %96 / 3,691
75.0 %3,261 / 4,346
38.7 %2,246 / 5,808
36.6 %2,201 / 6,016
33.6 %1,915 / 5,695
34.5 %1,963 / 5,692
30.4 %2,085 / 6,866
34.2 %2,205 / 6,448
36.3 %2,179 / 6,005
29.6 %2,214 / 7,478
36.2 %1,976 / 5,464
35.9 %2,233 / 6,219
36.7 %1,906 / 5,192
32.8 %1,843 / 5,611
33.0 %1,848 / 5,596
34.9 %1,843 / 5,282
37.7 %1,947 / 5,165
35.3 %1,972 / 5,581
33.6 %1,884 / 5,612
35.0 %1,949 / 5,561
2.9 %112 / 3,894
38.1 %2,277 / 5,979
35.7 %2,219 / 6,213
32.5 %1,919 / 5,903
33.9 %1,991 / 5,874
29.4 %2,083 / 7,082
33.1 %2,209 / 6,672
35.3 %2,191 / 6,211
29.3 %2,244 / 7,665
34.5 %1,965 / 5,696
35.5 %2,270 / 6,400
35.6 %1,922 / 5,398
31.9 %1,857 / 5,817
32.1 %1,861 / 5,806
34.2 %1,872 / 5,473
36.6 %1,964 / 5,371
34.2 %1,983 / 5,797
32.5 %1,896 / 5,827
34.0 %1,963 / 5,771
2.8 %118 / 4,277
72.3 %3,688 / 5,101
38.6 %2,289 / 5,931
42.3 %2,451 / 5,795
36.7 %2,562 / 6,982
40.8 %2,680 / 6,565
43.7 %2,670 / 6,112
36.7 %2,759 / 7,516
45.4 %2,507 / 5,523
43.9 %2,762 / 6,293
41.8 %2,264 / 5,418
38.0 %2,213 / 5,823
37.9 %2,209 / 5,822
39.9 %2,202 / 5,514
42.9 %2,314 / 5,388
40.4 %2,345 / 5,808
38.6 %2,251 / 5,839
40.3 %2,326 / 5,771
2.3 %103 / 4,463
36.9 %2,259 / 6,124
40.2 %2,413 / 5,999
36.5 %2,593 / 7,105
39.7 %2,672 / 6,728
41.9 %2,637 / 6,301
34.6 %2,682 / 7,762
43.7 %2,492 / 5,705
41.4 %2,698 / 6,523
39.9 %2,238 / 5,609
36.9 %2,208 / 5,989
36.3 %2,186 / 6,014
38.0 %2,171 / 5,707
40.6 %2,270 / 5,592
38.5 %2,311 / 6,004
37.0 %2,227 / 6,026
38.4 %2,291 / 5,971
2.3 %88 / 3,822
46.2 %2,452 / 5,307
30.9 %2,144 / 6,948
37.5 %2,396 / 6,387
39.9 %2,372 / 5,942
45.0 %3,018 / 6,702
37.8 %2,081 / 5,503
47.0 %2,741 / 5,827
38.1 %1,994 / 5,228
34.4 %1,944 / 5,645
34.8 %1,952 / 5,617
36.4 %1,935 / 5,317
38.7 %2,021 / 5,225
36.4 %2,055 / 5,647
34.7 %1,968 / 5,677
35.8 %2,018 / 5,637
2.7 %103 / 3,886
34.5 %2,335 / 6,762
43.2 %2,655 / 6,143
46.1 %2,626 / 5,697
43.4 %2,981 / 6,875
45.0 %2,357 / 5,232
64.9 %3,385 / 5,213
41.6 %2,134 / 5,135
38.2 %2,104 / 5,504
37.2 %2,064 / 5,548
39.1 %2,048 / 5,244
41.6 %2,140 / 5,139
38.8 %2,162 / 5,566
37.9 %2,110 / 5,560
38.7 %2,143 / 5,536
3.9 %200 / 5,078
33.0 %2,516 / 7,615
34.4 %2,472 / 7,184
30.1 %2,581 / 8,574
34.3 %2,276 / 6,634
35.2 %2,581 / 7,333
32.4 %2,098 / 6,481
30.3 %2,079 / 6,856
29.6 %2,044 / 6,898
31.2 %2,045 / 6,565
33.0 %2,137 / 6,467
31.5 %2,169 / 6,884
30.4 %2,098 / 6,893
31.2 %2,143 / 6,862
3.1 %150 / 4,773
67.5 %3,741 / 5,540
37.0 %2,900 / 7,832
43.2 %2,597 / 6,013
46.4 %3,042 / 6,550
43.0 %2,483 / 5,781
39.4 %2,432 / 6,172
39.1 %2,418 / 6,182
40.1 %2,373 / 5,919
44.1 %2,533 / 5,743
41.9 %2,575 / 6,151
40.0 %2,473 / 6,185
41.7 %2,552 / 6,116
2.8 %121 / 4,337
38.7 %2,880 / 7,439
47.2 %2,608 / 5,524
48.9 %2,994 / 6,128
46.3 %2,464 / 5,326
42.2 %2,409 / 5,711
41.3 %2,372 / 5,746
43.5 %2,367 / 5,437
47.1 %2,503 / 5,310
44.5 %2,539 / 5,707
42.8 %2,449 / 5,718
44.3 %2,515 / 5,683
3.9 %202 / 5,116
34.9 %2,496 / 7,160
46.4 %3,371 / 7,266
33.3 %2,327 / 6,984
31.0 %2,282 / 7,362
30.7 %2,271 / 7,389
32.1 %2,268 / 7,062
34.3 %2,377 / 6,932
33.1 %2,415 / 7,299
31.7 %2,323 / 7,337
32.5 %2,385 / 7,336
2.1 %79 / 3,683
43.5 %2,547 / 5,858
46.0 %2,220 / 4,821
41.1 %2,153 / 5,242
41.1 %2,152 / 5,239
42.7 %2,113 / 4,953
45.9 %2,223 / 4,842
42.3 %2,236 / 5,283
41.3 %2,181 / 5,277
42.2 %2,215 / 5,254
3.2 %147 / 4,662
42.3 %2,399 / 5,675
37.9 %2,313 / 6,099
38.1 %2,320 / 6,091
39.7 %2,303 / 5,796
42.4 %2,408 / 5,683
40.0 %2,440 / 6,094
38.4 %2,348 / 6,120
40.0 %2,421 / 6,055
2.5 %84 / 3,305
68.5 %2,844 / 4,150
70.4 %2,886 / 4,098
73.1 %2,818 / 3,854
81.0 %2,989 / 3,688
72.2 %2,986 / 4,136
68.5 %2,869 / 4,191
70.4 %2,922 / 4,153
3.5 %125 / 3,567
64.5 %2,847 / 4,414
68.3 %2,820 / 4,126
74.3 %2,987 / 4,018
81.6 %3,264 / 4,000
77.5 %3,153 / 4,066
76.9 %3,165 / 4,117
2.8 %99 / 3,597
67.8 %2,806 / 4,137
67.6 %2,836 / 4,195
67.4 %2,983 / 4,424
65.0 %2,880 / 4,434
64.6 %2,888 / 4,474
2.2 %73 / 3,311
71.5 %2,801 / 3,915
69.7 %2,916 / 4,183
69.0 %2,860 / 4,145
68.7 %2,874 / 4,181
1.8 %59 / 3,353
80.2 %3,169 / 3,953
75.1 %3,024 / 4,028
79.6 %3,139 / 3,944
4.3 %157 / 3,665
80.2 %3,271 / 4,079
80.4 %3,303 / 4,109
3.3 %120 / 3,599
77.1 %3,186 / 4,134
3.0 %110 / 3,665
Aliivibrio salmonicida LFI1238
3,915 proteins, 3,378 families
Photobacterium profundum
SS9
5,480 proteins, 4,897 families
Vibrio fischeri ES114
3,818 proteins, 3,691 families
Vibrio fischeri MJ11
4,039 proteins, 3,894 families
Vibrio splendidus LGP32
4,431 proteins, 4,277 families
Vibrio species
MED
222 1099517005441
4,590 proteins, 4,463 families
Vibrio campbellii
AN
D4 1103602000595
3,935 proteins, 3,822 families
Vibrio species Ex25
4,004 proteins, 3,886 families
Vibrio shilonii
AK1 1103207002036
5,360 proteins, 5,078 families
Vibrio vulnificus YJ016
5,028 proteins, 4,773 families
Vibrio vulnificus CM
CP6
4,538 proteins, 4,337 families
Vibrio harveyi
ATCC BA
A-1116
6,064 proteins, 5,116 families
Vibrio parahaemolyticus 16
3,780 proteins, 3,683 families
Vibrio parahaemolyticus
RIMD
2210633
4,832 proteins, 4,662 families
Vibrio cholerae A
M-19226
3,407 proteins, 3,305 families
Vibrio cholerae 2740-80
3,771 proteins, 3,567 families
Vibrio cholerae 1587
3,758 proteins, 3,597 families
Vibrio cholerae MZO
-2
3,425 proteins, 3,311 families
Vibrio cholerae MO
10
3,421 proteins, 3,353 families
Vibrio cholerae 0395
3,875 proteins, 3,665 families
Vibrio cholerae V52
3,815 proteins, 3,599 families
Vibrio cholerae
O1 biovar eltor str. N
16961
3,828 proteins, 3,665 families
Aliivi
brio
salm
onici
da
LFI1
238
3,915
pro
tein
s, 3,3
78 fa
mili
es
Photo
bacte
rium
profu
ndum
SS9
5,480
pro
tein
s, 4,8
97 fa
mili
es
Vibrio
fisch
eri
ES11
4
3,818
pro
tein
s, 3,6
91 fa
mili
es
Vibrio
fisch
eri
MJ1
1
4,039
pro
tein
s, 3,8
94 fa
mili
es
Vibrio
splen
didu
s
LGP32
4,431
pro
tein
s, 4,2
77 fa
mili
es
Vibrio
spec
ies
MED
222 1
0995
1700
5441
4,590
pro
tein
s, 4,4
63 fa
mili
es
Vibrio
cam
pbell
ii
AN
D4 1
1036
0200
0595
3,935
pro
tein
s, 3,8
22 fa
mili
es
Vibrio
spec
ies
Ex2
5
4,004
pro
tein
s, 3,8
86 fa
mili
es
Vibrio
shilo
nii
AK1 1
1032
0700
2036
5,360
pro
tein
s, 5,0
78 fa
mili
es
Vibrio
vuln
ificu
s
YJ0
16
5,028
pro
tein
s, 4,7
73 fa
mili
es
Vibrio
vuln
ificu
s
CM
CP6
4,538
pro
tein
s, 4,3
37 fa
mili
es
Vibrio
harv
eyi
ATCC B
AA
-111
6
6,064
pro
tein
s, 5,1
16 fa
mili
es
Vibrio
para
haem
olytic
us
16
3,780
pro
tein
s, 3,6
83 fa
mili
es
Vibrio
para
haem
olytic
us
RIMD
2210
633
4,832
pro
tein
s, 4,6
62 fa
mili
es
Vibrio
chole
rae
AM
-192
26
3,407
pro
tein
s, 3,3
05 fa
mili
es
Vibrio
chole
rae
2740
-80
3,771
pro
tein
s, 3,5
67 fa
mili
es
Vibrio
chole
rae
1587
3,758
pro
tein
s, 3,5
97 fa
mili
es
Vibrio
chole
rae
MZO
-2
3,425
pro
tein
s, 3,3
11 fa
mili
es
Vibrio
chole
rae
MO
10
3,421
pro
tein
s, 3,3
53 fa
mili
es
Vibrio
chole
rae
0395
3,875
pro
tein
s, 3,6
65 fa
mili
es
Vibrio
chole
rae
V52
3,815
pro
tein
s, 3,5
99 fa
mili
es
Vibrio
chole
rae
O1 b
iovar
elto
r str.
N16
961
3,828
pro
tein
s, 3,6
65 fa
mili
es
Homology within proteomes
5.0 %1.8 %
Homology between proteomes
81.6 %25.5 %
BLAST matrix
grep
ls -1
gawk
pancoreplot
makebmdest blastmatrix
Copy and download, GenBank and DNA files
saco_extract
saco_convert Prodigal
4 1 Sequences as Biological Information
organisms, the number of species present in the environment, and, despite their small size, the biomass they represent on a worldwide scale. Even inside an animal, microbes are abundant: only one out of every 10 cells in a human body is actually human, whilst the other nine cells are prokaryotic.
From an evolutionary perspective, Bacteria and Archaea have been around for more than 3 billion years; plants and animals are relatively recent ‘newcomers’ on the scene, arriving less than half a billion years ago. Since Bacteria and Archaea can divide rather quickly and have had much more time to evolve, their diversity by far exceeds that of eukaryotes (the members of Eucarya). Our human perception is that plants and animals are completely unlike each other, and so are, say, insects and mammals, as they are strikingly different even at first sight. The diversity of
Fig. 1.1 A phylogenetic tree displaying the genetic distances between members of the three super-kingdoms of life: Bacteria, Archaea, and Eucarya. The represented bacterial genera will appear in examples throughout the book. The distance between bacterial genera is much larger than that of plants and animals, drawn on the same scale of genetic distance
BACTERIA
ARCHAEA
EUCARYA
Unicellulareukaryotes
Animals Plants
Macro-organisms
Protozoans
Flav
obac
teriu
m
Crenarchaeota
EuryarchaeotaChlamydiae
Cyanobacteria
Pro
teob
acte
ria
Act
inob
acte
ria
Chlorobi
Clostridium
Bacillus
Chloroflexi
Acidobacteria
Giardia
Saccharomyces
Trypanosoma
Slime mold
Babesia
Aquifi
cae
Ther
moto
ga
Thermus
Deinoco
ccus
Firmicutes
Bacteroidetes
Spirochaetes
Pla
ncto
myc
etes
16S rRNA phylogenetic
tree
locate rRNA sequences
Basic genome statistics
njplot
extractseqs
clustalw
Genome atlas
Published annotated
genes/proteins
genomeAtlas
sed
chmod
genewiz
Examine GenBank
files
mousepad
basicgenomeanalysis
Genefinding, genes/proteins
Amino acid and codon
usage
Number of genes/proteins
Information table for all genomes.
Add information to this table as you do the exercises
Subset specific gene
counts
MONDAY Tuesday Wednesday Thursday
Pan and core
genome plot
Raw DNA sequence
BLAST atlasComparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012 13
Work flow:
1 3 5 7 9 12 15 18 21 24 27 30 33 36 39 42 45
05
00
01
00
00
15
00
0
New genes
New gene families
Core genome
Pan genome
1 : Ecoli_042 2 : Ecoli_536 3 : Ecoli_55989 4 : Ecoli_ABU_83972 5 : Ecoli_APEC_O1 6 : Ecoli_ATCC_8739 7 : Ecoli_BL21_DE3_28965 8 : Ecoli_BL21_DE3_30681 9 : Ecoli_BW2952 10 : Ecoli_B_str_REL606 11 : Ecoli_DH1 12 : Ecoli_E24377A 13 : Ecoli_ED1a 14 : Ecoli_ETEC_H10407 15 : Ecoli_HS 16 : Ecoli_IAI1 17 : Ecoli_IAI39 18 : Ecoli_IHE3034 19 : Ecoli_KO11 20 : Ecoli_O103H2_str_12009 21 : Ecoli_O111H_str_11128 22 : Ecoli_O127H6_str_E2348_69 23 : Ecoli_O157H7_str_EDL933 24 : Ecoli_O157H7_str_TW14359 25 : Ecoli_O26H11_str_11368 26 : Ecoli_O55H7_str_CB9615 27 : Ecoli_O83H1_str_NRG_857C 28 : Ecoli_S88 29 : Ecoli_SE11 30 : Ecoli_SE15 31 : Ecoli_SMS35 32 : Ecoli_UM146 33 : Ecoli_UMN026 34 : Ecoli_UTI89 35 : Ecoli_W 36 : Ecoli_str_K12_substr_DH10B 37 : Ecoli_str_K12_substr_MG1655 38 : Ecoli_str_K12_substr_W3110 39 : Vatypica_ACS_049_V_Sch6 40 : Vatypica_ACS_134_V_Col7a 41 : Vdispar_ATCC_17748 42 : Vparvula_ATCC_17745 43 : Vparvula_DSM_2008 44 : Vsp_3_1_44 45 : Vsp_6_1_27 46 : Vsp_str_F0412
Pan and core
genome plot
3.3 %111 / 3,378
28.3 %1,980 / 6,989
55.5 %2,683 / 4,838
52.4 %2,666 / 5,085
34.9 %2,114 / 6,065
33.1 %2,074 / 6,269
30.3 %1,795 / 5,923
30.5 %1,813 / 5,939
26.7 %1,916 / 7,168
30.5 %2,050 / 6,715
32.6 %2,040 / 6,250
28.3 %2,095 / 7,406
32.3 %1,842 / 5,705
31.9 %2,074 / 6,494
33.6 %1,805 / 5,377
30.2 %1,747 / 5,786
29.9 %1,736 / 5,802
31.9 %1,743 / 5,469
34.4 %1,846 / 5,360
32.5 %1,873 / 5,769
30.6 %1,777 / 5,804
32.1 %1,846 / 5,747
5.0 %243 / 4,897
30.3 %2,110 / 6,968
29.7 %2,127 / 7,169
29.5 %2,198 / 7,456
28.1 %2,155 / 7,667
25.5 %1,872 / 7,339
28.0 %2,022 / 7,222
25.9 %2,170 / 8,370
27.8 %2,222 / 7,979
29.4 %2,212 / 7,534
26.1 %2,254 / 8,624
27.9 %1,972 / 7,061
29.6 %2,295 / 7,753
28.1 %1,904 / 6,782
25.7 %1,850 / 7,198
25.6 %1,841 / 7,205
26.9 %1,851 / 6,869
28.7 %1,944 / 6,766
27.5 %1,971 / 7,179
26.3 %1,893 / 7,208
27.2 %1,946 / 7,165
2.6 %96 / 3,691
75.0 %3,261 / 4,346
38.7 %2,246 / 5,808
36.6 %2,201 / 6,016
33.6 %1,915 / 5,695
34.5 %1,963 / 5,692
30.4 %2,085 / 6,866
34.2 %2,205 / 6,448
36.3 %2,179 / 6,005
29.6 %2,214 / 7,478
36.2 %1,976 / 5,464
35.9 %2,233 / 6,219
36.7 %1,906 / 5,192
32.8 %1,843 / 5,611
33.0 %1,848 / 5,596
34.9 %1,843 / 5,282
37.7 %1,947 / 5,165
35.3 %1,972 / 5,581
33.6 %1,884 / 5,612
35.0 %1,949 / 5,561
2.9 %112 / 3,894
38.1 %2,277 / 5,979
35.7 %2,219 / 6,213
32.5 %1,919 / 5,903
33.9 %1,991 / 5,874
29.4 %2,083 / 7,082
33.1 %2,209 / 6,672
35.3 %2,191 / 6,211
29.3 %2,244 / 7,665
34.5 %1,965 / 5,696
35.5 %2,270 / 6,400
35.6 %1,922 / 5,398
31.9 %1,857 / 5,817
32.1 %1,861 / 5,806
34.2 %1,872 / 5,473
36.6 %1,964 / 5,371
34.2 %1,983 / 5,797
32.5 %1,896 / 5,827
34.0 %1,963 / 5,771
2.8 %118 / 4,277
72.3 %3,688 / 5,101
38.6 %2,289 / 5,931
42.3 %2,451 / 5,795
36.7 %2,562 / 6,982
40.8 %2,680 / 6,565
43.7 %2,670 / 6,112
36.7 %2,759 / 7,516
45.4 %2,507 / 5,523
43.9 %2,762 / 6,293
41.8 %2,264 / 5,418
38.0 %2,213 / 5,823
37.9 %2,209 / 5,822
39.9 %2,202 / 5,514
42.9 %2,314 / 5,388
40.4 %2,345 / 5,808
38.6 %2,251 / 5,839
40.3 %2,326 / 5,771
2.3 %103 / 4,463
36.9 %2,259 / 6,124
40.2 %2,413 / 5,999
36.5 %2,593 / 7,105
39.7 %2,672 / 6,728
41.9 %2,637 / 6,301
34.6 %2,682 / 7,762
43.7 %2,492 / 5,705
41.4 %2,698 / 6,523
39.9 %2,238 / 5,609
36.9 %2,208 / 5,989
36.3 %2,186 / 6,014
38.0 %2,171 / 5,707
40.6 %2,270 / 5,592
38.5 %2,311 / 6,004
37.0 %2,227 / 6,026
38.4 %2,291 / 5,971
2.3 %88 / 3,822
46.2 %2,452 / 5,307
30.9 %2,144 / 6,948
37.5 %2,396 / 6,387
39.9 %2,372 / 5,942
45.0 %3,018 / 6,702
37.8 %2,081 / 5,503
47.0 %2,741 / 5,827
38.1 %1,994 / 5,228
34.4 %1,944 / 5,645
34.8 %1,952 / 5,617
36.4 %1,935 / 5,317
38.7 %2,021 / 5,225
36.4 %2,055 / 5,647
34.7 %1,968 / 5,677
35.8 %2,018 / 5,637
2.7 %103 / 3,886
34.5 %2,335 / 6,762
43.2 %2,655 / 6,143
46.1 %2,626 / 5,697
43.4 %2,981 / 6,875
45.0 %2,357 / 5,232
64.9 %3,385 / 5,213
41.6 %2,134 / 5,135
38.2 %2,104 / 5,504
37.2 %2,064 / 5,548
39.1 %2,048 / 5,244
41.6 %2,140 / 5,139
38.8 %2,162 / 5,566
37.9 %2,110 / 5,560
38.7 %2,143 / 5,536
3.9 %200 / 5,078
33.0 %2,516 / 7,615
34.4 %2,472 / 7,184
30.1 %2,581 / 8,574
34.3 %2,276 / 6,634
35.2 %2,581 / 7,333
32.4 %2,098 / 6,481
30.3 %2,079 / 6,856
29.6 %2,044 / 6,898
31.2 %2,045 / 6,565
33.0 %2,137 / 6,467
31.5 %2,169 / 6,884
30.4 %2,098 / 6,893
31.2 %2,143 / 6,862
3.1 %150 / 4,773
67.5 %3,741 / 5,540
37.0 %2,900 / 7,832
43.2 %2,597 / 6,013
46.4 %3,042 / 6,550
43.0 %2,483 / 5,781
39.4 %2,432 / 6,172
39.1 %2,418 / 6,182
40.1 %2,373 / 5,919
44.1 %2,533 / 5,743
41.9 %2,575 / 6,151
40.0 %2,473 / 6,185
41.7 %2,552 / 6,116
2.8 %121 / 4,337
38.7 %2,880 / 7,439
47.2 %2,608 / 5,524
48.9 %2,994 / 6,128
46.3 %2,464 / 5,326
42.2 %2,409 / 5,711
41.3 %2,372 / 5,746
43.5 %2,367 / 5,437
47.1 %2,503 / 5,310
44.5 %2,539 / 5,707
42.8 %2,449 / 5,718
44.3 %2,515 / 5,683
3.9 %202 / 5,116
34.9 %2,496 / 7,160
46.4 %3,371 / 7,266
33.3 %2,327 / 6,984
31.0 %2,282 / 7,362
30.7 %2,271 / 7,389
32.1 %2,268 / 7,062
34.3 %2,377 / 6,932
33.1 %2,415 / 7,299
31.7 %2,323 / 7,337
32.5 %2,385 / 7,336
2.1 %79 / 3,683
43.5 %2,547 / 5,858
46.0 %2,220 / 4,821
41.1 %2,153 / 5,242
41.1 %2,152 / 5,239
42.7 %2,113 / 4,953
45.9 %2,223 / 4,842
42.3 %2,236 / 5,283
41.3 %2,181 / 5,277
42.2 %2,215 / 5,254
3.2 %147 / 4,662
42.3 %2,399 / 5,675
37.9 %2,313 / 6,099
38.1 %2,320 / 6,091
39.7 %2,303 / 5,796
42.4 %2,408 / 5,683
40.0 %2,440 / 6,094
38.4 %2,348 / 6,120
40.0 %2,421 / 6,055
2.5 %84 / 3,305
68.5 %2,844 / 4,150
70.4 %2,886 / 4,098
73.1 %2,818 / 3,854
81.0 %2,989 / 3,688
72.2 %2,986 / 4,136
68.5 %2,869 / 4,191
70.4 %2,922 / 4,153
3.5 %125 / 3,567
64.5 %2,847 / 4,414
68.3 %2,820 / 4,126
74.3 %2,987 / 4,018
81.6 %3,264 / 4,000
77.5 %3,153 / 4,066
76.9 %3,165 / 4,117
2.8 %99 / 3,597
67.8 %2,806 / 4,137
67.6 %2,836 / 4,195
67.4 %2,983 / 4,424
65.0 %2,880 / 4,434
64.6 %2,888 / 4,474
2.2 %73 / 3,311
71.5 %2,801 / 3,915
69.7 %2,916 / 4,183
69.0 %2,860 / 4,145
68.7 %2,874 / 4,181
1.8 %59 / 3,353
80.2 %3,169 / 3,953
75.1 %3,024 / 4,028
79.6 %3,139 / 3,944
4.3 %157 / 3,665
80.2 %3,271 / 4,079
80.4 %3,303 / 4,109
3.3 %120 / 3,599
77.1 %3,186 / 4,134
3.0 %110 / 3,665
Aliivibrio salmonicida LFI1238
3,915 proteins, 3,378 families
Photobacterium profundum SS9
5,480 proteins, 4,897 families
Vibrio fischeri ES114
3,818 proteins, 3,691 families
Vibrio fischeri MJ11
4,039 proteins, 3,894 families
Vibrio splendidus LGP32
4,431 proteins, 4,277 families
Vibrio species
MED222 1099517005441
4,590 proteins, 4,463 families
Vibrio campbellii
AND4 1103602000595
3,935 proteins, 3,822 families
Vibrio species Ex25
4,004 proteins, 3,886 families
Vibrio shilonii
AK1 1103207002036
5,360 proteins, 5,078 families
Vibrio vulnificus YJ016
5,028 proteins, 4,773 families
Vibrio vulnificus CM
CP6
4,538 proteins, 4,337 families
Vibrio harveyi
ATCC BAA-1116
6,064 proteins, 5,116 families
Vibrio parahaemolyticus 16
3,780 proteins, 3,683 families
Vibrio parahaemolyticus
RIMD 2210633
4,832 proteins, 4,662 families
Vibrio cholerae AM
-19226
3,407 proteins, 3,305 families
Vibrio cholerae 2740-80
3,771 proteins, 3,567 families
Vibrio cholerae 1587
3,758 proteins, 3,597 families
Vibrio cholerae MZO
-2
3,425 proteins, 3,311 families
Vibrio cholerae MO
10
3,421 proteins, 3,353 families
Vibrio cholerae 0395
3,875 proteins, 3,665 families
Vibrio cholerae V52
3,815 proteins, 3,599 families
Vibrio cholerae
O1 biovar eltor str. N
16961
3,828 proteins, 3,665 families
Aliivib
rio sa
lmon
icida
LFI1
238
3,915
pro
tein
s, 3,3
78 fa
mili
es
Photob
acter
ium pr
ofundu
m
SS9
5,480
pro
tein
s, 4,8
97 fa
mili
es
Vibrio
fisch
eri
ES11
4
3,818
pro
tein
s, 3,6
91 fa
mili
es
Vibrio
fisch
eri
MJ1
1
4,039
pro
tein
s, 3,8
94 fa
mili
es
Vibrio
splen
didus
LGP32
4,431
pro
tein
s, 4,2
77 fa
mili
es
Vibrio
spec
ies
MED22
2 109
9517
0054
41
4,590
pro
tein
s, 4,4
63 fa
mili
es
Vibrio
campb
ellii
AND4 1
1036
0200
0595
3,935
pro
tein
s, 3,8
22 fa
mili
es
Vibrio
spec
ies
Ex25
4,004
pro
tein
s, 3,8
86 fa
mili
es
Vibrio
shilo
nii
AK1 110
3207
0020
36
5,360
pro
tein
s, 5,0
78 fa
mili
es
Vibrio
vuln
ificu
s
YJ0
16
5,028
pro
tein
s, 4,7
73 fa
mili
es
Vibrio
vuln
ificu
s
CM
CP6
4,538
pro
tein
s, 4,3
37 fa
mili
es
Vibrio
harv
eyi
ATCC BAA-1
116
6,064
pro
tein
s, 5,1
16 fa
mili
es
Vibrio
para
haem
olytic
us
16
3,780
pro
tein
s, 3,6
83 fa
mili
es
Vibrio
para
haem
olytic
us
RIMD 22
1063
3
4,832
pro
tein
s, 4,6
62 fa
mili
es
Vibrio
chole
rae
AM-1
9226
3,407
pro
tein
s, 3,3
05 fa
mili
es
Vibrio
chole
rae
2740
-80
3,771
pro
tein
s, 3,5
67 fa
mili
es
Vibrio
chole
rae
1587
3,758
pro
tein
s, 3,5
97 fa
mili
es
Vibrio
chole
rae
MZO
-2
3,425
pro
tein
s, 3,3
11 fa
mili
es
Vibrio
chole
rae
MO
10
3,421
pro
tein
s, 3,3
53 fa
mili
es
Vibrio
chole
rae
0395
3,875
pro
tein
s, 3,6
65 fa
mili
es
Vibrio
chole
rae
V52
3,815
pro
tein
s, 3,5
99 fa
mili
es
Vibrio
chole
rae
O1 b
iovar
elto
r str.
N16
961
3,828
pro
tein
s, 3,6
65 fa
mili
es
Homology within proteomes
5.0 %1.8 %
Homology between proteomes
81.6 %25.5 %
BLAST matrix
0M
0.5
M1
M
1.5M
2M
2.5
M
V. cholerae O1 biovar El Tor str. N16961 I
2,961,149 bp
BASE ATLAS
Center for Biological Sequence Anhttp://www.cbs.dtu.dk/
G Content
0.18 0.30
A Content
0.20 0.32
T Content
0.21 0.32
C Content
0.17 0.30
Annotations:
CDS +
CDS -
rRNA
tRNA
AT Skew
-0.04 0.04
GC Skew
-0.08 0.08
Percent AT
0.46 0.59
Resolution: 1185
Genome atlas
4 1 Sequences as Biological Information
organisms, the number of species present in the environment, and, despite their small size, the biomass they represent on a worldwide scale. Even inside an animal, microbes are abundant: only one out of every 10 cells in a human body is actually human, whilst the other nine cells are prokaryotic.
From an evolutionary perspective, Bacteria and Archaea have been around for more than 3 billion years; plants and animals are relatively recent ‘newcomers’ on the scene, arriving less than half a billion years ago. Since Bacteria and Archaea can divide rather quickly and have had much more time to evolve, their diversity by far exceeds that of eukaryotes (the members of Eucarya). Our human perception is that plants and animals are completely unlike each other, and so are, say, insects and mammals, as they are strikingly different even at first sight. The diversity of
Fig. 1.1 A phylogenetic tree displaying the genetic distances between members of the three super-kingdoms of life: Bacteria, Archaea, and Eucarya. The represented bacterial genera will appear in examples throughout the book. The distance between bacterial genera is much larger than that of plants and animals, drawn on the same scale of genetic distance
BACTERIA
ARCHAEA
EUCARYA
Unicellulareukaryotes
Animals Plants
Macro-organisms
Protozoans
Flav
obac
teriu
m
Crenarchaeota
EuryarchaeotaChlamydiae
Cyanobacteria
Pro
teob
acte
ria
Act
inob
acte
ria
Chlorobi
Clostridium
Bacillus
Chloroflexi
Acidobacteria
Giardia
Saccharomyces
Trypanosoma
Slime mold
Babesia
Aquifi
cae
Thermoto
ga
Thermus
Deinoco
ccus
Firmicutes
Bacteroidetes
Spirochaetes
Pla
ncto
myc
etes
16S rRNA phylogenetic
tree
Amino acid usage
G A
V
L
IF
Y
W
H
KRD
E
N
QS
T
M
C
P
Amino acid usageCP001139
1.01
2.85
4.69
6.53
8.36
10.20
Perc
enta
ge
locate rRNA sequences
Basic genome statistics
Published annotated genes/proteins
Raw DNA sequence
Amino acid and codon
usage
Number of genes/proteins
Number of genes/proteins
STEP 1: List of genomes, NCBI
GenBank id numbers, GPID
STEP 2: Download
genomes in the form of
GenBank files
getgbk
saco_extract grep
saco_convert
grep
extractname
extractname
aminoacidUsagePlotgenomeAtlas
sed
chmod
genewiz
genomeStatistics
prodigalrunnerrnammer
njplot
extractseqs
clustalw
basicgenomeanalysis
Genefinding, local annotation of genes/
proteins
ls -1
gawk
pancoreplot
makebmdest
blastmatrix
SPI_7 >
SPI-2
SPI-1
SPI-7
SPI-7
SPI-3
SPI-4
SPI-5
SPI-5SPI-6
SPI-6
SPI-9
SPI-
10
SPI-11
SPI-11
SPI-12
SPI-12 0M0.5M
1M1.5M
2M2.5M
3M
3.5M
4M
4.5M
S. Typhi str. Ty2 4,791,961 bp
0.15 0.10 0.05 0.00
Pan genomic Dendrogram
Relative manhattan distance
S.arizonae serovar 62:z4,z23: str. RSK2980 S.Montevideo str. MB110209 0055 S.Montevideo str. OH_2009072675 S.Montevideo str. 556152 S.Montevideo str. MB102109 0047 S.Montevideo str. IA_2010008284 S.Montevideo str. NC_MB110209 0054 S.Montevideo str. IA_2010008283 S.Montevideo str. 366867 S.Montevideo str. 556150 1 S.Montevideo str. MB101509 0077 S.Montevideo str. IA_2010008282 S.Montevideo str. 495297 1 S.Montevideo str. 446600 S.Montevideo str. 413180 S.Montevideo str. 19N S.Montevideo str. 609460 S.Montevideo str. IA_2010008287 S.Montevideo str. IA_2009159199 S.Montevideo str. CASC_09SCPH15965 S.Montevideo str. 81038 01 S.Montevideo str. 609458 1 S.Montevideo str. MB111609 0052 S.Montevideo str. 2009085258 S.Montevideo str. MD_MDA09249507 S.Montevideo str. 515920 2 S.Montevideo str. 315996572 S.Montevideo str. 515920 1 S.Montevideo str. 315731156 S.Montevideo str. IA_2010008285 S.Montevideo str. 2009083312 S.Montevideo str. 495297 3 S.Montevideo str. 507440 20 S.Montevideo str. 414877 S.Montevideo str. 495297 4 S.Javiana str. GA_MM04042433 S.Montevideo str. 531954 S.Schwarzengrund str. CVM19633 S.Schwarzengrund str. SL480 S.Paratyphi A str. AKU_12601 S.Paratyphi A str. ATCC 9150 S.Typhi str. Ty2 S.Typhi str. CT18 S.Weltevreden str. HI_N05 537 S.Saintpaul str. SARA29 S.Tennessee str. CDC07 0191 S.Kentucky str. CDC 191 S.Kentucky str. CVM29188 S.Virchow str. SL491 S.Agona str. SL483 S.Paratyphi C srt. RKS4594 S.Choleraesuis str. A50 S.Choleraesuis str. SC B67 S.Dublin str. 3246 S.Dublin str. CT_02021853 S.Enteritidis str. P125109 S.Gallinarum str. 287/91 S.Gallinarum str. 9 S.Paratyphi B str. SPB7 S.Heidelberg str. SL476 S.Heidelberg str. SL486 S.4,[5],12:i: str. CVM23701 S.Typhimurium str. D23580 S.Typhimurium str. TN061786 S.Typhimurium str. LT2 S.Typhimurium str. 4/74 S.Typhimurium str. SL1344 S.Saintpaul str. SARA23 S.Typhimurium str. DT104 S.Typhimurium str. 14028S S.Hadar str. RI_05P066 S.Newport str. SL317 S.Newport str. SL254
100
38
31
60
30
100
0
60
11
500
00
36
000
100
85
00006
100
874
100
8
85
811141182253
86
63
49
29
100
100
100
37
100
98
97
5238
100
82
88
59
100
46
100
28
56
33
77
40
35
61
74
68
94
74
100
Pan-genome family tree
Thursday, August 30, 2012
Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 7
Output of the day:
• core/pan genomes plot
• pan-genome family tree
Thursday, August 30, 2012
Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 8
Pan-core genome plot command lines:
• Make a new for folder for the plot
• Copy protein files to the new folder using cp
• Enter to the new directory
• Create an input file for pancoreplot program
• Construct pan-core genome plot
‣ mkdir panCorePlot
‣ cp <name>_prodigal.orf.fsa panCorePlot
‣ cd panCorePlot
‣ ls -1 *orf.fsa | gawk ‘{print $1 “\t” $1}’ > pancore.list
‣ pancoreplot -keep blastOutPut pancore.list > pancoreplot.ps
Thursday, August 30, 2012
Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 9
Extract genes from pan-core genome plot
Thursday, August 30, 2012
Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 10
Extract genes from pan-core genome plot
• Extract all the core genes
• Extract all pan-genomes
• Extract the core genes of genomes 1,2,3,5,6,7
• Extract the core genes of genomes 1,3,4 and 5 which are not present in any of the genomes from 6 to the last genome
‣ specificGenes -i 1: <blastOutPutFolder> > <output>.fsa
‣ specificGenes -u 1: <blastOutPutFolder> > <output>.fsa
‣ specificGenes -i 1:3,5:7 <blastOutPutFolder> > <output>.fsa
‣ specificGenes -i 1,3:5 -c 6: <blastOutPutFolder> > <output>.fsa
Thursday, August 30, 2012
Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 11
Pan genome family tree;
• Copy tree.pl from Download directory to /usr/biotools
• Make program executable
• Enter to the folder where you save all the blast results from pan/core plot
• Construct pan-genome family tree
‣ cp tree.pl /usr/biotools
‣ chmod +x /usr/biotools/tree.pl
‣ cd panCorePlot
‣ tree.pl -m <shell or cloud> <blastOutputFolder> > panGenomeTree.ps
Thursday, August 30, 2012