comparative bacterial genomics - dtu bioinformatics · comparative bacterial genomics workshop,...

11
Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 1 Comparative Bacterial Genomics Exercises for Day 4 - core/pan genomes Pimlapas Leekitcharoenphon (Shinny) 30 August 2012 http://www.cbs.dtu.dk/staff/dave/CDC_2012.php Thursday, August 30, 2012

Upload: lyliem

Post on 13-Sep-2018

230 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Comparative Bacterial Genomics - DTU Bioinformatics · Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 6 0 M 0. 5 M 1 M

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 1

Comparative Bacterial Genomics

Exercises for Day 4 - core/pan genomes

Pimlapas Leekitcharoenphon (Shinny)30 August 2012

http://www.cbs.dtu.dk/staff/dave/CDC_2012.php

Thursday, August 30, 2012

Page 2: Comparative Bacterial Genomics - DTU Bioinformatics · Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 6 0 M 0. 5 M 1 M

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 2

http://cge.cbs.dtu.dk/services/

Thursday, August 30, 2012

Page 3: Comparative Bacterial Genomics - DTU Bioinformatics · Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 6 0 M 0. 5 M 1 M

National Food Institute, Technical University of Denmark

Protein homology in clonal strains (outbreak)

Thursday, August 30, 2012

Page 4: Comparative Bacterial Genomics - DTU Bioinformatics · Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 6 0 M 0. 5 M 1 M

National Food Institute, Technical University of Denmark

SNPs

Single Nucleotide Polymorphisms

• DNA sequence variations that occur when a single nucleotide (A,T,C,or G) in the genome sequence is altered. • SNPs can occur in both coding and non-coding regions of genome

Thursday, August 30, 2012

Page 5: Comparative Bacterial Genomics - DTU Bioinformatics · Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 6 0 M 0. 5 M 1 M

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 5

SNPs tree

• Download 6 genomes from the following link

• Choose Salmonella Typhimurium D23580 as a reference genome

• Construct SNPs tree

‣ http://cge.cbs.dtu.dk/services/snpTree/

Thursday, August 30, 2012

Page 6: Comparative Bacterial Genomics - DTU Bioinformatics · Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 6 0 M 0. 5 M 1 M

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 6

0M

0.5

M1

M

1.5M

2M

2.5

M

V. cholerae O1 biovar El Tor str. N16961 I

2,961,149 bp

BASE ATLAS

Center for Biological Sequence Anhttp://www.cbs.dtu.dk/

G Content

0.18 0.30

A Content

0.20 0.32

T Content

0.21 0.32

C Content

0.17 0.30

Annotations:

CDS +

CDS -

rRNA

tRNA

AT Skew

-0.04 0.04

GC Skew

-0.08 0.08

Percent AT

0.46 0.59

Resolution: 1185

genomeStatistics

rnammer

1 3 5 7 9 12 15 18 21 24 27 30 33 36 39 42 45

05000

10000

15000

New genes

New gene families

Core genome

Pan genome

1 : Ecoli_042 2 : Ecoli_536 3 : Ecoli_55989 4 : Ecoli_ABU_83972 5 : Ecoli_APEC_O1 6 : Ecoli_ATCC_8739 7 : Ecoli_BL21_DE3_28965 8 : Ecoli_BL21_DE3_30681 9 : Ecoli_BW2952 10 : Ecoli_B_str_REL606 11 : Ecoli_DH1 12 : Ecoli_E24377A 13 : Ecoli_ED1a 14 : Ecoli_ETEC_H10407 15 : Ecoli_HS 16 : Ecoli_IAI1 17 : Ecoli_IAI39 18 : Ecoli_IHE3034 19 : Ecoli_KO11 20 : Ecoli_O103H2_str_12009 21 : Ecoli_O111H_str_11128 22 : Ecoli_O127H6_str_E2348_69 23 : Ecoli_O157H7_str_EDL933 24 : Ecoli_O157H7_str_TW14359 25 : Ecoli_O26H11_str_11368 26 : Ecoli_O55H7_str_CB9615 27 : Ecoli_O83H1_str_NRG_857C 28 : Ecoli_S88 29 : Ecoli_SE11 30 : Ecoli_SE15 31 : Ecoli_SMS35 32 : Ecoli_UM146 33 : Ecoli_UMN026 34 : Ecoli_UTI89 35 : Ecoli_W 36 : Ecoli_str_K12_substr_DH10B 37 : Ecoli_str_K12_substr_MG1655 38 : Ecoli_str_K12_substr_W3110 39 : Vatypica_ACS_049_V_Sch6 40 : Vatypica_ACS_134_V_Col7a 41 : Vdispar_ATCC_17748 42 : Vparvula_ATCC_17745 43 : Vparvula_DSM_2008 44 : Vsp_3_1_44 45 : Vsp_6_1_27 46 : Vsp_str_F0412

3.3 %111 / 3,378

28.3 %1,980 / 6,989

55.5 %2,683 / 4,838

52.4 %2,666 / 5,085

34.9 %2,114 / 6,065

33.1 %2,074 / 6,269

30.3 %1,795 / 5,923

30.5 %1,813 / 5,939

26.7 %1,916 / 7,168

30.5 %2,050 / 6,715

32.6 %2,040 / 6,250

28.3 %2,095 / 7,406

32.3 %1,842 / 5,705

31.9 %2,074 / 6,494

33.6 %1,805 / 5,377

30.2 %1,747 / 5,786

29.9 %1,736 / 5,802

31.9 %1,743 / 5,469

34.4 %1,846 / 5,360

32.5 %1,873 / 5,769

30.6 %1,777 / 5,804

32.1 %1,846 / 5,747

5.0 %243 / 4,897

30.3 %2,110 / 6,968

29.7 %2,127 / 7,169

29.5 %2,198 / 7,456

28.1 %2,155 / 7,667

25.5 %1,872 / 7,339

28.0 %2,022 / 7,222

25.9 %2,170 / 8,370

27.8 %2,222 / 7,979

29.4 %2,212 / 7,534

26.1 %2,254 / 8,624

27.9 %1,972 / 7,061

29.6 %2,295 / 7,753

28.1 %1,904 / 6,782

25.7 %1,850 / 7,198

25.6 %1,841 / 7,205

26.9 %1,851 / 6,869

28.7 %1,944 / 6,766

27.5 %1,971 / 7,179

26.3 %1,893 / 7,208

27.2 %1,946 / 7,165

2.6 %96 / 3,691

75.0 %3,261 / 4,346

38.7 %2,246 / 5,808

36.6 %2,201 / 6,016

33.6 %1,915 / 5,695

34.5 %1,963 / 5,692

30.4 %2,085 / 6,866

34.2 %2,205 / 6,448

36.3 %2,179 / 6,005

29.6 %2,214 / 7,478

36.2 %1,976 / 5,464

35.9 %2,233 / 6,219

36.7 %1,906 / 5,192

32.8 %1,843 / 5,611

33.0 %1,848 / 5,596

34.9 %1,843 / 5,282

37.7 %1,947 / 5,165

35.3 %1,972 / 5,581

33.6 %1,884 / 5,612

35.0 %1,949 / 5,561

2.9 %112 / 3,894

38.1 %2,277 / 5,979

35.7 %2,219 / 6,213

32.5 %1,919 / 5,903

33.9 %1,991 / 5,874

29.4 %2,083 / 7,082

33.1 %2,209 / 6,672

35.3 %2,191 / 6,211

29.3 %2,244 / 7,665

34.5 %1,965 / 5,696

35.5 %2,270 / 6,400

35.6 %1,922 / 5,398

31.9 %1,857 / 5,817

32.1 %1,861 / 5,806

34.2 %1,872 / 5,473

36.6 %1,964 / 5,371

34.2 %1,983 / 5,797

32.5 %1,896 / 5,827

34.0 %1,963 / 5,771

2.8 %118 / 4,277

72.3 %3,688 / 5,101

38.6 %2,289 / 5,931

42.3 %2,451 / 5,795

36.7 %2,562 / 6,982

40.8 %2,680 / 6,565

43.7 %2,670 / 6,112

36.7 %2,759 / 7,516

45.4 %2,507 / 5,523

43.9 %2,762 / 6,293

41.8 %2,264 / 5,418

38.0 %2,213 / 5,823

37.9 %2,209 / 5,822

39.9 %2,202 / 5,514

42.9 %2,314 / 5,388

40.4 %2,345 / 5,808

38.6 %2,251 / 5,839

40.3 %2,326 / 5,771

2.3 %103 / 4,463

36.9 %2,259 / 6,124

40.2 %2,413 / 5,999

36.5 %2,593 / 7,105

39.7 %2,672 / 6,728

41.9 %2,637 / 6,301

34.6 %2,682 / 7,762

43.7 %2,492 / 5,705

41.4 %2,698 / 6,523

39.9 %2,238 / 5,609

36.9 %2,208 / 5,989

36.3 %2,186 / 6,014

38.0 %2,171 / 5,707

40.6 %2,270 / 5,592

38.5 %2,311 / 6,004

37.0 %2,227 / 6,026

38.4 %2,291 / 5,971

2.3 %88 / 3,822

46.2 %2,452 / 5,307

30.9 %2,144 / 6,948

37.5 %2,396 / 6,387

39.9 %2,372 / 5,942

45.0 %3,018 / 6,702

37.8 %2,081 / 5,503

47.0 %2,741 / 5,827

38.1 %1,994 / 5,228

34.4 %1,944 / 5,645

34.8 %1,952 / 5,617

36.4 %1,935 / 5,317

38.7 %2,021 / 5,225

36.4 %2,055 / 5,647

34.7 %1,968 / 5,677

35.8 %2,018 / 5,637

2.7 %103 / 3,886

34.5 %2,335 / 6,762

43.2 %2,655 / 6,143

46.1 %2,626 / 5,697

43.4 %2,981 / 6,875

45.0 %2,357 / 5,232

64.9 %3,385 / 5,213

41.6 %2,134 / 5,135

38.2 %2,104 / 5,504

37.2 %2,064 / 5,548

39.1 %2,048 / 5,244

41.6 %2,140 / 5,139

38.8 %2,162 / 5,566

37.9 %2,110 / 5,560

38.7 %2,143 / 5,536

3.9 %200 / 5,078

33.0 %2,516 / 7,615

34.4 %2,472 / 7,184

30.1 %2,581 / 8,574

34.3 %2,276 / 6,634

35.2 %2,581 / 7,333

32.4 %2,098 / 6,481

30.3 %2,079 / 6,856

29.6 %2,044 / 6,898

31.2 %2,045 / 6,565

33.0 %2,137 / 6,467

31.5 %2,169 / 6,884

30.4 %2,098 / 6,893

31.2 %2,143 / 6,862

3.1 %150 / 4,773

67.5 %3,741 / 5,540

37.0 %2,900 / 7,832

43.2 %2,597 / 6,013

46.4 %3,042 / 6,550

43.0 %2,483 / 5,781

39.4 %2,432 / 6,172

39.1 %2,418 / 6,182

40.1 %2,373 / 5,919

44.1 %2,533 / 5,743

41.9 %2,575 / 6,151

40.0 %2,473 / 6,185

41.7 %2,552 / 6,116

2.8 %121 / 4,337

38.7 %2,880 / 7,439

47.2 %2,608 / 5,524

48.9 %2,994 / 6,128

46.3 %2,464 / 5,326

42.2 %2,409 / 5,711

41.3 %2,372 / 5,746

43.5 %2,367 / 5,437

47.1 %2,503 / 5,310

44.5 %2,539 / 5,707

42.8 %2,449 / 5,718

44.3 %2,515 / 5,683

3.9 %202 / 5,116

34.9 %2,496 / 7,160

46.4 %3,371 / 7,266

33.3 %2,327 / 6,984

31.0 %2,282 / 7,362

30.7 %2,271 / 7,389

32.1 %2,268 / 7,062

34.3 %2,377 / 6,932

33.1 %2,415 / 7,299

31.7 %2,323 / 7,337

32.5 %2,385 / 7,336

2.1 %79 / 3,683

43.5 %2,547 / 5,858

46.0 %2,220 / 4,821

41.1 %2,153 / 5,242

41.1 %2,152 / 5,239

42.7 %2,113 / 4,953

45.9 %2,223 / 4,842

42.3 %2,236 / 5,283

41.3 %2,181 / 5,277

42.2 %2,215 / 5,254

3.2 %147 / 4,662

42.3 %2,399 / 5,675

37.9 %2,313 / 6,099

38.1 %2,320 / 6,091

39.7 %2,303 / 5,796

42.4 %2,408 / 5,683

40.0 %2,440 / 6,094

38.4 %2,348 / 6,120

40.0 %2,421 / 6,055

2.5 %84 / 3,305

68.5 %2,844 / 4,150

70.4 %2,886 / 4,098

73.1 %2,818 / 3,854

81.0 %2,989 / 3,688

72.2 %2,986 / 4,136

68.5 %2,869 / 4,191

70.4 %2,922 / 4,153

3.5 %125 / 3,567

64.5 %2,847 / 4,414

68.3 %2,820 / 4,126

74.3 %2,987 / 4,018

81.6 %3,264 / 4,000

77.5 %3,153 / 4,066

76.9 %3,165 / 4,117

2.8 %99 / 3,597

67.8 %2,806 / 4,137

67.6 %2,836 / 4,195

67.4 %2,983 / 4,424

65.0 %2,880 / 4,434

64.6 %2,888 / 4,474

2.2 %73 / 3,311

71.5 %2,801 / 3,915

69.7 %2,916 / 4,183

69.0 %2,860 / 4,145

68.7 %2,874 / 4,181

1.8 %59 / 3,353

80.2 %3,169 / 3,953

75.1 %3,024 / 4,028

79.6 %3,139 / 3,944

4.3 %157 / 3,665

80.2 %3,271 / 4,079

80.4 %3,303 / 4,109

3.3 %120 / 3,599

77.1 %3,186 / 4,134

3.0 %110 / 3,665

Aliivibrio salmonicida LFI1238

3,915 proteins, 3,378 families

Photobacterium profundum

SS9

5,480 proteins, 4,897 families

Vibrio fischeri ES114

3,818 proteins, 3,691 families

Vibrio fischeri MJ11

4,039 proteins, 3,894 families

Vibrio splendidus LGP32

4,431 proteins, 4,277 families

Vibrio species

MED

222 1099517005441

4,590 proteins, 4,463 families

Vibrio campbellii

AN

D4 1103602000595

3,935 proteins, 3,822 families

Vibrio species Ex25

4,004 proteins, 3,886 families

Vibrio shilonii

AK1 1103207002036

5,360 proteins, 5,078 families

Vibrio vulnificus YJ016

5,028 proteins, 4,773 families

Vibrio vulnificus CM

CP6

4,538 proteins, 4,337 families

Vibrio harveyi

ATCC BA

A-1116

6,064 proteins, 5,116 families

Vibrio parahaemolyticus 16

3,780 proteins, 3,683 families

Vibrio parahaemolyticus

RIMD

2210633

4,832 proteins, 4,662 families

Vibrio cholerae A

M-19226

3,407 proteins, 3,305 families

Vibrio cholerae 2740-80

3,771 proteins, 3,567 families

Vibrio cholerae 1587

3,758 proteins, 3,597 families

Vibrio cholerae MZO

-2

3,425 proteins, 3,311 families

Vibrio cholerae MO

10

3,421 proteins, 3,353 families

Vibrio cholerae 0395

3,875 proteins, 3,665 families

Vibrio cholerae V52

3,815 proteins, 3,599 families

Vibrio cholerae

O1 biovar eltor str. N

16961

3,828 proteins, 3,665 families

Aliivi

brio

salm

onici

da

LFI1

238

3,915

pro

tein

s, 3,3

78 fa

mili

es

Photo

bacte

rium

profu

ndum

SS9

5,480

pro

tein

s, 4,8

97 fa

mili

es

Vibrio

fisch

eri

ES11

4

3,818

pro

tein

s, 3,6

91 fa

mili

es

Vibrio

fisch

eri

MJ1

1

4,039

pro

tein

s, 3,8

94 fa

mili

es

Vibrio

splen

didu

s

LGP32

4,431

pro

tein

s, 4,2

77 fa

mili

es

Vibrio

spec

ies

MED

222 1

0995

1700

5441

4,590

pro

tein

s, 4,4

63 fa

mili

es

Vibrio

cam

pbell

ii

AN

D4 1

1036

0200

0595

3,935

pro

tein

s, 3,8

22 fa

mili

es

Vibrio

spec

ies

Ex2

5

4,004

pro

tein

s, 3,8

86 fa

mili

es

Vibrio

shilo

nii

AK1 1

1032

0700

2036

5,360

pro

tein

s, 5,0

78 fa

mili

es

Vibrio

vuln

ificu

s

YJ0

16

5,028

pro

tein

s, 4,7

73 fa

mili

es

Vibrio

vuln

ificu

s

CM

CP6

4,538

pro

tein

s, 4,3

37 fa

mili

es

Vibrio

harv

eyi

ATCC B

AA

-111

6

6,064

pro

tein

s, 5,1

16 fa

mili

es

Vibrio

para

haem

olytic

us

16

3,780

pro

tein

s, 3,6

83 fa

mili

es

Vibrio

para

haem

olytic

us

RIMD

2210

633

4,832

pro

tein

s, 4,6

62 fa

mili

es

Vibrio

chole

rae

AM

-192

26

3,407

pro

tein

s, 3,3

05 fa

mili

es

Vibrio

chole

rae

2740

-80

3,771

pro

tein

s, 3,5

67 fa

mili

es

Vibrio

chole

rae

1587

3,758

pro

tein

s, 3,5

97 fa

mili

es

Vibrio

chole

rae

MZO

-2

3,425

pro

tein

s, 3,3

11 fa

mili

es

Vibrio

chole

rae

MO

10

3,421

pro

tein

s, 3,3

53 fa

mili

es

Vibrio

chole

rae

0395

3,875

pro

tein

s, 3,6

65 fa

mili

es

Vibrio

chole

rae

V52

3,815

pro

tein

s, 3,5

99 fa

mili

es

Vibrio

chole

rae

O1 b

iovar

elto

r str.

N16

961

3,828

pro

tein

s, 3,6

65 fa

mili

es

Homology within proteomes

5.0 %1.8 %

Homology between proteomes

81.6 %25.5 %

BLAST matrix

grep

ls -1

gawk

pancoreplot

makebmdest blastmatrix

Copy and download, GenBank and DNA files

saco_extract

saco_convert Prodigal

4 1 Sequences as Biological Information

organisms, the number of species present in the environment, and, despite their small size, the biomass they represent on a worldwide scale. Even inside an animal, microbes are abundant: only one out of every 10 cells in a human body is actually human, whilst the other nine cells are prokaryotic.

From an evolutionary perspective, Bacteria and Archaea have been around for more than 3 billion years; plants and animals are relatively recent ‘newcomers’ on the scene, arriving less than half a billion years ago. Since Bacteria and Archaea can divide rather quickly and have had much more time to evolve, their diversity by far exceeds that of eukaryotes (the members of Eucarya). Our human perception is that plants and animals are completely unlike each other, and so are, say, insects and mammals, as they are strikingly different even at first sight. The diversity of

Fig. 1.1 A phylogenetic tree displaying the genetic distances between members of the three super-kingdoms of life: Bacteria, Archaea, and Eucarya. The represented bacterial genera will appear in examples throughout the book. The distance between bacterial genera is much larger than that of plants and animals, drawn on the same scale of genetic distance

BACTERIA

ARCHAEA

EUCARYA

Unicellulareukaryotes

Animals Plants

Macro-organisms

Protozoans

Flav

obac

teriu

m

Crenarchaeota

EuryarchaeotaChlamydiae

Cyanobacteria

Pro

teob

acte

ria

Act

inob

acte

ria

Chlorobi

Clostridium

Bacillus

Chloroflexi

Acidobacteria

Giardia

Saccharomyces

Trypanosoma

Slime mold

Babesia

Aquifi

cae

Ther

moto

ga

Thermus

Deinoco

ccus

Firmicutes

Bacteroidetes

Spirochaetes

Pla

ncto

myc

etes

16S rRNA phylogenetic

tree

locate rRNA sequences

Basic genome statistics

njplot

extractseqs

clustalw

Genome atlas

Published annotated

genes/proteins

genomeAtlas

sed

chmod

genewiz

Examine GenBank

files

mousepad

basicgenomeanalysis

Genefinding, genes/proteins

Amino acid and codon

usage

Number of genes/

proteins

Information table for all genomes.

Add information to this table as you do the exercises

Subset specific gene

counts

MONDAY Tuesday Wednesday Thursday

Pan and core

genome plot

Raw DNA sequence

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012 3

0M

0.5

M1

M

1.5M

2M

2.5

M

V. cholerae O1 biovar El Tor str. N16961 I

2,961,149 bp

BASE ATLAS

Center for Biological Sequence Anhttp://www.cbs.dtu.dk/

G Content

0.18 0.30

A Content

0.20 0.32

T Content

0.21 0.32

C Content

0.17 0.30

Annotations:

CDS +

CDS -

rRNA

tRNA

AT Skew

-0.04 0.04

GC Skew

-0.08 0.08

Percent AT

0.46 0.59

Resolution: 1185

genomeStatistics

rnammer

1 3 5 7 9 12 15 18 21 24 27 30 33 36 39 42 45

05

00

01

00

00

15

00

0

New genes

New gene families

Core genome

Pan genome

1 : Ecoli_042 2 : Ecoli_536 3 : Ecoli_55989 4 : Ecoli_ABU_83972 5 : Ecoli_APEC_O1 6 : Ecoli_ATCC_8739 7 : Ecoli_BL21_DE3_28965 8 : Ecoli_BL21_DE3_30681 9 : Ecoli_BW2952 10 : Ecoli_B_str_REL606 11 : Ecoli_DH1 12 : Ecoli_E24377A 13 : Ecoli_ED1a 14 : Ecoli_ETEC_H10407 15 : Ecoli_HS 16 : Ecoli_IAI1 17 : Ecoli_IAI39 18 : Ecoli_IHE3034 19 : Ecoli_KO11 20 : Ecoli_O103H2_str_12009 21 : Ecoli_O111H_str_11128 22 : Ecoli_O127H6_str_E2348_69 23 : Ecoli_O157H7_str_EDL933 24 : Ecoli_O157H7_str_TW14359 25 : Ecoli_O26H11_str_11368 26 : Ecoli_O55H7_str_CB9615 27 : Ecoli_O83H1_str_NRG_857C 28 : Ecoli_S88 29 : Ecoli_SE11 30 : Ecoli_SE15 31 : Ecoli_SMS35 32 : Ecoli_UM146 33 : Ecoli_UMN026 34 : Ecoli_UTI89 35 : Ecoli_W 36 : Ecoli_str_K12_substr_DH10B 37 : Ecoli_str_K12_substr_MG1655 38 : Ecoli_str_K12_substr_W3110 39 : Vatypica_ACS_049_V_Sch6 40 : Vatypica_ACS_134_V_Col7a 41 : Vdispar_ATCC_17748 42 : Vparvula_ATCC_17745 43 : Vparvula_DSM_2008 44 : Vsp_3_1_44 45 : Vsp_6_1_27 46 : Vsp_str_F0412

3.3 %111 / 3,378

28.3 %1,980 / 6,989

55.5 %2,683 / 4,838

52.4 %2,666 / 5,085

34.9 %2,114 / 6,065

33.1 %2,074 / 6,269

30.3 %1,795 / 5,923

30.5 %1,813 / 5,939

26.7 %1,916 / 7,168

30.5 %2,050 / 6,715

32.6 %2,040 / 6,250

28.3 %2,095 / 7,406

32.3 %1,842 / 5,705

31.9 %2,074 / 6,494

33.6 %1,805 / 5,377

30.2 %1,747 / 5,786

29.9 %1,736 / 5,802

31.9 %1,743 / 5,469

34.4 %1,846 / 5,360

32.5 %1,873 / 5,769

30.6 %1,777 / 5,804

32.1 %1,846 / 5,747

5.0 %243 / 4,897

30.3 %2,110 / 6,968

29.7 %2,127 / 7,169

29.5 %2,198 / 7,456

28.1 %2,155 / 7,667

25.5 %1,872 / 7,339

28.0 %2,022 / 7,222

25.9 %2,170 / 8,370

27.8 %2,222 / 7,979

29.4 %2,212 / 7,534

26.1 %2,254 / 8,624

27.9 %1,972 / 7,061

29.6 %2,295 / 7,753

28.1 %1,904 / 6,782

25.7 %1,850 / 7,198

25.6 %1,841 / 7,205

26.9 %1,851 / 6,869

28.7 %1,944 / 6,766

27.5 %1,971 / 7,179

26.3 %1,893 / 7,208

27.2 %1,946 / 7,165

2.6 %96 / 3,691

75.0 %3,261 / 4,346

38.7 %2,246 / 5,808

36.6 %2,201 / 6,016

33.6 %1,915 / 5,695

34.5 %1,963 / 5,692

30.4 %2,085 / 6,866

34.2 %2,205 / 6,448

36.3 %2,179 / 6,005

29.6 %2,214 / 7,478

36.2 %1,976 / 5,464

35.9 %2,233 / 6,219

36.7 %1,906 / 5,192

32.8 %1,843 / 5,611

33.0 %1,848 / 5,596

34.9 %1,843 / 5,282

37.7 %1,947 / 5,165

35.3 %1,972 / 5,581

33.6 %1,884 / 5,612

35.0 %1,949 / 5,561

2.9 %112 / 3,894

38.1 %2,277 / 5,979

35.7 %2,219 / 6,213

32.5 %1,919 / 5,903

33.9 %1,991 / 5,874

29.4 %2,083 / 7,082

33.1 %2,209 / 6,672

35.3 %2,191 / 6,211

29.3 %2,244 / 7,665

34.5 %1,965 / 5,696

35.5 %2,270 / 6,400

35.6 %1,922 / 5,398

31.9 %1,857 / 5,817

32.1 %1,861 / 5,806

34.2 %1,872 / 5,473

36.6 %1,964 / 5,371

34.2 %1,983 / 5,797

32.5 %1,896 / 5,827

34.0 %1,963 / 5,771

2.8 %118 / 4,277

72.3 %3,688 / 5,101

38.6 %2,289 / 5,931

42.3 %2,451 / 5,795

36.7 %2,562 / 6,982

40.8 %2,680 / 6,565

43.7 %2,670 / 6,112

36.7 %2,759 / 7,516

45.4 %2,507 / 5,523

43.9 %2,762 / 6,293

41.8 %2,264 / 5,418

38.0 %2,213 / 5,823

37.9 %2,209 / 5,822

39.9 %2,202 / 5,514

42.9 %2,314 / 5,388

40.4 %2,345 / 5,808

38.6 %2,251 / 5,839

40.3 %2,326 / 5,771

2.3 %103 / 4,463

36.9 %2,259 / 6,124

40.2 %2,413 / 5,999

36.5 %2,593 / 7,105

39.7 %2,672 / 6,728

41.9 %2,637 / 6,301

34.6 %2,682 / 7,762

43.7 %2,492 / 5,705

41.4 %2,698 / 6,523

39.9 %2,238 / 5,609

36.9 %2,208 / 5,989

36.3 %2,186 / 6,014

38.0 %2,171 / 5,707

40.6 %2,270 / 5,592

38.5 %2,311 / 6,004

37.0 %2,227 / 6,026

38.4 %2,291 / 5,971

2.3 %88 / 3,822

46.2 %2,452 / 5,307

30.9 %2,144 / 6,948

37.5 %2,396 / 6,387

39.9 %2,372 / 5,942

45.0 %3,018 / 6,702

37.8 %2,081 / 5,503

47.0 %2,741 / 5,827

38.1 %1,994 / 5,228

34.4 %1,944 / 5,645

34.8 %1,952 / 5,617

36.4 %1,935 / 5,317

38.7 %2,021 / 5,225

36.4 %2,055 / 5,647

34.7 %1,968 / 5,677

35.8 %2,018 / 5,637

2.7 %103 / 3,886

34.5 %2,335 / 6,762

43.2 %2,655 / 6,143

46.1 %2,626 / 5,697

43.4 %2,981 / 6,875

45.0 %2,357 / 5,232

64.9 %3,385 / 5,213

41.6 %2,134 / 5,135

38.2 %2,104 / 5,504

37.2 %2,064 / 5,548

39.1 %2,048 / 5,244

41.6 %2,140 / 5,139

38.8 %2,162 / 5,566

37.9 %2,110 / 5,560

38.7 %2,143 / 5,536

3.9 %200 / 5,078

33.0 %2,516 / 7,615

34.4 %2,472 / 7,184

30.1 %2,581 / 8,574

34.3 %2,276 / 6,634

35.2 %2,581 / 7,333

32.4 %2,098 / 6,481

30.3 %2,079 / 6,856

29.6 %2,044 / 6,898

31.2 %2,045 / 6,565

33.0 %2,137 / 6,467

31.5 %2,169 / 6,884

30.4 %2,098 / 6,893

31.2 %2,143 / 6,862

3.1 %150 / 4,773

67.5 %3,741 / 5,540

37.0 %2,900 / 7,832

43.2 %2,597 / 6,013

46.4 %3,042 / 6,550

43.0 %2,483 / 5,781

39.4 %2,432 / 6,172

39.1 %2,418 / 6,182

40.1 %2,373 / 5,919

44.1 %2,533 / 5,743

41.9 %2,575 / 6,151

40.0 %2,473 / 6,185

41.7 %2,552 / 6,116

2.8 %121 / 4,337

38.7 %2,880 / 7,439

47.2 %2,608 / 5,524

48.9 %2,994 / 6,128

46.3 %2,464 / 5,326

42.2 %2,409 / 5,711

41.3 %2,372 / 5,746

43.5 %2,367 / 5,437

47.1 %2,503 / 5,310

44.5 %2,539 / 5,707

42.8 %2,449 / 5,718

44.3 %2,515 / 5,683

3.9 %202 / 5,116

34.9 %2,496 / 7,160

46.4 %3,371 / 7,266

33.3 %2,327 / 6,984

31.0 %2,282 / 7,362

30.7 %2,271 / 7,389

32.1 %2,268 / 7,062

34.3 %2,377 / 6,932

33.1 %2,415 / 7,299

31.7 %2,323 / 7,337

32.5 %2,385 / 7,336

2.1 %79 / 3,683

43.5 %2,547 / 5,858

46.0 %2,220 / 4,821

41.1 %2,153 / 5,242

41.1 %2,152 / 5,239

42.7 %2,113 / 4,953

45.9 %2,223 / 4,842

42.3 %2,236 / 5,283

41.3 %2,181 / 5,277

42.2 %2,215 / 5,254

3.2 %147 / 4,662

42.3 %2,399 / 5,675

37.9 %2,313 / 6,099

38.1 %2,320 / 6,091

39.7 %2,303 / 5,796

42.4 %2,408 / 5,683

40.0 %2,440 / 6,094

38.4 %2,348 / 6,120

40.0 %2,421 / 6,055

2.5 %84 / 3,305

68.5 %2,844 / 4,150

70.4 %2,886 / 4,098

73.1 %2,818 / 3,854

81.0 %2,989 / 3,688

72.2 %2,986 / 4,136

68.5 %2,869 / 4,191

70.4 %2,922 / 4,153

3.5 %125 / 3,567

64.5 %2,847 / 4,414

68.3 %2,820 / 4,126

74.3 %2,987 / 4,018

81.6 %3,264 / 4,000

77.5 %3,153 / 4,066

76.9 %3,165 / 4,117

2.8 %99 / 3,597

67.8 %2,806 / 4,137

67.6 %2,836 / 4,195

67.4 %2,983 / 4,424

65.0 %2,880 / 4,434

64.6 %2,888 / 4,474

2.2 %73 / 3,311

71.5 %2,801 / 3,915

69.7 %2,916 / 4,183

69.0 %2,860 / 4,145

68.7 %2,874 / 4,181

1.8 %59 / 3,353

80.2 %3,169 / 3,953

75.1 %3,024 / 4,028

79.6 %3,139 / 3,944

4.3 %157 / 3,665

80.2 %3,271 / 4,079

80.4 %3,303 / 4,109

3.3 %120 / 3,599

77.1 %3,186 / 4,134

3.0 %110 / 3,665

Aliivibrio salmonicida LFI1238

3,915 proteins, 3,378 families

Photobacterium profundum

SS9

5,480 proteins, 4,897 families

Vibrio fischeri ES114

3,818 proteins, 3,691 families

Vibrio fischeri MJ11

4,039 proteins, 3,894 families

Vibrio splendidus LGP32

4,431 proteins, 4,277 families

Vibrio species

MED

222 1099517005441

4,590 proteins, 4,463 families

Vibrio campbellii

AN

D4 1103602000595

3,935 proteins, 3,822 families

Vibrio species Ex25

4,004 proteins, 3,886 families

Vibrio shilonii

AK1 1103207002036

5,360 proteins, 5,078 families

Vibrio vulnificus YJ016

5,028 proteins, 4,773 families

Vibrio vulnificus CM

CP6

4,538 proteins, 4,337 families

Vibrio harveyi

ATCC BA

A-1116

6,064 proteins, 5,116 families

Vibrio parahaemolyticus 16

3,780 proteins, 3,683 families

Vibrio parahaemolyticus

RIMD

2210633

4,832 proteins, 4,662 families

Vibrio cholerae A

M-19226

3,407 proteins, 3,305 families

Vibrio cholerae 2740-80

3,771 proteins, 3,567 families

Vibrio cholerae 1587

3,758 proteins, 3,597 families

Vibrio cholerae MZO

-2

3,425 proteins, 3,311 families

Vibrio cholerae MO

10

3,421 proteins, 3,353 families

Vibrio cholerae 0395

3,875 proteins, 3,665 families

Vibrio cholerae V52

3,815 proteins, 3,599 families

Vibrio cholerae

O1 biovar eltor str. N

16961

3,828 proteins, 3,665 families

Aliivi

brio

salm

onici

da

LFI1

238

3,915

pro

tein

s, 3,3

78 fa

mili

es

Photo

bacte

rium

profu

ndum

SS9

5,480

pro

tein

s, 4,8

97 fa

mili

es

Vibrio

fisch

eri

ES11

4

3,818

pro

tein

s, 3,6

91 fa

mili

es

Vibrio

fisch

eri

MJ1

1

4,039

pro

tein

s, 3,8

94 fa

mili

es

Vibrio

splen

didu

s

LGP32

4,431

pro

tein

s, 4,2

77 fa

mili

es

Vibrio

spec

ies

MED

222 1

0995

1700

5441

4,590

pro

tein

s, 4,4

63 fa

mili

es

Vibrio

cam

pbell

ii

AN

D4 1

1036

0200

0595

3,935

pro

tein

s, 3,8

22 fa

mili

es

Vibrio

spec

ies

Ex2

5

4,004

pro

tein

s, 3,8

86 fa

mili

es

Vibrio

shilo

nii

AK1 1

1032

0700

2036

5,360

pro

tein

s, 5,0

78 fa

mili

es

Vibrio

vuln

ificu

s

YJ0

16

5,028

pro

tein

s, 4,7

73 fa

mili

es

Vibrio

vuln

ificu

s

CM

CP6

4,538

pro

tein

s, 4,3

37 fa

mili

es

Vibrio

harv

eyi

ATCC B

AA

-111

6

6,064

pro

tein

s, 5,1

16 fa

mili

es

Vibrio

para

haem

olytic

us

16

3,780

pro

tein

s, 3,6

83 fa

mili

es

Vibrio

para

haem

olytic

us

RIMD

2210

633

4,832

pro

tein

s, 4,6

62 fa

mili

es

Vibrio

chole

rae

AM

-192

26

3,407

pro

tein

s, 3,3

05 fa

mili

es

Vibrio

chole

rae

2740

-80

3,771

pro

tein

s, 3,5

67 fa

mili

es

Vibrio

chole

rae

1587

3,758

pro

tein

s, 3,5

97 fa

mili

es

Vibrio

chole

rae

MZO

-2

3,425

pro

tein

s, 3,3

11 fa

mili

es

Vibrio

chole

rae

MO

10

3,421

pro

tein

s, 3,3

53 fa

mili

es

Vibrio

chole

rae

0395

3,875

pro

tein

s, 3,6

65 fa

mili

es

Vibrio

chole

rae

V52

3,815

pro

tein

s, 3,5

99 fa

mili

es

Vibrio

chole

rae

O1 b

iovar

elto

r str.

N16

961

3,828

pro

tein

s, 3,6

65 fa

mili

es

Homology within proteomes

5.0 %1.8 %

Homology between proteomes

81.6 %25.5 %

BLAST matrix

grep

ls -1

gawk

pancoreplot

makebmdest blastmatrix

Copy and download, GenBank and DNA files

saco_extract

saco_convert Prodigal

4 1 Sequences as Biological Information

organisms, the number of species present in the environment, and, despite their small size, the biomass they represent on a worldwide scale. Even inside an animal, microbes are abundant: only one out of every 10 cells in a human body is actually human, whilst the other nine cells are prokaryotic.

From an evolutionary perspective, Bacteria and Archaea have been around for more than 3 billion years; plants and animals are relatively recent ‘newcomers’ on the scene, arriving less than half a billion years ago. Since Bacteria and Archaea can divide rather quickly and have had much more time to evolve, their diversity by far exceeds that of eukaryotes (the members of Eucarya). Our human perception is that plants and animals are completely unlike each other, and so are, say, insects and mammals, as they are strikingly different even at first sight. The diversity of

Fig. 1.1 A phylogenetic tree displaying the genetic distances between members of the three super-kingdoms of life: Bacteria, Archaea, and Eucarya. The represented bacterial genera will appear in examples throughout the book. The distance between bacterial genera is much larger than that of plants and animals, drawn on the same scale of genetic distance

BACTERIA

ARCHAEA

EUCARYA

Unicellulareukaryotes

Animals Plants

Macro-organisms

Protozoans

Flav

obac

teriu

m

Crenarchaeota

EuryarchaeotaChlamydiae

Cyanobacteria

Pro

teob

acte

ria

Act

inob

acte

ria

Chlorobi

Clostridium

Bacillus

Chloroflexi

Acidobacteria

Giardia

Saccharomyces

Trypanosoma

Slime mold

Babesia

Aquifi

cae

Ther

moto

ga

Thermus

Deinoco

ccus

Firmicutes

Bacteroidetes

Spirochaetes

Pla

ncto

myc

etes

16S rRNA phylogenetic

tree

locate rRNA sequences

Basic genome statistics

njplot

extractseqs

clustalw

Genome atlas

Published annotated

genes/proteins

genomeAtlas

sed

chmod

genewiz

Examine GenBank

files

mousepad

basicgenomeanalysis

Genefinding, genes/proteins

Amino acid and codon

usage

Number of genes/proteins

Information table for all genomes.

Add information to this table as you do the exercises

Subset specific gene

counts

MONDAY Tuesday Wednesday Thursday

Pan and core

genome plot

Raw DNA sequence

BLAST atlasComparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 27 August, 2012 13

Work flow:

1 3 5 7 9 12 15 18 21 24 27 30 33 36 39 42 45

05

00

01

00

00

15

00

0

New genes

New gene families

Core genome

Pan genome

1 : Ecoli_042 2 : Ecoli_536 3 : Ecoli_55989 4 : Ecoli_ABU_83972 5 : Ecoli_APEC_O1 6 : Ecoli_ATCC_8739 7 : Ecoli_BL21_DE3_28965 8 : Ecoli_BL21_DE3_30681 9 : Ecoli_BW2952 10 : Ecoli_B_str_REL606 11 : Ecoli_DH1 12 : Ecoli_E24377A 13 : Ecoli_ED1a 14 : Ecoli_ETEC_H10407 15 : Ecoli_HS 16 : Ecoli_IAI1 17 : Ecoli_IAI39 18 : Ecoli_IHE3034 19 : Ecoli_KO11 20 : Ecoli_O103H2_str_12009 21 : Ecoli_O111H_str_11128 22 : Ecoli_O127H6_str_E2348_69 23 : Ecoli_O157H7_str_EDL933 24 : Ecoli_O157H7_str_TW14359 25 : Ecoli_O26H11_str_11368 26 : Ecoli_O55H7_str_CB9615 27 : Ecoli_O83H1_str_NRG_857C 28 : Ecoli_S88 29 : Ecoli_SE11 30 : Ecoli_SE15 31 : Ecoli_SMS35 32 : Ecoli_UM146 33 : Ecoli_UMN026 34 : Ecoli_UTI89 35 : Ecoli_W 36 : Ecoli_str_K12_substr_DH10B 37 : Ecoli_str_K12_substr_MG1655 38 : Ecoli_str_K12_substr_W3110 39 : Vatypica_ACS_049_V_Sch6 40 : Vatypica_ACS_134_V_Col7a 41 : Vdispar_ATCC_17748 42 : Vparvula_ATCC_17745 43 : Vparvula_DSM_2008 44 : Vsp_3_1_44 45 : Vsp_6_1_27 46 : Vsp_str_F0412

Pan and core

genome plot

3.3 %111 / 3,378

28.3 %1,980 / 6,989

55.5 %2,683 / 4,838

52.4 %2,666 / 5,085

34.9 %2,114 / 6,065

33.1 %2,074 / 6,269

30.3 %1,795 / 5,923

30.5 %1,813 / 5,939

26.7 %1,916 / 7,168

30.5 %2,050 / 6,715

32.6 %2,040 / 6,250

28.3 %2,095 / 7,406

32.3 %1,842 / 5,705

31.9 %2,074 / 6,494

33.6 %1,805 / 5,377

30.2 %1,747 / 5,786

29.9 %1,736 / 5,802

31.9 %1,743 / 5,469

34.4 %1,846 / 5,360

32.5 %1,873 / 5,769

30.6 %1,777 / 5,804

32.1 %1,846 / 5,747

5.0 %243 / 4,897

30.3 %2,110 / 6,968

29.7 %2,127 / 7,169

29.5 %2,198 / 7,456

28.1 %2,155 / 7,667

25.5 %1,872 / 7,339

28.0 %2,022 / 7,222

25.9 %2,170 / 8,370

27.8 %2,222 / 7,979

29.4 %2,212 / 7,534

26.1 %2,254 / 8,624

27.9 %1,972 / 7,061

29.6 %2,295 / 7,753

28.1 %1,904 / 6,782

25.7 %1,850 / 7,198

25.6 %1,841 / 7,205

26.9 %1,851 / 6,869

28.7 %1,944 / 6,766

27.5 %1,971 / 7,179

26.3 %1,893 / 7,208

27.2 %1,946 / 7,165

2.6 %96 / 3,691

75.0 %3,261 / 4,346

38.7 %2,246 / 5,808

36.6 %2,201 / 6,016

33.6 %1,915 / 5,695

34.5 %1,963 / 5,692

30.4 %2,085 / 6,866

34.2 %2,205 / 6,448

36.3 %2,179 / 6,005

29.6 %2,214 / 7,478

36.2 %1,976 / 5,464

35.9 %2,233 / 6,219

36.7 %1,906 / 5,192

32.8 %1,843 / 5,611

33.0 %1,848 / 5,596

34.9 %1,843 / 5,282

37.7 %1,947 / 5,165

35.3 %1,972 / 5,581

33.6 %1,884 / 5,612

35.0 %1,949 / 5,561

2.9 %112 / 3,894

38.1 %2,277 / 5,979

35.7 %2,219 / 6,213

32.5 %1,919 / 5,903

33.9 %1,991 / 5,874

29.4 %2,083 / 7,082

33.1 %2,209 / 6,672

35.3 %2,191 / 6,211

29.3 %2,244 / 7,665

34.5 %1,965 / 5,696

35.5 %2,270 / 6,400

35.6 %1,922 / 5,398

31.9 %1,857 / 5,817

32.1 %1,861 / 5,806

34.2 %1,872 / 5,473

36.6 %1,964 / 5,371

34.2 %1,983 / 5,797

32.5 %1,896 / 5,827

34.0 %1,963 / 5,771

2.8 %118 / 4,277

72.3 %3,688 / 5,101

38.6 %2,289 / 5,931

42.3 %2,451 / 5,795

36.7 %2,562 / 6,982

40.8 %2,680 / 6,565

43.7 %2,670 / 6,112

36.7 %2,759 / 7,516

45.4 %2,507 / 5,523

43.9 %2,762 / 6,293

41.8 %2,264 / 5,418

38.0 %2,213 / 5,823

37.9 %2,209 / 5,822

39.9 %2,202 / 5,514

42.9 %2,314 / 5,388

40.4 %2,345 / 5,808

38.6 %2,251 / 5,839

40.3 %2,326 / 5,771

2.3 %103 / 4,463

36.9 %2,259 / 6,124

40.2 %2,413 / 5,999

36.5 %2,593 / 7,105

39.7 %2,672 / 6,728

41.9 %2,637 / 6,301

34.6 %2,682 / 7,762

43.7 %2,492 / 5,705

41.4 %2,698 / 6,523

39.9 %2,238 / 5,609

36.9 %2,208 / 5,989

36.3 %2,186 / 6,014

38.0 %2,171 / 5,707

40.6 %2,270 / 5,592

38.5 %2,311 / 6,004

37.0 %2,227 / 6,026

38.4 %2,291 / 5,971

2.3 %88 / 3,822

46.2 %2,452 / 5,307

30.9 %2,144 / 6,948

37.5 %2,396 / 6,387

39.9 %2,372 / 5,942

45.0 %3,018 / 6,702

37.8 %2,081 / 5,503

47.0 %2,741 / 5,827

38.1 %1,994 / 5,228

34.4 %1,944 / 5,645

34.8 %1,952 / 5,617

36.4 %1,935 / 5,317

38.7 %2,021 / 5,225

36.4 %2,055 / 5,647

34.7 %1,968 / 5,677

35.8 %2,018 / 5,637

2.7 %103 / 3,886

34.5 %2,335 / 6,762

43.2 %2,655 / 6,143

46.1 %2,626 / 5,697

43.4 %2,981 / 6,875

45.0 %2,357 / 5,232

64.9 %3,385 / 5,213

41.6 %2,134 / 5,135

38.2 %2,104 / 5,504

37.2 %2,064 / 5,548

39.1 %2,048 / 5,244

41.6 %2,140 / 5,139

38.8 %2,162 / 5,566

37.9 %2,110 / 5,560

38.7 %2,143 / 5,536

3.9 %200 / 5,078

33.0 %2,516 / 7,615

34.4 %2,472 / 7,184

30.1 %2,581 / 8,574

34.3 %2,276 / 6,634

35.2 %2,581 / 7,333

32.4 %2,098 / 6,481

30.3 %2,079 / 6,856

29.6 %2,044 / 6,898

31.2 %2,045 / 6,565

33.0 %2,137 / 6,467

31.5 %2,169 / 6,884

30.4 %2,098 / 6,893

31.2 %2,143 / 6,862

3.1 %150 / 4,773

67.5 %3,741 / 5,540

37.0 %2,900 / 7,832

43.2 %2,597 / 6,013

46.4 %3,042 / 6,550

43.0 %2,483 / 5,781

39.4 %2,432 / 6,172

39.1 %2,418 / 6,182

40.1 %2,373 / 5,919

44.1 %2,533 / 5,743

41.9 %2,575 / 6,151

40.0 %2,473 / 6,185

41.7 %2,552 / 6,116

2.8 %121 / 4,337

38.7 %2,880 / 7,439

47.2 %2,608 / 5,524

48.9 %2,994 / 6,128

46.3 %2,464 / 5,326

42.2 %2,409 / 5,711

41.3 %2,372 / 5,746

43.5 %2,367 / 5,437

47.1 %2,503 / 5,310

44.5 %2,539 / 5,707

42.8 %2,449 / 5,718

44.3 %2,515 / 5,683

3.9 %202 / 5,116

34.9 %2,496 / 7,160

46.4 %3,371 / 7,266

33.3 %2,327 / 6,984

31.0 %2,282 / 7,362

30.7 %2,271 / 7,389

32.1 %2,268 / 7,062

34.3 %2,377 / 6,932

33.1 %2,415 / 7,299

31.7 %2,323 / 7,337

32.5 %2,385 / 7,336

2.1 %79 / 3,683

43.5 %2,547 / 5,858

46.0 %2,220 / 4,821

41.1 %2,153 / 5,242

41.1 %2,152 / 5,239

42.7 %2,113 / 4,953

45.9 %2,223 / 4,842

42.3 %2,236 / 5,283

41.3 %2,181 / 5,277

42.2 %2,215 / 5,254

3.2 %147 / 4,662

42.3 %2,399 / 5,675

37.9 %2,313 / 6,099

38.1 %2,320 / 6,091

39.7 %2,303 / 5,796

42.4 %2,408 / 5,683

40.0 %2,440 / 6,094

38.4 %2,348 / 6,120

40.0 %2,421 / 6,055

2.5 %84 / 3,305

68.5 %2,844 / 4,150

70.4 %2,886 / 4,098

73.1 %2,818 / 3,854

81.0 %2,989 / 3,688

72.2 %2,986 / 4,136

68.5 %2,869 / 4,191

70.4 %2,922 / 4,153

3.5 %125 / 3,567

64.5 %2,847 / 4,414

68.3 %2,820 / 4,126

74.3 %2,987 / 4,018

81.6 %3,264 / 4,000

77.5 %3,153 / 4,066

76.9 %3,165 / 4,117

2.8 %99 / 3,597

67.8 %2,806 / 4,137

67.6 %2,836 / 4,195

67.4 %2,983 / 4,424

65.0 %2,880 / 4,434

64.6 %2,888 / 4,474

2.2 %73 / 3,311

71.5 %2,801 / 3,915

69.7 %2,916 / 4,183

69.0 %2,860 / 4,145

68.7 %2,874 / 4,181

1.8 %59 / 3,353

80.2 %3,169 / 3,953

75.1 %3,024 / 4,028

79.6 %3,139 / 3,944

4.3 %157 / 3,665

80.2 %3,271 / 4,079

80.4 %3,303 / 4,109

3.3 %120 / 3,599

77.1 %3,186 / 4,134

3.0 %110 / 3,665

Aliivibrio salmonicida LFI1238

3,915 proteins, 3,378 families

Photobacterium profundum SS9

5,480 proteins, 4,897 families

Vibrio fischeri ES114

3,818 proteins, 3,691 families

Vibrio fischeri MJ11

4,039 proteins, 3,894 families

Vibrio splendidus LGP32

4,431 proteins, 4,277 families

Vibrio species

MED222 1099517005441

4,590 proteins, 4,463 families

Vibrio campbellii

AND4 1103602000595

3,935 proteins, 3,822 families

Vibrio species Ex25

4,004 proteins, 3,886 families

Vibrio shilonii

AK1 1103207002036

5,360 proteins, 5,078 families

Vibrio vulnificus YJ016

5,028 proteins, 4,773 families

Vibrio vulnificus CM

CP6

4,538 proteins, 4,337 families

Vibrio harveyi

ATCC BAA-1116

6,064 proteins, 5,116 families

Vibrio parahaemolyticus 16

3,780 proteins, 3,683 families

Vibrio parahaemolyticus

RIMD 2210633

4,832 proteins, 4,662 families

Vibrio cholerae AM

-19226

3,407 proteins, 3,305 families

Vibrio cholerae 2740-80

3,771 proteins, 3,567 families

Vibrio cholerae 1587

3,758 proteins, 3,597 families

Vibrio cholerae MZO

-2

3,425 proteins, 3,311 families

Vibrio cholerae MO

10

3,421 proteins, 3,353 families

Vibrio cholerae 0395

3,875 proteins, 3,665 families

Vibrio cholerae V52

3,815 proteins, 3,599 families

Vibrio cholerae

O1 biovar eltor str. N

16961

3,828 proteins, 3,665 families

Aliivib

rio sa

lmon

icida

LFI1

238

3,915

pro

tein

s, 3,3

78 fa

mili

es

Photob

acter

ium pr

ofundu

m

SS9

5,480

pro

tein

s, 4,8

97 fa

mili

es

Vibrio

fisch

eri

ES11

4

3,818

pro

tein

s, 3,6

91 fa

mili

es

Vibrio

fisch

eri

MJ1

1

4,039

pro

tein

s, 3,8

94 fa

mili

es

Vibrio

splen

didus

LGP32

4,431

pro

tein

s, 4,2

77 fa

mili

es

Vibrio

spec

ies

MED22

2 109

9517

0054

41

4,590

pro

tein

s, 4,4

63 fa

mili

es

Vibrio

campb

ellii

AND4 1

1036

0200

0595

3,935

pro

tein

s, 3,8

22 fa

mili

es

Vibrio

spec

ies

Ex25

4,004

pro

tein

s, 3,8

86 fa

mili

es

Vibrio

shilo

nii

AK1 110

3207

0020

36

5,360

pro

tein

s, 5,0

78 fa

mili

es

Vibrio

vuln

ificu

s

YJ0

16

5,028

pro

tein

s, 4,7

73 fa

mili

es

Vibrio

vuln

ificu

s

CM

CP6

4,538

pro

tein

s, 4,3

37 fa

mili

es

Vibrio

harv

eyi

ATCC BAA-1

116

6,064

pro

tein

s, 5,1

16 fa

mili

es

Vibrio

para

haem

olytic

us

16

3,780

pro

tein

s, 3,6

83 fa

mili

es

Vibrio

para

haem

olytic

us

RIMD 22

1063

3

4,832

pro

tein

s, 4,6

62 fa

mili

es

Vibrio

chole

rae

AM-1

9226

3,407

pro

tein

s, 3,3

05 fa

mili

es

Vibrio

chole

rae

2740

-80

3,771

pro

tein

s, 3,5

67 fa

mili

es

Vibrio

chole

rae

1587

3,758

pro

tein

s, 3,5

97 fa

mili

es

Vibrio

chole

rae

MZO

-2

3,425

pro

tein

s, 3,3

11 fa

mili

es

Vibrio

chole

rae

MO

10

3,421

pro

tein

s, 3,3

53 fa

mili

es

Vibrio

chole

rae

0395

3,875

pro

tein

s, 3,6

65 fa

mili

es

Vibrio

chole

rae

V52

3,815

pro

tein

s, 3,5

99 fa

mili

es

Vibrio

chole

rae

O1 b

iovar

elto

r str.

N16

961

3,828

pro

tein

s, 3,6

65 fa

mili

es

Homology within proteomes

5.0 %1.8 %

Homology between proteomes

81.6 %25.5 %

BLAST matrix

0M

0.5

M1

M

1.5M

2M

2.5

M

V. cholerae O1 biovar El Tor str. N16961 I

2,961,149 bp

BASE ATLAS

Center for Biological Sequence Anhttp://www.cbs.dtu.dk/

G Content

0.18 0.30

A Content

0.20 0.32

T Content

0.21 0.32

C Content

0.17 0.30

Annotations:

CDS +

CDS -

rRNA

tRNA

AT Skew

-0.04 0.04

GC Skew

-0.08 0.08

Percent AT

0.46 0.59

Resolution: 1185

Genome atlas

4 1 Sequences as Biological Information

organisms, the number of species present in the environment, and, despite their small size, the biomass they represent on a worldwide scale. Even inside an animal, microbes are abundant: only one out of every 10 cells in a human body is actually human, whilst the other nine cells are prokaryotic.

From an evolutionary perspective, Bacteria and Archaea have been around for more than 3 billion years; plants and animals are relatively recent ‘newcomers’ on the scene, arriving less than half a billion years ago. Since Bacteria and Archaea can divide rather quickly and have had much more time to evolve, their diversity by far exceeds that of eukaryotes (the members of Eucarya). Our human perception is that plants and animals are completely unlike each other, and so are, say, insects and mammals, as they are strikingly different even at first sight. The diversity of

Fig. 1.1 A phylogenetic tree displaying the genetic distances between members of the three super-kingdoms of life: Bacteria, Archaea, and Eucarya. The represented bacterial genera will appear in examples throughout the book. The distance between bacterial genera is much larger than that of plants and animals, drawn on the same scale of genetic distance

BACTERIA

ARCHAEA

EUCARYA

Unicellulareukaryotes

Animals Plants

Macro-organisms

Protozoans

Flav

obac

teriu

m

Crenarchaeota

EuryarchaeotaChlamydiae

Cyanobacteria

Pro

teob

acte

ria

Act

inob

acte

ria

Chlorobi

Clostridium

Bacillus

Chloroflexi

Acidobacteria

Giardia

Saccharomyces

Trypanosoma

Slime mold

Babesia

Aquifi

cae

Thermoto

ga

Thermus

Deinoco

ccus

Firmicutes

Bacteroidetes

Spirochaetes

Pla

ncto

myc

etes

16S rRNA phylogenetic

tree

Amino acid usage

G A

V

L

IF

Y

W

H

KRD

E

N

QS

T

M

C

P

Amino acid usageCP001139

1.01

2.85

4.69

6.53

8.36

10.20

Perc

enta

ge

locate rRNA sequences

Basic genome statistics

Published annotated genes/proteins

Raw DNA sequence

Amino acid and codon

usage

Number of genes/proteins

Number of genes/proteins

STEP 1: List of genomes, NCBI

GenBank id numbers, GPID

STEP 2: Download

genomes in the form of

GenBank files

getgbk

saco_extract grep

saco_convert

grep

extractname

extractname

aminoacidUsagePlotgenomeAtlas

sed

chmod

genewiz

genomeStatistics

prodigalrunnerrnammer

njplot

extractseqs

clustalw

basicgenomeanalysis

Genefinding, local annotation of genes/

proteins

ls -1

gawk

pancoreplot

makebmdest

blastmatrix

SPI_7 >

SPI-2

SPI-1

SPI-7

SPI-7

SPI-3

SPI-4

SPI-5

SPI-5SPI-6

SPI-6

SPI-9

SPI-

10

SPI-11

SPI-11

SPI-12

SPI-12 0M0.5M

1M1.5M

2M2.5M

3M

3.5M

4M

4.5M

S. Typhi str. Ty2 4,791,961 bp

0.15 0.10 0.05 0.00

Pan genomic Dendrogram

Relative manhattan distance

S.arizonae serovar 62:z4,z23: str. RSK2980 S.Montevideo str. MB110209 0055 S.Montevideo str. OH_2009072675 S.Montevideo str. 556152 S.Montevideo str. MB102109 0047 S.Montevideo str. IA_2010008284 S.Montevideo str. NC_MB110209 0054 S.Montevideo str. IA_2010008283 S.Montevideo str. 366867 S.Montevideo str. 556150 1 S.Montevideo str. MB101509 0077 S.Montevideo str. IA_2010008282 S.Montevideo str. 495297 1 S.Montevideo str. 446600 S.Montevideo str. 413180 S.Montevideo str. 19N S.Montevideo str. 609460 S.Montevideo str. IA_2010008287 S.Montevideo str. IA_2009159199 S.Montevideo str. CASC_09SCPH15965 S.Montevideo str. 81038 01 S.Montevideo str. 609458 1 S.Montevideo str. MB111609 0052 S.Montevideo str. 2009085258 S.Montevideo str. MD_MDA09249507 S.Montevideo str. 515920 2 S.Montevideo str. 315996572 S.Montevideo str. 515920 1 S.Montevideo str. 315731156 S.Montevideo str. IA_2010008285 S.Montevideo str. 2009083312 S.Montevideo str. 495297 3 S.Montevideo str. 507440 20 S.Montevideo str. 414877 S.Montevideo str. 495297 4 S.Javiana str. GA_MM04042433 S.Montevideo str. 531954 S.Schwarzengrund str. CVM19633 S.Schwarzengrund str. SL480 S.Paratyphi A str. AKU_12601 S.Paratyphi A str. ATCC 9150 S.Typhi str. Ty2 S.Typhi str. CT18 S.Weltevreden str. HI_N05 537 S.Saintpaul str. SARA29 S.Tennessee str. CDC07 0191 S.Kentucky str. CDC 191 S.Kentucky str. CVM29188 S.Virchow str. SL491 S.Agona str. SL483 S.Paratyphi C srt. RKS4594 S.Choleraesuis str. A50 S.Choleraesuis str. SC B67 S.Dublin str. 3246 S.Dublin str. CT_02021853 S.Enteritidis str. P125109 S.Gallinarum str. 287/91 S.Gallinarum str. 9 S.Paratyphi B str. SPB7 S.Heidelberg str. SL476 S.Heidelberg str. SL486 S.4,[5],12:i: str. CVM23701 S.Typhimurium str. D23580 S.Typhimurium str. TN061786 S.Typhimurium str. LT2 S.Typhimurium str. 4/74 S.Typhimurium str. SL1344 S.Saintpaul str. SARA23 S.Typhimurium str. DT104 S.Typhimurium str. 14028S S.Hadar str. RI_05P066 S.Newport str. SL317 S.Newport str. SL254

100

38

31

60

30

100

0

60

11

500

00

36

000

100

85

00006

100

874

100

8

85

811141182253

86

63

49

29

100

100

100

37

100

98

97

5238

100

82

88

59

100

46

100

28

56

33

77

40

35

61

74

68

94

74

100

Pan-genome family tree

Thursday, August 30, 2012

Page 7: Comparative Bacterial Genomics - DTU Bioinformatics · Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 6 0 M 0. 5 M 1 M

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 7

Output of the day:

• core/pan genomes plot

• pan-genome family tree

Thursday, August 30, 2012

Page 8: Comparative Bacterial Genomics - DTU Bioinformatics · Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 6 0 M 0. 5 M 1 M

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 8

Pan-core genome plot command lines:

• Make a new for folder for the plot

• Copy protein files to the new folder using cp

• Enter to the new directory

• Create an input file for pancoreplot program

• Construct pan-core genome plot

‣ mkdir panCorePlot

‣ cp <name>_prodigal.orf.fsa panCorePlot

‣ cd panCorePlot

‣ ls -1 *orf.fsa | gawk ‘{print $1 “\t” $1}’ > pancore.list

‣ pancoreplot -keep blastOutPut pancore.list > pancoreplot.ps

Thursday, August 30, 2012

Page 9: Comparative Bacterial Genomics - DTU Bioinformatics · Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 6 0 M 0. 5 M 1 M

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 9

Extract genes from pan-core genome plot

Thursday, August 30, 2012

Page 10: Comparative Bacterial Genomics - DTU Bioinformatics · Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 6 0 M 0. 5 M 1 M

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 10

Extract genes from pan-core genome plot

• Extract all the core genes

• Extract all pan-genomes

• Extract the core genes of genomes 1,2,3,5,6,7

• Extract the core genes of genomes 1,3,4 and 5 which are not present in any of the genomes from 6 to the last genome

‣ specificGenes -i 1: <blastOutPutFolder> > <output>.fsa

‣ specificGenes -u 1: <blastOutPutFolder> > <output>.fsa

‣ specificGenes -i 1:3,5:7 <blastOutPutFolder> > <output>.fsa

‣ specificGenes -i 1,3:5 -c 6: <blastOutPutFolder> > <output>.fsa

Thursday, August 30, 2012

Page 11: Comparative Bacterial Genomics - DTU Bioinformatics · Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 6 0 M 0. 5 M 1 M

Comparative Bacterial Genomics Workshop, Centers for Disease Control, Atlanta, Georgia, USA 30 August, 2012 11

Pan genome family tree;

• Copy tree.pl from Download directory to /usr/biotools

• Make program executable

• Enter to the folder where you save all the blast results from pan/core plot

• Construct pan-genome family tree

‣ cp tree.pl /usr/biotools

‣ chmod +x /usr/biotools/tree.pl

‣ cd panCorePlot

‣ tree.pl -m <shell or cloud> <blastOutputFolder> > panGenomeTree.ps

Thursday, August 30, 2012