hierarchical cluster structures and symmetries in genomic sequences andrei zinovyev institut des...
TRANSCRIPT
![Page 1: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov](https://reader033.vdocuments.net/reader033/viewer/2022051819/551625b4550346b2068b480b/html5/thumbnails/1.jpg)
Hierarchical Cluster Structures and
Symmetries in Genomic Sequences
Andrei Zinovyev
Institut des Hautes Études Scientifiques
Math@Bio group of M.Gromov
![Page 2: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov](https://reader033.vdocuments.net/reader033/viewer/2022051819/551625b4550346b2068b480b/html5/thumbnails/2.jpg)
Plan of the talk
Genomic sequences: geometric approach, clustering
Genomic sequence as text Basic 7-cluster structure Global structure of codon frequencies Internal structure of codon frequencies Applications
![Page 3: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov](https://reader033.vdocuments.net/reader033/viewer/2022051819/551625b4550346b2068b480b/html5/thumbnails/3.jpg)
Introduction
Frequency dictionaries
![Page 4: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov](https://reader033.vdocuments.net/reader033/viewer/2022051819/551625b4550346b2068b480b/html5/thumbnails/4.jpg)
Genomic sequence as a text in unknown language
tagggrcgcacgtggtgagctgatgctaggg
frequency dictionaries:t a g g g r c g c a c g t g g t g a g c t g a t g c t a g g g
ta gg gr cg ca cg tg gt ga gc tg at gc ta gg
tag ggr cgc acg tgg tga gct gat gct agg
tagg grcg cacg tggt gagc tgat gcta gggr
N = 4=41
N = 16=42
N = 64=43
N=256=44
gggrcgccacgttggtgagctgatgctagggrcgacgtgg
tagggrcgcacgtggtgagctgatgctagggrcgacgtgg
agggrcgcacgtggtgagctgatgctagggrcgacgtggc
..cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc…
![Page 5: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov](https://reader033.vdocuments.net/reader033/viewer/2022051819/551625b4550346b2068b480b/html5/thumbnails/5.jpg)
From text to geometrycgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc
107
cgtggtgagctgatgctagggrcgcacggtgagctgatgctagggrcgcacacttgagctgatgctagggrcgcacaattcgtgagctgatgctagggrcgcacggtg……gagctgatgctagggrcgcacaagtga
length~300-400
3000-4000 fragments
RN
![Page 6: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov](https://reader033.vdocuments.net/reader033/viewer/2022051819/551625b4550346b2068b480b/html5/thumbnails/6.jpg)
Method of visualizationprincipal components analysis
RNR
2
R2
PCA plot
![Page 7: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov](https://reader033.vdocuments.net/reader033/viewer/2022051819/551625b4550346b2068b480b/html5/thumbnails/7.jpg)
Chapter 1
Basic 7-cluster structure
(level 1 of non-randomness)
![Page 8: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov](https://reader033.vdocuments.net/reader033/viewer/2022051819/551625b4550346b2068b480b/html5/thumbnails/8.jpg)
Caulobacter crescentus
singles N=4
doublets N=16
triplets N=64
quadruplets N=256
!!!
the information in genomic sequence is encodedby non-overlapping triplets
![Page 9: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov](https://reader033.vdocuments.net/reader033/viewer/2022051819/551625b4550346b2068b480b/html5/thumbnails/9.jpg)
First explanation
cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc
![Page 10: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov](https://reader033.vdocuments.net/reader033/viewer/2022051819/551625b4550346b2068b480b/html5/thumbnails/10.jpg)
tga tgc tag ggr cgc acg tgg
ctg atg cta ggg rcg cac gtg
Basic 7-cluster structure
gtgagctgatgctagggrcgcacgtggtgagc
gct gat gct agg grc gca cgt
gtgaatcggtgggtgaqtgtgctgctatgagc
atc ggt ggg tga gtg tgc tgc
tcg gtg ggt gag tgt gct gct
cgg tgg gtg agt gtg ctg ctg
![Page 11: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov](https://reader033.vdocuments.net/reader033/viewer/2022051819/551625b4550346b2068b480b/html5/thumbnails/11.jpg)
Non-coding parts
gtgagctgatgctagggr cgcacgaat
Point mutations:insertions, deletions
a
![Page 12: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov](https://reader033.vdocuments.net/reader033/viewer/2022051819/551625b4550346b2068b480b/html5/thumbnails/12.jpg)
Mean-field approximationfor triplet frequencies
321KJIIJK PPPF
FIJK : Frequency of triplet IJK ( I,J,K {A,C,G,T} ):
FAAA , FAAT , FAAC … FGGC , FGGG : 64 numbers
letter frequency + correlations
: 12 numbersjiP
![Page 13: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov](https://reader033.vdocuments.net/reader033/viewer/2022051819/551625b4550346b2068b480b/html5/thumbnails/13.jpg)
Why hexagonal symmetry?
0-+
-+0
+0-
+-0
-0+
0+-
GC-content = PC + PG
![Page 14: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov](https://reader033.vdocuments.net/reader033/viewer/2022051819/551625b4550346b2068b480b/html5/thumbnails/14.jpg)
Chapter 2
Global structure of codon frequencies
(143 complete bacterial genomes)
![Page 15: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov](https://reader033.vdocuments.net/reader033/viewer/2022051819/551625b4550346b2068b480b/html5/thumbnails/15.jpg)
Genome codon usageand mean-field approximation
ggtgaATG gat gct agg … gtc gca cgc TAAtgagct
…
correct frameshift
64 frequencies FIJK
…
ggtgaATG gat gct agg … gtc gca cgc TAAtgagct
12 frequencies PI1 , PJ
2 , PK3
![Page 16: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov](https://reader033.vdocuments.net/reader033/viewer/2022051819/551625b4550346b2068b480b/html5/thumbnails/16.jpg)
Global structure of codon frequencies
eubacteria
archa
ea
![Page 17: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov](https://reader033.vdocuments.net/reader033/viewer/2022051819/551625b4550346b2068b480b/html5/thumbnails/17.jpg)
PIJ are linear functions of GC-content
![Page 18: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov](https://reader033.vdocuments.net/reader033/viewer/2022051819/551625b4550346b2068b480b/html5/thumbnails/18.jpg)
Four symmetry typesof the basic 7-cluster structure
eubacteria
flower-likedegeneratedperpendiculartriangles
paralleltriangles
![Page 19: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov](https://reader033.vdocuments.net/reader033/viewer/2022051819/551625b4550346b2068b480b/html5/thumbnails/19.jpg)
Chapter 3
Internal structure of codon frequencies
(level 2 of non-randomness)
![Page 20: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov](https://reader033.vdocuments.net/reader033/viewer/2022051819/551625b4550346b2068b480b/html5/thumbnails/20.jpg)
Second level of hierarchy
?
![Page 21: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov](https://reader033.vdocuments.net/reader033/viewer/2022051819/551625b4550346b2068b480b/html5/thumbnails/21.jpg)
Distribution of genes
R64
function1 function2
function3
![Page 22: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov](https://reader033.vdocuments.net/reader033/viewer/2022051819/551625b4550346b2068b480b/html5/thumbnails/22.jpg)
Fast-growing bacteria
IV
II
I
III
Genes of class I(most of)
Genes of class II(higly expressed)
Genes of class III(unusual)
Genes of class IV(hydrophobic proteins)
![Page 23: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov](https://reader033.vdocuments.net/reader033/viewer/2022051819/551625b4550346b2068b480b/html5/thumbnails/23.jpg)
Escherichia coli
Genes of class I(most of)
Genes of class II(higly expressed)
Genes of class III(unusual)
Genes of class IV(hydrophobicproteins)
![Page 24: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov](https://reader033.vdocuments.net/reader033/viewer/2022051819/551625b4550346b2068b480b/html5/thumbnails/24.jpg)
Chapter 4
Applications
![Page 25: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov](https://reader033.vdocuments.net/reader033/viewer/2022051819/551625b4550346b2068b480b/html5/thumbnails/25.jpg)
Computational gene prediction
Accuracy >90%
![Page 26: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov](https://reader033.vdocuments.net/reader033/viewer/2022051819/551625b4550346b2068b480b/html5/thumbnails/26.jpg)
Protein expression optimization
IV
II
I
III
gene sequence S,protein A
gene sequence S’,same protein A,higher expression
![Page 27: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov](https://reader033.vdocuments.net/reader033/viewer/2022051819/551625b4550346b2068b480b/html5/thumbnails/27.jpg)
Web-site
http://www.ihes.fr/~zinovyev/7clusters
cluster structures in genomic sequences
![Page 28: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov](https://reader033.vdocuments.net/reader033/viewer/2022051819/551625b4550346b2068b480b/html5/thumbnails/28.jpg)
PapersGorban A, Popova T, Zinovyev AFour basic symmetry types in the universal 7-cluster Four basic symmetry types in the universal 7-cluster structure of 143 complete bacterial genomic sequences.structure of 143 complete bacterial genomic sequences.2004. Arxive e-print.
Gorban A, Zinovyev A, Popova T Seven clusters in genomic triplet distributionsSeven clusters in genomic triplet distributions. 2003. In Silico Biology. V.3, 0039.
Zinovyev A, Gorban A, Popova T Self-Organizing Approach Self-Organizing Approach for Automated Gene Identificationfor Automated Gene Identification. 2003. Open Systems and Information Dynamics 10 (4).
![Page 29: Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov](https://reader033.vdocuments.net/reader033/viewer/2022051819/551625b4550346b2068b480b/html5/thumbnails/29.jpg)
People
Dr. Tanya PopovaInstitute of Computational ModelingRussia
ProfessorAlexander GorbanUniversity of LeicesterUK