kolmogorov complexity for analysis of dna sequence
DESCRIPTION
Kolmogorov Complexity for analysis of DNA sequence. Shijun Tang Thiraphat Meesumrarn Gaith Albadarin. Outline. Kolmogorov Complexity The Complexity of DNA Methods Quantum Kolmogorov Complexity Qubit and Definition of QKC. Kolmogorov Complexity. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/1.jpg)
Kolmogorov Complexityfor analysis of DNA sequence
Shijun TangThiraphat Meesumrarn
Gaith Albadarin
![Page 2: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/2.jpg)
Outline
• Kolmogorov Complexity The Complexity of DNA
Methods
• Quantum Kolmogorov Complexity Qubit and Definition of QKC
![Page 3: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/3.jpg)
Kolmogorov Complexity
The Kolmogorov complexity of any string x {0, 1}∈ ∗ is defined as:
C(x) := min{ℓ(p) | U(p) = x}
The Kolmogorov complexity of x : the length of the shortest program which produces x as its output
![Page 4: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/4.jpg)
The Complexity of DNA
• “genetic language” in DNA sequences (A, C, G, and T)
• heterogeneity in DNA sequences (not random)
• the long-range correlation• Compression
![Page 5: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/5.jpg)
Methods
• Entropy• Spectral Analysis• Kolmogorov Complexity
![Page 6: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/6.jpg)
Entropy
Clausius EntropyBoltzmann EntropyShannon EntropyKolmogorov EntropyTsallis Entropy-- Approximate Entropy---Sample Entropy---Multiscale Entropy…………….
![Page 7: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/7.jpg)
Entropy
• Jensen-Shannon distance == the difference between the entropy calculated from the whole system and the weighted sum of entropies calculated from the subsystems
• Jensen-Shannon distance D(i) for each possible partition point i along the DNA sequence
![Page 8: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/8.jpg)
230,208 nucleotides
near » 189,00012 3
4
![Page 9: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/9.jpg)
• the bigger the difference of the two subsequences as partitioned at point i, and the more ideal to choose that point to partition the sequence
• the average value of D(i) of random sequence is at least 10 times lower than that for the yeast sequence.
• These ups and downs in D(i) for the random sequence are purely random fluctuations
![Page 10: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/10.jpg)
Spectral Analysis
• Power spectrum -- > to represent the correlation structure in a sequence according to wavelength (or
frequency f =c/wavelength).
• The power at a given frequency, P(f), is the contribution from that frequency component to the total variance
of the fluctuation in the sequence.
• A random sequence lacks correlation at any length scale, and its power spectra is flat
![Page 11: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/11.jpg)
![Page 12: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/12.jpg)
Kolmogorov Complexity for Analysis of DNA
The search for DNA regions with low complexity is
one of the pivotal tasks of modern structural analysis
of complete genomes.
The low complexity may be preconditioned by strong
inequality in nucleotide content (biased composition),
by tandem or dispersed repeats or by palindrome-
hairpin structures, as well as by a combination of all
these factors.
![Page 13: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/13.jpg)
Four types of repeat differing by orientation andlocalization in direct or complementary chains are considered: direct, symmetric, inverted anddirect complementary.
Direct and inverted repeats as standard prototypes. Symmetric (the repeated sequence is oppositely oriented on the same DNA strand) Direct complementary (a direct repeat on the complementary DNA strand),
![Page 14: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/14.jpg)
![Page 15: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/15.jpg)
Nucleotide Sequence : the AP2 transcription factorbinding site, GTGCCCCGCGGGAACCCCGC.
Black and gray arrows mark the copied fragments andtheir prototypes. A tandem repeat characterized by partial overlapping of the prototype on the copiedfragment is marked by a dotted line. In thisdecomposition, the first one-lettered components, G andT, are produced by an operation generating a novelsymbol. The complexity of this 20-letteredsequence = 10 [the number of components in H(S)].
![Page 16: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/16.jpg)
Lempel-Ziv complexity S, Q represents two string, respectively.SQ=S+Q. SQP=SQ(deleted last letter)V(SQP) is all subset of SQPNow c(n)=1, assume S=s1s2….sr Q=Sr+1
If QϵV(SQP), S same, Q=Sr+1Sr+2
Until Q V(SQP), So Q=sr+1sr+2…sr+i is not the subset of s1s2..srsr+1sr+2..sr+i-1, c(n)+1
Update S= s1s2..srsr+1sr+2..sr+I and Q=sr+i+1
Until Q take the final letter
![Page 17: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/17.jpg)
b(n) is complexity value of random sequence @ n infinite
b(n) = = nlog2n
Thus,CLZN = c(n)/b(n)
the complexity of random ---- > 1 the complexity of order sequence ---- > 0
The smaller the complexity, the slower the speed of variation === > the change of data is regular, and has good periodic time.
![Page 18: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/18.jpg)
The calculation of c(n) (Lempel-Ziv complexity)
Lempel-Ziv Complexity 1976For sequence S=(10101010)
S=s1=1, Q=s2=0, SQ=10, SQP=1, Q V(SQP), Q insertion, SQ=1● 0
S= s1s2=10, Q=s3=1, SQ=101, SQP=10, Q ϵ V(SQP), Q duplication, SQ=1● 0 ●1
S= s1s2=10, Q=s3 s4=10, SQ=1010, SQP=101, Q ϵ V(SQP), Q duplication, SQ=1● 0●10
• Repeated 2) and 3), Q duplication , S=1●0●101010 , c(n)=3• b(8)=8log28=24. So normalized complexly: CLNZ =c(8)/b(8)=3/24=0.125• Thus, results show that the sequence is low because this sequence is periodical one.
![Page 19: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/19.jpg)
Other estimates of text complexityThe evaluation of complexity in a text region CWF by Wootton and Federhen (7) is given by the formula
![Page 20: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/20.jpg)
Linguistic complexity can also be defined as the ratio of the sum of numbers of words occurring in a sequence analyzed to the maximum possible number of such words (12):
![Page 21: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/21.jpg)
![Page 22: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/22.jpg)
Implementation and Results
Calculation mode in a sliding window(i) a single extended sequence (ii) a group of relatively short sequences up to 1 kb in
length. A table of complexity values is constructed for a window, of ordered size N, Sliding along the sequence. The sequence complexity is assigned to the window center. The calculation mode in a sliding window (complexity profile) is demonstrated here using the example of the Borrelia burgdorferi genome. In Figure 2, complexity profiles for a window sliding along the sequence are illustrated.
![Page 23: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/23.jpg)
![Page 24: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/24.jpg)
![Page 26: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/26.jpg)
Quantum Kolmogorov Complexity
Are quantum computers more powerful than classical computers?
Quantum Entanglement
Quantum Factorization of Integers
Quantum computers can solve some problems faster than classical computers (→ Shor’s factoring algorithm).
![Page 27: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/27.jpg)
86782348943904553258203876589276488467282764884783575788579901017459395793602387575786897646492020929237475675203798980000847736223445526263778374774774764657586879989889999531190642287653930057686950486950384756567438556574648876589005088573342257947602867756958696986758959511122344756900768768779957500472667899533045786777487657783691190875682046392930000272645583936857939487456884763747949631611893900540958687034763637485696997576535644578499596997665561098443348899046881020480568572231018209586704589944806808908069887677575969061234894390498988999953119064799010174593957936
>> 2^(129/2) = 2.6088e+019>> 2^125 = 4.2535e+037>> 2^200 = 1.6069e+060
![Page 28: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/28.jpg)
Prime factorization of large number
Principle of Current Cryptography
In 1994, 1600 workstation with super speed obtained primefactors of L=129 in about 8 months. If L=250, 800,000 years
Factorization of Integers
The number N---- approximate length L bits ---(0 ~ 2L-1)The number N has a factor in the range (1, )Try each number in this range to find a factor of N---At least stepsS~ =2L/2
But, for Shor’s Quantum ComputationS ~ poly(log(N))
N
N
![Page 29: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/29.jpg)
The Factoring Firestorm188198812920607963838697239461650439807163563379417382700763356422988859715234665485319060606504743045317388011303396716199692321205734031879550656996221305168759307650257059
472772146107435302536223071973048224632914695302097116459852171130520711256363590397527
398075086424064937397125500550386491199064362342526708406385189575946388957261768583317
Best classical algorithmtakes time
Shor’s quantum algorithm takes time
An efficient algorithm for factoring breaks the RSA public key cryptosystem
PeterShor 1994
![Page 30: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/30.jpg)
Qubit
12sine02cos
02sine12cosi
i
0
• Pure state of a qubit
• Basis
• Superposition of states and 1,0
0 1
![Page 31: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/31.jpg)
Qubit:• The element of carrying information------- The
quantum state• |0>, |1> and any linear combination
(superposition) c1|0>+c2|1>
Definition (Qubit Strings)A qubit string σ is a state vector or density operator
![Page 32: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/32.jpg)
A quantum computer can perform 2n operations at the same time due to superposition :
However we get only one answer when we measure the result:
F[000] F[001] F[010] . . F[111]
Only one answer F[a,b,c]
![Page 33: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/33.jpg)
The Discrete Fourier Transform
• Assume L qubits hold any number x, from 0 to 2L-1• Any number x can be expressed as the state• |x> = |xL-1 xL-2 …x1 x0 >= |xL-1 > |xL-2 > ….|x1 > |x0 >• Where x= and a tensor product• Aj acts only on the qubit represented by j-th atom
• The operator |ij><kj|on the state |nj>
• |ij><kj||nj> = | ij >
• Aj|0j> = (|0j>+|1j>) Aj|0j> = (|0j>-|1j>)
• Bj|0jk> = |0jk> Bj|1jk> = |1jk> Bj|2jk> = |2jk>
• Bj|3jk> =exp( )|3jk>
![Page 34: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/34.jpg)
A0B01B02A1B12A0|x> = {(|0>+|4>)-(|2>+|6>)+i(|1>+|5>)-
i(|3>+|7>) =
|x> == > A0B01B02A1B12A0 perform a discrete Fourier transform!
![Page 35: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/35.jpg)
![Page 36: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/36.jpg)
Shor’s algothrim
[1] Quantum Fourier Transform | > = = == >
Finding the period of a periodic function [2] = , then find Greatest Common Divisor
and
![Page 37: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/37.jpg)
N-gate
![Page 38: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/38.jpg)
Quantum computers U : input qubit string σ → output qubit string U(σ)
• Definition (Quantum Kolmogorov Complexity) Let U be a universal quantum computer and δ >
0. Then, for every qubit string ρ, define QCδ(ρ) = min{ℓ(σ) | || ρ − U(σ)||Tr ≤ δ}
• the difference between two qubit strings, it is natural to use the trace distance which is defined as || ρ − U(σ)||Tr := (1/2)Tr|ρ − σ|
![Page 39: Kolmogorov Complexity for analysis of DNA sequence](https://reader030.vdocuments.net/reader030/viewer/2022032612/5681339d550346895d9ab04b/html5/thumbnails/39.jpg)
References:
• Y. L. Orlov and V. N. Potapov, Complexity: an internet resource for analysis of DNA sequence complexity, Nucleic Acids Research, 2004, Vol. 32
• Fabio Benatti, Tyll Krüger, Markus Müller, Rainer Siegmund-Schultze, Arleta Szkoła. Entropy and Quantum Kolmogorov Complexity: A Quantum Brudno’s Theorem, Communications in Mathematical Physics 265, 437–461 (2006)