cpg island identification with hidden markov models
TRANSCRIPT
CpG Island identification with Hidden Markov
Models !
- Kshitij Tayal
1
CpG Island• Region of the genome with high frequency of CpG
sites than the rest of the genome.
• Formal Definition - CpG island is a region with at least 200 bp, and a GC percentage that is greater than 50 % .
• CpG is shorthand for “—C—phosphate—G—that is, cytosine and guanine separated by only one phosphate.
2
Genome ~ 3 billion characters. Find gene ?
3
Importance of CpG Islands• CpG island acts as a proxy to
identify a gene.
• They often occur at the start of the gene.
• Cytosines in CpG dinucleotides can be methylated(have methyl group attache) to form 5-methylcytosine.
4
5
Importance of Methylation• Our body consist thousand of cell . Every cell of our body
contain same copy of DNA with same blueprint of genetic code, then how do they decide among themselves which function has to performed ?
• How Does heart cell know it’s a heart cell
• How Does skin cell know it’s skin cell.
• They need outside instructions from these little carbon hydrogen compounds called methyl group.
• How characteristics change across generations without changes to the DNA sequence itself.
6
Epigenetics & CpG Islands• Literal meaning of epigenetic is ‘above genetics’. It
decides methylation of CpG island
• CpG islands regulate expression of nearby genes.
• Proteins involved in gene expression can be repelled or attracted by the methyl group
7
Background: Epigenetics• Environmental factors like what we do, what we eat, what we
smoke and how stressed we are decide the methyl group binding.
• Bad diet can actually lead methyl group binding to the wrong place and with these bad instruction cell become abnormal and become disease
• Epigenetics is also controlled by histones. Histones are protein that are basically spools that DNA wind itself around . Histones can change how tightly or loosely the DNA is around them.
• If loosely around — the gene get more expressed
• If tightly around — the gene get less expressed
8
9
Background: Epigenetics• So methyl group is more like a ‘switch’ and histones
are more like a ‘knob’
• Every cell of your body has a distinct methylation and histones pattern that gives every cell its marching order.
• DNA can be thought of as body ‘hardware’ and epigenome is more like a software which tells the hardware what work it has to do and hence justifies its meaning.
10
Now Some Computer Science……..
• Task - Design a method that, given a candidate string (k-mer), score it according to how confident it came from CpG Island.
• Apply, Sequence Model which is a probabilistic model that associates probabilities with sequences.
11
Sequence Models
• Sequence models learn from examples.
• Say we have sampled 100K 5-mers from inside CpG islands and 100K 5-mers from outside.
• Can we guess whether CGCGC came from CpG island.?
• P(inside) = 315/(315 + 12)
12
# CGCGC inside 315
# CGCGC outside 12
Sequence Models • To estimate p(x) we count # times x appears in the
training set labelled INSIDE divided by total # of times x appears in training set.
• But for sufficiently long k, we might not see any occurrences of x, or very few.To overcome this limitation we will go for joint probability distribution.
• P(X) = P(Xk,Xk-1,………X1) where P(X) is the probability of sequence X
13
14
15
16
• P(x) now equal product of all the Markov chain edge weights on our
string driven walk through the chain
!!
• Nodes label are symbol and transition label are conditional probability
17
18
19
20
Hidden Markov Model• In simpler Markov models (like a Markov chain), the
state is directly visible to the observer, and therefore the state transition probabilities are the only parameters.
• In a hidden Markov model, the state is not directly visible, but the output, dependent on the state, is visible. Each state has a probability distribution over the possible output tokens. The adjective 'hidden' refers to the state sequence through which the model passes.
21
22
23
24
25
26
27
28
Hidden Markov Model- Viterbi Algorithm
• Given flips can we say when the dealer was using loaded coin.
• We want to find p* , the most likely path given the emission.
!
• Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states – called the Viterbi path – that results in a sequence of observed events.
29
30
31
32
33
34
35
36
37
38
39
Hidden Markov Model
40
Hidden Markov Model
41
42
EMISSIONS
43
44
Hidden Markov Model
45
46
THANK YOU
47