8/29/061 temporal chaos game representation (tcgr) for dna/rna sequence visualization margaret h....

Post on 12-Jan-2016

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

8/29/06 1

Temporal Chaos Game Temporal Chaos Game Representation (TCGR) for Representation (TCGR) for

DNA/RNA Sequence DNA/RNA Sequence VisualizationVisualization

Margaret H. DunhamMargaret H. Dunham, , Donya Quick, Donya Quick, Yuhang Wang, Monnie McGee, Yuhang Wang, Monnie McGee,

Jim Waddle, and Yu MengJim Waddle, and Yu Meng

Southern Methodist UniversitySouthern Methodist University

Dallas, Texas 75275Dallas, Texas 75275

mhd@engr.smu.edu

8/29/06 2

Outline Introduction

CGR/FCGR miRNA Motivation Research Objective

TCGR EMM miRNA Prediction using TCGR/EMMConclusiont/Future Work

8/29/06 3

OutlineIntroduction

CGR/FCGR miRNA Motivation Research Objective

TCGR EMM miRNA Prediction using TCGR/EMMConclusion/Future Work

8/29/06 4

Chaos Game Representation (CGR)

Scatter plot showing occurrence of patterns of nucleotides.

University of the Basque Country http://

insilico.ehu.es/genomics/my_words/

8/29/06 6

Chaos Game Representation (CGR)

2D technique to visually see the distribution of subpatterns

Our technique is based on the following:

Generate totals for each subpattern

Scale totals to a [0,1] range. (Note scaling can be a problem)

Convert range to red/blue• 0-0.5: White to Blue• 0.5-1: Blue to Red

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

FCGR

8/29/06 7

FCGR

•AA •AC •CA •CC

•AG •AT •CG •CT

•GA •GC •TA •TC

•GG •GT •TG •TT

•A •C

•G •T

•a) Nucleotides •b) Dinucleotides •c) Trinucletides

•AAA •AAC •ACA •ACC

•AAG •AAT •ACG •ACT

•AGA •AGC •ATA •ATC

•AGG •AGT •ATG •ATT

•GAA •GAC •GCA •GCC

•GAG •GAT •GCG •GCT

•GGA •GGC •GTA •GTC

•GGG •GGT •GTG •GTT

•CAA •CAC •CCA •CCC

•CAG •CAT •CCG •CCT

•CGA •CGC •CTA •CTC

•CGG •CGT •CTG •CTT

•TAA •TAC •TCA •TCC

•TAG •TAT •TCG •TCT

•TGA •TGC •TTA •TTC

•TGG •TGT •TTG •TTT

8/29/06 8

FCGR Example

Homo Sapiens – all mature miRNA

Patterns of length 3

UUC

GUG

8/29/06 9

miRNA Short (20-25nt) sequence of noncoding RNA Single strand Previously assumed to be garbage Impact/Prevent translation of mRNA Conserved across species(sometimes) Reduce protein levels without impacting mRNA levels Bind to target areas in mRNA – Problem is that this binding is

not perfect (particularly in animals) mRNA may have multiple (nonoverlapping) binding sites for

one miRNA

8/29/06 10

miRNA Functions Causes some cancers Embryo Development Cell Differentiation Cell Death Prevents the production of a protein

that causes lung cancer Control brain development in zebra

fish Associated with HIV

8/29/06 11

miRNA Research Issues

Predict/Find miRNA Predict miRNA targets Identify miRNA functions Identify how miRNAs work

8/29/06 12

Motivation

2000bp Flanking Upstream Region mir-258.2 in C elegans

a) All 2000 bp b) First 240 bp b) Last 240 bp

8/29/06 13

Research Objectives Identify, develop, and implement algorithms which

can be used for identifying potential miRNA

functions.

Create an online tool which can be used by other

researchers to apply our algorithms to new data.

8/29/06 14

Outline Introduction

CGR/FCGR miRNA Motivation Research Objective

TCGR EMM miRNA Prediction using TCGR/EMMConclusion/Future Work

8/29/06 15

Temporal CGR (TCGR)

Temporal version of Frequency CGR In our context temporal means the starting location of a window

2D Array Each Row represents counts for a particular window in sequence

• First row – first window• Last row – last window • We start successive windows at the next character location

Each Column represents the counts for the associated pattern in that window• Initially we have assumed order of patterns is alphabetic

Size of TCGR depends on sequence length and subpattern lengt As sequence lengths vary, we only examine complete windows We only count patterns completely contained in each window.

8/29/06 16

TCGR Exampleacgtgcacgtaactgattccggaaccaaatgtgcccacgtcga

Moving Window

A C G T

Pos 0-8 2 3 3 1

Pos 1-9 1 3 3 2

…Pos 34-42 2 4 2 1

A C G T

Pos 0-8 0.4 0.6 0.6 0.2

Pos 1-9 0.2 0.6 0.6 0.4

…Pos 34-42 0.4 0.8 0.4 0.2

8/29/06 17

TCGR Example (cont’d)

TCGRs for Sub-patterns of length 1, 2, and 3

8/29/06 18

TCGR Example (cont’d)

Window 0: Pos 0-8Window 1: Pos 1-9

Window 17: Pos 17-25Window 18: Pos 18-26

Window 34: Pos 34-42

acgtgcacgcgtgcacgt

tccggaaccccggaacca

ccacgtcga

A C G T

8/29/06 19

TCGR – Viruses miRNA(Window=9; Pattern=1;2;3)

Epstein Barr Human Cytomegalovirus Kaposi sarc Herpesvirus Mouse Gammaherpesvirus

Pattern=1

Pattern=2

Pattern=3

8/29/06 20

TCGR – Mature miRNA(Window=5; Pattern=3)

All Mature

Mus Musculus

Homo Sapiens

C Elegans

ACG CGC GCG UCG

8/29/06 21

Outline Introduction

CGR/FCGR miRNA Motivation Research Objective

TCGR

EMM miRNA Prediction using TCGR/EMMConclusion/Future Work

8/29/06 22

EMM Overview Time Varying Discrete First Order Markov

Model Nodes are clusters of real world states. Learning continues during prediction phase. Learning:

Transition probabilities between nodes Node labels (centroid of cluster) Nodes are added and removed as data

arrives

8/29/06 23

EMM DefinitionExtensible Markov Model (EMM): at any time t, EMM

consists of an MC with designated current node, Nn, and algorithms to modify it, where algorithms include:

EMMCluster, which defines a technique for matching between input data at time t + 1 and existing states in the MC at time t.

EMMIncrement algorithm, which updates MC at time t + 1 given the MC at time t and clustering measure result at time t + 1.

EMMDecrement algorithm, which removes nodes from the EMM when needed.

8/29/06 24

EMM Cluster

Find closest node to incoming event.

If none “close” create new node Labeling of cluster is centroid of

members in cluster O(n)

8/29/06 25

EMM Increment

<18,10,3,3,1,0,0><18,10,3,3,1,0,0>

<17,10,2,3,1,0,0><17,10,2,3,1,0,0>

<16,9,2,3,1,0,0><16,9,2,3,1,0,0>

<14,8,2,3,1,0,0><14,8,2,3,1,0,0>

<14,8,2,3,0,0,0><14,8,2,3,0,0,0>

<18,10,3,3,1,1,0.><18,10,3,3,1,1,0.>

1/3

N1

N2

2/3

N3

1/11/3

N1

N2

2/3

1/1

N3

1/1

1/2

1/3

N1

N2

2/31/2

1/2

N3

1/1

2/3

1/3

N1

N2

N1

2/21/1

N1

1

8/29/06 26

Research Objectives Identify, develop, and implement algorithms

which can be used for identifying potential miRNA functions.

Create an online tool which can be used by other researchers to apply our algorithms to new data.

Our approach:1. Represent potential miRNA sequence with TCGR

sequence of count vectors2. Create EMM using count vectors for known miRNA

(miRNA stem loops, miRNA targets)3. Predict unknown sequence to be miRNA (miRNA stem

loop, miRNA target) based on normalized product of transition probabilities along clustering path in EMM

8/29/06 27

Outline Introduction

CGR/FCGR miRNA Motivation Research Objective

TCGR EMM

miRNA Prediction using TCGR/EMMConclusion/Future Work

8/29/06 28

Prediction of miRNA Precursors1

Predicted occurrence of pre-miRNA segments form a set of hairpin sequences

No assumptions about biological function or conservation across species.

Used SVMs to differentiate the structure of hiarpin segments that contained pre-miRNAs from those that did not.

Sensitivey of 93.3% Specificity of 88.1% No report of false positives

1 C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure-Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.

8/29/06 29

Preliminary Test Data1

Positive Training: This dataset consists of 163 human pre-miRNAs with lengths of 62-119.

Negative Training: This dataset was obtained from protein coding regions of human RefSeq genes. As these are from coding regions it is likely that there are no true pre-miRNAs in this data. This dataset contains 168 sequences with lengths between 63 and 110 characters.

Positive Test: This dataset contains 30 pre-miRNAs. Negative Test: This dataset contains 1000 randomly

chosen sequences from coding regions.

1 C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure-Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.

8/29/06 30

POSITIVENEGATIVE

TCGRs for Xue Training Data

8/29/06 31

POSITIVE

NEGATIVE

TCGRs for Xue Test Data

8/29/06 32

Predictive Probabilities with Xue’s Data

EMM Test Data Mean Std Dev Max Min

Negative Test-Neg 0 0 0 0

Test-Pos 0 0 0 0

Train-Neg 0.37963 0.050085 0.91256 0.2945

Train-Pos 0 0 0 0

Positive Test-Neg 0 0 0 0

Test-Pos 0.25894 0.18701 0.42075 0

Train-Neg 0 0 0 0

Train-Pos 0.38926 0.048439 0.91155 0.32209

8/29/06 33

Preliminary Test Results

Positive EMM Cutoff Probability = 0.3 False Positive Rate = 0% True Positive Rate = 66%

Test results could be improved by meta classifiers combining multiple positive and negative classifiers together.

8/29/06 34

Outline Introduction

CGR/FCGR miRNA Motivation Research Objective

TCGR EMM miRNA Prediction using TCGR/EMM

Conclusion/Future Work

8/29/06 35

Conclusion/Future Work

This is ongoing research. Results, although promising, are preliminary. More research is ongoing.

8/29/06 36

Future Research

1. Obtain all known mature miRNA sequences for a species – initially the 119 C. elegans miRNAs.

2. Create TCGR count vectors for each sequence and each sub-pattern length (1,2,3,4,5).

3. Train EMMs using this data for each sub-pattern length. Thus five EMMs will be created

4. Obtain negative data (much as Xue did in his research) from coding regions for C Elegans.

5. Train EMMs using this data for each sub-pattern length. Thus five EMMs will be created

6. Construct a meta-classifier based on the combined results of prediction from each of these ten EMMs.

7. Apply the EMM classifier to the existing ~75x106 base pairs of non-exonic sequence in the C. elegans genome to search for miRNAs. Note: all 119 validated C. elegans miRNAs are contained in the non-exonic part of the genome and thus the first pass of the algorithm will be tested for its ability to detect all 119 validated miRNAs.

8. Validate the prediction of novel miRNAs using molecular biology.

top related