8/29/061 temporal chaos game representation (tcgr) for dna/rna sequence visualization margaret h....

35
8/29/06 1 Temporal Chaos Game Temporal Chaos Game Representation (TCGR) for Representation (TCGR) for DNA/RNA Sequence DNA/RNA Sequence Visualization Visualization Margaret H. Dunham Margaret H. Dunham , , Donya Quick, Donya Quick, Yuhang Wang, Monnie McGee, Yuhang Wang, Monnie McGee, Jim Waddle, and Yu Meng Jim Waddle, and Yu Meng Southern Methodist University Southern Methodist University Dallas, Texas 75275 Dallas, Texas 75275 [email protected]

Upload: christiana-sullivan

Post on 12-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 1

Temporal Chaos Game Temporal Chaos Game Representation (TCGR) for Representation (TCGR) for

DNA/RNA Sequence DNA/RNA Sequence VisualizationVisualization

Margaret H. DunhamMargaret H. Dunham, , Donya Quick, Donya Quick, Yuhang Wang, Monnie McGee, Yuhang Wang, Monnie McGee,

Jim Waddle, and Yu MengJim Waddle, and Yu Meng

Southern Methodist UniversitySouthern Methodist University

Dallas, Texas 75275Dallas, Texas 75275

[email protected]

Page 2: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 2

Outline Introduction

CGR/FCGR miRNA Motivation Research Objective

TCGR EMM miRNA Prediction using TCGR/EMMConclusiont/Future Work

Page 3: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 3

OutlineIntroduction

CGR/FCGR miRNA Motivation Research Objective

TCGR EMM miRNA Prediction using TCGR/EMMConclusion/Future Work

Page 4: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 4

Chaos Game Representation (CGR)

Scatter plot showing occurrence of patterns of nucleotides.

University of the Basque Country http://

insilico.ehu.es/genomics/my_words/

Page 5: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 6

Chaos Game Representation (CGR)

2D technique to visually see the distribution of subpatterns

Our technique is based on the following:

Generate totals for each subpattern

Scale totals to a [0,1] range. (Note scaling can be a problem)

Convert range to red/blue• 0-0.5: White to Blue• 0.5-1: Blue to Red

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

FCGR

Page 6: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 7

FCGR

•AA •AC •CA •CC

•AG •AT •CG •CT

•GA •GC •TA •TC

•GG •GT •TG •TT

•A •C

•G •T

•a) Nucleotides •b) Dinucleotides •c) Trinucletides

•AAA •AAC •ACA •ACC

•AAG •AAT •ACG •ACT

•AGA •AGC •ATA •ATC

•AGG •AGT •ATG •ATT

•GAA •GAC •GCA •GCC

•GAG •GAT •GCG •GCT

•GGA •GGC •GTA •GTC

•GGG •GGT •GTG •GTT

•CAA •CAC •CCA •CCC

•CAG •CAT •CCG •CCT

•CGA •CGC •CTA •CTC

•CGG •CGT •CTG •CTT

•TAA •TAC •TCA •TCC

•TAG •TAT •TCG •TCT

•TGA •TGC •TTA •TTC

•TGG •TGT •TTG •TTT

Page 7: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 8

FCGR Example

Homo Sapiens – all mature miRNA

Patterns of length 3

UUC

GUG

Page 8: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 9

miRNA Short (20-25nt) sequence of noncoding RNA Single strand Previously assumed to be garbage Impact/Prevent translation of mRNA Conserved across species(sometimes) Reduce protein levels without impacting mRNA levels Bind to target areas in mRNA – Problem is that this binding is

not perfect (particularly in animals) mRNA may have multiple (nonoverlapping) binding sites for

one miRNA

Page 9: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 10

miRNA Functions Causes some cancers Embryo Development Cell Differentiation Cell Death Prevents the production of a protein

that causes lung cancer Control brain development in zebra

fish Associated with HIV

Page 10: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 11

miRNA Research Issues

Predict/Find miRNA Predict miRNA targets Identify miRNA functions Identify how miRNAs work

Page 11: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 12

Motivation

2000bp Flanking Upstream Region mir-258.2 in C elegans

a) All 2000 bp b) First 240 bp b) Last 240 bp

Page 12: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 13

Research Objectives Identify, develop, and implement algorithms which

can be used for identifying potential miRNA

functions.

Create an online tool which can be used by other

researchers to apply our algorithms to new data.

Page 13: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 14

Outline Introduction

CGR/FCGR miRNA Motivation Research Objective

TCGR EMM miRNA Prediction using TCGR/EMMConclusion/Future Work

Page 14: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 15

Temporal CGR (TCGR)

Temporal version of Frequency CGR In our context temporal means the starting location of a window

2D Array Each Row represents counts for a particular window in sequence

• First row – first window• Last row – last window • We start successive windows at the next character location

Each Column represents the counts for the associated pattern in that window• Initially we have assumed order of patterns is alphabetic

Size of TCGR depends on sequence length and subpattern lengt As sequence lengths vary, we only examine complete windows We only count patterns completely contained in each window.

Page 15: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 16

TCGR Exampleacgtgcacgtaactgattccggaaccaaatgtgcccacgtcga

Moving Window

A C G T

Pos 0-8 2 3 3 1

Pos 1-9 1 3 3 2

…Pos 34-42 2 4 2 1

A C G T

Pos 0-8 0.4 0.6 0.6 0.2

Pos 1-9 0.2 0.6 0.6 0.4

…Pos 34-42 0.4 0.8 0.4 0.2

Page 16: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 17

TCGR Example (cont’d)

TCGRs for Sub-patterns of length 1, 2, and 3

Page 17: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 18

TCGR Example (cont’d)

Window 0: Pos 0-8Window 1: Pos 1-9

Window 17: Pos 17-25Window 18: Pos 18-26

Window 34: Pos 34-42

acgtgcacgcgtgcacgt

tccggaaccccggaacca

ccacgtcga

A C G T

Page 18: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 19

TCGR – Viruses miRNA(Window=9; Pattern=1;2;3)

Epstein Barr Human Cytomegalovirus Kaposi sarc Herpesvirus Mouse Gammaherpesvirus

Pattern=1

Pattern=2

Pattern=3

Page 19: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 20

TCGR – Mature miRNA(Window=5; Pattern=3)

All Mature

Mus Musculus

Homo Sapiens

C Elegans

ACG CGC GCG UCG

Page 20: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 21

Outline Introduction

CGR/FCGR miRNA Motivation Research Objective

TCGR

EMM miRNA Prediction using TCGR/EMMConclusion/Future Work

Page 21: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 22

EMM Overview Time Varying Discrete First Order Markov

Model Nodes are clusters of real world states. Learning continues during prediction phase. Learning:

Transition probabilities between nodes Node labels (centroid of cluster) Nodes are added and removed as data

arrives

Page 22: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 23

EMM DefinitionExtensible Markov Model (EMM): at any time t, EMM

consists of an MC with designated current node, Nn, and algorithms to modify it, where algorithms include:

EMMCluster, which defines a technique for matching between input data at time t + 1 and existing states in the MC at time t.

EMMIncrement algorithm, which updates MC at time t + 1 given the MC at time t and clustering measure result at time t + 1.

EMMDecrement algorithm, which removes nodes from the EMM when needed.

Page 23: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 24

EMM Cluster

Find closest node to incoming event.

If none “close” create new node Labeling of cluster is centroid of

members in cluster O(n)

Page 24: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 25

EMM Increment

<18,10,3,3,1,0,0><18,10,3,3,1,0,0>

<17,10,2,3,1,0,0><17,10,2,3,1,0,0>

<16,9,2,3,1,0,0><16,9,2,3,1,0,0>

<14,8,2,3,1,0,0><14,8,2,3,1,0,0>

<14,8,2,3,0,0,0><14,8,2,3,0,0,0>

<18,10,3,3,1,1,0.><18,10,3,3,1,1,0.>

1/3

N1

N2

2/3

N3

1/11/3

N1

N2

2/3

1/1

N3

1/1

1/2

1/3

N1

N2

2/31/2

1/2

N3

1/1

2/3

1/3

N1

N2

N1

2/21/1

N1

1

Page 25: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 26

Research Objectives Identify, develop, and implement algorithms

which can be used for identifying potential miRNA functions.

Create an online tool which can be used by other researchers to apply our algorithms to new data.

Our approach:1. Represent potential miRNA sequence with TCGR

sequence of count vectors2. Create EMM using count vectors for known miRNA

(miRNA stem loops, miRNA targets)3. Predict unknown sequence to be miRNA (miRNA stem

loop, miRNA target) based on normalized product of transition probabilities along clustering path in EMM

Page 26: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 27

Outline Introduction

CGR/FCGR miRNA Motivation Research Objective

TCGR EMM

miRNA Prediction using TCGR/EMMConclusion/Future Work

Page 27: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 28

Prediction of miRNA Precursors1

Predicted occurrence of pre-miRNA segments form a set of hairpin sequences

No assumptions about biological function or conservation across species.

Used SVMs to differentiate the structure of hiarpin segments that contained pre-miRNAs from those that did not.

Sensitivey of 93.3% Specificity of 88.1% No report of false positives

1 C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure-Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.

Page 28: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 29

Preliminary Test Data1

Positive Training: This dataset consists of 163 human pre-miRNAs with lengths of 62-119.

Negative Training: This dataset was obtained from protein coding regions of human RefSeq genes. As these are from coding regions it is likely that there are no true pre-miRNAs in this data. This dataset contains 168 sequences with lengths between 63 and 110 characters.

Positive Test: This dataset contains 30 pre-miRNAs. Negative Test: This dataset contains 1000 randomly

chosen sequences from coding regions.

1 C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure-Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.

Page 29: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 30

POSITIVENEGATIVE

TCGRs for Xue Training Data

Page 30: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 31

POSITIVE

NEGATIVE

TCGRs for Xue Test Data

Page 31: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 32

Predictive Probabilities with Xue’s Data

EMM Test Data Mean Std Dev Max Min

Negative Test-Neg 0 0 0 0

Test-Pos 0 0 0 0

Train-Neg 0.37963 0.050085 0.91256 0.2945

Train-Pos 0 0 0 0

Positive Test-Neg 0 0 0 0

Test-Pos 0.25894 0.18701 0.42075 0

Train-Neg 0 0 0 0

Train-Pos 0.38926 0.048439 0.91155 0.32209

Page 32: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 33

Preliminary Test Results

Positive EMM Cutoff Probability = 0.3 False Positive Rate = 0% True Positive Rate = 66%

Test results could be improved by meta classifiers combining multiple positive and negative classifiers together.

Page 33: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 34

Outline Introduction

CGR/FCGR miRNA Motivation Research Objective

TCGR EMM miRNA Prediction using TCGR/EMM

Conclusion/Future Work

Page 34: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 35

Conclusion/Future Work

This is ongoing research. Results, although promising, are preliminary. More research is ongoing.

Page 35: 8/29/061 Temporal Chaos Game Representation (TCGR) for DNA/RNA Sequence Visualization Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle,

8/29/06 36

Future Research

1. Obtain all known mature miRNA sequences for a species – initially the 119 C. elegans miRNAs.

2. Create TCGR count vectors for each sequence and each sub-pattern length (1,2,3,4,5).

3. Train EMMs using this data for each sub-pattern length. Thus five EMMs will be created

4. Obtain negative data (much as Xue did in his research) from coding regions for C Elegans.

5. Train EMMs using this data for each sub-pattern length. Thus five EMMs will be created

6. Construct a meta-classifier based on the combined results of prediction from each of these ten EMMs.

7. Apply the EMM classifier to the existing ~75x106 base pairs of non-exonic sequence in the C. elegans genome to search for miRNAs. Note: all 119 validated C. elegans miRNAs are contained in the non-exonic part of the genome and thus the first pass of the algorithm will be tested for its ability to detect all 119 validated miRNAs.

8. Validate the prediction of novel miRNAs using molecular biology.