efficient algorithms for locating the length- constrained heaviest segments, with applications to...

21
Efficient Algorithms for Locating the Length-Constrained Heaviest Se gments, with Applications to Biomo lecular Sequence Analysis Yaw-Ling Lin * Tao Jiang Kun-Mao Chao * Dept CS & Info Mngmt, Providence Univ, Taiwan Dept CS & Engineering, UC Riverside, US A Dept CS & Info Engnr, Nat. Taiwan Univ, Taiwan

Post on 20-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin * Tao Jiang

Efficient Algorithms for Locating the Length-Constrained Heaviest Segments, with Application

s to Biomolecular Sequence Analysis

Yaw-Ling Lin* Tao Jiang Kun-Mao Chao

* Dept CS & Info Mngmt, Providence Univ, TaiwanDept CS & Engineering, UC Riverside, USA

Dept CS & Info Engnr, Nat. Taiwan Univ, Taiwan

Page 2: Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin * Tao Jiang

Yaw-Ling Lin, Providence, Taiwan 2

Outline

• Introduction. • Applications to Biomolecular Sequence Analysis. • Maximum Sum Consecutive Subsequence.• Maximum Average Consecutive Subsequence.• Implementation and Preliminary Experiments• Concluding Remarks

Page 3: Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin * Tao Jiang

Yaw-Ling Lin, Providence, Taiwan 3

Introduction

• Two fundamental algorithms in searching for interesting regions in sequences:

• Given a sequence of real numbers of length n and an upper bound U, find a consecutive subsequence of length at most U with the maximum sum --- an O(n)-time algorithm.

• Given a sequence of real numbers of length n and a lower bound L, find a consecutive subsequence of length at least L with the maximum average. --- an O(n log L)-time algorithm.

Page 4: Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin * Tao Jiang

Yaw-Ling Lin, Providence, Taiwan 4

Applications to Biomolecular Sequence Analysis (I)

• Locating GC-Rich Regions– Finding GC-rich regions: an important problem in gene recogniti

on and comparative genomics.– CpG islands ( 200 ~ 1400 bp )– [Huang’94]: O(n L)-time algorithm.

• Post-Processing Sequence Alignments– Comparative analysis of human and mouse DNA: useful in gene

prediction in human genome.– Mosaic effect: bad inner sequence.– Normalized local alignment.– Post-processing local aligned subsequences

Page 5: Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin * Tao Jiang

Yaw-Ling Lin, Providence, Taiwan 5

Applications to Biomolecular Sequence Analysis (II)

• Annotating Multiple Sequence Alignments– [Stojanovic’99]: conserved regions in biomolecular sequences.

– Numerical scores for columns of a multiple alignment; each column score shall be adjusted by subtracting an anchor value.

• Ungapped Local Alignments with Length Constraints– Computing the length-constrained segment of each diagonal in th

e matrix with the largest sum (or average) of scores.

– Applications in motif identification.

Page 6: Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin * Tao Jiang

Yaw-Ling Lin, Providence, Taiwan 6

Maximum Sum Consecutive Subsequence

<-4,1,-2,3> is left-negative < 5, -3, 4, -1, 2, -6 > is not.

<5> <-3,4> <-1,2> <-6> is minimal left-negative partitioned.

Page 7: Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin * Tao Jiang

Yaw-Ling Lin, Providence, Taiwan 7

Minimal left-negative partition

Page 8: Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin * Tao Jiang

Yaw-Ling Lin, Providence, Taiwan 8

MLN-partition: linear time

Page 9: Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin * Tao Jiang

Yaw-Ling Lin, Providence, Taiwan 9

Max-Sum with LC

Page 10: Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin * Tao Jiang

Yaw-Ling Lin, Providence, Taiwan 10

Analysis of MSLC

Page 11: Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin * Tao Jiang

Yaw-Ling Lin, Providence, Taiwan 11

Max Average Subsequence

<4,2,3,8> is right-skew < 5, 3, 4, 1, 2, 6 > is not.

<5> <3,4> <1,2,6> is decreasing right-skew partitioned.

Page 12: Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin * Tao Jiang

Yaw-Ling Lin, Providence, Taiwan 12

Decreasing right-skiew partition

Page 13: Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin * Tao Jiang

Yaw-Ling Lin, Providence, Taiwan 13

DRS-partition: linear time

Page 14: Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin * Tao Jiang

Yaw-Ling Lin, Providence, Taiwan 14

Max-Avg-Seq with LC

Page 15: Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin * Tao Jiang

Yaw-Ling Lin, Providence, Taiwan 15

Locate good-partner

Page 16: Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin * Tao Jiang

Yaw-Ling Lin, Providence, Taiwan 16

Analysis of MaxAvgSeq

Page 17: Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin * Tao Jiang

Yaw-Ling Lin, Providence, Taiwan 17

Implementation and Preliminary Experiments

Page 18: Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin * Tao Jiang

Yaw-Ling Lin, Providence, Taiwan 18

Implementation and Preliminary Experiments

Page 19: Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin * Tao Jiang

Yaw-Ling Lin, Providence, Taiwan 19

Conclusion

• Find a max-sum subsequence of length at most U can be done in O(n)-time.

• Find a max-avg subsequence of length at least L can be done in O(n log L)-time.

Page 20: Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin * Tao Jiang

Yaw-Ling Lin, Providence, Taiwan 20

Recent Progress• Lu (CMCT’2002): finding the max-avg subsequen

ce of length at least L on binary (0,1) sequences. O(n)-time.

• Goldwasser, Kao, Lu (2002, manuscripts): finding the max-avg subsequence of length at least L and at most U on real sequences. O(n)-time

• Tools: finding CpG islands using MAVG (joint work with Huang, X., Jiang, T. and Chao, K.-M.) http://deepc2.zool.iastate.edu/aat/mavg/cgdoc.html http://deepc2.zool.iastate.edu/aat/mavg/cg.html

Page 21: Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin * Tao Jiang

Yaw-Ling Lin, Providence, Taiwan 21

Future Research

• Best k (nonintersecting) subsequences?

• Normalized local alignment?

• Measurement of goodness?