research projects tao jiang’s lab algorithms and computational biology laboratory department of...

Research Projects Research Projects Tao Jiang’s LabTao Jiang’s Lab

Algorithms and Computational Biology LaboratoryAlgorithms and Computational Biology Laboratory

Department of Computer Science and EngineeringDepartment of Computer Science and EngineeringUniversity of California, RiversideUniversity of California, Riverside

March, 2013March, 2013

Project Project OverviewOverview

Predicting Operons by a Comparative Genomics Approach (DOE GtL)

Evolutionary Dynamics of Myb Gene DNA-binding Domains (NSF ITR)

Prediction of HNF4 Binding Sites and Target Genes in Human and Mouse Genomes (NIH/NSF)

Efficient Selection of Unique and Popular Oligos for Large EST Databases (USDA/NSF)

Oligonucleotide Fingerprinting of Ribosomal RNA Genes and Microorganism Classification (NSF BDI/NIH)

Efficient Haplotyping Algorithms for Pedigree Data and Gene Association Mapping (NSF CCF and NIH)

High Throughput Ortholog Assignment via Genome Rearrangement (NSF IIS)

Genome-Wide Inference of mRNA Isoforms and Estimation of Their Expression Levels from RNA-Seq short reads

Metagenomic Data Analysis

Predicting operons by a Predicting operons by a comparative genomics comparative genomics

approachapproach

This project aims at predicting candidate operons in the genome Synechococcus sp. WH8102, based on a comparative genomics approach. These candidate operons may provide us with helpful information for the construction of protein-protein interaction networks and functional pathways.

Xin Chen

Collaboration: Ying Xu (ORNL)Fund: DOE GtL

Operon structuresOperon structures

Operons represent a basic organizational unit of genes in the complex hierarchical structure of biological processes in a cell. They are mainly used to facilitate efficient implementation of transcriptional regulation, especially in bacteria.

Biological characteristics of genes in an operon include:• sharing certain regulatory elements• arranged in tandem on the same strand• separated by short distances• well conserved across phylogenetically related species• their functions are usually related

Existing methods for Existing methods for operon predictionoperon prediction

• Overbeek et al. (1999): gene pairs of close bidirectional best hits• Salgado et al. (2000): close gene distances and gene functional classes• Ermolaeva et al. (2001): the likelihood of conserved genes being an operon• Carven et al. (2002): a probabilistic learning approach on whole genome• Sabatti et al. (2002): a Bayesian classification scheme on gene microarray• Zheng et al. (2002): based on information from metabolic pathways

Our approach based on Our approach based on comparative genomicscomparative genomics

Comparative analysis is based on the idea that functional segments tend to evolve at lower rate than nonfunctional segments, making well conserved regions likely to be of very interest (Overbeek et al., 1999).

Genome sequenceswith annotation genes

Pairwise comparisonGene matches

(homolog information)

Cluster conserved,nearby genes

Candidate operons

running blastp program with E-value = 1e-20

A score is given by:1. product of E-values of genematches involved in an operon2. intergenic distances in an operonto be considered3. predictive reliability of promoter orterminator to be considered

List of rankedoperons output

Scoring

Constraints:1. neighbor genes separated by 100 bases or less2. genes in an operon located in the same strand3. gene sets conserved across two or more genomes4. full matching required for a candidate operon5. promoter and terminator to be considered6. pathway information to be considered

Implementation detailsImplementation details

• Data preparation: three genome data downloaded from ORNL website (http://compbio.ornl.gov/channel/index.html).

• Pairwise comparison: blastp with E-value <1e-20, a bipartite gene matching graph. Same COG ID.

• Gene clustering: – neighbor genes separated by 100 bases or less– genes in an operon located in the same strand– gene sets conserved across two or more genomes– full matching required for a candidate operon

• Scoring: product of E-values of all gene matches involved, operons with lower scores output earlier

Genome a

Genome b

Genome c

a6

a5

a4

a3

a2

a1

a8

a7

b1

c1

b7

b6

b5

b4

b3

b2

c7

c6

c5

c4

c3

c2

The gene matching graph for The gene matching graph for three cyanobacterial three cyanobacterial

genomesgenomes

The numbers of gene matching pairs:• 1593 between syn_wh and par_med• 2242 between syn_wh and par_mit• 1579 between par_med and par_mit

Three genomes with their gene numbers:• Synechococcus sp. WH8102 (2520)• Prochlorococcus marinus sp. MED4 (1700)• Prochlorococcus marinus sp. MIT9313 (2267)

Predicted operons in Predicted operons in Synechococcus sp. Synechococcus sp.

WH8102WH8102

A total of 242 operons output from Synechococcus sp. WH8102:• 126 operons shared with both other two genomes• 26 operons shared with pmar_med only• 90 operons shared with pmar_mit only( See operons at http://www.cs.ucr.edu/~xinchen/operons.htm )

Several observations on Several observations on the putative operonsthe putative operons

• The average size of putative operons is 2.88, very close to 3;• The two most frequent intergenic distances are –4 and –1 overlap;• All operons in Synechococcus sp. WH8102 are on the positive strand;• Matching genes have the same COG IDs across three genomes.

Z. Su, P. Dam, X. Chen, V. Olman, T. Jiang, B. Palenik, and Y. Xu. GIW’2003. X. Chen, Z. Su, Y. Xu, and T. Jiang. GIW’2004 (the best paper award).X. Chen, Z. Su, P. Dam, B. Palenik, Y. Xu, and T. Jiang. Nucleic Acids Research, 2004.

Ongoing workOngoing work

• Look for a way of predicting promoters and terminators upstream and downstream of candidate operons.• Find a method to validate/score putative operons by promoter/terminator results.• Incorporate additional information like intergenic distances and predicted promoters into the scoring system.• Pathway information to be considered.

Evolutionary Dynamics of Evolutionary Dynamics of Myb Gene Myb Gene DNA-binding DNA-binding

DomainsDomains

Li Jia

Collaboration: Michael Clegg (Botany)Fund: NSF ITR

MotivationMotivation

Natural selection on changes of “regulatory genes”

that regulate the timing or rate of development,

must be required for evolution.(Britten and Davidson, 1969 and 1971)

Natural selection on transcription factors should provide one ofpredominant mechanisms for the generation of novel phenotypes.

Organism Total number of genes

Genes coding for transcriptionalregulators

Total number Percentage in total gene number

A. Thaliana

O. Sativa

C. Elegans

D. Melanogaster

H. Sapiens

M. Musculus

~25,000

~50,000

~18,000

~15,000

~35,000

~30,000

~1,500

~200

~700

~800

~3,000

~1,800

~5%

~4%

~5%

~6%

~9%

~6%

The Crucial Role of TFsThe Crucial Role of TFs

TFs

. . . . . . Target genes

. . . . . . Signaling molecules

WHEN? WHERE? HOW?

R2R3-MYBR2R3-MYB

DNA-binding domain Activation domain

R2 R3

Flexible domain

R2R3-MYB

Helix3

Helix2Helix1MYB

Target genes

Differentiation

Proliferation

Metabolism

1) Secondary metabolism

2) Cell shape

3) Disease resistance

1) Stress response

Functions:

Structure:

OBJECTIVEOBJECTIVE

to unveil molecular dynamics thatto unveil molecular dynamics thatunderlines the evolution of TFs (Myb)underlines the evolution of TFs (Myb)

R3Helix3

R2Helix3

R2R2Helix2Helix2

R2Helix1

R3Helix1

R3R3Helix2Helix2

0

2

4

6

8

10

12

14

16

18

20

1 6 11

16

21

26

31

36

41

46

51

56

61

66

71

76

81

86

91

96

Amino acid position

positiv

e s

ele

ctio

n c

ounts

R2 R3

Helix1 Helix2 Helix3 Helix1 Helix2 Helix3

Infer Positive Selection Infer Positive Selection SitesSites

(based on dN/dS analysis in the duplicationhistory of R2R3-Myb gene family)

A. Thaliana

Jia, Clegg and Jiang (2003) Plant Mol. Biol.

synonymous vsnonsynonymousmutation rates

Positive Selection SitesPositive Selection Sites

Jia, Clegg and Jiang (2003) Plant Mol. Biol. Jia, Clegg and Jiang (2004) Plant Physiol..

R2 domain

R3 domain

Full R2R3 region

Helix1Helix2Helix3

Helix1Helix2Helix3

Sites Counts Percentage Counts/site

1 531 100% 5.4

14 173 33% 12.4*7 83 16% 11.6*10 8 1.5% 0.8

14 119 22% 8.5**7 33 6% 4.7*10 1 0.2% 0.1

O. Sativa (monocot)

Category sitesCount Percentage Count/site

indica japonica indica japonica indicajaponic

aFull R2R3 region 103 52 380 100% 100% 0.5 3.7

R2 domain

Helix1

15 12 61 23% 16% 0.8** 4.1**Helix

27 14 73 27% 19% 2.0** 10.4**

Helix3

10 0 0 0% 0% 0.0 0.0

R3 domain

Helix1

14 16 197 31% 52% 1.1* 14.1**Helix

26 2 9 4% 2% 0.3 1.5**

Helix3

10 0 0 0% 0% 0.0 0.0

A. Thaliana (dicot)

japonica indica Arabidopsis

r (R2, R3) 0.69** 0.68** 0.69**

r (R2-1, R3-1) N/A 0.15 0.11

r (R2-2, R3-2) 0.40** 0.62** 0.65**

r (R2-3, R3-3) 0.38 0.29 0.2

Co-evolvedCo-evolved --HelicesHelices

Jia, Clegg and Jiang (2004) Plant Physiol..

1) Positive selection sites positive selection pressure works through the first and second helices of the R2R3 repeats rather than the third helices due to their structural characteristics

2) Co-evolution patterns the functional importance of the pairing-correlations between the related secondary structures in preserving the conformation of the specific protein folding-pocket (the second helices)

SUMMARYSUMMARY

APPLICATIONS:

determine protein-DNA interaction regions of transcription factors based on their primary codon sequences

genetically modify MYB structure to improve economically important traits

Prediction of HNF4 Binding Sites and Prediction of HNF4 Binding Sites and Target Genes in Human and Mouse Target Genes in Human and Mouse

GenomesGenomes

Chuhu Yang

Collaboration: F.M. Sladek (Cell Biology, Neuroscience)

Fund: NIH/NSF

HNF4—an important TFHNF4—an important TF

• An important TF that regulates the expression of many genes, especially some liver-specific genes; it also plays an important role in the process of development.

• It has been demonstrated to regulate the expression of over 60 genes.

• Researchers anticipate to find more HNF4 target genes.

Related to many human diseases Related to many human diseases such as Diabetes, hemophilia, such as Diabetes, hemophilia,

hepatitis etc.hepatitis etc.

Thrombosis

AtherosclerosisDiabetesHemophilia

Hypoxia

Cancer

MCAD deficiency OTC deficiency

HNF4

Apolipoproteins PEPCKL-PKHNF1

CYP genesACOHBVBPGOTCMCAD

EPO

Anti-thrombin

CoagulationFactors

DrugMetabolism

HNF4 is highly conserved in HNF4 is highly conserved in many different organismsmany different organisms

Xenopus

Drosophila

100% 87.2%

90% 61.4%

464

666

69% 64%

22% 14%

1

1

Human

Rat/mouse

Zn++ Ligand?

93100% 97.4%

464

46493% 88%

1

1

DNA binding Transactivation

% = amino acid identity

Our previous Our previous workwork

• Collected 71 HNF4 binding sequences from literature.

• Developed software based on an optimized (or permuted) Markov model and trained it with the 71 known sequences.

• Searched –500 to +100 regions (relative to transcription start sites) of all the human genes in UCSC database.

• Predicted 840 potential HNF4 binding sites in the human genome.

• Verified in vitro 77 new HNF4 binding sequences, resulting in a total of 137 HNF4 binding sequences.

•This work has been summarized in a paper, which was published in Bioinformatics (Vol. 18 Suppl. 2 2002).

Current workCurrent work

Search the promoter regions of all the human genes with 137 HNF4 binding sequences for potential HNF4 target genes in human.

Search the promoter regions of all the mouse genes with 137 HNF4 binding sequences for potential HNF4 target genes in mouse.

Compare HNF4 target genes in both human and mouse genomes.

Do in vivo experiment to verify potential HNF4 target genes.

Future workFuture work

Optimize current software so that it can predict HNF4 binding sites more accurately.

Study the functions of all HNF4 target genes, cluster them into different functional groups and study the relationship between different groups.

Set up regulatory networks of all HNF4 target genes in human and mouse genomes.

Sequence weighting: A new approach to constructing PSSM (or PWM) for motif finding from Chip and gene expression data.

Efficient Selection of Unique Efficient Selection of Unique and Popular Oligos for Large and Popular Oligos for Large

EST DatabasesEST Databases

Jie ZhengCollaboration: Sefano Lonardi and Timothy J. Close

(Botany)

Funding: USDA / NSF

Problems of Oligo Problems of Oligo SelectionSelection

(for the Barley EST data in HarvEST)

• Unique Oligo Problem– Selection of oligos each of which appears (exactly) in

one EST sequence but does not appear (exactly or approximately) in any other EST

• Popular Oligo Problem– Selection of oligos that appear (exactly or

approximately) in many ESTs

ApplicationsApplications

• Unique oligos– PCR primer designs– Microarray probe designs

• Popular oligos– Useful in screening genomic libraries (such as BAC libraries) for gene-rich

regions

MethodsMethods

• Basic idea– Separate dissimilar strings as early as

possible to reduce the search space

• Algorithm for unique oligos– Group similar oligos by hashing 11-mer

seeds, and disqualify oligos similar to oligos in other ESTs

• Algorithm for popular oligos– Cluster similar oligos by hashing 20-mer

cores and comparing regions outside cores– Identify centers in clusters

PerformancePerformance

• Input Data:– 46145 Barley EST sequences of about 28

Millions base pairs from the HarvEST database

• Time and Space:– A couple of hours on a 1.2GHz CPU, 1GB

RAM machine

• Accuracy in simulation– Relative error is below 2%

Oligonucleotide Fingerprinting Oligonucleotide Fingerprinting of Ribosomal RNA Genes (OFRG)of Ribosomal RNA Genes (OFRG)

and Microorganism and Microorganism ClassificationClassification

Andres Figueroa and Zheng LiuCollaboration: J. Borneman (Plant Pathology) and M. Chrobak

(CSE)Fund: NSF BDI/NIH R01

Basic IdeaBasic Idea

• rRNA genes (rDNA) can be used as an ID of species, especially microorganisms.

• Use microarray technology to identify the rDNAs of the microbes in a community. Oligonucleotide probes are designed to hybridize with the (unknown) rDNA clones in a sample.

• Analyze the hybridization result to obtain fingerprints.

Project FlowchartProject Flowchart

Sample: soil, mouse gut, plant tissue, etc.

Sample rDNA

Mixture of rDNA

Clone library Signal intensities

Fingerprints

Individual rDNA Array

Taxonomic tree

Extract rDNA

PCR

Clone: Ligate and transform

PCR

Print

Hybridization with probes

Normalized signal intensities

Fingerprint assignment

Normalization

Cluster

Taxonomic tree

rDNA sequence

DB

Genomic DBs Expr. data

Web-based integrated platform

Clustering

Binarize fingerprints

Label unknown

clone

Project Project StructureStructure

OFRG management

DB

Probe set design

Future WorkFuture Work• Complete rDNA sequence database (done)

• Create the OFRG management database (done)

• Intensity normalization/binarization using control information (partially done)

• Extend to [0,t], for t = 2,3,4,…

• Combine tools into an integrated platform

• A higher throughput system based on microbeads and polony sequencing technologies (NIH)

Polony (Polony (polymerase colonypolymerase colony)

Polony hybridizing with Polony hybridizing with different probesdifferent probes

Efficient Haplotyping Efficient Haplotyping Algorithms for Pedigree Algorithms for Pedigree

DataData

Lan Liu, Bob Wang and Jing XiaoCollaboration: Jing Li (CWRU) and Tim Chen

(USC) Fund: NSF CCR/NIH R01

An Example Pedigree: The British Royal Family

Camilla, Duchess of Cornwall

Peter Phillips Zara Phillips

Diana,Princess of Wales

Prince Williamof Wales

Prince Henry ofWales

PrincessBeatrice of York

PrincessEugenie of York

Lady LouiseWindsor

Prince Charles,Prince of Wales

Princess Anne, Princess Royal

CommanderTimothy Laurence

Prince Andrew,Duke of York

SarahMargaret Ferguson

Prince Edward, Earl of Wessex

Sophie Rhys-Jones

Elizabeth II ofthe United Kingdom

Prince Philip,Duke of Edinburgh

CaptainMark Phillips

MRHC ProblemMRHC Problem

Find a minimum recombinant haplotype configuration from a given pedigree with genotype data

Assumptions:• Mendelian law

(no mutations)• Recombination

events are rare

(1 2)(1 2)(1 2) …

(1 2)(1 2)(2 2) …

(1 1)(1 2)(2 2) ...

(1 2)(1 2)(1 2) ...

(1 2)(2 2)(2 2) …

(1 1)(1 2)(2 2) …

(1 2)(1 2)(1 2) ...

(1 1)(1 2)(2 2) ...

Input (unphased data)

1|21|22|1…

1|22|12|2 …

1|11|22|2 ...

1|21|22|1 ...

1|22|22|2 …

1|12|12|2…

1|21|22|1 ...

1|11|22|2 ...

Output (phased data)

MotivationsMotivations Haplotype is more biologically meaningful than genotype since each

haplotype of a child is inherited from one parent. Haplotype data are more informative and more valuable in determining the association between diseases and genes and in study of human histories.

The human genome project gave us the consensus genotype sequence of humans, but in order to understand the genetic effects on many complex diseases such as cancers, diabetes, osteoporoses, the genetic variations are more important, which can be represented by haplotypes.

Current techniques collect genotype data. Computational methods deriving haplotypes from genotypes are highly demanded.

The ongoing international HapMap project. It’s generally believed that with parents/pedigree information, we could

get more accurate haplotype and frequency estimations than from data w/o such information.

Family-based association studies have been widely used. We would expect more family-based gene mapping methods that assume accurate haplotype information.

Not only computation intensive, model-based statistical methods may use assumptions that may not hold in real datasets.

ResultsResults

• MRHC is NP-hard• Heuristic: block-extension algorithm • Exact algorithms: member-based and locus-based

dynamic programming• ILP algorithm for MRHC with missing alleles • Software: PedPhase• Special cases:

– Efficient algorithms for ZRHC based on systems of linear equations and low stretch spanning trees

– locus-based dynamic programming for loopless pedigree

• A datamining approach to gene association mapping• Several results on genome-wide TagSNP selection via

linkage disequilibrium

ILP Formulation

) (Founders-Non

1

12,1,

m

j

ji

ji rr

Subject toGenotype constraints: (0 means missing allele)

} 1{},{

} 1{},{

} 1{}0,{

} 1, 1{}0,0{

,,,,,,,,

,,

,,

1,

1,

jsi

jsi

jri

jri

jsi

jri

jsi

jri

js

jr

jri

jri

jr

jr

jri

jri

jr

t

k

jki

t

k

jki

mfmfmmffmm

mfmm

mfm

mfjj

Objective function:

Mendelian law of inheritance constraints:

1

0

1,,,

1,,,

j

ij

kfjki

ji

jkf

jki

gmf

gff

Constraints for the r variables:

Test Results on Real DataTest Results on Real Data

The ZRHC Problem

Problem definition Given a pedigree and the genotype

information for each member, find a recombination-free haplotype configuration for each member that obeys the Mendelian law of inheritance.

1 2

1 2

Some Constraints

1 2

1 2

1 2

1 2

1 2

1 2

1 1

1 1

1 2

1 2

1 2

3 4 5

6

1 2

1 2

1 2

1 2

1 2

1 2

4 5

6

1 2

1 2

1 2

1 2

2 1

2 1

4 5

6

1 2

2 1

1 2

2 1

1 2

2 1

4 5

6

The Constraints as Linear The Constraints as Linear EquationsEquations

Note: The variables represent phase and the equations are over F(2) (in fact, addition mod 2).

The Final Linear System

O(mn) equations on O(mn) variables.

Standard Gaussian elimination gives rise to an time algorithm.

3 3O m n

A Faster Algorithm for ZRHC

• We have recently devised a faster algorithm for ZRHC with running time 2 3 2log log logO mn n n n

Matrix

O mn

O mn Matrix

O n

O mnMatri

x

O n

2log log logO n n n

Transform

Reduce redundancy

Some Open Problems

• Faster (and reliable) method than ILP for large pedigrees

• The k-RHC problem for small k

• Probabilistic models for k-RHC (Xiao Jing)

• Incorporation of population models into pedigrees– Combine with the parsimony model

– Combine with the perfect phylogeny model

– Population of trios?

• Dealing with mutations, errors, and missing data

• Association mapping on/using pedigree data?

A High-Throughput Combinatorial Approach to Genome-Wide Ortholog Assignment

Zheng Fu, Wilson Shi, Vincent PengCollaboration: Liqing Zhang (Virginia Tech)

Fund: NSF IIS

Joint work with X. Chen, J. Zheng, V. Vacic, P. Nan, Y. Zhong, and S. Lonardi

Orthology

(from http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html)

mouse

chicken

frog

• Homolog 同源 – Gene family

• Duplication 复制– Paralog 旁系同源

• Speciation 分支– Ortholog 直系同源

Orthology

(from http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html)

• Homolog 同源– Gene family

• Duplication 复制– Paralog 旁系同源

• Speciation 分支– Ortholog 直系同源

mouse

chicken

frog

Orthology – the more complicated picture

A1 B1 C1 B2 C2 C3

A

B C

Speciation 1

Gene duplication 1

Gene duplication 2

Speciation 2

G1 G2 G3

Outparalogs evolved via a duplication prior to a given speciation event.

B1 C1 B2 C2 C3

Inparalogs evolved via a duplication posterior to a given speciation event.

B1 C1 B2 C2 C3

True exemplar is the direct descendant of the ancestral gene of a given set of inparalogs. A main ortholog pair is defined as the two true exemplar genes of two co-orthologous gene sets.

Speciation 2

Significance• Orthologous genes in different species are

evolutionary and functional counterparts.

• Many methods use orthologs in a critical way:– Function inference– Protein structure prediction– Motif finding– Phylogenetic analysis– Pathway reconstruction– and more ...

• Identification of orthologs, especially exemplar genes, is a fundamental and challenging problem.

Ortholog Assignment Methods• BBH: Best Bidirectional Hit (by BLASTn / BLASTp)

• COG: Cluster of Orthologous Groups(Tatusov et al., Science, 278: 631-637, 1997; Nucleic Acids Res., 28:33-36, 2000)

• TOGA: TIGR Orthologous Gene Alignments(Lee et al., Genome Res, 12: 493-502, 2002)

• INPARANOID: Identify Orthologs & Inparalogs(Remm et al., J Mol Biol. 314:1041-1052, 2001)

• OrthoMCL: a Markov Cluster algorithm (Li et al., Genome Res, 13: 2178-2189, 2003 )

• Reconciled Tree: Gene tree v.s. species tree(Yuan et al., Bioinformatics, 14:285-289, 2001)

• OrthoParaMap: Synteny regions(Cannon et al., BMC Bioinformatics 4(1):35, 2003 )

• Shared Genomic Synteny: Synteny anchors and Synteny blocks(Zheng et al., Bioinformatics 21:703-710, 2004 )

• SOAR: System of Ortholog Assignment by Reversal(Chen et al., IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2005)

Molecular Evolution• Local mutation

– Base substitution

– Base insertion– Base deletion

A complete ortholog assignment system should make use of information from both levels of molecular evolution.

• Global rearrangement and duplication– Inversion/Reversal– Translocation– Transposition– Fusion/Fission– Duplication/Loss

重排

Example

a1 c a2 d e f gb

reversal

a1 b c a2 d e f ga3

duplication

Speciation

a1 b c a2 d e f gThe ancestral genome

a1 c a2 d e f gb a4

duplication

Genome

Given the evolutionary scenario in terms of gene order, main ortholog pairs and inparalogs could be identified in a straightforward way.

a1 b c a2 d e f ga3

fission

Genome

The Parsimony Approach 简约

• Identify homologs using BLASTp.

• Reconstruct the evolutionary scenario on the basis of the parsimony principle: postulate the minimum possible number of rearrangement events and duplication events in the evolution of two closely related genomes since their splitting so as to assign orthologs.

• Ortholog assignment problem could be formulated as a problem of finding a most parsimonious transformation from one genome into the other, without explicitly inferring their ancestral genome.

RD (Reversal-Duplication) Distance

• RD distance:– denotes the number of rearrangement

events in a most parsimonious transformation– denotes the number of gene duplications

in a most parsimonious transformation

),(),(),( DRRD

),( R

),( D

)()(

)(

321

421

gafedacba

gfaedacab

4),( RD

The Key Algorithmic Problem -SRDD

• Two related (unichromosomal) genomes– No inparalogs, i.e. no post-speciation duplications– No gene losses– Equal gene content

• Signed Reversal Distance with Duplicates– Given two related genomes– Only reversals have occurred– How to find a shortest sequence of reversals

• Almost untouched in the literature– Duplicated genes are present– Generalizes the problem of sorting by reversal

Sorting By Reversals Problem

• Goal: Given a permutation, find a shortest series of reversals that transforms it into the identity permutation (1 2 … n )

• Input: Permutation

• Output: A series of reversals 1, … t transforming into the identity permutation such that t is minimum

Sorting by Reversals Problem

• Goal: Given a permutation, find a shortest series of reversals that transforms it into the identity permutation (1 2 … n )

• Input: Permutation

• Output: A series of reversals 1, … t transforming into the identity permutation such that t is minimum

Sorting by Reversals: Example

• t =d( ) - reversal distance of • Example : = 3 4 2 1 5 6 7 10 9 8 4 3 2 1 5 6 7 10 9 8 4 3 2 1 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 So d( ) = 3

The MCSP Problem

• Minimum Common Substring Partition

• This may help eliminate many duplicates, but is different from syntenic blocks.

• Give two related genomes and , we have

G: 3 1 2 -1 4

H: -4 1 2 3 1

G: 3 1 2 -1 4

H: -4 1 2 3 1

G H

1),(),(2/)1),(( HGLHGdHGL

An Outline of MSOAR

Dataset A Dataset B

Homology search: 1. Apply all-vs.-all comparison by BLASTp 2. Only select the blast hits with similarity score above cutoff 3. Keep the top five bi-directional best hits

List of orthologousgene pairs output

Assign orthologs by minimizing RD distance: 1. Apply suboptimal rules 2. Apply minimum common partition 3. Maximum graph decomposition 4. Detect inparalogs by identifying “noise” gene pairs

Assign orthologs by minimizing RD distance: 1. Apply suboptimal rules 2. Apply minimum common substring partition 3. Maximum cycle decomposition 4. Detect inparalogs by identifying “noise” gene pairs

Real Data• Homo sapiens:

– Build 36.1 human genome assembly (UCSC hg18, March 2006)

– 20161 protein sequences in total

• Mus musculus:– Build 36 mouse genome assembly (UCSC mm8,

February 2006)

– 19199 protein sequences in total

MSOAR vs Inparanoid• Validation: Official gene symbols extracted from the

UniProt release 6.0 (September 2005)

• For 20161 human protein sequences and 19199 mouse protein sequences, MSOAR assigned 14362 orthologs between Human and Mouse, among which 11050 are true positives, 1748 are unknown pairs and 1508 are false positives, resulting in a sensitivity of 92.26% and a specificity of 87.99%.

• The comparison between MSOAR and Inparanoid (Remm et al., J. Mol. Biol., 2001)

MSOAR vs INPARANOID

The ortholog pair SNRPB (Human) and Snrpb (Mouse) are not bi-directional best hits, which could be missed by the sequence-similarity based ortholog assignment methods like Inparanoid.

Human chromosome 20

SNRPBSTK35 TGM3 TGM6 ZNF343 TMC2 NOL5A IDH3B

SnrpbStk35 Tgm3 Tgm6 Tmc2 Nol5a Idh3b

Mouse chromosome 2

Validation by HCOP• The HGNC Comparison of Orthology Predictions

(HCOP) is a tool that integrates and displays the human-mouse orthology assertions made by Ensembl, Homologene, Inparanoid, PhIGS, MGD and HGNC. (http://www.gene.ucl.ac.uk/cgi-bin/nomenclature/hcop.pl)

Distribution of the number of supports from HCOP

0

1000

2000

3000

4000

5000

6000

0 1 2 3 4 5 6

The Number of Supports

Th

e N

um

ber

of

Ort

ho

log

s

Assig

ned

by M

SO

AR

Future Work– More efficient algorithms for MCSP and MCD. The best

approximation algorithm for MCSP has ratio O(k) (Kolman and Walen’06). Can the ratio be improved to O(1)?

– Refine the evolutionary model for MSOAR (transposition, tandem duplication, gene loss, etc.) How would the DCJ model fit in?

– Ortholog assignment for multiple genome comparison. The median problem.

– More explicit treatment of one-to-many and many-to-many orthology relationship.

– Take advantage of other sources of genomic information such as unique sequence tags, syntenic blocks, etc.

Wei Li (UCR/Harvard) and Jianxing Feng (Tsinghua/Tongji)

Genome-Wide Inference of mRNA Isoforms and Estimation of Their

Expression Levels from RNA-Seq Reads

This method was also reported in SLIDE later (Li et al., PNAS, 2011).

Olga Tanaseichuk and James Borneman (UCR)

Separating Metagenomic Short Reads into Genomes via Clustering

Metagenomics

• Genomics– Study of an organism's

genome– Relies upon cultivation

and isolation– > 99% of bacteria cannot

be cultivated

• Metagenomics▫ Study of all organisms in an environmental sample by

simultaneous sequencing of their genomes▫Makes it possible to study organisms that can’t be isolated or

difficult to grow in a lab

Metagenomic Projects

• Motivation: to understand mechanisms by which the microbes tolerate the extremely acid environments

• Simple community: 5 dominant species (3 bacteria and 2 archaea)

The Acid Mine Drainage Project

The Tinto River in Spain (Credit - Carol Stoker)

The Sargasso Sea Project

A coral reef off the coast of Malden Island in Kiritibati

• A large scale sequencing in an environmental setting

• Identified >1 million of putative genes (10 times > than in all databases at that time)

• ~1800 species

The Human-Microbiome Project

• Microbial community living in a host

• 100 trillion microbes

• 100 times more microbial than human genes

• Is there a core human microbiome?

• How changes in microbiome correlate with human health?

DNA Sequencing• Next Generation Sequencing (NGS)

– High-throughput– Cost- and time-effective– No cloning (reduced clonal biases)– Shorter read length compared to

Sanger reads (~1000 bps)• Roche/454 (~450 bps)• Illumina/Solexa (35-100 bps)• ABI SOLiD (35–50 bps)

– Due to rapid progress, sequencing lengths will increase

Goals of Metagenomics

• Phylogenetic diversity

• Metabolic pathways

• Genes that predominate in a given environment

• Genes for desirable enzymes

• ...

Ultimate goal: complete genomic sequences

Problem Formulation

• Given metagenomic reads, separate reads from different species (or groups of related species)

Difficulties

• Repeats in genomic sequences

• Sequencing errors

• Unknown number of species and abundance levels

• Common repeats in different genomes due to homologous sequences

genomics

metagenomics

Existing Approaches

• Similarity-Based– Similarity search against databases of known

genomes or genes/proteins

• Composition-Based– Binning based on conserved compositional

features of genomes

• Abundance-Based– Separate genomes by abundance levels

Algorithm: Overview

• Purpose: separating short paired-end reads from different genomes in a metagenomic dataset

• Two-phase heuristic algorithm– short reads– similar abundance levels– arbitrary abundance levels (in combination with

AbundanceBin [Wu and Ye, RECOMB, 2010])

Algorithm: Definitions and Observations

Observation 1: Most of the l-mers in a bacterial genome are unique

l ~ 20, for most of complete genomes

Repeated l-mers (occur > once)

Unique l-mers (occur only once)

The ratio of unique l-mers to distinct l-mers


Observation 2: Most l-mers in a metagenome are unique

for l ~ 20 and genomes separated by sufficient phylogenetic distances

Unique l-mers

Repeated l-mers


Observation 3: Most of the repeats in a metagenome are individual

for l ~ 20 and genomes separated by sufficient phylogenetic distances

Repeated l-mers

Individual repeats

Common repeats

Flowchart

Arbitrary Abundance Levels

• Significant abundance ratios is defined by the expected misclassification rate (>3%)

Experimental Results: Overview

• Lack of NGS metagenomic benchmarks• Lack of algorithms in the literature to separate short NGS reads from

different genomes• Datasets

– Tests on variety of synthetic datasets with different number of genomes, phylogenetic distances and abundance ratios

– Performance on a real metagenomic dataset from gut bacteriocytes of a glassy-winged sharpshooter

• Comparison – We modify the Velvet assembler [Zerbiono and Birney, Renome

Research, 2008] to work as a genome separator (clusters in Phase I are replaced by sets of l-mers from the Velvet contigs)

– With CompostBin on longer reads

Experimental Results

• 182 synthetic datasets of 4 categories

– 79 experiments for the same genus

– 66 – same family

– 29 – same order

– 8 – same class

• Read length: 80 bps

• Coverage depth: ~15-30

• Equal abundance levels

• 2-10 genomes in each dataset

• Simulation: Metasim [Richter et al., PloS ONE, 2008]

• Phylogeny: NCBI taxonomy

Experimental Results

Experimental Results: Genomes with Different Abundance Levels

Experimental Results: Comparison with CompostBin

• Simulated paired-end Sanger reads from [Chatterji et al., RECOMB, 2008]

– Handling longer reads (1000 bps)• Cut long reads into short reads of 80 bps• Linkage information is recovered in Phase II

– Handling lower coverage depth (~3-6)• Choose higher threshold K to separate repeats and

unique l-mers in preprocessing

• Simulated paired-end Illumina reads

– 80 bps, high coverage depth (~15-30)

Experimental Results: Comparison with CompostBin

Test1 Test2 Test3 Test4 Test5 Test6

Test7 Test8 Test9

Abundance ratio 1:1 1:1 1:1 1:1 1:1 1:1 1:1:8 1:1:8 1:1:1:1:2:14

Phylogenetic distance

Species Genus Genus Family Family Order Family Order

OrderPhylum

Species, Order, Family Phylum, Kingdom

Experimental Results: Real Dataset

• Gut bacteriocytes of glassy-winged sharpshooter, Homalodisca coagulata– Consists of reads from:

• Baumannia cicadellinicola• Sulcia muelleri• Miscellaneous unclassified reads

• Sanger reads• Performance is measured on the ability to separate reads

from B.cicadellinicola and S.muelleri• Performance

– TOSS: Sensitivity: ~92% and error rate ~1.6%– CompostBin: Error rate: ~9%

Implementation of TOSS

• Implemented in C

• Running time and memory depend on

– Number and length of reads

– Total length of the genomes

• For 80 bps reads -- 0.5 GB of RAM per 1 Mbps

– 2-4 genomes, total length 2-6 Mbps – 1-3 h, 2-4 GB of RAM

– 15 genomes, total length 40 Mbps – 14 h, 20 GB of RAM

Questions? Comments?Questions? Comments?

Contact: Tao Jiang

Department of Computer Science and Engineering

University of California – Riverside

[email protected]/

~jiang

research projects tao jiang’s lab algorithms and computational biology laboratory department of...

Documents