research projects tao jiang’s lab algorithms and computational biology laboratory department of...
Post on 22-Dec-2015
222 views
TRANSCRIPT
Research Projects Research Projects Tao Jiang’s LabTao Jiang’s Lab
Algorithms and Computational Biology LaboratoryAlgorithms and Computational Biology Laboratory
Department of Computer Science and EngineeringDepartment of Computer Science and EngineeringUniversity of California, RiversideUniversity of California, Riverside
March, 2013March, 2013
Project Project OverviewOverview
Predicting Operons by a Comparative Genomics Approach (DOE GtL)
Evolutionary Dynamics of Myb Gene DNA-binding Domains (NSF ITR)
Prediction of HNF4 Binding Sites and Target Genes in Human and Mouse Genomes (NIH/NSF)
Efficient Selection of Unique and Popular Oligos for Large EST Databases (USDA/NSF)
Oligonucleotide Fingerprinting of Ribosomal RNA Genes and Microorganism Classification (NSF BDI/NIH)
Efficient Haplotyping Algorithms for Pedigree Data and Gene Association Mapping (NSF CCF and NIH)
High Throughput Ortholog Assignment via Genome Rearrangement (NSF IIS)
Genome-Wide Inference of mRNA Isoforms and Estimation of Their Expression Levels from RNA-Seq short reads
Metagenomic Data Analysis
Predicting operons by a Predicting operons by a comparative genomics comparative genomics
approachapproach
This project aims at predicting candidate operons in the genome Synechococcus sp. WH8102, based on a comparative genomics approach. These candidate operons may provide us with helpful information for the construction of protein-protein interaction networks and functional pathways.
Xin Chen
Collaboration: Ying Xu (ORNL)Fund: DOE GtL
Operon structuresOperon structures
Operons represent a basic organizational unit of genes in the complex hierarchical structure of biological processes in a cell. They are mainly used to facilitate efficient implementation of transcriptional regulation, especially in bacteria.
Biological characteristics of genes in an operon include:• sharing certain regulatory elements• arranged in tandem on the same strand• separated by short distances• well conserved across phylogenetically related species• their functions are usually related
Existing methods for Existing methods for operon predictionoperon prediction
• Overbeek et al. (1999): gene pairs of close bidirectional best hits• Salgado et al. (2000): close gene distances and gene functional classes• Ermolaeva et al. (2001): the likelihood of conserved genes being an operon• Carven et al. (2002): a probabilistic learning approach on whole genome• Sabatti et al. (2002): a Bayesian classification scheme on gene microarray• Zheng et al. (2002): based on information from metabolic pathways
Our approach based on Our approach based on comparative genomicscomparative genomics
Comparative analysis is based on the idea that functional segments tend to evolve at lower rate than nonfunctional segments, making well conserved regions likely to be of very interest (Overbeek et al., 1999).
Genome sequenceswith annotation genes
Pairwise comparisonGene matches
(homolog information)
Cluster conserved,nearby genes
Candidate operons
running blastp program with E-value = 1e-20
A score is given by:1. product of E-values of genematches involved in an operon2. intergenic distances in an operonto be considered3. predictive reliability of promoter orterminator to be considered
List of rankedoperons output
Scoring
Constraints:1. neighbor genes separated by 100 bases or less2. genes in an operon located in the same strand3. gene sets conserved across two or more genomes4. full matching required for a candidate operon5. promoter and terminator to be considered6. pathway information to be considered
Implementation detailsImplementation details
• Data preparation: three genome data downloaded from ORNL website (http://compbio.ornl.gov/channel/index.html).
• Pairwise comparison: blastp with E-value <1e-20, a bipartite gene matching graph. Same COG ID.
• Gene clustering: – neighbor genes separated by 100 bases or less– genes in an operon located in the same strand– gene sets conserved across two or more genomes– full matching required for a candidate operon
• Scoring: product of E-values of all gene matches involved, operons with lower scores output earlier
Genome a
Genome b
Genome c
a6
a5
a4
a3
a2
a1
a8
a7
b1
c1
b7
b6
b5
b4
b3
b2
c7
c6
c5
c4
c3
c2
The gene matching graph for The gene matching graph for three cyanobacterial three cyanobacterial
genomesgenomes
The numbers of gene matching pairs:• 1593 between syn_wh and par_med• 2242 between syn_wh and par_mit• 1579 between par_med and par_mit
Three genomes with their gene numbers:• Synechococcus sp. WH8102 (2520)• Prochlorococcus marinus sp. MED4 (1700)• Prochlorococcus marinus sp. MIT9313 (2267)
Predicted operons in Predicted operons in Synechococcus sp. Synechococcus sp.
WH8102WH8102
A total of 242 operons output from Synechococcus sp. WH8102:• 126 operons shared with both other two genomes• 26 operons shared with pmar_med only• 90 operons shared with pmar_mit only( See operons at http://www.cs.ucr.edu/~xinchen/operons.htm )
Several observations on Several observations on the putative operonsthe putative operons
• The average size of putative operons is 2.88, very close to 3;• The two most frequent intergenic distances are –4 and –1 overlap;• All operons in Synechococcus sp. WH8102 are on the positive strand;• Matching genes have the same COG IDs across three genomes.
Z. Su, P. Dam, X. Chen, V. Olman, T. Jiang, B. Palenik, and Y. Xu. GIW’2003. X. Chen, Z. Su, Y. Xu, and T. Jiang. GIW’2004 (the best paper award).X. Chen, Z. Su, P. Dam, B. Palenik, Y. Xu, and T. Jiang. Nucleic Acids Research, 2004.
Ongoing workOngoing work
• Look for a way of predicting promoters and terminators upstream and downstream of candidate operons.• Find a method to validate/score putative operons by promoter/terminator results.• Incorporate additional information like intergenic distances and predicted promoters into the scoring system.• Pathway information to be considered.
Evolutionary Dynamics of Evolutionary Dynamics of Myb Gene Myb Gene DNA-binding DNA-binding
DomainsDomains
Li Jia
Collaboration: Michael Clegg (Botany)Fund: NSF ITR
MotivationMotivation
Natural selection on changes of “regulatory genes”
that regulate the timing or rate of development,
must be required for evolution.(Britten and Davidson, 1969 and 1971)
Natural selection on transcription factors should provide one ofpredominant mechanisms for the generation of novel phenotypes.
Organism Total number of genes
Genes coding for transcriptionalregulators
Total number Percentage in total gene number
A. Thaliana
O. Sativa
C. Elegans
D. Melanogaster
H. Sapiens
M. Musculus
~25,000
~50,000
~18,000
~15,000
~35,000
~30,000
~1,500
~200
~700
~800
~3,000
~1,800
~5%
~4%
~5%
~6%
~9%
~6%
The Crucial Role of TFsThe Crucial Role of TFs
TFs
. . . . . . Target genes
. . . . . . Signaling molecules
WHEN? WHERE? HOW?
R2R3-MYBR2R3-MYB
DNA-binding domain Activation domain
R2 R3
Flexible domain
R2R3-MYB
Helix3
Helix2Helix1MYB
Target genes
Differentiation
Proliferation
Metabolism
1) Secondary metabolism
2) Cell shape
3) Disease resistance
1) Stress response
Functions:
Structure:
OBJECTIVEOBJECTIVE
to unveil molecular dynamics thatto unveil molecular dynamics thatunderlines the evolution of TFs (Myb)underlines the evolution of TFs (Myb)
R3Helix3
R2Helix3
R2R2Helix2Helix2
R2Helix1
R3Helix1
R3R3Helix2Helix2
0
2
4
6
8
10
12
14
16
18
20
1 6 11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
Amino acid position
positiv
e s
ele
ctio
n c
ounts
R2 R3
Helix1 Helix2 Helix3 Helix1 Helix2 Helix3
Infer Positive Selection Infer Positive Selection SitesSites
(based on dN/dS analysis in the duplicationhistory of R2R3-Myb gene family)
A. Thaliana
Jia, Clegg and Jiang (2003) Plant Mol. Biol.
synonymous vsnonsynonymousmutation rates
Positive Selection SitesPositive Selection Sites
Jia, Clegg and Jiang (2003) Plant Mol. Biol. Jia, Clegg and Jiang (2004) Plant Physiol..
R2 domain
R3 domain
Full R2R3 region
Helix1Helix2Helix3
Helix1Helix2Helix3
Sites Counts Percentage Counts/site
1 531 100% 5.4
14 173 33% 12.4*7 83 16% 11.6*10 8 1.5% 0.8
14 119 22% 8.5**7 33 6% 4.7*10 1 0.2% 0.1
O. Sativa (monocot)
Category sitesCount Percentage Count/site
indica japonica indica japonica indicajaponic
aFull R2R3 region 103 52 380 100% 100% 0.5 3.7
R2 domain
Helix1
15 12 61 23% 16% 0.8** 4.1**Helix
27 14 73 27% 19% 2.0** 10.4**
Helix3
10 0 0 0% 0% 0.0 0.0
R3 domain
Helix1
14 16 197 31% 52% 1.1* 14.1**Helix
26 2 9 4% 2% 0.3 1.5**
Helix3
10 0 0 0% 0% 0.0 0.0
A. Thaliana (dicot)
japonica indica Arabidopsis
r (R2, R3) 0.69** 0.68** 0.69**
r (R2-1, R3-1) N/A 0.15 0.11
r (R2-2, R3-2) 0.40** 0.62** 0.65**
r (R2-3, R3-3) 0.38 0.29 0.2
Co-evolvedCo-evolved --HelicesHelices
Jia, Clegg and Jiang (2004) Plant Physiol..
1) Positive selection sites positive selection pressure works through the first and second helices of the R2R3 repeats rather than the third helices due to their structural characteristics
2) Co-evolution patterns the functional importance of the pairing-correlations between the related secondary structures in preserving the conformation of the specific protein folding-pocket (the second helices)
SUMMARYSUMMARY
APPLICATIONS:
determine protein-DNA interaction regions of transcription factors based on their primary codon sequences
genetically modify MYB structure to improve economically important traits
Prediction of HNF4 Binding Sites and Prediction of HNF4 Binding Sites and Target Genes in Human and Mouse Target Genes in Human and Mouse
GenomesGenomes
Chuhu Yang
Collaboration: F.M. Sladek (Cell Biology, Neuroscience)
Fund: NIH/NSF
HNF4—an important TFHNF4—an important TF
• An important TF that regulates the expression of many genes, especially some liver-specific genes; it also plays an important role in the process of development.
• It has been demonstrated to regulate the expression of over 60 genes.
• Researchers anticipate to find more HNF4 target genes.
Related to many human diseases Related to many human diseases such as Diabetes, hemophilia, such as Diabetes, hemophilia,
hepatitis etc.hepatitis etc.
Thrombosis
AtherosclerosisDiabetesHemophilia
Hypoxia
Cancer
MCAD deficiency OTC deficiency
HNF4
Apolipoproteins PEPCKL-PKHNF1
CYP genesACOHBVBPGOTCMCAD
EPO
Anti-thrombin
CoagulationFactors
DrugMetabolism
HNF4 is highly conserved in HNF4 is highly conserved in many different organismsmany different organisms
Xenopus
Drosophila
100% 87.2%
90% 61.4%
464
666
69% 64%
22% 14%
1
1
Human
Rat/mouse
Zn++ Ligand?
93100% 97.4%
464
46493% 88%
1
1
DNA binding Transactivation
% = amino acid identity
Our previous Our previous workwork
• Collected 71 HNF4 binding sequences from literature.
• Developed software based on an optimized (or permuted) Markov model and trained it with the 71 known sequences.
• Searched –500 to +100 regions (relative to transcription start sites) of all the human genes in UCSC database.
• Predicted 840 potential HNF4 binding sites in the human genome.
• Verified in vitro 77 new HNF4 binding sequences, resulting in a total of 137 HNF4 binding sequences.
•This work has been summarized in a paper, which was published in Bioinformatics (Vol. 18 Suppl. 2 2002).
Current workCurrent work
Search the promoter regions of all the human genes with 137 HNF4 binding sequences for potential HNF4 target genes in human.
Search the promoter regions of all the mouse genes with 137 HNF4 binding sequences for potential HNF4 target genes in mouse.
Compare HNF4 target genes in both human and mouse genomes.
Do in vivo experiment to verify potential HNF4 target genes.
Future workFuture work
Optimize current software so that it can predict HNF4 binding sites more accurately.
Study the functions of all HNF4 target genes, cluster them into different functional groups and study the relationship between different groups.
Set up regulatory networks of all HNF4 target genes in human and mouse genomes.
Sequence weighting: A new approach to constructing PSSM (or PWM) for motif finding from Chip and gene expression data.
Efficient Selection of Unique Efficient Selection of Unique and Popular Oligos for Large and Popular Oligos for Large
EST DatabasesEST Databases
Jie ZhengCollaboration: Sefano Lonardi and Timothy J. Close
(Botany)
Funding: USDA / NSF
Problems of Oligo Problems of Oligo SelectionSelection
(for the Barley EST data in HarvEST)
• Unique Oligo Problem– Selection of oligos each of which appears (exactly) in
one EST sequence but does not appear (exactly or approximately) in any other EST
• Popular Oligo Problem– Selection of oligos that appear (exactly or
approximately) in many ESTs
ApplicationsApplications
• Unique oligos– PCR primer designs– Microarray probe designs
• Popular oligos– Useful in screening genomic libraries (such as BAC libraries) for gene-rich
regions
MethodsMethods
• Basic idea– Separate dissimilar strings as early as
possible to reduce the search space
• Algorithm for unique oligos– Group similar oligos by hashing 11-mer
seeds, and disqualify oligos similar to oligos in other ESTs
• Algorithm for popular oligos– Cluster similar oligos by hashing 20-mer
cores and comparing regions outside cores– Identify centers in clusters
PerformancePerformance
• Input Data:– 46145 Barley EST sequences of about 28
Millions base pairs from the HarvEST database
• Time and Space:– A couple of hours on a 1.2GHz CPU, 1GB
RAM machine
• Accuracy in simulation– Relative error is below 2%
Oligonucleotide Fingerprinting Oligonucleotide Fingerprinting of Ribosomal RNA Genes (OFRG)of Ribosomal RNA Genes (OFRG)
and Microorganism and Microorganism ClassificationClassification
Andres Figueroa and Zheng LiuCollaboration: J. Borneman (Plant Pathology) and M. Chrobak
(CSE)Fund: NSF BDI/NIH R01
Basic IdeaBasic Idea
• rRNA genes (rDNA) can be used as an ID of species, especially microorganisms.
• Use microarray technology to identify the rDNAs of the microbes in a community. Oligonucleotide probes are designed to hybridize with the (unknown) rDNA clones in a sample.
• Analyze the hybridization result to obtain fingerprints.
Project FlowchartProject Flowchart
Sample: soil, mouse gut, plant tissue, etc.
Sample rDNA
Mixture of rDNA
Clone library Signal intensities
Fingerprints
Individual rDNA Array
Taxonomic tree
Extract rDNA
PCR
Clone: Ligate and transform
PCR
Hybridization with probes
Normalized signal intensities
Fingerprint assignment
Normalization
Cluster
Taxonomic tree
rDNA sequence
DB
Genomic DBs Expr. data
Web-based integrated platform
Clustering
Binarize fingerprints
Label unknown
clone
Project Project StructureStructure
OFRG management
DB
Probe set design
Future WorkFuture Work• Complete rDNA sequence database (done)
• Create the OFRG management database (done)
• Intensity normalization/binarization using control information (partially done)
• Extend to [0,t], for t = 2,3,4,…
• Combine tools into an integrated platform
• A higher throughput system based on microbeads and polony sequencing technologies (NIH)
Polony (Polony (polymerase colonypolymerase colony)
Polony hybridizing with Polony hybridizing with different probesdifferent probes
Efficient Haplotyping Efficient Haplotyping Algorithms for Pedigree Algorithms for Pedigree
DataData
Lan Liu, Bob Wang and Jing XiaoCollaboration: Jing Li (CWRU) and Tim Chen
(USC) Fund: NSF CCR/NIH R01
An Example Pedigree: The British Royal Family
Camilla, Duchess of Cornwall
Peter Phillips Zara Phillips
Diana,Princess of Wales
Prince Williamof Wales
Prince Henry ofWales
PrincessBeatrice of York
PrincessEugenie of York
Lady LouiseWindsor
Prince Charles,Prince of Wales
Princess Anne, Princess Royal
CommanderTimothy Laurence
Prince Andrew,Duke of York
SarahMargaret Ferguson
Prince Edward, Earl of Wessex
Sophie Rhys-Jones
Elizabeth II ofthe United Kingdom
Prince Philip,Duke of Edinburgh
CaptainMark Phillips
MRHC ProblemMRHC Problem
Find a minimum recombinant haplotype configuration from a given pedigree with genotype data
Assumptions:• Mendelian law
(no mutations)• Recombination
events are rare
(1 2)(1 2)(1 2) …
(1 2)(1 2)(2 2) …
(1 1)(1 2)(2 2) ...
(1 2)(1 2)(1 2) ...
(1 2)(2 2)(2 2) …
(1 1)(1 2)(2 2) …
(1 2)(1 2)(1 2) ...
(1 1)(1 2)(2 2) ...
Input (unphased data)
1|21|22|1…
1|22|12|2 …
1|11|22|2 ...
1|21|22|1 ...
1|22|22|2 …
1|12|12|2…
1|21|22|1 ...
1|11|22|2 ...
Output (phased data)
MotivationsMotivations Haplotype is more biologically meaningful than genotype since each
haplotype of a child is inherited from one parent. Haplotype data are more informative and more valuable in determining the association between diseases and genes and in study of human histories.
The human genome project gave us the consensus genotype sequence of humans, but in order to understand the genetic effects on many complex diseases such as cancers, diabetes, osteoporoses, the genetic variations are more important, which can be represented by haplotypes.
Current techniques collect genotype data. Computational methods deriving haplotypes from genotypes are highly demanded.
The ongoing international HapMap project. It’s generally believed that with parents/pedigree information, we could
get more accurate haplotype and frequency estimations than from data w/o such information.
Family-based association studies have been widely used. We would expect more family-based gene mapping methods that assume accurate haplotype information.
Not only computation intensive, model-based statistical methods may use assumptions that may not hold in real datasets.
ResultsResults
• MRHC is NP-hard• Heuristic: block-extension algorithm • Exact algorithms: member-based and locus-based
dynamic programming• ILP algorithm for MRHC with missing alleles • Software: PedPhase• Special cases:
– Efficient algorithms for ZRHC based on systems of linear equations and low stretch spanning trees
– locus-based dynamic programming for loopless pedigree
• A datamining approach to gene association mapping• Several results on genome-wide TagSNP selection via
linkage disequilibrium
ILP Formulation
) (Founders-Non
1
12,1,
m
j
ji
ji rr
Subject toGenotype constraints: (0 means missing allele)
} 1{},{
} 1{},{
} 1{}0,{
} 1, 1{}0,0{
,,,,,,,,
,,
,,
1,
1,
jsi
jsi
jri
jri
jsi
jri
jsi
jri
js
jr
jri
jri
jr
jr
jri
jri
jr
t
k
jki
t
k
jki
mfmfmmffmm
mfmm
mfm
mfjj
Objective function:
Mendelian law of inheritance constraints:
1
0
1,,,
1,,,
j
ij
kfjki
ji
jkf
jki
gmf
gff
Constraints for the r variables:
Test Results on Real DataTest Results on Real Data
The ZRHC Problem
Problem definition Given a pedigree and the genotype
information for each member, find a recombination-free haplotype configuration for each member that obeys the Mendelian law of inheritance.
1 2
1 2
Some Constraints
1 2
1 2
1 2
1 2
1 2
1 2
1 1
1 1
1 2
1 2
1 2
3 4 5
6
1 2
1 2
1 2
1 2
1 2
1 2
4 5
6
1 2
1 2
1 2
1 2
2 1
2 1
4 5
6
1 2
2 1
1 2
2 1
1 2
2 1
4 5
6
The Constraints as Linear The Constraints as Linear EquationsEquations
Note: The variables represent phase and the equations are over F(2) (in fact, addition mod 2).
The Final Linear System
O(mn) equations on O(mn) variables.
Standard Gaussian elimination gives rise to an time algorithm.
3 3O m n
A Faster Algorithm for ZRHC
• We have recently devised a faster algorithm for ZRHC with running time 2 3 2log log logO mn n n n
Matrix
O mn
O mn Matrix
O n
O mnMatri
x
O n
2log log logO n n n
Transform
Reduce redundancy
Some Open Problems
• Faster (and reliable) method than ILP for large pedigrees
• The k-RHC problem for small k
• Probabilistic models for k-RHC (Xiao Jing)
• Incorporation of population models into pedigrees– Combine with the parsimony model
– Combine with the perfect phylogeny model
– Population of trios?
• Dealing with mutations, errors, and missing data
• Association mapping on/using pedigree data?
A High-Throughput Combinatorial Approach to Genome-Wide Ortholog Assignment
Zheng Fu, Wilson Shi, Vincent PengCollaboration: Liqing Zhang (Virginia Tech)
Fund: NSF IIS
Joint work with X. Chen, J. Zheng, V. Vacic, P. Nan, Y. Zhong, and S. Lonardi
Orthology
(from http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html)
mouse
chicken
frog
• Homolog 同源 – Gene family
• Duplication 复制– Paralog 旁系同源
• Speciation 分支– Ortholog 直系同源
Orthology
(from http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html)
• Homolog 同源– Gene family
• Duplication 复制– Paralog 旁系同源
• Speciation 分支– Ortholog 直系同源
mouse
chicken
frog
Orthology
(from http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html)
• Homolog 同源– Gene family
• Duplication 复制– Paralog 旁系同源
• Speciation 分支– Ortholog 直系同源
mouse
chicken
frog
Orthology – the more complicated picture
A1 B1 C1 B2 C2 C3
A
B C
Speciation 1
Gene duplication 1
Gene duplication 2
Speciation 2
G1 G2 G3
Outparalogs evolved via a duplication prior to a given speciation event.
B1 C1 B2 C2 C3
Inparalogs evolved via a duplication posterior to a given speciation event.
B1 C1 B2 C2 C3
True exemplar is the direct descendant of the ancestral gene of a given set of inparalogs. A main ortholog pair is defined as the two true exemplar genes of two co-orthologous gene sets.
Speciation 2
Significance• Orthologous genes in different species are
evolutionary and functional counterparts.
• Many methods use orthologs in a critical way:– Function inference– Protein structure prediction– Motif finding– Phylogenetic analysis– Pathway reconstruction– and more ...
• Identification of orthologs, especially exemplar genes, is a fundamental and challenging problem.
Ortholog Assignment Methods• BBH: Best Bidirectional Hit (by BLASTn / BLASTp)
• COG: Cluster of Orthologous Groups(Tatusov et al., Science, 278: 631-637, 1997; Nucleic Acids Res., 28:33-36, 2000)
• TOGA: TIGR Orthologous Gene Alignments(Lee et al., Genome Res, 12: 493-502, 2002)
• INPARANOID: Identify Orthologs & Inparalogs(Remm et al., J Mol Biol. 314:1041-1052, 2001)
• OrthoMCL: a Markov Cluster algorithm (Li et al., Genome Res, 13: 2178-2189, 2003 )
• Reconciled Tree: Gene tree v.s. species tree(Yuan et al., Bioinformatics, 14:285-289, 2001)
• OrthoParaMap: Synteny regions(Cannon et al., BMC Bioinformatics 4(1):35, 2003 )
• Shared Genomic Synteny: Synteny anchors and Synteny blocks(Zheng et al., Bioinformatics 21:703-710, 2004 )
• SOAR: System of Ortholog Assignment by Reversal(Chen et al., IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2005)
Molecular Evolution• Local mutation
– Base substitution
– Base insertion– Base deletion
A complete ortholog assignment system should make use of information from both levels of molecular evolution.
• Global rearrangement and duplication– Inversion/Reversal– Translocation– Transposition– Fusion/Fission– Duplication/Loss
重排
Example
a1 c a2 d e f gb
reversal
a1 b c a2 d e f ga3
duplication
Speciation
a1 b c a2 d e f gThe ancestral genome
a1 c a2 d e f gb a4
duplication
Genome
Given the evolutionary scenario in terms of gene order, main ortholog pairs and inparalogs could be identified in a straightforward way.
a1 b c a2 d e f ga3
fission
Genome
The Parsimony Approach 简约
• Identify homologs using BLASTp.
• Reconstruct the evolutionary scenario on the basis of the parsimony principle: postulate the minimum possible number of rearrangement events and duplication events in the evolution of two closely related genomes since their splitting so as to assign orthologs.
• Ortholog assignment problem could be formulated as a problem of finding a most parsimonious transformation from one genome into the other, without explicitly inferring their ancestral genome.
RD (Reversal-Duplication) Distance
• RD distance:– denotes the number of rearrangement
events in a most parsimonious transformation– denotes the number of gene duplications
in a most parsimonious transformation
),(),(),( DRRD
),( R
),( D
)()(
)(
321
421
gafedacba
gfaedacab
4),( RD
The Key Algorithmic Problem -SRDD
• Two related (unichromosomal) genomes– No inparalogs, i.e. no post-speciation duplications– No gene losses– Equal gene content
• Signed Reversal Distance with Duplicates– Given two related genomes– Only reversals have occurred– How to find a shortest sequence of reversals
• Almost untouched in the literature– Duplicated genes are present– Generalizes the problem of sorting by reversal
Sorting By Reversals Problem
• Goal: Given a permutation, find a shortest series of reversals that transforms it into the identity permutation (1 2 … n )
• Input: Permutation
• Output: A series of reversals 1, … t transforming into the identity permutation such that t is minimum
Sorting by Reversals Problem
• Goal: Given a permutation, find a shortest series of reversals that transforms it into the identity permutation (1 2 … n )
• Input: Permutation
• Output: A series of reversals 1, … t transforming into the identity permutation such that t is minimum
Sorting by Reversals: Example
• t =d( ) - reversal distance of • Example : = 3 4 2 1 5 6 7 10 9 8 4 3 2 1 5 6 7 10 9 8 4 3 2 1 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 So d( ) = 3
The MCSP Problem
• Minimum Common Substring Partition
• This may help eliminate many duplicates, but is different from syntenic blocks.
• Give two related genomes and , we have
G: 3 1 2 -1 4
H: -4 1 2 3 1
G: 3 1 2 -1 4
H: -4 1 2 3 1
G H
1),(),(2/)1),(( HGLHGdHGL
An Outline of MSOAR
Dataset A Dataset B
Homology search: 1. Apply all-vs.-all comparison by BLASTp 2. Only select the blast hits with similarity score above cutoff 3. Keep the top five bi-directional best hits
List of orthologousgene pairs output
Assign orthologs by minimizing RD distance: 1. Apply suboptimal rules 2. Apply minimum common partition 3. Maximum graph decomposition 4. Detect inparalogs by identifying “noise” gene pairs
Assign orthologs by minimizing RD distance: 1. Apply suboptimal rules 2. Apply minimum common substring partition 3. Maximum cycle decomposition 4. Detect inparalogs by identifying “noise” gene pairs
Real Data• Homo sapiens:
– Build 36.1 human genome assembly (UCSC hg18, March 2006)
– 20161 protein sequences in total
• Mus musculus:– Build 36 mouse genome assembly (UCSC mm8,
February 2006)
– 19199 protein sequences in total
MSOAR vs Inparanoid• Validation: Official gene symbols extracted from the
UniProt release 6.0 (September 2005)
• For 20161 human protein sequences and 19199 mouse protein sequences, MSOAR assigned 14362 orthologs between Human and Mouse, among which 11050 are true positives, 1748 are unknown pairs and 1508 are false positives, resulting in a sensitivity of 92.26% and a specificity of 87.99%.
• The comparison between MSOAR and Inparanoid (Remm et al., J. Mol. Biol., 2001)
MSOAR vs INPARANOID
The ortholog pair SNRPB (Human) and Snrpb (Mouse) are not bi-directional best hits, which could be missed by the sequence-similarity based ortholog assignment methods like Inparanoid.
Human chromosome 20
SNRPBSTK35 TGM3 TGM6 ZNF343 TMC2 NOL5A IDH3B
SnrpbStk35 Tgm3 Tgm6 Tmc2 Nol5a Idh3b
Mouse chromosome 2
Validation by HCOP• The HGNC Comparison of Orthology Predictions
(HCOP) is a tool that integrates and displays the human-mouse orthology assertions made by Ensembl, Homologene, Inparanoid, PhIGS, MGD and HGNC. (http://www.gene.ucl.ac.uk/cgi-bin/nomenclature/hcop.pl)
Distribution of the number of supports from HCOP
0
1000
2000
3000
4000
5000
6000
0 1 2 3 4 5 6
The Number of Supports
Th
e N
um
ber
of
Ort
ho
log
s
Assig
ned
by M
SO
AR
Future Work– More efficient algorithms for MCSP and MCD. The best
approximation algorithm for MCSP has ratio O(k) (Kolman and Walen’06). Can the ratio be improved to O(1)?
– Refine the evolutionary model for MSOAR (transposition, tandem duplication, gene loss, etc.) How would the DCJ model fit in?
– Ortholog assignment for multiple genome comparison. The median problem.
– More explicit treatment of one-to-many and many-to-many orthology relationship.
– Take advantage of other sources of genomic information such as unique sequence tags, syntenic blocks, etc.
Wei Li (UCR/Harvard) and Jianxing Feng (Tsinghua/Tongji)
Genome-Wide Inference of mRNA Isoforms and Estimation of Their
Expression Levels from RNA-Seq Reads
This method was also reported in SLIDE later (Li et al., PNAS, 2011).
Olga Tanaseichuk and James Borneman (UCR)
Separating Metagenomic Short Reads into Genomes via Clustering
Metagenomics
• Genomics– Study of an organism's
genome– Relies upon cultivation
and isolation– > 99% of bacteria cannot
be cultivated
• Metagenomics▫ Study of all organisms in an environmental sample by
simultaneous sequencing of their genomes▫Makes it possible to study organisms that can’t be isolated or
difficult to grow in a lab
Metagenomic Projects
• Motivation: to understand mechanisms by which the microbes tolerate the extremely acid environments
• Simple community: 5 dominant species (3 bacteria and 2 archaea)
The Acid Mine Drainage Project
The Tinto River in Spain (Credit - Carol Stoker)
The Sargasso Sea Project
A coral reef off the coast of Malden Island in Kiritibati
• A large scale sequencing in an environmental setting
• Identified >1 million of putative genes (10 times > than in all databases at that time)
• ~1800 species
The Human-Microbiome Project
• Microbial community living in a host
• 100 trillion microbes
• 100 times more microbial than human genes
• Is there a core human microbiome?
• How changes in microbiome correlate with human health?
DNA Sequencing• Next Generation Sequencing (NGS)
– High-throughput– Cost- and time-effective– No cloning (reduced clonal biases)– Shorter read length compared to
Sanger reads (~1000 bps)• Roche/454 (~450 bps)• Illumina/Solexa (35-100 bps)• ABI SOLiD (35–50 bps)
– Due to rapid progress, sequencing lengths will increase
Goals of Metagenomics
• Phylogenetic diversity
• Metabolic pathways
• Genes that predominate in a given environment
• Genes for desirable enzymes
• ...
Ultimate goal: complete genomic sequences
Problem Formulation
• Given metagenomic reads, separate reads from different species (or groups of related species)
Difficulties
• Repeats in genomic sequences
• Sequencing errors
• Unknown number of species and abundance levels
• Common repeats in different genomes due to homologous sequences
genomics
metagenomics
Existing Approaches
• Similarity-Based– Similarity search against databases of known
genomes or genes/proteins
• Composition-Based– Binning based on conserved compositional
features of genomes
• Abundance-Based– Separate genomes by abundance levels
Algorithm: Overview
• Purpose: separating short paired-end reads from different genomes in a metagenomic dataset
• Two-phase heuristic algorithm– short reads– similar abundance levels– arbitrary abundance levels (in combination with
AbundanceBin [Wu and Ye, RECOMB, 2010])
Algorithm: Definitions and Observations
Observation 1: Most of the l-mers in a bacterial genome are unique
l ~ 20, for most of complete genomes
Repeated l-mers (occur > once)
Unique l-mers (occur only once)
The ratio of unique l-mers to distinct l-mers
Algorithm: Definitions and Observations
Observation 2: Most l-mers in a metagenome are unique
for l ~ 20 and genomes separated by sufficient phylogenetic distances
Unique l-mers
Repeated l-mers
Algorithm: Definitions and Observations
Observation 3: Most of the repeats in a metagenome are individual
for l ~ 20 and genomes separated by sufficient phylogenetic distances
Repeated l-mers
Individual repeats
Common repeats
Flowchart
Arbitrary Abundance Levels
• Significant abundance ratios is defined by the expected misclassification rate (>3%)
Experimental Results: Overview
• Lack of NGS metagenomic benchmarks• Lack of algorithms in the literature to separate short NGS reads from
different genomes• Datasets
– Tests on variety of synthetic datasets with different number of genomes, phylogenetic distances and abundance ratios
– Performance on a real metagenomic dataset from gut bacteriocytes of a glassy-winged sharpshooter
• Comparison – We modify the Velvet assembler [Zerbiono and Birney, Renome
Research, 2008] to work as a genome separator (clusters in Phase I are replaced by sets of l-mers from the Velvet contigs)
– With CompostBin on longer reads
Experimental Results
• 182 synthetic datasets of 4 categories
– 79 experiments for the same genus
– 66 – same family
– 29 – same order
– 8 – same class
• Read length: 80 bps
• Coverage depth: ~15-30
• Equal abundance levels
• 2-10 genomes in each dataset
• Simulation: Metasim [Richter et al., PloS ONE, 2008]
• Phylogeny: NCBI taxonomy
Experimental Results
Experimental Results: Genomes with Different Abundance Levels
Experimental Results: Comparison with CompostBin
• Simulated paired-end Sanger reads from [Chatterji et al., RECOMB, 2008]
– Handling longer reads (1000 bps)• Cut long reads into short reads of 80 bps• Linkage information is recovered in Phase II
– Handling lower coverage depth (~3-6)• Choose higher threshold K to separate repeats and
unique l-mers in preprocessing
• Simulated paired-end Illumina reads
– 80 bps, high coverage depth (~15-30)
Experimental Results: Comparison with CompostBin
Test1 Test2 Test3 Test4 Test5 Test6
Test7 Test8 Test9
Abundance ratio 1:1 1:1 1:1 1:1 1:1 1:1 1:1:8 1:1:8 1:1:1:1:2:14
Phylogenetic distance
Species Genus Genus Family Family Order Family Order
OrderPhylum
Species, Order, Family Phylum, Kingdom
Experimental Results: Real Dataset
• Gut bacteriocytes of glassy-winged sharpshooter, Homalodisca coagulata– Consists of reads from:
• Baumannia cicadellinicola• Sulcia muelleri• Miscellaneous unclassified reads
• Sanger reads• Performance is measured on the ability to separate reads
from B.cicadellinicola and S.muelleri• Performance
– TOSS: Sensitivity: ~92% and error rate ~1.6%– CompostBin: Error rate: ~9%
Implementation of TOSS
• Implemented in C
• Running time and memory depend on
– Number and length of reads
– Total length of the genomes
• For 80 bps reads -- 0.5 GB of RAM per 1 Mbps
– 2-4 genomes, total length 2-6 Mbps – 1-3 h, 2-4 GB of RAM
– 15 genomes, total length 40 Mbps – 14 h, 20 GB of RAM
Questions? Comments?Questions? Comments?
Contact: Tao Jiang
Department of Computer Science and Engineering
University of California – Riverside
~jiang