compare and contrast the effects of using less stringent ...cbs/projects/2007... · goldberg (cbs...
TRANSCRIPT
Compare and Contrast the Effects of Compare and Contrast the Effects of Using Less Stringent Criteria in Using Less Stringent Criteria in
BLASTCLUST to a Novel Iterative BLASTCLUST to a Novel Iterative Method for Identifying Gene FamiliesMethod for Identifying Gene Families
Virginia EarlVirginia Earl--MirowskiMirowski
AgendaAgenda
IntroductionIntroduction�� Gene Family Extension Iterative (GFEI) Gene Family Extension Iterative (GFEI)
WorkflowWorkflow�� BLASTCLUSTBLASTCLUSTGoalGoalMethodologyMethodologyResultsResultsConclusionsConclusionsFuture DirectionFuture Direction
TermsTermsHeuristic Heuristic �� An Algorithm That Will Yield Reasonable An Algorithm That Will Yield Reasonable Results, Even If It Is Not Provably Optimal Results, Even If It Is Not Provably Optimal Filtration Filtration �� Idea That Good Alignments Contain Short Idea That Good Alignments Contain Short Stretches That Are A High Degree Of Similarity Stretches That Are A High Degree Of Similarity High Scoring Pair (HSP) High Scoring Pair (HSP) -- Segment Pair With The Best Segment Pair With The Best Score Over All Segment Pairs In The Two Sequences Score Over All Segment Pairs In The Two Sequences Type I Error Type I Error �� Also Known As False Positive. Rejecting Also Known As False Positive. Rejecting the Hthe H00 When It Is TrueWhen It Is TrueType II Error Type II Error �� Also Known As False Negative. Not Also Known As False Negative. Not Rejecting the HRejecting the H00 When It Is False.When It Is False.
IntroductionIntroduction
A Gene Family is a group of genes united A Gene Family is a group of genes united by similar structure or function and by similar structure or function and evolved from a common ancestor. evolved from a common ancestor. Help To Describe How Genes Relate to Help To Describe How Genes Relate to Each OtherEach OtherProvide Mechanism to Predict Function of Provide Mechanism to Predict Function of
Newly Identified GenesNewly Identified Genes
Introduction Cont�dIntroduction Cont�dNovel Iterative Method Study Done By Loretta Novel Iterative Method Study Done By Loretta Goldberg (CBS �06)Goldberg (CBS �06)Utilized Various Techniques That Identify Utilized Various Techniques That Identify Members Of A Gene Families Through Ancestral Members Of A Gene Families Through Ancestral Predictions. Predictions. Concluded Additional Homologies Could Be Concluded Additional Homologies Could Be Captured To Existing Clusters Using This Method Captured To Existing Clusters Using This Method Default Settings Of Default Settings Of BlastclustBlastclust For The All Vs. All For The All Vs. All Comparisons. Restrictive Cut Offs May Result In Comparisons. Restrictive Cut Offs May Result In Some Cluster Members Being Missed.Some Cluster Members Being Missed.
Gene Family Extension Iterative Gene Family Extension Iterative (GFEI) Workflow(GFEI) Workflow
ProteinSequences
All-vs-AllComparison
SequenceAlignment
PhylogeneticTree Construction
Ancestral SequencePrediction
Determine Affect of
Ancestral Sequences
BLASTCLUSTBLASTCLUST
Rapidly Clusters Sequences Together Rapidly Clusters Sequences Together Based On Similarity And Coverage Based On Similarity And Coverage Thresholds .Thresholds .Uses SingleUses Single--linkage Clustering Algorithmslinkage Clustering AlgorithmsA Sequence Considered A �Neighbor� To A Sequence Considered A �Neighbor� To
At Least One Sequence In The Cluster At Least One Sequence In The Cluster Then It Will Be Placed In That Cluster Then It Will Be Placed In That Cluster Based On The Heuristic Technique Of Based On The Heuristic Technique Of FiltrationFiltration
BLASTCLUST cont�dBLASTCLUST cont�d
Coverage = HSP Len./Sequence Len.Coverage = HSP Len./Sequence Len.Coverage Threshold = Max. Or Min. Coverage Threshold = Max. Or Min. Coverage Value Between The Two Coverage Value Between The Two SequencesSequencesBlastclust Blastclust ��L Option Sets Minimum Length L Option Sets Minimum Length CoverageCoverage
BLASTCLUST cont�dBLASTCLUST cont�d
Similarity Threshold Is Based On Either Similarity Threshold Is Based On Either The BLAST Score Density Or The The BLAST Score Density Or The Percentage Of Identical Residues Percentage Of Identical Residues Score Density = BLAST Score/Min. HSP Score Density = BLAST Score/Min. HSP Length Of The Two Sequences Length Of The Two Sequences PercPerc. Id. Res. = # Id. . Id. Res. = # Id. ResRes/ Align. Length/ Align. LengthBlastclust Option Blastclust Option ��S Sets The Similarity S Sets The Similarity ThresholdThreshold
BLASTCLUST DefaultsBLASTCLUST Defaults
Minimum Length Coverage Threshold Minimum Length Coverage Threshold -- 0.90.9Similarity Threshold Similarity Threshold -- 1.75 1.75 Required Coverage On Both Sequences Required Coverage On Both Sequences ((--B) B) -- TrueTrue
GoalGoal
Compare And Contrast The Effects Of Compare And Contrast The Effects Of Using Less Stringent Criteria In Using Less Stringent Criteria In BlastclustBlastclustTo The Novel Iterative Method, Which To The Novel Iterative Method, Which Made Use Of The Predicted Ancestral Made Use Of The Predicted Ancestral Sequences.Sequences.Utilized The Neighbor / Hit List In Its Utilized The Neighbor / Hit List In Its BlastclustBlastclust Settings To Determine If This Settings To Determine If This Would Reduce The Processing Time Of Would Reduce The Processing Time Of The Project Workflow. The Project Workflow.
MethodologyMethodology
Used The Species Database For Used The Species Database For Rattus Rattus NorvegicusNorvegicus (Rat) (Rat) Clusters Sizes Clusters Sizes ≥≥ 6 Used In Processing6 Used In ProcessingExecuted The GFEI Workflow Where All Executed The GFEI Workflow Where All BlastclustBlastclust Threshold Options Used The Threshold Options Used The Default Values.Default Values.Created Neighbor/Hit List In The Initial Created Neighbor/Hit List In The Initial Execution Execution BlastclustBlastclust. .
Methodology Cont�dMethodology Cont�d
Executed The GFEI Workflow For Where Executed The GFEI Workflow For Where Different Settings Of The Different Settings Of The BlastclustThreshold Options (Threshold Runs)Threshold Options (Threshold Runs)The Neighbor/ Hit List Was Used In The The Neighbor/ Hit List Was Used In The Initial Clustering In All Threshold Runs.Initial Clustering In All Threshold Runs.The The CompareClust.plCompareClust.pl Program Output Program Output Was Used For AnalysisWas Used For Analysis
BLASTCLUST Threshold SettingsBLASTCLUST Threshold Settings
Default0.71.556
Default0.7Default5
Default0.81.654
DefaultDefault1.653
OneDefaultDefault2
Both0.8Default1
BothDefaultDefault0
Both / One Coverage (-b)
Min. Length Coverage (-L)
Similarity Setting (-S)
RUN #
Results Results �� Initial ClusteringInitial Clustering
1241062722723486
1227022702702305
1128912702702614
1123082602602113
1349633543543872
1122872402402061
1117801951951640
Ave. # Sequences /
ClusterSequences Processed
Largest Cluster
ProcessedLargest Cluster
Clusters Size ≥ 6
Run #
Results Results �� Clustering After Clustering After Ancestor SequencesAncestor Sequences
143749878444533466
141232074974462295
143835556264472594
143029225844372093
141255375123543842
13927554594022051
141022564664031620
Ave. # Sequences
/ Cluster
Additional Sequences Collected
Sequences Processed
Ancestral Sequences
AddedLargest Cluster
Clusters Size ≥ 6
Run #
Results Results -- Comparison of Default Comparison of Default Additional Sequences Additional Sequences
Cluster Size Found in Threshold Run, Initial ClusterSequences Gained in Default Run (Cluster Size After Anc.)
919818151261002 (12)15110101147578043 (15)21317173347577567 (17)21317173347577553 (17)21317173347577549 (17)21217172247577509 (17)21217172247577501 (17)23320203147576157 (31)23120201147576153 (31)23120201147576151 (31)
Run 6Run 5Run 4Run 3Run 2Run 1
Results Results -- Threshold Runs 1,2,5Threshold Runs 1,2,5Cluster Size Found in Threshold Run, After Reclustering
118151261002 (12)15151547578043 (15)17171747577567 (17)17171747577553 (17)17171747577549 (17)17171747577509 (17)17171747577501 (17)37373147576157 (31)37373147576153 (31)37373147576151 (31)
Run 5Run 2Run 1
Sequences Gained in Default Run (Cluster Size After Anc.)
Results Results -- New Sequences New Sequences Captured in Threshold RunsCaptured in Threshold Runs
n/a0376
2125
327384
427303
5122
n/a001
Threshold Run #
# of Sequence Overlap with Other
Threshold Runs
# of Additional Sequences Other
Default SequencesThreshold
Run #
Results Results -- Threshold Run 6 Add. Threshold Run 6 Add. Sequences DistributionSequences Distribution
G-protein-coupled Olfactory ReceptorNo316G-protein-coupled Olfactory ReceptorNo218
Immunoglobulin Domain Variable RegionNo521
G-protein-coupled Olfactory ReceptorNo222G-protein-coupled Olfactory ReceptorNo530
G-protein-coupled Olfactory ReceptorNo331
G-protein-coupled Olfactory ReceptorYes534
Vomeronasal Organ Pheromone Receptor FamilyNo537
Zinc Finger ProteinYes537Family
Composite Cluster ?
# of Additional Sequences Added
Cluster Size
ConclusionsConclusions
Relaxing The Coverage Option Only Produced Relaxing The Coverage Option Only Produced Results That Were Very Similar The Default Results That Were Very Similar The Default Settings With Few Additional New Sequences Settings With Few Additional New Sequences Less Stringent Similarity Option Resulted Less Stringent Similarity Option Resulted Groups More Sequences Initially And Clustered Groups More Sequences Initially And Clustered Addition Singletons Into Significant Cluster Size Addition Singletons Into Significant Cluster Size Overlooked By The Overlooked By The BlastclustBlastclust With Its Default With Its Default Settings Settings Neighbor / Hit List Improved Processing TimeNeighbor / Hit List Improved Processing Time
Future DirectionsFuture Directions
Further Investigation of PAML Errors With Further Investigation of PAML Errors With Several Cluster Sizes.Several Cluster Sizes.Type I and Type II error determination with Type I and Type II error determination with the various alternative approaches. the various alternative approaches.
AcknowledgementsAcknowledgementsI would like to thank my advisor, Dr. Michael I would like to thank my advisor, Dr. Michael Rosenberg, for his guidance, support, and Rosenberg, for his guidance, support, and direction regarding my project. I would also like direction regarding my project. I would also like to thank my husband Joe for his loving support to thank my husband Joe for his loving support and for the help he provided in doing this report, and for the help he provided in doing this report, and my committee members, Dr. Jeffrey and my committee members, Dr. Jeffrey TouchmanTouchman and Dr. Martín Wojciechowski; not and Dr. Martín Wojciechowski; not only for their feedback for this report but also for only for their feedback for this report but also for their enthusiasm and mentoring as instructors in their enthusiasm and mentoring as instructors in the Computational Biosciences Program.the Computational Biosciences Program.
ReferencesReferences[1][1] Durbin, R., Eddy S., Krogh A., Durbin, R., Eddy S., Krogh A., MitchisonMitchison G., Biological sequence analysis, Cambridge G., Biological sequence analysis, Cambridge
University Press, 2004.University Press, 2004.
[2][2] Edwards, R.V., Edwards, R.V., SheildsSheilds D.C., BADASP: predicting functional specificity in protein famiD.C., BADASP: predicting functional specificity in protein families lies using ancestral sequences. using ancestral sequences. BioinfomaticsBioinfomatics, 2005 , 21(22):4190, 2005 , 21(22):4190--4191.4191.
[3][3] Goldberg L., Rosenberg M., Extending Gene Families Via PredictedGoldberg L., Rosenberg M., Extending Gene Families Via Predicted Ancestral Ancestral Sequences, Internship Report, April 28 2006Sequences, Internship Report, April 28 2006
[4][4] HenikoffHenikoff S., Greene E.A., S., Greene E.A., PietrokovskiPietrokovski S., Bork P., Attwood T.K., Hood L., Gene Families: S., Bork P., Attwood T.K., Hood L., Gene Families: The Taxonomy of Protein The Taxonomy of Protein PralogsPralogs and and Chimeras,ScienceChimeras,Science, 1997;278, 609, 1997;278, 609--614.614.
[5][5] KorfKorf, I., , I., YandellYandell M., M., BedellBedell J., BLAST; O�Reilly & Associates, CA 2003J., BLAST; O�Reilly & Associates, CA 2003
[6][6] MasseroliMasseroli, M., , M., BellistriBellistri E., E., FranceschiniFranceschini A., A., PinciroliPinciroli F., Statistical Analysis of genomic F., Statistical Analysis of genomic protein family and domain controlled annotations for functional protein family and domain controlled annotations for functional investigation of classified investigation of classified gene lists, BMC gene lists, BMC BioinfomaticsBioinfomatics, 2007, 8(Supp 1):S14., 2007, 8(Supp 1):S14.
[7][7] Jones, N.C. and Jones, N.C. and PevznerPevzner, P.A., An Introduction to Bioinformatics Algorithms, A Bradford, P.A., An Introduction to Bioinformatics Algorithms, A Bradfordbook, Massachusetts 2004.book, Massachusetts 2004.
[8][8] TatusovTatusov R L, R L, KooninKoonin E V, E V, LipmanLipman D J, A Genomic Perspective on Protein Families, D J, A Genomic Perspective on Protein Families, Science, 1997, 278, 631 Science, 1997, 278, 631 --637637
[9][9] http://genomes.ucsd.edu/gaasterlandlab/manuals/blast/blastclust.http://genomes.ucsd.edu/gaasterlandlab/manuals/blast/blastclust.htmlhtml
[10][10] http://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.htmlhttp://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html
[11] [11] http://ghr.nlm.nih.gov/handbook/howgeneswork/genefamilies;jsessihttp://ghr.nlm.nih.gov/handbook/howgeneswork/genefamilies;jsessionidonid= = 728605737242A6C8FA97CD4FD5450BD0728605737242A6C8FA97CD4FD5450BD0