compare and contrast the effects of using less stringent ...cbs/projects/2007... · goldberg (cbs...

24
Compare and Contrast the Effects of Compare and Contrast the Effects of Using Less Stringent Criteria in Using Less Stringent Criteria in BLASTCLUST to a Novel Iterative BLASTCLUST to a Novel Iterative Method for Identifying Gene Families Method for Identifying Gene Families Virginia Earl Virginia Earl - - Mirowski Mirowski

Upload: others

Post on 22-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Compare and Contrast the Effects of Using Less Stringent ...cbs/projects/2007... · Goldberg (CBS ™06) Utilized Various Techniques That Identify ... to thank my husband Joe for

Compare and Contrast the Effects of Compare and Contrast the Effects of Using Less Stringent Criteria in Using Less Stringent Criteria in

BLASTCLUST to a Novel Iterative BLASTCLUST to a Novel Iterative Method for Identifying Gene FamiliesMethod for Identifying Gene Families

Virginia EarlVirginia Earl--MirowskiMirowski

Page 2: Compare and Contrast the Effects of Using Less Stringent ...cbs/projects/2007... · Goldberg (CBS ™06) Utilized Various Techniques That Identify ... to thank my husband Joe for

AgendaAgenda

IntroductionIntroduction�� Gene Family Extension Iterative (GFEI) Gene Family Extension Iterative (GFEI)

WorkflowWorkflow�� BLASTCLUSTBLASTCLUSTGoalGoalMethodologyMethodologyResultsResultsConclusionsConclusionsFuture DirectionFuture Direction

Page 3: Compare and Contrast the Effects of Using Less Stringent ...cbs/projects/2007... · Goldberg (CBS ™06) Utilized Various Techniques That Identify ... to thank my husband Joe for

TermsTermsHeuristic Heuristic �� An Algorithm That Will Yield Reasonable An Algorithm That Will Yield Reasonable Results, Even If It Is Not Provably Optimal Results, Even If It Is Not Provably Optimal Filtration Filtration �� Idea That Good Alignments Contain Short Idea That Good Alignments Contain Short Stretches That Are A High Degree Of Similarity Stretches That Are A High Degree Of Similarity High Scoring Pair (HSP) High Scoring Pair (HSP) -- Segment Pair With The Best Segment Pair With The Best Score Over All Segment Pairs In The Two Sequences Score Over All Segment Pairs In The Two Sequences Type I Error Type I Error �� Also Known As False Positive. Rejecting Also Known As False Positive. Rejecting the Hthe H00 When It Is TrueWhen It Is TrueType II Error Type II Error �� Also Known As False Negative. Not Also Known As False Negative. Not Rejecting the HRejecting the H00 When It Is False.When It Is False.

Page 4: Compare and Contrast the Effects of Using Less Stringent ...cbs/projects/2007... · Goldberg (CBS ™06) Utilized Various Techniques That Identify ... to thank my husband Joe for

IntroductionIntroduction

A Gene Family is a group of genes united A Gene Family is a group of genes united by similar structure or function and by similar structure or function and evolved from a common ancestor. evolved from a common ancestor. Help To Describe How Genes Relate to Help To Describe How Genes Relate to Each OtherEach OtherProvide Mechanism to Predict Function of Provide Mechanism to Predict Function of

Newly Identified GenesNewly Identified Genes

Page 5: Compare and Contrast the Effects of Using Less Stringent ...cbs/projects/2007... · Goldberg (CBS ™06) Utilized Various Techniques That Identify ... to thank my husband Joe for

Introduction Cont�dIntroduction Cont�dNovel Iterative Method Study Done By Loretta Novel Iterative Method Study Done By Loretta Goldberg (CBS �06)Goldberg (CBS �06)Utilized Various Techniques That Identify Utilized Various Techniques That Identify Members Of A Gene Families Through Ancestral Members Of A Gene Families Through Ancestral Predictions. Predictions. Concluded Additional Homologies Could Be Concluded Additional Homologies Could Be Captured To Existing Clusters Using This Method Captured To Existing Clusters Using This Method Default Settings Of Default Settings Of BlastclustBlastclust For The All Vs. All For The All Vs. All Comparisons. Restrictive Cut Offs May Result In Comparisons. Restrictive Cut Offs May Result In Some Cluster Members Being Missed.Some Cluster Members Being Missed.

Page 6: Compare and Contrast the Effects of Using Less Stringent ...cbs/projects/2007... · Goldberg (CBS ™06) Utilized Various Techniques That Identify ... to thank my husband Joe for

Gene Family Extension Iterative Gene Family Extension Iterative (GFEI) Workflow(GFEI) Workflow

ProteinSequences

All-vs-AllComparison

SequenceAlignment

PhylogeneticTree Construction

Ancestral SequencePrediction

Determine Affect of

Ancestral Sequences

Page 7: Compare and Contrast the Effects of Using Less Stringent ...cbs/projects/2007... · Goldberg (CBS ™06) Utilized Various Techniques That Identify ... to thank my husband Joe for

BLASTCLUSTBLASTCLUST

Rapidly Clusters Sequences Together Rapidly Clusters Sequences Together Based On Similarity And Coverage Based On Similarity And Coverage Thresholds .Thresholds .Uses SingleUses Single--linkage Clustering Algorithmslinkage Clustering AlgorithmsA Sequence Considered A �Neighbor� To A Sequence Considered A �Neighbor� To

At Least One Sequence In The Cluster At Least One Sequence In The Cluster Then It Will Be Placed In That Cluster Then It Will Be Placed In That Cluster Based On The Heuristic Technique Of Based On The Heuristic Technique Of FiltrationFiltration

Page 8: Compare and Contrast the Effects of Using Less Stringent ...cbs/projects/2007... · Goldberg (CBS ™06) Utilized Various Techniques That Identify ... to thank my husband Joe for

BLASTCLUST cont�dBLASTCLUST cont�d

Coverage = HSP Len./Sequence Len.Coverage = HSP Len./Sequence Len.Coverage Threshold = Max. Or Min. Coverage Threshold = Max. Or Min. Coverage Value Between The Two Coverage Value Between The Two SequencesSequencesBlastclust Blastclust ��L Option Sets Minimum Length L Option Sets Minimum Length CoverageCoverage

Page 9: Compare and Contrast the Effects of Using Less Stringent ...cbs/projects/2007... · Goldberg (CBS ™06) Utilized Various Techniques That Identify ... to thank my husband Joe for

BLASTCLUST cont�dBLASTCLUST cont�d

Similarity Threshold Is Based On Either Similarity Threshold Is Based On Either The BLAST Score Density Or The The BLAST Score Density Or The Percentage Of Identical Residues Percentage Of Identical Residues Score Density = BLAST Score/Min. HSP Score Density = BLAST Score/Min. HSP Length Of The Two Sequences Length Of The Two Sequences PercPerc. Id. Res. = # Id. . Id. Res. = # Id. ResRes/ Align. Length/ Align. LengthBlastclust Option Blastclust Option ��S Sets The Similarity S Sets The Similarity ThresholdThreshold

Page 10: Compare and Contrast the Effects of Using Less Stringent ...cbs/projects/2007... · Goldberg (CBS ™06) Utilized Various Techniques That Identify ... to thank my husband Joe for

BLASTCLUST DefaultsBLASTCLUST Defaults

Minimum Length Coverage Threshold Minimum Length Coverage Threshold -- 0.90.9Similarity Threshold Similarity Threshold -- 1.75 1.75 Required Coverage On Both Sequences Required Coverage On Both Sequences ((--B) B) -- TrueTrue

Page 11: Compare and Contrast the Effects of Using Less Stringent ...cbs/projects/2007... · Goldberg (CBS ™06) Utilized Various Techniques That Identify ... to thank my husband Joe for

GoalGoal

Compare And Contrast The Effects Of Compare And Contrast The Effects Of Using Less Stringent Criteria In Using Less Stringent Criteria In BlastclustBlastclustTo The Novel Iterative Method, Which To The Novel Iterative Method, Which Made Use Of The Predicted Ancestral Made Use Of The Predicted Ancestral Sequences.Sequences.Utilized The Neighbor / Hit List In Its Utilized The Neighbor / Hit List In Its BlastclustBlastclust Settings To Determine If This Settings To Determine If This Would Reduce The Processing Time Of Would Reduce The Processing Time Of The Project Workflow. The Project Workflow.

Page 12: Compare and Contrast the Effects of Using Less Stringent ...cbs/projects/2007... · Goldberg (CBS ™06) Utilized Various Techniques That Identify ... to thank my husband Joe for

MethodologyMethodology

Used The Species Database For Used The Species Database For Rattus Rattus NorvegicusNorvegicus (Rat) (Rat) Clusters Sizes Clusters Sizes ≥≥ 6 Used In Processing6 Used In ProcessingExecuted The GFEI Workflow Where All Executed The GFEI Workflow Where All BlastclustBlastclust Threshold Options Used The Threshold Options Used The Default Values.Default Values.Created Neighbor/Hit List In The Initial Created Neighbor/Hit List In The Initial Execution Execution BlastclustBlastclust. .

Page 13: Compare and Contrast the Effects of Using Less Stringent ...cbs/projects/2007... · Goldberg (CBS ™06) Utilized Various Techniques That Identify ... to thank my husband Joe for

Methodology Cont�dMethodology Cont�d

Executed The GFEI Workflow For Where Executed The GFEI Workflow For Where Different Settings Of The Different Settings Of The BlastclustThreshold Options (Threshold Runs)Threshold Options (Threshold Runs)The Neighbor/ Hit List Was Used In The The Neighbor/ Hit List Was Used In The Initial Clustering In All Threshold Runs.Initial Clustering In All Threshold Runs.The The CompareClust.plCompareClust.pl Program Output Program Output Was Used For AnalysisWas Used For Analysis

Page 14: Compare and Contrast the Effects of Using Less Stringent ...cbs/projects/2007... · Goldberg (CBS ™06) Utilized Various Techniques That Identify ... to thank my husband Joe for

BLASTCLUST Threshold SettingsBLASTCLUST Threshold Settings

Default0.71.556

Default0.7Default5

Default0.81.654

DefaultDefault1.653

OneDefaultDefault2

Both0.8Default1

BothDefaultDefault0

Both / One Coverage (-b)

Min. Length Coverage (-L)

Similarity Setting (-S)

RUN #

Page 15: Compare and Contrast the Effects of Using Less Stringent ...cbs/projects/2007... · Goldberg (CBS ™06) Utilized Various Techniques That Identify ... to thank my husband Joe for

Results Results �� Initial ClusteringInitial Clustering

1241062722723486

1227022702702305

1128912702702614

1123082602602113

1349633543543872

1122872402402061

1117801951951640

Ave. # Sequences /

ClusterSequences Processed

Largest Cluster

ProcessedLargest Cluster

Clusters Size ≥ 6

Run #

Page 16: Compare and Contrast the Effects of Using Less Stringent ...cbs/projects/2007... · Goldberg (CBS ™06) Utilized Various Techniques That Identify ... to thank my husband Joe for

Results Results �� Clustering After Clustering After Ancestor SequencesAncestor Sequences

143749878444533466

141232074974462295

143835556264472594

143029225844372093

141255375123543842

13927554594022051

141022564664031620

Ave. # Sequences

/ Cluster

Additional Sequences Collected

Sequences Processed

Ancestral Sequences

AddedLargest Cluster

Clusters Size ≥ 6

Run #

Page 17: Compare and Contrast the Effects of Using Less Stringent ...cbs/projects/2007... · Goldberg (CBS ™06) Utilized Various Techniques That Identify ... to thank my husband Joe for

Results Results -- Comparison of Default Comparison of Default Additional Sequences Additional Sequences

Cluster Size Found in Threshold Run, Initial ClusterSequences Gained in Default Run (Cluster Size After Anc.)

919818151261002 (12)15110101147578043 (15)21317173347577567 (17)21317173347577553 (17)21317173347577549 (17)21217172247577509 (17)21217172247577501 (17)23320203147576157 (31)23120201147576153 (31)23120201147576151 (31)

Run 6Run 5Run 4Run 3Run 2Run 1

Page 18: Compare and Contrast the Effects of Using Less Stringent ...cbs/projects/2007... · Goldberg (CBS ™06) Utilized Various Techniques That Identify ... to thank my husband Joe for

Results Results -- Threshold Runs 1,2,5Threshold Runs 1,2,5Cluster Size Found in Threshold Run, After Reclustering

118151261002 (12)15151547578043 (15)17171747577567 (17)17171747577553 (17)17171747577549 (17)17171747577509 (17)17171747577501 (17)37373147576157 (31)37373147576153 (31)37373147576151 (31)

Run 5Run 2Run 1

Sequences Gained in Default Run (Cluster Size After Anc.)

Page 19: Compare and Contrast the Effects of Using Less Stringent ...cbs/projects/2007... · Goldberg (CBS ™06) Utilized Various Techniques That Identify ... to thank my husband Joe for

Results Results -- New Sequences New Sequences Captured in Threshold RunsCaptured in Threshold Runs

n/a0376

2125

327384

427303

5122

n/a001

Threshold Run #

# of Sequence Overlap with Other

Threshold Runs

# of Additional Sequences Other

Default SequencesThreshold

Run #

Page 20: Compare and Contrast the Effects of Using Less Stringent ...cbs/projects/2007... · Goldberg (CBS ™06) Utilized Various Techniques That Identify ... to thank my husband Joe for

Results Results -- Threshold Run 6 Add. Threshold Run 6 Add. Sequences DistributionSequences Distribution

G-protein-coupled Olfactory ReceptorNo316G-protein-coupled Olfactory ReceptorNo218

Immunoglobulin Domain Variable RegionNo521

G-protein-coupled Olfactory ReceptorNo222G-protein-coupled Olfactory ReceptorNo530

G-protein-coupled Olfactory ReceptorNo331

G-protein-coupled Olfactory ReceptorYes534

Vomeronasal Organ Pheromone Receptor FamilyNo537

Zinc Finger ProteinYes537Family

Composite Cluster ?

# of Additional Sequences Added

Cluster Size

Page 21: Compare and Contrast the Effects of Using Less Stringent ...cbs/projects/2007... · Goldberg (CBS ™06) Utilized Various Techniques That Identify ... to thank my husband Joe for

ConclusionsConclusions

Relaxing The Coverage Option Only Produced Relaxing The Coverage Option Only Produced Results That Were Very Similar The Default Results That Were Very Similar The Default Settings With Few Additional New Sequences Settings With Few Additional New Sequences Less Stringent Similarity Option Resulted Less Stringent Similarity Option Resulted Groups More Sequences Initially And Clustered Groups More Sequences Initially And Clustered Addition Singletons Into Significant Cluster Size Addition Singletons Into Significant Cluster Size Overlooked By The Overlooked By The BlastclustBlastclust With Its Default With Its Default Settings Settings Neighbor / Hit List Improved Processing TimeNeighbor / Hit List Improved Processing Time

Page 22: Compare and Contrast the Effects of Using Less Stringent ...cbs/projects/2007... · Goldberg (CBS ™06) Utilized Various Techniques That Identify ... to thank my husband Joe for

Future DirectionsFuture Directions

Further Investigation of PAML Errors With Further Investigation of PAML Errors With Several Cluster Sizes.Several Cluster Sizes.Type I and Type II error determination with Type I and Type II error determination with the various alternative approaches. the various alternative approaches.

Page 23: Compare and Contrast the Effects of Using Less Stringent ...cbs/projects/2007... · Goldberg (CBS ™06) Utilized Various Techniques That Identify ... to thank my husband Joe for

AcknowledgementsAcknowledgementsI would like to thank my advisor, Dr. Michael I would like to thank my advisor, Dr. Michael Rosenberg, for his guidance, support, and Rosenberg, for his guidance, support, and direction regarding my project. I would also like direction regarding my project. I would also like to thank my husband Joe for his loving support to thank my husband Joe for his loving support and for the help he provided in doing this report, and for the help he provided in doing this report, and my committee members, Dr. Jeffrey and my committee members, Dr. Jeffrey TouchmanTouchman and Dr. Martín Wojciechowski; not and Dr. Martín Wojciechowski; not only for their feedback for this report but also for only for their feedback for this report but also for their enthusiasm and mentoring as instructors in their enthusiasm and mentoring as instructors in the Computational Biosciences Program.the Computational Biosciences Program.

Page 24: Compare and Contrast the Effects of Using Less Stringent ...cbs/projects/2007... · Goldberg (CBS ™06) Utilized Various Techniques That Identify ... to thank my husband Joe for

ReferencesReferences[1][1] Durbin, R., Eddy S., Krogh A., Durbin, R., Eddy S., Krogh A., MitchisonMitchison G., Biological sequence analysis, Cambridge G., Biological sequence analysis, Cambridge

University Press, 2004.University Press, 2004.

[2][2] Edwards, R.V., Edwards, R.V., SheildsSheilds D.C., BADASP: predicting functional specificity in protein famiD.C., BADASP: predicting functional specificity in protein families lies using ancestral sequences. using ancestral sequences. BioinfomaticsBioinfomatics, 2005 , 21(22):4190, 2005 , 21(22):4190--4191.4191.

[3][3] Goldberg L., Rosenberg M., Extending Gene Families Via PredictedGoldberg L., Rosenberg M., Extending Gene Families Via Predicted Ancestral Ancestral Sequences, Internship Report, April 28 2006Sequences, Internship Report, April 28 2006

[4][4] HenikoffHenikoff S., Greene E.A., S., Greene E.A., PietrokovskiPietrokovski S., Bork P., Attwood T.K., Hood L., Gene Families: S., Bork P., Attwood T.K., Hood L., Gene Families: The Taxonomy of Protein The Taxonomy of Protein PralogsPralogs and and Chimeras,ScienceChimeras,Science, 1997;278, 609, 1997;278, 609--614.614.

[5][5] KorfKorf, I., , I., YandellYandell M., M., BedellBedell J., BLAST; O�Reilly & Associates, CA 2003J., BLAST; O�Reilly & Associates, CA 2003

[6][6] MasseroliMasseroli, M., , M., BellistriBellistri E., E., FranceschiniFranceschini A., A., PinciroliPinciroli F., Statistical Analysis of genomic F., Statistical Analysis of genomic protein family and domain controlled annotations for functional protein family and domain controlled annotations for functional investigation of classified investigation of classified gene lists, BMC gene lists, BMC BioinfomaticsBioinfomatics, 2007, 8(Supp 1):S14., 2007, 8(Supp 1):S14.

[7][7] Jones, N.C. and Jones, N.C. and PevznerPevzner, P.A., An Introduction to Bioinformatics Algorithms, A Bradford, P.A., An Introduction to Bioinformatics Algorithms, A Bradfordbook, Massachusetts 2004.book, Massachusetts 2004.

[8][8] TatusovTatusov R L, R L, KooninKoonin E V, E V, LipmanLipman D J, A Genomic Perspective on Protein Families, D J, A Genomic Perspective on Protein Families, Science, 1997, 278, 631 Science, 1997, 278, 631 --637637

[9][9] http://genomes.ucsd.edu/gaasterlandlab/manuals/blast/blastclust.http://genomes.ucsd.edu/gaasterlandlab/manuals/blast/blastclust.htmlhtml

[10][10] http://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.htmlhttp://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html

[11] [11] http://ghr.nlm.nih.gov/handbook/howgeneswork/genefamilies;jsessihttp://ghr.nlm.nih.gov/handbook/howgeneswork/genefamilies;jsessionidonid= = 728605737242A6C8FA97CD4FD5450BD0728605737242A6C8FA97CD4FD5450BD0