assignment 12: substitution rates and identifying...

14
Assignment 12: Substitution Rates and Identifying Selection 4/13/17 Modified from slides 2015

Upload: others

Post on 05-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Assignment 12: Substitution Rates and Identifying Selectiongenetics.wustl.edu/bio5488/files/2017/04/SP2017_assignment12_Intro-1.pdfAssignment 12: Substitution Rates and Identifying

Assignment12:SubstitutionRatesandIdentifyingSelection

4/13/17

Modifiedfromslides2015

Page 2: Assignment 12: Substitution Rates and Identifying Selectiongenetics.wustl.edu/bio5488/files/2017/04/SP2017_assignment12_Intro-1.pdfAssignment 12: Substitution Rates and Identifying

Detectingselectionusingthenucleotidesubstitutionrate

• dN orKa =nonsynonymoussubstitutionrate=#nonsyn.changes#nonsyn.sites

• dS orKs =synonymoussubstitutionrate=#syn.changes#syn.sites

• dN/dS ratioisameasureoftheselectivepressureonaprotein-codinggene:

2AdaptedfromDr.Fay’slectureslides

dN/dS Interpretation= 1 Noconstraintonproteinsequence,i.e.,nonsyn.changesareneutral

(neutral selection)< 1 Functionalconstraintontheproteinsequence,i.e.,nonsyn.

mutationsaredeleterious(purifyingselection)> 1 Changeinthefunctionoftheproteinsequence,i.e.,nonsyn.

mutationsareadaptive(positive selection)

Page 3: Assignment 12: Substitution Rates and Identifying Selectiongenetics.wustl.edu/bio5488/files/2017/04/SP2017_assignment12_Intro-1.pdfAssignment 12: Substitution Rates and Identifying

Assignment12:SubstitutionRatesandIdentifyingSelection

• Goal• Investigatesynonymousandnonsynonymoussubstitutionratesacrossthegenomesofseveralyeastspecies

• Input• Alignmentsof5kgenesfrom4yeastspecies• SynonymousNon-synonymousAnalysisProgram(SNAP)• GeneannotationfileforS.cerevisiae

• Output• dN/dSratioforeverygene,summarystatisticsofdN/dSdistribution,visualizationofdN/dSdistribution,averagedN/dSratioforallGOterms

3

Page 4: Assignment 12: Substitution Rates and Identifying Selectiongenetics.wustl.edu/bio5488/files/2017/04/SP2017_assignment12_Intro-1.pdfAssignment 12: Substitution Rates and Identifying

Inputfiles

4

>ScerATGTCAAAAGCTGTCGGGCTCCA-----------CCGTTGAAGAAGTTGAT>SmikATGTCAAAAGCTGTCGGGCTCCAGGAGCTGCTCCCTGTTGAAGAAGTTGAT

Examplenucleotidealignmentfile

• Alignmentsof5kgenesfrom4yeastspecies:S.cerevisiae,S.paradoxus,S.mikatae,&S.bayanus• Fastaformat• DatacomesfromKellisetal.Nature(2003)

AdaptedfromDr.Cohen’slectureslides

Page 5: Assignment 12: Substitution Rates and Identifying Selectiongenetics.wustl.edu/bio5488/files/2017/04/SP2017_assignment12_Intro-1.pdfAssignment 12: Substitution Rates and Identifying

Synonymous Non-synonymousAnalysisProgram (SNAP)

• Perlscriptthatcalculatesthesyn.andnonsyn.substitutionratesinanucleotidealignmentofagene• Usage

$ perl /home/assignments/assignment10/SNAP.pl <nucleotide alignment fasta> <output directory>

• Createsanoutputfile(*.dnds)containingsubstitutionmetricsforeachpairofspeciesinthealignment• Outputfileiswrittentotheoutputdirectory

5

Page 6: Assignment 12: Substitution Rates and Identifying Selectiongenetics.wustl.edu/bio5488/files/2017/04/SP2017_assignment12_Intro-1.pdfAssignment 12: Substitution Rates and Identifying

dndsfileformat• Whitespace-separatedtextfile• Tableofsubstitutionmetricsforeachpairwisecomparison• Containsheader• Seeassignmentforcompletedescriptionoffileformat

• Tableofabbreviations

6

Compare Sequences_names Sd Nd S N ps pn ds dn dn/ds0 1 Scer Spar 291.00 37.00 896.50 3177.50 0.3246 0.0116 0.4253 0.0117 0.02760 2 Scer Smik 424.50 71.50 891.33 3170.67 0.4763 0.0226 0.7559 0.0229 0.03031 2 Spar Smik 369.33 77.67 891.83 3170.17 0.4141 0.0245 0.6025 0.0249 0.0413Sd = Synonymous differencesNd = Nonsynonymous differencesS = Synonymous sitesN = Nonsynonymous sitesps = Synonymous rate (Sd/S)pn = Nonsynonymous rate (Nd/N)ds = Synonymous rate (corrected)dn = Nonsynonymous rate (corrected)

Exampledndsfile

Page 7: Assignment 12: Substitution Rates and Identifying Selectiongenetics.wustl.edu/bio5488/files/2017/04/SP2017_assignment12_Intro-1.pdfAssignment 12: Substitution Rates and Identifying

AssignmentTODOs

• Writerun_SNAP.py• RunSNAP.pl oneveryalignmentfile• CreateanoutputfileofdN/dSratios• CalculatedN/dSsummarystatistics

• Writeplot_gene_length_vs_dnds.py• Createscatterplotofgenelengthvs.dN/dSratio

• Writecalc_average_go_dnds.py• CalculatetheaveragedN/dSratioforeachGOterminaGFF

• Answerfollow-upquestions7

Page 8: Assignment 12: Substitution Rates and Identifying Selectiongenetics.wustl.edu/bio5488/files/2017/04/SP2017_assignment12_Intro-1.pdfAssignment 12: Substitution Rates and Identifying

ExecutingexternalcommandsinPython

• Usesubprocess.call toexecuteanexternalcommandfromwithinaPythonscript

• Pythonwillwaituntilthecommandcompletesbeforemovingtothenextlineofcode

• Seehttp://stackoverflow.com/questions/89228/calling-an-external-command-in-python foralternatives

8

Shell command Python

$ SNAP.pl YAL003W.fasta dnds_out

Code

Template import subprocess

subprocess.call(<list_of_arguments>)# <list_of_arguments> is a list of the words in the command

Exam

ple

Page 9: Assignment 12: Substitution Rates and Identifying Selectiongenetics.wustl.edu/bio5488/files/2017/04/SP2017_assignment12_Intro-1.pdfAssignment 12: Substitution Rates and Identifying

FileanddirectorymanipulationinPython• Useos.listdir tolistofallfilesandsubdirectoriesinagivendirectory:

Codetemplate Example

import os<list_of_files> =

os.listdir(<directory>) Code

Outpu

t

[‘YAL002W.fasta’, ‘YAL008W.fasta’]

9

• os.listdir returnsthename ofthefile/dir,notthepath• Useos.path.join toconstructthepath:

Codetemplate Example

import os<file_path> =

os.path.join(<directory>,<file>)

Code

Outpu

t

../alignments/YAL002W.fasta

Page 10: Assignment 12: Substitution Rates and Identifying Selectiongenetics.wustl.edu/bio5488/files/2017/04/SP2017_assignment12_Intro-1.pdfAssignment 12: Substitution Rates and Identifying

FileanddirectorymanipulationinPython

• Useos.path.isfile tocheckifafileexists:

Codetemplate Example

import osif os.path.isfile(<filename>):

# Do something Code

Outpu

t

The file exists!

10

• Wheretolearnmore• Python.orgdocumentation:https://docs.python.org/3.4/library/os.html

Page 11: Assignment 12: Substitution Rates and Identifying Selectiongenetics.wustl.edu/bio5488/files/2017/04/SP2017_assignment12_Intro-1.pdfAssignment 12: Substitution Rates and Identifying

Tipsforwritingrun_SNAP.py• RunningSNAP.pl on5kfilestakesalongtime

• à Beforeyoustart,makeatesterfolderw/20alignmentfiles.Runrun_SNAP.py onthisdirectorytosavetimewhenwriting/debugging.

• Theclassservermaygetveryslowifeveryonerunstheirscriptatthesametime• à Don’twaituntilThursdaynighttostart

• Someoftheinputalignmentfilesarenotformattedcorrectly. (Datacanbemessy.)• à SNAP.plmay/maynotproduceadndsfileforthesegenes. IgnoretheseinyouroutputfilesincetheyhavenodN/dSratio)

• Hints• useos.path.isfile• Checkifthefilecontainsthescer vs.sparcomparison

11

Page 12: Assignment 12: Substitution Rates and Identifying Selectiongenetics.wustl.edu/bio5488/files/2017/04/SP2017_assignment12_Intro-1.pdfAssignment 12: Substitution Rates and Identifying

AssignmentTODOs

• Writerun_SNAP.py• RunSNAP.pl oneveryalignmentfile• CreateanoutputfileofdN/dSratios• CalculatedN/dSsummarystatistics

• Writeplot_gene_length_vs_dnds.py• Createscatterplotofgenelengthvs.dN/dSratio

• Writecalc_average_go_dnds.py• CalculatetheaveragedN/dSratioforeachGOterminaGFF

• Answerfollow-upquestions12

Page 13: Assignment 12: Substitution Rates and Identifying Selectiongenetics.wustl.edu/bio5488/files/2017/04/SP2017_assignment12_Intro-1.pdfAssignment 12: Substitution Rates and Identifying

GeneralFeatureFormat(GFF)

• Seehttps://www.sanger.ac.uk/resources/software/gff/spec.html#t_2 foracompletedescription• Attributecolumncontainsadditionalinformationaboutthefeature,e.g.,GOIDs• Semicolon-separatedlistofkey-valuepairs

13

chrI SGD gene 335 649 . + . Name=YAL069W;Ontology_term=GO:0003674,GO:0005575chrI SGD gene 7235 9016 . - . Name=YAL067C;Ontology_term=GO:0005215

Examplegfffile

Page 14: Assignment 12: Substitution Rates and Identifying Selectiongenetics.wustl.edu/bio5488/files/2017/04/SP2017_assignment12_Intro-1.pdfAssignment 12: Substitution Rates and Identifying

Assignment12:requirements• TIP:startearly!• Due:21April2017,10am• Commentyourcode

• Writedocstringsforscripts(includeusage)• Writedocstringsforfunctions

• Submissiondirectoryshouldcontain• README.txt• Allscripts

• run_SNAP.py• plot_gene_length_vs_dnds.py• calc_average_go_dnds.py

• Most* outputfiles• alignments_all_dnds.txt• alignments.err• gene_length_vs_dnds.png• average_go_dnds.txt• *Youdonotneedtoturninthedndsoutputfiles

14