how to make a monkey: functional adaptation in the primate genome
DESCRIPTION
Presentation to the "Workshop on Parallel and Distributed Processing of Large Genome Data", 22 February 2011, DBCLS, Tokyo (http://mlab.cb.k.u-tokyo.ac.jp/en/events/lgd/). The presentation describes the methodological issues surrounding the design of a workflow for assigning orthology among primate genomes, testing them for evidence of selection and interpreting the results using the Gene Ontology.TRANSCRIPT
How to make a monkey: functional adaptation in the
primate genomeRutger Vos
Marie Curie Research Fellow
Outline• Introduction
– The question – Primate genomes– Homology across genomes– Finding evidence for natural selection– Characterizing gene function
• Methods– Computational infrastructure– Basic workflow steps– Workflow design
• Results– Preliminary findings
• Conclusions• Acknowledgements
The question
Which gene functions were under directional selection in primate evolutionary history?
Primate genomes
Homo sapiensHuman
Pongo pygmaeusOrangutan
Tarsius syrichtaPhilippine tarsier
Pan troglodytesChimpanzee
Macaca mulattaRhesus monkey
Otolemur garnettiiGreater galago
Gorilla gorillaGorilla
Callithrix jacchusCommon marmoset
Microcebus murinusGray mouse lemur
Primate genomes
~65 MYA (K/T boundary)
Apes
Old world monkeys
New world monkeys
TarsiersLemurs
Bush babies
Homology: Orthologs and paralogs
Evidence of selection: dN/dS ratio
Evidence of selection: dN/dS ratio
• Or Ka/Ks or ω, the ratio of non-synonymous over synonymous substitutions– dN/dS > 1: positive selection– dN/dS ≈ 1: neutral evolution?– dN/dS < 1: stabilizing selection
Gene function: the Gene Ontology
• GO is a hierarchical database of terms for genes
• Terms are structured in a directed acyclic graphs
• Terms are organized in three domains: biological process, cellular component and molecular function
Gene function: the Gene Ontology
Methods: Basic workflow steps
1. Protein BLAST all vs. all2. Find Reciprocal Best protein Hit clusters3. Protein align RBH clusters4. Backtranslate protein alignments to cDNAs5. Perform dN/dS ratio tests on all branches6. Lookup GO terms for sequence GIs7. Interpret results
Methods: Basic workflow design
• Build a single BLAST database of all genomes, then,
• To parallelize the analysis:– Split the data into nine sets (for nine species)– Split each of nine genomes into files for each gene
(~20k files per species)– Process files in parallel
Methods: File processing
…
Homo_sapiens.sh
Pan_troglodytes.sh
…Makefile
qsub setenv
qsub setenv
mak
e -j
4 al
l
Methods: Software used
• NCBI standalone BLAST (formatdb, blastp, fastacmd)
• Muscle• GeneWise• HyPhy• BioPerl/Bio::Phylo (for parsing, logging and
wrapping, all scripts under svn)
Methods: Project organization
From: Noble, W.S., 2009. A Quick Guide to Organizing Computational Biology Projects. PLoS Comput. Biol. 5(7).
Methods: ThamesBlue hardware
• One of the 100 fastest supercomputers in the world
• IBM BladeCenter cluster • JS21 and JS20 Blade servers
with 60TB of storage connected via a Myrinet 2G network.
• SuSE Linux Enterprise Server • General Parallel File System• Batch jobs managed with
Torque.
Results
• 5952 loci with >= 2 RBHs relative to humans• 2346 loci with dN/dS deviation somewhere
(p<0.05) Homo sapiens
Pan troglodytes
Gorilla gorilla
Pongo pygmaeus
Macaca mulatta
Callithrix jacchus
Tarsius syrichta
Microcebus murinus
Otolemur garnettii
Results: some interesting terms
• Forebrain development, lifespan (and apoptosis), learning and social behavior in apes, including “deep” nodes
• Eye development in “higher” monkeys• Terms to do with pregnancy• Terms to do with male-male competition• Etc. Etc. (…lots of hard to interpret molecular
processes, of course…)
“Brain genes”
Visual system
• Primates have a highly variable visual system:– Old World monkeys: three types of cones (unique
among mammals)– New World monkeys: females trichromatic, males
dichromatic
Biological conclusions
• Very, very, very, very preliminary: highest dN/dS ratios in functions for which there are multiple “optima” among primates:– Different placentation systems– Different mating systems– Different visual systems– Different life histories and brain mass investments
Methodological conclusions
• Nine genomes is not that much. As FASTA files, it’s a 14Gb zipped archive (AA+cDNA).
• The problem was trivially parallelizable, so I didn’t use any MPI versions of softwares.
• Simple, consistent workflow and project design conventions are a lifesaver.
• Make each step small enough so you can rerun it, because you will.
Summary
• I discussed:– Primate evolution and adaptation– Ortholog-finding– Alignment (multiple proteins, cDNA to protein)– Tree-based dN/dS ratio tests– Gene Ontology term enrichment– Methodological challenges
Acknowledgements
• Funding: FP7-PEOPLE-IEF-2008/N°237046• DBCLS for their kind invitation• Mark Pagel, Andrew Meade for discussion and
help designing the workflow