How to make a monkey: functional adaptation in the primate genome

Post on 06-May-2015

1.086 views 4 download

description

Presentation to the "Workshop on Parallel and Distributed Processing of Large Genome Data", 22 February 2011, DBCLS, Tokyo (http://mlab.cb.k.u-tokyo.ac.jp/en/events/lgd/). The presentation describes the methodological issues surrounding the design of a workflow for assigning orthology among primate genomes, testing them for evidence of selection and interpreting the results using the Gene Ontology.

Transcript of How to make a monkey: functional adaptation in the primate genome

How to make a monkey: functional adaptation in the

primate genomeRutger Vos

Marie Curie Research Fellow

Outline• Introduction

– The question – Primate genomes– Homology across genomes– Finding evidence for natural selection– Characterizing gene function

• Methods– Computational infrastructure– Basic workflow steps– Workflow design

• Results– Preliminary findings

• Conclusions• Acknowledgements

The question

Which gene functions were under directional selection in primate evolutionary history?

Primate genomes

Homo sapiensHuman

Pongo pygmaeusOrangutan

Tarsius syrichtaPhilippine tarsier

Pan troglodytesChimpanzee

Macaca mulattaRhesus monkey

Otolemur garnettiiGreater galago

Gorilla gorillaGorilla

Callithrix jacchusCommon marmoset

Microcebus murinusGray mouse lemur

Primate genomes

~65 MYA (K/T boundary)

Apes

Old world monkeys

New world monkeys

TarsiersLemurs

Bush babies

Homology: Orthologs and paralogs

Evidence of selection: dN/dS ratio

Evidence of selection: dN/dS ratio

• Or Ka/Ks or ω, the ratio of non-synonymous over synonymous substitutions– dN/dS > 1: positive selection– dN/dS ≈ 1: neutral evolution?– dN/dS < 1: stabilizing selection

Gene function: the Gene Ontology

• GO is a hierarchical database of terms for genes

• Terms are structured in a directed acyclic graphs

• Terms are organized in three domains: biological process, cellular component and molecular function

Gene function: the Gene Ontology

Methods: Basic workflow steps

1. Protein BLAST all vs. all2. Find Reciprocal Best protein Hit clusters3. Protein align RBH clusters4. Backtranslate protein alignments to cDNAs5. Perform dN/dS ratio tests on all branches6. Lookup GO terms for sequence GIs7. Interpret results

Methods: Basic workflow design

• Build a single BLAST database of all genomes, then,

• To parallelize the analysis:– Split the data into nine sets (for nine species)– Split each of nine genomes into files for each gene

(~20k files per species)– Process files in parallel

Methods: File processing

Homo_sapiens.sh

Pan_troglodytes.sh

…Makefile

qsub setenv

qsub setenv

mak

e -j

4 al

l

Methods: Software used

• NCBI standalone BLAST (formatdb, blastp, fastacmd)

• Muscle• GeneWise• HyPhy• BioPerl/Bio::Phylo (for parsing, logging and

wrapping, all scripts under svn)

Methods: Project organization

From: Noble, W.S., 2009. A Quick Guide to Organizing Computational Biology Projects. PLoS Comput. Biol. 5(7).

Methods: ThamesBlue hardware

• One of the 100 fastest supercomputers in the world

• IBM BladeCenter cluster • JS21 and JS20 Blade servers

with 60TB of storage connected via a Myrinet 2G network.

• SuSE Linux Enterprise Server • General Parallel File System• Batch jobs managed with

Torque.

Results

• 5952 loci with >= 2 RBHs relative to humans• 2346 loci with dN/dS deviation somewhere

(p<0.05) Homo sapiens

Pan troglodytes

Gorilla gorilla

Pongo pygmaeus

Macaca mulatta

Callithrix jacchus

Tarsius syrichta

Microcebus murinus

Otolemur garnettii

Results: some interesting terms

• Forebrain development, lifespan (and apoptosis), learning and social behavior in apes, including “deep” nodes

• Eye development in “higher” monkeys• Terms to do with pregnancy• Terms to do with male-male competition• Etc. Etc. (…lots of hard to interpret molecular

processes, of course…)

“Brain genes”

Visual system

• Primates have a highly variable visual system:– Old World monkeys: three types of cones (unique

among mammals)– New World monkeys: females trichromatic, males

dichromatic

Biological conclusions

• Very, very, very, very preliminary: highest dN/dS ratios in functions for which there are multiple “optima” among primates:– Different placentation systems– Different mating systems– Different visual systems– Different life histories and brain mass investments

Methodological conclusions

• Nine genomes is not that much. As FASTA files, it’s a 14Gb zipped archive (AA+cDNA).

• The problem was trivially parallelizable, so I didn’t use any MPI versions of softwares.

• Simple, consistent workflow and project design conventions are a lifesaver.

• Make each step small enough so you can rerun it, because you will.

Summary

• I discussed:– Primate evolution and adaptation– Ortholog-finding– Alignment (multiple proteins, cDNA to protein)– Tree-based dN/dS ratio tests– Gene Ontology term enrichment– Methodological challenges

Acknowledgements

• Funding: FP7-PEOPLE-IEF-2008/N°237046• DBCLS for their kind invitation• Mark Pagel, Andrew Meade for discussion and

help designing the workflow