Anatomy of a Genome Project A.Sequencing 1. De novo vs. ‘resequencing’ 2.Sanger WGS versus...

16
Anatomy of a Genome Project ncing e novo vs. ‘resequencing’ nger WGS versus ‘next generation’ sequencing gh versus low sequence coverage bly aft assembly p closure ation ne, intron, RNA prediction novo vs. homology-based prediction sessing confidence rison mparing gene content, lineage specific gene loss, gain, emergence mparing genome structure (chromosomes, breakpoints, etc) mparing evolutionary rates of change (rates of amino-acid, nucleotide substi 1

Transcript of Anatomy of a Genome Project A.Sequencing 1. De novo vs. ‘resequencing’ 2.Sanger WGS versus...

Page 1: Anatomy of a Genome Project A.Sequencing 1. De novo vs. ‘resequencing’ 2.Sanger WGS versus ‘next generation’ sequencing 3.High versus low sequence coverage.

1

Anatomy of a Genome Project

A. Sequencing1. De novo vs. ‘resequencing’2. Sanger WGS versus ‘next generation’ sequencing3. High versus low sequence coverage

B. Assembly1. Draft assembly2. Gap closure

C. Annotation1. Gene, intron, RNA prediction2. De novo vs. homology-based prediction3. Assessing confidence

D. Comparison1. Comparing gene content, lineage specific gene loss, gain, emergence2. Comparing genome structure (chromosomes, breakpoints, etc)3. Comparing evolutionary rates of change (rates of amino-acid, nucleotide substitution)

Page 2: Anatomy of a Genome Project A.Sequencing 1. De novo vs. ‘resequencing’ 2.Sanger WGS versus ‘next generation’ sequencing 3.High versus low sequence coverage.

2

Anatomy of a Genome Project: non-Model challenges

A. Sequencing1. De novo vs. ‘resequencing’ … resequencing not possible without a close, syntenic relative2. Sanger WGS versus ‘next generation’ sequencing3. High versus low sequence coverage … need high coverage and long reads (or mate-pair

reads to assemble)

B. Assembly1. Draft assembly2. Gap closure … time consuming no matter what

C. Annotation1. Gene, intron, RNA prediction2. De novo vs. homology-based prediction3. Assessing confidenceDe novo predictions challenging if gene models are different in your species …can rely less on homology for identifications and assessing confidence

D. Comparison1. Comparing gene content, lineage specific gene loss, gain, emergence2. Comparing genome structure (chromosomes, breakpoints, etc)3. Comparing evolutionary rates of change (rates of amino-acid, nucleotide substitution)

Page 3: Anatomy of a Genome Project A.Sequencing 1. De novo vs. ‘resequencing’ 2.Sanger WGS versus ‘next generation’ sequencing 3.High versus low sequence coverage.

3

The power of comparison

For many non-model organisms, most of the predicted genes will be uncharacterized &may not have homology to known genes.

But Comparison within and between species can still reveal interesting features

1. Comparing gene content, lineage specific gene loss, gain, emergence

2. Comparing genome structure (chromosomes, breakpoints, etc)

3. Comparing evolutionary rates of change (rates of amino-acid, nucleotide substitution)

4. Comparing population data (SNPs, expression response, phenotypic variation … mapping studies)

Page 4: Anatomy of a Genome Project A.Sequencing 1. De novo vs. ‘resequencing’ 2.Sanger WGS versus ‘next generation’ sequencing 3.High versus low sequence coverage.

4

Science April 25, 2014

Tsetse fly: blood feeding insect that gives birth to live larvae & ‘lactates’

- 366 Mb genome = double the size of Drosophila melanogaster

- Identified orthologs across 5 insects … comparison of ortholog presence/absencesuggests unique evolutionary trajectories

- blood feeding evolved independently 12 times in Diptera … identified sharedproteins unique to several blood-suckers

- Some gene families have been expanded, others contracted in numbers … functionalannotations (“GO” = gene ontology predictions) suggestion selection

Page 5: Anatomy of a Genome Project A.Sequencing 1. De novo vs. ‘resequencing’ 2.Sanger WGS versus ‘next generation’ sequencing 3.High versus low sequence coverage.

5

- sequenced 4 bat genomes & compared orthologs across 22 mammals- used phylogenetic analysis and protein trees to identify cases of lineage-spec. evolution

Page 6: Anatomy of a Genome Project A.Sequencing 1. De novo vs. ‘resequencing’ 2.Sanger WGS versus ‘next generation’ sequencing 3.High versus low sequence coverage.

6

To detect convergent evolution, look for proteins with unusual sequence relationships

Found ~2,300 genes with signatures of convergent evolution.* enriched for genes linked to hearing, ear development, and … vison

Page 7: Anatomy of a Genome Project A.Sequencing 1. De novo vs. ‘resequencing’ 2.Sanger WGS versus ‘next generation’ sequencing 3.High versus low sequence coverage.

7

The power of comparison

For many non-model organisms, most of the predicted genes will be uncharacterized &may not have homology to known genes.

But Comparison within and between species can still reveal interesting features

1. Comparing gene content, lineage specific gene loss, gain, emergence

2. Comparing genome structure (chromosomes, breakpoints, etc)

3. Comparing evolutionary rates of change (rates of amino-acid, nucleotide substitution)

4. Comparing population data (SNPs, expression response, phenotypic variation … mapping studies)

Page 8: Anatomy of a Genome Project A.Sequencing 1. De novo vs. ‘resequencing’ 2.Sanger WGS versus ‘next generation’ sequencing 3.High versus low sequence coverage.

8

Page 9: Anatomy of a Genome Project A.Sequencing 1. De novo vs. ‘resequencing’ 2.Sanger WGS versus ‘next generation’ sequencing 3.High versus low sequence coverage.

9

Evolutionary Genetics Recap

Page 10: Anatomy of a Genome Project A.Sequencing 1. De novo vs. ‘resequencing’ 2.Sanger WGS versus ‘next generation’ sequencing 3.High versus low sequence coverage.

10

* Duplication facilitates change

- Duplications can be tandem, segmental, or whole genome - Most duplications lost quickly through neutral (or selective) processes- Facilitates subfunctionalization and neofunctionalization- Baker et al. 2013 paper: paralog interference could drive evolution

- Benefits of duplication operate at all levels

- Gene duplication novel functions- Gene duplication for novel regulation- Gene duplication for novel network rewiring- Regulatory element duplication for novel gene regulation- Regulatory protein duplication for novel module regulation- Regulatory system duplication for novel network rewiring

Evolutionary Genetics: Recurring Themes

Page 11: Anatomy of a Genome Project A.Sequencing 1. De novo vs. ‘resequencing’ 2.Sanger WGS versus ‘next generation’ sequencing 3.High versus low sequence coverage.

11

Evolutionary Genetics: Recurring Themes

* Biological systems are more plastic than we might think

- Much of the genome is under constraint from evolution purifying selection removes variation

- Many features of cellular systems appear to evolve, even if the cellular function or output is conserved

stabilizing selection can explain poor conservation of important features, if the cell finds a ‘quick fix’ to maintain the phenotype

Examples: pervasive evidence of positive selection in fly and rodentcoding genes … transcription factor binding-site turnover… phospho-site turnover … genetic/protein rewiring??

strongest constraints may promote whole-sale rewiring as stabilizing evolution (e.g. rewiring of ribosomal protein regulon)

De novo genes also appear to emerge frequently from the genomic ether

Page 12: Anatomy of a Genome Project A.Sequencing 1. De novo vs. ‘resequencing’ 2.Sanger WGS versus ‘next generation’ sequencing 3.High versus low sequence coverage.

12

Evolutionary Genetics: Recurring Themes

* Evolutionary pressures vary over time and space

Neutral variation can suddenly become advantageous …therefore accumulation of neutral variation can be a future conduit

Deleterious polymorphisms can be stabilized in the presence of otherpolymorphisms

splitting up alleles by recombination can unmask deleterious alleles

Page 13: Anatomy of a Genome Project A.Sequencing 1. De novo vs. ‘resequencing’ 2.Sanger WGS versus ‘next generation’ sequencing 3.High versus low sequence coverage.

13

* Use a model for null/neutral expectation for your tests

- Likelihood ratio: comparing how likely one model is versus anotherQTL analysismotif model vs background modelselection model vs neutral model etc, etc, etc

- Random sampling or simulations to assess what you expect by chance

- More complicated simulations (eg. coalescence)

This is especially true for whole-genome scans … many things look striking until you do the statistics

Evolutionary Genetics: Recurring Themes

Page 14: Anatomy of a Genome Project A.Sequencing 1. De novo vs. ‘resequencing’ 2.Sanger WGS versus ‘next generation’ sequencing 3.High versus low sequence coverage.

14

* Value of a phylogenetic perspective

- use the tree if you have one* may not be the same tree across the entire genome

- inferring the state of the common ancestor can aid in analysis

Can be very useful for inferring evolutionary trajectory,timing, order of events

Evolutionary Genetics: Recurring Themes

Page 15: Anatomy of a Genome Project A.Sequencing 1. De novo vs. ‘resequencing’ 2.Sanger WGS versus ‘next generation’ sequencing 3.High versus low sequence coverage.

15

* Control for co-variates

Example: controlling for expression levels re. rate of protein evolutionOften hard to know what to even look/control for

* Best evidence if >1 test is significant

* Know your datasetKnow how the data were collected, what types of noise are associated

e.g. genome sequences by short-read deep sequencing protein-protein interaction data

Evolutionary Genetics: Recurring Themes

Page 16: Anatomy of a Genome Project A.Sequencing 1. De novo vs. ‘resequencing’ 2.Sanger WGS versus ‘next generation’ sequencing 3.High versus low sequence coverage.

16

Evolutionary Genetics: Remaining Questions & Challenges

What is the relative contribution of adaptive vs. neutral evolution?

Epistasis & Environmental interactions- how much does epistasis contribute in nature?- challenges associated with gene-gene/gene-environment signals

Detecting signatures of selection, esp. recent/transient- human evolution- how will tests, statistics, caveats change with 10,000 genomes?

What is the relative contribution of regulatory vs. coding evolution?

What features contribute to the evolution of new forms and functions?