Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge...

45
Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov
  • date post

    15-Jan-2016
  • Category

    Documents

  • view

    215
  • download

    2

Transcript of Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge...

Page 1: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Gene Recognition

Credits for slides:Serafim BatzoglouMarina AlexanderssonLior PachterSerge Saxonov

Page 2: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

The Central Dogma

Protein

RNA

DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

Page 3: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Gene structure

exon1 exon2 exon3intron1 intron2

transcription

translation

splicing

exon = protein-codingintron = non-coding

Codon:A triplet of nucleotides that is converted to one amino acid

Page 4: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Locating Genes

• We have a genome sequence, maybe with related genomes aligned to it…where are the genes?

• Yeast genome is about 70% protein coding• About 6000 genes

• Human genome is about 1.5% protein coding• About 22,000 genes

Page 5: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Finding Genes in Yeast

Start codonATG

5’ 3’

Stop codonTAG/TGA/TAA

Intergenic Coding Intergenic

Mean coding length about 1500bp (500 codons)

Transcript

Page 6: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Finding Genes in Yeast

• ORF Scanning Look for long open reading frames (ORFs)

ORFs start with ATG and contain no in-frame stop codons

Long ORFs unlikely to occur by chance (i.e., they are probably genes)

Page 7: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Finding Genes in Yeast

Yeast ORF distribution

Page 8: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Introns: The Bane of ORF Scanning

Start codonATG

5’ 3’

Stop codonTAG/TGA/TAA

Splice sites

Intergenic Exon Intron IntergenicExon ExonIntron

Transcript

Page 9: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Introns: The Bane of ORF Scanning

• Drosophila:

• 3.4 introns per gene on average

• mean intron length 475, mean exon length 397

• Human:

• 8.8 introns per gene on average

• mean intron length 4400, mean exon length 165

• ORF scanning is defeated

Page 10: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Where are the genes?Where are the genes?

Page 11: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.
Page 12: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Needles in a Haystack

Page 13: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Now What?

• We need to use more information to help recognize genes

Regular structure

Exon/intron lengths

Nucleotide composition

Biological signals• Start codon, stop codon, splice sites

Patterns of conservation

Page 14: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Regular Gene Structure

• Protein coding region starts with ATG, ends with TAA/TAG/TGA

• Exons alternate with introns

• Introns start with GT/GC, end with AG

• Each exon has a reading frame determined by the codon position at the end of the last exon

Page 15: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Next Exon:Frame 0

Next Exon:Frame 1

Page 16: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Exon/Intron Lengths

Page 17: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Nucleotide Composition

• Base composition in exons is characteristic due to the genetic code

Amino Acid SLC DNA CodonsIsoleucine I ATT, ATC, ATALeucine L CTT, CTC, CTA, CTG, TTA, TTGValine V GTT, GTC, GTA, GTGPhenylalanine F TTT, TTCMethionine M ATGCysteine C TGT, TGCAlanine A GCT, GCC, GCA, GCG Glycine G GGT, GGC, GGA, GGG Proline P CCT, CCC, CCA, CCGThreonine T ACT, ACC, ACA, ACGSerine S TCT, TCC, TCA, TCG, AGT, AGCTyrosine Y TAT, TACTryptophan W TGGGlutamine Q CAA, CAGAsparagine N AAT, AACHistidine H CAT, CACGlutamic acid E GAA, GAGAspartic acid D GAT, GACLysine K AAA, AAGArginine R CGT, CGC, CGA, CGG, AGA, AGG

Page 18: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Biological Signals

• How does the cell recognize start/stop codons and splice sites? In part, from characteristic base composition

• Donor site (start of intron) is recognized by a section of U1 snRNA

U1 snRNA: GUCCAUUCADonor site consensus: MAGGTRAGT

M means “A or C”, R means “A or G”

Page 19: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

atg

tga

ggtgag

ggtgag

ggtgag

caggtg

cagatg

cagttg

caggccggtgag

Page 20: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

5’ 3’Donor site

Position

-8 … -2 -1 0 1 2 … 17

A 26 … 60 9 0 0 54 … 21C 26 … 15 5 0 1 2 … 27G 25 … 12 78 100 0 41 … 27T 23 … 13 8 0 99 3 … 25

Splice Sites

Page 21: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Splice Sites

(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)

Page 22: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

• WMM: weight matrix model = PSSM (Staden 1984)• WAM: weight array model = 1st order Markov (Zhang & Marr 1993)• MDD: maximal dependence decomposition (Burge & Karlin 1997)

Decision-tree algorithm to take pairwise dependencies into account

• For each position I, calculate Si = ji2(Ci, Xj)

• Choose i* such that Si* is maximal and partition into two subsets, until

• No significant dependencies left, or

• Not enough sequences in subset

Train separate WMM models for each subset

All donor splice sites

G5

not G5

G5G-1

G5

not G-1

G5G-1

A2

G5G-1

not A2

G5G-1

A2U6

G5G-1A2

not U6

Splice Sites

Page 23: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Patterns of Conservation

• Functional sequences are much more conserved than nonfunctional sequences

• Signal sequences show compensatory mutations If one position mutates away from consensus, often a

different one will mutate to consensus

• Coding sequence shows three-periodic pattern of conservation

Page 24: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Three Periodicity

• Most amino acids can be coded for by more than one DNA triplet

• Usually, the degeneracy is in the last position

Human CCTGTT (Proline, Valine)Mouse CCAGTC (Proline, Valine)Rat CCAGTC (Proline, Valine)Dog CCGGTA (Proline, Valine)Chicken CCCGTG (Proline, Valine)

Page 25: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA

Exon Exon ExonIntronIntronIntergenic Intergenic

Hidden Markov Models for Gene Finding

Intergene State

First Exon State

IntronState

Page 26: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA

Exon Exon ExonIntronIntronIntergenic Intergenic

Hidden Markov Models for Gene Finding

Intergene State

First Exon State

IntronState

Page 27: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

GENSCAN

Page 28: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

GENSCAN

• Burge and Karlin, Stanford, 1997

• Before The Human Genome Project No alignments available Estimated human gene count was 100,000

• Explicit state duration HMM (with tricks) Intergenic and intronic regions have geometric length

distribution Exons are only possible when correct flanking

sequences are present

Page 29: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

GENSCAN

• Output probabilities for NC and CDS depend on previous 5 bases (5th-order) P(Xi | Xi-1, Xi-2, Xi-3, Xi-4, Xi-5)

• Each CDS frame has its own model

• WAM models for start/stop codons and acceptor sites

• MDD model for donor sites

• Separate parameters for regions of different GC content

Page 30: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

GENSCAN Performance

• First program to do well on realistic sequences Long, multiple genes in both orientations

• Pretty good sensitivity, poor specificity 70% exon Sn, 40% exon Sp

• Not enough exons per gene

• Was the best gene predictor for about 4 years

Page 31: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

TWINSCAN

• Korf, Flicek, Duan, Brent, Washington University in St. Louis, 2001

• Uses an informant sequence to help predict genes For human, informant is normally mouse

• Informant sequence consists of three characters Match: | Mismatch: : Unaligned: .

• Informant sequence assumed independent of target sequence

Page 32: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

The TWINSCAN Model

• Just like GENSCAN, except adds models for conservation sequence

• 5th-order models for CDS and NC, 2nd-order models for start and stop codons and splice sites One CDS model for all frames

• Many informants tried, but mouse seems to be at the “sweet spot”

Page 33: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

TWINSCAN Performance

• Slightly more sensitive than GENSCAN, much more specific Exon sensitivity/specificity about 75%

• Much better at the gene level Most genes are mostly right, about 25% exactly right

• Was the best gene predictor for about 4 years

Page 34: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

N-SCAN

• Gross and Brent, Washington University in St. Louis, 2005

• If one informant sequence is good, let’s try more!

• Also several other improvements on TWINSCAN

Page 35: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

N-SCAN Improvements

• Multiple informants

• Richer models of sequence evolution

• Frame-specific CDS conservation model

• Conserved noncoding sequence model

• 5’ UTR structure model

Page 36: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

• GENSCAN

• TWINSCAN

• N-SCAN

HMM Outputs

Target GGTGAGGTGACCAAGAACGTGTTGACAGTA

Target GGTGAGGTGACCAAGAACGTGTTGACAGTAConservation |||:||:||:|||||:||||||||......sequence

Target GGTGAGGTGACCAAGAACGTGTTGACAGTAInformant1 GGTCAGC___CCAAGAACGTGTAG......Informant2 GATCAGC___CCAAGAACGTGTAG......Informant3 GGTGAGCTGACCAAGATCGTGTTGACACAA.

..

Page 37: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Phylogenetic Bayesian Network Models

)|()|()|(

)|()|()|()(),,,,,,(

3323

21211321

ARPAMPAAP

AHPAAPACPAPAAARMCHP

Page 38: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Homology-Based Gene Prediction

• Idea: Try to predict a gene in one organism using a known orthologous gene or protein from another organism

• Genewise Protein homology

• Projector Gene structure homology

• Very accurate if (and only if??) homology is high

Page 39: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Evaluating Performance

• Three main levels of performance: gene, exon, nucleotide

• Two measures of performance: Sensitivity: what fraction of the true features did we

predict correctly? Specificity: what fraction of our predicted features

were correct?

• Testing standard is whole-genome prediction Predicting on single-gene sequences is easier and less

interesting

Page 40: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Exact Exon Accuracy

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Exon Sn Exon Sp

GENSCAN EXONIPHY SGP2 TWINSCAN 2.0 N-SCAN

Page 41: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Exact Gene Accuracy

0

0.1

0.2

0.3

0.4

0.5

Gene Sn Gene Sp

GENSCAN SGP2 TWINSCAN 2.0 N-SCAN

Page 42: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Intron Sensitivity By Length

0

0.2

0.4

0.6

0.8

1

0-10

10-2

0

20-3

0

30-4

0

40-5

0

50-6

0

60-7

0

70-8

0

80-9

0

90-1

00Length (Kb)

N-SCAN

SGP2

GENSCAN

TWINSCAN

Page 43: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Human Informant Effectiveness

00.10.20.30.40.50.60.70.80.9

Gene Sn Gene Sp Exon Sn Exon Sp

Chicken Rat Mouse All

Page 44: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Drosophila Informant Effectiveness

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Gene Sn Gene Sp Exon Sn Exon Sp

A. gambiae D. yakuba D. pseudoobscura All

Page 45: Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

The Future

• Many new genomes being sequenced—they will need annotations! Current experimental “shotgun” methods not enough However, cheap targeted experiments are available to

verify predicted genes

• Promising directions in gene prediction: Conditional random fields Multiple informants—can we actually get them to

work???