Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol...

49
Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    0

Transcript of Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol...

Page 1: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

Computational Gene Finding

Greg Voronin

Hui Zhao

Xueyi(Judy) Xiao

CIS786 Intro to Comp BiolInstructor: Dr. Barry Cohen

Page 2: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

The Challenge

Generate predictions of gene locations from primary genomic sequence by computational means

Two principle means:– Database searching– Statistical Methods

Presented By Greg Voronin

Page 3: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

The Biological Model

Page 4: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

The Computational Model

Representing the biology in a framework amenable to mathematical/statistical methods

Exon classification, sequence features, signal profiles– What is an exon and what properties does

the sequence of an exon hold?– How is an exon recognized and

processed?

Page 5: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

Exon Classification Scheme

Page 6: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

The Nature of The Data

What is the primary genomic sequence?

• “Nor is the available sequence a single continuous and exact sequence for each chromosome… [ the HGP ] is represented by a set of sequences that cover the genome is a statistical sense but have a very large number of gaps.”

– Many genes are as large or larger than the contigs in the HGP

– Finding genes will depend on the accuracy of the scaffold of their contigs

Page 7: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

Back to Beginning

What is a gene?– A biological model, a mathematical model and

computational representation

The programs we evaluate take these factors into account in their underlying model

Page 8: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

MZEF

Michael Zhang’s Exon Finder Utilizes quadratic discriminant analysis

(QDA) to classify sequence into gene and non-gene groups– QDA is a multivariate statistical pattern

recognition method– “Draws” a curved boundary between

groups of different classes

Page 9: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

QDA

Page 10: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

Key Elements of QDA

Entities are represented by an n-dimensional vector of feature values

Two classes of entities are categorized by their respective multinormal distribution– Each class has its own mean vector– The mean of each feature

An appropriate distance function is central to the calculation of the posterior probabillity of group membership of a given unknown entity given its specific feature vector.

Page 11: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

Mahalanobis Distance

The actual posterior probabillity function is more complex, but this is the distance component:

( x – i )T i-1 ( x – i )

Page 12: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

MZEF Specifics

MZEF uses the following features:– Exon length, exon-intron transition,

branch site score, 3’ss score, exon score, strand score, frame score, 5’ss score, intron-exon transition

9 dimensional feature vector Training sets of known exons and

“non-exons” are used to establish the class characterisitics– Supervised learning

Page 13: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

…GATC… to Gene

Cells recognize genes from DNA

sequence.

Can we??Can we??

The Hidden Markov Model Method

HMMgene Presented By Hui Zhao

Page 14: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

HMMs are Statistical Models

Definition: – Any mathematical construct that attempts to

parameterize a random process Example: A normal distribution

– Assumptions– Parameters– Estimation– Usage

HMMs are just a little more complicated…

Page 15: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

Primary HMM Assumptions Observations are ordered Random processes can be represented by a

stochastic finite state machine with emitting states– transition probabilities and emission probabilities.

Page 16: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

How do we find the model probabilities?

This is called training We start with an architecture and a set of observed

sequences The training process iteratively alters its

parameters to fit the training set The trained model will assign the training

sequences high probability – but can it generalize?

Page 17: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

HMM Usage – two major tasks

Evaluate the probability of an observed sequence given the model (Forward)

Find the most likely path through the model for a given observation sequence (Viterbi)

Page 18: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

Gene Finding: An Ideal HMM Application

Our Objective: – To find the coding and non-coding regions of

an unlabeled string of DNA nucleotides

Our Motivation:– Assist in the annotation of genomic data

produced by genome sequencing methods– Gain insight into the mechanisms involved in

transcription, splicing and other processes

Page 19: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

Why HMMs might be a good fit for Gene Finding The observations within a sequence are ordered A DNA sequence is a set of ordered observations Designing the architecture is straight forward:

Easy to measure success Training data is available from various genome

annotation projects

Page 20: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

A HMM genefinder States represent standard gene features: intergenic

region, exon, intron, perhaps more (promotor, 5’UTR, 3’UTR, Poly-A,..).

Observations are things like state-dependent base composition.

In a HMM, length of each state must be included as well.

Finally, reading frames and both strands must be

dealt with.

Page 21: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

5’

correct gene structure

extended exon

missing exon

additional exon

missing intron

extended gene model

3’

Several problems can occur

Page 22: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

HMMgene

•Predicts whole genes in any given stretch of DNA •Uses Hidden Markov Models (HMM) to maximize

probability of accurate prediction •This allows confidence levels to be determined and

"Best Prediction" as well as potential alternative splicing predictions •Outputs splice sites, start and stop codons, alternative predictions •Trained for human and C. elegans

Krogh (1997) In Proc. 5th Conf. Intel. Sys. Mol Biol. pp179-186

Page 23: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

HMMGene

Uses an extended HMM called a CHMM CHMM = HMM with classes Takes full advantage of being able to modify the

statistical algorithms Uses high-order states Trains everything at once

Page 24: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

How does HMMGene work?

1) 5th order HMM assumes: P(xi | xi-1,xi-2, xi-3, xi-4, xi-5) is different in Introns, Exons, etc..

e.g: P(G, I | A,C,G,G,T) P(G, E | A,C,G,G,T)

2) Construct the model

Page 25: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

2. How does HMMGene work?

4) Use Viterbi (n-best) to find a path through the CHMM = a labeled gene

5) Use the forward algorithm to measure P(gene | model) –using n-best.

3) In a CHMM states emit a pair

labelclass

nucl

.

E

G

I

Gge or ..

Page 26: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

A DNA sequence containing one gene. For each nucleotide its label is written below. The coding regions are labeled ‘C’, the introns ‘I’, and the intergenic regions ‘0’. HMMGene calls these class labels in a CHMM.

Page 27: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

HMMGene Does not use the standard ML method which optimizes the

probability of the observed sequence – instead it maximizes the probability of the correct prediction.

Only one conference paper describes the algorithm. There is a web site to run the algorithm, and it's performance has been compared to other algorithms.

No complete description of the algorithm is available – in the 1997 paper the author states "… the details of HMMGene will be described elsewhere (in prep)" – but unfortunately the detailed paper has not been published.

Page 28: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

HMMgene http://www.cbs.dtu.dk/services/HMMgene/)

Page 29: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

HMMgene and HMM Disadvantages

Markov Chains– States should be independent

– P(y) must be independent of P(x) -usually not true

Local maxima– Model may not converge the optimal parameter set

Over-fitting– More training is not always good-set may be too small

P(x) … P(y)

Page 30: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

Summary• HMMgene finds whole genes in anonymous DNA with correctly spliced exons.

• It can predict several whole or partial genes in one sequence.

•If some features of a sequence are known, such as hits to ESTs, proteins, or repeat elements, these regions can be locked as coding or non-coding and then the program will find the best gene structure under these constraints.

Page 31: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

GENSCAN (v1.0)

A computer program identifying complete exon & intron

structures of genes in genomic DNA.

Developed by Chris Burge (Burge 1997), in the research group of Samuel Karlin, Dept of Mathematics, Stanford Univ. 

Original server @Stanford New server @MIT (seq_len <= 500 kb);

Servers are also maintained by the Pasteur Institute, Paris and by the GENSCAN web server at DKFZ/EMBnet, Heidelberg

Implementations web server http://genes.mit.edu/GENSCAN.html email server http://genes.mit.edu/GENSCANM.html local copy downloaded under a license agreement

Presented By Xueyi (Judy) Xiao

Page 32: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

How does It Work?

Designed to predict complete gene structures Introns and exons Promoter sites Polyadenylation signals

Larger predictive scope Partial and Complete genes Multiple genes separated by intergenic DNA in a seq Consistent sets of genes on either/both DNA strands

Not use similarity-based methods

Based on a general probabilistic model of

genomic sequences composition and gene structure

Page 33: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

Model of Genomic Sequence Structure

Fig. 3, Burge and Karlin 1997

Page 34: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

Input http://genes.mit.edu/GENSCAN.html

Page 35: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

Output

Page 36: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

Graphic View

Initial Exon

Internal Exon

TerminalExon

Single-Exon gene

Optimal Exon

Suboptimal Exon

Page 37: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

Is It Good?

Accuracy:

Substantially higher accuracies when tested on standardized sets of human & vertebrate genes, with 75-80% of exons identified exactly.

Reliability:Able to indicate fairly accurately the reliability of each predicted exon.

Consistency:Consistently high levels of accuracy, for seqs of differing C+G content and for distinct groups of vertebrates.

Page 38: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

Why not Perfect? Gene Number

usually approximately correct, but may not

Organismprimarily for human/vertebrate seqs; maybe lower accuracy for non-vertebrates. ‘Glimmer’ & ‘GeneMark’ for prokaryotic or yeast seqs

Exon and Feature Type

Internal exons > Initial or Terminal exons;Exons > Polyadenylation or Promoter signals(‘NNPP’)

Biases in Test Set

The Burset/Guigó (1996) dataset: toward short genes with relatively simple exon/intron structure

The Rogic (2001) dataset: DNA seqs: GenBank r-111.0 (04/1999 <- 08/1997); source organism specified; consider genomic seqs containing exactly one gene; seqs>200kb were discarded; mRNA seqs and seqs containing pseudo genes or

alternatively spliced genes were excluded.

Page 39: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

What are They doing NOW?

The research group @MIT

is currently developing another program,

GenomeScan, which is more accurate

when a moderate or closely related

protein seq is available.

Page 40: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.
Page 41: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

TEST OF METHODS

Sample Tests reported by Literature Test on the set of 570 vertebrate gene seqs (Burset&Guigo 1996)

as a standard for comparison of gene finding methods.

Test on the set of 195 seqs of human, mouse or rat origin (named HMR195) (Rogic 2001).

Self-Test done by our group Dataset: Intron-less(Single-exon), -rich(Multi-exon), -poor(Random)

Organism: Human

Methods: all of the three

Steps

Page 42: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

Where to get the dataset for Self-Test?

http://www.ncbi.nlm.nih.gov/genome/guide/human/

Page 43: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

Accuracy Measures

Sensitivity vs. Specificity (adapted from Burset&Guigo 1996)

Sensitivity (Sn) Fraction of actual coding regions that are correctly predicted as coding

Specificity (Sp) Fraction of the prediction that is actually correct

Correlation Coefficient (CC)

Combined measure of Sensitivity & Specificity Range: -1 (always wrong) +1 (always right)

TP FP TN FN TP FN TN

Actual

Predicted

Coding / No Coding

TNFN

FPTP

Pre

dic

ted

Actual

No

Co

din

g /

Co

din

g

Page 44: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

Table: Relative Performance (adapted & added from Rogic 2001)

# of seqs - number of seqs effectively analyzed by each program; in parentheses is the number of seqs where the absence of gene was predicted;

Sn -nucleotide level sensitivity; Sp - nucleotide level specificity;

CC - correlation coefficient;

ESn - exon level sensitivity; ESp - exon level specificity

Results: Accuracy Statistics

Test By Rogic 2001 Self-Test 2002

Nucleotide accuracy

Exon accuracy

Multi-Exon Single-Exon Programs

# of seq

Sn Sp CC ESn ESp # of Seq

ESn ESp # of Seq

ESn ESp

Genscan 195(3) 0.95 0.90 0.91 0.70 0.70 5 0.57 0.63 5 0.60 0.50

HMMgene 195(5) 0.93 0.93 0.91 0.76 0.77 5 0.42 0.42 5 0.60 0.30

MZEF 119(8) 0.70 0.73 0.66 0.58 0.59 5 0.76 0.62 5 0.40 0.40

Page 45: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

Testing ‘Random’ Sequences

These gene finding programs model statistical trends and properties– Can they be fooled by ‘random’ sequences– Generate a preliminary measure of accuracy

Java program written to generate ‘random’ sequences of a,t,g,c

3 groups of sequences 5k, 10k & 30K Sent to BLAST then GeneMachine

Presented By Greg Voronin

Page 46: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

Testing Results BLAST:

bit score E-value– 5k 42 5.7– 10k 44 3.0– 30k 42 8.7

GeneMachine: 5k 10k 30K

– MZEF 1 5 14– GenScan 3 11 26– HMMgene 7 11 42

Page 47: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

•Computational Gene Finding has rapidly evolved since it started 20 years ago.

•The advent of full-length genomic sequences has provided data and increased the requirements.

•Gene annotation has direct medical implications on the design of pharmaceuticals and the understanding of the genetic component of diseases.

•Gene finding remains largely an unsolved problem.

New directionsPresented By Hui Zhao

Page 48: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

•The growing quantities of training data for the models should improve their performance.

•Algorithms that combine the inputs from several models in a weighted voting scheme should be considered to try to get the best from all of the methods.

•Many other AI approaches can be used to meet this challenge including decision trees, neural networks and rule-based systems

New directions

Page 49: Computational Gene Finding Greg Voronin Hui Zhao Xueyi(Judy) Xiao CIS786 Intro to Comp Biol Instructor: Dr. Barry Cohen.

Challenges and Discoveries Ahead

Eukaryotic gene finding continues to be an active and important area – more research is required into algorithms with greater accuracy

Expertise in computational biology is also required – which means training in both: computer science and molecular biology

More classes like this…