Computational Genomics

149
Computational Genomics Izabela Makalowska July 15, 2006

description

Computational Genomics. Izabela Makalowska July 15, 2006. The main task in modern biology is to find out how this…. TGCATCGATCGTAGCTAGCTAGCGCATGCTAGCTAGCTAGCTAGCTACGATGCATCG TGCATCGATCGATGCATGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTAGCTATTGG CGCTAGCTAGCATGCATGCATGCATCGATGCATCGATTATAAGCGCGATGACGTCAG - PowerPoint PPT Presentation

Transcript of Computational Genomics

Page 1: Computational Genomics

Computational Genomics

Izabela MakalowskaJuly 15, 2006

Page 2: Computational Genomics

The main task in modern biology is to find out how this…

TGCATCGATCGTAGCTAGCTAGCGCATGCTAGCTAGCTAGCTAGCTACGATGCATCGTGCATCGATCGATGCATGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTAGCTATTGGCGCTAGCTAGCATGCATGCATGCATCGATGCATCGATTATAAGCGCGATGACGTCAGCGCGCGCATTATGCCGCGGCATGCTGCGCACACACAGTACTATAGCATTAGTAAAAAGGCCGCGTATATTTTACACGATAGTGCGGCGCGGCGCGTAGCTAGTGCTAGCTAGTCTCCGGTTACACAGGTAGCTAGCTAGCTGCTAGCTAGCTGCTGCATGCATGCATTAGTAGCTAGTGTAGCTAGCTAGCATGCTGCTAGCATGCAGCATGCATCGGGCGCGATGCTGCTAGCGCTGCTAGCTAGCTAGCTAGCTAGGCGCTAATTATTTATTTTGGGGGGTTAAAAAAAAAAATTTCGCTGCTTATACCCCCCCCCACATGATGATCGTTAGTAGCTACTAGCTCTCATCGCGCGGGGGGATGCTTAGCGTGGTGTGTGTGTGTGGTGTGTGTGGTCCTATAATTAGTGCATCGGCGCATCGATGGCTAGTCGATCGATCGATTTTATATATCTAAAGACCCCATCTCTCTCTCTTTTCCCTTCTCTCGCTAGCGGGCGGTACGATTTACC

Page 3: Computational Genomics

…becomes this

Page 4: Computational Genomics

DNA sequence contains all information but we need to decipher it.

TGCATCGATCGTAGCTAGCTAGCGCATGCTAGCTAGCTAGCTAGCTACGATGCATCGTGCATCGATCGATGCATGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTAGCTATTGGCGCTAGCTAGCATGCATGCATGCATCGATGCATCGATTATAAGCGCGATGACGTCAGCGCGCGCATTATGCCGCGGCATGCTGCGCACACACAGTACTATAGCATTAGTAAAAAGGCCGCGTATATTTTACACGATAGTGCGGCGCGGCGCGTAGCTAGTGCTAGCTAGTCTCCGGTTACACAGGTAGCTAGCTAGCTGCTAGCTAGCTGCTGCATGCATGCATTAGTAGCTAGTGTAGCTAGCTAGCATGCTGCTAGCATGCAGCATGCATCGGGCGCGATGCTGCTAGCGCTGCTAGCTAGCTAGCTAGCTAGGCGCTAATTATTTATTTTGGGGGGTTAAAAAAAAAAATTTCGCTGCTTATACCCCCCCCCACATGATGATCGTTAGTAGCTACTAGCTCTCATCGCGCGGGGGGATGCTTAGCGTGGTGTGTGTGTGTGGTGTGTGTGGTCCTATAATTAGTGCATCGGCGCATCGATGGCTAGTCGATCGATCGATTTTATATATCTAAAGACCCCATCTCTCTCTCTTTTCCCTTCTCTCGCTAGCGGGCGGTACGATTTACC

Page 5: Computational Genomics

Program for life

DNA in our cells store information in a way that is very similar to the way computers do.

Instead of being a binary memory, where everything is either 0 or 1, DNA is a 4 letter alphabet: A, C, G, T

Using computer metaphor we can say that:

Plant cell do not look like a mouse cell because their “programs” are different Liver cells work differently than lung cells because of different input to the

program Children look like parents because their program is a “revision” of parents

program Many diseases are caused by “bugs” in program:

Tay-Sach’s disease: A simple mistake in one line of codeHuntington’s disease: A “line” of code gets repeated a bunch of times

by accident Different ways to solve the same problem:

Plants: photosynthesis = turn light into sugar Animals: eat plant or other animals

Page 6: Computational Genomics

What exactly are we looking for in the DNA sequence? Genes

Protein coding RNA genes Retrogenes

Regulatory elements Promotors Enhancers siRNA

Repetitive elements LINES SINES Simple repeats

Page 7: Computational Genomics

Are genes just protein or RNA coding elements?

Makorin1-p1 is a non-coding pseudogene of Makorin1. Makorin1-p1 regulates the expression of its related coding gene. It acts by stabilizing the Makorin1 gene by blocking of a cis-acting RNA decay element within the 5’ region of Makorin1.

Page 8: Computational Genomics

Are repeats just a junk DNA?

Translation of mRNA containing Alu-cassette results in soluble form of the protein

Caras, I.W., Davitz, M.A., Rhee, L., Weddell, G., Martin Jr., D.W., Namba, T., Sugimoto, Y., Negishi, M., Irie, A., Ushikubi, F., Kaki-Nussenzweig,V., Cloning of decay-accelerating factor suggests novel use of splicing to generate two proteins. Nature. 1987 Feb 5-11;325(6104):545-9.

Page 9: Computational Genomics

Getting all genes The most direct way to identify a gene is to document the

transcription of a fragment of the genome - EST sequencing Requires less sequencing since it is focused on coding

sequence only Small rate of false positives, although even 10% of EST

sequences could be artifacts Genes with very restricted expression may newer be

discovered In most cases gives only partial sequences

Genome sequencing Access to entire genome, allows to learn more about

genome organization Regulatory elements Only small percentage of the genome codes for genes Hard to identify less typical genes High rate of false positives

Page 10: Computational Genomics

Cell or tissue

Isolate mRNA andreverse transcribe into cDNA

Clone cDNA into a vectorto make a cDNA library

cDNA

vectorPick individual clones

Sequence the5' and 3' endsof cDNA inserts

5' EST

3' EST

Analyze

Constructing EST

Page 11: Computational Genomics

Problems with EST data

Contamination Low quality – the error rates are high in individual

ESTs Highly redundant, for highly expressed genes we

can have hundreds of ESTs representing a single gene

The databases are skewed for sequences near 3’ end of mRNA

For most ESTs there is no indication as to the gene from which it was derived

Overlapping genes Splice variants

Page 12: Computational Genomics

An example of a good chromatogram showing well-resolved peaks and no ambiguities

This is a region of a chromatogram fairly far along the sequence where some bases in runs of 2 or more are no longer visible as single peaks. Many peaks are beginning to broaden and smear into one another, interpretation of the peaks has become more difficult, and the basecalling software has begun to use 'N's. This is a region of a chromatogram where the traces have become too ambiguous for accurate basecalling. While some parts of this region of the chromatogram can be useful for linking to existing sequences following manual editing, it should not be considered accurate.

Chromatogram

Page 13: Computational Genomics

Sequence quality screening

ABI sequencing software contains a program for quality screening

PHRED - reads DNA sequence data, calls bases, and writes the base calls and quality values to output files

Page 14: Computational Genomics

Quality file

Page 15: Computational Genomics

Phred quality scores

Phred quality score Probability that the base is called wrong

Accuracy of the base call

10 1 in 10 90%

20 1 in 100 99%

30 1 in 1,000 99.9%

40 1 in 10,000 99.99%

50 1 in 100,000 99.999%

Page 16: Computational Genomics

Contamination

Vectors - DNA/cDNAs from the biological source organism/organelle are usually inserted into a cloning vector so that they can be cloned, propagated, and manipulated. Sequencing of such constructs frequently produces raw sequences that include segments derived from vector.

Adapters, linkers, and PCR primers - Various oligonucleotides can be attached to the DNA/RNA under investigation as part of the cloning or amplification process.

Impurities in the DNA/RNA - Nucleic acid preparations may contain DNA/RNA from sources other than the intended one.

nucleic acids from an organelle mRNA/DNA present in a reagent used in the isolation, purification, or

cloning procedures nucleic acids from other organisms present in the material from which the

DNA/RNA was isolated other DNAs/RNAs used in the laboratory (e.g., from accidental mixing of

samples or cross contamination from dirty pipettes, tips, tubes, or equipment)

Page 17: Computational Genomics

Consequences of contamination

Time and effort wasted on meaningless analyses

Erroneous conclusions drawn about the biological significance of the sequence

Misassembly of sequence contigs and false clustering of Expressed Sequence Tags (ESTs)

Page 18: Computational Genomics

Why contamination is causing problems

Gene A Gene B

Assembly

Contigs

Page 19: Computational Genomics

VecScreeVecScreenn

http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen_docs.html

Page 20: Computational Genomics

Cross_match

Cross_match is a general purpose application for comparing any two DNA sequence sets. For example, it can be used to compare a set of reads to a set of vector sequences and produce vector-masked versions of the reads. It is slower but more sensitive than BLAST.

GCACGCACAACCAGACCATGCTCGGACGACCCGCTGTACATCGGCCTGCGGCAGAGGCGCGTGCGCGGCGCCGCGTACGACGAGTTCGTCGACGAGTTCATGCAGGCGGTCGTCAAGCGCTTCGGGCAGAACTGCCTCATACAGTTCGAGGACTTCGCCAACGCGAACGCGTTCCGCCTGCTCGAGAAGTACCGCGGCAGGTACTGCACGTTCAACGACGACCTCCAGGGCACGGCGGCGGTGGCGGTGGCCGGGCTGCTCGCGTCGCTGCGCATCACCGGCAAGCGGCTCTCCGACAACGTGTTCGTGTTCCAGGGAGCCGGCGAGGCATCTCTGGGTATCGCCGAGCTGTGCGTGATGGCGATGAAGAACGAGGGTACATCGGACGCCGATGCCCGCTGCAAGATTTGGATGGTGGACTCCAAGGGTCTCATCGTGAAGAACCGTCCTGAAGGTGGACTGAACGAACACAAGGAGAAGTTTGCCCAGAACTGCTCCCCCATTCGGACACTTGCCGAAGTTATAAATGTTGCTAAGCCTTCTGTACTGATTGGCXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXGCTCGACGGCGGAAGGAAAA

Page 21: Computational Genomics

Qscreen

Sequencing data management and quality checking system Quality screening Relational database for sequence data

management and archive Easy data access via web interface Easy data sharing Sequence and trace view Project statistics Password protected

Page 22: Computational Genomics
Page 23: Computational Genomics
Page 24: Computational Genomics
Page 25: Computational Genomics
Page 26: Computational Genomics

Qscreen: project and user management

Page 27: Computational Genomics

Gene discovery strategy

Cluster and assemble EST sequences to lower redundancy and to increase the length of transcripts

Find coding regions and reading frame Use deducted protein to search

databases and assign function to the gene

Page 28: Computational Genomics

Cluster and assemble EST - resources Assembly tools:

TGICL http://www.tigr.org/tdb/tgi/software/ Cap 3 http://genome.cs.mtu.edu/sas.html Phrap http://www.phrap.org/

EST clusters databases UniGene http://www.ncbi.nlm.nih.gov/ TGI http://www.tigr.org/

EST analysis pipeline SMAPi http://smapi.cbio.psu.edu

Page 29: Computational Genomics

Two stage process: Clustering ESTs based on the similarity and clone ID

Assembling ESTs

Assembling inside each cluster

Page 30: Computational Genomics

Important parameters

Criteria too stringent = many ESTs will not be assembled and genes will stay fragmented

Criteria too loose = ESTs from genes from the same family will be assembled into one gene

Length and similarity level of overlapping fragments

Length of overhanging fragments

Page 31: Computational Genomics

Contig quality

ATGTCTCTNTCACTGA TCTGTCCC-CAGTCACGATCGANATGTCTCTGTCNCTNAGTCACGATCGAN

ATGTCTCTNTCACTGA TCTGTCCC-CAGTCACGATCGANATGTCTCGGTCAC-CAGTCACGATCGATATGTCTCGGTCAC-CAGTCACGATCGATTTGTCTGGGTCAC-CTCC GGTGGC-CAGTCACGATNGANATGTCTCGGTCAC-CAGTCACGATCGAT

ATGTCTCTGTCNCTNAGTCACGATCGANATGTCTCGGTCAC-CAGTCACGATCGAT

Page 32: Computational Genomics

One gene one cluster?

5’ESTs 3’ESTs

Joined based on similarity Joined based on similarityJoined based on clone ID

Page 33: Computational Genomics

One cluster one gene?

Page 34: Computational Genomics

Cap3

Use of forward-reverse constraints to correct assembly errors and link contigs.

Use of base quality values in alignment of sequence reads.

Automatic clipping of 5' and 3' poor regions of reads.

Generation of assembly results in ‘ace’ file format for Consed.

Page 35: Computational Genomics

Input files

CAP3 takes as input a file of sequence reads in FASTA format. CAP3 takes two optional files: a file of quality values in FASTA format and a file of forward-reverse constraints.

The file of quality values must be named "xyz.qual", and the file of forward-reverse constraints must be named "xyz.con", where "xyz" is the name of the sequence file. CAP3 uses the same format of a quality file as Phrap.

Page 36: Computational Genomics

Web interfacehttp://bio.ifom-firc.it/ASSEMBLY/assemble.html

Page 37: Computational Genomics

Output files

Page 38: Computational Genomics

Clustering – uses modified megablast program to cluster sequences together

Assembly – CAP3 is used to assemble sequences inside each cluster

TGICL – TIGR Gene Indices clustering tool

Page 39: Computational Genomics

TGICL – TIGR Gene Indices clustering tool Sequences need to be cleaned before using

TGICL (Lucy, UniVEc, SeqClean) mRNA sequences may be used as ‘seeds’ for

clustering. Caution: partial mRNAs mislabeled as complete can prevent cluster extension beyond the seed.

Difficulty with highly expressed genes that have several thousand ESTs in a single cluster (assembly program may run out of memory)

Page 40: Computational Genomics

Phrap part of the Phred/Phrap/Consed program for assembling shotgun DNA sequence data. allows use of the entire read and not just the trimmed high

quality part uses a combination of user-supplied and internally computed

data quality information to improve assembly accuracy constructs the contig sequence as a mosaic of the highest

quality read segments rather than a consensus provides extensive assembly information to assist in trouble-

shooting assembly problems handles large datasets It is strongly recommended that phrap be used in conjunction

with the base calls and base quality values produced by the basecaller, phred; and with the sequence editor/assembly viewer, consed.

Page 41: Computational Genomics

Phrap output in Consed

Page 42: Computational Genomics

Apple EST assembly

183, 732 ESTsPhrap:

24,199 contigs; 9,765 singletonsCAP3

19,791 contigs; 18,927 singletonsTGICL

22,481 contigs; 28,279 singletons

Page 43: Computational Genomics

NCBI UniGene

Page 44: Computational Genomics
Page 45: Computational Genomics

Clustering at NCBI

EST must have at least 100bp after removing contaminants The overlap between similar ESTs must be at least 70 bp Similarity between overlapping area must be at least 96%

over the 70% of overlapping region

>70% of overlap with at least 96% of similarity

>70bp

>100bp

>100bp

Page 46: Computational Genomics

www.tigr.org

TIGR Gene Indices

Page 47: Computational Genomics

What are TIGR gene indices?

Clustered and assembled ESTs and mRNA sequences

Each gene and each splice variant is represented by a single consensus sequence

Provides ORF annotation, genome mapping, expression profiles, domain annotation, unique oligomers

>40bp of overlap with at least 95% of similarity

<30bp <30bp

Page 48: Computational Genomics
Page 49: Computational Genomics

TGI annotations

Page 50: Computational Genomics

TGI sequence annotations

Page 51: Computational Genomics

BLAST search

Page 52: Computational Genomics

SMAPi – Sequence Management and Analysis Pipeline

Online access to data Easy sequence management Custom as well as automated

annotation Sharing among researchers Integrated usage of various

analysis tools (BLAST, Phred/Phrap.)

Supported by grant from Pennsylvania Greenhouse for Life Sciences

The Parasitic Plant Genome Project Comparative

genomics of related parasitic genera

Host-parasite lateral transfer

Page 53: Computational Genomics
Page 54: Computational Genomics
Page 55: Computational Genomics

NCBI ORF finder

Page 56: Computational Genomics
Page 57: Computational Genomics

Mapping contigs to the genome Splign http://www.ncbi.nlm.nih.gov/sutils/splign/

Page 58: Computational Genomics
Page 59: Computational Genomics
Page 60: Computational Genomics
Page 61: Computational Genomics

Genome sequencing

Access to entire genome Regulatory elements Only small percentage of the genome codes for

genes Hard to identify less typical genes High rate of false positives

TGCATCGATCGTAGCTAGCTAGCGCATGCTAGCTAGCTAGCTAGCTACGATGCATCGTGCATCGATCGATGCATGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTAGCTATTGGCGCTAGCTAGCATGCATGCATGCATCGATGCATCGATTATAAGCGCGATGACGTCAGCGCGCGCATTATGCCGCGGCATGCTGCGCACACACAGTACTATAGCATTAGTAAAAAGGCCGCGTATATTTTACACGATAGTGCGGCGCGGCGCGTAGCTAGTGCTAGCTAGTCTCCGGTTACACAGGTAGCTAGCTAGCTGCTAGCTAGCTGCTGCATGCATGCATTAGTAGCTAGTGTAGCTAGCTAGCATGCTGCTAGCATGCAGCATGCATCGGGCGCGATGCTGCTAGCGCTGCTAGCTAGCTAGCTAGCTAGGCGCTAATTATTTATTTTGGGGGGTTAAAAAAAAAAATTTCGCTGCTTATACCCCCCCCCACATGATGATCGTTAGTAGCTACTAGCTCTCATCGCGCGGGGGGATGCTTAGCGTGGTGTGTGTGTGTGGTGTGTGTGGTCCTATAATTAGTGCATCGGCGCATCGATGGCTAGTCGATCGATCGATTTTATATATCTAAAGACCCCATCTCTCTCTCTTTTCCCTTCTCTCGCTAGCGGGCGGTACGATTTACC

Page 62: Computational Genomics

Genome sequencing strategies

Page-by-page sequencing strategy

Sequence = determining the letters of each word on each piece of paper

Assembly = fitting the words back together in the correct order

All-at-once sequencing strategy

Find small pieces of paper

Decipher the words on each fragment

Look for overlaps to assemble

Page 63: Computational Genomics

Gene identification methods

Molecular techniques Very laborious Time consuming Expensive Low rate of false positives

Computational methods Fast Relatively low cost High rate of false positives Poor performance on less typical genes

Page 64: Computational Genomics

Before we start analysis…

We have to: Check sequences quality Remove contamination Assembly sequence reads into longer

contigs Close gaps (in perfect situation)

Page 65: Computational Genomics

ComparativeSite-BasedContent-Based

Genomic Sequence

Bulk properties ofsequence:

Open reading frameCodon usageRepeat periodicityCompositional complexity

Absolute properties ofsequence:

Consensus sequencesDonor and acceptor splice sites

Transcription factor binding sites

Polyadenylation signalsStart codonStop codon

Inferences based on sequence homology:

Protein sequence with similarity to translated product of query

Similarity to EST sequences

Conserved regions between species (comparative genomics)

Model based

Gene finding strategies

Page 66: Computational Genomics

Gene finding methods classificationSimilarity based predictors: make use of similarity to already known genes and proteins coded by these genes as well as expression data including sequences from cDNAs and data from hybridization experiments (tiling arrays for example)

Dual- and multi-genome predictors: rely on the fact that functional regions of a genome sequence are more conserved during evolution

Model based predictors: use a single genome sequence and exon/intron structure is predicted based on absolute and bulk properties of the sequence

Page 67: Computational Genomics

http://pipmaker.bx.psu.edu/pipmaker/

Comparative genomics - MultiPipmaker

Page 68: Computational Genomics

Model based methods

We take advantage of what we already learned about gene structures and features of coding sequences. Based on this knowledge we can build theoretical model, develop an algorithm to search for important features, train it on known data and use to search for coding sequences in anonymous genomic fragments

Page 69: Computational Genomics

Pattern recognition and matching

The ability of a program to compare novel and known patterns and determine the degree of similarity forms the basis of sequence analysis including gene identification. In similarity based methods we search the genome directly for nucleotide or amino acid pattern observed in one or more already known genes; in comparative genomics we look for similar sequence pattern in two or more genomes, and in method based prediction we look for patterns in sequence composition and signals.

One of the major challenges associated with using pattern matching is in that, in most cases, we need to identify patterns that are ‘similar’ to a target pattern, but the concept of similarity isn’t well defined from programmatic and biological sense. Also, only already known pattern may be used for searches, therefore genes with unusual patterns may not be discovered using these methods.

Page 70: Computational Genomics

Coding regions in Prokaryotes

INTERGENIC REGION START CODING

SEQUENCE STOP

Page 71: Computational Genomics

Genetic code

Page 72: Computational Genomics

Gene Model in Prokaryotes

A 1GT 32

START

CODON

AGT

AAT

GAT

STOP

Page 73: Computational Genomics

Open Reading Frames

+3

+2

+1

-3

-2

-1

Page 74: Computational Genomics

Properties of the sequence

We can check if sequence in particular ORF has some other features which could tell us if this is a putative coding sequence or the ORF is false positive. We can look at the sequence content and compare it with known coding sequence and non-coding sequence and check to which of these two the ORF sequence is more similar to.

Page 75: Computational Genomics

Markov chain

A Markov chain describes at successive times the states of a system. At these times the system may have changed from the state it was in the moment before to another or stayed in the same state. The changes of state are called transitions. The Markov property means the system is memoryless, i.e. it does not "remember" the states it was in before, just "knows" its present state, and hence bases its "decision" to which future state it will transit purely on the present, not considering the past.

Page 76: Computational Genomics

Examples of Markov chain

A game of Monopoly, snakes and ladders or any other game whose moves are determined entirely by dice is a Markov chain. This is in contrast to card games such as poker or blackjack, where the cards represent a 'memory' of the past moves. To see the difference, consider the probability for a certain event in the game. In the above mentioned dice games, the only thing that matters is the current state of the board. The next state of the board depends on the current state, and the next roll of the dice. It doesn't depend on how things got to their current state. In a game such as poker or blackjack, a player can gain an advantage by remembering which hands have already been played (and hence which cards are no longer in the deck), so the next state (or hand) of the game is not independent of the past states.

Page 77: Computational Genomics

Markov chain – weather model

The matrix P represents the weather model in which a sunny day is 90% likely to be followed by another sunny day, and a rainy day is 50% likely to be followed by another rainy day.

0.9 0.1

0.5 0.5

P =

0.1

0.5

0.9 0.5

Page 78: Computational Genomics

Markov chain – sequence properties

A T

GC

Page 79: Computational Genomics

How can we do this?CGATCGATCGATCGGCCGATGCTGATGAGTCTCCTACGTCTCGTAGCTCGCGTCGT

ATATCGCGATCGCATCGCGTACGCGATCGCTATCTCGCATCGCGTACTGCAT

CGATCGATCGATCGGCCGATGCTGATGAGTCTCCTACGTCTCGTAGCTCGCGTCGTATATCGCGATCGCATCGCGTACGCGATCGCATCTCGCATCGCGTACTGCAT

CGATCGATCGATCGGCCGATGCTGATGAGTCTCCTACGTCTCGTAGCTCGCGTCGTATATCGCGATCGCATCGCGTACGCGATCGCATCTCGCATCGCGTACTGCAT

CGATCGATCGATCGGCCGATGCTGATGAGTCTCCTACGTCTCGTAGCTCGCGTCGTATATCGCGATCGCATCGCGTACGCGATCGCATCTCGCATCGCGTACTGCAT

20 G7 GA1 GG5 GT7 GC

7/201/205/207/20

For non-coding sequence we assume that probability of each is equal. The more ‘popular’ in coding sequence transition, the higher probability the sequence is coding

Page 80: Computational Genomics

Markov Chains

GCGCTAGCGCCGATCATCTACTCG

GCGCTAGCGCCGATCATCTACTCG

GCGCTAGCGCCGATCATCTACTCG }GCGCTAGCGCCGATCATCTACTCG

GCGCTAGCGCCGATCATCTACTCG }GCGCTAGCGCCGATCATCTACTCG

GCGCTAGCGCCGATCATCTACTCG }

first order

second order

fifth order

Page 81: Computational Genomics

How far can we go?

Order of our model will have influence on specificity and sensitivity of our program. Too short sequences may not be specific enough and program may return a lot of false positives. Long chains may be to specific and our program will not be sensitive enough and will return a lot of false negatives.

Page 82: Computational Genomics

Probability matrix

first order Markov Model - matrix of 16 probabilities

p(A/A),p(A/T),p(A/C),p(A/G)p(T/A),p(T/T),p(T/C),p(T/G)p(C/A),p(C/T),p(C/C),p(C/G)p(G/A),p(G/T),p(G/C),p(G/G)

41+1= 4 =162

42+1= 4 =643

43+1 4= 4 =256

4K+1

Page 83: Computational Genomics

Probabilities in different reading frames

GCGCTAGCGCCGATCATCTACTCG

GCG CTA GCG CCG ATC ATC TAC TCG

G CGC TAG CGC CGA TCA TCT ACT CG

Does G more often follow GC when it is in third

position in codon or if it is in second or firstposition?

GC GCT AGC GCC GAT CAT CTA CTC G

Page 84: Computational Genomics

Number of probabilities

41+1= 4 = 162

41+1= 4 = 3x16 = 482

Page 85: Computational Genomics

Estimating coding potential

LP (S) = log P (S)P (S)

i

0

To estimate if the sequence is coding we have to calculate the probability that the sequence is coding and the probability the sequence is non-coding. Next we calculate the logarithm of the ratio of these two probability values

If the calculated value is > 0 the likelihood that the sequence is coding is higher than the sequence is not coding. If the value is < 0 there is higher likelihood that sequence is not coding.

Page 86: Computational Genomics

Markov Models - probabilities

S=AGGACG

P(S) =f(A,1)F(G,A)F(G,G)F(A,G)F(C,A)F(G,C)1 11 22 3

P(S) = 0.27 x 0.19 x 0.27 x 0.24 x 0.21 x 0.12 = 0.00008377

P (S)P (S)

i

0

LP(S) = log

P(S) = 0.25 x 0.25 x 0.25 x 0.25 x 0.25 x 0.25 =0.0002441

LP(S) = log(0.00008377/0.0002441) = -0.4644

Page 87: Computational Genomics

Calculating LP

P (S)P (S)

i

0

LP(S) = log

LP(S) = log + log + log + log + log + log 0.270.25

0.190.25

0.270.25

0.240.25

0.210.25

0.120.25

LP(S) = log 1.08 + log 0.76 + log 1.08 + log 0.96 + log 0.84 + log 0.48

LP(S) = 0.0334 + (-0.1191) +0.0334 + (-0.0177) + (-0.0757) + (-0.3187)

LP(S) = -0.4644

Page 88: Computational Genomics

GLIMMER

Gene finding program for ProkaryotesSaltzberg et. al, 1998For prediction uses:

StartStopSequence compositionInterpolated Markov Models

Page 89: Computational Genomics

The GLIMMER system

Part 1 – Program is trained for given data set (species) Part 2 – Program identifies putative genes in the

genomic sequence Identify all ORFs longer than a thresholdScore each ORF in each reading frame and select

these which gets highest scores in correct reading frame

Score overlapping genes in each frame separately to see which frame score the highest

Page 90: Computational Genomics

Running the program

First run build-imm on a set of sequences to make the Markov models (long ORFs from the same or closely related species)build-imm train.seqbuild-imm train.seq

Then run GLIMMER to find genes in your sequenceglimmer your.seq train.seq <options>glimmer your.seq train.seq <options>

Page 91: Computational Genomics

GLIMMER options

-g set minimum gene length -o set minimum overlap -p set minimum overlap percentage+r/-r independent probability score

ON/OFF -t set threshold score for calling as

gene

Page 92: Computational Genomics

GLIMMER outputMinimum gene length = 180Minimum overlap length = 30Minimum overlap percent = 10.0%Threshold score = 90Use independent scores = TrueUse first start codon = True

Orf Gene Lengths Gene -- Frame Scores - Indep ID# Fr Start Start End Orf Gene Score F1 F2 F3 R1 R2 R3 Score F2 302 305 616 315 312 0 _ 0 _ 99 _ _ 0 1 R1 660 633 220 441 414 99 _ _ _ 99 _ _ 0 F2 620 650 901 282 252 0 _ 0 _ _ _ 99 0 2 R3 1114 1105 638 477 468 99 _ _ _ _ _ 99 0 F3 1119 1140 1466 348 327 0 _ _ 0 _ _ 99 0 3 R3 2026 1999 1118 909 882 99 _ _ _ _ _ 99 0 4 F3 1815 1830 2054 240 225 99 _ _ 99 _ _ _ 0 *** Overlaps #3 by 170 Overlap Region Scores: _ _ 0 _ _ 99 0 Reject #4 5 R2 2600 2597 1935 666 663 99 _ _ _ _ 99 _ 0 *** Overlaps #4 by 120 Overlap Region Scores: _ _ 0 _ 99 _ 0 6 F1 2710 2719 3399 690 681 99 99 _ _ _ _ _ 0 R3 4153 4153 3962 192 192 0 99 _ _ _ _ 0 0 7 F1 3403 3403 4230 828 828 99 99 _ _ _ _ _ 0 R2 4700 4679 4455 246 225 0 _ _ _ _ 0 _ 99

R2 68906 68897 68670 237 228 13 _ _ _ _ 13 _ 86 8 R1 101574 101544 101296 279 249 96 _ _ _ 96 _ _ 3 R3 193228 193204 193022 207 183 56 _ _ _ _ _ 56 43

Page 93: Computational Genomics

List of putative genes

Putative Genes:

1 633 220 2 1105 638 3 1999 1118 5 2597 1935 6 2719 3399 7 3403 4230... 39 38472 38741 [Shorter 40 80 74] 40 38662 39450 [Bad Overlap 39 80 25]... 482 464206 464424 [Shadowed by 483]... 636 616213 615965 [Delay by 33 637 50 0]

Page 94: Computational Genomics

Output description

39 38472 38741 [Shorter 40 80 74]

40 38662 39450 [Bad Overlap 39 80 25]

[Bad Overlap a b c] means that gene number a overlapped this one and was shorter but scored higher on the overlap region. b is the length of the overlap region and c is the score of *this* gene on the overlap region. There should be a [Shorter ...] notation with gene a giving its score.

[Shorter a b c] means that gene number a overlapped this one and was longer but scored lower on the overlap region. b is the length of the overlap region and c is the score of *this* gene on the overlap region. There should be a [Bad overlap ...] notation with gene a giving its score.

Page 95: Computational Genomics

Output description - 2

482 464206 464424 [Shadowed by 483]

... 636 616213 615965 [Delay by 33 637 50 0]

[Shadowed by a] means that this gene was completed contained as part of gene a 's region, but in another frame.

[Delay by a b c d] means that this gene was tentatively rejected because of an overlap with gene b , but if the start codon is postponed by a positions, then this would be a valid gene. The start position reported for this gene includes the delay. c is the length of the overlap region that caused the rejection and d is the score in this gene's frame on that overlap region.

Page 96: Computational Genomics

Prokaryotic vs. Eukaryotic Genes Prokaryotes

small genomes high gene density no introns (or splicing) no RNA processing similar promoters terminators important overlapping genes

Eukaryotes large genomes low gene density introns (splicing) RNA processing heterogeneous promoters terminators not important polyadenylation

Page 97: Computational Genomics

Coding regions in Prokaryotes

INTERGENIC REGION START CODING

SEQUENCE STOP

Page 98: Computational Genomics

From DNA to protein

Page 99: Computational Genomics

Gene structure

Page 100: Computational Genomics

Pseudogenes and repetitive elements

Page 101: Computational Genomics

Complicated gene structures

Page 102: Computational Genomics

Overlapping genes

Page 103: Computational Genomics

Eukaryotic gene structure

5’ 3’

DNAExon1 Exon2 Exon3 Exon4

Intron1 Intron2 Intron3

polyA signalPyrimidinetract

Branchpoint

CTGAC

Splice siteCAG

Splice siteGGTGAG

TranslationInitiationATG

Stop codonTAG/TGA/TAA

PromoterTATA

Page 104: Computational Genomics

Ex1 In1 Ex2 Ex2 In2 Ex3 In3 Ex4 In4 Ex5 Ex5

5’ UTR 3’ UTR

G T AG

E0 E1 E2

I0 I1 I2

Einit Eterm

Single exon gene

5’ UTR 3’ UTR

Poly A

Signal

promoter

Intergenic region

Markov Models

Page 105: Computational Genomics

Searching for coding sequences using Markov chain

In this case we do not want check if given sequence fragment is coding or not but we rather want to identify coding fragments in a long sequence. In most cases this is done by calculating statistics in overlapping windows.

Windows coding potentials are used to create profiles. This example shows a profile for a sequence analyzed using 120bp windows with 10 bp distance between them.

AGTACGATATTAGCGGCAATCGTATGACTACGTCTTGCTACGTCTTCTCTCGTCTGCTCTAG

Page 106: Computational Genomics

Codon usage

Arg AGG 12.09 0.22Arg AGA 11.73 0.21Ser AGT 10.18 0.14Ser AGC 18.54 0.25 Lys AAG 33.79 0.60Lys AAA 22.32 0.40Asn AAT 16.43 0.44Asn AAC 21.30 0.56 Met ATG 21.86 1.00Ile ATA 6.05 0.14Ile ATT 15.03 0.35Ile ATC 22.47 0.52 Thr ACG 6.80 0.12Thr ACA 15.04 0.27Thr ACT 13.24 0.23Thr ACC 21.52 0.38

Trp TGG 14.74 1.00End TGA 2.64 0.61Cys TGT 9.99 0.42Cys TGC 13.86 0.58 End TAG 0.73 0.17End TAA 0.95 0.22Tyr TAT 11.80 0.42Tyr TAC 16.48 0.58 Leu TTG 11.43 0.12Leu TTA 5.55 0.06Phe TTT 15.36 0.43Phe TTC 20.72 0.57 Ser TCG 4.38 0.06Ser TCA 10.96 0.15Ser TCT 13.51 0.18Ser TCC 17.37 0.23

Gly GGG 17.08 0.23Gly GGA 19.31 0.26Gly GGT 13.66 0.18Gly GGC 24.94 0.33 Glu GAG 38.82 0.59Glu GAA 27.51 0.41Asp GAT 21.45 0.44Asp GAC 27.06 0.56 Val GTG 28.60 0.48Val GTA 6.09 0.10Val GTT 10.30 0.17Val GTC 15.01 0.25 Ala GCG 7.27 0.10Ala GCA 15.50 0.22Ala GCT 20.23 0.28Ala GCC 28.43 0.40

Arg CGG 10.40 0.19Arg CGA 5.63 0.10Arg CGT 5.16 0.09Arg CGC 10.82 0.19 Gln CAG 32.95 0.73Gln CAA 11.94 0.27His CAT 9.56 0.41His CAC 14.00 0.59 Leu CTG 39.93 0.43Leu CTA 6.42 0.07Leu CTT 11.24 0.12Leu CTC 19.14 0.20 Pro CCG 7.02 0.11Pro CCA 17.11 0.27Pro CCT 18.03 0.29Pro CCC 20.51 0.33

Page 107: Computational Genomics

Codon usage

S=AGGACGGGATCADNA sequence can be divided into noneoverlapping codons in three reading framesC=C1C2...Cm

AGG ACG GGA TCAA GGA CGG GAT CAAG GAC GGG ATC A

C =AGG11

C =GGA21 C =GGG3

2

Page 108: Computational Genomics

Probability that sequence is coding

Probability that sequence is coding is equal probability that sequence of codons is coding. Assuming independence between adjacent codons the probabilty that sequence is coding will be equal to the product of codon frequencies.

P(C) = F(C1)F(C2)..F(Cm)

AGG ACG GGA TCAA GGA CGG GAT CAAG GAC GGG ATC A

P(C) = F(AGG)F(ACG) = 0.022 x 0.038 =0.000836

0.022

0.038

Page 109: Computational Genomics

Probability that sequence is non-coding

If the sequence is non-coding the codon frequency will be random and each codon will be equally prbable. In this case frequency for each codon will be 0.0156. This is because we have 64 codons and each of them is equally possible.

Therefore probability that the sequence is non-coding will be:

P(C) = F(AGG)F(AGC) = 0.0156 x 0.0156 = 0.000244

Page 110: Computational Genomics

Log-likelihood ratio

LP (S) = log P (S)P (S)

i

0

LP(S) = log 1.4102+log 2.4358 = 0.1493 + 0.3866 = 0.53 > 0

Page 111: Computational Genomics
Page 112: Computational Genomics

Rule based methodMinimal length ORF

Splicing sites

Putative exon

Codon usage

Neural networkMinimal length ORF

Splicing sites

Codon usage

Przypuszczalnyekson

Putative exon

Page 113: Computational Genomics

GRAIL

Gene recognition and Analysis Internet Link Uberbacher and Mural, 1991 First gene prediction program GRAIL1

Neural network recognizing coding potential within fixed-size window (100 bp)

Evaluates coding potential without looking for additional features (e.g. splice junctions, start and stop codons)

GRAIL2 Variable size of windows Incorporated genomic context information (splice sites, start and

stop, polyadenylation signals) GRAILEXP http://compbio.ornl.gov/grailexp/

Page 114: Computational Genomics

GRAILEXP

Page 115: Computational Genomics

MZEF

M. Zhang 1997 Predicts exons only, does not build gene

structure Uses ‘quadratic discriminant analysis” Variable measures:

Exon length Intron-exon transition Branch site scores

http://rulai.cshl.org/tools/genefinder/

Page 116: Computational Genomics

MZEF

Page 117: Computational Genomics

FGENES Solovyev et al., 1994 Predicts internal exons Linear discriminant analysis

Donor and acceptor splice sites Putative coding regions Intronic regions both 5’ and 3’ to the putative

exon Passes results to a dynamic programming

algorithm to come up with coherent gene model

http://www.softberry.com/berry.phtml

Page 118: Computational Genomics

FGENESH

Page 119: Computational Genomics

FGENESH

Page 120: Computational Genomics

GENSCAN

Burge and Karlin Search for general and specific compositional

properties of distinct functional units in eukaryotic genes

General fifth-order markov model of coding regions Analyze both DNA strands Sequences may contain multiple and/or partial

genes http://genes.mit.edu/GENSCAN

Page 121: Computational Genomics

GENSCAN options

OrganismvertebrateMaizeArabidopsis

Outputpredicted peptides onlypredicted CDS and peptides

Suboptimal exon cutoff1.000.500.25...0.01

Page 122: Computational Genomics

GENSCAN output

Prom = promoterInit = Initial exonIntr = Internal exonTerm = Terminal exonSngl = Single exon genePlyA = poly-A signal

P-range Accuracy0.00-0.50 29.8%0.50-0.75 54.1%0.75-0.90 74.8%0.95-0.99 92.4%0.99-1.00 97.7%

Page 123: Computational Genomics

Graphical output

Page 124: Computational Genomics

Are predictions the same?

GRAILEXP

MZEF

FGENESH

GenScan

Page 125: Computational Genomics

TP TPFP FN TN

TP - true positive

FP - false positive

FN - false negative

TN - true negative

Evaluation statistic

Sensitivity Fraction of actual coding regions that are correctly predicted as coding, ranging from 0 to 1

Sn = TP/(TP+FN)

Specificity Fraction of the prediction that is actually correct, ranging from 0 to 1

Sp = TP/(TP+FP)

Correlation Combined measure of sensitivity and specificity, ranging from -1 (always wrong) to +1 (always right)

CC = TP x TN + FPx FN

(PP)(PN)(AP)(AN)

Page 126: Computational Genomics

20 not annotated human BAC clones3 finished17 unfinished

Genes that had at least two exons, each predicted by at least two programs the overlap of the predicted exons did not have to be perfectsimilarity to ESTs or known genes was used as supporting evidence but was not required

40 genes (number of exons 2-11)Six single exons predicted by three or four programsThree two-exon genes predicted by one program only but strongly supported by similarities to EST sequences

12 genes were eliminated from further studies as they contained repetitive elements and were most likely false positives

Total: 37 putative transcripts

Experimental validation of predicted genes

Page 127: Computational Genomics

predicted exons specificity sensitivity

MZEF 34 0.51 0.56

GRAIL 11 0.48 0.19

GENSCAN 52 0.46 0.91

FGENES 45 0.37 0.75

37 genes were tested, 16 of them (43%) were confirmed.At the exon level there were 159 exons and 58 (36%) were found to be real

Prediction programs performance

Page 128: Computational Genomics

Gene structure and alternative splicing

Page 129: Computational Genomics

12 genes containing large fragments of repetitive elements were eliminated from experimental validation. However some proteins are partially coded by Alu or MER (Makalowski 1994, 2000)Gene gm121 contains MER at 5' end of cDNA and this is experimentally confirmed gene.Gene gm124 was eliminated from our experimental validation based on the presence of repetitive element. Further analysis showed that this gene is a part of transcript FLJ23129, GenBank accession AK026782

Repetitive elements

Page 130: Computational Genomics

50% of mammalian genome are repeats:DNA transposonsretrotransposonsLINEsSINEstandem repeats

masking before similarity search - helps avoid getting similarities caused by the presence of repetitive elements, not because of sequences homology

predicted gene with repetitive elements are less likely to be real, although sometimes repeats are true parts of coding sequence

Repetitive elements

Page 131: Computational Genomics

Searches for Alu, MIR, LINE, LTR and other repeats by comparison to sequences in RepBase library

RepBase is a database of repetitive DNA sequence elements found in a variety of eukaryotic organisms including primates, rodents, cow, dog, chicken, fugu, drosophila, arabidopsis, rice

Accepts local databases with repetitive elementsRepetitive elements are masked by replacing nucleotides with string of letters N

Masking can be limited to certain types of repeats only

RepeatMasker

Page 132: Computational Genomics

RepeatMaskerRepeatMasker

http://www.repeatmasker.org/

Page 133: Computational Genomics

RepeatMasker outputs

Page 134: Computational Genomics

Functional annotation- Gene Ontology Gene Ontology: A controlled vocabulary to describe gene

products - proteins and RNA - in any organism. Gene is described in three categories:

Cellular component: where a gene

product acts

Molecular function: activities or “jobs”

of a gene product

Biological process: a commonly

recognized series of events

Page 135: Computational Genomics
Page 136: Computational Genomics
Page 137: Computational Genomics

Annotation source

ISS Inferred from Sequence/Structural SimilarityIDA Inferred from Direct AssayIPI Inferred from Physical InteractionTAS Traceable Author StatementNAS Non-traceable Author StatementIMP Inferred from Mutant PhenotypeIGI Inferred from Genetic InteractionIEP Inferred from Expression PatternIC Inferred by CuratorND No Data available

IEA Inferred from electronic annotation

Page 138: Computational Genomics

Gene annotation

Page 139: Computational Genomics

Annotating genes

Model organism: look at curated databases

Page 140: Computational Genomics

Annotating novel genes

Similarity search against curated and well annotated database: functional annotations deducted from close and significant match

Functional domains identification: mapping domains and terms to Gene Ontology

Page 141: Computational Genomics

Annotating novel genes

Similarity search against curated and well annotated database: functional annotations deducted from close and significant match

Identification of functional domains: mapping domain to GO term

Page 142: Computational Genomics

Regulatory sequences

Difficult to identify using computer programs Most regulatory elements are still not

identified They are usually very short: 6-10 base pairs They may differ in one or more places

Page 143: Computational Genomics

Finding regulatory elements

Phylogenetic footprinting – comparative genomics used to identify conserved non-coding sequences (MultiPipMaker)

Phylogenetic shadowing – conceptually similar to phylogenetic footprinting, with the difference that closely related species are compared (eShadow)

Gibbs sampling – random iterations of the promoter regions alignments to identify blocks of conserved residues (Gibbs Motif Sampler)

Hidden Markov Models – searching for known regulatory elements using probability matrices (Cister)

Page 144: Computational Genomics

Phylogenetic shadowing

Page 145: Computational Genomics

Gibbs sampling

ACGGTACGTTGG

GGTACGTAGGAC

TGGCTACCTTGG

CGGTACGTAGGT

******* **

Page 146: Computational Genomics

Building model from existing alignment

A HMM model for a DNA motif alignments, The transitions are shown with arrows whose thickness indicate their probability. In each state, the histogram shows the probabilities of the four bases.

ACA - - - ATG TCA ACT ATCACA C - - AGCAGA - - - ATCACC G - - ATC Transition probabilities

Output Probabilities

insertion

Page 147: Computational Genomics

Detects cis-elements clusters by using Hidden Markov Detects cis-elements clusters by using Hidden Markov ModelModel

For each element uses separate matrix with frequencies of For each element uses separate matrix with frequencies of each nucleotide in each position; user can input matrix for each nucleotide in each position; user can input matrix for elements not included in the basic optionelements not included in the basic option

User can specify:User can specify:ƒ distance between neighboring cis-elements within a distance between neighboring cis-elements within a

clusterclusterƒ number of cis-elements in the clusternumber of cis-elements in the clusterƒ distance between clustersdistance between clustersƒ half-width of the sliding windowhalf-width of the sliding window

Cister

Page 148: Computational Genomics

NA AML-1aXXDE runt-factor AML-1XXBF T02256; AML1a; Species: human, Homo sapiens.XXP0 A C G T01 5 1 2 49 T02 2 2 52 1 G03 4 14 1 38 T04 0 0 57 0 G05 1 0 55 1 G06 1 4 0 52 T

TGTGGTTGCGGTTGTGGTAGTGGTTGTGGC

Sequences of experimentaly identified elements are alignedand frequencies in each position are calculated

Frequency matrix

Page 149: Computational Genomics

http://zlab.bu.edu/~mfrith/cister.shtml

Cister