Gene Finding. Biological Background The Central Dogma Transcription RNA Translation Protein DNA.

23
Gene Finding
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    226
  • download

    0

Transcript of Gene Finding. Biological Background The Central Dogma Transcription RNA Translation Protein DNA.

Gene Finding

Biological Background

The Central Dogma

Transcription

RNA

Translation

ProteinDNA

Background

*Essential Cell Biology; p.268

Non-coding regions gene regulation Vicinity of TSS: direct interactions with Pol-II complex Larger vicinity – indirect interactions (chromatin remodelling)

The Genetic CodeF

irst L

ette

rSecond Letter

Thi

rd L

ette

r

tRNA – Responsible for Translation

Adopted from Genetic Analysis V, p.388

tRNA – Responsible for Translation

Adopted from Genetic Analysis V, p.388

Frame Shifts

Code Triplets (“codons”) are not overlapping 3x2 possible ways of reading depending on strand

and the relative position where reading starts This is not just our concern when looking for genes,

it is also the cell’s concern in terms of mutations: Original: THE FAT CAT ATE THE BIG RAT Delete C:THE FAT ATA TET HEB IGR AT

Prokaryotes Gene Finding

No noclues Most DNA is coding (e.g. 70% in H.influenza) Each gene is one contiunes DNA sequence (no

introns) PolyI – rRNA, PolyII – mRNA, PolyIII - tRNA

Detecting ORFSimple Idea:

If there is no gene encoded then the expected frequency of STOP codon is 3/64 codonsORF – open reading frame, a sequence of codons with no STOP codon

Simple Algorithm:1. scan until you find a stop condon, in all reading

frames. 2. Scan back to find a start codon. 3. If it’s long ehough, report this ORF as a putative

geneCons:Can’t detect short genesHigh FP ( E.Coli has 6500 ORFS but only 1100 genes)

Coding vs. Non coding regions Codon frequencies

Codon usage in coding regions is different Leucine, Alanine, Tryptophan are coded in 6:4:1

different codons Expect to see a ratio of 6:4:1 in random sequence In proteins the appear in 6.9:6.5:1 ratio Another example:

A or T appear in 90% of the case as the last letter of a codon in protein coding regions

Nocleutide MM for Gene Detection

2nd Order MM

Idea: extend the model to capture codons

Results: poor…. Code overlap in this model

MM over codons

Idea: Transform the code into codons, then use 1rd MM

Why not use codon frequencies directly?

“Codon Preferences” program:

“Codon Preferences” programUses a window of 25 codons around each point

Score: )1

log(P

P

Using Promoter’s Signal

We are still far from perfect… idea: try to detect signals in the promoter regions,

to help descriminate real genes in ORFs Prokaryotes:

~-35 tss: TTGACA~-10 tss: TATAAT (“TATA box” signal)

No single promoter has the exact consensus Nearly all promoters have 2-3 from TAxyzT 80-90% have all 3 In 50% xyz = TAA

Up To here summary

We have seen the problems in trying to find genes in wide genome scan – Prokaryotes!

The bottom line is that the problem is not really solved, but most research in gene finding focus on Eukaryotes, where the main interest lies …

Next lecture – much more sophisticated models, to handle the much more complex situation in Eukaryotes in general, and Human in particular