15-20 september WABI031 A Method to Detect Gene Structure and Alternative Splice Sites by Agreeing...
-
date post
18-Dec-2015 -
Category
Documents
-
view
216 -
download
1
Transcript of 15-20 september WABI031 A Method to Detect Gene Structure and Alternative Splice Sites by Agreeing...
15-20 september WABI03
1
A Method to Detect Gene Structure and Alternative
Splice Sites by Agreeing ESTs to a Genomic Sequence
Paola Bonizzoni Graziano Pesole* Raffaella Rizzi
DISCo, University of Milan-Bicocca, Italy*Department of Physiology and Biochemistry, University of
Milan, Italy
Supported by FIRB Bioinformatics: Genomics and Proteomics
15-20 september WABI03
2
Outline
Gene structure and alternative splicing (AS)
Problem definition and algorithm ASPic program Experimental results and
discussion
15-20 september WABI03
3
Mechanism of Splicing
3’
5’
5’
3’
DNA
TRANSCRIPTION
5’
3’
exon 1 exon 2 exon 3pre-mRNA
SPLICING by spliceosome
exon 1 exon 2 exon 3 splicing productmRNA
EST Expressed Sequence Tag(cDNA)
exon 2exon 1 exon 3
15-20 september WABI03
4
Modes of Alternative Splicing
1 2 3
Genomic sequence
1 2 3
ExonsIntrons
1 2 3
First splicing modeSecond splicing mode
1 3
Third splicing mode
2 3
15-20 september WABI03
5
Modes of Alternative Splicing
1 2 32b
Competing 5’–3’
Exclusive exons: 1 31 2b
15-20 september WABI03
6
Why AS is important?
AS occurs in 59% of human genes (Graveley, 2001)
AS expands protein diversity (generates from a single gene multiple transcripts)
AS is tissue-specific (Graveley, 2001)
AS is related to human diseases
15-20 september WABI03
7
Motivations
predict alternative splicing forms analyze such a mechanism by a representation of splicing
forms
Regulation of AS is still an open problem
NEED tools to
15-20 september WABI03
8
What is available?
Fast programs to produce a single EST alignment to a genomic sequence: Spidey (Wheelan et al., 2001)
Squall (Ogasawara & Morishita, 2002)
But to predict the exon-intron gene structure is acomplicate goal because of
sequencing errors in EST make difficult to locate splice sites by alignment
duplications, repeated sequences may produce more than one possible EST alignment
15-20 september WABI03
9
Open Problems
Formal definition of AS prediction problem …
Combined analysis of ESTs alignments related to the same gene by agreeing ESTs to a common exon-intron gene structure
Optimization criteria
15-20 september WABI03
10
Formal Definitions Def 1
Genomic sequence, G = I1 f1 I2 f2 I3 f3 … In fn In+1, where Ii (i=1, 2, …, n+1) are introns and fi (i=1, 2, …, n) are exons
Def 2 Exon factorization of G, GE = f1 f2 f3 … fn
Def 3 EST factorization of an EST S compatible with GE is
S=s1 s2 … sk s.t. there exists 1 i1 < i2 < … < ik n:
st = fit for t=2, 3, …, k-1 s1 is a suffix of fi1 and sk is a prefix of fik
st = suff (fit) or st = pref (fit)splice variant
Def 1 Genomic sequence, G = I1 f1 I2 f2 I3 f3 … In fn In+1, where Ii
(i=1, 2, …, n+1) are introns and fi (i=1, 2, …, n) are exons
Def 2 Exon factorization of G, GE = f1 f2 f3 … fn
Def 3 EST factorization of an EST S compatible with GE is
S=s1 s2 … sk s.t. there exists 1 i1 < i2 < … < ik n:
edit (st, fit) error for t=2, 3, …, k-1 edit(s1, suff(fi1)) error and edit(sk, pref(fik)) error
15-20 september WABI03
11
The ProblemInput
- A genomic sequence G- A set of EST sequences S = {S1, S2, …, Sn}
Output
An exon factorization GE of G (GE = f1, f2, …, fn) and aset of ESTs factorizations compatible with GE
Objective: minimize n
15-20 september WABI03
12
Example
Genomic sequence G
EST set S = {S1, S2, S3}
S2 A1A2 B D1
S3 A2 D1D2 C1C2
A2 A1A2 B D1 C1 D1D2 C1C2
C1S1 A2 D1
A2 D1 C1A2 D1 C1A1A2 B D1A1A2 B D1A2 D1D2 C1C2A2 D1D2 C1C2
7 exons
B D1D2 C1C2
4 exons
A1A2
15-20 september WABI03
13
Results
MEFC is MAX-SNP-hard (linear reduction from NODE-COVER)
heuristic algorithm:
Iterate process to factorize each EST
backtracking to recompute previous EST factorsif not compatible to GE
15-20 september WABI03
14
The algorithm
si1 si j-1 sijSi
e1 e2G
Iterative jth step: partial EST factorization of Si (compute factor sij)
em
if (Compatible(em, exon_list)) thenadd em to exon_list;
otherwise try to place sij elsewhere;
em
If not possible then backtrack;
si-1 1 si-1 j-1 si-1 j si-1 nSi-1
After placing all the factors sij for the set S,place the external factors;
15-20 september WABI03
15
The algorithm (more details)
G
si1 si j-1Sisi j
Compute factor sij
Sij can be divided into n components ck (k=1,2,…,n)At least one of these components for k from 1 to (n-1)is error-free and can be placed on G
sijc1 c2 c3 c4 c5
The algorithm searches a perfect match of c1 on G
c1
Suppose that c1 has no perfect match on G
Then the algorithm searches a perfect match of c2 on G
c2c1c1
Suppose that c2 has a perfect match on G
c2
Then the entire factor sij can be placed on GFind the canonical ag pattern on the left
ag
Find the rightmost gt pattern such that the edit distance between sijy and the genomic substring from ag to gt is bounded
gt
si jy
exon
15-20 september WABI03
16
ASPic (Alternative Splicing PredICtion)
Input- A minimum length of an exon- A maximum number of exons in the exon factorization of the genomic sequence- An error percentage- A genomic sequence- An ESTs set (or cluster)
Output- A text file for all ESTs alignments- An HTML file for the exon factorization of the genomic sequence
15-20 september WABI03
17
ASPic data validation
ASAP (Lee et al., 2003)
Genomic sequences from ASAP database EST clusters of human chromosome 1 from UniGene
database
ASPic INPUT:
Validation Database:
15-20 september WABI03
18
Experimental Results
Genomic sequence(official gene name)
Introns detectedby ASAPASAP intronsdetected by ASPic
Novel introns detectedby ASPic
Genomic shift detected by ASPic
15-20 september WABI03
19
Execution timesPENTIUM IV, 1600 MHZ, 256 MB, running Linux
15-20 september WABI03
20
An example of data (gene HNRPR)
ASPic finds a novel intronfrom 2144 to 5333 confirmedby 18 EST sequencesPositions are from 0 for ASPic and from 1 for ASAP
15-20 september WABI03
21
An example of data (gene HNRPR, intron 2144-5333)
EST ID
Left and right ends of thetwo exonsEST exonsGenomic exons
15-20 september WABI03
22
WEB site
15-20 september WABI03
23
WEB site
15-20 september WABI03
24
WEB site
15-20 september WABI03
25
Responsabili di progetto: Prof. Paola Bonizzoni Prof. Graziano Pesole
Responsabile disegno software: Raffaella Rizzi
Sito WEB: Gabriele RavanelliRappresentazione grafica: Francesco Perego
Anna RedondiAnalisi dati: Francesca RossinAltri contributi: Gianluca Dellavedova
15-20 september WABI03
26
GRAZIE!