Groningen Bioinformatics Centre Yang Li and Rainer Breitling Dagstuhl seminar, March 2007 Analyzing...
-
date post
21-Dec-2015 -
Category
Documents
-
view
219 -
download
0
Transcript of Groningen Bioinformatics Centre Yang Li and Rainer Breitling Dagstuhl seminar, March 2007 Analyzing...
Groningen Bioinformatics Centre
Yang Li and Rainer Breitling
Dagstuhl seminar, March 2007
Analyzing genome tiling microarrays for the detection of novel expressed genes
Groningen Bioinformatics Centre
Preliminary version 23 Feb 2007
Groningen Bioinformatics Centre
Introduction to tiling arrays
Published research on exon finding
Our data set
Machine learning for exon finding
Results
Outline
Groningen Bioinformatics Centre
Background
• Genomic tiling array
Probes are designed to blanket an entire genomic region of interest and used to detect the presence or absence of transcription.
• TilingA sequence of probes spanning a genomic region is called a “tile path”, or a “tiling”.
Groningen Bioinformatics Centre
Two types of tiling array construction:
1) Oligonucleotide tiling array
2) Tiling array constructed using PCR products
Trend in Genetics 2005 v21 466
Groningen Bioinformatics Centre
1) Discovery of novel genes
2) Discovery of novel non-coding RNAs
3) Alternative splicing study
Advantages:
1) The sensitivity of microarrays enables rare transcripts to be detected;
2) The parallel nature of the arrays enables numerous samples and genomic sequences to be analyzed.
3) The experimental design is not dependent on current genome annotations.
Detection of transcription
Groningen Bioinformatics Centre
Recent Research
Surprising amounts of genomic ‘dark matter’ More than 50% of animal genomes may be
transcribed Novel protein-coding genes Novel non-coding genes (rRNA, tRNA, snoRNA,
miRNA…) Antisense transcripts Alternative isoforms and gene ‘extensions’ Leaky transcription Technical noise/artifacts
Groningen Bioinformatics Centre
Kampa et al. Hodges–Lehman estimator (pseudo median )
Exon-intron discriminators
Groningen Bioinformatics Centre
Schadt et al. PCA
1. Probes are separated into 15 kb sliding windows
2. Calculate robust principal component (between-sample correlation matrix)
3. Calculate Mahalanobis distance (probe location minus the center of the data in the first two dimensions of the principal component score (PCS))
4. Decide on exon vs. intron
5. Assign probes to transcriptional units
Exon-intron discriminators
Groningen Bioinformatics Centre
Our collaborators’ approach (Andrew Fraser and Tom Gingeras):
• use negative bacterial controls to calculate an intensity threshold corresponding to 5% false positive rate in a given regions
• apply these intensity thresholds to generate positive probe maps which are then joined together using two parameters: maxgap, the maximal distance between two positive probes and minrun, the minimal size of a transfrag
• minrun of 40 (two positive probes) or 80 (three positive probes) are a good starting point for these parameters
Exon-intron discriminators
Groningen Bioinformatics Centre
• Affymetrix C. elegans Tiling 1.0R Array
• Genome-wide gene expression: ChrI~V, Chr X and Chr M (Mitochondrion)
• Resolution: on average 25 bp
• Negative bacterial controls
• Samples: 21 samples across development (plus mutant)
• Probes: 2,942,364 PM/MM pairs
About our tiling data
Groningen Bioinformatics Centre
Developmental time course
L2 L3 L4 Young adult
Gravid adult
total
strainsN2 2 2 3 3 3 13
smg-1* - - 3 2 3 8
samplenumber
* smg-1: deficient in nonsense mediated decay
About tiling data
Groningen Bioinformatics Centre
LAP-1(ZK353.6)
Genomic Position: III:8401845..8399119 bpLap-1 is expressed throughout the life cycle. While there appears to be marginally less LAP-1 message at 2 h and 40 h, corresponding to early L1 and young adult stages respectively, LAP-1 appears to be constitutively expressed. Densitometric analysis of LAP-1 expression compared to the housekeeping gene ama-1 shows some variation in LAP-1 expression but this appears to be unrelated to moulting.
Examples
Groningen Bioinformatics Centre
8399000 8399500 8400000 8400500 8401000 8401500
04
00
08
00
0
LAP-1 chrIII
Genome(bp)
PM
8399000 8399500 8400000 8400500 8401000 8401500
02
00
05
00
0
Genome(bp)
MM
Pro
be in
tens
ity
intronextron
Example
Groningen Bioinformatics Centre
8399
109
8399
141
8399
166
8399
194
8399
220
8399
246
8399
268
8399
294
8399
321
8399
351
8399
377
8399
407
8399
431
8399
459
8399
484
8399
509
8399
539
8399
562
8399
587
8399
613
8399
637
8399
663
8399
691
8399
717
8399
743
8399
772
8399
801
8399
831
8399
858
8399
882
8399
907
8399
935
8399
962
8399
993
8400
021
8400
046
8400
072
8400
099
8400
125
8400
151
8400
174
8400
197
8400
220
8400
247
8400
273
8400
297
8400
323
8400
349
8400
399
8400
426
8400
454
8400
477
8400
499
8400
526
8400
555
8400
581
8400
604
8400
630
8400
652
8400
681
8400
707
8400
733
8400
759
8400
787
8400
811
8400
841
8400
862
8400
889
8400
916
8400
941
8400
964
8400
993
8401
017
8401
045
8401
215
8401
271
8401
491
8401
563
8401
584
8401
610
8401
631
8401
676
8401
698
8401
721
8401
747
8401
771
8401
797
8401
823
8401
884
smg-1_Gravid Adult.2
smg-1_Gravid Adult.1
smg-1_Gravid Adult
smg-1_Young Adult.1
smg-1_Young Adult
smg-1_L4.2
smg-1_L4.1
smg-1_L4
N2_Gravid Adult.2
N2_Gravid Adult.1
N2_Gravid Adult
N2_Young Adult.2
N2_Young Adult.1
N2_Young Adult
N2_L4.2
N2_L4.1
N2_L4
N2_L3.1
N2_L3
N2_L2.1
N2_L2
LAP-1 chrIII PM
2000 8000Value
06
00
Color Keyand Histogram
Cou
nt
8399109
8399141
8399166
8399194
8399220
8399246
8399268
8399294
8399321
8399351
8399377
8399407
8399431
8399459
8399484
8399509
8399539
8399562
8399587
8399613
8399637
8399663
8399691
8399717
8399743
8399772
8399801
8399831
8399858
8399882
8399907
8399935
8399962
8399993
8400021
8400046
8400072
8400099
8400125
8400151
8400174
8400197
8400220
8400247
8400273
8400297
8400323
8400349
8400399
8400426
8400454
8400477
8400499
8400526
8400555
8400581
8400604
8400630
8400652
8400681
8400707
8400733
8400759
8400787
8400811
8400841
8400862
8400889
8400916
8400941
8400964
8400993
8401017
8401045
8401215
8401271
8401491
8401563
8401584
8401610
8401631
8401676
8401698
8401721
8401747
8401771
8401797
8401823
8401884
smg-1_Gravid Adult.2
smg-1_Gravid Adult.1
smg-1_Gravid Adult
smg-1_Young Adult.1
smg-1_Young Adult
smg-1_L4.2
smg-1_L4.1
smg-1_L4
N2_Gravid Adult.2
N2_Gravid Adult.1
N2_Gravid Adult
N2_Young Adult.2
N2_Young Adult.1
N2_Young Adult
N2_L4.2
N2_L4.1
N2_L4
N2_L3.1
N2_L3
N2_L2.1
N2_L2
LAP-1 chrIII MM
1000 5000Value
06
00
Color Keyand Histogram
Co
un
t
Example
Groningen Bioinformatics Centre
6914000 6915000 6916000 6917000 6918000 6919000 6920000 6921000
02
00
06
00
0
nhx-4 chrX
Genome(bp)
PM
6914000 6915000 6916000 6917000 6918000 6919000 6920000 6921000
01
00
03
00
0
Genome(bp)
MM
Pro
be in
tens
ityExample 2
Groningen Bioinformatics Centre
6914
251
6914
276
6914
305
6914
328
6914
358
6914
393
6914
418
6914
449
6914
471
6914
499
6914
552
6914
573
6914
597
6914
618
6914
648
6914
678
6914
726
6914
754
6914
795
6914
821
6914
848
6914
890
6914
916
6914
940
6914
963
6914
984
6915
023
6915
049
6915
076
6915
126
6915
152
6915
183
6915
212
6915
238
6915
266
6915
288
6915
317
6915
343
6915
368
6915
394
6915
422
6915
448
6915
479
6915
507
6915
532
6915
554
6915
585
6915
616
6915
641
6915
666
6915
841
6915
864
6915
892
6915
920
6915
948
6915
977
6916
001
6916
028
6916
053
6916
083
6916
121
6916
150
6916
176
6916
201
6916
224
6916
246
6916
276
6916
301
6916
327
6916
355
6916
378
6916
404
6916
427
6916
458
6916
485
6916
512
6916
539
6916
565
6916
593
6916
619
6916
645
6916
702
6916
727
6916
754
6916
779
6916
806
6916
832
6916
861
6916
883
6916
914
6916
940
6916
963
6917
003
6917
024
6917
051
6917
080
6917
107
6917
133
6917
154
6917
180
6917
201
6917
227
6917
255
6917
283
6917
312
6917
337
6917
363
6917
390
6917
417
6917
445
6917
470
6917
501
6917
527
6917
554
6917
597
6917
626
6917
654
6917
675
6917
703
6917
729
6917
753
6917
781
6917
808
6917
834
6917
859
6917
885
6917
912
6917
936
6917
959
6917
986
6918
012
6918
040
6918
067
6918
092
6918
115
6918
141
6918
168
6918
195
6918
221
6918
246
6918
268
6918
294
6918
316
6918
339
6918
364
6918
386
6918
416
6918
441
6918
465
6918
495
6918
518
6918
544
6918
571
6918
597
6918
624
6918
652
6918
676
6918
703
6918
725
6918
756
6918
779
6918
806
6918
829
6918
852
6918
882
6918
913
6918
939
6918
967
6918
994
6919
020
6919
045
6919
069
6919
095
6919
120
6919
146
6919
173
6919
199
6919
226
6919
251
6919
277
6919
301
6919
325
6919
356
6919
382
6919
408
6919
434
6919
458
6919
489
6919
513
6919
541
6919
563
6919
591
6919
617
6919
639
6919
665
6919
692
6919
719
6919
745
6919
766
6919
790
6919
816
6919
843
6919
870
6919
898
6919
926
6919
949
6919
992
6920
015
6920
044
6920
072
6920
172
6920
197
6920
229
6920
255
6920
278
6920
303
6920
325
6920
350
6920
378
6920
404
6920
430
6920
461
6920
499
6920
522
6920
544
6920
570
6920
591
6920
618
6920
644
6920
666
6920
695
6920
720
6920
749
6920
780
6920
803
6920
847
6920
876
6920
901
6920
966
6920
992
6921
043
6921
088
smg-1_Gravid Adult.2
smg-1_Gravid Adult.1
smg-1_Gravid Adult
smg-1_Young Adult.1
smg-1_Young Adult
smg-1_L4.2
smg-1_L4.1smg-1_L4
N2_Gravid Adult.2
N2_Gravid Adult.1
N2_Gravid Adult
N2_Young Adult.2
N2_Young Adult.1
N2_Young Adult
N2_L4.2
N2_L4.1
N2_L4N2_L3.1
N2_L3
N2_L2.1
N2_L2
nhx-4 chrX PM
2000Value
03
00
0
Color Keyand Histogram
Co
un
t
6914
251
6914
276
6914
305
6914
328
6914
358
6914
393
6914
418
6914
449
6914
471
6914
499
6914
552
6914
573
6914
597
6914
618
6914
648
6914
678
6914
726
6914
754
6914
795
6914
821
6914
848
6914
890
6914
916
6914
940
6914
963
6914
984
6915
023
6915
049
6915
076
6915
126
6915
152
6915
183
6915
212
6915
238
6915
266
6915
288
6915
317
6915
343
6915
368
6915
394
6915
422
6915
448
6915
479
6915
507
6915
532
6915
554
6915
585
6915
616
6915
641
6915
666
6915
841
6915
864
6915
892
6915
920
6915
948
6915
977
6916
001
6916
028
6916
053
6916
083
6916
121
6916
150
6916
176
6916
201
6916
224
6916
246
6916
276
6916
301
6916
327
6916
355
6916
378
6916
404
6916
427
6916
458
6916
485
6916
512
6916
539
6916
565
6916
593
6916
619
6916
645
6916
702
6916
727
6916
754
6916
779
6916
806
6916
832
6916
861
6916
883
6916
914
6916
940
6916
963
6917
003
6917
024
6917
051
6917
080
6917
107
6917
133
6917
154
6917
180
6917
201
6917
227
6917
255
6917
283
6917
312
6917
337
6917
363
6917
390
6917
417
6917
445
6917
470
6917
501
6917
527
6917
554
6917
597
6917
626
6917
654
6917
675
6917
703
6917
729
6917
753
6917
781
6917
808
6917
834
6917
859
6917
885
6917
912
6917
936
6917
959
6917
986
6918
012
6918
040
6918
067
6918
092
6918
115
6918
141
6918
168
6918
195
6918
221
6918
246
6918
268
6918
294
6918
316
6918
339
6918
364
6918
386
6918
416
6918
441
6918
465
6918
495
6918
518
6918
544
6918
571
6918
597
6918
624
6918
652
6918
676
6918
703
6918
725
6918
756
6918
779
6918
806
6918
829
6918
852
6918
882
6918
913
6918
939
6918
967
6918
994
6919
020
6919
045
6919
069
6919
095
6919
120
6919
146
6919
173
6919
199
6919
226
6919
251
6919
277
6919
301
6919
325
6919
356
6919
382
6919
408
6919
434
6919
458
6919
489
6919
513
6919
541
6919
563
6919
591
6919
617
6919
639
6919
665
6919
692
6919
719
6919
745
6919
766
6919
790
6919
816
6919
843
6919
870
6919
898
6919
926
6919
949
6919
992
6920
015
6920
044
6920
072
6920
172
6920
197
6920
229
6920
255
6920
278
6920
303
6920
325
6920
350
6920
378
6920
404
6920
430
6920
461
6920
499
6920
522
6920
544
6920
570
6920
591
6920
618
6920
644
6920
666
6920
695
6920
720
6920
749
6920
780
6920
803
6920
847
6920
876
6920
901
6920
966
6920
992
6921
043
6921
088
smg-1_Gravid Adult.2
smg-1_Gravid Adult.1
smg-1_Gravid Adult
smg-1_Young Adult.1
smg-1_Young Adult
smg-1_L4.2
smg-1_L4.1smg-1_L4
N2_Gravid Adult.2
N2_Gravid Adult.1
N2_Gravid Adult
N2_Young Adult.2
N2_Young Adult.1
N2_Young Adult
N2_L4.2
N2_L4.1
N2_L4N2_L3.1
N2_L3
N2_L2.1
N2_L2
nhx-4 chrX MM
1000 4000Value
03
00
0
Color Keyand Histogram
Co
un
t
Example 2
Groningen Bioinformatics Centre
Chr III 2866 genes
-2 0 2 4 6 8
05
10
15
20
PM
exon-intron
-lo
g1
0(p
) fo
r w
ilco
xon
test
74.25 %
-2 0 2 4 6
05
10
15
20
MM
exon-intron
-lo
g1
0(p
) fo
r w
ilco
xon
test
69.43 %
General impression
Groningen Bioinformatics Centre
Exon-Intron(PM)
Fre
quen
cy
-2 0 2 4 6 8
020
060
0
Exon-Intron(MM)
Fre
quen
cy
-2 0 2 4 6
040
080
012
00
General impression
Groningen Bioinformatics Centre
PM_Exon
ex.pm.all
Fre
quency
6 8 10 12 14
04000
8000
12000
PM_Intron
in.pm.all
Fre
quency
6 8 10 12 14
020000
40000
60000
MM_Exon
ex.mm.all
Fre
quency
6 8 10 12 14
05000
15000
MM_Intron
in.mm.all
Fre
quency
6 8 10 12 14
020000
40000
60000
General impression
Groningen Bioinformatics Centre
-15 -10 -5 0 5
-10
-50
5
PCA
Pri Comp 1
Pri
n C
om
p 2
-15 -10 -5 0 5
-10
-50
5
Pri Comp 1
Pri
n C
om
p 2
PCA
Groningen Bioinformatics Centre
Methods: machine learning
Aim
Find the most effective (correct) machine learning method that distinguishes between
True exons and True introns
Find the simplest (fastest, intuitive) method that achieves this task
Groningen Bioinformatics Centre
Methods: machine learning
Main challenge
True exons and True introns are not known:
Annotated exons may be unexpressed
Annotated introns may be novel transcripts
Our approach
Ignore the problem and optimize supervised performance
Assumption
True novel transcripts will be similar to known ones
Groningen Bioinformatics Centre
Methods: machine learning
1.Classification and regression tree (CART)
binary recursive partitioning
Advantages:
• Easy to understand
• Easy to implement
• Computationally cheap
Groningen Bioinformatics Centre
Methods: Machine learning
2. Support vector machines (SVM)
denotes +1 denotes 0
How would you classify this data?
Groningen Bioinformatics Centre
denotes +1
denotes 0
How would you classify this data?
2. Support vector machines (SVM)
Groningen Bioinformatics Centre
denotes +1
denotes 0
How would you classify this data?
2. Support vector machines (SVM)
Groningen Bioinformatics Centre
denotes +1
denotes 0
How would you classify this data?
2. Support vector machines (SVM)
Groningen Bioinformatics Centre
denotes +1
denotes 0
Maximum Margin
The classifier with the maximum margin is the ideal one.
Groningen Bioinformatics Centre
Receiver Operating Characteristic curve (ROC curve)
Evaluation
ROC
False Positive Rate (1-specificity)
Tru
e P
ositi
ve R
ate
(sen
sitiv
ity)
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.50
0.80
0.85
0.90
1.00
0.1
0.3
0.51
0.72
0.93
1.14
Groningen Bioinformatics Centre
1 2 3 4 5 6 7 8 9 10
11
12
13
14
15
16
17
18
19
20
4
3
2
1
AUC
0.7 0.8Value
01
23
4
Color Keyand Histogram
Co
un
t
Raw Normalized
Mean
Median
Max
Max_1
pm.i,pm.1,pm_1,pm.2,pm_2,mm.i,mm.1,mm_1,mm.2,mm_2
Selection of informative features – intensities
Groningen Bioinformatics Centre
1 2 3 4 5 6 7 8
2
1
AUC
0.7 0.75 0.8Value
01
2
Color Keyand Histogram
Co
un
t
Raw Normalized
Pearson
Spearman
pm1,pm-1, mm1,mm-1
Selection of informative features – correlation
Groningen Bioinformatics Centre
Summary• Almost all reasonable features are informative• No striking difference between mean and median, but they seem better than max, max_1• CC also informative. No striking difference between Pearson and Spearman • Quantile normalization doesn’t improve the result
DecisionMedian, CC (Pearson) of non-normalized data are used to generate featuresGC content or melting temperature can also be informative
Selection of informative features
Groningen Bioinformatics Centre
Selection of informative features – neighbors
X.10nb X.5nb X.1nb X3nb X7nb
0.5
0.6
0.7
0.8
0.9
PM
AU
C
X.10nb X.5nb X.1nb X3nb X7nb
0.5
0.6
0.7
0.8
MM
AU
C
X.10nb X.6nb X.2nb X3nb X7nb
0.5
00.6
00.7
0
CC.PM
AU
C
X.10nb X.6nb X.2nb X3nb X7nb
0.5
00.6
00.7
0
CC.MM
AU
C
CART
Groningen Bioinformatics Centre
Selection of informative features
• Neighbours• MM• CC.PM• CC.MM• Tm• ANOVA results
Groningen Bioinformatics Centre
AUC ~ ( other factors )
107 171 259 394 2449
CART
length(exon)
AU
C
0.0
0.2
0.4
0.6
0.8
1.0
55 62 70 77 85
CART
Tm
AU
C
0.0
0.2
0.4
0.6
0.8
1.0
-1 -2 0 2 1
CART
withinexon.posi
AU
C
0.0
0.2
0.4
0.6
0.8
1.0
expression
exon length
melting temperature
relative position
Groningen Bioinformatics Centre
Minrun/maxgap Maxgap/minrun
thres ccr fpr tpr 0.936 0.806 0.009 0.464
Maxgap and minrun optimization
0 1 2 3 4 5 6
minrun
6
5
4
3
2
1
0
ma
xga
p
0.75 0.8Value
03
6
Color Keyand Histogram
Co
un
t
0 1 2 3 4 5 6
minrun
6
5
4
3
2
1
0
ma
xga
p
0.75 0.8Value
03
Color Keyand Histogram
Co
un
t
Groningen Bioinformatics Centre
Minrun/maxgap Maxgap/minrun
thres ccr fpr tpr 0.718 0.850 0.030 0.627
Maxgap and minrun optimization
0 1 2 3 4 5 6
minrun
6
5
4
3
2
1
0
ma
xga
p
0.8 0.85Value
04
Color Keyand Histogram
Co
un
t
0 1 2 3 4 5 6
minrun
6
5
4
3
2
1
0
ma
xga
p
0.85Value
04
Color Keyand Histogram
Co
un
t
Groningen Bioinformatics Centre
0 1 2 3 4 5 6
minrun
6
5
4
3
2
1
0
ma
xga
p
0.85Value
03
6
Color Keyand Histogram
Co
un
t
0 1 2 3 4 5 6
minrun
6
5
4
3
2
1
0
ma
xga
p
0.85Value
03
6
Color Keyand Histogram
Co
un
t
Minrun/maxgap Maxgap/minrun
thres ccr fpr tpr 0.500 0.856 0.059 0.700
Maxgap and minrun optimization
Groningen Bioinformatics Centre
Minrun/maxgap Maxgap/minrun
thres ccr fpr tpr 0.300 0.815 0.216 0.851
Maxgap and minrun optimization
0 1 2 3 4 5 6
minrun
6
5
4
3
2
1
0
ma
xga
p
0.75 0.8Value
05
10
15
Color Keyand Histogram
Co
un
t
0 1 2 3 4 5 6
minrun
6
5
4
3
2
1
0
ma
xg
ap
0.75 0.8Value
02
4
Color Keyand Histogram
Co
un
t
Groningen Bioinformatics Centre
Maxgap and minrun optimization
1 - maxgap2 - minrun Order: minrun/maxgap
Groningen Bioinformatics Centre
Maxgap and minrun conclusion
a minrun of 0 and a maxgap of 1 give the best overall result for our classifier
minrun and maxgap have minimal influence on the results, if the classifier already uses neighboring probe information
Groningen Bioinformatics Centre
Future work
• Joining of transfrags into transcriptional units (genes)
• Differential gene expression between developmental stage and strains (ANOVA)
• Detect alternative splicing (ANOVA)