Groningen Bioinformatics Centre Yang Li and Rainer Breitling Dagstuhl seminar, March 2007 Analyzing...

55
Groningen Bioinformatics Centre Yang Li and Rainer Breitling Dagstuhl seminar, March 2007 Analyzing genome tiling microarrays for the detection of novel expressed genes Groningen Bioinformatics Centre Preliminary version 23 Feb 2007
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    0

Transcript of Groningen Bioinformatics Centre Yang Li and Rainer Breitling Dagstuhl seminar, March 2007 Analyzing...

Groningen Bioinformatics Centre

Yang Li and Rainer Breitling

Dagstuhl seminar, March 2007

Analyzing genome tiling microarrays for the detection of novel expressed genes

Groningen Bioinformatics Centre

Preliminary version 23 Feb 2007

Groningen Bioinformatics Centre

Introduction to tiling arrays

Published research on exon finding

Our data set

Machine learning for exon finding

Results

Outline

Groningen Bioinformatics Centre

Background

• Genomic tiling array

Probes are designed to blanket an entire genomic region of interest and used to detect the presence or absence of transcription.

• TilingA sequence of probes spanning a genomic region is called a “tile path”, or a “tiling”.

Groningen Bioinformatics Centre

Two types of tiling array construction:

1) Oligonucleotide tiling array

2) Tiling array constructed using PCR products

Trend in Genetics 2005 v21 466

Groningen Bioinformatics Centre

1) Discovery of novel genes

2) Discovery of novel non-coding RNAs

3) Alternative splicing study

Advantages:

1) The sensitivity of microarrays enables rare transcripts to be detected;

2) The parallel nature of the arrays enables numerous samples and genomic sequences to be analyzed.

3) The experimental design is not dependent on current genome annotations.

Detection of transcription

Groningen Bioinformatics Centre

Recent Research

Groningen Bioinformatics Centre

Recent Research

Surprising amounts of genomic ‘dark matter’ More than 50% of animal genomes may be

transcribed Novel protein-coding genes Novel non-coding genes (rRNA, tRNA, snoRNA,

miRNA…) Antisense transcripts Alternative isoforms and gene ‘extensions’ Leaky transcription Technical noise/artifacts

Groningen Bioinformatics Centre

Kampa et al. Hodges–Lehman estimator (pseudo median )

Exon-intron discriminators

Groningen Bioinformatics Centre

Schadt et al. PCA

1. Probes are separated into 15 kb sliding windows

2. Calculate robust principal component (between-sample correlation matrix)

3. Calculate Mahalanobis distance (probe location minus the center of the data in the first two dimensions of the principal component score (PCS))

4. Decide on exon vs. intron

5. Assign probes to transcriptional units

Exon-intron discriminators

Groningen Bioinformatics Centre

Our collaborators’ approach (Andrew Fraser and Tom Gingeras):

• use negative bacterial controls to calculate an intensity threshold corresponding to 5% false positive rate in a given regions

• apply these intensity thresholds to generate positive probe maps which are then joined together using two parameters: maxgap, the maximal distance between two positive probes and minrun, the minimal size of a transfrag

• minrun of 40 (two positive probes) or 80 (three positive probes) are a good starting point for these parameters

Exon-intron discriminators

Groningen Bioinformatics Centre

• Affymetrix C. elegans Tiling 1.0R Array

• Genome-wide gene expression: ChrI~V, Chr X and Chr M (Mitochondrion)

• Resolution: on average 25 bp

• Negative bacterial controls

• Samples: 21 samples across development (plus mutant)

• Probes: 2,942,364 PM/MM pairs

About our tiling data

Groningen Bioinformatics Centre

Developmental time course

L2 L3 L4 Young adult

Gravid adult

total

strainsN2 2 2 3 3 3 13

smg-1* - - 3 2 3 8

samplenumber

* smg-1: deficient in nonsense mediated decay

About tiling data

Groningen Bioinformatics Centre

LAP-1(ZK353.6)

Genomic Position: III:8401845..8399119 bpLap-1 is expressed throughout the life cycle. While there appears to be marginally less LAP-1 message at 2 h and 40 h, corresponding to early L1 and young adult stages respectively, LAP-1 appears to be constitutively expressed. Densitometric analysis of LAP-1 expression compared to the housekeeping gene ama-1 shows some variation in LAP-1 expression but this appears to be unrelated to moulting.

Examples

Groningen Bioinformatics Centre

8399000 8399500 8400000 8400500 8401000 8401500

04

00

08

00

0

LAP-1 chrIII

Genome(bp)

PM

8399000 8399500 8400000 8400500 8401000 8401500

02

00

05

00

0

Genome(bp)

MM

Pro

be in

tens

ity

intronextron

Example

Groningen Bioinformatics Centre

8399

109

8399

141

8399

166

8399

194

8399

220

8399

246

8399

268

8399

294

8399

321

8399

351

8399

377

8399

407

8399

431

8399

459

8399

484

8399

509

8399

539

8399

562

8399

587

8399

613

8399

637

8399

663

8399

691

8399

717

8399

743

8399

772

8399

801

8399

831

8399

858

8399

882

8399

907

8399

935

8399

962

8399

993

8400

021

8400

046

8400

072

8400

099

8400

125

8400

151

8400

174

8400

197

8400

220

8400

247

8400

273

8400

297

8400

323

8400

349

8400

399

8400

426

8400

454

8400

477

8400

499

8400

526

8400

555

8400

581

8400

604

8400

630

8400

652

8400

681

8400

707

8400

733

8400

759

8400

787

8400

811

8400

841

8400

862

8400

889

8400

916

8400

941

8400

964

8400

993

8401

017

8401

045

8401

215

8401

271

8401

491

8401

563

8401

584

8401

610

8401

631

8401

676

8401

698

8401

721

8401

747

8401

771

8401

797

8401

823

8401

884

smg-1_Gravid Adult.2

smg-1_Gravid Adult.1

smg-1_Gravid Adult

smg-1_Young Adult.1

smg-1_Young Adult

smg-1_L4.2

smg-1_L4.1

smg-1_L4

N2_Gravid Adult.2

N2_Gravid Adult.1

N2_Gravid Adult

N2_Young Adult.2

N2_Young Adult.1

N2_Young Adult

N2_L4.2

N2_L4.1

N2_L4

N2_L3.1

N2_L3

N2_L2.1

N2_L2

LAP-1 chrIII PM

2000 8000Value

06

00

Color Keyand Histogram

Cou

nt

8399109

8399141

8399166

8399194

8399220

8399246

8399268

8399294

8399321

8399351

8399377

8399407

8399431

8399459

8399484

8399509

8399539

8399562

8399587

8399613

8399637

8399663

8399691

8399717

8399743

8399772

8399801

8399831

8399858

8399882

8399907

8399935

8399962

8399993

8400021

8400046

8400072

8400099

8400125

8400151

8400174

8400197

8400220

8400247

8400273

8400297

8400323

8400349

8400399

8400426

8400454

8400477

8400499

8400526

8400555

8400581

8400604

8400630

8400652

8400681

8400707

8400733

8400759

8400787

8400811

8400841

8400862

8400889

8400916

8400941

8400964

8400993

8401017

8401045

8401215

8401271

8401491

8401563

8401584

8401610

8401631

8401676

8401698

8401721

8401747

8401771

8401797

8401823

8401884

smg-1_Gravid Adult.2

smg-1_Gravid Adult.1

smg-1_Gravid Adult

smg-1_Young Adult.1

smg-1_Young Adult

smg-1_L4.2

smg-1_L4.1

smg-1_L4

N2_Gravid Adult.2

N2_Gravid Adult.1

N2_Gravid Adult

N2_Young Adult.2

N2_Young Adult.1

N2_Young Adult

N2_L4.2

N2_L4.1

N2_L4

N2_L3.1

N2_L3

N2_L2.1

N2_L2

LAP-1 chrIII MM

1000 5000Value

06

00

Color Keyand Histogram

Co

un

t

Example

Groningen Bioinformatics Centre

6914000 6915000 6916000 6917000 6918000 6919000 6920000 6921000

02

00

06

00

0

nhx-4 chrX

Genome(bp)

PM

6914000 6915000 6916000 6917000 6918000 6919000 6920000 6921000

01

00

03

00

0

Genome(bp)

MM

Pro

be in

tens

ityExample 2

Groningen Bioinformatics Centre

6914

251

6914

276

6914

305

6914

328

6914

358

6914

393

6914

418

6914

449

6914

471

6914

499

6914

552

6914

573

6914

597

6914

618

6914

648

6914

678

6914

726

6914

754

6914

795

6914

821

6914

848

6914

890

6914

916

6914

940

6914

963

6914

984

6915

023

6915

049

6915

076

6915

126

6915

152

6915

183

6915

212

6915

238

6915

266

6915

288

6915

317

6915

343

6915

368

6915

394

6915

422

6915

448

6915

479

6915

507

6915

532

6915

554

6915

585

6915

616

6915

641

6915

666

6915

841

6915

864

6915

892

6915

920

6915

948

6915

977

6916

001

6916

028

6916

053

6916

083

6916

121

6916

150

6916

176

6916

201

6916

224

6916

246

6916

276

6916

301

6916

327

6916

355

6916

378

6916

404

6916

427

6916

458

6916

485

6916

512

6916

539

6916

565

6916

593

6916

619

6916

645

6916

702

6916

727

6916

754

6916

779

6916

806

6916

832

6916

861

6916

883

6916

914

6916

940

6916

963

6917

003

6917

024

6917

051

6917

080

6917

107

6917

133

6917

154

6917

180

6917

201

6917

227

6917

255

6917

283

6917

312

6917

337

6917

363

6917

390

6917

417

6917

445

6917

470

6917

501

6917

527

6917

554

6917

597

6917

626

6917

654

6917

675

6917

703

6917

729

6917

753

6917

781

6917

808

6917

834

6917

859

6917

885

6917

912

6917

936

6917

959

6917

986

6918

012

6918

040

6918

067

6918

092

6918

115

6918

141

6918

168

6918

195

6918

221

6918

246

6918

268

6918

294

6918

316

6918

339

6918

364

6918

386

6918

416

6918

441

6918

465

6918

495

6918

518

6918

544

6918

571

6918

597

6918

624

6918

652

6918

676

6918

703

6918

725

6918

756

6918

779

6918

806

6918

829

6918

852

6918

882

6918

913

6918

939

6918

967

6918

994

6919

020

6919

045

6919

069

6919

095

6919

120

6919

146

6919

173

6919

199

6919

226

6919

251

6919

277

6919

301

6919

325

6919

356

6919

382

6919

408

6919

434

6919

458

6919

489

6919

513

6919

541

6919

563

6919

591

6919

617

6919

639

6919

665

6919

692

6919

719

6919

745

6919

766

6919

790

6919

816

6919

843

6919

870

6919

898

6919

926

6919

949

6919

992

6920

015

6920

044

6920

072

6920

172

6920

197

6920

229

6920

255

6920

278

6920

303

6920

325

6920

350

6920

378

6920

404

6920

430

6920

461

6920

499

6920

522

6920

544

6920

570

6920

591

6920

618

6920

644

6920

666

6920

695

6920

720

6920

749

6920

780

6920

803

6920

847

6920

876

6920

901

6920

966

6920

992

6921

043

6921

088

smg-1_Gravid Adult.2

smg-1_Gravid Adult.1

smg-1_Gravid Adult

smg-1_Young Adult.1

smg-1_Young Adult

smg-1_L4.2

smg-1_L4.1smg-1_L4

N2_Gravid Adult.2

N2_Gravid Adult.1

N2_Gravid Adult

N2_Young Adult.2

N2_Young Adult.1

N2_Young Adult

N2_L4.2

N2_L4.1

N2_L4N2_L3.1

N2_L3

N2_L2.1

N2_L2

nhx-4 chrX PM

2000Value

03

00

0

Color Keyand Histogram

Co

un

t

6914

251

6914

276

6914

305

6914

328

6914

358

6914

393

6914

418

6914

449

6914

471

6914

499

6914

552

6914

573

6914

597

6914

618

6914

648

6914

678

6914

726

6914

754

6914

795

6914

821

6914

848

6914

890

6914

916

6914

940

6914

963

6914

984

6915

023

6915

049

6915

076

6915

126

6915

152

6915

183

6915

212

6915

238

6915

266

6915

288

6915

317

6915

343

6915

368

6915

394

6915

422

6915

448

6915

479

6915

507

6915

532

6915

554

6915

585

6915

616

6915

641

6915

666

6915

841

6915

864

6915

892

6915

920

6915

948

6915

977

6916

001

6916

028

6916

053

6916

083

6916

121

6916

150

6916

176

6916

201

6916

224

6916

246

6916

276

6916

301

6916

327

6916

355

6916

378

6916

404

6916

427

6916

458

6916

485

6916

512

6916

539

6916

565

6916

593

6916

619

6916

645

6916

702

6916

727

6916

754

6916

779

6916

806

6916

832

6916

861

6916

883

6916

914

6916

940

6916

963

6917

003

6917

024

6917

051

6917

080

6917

107

6917

133

6917

154

6917

180

6917

201

6917

227

6917

255

6917

283

6917

312

6917

337

6917

363

6917

390

6917

417

6917

445

6917

470

6917

501

6917

527

6917

554

6917

597

6917

626

6917

654

6917

675

6917

703

6917

729

6917

753

6917

781

6917

808

6917

834

6917

859

6917

885

6917

912

6917

936

6917

959

6917

986

6918

012

6918

040

6918

067

6918

092

6918

115

6918

141

6918

168

6918

195

6918

221

6918

246

6918

268

6918

294

6918

316

6918

339

6918

364

6918

386

6918

416

6918

441

6918

465

6918

495

6918

518

6918

544

6918

571

6918

597

6918

624

6918

652

6918

676

6918

703

6918

725

6918

756

6918

779

6918

806

6918

829

6918

852

6918

882

6918

913

6918

939

6918

967

6918

994

6919

020

6919

045

6919

069

6919

095

6919

120

6919

146

6919

173

6919

199

6919

226

6919

251

6919

277

6919

301

6919

325

6919

356

6919

382

6919

408

6919

434

6919

458

6919

489

6919

513

6919

541

6919

563

6919

591

6919

617

6919

639

6919

665

6919

692

6919

719

6919

745

6919

766

6919

790

6919

816

6919

843

6919

870

6919

898

6919

926

6919

949

6919

992

6920

015

6920

044

6920

072

6920

172

6920

197

6920

229

6920

255

6920

278

6920

303

6920

325

6920

350

6920

378

6920

404

6920

430

6920

461

6920

499

6920

522

6920

544

6920

570

6920

591

6920

618

6920

644

6920

666

6920

695

6920

720

6920

749

6920

780

6920

803

6920

847

6920

876

6920

901

6920

966

6920

992

6921

043

6921

088

smg-1_Gravid Adult.2

smg-1_Gravid Adult.1

smg-1_Gravid Adult

smg-1_Young Adult.1

smg-1_Young Adult

smg-1_L4.2

smg-1_L4.1smg-1_L4

N2_Gravid Adult.2

N2_Gravid Adult.1

N2_Gravid Adult

N2_Young Adult.2

N2_Young Adult.1

N2_Young Adult

N2_L4.2

N2_L4.1

N2_L4N2_L3.1

N2_L3

N2_L2.1

N2_L2

nhx-4 chrX MM

1000 4000Value

03

00

0

Color Keyand Histogram

Co

un

t

Example 2

Groningen Bioinformatics Centre

Chr III 2866 genes

-2 0 2 4 6 8

05

10

15

20

PM

exon-intron

-lo

g1

0(p

) fo

r w

ilco

xon

test

74.25 %

-2 0 2 4 6

05

10

15

20

MM

exon-intron

-lo

g1

0(p

) fo

r w

ilco

xon

test

69.43 %

General impression

Groningen Bioinformatics Centre

Exon-Intron(PM)

Fre

quen

cy

-2 0 2 4 6 8

020

060

0

Exon-Intron(MM)

Fre

quen

cy

-2 0 2 4 6

040

080

012

00

General impression

Groningen Bioinformatics Centre

PM_Exon

ex.pm.all

Fre

quency

6 8 10 12 14

04000

8000

12000

PM_Intron

in.pm.all

Fre

quency

6 8 10 12 14

020000

40000

60000

MM_Exon

ex.mm.all

Fre

quency

6 8 10 12 14

05000

15000

MM_Intron

in.mm.all

Fre

quency

6 8 10 12 14

020000

40000

60000

General impression

Groningen Bioinformatics Centre

-15 -10 -5 0 5

-10

-50

5

PCA

Pri Comp 1

Pri

n C

om

p 2

-15 -10 -5 0 5

-10

-50

5

Pri Comp 1

Pri

n C

om

p 2

PCA

Groningen Bioinformatics Centre

Methods: machine learning

Aim

Find the most effective (correct) machine learning method that distinguishes between

True exons and True introns

Find the simplest (fastest, intuitive) method that achieves this task

Groningen Bioinformatics Centre

Methods: machine learning

Main challenge

True exons and True introns are not known:

Annotated exons may be unexpressed

Annotated introns may be novel transcripts

Our approach

Ignore the problem and optimize supervised performance

Assumption

True novel transcripts will be similar to known ones

Groningen Bioinformatics Centre

Methods: machine learning

1.Classification and regression tree (CART)

binary recursive partitioning

Advantages:

• Easy to understand

• Easy to implement

• Computationally cheap

Groningen Bioinformatics Centre

Methods: Machine learning

2. Support vector machines (SVM)

denotes +1 denotes 0

How would you classify this data?

Groningen Bioinformatics Centre

denotes +1

denotes 0

How would you classify this data?

2. Support vector machines (SVM)

Groningen Bioinformatics Centre

denotes +1

denotes 0

How would you classify this data?

2. Support vector machines (SVM)

Groningen Bioinformatics Centre

denotes +1

denotes 0

How would you classify this data?

2. Support vector machines (SVM)

Groningen Bioinformatics Centre

denotes +1

denotes 0

Maximum Margin

The classifier with the maximum margin is the ideal one.

Groningen Bioinformatics Centre

Receiver Operating Characteristic curve (ROC curve)

Evaluation

ROC

False Positive Rate (1-specificity)

Tru

e P

ositi

ve R

ate

(sen

sitiv

ity)

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.50

0.80

0.85

0.90

1.00

0.1

0.3

0.51

0.72

0.93

1.14

Groningen Bioinformatics Centre

The Area Under an ROC Curve (AUC)

Groningen Bioinformatics Centre

1 2 3 4 5 6 7 8 9 10

11

12

13

14

15

16

17

18

19

20

4

3

2

1

AUC

0.7 0.8Value

01

23

4

Color Keyand Histogram

Co

un

t

Raw Normalized

Mean

Median

Max

Max_1

pm.i,pm.1,pm_1,pm.2,pm_2,mm.i,mm.1,mm_1,mm.2,mm_2

Selection of informative features – intensities

Groningen Bioinformatics Centre

1 2 3 4 5 6 7 8

2

1

AUC

0.7 0.75 0.8Value

01

2

Color Keyand Histogram

Co

un

t

Raw Normalized

Pearson

Spearman

pm1,pm-1, mm1,mm-1

Selection of informative features – correlation

Groningen Bioinformatics Centre

Summary• Almost all reasonable features are informative• No striking difference between mean and median, but they seem better than max, max_1• CC also informative. No striking difference between Pearson and Spearman • Quantile normalization doesn’t improve the result

DecisionMedian, CC (Pearson) of non-normalized data are used to generate featuresGC content or melting temperature can also be informative

Selection of informative features

Groningen Bioinformatics Centre

Selection of informative features – neighbors

X.10nb X.5nb X.1nb X3nb X7nb

0.5

0.6

0.7

0.8

0.9

PM

AU

C

X.10nb X.5nb X.1nb X3nb X7nb

0.5

0.6

0.7

0.8

MM

AU

C

X.10nb X.6nb X.2nb X3nb X7nb

0.5

00.6

00.7

0

CC.PM

AU

C

X.10nb X.6nb X.2nb X3nb X7nb

0.5

00.6

00.7

0

CC.MM

AU

C

CART

Groningen Bioinformatics Centre

Selection of informative features – neighbors

SVM

CART

Groningen Bioinformatics Centre

Selection of informative features

• Neighbours• MM• CC.PM• CC.MM• Tm• ANOVA results

Groningen Bioinformatics Centre

Results

Groningen Bioinformatics Centre

Example tree

Groningen Bioinformatics Centre

AUC ~ ( expression level )

Groningen Bioinformatics Centre

AUC ~ length( exon )

Groningen Bioinformatics Centre

AUC ~ Tm

Groningen Bioinformatics Centre

AUC ~ probe position within exon

Groningen Bioinformatics Centre

AUC ~ ( other factors )

107 171 259 394 2449

CART

length(exon)

AU

C

0.0

0.2

0.4

0.6

0.8

1.0

55 62 70 77 85

CART

Tm

AU

C

0.0

0.2

0.4

0.6

0.8

1.0

-1 -2 0 2 1

CART

withinexon.posi

AU

C

0.0

0.2

0.4

0.6

0.8

1.0

expression

exon length

melting temperature

relative position

Groningen Bioinformatics Centre

Can minrun and maxgap improve the

results?

maxgap = 1

minrun = 3

Groningen Bioinformatics Centre

Can minrun and maxgap improve the

results?

minrun = 3

maxgap = 1

Groningen Bioinformatics Centre

Minrun/maxgap Maxgap/minrun

thres ccr fpr tpr 0.936 0.806 0.009 0.464

Maxgap and minrun optimization

0 1 2 3 4 5 6

minrun

6

5

4

3

2

1

0

ma

xga

p

0.75 0.8Value

03

6

Color Keyand Histogram

Co

un

t

0 1 2 3 4 5 6

minrun

6

5

4

3

2

1

0

ma

xga

p

0.75 0.8Value

03

Color Keyand Histogram

Co

un

t

Groningen Bioinformatics Centre

Minrun/maxgap Maxgap/minrun

thres ccr fpr tpr 0.718 0.850 0.030 0.627

Maxgap and minrun optimization

0 1 2 3 4 5 6

minrun

6

5

4

3

2

1

0

ma

xga

p

0.8 0.85Value

04

Color Keyand Histogram

Co

un

t

0 1 2 3 4 5 6

minrun

6

5

4

3

2

1

0

ma

xga

p

0.85Value

04

Color Keyand Histogram

Co

un

t

Groningen Bioinformatics Centre

0 1 2 3 4 5 6

minrun

6

5

4

3

2

1

0

ma

xga

p

0.85Value

03

6

Color Keyand Histogram

Co

un

t

0 1 2 3 4 5 6

minrun

6

5

4

3

2

1

0

ma

xga

p

0.85Value

03

6

Color Keyand Histogram

Co

un

t

Minrun/maxgap Maxgap/minrun

thres ccr fpr tpr 0.500 0.856 0.059 0.700

Maxgap and minrun optimization

Groningen Bioinformatics Centre

Minrun/maxgap Maxgap/minrun

thres ccr fpr tpr 0.300 0.815 0.216 0.851

Maxgap and minrun optimization

0 1 2 3 4 5 6

minrun

6

5

4

3

2

1

0

ma

xga

p

0.75 0.8Value

05

10

15

Color Keyand Histogram

Co

un

t

0 1 2 3 4 5 6

minrun

6

5

4

3

2

1

0

ma

xg

ap

0.75 0.8Value

02

4

Color Keyand Histogram

Co

un

t

Groningen Bioinformatics Centre

Maxgap and minrun optimization

Groningen Bioinformatics Centre

Maxgap and minrun optimization

1 - maxgap2 - minrun Order: minrun/maxgap

Groningen Bioinformatics Centre

Maxgap and minrun conclusion

a minrun of 0 and a maxgap of 1 give the best overall result for our classifier

minrun and maxgap have minimal influence on the results, if the classifier already uses neighboring probe information

Groningen Bioinformatics Centre

Future work

• Joining of transfrags into transcriptional units (genes)

• Differential gene expression between developmental stage and strains (ANOVA)

• Detect alternative splicing (ANOVA)

Groningen Bioinformatics Centre

Acknowledgements

Yang Li and Ritsert Jansen, Groningen Bioinformatics Centre

Andrew Fraser, Welcome Trust Sanger Institute, Cambridge

Tom Gingeras, Affymetrix, Santa Clara Jan Kammenga, Nematology, Wageningen

University