Population-based Detection of Structural Variants in ...

38
Population-based Detection of Structural Variants in Normal and Aberrant Genomes. Jean Monlong Guillaume Bourque’s group Canadian Human and Statistical Genetics Meeting April 19, 2015 Human Genetics Dept. 1 / 14

Transcript of Population-based Detection of Structural Variants in ...

Population-based Detection of StructuralVariants in Normal and Aberrant Genomes.

Jean Monlong

Guillaume Bourque’s group

Canadian Human and Statistical Genetics MeetingApril 19, 2015

Human Genetics Dept.

1 / 14

Structural variation

Genetic variation involving more than 500bp.

Raphael Lab, Brown University.

Structural Variant: SV; Copy Number Variation: CNV.

2 / 14

SV detection using High-Throughput Sequencing

Baker 2012, Nature Methods.

3 / 14

LimitationMappability issues

I Noisy or reduced signal in repeat-rich regions, centromeres, telomeres.I Unpredictable segmentation → reduced sensitivity/specificity.I Filtering problematic regions reduces the genome range tested.

genomic window

num

ber

of r

eads

map

ped

genomic window

num

ber

of r

eads

map

ped

4 / 14

Objective

Test the entire genome, including low-mappability regions, anddetect subtle abnormal coverage.

PopSV: Population-based approach

Use a set of reference experiments to detect abnormal patterns.

genomic window

num

ber

of r

eads

map

ped

samplereferencetested

5 / 14

PopSV: Population-based approach

genomic window

num

ber

of r

eads

map

ped

samplereferencetested

Workflow

1. Genome is fragmented in bins.

2. Reads in each bin are counted, for each sample.

3. Normalization of the bin counts.

4. Each sample and each bin is tested for divergence fromreference samples (Z-score).

5. P-value estimation and multiple test correction.

6 / 14

Application

CageKid : Renal Cell Carcinoma

I Whole-Genome Sequencing of 100 individuals.

I Normal and tumor paired samples.

I Reference samples : normal samples.

Twin family dataset

I Whole-Genome Sequencing of 45 individuals.

I 10 families (2 parents + 2 twins).

I Reference samples : all the samples.

∼ 40X coverage, Illumina paired-end 100bp

7 / 14

Example : Partial signal supporting tumoral deletion

●● ● ●

●●

●●

●●

●●

● ● ●

0

2000

4000

100.75 100.80 100.85 100.90 100.95 101.00position (Mb)

read

cov

erag

e

tumor sample: D000GMU●●

abnormalnormal

normal samples

Chr.1, overlapping CDC14A gene (cell division cycle), not detected by other approaches.

8 / 14

Evaluating PopSV performance

Germline events detected in tumor samples

●●●● ●● ●● ●●●●● ● ●●●● ●●●●●●●●

● ●●

cn.MOPS

FREEC

PopSV

cn.MOPS

FREEC

PopSV

all eventslow

mappability

0 200 400 600number of germline events in tumor

●● ●

●●● ●

●●● ●

all eventslow

mappability

0.00 0.25 0.50 0.75 1.00proportion of germline events in tumor

Results

PopSV detected more consistent calls than other methods withsimilar specificity.

9 / 14

Other validation and benchmark

I Consistent with SNP-array calls ?

I Twin dataset: concordant between twins ?

I Concordant calls when using different bin sizes ?

For more details/discussion come see poster 30 tomorrow !

10 / 14

Twin dataset : PopSV on normal genomes

0.0

2.5

5.0

7.5

10.0

sample

abno

rmal

reg

ion

(Mbp

)

coverage in reference samples low expected high

16-fold enrichment in low coverage regions.

11 / 14

Twin dataset : PopSV calls in low coverage regions

● ● ● ● ● ● ● ● ● ●0.0

0.2

0.4

0.6

0.8

1323

−Fa

ther

1323

−M

othe

r

1323

−Tw

in1

1323

−Tw

in2

1301

−Fa

ther

1301

−M

othe

r

1301

−Tw

in1

1301

−Tw

in2

1286

−Fa

ther

1286

−M

othe

r

1286

−Tw

in1

1286

−Tw

in2

1490

−Fa

ther

1490

−M

othe

r

1490

−Tw

in1

1490

−Tw

in2

1480

−Fa

ther

1480

−M

othe

r

1480

−Tw

in2

1480

−Tw

in1

1389

−Fa

ther

1389

−M

othe

r

1389

−Tw

in1

1389

−Tw

in2

1652

−Fa

ther

1652

−M

othe

r

1652

−Tw

in1

1652

−Tw

in2

1443

−M

othe

r

1443

−Fa

ther

1443

−Tw

in2

1443

−Tw

in1

1121

−Fa

ther

1121

−M

othe

r

1121

−Tw

in1

1121

−Tw

in2

unkn

own4

unkn

own2

unkn

own3

unkn

own1

unkn

own5

1207

−Fa

ther

1207

−M

othe

r

1207

−Tw

in1

1207

−Tw

in2

sample

family ● ● ● ● ● ● ● ● ● ● ●1121 1207 1286 1301 1323 1389 1443 1480 1490 1652 NA

●Father Mother NA Twin1 Twin2

All calls, flag distance

● ● ● ● ● ● ● ● ● ●0.0

0.2

0.4

0.6

0.8

1652

−M

othe

r

1652

−Fa

ther

1652

−Tw

in1

1652

−Tw

in2

unkn

own4

unkn

own2

unkn

own3

1121

−M

othe

r

1121

−Tw

in1

1121

−Tw

in2

1443

−M

othe

r

1443

−Fa

ther

1443

−Tw

in2

1443

−Tw

in1

1490

−Fa

ther

1490

−M

othe

r

1490

−Tw

in1

1490

−Tw

in2

1480

−M

othe

r

1480

−Tw

in2

1480

−Tw

in1

1121

−Fa

ther

1323

−Fa

ther

1323

−M

othe

r

1323

−Tw

in1

1323

−Tw

in2

1301

−Fa

ther

1301

−M

othe

r

1301

−Tw

in1

1301

−Tw

in2

1389

−Fa

ther

1480

−Fa

ther

1207

−M

othe

r

1207

−Fa

ther

1207

−Tw

in1

1207

−Tw

in2

unkn

own5

1286

−M

othe

r

1286

−Fa

ther

1286

−Tw

in1

1286

−Tw

in2

unkn

own1

1389

−M

othe

r

1389

−Tw

in1

1389

−Tw

in2

sample

family ● ● ● ● ● ● ● ● ● ● ●1121 1207 1286 1301 1323 1389 1443 1480 1490 1652 NA

●Father Mother NA Twin1 Twin2

Low coverage (<10 reads), flag distance

12 / 14

Conclusion

PopSV has been applied to

I Whole-Genome Sequencing of normal genomes.I Whole-Genome Sequencing of tumor genomes.I Whole-Exome Sequencing data.

PopSV robustly detects

I variants in high and low mappability regionsI variants with partial signal (e.g. in tumors).

R package available at github.com/jmonlong/PopSV.

Future directionI Other types of SVs as excess of discordant read pairs.I Combination with orthogonal approaches (PEM, Assembly).

13 / 14

Acknowledgment

I Guillaume BourqueI Mathieu BourgeyI Louis LetourneauI Francois LefebvreI Eric AudemardI Toby Hocking

I Simon GravelI Mathieu Blanchette

I Simon GirardI Guy RouleauI Michel Boivin

14 / 14

Thank You !

15 / 14

Low-mappability regions overlap functional elements

0

50000

100000

150000

200000

0.00 0.25 0.50 0.75 1.00mappability of the bin

num

ber

of b

ins

mappability

high − 72%

low − 21%

no − 7%

gene DNase GWAS

0.00

0.25

0.50

0.75

1.00

lincRNA other

protein_coding

pseudogene

prop

ortio

n ov

erla

ppin

g at

leas

t one

bin

mappability high low no

16 / 14

Unknown technical bias

0.000

0.002

0.004

0.006

4000 5000 6000 7000 8000inter−sample mean coverage

dens

ity

WGSsimulatedshuffled

0.0000

0.0025

0.0050

0.0075

0.0100

0 250 500 750 1000inter−sample standard deviation

dens

ity

WGSsimulatedshuffled

17 / 14

PopSV: importance of normalization

I Experiment-specific technical bias.

I Naive normalization (linear, quantile) is often not enough.

0.00

0.05

0.10

0.15

0.20

RS

1146

77K

2310

006

LR35

4R

S11

4676

RS

1146

04R

S11

4528

K23

1007

8K

2310

004

RS

1146

74LR

398

RS

1146

05LR

405

K21

1008

9K

2310

061

LR41

7R

S11

4585

LR34

0K

2150

051

LR36

4K

2310

024

LR42

2K

2310

030

K23

1000

8K

2150

053

LR38

0R

S11

4636

K21

5005

2K

2310

001

K21

5004

5K

2310

090

K23

1008

0R

S11

4624

RS

1145

39R

S11

4606

LR37

7LR

370.

2LR

370

K23

1003

8K

2110

093

LR40

7R

S11

4646

RS

1144

94K

2310

007

K21

5004

7LR

390

LR34

4K

2110

118

LR37

1R

S11

4527

LR38

2K

2310

025

K21

1006

0LR

357

K21

1007

8R

S11

4472

LR42

0K

2150

024

K21

1010

6R

S11

4511

RS

1145

41R

S11

4563

LR40

4LR

389

RS

1149

12R

S11

4728

RS

1147

19LR

426

LR42

3LR

358

K21

1006

8LR

413

K21

1006

1K

2110

073

K21

1005

6R

S11

4532

K21

5000

6K

2110

059

K21

1012

6K

2110

085

K21

1011

2LR

396

K16

3002

8K

2110

079

K16

1035

9K

1620

380

RS

1146

70

sample

prop

otio

n of

the

stud

ied

geno

me

coveragehighestlowest

18 / 14

PopSV: importance of normalization

I PCA-based normalization (Krumm, 2012; Boeva, 2014).I Targeted normalization: linear using a subset of the genome.

Ref1

Ref2

Ref3

Ref4

Test

Test

19 / 14

PopSV: Z-score and test

For a sample s:

I For each bin b: z =BCb

s −BCbreference

sdbreference

I pv = P(|z | ≤ |Z |) with Z ∼ N (0, σ) where σ is estimated from the zdistribution across all bins.

0.0

0.1

0.2

0.3

0.4

0.5

−5.0 −2.5 0.0 2.5 5.0Z−scores

dens

ity

normalizationtargetedmedianmedian+variancequantile

20 / 14

Z-scores : Normal versus Tumor

−20

−10

0

10

20

−20 −10 0 10 20normal sample Z−score

tum

or s

ampl

e Z

−sc

ore

nb of bins(0,1](1,5](5,10](10,100](100,1e+03](1e+03,Inf]

“funky snowman” plot

21 / 14

Z-scores : contamination detection

−20

−10

0

10

20

−20 −10 0 10 20D000G19 z−score

D00

0G1D

tum

or z

−sc

ore

nb of bins

(0,1]

(1,5]

(5,10]

(10,100]

(100,1e+03]

(1e+03,Inf]

D000G19 vs D000G1D − z−scores

22 / 14

Example: Telomeric region

●●

0

2000

4000

6000

135.11 135.13 135.15position (Mb)

read

cov

erag

e

normal sample: D000GQ9●●

abnormalnormal

normal samples

Chr.10, overlapping genes (PRAP1, CALY), not detected by other approaches.

23 / 14

Example: NAHR candidatech

r chr20

chr20y

gene

s

RP4−576H24.4

SIRPB1

SIRPD SIRPG

repe

ats

DNALINE

Low_complexityLTR

Simple_repeatSINE

bin

coun

t

●●

● ●

● ● ●●

0

2500

5000

7500

10000

norm

aliz

ed c

over

age

group

normal

tumor

Pop

SV 0

10

20

30

40

0

10

20

30

40

deletion

duplication

nb s

ampl

es group

normal

tumor

1520000 1560000 1600000 1640000

24 / 14

500bp Z-scores within 10kb calls

0

5

10

15

>20

−10 −8 −6 −4 −2 0 2 4 6 8 >10median Z−score in 500bp bins

Z−

scor

e in

10k

b ca

lls

number of calls (0,1] (1,10] (10,50] (50,100] (100,1e+03]

25 / 14

500bp Z-scores within 10kb calls

0

−5

−10

−15

<−20

1086420−2−4−6−8<−10median Z−score in 500bp bins

Z−

scor

e in

10k

b ca

lls

number of calls (0,1] (1,10] (10,50] (50,100] (100,1e+03] (1e+03,Inf]

26 / 14

SNP array methods concordance

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20

21 22

ASCAT

GAP

GS−loose

GS−strigent

ASCAT

GAP

GS−loose

GS−strigent

ASCAT

GAP

GS−loose

GS−strigent

ASCAT

GAP

GS−loose

GS−strigent

ASCAT

GAP

GS−loose

GS−strigent

0.0e+005.0e+071.0e+081.5e+082.0e+082.5e+080.0e+005.0e+071.0e+081.5e+082.0e+082.5e+08start

CN

0

1

2

3

4

5

6

7

8

D000G0L 0.636 0.161 0.899 tumor

27 / 14

SNP array concordance

● ●● ●● ●

●● ● ●

● ●

●●● ●

● ●● ●● ●●● ●

cn.MOPS

FREEC

PopSV

cn.MOPS

FREEC

PopSV

loosestringent

0.00 0.25 0.50 0.75 1.00proportion of WGS calls in SNP−array GS

28 / 14

SNP array concordance

●●●

●●

● ●

cn.MOPS

FREEC

PopSV

cn.MOPS

FREEC

PopSV

loosestringent

0.00 0.25 0.50 0.75 1.00proportion of SNP−array GS event also in WGS calls

29 / 14

Twin concordance

cn.MOPS

FREEC

PopSV

cn.MOPS

FREEC

PopSV

cn.MOPS

FREEC

PopSV

all eventslarge events

low m

appability

0 50 100 150 200number of concordant calls per twin pair

30 / 14

Twin concordance

cn.MOPS

FREEC

PopSV

cn.MOPS

FREEC

PopSV

cn.MOPS

FREEC

PopSV

all eventslarge events

low m

appability

0.5 0.6 0.7 0.8 0.9 1.0proportion of concordant calls per twin pair

31 / 14

Twins and clustering quality

●●

● ●

● ●

●●

● ●●

●● ●

● ●●

● ●

● ●

●●

● ● ●

● ●● ● ● ● ●

● ●● ● ● ● ●

● ●

● ●●

●● ●

●●

●●

●● ●

●●

●● ●

●● ●

●●

●●

● ●

● ● ● ●●

●● ●

● ● ●

0.00

0.25

0.50

0.75

1.00

0 10 20 30 40number of groups derived from CNV clustering

Ran

d in

dex

usin

g pe

digr

ee in

form

atio

nmethod

cn.MOPSFREECPopSV

clustering linkage● average

completeWard

32 / 14

Many variants in low coverage regions

0e+00

1e+05

2e+05

3e+05

0 1000 2000 >3000mean coverage in reference samples (normalized read counts)

num

ber

of b

ins

coverage in reference

low

expected

high

33 / 14

Many variants in low coverage regions

0

2000

4000

6000

0 1000 2000 >3000mean coverage in reference samples (normalized read counts)

num

ber

of c

alls

coverage in reference

low

expected

high

34 / 14

Twins dataset : copy number estimation

0

300

600

900

0 1 2 3 4 >5estimated copy number

num

ber

of c

alls

coverage inreference samples

low

expected

high

estimated copy number = 2 × coverage in sample / coverage in reference samples 35 / 14

Many calls in segmental duplications but also in genes

0.00

0.25

0.50

0.75

1.00

protein−codingexon

pseudogeneexon

segmentalduplication

DNA satellite

genomic feature

prop

ortio

n of

reg

ions

ove

rlapp

ing

a ge

nom

ic fe

atur

e

low coverage

expected coverage

high coverage

random regions

36 / 14

Distance to centromere/telomere/gaps

0.00

0.25

0.50

0.75

1.00

0 1 2 3 4 >5distance to centromere/telomere/gap (Mb)

cum

ulat

ive

prop

ortio

n of

abn

orm

al s

eque

nce

type

ambiguous

deletion

duplication

DGV

random

More SV detected near centromere/telomere/gaps.

37 / 14

Mappability

0.00

0.25

0.50

0.75

1.00

0.0 0.2 0.4 0.6 0.8 1.0proportion of mappable sequence

cum

ulat

ive

even

t pro

port

ion

type

ambiguous

deletion

duplication

DGV

random

38 / 14