Population-based Detection of Structural Variants in ...
Transcript of Population-based Detection of Structural Variants in ...
Population-based Detection of StructuralVariants in Normal and Aberrant Genomes.
Jean Monlong
Guillaume Bourque’s group
Canadian Human and Statistical Genetics MeetingApril 19, 2015
Human Genetics Dept.
1 / 14
Structural variation
Genetic variation involving more than 500bp.
Raphael Lab, Brown University.
Structural Variant: SV; Copy Number Variation: CNV.
2 / 14
LimitationMappability issues
I Noisy or reduced signal in repeat-rich regions, centromeres, telomeres.I Unpredictable segmentation → reduced sensitivity/specificity.I Filtering problematic regions reduces the genome range tested.
genomic window
num
ber
of r
eads
map
ped
genomic window
num
ber
of r
eads
map
ped
4 / 14
Objective
Test the entire genome, including low-mappability regions, anddetect subtle abnormal coverage.
PopSV: Population-based approach
Use a set of reference experiments to detect abnormal patterns.
genomic window
num
ber
of r
eads
map
ped
samplereferencetested
5 / 14
PopSV: Population-based approach
genomic window
num
ber
of r
eads
map
ped
samplereferencetested
Workflow
1. Genome is fragmented in bins.
2. Reads in each bin are counted, for each sample.
3. Normalization of the bin counts.
4. Each sample and each bin is tested for divergence fromreference samples (Z-score).
5. P-value estimation and multiple test correction.
6 / 14
Application
CageKid : Renal Cell Carcinoma
I Whole-Genome Sequencing of 100 individuals.
I Normal and tumor paired samples.
I Reference samples : normal samples.
Twin family dataset
I Whole-Genome Sequencing of 45 individuals.
I 10 families (2 parents + 2 twins).
I Reference samples : all the samples.
∼ 40X coverage, Illumina paired-end 100bp
7 / 14
Example : Partial signal supporting tumoral deletion
●
●● ● ●
●
●●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
● ● ●
0
2000
4000
100.75 100.80 100.85 100.90 100.95 101.00position (Mb)
read
cov
erag
e
tumor sample: D000GMU●●
abnormalnormal
normal samples
Chr.1, overlapping CDC14A gene (cell division cycle), not detected by other approaches.
8 / 14
Evaluating PopSV performance
Germline events detected in tumor samples
●●●● ●● ●● ●●●●● ● ●●●● ●●●●●●●●
● ●●
●
cn.MOPS
FREEC
PopSV
cn.MOPS
FREEC
PopSV
all eventslow
mappability
0 200 400 600number of germline events in tumor
●● ●
●
●●● ●
●●● ●
all eventslow
mappability
0.00 0.25 0.50 0.75 1.00proportion of germline events in tumor
Results
PopSV detected more consistent calls than other methods withsimilar specificity.
9 / 14
Other validation and benchmark
I Consistent with SNP-array calls ?
I Twin dataset: concordant between twins ?
I Concordant calls when using different bin sizes ?
For more details/discussion come see poster 30 tomorrow !
10 / 14
Twin dataset : PopSV on normal genomes
0.0
2.5
5.0
7.5
10.0
sample
abno
rmal
reg
ion
(Mbp
)
coverage in reference samples low expected high
16-fold enrichment in low coverage regions.
11 / 14
Twin dataset : PopSV calls in low coverage regions
● ● ● ● ● ● ● ● ● ●0.0
0.2
0.4
0.6
0.8
1323
−Fa
ther
1323
−M
othe
r
1323
−Tw
in1
1323
−Tw
in2
1301
−Fa
ther
1301
−M
othe
r
1301
−Tw
in1
1301
−Tw
in2
1286
−Fa
ther
1286
−M
othe
r
1286
−Tw
in1
1286
−Tw
in2
1490
−Fa
ther
1490
−M
othe
r
1490
−Tw
in1
1490
−Tw
in2
1480
−Fa
ther
1480
−M
othe
r
1480
−Tw
in2
1480
−Tw
in1
1389
−Fa
ther
1389
−M
othe
r
1389
−Tw
in1
1389
−Tw
in2
1652
−Fa
ther
1652
−M
othe
r
1652
−Tw
in1
1652
−Tw
in2
1443
−M
othe
r
1443
−Fa
ther
1443
−Tw
in2
1443
−Tw
in1
1121
−Fa
ther
1121
−M
othe
r
1121
−Tw
in1
1121
−Tw
in2
unkn
own4
unkn
own2
unkn
own3
unkn
own1
unkn
own5
1207
−Fa
ther
1207
−M
othe
r
1207
−Tw
in1
1207
−Tw
in2
sample
family ● ● ● ● ● ● ● ● ● ● ●1121 1207 1286 1301 1323 1389 1443 1480 1490 1652 NA
●Father Mother NA Twin1 Twin2
All calls, flag distance
● ● ● ● ● ● ● ● ● ●0.0
0.2
0.4
0.6
0.8
1652
−M
othe
r
1652
−Fa
ther
1652
−Tw
in1
1652
−Tw
in2
unkn
own4
unkn
own2
unkn
own3
1121
−M
othe
r
1121
−Tw
in1
1121
−Tw
in2
1443
−M
othe
r
1443
−Fa
ther
1443
−Tw
in2
1443
−Tw
in1
1490
−Fa
ther
1490
−M
othe
r
1490
−Tw
in1
1490
−Tw
in2
1480
−M
othe
r
1480
−Tw
in2
1480
−Tw
in1
1121
−Fa
ther
1323
−Fa
ther
1323
−M
othe
r
1323
−Tw
in1
1323
−Tw
in2
1301
−Fa
ther
1301
−M
othe
r
1301
−Tw
in1
1301
−Tw
in2
1389
−Fa
ther
1480
−Fa
ther
1207
−M
othe
r
1207
−Fa
ther
1207
−Tw
in1
1207
−Tw
in2
unkn
own5
1286
−M
othe
r
1286
−Fa
ther
1286
−Tw
in1
1286
−Tw
in2
unkn
own1
1389
−M
othe
r
1389
−Tw
in1
1389
−Tw
in2
sample
family ● ● ● ● ● ● ● ● ● ● ●1121 1207 1286 1301 1323 1389 1443 1480 1490 1652 NA
●Father Mother NA Twin1 Twin2
Low coverage (<10 reads), flag distance
12 / 14
Conclusion
PopSV has been applied to
I Whole-Genome Sequencing of normal genomes.I Whole-Genome Sequencing of tumor genomes.I Whole-Exome Sequencing data.
PopSV robustly detects
I variants in high and low mappability regionsI variants with partial signal (e.g. in tumors).
R package available at github.com/jmonlong/PopSV.
Future directionI Other types of SVs as excess of discordant read pairs.I Combination with orthogonal approaches (PEM, Assembly).
13 / 14
Acknowledgment
I Guillaume BourqueI Mathieu BourgeyI Louis LetourneauI Francois LefebvreI Eric AudemardI Toby Hocking
I Simon GravelI Mathieu Blanchette
I Simon GirardI Guy RouleauI Michel Boivin
14 / 14
Low-mappability regions overlap functional elements
0
50000
100000
150000
200000
0.00 0.25 0.50 0.75 1.00mappability of the bin
num
ber
of b
ins
mappability
high − 72%
low − 21%
no − 7%
gene DNase GWAS
0.00
0.25
0.50
0.75
1.00
lincRNA other
protein_coding
pseudogene
prop
ortio
n ov
erla
ppin
g at
leas
t one
bin
mappability high low no
16 / 14
Unknown technical bias
0.000
0.002
0.004
0.006
4000 5000 6000 7000 8000inter−sample mean coverage
dens
ity
WGSsimulatedshuffled
0.0000
0.0025
0.0050
0.0075
0.0100
0 250 500 750 1000inter−sample standard deviation
dens
ity
WGSsimulatedshuffled
17 / 14
PopSV: importance of normalization
I Experiment-specific technical bias.
I Naive normalization (linear, quantile) is often not enough.
0.00
0.05
0.10
0.15
0.20
RS
1146
77K
2310
006
LR35
4R
S11
4676
RS
1146
04R
S11
4528
K23
1007
8K
2310
004
RS
1146
74LR
398
RS
1146
05LR
405
K21
1008
9K
2310
061
LR41
7R
S11
4585
LR34
0K
2150
051
LR36
4K
2310
024
LR42
2K
2310
030
K23
1000
8K
2150
053
LR38
0R
S11
4636
K21
5005
2K
2310
001
K21
5004
5K
2310
090
K23
1008
0R
S11
4624
RS
1145
39R
S11
4606
LR37
7LR
370.
2LR
370
K23
1003
8K
2110
093
LR40
7R
S11
4646
RS
1144
94K
2310
007
K21
5004
7LR
390
LR34
4K
2110
118
LR37
1R
S11
4527
LR38
2K
2310
025
K21
1006
0LR
357
K21
1007
8R
S11
4472
LR42
0K
2150
024
K21
1010
6R
S11
4511
RS
1145
41R
S11
4563
LR40
4LR
389
RS
1149
12R
S11
4728
RS
1147
19LR
426
LR42
3LR
358
K21
1006
8LR
413
K21
1006
1K
2110
073
K21
1005
6R
S11
4532
K21
5000
6K
2110
059
K21
1012
6K
2110
085
K21
1011
2LR
396
K16
3002
8K
2110
079
K16
1035
9K
1620
380
RS
1146
70
sample
prop
otio
n of
the
stud
ied
geno
me
coveragehighestlowest
18 / 14
PopSV: importance of normalization
I PCA-based normalization (Krumm, 2012; Boeva, 2014).I Targeted normalization: linear using a subset of the genome.
Ref1
Ref2
Ref3
Ref4
Test
Test
19 / 14
PopSV: Z-score and test
For a sample s:
I For each bin b: z =BCb
s −BCbreference
sdbreference
I pv = P(|z | ≤ |Z |) with Z ∼ N (0, σ) where σ is estimated from the zdistribution across all bins.
0.0
0.1
0.2
0.3
0.4
0.5
−5.0 −2.5 0.0 2.5 5.0Z−scores
dens
ity
normalizationtargetedmedianmedian+variancequantile
20 / 14
Z-scores : Normal versus Tumor
−20
−10
0
10
20
−20 −10 0 10 20normal sample Z−score
tum
or s
ampl
e Z
−sc
ore
nb of bins(0,1](1,5](5,10](10,100](100,1e+03](1e+03,Inf]
“funky snowman” plot
21 / 14
Z-scores : contamination detection
−20
−10
0
10
20
−20 −10 0 10 20D000G19 z−score
D00
0G1D
tum
or z
−sc
ore
nb of bins
(0,1]
(1,5]
(5,10]
(10,100]
(100,1e+03]
(1e+03,Inf]
D000G19 vs D000G1D − z−scores
22 / 14
Example: Telomeric region
●
●
●
●●
●
●
0
2000
4000
6000
135.11 135.13 135.15position (Mb)
read
cov
erag
e
normal sample: D000GQ9●●
abnormalnormal
normal samples
Chr.10, overlapping genes (PRAP1, CALY), not detected by other approaches.
23 / 14
Example: NAHR candidatech
r chr20
chr20y
gene
s
RP4−576H24.4
SIRPB1
SIRPD SIRPG
repe
ats
DNALINE
Low_complexityLTR
Simple_repeatSINE
bin
coun
t
●
●●
●
●
● ●
●
● ● ●●
0
2500
5000
7500
10000
norm
aliz
ed c
over
age
group
normal
tumor
Pop
SV 0
10
20
30
40
0
10
20
30
40
deletion
duplication
nb s
ampl
es group
normal
tumor
1520000 1560000 1600000 1640000
24 / 14
500bp Z-scores within 10kb calls
0
5
10
15
>20
−10 −8 −6 −4 −2 0 2 4 6 8 >10median Z−score in 500bp bins
Z−
scor
e in
10k
b ca
lls
number of calls (0,1] (1,10] (10,50] (50,100] (100,1e+03]
25 / 14
500bp Z-scores within 10kb calls
0
−5
−10
−15
<−20
1086420−2−4−6−8<−10median Z−score in 500bp bins
Z−
scor
e in
10k
b ca
lls
number of calls (0,1] (1,10] (10,50] (50,100] (100,1e+03] (1e+03,Inf]
26 / 14
SNP array methods concordance
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22
ASCAT
GAP
GS−loose
GS−strigent
ASCAT
GAP
GS−loose
GS−strigent
ASCAT
GAP
GS−loose
GS−strigent
ASCAT
GAP
GS−loose
GS−strigent
ASCAT
GAP
GS−loose
GS−strigent
0.0e+005.0e+071.0e+081.5e+082.0e+082.5e+080.0e+005.0e+071.0e+081.5e+082.0e+082.5e+08start
CN
0
1
2
3
4
5
6
7
8
D000G0L 0.636 0.161 0.899 tumor
27 / 14
SNP array concordance
● ●● ●● ●
●● ● ●
● ●
●●● ●
● ●● ●● ●●● ●
cn.MOPS
FREEC
PopSV
cn.MOPS
FREEC
PopSV
loosestringent
0.00 0.25 0.50 0.75 1.00proportion of WGS calls in SNP−array GS
28 / 14
SNP array concordance
●●●
●●
● ●
●
●
cn.MOPS
FREEC
PopSV
cn.MOPS
FREEC
PopSV
loosestringent
0.00 0.25 0.50 0.75 1.00proportion of SNP−array GS event also in WGS calls
29 / 14
Twin concordance
●
●
●
●
cn.MOPS
FREEC
PopSV
cn.MOPS
FREEC
PopSV
cn.MOPS
FREEC
PopSV
all eventslarge events
low m
appability
0 50 100 150 200number of concordant calls per twin pair
30 / 14
Twin concordance
●
●
●
●
cn.MOPS
FREEC
PopSV
cn.MOPS
FREEC
PopSV
cn.MOPS
FREEC
PopSV
all eventslarge events
low m
appability
0.5 0.6 0.7 0.8 0.9 1.0proportion of concordant calls per twin pair
31 / 14
Twins and clustering quality
●●
● ●
●
● ●
●
●
●
●
●
●
●
●
●●
●
● ●●
●● ●
●
● ●●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
● ● ●
● ●● ● ● ● ●
● ●● ● ● ● ●
● ●
● ●●
●● ●
●
●
●●
●
●●
●
●
●
●
●
●
●● ●
●
●●
●
●
●
●● ●
●● ●
●
●●
●●
●
● ●
● ● ● ●●
●
●● ●
● ● ●
●
●
●
●
●
●
0.00
0.25
0.50
0.75
1.00
0 10 20 30 40number of groups derived from CNV clustering
Ran
d in
dex
usin
g pe
digr
ee in
form
atio
nmethod
●
●
●
cn.MOPSFREECPopSV
clustering linkage● average
completeWard
32 / 14
Many variants in low coverage regions
0e+00
1e+05
2e+05
3e+05
0 1000 2000 >3000mean coverage in reference samples (normalized read counts)
num
ber
of b
ins
coverage in reference
low
expected
high
33 / 14
Many variants in low coverage regions
0
2000
4000
6000
0 1000 2000 >3000mean coverage in reference samples (normalized read counts)
num
ber
of c
alls
coverage in reference
low
expected
high
34 / 14
Twins dataset : copy number estimation
0
300
600
900
0 1 2 3 4 >5estimated copy number
num
ber
of c
alls
coverage inreference samples
low
expected
high
estimated copy number = 2 × coverage in sample / coverage in reference samples 35 / 14
Many calls in segmental duplications but also in genes
0.00
0.25
0.50
0.75
1.00
protein−codingexon
pseudogeneexon
segmentalduplication
DNA satellite
genomic feature
prop
ortio
n of
reg
ions
ove
rlapp
ing
a ge
nom
ic fe
atur
e
low coverage
expected coverage
high coverage
random regions
36 / 14
Distance to centromere/telomere/gaps
0.00
0.25
0.50
0.75
1.00
0 1 2 3 4 >5distance to centromere/telomere/gap (Mb)
cum
ulat
ive
prop
ortio
n of
abn
orm
al s
eque
nce
type
ambiguous
deletion
duplication
DGV
random
More SV detected near centromere/telomere/gaps.
37 / 14