Estimating chromosomal copy numbers from Affymetrix SNP & CN chips
description
Transcript of Estimating chromosomal copy numbers from Affymetrix SNP & CN chips
Estimating chromosomal copy numbers
from Affymetrix SNP & CN chips
Henrik Bengtsson & Terry SpeedDept of Statistics, UC Berkeley
September 13, 2007
"Statistics and Genomics Seminar"
Size = 264 kb, Number of SNPs = 72
What are copy numbers and segmentation?
Objectives
• Total copy number estimation/segmentation• Estimate single-locus CNs well
(segmentation method takes it from there)
• All generations of Affymetrix SNP arrays:– SNP chips: 10K, 100K, 500K– SNP & CN chips: 5.0, 6.0
• Small and very large data sets
Available in aroma.affymetrix
“Infinite” number of arrays: 1-1,000sRequirements: 1-2GB RAMArrays: SNP, exon, expression, (tiling).Dynamic HTML reportsImport/export to existing methodsOpen source: RCross platform: Windows, Linux, Mac
Acknowledgments
WEHI, Melbourne:Ken Simpson
UC Berkeley:James BullardKasper HansenElizabeth Purdom
ISREC, Lausanne:“Asa” Wirapati
John Hopkins:Benilton CarvalhoRafael Irizarry
Affymetrix, California:Ben BolstadSimon CawleyLuis JevonsChuck Sugnet
Affymetrix chips
Generic Affymetrix chip
1.28 cm
6.5 million probes/ chip
1.28 cm
Feature size: 100µm to 18µm to 11µm and now 5µm.Soon: 1µm, 0.8µm, with a huge increase in number of probes.
*
5 µm
5 µm
> 1 million identical 25bp sequences
* ***
Abbreviated generic assay description
1. Start with target gDNA (genomic DNA) or mRNA.
2. Obtain labeled single-stranded target DNA fragments for hybridization to the probes on the chip.
3. After hybridization, washing, staining and scanning we get a digital image. This is summarized across pixels to probe-level intensities before we begin. They are our raw data.
Affymetrix probe terminology
Target DNA: ...CGTAGCCATCGGTAAGTACTCAATGATAG... |||||||||||||||||||||||||
Perfect match (PM): ATCGGTAGCCATTCATGAGTTACTAMis-match (MM): ATCGGTAGCCATACATGAGTTACTA
25 nucleotides
* *
* **
PM
Target seq.
* **
MM
* **
other PMs
Other DNA Other DNA Other seq.
X
Affymetrix SNP chips(Mapping 10K, 100K, 500K)
Single Nucleotide Polymorphism (SNP)
Definition:A sequence variation such that two genomes may differ by a single nucleotide (A, T, C, or G).
Allele A: A...CGTAGCCATCGGTA/GTACTCAATGATAG...
Allele B: G
A person is either AA, AB, or BB at this SNP.
Probes for SNPs
PMA: ATCGGTAGCCATTCATGAGTTACTAAllele A: ...CGTAGCCATCGGTAAGTACTCAATGATAG...
Allele B: ...CGTAGCCATCGGTAGGTACTCAATGATAG...PMB: ATCGGTAGCCATCCATGAGTTACTA
(Also MMs, but not in the newer chips, so we will not use these!)
* **
PMA >> PMB
AA* **
*
* **
PMA << PMB
* **
BB* **
PMA ¼ PMB
AB* **
Affymetrix SNP & CN chips(Genome-Wide Human SNP Array 5.0 & 6.0)
Copy-number/non-polymorphic probes (CNPs)
CN locus: ...CGTAGCCATCGGTAAGTACTCAATGATAG...PM: ATCGGTAGCCATTCATGAGTTACTA
* **
PM = c
CN=1* **
PM = 2¢c
CN=2* **
PM = 3¢c
CN=3
Genome-Wide Human SNP Array 6.0includes frequently requested properties
• > 906,600 SNPs:– Unbiased selection of 482,000 SNPs:
historical SNPs from the SNP Array 5.0 (== 500K)– Selection of additional 424,000 SNPs:
• Tag SNPs• SNPs from chromosomes X and Y• Mitochondrial SNPs• Recent SNPs added to the dbSNP database• SNPs in recombination hotspots
• > 946,000 copy number probes (CNPs):– 202,000 probes targeting 5,677 CNV regions from the Toronto
Database of Genomic Variants. Regions resolve into 3,182 distinct, non-overlapping segments; on average 61 probe sets per region
– 744,000 probes, evenly spaced along the genome
Large increase in density
6.0
5.0
500K
100K
10K
1.6kb
3.6kb
6.0kb
26kb
4£ further out…
294kb
year
# loci
History of SNP & CNP chipsAffymetrix & Illumina are competing
10K 100K 500K 5.0 6.0
Released July 2003 April 2004 Sept 2005 Feb 2007 May 2007
# SNPs 10,204 116,204 500,568 500,568 934,946
# CNPs - - - 340,742 946,371
# loci 10,204 116,204 500,568 841,310 1,878,317
Distance 294kb 25.8kb 6.0kb 3.6kb 1.6kb
Price / chip set 65 USD 400 USD 260 USD 175 USD 300 USD
# loci / cup of espresso ($1.35)
116 loci 216 loci 1426 loci 3561 loci 4638 loci
Price source: Affymetrix Pricing Information, http://www.affymetrix.com/, September 2007.
Copy-number analysis with SNP arrays
(10K, 100K, 500K)
SNP chips can be used to determine copy number
Some sample figures based on a 250K SNP chip showing deletions and amplifications
Size = 424 kb, Number of SNPs = 118 Results using of dChip and GLAD.
Size = 168 kb, Number of SNPs = 55
Our method:
CRMA(10K, 100K, 500K)
Copy-number estimation using Robust Multichip Analysis (CRMA)
CRMA
Preprocessing(probe signals)
allelic crosstalk (or quantile)
Total CN PM = PMA + PMB
Summarization (SNP signals )
log-additivePM only
Post-processing fragment-length
(GC-content)
Raw total CNs R = Reference
Mij = log2(ij /Rj) chip i, probe j
Copy-number estimation using Robust Multichip Analysis (CRMA)
CRMA
Preprocessing(probe signals)
allelic crosstalk (quantile)
Total CNs PM=PMA+PMB
Summarization (SNP signals )
log-additive(PM-only)
Post-processing fragment-length
(GC-content)
Raw total CNs Mij = log2(ij/Rj)
Cross-hybridization:
Allele A: TCGGTAAGTACTCAllele B: TCGGTATGTACTC
AA* **
PMA >> PMB
* **
* **
PMA ¼ PMB
AB* ** *
* **
PMA << PMB
* **
BB
AA
TTAT
Copy-number estimation using Robust Multichip Analysis (CRMA)
CRMA
Preprocessing(probe signals)
allelic crosstalk (quantile)
Total CNs PM=PMA+PMB
Summarization (SNP signals )
log-additive(PM-only)
Post-processing fragment-length
(GC-content)
Raw total CNs Mij = log2(ij/Rj)
offset
+
PMT
PMA
Copy-number estimation using Robust Multichip Analysis (CRMA)
CRMA
Preprocessing(probe signals)
allelic crosstalk (quantile)
Total CNs PM=PMA+PMB
Summarization (SNP signals )
log-additive(PM-only)
Post-processing fragment-length
(GC-content)
Raw total CNs Mij = log2(ij/Rj)
PMT
PMA
Copy-number estimation using Robust Multichip Analysis (CRMA)
CRMA
Preprocessing(probe signals)
allelic crosstalk (quantile)
Total CNs PM=PMA+PMB
Summarization (SNP signals )
log-additive(PM-only)
Post-processing fragment-length
(GC-content)
Raw total CNs Mij = log2(ij/Rj)
Crosstalk calibration corrects for differences in distributions too
log2 PM
Copy-number estimation using Robust Multichip Analysis (CRMA)
CRMA
Preprocessing(probe signals)
allelic crosstalk (quantile)
Total CNs PM=PMA+PMB
Summarization (SNP signals )
log-additive(PM-only)
Post-processing fragment-length
(GC-content)
Raw total CNs Mij = log2(ij/Rj)
Crosstalk calibration corrects for differences in distributions too
log2 PM
Copy-number estimation using Robust Multichip Analysis (CRMA)
CRMA
Preprocessing(probe signals)
allelic crosstalk (quantile)
Total CNs PM=PMA+PMB
Summarization (SNP signals )
log-additive(PM-only)
Post-processing fragment-length
(GC-content)
Raw total CNs Mij = log2(ij/Rj)
AA* **
PM = PMA + PMB
* **
* **
PM = PMA + PMB
AB* **
*
* **
PM = PMA + PMB
* **
BB
Copy-number estimation using Robust Multichip Analysis (CRMA)
CRMA
Preprocessing(probe signals)
allelic crosstalk (quantile)
Total CNs PM=PMA+PMB
Summarization (SNP signals )
log-additive(PM-only)
Post-processing fragment-length
(GC-content)
Raw total CNs Mij = log2(ij/Rj)
The log-additive model:
log2(PMijk) = log2ij + log2jk + ijk
sample i, SNP j, probe k.
Fit using robust linear models (rlm)
100K
Copy-number estimation using Robust Multichip Analysis (CRMA)
CRMA
Preprocessing(probe signals)
allelic crosstalk (quantile)
Total CNs PM=PMA+PMB
Summarization (SNP signals )
log-additive(PM-only)
Post-processing fragment-length
(GC-content)
Raw total CNs Mij = log2(ij/Rj)
Longer fragments ) less amplified by PCR ) weaker SNP signals
500K
Copy-number estimation using Robust Multichip Analysis (CRMA)
CRMA
Preprocessing(probe signals)
allelic crosstalk (quantile)
Total CNs PM=PMA+PMB
Summarization (SNP signals )
log-additive(PM-only)
Post-processing fragment-length
(GC-content)
Raw total CNs Mij = log2(ij/Rj)
Longer fragments ) less amplified by PCR ) weaker SNP signals
Copy-number estimation using Robust Multichip Analysis (CRMA)
CRMA
Preprocessing(probe signals)
allelic crosstalk (quantile)
Total CNs PM=PMA+PMB
Summarization (SNP signals )
log-additive(PM-only)
Post-processing fragment-length
(GC-content)
Raw total CNs Mij = log2(ij/Rj)
Normalize to get samefragment-length effect for all hybridizations
Copy-number estimation using Robust Multichip Analysis (CRMA)
CRMA
Preprocessing(probe signals)
allelic crosstalk (quantile)
Total CNs PM=PMA+PMB
Summarization (SNP signals )
log-additive(PM-only)
Post-processing fragment-length
(GC-content)
Raw total CNs Mij = log2(ij/Rj)
Normalize to get samefragment-length effect for all hybridizations
Copy-number estimation using Robust Multichip Analysis (CRMA)
CRMA
Preprocessing(probe signals)
allelic crosstalk (quantile)
Total CNs PM=PMA+PMB
Summarization (SNP signals )
log-additive(PM-only)
Post-processing fragment-length
(GC-content)
Raw total CNs Mij = log2(ij/Rj)
Comparison(other methods)
Other methods
CRMA dChip(Li & Wong 2001)
CNAG(Nannya et al 2005)
CNAT v4(Affymetrix 2006)
Preprocessing(probe signals)
allelic crosstalk (quantile)
invariant-set scale quantile
Total CNs PM=PMA+PMB PM=PMA+PMB
MM=MMA+MMB
PM=PMA+PMB =A+B
Summarization (SNP signals )
log-additive(PM-only)
multiplicative(PM-MM)
sum (PM-only)
log-additive (PM-only)
Post-processing fragment-length
(GC-content)
- fragment-length
GC-content
fragment-length
GC-content
Raw total CNs Mij = log2(ij/Rj) Mij = log2(ij/Rj) Mij = log2(ij/Rj) Mij = log2(ij/Rj)
How well can be differentiate between one and two copies?
HapMap (CEU):Mapping250K Nsp data30 males and 29 females (no children; one excl. female)
Chromosome X is known: Males (CN=1) & females (CN=2)5,608 SNPs
Classification rule:
Mij < threshold ) CNij =1, otherwise CNij =2.Number of calls: 595,608 = 330,872
Calling samples for SNP_A-1920774
# males: 30# females: 29
Call rule:If Mi < threshold, a male
Calling a male male:#True-positives: 30 TP rate: 30/30 = 100%
Calling a female male:#False-positive : 5FP rate: 5/29 = 17%
Receiver Operator Characteristic (ROC)
FP rate(incorrectly calling females male)
TP rate
(correctly calling a males male)
increasingthreshold
²
(17%,100%)
Single-SNP comparisonA random SNP
TP rate
(correctly calling a males male)
FP rate(incorrectly calling females male)
Single-SNP comparisonA non-differentiating SNP
TP rate
(correctly calling a males male)
FP rate(incorrectly calling females male)
Performance of an average SNPwith a common threshold
59 individuals £
CRMA & dChip perform betterfor an average SNP (common threshold)
Number of calls:59£5,608 = 330,872
TP rate
(correctly calling a males male)
FP rate(incorrectly calling females male)
0.85
1.00
0.150.00
Zoom in
CRMA
dChip
CNAG
CNAT
"Smoothing"
No averaging (R=1)Averaging two and two (R=2)Averaging three and three (R=3)
Average across SNPsnon-overlapping windows
threshold
A false-positive(or real?!?)
Better detection rate when averaging(with risk of missing short regions)
R=1(no avg.)
R=2
R=3
R=4
CRMA does better than dChip
CRMA
dChip
CRMA does better than dChip
CRMA
dChipControl for FP rate: 1.0%
CRMA dChipR=1 69.6% 63.1%R=2 96.0% 93.8%R=3 98.7% 98.0%R=4 99.8% 99.6%… … …
²
²
²²
²
²
Comparing methods by “resolution”controlling for FP rate
@ FP rate: 1.0%
CRMA
CNAT
²
²² ² ² ² ²
dChipCNAG
Early work:
CRMA 6(SNP 5.0 & 6.0 chips)
CRMA with CN probes
CRMA
Preprocessing(probe signals)
allelic crosstalk
(or quantile)
Total CN SNPs: PM = PMA + PMB
CNs: PM
Summarization (SNP signals )
single-arrayaveraging
Post-processing fragment-length
(GC-content)
Raw total CNs
R = Reference
Mij = log2(ij /Rj) chip i, probe j
Allelic crosstalk calibration-incorporating CN probes
SNPs:For each allele pair in {AC, AG, AT, CG, CT, GT}:
1) Estimate crosstalk model:offset: aSNP = (aA, aB)crosstalk matrix: S = [SAA, SAB; SBA, SBB]
2) Calibrate probe pairs {PM} = {(PMA, PMB)}:PM' Ã S-1 (PM - aSNP)
3) Rescale {PM'A} and {PM'B} to have average 2200.
CN probes:1) Calculate the average offset across all alleles:
offset: aCN = 1/6 * k {wk*(aA+aB)/2},with weights wk corresponding to n:s (above).
2) Calibrate CN probes {y}:PM' Ã PM - aCN
3) Rescale {PM'} to have average 2200.
Probe-level modelling (PLM)
SNPs:* Technical replicates:
PMA = (PMA1, PMA2, PMA3) and PMB = (PMB1, PMB2, PMB3)
All should have the same probe affinities => No probe-affinity model(!)
* Suggestion:PMA = median {PMAk}PMB = median {PMBk}PM = PMA + PMB (compare to CN probes!) = PM
CN probes:* (Mostly) single probe units, i.e. nothing much to do;
= PM
CRMA with and without CN probes
CRMA
(SNP)
CRMA6
(SNP & CN)
Preprocessing(probe signals)
allelic crosstalk (quantile)
allelic crosstalk (quantile)
Summarization (locus signals )
Total CN:
PM=PMA+PMB
log-additive(PM-only)
= "chip effects"
Averaging SNPs:
PMA = median{PMA}
PMB = median{PMB}
Total CN:
PM = PMA + PMB
= PM
Post-processing fragment-length
(GC-content)
fragment-length
(GC-content)
Raw total CNs Mij = log2(ij/Rj) Mij = log2(ij/Rj)
Comparisonacross generations
(100K - 500K - 6.0)
Average-locus ROC100K ! 500K ! 6.0
100K:Hind240, Xba240 & both
500K:Nsp, Sty & both
6.0:SNP, CN probes & all
Resolution comparison- at 1.0% FP
At any given resolution (kb), we have:
TP6.0,SNP > TP500K > TP100K
Note, the differences may be due to lab effects (the HapMap 100K, 500K & 6.0 hybridization
were done in different years/labs).In either case, the trend is in the right direction.
Resolution comparison- at 1.0% FP
Summary
Conclusions
• It helps to:– Control for allelic crosstalk.– Sum alleles at PM level: PM = PMA + PMB.– Control for fragment-length effects.
• Resolution: 6.0 (SNPs) > 500K > 100K(or lab effects).
• Currently estimates from CN probes are poor. Not unexpected. Better preprocessing might help.
Appendix
Density of TP rates whencontrolling for FP rate (5,608 SNPs)
TP rate(correctly calling males male)
FP rate: 1.0% (incorrectly calling females male)
CNAT: 10% poor SNPs
density
CRMA
dChip
CNAG
CNAT
Effect of different normalization steps
Allelic-crosstalkcalibration +Fragment-lengthnorm.
Allelic-crosstalkcalibration
Quantilenormalization