Comparing CNVdetection methods
description
Transcript of Comparing CNVdetection methods
Comparing CNVdetection methodsfor SNP arraysLauraWinchester Christopher Yau and Jiannis Ragoussis
AbstractData from whole genome association studies can now be used for dual purposes genotyping and copy numberdetection In this review we discuss some of the methods for using SNP data to detect copy number eventsWe examine a number of algorithms designed to detect copy number changes through the use of signal-intensitydata and consider methods to evaluate the changes found We describe the use of several statistical models incopy number detection in germline samplesWe also present a comparison of data using these methods to assessaccuracy of prediction and detection of changes in copy number
Keywords copy number SNParray
INTRODUCTIONStructural variation in the human genome has been
intensely studied in recent years [1ndash5] Publications
have shown rare copy number variations (CNV)
with a relationship to certain diseases and much has
also been done to study copy number polymor-
phisms (CNP) in the population their contribution
to structural variation and possible association to
complex disease Multiple methods for the detection
of these structural variants exist [6 7] but we seek to
focus on methods designed to interpret results from
SNP arrays
The most prominent SNP array types are avail-
able from commercial vendors Affymetrix and
Illumina Both companies sell competing arrays
and continue to offer increased coverage for detect-
ing copy number events and SNP assays simulta-
neously Assay technique for the arrays differ [8 9]
but the signal-intensity output from the both plat-
forms present similar analysis and interpretation
problems
Successful application of these technologies has
yielded a number of interesting individual CNVs
with relationships to complex disease For example
rare CNVs have been linked to schizophrenia [10]
in a study where microdeletions and duplications
were shown to be responsible for disrupting genes
involved in neurodevelopment The UGT2B17 gene
on Chromosome 4q132 was linked to osteoporosis
in a case-control study of 727 CNV regions in a
Chinese sample set [11]
One approach to copy number event detection
has been to investigate common events Studies such
as the McCarroll et al [12] involved the characteriza-
tion of deletion variations in the genome while
Redon et al [2] have mapped the location of
events found in multiple samples Information
about identified copy number events is recorded in
databases such as The Database of Genomic Variants
(DGV) [1] Using the prior information about CNP
location we can investigate copy number events as
we would use SNP information in genotyping
Known CNPs can be genotyped in casendashcontrol
populations with similar methods to the SNP-based
association study With the diversity of approaches
and analysis options it is important to decide on
a method most suited for the particular experimental
needs This review presents methods suggested for
analysis of germ line CNV analysis including both
CNP analysis and the detection of rare CNVs
LauraWinchester is a DPhil student at Oxford University where her research involves detection of copy number events in genetic
disorders in particular Specific Language Impairment
ChristopherYau is a Postdoctoral Research Fellow in the Department of Statistics at Oxford University
Jiannis Ragoussis is Head of Genomics at WTCHG Interests gene expression regulation in hypoxia and inflammation genotyping
and sequencing technology identification of chromosomal aneuploidies and CNVs associated with disease
Corresponding author Jiannis Ragoussis Genomics Wellcome Trust Centre For Human Genetics Roosevelt Drive Oxford OX3
7BN UK Tel (01865) 287526 Fax (01865) 287501 E-mail ioannisrwelloxacuk
BRIEFINGS IN FUNCTIONAL GENOMICS AND PROTEOMICS page 1 of 14 doi101093bfgpelp017
The Author 2009 Published by Oxford University Press For permissions please email journalspermissionsoxfordjournalsorg
Briefings in Functional Genomics and Proteomics Advance Access published September 8 2009 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
CNVDISCOVERYANDDETECTIONUSING SNP CHIPSThe use of SNP arrays in copy number event detec-
tion has a number of advantages As well as the two
applications for the data which are SNP genotyping
and copy number analysis there are other aspects
that promote their use over other techniques SNP
arrays use less sample per experiment compared to
other techniques such as comparative genomic
hybridization (CGH) arrays Cost is also an important
factor in the selection of the method The SNP array
is a cost effective technique which allows the user to
increase the number of samples tested on a limited
budget Although the advances in high throughput
sequencing technology has made copy number
discovery much easier the application of known
CNP information means that we can target structural
variation in a sample using cheaper techniques such
as the SNP array without a large reduction in
genome wide coverage
One important consideration however is the bias
of the SNP chip coverage towards known CNVs
[13] Historically when SNPs are selected for geno-
typing arrays certain factors are considered which
may decrease the number of copy number variants
or polymorphisms typed [14] Studies have found
CNPs to be most common in regions containing
high levels of segmental duplication [2] which are
areas of low SNP coverage compared to other areas
of the genome due to the difficulties of assay design
and implementation Common CNPs may cause
assays to fail standard inheritance checks and
HardyndashWeinberg tests For example in a situation
where a father is (A B) and the mother (B ) the
child could be (AB) or (A) or (B) However in
SNP genotyping results the mother would appear to
be called (B B) and the child would be called either
(A B) or (A A) or (B B) If the child is really (A)
then an (AA) call would seem to violate Mendelian
inheritance patterns and often cause the SNP to be
rejected
Assays were also often selected and tested on the
basis of their use in SNP genotyping meaning the
final result may produce noisy signal which although
per se does not affect the ability to genotype is a
major problem for accurate copy number detection
For instance SNP data is typically standardized
against a reference population in order to reduce
the effect of factors including between-array varia-
tion and probe-specific hybridization effects In
doing so normalization routines implicitly assume
that all members (or the large majority) of the refer-
ence population have the same copy number but
at locations of common CNV this assumption is
clearly no longer appropriate At these genomic
locations the process of SNP data normalization
and the derivation of copy number estimates
should be integrated for optimal performance and
the correct derivation of normalization parameters
Several of the new array assay selections have
taken the copy number detection into account for
example Illumina includes lsquounSNPablersquo genome
probes on some of its products These markers
were picked to cover events recorded in the
Database Genomic Variants (DGV) and some addi-
tional regions highlighted by experimental work
The Affymetrix SNP 60 chip was developed with
an aim to assess SNPs and CNVs simultaneously
McCarroll et al [15] studied 270 HapMap samples
to design probes for their hybrid array With these
changes in assay selection techniques the SNP array
has become more appealing for copy number detec-
tion and reliable interpretation of these results
increases in importance
ILLUMINA PROPRIETARYSOFTWARE FORCOPYNUMBERDETECTIONIllumina data can be initially viewed checked and
exported using the proprietary software BeadStudio
As well as the softwarersquos quality checking and geno-
type-calling functions it calculates a number of other
values for the signal-intensity data The normalized
R value is used as a representation of intensity on
individual SNP plots The log R ratio value is then
calculated from the expected normalized intensity
of a sample and observed normalized intensity
The B allele frequency (BAF) is calculated from
the difference between the expected position of
the cluster group and the actual value BAF and
log R ratio are used by a number of the copy
number event detection algorithms
Detection of copy number events within
BeadStudio uses simple algorithms which can be
run rapidly for an overview of larger events in a
sample The Loss of Heterozygosity (LOH) score is
calculated using heterozygote frequency The CNV
partition plug-in uses the log R ratio and BAF and
compares the data to 14 different Gaussian distribu-
tion models to assess copy number level Values can
be plotted in the Chromosome Browser allowing the
page 2 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
user to compare predicted events with BAF or
log R ratio at the location for event confirmation
(Figure 1)
AFFYMETRIX PROPRIETARYSOFTWARE FORCOPY NUMBERDETECTIONAffymetrix SNP array data can be analysed with
specially designed proprietary software Within the
Genotyping Console samples are grouped into
In Bounds (good sample) and Out of bounds
(problematic samples) after initial quality checks
and other quality control metrics allow the user
to investigate probe mismatching and individual
SNP clustering LOH scores can be calculated and
the software contains a Chromosome Copy Number
Analysis Tool (CNAT) which uses a reference set of
data to compare the experiment signal-intensity
values against and evaluates copy number changes
Results are processed by the segment reporting tool
to produce a basic output of larger detected CNV
events
Tools for analysis of the different Affymetrix chip
types vary but HumanGenomeSNP Array 60 uti-
lizes two externally developed algorithms from the
BirdSuite package [16] which dramatically improves
detection Birdseed is used for SNP genotyping
and Canary genotypes the known CNPs on the
chip Each CNP has a number of targeted probes
Figure 1 BeadStudio Chromosome Viewer Image from BeadStudio Chromosome Browser showing copy numbervalues for Sample NA10861Chromosome 22 shown with an event at 23 999 142^24 239 255 confirmed by all statis-tics CNV value produced by CNV Partition algorithm
Comparing CNVdetection methods for SNParrays page 3 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
data from these are summarized and then compared
to a reference set to produce the final call Results
can be viewed in the Integrated Genome Browser
(IGB) (Figure 2)
HIDDENMARKOVMODELS(HMMs) IN COPY NUMBEREVENTDETECTIONLimitations of available copy number analyses within
proprietary software led to the use of other methods
to analyse data The HMM assumes that observed
intensities are related to an unobserved copy
number state at each locus via an emission distribu-
tion (often assumed to be Gaussian) The copy
number states are assumed to have a dependence
structure such that neighbouring loci are assumed
to have similar copy number states Transitions
between copy number states are determined by a
transition matrix which describes the probability of
moving from one state to another The probabilistic
structure of the HMM allows parameters in the
model to be efficiently learnt from data in both
Bayesian and non-Bayesian frameworks by using
dynamic programming-based algorithms such as
the expectation maximization (EM) algorithm
When applied to event detection each copy
number possibility is assigned a state and the
Viterbi algorithm is used to predict the state for
each observation value
Figure 2 Genotyping Console Genome Viewer Image from Affymetrix Genotyping Console showing sampleNA10861 Event on chromosome 22 confirmed by CNAT algorithm (third plot) and the segmentation report (redmark) showing the single event
page 4 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
With prior knowledge of modelling statistics
there are a multitude of options for copy number
detection HMMSeg [17] is a command line oper-
ated algorithm that is designed to apply HMM
to genomic data Application of correct modelling
procedures is not an obvious process to non-
statisticians For these reasons software has been
developed which allows guided application of these
types of advanced methods
GUIDEDAPPLICATIONOFTHE HMMA number of solutions for guided accurate CNV
detection for SNP array data have been published
but these are often platform specific QuantiSNP
[18] and PennCNV [19] are academically developed
and freely available for prediction purposes They use
the HMM and assist the user to apply it to their own
data The standard output from these tools is a list of
detected events and brief summary statistics used for
quality checking Checking the quality of data is
extremely important in accurate event prediction
Data with high signal noise often causes false positive
predictions and stringency with checks at this stage is
highly recommended to eliminate any problem data
Signal noise is a strong limitation particularly with
samples prepared by whole genome amplification
Output from QuantiSNP allows the user to plot
average and standard deviations for BAF by chromo-
some or sample to show outliers (Figure 3)
PennCNV has a detailed set of guidelines for identi-
fying and rejecting problem data included on the
softwarersquos support website Both can run using com-
mand line options or integrated into Illuminarsquos
BeadStudio plug-in and have unique features to
recommend them
The QuantiSNP algorithm output gives a log
Bayes factor with its prediction which allows the
user to rank events in order of likelihood and place
their own cut off on acceptable events Users can
modify parameters to suit their own dataset for
example changing the length parameter can allow
more accurate detection of different sized events for
a particular sample set Later versions of QuantiSNP
have increased flexibility for data other than the
Figure 3 Graphical representation of quality control data from PennCNV and QuantiSNP algorithms It is impor-tant to use quality control (QC) data from the algorithms to eliminate problem samples which would not be foundduring standard-genotyping analysis Plot shows BAF score for each chromosome from analysis of sample NA10861we can see chromosome 4 and X are outliersValues produced by PennCNV log file also shown NB Values shownrelate to Illumina 1MDuo array
Comparing CNVdetection methods for SNParrays page 5 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
standard Illumina Infinium array and can used to
process Affymetrix data and have proven accuracy
on Illumina GoldenGate data [20] where SNP
coverage is suitable
PennCNV has a number of downstream analysis
options Most important to highlight is the use of
family trio data in analysis [21] The use of trio infor-
mation in event prediction allows easier detection of
events novel to probands It also integrates a pipeline
for Affymetrix data analysis The PennCNV package
also includes a number of options to allow more
analysis of event results such as a script to compare
events to known gene libraries or for changing the
format to be suitable for viewer such as BeadStudiorsquos
Chromosome Browser or the web-based genome
browser UCSC (httpwwwgenomeucscedu)
Dchip SNP [22] was originally developed for
Affymetrix data but has been modified to allow the
viewing of Illumina data It produces an LOH score
which can be plotted against chromosome but its
functions are best suited to the Affymetrix platform
generated values in particular the quality control
options The software also has options to carry out
paired analysis for cancer data major copy propor-
tion analysis [22] uses HMM to analyse tumour
samples
APPLYINGAPPROACHESORIGINALLY USED INARRAYCGHA number of methods for copy number event detec-
tion were originally developed for arrayCGH analy-
sis but have been modified for SNP array analysis
The Circular Binary Segmentation (CBS) [23] algo-
rithm is one such method It was designed to convert
noisy intensity values into regions of equal copy
number The algorithm will continue to divide a
region into segments until it finds a segment
which is different to the neighbouring region This
change-point detection is designed to identify all
the places which partition the chromosome into
segments of the same copy number An addition to
the binary segmentation algorithm was made to
allow the defining of single change inside a large
segment Segment ends were joined forming a
circle to allow a further likelihood ratio test that
the content has different means Final segments are
then given a cluster value which is the median log-
ratio value of the probes within the region and this
value is used to define the copy number status
An alternative to the CBS algorithm was devel-
oped by Pique-Regi et al [24] which can now be
applied to SNP arrays The Genome Alteration
Detection Algorithm (GADA) uses sparse Bayesian
learning to predict CN changes For our testing we
used a package designed for use in R environment
with helpful processing options and detailed instruc-
tions for Affymetrix and Illumina data The advan-
tage of the speed of data processing was clear and we
were able to analyse data within a few minutes
There are many other algorithms developed that
could potentially be applied to SNP array data
Other reviews [6 25] focused on the arrayCGH
format present the reader with a variety of alternative
options
CNVDETECTION USING OTHERMETHODSApproaches which describe different methods to
address CN event detection are common in the lit-
erature SNP conditional mixture modelling
(SCIMM) developed by Cooper et al [13] which
is based on the observation that samples with dele-
tions appear to have unique signal-intensity clusters
They applied a mixture-likelihood clustering
method within the R statistical package to identify
deletions A secondary algorithm (SCIMM-Search)
was developed to help discover probes which detect
copy number changes within an array dataset The
algorithms require knowledge of modelling techni-
ques to correctly carry out the analysis
The ITALICS [26] software focuses analysis on
removal on unwanted events found in Affymetrix
data Rigaill et al developed ITALICS (Iterative
and Alternative normaLIsation and Copy number
calling for affymetrix Snp arrays) to remove probes
with abnormal intensities Each iteration of the
algorithm estimates the biological signal and then
uses multiple linear regressions to estimate the non-
linear effects on the signal The algorithm can be run
in R and has the potential to analyse the Affymetrix
Human mapping 500K Genome Wide array 50 and
60 format but was designed to process data from
chip formats containing perfect match and mismatch
probes
COMMERCIALLYAVAILABLESOFTWAREThe strength of the software packages available
to purchase lies in a number of traits the ability
page 6 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
to combine data from other platforms for compar-
ison graphical user interfaces integrated pipelines for
analysis and work flows optimized computational
speed and technical support These factors are all
extremely useful to those labs with no or limited
bioinformatic core support Unfortunately commer-
cial companies are limited in their use of some of the
methods developed in the academic environment
They are often prevented from building user inter-
faces and other features around academic software
due to restrictions imposed by free software licences
such as GNU Public Licence and prevention from
having access to the latest methods
For our own purposes we have chosen to look in
detail at the Nexus Biodiscovery software This uses
the rank segmentation approach for detection This
approach is based on CBS but has been modified to
increase speed of processing It can be used for
Affymetrix arrayCGH or Illumina data and although
weaker for Illumina event detection is an extremely
useful tool for practically trained scientists
COMBINING COPY NUMBERPREDICTIONANDGENOTYPINGCopy number detection approaches described thus
far have looked only at a single aspect of the data
The Birdsuite set developed by Korn et al [16] com-
bines SNP genotyping and copy number detection
as well as independently genotyping common
CNPs It uses four different methods to analyse an
Affymetrix dataset The Canary algorithm which
genotypes common CNPs and Birdseed which
carries out SNP genotyping are included in the
Affymetrix Genotyping Console Birdseye is used
to discover rare CNVs This uses the HMM to iden-
tify and assess previously unknown CNVs in the
data Fawkes is the final stage of Birdsuite this
merges all the results from the other three stages
Combining data in this way gives a more complete
picture of structural variation in a sample and allows
the user to proceed with single stage of association
analysis with increased coverage on the data Korn
et al compared their software to commercially avail-
able algorithms including Nexus and report the
higher detection rates of Birdsuite
Franke et al [27] have also presented a combined
approach which focuses on single SNP interpreta-
tion TriTyper uses maximum likelihood estimation
to detect deletions in Illumina SNP data in unrelated
samples It incorporates an extra null allele into its
genotyping clusters and uses deviations from the
HWE as an indicator of when to use triallelic geno-
typing It can also use neighbouring SNP data to
impute the success of the caller which increases the
accuracy of the output
COMPARINGTHEDETECTIONALGORITHMSThere are a large variety of algorithms and software
available for copy number event detection Table 1
shows a summary of the software discussed in this
review A number of these software packages have
been tested during the review and a brief synopsis of
the results is presented here
Assessing SoftwareTo assess the accuracy of the algorithms we com-
pared our data to the results of a well characterized
sample The sample NA12156 is the basis for our
comparison (Table 2) it is from the HapMap collec-
tion and was sequenced for structural variation by
Kidd et al [28] We have chosen to record the
number of similar events between software and pub-
lished data We assume the samples with low num-
bers of similar events have higher false positive rates
however we have not experimentally validated the
results While there is no faultless software we have
found that at least 20 of events were confirmed by
Kidd et al in all algorithms 27 of the overlapping
detected events were found by more than one algo-
rithm (Supplementary Table 1) Although some
algorithms have a lower percentage of overlapping
events it is important to consider the number of
events found as well as the proportion 49 of
PennCNV detected events were confirmed but
other algorithms have actually detected more in
total
We carried out a secondary comparison using the
CEPH sample NA15510 which has been character-
ized in a number of publications [2 7 28] Table 3
shows the variation of results between studies
Further investigation of event replication across stud-
ies is represented in the Venn Diagrams (Figure 4)
PennCNV and Illumina show similar patterns of
overlap although we note an increased similarity
between the Korbel et al data and QuantiSNP
output We conclude that although we found a dif-
ference between detected events in our data and
published results we found similar variation between
different publications suggesting this is problem in
Comparing CNVdetection methods for SNParrays page 7 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
all comparisons and not unique to algorithms we
tested
The overlap of algorithm events of the tested soft-
ware is below 50 for all cases We used default
parameters for all our algorithms for ease of replica-
tion which means some algorithms were not run at
their optimal level for our data We deliberately
chose data which did not use an array-based
Table 1 Summary of SNP array detection algorithms
Software Platform Relatedpublication
Details Strengths Weaknesses
Birdsuite (Birdseyeand Canary)
Affymetrix [15] Combined tool set togenotype SNPs amp CNPs
Unique approach singleassociation of SNPs andCN
Availability limited toAffymetrix data
CNAT Affymetrix Technicalnotes
Proprietaryccedilrun inGenome Console
Integral part of GenomeConsole
Accuracy of event prediction(missed events)
CNVPartition 121 Illumina Technicalnotes
Proprietaryccedilrun inBeadStudio
Integral part of BeadStudio Accuracy of event prediction(missed events)
Dchip SNP Affymetrixor Illumina
[22] Stand alone software Free viewer for all data Limited applications forIllumina data
GADA Affymetrixor Illumina
[24] Model uses Sparse BayesianLearning
Speed of processing andapplication within R
Accuracy on Illumina weaker
HMMSeg Multiple [17] HMM application tool to anygenomic data
Flexibility to any dataset Statistical knowledgerequired for correctuse Not CN specific
ITALICS Affymetrix [26] R package for normalizationand CN detection inAffymetrix data
Focus on removal of non-relevant effects
Designed to work onAffymetrix 100Kthorn 500Kchip (MM probe format)
Nexus Biodiscovery Multiple [23] Commercial segmentationdetection tool
Allows combined data fromdifferentplatforms Integratedviewer
Freeware alternatives areavailable
PennCNV Illumina orAffymetrix
[19] Perl script based Multiple downstream toolsfor output
No way of ranking eventsdue to likelihood
QuantiSNP Illumina orAffymetrix
[18] HHM PC or LINUXcommand line
Bayes factor score forevents flexibility of runparameters
Limited support for furtherevent analysis
SCIMM andSCIMM-Search
Illumina [13] Modelling algorithmapplied in R
High detection ratescompared to sequencedata
Statistical knowledgerequired for correct use
TriTyper Illumina [27] Identify and genotype SNPswith null allele
Able to interpret single SNPs Only genotypes deletions
Table 2 Comparison of algorithms
Algorithm Platformand array
Total of copynumber eventsdetected
Number of copynumber eventsconfirmed byKidd et al [28]
Birdsuite 155 (Birdseye amp Canary) Affymetrix 60 386 76 (20)CNAT (Genome Console 302) Affymetrix 60 8 2 (25)GADA (R 07-5) Affymetrix 60 546 128 (23)GADA (R 07-5) Illumina 1M Duo 511 157 (31)PennCNV (2009Jan06) Affymetrix 60 57 28 (49)PennCNV (2009Jan06) Illumina 1M Duo 57 21 (37)QuantiSNP v20 Affymetrix 60 131 53 (41)QuantiSNP v11 Illumina 1M Duo 75 23 (31)
Detected events from CEPH sample NA12156 are compared to events published in sequencing analysis by Kidd et al [28] Default parametersare used for each algorithm and any Ychromosome data was omitted An overlap between software output and confirmed data by Kidd et al isdetermined by comparing the start and end points of events Details of events are shown in SupplementaryTable1 Percentage shows the numberof confirmed CN events compared to the total detectedby the algorithm
page 8 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
Figure 4 Venn diagrams comparing events for NA15510 between different studies Visual representation ofdata from CEPH sample NA15510 on 1M array Illumina platform used to compare between algorithms and otherpublications [2 7 28] Default parameters are used for each algorithm and Ychromosome data was omitted fromcount Event lists from publications were generated by combining data from several tables to create a completelist (including all validated and unvalidated events) An event was counted if any overlap was found with base eventin published data multiple predictions by an algorithm for one published event were counted as one Each total inthe diagram is comprised of all the events found by the studies meaning each event in an overlapping pair is countedSurprisingly only 43 overlapping events are found for NA15510 in all the three studies (A) Results from thePennCNV (D) and QuantiSNP (C) comparisons show that QuantiSNP detects more events in all three softwaredue to the detection of more events overlapping with the Korbel et al study Overlap between algorithmsis shown in Venn Diagram B where events which are detected by the algorithm and found in at least one ofthe publication are compared A large proportion of detected events between PennCNV and QuantiSNP (43)overlap
Table 3 Overlap between events detected by SNP array algorithms using multiple publication data
Total events foundin NA15510 byalgorithm
Number of copynumber events(Kidd) [28]
Number of copynumber events(Korbel) [7]
Number of copynumber events(Redon) [2]
Events in paper 299 466 219CNVPartition 121 39 12 (4) 22 (5) 9 (4)GADA (R 07-5) 69 68 (23) 85 (18) 42 (19)PennCNV (2009Jan06) 81 18 (6) 28 () 30 (14)QuantiSNP v11 64 18 (6) 41 (9) 29 (13)
Data fromCEPH sampleNA15510 on1M array Illumina platform is used to compare between algorithms and other publicationsDefault parametersare used for each algorithm and Y chromosome data was omitted Event lists from publications were generated by combining data fromseveral tables to create a complete list (including all validated and un-validated events) An event was counted if any overlap was found with baseevent in published data multiple predictions by an algorithm for one published event were counted as oneValue in brackets shows percentage ofpublished events found by algorithmWe note from GADA analysis although a high number of overlaps were found this was due to the predictionof large events that included smaller events found by Kidd et al and Korbel et al
Comparing CNVdetection methods for SNParrays page 9 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
technique for our NA12156 comparison to prevent a
bias between Affymetrix and Illumina but in doing
so we accepted an increase in the number of differ-
ently detected events Kidd et al have shown similar
data when comparing studies and found only a
125 overlap of events larger than 5 kb between
their results and CN data generated by Affymetrix
60 array
Similarities of events detected betweendifferent SoftwareWe chose to test a single sample (NA10861) on
a range of the available algorithms to compare the
similarity between event detection In all cases we
found the academically developed software to be
more sensitive and detect more events than propri-
etary algorithms (Table 4) The data also shows an
increased number of events found from the sample
using the Affymetrix SNP60 array we assume this
reflects the increase in the number of CNP probes
on the array relative to Illuminarsquos 1M chip
Table 5 shows the amount of overlap in event
prediction We show two results for each compari-
son counting the number of events overlapping for
each algorithm separately The difference in values
represents the number of smaller events often found
in one event by a different algorithm In general
we found a higher number of overlapping events
between algorithms run on Affymetrix 60 arrays
data We expected the low resemblance between
data generated on different platforms as a result of
the different probe sets however we are pleased to
find some overlap We have included a comparison
to events published by Redon et al [2] although the
study does not include a comprehensive list for this
sample it does show that the algorithms are detecting
confirmed events
During our comparison we often saw a difference
in the size of the predicted event between algorithms
(Figure 5) This was to be expected when using
different platforms as probe locations vary but was
also seen when analysing an identical dataset This
kind of effect can even be produced when simply
altering algorithm parameters and should be a con-
sideration when looking at breakpoints of detected
events We found that the available software tend to
target and support one particular platform for analy-
sis which unfortunately can limit options
Recommending algorithmsComparison of events in a dataset is a good way of
assessing accuracy of detection algorithms but it is
also important to take into account that the different
predictions can also be informative in showing false
positives caused by noisy data and conversely that
those in agreement are the strongest candidates for
events Multiple predictions from different software
for the same event increase confidence in the data
and give clearer indications of the event boundaries
or any discrepancy in this information We would
recommend using a second algorithm on a single
dataset to produce the most informative results and
also utilize the different advantages of each software
We also suggest using software designed specifically
for the platform which generated the data as several
of the dual use algorithms have been shown to
weaker in one format We have selected a range of
algorithms to discuss and test and the list in Table 1 is
not exhaustive only an overview of some of the
possibilities It is also important to state even using
different algorithms one cannot definitively confirm
the presence of a CN event without separate biolog-
ical replication and it is unlikely that any list of events
detected will contain all CNVs in a sample
FURTHER ANALYSIS OFDETECTED CNVsWith a number of reliable options available for
the detection of copy number events it becomes
Table 4 Comparison of event numbers detected fora single sample (NA10861)
Algorithm Platform andarray
Number ofCNeventsdetected
Birdsuite 155 (Canary amp Birdseye) Affymetrix 60 137CNAT (Genome Console 302) Affymetrix 60 10CNVPartition 121 Illumina 1M Duo 16GADA (R 07-5) Affymetrix 60 613GADA (R 07-5) Illumina 1M Duo 87Nexus Biodiscovery 401 Affymetrix 60 111Nexus Biodiscovery 401 Illumina 1M Duo 8PennCNV (2009Jan06) Affymetrix 60 67PennCNV (2009Jan06) Illumina 1M Duo 43QuantiSNP v20 Affymetrix 60 193QuantiSNP v11 Illumina 1M Duo 60
HapMap samples provided as demonstration data were analysed onboth Affymetrix and Illumina platforms to give an easily reproduciblecomparison of event prediction Events shown have been detected bythe algorithm for CEPH sample NA10861 Default parameters wereused for all algorithms and anyYchromosome data was omittedDatafrom the Affymetrix array has a higher number of detected eventsprobably linked to the number of specifically targeted probesProprietary software from both Illumina and Affymetrix has a lowdetection rate
page 10 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
Table5
Com
parison
ofsoftwareeventpredictio
ns
Pub
lishe
dresults
(Red
on)
Birdsuite
Affym
etrix
CNAT
Affym
etrix
CNV
Par
tition
Illum
ina
GADA
Affym
etrix
GADA
Illum
ina
Nex
usAffym
etrix
Nex
usIllum
ina
Pen
nCNV
Affym
etrix
Pen
nCNV
Illum
ina
Qua
ntiSNP
Affym
etrix
Qua
ntiSNP
Illum
ina
Publishe
ddata
(Red
on)
17(4)
4(40
)3(19
)32
(5)
2(2)
11(10
)2(25
)12
(18
)7(16
)18
(9)
8(13
)
Birdsuite
Affy
metrix
17(44
)9(90
)13
(81
)135(22
)21
(24
)62
(56
)6(75
)43
(64
)20
(47
)97
(50
)20
(33
)CNAT
Affy
metrix
4(10
)15
(4)
4(25
)34
(6)
023
(21
)1(13
)
13(19
)2(5)
17(9)
5(8)
CNVPartition
Illum
ina
3(8)
16(4)
4(40
)37
(6)
7(8)
20(18
)7(88
)9(13
)
11(26
)16
(8)
16(27
)GADA
Affy
metrix
17(44
)106(28
)9(90
)13
(81
)32
(37
)91
(82
)7(88
)58
(87
)23
(53
)153(79
)27
(45
)GADA
Illum
ina
2(5)
96(25
)0
13(81
)20
8(34
)25
(23
)2(25
)26
(30
)17
(40
)67
(35
)23
(38
)Nexus
Affy
metrix
7(18
)57
(15
)10
(100
)
7(44
)116(19
)8(9)
4(50
)45
(67
)15
(35
)78
(40
)17
(28
)Nexus
Illum
ina
2(5)
6(2)
1(10
)7(44
)22
(4)
2(2)
4(4)
6(9)
7(16
)10
(5)
9(15
)Penn
CNV
Affy
metrix
11(28
)51
(13)
10(100
)
9(56
)105(17
)10
(11)
65(59
)6(75
)19
(44
)71
(37
)21
(35
)Penn
CNV
Illum
ina
6(15
)25
(7)
2(20
)11
(69
)44
(7)
9(10
)23
(21
)6(75
)18
(27
)26
(13)
28(47
)QuantiSNP
Affy
metrix
14(36
)97
(25
)10
(100
)
10(63
)199(32
)18
(21
)86
(77
)7(88
)65
(97
)21
(49
)24
(40
)QuantiSNP
Illum
ina
6(15
)14
(4)
5(50
)15
(94
)55
(9)
10(11)
30(27
)8(100
)
23(34
)32
(74
)31
(16
)
Algorithm
swererunon
demon
stratio
ndataforsampleNA108
61on
Affy
metrix60chipsa
ndIllum
ina1MDuo
arraysD
efaultparametersw
ereused
andanyY
chromosom
edatawas
omittedFo
ralgorithmoverall
totalsseeTable4Events
detected
inbo
thsoftwareareshow
nEvents
coun
tedas
common
betw
eenalgorithmsifpart
ofregion
predictedoverlaps
withtheotherEach
comparisoniscarriedou
ttw
ice
toshow
caseswhere
smallereventswithinon
ealgorithm
makeup
oneeventintheotherthereforeoverlapof
eventsdepe
ndson
analysisorientationTotalvalue
representsnumberof
eventsforsoftwareon
horizontalaxisfoun
dintheothersoftwaredatasetbracketedvalueshow
spercentageofeventsdetected
bysamesoftwareWehave
foun
dthemostsim
ilaritie
sare
betw
eendatafrom
similarplatform
soralgo
-rithm
metho
dforexam
pleAffy
metrixPenn
CNVandQuantiSNParebo
thbasedon
theHMM
algorithm
andas
such
eventpredictio
nshou
ldbe
very
similarWehave
also
notedahigher
numberof
similar
eventsfrom
algorithmsu
singAffy
metrixdata
Comparing CNVdetection methods for SNParrays page 11 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
increasingly important to be able to summarize and
use this data Initially we are often interested in
looking for novel events in certain genes or regions
Tracks of events can be viewed in databases such as
the web-based genome browser UCSC (http
wwwgenomeucscedu) and events can be com-
pared to known copy number data in the DGV
such as displayed in Figure 3 Importing several
tracks of data into a browser simultaneously will
allow the user to compare different result sets
Analysis of multiple events per sample is a more
complicated procedure Events and samples can
be explored using pathway analysis tools to look
for interesting groups or combinations of events in
different genes but methods of confirming the
significance of an event are required A number of
publications exist presenting ways of applying asso-
ciation study methods to copy number data Barnes
etal [29] developed an R package CNVtools which
allows the user to carry out case-control association
Figure 5 Image from UCSC Browser showing the detection of a single event using different algorithmsThe deletion described is a known CNP and is recorded several times in the DGV Each track represents a differ-ent algorithm or platform All results for detection algorithms shown used default parameters and test sampleNA10861
page 12 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
analysis on a single CNV of interest The publica-
tion tests a series of five alternative modelling meth-
ods before recommending a likelihood ratio test
which combines CNV calling and association testing
into a single model This method was designed
to eliminate problems with signal noise which is a
known trait of SNP assay data Ionita-Laza et al [30]
suggested a method to apply genome-wide family-
based association studies on raw-intensity data The
Birdsuite package includes a pipeline to prepare
the data for PLINK analysis Other sources have
suggested similar association study-based strategies
but an agreed approach is a subject of great discus-
sion Calls have been made by authors such as
Scherer et al [31] to decide on a single technique
but future decisions in the field will be extremely
enlightening
As is commented much upon in literature
describing SNP association study techniques
sample size and power of tests are major factors in
a successful study [32] This must also be considered
when analysing copy number data As we have dis-
cussed there are a number of analysis options avail-
able for SNP array CNV detection pipelines to
allow guided analysis and stand alone options for
more flexible analysis Some of these applications
are platform targeted but we have found that the
best outcome is given by using multiple algorithms
and comparing data
SUPPLEMENTARYDATASupplementary data are available online at http
biboxfordjournalsorg
AcknowledgementsThe authors thank Dr Helen Butler for her ideas and contribu-
tions to the manuscript
FUNDINGJR and LW are funded by Wellcome Trust Grants
CY is funded by a UK Medical Research Council
Special Training Fellowship in Biomedical
Informatics (Ref No G0701810)
References1 Iafrate AJ Feuk L Rivera MN et al Detection of large-
scale variation in the human genome Nat Genet 200436(9)949ndash51
2 Redon R Ishikawa S Fitch KR et al Global variation incopy number in the human genome Nature 2006444(7118)444ndash54
3 Tuzun E Sharp AJ Bailey JA et al Fine-scale structuralvariation of the human genome Nat Genet 200537(7)727ndash32
4 Sebat J Lakshmi B Troge J et al Large-scale copy numberpolymorphism in the human genome Science 2004305(5683)525ndash8
5 de Smith AJ Tsalenko A Sampas N et al Array CGHanalysis of copy number variation identifies 1284 newgenes variant in healthy white males implications for asso-ciation studies of complex diseases Hum Mol Genet 200716(23)2783ndash94
6 Carter NP Methods and strategies for analyzing copynumber variation using DNA microarrays Nat Genet200739(7 Suppl)S16ndash21
7 Korbel JO Urban AE Affourtit JP et al Paired-end map-ping reveals extensive structural variation in the humangenome Science 2007318(5849)420ndash6
8 Kennedy GC Matsuzaki H Dong S etal Large-scale geno-typing of complex DNA NatBiotechnol 200321(10)1233ndash7
9 Peiffer DA Le JM Steemers FJ etal High-resolution geno-mic profiling of chromosomal aberrations using Infiniumwhole-genome genotyping Genome Res 200616(9)1136ndash48
10 International Schizophrenia Consortium Rare chromoso-mal deletions and duplications increase risk of schizophreniaNature 2008455(7210)237ndash41
11 Yang TL Chen XD Guo Y et al Genome-wide copy-number-variation study identified a susceptibility geneUGT2B17 for osteoporosis Am J Hum Genet 200883(6)663ndash74
12 McCarroll SA Hadnott TN Perry GH et al Commondeletion polymorphisms in the human genome Nat Genet200638(1)86ndash92
13 Cooper GM Zerr T Kidd JM et al Systematic assessmentof copy number variant detection via genome-wide SNPgenotyping Nat Genet 200840(10)1199ndash203
14 McCarroll SA Altshuler DM Copy-number variation andassociation studies of human disease Nat Genet 200739(7 Suppl)S37ndash42
Key Points Awide variety of software is available for CNVdetection from
data produced by SNP arrays This review seeks to discussoptions and statistical methods currently available for analysisof signal intensity data
Changes in assay selection techniques for SNP arrays havemadethemmore appealing for copynumber detection aswell as geno-typingTargeted probe design has made the SNP array a reliableand cheaper option for copy number analysis
After testing a selection of the available software comparisonswere performed using Hapmap samples and publishedcopy number data Of the events found in our data 20^49were replicated in previously published studies but the resultsclearly showed variation in data caused by differences inalgorithms
An important recommendation when choosing software foranalysis is the use of a second algorithm on a dataset to producemore informative results This enables the user to eliminatefalse positives not found by both software and increases confi-dence in replicated events
Comparing CNVdetection methods for SNParrays page 13 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
15 McCarroll SA Kuruvilla FG Korn JM et al Integrateddetection and population-genetic analysis of SNPs andcopy number variation Nat Genet 200840(10)1166ndash74
16 Korn JM Kuruvilla FG McCarroll SA et al Integratedgenotype calling and association analysis of SNPscommon copy number polymorphisms and rare CNVsNat Genet 200840(10)1253ndash60
17 Day N Hemmaplardh A Thurman RE et al Unsupervisedsegmentation of continuous genomic data Bioinformatics200723(11)1424ndash6
18 Colella S Yau C Taylor JM etal QuantiSNP an objectiveBayes Hidden-Markov Model to detect and accurately mapcopy number variation using SNP genotyping data NucleicAcids Res 200735(6)2013ndash25
19 Wang K Li M Hadley D et al PennCNV an integratedhidden Markov model designed for high-resolution copynumber variation detection in whole-genome SNP geno-typing data Genome Res 200717(11)1665ndash74
20 Maestrini E Pagnamenta AT Lamb JA et al High-densitySNP association study and copy number variation analysisof the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility MolPsychiatry2009
21 Wang K Chen Z Tadesse MG et al Modeling geneticinheritance of copy number variations Nucleic Acids Res200836(21)e138
22 Li C Beroukhim R Weir BA et al Major copy propor-tion analysis of tumor samples using SNP arrays BMCBioinformatics 20089204
23 Olshen AB Venkatraman ES Lucito R Wigler M Circularbinary segmentation for the analysis of array-based DNAcopy number data Biostatistics 20045(4)557ndash72
24 Pique-Regi R Monso-Varona J Ortega A et al Sparserepresentation and Bayesian detection of genome copynumber alterations from microarray data Bioinformatics200824(3)309ndash18
25 Lai WR Johnson MD Kucherlapati R Park PJComparative analysis of algorithms for identifying amplifi-cations and deletions in array CGH data Bioinformatics 200521(19)3763ndash70
26 Rigaill G Hupe P Almeida A et al ITALICS analgorithm for normalization and DNA copy number callingfor Affymetrix SNP arrays Bioinformatics 200824(6)768ndash74
27 Franke L de Kovel CG Aulchenko YS et al Detectionimputation and association analysis of small deletions andnull alleles on oligonucleotide arrays AmJHumGenet 200882(6)1316ndash33
28 Kidd JM Cooper GM Donahue WF et al Mapping andsequencing of structural variation from eight human gen-omes Nature 2008453(7191)56ndash64
29 Barnes C Plagnol V Fitzgerald T et al A robuststatistical method for case-control association testingwith copy number variation Nat Genet 200840(10)1245ndash52
30 Ionita-Laza I Perry GH Raby BA et al On the analysisof copy-number variations in genome-wide associationstudies a translation of the family-based association testGenet Epidemiol 200832(3)273ndash84
31 Scherer SW Lee C Birney E etal Challenges and standardsin integrating surveys of structural variation NatGenet 200739(7 Suppl)S7ndash15
32 Cardon LR Bell JI Association study designs for complexdiseases Nat Rev Genet 20012(2)91ndash9
page 14 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
CNVDISCOVERYANDDETECTIONUSING SNP CHIPSThe use of SNP arrays in copy number event detec-
tion has a number of advantages As well as the two
applications for the data which are SNP genotyping
and copy number analysis there are other aspects
that promote their use over other techniques SNP
arrays use less sample per experiment compared to
other techniques such as comparative genomic
hybridization (CGH) arrays Cost is also an important
factor in the selection of the method The SNP array
is a cost effective technique which allows the user to
increase the number of samples tested on a limited
budget Although the advances in high throughput
sequencing technology has made copy number
discovery much easier the application of known
CNP information means that we can target structural
variation in a sample using cheaper techniques such
as the SNP array without a large reduction in
genome wide coverage
One important consideration however is the bias
of the SNP chip coverage towards known CNVs
[13] Historically when SNPs are selected for geno-
typing arrays certain factors are considered which
may decrease the number of copy number variants
or polymorphisms typed [14] Studies have found
CNPs to be most common in regions containing
high levels of segmental duplication [2] which are
areas of low SNP coverage compared to other areas
of the genome due to the difficulties of assay design
and implementation Common CNPs may cause
assays to fail standard inheritance checks and
HardyndashWeinberg tests For example in a situation
where a father is (A B) and the mother (B ) the
child could be (AB) or (A) or (B) However in
SNP genotyping results the mother would appear to
be called (B B) and the child would be called either
(A B) or (A A) or (B B) If the child is really (A)
then an (AA) call would seem to violate Mendelian
inheritance patterns and often cause the SNP to be
rejected
Assays were also often selected and tested on the
basis of their use in SNP genotyping meaning the
final result may produce noisy signal which although
per se does not affect the ability to genotype is a
major problem for accurate copy number detection
For instance SNP data is typically standardized
against a reference population in order to reduce
the effect of factors including between-array varia-
tion and probe-specific hybridization effects In
doing so normalization routines implicitly assume
that all members (or the large majority) of the refer-
ence population have the same copy number but
at locations of common CNV this assumption is
clearly no longer appropriate At these genomic
locations the process of SNP data normalization
and the derivation of copy number estimates
should be integrated for optimal performance and
the correct derivation of normalization parameters
Several of the new array assay selections have
taken the copy number detection into account for
example Illumina includes lsquounSNPablersquo genome
probes on some of its products These markers
were picked to cover events recorded in the
Database Genomic Variants (DGV) and some addi-
tional regions highlighted by experimental work
The Affymetrix SNP 60 chip was developed with
an aim to assess SNPs and CNVs simultaneously
McCarroll et al [15] studied 270 HapMap samples
to design probes for their hybrid array With these
changes in assay selection techniques the SNP array
has become more appealing for copy number detec-
tion and reliable interpretation of these results
increases in importance
ILLUMINA PROPRIETARYSOFTWARE FORCOPYNUMBERDETECTIONIllumina data can be initially viewed checked and
exported using the proprietary software BeadStudio
As well as the softwarersquos quality checking and geno-
type-calling functions it calculates a number of other
values for the signal-intensity data The normalized
R value is used as a representation of intensity on
individual SNP plots The log R ratio value is then
calculated from the expected normalized intensity
of a sample and observed normalized intensity
The B allele frequency (BAF) is calculated from
the difference between the expected position of
the cluster group and the actual value BAF and
log R ratio are used by a number of the copy
number event detection algorithms
Detection of copy number events within
BeadStudio uses simple algorithms which can be
run rapidly for an overview of larger events in a
sample The Loss of Heterozygosity (LOH) score is
calculated using heterozygote frequency The CNV
partition plug-in uses the log R ratio and BAF and
compares the data to 14 different Gaussian distribu-
tion models to assess copy number level Values can
be plotted in the Chromosome Browser allowing the
page 2 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
user to compare predicted events with BAF or
log R ratio at the location for event confirmation
(Figure 1)
AFFYMETRIX PROPRIETARYSOFTWARE FORCOPY NUMBERDETECTIONAffymetrix SNP array data can be analysed with
specially designed proprietary software Within the
Genotyping Console samples are grouped into
In Bounds (good sample) and Out of bounds
(problematic samples) after initial quality checks
and other quality control metrics allow the user
to investigate probe mismatching and individual
SNP clustering LOH scores can be calculated and
the software contains a Chromosome Copy Number
Analysis Tool (CNAT) which uses a reference set of
data to compare the experiment signal-intensity
values against and evaluates copy number changes
Results are processed by the segment reporting tool
to produce a basic output of larger detected CNV
events
Tools for analysis of the different Affymetrix chip
types vary but HumanGenomeSNP Array 60 uti-
lizes two externally developed algorithms from the
BirdSuite package [16] which dramatically improves
detection Birdseed is used for SNP genotyping
and Canary genotypes the known CNPs on the
chip Each CNP has a number of targeted probes
Figure 1 BeadStudio Chromosome Viewer Image from BeadStudio Chromosome Browser showing copy numbervalues for Sample NA10861Chromosome 22 shown with an event at 23 999 142^24 239 255 confirmed by all statis-tics CNV value produced by CNV Partition algorithm
Comparing CNVdetection methods for SNParrays page 3 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
data from these are summarized and then compared
to a reference set to produce the final call Results
can be viewed in the Integrated Genome Browser
(IGB) (Figure 2)
HIDDENMARKOVMODELS(HMMs) IN COPY NUMBEREVENTDETECTIONLimitations of available copy number analyses within
proprietary software led to the use of other methods
to analyse data The HMM assumes that observed
intensities are related to an unobserved copy
number state at each locus via an emission distribu-
tion (often assumed to be Gaussian) The copy
number states are assumed to have a dependence
structure such that neighbouring loci are assumed
to have similar copy number states Transitions
between copy number states are determined by a
transition matrix which describes the probability of
moving from one state to another The probabilistic
structure of the HMM allows parameters in the
model to be efficiently learnt from data in both
Bayesian and non-Bayesian frameworks by using
dynamic programming-based algorithms such as
the expectation maximization (EM) algorithm
When applied to event detection each copy
number possibility is assigned a state and the
Viterbi algorithm is used to predict the state for
each observation value
Figure 2 Genotyping Console Genome Viewer Image from Affymetrix Genotyping Console showing sampleNA10861 Event on chromosome 22 confirmed by CNAT algorithm (third plot) and the segmentation report (redmark) showing the single event
page 4 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
With prior knowledge of modelling statistics
there are a multitude of options for copy number
detection HMMSeg [17] is a command line oper-
ated algorithm that is designed to apply HMM
to genomic data Application of correct modelling
procedures is not an obvious process to non-
statisticians For these reasons software has been
developed which allows guided application of these
types of advanced methods
GUIDEDAPPLICATIONOFTHE HMMA number of solutions for guided accurate CNV
detection for SNP array data have been published
but these are often platform specific QuantiSNP
[18] and PennCNV [19] are academically developed
and freely available for prediction purposes They use
the HMM and assist the user to apply it to their own
data The standard output from these tools is a list of
detected events and brief summary statistics used for
quality checking Checking the quality of data is
extremely important in accurate event prediction
Data with high signal noise often causes false positive
predictions and stringency with checks at this stage is
highly recommended to eliminate any problem data
Signal noise is a strong limitation particularly with
samples prepared by whole genome amplification
Output from QuantiSNP allows the user to plot
average and standard deviations for BAF by chromo-
some or sample to show outliers (Figure 3)
PennCNV has a detailed set of guidelines for identi-
fying and rejecting problem data included on the
softwarersquos support website Both can run using com-
mand line options or integrated into Illuminarsquos
BeadStudio plug-in and have unique features to
recommend them
The QuantiSNP algorithm output gives a log
Bayes factor with its prediction which allows the
user to rank events in order of likelihood and place
their own cut off on acceptable events Users can
modify parameters to suit their own dataset for
example changing the length parameter can allow
more accurate detection of different sized events for
a particular sample set Later versions of QuantiSNP
have increased flexibility for data other than the
Figure 3 Graphical representation of quality control data from PennCNV and QuantiSNP algorithms It is impor-tant to use quality control (QC) data from the algorithms to eliminate problem samples which would not be foundduring standard-genotyping analysis Plot shows BAF score for each chromosome from analysis of sample NA10861we can see chromosome 4 and X are outliersValues produced by PennCNV log file also shown NB Values shownrelate to Illumina 1MDuo array
Comparing CNVdetection methods for SNParrays page 5 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
standard Illumina Infinium array and can used to
process Affymetrix data and have proven accuracy
on Illumina GoldenGate data [20] where SNP
coverage is suitable
PennCNV has a number of downstream analysis
options Most important to highlight is the use of
family trio data in analysis [21] The use of trio infor-
mation in event prediction allows easier detection of
events novel to probands It also integrates a pipeline
for Affymetrix data analysis The PennCNV package
also includes a number of options to allow more
analysis of event results such as a script to compare
events to known gene libraries or for changing the
format to be suitable for viewer such as BeadStudiorsquos
Chromosome Browser or the web-based genome
browser UCSC (httpwwwgenomeucscedu)
Dchip SNP [22] was originally developed for
Affymetrix data but has been modified to allow the
viewing of Illumina data It produces an LOH score
which can be plotted against chromosome but its
functions are best suited to the Affymetrix platform
generated values in particular the quality control
options The software also has options to carry out
paired analysis for cancer data major copy propor-
tion analysis [22] uses HMM to analyse tumour
samples
APPLYINGAPPROACHESORIGINALLY USED INARRAYCGHA number of methods for copy number event detec-
tion were originally developed for arrayCGH analy-
sis but have been modified for SNP array analysis
The Circular Binary Segmentation (CBS) [23] algo-
rithm is one such method It was designed to convert
noisy intensity values into regions of equal copy
number The algorithm will continue to divide a
region into segments until it finds a segment
which is different to the neighbouring region This
change-point detection is designed to identify all
the places which partition the chromosome into
segments of the same copy number An addition to
the binary segmentation algorithm was made to
allow the defining of single change inside a large
segment Segment ends were joined forming a
circle to allow a further likelihood ratio test that
the content has different means Final segments are
then given a cluster value which is the median log-
ratio value of the probes within the region and this
value is used to define the copy number status
An alternative to the CBS algorithm was devel-
oped by Pique-Regi et al [24] which can now be
applied to SNP arrays The Genome Alteration
Detection Algorithm (GADA) uses sparse Bayesian
learning to predict CN changes For our testing we
used a package designed for use in R environment
with helpful processing options and detailed instruc-
tions for Affymetrix and Illumina data The advan-
tage of the speed of data processing was clear and we
were able to analyse data within a few minutes
There are many other algorithms developed that
could potentially be applied to SNP array data
Other reviews [6 25] focused on the arrayCGH
format present the reader with a variety of alternative
options
CNVDETECTION USING OTHERMETHODSApproaches which describe different methods to
address CN event detection are common in the lit-
erature SNP conditional mixture modelling
(SCIMM) developed by Cooper et al [13] which
is based on the observation that samples with dele-
tions appear to have unique signal-intensity clusters
They applied a mixture-likelihood clustering
method within the R statistical package to identify
deletions A secondary algorithm (SCIMM-Search)
was developed to help discover probes which detect
copy number changes within an array dataset The
algorithms require knowledge of modelling techni-
ques to correctly carry out the analysis
The ITALICS [26] software focuses analysis on
removal on unwanted events found in Affymetrix
data Rigaill et al developed ITALICS (Iterative
and Alternative normaLIsation and Copy number
calling for affymetrix Snp arrays) to remove probes
with abnormal intensities Each iteration of the
algorithm estimates the biological signal and then
uses multiple linear regressions to estimate the non-
linear effects on the signal The algorithm can be run
in R and has the potential to analyse the Affymetrix
Human mapping 500K Genome Wide array 50 and
60 format but was designed to process data from
chip formats containing perfect match and mismatch
probes
COMMERCIALLYAVAILABLESOFTWAREThe strength of the software packages available
to purchase lies in a number of traits the ability
page 6 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
to combine data from other platforms for compar-
ison graphical user interfaces integrated pipelines for
analysis and work flows optimized computational
speed and technical support These factors are all
extremely useful to those labs with no or limited
bioinformatic core support Unfortunately commer-
cial companies are limited in their use of some of the
methods developed in the academic environment
They are often prevented from building user inter-
faces and other features around academic software
due to restrictions imposed by free software licences
such as GNU Public Licence and prevention from
having access to the latest methods
For our own purposes we have chosen to look in
detail at the Nexus Biodiscovery software This uses
the rank segmentation approach for detection This
approach is based on CBS but has been modified to
increase speed of processing It can be used for
Affymetrix arrayCGH or Illumina data and although
weaker for Illumina event detection is an extremely
useful tool for practically trained scientists
COMBINING COPY NUMBERPREDICTIONANDGENOTYPINGCopy number detection approaches described thus
far have looked only at a single aspect of the data
The Birdsuite set developed by Korn et al [16] com-
bines SNP genotyping and copy number detection
as well as independently genotyping common
CNPs It uses four different methods to analyse an
Affymetrix dataset The Canary algorithm which
genotypes common CNPs and Birdseed which
carries out SNP genotyping are included in the
Affymetrix Genotyping Console Birdseye is used
to discover rare CNVs This uses the HMM to iden-
tify and assess previously unknown CNVs in the
data Fawkes is the final stage of Birdsuite this
merges all the results from the other three stages
Combining data in this way gives a more complete
picture of structural variation in a sample and allows
the user to proceed with single stage of association
analysis with increased coverage on the data Korn
et al compared their software to commercially avail-
able algorithms including Nexus and report the
higher detection rates of Birdsuite
Franke et al [27] have also presented a combined
approach which focuses on single SNP interpreta-
tion TriTyper uses maximum likelihood estimation
to detect deletions in Illumina SNP data in unrelated
samples It incorporates an extra null allele into its
genotyping clusters and uses deviations from the
HWE as an indicator of when to use triallelic geno-
typing It can also use neighbouring SNP data to
impute the success of the caller which increases the
accuracy of the output
COMPARINGTHEDETECTIONALGORITHMSThere are a large variety of algorithms and software
available for copy number event detection Table 1
shows a summary of the software discussed in this
review A number of these software packages have
been tested during the review and a brief synopsis of
the results is presented here
Assessing SoftwareTo assess the accuracy of the algorithms we com-
pared our data to the results of a well characterized
sample The sample NA12156 is the basis for our
comparison (Table 2) it is from the HapMap collec-
tion and was sequenced for structural variation by
Kidd et al [28] We have chosen to record the
number of similar events between software and pub-
lished data We assume the samples with low num-
bers of similar events have higher false positive rates
however we have not experimentally validated the
results While there is no faultless software we have
found that at least 20 of events were confirmed by
Kidd et al in all algorithms 27 of the overlapping
detected events were found by more than one algo-
rithm (Supplementary Table 1) Although some
algorithms have a lower percentage of overlapping
events it is important to consider the number of
events found as well as the proportion 49 of
PennCNV detected events were confirmed but
other algorithms have actually detected more in
total
We carried out a secondary comparison using the
CEPH sample NA15510 which has been character-
ized in a number of publications [2 7 28] Table 3
shows the variation of results between studies
Further investigation of event replication across stud-
ies is represented in the Venn Diagrams (Figure 4)
PennCNV and Illumina show similar patterns of
overlap although we note an increased similarity
between the Korbel et al data and QuantiSNP
output We conclude that although we found a dif-
ference between detected events in our data and
published results we found similar variation between
different publications suggesting this is problem in
Comparing CNVdetection methods for SNParrays page 7 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
all comparisons and not unique to algorithms we
tested
The overlap of algorithm events of the tested soft-
ware is below 50 for all cases We used default
parameters for all our algorithms for ease of replica-
tion which means some algorithms were not run at
their optimal level for our data We deliberately
chose data which did not use an array-based
Table 1 Summary of SNP array detection algorithms
Software Platform Relatedpublication
Details Strengths Weaknesses
Birdsuite (Birdseyeand Canary)
Affymetrix [15] Combined tool set togenotype SNPs amp CNPs
Unique approach singleassociation of SNPs andCN
Availability limited toAffymetrix data
CNAT Affymetrix Technicalnotes
Proprietaryccedilrun inGenome Console
Integral part of GenomeConsole
Accuracy of event prediction(missed events)
CNVPartition 121 Illumina Technicalnotes
Proprietaryccedilrun inBeadStudio
Integral part of BeadStudio Accuracy of event prediction(missed events)
Dchip SNP Affymetrixor Illumina
[22] Stand alone software Free viewer for all data Limited applications forIllumina data
GADA Affymetrixor Illumina
[24] Model uses Sparse BayesianLearning
Speed of processing andapplication within R
Accuracy on Illumina weaker
HMMSeg Multiple [17] HMM application tool to anygenomic data
Flexibility to any dataset Statistical knowledgerequired for correctuse Not CN specific
ITALICS Affymetrix [26] R package for normalizationand CN detection inAffymetrix data
Focus on removal of non-relevant effects
Designed to work onAffymetrix 100Kthorn 500Kchip (MM probe format)
Nexus Biodiscovery Multiple [23] Commercial segmentationdetection tool
Allows combined data fromdifferentplatforms Integratedviewer
Freeware alternatives areavailable
PennCNV Illumina orAffymetrix
[19] Perl script based Multiple downstream toolsfor output
No way of ranking eventsdue to likelihood
QuantiSNP Illumina orAffymetrix
[18] HHM PC or LINUXcommand line
Bayes factor score forevents flexibility of runparameters
Limited support for furtherevent analysis
SCIMM andSCIMM-Search
Illumina [13] Modelling algorithmapplied in R
High detection ratescompared to sequencedata
Statistical knowledgerequired for correct use
TriTyper Illumina [27] Identify and genotype SNPswith null allele
Able to interpret single SNPs Only genotypes deletions
Table 2 Comparison of algorithms
Algorithm Platformand array
Total of copynumber eventsdetected
Number of copynumber eventsconfirmed byKidd et al [28]
Birdsuite 155 (Birdseye amp Canary) Affymetrix 60 386 76 (20)CNAT (Genome Console 302) Affymetrix 60 8 2 (25)GADA (R 07-5) Affymetrix 60 546 128 (23)GADA (R 07-5) Illumina 1M Duo 511 157 (31)PennCNV (2009Jan06) Affymetrix 60 57 28 (49)PennCNV (2009Jan06) Illumina 1M Duo 57 21 (37)QuantiSNP v20 Affymetrix 60 131 53 (41)QuantiSNP v11 Illumina 1M Duo 75 23 (31)
Detected events from CEPH sample NA12156 are compared to events published in sequencing analysis by Kidd et al [28] Default parametersare used for each algorithm and any Ychromosome data was omitted An overlap between software output and confirmed data by Kidd et al isdetermined by comparing the start and end points of events Details of events are shown in SupplementaryTable1 Percentage shows the numberof confirmed CN events compared to the total detectedby the algorithm
page 8 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
Figure 4 Venn diagrams comparing events for NA15510 between different studies Visual representation ofdata from CEPH sample NA15510 on 1M array Illumina platform used to compare between algorithms and otherpublications [2 7 28] Default parameters are used for each algorithm and Ychromosome data was omitted fromcount Event lists from publications were generated by combining data from several tables to create a completelist (including all validated and unvalidated events) An event was counted if any overlap was found with base eventin published data multiple predictions by an algorithm for one published event were counted as one Each total inthe diagram is comprised of all the events found by the studies meaning each event in an overlapping pair is countedSurprisingly only 43 overlapping events are found for NA15510 in all the three studies (A) Results from thePennCNV (D) and QuantiSNP (C) comparisons show that QuantiSNP detects more events in all three softwaredue to the detection of more events overlapping with the Korbel et al study Overlap between algorithmsis shown in Venn Diagram B where events which are detected by the algorithm and found in at least one ofthe publication are compared A large proportion of detected events between PennCNV and QuantiSNP (43)overlap
Table 3 Overlap between events detected by SNP array algorithms using multiple publication data
Total events foundin NA15510 byalgorithm
Number of copynumber events(Kidd) [28]
Number of copynumber events(Korbel) [7]
Number of copynumber events(Redon) [2]
Events in paper 299 466 219CNVPartition 121 39 12 (4) 22 (5) 9 (4)GADA (R 07-5) 69 68 (23) 85 (18) 42 (19)PennCNV (2009Jan06) 81 18 (6) 28 () 30 (14)QuantiSNP v11 64 18 (6) 41 (9) 29 (13)
Data fromCEPH sampleNA15510 on1M array Illumina platform is used to compare between algorithms and other publicationsDefault parametersare used for each algorithm and Y chromosome data was omitted Event lists from publications were generated by combining data fromseveral tables to create a complete list (including all validated and un-validated events) An event was counted if any overlap was found with baseevent in published data multiple predictions by an algorithm for one published event were counted as oneValue in brackets shows percentage ofpublished events found by algorithmWe note from GADA analysis although a high number of overlaps were found this was due to the predictionof large events that included smaller events found by Kidd et al and Korbel et al
Comparing CNVdetection methods for SNParrays page 9 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
technique for our NA12156 comparison to prevent a
bias between Affymetrix and Illumina but in doing
so we accepted an increase in the number of differ-
ently detected events Kidd et al have shown similar
data when comparing studies and found only a
125 overlap of events larger than 5 kb between
their results and CN data generated by Affymetrix
60 array
Similarities of events detected betweendifferent SoftwareWe chose to test a single sample (NA10861) on
a range of the available algorithms to compare the
similarity between event detection In all cases we
found the academically developed software to be
more sensitive and detect more events than propri-
etary algorithms (Table 4) The data also shows an
increased number of events found from the sample
using the Affymetrix SNP60 array we assume this
reflects the increase in the number of CNP probes
on the array relative to Illuminarsquos 1M chip
Table 5 shows the amount of overlap in event
prediction We show two results for each compari-
son counting the number of events overlapping for
each algorithm separately The difference in values
represents the number of smaller events often found
in one event by a different algorithm In general
we found a higher number of overlapping events
between algorithms run on Affymetrix 60 arrays
data We expected the low resemblance between
data generated on different platforms as a result of
the different probe sets however we are pleased to
find some overlap We have included a comparison
to events published by Redon et al [2] although the
study does not include a comprehensive list for this
sample it does show that the algorithms are detecting
confirmed events
During our comparison we often saw a difference
in the size of the predicted event between algorithms
(Figure 5) This was to be expected when using
different platforms as probe locations vary but was
also seen when analysing an identical dataset This
kind of effect can even be produced when simply
altering algorithm parameters and should be a con-
sideration when looking at breakpoints of detected
events We found that the available software tend to
target and support one particular platform for analy-
sis which unfortunately can limit options
Recommending algorithmsComparison of events in a dataset is a good way of
assessing accuracy of detection algorithms but it is
also important to take into account that the different
predictions can also be informative in showing false
positives caused by noisy data and conversely that
those in agreement are the strongest candidates for
events Multiple predictions from different software
for the same event increase confidence in the data
and give clearer indications of the event boundaries
or any discrepancy in this information We would
recommend using a second algorithm on a single
dataset to produce the most informative results and
also utilize the different advantages of each software
We also suggest using software designed specifically
for the platform which generated the data as several
of the dual use algorithms have been shown to
weaker in one format We have selected a range of
algorithms to discuss and test and the list in Table 1 is
not exhaustive only an overview of some of the
possibilities It is also important to state even using
different algorithms one cannot definitively confirm
the presence of a CN event without separate biolog-
ical replication and it is unlikely that any list of events
detected will contain all CNVs in a sample
FURTHER ANALYSIS OFDETECTED CNVsWith a number of reliable options available for
the detection of copy number events it becomes
Table 4 Comparison of event numbers detected fora single sample (NA10861)
Algorithm Platform andarray
Number ofCNeventsdetected
Birdsuite 155 (Canary amp Birdseye) Affymetrix 60 137CNAT (Genome Console 302) Affymetrix 60 10CNVPartition 121 Illumina 1M Duo 16GADA (R 07-5) Affymetrix 60 613GADA (R 07-5) Illumina 1M Duo 87Nexus Biodiscovery 401 Affymetrix 60 111Nexus Biodiscovery 401 Illumina 1M Duo 8PennCNV (2009Jan06) Affymetrix 60 67PennCNV (2009Jan06) Illumina 1M Duo 43QuantiSNP v20 Affymetrix 60 193QuantiSNP v11 Illumina 1M Duo 60
HapMap samples provided as demonstration data were analysed onboth Affymetrix and Illumina platforms to give an easily reproduciblecomparison of event prediction Events shown have been detected bythe algorithm for CEPH sample NA10861 Default parameters wereused for all algorithms and anyYchromosome data was omittedDatafrom the Affymetrix array has a higher number of detected eventsprobably linked to the number of specifically targeted probesProprietary software from both Illumina and Affymetrix has a lowdetection rate
page 10 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
Table5
Com
parison
ofsoftwareeventpredictio
ns
Pub
lishe
dresults
(Red
on)
Birdsuite
Affym
etrix
CNAT
Affym
etrix
CNV
Par
tition
Illum
ina
GADA
Affym
etrix
GADA
Illum
ina
Nex
usAffym
etrix
Nex
usIllum
ina
Pen
nCNV
Affym
etrix
Pen
nCNV
Illum
ina
Qua
ntiSNP
Affym
etrix
Qua
ntiSNP
Illum
ina
Publishe
ddata
(Red
on)
17(4)
4(40
)3(19
)32
(5)
2(2)
11(10
)2(25
)12
(18
)7(16
)18
(9)
8(13
)
Birdsuite
Affy
metrix
17(44
)9(90
)13
(81
)135(22
)21
(24
)62
(56
)6(75
)43
(64
)20
(47
)97
(50
)20
(33
)CNAT
Affy
metrix
4(10
)15
(4)
4(25
)34
(6)
023
(21
)1(13
)
13(19
)2(5)
17(9)
5(8)
CNVPartition
Illum
ina
3(8)
16(4)
4(40
)37
(6)
7(8)
20(18
)7(88
)9(13
)
11(26
)16
(8)
16(27
)GADA
Affy
metrix
17(44
)106(28
)9(90
)13
(81
)32
(37
)91
(82
)7(88
)58
(87
)23
(53
)153(79
)27
(45
)GADA
Illum
ina
2(5)
96(25
)0
13(81
)20
8(34
)25
(23
)2(25
)26
(30
)17
(40
)67
(35
)23
(38
)Nexus
Affy
metrix
7(18
)57
(15
)10
(100
)
7(44
)116(19
)8(9)
4(50
)45
(67
)15
(35
)78
(40
)17
(28
)Nexus
Illum
ina
2(5)
6(2)
1(10
)7(44
)22
(4)
2(2)
4(4)
6(9)
7(16
)10
(5)
9(15
)Penn
CNV
Affy
metrix
11(28
)51
(13)
10(100
)
9(56
)105(17
)10
(11)
65(59
)6(75
)19
(44
)71
(37
)21
(35
)Penn
CNV
Illum
ina
6(15
)25
(7)
2(20
)11
(69
)44
(7)
9(10
)23
(21
)6(75
)18
(27
)26
(13)
28(47
)QuantiSNP
Affy
metrix
14(36
)97
(25
)10
(100
)
10(63
)199(32
)18
(21
)86
(77
)7(88
)65
(97
)21
(49
)24
(40
)QuantiSNP
Illum
ina
6(15
)14
(4)
5(50
)15
(94
)55
(9)
10(11)
30(27
)8(100
)
23(34
)32
(74
)31
(16
)
Algorithm
swererunon
demon
stratio
ndataforsampleNA108
61on
Affy
metrix60chipsa
ndIllum
ina1MDuo
arraysD
efaultparametersw
ereused
andanyY
chromosom
edatawas
omittedFo
ralgorithmoverall
totalsseeTable4Events
detected
inbo
thsoftwareareshow
nEvents
coun
tedas
common
betw
eenalgorithmsifpart
ofregion
predictedoverlaps
withtheotherEach
comparisoniscarriedou
ttw
ice
toshow
caseswhere
smallereventswithinon
ealgorithm
makeup
oneeventintheotherthereforeoverlapof
eventsdepe
ndson
analysisorientationTotalvalue
representsnumberof
eventsforsoftwareon
horizontalaxisfoun
dintheothersoftwaredatasetbracketedvalueshow
spercentageofeventsdetected
bysamesoftwareWehave
foun
dthemostsim
ilaritie
sare
betw
eendatafrom
similarplatform
soralgo
-rithm
metho
dforexam
pleAffy
metrixPenn
CNVandQuantiSNParebo
thbasedon
theHMM
algorithm
andas
such
eventpredictio
nshou
ldbe
very
similarWehave
also
notedahigher
numberof
similar
eventsfrom
algorithmsu
singAffy
metrixdata
Comparing CNVdetection methods for SNParrays page 11 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
increasingly important to be able to summarize and
use this data Initially we are often interested in
looking for novel events in certain genes or regions
Tracks of events can be viewed in databases such as
the web-based genome browser UCSC (http
wwwgenomeucscedu) and events can be com-
pared to known copy number data in the DGV
such as displayed in Figure 3 Importing several
tracks of data into a browser simultaneously will
allow the user to compare different result sets
Analysis of multiple events per sample is a more
complicated procedure Events and samples can
be explored using pathway analysis tools to look
for interesting groups or combinations of events in
different genes but methods of confirming the
significance of an event are required A number of
publications exist presenting ways of applying asso-
ciation study methods to copy number data Barnes
etal [29] developed an R package CNVtools which
allows the user to carry out case-control association
Figure 5 Image from UCSC Browser showing the detection of a single event using different algorithmsThe deletion described is a known CNP and is recorded several times in the DGV Each track represents a differ-ent algorithm or platform All results for detection algorithms shown used default parameters and test sampleNA10861
page 12 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
analysis on a single CNV of interest The publica-
tion tests a series of five alternative modelling meth-
ods before recommending a likelihood ratio test
which combines CNV calling and association testing
into a single model This method was designed
to eliminate problems with signal noise which is a
known trait of SNP assay data Ionita-Laza et al [30]
suggested a method to apply genome-wide family-
based association studies on raw-intensity data The
Birdsuite package includes a pipeline to prepare
the data for PLINK analysis Other sources have
suggested similar association study-based strategies
but an agreed approach is a subject of great discus-
sion Calls have been made by authors such as
Scherer et al [31] to decide on a single technique
but future decisions in the field will be extremely
enlightening
As is commented much upon in literature
describing SNP association study techniques
sample size and power of tests are major factors in
a successful study [32] This must also be considered
when analysing copy number data As we have dis-
cussed there are a number of analysis options avail-
able for SNP array CNV detection pipelines to
allow guided analysis and stand alone options for
more flexible analysis Some of these applications
are platform targeted but we have found that the
best outcome is given by using multiple algorithms
and comparing data
SUPPLEMENTARYDATASupplementary data are available online at http
biboxfordjournalsorg
AcknowledgementsThe authors thank Dr Helen Butler for her ideas and contribu-
tions to the manuscript
FUNDINGJR and LW are funded by Wellcome Trust Grants
CY is funded by a UK Medical Research Council
Special Training Fellowship in Biomedical
Informatics (Ref No G0701810)
References1 Iafrate AJ Feuk L Rivera MN et al Detection of large-
scale variation in the human genome Nat Genet 200436(9)949ndash51
2 Redon R Ishikawa S Fitch KR et al Global variation incopy number in the human genome Nature 2006444(7118)444ndash54
3 Tuzun E Sharp AJ Bailey JA et al Fine-scale structuralvariation of the human genome Nat Genet 200537(7)727ndash32
4 Sebat J Lakshmi B Troge J et al Large-scale copy numberpolymorphism in the human genome Science 2004305(5683)525ndash8
5 de Smith AJ Tsalenko A Sampas N et al Array CGHanalysis of copy number variation identifies 1284 newgenes variant in healthy white males implications for asso-ciation studies of complex diseases Hum Mol Genet 200716(23)2783ndash94
6 Carter NP Methods and strategies for analyzing copynumber variation using DNA microarrays Nat Genet200739(7 Suppl)S16ndash21
7 Korbel JO Urban AE Affourtit JP et al Paired-end map-ping reveals extensive structural variation in the humangenome Science 2007318(5849)420ndash6
8 Kennedy GC Matsuzaki H Dong S etal Large-scale geno-typing of complex DNA NatBiotechnol 200321(10)1233ndash7
9 Peiffer DA Le JM Steemers FJ etal High-resolution geno-mic profiling of chromosomal aberrations using Infiniumwhole-genome genotyping Genome Res 200616(9)1136ndash48
10 International Schizophrenia Consortium Rare chromoso-mal deletions and duplications increase risk of schizophreniaNature 2008455(7210)237ndash41
11 Yang TL Chen XD Guo Y et al Genome-wide copy-number-variation study identified a susceptibility geneUGT2B17 for osteoporosis Am J Hum Genet 200883(6)663ndash74
12 McCarroll SA Hadnott TN Perry GH et al Commondeletion polymorphisms in the human genome Nat Genet200638(1)86ndash92
13 Cooper GM Zerr T Kidd JM et al Systematic assessmentof copy number variant detection via genome-wide SNPgenotyping Nat Genet 200840(10)1199ndash203
14 McCarroll SA Altshuler DM Copy-number variation andassociation studies of human disease Nat Genet 200739(7 Suppl)S37ndash42
Key Points Awide variety of software is available for CNVdetection from
data produced by SNP arrays This review seeks to discussoptions and statistical methods currently available for analysisof signal intensity data
Changes in assay selection techniques for SNP arrays havemadethemmore appealing for copynumber detection aswell as geno-typingTargeted probe design has made the SNP array a reliableand cheaper option for copy number analysis
After testing a selection of the available software comparisonswere performed using Hapmap samples and publishedcopy number data Of the events found in our data 20^49were replicated in previously published studies but the resultsclearly showed variation in data caused by differences inalgorithms
An important recommendation when choosing software foranalysis is the use of a second algorithm on a dataset to producemore informative results This enables the user to eliminatefalse positives not found by both software and increases confi-dence in replicated events
Comparing CNVdetection methods for SNParrays page 13 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
15 McCarroll SA Kuruvilla FG Korn JM et al Integrateddetection and population-genetic analysis of SNPs andcopy number variation Nat Genet 200840(10)1166ndash74
16 Korn JM Kuruvilla FG McCarroll SA et al Integratedgenotype calling and association analysis of SNPscommon copy number polymorphisms and rare CNVsNat Genet 200840(10)1253ndash60
17 Day N Hemmaplardh A Thurman RE et al Unsupervisedsegmentation of continuous genomic data Bioinformatics200723(11)1424ndash6
18 Colella S Yau C Taylor JM etal QuantiSNP an objectiveBayes Hidden-Markov Model to detect and accurately mapcopy number variation using SNP genotyping data NucleicAcids Res 200735(6)2013ndash25
19 Wang K Li M Hadley D et al PennCNV an integratedhidden Markov model designed for high-resolution copynumber variation detection in whole-genome SNP geno-typing data Genome Res 200717(11)1665ndash74
20 Maestrini E Pagnamenta AT Lamb JA et al High-densitySNP association study and copy number variation analysisof the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility MolPsychiatry2009
21 Wang K Chen Z Tadesse MG et al Modeling geneticinheritance of copy number variations Nucleic Acids Res200836(21)e138
22 Li C Beroukhim R Weir BA et al Major copy propor-tion analysis of tumor samples using SNP arrays BMCBioinformatics 20089204
23 Olshen AB Venkatraman ES Lucito R Wigler M Circularbinary segmentation for the analysis of array-based DNAcopy number data Biostatistics 20045(4)557ndash72
24 Pique-Regi R Monso-Varona J Ortega A et al Sparserepresentation and Bayesian detection of genome copynumber alterations from microarray data Bioinformatics200824(3)309ndash18
25 Lai WR Johnson MD Kucherlapati R Park PJComparative analysis of algorithms for identifying amplifi-cations and deletions in array CGH data Bioinformatics 200521(19)3763ndash70
26 Rigaill G Hupe P Almeida A et al ITALICS analgorithm for normalization and DNA copy number callingfor Affymetrix SNP arrays Bioinformatics 200824(6)768ndash74
27 Franke L de Kovel CG Aulchenko YS et al Detectionimputation and association analysis of small deletions andnull alleles on oligonucleotide arrays AmJHumGenet 200882(6)1316ndash33
28 Kidd JM Cooper GM Donahue WF et al Mapping andsequencing of structural variation from eight human gen-omes Nature 2008453(7191)56ndash64
29 Barnes C Plagnol V Fitzgerald T et al A robuststatistical method for case-control association testingwith copy number variation Nat Genet 200840(10)1245ndash52
30 Ionita-Laza I Perry GH Raby BA et al On the analysisof copy-number variations in genome-wide associationstudies a translation of the family-based association testGenet Epidemiol 200832(3)273ndash84
31 Scherer SW Lee C Birney E etal Challenges and standardsin integrating surveys of structural variation NatGenet 200739(7 Suppl)S7ndash15
32 Cardon LR Bell JI Association study designs for complexdiseases Nat Rev Genet 20012(2)91ndash9
page 14 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
user to compare predicted events with BAF or
log R ratio at the location for event confirmation
(Figure 1)
AFFYMETRIX PROPRIETARYSOFTWARE FORCOPY NUMBERDETECTIONAffymetrix SNP array data can be analysed with
specially designed proprietary software Within the
Genotyping Console samples are grouped into
In Bounds (good sample) and Out of bounds
(problematic samples) after initial quality checks
and other quality control metrics allow the user
to investigate probe mismatching and individual
SNP clustering LOH scores can be calculated and
the software contains a Chromosome Copy Number
Analysis Tool (CNAT) which uses a reference set of
data to compare the experiment signal-intensity
values against and evaluates copy number changes
Results are processed by the segment reporting tool
to produce a basic output of larger detected CNV
events
Tools for analysis of the different Affymetrix chip
types vary but HumanGenomeSNP Array 60 uti-
lizes two externally developed algorithms from the
BirdSuite package [16] which dramatically improves
detection Birdseed is used for SNP genotyping
and Canary genotypes the known CNPs on the
chip Each CNP has a number of targeted probes
Figure 1 BeadStudio Chromosome Viewer Image from BeadStudio Chromosome Browser showing copy numbervalues for Sample NA10861Chromosome 22 shown with an event at 23 999 142^24 239 255 confirmed by all statis-tics CNV value produced by CNV Partition algorithm
Comparing CNVdetection methods for SNParrays page 3 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
data from these are summarized and then compared
to a reference set to produce the final call Results
can be viewed in the Integrated Genome Browser
(IGB) (Figure 2)
HIDDENMARKOVMODELS(HMMs) IN COPY NUMBEREVENTDETECTIONLimitations of available copy number analyses within
proprietary software led to the use of other methods
to analyse data The HMM assumes that observed
intensities are related to an unobserved copy
number state at each locus via an emission distribu-
tion (often assumed to be Gaussian) The copy
number states are assumed to have a dependence
structure such that neighbouring loci are assumed
to have similar copy number states Transitions
between copy number states are determined by a
transition matrix which describes the probability of
moving from one state to another The probabilistic
structure of the HMM allows parameters in the
model to be efficiently learnt from data in both
Bayesian and non-Bayesian frameworks by using
dynamic programming-based algorithms such as
the expectation maximization (EM) algorithm
When applied to event detection each copy
number possibility is assigned a state and the
Viterbi algorithm is used to predict the state for
each observation value
Figure 2 Genotyping Console Genome Viewer Image from Affymetrix Genotyping Console showing sampleNA10861 Event on chromosome 22 confirmed by CNAT algorithm (third plot) and the segmentation report (redmark) showing the single event
page 4 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
With prior knowledge of modelling statistics
there are a multitude of options for copy number
detection HMMSeg [17] is a command line oper-
ated algorithm that is designed to apply HMM
to genomic data Application of correct modelling
procedures is not an obvious process to non-
statisticians For these reasons software has been
developed which allows guided application of these
types of advanced methods
GUIDEDAPPLICATIONOFTHE HMMA number of solutions for guided accurate CNV
detection for SNP array data have been published
but these are often platform specific QuantiSNP
[18] and PennCNV [19] are academically developed
and freely available for prediction purposes They use
the HMM and assist the user to apply it to their own
data The standard output from these tools is a list of
detected events and brief summary statistics used for
quality checking Checking the quality of data is
extremely important in accurate event prediction
Data with high signal noise often causes false positive
predictions and stringency with checks at this stage is
highly recommended to eliminate any problem data
Signal noise is a strong limitation particularly with
samples prepared by whole genome amplification
Output from QuantiSNP allows the user to plot
average and standard deviations for BAF by chromo-
some or sample to show outliers (Figure 3)
PennCNV has a detailed set of guidelines for identi-
fying and rejecting problem data included on the
softwarersquos support website Both can run using com-
mand line options or integrated into Illuminarsquos
BeadStudio plug-in and have unique features to
recommend them
The QuantiSNP algorithm output gives a log
Bayes factor with its prediction which allows the
user to rank events in order of likelihood and place
their own cut off on acceptable events Users can
modify parameters to suit their own dataset for
example changing the length parameter can allow
more accurate detection of different sized events for
a particular sample set Later versions of QuantiSNP
have increased flexibility for data other than the
Figure 3 Graphical representation of quality control data from PennCNV and QuantiSNP algorithms It is impor-tant to use quality control (QC) data from the algorithms to eliminate problem samples which would not be foundduring standard-genotyping analysis Plot shows BAF score for each chromosome from analysis of sample NA10861we can see chromosome 4 and X are outliersValues produced by PennCNV log file also shown NB Values shownrelate to Illumina 1MDuo array
Comparing CNVdetection methods for SNParrays page 5 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
standard Illumina Infinium array and can used to
process Affymetrix data and have proven accuracy
on Illumina GoldenGate data [20] where SNP
coverage is suitable
PennCNV has a number of downstream analysis
options Most important to highlight is the use of
family trio data in analysis [21] The use of trio infor-
mation in event prediction allows easier detection of
events novel to probands It also integrates a pipeline
for Affymetrix data analysis The PennCNV package
also includes a number of options to allow more
analysis of event results such as a script to compare
events to known gene libraries or for changing the
format to be suitable for viewer such as BeadStudiorsquos
Chromosome Browser or the web-based genome
browser UCSC (httpwwwgenomeucscedu)
Dchip SNP [22] was originally developed for
Affymetrix data but has been modified to allow the
viewing of Illumina data It produces an LOH score
which can be plotted against chromosome but its
functions are best suited to the Affymetrix platform
generated values in particular the quality control
options The software also has options to carry out
paired analysis for cancer data major copy propor-
tion analysis [22] uses HMM to analyse tumour
samples
APPLYINGAPPROACHESORIGINALLY USED INARRAYCGHA number of methods for copy number event detec-
tion were originally developed for arrayCGH analy-
sis but have been modified for SNP array analysis
The Circular Binary Segmentation (CBS) [23] algo-
rithm is one such method It was designed to convert
noisy intensity values into regions of equal copy
number The algorithm will continue to divide a
region into segments until it finds a segment
which is different to the neighbouring region This
change-point detection is designed to identify all
the places which partition the chromosome into
segments of the same copy number An addition to
the binary segmentation algorithm was made to
allow the defining of single change inside a large
segment Segment ends were joined forming a
circle to allow a further likelihood ratio test that
the content has different means Final segments are
then given a cluster value which is the median log-
ratio value of the probes within the region and this
value is used to define the copy number status
An alternative to the CBS algorithm was devel-
oped by Pique-Regi et al [24] which can now be
applied to SNP arrays The Genome Alteration
Detection Algorithm (GADA) uses sparse Bayesian
learning to predict CN changes For our testing we
used a package designed for use in R environment
with helpful processing options and detailed instruc-
tions for Affymetrix and Illumina data The advan-
tage of the speed of data processing was clear and we
were able to analyse data within a few minutes
There are many other algorithms developed that
could potentially be applied to SNP array data
Other reviews [6 25] focused on the arrayCGH
format present the reader with a variety of alternative
options
CNVDETECTION USING OTHERMETHODSApproaches which describe different methods to
address CN event detection are common in the lit-
erature SNP conditional mixture modelling
(SCIMM) developed by Cooper et al [13] which
is based on the observation that samples with dele-
tions appear to have unique signal-intensity clusters
They applied a mixture-likelihood clustering
method within the R statistical package to identify
deletions A secondary algorithm (SCIMM-Search)
was developed to help discover probes which detect
copy number changes within an array dataset The
algorithms require knowledge of modelling techni-
ques to correctly carry out the analysis
The ITALICS [26] software focuses analysis on
removal on unwanted events found in Affymetrix
data Rigaill et al developed ITALICS (Iterative
and Alternative normaLIsation and Copy number
calling for affymetrix Snp arrays) to remove probes
with abnormal intensities Each iteration of the
algorithm estimates the biological signal and then
uses multiple linear regressions to estimate the non-
linear effects on the signal The algorithm can be run
in R and has the potential to analyse the Affymetrix
Human mapping 500K Genome Wide array 50 and
60 format but was designed to process data from
chip formats containing perfect match and mismatch
probes
COMMERCIALLYAVAILABLESOFTWAREThe strength of the software packages available
to purchase lies in a number of traits the ability
page 6 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
to combine data from other platforms for compar-
ison graphical user interfaces integrated pipelines for
analysis and work flows optimized computational
speed and technical support These factors are all
extremely useful to those labs with no or limited
bioinformatic core support Unfortunately commer-
cial companies are limited in their use of some of the
methods developed in the academic environment
They are often prevented from building user inter-
faces and other features around academic software
due to restrictions imposed by free software licences
such as GNU Public Licence and prevention from
having access to the latest methods
For our own purposes we have chosen to look in
detail at the Nexus Biodiscovery software This uses
the rank segmentation approach for detection This
approach is based on CBS but has been modified to
increase speed of processing It can be used for
Affymetrix arrayCGH or Illumina data and although
weaker for Illumina event detection is an extremely
useful tool for practically trained scientists
COMBINING COPY NUMBERPREDICTIONANDGENOTYPINGCopy number detection approaches described thus
far have looked only at a single aspect of the data
The Birdsuite set developed by Korn et al [16] com-
bines SNP genotyping and copy number detection
as well as independently genotyping common
CNPs It uses four different methods to analyse an
Affymetrix dataset The Canary algorithm which
genotypes common CNPs and Birdseed which
carries out SNP genotyping are included in the
Affymetrix Genotyping Console Birdseye is used
to discover rare CNVs This uses the HMM to iden-
tify and assess previously unknown CNVs in the
data Fawkes is the final stage of Birdsuite this
merges all the results from the other three stages
Combining data in this way gives a more complete
picture of structural variation in a sample and allows
the user to proceed with single stage of association
analysis with increased coverage on the data Korn
et al compared their software to commercially avail-
able algorithms including Nexus and report the
higher detection rates of Birdsuite
Franke et al [27] have also presented a combined
approach which focuses on single SNP interpreta-
tion TriTyper uses maximum likelihood estimation
to detect deletions in Illumina SNP data in unrelated
samples It incorporates an extra null allele into its
genotyping clusters and uses deviations from the
HWE as an indicator of when to use triallelic geno-
typing It can also use neighbouring SNP data to
impute the success of the caller which increases the
accuracy of the output
COMPARINGTHEDETECTIONALGORITHMSThere are a large variety of algorithms and software
available for copy number event detection Table 1
shows a summary of the software discussed in this
review A number of these software packages have
been tested during the review and a brief synopsis of
the results is presented here
Assessing SoftwareTo assess the accuracy of the algorithms we com-
pared our data to the results of a well characterized
sample The sample NA12156 is the basis for our
comparison (Table 2) it is from the HapMap collec-
tion and was sequenced for structural variation by
Kidd et al [28] We have chosen to record the
number of similar events between software and pub-
lished data We assume the samples with low num-
bers of similar events have higher false positive rates
however we have not experimentally validated the
results While there is no faultless software we have
found that at least 20 of events were confirmed by
Kidd et al in all algorithms 27 of the overlapping
detected events were found by more than one algo-
rithm (Supplementary Table 1) Although some
algorithms have a lower percentage of overlapping
events it is important to consider the number of
events found as well as the proportion 49 of
PennCNV detected events were confirmed but
other algorithms have actually detected more in
total
We carried out a secondary comparison using the
CEPH sample NA15510 which has been character-
ized in a number of publications [2 7 28] Table 3
shows the variation of results between studies
Further investigation of event replication across stud-
ies is represented in the Venn Diagrams (Figure 4)
PennCNV and Illumina show similar patterns of
overlap although we note an increased similarity
between the Korbel et al data and QuantiSNP
output We conclude that although we found a dif-
ference between detected events in our data and
published results we found similar variation between
different publications suggesting this is problem in
Comparing CNVdetection methods for SNParrays page 7 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
all comparisons and not unique to algorithms we
tested
The overlap of algorithm events of the tested soft-
ware is below 50 for all cases We used default
parameters for all our algorithms for ease of replica-
tion which means some algorithms were not run at
their optimal level for our data We deliberately
chose data which did not use an array-based
Table 1 Summary of SNP array detection algorithms
Software Platform Relatedpublication
Details Strengths Weaknesses
Birdsuite (Birdseyeand Canary)
Affymetrix [15] Combined tool set togenotype SNPs amp CNPs
Unique approach singleassociation of SNPs andCN
Availability limited toAffymetrix data
CNAT Affymetrix Technicalnotes
Proprietaryccedilrun inGenome Console
Integral part of GenomeConsole
Accuracy of event prediction(missed events)
CNVPartition 121 Illumina Technicalnotes
Proprietaryccedilrun inBeadStudio
Integral part of BeadStudio Accuracy of event prediction(missed events)
Dchip SNP Affymetrixor Illumina
[22] Stand alone software Free viewer for all data Limited applications forIllumina data
GADA Affymetrixor Illumina
[24] Model uses Sparse BayesianLearning
Speed of processing andapplication within R
Accuracy on Illumina weaker
HMMSeg Multiple [17] HMM application tool to anygenomic data
Flexibility to any dataset Statistical knowledgerequired for correctuse Not CN specific
ITALICS Affymetrix [26] R package for normalizationand CN detection inAffymetrix data
Focus on removal of non-relevant effects
Designed to work onAffymetrix 100Kthorn 500Kchip (MM probe format)
Nexus Biodiscovery Multiple [23] Commercial segmentationdetection tool
Allows combined data fromdifferentplatforms Integratedviewer
Freeware alternatives areavailable
PennCNV Illumina orAffymetrix
[19] Perl script based Multiple downstream toolsfor output
No way of ranking eventsdue to likelihood
QuantiSNP Illumina orAffymetrix
[18] HHM PC or LINUXcommand line
Bayes factor score forevents flexibility of runparameters
Limited support for furtherevent analysis
SCIMM andSCIMM-Search
Illumina [13] Modelling algorithmapplied in R
High detection ratescompared to sequencedata
Statistical knowledgerequired for correct use
TriTyper Illumina [27] Identify and genotype SNPswith null allele
Able to interpret single SNPs Only genotypes deletions
Table 2 Comparison of algorithms
Algorithm Platformand array
Total of copynumber eventsdetected
Number of copynumber eventsconfirmed byKidd et al [28]
Birdsuite 155 (Birdseye amp Canary) Affymetrix 60 386 76 (20)CNAT (Genome Console 302) Affymetrix 60 8 2 (25)GADA (R 07-5) Affymetrix 60 546 128 (23)GADA (R 07-5) Illumina 1M Duo 511 157 (31)PennCNV (2009Jan06) Affymetrix 60 57 28 (49)PennCNV (2009Jan06) Illumina 1M Duo 57 21 (37)QuantiSNP v20 Affymetrix 60 131 53 (41)QuantiSNP v11 Illumina 1M Duo 75 23 (31)
Detected events from CEPH sample NA12156 are compared to events published in sequencing analysis by Kidd et al [28] Default parametersare used for each algorithm and any Ychromosome data was omitted An overlap between software output and confirmed data by Kidd et al isdetermined by comparing the start and end points of events Details of events are shown in SupplementaryTable1 Percentage shows the numberof confirmed CN events compared to the total detectedby the algorithm
page 8 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
Figure 4 Venn diagrams comparing events for NA15510 between different studies Visual representation ofdata from CEPH sample NA15510 on 1M array Illumina platform used to compare between algorithms and otherpublications [2 7 28] Default parameters are used for each algorithm and Ychromosome data was omitted fromcount Event lists from publications were generated by combining data from several tables to create a completelist (including all validated and unvalidated events) An event was counted if any overlap was found with base eventin published data multiple predictions by an algorithm for one published event were counted as one Each total inthe diagram is comprised of all the events found by the studies meaning each event in an overlapping pair is countedSurprisingly only 43 overlapping events are found for NA15510 in all the three studies (A) Results from thePennCNV (D) and QuantiSNP (C) comparisons show that QuantiSNP detects more events in all three softwaredue to the detection of more events overlapping with the Korbel et al study Overlap between algorithmsis shown in Venn Diagram B where events which are detected by the algorithm and found in at least one ofthe publication are compared A large proportion of detected events between PennCNV and QuantiSNP (43)overlap
Table 3 Overlap between events detected by SNP array algorithms using multiple publication data
Total events foundin NA15510 byalgorithm
Number of copynumber events(Kidd) [28]
Number of copynumber events(Korbel) [7]
Number of copynumber events(Redon) [2]
Events in paper 299 466 219CNVPartition 121 39 12 (4) 22 (5) 9 (4)GADA (R 07-5) 69 68 (23) 85 (18) 42 (19)PennCNV (2009Jan06) 81 18 (6) 28 () 30 (14)QuantiSNP v11 64 18 (6) 41 (9) 29 (13)
Data fromCEPH sampleNA15510 on1M array Illumina platform is used to compare between algorithms and other publicationsDefault parametersare used for each algorithm and Y chromosome data was omitted Event lists from publications were generated by combining data fromseveral tables to create a complete list (including all validated and un-validated events) An event was counted if any overlap was found with baseevent in published data multiple predictions by an algorithm for one published event were counted as oneValue in brackets shows percentage ofpublished events found by algorithmWe note from GADA analysis although a high number of overlaps were found this was due to the predictionof large events that included smaller events found by Kidd et al and Korbel et al
Comparing CNVdetection methods for SNParrays page 9 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
technique for our NA12156 comparison to prevent a
bias between Affymetrix and Illumina but in doing
so we accepted an increase in the number of differ-
ently detected events Kidd et al have shown similar
data when comparing studies and found only a
125 overlap of events larger than 5 kb between
their results and CN data generated by Affymetrix
60 array
Similarities of events detected betweendifferent SoftwareWe chose to test a single sample (NA10861) on
a range of the available algorithms to compare the
similarity between event detection In all cases we
found the academically developed software to be
more sensitive and detect more events than propri-
etary algorithms (Table 4) The data also shows an
increased number of events found from the sample
using the Affymetrix SNP60 array we assume this
reflects the increase in the number of CNP probes
on the array relative to Illuminarsquos 1M chip
Table 5 shows the amount of overlap in event
prediction We show two results for each compari-
son counting the number of events overlapping for
each algorithm separately The difference in values
represents the number of smaller events often found
in one event by a different algorithm In general
we found a higher number of overlapping events
between algorithms run on Affymetrix 60 arrays
data We expected the low resemblance between
data generated on different platforms as a result of
the different probe sets however we are pleased to
find some overlap We have included a comparison
to events published by Redon et al [2] although the
study does not include a comprehensive list for this
sample it does show that the algorithms are detecting
confirmed events
During our comparison we often saw a difference
in the size of the predicted event between algorithms
(Figure 5) This was to be expected when using
different platforms as probe locations vary but was
also seen when analysing an identical dataset This
kind of effect can even be produced when simply
altering algorithm parameters and should be a con-
sideration when looking at breakpoints of detected
events We found that the available software tend to
target and support one particular platform for analy-
sis which unfortunately can limit options
Recommending algorithmsComparison of events in a dataset is a good way of
assessing accuracy of detection algorithms but it is
also important to take into account that the different
predictions can also be informative in showing false
positives caused by noisy data and conversely that
those in agreement are the strongest candidates for
events Multiple predictions from different software
for the same event increase confidence in the data
and give clearer indications of the event boundaries
or any discrepancy in this information We would
recommend using a second algorithm on a single
dataset to produce the most informative results and
also utilize the different advantages of each software
We also suggest using software designed specifically
for the platform which generated the data as several
of the dual use algorithms have been shown to
weaker in one format We have selected a range of
algorithms to discuss and test and the list in Table 1 is
not exhaustive only an overview of some of the
possibilities It is also important to state even using
different algorithms one cannot definitively confirm
the presence of a CN event without separate biolog-
ical replication and it is unlikely that any list of events
detected will contain all CNVs in a sample
FURTHER ANALYSIS OFDETECTED CNVsWith a number of reliable options available for
the detection of copy number events it becomes
Table 4 Comparison of event numbers detected fora single sample (NA10861)
Algorithm Platform andarray
Number ofCNeventsdetected
Birdsuite 155 (Canary amp Birdseye) Affymetrix 60 137CNAT (Genome Console 302) Affymetrix 60 10CNVPartition 121 Illumina 1M Duo 16GADA (R 07-5) Affymetrix 60 613GADA (R 07-5) Illumina 1M Duo 87Nexus Biodiscovery 401 Affymetrix 60 111Nexus Biodiscovery 401 Illumina 1M Duo 8PennCNV (2009Jan06) Affymetrix 60 67PennCNV (2009Jan06) Illumina 1M Duo 43QuantiSNP v20 Affymetrix 60 193QuantiSNP v11 Illumina 1M Duo 60
HapMap samples provided as demonstration data were analysed onboth Affymetrix and Illumina platforms to give an easily reproduciblecomparison of event prediction Events shown have been detected bythe algorithm for CEPH sample NA10861 Default parameters wereused for all algorithms and anyYchromosome data was omittedDatafrom the Affymetrix array has a higher number of detected eventsprobably linked to the number of specifically targeted probesProprietary software from both Illumina and Affymetrix has a lowdetection rate
page 10 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
Table5
Com
parison
ofsoftwareeventpredictio
ns
Pub
lishe
dresults
(Red
on)
Birdsuite
Affym
etrix
CNAT
Affym
etrix
CNV
Par
tition
Illum
ina
GADA
Affym
etrix
GADA
Illum
ina
Nex
usAffym
etrix
Nex
usIllum
ina
Pen
nCNV
Affym
etrix
Pen
nCNV
Illum
ina
Qua
ntiSNP
Affym
etrix
Qua
ntiSNP
Illum
ina
Publishe
ddata
(Red
on)
17(4)
4(40
)3(19
)32
(5)
2(2)
11(10
)2(25
)12
(18
)7(16
)18
(9)
8(13
)
Birdsuite
Affy
metrix
17(44
)9(90
)13
(81
)135(22
)21
(24
)62
(56
)6(75
)43
(64
)20
(47
)97
(50
)20
(33
)CNAT
Affy
metrix
4(10
)15
(4)
4(25
)34
(6)
023
(21
)1(13
)
13(19
)2(5)
17(9)
5(8)
CNVPartition
Illum
ina
3(8)
16(4)
4(40
)37
(6)
7(8)
20(18
)7(88
)9(13
)
11(26
)16
(8)
16(27
)GADA
Affy
metrix
17(44
)106(28
)9(90
)13
(81
)32
(37
)91
(82
)7(88
)58
(87
)23
(53
)153(79
)27
(45
)GADA
Illum
ina
2(5)
96(25
)0
13(81
)20
8(34
)25
(23
)2(25
)26
(30
)17
(40
)67
(35
)23
(38
)Nexus
Affy
metrix
7(18
)57
(15
)10
(100
)
7(44
)116(19
)8(9)
4(50
)45
(67
)15
(35
)78
(40
)17
(28
)Nexus
Illum
ina
2(5)
6(2)
1(10
)7(44
)22
(4)
2(2)
4(4)
6(9)
7(16
)10
(5)
9(15
)Penn
CNV
Affy
metrix
11(28
)51
(13)
10(100
)
9(56
)105(17
)10
(11)
65(59
)6(75
)19
(44
)71
(37
)21
(35
)Penn
CNV
Illum
ina
6(15
)25
(7)
2(20
)11
(69
)44
(7)
9(10
)23
(21
)6(75
)18
(27
)26
(13)
28(47
)QuantiSNP
Affy
metrix
14(36
)97
(25
)10
(100
)
10(63
)199(32
)18
(21
)86
(77
)7(88
)65
(97
)21
(49
)24
(40
)QuantiSNP
Illum
ina
6(15
)14
(4)
5(50
)15
(94
)55
(9)
10(11)
30(27
)8(100
)
23(34
)32
(74
)31
(16
)
Algorithm
swererunon
demon
stratio
ndataforsampleNA108
61on
Affy
metrix60chipsa
ndIllum
ina1MDuo
arraysD
efaultparametersw
ereused
andanyY
chromosom
edatawas
omittedFo
ralgorithmoverall
totalsseeTable4Events
detected
inbo
thsoftwareareshow
nEvents
coun
tedas
common
betw
eenalgorithmsifpart
ofregion
predictedoverlaps
withtheotherEach
comparisoniscarriedou
ttw
ice
toshow
caseswhere
smallereventswithinon
ealgorithm
makeup
oneeventintheotherthereforeoverlapof
eventsdepe
ndson
analysisorientationTotalvalue
representsnumberof
eventsforsoftwareon
horizontalaxisfoun
dintheothersoftwaredatasetbracketedvalueshow
spercentageofeventsdetected
bysamesoftwareWehave
foun
dthemostsim
ilaritie
sare
betw
eendatafrom
similarplatform
soralgo
-rithm
metho
dforexam
pleAffy
metrixPenn
CNVandQuantiSNParebo
thbasedon
theHMM
algorithm
andas
such
eventpredictio
nshou
ldbe
very
similarWehave
also
notedahigher
numberof
similar
eventsfrom
algorithmsu
singAffy
metrixdata
Comparing CNVdetection methods for SNParrays page 11 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
increasingly important to be able to summarize and
use this data Initially we are often interested in
looking for novel events in certain genes or regions
Tracks of events can be viewed in databases such as
the web-based genome browser UCSC (http
wwwgenomeucscedu) and events can be com-
pared to known copy number data in the DGV
such as displayed in Figure 3 Importing several
tracks of data into a browser simultaneously will
allow the user to compare different result sets
Analysis of multiple events per sample is a more
complicated procedure Events and samples can
be explored using pathway analysis tools to look
for interesting groups or combinations of events in
different genes but methods of confirming the
significance of an event are required A number of
publications exist presenting ways of applying asso-
ciation study methods to copy number data Barnes
etal [29] developed an R package CNVtools which
allows the user to carry out case-control association
Figure 5 Image from UCSC Browser showing the detection of a single event using different algorithmsThe deletion described is a known CNP and is recorded several times in the DGV Each track represents a differ-ent algorithm or platform All results for detection algorithms shown used default parameters and test sampleNA10861
page 12 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
analysis on a single CNV of interest The publica-
tion tests a series of five alternative modelling meth-
ods before recommending a likelihood ratio test
which combines CNV calling and association testing
into a single model This method was designed
to eliminate problems with signal noise which is a
known trait of SNP assay data Ionita-Laza et al [30]
suggested a method to apply genome-wide family-
based association studies on raw-intensity data The
Birdsuite package includes a pipeline to prepare
the data for PLINK analysis Other sources have
suggested similar association study-based strategies
but an agreed approach is a subject of great discus-
sion Calls have been made by authors such as
Scherer et al [31] to decide on a single technique
but future decisions in the field will be extremely
enlightening
As is commented much upon in literature
describing SNP association study techniques
sample size and power of tests are major factors in
a successful study [32] This must also be considered
when analysing copy number data As we have dis-
cussed there are a number of analysis options avail-
able for SNP array CNV detection pipelines to
allow guided analysis and stand alone options for
more flexible analysis Some of these applications
are platform targeted but we have found that the
best outcome is given by using multiple algorithms
and comparing data
SUPPLEMENTARYDATASupplementary data are available online at http
biboxfordjournalsorg
AcknowledgementsThe authors thank Dr Helen Butler for her ideas and contribu-
tions to the manuscript
FUNDINGJR and LW are funded by Wellcome Trust Grants
CY is funded by a UK Medical Research Council
Special Training Fellowship in Biomedical
Informatics (Ref No G0701810)
References1 Iafrate AJ Feuk L Rivera MN et al Detection of large-
scale variation in the human genome Nat Genet 200436(9)949ndash51
2 Redon R Ishikawa S Fitch KR et al Global variation incopy number in the human genome Nature 2006444(7118)444ndash54
3 Tuzun E Sharp AJ Bailey JA et al Fine-scale structuralvariation of the human genome Nat Genet 200537(7)727ndash32
4 Sebat J Lakshmi B Troge J et al Large-scale copy numberpolymorphism in the human genome Science 2004305(5683)525ndash8
5 de Smith AJ Tsalenko A Sampas N et al Array CGHanalysis of copy number variation identifies 1284 newgenes variant in healthy white males implications for asso-ciation studies of complex diseases Hum Mol Genet 200716(23)2783ndash94
6 Carter NP Methods and strategies for analyzing copynumber variation using DNA microarrays Nat Genet200739(7 Suppl)S16ndash21
7 Korbel JO Urban AE Affourtit JP et al Paired-end map-ping reveals extensive structural variation in the humangenome Science 2007318(5849)420ndash6
8 Kennedy GC Matsuzaki H Dong S etal Large-scale geno-typing of complex DNA NatBiotechnol 200321(10)1233ndash7
9 Peiffer DA Le JM Steemers FJ etal High-resolution geno-mic profiling of chromosomal aberrations using Infiniumwhole-genome genotyping Genome Res 200616(9)1136ndash48
10 International Schizophrenia Consortium Rare chromoso-mal deletions and duplications increase risk of schizophreniaNature 2008455(7210)237ndash41
11 Yang TL Chen XD Guo Y et al Genome-wide copy-number-variation study identified a susceptibility geneUGT2B17 for osteoporosis Am J Hum Genet 200883(6)663ndash74
12 McCarroll SA Hadnott TN Perry GH et al Commondeletion polymorphisms in the human genome Nat Genet200638(1)86ndash92
13 Cooper GM Zerr T Kidd JM et al Systematic assessmentof copy number variant detection via genome-wide SNPgenotyping Nat Genet 200840(10)1199ndash203
14 McCarroll SA Altshuler DM Copy-number variation andassociation studies of human disease Nat Genet 200739(7 Suppl)S37ndash42
Key Points Awide variety of software is available for CNVdetection from
data produced by SNP arrays This review seeks to discussoptions and statistical methods currently available for analysisof signal intensity data
Changes in assay selection techniques for SNP arrays havemadethemmore appealing for copynumber detection aswell as geno-typingTargeted probe design has made the SNP array a reliableand cheaper option for copy number analysis
After testing a selection of the available software comparisonswere performed using Hapmap samples and publishedcopy number data Of the events found in our data 20^49were replicated in previously published studies but the resultsclearly showed variation in data caused by differences inalgorithms
An important recommendation when choosing software foranalysis is the use of a second algorithm on a dataset to producemore informative results This enables the user to eliminatefalse positives not found by both software and increases confi-dence in replicated events
Comparing CNVdetection methods for SNParrays page 13 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
15 McCarroll SA Kuruvilla FG Korn JM et al Integrateddetection and population-genetic analysis of SNPs andcopy number variation Nat Genet 200840(10)1166ndash74
16 Korn JM Kuruvilla FG McCarroll SA et al Integratedgenotype calling and association analysis of SNPscommon copy number polymorphisms and rare CNVsNat Genet 200840(10)1253ndash60
17 Day N Hemmaplardh A Thurman RE et al Unsupervisedsegmentation of continuous genomic data Bioinformatics200723(11)1424ndash6
18 Colella S Yau C Taylor JM etal QuantiSNP an objectiveBayes Hidden-Markov Model to detect and accurately mapcopy number variation using SNP genotyping data NucleicAcids Res 200735(6)2013ndash25
19 Wang K Li M Hadley D et al PennCNV an integratedhidden Markov model designed for high-resolution copynumber variation detection in whole-genome SNP geno-typing data Genome Res 200717(11)1665ndash74
20 Maestrini E Pagnamenta AT Lamb JA et al High-densitySNP association study and copy number variation analysisof the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility MolPsychiatry2009
21 Wang K Chen Z Tadesse MG et al Modeling geneticinheritance of copy number variations Nucleic Acids Res200836(21)e138
22 Li C Beroukhim R Weir BA et al Major copy propor-tion analysis of tumor samples using SNP arrays BMCBioinformatics 20089204
23 Olshen AB Venkatraman ES Lucito R Wigler M Circularbinary segmentation for the analysis of array-based DNAcopy number data Biostatistics 20045(4)557ndash72
24 Pique-Regi R Monso-Varona J Ortega A et al Sparserepresentation and Bayesian detection of genome copynumber alterations from microarray data Bioinformatics200824(3)309ndash18
25 Lai WR Johnson MD Kucherlapati R Park PJComparative analysis of algorithms for identifying amplifi-cations and deletions in array CGH data Bioinformatics 200521(19)3763ndash70
26 Rigaill G Hupe P Almeida A et al ITALICS analgorithm for normalization and DNA copy number callingfor Affymetrix SNP arrays Bioinformatics 200824(6)768ndash74
27 Franke L de Kovel CG Aulchenko YS et al Detectionimputation and association analysis of small deletions andnull alleles on oligonucleotide arrays AmJHumGenet 200882(6)1316ndash33
28 Kidd JM Cooper GM Donahue WF et al Mapping andsequencing of structural variation from eight human gen-omes Nature 2008453(7191)56ndash64
29 Barnes C Plagnol V Fitzgerald T et al A robuststatistical method for case-control association testingwith copy number variation Nat Genet 200840(10)1245ndash52
30 Ionita-Laza I Perry GH Raby BA et al On the analysisof copy-number variations in genome-wide associationstudies a translation of the family-based association testGenet Epidemiol 200832(3)273ndash84
31 Scherer SW Lee C Birney E etal Challenges and standardsin integrating surveys of structural variation NatGenet 200739(7 Suppl)S7ndash15
32 Cardon LR Bell JI Association study designs for complexdiseases Nat Rev Genet 20012(2)91ndash9
page 14 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
data from these are summarized and then compared
to a reference set to produce the final call Results
can be viewed in the Integrated Genome Browser
(IGB) (Figure 2)
HIDDENMARKOVMODELS(HMMs) IN COPY NUMBEREVENTDETECTIONLimitations of available copy number analyses within
proprietary software led to the use of other methods
to analyse data The HMM assumes that observed
intensities are related to an unobserved copy
number state at each locus via an emission distribu-
tion (often assumed to be Gaussian) The copy
number states are assumed to have a dependence
structure such that neighbouring loci are assumed
to have similar copy number states Transitions
between copy number states are determined by a
transition matrix which describes the probability of
moving from one state to another The probabilistic
structure of the HMM allows parameters in the
model to be efficiently learnt from data in both
Bayesian and non-Bayesian frameworks by using
dynamic programming-based algorithms such as
the expectation maximization (EM) algorithm
When applied to event detection each copy
number possibility is assigned a state and the
Viterbi algorithm is used to predict the state for
each observation value
Figure 2 Genotyping Console Genome Viewer Image from Affymetrix Genotyping Console showing sampleNA10861 Event on chromosome 22 confirmed by CNAT algorithm (third plot) and the segmentation report (redmark) showing the single event
page 4 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
With prior knowledge of modelling statistics
there are a multitude of options for copy number
detection HMMSeg [17] is a command line oper-
ated algorithm that is designed to apply HMM
to genomic data Application of correct modelling
procedures is not an obvious process to non-
statisticians For these reasons software has been
developed which allows guided application of these
types of advanced methods
GUIDEDAPPLICATIONOFTHE HMMA number of solutions for guided accurate CNV
detection for SNP array data have been published
but these are often platform specific QuantiSNP
[18] and PennCNV [19] are academically developed
and freely available for prediction purposes They use
the HMM and assist the user to apply it to their own
data The standard output from these tools is a list of
detected events and brief summary statistics used for
quality checking Checking the quality of data is
extremely important in accurate event prediction
Data with high signal noise often causes false positive
predictions and stringency with checks at this stage is
highly recommended to eliminate any problem data
Signal noise is a strong limitation particularly with
samples prepared by whole genome amplification
Output from QuantiSNP allows the user to plot
average and standard deviations for BAF by chromo-
some or sample to show outliers (Figure 3)
PennCNV has a detailed set of guidelines for identi-
fying and rejecting problem data included on the
softwarersquos support website Both can run using com-
mand line options or integrated into Illuminarsquos
BeadStudio plug-in and have unique features to
recommend them
The QuantiSNP algorithm output gives a log
Bayes factor with its prediction which allows the
user to rank events in order of likelihood and place
their own cut off on acceptable events Users can
modify parameters to suit their own dataset for
example changing the length parameter can allow
more accurate detection of different sized events for
a particular sample set Later versions of QuantiSNP
have increased flexibility for data other than the
Figure 3 Graphical representation of quality control data from PennCNV and QuantiSNP algorithms It is impor-tant to use quality control (QC) data from the algorithms to eliminate problem samples which would not be foundduring standard-genotyping analysis Plot shows BAF score for each chromosome from analysis of sample NA10861we can see chromosome 4 and X are outliersValues produced by PennCNV log file also shown NB Values shownrelate to Illumina 1MDuo array
Comparing CNVdetection methods for SNParrays page 5 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
standard Illumina Infinium array and can used to
process Affymetrix data and have proven accuracy
on Illumina GoldenGate data [20] where SNP
coverage is suitable
PennCNV has a number of downstream analysis
options Most important to highlight is the use of
family trio data in analysis [21] The use of trio infor-
mation in event prediction allows easier detection of
events novel to probands It also integrates a pipeline
for Affymetrix data analysis The PennCNV package
also includes a number of options to allow more
analysis of event results such as a script to compare
events to known gene libraries or for changing the
format to be suitable for viewer such as BeadStudiorsquos
Chromosome Browser or the web-based genome
browser UCSC (httpwwwgenomeucscedu)
Dchip SNP [22] was originally developed for
Affymetrix data but has been modified to allow the
viewing of Illumina data It produces an LOH score
which can be plotted against chromosome but its
functions are best suited to the Affymetrix platform
generated values in particular the quality control
options The software also has options to carry out
paired analysis for cancer data major copy propor-
tion analysis [22] uses HMM to analyse tumour
samples
APPLYINGAPPROACHESORIGINALLY USED INARRAYCGHA number of methods for copy number event detec-
tion were originally developed for arrayCGH analy-
sis but have been modified for SNP array analysis
The Circular Binary Segmentation (CBS) [23] algo-
rithm is one such method It was designed to convert
noisy intensity values into regions of equal copy
number The algorithm will continue to divide a
region into segments until it finds a segment
which is different to the neighbouring region This
change-point detection is designed to identify all
the places which partition the chromosome into
segments of the same copy number An addition to
the binary segmentation algorithm was made to
allow the defining of single change inside a large
segment Segment ends were joined forming a
circle to allow a further likelihood ratio test that
the content has different means Final segments are
then given a cluster value which is the median log-
ratio value of the probes within the region and this
value is used to define the copy number status
An alternative to the CBS algorithm was devel-
oped by Pique-Regi et al [24] which can now be
applied to SNP arrays The Genome Alteration
Detection Algorithm (GADA) uses sparse Bayesian
learning to predict CN changes For our testing we
used a package designed for use in R environment
with helpful processing options and detailed instruc-
tions for Affymetrix and Illumina data The advan-
tage of the speed of data processing was clear and we
were able to analyse data within a few minutes
There are many other algorithms developed that
could potentially be applied to SNP array data
Other reviews [6 25] focused on the arrayCGH
format present the reader with a variety of alternative
options
CNVDETECTION USING OTHERMETHODSApproaches which describe different methods to
address CN event detection are common in the lit-
erature SNP conditional mixture modelling
(SCIMM) developed by Cooper et al [13] which
is based on the observation that samples with dele-
tions appear to have unique signal-intensity clusters
They applied a mixture-likelihood clustering
method within the R statistical package to identify
deletions A secondary algorithm (SCIMM-Search)
was developed to help discover probes which detect
copy number changes within an array dataset The
algorithms require knowledge of modelling techni-
ques to correctly carry out the analysis
The ITALICS [26] software focuses analysis on
removal on unwanted events found in Affymetrix
data Rigaill et al developed ITALICS (Iterative
and Alternative normaLIsation and Copy number
calling for affymetrix Snp arrays) to remove probes
with abnormal intensities Each iteration of the
algorithm estimates the biological signal and then
uses multiple linear regressions to estimate the non-
linear effects on the signal The algorithm can be run
in R and has the potential to analyse the Affymetrix
Human mapping 500K Genome Wide array 50 and
60 format but was designed to process data from
chip formats containing perfect match and mismatch
probes
COMMERCIALLYAVAILABLESOFTWAREThe strength of the software packages available
to purchase lies in a number of traits the ability
page 6 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
to combine data from other platforms for compar-
ison graphical user interfaces integrated pipelines for
analysis and work flows optimized computational
speed and technical support These factors are all
extremely useful to those labs with no or limited
bioinformatic core support Unfortunately commer-
cial companies are limited in their use of some of the
methods developed in the academic environment
They are often prevented from building user inter-
faces and other features around academic software
due to restrictions imposed by free software licences
such as GNU Public Licence and prevention from
having access to the latest methods
For our own purposes we have chosen to look in
detail at the Nexus Biodiscovery software This uses
the rank segmentation approach for detection This
approach is based on CBS but has been modified to
increase speed of processing It can be used for
Affymetrix arrayCGH or Illumina data and although
weaker for Illumina event detection is an extremely
useful tool for practically trained scientists
COMBINING COPY NUMBERPREDICTIONANDGENOTYPINGCopy number detection approaches described thus
far have looked only at a single aspect of the data
The Birdsuite set developed by Korn et al [16] com-
bines SNP genotyping and copy number detection
as well as independently genotyping common
CNPs It uses four different methods to analyse an
Affymetrix dataset The Canary algorithm which
genotypes common CNPs and Birdseed which
carries out SNP genotyping are included in the
Affymetrix Genotyping Console Birdseye is used
to discover rare CNVs This uses the HMM to iden-
tify and assess previously unknown CNVs in the
data Fawkes is the final stage of Birdsuite this
merges all the results from the other three stages
Combining data in this way gives a more complete
picture of structural variation in a sample and allows
the user to proceed with single stage of association
analysis with increased coverage on the data Korn
et al compared their software to commercially avail-
able algorithms including Nexus and report the
higher detection rates of Birdsuite
Franke et al [27] have also presented a combined
approach which focuses on single SNP interpreta-
tion TriTyper uses maximum likelihood estimation
to detect deletions in Illumina SNP data in unrelated
samples It incorporates an extra null allele into its
genotyping clusters and uses deviations from the
HWE as an indicator of when to use triallelic geno-
typing It can also use neighbouring SNP data to
impute the success of the caller which increases the
accuracy of the output
COMPARINGTHEDETECTIONALGORITHMSThere are a large variety of algorithms and software
available for copy number event detection Table 1
shows a summary of the software discussed in this
review A number of these software packages have
been tested during the review and a brief synopsis of
the results is presented here
Assessing SoftwareTo assess the accuracy of the algorithms we com-
pared our data to the results of a well characterized
sample The sample NA12156 is the basis for our
comparison (Table 2) it is from the HapMap collec-
tion and was sequenced for structural variation by
Kidd et al [28] We have chosen to record the
number of similar events between software and pub-
lished data We assume the samples with low num-
bers of similar events have higher false positive rates
however we have not experimentally validated the
results While there is no faultless software we have
found that at least 20 of events were confirmed by
Kidd et al in all algorithms 27 of the overlapping
detected events were found by more than one algo-
rithm (Supplementary Table 1) Although some
algorithms have a lower percentage of overlapping
events it is important to consider the number of
events found as well as the proportion 49 of
PennCNV detected events were confirmed but
other algorithms have actually detected more in
total
We carried out a secondary comparison using the
CEPH sample NA15510 which has been character-
ized in a number of publications [2 7 28] Table 3
shows the variation of results between studies
Further investigation of event replication across stud-
ies is represented in the Venn Diagrams (Figure 4)
PennCNV and Illumina show similar patterns of
overlap although we note an increased similarity
between the Korbel et al data and QuantiSNP
output We conclude that although we found a dif-
ference between detected events in our data and
published results we found similar variation between
different publications suggesting this is problem in
Comparing CNVdetection methods for SNParrays page 7 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
all comparisons and not unique to algorithms we
tested
The overlap of algorithm events of the tested soft-
ware is below 50 for all cases We used default
parameters for all our algorithms for ease of replica-
tion which means some algorithms were not run at
their optimal level for our data We deliberately
chose data which did not use an array-based
Table 1 Summary of SNP array detection algorithms
Software Platform Relatedpublication
Details Strengths Weaknesses
Birdsuite (Birdseyeand Canary)
Affymetrix [15] Combined tool set togenotype SNPs amp CNPs
Unique approach singleassociation of SNPs andCN
Availability limited toAffymetrix data
CNAT Affymetrix Technicalnotes
Proprietaryccedilrun inGenome Console
Integral part of GenomeConsole
Accuracy of event prediction(missed events)
CNVPartition 121 Illumina Technicalnotes
Proprietaryccedilrun inBeadStudio
Integral part of BeadStudio Accuracy of event prediction(missed events)
Dchip SNP Affymetrixor Illumina
[22] Stand alone software Free viewer for all data Limited applications forIllumina data
GADA Affymetrixor Illumina
[24] Model uses Sparse BayesianLearning
Speed of processing andapplication within R
Accuracy on Illumina weaker
HMMSeg Multiple [17] HMM application tool to anygenomic data
Flexibility to any dataset Statistical knowledgerequired for correctuse Not CN specific
ITALICS Affymetrix [26] R package for normalizationand CN detection inAffymetrix data
Focus on removal of non-relevant effects
Designed to work onAffymetrix 100Kthorn 500Kchip (MM probe format)
Nexus Biodiscovery Multiple [23] Commercial segmentationdetection tool
Allows combined data fromdifferentplatforms Integratedviewer
Freeware alternatives areavailable
PennCNV Illumina orAffymetrix
[19] Perl script based Multiple downstream toolsfor output
No way of ranking eventsdue to likelihood
QuantiSNP Illumina orAffymetrix
[18] HHM PC or LINUXcommand line
Bayes factor score forevents flexibility of runparameters
Limited support for furtherevent analysis
SCIMM andSCIMM-Search
Illumina [13] Modelling algorithmapplied in R
High detection ratescompared to sequencedata
Statistical knowledgerequired for correct use
TriTyper Illumina [27] Identify and genotype SNPswith null allele
Able to interpret single SNPs Only genotypes deletions
Table 2 Comparison of algorithms
Algorithm Platformand array
Total of copynumber eventsdetected
Number of copynumber eventsconfirmed byKidd et al [28]
Birdsuite 155 (Birdseye amp Canary) Affymetrix 60 386 76 (20)CNAT (Genome Console 302) Affymetrix 60 8 2 (25)GADA (R 07-5) Affymetrix 60 546 128 (23)GADA (R 07-5) Illumina 1M Duo 511 157 (31)PennCNV (2009Jan06) Affymetrix 60 57 28 (49)PennCNV (2009Jan06) Illumina 1M Duo 57 21 (37)QuantiSNP v20 Affymetrix 60 131 53 (41)QuantiSNP v11 Illumina 1M Duo 75 23 (31)
Detected events from CEPH sample NA12156 are compared to events published in sequencing analysis by Kidd et al [28] Default parametersare used for each algorithm and any Ychromosome data was omitted An overlap between software output and confirmed data by Kidd et al isdetermined by comparing the start and end points of events Details of events are shown in SupplementaryTable1 Percentage shows the numberof confirmed CN events compared to the total detectedby the algorithm
page 8 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
Figure 4 Venn diagrams comparing events for NA15510 between different studies Visual representation ofdata from CEPH sample NA15510 on 1M array Illumina platform used to compare between algorithms and otherpublications [2 7 28] Default parameters are used for each algorithm and Ychromosome data was omitted fromcount Event lists from publications were generated by combining data from several tables to create a completelist (including all validated and unvalidated events) An event was counted if any overlap was found with base eventin published data multiple predictions by an algorithm for one published event were counted as one Each total inthe diagram is comprised of all the events found by the studies meaning each event in an overlapping pair is countedSurprisingly only 43 overlapping events are found for NA15510 in all the three studies (A) Results from thePennCNV (D) and QuantiSNP (C) comparisons show that QuantiSNP detects more events in all three softwaredue to the detection of more events overlapping with the Korbel et al study Overlap between algorithmsis shown in Venn Diagram B where events which are detected by the algorithm and found in at least one ofthe publication are compared A large proportion of detected events between PennCNV and QuantiSNP (43)overlap
Table 3 Overlap between events detected by SNP array algorithms using multiple publication data
Total events foundin NA15510 byalgorithm
Number of copynumber events(Kidd) [28]
Number of copynumber events(Korbel) [7]
Number of copynumber events(Redon) [2]
Events in paper 299 466 219CNVPartition 121 39 12 (4) 22 (5) 9 (4)GADA (R 07-5) 69 68 (23) 85 (18) 42 (19)PennCNV (2009Jan06) 81 18 (6) 28 () 30 (14)QuantiSNP v11 64 18 (6) 41 (9) 29 (13)
Data fromCEPH sampleNA15510 on1M array Illumina platform is used to compare between algorithms and other publicationsDefault parametersare used for each algorithm and Y chromosome data was omitted Event lists from publications were generated by combining data fromseveral tables to create a complete list (including all validated and un-validated events) An event was counted if any overlap was found with baseevent in published data multiple predictions by an algorithm for one published event were counted as oneValue in brackets shows percentage ofpublished events found by algorithmWe note from GADA analysis although a high number of overlaps were found this was due to the predictionof large events that included smaller events found by Kidd et al and Korbel et al
Comparing CNVdetection methods for SNParrays page 9 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
technique for our NA12156 comparison to prevent a
bias between Affymetrix and Illumina but in doing
so we accepted an increase in the number of differ-
ently detected events Kidd et al have shown similar
data when comparing studies and found only a
125 overlap of events larger than 5 kb between
their results and CN data generated by Affymetrix
60 array
Similarities of events detected betweendifferent SoftwareWe chose to test a single sample (NA10861) on
a range of the available algorithms to compare the
similarity between event detection In all cases we
found the academically developed software to be
more sensitive and detect more events than propri-
etary algorithms (Table 4) The data also shows an
increased number of events found from the sample
using the Affymetrix SNP60 array we assume this
reflects the increase in the number of CNP probes
on the array relative to Illuminarsquos 1M chip
Table 5 shows the amount of overlap in event
prediction We show two results for each compari-
son counting the number of events overlapping for
each algorithm separately The difference in values
represents the number of smaller events often found
in one event by a different algorithm In general
we found a higher number of overlapping events
between algorithms run on Affymetrix 60 arrays
data We expected the low resemblance between
data generated on different platforms as a result of
the different probe sets however we are pleased to
find some overlap We have included a comparison
to events published by Redon et al [2] although the
study does not include a comprehensive list for this
sample it does show that the algorithms are detecting
confirmed events
During our comparison we often saw a difference
in the size of the predicted event between algorithms
(Figure 5) This was to be expected when using
different platforms as probe locations vary but was
also seen when analysing an identical dataset This
kind of effect can even be produced when simply
altering algorithm parameters and should be a con-
sideration when looking at breakpoints of detected
events We found that the available software tend to
target and support one particular platform for analy-
sis which unfortunately can limit options
Recommending algorithmsComparison of events in a dataset is a good way of
assessing accuracy of detection algorithms but it is
also important to take into account that the different
predictions can also be informative in showing false
positives caused by noisy data and conversely that
those in agreement are the strongest candidates for
events Multiple predictions from different software
for the same event increase confidence in the data
and give clearer indications of the event boundaries
or any discrepancy in this information We would
recommend using a second algorithm on a single
dataset to produce the most informative results and
also utilize the different advantages of each software
We also suggest using software designed specifically
for the platform which generated the data as several
of the dual use algorithms have been shown to
weaker in one format We have selected a range of
algorithms to discuss and test and the list in Table 1 is
not exhaustive only an overview of some of the
possibilities It is also important to state even using
different algorithms one cannot definitively confirm
the presence of a CN event without separate biolog-
ical replication and it is unlikely that any list of events
detected will contain all CNVs in a sample
FURTHER ANALYSIS OFDETECTED CNVsWith a number of reliable options available for
the detection of copy number events it becomes
Table 4 Comparison of event numbers detected fora single sample (NA10861)
Algorithm Platform andarray
Number ofCNeventsdetected
Birdsuite 155 (Canary amp Birdseye) Affymetrix 60 137CNAT (Genome Console 302) Affymetrix 60 10CNVPartition 121 Illumina 1M Duo 16GADA (R 07-5) Affymetrix 60 613GADA (R 07-5) Illumina 1M Duo 87Nexus Biodiscovery 401 Affymetrix 60 111Nexus Biodiscovery 401 Illumina 1M Duo 8PennCNV (2009Jan06) Affymetrix 60 67PennCNV (2009Jan06) Illumina 1M Duo 43QuantiSNP v20 Affymetrix 60 193QuantiSNP v11 Illumina 1M Duo 60
HapMap samples provided as demonstration data were analysed onboth Affymetrix and Illumina platforms to give an easily reproduciblecomparison of event prediction Events shown have been detected bythe algorithm for CEPH sample NA10861 Default parameters wereused for all algorithms and anyYchromosome data was omittedDatafrom the Affymetrix array has a higher number of detected eventsprobably linked to the number of specifically targeted probesProprietary software from both Illumina and Affymetrix has a lowdetection rate
page 10 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
Table5
Com
parison
ofsoftwareeventpredictio
ns
Pub
lishe
dresults
(Red
on)
Birdsuite
Affym
etrix
CNAT
Affym
etrix
CNV
Par
tition
Illum
ina
GADA
Affym
etrix
GADA
Illum
ina
Nex
usAffym
etrix
Nex
usIllum
ina
Pen
nCNV
Affym
etrix
Pen
nCNV
Illum
ina
Qua
ntiSNP
Affym
etrix
Qua
ntiSNP
Illum
ina
Publishe
ddata
(Red
on)
17(4)
4(40
)3(19
)32
(5)
2(2)
11(10
)2(25
)12
(18
)7(16
)18
(9)
8(13
)
Birdsuite
Affy
metrix
17(44
)9(90
)13
(81
)135(22
)21
(24
)62
(56
)6(75
)43
(64
)20
(47
)97
(50
)20
(33
)CNAT
Affy
metrix
4(10
)15
(4)
4(25
)34
(6)
023
(21
)1(13
)
13(19
)2(5)
17(9)
5(8)
CNVPartition
Illum
ina
3(8)
16(4)
4(40
)37
(6)
7(8)
20(18
)7(88
)9(13
)
11(26
)16
(8)
16(27
)GADA
Affy
metrix
17(44
)106(28
)9(90
)13
(81
)32
(37
)91
(82
)7(88
)58
(87
)23
(53
)153(79
)27
(45
)GADA
Illum
ina
2(5)
96(25
)0
13(81
)20
8(34
)25
(23
)2(25
)26
(30
)17
(40
)67
(35
)23
(38
)Nexus
Affy
metrix
7(18
)57
(15
)10
(100
)
7(44
)116(19
)8(9)
4(50
)45
(67
)15
(35
)78
(40
)17
(28
)Nexus
Illum
ina
2(5)
6(2)
1(10
)7(44
)22
(4)
2(2)
4(4)
6(9)
7(16
)10
(5)
9(15
)Penn
CNV
Affy
metrix
11(28
)51
(13)
10(100
)
9(56
)105(17
)10
(11)
65(59
)6(75
)19
(44
)71
(37
)21
(35
)Penn
CNV
Illum
ina
6(15
)25
(7)
2(20
)11
(69
)44
(7)
9(10
)23
(21
)6(75
)18
(27
)26
(13)
28(47
)QuantiSNP
Affy
metrix
14(36
)97
(25
)10
(100
)
10(63
)199(32
)18
(21
)86
(77
)7(88
)65
(97
)21
(49
)24
(40
)QuantiSNP
Illum
ina
6(15
)14
(4)
5(50
)15
(94
)55
(9)
10(11)
30(27
)8(100
)
23(34
)32
(74
)31
(16
)
Algorithm
swererunon
demon
stratio
ndataforsampleNA108
61on
Affy
metrix60chipsa
ndIllum
ina1MDuo
arraysD
efaultparametersw
ereused
andanyY
chromosom
edatawas
omittedFo
ralgorithmoverall
totalsseeTable4Events
detected
inbo
thsoftwareareshow
nEvents
coun
tedas
common
betw
eenalgorithmsifpart
ofregion
predictedoverlaps
withtheotherEach
comparisoniscarriedou
ttw
ice
toshow
caseswhere
smallereventswithinon
ealgorithm
makeup
oneeventintheotherthereforeoverlapof
eventsdepe
ndson
analysisorientationTotalvalue
representsnumberof
eventsforsoftwareon
horizontalaxisfoun
dintheothersoftwaredatasetbracketedvalueshow
spercentageofeventsdetected
bysamesoftwareWehave
foun
dthemostsim
ilaritie
sare
betw
eendatafrom
similarplatform
soralgo
-rithm
metho
dforexam
pleAffy
metrixPenn
CNVandQuantiSNParebo
thbasedon
theHMM
algorithm
andas
such
eventpredictio
nshou
ldbe
very
similarWehave
also
notedahigher
numberof
similar
eventsfrom
algorithmsu
singAffy
metrixdata
Comparing CNVdetection methods for SNParrays page 11 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
increasingly important to be able to summarize and
use this data Initially we are often interested in
looking for novel events in certain genes or regions
Tracks of events can be viewed in databases such as
the web-based genome browser UCSC (http
wwwgenomeucscedu) and events can be com-
pared to known copy number data in the DGV
such as displayed in Figure 3 Importing several
tracks of data into a browser simultaneously will
allow the user to compare different result sets
Analysis of multiple events per sample is a more
complicated procedure Events and samples can
be explored using pathway analysis tools to look
for interesting groups or combinations of events in
different genes but methods of confirming the
significance of an event are required A number of
publications exist presenting ways of applying asso-
ciation study methods to copy number data Barnes
etal [29] developed an R package CNVtools which
allows the user to carry out case-control association
Figure 5 Image from UCSC Browser showing the detection of a single event using different algorithmsThe deletion described is a known CNP and is recorded several times in the DGV Each track represents a differ-ent algorithm or platform All results for detection algorithms shown used default parameters and test sampleNA10861
page 12 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
analysis on a single CNV of interest The publica-
tion tests a series of five alternative modelling meth-
ods before recommending a likelihood ratio test
which combines CNV calling and association testing
into a single model This method was designed
to eliminate problems with signal noise which is a
known trait of SNP assay data Ionita-Laza et al [30]
suggested a method to apply genome-wide family-
based association studies on raw-intensity data The
Birdsuite package includes a pipeline to prepare
the data for PLINK analysis Other sources have
suggested similar association study-based strategies
but an agreed approach is a subject of great discus-
sion Calls have been made by authors such as
Scherer et al [31] to decide on a single technique
but future decisions in the field will be extremely
enlightening
As is commented much upon in literature
describing SNP association study techniques
sample size and power of tests are major factors in
a successful study [32] This must also be considered
when analysing copy number data As we have dis-
cussed there are a number of analysis options avail-
able for SNP array CNV detection pipelines to
allow guided analysis and stand alone options for
more flexible analysis Some of these applications
are platform targeted but we have found that the
best outcome is given by using multiple algorithms
and comparing data
SUPPLEMENTARYDATASupplementary data are available online at http
biboxfordjournalsorg
AcknowledgementsThe authors thank Dr Helen Butler for her ideas and contribu-
tions to the manuscript
FUNDINGJR and LW are funded by Wellcome Trust Grants
CY is funded by a UK Medical Research Council
Special Training Fellowship in Biomedical
Informatics (Ref No G0701810)
References1 Iafrate AJ Feuk L Rivera MN et al Detection of large-
scale variation in the human genome Nat Genet 200436(9)949ndash51
2 Redon R Ishikawa S Fitch KR et al Global variation incopy number in the human genome Nature 2006444(7118)444ndash54
3 Tuzun E Sharp AJ Bailey JA et al Fine-scale structuralvariation of the human genome Nat Genet 200537(7)727ndash32
4 Sebat J Lakshmi B Troge J et al Large-scale copy numberpolymorphism in the human genome Science 2004305(5683)525ndash8
5 de Smith AJ Tsalenko A Sampas N et al Array CGHanalysis of copy number variation identifies 1284 newgenes variant in healthy white males implications for asso-ciation studies of complex diseases Hum Mol Genet 200716(23)2783ndash94
6 Carter NP Methods and strategies for analyzing copynumber variation using DNA microarrays Nat Genet200739(7 Suppl)S16ndash21
7 Korbel JO Urban AE Affourtit JP et al Paired-end map-ping reveals extensive structural variation in the humangenome Science 2007318(5849)420ndash6
8 Kennedy GC Matsuzaki H Dong S etal Large-scale geno-typing of complex DNA NatBiotechnol 200321(10)1233ndash7
9 Peiffer DA Le JM Steemers FJ etal High-resolution geno-mic profiling of chromosomal aberrations using Infiniumwhole-genome genotyping Genome Res 200616(9)1136ndash48
10 International Schizophrenia Consortium Rare chromoso-mal deletions and duplications increase risk of schizophreniaNature 2008455(7210)237ndash41
11 Yang TL Chen XD Guo Y et al Genome-wide copy-number-variation study identified a susceptibility geneUGT2B17 for osteoporosis Am J Hum Genet 200883(6)663ndash74
12 McCarroll SA Hadnott TN Perry GH et al Commondeletion polymorphisms in the human genome Nat Genet200638(1)86ndash92
13 Cooper GM Zerr T Kidd JM et al Systematic assessmentof copy number variant detection via genome-wide SNPgenotyping Nat Genet 200840(10)1199ndash203
14 McCarroll SA Altshuler DM Copy-number variation andassociation studies of human disease Nat Genet 200739(7 Suppl)S37ndash42
Key Points Awide variety of software is available for CNVdetection from
data produced by SNP arrays This review seeks to discussoptions and statistical methods currently available for analysisof signal intensity data
Changes in assay selection techniques for SNP arrays havemadethemmore appealing for copynumber detection aswell as geno-typingTargeted probe design has made the SNP array a reliableand cheaper option for copy number analysis
After testing a selection of the available software comparisonswere performed using Hapmap samples and publishedcopy number data Of the events found in our data 20^49were replicated in previously published studies but the resultsclearly showed variation in data caused by differences inalgorithms
An important recommendation when choosing software foranalysis is the use of a second algorithm on a dataset to producemore informative results This enables the user to eliminatefalse positives not found by both software and increases confi-dence in replicated events
Comparing CNVdetection methods for SNParrays page 13 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
15 McCarroll SA Kuruvilla FG Korn JM et al Integrateddetection and population-genetic analysis of SNPs andcopy number variation Nat Genet 200840(10)1166ndash74
16 Korn JM Kuruvilla FG McCarroll SA et al Integratedgenotype calling and association analysis of SNPscommon copy number polymorphisms and rare CNVsNat Genet 200840(10)1253ndash60
17 Day N Hemmaplardh A Thurman RE et al Unsupervisedsegmentation of continuous genomic data Bioinformatics200723(11)1424ndash6
18 Colella S Yau C Taylor JM etal QuantiSNP an objectiveBayes Hidden-Markov Model to detect and accurately mapcopy number variation using SNP genotyping data NucleicAcids Res 200735(6)2013ndash25
19 Wang K Li M Hadley D et al PennCNV an integratedhidden Markov model designed for high-resolution copynumber variation detection in whole-genome SNP geno-typing data Genome Res 200717(11)1665ndash74
20 Maestrini E Pagnamenta AT Lamb JA et al High-densitySNP association study and copy number variation analysisof the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility MolPsychiatry2009
21 Wang K Chen Z Tadesse MG et al Modeling geneticinheritance of copy number variations Nucleic Acids Res200836(21)e138
22 Li C Beroukhim R Weir BA et al Major copy propor-tion analysis of tumor samples using SNP arrays BMCBioinformatics 20089204
23 Olshen AB Venkatraman ES Lucito R Wigler M Circularbinary segmentation for the analysis of array-based DNAcopy number data Biostatistics 20045(4)557ndash72
24 Pique-Regi R Monso-Varona J Ortega A et al Sparserepresentation and Bayesian detection of genome copynumber alterations from microarray data Bioinformatics200824(3)309ndash18
25 Lai WR Johnson MD Kucherlapati R Park PJComparative analysis of algorithms for identifying amplifi-cations and deletions in array CGH data Bioinformatics 200521(19)3763ndash70
26 Rigaill G Hupe P Almeida A et al ITALICS analgorithm for normalization and DNA copy number callingfor Affymetrix SNP arrays Bioinformatics 200824(6)768ndash74
27 Franke L de Kovel CG Aulchenko YS et al Detectionimputation and association analysis of small deletions andnull alleles on oligonucleotide arrays AmJHumGenet 200882(6)1316ndash33
28 Kidd JM Cooper GM Donahue WF et al Mapping andsequencing of structural variation from eight human gen-omes Nature 2008453(7191)56ndash64
29 Barnes C Plagnol V Fitzgerald T et al A robuststatistical method for case-control association testingwith copy number variation Nat Genet 200840(10)1245ndash52
30 Ionita-Laza I Perry GH Raby BA et al On the analysisof copy-number variations in genome-wide associationstudies a translation of the family-based association testGenet Epidemiol 200832(3)273ndash84
31 Scherer SW Lee C Birney E etal Challenges and standardsin integrating surveys of structural variation NatGenet 200739(7 Suppl)S7ndash15
32 Cardon LR Bell JI Association study designs for complexdiseases Nat Rev Genet 20012(2)91ndash9
page 14 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
With prior knowledge of modelling statistics
there are a multitude of options for copy number
detection HMMSeg [17] is a command line oper-
ated algorithm that is designed to apply HMM
to genomic data Application of correct modelling
procedures is not an obvious process to non-
statisticians For these reasons software has been
developed which allows guided application of these
types of advanced methods
GUIDEDAPPLICATIONOFTHE HMMA number of solutions for guided accurate CNV
detection for SNP array data have been published
but these are often platform specific QuantiSNP
[18] and PennCNV [19] are academically developed
and freely available for prediction purposes They use
the HMM and assist the user to apply it to their own
data The standard output from these tools is a list of
detected events and brief summary statistics used for
quality checking Checking the quality of data is
extremely important in accurate event prediction
Data with high signal noise often causes false positive
predictions and stringency with checks at this stage is
highly recommended to eliminate any problem data
Signal noise is a strong limitation particularly with
samples prepared by whole genome amplification
Output from QuantiSNP allows the user to plot
average and standard deviations for BAF by chromo-
some or sample to show outliers (Figure 3)
PennCNV has a detailed set of guidelines for identi-
fying and rejecting problem data included on the
softwarersquos support website Both can run using com-
mand line options or integrated into Illuminarsquos
BeadStudio plug-in and have unique features to
recommend them
The QuantiSNP algorithm output gives a log
Bayes factor with its prediction which allows the
user to rank events in order of likelihood and place
their own cut off on acceptable events Users can
modify parameters to suit their own dataset for
example changing the length parameter can allow
more accurate detection of different sized events for
a particular sample set Later versions of QuantiSNP
have increased flexibility for data other than the
Figure 3 Graphical representation of quality control data from PennCNV and QuantiSNP algorithms It is impor-tant to use quality control (QC) data from the algorithms to eliminate problem samples which would not be foundduring standard-genotyping analysis Plot shows BAF score for each chromosome from analysis of sample NA10861we can see chromosome 4 and X are outliersValues produced by PennCNV log file also shown NB Values shownrelate to Illumina 1MDuo array
Comparing CNVdetection methods for SNParrays page 5 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
standard Illumina Infinium array and can used to
process Affymetrix data and have proven accuracy
on Illumina GoldenGate data [20] where SNP
coverage is suitable
PennCNV has a number of downstream analysis
options Most important to highlight is the use of
family trio data in analysis [21] The use of trio infor-
mation in event prediction allows easier detection of
events novel to probands It also integrates a pipeline
for Affymetrix data analysis The PennCNV package
also includes a number of options to allow more
analysis of event results such as a script to compare
events to known gene libraries or for changing the
format to be suitable for viewer such as BeadStudiorsquos
Chromosome Browser or the web-based genome
browser UCSC (httpwwwgenomeucscedu)
Dchip SNP [22] was originally developed for
Affymetrix data but has been modified to allow the
viewing of Illumina data It produces an LOH score
which can be plotted against chromosome but its
functions are best suited to the Affymetrix platform
generated values in particular the quality control
options The software also has options to carry out
paired analysis for cancer data major copy propor-
tion analysis [22] uses HMM to analyse tumour
samples
APPLYINGAPPROACHESORIGINALLY USED INARRAYCGHA number of methods for copy number event detec-
tion were originally developed for arrayCGH analy-
sis but have been modified for SNP array analysis
The Circular Binary Segmentation (CBS) [23] algo-
rithm is one such method It was designed to convert
noisy intensity values into regions of equal copy
number The algorithm will continue to divide a
region into segments until it finds a segment
which is different to the neighbouring region This
change-point detection is designed to identify all
the places which partition the chromosome into
segments of the same copy number An addition to
the binary segmentation algorithm was made to
allow the defining of single change inside a large
segment Segment ends were joined forming a
circle to allow a further likelihood ratio test that
the content has different means Final segments are
then given a cluster value which is the median log-
ratio value of the probes within the region and this
value is used to define the copy number status
An alternative to the CBS algorithm was devel-
oped by Pique-Regi et al [24] which can now be
applied to SNP arrays The Genome Alteration
Detection Algorithm (GADA) uses sparse Bayesian
learning to predict CN changes For our testing we
used a package designed for use in R environment
with helpful processing options and detailed instruc-
tions for Affymetrix and Illumina data The advan-
tage of the speed of data processing was clear and we
were able to analyse data within a few minutes
There are many other algorithms developed that
could potentially be applied to SNP array data
Other reviews [6 25] focused on the arrayCGH
format present the reader with a variety of alternative
options
CNVDETECTION USING OTHERMETHODSApproaches which describe different methods to
address CN event detection are common in the lit-
erature SNP conditional mixture modelling
(SCIMM) developed by Cooper et al [13] which
is based on the observation that samples with dele-
tions appear to have unique signal-intensity clusters
They applied a mixture-likelihood clustering
method within the R statistical package to identify
deletions A secondary algorithm (SCIMM-Search)
was developed to help discover probes which detect
copy number changes within an array dataset The
algorithms require knowledge of modelling techni-
ques to correctly carry out the analysis
The ITALICS [26] software focuses analysis on
removal on unwanted events found in Affymetrix
data Rigaill et al developed ITALICS (Iterative
and Alternative normaLIsation and Copy number
calling for affymetrix Snp arrays) to remove probes
with abnormal intensities Each iteration of the
algorithm estimates the biological signal and then
uses multiple linear regressions to estimate the non-
linear effects on the signal The algorithm can be run
in R and has the potential to analyse the Affymetrix
Human mapping 500K Genome Wide array 50 and
60 format but was designed to process data from
chip formats containing perfect match and mismatch
probes
COMMERCIALLYAVAILABLESOFTWAREThe strength of the software packages available
to purchase lies in a number of traits the ability
page 6 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
to combine data from other platforms for compar-
ison graphical user interfaces integrated pipelines for
analysis and work flows optimized computational
speed and technical support These factors are all
extremely useful to those labs with no or limited
bioinformatic core support Unfortunately commer-
cial companies are limited in their use of some of the
methods developed in the academic environment
They are often prevented from building user inter-
faces and other features around academic software
due to restrictions imposed by free software licences
such as GNU Public Licence and prevention from
having access to the latest methods
For our own purposes we have chosen to look in
detail at the Nexus Biodiscovery software This uses
the rank segmentation approach for detection This
approach is based on CBS but has been modified to
increase speed of processing It can be used for
Affymetrix arrayCGH or Illumina data and although
weaker for Illumina event detection is an extremely
useful tool for practically trained scientists
COMBINING COPY NUMBERPREDICTIONANDGENOTYPINGCopy number detection approaches described thus
far have looked only at a single aspect of the data
The Birdsuite set developed by Korn et al [16] com-
bines SNP genotyping and copy number detection
as well as independently genotyping common
CNPs It uses four different methods to analyse an
Affymetrix dataset The Canary algorithm which
genotypes common CNPs and Birdseed which
carries out SNP genotyping are included in the
Affymetrix Genotyping Console Birdseye is used
to discover rare CNVs This uses the HMM to iden-
tify and assess previously unknown CNVs in the
data Fawkes is the final stage of Birdsuite this
merges all the results from the other three stages
Combining data in this way gives a more complete
picture of structural variation in a sample and allows
the user to proceed with single stage of association
analysis with increased coverage on the data Korn
et al compared their software to commercially avail-
able algorithms including Nexus and report the
higher detection rates of Birdsuite
Franke et al [27] have also presented a combined
approach which focuses on single SNP interpreta-
tion TriTyper uses maximum likelihood estimation
to detect deletions in Illumina SNP data in unrelated
samples It incorporates an extra null allele into its
genotyping clusters and uses deviations from the
HWE as an indicator of when to use triallelic geno-
typing It can also use neighbouring SNP data to
impute the success of the caller which increases the
accuracy of the output
COMPARINGTHEDETECTIONALGORITHMSThere are a large variety of algorithms and software
available for copy number event detection Table 1
shows a summary of the software discussed in this
review A number of these software packages have
been tested during the review and a brief synopsis of
the results is presented here
Assessing SoftwareTo assess the accuracy of the algorithms we com-
pared our data to the results of a well characterized
sample The sample NA12156 is the basis for our
comparison (Table 2) it is from the HapMap collec-
tion and was sequenced for structural variation by
Kidd et al [28] We have chosen to record the
number of similar events between software and pub-
lished data We assume the samples with low num-
bers of similar events have higher false positive rates
however we have not experimentally validated the
results While there is no faultless software we have
found that at least 20 of events were confirmed by
Kidd et al in all algorithms 27 of the overlapping
detected events were found by more than one algo-
rithm (Supplementary Table 1) Although some
algorithms have a lower percentage of overlapping
events it is important to consider the number of
events found as well as the proportion 49 of
PennCNV detected events were confirmed but
other algorithms have actually detected more in
total
We carried out a secondary comparison using the
CEPH sample NA15510 which has been character-
ized in a number of publications [2 7 28] Table 3
shows the variation of results between studies
Further investigation of event replication across stud-
ies is represented in the Venn Diagrams (Figure 4)
PennCNV and Illumina show similar patterns of
overlap although we note an increased similarity
between the Korbel et al data and QuantiSNP
output We conclude that although we found a dif-
ference between detected events in our data and
published results we found similar variation between
different publications suggesting this is problem in
Comparing CNVdetection methods for SNParrays page 7 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
all comparisons and not unique to algorithms we
tested
The overlap of algorithm events of the tested soft-
ware is below 50 for all cases We used default
parameters for all our algorithms for ease of replica-
tion which means some algorithms were not run at
their optimal level for our data We deliberately
chose data which did not use an array-based
Table 1 Summary of SNP array detection algorithms
Software Platform Relatedpublication
Details Strengths Weaknesses
Birdsuite (Birdseyeand Canary)
Affymetrix [15] Combined tool set togenotype SNPs amp CNPs
Unique approach singleassociation of SNPs andCN
Availability limited toAffymetrix data
CNAT Affymetrix Technicalnotes
Proprietaryccedilrun inGenome Console
Integral part of GenomeConsole
Accuracy of event prediction(missed events)
CNVPartition 121 Illumina Technicalnotes
Proprietaryccedilrun inBeadStudio
Integral part of BeadStudio Accuracy of event prediction(missed events)
Dchip SNP Affymetrixor Illumina
[22] Stand alone software Free viewer for all data Limited applications forIllumina data
GADA Affymetrixor Illumina
[24] Model uses Sparse BayesianLearning
Speed of processing andapplication within R
Accuracy on Illumina weaker
HMMSeg Multiple [17] HMM application tool to anygenomic data
Flexibility to any dataset Statistical knowledgerequired for correctuse Not CN specific
ITALICS Affymetrix [26] R package for normalizationand CN detection inAffymetrix data
Focus on removal of non-relevant effects
Designed to work onAffymetrix 100Kthorn 500Kchip (MM probe format)
Nexus Biodiscovery Multiple [23] Commercial segmentationdetection tool
Allows combined data fromdifferentplatforms Integratedviewer
Freeware alternatives areavailable
PennCNV Illumina orAffymetrix
[19] Perl script based Multiple downstream toolsfor output
No way of ranking eventsdue to likelihood
QuantiSNP Illumina orAffymetrix
[18] HHM PC or LINUXcommand line
Bayes factor score forevents flexibility of runparameters
Limited support for furtherevent analysis
SCIMM andSCIMM-Search
Illumina [13] Modelling algorithmapplied in R
High detection ratescompared to sequencedata
Statistical knowledgerequired for correct use
TriTyper Illumina [27] Identify and genotype SNPswith null allele
Able to interpret single SNPs Only genotypes deletions
Table 2 Comparison of algorithms
Algorithm Platformand array
Total of copynumber eventsdetected
Number of copynumber eventsconfirmed byKidd et al [28]
Birdsuite 155 (Birdseye amp Canary) Affymetrix 60 386 76 (20)CNAT (Genome Console 302) Affymetrix 60 8 2 (25)GADA (R 07-5) Affymetrix 60 546 128 (23)GADA (R 07-5) Illumina 1M Duo 511 157 (31)PennCNV (2009Jan06) Affymetrix 60 57 28 (49)PennCNV (2009Jan06) Illumina 1M Duo 57 21 (37)QuantiSNP v20 Affymetrix 60 131 53 (41)QuantiSNP v11 Illumina 1M Duo 75 23 (31)
Detected events from CEPH sample NA12156 are compared to events published in sequencing analysis by Kidd et al [28] Default parametersare used for each algorithm and any Ychromosome data was omitted An overlap between software output and confirmed data by Kidd et al isdetermined by comparing the start and end points of events Details of events are shown in SupplementaryTable1 Percentage shows the numberof confirmed CN events compared to the total detectedby the algorithm
page 8 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
Figure 4 Venn diagrams comparing events for NA15510 between different studies Visual representation ofdata from CEPH sample NA15510 on 1M array Illumina platform used to compare between algorithms and otherpublications [2 7 28] Default parameters are used for each algorithm and Ychromosome data was omitted fromcount Event lists from publications were generated by combining data from several tables to create a completelist (including all validated and unvalidated events) An event was counted if any overlap was found with base eventin published data multiple predictions by an algorithm for one published event were counted as one Each total inthe diagram is comprised of all the events found by the studies meaning each event in an overlapping pair is countedSurprisingly only 43 overlapping events are found for NA15510 in all the three studies (A) Results from thePennCNV (D) and QuantiSNP (C) comparisons show that QuantiSNP detects more events in all three softwaredue to the detection of more events overlapping with the Korbel et al study Overlap between algorithmsis shown in Venn Diagram B where events which are detected by the algorithm and found in at least one ofthe publication are compared A large proportion of detected events between PennCNV and QuantiSNP (43)overlap
Table 3 Overlap between events detected by SNP array algorithms using multiple publication data
Total events foundin NA15510 byalgorithm
Number of copynumber events(Kidd) [28]
Number of copynumber events(Korbel) [7]
Number of copynumber events(Redon) [2]
Events in paper 299 466 219CNVPartition 121 39 12 (4) 22 (5) 9 (4)GADA (R 07-5) 69 68 (23) 85 (18) 42 (19)PennCNV (2009Jan06) 81 18 (6) 28 () 30 (14)QuantiSNP v11 64 18 (6) 41 (9) 29 (13)
Data fromCEPH sampleNA15510 on1M array Illumina platform is used to compare between algorithms and other publicationsDefault parametersare used for each algorithm and Y chromosome data was omitted Event lists from publications were generated by combining data fromseveral tables to create a complete list (including all validated and un-validated events) An event was counted if any overlap was found with baseevent in published data multiple predictions by an algorithm for one published event were counted as oneValue in brackets shows percentage ofpublished events found by algorithmWe note from GADA analysis although a high number of overlaps were found this was due to the predictionof large events that included smaller events found by Kidd et al and Korbel et al
Comparing CNVdetection methods for SNParrays page 9 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
technique for our NA12156 comparison to prevent a
bias between Affymetrix and Illumina but in doing
so we accepted an increase in the number of differ-
ently detected events Kidd et al have shown similar
data when comparing studies and found only a
125 overlap of events larger than 5 kb between
their results and CN data generated by Affymetrix
60 array
Similarities of events detected betweendifferent SoftwareWe chose to test a single sample (NA10861) on
a range of the available algorithms to compare the
similarity between event detection In all cases we
found the academically developed software to be
more sensitive and detect more events than propri-
etary algorithms (Table 4) The data also shows an
increased number of events found from the sample
using the Affymetrix SNP60 array we assume this
reflects the increase in the number of CNP probes
on the array relative to Illuminarsquos 1M chip
Table 5 shows the amount of overlap in event
prediction We show two results for each compari-
son counting the number of events overlapping for
each algorithm separately The difference in values
represents the number of smaller events often found
in one event by a different algorithm In general
we found a higher number of overlapping events
between algorithms run on Affymetrix 60 arrays
data We expected the low resemblance between
data generated on different platforms as a result of
the different probe sets however we are pleased to
find some overlap We have included a comparison
to events published by Redon et al [2] although the
study does not include a comprehensive list for this
sample it does show that the algorithms are detecting
confirmed events
During our comparison we often saw a difference
in the size of the predicted event between algorithms
(Figure 5) This was to be expected when using
different platforms as probe locations vary but was
also seen when analysing an identical dataset This
kind of effect can even be produced when simply
altering algorithm parameters and should be a con-
sideration when looking at breakpoints of detected
events We found that the available software tend to
target and support one particular platform for analy-
sis which unfortunately can limit options
Recommending algorithmsComparison of events in a dataset is a good way of
assessing accuracy of detection algorithms but it is
also important to take into account that the different
predictions can also be informative in showing false
positives caused by noisy data and conversely that
those in agreement are the strongest candidates for
events Multiple predictions from different software
for the same event increase confidence in the data
and give clearer indications of the event boundaries
or any discrepancy in this information We would
recommend using a second algorithm on a single
dataset to produce the most informative results and
also utilize the different advantages of each software
We also suggest using software designed specifically
for the platform which generated the data as several
of the dual use algorithms have been shown to
weaker in one format We have selected a range of
algorithms to discuss and test and the list in Table 1 is
not exhaustive only an overview of some of the
possibilities It is also important to state even using
different algorithms one cannot definitively confirm
the presence of a CN event without separate biolog-
ical replication and it is unlikely that any list of events
detected will contain all CNVs in a sample
FURTHER ANALYSIS OFDETECTED CNVsWith a number of reliable options available for
the detection of copy number events it becomes
Table 4 Comparison of event numbers detected fora single sample (NA10861)
Algorithm Platform andarray
Number ofCNeventsdetected
Birdsuite 155 (Canary amp Birdseye) Affymetrix 60 137CNAT (Genome Console 302) Affymetrix 60 10CNVPartition 121 Illumina 1M Duo 16GADA (R 07-5) Affymetrix 60 613GADA (R 07-5) Illumina 1M Duo 87Nexus Biodiscovery 401 Affymetrix 60 111Nexus Biodiscovery 401 Illumina 1M Duo 8PennCNV (2009Jan06) Affymetrix 60 67PennCNV (2009Jan06) Illumina 1M Duo 43QuantiSNP v20 Affymetrix 60 193QuantiSNP v11 Illumina 1M Duo 60
HapMap samples provided as demonstration data were analysed onboth Affymetrix and Illumina platforms to give an easily reproduciblecomparison of event prediction Events shown have been detected bythe algorithm for CEPH sample NA10861 Default parameters wereused for all algorithms and anyYchromosome data was omittedDatafrom the Affymetrix array has a higher number of detected eventsprobably linked to the number of specifically targeted probesProprietary software from both Illumina and Affymetrix has a lowdetection rate
page 10 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
Table5
Com
parison
ofsoftwareeventpredictio
ns
Pub
lishe
dresults
(Red
on)
Birdsuite
Affym
etrix
CNAT
Affym
etrix
CNV
Par
tition
Illum
ina
GADA
Affym
etrix
GADA
Illum
ina
Nex
usAffym
etrix
Nex
usIllum
ina
Pen
nCNV
Affym
etrix
Pen
nCNV
Illum
ina
Qua
ntiSNP
Affym
etrix
Qua
ntiSNP
Illum
ina
Publishe
ddata
(Red
on)
17(4)
4(40
)3(19
)32
(5)
2(2)
11(10
)2(25
)12
(18
)7(16
)18
(9)
8(13
)
Birdsuite
Affy
metrix
17(44
)9(90
)13
(81
)135(22
)21
(24
)62
(56
)6(75
)43
(64
)20
(47
)97
(50
)20
(33
)CNAT
Affy
metrix
4(10
)15
(4)
4(25
)34
(6)
023
(21
)1(13
)
13(19
)2(5)
17(9)
5(8)
CNVPartition
Illum
ina
3(8)
16(4)
4(40
)37
(6)
7(8)
20(18
)7(88
)9(13
)
11(26
)16
(8)
16(27
)GADA
Affy
metrix
17(44
)106(28
)9(90
)13
(81
)32
(37
)91
(82
)7(88
)58
(87
)23
(53
)153(79
)27
(45
)GADA
Illum
ina
2(5)
96(25
)0
13(81
)20
8(34
)25
(23
)2(25
)26
(30
)17
(40
)67
(35
)23
(38
)Nexus
Affy
metrix
7(18
)57
(15
)10
(100
)
7(44
)116(19
)8(9)
4(50
)45
(67
)15
(35
)78
(40
)17
(28
)Nexus
Illum
ina
2(5)
6(2)
1(10
)7(44
)22
(4)
2(2)
4(4)
6(9)
7(16
)10
(5)
9(15
)Penn
CNV
Affy
metrix
11(28
)51
(13)
10(100
)
9(56
)105(17
)10
(11)
65(59
)6(75
)19
(44
)71
(37
)21
(35
)Penn
CNV
Illum
ina
6(15
)25
(7)
2(20
)11
(69
)44
(7)
9(10
)23
(21
)6(75
)18
(27
)26
(13)
28(47
)QuantiSNP
Affy
metrix
14(36
)97
(25
)10
(100
)
10(63
)199(32
)18
(21
)86
(77
)7(88
)65
(97
)21
(49
)24
(40
)QuantiSNP
Illum
ina
6(15
)14
(4)
5(50
)15
(94
)55
(9)
10(11)
30(27
)8(100
)
23(34
)32
(74
)31
(16
)
Algorithm
swererunon
demon
stratio
ndataforsampleNA108
61on
Affy
metrix60chipsa
ndIllum
ina1MDuo
arraysD
efaultparametersw
ereused
andanyY
chromosom
edatawas
omittedFo
ralgorithmoverall
totalsseeTable4Events
detected
inbo
thsoftwareareshow
nEvents
coun
tedas
common
betw
eenalgorithmsifpart
ofregion
predictedoverlaps
withtheotherEach
comparisoniscarriedou
ttw
ice
toshow
caseswhere
smallereventswithinon
ealgorithm
makeup
oneeventintheotherthereforeoverlapof
eventsdepe
ndson
analysisorientationTotalvalue
representsnumberof
eventsforsoftwareon
horizontalaxisfoun
dintheothersoftwaredatasetbracketedvalueshow
spercentageofeventsdetected
bysamesoftwareWehave
foun
dthemostsim
ilaritie
sare
betw
eendatafrom
similarplatform
soralgo
-rithm
metho
dforexam
pleAffy
metrixPenn
CNVandQuantiSNParebo
thbasedon
theHMM
algorithm
andas
such
eventpredictio
nshou
ldbe
very
similarWehave
also
notedahigher
numberof
similar
eventsfrom
algorithmsu
singAffy
metrixdata
Comparing CNVdetection methods for SNParrays page 11 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
increasingly important to be able to summarize and
use this data Initially we are often interested in
looking for novel events in certain genes or regions
Tracks of events can be viewed in databases such as
the web-based genome browser UCSC (http
wwwgenomeucscedu) and events can be com-
pared to known copy number data in the DGV
such as displayed in Figure 3 Importing several
tracks of data into a browser simultaneously will
allow the user to compare different result sets
Analysis of multiple events per sample is a more
complicated procedure Events and samples can
be explored using pathway analysis tools to look
for interesting groups or combinations of events in
different genes but methods of confirming the
significance of an event are required A number of
publications exist presenting ways of applying asso-
ciation study methods to copy number data Barnes
etal [29] developed an R package CNVtools which
allows the user to carry out case-control association
Figure 5 Image from UCSC Browser showing the detection of a single event using different algorithmsThe deletion described is a known CNP and is recorded several times in the DGV Each track represents a differ-ent algorithm or platform All results for detection algorithms shown used default parameters and test sampleNA10861
page 12 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
analysis on a single CNV of interest The publica-
tion tests a series of five alternative modelling meth-
ods before recommending a likelihood ratio test
which combines CNV calling and association testing
into a single model This method was designed
to eliminate problems with signal noise which is a
known trait of SNP assay data Ionita-Laza et al [30]
suggested a method to apply genome-wide family-
based association studies on raw-intensity data The
Birdsuite package includes a pipeline to prepare
the data for PLINK analysis Other sources have
suggested similar association study-based strategies
but an agreed approach is a subject of great discus-
sion Calls have been made by authors such as
Scherer et al [31] to decide on a single technique
but future decisions in the field will be extremely
enlightening
As is commented much upon in literature
describing SNP association study techniques
sample size and power of tests are major factors in
a successful study [32] This must also be considered
when analysing copy number data As we have dis-
cussed there are a number of analysis options avail-
able for SNP array CNV detection pipelines to
allow guided analysis and stand alone options for
more flexible analysis Some of these applications
are platform targeted but we have found that the
best outcome is given by using multiple algorithms
and comparing data
SUPPLEMENTARYDATASupplementary data are available online at http
biboxfordjournalsorg
AcknowledgementsThe authors thank Dr Helen Butler for her ideas and contribu-
tions to the manuscript
FUNDINGJR and LW are funded by Wellcome Trust Grants
CY is funded by a UK Medical Research Council
Special Training Fellowship in Biomedical
Informatics (Ref No G0701810)
References1 Iafrate AJ Feuk L Rivera MN et al Detection of large-
scale variation in the human genome Nat Genet 200436(9)949ndash51
2 Redon R Ishikawa S Fitch KR et al Global variation incopy number in the human genome Nature 2006444(7118)444ndash54
3 Tuzun E Sharp AJ Bailey JA et al Fine-scale structuralvariation of the human genome Nat Genet 200537(7)727ndash32
4 Sebat J Lakshmi B Troge J et al Large-scale copy numberpolymorphism in the human genome Science 2004305(5683)525ndash8
5 de Smith AJ Tsalenko A Sampas N et al Array CGHanalysis of copy number variation identifies 1284 newgenes variant in healthy white males implications for asso-ciation studies of complex diseases Hum Mol Genet 200716(23)2783ndash94
6 Carter NP Methods and strategies for analyzing copynumber variation using DNA microarrays Nat Genet200739(7 Suppl)S16ndash21
7 Korbel JO Urban AE Affourtit JP et al Paired-end map-ping reveals extensive structural variation in the humangenome Science 2007318(5849)420ndash6
8 Kennedy GC Matsuzaki H Dong S etal Large-scale geno-typing of complex DNA NatBiotechnol 200321(10)1233ndash7
9 Peiffer DA Le JM Steemers FJ etal High-resolution geno-mic profiling of chromosomal aberrations using Infiniumwhole-genome genotyping Genome Res 200616(9)1136ndash48
10 International Schizophrenia Consortium Rare chromoso-mal deletions and duplications increase risk of schizophreniaNature 2008455(7210)237ndash41
11 Yang TL Chen XD Guo Y et al Genome-wide copy-number-variation study identified a susceptibility geneUGT2B17 for osteoporosis Am J Hum Genet 200883(6)663ndash74
12 McCarroll SA Hadnott TN Perry GH et al Commondeletion polymorphisms in the human genome Nat Genet200638(1)86ndash92
13 Cooper GM Zerr T Kidd JM et al Systematic assessmentof copy number variant detection via genome-wide SNPgenotyping Nat Genet 200840(10)1199ndash203
14 McCarroll SA Altshuler DM Copy-number variation andassociation studies of human disease Nat Genet 200739(7 Suppl)S37ndash42
Key Points Awide variety of software is available for CNVdetection from
data produced by SNP arrays This review seeks to discussoptions and statistical methods currently available for analysisof signal intensity data
Changes in assay selection techniques for SNP arrays havemadethemmore appealing for copynumber detection aswell as geno-typingTargeted probe design has made the SNP array a reliableand cheaper option for copy number analysis
After testing a selection of the available software comparisonswere performed using Hapmap samples and publishedcopy number data Of the events found in our data 20^49were replicated in previously published studies but the resultsclearly showed variation in data caused by differences inalgorithms
An important recommendation when choosing software foranalysis is the use of a second algorithm on a dataset to producemore informative results This enables the user to eliminatefalse positives not found by both software and increases confi-dence in replicated events
Comparing CNVdetection methods for SNParrays page 13 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
15 McCarroll SA Kuruvilla FG Korn JM et al Integrateddetection and population-genetic analysis of SNPs andcopy number variation Nat Genet 200840(10)1166ndash74
16 Korn JM Kuruvilla FG McCarroll SA et al Integratedgenotype calling and association analysis of SNPscommon copy number polymorphisms and rare CNVsNat Genet 200840(10)1253ndash60
17 Day N Hemmaplardh A Thurman RE et al Unsupervisedsegmentation of continuous genomic data Bioinformatics200723(11)1424ndash6
18 Colella S Yau C Taylor JM etal QuantiSNP an objectiveBayes Hidden-Markov Model to detect and accurately mapcopy number variation using SNP genotyping data NucleicAcids Res 200735(6)2013ndash25
19 Wang K Li M Hadley D et al PennCNV an integratedhidden Markov model designed for high-resolution copynumber variation detection in whole-genome SNP geno-typing data Genome Res 200717(11)1665ndash74
20 Maestrini E Pagnamenta AT Lamb JA et al High-densitySNP association study and copy number variation analysisof the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility MolPsychiatry2009
21 Wang K Chen Z Tadesse MG et al Modeling geneticinheritance of copy number variations Nucleic Acids Res200836(21)e138
22 Li C Beroukhim R Weir BA et al Major copy propor-tion analysis of tumor samples using SNP arrays BMCBioinformatics 20089204
23 Olshen AB Venkatraman ES Lucito R Wigler M Circularbinary segmentation for the analysis of array-based DNAcopy number data Biostatistics 20045(4)557ndash72
24 Pique-Regi R Monso-Varona J Ortega A et al Sparserepresentation and Bayesian detection of genome copynumber alterations from microarray data Bioinformatics200824(3)309ndash18
25 Lai WR Johnson MD Kucherlapati R Park PJComparative analysis of algorithms for identifying amplifi-cations and deletions in array CGH data Bioinformatics 200521(19)3763ndash70
26 Rigaill G Hupe P Almeida A et al ITALICS analgorithm for normalization and DNA copy number callingfor Affymetrix SNP arrays Bioinformatics 200824(6)768ndash74
27 Franke L de Kovel CG Aulchenko YS et al Detectionimputation and association analysis of small deletions andnull alleles on oligonucleotide arrays AmJHumGenet 200882(6)1316ndash33
28 Kidd JM Cooper GM Donahue WF et al Mapping andsequencing of structural variation from eight human gen-omes Nature 2008453(7191)56ndash64
29 Barnes C Plagnol V Fitzgerald T et al A robuststatistical method for case-control association testingwith copy number variation Nat Genet 200840(10)1245ndash52
30 Ionita-Laza I Perry GH Raby BA et al On the analysisof copy-number variations in genome-wide associationstudies a translation of the family-based association testGenet Epidemiol 200832(3)273ndash84
31 Scherer SW Lee C Birney E etal Challenges and standardsin integrating surveys of structural variation NatGenet 200739(7 Suppl)S7ndash15
32 Cardon LR Bell JI Association study designs for complexdiseases Nat Rev Genet 20012(2)91ndash9
page 14 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
standard Illumina Infinium array and can used to
process Affymetrix data and have proven accuracy
on Illumina GoldenGate data [20] where SNP
coverage is suitable
PennCNV has a number of downstream analysis
options Most important to highlight is the use of
family trio data in analysis [21] The use of trio infor-
mation in event prediction allows easier detection of
events novel to probands It also integrates a pipeline
for Affymetrix data analysis The PennCNV package
also includes a number of options to allow more
analysis of event results such as a script to compare
events to known gene libraries or for changing the
format to be suitable for viewer such as BeadStudiorsquos
Chromosome Browser or the web-based genome
browser UCSC (httpwwwgenomeucscedu)
Dchip SNP [22] was originally developed for
Affymetrix data but has been modified to allow the
viewing of Illumina data It produces an LOH score
which can be plotted against chromosome but its
functions are best suited to the Affymetrix platform
generated values in particular the quality control
options The software also has options to carry out
paired analysis for cancer data major copy propor-
tion analysis [22] uses HMM to analyse tumour
samples
APPLYINGAPPROACHESORIGINALLY USED INARRAYCGHA number of methods for copy number event detec-
tion were originally developed for arrayCGH analy-
sis but have been modified for SNP array analysis
The Circular Binary Segmentation (CBS) [23] algo-
rithm is one such method It was designed to convert
noisy intensity values into regions of equal copy
number The algorithm will continue to divide a
region into segments until it finds a segment
which is different to the neighbouring region This
change-point detection is designed to identify all
the places which partition the chromosome into
segments of the same copy number An addition to
the binary segmentation algorithm was made to
allow the defining of single change inside a large
segment Segment ends were joined forming a
circle to allow a further likelihood ratio test that
the content has different means Final segments are
then given a cluster value which is the median log-
ratio value of the probes within the region and this
value is used to define the copy number status
An alternative to the CBS algorithm was devel-
oped by Pique-Regi et al [24] which can now be
applied to SNP arrays The Genome Alteration
Detection Algorithm (GADA) uses sparse Bayesian
learning to predict CN changes For our testing we
used a package designed for use in R environment
with helpful processing options and detailed instruc-
tions for Affymetrix and Illumina data The advan-
tage of the speed of data processing was clear and we
were able to analyse data within a few minutes
There are many other algorithms developed that
could potentially be applied to SNP array data
Other reviews [6 25] focused on the arrayCGH
format present the reader with a variety of alternative
options
CNVDETECTION USING OTHERMETHODSApproaches which describe different methods to
address CN event detection are common in the lit-
erature SNP conditional mixture modelling
(SCIMM) developed by Cooper et al [13] which
is based on the observation that samples with dele-
tions appear to have unique signal-intensity clusters
They applied a mixture-likelihood clustering
method within the R statistical package to identify
deletions A secondary algorithm (SCIMM-Search)
was developed to help discover probes which detect
copy number changes within an array dataset The
algorithms require knowledge of modelling techni-
ques to correctly carry out the analysis
The ITALICS [26] software focuses analysis on
removal on unwanted events found in Affymetrix
data Rigaill et al developed ITALICS (Iterative
and Alternative normaLIsation and Copy number
calling for affymetrix Snp arrays) to remove probes
with abnormal intensities Each iteration of the
algorithm estimates the biological signal and then
uses multiple linear regressions to estimate the non-
linear effects on the signal The algorithm can be run
in R and has the potential to analyse the Affymetrix
Human mapping 500K Genome Wide array 50 and
60 format but was designed to process data from
chip formats containing perfect match and mismatch
probes
COMMERCIALLYAVAILABLESOFTWAREThe strength of the software packages available
to purchase lies in a number of traits the ability
page 6 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
to combine data from other platforms for compar-
ison graphical user interfaces integrated pipelines for
analysis and work flows optimized computational
speed and technical support These factors are all
extremely useful to those labs with no or limited
bioinformatic core support Unfortunately commer-
cial companies are limited in their use of some of the
methods developed in the academic environment
They are often prevented from building user inter-
faces and other features around academic software
due to restrictions imposed by free software licences
such as GNU Public Licence and prevention from
having access to the latest methods
For our own purposes we have chosen to look in
detail at the Nexus Biodiscovery software This uses
the rank segmentation approach for detection This
approach is based on CBS but has been modified to
increase speed of processing It can be used for
Affymetrix arrayCGH or Illumina data and although
weaker for Illumina event detection is an extremely
useful tool for practically trained scientists
COMBINING COPY NUMBERPREDICTIONANDGENOTYPINGCopy number detection approaches described thus
far have looked only at a single aspect of the data
The Birdsuite set developed by Korn et al [16] com-
bines SNP genotyping and copy number detection
as well as independently genotyping common
CNPs It uses four different methods to analyse an
Affymetrix dataset The Canary algorithm which
genotypes common CNPs and Birdseed which
carries out SNP genotyping are included in the
Affymetrix Genotyping Console Birdseye is used
to discover rare CNVs This uses the HMM to iden-
tify and assess previously unknown CNVs in the
data Fawkes is the final stage of Birdsuite this
merges all the results from the other three stages
Combining data in this way gives a more complete
picture of structural variation in a sample and allows
the user to proceed with single stage of association
analysis with increased coverage on the data Korn
et al compared their software to commercially avail-
able algorithms including Nexus and report the
higher detection rates of Birdsuite
Franke et al [27] have also presented a combined
approach which focuses on single SNP interpreta-
tion TriTyper uses maximum likelihood estimation
to detect deletions in Illumina SNP data in unrelated
samples It incorporates an extra null allele into its
genotyping clusters and uses deviations from the
HWE as an indicator of when to use triallelic geno-
typing It can also use neighbouring SNP data to
impute the success of the caller which increases the
accuracy of the output
COMPARINGTHEDETECTIONALGORITHMSThere are a large variety of algorithms and software
available for copy number event detection Table 1
shows a summary of the software discussed in this
review A number of these software packages have
been tested during the review and a brief synopsis of
the results is presented here
Assessing SoftwareTo assess the accuracy of the algorithms we com-
pared our data to the results of a well characterized
sample The sample NA12156 is the basis for our
comparison (Table 2) it is from the HapMap collec-
tion and was sequenced for structural variation by
Kidd et al [28] We have chosen to record the
number of similar events between software and pub-
lished data We assume the samples with low num-
bers of similar events have higher false positive rates
however we have not experimentally validated the
results While there is no faultless software we have
found that at least 20 of events were confirmed by
Kidd et al in all algorithms 27 of the overlapping
detected events were found by more than one algo-
rithm (Supplementary Table 1) Although some
algorithms have a lower percentage of overlapping
events it is important to consider the number of
events found as well as the proportion 49 of
PennCNV detected events were confirmed but
other algorithms have actually detected more in
total
We carried out a secondary comparison using the
CEPH sample NA15510 which has been character-
ized in a number of publications [2 7 28] Table 3
shows the variation of results between studies
Further investigation of event replication across stud-
ies is represented in the Venn Diagrams (Figure 4)
PennCNV and Illumina show similar patterns of
overlap although we note an increased similarity
between the Korbel et al data and QuantiSNP
output We conclude that although we found a dif-
ference between detected events in our data and
published results we found similar variation between
different publications suggesting this is problem in
Comparing CNVdetection methods for SNParrays page 7 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
all comparisons and not unique to algorithms we
tested
The overlap of algorithm events of the tested soft-
ware is below 50 for all cases We used default
parameters for all our algorithms for ease of replica-
tion which means some algorithms were not run at
their optimal level for our data We deliberately
chose data which did not use an array-based
Table 1 Summary of SNP array detection algorithms
Software Platform Relatedpublication
Details Strengths Weaknesses
Birdsuite (Birdseyeand Canary)
Affymetrix [15] Combined tool set togenotype SNPs amp CNPs
Unique approach singleassociation of SNPs andCN
Availability limited toAffymetrix data
CNAT Affymetrix Technicalnotes
Proprietaryccedilrun inGenome Console
Integral part of GenomeConsole
Accuracy of event prediction(missed events)
CNVPartition 121 Illumina Technicalnotes
Proprietaryccedilrun inBeadStudio
Integral part of BeadStudio Accuracy of event prediction(missed events)
Dchip SNP Affymetrixor Illumina
[22] Stand alone software Free viewer for all data Limited applications forIllumina data
GADA Affymetrixor Illumina
[24] Model uses Sparse BayesianLearning
Speed of processing andapplication within R
Accuracy on Illumina weaker
HMMSeg Multiple [17] HMM application tool to anygenomic data
Flexibility to any dataset Statistical knowledgerequired for correctuse Not CN specific
ITALICS Affymetrix [26] R package for normalizationand CN detection inAffymetrix data
Focus on removal of non-relevant effects
Designed to work onAffymetrix 100Kthorn 500Kchip (MM probe format)
Nexus Biodiscovery Multiple [23] Commercial segmentationdetection tool
Allows combined data fromdifferentplatforms Integratedviewer
Freeware alternatives areavailable
PennCNV Illumina orAffymetrix
[19] Perl script based Multiple downstream toolsfor output
No way of ranking eventsdue to likelihood
QuantiSNP Illumina orAffymetrix
[18] HHM PC or LINUXcommand line
Bayes factor score forevents flexibility of runparameters
Limited support for furtherevent analysis
SCIMM andSCIMM-Search
Illumina [13] Modelling algorithmapplied in R
High detection ratescompared to sequencedata
Statistical knowledgerequired for correct use
TriTyper Illumina [27] Identify and genotype SNPswith null allele
Able to interpret single SNPs Only genotypes deletions
Table 2 Comparison of algorithms
Algorithm Platformand array
Total of copynumber eventsdetected
Number of copynumber eventsconfirmed byKidd et al [28]
Birdsuite 155 (Birdseye amp Canary) Affymetrix 60 386 76 (20)CNAT (Genome Console 302) Affymetrix 60 8 2 (25)GADA (R 07-5) Affymetrix 60 546 128 (23)GADA (R 07-5) Illumina 1M Duo 511 157 (31)PennCNV (2009Jan06) Affymetrix 60 57 28 (49)PennCNV (2009Jan06) Illumina 1M Duo 57 21 (37)QuantiSNP v20 Affymetrix 60 131 53 (41)QuantiSNP v11 Illumina 1M Duo 75 23 (31)
Detected events from CEPH sample NA12156 are compared to events published in sequencing analysis by Kidd et al [28] Default parametersare used for each algorithm and any Ychromosome data was omitted An overlap between software output and confirmed data by Kidd et al isdetermined by comparing the start and end points of events Details of events are shown in SupplementaryTable1 Percentage shows the numberof confirmed CN events compared to the total detectedby the algorithm
page 8 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
Figure 4 Venn diagrams comparing events for NA15510 between different studies Visual representation ofdata from CEPH sample NA15510 on 1M array Illumina platform used to compare between algorithms and otherpublications [2 7 28] Default parameters are used for each algorithm and Ychromosome data was omitted fromcount Event lists from publications were generated by combining data from several tables to create a completelist (including all validated and unvalidated events) An event was counted if any overlap was found with base eventin published data multiple predictions by an algorithm for one published event were counted as one Each total inthe diagram is comprised of all the events found by the studies meaning each event in an overlapping pair is countedSurprisingly only 43 overlapping events are found for NA15510 in all the three studies (A) Results from thePennCNV (D) and QuantiSNP (C) comparisons show that QuantiSNP detects more events in all three softwaredue to the detection of more events overlapping with the Korbel et al study Overlap between algorithmsis shown in Venn Diagram B where events which are detected by the algorithm and found in at least one ofthe publication are compared A large proportion of detected events between PennCNV and QuantiSNP (43)overlap
Table 3 Overlap between events detected by SNP array algorithms using multiple publication data
Total events foundin NA15510 byalgorithm
Number of copynumber events(Kidd) [28]
Number of copynumber events(Korbel) [7]
Number of copynumber events(Redon) [2]
Events in paper 299 466 219CNVPartition 121 39 12 (4) 22 (5) 9 (4)GADA (R 07-5) 69 68 (23) 85 (18) 42 (19)PennCNV (2009Jan06) 81 18 (6) 28 () 30 (14)QuantiSNP v11 64 18 (6) 41 (9) 29 (13)
Data fromCEPH sampleNA15510 on1M array Illumina platform is used to compare between algorithms and other publicationsDefault parametersare used for each algorithm and Y chromosome data was omitted Event lists from publications were generated by combining data fromseveral tables to create a complete list (including all validated and un-validated events) An event was counted if any overlap was found with baseevent in published data multiple predictions by an algorithm for one published event were counted as oneValue in brackets shows percentage ofpublished events found by algorithmWe note from GADA analysis although a high number of overlaps were found this was due to the predictionof large events that included smaller events found by Kidd et al and Korbel et al
Comparing CNVdetection methods for SNParrays page 9 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
technique for our NA12156 comparison to prevent a
bias between Affymetrix and Illumina but in doing
so we accepted an increase in the number of differ-
ently detected events Kidd et al have shown similar
data when comparing studies and found only a
125 overlap of events larger than 5 kb between
their results and CN data generated by Affymetrix
60 array
Similarities of events detected betweendifferent SoftwareWe chose to test a single sample (NA10861) on
a range of the available algorithms to compare the
similarity between event detection In all cases we
found the academically developed software to be
more sensitive and detect more events than propri-
etary algorithms (Table 4) The data also shows an
increased number of events found from the sample
using the Affymetrix SNP60 array we assume this
reflects the increase in the number of CNP probes
on the array relative to Illuminarsquos 1M chip
Table 5 shows the amount of overlap in event
prediction We show two results for each compari-
son counting the number of events overlapping for
each algorithm separately The difference in values
represents the number of smaller events often found
in one event by a different algorithm In general
we found a higher number of overlapping events
between algorithms run on Affymetrix 60 arrays
data We expected the low resemblance between
data generated on different platforms as a result of
the different probe sets however we are pleased to
find some overlap We have included a comparison
to events published by Redon et al [2] although the
study does not include a comprehensive list for this
sample it does show that the algorithms are detecting
confirmed events
During our comparison we often saw a difference
in the size of the predicted event between algorithms
(Figure 5) This was to be expected when using
different platforms as probe locations vary but was
also seen when analysing an identical dataset This
kind of effect can even be produced when simply
altering algorithm parameters and should be a con-
sideration when looking at breakpoints of detected
events We found that the available software tend to
target and support one particular platform for analy-
sis which unfortunately can limit options
Recommending algorithmsComparison of events in a dataset is a good way of
assessing accuracy of detection algorithms but it is
also important to take into account that the different
predictions can also be informative in showing false
positives caused by noisy data and conversely that
those in agreement are the strongest candidates for
events Multiple predictions from different software
for the same event increase confidence in the data
and give clearer indications of the event boundaries
or any discrepancy in this information We would
recommend using a second algorithm on a single
dataset to produce the most informative results and
also utilize the different advantages of each software
We also suggest using software designed specifically
for the platform which generated the data as several
of the dual use algorithms have been shown to
weaker in one format We have selected a range of
algorithms to discuss and test and the list in Table 1 is
not exhaustive only an overview of some of the
possibilities It is also important to state even using
different algorithms one cannot definitively confirm
the presence of a CN event without separate biolog-
ical replication and it is unlikely that any list of events
detected will contain all CNVs in a sample
FURTHER ANALYSIS OFDETECTED CNVsWith a number of reliable options available for
the detection of copy number events it becomes
Table 4 Comparison of event numbers detected fora single sample (NA10861)
Algorithm Platform andarray
Number ofCNeventsdetected
Birdsuite 155 (Canary amp Birdseye) Affymetrix 60 137CNAT (Genome Console 302) Affymetrix 60 10CNVPartition 121 Illumina 1M Duo 16GADA (R 07-5) Affymetrix 60 613GADA (R 07-5) Illumina 1M Duo 87Nexus Biodiscovery 401 Affymetrix 60 111Nexus Biodiscovery 401 Illumina 1M Duo 8PennCNV (2009Jan06) Affymetrix 60 67PennCNV (2009Jan06) Illumina 1M Duo 43QuantiSNP v20 Affymetrix 60 193QuantiSNP v11 Illumina 1M Duo 60
HapMap samples provided as demonstration data were analysed onboth Affymetrix and Illumina platforms to give an easily reproduciblecomparison of event prediction Events shown have been detected bythe algorithm for CEPH sample NA10861 Default parameters wereused for all algorithms and anyYchromosome data was omittedDatafrom the Affymetrix array has a higher number of detected eventsprobably linked to the number of specifically targeted probesProprietary software from both Illumina and Affymetrix has a lowdetection rate
page 10 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
Table5
Com
parison
ofsoftwareeventpredictio
ns
Pub
lishe
dresults
(Red
on)
Birdsuite
Affym
etrix
CNAT
Affym
etrix
CNV
Par
tition
Illum
ina
GADA
Affym
etrix
GADA
Illum
ina
Nex
usAffym
etrix
Nex
usIllum
ina
Pen
nCNV
Affym
etrix
Pen
nCNV
Illum
ina
Qua
ntiSNP
Affym
etrix
Qua
ntiSNP
Illum
ina
Publishe
ddata
(Red
on)
17(4)
4(40
)3(19
)32
(5)
2(2)
11(10
)2(25
)12
(18
)7(16
)18
(9)
8(13
)
Birdsuite
Affy
metrix
17(44
)9(90
)13
(81
)135(22
)21
(24
)62
(56
)6(75
)43
(64
)20
(47
)97
(50
)20
(33
)CNAT
Affy
metrix
4(10
)15
(4)
4(25
)34
(6)
023
(21
)1(13
)
13(19
)2(5)
17(9)
5(8)
CNVPartition
Illum
ina
3(8)
16(4)
4(40
)37
(6)
7(8)
20(18
)7(88
)9(13
)
11(26
)16
(8)
16(27
)GADA
Affy
metrix
17(44
)106(28
)9(90
)13
(81
)32
(37
)91
(82
)7(88
)58
(87
)23
(53
)153(79
)27
(45
)GADA
Illum
ina
2(5)
96(25
)0
13(81
)20
8(34
)25
(23
)2(25
)26
(30
)17
(40
)67
(35
)23
(38
)Nexus
Affy
metrix
7(18
)57
(15
)10
(100
)
7(44
)116(19
)8(9)
4(50
)45
(67
)15
(35
)78
(40
)17
(28
)Nexus
Illum
ina
2(5)
6(2)
1(10
)7(44
)22
(4)
2(2)
4(4)
6(9)
7(16
)10
(5)
9(15
)Penn
CNV
Affy
metrix
11(28
)51
(13)
10(100
)
9(56
)105(17
)10
(11)
65(59
)6(75
)19
(44
)71
(37
)21
(35
)Penn
CNV
Illum
ina
6(15
)25
(7)
2(20
)11
(69
)44
(7)
9(10
)23
(21
)6(75
)18
(27
)26
(13)
28(47
)QuantiSNP
Affy
metrix
14(36
)97
(25
)10
(100
)
10(63
)199(32
)18
(21
)86
(77
)7(88
)65
(97
)21
(49
)24
(40
)QuantiSNP
Illum
ina
6(15
)14
(4)
5(50
)15
(94
)55
(9)
10(11)
30(27
)8(100
)
23(34
)32
(74
)31
(16
)
Algorithm
swererunon
demon
stratio
ndataforsampleNA108
61on
Affy
metrix60chipsa
ndIllum
ina1MDuo
arraysD
efaultparametersw
ereused
andanyY
chromosom
edatawas
omittedFo
ralgorithmoverall
totalsseeTable4Events
detected
inbo
thsoftwareareshow
nEvents
coun
tedas
common
betw
eenalgorithmsifpart
ofregion
predictedoverlaps
withtheotherEach
comparisoniscarriedou
ttw
ice
toshow
caseswhere
smallereventswithinon
ealgorithm
makeup
oneeventintheotherthereforeoverlapof
eventsdepe
ndson
analysisorientationTotalvalue
representsnumberof
eventsforsoftwareon
horizontalaxisfoun
dintheothersoftwaredatasetbracketedvalueshow
spercentageofeventsdetected
bysamesoftwareWehave
foun
dthemostsim
ilaritie
sare
betw
eendatafrom
similarplatform
soralgo
-rithm
metho
dforexam
pleAffy
metrixPenn
CNVandQuantiSNParebo
thbasedon
theHMM
algorithm
andas
such
eventpredictio
nshou
ldbe
very
similarWehave
also
notedahigher
numberof
similar
eventsfrom
algorithmsu
singAffy
metrixdata
Comparing CNVdetection methods for SNParrays page 11 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
increasingly important to be able to summarize and
use this data Initially we are often interested in
looking for novel events in certain genes or regions
Tracks of events can be viewed in databases such as
the web-based genome browser UCSC (http
wwwgenomeucscedu) and events can be com-
pared to known copy number data in the DGV
such as displayed in Figure 3 Importing several
tracks of data into a browser simultaneously will
allow the user to compare different result sets
Analysis of multiple events per sample is a more
complicated procedure Events and samples can
be explored using pathway analysis tools to look
for interesting groups or combinations of events in
different genes but methods of confirming the
significance of an event are required A number of
publications exist presenting ways of applying asso-
ciation study methods to copy number data Barnes
etal [29] developed an R package CNVtools which
allows the user to carry out case-control association
Figure 5 Image from UCSC Browser showing the detection of a single event using different algorithmsThe deletion described is a known CNP and is recorded several times in the DGV Each track represents a differ-ent algorithm or platform All results for detection algorithms shown used default parameters and test sampleNA10861
page 12 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
analysis on a single CNV of interest The publica-
tion tests a series of five alternative modelling meth-
ods before recommending a likelihood ratio test
which combines CNV calling and association testing
into a single model This method was designed
to eliminate problems with signal noise which is a
known trait of SNP assay data Ionita-Laza et al [30]
suggested a method to apply genome-wide family-
based association studies on raw-intensity data The
Birdsuite package includes a pipeline to prepare
the data for PLINK analysis Other sources have
suggested similar association study-based strategies
but an agreed approach is a subject of great discus-
sion Calls have been made by authors such as
Scherer et al [31] to decide on a single technique
but future decisions in the field will be extremely
enlightening
As is commented much upon in literature
describing SNP association study techniques
sample size and power of tests are major factors in
a successful study [32] This must also be considered
when analysing copy number data As we have dis-
cussed there are a number of analysis options avail-
able for SNP array CNV detection pipelines to
allow guided analysis and stand alone options for
more flexible analysis Some of these applications
are platform targeted but we have found that the
best outcome is given by using multiple algorithms
and comparing data
SUPPLEMENTARYDATASupplementary data are available online at http
biboxfordjournalsorg
AcknowledgementsThe authors thank Dr Helen Butler for her ideas and contribu-
tions to the manuscript
FUNDINGJR and LW are funded by Wellcome Trust Grants
CY is funded by a UK Medical Research Council
Special Training Fellowship in Biomedical
Informatics (Ref No G0701810)
References1 Iafrate AJ Feuk L Rivera MN et al Detection of large-
scale variation in the human genome Nat Genet 200436(9)949ndash51
2 Redon R Ishikawa S Fitch KR et al Global variation incopy number in the human genome Nature 2006444(7118)444ndash54
3 Tuzun E Sharp AJ Bailey JA et al Fine-scale structuralvariation of the human genome Nat Genet 200537(7)727ndash32
4 Sebat J Lakshmi B Troge J et al Large-scale copy numberpolymorphism in the human genome Science 2004305(5683)525ndash8
5 de Smith AJ Tsalenko A Sampas N et al Array CGHanalysis of copy number variation identifies 1284 newgenes variant in healthy white males implications for asso-ciation studies of complex diseases Hum Mol Genet 200716(23)2783ndash94
6 Carter NP Methods and strategies for analyzing copynumber variation using DNA microarrays Nat Genet200739(7 Suppl)S16ndash21
7 Korbel JO Urban AE Affourtit JP et al Paired-end map-ping reveals extensive structural variation in the humangenome Science 2007318(5849)420ndash6
8 Kennedy GC Matsuzaki H Dong S etal Large-scale geno-typing of complex DNA NatBiotechnol 200321(10)1233ndash7
9 Peiffer DA Le JM Steemers FJ etal High-resolution geno-mic profiling of chromosomal aberrations using Infiniumwhole-genome genotyping Genome Res 200616(9)1136ndash48
10 International Schizophrenia Consortium Rare chromoso-mal deletions and duplications increase risk of schizophreniaNature 2008455(7210)237ndash41
11 Yang TL Chen XD Guo Y et al Genome-wide copy-number-variation study identified a susceptibility geneUGT2B17 for osteoporosis Am J Hum Genet 200883(6)663ndash74
12 McCarroll SA Hadnott TN Perry GH et al Commondeletion polymorphisms in the human genome Nat Genet200638(1)86ndash92
13 Cooper GM Zerr T Kidd JM et al Systematic assessmentof copy number variant detection via genome-wide SNPgenotyping Nat Genet 200840(10)1199ndash203
14 McCarroll SA Altshuler DM Copy-number variation andassociation studies of human disease Nat Genet 200739(7 Suppl)S37ndash42
Key Points Awide variety of software is available for CNVdetection from
data produced by SNP arrays This review seeks to discussoptions and statistical methods currently available for analysisof signal intensity data
Changes in assay selection techniques for SNP arrays havemadethemmore appealing for copynumber detection aswell as geno-typingTargeted probe design has made the SNP array a reliableand cheaper option for copy number analysis
After testing a selection of the available software comparisonswere performed using Hapmap samples and publishedcopy number data Of the events found in our data 20^49were replicated in previously published studies but the resultsclearly showed variation in data caused by differences inalgorithms
An important recommendation when choosing software foranalysis is the use of a second algorithm on a dataset to producemore informative results This enables the user to eliminatefalse positives not found by both software and increases confi-dence in replicated events
Comparing CNVdetection methods for SNParrays page 13 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
15 McCarroll SA Kuruvilla FG Korn JM et al Integrateddetection and population-genetic analysis of SNPs andcopy number variation Nat Genet 200840(10)1166ndash74
16 Korn JM Kuruvilla FG McCarroll SA et al Integratedgenotype calling and association analysis of SNPscommon copy number polymorphisms and rare CNVsNat Genet 200840(10)1253ndash60
17 Day N Hemmaplardh A Thurman RE et al Unsupervisedsegmentation of continuous genomic data Bioinformatics200723(11)1424ndash6
18 Colella S Yau C Taylor JM etal QuantiSNP an objectiveBayes Hidden-Markov Model to detect and accurately mapcopy number variation using SNP genotyping data NucleicAcids Res 200735(6)2013ndash25
19 Wang K Li M Hadley D et al PennCNV an integratedhidden Markov model designed for high-resolution copynumber variation detection in whole-genome SNP geno-typing data Genome Res 200717(11)1665ndash74
20 Maestrini E Pagnamenta AT Lamb JA et al High-densitySNP association study and copy number variation analysisof the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility MolPsychiatry2009
21 Wang K Chen Z Tadesse MG et al Modeling geneticinheritance of copy number variations Nucleic Acids Res200836(21)e138
22 Li C Beroukhim R Weir BA et al Major copy propor-tion analysis of tumor samples using SNP arrays BMCBioinformatics 20089204
23 Olshen AB Venkatraman ES Lucito R Wigler M Circularbinary segmentation for the analysis of array-based DNAcopy number data Biostatistics 20045(4)557ndash72
24 Pique-Regi R Monso-Varona J Ortega A et al Sparserepresentation and Bayesian detection of genome copynumber alterations from microarray data Bioinformatics200824(3)309ndash18
25 Lai WR Johnson MD Kucherlapati R Park PJComparative analysis of algorithms for identifying amplifi-cations and deletions in array CGH data Bioinformatics 200521(19)3763ndash70
26 Rigaill G Hupe P Almeida A et al ITALICS analgorithm for normalization and DNA copy number callingfor Affymetrix SNP arrays Bioinformatics 200824(6)768ndash74
27 Franke L de Kovel CG Aulchenko YS et al Detectionimputation and association analysis of small deletions andnull alleles on oligonucleotide arrays AmJHumGenet 200882(6)1316ndash33
28 Kidd JM Cooper GM Donahue WF et al Mapping andsequencing of structural variation from eight human gen-omes Nature 2008453(7191)56ndash64
29 Barnes C Plagnol V Fitzgerald T et al A robuststatistical method for case-control association testingwith copy number variation Nat Genet 200840(10)1245ndash52
30 Ionita-Laza I Perry GH Raby BA et al On the analysisof copy-number variations in genome-wide associationstudies a translation of the family-based association testGenet Epidemiol 200832(3)273ndash84
31 Scherer SW Lee C Birney E etal Challenges and standardsin integrating surveys of structural variation NatGenet 200739(7 Suppl)S7ndash15
32 Cardon LR Bell JI Association study designs for complexdiseases Nat Rev Genet 20012(2)91ndash9
page 14 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
to combine data from other platforms for compar-
ison graphical user interfaces integrated pipelines for
analysis and work flows optimized computational
speed and technical support These factors are all
extremely useful to those labs with no or limited
bioinformatic core support Unfortunately commer-
cial companies are limited in their use of some of the
methods developed in the academic environment
They are often prevented from building user inter-
faces and other features around academic software
due to restrictions imposed by free software licences
such as GNU Public Licence and prevention from
having access to the latest methods
For our own purposes we have chosen to look in
detail at the Nexus Biodiscovery software This uses
the rank segmentation approach for detection This
approach is based on CBS but has been modified to
increase speed of processing It can be used for
Affymetrix arrayCGH or Illumina data and although
weaker for Illumina event detection is an extremely
useful tool for practically trained scientists
COMBINING COPY NUMBERPREDICTIONANDGENOTYPINGCopy number detection approaches described thus
far have looked only at a single aspect of the data
The Birdsuite set developed by Korn et al [16] com-
bines SNP genotyping and copy number detection
as well as independently genotyping common
CNPs It uses four different methods to analyse an
Affymetrix dataset The Canary algorithm which
genotypes common CNPs and Birdseed which
carries out SNP genotyping are included in the
Affymetrix Genotyping Console Birdseye is used
to discover rare CNVs This uses the HMM to iden-
tify and assess previously unknown CNVs in the
data Fawkes is the final stage of Birdsuite this
merges all the results from the other three stages
Combining data in this way gives a more complete
picture of structural variation in a sample and allows
the user to proceed with single stage of association
analysis with increased coverage on the data Korn
et al compared their software to commercially avail-
able algorithms including Nexus and report the
higher detection rates of Birdsuite
Franke et al [27] have also presented a combined
approach which focuses on single SNP interpreta-
tion TriTyper uses maximum likelihood estimation
to detect deletions in Illumina SNP data in unrelated
samples It incorporates an extra null allele into its
genotyping clusters and uses deviations from the
HWE as an indicator of when to use triallelic geno-
typing It can also use neighbouring SNP data to
impute the success of the caller which increases the
accuracy of the output
COMPARINGTHEDETECTIONALGORITHMSThere are a large variety of algorithms and software
available for copy number event detection Table 1
shows a summary of the software discussed in this
review A number of these software packages have
been tested during the review and a brief synopsis of
the results is presented here
Assessing SoftwareTo assess the accuracy of the algorithms we com-
pared our data to the results of a well characterized
sample The sample NA12156 is the basis for our
comparison (Table 2) it is from the HapMap collec-
tion and was sequenced for structural variation by
Kidd et al [28] We have chosen to record the
number of similar events between software and pub-
lished data We assume the samples with low num-
bers of similar events have higher false positive rates
however we have not experimentally validated the
results While there is no faultless software we have
found that at least 20 of events were confirmed by
Kidd et al in all algorithms 27 of the overlapping
detected events were found by more than one algo-
rithm (Supplementary Table 1) Although some
algorithms have a lower percentage of overlapping
events it is important to consider the number of
events found as well as the proportion 49 of
PennCNV detected events were confirmed but
other algorithms have actually detected more in
total
We carried out a secondary comparison using the
CEPH sample NA15510 which has been character-
ized in a number of publications [2 7 28] Table 3
shows the variation of results between studies
Further investigation of event replication across stud-
ies is represented in the Venn Diagrams (Figure 4)
PennCNV and Illumina show similar patterns of
overlap although we note an increased similarity
between the Korbel et al data and QuantiSNP
output We conclude that although we found a dif-
ference between detected events in our data and
published results we found similar variation between
different publications suggesting this is problem in
Comparing CNVdetection methods for SNParrays page 7 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
all comparisons and not unique to algorithms we
tested
The overlap of algorithm events of the tested soft-
ware is below 50 for all cases We used default
parameters for all our algorithms for ease of replica-
tion which means some algorithms were not run at
their optimal level for our data We deliberately
chose data which did not use an array-based
Table 1 Summary of SNP array detection algorithms
Software Platform Relatedpublication
Details Strengths Weaknesses
Birdsuite (Birdseyeand Canary)
Affymetrix [15] Combined tool set togenotype SNPs amp CNPs
Unique approach singleassociation of SNPs andCN
Availability limited toAffymetrix data
CNAT Affymetrix Technicalnotes
Proprietaryccedilrun inGenome Console
Integral part of GenomeConsole
Accuracy of event prediction(missed events)
CNVPartition 121 Illumina Technicalnotes
Proprietaryccedilrun inBeadStudio
Integral part of BeadStudio Accuracy of event prediction(missed events)
Dchip SNP Affymetrixor Illumina
[22] Stand alone software Free viewer for all data Limited applications forIllumina data
GADA Affymetrixor Illumina
[24] Model uses Sparse BayesianLearning
Speed of processing andapplication within R
Accuracy on Illumina weaker
HMMSeg Multiple [17] HMM application tool to anygenomic data
Flexibility to any dataset Statistical knowledgerequired for correctuse Not CN specific
ITALICS Affymetrix [26] R package for normalizationand CN detection inAffymetrix data
Focus on removal of non-relevant effects
Designed to work onAffymetrix 100Kthorn 500Kchip (MM probe format)
Nexus Biodiscovery Multiple [23] Commercial segmentationdetection tool
Allows combined data fromdifferentplatforms Integratedviewer
Freeware alternatives areavailable
PennCNV Illumina orAffymetrix
[19] Perl script based Multiple downstream toolsfor output
No way of ranking eventsdue to likelihood
QuantiSNP Illumina orAffymetrix
[18] HHM PC or LINUXcommand line
Bayes factor score forevents flexibility of runparameters
Limited support for furtherevent analysis
SCIMM andSCIMM-Search
Illumina [13] Modelling algorithmapplied in R
High detection ratescompared to sequencedata
Statistical knowledgerequired for correct use
TriTyper Illumina [27] Identify and genotype SNPswith null allele
Able to interpret single SNPs Only genotypes deletions
Table 2 Comparison of algorithms
Algorithm Platformand array
Total of copynumber eventsdetected
Number of copynumber eventsconfirmed byKidd et al [28]
Birdsuite 155 (Birdseye amp Canary) Affymetrix 60 386 76 (20)CNAT (Genome Console 302) Affymetrix 60 8 2 (25)GADA (R 07-5) Affymetrix 60 546 128 (23)GADA (R 07-5) Illumina 1M Duo 511 157 (31)PennCNV (2009Jan06) Affymetrix 60 57 28 (49)PennCNV (2009Jan06) Illumina 1M Duo 57 21 (37)QuantiSNP v20 Affymetrix 60 131 53 (41)QuantiSNP v11 Illumina 1M Duo 75 23 (31)
Detected events from CEPH sample NA12156 are compared to events published in sequencing analysis by Kidd et al [28] Default parametersare used for each algorithm and any Ychromosome data was omitted An overlap between software output and confirmed data by Kidd et al isdetermined by comparing the start and end points of events Details of events are shown in SupplementaryTable1 Percentage shows the numberof confirmed CN events compared to the total detectedby the algorithm
page 8 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
Figure 4 Venn diagrams comparing events for NA15510 between different studies Visual representation ofdata from CEPH sample NA15510 on 1M array Illumina platform used to compare between algorithms and otherpublications [2 7 28] Default parameters are used for each algorithm and Ychromosome data was omitted fromcount Event lists from publications were generated by combining data from several tables to create a completelist (including all validated and unvalidated events) An event was counted if any overlap was found with base eventin published data multiple predictions by an algorithm for one published event were counted as one Each total inthe diagram is comprised of all the events found by the studies meaning each event in an overlapping pair is countedSurprisingly only 43 overlapping events are found for NA15510 in all the three studies (A) Results from thePennCNV (D) and QuantiSNP (C) comparisons show that QuantiSNP detects more events in all three softwaredue to the detection of more events overlapping with the Korbel et al study Overlap between algorithmsis shown in Venn Diagram B where events which are detected by the algorithm and found in at least one ofthe publication are compared A large proportion of detected events between PennCNV and QuantiSNP (43)overlap
Table 3 Overlap between events detected by SNP array algorithms using multiple publication data
Total events foundin NA15510 byalgorithm
Number of copynumber events(Kidd) [28]
Number of copynumber events(Korbel) [7]
Number of copynumber events(Redon) [2]
Events in paper 299 466 219CNVPartition 121 39 12 (4) 22 (5) 9 (4)GADA (R 07-5) 69 68 (23) 85 (18) 42 (19)PennCNV (2009Jan06) 81 18 (6) 28 () 30 (14)QuantiSNP v11 64 18 (6) 41 (9) 29 (13)
Data fromCEPH sampleNA15510 on1M array Illumina platform is used to compare between algorithms and other publicationsDefault parametersare used for each algorithm and Y chromosome data was omitted Event lists from publications were generated by combining data fromseveral tables to create a complete list (including all validated and un-validated events) An event was counted if any overlap was found with baseevent in published data multiple predictions by an algorithm for one published event were counted as oneValue in brackets shows percentage ofpublished events found by algorithmWe note from GADA analysis although a high number of overlaps were found this was due to the predictionof large events that included smaller events found by Kidd et al and Korbel et al
Comparing CNVdetection methods for SNParrays page 9 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
technique for our NA12156 comparison to prevent a
bias between Affymetrix and Illumina but in doing
so we accepted an increase in the number of differ-
ently detected events Kidd et al have shown similar
data when comparing studies and found only a
125 overlap of events larger than 5 kb between
their results and CN data generated by Affymetrix
60 array
Similarities of events detected betweendifferent SoftwareWe chose to test a single sample (NA10861) on
a range of the available algorithms to compare the
similarity between event detection In all cases we
found the academically developed software to be
more sensitive and detect more events than propri-
etary algorithms (Table 4) The data also shows an
increased number of events found from the sample
using the Affymetrix SNP60 array we assume this
reflects the increase in the number of CNP probes
on the array relative to Illuminarsquos 1M chip
Table 5 shows the amount of overlap in event
prediction We show two results for each compari-
son counting the number of events overlapping for
each algorithm separately The difference in values
represents the number of smaller events often found
in one event by a different algorithm In general
we found a higher number of overlapping events
between algorithms run on Affymetrix 60 arrays
data We expected the low resemblance between
data generated on different platforms as a result of
the different probe sets however we are pleased to
find some overlap We have included a comparison
to events published by Redon et al [2] although the
study does not include a comprehensive list for this
sample it does show that the algorithms are detecting
confirmed events
During our comparison we often saw a difference
in the size of the predicted event between algorithms
(Figure 5) This was to be expected when using
different platforms as probe locations vary but was
also seen when analysing an identical dataset This
kind of effect can even be produced when simply
altering algorithm parameters and should be a con-
sideration when looking at breakpoints of detected
events We found that the available software tend to
target and support one particular platform for analy-
sis which unfortunately can limit options
Recommending algorithmsComparison of events in a dataset is a good way of
assessing accuracy of detection algorithms but it is
also important to take into account that the different
predictions can also be informative in showing false
positives caused by noisy data and conversely that
those in agreement are the strongest candidates for
events Multiple predictions from different software
for the same event increase confidence in the data
and give clearer indications of the event boundaries
or any discrepancy in this information We would
recommend using a second algorithm on a single
dataset to produce the most informative results and
also utilize the different advantages of each software
We also suggest using software designed specifically
for the platform which generated the data as several
of the dual use algorithms have been shown to
weaker in one format We have selected a range of
algorithms to discuss and test and the list in Table 1 is
not exhaustive only an overview of some of the
possibilities It is also important to state even using
different algorithms one cannot definitively confirm
the presence of a CN event without separate biolog-
ical replication and it is unlikely that any list of events
detected will contain all CNVs in a sample
FURTHER ANALYSIS OFDETECTED CNVsWith a number of reliable options available for
the detection of copy number events it becomes
Table 4 Comparison of event numbers detected fora single sample (NA10861)
Algorithm Platform andarray
Number ofCNeventsdetected
Birdsuite 155 (Canary amp Birdseye) Affymetrix 60 137CNAT (Genome Console 302) Affymetrix 60 10CNVPartition 121 Illumina 1M Duo 16GADA (R 07-5) Affymetrix 60 613GADA (R 07-5) Illumina 1M Duo 87Nexus Biodiscovery 401 Affymetrix 60 111Nexus Biodiscovery 401 Illumina 1M Duo 8PennCNV (2009Jan06) Affymetrix 60 67PennCNV (2009Jan06) Illumina 1M Duo 43QuantiSNP v20 Affymetrix 60 193QuantiSNP v11 Illumina 1M Duo 60
HapMap samples provided as demonstration data were analysed onboth Affymetrix and Illumina platforms to give an easily reproduciblecomparison of event prediction Events shown have been detected bythe algorithm for CEPH sample NA10861 Default parameters wereused for all algorithms and anyYchromosome data was omittedDatafrom the Affymetrix array has a higher number of detected eventsprobably linked to the number of specifically targeted probesProprietary software from both Illumina and Affymetrix has a lowdetection rate
page 10 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
Table5
Com
parison
ofsoftwareeventpredictio
ns
Pub
lishe
dresults
(Red
on)
Birdsuite
Affym
etrix
CNAT
Affym
etrix
CNV
Par
tition
Illum
ina
GADA
Affym
etrix
GADA
Illum
ina
Nex
usAffym
etrix
Nex
usIllum
ina
Pen
nCNV
Affym
etrix
Pen
nCNV
Illum
ina
Qua
ntiSNP
Affym
etrix
Qua
ntiSNP
Illum
ina
Publishe
ddata
(Red
on)
17(4)
4(40
)3(19
)32
(5)
2(2)
11(10
)2(25
)12
(18
)7(16
)18
(9)
8(13
)
Birdsuite
Affy
metrix
17(44
)9(90
)13
(81
)135(22
)21
(24
)62
(56
)6(75
)43
(64
)20
(47
)97
(50
)20
(33
)CNAT
Affy
metrix
4(10
)15
(4)
4(25
)34
(6)
023
(21
)1(13
)
13(19
)2(5)
17(9)
5(8)
CNVPartition
Illum
ina
3(8)
16(4)
4(40
)37
(6)
7(8)
20(18
)7(88
)9(13
)
11(26
)16
(8)
16(27
)GADA
Affy
metrix
17(44
)106(28
)9(90
)13
(81
)32
(37
)91
(82
)7(88
)58
(87
)23
(53
)153(79
)27
(45
)GADA
Illum
ina
2(5)
96(25
)0
13(81
)20
8(34
)25
(23
)2(25
)26
(30
)17
(40
)67
(35
)23
(38
)Nexus
Affy
metrix
7(18
)57
(15
)10
(100
)
7(44
)116(19
)8(9)
4(50
)45
(67
)15
(35
)78
(40
)17
(28
)Nexus
Illum
ina
2(5)
6(2)
1(10
)7(44
)22
(4)
2(2)
4(4)
6(9)
7(16
)10
(5)
9(15
)Penn
CNV
Affy
metrix
11(28
)51
(13)
10(100
)
9(56
)105(17
)10
(11)
65(59
)6(75
)19
(44
)71
(37
)21
(35
)Penn
CNV
Illum
ina
6(15
)25
(7)
2(20
)11
(69
)44
(7)
9(10
)23
(21
)6(75
)18
(27
)26
(13)
28(47
)QuantiSNP
Affy
metrix
14(36
)97
(25
)10
(100
)
10(63
)199(32
)18
(21
)86
(77
)7(88
)65
(97
)21
(49
)24
(40
)QuantiSNP
Illum
ina
6(15
)14
(4)
5(50
)15
(94
)55
(9)
10(11)
30(27
)8(100
)
23(34
)32
(74
)31
(16
)
Algorithm
swererunon
demon
stratio
ndataforsampleNA108
61on
Affy
metrix60chipsa
ndIllum
ina1MDuo
arraysD
efaultparametersw
ereused
andanyY
chromosom
edatawas
omittedFo
ralgorithmoverall
totalsseeTable4Events
detected
inbo
thsoftwareareshow
nEvents
coun
tedas
common
betw
eenalgorithmsifpart
ofregion
predictedoverlaps
withtheotherEach
comparisoniscarriedou
ttw
ice
toshow
caseswhere
smallereventswithinon
ealgorithm
makeup
oneeventintheotherthereforeoverlapof
eventsdepe
ndson
analysisorientationTotalvalue
representsnumberof
eventsforsoftwareon
horizontalaxisfoun
dintheothersoftwaredatasetbracketedvalueshow
spercentageofeventsdetected
bysamesoftwareWehave
foun
dthemostsim
ilaritie
sare
betw
eendatafrom
similarplatform
soralgo
-rithm
metho
dforexam
pleAffy
metrixPenn
CNVandQuantiSNParebo
thbasedon
theHMM
algorithm
andas
such
eventpredictio
nshou
ldbe
very
similarWehave
also
notedahigher
numberof
similar
eventsfrom
algorithmsu
singAffy
metrixdata
Comparing CNVdetection methods for SNParrays page 11 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
increasingly important to be able to summarize and
use this data Initially we are often interested in
looking for novel events in certain genes or regions
Tracks of events can be viewed in databases such as
the web-based genome browser UCSC (http
wwwgenomeucscedu) and events can be com-
pared to known copy number data in the DGV
such as displayed in Figure 3 Importing several
tracks of data into a browser simultaneously will
allow the user to compare different result sets
Analysis of multiple events per sample is a more
complicated procedure Events and samples can
be explored using pathway analysis tools to look
for interesting groups or combinations of events in
different genes but methods of confirming the
significance of an event are required A number of
publications exist presenting ways of applying asso-
ciation study methods to copy number data Barnes
etal [29] developed an R package CNVtools which
allows the user to carry out case-control association
Figure 5 Image from UCSC Browser showing the detection of a single event using different algorithmsThe deletion described is a known CNP and is recorded several times in the DGV Each track represents a differ-ent algorithm or platform All results for detection algorithms shown used default parameters and test sampleNA10861
page 12 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
analysis on a single CNV of interest The publica-
tion tests a series of five alternative modelling meth-
ods before recommending a likelihood ratio test
which combines CNV calling and association testing
into a single model This method was designed
to eliminate problems with signal noise which is a
known trait of SNP assay data Ionita-Laza et al [30]
suggested a method to apply genome-wide family-
based association studies on raw-intensity data The
Birdsuite package includes a pipeline to prepare
the data for PLINK analysis Other sources have
suggested similar association study-based strategies
but an agreed approach is a subject of great discus-
sion Calls have been made by authors such as
Scherer et al [31] to decide on a single technique
but future decisions in the field will be extremely
enlightening
As is commented much upon in literature
describing SNP association study techniques
sample size and power of tests are major factors in
a successful study [32] This must also be considered
when analysing copy number data As we have dis-
cussed there are a number of analysis options avail-
able for SNP array CNV detection pipelines to
allow guided analysis and stand alone options for
more flexible analysis Some of these applications
are platform targeted but we have found that the
best outcome is given by using multiple algorithms
and comparing data
SUPPLEMENTARYDATASupplementary data are available online at http
biboxfordjournalsorg
AcknowledgementsThe authors thank Dr Helen Butler for her ideas and contribu-
tions to the manuscript
FUNDINGJR and LW are funded by Wellcome Trust Grants
CY is funded by a UK Medical Research Council
Special Training Fellowship in Biomedical
Informatics (Ref No G0701810)
References1 Iafrate AJ Feuk L Rivera MN et al Detection of large-
scale variation in the human genome Nat Genet 200436(9)949ndash51
2 Redon R Ishikawa S Fitch KR et al Global variation incopy number in the human genome Nature 2006444(7118)444ndash54
3 Tuzun E Sharp AJ Bailey JA et al Fine-scale structuralvariation of the human genome Nat Genet 200537(7)727ndash32
4 Sebat J Lakshmi B Troge J et al Large-scale copy numberpolymorphism in the human genome Science 2004305(5683)525ndash8
5 de Smith AJ Tsalenko A Sampas N et al Array CGHanalysis of copy number variation identifies 1284 newgenes variant in healthy white males implications for asso-ciation studies of complex diseases Hum Mol Genet 200716(23)2783ndash94
6 Carter NP Methods and strategies for analyzing copynumber variation using DNA microarrays Nat Genet200739(7 Suppl)S16ndash21
7 Korbel JO Urban AE Affourtit JP et al Paired-end map-ping reveals extensive structural variation in the humangenome Science 2007318(5849)420ndash6
8 Kennedy GC Matsuzaki H Dong S etal Large-scale geno-typing of complex DNA NatBiotechnol 200321(10)1233ndash7
9 Peiffer DA Le JM Steemers FJ etal High-resolution geno-mic profiling of chromosomal aberrations using Infiniumwhole-genome genotyping Genome Res 200616(9)1136ndash48
10 International Schizophrenia Consortium Rare chromoso-mal deletions and duplications increase risk of schizophreniaNature 2008455(7210)237ndash41
11 Yang TL Chen XD Guo Y et al Genome-wide copy-number-variation study identified a susceptibility geneUGT2B17 for osteoporosis Am J Hum Genet 200883(6)663ndash74
12 McCarroll SA Hadnott TN Perry GH et al Commondeletion polymorphisms in the human genome Nat Genet200638(1)86ndash92
13 Cooper GM Zerr T Kidd JM et al Systematic assessmentof copy number variant detection via genome-wide SNPgenotyping Nat Genet 200840(10)1199ndash203
14 McCarroll SA Altshuler DM Copy-number variation andassociation studies of human disease Nat Genet 200739(7 Suppl)S37ndash42
Key Points Awide variety of software is available for CNVdetection from
data produced by SNP arrays This review seeks to discussoptions and statistical methods currently available for analysisof signal intensity data
Changes in assay selection techniques for SNP arrays havemadethemmore appealing for copynumber detection aswell as geno-typingTargeted probe design has made the SNP array a reliableand cheaper option for copy number analysis
After testing a selection of the available software comparisonswere performed using Hapmap samples and publishedcopy number data Of the events found in our data 20^49were replicated in previously published studies but the resultsclearly showed variation in data caused by differences inalgorithms
An important recommendation when choosing software foranalysis is the use of a second algorithm on a dataset to producemore informative results This enables the user to eliminatefalse positives not found by both software and increases confi-dence in replicated events
Comparing CNVdetection methods for SNParrays page 13 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
15 McCarroll SA Kuruvilla FG Korn JM et al Integrateddetection and population-genetic analysis of SNPs andcopy number variation Nat Genet 200840(10)1166ndash74
16 Korn JM Kuruvilla FG McCarroll SA et al Integratedgenotype calling and association analysis of SNPscommon copy number polymorphisms and rare CNVsNat Genet 200840(10)1253ndash60
17 Day N Hemmaplardh A Thurman RE et al Unsupervisedsegmentation of continuous genomic data Bioinformatics200723(11)1424ndash6
18 Colella S Yau C Taylor JM etal QuantiSNP an objectiveBayes Hidden-Markov Model to detect and accurately mapcopy number variation using SNP genotyping data NucleicAcids Res 200735(6)2013ndash25
19 Wang K Li M Hadley D et al PennCNV an integratedhidden Markov model designed for high-resolution copynumber variation detection in whole-genome SNP geno-typing data Genome Res 200717(11)1665ndash74
20 Maestrini E Pagnamenta AT Lamb JA et al High-densitySNP association study and copy number variation analysisof the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility MolPsychiatry2009
21 Wang K Chen Z Tadesse MG et al Modeling geneticinheritance of copy number variations Nucleic Acids Res200836(21)e138
22 Li C Beroukhim R Weir BA et al Major copy propor-tion analysis of tumor samples using SNP arrays BMCBioinformatics 20089204
23 Olshen AB Venkatraman ES Lucito R Wigler M Circularbinary segmentation for the analysis of array-based DNAcopy number data Biostatistics 20045(4)557ndash72
24 Pique-Regi R Monso-Varona J Ortega A et al Sparserepresentation and Bayesian detection of genome copynumber alterations from microarray data Bioinformatics200824(3)309ndash18
25 Lai WR Johnson MD Kucherlapati R Park PJComparative analysis of algorithms for identifying amplifi-cations and deletions in array CGH data Bioinformatics 200521(19)3763ndash70
26 Rigaill G Hupe P Almeida A et al ITALICS analgorithm for normalization and DNA copy number callingfor Affymetrix SNP arrays Bioinformatics 200824(6)768ndash74
27 Franke L de Kovel CG Aulchenko YS et al Detectionimputation and association analysis of small deletions andnull alleles on oligonucleotide arrays AmJHumGenet 200882(6)1316ndash33
28 Kidd JM Cooper GM Donahue WF et al Mapping andsequencing of structural variation from eight human gen-omes Nature 2008453(7191)56ndash64
29 Barnes C Plagnol V Fitzgerald T et al A robuststatistical method for case-control association testingwith copy number variation Nat Genet 200840(10)1245ndash52
30 Ionita-Laza I Perry GH Raby BA et al On the analysisof copy-number variations in genome-wide associationstudies a translation of the family-based association testGenet Epidemiol 200832(3)273ndash84
31 Scherer SW Lee C Birney E etal Challenges and standardsin integrating surveys of structural variation NatGenet 200739(7 Suppl)S7ndash15
32 Cardon LR Bell JI Association study designs for complexdiseases Nat Rev Genet 20012(2)91ndash9
page 14 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
all comparisons and not unique to algorithms we
tested
The overlap of algorithm events of the tested soft-
ware is below 50 for all cases We used default
parameters for all our algorithms for ease of replica-
tion which means some algorithms were not run at
their optimal level for our data We deliberately
chose data which did not use an array-based
Table 1 Summary of SNP array detection algorithms
Software Platform Relatedpublication
Details Strengths Weaknesses
Birdsuite (Birdseyeand Canary)
Affymetrix [15] Combined tool set togenotype SNPs amp CNPs
Unique approach singleassociation of SNPs andCN
Availability limited toAffymetrix data
CNAT Affymetrix Technicalnotes
Proprietaryccedilrun inGenome Console
Integral part of GenomeConsole
Accuracy of event prediction(missed events)
CNVPartition 121 Illumina Technicalnotes
Proprietaryccedilrun inBeadStudio
Integral part of BeadStudio Accuracy of event prediction(missed events)
Dchip SNP Affymetrixor Illumina
[22] Stand alone software Free viewer for all data Limited applications forIllumina data
GADA Affymetrixor Illumina
[24] Model uses Sparse BayesianLearning
Speed of processing andapplication within R
Accuracy on Illumina weaker
HMMSeg Multiple [17] HMM application tool to anygenomic data
Flexibility to any dataset Statistical knowledgerequired for correctuse Not CN specific
ITALICS Affymetrix [26] R package for normalizationand CN detection inAffymetrix data
Focus on removal of non-relevant effects
Designed to work onAffymetrix 100Kthorn 500Kchip (MM probe format)
Nexus Biodiscovery Multiple [23] Commercial segmentationdetection tool
Allows combined data fromdifferentplatforms Integratedviewer
Freeware alternatives areavailable
PennCNV Illumina orAffymetrix
[19] Perl script based Multiple downstream toolsfor output
No way of ranking eventsdue to likelihood
QuantiSNP Illumina orAffymetrix
[18] HHM PC or LINUXcommand line
Bayes factor score forevents flexibility of runparameters
Limited support for furtherevent analysis
SCIMM andSCIMM-Search
Illumina [13] Modelling algorithmapplied in R
High detection ratescompared to sequencedata
Statistical knowledgerequired for correct use
TriTyper Illumina [27] Identify and genotype SNPswith null allele
Able to interpret single SNPs Only genotypes deletions
Table 2 Comparison of algorithms
Algorithm Platformand array
Total of copynumber eventsdetected
Number of copynumber eventsconfirmed byKidd et al [28]
Birdsuite 155 (Birdseye amp Canary) Affymetrix 60 386 76 (20)CNAT (Genome Console 302) Affymetrix 60 8 2 (25)GADA (R 07-5) Affymetrix 60 546 128 (23)GADA (R 07-5) Illumina 1M Duo 511 157 (31)PennCNV (2009Jan06) Affymetrix 60 57 28 (49)PennCNV (2009Jan06) Illumina 1M Duo 57 21 (37)QuantiSNP v20 Affymetrix 60 131 53 (41)QuantiSNP v11 Illumina 1M Duo 75 23 (31)
Detected events from CEPH sample NA12156 are compared to events published in sequencing analysis by Kidd et al [28] Default parametersare used for each algorithm and any Ychromosome data was omitted An overlap between software output and confirmed data by Kidd et al isdetermined by comparing the start and end points of events Details of events are shown in SupplementaryTable1 Percentage shows the numberof confirmed CN events compared to the total detectedby the algorithm
page 8 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
Figure 4 Venn diagrams comparing events for NA15510 between different studies Visual representation ofdata from CEPH sample NA15510 on 1M array Illumina platform used to compare between algorithms and otherpublications [2 7 28] Default parameters are used for each algorithm and Ychromosome data was omitted fromcount Event lists from publications were generated by combining data from several tables to create a completelist (including all validated and unvalidated events) An event was counted if any overlap was found with base eventin published data multiple predictions by an algorithm for one published event were counted as one Each total inthe diagram is comprised of all the events found by the studies meaning each event in an overlapping pair is countedSurprisingly only 43 overlapping events are found for NA15510 in all the three studies (A) Results from thePennCNV (D) and QuantiSNP (C) comparisons show that QuantiSNP detects more events in all three softwaredue to the detection of more events overlapping with the Korbel et al study Overlap between algorithmsis shown in Venn Diagram B where events which are detected by the algorithm and found in at least one ofthe publication are compared A large proportion of detected events between PennCNV and QuantiSNP (43)overlap
Table 3 Overlap between events detected by SNP array algorithms using multiple publication data
Total events foundin NA15510 byalgorithm
Number of copynumber events(Kidd) [28]
Number of copynumber events(Korbel) [7]
Number of copynumber events(Redon) [2]
Events in paper 299 466 219CNVPartition 121 39 12 (4) 22 (5) 9 (4)GADA (R 07-5) 69 68 (23) 85 (18) 42 (19)PennCNV (2009Jan06) 81 18 (6) 28 () 30 (14)QuantiSNP v11 64 18 (6) 41 (9) 29 (13)
Data fromCEPH sampleNA15510 on1M array Illumina platform is used to compare between algorithms and other publicationsDefault parametersare used for each algorithm and Y chromosome data was omitted Event lists from publications were generated by combining data fromseveral tables to create a complete list (including all validated and un-validated events) An event was counted if any overlap was found with baseevent in published data multiple predictions by an algorithm for one published event were counted as oneValue in brackets shows percentage ofpublished events found by algorithmWe note from GADA analysis although a high number of overlaps were found this was due to the predictionof large events that included smaller events found by Kidd et al and Korbel et al
Comparing CNVdetection methods for SNParrays page 9 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
technique for our NA12156 comparison to prevent a
bias between Affymetrix and Illumina but in doing
so we accepted an increase in the number of differ-
ently detected events Kidd et al have shown similar
data when comparing studies and found only a
125 overlap of events larger than 5 kb between
their results and CN data generated by Affymetrix
60 array
Similarities of events detected betweendifferent SoftwareWe chose to test a single sample (NA10861) on
a range of the available algorithms to compare the
similarity between event detection In all cases we
found the academically developed software to be
more sensitive and detect more events than propri-
etary algorithms (Table 4) The data also shows an
increased number of events found from the sample
using the Affymetrix SNP60 array we assume this
reflects the increase in the number of CNP probes
on the array relative to Illuminarsquos 1M chip
Table 5 shows the amount of overlap in event
prediction We show two results for each compari-
son counting the number of events overlapping for
each algorithm separately The difference in values
represents the number of smaller events often found
in one event by a different algorithm In general
we found a higher number of overlapping events
between algorithms run on Affymetrix 60 arrays
data We expected the low resemblance between
data generated on different platforms as a result of
the different probe sets however we are pleased to
find some overlap We have included a comparison
to events published by Redon et al [2] although the
study does not include a comprehensive list for this
sample it does show that the algorithms are detecting
confirmed events
During our comparison we often saw a difference
in the size of the predicted event between algorithms
(Figure 5) This was to be expected when using
different platforms as probe locations vary but was
also seen when analysing an identical dataset This
kind of effect can even be produced when simply
altering algorithm parameters and should be a con-
sideration when looking at breakpoints of detected
events We found that the available software tend to
target and support one particular platform for analy-
sis which unfortunately can limit options
Recommending algorithmsComparison of events in a dataset is a good way of
assessing accuracy of detection algorithms but it is
also important to take into account that the different
predictions can also be informative in showing false
positives caused by noisy data and conversely that
those in agreement are the strongest candidates for
events Multiple predictions from different software
for the same event increase confidence in the data
and give clearer indications of the event boundaries
or any discrepancy in this information We would
recommend using a second algorithm on a single
dataset to produce the most informative results and
also utilize the different advantages of each software
We also suggest using software designed specifically
for the platform which generated the data as several
of the dual use algorithms have been shown to
weaker in one format We have selected a range of
algorithms to discuss and test and the list in Table 1 is
not exhaustive only an overview of some of the
possibilities It is also important to state even using
different algorithms one cannot definitively confirm
the presence of a CN event without separate biolog-
ical replication and it is unlikely that any list of events
detected will contain all CNVs in a sample
FURTHER ANALYSIS OFDETECTED CNVsWith a number of reliable options available for
the detection of copy number events it becomes
Table 4 Comparison of event numbers detected fora single sample (NA10861)
Algorithm Platform andarray
Number ofCNeventsdetected
Birdsuite 155 (Canary amp Birdseye) Affymetrix 60 137CNAT (Genome Console 302) Affymetrix 60 10CNVPartition 121 Illumina 1M Duo 16GADA (R 07-5) Affymetrix 60 613GADA (R 07-5) Illumina 1M Duo 87Nexus Biodiscovery 401 Affymetrix 60 111Nexus Biodiscovery 401 Illumina 1M Duo 8PennCNV (2009Jan06) Affymetrix 60 67PennCNV (2009Jan06) Illumina 1M Duo 43QuantiSNP v20 Affymetrix 60 193QuantiSNP v11 Illumina 1M Duo 60
HapMap samples provided as demonstration data were analysed onboth Affymetrix and Illumina platforms to give an easily reproduciblecomparison of event prediction Events shown have been detected bythe algorithm for CEPH sample NA10861 Default parameters wereused for all algorithms and anyYchromosome data was omittedDatafrom the Affymetrix array has a higher number of detected eventsprobably linked to the number of specifically targeted probesProprietary software from both Illumina and Affymetrix has a lowdetection rate
page 10 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
Table5
Com
parison
ofsoftwareeventpredictio
ns
Pub
lishe
dresults
(Red
on)
Birdsuite
Affym
etrix
CNAT
Affym
etrix
CNV
Par
tition
Illum
ina
GADA
Affym
etrix
GADA
Illum
ina
Nex
usAffym
etrix
Nex
usIllum
ina
Pen
nCNV
Affym
etrix
Pen
nCNV
Illum
ina
Qua
ntiSNP
Affym
etrix
Qua
ntiSNP
Illum
ina
Publishe
ddata
(Red
on)
17(4)
4(40
)3(19
)32
(5)
2(2)
11(10
)2(25
)12
(18
)7(16
)18
(9)
8(13
)
Birdsuite
Affy
metrix
17(44
)9(90
)13
(81
)135(22
)21
(24
)62
(56
)6(75
)43
(64
)20
(47
)97
(50
)20
(33
)CNAT
Affy
metrix
4(10
)15
(4)
4(25
)34
(6)
023
(21
)1(13
)
13(19
)2(5)
17(9)
5(8)
CNVPartition
Illum
ina
3(8)
16(4)
4(40
)37
(6)
7(8)
20(18
)7(88
)9(13
)
11(26
)16
(8)
16(27
)GADA
Affy
metrix
17(44
)106(28
)9(90
)13
(81
)32
(37
)91
(82
)7(88
)58
(87
)23
(53
)153(79
)27
(45
)GADA
Illum
ina
2(5)
96(25
)0
13(81
)20
8(34
)25
(23
)2(25
)26
(30
)17
(40
)67
(35
)23
(38
)Nexus
Affy
metrix
7(18
)57
(15
)10
(100
)
7(44
)116(19
)8(9)
4(50
)45
(67
)15
(35
)78
(40
)17
(28
)Nexus
Illum
ina
2(5)
6(2)
1(10
)7(44
)22
(4)
2(2)
4(4)
6(9)
7(16
)10
(5)
9(15
)Penn
CNV
Affy
metrix
11(28
)51
(13)
10(100
)
9(56
)105(17
)10
(11)
65(59
)6(75
)19
(44
)71
(37
)21
(35
)Penn
CNV
Illum
ina
6(15
)25
(7)
2(20
)11
(69
)44
(7)
9(10
)23
(21
)6(75
)18
(27
)26
(13)
28(47
)QuantiSNP
Affy
metrix
14(36
)97
(25
)10
(100
)
10(63
)199(32
)18
(21
)86
(77
)7(88
)65
(97
)21
(49
)24
(40
)QuantiSNP
Illum
ina
6(15
)14
(4)
5(50
)15
(94
)55
(9)
10(11)
30(27
)8(100
)
23(34
)32
(74
)31
(16
)
Algorithm
swererunon
demon
stratio
ndataforsampleNA108
61on
Affy
metrix60chipsa
ndIllum
ina1MDuo
arraysD
efaultparametersw
ereused
andanyY
chromosom
edatawas
omittedFo
ralgorithmoverall
totalsseeTable4Events
detected
inbo
thsoftwareareshow
nEvents
coun
tedas
common
betw
eenalgorithmsifpart
ofregion
predictedoverlaps
withtheotherEach
comparisoniscarriedou
ttw
ice
toshow
caseswhere
smallereventswithinon
ealgorithm
makeup
oneeventintheotherthereforeoverlapof
eventsdepe
ndson
analysisorientationTotalvalue
representsnumberof
eventsforsoftwareon
horizontalaxisfoun
dintheothersoftwaredatasetbracketedvalueshow
spercentageofeventsdetected
bysamesoftwareWehave
foun
dthemostsim
ilaritie
sare
betw
eendatafrom
similarplatform
soralgo
-rithm
metho
dforexam
pleAffy
metrixPenn
CNVandQuantiSNParebo
thbasedon
theHMM
algorithm
andas
such
eventpredictio
nshou
ldbe
very
similarWehave
also
notedahigher
numberof
similar
eventsfrom
algorithmsu
singAffy
metrixdata
Comparing CNVdetection methods for SNParrays page 11 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
increasingly important to be able to summarize and
use this data Initially we are often interested in
looking for novel events in certain genes or regions
Tracks of events can be viewed in databases such as
the web-based genome browser UCSC (http
wwwgenomeucscedu) and events can be com-
pared to known copy number data in the DGV
such as displayed in Figure 3 Importing several
tracks of data into a browser simultaneously will
allow the user to compare different result sets
Analysis of multiple events per sample is a more
complicated procedure Events and samples can
be explored using pathway analysis tools to look
for interesting groups or combinations of events in
different genes but methods of confirming the
significance of an event are required A number of
publications exist presenting ways of applying asso-
ciation study methods to copy number data Barnes
etal [29] developed an R package CNVtools which
allows the user to carry out case-control association
Figure 5 Image from UCSC Browser showing the detection of a single event using different algorithmsThe deletion described is a known CNP and is recorded several times in the DGV Each track represents a differ-ent algorithm or platform All results for detection algorithms shown used default parameters and test sampleNA10861
page 12 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
analysis on a single CNV of interest The publica-
tion tests a series of five alternative modelling meth-
ods before recommending a likelihood ratio test
which combines CNV calling and association testing
into a single model This method was designed
to eliminate problems with signal noise which is a
known trait of SNP assay data Ionita-Laza et al [30]
suggested a method to apply genome-wide family-
based association studies on raw-intensity data The
Birdsuite package includes a pipeline to prepare
the data for PLINK analysis Other sources have
suggested similar association study-based strategies
but an agreed approach is a subject of great discus-
sion Calls have been made by authors such as
Scherer et al [31] to decide on a single technique
but future decisions in the field will be extremely
enlightening
As is commented much upon in literature
describing SNP association study techniques
sample size and power of tests are major factors in
a successful study [32] This must also be considered
when analysing copy number data As we have dis-
cussed there are a number of analysis options avail-
able for SNP array CNV detection pipelines to
allow guided analysis and stand alone options for
more flexible analysis Some of these applications
are platform targeted but we have found that the
best outcome is given by using multiple algorithms
and comparing data
SUPPLEMENTARYDATASupplementary data are available online at http
biboxfordjournalsorg
AcknowledgementsThe authors thank Dr Helen Butler for her ideas and contribu-
tions to the manuscript
FUNDINGJR and LW are funded by Wellcome Trust Grants
CY is funded by a UK Medical Research Council
Special Training Fellowship in Biomedical
Informatics (Ref No G0701810)
References1 Iafrate AJ Feuk L Rivera MN et al Detection of large-
scale variation in the human genome Nat Genet 200436(9)949ndash51
2 Redon R Ishikawa S Fitch KR et al Global variation incopy number in the human genome Nature 2006444(7118)444ndash54
3 Tuzun E Sharp AJ Bailey JA et al Fine-scale structuralvariation of the human genome Nat Genet 200537(7)727ndash32
4 Sebat J Lakshmi B Troge J et al Large-scale copy numberpolymorphism in the human genome Science 2004305(5683)525ndash8
5 de Smith AJ Tsalenko A Sampas N et al Array CGHanalysis of copy number variation identifies 1284 newgenes variant in healthy white males implications for asso-ciation studies of complex diseases Hum Mol Genet 200716(23)2783ndash94
6 Carter NP Methods and strategies for analyzing copynumber variation using DNA microarrays Nat Genet200739(7 Suppl)S16ndash21
7 Korbel JO Urban AE Affourtit JP et al Paired-end map-ping reveals extensive structural variation in the humangenome Science 2007318(5849)420ndash6
8 Kennedy GC Matsuzaki H Dong S etal Large-scale geno-typing of complex DNA NatBiotechnol 200321(10)1233ndash7
9 Peiffer DA Le JM Steemers FJ etal High-resolution geno-mic profiling of chromosomal aberrations using Infiniumwhole-genome genotyping Genome Res 200616(9)1136ndash48
10 International Schizophrenia Consortium Rare chromoso-mal deletions and duplications increase risk of schizophreniaNature 2008455(7210)237ndash41
11 Yang TL Chen XD Guo Y et al Genome-wide copy-number-variation study identified a susceptibility geneUGT2B17 for osteoporosis Am J Hum Genet 200883(6)663ndash74
12 McCarroll SA Hadnott TN Perry GH et al Commondeletion polymorphisms in the human genome Nat Genet200638(1)86ndash92
13 Cooper GM Zerr T Kidd JM et al Systematic assessmentof copy number variant detection via genome-wide SNPgenotyping Nat Genet 200840(10)1199ndash203
14 McCarroll SA Altshuler DM Copy-number variation andassociation studies of human disease Nat Genet 200739(7 Suppl)S37ndash42
Key Points Awide variety of software is available for CNVdetection from
data produced by SNP arrays This review seeks to discussoptions and statistical methods currently available for analysisof signal intensity data
Changes in assay selection techniques for SNP arrays havemadethemmore appealing for copynumber detection aswell as geno-typingTargeted probe design has made the SNP array a reliableand cheaper option for copy number analysis
After testing a selection of the available software comparisonswere performed using Hapmap samples and publishedcopy number data Of the events found in our data 20^49were replicated in previously published studies but the resultsclearly showed variation in data caused by differences inalgorithms
An important recommendation when choosing software foranalysis is the use of a second algorithm on a dataset to producemore informative results This enables the user to eliminatefalse positives not found by both software and increases confi-dence in replicated events
Comparing CNVdetection methods for SNParrays page 13 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
15 McCarroll SA Kuruvilla FG Korn JM et al Integrateddetection and population-genetic analysis of SNPs andcopy number variation Nat Genet 200840(10)1166ndash74
16 Korn JM Kuruvilla FG McCarroll SA et al Integratedgenotype calling and association analysis of SNPscommon copy number polymorphisms and rare CNVsNat Genet 200840(10)1253ndash60
17 Day N Hemmaplardh A Thurman RE et al Unsupervisedsegmentation of continuous genomic data Bioinformatics200723(11)1424ndash6
18 Colella S Yau C Taylor JM etal QuantiSNP an objectiveBayes Hidden-Markov Model to detect and accurately mapcopy number variation using SNP genotyping data NucleicAcids Res 200735(6)2013ndash25
19 Wang K Li M Hadley D et al PennCNV an integratedhidden Markov model designed for high-resolution copynumber variation detection in whole-genome SNP geno-typing data Genome Res 200717(11)1665ndash74
20 Maestrini E Pagnamenta AT Lamb JA et al High-densitySNP association study and copy number variation analysisof the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility MolPsychiatry2009
21 Wang K Chen Z Tadesse MG et al Modeling geneticinheritance of copy number variations Nucleic Acids Res200836(21)e138
22 Li C Beroukhim R Weir BA et al Major copy propor-tion analysis of tumor samples using SNP arrays BMCBioinformatics 20089204
23 Olshen AB Venkatraman ES Lucito R Wigler M Circularbinary segmentation for the analysis of array-based DNAcopy number data Biostatistics 20045(4)557ndash72
24 Pique-Regi R Monso-Varona J Ortega A et al Sparserepresentation and Bayesian detection of genome copynumber alterations from microarray data Bioinformatics200824(3)309ndash18
25 Lai WR Johnson MD Kucherlapati R Park PJComparative analysis of algorithms for identifying amplifi-cations and deletions in array CGH data Bioinformatics 200521(19)3763ndash70
26 Rigaill G Hupe P Almeida A et al ITALICS analgorithm for normalization and DNA copy number callingfor Affymetrix SNP arrays Bioinformatics 200824(6)768ndash74
27 Franke L de Kovel CG Aulchenko YS et al Detectionimputation and association analysis of small deletions andnull alleles on oligonucleotide arrays AmJHumGenet 200882(6)1316ndash33
28 Kidd JM Cooper GM Donahue WF et al Mapping andsequencing of structural variation from eight human gen-omes Nature 2008453(7191)56ndash64
29 Barnes C Plagnol V Fitzgerald T et al A robuststatistical method for case-control association testingwith copy number variation Nat Genet 200840(10)1245ndash52
30 Ionita-Laza I Perry GH Raby BA et al On the analysisof copy-number variations in genome-wide associationstudies a translation of the family-based association testGenet Epidemiol 200832(3)273ndash84
31 Scherer SW Lee C Birney E etal Challenges and standardsin integrating surveys of structural variation NatGenet 200739(7 Suppl)S7ndash15
32 Cardon LR Bell JI Association study designs for complexdiseases Nat Rev Genet 20012(2)91ndash9
page 14 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
Figure 4 Venn diagrams comparing events for NA15510 between different studies Visual representation ofdata from CEPH sample NA15510 on 1M array Illumina platform used to compare between algorithms and otherpublications [2 7 28] Default parameters are used for each algorithm and Ychromosome data was omitted fromcount Event lists from publications were generated by combining data from several tables to create a completelist (including all validated and unvalidated events) An event was counted if any overlap was found with base eventin published data multiple predictions by an algorithm for one published event were counted as one Each total inthe diagram is comprised of all the events found by the studies meaning each event in an overlapping pair is countedSurprisingly only 43 overlapping events are found for NA15510 in all the three studies (A) Results from thePennCNV (D) and QuantiSNP (C) comparisons show that QuantiSNP detects more events in all three softwaredue to the detection of more events overlapping with the Korbel et al study Overlap between algorithmsis shown in Venn Diagram B where events which are detected by the algorithm and found in at least one ofthe publication are compared A large proportion of detected events between PennCNV and QuantiSNP (43)overlap
Table 3 Overlap between events detected by SNP array algorithms using multiple publication data
Total events foundin NA15510 byalgorithm
Number of copynumber events(Kidd) [28]
Number of copynumber events(Korbel) [7]
Number of copynumber events(Redon) [2]
Events in paper 299 466 219CNVPartition 121 39 12 (4) 22 (5) 9 (4)GADA (R 07-5) 69 68 (23) 85 (18) 42 (19)PennCNV (2009Jan06) 81 18 (6) 28 () 30 (14)QuantiSNP v11 64 18 (6) 41 (9) 29 (13)
Data fromCEPH sampleNA15510 on1M array Illumina platform is used to compare between algorithms and other publicationsDefault parametersare used for each algorithm and Y chromosome data was omitted Event lists from publications were generated by combining data fromseveral tables to create a complete list (including all validated and un-validated events) An event was counted if any overlap was found with baseevent in published data multiple predictions by an algorithm for one published event were counted as oneValue in brackets shows percentage ofpublished events found by algorithmWe note from GADA analysis although a high number of overlaps were found this was due to the predictionof large events that included smaller events found by Kidd et al and Korbel et al
Comparing CNVdetection methods for SNParrays page 9 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
technique for our NA12156 comparison to prevent a
bias between Affymetrix and Illumina but in doing
so we accepted an increase in the number of differ-
ently detected events Kidd et al have shown similar
data when comparing studies and found only a
125 overlap of events larger than 5 kb between
their results and CN data generated by Affymetrix
60 array
Similarities of events detected betweendifferent SoftwareWe chose to test a single sample (NA10861) on
a range of the available algorithms to compare the
similarity between event detection In all cases we
found the academically developed software to be
more sensitive and detect more events than propri-
etary algorithms (Table 4) The data also shows an
increased number of events found from the sample
using the Affymetrix SNP60 array we assume this
reflects the increase in the number of CNP probes
on the array relative to Illuminarsquos 1M chip
Table 5 shows the amount of overlap in event
prediction We show two results for each compari-
son counting the number of events overlapping for
each algorithm separately The difference in values
represents the number of smaller events often found
in one event by a different algorithm In general
we found a higher number of overlapping events
between algorithms run on Affymetrix 60 arrays
data We expected the low resemblance between
data generated on different platforms as a result of
the different probe sets however we are pleased to
find some overlap We have included a comparison
to events published by Redon et al [2] although the
study does not include a comprehensive list for this
sample it does show that the algorithms are detecting
confirmed events
During our comparison we often saw a difference
in the size of the predicted event between algorithms
(Figure 5) This was to be expected when using
different platforms as probe locations vary but was
also seen when analysing an identical dataset This
kind of effect can even be produced when simply
altering algorithm parameters and should be a con-
sideration when looking at breakpoints of detected
events We found that the available software tend to
target and support one particular platform for analy-
sis which unfortunately can limit options
Recommending algorithmsComparison of events in a dataset is a good way of
assessing accuracy of detection algorithms but it is
also important to take into account that the different
predictions can also be informative in showing false
positives caused by noisy data and conversely that
those in agreement are the strongest candidates for
events Multiple predictions from different software
for the same event increase confidence in the data
and give clearer indications of the event boundaries
or any discrepancy in this information We would
recommend using a second algorithm on a single
dataset to produce the most informative results and
also utilize the different advantages of each software
We also suggest using software designed specifically
for the platform which generated the data as several
of the dual use algorithms have been shown to
weaker in one format We have selected a range of
algorithms to discuss and test and the list in Table 1 is
not exhaustive only an overview of some of the
possibilities It is also important to state even using
different algorithms one cannot definitively confirm
the presence of a CN event without separate biolog-
ical replication and it is unlikely that any list of events
detected will contain all CNVs in a sample
FURTHER ANALYSIS OFDETECTED CNVsWith a number of reliable options available for
the detection of copy number events it becomes
Table 4 Comparison of event numbers detected fora single sample (NA10861)
Algorithm Platform andarray
Number ofCNeventsdetected
Birdsuite 155 (Canary amp Birdseye) Affymetrix 60 137CNAT (Genome Console 302) Affymetrix 60 10CNVPartition 121 Illumina 1M Duo 16GADA (R 07-5) Affymetrix 60 613GADA (R 07-5) Illumina 1M Duo 87Nexus Biodiscovery 401 Affymetrix 60 111Nexus Biodiscovery 401 Illumina 1M Duo 8PennCNV (2009Jan06) Affymetrix 60 67PennCNV (2009Jan06) Illumina 1M Duo 43QuantiSNP v20 Affymetrix 60 193QuantiSNP v11 Illumina 1M Duo 60
HapMap samples provided as demonstration data were analysed onboth Affymetrix and Illumina platforms to give an easily reproduciblecomparison of event prediction Events shown have been detected bythe algorithm for CEPH sample NA10861 Default parameters wereused for all algorithms and anyYchromosome data was omittedDatafrom the Affymetrix array has a higher number of detected eventsprobably linked to the number of specifically targeted probesProprietary software from both Illumina and Affymetrix has a lowdetection rate
page 10 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
Table5
Com
parison
ofsoftwareeventpredictio
ns
Pub
lishe
dresults
(Red
on)
Birdsuite
Affym
etrix
CNAT
Affym
etrix
CNV
Par
tition
Illum
ina
GADA
Affym
etrix
GADA
Illum
ina
Nex
usAffym
etrix
Nex
usIllum
ina
Pen
nCNV
Affym
etrix
Pen
nCNV
Illum
ina
Qua
ntiSNP
Affym
etrix
Qua
ntiSNP
Illum
ina
Publishe
ddata
(Red
on)
17(4)
4(40
)3(19
)32
(5)
2(2)
11(10
)2(25
)12
(18
)7(16
)18
(9)
8(13
)
Birdsuite
Affy
metrix
17(44
)9(90
)13
(81
)135(22
)21
(24
)62
(56
)6(75
)43
(64
)20
(47
)97
(50
)20
(33
)CNAT
Affy
metrix
4(10
)15
(4)
4(25
)34
(6)
023
(21
)1(13
)
13(19
)2(5)
17(9)
5(8)
CNVPartition
Illum
ina
3(8)
16(4)
4(40
)37
(6)
7(8)
20(18
)7(88
)9(13
)
11(26
)16
(8)
16(27
)GADA
Affy
metrix
17(44
)106(28
)9(90
)13
(81
)32
(37
)91
(82
)7(88
)58
(87
)23
(53
)153(79
)27
(45
)GADA
Illum
ina
2(5)
96(25
)0
13(81
)20
8(34
)25
(23
)2(25
)26
(30
)17
(40
)67
(35
)23
(38
)Nexus
Affy
metrix
7(18
)57
(15
)10
(100
)
7(44
)116(19
)8(9)
4(50
)45
(67
)15
(35
)78
(40
)17
(28
)Nexus
Illum
ina
2(5)
6(2)
1(10
)7(44
)22
(4)
2(2)
4(4)
6(9)
7(16
)10
(5)
9(15
)Penn
CNV
Affy
metrix
11(28
)51
(13)
10(100
)
9(56
)105(17
)10
(11)
65(59
)6(75
)19
(44
)71
(37
)21
(35
)Penn
CNV
Illum
ina
6(15
)25
(7)
2(20
)11
(69
)44
(7)
9(10
)23
(21
)6(75
)18
(27
)26
(13)
28(47
)QuantiSNP
Affy
metrix
14(36
)97
(25
)10
(100
)
10(63
)199(32
)18
(21
)86
(77
)7(88
)65
(97
)21
(49
)24
(40
)QuantiSNP
Illum
ina
6(15
)14
(4)
5(50
)15
(94
)55
(9)
10(11)
30(27
)8(100
)
23(34
)32
(74
)31
(16
)
Algorithm
swererunon
demon
stratio
ndataforsampleNA108
61on
Affy
metrix60chipsa
ndIllum
ina1MDuo
arraysD
efaultparametersw
ereused
andanyY
chromosom
edatawas
omittedFo
ralgorithmoverall
totalsseeTable4Events
detected
inbo
thsoftwareareshow
nEvents
coun
tedas
common
betw
eenalgorithmsifpart
ofregion
predictedoverlaps
withtheotherEach
comparisoniscarriedou
ttw
ice
toshow
caseswhere
smallereventswithinon
ealgorithm
makeup
oneeventintheotherthereforeoverlapof
eventsdepe
ndson
analysisorientationTotalvalue
representsnumberof
eventsforsoftwareon
horizontalaxisfoun
dintheothersoftwaredatasetbracketedvalueshow
spercentageofeventsdetected
bysamesoftwareWehave
foun
dthemostsim
ilaritie
sare
betw
eendatafrom
similarplatform
soralgo
-rithm
metho
dforexam
pleAffy
metrixPenn
CNVandQuantiSNParebo
thbasedon
theHMM
algorithm
andas
such
eventpredictio
nshou
ldbe
very
similarWehave
also
notedahigher
numberof
similar
eventsfrom
algorithmsu
singAffy
metrixdata
Comparing CNVdetection methods for SNParrays page 11 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
increasingly important to be able to summarize and
use this data Initially we are often interested in
looking for novel events in certain genes or regions
Tracks of events can be viewed in databases such as
the web-based genome browser UCSC (http
wwwgenomeucscedu) and events can be com-
pared to known copy number data in the DGV
such as displayed in Figure 3 Importing several
tracks of data into a browser simultaneously will
allow the user to compare different result sets
Analysis of multiple events per sample is a more
complicated procedure Events and samples can
be explored using pathway analysis tools to look
for interesting groups or combinations of events in
different genes but methods of confirming the
significance of an event are required A number of
publications exist presenting ways of applying asso-
ciation study methods to copy number data Barnes
etal [29] developed an R package CNVtools which
allows the user to carry out case-control association
Figure 5 Image from UCSC Browser showing the detection of a single event using different algorithmsThe deletion described is a known CNP and is recorded several times in the DGV Each track represents a differ-ent algorithm or platform All results for detection algorithms shown used default parameters and test sampleNA10861
page 12 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
analysis on a single CNV of interest The publica-
tion tests a series of five alternative modelling meth-
ods before recommending a likelihood ratio test
which combines CNV calling and association testing
into a single model This method was designed
to eliminate problems with signal noise which is a
known trait of SNP assay data Ionita-Laza et al [30]
suggested a method to apply genome-wide family-
based association studies on raw-intensity data The
Birdsuite package includes a pipeline to prepare
the data for PLINK analysis Other sources have
suggested similar association study-based strategies
but an agreed approach is a subject of great discus-
sion Calls have been made by authors such as
Scherer et al [31] to decide on a single technique
but future decisions in the field will be extremely
enlightening
As is commented much upon in literature
describing SNP association study techniques
sample size and power of tests are major factors in
a successful study [32] This must also be considered
when analysing copy number data As we have dis-
cussed there are a number of analysis options avail-
able for SNP array CNV detection pipelines to
allow guided analysis and stand alone options for
more flexible analysis Some of these applications
are platform targeted but we have found that the
best outcome is given by using multiple algorithms
and comparing data
SUPPLEMENTARYDATASupplementary data are available online at http
biboxfordjournalsorg
AcknowledgementsThe authors thank Dr Helen Butler for her ideas and contribu-
tions to the manuscript
FUNDINGJR and LW are funded by Wellcome Trust Grants
CY is funded by a UK Medical Research Council
Special Training Fellowship in Biomedical
Informatics (Ref No G0701810)
References1 Iafrate AJ Feuk L Rivera MN et al Detection of large-
scale variation in the human genome Nat Genet 200436(9)949ndash51
2 Redon R Ishikawa S Fitch KR et al Global variation incopy number in the human genome Nature 2006444(7118)444ndash54
3 Tuzun E Sharp AJ Bailey JA et al Fine-scale structuralvariation of the human genome Nat Genet 200537(7)727ndash32
4 Sebat J Lakshmi B Troge J et al Large-scale copy numberpolymorphism in the human genome Science 2004305(5683)525ndash8
5 de Smith AJ Tsalenko A Sampas N et al Array CGHanalysis of copy number variation identifies 1284 newgenes variant in healthy white males implications for asso-ciation studies of complex diseases Hum Mol Genet 200716(23)2783ndash94
6 Carter NP Methods and strategies for analyzing copynumber variation using DNA microarrays Nat Genet200739(7 Suppl)S16ndash21
7 Korbel JO Urban AE Affourtit JP et al Paired-end map-ping reveals extensive structural variation in the humangenome Science 2007318(5849)420ndash6
8 Kennedy GC Matsuzaki H Dong S etal Large-scale geno-typing of complex DNA NatBiotechnol 200321(10)1233ndash7
9 Peiffer DA Le JM Steemers FJ etal High-resolution geno-mic profiling of chromosomal aberrations using Infiniumwhole-genome genotyping Genome Res 200616(9)1136ndash48
10 International Schizophrenia Consortium Rare chromoso-mal deletions and duplications increase risk of schizophreniaNature 2008455(7210)237ndash41
11 Yang TL Chen XD Guo Y et al Genome-wide copy-number-variation study identified a susceptibility geneUGT2B17 for osteoporosis Am J Hum Genet 200883(6)663ndash74
12 McCarroll SA Hadnott TN Perry GH et al Commondeletion polymorphisms in the human genome Nat Genet200638(1)86ndash92
13 Cooper GM Zerr T Kidd JM et al Systematic assessmentof copy number variant detection via genome-wide SNPgenotyping Nat Genet 200840(10)1199ndash203
14 McCarroll SA Altshuler DM Copy-number variation andassociation studies of human disease Nat Genet 200739(7 Suppl)S37ndash42
Key Points Awide variety of software is available for CNVdetection from
data produced by SNP arrays This review seeks to discussoptions and statistical methods currently available for analysisof signal intensity data
Changes in assay selection techniques for SNP arrays havemadethemmore appealing for copynumber detection aswell as geno-typingTargeted probe design has made the SNP array a reliableand cheaper option for copy number analysis
After testing a selection of the available software comparisonswere performed using Hapmap samples and publishedcopy number data Of the events found in our data 20^49were replicated in previously published studies but the resultsclearly showed variation in data caused by differences inalgorithms
An important recommendation when choosing software foranalysis is the use of a second algorithm on a dataset to producemore informative results This enables the user to eliminatefalse positives not found by both software and increases confi-dence in replicated events
Comparing CNVdetection methods for SNParrays page 13 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
15 McCarroll SA Kuruvilla FG Korn JM et al Integrateddetection and population-genetic analysis of SNPs andcopy number variation Nat Genet 200840(10)1166ndash74
16 Korn JM Kuruvilla FG McCarroll SA et al Integratedgenotype calling and association analysis of SNPscommon copy number polymorphisms and rare CNVsNat Genet 200840(10)1253ndash60
17 Day N Hemmaplardh A Thurman RE et al Unsupervisedsegmentation of continuous genomic data Bioinformatics200723(11)1424ndash6
18 Colella S Yau C Taylor JM etal QuantiSNP an objectiveBayes Hidden-Markov Model to detect and accurately mapcopy number variation using SNP genotyping data NucleicAcids Res 200735(6)2013ndash25
19 Wang K Li M Hadley D et al PennCNV an integratedhidden Markov model designed for high-resolution copynumber variation detection in whole-genome SNP geno-typing data Genome Res 200717(11)1665ndash74
20 Maestrini E Pagnamenta AT Lamb JA et al High-densitySNP association study and copy number variation analysisof the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility MolPsychiatry2009
21 Wang K Chen Z Tadesse MG et al Modeling geneticinheritance of copy number variations Nucleic Acids Res200836(21)e138
22 Li C Beroukhim R Weir BA et al Major copy propor-tion analysis of tumor samples using SNP arrays BMCBioinformatics 20089204
23 Olshen AB Venkatraman ES Lucito R Wigler M Circularbinary segmentation for the analysis of array-based DNAcopy number data Biostatistics 20045(4)557ndash72
24 Pique-Regi R Monso-Varona J Ortega A et al Sparserepresentation and Bayesian detection of genome copynumber alterations from microarray data Bioinformatics200824(3)309ndash18
25 Lai WR Johnson MD Kucherlapati R Park PJComparative analysis of algorithms for identifying amplifi-cations and deletions in array CGH data Bioinformatics 200521(19)3763ndash70
26 Rigaill G Hupe P Almeida A et al ITALICS analgorithm for normalization and DNA copy number callingfor Affymetrix SNP arrays Bioinformatics 200824(6)768ndash74
27 Franke L de Kovel CG Aulchenko YS et al Detectionimputation and association analysis of small deletions andnull alleles on oligonucleotide arrays AmJHumGenet 200882(6)1316ndash33
28 Kidd JM Cooper GM Donahue WF et al Mapping andsequencing of structural variation from eight human gen-omes Nature 2008453(7191)56ndash64
29 Barnes C Plagnol V Fitzgerald T et al A robuststatistical method for case-control association testingwith copy number variation Nat Genet 200840(10)1245ndash52
30 Ionita-Laza I Perry GH Raby BA et al On the analysisof copy-number variations in genome-wide associationstudies a translation of the family-based association testGenet Epidemiol 200832(3)273ndash84
31 Scherer SW Lee C Birney E etal Challenges and standardsin integrating surveys of structural variation NatGenet 200739(7 Suppl)S7ndash15
32 Cardon LR Bell JI Association study designs for complexdiseases Nat Rev Genet 20012(2)91ndash9
page 14 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
technique for our NA12156 comparison to prevent a
bias between Affymetrix and Illumina but in doing
so we accepted an increase in the number of differ-
ently detected events Kidd et al have shown similar
data when comparing studies and found only a
125 overlap of events larger than 5 kb between
their results and CN data generated by Affymetrix
60 array
Similarities of events detected betweendifferent SoftwareWe chose to test a single sample (NA10861) on
a range of the available algorithms to compare the
similarity between event detection In all cases we
found the academically developed software to be
more sensitive and detect more events than propri-
etary algorithms (Table 4) The data also shows an
increased number of events found from the sample
using the Affymetrix SNP60 array we assume this
reflects the increase in the number of CNP probes
on the array relative to Illuminarsquos 1M chip
Table 5 shows the amount of overlap in event
prediction We show two results for each compari-
son counting the number of events overlapping for
each algorithm separately The difference in values
represents the number of smaller events often found
in one event by a different algorithm In general
we found a higher number of overlapping events
between algorithms run on Affymetrix 60 arrays
data We expected the low resemblance between
data generated on different platforms as a result of
the different probe sets however we are pleased to
find some overlap We have included a comparison
to events published by Redon et al [2] although the
study does not include a comprehensive list for this
sample it does show that the algorithms are detecting
confirmed events
During our comparison we often saw a difference
in the size of the predicted event between algorithms
(Figure 5) This was to be expected when using
different platforms as probe locations vary but was
also seen when analysing an identical dataset This
kind of effect can even be produced when simply
altering algorithm parameters and should be a con-
sideration when looking at breakpoints of detected
events We found that the available software tend to
target and support one particular platform for analy-
sis which unfortunately can limit options
Recommending algorithmsComparison of events in a dataset is a good way of
assessing accuracy of detection algorithms but it is
also important to take into account that the different
predictions can also be informative in showing false
positives caused by noisy data and conversely that
those in agreement are the strongest candidates for
events Multiple predictions from different software
for the same event increase confidence in the data
and give clearer indications of the event boundaries
or any discrepancy in this information We would
recommend using a second algorithm on a single
dataset to produce the most informative results and
also utilize the different advantages of each software
We also suggest using software designed specifically
for the platform which generated the data as several
of the dual use algorithms have been shown to
weaker in one format We have selected a range of
algorithms to discuss and test and the list in Table 1 is
not exhaustive only an overview of some of the
possibilities It is also important to state even using
different algorithms one cannot definitively confirm
the presence of a CN event without separate biolog-
ical replication and it is unlikely that any list of events
detected will contain all CNVs in a sample
FURTHER ANALYSIS OFDETECTED CNVsWith a number of reliable options available for
the detection of copy number events it becomes
Table 4 Comparison of event numbers detected fora single sample (NA10861)
Algorithm Platform andarray
Number ofCNeventsdetected
Birdsuite 155 (Canary amp Birdseye) Affymetrix 60 137CNAT (Genome Console 302) Affymetrix 60 10CNVPartition 121 Illumina 1M Duo 16GADA (R 07-5) Affymetrix 60 613GADA (R 07-5) Illumina 1M Duo 87Nexus Biodiscovery 401 Affymetrix 60 111Nexus Biodiscovery 401 Illumina 1M Duo 8PennCNV (2009Jan06) Affymetrix 60 67PennCNV (2009Jan06) Illumina 1M Duo 43QuantiSNP v20 Affymetrix 60 193QuantiSNP v11 Illumina 1M Duo 60
HapMap samples provided as demonstration data were analysed onboth Affymetrix and Illumina platforms to give an easily reproduciblecomparison of event prediction Events shown have been detected bythe algorithm for CEPH sample NA10861 Default parameters wereused for all algorithms and anyYchromosome data was omittedDatafrom the Affymetrix array has a higher number of detected eventsprobably linked to the number of specifically targeted probesProprietary software from both Illumina and Affymetrix has a lowdetection rate
page 10 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
Table5
Com
parison
ofsoftwareeventpredictio
ns
Pub
lishe
dresults
(Red
on)
Birdsuite
Affym
etrix
CNAT
Affym
etrix
CNV
Par
tition
Illum
ina
GADA
Affym
etrix
GADA
Illum
ina
Nex
usAffym
etrix
Nex
usIllum
ina
Pen
nCNV
Affym
etrix
Pen
nCNV
Illum
ina
Qua
ntiSNP
Affym
etrix
Qua
ntiSNP
Illum
ina
Publishe
ddata
(Red
on)
17(4)
4(40
)3(19
)32
(5)
2(2)
11(10
)2(25
)12
(18
)7(16
)18
(9)
8(13
)
Birdsuite
Affy
metrix
17(44
)9(90
)13
(81
)135(22
)21
(24
)62
(56
)6(75
)43
(64
)20
(47
)97
(50
)20
(33
)CNAT
Affy
metrix
4(10
)15
(4)
4(25
)34
(6)
023
(21
)1(13
)
13(19
)2(5)
17(9)
5(8)
CNVPartition
Illum
ina
3(8)
16(4)
4(40
)37
(6)
7(8)
20(18
)7(88
)9(13
)
11(26
)16
(8)
16(27
)GADA
Affy
metrix
17(44
)106(28
)9(90
)13
(81
)32
(37
)91
(82
)7(88
)58
(87
)23
(53
)153(79
)27
(45
)GADA
Illum
ina
2(5)
96(25
)0
13(81
)20
8(34
)25
(23
)2(25
)26
(30
)17
(40
)67
(35
)23
(38
)Nexus
Affy
metrix
7(18
)57
(15
)10
(100
)
7(44
)116(19
)8(9)
4(50
)45
(67
)15
(35
)78
(40
)17
(28
)Nexus
Illum
ina
2(5)
6(2)
1(10
)7(44
)22
(4)
2(2)
4(4)
6(9)
7(16
)10
(5)
9(15
)Penn
CNV
Affy
metrix
11(28
)51
(13)
10(100
)
9(56
)105(17
)10
(11)
65(59
)6(75
)19
(44
)71
(37
)21
(35
)Penn
CNV
Illum
ina
6(15
)25
(7)
2(20
)11
(69
)44
(7)
9(10
)23
(21
)6(75
)18
(27
)26
(13)
28(47
)QuantiSNP
Affy
metrix
14(36
)97
(25
)10
(100
)
10(63
)199(32
)18
(21
)86
(77
)7(88
)65
(97
)21
(49
)24
(40
)QuantiSNP
Illum
ina
6(15
)14
(4)
5(50
)15
(94
)55
(9)
10(11)
30(27
)8(100
)
23(34
)32
(74
)31
(16
)
Algorithm
swererunon
demon
stratio
ndataforsampleNA108
61on
Affy
metrix60chipsa
ndIllum
ina1MDuo
arraysD
efaultparametersw
ereused
andanyY
chromosom
edatawas
omittedFo
ralgorithmoverall
totalsseeTable4Events
detected
inbo
thsoftwareareshow
nEvents
coun
tedas
common
betw
eenalgorithmsifpart
ofregion
predictedoverlaps
withtheotherEach
comparisoniscarriedou
ttw
ice
toshow
caseswhere
smallereventswithinon
ealgorithm
makeup
oneeventintheotherthereforeoverlapof
eventsdepe
ndson
analysisorientationTotalvalue
representsnumberof
eventsforsoftwareon
horizontalaxisfoun
dintheothersoftwaredatasetbracketedvalueshow
spercentageofeventsdetected
bysamesoftwareWehave
foun
dthemostsim
ilaritie
sare
betw
eendatafrom
similarplatform
soralgo
-rithm
metho
dforexam
pleAffy
metrixPenn
CNVandQuantiSNParebo
thbasedon
theHMM
algorithm
andas
such
eventpredictio
nshou
ldbe
very
similarWehave
also
notedahigher
numberof
similar
eventsfrom
algorithmsu
singAffy
metrixdata
Comparing CNVdetection methods for SNParrays page 11 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
increasingly important to be able to summarize and
use this data Initially we are often interested in
looking for novel events in certain genes or regions
Tracks of events can be viewed in databases such as
the web-based genome browser UCSC (http
wwwgenomeucscedu) and events can be com-
pared to known copy number data in the DGV
such as displayed in Figure 3 Importing several
tracks of data into a browser simultaneously will
allow the user to compare different result sets
Analysis of multiple events per sample is a more
complicated procedure Events and samples can
be explored using pathway analysis tools to look
for interesting groups or combinations of events in
different genes but methods of confirming the
significance of an event are required A number of
publications exist presenting ways of applying asso-
ciation study methods to copy number data Barnes
etal [29] developed an R package CNVtools which
allows the user to carry out case-control association
Figure 5 Image from UCSC Browser showing the detection of a single event using different algorithmsThe deletion described is a known CNP and is recorded several times in the DGV Each track represents a differ-ent algorithm or platform All results for detection algorithms shown used default parameters and test sampleNA10861
page 12 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
analysis on a single CNV of interest The publica-
tion tests a series of five alternative modelling meth-
ods before recommending a likelihood ratio test
which combines CNV calling and association testing
into a single model This method was designed
to eliminate problems with signal noise which is a
known trait of SNP assay data Ionita-Laza et al [30]
suggested a method to apply genome-wide family-
based association studies on raw-intensity data The
Birdsuite package includes a pipeline to prepare
the data for PLINK analysis Other sources have
suggested similar association study-based strategies
but an agreed approach is a subject of great discus-
sion Calls have been made by authors such as
Scherer et al [31] to decide on a single technique
but future decisions in the field will be extremely
enlightening
As is commented much upon in literature
describing SNP association study techniques
sample size and power of tests are major factors in
a successful study [32] This must also be considered
when analysing copy number data As we have dis-
cussed there are a number of analysis options avail-
able for SNP array CNV detection pipelines to
allow guided analysis and stand alone options for
more flexible analysis Some of these applications
are platform targeted but we have found that the
best outcome is given by using multiple algorithms
and comparing data
SUPPLEMENTARYDATASupplementary data are available online at http
biboxfordjournalsorg
AcknowledgementsThe authors thank Dr Helen Butler for her ideas and contribu-
tions to the manuscript
FUNDINGJR and LW are funded by Wellcome Trust Grants
CY is funded by a UK Medical Research Council
Special Training Fellowship in Biomedical
Informatics (Ref No G0701810)
References1 Iafrate AJ Feuk L Rivera MN et al Detection of large-
scale variation in the human genome Nat Genet 200436(9)949ndash51
2 Redon R Ishikawa S Fitch KR et al Global variation incopy number in the human genome Nature 2006444(7118)444ndash54
3 Tuzun E Sharp AJ Bailey JA et al Fine-scale structuralvariation of the human genome Nat Genet 200537(7)727ndash32
4 Sebat J Lakshmi B Troge J et al Large-scale copy numberpolymorphism in the human genome Science 2004305(5683)525ndash8
5 de Smith AJ Tsalenko A Sampas N et al Array CGHanalysis of copy number variation identifies 1284 newgenes variant in healthy white males implications for asso-ciation studies of complex diseases Hum Mol Genet 200716(23)2783ndash94
6 Carter NP Methods and strategies for analyzing copynumber variation using DNA microarrays Nat Genet200739(7 Suppl)S16ndash21
7 Korbel JO Urban AE Affourtit JP et al Paired-end map-ping reveals extensive structural variation in the humangenome Science 2007318(5849)420ndash6
8 Kennedy GC Matsuzaki H Dong S etal Large-scale geno-typing of complex DNA NatBiotechnol 200321(10)1233ndash7
9 Peiffer DA Le JM Steemers FJ etal High-resolution geno-mic profiling of chromosomal aberrations using Infiniumwhole-genome genotyping Genome Res 200616(9)1136ndash48
10 International Schizophrenia Consortium Rare chromoso-mal deletions and duplications increase risk of schizophreniaNature 2008455(7210)237ndash41
11 Yang TL Chen XD Guo Y et al Genome-wide copy-number-variation study identified a susceptibility geneUGT2B17 for osteoporosis Am J Hum Genet 200883(6)663ndash74
12 McCarroll SA Hadnott TN Perry GH et al Commondeletion polymorphisms in the human genome Nat Genet200638(1)86ndash92
13 Cooper GM Zerr T Kidd JM et al Systematic assessmentof copy number variant detection via genome-wide SNPgenotyping Nat Genet 200840(10)1199ndash203
14 McCarroll SA Altshuler DM Copy-number variation andassociation studies of human disease Nat Genet 200739(7 Suppl)S37ndash42
Key Points Awide variety of software is available for CNVdetection from
data produced by SNP arrays This review seeks to discussoptions and statistical methods currently available for analysisof signal intensity data
Changes in assay selection techniques for SNP arrays havemadethemmore appealing for copynumber detection aswell as geno-typingTargeted probe design has made the SNP array a reliableand cheaper option for copy number analysis
After testing a selection of the available software comparisonswere performed using Hapmap samples and publishedcopy number data Of the events found in our data 20^49were replicated in previously published studies but the resultsclearly showed variation in data caused by differences inalgorithms
An important recommendation when choosing software foranalysis is the use of a second algorithm on a dataset to producemore informative results This enables the user to eliminatefalse positives not found by both software and increases confi-dence in replicated events
Comparing CNVdetection methods for SNParrays page 13 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
15 McCarroll SA Kuruvilla FG Korn JM et al Integrateddetection and population-genetic analysis of SNPs andcopy number variation Nat Genet 200840(10)1166ndash74
16 Korn JM Kuruvilla FG McCarroll SA et al Integratedgenotype calling and association analysis of SNPscommon copy number polymorphisms and rare CNVsNat Genet 200840(10)1253ndash60
17 Day N Hemmaplardh A Thurman RE et al Unsupervisedsegmentation of continuous genomic data Bioinformatics200723(11)1424ndash6
18 Colella S Yau C Taylor JM etal QuantiSNP an objectiveBayes Hidden-Markov Model to detect and accurately mapcopy number variation using SNP genotyping data NucleicAcids Res 200735(6)2013ndash25
19 Wang K Li M Hadley D et al PennCNV an integratedhidden Markov model designed for high-resolution copynumber variation detection in whole-genome SNP geno-typing data Genome Res 200717(11)1665ndash74
20 Maestrini E Pagnamenta AT Lamb JA et al High-densitySNP association study and copy number variation analysisof the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility MolPsychiatry2009
21 Wang K Chen Z Tadesse MG et al Modeling geneticinheritance of copy number variations Nucleic Acids Res200836(21)e138
22 Li C Beroukhim R Weir BA et al Major copy propor-tion analysis of tumor samples using SNP arrays BMCBioinformatics 20089204
23 Olshen AB Venkatraman ES Lucito R Wigler M Circularbinary segmentation for the analysis of array-based DNAcopy number data Biostatistics 20045(4)557ndash72
24 Pique-Regi R Monso-Varona J Ortega A et al Sparserepresentation and Bayesian detection of genome copynumber alterations from microarray data Bioinformatics200824(3)309ndash18
25 Lai WR Johnson MD Kucherlapati R Park PJComparative analysis of algorithms for identifying amplifi-cations and deletions in array CGH data Bioinformatics 200521(19)3763ndash70
26 Rigaill G Hupe P Almeida A et al ITALICS analgorithm for normalization and DNA copy number callingfor Affymetrix SNP arrays Bioinformatics 200824(6)768ndash74
27 Franke L de Kovel CG Aulchenko YS et al Detectionimputation and association analysis of small deletions andnull alleles on oligonucleotide arrays AmJHumGenet 200882(6)1316ndash33
28 Kidd JM Cooper GM Donahue WF et al Mapping andsequencing of structural variation from eight human gen-omes Nature 2008453(7191)56ndash64
29 Barnes C Plagnol V Fitzgerald T et al A robuststatistical method for case-control association testingwith copy number variation Nat Genet 200840(10)1245ndash52
30 Ionita-Laza I Perry GH Raby BA et al On the analysisof copy-number variations in genome-wide associationstudies a translation of the family-based association testGenet Epidemiol 200832(3)273ndash84
31 Scherer SW Lee C Birney E etal Challenges and standardsin integrating surveys of structural variation NatGenet 200739(7 Suppl)S7ndash15
32 Cardon LR Bell JI Association study designs for complexdiseases Nat Rev Genet 20012(2)91ndash9
page 14 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
Table5
Com
parison
ofsoftwareeventpredictio
ns
Pub
lishe
dresults
(Red
on)
Birdsuite
Affym
etrix
CNAT
Affym
etrix
CNV
Par
tition
Illum
ina
GADA
Affym
etrix
GADA
Illum
ina
Nex
usAffym
etrix
Nex
usIllum
ina
Pen
nCNV
Affym
etrix
Pen
nCNV
Illum
ina
Qua
ntiSNP
Affym
etrix
Qua
ntiSNP
Illum
ina
Publishe
ddata
(Red
on)
17(4)
4(40
)3(19
)32
(5)
2(2)
11(10
)2(25
)12
(18
)7(16
)18
(9)
8(13
)
Birdsuite
Affy
metrix
17(44
)9(90
)13
(81
)135(22
)21
(24
)62
(56
)6(75
)43
(64
)20
(47
)97
(50
)20
(33
)CNAT
Affy
metrix
4(10
)15
(4)
4(25
)34
(6)
023
(21
)1(13
)
13(19
)2(5)
17(9)
5(8)
CNVPartition
Illum
ina
3(8)
16(4)
4(40
)37
(6)
7(8)
20(18
)7(88
)9(13
)
11(26
)16
(8)
16(27
)GADA
Affy
metrix
17(44
)106(28
)9(90
)13
(81
)32
(37
)91
(82
)7(88
)58
(87
)23
(53
)153(79
)27
(45
)GADA
Illum
ina
2(5)
96(25
)0
13(81
)20
8(34
)25
(23
)2(25
)26
(30
)17
(40
)67
(35
)23
(38
)Nexus
Affy
metrix
7(18
)57
(15
)10
(100
)
7(44
)116(19
)8(9)
4(50
)45
(67
)15
(35
)78
(40
)17
(28
)Nexus
Illum
ina
2(5)
6(2)
1(10
)7(44
)22
(4)
2(2)
4(4)
6(9)
7(16
)10
(5)
9(15
)Penn
CNV
Affy
metrix
11(28
)51
(13)
10(100
)
9(56
)105(17
)10
(11)
65(59
)6(75
)19
(44
)71
(37
)21
(35
)Penn
CNV
Illum
ina
6(15
)25
(7)
2(20
)11
(69
)44
(7)
9(10
)23
(21
)6(75
)18
(27
)26
(13)
28(47
)QuantiSNP
Affy
metrix
14(36
)97
(25
)10
(100
)
10(63
)199(32
)18
(21
)86
(77
)7(88
)65
(97
)21
(49
)24
(40
)QuantiSNP
Illum
ina
6(15
)14
(4)
5(50
)15
(94
)55
(9)
10(11)
30(27
)8(100
)
23(34
)32
(74
)31
(16
)
Algorithm
swererunon
demon
stratio
ndataforsampleNA108
61on
Affy
metrix60chipsa
ndIllum
ina1MDuo
arraysD
efaultparametersw
ereused
andanyY
chromosom
edatawas
omittedFo
ralgorithmoverall
totalsseeTable4Events
detected
inbo
thsoftwareareshow
nEvents
coun
tedas
common
betw
eenalgorithmsifpart
ofregion
predictedoverlaps
withtheotherEach
comparisoniscarriedou
ttw
ice
toshow
caseswhere
smallereventswithinon
ealgorithm
makeup
oneeventintheotherthereforeoverlapof
eventsdepe
ndson
analysisorientationTotalvalue
representsnumberof
eventsforsoftwareon
horizontalaxisfoun
dintheothersoftwaredatasetbracketedvalueshow
spercentageofeventsdetected
bysamesoftwareWehave
foun
dthemostsim
ilaritie
sare
betw
eendatafrom
similarplatform
soralgo
-rithm
metho
dforexam
pleAffy
metrixPenn
CNVandQuantiSNParebo
thbasedon
theHMM
algorithm
andas
such
eventpredictio
nshou
ldbe
very
similarWehave
also
notedahigher
numberof
similar
eventsfrom
algorithmsu
singAffy
metrixdata
Comparing CNVdetection methods for SNParrays page 11 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
increasingly important to be able to summarize and
use this data Initially we are often interested in
looking for novel events in certain genes or regions
Tracks of events can be viewed in databases such as
the web-based genome browser UCSC (http
wwwgenomeucscedu) and events can be com-
pared to known copy number data in the DGV
such as displayed in Figure 3 Importing several
tracks of data into a browser simultaneously will
allow the user to compare different result sets
Analysis of multiple events per sample is a more
complicated procedure Events and samples can
be explored using pathway analysis tools to look
for interesting groups or combinations of events in
different genes but methods of confirming the
significance of an event are required A number of
publications exist presenting ways of applying asso-
ciation study methods to copy number data Barnes
etal [29] developed an R package CNVtools which
allows the user to carry out case-control association
Figure 5 Image from UCSC Browser showing the detection of a single event using different algorithmsThe deletion described is a known CNP and is recorded several times in the DGV Each track represents a differ-ent algorithm or platform All results for detection algorithms shown used default parameters and test sampleNA10861
page 12 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
analysis on a single CNV of interest The publica-
tion tests a series of five alternative modelling meth-
ods before recommending a likelihood ratio test
which combines CNV calling and association testing
into a single model This method was designed
to eliminate problems with signal noise which is a
known trait of SNP assay data Ionita-Laza et al [30]
suggested a method to apply genome-wide family-
based association studies on raw-intensity data The
Birdsuite package includes a pipeline to prepare
the data for PLINK analysis Other sources have
suggested similar association study-based strategies
but an agreed approach is a subject of great discus-
sion Calls have been made by authors such as
Scherer et al [31] to decide on a single technique
but future decisions in the field will be extremely
enlightening
As is commented much upon in literature
describing SNP association study techniques
sample size and power of tests are major factors in
a successful study [32] This must also be considered
when analysing copy number data As we have dis-
cussed there are a number of analysis options avail-
able for SNP array CNV detection pipelines to
allow guided analysis and stand alone options for
more flexible analysis Some of these applications
are platform targeted but we have found that the
best outcome is given by using multiple algorithms
and comparing data
SUPPLEMENTARYDATASupplementary data are available online at http
biboxfordjournalsorg
AcknowledgementsThe authors thank Dr Helen Butler for her ideas and contribu-
tions to the manuscript
FUNDINGJR and LW are funded by Wellcome Trust Grants
CY is funded by a UK Medical Research Council
Special Training Fellowship in Biomedical
Informatics (Ref No G0701810)
References1 Iafrate AJ Feuk L Rivera MN et al Detection of large-
scale variation in the human genome Nat Genet 200436(9)949ndash51
2 Redon R Ishikawa S Fitch KR et al Global variation incopy number in the human genome Nature 2006444(7118)444ndash54
3 Tuzun E Sharp AJ Bailey JA et al Fine-scale structuralvariation of the human genome Nat Genet 200537(7)727ndash32
4 Sebat J Lakshmi B Troge J et al Large-scale copy numberpolymorphism in the human genome Science 2004305(5683)525ndash8
5 de Smith AJ Tsalenko A Sampas N et al Array CGHanalysis of copy number variation identifies 1284 newgenes variant in healthy white males implications for asso-ciation studies of complex diseases Hum Mol Genet 200716(23)2783ndash94
6 Carter NP Methods and strategies for analyzing copynumber variation using DNA microarrays Nat Genet200739(7 Suppl)S16ndash21
7 Korbel JO Urban AE Affourtit JP et al Paired-end map-ping reveals extensive structural variation in the humangenome Science 2007318(5849)420ndash6
8 Kennedy GC Matsuzaki H Dong S etal Large-scale geno-typing of complex DNA NatBiotechnol 200321(10)1233ndash7
9 Peiffer DA Le JM Steemers FJ etal High-resolution geno-mic profiling of chromosomal aberrations using Infiniumwhole-genome genotyping Genome Res 200616(9)1136ndash48
10 International Schizophrenia Consortium Rare chromoso-mal deletions and duplications increase risk of schizophreniaNature 2008455(7210)237ndash41
11 Yang TL Chen XD Guo Y et al Genome-wide copy-number-variation study identified a susceptibility geneUGT2B17 for osteoporosis Am J Hum Genet 200883(6)663ndash74
12 McCarroll SA Hadnott TN Perry GH et al Commondeletion polymorphisms in the human genome Nat Genet200638(1)86ndash92
13 Cooper GM Zerr T Kidd JM et al Systematic assessmentof copy number variant detection via genome-wide SNPgenotyping Nat Genet 200840(10)1199ndash203
14 McCarroll SA Altshuler DM Copy-number variation andassociation studies of human disease Nat Genet 200739(7 Suppl)S37ndash42
Key Points Awide variety of software is available for CNVdetection from
data produced by SNP arrays This review seeks to discussoptions and statistical methods currently available for analysisof signal intensity data
Changes in assay selection techniques for SNP arrays havemadethemmore appealing for copynumber detection aswell as geno-typingTargeted probe design has made the SNP array a reliableand cheaper option for copy number analysis
After testing a selection of the available software comparisonswere performed using Hapmap samples and publishedcopy number data Of the events found in our data 20^49were replicated in previously published studies but the resultsclearly showed variation in data caused by differences inalgorithms
An important recommendation when choosing software foranalysis is the use of a second algorithm on a dataset to producemore informative results This enables the user to eliminatefalse positives not found by both software and increases confi-dence in replicated events
Comparing CNVdetection methods for SNParrays page 13 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
15 McCarroll SA Kuruvilla FG Korn JM et al Integrateddetection and population-genetic analysis of SNPs andcopy number variation Nat Genet 200840(10)1166ndash74
16 Korn JM Kuruvilla FG McCarroll SA et al Integratedgenotype calling and association analysis of SNPscommon copy number polymorphisms and rare CNVsNat Genet 200840(10)1253ndash60
17 Day N Hemmaplardh A Thurman RE et al Unsupervisedsegmentation of continuous genomic data Bioinformatics200723(11)1424ndash6
18 Colella S Yau C Taylor JM etal QuantiSNP an objectiveBayes Hidden-Markov Model to detect and accurately mapcopy number variation using SNP genotyping data NucleicAcids Res 200735(6)2013ndash25
19 Wang K Li M Hadley D et al PennCNV an integratedhidden Markov model designed for high-resolution copynumber variation detection in whole-genome SNP geno-typing data Genome Res 200717(11)1665ndash74
20 Maestrini E Pagnamenta AT Lamb JA et al High-densitySNP association study and copy number variation analysisof the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility MolPsychiatry2009
21 Wang K Chen Z Tadesse MG et al Modeling geneticinheritance of copy number variations Nucleic Acids Res200836(21)e138
22 Li C Beroukhim R Weir BA et al Major copy propor-tion analysis of tumor samples using SNP arrays BMCBioinformatics 20089204
23 Olshen AB Venkatraman ES Lucito R Wigler M Circularbinary segmentation for the analysis of array-based DNAcopy number data Biostatistics 20045(4)557ndash72
24 Pique-Regi R Monso-Varona J Ortega A et al Sparserepresentation and Bayesian detection of genome copynumber alterations from microarray data Bioinformatics200824(3)309ndash18
25 Lai WR Johnson MD Kucherlapati R Park PJComparative analysis of algorithms for identifying amplifi-cations and deletions in array CGH data Bioinformatics 200521(19)3763ndash70
26 Rigaill G Hupe P Almeida A et al ITALICS analgorithm for normalization and DNA copy number callingfor Affymetrix SNP arrays Bioinformatics 200824(6)768ndash74
27 Franke L de Kovel CG Aulchenko YS et al Detectionimputation and association analysis of small deletions andnull alleles on oligonucleotide arrays AmJHumGenet 200882(6)1316ndash33
28 Kidd JM Cooper GM Donahue WF et al Mapping andsequencing of structural variation from eight human gen-omes Nature 2008453(7191)56ndash64
29 Barnes C Plagnol V Fitzgerald T et al A robuststatistical method for case-control association testingwith copy number variation Nat Genet 200840(10)1245ndash52
30 Ionita-Laza I Perry GH Raby BA et al On the analysisof copy-number variations in genome-wide associationstudies a translation of the family-based association testGenet Epidemiol 200832(3)273ndash84
31 Scherer SW Lee C Birney E etal Challenges and standardsin integrating surveys of structural variation NatGenet 200739(7 Suppl)S7ndash15
32 Cardon LR Bell JI Association study designs for complexdiseases Nat Rev Genet 20012(2)91ndash9
page 14 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
increasingly important to be able to summarize and
use this data Initially we are often interested in
looking for novel events in certain genes or regions
Tracks of events can be viewed in databases such as
the web-based genome browser UCSC (http
wwwgenomeucscedu) and events can be com-
pared to known copy number data in the DGV
such as displayed in Figure 3 Importing several
tracks of data into a browser simultaneously will
allow the user to compare different result sets
Analysis of multiple events per sample is a more
complicated procedure Events and samples can
be explored using pathway analysis tools to look
for interesting groups or combinations of events in
different genes but methods of confirming the
significance of an event are required A number of
publications exist presenting ways of applying asso-
ciation study methods to copy number data Barnes
etal [29] developed an R package CNVtools which
allows the user to carry out case-control association
Figure 5 Image from UCSC Browser showing the detection of a single event using different algorithmsThe deletion described is a known CNP and is recorded several times in the DGV Each track represents a differ-ent algorithm or platform All results for detection algorithms shown used default parameters and test sampleNA10861
page 12 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
analysis on a single CNV of interest The publica-
tion tests a series of five alternative modelling meth-
ods before recommending a likelihood ratio test
which combines CNV calling and association testing
into a single model This method was designed
to eliminate problems with signal noise which is a
known trait of SNP assay data Ionita-Laza et al [30]
suggested a method to apply genome-wide family-
based association studies on raw-intensity data The
Birdsuite package includes a pipeline to prepare
the data for PLINK analysis Other sources have
suggested similar association study-based strategies
but an agreed approach is a subject of great discus-
sion Calls have been made by authors such as
Scherer et al [31] to decide on a single technique
but future decisions in the field will be extremely
enlightening
As is commented much upon in literature
describing SNP association study techniques
sample size and power of tests are major factors in
a successful study [32] This must also be considered
when analysing copy number data As we have dis-
cussed there are a number of analysis options avail-
able for SNP array CNV detection pipelines to
allow guided analysis and stand alone options for
more flexible analysis Some of these applications
are platform targeted but we have found that the
best outcome is given by using multiple algorithms
and comparing data
SUPPLEMENTARYDATASupplementary data are available online at http
biboxfordjournalsorg
AcknowledgementsThe authors thank Dr Helen Butler for her ideas and contribu-
tions to the manuscript
FUNDINGJR and LW are funded by Wellcome Trust Grants
CY is funded by a UK Medical Research Council
Special Training Fellowship in Biomedical
Informatics (Ref No G0701810)
References1 Iafrate AJ Feuk L Rivera MN et al Detection of large-
scale variation in the human genome Nat Genet 200436(9)949ndash51
2 Redon R Ishikawa S Fitch KR et al Global variation incopy number in the human genome Nature 2006444(7118)444ndash54
3 Tuzun E Sharp AJ Bailey JA et al Fine-scale structuralvariation of the human genome Nat Genet 200537(7)727ndash32
4 Sebat J Lakshmi B Troge J et al Large-scale copy numberpolymorphism in the human genome Science 2004305(5683)525ndash8
5 de Smith AJ Tsalenko A Sampas N et al Array CGHanalysis of copy number variation identifies 1284 newgenes variant in healthy white males implications for asso-ciation studies of complex diseases Hum Mol Genet 200716(23)2783ndash94
6 Carter NP Methods and strategies for analyzing copynumber variation using DNA microarrays Nat Genet200739(7 Suppl)S16ndash21
7 Korbel JO Urban AE Affourtit JP et al Paired-end map-ping reveals extensive structural variation in the humangenome Science 2007318(5849)420ndash6
8 Kennedy GC Matsuzaki H Dong S etal Large-scale geno-typing of complex DNA NatBiotechnol 200321(10)1233ndash7
9 Peiffer DA Le JM Steemers FJ etal High-resolution geno-mic profiling of chromosomal aberrations using Infiniumwhole-genome genotyping Genome Res 200616(9)1136ndash48
10 International Schizophrenia Consortium Rare chromoso-mal deletions and duplications increase risk of schizophreniaNature 2008455(7210)237ndash41
11 Yang TL Chen XD Guo Y et al Genome-wide copy-number-variation study identified a susceptibility geneUGT2B17 for osteoporosis Am J Hum Genet 200883(6)663ndash74
12 McCarroll SA Hadnott TN Perry GH et al Commondeletion polymorphisms in the human genome Nat Genet200638(1)86ndash92
13 Cooper GM Zerr T Kidd JM et al Systematic assessmentof copy number variant detection via genome-wide SNPgenotyping Nat Genet 200840(10)1199ndash203
14 McCarroll SA Altshuler DM Copy-number variation andassociation studies of human disease Nat Genet 200739(7 Suppl)S37ndash42
Key Points Awide variety of software is available for CNVdetection from
data produced by SNP arrays This review seeks to discussoptions and statistical methods currently available for analysisof signal intensity data
Changes in assay selection techniques for SNP arrays havemadethemmore appealing for copynumber detection aswell as geno-typingTargeted probe design has made the SNP array a reliableand cheaper option for copy number analysis
After testing a selection of the available software comparisonswere performed using Hapmap samples and publishedcopy number data Of the events found in our data 20^49were replicated in previously published studies but the resultsclearly showed variation in data caused by differences inalgorithms
An important recommendation when choosing software foranalysis is the use of a second algorithm on a dataset to producemore informative results This enables the user to eliminatefalse positives not found by both software and increases confi-dence in replicated events
Comparing CNVdetection methods for SNParrays page 13 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
15 McCarroll SA Kuruvilla FG Korn JM et al Integrateddetection and population-genetic analysis of SNPs andcopy number variation Nat Genet 200840(10)1166ndash74
16 Korn JM Kuruvilla FG McCarroll SA et al Integratedgenotype calling and association analysis of SNPscommon copy number polymorphisms and rare CNVsNat Genet 200840(10)1253ndash60
17 Day N Hemmaplardh A Thurman RE et al Unsupervisedsegmentation of continuous genomic data Bioinformatics200723(11)1424ndash6
18 Colella S Yau C Taylor JM etal QuantiSNP an objectiveBayes Hidden-Markov Model to detect and accurately mapcopy number variation using SNP genotyping data NucleicAcids Res 200735(6)2013ndash25
19 Wang K Li M Hadley D et al PennCNV an integratedhidden Markov model designed for high-resolution copynumber variation detection in whole-genome SNP geno-typing data Genome Res 200717(11)1665ndash74
20 Maestrini E Pagnamenta AT Lamb JA et al High-densitySNP association study and copy number variation analysisof the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility MolPsychiatry2009
21 Wang K Chen Z Tadesse MG et al Modeling geneticinheritance of copy number variations Nucleic Acids Res200836(21)e138
22 Li C Beroukhim R Weir BA et al Major copy propor-tion analysis of tumor samples using SNP arrays BMCBioinformatics 20089204
23 Olshen AB Venkatraman ES Lucito R Wigler M Circularbinary segmentation for the analysis of array-based DNAcopy number data Biostatistics 20045(4)557ndash72
24 Pique-Regi R Monso-Varona J Ortega A et al Sparserepresentation and Bayesian detection of genome copynumber alterations from microarray data Bioinformatics200824(3)309ndash18
25 Lai WR Johnson MD Kucherlapati R Park PJComparative analysis of algorithms for identifying amplifi-cations and deletions in array CGH data Bioinformatics 200521(19)3763ndash70
26 Rigaill G Hupe P Almeida A et al ITALICS analgorithm for normalization and DNA copy number callingfor Affymetrix SNP arrays Bioinformatics 200824(6)768ndash74
27 Franke L de Kovel CG Aulchenko YS et al Detectionimputation and association analysis of small deletions andnull alleles on oligonucleotide arrays AmJHumGenet 200882(6)1316ndash33
28 Kidd JM Cooper GM Donahue WF et al Mapping andsequencing of structural variation from eight human gen-omes Nature 2008453(7191)56ndash64
29 Barnes C Plagnol V Fitzgerald T et al A robuststatistical method for case-control association testingwith copy number variation Nat Genet 200840(10)1245ndash52
30 Ionita-Laza I Perry GH Raby BA et al On the analysisof copy-number variations in genome-wide associationstudies a translation of the family-based association testGenet Epidemiol 200832(3)273ndash84
31 Scherer SW Lee C Birney E etal Challenges and standardsin integrating surveys of structural variation NatGenet 200739(7 Suppl)S7ndash15
32 Cardon LR Bell JI Association study designs for complexdiseases Nat Rev Genet 20012(2)91ndash9
page 14 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
analysis on a single CNV of interest The publica-
tion tests a series of five alternative modelling meth-
ods before recommending a likelihood ratio test
which combines CNV calling and association testing
into a single model This method was designed
to eliminate problems with signal noise which is a
known trait of SNP assay data Ionita-Laza et al [30]
suggested a method to apply genome-wide family-
based association studies on raw-intensity data The
Birdsuite package includes a pipeline to prepare
the data for PLINK analysis Other sources have
suggested similar association study-based strategies
but an agreed approach is a subject of great discus-
sion Calls have been made by authors such as
Scherer et al [31] to decide on a single technique
but future decisions in the field will be extremely
enlightening
As is commented much upon in literature
describing SNP association study techniques
sample size and power of tests are major factors in
a successful study [32] This must also be considered
when analysing copy number data As we have dis-
cussed there are a number of analysis options avail-
able for SNP array CNV detection pipelines to
allow guided analysis and stand alone options for
more flexible analysis Some of these applications
are platform targeted but we have found that the
best outcome is given by using multiple algorithms
and comparing data
SUPPLEMENTARYDATASupplementary data are available online at http
biboxfordjournalsorg
AcknowledgementsThe authors thank Dr Helen Butler for her ideas and contribu-
tions to the manuscript
FUNDINGJR and LW are funded by Wellcome Trust Grants
CY is funded by a UK Medical Research Council
Special Training Fellowship in Biomedical
Informatics (Ref No G0701810)
References1 Iafrate AJ Feuk L Rivera MN et al Detection of large-
scale variation in the human genome Nat Genet 200436(9)949ndash51
2 Redon R Ishikawa S Fitch KR et al Global variation incopy number in the human genome Nature 2006444(7118)444ndash54
3 Tuzun E Sharp AJ Bailey JA et al Fine-scale structuralvariation of the human genome Nat Genet 200537(7)727ndash32
4 Sebat J Lakshmi B Troge J et al Large-scale copy numberpolymorphism in the human genome Science 2004305(5683)525ndash8
5 de Smith AJ Tsalenko A Sampas N et al Array CGHanalysis of copy number variation identifies 1284 newgenes variant in healthy white males implications for asso-ciation studies of complex diseases Hum Mol Genet 200716(23)2783ndash94
6 Carter NP Methods and strategies for analyzing copynumber variation using DNA microarrays Nat Genet200739(7 Suppl)S16ndash21
7 Korbel JO Urban AE Affourtit JP et al Paired-end map-ping reveals extensive structural variation in the humangenome Science 2007318(5849)420ndash6
8 Kennedy GC Matsuzaki H Dong S etal Large-scale geno-typing of complex DNA NatBiotechnol 200321(10)1233ndash7
9 Peiffer DA Le JM Steemers FJ etal High-resolution geno-mic profiling of chromosomal aberrations using Infiniumwhole-genome genotyping Genome Res 200616(9)1136ndash48
10 International Schizophrenia Consortium Rare chromoso-mal deletions and duplications increase risk of schizophreniaNature 2008455(7210)237ndash41
11 Yang TL Chen XD Guo Y et al Genome-wide copy-number-variation study identified a susceptibility geneUGT2B17 for osteoporosis Am J Hum Genet 200883(6)663ndash74
12 McCarroll SA Hadnott TN Perry GH et al Commondeletion polymorphisms in the human genome Nat Genet200638(1)86ndash92
13 Cooper GM Zerr T Kidd JM et al Systematic assessmentof copy number variant detection via genome-wide SNPgenotyping Nat Genet 200840(10)1199ndash203
14 McCarroll SA Altshuler DM Copy-number variation andassociation studies of human disease Nat Genet 200739(7 Suppl)S37ndash42
Key Points Awide variety of software is available for CNVdetection from
data produced by SNP arrays This review seeks to discussoptions and statistical methods currently available for analysisof signal intensity data
Changes in assay selection techniques for SNP arrays havemadethemmore appealing for copynumber detection aswell as geno-typingTargeted probe design has made the SNP array a reliableand cheaper option for copy number analysis
After testing a selection of the available software comparisonswere performed using Hapmap samples and publishedcopy number data Of the events found in our data 20^49were replicated in previously published studies but the resultsclearly showed variation in data caused by differences inalgorithms
An important recommendation when choosing software foranalysis is the use of a second algorithm on a dataset to producemore informative results This enables the user to eliminatefalse positives not found by both software and increases confi-dence in replicated events
Comparing CNVdetection methods for SNParrays page 13 of 14 by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
15 McCarroll SA Kuruvilla FG Korn JM et al Integrateddetection and population-genetic analysis of SNPs andcopy number variation Nat Genet 200840(10)1166ndash74
16 Korn JM Kuruvilla FG McCarroll SA et al Integratedgenotype calling and association analysis of SNPscommon copy number polymorphisms and rare CNVsNat Genet 200840(10)1253ndash60
17 Day N Hemmaplardh A Thurman RE et al Unsupervisedsegmentation of continuous genomic data Bioinformatics200723(11)1424ndash6
18 Colella S Yau C Taylor JM etal QuantiSNP an objectiveBayes Hidden-Markov Model to detect and accurately mapcopy number variation using SNP genotyping data NucleicAcids Res 200735(6)2013ndash25
19 Wang K Li M Hadley D et al PennCNV an integratedhidden Markov model designed for high-resolution copynumber variation detection in whole-genome SNP geno-typing data Genome Res 200717(11)1665ndash74
20 Maestrini E Pagnamenta AT Lamb JA et al High-densitySNP association study and copy number variation analysisof the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility MolPsychiatry2009
21 Wang K Chen Z Tadesse MG et al Modeling geneticinheritance of copy number variations Nucleic Acids Res200836(21)e138
22 Li C Beroukhim R Weir BA et al Major copy propor-tion analysis of tumor samples using SNP arrays BMCBioinformatics 20089204
23 Olshen AB Venkatraman ES Lucito R Wigler M Circularbinary segmentation for the analysis of array-based DNAcopy number data Biostatistics 20045(4)557ndash72
24 Pique-Regi R Monso-Varona J Ortega A et al Sparserepresentation and Bayesian detection of genome copynumber alterations from microarray data Bioinformatics200824(3)309ndash18
25 Lai WR Johnson MD Kucherlapati R Park PJComparative analysis of algorithms for identifying amplifi-cations and deletions in array CGH data Bioinformatics 200521(19)3763ndash70
26 Rigaill G Hupe P Almeida A et al ITALICS analgorithm for normalization and DNA copy number callingfor Affymetrix SNP arrays Bioinformatics 200824(6)768ndash74
27 Franke L de Kovel CG Aulchenko YS et al Detectionimputation and association analysis of small deletions andnull alleles on oligonucleotide arrays AmJHumGenet 200882(6)1316ndash33
28 Kidd JM Cooper GM Donahue WF et al Mapping andsequencing of structural variation from eight human gen-omes Nature 2008453(7191)56ndash64
29 Barnes C Plagnol V Fitzgerald T et al A robuststatistical method for case-control association testingwith copy number variation Nat Genet 200840(10)1245ndash52
30 Ionita-Laza I Perry GH Raby BA et al On the analysisof copy-number variations in genome-wide associationstudies a translation of the family-based association testGenet Epidemiol 200832(3)273ndash84
31 Scherer SW Lee C Birney E etal Challenges and standardsin integrating surveys of structural variation NatGenet 200739(7 Suppl)S7ndash15
32 Cardon LR Bell JI Association study designs for complexdiseases Nat Rev Genet 20012(2)91ndash9
page 14 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from
15 McCarroll SA Kuruvilla FG Korn JM et al Integrateddetection and population-genetic analysis of SNPs andcopy number variation Nat Genet 200840(10)1166ndash74
16 Korn JM Kuruvilla FG McCarroll SA et al Integratedgenotype calling and association analysis of SNPscommon copy number polymorphisms and rare CNVsNat Genet 200840(10)1253ndash60
17 Day N Hemmaplardh A Thurman RE et al Unsupervisedsegmentation of continuous genomic data Bioinformatics200723(11)1424ndash6
18 Colella S Yau C Taylor JM etal QuantiSNP an objectiveBayes Hidden-Markov Model to detect and accurately mapcopy number variation using SNP genotyping data NucleicAcids Res 200735(6)2013ndash25
19 Wang K Li M Hadley D et al PennCNV an integratedhidden Markov model designed for high-resolution copynumber variation detection in whole-genome SNP geno-typing data Genome Res 200717(11)1665ndash74
20 Maestrini E Pagnamenta AT Lamb JA et al High-densitySNP association study and copy number variation analysisof the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility MolPsychiatry2009
21 Wang K Chen Z Tadesse MG et al Modeling geneticinheritance of copy number variations Nucleic Acids Res200836(21)e138
22 Li C Beroukhim R Weir BA et al Major copy propor-tion analysis of tumor samples using SNP arrays BMCBioinformatics 20089204
23 Olshen AB Venkatraman ES Lucito R Wigler M Circularbinary segmentation for the analysis of array-based DNAcopy number data Biostatistics 20045(4)557ndash72
24 Pique-Regi R Monso-Varona J Ortega A et al Sparserepresentation and Bayesian detection of genome copynumber alterations from microarray data Bioinformatics200824(3)309ndash18
25 Lai WR Johnson MD Kucherlapati R Park PJComparative analysis of algorithms for identifying amplifi-cations and deletions in array CGH data Bioinformatics 200521(19)3763ndash70
26 Rigaill G Hupe P Almeida A et al ITALICS analgorithm for normalization and DNA copy number callingfor Affymetrix SNP arrays Bioinformatics 200824(6)768ndash74
27 Franke L de Kovel CG Aulchenko YS et al Detectionimputation and association analysis of small deletions andnull alleles on oligonucleotide arrays AmJHumGenet 200882(6)1316ndash33
28 Kidd JM Cooper GM Donahue WF et al Mapping andsequencing of structural variation from eight human gen-omes Nature 2008453(7191)56ndash64
29 Barnes C Plagnol V Fitzgerald T et al A robuststatistical method for case-control association testingwith copy number variation Nat Genet 200840(10)1245ndash52
30 Ionita-Laza I Perry GH Raby BA et al On the analysisof copy-number variations in genome-wide associationstudies a translation of the family-based association testGenet Epidemiol 200832(3)273ndash84
31 Scherer SW Lee C Birney E etal Challenges and standardsin integrating surveys of structural variation NatGenet 200739(7 Suppl)S7ndash15
32 Cardon LR Bell JI Association study designs for complexdiseases Nat Rev Genet 20012(2)91ndash9
page 14 of 14 Winchester et al by guest on February 21 2014
httpbfgoxfordjournalsorgD
ownloaded from