Comparing CNVdetection methods

14
Comparing CNV detection methods for SNP arrays Laura Winchester, Christopher Yau and Jiannis Ragoussis Abstract Data from whole genome association studies can now be used for dual purposes, genotyping and copy number detection. In this review we discuss some of the methods for using SNP data to detect copy number events. We examine a number of algorithms designed to detect copy number changes through the use of signal-intensity data and consider methods to evaluate the changes found. We describe the use of several statistical models in copy number detection in germline samples. We also present a comparison of data using these methods to assess accuracy of prediction and detection of changes in copy number. Keywords: copy number; SNP array INTRODUCTION Structural variation in the human genome has been intensely studied in recent years [1–5]. Publications have shown rare copy number variations (CNV) with a relationship to certain diseases and much has also been done to study copy number polymor- phisms (CNP) in the population, their contribution to structural variation and possible association to complex disease. Multiple methods for the detection of these structural variants exist [6, 7] but we seek to focus on methods designed to interpret results from SNP arrays. The most prominent SNP array types are avail- able from commercial vendors Affymetrix and Illumina. Both companies sell competing arrays and continue to offer increased coverage for detect- ing copy number events and SNP assays simulta- neously. Assay technique for the arrays differ [8, 9] but the signal-intensity output from the both plat- forms present similar analysis and interpretation problems. Successful application of these technologies has yielded a number of interesting individual CNVs with relationships to complex disease. For example, rare CNVs have been linked to schizophrenia [10] in a study where microdeletions and duplications were shown to be responsible for disrupting genes involved in neurodevelopment. The UGT2B17 gene on Chromosome 4q13.2 was linked to osteoporosis in a case-control study of 727 CNV regions in a Chinese sample set [11]. One approach to copy number event detection has been to investigate common events. Studies such as the McCarroll et al. [12] involved the characteriza- tion of deletion variations in the genome, while Redon et al. [2] have mapped the location of events found in multiple samples. Information about identified copy number events is recorded in databases such as The Database of Genomic Variants (DGV) [1]. Using the prior information about CNP location we can investigate copy number events as we would use SNP information in genotyping. Known CNPs can be genotyped in case–control populations with similar methods to the SNP-based association study. With the diversity of approaches and analysis options it is important to decide on a method most suited for the particular experimental needs. This review presents methods suggested for analysis of germ line CNV analysis, including both CNP analysis and the detection of rare CNVs. LauraWinchester is a DPhil student at Oxford University where her research involves detection of copy number events in genetic disorders, in particular, Specific Language Impairment. ChristopherYau is a Postdoctoral Research Fellow in the Department of Statistics at Oxford University. Jiannis Ragoussis is Head of Genomics at WTCHG. Interests: gene expression regulation in hypoxia and inflammation, genotyping and sequencing technology, identification of chromosomal aneuploidies and CNVs associated with disease. Corresponding author. Jiannis Ragoussis, Genomics, Wellcome Trust Centre For Human Genetics, Roosevelt Drive, Oxford, OX3 7BN, UK. Tel: (01865) 287526; Fax: (01865) 287501; E-mail: [email protected] BRIEFINGS IN FUNCTIONAL GENOMICS AND PROTEOMICS. page 1 of 14 doi:10.1093/bfgp/elp017 ß The Author 2009. Published by Oxford University Press. For permissions, please email: [email protected] Briefings in Functional Genomics and Proteomics Advance Access published September 8, 2009 by guest on February 21, 2014 http://bfg.oxfordjournals.org/ Downloaded from

description

Data from whole genome association studies can now be used for dual purposes, genotyping and copy numberdetection. In this review we discuss some of the methods for using SNP data to detect copy number events.We examine a number of algorithms designed to detect copy number changes through the use of signal-intensitydata and consider methods to evaluate the changes found. We describe the use of several statistical models incopy number detection in germline samples.We also present a comparison of data using these methods to assessaccuracy of prediction and detection of changes in copy number.

Transcript of Comparing CNVdetection methods

Page 1: Comparing CNVdetection methods

Comparing CNVdetection methodsfor SNP arraysLauraWinchester Christopher Yau and Jiannis Ragoussis

AbstractData from whole genome association studies can now be used for dual purposes genotyping and copy numberdetection In this review we discuss some of the methods for using SNP data to detect copy number eventsWe examine a number of algorithms designed to detect copy number changes through the use of signal-intensitydata and consider methods to evaluate the changes found We describe the use of several statistical models incopy number detection in germline samplesWe also present a comparison of data using these methods to assessaccuracy of prediction and detection of changes in copy number

Keywords copy number SNParray

INTRODUCTIONStructural variation in the human genome has been

intensely studied in recent years [1ndash5] Publications

have shown rare copy number variations (CNV)

with a relationship to certain diseases and much has

also been done to study copy number polymor-

phisms (CNP) in the population their contribution

to structural variation and possible association to

complex disease Multiple methods for the detection

of these structural variants exist [6 7] but we seek to

focus on methods designed to interpret results from

SNP arrays

The most prominent SNP array types are avail-

able from commercial vendors Affymetrix and

Illumina Both companies sell competing arrays

and continue to offer increased coverage for detect-

ing copy number events and SNP assays simulta-

neously Assay technique for the arrays differ [8 9]

but the signal-intensity output from the both plat-

forms present similar analysis and interpretation

problems

Successful application of these technologies has

yielded a number of interesting individual CNVs

with relationships to complex disease For example

rare CNVs have been linked to schizophrenia [10]

in a study where microdeletions and duplications

were shown to be responsible for disrupting genes

involved in neurodevelopment The UGT2B17 gene

on Chromosome 4q132 was linked to osteoporosis

in a case-control study of 727 CNV regions in a

Chinese sample set [11]

One approach to copy number event detection

has been to investigate common events Studies such

as the McCarroll et al [12] involved the characteriza-

tion of deletion variations in the genome while

Redon et al [2] have mapped the location of

events found in multiple samples Information

about identified copy number events is recorded in

databases such as The Database of Genomic Variants

(DGV) [1] Using the prior information about CNP

location we can investigate copy number events as

we would use SNP information in genotyping

Known CNPs can be genotyped in casendashcontrol

populations with similar methods to the SNP-based

association study With the diversity of approaches

and analysis options it is important to decide on

a method most suited for the particular experimental

needs This review presents methods suggested for

analysis of germ line CNV analysis including both

CNP analysis and the detection of rare CNVs

LauraWinchester is a DPhil student at Oxford University where her research involves detection of copy number events in genetic

disorders in particular Specific Language Impairment

ChristopherYau is a Postdoctoral Research Fellow in the Department of Statistics at Oxford University

Jiannis Ragoussis is Head of Genomics at WTCHG Interests gene expression regulation in hypoxia and inflammation genotyping

and sequencing technology identification of chromosomal aneuploidies and CNVs associated with disease

Corresponding author Jiannis Ragoussis Genomics Wellcome Trust Centre For Human Genetics Roosevelt Drive Oxford OX3

7BN UK Tel (01865) 287526 Fax (01865) 287501 E-mail ioannisrwelloxacuk

BRIEFINGS IN FUNCTIONAL GENOMICS AND PROTEOMICS page 1 of 14 doi101093bfgpelp017

The Author 2009 Published by Oxford University Press For permissions please email journalspermissionsoxfordjournalsorg

Briefings in Functional Genomics and Proteomics Advance Access published September 8 2009 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

CNVDISCOVERYANDDETECTIONUSING SNP CHIPSThe use of SNP arrays in copy number event detec-

tion has a number of advantages As well as the two

applications for the data which are SNP genotyping

and copy number analysis there are other aspects

that promote their use over other techniques SNP

arrays use less sample per experiment compared to

other techniques such as comparative genomic

hybridization (CGH) arrays Cost is also an important

factor in the selection of the method The SNP array

is a cost effective technique which allows the user to

increase the number of samples tested on a limited

budget Although the advances in high throughput

sequencing technology has made copy number

discovery much easier the application of known

CNP information means that we can target structural

variation in a sample using cheaper techniques such

as the SNP array without a large reduction in

genome wide coverage

One important consideration however is the bias

of the SNP chip coverage towards known CNVs

[13] Historically when SNPs are selected for geno-

typing arrays certain factors are considered which

may decrease the number of copy number variants

or polymorphisms typed [14] Studies have found

CNPs to be most common in regions containing

high levels of segmental duplication [2] which are

areas of low SNP coverage compared to other areas

of the genome due to the difficulties of assay design

and implementation Common CNPs may cause

assays to fail standard inheritance checks and

HardyndashWeinberg tests For example in a situation

where a father is (A B) and the mother (B ) the

child could be (AB) or (A) or (B) However in

SNP genotyping results the mother would appear to

be called (B B) and the child would be called either

(A B) or (A A) or (B B) If the child is really (A)

then an (AA) call would seem to violate Mendelian

inheritance patterns and often cause the SNP to be

rejected

Assays were also often selected and tested on the

basis of their use in SNP genotyping meaning the

final result may produce noisy signal which although

per se does not affect the ability to genotype is a

major problem for accurate copy number detection

For instance SNP data is typically standardized

against a reference population in order to reduce

the effect of factors including between-array varia-

tion and probe-specific hybridization effects In

doing so normalization routines implicitly assume

that all members (or the large majority) of the refer-

ence population have the same copy number but

at locations of common CNV this assumption is

clearly no longer appropriate At these genomic

locations the process of SNP data normalization

and the derivation of copy number estimates

should be integrated for optimal performance and

the correct derivation of normalization parameters

Several of the new array assay selections have

taken the copy number detection into account for

example Illumina includes lsquounSNPablersquo genome

probes on some of its products These markers

were picked to cover events recorded in the

Database Genomic Variants (DGV) and some addi-

tional regions highlighted by experimental work

The Affymetrix SNP 60 chip was developed with

an aim to assess SNPs and CNVs simultaneously

McCarroll et al [15] studied 270 HapMap samples

to design probes for their hybrid array With these

changes in assay selection techniques the SNP array

has become more appealing for copy number detec-

tion and reliable interpretation of these results

increases in importance

ILLUMINA PROPRIETARYSOFTWARE FORCOPYNUMBERDETECTIONIllumina data can be initially viewed checked and

exported using the proprietary software BeadStudio

As well as the softwarersquos quality checking and geno-

type-calling functions it calculates a number of other

values for the signal-intensity data The normalized

R value is used as a representation of intensity on

individual SNP plots The log R ratio value is then

calculated from the expected normalized intensity

of a sample and observed normalized intensity

The B allele frequency (BAF) is calculated from

the difference between the expected position of

the cluster group and the actual value BAF and

log R ratio are used by a number of the copy

number event detection algorithms

Detection of copy number events within

BeadStudio uses simple algorithms which can be

run rapidly for an overview of larger events in a

sample The Loss of Heterozygosity (LOH) score is

calculated using heterozygote frequency The CNV

partition plug-in uses the log R ratio and BAF and

compares the data to 14 different Gaussian distribu-

tion models to assess copy number level Values can

be plotted in the Chromosome Browser allowing the

page 2 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

user to compare predicted events with BAF or

log R ratio at the location for event confirmation

(Figure 1)

AFFYMETRIX PROPRIETARYSOFTWARE FORCOPY NUMBERDETECTIONAffymetrix SNP array data can be analysed with

specially designed proprietary software Within the

Genotyping Console samples are grouped into

In Bounds (good sample) and Out of bounds

(problematic samples) after initial quality checks

and other quality control metrics allow the user

to investigate probe mismatching and individual

SNP clustering LOH scores can be calculated and

the software contains a Chromosome Copy Number

Analysis Tool (CNAT) which uses a reference set of

data to compare the experiment signal-intensity

values against and evaluates copy number changes

Results are processed by the segment reporting tool

to produce a basic output of larger detected CNV

events

Tools for analysis of the different Affymetrix chip

types vary but HumanGenomeSNP Array 60 uti-

lizes two externally developed algorithms from the

BirdSuite package [16] which dramatically improves

detection Birdseed is used for SNP genotyping

and Canary genotypes the known CNPs on the

chip Each CNP has a number of targeted probes

Figure 1 BeadStudio Chromosome Viewer Image from BeadStudio Chromosome Browser showing copy numbervalues for Sample NA10861Chromosome 22 shown with an event at 23 999 142^24 239 255 confirmed by all statis-tics CNV value produced by CNV Partition algorithm

Comparing CNVdetection methods for SNParrays page 3 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

data from these are summarized and then compared

to a reference set to produce the final call Results

can be viewed in the Integrated Genome Browser

(IGB) (Figure 2)

HIDDENMARKOVMODELS(HMMs) IN COPY NUMBEREVENTDETECTIONLimitations of available copy number analyses within

proprietary software led to the use of other methods

to analyse data The HMM assumes that observed

intensities are related to an unobserved copy

number state at each locus via an emission distribu-

tion (often assumed to be Gaussian) The copy

number states are assumed to have a dependence

structure such that neighbouring loci are assumed

to have similar copy number states Transitions

between copy number states are determined by a

transition matrix which describes the probability of

moving from one state to another The probabilistic

structure of the HMM allows parameters in the

model to be efficiently learnt from data in both

Bayesian and non-Bayesian frameworks by using

dynamic programming-based algorithms such as

the expectation maximization (EM) algorithm

When applied to event detection each copy

number possibility is assigned a state and the

Viterbi algorithm is used to predict the state for

each observation value

Figure 2 Genotyping Console Genome Viewer Image from Affymetrix Genotyping Console showing sampleNA10861 Event on chromosome 22 confirmed by CNAT algorithm (third plot) and the segmentation report (redmark) showing the single event

page 4 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

With prior knowledge of modelling statistics

there are a multitude of options for copy number

detection HMMSeg [17] is a command line oper-

ated algorithm that is designed to apply HMM

to genomic data Application of correct modelling

procedures is not an obvious process to non-

statisticians For these reasons software has been

developed which allows guided application of these

types of advanced methods

GUIDEDAPPLICATIONOFTHE HMMA number of solutions for guided accurate CNV

detection for SNP array data have been published

but these are often platform specific QuantiSNP

[18] and PennCNV [19] are academically developed

and freely available for prediction purposes They use

the HMM and assist the user to apply it to their own

data The standard output from these tools is a list of

detected events and brief summary statistics used for

quality checking Checking the quality of data is

extremely important in accurate event prediction

Data with high signal noise often causes false positive

predictions and stringency with checks at this stage is

highly recommended to eliminate any problem data

Signal noise is a strong limitation particularly with

samples prepared by whole genome amplification

Output from QuantiSNP allows the user to plot

average and standard deviations for BAF by chromo-

some or sample to show outliers (Figure 3)

PennCNV has a detailed set of guidelines for identi-

fying and rejecting problem data included on the

softwarersquos support website Both can run using com-

mand line options or integrated into Illuminarsquos

BeadStudio plug-in and have unique features to

recommend them

The QuantiSNP algorithm output gives a log

Bayes factor with its prediction which allows the

user to rank events in order of likelihood and place

their own cut off on acceptable events Users can

modify parameters to suit their own dataset for

example changing the length parameter can allow

more accurate detection of different sized events for

a particular sample set Later versions of QuantiSNP

have increased flexibility for data other than the

Figure 3 Graphical representation of quality control data from PennCNV and QuantiSNP algorithms It is impor-tant to use quality control (QC) data from the algorithms to eliminate problem samples which would not be foundduring standard-genotyping analysis Plot shows BAF score for each chromosome from analysis of sample NA10861we can see chromosome 4 and X are outliersValues produced by PennCNV log file also shown NB Values shownrelate to Illumina 1MDuo array

Comparing CNVdetection methods for SNParrays page 5 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

standard Illumina Infinium array and can used to

process Affymetrix data and have proven accuracy

on Illumina GoldenGate data [20] where SNP

coverage is suitable

PennCNV has a number of downstream analysis

options Most important to highlight is the use of

family trio data in analysis [21] The use of trio infor-

mation in event prediction allows easier detection of

events novel to probands It also integrates a pipeline

for Affymetrix data analysis The PennCNV package

also includes a number of options to allow more

analysis of event results such as a script to compare

events to known gene libraries or for changing the

format to be suitable for viewer such as BeadStudiorsquos

Chromosome Browser or the web-based genome

browser UCSC (httpwwwgenomeucscedu)

Dchip SNP [22] was originally developed for

Affymetrix data but has been modified to allow the

viewing of Illumina data It produces an LOH score

which can be plotted against chromosome but its

functions are best suited to the Affymetrix platform

generated values in particular the quality control

options The software also has options to carry out

paired analysis for cancer data major copy propor-

tion analysis [22] uses HMM to analyse tumour

samples

APPLYINGAPPROACHESORIGINALLY USED INARRAYCGHA number of methods for copy number event detec-

tion were originally developed for arrayCGH analy-

sis but have been modified for SNP array analysis

The Circular Binary Segmentation (CBS) [23] algo-

rithm is one such method It was designed to convert

noisy intensity values into regions of equal copy

number The algorithm will continue to divide a

region into segments until it finds a segment

which is different to the neighbouring region This

change-point detection is designed to identify all

the places which partition the chromosome into

segments of the same copy number An addition to

the binary segmentation algorithm was made to

allow the defining of single change inside a large

segment Segment ends were joined forming a

circle to allow a further likelihood ratio test that

the content has different means Final segments are

then given a cluster value which is the median log-

ratio value of the probes within the region and this

value is used to define the copy number status

An alternative to the CBS algorithm was devel-

oped by Pique-Regi et al [24] which can now be

applied to SNP arrays The Genome Alteration

Detection Algorithm (GADA) uses sparse Bayesian

learning to predict CN changes For our testing we

used a package designed for use in R environment

with helpful processing options and detailed instruc-

tions for Affymetrix and Illumina data The advan-

tage of the speed of data processing was clear and we

were able to analyse data within a few minutes

There are many other algorithms developed that

could potentially be applied to SNP array data

Other reviews [6 25] focused on the arrayCGH

format present the reader with a variety of alternative

options

CNVDETECTION USING OTHERMETHODSApproaches which describe different methods to

address CN event detection are common in the lit-

erature SNP conditional mixture modelling

(SCIMM) developed by Cooper et al [13] which

is based on the observation that samples with dele-

tions appear to have unique signal-intensity clusters

They applied a mixture-likelihood clustering

method within the R statistical package to identify

deletions A secondary algorithm (SCIMM-Search)

was developed to help discover probes which detect

copy number changes within an array dataset The

algorithms require knowledge of modelling techni-

ques to correctly carry out the analysis

The ITALICS [26] software focuses analysis on

removal on unwanted events found in Affymetrix

data Rigaill et al developed ITALICS (Iterative

and Alternative normaLIsation and Copy number

calling for affymetrix Snp arrays) to remove probes

with abnormal intensities Each iteration of the

algorithm estimates the biological signal and then

uses multiple linear regressions to estimate the non-

linear effects on the signal The algorithm can be run

in R and has the potential to analyse the Affymetrix

Human mapping 500K Genome Wide array 50 and

60 format but was designed to process data from

chip formats containing perfect match and mismatch

probes

COMMERCIALLYAVAILABLESOFTWAREThe strength of the software packages available

to purchase lies in a number of traits the ability

page 6 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

to combine data from other platforms for compar-

ison graphical user interfaces integrated pipelines for

analysis and work flows optimized computational

speed and technical support These factors are all

extremely useful to those labs with no or limited

bioinformatic core support Unfortunately commer-

cial companies are limited in their use of some of the

methods developed in the academic environment

They are often prevented from building user inter-

faces and other features around academic software

due to restrictions imposed by free software licences

such as GNU Public Licence and prevention from

having access to the latest methods

For our own purposes we have chosen to look in

detail at the Nexus Biodiscovery software This uses

the rank segmentation approach for detection This

approach is based on CBS but has been modified to

increase speed of processing It can be used for

Affymetrix arrayCGH or Illumina data and although

weaker for Illumina event detection is an extremely

useful tool for practically trained scientists

COMBINING COPY NUMBERPREDICTIONANDGENOTYPINGCopy number detection approaches described thus

far have looked only at a single aspect of the data

The Birdsuite set developed by Korn et al [16] com-

bines SNP genotyping and copy number detection

as well as independently genotyping common

CNPs It uses four different methods to analyse an

Affymetrix dataset The Canary algorithm which

genotypes common CNPs and Birdseed which

carries out SNP genotyping are included in the

Affymetrix Genotyping Console Birdseye is used

to discover rare CNVs This uses the HMM to iden-

tify and assess previously unknown CNVs in the

data Fawkes is the final stage of Birdsuite this

merges all the results from the other three stages

Combining data in this way gives a more complete

picture of structural variation in a sample and allows

the user to proceed with single stage of association

analysis with increased coverage on the data Korn

et al compared their software to commercially avail-

able algorithms including Nexus and report the

higher detection rates of Birdsuite

Franke et al [27] have also presented a combined

approach which focuses on single SNP interpreta-

tion TriTyper uses maximum likelihood estimation

to detect deletions in Illumina SNP data in unrelated

samples It incorporates an extra null allele into its

genotyping clusters and uses deviations from the

HWE as an indicator of when to use triallelic geno-

typing It can also use neighbouring SNP data to

impute the success of the caller which increases the

accuracy of the output

COMPARINGTHEDETECTIONALGORITHMSThere are a large variety of algorithms and software

available for copy number event detection Table 1

shows a summary of the software discussed in this

review A number of these software packages have

been tested during the review and a brief synopsis of

the results is presented here

Assessing SoftwareTo assess the accuracy of the algorithms we com-

pared our data to the results of a well characterized

sample The sample NA12156 is the basis for our

comparison (Table 2) it is from the HapMap collec-

tion and was sequenced for structural variation by

Kidd et al [28] We have chosen to record the

number of similar events between software and pub-

lished data We assume the samples with low num-

bers of similar events have higher false positive rates

however we have not experimentally validated the

results While there is no faultless software we have

found that at least 20 of events were confirmed by

Kidd et al in all algorithms 27 of the overlapping

detected events were found by more than one algo-

rithm (Supplementary Table 1) Although some

algorithms have a lower percentage of overlapping

events it is important to consider the number of

events found as well as the proportion 49 of

PennCNV detected events were confirmed but

other algorithms have actually detected more in

total

We carried out a secondary comparison using the

CEPH sample NA15510 which has been character-

ized in a number of publications [2 7 28] Table 3

shows the variation of results between studies

Further investigation of event replication across stud-

ies is represented in the Venn Diagrams (Figure 4)

PennCNV and Illumina show similar patterns of

overlap although we note an increased similarity

between the Korbel et al data and QuantiSNP

output We conclude that although we found a dif-

ference between detected events in our data and

published results we found similar variation between

different publications suggesting this is problem in

Comparing CNVdetection methods for SNParrays page 7 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

all comparisons and not unique to algorithms we

tested

The overlap of algorithm events of the tested soft-

ware is below 50 for all cases We used default

parameters for all our algorithms for ease of replica-

tion which means some algorithms were not run at

their optimal level for our data We deliberately

chose data which did not use an array-based

Table 1 Summary of SNP array detection algorithms

Software Platform Relatedpublication

Details Strengths Weaknesses

Birdsuite (Birdseyeand Canary)

Affymetrix [15] Combined tool set togenotype SNPs amp CNPs

Unique approach singleassociation of SNPs andCN

Availability limited toAffymetrix data

CNAT Affymetrix Technicalnotes

Proprietaryccedilrun inGenome Console

Integral part of GenomeConsole

Accuracy of event prediction(missed events)

CNVPartition 121 Illumina Technicalnotes

Proprietaryccedilrun inBeadStudio

Integral part of BeadStudio Accuracy of event prediction(missed events)

Dchip SNP Affymetrixor Illumina

[22] Stand alone software Free viewer for all data Limited applications forIllumina data

GADA Affymetrixor Illumina

[24] Model uses Sparse BayesianLearning

Speed of processing andapplication within R

Accuracy on Illumina weaker

HMMSeg Multiple [17] HMM application tool to anygenomic data

Flexibility to any dataset Statistical knowledgerequired for correctuse Not CN specific

ITALICS Affymetrix [26] R package for normalizationand CN detection inAffymetrix data

Focus on removal of non-relevant effects

Designed to work onAffymetrix 100Kthorn 500Kchip (MM probe format)

Nexus Biodiscovery Multiple [23] Commercial segmentationdetection tool

Allows combined data fromdifferentplatforms Integratedviewer

Freeware alternatives areavailable

PennCNV Illumina orAffymetrix

[19] Perl script based Multiple downstream toolsfor output

No way of ranking eventsdue to likelihood

QuantiSNP Illumina orAffymetrix

[18] HHM PC or LINUXcommand line

Bayes factor score forevents flexibility of runparameters

Limited support for furtherevent analysis

SCIMM andSCIMM-Search

Illumina [13] Modelling algorithmapplied in R

High detection ratescompared to sequencedata

Statistical knowledgerequired for correct use

TriTyper Illumina [27] Identify and genotype SNPswith null allele

Able to interpret single SNPs Only genotypes deletions

Table 2 Comparison of algorithms

Algorithm Platformand array

Total of copynumber eventsdetected

Number of copynumber eventsconfirmed byKidd et al [28]

Birdsuite 155 (Birdseye amp Canary) Affymetrix 60 386 76 (20)CNAT (Genome Console 302) Affymetrix 60 8 2 (25)GADA (R 07-5) Affymetrix 60 546 128 (23)GADA (R 07-5) Illumina 1M Duo 511 157 (31)PennCNV (2009Jan06) Affymetrix 60 57 28 (49)PennCNV (2009Jan06) Illumina 1M Duo 57 21 (37)QuantiSNP v20 Affymetrix 60 131 53 (41)QuantiSNP v11 Illumina 1M Duo 75 23 (31)

Detected events from CEPH sample NA12156 are compared to events published in sequencing analysis by Kidd et al [28] Default parametersare used for each algorithm and any Ychromosome data was omitted An overlap between software output and confirmed data by Kidd et al isdetermined by comparing the start and end points of events Details of events are shown in SupplementaryTable1 Percentage shows the numberof confirmed CN events compared to the total detectedby the algorithm

page 8 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Figure 4 Venn diagrams comparing events for NA15510 between different studies Visual representation ofdata from CEPH sample NA15510 on 1M array Illumina platform used to compare between algorithms and otherpublications [2 7 28] Default parameters are used for each algorithm and Ychromosome data was omitted fromcount Event lists from publications were generated by combining data from several tables to create a completelist (including all validated and unvalidated events) An event was counted if any overlap was found with base eventin published data multiple predictions by an algorithm for one published event were counted as one Each total inthe diagram is comprised of all the events found by the studies meaning each event in an overlapping pair is countedSurprisingly only 43 overlapping events are found for NA15510 in all the three studies (A) Results from thePennCNV (D) and QuantiSNP (C) comparisons show that QuantiSNP detects more events in all three softwaredue to the detection of more events overlapping with the Korbel et al study Overlap between algorithmsis shown in Venn Diagram B where events which are detected by the algorithm and found in at least one ofthe publication are compared A large proportion of detected events between PennCNV and QuantiSNP (43)overlap

Table 3 Overlap between events detected by SNP array algorithms using multiple publication data

Total events foundin NA15510 byalgorithm

Number of copynumber events(Kidd) [28]

Number of copynumber events(Korbel) [7]

Number of copynumber events(Redon) [2]

Events in paper 299 466 219CNVPartition 121 39 12 (4) 22 (5) 9 (4)GADA (R 07-5) 69 68 (23) 85 (18) 42 (19)PennCNV (2009Jan06) 81 18 (6) 28 () 30 (14)QuantiSNP v11 64 18 (6) 41 (9) 29 (13)

Data fromCEPH sampleNA15510 on1M array Illumina platform is used to compare between algorithms and other publicationsDefault parametersare used for each algorithm and Y chromosome data was omitted Event lists from publications were generated by combining data fromseveral tables to create a complete list (including all validated and un-validated events) An event was counted if any overlap was found with baseevent in published data multiple predictions by an algorithm for one published event were counted as oneValue in brackets shows percentage ofpublished events found by algorithmWe note from GADA analysis although a high number of overlaps were found this was due to the predictionof large events that included smaller events found by Kidd et al and Korbel et al

Comparing CNVdetection methods for SNParrays page 9 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

technique for our NA12156 comparison to prevent a

bias between Affymetrix and Illumina but in doing

so we accepted an increase in the number of differ-

ently detected events Kidd et al have shown similar

data when comparing studies and found only a

125 overlap of events larger than 5 kb between

their results and CN data generated by Affymetrix

60 array

Similarities of events detected betweendifferent SoftwareWe chose to test a single sample (NA10861) on

a range of the available algorithms to compare the

similarity between event detection In all cases we

found the academically developed software to be

more sensitive and detect more events than propri-

etary algorithms (Table 4) The data also shows an

increased number of events found from the sample

using the Affymetrix SNP60 array we assume this

reflects the increase in the number of CNP probes

on the array relative to Illuminarsquos 1M chip

Table 5 shows the amount of overlap in event

prediction We show two results for each compari-

son counting the number of events overlapping for

each algorithm separately The difference in values

represents the number of smaller events often found

in one event by a different algorithm In general

we found a higher number of overlapping events

between algorithms run on Affymetrix 60 arrays

data We expected the low resemblance between

data generated on different platforms as a result of

the different probe sets however we are pleased to

find some overlap We have included a comparison

to events published by Redon et al [2] although the

study does not include a comprehensive list for this

sample it does show that the algorithms are detecting

confirmed events

During our comparison we often saw a difference

in the size of the predicted event between algorithms

(Figure 5) This was to be expected when using

different platforms as probe locations vary but was

also seen when analysing an identical dataset This

kind of effect can even be produced when simply

altering algorithm parameters and should be a con-

sideration when looking at breakpoints of detected

events We found that the available software tend to

target and support one particular platform for analy-

sis which unfortunately can limit options

Recommending algorithmsComparison of events in a dataset is a good way of

assessing accuracy of detection algorithms but it is

also important to take into account that the different

predictions can also be informative in showing false

positives caused by noisy data and conversely that

those in agreement are the strongest candidates for

events Multiple predictions from different software

for the same event increase confidence in the data

and give clearer indications of the event boundaries

or any discrepancy in this information We would

recommend using a second algorithm on a single

dataset to produce the most informative results and

also utilize the different advantages of each software

We also suggest using software designed specifically

for the platform which generated the data as several

of the dual use algorithms have been shown to

weaker in one format We have selected a range of

algorithms to discuss and test and the list in Table 1 is

not exhaustive only an overview of some of the

possibilities It is also important to state even using

different algorithms one cannot definitively confirm

the presence of a CN event without separate biolog-

ical replication and it is unlikely that any list of events

detected will contain all CNVs in a sample

FURTHER ANALYSIS OFDETECTED CNVsWith a number of reliable options available for

the detection of copy number events it becomes

Table 4 Comparison of event numbers detected fora single sample (NA10861)

Algorithm Platform andarray

Number ofCNeventsdetected

Birdsuite 155 (Canary amp Birdseye) Affymetrix 60 137CNAT (Genome Console 302) Affymetrix 60 10CNVPartition 121 Illumina 1M Duo 16GADA (R 07-5) Affymetrix 60 613GADA (R 07-5) Illumina 1M Duo 87Nexus Biodiscovery 401 Affymetrix 60 111Nexus Biodiscovery 401 Illumina 1M Duo 8PennCNV (2009Jan06) Affymetrix 60 67PennCNV (2009Jan06) Illumina 1M Duo 43QuantiSNP v20 Affymetrix 60 193QuantiSNP v11 Illumina 1M Duo 60

HapMap samples provided as demonstration data were analysed onboth Affymetrix and Illumina platforms to give an easily reproduciblecomparison of event prediction Events shown have been detected bythe algorithm for CEPH sample NA10861 Default parameters wereused for all algorithms and anyYchromosome data was omittedDatafrom the Affymetrix array has a higher number of detected eventsprobably linked to the number of specifically targeted probesProprietary software from both Illumina and Affymetrix has a lowdetection rate

page 10 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Table5

Com

parison

ofsoftwareeventpredictio

ns

Pub

lishe

dresults

(Red

on)

Birdsuite

Affym

etrix

CNAT

Affym

etrix

CNV

Par

tition

Illum

ina

GADA

Affym

etrix

GADA

Illum

ina

Nex

usAffym

etrix

Nex

usIllum

ina

Pen

nCNV

Affym

etrix

Pen

nCNV

Illum

ina

Qua

ntiSNP

Affym

etrix

Qua

ntiSNP

Illum

ina

Publishe

ddata

(Red

on)

17(4)

4(40

)3(19

)32

(5)

2(2)

11(10

)2(25

)12

(18

)7(16

)18

(9)

8(13

)

Birdsuite

Affy

metrix

17(44

)9(90

)13

(81

)135(22

)21

(24

)62

(56

)6(75

)43

(64

)20

(47

)97

(50

)20

(33

)CNAT

Affy

metrix

4(10

)15

(4)

4(25

)34

(6)

023

(21

)1(13

)

13(19

)2(5)

17(9)

5(8)

CNVPartition

Illum

ina

3(8)

16(4)

4(40

)37

(6)

7(8)

20(18

)7(88

)9(13

)

11(26

)16

(8)

16(27

)GADA

Affy

metrix

17(44

)106(28

)9(90

)13

(81

)32

(37

)91

(82

)7(88

)58

(87

)23

(53

)153(79

)27

(45

)GADA

Illum

ina

2(5)

96(25

)0

13(81

)20

8(34

)25

(23

)2(25

)26

(30

)17

(40

)67

(35

)23

(38

)Nexus

Affy

metrix

7(18

)57

(15

)10

(100

)

7(44

)116(19

)8(9)

4(50

)45

(67

)15

(35

)78

(40

)17

(28

)Nexus

Illum

ina

2(5)

6(2)

1(10

)7(44

)22

(4)

2(2)

4(4)

6(9)

7(16

)10

(5)

9(15

)Penn

CNV

Affy

metrix

11(28

)51

(13)

10(100

)

9(56

)105(17

)10

(11)

65(59

)6(75

)19

(44

)71

(37

)21

(35

)Penn

CNV

Illum

ina

6(15

)25

(7)

2(20

)11

(69

)44

(7)

9(10

)23

(21

)6(75

)18

(27

)26

(13)

28(47

)QuantiSNP

Affy

metrix

14(36

)97

(25

)10

(100

)

10(63

)199(32

)18

(21

)86

(77

)7(88

)65

(97

)21

(49

)24

(40

)QuantiSNP

Illum

ina

6(15

)14

(4)

5(50

)15

(94

)55

(9)

10(11)

30(27

)8(100

)

23(34

)32

(74

)31

(16

)

Algorithm

swererunon

demon

stratio

ndataforsampleNA108

61on

Affy

metrix60chipsa

ndIllum

ina1MDuo

arraysD

efaultparametersw

ereused

andanyY

chromosom

edatawas

omittedFo

ralgorithmoverall

totalsseeTable4Events

detected

inbo

thsoftwareareshow

nEvents

coun

tedas

common

betw

eenalgorithmsifpart

ofregion

predictedoverlaps

withtheotherEach

comparisoniscarriedou

ttw

ice

toshow

caseswhere

smallereventswithinon

ealgorithm

makeup

oneeventintheotherthereforeoverlapof

eventsdepe

ndson

analysisorientationTotalvalue

representsnumberof

eventsforsoftwareon

horizontalaxisfoun

dintheothersoftwaredatasetbracketedvalueshow

spercentageofeventsdetected

bysamesoftwareWehave

foun

dthemostsim

ilaritie

sare

betw

eendatafrom

similarplatform

soralgo

-rithm

metho

dforexam

pleAffy

metrixPenn

CNVandQuantiSNParebo

thbasedon

theHMM

algorithm

andas

such

eventpredictio

nshou

ldbe

very

similarWehave

also

notedahigher

numberof

similar

eventsfrom

algorithmsu

singAffy

metrixdata

Comparing CNVdetection methods for SNParrays page 11 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

increasingly important to be able to summarize and

use this data Initially we are often interested in

looking for novel events in certain genes or regions

Tracks of events can be viewed in databases such as

the web-based genome browser UCSC (http

wwwgenomeucscedu) and events can be com-

pared to known copy number data in the DGV

such as displayed in Figure 3 Importing several

tracks of data into a browser simultaneously will

allow the user to compare different result sets

Analysis of multiple events per sample is a more

complicated procedure Events and samples can

be explored using pathway analysis tools to look

for interesting groups or combinations of events in

different genes but methods of confirming the

significance of an event are required A number of

publications exist presenting ways of applying asso-

ciation study methods to copy number data Barnes

etal [29] developed an R package CNVtools which

allows the user to carry out case-control association

Figure 5 Image from UCSC Browser showing the detection of a single event using different algorithmsThe deletion described is a known CNP and is recorded several times in the DGV Each track represents a differ-ent algorithm or platform All results for detection algorithms shown used default parameters and test sampleNA10861

page 12 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

analysis on a single CNV of interest The publica-

tion tests a series of five alternative modelling meth-

ods before recommending a likelihood ratio test

which combines CNV calling and association testing

into a single model This method was designed

to eliminate problems with signal noise which is a

known trait of SNP assay data Ionita-Laza et al [30]

suggested a method to apply genome-wide family-

based association studies on raw-intensity data The

Birdsuite package includes a pipeline to prepare

the data for PLINK analysis Other sources have

suggested similar association study-based strategies

but an agreed approach is a subject of great discus-

sion Calls have been made by authors such as

Scherer et al [31] to decide on a single technique

but future decisions in the field will be extremely

enlightening

As is commented much upon in literature

describing SNP association study techniques

sample size and power of tests are major factors in

a successful study [32] This must also be considered

when analysing copy number data As we have dis-

cussed there are a number of analysis options avail-

able for SNP array CNV detection pipelines to

allow guided analysis and stand alone options for

more flexible analysis Some of these applications

are platform targeted but we have found that the

best outcome is given by using multiple algorithms

and comparing data

SUPPLEMENTARYDATASupplementary data are available online at http

biboxfordjournalsorg

AcknowledgementsThe authors thank Dr Helen Butler for her ideas and contribu-

tions to the manuscript

FUNDINGJR and LW are funded by Wellcome Trust Grants

CY is funded by a UK Medical Research Council

Special Training Fellowship in Biomedical

Informatics (Ref No G0701810)

References1 Iafrate AJ Feuk L Rivera MN et al Detection of large-

scale variation in the human genome Nat Genet 200436(9)949ndash51

2 Redon R Ishikawa S Fitch KR et al Global variation incopy number in the human genome Nature 2006444(7118)444ndash54

3 Tuzun E Sharp AJ Bailey JA et al Fine-scale structuralvariation of the human genome Nat Genet 200537(7)727ndash32

4 Sebat J Lakshmi B Troge J et al Large-scale copy numberpolymorphism in the human genome Science 2004305(5683)525ndash8

5 de Smith AJ Tsalenko A Sampas N et al Array CGHanalysis of copy number variation identifies 1284 newgenes variant in healthy white males implications for asso-ciation studies of complex diseases Hum Mol Genet 200716(23)2783ndash94

6 Carter NP Methods and strategies for analyzing copynumber variation using DNA microarrays Nat Genet200739(7 Suppl)S16ndash21

7 Korbel JO Urban AE Affourtit JP et al Paired-end map-ping reveals extensive structural variation in the humangenome Science 2007318(5849)420ndash6

8 Kennedy GC Matsuzaki H Dong S etal Large-scale geno-typing of complex DNA NatBiotechnol 200321(10)1233ndash7

9 Peiffer DA Le JM Steemers FJ etal High-resolution geno-mic profiling of chromosomal aberrations using Infiniumwhole-genome genotyping Genome Res 200616(9)1136ndash48

10 International Schizophrenia Consortium Rare chromoso-mal deletions and duplications increase risk of schizophreniaNature 2008455(7210)237ndash41

11 Yang TL Chen XD Guo Y et al Genome-wide copy-number-variation study identified a susceptibility geneUGT2B17 for osteoporosis Am J Hum Genet 200883(6)663ndash74

12 McCarroll SA Hadnott TN Perry GH et al Commondeletion polymorphisms in the human genome Nat Genet200638(1)86ndash92

13 Cooper GM Zerr T Kidd JM et al Systematic assessmentof copy number variant detection via genome-wide SNPgenotyping Nat Genet 200840(10)1199ndash203

14 McCarroll SA Altshuler DM Copy-number variation andassociation studies of human disease Nat Genet 200739(7 Suppl)S37ndash42

Key Points Awide variety of software is available for CNVdetection from

data produced by SNP arrays This review seeks to discussoptions and statistical methods currently available for analysisof signal intensity data

Changes in assay selection techniques for SNP arrays havemadethemmore appealing for copynumber detection aswell as geno-typingTargeted probe design has made the SNP array a reliableand cheaper option for copy number analysis

After testing a selection of the available software comparisonswere performed using Hapmap samples and publishedcopy number data Of the events found in our data 20^49were replicated in previously published studies but the resultsclearly showed variation in data caused by differences inalgorithms

An important recommendation when choosing software foranalysis is the use of a second algorithm on a dataset to producemore informative results This enables the user to eliminatefalse positives not found by both software and increases confi-dence in replicated events

Comparing CNVdetection methods for SNParrays page 13 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

15 McCarroll SA Kuruvilla FG Korn JM et al Integrateddetection and population-genetic analysis of SNPs andcopy number variation Nat Genet 200840(10)1166ndash74

16 Korn JM Kuruvilla FG McCarroll SA et al Integratedgenotype calling and association analysis of SNPscommon copy number polymorphisms and rare CNVsNat Genet 200840(10)1253ndash60

17 Day N Hemmaplardh A Thurman RE et al Unsupervisedsegmentation of continuous genomic data Bioinformatics200723(11)1424ndash6

18 Colella S Yau C Taylor JM etal QuantiSNP an objectiveBayes Hidden-Markov Model to detect and accurately mapcopy number variation using SNP genotyping data NucleicAcids Res 200735(6)2013ndash25

19 Wang K Li M Hadley D et al PennCNV an integratedhidden Markov model designed for high-resolution copynumber variation detection in whole-genome SNP geno-typing data Genome Res 200717(11)1665ndash74

20 Maestrini E Pagnamenta AT Lamb JA et al High-densitySNP association study and copy number variation analysisof the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility MolPsychiatry2009

21 Wang K Chen Z Tadesse MG et al Modeling geneticinheritance of copy number variations Nucleic Acids Res200836(21)e138

22 Li C Beroukhim R Weir BA et al Major copy propor-tion analysis of tumor samples using SNP arrays BMCBioinformatics 20089204

23 Olshen AB Venkatraman ES Lucito R Wigler M Circularbinary segmentation for the analysis of array-based DNAcopy number data Biostatistics 20045(4)557ndash72

24 Pique-Regi R Monso-Varona J Ortega A et al Sparserepresentation and Bayesian detection of genome copynumber alterations from microarray data Bioinformatics200824(3)309ndash18

25 Lai WR Johnson MD Kucherlapati R Park PJComparative analysis of algorithms for identifying amplifi-cations and deletions in array CGH data Bioinformatics 200521(19)3763ndash70

26 Rigaill G Hupe P Almeida A et al ITALICS analgorithm for normalization and DNA copy number callingfor Affymetrix SNP arrays Bioinformatics 200824(6)768ndash74

27 Franke L de Kovel CG Aulchenko YS et al Detectionimputation and association analysis of small deletions andnull alleles on oligonucleotide arrays AmJHumGenet 200882(6)1316ndash33

28 Kidd JM Cooper GM Donahue WF et al Mapping andsequencing of structural variation from eight human gen-omes Nature 2008453(7191)56ndash64

29 Barnes C Plagnol V Fitzgerald T et al A robuststatistical method for case-control association testingwith copy number variation Nat Genet 200840(10)1245ndash52

30 Ionita-Laza I Perry GH Raby BA et al On the analysisof copy-number variations in genome-wide associationstudies a translation of the family-based association testGenet Epidemiol 200832(3)273ndash84

31 Scherer SW Lee C Birney E etal Challenges and standardsin integrating surveys of structural variation NatGenet 200739(7 Suppl)S7ndash15

32 Cardon LR Bell JI Association study designs for complexdiseases Nat Rev Genet 20012(2)91ndash9

page 14 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Page 2: Comparing CNVdetection methods

CNVDISCOVERYANDDETECTIONUSING SNP CHIPSThe use of SNP arrays in copy number event detec-

tion has a number of advantages As well as the two

applications for the data which are SNP genotyping

and copy number analysis there are other aspects

that promote their use over other techniques SNP

arrays use less sample per experiment compared to

other techniques such as comparative genomic

hybridization (CGH) arrays Cost is also an important

factor in the selection of the method The SNP array

is a cost effective technique which allows the user to

increase the number of samples tested on a limited

budget Although the advances in high throughput

sequencing technology has made copy number

discovery much easier the application of known

CNP information means that we can target structural

variation in a sample using cheaper techniques such

as the SNP array without a large reduction in

genome wide coverage

One important consideration however is the bias

of the SNP chip coverage towards known CNVs

[13] Historically when SNPs are selected for geno-

typing arrays certain factors are considered which

may decrease the number of copy number variants

or polymorphisms typed [14] Studies have found

CNPs to be most common in regions containing

high levels of segmental duplication [2] which are

areas of low SNP coverage compared to other areas

of the genome due to the difficulties of assay design

and implementation Common CNPs may cause

assays to fail standard inheritance checks and

HardyndashWeinberg tests For example in a situation

where a father is (A B) and the mother (B ) the

child could be (AB) or (A) or (B) However in

SNP genotyping results the mother would appear to

be called (B B) and the child would be called either

(A B) or (A A) or (B B) If the child is really (A)

then an (AA) call would seem to violate Mendelian

inheritance patterns and often cause the SNP to be

rejected

Assays were also often selected and tested on the

basis of their use in SNP genotyping meaning the

final result may produce noisy signal which although

per se does not affect the ability to genotype is a

major problem for accurate copy number detection

For instance SNP data is typically standardized

against a reference population in order to reduce

the effect of factors including between-array varia-

tion and probe-specific hybridization effects In

doing so normalization routines implicitly assume

that all members (or the large majority) of the refer-

ence population have the same copy number but

at locations of common CNV this assumption is

clearly no longer appropriate At these genomic

locations the process of SNP data normalization

and the derivation of copy number estimates

should be integrated for optimal performance and

the correct derivation of normalization parameters

Several of the new array assay selections have

taken the copy number detection into account for

example Illumina includes lsquounSNPablersquo genome

probes on some of its products These markers

were picked to cover events recorded in the

Database Genomic Variants (DGV) and some addi-

tional regions highlighted by experimental work

The Affymetrix SNP 60 chip was developed with

an aim to assess SNPs and CNVs simultaneously

McCarroll et al [15] studied 270 HapMap samples

to design probes for their hybrid array With these

changes in assay selection techniques the SNP array

has become more appealing for copy number detec-

tion and reliable interpretation of these results

increases in importance

ILLUMINA PROPRIETARYSOFTWARE FORCOPYNUMBERDETECTIONIllumina data can be initially viewed checked and

exported using the proprietary software BeadStudio

As well as the softwarersquos quality checking and geno-

type-calling functions it calculates a number of other

values for the signal-intensity data The normalized

R value is used as a representation of intensity on

individual SNP plots The log R ratio value is then

calculated from the expected normalized intensity

of a sample and observed normalized intensity

The B allele frequency (BAF) is calculated from

the difference between the expected position of

the cluster group and the actual value BAF and

log R ratio are used by a number of the copy

number event detection algorithms

Detection of copy number events within

BeadStudio uses simple algorithms which can be

run rapidly for an overview of larger events in a

sample The Loss of Heterozygosity (LOH) score is

calculated using heterozygote frequency The CNV

partition plug-in uses the log R ratio and BAF and

compares the data to 14 different Gaussian distribu-

tion models to assess copy number level Values can

be plotted in the Chromosome Browser allowing the

page 2 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

user to compare predicted events with BAF or

log R ratio at the location for event confirmation

(Figure 1)

AFFYMETRIX PROPRIETARYSOFTWARE FORCOPY NUMBERDETECTIONAffymetrix SNP array data can be analysed with

specially designed proprietary software Within the

Genotyping Console samples are grouped into

In Bounds (good sample) and Out of bounds

(problematic samples) after initial quality checks

and other quality control metrics allow the user

to investigate probe mismatching and individual

SNP clustering LOH scores can be calculated and

the software contains a Chromosome Copy Number

Analysis Tool (CNAT) which uses a reference set of

data to compare the experiment signal-intensity

values against and evaluates copy number changes

Results are processed by the segment reporting tool

to produce a basic output of larger detected CNV

events

Tools for analysis of the different Affymetrix chip

types vary but HumanGenomeSNP Array 60 uti-

lizes two externally developed algorithms from the

BirdSuite package [16] which dramatically improves

detection Birdseed is used for SNP genotyping

and Canary genotypes the known CNPs on the

chip Each CNP has a number of targeted probes

Figure 1 BeadStudio Chromosome Viewer Image from BeadStudio Chromosome Browser showing copy numbervalues for Sample NA10861Chromosome 22 shown with an event at 23 999 142^24 239 255 confirmed by all statis-tics CNV value produced by CNV Partition algorithm

Comparing CNVdetection methods for SNParrays page 3 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

data from these are summarized and then compared

to a reference set to produce the final call Results

can be viewed in the Integrated Genome Browser

(IGB) (Figure 2)

HIDDENMARKOVMODELS(HMMs) IN COPY NUMBEREVENTDETECTIONLimitations of available copy number analyses within

proprietary software led to the use of other methods

to analyse data The HMM assumes that observed

intensities are related to an unobserved copy

number state at each locus via an emission distribu-

tion (often assumed to be Gaussian) The copy

number states are assumed to have a dependence

structure such that neighbouring loci are assumed

to have similar copy number states Transitions

between copy number states are determined by a

transition matrix which describes the probability of

moving from one state to another The probabilistic

structure of the HMM allows parameters in the

model to be efficiently learnt from data in both

Bayesian and non-Bayesian frameworks by using

dynamic programming-based algorithms such as

the expectation maximization (EM) algorithm

When applied to event detection each copy

number possibility is assigned a state and the

Viterbi algorithm is used to predict the state for

each observation value

Figure 2 Genotyping Console Genome Viewer Image from Affymetrix Genotyping Console showing sampleNA10861 Event on chromosome 22 confirmed by CNAT algorithm (third plot) and the segmentation report (redmark) showing the single event

page 4 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

With prior knowledge of modelling statistics

there are a multitude of options for copy number

detection HMMSeg [17] is a command line oper-

ated algorithm that is designed to apply HMM

to genomic data Application of correct modelling

procedures is not an obvious process to non-

statisticians For these reasons software has been

developed which allows guided application of these

types of advanced methods

GUIDEDAPPLICATIONOFTHE HMMA number of solutions for guided accurate CNV

detection for SNP array data have been published

but these are often platform specific QuantiSNP

[18] and PennCNV [19] are academically developed

and freely available for prediction purposes They use

the HMM and assist the user to apply it to their own

data The standard output from these tools is a list of

detected events and brief summary statistics used for

quality checking Checking the quality of data is

extremely important in accurate event prediction

Data with high signal noise often causes false positive

predictions and stringency with checks at this stage is

highly recommended to eliminate any problem data

Signal noise is a strong limitation particularly with

samples prepared by whole genome amplification

Output from QuantiSNP allows the user to plot

average and standard deviations for BAF by chromo-

some or sample to show outliers (Figure 3)

PennCNV has a detailed set of guidelines for identi-

fying and rejecting problem data included on the

softwarersquos support website Both can run using com-

mand line options or integrated into Illuminarsquos

BeadStudio plug-in and have unique features to

recommend them

The QuantiSNP algorithm output gives a log

Bayes factor with its prediction which allows the

user to rank events in order of likelihood and place

their own cut off on acceptable events Users can

modify parameters to suit their own dataset for

example changing the length parameter can allow

more accurate detection of different sized events for

a particular sample set Later versions of QuantiSNP

have increased flexibility for data other than the

Figure 3 Graphical representation of quality control data from PennCNV and QuantiSNP algorithms It is impor-tant to use quality control (QC) data from the algorithms to eliminate problem samples which would not be foundduring standard-genotyping analysis Plot shows BAF score for each chromosome from analysis of sample NA10861we can see chromosome 4 and X are outliersValues produced by PennCNV log file also shown NB Values shownrelate to Illumina 1MDuo array

Comparing CNVdetection methods for SNParrays page 5 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

standard Illumina Infinium array and can used to

process Affymetrix data and have proven accuracy

on Illumina GoldenGate data [20] where SNP

coverage is suitable

PennCNV has a number of downstream analysis

options Most important to highlight is the use of

family trio data in analysis [21] The use of trio infor-

mation in event prediction allows easier detection of

events novel to probands It also integrates a pipeline

for Affymetrix data analysis The PennCNV package

also includes a number of options to allow more

analysis of event results such as a script to compare

events to known gene libraries or for changing the

format to be suitable for viewer such as BeadStudiorsquos

Chromosome Browser or the web-based genome

browser UCSC (httpwwwgenomeucscedu)

Dchip SNP [22] was originally developed for

Affymetrix data but has been modified to allow the

viewing of Illumina data It produces an LOH score

which can be plotted against chromosome but its

functions are best suited to the Affymetrix platform

generated values in particular the quality control

options The software also has options to carry out

paired analysis for cancer data major copy propor-

tion analysis [22] uses HMM to analyse tumour

samples

APPLYINGAPPROACHESORIGINALLY USED INARRAYCGHA number of methods for copy number event detec-

tion were originally developed for arrayCGH analy-

sis but have been modified for SNP array analysis

The Circular Binary Segmentation (CBS) [23] algo-

rithm is one such method It was designed to convert

noisy intensity values into regions of equal copy

number The algorithm will continue to divide a

region into segments until it finds a segment

which is different to the neighbouring region This

change-point detection is designed to identify all

the places which partition the chromosome into

segments of the same copy number An addition to

the binary segmentation algorithm was made to

allow the defining of single change inside a large

segment Segment ends were joined forming a

circle to allow a further likelihood ratio test that

the content has different means Final segments are

then given a cluster value which is the median log-

ratio value of the probes within the region and this

value is used to define the copy number status

An alternative to the CBS algorithm was devel-

oped by Pique-Regi et al [24] which can now be

applied to SNP arrays The Genome Alteration

Detection Algorithm (GADA) uses sparse Bayesian

learning to predict CN changes For our testing we

used a package designed for use in R environment

with helpful processing options and detailed instruc-

tions for Affymetrix and Illumina data The advan-

tage of the speed of data processing was clear and we

were able to analyse data within a few minutes

There are many other algorithms developed that

could potentially be applied to SNP array data

Other reviews [6 25] focused on the arrayCGH

format present the reader with a variety of alternative

options

CNVDETECTION USING OTHERMETHODSApproaches which describe different methods to

address CN event detection are common in the lit-

erature SNP conditional mixture modelling

(SCIMM) developed by Cooper et al [13] which

is based on the observation that samples with dele-

tions appear to have unique signal-intensity clusters

They applied a mixture-likelihood clustering

method within the R statistical package to identify

deletions A secondary algorithm (SCIMM-Search)

was developed to help discover probes which detect

copy number changes within an array dataset The

algorithms require knowledge of modelling techni-

ques to correctly carry out the analysis

The ITALICS [26] software focuses analysis on

removal on unwanted events found in Affymetrix

data Rigaill et al developed ITALICS (Iterative

and Alternative normaLIsation and Copy number

calling for affymetrix Snp arrays) to remove probes

with abnormal intensities Each iteration of the

algorithm estimates the biological signal and then

uses multiple linear regressions to estimate the non-

linear effects on the signal The algorithm can be run

in R and has the potential to analyse the Affymetrix

Human mapping 500K Genome Wide array 50 and

60 format but was designed to process data from

chip formats containing perfect match and mismatch

probes

COMMERCIALLYAVAILABLESOFTWAREThe strength of the software packages available

to purchase lies in a number of traits the ability

page 6 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

to combine data from other platforms for compar-

ison graphical user interfaces integrated pipelines for

analysis and work flows optimized computational

speed and technical support These factors are all

extremely useful to those labs with no or limited

bioinformatic core support Unfortunately commer-

cial companies are limited in their use of some of the

methods developed in the academic environment

They are often prevented from building user inter-

faces and other features around academic software

due to restrictions imposed by free software licences

such as GNU Public Licence and prevention from

having access to the latest methods

For our own purposes we have chosen to look in

detail at the Nexus Biodiscovery software This uses

the rank segmentation approach for detection This

approach is based on CBS but has been modified to

increase speed of processing It can be used for

Affymetrix arrayCGH or Illumina data and although

weaker for Illumina event detection is an extremely

useful tool for practically trained scientists

COMBINING COPY NUMBERPREDICTIONANDGENOTYPINGCopy number detection approaches described thus

far have looked only at a single aspect of the data

The Birdsuite set developed by Korn et al [16] com-

bines SNP genotyping and copy number detection

as well as independently genotyping common

CNPs It uses four different methods to analyse an

Affymetrix dataset The Canary algorithm which

genotypes common CNPs and Birdseed which

carries out SNP genotyping are included in the

Affymetrix Genotyping Console Birdseye is used

to discover rare CNVs This uses the HMM to iden-

tify and assess previously unknown CNVs in the

data Fawkes is the final stage of Birdsuite this

merges all the results from the other three stages

Combining data in this way gives a more complete

picture of structural variation in a sample and allows

the user to proceed with single stage of association

analysis with increased coverage on the data Korn

et al compared their software to commercially avail-

able algorithms including Nexus and report the

higher detection rates of Birdsuite

Franke et al [27] have also presented a combined

approach which focuses on single SNP interpreta-

tion TriTyper uses maximum likelihood estimation

to detect deletions in Illumina SNP data in unrelated

samples It incorporates an extra null allele into its

genotyping clusters and uses deviations from the

HWE as an indicator of when to use triallelic geno-

typing It can also use neighbouring SNP data to

impute the success of the caller which increases the

accuracy of the output

COMPARINGTHEDETECTIONALGORITHMSThere are a large variety of algorithms and software

available for copy number event detection Table 1

shows a summary of the software discussed in this

review A number of these software packages have

been tested during the review and a brief synopsis of

the results is presented here

Assessing SoftwareTo assess the accuracy of the algorithms we com-

pared our data to the results of a well characterized

sample The sample NA12156 is the basis for our

comparison (Table 2) it is from the HapMap collec-

tion and was sequenced for structural variation by

Kidd et al [28] We have chosen to record the

number of similar events between software and pub-

lished data We assume the samples with low num-

bers of similar events have higher false positive rates

however we have not experimentally validated the

results While there is no faultless software we have

found that at least 20 of events were confirmed by

Kidd et al in all algorithms 27 of the overlapping

detected events were found by more than one algo-

rithm (Supplementary Table 1) Although some

algorithms have a lower percentage of overlapping

events it is important to consider the number of

events found as well as the proportion 49 of

PennCNV detected events were confirmed but

other algorithms have actually detected more in

total

We carried out a secondary comparison using the

CEPH sample NA15510 which has been character-

ized in a number of publications [2 7 28] Table 3

shows the variation of results between studies

Further investigation of event replication across stud-

ies is represented in the Venn Diagrams (Figure 4)

PennCNV and Illumina show similar patterns of

overlap although we note an increased similarity

between the Korbel et al data and QuantiSNP

output We conclude that although we found a dif-

ference between detected events in our data and

published results we found similar variation between

different publications suggesting this is problem in

Comparing CNVdetection methods for SNParrays page 7 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

all comparisons and not unique to algorithms we

tested

The overlap of algorithm events of the tested soft-

ware is below 50 for all cases We used default

parameters for all our algorithms for ease of replica-

tion which means some algorithms were not run at

their optimal level for our data We deliberately

chose data which did not use an array-based

Table 1 Summary of SNP array detection algorithms

Software Platform Relatedpublication

Details Strengths Weaknesses

Birdsuite (Birdseyeand Canary)

Affymetrix [15] Combined tool set togenotype SNPs amp CNPs

Unique approach singleassociation of SNPs andCN

Availability limited toAffymetrix data

CNAT Affymetrix Technicalnotes

Proprietaryccedilrun inGenome Console

Integral part of GenomeConsole

Accuracy of event prediction(missed events)

CNVPartition 121 Illumina Technicalnotes

Proprietaryccedilrun inBeadStudio

Integral part of BeadStudio Accuracy of event prediction(missed events)

Dchip SNP Affymetrixor Illumina

[22] Stand alone software Free viewer for all data Limited applications forIllumina data

GADA Affymetrixor Illumina

[24] Model uses Sparse BayesianLearning

Speed of processing andapplication within R

Accuracy on Illumina weaker

HMMSeg Multiple [17] HMM application tool to anygenomic data

Flexibility to any dataset Statistical knowledgerequired for correctuse Not CN specific

ITALICS Affymetrix [26] R package for normalizationand CN detection inAffymetrix data

Focus on removal of non-relevant effects

Designed to work onAffymetrix 100Kthorn 500Kchip (MM probe format)

Nexus Biodiscovery Multiple [23] Commercial segmentationdetection tool

Allows combined data fromdifferentplatforms Integratedviewer

Freeware alternatives areavailable

PennCNV Illumina orAffymetrix

[19] Perl script based Multiple downstream toolsfor output

No way of ranking eventsdue to likelihood

QuantiSNP Illumina orAffymetrix

[18] HHM PC or LINUXcommand line

Bayes factor score forevents flexibility of runparameters

Limited support for furtherevent analysis

SCIMM andSCIMM-Search

Illumina [13] Modelling algorithmapplied in R

High detection ratescompared to sequencedata

Statistical knowledgerequired for correct use

TriTyper Illumina [27] Identify and genotype SNPswith null allele

Able to interpret single SNPs Only genotypes deletions

Table 2 Comparison of algorithms

Algorithm Platformand array

Total of copynumber eventsdetected

Number of copynumber eventsconfirmed byKidd et al [28]

Birdsuite 155 (Birdseye amp Canary) Affymetrix 60 386 76 (20)CNAT (Genome Console 302) Affymetrix 60 8 2 (25)GADA (R 07-5) Affymetrix 60 546 128 (23)GADA (R 07-5) Illumina 1M Duo 511 157 (31)PennCNV (2009Jan06) Affymetrix 60 57 28 (49)PennCNV (2009Jan06) Illumina 1M Duo 57 21 (37)QuantiSNP v20 Affymetrix 60 131 53 (41)QuantiSNP v11 Illumina 1M Duo 75 23 (31)

Detected events from CEPH sample NA12156 are compared to events published in sequencing analysis by Kidd et al [28] Default parametersare used for each algorithm and any Ychromosome data was omitted An overlap between software output and confirmed data by Kidd et al isdetermined by comparing the start and end points of events Details of events are shown in SupplementaryTable1 Percentage shows the numberof confirmed CN events compared to the total detectedby the algorithm

page 8 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Figure 4 Venn diagrams comparing events for NA15510 between different studies Visual representation ofdata from CEPH sample NA15510 on 1M array Illumina platform used to compare between algorithms and otherpublications [2 7 28] Default parameters are used for each algorithm and Ychromosome data was omitted fromcount Event lists from publications were generated by combining data from several tables to create a completelist (including all validated and unvalidated events) An event was counted if any overlap was found with base eventin published data multiple predictions by an algorithm for one published event were counted as one Each total inthe diagram is comprised of all the events found by the studies meaning each event in an overlapping pair is countedSurprisingly only 43 overlapping events are found for NA15510 in all the three studies (A) Results from thePennCNV (D) and QuantiSNP (C) comparisons show that QuantiSNP detects more events in all three softwaredue to the detection of more events overlapping with the Korbel et al study Overlap between algorithmsis shown in Venn Diagram B where events which are detected by the algorithm and found in at least one ofthe publication are compared A large proportion of detected events between PennCNV and QuantiSNP (43)overlap

Table 3 Overlap between events detected by SNP array algorithms using multiple publication data

Total events foundin NA15510 byalgorithm

Number of copynumber events(Kidd) [28]

Number of copynumber events(Korbel) [7]

Number of copynumber events(Redon) [2]

Events in paper 299 466 219CNVPartition 121 39 12 (4) 22 (5) 9 (4)GADA (R 07-5) 69 68 (23) 85 (18) 42 (19)PennCNV (2009Jan06) 81 18 (6) 28 () 30 (14)QuantiSNP v11 64 18 (6) 41 (9) 29 (13)

Data fromCEPH sampleNA15510 on1M array Illumina platform is used to compare between algorithms and other publicationsDefault parametersare used for each algorithm and Y chromosome data was omitted Event lists from publications were generated by combining data fromseveral tables to create a complete list (including all validated and un-validated events) An event was counted if any overlap was found with baseevent in published data multiple predictions by an algorithm for one published event were counted as oneValue in brackets shows percentage ofpublished events found by algorithmWe note from GADA analysis although a high number of overlaps were found this was due to the predictionof large events that included smaller events found by Kidd et al and Korbel et al

Comparing CNVdetection methods for SNParrays page 9 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

technique for our NA12156 comparison to prevent a

bias between Affymetrix and Illumina but in doing

so we accepted an increase in the number of differ-

ently detected events Kidd et al have shown similar

data when comparing studies and found only a

125 overlap of events larger than 5 kb between

their results and CN data generated by Affymetrix

60 array

Similarities of events detected betweendifferent SoftwareWe chose to test a single sample (NA10861) on

a range of the available algorithms to compare the

similarity between event detection In all cases we

found the academically developed software to be

more sensitive and detect more events than propri-

etary algorithms (Table 4) The data also shows an

increased number of events found from the sample

using the Affymetrix SNP60 array we assume this

reflects the increase in the number of CNP probes

on the array relative to Illuminarsquos 1M chip

Table 5 shows the amount of overlap in event

prediction We show two results for each compari-

son counting the number of events overlapping for

each algorithm separately The difference in values

represents the number of smaller events often found

in one event by a different algorithm In general

we found a higher number of overlapping events

between algorithms run on Affymetrix 60 arrays

data We expected the low resemblance between

data generated on different platforms as a result of

the different probe sets however we are pleased to

find some overlap We have included a comparison

to events published by Redon et al [2] although the

study does not include a comprehensive list for this

sample it does show that the algorithms are detecting

confirmed events

During our comparison we often saw a difference

in the size of the predicted event between algorithms

(Figure 5) This was to be expected when using

different platforms as probe locations vary but was

also seen when analysing an identical dataset This

kind of effect can even be produced when simply

altering algorithm parameters and should be a con-

sideration when looking at breakpoints of detected

events We found that the available software tend to

target and support one particular platform for analy-

sis which unfortunately can limit options

Recommending algorithmsComparison of events in a dataset is a good way of

assessing accuracy of detection algorithms but it is

also important to take into account that the different

predictions can also be informative in showing false

positives caused by noisy data and conversely that

those in agreement are the strongest candidates for

events Multiple predictions from different software

for the same event increase confidence in the data

and give clearer indications of the event boundaries

or any discrepancy in this information We would

recommend using a second algorithm on a single

dataset to produce the most informative results and

also utilize the different advantages of each software

We also suggest using software designed specifically

for the platform which generated the data as several

of the dual use algorithms have been shown to

weaker in one format We have selected a range of

algorithms to discuss and test and the list in Table 1 is

not exhaustive only an overview of some of the

possibilities It is also important to state even using

different algorithms one cannot definitively confirm

the presence of a CN event without separate biolog-

ical replication and it is unlikely that any list of events

detected will contain all CNVs in a sample

FURTHER ANALYSIS OFDETECTED CNVsWith a number of reliable options available for

the detection of copy number events it becomes

Table 4 Comparison of event numbers detected fora single sample (NA10861)

Algorithm Platform andarray

Number ofCNeventsdetected

Birdsuite 155 (Canary amp Birdseye) Affymetrix 60 137CNAT (Genome Console 302) Affymetrix 60 10CNVPartition 121 Illumina 1M Duo 16GADA (R 07-5) Affymetrix 60 613GADA (R 07-5) Illumina 1M Duo 87Nexus Biodiscovery 401 Affymetrix 60 111Nexus Biodiscovery 401 Illumina 1M Duo 8PennCNV (2009Jan06) Affymetrix 60 67PennCNV (2009Jan06) Illumina 1M Duo 43QuantiSNP v20 Affymetrix 60 193QuantiSNP v11 Illumina 1M Duo 60

HapMap samples provided as demonstration data were analysed onboth Affymetrix and Illumina platforms to give an easily reproduciblecomparison of event prediction Events shown have been detected bythe algorithm for CEPH sample NA10861 Default parameters wereused for all algorithms and anyYchromosome data was omittedDatafrom the Affymetrix array has a higher number of detected eventsprobably linked to the number of specifically targeted probesProprietary software from both Illumina and Affymetrix has a lowdetection rate

page 10 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Table5

Com

parison

ofsoftwareeventpredictio

ns

Pub

lishe

dresults

(Red

on)

Birdsuite

Affym

etrix

CNAT

Affym

etrix

CNV

Par

tition

Illum

ina

GADA

Affym

etrix

GADA

Illum

ina

Nex

usAffym

etrix

Nex

usIllum

ina

Pen

nCNV

Affym

etrix

Pen

nCNV

Illum

ina

Qua

ntiSNP

Affym

etrix

Qua

ntiSNP

Illum

ina

Publishe

ddata

(Red

on)

17(4)

4(40

)3(19

)32

(5)

2(2)

11(10

)2(25

)12

(18

)7(16

)18

(9)

8(13

)

Birdsuite

Affy

metrix

17(44

)9(90

)13

(81

)135(22

)21

(24

)62

(56

)6(75

)43

(64

)20

(47

)97

(50

)20

(33

)CNAT

Affy

metrix

4(10

)15

(4)

4(25

)34

(6)

023

(21

)1(13

)

13(19

)2(5)

17(9)

5(8)

CNVPartition

Illum

ina

3(8)

16(4)

4(40

)37

(6)

7(8)

20(18

)7(88

)9(13

)

11(26

)16

(8)

16(27

)GADA

Affy

metrix

17(44

)106(28

)9(90

)13

(81

)32

(37

)91

(82

)7(88

)58

(87

)23

(53

)153(79

)27

(45

)GADA

Illum

ina

2(5)

96(25

)0

13(81

)20

8(34

)25

(23

)2(25

)26

(30

)17

(40

)67

(35

)23

(38

)Nexus

Affy

metrix

7(18

)57

(15

)10

(100

)

7(44

)116(19

)8(9)

4(50

)45

(67

)15

(35

)78

(40

)17

(28

)Nexus

Illum

ina

2(5)

6(2)

1(10

)7(44

)22

(4)

2(2)

4(4)

6(9)

7(16

)10

(5)

9(15

)Penn

CNV

Affy

metrix

11(28

)51

(13)

10(100

)

9(56

)105(17

)10

(11)

65(59

)6(75

)19

(44

)71

(37

)21

(35

)Penn

CNV

Illum

ina

6(15

)25

(7)

2(20

)11

(69

)44

(7)

9(10

)23

(21

)6(75

)18

(27

)26

(13)

28(47

)QuantiSNP

Affy

metrix

14(36

)97

(25

)10

(100

)

10(63

)199(32

)18

(21

)86

(77

)7(88

)65

(97

)21

(49

)24

(40

)QuantiSNP

Illum

ina

6(15

)14

(4)

5(50

)15

(94

)55

(9)

10(11)

30(27

)8(100

)

23(34

)32

(74

)31

(16

)

Algorithm

swererunon

demon

stratio

ndataforsampleNA108

61on

Affy

metrix60chipsa

ndIllum

ina1MDuo

arraysD

efaultparametersw

ereused

andanyY

chromosom

edatawas

omittedFo

ralgorithmoverall

totalsseeTable4Events

detected

inbo

thsoftwareareshow

nEvents

coun

tedas

common

betw

eenalgorithmsifpart

ofregion

predictedoverlaps

withtheotherEach

comparisoniscarriedou

ttw

ice

toshow

caseswhere

smallereventswithinon

ealgorithm

makeup

oneeventintheotherthereforeoverlapof

eventsdepe

ndson

analysisorientationTotalvalue

representsnumberof

eventsforsoftwareon

horizontalaxisfoun

dintheothersoftwaredatasetbracketedvalueshow

spercentageofeventsdetected

bysamesoftwareWehave

foun

dthemostsim

ilaritie

sare

betw

eendatafrom

similarplatform

soralgo

-rithm

metho

dforexam

pleAffy

metrixPenn

CNVandQuantiSNParebo

thbasedon

theHMM

algorithm

andas

such

eventpredictio

nshou

ldbe

very

similarWehave

also

notedahigher

numberof

similar

eventsfrom

algorithmsu

singAffy

metrixdata

Comparing CNVdetection methods for SNParrays page 11 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

increasingly important to be able to summarize and

use this data Initially we are often interested in

looking for novel events in certain genes or regions

Tracks of events can be viewed in databases such as

the web-based genome browser UCSC (http

wwwgenomeucscedu) and events can be com-

pared to known copy number data in the DGV

such as displayed in Figure 3 Importing several

tracks of data into a browser simultaneously will

allow the user to compare different result sets

Analysis of multiple events per sample is a more

complicated procedure Events and samples can

be explored using pathway analysis tools to look

for interesting groups or combinations of events in

different genes but methods of confirming the

significance of an event are required A number of

publications exist presenting ways of applying asso-

ciation study methods to copy number data Barnes

etal [29] developed an R package CNVtools which

allows the user to carry out case-control association

Figure 5 Image from UCSC Browser showing the detection of a single event using different algorithmsThe deletion described is a known CNP and is recorded several times in the DGV Each track represents a differ-ent algorithm or platform All results for detection algorithms shown used default parameters and test sampleNA10861

page 12 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

analysis on a single CNV of interest The publica-

tion tests a series of five alternative modelling meth-

ods before recommending a likelihood ratio test

which combines CNV calling and association testing

into a single model This method was designed

to eliminate problems with signal noise which is a

known trait of SNP assay data Ionita-Laza et al [30]

suggested a method to apply genome-wide family-

based association studies on raw-intensity data The

Birdsuite package includes a pipeline to prepare

the data for PLINK analysis Other sources have

suggested similar association study-based strategies

but an agreed approach is a subject of great discus-

sion Calls have been made by authors such as

Scherer et al [31] to decide on a single technique

but future decisions in the field will be extremely

enlightening

As is commented much upon in literature

describing SNP association study techniques

sample size and power of tests are major factors in

a successful study [32] This must also be considered

when analysing copy number data As we have dis-

cussed there are a number of analysis options avail-

able for SNP array CNV detection pipelines to

allow guided analysis and stand alone options for

more flexible analysis Some of these applications

are platform targeted but we have found that the

best outcome is given by using multiple algorithms

and comparing data

SUPPLEMENTARYDATASupplementary data are available online at http

biboxfordjournalsorg

AcknowledgementsThe authors thank Dr Helen Butler for her ideas and contribu-

tions to the manuscript

FUNDINGJR and LW are funded by Wellcome Trust Grants

CY is funded by a UK Medical Research Council

Special Training Fellowship in Biomedical

Informatics (Ref No G0701810)

References1 Iafrate AJ Feuk L Rivera MN et al Detection of large-

scale variation in the human genome Nat Genet 200436(9)949ndash51

2 Redon R Ishikawa S Fitch KR et al Global variation incopy number in the human genome Nature 2006444(7118)444ndash54

3 Tuzun E Sharp AJ Bailey JA et al Fine-scale structuralvariation of the human genome Nat Genet 200537(7)727ndash32

4 Sebat J Lakshmi B Troge J et al Large-scale copy numberpolymorphism in the human genome Science 2004305(5683)525ndash8

5 de Smith AJ Tsalenko A Sampas N et al Array CGHanalysis of copy number variation identifies 1284 newgenes variant in healthy white males implications for asso-ciation studies of complex diseases Hum Mol Genet 200716(23)2783ndash94

6 Carter NP Methods and strategies for analyzing copynumber variation using DNA microarrays Nat Genet200739(7 Suppl)S16ndash21

7 Korbel JO Urban AE Affourtit JP et al Paired-end map-ping reveals extensive structural variation in the humangenome Science 2007318(5849)420ndash6

8 Kennedy GC Matsuzaki H Dong S etal Large-scale geno-typing of complex DNA NatBiotechnol 200321(10)1233ndash7

9 Peiffer DA Le JM Steemers FJ etal High-resolution geno-mic profiling of chromosomal aberrations using Infiniumwhole-genome genotyping Genome Res 200616(9)1136ndash48

10 International Schizophrenia Consortium Rare chromoso-mal deletions and duplications increase risk of schizophreniaNature 2008455(7210)237ndash41

11 Yang TL Chen XD Guo Y et al Genome-wide copy-number-variation study identified a susceptibility geneUGT2B17 for osteoporosis Am J Hum Genet 200883(6)663ndash74

12 McCarroll SA Hadnott TN Perry GH et al Commondeletion polymorphisms in the human genome Nat Genet200638(1)86ndash92

13 Cooper GM Zerr T Kidd JM et al Systematic assessmentof copy number variant detection via genome-wide SNPgenotyping Nat Genet 200840(10)1199ndash203

14 McCarroll SA Altshuler DM Copy-number variation andassociation studies of human disease Nat Genet 200739(7 Suppl)S37ndash42

Key Points Awide variety of software is available for CNVdetection from

data produced by SNP arrays This review seeks to discussoptions and statistical methods currently available for analysisof signal intensity data

Changes in assay selection techniques for SNP arrays havemadethemmore appealing for copynumber detection aswell as geno-typingTargeted probe design has made the SNP array a reliableand cheaper option for copy number analysis

After testing a selection of the available software comparisonswere performed using Hapmap samples and publishedcopy number data Of the events found in our data 20^49were replicated in previously published studies but the resultsclearly showed variation in data caused by differences inalgorithms

An important recommendation when choosing software foranalysis is the use of a second algorithm on a dataset to producemore informative results This enables the user to eliminatefalse positives not found by both software and increases confi-dence in replicated events

Comparing CNVdetection methods for SNParrays page 13 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

15 McCarroll SA Kuruvilla FG Korn JM et al Integrateddetection and population-genetic analysis of SNPs andcopy number variation Nat Genet 200840(10)1166ndash74

16 Korn JM Kuruvilla FG McCarroll SA et al Integratedgenotype calling and association analysis of SNPscommon copy number polymorphisms and rare CNVsNat Genet 200840(10)1253ndash60

17 Day N Hemmaplardh A Thurman RE et al Unsupervisedsegmentation of continuous genomic data Bioinformatics200723(11)1424ndash6

18 Colella S Yau C Taylor JM etal QuantiSNP an objectiveBayes Hidden-Markov Model to detect and accurately mapcopy number variation using SNP genotyping data NucleicAcids Res 200735(6)2013ndash25

19 Wang K Li M Hadley D et al PennCNV an integratedhidden Markov model designed for high-resolution copynumber variation detection in whole-genome SNP geno-typing data Genome Res 200717(11)1665ndash74

20 Maestrini E Pagnamenta AT Lamb JA et al High-densitySNP association study and copy number variation analysisof the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility MolPsychiatry2009

21 Wang K Chen Z Tadesse MG et al Modeling geneticinheritance of copy number variations Nucleic Acids Res200836(21)e138

22 Li C Beroukhim R Weir BA et al Major copy propor-tion analysis of tumor samples using SNP arrays BMCBioinformatics 20089204

23 Olshen AB Venkatraman ES Lucito R Wigler M Circularbinary segmentation for the analysis of array-based DNAcopy number data Biostatistics 20045(4)557ndash72

24 Pique-Regi R Monso-Varona J Ortega A et al Sparserepresentation and Bayesian detection of genome copynumber alterations from microarray data Bioinformatics200824(3)309ndash18

25 Lai WR Johnson MD Kucherlapati R Park PJComparative analysis of algorithms for identifying amplifi-cations and deletions in array CGH data Bioinformatics 200521(19)3763ndash70

26 Rigaill G Hupe P Almeida A et al ITALICS analgorithm for normalization and DNA copy number callingfor Affymetrix SNP arrays Bioinformatics 200824(6)768ndash74

27 Franke L de Kovel CG Aulchenko YS et al Detectionimputation and association analysis of small deletions andnull alleles on oligonucleotide arrays AmJHumGenet 200882(6)1316ndash33

28 Kidd JM Cooper GM Donahue WF et al Mapping andsequencing of structural variation from eight human gen-omes Nature 2008453(7191)56ndash64

29 Barnes C Plagnol V Fitzgerald T et al A robuststatistical method for case-control association testingwith copy number variation Nat Genet 200840(10)1245ndash52

30 Ionita-Laza I Perry GH Raby BA et al On the analysisof copy-number variations in genome-wide associationstudies a translation of the family-based association testGenet Epidemiol 200832(3)273ndash84

31 Scherer SW Lee C Birney E etal Challenges and standardsin integrating surveys of structural variation NatGenet 200739(7 Suppl)S7ndash15

32 Cardon LR Bell JI Association study designs for complexdiseases Nat Rev Genet 20012(2)91ndash9

page 14 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Page 3: Comparing CNVdetection methods

user to compare predicted events with BAF or

log R ratio at the location for event confirmation

(Figure 1)

AFFYMETRIX PROPRIETARYSOFTWARE FORCOPY NUMBERDETECTIONAffymetrix SNP array data can be analysed with

specially designed proprietary software Within the

Genotyping Console samples are grouped into

In Bounds (good sample) and Out of bounds

(problematic samples) after initial quality checks

and other quality control metrics allow the user

to investigate probe mismatching and individual

SNP clustering LOH scores can be calculated and

the software contains a Chromosome Copy Number

Analysis Tool (CNAT) which uses a reference set of

data to compare the experiment signal-intensity

values against and evaluates copy number changes

Results are processed by the segment reporting tool

to produce a basic output of larger detected CNV

events

Tools for analysis of the different Affymetrix chip

types vary but HumanGenomeSNP Array 60 uti-

lizes two externally developed algorithms from the

BirdSuite package [16] which dramatically improves

detection Birdseed is used for SNP genotyping

and Canary genotypes the known CNPs on the

chip Each CNP has a number of targeted probes

Figure 1 BeadStudio Chromosome Viewer Image from BeadStudio Chromosome Browser showing copy numbervalues for Sample NA10861Chromosome 22 shown with an event at 23 999 142^24 239 255 confirmed by all statis-tics CNV value produced by CNV Partition algorithm

Comparing CNVdetection methods for SNParrays page 3 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

data from these are summarized and then compared

to a reference set to produce the final call Results

can be viewed in the Integrated Genome Browser

(IGB) (Figure 2)

HIDDENMARKOVMODELS(HMMs) IN COPY NUMBEREVENTDETECTIONLimitations of available copy number analyses within

proprietary software led to the use of other methods

to analyse data The HMM assumes that observed

intensities are related to an unobserved copy

number state at each locus via an emission distribu-

tion (often assumed to be Gaussian) The copy

number states are assumed to have a dependence

structure such that neighbouring loci are assumed

to have similar copy number states Transitions

between copy number states are determined by a

transition matrix which describes the probability of

moving from one state to another The probabilistic

structure of the HMM allows parameters in the

model to be efficiently learnt from data in both

Bayesian and non-Bayesian frameworks by using

dynamic programming-based algorithms such as

the expectation maximization (EM) algorithm

When applied to event detection each copy

number possibility is assigned a state and the

Viterbi algorithm is used to predict the state for

each observation value

Figure 2 Genotyping Console Genome Viewer Image from Affymetrix Genotyping Console showing sampleNA10861 Event on chromosome 22 confirmed by CNAT algorithm (third plot) and the segmentation report (redmark) showing the single event

page 4 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

With prior knowledge of modelling statistics

there are a multitude of options for copy number

detection HMMSeg [17] is a command line oper-

ated algorithm that is designed to apply HMM

to genomic data Application of correct modelling

procedures is not an obvious process to non-

statisticians For these reasons software has been

developed which allows guided application of these

types of advanced methods

GUIDEDAPPLICATIONOFTHE HMMA number of solutions for guided accurate CNV

detection for SNP array data have been published

but these are often platform specific QuantiSNP

[18] and PennCNV [19] are academically developed

and freely available for prediction purposes They use

the HMM and assist the user to apply it to their own

data The standard output from these tools is a list of

detected events and brief summary statistics used for

quality checking Checking the quality of data is

extremely important in accurate event prediction

Data with high signal noise often causes false positive

predictions and stringency with checks at this stage is

highly recommended to eliminate any problem data

Signal noise is a strong limitation particularly with

samples prepared by whole genome amplification

Output from QuantiSNP allows the user to plot

average and standard deviations for BAF by chromo-

some or sample to show outliers (Figure 3)

PennCNV has a detailed set of guidelines for identi-

fying and rejecting problem data included on the

softwarersquos support website Both can run using com-

mand line options or integrated into Illuminarsquos

BeadStudio plug-in and have unique features to

recommend them

The QuantiSNP algorithm output gives a log

Bayes factor with its prediction which allows the

user to rank events in order of likelihood and place

their own cut off on acceptable events Users can

modify parameters to suit their own dataset for

example changing the length parameter can allow

more accurate detection of different sized events for

a particular sample set Later versions of QuantiSNP

have increased flexibility for data other than the

Figure 3 Graphical representation of quality control data from PennCNV and QuantiSNP algorithms It is impor-tant to use quality control (QC) data from the algorithms to eliminate problem samples which would not be foundduring standard-genotyping analysis Plot shows BAF score for each chromosome from analysis of sample NA10861we can see chromosome 4 and X are outliersValues produced by PennCNV log file also shown NB Values shownrelate to Illumina 1MDuo array

Comparing CNVdetection methods for SNParrays page 5 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

standard Illumina Infinium array and can used to

process Affymetrix data and have proven accuracy

on Illumina GoldenGate data [20] where SNP

coverage is suitable

PennCNV has a number of downstream analysis

options Most important to highlight is the use of

family trio data in analysis [21] The use of trio infor-

mation in event prediction allows easier detection of

events novel to probands It also integrates a pipeline

for Affymetrix data analysis The PennCNV package

also includes a number of options to allow more

analysis of event results such as a script to compare

events to known gene libraries or for changing the

format to be suitable for viewer such as BeadStudiorsquos

Chromosome Browser or the web-based genome

browser UCSC (httpwwwgenomeucscedu)

Dchip SNP [22] was originally developed for

Affymetrix data but has been modified to allow the

viewing of Illumina data It produces an LOH score

which can be plotted against chromosome but its

functions are best suited to the Affymetrix platform

generated values in particular the quality control

options The software also has options to carry out

paired analysis for cancer data major copy propor-

tion analysis [22] uses HMM to analyse tumour

samples

APPLYINGAPPROACHESORIGINALLY USED INARRAYCGHA number of methods for copy number event detec-

tion were originally developed for arrayCGH analy-

sis but have been modified for SNP array analysis

The Circular Binary Segmentation (CBS) [23] algo-

rithm is one such method It was designed to convert

noisy intensity values into regions of equal copy

number The algorithm will continue to divide a

region into segments until it finds a segment

which is different to the neighbouring region This

change-point detection is designed to identify all

the places which partition the chromosome into

segments of the same copy number An addition to

the binary segmentation algorithm was made to

allow the defining of single change inside a large

segment Segment ends were joined forming a

circle to allow a further likelihood ratio test that

the content has different means Final segments are

then given a cluster value which is the median log-

ratio value of the probes within the region and this

value is used to define the copy number status

An alternative to the CBS algorithm was devel-

oped by Pique-Regi et al [24] which can now be

applied to SNP arrays The Genome Alteration

Detection Algorithm (GADA) uses sparse Bayesian

learning to predict CN changes For our testing we

used a package designed for use in R environment

with helpful processing options and detailed instruc-

tions for Affymetrix and Illumina data The advan-

tage of the speed of data processing was clear and we

were able to analyse data within a few minutes

There are many other algorithms developed that

could potentially be applied to SNP array data

Other reviews [6 25] focused on the arrayCGH

format present the reader with a variety of alternative

options

CNVDETECTION USING OTHERMETHODSApproaches which describe different methods to

address CN event detection are common in the lit-

erature SNP conditional mixture modelling

(SCIMM) developed by Cooper et al [13] which

is based on the observation that samples with dele-

tions appear to have unique signal-intensity clusters

They applied a mixture-likelihood clustering

method within the R statistical package to identify

deletions A secondary algorithm (SCIMM-Search)

was developed to help discover probes which detect

copy number changes within an array dataset The

algorithms require knowledge of modelling techni-

ques to correctly carry out the analysis

The ITALICS [26] software focuses analysis on

removal on unwanted events found in Affymetrix

data Rigaill et al developed ITALICS (Iterative

and Alternative normaLIsation and Copy number

calling for affymetrix Snp arrays) to remove probes

with abnormal intensities Each iteration of the

algorithm estimates the biological signal and then

uses multiple linear regressions to estimate the non-

linear effects on the signal The algorithm can be run

in R and has the potential to analyse the Affymetrix

Human mapping 500K Genome Wide array 50 and

60 format but was designed to process data from

chip formats containing perfect match and mismatch

probes

COMMERCIALLYAVAILABLESOFTWAREThe strength of the software packages available

to purchase lies in a number of traits the ability

page 6 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

to combine data from other platforms for compar-

ison graphical user interfaces integrated pipelines for

analysis and work flows optimized computational

speed and technical support These factors are all

extremely useful to those labs with no or limited

bioinformatic core support Unfortunately commer-

cial companies are limited in their use of some of the

methods developed in the academic environment

They are often prevented from building user inter-

faces and other features around academic software

due to restrictions imposed by free software licences

such as GNU Public Licence and prevention from

having access to the latest methods

For our own purposes we have chosen to look in

detail at the Nexus Biodiscovery software This uses

the rank segmentation approach for detection This

approach is based on CBS but has been modified to

increase speed of processing It can be used for

Affymetrix arrayCGH or Illumina data and although

weaker for Illumina event detection is an extremely

useful tool for practically trained scientists

COMBINING COPY NUMBERPREDICTIONANDGENOTYPINGCopy number detection approaches described thus

far have looked only at a single aspect of the data

The Birdsuite set developed by Korn et al [16] com-

bines SNP genotyping and copy number detection

as well as independently genotyping common

CNPs It uses four different methods to analyse an

Affymetrix dataset The Canary algorithm which

genotypes common CNPs and Birdseed which

carries out SNP genotyping are included in the

Affymetrix Genotyping Console Birdseye is used

to discover rare CNVs This uses the HMM to iden-

tify and assess previously unknown CNVs in the

data Fawkes is the final stage of Birdsuite this

merges all the results from the other three stages

Combining data in this way gives a more complete

picture of structural variation in a sample and allows

the user to proceed with single stage of association

analysis with increased coverage on the data Korn

et al compared their software to commercially avail-

able algorithms including Nexus and report the

higher detection rates of Birdsuite

Franke et al [27] have also presented a combined

approach which focuses on single SNP interpreta-

tion TriTyper uses maximum likelihood estimation

to detect deletions in Illumina SNP data in unrelated

samples It incorporates an extra null allele into its

genotyping clusters and uses deviations from the

HWE as an indicator of when to use triallelic geno-

typing It can also use neighbouring SNP data to

impute the success of the caller which increases the

accuracy of the output

COMPARINGTHEDETECTIONALGORITHMSThere are a large variety of algorithms and software

available for copy number event detection Table 1

shows a summary of the software discussed in this

review A number of these software packages have

been tested during the review and a brief synopsis of

the results is presented here

Assessing SoftwareTo assess the accuracy of the algorithms we com-

pared our data to the results of a well characterized

sample The sample NA12156 is the basis for our

comparison (Table 2) it is from the HapMap collec-

tion and was sequenced for structural variation by

Kidd et al [28] We have chosen to record the

number of similar events between software and pub-

lished data We assume the samples with low num-

bers of similar events have higher false positive rates

however we have not experimentally validated the

results While there is no faultless software we have

found that at least 20 of events were confirmed by

Kidd et al in all algorithms 27 of the overlapping

detected events were found by more than one algo-

rithm (Supplementary Table 1) Although some

algorithms have a lower percentage of overlapping

events it is important to consider the number of

events found as well as the proportion 49 of

PennCNV detected events were confirmed but

other algorithms have actually detected more in

total

We carried out a secondary comparison using the

CEPH sample NA15510 which has been character-

ized in a number of publications [2 7 28] Table 3

shows the variation of results between studies

Further investigation of event replication across stud-

ies is represented in the Venn Diagrams (Figure 4)

PennCNV and Illumina show similar patterns of

overlap although we note an increased similarity

between the Korbel et al data and QuantiSNP

output We conclude that although we found a dif-

ference between detected events in our data and

published results we found similar variation between

different publications suggesting this is problem in

Comparing CNVdetection methods for SNParrays page 7 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

all comparisons and not unique to algorithms we

tested

The overlap of algorithm events of the tested soft-

ware is below 50 for all cases We used default

parameters for all our algorithms for ease of replica-

tion which means some algorithms were not run at

their optimal level for our data We deliberately

chose data which did not use an array-based

Table 1 Summary of SNP array detection algorithms

Software Platform Relatedpublication

Details Strengths Weaknesses

Birdsuite (Birdseyeand Canary)

Affymetrix [15] Combined tool set togenotype SNPs amp CNPs

Unique approach singleassociation of SNPs andCN

Availability limited toAffymetrix data

CNAT Affymetrix Technicalnotes

Proprietaryccedilrun inGenome Console

Integral part of GenomeConsole

Accuracy of event prediction(missed events)

CNVPartition 121 Illumina Technicalnotes

Proprietaryccedilrun inBeadStudio

Integral part of BeadStudio Accuracy of event prediction(missed events)

Dchip SNP Affymetrixor Illumina

[22] Stand alone software Free viewer for all data Limited applications forIllumina data

GADA Affymetrixor Illumina

[24] Model uses Sparse BayesianLearning

Speed of processing andapplication within R

Accuracy on Illumina weaker

HMMSeg Multiple [17] HMM application tool to anygenomic data

Flexibility to any dataset Statistical knowledgerequired for correctuse Not CN specific

ITALICS Affymetrix [26] R package for normalizationand CN detection inAffymetrix data

Focus on removal of non-relevant effects

Designed to work onAffymetrix 100Kthorn 500Kchip (MM probe format)

Nexus Biodiscovery Multiple [23] Commercial segmentationdetection tool

Allows combined data fromdifferentplatforms Integratedviewer

Freeware alternatives areavailable

PennCNV Illumina orAffymetrix

[19] Perl script based Multiple downstream toolsfor output

No way of ranking eventsdue to likelihood

QuantiSNP Illumina orAffymetrix

[18] HHM PC or LINUXcommand line

Bayes factor score forevents flexibility of runparameters

Limited support for furtherevent analysis

SCIMM andSCIMM-Search

Illumina [13] Modelling algorithmapplied in R

High detection ratescompared to sequencedata

Statistical knowledgerequired for correct use

TriTyper Illumina [27] Identify and genotype SNPswith null allele

Able to interpret single SNPs Only genotypes deletions

Table 2 Comparison of algorithms

Algorithm Platformand array

Total of copynumber eventsdetected

Number of copynumber eventsconfirmed byKidd et al [28]

Birdsuite 155 (Birdseye amp Canary) Affymetrix 60 386 76 (20)CNAT (Genome Console 302) Affymetrix 60 8 2 (25)GADA (R 07-5) Affymetrix 60 546 128 (23)GADA (R 07-5) Illumina 1M Duo 511 157 (31)PennCNV (2009Jan06) Affymetrix 60 57 28 (49)PennCNV (2009Jan06) Illumina 1M Duo 57 21 (37)QuantiSNP v20 Affymetrix 60 131 53 (41)QuantiSNP v11 Illumina 1M Duo 75 23 (31)

Detected events from CEPH sample NA12156 are compared to events published in sequencing analysis by Kidd et al [28] Default parametersare used for each algorithm and any Ychromosome data was omitted An overlap between software output and confirmed data by Kidd et al isdetermined by comparing the start and end points of events Details of events are shown in SupplementaryTable1 Percentage shows the numberof confirmed CN events compared to the total detectedby the algorithm

page 8 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Figure 4 Venn diagrams comparing events for NA15510 between different studies Visual representation ofdata from CEPH sample NA15510 on 1M array Illumina platform used to compare between algorithms and otherpublications [2 7 28] Default parameters are used for each algorithm and Ychromosome data was omitted fromcount Event lists from publications were generated by combining data from several tables to create a completelist (including all validated and unvalidated events) An event was counted if any overlap was found with base eventin published data multiple predictions by an algorithm for one published event were counted as one Each total inthe diagram is comprised of all the events found by the studies meaning each event in an overlapping pair is countedSurprisingly only 43 overlapping events are found for NA15510 in all the three studies (A) Results from thePennCNV (D) and QuantiSNP (C) comparisons show that QuantiSNP detects more events in all three softwaredue to the detection of more events overlapping with the Korbel et al study Overlap between algorithmsis shown in Venn Diagram B where events which are detected by the algorithm and found in at least one ofthe publication are compared A large proportion of detected events between PennCNV and QuantiSNP (43)overlap

Table 3 Overlap between events detected by SNP array algorithms using multiple publication data

Total events foundin NA15510 byalgorithm

Number of copynumber events(Kidd) [28]

Number of copynumber events(Korbel) [7]

Number of copynumber events(Redon) [2]

Events in paper 299 466 219CNVPartition 121 39 12 (4) 22 (5) 9 (4)GADA (R 07-5) 69 68 (23) 85 (18) 42 (19)PennCNV (2009Jan06) 81 18 (6) 28 () 30 (14)QuantiSNP v11 64 18 (6) 41 (9) 29 (13)

Data fromCEPH sampleNA15510 on1M array Illumina platform is used to compare between algorithms and other publicationsDefault parametersare used for each algorithm and Y chromosome data was omitted Event lists from publications were generated by combining data fromseveral tables to create a complete list (including all validated and un-validated events) An event was counted if any overlap was found with baseevent in published data multiple predictions by an algorithm for one published event were counted as oneValue in brackets shows percentage ofpublished events found by algorithmWe note from GADA analysis although a high number of overlaps were found this was due to the predictionof large events that included smaller events found by Kidd et al and Korbel et al

Comparing CNVdetection methods for SNParrays page 9 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

technique for our NA12156 comparison to prevent a

bias between Affymetrix and Illumina but in doing

so we accepted an increase in the number of differ-

ently detected events Kidd et al have shown similar

data when comparing studies and found only a

125 overlap of events larger than 5 kb between

their results and CN data generated by Affymetrix

60 array

Similarities of events detected betweendifferent SoftwareWe chose to test a single sample (NA10861) on

a range of the available algorithms to compare the

similarity between event detection In all cases we

found the academically developed software to be

more sensitive and detect more events than propri-

etary algorithms (Table 4) The data also shows an

increased number of events found from the sample

using the Affymetrix SNP60 array we assume this

reflects the increase in the number of CNP probes

on the array relative to Illuminarsquos 1M chip

Table 5 shows the amount of overlap in event

prediction We show two results for each compari-

son counting the number of events overlapping for

each algorithm separately The difference in values

represents the number of smaller events often found

in one event by a different algorithm In general

we found a higher number of overlapping events

between algorithms run on Affymetrix 60 arrays

data We expected the low resemblance between

data generated on different platforms as a result of

the different probe sets however we are pleased to

find some overlap We have included a comparison

to events published by Redon et al [2] although the

study does not include a comprehensive list for this

sample it does show that the algorithms are detecting

confirmed events

During our comparison we often saw a difference

in the size of the predicted event between algorithms

(Figure 5) This was to be expected when using

different platforms as probe locations vary but was

also seen when analysing an identical dataset This

kind of effect can even be produced when simply

altering algorithm parameters and should be a con-

sideration when looking at breakpoints of detected

events We found that the available software tend to

target and support one particular platform for analy-

sis which unfortunately can limit options

Recommending algorithmsComparison of events in a dataset is a good way of

assessing accuracy of detection algorithms but it is

also important to take into account that the different

predictions can also be informative in showing false

positives caused by noisy data and conversely that

those in agreement are the strongest candidates for

events Multiple predictions from different software

for the same event increase confidence in the data

and give clearer indications of the event boundaries

or any discrepancy in this information We would

recommend using a second algorithm on a single

dataset to produce the most informative results and

also utilize the different advantages of each software

We also suggest using software designed specifically

for the platform which generated the data as several

of the dual use algorithms have been shown to

weaker in one format We have selected a range of

algorithms to discuss and test and the list in Table 1 is

not exhaustive only an overview of some of the

possibilities It is also important to state even using

different algorithms one cannot definitively confirm

the presence of a CN event without separate biolog-

ical replication and it is unlikely that any list of events

detected will contain all CNVs in a sample

FURTHER ANALYSIS OFDETECTED CNVsWith a number of reliable options available for

the detection of copy number events it becomes

Table 4 Comparison of event numbers detected fora single sample (NA10861)

Algorithm Platform andarray

Number ofCNeventsdetected

Birdsuite 155 (Canary amp Birdseye) Affymetrix 60 137CNAT (Genome Console 302) Affymetrix 60 10CNVPartition 121 Illumina 1M Duo 16GADA (R 07-5) Affymetrix 60 613GADA (R 07-5) Illumina 1M Duo 87Nexus Biodiscovery 401 Affymetrix 60 111Nexus Biodiscovery 401 Illumina 1M Duo 8PennCNV (2009Jan06) Affymetrix 60 67PennCNV (2009Jan06) Illumina 1M Duo 43QuantiSNP v20 Affymetrix 60 193QuantiSNP v11 Illumina 1M Duo 60

HapMap samples provided as demonstration data were analysed onboth Affymetrix and Illumina platforms to give an easily reproduciblecomparison of event prediction Events shown have been detected bythe algorithm for CEPH sample NA10861 Default parameters wereused for all algorithms and anyYchromosome data was omittedDatafrom the Affymetrix array has a higher number of detected eventsprobably linked to the number of specifically targeted probesProprietary software from both Illumina and Affymetrix has a lowdetection rate

page 10 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Table5

Com

parison

ofsoftwareeventpredictio

ns

Pub

lishe

dresults

(Red

on)

Birdsuite

Affym

etrix

CNAT

Affym

etrix

CNV

Par

tition

Illum

ina

GADA

Affym

etrix

GADA

Illum

ina

Nex

usAffym

etrix

Nex

usIllum

ina

Pen

nCNV

Affym

etrix

Pen

nCNV

Illum

ina

Qua

ntiSNP

Affym

etrix

Qua

ntiSNP

Illum

ina

Publishe

ddata

(Red

on)

17(4)

4(40

)3(19

)32

(5)

2(2)

11(10

)2(25

)12

(18

)7(16

)18

(9)

8(13

)

Birdsuite

Affy

metrix

17(44

)9(90

)13

(81

)135(22

)21

(24

)62

(56

)6(75

)43

(64

)20

(47

)97

(50

)20

(33

)CNAT

Affy

metrix

4(10

)15

(4)

4(25

)34

(6)

023

(21

)1(13

)

13(19

)2(5)

17(9)

5(8)

CNVPartition

Illum

ina

3(8)

16(4)

4(40

)37

(6)

7(8)

20(18

)7(88

)9(13

)

11(26

)16

(8)

16(27

)GADA

Affy

metrix

17(44

)106(28

)9(90

)13

(81

)32

(37

)91

(82

)7(88

)58

(87

)23

(53

)153(79

)27

(45

)GADA

Illum

ina

2(5)

96(25

)0

13(81

)20

8(34

)25

(23

)2(25

)26

(30

)17

(40

)67

(35

)23

(38

)Nexus

Affy

metrix

7(18

)57

(15

)10

(100

)

7(44

)116(19

)8(9)

4(50

)45

(67

)15

(35

)78

(40

)17

(28

)Nexus

Illum

ina

2(5)

6(2)

1(10

)7(44

)22

(4)

2(2)

4(4)

6(9)

7(16

)10

(5)

9(15

)Penn

CNV

Affy

metrix

11(28

)51

(13)

10(100

)

9(56

)105(17

)10

(11)

65(59

)6(75

)19

(44

)71

(37

)21

(35

)Penn

CNV

Illum

ina

6(15

)25

(7)

2(20

)11

(69

)44

(7)

9(10

)23

(21

)6(75

)18

(27

)26

(13)

28(47

)QuantiSNP

Affy

metrix

14(36

)97

(25

)10

(100

)

10(63

)199(32

)18

(21

)86

(77

)7(88

)65

(97

)21

(49

)24

(40

)QuantiSNP

Illum

ina

6(15

)14

(4)

5(50

)15

(94

)55

(9)

10(11)

30(27

)8(100

)

23(34

)32

(74

)31

(16

)

Algorithm

swererunon

demon

stratio

ndataforsampleNA108

61on

Affy

metrix60chipsa

ndIllum

ina1MDuo

arraysD

efaultparametersw

ereused

andanyY

chromosom

edatawas

omittedFo

ralgorithmoverall

totalsseeTable4Events

detected

inbo

thsoftwareareshow

nEvents

coun

tedas

common

betw

eenalgorithmsifpart

ofregion

predictedoverlaps

withtheotherEach

comparisoniscarriedou

ttw

ice

toshow

caseswhere

smallereventswithinon

ealgorithm

makeup

oneeventintheotherthereforeoverlapof

eventsdepe

ndson

analysisorientationTotalvalue

representsnumberof

eventsforsoftwareon

horizontalaxisfoun

dintheothersoftwaredatasetbracketedvalueshow

spercentageofeventsdetected

bysamesoftwareWehave

foun

dthemostsim

ilaritie

sare

betw

eendatafrom

similarplatform

soralgo

-rithm

metho

dforexam

pleAffy

metrixPenn

CNVandQuantiSNParebo

thbasedon

theHMM

algorithm

andas

such

eventpredictio

nshou

ldbe

very

similarWehave

also

notedahigher

numberof

similar

eventsfrom

algorithmsu

singAffy

metrixdata

Comparing CNVdetection methods for SNParrays page 11 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

increasingly important to be able to summarize and

use this data Initially we are often interested in

looking for novel events in certain genes or regions

Tracks of events can be viewed in databases such as

the web-based genome browser UCSC (http

wwwgenomeucscedu) and events can be com-

pared to known copy number data in the DGV

such as displayed in Figure 3 Importing several

tracks of data into a browser simultaneously will

allow the user to compare different result sets

Analysis of multiple events per sample is a more

complicated procedure Events and samples can

be explored using pathway analysis tools to look

for interesting groups or combinations of events in

different genes but methods of confirming the

significance of an event are required A number of

publications exist presenting ways of applying asso-

ciation study methods to copy number data Barnes

etal [29] developed an R package CNVtools which

allows the user to carry out case-control association

Figure 5 Image from UCSC Browser showing the detection of a single event using different algorithmsThe deletion described is a known CNP and is recorded several times in the DGV Each track represents a differ-ent algorithm or platform All results for detection algorithms shown used default parameters and test sampleNA10861

page 12 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

analysis on a single CNV of interest The publica-

tion tests a series of five alternative modelling meth-

ods before recommending a likelihood ratio test

which combines CNV calling and association testing

into a single model This method was designed

to eliminate problems with signal noise which is a

known trait of SNP assay data Ionita-Laza et al [30]

suggested a method to apply genome-wide family-

based association studies on raw-intensity data The

Birdsuite package includes a pipeline to prepare

the data for PLINK analysis Other sources have

suggested similar association study-based strategies

but an agreed approach is a subject of great discus-

sion Calls have been made by authors such as

Scherer et al [31] to decide on a single technique

but future decisions in the field will be extremely

enlightening

As is commented much upon in literature

describing SNP association study techniques

sample size and power of tests are major factors in

a successful study [32] This must also be considered

when analysing copy number data As we have dis-

cussed there are a number of analysis options avail-

able for SNP array CNV detection pipelines to

allow guided analysis and stand alone options for

more flexible analysis Some of these applications

are platform targeted but we have found that the

best outcome is given by using multiple algorithms

and comparing data

SUPPLEMENTARYDATASupplementary data are available online at http

biboxfordjournalsorg

AcknowledgementsThe authors thank Dr Helen Butler for her ideas and contribu-

tions to the manuscript

FUNDINGJR and LW are funded by Wellcome Trust Grants

CY is funded by a UK Medical Research Council

Special Training Fellowship in Biomedical

Informatics (Ref No G0701810)

References1 Iafrate AJ Feuk L Rivera MN et al Detection of large-

scale variation in the human genome Nat Genet 200436(9)949ndash51

2 Redon R Ishikawa S Fitch KR et al Global variation incopy number in the human genome Nature 2006444(7118)444ndash54

3 Tuzun E Sharp AJ Bailey JA et al Fine-scale structuralvariation of the human genome Nat Genet 200537(7)727ndash32

4 Sebat J Lakshmi B Troge J et al Large-scale copy numberpolymorphism in the human genome Science 2004305(5683)525ndash8

5 de Smith AJ Tsalenko A Sampas N et al Array CGHanalysis of copy number variation identifies 1284 newgenes variant in healthy white males implications for asso-ciation studies of complex diseases Hum Mol Genet 200716(23)2783ndash94

6 Carter NP Methods and strategies for analyzing copynumber variation using DNA microarrays Nat Genet200739(7 Suppl)S16ndash21

7 Korbel JO Urban AE Affourtit JP et al Paired-end map-ping reveals extensive structural variation in the humangenome Science 2007318(5849)420ndash6

8 Kennedy GC Matsuzaki H Dong S etal Large-scale geno-typing of complex DNA NatBiotechnol 200321(10)1233ndash7

9 Peiffer DA Le JM Steemers FJ etal High-resolution geno-mic profiling of chromosomal aberrations using Infiniumwhole-genome genotyping Genome Res 200616(9)1136ndash48

10 International Schizophrenia Consortium Rare chromoso-mal deletions and duplications increase risk of schizophreniaNature 2008455(7210)237ndash41

11 Yang TL Chen XD Guo Y et al Genome-wide copy-number-variation study identified a susceptibility geneUGT2B17 for osteoporosis Am J Hum Genet 200883(6)663ndash74

12 McCarroll SA Hadnott TN Perry GH et al Commondeletion polymorphisms in the human genome Nat Genet200638(1)86ndash92

13 Cooper GM Zerr T Kidd JM et al Systematic assessmentof copy number variant detection via genome-wide SNPgenotyping Nat Genet 200840(10)1199ndash203

14 McCarroll SA Altshuler DM Copy-number variation andassociation studies of human disease Nat Genet 200739(7 Suppl)S37ndash42

Key Points Awide variety of software is available for CNVdetection from

data produced by SNP arrays This review seeks to discussoptions and statistical methods currently available for analysisof signal intensity data

Changes in assay selection techniques for SNP arrays havemadethemmore appealing for copynumber detection aswell as geno-typingTargeted probe design has made the SNP array a reliableand cheaper option for copy number analysis

After testing a selection of the available software comparisonswere performed using Hapmap samples and publishedcopy number data Of the events found in our data 20^49were replicated in previously published studies but the resultsclearly showed variation in data caused by differences inalgorithms

An important recommendation when choosing software foranalysis is the use of a second algorithm on a dataset to producemore informative results This enables the user to eliminatefalse positives not found by both software and increases confi-dence in replicated events

Comparing CNVdetection methods for SNParrays page 13 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

15 McCarroll SA Kuruvilla FG Korn JM et al Integrateddetection and population-genetic analysis of SNPs andcopy number variation Nat Genet 200840(10)1166ndash74

16 Korn JM Kuruvilla FG McCarroll SA et al Integratedgenotype calling and association analysis of SNPscommon copy number polymorphisms and rare CNVsNat Genet 200840(10)1253ndash60

17 Day N Hemmaplardh A Thurman RE et al Unsupervisedsegmentation of continuous genomic data Bioinformatics200723(11)1424ndash6

18 Colella S Yau C Taylor JM etal QuantiSNP an objectiveBayes Hidden-Markov Model to detect and accurately mapcopy number variation using SNP genotyping data NucleicAcids Res 200735(6)2013ndash25

19 Wang K Li M Hadley D et al PennCNV an integratedhidden Markov model designed for high-resolution copynumber variation detection in whole-genome SNP geno-typing data Genome Res 200717(11)1665ndash74

20 Maestrini E Pagnamenta AT Lamb JA et al High-densitySNP association study and copy number variation analysisof the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility MolPsychiatry2009

21 Wang K Chen Z Tadesse MG et al Modeling geneticinheritance of copy number variations Nucleic Acids Res200836(21)e138

22 Li C Beroukhim R Weir BA et al Major copy propor-tion analysis of tumor samples using SNP arrays BMCBioinformatics 20089204

23 Olshen AB Venkatraman ES Lucito R Wigler M Circularbinary segmentation for the analysis of array-based DNAcopy number data Biostatistics 20045(4)557ndash72

24 Pique-Regi R Monso-Varona J Ortega A et al Sparserepresentation and Bayesian detection of genome copynumber alterations from microarray data Bioinformatics200824(3)309ndash18

25 Lai WR Johnson MD Kucherlapati R Park PJComparative analysis of algorithms for identifying amplifi-cations and deletions in array CGH data Bioinformatics 200521(19)3763ndash70

26 Rigaill G Hupe P Almeida A et al ITALICS analgorithm for normalization and DNA copy number callingfor Affymetrix SNP arrays Bioinformatics 200824(6)768ndash74

27 Franke L de Kovel CG Aulchenko YS et al Detectionimputation and association analysis of small deletions andnull alleles on oligonucleotide arrays AmJHumGenet 200882(6)1316ndash33

28 Kidd JM Cooper GM Donahue WF et al Mapping andsequencing of structural variation from eight human gen-omes Nature 2008453(7191)56ndash64

29 Barnes C Plagnol V Fitzgerald T et al A robuststatistical method for case-control association testingwith copy number variation Nat Genet 200840(10)1245ndash52

30 Ionita-Laza I Perry GH Raby BA et al On the analysisof copy-number variations in genome-wide associationstudies a translation of the family-based association testGenet Epidemiol 200832(3)273ndash84

31 Scherer SW Lee C Birney E etal Challenges and standardsin integrating surveys of structural variation NatGenet 200739(7 Suppl)S7ndash15

32 Cardon LR Bell JI Association study designs for complexdiseases Nat Rev Genet 20012(2)91ndash9

page 14 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Page 4: Comparing CNVdetection methods

data from these are summarized and then compared

to a reference set to produce the final call Results

can be viewed in the Integrated Genome Browser

(IGB) (Figure 2)

HIDDENMARKOVMODELS(HMMs) IN COPY NUMBEREVENTDETECTIONLimitations of available copy number analyses within

proprietary software led to the use of other methods

to analyse data The HMM assumes that observed

intensities are related to an unobserved copy

number state at each locus via an emission distribu-

tion (often assumed to be Gaussian) The copy

number states are assumed to have a dependence

structure such that neighbouring loci are assumed

to have similar copy number states Transitions

between copy number states are determined by a

transition matrix which describes the probability of

moving from one state to another The probabilistic

structure of the HMM allows parameters in the

model to be efficiently learnt from data in both

Bayesian and non-Bayesian frameworks by using

dynamic programming-based algorithms such as

the expectation maximization (EM) algorithm

When applied to event detection each copy

number possibility is assigned a state and the

Viterbi algorithm is used to predict the state for

each observation value

Figure 2 Genotyping Console Genome Viewer Image from Affymetrix Genotyping Console showing sampleNA10861 Event on chromosome 22 confirmed by CNAT algorithm (third plot) and the segmentation report (redmark) showing the single event

page 4 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

With prior knowledge of modelling statistics

there are a multitude of options for copy number

detection HMMSeg [17] is a command line oper-

ated algorithm that is designed to apply HMM

to genomic data Application of correct modelling

procedures is not an obvious process to non-

statisticians For these reasons software has been

developed which allows guided application of these

types of advanced methods

GUIDEDAPPLICATIONOFTHE HMMA number of solutions for guided accurate CNV

detection for SNP array data have been published

but these are often platform specific QuantiSNP

[18] and PennCNV [19] are academically developed

and freely available for prediction purposes They use

the HMM and assist the user to apply it to their own

data The standard output from these tools is a list of

detected events and brief summary statistics used for

quality checking Checking the quality of data is

extremely important in accurate event prediction

Data with high signal noise often causes false positive

predictions and stringency with checks at this stage is

highly recommended to eliminate any problem data

Signal noise is a strong limitation particularly with

samples prepared by whole genome amplification

Output from QuantiSNP allows the user to plot

average and standard deviations for BAF by chromo-

some or sample to show outliers (Figure 3)

PennCNV has a detailed set of guidelines for identi-

fying and rejecting problem data included on the

softwarersquos support website Both can run using com-

mand line options or integrated into Illuminarsquos

BeadStudio plug-in and have unique features to

recommend them

The QuantiSNP algorithm output gives a log

Bayes factor with its prediction which allows the

user to rank events in order of likelihood and place

their own cut off on acceptable events Users can

modify parameters to suit their own dataset for

example changing the length parameter can allow

more accurate detection of different sized events for

a particular sample set Later versions of QuantiSNP

have increased flexibility for data other than the

Figure 3 Graphical representation of quality control data from PennCNV and QuantiSNP algorithms It is impor-tant to use quality control (QC) data from the algorithms to eliminate problem samples which would not be foundduring standard-genotyping analysis Plot shows BAF score for each chromosome from analysis of sample NA10861we can see chromosome 4 and X are outliersValues produced by PennCNV log file also shown NB Values shownrelate to Illumina 1MDuo array

Comparing CNVdetection methods for SNParrays page 5 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

standard Illumina Infinium array and can used to

process Affymetrix data and have proven accuracy

on Illumina GoldenGate data [20] where SNP

coverage is suitable

PennCNV has a number of downstream analysis

options Most important to highlight is the use of

family trio data in analysis [21] The use of trio infor-

mation in event prediction allows easier detection of

events novel to probands It also integrates a pipeline

for Affymetrix data analysis The PennCNV package

also includes a number of options to allow more

analysis of event results such as a script to compare

events to known gene libraries or for changing the

format to be suitable for viewer such as BeadStudiorsquos

Chromosome Browser or the web-based genome

browser UCSC (httpwwwgenomeucscedu)

Dchip SNP [22] was originally developed for

Affymetrix data but has been modified to allow the

viewing of Illumina data It produces an LOH score

which can be plotted against chromosome but its

functions are best suited to the Affymetrix platform

generated values in particular the quality control

options The software also has options to carry out

paired analysis for cancer data major copy propor-

tion analysis [22] uses HMM to analyse tumour

samples

APPLYINGAPPROACHESORIGINALLY USED INARRAYCGHA number of methods for copy number event detec-

tion were originally developed for arrayCGH analy-

sis but have been modified for SNP array analysis

The Circular Binary Segmentation (CBS) [23] algo-

rithm is one such method It was designed to convert

noisy intensity values into regions of equal copy

number The algorithm will continue to divide a

region into segments until it finds a segment

which is different to the neighbouring region This

change-point detection is designed to identify all

the places which partition the chromosome into

segments of the same copy number An addition to

the binary segmentation algorithm was made to

allow the defining of single change inside a large

segment Segment ends were joined forming a

circle to allow a further likelihood ratio test that

the content has different means Final segments are

then given a cluster value which is the median log-

ratio value of the probes within the region and this

value is used to define the copy number status

An alternative to the CBS algorithm was devel-

oped by Pique-Regi et al [24] which can now be

applied to SNP arrays The Genome Alteration

Detection Algorithm (GADA) uses sparse Bayesian

learning to predict CN changes For our testing we

used a package designed for use in R environment

with helpful processing options and detailed instruc-

tions for Affymetrix and Illumina data The advan-

tage of the speed of data processing was clear and we

were able to analyse data within a few minutes

There are many other algorithms developed that

could potentially be applied to SNP array data

Other reviews [6 25] focused on the arrayCGH

format present the reader with a variety of alternative

options

CNVDETECTION USING OTHERMETHODSApproaches which describe different methods to

address CN event detection are common in the lit-

erature SNP conditional mixture modelling

(SCIMM) developed by Cooper et al [13] which

is based on the observation that samples with dele-

tions appear to have unique signal-intensity clusters

They applied a mixture-likelihood clustering

method within the R statistical package to identify

deletions A secondary algorithm (SCIMM-Search)

was developed to help discover probes which detect

copy number changes within an array dataset The

algorithms require knowledge of modelling techni-

ques to correctly carry out the analysis

The ITALICS [26] software focuses analysis on

removal on unwanted events found in Affymetrix

data Rigaill et al developed ITALICS (Iterative

and Alternative normaLIsation and Copy number

calling for affymetrix Snp arrays) to remove probes

with abnormal intensities Each iteration of the

algorithm estimates the biological signal and then

uses multiple linear regressions to estimate the non-

linear effects on the signal The algorithm can be run

in R and has the potential to analyse the Affymetrix

Human mapping 500K Genome Wide array 50 and

60 format but was designed to process data from

chip formats containing perfect match and mismatch

probes

COMMERCIALLYAVAILABLESOFTWAREThe strength of the software packages available

to purchase lies in a number of traits the ability

page 6 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

to combine data from other platforms for compar-

ison graphical user interfaces integrated pipelines for

analysis and work flows optimized computational

speed and technical support These factors are all

extremely useful to those labs with no or limited

bioinformatic core support Unfortunately commer-

cial companies are limited in their use of some of the

methods developed in the academic environment

They are often prevented from building user inter-

faces and other features around academic software

due to restrictions imposed by free software licences

such as GNU Public Licence and prevention from

having access to the latest methods

For our own purposes we have chosen to look in

detail at the Nexus Biodiscovery software This uses

the rank segmentation approach for detection This

approach is based on CBS but has been modified to

increase speed of processing It can be used for

Affymetrix arrayCGH or Illumina data and although

weaker for Illumina event detection is an extremely

useful tool for practically trained scientists

COMBINING COPY NUMBERPREDICTIONANDGENOTYPINGCopy number detection approaches described thus

far have looked only at a single aspect of the data

The Birdsuite set developed by Korn et al [16] com-

bines SNP genotyping and copy number detection

as well as independently genotyping common

CNPs It uses four different methods to analyse an

Affymetrix dataset The Canary algorithm which

genotypes common CNPs and Birdseed which

carries out SNP genotyping are included in the

Affymetrix Genotyping Console Birdseye is used

to discover rare CNVs This uses the HMM to iden-

tify and assess previously unknown CNVs in the

data Fawkes is the final stage of Birdsuite this

merges all the results from the other three stages

Combining data in this way gives a more complete

picture of structural variation in a sample and allows

the user to proceed with single stage of association

analysis with increased coverage on the data Korn

et al compared their software to commercially avail-

able algorithms including Nexus and report the

higher detection rates of Birdsuite

Franke et al [27] have also presented a combined

approach which focuses on single SNP interpreta-

tion TriTyper uses maximum likelihood estimation

to detect deletions in Illumina SNP data in unrelated

samples It incorporates an extra null allele into its

genotyping clusters and uses deviations from the

HWE as an indicator of when to use triallelic geno-

typing It can also use neighbouring SNP data to

impute the success of the caller which increases the

accuracy of the output

COMPARINGTHEDETECTIONALGORITHMSThere are a large variety of algorithms and software

available for copy number event detection Table 1

shows a summary of the software discussed in this

review A number of these software packages have

been tested during the review and a brief synopsis of

the results is presented here

Assessing SoftwareTo assess the accuracy of the algorithms we com-

pared our data to the results of a well characterized

sample The sample NA12156 is the basis for our

comparison (Table 2) it is from the HapMap collec-

tion and was sequenced for structural variation by

Kidd et al [28] We have chosen to record the

number of similar events between software and pub-

lished data We assume the samples with low num-

bers of similar events have higher false positive rates

however we have not experimentally validated the

results While there is no faultless software we have

found that at least 20 of events were confirmed by

Kidd et al in all algorithms 27 of the overlapping

detected events were found by more than one algo-

rithm (Supplementary Table 1) Although some

algorithms have a lower percentage of overlapping

events it is important to consider the number of

events found as well as the proportion 49 of

PennCNV detected events were confirmed but

other algorithms have actually detected more in

total

We carried out a secondary comparison using the

CEPH sample NA15510 which has been character-

ized in a number of publications [2 7 28] Table 3

shows the variation of results between studies

Further investigation of event replication across stud-

ies is represented in the Venn Diagrams (Figure 4)

PennCNV and Illumina show similar patterns of

overlap although we note an increased similarity

between the Korbel et al data and QuantiSNP

output We conclude that although we found a dif-

ference between detected events in our data and

published results we found similar variation between

different publications suggesting this is problem in

Comparing CNVdetection methods for SNParrays page 7 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

all comparisons and not unique to algorithms we

tested

The overlap of algorithm events of the tested soft-

ware is below 50 for all cases We used default

parameters for all our algorithms for ease of replica-

tion which means some algorithms were not run at

their optimal level for our data We deliberately

chose data which did not use an array-based

Table 1 Summary of SNP array detection algorithms

Software Platform Relatedpublication

Details Strengths Weaknesses

Birdsuite (Birdseyeand Canary)

Affymetrix [15] Combined tool set togenotype SNPs amp CNPs

Unique approach singleassociation of SNPs andCN

Availability limited toAffymetrix data

CNAT Affymetrix Technicalnotes

Proprietaryccedilrun inGenome Console

Integral part of GenomeConsole

Accuracy of event prediction(missed events)

CNVPartition 121 Illumina Technicalnotes

Proprietaryccedilrun inBeadStudio

Integral part of BeadStudio Accuracy of event prediction(missed events)

Dchip SNP Affymetrixor Illumina

[22] Stand alone software Free viewer for all data Limited applications forIllumina data

GADA Affymetrixor Illumina

[24] Model uses Sparse BayesianLearning

Speed of processing andapplication within R

Accuracy on Illumina weaker

HMMSeg Multiple [17] HMM application tool to anygenomic data

Flexibility to any dataset Statistical knowledgerequired for correctuse Not CN specific

ITALICS Affymetrix [26] R package for normalizationand CN detection inAffymetrix data

Focus on removal of non-relevant effects

Designed to work onAffymetrix 100Kthorn 500Kchip (MM probe format)

Nexus Biodiscovery Multiple [23] Commercial segmentationdetection tool

Allows combined data fromdifferentplatforms Integratedviewer

Freeware alternatives areavailable

PennCNV Illumina orAffymetrix

[19] Perl script based Multiple downstream toolsfor output

No way of ranking eventsdue to likelihood

QuantiSNP Illumina orAffymetrix

[18] HHM PC or LINUXcommand line

Bayes factor score forevents flexibility of runparameters

Limited support for furtherevent analysis

SCIMM andSCIMM-Search

Illumina [13] Modelling algorithmapplied in R

High detection ratescompared to sequencedata

Statistical knowledgerequired for correct use

TriTyper Illumina [27] Identify and genotype SNPswith null allele

Able to interpret single SNPs Only genotypes deletions

Table 2 Comparison of algorithms

Algorithm Platformand array

Total of copynumber eventsdetected

Number of copynumber eventsconfirmed byKidd et al [28]

Birdsuite 155 (Birdseye amp Canary) Affymetrix 60 386 76 (20)CNAT (Genome Console 302) Affymetrix 60 8 2 (25)GADA (R 07-5) Affymetrix 60 546 128 (23)GADA (R 07-5) Illumina 1M Duo 511 157 (31)PennCNV (2009Jan06) Affymetrix 60 57 28 (49)PennCNV (2009Jan06) Illumina 1M Duo 57 21 (37)QuantiSNP v20 Affymetrix 60 131 53 (41)QuantiSNP v11 Illumina 1M Duo 75 23 (31)

Detected events from CEPH sample NA12156 are compared to events published in sequencing analysis by Kidd et al [28] Default parametersare used for each algorithm and any Ychromosome data was omitted An overlap between software output and confirmed data by Kidd et al isdetermined by comparing the start and end points of events Details of events are shown in SupplementaryTable1 Percentage shows the numberof confirmed CN events compared to the total detectedby the algorithm

page 8 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Figure 4 Venn diagrams comparing events for NA15510 between different studies Visual representation ofdata from CEPH sample NA15510 on 1M array Illumina platform used to compare between algorithms and otherpublications [2 7 28] Default parameters are used for each algorithm and Ychromosome data was omitted fromcount Event lists from publications were generated by combining data from several tables to create a completelist (including all validated and unvalidated events) An event was counted if any overlap was found with base eventin published data multiple predictions by an algorithm for one published event were counted as one Each total inthe diagram is comprised of all the events found by the studies meaning each event in an overlapping pair is countedSurprisingly only 43 overlapping events are found for NA15510 in all the three studies (A) Results from thePennCNV (D) and QuantiSNP (C) comparisons show that QuantiSNP detects more events in all three softwaredue to the detection of more events overlapping with the Korbel et al study Overlap between algorithmsis shown in Venn Diagram B where events which are detected by the algorithm and found in at least one ofthe publication are compared A large proportion of detected events between PennCNV and QuantiSNP (43)overlap

Table 3 Overlap between events detected by SNP array algorithms using multiple publication data

Total events foundin NA15510 byalgorithm

Number of copynumber events(Kidd) [28]

Number of copynumber events(Korbel) [7]

Number of copynumber events(Redon) [2]

Events in paper 299 466 219CNVPartition 121 39 12 (4) 22 (5) 9 (4)GADA (R 07-5) 69 68 (23) 85 (18) 42 (19)PennCNV (2009Jan06) 81 18 (6) 28 () 30 (14)QuantiSNP v11 64 18 (6) 41 (9) 29 (13)

Data fromCEPH sampleNA15510 on1M array Illumina platform is used to compare between algorithms and other publicationsDefault parametersare used for each algorithm and Y chromosome data was omitted Event lists from publications were generated by combining data fromseveral tables to create a complete list (including all validated and un-validated events) An event was counted if any overlap was found with baseevent in published data multiple predictions by an algorithm for one published event were counted as oneValue in brackets shows percentage ofpublished events found by algorithmWe note from GADA analysis although a high number of overlaps were found this was due to the predictionof large events that included smaller events found by Kidd et al and Korbel et al

Comparing CNVdetection methods for SNParrays page 9 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

technique for our NA12156 comparison to prevent a

bias between Affymetrix and Illumina but in doing

so we accepted an increase in the number of differ-

ently detected events Kidd et al have shown similar

data when comparing studies and found only a

125 overlap of events larger than 5 kb between

their results and CN data generated by Affymetrix

60 array

Similarities of events detected betweendifferent SoftwareWe chose to test a single sample (NA10861) on

a range of the available algorithms to compare the

similarity between event detection In all cases we

found the academically developed software to be

more sensitive and detect more events than propri-

etary algorithms (Table 4) The data also shows an

increased number of events found from the sample

using the Affymetrix SNP60 array we assume this

reflects the increase in the number of CNP probes

on the array relative to Illuminarsquos 1M chip

Table 5 shows the amount of overlap in event

prediction We show two results for each compari-

son counting the number of events overlapping for

each algorithm separately The difference in values

represents the number of smaller events often found

in one event by a different algorithm In general

we found a higher number of overlapping events

between algorithms run on Affymetrix 60 arrays

data We expected the low resemblance between

data generated on different platforms as a result of

the different probe sets however we are pleased to

find some overlap We have included a comparison

to events published by Redon et al [2] although the

study does not include a comprehensive list for this

sample it does show that the algorithms are detecting

confirmed events

During our comparison we often saw a difference

in the size of the predicted event between algorithms

(Figure 5) This was to be expected when using

different platforms as probe locations vary but was

also seen when analysing an identical dataset This

kind of effect can even be produced when simply

altering algorithm parameters and should be a con-

sideration when looking at breakpoints of detected

events We found that the available software tend to

target and support one particular platform for analy-

sis which unfortunately can limit options

Recommending algorithmsComparison of events in a dataset is a good way of

assessing accuracy of detection algorithms but it is

also important to take into account that the different

predictions can also be informative in showing false

positives caused by noisy data and conversely that

those in agreement are the strongest candidates for

events Multiple predictions from different software

for the same event increase confidence in the data

and give clearer indications of the event boundaries

or any discrepancy in this information We would

recommend using a second algorithm on a single

dataset to produce the most informative results and

also utilize the different advantages of each software

We also suggest using software designed specifically

for the platform which generated the data as several

of the dual use algorithms have been shown to

weaker in one format We have selected a range of

algorithms to discuss and test and the list in Table 1 is

not exhaustive only an overview of some of the

possibilities It is also important to state even using

different algorithms one cannot definitively confirm

the presence of a CN event without separate biolog-

ical replication and it is unlikely that any list of events

detected will contain all CNVs in a sample

FURTHER ANALYSIS OFDETECTED CNVsWith a number of reliable options available for

the detection of copy number events it becomes

Table 4 Comparison of event numbers detected fora single sample (NA10861)

Algorithm Platform andarray

Number ofCNeventsdetected

Birdsuite 155 (Canary amp Birdseye) Affymetrix 60 137CNAT (Genome Console 302) Affymetrix 60 10CNVPartition 121 Illumina 1M Duo 16GADA (R 07-5) Affymetrix 60 613GADA (R 07-5) Illumina 1M Duo 87Nexus Biodiscovery 401 Affymetrix 60 111Nexus Biodiscovery 401 Illumina 1M Duo 8PennCNV (2009Jan06) Affymetrix 60 67PennCNV (2009Jan06) Illumina 1M Duo 43QuantiSNP v20 Affymetrix 60 193QuantiSNP v11 Illumina 1M Duo 60

HapMap samples provided as demonstration data were analysed onboth Affymetrix and Illumina platforms to give an easily reproduciblecomparison of event prediction Events shown have been detected bythe algorithm for CEPH sample NA10861 Default parameters wereused for all algorithms and anyYchromosome data was omittedDatafrom the Affymetrix array has a higher number of detected eventsprobably linked to the number of specifically targeted probesProprietary software from both Illumina and Affymetrix has a lowdetection rate

page 10 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Table5

Com

parison

ofsoftwareeventpredictio

ns

Pub

lishe

dresults

(Red

on)

Birdsuite

Affym

etrix

CNAT

Affym

etrix

CNV

Par

tition

Illum

ina

GADA

Affym

etrix

GADA

Illum

ina

Nex

usAffym

etrix

Nex

usIllum

ina

Pen

nCNV

Affym

etrix

Pen

nCNV

Illum

ina

Qua

ntiSNP

Affym

etrix

Qua

ntiSNP

Illum

ina

Publishe

ddata

(Red

on)

17(4)

4(40

)3(19

)32

(5)

2(2)

11(10

)2(25

)12

(18

)7(16

)18

(9)

8(13

)

Birdsuite

Affy

metrix

17(44

)9(90

)13

(81

)135(22

)21

(24

)62

(56

)6(75

)43

(64

)20

(47

)97

(50

)20

(33

)CNAT

Affy

metrix

4(10

)15

(4)

4(25

)34

(6)

023

(21

)1(13

)

13(19

)2(5)

17(9)

5(8)

CNVPartition

Illum

ina

3(8)

16(4)

4(40

)37

(6)

7(8)

20(18

)7(88

)9(13

)

11(26

)16

(8)

16(27

)GADA

Affy

metrix

17(44

)106(28

)9(90

)13

(81

)32

(37

)91

(82

)7(88

)58

(87

)23

(53

)153(79

)27

(45

)GADA

Illum

ina

2(5)

96(25

)0

13(81

)20

8(34

)25

(23

)2(25

)26

(30

)17

(40

)67

(35

)23

(38

)Nexus

Affy

metrix

7(18

)57

(15

)10

(100

)

7(44

)116(19

)8(9)

4(50

)45

(67

)15

(35

)78

(40

)17

(28

)Nexus

Illum

ina

2(5)

6(2)

1(10

)7(44

)22

(4)

2(2)

4(4)

6(9)

7(16

)10

(5)

9(15

)Penn

CNV

Affy

metrix

11(28

)51

(13)

10(100

)

9(56

)105(17

)10

(11)

65(59

)6(75

)19

(44

)71

(37

)21

(35

)Penn

CNV

Illum

ina

6(15

)25

(7)

2(20

)11

(69

)44

(7)

9(10

)23

(21

)6(75

)18

(27

)26

(13)

28(47

)QuantiSNP

Affy

metrix

14(36

)97

(25

)10

(100

)

10(63

)199(32

)18

(21

)86

(77

)7(88

)65

(97

)21

(49

)24

(40

)QuantiSNP

Illum

ina

6(15

)14

(4)

5(50

)15

(94

)55

(9)

10(11)

30(27

)8(100

)

23(34

)32

(74

)31

(16

)

Algorithm

swererunon

demon

stratio

ndataforsampleNA108

61on

Affy

metrix60chipsa

ndIllum

ina1MDuo

arraysD

efaultparametersw

ereused

andanyY

chromosom

edatawas

omittedFo

ralgorithmoverall

totalsseeTable4Events

detected

inbo

thsoftwareareshow

nEvents

coun

tedas

common

betw

eenalgorithmsifpart

ofregion

predictedoverlaps

withtheotherEach

comparisoniscarriedou

ttw

ice

toshow

caseswhere

smallereventswithinon

ealgorithm

makeup

oneeventintheotherthereforeoverlapof

eventsdepe

ndson

analysisorientationTotalvalue

representsnumberof

eventsforsoftwareon

horizontalaxisfoun

dintheothersoftwaredatasetbracketedvalueshow

spercentageofeventsdetected

bysamesoftwareWehave

foun

dthemostsim

ilaritie

sare

betw

eendatafrom

similarplatform

soralgo

-rithm

metho

dforexam

pleAffy

metrixPenn

CNVandQuantiSNParebo

thbasedon

theHMM

algorithm

andas

such

eventpredictio

nshou

ldbe

very

similarWehave

also

notedahigher

numberof

similar

eventsfrom

algorithmsu

singAffy

metrixdata

Comparing CNVdetection methods for SNParrays page 11 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

increasingly important to be able to summarize and

use this data Initially we are often interested in

looking for novel events in certain genes or regions

Tracks of events can be viewed in databases such as

the web-based genome browser UCSC (http

wwwgenomeucscedu) and events can be com-

pared to known copy number data in the DGV

such as displayed in Figure 3 Importing several

tracks of data into a browser simultaneously will

allow the user to compare different result sets

Analysis of multiple events per sample is a more

complicated procedure Events and samples can

be explored using pathway analysis tools to look

for interesting groups or combinations of events in

different genes but methods of confirming the

significance of an event are required A number of

publications exist presenting ways of applying asso-

ciation study methods to copy number data Barnes

etal [29] developed an R package CNVtools which

allows the user to carry out case-control association

Figure 5 Image from UCSC Browser showing the detection of a single event using different algorithmsThe deletion described is a known CNP and is recorded several times in the DGV Each track represents a differ-ent algorithm or platform All results for detection algorithms shown used default parameters and test sampleNA10861

page 12 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

analysis on a single CNV of interest The publica-

tion tests a series of five alternative modelling meth-

ods before recommending a likelihood ratio test

which combines CNV calling and association testing

into a single model This method was designed

to eliminate problems with signal noise which is a

known trait of SNP assay data Ionita-Laza et al [30]

suggested a method to apply genome-wide family-

based association studies on raw-intensity data The

Birdsuite package includes a pipeline to prepare

the data for PLINK analysis Other sources have

suggested similar association study-based strategies

but an agreed approach is a subject of great discus-

sion Calls have been made by authors such as

Scherer et al [31] to decide on a single technique

but future decisions in the field will be extremely

enlightening

As is commented much upon in literature

describing SNP association study techniques

sample size and power of tests are major factors in

a successful study [32] This must also be considered

when analysing copy number data As we have dis-

cussed there are a number of analysis options avail-

able for SNP array CNV detection pipelines to

allow guided analysis and stand alone options for

more flexible analysis Some of these applications

are platform targeted but we have found that the

best outcome is given by using multiple algorithms

and comparing data

SUPPLEMENTARYDATASupplementary data are available online at http

biboxfordjournalsorg

AcknowledgementsThe authors thank Dr Helen Butler for her ideas and contribu-

tions to the manuscript

FUNDINGJR and LW are funded by Wellcome Trust Grants

CY is funded by a UK Medical Research Council

Special Training Fellowship in Biomedical

Informatics (Ref No G0701810)

References1 Iafrate AJ Feuk L Rivera MN et al Detection of large-

scale variation in the human genome Nat Genet 200436(9)949ndash51

2 Redon R Ishikawa S Fitch KR et al Global variation incopy number in the human genome Nature 2006444(7118)444ndash54

3 Tuzun E Sharp AJ Bailey JA et al Fine-scale structuralvariation of the human genome Nat Genet 200537(7)727ndash32

4 Sebat J Lakshmi B Troge J et al Large-scale copy numberpolymorphism in the human genome Science 2004305(5683)525ndash8

5 de Smith AJ Tsalenko A Sampas N et al Array CGHanalysis of copy number variation identifies 1284 newgenes variant in healthy white males implications for asso-ciation studies of complex diseases Hum Mol Genet 200716(23)2783ndash94

6 Carter NP Methods and strategies for analyzing copynumber variation using DNA microarrays Nat Genet200739(7 Suppl)S16ndash21

7 Korbel JO Urban AE Affourtit JP et al Paired-end map-ping reveals extensive structural variation in the humangenome Science 2007318(5849)420ndash6

8 Kennedy GC Matsuzaki H Dong S etal Large-scale geno-typing of complex DNA NatBiotechnol 200321(10)1233ndash7

9 Peiffer DA Le JM Steemers FJ etal High-resolution geno-mic profiling of chromosomal aberrations using Infiniumwhole-genome genotyping Genome Res 200616(9)1136ndash48

10 International Schizophrenia Consortium Rare chromoso-mal deletions and duplications increase risk of schizophreniaNature 2008455(7210)237ndash41

11 Yang TL Chen XD Guo Y et al Genome-wide copy-number-variation study identified a susceptibility geneUGT2B17 for osteoporosis Am J Hum Genet 200883(6)663ndash74

12 McCarroll SA Hadnott TN Perry GH et al Commondeletion polymorphisms in the human genome Nat Genet200638(1)86ndash92

13 Cooper GM Zerr T Kidd JM et al Systematic assessmentof copy number variant detection via genome-wide SNPgenotyping Nat Genet 200840(10)1199ndash203

14 McCarroll SA Altshuler DM Copy-number variation andassociation studies of human disease Nat Genet 200739(7 Suppl)S37ndash42

Key Points Awide variety of software is available for CNVdetection from

data produced by SNP arrays This review seeks to discussoptions and statistical methods currently available for analysisof signal intensity data

Changes in assay selection techniques for SNP arrays havemadethemmore appealing for copynumber detection aswell as geno-typingTargeted probe design has made the SNP array a reliableand cheaper option for copy number analysis

After testing a selection of the available software comparisonswere performed using Hapmap samples and publishedcopy number data Of the events found in our data 20^49were replicated in previously published studies but the resultsclearly showed variation in data caused by differences inalgorithms

An important recommendation when choosing software foranalysis is the use of a second algorithm on a dataset to producemore informative results This enables the user to eliminatefalse positives not found by both software and increases confi-dence in replicated events

Comparing CNVdetection methods for SNParrays page 13 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

15 McCarroll SA Kuruvilla FG Korn JM et al Integrateddetection and population-genetic analysis of SNPs andcopy number variation Nat Genet 200840(10)1166ndash74

16 Korn JM Kuruvilla FG McCarroll SA et al Integratedgenotype calling and association analysis of SNPscommon copy number polymorphisms and rare CNVsNat Genet 200840(10)1253ndash60

17 Day N Hemmaplardh A Thurman RE et al Unsupervisedsegmentation of continuous genomic data Bioinformatics200723(11)1424ndash6

18 Colella S Yau C Taylor JM etal QuantiSNP an objectiveBayes Hidden-Markov Model to detect and accurately mapcopy number variation using SNP genotyping data NucleicAcids Res 200735(6)2013ndash25

19 Wang K Li M Hadley D et al PennCNV an integratedhidden Markov model designed for high-resolution copynumber variation detection in whole-genome SNP geno-typing data Genome Res 200717(11)1665ndash74

20 Maestrini E Pagnamenta AT Lamb JA et al High-densitySNP association study and copy number variation analysisof the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility MolPsychiatry2009

21 Wang K Chen Z Tadesse MG et al Modeling geneticinheritance of copy number variations Nucleic Acids Res200836(21)e138

22 Li C Beroukhim R Weir BA et al Major copy propor-tion analysis of tumor samples using SNP arrays BMCBioinformatics 20089204

23 Olshen AB Venkatraman ES Lucito R Wigler M Circularbinary segmentation for the analysis of array-based DNAcopy number data Biostatistics 20045(4)557ndash72

24 Pique-Regi R Monso-Varona J Ortega A et al Sparserepresentation and Bayesian detection of genome copynumber alterations from microarray data Bioinformatics200824(3)309ndash18

25 Lai WR Johnson MD Kucherlapati R Park PJComparative analysis of algorithms for identifying amplifi-cations and deletions in array CGH data Bioinformatics 200521(19)3763ndash70

26 Rigaill G Hupe P Almeida A et al ITALICS analgorithm for normalization and DNA copy number callingfor Affymetrix SNP arrays Bioinformatics 200824(6)768ndash74

27 Franke L de Kovel CG Aulchenko YS et al Detectionimputation and association analysis of small deletions andnull alleles on oligonucleotide arrays AmJHumGenet 200882(6)1316ndash33

28 Kidd JM Cooper GM Donahue WF et al Mapping andsequencing of structural variation from eight human gen-omes Nature 2008453(7191)56ndash64

29 Barnes C Plagnol V Fitzgerald T et al A robuststatistical method for case-control association testingwith copy number variation Nat Genet 200840(10)1245ndash52

30 Ionita-Laza I Perry GH Raby BA et al On the analysisof copy-number variations in genome-wide associationstudies a translation of the family-based association testGenet Epidemiol 200832(3)273ndash84

31 Scherer SW Lee C Birney E etal Challenges and standardsin integrating surveys of structural variation NatGenet 200739(7 Suppl)S7ndash15

32 Cardon LR Bell JI Association study designs for complexdiseases Nat Rev Genet 20012(2)91ndash9

page 14 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Page 5: Comparing CNVdetection methods

With prior knowledge of modelling statistics

there are a multitude of options for copy number

detection HMMSeg [17] is a command line oper-

ated algorithm that is designed to apply HMM

to genomic data Application of correct modelling

procedures is not an obvious process to non-

statisticians For these reasons software has been

developed which allows guided application of these

types of advanced methods

GUIDEDAPPLICATIONOFTHE HMMA number of solutions for guided accurate CNV

detection for SNP array data have been published

but these are often platform specific QuantiSNP

[18] and PennCNV [19] are academically developed

and freely available for prediction purposes They use

the HMM and assist the user to apply it to their own

data The standard output from these tools is a list of

detected events and brief summary statistics used for

quality checking Checking the quality of data is

extremely important in accurate event prediction

Data with high signal noise often causes false positive

predictions and stringency with checks at this stage is

highly recommended to eliminate any problem data

Signal noise is a strong limitation particularly with

samples prepared by whole genome amplification

Output from QuantiSNP allows the user to plot

average and standard deviations for BAF by chromo-

some or sample to show outliers (Figure 3)

PennCNV has a detailed set of guidelines for identi-

fying and rejecting problem data included on the

softwarersquos support website Both can run using com-

mand line options or integrated into Illuminarsquos

BeadStudio plug-in and have unique features to

recommend them

The QuantiSNP algorithm output gives a log

Bayes factor with its prediction which allows the

user to rank events in order of likelihood and place

their own cut off on acceptable events Users can

modify parameters to suit their own dataset for

example changing the length parameter can allow

more accurate detection of different sized events for

a particular sample set Later versions of QuantiSNP

have increased flexibility for data other than the

Figure 3 Graphical representation of quality control data from PennCNV and QuantiSNP algorithms It is impor-tant to use quality control (QC) data from the algorithms to eliminate problem samples which would not be foundduring standard-genotyping analysis Plot shows BAF score for each chromosome from analysis of sample NA10861we can see chromosome 4 and X are outliersValues produced by PennCNV log file also shown NB Values shownrelate to Illumina 1MDuo array

Comparing CNVdetection methods for SNParrays page 5 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

standard Illumina Infinium array and can used to

process Affymetrix data and have proven accuracy

on Illumina GoldenGate data [20] where SNP

coverage is suitable

PennCNV has a number of downstream analysis

options Most important to highlight is the use of

family trio data in analysis [21] The use of trio infor-

mation in event prediction allows easier detection of

events novel to probands It also integrates a pipeline

for Affymetrix data analysis The PennCNV package

also includes a number of options to allow more

analysis of event results such as a script to compare

events to known gene libraries or for changing the

format to be suitable for viewer such as BeadStudiorsquos

Chromosome Browser or the web-based genome

browser UCSC (httpwwwgenomeucscedu)

Dchip SNP [22] was originally developed for

Affymetrix data but has been modified to allow the

viewing of Illumina data It produces an LOH score

which can be plotted against chromosome but its

functions are best suited to the Affymetrix platform

generated values in particular the quality control

options The software also has options to carry out

paired analysis for cancer data major copy propor-

tion analysis [22] uses HMM to analyse tumour

samples

APPLYINGAPPROACHESORIGINALLY USED INARRAYCGHA number of methods for copy number event detec-

tion were originally developed for arrayCGH analy-

sis but have been modified for SNP array analysis

The Circular Binary Segmentation (CBS) [23] algo-

rithm is one such method It was designed to convert

noisy intensity values into regions of equal copy

number The algorithm will continue to divide a

region into segments until it finds a segment

which is different to the neighbouring region This

change-point detection is designed to identify all

the places which partition the chromosome into

segments of the same copy number An addition to

the binary segmentation algorithm was made to

allow the defining of single change inside a large

segment Segment ends were joined forming a

circle to allow a further likelihood ratio test that

the content has different means Final segments are

then given a cluster value which is the median log-

ratio value of the probes within the region and this

value is used to define the copy number status

An alternative to the CBS algorithm was devel-

oped by Pique-Regi et al [24] which can now be

applied to SNP arrays The Genome Alteration

Detection Algorithm (GADA) uses sparse Bayesian

learning to predict CN changes For our testing we

used a package designed for use in R environment

with helpful processing options and detailed instruc-

tions for Affymetrix and Illumina data The advan-

tage of the speed of data processing was clear and we

were able to analyse data within a few minutes

There are many other algorithms developed that

could potentially be applied to SNP array data

Other reviews [6 25] focused on the arrayCGH

format present the reader with a variety of alternative

options

CNVDETECTION USING OTHERMETHODSApproaches which describe different methods to

address CN event detection are common in the lit-

erature SNP conditional mixture modelling

(SCIMM) developed by Cooper et al [13] which

is based on the observation that samples with dele-

tions appear to have unique signal-intensity clusters

They applied a mixture-likelihood clustering

method within the R statistical package to identify

deletions A secondary algorithm (SCIMM-Search)

was developed to help discover probes which detect

copy number changes within an array dataset The

algorithms require knowledge of modelling techni-

ques to correctly carry out the analysis

The ITALICS [26] software focuses analysis on

removal on unwanted events found in Affymetrix

data Rigaill et al developed ITALICS (Iterative

and Alternative normaLIsation and Copy number

calling for affymetrix Snp arrays) to remove probes

with abnormal intensities Each iteration of the

algorithm estimates the biological signal and then

uses multiple linear regressions to estimate the non-

linear effects on the signal The algorithm can be run

in R and has the potential to analyse the Affymetrix

Human mapping 500K Genome Wide array 50 and

60 format but was designed to process data from

chip formats containing perfect match and mismatch

probes

COMMERCIALLYAVAILABLESOFTWAREThe strength of the software packages available

to purchase lies in a number of traits the ability

page 6 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

to combine data from other platforms for compar-

ison graphical user interfaces integrated pipelines for

analysis and work flows optimized computational

speed and technical support These factors are all

extremely useful to those labs with no or limited

bioinformatic core support Unfortunately commer-

cial companies are limited in their use of some of the

methods developed in the academic environment

They are often prevented from building user inter-

faces and other features around academic software

due to restrictions imposed by free software licences

such as GNU Public Licence and prevention from

having access to the latest methods

For our own purposes we have chosen to look in

detail at the Nexus Biodiscovery software This uses

the rank segmentation approach for detection This

approach is based on CBS but has been modified to

increase speed of processing It can be used for

Affymetrix arrayCGH or Illumina data and although

weaker for Illumina event detection is an extremely

useful tool for practically trained scientists

COMBINING COPY NUMBERPREDICTIONANDGENOTYPINGCopy number detection approaches described thus

far have looked only at a single aspect of the data

The Birdsuite set developed by Korn et al [16] com-

bines SNP genotyping and copy number detection

as well as independently genotyping common

CNPs It uses four different methods to analyse an

Affymetrix dataset The Canary algorithm which

genotypes common CNPs and Birdseed which

carries out SNP genotyping are included in the

Affymetrix Genotyping Console Birdseye is used

to discover rare CNVs This uses the HMM to iden-

tify and assess previously unknown CNVs in the

data Fawkes is the final stage of Birdsuite this

merges all the results from the other three stages

Combining data in this way gives a more complete

picture of structural variation in a sample and allows

the user to proceed with single stage of association

analysis with increased coverage on the data Korn

et al compared their software to commercially avail-

able algorithms including Nexus and report the

higher detection rates of Birdsuite

Franke et al [27] have also presented a combined

approach which focuses on single SNP interpreta-

tion TriTyper uses maximum likelihood estimation

to detect deletions in Illumina SNP data in unrelated

samples It incorporates an extra null allele into its

genotyping clusters and uses deviations from the

HWE as an indicator of when to use triallelic geno-

typing It can also use neighbouring SNP data to

impute the success of the caller which increases the

accuracy of the output

COMPARINGTHEDETECTIONALGORITHMSThere are a large variety of algorithms and software

available for copy number event detection Table 1

shows a summary of the software discussed in this

review A number of these software packages have

been tested during the review and a brief synopsis of

the results is presented here

Assessing SoftwareTo assess the accuracy of the algorithms we com-

pared our data to the results of a well characterized

sample The sample NA12156 is the basis for our

comparison (Table 2) it is from the HapMap collec-

tion and was sequenced for structural variation by

Kidd et al [28] We have chosen to record the

number of similar events between software and pub-

lished data We assume the samples with low num-

bers of similar events have higher false positive rates

however we have not experimentally validated the

results While there is no faultless software we have

found that at least 20 of events were confirmed by

Kidd et al in all algorithms 27 of the overlapping

detected events were found by more than one algo-

rithm (Supplementary Table 1) Although some

algorithms have a lower percentage of overlapping

events it is important to consider the number of

events found as well as the proportion 49 of

PennCNV detected events were confirmed but

other algorithms have actually detected more in

total

We carried out a secondary comparison using the

CEPH sample NA15510 which has been character-

ized in a number of publications [2 7 28] Table 3

shows the variation of results between studies

Further investigation of event replication across stud-

ies is represented in the Venn Diagrams (Figure 4)

PennCNV and Illumina show similar patterns of

overlap although we note an increased similarity

between the Korbel et al data and QuantiSNP

output We conclude that although we found a dif-

ference between detected events in our data and

published results we found similar variation between

different publications suggesting this is problem in

Comparing CNVdetection methods for SNParrays page 7 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

all comparisons and not unique to algorithms we

tested

The overlap of algorithm events of the tested soft-

ware is below 50 for all cases We used default

parameters for all our algorithms for ease of replica-

tion which means some algorithms were not run at

their optimal level for our data We deliberately

chose data which did not use an array-based

Table 1 Summary of SNP array detection algorithms

Software Platform Relatedpublication

Details Strengths Weaknesses

Birdsuite (Birdseyeand Canary)

Affymetrix [15] Combined tool set togenotype SNPs amp CNPs

Unique approach singleassociation of SNPs andCN

Availability limited toAffymetrix data

CNAT Affymetrix Technicalnotes

Proprietaryccedilrun inGenome Console

Integral part of GenomeConsole

Accuracy of event prediction(missed events)

CNVPartition 121 Illumina Technicalnotes

Proprietaryccedilrun inBeadStudio

Integral part of BeadStudio Accuracy of event prediction(missed events)

Dchip SNP Affymetrixor Illumina

[22] Stand alone software Free viewer for all data Limited applications forIllumina data

GADA Affymetrixor Illumina

[24] Model uses Sparse BayesianLearning

Speed of processing andapplication within R

Accuracy on Illumina weaker

HMMSeg Multiple [17] HMM application tool to anygenomic data

Flexibility to any dataset Statistical knowledgerequired for correctuse Not CN specific

ITALICS Affymetrix [26] R package for normalizationand CN detection inAffymetrix data

Focus on removal of non-relevant effects

Designed to work onAffymetrix 100Kthorn 500Kchip (MM probe format)

Nexus Biodiscovery Multiple [23] Commercial segmentationdetection tool

Allows combined data fromdifferentplatforms Integratedviewer

Freeware alternatives areavailable

PennCNV Illumina orAffymetrix

[19] Perl script based Multiple downstream toolsfor output

No way of ranking eventsdue to likelihood

QuantiSNP Illumina orAffymetrix

[18] HHM PC or LINUXcommand line

Bayes factor score forevents flexibility of runparameters

Limited support for furtherevent analysis

SCIMM andSCIMM-Search

Illumina [13] Modelling algorithmapplied in R

High detection ratescompared to sequencedata

Statistical knowledgerequired for correct use

TriTyper Illumina [27] Identify and genotype SNPswith null allele

Able to interpret single SNPs Only genotypes deletions

Table 2 Comparison of algorithms

Algorithm Platformand array

Total of copynumber eventsdetected

Number of copynumber eventsconfirmed byKidd et al [28]

Birdsuite 155 (Birdseye amp Canary) Affymetrix 60 386 76 (20)CNAT (Genome Console 302) Affymetrix 60 8 2 (25)GADA (R 07-5) Affymetrix 60 546 128 (23)GADA (R 07-5) Illumina 1M Duo 511 157 (31)PennCNV (2009Jan06) Affymetrix 60 57 28 (49)PennCNV (2009Jan06) Illumina 1M Duo 57 21 (37)QuantiSNP v20 Affymetrix 60 131 53 (41)QuantiSNP v11 Illumina 1M Duo 75 23 (31)

Detected events from CEPH sample NA12156 are compared to events published in sequencing analysis by Kidd et al [28] Default parametersare used for each algorithm and any Ychromosome data was omitted An overlap between software output and confirmed data by Kidd et al isdetermined by comparing the start and end points of events Details of events are shown in SupplementaryTable1 Percentage shows the numberof confirmed CN events compared to the total detectedby the algorithm

page 8 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Figure 4 Venn diagrams comparing events for NA15510 between different studies Visual representation ofdata from CEPH sample NA15510 on 1M array Illumina platform used to compare between algorithms and otherpublications [2 7 28] Default parameters are used for each algorithm and Ychromosome data was omitted fromcount Event lists from publications were generated by combining data from several tables to create a completelist (including all validated and unvalidated events) An event was counted if any overlap was found with base eventin published data multiple predictions by an algorithm for one published event were counted as one Each total inthe diagram is comprised of all the events found by the studies meaning each event in an overlapping pair is countedSurprisingly only 43 overlapping events are found for NA15510 in all the three studies (A) Results from thePennCNV (D) and QuantiSNP (C) comparisons show that QuantiSNP detects more events in all three softwaredue to the detection of more events overlapping with the Korbel et al study Overlap between algorithmsis shown in Venn Diagram B where events which are detected by the algorithm and found in at least one ofthe publication are compared A large proportion of detected events between PennCNV and QuantiSNP (43)overlap

Table 3 Overlap between events detected by SNP array algorithms using multiple publication data

Total events foundin NA15510 byalgorithm

Number of copynumber events(Kidd) [28]

Number of copynumber events(Korbel) [7]

Number of copynumber events(Redon) [2]

Events in paper 299 466 219CNVPartition 121 39 12 (4) 22 (5) 9 (4)GADA (R 07-5) 69 68 (23) 85 (18) 42 (19)PennCNV (2009Jan06) 81 18 (6) 28 () 30 (14)QuantiSNP v11 64 18 (6) 41 (9) 29 (13)

Data fromCEPH sampleNA15510 on1M array Illumina platform is used to compare between algorithms and other publicationsDefault parametersare used for each algorithm and Y chromosome data was omitted Event lists from publications were generated by combining data fromseveral tables to create a complete list (including all validated and un-validated events) An event was counted if any overlap was found with baseevent in published data multiple predictions by an algorithm for one published event were counted as oneValue in brackets shows percentage ofpublished events found by algorithmWe note from GADA analysis although a high number of overlaps were found this was due to the predictionof large events that included smaller events found by Kidd et al and Korbel et al

Comparing CNVdetection methods for SNParrays page 9 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

technique for our NA12156 comparison to prevent a

bias between Affymetrix and Illumina but in doing

so we accepted an increase in the number of differ-

ently detected events Kidd et al have shown similar

data when comparing studies and found only a

125 overlap of events larger than 5 kb between

their results and CN data generated by Affymetrix

60 array

Similarities of events detected betweendifferent SoftwareWe chose to test a single sample (NA10861) on

a range of the available algorithms to compare the

similarity between event detection In all cases we

found the academically developed software to be

more sensitive and detect more events than propri-

etary algorithms (Table 4) The data also shows an

increased number of events found from the sample

using the Affymetrix SNP60 array we assume this

reflects the increase in the number of CNP probes

on the array relative to Illuminarsquos 1M chip

Table 5 shows the amount of overlap in event

prediction We show two results for each compari-

son counting the number of events overlapping for

each algorithm separately The difference in values

represents the number of smaller events often found

in one event by a different algorithm In general

we found a higher number of overlapping events

between algorithms run on Affymetrix 60 arrays

data We expected the low resemblance between

data generated on different platforms as a result of

the different probe sets however we are pleased to

find some overlap We have included a comparison

to events published by Redon et al [2] although the

study does not include a comprehensive list for this

sample it does show that the algorithms are detecting

confirmed events

During our comparison we often saw a difference

in the size of the predicted event between algorithms

(Figure 5) This was to be expected when using

different platforms as probe locations vary but was

also seen when analysing an identical dataset This

kind of effect can even be produced when simply

altering algorithm parameters and should be a con-

sideration when looking at breakpoints of detected

events We found that the available software tend to

target and support one particular platform for analy-

sis which unfortunately can limit options

Recommending algorithmsComparison of events in a dataset is a good way of

assessing accuracy of detection algorithms but it is

also important to take into account that the different

predictions can also be informative in showing false

positives caused by noisy data and conversely that

those in agreement are the strongest candidates for

events Multiple predictions from different software

for the same event increase confidence in the data

and give clearer indications of the event boundaries

or any discrepancy in this information We would

recommend using a second algorithm on a single

dataset to produce the most informative results and

also utilize the different advantages of each software

We also suggest using software designed specifically

for the platform which generated the data as several

of the dual use algorithms have been shown to

weaker in one format We have selected a range of

algorithms to discuss and test and the list in Table 1 is

not exhaustive only an overview of some of the

possibilities It is also important to state even using

different algorithms one cannot definitively confirm

the presence of a CN event without separate biolog-

ical replication and it is unlikely that any list of events

detected will contain all CNVs in a sample

FURTHER ANALYSIS OFDETECTED CNVsWith a number of reliable options available for

the detection of copy number events it becomes

Table 4 Comparison of event numbers detected fora single sample (NA10861)

Algorithm Platform andarray

Number ofCNeventsdetected

Birdsuite 155 (Canary amp Birdseye) Affymetrix 60 137CNAT (Genome Console 302) Affymetrix 60 10CNVPartition 121 Illumina 1M Duo 16GADA (R 07-5) Affymetrix 60 613GADA (R 07-5) Illumina 1M Duo 87Nexus Biodiscovery 401 Affymetrix 60 111Nexus Biodiscovery 401 Illumina 1M Duo 8PennCNV (2009Jan06) Affymetrix 60 67PennCNV (2009Jan06) Illumina 1M Duo 43QuantiSNP v20 Affymetrix 60 193QuantiSNP v11 Illumina 1M Duo 60

HapMap samples provided as demonstration data were analysed onboth Affymetrix and Illumina platforms to give an easily reproduciblecomparison of event prediction Events shown have been detected bythe algorithm for CEPH sample NA10861 Default parameters wereused for all algorithms and anyYchromosome data was omittedDatafrom the Affymetrix array has a higher number of detected eventsprobably linked to the number of specifically targeted probesProprietary software from both Illumina and Affymetrix has a lowdetection rate

page 10 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Table5

Com

parison

ofsoftwareeventpredictio

ns

Pub

lishe

dresults

(Red

on)

Birdsuite

Affym

etrix

CNAT

Affym

etrix

CNV

Par

tition

Illum

ina

GADA

Affym

etrix

GADA

Illum

ina

Nex

usAffym

etrix

Nex

usIllum

ina

Pen

nCNV

Affym

etrix

Pen

nCNV

Illum

ina

Qua

ntiSNP

Affym

etrix

Qua

ntiSNP

Illum

ina

Publishe

ddata

(Red

on)

17(4)

4(40

)3(19

)32

(5)

2(2)

11(10

)2(25

)12

(18

)7(16

)18

(9)

8(13

)

Birdsuite

Affy

metrix

17(44

)9(90

)13

(81

)135(22

)21

(24

)62

(56

)6(75

)43

(64

)20

(47

)97

(50

)20

(33

)CNAT

Affy

metrix

4(10

)15

(4)

4(25

)34

(6)

023

(21

)1(13

)

13(19

)2(5)

17(9)

5(8)

CNVPartition

Illum

ina

3(8)

16(4)

4(40

)37

(6)

7(8)

20(18

)7(88

)9(13

)

11(26

)16

(8)

16(27

)GADA

Affy

metrix

17(44

)106(28

)9(90

)13

(81

)32

(37

)91

(82

)7(88

)58

(87

)23

(53

)153(79

)27

(45

)GADA

Illum

ina

2(5)

96(25

)0

13(81

)20

8(34

)25

(23

)2(25

)26

(30

)17

(40

)67

(35

)23

(38

)Nexus

Affy

metrix

7(18

)57

(15

)10

(100

)

7(44

)116(19

)8(9)

4(50

)45

(67

)15

(35

)78

(40

)17

(28

)Nexus

Illum

ina

2(5)

6(2)

1(10

)7(44

)22

(4)

2(2)

4(4)

6(9)

7(16

)10

(5)

9(15

)Penn

CNV

Affy

metrix

11(28

)51

(13)

10(100

)

9(56

)105(17

)10

(11)

65(59

)6(75

)19

(44

)71

(37

)21

(35

)Penn

CNV

Illum

ina

6(15

)25

(7)

2(20

)11

(69

)44

(7)

9(10

)23

(21

)6(75

)18

(27

)26

(13)

28(47

)QuantiSNP

Affy

metrix

14(36

)97

(25

)10

(100

)

10(63

)199(32

)18

(21

)86

(77

)7(88

)65

(97

)21

(49

)24

(40

)QuantiSNP

Illum

ina

6(15

)14

(4)

5(50

)15

(94

)55

(9)

10(11)

30(27

)8(100

)

23(34

)32

(74

)31

(16

)

Algorithm

swererunon

demon

stratio

ndataforsampleNA108

61on

Affy

metrix60chipsa

ndIllum

ina1MDuo

arraysD

efaultparametersw

ereused

andanyY

chromosom

edatawas

omittedFo

ralgorithmoverall

totalsseeTable4Events

detected

inbo

thsoftwareareshow

nEvents

coun

tedas

common

betw

eenalgorithmsifpart

ofregion

predictedoverlaps

withtheotherEach

comparisoniscarriedou

ttw

ice

toshow

caseswhere

smallereventswithinon

ealgorithm

makeup

oneeventintheotherthereforeoverlapof

eventsdepe

ndson

analysisorientationTotalvalue

representsnumberof

eventsforsoftwareon

horizontalaxisfoun

dintheothersoftwaredatasetbracketedvalueshow

spercentageofeventsdetected

bysamesoftwareWehave

foun

dthemostsim

ilaritie

sare

betw

eendatafrom

similarplatform

soralgo

-rithm

metho

dforexam

pleAffy

metrixPenn

CNVandQuantiSNParebo

thbasedon

theHMM

algorithm

andas

such

eventpredictio

nshou

ldbe

very

similarWehave

also

notedahigher

numberof

similar

eventsfrom

algorithmsu

singAffy

metrixdata

Comparing CNVdetection methods for SNParrays page 11 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

increasingly important to be able to summarize and

use this data Initially we are often interested in

looking for novel events in certain genes or regions

Tracks of events can be viewed in databases such as

the web-based genome browser UCSC (http

wwwgenomeucscedu) and events can be com-

pared to known copy number data in the DGV

such as displayed in Figure 3 Importing several

tracks of data into a browser simultaneously will

allow the user to compare different result sets

Analysis of multiple events per sample is a more

complicated procedure Events and samples can

be explored using pathway analysis tools to look

for interesting groups or combinations of events in

different genes but methods of confirming the

significance of an event are required A number of

publications exist presenting ways of applying asso-

ciation study methods to copy number data Barnes

etal [29] developed an R package CNVtools which

allows the user to carry out case-control association

Figure 5 Image from UCSC Browser showing the detection of a single event using different algorithmsThe deletion described is a known CNP and is recorded several times in the DGV Each track represents a differ-ent algorithm or platform All results for detection algorithms shown used default parameters and test sampleNA10861

page 12 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

analysis on a single CNV of interest The publica-

tion tests a series of five alternative modelling meth-

ods before recommending a likelihood ratio test

which combines CNV calling and association testing

into a single model This method was designed

to eliminate problems with signal noise which is a

known trait of SNP assay data Ionita-Laza et al [30]

suggested a method to apply genome-wide family-

based association studies on raw-intensity data The

Birdsuite package includes a pipeline to prepare

the data for PLINK analysis Other sources have

suggested similar association study-based strategies

but an agreed approach is a subject of great discus-

sion Calls have been made by authors such as

Scherer et al [31] to decide on a single technique

but future decisions in the field will be extremely

enlightening

As is commented much upon in literature

describing SNP association study techniques

sample size and power of tests are major factors in

a successful study [32] This must also be considered

when analysing copy number data As we have dis-

cussed there are a number of analysis options avail-

able for SNP array CNV detection pipelines to

allow guided analysis and stand alone options for

more flexible analysis Some of these applications

are platform targeted but we have found that the

best outcome is given by using multiple algorithms

and comparing data

SUPPLEMENTARYDATASupplementary data are available online at http

biboxfordjournalsorg

AcknowledgementsThe authors thank Dr Helen Butler for her ideas and contribu-

tions to the manuscript

FUNDINGJR and LW are funded by Wellcome Trust Grants

CY is funded by a UK Medical Research Council

Special Training Fellowship in Biomedical

Informatics (Ref No G0701810)

References1 Iafrate AJ Feuk L Rivera MN et al Detection of large-

scale variation in the human genome Nat Genet 200436(9)949ndash51

2 Redon R Ishikawa S Fitch KR et al Global variation incopy number in the human genome Nature 2006444(7118)444ndash54

3 Tuzun E Sharp AJ Bailey JA et al Fine-scale structuralvariation of the human genome Nat Genet 200537(7)727ndash32

4 Sebat J Lakshmi B Troge J et al Large-scale copy numberpolymorphism in the human genome Science 2004305(5683)525ndash8

5 de Smith AJ Tsalenko A Sampas N et al Array CGHanalysis of copy number variation identifies 1284 newgenes variant in healthy white males implications for asso-ciation studies of complex diseases Hum Mol Genet 200716(23)2783ndash94

6 Carter NP Methods and strategies for analyzing copynumber variation using DNA microarrays Nat Genet200739(7 Suppl)S16ndash21

7 Korbel JO Urban AE Affourtit JP et al Paired-end map-ping reveals extensive structural variation in the humangenome Science 2007318(5849)420ndash6

8 Kennedy GC Matsuzaki H Dong S etal Large-scale geno-typing of complex DNA NatBiotechnol 200321(10)1233ndash7

9 Peiffer DA Le JM Steemers FJ etal High-resolution geno-mic profiling of chromosomal aberrations using Infiniumwhole-genome genotyping Genome Res 200616(9)1136ndash48

10 International Schizophrenia Consortium Rare chromoso-mal deletions and duplications increase risk of schizophreniaNature 2008455(7210)237ndash41

11 Yang TL Chen XD Guo Y et al Genome-wide copy-number-variation study identified a susceptibility geneUGT2B17 for osteoporosis Am J Hum Genet 200883(6)663ndash74

12 McCarroll SA Hadnott TN Perry GH et al Commondeletion polymorphisms in the human genome Nat Genet200638(1)86ndash92

13 Cooper GM Zerr T Kidd JM et al Systematic assessmentof copy number variant detection via genome-wide SNPgenotyping Nat Genet 200840(10)1199ndash203

14 McCarroll SA Altshuler DM Copy-number variation andassociation studies of human disease Nat Genet 200739(7 Suppl)S37ndash42

Key Points Awide variety of software is available for CNVdetection from

data produced by SNP arrays This review seeks to discussoptions and statistical methods currently available for analysisof signal intensity data

Changes in assay selection techniques for SNP arrays havemadethemmore appealing for copynumber detection aswell as geno-typingTargeted probe design has made the SNP array a reliableand cheaper option for copy number analysis

After testing a selection of the available software comparisonswere performed using Hapmap samples and publishedcopy number data Of the events found in our data 20^49were replicated in previously published studies but the resultsclearly showed variation in data caused by differences inalgorithms

An important recommendation when choosing software foranalysis is the use of a second algorithm on a dataset to producemore informative results This enables the user to eliminatefalse positives not found by both software and increases confi-dence in replicated events

Comparing CNVdetection methods for SNParrays page 13 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

15 McCarroll SA Kuruvilla FG Korn JM et al Integrateddetection and population-genetic analysis of SNPs andcopy number variation Nat Genet 200840(10)1166ndash74

16 Korn JM Kuruvilla FG McCarroll SA et al Integratedgenotype calling and association analysis of SNPscommon copy number polymorphisms and rare CNVsNat Genet 200840(10)1253ndash60

17 Day N Hemmaplardh A Thurman RE et al Unsupervisedsegmentation of continuous genomic data Bioinformatics200723(11)1424ndash6

18 Colella S Yau C Taylor JM etal QuantiSNP an objectiveBayes Hidden-Markov Model to detect and accurately mapcopy number variation using SNP genotyping data NucleicAcids Res 200735(6)2013ndash25

19 Wang K Li M Hadley D et al PennCNV an integratedhidden Markov model designed for high-resolution copynumber variation detection in whole-genome SNP geno-typing data Genome Res 200717(11)1665ndash74

20 Maestrini E Pagnamenta AT Lamb JA et al High-densitySNP association study and copy number variation analysisof the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility MolPsychiatry2009

21 Wang K Chen Z Tadesse MG et al Modeling geneticinheritance of copy number variations Nucleic Acids Res200836(21)e138

22 Li C Beroukhim R Weir BA et al Major copy propor-tion analysis of tumor samples using SNP arrays BMCBioinformatics 20089204

23 Olshen AB Venkatraman ES Lucito R Wigler M Circularbinary segmentation for the analysis of array-based DNAcopy number data Biostatistics 20045(4)557ndash72

24 Pique-Regi R Monso-Varona J Ortega A et al Sparserepresentation and Bayesian detection of genome copynumber alterations from microarray data Bioinformatics200824(3)309ndash18

25 Lai WR Johnson MD Kucherlapati R Park PJComparative analysis of algorithms for identifying amplifi-cations and deletions in array CGH data Bioinformatics 200521(19)3763ndash70

26 Rigaill G Hupe P Almeida A et al ITALICS analgorithm for normalization and DNA copy number callingfor Affymetrix SNP arrays Bioinformatics 200824(6)768ndash74

27 Franke L de Kovel CG Aulchenko YS et al Detectionimputation and association analysis of small deletions andnull alleles on oligonucleotide arrays AmJHumGenet 200882(6)1316ndash33

28 Kidd JM Cooper GM Donahue WF et al Mapping andsequencing of structural variation from eight human gen-omes Nature 2008453(7191)56ndash64

29 Barnes C Plagnol V Fitzgerald T et al A robuststatistical method for case-control association testingwith copy number variation Nat Genet 200840(10)1245ndash52

30 Ionita-Laza I Perry GH Raby BA et al On the analysisof copy-number variations in genome-wide associationstudies a translation of the family-based association testGenet Epidemiol 200832(3)273ndash84

31 Scherer SW Lee C Birney E etal Challenges and standardsin integrating surveys of structural variation NatGenet 200739(7 Suppl)S7ndash15

32 Cardon LR Bell JI Association study designs for complexdiseases Nat Rev Genet 20012(2)91ndash9

page 14 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Page 6: Comparing CNVdetection methods

standard Illumina Infinium array and can used to

process Affymetrix data and have proven accuracy

on Illumina GoldenGate data [20] where SNP

coverage is suitable

PennCNV has a number of downstream analysis

options Most important to highlight is the use of

family trio data in analysis [21] The use of trio infor-

mation in event prediction allows easier detection of

events novel to probands It also integrates a pipeline

for Affymetrix data analysis The PennCNV package

also includes a number of options to allow more

analysis of event results such as a script to compare

events to known gene libraries or for changing the

format to be suitable for viewer such as BeadStudiorsquos

Chromosome Browser or the web-based genome

browser UCSC (httpwwwgenomeucscedu)

Dchip SNP [22] was originally developed for

Affymetrix data but has been modified to allow the

viewing of Illumina data It produces an LOH score

which can be plotted against chromosome but its

functions are best suited to the Affymetrix platform

generated values in particular the quality control

options The software also has options to carry out

paired analysis for cancer data major copy propor-

tion analysis [22] uses HMM to analyse tumour

samples

APPLYINGAPPROACHESORIGINALLY USED INARRAYCGHA number of methods for copy number event detec-

tion were originally developed for arrayCGH analy-

sis but have been modified for SNP array analysis

The Circular Binary Segmentation (CBS) [23] algo-

rithm is one such method It was designed to convert

noisy intensity values into regions of equal copy

number The algorithm will continue to divide a

region into segments until it finds a segment

which is different to the neighbouring region This

change-point detection is designed to identify all

the places which partition the chromosome into

segments of the same copy number An addition to

the binary segmentation algorithm was made to

allow the defining of single change inside a large

segment Segment ends were joined forming a

circle to allow a further likelihood ratio test that

the content has different means Final segments are

then given a cluster value which is the median log-

ratio value of the probes within the region and this

value is used to define the copy number status

An alternative to the CBS algorithm was devel-

oped by Pique-Regi et al [24] which can now be

applied to SNP arrays The Genome Alteration

Detection Algorithm (GADA) uses sparse Bayesian

learning to predict CN changes For our testing we

used a package designed for use in R environment

with helpful processing options and detailed instruc-

tions for Affymetrix and Illumina data The advan-

tage of the speed of data processing was clear and we

were able to analyse data within a few minutes

There are many other algorithms developed that

could potentially be applied to SNP array data

Other reviews [6 25] focused on the arrayCGH

format present the reader with a variety of alternative

options

CNVDETECTION USING OTHERMETHODSApproaches which describe different methods to

address CN event detection are common in the lit-

erature SNP conditional mixture modelling

(SCIMM) developed by Cooper et al [13] which

is based on the observation that samples with dele-

tions appear to have unique signal-intensity clusters

They applied a mixture-likelihood clustering

method within the R statistical package to identify

deletions A secondary algorithm (SCIMM-Search)

was developed to help discover probes which detect

copy number changes within an array dataset The

algorithms require knowledge of modelling techni-

ques to correctly carry out the analysis

The ITALICS [26] software focuses analysis on

removal on unwanted events found in Affymetrix

data Rigaill et al developed ITALICS (Iterative

and Alternative normaLIsation and Copy number

calling for affymetrix Snp arrays) to remove probes

with abnormal intensities Each iteration of the

algorithm estimates the biological signal and then

uses multiple linear regressions to estimate the non-

linear effects on the signal The algorithm can be run

in R and has the potential to analyse the Affymetrix

Human mapping 500K Genome Wide array 50 and

60 format but was designed to process data from

chip formats containing perfect match and mismatch

probes

COMMERCIALLYAVAILABLESOFTWAREThe strength of the software packages available

to purchase lies in a number of traits the ability

page 6 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

to combine data from other platforms for compar-

ison graphical user interfaces integrated pipelines for

analysis and work flows optimized computational

speed and technical support These factors are all

extremely useful to those labs with no or limited

bioinformatic core support Unfortunately commer-

cial companies are limited in their use of some of the

methods developed in the academic environment

They are often prevented from building user inter-

faces and other features around academic software

due to restrictions imposed by free software licences

such as GNU Public Licence and prevention from

having access to the latest methods

For our own purposes we have chosen to look in

detail at the Nexus Biodiscovery software This uses

the rank segmentation approach for detection This

approach is based on CBS but has been modified to

increase speed of processing It can be used for

Affymetrix arrayCGH or Illumina data and although

weaker for Illumina event detection is an extremely

useful tool for practically trained scientists

COMBINING COPY NUMBERPREDICTIONANDGENOTYPINGCopy number detection approaches described thus

far have looked only at a single aspect of the data

The Birdsuite set developed by Korn et al [16] com-

bines SNP genotyping and copy number detection

as well as independently genotyping common

CNPs It uses four different methods to analyse an

Affymetrix dataset The Canary algorithm which

genotypes common CNPs and Birdseed which

carries out SNP genotyping are included in the

Affymetrix Genotyping Console Birdseye is used

to discover rare CNVs This uses the HMM to iden-

tify and assess previously unknown CNVs in the

data Fawkes is the final stage of Birdsuite this

merges all the results from the other three stages

Combining data in this way gives a more complete

picture of structural variation in a sample and allows

the user to proceed with single stage of association

analysis with increased coverage on the data Korn

et al compared their software to commercially avail-

able algorithms including Nexus and report the

higher detection rates of Birdsuite

Franke et al [27] have also presented a combined

approach which focuses on single SNP interpreta-

tion TriTyper uses maximum likelihood estimation

to detect deletions in Illumina SNP data in unrelated

samples It incorporates an extra null allele into its

genotyping clusters and uses deviations from the

HWE as an indicator of when to use triallelic geno-

typing It can also use neighbouring SNP data to

impute the success of the caller which increases the

accuracy of the output

COMPARINGTHEDETECTIONALGORITHMSThere are a large variety of algorithms and software

available for copy number event detection Table 1

shows a summary of the software discussed in this

review A number of these software packages have

been tested during the review and a brief synopsis of

the results is presented here

Assessing SoftwareTo assess the accuracy of the algorithms we com-

pared our data to the results of a well characterized

sample The sample NA12156 is the basis for our

comparison (Table 2) it is from the HapMap collec-

tion and was sequenced for structural variation by

Kidd et al [28] We have chosen to record the

number of similar events between software and pub-

lished data We assume the samples with low num-

bers of similar events have higher false positive rates

however we have not experimentally validated the

results While there is no faultless software we have

found that at least 20 of events were confirmed by

Kidd et al in all algorithms 27 of the overlapping

detected events were found by more than one algo-

rithm (Supplementary Table 1) Although some

algorithms have a lower percentage of overlapping

events it is important to consider the number of

events found as well as the proportion 49 of

PennCNV detected events were confirmed but

other algorithms have actually detected more in

total

We carried out a secondary comparison using the

CEPH sample NA15510 which has been character-

ized in a number of publications [2 7 28] Table 3

shows the variation of results between studies

Further investigation of event replication across stud-

ies is represented in the Venn Diagrams (Figure 4)

PennCNV and Illumina show similar patterns of

overlap although we note an increased similarity

between the Korbel et al data and QuantiSNP

output We conclude that although we found a dif-

ference between detected events in our data and

published results we found similar variation between

different publications suggesting this is problem in

Comparing CNVdetection methods for SNParrays page 7 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

all comparisons and not unique to algorithms we

tested

The overlap of algorithm events of the tested soft-

ware is below 50 for all cases We used default

parameters for all our algorithms for ease of replica-

tion which means some algorithms were not run at

their optimal level for our data We deliberately

chose data which did not use an array-based

Table 1 Summary of SNP array detection algorithms

Software Platform Relatedpublication

Details Strengths Weaknesses

Birdsuite (Birdseyeand Canary)

Affymetrix [15] Combined tool set togenotype SNPs amp CNPs

Unique approach singleassociation of SNPs andCN

Availability limited toAffymetrix data

CNAT Affymetrix Technicalnotes

Proprietaryccedilrun inGenome Console

Integral part of GenomeConsole

Accuracy of event prediction(missed events)

CNVPartition 121 Illumina Technicalnotes

Proprietaryccedilrun inBeadStudio

Integral part of BeadStudio Accuracy of event prediction(missed events)

Dchip SNP Affymetrixor Illumina

[22] Stand alone software Free viewer for all data Limited applications forIllumina data

GADA Affymetrixor Illumina

[24] Model uses Sparse BayesianLearning

Speed of processing andapplication within R

Accuracy on Illumina weaker

HMMSeg Multiple [17] HMM application tool to anygenomic data

Flexibility to any dataset Statistical knowledgerequired for correctuse Not CN specific

ITALICS Affymetrix [26] R package for normalizationand CN detection inAffymetrix data

Focus on removal of non-relevant effects

Designed to work onAffymetrix 100Kthorn 500Kchip (MM probe format)

Nexus Biodiscovery Multiple [23] Commercial segmentationdetection tool

Allows combined data fromdifferentplatforms Integratedviewer

Freeware alternatives areavailable

PennCNV Illumina orAffymetrix

[19] Perl script based Multiple downstream toolsfor output

No way of ranking eventsdue to likelihood

QuantiSNP Illumina orAffymetrix

[18] HHM PC or LINUXcommand line

Bayes factor score forevents flexibility of runparameters

Limited support for furtherevent analysis

SCIMM andSCIMM-Search

Illumina [13] Modelling algorithmapplied in R

High detection ratescompared to sequencedata

Statistical knowledgerequired for correct use

TriTyper Illumina [27] Identify and genotype SNPswith null allele

Able to interpret single SNPs Only genotypes deletions

Table 2 Comparison of algorithms

Algorithm Platformand array

Total of copynumber eventsdetected

Number of copynumber eventsconfirmed byKidd et al [28]

Birdsuite 155 (Birdseye amp Canary) Affymetrix 60 386 76 (20)CNAT (Genome Console 302) Affymetrix 60 8 2 (25)GADA (R 07-5) Affymetrix 60 546 128 (23)GADA (R 07-5) Illumina 1M Duo 511 157 (31)PennCNV (2009Jan06) Affymetrix 60 57 28 (49)PennCNV (2009Jan06) Illumina 1M Duo 57 21 (37)QuantiSNP v20 Affymetrix 60 131 53 (41)QuantiSNP v11 Illumina 1M Duo 75 23 (31)

Detected events from CEPH sample NA12156 are compared to events published in sequencing analysis by Kidd et al [28] Default parametersare used for each algorithm and any Ychromosome data was omitted An overlap between software output and confirmed data by Kidd et al isdetermined by comparing the start and end points of events Details of events are shown in SupplementaryTable1 Percentage shows the numberof confirmed CN events compared to the total detectedby the algorithm

page 8 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Figure 4 Venn diagrams comparing events for NA15510 between different studies Visual representation ofdata from CEPH sample NA15510 on 1M array Illumina platform used to compare between algorithms and otherpublications [2 7 28] Default parameters are used for each algorithm and Ychromosome data was omitted fromcount Event lists from publications were generated by combining data from several tables to create a completelist (including all validated and unvalidated events) An event was counted if any overlap was found with base eventin published data multiple predictions by an algorithm for one published event were counted as one Each total inthe diagram is comprised of all the events found by the studies meaning each event in an overlapping pair is countedSurprisingly only 43 overlapping events are found for NA15510 in all the three studies (A) Results from thePennCNV (D) and QuantiSNP (C) comparisons show that QuantiSNP detects more events in all three softwaredue to the detection of more events overlapping with the Korbel et al study Overlap between algorithmsis shown in Venn Diagram B where events which are detected by the algorithm and found in at least one ofthe publication are compared A large proportion of detected events between PennCNV and QuantiSNP (43)overlap

Table 3 Overlap between events detected by SNP array algorithms using multiple publication data

Total events foundin NA15510 byalgorithm

Number of copynumber events(Kidd) [28]

Number of copynumber events(Korbel) [7]

Number of copynumber events(Redon) [2]

Events in paper 299 466 219CNVPartition 121 39 12 (4) 22 (5) 9 (4)GADA (R 07-5) 69 68 (23) 85 (18) 42 (19)PennCNV (2009Jan06) 81 18 (6) 28 () 30 (14)QuantiSNP v11 64 18 (6) 41 (9) 29 (13)

Data fromCEPH sampleNA15510 on1M array Illumina platform is used to compare between algorithms and other publicationsDefault parametersare used for each algorithm and Y chromosome data was omitted Event lists from publications were generated by combining data fromseveral tables to create a complete list (including all validated and un-validated events) An event was counted if any overlap was found with baseevent in published data multiple predictions by an algorithm for one published event were counted as oneValue in brackets shows percentage ofpublished events found by algorithmWe note from GADA analysis although a high number of overlaps were found this was due to the predictionof large events that included smaller events found by Kidd et al and Korbel et al

Comparing CNVdetection methods for SNParrays page 9 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

technique for our NA12156 comparison to prevent a

bias between Affymetrix and Illumina but in doing

so we accepted an increase in the number of differ-

ently detected events Kidd et al have shown similar

data when comparing studies and found only a

125 overlap of events larger than 5 kb between

their results and CN data generated by Affymetrix

60 array

Similarities of events detected betweendifferent SoftwareWe chose to test a single sample (NA10861) on

a range of the available algorithms to compare the

similarity between event detection In all cases we

found the academically developed software to be

more sensitive and detect more events than propri-

etary algorithms (Table 4) The data also shows an

increased number of events found from the sample

using the Affymetrix SNP60 array we assume this

reflects the increase in the number of CNP probes

on the array relative to Illuminarsquos 1M chip

Table 5 shows the amount of overlap in event

prediction We show two results for each compari-

son counting the number of events overlapping for

each algorithm separately The difference in values

represents the number of smaller events often found

in one event by a different algorithm In general

we found a higher number of overlapping events

between algorithms run on Affymetrix 60 arrays

data We expected the low resemblance between

data generated on different platforms as a result of

the different probe sets however we are pleased to

find some overlap We have included a comparison

to events published by Redon et al [2] although the

study does not include a comprehensive list for this

sample it does show that the algorithms are detecting

confirmed events

During our comparison we often saw a difference

in the size of the predicted event between algorithms

(Figure 5) This was to be expected when using

different platforms as probe locations vary but was

also seen when analysing an identical dataset This

kind of effect can even be produced when simply

altering algorithm parameters and should be a con-

sideration when looking at breakpoints of detected

events We found that the available software tend to

target and support one particular platform for analy-

sis which unfortunately can limit options

Recommending algorithmsComparison of events in a dataset is a good way of

assessing accuracy of detection algorithms but it is

also important to take into account that the different

predictions can also be informative in showing false

positives caused by noisy data and conversely that

those in agreement are the strongest candidates for

events Multiple predictions from different software

for the same event increase confidence in the data

and give clearer indications of the event boundaries

or any discrepancy in this information We would

recommend using a second algorithm on a single

dataset to produce the most informative results and

also utilize the different advantages of each software

We also suggest using software designed specifically

for the platform which generated the data as several

of the dual use algorithms have been shown to

weaker in one format We have selected a range of

algorithms to discuss and test and the list in Table 1 is

not exhaustive only an overview of some of the

possibilities It is also important to state even using

different algorithms one cannot definitively confirm

the presence of a CN event without separate biolog-

ical replication and it is unlikely that any list of events

detected will contain all CNVs in a sample

FURTHER ANALYSIS OFDETECTED CNVsWith a number of reliable options available for

the detection of copy number events it becomes

Table 4 Comparison of event numbers detected fora single sample (NA10861)

Algorithm Platform andarray

Number ofCNeventsdetected

Birdsuite 155 (Canary amp Birdseye) Affymetrix 60 137CNAT (Genome Console 302) Affymetrix 60 10CNVPartition 121 Illumina 1M Duo 16GADA (R 07-5) Affymetrix 60 613GADA (R 07-5) Illumina 1M Duo 87Nexus Biodiscovery 401 Affymetrix 60 111Nexus Biodiscovery 401 Illumina 1M Duo 8PennCNV (2009Jan06) Affymetrix 60 67PennCNV (2009Jan06) Illumina 1M Duo 43QuantiSNP v20 Affymetrix 60 193QuantiSNP v11 Illumina 1M Duo 60

HapMap samples provided as demonstration data were analysed onboth Affymetrix and Illumina platforms to give an easily reproduciblecomparison of event prediction Events shown have been detected bythe algorithm for CEPH sample NA10861 Default parameters wereused for all algorithms and anyYchromosome data was omittedDatafrom the Affymetrix array has a higher number of detected eventsprobably linked to the number of specifically targeted probesProprietary software from both Illumina and Affymetrix has a lowdetection rate

page 10 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Table5

Com

parison

ofsoftwareeventpredictio

ns

Pub

lishe

dresults

(Red

on)

Birdsuite

Affym

etrix

CNAT

Affym

etrix

CNV

Par

tition

Illum

ina

GADA

Affym

etrix

GADA

Illum

ina

Nex

usAffym

etrix

Nex

usIllum

ina

Pen

nCNV

Affym

etrix

Pen

nCNV

Illum

ina

Qua

ntiSNP

Affym

etrix

Qua

ntiSNP

Illum

ina

Publishe

ddata

(Red

on)

17(4)

4(40

)3(19

)32

(5)

2(2)

11(10

)2(25

)12

(18

)7(16

)18

(9)

8(13

)

Birdsuite

Affy

metrix

17(44

)9(90

)13

(81

)135(22

)21

(24

)62

(56

)6(75

)43

(64

)20

(47

)97

(50

)20

(33

)CNAT

Affy

metrix

4(10

)15

(4)

4(25

)34

(6)

023

(21

)1(13

)

13(19

)2(5)

17(9)

5(8)

CNVPartition

Illum

ina

3(8)

16(4)

4(40

)37

(6)

7(8)

20(18

)7(88

)9(13

)

11(26

)16

(8)

16(27

)GADA

Affy

metrix

17(44

)106(28

)9(90

)13

(81

)32

(37

)91

(82

)7(88

)58

(87

)23

(53

)153(79

)27

(45

)GADA

Illum

ina

2(5)

96(25

)0

13(81

)20

8(34

)25

(23

)2(25

)26

(30

)17

(40

)67

(35

)23

(38

)Nexus

Affy

metrix

7(18

)57

(15

)10

(100

)

7(44

)116(19

)8(9)

4(50

)45

(67

)15

(35

)78

(40

)17

(28

)Nexus

Illum

ina

2(5)

6(2)

1(10

)7(44

)22

(4)

2(2)

4(4)

6(9)

7(16

)10

(5)

9(15

)Penn

CNV

Affy

metrix

11(28

)51

(13)

10(100

)

9(56

)105(17

)10

(11)

65(59

)6(75

)19

(44

)71

(37

)21

(35

)Penn

CNV

Illum

ina

6(15

)25

(7)

2(20

)11

(69

)44

(7)

9(10

)23

(21

)6(75

)18

(27

)26

(13)

28(47

)QuantiSNP

Affy

metrix

14(36

)97

(25

)10

(100

)

10(63

)199(32

)18

(21

)86

(77

)7(88

)65

(97

)21

(49

)24

(40

)QuantiSNP

Illum

ina

6(15

)14

(4)

5(50

)15

(94

)55

(9)

10(11)

30(27

)8(100

)

23(34

)32

(74

)31

(16

)

Algorithm

swererunon

demon

stratio

ndataforsampleNA108

61on

Affy

metrix60chipsa

ndIllum

ina1MDuo

arraysD

efaultparametersw

ereused

andanyY

chromosom

edatawas

omittedFo

ralgorithmoverall

totalsseeTable4Events

detected

inbo

thsoftwareareshow

nEvents

coun

tedas

common

betw

eenalgorithmsifpart

ofregion

predictedoverlaps

withtheotherEach

comparisoniscarriedou

ttw

ice

toshow

caseswhere

smallereventswithinon

ealgorithm

makeup

oneeventintheotherthereforeoverlapof

eventsdepe

ndson

analysisorientationTotalvalue

representsnumberof

eventsforsoftwareon

horizontalaxisfoun

dintheothersoftwaredatasetbracketedvalueshow

spercentageofeventsdetected

bysamesoftwareWehave

foun

dthemostsim

ilaritie

sare

betw

eendatafrom

similarplatform

soralgo

-rithm

metho

dforexam

pleAffy

metrixPenn

CNVandQuantiSNParebo

thbasedon

theHMM

algorithm

andas

such

eventpredictio

nshou

ldbe

very

similarWehave

also

notedahigher

numberof

similar

eventsfrom

algorithmsu

singAffy

metrixdata

Comparing CNVdetection methods for SNParrays page 11 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

increasingly important to be able to summarize and

use this data Initially we are often interested in

looking for novel events in certain genes or regions

Tracks of events can be viewed in databases such as

the web-based genome browser UCSC (http

wwwgenomeucscedu) and events can be com-

pared to known copy number data in the DGV

such as displayed in Figure 3 Importing several

tracks of data into a browser simultaneously will

allow the user to compare different result sets

Analysis of multiple events per sample is a more

complicated procedure Events and samples can

be explored using pathway analysis tools to look

for interesting groups or combinations of events in

different genes but methods of confirming the

significance of an event are required A number of

publications exist presenting ways of applying asso-

ciation study methods to copy number data Barnes

etal [29] developed an R package CNVtools which

allows the user to carry out case-control association

Figure 5 Image from UCSC Browser showing the detection of a single event using different algorithmsThe deletion described is a known CNP and is recorded several times in the DGV Each track represents a differ-ent algorithm or platform All results for detection algorithms shown used default parameters and test sampleNA10861

page 12 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

analysis on a single CNV of interest The publica-

tion tests a series of five alternative modelling meth-

ods before recommending a likelihood ratio test

which combines CNV calling and association testing

into a single model This method was designed

to eliminate problems with signal noise which is a

known trait of SNP assay data Ionita-Laza et al [30]

suggested a method to apply genome-wide family-

based association studies on raw-intensity data The

Birdsuite package includes a pipeline to prepare

the data for PLINK analysis Other sources have

suggested similar association study-based strategies

but an agreed approach is a subject of great discus-

sion Calls have been made by authors such as

Scherer et al [31] to decide on a single technique

but future decisions in the field will be extremely

enlightening

As is commented much upon in literature

describing SNP association study techniques

sample size and power of tests are major factors in

a successful study [32] This must also be considered

when analysing copy number data As we have dis-

cussed there are a number of analysis options avail-

able for SNP array CNV detection pipelines to

allow guided analysis and stand alone options for

more flexible analysis Some of these applications

are platform targeted but we have found that the

best outcome is given by using multiple algorithms

and comparing data

SUPPLEMENTARYDATASupplementary data are available online at http

biboxfordjournalsorg

AcknowledgementsThe authors thank Dr Helen Butler for her ideas and contribu-

tions to the manuscript

FUNDINGJR and LW are funded by Wellcome Trust Grants

CY is funded by a UK Medical Research Council

Special Training Fellowship in Biomedical

Informatics (Ref No G0701810)

References1 Iafrate AJ Feuk L Rivera MN et al Detection of large-

scale variation in the human genome Nat Genet 200436(9)949ndash51

2 Redon R Ishikawa S Fitch KR et al Global variation incopy number in the human genome Nature 2006444(7118)444ndash54

3 Tuzun E Sharp AJ Bailey JA et al Fine-scale structuralvariation of the human genome Nat Genet 200537(7)727ndash32

4 Sebat J Lakshmi B Troge J et al Large-scale copy numberpolymorphism in the human genome Science 2004305(5683)525ndash8

5 de Smith AJ Tsalenko A Sampas N et al Array CGHanalysis of copy number variation identifies 1284 newgenes variant in healthy white males implications for asso-ciation studies of complex diseases Hum Mol Genet 200716(23)2783ndash94

6 Carter NP Methods and strategies for analyzing copynumber variation using DNA microarrays Nat Genet200739(7 Suppl)S16ndash21

7 Korbel JO Urban AE Affourtit JP et al Paired-end map-ping reveals extensive structural variation in the humangenome Science 2007318(5849)420ndash6

8 Kennedy GC Matsuzaki H Dong S etal Large-scale geno-typing of complex DNA NatBiotechnol 200321(10)1233ndash7

9 Peiffer DA Le JM Steemers FJ etal High-resolution geno-mic profiling of chromosomal aberrations using Infiniumwhole-genome genotyping Genome Res 200616(9)1136ndash48

10 International Schizophrenia Consortium Rare chromoso-mal deletions and duplications increase risk of schizophreniaNature 2008455(7210)237ndash41

11 Yang TL Chen XD Guo Y et al Genome-wide copy-number-variation study identified a susceptibility geneUGT2B17 for osteoporosis Am J Hum Genet 200883(6)663ndash74

12 McCarroll SA Hadnott TN Perry GH et al Commondeletion polymorphisms in the human genome Nat Genet200638(1)86ndash92

13 Cooper GM Zerr T Kidd JM et al Systematic assessmentof copy number variant detection via genome-wide SNPgenotyping Nat Genet 200840(10)1199ndash203

14 McCarroll SA Altshuler DM Copy-number variation andassociation studies of human disease Nat Genet 200739(7 Suppl)S37ndash42

Key Points Awide variety of software is available for CNVdetection from

data produced by SNP arrays This review seeks to discussoptions and statistical methods currently available for analysisof signal intensity data

Changes in assay selection techniques for SNP arrays havemadethemmore appealing for copynumber detection aswell as geno-typingTargeted probe design has made the SNP array a reliableand cheaper option for copy number analysis

After testing a selection of the available software comparisonswere performed using Hapmap samples and publishedcopy number data Of the events found in our data 20^49were replicated in previously published studies but the resultsclearly showed variation in data caused by differences inalgorithms

An important recommendation when choosing software foranalysis is the use of a second algorithm on a dataset to producemore informative results This enables the user to eliminatefalse positives not found by both software and increases confi-dence in replicated events

Comparing CNVdetection methods for SNParrays page 13 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

15 McCarroll SA Kuruvilla FG Korn JM et al Integrateddetection and population-genetic analysis of SNPs andcopy number variation Nat Genet 200840(10)1166ndash74

16 Korn JM Kuruvilla FG McCarroll SA et al Integratedgenotype calling and association analysis of SNPscommon copy number polymorphisms and rare CNVsNat Genet 200840(10)1253ndash60

17 Day N Hemmaplardh A Thurman RE et al Unsupervisedsegmentation of continuous genomic data Bioinformatics200723(11)1424ndash6

18 Colella S Yau C Taylor JM etal QuantiSNP an objectiveBayes Hidden-Markov Model to detect and accurately mapcopy number variation using SNP genotyping data NucleicAcids Res 200735(6)2013ndash25

19 Wang K Li M Hadley D et al PennCNV an integratedhidden Markov model designed for high-resolution copynumber variation detection in whole-genome SNP geno-typing data Genome Res 200717(11)1665ndash74

20 Maestrini E Pagnamenta AT Lamb JA et al High-densitySNP association study and copy number variation analysisof the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility MolPsychiatry2009

21 Wang K Chen Z Tadesse MG et al Modeling geneticinheritance of copy number variations Nucleic Acids Res200836(21)e138

22 Li C Beroukhim R Weir BA et al Major copy propor-tion analysis of tumor samples using SNP arrays BMCBioinformatics 20089204

23 Olshen AB Venkatraman ES Lucito R Wigler M Circularbinary segmentation for the analysis of array-based DNAcopy number data Biostatistics 20045(4)557ndash72

24 Pique-Regi R Monso-Varona J Ortega A et al Sparserepresentation and Bayesian detection of genome copynumber alterations from microarray data Bioinformatics200824(3)309ndash18

25 Lai WR Johnson MD Kucherlapati R Park PJComparative analysis of algorithms for identifying amplifi-cations and deletions in array CGH data Bioinformatics 200521(19)3763ndash70

26 Rigaill G Hupe P Almeida A et al ITALICS analgorithm for normalization and DNA copy number callingfor Affymetrix SNP arrays Bioinformatics 200824(6)768ndash74

27 Franke L de Kovel CG Aulchenko YS et al Detectionimputation and association analysis of small deletions andnull alleles on oligonucleotide arrays AmJHumGenet 200882(6)1316ndash33

28 Kidd JM Cooper GM Donahue WF et al Mapping andsequencing of structural variation from eight human gen-omes Nature 2008453(7191)56ndash64

29 Barnes C Plagnol V Fitzgerald T et al A robuststatistical method for case-control association testingwith copy number variation Nat Genet 200840(10)1245ndash52

30 Ionita-Laza I Perry GH Raby BA et al On the analysisof copy-number variations in genome-wide associationstudies a translation of the family-based association testGenet Epidemiol 200832(3)273ndash84

31 Scherer SW Lee C Birney E etal Challenges and standardsin integrating surveys of structural variation NatGenet 200739(7 Suppl)S7ndash15

32 Cardon LR Bell JI Association study designs for complexdiseases Nat Rev Genet 20012(2)91ndash9

page 14 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Page 7: Comparing CNVdetection methods

to combine data from other platforms for compar-

ison graphical user interfaces integrated pipelines for

analysis and work flows optimized computational

speed and technical support These factors are all

extremely useful to those labs with no or limited

bioinformatic core support Unfortunately commer-

cial companies are limited in their use of some of the

methods developed in the academic environment

They are often prevented from building user inter-

faces and other features around academic software

due to restrictions imposed by free software licences

such as GNU Public Licence and prevention from

having access to the latest methods

For our own purposes we have chosen to look in

detail at the Nexus Biodiscovery software This uses

the rank segmentation approach for detection This

approach is based on CBS but has been modified to

increase speed of processing It can be used for

Affymetrix arrayCGH or Illumina data and although

weaker for Illumina event detection is an extremely

useful tool for practically trained scientists

COMBINING COPY NUMBERPREDICTIONANDGENOTYPINGCopy number detection approaches described thus

far have looked only at a single aspect of the data

The Birdsuite set developed by Korn et al [16] com-

bines SNP genotyping and copy number detection

as well as independently genotyping common

CNPs It uses four different methods to analyse an

Affymetrix dataset The Canary algorithm which

genotypes common CNPs and Birdseed which

carries out SNP genotyping are included in the

Affymetrix Genotyping Console Birdseye is used

to discover rare CNVs This uses the HMM to iden-

tify and assess previously unknown CNVs in the

data Fawkes is the final stage of Birdsuite this

merges all the results from the other three stages

Combining data in this way gives a more complete

picture of structural variation in a sample and allows

the user to proceed with single stage of association

analysis with increased coverage on the data Korn

et al compared their software to commercially avail-

able algorithms including Nexus and report the

higher detection rates of Birdsuite

Franke et al [27] have also presented a combined

approach which focuses on single SNP interpreta-

tion TriTyper uses maximum likelihood estimation

to detect deletions in Illumina SNP data in unrelated

samples It incorporates an extra null allele into its

genotyping clusters and uses deviations from the

HWE as an indicator of when to use triallelic geno-

typing It can also use neighbouring SNP data to

impute the success of the caller which increases the

accuracy of the output

COMPARINGTHEDETECTIONALGORITHMSThere are a large variety of algorithms and software

available for copy number event detection Table 1

shows a summary of the software discussed in this

review A number of these software packages have

been tested during the review and a brief synopsis of

the results is presented here

Assessing SoftwareTo assess the accuracy of the algorithms we com-

pared our data to the results of a well characterized

sample The sample NA12156 is the basis for our

comparison (Table 2) it is from the HapMap collec-

tion and was sequenced for structural variation by

Kidd et al [28] We have chosen to record the

number of similar events between software and pub-

lished data We assume the samples with low num-

bers of similar events have higher false positive rates

however we have not experimentally validated the

results While there is no faultless software we have

found that at least 20 of events were confirmed by

Kidd et al in all algorithms 27 of the overlapping

detected events were found by more than one algo-

rithm (Supplementary Table 1) Although some

algorithms have a lower percentage of overlapping

events it is important to consider the number of

events found as well as the proportion 49 of

PennCNV detected events were confirmed but

other algorithms have actually detected more in

total

We carried out a secondary comparison using the

CEPH sample NA15510 which has been character-

ized in a number of publications [2 7 28] Table 3

shows the variation of results between studies

Further investigation of event replication across stud-

ies is represented in the Venn Diagrams (Figure 4)

PennCNV and Illumina show similar patterns of

overlap although we note an increased similarity

between the Korbel et al data and QuantiSNP

output We conclude that although we found a dif-

ference between detected events in our data and

published results we found similar variation between

different publications suggesting this is problem in

Comparing CNVdetection methods for SNParrays page 7 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

all comparisons and not unique to algorithms we

tested

The overlap of algorithm events of the tested soft-

ware is below 50 for all cases We used default

parameters for all our algorithms for ease of replica-

tion which means some algorithms were not run at

their optimal level for our data We deliberately

chose data which did not use an array-based

Table 1 Summary of SNP array detection algorithms

Software Platform Relatedpublication

Details Strengths Weaknesses

Birdsuite (Birdseyeand Canary)

Affymetrix [15] Combined tool set togenotype SNPs amp CNPs

Unique approach singleassociation of SNPs andCN

Availability limited toAffymetrix data

CNAT Affymetrix Technicalnotes

Proprietaryccedilrun inGenome Console

Integral part of GenomeConsole

Accuracy of event prediction(missed events)

CNVPartition 121 Illumina Technicalnotes

Proprietaryccedilrun inBeadStudio

Integral part of BeadStudio Accuracy of event prediction(missed events)

Dchip SNP Affymetrixor Illumina

[22] Stand alone software Free viewer for all data Limited applications forIllumina data

GADA Affymetrixor Illumina

[24] Model uses Sparse BayesianLearning

Speed of processing andapplication within R

Accuracy on Illumina weaker

HMMSeg Multiple [17] HMM application tool to anygenomic data

Flexibility to any dataset Statistical knowledgerequired for correctuse Not CN specific

ITALICS Affymetrix [26] R package for normalizationand CN detection inAffymetrix data

Focus on removal of non-relevant effects

Designed to work onAffymetrix 100Kthorn 500Kchip (MM probe format)

Nexus Biodiscovery Multiple [23] Commercial segmentationdetection tool

Allows combined data fromdifferentplatforms Integratedviewer

Freeware alternatives areavailable

PennCNV Illumina orAffymetrix

[19] Perl script based Multiple downstream toolsfor output

No way of ranking eventsdue to likelihood

QuantiSNP Illumina orAffymetrix

[18] HHM PC or LINUXcommand line

Bayes factor score forevents flexibility of runparameters

Limited support for furtherevent analysis

SCIMM andSCIMM-Search

Illumina [13] Modelling algorithmapplied in R

High detection ratescompared to sequencedata

Statistical knowledgerequired for correct use

TriTyper Illumina [27] Identify and genotype SNPswith null allele

Able to interpret single SNPs Only genotypes deletions

Table 2 Comparison of algorithms

Algorithm Platformand array

Total of copynumber eventsdetected

Number of copynumber eventsconfirmed byKidd et al [28]

Birdsuite 155 (Birdseye amp Canary) Affymetrix 60 386 76 (20)CNAT (Genome Console 302) Affymetrix 60 8 2 (25)GADA (R 07-5) Affymetrix 60 546 128 (23)GADA (R 07-5) Illumina 1M Duo 511 157 (31)PennCNV (2009Jan06) Affymetrix 60 57 28 (49)PennCNV (2009Jan06) Illumina 1M Duo 57 21 (37)QuantiSNP v20 Affymetrix 60 131 53 (41)QuantiSNP v11 Illumina 1M Duo 75 23 (31)

Detected events from CEPH sample NA12156 are compared to events published in sequencing analysis by Kidd et al [28] Default parametersare used for each algorithm and any Ychromosome data was omitted An overlap between software output and confirmed data by Kidd et al isdetermined by comparing the start and end points of events Details of events are shown in SupplementaryTable1 Percentage shows the numberof confirmed CN events compared to the total detectedby the algorithm

page 8 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Figure 4 Venn diagrams comparing events for NA15510 between different studies Visual representation ofdata from CEPH sample NA15510 on 1M array Illumina platform used to compare between algorithms and otherpublications [2 7 28] Default parameters are used for each algorithm and Ychromosome data was omitted fromcount Event lists from publications were generated by combining data from several tables to create a completelist (including all validated and unvalidated events) An event was counted if any overlap was found with base eventin published data multiple predictions by an algorithm for one published event were counted as one Each total inthe diagram is comprised of all the events found by the studies meaning each event in an overlapping pair is countedSurprisingly only 43 overlapping events are found for NA15510 in all the three studies (A) Results from thePennCNV (D) and QuantiSNP (C) comparisons show that QuantiSNP detects more events in all three softwaredue to the detection of more events overlapping with the Korbel et al study Overlap between algorithmsis shown in Venn Diagram B where events which are detected by the algorithm and found in at least one ofthe publication are compared A large proportion of detected events between PennCNV and QuantiSNP (43)overlap

Table 3 Overlap between events detected by SNP array algorithms using multiple publication data

Total events foundin NA15510 byalgorithm

Number of copynumber events(Kidd) [28]

Number of copynumber events(Korbel) [7]

Number of copynumber events(Redon) [2]

Events in paper 299 466 219CNVPartition 121 39 12 (4) 22 (5) 9 (4)GADA (R 07-5) 69 68 (23) 85 (18) 42 (19)PennCNV (2009Jan06) 81 18 (6) 28 () 30 (14)QuantiSNP v11 64 18 (6) 41 (9) 29 (13)

Data fromCEPH sampleNA15510 on1M array Illumina platform is used to compare between algorithms and other publicationsDefault parametersare used for each algorithm and Y chromosome data was omitted Event lists from publications were generated by combining data fromseveral tables to create a complete list (including all validated and un-validated events) An event was counted if any overlap was found with baseevent in published data multiple predictions by an algorithm for one published event were counted as oneValue in brackets shows percentage ofpublished events found by algorithmWe note from GADA analysis although a high number of overlaps were found this was due to the predictionof large events that included smaller events found by Kidd et al and Korbel et al

Comparing CNVdetection methods for SNParrays page 9 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

technique for our NA12156 comparison to prevent a

bias between Affymetrix and Illumina but in doing

so we accepted an increase in the number of differ-

ently detected events Kidd et al have shown similar

data when comparing studies and found only a

125 overlap of events larger than 5 kb between

their results and CN data generated by Affymetrix

60 array

Similarities of events detected betweendifferent SoftwareWe chose to test a single sample (NA10861) on

a range of the available algorithms to compare the

similarity between event detection In all cases we

found the academically developed software to be

more sensitive and detect more events than propri-

etary algorithms (Table 4) The data also shows an

increased number of events found from the sample

using the Affymetrix SNP60 array we assume this

reflects the increase in the number of CNP probes

on the array relative to Illuminarsquos 1M chip

Table 5 shows the amount of overlap in event

prediction We show two results for each compari-

son counting the number of events overlapping for

each algorithm separately The difference in values

represents the number of smaller events often found

in one event by a different algorithm In general

we found a higher number of overlapping events

between algorithms run on Affymetrix 60 arrays

data We expected the low resemblance between

data generated on different platforms as a result of

the different probe sets however we are pleased to

find some overlap We have included a comparison

to events published by Redon et al [2] although the

study does not include a comprehensive list for this

sample it does show that the algorithms are detecting

confirmed events

During our comparison we often saw a difference

in the size of the predicted event between algorithms

(Figure 5) This was to be expected when using

different platforms as probe locations vary but was

also seen when analysing an identical dataset This

kind of effect can even be produced when simply

altering algorithm parameters and should be a con-

sideration when looking at breakpoints of detected

events We found that the available software tend to

target and support one particular platform for analy-

sis which unfortunately can limit options

Recommending algorithmsComparison of events in a dataset is a good way of

assessing accuracy of detection algorithms but it is

also important to take into account that the different

predictions can also be informative in showing false

positives caused by noisy data and conversely that

those in agreement are the strongest candidates for

events Multiple predictions from different software

for the same event increase confidence in the data

and give clearer indications of the event boundaries

or any discrepancy in this information We would

recommend using a second algorithm on a single

dataset to produce the most informative results and

also utilize the different advantages of each software

We also suggest using software designed specifically

for the platform which generated the data as several

of the dual use algorithms have been shown to

weaker in one format We have selected a range of

algorithms to discuss and test and the list in Table 1 is

not exhaustive only an overview of some of the

possibilities It is also important to state even using

different algorithms one cannot definitively confirm

the presence of a CN event without separate biolog-

ical replication and it is unlikely that any list of events

detected will contain all CNVs in a sample

FURTHER ANALYSIS OFDETECTED CNVsWith a number of reliable options available for

the detection of copy number events it becomes

Table 4 Comparison of event numbers detected fora single sample (NA10861)

Algorithm Platform andarray

Number ofCNeventsdetected

Birdsuite 155 (Canary amp Birdseye) Affymetrix 60 137CNAT (Genome Console 302) Affymetrix 60 10CNVPartition 121 Illumina 1M Duo 16GADA (R 07-5) Affymetrix 60 613GADA (R 07-5) Illumina 1M Duo 87Nexus Biodiscovery 401 Affymetrix 60 111Nexus Biodiscovery 401 Illumina 1M Duo 8PennCNV (2009Jan06) Affymetrix 60 67PennCNV (2009Jan06) Illumina 1M Duo 43QuantiSNP v20 Affymetrix 60 193QuantiSNP v11 Illumina 1M Duo 60

HapMap samples provided as demonstration data were analysed onboth Affymetrix and Illumina platforms to give an easily reproduciblecomparison of event prediction Events shown have been detected bythe algorithm for CEPH sample NA10861 Default parameters wereused for all algorithms and anyYchromosome data was omittedDatafrom the Affymetrix array has a higher number of detected eventsprobably linked to the number of specifically targeted probesProprietary software from both Illumina and Affymetrix has a lowdetection rate

page 10 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Table5

Com

parison

ofsoftwareeventpredictio

ns

Pub

lishe

dresults

(Red

on)

Birdsuite

Affym

etrix

CNAT

Affym

etrix

CNV

Par

tition

Illum

ina

GADA

Affym

etrix

GADA

Illum

ina

Nex

usAffym

etrix

Nex

usIllum

ina

Pen

nCNV

Affym

etrix

Pen

nCNV

Illum

ina

Qua

ntiSNP

Affym

etrix

Qua

ntiSNP

Illum

ina

Publishe

ddata

(Red

on)

17(4)

4(40

)3(19

)32

(5)

2(2)

11(10

)2(25

)12

(18

)7(16

)18

(9)

8(13

)

Birdsuite

Affy

metrix

17(44

)9(90

)13

(81

)135(22

)21

(24

)62

(56

)6(75

)43

(64

)20

(47

)97

(50

)20

(33

)CNAT

Affy

metrix

4(10

)15

(4)

4(25

)34

(6)

023

(21

)1(13

)

13(19

)2(5)

17(9)

5(8)

CNVPartition

Illum

ina

3(8)

16(4)

4(40

)37

(6)

7(8)

20(18

)7(88

)9(13

)

11(26

)16

(8)

16(27

)GADA

Affy

metrix

17(44

)106(28

)9(90

)13

(81

)32

(37

)91

(82

)7(88

)58

(87

)23

(53

)153(79

)27

(45

)GADA

Illum

ina

2(5)

96(25

)0

13(81

)20

8(34

)25

(23

)2(25

)26

(30

)17

(40

)67

(35

)23

(38

)Nexus

Affy

metrix

7(18

)57

(15

)10

(100

)

7(44

)116(19

)8(9)

4(50

)45

(67

)15

(35

)78

(40

)17

(28

)Nexus

Illum

ina

2(5)

6(2)

1(10

)7(44

)22

(4)

2(2)

4(4)

6(9)

7(16

)10

(5)

9(15

)Penn

CNV

Affy

metrix

11(28

)51

(13)

10(100

)

9(56

)105(17

)10

(11)

65(59

)6(75

)19

(44

)71

(37

)21

(35

)Penn

CNV

Illum

ina

6(15

)25

(7)

2(20

)11

(69

)44

(7)

9(10

)23

(21

)6(75

)18

(27

)26

(13)

28(47

)QuantiSNP

Affy

metrix

14(36

)97

(25

)10

(100

)

10(63

)199(32

)18

(21

)86

(77

)7(88

)65

(97

)21

(49

)24

(40

)QuantiSNP

Illum

ina

6(15

)14

(4)

5(50

)15

(94

)55

(9)

10(11)

30(27

)8(100

)

23(34

)32

(74

)31

(16

)

Algorithm

swererunon

demon

stratio

ndataforsampleNA108

61on

Affy

metrix60chipsa

ndIllum

ina1MDuo

arraysD

efaultparametersw

ereused

andanyY

chromosom

edatawas

omittedFo

ralgorithmoverall

totalsseeTable4Events

detected

inbo

thsoftwareareshow

nEvents

coun

tedas

common

betw

eenalgorithmsifpart

ofregion

predictedoverlaps

withtheotherEach

comparisoniscarriedou

ttw

ice

toshow

caseswhere

smallereventswithinon

ealgorithm

makeup

oneeventintheotherthereforeoverlapof

eventsdepe

ndson

analysisorientationTotalvalue

representsnumberof

eventsforsoftwareon

horizontalaxisfoun

dintheothersoftwaredatasetbracketedvalueshow

spercentageofeventsdetected

bysamesoftwareWehave

foun

dthemostsim

ilaritie

sare

betw

eendatafrom

similarplatform

soralgo

-rithm

metho

dforexam

pleAffy

metrixPenn

CNVandQuantiSNParebo

thbasedon

theHMM

algorithm

andas

such

eventpredictio

nshou

ldbe

very

similarWehave

also

notedahigher

numberof

similar

eventsfrom

algorithmsu

singAffy

metrixdata

Comparing CNVdetection methods for SNParrays page 11 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

increasingly important to be able to summarize and

use this data Initially we are often interested in

looking for novel events in certain genes or regions

Tracks of events can be viewed in databases such as

the web-based genome browser UCSC (http

wwwgenomeucscedu) and events can be com-

pared to known copy number data in the DGV

such as displayed in Figure 3 Importing several

tracks of data into a browser simultaneously will

allow the user to compare different result sets

Analysis of multiple events per sample is a more

complicated procedure Events and samples can

be explored using pathway analysis tools to look

for interesting groups or combinations of events in

different genes but methods of confirming the

significance of an event are required A number of

publications exist presenting ways of applying asso-

ciation study methods to copy number data Barnes

etal [29] developed an R package CNVtools which

allows the user to carry out case-control association

Figure 5 Image from UCSC Browser showing the detection of a single event using different algorithmsThe deletion described is a known CNP and is recorded several times in the DGV Each track represents a differ-ent algorithm or platform All results for detection algorithms shown used default parameters and test sampleNA10861

page 12 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

analysis on a single CNV of interest The publica-

tion tests a series of five alternative modelling meth-

ods before recommending a likelihood ratio test

which combines CNV calling and association testing

into a single model This method was designed

to eliminate problems with signal noise which is a

known trait of SNP assay data Ionita-Laza et al [30]

suggested a method to apply genome-wide family-

based association studies on raw-intensity data The

Birdsuite package includes a pipeline to prepare

the data for PLINK analysis Other sources have

suggested similar association study-based strategies

but an agreed approach is a subject of great discus-

sion Calls have been made by authors such as

Scherer et al [31] to decide on a single technique

but future decisions in the field will be extremely

enlightening

As is commented much upon in literature

describing SNP association study techniques

sample size and power of tests are major factors in

a successful study [32] This must also be considered

when analysing copy number data As we have dis-

cussed there are a number of analysis options avail-

able for SNP array CNV detection pipelines to

allow guided analysis and stand alone options for

more flexible analysis Some of these applications

are platform targeted but we have found that the

best outcome is given by using multiple algorithms

and comparing data

SUPPLEMENTARYDATASupplementary data are available online at http

biboxfordjournalsorg

AcknowledgementsThe authors thank Dr Helen Butler for her ideas and contribu-

tions to the manuscript

FUNDINGJR and LW are funded by Wellcome Trust Grants

CY is funded by a UK Medical Research Council

Special Training Fellowship in Biomedical

Informatics (Ref No G0701810)

References1 Iafrate AJ Feuk L Rivera MN et al Detection of large-

scale variation in the human genome Nat Genet 200436(9)949ndash51

2 Redon R Ishikawa S Fitch KR et al Global variation incopy number in the human genome Nature 2006444(7118)444ndash54

3 Tuzun E Sharp AJ Bailey JA et al Fine-scale structuralvariation of the human genome Nat Genet 200537(7)727ndash32

4 Sebat J Lakshmi B Troge J et al Large-scale copy numberpolymorphism in the human genome Science 2004305(5683)525ndash8

5 de Smith AJ Tsalenko A Sampas N et al Array CGHanalysis of copy number variation identifies 1284 newgenes variant in healthy white males implications for asso-ciation studies of complex diseases Hum Mol Genet 200716(23)2783ndash94

6 Carter NP Methods and strategies for analyzing copynumber variation using DNA microarrays Nat Genet200739(7 Suppl)S16ndash21

7 Korbel JO Urban AE Affourtit JP et al Paired-end map-ping reveals extensive structural variation in the humangenome Science 2007318(5849)420ndash6

8 Kennedy GC Matsuzaki H Dong S etal Large-scale geno-typing of complex DNA NatBiotechnol 200321(10)1233ndash7

9 Peiffer DA Le JM Steemers FJ etal High-resolution geno-mic profiling of chromosomal aberrations using Infiniumwhole-genome genotyping Genome Res 200616(9)1136ndash48

10 International Schizophrenia Consortium Rare chromoso-mal deletions and duplications increase risk of schizophreniaNature 2008455(7210)237ndash41

11 Yang TL Chen XD Guo Y et al Genome-wide copy-number-variation study identified a susceptibility geneUGT2B17 for osteoporosis Am J Hum Genet 200883(6)663ndash74

12 McCarroll SA Hadnott TN Perry GH et al Commondeletion polymorphisms in the human genome Nat Genet200638(1)86ndash92

13 Cooper GM Zerr T Kidd JM et al Systematic assessmentof copy number variant detection via genome-wide SNPgenotyping Nat Genet 200840(10)1199ndash203

14 McCarroll SA Altshuler DM Copy-number variation andassociation studies of human disease Nat Genet 200739(7 Suppl)S37ndash42

Key Points Awide variety of software is available for CNVdetection from

data produced by SNP arrays This review seeks to discussoptions and statistical methods currently available for analysisof signal intensity data

Changes in assay selection techniques for SNP arrays havemadethemmore appealing for copynumber detection aswell as geno-typingTargeted probe design has made the SNP array a reliableand cheaper option for copy number analysis

After testing a selection of the available software comparisonswere performed using Hapmap samples and publishedcopy number data Of the events found in our data 20^49were replicated in previously published studies but the resultsclearly showed variation in data caused by differences inalgorithms

An important recommendation when choosing software foranalysis is the use of a second algorithm on a dataset to producemore informative results This enables the user to eliminatefalse positives not found by both software and increases confi-dence in replicated events

Comparing CNVdetection methods for SNParrays page 13 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

15 McCarroll SA Kuruvilla FG Korn JM et al Integrateddetection and population-genetic analysis of SNPs andcopy number variation Nat Genet 200840(10)1166ndash74

16 Korn JM Kuruvilla FG McCarroll SA et al Integratedgenotype calling and association analysis of SNPscommon copy number polymorphisms and rare CNVsNat Genet 200840(10)1253ndash60

17 Day N Hemmaplardh A Thurman RE et al Unsupervisedsegmentation of continuous genomic data Bioinformatics200723(11)1424ndash6

18 Colella S Yau C Taylor JM etal QuantiSNP an objectiveBayes Hidden-Markov Model to detect and accurately mapcopy number variation using SNP genotyping data NucleicAcids Res 200735(6)2013ndash25

19 Wang K Li M Hadley D et al PennCNV an integratedhidden Markov model designed for high-resolution copynumber variation detection in whole-genome SNP geno-typing data Genome Res 200717(11)1665ndash74

20 Maestrini E Pagnamenta AT Lamb JA et al High-densitySNP association study and copy number variation analysisof the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility MolPsychiatry2009

21 Wang K Chen Z Tadesse MG et al Modeling geneticinheritance of copy number variations Nucleic Acids Res200836(21)e138

22 Li C Beroukhim R Weir BA et al Major copy propor-tion analysis of tumor samples using SNP arrays BMCBioinformatics 20089204

23 Olshen AB Venkatraman ES Lucito R Wigler M Circularbinary segmentation for the analysis of array-based DNAcopy number data Biostatistics 20045(4)557ndash72

24 Pique-Regi R Monso-Varona J Ortega A et al Sparserepresentation and Bayesian detection of genome copynumber alterations from microarray data Bioinformatics200824(3)309ndash18

25 Lai WR Johnson MD Kucherlapati R Park PJComparative analysis of algorithms for identifying amplifi-cations and deletions in array CGH data Bioinformatics 200521(19)3763ndash70

26 Rigaill G Hupe P Almeida A et al ITALICS analgorithm for normalization and DNA copy number callingfor Affymetrix SNP arrays Bioinformatics 200824(6)768ndash74

27 Franke L de Kovel CG Aulchenko YS et al Detectionimputation and association analysis of small deletions andnull alleles on oligonucleotide arrays AmJHumGenet 200882(6)1316ndash33

28 Kidd JM Cooper GM Donahue WF et al Mapping andsequencing of structural variation from eight human gen-omes Nature 2008453(7191)56ndash64

29 Barnes C Plagnol V Fitzgerald T et al A robuststatistical method for case-control association testingwith copy number variation Nat Genet 200840(10)1245ndash52

30 Ionita-Laza I Perry GH Raby BA et al On the analysisof copy-number variations in genome-wide associationstudies a translation of the family-based association testGenet Epidemiol 200832(3)273ndash84

31 Scherer SW Lee C Birney E etal Challenges and standardsin integrating surveys of structural variation NatGenet 200739(7 Suppl)S7ndash15

32 Cardon LR Bell JI Association study designs for complexdiseases Nat Rev Genet 20012(2)91ndash9

page 14 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Page 8: Comparing CNVdetection methods

all comparisons and not unique to algorithms we

tested

The overlap of algorithm events of the tested soft-

ware is below 50 for all cases We used default

parameters for all our algorithms for ease of replica-

tion which means some algorithms were not run at

their optimal level for our data We deliberately

chose data which did not use an array-based

Table 1 Summary of SNP array detection algorithms

Software Platform Relatedpublication

Details Strengths Weaknesses

Birdsuite (Birdseyeand Canary)

Affymetrix [15] Combined tool set togenotype SNPs amp CNPs

Unique approach singleassociation of SNPs andCN

Availability limited toAffymetrix data

CNAT Affymetrix Technicalnotes

Proprietaryccedilrun inGenome Console

Integral part of GenomeConsole

Accuracy of event prediction(missed events)

CNVPartition 121 Illumina Technicalnotes

Proprietaryccedilrun inBeadStudio

Integral part of BeadStudio Accuracy of event prediction(missed events)

Dchip SNP Affymetrixor Illumina

[22] Stand alone software Free viewer for all data Limited applications forIllumina data

GADA Affymetrixor Illumina

[24] Model uses Sparse BayesianLearning

Speed of processing andapplication within R

Accuracy on Illumina weaker

HMMSeg Multiple [17] HMM application tool to anygenomic data

Flexibility to any dataset Statistical knowledgerequired for correctuse Not CN specific

ITALICS Affymetrix [26] R package for normalizationand CN detection inAffymetrix data

Focus on removal of non-relevant effects

Designed to work onAffymetrix 100Kthorn 500Kchip (MM probe format)

Nexus Biodiscovery Multiple [23] Commercial segmentationdetection tool

Allows combined data fromdifferentplatforms Integratedviewer

Freeware alternatives areavailable

PennCNV Illumina orAffymetrix

[19] Perl script based Multiple downstream toolsfor output

No way of ranking eventsdue to likelihood

QuantiSNP Illumina orAffymetrix

[18] HHM PC or LINUXcommand line

Bayes factor score forevents flexibility of runparameters

Limited support for furtherevent analysis

SCIMM andSCIMM-Search

Illumina [13] Modelling algorithmapplied in R

High detection ratescompared to sequencedata

Statistical knowledgerequired for correct use

TriTyper Illumina [27] Identify and genotype SNPswith null allele

Able to interpret single SNPs Only genotypes deletions

Table 2 Comparison of algorithms

Algorithm Platformand array

Total of copynumber eventsdetected

Number of copynumber eventsconfirmed byKidd et al [28]

Birdsuite 155 (Birdseye amp Canary) Affymetrix 60 386 76 (20)CNAT (Genome Console 302) Affymetrix 60 8 2 (25)GADA (R 07-5) Affymetrix 60 546 128 (23)GADA (R 07-5) Illumina 1M Duo 511 157 (31)PennCNV (2009Jan06) Affymetrix 60 57 28 (49)PennCNV (2009Jan06) Illumina 1M Duo 57 21 (37)QuantiSNP v20 Affymetrix 60 131 53 (41)QuantiSNP v11 Illumina 1M Duo 75 23 (31)

Detected events from CEPH sample NA12156 are compared to events published in sequencing analysis by Kidd et al [28] Default parametersare used for each algorithm and any Ychromosome data was omitted An overlap between software output and confirmed data by Kidd et al isdetermined by comparing the start and end points of events Details of events are shown in SupplementaryTable1 Percentage shows the numberof confirmed CN events compared to the total detectedby the algorithm

page 8 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Figure 4 Venn diagrams comparing events for NA15510 between different studies Visual representation ofdata from CEPH sample NA15510 on 1M array Illumina platform used to compare between algorithms and otherpublications [2 7 28] Default parameters are used for each algorithm and Ychromosome data was omitted fromcount Event lists from publications were generated by combining data from several tables to create a completelist (including all validated and unvalidated events) An event was counted if any overlap was found with base eventin published data multiple predictions by an algorithm for one published event were counted as one Each total inthe diagram is comprised of all the events found by the studies meaning each event in an overlapping pair is countedSurprisingly only 43 overlapping events are found for NA15510 in all the three studies (A) Results from thePennCNV (D) and QuantiSNP (C) comparisons show that QuantiSNP detects more events in all three softwaredue to the detection of more events overlapping with the Korbel et al study Overlap between algorithmsis shown in Venn Diagram B where events which are detected by the algorithm and found in at least one ofthe publication are compared A large proportion of detected events between PennCNV and QuantiSNP (43)overlap

Table 3 Overlap between events detected by SNP array algorithms using multiple publication data

Total events foundin NA15510 byalgorithm

Number of copynumber events(Kidd) [28]

Number of copynumber events(Korbel) [7]

Number of copynumber events(Redon) [2]

Events in paper 299 466 219CNVPartition 121 39 12 (4) 22 (5) 9 (4)GADA (R 07-5) 69 68 (23) 85 (18) 42 (19)PennCNV (2009Jan06) 81 18 (6) 28 () 30 (14)QuantiSNP v11 64 18 (6) 41 (9) 29 (13)

Data fromCEPH sampleNA15510 on1M array Illumina platform is used to compare between algorithms and other publicationsDefault parametersare used for each algorithm and Y chromosome data was omitted Event lists from publications were generated by combining data fromseveral tables to create a complete list (including all validated and un-validated events) An event was counted if any overlap was found with baseevent in published data multiple predictions by an algorithm for one published event were counted as oneValue in brackets shows percentage ofpublished events found by algorithmWe note from GADA analysis although a high number of overlaps were found this was due to the predictionof large events that included smaller events found by Kidd et al and Korbel et al

Comparing CNVdetection methods for SNParrays page 9 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

technique for our NA12156 comparison to prevent a

bias between Affymetrix and Illumina but in doing

so we accepted an increase in the number of differ-

ently detected events Kidd et al have shown similar

data when comparing studies and found only a

125 overlap of events larger than 5 kb between

their results and CN data generated by Affymetrix

60 array

Similarities of events detected betweendifferent SoftwareWe chose to test a single sample (NA10861) on

a range of the available algorithms to compare the

similarity between event detection In all cases we

found the academically developed software to be

more sensitive and detect more events than propri-

etary algorithms (Table 4) The data also shows an

increased number of events found from the sample

using the Affymetrix SNP60 array we assume this

reflects the increase in the number of CNP probes

on the array relative to Illuminarsquos 1M chip

Table 5 shows the amount of overlap in event

prediction We show two results for each compari-

son counting the number of events overlapping for

each algorithm separately The difference in values

represents the number of smaller events often found

in one event by a different algorithm In general

we found a higher number of overlapping events

between algorithms run on Affymetrix 60 arrays

data We expected the low resemblance between

data generated on different platforms as a result of

the different probe sets however we are pleased to

find some overlap We have included a comparison

to events published by Redon et al [2] although the

study does not include a comprehensive list for this

sample it does show that the algorithms are detecting

confirmed events

During our comparison we often saw a difference

in the size of the predicted event between algorithms

(Figure 5) This was to be expected when using

different platforms as probe locations vary but was

also seen when analysing an identical dataset This

kind of effect can even be produced when simply

altering algorithm parameters and should be a con-

sideration when looking at breakpoints of detected

events We found that the available software tend to

target and support one particular platform for analy-

sis which unfortunately can limit options

Recommending algorithmsComparison of events in a dataset is a good way of

assessing accuracy of detection algorithms but it is

also important to take into account that the different

predictions can also be informative in showing false

positives caused by noisy data and conversely that

those in agreement are the strongest candidates for

events Multiple predictions from different software

for the same event increase confidence in the data

and give clearer indications of the event boundaries

or any discrepancy in this information We would

recommend using a second algorithm on a single

dataset to produce the most informative results and

also utilize the different advantages of each software

We also suggest using software designed specifically

for the platform which generated the data as several

of the dual use algorithms have been shown to

weaker in one format We have selected a range of

algorithms to discuss and test and the list in Table 1 is

not exhaustive only an overview of some of the

possibilities It is also important to state even using

different algorithms one cannot definitively confirm

the presence of a CN event without separate biolog-

ical replication and it is unlikely that any list of events

detected will contain all CNVs in a sample

FURTHER ANALYSIS OFDETECTED CNVsWith a number of reliable options available for

the detection of copy number events it becomes

Table 4 Comparison of event numbers detected fora single sample (NA10861)

Algorithm Platform andarray

Number ofCNeventsdetected

Birdsuite 155 (Canary amp Birdseye) Affymetrix 60 137CNAT (Genome Console 302) Affymetrix 60 10CNVPartition 121 Illumina 1M Duo 16GADA (R 07-5) Affymetrix 60 613GADA (R 07-5) Illumina 1M Duo 87Nexus Biodiscovery 401 Affymetrix 60 111Nexus Biodiscovery 401 Illumina 1M Duo 8PennCNV (2009Jan06) Affymetrix 60 67PennCNV (2009Jan06) Illumina 1M Duo 43QuantiSNP v20 Affymetrix 60 193QuantiSNP v11 Illumina 1M Duo 60

HapMap samples provided as demonstration data were analysed onboth Affymetrix and Illumina platforms to give an easily reproduciblecomparison of event prediction Events shown have been detected bythe algorithm for CEPH sample NA10861 Default parameters wereused for all algorithms and anyYchromosome data was omittedDatafrom the Affymetrix array has a higher number of detected eventsprobably linked to the number of specifically targeted probesProprietary software from both Illumina and Affymetrix has a lowdetection rate

page 10 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Table5

Com

parison

ofsoftwareeventpredictio

ns

Pub

lishe

dresults

(Red

on)

Birdsuite

Affym

etrix

CNAT

Affym

etrix

CNV

Par

tition

Illum

ina

GADA

Affym

etrix

GADA

Illum

ina

Nex

usAffym

etrix

Nex

usIllum

ina

Pen

nCNV

Affym

etrix

Pen

nCNV

Illum

ina

Qua

ntiSNP

Affym

etrix

Qua

ntiSNP

Illum

ina

Publishe

ddata

(Red

on)

17(4)

4(40

)3(19

)32

(5)

2(2)

11(10

)2(25

)12

(18

)7(16

)18

(9)

8(13

)

Birdsuite

Affy

metrix

17(44

)9(90

)13

(81

)135(22

)21

(24

)62

(56

)6(75

)43

(64

)20

(47

)97

(50

)20

(33

)CNAT

Affy

metrix

4(10

)15

(4)

4(25

)34

(6)

023

(21

)1(13

)

13(19

)2(5)

17(9)

5(8)

CNVPartition

Illum

ina

3(8)

16(4)

4(40

)37

(6)

7(8)

20(18

)7(88

)9(13

)

11(26

)16

(8)

16(27

)GADA

Affy

metrix

17(44

)106(28

)9(90

)13

(81

)32

(37

)91

(82

)7(88

)58

(87

)23

(53

)153(79

)27

(45

)GADA

Illum

ina

2(5)

96(25

)0

13(81

)20

8(34

)25

(23

)2(25

)26

(30

)17

(40

)67

(35

)23

(38

)Nexus

Affy

metrix

7(18

)57

(15

)10

(100

)

7(44

)116(19

)8(9)

4(50

)45

(67

)15

(35

)78

(40

)17

(28

)Nexus

Illum

ina

2(5)

6(2)

1(10

)7(44

)22

(4)

2(2)

4(4)

6(9)

7(16

)10

(5)

9(15

)Penn

CNV

Affy

metrix

11(28

)51

(13)

10(100

)

9(56

)105(17

)10

(11)

65(59

)6(75

)19

(44

)71

(37

)21

(35

)Penn

CNV

Illum

ina

6(15

)25

(7)

2(20

)11

(69

)44

(7)

9(10

)23

(21

)6(75

)18

(27

)26

(13)

28(47

)QuantiSNP

Affy

metrix

14(36

)97

(25

)10

(100

)

10(63

)199(32

)18

(21

)86

(77

)7(88

)65

(97

)21

(49

)24

(40

)QuantiSNP

Illum

ina

6(15

)14

(4)

5(50

)15

(94

)55

(9)

10(11)

30(27

)8(100

)

23(34

)32

(74

)31

(16

)

Algorithm

swererunon

demon

stratio

ndataforsampleNA108

61on

Affy

metrix60chipsa

ndIllum

ina1MDuo

arraysD

efaultparametersw

ereused

andanyY

chromosom

edatawas

omittedFo

ralgorithmoverall

totalsseeTable4Events

detected

inbo

thsoftwareareshow

nEvents

coun

tedas

common

betw

eenalgorithmsifpart

ofregion

predictedoverlaps

withtheotherEach

comparisoniscarriedou

ttw

ice

toshow

caseswhere

smallereventswithinon

ealgorithm

makeup

oneeventintheotherthereforeoverlapof

eventsdepe

ndson

analysisorientationTotalvalue

representsnumberof

eventsforsoftwareon

horizontalaxisfoun

dintheothersoftwaredatasetbracketedvalueshow

spercentageofeventsdetected

bysamesoftwareWehave

foun

dthemostsim

ilaritie

sare

betw

eendatafrom

similarplatform

soralgo

-rithm

metho

dforexam

pleAffy

metrixPenn

CNVandQuantiSNParebo

thbasedon

theHMM

algorithm

andas

such

eventpredictio

nshou

ldbe

very

similarWehave

also

notedahigher

numberof

similar

eventsfrom

algorithmsu

singAffy

metrixdata

Comparing CNVdetection methods for SNParrays page 11 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

increasingly important to be able to summarize and

use this data Initially we are often interested in

looking for novel events in certain genes or regions

Tracks of events can be viewed in databases such as

the web-based genome browser UCSC (http

wwwgenomeucscedu) and events can be com-

pared to known copy number data in the DGV

such as displayed in Figure 3 Importing several

tracks of data into a browser simultaneously will

allow the user to compare different result sets

Analysis of multiple events per sample is a more

complicated procedure Events and samples can

be explored using pathway analysis tools to look

for interesting groups or combinations of events in

different genes but methods of confirming the

significance of an event are required A number of

publications exist presenting ways of applying asso-

ciation study methods to copy number data Barnes

etal [29] developed an R package CNVtools which

allows the user to carry out case-control association

Figure 5 Image from UCSC Browser showing the detection of a single event using different algorithmsThe deletion described is a known CNP and is recorded several times in the DGV Each track represents a differ-ent algorithm or platform All results for detection algorithms shown used default parameters and test sampleNA10861

page 12 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

analysis on a single CNV of interest The publica-

tion tests a series of five alternative modelling meth-

ods before recommending a likelihood ratio test

which combines CNV calling and association testing

into a single model This method was designed

to eliminate problems with signal noise which is a

known trait of SNP assay data Ionita-Laza et al [30]

suggested a method to apply genome-wide family-

based association studies on raw-intensity data The

Birdsuite package includes a pipeline to prepare

the data for PLINK analysis Other sources have

suggested similar association study-based strategies

but an agreed approach is a subject of great discus-

sion Calls have been made by authors such as

Scherer et al [31] to decide on a single technique

but future decisions in the field will be extremely

enlightening

As is commented much upon in literature

describing SNP association study techniques

sample size and power of tests are major factors in

a successful study [32] This must also be considered

when analysing copy number data As we have dis-

cussed there are a number of analysis options avail-

able for SNP array CNV detection pipelines to

allow guided analysis and stand alone options for

more flexible analysis Some of these applications

are platform targeted but we have found that the

best outcome is given by using multiple algorithms

and comparing data

SUPPLEMENTARYDATASupplementary data are available online at http

biboxfordjournalsorg

AcknowledgementsThe authors thank Dr Helen Butler for her ideas and contribu-

tions to the manuscript

FUNDINGJR and LW are funded by Wellcome Trust Grants

CY is funded by a UK Medical Research Council

Special Training Fellowship in Biomedical

Informatics (Ref No G0701810)

References1 Iafrate AJ Feuk L Rivera MN et al Detection of large-

scale variation in the human genome Nat Genet 200436(9)949ndash51

2 Redon R Ishikawa S Fitch KR et al Global variation incopy number in the human genome Nature 2006444(7118)444ndash54

3 Tuzun E Sharp AJ Bailey JA et al Fine-scale structuralvariation of the human genome Nat Genet 200537(7)727ndash32

4 Sebat J Lakshmi B Troge J et al Large-scale copy numberpolymorphism in the human genome Science 2004305(5683)525ndash8

5 de Smith AJ Tsalenko A Sampas N et al Array CGHanalysis of copy number variation identifies 1284 newgenes variant in healthy white males implications for asso-ciation studies of complex diseases Hum Mol Genet 200716(23)2783ndash94

6 Carter NP Methods and strategies for analyzing copynumber variation using DNA microarrays Nat Genet200739(7 Suppl)S16ndash21

7 Korbel JO Urban AE Affourtit JP et al Paired-end map-ping reveals extensive structural variation in the humangenome Science 2007318(5849)420ndash6

8 Kennedy GC Matsuzaki H Dong S etal Large-scale geno-typing of complex DNA NatBiotechnol 200321(10)1233ndash7

9 Peiffer DA Le JM Steemers FJ etal High-resolution geno-mic profiling of chromosomal aberrations using Infiniumwhole-genome genotyping Genome Res 200616(9)1136ndash48

10 International Schizophrenia Consortium Rare chromoso-mal deletions and duplications increase risk of schizophreniaNature 2008455(7210)237ndash41

11 Yang TL Chen XD Guo Y et al Genome-wide copy-number-variation study identified a susceptibility geneUGT2B17 for osteoporosis Am J Hum Genet 200883(6)663ndash74

12 McCarroll SA Hadnott TN Perry GH et al Commondeletion polymorphisms in the human genome Nat Genet200638(1)86ndash92

13 Cooper GM Zerr T Kidd JM et al Systematic assessmentof copy number variant detection via genome-wide SNPgenotyping Nat Genet 200840(10)1199ndash203

14 McCarroll SA Altshuler DM Copy-number variation andassociation studies of human disease Nat Genet 200739(7 Suppl)S37ndash42

Key Points Awide variety of software is available for CNVdetection from

data produced by SNP arrays This review seeks to discussoptions and statistical methods currently available for analysisof signal intensity data

Changes in assay selection techniques for SNP arrays havemadethemmore appealing for copynumber detection aswell as geno-typingTargeted probe design has made the SNP array a reliableand cheaper option for copy number analysis

After testing a selection of the available software comparisonswere performed using Hapmap samples and publishedcopy number data Of the events found in our data 20^49were replicated in previously published studies but the resultsclearly showed variation in data caused by differences inalgorithms

An important recommendation when choosing software foranalysis is the use of a second algorithm on a dataset to producemore informative results This enables the user to eliminatefalse positives not found by both software and increases confi-dence in replicated events

Comparing CNVdetection methods for SNParrays page 13 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

15 McCarroll SA Kuruvilla FG Korn JM et al Integrateddetection and population-genetic analysis of SNPs andcopy number variation Nat Genet 200840(10)1166ndash74

16 Korn JM Kuruvilla FG McCarroll SA et al Integratedgenotype calling and association analysis of SNPscommon copy number polymorphisms and rare CNVsNat Genet 200840(10)1253ndash60

17 Day N Hemmaplardh A Thurman RE et al Unsupervisedsegmentation of continuous genomic data Bioinformatics200723(11)1424ndash6

18 Colella S Yau C Taylor JM etal QuantiSNP an objectiveBayes Hidden-Markov Model to detect and accurately mapcopy number variation using SNP genotyping data NucleicAcids Res 200735(6)2013ndash25

19 Wang K Li M Hadley D et al PennCNV an integratedhidden Markov model designed for high-resolution copynumber variation detection in whole-genome SNP geno-typing data Genome Res 200717(11)1665ndash74

20 Maestrini E Pagnamenta AT Lamb JA et al High-densitySNP association study and copy number variation analysisof the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility MolPsychiatry2009

21 Wang K Chen Z Tadesse MG et al Modeling geneticinheritance of copy number variations Nucleic Acids Res200836(21)e138

22 Li C Beroukhim R Weir BA et al Major copy propor-tion analysis of tumor samples using SNP arrays BMCBioinformatics 20089204

23 Olshen AB Venkatraman ES Lucito R Wigler M Circularbinary segmentation for the analysis of array-based DNAcopy number data Biostatistics 20045(4)557ndash72

24 Pique-Regi R Monso-Varona J Ortega A et al Sparserepresentation and Bayesian detection of genome copynumber alterations from microarray data Bioinformatics200824(3)309ndash18

25 Lai WR Johnson MD Kucherlapati R Park PJComparative analysis of algorithms for identifying amplifi-cations and deletions in array CGH data Bioinformatics 200521(19)3763ndash70

26 Rigaill G Hupe P Almeida A et al ITALICS analgorithm for normalization and DNA copy number callingfor Affymetrix SNP arrays Bioinformatics 200824(6)768ndash74

27 Franke L de Kovel CG Aulchenko YS et al Detectionimputation and association analysis of small deletions andnull alleles on oligonucleotide arrays AmJHumGenet 200882(6)1316ndash33

28 Kidd JM Cooper GM Donahue WF et al Mapping andsequencing of structural variation from eight human gen-omes Nature 2008453(7191)56ndash64

29 Barnes C Plagnol V Fitzgerald T et al A robuststatistical method for case-control association testingwith copy number variation Nat Genet 200840(10)1245ndash52

30 Ionita-Laza I Perry GH Raby BA et al On the analysisof copy-number variations in genome-wide associationstudies a translation of the family-based association testGenet Epidemiol 200832(3)273ndash84

31 Scherer SW Lee C Birney E etal Challenges and standardsin integrating surveys of structural variation NatGenet 200739(7 Suppl)S7ndash15

32 Cardon LR Bell JI Association study designs for complexdiseases Nat Rev Genet 20012(2)91ndash9

page 14 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Page 9: Comparing CNVdetection methods

Figure 4 Venn diagrams comparing events for NA15510 between different studies Visual representation ofdata from CEPH sample NA15510 on 1M array Illumina platform used to compare between algorithms and otherpublications [2 7 28] Default parameters are used for each algorithm and Ychromosome data was omitted fromcount Event lists from publications were generated by combining data from several tables to create a completelist (including all validated and unvalidated events) An event was counted if any overlap was found with base eventin published data multiple predictions by an algorithm for one published event were counted as one Each total inthe diagram is comprised of all the events found by the studies meaning each event in an overlapping pair is countedSurprisingly only 43 overlapping events are found for NA15510 in all the three studies (A) Results from thePennCNV (D) and QuantiSNP (C) comparisons show that QuantiSNP detects more events in all three softwaredue to the detection of more events overlapping with the Korbel et al study Overlap between algorithmsis shown in Venn Diagram B where events which are detected by the algorithm and found in at least one ofthe publication are compared A large proportion of detected events between PennCNV and QuantiSNP (43)overlap

Table 3 Overlap between events detected by SNP array algorithms using multiple publication data

Total events foundin NA15510 byalgorithm

Number of copynumber events(Kidd) [28]

Number of copynumber events(Korbel) [7]

Number of copynumber events(Redon) [2]

Events in paper 299 466 219CNVPartition 121 39 12 (4) 22 (5) 9 (4)GADA (R 07-5) 69 68 (23) 85 (18) 42 (19)PennCNV (2009Jan06) 81 18 (6) 28 () 30 (14)QuantiSNP v11 64 18 (6) 41 (9) 29 (13)

Data fromCEPH sampleNA15510 on1M array Illumina platform is used to compare between algorithms and other publicationsDefault parametersare used for each algorithm and Y chromosome data was omitted Event lists from publications were generated by combining data fromseveral tables to create a complete list (including all validated and un-validated events) An event was counted if any overlap was found with baseevent in published data multiple predictions by an algorithm for one published event were counted as oneValue in brackets shows percentage ofpublished events found by algorithmWe note from GADA analysis although a high number of overlaps were found this was due to the predictionof large events that included smaller events found by Kidd et al and Korbel et al

Comparing CNVdetection methods for SNParrays page 9 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

technique for our NA12156 comparison to prevent a

bias between Affymetrix and Illumina but in doing

so we accepted an increase in the number of differ-

ently detected events Kidd et al have shown similar

data when comparing studies and found only a

125 overlap of events larger than 5 kb between

their results and CN data generated by Affymetrix

60 array

Similarities of events detected betweendifferent SoftwareWe chose to test a single sample (NA10861) on

a range of the available algorithms to compare the

similarity between event detection In all cases we

found the academically developed software to be

more sensitive and detect more events than propri-

etary algorithms (Table 4) The data also shows an

increased number of events found from the sample

using the Affymetrix SNP60 array we assume this

reflects the increase in the number of CNP probes

on the array relative to Illuminarsquos 1M chip

Table 5 shows the amount of overlap in event

prediction We show two results for each compari-

son counting the number of events overlapping for

each algorithm separately The difference in values

represents the number of smaller events often found

in one event by a different algorithm In general

we found a higher number of overlapping events

between algorithms run on Affymetrix 60 arrays

data We expected the low resemblance between

data generated on different platforms as a result of

the different probe sets however we are pleased to

find some overlap We have included a comparison

to events published by Redon et al [2] although the

study does not include a comprehensive list for this

sample it does show that the algorithms are detecting

confirmed events

During our comparison we often saw a difference

in the size of the predicted event between algorithms

(Figure 5) This was to be expected when using

different platforms as probe locations vary but was

also seen when analysing an identical dataset This

kind of effect can even be produced when simply

altering algorithm parameters and should be a con-

sideration when looking at breakpoints of detected

events We found that the available software tend to

target and support one particular platform for analy-

sis which unfortunately can limit options

Recommending algorithmsComparison of events in a dataset is a good way of

assessing accuracy of detection algorithms but it is

also important to take into account that the different

predictions can also be informative in showing false

positives caused by noisy data and conversely that

those in agreement are the strongest candidates for

events Multiple predictions from different software

for the same event increase confidence in the data

and give clearer indications of the event boundaries

or any discrepancy in this information We would

recommend using a second algorithm on a single

dataset to produce the most informative results and

also utilize the different advantages of each software

We also suggest using software designed specifically

for the platform which generated the data as several

of the dual use algorithms have been shown to

weaker in one format We have selected a range of

algorithms to discuss and test and the list in Table 1 is

not exhaustive only an overview of some of the

possibilities It is also important to state even using

different algorithms one cannot definitively confirm

the presence of a CN event without separate biolog-

ical replication and it is unlikely that any list of events

detected will contain all CNVs in a sample

FURTHER ANALYSIS OFDETECTED CNVsWith a number of reliable options available for

the detection of copy number events it becomes

Table 4 Comparison of event numbers detected fora single sample (NA10861)

Algorithm Platform andarray

Number ofCNeventsdetected

Birdsuite 155 (Canary amp Birdseye) Affymetrix 60 137CNAT (Genome Console 302) Affymetrix 60 10CNVPartition 121 Illumina 1M Duo 16GADA (R 07-5) Affymetrix 60 613GADA (R 07-5) Illumina 1M Duo 87Nexus Biodiscovery 401 Affymetrix 60 111Nexus Biodiscovery 401 Illumina 1M Duo 8PennCNV (2009Jan06) Affymetrix 60 67PennCNV (2009Jan06) Illumina 1M Duo 43QuantiSNP v20 Affymetrix 60 193QuantiSNP v11 Illumina 1M Duo 60

HapMap samples provided as demonstration data were analysed onboth Affymetrix and Illumina platforms to give an easily reproduciblecomparison of event prediction Events shown have been detected bythe algorithm for CEPH sample NA10861 Default parameters wereused for all algorithms and anyYchromosome data was omittedDatafrom the Affymetrix array has a higher number of detected eventsprobably linked to the number of specifically targeted probesProprietary software from both Illumina and Affymetrix has a lowdetection rate

page 10 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Table5

Com

parison

ofsoftwareeventpredictio

ns

Pub

lishe

dresults

(Red

on)

Birdsuite

Affym

etrix

CNAT

Affym

etrix

CNV

Par

tition

Illum

ina

GADA

Affym

etrix

GADA

Illum

ina

Nex

usAffym

etrix

Nex

usIllum

ina

Pen

nCNV

Affym

etrix

Pen

nCNV

Illum

ina

Qua

ntiSNP

Affym

etrix

Qua

ntiSNP

Illum

ina

Publishe

ddata

(Red

on)

17(4)

4(40

)3(19

)32

(5)

2(2)

11(10

)2(25

)12

(18

)7(16

)18

(9)

8(13

)

Birdsuite

Affy

metrix

17(44

)9(90

)13

(81

)135(22

)21

(24

)62

(56

)6(75

)43

(64

)20

(47

)97

(50

)20

(33

)CNAT

Affy

metrix

4(10

)15

(4)

4(25

)34

(6)

023

(21

)1(13

)

13(19

)2(5)

17(9)

5(8)

CNVPartition

Illum

ina

3(8)

16(4)

4(40

)37

(6)

7(8)

20(18

)7(88

)9(13

)

11(26

)16

(8)

16(27

)GADA

Affy

metrix

17(44

)106(28

)9(90

)13

(81

)32

(37

)91

(82

)7(88

)58

(87

)23

(53

)153(79

)27

(45

)GADA

Illum

ina

2(5)

96(25

)0

13(81

)20

8(34

)25

(23

)2(25

)26

(30

)17

(40

)67

(35

)23

(38

)Nexus

Affy

metrix

7(18

)57

(15

)10

(100

)

7(44

)116(19

)8(9)

4(50

)45

(67

)15

(35

)78

(40

)17

(28

)Nexus

Illum

ina

2(5)

6(2)

1(10

)7(44

)22

(4)

2(2)

4(4)

6(9)

7(16

)10

(5)

9(15

)Penn

CNV

Affy

metrix

11(28

)51

(13)

10(100

)

9(56

)105(17

)10

(11)

65(59

)6(75

)19

(44

)71

(37

)21

(35

)Penn

CNV

Illum

ina

6(15

)25

(7)

2(20

)11

(69

)44

(7)

9(10

)23

(21

)6(75

)18

(27

)26

(13)

28(47

)QuantiSNP

Affy

metrix

14(36

)97

(25

)10

(100

)

10(63

)199(32

)18

(21

)86

(77

)7(88

)65

(97

)21

(49

)24

(40

)QuantiSNP

Illum

ina

6(15

)14

(4)

5(50

)15

(94

)55

(9)

10(11)

30(27

)8(100

)

23(34

)32

(74

)31

(16

)

Algorithm

swererunon

demon

stratio

ndataforsampleNA108

61on

Affy

metrix60chipsa

ndIllum

ina1MDuo

arraysD

efaultparametersw

ereused

andanyY

chromosom

edatawas

omittedFo

ralgorithmoverall

totalsseeTable4Events

detected

inbo

thsoftwareareshow

nEvents

coun

tedas

common

betw

eenalgorithmsifpart

ofregion

predictedoverlaps

withtheotherEach

comparisoniscarriedou

ttw

ice

toshow

caseswhere

smallereventswithinon

ealgorithm

makeup

oneeventintheotherthereforeoverlapof

eventsdepe

ndson

analysisorientationTotalvalue

representsnumberof

eventsforsoftwareon

horizontalaxisfoun

dintheothersoftwaredatasetbracketedvalueshow

spercentageofeventsdetected

bysamesoftwareWehave

foun

dthemostsim

ilaritie

sare

betw

eendatafrom

similarplatform

soralgo

-rithm

metho

dforexam

pleAffy

metrixPenn

CNVandQuantiSNParebo

thbasedon

theHMM

algorithm

andas

such

eventpredictio

nshou

ldbe

very

similarWehave

also

notedahigher

numberof

similar

eventsfrom

algorithmsu

singAffy

metrixdata

Comparing CNVdetection methods for SNParrays page 11 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

increasingly important to be able to summarize and

use this data Initially we are often interested in

looking for novel events in certain genes or regions

Tracks of events can be viewed in databases such as

the web-based genome browser UCSC (http

wwwgenomeucscedu) and events can be com-

pared to known copy number data in the DGV

such as displayed in Figure 3 Importing several

tracks of data into a browser simultaneously will

allow the user to compare different result sets

Analysis of multiple events per sample is a more

complicated procedure Events and samples can

be explored using pathway analysis tools to look

for interesting groups or combinations of events in

different genes but methods of confirming the

significance of an event are required A number of

publications exist presenting ways of applying asso-

ciation study methods to copy number data Barnes

etal [29] developed an R package CNVtools which

allows the user to carry out case-control association

Figure 5 Image from UCSC Browser showing the detection of a single event using different algorithmsThe deletion described is a known CNP and is recorded several times in the DGV Each track represents a differ-ent algorithm or platform All results for detection algorithms shown used default parameters and test sampleNA10861

page 12 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

analysis on a single CNV of interest The publica-

tion tests a series of five alternative modelling meth-

ods before recommending a likelihood ratio test

which combines CNV calling and association testing

into a single model This method was designed

to eliminate problems with signal noise which is a

known trait of SNP assay data Ionita-Laza et al [30]

suggested a method to apply genome-wide family-

based association studies on raw-intensity data The

Birdsuite package includes a pipeline to prepare

the data for PLINK analysis Other sources have

suggested similar association study-based strategies

but an agreed approach is a subject of great discus-

sion Calls have been made by authors such as

Scherer et al [31] to decide on a single technique

but future decisions in the field will be extremely

enlightening

As is commented much upon in literature

describing SNP association study techniques

sample size and power of tests are major factors in

a successful study [32] This must also be considered

when analysing copy number data As we have dis-

cussed there are a number of analysis options avail-

able for SNP array CNV detection pipelines to

allow guided analysis and stand alone options for

more flexible analysis Some of these applications

are platform targeted but we have found that the

best outcome is given by using multiple algorithms

and comparing data

SUPPLEMENTARYDATASupplementary data are available online at http

biboxfordjournalsorg

AcknowledgementsThe authors thank Dr Helen Butler for her ideas and contribu-

tions to the manuscript

FUNDINGJR and LW are funded by Wellcome Trust Grants

CY is funded by a UK Medical Research Council

Special Training Fellowship in Biomedical

Informatics (Ref No G0701810)

References1 Iafrate AJ Feuk L Rivera MN et al Detection of large-

scale variation in the human genome Nat Genet 200436(9)949ndash51

2 Redon R Ishikawa S Fitch KR et al Global variation incopy number in the human genome Nature 2006444(7118)444ndash54

3 Tuzun E Sharp AJ Bailey JA et al Fine-scale structuralvariation of the human genome Nat Genet 200537(7)727ndash32

4 Sebat J Lakshmi B Troge J et al Large-scale copy numberpolymorphism in the human genome Science 2004305(5683)525ndash8

5 de Smith AJ Tsalenko A Sampas N et al Array CGHanalysis of copy number variation identifies 1284 newgenes variant in healthy white males implications for asso-ciation studies of complex diseases Hum Mol Genet 200716(23)2783ndash94

6 Carter NP Methods and strategies for analyzing copynumber variation using DNA microarrays Nat Genet200739(7 Suppl)S16ndash21

7 Korbel JO Urban AE Affourtit JP et al Paired-end map-ping reveals extensive structural variation in the humangenome Science 2007318(5849)420ndash6

8 Kennedy GC Matsuzaki H Dong S etal Large-scale geno-typing of complex DNA NatBiotechnol 200321(10)1233ndash7

9 Peiffer DA Le JM Steemers FJ etal High-resolution geno-mic profiling of chromosomal aberrations using Infiniumwhole-genome genotyping Genome Res 200616(9)1136ndash48

10 International Schizophrenia Consortium Rare chromoso-mal deletions and duplications increase risk of schizophreniaNature 2008455(7210)237ndash41

11 Yang TL Chen XD Guo Y et al Genome-wide copy-number-variation study identified a susceptibility geneUGT2B17 for osteoporosis Am J Hum Genet 200883(6)663ndash74

12 McCarroll SA Hadnott TN Perry GH et al Commondeletion polymorphisms in the human genome Nat Genet200638(1)86ndash92

13 Cooper GM Zerr T Kidd JM et al Systematic assessmentof copy number variant detection via genome-wide SNPgenotyping Nat Genet 200840(10)1199ndash203

14 McCarroll SA Altshuler DM Copy-number variation andassociation studies of human disease Nat Genet 200739(7 Suppl)S37ndash42

Key Points Awide variety of software is available for CNVdetection from

data produced by SNP arrays This review seeks to discussoptions and statistical methods currently available for analysisof signal intensity data

Changes in assay selection techniques for SNP arrays havemadethemmore appealing for copynumber detection aswell as geno-typingTargeted probe design has made the SNP array a reliableand cheaper option for copy number analysis

After testing a selection of the available software comparisonswere performed using Hapmap samples and publishedcopy number data Of the events found in our data 20^49were replicated in previously published studies but the resultsclearly showed variation in data caused by differences inalgorithms

An important recommendation when choosing software foranalysis is the use of a second algorithm on a dataset to producemore informative results This enables the user to eliminatefalse positives not found by both software and increases confi-dence in replicated events

Comparing CNVdetection methods for SNParrays page 13 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

15 McCarroll SA Kuruvilla FG Korn JM et al Integrateddetection and population-genetic analysis of SNPs andcopy number variation Nat Genet 200840(10)1166ndash74

16 Korn JM Kuruvilla FG McCarroll SA et al Integratedgenotype calling and association analysis of SNPscommon copy number polymorphisms and rare CNVsNat Genet 200840(10)1253ndash60

17 Day N Hemmaplardh A Thurman RE et al Unsupervisedsegmentation of continuous genomic data Bioinformatics200723(11)1424ndash6

18 Colella S Yau C Taylor JM etal QuantiSNP an objectiveBayes Hidden-Markov Model to detect and accurately mapcopy number variation using SNP genotyping data NucleicAcids Res 200735(6)2013ndash25

19 Wang K Li M Hadley D et al PennCNV an integratedhidden Markov model designed for high-resolution copynumber variation detection in whole-genome SNP geno-typing data Genome Res 200717(11)1665ndash74

20 Maestrini E Pagnamenta AT Lamb JA et al High-densitySNP association study and copy number variation analysisof the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility MolPsychiatry2009

21 Wang K Chen Z Tadesse MG et al Modeling geneticinheritance of copy number variations Nucleic Acids Res200836(21)e138

22 Li C Beroukhim R Weir BA et al Major copy propor-tion analysis of tumor samples using SNP arrays BMCBioinformatics 20089204

23 Olshen AB Venkatraman ES Lucito R Wigler M Circularbinary segmentation for the analysis of array-based DNAcopy number data Biostatistics 20045(4)557ndash72

24 Pique-Regi R Monso-Varona J Ortega A et al Sparserepresentation and Bayesian detection of genome copynumber alterations from microarray data Bioinformatics200824(3)309ndash18

25 Lai WR Johnson MD Kucherlapati R Park PJComparative analysis of algorithms for identifying amplifi-cations and deletions in array CGH data Bioinformatics 200521(19)3763ndash70

26 Rigaill G Hupe P Almeida A et al ITALICS analgorithm for normalization and DNA copy number callingfor Affymetrix SNP arrays Bioinformatics 200824(6)768ndash74

27 Franke L de Kovel CG Aulchenko YS et al Detectionimputation and association analysis of small deletions andnull alleles on oligonucleotide arrays AmJHumGenet 200882(6)1316ndash33

28 Kidd JM Cooper GM Donahue WF et al Mapping andsequencing of structural variation from eight human gen-omes Nature 2008453(7191)56ndash64

29 Barnes C Plagnol V Fitzgerald T et al A robuststatistical method for case-control association testingwith copy number variation Nat Genet 200840(10)1245ndash52

30 Ionita-Laza I Perry GH Raby BA et al On the analysisof copy-number variations in genome-wide associationstudies a translation of the family-based association testGenet Epidemiol 200832(3)273ndash84

31 Scherer SW Lee C Birney E etal Challenges and standardsin integrating surveys of structural variation NatGenet 200739(7 Suppl)S7ndash15

32 Cardon LR Bell JI Association study designs for complexdiseases Nat Rev Genet 20012(2)91ndash9

page 14 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Page 10: Comparing CNVdetection methods

technique for our NA12156 comparison to prevent a

bias between Affymetrix and Illumina but in doing

so we accepted an increase in the number of differ-

ently detected events Kidd et al have shown similar

data when comparing studies and found only a

125 overlap of events larger than 5 kb between

their results and CN data generated by Affymetrix

60 array

Similarities of events detected betweendifferent SoftwareWe chose to test a single sample (NA10861) on

a range of the available algorithms to compare the

similarity between event detection In all cases we

found the academically developed software to be

more sensitive and detect more events than propri-

etary algorithms (Table 4) The data also shows an

increased number of events found from the sample

using the Affymetrix SNP60 array we assume this

reflects the increase in the number of CNP probes

on the array relative to Illuminarsquos 1M chip

Table 5 shows the amount of overlap in event

prediction We show two results for each compari-

son counting the number of events overlapping for

each algorithm separately The difference in values

represents the number of smaller events often found

in one event by a different algorithm In general

we found a higher number of overlapping events

between algorithms run on Affymetrix 60 arrays

data We expected the low resemblance between

data generated on different platforms as a result of

the different probe sets however we are pleased to

find some overlap We have included a comparison

to events published by Redon et al [2] although the

study does not include a comprehensive list for this

sample it does show that the algorithms are detecting

confirmed events

During our comparison we often saw a difference

in the size of the predicted event between algorithms

(Figure 5) This was to be expected when using

different platforms as probe locations vary but was

also seen when analysing an identical dataset This

kind of effect can even be produced when simply

altering algorithm parameters and should be a con-

sideration when looking at breakpoints of detected

events We found that the available software tend to

target and support one particular platform for analy-

sis which unfortunately can limit options

Recommending algorithmsComparison of events in a dataset is a good way of

assessing accuracy of detection algorithms but it is

also important to take into account that the different

predictions can also be informative in showing false

positives caused by noisy data and conversely that

those in agreement are the strongest candidates for

events Multiple predictions from different software

for the same event increase confidence in the data

and give clearer indications of the event boundaries

or any discrepancy in this information We would

recommend using a second algorithm on a single

dataset to produce the most informative results and

also utilize the different advantages of each software

We also suggest using software designed specifically

for the platform which generated the data as several

of the dual use algorithms have been shown to

weaker in one format We have selected a range of

algorithms to discuss and test and the list in Table 1 is

not exhaustive only an overview of some of the

possibilities It is also important to state even using

different algorithms one cannot definitively confirm

the presence of a CN event without separate biolog-

ical replication and it is unlikely that any list of events

detected will contain all CNVs in a sample

FURTHER ANALYSIS OFDETECTED CNVsWith a number of reliable options available for

the detection of copy number events it becomes

Table 4 Comparison of event numbers detected fora single sample (NA10861)

Algorithm Platform andarray

Number ofCNeventsdetected

Birdsuite 155 (Canary amp Birdseye) Affymetrix 60 137CNAT (Genome Console 302) Affymetrix 60 10CNVPartition 121 Illumina 1M Duo 16GADA (R 07-5) Affymetrix 60 613GADA (R 07-5) Illumina 1M Duo 87Nexus Biodiscovery 401 Affymetrix 60 111Nexus Biodiscovery 401 Illumina 1M Duo 8PennCNV (2009Jan06) Affymetrix 60 67PennCNV (2009Jan06) Illumina 1M Duo 43QuantiSNP v20 Affymetrix 60 193QuantiSNP v11 Illumina 1M Duo 60

HapMap samples provided as demonstration data were analysed onboth Affymetrix and Illumina platforms to give an easily reproduciblecomparison of event prediction Events shown have been detected bythe algorithm for CEPH sample NA10861 Default parameters wereused for all algorithms and anyYchromosome data was omittedDatafrom the Affymetrix array has a higher number of detected eventsprobably linked to the number of specifically targeted probesProprietary software from both Illumina and Affymetrix has a lowdetection rate

page 10 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Table5

Com

parison

ofsoftwareeventpredictio

ns

Pub

lishe

dresults

(Red

on)

Birdsuite

Affym

etrix

CNAT

Affym

etrix

CNV

Par

tition

Illum

ina

GADA

Affym

etrix

GADA

Illum

ina

Nex

usAffym

etrix

Nex

usIllum

ina

Pen

nCNV

Affym

etrix

Pen

nCNV

Illum

ina

Qua

ntiSNP

Affym

etrix

Qua

ntiSNP

Illum

ina

Publishe

ddata

(Red

on)

17(4)

4(40

)3(19

)32

(5)

2(2)

11(10

)2(25

)12

(18

)7(16

)18

(9)

8(13

)

Birdsuite

Affy

metrix

17(44

)9(90

)13

(81

)135(22

)21

(24

)62

(56

)6(75

)43

(64

)20

(47

)97

(50

)20

(33

)CNAT

Affy

metrix

4(10

)15

(4)

4(25

)34

(6)

023

(21

)1(13

)

13(19

)2(5)

17(9)

5(8)

CNVPartition

Illum

ina

3(8)

16(4)

4(40

)37

(6)

7(8)

20(18

)7(88

)9(13

)

11(26

)16

(8)

16(27

)GADA

Affy

metrix

17(44

)106(28

)9(90

)13

(81

)32

(37

)91

(82

)7(88

)58

(87

)23

(53

)153(79

)27

(45

)GADA

Illum

ina

2(5)

96(25

)0

13(81

)20

8(34

)25

(23

)2(25

)26

(30

)17

(40

)67

(35

)23

(38

)Nexus

Affy

metrix

7(18

)57

(15

)10

(100

)

7(44

)116(19

)8(9)

4(50

)45

(67

)15

(35

)78

(40

)17

(28

)Nexus

Illum

ina

2(5)

6(2)

1(10

)7(44

)22

(4)

2(2)

4(4)

6(9)

7(16

)10

(5)

9(15

)Penn

CNV

Affy

metrix

11(28

)51

(13)

10(100

)

9(56

)105(17

)10

(11)

65(59

)6(75

)19

(44

)71

(37

)21

(35

)Penn

CNV

Illum

ina

6(15

)25

(7)

2(20

)11

(69

)44

(7)

9(10

)23

(21

)6(75

)18

(27

)26

(13)

28(47

)QuantiSNP

Affy

metrix

14(36

)97

(25

)10

(100

)

10(63

)199(32

)18

(21

)86

(77

)7(88

)65

(97

)21

(49

)24

(40

)QuantiSNP

Illum

ina

6(15

)14

(4)

5(50

)15

(94

)55

(9)

10(11)

30(27

)8(100

)

23(34

)32

(74

)31

(16

)

Algorithm

swererunon

demon

stratio

ndataforsampleNA108

61on

Affy

metrix60chipsa

ndIllum

ina1MDuo

arraysD

efaultparametersw

ereused

andanyY

chromosom

edatawas

omittedFo

ralgorithmoverall

totalsseeTable4Events

detected

inbo

thsoftwareareshow

nEvents

coun

tedas

common

betw

eenalgorithmsifpart

ofregion

predictedoverlaps

withtheotherEach

comparisoniscarriedou

ttw

ice

toshow

caseswhere

smallereventswithinon

ealgorithm

makeup

oneeventintheotherthereforeoverlapof

eventsdepe

ndson

analysisorientationTotalvalue

representsnumberof

eventsforsoftwareon

horizontalaxisfoun

dintheothersoftwaredatasetbracketedvalueshow

spercentageofeventsdetected

bysamesoftwareWehave

foun

dthemostsim

ilaritie

sare

betw

eendatafrom

similarplatform

soralgo

-rithm

metho

dforexam

pleAffy

metrixPenn

CNVandQuantiSNParebo

thbasedon

theHMM

algorithm

andas

such

eventpredictio

nshou

ldbe

very

similarWehave

also

notedahigher

numberof

similar

eventsfrom

algorithmsu

singAffy

metrixdata

Comparing CNVdetection methods for SNParrays page 11 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

increasingly important to be able to summarize and

use this data Initially we are often interested in

looking for novel events in certain genes or regions

Tracks of events can be viewed in databases such as

the web-based genome browser UCSC (http

wwwgenomeucscedu) and events can be com-

pared to known copy number data in the DGV

such as displayed in Figure 3 Importing several

tracks of data into a browser simultaneously will

allow the user to compare different result sets

Analysis of multiple events per sample is a more

complicated procedure Events and samples can

be explored using pathway analysis tools to look

for interesting groups or combinations of events in

different genes but methods of confirming the

significance of an event are required A number of

publications exist presenting ways of applying asso-

ciation study methods to copy number data Barnes

etal [29] developed an R package CNVtools which

allows the user to carry out case-control association

Figure 5 Image from UCSC Browser showing the detection of a single event using different algorithmsThe deletion described is a known CNP and is recorded several times in the DGV Each track represents a differ-ent algorithm or platform All results for detection algorithms shown used default parameters and test sampleNA10861

page 12 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

analysis on a single CNV of interest The publica-

tion tests a series of five alternative modelling meth-

ods before recommending a likelihood ratio test

which combines CNV calling and association testing

into a single model This method was designed

to eliminate problems with signal noise which is a

known trait of SNP assay data Ionita-Laza et al [30]

suggested a method to apply genome-wide family-

based association studies on raw-intensity data The

Birdsuite package includes a pipeline to prepare

the data for PLINK analysis Other sources have

suggested similar association study-based strategies

but an agreed approach is a subject of great discus-

sion Calls have been made by authors such as

Scherer et al [31] to decide on a single technique

but future decisions in the field will be extremely

enlightening

As is commented much upon in literature

describing SNP association study techniques

sample size and power of tests are major factors in

a successful study [32] This must also be considered

when analysing copy number data As we have dis-

cussed there are a number of analysis options avail-

able for SNP array CNV detection pipelines to

allow guided analysis and stand alone options for

more flexible analysis Some of these applications

are platform targeted but we have found that the

best outcome is given by using multiple algorithms

and comparing data

SUPPLEMENTARYDATASupplementary data are available online at http

biboxfordjournalsorg

AcknowledgementsThe authors thank Dr Helen Butler for her ideas and contribu-

tions to the manuscript

FUNDINGJR and LW are funded by Wellcome Trust Grants

CY is funded by a UK Medical Research Council

Special Training Fellowship in Biomedical

Informatics (Ref No G0701810)

References1 Iafrate AJ Feuk L Rivera MN et al Detection of large-

scale variation in the human genome Nat Genet 200436(9)949ndash51

2 Redon R Ishikawa S Fitch KR et al Global variation incopy number in the human genome Nature 2006444(7118)444ndash54

3 Tuzun E Sharp AJ Bailey JA et al Fine-scale structuralvariation of the human genome Nat Genet 200537(7)727ndash32

4 Sebat J Lakshmi B Troge J et al Large-scale copy numberpolymorphism in the human genome Science 2004305(5683)525ndash8

5 de Smith AJ Tsalenko A Sampas N et al Array CGHanalysis of copy number variation identifies 1284 newgenes variant in healthy white males implications for asso-ciation studies of complex diseases Hum Mol Genet 200716(23)2783ndash94

6 Carter NP Methods and strategies for analyzing copynumber variation using DNA microarrays Nat Genet200739(7 Suppl)S16ndash21

7 Korbel JO Urban AE Affourtit JP et al Paired-end map-ping reveals extensive structural variation in the humangenome Science 2007318(5849)420ndash6

8 Kennedy GC Matsuzaki H Dong S etal Large-scale geno-typing of complex DNA NatBiotechnol 200321(10)1233ndash7

9 Peiffer DA Le JM Steemers FJ etal High-resolution geno-mic profiling of chromosomal aberrations using Infiniumwhole-genome genotyping Genome Res 200616(9)1136ndash48

10 International Schizophrenia Consortium Rare chromoso-mal deletions and duplications increase risk of schizophreniaNature 2008455(7210)237ndash41

11 Yang TL Chen XD Guo Y et al Genome-wide copy-number-variation study identified a susceptibility geneUGT2B17 for osteoporosis Am J Hum Genet 200883(6)663ndash74

12 McCarroll SA Hadnott TN Perry GH et al Commondeletion polymorphisms in the human genome Nat Genet200638(1)86ndash92

13 Cooper GM Zerr T Kidd JM et al Systematic assessmentof copy number variant detection via genome-wide SNPgenotyping Nat Genet 200840(10)1199ndash203

14 McCarroll SA Altshuler DM Copy-number variation andassociation studies of human disease Nat Genet 200739(7 Suppl)S37ndash42

Key Points Awide variety of software is available for CNVdetection from

data produced by SNP arrays This review seeks to discussoptions and statistical methods currently available for analysisof signal intensity data

Changes in assay selection techniques for SNP arrays havemadethemmore appealing for copynumber detection aswell as geno-typingTargeted probe design has made the SNP array a reliableand cheaper option for copy number analysis

After testing a selection of the available software comparisonswere performed using Hapmap samples and publishedcopy number data Of the events found in our data 20^49were replicated in previously published studies but the resultsclearly showed variation in data caused by differences inalgorithms

An important recommendation when choosing software foranalysis is the use of a second algorithm on a dataset to producemore informative results This enables the user to eliminatefalse positives not found by both software and increases confi-dence in replicated events

Comparing CNVdetection methods for SNParrays page 13 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

15 McCarroll SA Kuruvilla FG Korn JM et al Integrateddetection and population-genetic analysis of SNPs andcopy number variation Nat Genet 200840(10)1166ndash74

16 Korn JM Kuruvilla FG McCarroll SA et al Integratedgenotype calling and association analysis of SNPscommon copy number polymorphisms and rare CNVsNat Genet 200840(10)1253ndash60

17 Day N Hemmaplardh A Thurman RE et al Unsupervisedsegmentation of continuous genomic data Bioinformatics200723(11)1424ndash6

18 Colella S Yau C Taylor JM etal QuantiSNP an objectiveBayes Hidden-Markov Model to detect and accurately mapcopy number variation using SNP genotyping data NucleicAcids Res 200735(6)2013ndash25

19 Wang K Li M Hadley D et al PennCNV an integratedhidden Markov model designed for high-resolution copynumber variation detection in whole-genome SNP geno-typing data Genome Res 200717(11)1665ndash74

20 Maestrini E Pagnamenta AT Lamb JA et al High-densitySNP association study and copy number variation analysisof the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility MolPsychiatry2009

21 Wang K Chen Z Tadesse MG et al Modeling geneticinheritance of copy number variations Nucleic Acids Res200836(21)e138

22 Li C Beroukhim R Weir BA et al Major copy propor-tion analysis of tumor samples using SNP arrays BMCBioinformatics 20089204

23 Olshen AB Venkatraman ES Lucito R Wigler M Circularbinary segmentation for the analysis of array-based DNAcopy number data Biostatistics 20045(4)557ndash72

24 Pique-Regi R Monso-Varona J Ortega A et al Sparserepresentation and Bayesian detection of genome copynumber alterations from microarray data Bioinformatics200824(3)309ndash18

25 Lai WR Johnson MD Kucherlapati R Park PJComparative analysis of algorithms for identifying amplifi-cations and deletions in array CGH data Bioinformatics 200521(19)3763ndash70

26 Rigaill G Hupe P Almeida A et al ITALICS analgorithm for normalization and DNA copy number callingfor Affymetrix SNP arrays Bioinformatics 200824(6)768ndash74

27 Franke L de Kovel CG Aulchenko YS et al Detectionimputation and association analysis of small deletions andnull alleles on oligonucleotide arrays AmJHumGenet 200882(6)1316ndash33

28 Kidd JM Cooper GM Donahue WF et al Mapping andsequencing of structural variation from eight human gen-omes Nature 2008453(7191)56ndash64

29 Barnes C Plagnol V Fitzgerald T et al A robuststatistical method for case-control association testingwith copy number variation Nat Genet 200840(10)1245ndash52

30 Ionita-Laza I Perry GH Raby BA et al On the analysisof copy-number variations in genome-wide associationstudies a translation of the family-based association testGenet Epidemiol 200832(3)273ndash84

31 Scherer SW Lee C Birney E etal Challenges and standardsin integrating surveys of structural variation NatGenet 200739(7 Suppl)S7ndash15

32 Cardon LR Bell JI Association study designs for complexdiseases Nat Rev Genet 20012(2)91ndash9

page 14 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Page 11: Comparing CNVdetection methods

Table5

Com

parison

ofsoftwareeventpredictio

ns

Pub

lishe

dresults

(Red

on)

Birdsuite

Affym

etrix

CNAT

Affym

etrix

CNV

Par

tition

Illum

ina

GADA

Affym

etrix

GADA

Illum

ina

Nex

usAffym

etrix

Nex

usIllum

ina

Pen

nCNV

Affym

etrix

Pen

nCNV

Illum

ina

Qua

ntiSNP

Affym

etrix

Qua

ntiSNP

Illum

ina

Publishe

ddata

(Red

on)

17(4)

4(40

)3(19

)32

(5)

2(2)

11(10

)2(25

)12

(18

)7(16

)18

(9)

8(13

)

Birdsuite

Affy

metrix

17(44

)9(90

)13

(81

)135(22

)21

(24

)62

(56

)6(75

)43

(64

)20

(47

)97

(50

)20

(33

)CNAT

Affy

metrix

4(10

)15

(4)

4(25

)34

(6)

023

(21

)1(13

)

13(19

)2(5)

17(9)

5(8)

CNVPartition

Illum

ina

3(8)

16(4)

4(40

)37

(6)

7(8)

20(18

)7(88

)9(13

)

11(26

)16

(8)

16(27

)GADA

Affy

metrix

17(44

)106(28

)9(90

)13

(81

)32

(37

)91

(82

)7(88

)58

(87

)23

(53

)153(79

)27

(45

)GADA

Illum

ina

2(5)

96(25

)0

13(81

)20

8(34

)25

(23

)2(25

)26

(30

)17

(40

)67

(35

)23

(38

)Nexus

Affy

metrix

7(18

)57

(15

)10

(100

)

7(44

)116(19

)8(9)

4(50

)45

(67

)15

(35

)78

(40

)17

(28

)Nexus

Illum

ina

2(5)

6(2)

1(10

)7(44

)22

(4)

2(2)

4(4)

6(9)

7(16

)10

(5)

9(15

)Penn

CNV

Affy

metrix

11(28

)51

(13)

10(100

)

9(56

)105(17

)10

(11)

65(59

)6(75

)19

(44

)71

(37

)21

(35

)Penn

CNV

Illum

ina

6(15

)25

(7)

2(20

)11

(69

)44

(7)

9(10

)23

(21

)6(75

)18

(27

)26

(13)

28(47

)QuantiSNP

Affy

metrix

14(36

)97

(25

)10

(100

)

10(63

)199(32

)18

(21

)86

(77

)7(88

)65

(97

)21

(49

)24

(40

)QuantiSNP

Illum

ina

6(15

)14

(4)

5(50

)15

(94

)55

(9)

10(11)

30(27

)8(100

)

23(34

)32

(74

)31

(16

)

Algorithm

swererunon

demon

stratio

ndataforsampleNA108

61on

Affy

metrix60chipsa

ndIllum

ina1MDuo

arraysD

efaultparametersw

ereused

andanyY

chromosom

edatawas

omittedFo

ralgorithmoverall

totalsseeTable4Events

detected

inbo

thsoftwareareshow

nEvents

coun

tedas

common

betw

eenalgorithmsifpart

ofregion

predictedoverlaps

withtheotherEach

comparisoniscarriedou

ttw

ice

toshow

caseswhere

smallereventswithinon

ealgorithm

makeup

oneeventintheotherthereforeoverlapof

eventsdepe

ndson

analysisorientationTotalvalue

representsnumberof

eventsforsoftwareon

horizontalaxisfoun

dintheothersoftwaredatasetbracketedvalueshow

spercentageofeventsdetected

bysamesoftwareWehave

foun

dthemostsim

ilaritie

sare

betw

eendatafrom

similarplatform

soralgo

-rithm

metho

dforexam

pleAffy

metrixPenn

CNVandQuantiSNParebo

thbasedon

theHMM

algorithm

andas

such

eventpredictio

nshou

ldbe

very

similarWehave

also

notedahigher

numberof

similar

eventsfrom

algorithmsu

singAffy

metrixdata

Comparing CNVdetection methods for SNParrays page 11 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

increasingly important to be able to summarize and

use this data Initially we are often interested in

looking for novel events in certain genes or regions

Tracks of events can be viewed in databases such as

the web-based genome browser UCSC (http

wwwgenomeucscedu) and events can be com-

pared to known copy number data in the DGV

such as displayed in Figure 3 Importing several

tracks of data into a browser simultaneously will

allow the user to compare different result sets

Analysis of multiple events per sample is a more

complicated procedure Events and samples can

be explored using pathway analysis tools to look

for interesting groups or combinations of events in

different genes but methods of confirming the

significance of an event are required A number of

publications exist presenting ways of applying asso-

ciation study methods to copy number data Barnes

etal [29] developed an R package CNVtools which

allows the user to carry out case-control association

Figure 5 Image from UCSC Browser showing the detection of a single event using different algorithmsThe deletion described is a known CNP and is recorded several times in the DGV Each track represents a differ-ent algorithm or platform All results for detection algorithms shown used default parameters and test sampleNA10861

page 12 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

analysis on a single CNV of interest The publica-

tion tests a series of five alternative modelling meth-

ods before recommending a likelihood ratio test

which combines CNV calling and association testing

into a single model This method was designed

to eliminate problems with signal noise which is a

known trait of SNP assay data Ionita-Laza et al [30]

suggested a method to apply genome-wide family-

based association studies on raw-intensity data The

Birdsuite package includes a pipeline to prepare

the data for PLINK analysis Other sources have

suggested similar association study-based strategies

but an agreed approach is a subject of great discus-

sion Calls have been made by authors such as

Scherer et al [31] to decide on a single technique

but future decisions in the field will be extremely

enlightening

As is commented much upon in literature

describing SNP association study techniques

sample size and power of tests are major factors in

a successful study [32] This must also be considered

when analysing copy number data As we have dis-

cussed there are a number of analysis options avail-

able for SNP array CNV detection pipelines to

allow guided analysis and stand alone options for

more flexible analysis Some of these applications

are platform targeted but we have found that the

best outcome is given by using multiple algorithms

and comparing data

SUPPLEMENTARYDATASupplementary data are available online at http

biboxfordjournalsorg

AcknowledgementsThe authors thank Dr Helen Butler for her ideas and contribu-

tions to the manuscript

FUNDINGJR and LW are funded by Wellcome Trust Grants

CY is funded by a UK Medical Research Council

Special Training Fellowship in Biomedical

Informatics (Ref No G0701810)

References1 Iafrate AJ Feuk L Rivera MN et al Detection of large-

scale variation in the human genome Nat Genet 200436(9)949ndash51

2 Redon R Ishikawa S Fitch KR et al Global variation incopy number in the human genome Nature 2006444(7118)444ndash54

3 Tuzun E Sharp AJ Bailey JA et al Fine-scale structuralvariation of the human genome Nat Genet 200537(7)727ndash32

4 Sebat J Lakshmi B Troge J et al Large-scale copy numberpolymorphism in the human genome Science 2004305(5683)525ndash8

5 de Smith AJ Tsalenko A Sampas N et al Array CGHanalysis of copy number variation identifies 1284 newgenes variant in healthy white males implications for asso-ciation studies of complex diseases Hum Mol Genet 200716(23)2783ndash94

6 Carter NP Methods and strategies for analyzing copynumber variation using DNA microarrays Nat Genet200739(7 Suppl)S16ndash21

7 Korbel JO Urban AE Affourtit JP et al Paired-end map-ping reveals extensive structural variation in the humangenome Science 2007318(5849)420ndash6

8 Kennedy GC Matsuzaki H Dong S etal Large-scale geno-typing of complex DNA NatBiotechnol 200321(10)1233ndash7

9 Peiffer DA Le JM Steemers FJ etal High-resolution geno-mic profiling of chromosomal aberrations using Infiniumwhole-genome genotyping Genome Res 200616(9)1136ndash48

10 International Schizophrenia Consortium Rare chromoso-mal deletions and duplications increase risk of schizophreniaNature 2008455(7210)237ndash41

11 Yang TL Chen XD Guo Y et al Genome-wide copy-number-variation study identified a susceptibility geneUGT2B17 for osteoporosis Am J Hum Genet 200883(6)663ndash74

12 McCarroll SA Hadnott TN Perry GH et al Commondeletion polymorphisms in the human genome Nat Genet200638(1)86ndash92

13 Cooper GM Zerr T Kidd JM et al Systematic assessmentof copy number variant detection via genome-wide SNPgenotyping Nat Genet 200840(10)1199ndash203

14 McCarroll SA Altshuler DM Copy-number variation andassociation studies of human disease Nat Genet 200739(7 Suppl)S37ndash42

Key Points Awide variety of software is available for CNVdetection from

data produced by SNP arrays This review seeks to discussoptions and statistical methods currently available for analysisof signal intensity data

Changes in assay selection techniques for SNP arrays havemadethemmore appealing for copynumber detection aswell as geno-typingTargeted probe design has made the SNP array a reliableand cheaper option for copy number analysis

After testing a selection of the available software comparisonswere performed using Hapmap samples and publishedcopy number data Of the events found in our data 20^49were replicated in previously published studies but the resultsclearly showed variation in data caused by differences inalgorithms

An important recommendation when choosing software foranalysis is the use of a second algorithm on a dataset to producemore informative results This enables the user to eliminatefalse positives not found by both software and increases confi-dence in replicated events

Comparing CNVdetection methods for SNParrays page 13 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

15 McCarroll SA Kuruvilla FG Korn JM et al Integrateddetection and population-genetic analysis of SNPs andcopy number variation Nat Genet 200840(10)1166ndash74

16 Korn JM Kuruvilla FG McCarroll SA et al Integratedgenotype calling and association analysis of SNPscommon copy number polymorphisms and rare CNVsNat Genet 200840(10)1253ndash60

17 Day N Hemmaplardh A Thurman RE et al Unsupervisedsegmentation of continuous genomic data Bioinformatics200723(11)1424ndash6

18 Colella S Yau C Taylor JM etal QuantiSNP an objectiveBayes Hidden-Markov Model to detect and accurately mapcopy number variation using SNP genotyping data NucleicAcids Res 200735(6)2013ndash25

19 Wang K Li M Hadley D et al PennCNV an integratedhidden Markov model designed for high-resolution copynumber variation detection in whole-genome SNP geno-typing data Genome Res 200717(11)1665ndash74

20 Maestrini E Pagnamenta AT Lamb JA et al High-densitySNP association study and copy number variation analysisof the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility MolPsychiatry2009

21 Wang K Chen Z Tadesse MG et al Modeling geneticinheritance of copy number variations Nucleic Acids Res200836(21)e138

22 Li C Beroukhim R Weir BA et al Major copy propor-tion analysis of tumor samples using SNP arrays BMCBioinformatics 20089204

23 Olshen AB Venkatraman ES Lucito R Wigler M Circularbinary segmentation for the analysis of array-based DNAcopy number data Biostatistics 20045(4)557ndash72

24 Pique-Regi R Monso-Varona J Ortega A et al Sparserepresentation and Bayesian detection of genome copynumber alterations from microarray data Bioinformatics200824(3)309ndash18

25 Lai WR Johnson MD Kucherlapati R Park PJComparative analysis of algorithms for identifying amplifi-cations and deletions in array CGH data Bioinformatics 200521(19)3763ndash70

26 Rigaill G Hupe P Almeida A et al ITALICS analgorithm for normalization and DNA copy number callingfor Affymetrix SNP arrays Bioinformatics 200824(6)768ndash74

27 Franke L de Kovel CG Aulchenko YS et al Detectionimputation and association analysis of small deletions andnull alleles on oligonucleotide arrays AmJHumGenet 200882(6)1316ndash33

28 Kidd JM Cooper GM Donahue WF et al Mapping andsequencing of structural variation from eight human gen-omes Nature 2008453(7191)56ndash64

29 Barnes C Plagnol V Fitzgerald T et al A robuststatistical method for case-control association testingwith copy number variation Nat Genet 200840(10)1245ndash52

30 Ionita-Laza I Perry GH Raby BA et al On the analysisof copy-number variations in genome-wide associationstudies a translation of the family-based association testGenet Epidemiol 200832(3)273ndash84

31 Scherer SW Lee C Birney E etal Challenges and standardsin integrating surveys of structural variation NatGenet 200739(7 Suppl)S7ndash15

32 Cardon LR Bell JI Association study designs for complexdiseases Nat Rev Genet 20012(2)91ndash9

page 14 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Page 12: Comparing CNVdetection methods

increasingly important to be able to summarize and

use this data Initially we are often interested in

looking for novel events in certain genes or regions

Tracks of events can be viewed in databases such as

the web-based genome browser UCSC (http

wwwgenomeucscedu) and events can be com-

pared to known copy number data in the DGV

such as displayed in Figure 3 Importing several

tracks of data into a browser simultaneously will

allow the user to compare different result sets

Analysis of multiple events per sample is a more

complicated procedure Events and samples can

be explored using pathway analysis tools to look

for interesting groups or combinations of events in

different genes but methods of confirming the

significance of an event are required A number of

publications exist presenting ways of applying asso-

ciation study methods to copy number data Barnes

etal [29] developed an R package CNVtools which

allows the user to carry out case-control association

Figure 5 Image from UCSC Browser showing the detection of a single event using different algorithmsThe deletion described is a known CNP and is recorded several times in the DGV Each track represents a differ-ent algorithm or platform All results for detection algorithms shown used default parameters and test sampleNA10861

page 12 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

analysis on a single CNV of interest The publica-

tion tests a series of five alternative modelling meth-

ods before recommending a likelihood ratio test

which combines CNV calling and association testing

into a single model This method was designed

to eliminate problems with signal noise which is a

known trait of SNP assay data Ionita-Laza et al [30]

suggested a method to apply genome-wide family-

based association studies on raw-intensity data The

Birdsuite package includes a pipeline to prepare

the data for PLINK analysis Other sources have

suggested similar association study-based strategies

but an agreed approach is a subject of great discus-

sion Calls have been made by authors such as

Scherer et al [31] to decide on a single technique

but future decisions in the field will be extremely

enlightening

As is commented much upon in literature

describing SNP association study techniques

sample size and power of tests are major factors in

a successful study [32] This must also be considered

when analysing copy number data As we have dis-

cussed there are a number of analysis options avail-

able for SNP array CNV detection pipelines to

allow guided analysis and stand alone options for

more flexible analysis Some of these applications

are platform targeted but we have found that the

best outcome is given by using multiple algorithms

and comparing data

SUPPLEMENTARYDATASupplementary data are available online at http

biboxfordjournalsorg

AcknowledgementsThe authors thank Dr Helen Butler for her ideas and contribu-

tions to the manuscript

FUNDINGJR and LW are funded by Wellcome Trust Grants

CY is funded by a UK Medical Research Council

Special Training Fellowship in Biomedical

Informatics (Ref No G0701810)

References1 Iafrate AJ Feuk L Rivera MN et al Detection of large-

scale variation in the human genome Nat Genet 200436(9)949ndash51

2 Redon R Ishikawa S Fitch KR et al Global variation incopy number in the human genome Nature 2006444(7118)444ndash54

3 Tuzun E Sharp AJ Bailey JA et al Fine-scale structuralvariation of the human genome Nat Genet 200537(7)727ndash32

4 Sebat J Lakshmi B Troge J et al Large-scale copy numberpolymorphism in the human genome Science 2004305(5683)525ndash8

5 de Smith AJ Tsalenko A Sampas N et al Array CGHanalysis of copy number variation identifies 1284 newgenes variant in healthy white males implications for asso-ciation studies of complex diseases Hum Mol Genet 200716(23)2783ndash94

6 Carter NP Methods and strategies for analyzing copynumber variation using DNA microarrays Nat Genet200739(7 Suppl)S16ndash21

7 Korbel JO Urban AE Affourtit JP et al Paired-end map-ping reveals extensive structural variation in the humangenome Science 2007318(5849)420ndash6

8 Kennedy GC Matsuzaki H Dong S etal Large-scale geno-typing of complex DNA NatBiotechnol 200321(10)1233ndash7

9 Peiffer DA Le JM Steemers FJ etal High-resolution geno-mic profiling of chromosomal aberrations using Infiniumwhole-genome genotyping Genome Res 200616(9)1136ndash48

10 International Schizophrenia Consortium Rare chromoso-mal deletions and duplications increase risk of schizophreniaNature 2008455(7210)237ndash41

11 Yang TL Chen XD Guo Y et al Genome-wide copy-number-variation study identified a susceptibility geneUGT2B17 for osteoporosis Am J Hum Genet 200883(6)663ndash74

12 McCarroll SA Hadnott TN Perry GH et al Commondeletion polymorphisms in the human genome Nat Genet200638(1)86ndash92

13 Cooper GM Zerr T Kidd JM et al Systematic assessmentof copy number variant detection via genome-wide SNPgenotyping Nat Genet 200840(10)1199ndash203

14 McCarroll SA Altshuler DM Copy-number variation andassociation studies of human disease Nat Genet 200739(7 Suppl)S37ndash42

Key Points Awide variety of software is available for CNVdetection from

data produced by SNP arrays This review seeks to discussoptions and statistical methods currently available for analysisof signal intensity data

Changes in assay selection techniques for SNP arrays havemadethemmore appealing for copynumber detection aswell as geno-typingTargeted probe design has made the SNP array a reliableand cheaper option for copy number analysis

After testing a selection of the available software comparisonswere performed using Hapmap samples and publishedcopy number data Of the events found in our data 20^49were replicated in previously published studies but the resultsclearly showed variation in data caused by differences inalgorithms

An important recommendation when choosing software foranalysis is the use of a second algorithm on a dataset to producemore informative results This enables the user to eliminatefalse positives not found by both software and increases confi-dence in replicated events

Comparing CNVdetection methods for SNParrays page 13 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

15 McCarroll SA Kuruvilla FG Korn JM et al Integrateddetection and population-genetic analysis of SNPs andcopy number variation Nat Genet 200840(10)1166ndash74

16 Korn JM Kuruvilla FG McCarroll SA et al Integratedgenotype calling and association analysis of SNPscommon copy number polymorphisms and rare CNVsNat Genet 200840(10)1253ndash60

17 Day N Hemmaplardh A Thurman RE et al Unsupervisedsegmentation of continuous genomic data Bioinformatics200723(11)1424ndash6

18 Colella S Yau C Taylor JM etal QuantiSNP an objectiveBayes Hidden-Markov Model to detect and accurately mapcopy number variation using SNP genotyping data NucleicAcids Res 200735(6)2013ndash25

19 Wang K Li M Hadley D et al PennCNV an integratedhidden Markov model designed for high-resolution copynumber variation detection in whole-genome SNP geno-typing data Genome Res 200717(11)1665ndash74

20 Maestrini E Pagnamenta AT Lamb JA et al High-densitySNP association study and copy number variation analysisof the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility MolPsychiatry2009

21 Wang K Chen Z Tadesse MG et al Modeling geneticinheritance of copy number variations Nucleic Acids Res200836(21)e138

22 Li C Beroukhim R Weir BA et al Major copy propor-tion analysis of tumor samples using SNP arrays BMCBioinformatics 20089204

23 Olshen AB Venkatraman ES Lucito R Wigler M Circularbinary segmentation for the analysis of array-based DNAcopy number data Biostatistics 20045(4)557ndash72

24 Pique-Regi R Monso-Varona J Ortega A et al Sparserepresentation and Bayesian detection of genome copynumber alterations from microarray data Bioinformatics200824(3)309ndash18

25 Lai WR Johnson MD Kucherlapati R Park PJComparative analysis of algorithms for identifying amplifi-cations and deletions in array CGH data Bioinformatics 200521(19)3763ndash70

26 Rigaill G Hupe P Almeida A et al ITALICS analgorithm for normalization and DNA copy number callingfor Affymetrix SNP arrays Bioinformatics 200824(6)768ndash74

27 Franke L de Kovel CG Aulchenko YS et al Detectionimputation and association analysis of small deletions andnull alleles on oligonucleotide arrays AmJHumGenet 200882(6)1316ndash33

28 Kidd JM Cooper GM Donahue WF et al Mapping andsequencing of structural variation from eight human gen-omes Nature 2008453(7191)56ndash64

29 Barnes C Plagnol V Fitzgerald T et al A robuststatistical method for case-control association testingwith copy number variation Nat Genet 200840(10)1245ndash52

30 Ionita-Laza I Perry GH Raby BA et al On the analysisof copy-number variations in genome-wide associationstudies a translation of the family-based association testGenet Epidemiol 200832(3)273ndash84

31 Scherer SW Lee C Birney E etal Challenges and standardsin integrating surveys of structural variation NatGenet 200739(7 Suppl)S7ndash15

32 Cardon LR Bell JI Association study designs for complexdiseases Nat Rev Genet 20012(2)91ndash9

page 14 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Page 13: Comparing CNVdetection methods

analysis on a single CNV of interest The publica-

tion tests a series of five alternative modelling meth-

ods before recommending a likelihood ratio test

which combines CNV calling and association testing

into a single model This method was designed

to eliminate problems with signal noise which is a

known trait of SNP assay data Ionita-Laza et al [30]

suggested a method to apply genome-wide family-

based association studies on raw-intensity data The

Birdsuite package includes a pipeline to prepare

the data for PLINK analysis Other sources have

suggested similar association study-based strategies

but an agreed approach is a subject of great discus-

sion Calls have been made by authors such as

Scherer et al [31] to decide on a single technique

but future decisions in the field will be extremely

enlightening

As is commented much upon in literature

describing SNP association study techniques

sample size and power of tests are major factors in

a successful study [32] This must also be considered

when analysing copy number data As we have dis-

cussed there are a number of analysis options avail-

able for SNP array CNV detection pipelines to

allow guided analysis and stand alone options for

more flexible analysis Some of these applications

are platform targeted but we have found that the

best outcome is given by using multiple algorithms

and comparing data

SUPPLEMENTARYDATASupplementary data are available online at http

biboxfordjournalsorg

AcknowledgementsThe authors thank Dr Helen Butler for her ideas and contribu-

tions to the manuscript

FUNDINGJR and LW are funded by Wellcome Trust Grants

CY is funded by a UK Medical Research Council

Special Training Fellowship in Biomedical

Informatics (Ref No G0701810)

References1 Iafrate AJ Feuk L Rivera MN et al Detection of large-

scale variation in the human genome Nat Genet 200436(9)949ndash51

2 Redon R Ishikawa S Fitch KR et al Global variation incopy number in the human genome Nature 2006444(7118)444ndash54

3 Tuzun E Sharp AJ Bailey JA et al Fine-scale structuralvariation of the human genome Nat Genet 200537(7)727ndash32

4 Sebat J Lakshmi B Troge J et al Large-scale copy numberpolymorphism in the human genome Science 2004305(5683)525ndash8

5 de Smith AJ Tsalenko A Sampas N et al Array CGHanalysis of copy number variation identifies 1284 newgenes variant in healthy white males implications for asso-ciation studies of complex diseases Hum Mol Genet 200716(23)2783ndash94

6 Carter NP Methods and strategies for analyzing copynumber variation using DNA microarrays Nat Genet200739(7 Suppl)S16ndash21

7 Korbel JO Urban AE Affourtit JP et al Paired-end map-ping reveals extensive structural variation in the humangenome Science 2007318(5849)420ndash6

8 Kennedy GC Matsuzaki H Dong S etal Large-scale geno-typing of complex DNA NatBiotechnol 200321(10)1233ndash7

9 Peiffer DA Le JM Steemers FJ etal High-resolution geno-mic profiling of chromosomal aberrations using Infiniumwhole-genome genotyping Genome Res 200616(9)1136ndash48

10 International Schizophrenia Consortium Rare chromoso-mal deletions and duplications increase risk of schizophreniaNature 2008455(7210)237ndash41

11 Yang TL Chen XD Guo Y et al Genome-wide copy-number-variation study identified a susceptibility geneUGT2B17 for osteoporosis Am J Hum Genet 200883(6)663ndash74

12 McCarroll SA Hadnott TN Perry GH et al Commondeletion polymorphisms in the human genome Nat Genet200638(1)86ndash92

13 Cooper GM Zerr T Kidd JM et al Systematic assessmentof copy number variant detection via genome-wide SNPgenotyping Nat Genet 200840(10)1199ndash203

14 McCarroll SA Altshuler DM Copy-number variation andassociation studies of human disease Nat Genet 200739(7 Suppl)S37ndash42

Key Points Awide variety of software is available for CNVdetection from

data produced by SNP arrays This review seeks to discussoptions and statistical methods currently available for analysisof signal intensity data

Changes in assay selection techniques for SNP arrays havemadethemmore appealing for copynumber detection aswell as geno-typingTargeted probe design has made the SNP array a reliableand cheaper option for copy number analysis

After testing a selection of the available software comparisonswere performed using Hapmap samples and publishedcopy number data Of the events found in our data 20^49were replicated in previously published studies but the resultsclearly showed variation in data caused by differences inalgorithms

An important recommendation when choosing software foranalysis is the use of a second algorithm on a dataset to producemore informative results This enables the user to eliminatefalse positives not found by both software and increases confi-dence in replicated events

Comparing CNVdetection methods for SNParrays page 13 of 14 by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

15 McCarroll SA Kuruvilla FG Korn JM et al Integrateddetection and population-genetic analysis of SNPs andcopy number variation Nat Genet 200840(10)1166ndash74

16 Korn JM Kuruvilla FG McCarroll SA et al Integratedgenotype calling and association analysis of SNPscommon copy number polymorphisms and rare CNVsNat Genet 200840(10)1253ndash60

17 Day N Hemmaplardh A Thurman RE et al Unsupervisedsegmentation of continuous genomic data Bioinformatics200723(11)1424ndash6

18 Colella S Yau C Taylor JM etal QuantiSNP an objectiveBayes Hidden-Markov Model to detect and accurately mapcopy number variation using SNP genotyping data NucleicAcids Res 200735(6)2013ndash25

19 Wang K Li M Hadley D et al PennCNV an integratedhidden Markov model designed for high-resolution copynumber variation detection in whole-genome SNP geno-typing data Genome Res 200717(11)1665ndash74

20 Maestrini E Pagnamenta AT Lamb JA et al High-densitySNP association study and copy number variation analysisof the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility MolPsychiatry2009

21 Wang K Chen Z Tadesse MG et al Modeling geneticinheritance of copy number variations Nucleic Acids Res200836(21)e138

22 Li C Beroukhim R Weir BA et al Major copy propor-tion analysis of tumor samples using SNP arrays BMCBioinformatics 20089204

23 Olshen AB Venkatraman ES Lucito R Wigler M Circularbinary segmentation for the analysis of array-based DNAcopy number data Biostatistics 20045(4)557ndash72

24 Pique-Regi R Monso-Varona J Ortega A et al Sparserepresentation and Bayesian detection of genome copynumber alterations from microarray data Bioinformatics200824(3)309ndash18

25 Lai WR Johnson MD Kucherlapati R Park PJComparative analysis of algorithms for identifying amplifi-cations and deletions in array CGH data Bioinformatics 200521(19)3763ndash70

26 Rigaill G Hupe P Almeida A et al ITALICS analgorithm for normalization and DNA copy number callingfor Affymetrix SNP arrays Bioinformatics 200824(6)768ndash74

27 Franke L de Kovel CG Aulchenko YS et al Detectionimputation and association analysis of small deletions andnull alleles on oligonucleotide arrays AmJHumGenet 200882(6)1316ndash33

28 Kidd JM Cooper GM Donahue WF et al Mapping andsequencing of structural variation from eight human gen-omes Nature 2008453(7191)56ndash64

29 Barnes C Plagnol V Fitzgerald T et al A robuststatistical method for case-control association testingwith copy number variation Nat Genet 200840(10)1245ndash52

30 Ionita-Laza I Perry GH Raby BA et al On the analysisof copy-number variations in genome-wide associationstudies a translation of the family-based association testGenet Epidemiol 200832(3)273ndash84

31 Scherer SW Lee C Birney E etal Challenges and standardsin integrating surveys of structural variation NatGenet 200739(7 Suppl)S7ndash15

32 Cardon LR Bell JI Association study designs for complexdiseases Nat Rev Genet 20012(2)91ndash9

page 14 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from

Page 14: Comparing CNVdetection methods

15 McCarroll SA Kuruvilla FG Korn JM et al Integrateddetection and population-genetic analysis of SNPs andcopy number variation Nat Genet 200840(10)1166ndash74

16 Korn JM Kuruvilla FG McCarroll SA et al Integratedgenotype calling and association analysis of SNPscommon copy number polymorphisms and rare CNVsNat Genet 200840(10)1253ndash60

17 Day N Hemmaplardh A Thurman RE et al Unsupervisedsegmentation of continuous genomic data Bioinformatics200723(11)1424ndash6

18 Colella S Yau C Taylor JM etal QuantiSNP an objectiveBayes Hidden-Markov Model to detect and accurately mapcopy number variation using SNP genotyping data NucleicAcids Res 200735(6)2013ndash25

19 Wang K Li M Hadley D et al PennCNV an integratedhidden Markov model designed for high-resolution copynumber variation detection in whole-genome SNP geno-typing data Genome Res 200717(11)1665ndash74

20 Maestrini E Pagnamenta AT Lamb JA et al High-densitySNP association study and copy number variation analysisof the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility MolPsychiatry2009

21 Wang K Chen Z Tadesse MG et al Modeling geneticinheritance of copy number variations Nucleic Acids Res200836(21)e138

22 Li C Beroukhim R Weir BA et al Major copy propor-tion analysis of tumor samples using SNP arrays BMCBioinformatics 20089204

23 Olshen AB Venkatraman ES Lucito R Wigler M Circularbinary segmentation for the analysis of array-based DNAcopy number data Biostatistics 20045(4)557ndash72

24 Pique-Regi R Monso-Varona J Ortega A et al Sparserepresentation and Bayesian detection of genome copynumber alterations from microarray data Bioinformatics200824(3)309ndash18

25 Lai WR Johnson MD Kucherlapati R Park PJComparative analysis of algorithms for identifying amplifi-cations and deletions in array CGH data Bioinformatics 200521(19)3763ndash70

26 Rigaill G Hupe P Almeida A et al ITALICS analgorithm for normalization and DNA copy number callingfor Affymetrix SNP arrays Bioinformatics 200824(6)768ndash74

27 Franke L de Kovel CG Aulchenko YS et al Detectionimputation and association analysis of small deletions andnull alleles on oligonucleotide arrays AmJHumGenet 200882(6)1316ndash33

28 Kidd JM Cooper GM Donahue WF et al Mapping andsequencing of structural variation from eight human gen-omes Nature 2008453(7191)56ndash64

29 Barnes C Plagnol V Fitzgerald T et al A robuststatistical method for case-control association testingwith copy number variation Nat Genet 200840(10)1245ndash52

30 Ionita-Laza I Perry GH Raby BA et al On the analysisof copy-number variations in genome-wide associationstudies a translation of the family-based association testGenet Epidemiol 200832(3)273ndash84

31 Scherer SW Lee C Birney E etal Challenges and standardsin integrating surveys of structural variation NatGenet 200739(7 Suppl)S7ndash15

32 Cardon LR Bell JI Association study designs for complexdiseases Nat Rev Genet 20012(2)91ndash9

page 14 of 14 Winchester et al by guest on February 21 2014

httpbfgoxfordjournalsorgD

ownloaded from