Development of (graphical) web applications for the...

37
FACULTY OF SCIENCES Development of (graphical) web applications for the processing and interpretation of arrayCGH data Tom SANTE Master dissertation submitted to obtain the degree of Master of Biochemistry and Biotechnology Major Bioinformatics and Systems Biology Academic year 2009-2010 Promoters: Prof. Dr. Ir. Björn Menten, Prof. Dr. Frank Speleman Department of Pediatrics and Medical Genetics UZ Gent, Center for Medical Genetics Ghent

Transcript of Development of (graphical) web applications for the...

FACULTY OF SCIENCES

Development of (graphical) web applications for theprocessing and interpretation of arrayCGH data

Tom SANTE

Master dissertation submitted to obtain the degree ofMaster of Biochemistry and BiotechnologyMajor Bioinformatics and Systems Biology

Academic year 2009-2010

Promoters: Prof. Dr. Ir. Björn Menten, Prof. Dr. Frank SpelemanDepartment of Pediatrics and Medical GeneticsUZ Gent, Center for Medical Genetics Ghent

Acknowledgments

... you can’t connect the dots looking forward; you can only connect them looking backwards. So youhave to trust that the dots will somehow connect in your future - Steve Jobs

An academic career is a series of dots. Since you’re reading this section, I’ve come to the end of myMasters training and I can look back to thank the people who helped me connect the dots along theway. I want to express my gratitude to both my promoters Prof. Dr. Ir. Björn Menten and Prof. Dr.Frank Speleman, for giving me a chance to work on this project. I am especially grateful to BjörnMenten for introducing me to the field of structural variation. You taught me that bioinformaticsdoesn’t have to be abstract and theoretic, but that it’s an important tool to help researchers and doctorsprovide better patient care. I thank you for giving me the freedom to independently connect the dots.You helped me stay on track with your expertise and feedback.

I am grateful to everyone at the Center for Medical Genetics who provided assistance with my work.To Lies for showing me the techniques that produce the arrayCGH data that I worked with. And toSteve and Joris for helping me set-up on the ’Mellfire’. I also want to thank the students at the lab,it was fun working together in our packed little students office. Specials thanks to my classmate Kenand last years graduate Greet, for helping me with their LaTex expertise and sparing me the horrorsof writing this thesis in Word.

The Web was a big help, today a researcher is not only working in a lab, but connected to an onlineworld that knows no boundaries. Even thought they will probably never read this thesis, I want to usethis space to recognise some who inspired and help me along the way. I want to thank the developersand users of the Mojolicious and CouchDB community, to helped me Try again, fail again. Failbetter. Their work on these open source projects is very much appreciated. My thanks also go out tothe people of the FriendFeed Life Scientists-group, like Jan Aerts. His posts taught me that sometimesyou have to think outside the box, and use a tree.

Of course I also thank my friends for sticking with me, when I spend more time behind my computerthan with them. To my parents I want to say: thank you for the chances you give me and supportingme all these years. I am also incredibly grateful to Pieter, your support means a lot to me. Because ofyou I didn’t forget to trust the future, and managed to finished this thesis.

i

Contents

List of figures iii

List of tables iv

Abbreviations v

Abstract vi

1 Introduction 11.1 Variation in the human genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Copy number variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2.1 Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.2 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.3 Clinical significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.4 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 CNV data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Aims 9

3 Materials and methods 113.1 General technology choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1.1 Client side . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1.2 Server side . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Data: input and storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Results 184.1 List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2 View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5 Discussion 21

6 Conclusie 24

Bibliography 26

Addendum: ArrayCGH protocol 30

ii

List of Figures

1.1 Principle of arrayCGH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Molecular mechanisms by which chromosomal rearrangements influence phenotypes 5

3.1 arrayCGHbase organizational scheme . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 The tree data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1 Screenshot of the arrayCGHbase: list and form . . . . . . . . . . . . . . . . . . . . 194.2 Screenshot of the arrayCGHbase: the view section . . . . . . . . . . . . . . . . . . . 20

iii

List of Tables

1.1 Techniques for detecting structural variation . . . . . . . . . . . . . . . . . . . . . . 41.2 Common disease - Common variant . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.1 CouchDB RESTful HTTP API CRUD functions . . . . . . . . . . . . . . . . . . . . 12

iv

Abbreviations

AJAX asynchronous JavaScript and XMLAPI application programming interfacearrayCGH array based comparative genomic hybridizationAWS adaptive Weights SmoothingBAC bacterial artificial chromosomeCBS circular binary segmentationCGH comparative genomic hybridizationCNV copy number variantsCSS cascading style sheetsFISH fluorescent in situ hybridizationFoSTeS Fork Stalling and Template SwitchingGD Graphics Library for dynamically manipulating imagesGLAD Gain and Loss Analysis of DNAGWA study genome wide association studyHTML HyperText Markup LanguageHTTP HyperText Transfer ProtocolJSON JavaScript object notationLOWESS Locally Weighted Scatter plot SmoothingMAPH multiplex amplifiable probe hybridizationMLPA multiplex ligation-dependent probe amplificationMMBIR Microhomology Mediated Break Induced ReplicationNAHR Non-Allelic Homologous RecombinationNHEJ Non-Homologous End JoiningOMIM Online Mendelian Inheritance in ManQMPSF quantitative multiplex PCR of short fluorescent fragmentsqPCR quantitative polymerase chain reactionRE regulatory elementsREST REpresentational State TransferSLE systemic lupus erythematosusSNP single-nucleotide polymorphismSQL structured query languageSTR short tandem repeatSVG scalable vector graphicsTIFF tagged image file formatVCFS velocardiofacial syndromeVNTR variable number of tandem repeatXML eXtensible Markup Language

v

Abstract

Cytogenetics is important in diagnostics and genomic research. It has evolved from the traditionalchromosome banding analysis to a microarray based technology for comparative genomic hybrid-ization (arrayCGH). The improving resolution and rising usage of arrayCGH brings with it an evergrowing amount of data to be analyzed. With the rise of next-generation sequencing, the next revolu-tion in molecular cytogenetics, this growth in data will accelerate. The challenge today is to build thetools necessary to process, and interpret the data, and help researchers in differentiating clinically rel-evant chromosomal aberrations from the neutral structural variation. Aggregation of existing researchand use of databases of CNVs is essential in the interpretation, because of the ubiquitous presenceof structural variations in the human genome. Starting from scratch, a new version of the existinganalysis platform arrayCGHbase is developed. This will lay a foundation to allow this tool to handle agrowing number of data points. The web application is build to help researchers and doctors in usingthis data from benchside to bedside.

* * *

Cytogenetica is belangrijk in diagnostiek en genetisch onderzoek. De technieken zijn geëvolueerdvan karyotypering van chromosomen via microscopie, en later naar de studie van het chromosoombandenpatroon, en meest recent naar het gebruik van microarray gebaseerde technologie (array basedcomparative genomic hybridization, arrayCGH). De verbeteringen in resolutie en stijgend gebruik vanarrayCGH zorgt er voor dat er als maar meer data moet geanalyseerd worden. De nieuwe generatieDNA sequencering toepassingen, die nu hun intrede doen in de cytogenetica, zullen deze groei inhoeveelheid data nog versnellen. De uitdaging is om deze groeiende hoeveelheid data te interpreteren,en de onderzoekers te helpen achterhalen welke structurele varianten een klinisch belang kunnenhebben, of deel uitmaken van de alomtegenwoordige neutrale varianten in het menselijk genoom. Hetis dan ook essentieel dat informatie van onderzoek en CNV databanken optimaal kan gebruikt wordenin de interpretatie. Om de huidige web applicatie voor arrayCGH data analyse, arrayCGHbase, tehelpen omgaan met grotere data sets, werd er een compleet nieuwe versie gebouwd. Deze legt debasis voor verdere ontwikkeling van de applicatie in de toekomst, zodat het de onderzoekers kanblijven helpen om efficiënt data te verwerken tot informatie bruikbaar voor zorgverleners.

vi

Chapter 1

Introduction

1.1 Variation in the human genomeThe Human Genome Project, released the first near-complete human genome sequence in 2003, andthis was a milestone in human genetics (IHGSC, 2004). This 3 ·109 base pair sequence has been a keytool in biomedical research. But the one reference version of the human genome does not account forthe genetic variations present in the human population. Although 2 individuals are more than 99,9%similar at the sequence level, the small difference mediates the large phenotypic diversity and variabledisease susceptibility in our species. This genetic variation manifests at different levels, and evenbefore genome sequencing was available, conventional cytogenetic techniques such as karyotypingshowed large morphological changes in chromosomes. These large-scale chromosomal rearrange-ments or numerical aberrations, can have severe phenotypic consequences (e.g. Trisomy 21 - Downsyndrome) or are benign variants (e.g. Robertsonian translocation) At a much lower level, singlenucleotide differences or single-nucleotide polymorphisms (SNPs), repeat structure of DNA (variablenumber of tandem repeat, VNTRs), mini/micro-satellites (short tandem repeat, STR) and transposableelements comprise the variable genomic landscape. For a long time, SNPs have been recognized asthe major source of genomic variation. In contrast, structural aberrations like copy number variants(CNVs) have long been an under appreciated source of variance. Resequencing efforts and genomicmicroarrays have however highlighted the importance of CNVs.

1.2 Copy number variantsCopy number variants are defined as quantitative changes caused by deletions or duplications of DNAsequences >1 kilobase or longer in length. The advent of array comparative genomic hybridization andpaired end fosmid sequencing has allowed the identification of many classes of CNVs (Iafrate et al.,2004; Sebat et al., 2004; Tuzun et al., 2005). CNVs smaller than 500bp are still under represented inmost studies because of technical difficulties in their detection, although their capture will probablybe possible with the latest application of next-generation sequencing.

1.2.1 Formation

CNVs can be inherited or are formed de novo. Formation of a CNV can be caused by rearrangementsfrom a few nucleotides to several Mb. Based on experimental observations, different mechanisms are

1

Chapter 1. Introduction

thought to underlay the formation of CNVs. Non-Allelic Homologous Recombination (NAHR), Non-Homologous End Joining (NHEJ), Fork Stalling and Template Switching (FoSTeS) and Microhomo-logy Mediated Break Induced Replication (MMBIR) are the currently known mechanisms. InheritedCNVs and most de novo formed CNVs are constitutional, so they are present in all cells. De novoformed CNVs can also be somatic and lead to mosaicism for the CNV. Somatic CNVs are common intumors and are challenging to detect because the tissue sample will contain a mosaic mixture of cells.

1.2.2 Detection

In the early days of cytogenetics, visual inspection of chromosomes under the microscope lead tothe discovery of one of the first large chromosomal aberrations, Trisomy 21 (Lejeune et al., 1959).Later the use of chromosome banding analysis (e.g. G-banding (Seabright, 1971))became the goldenstandard in cytogenetic diagnostics. This technique allowed the identification of numerous aberrationsunderlying many diseases and syndromes. Important examples are velocardiofacial syndrome (VCFS)caused by a microdeletion at long arm of chromosome 22 (22q11.2 band) (Driscoll et al., 1992), anddeletion in the 15q11 region in Angelman and Prader-Willi syndrome (Butler et al., 1986; Mageniset al., 1990). Although chromosome analysis discovered many of these syndromes, its applicationswere still limited. Because of technical issues, detection of CNVs was only possible at a resolution >5Mb and for simple structural changes. The development of fluorescent in situ hybridization (FISH)alleviated some of these limitations and allowed the study of more complex rearrangements (Langer-Safer, 1982; Van Prooijen-Knegt et al., 1982). The biggest drawback of FISH is the need for apriori knowledge of the locus implicated in the disease. The development of comparative genomichybridization (CGH), first used for cytogenetic analysis of solid tumors (Kallioniemi et al., 1992),allowed mapping of relative DNA copy number between complete genomes, but still needs metaphasespreads of chromosomes and hence also has the same limited resolution (> 5Mb)

The extent of copy number variation of submicroscopic segments only became clear with the intro-duction of arrayCGH(Solinas-Toldo et al., 1997). By combining CGH with micro array technology,efficient genome wide analysis of diseased and phenotypically normal patients became possible. InarrayCGH, differentially fluorescent labeled test and control DNA samples are hybridized with BAC,P1, cDNA or cosmid clones, and more recent synthetic oligonucleotides, that are fixed to a glass slide.High throughput measuring of the arrays results in a copy number profile represented by the log2 ratioof the the two fluorescent dyes.

The first arrays used mostly large-insert clones like bacterial artificial chromosomes (BACs). Spotting3000 clones on an array allowed the interrogation of the whole genome with a ∼1 Mb resolution.Past 10 years the arrays improved by using smaller insert clones, and now mostly oligonucleotidearrays. Spot density increased from 3000 to 44000 and now even millions of probes on a single arrayand allows the investigation of multiple independent samples on one glass slide. Smaller probes andhigher density arrays results in higher resolutions in the range of 1-50 kb. Oligonucleotide arrays nowoffer the highest resolutions and greatest flexibility in array design. Using available genome sequencesyou can develop targeted arrays with probes specially designed for covering specific areas of interest(e.i. high repeat regions, segmental duplications, unstably genome regions) that are otherwise hard tostudy with standard whole genome arrays.

As the cytogenetic tools evolved from the traditional chromosome banding analysis to a microar-ray based technology (arrayCGH), the improved resolution, high density, lower cost and less labour

2

Chapter 1. Introduction

Figure 1.1: Principle of arrayCGH 1. Extraction of the DNA, 2. Differential labeling of the DNA with Cy3& Cy5, 3. Mixing the samples (+ cot-1), 4. Hybridization on the array, 5. Scanning the array, and measuringthe signal intensities of the dyes, 6. Analysis of the copy number profile. (Buysse, 2009)

intensive benchwork made arrayCGH an important tool in diagnostics and genomic research. Anoverview of the techniques for detecting structural variation is given in table 1.1

1.2.3 Clinical significance

Initially, a lot of genome wide association (GWA) studies of CNVs used data from SNP arrays toimpute copy number variants to disease. A CNV can be indirectly detected by looking for linkagedisequilibrium or Mendelian inconsistencies. That way the already available datasets of SNP datacould be computationally analyzed to detect CNVs undetectable by the then available low resolutionBAC arrays. Although this information inferred from SNP data is useful, it is heavily biased towardsregions that are well genotyped. This is one of the reasons why originally the contribution of CNVs incomplex and common disease was underestimated. With the use of arrayCGH and targeted resequen-cing it became possible to discover CNVs in regions not covered by SNP arrays, sometimes called theunSNPable genome.

These GWA study approaches for CNVs spurred the discovery of a growing number new microdele-tion syndromes and CNVs associated with disease. These common risk variants (MAF > 5%) implic-ated in disease lead to a clear phenotypic effect. They can cause a dosage change in genes or affectregulatory elements, other molecular mechanisms by which these aberrations can cause a functionaleffect are shown in figure 1.2. The results of the studies lead to the ’common disease - common vari-ant’ hypothesis: a common disease is an additive or multiplicative effect of the risk variants (Risch ,

3

Chapter 1. Introduction

Table 1.1: Techniques for detecting structural variation (Feuk et al., 2006)

Method Translocation Inversion CNV >50kb CNV <50kbGenome-wide scansKaryotyping Yes (>2Mb) Yes (>2Mb) Yes (>2Mb) NoarrayCGH (BAC) No No Yes NoarrayCGH (oligo) No No Yes YesSNP array No No Yes Yes (SNPs)Targeted scansMicrosatellite genotyping No No Yes (del) Yes (del)MAPH No No Yes YesMLPA No No Yes YesQMPSF No No Yes YesReal-Time qPCR No No Yes YesFISH Yes Yes Yes YesSouthern blotting Yes Yes Yes YesTargeted and genome-wide scansTargeted resequencing Yes (del) Yes Yes YesSequence-assembly comparison Yes Yes Yes YesPaired end fosmid sequencing Yes Yes Yes Yes

Table 1.2: Common disease - Common variant, examples (Estivill , Armengol, 2007)

HIV-1/AIDS susceptibility (Gonzalez et al., 2005)Rheumatoid arthritis and type 1 diabetes (McKinney et al., 2008)

SLE, microscopic Polyangiitis and Wegener granulomatosis(Aitman et al., 2006)(Fanciulli et al., 2007)

SLE (Yang et al., 2004)

Crohn disease(Fellermann et al., 2006)(McCarroll et al., 2008)

Bipolar disorder (Lachman et al., 2007)

Autism spectrum disorders(Sebat et al., 2007)(Project Consortium, 2007)(Weiss et al., 2008)

Familial breast cancer (Frank et al., 2007)

Merikangas, 1996).A low copy number of the human beta-defensin 2 gene (Fellermann et al., 2006) and deletions alteringthe IRGM gene (McCarroll et al., 2008) have been associated with Crohn disease, and high presenceof de novo CNVs are associated with autism spectrum disorders (Sebat et al., 2007; Weiss et al.,2008), other examples of common variants associated with common disorders are listed in table 1.2.

The majority of common variants discovered in GWA studies that have an association with disease,didn’t always cause common early onset or penetrant diseases. They might go without consequence ifthey cover dosage insensitive genes or intergenic regions, or they only modulate disease susceptibilityand don’t have a clear biological effect.The use of larger test sample populations and higher resolution data in GWA analysis allowed to

4

Chapter 1. Introduction

Figure 1.2: Molecular mechanisms by which chromosomal rearrangements influence phenotypes A re-arrangement encompasses a dosage sensitive gene causing disease (A) or dysregulation of a gene by affecting aregulatory element (E). A deletion (B), duplication, translocation or inversion (C) can disrupt a gene and causedisease. When both breakpoints of a deletion are situated within a different gene this will create a fusion gene(D). The rearrangement can unmask a recessive allele (F) or a functional polymorphism; affect repression oractivation on an epigenetic level by interrupting communication between alleles. (Genes depicted as green orred rectangles, regulatory elements (RE) as purple rectangles and a point mutation is marked with an asterisk(*). Figure from (Buysse, 2009), based on (Lupski , Stankiewicz, 2005; Feuk et al., 2006).)

differentiate between rare CNVs and polymorphic CNVs with a low frequency. Recently, a 16.000cases GWA study (Wellcome Trust Case Control Consortium, 2010), showed that the total numberof common CNVs was insufficient to be associated with the numerous common diseases that exist.Therefore a new hypothesis, ’common disease - rare variant’ was postulated, that is better supportedby the current data of known common CNVs. (i.g. rare de novo deletions and duplications increasingthe risk of schizophrenia (International Schizophrenia Consortium, 2008))

Soon it became clear that not only was the impact of copy number variation on diseases long underes-timated, extensive copy number variations are also present in healthy individuals. The variation seenin arrayCGH surveys is extensive, between any two individuals an average cumulative CNV locuslength of 24Mb (0,78% of the entire human genome) is seen, within the population all CNVs found

5

Chapter 1. Introduction

collectively span 12-16% of the autosomal genome (Conrad et al., 2010; Itsara et al., 2009). Conradet al estimates that 80-90% of common CNVs are now known, and more and more CNVs are foundeach day. The cumulative CNV locus length doesn’t increase at the same pace, indicating that themore recently discovered CNVs are smaller. Therefore the average size of CNVs is expected to dropas technical improvements overcome the detection bias.

In this context, the discovery that submicroscopic CNVs, previously undetectable, can lead to variablephenotypes in mental retardation, illustrates that arrayCGH is valuable as a diagnostic tool, and inresearch to elucidate the role of CNVs in idiophatic cases of mental retardation (Buysse, 2009).

1.2.4 Interpretation

If all humans carry CNVs and no clear rules are available to know if a CNV is implicated in a clinicalphenotype, how do we decide which CNVs found in a diagnostic genome wide assay are clinicallyrelevant? Based on experience of using arrayCGH in a diagnostic setting, several researchers haveproposed a workflow to help answer this question.(Koolen et al., 2009; Buysse et al., 2009).

Starting from the complete set of CNVs obtained after analysis, the first step is to identify if any knowncausal CNV is present in the set. To be able to do this a database is needed that links common andrare CNVs with clinical data. The DatabasE of Chromosomal Imbalance and Phenotype in Humansusing Ensembl Resources (DECIPHER) is one such database that aggregate clinical research data toaid scientists in interpreting the relevance of a CNV, ECARUCA is a database with similar aim. If aCNV is not a known causal CNV, the presence of OMIM genes within the CNV might imply possibleclinical significance. In this context, gene ontology’s can help in deciphering if dosage changes ofcontaining genes might explain the phenotypic effects of a patient.

Next, removal of common or normal CNVs is attempted by cross referencing the set of CNVs with adatabase of known genomic variants in a healthy population. The DGV (database of genomic variants,(Zhang et al., 2006), data from healthy human samples) and the recently started dbVar (Database ofGenomic Structural Variation, data from all species + clinical information) project by NCBI, attemptto collect datasets of genomic variations to this purpose. When using the data of these public databasesit is important to remember that they also contain older low-resolution datasets, often the CNV lengthnoted in those sets is an overestimation of the actually segment length. To alleviate the problem oflow-resolution data and other platform specific parameters, it is advisable to also build an internalcontrol dataset based on data obtained with the same platform used for the diagnostic testing. Ofcourse common and normal CNVs can still be associated with disease. Studies of gene content ofCNVs have shown that they are often enriched for genes associated with environmental interactionand response to specific stimuli. Even though they might not directly result in a clinical phenotype,they may still contribute to the severity of a disease or influence disease susceptibility or predispositionand may play a role in drug response.

If the CNVs left in the set are not known to be causal, common or normal CNVs, studying the in-heritance is helpful in judging their significance. De novo rearrangements, not found in the parents,are more likely to be the cause of a clinical abnormal phenotype. The arrayCGH test of the child isthen followed by testing both parents each versus a control sample (representing a ’normal’ genome).Alternatively the parent samples can be directly compared by hybridizing it with the child sampleinstead of a control sample. Based on these results it is not always easy to elucidate the inheritance

6

Chapter 1. Introduction

and molecular mechanism by which CNVs influence phenotypes.

After the above steps are followed, it is important to validate the possible causal CNV, e.g. with atargeted scan (Table 1.1). Platform dependent bias, and bias caused by sample collection or choice oftissue used for the DNA sample should be avoided.

In the end, a lot of CNVs are left of unknown significance. Advances in arrayCGH technology,paired-end sequencing and whole genome sequencing will help, but the biggest hurdle to improve ourunderstanding of the functional importance of CNV is the analysis of the data. To be able to tackle thatchallenge, analysis will have to integrate all available data, like genotype info, allele state of CNVsand neighboring sequence context.

1.3 CNV data analysisNormalization

The data obtained from an arrayCGH experiment is a list of the log2 intensity ratios of the two dyesused to mark the test and reference sample, for each probe on the array. The first step of the analysis isquality control (QC) and normalization of the data. The QC is a first assessment of the quality of theexperiment, to avoid artifacts caused by array processing (e.i. during labeling, wash or hybridizationphase). Visual inspection of the scanned array image will show some of the more obvious artifacts.Plotting of the spatial distribution of outliers on the array (M-XY plot) can show possible spatial bias.the signal intensity distribution for both dyes can be assessed in a M-A plot (M = log ratio versusA = log mean intensity) and intensity bias and/or background might be evident. These elements ofsystemic variation have to be compensated by normalization of the intensities, and many methodsare published that can correct for some of the bias. First the log2 ratios are median normalization bysubtracting the median log intensity of the whole array (mi) from that of all the spots (xi):

log x′i = log xi − log mi

Background bias correction can done by subtracting the estimated background intensity from theestimated foreground intensity for each spot. Intensity bias can be corrected by using Robust Loc-ally Weighted Scatter plot Smoothing (robust LOWESS), this uses a local regression weight functioncombined with a robust weight function to make it resistant to outliers (Cleveland, 1979). The copynumber profiles of tumor samples often contain a wave bias, detected by visual inspection of the pro-file, interfering with segmentation algorithms. The NoWaves R-package can be used to remove thewaves and improve the accuracy of CNV detection (van de Wiel et al., 2009).The end goal for this step of the analysis is to get an intensity ratio profile for the array that can becompared with other samples and is usable in the next step of the analysis.

Segmentation and smoothing

The next step is using the normalized data and transform it to useful biological information about thecopy number variants in the genome. Every probe of the array can be mapped to a position in thegenome, so each log2-ratio represents a relative measure of the copy number ratio of the sample forthat specific region. In these terms a CNV is a series of probes that have a similar ratio compared totheir surrounding region.

Because of small variations in signal inherent to the technique (not related to the copy number of the

7

Chapter 1. Introduction

sample) it can be useful to apply smoothing. By reducing the noise, smoothing will make it easier tofind probes with similar values. Next, is a process called segmentation, a set of consecutive probeswith similar log2-ratios will be grouped as one segment with specific borders or break points, andto which a single common ratio is assigned. The most commonly used algorithms are CBS (circularbinary segmentation (Olshen et al., 2004)) and GLAD (Gain and Loss Analysis of DNA(Hupé et al.,2004)). CBS is a modification of the binary segmentation algorithm and GLAD uses an AdaptiveWeights Smoothing (AWS) procedure to find the borders or breakpoints of the segments.

Calling

When the experiment compares a test with a reference sample, the reference sample is assumed tohave no aberrations. This assumption makes it essential to choose a good reference sample as thiswill impact assignment of a discrete copy number based on the relative ratio. The reference is oftenmultiple DNA samples pooled together, because as mentioned earlier, CNVs are found in all humans,so no one definitive reference genome exists, and should be chosen based on the type of sample usedor goal of the experiment.

Calling is the process of categorizing the log2-ratios as a ’loss’, ’normal’, ’gain’ copy number, rep-resenting duplications or deletions in the genome. Traditionally this is done with thresholds based onthe results of earlier experiments of self hybridizing, that use the test sample with itself instead of thesample-reference combination. More advanced methods (i.e. CGHcall (van de Wiel et al., 2007),...)have been developed that take into account biological information of the arrayCGH data, like segmentbreakpoints, natural distribution of log2-ratios, clustering techniques or heuristic models.

As a final result the dataset of log2-ratios will be transformed to a list of genome segments eachassigned a copy-number.

8

Chapter 2

Aims

Because of increasing use and better resolution of arrayCGH technology, investigators routinely haveto handle large data volumes in a research setting as well as in the diagnostic field. The growth indata size is expected to accelerated greatly as next generation sequencing technology is becoming thetool of the future in genetic diagnostics and research, and will generate an ever increasing amount ofsequence reads.

The ability to take this data, process it, visualize it and turn it into useful information is a challenge.We’ve reached a tipping point in science were the experimental techniques and sample collection areno longer the limiting factor for advancing our knowledge. Instead data analysis is becoming thebottleneck. Data in itself is only useful if the researcher or diagnostician can use his knowledge andexpertise to extract the useful information from it. The raw arrayCGH data listing all probe intensitiesmeans nothing until it can be normalized, segmented and further analyzed. Distilling it to a potentialset of clinically relevant CNVs. Only then can we speak of information instead of data and has itbecome useful in a clinical context to help reach a better diagnosis and patient-care. But the cycledoesn’t stop there, after data analysis, data management is important so we can keep learning fromthis information to advance our knowledge of structural variation in the human genome and in turnimprove our ability to analyze future data.

The be able to keep up with the technological advancements and tackle the mountain of data, it isessential that we have the proper tools to optimize the analysis pipeline. When arrayCGH was de-veloped the first studies soon demonstrated its potential in discovering new genetic disorders, findingcritical areas implicated in existing disease, and learning the mechanism at the source of the structuralgenomic variation. As a pioneer in Belgium for the application of arrayCGH, the Center for MedicalGenetics Gent developed arrayCGHbase to facilitate the analysis and interpretation of arrayCGH data(Menten et al., 2005). Different individual tools exists to handle part of the analysis of CNV data(R-modules like aCGH, aCGH-smooth, DNAcopy) and database systems like MySQL to store dataand Laboratory Information Management Systems to manage biomedical annotations. The merit ofarrayCGHbase is that it bundles the features of these tools in one integrated open source analysis plat-form. ArrayCGHbase is web-based with a server back-end, so it can be used through-out the lab onany computer. The integrating of the tools in a user-friendly platform allows investigator to analyzetheir data without needing a bioinformatics background or R-expertise.

9

Chapter 2. Aims

As the array resolutions keep rising and with sequencing techniques looming around the corner asa diagnostic tool, it was clear that the existing arrayCGHbase back-end would soon no longer beable to efficiently handle these larger datasets. The thesis project attempts to tackle that problem bydeveloping a new data storage back-end with accompanying web interface. The three main focuspoints of the project are:

Storage size The new database needs to be able to handle more data points for each experiment. Aroutine array experiment on an Agilent Human Genome CGH Microarray 4x180K generatesmore than a gigabyte (GB) of raw image data, which after feature extraction can be reduced to4 60 megabyte (MB) text files with intensity measurements for each probe. The next generationsequencing platforms are able to produce terabytes of image data resulting in an output of mul-tiple gigabases of reads. These amounts are no longer manageable in the original arrayCGHbasemysql database, and will be replaced by a system suitable for large datasets.

Data flexibility Not only the amount of data but also the format is important. As techniques areimproving and data formats advancing, the new database needs to be flexible to handle diversedata formats. The goal is to make it data format agnostic, making a shift from an inflexible fixedcolumn approach in mysql to a more free form data store. This would allow arrayCGHbase tomake the transition to sequencing data in the future.

Visualization Besides storing large and diverse data, the new system needs a way to handle theprocessing of the data, so it can be analyzed an visualized. One pixel on the computer screenoften represents thousands of data points. The data store should therefor be able to aggregatethe data and optimize it for visualization in the browser. The last few years, ubiquitous use ofdynamic browser interfaces have shown that the browser platform is a powerful tool to deliverinformation in a user friendly way, while using a powerful server back-end to handle the dataitself.

The end goal of the project is not meant to be a finished product, but a prototype implementation as afoundation for a new version of arrayCGHbase. This should allow the tool to evolve in the future, andallow investigators to use new techniques and algorithms, without sacrificing the ease of use of theoriginal project, to process raw data to useful information. It takes advantage of recent innovations indata storage and web browser technology to deliver large data sets to the user.

10

Chapter 3

Materials and methods

This chapter will detail the most important technologies and design decisions of the project. A proto-type implementation, referred to as arrayCGHbase 4, is available at http://mellfire.ugent.be/arrayCGHbase4/ . Figure 3.1 gives an overview of the organizational structure of the ar-rayCGHbase implementation.

3.1 General technology choices3.1.1 Client side

The new arrayCGHbase system implements a client side browser interface to the server side databaseand application layer. The choice for a web interface allows this system to be used from any modernbrowser (tested for Mozilla Firefox 3.5), thereby avoiding the complex task of deploying a desktopapplication to each user, and allowing easy incremental deployment of updates to the system. Becauseno experiment data is kept on the client-side, we have full control over how the data is served, storedand secured on the server. The interface is build according to the HTML5 draft specification1 withCSS styling, and uses JavaScript (JS) for the dynamic parts of the interface. JavaScript is a widelyused scripting language for making dynamic websites. Most browser implementations converge ontothe standard specification by the ECMA (European Computer Manufacturing Association2) whilestill providing some browser specific features. To facilitate the writing of a cross-browser compatibleconcise JavaScript code, the jQuery JavaScript library3 is used. This HTML5+CSS+JS client interfacecommunicates with the server via HTTP requests in JavaScript (a system often called AJAX). Thedata itself is exchanged in the JSON data format4. The images for CNV visualization are in theSVG (Scalable Vector Graphics) format as this delivers high resolution images that can be used inpublications and further vector image editing.

1W3C: http://dev.w3.org/html5/spec/Overview.html2ECMA-262: http://www.ecma-international.org/publications/standards/Ecma-262.htm3jQuery: http://jquery.com4JSON: http://www.json.org

11

Chapter 3. Materials and methods

Table 3.1: CouchDB RESTful HTTP API CRUD functions

Create HTTP PUT/POST /database/doc_idRead HTTP GET /database/doc_idUpdate HTTP PUT /database/doc_idDelete HTTP DELETE /database/doc_id

3.1.2 Server side

The program running on the server side is writing in the Perl 55 programming language and usesthe Mojolicious web framework6. The Mojolicious web framework was chosen because it has noperl module dependencies beside those part of the Perl core, and it implements most modern internetHTTP protocols for easy communication with the database back-end and client side interface. Theapache CouchDB-database7 serves the data to the Perl application via HTTP, it is a noSQL databaseusing a schema free document store and is know for its ability to handle huge data sets.

3.2 Data: input and storageThe data used in the making and testing of the project, is obtained from an arrayCGH 8x60K microar-ray (Agilent G4450A) experiment, following the standard CMGG-protocol as used for diagnostics(see Appendix). After labeling, precipitation, hybridization and washing steps, the microarray isscanned by an Agilent SureScan high resolution microarray scanner. The TIFF-image file from thescanner is then processed by the accompanying Agilent Feature Extraction Software to calculate signalintensity and perform signal correction for background and signal bias. The output is a text file con-taining experiment data and quality control parameters and a signal intensity measurement expressedas a log-ratio. These files will serve as input for arrayCGHbase.

The data is imported in CouchDB, where each experiment gets a separate database, which containsa JSON document for each feature. The document is identified by an id based on its chromosomalposition and are structured as illustrated in note 1. The database interface is a RESTful HTTP APIthat implements all basic functions of persistent storage, see Table 3.1. The database system givesus several advantages. First advantage is the replication functionally via HTTP that is build intoCouchDB. It allows continues replication to a second server for backup. The replication data can befiltered to implement horizontal partitioning (sharding) of huge datasets over several servers. Second,since CouchDB is written in the Erlang programming language, it is highly concurrent. Because ofthis, a long running analysis (doing many read and write operations) won’t block database access,and will have little impact on the performance for the users of arrayCGHbase accessing the database.Lastly, CouchDB internally uses a b-tree storage structure to save the data to an append only file, andmost REST operations are atomic, making it a very robust system preventing data corruption in caseof a server crash.

The database system doesn’t allow standard SQL queries but uses map and reduce functions to processdata. This system is based on the MapReduce framework introduced by Google (Pike et al., 2005)

5Perl: http://www.perl.org6Mojolicious: http://mojolicious.org7CouchDB: http://couchdb.apache.org

12

Chapter 3. Materials and methods

Note 1 Example feature JSON document{// d:chromosome nr:start position(bp):counter[0-9]

"_id": "d:01:0247179232:0","e": "247179291", // feature end (bp)"n": "A_14_P105372", //feature name"v": {

"raw": "0.036" // $log_2$ ratio}"f": 0 // quality flag: false=0/true=1

}

for processing large data sets. The map function is run for each document emitting a list of key/valuepairs, these pairs can then be reduced by an optional reduce function, aggregating the values of pairsthat have the same key. The result of these operations is written to disk in what CouchDB calls a viewfile. A view query only allows simple start and stop key range parameters. Once generated the viewfile is only regenerated when the data changes, providing fast access when performing queries on theview.

The most common query on our experiment data is listing all features within a given chromosomalrange. Linearly scanning all features to find those within the range is a very inefficient and is no longerusable for data sets of high resolution arrayCGH microarrays. To solve this issue arrayCGHbase nowuses a tree data structure to optimize data access for range queries (Figure 3.2).

The tree data structure was inspired by the LocusTree library written in Ruby (Aerts, 2010). It isan adaptation of an R-Tree that bins data into bins of a fixed size which in turn are binned in largebins until it reaches the root bin that covers all data. The tree structure in arrayCGHbase groupsall features into bins of a 10000bp size, and is referred to as level 0. This is done individually foreach chromosome instead of the genome wide binning scheme described by Kent (Kent et al., 2002).These ’level 0’ bins are grouped, ten each in a new larger bin on the next level up. The process is thenrepeated until the highest level is reached, which cover the whole chromosome. The tree is build by amap and reduce function as described in algorithm 1 and 2. These functions are written in JavaScriptand saved in a design document to generate a view in CouchDB. Each bin or tree node, contains anaggregate value for the containing features’ ratios, the minimum and maximum ratio values and afeature count. To fetch all features within a range, we descend the tree starting from the root nodedown all levels, only getting the nodes that overlap the query range.

Besides the dye intensity ratios for each experiment, arrayCGHbase also needs to store the metadata.For each experiment it keeps the DNA sample information (dye, patient id), and can be further annot-ated with lab protocols, extra data sources, text remarks, quality parameters and it can group experi-ments into projects. Since the new arrayCGHbase uses schema free documents to store all info, anytype of metadata can easily be expanded with additional annotation fields. Static html forms handlingmetadata input, don’t have the flexibility to take advantage of the schema free data aspect. The projectnow dynamically generates the forms based on a form template, which is also stored in the database,and automatically populates the form fields if you’re edited existing information. Some data fieldslink to other documents, for example an experiment will be linked with a project. To handle these data

13

Chapter 3. Materials and methods

Algorithm 1: Tree building map function pseudocodeinput: ChromosomeLength and feature start, end, ratio

BaseSize← 10000;NrChildren← 10;level = log(ChromosomeLength÷BaseSize)

log(NrChildren) ;for level to 0 do

NodeSize = BaseSize×NrChildrenlevel;StartNodeNr = (start− 1)÷NodeSize;EndNodeNr = (end− 1)÷NodeSize;if EndNodeNr > StartNodeNr then

end; // feature spans multiple nodeselse

emit: StartNodeNr, ratio; // key, value

Algorithm 2: Pseudocode of the tree building reduce function run on key,value pairs emitted by themap functioninput: list of values having the same key

initialize min,max, sum, count;foreach values do

if value < min thenmin← value;

if value > max thenmax← value;

sum← sum + value;count← count + 1;

average← sum÷ count;return min,max, average

relations, fields that act as a foreign key to an other document are show in the form with an optionselector element. As more and more data gets added these lists can get very long making it hard tofind the item you need. Therefor the list is combined with an text field which automatically offers sug-gestions as you type, offering an efficient way of filling out the form. CouchDB isn’t build to handlequeries like textual searches, and advanced selection on multiple parameters. To perform these kind ofqueries an external indexer called couchdb-lucene is used. It is made up of two parts. First, an indexserver written in Java that indexes the metadata as soon as it changes by monitoring the CouchDB"_changes"-feed. This part can be run on a separate server since the "_changes"-feed functions overHTTP and only needs a network connection to communicate with the database server. The secondpart, is a Python script acting as the middle man between the database and the index server. It handlesincoming queries, forwards them to the index server and sends back the results as a JSON response.

3.3 VisualizationVisualization is important for the interpretation of arrayCGH results. ArrayCGHbase provides a cyto-genetic browser to inspect the copy number profile. The browser uses a zoomable sliding window

14

Chapter 3. Materials and methods

interface. When exploring the results in the browser it will request image tiles from the server auto-matically as you scroll or zoom. The tiles are send to the browser as SVG images, which are thenrendered by the browser. The cytogenetic browser was inspired by the Google Maps API8 and theopen source OpenLayers9 project.

The tree storage structure described above makes this visualization smoother. To draw all featureswithin the displayed region, it will descend the tree only til it reaches the level which has more nodesthan it has pixels to draw them on. This removes the inefficiency of needlessly drawing hundredsof features onto one pixel. When viewing larger regions it will not draw individual features, but usethe aggregate values stored in the tree-nodes to draw a vertical interval from the minimum to themaximum ratio and marking the average ratio.

3.4 AnalysisFor the analysis of the arrayCGH data arrayCGHbase uses a job/worker system. Long running tasks,such as data import or segmentation, require intensive database access. CouchDB’s high concur-rency handles this easily without blocking simultaneous data access by other clients. The previousArrayCGHbase version already is easy extensible with new analysis algorithms but could not offloadthe processing to the background. Now all analysis and data import get saved as a job document inthe CouchDB jobs database and the users can continue working with the browser client. Each jobcontains a work flow of the tasks it needs to run to complete the job. On the server a continuouslyrunning Perl worker-script will periodically check the jobs database and iterate over the jobs waitingfor processing. New functionality can be added to the worker script, by writing a task handler as asimple Perl module. This first implementation has an Agilent CGH array data importer and an R CBSsegmentation handler. If preferred the worker can be configured to spawn multiple processes eachhandling a different job. The parallel job processing results in optimal use of multi-core processorservers.

8Google Maps API: http://code.google.com/apis/maps9OpenLayers: http://openlayers.org

15

Chapter 3. Materials and methods

Figure 3.1: arrayCGHbase organizational scheme The client side is a browser interface using HTML5,CSS and JavaScript technology. It communicates with the server side via AJAX for dynamic exchange of datain JSON format or SVG images. The client requests are handled by the Mojolicious web application whichuses the CouchDB database to organize the data. An other server side component is the worker script, whichperforms parallel execution of jobs. Each job is a series of tasks, like data import or R analysis. Each type oftask has a perl perl module. This Worker system is easy expandable by adding new task handlers.

16

Chapter 3. Materials and methods

Figure 3.2: The tree data structure (A) Illustration of the LocusTree system with the number of children foreach node set to 2. To find all features within the blue area, only the green nodes need to be search for features.(Aerts, 2010) (B) ArrayCGHbase tree structure for chromosome 1 uses 10 children and a 10000bp long nodesize at the lowest level 0. Binning 10 bins into one on the level above until the root node is reached whichcovers the whole chromosome.

17

Chapter 4

Results

The web application implemented during this master thesis can be tested online1 and will be detailedin this chapter. The application is divided into 3 main sections: list, view and analysis.

4.1 ListThe first section handles data input and information management. If we follow the standard arrayanalysis workflow, the first step is creating a new experiment. Sample information, reporter selection,project assignment, data file import and other experiment annotation is integrate into one form. Whenavailable, form fields linking to other items (i.e. projects, patients) will auto-complete while typing, tospeed form entry. After filling out the form and clicking the save button, the user is asked to confirmthe entered information to avoid errors. As soon as the experiment form data is confirmed and sendto the server, a meta document is created in the database and the application also creates a job entrythat will handle the data import in the background. The user doesn’t need to wait for the import job tofinish and may immediately continue to enter more experiments. When the worker has finished thatjob it will be added to the list section and is available for viewing and analysis.

Existing information will be shown in the list and can be browsed by any authenticated user. To helpfind specific information, a filtering panel is present to query the database index. Experiments can bevisualized by clicking the view link, redirecting the user to the view section.

4.2 ViewThe view section serves as a genome browser for copy number information or cytogenetic browser.The top panel displays the ideogram for the experiment and if CBS data is available CNVs will bemarked in green (gain) or red (loss). Clicking on a chromosome will load that chromosome in thesecond panel. The second panel is the actual browser part. The copy number information can beexplored by panning left and right using the scroll-wheel or by drawing a semi-transparent zoombox with the mouse over a region of interest. The status bar just above the second panel shows theexperiment name and bp coordinates of the displayed region.

1arrayCGHbase development server: http://mellfire.ugent.be/arrayCGHbase4

18

Chapter 4. Results

Figure 4.1: Screenshot of the arrayCGHbase: list and form The list section (A) and the form for creatingnew experiments (B).

The browser panel plots the array features by chromosome location on the horizontal axis and thelog2 scaled ratio on the vertical axis. Features spots and vertical interval lines (min,max, average)are marked as loss (red) or gain (green) based on a threshold that can be adjusted as preferred. Thechromosomal position of features is marked by a numeric axis, and the cytobands2, as seen on Giemsastained chromosomes, are drawn.

The bottom panel is called the tracks-area. This area can be used to display genome features importedfrom external sources (e.g. segmental duplications, OMIM genes). The tracks functionality can beextended with tracks by importing them into the CouchDB database (system requires a chromosome,start and end position and a feature id).

4.3 AnalysisThis section controls the data analysis. The functionality currently available in the web interfaceis limited, since the project for this part was concentrated on the back end functionality. The most

2UCSC: http://hgdownload.cse.ucsc.edu/downloads.html

19

Chapter 4. Results

Figure 4.2: Screenshot of the arrayCGHbase: the view section The top panel displays an ideogram, themiddle panel is the copy number profile with cytobands and the lowest panel will show the tracks (not shownhere).

important feature of the analysis framework is the job/worker-system, detailed in chapter 3. Thissystem can also be used as a basis for building a report module, that aggregates and formats experimentdata in the form of a report for use in diagnostic practice.

20

Chapter 5

Discussion

Copy-number variations play a significant role in mediating genetic diversity and are linked withmany diseases. Since the development of arrayCGH, a lot of CNVs have been found, estimated toencompassing a cumulative length of 360-460Mb. The ubiquitous presence of CNVs in all humans,healthy and sick, is a challenge when studying the clinical significance of CNVs in a diagnostic setting.Careful analysis of potential causative CNVs is essential for elucidation of their phenotypic effects.This project attempts a software reengeneering of the original arrayCGHbase, so it can handle thechallenges of the recent and future advances in genome research and genetic diagnostics. It providesa server side application to handle these growing data sets, and a client side interface using the latestweb technologies. The visualization system is an informative and easy extensible interface to interpretCNV data within its genomic context.

This project is build as a web application, to avoid the need for desktop application deployment andto enable the use of powerful servers instead of a client computer to handle analysis. The evolutionof web technologies during the last decade have shown that the browser is a more than adequate plat-form to build highly dynamic user friendly graphical interfaces. Numerous copy number analysissystems for arrayCGH data are available today, as a desktop applications [SnoopCGH, CNVDetector,CGHAnalyser, CGHPro, ChARMView], or web tools [WebaCGH, ArrayCyGHt, ISACGH]. Someweb applications are build as a front end for R-packages and implement a specific set of algorithms[e.g. CGHweb, ADaCGH], others only focus on a specific part of the analysis workflow and need pre-or post-processing with other tools. VAMP attempt to blur the boundary of desktop and web applica-tions by using Java applets. But this is essentially using the browser only as a delivery mechanism fora desktop application.

Only a few systems provide an integrated system for analysis and visualization. CAPweb is an ana-lysis platform that combines different tools developed at the Institut Curie, including project and datamanagement functions. It is build with a mix of technologies making it hard to install on your ownserver for internal use. Recently published WaviCGH does provide an comprehensive analysis systemwith an integrated cytogenetic browser. Regardless of the fact that it is open and free to use for all, tothis day no source code is available making it unsuitable for internal use in a diagnostic lab setting.Therefor arrayCGHbase differs from most tools existing today, as it will be released as open sourceand is build to be a customizable framework for data visualization, analysis and storage.

21

Chapter 5. Discussion

The new arrayCGHbase implementation contains some radical technological changes compared toits predecessor. The most significant is the move from mysql to CouchDB as the database. Mysqlhas its merits for managing highly relational data and performing complex SQL queries. But it isn’toptimized for the many large data sets produced in science today. It is not very concurrent, locks tableswhen modifying data and requires a lot of RAM memory for its indexes to achieve performance if yourdatabases tables grow.

In contrast, CouchDB is build to be a flexible data store and has no notion of data relationships,imposing no restrictions on the kind of data it can store, as long as it can be expressed in a JSONformat. It has a small memory footprint as it keeps all of its data only on the hard disk. One issuewhen using CouchDB is that it has no build-in indexing and doesn’t use SQL. This is solved by usinglucene-couchdb as an external indexer that provides query functionality on par with mysql. To speeddata access and build the custom designed tree data structure for experimental data arrayCGHbase usesthe CouchDB Views. Views use map and reduce functions to extract and aggregate information froma database. During the development this requires a change of mindset when you’re used to workingwith relational database systems, but once you do, it is a powerful way of efficiently processing largerdata sets. Because of these advantages NoSQL database systems like CouchDB come into use ingenomics and other data intensive scientific disciplines. The CouchDB view files generated by themap and reduce functions are saved to disk and take up more disk space than the old mysql database.In the CouchDB world this is not regarded as a disadvantage. It chooses to sacrifice disk space inreturn for faster data access, since extra hard drives are cheaper than adding processing power to aserver.

The second important part of arrayCGHbase is the visualization. Different genome viewers exist thatcan be used to display copy number variations, the most popular are NCBI Map Viewer, Ensemblcontig viewer, UCSC Genome Browse, GBrowse and JBrowse. The NCBI Map Viewer can not beinstalled on a private webserver making it unsuitable for use in this project. The UCSC genomebrowser and Ensembl map viewer are open source and could be installed on a private server. Bothbrowsers were not build for displaying arrayCGH data and would require extensive modifications toprovide this functionality. They are hard to integrate and would be overkill for this project and wouldhave a negative impact on the ease of use. NCBI, Ensembl, UCSC use static images and don’t dodynamic panning and zooming. This leaves GBrowse, a genome browser based on BioPerl, but itis depreciated in favor if its successor JBrowse. JBrowse uses JavaScript and JSON data to do thegraphics rendering in the browser, in contrast with the other genome browsers which use static imagesgenerated on the server. While it could be used for displaying arrayCGH data, it uses a static dataformat that has to be generated beforehand on the server. Visualization of data dynamically loadedfrom a database is not possible.

For the above reasons, the choice was made not to use any of the existing tools. The tree data structurewas build with the visualization in mind. This unique system not found in alternative tools, allows thescript to determine based on the zoom level which tree level corresponds best, and which bins of thatlevel fall into the shown interval on screen. When zoomed out it will use the aggregate data storedin the nodes, when zoomed in it will use the tree structure to efficiently fetch only the individualfeature documents it needs to draw. As the user pans the chromosome browser new nodes will befetched from the tree and converted to an SVG image. To our knowledge, arrayCGHbase is the onlysystem using SVG images, instead of static raster images. Pre-generating and caching thousands ofimages for each experiment is not feasible, it would require huge amounts of disk space. The SVG is a

22

Chapter 5. Discussion

XML-based format, so the files can be generated on-fly by the web application without using specialgraphics libraries, like the GD library needed for raster images. This also means that the browseris responsible for rendering the SVG file as an image. A limitation is that SVG rendering is onlysupported in webkit-based browsers (Safari, Google Chrome), Firefox and Opera.

The new arrayCGHbase implementation proves that a non-traditional database like CouchDB, andthe new interactive JavaScript cytogenetic browser can be powerful tools for building a web applica-tion for arrayCGH data analysis and visualization. This is of course only a first implementation andalthough the result is promising, not all functionality from the previous arrayCGHbase version is im-plemented. The result of this thesis can serve as a basis for future additions and improvements suchas:

• Building more data import handlers, to allow the use of other formats than the Agilent FeatureExtraction Software files.

• Improve the tree data structure by using a hybrid approach to save disk space. Storing aggregatedata only in the nodes of the upper levels and storing the feature documents in the 10000bp binsof level 0 and not as individual database documents.

• Add export handlers, besides the currently implemented tab separated file-export

• Expand the authentication module to allow more granular control over security settings.

• Use APIs from NCBI, UCSC and other sources to make the View section more informative andallow the interface to make direct links to these sites.

• Build a filter system for the genome browser to search and automatically zoom to specificfeatures.

• Add more analysis algorithms besides CBS

• Build a report module that uses the analysis job/worker system and tightly integrates with themeta data documents.

• Add more tracks to the View section to help investigate the functional context

• Add more tracks to the View section containing data from new CNV collections

• Test the application with data from other organisms (e.g. mouse and rat)

• Build an interface for running very CPU-intensive analysis jobs with cloud services such asAmazon Elastic Compute Cloud and private Hadoop clusters (both systems are especially suitedfor massive parallel processing with map/reduce functions like the CouchDB views)

23

Chapter 6

Conclusie

Kopie nummer variaties spelen een belangrijke rol in genetische diversiteit en zijn betrokken in ver-schillende ziektes. Sinds de ontwikkeling van arrayCGH zijn er steeds meer CNVs gevonden en werdduidelijk dat ze allom vertegenwoordigd zijn zowel in gezonde als ziek mensen. Dit zorgt ervoor dathet moeilijk is om te achterhalen of CNV gevonden in een diagnostisch arrayCGH onderzoek, klinischsignificant zijn of niet. Zorgvuldige analyse van potentieel causatieve CNVs is essentieel, en hiervoorzijn de juiste tools noodzakelijk. In dit project werd een web applicatie ontwikkeld voor analyse vanarrayCGH data, de opvolger van de bestaande arrayCGHbase applicatie.

Er werd gekozen voor een web applicatie om problemen met installatie en updates van desktop ap-plicaties te vermijden. Recente ontwikkelingen in browser technologie laten toe om een gebruiksv-riendelijke interface te bouwen gecombineerd met een krachtige server voor data opslag en analyse.Er bestaan reeds vele afzonderlijke arrayCGH data analyse tools, maar deze zijn meestal gericht opeen specifiek deel probleem of zijn niet beschikbaar voor installatie op een eigen server voor interngebruik. De nieuwe arrayCGHbase versie probeert de standaard analyse workflow te integreren in ééntool. De bedoeling is dat het kan gebruikt worden door onderzoekers zonder bioinformatica trainingof zonder dat ervaring met R statistische analyse noodzakelijk is.

De nieuwe arrayCGHbase versie verschilt sterk met zijn voorganger op een aantal belangrijke punten.Zo gebruikt het geen relationele databank maar een schema vrije databank CouchDB. CouchDB isgeschikt voor het opslaan van grote hoeveelheden data en gebruikt een query system gebaseerd opMapReduce-technologie. Met behulp van een map en reduce functie genereert het een B-tree datastructuur wanneer de data wijzigt en bewaart die op de schijf zodat deze niet bij elke query hoeftgeregenereerd worden. Dit maakt snelle gelijktijdige toegang tot de data mogelijk, waardoor analysesdie in de achtergrond op de server lopen, niet voor vertraging zorgen voor andere gebruikers die dedata wensen te raadplegen. Het systeem implementeert dan ook een nieuwe analyse systeem. Jekan een analyse job in de databank opslaan en deze komt dan automatisch in een werk lijst terechtdie door de server zal afgewerkt worden, terwijl de gebruiker gewoon verder de web applicatie kanverder gebruiken. Indien gewenst kan men optimaal gebruik maken van multi-processor systemendoor verschillende analyses gelijktijdig uit te voeren.

Voor de visualisatie is een interactieve cytogenetische browser gebouwd, geïnspireerd door bestaande

24

Chapter 6. Conclusie

genoom browsers en functionaliteit van kaart interfaces zoals Google maps. De JavaScript applicatielaat toe om op het kopie nummer profiel in te zoomen en in het gewenste detail te verkennen. Integenstelling tot de andere genoom browsers zijn de afbeeldingen niet statisch, maar dynamisch ge-genereerd op de server in SVG formaat, en worden ze door de browser zelf gerendered. Dit nieuwevisualisatie systee is gemakkelijk uit te bereiden om contextuele informatie weer te geven zoals geninformatie, gekende CNVs uit databanken, segmental duplication. Dit om de onderzoekers te helpenom een efficiënte analyse te doen van hun arrayCGH data.

25

Bibliography

Aerts J (2010). LocusTree ( http://github.com/jandot/locustree )

Aitman T. J, Dong R, Vyse T. J, Norsworthy P. J, Johnson M. D, Smith J, Mangion J, Roberton-Lowe C,Marshall A. J, Petretto E, Hodges M. D, Bhangal G, Patel S. G, Sheehan-Rooney K, Duda M, Cook P. R,Evans D. J, Domin J, Flint J, Boyle J. J, Pusey C. D, Cook H. T (2006) Copy number polymorphism inFcgr3 predisposes to glomerulonephritis in rats and humans. Nature 439: 851–855

Butler M. G, Meaney F. J, Palmer C. G (1986) Clinical and cytogenetic survey of 39 individuals with Prader-Labhart-Willi syndrome. American journal of medical genetics 23: 793–809

Buysse K, Delle Chiaie B, Van Coster R, Loeys B, De Paepe A, Mortier G, Speleman F, Menten B (2009)Challenges for CNV interpretation in clinical molecular karyotyping: lessons learned from a 1001 sampleexperience. European journal of medical genetics 52: 398–403

Buysse K (2009). Molecular karyotyping: a powerful tool for the study of genomic defects in patients with

mental retardation. Phd thesis Ghent University

Cleveland W (1979) Robust Locally Weighted Regression and Smoothing Scatter plots. Journal of the American

Statistical Association 74: 829–836

Conrad D. F, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, Aerts J, Andrews T. D, Barnes C, Campbell P,Fitzgerald T, Hu M, Ihm C. H, Kristiansson K, Macarthur D. G, Macdonald J. R, Onyiah I, Pang A. W. C,Robson S, Stirrups K, Valsesia A, Walter K, Wei J, Wellcome Trust Case Control Consortium, Tyler-SmithC, Carter N. P, Lee C, Scherer S. W, Hurles M. E (2010) Origins and functional impact of copy numbervariation in the human genome. Nature 464: 704–712

Driscoll D. A, Spinner N. B, Budarf M. L, McDonald-McGinn D. M, Zackai E. H, Goldberg R. B, ShprintzenR. J, Saal H. M, Zonana J, Jones M. C (1992) Deletions and microdeletions of 22q11.2 in velo-cardio-facial syndrome. American journal of medical genetics 44: 261–8

Estivill X, Armengol L (2007) Copy number variants and common disorders: filling the gaps and exploringcomplexity in genome-wide association studies. PLoS genetics 3: 1787–99

26

BIBLIOGRAPHY

Fanciulli M, Norsworthy P. J, Petretto E, Dong R, Harper L, Kamesh L, Heward J. M, Gough S. C. L, de SmithA, Blakemore A. I. F, Froguel P, Owen C. J, Pearce S. H. S, Teixeira L, Guillevin L, Graham D. S. C,Pusey C. D, Cook H. T, Vyse T. J, Aitman T. J (2007) FCGR3B copy number variation is associated withsusceptibility to systemic, but not organ-specific, autoimmunity. Nat Genet 39: 721–723

Fellermann K, Stange D. E, Schaeffeler E, Schmalzl H, Wehkamp J, Bevins C. L, Reinisch W, Teml A, SchwabM, Lichter P, Radlwimmer B, Stange E. F (2006) A chromosome 8 gene-cluster polymorphism with lowhuman beta-defensin 2 gene copy number predisposes to Crohn disease of the colon. American journal of

human genetics 79: 439–48

Feuk L, Carson A. R, Scherer S. W (2006) Structural variation in the human genome. Nature reviews. Genetics

7: 85–97

Frank B, Hemminki K, Meindl A, Wappenschmidt B, Sutter C, Kiechle M, Bugert P, Schmutzler R. K, BartramC. R, Burwinkel B (2007) BRIP1 (BACH1) variants and familial breast cancer risk: a case-control study.BMC Cancer 7: 83

Gonzalez E, Kulkarni H, Bolivar H, Mangano A, Sanchez R, Catano G, Nibbs R. J, Freedman B. I, QuinonesM. P, Bamshad M. J, Murthy K. K, Rovin B. H, Bradley W, Clark R. A, Anderson S. A, O’connell R. J,Agan B. K, Ahuja S. S, Bologna R, Sen L, Dolan M. J, Ahuja S. K (2005) The influence of CCL3L1gene-containing segmental duplications on HIV-1/AIDS susceptibility. Science 307: 1434–1440

Hupé P, Stransky N, Thiery J.-P, Radvanyi F, Barillot E (2004) Analysis of array CGH data: from signal ratioto gain and loss of DNA regions. Bioinformatics 20: 3413–3422

Iafrate A. J, Feuk L, Rivera M. N, Listewnik M. L, Donahoe P. K, Qi Y, Scherer S. W, Lee C (2004) Detectionof large-scale variation in the human genome. Nat Genet 36: 949–951

IHGSC (2004) Finishing the euchromatic sequence of the human genome. Nature 431: 931–945

International Schizophrenia Consortium (2008) Rare chromosomal deletions and duplications increase risk ofschizophrenia. Nature 455: 237–241

Itsara A, Cooper G. M, Baker C, Girirajan S, Li J, Absher D, Krauss R. M, Myers R. M, Ridker P. M, ChasmanD. I, Mefford H, Ying P, Nickerson D. A, Eichler E. E (2009) Population analysis of large copy numbervariants and hotspots of human genetic disease. Am J Hum Genet 84: 148–161

Kallioniemi A, Kallioniemi O. P, Sudar D, Rutovitz D, Gray J. W, Waldman F, Pinkel D (1992) Comparativegenomic hybridization for molecular cytogenetic analysis of solid tumors. Science 258: 818–821

Kent W. J, Sugnet C. W, Furey T. S, Roskin K. M, Pringle T. H, Zahler a. M, Haussler a. D (2002) The HumanGenome Browser at UCSC. Genome Research 12: 996–1006

Koolen D. A, Pfundt R, de Leeuw N, Hehir-Kwa J. Y, Nillesen W. M, Neefs I, Scheltinga I, Sistermans E,Smeets D, Brunner H. G, van Kessel A. G, Veltman J. A, de Vries B. B. A (2009) Genomic microarraysin mental retardation: a practical workflow for diagnostic applications. Hum Mutat 30: 283–292

Lachman H. M, Pedrosa E, Petruolo O. A, Cockerham M, Papolos A, Novak T, Papolos D. F, Stopkova P (2007)Increase in GSK3beta gene copy number variation in bipolar disorder. Am J Med Genet B Neuropsychiatr

Genet 144B: 259–265

27

BIBLIOGRAPHY

Langer-Safer P. R (1982) Immunological Method for Mapping Genes on Drosophila Polytene Chromosomes.Proceedings of the National Academy of Sciences 79: 4381–4385

Lejeune J, Gautier M, Turpin R (1959) Les Chromosomes humains en culture de tissus. Comptes Rendus

Seances Acad Sci 248: 602–603

Lupski J. R, Stankiewicz P (2005) Genomic disorders: molecular mechanisms for rearrangements and conveyedphenotypes. PLoS genetics 1: e49

Magenis R. E, Toth-Fejel S, Allen L. J, Black M, Brown M. G, Budden S, Cohen R, Friedman J. M, KalousekD, Zonana J (1990) Comparison of the 15q deletions in Prader-Willi and Angelman syndromes: specificregions, extent of deletions, parental origin, and clinical consequences. American journal of medical

genetics 35: 333–49

McCarroll S. A, Huett A, Kuballa P, Chilewski S. D, Landry A, Goyette P, Zody M. C, Hall J. L, Brant S. R,Cho J. H, Duerr R. H, Silverberg M. S, Taylor K. D, Rioux J. D, Altshuler D, Daly M. J, Xavier R. J(2008) Deletion polymorphism upstream of IRGM associated with altered IRGM expression and Crohn’sdisease. Nat Genet 40: 1107–1112

McKinney C, Merriman M. E, Chapman P. T, Gow P. J, Harrison A. A, Highton J, Jones P. B. B, McLean L,O’Donnell J. L, Pokorny V, Spellerberg M, Stamp L. K, Willis J, Steer S, Merriman T. R (2008) Evidencefor an influence of chemokine ligand 3-like 1 (CCL3L1) gene copy number on susceptibility to rheumatoidarthritis. Ann Rheum Dis 67: 409–413

Menten B, Pattyn F, De Preter K, Robbrecht P, Michels E, Buysse K, Mortier G, De Paepe A, van Vooren S,Vermeesch J, Moreau Y, De Moor B, Vermeulen S, Speleman F, Vandesompele J (2005) arrayCGHbase:an analysis platform for comparative genomic hybridization microarrays. BMC Bioinformatics 6: 124

Olshen A. B, Venkatraman E. S, Lucito R, Wigler M (2004) Circular binary segmentation for the analysis ofarray-based DNA copy number data. Biostatistics 5: 557–572

Pike R, Dorward S, Griesemer R, Quinlan S (2005) Interpreting the data: Parallel analysis with Sawzall. Sci-

entific Programming 13: 277–298

Project Consortium A. G (2007) Mapping autism risk loci using genetic linkage and chromosomal rearrange-ments. Nat Genet 39: 319–328

Risch N, Merikangas K (1996) The Future of Genetic Studies of Complex Human Diseases. Science 273:1516–1517

Seabright M (1971) A rapid banding technique for human chromosomes. Lancet 2: 971–2

Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Må nér S, Massa H, Walker M, Chi M, Navin N,Lucito R, Healy J, Hicks J, Ye K, Reiner A, Gilliam T. C, Trask B, Patterson N, Zetterberg A, Wigler M(2004) Large-scale copy number polymorphism in the human genome. Science 305: 525–528

Sebat J, Lakshmi B, Malhotra D, Troge J, Lese-Martin C, Walsh T, Yamrom B, Yoon S, Krasnitz A, KendallJ, Leotta A, Pai D, Zhang R, Lee Y.-H, Hicks J, Spence S. J, Lee A. T, Puura K, Lehtimäki T, LedbetterD, Gregersen P. K, Bregman J, Sutcliffe J. S, Jobanputra V, Chung W, Warburton D, King M.-C, SkuseD, Geschwind D. H, Gilliam T. C, Ye K, Wigler M (2007) Strong association of de novo copy numbermutations with autism. Science (New York, N.Y.) 316: 445–9

28

BIBLIOGRAPHY

Solinas-Toldo S, Lampel S, Stilgenbauer S, Nickolenko J, Benner A, Döhner H, Cremer T, Lichter P (1997)Matrix-based comparative genomic hybridization: biochips to screen for genomic imbalances. Genes

Chromosomes Cancer 20: 399–407

Tuzun E, Sharp A. J, Bailey J. A, Kaul R, Morrison V. A, Pertz L. M, Haugen E, Hayden H, Albertson D, PinkelD, Olson M. V, Eichler E. E (2005) Fine-scale structural variation of the human genome. Nat Genet 37:727–732

van de Wiel M. A, Kim K. I, Vosse S. J, van Wieringen W. N, Wilting S. M, Ylstra B (2007) CGHcall: callingaberrations for array CGH tumor profiles. Bioinformatics 23: 892–894

van de Wiel M. A, Brosens R, Eilers P. H. C, Kumps C, Meijer G. A, Menten B, Sistermans E, SpelemanF, Timmerman M. E, Ylstra B (2009) Smoothing waves in array CGH tumor profiles. Bioinformatics

(Oxford, England) 25: 1099–104

Van Prooijen-Knegt A, VANHOEK J, BAUMAN J, VANDUIJN P, WOOL I, VANDERPLOEG M (1982) In situhybridization of DNA sequences in human metaphase chromosomes visualized by an indirect fluorescentimmunocytochemical procedure. Experimental Cell Research 141: 397–407

Weiss L. A, Shen Y, Korn J. M, Arking D. E, Miller D. T, Fossdal R, Saemundsen E, Stefansson H, Ferreira M.A. R, Green T, Platt O. S, Ruderfer D. M, Walsh C. A, Altshuler D, Chakravarti A, Tanzi R. E, StefanssonK, Santangelo S. L, Gusella J. F, Sklar P, Wu B.-L, Daly M. J, Autism Consortium (2008) Associationbetween microdeletion and microduplication at 16p11.2 and autism. N Engl J Med 358: 667–675

Wellcome Trust Case Control Consortium (2010) Genome-wide association study of CNVs in 16,000 cases ofeight common diseases and 3,000 shared controls. Nature 464: 713–720

Yang Y, Chung E. K, Zhou B, Lhotta K, Hebert L. A, Birmingham D. J, Rovin B. H, Yu C. Y (2004) Theintricate role of complement component C4 in human systemic lupus erythematosus. Curr Dir Autoimmun

7: 98–132

Zhang J, Feuk L, Duggan G. E, Khaja R, Scherer S. W (2006) Development of bioinformatics resources for dis-play and analysis of copy number and other structural variants in the human genome. Cytogenet Genome

Res 115: 205–214

29

Addendum: ArrayCGH protocol

30