CLONALITY INFERENCE IN MULTIPLE TUMOR SAMPLES USING...

58
CLONALITY INFERENCE IN MULTIPLE TUMOR SAMPLES USING PHYLOGENY by Salem Maliki´ c B.Sc., University of Sarajevo, Bosnia and Herzegovina, 2011 a Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in the School of Computing Science Faculty of Applied Sciences c Salem Maliki´ c 2014 SIMON FRASER UNIVERSITY Summer 2014 All rights reserved. However, in accordance with the Copyright Act of Canada, this work may be reproduced without authorization under the conditions for “Fair Dealing.” Therefore, limited reproduction of this work for the purposes of private study, research, criticism, review and news reporting is likely to be in accordance with the law, particularly if cited appropriately.

Transcript of CLONALITY INFERENCE IN MULTIPLE TUMOR SAMPLES USING...

  • CLONALITY INFERENCE IN MULTIPLE TUMOR

    SAMPLES USING PHYLOGENY

    by

    Salem Malikić

    B.Sc., University of Sarajevo, Bosnia and Herzegovina, 2011

    a Thesis submitted in partial fulfillment

    of the requirements for the degree of

    Master of Science

    in the

    School of Computing Science

    Faculty of Applied Sciences

    c© Salem Malikić 2014SIMON FRASER UNIVERSITY

    Summer 2014

    All rights reserved.

    However, in accordance with the Copyright Act of Canada, this work may be

    reproduced without authorization under the conditions for “Fair Dealing.”

    Therefore, limited reproduction of this work for the purposes of private study,

    research, criticism, review and news reporting is likely to be in accordance

    with the law, particularly if cited appropriately.

  • APPROVAL

    Name: Salem Malikić

    Degree: Master of Science

    Title of Thesis: Clonality Inference in Multiple Tumor Samples using Phy-

    logeny

    Examining Committee: Dr. Andrei Bulatov,

    Professor, Chair

    Dr. Süleyman Cenk Şahinalp,

    Professor, Senior Supervisor

    Dr. Jian Pei,

    Professor, Supervisor

    Dr. Ayşe Funda Ergün,

    Professor, Internal Examiner

    Date Approved: August 22nd, 2014

    ii

  • Partial Copyright Licence

    iii

  • Abstract

    Intra-tumor heterogeneity presents itself through the evolution of subclones during cancer

    progression. While recent research suggests that this clonal diversity is a key factor in

    therapeutic failure, the determination of subclonal architecture of human tumors remains a

    challenge. To address the problem of accurately determining subclonal frequencies in tumors

    as well as their evolutionary history, we have developed a novel combinatorial method named

    CITUP (Clonality Inference in Tumors Using Phylogeny). An important feature of CITUP

    is its ability to exploit data from multiple time-point and/or regional samples from a single

    patient in order to improve estimates of mutational profiles and subclonal frequencies. Using

    extensive simulations and real datasets comprising tumor samples from two leukemia drug-

    response studies, we show that CITUP can infer the evolutionary trajectory of human

    tumors with high accuracy.

    keywords: Cancer progression, intra-tumor heterogeneity, combinatorial methods

    iv

  • To my beloved parents Faiz and Sadeta,

    and my dear sister Faiza

    v

  • You can’t connect the dots looking forward; you can only connect them looking backwards.

    So you have to trust that the dots will somehow connect in your future.

    You have to trust in something: your gut, destiny, life, karma, whatever.

    This approach has never let me down, and it has made all the difference in my life.

    — Steve Jobs

    vi

  • Acknowledgements

    First and foremost, I would like to thank my supervisor Dr. S. Cenk Sahinalp for his

    extensive guidance, support and patience during my studies. I especially thank him for

    the endless effort he put into training me in the scientific field. I am also very thankful to

    Andrew McPherson and Dr. Nilgün Donmez for their immense contribution to this work,

    their valuable advices and help with writing the thesis. I would like to acknowledge the

    insightful feedback I have received from my colleagues from the Lab for Computational

    Biology at Simon Fraser University.

    Also, I am indebted to Nermin Suljić, Ali Lafçioğlu, Dr. Hasan Jamak and Dino Oglić

    for helping me develop my passion and enthusiasm towards the fields of Mathematics and

    Science. Furthermore, I thank all of the people from Bosna Sema Educational Institutions for

    providing an excellent environment and support during my high school and undergraduate

    studies.

    I am very grateful to my dear aunt Faiza and uncle Dževad together with their family

    and my beloved partner Fatima for providing moral support. Last but not the least, I would

    like to express my deep gratitude to my parents and my sister for their unconditional love

    and support.

    vii

  • Contents

    Approval ii

    Partial Copyright License iii

    Abstract iv

    Dedication v

    Quotation vi

    Acknowledgements vii

    Contents viii

    List of Tables x

    List of Figures xi

    1 Introduction 1

    2 Background 4

    2.1 DNA and mutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2.2 Cancer onset and evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.3 Methods for analysing and sequencing DNA and their applications in tumor

    heterogeneity studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.4 Identifying mutations and their frequencies from HTS data . . . . . . . . . . 10

    viii

  • 3 Model assumptions and problem description 12

    3.1 Phylogenetic tree model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    3.2 Input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    3.3 Model assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    3.4 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    4 Methods 19

    4.1 Combinatorial Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    4.2 Method Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    4.3 Quadratic Integer Programming (QIP) method . . . . . . . . . . . . . . . . . 21

    4.4 QIP optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    4.5 Heuristic Iterative Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    4.6 Enumerating rooted trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    4.7 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    5 Results 24

    5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    5.2 Evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    5.3 Evaluation on simulated datasets . . . . . . . . . . . . . . . . . . . . . . . . . 26

    5.4 Comparison with Rec-BTP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    5.5 Results on Chronic Lymphocytic Leukemia datasets . . . . . . . . . . . . . . 31

    5.6 Results on Acute Myeloid Leukemia datasets . . . . . . . . . . . . . . . . . . 32

    5.7 Computing environment and running parameters . . . . . . . . . . . . . . . . 37

    6 Conclusions and future work 41

    Bibliography 43

    ix

  • List of Tables

    3.1 Simple input to CITUP algorithm consisting of 2 samples. In total, 10 so-

    matic mutations have been identified. Their frequencies are estimated from

    alignment of sequencing data and most of them deviate from true values due

    to presence of noise. Mutation might be not detected or completely absent

    from some, but not all, samples. . . . . . . . . . . . . . . . . . . . . . . . . . 14

    5.1 Summary of CITUP’s results on the CLL dataset. The second column refers

    to the number mutations as reported by [17]. The third column reports the

    number of subclones (including normal cells) found in the best solution. The

    number of solutions column shows how many distinct solutions are found

    with the best score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    5.2 Summary of CITUP’s results on the AML dataset. The second column refers

    to the total number of indel and single nucleotide mutations as reported by

    [3]. The third column reports the number of subclones (including normal

    cells) found in the best solution. The number of solutions column shows how

    many distinct solutions are found with the best score. . . . . . . . . . . . . . 36

    x

  • List of Figures

    2.1 Simple evolutionary tree showing emergence of different clonal subpopulations

    as a consequence of mutations in cells DNA. . . . . . . . . . . . . . . . . . . . 6

    2.2 An example of heterogeneous tumor tissue consisting of several different clonal

    subpopulations of cells used as the input for DNA sequencing. . . . . . . . . . 8

    2.3 Frequencies of mutations present in sample shown in Figure 2.2. . . . . . . . 11

    3.1 Arbitrary rooted tree representation of binary tree given in Figure 2.1. . . . . 13

    3.2 One of possible interpretations of input data given in Table 3.1. This figure

    shows frequencies assignment for Sample 1. For each node, except the root,

    first number in node label represents the proportion of cells harbouring mu-

    tations that occurred along an edge connecting that node with its parent.

    For root nodes this number is always 1. The number inside bracket shows

    the proportion of cells harbouring genotype uniquely identified by this node. . 17

    3.3 One of possible interpretations of input data given in Table 3.1. This figure

    shows frequencies assignment for Sample 2. For each node, except the root,

    first number in node label represents the proportion of cells harbouring mu-

    tations that occurred along an edge connecting that node with its parent.

    For root nodes this number is always 1. The number inside bracket shows

    the proportion of cells harbouring genotype uniquely identified by this node. . 18

    xi

  • 5.1 Simulation results for TrAp, PhyloSub and CITUP (QIP and iterative pro-

    cedures) under the four evaluation criteria. The rows depict measures M0 to

    M3. The first column investigates the effect of the number of subclones/nodes

    in the dataset, the second investigate the effect of the number of samples, the

    third investigates the effect of noise added to the mutation frequencies and

    the fourth investigates the effect of non-uniformity among subclone frequen-

    cies. The figure is drawn using the boxplot function in Phyton’s mathplot

    library: the line within each box is the mean and the box boundaries mark

    the 25% and 75% values. The extreme outliers are depicted with + symbols.

    Note that we were unable to run PhyloSub on 7 samples, so the corresponding

    bars are absent from this column. . . . . . . . . . . . . . . . . . . . . . . . . . 28

    5.2 Sensitivity analysis of CITUP iter with respect to starting points. Top: Dis-

    tribution of errors in the objective for different restarts of the algorithm where

    error is defined as the difference between the local minimum objective value

    reached by CITUP iter and the global minimum reached by CITUP qip. Bot-

    tom: The proportion of iterative restarts that reach the global min within

    10−9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    5.3 CITUP predictions for patient CLL003. Left: Estimated subclonal propor-

    tions for the five time points (ordered from inner to outer circles). Right:

    The predicted evolutionary tree and the mutations assigned to each sub-

    clone. Note that each node is also assumed to inherit mutations that emerge

    at its ancestors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    5.4 CITUP predictions for patient CLL006. Left: Estimated subclonal propor-

    tions for the five time points (ordered from inner to outer circles). Right:

    The predicted evolutionary tree and the mutations assigned to each sub-

    clone. Note that each node is also assumed to inherit mutations that emerge

    at its ancestors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    5.5 CITUP predictions for patient CLL077. Left: Estimated subclonal propor-

    tions for the five time points (ordered from inner to outer circles). Right:

    The predicted evolutionary tree and the mutations assigned to each sub-

    clone. Note that each node is also assumed to inherit mutations that emerge

    at its ancestors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    xii

  • 5.6 Tumor purities predicted by [3] and CITUP in primary and relapse samples

    of AML patients. For the three patients with multiple reported solutions,

    UPN758168 had the same root frequencies in both solutions. For UPN452198

    and UPN573988, we pick the frequencies closest to the ones given in [3]. . . . 38

    5.7 CITUP predictions for patient UPN869586. Left: The estimated subclonal

    proportions for tumor (inner) and relapse (outer) samples. Right: The pre-

    dicted evolutionary tree and the coding mutations assigned to each subclone.

    The numbers in parentheses give the total number of (i.e. coding and non-

    coding) mutations for each subclone. . . . . . . . . . . . . . . . . . . . . . . . 39

    xiii

  • Chapter 1

    Introduction

    Most human tumors exhibit a large degree of heterogeneity. This heterogeneity is not only

    apparent in histology but also presents itself in various features such as gene expression

    changes, genomic copy number alterations and structural rearrangements as well as other

    aberrations. While the origins of the intra-tumor heterogeneity are still debated, research

    suggests that this diversity is likely to have clinical implications. For instance, Merlo et

    al. [11] have reported a correlation between clonal diversity and progression to esophageal

    adenocarcinoma in Barrett’s esophagus.

    The implications of tumor heterogeneity are not limited to diagnostics. It has been

    suggested that clonal diversity may also be linked to metastatic potential and drug response.

    Looking at biopsies from pancreas and prostate adenocarcinomas, Ruiz et al. [15] found that

    metastatic tumors were derived from certain clonal populations. In colorectal cancer, Kreso

    et al. [7] reported that clonal diversity affects chemotherapy tolerance. By tracking 150

    lentivirus-marked lineages from 10 human colorectal cancers, they found that previously

    minor or dormant clones were promoted by chemotherapy, thus reducing the effectiveness

    of the treatment.

    Although the multi-clonal nature is virtually common to most tumor samples, deter-

    mining the clonal subpopulations is a challenging process. This problem could potentially

    be alleviated by single-cell sequencing; however, the current cost of these methods are pro-

    hibitive in the scales that would be necessary to representatively sample a tumor tissue.

    Methods such as Fluorescence in Situ Hybridization (FISH) or Silver in Situ Hybridization

    (SISH) can also assess a small number of probes in individual cells of a tumor sample. On

    the other hand, these methods are quite limited in scope and can not offer the same genome

    1

  • CHAPTER 1. INTRODUCTION 2

    wide perspective as high-throughput sequencing methods.

    In silico separation of the clonal subpopulations may also provide a viable alternative

    to these methods. Despite the importance of clonal diversity and its clinical implications,

    however, relatively few computational methods have been developed to date. In a pioneering

    paper, Schwartz and Shackney [18] developed an unmixing method based on a geometric

    model to distinguish a small number of cancer subtypes in gene expression data. After

    determining cell types and their relative frequencies in different tumor samples, this method

    infers a phylogenetic tree that best fits the cell types identified. An alternative method,

    named TrAp [19], generates possible phylogenetic trees following certain parsimony and

    sparsity conditions using a greedy approach. More recently, PhyloSub [6], which is based on

    Bayesian inference is developed. This method relies on the well known Monte Carlo Markov

    Chain (MCMC) sampling paradigm to infer a distribution over all possible phylogenies.

    Another statistical method named PyClone [14] - also based on MCMC sampling - leverages

    copy number genotypes to estimate subclonal frequencies more reliably. Unlike PhyloSub,

    however, this method does not infer phylogenies.

    In this work we present a combinatorial algorithm, named CITUP (Clonality Inference

    in Tumors Using Phylogeny), that can exploit data obtained from multiple loci from a single

    patient to infer the tumor phylogeny more accurately. Our framework also involves gen-

    erating all possible phylogenetic trees, however, unlike the previous approaches mentioned

    above CITUP has the ability to find optimal solutions based on an exact Quadratic Integer

    Programming formulation.

    Another tree-based method named Rec-BTP [5], being simultaneously developed, is

    closely related to our framework. In this approach, mutations are subjected to a binary-tree

    partition, where a binary-tree with the least number of conflicting triplets is sought using

    an approximation algorithm. In contrast to our framework, this method can not handle

    multiple samples.

    Our work is also related to the studies of [16] and [13], however their goals are different.

    While THetA [13] aims to predict subclonal populations and their proportions given a

    sample from high-throughput sequencing data, it does not aim to infer any phylogenetic

    relationship between the subclones. Although the method proposed by Salari et al. [16]

    infers tumor phylogenies from multiple samples like CITUP, the goal of that study is to

    improve somatic SNV calls. Moreover, their model places the samples as leaves in a single

    phylogenetic tree (thus leaves do not represent clonal subpopulations but rather samples,

  • CHAPTER 1. INTRODUCTION 3

    each of which is a mixture of subclones) whereas our model assumes a shared tree between

    samples with different clonal frequencies.

  • Chapter 2

    Background

    2.1 DNA and mutations

    Human body is formed from between 50 and 100 trillion cells. Each cell contains all of the

    organism’s genetic instructions stored as Deoxyribonucleic acid (DNA) that is made up of

    molecules called nucleotides. There are four types of these molecules: adenine (A), cytosine

    (C), thymine (T) and guanine (G) and their order defines particular DNA sequence and its

    function. For the sake of illustration, we can consider nucleotides as an alphabet of some

    language. In an analogous way as letters of an alphabet combined together form words

    of language, nucleotides form genes that contain instructions for different tasks that cell

    performs. Inside the cell, DNA is packed into structures called chromosomes. In normal

    conditions, chromosomes in human cells always come in matching pairs, one pair from each

    parent. As a consequence of this, human genome is diploid and genomic loci also come in

    pairs, where the term genome is used for an organism’s complete set of DNA, including all

    of its genes. On the other hand, genotype describes nucleotides present at a specific location

    in the two copies. These two loci do not have to be fully identical and one of a number of

    alternative forms of the same genetic locus is referred to as an allele.

    Changing, deleting or altering the position of even a single letter in a word of language

    can completely change its meaning. For example, in English language, if we change first

    letter into ’d’ in word ”keep” we get word ”deep” with completely different meaning, and

    changing ’k’ into ’e’ results in word ”eeep” without any specific meaning. Similarly, changes

    in DNA sequence of cell, usually referred as mutations or aberrations, can change DNA

    segment coding for some genes that might result in improperly functioning genes or in a

    4

  • CHAPTER 2. BACKGROUND 5

    complete loss of a gene function. Many different types of mutations have been reported.

    Single nucleotide mutation, where a single nucleotide is exchanged for another, is the most

    prevalent type of mutations. Copy number aberrations occur when part of genome gets

    deleted or amplified and indel mutation is a mutation named with the blend of insertion

    and deletion of nucleotides in DNA sequence.

    During the lifetime of an individual many cells undergo programmed cell death (apop-

    tosis) and get replaced by new cells in a process called somatic cell division. By this process

    one cell, called mother cell, is replaced by two daughter cells. During the replication process

    DNA of mother cell is copied into each of daughter cells and, in perfect division, daughter

    cells DNA sequence is an exact copy of mother cell’s DNA sequence. Occasionally, during

    some of the somatic cell divisions mutations occur resulting in daughter cell(s) having dif-

    ferent DNA sequence compared to mother cell. Somatic division is not the only process

    whereby mutations can be acquired. Various external factors, such as radiation, tobacco

    consumption, ultraviolet light and many others can also have deleterious effect on cell and

    cause severe damages to its DNA.

    All of the mutations acquired in non-germ cells during the lifetime of an individual are

    commonly referred to as somatic mutations. Germline or hereditary mutations comprise

    another major class of mutations. Mutations from this class occur in germ-cells and can

    later be passed on its progeny. If inherited, such mutation is typically present in all cells

    of human body and does not give valuable information in heterogeneity studies. For these

    reasons, in this work we will only focus on somatic mutations.

    Since more than 98% of human DNA is noncoding and most of noncoding DNA does

    not have any known biological function, somatic mutations can be divided into coding and

    non-coding. The latter one are usually non-deleterious and do not have serious impact on

    cell functioning. On the other hand, coding somatic mutations might be fatal for proper

    cell functioning and result in various diseases, with cancer being one of the most frequent

    among them. For these reasons, some studies are using only coding mutations.

    By somatic cell division, mutation acquired in some cell is passed on all of its descendants,

    unless it gets reverted, that is very unlikely considering the size of human DNA formed from

    several billion nucleotides. In addition to mutations inherited from mother cell, daughter

    cells might acquire additional mutations. This can result in emergence of several clonal

    subpopulations of cells, each one uniquely identifiable by the set of somatic mutations it

    has acquired, as shown in Figure 2.1. This figure shows simple evolutionary tree, also called

  • CHAPTER 2. BACKGROUND 6

    phylogenetic tree, where, starting from normal, healthy, cell six different subpopulations

    of mutated cells have emerged. Note that some clonal subpopulations can die out or get

    completely replaced by more differentiated subpopulations descending from them. One such

    clonal subpopulation of cells, harbouring only red-colored mutation, is shown in Figure 2.1.

    Figure 2.1: Simple evolutionary tree showing emergence of different clonal subpopulationsas a consequence of mutations in cells DNA.

    2.2 Cancer onset and evolution

    Cancer is a disorder in which some of the body’s cells begin to grow uncontrollably to form

    a mass of cells called a tumor. It is used as a term for more than a hundred diseases all

    having in common two main characteristics: uncontrolled cell growth and the ability of these

  • CHAPTER 2. BACKGROUND 7

    cells to invade other tissues. The growth of a tumor can be thought of as an evolutionary

    process. Malignant (life-threatening) tumors usually contain many mutations, that do not

    happen all at once.

    There are two different models explaining evolutionary processes in tumors: clonal evo-

    lution and cancer stem cells models. Detailed explanation of these two models can be found

    in [4]. In this study we adopt Clonal Evolution model of tumor progression first intro-

    duced in [12]. According to this model, tumor begins with a cell that has sustained a single

    mutation that offers it a growth advantage over its neighbors. This advantage might be

    manifested in many different ways: cell can have higher rate of somatic division compared

    to normal cells, some genetic mechanisms might be damaged prolonging programmed cell’s

    death (apoptosis) and many others. For example, a mutation that inactivates pro-apoptotic

    gene might result in delayed cell’s death, giving it longer lifespan. Somatic divisions of this

    mutated cell and its descendants lead to the formation of clonal subpopulation formed from

    cells harbouring the advantageous mutation. After some time, one cell in this subpopulation

    may sustain another such mutation that can result in emergence of new clone, possibly with

    higher proliferative power. Over the course of time several clonal subpopulations might

    emerge, resulting in highly heterogeneous tissue at the time of clinical diagnosis of disease.

    Figure 2.2 illustrates a tumor tissue consisting of several distinct clonal subpoplations. For

    simplicity, we assume that they are related by evolutionary tree given in Figure 2.1. We

    note that at the time of clinical diagnosis, the only observable clonal subpopulations are the

    ones corresponding to leaf nodes in the evolutionary tree. Therefore, none of the cells in

    Figure 2.2 has genotype harbouring only red-colored mutation.

    The first tumor that develops in the body is named primary tumor. If not detected

    and removed at early stages of disease, tumor cells usually migrate through blood or lymph

    and start growing at distant organs. This new tumor is known as metastatic tumor. It

    is important to mention that metastatic tumor, although growing in different part of the

    body, still has the characteristics of primary tumor and the same tumor evolutionary tree

    can be used to explain primary tumor and all of its metastatic outgrowths. The same

    applies for tumor samples from early and later stages of disease progression since clonal

    subpopulations at later stages are either identical to or evolved from clonal subpopulations

    from early stages by the acquisition of additional mutations between the timepoints when

    samples were obtained.

  • CHAPTER 2. BACKGROUND 8

    Figure 2.2: An example of heterogeneous tumor tissue consisting of several different clonalsubpopulations of cells used as the input for DNA sequencing.

  • CHAPTER 2. BACKGROUND 9

    2.3 Methods for analysing and sequencing DNA and their

    applications in tumor heterogeneity studies

    Several methods used for DNA analysis and sequencing (determining the order of nucleotides

    A, C, G, T) have been invented up to date. Although there are many different ways to

    classify them, in cancer heterogeneity studies they are usually classified into two categories,

    based on the number of cells that are used as the input.

    First category comprises of methods using a single cell as the input for DNA analysis or

    sequencing. Fluorescence in Situ Hybridization (FISH) or Spectral Karyotyping (SKY) have

    been used for decades in studying DNA sequences of single tumor cells and have shown great

    variability in DNA content among them. These methods are quite limited in scope and can

    only asses a small number of probes from single cells of tumor sample. Ideally, whole-genome

    sequencing of sufficient number of cells using these methods can be used for identifying

    tumor heterogeneity. Single cell sequencing (SNS) and single nucleus sequencing (SNS)

    are another two promising approaches for revealing intra-tumor heterogeneity. However,

    both of them have several limitations. Namely, as they require an amount of DNA far

    exceeding an amount present in single cell, the amplification of DNA is required prior to

    sequencing. This amplification usually results in biases where some regions of DNA do

    not get amplified to desired extent, whereas some other regions get over-amplified. As a

    consequence, the number of reads (small subsequences of DNA generated as the output of

    sequencing) covering under-amplified regions is usually low and the effect of noise is very

    high in downstream analysis of reads covering these regions. Some of these regions do not

    get any reads covering them, hence some mutations remain undetected. This bias also

    results in difficulties with detecting copy number aberrations (aberrations where some part

    of genome gets deleted or amplified), since it is difficult to distinguish whether a particular

    DNA segment having large number of reads covering it is amplified in sequenced cell, or this

    number is just a consequence of over-amplification during DNA amplification step. Also, in

    order to get statistically representative sample of underlying tumor, typically consisting of

    millions of cells, isolating and sequencing large number of cells from many different sections

    of tumor is required. Due to these limitations and prohibitive cost of single cell sequencing

    of large number of cells these methods are still mainly used only for academic purposes in

    analyzing intra-tumor heterogeneity.

    Second major category of DNA analysis and sequencing methods are using a bulk of

  • CHAPTER 2. BACKGROUND 10

    tumor cells as the input to obtain short reads of DNA. High Throughput Sequencing (HTS)

    methods are currently the most widely used for this task due to their ability to generate

    large number of short reads from multiple tumor cells at low cost and with good accuracy

    (percentage of correctly identified nucleotides among all reads) that varies among different

    platforms, but is typically above 99.9%. Since multiple cells are used as the input, only an

    average signal of DNA content from underlying cells is obtained as the output. Therefore,

    some post-processing is required in order to obtain number, evolutionary history, genotypes

    and proportions of clonal subpopulations present in sequenced tumor.

    Sequencing coverage of DNA segment is defined as the average number of reads covering

    that particular segment. Sequencing of whole DNA usually results in coverage that is low

    in order to accurately identify frequencies of somatic mutations. High degree of confidence

    in measuring frequencies of single nucleotide mutations and small indels can be achieved

    using targeted deep sequencing. In brief, after the mutation is detected at some DNA locus,

    the region encompassing this locus is PCR amplified from a bulk tumor sample, and then

    sequenced to high depth (>1000× coverage) using HTS. Technological advances now allowmany variants to be amplified and sequenced in parallel speeding up the sequencing process.

    2.4 Identifying mutations and their frequencies from HTS

    data

    In addition to cells from tumor mass, sequencing of some healthy tissue is also performed

    in order to obtain genome of normal cells. All of the short reads obtained from tumor cells

    are then compared against the normal genome and somatic mutations are identified.

    For each mutation, we define its frequency as the proportion of cells from tumor sample

    harbouring that mutation. For somatic mutations from diploid loci of genome, it is very

    unlikely that both of the alleles are mutated, so we assume that all of such mutations are

    heterozygous, i.e. only one allele is mutated. As calculating of single nucleotide mutations

    from regions that have been affected by copy number aberrations is complicated task, in this

    study we only focus on heterozygous somatic mutations outside of copy number aberrated

    regions. Their frequency is calculated as (2 · qvar)/(qvar + qref ) where qvar is the number ofreads with the variant allele and qref is the number of reads with the reference allele. Allele

    specific copy number measurements, obtained using sequencing or arrays, can be used to

    exclude mutations from genomic regions that are not diploid heterozygous throughout the

  • CHAPTER 2. BACKGROUND 11

    population of tumor cells.

    In conclusion, using HTS methods of DNA sequencing we can obtain a set of mutations

    present in sequenced tissue together with proportions of cells harbouring each mutation. For

    the purpose of illustration, in Figure 2.3 a simple example of output is given, where we list

    the frequencies of mutations present in hypothetical sample given in Figure 2.2, assuming

    no noise is present in estimating frequencies values (note that this is not valid assumption

    for frequencies obtained from real sequencing data where estimates are usually affected by

    noise present as a consequence of errors and biases occurring during DNA sequencing step).

    Figure 2.3: Frequencies of mutations present in sample shown in Figure 2.2.

  • Chapter 3

    Model assumptions and problem

    description

    3.1 Phylogenetic tree model

    As we have already explained in previous chapter, every successful somatic cell division

    transforms a single cell into two daughter cells. The genomes of the daughter cells are

    copies of the original cell’s genome, with the addition of mutations that occurred during

    replication. Furthermore, all somatic cells originated from a single germline cell. Thus a

    natural representation of somatic cellular evolution is a rooted full binary tree (see Rec-BTP

    [5], 2.1 ). In such a model only the leaves of the tree are observable; internal nodes represent

    unobservable ancestral cells.

    For a full binary tree model, not all edges would be identifiable by mutations, either

    because no mutations occurred from one cell division to the next, or because distinguishing

    mutations were not detected. Thus, a more concise model [6] uses arbitrary rooted trees,

    implicitly collapsing unidentifiable edges. Collapsing internal edges has the effect of allowing

    nodes to have an arbitrary number of children. Furthermore, collapsing leaf edges implies

    that internal nodes are observable. Figure 3.1 shows arbitrary rooted tree representation of

    binary tree given in Figure 2.1.

    Using arbitrary rooted trees has two major implications for CITUP. First, it limits

    the number of fundamentally equivalent solutions produced by CITUP, allowing for easier

    interpretability of the results. Second, since arbitrary rooted trees are more concise, CITUP

    12

  • CHAPTER 3. MODEL ASSUMPTIONS AND PROBLEM DESCRIPTION 13

    Figure 3.1: Arbitrary rooted tree representation of binary tree given in Figure 2.1.

  • CHAPTER 3. MODEL ASSUMPTIONS AND PROBLEM DESCRIPTION 14

    can consider fewer trees while still maintaining the same accuracy.

    3.2 Input data

    As a consequence of evolutionary processes and genomic instability present in most of the

    tumors, chemotherapy or other types of treatment, epigenetic and many other factors, pro-

    portions of cells harbouring specific genotype usually differ among different timepoints of

    disease progression or different anatomical sites. Sequencing a set of tumor samples obtained

    from different timepoints of disease progression or different anatomical sites, or both, can

    give us valuable information that can be exploited in solving problems defined in the follow-

    ing sections of this chapter. Denote by S the set of all samples that have been sequenced

    and M the set of heterozygous somatic mutations identified in at least one of the samples

    from S. For the reasons already discussed in previous chapter, we only consider mutations

    from diploid regions of genome.

    Hence, the only input to our algorithm can be given as |M | × |S| matrix F , where Fijdenotes frequency of mutation i in sample j. The simple example of input data, where

    |M | = 10 and |S| = 2, is given in the Table 3.1.

    Sample 1 Sample 2

    Mutation 1 0.32 0.04Mutation 2 0.23 0.24Mutation 3 0.80 1.00Mutation 4 0.06 0.55Mutation 5 0.19 0.28Mutation 6 0.30 0.00Mutation 7 0.20 0.67Mutation 8 0.77 0.95Mutation 9 0.19 0.66Mutation 10 0.20 0.24

    Table 3.1: Simple input to CITUP algorithm consisting of 2 samples. In total, 10 somaticmutations have been identified. Their frequencies are estimated from alignment of sequenc-ing data and most of them deviate from true values due to presence of noise. Mutationmight be not detected or completely absent from some, but not all, samples.

  • CHAPTER 3. MODEL ASSUMPTIONS AND PROBLEM DESCRIPTION 15

    3.3 Model assumptions

    Similar to [6], in this work we make the infinite sites assumption about tumor evolution.

    Somatic mutations are gained at most once per individual and cannot be lost via a subse-

    quent reversion mutation. Additionally, we assume the tumor exhibits minimal aneuploidy,

    thus mutations cannot be lost by deletion of the encompassing chromosomal region.

    Assuming mutations cannot be lost or reverted, a mutation gained in a tumor cell will

    be present in all of the descendants of that tumor cell. Trivially, a mutation that occurred

    in the single common ancestor of a tumor will be present in 100% of the tumor cells. A

    mutation that occurred in a specific lineage of the tumor phylogeny will be present in a

    smaller proportion, providing all other lineages have not died out.

    Based on the arguments mentioned in Section 2.2. we also impose the same phylogenetic

    tree on all samples.

    3.4 Problem description

    Three common problems arise with the interpretation of input data:

    • determination of number and genotypes of major clonal subpopulations of tumor cells;

    This problem consists of inferring the number of different clonal subpopulations present

    in the sequenced tumor and identifying the set of mutations present in each subpop-

    ulation. Note that clonal subpopulation can be uniquely identified by this set.

    • inference of phylogeny relating clonal subpopulations;

    This problem consists of identifying tumor evolutionary history tree that best explains

    the given input data. This tree is also known as tumor phylogenetic tree. Each mu-

    tation has to be placed along one and only one edge of the tree and this placement

    corresponds to its first appearance in tumor evolutionary history. At least one mu-

    tation has to be placed along each edge of the tree, otherwise we have unidentifiable

    edge that would be collapsed in our arbitrary rooted tree model. Also in this model,

    the normal cells can be represented by the root node. Each node corresponds to one

    and only one clonal subpopulation that is uniquely identified by the set of mutations

    assigned to the root node combined with the mutations appearing along the edges

    that form the path from root to that node.

  • CHAPTER 3. MODEL ASSUMPTIONS AND PROBLEM DESCRIPTION 16

    • estimation of proportion of each subpopulation over all samples;

    This problem consists of assigning a real number αis ∈ [0, 1] to each subpopulationi for each sample s. This number represents proportion of tumor cells in sample s

    harbouring genotype of clonal subpopulation i. As there is one to one correspondence

    between nodes of tumor phylogenetic tree and subclonal populations, this is equiv-

    alent to assigning number αis to node corresponding to subpopulation i in sample

    s. Although all samples share common evolutionary tree, hence mutation placement

    is shared among them, the frequencies assigned to nodes of the tree change among

    samples.

    In the next chapter we give a details of our novel algorithmic approaches for solving prob-

    lems introduced in this section. Figures 3.2 and 3.3 show one of the possible interpretations

    of example data given in Table 3.1.

  • CHAPTER 3. MODEL ASSUMPTIONS AND PROBLEM DESCRIPTION 17

    Figure 3.2: One of possible interpretations of input data given in Table 3.1. This figureshows frequencies assignment for Sample 1. For each node, except the root, first number innode label represents the proportion of cells harbouring mutations that occurred along anedge connecting that node with its parent. For root nodes this number is always 1. Thenumber inside bracket shows the proportion of cells harbouring genotype uniquely identifiedby this node.

  • CHAPTER 3. MODEL ASSUMPTIONS AND PROBLEM DESCRIPTION 18

    Figure 3.3: One of possible interpretations of input data given in Table 3.1. This figureshows frequencies assignment for Sample 2. For each node, except the root, first number innode label represents the proportion of cells harbouring mutations that occurred along anedge connecting that node with its parent. For root nodes this number is always 1. Thenumber inside bracket shows the proportion of cells harbouring genotype uniquely identifiedby this node.

  • Chapter 4

    Methods

    4.1 Combinatorial Formulation

    Let T represent the space of all rooted trees and let T ∈ T be a hypothetical phylogenetic treerelating N = |V (T )| genetically distinct subpopulations. Let D(v) be the set of descendentsof node v. As already explained in previous chapters, in our formulation, genotypes are

    represented with nodes (also referred to as clonal subpopulations or subclones in the text)

    while subtrees rooted at a specific node are named clones. A mutation occurring at a node

    in the tree is inherited by its descendants. Thus an assignment of the set of mutations to

    their node of origin is sufficient to describe the genotypes of all nodes.

    Define the clone proportion βvs as the proportion of the clone rooted at v in sample

    s. Similarly, define the subclonal proportion αvs as the proportion of genotype v in sample

    s. Subclonal proportions add up to 1 in each sample (equation 4.1). Furthermore, clone

    proportions are related to subclonal proportions via the sum rule (equation 4.2).

    ∀s ∈ S :∑v∈V

    αvs = 1 (4.1)

    βvs = αvs +∑

    u∈D(v)αus (4.2)

    The expected value of the frequency of a mutation is equal to the clone proportion of the

    node to which the mutation was assigned. Thus, the squared error incurred by assigning a

    single mutation i to a node v in sample s is given by equation 4.3.

    eivs = (Fis − βvs)2 (4.3)

    19

  • CHAPTER 4. METHODS 20

    Let ∆ be an |M | × N binary matrix such that δiv = 1 iff mutation i first appearedat node v, otherwise δiv = 0. We also introduce matrix A of dimensions N × |S|, whereAis = αis. Given T ∈ T, ∆ and A, the total squared error is given by equation 4.4.

    E(T,∆, A) =∑i∈M

    ∑s∈S

    ∑v∈V

    δiveivs (4.4)

    Minimization of squared error may result in overfitting, assigning each mutation to a

    unique node in a very large tree. Instead, we minimize the Bayesian Information Criterion

    (BIC) under the assumption that the noise is normally distributed with known variance σ2.

    The log likelihood can be expressed (within an additive factor) as given by equation 4.5.

    L(F |T,∆, A) = E(T,∆, A)2σ2

    (4.5)

    Finally, BIC can be expressed as given by equation 4.6.

    BIC(T,∆, A) = 2 · L(F |T,∆, A) + |S| · (N − 1) · log |M | (4.6)

    We propose to identify the optimal genotypes ∆opt, the subclone proportions Aopt and

    phylogenetic relationship Topt as given by equation 4.7.

    ∆opt, Aopt, Topt = argminT,∆,A

    BIC(T,∆, A) (4.7)

    We refer to the above optimization problem as the mutation phylogeny problem. We pro-

    pose two methods to solve this problem, namely “CITUP qip” and “CITUP iter”. CITUP qip

    uses an exact Quadratic Integer Programming formulation; while CITUP iter implements an

    iterative heuristic. A detailed description of these implementations for solving the mutation

    phylogeny problem is given in the following sections.

    4.2 Method Outline

    Given a fixed tree topology, define the mutation assignment problem as the problem of iden-

    tifying A and ∆ that minimize mutation frequency error (equation 4.4). CITUP solves the

    mutation phylogeny problem by iterating through all tree topologies up to a fixed number

    of nodes Nmax, and solving the mutation assignment problem for each tree:

    1. for each T ∈ TN , for each N ∈ {1, . . . , Nmax}

  • CHAPTER 4. METHODS 21

    (a) identify A and ∆ that minimizes equation 4.4 (mutation assignment problem)

    (b) calculate BIC for T using equation 4.6

    2. select T , A and ∆ that minimize BIC

    We propose two methods for solving the mutation assignment problem: a Quadratic Integer

    Programming based approach (CITUP qip), and an iterative heuristic approach (CITUP iter)

    as explained below.

    4.3 Quadratic Integer Programming (QIP) method

    QIP based approaches guarantee an optimal solution but limit the feasible problem size. To

    ensure a reasonable running time for the QIP approach on larger (>20 mutations) problem

    sizes, we first cluster the mutations into N sets by their mutation frequency, where N is

    the number of nodes in the current tree topology. We then limit the solution space for ∆

    by adding the constraint that all mutations in a cluster must be assigned, en masse, to a

    single node. We use multivariate k-means clustering implemented in the python scikit learn

    package to cluster mutations.

    Let c : M → {1, . . . , N} be a mapping from mutations to clusters. Let ∆′ be an N ×Nbinary matrix such that δ′c(i)v = 1 iff mutation i assigned to cluster c(i) originated at node

    v, otherwise δ′c(i)v = 0. The total squared error given by equation 4.4 can be rewritten as

    4.8.

    E(T,∆, A) =∑i∈M

    ∑s∈S

    ∑v∈V

    δ′c(i)veivs (4.8)

    Requiring that each cluster must be assigned to exactly one node adds the constraint

    given by equation 4.9.

    ∀n ∈ {1, . . . , N} :∑v∈V

    δ′nv = 1 (4.9)

    Additionally, we require that all non-root nodes must have at least one cluster of muta-

    tions assigned to them, resulting in the constraint given by equation 4.10,

    ∀v ∈ V \ {r} :∑

    n∈{1,...,N}δ′nv ≥ 1 (4.10)

    where r denotes the root node.

  • CHAPTER 4. METHODS 22

    The QIP approach minimizes the squared error objective (equation 4.8), subject to the

    subclonal proportion constraints (equation 4.1), the clone proportion constraints (equation

    4.2), and the cluster assignment constraints (equations 4.9 and 4.10).

    4.4 QIP optimizations

    The objective given by equation 4.8 is not well suited for QIP solvers. Below, we introduce

    auxiliary variables and constraints to convert our objective function to a form that is easier

    to solve. For mutation i, node v and sample s, introduce variable xivs subject to the the

    following constraints.

    xivs ≥ fis − βvs (4.11)

    xivs ≥ βvs − fis (4.12)

    Similarly, introduce variable yivs subject to the following constraints:

    yivs ≥ δ′c(i)v − 1 + xivs (4.13)

    yivs ≥ 0 (4.14)

    The modified QIP minimizes the objective given by equation 4.15, subject to the addi-

    tional constraints for xivs and yivs. ∑i∈M

    ∑v∈V

    ∑s∈S

    y2ivs (4.15)

    It is easy to see that, whenever δ′c(i)v = 1, yivs will be set to xivs; otherwise, it will be set

    to 0. Hence, minimizing the objective given in equation 4.15 is equivalent to minimizing the

    objective given in equation 4.8. It can also be easily verified that Hessian of 4.15 is positive

    definite implying its convexity.

    4.5 Heuristic Iterative Method

    We also propose a heuristic iterative method for solving the mutation assignment problem.

    The iterative heuristic is significantly faster than the QIP with only a small degradation in

    performance observed in our evaluations.

    In brief, the iterative heuristic solves two subproblems iteratively until convergence.

    Problem 1: given a fixed ∆ calculate the (necessarily unique) A that minimizes equation

  • CHAPTER 4. METHODS 23

    4.4. Problem 2: with A fixed to the value calculated in the previous step, calculate the ∆

    that minimizes equation 4.4. Each step is guaranteed to not increase the objective given by

    equation 4.4, thus the algorithm is guaranteed to converge to a local optimum.

    Problem 1 is a convex quadratic programming problem and can be solved efficiently with

    existing convex optimization software. The objective given by equation 4.4 is solved subject

    to constraints given by equations 4.1 and 4.2. Problem 2 can be solved by independently

    assigning each mutation to the node v that minimizes equation 4.3.

    The iterative heuristic is not guaranteed to identify a globally optimal solution, and

    as such, results depend heavily on initialization. We mitigate this problem using multiple

    restarts with random initializations of ∆. A random ∆ is generated by independently

    assigning each mutation to a node, with mutations assigned uniformly and at random to a

    any node in the tree. We perform 1000 restarts with different random seeds, and select the

    solution that minimizes equation 4.4.

    4.6 Enumerating rooted trees

    We use the Beyer-Hedetniemi algorithm [2] to enumerate rooted tree topologies up to the

    user-defined number of nodes (Nmax). The number of non-isomorphic rooted trees for the

    N = 1, . . . , 10 nodes are as follows: 1, 1, 2, 4, 9, 20, 48, 115, 286, 719.

    4.7 Model selection

    In practice, the variance σ2 required to calculate equation 4.5 is often unknown and must

    be estimated from the data. We estimate σ2 by clustering the mutation frequencies using

    an k component Gaussian Mixture Model (GMM) with spherical covariance matrix, where

    k is selected to minimize the BIC of the GMM. We then use the estimated variance of the

    GMM as σ2.

    We remark that this model selection procedure can only distinguish trees with the same

    number of nodes if they have different objective function scores. In practice, we have found

    that two distinct trees with an equal number of nodes can have identical objective scores.

    Following other tools developed for this problem [19, 6], in such cases we report all solutions

    with the best score.

  • Chapter 5

    Results

    In this chapter we present the performance of our algorithm on simulated and real datasets.

    For the simulations and CLL datasets we have used Nmax = 7 and for the AML datasets

    we have set Nmax = 8.

    5.1 Datasets

    To evaluate our method, we use both simulated and real datasets. For simulations, we

    experiment with a variety of trees with differing number of subclones and model parameters.

    We report the performances of both CITUP qip and CITUP iter, using several measures

    that are explained in the following section. We compare the performance of CITUP to

    the performances of TrAp [19] and PhyloSub [6], which can handle multi-sample datasets.

    Additionally, we report a separate comparison between CITUP and Rec-BPT [5] on a smaller

    set of single-sample simulations. We limit our comparison to these tools since our model

    does not support the type of input required by [18] and [13]. While the method of [16] also

    works with SNV data, their model is not directly comparable to ours due to incompatible

    assumptions and goals.

    We also evaluate the utility of our method on two real datasets. The first dataset is

    taken from a Chronic Lymphocytic Leukemia (CLL) study by Schuh et al. [17]. This

    dataset contains targeted deep sequencing measurements of 3 CLL patients sampled at

    5 time points. The second dataset consists of a study involving Acute Myeloid Leukemia

    (AML) patients by Ding et al. [3]. This dataset features a large number of somatic indels and

    single nucleotide variants (SNVs), however only 3 sample points (designated as “normal”,

    24

  • CHAPTER 5. RESULTS 25

    “tumor” and “relapse”) are available per patient. Since the simulations show the QIP and

    iterative versions to have similar performance, we only report the results of CITUP qip on

    the real datasets.

    5.2 Evaluation criteria

    We evaluate the performance of CITUP on the simulation sets using several measures. To

    compute these measures, we first obtain a matching between the predicted tree and the true

    tree as explained below.

    Let T = (V,E) denote the simulated tree, which we are trying to find, and let T ′ =

    (V ′, E′) denote the tree predicted by CITUP. We first check whether T and T ′ have identical

    topologies as a measure of success. In general, however, the topology of T may be different

    from T ′. In such cases, computing the correspondence of the nodes in each tree is not

    trivial. To accomplish this, we first create a complete bipartite graph G, where one partition,

    denoted by A, consists of the nodes of T and the other partition, B consists of the nodes

    of T ′. If |V | 6= |V ′|, then we add dummy nodes to the partition with the fewer nodes untilboth partitions have exactly max(|V |, |V ′|) nodes.

    We denote by Ai the set of mutations assigned to node i in T . Similarly, we define Bj to

    be the set of mutations assigned to node j in T ′. If i (or j) is a dummy node, then Ai = ∅(resp. Bj = ∅). For each edge (i, j) in G, we calculate its weight as the number of mutationsthat are assigned exactly one of i or j. We denote this weight by c(i, j). We then search for

    a matching f : A→ B that minimizes:

    ∑i∈A

    c(i, f(i)) (5.1)

    This problem is a known as the “Minimum Bipartite Matching”, for which efficient

    polynomial time algorithms exist [8]. Once we obtain a one-to-one matching between the

    nodes of the two trees, we calculate the following scores:

    1. Correct tree proportion: (M0) This is the proportion of correctly identified tree topolo-

    gies to the total number of simulations in each experiment.

    2. Clone proportion error: (M1) For this measure, we compute:∑u∈T ∗ |βT

    ∗u − βT

    ∗∗

    g(u)||V ∗|

    (5.2)

  • CHAPTER 5. RESULTS 26

    Above T ∗ denotes smaller of the trees T and T ′ while T ∗∗ denotes the larger one. V ∗

    is defined to be the set of nodes in and βXn represents the frequency of clone n in tree

    X. If T ∗ is the true tree, we define g ≡ f . Otherwise we set g ≡ f−1.

    3. Misplace mutation proportion: (M2) Suppose a mutation m is assigned to a node u

    in the true tree T . If it is assigned to f(u) in T ′, we say that m is correctly placed,

    otherwise we say it is misplaced. M2 is set to the number of misplaced mutations

    divided by the total number of mutations in the dataset. This measure essentially

    evaluates the mutation clustering accuracy.

    4. Phylogenetic accuracy: (M3) For this measure, we count the number of phylogenetic

    relationships that are preserved. We use two types of mutually exclusive relationships:

    ancestor/descendant and non-ancestor/descendant. For example, if a mutation A

    emerges at a clone that is an ancestor of another clone where mutation B emerges,

    we say that A is an ancestor of B (or alternatively B is a descendant of A). If this

    relationship is reversed in the predictions, it is counted as non-preserved. If two

    mutations do not have an ancestor/descendant relationship, they are marked as a non-

    ancestor/descendant pair. If such a pair is predicted to have an ancestor/descendant

    relationship, this pair is also counted as non-preserved.

    5.3 Evaluation on simulated datasets

    We evaluate the performances of CITUP qip and CITUP iter compared to TrAp and Phy-

    loSub using a large set of simulations. For these simulations, we generate random tree

    topologies T with 3 to 6 subclones with 3 to 7 samples. The frequencies of subclones are

    simulated using a Dirichlet distribution with parameter α, ranging from 0.1 to 10.0. For

    each simulation, we generate a set of 500 mutations that are uniformly distributed to the

    subclones. The frequencies of these mutations are then altered through an additive Gaussian

    noise with deviation between 0.02 to 0.1.

    We compare each true tree T with the trees obtained by the tools based on the four eval-

    uation criteria introduced above. For CITUP qip, we first cluster the mutations as described

    in Methods. As the current version of TrAp does not have a module for clustering and we

    were unable to run it on the individual mutations, we use our own clustering method for

  • CHAPTER 5. RESULTS 27

    TrAp as well. Since our model selection procedure is unlikely to work with TrAp’s heuris-

    tic model, we had to provide TrAp with the clustering of the correct size. We emphasize

    that despite this significant advantage, TrAp performs worse than CITUP with respect to

    most of our criteria (see below). For PhyloSub and CITUP iter we use the individual set of

    mutations.

    Since all four methods can output multiple solutions, we devise the following protocol

    in order to compute the evaluation measures. For CITUP qip, CITUP iter and TrAp, we

    randomly choose up to 3 trees out of all (top scoring) solutions reported by the tool. If there

    are only 1 or 2 reported solutions, we pick only these. Since PhyloSub reports 3 solutions

    by default, we simply pick these solutions. For each tool, if one of the chosen solutions has

    the correct tree topology, we use that solution to calculate all the measures for that tool.

    Otherwise, we select one of them randomly.

    Figure 5.1 summarises the results of these simulations. Note that for each selection of

    parameters, we repeat the experiment 10 times.

    The first column of the figure demonstrates the effect of the number of subclones/nodes

    on all four criteria. The number of nodes vary between 3 and 6 - in all cases the number of

    samples is set to 4, the Gaussian noise deviation is set to 0.05 and the frequency imbalance,

    as determined by the parameter α of the Dirichlet distribution, is set to 1.0.

    The second column demonstrates the effect of the number of samples on the four criteria.

    The number of samples now vary between 3 and 7 - in all cases the number of nodes is set

    to 5, and again, the Gaussian noise deviation is set to 0.05 and α is set to 1.0. We note that

    we were unable to run PhyloSub for 7 samples due to limitations of this software. Hence,

    in this case the comparison is only between the other methods.

    The third column depicts the effect of increasing noise (primarily due to sequence cover-

    age variation). The Gaussian noise deviation now varies between 0.02 to 0.1 - for 4 samples,

    5 subclones and α = 1.0.

    The fourth column depicts the effect of imbalance in subclones where α varies between

    0.1 to 10.0, again for 4 samples, 5 subclones and noise deviation of 0.05.

    From Figure 5.1, we see that both CITUP qip and CITUP iter find the correct tree

    topology more often than TrAp, despite the fact that TrAp is already provided with the

    correct number of clusters. In other words, while the other tools have to simultaneously

    identify the right tree size and topology, TrAp only has to find the right topology of the

    given tree size. Compared to CITUP and TrAp, PhyloSub performs poorly with respect to

  • CHAPTER 5. RESULTS 28

    3 4 5 60.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    corr

    ect t

    ree

    prop

    ortio

    n

    number of nodes

    3 4 5 6

    0.00

    0.05

    0.10

    0.15

    0.20

    0.25

    clon

    e pr

    opor

    tion

    erro

    r

    3 4 5 6

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    mis

    plac

    ed m

    utat

    ion

    prop

    ortio

    n

    3 4 5 6

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    phyl

    ogen

    etic

    acc

    urac

    y

    3 5 70.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    number of samples

    3 5 7

    0.00

    0.05

    0.10

    0.15

    0.20

    0.25

    3 5 7

    0.00

    0.05

    0.10

    0.15

    0.20

    0.25

    0.30

    0.35

    3 5 7

    0.2

    0.4

    0.6

    0.8

    1.0

    0.02 0.04 0.06 0.08 0.10.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    mut. frequency noise

    0.02 0.04 0.06 0.08 0.1

    0.00

    0.05

    0.10

    0.15

    0.20

    0.25

    0.30

    0.02 0.04 0.06 0.08 0.1

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.02 0.04 0.06 0.08 0.1

    0.2

    0.4

    0.6

    0.8

    1.0

    0.1 1.0 10.00.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    sample dirichlet alpha

    0.1 1.0 10.0

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.1 1.0 10.0

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.1 1.0 10.0

    0.2

    0.4

    0.6

    0.8

    1.0

    CITUP_qipCITUP_iterTrApPhyloSub

    Figure 5.1: Simulation results for TrAp, PhyloSub and CITUP (QIP and iterative proce-dures) under the four evaluation criteria. The rows depict measures M0 to M3. The firstcolumn investigates the effect of the number of subclones/nodes in the dataset, the secondinvestigate the effect of the number of samples, the third investigates the effect of noiseadded to the mutation frequencies and the fourth investigates the effect of non-uniformityamong subclone frequencies. The figure is drawn using the boxplot function in Phyton’smathplot library: the line within each box is the mean and the box boundaries mark the25% and 75% values. The extreme outliers are depicted with + symbols. Note that wewere unable to run PhyloSub on 7 samples, so the corresponding bars are absent from thiscolumn.

  • CHAPTER 5. RESULTS 29

    this measure.

    Similarly, CITUP performs typically better than the other tools in terms of phylogenetic

    accuracy with a score of 60% or more in most cases. This suggests that even when the correct

    tree is not found, the majority of phylogenetic relationships are preserved.

    In estimating clonal frequencies, we see that CITUP outperforms both TrAp and Phy-

    loSub, while TrAp performs best with respect to the ratio of misplaced mutations. We

    remark that this is likely due to TrAp’s unfair advantage of being given the clustering with

    the correct number of clusters. Note that this measure is evaluated by a one-to-one match-

    ing between the nodes of the predicted and the true tree using only the mutations assigned

    to (but not inherited by) the node. Hence, even when the predicted topology is not identical

    to the correct tree, this measure can have a perfect score as long as the initial clustering

    groups the mutations correctly. This, by definition, can only happen when the clustering

    is performed with the correct number of clusters. Indeed, Figure 5.1 shows that whenever

    CITUP identifies the correct tree topology (hence, the correct tree size) 10 out of 10 times,

    it performs on par with TrAp. This suggests that TrAp’s apparent superiority to CITUP

    in this measure is simply due to the high accuracy of our clustering method.

    Sensitivity analysis of CITUP iter on the same set of simulated data with respect to

    starting points is given in Figure 5.2.. Overall, we see that CITUP qip and CITUP iter

    perform similarly under most conditions, although CITUP qip seems to be slightly more

    resilient to extreme values of simulation parameters (e.g. sample Dirichlet alpha and mu-

    tational frequency noise). Hence, we have chosen to proceed with CITUP qip for the real

    datasets.

    5.4 Comparison with Rec-BTP

    We have also performed a separate comparison between CITUP qip and Rec-BTP. Since

    Rec-BTP does not support multi-sample datasets, for these experiments we have simulated

    single-sample datasets with 500 mutations for 4, 5 and 6 node trees. In each case, we

    generate 10 simulations adding up to 30 datasets in total. The topologies of the trees were

    chosen randomly as before. Since the current version of Rec-BTP does not report which

    mutations are assigned to each subclone, we were restricted to a very limited evaluation of

    the performance of this tool. Briefly, we compare the results of the two methods based on

    i) the number of subclones predicted and ii) an RMSD measure of the predicted subclonal

  • CHAPTER 5. RESULTS 30

    Sensitivity analysis of CITUP iter with respect to starting points

    3 4 5 6

    0

    10

    20

    30

    40

    50

    60

    70

    80

    obje

    ctiv

    e va

    lue

    erro

    r

    number of nodes

    3 4 5 60.0

    0.2

    0.4

    0.6

    0.8

    1.0

    prop

    ortio

    n op

    timal

    3 5 7

    0

    10

    20

    30

    40

    50

    60

    70

    80number of samples

    3 5 70.0

    0.1

    0.2

    0.3

    0.4

    0.5

    is_m

    inim

    um

    0.02 0.04 0.06 0.08 0.1

    0

    10

    20

    30

    40

    50

    60

    70

    80mut. frequency noise

    0.02 0.04 0.06 0.08 0.10.00

    0.05

    0.10

    0.15

    0.20

    0.25

    0.30

    0.35

    0.40

    0.45

    is_m

    inim

    um

    0.1 1.0 10.0

    0

    10

    20

    30

    40

    50

    60

    70

    80sample dirichlet alpha

    0.1 1.0 10.00.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    is_m

    inim

    um

    Supplementary Figure 3: Top: Distribution of errors in the objective for different restarts of the algorithmwhere error is defined as the difference between the local minimum objective value reached by CITUP iterand the global minimum reached by CITUP qip. Bottom: The proportion of iterative restarts that reachthe global min within 10−9.

    3

    Figure 5.2: Sensitivity analysis of CITUP iter with respect to starting points. Top: Distri-bution of errors in the objective for different restarts of the algorithm where error is definedas the difference between the local minimum objective value reached by CITUP iter andthe global minimum reached by CITUP qip. Bottom: The proportion of iterative restartsthat reach the global min within 10−9.

  • CHAPTER 5. RESULTS 31

    Patient No. of No. of No. of Wall-clockmutations subclones solutions time (min)

    CLL003 19 5 1 1.64CLL006 9 5 2 0.32CLL077 15 5 1 0.84

    Table 5.1: Summary of CITUP’s results on the CLL dataset. The second column refersto the number mutations as reported by [17]. The third column reports the number ofsubclones (including normal cells) found in the best solution. The number of solutionscolumn shows how many distinct solutions are found with the best score.

    frequencies similar to the one employed in [5]. In terms of the first measure, CITUP was

    able to find the correct number of subclones in 50% of the simulations (15 out of 30). In

    contrast, Rec-BTP only identified the correct number of subclones in 23.3% of the cases

    (7 out of 30). CITUP also outperformed Rec-BTP with respect to the RMSD measure:

    The average RMSD values for CITUP and Rec-BTP in 30 simulations were 0.02 and 0.05

    respectively.

    5.5 Results on Chronic Lymphocytic Leukemia datasets

    We evaluate the performance of CITUP qip on the Chronic Lymphocytic Leukemia (CLL)

    dataset of [17]. This dataset consists of single nucleotide and small indel mutations as

    inferred from Whole-Genome Sequencing (WGS) data from 3 CLL patients. Each patient is

    sampled at five time points while receiving a variety of treatments. The authors also perform

    targeted deep sequencing for a limited number of mutations found through WGS. Although

    a high number of somatic mutations are detected for each patient, only the frequencies of

    coding mutations are made available by Schuh et al.. Hence, we are only able to use the

    coding mutations as input to our algorithm. Since the number of mutations are small for

    these datasets, we manually removed mutations that are not heterozygous as reported by

    [17].

    Table 5.1 gives a summary of CITUP’s performance on all three patients.

    The trees (Figures 5.3, 5.4 and 5.5) and the clonal frequencies reported by CITUP for

    these patients match the results reported in [17] very closely: the mean absolute deviations

    are 0.0088, 0.0016 and 0.0048 for patients CLL003, CLL006 and CLL077 respectively. Note

    that while CITUP does not assign mutations to the root nodes in CLL003 and CLL077, the

  • CHAPTER 5. RESULTS 32

    root node in CLL006 is assigned 5 mutations. This is in agreement with the observation in

    [17] that the normal contamination in this patient is insignificant and suggests that CITUP

    is able to automatically handle presence or absence of healthy cell contamination.

    Although CITUP finds two distinct topologies for patient CLL006 - a chain topology and

    a branching topology, the clonal frequencies remain the same in both cases. We note that the

    number of deep sequencing mutations is quite small for this dataset, possibly resulting in an

    ambiguity with respect to the tree topology. To see if additional mutations can help identify

    the true tree, we also ran CITUP on the WGS predictions for this dataset, containing 16

    mutations. In this case, CITUP reported a single solution with a chain topology (data not

    shown). Thus we conclude that the true solution is likely to be one reported in Figure 5.4,

    which also matches the tree topology predicted by [17].

    Figure 5.3 suggests a switch between subclones ‘d’ and ‘e’ (referred to as subclones 4

    and 2 in [17]) around time-point 3. This is also in agreement with the disease progression

    as reported by Schuh et al., where the third time-point is classified as ”complete response

    + minimal residual disease”. On the other hand, subclone ‘d’ simulatanously starts gaining

    dominance. The fourth and fifth time-points (as well as the first two time-points) are

    designated as ”progressive disease” suggesting that subclone ‘d’ replaces ‘e’ as the driver

    subclone while the tumor relapses. In contrast, figures 5.4 and 5.5 imply a more stable

    subclonal composition over the time points. We note that the survival time of these patients

    are also longer than CLL003 (6+ and 9 versus 3 years) which may be linked to this slower

    pace of the clonal dynamics.

    5.6 Results on Acute Myeloid Leukemia datasets

    Next, we evaluate CITUP qip on an Acute Myeloid Leukemia (AML) dataset [3]. This

    dataset contains sequencing data from primary tumor and relapse samples after chemother-

    apy treatment, in addition to matched normal tissue for each patient. Although the normal

    tissue is typically sampled to distinguish somatic mutations, we also consider it as a sample

    since some of these tissues contain various degrees of cancer contamination and thus can be

    helpful in identifying subclones. Similar to the CLL dataset, we preprocess the mutations

    based on their copy-number analysis as reported by [3]. Briefly, we only keep autosomal

    mutations that are copy-number neutral. A summary of CITUP’s performance on 8 patients

    taken from this dataset is given in Table 5.2.

  • CHAPTER 5. RESULTS 33

    Malikic et al. Page 10 of 11

    Clonal evolution in relapsed acute myeloid leukaemia revealed by

    whole-genome sequencing. Nature 481(7382), 506–510 (2012)13. Kuhn, H.: The hungarian method for the assignment problem. In:

    Jünger, M., Liebling, T.M., Naddef, D., Nemhauser, G.L., Pulleyblank,

    W.R., Reinelt, G., Rinaldi, G., Wolsey, L.A. (eds.) 50 Years of Integer

    Programming 1958-2008, pp. 29–47. Springer, Berlin, Heidelberg

    (2010)

    14. Ashworth, A.: Drug resistance caused by reversion mutation 68(24),10021–10023 (2008)

    15. Beyer, T., Hedetniemi, S.M.: Constant time generation of rooted trees.

    SIAM J. Comput. 9(4), 706–712 (1980)

    Figures

    Figure 1 A comparison of the full-binary vs. arbitrary rootedtree formulations. In all trees, mutations are depicted withcolored squares. Left: An illustration of the complete binarytree formulation. Here, each internal node has exactly twochildren and only the leaf clones are assumed to be present inthe sample. Ancestral clones can only be represented throughpaths which acquire no additional mutations. Middle: Anequivalent representation of the full-binary tree on the left.Right: The same phylogenetic information represented by thearbitrary rooted tree model. Here, each internal node can haveone or more children but each clone must acquire at least oneadditional mutation.

    a b c d e

    Figure 3 CITUP predictions for patient CLL003. Left:Estimated subclonal proportions for the five time points(ordered from inner to outer circles). Right: The predictedevolutionary tree and the mutations assigned to eachsubclone. Note that each node is also assumed to inheritmutations that emerge at its ancestors.

    a b c d e

    Figure 4 CITUP predictions for patient CLL006. Left:Estimated subclonal proportions for the five time points(ordered from inner to outer circles). Right: The predictedevolutionary tree and the mutations assigned to eachsubclone. Note that each node is also assumed to inheritmutations that emerge at its ancestors.

    a b c d e

    Figure 5 CITUP predictions for patient CLL077. Left:Estimated subclonal proportions for the five time points(ordered from inner to outer circles). Right: The predictedevolutionary tree and the mutations assigned to eachsubclone. Note that each node is also assumed to inheritmutations that emerge at its ancestors.

    Table 1 Summary of CITUP’s results on the CLL dataset. Thesecond column refers to the number mutations as reported by[11]. The third column reports the number of subclones (includingnormal cells) found in the best solution. The number of solutionscolumn shows how many distinct solutions are found with thebest score.

    Patient No. of No. of No. of Wall-clockmutations subclones solutions time (min)

    CLL003 19 5 1 1.64CLL006 9 5 2 0.32CLL077 15 5 1 0.84

    TablesAdditional FilesAdditional file 1: The full set of CITUP predictions on the CLL dataset.

    Additional file 2: The full set of CITUP predictions on the AML dataset.

    Additional file 3: Detailed time requirement of CITUP qip on the AML

    dataset; performance analysis of CITUP qip and CITUP iter; comparison

    between CITUP and Rec-BTP.

    Additional file 4: The raw simulation results used for comparison.

    Figure 5.3: CITUP predictions for patient CLL003. Left: Estimated subclonal proportionsfor the five time points (ordered from inner to outer circles). Right: The predicted evo-lutionary tree and the mutations assigned to each subclone. Note that each node is alsoassumed to inherit mutations that emerge at its ancestors.

  • CHAPTER 5. RESULTS 34

    a b c d e

    Figure 5.4: CITUP predictions for patient CLL006. Left: Estimated subclonal proportionsfor the five time points (ordered from inner to outer circles). Right: The predicted evo-lutionary tree and the mutations assigned to each subclone. Note that each node is alsoassumed to inherit mutations that emerge at its ancestors.

  • CHAPTER 5. RESULTS 35

    a b c d e

    Figure 5.5: CITUP predictions for patient CLL077. Left: Estimated subclonal proportionsfor the five time points (ordered from inner to outer circles). Right: The predicted evo-lutionary tree and the mutations assigned to each subclone. Note that each node is alsoassumed to inherit mutations that emerge at its ancestors.

  • CHAPTER 5. RESULTS 36

    Patient No. of No. of No. of Wall-clockmutations subclones solutions time (hours)

    UPN400220 265 7 1 1.71UPN426980 822 7 1 23.00UPN452198 97 5 4 0.14UPN573988 144 3 2 1.02UPN758168 412 7 2 3.33UPN804168 589 8 1 6.89UPN869586 1160 8 1 23.00UPN933124 270 6 1 3.75

    Table 5.2: Summary of CITUP’s results on the AML dataset. The second column refersto the total number of indel and single nucleotide mutations as reported by [3]. The thirdcolumn reports the number of subclones (including normal cells) found in the best solution.The number of solutions column shows how many distinct solutions are found with the bestscore.

    Due to the large number of mutations, CITUP qip requires considerably more CPU time

    to run on this dataset compared to the CLL dataset. Nonetheless, we note that CITUP was

    able to optimize all but two datasets to an exact solution when a wall-clock time limit of 23

    hours is imposed for each dataset. Moreover, the total CPU time taken on these datasets

    indicate a quadratic to sub-quadratic practical running time.

    The number of subclones identified per patient is also higher than the number of sub-

    clones predicted for CLL patients. We believe this is likely due to the increased ability to

    detect subclones that differ by non-coding somatic mutations. To investigate this, we have

    also obtained CITUP qip results on 3 of the AML datasets (UPN426980, UPN804168 and

    UPN869586) using coding mutations only. Although the number of subclones predicted

    were smaller in all 3 cases, the overall clonal architecture in the newly predicted trees were

    typically similar to the trees estimated from the full set of mutations.

    While it is unknown whether the non-coding mutations play an important role in cancer

    progression, some may be hitchhiker mutations which represent subclones that differ by

    other types of aberrations such as gene fusions. Furthermore, some non-coding mutations

    may still be functional; for example, some intronic mutations are known to affect splicing.

    Thus, we believe that phylogenetic trees derived from the full set of mutations may have

    better potential to represent the true cancer progression.

    Since a full phylogenetic relationship analysis is absent from [3] and the ground truth

    solutions are not known, we can not directly evaluate our predicted trees. Figure 5.6 shows,

  • CHAPTER 5. RESULTS 37

    however, that the tumor purities inferred by CITUP generally agree with those reported

    by [3] for primary and relapse samples. Note that since CITUP does not explicitly predict

    tumor purity, for each sample this value is estimated as (1.0 − αrs) if the root node isnot assigned any mutations, where αrs is the predicted genome frequency of the root node

    in that sample. Otherwise, the tumor purity is considered to be 1.0 (assuming germline

    mutations have been excluded from the study).

    The only striking difference between the tumor purities inferred by [3] and CITUP is

    in the relapse sample of patient UPN869586. CITUP prediction for this patient is given

    in Figure 5.7. The figure suggests that while the founder clone ‘b’ (and its descendants) is

    present at a lower abundance in the relapse sample, which may correspond to the tumor

    purity of 40% reported by [3], CITUP predicts another emerging clone in the relapse sample

    (i.e. clone ’g’). Interestingly, although no coding mutations is assigned to clone ’g’, we have

    found that some of the mutations assigned to this clone are located in the intronic regions

    of several genes including IL15 and GPC5. Interestingly, the tumor purity estimate in the

    relapse sample using coding-only mutations for this patient is closer to the purity estimate

    reported in [3].

    5.7 Computing environment and running parameters

    For each simulated dataset, we converted our simulated mutation frequencies to PhyloSub’s

    input format as follows. We assumed each mutation had sequencing depth of 1000 reads,

    and set the number of variant reads to 1000 ·f where f is the simulated mutation frequency.We assumed a sequencing error rate of 0.001. Phylosub takes a signifficant amount of

    computation time, and thus it was necessary to make a minor modification to evolve.py to

    provide the ability to specify a maximum allowable computation time on the command line.

    If the specified computation time is exceeded, the sampler exits cleanly, reporting the top k

    trees identified thus far. For each simulated dataset, we ran PhyloSub using the evolve.py

    command specifying 1000 MH iterations and 1000 MCMC samples, and 3 restarts with

    different seeds. Computation time was limited to 95 hours per restart. PhyloSub completed

    on average 784.7 MCMC samples within the allowable computation time (standard deviation

    90.6 samples). For TrAp, we use the cluster frequencies as mentioned above and run it in

    multi-sample mode with default parameters. For Rec-BTP, the clustering of the mutations

    were performed using AVDPGM with the same parameters as described in [5]. Rec-BTP

  • CHAPTER 5. RESULTS 38

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Primary (Ding et al.) Primary (CITUP)

    Relapse (Ding et al.) Relapse (CITUP)

    Figure 5.6: Tumor purities predicted by [3] and CITUP in primary and relapse samples ofAML patients. For the three patients with multiple reported solutions, UPN758168 hadthe same root frequencies in both solutions. For UPN452198 and UPN573988, we pick thefrequencies closest to the ones given in [3].

  • CHAPTER 5. RESULTS 39

    3%

    5%

    0%

    9%

    1%

    33%

    46%

    3%

    17%

    0%

    7%

    0%0%

    36%0%

    40%

    a b c d

    e f g h

    Figure 5.7: CITUP predictions for patient UPN869586. Left: The estimated subclonal pro-portions for tumor (inner) and relapse (outer) samples. Right: The predicted evolutionarytree and the coding mutations assigned to each subclone. The numbers in parentheses givethe total number of (i.e. coding and non-coding) mutations for each subclone.

  • CHAPTER 5. RESULTS 40

    was run with default parameters.

    CITUP qip and CITUP iter are implemented in Python and C++ and CITUP qip is

    run using the IBM ILOG Cplex Optimizer. All CITUP runs were performed on a Linux

    server with a memory limit of up to 16GB per job.

  • Chapter 6

    Conclusions and future work

    In this work, we present CITUP, a novel combinatorial algorithm to determine clonal fre-

    quencies in tumors as well as their evolutionary history using one or more samples from

    the same patient. Our comparisons to other state-of-the-art tools show that CITUP con-

    sistently reports fewer solutions with better accuracy. This feature is very important for

    real cancer datasets where additional experiments may be required to validate the predic-

    tions. For example, predictions that involve contradictory assignments reported by TrAp

    (referred to as ”non-sparse” solutions in [19]), complicate the downstream analysis of iden-

    tifying potential drivers of cancer. Similarly, the partial order plots reported by PhyloSub

    [6] can involve many connections, making it difficult to interpret the solutions reported

    by this tool. Although our QIP framework is already able to handle a large number of

    mutations, and significantly faster than PhyloSub we acknowledge that it is considerably

    slower than TrAp. On the other hand, the iterative heuristic version of CITUP exhibits

    comparable accuracy, while achieving significant reduction in computation time. Moreover,

    our ability to run CITUP separately on each tree topology means that parallel computing

    can be utilized to quickly obtain high accuracy results on large datasets. As mentioned

    above, CITUP assumes infinite sites, which may be violated under certain conditions. For

    instance, a functional mutation may be selected against during changes to the tumor en-

    vironment, such as the reversion of BRCA2 mutation in therapy resistant ovarian cancer

    [1]. In other words, lineages that die out before the first sampling of the tumor or emerge

    and disappear between two time points are not detectable by CITUP or any other method

    aiming to construct phylogenies. In these cases, the evolutionary history of the tumor can

    only be partially constructed. In addition, CITUP and similar methods are only applicable

    41

  • CHAPTER 6. CONCLUSIONS AND FUTURE WORK 42

    to tumors with limited copy number changes. On the other hand, this limitation can be

    partially overcome by considering a restricted number of copy-number corrected genotypes

    similar to the approach of PyClone [14]. Extension of CITUP to exploit this type of changes

    would lead to its broader applicability and detection of subclonal populations characterized

    by copy number aberrations.

  • Bibliography

    [1] Ashworth, A. Drug resistance caused by reversion mutation. 10021–10023.

    [2] Beyer, T., and Hedetniemi, S. M. Constant time generation of rooted trees. SIAMJ. Comput. 9, 4 (1980), 706–712.

    [3] Ding, L., Ley, T. J., Larson, D. E., Miller, C. A., Koboldt, D. C., Welch,J. S., Ritchey, J. K., Young, M. A., Lamprecht, T., McLellan, M. D.,McMichael, J. F., Wallis, J. W., Lu, C., Shen, D., Harris, C. C., Dooling,D. J., Fulton, R. S., Fulton, L. L., Chen, K., Schmidt, H., Kalicki-Veizer,J., Magrini, V. J., Cook, L., McGrath, S. D., Vickery, T. L., Wendl, M. C.,Heath, S., Watson, M. A., Link, D. C., Tomasson, M. H., Shannon, W. D.,Payton, J. E., Kulkarni, S., Westervelt, P., Walter, M. J., Graubert,T. A., Mardis, E. R., Wilson, R. K., and DiPersio, J. F. Clonal evolution inrelapsed acute myeloid leukaemia revealed by whole-genome sequencing. Nature 481,7382 (2012), 506–510.

    [4] Ding, L., Raphael, J. B., Chen, F., and Wendl, M. C. Advances for studyingclonal evolution in cancer. Cancer Letters 340, 2 (2013), 212–219.

    [5] Hajirasouliha, I., Mahmoody, A., and Raphael, B. J. A combinatorial approachfor analyzing intra-tumor heterogeneity from high-throughput sequencing data. Pro-ceedings of the International Conference on Intelligent Systems of Molecular Biology(2014).

    [6] Jiao, W., Vembu, S., Deshwar, A., Stein, L., and Morris, Q. Inferring clonalevolution of tumors from single nucleotide soma