CLONALITY INFERENCE IN MULTIPLE TUMOR SAMPLES USING...

CLONALITY INFERENCE IN MULTIPLE TUMOR

SAMPLES USING PHYLOGENY

by

Salem Malikić

B.Sc., University of Sarajevo, Bosnia and Herzegovina, 2011

a Thesis submitted in partial fulfillment

of the requirements for the degree of

Master of Science

in the

School of Computing Science

Faculty of Applied Sciences

c© Salem Malikić 2014SIMON FRASER UNIVERSITY

Summer 2014

All rights reserved.

However, in accordance with the Copyright Act of Canada, this work may be

reproduced without authorization under the conditions for “Fair Dealing.”

Therefore, limited reproduction of this work for the purposes of private study,

research, criticism, review and news reporting is likely to be in accordance

with the law, particularly if cited appropriately.

APPROVAL

Name: Salem Malikić

Degree: Master of Science

Title of Thesis: Clonality Inference in Multiple Tumor Samples using Phy-

logeny

Examining Committee: Dr. Andrei Bulatov,

Professor, Chair

Dr. Süleyman Cenk Şahinalp,

Professor, Senior Supervisor

Dr. Jian Pei,

Professor, Supervisor

Dr. Ayşe Funda Ergün,

Professor, Internal Examiner

Date Approved: August 22nd, 2014

ii

Partial Copyright Licence

iii

Abstract

Intra-tumor heterogeneity presents itself through the evolution of subclones during cancer

progression. While recent research suggests that this clonal diversity is a key factor in

therapeutic failure, the determination of subclonal architecture of human tumors remains a

challenge. To address the problem of accurately determining subclonal frequencies in tumors

as well as their evolutionary history, we have developed a novel combinatorial method named

CITUP (Clonality Inference in Tumors Using Phylogeny). An important feature of CITUP

is its ability to exploit data from multiple time-point and/or regional samples from a single

patient in order to improve estimates of mutational profiles and subclonal frequencies. Using

extensive simulations and real datasets comprising tumor samples from two leukemia drug-

response studies, we show that CITUP can infer the evolutionary trajectory of human

tumors with high accuracy.

keywords: Cancer progression, intra-tumor heterogeneity, combinatorial methods

iv

To my beloved parents Faiz and Sadeta,

and my dear sister Faiza

v

You can’t connect the dots looking forward; you can only connect them looking backwards.

So you have to trust that the dots will somehow connect in your future.

You have to trust in something: your gut, destiny, life, karma, whatever.

This approach has never let me down, and it has made all the difference in my life.

— Steve Jobs

vi

Acknowledgements

First and foremost, I would like to thank my supervisor Dr. S. Cenk Sahinalp for his

extensive guidance, support and patience during my studies. I especially thank him for

the endless effort he put into training me in the scientific field. I am also very thankful to

Andrew McPherson and Dr. Nilgün Donmez for their immense contribution to this work,

their valuable advices and help with writing the thesis. I would like to acknowledge the

insightful feedback I have received from my colleagues from the Lab for Computational

Biology at Simon Fraser University.

Also, I am indebted to Nermin Suljić, Ali Lafçioğlu, Dr. Hasan Jamak and Dino Oglić

for helping me develop my passion and enthusiasm towards the fields of Mathematics and

Science. Furthermore, I thank all of the people from Bosna Sema Educational Institutions for

providing an excellent environment and support during my high school and undergraduate

studies.

I am very grateful to my dear aunt Faiza and uncle Dževad together with their family

and my beloved partner Fatima for providing moral support. Last but not the least, I would

like to express my deep gratitude to my parents and my sister for their unconditional love

and support.

vii

Contents

Approval ii

Partial Copyright License iii

Abstract iv

Dedication v

Quotation vi

Acknowledgements vii

Contents viii

List of Tables x

List of Figures xi

1 Introduction 1

2 Background 4

2.1 DNA and mutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Cancer onset and evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Methods for analysing and sequencing DNA and their applications in tumor

heterogeneity studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Identifying mutations and their frequencies from HTS data . . . . . . . . . . 10

viii

3 Model assumptions and problem description 12

3.1 Phylogenetic tree model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3 Model assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.4 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Methods 19

4.1 Combinatorial Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 Method Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.3 Quadratic Integer Programming (QIP) method . . . . . . . . . . . . . . . . . 21

4.4 QIP optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.5 Heuristic Iterative Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.6 Enumerating rooted trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.7 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5 Results 24

5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.2 Evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.3 Evaluation on simulated datasets . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.4 Comparison with Rec-BTP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.5 Results on Chronic Lymphocytic Leukemia datasets . . . . . . . . . . . . . . 31

5.6 Results on Acute Myeloid Leukemia datasets . . . . . . . . . . . . . . . . . . 32

5.7 Computing environment and running parameters . . . . . . . . . . . . . . . . 37

6 Conclusions and future work 41

Bibliography 43

ix

List of Tables

3.1 Simple input to CITUP algorithm consisting of 2 samples. In total, 10 so-

matic mutations have been identified. Their frequencies are estimated from

alignment of sequencing data and most of them deviate from true values due

to presence of noise. Mutation might be not detected or completely absent

from some, but not all, samples. . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.1 Summary of CITUP’s results on the CLL dataset. The second column refers

to the number mutations as reported by [17]. The third column reports the

number of subclones (including normal cells) found in the best solution. The

number of solutions column shows how many distinct solutions are found

with the best score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.2 Summary of CITUP’s results on the AML dataset. The second column refers

to the total number of indel and single nucleotide mutations as reported by

[3]. The third column reports the number of subclones (including normal

cells) found in the best solution. The number of solutions column shows how

many distinct solutions are found with the best score. . . . . . . . . . . . . . 36

x

List of Figures

2.1 Simple evolutionary tree showing emergence of different clonal subpopulations

as a consequence of mutations in cells DNA. . . . . . . . . . . . . . . . . . . . 6

2.2 An example of heterogeneous tumor tissue consisting of several different clonal

subpopulations of cells used as the input for DNA sequencing. . . . . . . . . . 8

2.3 Frequencies of mutations present in sample shown in Figure 2.2. . . . . . . . 11

3.1 Arbitrary rooted tree representation of binary tree given in Figure 2.1. . . . . 13

3.2 One of possible interpretations of input data given in Table 3.1. This figure

shows frequencies assignment for Sample 1. For each node, except the root,

first number in node label represents the proportion of cells harbouring mu-

tations that occurred along an edge connecting that node with its parent.

For root nodes this number is always 1. The number inside bracket shows

the proportion of cells harbouring genotype uniquely identified by this node. . 17

3.3 One of possible interpretations of input data given in Table 3.1. This figure

shows frequencies assignment for Sample 2. For each node, except the root,

first number in node label represents the proportion of cells harbouring mu-

tations that occurred along an edge connecting that node with its parent.

For root nodes this number is always 1. The number inside bracket shows

the proportion of cells harbouring genotype uniquely identified by this node. . 18

xi

5.1 Simulation results for TrAp, PhyloSub and CITUP (QIP and iterative pro-

cedures) under the four evaluation criteria. The rows depict measures M0 to

M3. The first column investigates the effect of the number of subclones/nodes

in the dataset, the second investigate the effect of the number of samples, the

third investigates the effect of noise added to the mutation frequencies and

the fourth investigates the effect of non-uniformity among subclone frequen-

cies. The figure is drawn using the boxplot function in Phyton’s mathplot

library: the line within each box is the mean and the box boundaries mark

the 25% and 75% values. The extreme outliers are depicted with + symbols.

Note that we were unable to run PhyloSub on 7 samples, so the corresponding

bars are absent from this column. . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.2 Sensitivity analysis of CITUP iter with respect to starting points. Top: Dis-

tribution of errors in the objective for different restarts of the algorithm where

error is defined as the difference between the local minimum objective value

reached by CITUP iter and the global minimum reached by CITUP qip. Bot-

tom: The proportion of iterative restarts that reach the global min within

10−9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.3 CITUP predictions for patient CLL003. Left: Estimated subclonal propor-

tions for the five time points (ordered from inner to outer circles). Right:

The predicted evolutionary tree and the mutations assigned to each sub-

clone. Note that each node is also assumed to inherit mutations that emerge

at its ancestors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33





at its ancestors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34





at its ancestors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

xii

5.6 Tumor purities predicted by [3] and CITUP in primary and relapse samples

of AML patients. For the three patients with multiple reported solutions,

UPN758168 had the same root frequencies in both solutions. For UPN452198

and UPN573988, we pick the frequencies closest to the ones given in [3]. . . . 38

5.7 CITUP predictions for patient UPN869586. Left: The estimated subclonal

proportions for tumor (inner) and relapse (outer) samples. Right: The pre-

dicted evolutionary tree and the coding mutations assigned to each subclone.

The numbers in parentheses give the total number of (i.e. coding and non-

coding) mutations for each subclone. . . . . . . . . . . . . . . . . . . . . . . . 39

xiii

Chapter 1

Introduction

Most human tumors exhibit a large degree of heterogeneity. This heterogeneity is not only

apparent in histology but also presents itself in various features such as gene expression

changes, genomic copy number alterations and structural rearrangements as well as other

aberrations. While the origins of the intra-tumor heterogeneity are still debated, research

suggests that this diversity is likely to have clinical implications. For instance, Merlo et

al. [11] have reported a correlation between clonal diversity and progression to esophageal

adenocarcinoma in Barrett’s esophagus.

The implications of tumor heterogeneity are not limited to diagnostics. It has been

suggested that clonal diversity may also be linked to metastatic potential and drug response.

Looking at biopsies from pancreas and prostate adenocarcinomas, Ruiz et al. [15] found that

metastatic tumors were derived from certain clonal populations. In colorectal cancer, Kreso

et al. [7] reported that clonal diversity affects chemotherapy tolerance. By tracking 150

lentivirus-marked lineages from 10 human colorectal cancers, they found that previously

minor or dormant clones were promoted by chemotherapy, thus reducing the effectiveness

of the treatment.

Although the multi-clonal nature is virtually common to most tumor samples, deter-

mining the clonal subpopulations is a challenging process. This problem could potentially

be alleviated by single-cell sequencing; however, the current cost of these methods are pro-

hibitive in the scales that would be necessary to representatively sample a tumor tissue.

Methods such as Fluorescence in Situ Hybridization (FISH) or Silver in Situ Hybridization

(SISH) can also assess a small number of probes in individual cells of a tumor sample. On

the other hand, these methods are quite limited in scope and can not offer the same genome

1

CHAPTER 1. INTRODUCTION 2

wide perspective as high-throughput sequencing methods.

In silico separation of the clonal subpopulations may also provide a viable alternative

to these methods. Despite the importance of clonal diversity and its clinical implications,

however, relatively few computational methods have been developed to date. In a pioneering

paper, Schwartz and Shackney [18] developed an unmixing method based on a geometric

model to distinguish a small number of cancer subtypes in gene expression data. After

determining cell types and their relative frequencies in different tumor samples, this method

infers a phylogenetic tree that best fits the cell types identified. An alternative method,

named TrAp [19], generates possible phylogenetic trees following certain parsimony and

sparsity conditions using a greedy approach. More recently, PhyloSub [6], which is based on

Bayesian inference is developed. This method relies on the well known Monte Carlo Markov

Chain (MCMC) sampling paradigm to infer a distribution over all possible phylogenies.

Another statistical method named PyClone [14] - also based on MCMC sampling - leverages

copy number genotypes to estimate subclonal frequencies more reliably. Unlike PhyloSub,

however, this method does not infer phylogenies.

In this work we present a combinatorial algorithm, named CITUP (Clonality Inference

in Tumors Using Phylogeny), that can exploit data obtained from multiple loci from a single

patient to infer the tumor phylogeny more accurately. Our framework also involves gen-

erating all possible phylogenetic trees, however, unlike the previous approaches mentioned

above CITUP has the ability to find optimal solutions based on an exact Quadratic Integer

Programming formulation.

Another tree-based method named Rec-BTP [5], being simultaneously developed, is

closely related to our framework. In this approach, mutations are subjected to a binary-tree

partition, where a binary-tree with the least number of conflicting triplets is sought using

an approximation algorithm. In contrast to our framework, this method can not handle

multiple samples.

Our work is also related to the studies of [16] and [13], however their goals are different.

While THetA [13] aims to predict subclonal populations and their proportions given a

sample from high-throughput sequencing data, it does not aim to infer any phylogenetic

relationship between the subclones. Although the method proposed by Salari et al. [16]

infers tumor phylogenies from multiple samples like CITUP, the goal of that study is to

improve somatic SNV calls. Moreover, their model places the samples as leaves in a single

phylogenetic tree (thus leaves do not represent clonal subpopulations but rather samples,

CHAPTER 1. INTRODUCTION 3

each of which is a mixture of subclones) whereas our model assumes a shared tree between

samples with different clonal frequencies.

Chapter 2

Background

2.1 DNA and mutations

Human body is formed from between 50 and 100 trillion cells. Each cell contains all of the

organism’s genetic instructions stored as Deoxyribonucleic acid (DNA) that is made up of

molecules called nucleotides. There are four types of these molecules: adenine (A), cytosine

(C), thymine (T) and guanine (G) and their order defines particular DNA sequence and its

function. For the sake of illustration, we can consider nucleotides as an alphabet of some

language. In an analogous way as letters of an alphabet combined together form words

of language, nucleotides form genes that contain instructions for different tasks that cell

performs. Inside the cell, DNA is packed into structures called chromosomes. In normal

conditions, chromosomes in human cells always come in matching pairs, one pair from each

parent. As a consequence of this, human genome is diploid and genomic loci also come in

pairs, where the term genome is used for an organism’s complete set of DNA, including all

of its genes. On the other hand, genotype describes nucleotides present at a specific location

in the two copies. These two loci do not have to be fully identical and one of a number of

alternative forms of the same genetic locus is referred to as an allele.

Changing, deleting or altering the position of even a single letter in a word of language

can completely change its meaning. For example, in English language, if we change first

letter into ’d’ in word ”keep” we get word ”deep” with completely different meaning, and

changing ’k’ into ’e’ results in word ”eeep” without any specific meaning. Similarly, changes

in DNA sequence of cell, usually referred as mutations or aberrations, can change DNA

segment coding for some genes that might result in improperly functioning genes or in a

4

CHAPTER 2. BACKGROUND 5

complete loss of a gene function. Many different types of mutations have been reported.

Single nucleotide mutation, where a single nucleotide is exchanged for another, is the most

prevalent type of mutations. Copy number aberrations occur when part of genome gets

deleted or amplified and indel mutation is a mutation named with the blend of insertion

and deletion of nucleotides in DNA sequence.

During the lifetime of an individual many cells undergo programmed cell death (apop-

tosis) and get replaced by new cells in a process called somatic cell division. By this process

one cell, called mother cell, is replaced by two daughter cells. During the replication process

DNA of mother cell is copied into each of daughter cells and, in perfect division, daughter

cells DNA sequence is an exact copy of mother cell’s DNA sequence. Occasionally, during

some of the somatic cell divisions mutations occur resulting in daughter cell(s) having dif-

ferent DNA sequence compared to mother cell. Somatic division is not the only process

whereby mutations can be acquired. Various external factors, such as radiation, tobacco

consumption, ultraviolet light and many others can also have deleterious effect on cell and

cause severe damages to its DNA.

All of the mutations acquired in non-germ cells during the lifetime of an individual are

commonly referred to as somatic mutations. Germline or hereditary mutations comprise

another major class of mutations. Mutations from this class occur in germ-cells and can

later be passed on its progeny. If inherited, such mutation is typically present in all cells

of human body and does not give valuable information in heterogeneity studies. For these

reasons, in this work we will only focus on somatic mutations.

Since more than 98% of human DNA is noncoding and most of noncoding DNA does

not have any known biological function, somatic mutations can be divided into coding and

non-coding. The latter one are usually non-deleterious and do not have serious impact on

cell functioning. On the other hand, coding somatic mutations might be fatal for proper

cell functioning and result in various diseases, with cancer being one of the most frequent

among them. For these reasons, some studies are using only coding mutations.

By somatic cell division, mutation acquired in some cell is passed on all of its descendants,

unless it gets reverted, that is very unlikely considering the size of human DNA formed from

several billion nucleotides. In addition to mutations inherited from mother cell, daughter

cells might acquire additional mutations. This can result in emergence of several clonal

subpopulations of cells, each one uniquely identifiable by the set of somatic mutations it

has acquired, as shown in Figure 2.1. This figure shows simple evolutionary tree, also called


phylogenetic tree, where, starting from normal, healthy, cell six different subpopulations

of mutated cells have emerged. Note that some clonal subpopulations can die out or get

completely replaced by more differentiated subpopulations descending from them. One such

clonal subpopulation of cells, harbouring only red-colored mutation, is shown in Figure 2.1.

Figure 2.1: Simple evolutionary tree showing emergence of different clonal subpopulationsas a consequence of mutations in cells DNA.

2.2 Cancer onset and evolution

Cancer is a disorder in which some of the body’s cells begin to grow uncontrollably to form

a mass of cells called a tumor. It is used as a term for more than a hundred diseases all

having in common two main characteristics: uncontrolled cell growth and the ability of these


cells to invade other tissues. The growth of a tumor can be thought of as an evolutionary

process. Malignant (life-threatening) tumors usually contain many mutations, that do not

happen all at once.

There are two different models explaining evolutionary processes in tumors: clonal evo-

lution and cancer stem cells models. Detailed explanation of these two models can be found

in [4]. In this study we adopt Clonal Evolution model of tumor progression first intro-

duced in [12]. According to this model, tumor begins with a cell that has sustained a single

mutation that offers it a growth advantage over its neighbors. This advantage might be

manifested in many different ways: cell can have higher rate of somatic division compared

to normal cells, some genetic mechanisms might be damaged prolonging programmed cell’s

death (apoptosis) and many others. For example, a mutation that inactivates pro-apoptotic

gene might result in delayed cell’s death, giving it longer lifespan. Somatic divisions of this

mutated cell and its descendants lead to the formation of clonal subpopulation formed from

cells harbouring the advantageous mutation. After some time, one cell in this subpopulation

may sustain another such mutation that can result in emergence of new clone, possibly with

higher proliferative power. Over the course of time several clonal subpopulations might

emerge, resulting in highly heterogeneous tissue at the time of clinical diagnosis of disease.

Figure 2.2 illustrates a tumor tissue consisting of several distinct clonal subpoplations. For

simplicity, we assume that they are related by evolutionary tree given in Figure 2.1. We

note that at the time of clinical diagnosis, the only observable clonal subpopulations are the

ones corresponding to leaf nodes in the evolutionary tree. Therefore, none of the cells in

Figure 2.2 has genotype harbouring only red-colored mutation.

The first tumor that develops in the body is named primary tumor. If not detected

and removed at early stages of disease, tumor cells usually migrate through blood or lymph

and start growing at distant organs. This new tumor is known as metastatic tumor. It

is important to mention that metastatic tumor, although growing in different part of the

body, still has the characteristics of primary tumor and the same tumor evolutionary tree

can be used to explain primary tumor and all of its metastatic outgrowths. The same

applies for tumor samples from early and later stages of disease progression since clonal

subpopulations at later stages are either identical to or evolved from clonal subpopulations

from early stages by the acquisition of additional mutations between the timepoints when

samples were obtained.


Figure 2.2: An example of heterogeneous tumor tissue consisting of several different clonalsubpopulations of cells used as the input for DNA sequencing.


2.3 Methods for analysing and sequencing DNA and their

applications in tumor heterogeneity studies

Several methods used for DNA analysis and sequencing (determining the order of nucleotides

A, C, G, T) have been invented up to date. Although there are many different ways to

classify them, in cancer heterogeneity studies they are usually classified into two categories,

based on the number of cells that are used as the input.

First category comprises of methods using a single cell as the input for DNA analysis or

sequencing. Fluorescence in Situ Hybridization (FISH) or Spectral Karyotyping (SKY) have

been used for decades in studying DNA sequences of single tumor cells and have shown great

variability in DNA content among them. These methods are quite limited in scope and can

only asses a small number of probes from single cells of tumor sample. Ideally, whole-genome

sequencing of sufficient number of cells using these methods can be used for identifying

tumor heterogeneity. Single cell sequencing (SNS) and single nucleus sequencing (SNS)

are another two promising approaches for revealing intra-tumor heterogeneity. However,

both of them have several limitations. Namely, as they require an amount of DNA far

exceeding an amount present in single cell, the amplification of DNA is required prior to

sequencing. This amplification usually results in biases where some regions of DNA do

not get amplified to desired extent, whereas some other regions get over-amplified. As a

consequence, the number of reads (small subsequences of DNA generated as the output of

sequencing) covering under-amplified regions is usually low and the effect of noise is very

high in downstream analysis of reads covering these regions. Some of these regions do not

get any reads covering them, hence some mutations remain undetected. This bias also

results in difficulties with detecting copy number aberrations (aberrations where some part

of genome gets deleted or amplified), since it is difficult to distinguish whether a particular

DNA segment having large number of reads covering it is amplified in sequenced cell, or this

number is just a consequence of over-amplification during DNA amplification step. Also, in

order to get statistically representative sample of underlying tumor, typically consisting of

millions of cells, isolating and sequencing large number of cells from many different sections

of tumor is required. Due to these limitations and prohibitive cost of single cell sequencing

of large number of cells these methods are still mainly used only for academic purposes in

analyzing intra-tumor heterogeneity.

Second major category of DNA analysis and sequencing methods are using a bulk of


tumor cells as the input to obtain short reads of DNA. High Throughput Sequencing (HTS)

methods are currently the most widely used for this task due to their ability to generate

large number of short reads from multiple tumor cells at low cost and with good accuracy

(percentage of correctly identified nucleotides among all reads) that varies among different

platforms, but is typically above 99.9%. Since multiple cells are used as the input, only an

average signal of DNA content from underlying cells is obtained as the output. Therefore,

some post-processing is required in order to obtain number, evolutionary history, genotypes

and proportions of clonal subpopulations present in sequenced tumor.

Sequencing coverage of DNA segment is defined as the average number of reads covering

that particular segment. Sequencing of whole DNA usually results in coverage that is low

in order to accurately identify frequencies of somatic mutations. High degree of confidence

in measuring frequencies of single nucleotide mutations and small indels can be achieved

using targeted deep sequencing. In brief, after the mutation is detected at some DNA locus,

the region encompassing this locus is PCR amplified from a bulk tumor sample, and then

sequenced to high depth (>1000× coverage) using HTS. Technological advances now allowmany variants to be amplified and sequenced in parallel speeding up the sequencing process.

2.4 Identifying mutations and their frequencies from HTS

data

In addition to cells from tumor mass, sequencing of some healthy tissue is also performed

in order to obtain genome of normal cells. All of the short reads obtained from tumor cells

are then compared against the normal genome and somatic mutations are identified.

For each mutation, we define its frequency as the proportion of cells from tumor sample

harbouring that mutation. For somatic mutations from diploid loci of genome, it is very

unlikely that both of the alleles are mutated, so we assume that all of such mutations are

heterozygous, i.e. only one allele is mutated. As calculating of single nucleotide mutations

from regions that have been affected by copy number aberrations is complicated task, in this

study we only focus on heterozygous somatic mutations outside of copy number aberrated

regions. Their frequency is calculated as (2 · qvar)/(qvar + qref ) where qvar is the number ofreads with the variant allele and qref is the number of reads with the reference allele. Allele

specific copy number measurements, obtained using sequencing or arrays, can be used to

exclude mutations from genomic regions that are not diploid heterozygous throughout the


population of tumor cells.

In conclusion, using HTS methods of DNA sequencing we can obtain a set of mutations

present in sequenced tissue together with proportions of cells harbouring each mutation. For

the purpose of illustration, in Figure 2.3 a simple example of output is given, where we list

the frequencies of mutations present in hypothetical sample given in Figure 2.2, assuming

no noise is present in estimating frequencies values (note that this is not valid assumption

for frequencies obtained from real sequencing data where estimates are usually affected by

noise present as a consequence of errors and biases occurring during DNA sequencing step).

Figure 2.3: Frequencies of mutations present in sample shown in Figure 2.2.

Chapter 3

Model assumptions and problem

description

3.1 Phylogenetic tree model

As we have already explained in previous chapter, every successful somatic cell division

transforms a single cell into two daughter cells. The genomes of the daughter cells are

copies of the original cell’s genome, with the addition of mutations that occurred during

replication. Furthermore, all somatic cells originated from a single germline cell. Thus a

natural representation of somatic cellular evolution is a rooted full binary tree (see Rec-BTP

[5], 2.1 ). In such a model only the leaves of the tree are observable; internal nodes represent

unobservable ancestral cells.

For a full binary tree model, not all edges would be identifiable by mutations, either

because no mutations occurred from one cell division to the next, or because distinguishing

mutations were not detected. Thus, a more concise model [6] uses arbitrary rooted trees,

implicitly collapsing unidentifiable edges. Collapsing internal edges has the effect of allowing

nodes to have an arbitrary number of children. Furthermore, collapsing leaf edges implies

that internal nodes are observable. Figure 3.1 shows arbitrary rooted tree representation of

binary tree given in Figure 2.1.

Using arbitrary rooted trees has two major implications for CITUP. First, it limits

the number of fundamentally equivalent solutions produced by CITUP, allowing for easier

interpretability of the results. Second, since arbitrary rooted trees are more concise, CITUP

12

CHAPTER 3. MODEL ASSUMPTIONS AND PROBLEM DESCRIPTION 13

Figure 3.1: Arbitrary rooted tree representation of binary tree given in Figure 2.1.


can consider fewer trees while still maintaining the same accuracy.

3.2 Input data

As a consequence of evolutionary processes and genomic instability present in most of the

tumors, chemotherapy or other types of treatment, epigenetic and many other factors, pro-

portions of cells harbouring specific genotype usually differ among different timepoints of

disease progression or different anatomical sites. Sequencing a set of tumor samples obtained

from different timepoints of disease progression or different anatomical sites, or both, can

give us valuable information that can be exploited in solving problems defined in the follow-

ing sections of this chapter. Denote by S the set of all samples that have been sequenced

and M the set of heterozygous somatic mutations identified in at least one of the samples

from S. For the reasons already discussed in previous chapter, we only consider mutations

from diploid regions of genome.

Hence, the only input to our algorithm can be given as |M | × |S| matrix F , where Fijdenotes frequency of mutation i in sample j. The simple example of input data, where

|M | = 10 and |S| = 2, is given in the Table 3.1.

Sample 1 Sample 2

Mutation 1 0.32 0.04Mutation 2 0.23 0.24Mutation 3 0.80 1.00Mutation 4 0.06 0.55Mutation 5 0.19 0.28Mutation 6 0.30 0.00Mutation 7 0.20 0.67Mutation 8 0.77 0.95Mutation 9 0.19 0.66Mutation 10 0.20 0.24

Table 3.1: Simple input to CITUP algorithm consisting of 2 samples. In total, 10 somaticmutations have been identified. Their frequencies are estimated from alignment of sequenc-ing data and most of them deviate from true values due to presence of noise. Mutationmight be not detected or completely absent from some, but not all, samples.


3.3 Model assumptions

Similar to [6], in this work we make the infinite sites assumption about tumor evolution.

Somatic mutations are gained at most once per individual and cannot be lost via a subse-

quent reversion mutation. Additionally, we assume the tumor exhibits minimal aneuploidy,

thus mutations cannot be lost by deletion of the encompassing chromosomal region.

Assuming mutations cannot be lost or reverted, a mutation gained in a tumor cell will

be present in all of the descendants of that tumor cell. Trivially, a mutation that occurred

in the single common ancestor of a tumor will be present in 100% of the tumor cells. A

mutation that occurred in a specific lineage of the tumor phylogeny will be present in a

smaller proportion, providing all other lineages have not died out.

Based on the arguments mentioned in Section 2.2. we also impose the same phylogenetic

tree on all samples.

3.4 Problem description

Three common problems arise with the interpretation of input data:

• determination of number and genotypes of major clonal subpopulations of tumor cells;

This problem consists of inferring the number of different clonal subpopulations present

in the sequenced tumor and identifying the set of mutations present in each subpop-

ulation. Note that clonal subpopulation can be uniquely identified by this set.

• inference of phylogeny relating clonal subpopulations;

This problem consists of identifying tumor evolutionary history tree that best explains

the given input data. This tree is also known as tumor phylogenetic tree. Each mu-

tation has to be placed along one and only one edge of the tree and this placement

corresponds to its first appearance in tumor evolutionary history. At least one mu-

tation has to be placed along each edge of the tree, otherwise we have unidentifiable

edge that would be collapsed in our arbitrary rooted tree model. Also in this model,

the normal cells can be represented by the root node. Each node corresponds to one

and only one clonal subpopulation that is uniquely identified by the set of mutations

assigned to the root node combined with the mutations appearing along the edges

that form the path from root to that node.


• estimation of proportion of each subpopulation over all samples;

This problem consists of assigning a real number αis ∈ [0, 1] to each subpopulationi for each sample s. This number represents proportion of tumor cells in sample s

harbouring genotype of clonal subpopulation i. As there is one to one correspondence

between nodes of tumor phylogenetic tree and subclonal populations, this is equiv-

alent to assigning number αis to node corresponding to subpopulation i in sample

s. Although all samples share common evolutionary tree, hence mutation placement

is shared among them, the frequencies assigned to nodes of the tree change among

samples.

In the next chapter we give a details of our novel algorithmic approaches for solving prob-

lems introduced in this section. Figures 3.2 and 3.3 show one of the possible interpretations

of example data given in Table 3.1.


Figure 3.2: One of possible interpretations of input data given in Table 3.1. This figureshows frequencies assignment for Sample 1. For each node, except the root, first number innode label represents the proportion of cells harbouring mutations that occurred along anedge connecting that node with its parent. For root nodes this number is always 1. Thenumber inside bracket shows the proportion of cells harbouring genotype uniquely identifiedby this node.


Figure 3.3: One of possible interpretations of input data given in Table 3.1. This figureshows frequencies assignment for Sample 2. For each node, except the root, first number innode label represents the proportion of cells harbouring mutations that occurred along anedge connecting that node with its parent. For root nodes this number is always 1. Thenumber inside bracket shows the proportion of cells harbouring genotype uniquely identifiedby this node.

Chapter 4

Methods

4.1 Combinatorial Formulation

Let T represent the space of all rooted trees and let T ∈ T be a hypothetical phylogenetic treerelating N = |V (T )| genetically distinct subpopulations. Let D(v) be the set of descendentsof node v. As already explained in previous chapters, in our formulation, genotypes are

represented with nodes (also referred to as clonal subpopulations or subclones in the text)

while subtrees rooted at a specific node are named clones. A mutation occurring at a node

in the tree is inherited by its descendants. Thus an assignment of the set of mutations to

their node of origin is sufficient to describe the genotypes of all nodes.

Define the clone proportion βvs as the proportion of the clone rooted at v in sample

s. Similarly, define the subclonal proportion αvs as the proportion of genotype v in sample

s. Subclonal proportions add up to 1 in each sample (equation 4.1). Furthermore, clone

proportions are related to subclonal proportions via the sum rule (equation 4.2).

∀s ∈ S :∑v∈V

αvs = 1 (4.1)

βvs = αvs +∑

u∈D(v)αus (4.2)

The expected value of the frequency of a mutation is equal to the clone proportion of the

node to which the mutation was assigned. Thus, the squared error incurred by assigning a

single mutation i to a node v in sample s is given by equation 4.3.

eivs = (Fis − βvs)2 (4.3)

19

CHAPTER 4. METHODS 20

Let ∆ be an |M | × N binary matrix such that δiv = 1 iff mutation i first appearedat node v, otherwise δiv = 0. We also introduce matrix A of dimensions N × |S|, whereAis = αis. Given T ∈ T, ∆ and A, the total squared error is given by equation 4.4.

E(T,∆, A) =∑i∈M

∑s∈S

∑v∈V

δiveivs (4.4)

Minimization of squared error may result in overfitting, assigning each mutation to a

unique node in a very large tree. Instead, we minimize the Bayesian Information Criterion

(BIC) under the assumption that the noise is normally distributed with known variance σ2.

The log likelihood can be expressed (within an additive factor) as given by equation 4.5.

L(F |T,∆, A) = E(T,∆, A)2σ2

(4.5)

Finally, BIC can be expressed as given by equation 4.6.

BIC(T,∆, A) = 2 · L(F |T,∆, A) + |S| · (N − 1) · log |M | (4.6)

We propose to identify the optimal genotypes ∆opt, the subclone proportions Aopt and

phylogenetic relationship Topt as given by equation 4.7.

∆opt, Aopt, Topt = argminT,∆,A

BIC(T,∆, A) (4.7)

We refer to the above optimization problem as the mutation phylogeny problem. We pro-

pose two methods to solve this problem, namely “CITUP qip” and “CITUP iter”. CITUP qip

uses an exact Quadratic Integer Programming formulation; while CITUP iter implements an

iterative heuristic. A detailed description of these implementations for solving the mutation

phylogeny problem is given in the following sections.

4.2 Method Outline

Given a fixed tree topology, define the mutation assignment problem as the problem of iden-

tifying A and ∆ that minimize mutation frequency error (equation 4.4). CITUP solves the

mutation phylogeny problem by iterating through all tree topologies up to a fixed number

of nodes Nmax, and solving the mutation assignment problem for each tree:

1. for each T ∈ TN , for each N ∈ {1, . . . , Nmax}


(a) identify A and ∆ that minimizes equation 4.4 (mutation assignment problem)

(b) calculate BIC for T using equation 4.6

2. select T , A and ∆ that minimize BIC

We propose two methods for solving the mutation assignment problem: a Quadratic Integer

Programming based approach (CITUP qip), and an iterative heuristic approach (CITUP iter)

as explained below.

4.3 Quadratic Integer Programming (QIP) method

QIP based approaches guarantee an optimal solution but limit the feasible problem size. To

ensure a reasonable running time for the QIP approach on larger (>20 mutations) problem

sizes, we first cluster the mutations into N sets by their mutation frequency, where N is

the number of nodes in the current tree topology. We then limit the solution space for ∆

by adding the constraint that all mutations in a cluster must be assigned, en masse, to a

single node. We use multivariate k-means clustering implemented in the python scikit learn

package to cluster mutations.

Let c : M → {1, . . . , N} be a mapping from mutations to clusters. Let ∆′ be an N ×Nbinary matrix such that δ′c(i)v = 1 iff mutation i assigned to cluster c(i) originated at node

v, otherwise δ′c(i)v = 0. The total squared error given by equation 4.4 can be rewritten as

4.8.

E(T,∆, A) =∑i∈M

∑s∈S

∑v∈V

δ′c(i)veivs (4.8)

Requiring that each cluster must be assigned to exactly one node adds the constraint

given by equation 4.9.

∀n ∈ {1, . . . , N} :∑v∈V

δ′nv = 1 (4.9)

Additionally, we require that all non-root nodes must have at least one cluster of muta-

tions assigned to them, resulting in the constraint given by equation 4.10,

∀v ∈ V \ {r} :∑

n∈{1,...,N}δ′nv ≥ 1 (4.10)

where r denotes the root node.


The QIP approach minimizes the squared error objective (equation 4.8), subject to the

subclonal proportion constraints (equation 4.1), the clone proportion constraints (equation

4.2), and the cluster assignment constraints (equations 4.9 and 4.10).

4.4 QIP optimizations

The objective given by equation 4.8 is not well suited for QIP solvers. Below, we introduce

auxiliary variables and constraints to convert our objective function to a form that is easier

to solve. For mutation i, node v and sample s, introduce variable xivs subject to the the

following constraints.

xivs ≥ fis − βvs (4.11)

xivs ≥ βvs − fis (4.12)

Similarly, introduce variable yivs subject to the following constraints:

yivs ≥ δ′c(i)v − 1 + xivs (4.13)

yivs ≥ 0 (4.14)

The modified QIP minimizes the objective given by equation 4.15, subject to the addi-

tional constraints for xivs and yivs. ∑i∈M

∑v∈V

∑s∈S

y2ivs (4.15)

It is easy to see that, whenever δ′c(i)v = 1, yivs will be set to xivs; otherwise, it will be set

to 0. Hence, minimizing the objective given in equation 4.15 is equivalent to minimizing the

objective given in equation 4.8. It can also be easily verified that Hessian of 4.15 is positive

definite implying its convexity.

4.5 Heuristic Iterative Method

We also propose a heuristic iterative method for solving the mutation assignment problem.

The iterative heuristic is significantly faster than the QIP with only a small degradation in

performance observed in our evaluations.

In brief, the iterative heuristic solves two subproblems iteratively until convergence.

Problem 1: given a fixed ∆ calculate the (necessarily unique) A that minimizes equation


4.4. Problem 2: with A fixed to the value calculated in the previous step, calculate the ∆

that minimizes equation 4.4. Each step is guaranteed to not increase the objective given by

equation 4.4, thus the algorithm is guaranteed to converge to a local optimum.

Problem 1 is a convex quadratic programming problem and can be solved efficiently with

existing convex optimization software. The objective given by equation 4.4 is solved subject

to constraints given by equations 4.1 and 4.2. Problem 2 can be solved by independently

assigning each mutation to the node v that minimizes equation 4.3.

The iterative heuristic is not guaranteed to identify a globally optimal solution, and

as such, results depend heavily on initialization. We mitigate this problem using multiple

restarts with random initializations of ∆. A random ∆ is generated by independently

assigning each mutation to a node, with mutations assigned uniformly and at random to a

any node in the tree. We perform 1000 restarts with different random seeds, and select the

solution that minimizes equation 4.4.

4.6 Enumerating rooted trees

We use the Beyer-Hedetniemi algorithm [2] to enumerate rooted tree topologies up to the

user-defined number of nodes (Nmax). The number of non-isomorphic rooted trees for the

N = 1, . . . , 10 nodes are as follows: 1, 1, 2, 4, 9, 20, 48, 115, 286, 719.

4.7 Model selection

In practice, the variance σ2 required to calculate equation 4.5 is often unknown and must

be estimated from the data. We estimate σ2 by clustering the mutation frequencies using

an k component Gaussian Mixture Model (GMM) with spherical covariance matrix, where

k is selected to minimize the BIC of the GMM. We then use the estimated variance of the

GMM as σ2.

We remark that this model selection procedure can only distinguish trees with the same

number of nodes if they have different objective function scores. In practice, we have found

that two distinct trees with an equal number of nodes can have identical objective scores.

Following other tools developed for this problem [19, 6], in such cases we report all solutions

with the best score.

Chapter 5

Results

In this chapter we present the performance of our algorithm on simulated and real datasets.

For the simulations and CLL datasets we have used Nmax = 7 and for the AML datasets

we have set Nmax = 8.

5.1 Datasets

To evaluate our method, we use both simulated and real datasets. For simulations, we

experiment with a variety of trees with differing number of subclones and model parameters.

We report the performances of both CITUP qip and CITUP iter, using several measures

that are explained in the following section. We compare the performance of CITUP to

the performances of TrAp [19] and PhyloSub [6], which can handle multi-sample datasets.

Additionally, we report a separate comparison between CITUP and Rec-BPT [5] on a smaller

set of single-sample simulations. We limit our comparison to these tools since our model

does not support the type of input required by [18] and [13]. While the method of [16] also

works with SNV data, their model is not directly comparable to ours due to incompatible

assumptions and goals.

We also evaluate the utility of our method on two real datasets. The first dataset is

taken from a Chronic Lymphocytic Leukemia (CLL) study by Schuh et al. [17]. This

dataset contains targeted deep sequencing measurements of 3 CLL patients sampled at

5 time points. The second dataset consists of a study involving Acute Myeloid Leukemia

(AML) patients by Ding et al. [3]. This dataset features a large number of somatic indels and

single nucleotide variants (SNVs), however only 3 sample points (designated as “normal”,

24

CHAPTER 5. RESULTS 25

“tumor” and “relapse”) are available per patient. Since the simulations show the QIP and

iterative versions to have similar performance, we only report the results of CITUP qip on

the real datasets.

5.2 Evaluation criteria

We evaluate the performance of CITUP on the simulation sets using several measures. To

compute these measures, we first obtain a matching between the predicted tree and the true

tree as explained below.

Let T = (V,E) denote the simulated tree, which we are trying to find, and let T ′ =

(V ′, E′) denote the tree predicted by CITUP. We first check whether T and T ′ have identical

topologies as a measure of success. In general, however, the topology of T may be different

from T ′. In such cases, computing the correspondence of the nodes in each tree is not

trivial. To accomplish this, we first create a complete bipartite graph G, where one partition,

denoted by A, consists of the nodes of T and the other partition, B consists of the nodes

of T ′. If |V | 6= |V ′|, then we add dummy nodes to the partition with the fewer nodes untilboth partitions have exactly max(|V |, |V ′|) nodes.

We denote by Ai the set of mutations assigned to node i in T . Similarly, we define Bj to

be the set of mutations assigned to node j in T ′. If i (or j) is a dummy node, then Ai = ∅(resp. Bj = ∅). For each edge (i, j) in G, we calculate its weight as the number of mutationsthat are assigned exactly one of i or j. We denote this weight by c(i, j). We then search for

a matching f : A→ B that minimizes:

∑i∈A

c(i, f(i)) (5.1)

This problem is a known as the “Minimum Bipartite Matching”, for which efficient

polynomial time algorithms exist [8]. Once we obtain a one-to-one matching between the

nodes of the two trees, we calculate the following scores:

1. Correct tree proportion: (M0) This is the proportion of correctly identified tree topolo-

gies to the total number of simulations in each experiment.

2. Clone proportion error: (M1) For this measure, we compute:∑u∈T ∗ |βT

∗u − βT

∗∗

g(u)||V ∗|

(5.2)


Above T ∗ denotes smaller of the trees T and T ′ while T ∗∗ denotes the larger one. V ∗

is defined to be the set of nodes in and βXn represents the frequency of clone n in tree

X. If T ∗ is the true tree, we define g ≡ f . Otherwise we set g ≡ f−1.

3. Misplace mutation proportion: (M2) Suppose a mutation m is assigned to a node u

in the true tree T . If it is assigned to f(u) in T ′, we say that m is correctly placed,

otherwise we say it is misplaced. M2 is set to the number of misplaced mutations

divided by the total number of mutations in the dataset. This measure essentially

evaluates the mutation clustering accuracy.

4. Phylogenetic accuracy: (M3) For this measure, we count the number of phylogenetic

relationships that are preserved. We use two types of mutually exclusive relationships:

ancestor/descendant and non-ancestor/descendant. For example, if a mutation A

emerges at a clone that is an ancestor of another clone where mutation B emerges,

we say that A is an ancestor of B (or alternatively B is a descendant of A). If this

relationship is reversed in the predictions, it is counted as non-preserved. If two

mutations do not have an ancestor/descendant relationship, they are marked as a non-

ancestor/descendant pair. If such a pair is predicted to have an ancestor/descendant

relationship, this pair is also counted as non-preserved.

5.3 Evaluation on simulated datasets

We evaluate the performances of CITUP qip and CITUP iter compared to TrAp and Phy-

loSub using a large set of simulations. For these simulations, we generate random tree

topologies T with 3 to 6 subclones with 3 to 7 samples. The frequencies of subclones are

simulated using a Dirichlet distribution with parameter α, ranging from 0.1 to 10.0. For

each simulation, we generate a set of 500 mutations that are uniformly distributed to the

subclones. The frequencies of these mutations are then altered through an additive Gaussian

noise with deviation between 0.02 to 0.1.

We compare each true tree T with the trees obtained by the tools based on the four eval-

uation criteria introduced above. For CITUP qip, we first cluster the mutations as described

in Methods. As the current version of TrAp does not have a module for clustering and we

were unable to run it on the individual mutations, we use our own clustering method for


TrAp as well. Since our model selection procedure is unlikely to work with TrAp’s heuris-

tic model, we had to provide TrAp with the clustering of the correct size. We emphasize

that despite this significant advantage, TrAp performs worse than CITUP with respect to

most of our criteria (see below). For PhyloSub and CITUP iter we use the individual set of

mutations.

Since all four methods can output multiple solutions, we devise the following protocol

in order to compute the evaluation measures. For CITUP qip, CITUP iter and TrAp, we

randomly choose up to 3 trees out of all (top scoring) solutions reported by the tool. If there

are only 1 or 2 reported solutions, we pick only these. Since PhyloSub reports 3 solutions

by default, we simply pick these solutions. For each tool, if one of the chosen solutions has

the correct tree topology, we use that solution to calculate all the measures for that tool.

Otherwise, we select one of them randomly.

Figure 5.1 summarises the results of these simulations. Note that for each selection of

parameters, we repeat the experiment 10 times.

The first column of the figure demonstrates the effect of the number of subclones/nodes

on all four criteria. The number of nodes vary between 3 and 6 - in all cases the number of

samples is set to 4, the Gaussian noise deviation is set to 0.05 and the frequency imbalance,

as determined by the parameter α of the Dirichlet distribution, is set to 1.0.

The second column demonstrates the effect of the number of samples on the four criteria.

The number of samples now vary between 3 and 7 - in all cases the number of nodes is set

to 5, and again, the Gaussian noise deviation is set to 0.05 and α is set to 1.0. We note that

we were unable to run PhyloSub for 7 samples due to limitations of this software. Hence,

in this case the comparison is only between the other methods.

The third column depicts the effect of increasing noise (primarily due to sequence cover-

age variation). The Gaussian noise deviation now varies between 0.02 to 0.1 - for 4 samples,

5 subclones and α = 1.0.

The fourth column depicts the effect of imbalance in subclones where α varies between

0.1 to 10.0, again for 4 samples, 5 subclones and noise deviation of 0.05.

From Figure 5.1, we see that both CITUP qip and CITUP iter find the correct tree

topology more often than TrAp, despite the fact that TrAp is already provided with the

correct number of clusters. In other words, while the other tools have to simultaneously

identify the right tree size and topology, TrAp only has to find the right topology of the

given tree size. Compared to CITUP and TrAp, PhyloSub performs poorly with respect to


3 4 5 60.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

corr

ect t

ree

prop

ortio

n

number of nodes

3 4 5 6

0.00

0.05

0.10

0.15

0.20

0.25

clon

e pr

opor

tion

erro

r

3 4 5 6

0.0

0.1

0.2

0.3

0.4

0.5

mis

plac

ed m

utat

ion

prop

ortio

n

3 4 5 6

0.0

0.2

0.4

0.6

0.8

1.0

phyl

ogen

etic

acc

urac

y

3 5 70.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

number of samples

3 5 7

0.00

0.05

0.10

0.15

0.20

0.25

3 5 7

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

3 5 7

0.2

0.4

0.6

0.8

1.0

0.02 0.04 0.06 0.08 0.10.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

mut. frequency noise

0.02 0.04 0.06 0.08 0.1

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.02 0.04 0.06 0.08 0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.02 0.04 0.06 0.08 0.1

0.2

0.4

0.6

0.8

1.0

0.1 1.0 10.00.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

sample dirichlet alpha

0.1 1.0 10.0

0.0

0.1

0.2

0.3

0.4

0.5

0.1 1.0 10.0

0.0

0.1

0.2

0.3

0.4

0.5

0.1 1.0 10.0

0.2

0.4

0.6

0.8

1.0

CITUP_qipCITUP_iterTrApPhyloSub

Figure 5.1: Simulation results for TrAp, PhyloSub and CITUP (QIP and iterative proce-dures) under the four evaluation criteria. The rows depict measures M0 to M3. The firstcolumn investigates the effect of the number of subclones/nodes in the dataset, the secondinvestigate the effect of the number of samples, the third investigates the effect of noiseadded to the mutation frequencies and the fourth investigates the effect of non-uniformityamong subclone frequencies. The figure is drawn using the boxplot function in Phyton’smathplot library: the line within each box is the mean and the box boundaries mark the25% and 75% values. The extreme outliers are depicted with + symbols. Note that wewere unable to run PhyloSub on 7 samples, so the corresponding bars are absent from thiscolumn.


this measure.

Similarly, CITUP performs typically better than the other tools in terms of phylogenetic

accuracy with a score of 60% or more in most cases. This suggests that even when the correct

tree is not found, the majority of phylogenetic relationships are preserved.

In estimating clonal frequencies, we see that CITUP outperforms both TrAp and Phy-

loSub, while TrAp performs best with respect to the ratio of misplaced mutations. We

remark that this is likely due to TrAp’s unfair advantage of being given the clustering with

the correct number of clusters. Note that this measure is evaluated by a one-to-one match-

ing between the nodes of the predicted and the true tree using only the mutations assigned

to (but not inherited by) the node. Hence, even when the predicted topology is not identical

to the correct tree, this measure can have a perfect score as long as the initial clustering

groups the mutations correctly. This, by definition, can only happen when the clustering

is performed with the correct number of clusters. Indeed, Figure 5.1 shows that whenever

CITUP identifies the correct tree topology (hence, the correct tree size) 10 out of 10 times,

it performs on par with TrAp. This suggests that TrAp’s apparent superiority to CITUP

in this measure is simply due to the high accuracy of our clustering method.

Sensitivity analysis of CITUP iter on the same set of simulated data with respect to

starting points is given in Figure 5.2.. Overall, we see that CITUP qip and CITUP iter

perform similarly under most conditions, although CITUP qip seems to be slightly more

resilient to extreme values of simulation parameters (e.g. sample Dirichlet alpha and mu-

tational frequency noise). Hence, we have chosen to proceed with CITUP qip for the real

datasets.

5.4 Comparison with Rec-BTP

We have also performed a separate comparison between CITUP qip and Rec-BTP. Since

Rec-BTP does not support multi-sample datasets, for these experiments we have simulated

single-sample datasets with 500 mutations for 4, 5 and 6 node trees. In each case, we

generate 10 simulations adding up to 30 datasets in total. The topologies of the trees were

chosen randomly as before. Since the current version of Rec-BTP does not report which

mutations are assigned to each subclone, we were restricted to a very limited evaluation of

the performance of this tool. Briefly, we compare the results of the two methods based on

i) the number of subclones predicted and ii) an RMSD measure of the predicted subclonal


Sensitivity analysis of CITUP iter with respect to starting points

3 4 5 6

0

10

20

30

40

50

60

70

80

obje

ctiv

e va

lue

erro

r

number of nodes

3 4 5 60.0

0.2

0.4

0.6

0.8

1.0

prop

ortio

n op

timal

3 5 7

0

10

20

30

40

50

60

70

80number of samples

3 5 70.0

0.1

0.2

0.3

0.4

0.5

is_m

inim

um

0.02 0.04 0.06 0.08 0.1

0

10

20

30

40

50

60

70

80mut. frequency noise

0.02 0.04 0.06 0.08 0.10.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

is_m

inim

um

0.1 1.0 10.0

0

10

20

30

40

50

60

70

80sample dirichlet alpha

0.1 1.0 10.00.0

0.1

0.2

0.3

0.4

0.5

0.6

is_m

inim

um

Supplementary Figure 3: Top: Distribution of errors in the objective for different restarts of the algorithmwhere error is defined as the difference between the local minimum objective value reached by CITUP iterand the global minimum reached by CITUP qip. Bottom: The proportion of iterative restarts that reachthe global min within 10−9.

3

Figure 5.2: Sensitivity analysis of CITUP iter with respect to starting points. Top: Distri-bution of errors in the objective for different restarts of the algorithm where error is definedas the difference between the local minimum objective value reached by CITUP iter andthe global minimum reached by CITUP qip. Bottom: The proportion of iterative restartsthat reach the global min within 10−9.


Patient No. of No. of No. of Wall-clockmutations subclones solutions time (min)

CLL003 19 5 1 1.64CLL006 9 5 2 0.32CLL077 15 5 1 0.84

Table 5.1: Summary of CITUP’s results on the CLL dataset. The second column refersto the number mutations as reported by [17]. The third column reports the number ofsubclones (including normal cells) found in the best solution. The number of solutionscolumn shows how many distinct solutions are found with the best score.

frequencies similar to the one employed in [5]. In terms of the first measure, CITUP was

able to find the correct number of subclones in 50% of the simulations (15 out of 30). In

contrast, Rec-BTP only identified the correct number of subclones in 23.3% of the cases

(7 out of 30). CITUP also outperformed Rec-BTP with respect to the RMSD measure:

The average RMSD values for CITUP and Rec-BTP in 30 simulations were 0.02 and 0.05

respectively.

5.5 Results on Chronic Lymphocytic Leukemia datasets

We evaluate the performance of CITUP qip on the Chronic Lymphocytic Leukemia (CLL)

dataset of [17]. This dataset consists of single nucleotide and small indel mutations as

inferred from Whole-Genome Sequencing (WGS) data from 3 CLL patients. Each patient is

sampled at five time points while receiving a variety of treatments. The authors also perform

targeted deep sequencing for a limited number of mutations found through WGS. Although

a high number of somatic mutations are detected for each patient, only the frequencies of

coding mutations are made available by Schuh et al.. Hence, we are only able to use the

coding mutations as input to our algorithm. Since the number of mutations are small for

these datasets, we manually removed mutations that are not heterozygous as reported by

[17].

Table 5.1 gives a summary of CITUP’s performance on all three patients.

The trees (Figures 5.3, 5.4 and 5.5) and the clonal frequencies reported by CITUP for

these patients match the results reported in [17] very closely: the mean absolute deviations

are 0.0088, 0.0016 and 0.0048 for patients CLL003, CLL006 and CLL077 respectively. Note

that while CITUP does not assign mutations to the root nodes in CLL003 and CLL077, the


root node in CLL006 is assigned 5 mutations. This is in agreement with the observation in

[17] that the normal contamination in this patient is insignificant and suggests that CITUP

is able to automatically handle presence or absence of healthy cell contamination.

Although CITUP finds two distinct topologies for patient CLL006 - a chain topology and

a branching topology, the clonal frequencies remain the same in both cases. We note that the

number of deep sequencing mutations is quite small for this dataset, possibly resulting in an

ambiguity with respect to the tree topology. To see if additional mutations can help identify

the true tree, we also ran CITUP on the WGS predictions for this dataset, containing 16

mutations. In this case, CITUP reported a single solution with a chain topology (data not

shown). Thus we conclude that the true solution is likely to be one reported in Figure 5.4,

which also matches the tree topology predicted by [17].

Figure 5.3 suggests a switch between subclones ‘d’ and ‘e’ (referred to as subclones 4

and 2 in [17]) around time-point 3. This is also in agreement with the disease progression

as reported by Schuh et al., where the third time-point is classified as ”complete response

+ minimal residual disease”. On the other hand, subclone ‘d’ simulatanously starts gaining

dominance. The fourth and fifth time-points (as well as the first two time-points) are

designated as ”progressive disease” suggesting that subclone ‘d’ replaces ‘e’ as the driver

subclone while the tumor relapses. In contrast, figures 5.4 and 5.5 imply a more stable

subclonal composition over the time points. We note that the survival time of these patients

are also longer than CLL003 (6+ and 9 versus 3 years) which may be linked to this slower

pace of the clonal dynamics.

5.6 Results on Acute Myeloid Leukemia datasets

Next, we evaluate CITUP qip on an Acute Myeloid Leukemia (AML) dataset [3]. This

dataset contains sequencing data from primary tumor and relapse samples after chemother-

apy treatment, in addition to matched normal tissue for each patient. Although the normal

tissue is typically sampled to distinguish somatic mutations, we also consider it as a sample

since some of these tissues contain various degrees of cancer contamination and thus can be

helpful in identifying subclones. Similar to the CLL dataset, we preprocess the mutations

based on their copy-number analysis as reported by [3]. Briefly, we only keep autosomal

mutations that are copy-number neutral. A summary of CITUP’s performance on 8 patients

taken from this dataset is given in Table 5.2.


Malikic et al. Page 10 of 11

Clonal evolution in relapsed acute myeloid leukaemia revealed by

whole-genome sequencing. Nature 481(7382), 506–510 (2012)13. Kuhn, H.: The hungarian method for the assignment problem. In:

Jünger, M., Liebling, T.M., Naddef, D., Nemhauser, G.L., Pulleyblank,

W.R., Reinelt, G., Rinaldi, G., Wolsey, L.A. (eds.) 50 Years of Integer

Programming 1958-2008, pp. 29–47. Springer, Berlin, Heidelberg

(2010)

14. Ashworth, A.: Drug resistance caused by reversion mutation 68(24),10021–10023 (2008)

15. Beyer, T., Hedetniemi, S.M.: Constant time generation of rooted trees.

SIAM J. Comput. 9(4), 706–712 (1980)

Figures

Figure 1 A comparison of the full-binary vs. arbitrary rootedtree formulations. In all trees, mutations are depicted withcolored squares. Left: An illustration of the complete binarytree formulation. Here, each internal node has exactly twochildren and only the leaf clones are assumed to be present inthe sample. Ancestral clones can only be represented throughpaths which acquire no additional mutations. Middle: Anequivalent representation of the full-binary tree on the left.Right: The same phylogenetic information represented by thearbitrary rooted tree model. Here, each internal node can haveone or more children but each clone must acquire at least oneadditional mutation.

a b c d e

Figure 3 CITUP predictions for patient CLL003. Left:Estimated subclonal proportions for the five time points(ordered from inner to outer circles). Right: The predictedevolutionary tree and the mutations assigned to eachsubclone. Note that each node is also assumed to inheritmutations that emerge at its ancestors.

a b c d e


a b c d e


Table 1 Summary of CITUP’s results on the CLL dataset. Thesecond column refers to the number mutations as reported by[11]. The third column reports the number of subclones (includingnormal cells) found in the best solution. The number of solutionscolumn shows how many distinct solutions are found with thebest score.

Patient No. of No. of No. of Wall-clockmutations subclones solutions time (min)

CLL003 19 5 1 1.64CLL006 9 5 2 0.32CLL077 15 5 1 0.84

TablesAdditional FilesAdditional file 1: The full set of CITUP predictions on the CLL dataset.

Additional file 2: The full set of CITUP predictions on the AML dataset.

Additional file 3: Detailed time requirement of CITUP qip on the AML

dataset; performance analysis of CITUP qip and CITUP iter; comparison

between CITUP and Rec-BTP.

Additional file 4: The raw simulation results used for comparison.

Figure 5.3: CITUP predictions for patient CLL003. Left: Estimated subclonal proportionsfor the five time points (ordered from inner to outer circles). Right: The predicted evo-lutionary tree and the mutations assigned to each subclone. Note that each node is alsoassumed to inherit mutations that emerge at its ancestors.


a b c d e



Patient No. of No. of No. of Wall-clockmutations subclones solutions time (hours)

UPN400220 265 7 1 1.71UPN426980 822 7 1 23.00UPN452198 97 5 4 0.14UPN573988 144 3 2 1.02UPN758168 412 7 2 3.33UPN804168 589 8 1 6.89UPN869586 1160 8 1 23.00UPN933124 270 6 1 3.75

Table 5.2: Summary of CITUP’s results on the AML dataset. The second column refersto the total number of indel and single nucleotide mutations as reported by [3]. The thirdcolumn reports the number of subclones (including normal cells) found in the best solution.The number of solutions column shows how many distinct solutions are found with the bestscore.

Due to the large number of mutations, CITUP qip requires considerably more CPU time

to run on this dataset compared to the CLL dataset. Nonetheless, we note that CITUP was

able to optimize all but two datasets to an exact solution when a wall-clock time limit of 23

hours is imposed for each dataset. Moreover, the total CPU time taken on these datasets

indicate a quadratic to sub-quadratic practical running time.

The number of subclones identified per patient is also higher than the number of sub-

clones predicted for CLL patients. We believe this is likely due to the increased ability to

detect subclones that differ by non-coding somatic mutations. To investigate this, we have

also obtained CITUP qip results on 3 of the AML datasets (UPN426980, UPN804168 and

UPN869586) using coding mutations only. Although the number of subclones predicted

were smaller in all 3 cases, the overall clonal architecture in the newly predicted trees were

typically similar to the trees estimated from the full set of mutations.

While it is unknown whether the non-coding mutations play an important role in cancer

progression, some may be hitchhiker mutations which represent subclones that differ by

other types of aberrations such as gene fusions. Furthermore, some non-coding mutations

may still be functional; for example, some intronic mutations are known to affect splicing.

Thus, we believe that phylogenetic trees derived from the full set of mutations may have

better potential to represent the true cancer progression.

Since a full phylogenetic relationship analysis is absent from [3] and the ground truth

solutions are not known, we can not directly evaluate our predicted trees. Figure 5.6 shows,


however, that the tumor purities inferred by CITUP generally agree with those reported

by [3] for primary and relapse samples. Note that since CITUP does not explicitly predict

tumor purity, for each sample this value is estimated as (1.0 − αrs) if the root node isnot assigned any mutations, where αrs is the predicted genome frequency of the root node

in that sample. Otherwise, the tumor purity is considered to be 1.0 (assuming germline

mutations have been excluded from the study).

The only striking difference between the tumor purities inferred by [3] and CITUP is

in the relapse sample of patient UPN869586. CITUP prediction for this patient is given

in Figure 5.7. The figure suggests that while the founder clone ‘b’ (and its descendants) is

present at a lower abundance in the relapse sample, which may correspond to the tumor

purity of 40% reported by [3], CITUP predicts another emerging clone in the relapse sample

(i.e. clone ’g’). Interestingly, although no coding mutations is assigned to clone ’g’, we have

found that some of the mutations assigned to this clone are located in the intronic regions

of several genes including IL15 and GPC5. Interestingly, the tumor purity estimate in the

relapse sample using coding-only mutations for this patient is closer to the purity estimate

reported in [3].

5.7 Computing environment and running parameters

For each simulated dataset, we converted our simulated mutation frequencies to PhyloSub’s

input format as follows. We assumed each mutation had sequencing depth of 1000 reads,

and set the number of variant reads to 1000 ·f where f is the simulated mutation frequency.We assumed a sequencing error rate of 0.001. Phylosub takes a signifficant amount of

computation time, and thus it was necessary to make a minor modification to evolve.py to

provide the ability to specify a maximum allowable computation time on the command line.

If the specified computation time is exceeded, the sampler exits cleanly, reporting the top k

trees identified thus far. For each simulated dataset, we ran PhyloSub using the evolve.py

command specifying 1000 MH iterations and 1000 MCMC samples, and 3 restarts with

different seeds. Computation time was limited to 95 hours per restart. PhyloSub completed

on average 784.7 MCMC samples within the allowable computation time (standard deviation

90.6 samples). For TrAp, we use the cluster frequencies as mentioned above and run it in

multi-sample mode with default parameters. For Rec-BTP, the clustering of the mutations

were performed using AVDPGM with the same parameters as described in [5]. Rec-BTP


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Primary (Ding et al.) Primary (CITUP)

Relapse (Ding et al.) Relapse (CITUP)

Figure 5.6: Tumor purities predicted by [3] and CITUP in primary and relapse samples ofAML patients. For the three patients with multiple reported solutions, UPN758168 hadthe same root frequencies in both solutions. For UPN452198 and UPN573988, we pick thefrequencies closest to the ones given in [3].


3%

5%

0%

9%

1%

33%

46%

3%

17%

0%

7%

0%0%

36%0%

40%

a b c d

e f g h

Figure 5.7: CITUP predictions for patient UPN869586. Left: The estimated subclonal pro-portions for tumor (inner) and relapse (outer) samples. Right: The predicted evolutionarytree and the coding mutations assigned to each subclone. The numbers in parentheses givethe total number of (i.e. coding and non-coding) mutations for each subclone.


was run with default parameters.

CITUP qip and CITUP iter are implemented in Python and C++ and CITUP qip is

run using the IBM ILOG Cplex Optimizer. All CITUP runs were performed on a Linux

server with a memory limit of up to 16GB per job.

Chapter 6

Conclusions and future work

In this work, we present CITUP, a novel combinatorial algorithm to determine clonal fre-

quencies in tumors as well as their evolutionary history using one or more samples from

the same patient. Our comparisons to other state-of-the-art tools show that CITUP con-

sistently reports fewer solutions with better accuracy. This feature is very important for

real cancer datasets where additional experiments may be required to validate the predic-

tions. For example, predictions that involve contradictory assignments reported by TrAp

(referred to as ”non-sparse” solutions in [19]), complicate the downstream analysis of iden-

tifying potential drivers of cancer. Similarly, the partial order plots reported by PhyloSub

[6] can involve many connections, making it difficult to interpret the solutions reported

by this tool. Although our QIP framework is already able to handle a large number of

mutations, and significantly faster than PhyloSub we acknowledge that it is considerably

slower than TrAp. On the other hand, the iterative heuristic version of CITUP exhibits

comparable accuracy, while achieving significant reduction in computation time. Moreover,

our ability to run CITUP separately on each tree topology means that parallel computing

can be utilized to quickly obtain high accuracy results on large datasets. As mentioned

above, CITUP assumes infinite sites, which may be violated under certain conditions. For

instance, a functional mutation may be selected against during changes to the tumor en-

vironment, such as the reversion of BRCA2 mutation in therapy resistant ovarian cancer

[1]. In other words, lineages that die out before the first sampling of the tumor or emerge

and disappear between two time points are not detectable by CITUP or any other method

aiming to construct phylogenies. In these cases, the evolutionary history of the tumor can

only be partially constructed. In addition, CITUP and similar methods are only applicable

41

CHAPTER 6. CONCLUSIONS AND FUTURE WORK 42

to tumors with limited copy number changes. On the other hand, this limitation can be

partially overcome by considering a restricted number of copy-number corrected genotypes

similar to the approach of PyClone [14]. Extension of CITUP to exploit this type of changes

would lead to its broader applicability and detection of subclonal populations characterized

by copy number aberrations.

Bibliography

[1] Ashworth, A. Drug resistance caused by reversion mutation. 10021–10023.

[2] Beyer, T., and Hedetniemi, S. M. Constant time generation of rooted trees. SIAMJ. Comput. 9, 4 (1980), 706–712.

[3] Ding, L., Ley, T. J., Larson, D. E., Miller, C. A., Koboldt, D. C., Welch,J. S., Ritchey, J. K., Young, M. A., Lamprecht, T., McLellan, M. D.,McMichael, J. F., Wallis, J. W., Lu, C., Shen, D., Harris, C. C., Dooling,D. J., Fulton, R. S., Fulton, L. L., Chen, K., Schmidt, H., Kalicki-Veizer,J., Magrini, V. J., Cook, L., McGrath, S. D., Vickery, T. L., Wendl, M. C.,Heath, S., Watson, M. A., Link, D. C., Tomasson, M. H., Shannon, W. D.,Payton, J. E., Kulkarni, S., Westervelt, P., Walter, M. J., Graubert,T. A., Mardis, E. R., Wilson, R. K., and DiPersio, J. F. Clonal evolution inrelapsed acute myeloid leukaemia revealed by whole-genome sequencing. Nature 481,7382 (2012), 506–510.

[4] Ding, L., Raphael, J. B., Chen, F., and Wendl, M. C. Advances for studyingclonal evolution in cancer. Cancer Letters 340, 2 (2013), 212–219.

[5] Hajirasouliha, I., Mahmoody, A., and Raphael, B. J. A combinatorial approachfor analyzing intra-tumor heterogeneity from high-throughput sequencing data. Pro-ceedings of the International Conference on Intelligent Systems of Molecular Biology(2014).

[6] Jiao, W., Vembu, S., Deshwar, A., Stein, L., and Morris, Q. Inferring clonalevolution of tumors from single nucleotide soma

CLONALITY INFERENCE IN MULTIPLE TUMOR SAMPLES USING...

Documents

Transcript of CLONALITY INFERENCE IN MULTIPLE TUMOR SAMPLES USING...