CLONALITY INFERENCE IN MULTIPLE TUMOR SAMPLES USING...
Transcript of CLONALITY INFERENCE IN MULTIPLE TUMOR SAMPLES USING...
-
CLONALITY INFERENCE IN MULTIPLE TUMOR
SAMPLES USING PHYLOGENY
by
Salem Malikić
B.Sc., University of Sarajevo, Bosnia and Herzegovina, 2011
a Thesis submitted in partial fulfillment
of the requirements for the degree of
Master of Science
in the
School of Computing Science
Faculty of Applied Sciences
c© Salem Malikić 2014SIMON FRASER UNIVERSITY
Summer 2014
All rights reserved.
However, in accordance with the Copyright Act of Canada, this work may be
reproduced without authorization under the conditions for “Fair Dealing.”
Therefore, limited reproduction of this work for the purposes of private study,
research, criticism, review and news reporting is likely to be in accordance
with the law, particularly if cited appropriately.
-
APPROVAL
Name: Salem Malikić
Degree: Master of Science
Title of Thesis: Clonality Inference in Multiple Tumor Samples using Phy-
logeny
Examining Committee: Dr. Andrei Bulatov,
Professor, Chair
Dr. Süleyman Cenk Şahinalp,
Professor, Senior Supervisor
Dr. Jian Pei,
Professor, Supervisor
Dr. Ayşe Funda Ergün,
Professor, Internal Examiner
Date Approved: August 22nd, 2014
ii
-
Partial Copyright Licence
iii
-
Abstract
Intra-tumor heterogeneity presents itself through the evolution of subclones during cancer
progression. While recent research suggests that this clonal diversity is a key factor in
therapeutic failure, the determination of subclonal architecture of human tumors remains a
challenge. To address the problem of accurately determining subclonal frequencies in tumors
as well as their evolutionary history, we have developed a novel combinatorial method named
CITUP (Clonality Inference in Tumors Using Phylogeny). An important feature of CITUP
is its ability to exploit data from multiple time-point and/or regional samples from a single
patient in order to improve estimates of mutational profiles and subclonal frequencies. Using
extensive simulations and real datasets comprising tumor samples from two leukemia drug-
response studies, we show that CITUP can infer the evolutionary trajectory of human
tumors with high accuracy.
keywords: Cancer progression, intra-tumor heterogeneity, combinatorial methods
iv
-
To my beloved parents Faiz and Sadeta,
and my dear sister Faiza
v
-
You can’t connect the dots looking forward; you can only connect them looking backwards.
So you have to trust that the dots will somehow connect in your future.
You have to trust in something: your gut, destiny, life, karma, whatever.
This approach has never let me down, and it has made all the difference in my life.
— Steve Jobs
vi
-
Acknowledgements
First and foremost, I would like to thank my supervisor Dr. S. Cenk Sahinalp for his
extensive guidance, support and patience during my studies. I especially thank him for
the endless effort he put into training me in the scientific field. I am also very thankful to
Andrew McPherson and Dr. Nilgün Donmez for their immense contribution to this work,
their valuable advices and help with writing the thesis. I would like to acknowledge the
insightful feedback I have received from my colleagues from the Lab for Computational
Biology at Simon Fraser University.
Also, I am indebted to Nermin Suljić, Ali Lafçioğlu, Dr. Hasan Jamak and Dino Oglić
for helping me develop my passion and enthusiasm towards the fields of Mathematics and
Science. Furthermore, I thank all of the people from Bosna Sema Educational Institutions for
providing an excellent environment and support during my high school and undergraduate
studies.
I am very grateful to my dear aunt Faiza and uncle Dževad together with their family
and my beloved partner Fatima for providing moral support. Last but not the least, I would
like to express my deep gratitude to my parents and my sister for their unconditional love
and support.
vii
-
Contents
Approval ii
Partial Copyright License iii
Abstract iv
Dedication v
Quotation vi
Acknowledgements vii
Contents viii
List of Tables x
List of Figures xi
1 Introduction 1
2 Background 4
2.1 DNA and mutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Cancer onset and evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Methods for analysing and sequencing DNA and their applications in tumor
heterogeneity studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Identifying mutations and their frequencies from HTS data . . . . . . . . . . 10
viii
-
3 Model assumptions and problem description 12
3.1 Phylogenetic tree model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Model assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Methods 19
4.1 Combinatorial Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Method Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Quadratic Integer Programming (QIP) method . . . . . . . . . . . . . . . . . 21
4.4 QIP optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.5 Heuristic Iterative Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.6 Enumerating rooted trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.7 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5 Results 24
5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2 Evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3 Evaluation on simulated datasets . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.4 Comparison with Rec-BTP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.5 Results on Chronic Lymphocytic Leukemia datasets . . . . . . . . . . . . . . 31
5.6 Results on Acute Myeloid Leukemia datasets . . . . . . . . . . . . . . . . . . 32
5.7 Computing environment and running parameters . . . . . . . . . . . . . . . . 37
6 Conclusions and future work 41
Bibliography 43
ix
-
List of Tables
3.1 Simple input to CITUP algorithm consisting of 2 samples. In total, 10 so-
matic mutations have been identified. Their frequencies are estimated from
alignment of sequencing data and most of them deviate from true values due
to presence of noise. Mutation might be not detected or completely absent
from some, but not all, samples. . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.1 Summary of CITUP’s results on the CLL dataset. The second column refers
to the number mutations as reported by [17]. The third column reports the
number of subclones (including normal cells) found in the best solution. The
number of solutions column shows how many distinct solutions are found
with the best score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Summary of CITUP’s results on the AML dataset. The second column refers
to the total number of indel and single nucleotide mutations as reported by
[3]. The third column reports the number of subclones (including normal
cells) found in the best solution. The number of solutions column shows how
many distinct solutions are found with the best score. . . . . . . . . . . . . . 36
x
-
List of Figures
2.1 Simple evolutionary tree showing emergence of different clonal subpopulations
as a consequence of mutations in cells DNA. . . . . . . . . . . . . . . . . . . . 6
2.2 An example of heterogeneous tumor tissue consisting of several different clonal
subpopulations of cells used as the input for DNA sequencing. . . . . . . . . . 8
2.3 Frequencies of mutations present in sample shown in Figure 2.2. . . . . . . . 11
3.1 Arbitrary rooted tree representation of binary tree given in Figure 2.1. . . . . 13
3.2 One of possible interpretations of input data given in Table 3.1. This figure
shows frequencies assignment for Sample 1. For each node, except the root,
first number in node label represents the proportion of cells harbouring mu-
tations that occurred along an edge connecting that node with its parent.
For root nodes this number is always 1. The number inside bracket shows
the proportion of cells harbouring genotype uniquely identified by this node. . 17
3.3 One of possible interpretations of input data given in Table 3.1. This figure
shows frequencies assignment for Sample 2. For each node, except the root,
first number in node label represents the proportion of cells harbouring mu-
tations that occurred along an edge connecting that node with its parent.
For root nodes this number is always 1. The number inside bracket shows
the proportion of cells harbouring genotype uniquely identified by this node. . 18
xi
-
5.1 Simulation results for TrAp, PhyloSub and CITUP (QIP and iterative pro-
cedures) under the four evaluation criteria. The rows depict measures M0 to
M3. The first column investigates the effect of the number of subclones/nodes
in the dataset, the second investigate the effect of the number of samples, the
third investigates the effect of noise added to the mutation frequencies and
the fourth investigates the effect of non-uniformity among subclone frequen-
cies. The figure is drawn using the boxplot function in Phyton’s mathplot
library: the line within each box is the mean and the box boundaries mark
the 25% and 75% values. The extreme outliers are depicted with + symbols.
Note that we were unable to run PhyloSub on 7 samples, so the corresponding
bars are absent from this column. . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2 Sensitivity analysis of CITUP iter with respect to starting points. Top: Dis-
tribution of errors in the objective for different restarts of the algorithm where
error is defined as the difference between the local minimum objective value
reached by CITUP iter and the global minimum reached by CITUP qip. Bot-
tom: The proportion of iterative restarts that reach the global min within
10−9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 CITUP predictions for patient CLL003. Left: Estimated subclonal propor-
tions for the five time points (ordered from inner to outer circles). Right:
The predicted evolutionary tree and the mutations assigned to each sub-
clone. Note that each node is also assumed to inherit mutations that emerge
at its ancestors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.4 CITUP predictions for patient CLL006. Left: Estimated subclonal propor-
tions for the five time points (ordered from inner to outer circles). Right:
The predicted evolutionary tree and the mutations assigned to each sub-
clone. Note that each node is also assumed to inherit mutations that emerge
at its ancestors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.5 CITUP predictions for patient CLL077. Left: Estimated subclonal propor-
tions for the five time points (ordered from inner to outer circles). Right:
The predicted evolutionary tree and the mutations assigned to each sub-
clone. Note that each node is also assumed to inherit mutations that emerge
at its ancestors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
xii
-
5.6 Tumor purities predicted by [3] and CITUP in primary and relapse samples
of AML patients. For the three patients with multiple reported solutions,
UPN758168 had the same root frequencies in both solutions. For UPN452198
and UPN573988, we pick the frequencies closest to the ones given in [3]. . . . 38
5.7 CITUP predictions for patient UPN869586. Left: The estimated subclonal
proportions for tumor (inner) and relapse (outer) samples. Right: The pre-
dicted evolutionary tree and the coding mutations assigned to each subclone.
The numbers in parentheses give the total number of (i.e. coding and non-
coding) mutations for each subclone. . . . . . . . . . . . . . . . . . . . . . . . 39
xiii
-
Chapter 1
Introduction
Most human tumors exhibit a large degree of heterogeneity. This heterogeneity is not only
apparent in histology but also presents itself in various features such as gene expression
changes, genomic copy number alterations and structural rearrangements as well as other
aberrations. While the origins of the intra-tumor heterogeneity are still debated, research
suggests that this diversity is likely to have clinical implications. For instance, Merlo et
al. [11] have reported a correlation between clonal diversity and progression to esophageal
adenocarcinoma in Barrett’s esophagus.
The implications of tumor heterogeneity are not limited to diagnostics. It has been
suggested that clonal diversity may also be linked to metastatic potential and drug response.
Looking at biopsies from pancreas and prostate adenocarcinomas, Ruiz et al. [15] found that
metastatic tumors were derived from certain clonal populations. In colorectal cancer, Kreso
et al. [7] reported that clonal diversity affects chemotherapy tolerance. By tracking 150
lentivirus-marked lineages from 10 human colorectal cancers, they found that previously
minor or dormant clones were promoted by chemotherapy, thus reducing the effectiveness
of the treatment.
Although the multi-clonal nature is virtually common to most tumor samples, deter-
mining the clonal subpopulations is a challenging process. This problem could potentially
be alleviated by single-cell sequencing; however, the current cost of these methods are pro-
hibitive in the scales that would be necessary to representatively sample a tumor tissue.
Methods such as Fluorescence in Situ Hybridization (FISH) or Silver in Situ Hybridization
(SISH) can also assess a small number of probes in individual cells of a tumor sample. On
the other hand, these methods are quite limited in scope and can not offer the same genome
1
-
CHAPTER 1. INTRODUCTION 2
wide perspective as high-throughput sequencing methods.
In silico separation of the clonal subpopulations may also provide a viable alternative
to these methods. Despite the importance of clonal diversity and its clinical implications,
however, relatively few computational methods have been developed to date. In a pioneering
paper, Schwartz and Shackney [18] developed an unmixing method based on a geometric
model to distinguish a small number of cancer subtypes in gene expression data. After
determining cell types and their relative frequencies in different tumor samples, this method
infers a phylogenetic tree that best fits the cell types identified. An alternative method,
named TrAp [19], generates possible phylogenetic trees following certain parsimony and
sparsity conditions using a greedy approach. More recently, PhyloSub [6], which is based on
Bayesian inference is developed. This method relies on the well known Monte Carlo Markov
Chain (MCMC) sampling paradigm to infer a distribution over all possible phylogenies.
Another statistical method named PyClone [14] - also based on MCMC sampling - leverages
copy number genotypes to estimate subclonal frequencies more reliably. Unlike PhyloSub,
however, this method does not infer phylogenies.
In this work we present a combinatorial algorithm, named CITUP (Clonality Inference
in Tumors Using Phylogeny), that can exploit data obtained from multiple loci from a single
patient to infer the tumor phylogeny more accurately. Our framework also involves gen-
erating all possible phylogenetic trees, however, unlike the previous approaches mentioned
above CITUP has the ability to find optimal solutions based on an exact Quadratic Integer
Programming formulation.
Another tree-based method named Rec-BTP [5], being simultaneously developed, is
closely related to our framework. In this approach, mutations are subjected to a binary-tree
partition, where a binary-tree with the least number of conflicting triplets is sought using
an approximation algorithm. In contrast to our framework, this method can not handle
multiple samples.
Our work is also related to the studies of [16] and [13], however their goals are different.
While THetA [13] aims to predict subclonal populations and their proportions given a
sample from high-throughput sequencing data, it does not aim to infer any phylogenetic
relationship between the subclones. Although the method proposed by Salari et al. [16]
infers tumor phylogenies from multiple samples like CITUP, the goal of that study is to
improve somatic SNV calls. Moreover, their model places the samples as leaves in a single
phylogenetic tree (thus leaves do not represent clonal subpopulations but rather samples,
-
CHAPTER 1. INTRODUCTION 3
each of which is a mixture of subclones) whereas our model assumes a shared tree between
samples with different clonal frequencies.
-
Chapter 2
Background
2.1 DNA and mutations
Human body is formed from between 50 and 100 trillion cells. Each cell contains all of the
organism’s genetic instructions stored as Deoxyribonucleic acid (DNA) that is made up of
molecules called nucleotides. There are four types of these molecules: adenine (A), cytosine
(C), thymine (T) and guanine (G) and their order defines particular DNA sequence and its
function. For the sake of illustration, we can consider nucleotides as an alphabet of some
language. In an analogous way as letters of an alphabet combined together form words
of language, nucleotides form genes that contain instructions for different tasks that cell
performs. Inside the cell, DNA is packed into structures called chromosomes. In normal
conditions, chromosomes in human cells always come in matching pairs, one pair from each
parent. As a consequence of this, human genome is diploid and genomic loci also come in
pairs, where the term genome is used for an organism’s complete set of DNA, including all
of its genes. On the other hand, genotype describes nucleotides present at a specific location
in the two copies. These two loci do not have to be fully identical and one of a number of
alternative forms of the same genetic locus is referred to as an allele.
Changing, deleting or altering the position of even a single letter in a word of language
can completely change its meaning. For example, in English language, if we change first
letter into ’d’ in word ”keep” we get word ”deep” with completely different meaning, and
changing ’k’ into ’e’ results in word ”eeep” without any specific meaning. Similarly, changes
in DNA sequence of cell, usually referred as mutations or aberrations, can change DNA
segment coding for some genes that might result in improperly functioning genes or in a
4
-
CHAPTER 2. BACKGROUND 5
complete loss of a gene function. Many different types of mutations have been reported.
Single nucleotide mutation, where a single nucleotide is exchanged for another, is the most
prevalent type of mutations. Copy number aberrations occur when part of genome gets
deleted or amplified and indel mutation is a mutation named with the blend of insertion
and deletion of nucleotides in DNA sequence.
During the lifetime of an individual many cells undergo programmed cell death (apop-
tosis) and get replaced by new cells in a process called somatic cell division. By this process
one cell, called mother cell, is replaced by two daughter cells. During the replication process
DNA of mother cell is copied into each of daughter cells and, in perfect division, daughter
cells DNA sequence is an exact copy of mother cell’s DNA sequence. Occasionally, during
some of the somatic cell divisions mutations occur resulting in daughter cell(s) having dif-
ferent DNA sequence compared to mother cell. Somatic division is not the only process
whereby mutations can be acquired. Various external factors, such as radiation, tobacco
consumption, ultraviolet light and many others can also have deleterious effect on cell and
cause severe damages to its DNA.
All of the mutations acquired in non-germ cells during the lifetime of an individual are
commonly referred to as somatic mutations. Germline or hereditary mutations comprise
another major class of mutations. Mutations from this class occur in germ-cells and can
later be passed on its progeny. If inherited, such mutation is typically present in all cells
of human body and does not give valuable information in heterogeneity studies. For these
reasons, in this work we will only focus on somatic mutations.
Since more than 98% of human DNA is noncoding and most of noncoding DNA does
not have any known biological function, somatic mutations can be divided into coding and
non-coding. The latter one are usually non-deleterious and do not have serious impact on
cell functioning. On the other hand, coding somatic mutations might be fatal for proper
cell functioning and result in various diseases, with cancer being one of the most frequent
among them. For these reasons, some studies are using only coding mutations.
By somatic cell division, mutation acquired in some cell is passed on all of its descendants,
unless it gets reverted, that is very unlikely considering the size of human DNA formed from
several billion nucleotides. In addition to mutations inherited from mother cell, daughter
cells might acquire additional mutations. This can result in emergence of several clonal
subpopulations of cells, each one uniquely identifiable by the set of somatic mutations it
has acquired, as shown in Figure 2.1. This figure shows simple evolutionary tree, also called
-
CHAPTER 2. BACKGROUND 6
phylogenetic tree, where, starting from normal, healthy, cell six different subpopulations
of mutated cells have emerged. Note that some clonal subpopulations can die out or get
completely replaced by more differentiated subpopulations descending from them. One such
clonal subpopulation of cells, harbouring only red-colored mutation, is shown in Figure 2.1.
Figure 2.1: Simple evolutionary tree showing emergence of different clonal subpopulationsas a consequence of mutations in cells DNA.
2.2 Cancer onset and evolution
Cancer is a disorder in which some of the body’s cells begin to grow uncontrollably to form
a mass of cells called a tumor. It is used as a term for more than a hundred diseases all
having in common two main characteristics: uncontrolled cell growth and the ability of these
-
CHAPTER 2. BACKGROUND 7
cells to invade other tissues. The growth of a tumor can be thought of as an evolutionary
process. Malignant (life-threatening) tumors usually contain many mutations, that do not
happen all at once.
There are two different models explaining evolutionary processes in tumors: clonal evo-
lution and cancer stem cells models. Detailed explanation of these two models can be found
in [4]. In this study we adopt Clonal Evolution model of tumor progression first intro-
duced in [12]. According to this model, tumor begins with a cell that has sustained a single
mutation that offers it a growth advantage over its neighbors. This advantage might be
manifested in many different ways: cell can have higher rate of somatic division compared
to normal cells, some genetic mechanisms might be damaged prolonging programmed cell’s
death (apoptosis) and many others. For example, a mutation that inactivates pro-apoptotic
gene might result in delayed cell’s death, giving it longer lifespan. Somatic divisions of this
mutated cell and its descendants lead to the formation of clonal subpopulation formed from
cells harbouring the advantageous mutation. After some time, one cell in this subpopulation
may sustain another such mutation that can result in emergence of new clone, possibly with
higher proliferative power. Over the course of time several clonal subpopulations might
emerge, resulting in highly heterogeneous tissue at the time of clinical diagnosis of disease.
Figure 2.2 illustrates a tumor tissue consisting of several distinct clonal subpoplations. For
simplicity, we assume that they are related by evolutionary tree given in Figure 2.1. We
note that at the time of clinical diagnosis, the only observable clonal subpopulations are the
ones corresponding to leaf nodes in the evolutionary tree. Therefore, none of the cells in
Figure 2.2 has genotype harbouring only red-colored mutation.
The first tumor that develops in the body is named primary tumor. If not detected
and removed at early stages of disease, tumor cells usually migrate through blood or lymph
and start growing at distant organs. This new tumor is known as metastatic tumor. It
is important to mention that metastatic tumor, although growing in different part of the
body, still has the characteristics of primary tumor and the same tumor evolutionary tree
can be used to explain primary tumor and all of its metastatic outgrowths. The same
applies for tumor samples from early and later stages of disease progression since clonal
subpopulations at later stages are either identical to or evolved from clonal subpopulations
from early stages by the acquisition of additional mutations between the timepoints when
samples were obtained.
-
CHAPTER 2. BACKGROUND 8
Figure 2.2: An example of heterogeneous tumor tissue consisting of several different clonalsubpopulations of cells used as the input for DNA sequencing.
-
CHAPTER 2. BACKGROUND 9
2.3 Methods for analysing and sequencing DNA and their
applications in tumor heterogeneity studies
Several methods used for DNA analysis and sequencing (determining the order of nucleotides
A, C, G, T) have been invented up to date. Although there are many different ways to
classify them, in cancer heterogeneity studies they are usually classified into two categories,
based on the number of cells that are used as the input.
First category comprises of methods using a single cell as the input for DNA analysis or
sequencing. Fluorescence in Situ Hybridization (FISH) or Spectral Karyotyping (SKY) have
been used for decades in studying DNA sequences of single tumor cells and have shown great
variability in DNA content among them. These methods are quite limited in scope and can
only asses a small number of probes from single cells of tumor sample. Ideally, whole-genome
sequencing of sufficient number of cells using these methods can be used for identifying
tumor heterogeneity. Single cell sequencing (SNS) and single nucleus sequencing (SNS)
are another two promising approaches for revealing intra-tumor heterogeneity. However,
both of them have several limitations. Namely, as they require an amount of DNA far
exceeding an amount present in single cell, the amplification of DNA is required prior to
sequencing. This amplification usually results in biases where some regions of DNA do
not get amplified to desired extent, whereas some other regions get over-amplified. As a
consequence, the number of reads (small subsequences of DNA generated as the output of
sequencing) covering under-amplified regions is usually low and the effect of noise is very
high in downstream analysis of reads covering these regions. Some of these regions do not
get any reads covering them, hence some mutations remain undetected. This bias also
results in difficulties with detecting copy number aberrations (aberrations where some part
of genome gets deleted or amplified), since it is difficult to distinguish whether a particular
DNA segment having large number of reads covering it is amplified in sequenced cell, or this
number is just a consequence of over-amplification during DNA amplification step. Also, in
order to get statistically representative sample of underlying tumor, typically consisting of
millions of cells, isolating and sequencing large number of cells from many different sections
of tumor is required. Due to these limitations and prohibitive cost of single cell sequencing
of large number of cells these methods are still mainly used only for academic purposes in
analyzing intra-tumor heterogeneity.
Second major category of DNA analysis and sequencing methods are using a bulk of
-
CHAPTER 2. BACKGROUND 10
tumor cells as the input to obtain short reads of DNA. High Throughput Sequencing (HTS)
methods are currently the most widely used for this task due to their ability to generate
large number of short reads from multiple tumor cells at low cost and with good accuracy
(percentage of correctly identified nucleotides among all reads) that varies among different
platforms, but is typically above 99.9%. Since multiple cells are used as the input, only an
average signal of DNA content from underlying cells is obtained as the output. Therefore,
some post-processing is required in order to obtain number, evolutionary history, genotypes
and proportions of clonal subpopulations present in sequenced tumor.
Sequencing coverage of DNA segment is defined as the average number of reads covering
that particular segment. Sequencing of whole DNA usually results in coverage that is low
in order to accurately identify frequencies of somatic mutations. High degree of confidence
in measuring frequencies of single nucleotide mutations and small indels can be achieved
using targeted deep sequencing. In brief, after the mutation is detected at some DNA locus,
the region encompassing this locus is PCR amplified from a bulk tumor sample, and then
sequenced to high depth (>1000× coverage) using HTS. Technological advances now allowmany variants to be amplified and sequenced in parallel speeding up the sequencing process.
2.4 Identifying mutations and their frequencies from HTS
data
In addition to cells from tumor mass, sequencing of some healthy tissue is also performed
in order to obtain genome of normal cells. All of the short reads obtained from tumor cells
are then compared against the normal genome and somatic mutations are identified.
For each mutation, we define its frequency as the proportion of cells from tumor sample
harbouring that mutation. For somatic mutations from diploid loci of genome, it is very
unlikely that both of the alleles are mutated, so we assume that all of such mutations are
heterozygous, i.e. only one allele is mutated. As calculating of single nucleotide mutations
from regions that have been affected by copy number aberrations is complicated task, in this
study we only focus on heterozygous somatic mutations outside of copy number aberrated
regions. Their frequency is calculated as (2 · qvar)/(qvar + qref ) where qvar is the number ofreads with the variant allele and qref is the number of reads with the reference allele. Allele
specific copy number measurements, obtained using sequencing or arrays, can be used to
exclude mutations from genomic regions that are not diploid heterozygous throughout the
-
CHAPTER 2. BACKGROUND 11
population of tumor cells.
In conclusion, using HTS methods of DNA sequencing we can obtain a set of mutations
present in sequenced tissue together with proportions of cells harbouring each mutation. For
the purpose of illustration, in Figure 2.3 a simple example of output is given, where we list
the frequencies of mutations present in hypothetical sample given in Figure 2.2, assuming
no noise is present in estimating frequencies values (note that this is not valid assumption
for frequencies obtained from real sequencing data where estimates are usually affected by
noise present as a consequence of errors and biases occurring during DNA sequencing step).
Figure 2.3: Frequencies of mutations present in sample shown in Figure 2.2.
-
Chapter 3
Model assumptions and problem
description
3.1 Phylogenetic tree model
As we have already explained in previous chapter, every successful somatic cell division
transforms a single cell into two daughter cells. The genomes of the daughter cells are
copies of the original cell’s genome, with the addition of mutations that occurred during
replication. Furthermore, all somatic cells originated from a single germline cell. Thus a
natural representation of somatic cellular evolution is a rooted full binary tree (see Rec-BTP
[5], 2.1 ). In such a model only the leaves of the tree are observable; internal nodes represent
unobservable ancestral cells.
For a full binary tree model, not all edges would be identifiable by mutations, either
because no mutations occurred from one cell division to the next, or because distinguishing
mutations were not detected. Thus, a more concise model [6] uses arbitrary rooted trees,
implicitly collapsing unidentifiable edges. Collapsing internal edges has the effect of allowing
nodes to have an arbitrary number of children. Furthermore, collapsing leaf edges implies
that internal nodes are observable. Figure 3.1 shows arbitrary rooted tree representation of
binary tree given in Figure 2.1.
Using arbitrary rooted trees has two major implications for CITUP. First, it limits
the number of fundamentally equivalent solutions produced by CITUP, allowing for easier
interpretability of the results. Second, since arbitrary rooted trees are more concise, CITUP
12
-
CHAPTER 3. MODEL ASSUMPTIONS AND PROBLEM DESCRIPTION 13
Figure 3.1: Arbitrary rooted tree representation of binary tree given in Figure 2.1.
-
CHAPTER 3. MODEL ASSUMPTIONS AND PROBLEM DESCRIPTION 14
can consider fewer trees while still maintaining the same accuracy.
3.2 Input data
As a consequence of evolutionary processes and genomic instability present in most of the
tumors, chemotherapy or other types of treatment, epigenetic and many other factors, pro-
portions of cells harbouring specific genotype usually differ among different timepoints of
disease progression or different anatomical sites. Sequencing a set of tumor samples obtained
from different timepoints of disease progression or different anatomical sites, or both, can
give us valuable information that can be exploited in solving problems defined in the follow-
ing sections of this chapter. Denote by S the set of all samples that have been sequenced
and M the set of heterozygous somatic mutations identified in at least one of the samples
from S. For the reasons already discussed in previous chapter, we only consider mutations
from diploid regions of genome.
Hence, the only input to our algorithm can be given as |M | × |S| matrix F , where Fijdenotes frequency of mutation i in sample j. The simple example of input data, where
|M | = 10 and |S| = 2, is given in the Table 3.1.
Sample 1 Sample 2
Mutation 1 0.32 0.04Mutation 2 0.23 0.24Mutation 3 0.80 1.00Mutation 4 0.06 0.55Mutation 5 0.19 0.28Mutation 6 0.30 0.00Mutation 7 0.20 0.67Mutation 8 0.77 0.95Mutation 9 0.19 0.66Mutation 10 0.20 0.24
Table 3.1: Simple input to CITUP algorithm consisting of 2 samples. In total, 10 somaticmutations have been identified. Their frequencies are estimated from alignment of sequenc-ing data and most of them deviate from true values due to presence of noise. Mutationmight be not detected or completely absent from some, but not all, samples.
-
CHAPTER 3. MODEL ASSUMPTIONS AND PROBLEM DESCRIPTION 15
3.3 Model assumptions
Similar to [6], in this work we make the infinite sites assumption about tumor evolution.
Somatic mutations are gained at most once per individual and cannot be lost via a subse-
quent reversion mutation. Additionally, we assume the tumor exhibits minimal aneuploidy,
thus mutations cannot be lost by deletion of the encompassing chromosomal region.
Assuming mutations cannot be lost or reverted, a mutation gained in a tumor cell will
be present in all of the descendants of that tumor cell. Trivially, a mutation that occurred
in the single common ancestor of a tumor will be present in 100% of the tumor cells. A
mutation that occurred in a specific lineage of the tumor phylogeny will be present in a
smaller proportion, providing all other lineages have not died out.
Based on the arguments mentioned in Section 2.2. we also impose the same phylogenetic
tree on all samples.
3.4 Problem description
Three common problems arise with the interpretation of input data:
• determination of number and genotypes of major clonal subpopulations of tumor cells;
This problem consists of inferring the number of different clonal subpopulations present
in the sequenced tumor and identifying the set of mutations present in each subpop-
ulation. Note that clonal subpopulation can be uniquely identified by this set.
• inference of phylogeny relating clonal subpopulations;
This problem consists of identifying tumor evolutionary history tree that best explains
the given input data. This tree is also known as tumor phylogenetic tree. Each mu-
tation has to be placed along one and only one edge of the tree and this placement
corresponds to its first appearance in tumor evolutionary history. At least one mu-
tation has to be placed along each edge of the tree, otherwise we have unidentifiable
edge that would be collapsed in our arbitrary rooted tree model. Also in this model,
the normal cells can be represented by the root node. Each node corresponds to one
and only one clonal subpopulation that is uniquely identified by the set of mutations
assigned to the root node combined with the mutations appearing along the edges
that form the path from root to that node.
-
CHAPTER 3. MODEL ASSUMPTIONS AND PROBLEM DESCRIPTION 16
• estimation of proportion of each subpopulation over all samples;
This problem consists of assigning a real number αis ∈ [0, 1] to each subpopulationi for each sample s. This number represents proportion of tumor cells in sample s
harbouring genotype of clonal subpopulation i. As there is one to one correspondence
between nodes of tumor phylogenetic tree and subclonal populations, this is equiv-
alent to assigning number αis to node corresponding to subpopulation i in sample
s. Although all samples share common evolutionary tree, hence mutation placement
is shared among them, the frequencies assigned to nodes of the tree change among
samples.
In the next chapter we give a details of our novel algorithmic approaches for solving prob-
lems introduced in this section. Figures 3.2 and 3.3 show one of the possible interpretations
of example data given in Table 3.1.
-
CHAPTER 3. MODEL ASSUMPTIONS AND PROBLEM DESCRIPTION 17
Figure 3.2: One of possible interpretations of input data given in Table 3.1. This figureshows frequencies assignment for Sample 1. For each node, except the root, first number innode label represents the proportion of cells harbouring mutations that occurred along anedge connecting that node with its parent. For root nodes this number is always 1. Thenumber inside bracket shows the proportion of cells harbouring genotype uniquely identifiedby this node.
-
CHAPTER 3. MODEL ASSUMPTIONS AND PROBLEM DESCRIPTION 18
Figure 3.3: One of possible interpretations of input data given in Table 3.1. This figureshows frequencies assignment for Sample 2. For each node, except the root, first number innode label represents the proportion of cells harbouring mutations that occurred along anedge connecting that node with its parent. For root nodes this number is always 1. Thenumber inside bracket shows the proportion of cells harbouring genotype uniquely identifiedby this node.
-
Chapter 4
Methods
4.1 Combinatorial Formulation
Let T represent the space of all rooted trees and let T ∈ T be a hypothetical phylogenetic treerelating N = |V (T )| genetically distinct subpopulations. Let D(v) be the set of descendentsof node v. As already explained in previous chapters, in our formulation, genotypes are
represented with nodes (also referred to as clonal subpopulations or subclones in the text)
while subtrees rooted at a specific node are named clones. A mutation occurring at a node
in the tree is inherited by its descendants. Thus an assignment of the set of mutations to
their node of origin is sufficient to describe the genotypes of all nodes.
Define the clone proportion βvs as the proportion of the clone rooted at v in sample
s. Similarly, define the subclonal proportion αvs as the proportion of genotype v in sample
s. Subclonal proportions add up to 1 in each sample (equation 4.1). Furthermore, clone
proportions are related to subclonal proportions via the sum rule (equation 4.2).
∀s ∈ S :∑v∈V
αvs = 1 (4.1)
βvs = αvs +∑
u∈D(v)αus (4.2)
The expected value of the frequency of a mutation is equal to the clone proportion of the
node to which the mutation was assigned. Thus, the squared error incurred by assigning a
single mutation i to a node v in sample s is given by equation 4.3.
eivs = (Fis − βvs)2 (4.3)
19
-
CHAPTER 4. METHODS 20
Let ∆ be an |M | × N binary matrix such that δiv = 1 iff mutation i first appearedat node v, otherwise δiv = 0. We also introduce matrix A of dimensions N × |S|, whereAis = αis. Given T ∈ T, ∆ and A, the total squared error is given by equation 4.4.
E(T,∆, A) =∑i∈M
∑s∈S
∑v∈V
δiveivs (4.4)
Minimization of squared error may result in overfitting, assigning each mutation to a
unique node in a very large tree. Instead, we minimize the Bayesian Information Criterion
(BIC) under the assumption that the noise is normally distributed with known variance σ2.
The log likelihood can be expressed (within an additive factor) as given by equation 4.5.
L(F |T,∆, A) = E(T,∆, A)2σ2
(4.5)
Finally, BIC can be expressed as given by equation 4.6.
BIC(T,∆, A) = 2 · L(F |T,∆, A) + |S| · (N − 1) · log |M | (4.6)
We propose to identify the optimal genotypes ∆opt, the subclone proportions Aopt and
phylogenetic relationship Topt as given by equation 4.7.
∆opt, Aopt, Topt = argminT,∆,A
BIC(T,∆, A) (4.7)
We refer to the above optimization problem as the mutation phylogeny problem. We pro-
pose two methods to solve this problem, namely “CITUP qip” and “CITUP iter”. CITUP qip
uses an exact Quadratic Integer Programming formulation; while CITUP iter implements an
iterative heuristic. A detailed description of these implementations for solving the mutation
phylogeny problem is given in the following sections.
4.2 Method Outline
Given a fixed tree topology, define the mutation assignment problem as the problem of iden-
tifying A and ∆ that minimize mutation frequency error (equation 4.4). CITUP solves the
mutation phylogeny problem by iterating through all tree topologies up to a fixed number
of nodes Nmax, and solving the mutation assignment problem for each tree:
1. for each T ∈ TN , for each N ∈ {1, . . . , Nmax}
-
CHAPTER 4. METHODS 21
(a) identify A and ∆ that minimizes equation 4.4 (mutation assignment problem)
(b) calculate BIC for T using equation 4.6
2. select T , A and ∆ that minimize BIC
We propose two methods for solving the mutation assignment problem: a Quadratic Integer
Programming based approach (CITUP qip), and an iterative heuristic approach (CITUP iter)
as explained below.
4.3 Quadratic Integer Programming (QIP) method
QIP based approaches guarantee an optimal solution but limit the feasible problem size. To
ensure a reasonable running time for the QIP approach on larger (>20 mutations) problem
sizes, we first cluster the mutations into N sets by their mutation frequency, where N is
the number of nodes in the current tree topology. We then limit the solution space for ∆
by adding the constraint that all mutations in a cluster must be assigned, en masse, to a
single node. We use multivariate k-means clustering implemented in the python scikit learn
package to cluster mutations.
Let c : M → {1, . . . , N} be a mapping from mutations to clusters. Let ∆′ be an N ×Nbinary matrix such that δ′c(i)v = 1 iff mutation i assigned to cluster c(i) originated at node
v, otherwise δ′c(i)v = 0. The total squared error given by equation 4.4 can be rewritten as
4.8.
E(T,∆, A) =∑i∈M
∑s∈S
∑v∈V
δ′c(i)veivs (4.8)
Requiring that each cluster must be assigned to exactly one node adds the constraint
given by equation 4.9.
∀n ∈ {1, . . . , N} :∑v∈V
δ′nv = 1 (4.9)
Additionally, we require that all non-root nodes must have at least one cluster of muta-
tions assigned to them, resulting in the constraint given by equation 4.10,
∀v ∈ V \ {r} :∑
n∈{1,...,N}δ′nv ≥ 1 (4.10)
where r denotes the root node.
-
CHAPTER 4. METHODS 22
The QIP approach minimizes the squared error objective (equation 4.8), subject to the
subclonal proportion constraints (equation 4.1), the clone proportion constraints (equation
4.2), and the cluster assignment constraints (equations 4.9 and 4.10).
4.4 QIP optimizations
The objective given by equation 4.8 is not well suited for QIP solvers. Below, we introduce
auxiliary variables and constraints to convert our objective function to a form that is easier
to solve. For mutation i, node v and sample s, introduce variable xivs subject to the the
following constraints.
xivs ≥ fis − βvs (4.11)
xivs ≥ βvs − fis (4.12)
Similarly, introduce variable yivs subject to the following constraints:
yivs ≥ δ′c(i)v − 1 + xivs (4.13)
yivs ≥ 0 (4.14)
The modified QIP minimizes the objective given by equation 4.15, subject to the addi-
tional constraints for xivs and yivs. ∑i∈M
∑v∈V
∑s∈S
y2ivs (4.15)
It is easy to see that, whenever δ′c(i)v = 1, yivs will be set to xivs; otherwise, it will be set
to 0. Hence, minimizing the objective given in equation 4.15 is equivalent to minimizing the
objective given in equation 4.8. It can also be easily verified that Hessian of 4.15 is positive
definite implying its convexity.
4.5 Heuristic Iterative Method
We also propose a heuristic iterative method for solving the mutation assignment problem.
The iterative heuristic is significantly faster than the QIP with only a small degradation in
performance observed in our evaluations.
In brief, the iterative heuristic solves two subproblems iteratively until convergence.
Problem 1: given a fixed ∆ calculate the (necessarily unique) A that minimizes equation
-
CHAPTER 4. METHODS 23
4.4. Problem 2: with A fixed to the value calculated in the previous step, calculate the ∆
that minimizes equation 4.4. Each step is guaranteed to not increase the objective given by
equation 4.4, thus the algorithm is guaranteed to converge to a local optimum.
Problem 1 is a convex quadratic programming problem and can be solved efficiently with
existing convex optimization software. The objective given by equation 4.4 is solved subject
to constraints given by equations 4.1 and 4.2. Problem 2 can be solved by independently
assigning each mutation to the node v that minimizes equation 4.3.
The iterative heuristic is not guaranteed to identify a globally optimal solution, and
as such, results depend heavily on initialization. We mitigate this problem using multiple
restarts with random initializations of ∆. A random ∆ is generated by independently
assigning each mutation to a node, with mutations assigned uniformly and at random to a
any node in the tree. We perform 1000 restarts with different random seeds, and select the
solution that minimizes equation 4.4.
4.6 Enumerating rooted trees
We use the Beyer-Hedetniemi algorithm [2] to enumerate rooted tree topologies up to the
user-defined number of nodes (Nmax). The number of non-isomorphic rooted trees for the
N = 1, . . . , 10 nodes are as follows: 1, 1, 2, 4, 9, 20, 48, 115, 286, 719.
4.7 Model selection
In practice, the variance σ2 required to calculate equation 4.5 is often unknown and must
be estimated from the data. We estimate σ2 by clustering the mutation frequencies using
an k component Gaussian Mixture Model (GMM) with spherical covariance matrix, where
k is selected to minimize the BIC of the GMM. We then use the estimated variance of the
GMM as σ2.
We remark that this model selection procedure can only distinguish trees with the same
number of nodes if they have different objective function scores. In practice, we have found
that two distinct trees with an equal number of nodes can have identical objective scores.
Following other tools developed for this problem [19, 6], in such cases we report all solutions
with the best score.
-
Chapter 5
Results
In this chapter we present the performance of our algorithm on simulated and real datasets.
For the simulations and CLL datasets we have used Nmax = 7 and for the AML datasets
we have set Nmax = 8.
5.1 Datasets
To evaluate our method, we use both simulated and real datasets. For simulations, we
experiment with a variety of trees with differing number of subclones and model parameters.
We report the performances of both CITUP qip and CITUP iter, using several measures
that are explained in the following section. We compare the performance of CITUP to
the performances of TrAp [19] and PhyloSub [6], which can handle multi-sample datasets.
Additionally, we report a separate comparison between CITUP and Rec-BPT [5] on a smaller
set of single-sample simulations. We limit our comparison to these tools since our model
does not support the type of input required by [18] and [13]. While the method of [16] also
works with SNV data, their model is not directly comparable to ours due to incompatible
assumptions and goals.
We also evaluate the utility of our method on two real datasets. The first dataset is
taken from a Chronic Lymphocytic Leukemia (CLL) study by Schuh et al. [17]. This
dataset contains targeted deep sequencing measurements of 3 CLL patients sampled at
5 time points. The second dataset consists of a study involving Acute Myeloid Leukemia
(AML) patients by Ding et al. [3]. This dataset features a large number of somatic indels and
single nucleotide variants (SNVs), however only 3 sample points (designated as “normal”,
24
-
CHAPTER 5. RESULTS 25
“tumor” and “relapse”) are available per patient. Since the simulations show the QIP and
iterative versions to have similar performance, we only report the results of CITUP qip on
the real datasets.
5.2 Evaluation criteria
We evaluate the performance of CITUP on the simulation sets using several measures. To
compute these measures, we first obtain a matching between the predicted tree and the true
tree as explained below.
Let T = (V,E) denote the simulated tree, which we are trying to find, and let T ′ =
(V ′, E′) denote the tree predicted by CITUP. We first check whether T and T ′ have identical
topologies as a measure of success. In general, however, the topology of T may be different
from T ′. In such cases, computing the correspondence of the nodes in each tree is not
trivial. To accomplish this, we first create a complete bipartite graph G, where one partition,
denoted by A, consists of the nodes of T and the other partition, B consists of the nodes
of T ′. If |V | 6= |V ′|, then we add dummy nodes to the partition with the fewer nodes untilboth partitions have exactly max(|V |, |V ′|) nodes.
We denote by Ai the set of mutations assigned to node i in T . Similarly, we define Bj to
be the set of mutations assigned to node j in T ′. If i (or j) is a dummy node, then Ai = ∅(resp. Bj = ∅). For each edge (i, j) in G, we calculate its weight as the number of mutationsthat are assigned exactly one of i or j. We denote this weight by c(i, j). We then search for
a matching f : A→ B that minimizes:
∑i∈A
c(i, f(i)) (5.1)
This problem is a known as the “Minimum Bipartite Matching”, for which efficient
polynomial time algorithms exist [8]. Once we obtain a one-to-one matching between the
nodes of the two trees, we calculate the following scores:
1. Correct tree proportion: (M0) This is the proportion of correctly identified tree topolo-
gies to the total number of simulations in each experiment.
2. Clone proportion error: (M1) For this measure, we compute:∑u∈T ∗ |βT
∗u − βT
∗∗
g(u)||V ∗|
(5.2)
-
CHAPTER 5. RESULTS 26
Above T ∗ denotes smaller of the trees T and T ′ while T ∗∗ denotes the larger one. V ∗
is defined to be the set of nodes in and βXn represents the frequency of clone n in tree
X. If T ∗ is the true tree, we define g ≡ f . Otherwise we set g ≡ f−1.
3. Misplace mutation proportion: (M2) Suppose a mutation m is assigned to a node u
in the true tree T . If it is assigned to f(u) in T ′, we say that m is correctly placed,
otherwise we say it is misplaced. M2 is set to the number of misplaced mutations
divided by the total number of mutations in the dataset. This measure essentially
evaluates the mutation clustering accuracy.
4. Phylogenetic accuracy: (M3) For this measure, we count the number of phylogenetic
relationships that are preserved. We use two types of mutually exclusive relationships:
ancestor/descendant and non-ancestor/descendant. For example, if a mutation A
emerges at a clone that is an ancestor of another clone where mutation B emerges,
we say that A is an ancestor of B (or alternatively B is a descendant of A). If this
relationship is reversed in the predictions, it is counted as non-preserved. If two
mutations do not have an ancestor/descendant relationship, they are marked as a non-
ancestor/descendant pair. If such a pair is predicted to have an ancestor/descendant
relationship, this pair is also counted as non-preserved.
5.3 Evaluation on simulated datasets
We evaluate the performances of CITUP qip and CITUP iter compared to TrAp and Phy-
loSub using a large set of simulations. For these simulations, we generate random tree
topologies T with 3 to 6 subclones with 3 to 7 samples. The frequencies of subclones are
simulated using a Dirichlet distribution with parameter α, ranging from 0.1 to 10.0. For
each simulation, we generate a set of 500 mutations that are uniformly distributed to the
subclones. The frequencies of these mutations are then altered through an additive Gaussian
noise with deviation between 0.02 to 0.1.
We compare each true tree T with the trees obtained by the tools based on the four eval-
uation criteria introduced above. For CITUP qip, we first cluster the mutations as described
in Methods. As the current version of TrAp does not have a module for clustering and we
were unable to run it on the individual mutations, we use our own clustering method for
-
CHAPTER 5. RESULTS 27
TrAp as well. Since our model selection procedure is unlikely to work with TrAp’s heuris-
tic model, we had to provide TrAp with the clustering of the correct size. We emphasize
that despite this significant advantage, TrAp performs worse than CITUP with respect to
most of our criteria (see below). For PhyloSub and CITUP iter we use the individual set of
mutations.
Since all four methods can output multiple solutions, we devise the following protocol
in order to compute the evaluation measures. For CITUP qip, CITUP iter and TrAp, we
randomly choose up to 3 trees out of all (top scoring) solutions reported by the tool. If there
are only 1 or 2 reported solutions, we pick only these. Since PhyloSub reports 3 solutions
by default, we simply pick these solutions. For each tool, if one of the chosen solutions has
the correct tree topology, we use that solution to calculate all the measures for that tool.
Otherwise, we select one of them randomly.
Figure 5.1 summarises the results of these simulations. Note that for each selection of
parameters, we repeat the experiment 10 times.
The first column of the figure demonstrates the effect of the number of subclones/nodes
on all four criteria. The number of nodes vary between 3 and 6 - in all cases the number of
samples is set to 4, the Gaussian noise deviation is set to 0.05 and the frequency imbalance,
as determined by the parameter α of the Dirichlet distribution, is set to 1.0.
The second column demonstrates the effect of the number of samples on the four criteria.
The number of samples now vary between 3 and 7 - in all cases the number of nodes is set
to 5, and again, the Gaussian noise deviation is set to 0.05 and α is set to 1.0. We note that
we were unable to run PhyloSub for 7 samples due to limitations of this software. Hence,
in this case the comparison is only between the other methods.
The third column depicts the effect of increasing noise (primarily due to sequence cover-
age variation). The Gaussian noise deviation now varies between 0.02 to 0.1 - for 4 samples,
5 subclones and α = 1.0.
The fourth column depicts the effect of imbalance in subclones where α varies between
0.1 to 10.0, again for 4 samples, 5 subclones and noise deviation of 0.05.
From Figure 5.1, we see that both CITUP qip and CITUP iter find the correct tree
topology more often than TrAp, despite the fact that TrAp is already provided with the
correct number of clusters. In other words, while the other tools have to simultaneously
identify the right tree size and topology, TrAp only has to find the right topology of the
given tree size. Compared to CITUP and TrAp, PhyloSub performs poorly with respect to
-
CHAPTER 5. RESULTS 28
3 4 5 60.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
corr
ect t
ree
prop
ortio
n
number of nodes
3 4 5 6
0.00
0.05
0.10
0.15
0.20
0.25
clon
e pr
opor
tion
erro
r
3 4 5 6
0.0
0.1
0.2
0.3
0.4
0.5
mis
plac
ed m
utat
ion
prop
ortio
n
3 4 5 6
0.0
0.2
0.4
0.6
0.8
1.0
phyl
ogen
etic
acc
urac
y
3 5 70.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
number of samples
3 5 7
0.00
0.05
0.10
0.15
0.20
0.25
3 5 7
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
3 5 7
0.2
0.4
0.6
0.8
1.0
0.02 0.04 0.06 0.08 0.10.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
mut. frequency noise
0.02 0.04 0.06 0.08 0.1
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.02 0.04 0.06 0.08 0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.02 0.04 0.06 0.08 0.1
0.2
0.4
0.6
0.8
1.0
0.1 1.0 10.00.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
sample dirichlet alpha
0.1 1.0 10.0
0.0
0.1
0.2
0.3
0.4
0.5
0.1 1.0 10.0
0.0
0.1
0.2
0.3
0.4
0.5
0.1 1.0 10.0
0.2
0.4
0.6
0.8
1.0
CITUP_qipCITUP_iterTrApPhyloSub
Figure 5.1: Simulation results for TrAp, PhyloSub and CITUP (QIP and iterative proce-dures) under the four evaluation criteria. The rows depict measures M0 to M3. The firstcolumn investigates the effect of the number of subclones/nodes in the dataset, the secondinvestigate the effect of the number of samples, the third investigates the effect of noiseadded to the mutation frequencies and the fourth investigates the effect of non-uniformityamong subclone frequencies. The figure is drawn using the boxplot function in Phyton’smathplot library: the line within each box is the mean and the box boundaries mark the25% and 75% values. The extreme outliers are depicted with + symbols. Note that wewere unable to run PhyloSub on 7 samples, so the corresponding bars are absent from thiscolumn.
-
CHAPTER 5. RESULTS 29
this measure.
Similarly, CITUP performs typically better than the other tools in terms of phylogenetic
accuracy with a score of 60% or more in most cases. This suggests that even when the correct
tree is not found, the majority of phylogenetic relationships are preserved.
In estimating clonal frequencies, we see that CITUP outperforms both TrAp and Phy-
loSub, while TrAp performs best with respect to the ratio of misplaced mutations. We
remark that this is likely due to TrAp’s unfair advantage of being given the clustering with
the correct number of clusters. Note that this measure is evaluated by a one-to-one match-
ing between the nodes of the predicted and the true tree using only the mutations assigned
to (but not inherited by) the node. Hence, even when the predicted topology is not identical
to the correct tree, this measure can have a perfect score as long as the initial clustering
groups the mutations correctly. This, by definition, can only happen when the clustering
is performed with the correct number of clusters. Indeed, Figure 5.1 shows that whenever
CITUP identifies the correct tree topology (hence, the correct tree size) 10 out of 10 times,
it performs on par with TrAp. This suggests that TrAp’s apparent superiority to CITUP
in this measure is simply due to the high accuracy of our clustering method.
Sensitivity analysis of CITUP iter on the same set of simulated data with respect to
starting points is given in Figure 5.2.. Overall, we see that CITUP qip and CITUP iter
perform similarly under most conditions, although CITUP qip seems to be slightly more
resilient to extreme values of simulation parameters (e.g. sample Dirichlet alpha and mu-
tational frequency noise). Hence, we have chosen to proceed with CITUP qip for the real
datasets.
5.4 Comparison with Rec-BTP
We have also performed a separate comparison between CITUP qip and Rec-BTP. Since
Rec-BTP does not support multi-sample datasets, for these experiments we have simulated
single-sample datasets with 500 mutations for 4, 5 and 6 node trees. In each case, we
generate 10 simulations adding up to 30 datasets in total. The topologies of the trees were
chosen randomly as before. Since the current version of Rec-BTP does not report which
mutations are assigned to each subclone, we were restricted to a very limited evaluation of
the performance of this tool. Briefly, we compare the results of the two methods based on
i) the number of subclones predicted and ii) an RMSD measure of the predicted subclonal
-
CHAPTER 5. RESULTS 30
Sensitivity analysis of CITUP iter with respect to starting points
3 4 5 6
0
10
20
30
40
50
60
70
80
obje
ctiv
e va
lue
erro
r
number of nodes
3 4 5 60.0
0.2
0.4
0.6
0.8
1.0
prop
ortio
n op
timal
3 5 7
0
10
20
30
40
50
60
70
80number of samples
3 5 70.0
0.1
0.2
0.3
0.4
0.5
is_m
inim
um
0.02 0.04 0.06 0.08 0.1
0
10
20
30
40
50
60
70
80mut. frequency noise
0.02 0.04 0.06 0.08 0.10.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
is_m
inim
um
0.1 1.0 10.0
0
10
20
30
40
50
60
70
80sample dirichlet alpha
0.1 1.0 10.00.0
0.1
0.2
0.3
0.4
0.5
0.6
is_m
inim
um
Supplementary Figure 3: Top: Distribution of errors in the objective for different restarts of the algorithmwhere error is defined as the difference between the local minimum objective value reached by CITUP iterand the global minimum reached by CITUP qip. Bottom: The proportion of iterative restarts that reachthe global min within 10−9.
3
Figure 5.2: Sensitivity analysis of CITUP iter with respect to starting points. Top: Distri-bution of errors in the objective for different restarts of the algorithm where error is definedas the difference between the local minimum objective value reached by CITUP iter andthe global minimum reached by CITUP qip. Bottom: The proportion of iterative restartsthat reach the global min within 10−9.
-
CHAPTER 5. RESULTS 31
Patient No. of No. of No. of Wall-clockmutations subclones solutions time (min)
CLL003 19 5 1 1.64CLL006 9 5 2 0.32CLL077 15 5 1 0.84
Table 5.1: Summary of CITUP’s results on the CLL dataset. The second column refersto the number mutations as reported by [17]. The third column reports the number ofsubclones (including normal cells) found in the best solution. The number of solutionscolumn shows how many distinct solutions are found with the best score.
frequencies similar to the one employed in [5]. In terms of the first measure, CITUP was
able to find the correct number of subclones in 50% of the simulations (15 out of 30). In
contrast, Rec-BTP only identified the correct number of subclones in 23.3% of the cases
(7 out of 30). CITUP also outperformed Rec-BTP with respect to the RMSD measure:
The average RMSD values for CITUP and Rec-BTP in 30 simulations were 0.02 and 0.05
respectively.
5.5 Results on Chronic Lymphocytic Leukemia datasets
We evaluate the performance of CITUP qip on the Chronic Lymphocytic Leukemia (CLL)
dataset of [17]. This dataset consists of single nucleotide and small indel mutations as
inferred from Whole-Genome Sequencing (WGS) data from 3 CLL patients. Each patient is
sampled at five time points while receiving a variety of treatments. The authors also perform
targeted deep sequencing for a limited number of mutations found through WGS. Although
a high number of somatic mutations are detected for each patient, only the frequencies of
coding mutations are made available by Schuh et al.. Hence, we are only able to use the
coding mutations as input to our algorithm. Since the number of mutations are small for
these datasets, we manually removed mutations that are not heterozygous as reported by
[17].
Table 5.1 gives a summary of CITUP’s performance on all three patients.
The trees (Figures 5.3, 5.4 and 5.5) and the clonal frequencies reported by CITUP for
these patients match the results reported in [17] very closely: the mean absolute deviations
are 0.0088, 0.0016 and 0.0048 for patients CLL003, CLL006 and CLL077 respectively. Note
that while CITUP does not assign mutations to the root nodes in CLL003 and CLL077, the
-
CHAPTER 5. RESULTS 32
root node in CLL006 is assigned 5 mutations. This is in agreement with the observation in
[17] that the normal contamination in this patient is insignificant and suggests that CITUP
is able to automatically handle presence or absence of healthy cell contamination.
Although CITUP finds two distinct topologies for patient CLL006 - a chain topology and
a branching topology, the clonal frequencies remain the same in both cases. We note that the
number of deep sequencing mutations is quite small for this dataset, possibly resulting in an
ambiguity with respect to the tree topology. To see if additional mutations can help identify
the true tree, we also ran CITUP on the WGS predictions for this dataset, containing 16
mutations. In this case, CITUP reported a single solution with a chain topology (data not
shown). Thus we conclude that the true solution is likely to be one reported in Figure 5.4,
which also matches the tree topology predicted by [17].
Figure 5.3 suggests a switch between subclones ‘d’ and ‘e’ (referred to as subclones 4
and 2 in [17]) around time-point 3. This is also in agreement with the disease progression
as reported by Schuh et al., where the third time-point is classified as ”complete response
+ minimal residual disease”. On the other hand, subclone ‘d’ simulatanously starts gaining
dominance. The fourth and fifth time-points (as well as the first two time-points) are
designated as ”progressive disease” suggesting that subclone ‘d’ replaces ‘e’ as the driver
subclone while the tumor relapses. In contrast, figures 5.4 and 5.5 imply a more stable
subclonal composition over the time points. We note that the survival time of these patients
are also longer than CLL003 (6+ and 9 versus 3 years) which may be linked to this slower
pace of the clonal dynamics.
5.6 Results on Acute Myeloid Leukemia datasets
Next, we evaluate CITUP qip on an Acute Myeloid Leukemia (AML) dataset [3]. This
dataset contains sequencing data from primary tumor and relapse samples after chemother-
apy treatment, in addition to matched normal tissue for each patient. Although the normal
tissue is typically sampled to distinguish somatic mutations, we also consider it as a sample
since some of these tissues contain various degrees of cancer contamination and thus can be
helpful in identifying subclones. Similar to the CLL dataset, we preprocess the mutations
based on their copy-number analysis as reported by [3]. Briefly, we only keep autosomal
mutations that are copy-number neutral. A summary of CITUP’s performance on 8 patients
taken from this dataset is given in Table 5.2.
-
CHAPTER 5. RESULTS 33
Malikic et al. Page 10 of 11
Clonal evolution in relapsed acute myeloid leukaemia revealed by
whole-genome sequencing. Nature 481(7382), 506–510 (2012)13. Kuhn, H.: The hungarian method for the assignment problem. In:
Jünger, M., Liebling, T.M., Naddef, D., Nemhauser, G.L., Pulleyblank,
W.R., Reinelt, G., Rinaldi, G., Wolsey, L.A. (eds.) 50 Years of Integer
Programming 1958-2008, pp. 29–47. Springer, Berlin, Heidelberg
(2010)
14. Ashworth, A.: Drug resistance caused by reversion mutation 68(24),10021–10023 (2008)
15. Beyer, T., Hedetniemi, S.M.: Constant time generation of rooted trees.
SIAM J. Comput. 9(4), 706–712 (1980)
Figures
Figure 1 A comparison of the full-binary vs. arbitrary rootedtree formulations. In all trees, mutations are depicted withcolored squares. Left: An illustration of the complete binarytree formulation. Here, each internal node has exactly twochildren and only the leaf clones are assumed to be present inthe sample. Ancestral clones can only be represented throughpaths which acquire no additional mutations. Middle: Anequivalent representation of the full-binary tree on the left.Right: The same phylogenetic information represented by thearbitrary rooted tree model. Here, each internal node can haveone or more children but each clone must acquire at least oneadditional mutation.
a b c d e
Figure 3 CITUP predictions for patient CLL003. Left:Estimated subclonal proportions for the five time points(ordered from inner to outer circles). Right: The predictedevolutionary tree and the mutations assigned to eachsubclone. Note that each node is also assumed to inheritmutations that emerge at its ancestors.
a b c d e
Figure 4 CITUP predictions for patient CLL006. Left:Estimated subclonal proportions for the five time points(ordered from inner to outer circles). Right: The predictedevolutionary tree and the mutations assigned to eachsubclone. Note that each node is also assumed to inheritmutations that emerge at its ancestors.
a b c d e
Figure 5 CITUP predictions for patient CLL077. Left:Estimated subclonal proportions for the five time points(ordered from inner to outer circles). Right: The predictedevolutionary tree and the mutations assigned to eachsubclone. Note that each node is also assumed to inheritmutations that emerge at its ancestors.
Table 1 Summary of CITUP’s results on the CLL dataset. Thesecond column refers to the number mutations as reported by[11]. The third column reports the number of subclones (includingnormal cells) found in the best solution. The number of solutionscolumn shows how many distinct solutions are found with thebest score.
Patient No. of No. of No. of Wall-clockmutations subclones solutions time (min)
CLL003 19 5 1 1.64CLL006 9 5 2 0.32CLL077 15 5 1 0.84
TablesAdditional FilesAdditional file 1: The full set of CITUP predictions on the CLL dataset.
Additional file 2: The full set of CITUP predictions on the AML dataset.
Additional file 3: Detailed time requirement of CITUP qip on the AML
dataset; performance analysis of CITUP qip and CITUP iter; comparison
between CITUP and Rec-BTP.
Additional file 4: The raw simulation results used for comparison.
Figure 5.3: CITUP predictions for patient CLL003. Left: Estimated subclonal proportionsfor the five time points (ordered from inner to outer circles). Right: The predicted evo-lutionary tree and the mutations assigned to each subclone. Note that each node is alsoassumed to inherit mutations that emerge at its ancestors.
-
CHAPTER 5. RESULTS 34
a b c d e
Figure 5.4: CITUP predictions for patient CLL006. Left: Estimated subclonal proportionsfor the five time points (ordered from inner to outer circles). Right: The predicted evo-lutionary tree and the mutations assigned to each subclone. Note that each node is alsoassumed to inherit mutations that emerge at its ancestors.
-
CHAPTER 5. RESULTS 35
a b c d e
Figure 5.5: CITUP predictions for patient CLL077. Left: Estimated subclonal proportionsfor the five time points (ordered from inner to outer circles). Right: The predicted evo-lutionary tree and the mutations assigned to each subclone. Note that each node is alsoassumed to inherit mutations that emerge at its ancestors.
-
CHAPTER 5. RESULTS 36
Patient No. of No. of No. of Wall-clockmutations subclones solutions time (hours)
UPN400220 265 7 1 1.71UPN426980 822 7 1 23.00UPN452198 97 5 4 0.14UPN573988 144 3 2 1.02UPN758168 412 7 2 3.33UPN804168 589 8 1 6.89UPN869586 1160 8 1 23.00UPN933124 270 6 1 3.75
Table 5.2: Summary of CITUP’s results on the AML dataset. The second column refersto the total number of indel and single nucleotide mutations as reported by [3]. The thirdcolumn reports the number of subclones (including normal cells) found in the best solution.The number of solutions column shows how many distinct solutions are found with the bestscore.
Due to the large number of mutations, CITUP qip requires considerably more CPU time
to run on this dataset compared to the CLL dataset. Nonetheless, we note that CITUP was
able to optimize all but two datasets to an exact solution when a wall-clock time limit of 23
hours is imposed for each dataset. Moreover, the total CPU time taken on these datasets
indicate a quadratic to sub-quadratic practical running time.
The number of subclones identified per patient is also higher than the number of sub-
clones predicted for CLL patients. We believe this is likely due to the increased ability to
detect subclones that differ by non-coding somatic mutations. To investigate this, we have
also obtained CITUP qip results on 3 of the AML datasets (UPN426980, UPN804168 and
UPN869586) using coding mutations only. Although the number of subclones predicted
were smaller in all 3 cases, the overall clonal architecture in the newly predicted trees were
typically similar to the trees estimated from the full set of mutations.
While it is unknown whether the non-coding mutations play an important role in cancer
progression, some may be hitchhiker mutations which represent subclones that differ by
other types of aberrations such as gene fusions. Furthermore, some non-coding mutations
may still be functional; for example, some intronic mutations are known to affect splicing.
Thus, we believe that phylogenetic trees derived from the full set of mutations may have
better potential to represent the true cancer progression.
Since a full phylogenetic relationship analysis is absent from [3] and the ground truth
solutions are not known, we can not directly evaluate our predicted trees. Figure 5.6 shows,
-
CHAPTER 5. RESULTS 37
however, that the tumor purities inferred by CITUP generally agree with those reported
by [3] for primary and relapse samples. Note that since CITUP does not explicitly predict
tumor purity, for each sample this value is estimated as (1.0 − αrs) if the root node isnot assigned any mutations, where αrs is the predicted genome frequency of the root node
in that sample. Otherwise, the tumor purity is considered to be 1.0 (assuming germline
mutations have been excluded from the study).
The only striking difference between the tumor purities inferred by [3] and CITUP is
in the relapse sample of patient UPN869586. CITUP prediction for this patient is given
in Figure 5.7. The figure suggests that while the founder clone ‘b’ (and its descendants) is
present at a lower abundance in the relapse sample, which may correspond to the tumor
purity of 40% reported by [3], CITUP predicts another emerging clone in the relapse sample
(i.e. clone ’g’). Interestingly, although no coding mutations is assigned to clone ’g’, we have
found that some of the mutations assigned to this clone are located in the intronic regions
of several genes including IL15 and GPC5. Interestingly, the tumor purity estimate in the
relapse sample using coding-only mutations for this patient is closer to the purity estimate
reported in [3].
5.7 Computing environment and running parameters
For each simulated dataset, we converted our simulated mutation frequencies to PhyloSub’s
input format as follows. We assumed each mutation had sequencing depth of 1000 reads,
and set the number of variant reads to 1000 ·f where f is the simulated mutation frequency.We assumed a sequencing error rate of 0.001. Phylosub takes a signifficant amount of
computation time, and thus it was necessary to make a minor modification to evolve.py to
provide the ability to specify a maximum allowable computation time on the command line.
If the specified computation time is exceeded, the sampler exits cleanly, reporting the top k
trees identified thus far. For each simulated dataset, we ran PhyloSub using the evolve.py
command specifying 1000 MH iterations and 1000 MCMC samples, and 3 restarts with
different seeds. Computation time was limited to 95 hours per restart. PhyloSub completed
on average 784.7 MCMC samples within the allowable computation time (standard deviation
90.6 samples). For TrAp, we use the cluster frequencies as mentioned above and run it in
multi-sample mode with default parameters. For Rec-BTP, the clustering of the mutations
were performed using AVDPGM with the same parameters as described in [5]. Rec-BTP
-
CHAPTER 5. RESULTS 38
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Primary (Ding et al.) Primary (CITUP)
Relapse (Ding et al.) Relapse (CITUP)
Figure 5.6: Tumor purities predicted by [3] and CITUP in primary and relapse samples ofAML patients. For the three patients with multiple reported solutions, UPN758168 hadthe same root frequencies in both solutions. For UPN452198 and UPN573988, we pick thefrequencies closest to the ones given in [3].
-
CHAPTER 5. RESULTS 39
3%
5%
0%
9%
1%
33%
46%
3%
17%
0%
7%
0%0%
36%0%
40%
a b c d
e f g h
Figure 5.7: CITUP predictions for patient UPN869586. Left: The estimated subclonal pro-portions for tumor (inner) and relapse (outer) samples. Right: The predicted evolutionarytree and the coding mutations assigned to each subclone. The numbers in parentheses givethe total number of (i.e. coding and non-coding) mutations for each subclone.
-
CHAPTER 5. RESULTS 40
was run with default parameters.
CITUP qip and CITUP iter are implemented in Python and C++ and CITUP qip is
run using the IBM ILOG Cplex Optimizer. All CITUP runs were performed on a Linux
server with a memory limit of up to 16GB per job.
-
Chapter 6
Conclusions and future work
In this work, we present CITUP, a novel combinatorial algorithm to determine clonal fre-
quencies in tumors as well as their evolutionary history using one or more samples from
the same patient. Our comparisons to other state-of-the-art tools show that CITUP con-
sistently reports fewer solutions with better accuracy. This feature is very important for
real cancer datasets where additional experiments may be required to validate the predic-
tions. For example, predictions that involve contradictory assignments reported by TrAp
(referred to as ”non-sparse” solutions in [19]), complicate the downstream analysis of iden-
tifying potential drivers of cancer. Similarly, the partial order plots reported by PhyloSub
[6] can involve many connections, making it difficult to interpret the solutions reported
by this tool. Although our QIP framework is already able to handle a large number of
mutations, and significantly faster than PhyloSub we acknowledge that it is considerably
slower than TrAp. On the other hand, the iterative heuristic version of CITUP exhibits
comparable accuracy, while achieving significant reduction in computation time. Moreover,
our ability to run CITUP separately on each tree topology means that parallel computing
can be utilized to quickly obtain high accuracy results on large datasets. As mentioned
above, CITUP assumes infinite sites, which may be violated under certain conditions. For
instance, a functional mutation may be selected against during changes to the tumor en-
vironment, such as the reversion of BRCA2 mutation in therapy resistant ovarian cancer
[1]. In other words, lineages that die out before the first sampling of the tumor or emerge
and disappear between two time points are not detectable by CITUP or any other method
aiming to construct phylogenies. In these cases, the evolutionary history of the tumor can
only be partially constructed. In addition, CITUP and similar methods are only applicable
41
-
CHAPTER 6. CONCLUSIONS AND FUTURE WORK 42
to tumors with limited copy number changes. On the other hand, this limitation can be
partially overcome by considering a restricted number of copy-number corrected genotypes
similar to the approach of PyClone [14]. Extension of CITUP to exploit this type of changes
would lead to its broader applicability and detection of subclonal populations characterized
by copy number aberrations.
-
Bibliography
[1] Ashworth, A. Drug resistance caused by reversion mutation. 10021–10023.
[2] Beyer, T., and Hedetniemi, S. M. Constant time generation of rooted trees. SIAMJ. Comput. 9, 4 (1980), 706–712.
[3] Ding, L., Ley, T. J., Larson, D. E., Miller, C. A., Koboldt, D. C., Welch,J. S., Ritchey, J. K., Young, M. A., Lamprecht, T., McLellan, M. D.,McMichael, J. F., Wallis, J. W., Lu, C., Shen, D., Harris, C. C., Dooling,D. J., Fulton, R. S., Fulton, L. L., Chen, K., Schmidt, H., Kalicki-Veizer,J., Magrini, V. J., Cook, L., McGrath, S. D., Vickery, T. L., Wendl, M. C.,Heath, S., Watson, M. A., Link, D. C., Tomasson, M. H., Shannon, W. D.,Payton, J. E., Kulkarni, S., Westervelt, P., Walter, M. J., Graubert,T. A., Mardis, E. R., Wilson, R. K., and DiPersio, J. F. Clonal evolution inrelapsed acute myeloid leukaemia revealed by whole-genome sequencing. Nature 481,7382 (2012), 506–510.
[4] Ding, L., Raphael, J. B., Chen, F., and Wendl, M. C. Advances for studyingclonal evolution in cancer. Cancer Letters 340, 2 (2013), 212–219.
[5] Hajirasouliha, I., Mahmoody, A., and Raphael, B. J. A combinatorial approachfor analyzing intra-tumor heterogeneity from high-throughput sequencing data. Pro-ceedings of the International Conference on Intelligent Systems of Molecular Biology(2014).
[6] Jiao, W., Vembu, S., Deshwar, A., Stein, L., and Morris, Q. Inferring clonalevolution of tumors from single nucleotide soma