Non-redundant Clustering, Principal Feature Selection and …... · 2019-02-13 · clustering and...
Transcript of Non-redundant Clustering, Principal Feature Selection and …... · 2019-02-13 · clustering and...
Non-redundant Clustering, Principal Feature Selection and Learning Methods
Applied to Lung Tumor Image-guided Radiotherapy
A Dissertation Presented
by
Ying Cui
to
The Department of Electrical and Computer Engineering
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
in the field of
Electrical and Computer Engineering
Northeastern University
Boston, Massachusetts
(January, 2009)
To those I love ...
To those unforgettable days in Boston ...
Acknowledgments
I would like to take this opportunity to express my great gratitude to my advisor,
Dr. Jennifer G. Dy, for her supervision, support and encouragement throughout my
academic research. I sincerely appreciate her invaluable help in suggesting research
topics, revising my technical reports, papers, thesis and helping me prepare my
presentations. She provided many inspirations and insights into research that will
always remain helpful to me.
I am especially grateful to Dr. Steve B. Jiang, for his time to serve as my
committee member, and working with me on the lung tumor Image-Guided Ra-
diotherapy (IGRT) research. He provides me with the precious opportunity to do
applied research in the medical domain. His patient guidance, wealth of knowledge
and deep insight into this area truly benefit me much. I would also like to thank
Dr. David Brady, Dr. Gregory C. Sharp and Dr. Mario Sznaier for their time, ad-
vice, and encouragements during my study process. It has been a great experience
working with all of them.
Besides the guidance from my committee members, I feel very fortunate
to meet my friends in Northeastern University, Xin Huang and Hongyan Liu with
whom I went through so many unforgettable moments no matter what - joyful or
frustrated. There are also many friends that I would like to show my deep appre-
ciation, with whom I had many fruitful conversations about my research: Xiaojun
Wang, Yujuan Cheng, Yanjun Xiang, Qiuzhao Dong, Yan Yan, Donglin Niu and
iv
Guan Yue. I thank all those who gave me generous help and shared the beautiful
memory with me in my life. I cannot imagine how my life would be without you.
This thesis once again finds me indebted to my family, especially my par-
ents, for their love, support and encouragement throughout my life. I never had a
chance to show you how much I appreciate what you have done for me, certainly
not as much as you deserve. Thank you! Last but not the least, I cannot express
my gratitude in words to my boyfriend, Wei Wang. His love and support will be
cherished in my heart forever.
YING CUI
Northeastern University
Boston, MA
December 2008
v
ABSTRACT
Cui, Ying, Ph.D., Northeastern University, December 2008. Non-redundant clus-
tering, principal feature selection and learning methods applied to lung tumor image-
guided radiotherapy. Major advisor: Jennifer G. Dy.
This thesis is divided into two parts. The first part is about non-redundant
clustering and feature selection for high dimensional data. The second part is on
applying learning techniques to lung tumor image-guided radiotherapy.
In the first part, we investigate a new clustering paradigm for exploratory
data analysis: find all non-redundant clustering views of the data, where data points
of one cluster can belong to different clusters in other views. Typical clustering al-
gorithms output a single clustering of the data. However, in real world applications,
data can have different groupings that are reasonable and interesting from different
perspectives. This is especially true for high-dimensional data, where different fea-
ture subspaces may reveal different structures of the data. We present a framework
to solve this problem and suggest two approaches: (1) orthogonal clustering, and
(2) clustering in orthogonal subspaces.
The idea of removing redundancy between clustering solutions was inspired
by our preliminary work on solving the feature selection problem via transforma-
tion methods. In particular, we developed a feature selection method based on the
popular transformation approach: principal component analysis (PCA). PCA is a
dimensionality reduction algorithm that do not explicitly indicate which variables
are important. We designed a method that utilize the PCA result to select the orig-
inal features, which are most correlated to the principal components and are as
uncorrelated with each other as possible through orthogonalization. We show that
our feature selection method, as a consequence of orthogonalization, preserve the
special property in PCA that the retained variance can be expressed as the sum of
orthogonal feature variances that are kept.
vi
In the second part, we design machine learning algorithms to aid lung tu-
mor image-guided radiotherapy (IGRT). Precise target localization in real-time is
particularly important for gated radiotherapy. However, it is difficult to gate or
track the lung tumors due to the uncertainties when using external surrogates and
the risk of pneumothorax when using implanted fiducial markers. We investigate
algorithms for gating and for directly tracking the tumor. For gated radiotherapy,
previous approach utilizes template matching to localize the tumor position. Here,
we investigate two ways to improve the precision of tumor target localization by
applying: (1) an ensemble of templates where the representative templates are se-
lected by Gaussian mixture clustering, and (2) a support vector machine (SVM)
classifier with radial basis kernels. Template matching only considers images inside
the gating window, but images outside the gating window might provide additional
information. We take advantage of both states and re-cast the gating problem into
a classification problem. For the tracking problem, we explore a multiple-template
matching method to capture the varying tumor appearance throughout the different
phases of the breathing cycle.
vii
Contents
Acknowledgments iv
List of Tables xii
List of Figures xiii
Chapter 1 Introduction 11.1 Non-redundant Multiview Clustering Through Orthogonalization . . 2
1.2 Non-redundant Principal Feature Selection . . . . . . . . . . . . . . 4
1.3 Robust Gating and Tracking the Lung Tumor Mass Without Mark-
ers for Image-guided Radiotherapy . . . . . . . . . . . . . . . . . . 5
1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Chapter 2 Review of Related Literature 102.1 Review of Related Clustering Algorithms . . . . . . . . . . . . . . 10
2.2 Review of Related Feature Selection Techniques . . . . . . . . . . . 15
2.3 Current Image-Guided Radiotherapy (IGRT) Approaches for Lung
Tumor Treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Problems with Respiratory Tumor Motion in Radiotherapy . 18
2.3.2 Review of the Techniques of Gated Radiotherapy . . . . . . 20
viii
Chapter 3 Non-redundant Multi-view Clustering 243.1 Multi-View Orthogonal Clustering . . . . . . . . . . . . . . . . . . 24
3.1.1 Orthogonal Clustering . . . . . . . . . . . . . . . . . . . . 26
3.1.2 Clustering in Orthogonal Subspaces . . . . . . . . . . . . . 28
3.1.3 Relationship Between Orthogonal Clustering and Cluster-
ing in Orthogonal Subspaces . . . . . . . . . . . . . . . . . 31
3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Experiments on Synthetic Data . . . . . . . . . . . . . . . . 34
3.2.2 Experiments on Real-World Benchmark Data . . . . . . . . 38
3.3 Automatically Finding the Number of Clusters and Stopping Criteria 47
3.3.1 Finding the Number of Clusters by Gap Statistics . . . . . . 47
3.3.2 Stopping Criteria . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.3 Case Studies for Synthetic Data II, Face and Text Data . . . 50
3.4 Conclusions for Multi-View Clustering Methods . . . . . . . . . . . 54
Chapter 4 Orthogonal Principal Feature Selection via Component Anal-ysis 564.1 Background and Notations . . . . . . . . . . . . . . . . . . . . . . 57
4.1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1.2 Background Review on SVD and Definition of Terms . . . . 57
4.1.3 Dual Space Representation of a Data Matrix and Statistical
Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2 Feature Selection via PCA . . . . . . . . . . . . . . . . . . . . . . 59
4.2.1 PCA Orthogonal Feature Selection . . . . . . . . . . . . . . 60
4.2.2 Orthogonal Feature Search . . . . . . . . . . . . . . . . . . 61
4.2.3 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.4 Illustrative Example . . . . . . . . . . . . . . . . . . . . . 68
4.3 Sparse Principal Component Analysis (SPCA) and PFS . . . . . . . 70
ix
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . 74
4.4.4 Time Complexity Analysis . . . . . . . . . . . . . . . . . . 76
4.5 Extension to Linear Discriminant Analysis (LDA) . . . . . . . . . . 79
4.6 Conclusion for Principal Feature Selection . . . . . . . . . . . . . . 80
Chapter 5 Robust Fluoroscopic Respiratory Gating for Lung Cancer Ra-diotherapy without Implanted Fiducial Markers 835.1 Data Acquisition and Pre-Processing . . . . . . . . . . . . . . . . . 84
5.1.1 Image Acquisition . . . . . . . . . . . . . . . . . . . . . . 84
5.1.2 Building Training Data . . . . . . . . . . . . . . . . . . . . 84
5.1.3 Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2 Clustering Ensemble Template Matching and Gaussian Mixture Clus-
tering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2.1 Ensemble/Multiple Template Method . . . . . . . . . . . . 89
5.2.2 Finding Representative Templates by Clustering . . . . . . 90
5.2.3 Generating the Gating Signal . . . . . . . . . . . . . . . . . 92
5.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . 94
5.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . 98
5.4.1 Experiments by Clustering Ensemble Template Method . . . 99
5.4.2 Experiments by Support Vector Machine . . . . . . . . . . 99
5.4.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . 100
5.5 Conclusion for Robust Markerless Gated Radiotherapy . . . . . . . 103
Chapter 6 Multiple Template-based Fluoroscopic Tracking of Lung Tu-mor Mass without Implanted Fiducial Markers 1046.1 Basic Ideas of Multiple Template Tracking . . . . . . . . . . . . . . 105
x
6.1.1 Building Multiple Templates . . . . . . . . . . . . . . . . . 107
6.1.2 Search Mechanism . . . . . . . . . . . . . . . . . . . . . . 108
6.1.3 Template Matching . . . . . . . . . . . . . . . . . . . . . . 109
6.1.4 Voting Mechanism . . . . . . . . . . . . . . . . . . . . . . 113
6.2 Experiments Setup for Direct Tumor Tracking . . . . . . . . . . . . 114
6.3 Results and Discussion on Tumor Tracking . . . . . . . . . . . . . 115
6.4 Summary for Multiple Template Tracking . . . . . . . . . . . . . . 123
Chapter 7 Concluding Remarks 124
Bibliography 127
xi
List of Tables
3.1 Confusion Matrix for Synthetic Data1 . . . . . . . . . . . . . . . . 35
3.2 Confusion Matrix for Synthetic Data2 . . . . . . . . . . . . . . . . 38
3.3 Confusion Matrix for the Digits Data . . . . . . . . . . . . . . . . . 41
3.4 Confusion Matrix for the Mini-Newsgroups Data . . . . . . . . . . 45
3.5 Confusion Matrix for WebKB Data . . . . . . . . . . . . . . . . . . 46
3.6 Confusion Matrix for WebKB Data based on Gap Statistics . . . . . 54
4.1 PC Loadings Applied to Glass Data Using SPCA . . . . . . . . . . 71
4.2 Example: SPCA Confusion for Feature Selection . . . . . . . . . . 72
4.3 Computational Complexity Analysis . . . . . . . . . . . . . . . . . 76
4.4 Computational Time in Seconds . . . . . . . . . . . . . . . . . . . 78
6.1 Experimental results for the proposed multiple tracking methods. e
is the mean localization error and e95 is the maximum localization
error at a 95% confidence level. . . . . . . . . . . . . . . . . . . . . 121
xii
List of Figures
1.1 This is a scatter plot of the data in (a) subspace {F1, F2} and (b)
subspace {F3, F4}. Note that the two subspaces lead to different
clustering structures. . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 An example of an on-board imaging system for respiratory gated
radiotherapy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1 The general framework for generating multiple orthogonal cluster-
ing views. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Scatter plots of synthetic data 1. The two columns show the results
of methods 1 and 2 respectively. The colors represent different class
labels and the ellipses represent the clusters found. Row 1 and 2
show the results for iteration 1 and 2 respectively; Row 3 shows
SSE as a function of iteration. . . . . . . . . . . . . . . . . . . . . 36
3.3 These are scatter plots of synthetic data 2 and the clusters found
by methods 1 (a1, a2) and 2 (b1, b2). The color of the data points
reflect different class labels and the ellipses represent the clusters
found. a1, b1 are the results for iteration 1; a2, b2 are the results for
iteration 2; a3 and b3 are SSE as a function of iteration for methods
1 and 2 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . 39
xiii
3.4 The average digit for images within each cluster found by method
2 in iterations/views 1, 2 and 3. These clustering views correspond
to different digits. . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 The average face image for each cluster in iteration 1. This cluster-
ing view corresponds to different persons. . . . . . . . . . . . . . . 43
3.6 The average face image for each cluster in iteration 2. This cluster-
ing view corresponds to different poses. . . . . . . . . . . . . . . . 44
3.7 !Gap and SSE results in each iteration for the Synthetic II data set. 51
3.8 Different partitionings for face data in different iterations . . . . . . 52
3.9 The drop in singular values: si ! si+1. Left: the gap of consecutive
singular values in iteration1. Right: the gap of consecutive singular
values in iteration2. . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1 Three uncorrelated data points in 2D space: (a) in the data space
view, and (b) in the feature space view. Three correlated data points
in 2D space: (c) in the data space view, and (d) in the feature space
view. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2 SSE between selected features and all the features. . . . . . . . . . 61
4.3 The general framework for our feature selection process. . . . . . . 62
4.4 A simple illustrative example of PFS. . . . . . . . . . . . . . . . . 69
4.5 SSE and retained variance for HRCT data on top. SSE for Chart,
face, 20 mini-newsgroup and gene data respectively. Each figure
plots the eight SSE curves for the eight methods: blue line with ’"’
by simple threshold, green line with ’#’ by PFA, red line with ’o’
by SPCA, light blue line with ’$’ by Jolliffe i, purple line with ’·’by Jolliffe ni, yellow line with ’+’ by SFS, grey line with ’x’ by
LSE-fw and black line by our PFS. . . . . . . . . . . . . . . . . . . 82
xiv
5.1 Block diagram for showing the process of the proposed clinical pro-
cedure for generating the gating signal. . . . . . . . . . . . . . . . . 84
5.2 The Integrated Radiotherapy Imaging System (IRIS), used as the
hardware platform for the proposed gating technology in this chapter. 85
5.3 Tumor contour and the region of interest (ROI). Left: original fluo-
roscopic image. Right: motion-enhanced image. . . . . . . . . . . . 86
5.4 The top left figure is the breathing waveform represented by the
tumor location. To the left of the vertical dotted line is the training
period. To the right of the vertical line is the treatment or testing
period. Under the horizontal dotted line (threshold corresponding
to a given duty cycle) is the end of exhale phase. Bottom figures
showed different end of exhale images during the training session,
which are averaged to generate a single template. . . . . . . . . . . 87
5.5 Ensemble/multiple template method. Here, each image is an end-
of-exhale template. We match the incoming image with each tem-
plate and get a set of correlation scores s1, s2, · · · , sK . Then we
apply a weighted average of these scores to generate the final cor-
relation score s for gating. . . . . . . . . . . . . . . . . . . . . . . 89
5.6 Scatter plot of our image data for patient 4 and 35% duty cycle
in 2D with the clustering result. The ”o” and ”x” each represent
different clusters, with the means represented by the ”$” symbol in
bold and the covariances in ellipses. . . . . . . . . . . . . . . . . . 92
5.7 Results from different methods for an example patient. (a) sin-
gle template method; (b) ensemble/multiple templates method with
Gaussian mixture clustering. For each figure, the top curve is the
correlation score and the bottom plot is the gating signal generated
by the correlation score. Here we use 35% duty cycle. . . . . . . . . 93
xv
5.8 Re-cast the gating problem to a classification problem (a) and (b).
(c) presents the decision boundary created by a single template
matching and (d) displays decision boundary by an SVM classifier. . 95
5.9 Experiment results in TD and DC for 35% proposed duty cycle.
Blue bars: metric by SVM method. Red bars: metric by clustering
ensemble template matching method. . . . . . . . . . . . . . . . . . 100
5.10 Experiment results in TD and DC for 50% proposed duty cycle.
Blue bars: metric by SVM method. Red bars: metric by clustering
ensemble template matching method. . . . . . . . . . . . . . . . . . 101
5.11 Example of estimated gating signals on patient 4 for proposed 35%
duty cycle. Top: the predicted gating signal by SVM classifier. Bot-
tom: the gating signal generated by clustering ensemble template
matching method. . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.1 Outline of the proposed multiple template tracking procedure. . . . 106
6.2 A fluoroscopic image with an region of interest (ROI) (blue rectan-
gle) and a tumor contour (red curve). . . . . . . . . . . . . . . . . . 107
6.3 Twelve motion-enhanced tumor templates built by averaging the
images in ROI (as shown in figure 2) falling in the same time bin.
Intensity waveform is divided into twelve equal time bins, corre-
sponding to which twelve templates were built. . . . . . . . . . . . 108
6.4 Tumor contour and region of interest (ROI). Left: Original fluoro-
scopic image. Right: Motion-enhanced image. . . . . . . . . . . . 110
6.5 The correlation score (in gray scale) as functions of template ID
(y-axis) and the incoming image frame ID (x-axis). . . . . . . . . . 115
xvi
6.6 A comparison of the tracking results with and without voting for
Method 2 and patient 3. The tumor position (y-axis) as a function
of time (x-axis). Black solid line: the reference tumor location.
Blue dotted line: Method 2 without voting. Red dots: Method 2
with voting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.7 Experiment results for a) patient 1 and b) patient 2. Black solid
line: Reference tumor motion trajectory. Red dots: tracking results
using Method 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.8 Experiment results for c) patient 3 and d) patient 4. Black solid
line: Reference tumor motion trajectory. Red dots: tracking results
using Method 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.9 Experiment results for e) patient 5 and f) patient 6. Black solid line:
Reference tumor motion trajectory. Red dots: tracking results using
Method 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.10 Top: the average localization error (blue bar) and max localization
error at 95% confidence level (red bar) for Method 1. Bottom: same
errors for Method 2. . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.11 A comparison between Method 1 and Method 2, for patient 3. The
tumor position (y-axis) as a function of time (x-axis). Black solid
line: the reference tumor location. Blue dotted line: Method 1. Red
dots: Method 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
xvii
Chapter 1
Introduction
My dissertation has two components. The first involves basic research in ma-
chine learning and data mining. In particular, I study methods for non-redundant
multi-view clustering for exploratory data analysis and methods for feature selec-
tion through feature transformation. The second involves designing and applying
machine learning to improve the robustness of image guided radiotherapy without
markers using fluoroscopic image sequences.
In this chapter, I begin by defining clustering and then motivate the need
for non-redundant multi-view clustering in Section 1.1. In the next section, Section
1.2, I define the feature selection problem and explain the benefit of utilizing feature
transformation for feature selection. Then, in Section 1.3, I describe image-guided
radiotherapy, explain gating and tracking, and motivate the importance of marker-
less gating and tracking. Finally, I provide a guide to this thesis dissertation in
Section 1.4.
1
1.1 Non-redundant Multiview Clustering Through Or-
thogonalization
Many applications are characterized by data in high dimensions. Examples include
text, image, and gene data. Automatically extracting interesting structure in such
data has been heavily studied in a number of different areas including data mining,
machine learning and statistical data analysis. One approach for extracting informa-
tion from unlabeled data is through clustering. Given a data set, typical clustering
algorithms group similar objects together based on some fixed notion of similarity
(distance) and output a single clustering solution. However, in real world appli-
cations, data can often be interpreted in many different ways and there may exist
multiple groupings of the data that are all reasonable in some perspective.
2 4 6 8 10 12 14 16 18 20 226
8
10
12
14
16
18
20
F1
F2
−5 0 5 10 15 20 252
4
6
8
10
12
14
16
18
20
F3
F4
(a) (b)
Figure 1.1: This is a scatter plot of the data in (a) subspace {F1, F2} and (b) sub-space {F3, F4}. Note that the two subspaces lead to different clustering structures.
This problem is often more prominent for high dimensional data, where each
object is described by a large number of features. In such cases, different feature
subspaces can often warrant different ways to partition the data, each presenting
the user a different view of the data’s structure. Figure 1.1 illustrates one such sce-
nario. In particular, Figure 1.1a shows a scatter plot of the data in feature subspace
2
{F1, F2}. Figure 1.1b shows how the data looks like in feature subspace {F3, F4}.
Note that each subspace leads to a different clustering structure. When faced with
such a situation, which features should we select (i.e., which clustering solution is
better)? Why do we have to choose? Why not keep both solutions? In fact both
clustering solutions could be important, and provide different interpretations of the
same data. For example, for the same medical data, what is interesting to physicians
might be different from what is interesting to insurance companies.
The goal of exploratory data analysis is to find structures in data, which
maybe multi-faceted by nature. Traditional clustering methods seek to find a unified
clustering solution and are thus inherently limited in achieving this goal. In this
research, we suggest a new exploratory clustering paradigm: the goal is to find a
set of non-redundant clustering views from data, where data points belonging to the
same cluster in one view can belong to different clusters in another view.
Toward this goal, we propose a framework that extracts multiple clustering
views of high-dimensional data that are orthogonal to each other. Note that there
are kN possible k disjoint partitioning of N data points. Not all of them are mean-
ingful. We wish to find good clustering solutions based on a clustering objective
function. Meanwhile, we would like to minimize the redundancy among the ob-
tained solutions. Thus, we include an orthogonality constraint in our search for
new clustering views to avoid providing the user with redundant clustering results.
The proposed framework works iteratively, at each step adding one clustering view
by searching for solutions in a space that is orthogonal to the space of the exist-
ing solutions. Within this framework, we develop two general approaches. The
first approach seeks orthogonality in the cluster space, while the second one seeks
orthogonality in the feature subspace. We present all the multiple view clustering
solutions to the user.
3
1.2 Non-redundant Principal Feature Selection
Feature selection is a dimensionality reduction technique which selects a subset of
features from the original set. It is very useful because in some applications it is
desirable not only to reduce the dimension of the space, but also to reduce the num-
ber of variables that are to be considered or measured in the future. However, it
is an NP-hard combinatorial optimization problem. As such, practical approaches
involve greedy searches that guarantee local optima. Besides feature selection, an-
other well studied topic in dimensionality reduction is feature transformation. Fea-
ture transformation is a process through which a new set of features is created. It
can be expressed as an optimization problem over a continuous feature space solu-
tion and classical feature transformation approaches (such as, principal component
analysis (PCA) [1] and linear discriminant analysis (LDA) [2]) provide global so-
lutions. Here, we propose a non-standard approach to feature selection by utilizing
feature transformations to perform feature search. In a sense, feature transforma-
tion performs a search that takes a global view and considers the interactions among
all the features.
PCA is a widely used transformation approach. It has been successfully ap-
plied to many real world applications, including face recognition [3], latent seman-
tic indexing for text retrieval [4], and gene sequence recognition [5]. An important
property of PCA is that the transformation vectors are orthogonal to each other.
Orthogonality is desired, because it assures that the transformed features are not
correlated with each other, and in some sense non-redundant. In fact, the success of
PCA can be attributed to two important optimality properties: (1) the principal com-
ponents sequentially capture the maximum variability in the data thereby retaining
the minimum information loss, and (2) the principal components are uncorrelated
[6]. The problem with transformation methods, such as PCA, is that they do not
explicitly inform us on which features are important.
4
In this thesis, we present a novel approach to feature selection that sequen-
tially selects original features based on the transformation-based method, PCA, to
optimize an objective criterion, while trying to keep the selected features as non-
redundant (uncorrelated) as possible through orthogonalization. We call this ap-
proach principal feature selection (PFS). In developing PFS, we present a new
objective function for PCA feature selection that incorporates feature redundancy
into account, which is analogous to an orthogonality constraint in feature trans-
formation approaches. We show that PFS, as a consequence of orthogonalization,
preserves the special property in PCA that the retained variance can be expressed
as the sum of orthogonal feature variances that are kept. This property is important
as it helps decide how many features to keep in terms of the proportion of variance
retained.
1.3 Robust Gating and Tracking the Lung Tumor Mass
Without Markers for Image-guided Radiother-
apy
Image Guided Radiation Therapy (IGRT) combines scanning and radiation equip-
ment, to provide images of the patient’s organs in the treatment position, at time
of treatment, optimizing the accuracy and precision of the radiotherapy. Treatment
errors related to respiratory organ motion may greatly degrade the effectiveness
of conformal radiotherapy for the management of thoracic and abdominal lesions.
This has become a pressing issue in image-guided radiation therapy (IGRT). For
patients with significant inter- and intra-fractional tumor motion, large treatment
margins are needed to provide full target coverage. Large margins limit the dose
that can be prescribed for tumor control and can cause complications from over-
irradiation of normal tissue. Motion management techniques, such as respiratory
5
Figure 1.2: An example of an on-board imaging system for respiratory gated radio-therapy.
gating or dynamic multi-leaf collimator (DMLC) beam tracking, hold promise to
reduce the incidence and severity of normal tissue complications and to increase lo-
cal control through dose escalation, for mobile tumors in the thorax and abdomen.
For those techniques, precise target localization in real time is particularly impor-
tant due to the reduced clinical tumor volume (CTV) to planning target volume
(PTV) margin and/or the escalated dose.
In this research, we investigate two approaches for lung tumor IGRT using
fluoroscopic images. One is to generate robust real-time gating signals for respi-
ratory gated radiotherapy. Another is to perform position estimation of the tumor
mass. By these two methods, we try to precisely deliver a lethal dose to the tumor,
while minimizing the incidence and severity of normal tissue complications, for
mobile tumors in the thorax and the abdomen [7].
Respiratory gating is a method of synchronizing radiation with respiration,
during the imaging and treatment processes. In computer-driven respiratory-gated
radiotherapy, a small plastic box with reflective markers is placed on the patient’s
abdomen. The reflective markers move during breathing, and a digital camera
hooked up to a central processing unit monitors these movements in real time. A
computer program analyzes the movements and triggers the scanner (simulation of
treatment), or the treatment beam, always at the same moment of the respiratory
cycle. With this technique, it is also possible to choose the respiratory phase: de-
6
pending on its location, the tumor will be treated during inspiration or expiration
so as to avoid exposure of critical organs. Figure 1.2 shows an imaging system
mounted with two orthogonal x-ray tubes and fast amorphous silicon flat panels on
the gantry of a medical linear accelerator (linac), which is used in respiratory gated
radiotherapy.
In an idealized gated treatment, tumor position should be directly detected
and the delivery of radiation is only allowed when the tumor is at the right position.
However, direct detection of the tumor mass in real-time during the treatment is
often difficult. Various surrogates, both external and internal, are used to identify
the tumor position. Depending on the surrogates used, we categorized the respira-
tory gating into external (optical) gating and internal (fluoroscopic) gating. During
gated treatment, the internal or external surrogate signal is continuously compared
against a pre-specified range of values, called the gating window. When the surro-
gate signal is within the gating window, a gating signal is sent to the linac to turn
on the radiation beam [7].
External gating techniques rely on the correlation between tumor location
and the external surrogates, such as markers placed on the patients’ abdomen [8, 9].
The major weakness in external gating is the uncertainty in the correlation between
the external marker position and internal target position [7]. Current internal gating
uses internal tumor motion surrogates such as implanted fiducial markers, as es-
tablished by the Hokkaido group [10, 11, 12]. And it has been shown that internal
surrogates can generate accurate gating signals. However, due to the risk of pneu-
mothorax, the implantation of radiopaque markers in patients’ lungs will unlikely
become a widely accepted clinical procedure [13, 14, 15]. Therefore, it is crucial
to be able to perform accurately gated treatment or directly tracking of lung tumor
mass without implanted markers.
For gated treatment, [16] has shown the feasibility of a template matching
method to generate gating signals for lung radiotherapy without implanted mark-
7
ers. The basic idea is (1) to generate a reference template which corresponds to
the treatment position of the target in the gating window using fluoroscopic im-
ages acquired during patient setup, (2) to calculate the correlation scores between
the reference template and the incoming fluoroscopic images acquired during treat-
ment delivery, and (3) to convert the correlation score into gating signals. Here,
in this research, we explore ways to improve the accuracy and robustness of tem-
plate matching for gating. From our experiments, a single template is not enough
to generate robust gating signals. Thus, we look at all the images corresponding
to the treatment position of the target in the gating window, make each of them
to be a template and combine the correlation scores. However, this multiple tem-
plate method, although good, is very time consuming. Therefore, we find a method
somewhere in between single and multiple template method. We group the tem-
plates into clusters and use the cluster means as the representative templates. This
leads to our template clustering ensemble method. Furthermore, template matching
only considers images inside the gating window, but images outside the gating win-
dow might provide additional information for improving the precision in localizing
the tumor. We assigned images inside the gating window as “ON” and those out-
side as “OFF” classes and re-cast the gating problem into a classification problem.
Then, as another approach, we apply a support vector machine (SVM) classifier to
gated radiotherapy.
On the other hand, we also investigate a direct beam-tracking method to
track the tumor location throughout the whole breathing cycle. The basic idea is as
follows: (i) during the patient setup session, a pair of orthogonal fluoroscopic image
sequences are taken and processed off-line to generate a set of reference templates
that correspond to different breathing phases and tumor positions; (ii) during treat-
ment delivery, fluoroscopic images are continuously acquired and processed; (iii)
the similarity between each reference template and the processed incoming image is
calculated; (iv) the tumor position in the incoming image is then estimated by com-
8
bining the tumor centroid coordinates in reference templates with proper weights
based on the measured similarities. With different image representation and sim-
ilarity calculations, two such multiple-template tracking techniques have been de-
veloped: one based on motion-enhanced templates and Pearson’s correlation score
while the other based on eigen templates and mean-squared error.
In both gating and tracking methods for radiotherapy, we perform a vali-
dation study by comparing the gating signals and the tumor locations generated
with the proposed techniques against those determined manually by clinicians us-
ing multiple patient data sets. For the gating problem, our case study on these pa-
tients shows that both clustering ensemble template matching method and SVM are
reasonable tools for image-guided markerless gated radiotherapy. For the tracking
problem, the tumor centroid coordinates automatically detected using both meth-
ods agree well with the manually marked reference locations, with the eigenspace
tracking method performing slightly better than the motion-enhanced method.
1.4 Overview
The remainder of this thesis dissertation is organized as follows. Chapter 2 provides
a review of related literature. In Chapter 3, we describe in detail the non-redundant
multi-view clustering paradigm. Then, Chapter 4 presents principal feature selec-
tion. In Chapters 5 and 6, we illustrate the robust gating and direct tracking methods
for lung tumor treatment respectively. Finally, we provide concluding remarks in
Chapter 7.
9
Chapter 2
Review of Related Literature
In this chapter, we first review the various related clustering algorithms in Section
2.1. Specifically, we point out the difference between our clustering paradigm with
the traditional clustering scheme. Next, we review the related feature subset se-
lection methods in Section 2.2. In particular, we review feature subset selection
methods which take advantage of feature transformation for feature selection. And
finally, in Section 2.3, we provide an overview of the various techniques that has
been done to achieve precise target localization in real time for image-guided radio-
therapy of mobile tumors in the thorax and abdomen.
2.1 Review of Related Clustering Algorithms
In this section, we review the literature related to our non-redundant multiview
clustering problem in different aspects.
Hierarchical clustering [17] presents a hierarchical grouping of the objects.
It can be subdivided into two categories: the agglomerative methods, which suc-
cessively merge small clusters into larger ones until a stopping criterion is met; and
divisive methods, which treat all the objects in one cluster and successively split
10
them into finer groupings until a stopping criterion is satisfied. However, although
a certain object can have a different label in different stages, it is quite different
with from our multi-view clustering problem. For hierarchical clustering, the dif-
ferent clustering solutions obtained at different hierarchical levels differ only in
their granularity – objects belonging to the same cluster in fine resolutions remain
in the same cluster at the coarser levels. For our multi-view clustering, objects in
the same cluster can be in different clusters in different views.
On the other hand, a different but related problem is the cluster ensemble
problem [18, 19]. The key idea of an ensemble approach is to improve the clustering
performance by combining several clustering results. While an ensemble method
creates a set of cluster solutions for a data set, the final objective is to generate a
single consolidated clustering. On the other hand, the objective of our multi-view
clustering method is to provide users with different meaningful clustering solutions.
The term “multi-view” is also utilized in semi-supervised learning [20, 21].
There, they break the feature space into two independent subsets to generate two
hypotheses. And, the two independent hypotheses bootstrap by providing each
other with labels for the unlabeled data in a semi-supervised learning setting. The
author provided partitioning and agglomerative, hierarchical multi-view clustering
algorithms for text data in [20]. The objective of their method is to maximize the
agreement between the two independent hypotheses. In contrast, our multi-view
method tries to find meaningful partitions that disagrees as much as possible with
previous solutions so as to give distinct structures of the data.
In addition, the problem we are trying to solve is different to the multi-
labeling problem [22, 23, 24]. First of all, our non-redundant multi-view cluster-
ing is performed in a totally unsupervised manner. However, the multi-labeling
problem is mainly for classification, i.e., supervised learning. Similar to the multi-
labeling problem, a given instance can be assigned to more than one class. For
example, in text document retrieval [22], a document can be tagged with a set of
11
labels, where the classes can be semantically overlapped. [22] solved the multi-
labeling problem by ranking the relevance of a topic based on a top-ranking func-
tion. Then a certain instance is marked with all the topic labels above a threshold.
Another multi-label approach is the MFoM [23] learning approach, which is a dis-
criminative multi-label multiclass classifier that maximizes the pair-wise discrimi-
nation power among all competing classes in labeling the topics of text documents.
In [24], Boutell applied a multi-label classifier to scene classification, and gave an
extensive comparative study of possible approaches to training and testing multi-
label classifiers. He used cross training as training strategy and defined base-class
and !-evaluation as evaluation metrics in testing. Although our multi-view non-
redundant clustering can also assign samples to multiple labels, our objective is dif-
ferent. Instead of looking for all the high-relevance labels describing an instance,
we look at the data set as a whole, and provide multiple partitionings for the whole
data set. In each of the different partitionings, the class labels for each instance
are mutually exclusive. Furthermore, the partitioning solutions in different views
should be non-redundant.
The idea of non-redundant clustering was introduced in [25, 26]. In non-
redundant clustering, we are typically given a set of data objects together with an
existing clustering solution and the goal is to learn an alternative clustering that cap-
tures new information about the structure of the data. Existing non-redundant clus-
tering techniques, include the conditional information bottleneck approach [26, 27,
28], the conditional ensemble based approach [29] and the constrained model based
approach [30]. In [26], they suggest that by minimizing the information about irrel-
evant structures, people can better identify the relevant structures. They provided
a new formulation called information bottleneck with side information (IBSI) to
remove the irrelavant structures. IBSI finds a stochastic map of the original data to
a new variable space which maximizes the mutual information between the labels
associated with the new variables and the desired relevant structure from side infor-
12
mation and minimizes the mutual information of the irrelevant ones. Gondek and
Hofmann [27, 28, 29] used conditional information bottleneck to remove the effect
of the undesired apriori solutions to try to discover new interesting structures. How-
ever, these existing methods are limited to find only one alternative partitioning and
the default apriori solution is given beforehand. Another approach for seeking an
alternative view is called COALA [31]. It finds a clustering redundant to an already
known clustering by creating cannot-link constraints per pair of objects which are
in the same cluster in the previously given partitioning. Then it proceeds in an
agglomerative fashion and merge single objects to clusters based on their dissimi-
larity measure to generate another view. Compared to these existing non-redundant
clustering techniques, the critical differences of our proposed research are:
(1) Focusing on searching for orthogonal clustering of high-dimensional data,
our research combines dimensionality reduction and orthogonal clustering
into a unifying framework and seeks lower dimensional representations of
the data that reveal non-redundant information about the data.
(2) Existing non-redundant clustering techniques are limited to finding one alter-
native structure given a known structure. Our framework works successively
to reveal a sequence of different clustering of the data.
Another related work is meta clustering. Meta clustering [32] inspired by
ensemble methods, first generates a diverse set of candidate clustering by either
random initialization or random feature weighting. They then apply agglomerative
clustering to merge the clustering solutions in the meta level based on a Rand index
for measuring similarity. Contrary to our framework, they create multiple clustering
solutions randomly. Our framework, on the other hand, generates multiple views
by orthogonalization so as to directly seek out non-redundant solutions.
Recently, [33] solve the problem of disparate clusterings by optimizing an
objective function that penalizes among the correlations of different clusterings,
13
and minimizes the sum-squared distances within each clusterings. Similar to our
method, it can find multiple alternative clusters. However, their approach looks at
the same feature space (i.e., they use all the original features to generate the dif-
ferent partitionings); whereas, our method mines alternative clutering structures in
different subspaces. In addition, although their method can be extended to generate
more than two alternative clusterings T of the data, extending to more than two
clusterings increases the complexity of their problem by T % (T ! 1)/2, and the
heavy computational burden of optimization using gradient descent may make this
extension unrealistic for large data sets.
Our framework produces a set of different clustering solutions, which is
similar to cluster ensembles [18, 19]. A clear distinction of our work from cluster
ensembles is that we intentionally search for orthogonal clusterings and do not seek
to find a consensus clustering as our end product.
An integral part of our framework is to search a clustering structure in a
high-dimensional space and find the corresponding subspace that best reveals the
clustering structure. This topic has been widely studied, and one closely related
work was conducted by [34]. In this work, after a clustering solution is obtained,
a subspace is computed to best capture the clustering, and the clustering is then
refined using the data projected onto the new subspace. Our framework works in
an opposite direction. We look at a subspace that is orthogonal to the space in
which the original clustering is embedded to search for non-redundant clustering
solutions.
Finally, while we search for different subspaces in our framework, it is dif-
ferent from the concept of subspace clustering in [35, 36]. [35] is interested in
automatically finding subspaces with high-density clusters, which is not revealed
in the full space. They divide the data space into a grid and measure the density
of the data points in each cell (unit). Dense units are recognized as clusters. In
subspace clustering, the goal is still to learn a single clustering, where each cluster
14
can be embedded in its own subspace. In contrast, our method searches for multiple
clustering solutions, each is revealed in a different subspace.
As mentioned previously, our proposed method for redundancy removal is
inspired from our work on non-redundant feature selection. In the next section, we
review feature selection methods that are closely related to our proposed feature
selection approach.
2.2 Review of Related Feature Selection Techniques
Feature selection algorithms are described by an objective function for evaluating
features and the search strategy for exploring candidate features. Feature selection
methods can be classified as wrapper (which takes the final learning algorithm into
account for feature evaluation) or filter (which evaluates features based on char-
acteristics of the data alone) methods [37]. Wrapper methods perform better than
filters for a particular learning algorithm, but filter methods are more efficient and
the features selected are typically not tied to a learning algorithm. Our method is
a filter approach with objective functions for PCA/LDA feature selection with an
embedded redundancy penalty and a search strategy based on PCA/LDA transfor-
mation and orthogonalization. This work has roots which reflects and combines
ideas from both machine learning and statistics.
Feature selection is an NP-hard combinatorial optimization problem [38].
Feature transformation can be expressed as an optimization problem over a contin-
uous feature space solution. Feature selection, on the other hand, is an optimization
over a discrete space, and typically involves combinatorial search. An exhaustive
search of 2d possible feature subsets (where d is the number of features) for the
best feature subset is computationally impractical. More realistic search strategies
such as, greedy approaches (e.g., sequential forward search (SFS) [39] and sequen-
tial backward search (SBS) [40] [41]) can lead to local optima. SFS starts with
15
an empty feature set and adds features one by one based on the criterion func-
tion. SBS starts from the whole set and remove the least important feature one by
one based on its criterion function. Random search methods, such as genetic al-
gorithms, [42], add some randomness in the search procedure to help escape from
local optima. These searches are subset search methods because they evaluate each
candidate subset with respect to the evaluation criterion. In some cases when the
dimensionality is very high, one can only afford an individual search. Individual
search methods evaluate each feature individually (as opposed to feature subsets)
according to a criterion [43]. They describe the various criterions to rank a feature,
including correlation, single feature classifier and information theoretical criterion.
Then, they select features, which either satisfy the condition or are top-ranked.
However, individual search is limited in that it does not consider the interaction
among features. The performance of individual search can be improved by remov-
ing redundancy after feature ranking. [44] suggests removing the features that are
highly correlated with the currently selected subset. They measure the relevance
between features based on the correlation coefficient and mutual information. [45]
applies Gram-Schmidt orthogonalization to candidate features ranked in order of
decreasing relevance of a general measured process output and suggests a stopping
criterion based on a rank probe method. They continuously select the feature which
has the minimum angle with the output vector until the probability that a realization
of the probe is less than a predefined risk. Similar to [45], we apply Gram-Schmidt,
to remove redundancy. But in contrast to [45], we incorporate PCA/LDA transfor-
mation in searching and selecting features. Feature transformation to select features
takes a global view and considers the interactions among features. Principal com-
ponent analysis (PCA) [1] and linear discriminant analysis (LDA) [2] are classical
feature transformation approaches, which provide global solutions to the best sub-
space based on various optimal criterions [46]. Unlike individual search, moreover,
our combination results in inheriting desirable properties of PCA/LDA.
16
There has been some work on performing feature selection through PCA.
Recently, there is a growing excitement in sparsifying PCA, such as rotation tech-
niques [47], SCoTLASS [48], and sparse PCA [6]. They comes for the classical
regression method LASSO [49]. By adding different penalty terms, including L1,
L2 norms, they achieve sparsity, and at the same time they perform rotation to match
the optimal regression subspace to the original data space. However, they are not
exactly feature selection methods. The original features are extracted to form new
variables which is a combination of some of them. A closely related approach is
the variable selection method based on PCA [1], which selects variables with the
highest coefficient (or loading) in absolute value of the first q principal eigenvec-
tors. It can be implemented both iteratively and non-iteratively. Another approach
is by Krzanowski [50], which tries to minimize the error between the principal com-
ponents (PC) calculated with all the original features and the ones in the selected
feature subset, via forward search and backward elimination and procrustes anal-
ysis (minimize sum squared error under translation, rotation and reflexion). Mao
[51] provided a modified faster version of Krzanowski’s method. Mao applyed the
forward selection and backward elimination, using least square estimate. It builds
linear models for each feature by minimizing the least square error between the
feature subset and the original PCs. The iterative PCA approach by Jolliffe does
not take redundancy into account. The other methods do but not explicitly, contrary
to our approach. Moreover, Krzanowski and Mao apply sequential search tech-
niques which are very slow and almost unrealistic when dealing with very large
data sets. Lu et al. [52] pointed out the importance of removing redundancy in
image recognition applications and performed Kmeans clustering to the loadings
of the first several PCs, and selected the features closest to each clusters’ centroid,
called principal feature analysis (PFA). This method depends on the performance
of the clustering method.
17
2.3 Current Image-Guided Radiotherapy (IGRT) Ap-
proaches for Lung Tumor Treatment
In gated radiotherapy, precise and real time tumor localization is extremely impor-
tant because tighter CTV-to-PTV margins are often applied based on the expecta-
tion of reduced tumor motion [53]. In an idealized gated treatment, tumor position
should be directly detected and the delivery of radiation is only allowed when the
tumor is at the right position.
However, direct detection of the tumor mass in real-time during the treat-
ment is often difficult. Instead, various surrogates are used to identify the tumor
position. Depending on surrogates used, we may categorize respiratory gating into
two classes: internal gating and external gating. Internal gating uses internal tumor
motion surrogates such as implanted fiducial markers while external gating relies
on external respiratory surrogates such as makers placed on the patient’s abdomen.
During gated treatment, the internal or external surrogate signal is continuously
compared against a pre-specified range of values, called the gating window. When
the surrogate signal is within the gating window, a gating signal is sent to the linac
to turn on the radiation beam.
In the following sections, first we will describe the problem with respiratory
tumor motion in IGRT in Section 2.3.1. Then we will review the history and current
techniques of gated radiotherapy for mobile tumors in Section 2.3.2.
2.3.1 Problems with Respiratory Tumor Motion in Radiother-
apy
Radiation therapy is a treatment modality directed towards local control of cancer.
The primary goal is to precisely deliver a lethal dose to the tumor while minimizing
the dose to surrounding healthy tissues and critical structures. However, treatment
18
errors related to internal organ motion may greatly degrade the effectiveness of
conformal radiotherapy for the management of thoracic and abdominal lesions, es-
pecially when the treatment is done in a hypo-fraction or single fraction manner
[54, 55]. This has become a pressing issue in the emerging era of image-guided
radiation therapy (IGRT).
Intra-fraction organ motion is mainly caused by patient respiration, some-
times also by skeletal muscular, cardiac, or gastrointestinal systems. Respiration
induced organ motion has been studied by directly tracking the movement of the tu-
mor [56, 57], the host organ [58, 59], radio-opaque markers implanted at the tumor
site [60, 11], radioactive tracer targeting the tumor [61, 62], and surrogate struc-
tures such as diaphragm and chest wall [63, 64]. It has been shown that the motion
magnitude can be clinically significant (e.g., of the order of 2 - 3 cm), depending
on tumor sites and individual patients.
One category of methods to account for respiratory motion is to minimize the
tumor motion, using techniques such as breath holding and forced shallow breathing
(such as jet ventilation) [65, 66]. These techniques require patient compliance,
active participation and, often, extra therapist participation. They may not be well
tolerated by patients with compromised lung function which is the case for most
lung cancer patients [67].
An alternative strategy is to allow free tumor motion while adapting the ra-
diation beam to the tumor position by either respiratory gating or beam tracking.
Respiratory gating limits radiation exposure to the portion of the breathing cycle
when the tumor is in the path of the beam [68, 66]. Beam tracking technique fol-
lows the target dynamically with the radiation beam [69], and was first implemented
in a robotic radiosurgery system (CyberKnife)[70]. For linac-based radiotherapy,
tumor motion can be compensated for using a dynamic multi-leaf collimator (MLC)
[71]. Linac based beam tracking is not used in clinical practice because its imple-
mentation and quality assurance are technically challenging. In contrast, respira-
19
tory gating is more practical, and has been adopted in clinical practice by a limited
number of cancer centers. We believe that if the proper tools were available, safe
and effective gated radiotherapy could be widely adopted for treating tumors in the
thorax and abdomen.
2.3.2 Review of the Techniques of Gated Radiotherapy
Respiratory gated radiation therapy was first developed in Japan in the late 1980s
and early 1990s [72, 73]. Various external surrogates were used to monitor respi-
ratory motion, including a combination airbag and strain gauge taped on the pa-
tient’s abdomen or back (for prone treatments) to gate a proton beam [74], and
position sensors placed on the patient [72, 73]. A major advancement of the gated
radiotherapy was the real-time tumor tracking (RTRT) system developed by Mit-
subishi Electronics Co., Ltd., Tokyo, in collaboration with the Hokkaido University
[75, 10]. The RTRT system uses real-time fluoroscopic tracking of gold markers
implanted in tumor.
Around the mid 1990s, Kubo and his colleagues at the University of Cali-
fornia at Davis introduced the gated radiotherapy technique into the United States.
They reported the first feasibility study of gated radiotherapy with a Varian 2100C
accelerator, as well as an evaluation of different external surrogate signals to moni-
tor respiratory motion [68]. They also reported a gated radiotherapy system which
tracks inferred reflective markers on the patient abdomen using a video camera,
developed jointly with Varian Medical Systems, Inc. (Palo Alto, CA) [67]. This
system was later commercialized by Varian and called real-time position manage-
ment (RPM) respiratory gating system. The RPM system has been implemented
and investigated clinically at a number of centers [64].
Currently, the Mitsubishi/Hokkaido RTRT system is the only internal gating
system used in clinical routine, while the Varian RPM system can be considered as
20
the representative external gating system. Each system has its strengths and weak-
nesses, but their weaknesses have been barriers to a broad adoption of gated radio-
therapy. For the RPM system, a lightweight plastic block with two passive inferred
reflective markers is placed on the patient’s anterior abdominal surface and moni-
tored by a charge-coupled-device (CCD) video camera mounted on the treatment
room wall. The surrogate signal is the abdominal surface motion. Both amplitude
and phase gating are allowed by the RPM system. A periodicity filter checks the
regularity of the breathing waveform and immediately disables the beam when the
breathing waveform becomes irregular, such as patient movement or coughing, and
re-enables the beam after establishing that breathing is again regular. The RPM can
also be used during treatment simulation at a radiotherapy simulator or a CT scan-
ner to acquire the patient treatment geometry in the gating window and to setup the
gating window.
The major strength of the external gating systems is that it is non-invasive,
and that tracking external markers is relatively easy. However, tracking the exter-
nal marker is not equivalent to tracking the tumor, and naively trusting the external
surrogate can cause significant errors. In particular, the relationship between the
tumor motion and the surrogate signal may change over time, which requires fre-
quent re-calibration of this relationship. The major weakness in external gating is
the uncertainty in the correlation between the external marker position and internal
target position.
The Mitsubishi/Hokkaido RTRT system as well as its application in radio-
therapy has been extensively published by the Hokkaido group [75, 10]. The system
consists of four sets of diagnostic x-ray camera systems, where each system con-
sists of an x-ray tube mounted under the floor, a 9-inch image intensifier mounted
in the ceiling, and a high-voltage x-ray generator. The four x-ray tubes are placed
at right caudal, right cranial, left caudal, and left cranial position with respect to the
patient couch at a distance of 280 cm from the isocenter. The image intensifiers are
21
mounted on the ceiling, opposite to the x-ray tubes, at a distance of 180 cm from
the isocenter, with beam central axes intersecting at the isocenter. At a given time
during patient treatment, depending on the linac gantry angle, two out of the four
x-ray systems are enabled to provide a pair of unblocked orthogonal fluoroscopic
images. To reduce the scatter radiation from the therapeutic beam to the imagers,
the x-ray units and the linac are synchronized, i.e., the MV beam is gated off the
kV x-ray units are pulsed.
Using this system, the fiducial markers implanted at the tumor site can be
directly tracked fluoroscopically at a video frame rate [76]. The linear accelerator
is gated to irradiate the tumor only when the marker is within the internal gating
window. The size of the gating window is set at ±1 to ±3 mm according to the
patient’s characteristics and the margin used in treatment planning [75]. Techniques
for the insertion of gold markers of 1.5-2.0 mm diameter into or near the tumor
were developed for various tumor sites, including bronchoscopic insertion for the
peripheral lung, image-guided transcutaneous insertion for the liver, cystoscopic
and image-guided percutaneous insertion for the prostate, surgical implantation for
spinal/paraspinal lesions [10].
Percutaneously implanting fiducial markers is an invasive procedure with
potential risks of infection. Many clinicians are reluctant to use this procedure for
lung cancer patients because puncturing the chest wall may cause pneumothorax.
The insertion of gold markers using bronchofiberscopy is feasible and safe only
for peripheral-type lung tumors, not for central lung lesions [10]. The Hokkaido
group found that the markers fixed into the bronchial tree may significantly change
their relationship with the tumor after 2 weeks of insertion [77]. Therefore, bron-
choscopic insertion of markers is not an ideal solution for lung tumor treatment,
especially for a large number of fractions.
In summary, the major strength of the internal gating systems represented
by the RTRT system is the precise and real-time localization of the tumor position
22
during the treatment. The implanted internal markers are often good surrogates
for tumor position, and marker migration usually is not an issue if the simulation
images are acquired a few days after marker implantation [10]. It is even less of a
concern if multiple markers are used. The two major weaknesses of internal gating
are the risk of pneumothorax for implantation of markers in the lungs and the high
imaging dose required for fluoroscopic tracking.
Here, in this research, we apply machine learning algorithms to improve
gating and tracking based radiotherapy. More specifically, we aim to precisely gate
the mobile tumor without implanted fiducial markers using fluoroscopic images.
Methods for gating and tracking are described in Chapters 5 and 6 respectively.
23
Chapter 3
Non-redundant Multi-viewClustering
In this chapter, we will present our non-redundant multi-view clustering framework
in detail. To be specific, in Section 3.1, we present the two clustering approaches
based on our framework. Interestingly, these two proposed approaches are related,
in Section 3.1.3, we analyze their relationships. Based on our framework which
is done in a totally unsupervised fashion, we provide an approach to automatically
find the number of clusters in each iteration. Together with it, we develop a stopping
criterion accordingly in Section 3.3. Then, we perform a set of experiments on both
synthetic and real-world data sets. The results are presented in Section 3.2. Finally,
we present our conclusions in Section 3.4.
3.1 Multi-View Orthogonal Clustering
Given data X & Rd!N , with N instances and d features. Our goal is to learn a set
of non-redundant clustering views from data.
There are a number of ways to find different clustering views [18, 19]. One
24
can apply different clustering algorithms (each with varying objective functions),
utilize different similarity measures, or apply the same algorithm on different ran-
domly sampled (either in instance space or feature space) data from X . Note that
such methods produce each individual clustering independently from all the other
clustering views. While the differences in the objective functions, similarity mea-
sures, density models, or different data samples may lead to clustering results that
differ from one another, it is common to see high redundancy in the obtained mul-
tiple clustering views. Below, we present a framework for successively generating
multiple clustering views that are orthogonal from one another and thus contain
limited redundancy.
Figure 3.1: The general framework for generating multiple orthogonal clusteringviews.
Figure 3.1 shows the general framework of our approach. We first cluster
the data (this can include dimensionality reduction followed by clustering when
necessary); then we orthogonalize the data to a space that is not covered by the
existing clustering solutions. Repeat the process until we cover most of the data
space or no structure can be found in the remaining space.
We developed two different approaches within this framework: (1) orthogo-
nal clustering, and (2) clustering in orthogonal subspaces.
25
These two approaches differ primarily in how they represent the existing
clustering solutions and consequently how to orthogonalize the data based on exist-
ing solutions. Specifically, the first approach represents a clustering solution using
its k cluster centroids. The second approach represents a clustering solution using
the feature subspace that best captures the clustering result. In the next two subsec-
tions, we describe these two different representations in detail and explain how to
obtain orthogonal clustering solutions based on these two representations.
3.1.1 Orthogonal Clustering
Clustering can be viewed as a way for compressing data X . For example in k-
means [78, 79], the objective function is to minimize the sum-squared-error crite-
rion (SSE):
SSE =k!
j=1
!
xi"Cj
||xi ! µj||2
where xi & Rd is a data point assigned to cluster Cj , and µj is the mean of Cj . We
represent xi and µj as column vectors. The outputs for k-means clustering are the
cluster means and the cluster membership of each data point xi. One can consider
k-means clustering as a compression of data X to the k cluster means µj .
Following the compression viewpoint, each data point xi is represented by
its cluster mean µj . Given k µj’s for representing X , what is not captured by these
µj’s? Let us consider the space that is spanned by xi, i = 1 . . . N , we refer to this as
the original data space. In contrast, the subspace that is spanned by the mean vectors
µj , j = 1 . . . k, is considered the compressed data space. Assigning data points to
their corresponding cluster means can be essentially considered as projecting the
data points from the original data space onto the compressed data space. What is
not covered by the current clustering solution (i.e., its compressed data space) is
captured by its residue space. In this paper, we define the residue space as the data
projected onto the space orthogonal to our current representation. Thus, to find
26
alternative clustering solutions not covered in the current solution, we can simply
perform clustering in the space that is orthogonal to the compressed data space.
Given current data X(t), and the clustering solution we found on X(t) (i.e.,
M (t) = [µ(t)1 µ(t)
2 · · ·µ(t)k ], and the cluster assignments), we describe two variations
for representing data in the residue space, X(t+1), single-mean and all-mean repre-
sentations.
Single-mean representation. In hard clustering, each data point belongs to a sin-
gle cluster, this is represented using a single mean vector. For data point x(t)i
belonging to cluster j, we project it onto its center µ(t)j as its representation
in the current clustering view. We consider two different methods to compute
the residue in this case. In the first method, the residue x(t+1)i is defined as
x(t)i ! µt
j (i.e., the difference between a data point and its mean). In the sec-
ond method, the residue, x(t+1)i , is defined to be the projection of x(t)
i onto the
subspace orthogonal to µ(t)j . This can be formalized by the following formula:
x(t+1)i = (I ! µ(t)
j µ(t)Tj /(µ(t)T
j µ(t)j ))x(t)
i .
Note that empirically we observed that method two is more effective in pro-
ducing non-redundant clustering solutions. This can be attributed to the fact
that method two’s residual representation of a data point in iteration t + 1 is
orthogonal to its cluster center in iteration t. This proved to be beneficial in
achieving our goal of producing non-redundant solutions. In the remainder
of the paper, we will focus on the second method for hard clustering.
All-mean representation. 1 We achieve this by projecting the data onto the sub-
space spanned by all cluster means and compute the residue, X(t+1), as the
projection of X(t) onto the subspace orthogonal to all the cluster centroids.1The solution in clustering (hard or soft) can be represented by using all cluster centers.
27
This can be formalized by the following formula:
X(t+1) = (I !M (t)(M (t)T M (t))#1M (t)T )X(t).
The algorithm for orthogonal clustering is summarized in Algorithm 1. The
data is first centered to have zero mean. We then create the first view by clustering
the original data X . Since most of the data in our experiments are high-dimensional,
we apply principal components analysis [80] to reduce the dimensionality, followed
by k-means. Note that one can apply other clustering methods within our frame-
work. We chose PCA followed by k-means because they are popular techniques. In
step 2, we project the data to the space orthogonal to the current cluster representa-
tion (using cluster centers) to obtain our residue X(t+1). The next clustering view is
then obtained by clustering in this residue space. We repeat steps 1 (clustering) and
2 (orthogonalization) until the desired number of views are obtained or when the
SSE is very small. Small SSE signifies that the existing views have already covered
most of the data. In Section 3.3, we provide and discuss a solution on when to stop
automatically in detail.
3.1.2 Clustering in Orthogonal Subspaces
In this approach, given a clustering solution with means µj , j = 1 . . . k, we would
like to find a feature subspace that best captures the clustering structure, or, in other
words, discriminates these clusters well. One well-known method for finding a
reduced dimensional space that discriminates classes (clusters here) is linear dis-
criminant analysis (LDA) [2, 81]. Another approach is by applying singular value
decomposition (SVD) on the k mean vectors µj’s [34].
Below we explain the mathematical differences between these two approaches.
28
Algorithm 1 Orthogonal Clustering.Inputs: The data matrix X & Rd!N , and the number of clusters k(t) for eachiteration, t.Output: The multiple partitioning views of the data into k(t) clusters at each itera-tion.Pre-processing: Center the data to have zero mean.Initialization: Set the iteration number t = 1 and X(1) = X .Step1: Cluster X(t). In our experiments, we performed PCA followed by k-means.The compressed solution are the k means, µ(t)
j . Each µ(t)j is a column vector in Rd
(the original feature space).Step2: Project each x(t)
i in X(t) to the space orthogonal to its cluster mean (forsingle-mean version) or all the means (for all-mean version), to form the residuespace representation, X(t+1).Step3: Set t = t + 1 and repeat steps 1 and 2 until the desired number of views oruntil the sum-squared-error,
"kj=1
"x(t)i "C
(t)j
||x(t)i ! µ(t)
j ||2, is very small.
LDA finds a linear projection Y = AT X that maximizes
trace(S#1w Sb)
where Sw is the within-class-scatter matrix and Sb is the between-class-scatter ma-
trix defined as follows.
Sw =k!
j=1
!
yi"Cj
(yi ! µj)(yi ! µj)T
Sb =k!
j=1
nj(µj ! µ)(µj ! µ)T
where yi’s are projected data points; µj’s are projected cluster centers; nj is the total
number of points in cluster j and µ is the center of the entire set of projected data.
In essence, LDA finds the subspace that maximizes the scatter between the cluster
means normalized by the scatter within each cluster.
Similarly, the SVD approach in [34] seeks a linear projection Y = AT X , but
29
maximizes a different objective function trace(MMT ), where M = [µ1 ! µ, µ2 !µ, · · · , µk ! µ] and the µj’s are the projected cluster centers and µ is the center of
the entire set of projected data.
For both methods, the solution can be represented as A = ["1"2 · · ·"q],
which contains the q most important eigenvectors (corresponding to the q largest
eigenvalues) of S#1w Sb for LDA and MMT for SVD respectively.
Note that, trace(Sb) = trace(M $M $T ), where M $ = ['
n1M1'
n2M2 · · ·'
nkMk].
The only difference between M and M $ is the weighting of each column of M ,
Mj , by the square root of nj , the number of data points in cluster Cj . But, both
M and M $ span the same space. Thus, in practice, both approaches maximizing
trace(MMT ) and trace(Sb), produce similar results. The difference between max-
imizing trace(Sb) (or trace(MMT )) from the standard LDA objective (trace(S#1w Sb))
is the normalization by the within-class-scatter S#1w . For computational purposes,
we choose the SVD approach on the means µj’s and set q = k ! 1, the rank
of MMT . In general, one may choose any number of dimensionality to keep,
q ( k ! 1, and any other dimensionality reduction algorithm for determining A(t).
Once we have obtained a feature subspace A = ["1"2 · · ·"q] that captures
the clustering structure well, we project X(t) to the subspace orthogonal to A to
obtain the residue X(t+1) = P (t)X(t). The orthogonal projection operator, P , is:
P (t) = I ! A(t)(A(t)T A(t))#1A(t)T
Algorithm 2 presents the pseudo-code for clustering in orthogonal subspaces.
We first pre-process the data to have zero mean. In step 1, we apply a clustering
algorithm (PCA followed by k-means in our experiments). We then represent this
clustering solution using the subspace that best separates these clusters. In step 2,
we project the data to the space orthogonal to the computed subspace representa-
tion. We repeat steps 1 and 2 until the desired number of views are obtained or
30
Algorithm 2 Clustering in Orthogonal Subspaces.Inputs: The data matrix X & Rd!N , and the number of clusters k(t) for eachiteration, t.Output: The multiple partitioning views of the data into k(t) clusters, and a reduceddimensional subspace for each iteration A(t).Pre-processing: Center the data to have zero mean.Initialization: Set the iteration number t = 1 and X(1) = X .Step1: Cluster X(t). In our experiments, we performed PCA followed by k-means.Then, apply a dimensionality reduction algorithm to obtain the subspace, A(t), thatcaptures the current clustering.Step2: Project X(t) to the space orthogonal to A(t) to produce X(t+1) = P (t)X(t),where the projection operator P (t) is:
P (t) = I ! A(t)(A(t)T A(t))#1A(t)T
Step3: Set t = t + 1 and repeat steps 1 and 2 until the desired number of views oruntil the sum-squared-error,
"Ni=1 ||x
(t)i ! A(t)y(t)
i ||2, is very small.
the SSE is very small. An automated approach for determining when to stop is
proposed in Section 3.3.2.
3.1.3 Relationship Between Orthogonal Clustering and Cluster-
ing in Orthogonal Subspaces
We have illustrated two approaches to represent a clustering view and search for
orthogonal clustering solutions in the two previous subsections. In this section, we
discuss the relationship between them.
In general, these two methods are different. However, under certain condi-
tions, the all-mean version of method 1 can be equivalent to method 2.
Assume that we obtain the same clustering results with K clusters for both
methods; consequently, the same mean vectors. The projection matrix for method
31
1, orthogonal clustering (all-mean version), is:
P1 = I !M (t)(M (t)T M (t))#1M (t)T ,
where M (t) = [µ1(t), · · · , µK
(t)] and X(t+1) = P1X(t). In contrast, for method 2,
clustering in orthogonal subspaces, the projection matrix is:
P2 = I ! A(t)(A(t)T A(t))#1A(t)T ,
where X(t+1) = P2X(t), and A(t) is the matrix of eigenvectors of M $(t)M $(t)T , and
M $(t) = [µ1(t)!µ(t), · · · , µK
(t)!µ(t)]. Note that the total mean µ(t) is zero because
X is zero-centered initially and linear projections simply rotates X keeping the
center for the residue spaces X(t) at zero. Therefore we have M $ = M . It follows
that A and M span the same space. As a result, we have P1 = P2. More specifically,
substituting the singular value decomposition of M 2, M = ASV T , we get
M(MT M)#1MT = ASV T (V ST AT ASV T )#1V ST AT (3.1)
= ASV T V T#1S#1A#1AT#1
ST#1V #1V ST AT
= A(AT A)#1AT
Thus, P1 = P2. Therefore, after projection, the residue space we generated by the
orthogonal clustering approach is the same as the one by the orthogonal subspace
algorithm. The two methods will lead to the same multi-view clustering results.
In the above paragraphs, we show when the two methods are equal. Here,
we explain when they differ. In method 2, one can also have the option of keeping
fewer than K ! 1 eigenvectors. In such a case, the two methods will be different,2We remove the superscript (t) here to simplify notations.
32
with method 2 having a slower convergence of reaching zero SSE and consequently
more iterations/views.
In addition, note that the equivalence of these two methods only holds when
we apply SVD on the means in method 2. If other dimensionality reduction tech-
niques are used, including LDA, these two methods will lead to different clustering
solutions in each iteration.
Finally, the above equivalence of the two methods only holds for the all-
mean version of method 1. The single-mean version of method 1 leads to different
residue representations of the data compared to method 2.
Again let us assume the same clustering results with K clusters for both
methods. Without loss of generality, we select a data sample x(t)i from X(t), and as-
sume it is clustered into class c. Then by the single-mean version of the orthogonal
clustering method, each data belongs to a single cluster, and the projection matrix
is:
P1 = I ! µ(t)c µ(t)T
c /(µ(t)Tc µ(t)
c ) and x(t+1)i = P1x
(t)i .
For method 2, clustering in orthogonal subspaces, the projection matrix is:
P2 = I !M (t)(M (t)T M (t))#1M (t)T , M = [µ1(t), · · · , µK
(t)], x(t+1)i = P2x
(t)i .
Comparing the two projection matrices, we see that only when span{µ1(t), · · · , µK
(t)} =
span{µ(t)c } (i.e., all current mean vectors are located in a line/1D vector equal to
µ(t)c ), we have P1 = P2 leading to x(t+1)
i method1 = x(t+2)i method2 in the residue space.
But in practice, because µ(t)c generally covers less space than the span of all mean
vectors which has a dimensionality of K ! 1, the single-mean version of method
1 converges slower than method 2 in each iteration. In other words, method 2 re-
moves the total subspace covered by the current K cluster partitioning, while the
single-mean version of method 1 only removes the direction of each sample’s clus-
33
ter mean. Another difference is that method 2 always obtains a residue space that
is orthogonal to all the previous iterations; the single-mean version of method 1, on
the other hand, leads to a residue that is not orthogonal to all previous iterations.
3.2 Experiments
In this section, we investigate whether our multi-view orthogonal clustering frame-
work can provide us with reasonable and orthogonal clustering views of the data.
We start by performing experiments on synthetic data in Section 3.2.1 to get a better
understanding of the methods, then we test the methods on benchmark data in Sec-
tion 3.2.2. In these experiments, we chose as our base clustering – PCA followed
by k-means clustering. This means, we first reduce the dimensionality with PCA,
keeping a dimensionality that retains at least 90% of the original variance, then fol-
low PCA with k-means clustering. Because the all-mean version of method 1 is
equivalent to method 2, we implement orthogonal clustering with the single-mean
version. In this section, we refer to the single-mean version of orthogonal cluster-
ing approach as method 1, and the clustering in orthogonal subspaces approach as
method 2.
3.2.1 Experiments on Synthetic Data
We would like to see whether our two methods can find diverse groupings of the
data. We generate two synthetic data.
Data 1 We generate a four-cluster data in two dimensions with N = 500 instances
as shown in Figure 3.2, where each cluster contains 125 data points. We test
our methods by setting k = 2 for our k-means clustering. We would like to
see that if the methods group the clusters into two in the first iteration, then
34
they should group the clusters the other way in the next iteration. This data
tests whether the methods can find orthogonal clusters.
Data 2 We generate a second synthetic data in four dimensions, with N = 500
instances as shown in Figure 3.3. We generate three Gaussian clusters in
features F1 and F2 with 100, 100 and 300 data points and means µ1 =
(12.5, 12.5), µ2 = (19, 10.5), and µ3 = (6, 17.5), and identity covariances.
We generate another mixture of three Gaussian clusters in features F3 and F4
with 200, 200 and 100 data points and means µ1 = (2, 17), µ2 = (17.5, 9),
and µ3 = (1.2, 5), and identity covariances. This data tests whether the meth-
ods can find different clustering solutions in different subspaces.
Table 3.1: Confusion Matrix for Synthetic Data1SYNTHETIC DATA1 METHOD1 METHOD2ITERATION1 C1 C2 C1 C2L1 125 0 125 0L2 0 125 0 125L3 125 0 125 0L4 0 125 0 125ITERATION2 C1 C2 C1 C2L1 125 0 125 0L2 125 0 125 0L3 0 125 0 125L4 0 125 0 125
Results for Synthetic Data 1
The confusion matrix in Table 3.1 shows the experimental results for synthetic data
1 for methods 1 and 2, in two iterations. We can see that for the first iteration, both
methods grouped classes L1 and L3 into a single cluster C1, and classes L2 and
L4 into another cluster C2. For the second iteration, the data was partitioned in a
35
−20 −15 −10 −5 0 5 10 15 20−15
−10
−5
0
5
10
15
F1
F2
−20 −15 −10 −5 0 5 10 15 20−15
−10
−5
0
5
10
15
F1
F2
(a1.iteration1 for method1) (b1.iteration1 for method2)
−15 −10 −5 0 5 10 15−20
−15
−10
−5
0
5
10
15
20
F1
F2
−15 −10 −5 0 5 10 15−20
−15
−10
−5
0
5
10
15
20
F1
F2
(a2.iteration2 for method1) (b2.iteration2 for method2)
1 2 30
1
2
3
4
5
6x 104
Iteration
Sum−S
quar
e−Er
ror
1 2 30
1
2
3
4
5
6x 104
Iteration
Sum−S
quar
e−Er
ror
(a3.SSE for method1) (b3.SSE for method2)
Figure 3.2: Scatter plots of synthetic data 1. The two columns show the results ofmethods 1 and 2 respectively. The colors represent different class labels and theellipses represent the clusters found. Row 1 and 2 show the results for iteration 1and 2 respectively; Row 3 shows SSE as a function of iteration.
36
different way, which grouped classes L1 and L2 into one cluster, and classes L3 and
L4 into another cluster. Figure 3.2 shows the scatter plot of the clustering results
of both methods in the original 2D data space for the two iterations. Different
colors are used to signify the true classes, and the ellipses show the clusters found
by k-means. The figure confirms the result summarized in the confusion matrix.
Both methods 1 and 2 have similar results as shown. In subfigure a3 and b3 of
Figure 3.2, we plot the sum-squared-error (SSE) as a function of iteration. Note
that, as expected, SSE for both methods decreases monotonically until convergence.
Moreover, the SSE reaches zero at iteration 2 meaning that the first two clustering
views have covered the data space completely.
Results for Synthetic Data 2
Table 3.2 shows the confusion matrix for our clustering with the two different label-
ings: labeling 1 is for features 1 and 2, and labeling 2 is for features 3 and 4. High
number of common occurrences means that the cluster correspond to those labels.
Observe that for both methods 1 and 2, they found the clusters in labeling 2 (fea-
tures 3 and 4) perfectly with zero confusion in the off-diagonal elements in the first
iteration/view. In the second iteration/view, methods 1 and 2 found the clusters in
labeling 1 (features 1 and 2) perfectly also with zero confusion. This result confirms
that indeed our multi-view approach can discover multiple clustering solutions in
different subspaces. Figure 3.3 shows scatter plots of the data. The left column
((a1), (a2), (a3)) is the plot for method 1. (a1) shows the clustering in ellipses
found by method 1 in iteration 1. The left sub-figure shows the groupings in the
original features 1 and 2, and the data points are colored based on true labeling 1.
The right sub-figure shows the clusterings in the original features 3 and 4, and the
color of the data points are based on true labeling 2. (a2) is the same scatter plot of
the original data X with the clusters found by method 1 as shown by the ellipses in
iteration 2. Similarly, (b1) and (b2) show the results of method 2. (a3) and (b3) are
37
the SSE for the two methods in each iteration. Method 2 converges much faster than
method 1 here. Note that SSE monotonically decreases with iteration and that the
algorithm captures most of the information in two clustering views. From these re-
sults, in iteration 1 we found the right partition based on features 3 and 4, but group
the clusters in features 1 and 2 incorrectly. On the other hand, iteration 2 groups
the clusters based on features 1 and 2 correctly, but the partition for the clusters in
features 3 and 4 is wrong. The results confirm that indeed our multi-view approach
can discover multiple clustering solutions in different subspaces.
Table 3.2: Confusion Matrix for Synthetic Data2SYNTHETIC DATA2 METHOD1 METHOD2
ITERATION 1LABELLING1 C1 C2 C3 C1 C2 C3L1 41 40 19 41 40 19L2 44 34 22 44 34 22L3 115 126 59 115 126 59LABELLING2 C1 C2 C3 C1 C2 C3L1 200 0 0 200 0 0L2 0 200 0 0 200 0L3 0 0 100 0 0 100
ITERATION 2LABELLING1 C1 C2 C3 C1 C2 C3L1 100 0 0 100 0 0L2 0 100 0 0 100 0L3 0 0 300 0 0 300LABELLING2 C1 C2 C3 C1 C2 C3L1 126 34 40 126 34 40L2 115 44 41 115 44 41L3 59 22 19 59 22 19
3.2.2 Experiments on Real-World Benchmark Data
We have shown that our two methods work on synthetic data. Here, we investigate
whether they reveal interesting and diverse clustering solutions on real benchmark
38
0 10 20 306
8
10
12
14
16
18
20
22
F1
F2
Label 1
−10 0 10 20 302
4
6
8
10
12
14
16
18
20
F3F4
Label 2
0 10 20 306
8
10
12
14
16
18
20
22
F1
F2
Label 1
−10 0 10 20 302
4
6
8
10
12
14
16
18
20
F3
F4
Label 2
(a1.iteration1 for method1) (b1.iteration1 for method2)
0 10 20 306
8
10
12
14
16
18
20
F1
F2
Label 1
−10 0 10 20 302
4
6
8
10
12
14
16
18
20
22
F3
F4
Label 2
0 10 20 306
8
10
12
14
16
18
20
F1
F2
Label 1
−10 0 10 20 302
4
6
8
10
12
14
16
18
20
22
F3
F4
Label 2
(a2.iteration2 for method1) (b2.iteration2 for method2)
2 4 6 8 10 120
0.5
1
1.5
2
2.5x 104
Iteration
Sum−S
quar
e−Er
ror
1 1.5 2 2.5 30
0.5
1
1.5
2
2.5x 104
Iteration
Sum−S
quar
e−Er
ror
(a3.SSE for method1) (b3.SSE for method2)
Figure 3.3: These are scatter plots of synthetic data 2 and the clusters found bymethods 1 (a1, a2) and 2 (b1, b2). The color of the data points reflect differentclass labels and the ellipses represent the clusters found. a1, b1 are the results foriteration 1; a2, b2 are the results for iteration 2; a3 and b3 are SSE as a function ofiteration for methods 1 and 2 respectively.
39
data. We select data sets that have high-dimensionality and that have multiple pos-
sible partitions. Since the two methods have similar results, we only need to show
the results once. In particular, we report the results for method 2.
In this section, we investigate the performance of our multi-view orthog-
onal clustering algorithms on four real-world data sets, including the digits data
set from the UCI machine learning repository [82], the face data set from the UCI
KDD repository [83], and two text data sets: the mini-newsgroups data [83] and the
WebKB data set [84].
The digits data is a data set for an optical recognition problem of handwritten
digits with ten classes, 5620 cases, and 64 attributes (all input attributes are integers
from 0 . . . 16). The face data consists of 640 face images of 20 people taken with
varying pose (straight, left, right, up), expression (neutral, happy, sad, angry), eyes
(wearing sunglasses or not). Each person has 32 images capturing every combina-
tion of features. The image resolution is 32% 30. We removed the missing data and
formed a 960% 624 data matrix. Each of the 960 features represents a pixel value.
The mini-newsgroups data comes from the UCI KDD repository which contains
2000 articles from 20 newsgroups. The second text data is the CMU four university
WebKB data set as described in [84]. Both text data sets were processed following
the standard procedure, including stemming and removing stopwords.
Results for the Digit Data
Table 3.3 shows the confusion matrix for the digit data. For all three iterations,
we partition the data into three clusters. In iteration 1, the resulting partition clus-
tered digits {1, 7, 8}, {2, 3, 5, 9} and {0, 4, 6} into different groups. In iteration 2,
our method clustered {2, 6, 8}, {1, 4} and {0, 3, 5, 7, 9} into another set of clusters.
And, in iteration 3, the clusters we found are {3, 6, 7}, {0, 1, 2, 8, 9} and {4, 5}.
These results show that in each iteration we can find a different way of partitioning
the ten classes (digits).
40
Table 3.3: Confusion Matrix for the Digits DataITERATION1 ITERATION2 ITERATION3
DIGIT C1 C2 C3 DIGIT C1 C2 C3 DIGIT C1 C2 C3”1” 477 82 12 ”2” 488 11 58 ”3” 393 151 28”7” 566 0 0 ”6” 532 14 12 ”6” 486 47 25”8” 321 218 15 ”8” 350 31 173 ”7” 500 54 12”2” 27 525 5 ”1” 285 286 0 ”0” 1 545 8”3” 41 531 0 ”4” 49 381 138 ”1” 77 394 100”5” 236 287 35 ”0” 2 4 548 ”2” 21 522 14”9” 152 408 2 ”3” 66 158 348 ”8” 213 317 24”0” 0 4 550 ”5” 212 24 322 ”9” 143 265 154”4” 199 0 369 ”7” 67 7 492 ”4” 159 187 222”6” 2 1 555 ”9” 9 95 458 ”5” 5 18 535
In Figure 3.4, we present the mean image of each cluster obtained by method
2 in three iterations. Below each image we show the dominant digits contained in
the cluster. For a digit to be considered as contained in a cluster, we require that at
least 70% of its data points fall into the cluster. It is interesting to note that digits
4 and 5 were not well captured by any of the clusters in iteration 1. In contrast, in
iteration 2, we see digit 4 well-separated and captured by cluster 2. In iteration 3,
we were able to capture digit 5 nicely in a single cluster. This further demonstrated
that our method is capable of discovering multiple reasonable structures from data.
Results for the Face Data
Face data is a very interesting data set because it can be grouped in several different
ways (e.g., by person, pose, etc.). We design the experiment to see if we can obtain
different clustering information in different iterations.
First, we begin with our number of clusters K = 20 in the first iteration,
hopefully to find the 20 persons in the database. Then, from the second iteration to
the rest of the iterations, we set K = 4 to see if the partitions found in the remaining
iterations can tell us any useful information. Figure 3.5 shows the average image
41
("1","7") ("0","6")("2","3","9")
("2","6") ("4") ("0","7","9")
("3","6","7") ("0","1","2") ("5")
Figure 3.4: The average digit for images within each cluster found by method 2 initerations/views 1, 2 and 3. These clustering views correspond to different digits.
for each cluster we find in iteration 1. We observed from this figure that iteration
1 leads to a clustering corresponding to the different persons in the database. The
number below the image is the percentage this person appears in the cluster. The
images clearly show different persons. In the second iteration, the four clusters we
found are shown in Figure 3.6. Each image is an average image of the images within
each cluster. It is clear that the clustering in iteration 2 groups the data based on
different poses. This suggested that our method was able to find different clustering
views from the face data.
Results for the Mini-Newsgroups Data
The mini-newsgroups data set originally contains 20 classes. We removed the
classes that are under the “misc” category because it does not correspond to a clear
42
1
1
1 0.440.76
0.57
10.53
1
0.25
0.86
0.28
1
0.52
0.52
1
0.33
0.85
0.65
0.6
Figure 3.5: The average face image for each cluster in iteration 1. This clusteringview corresponds to different persons.
concept class. We also pre-processed the data to remove stop words, words that ap-
peared in less than 40 documents, and the words that had low variance of occurrence
in all documents. After pre-processing, the data contains 1700 documents from 17
classes. Each document is represented by a 500-dimensional term frequency vector.
Note that PCA followed by k-means does not work well for text data. Here,
we apply the spherical k-means method [85] instead, which considers the corre-
lation between documents rather than the Euclidean distance. Our experiments
showed that this method provided a reasonable clustering of the text data sets.
Table 3.4 shows the confusion matrices by method 2 for three iterations.
For the first iteration, we set K = 3. The results show that cluster C1 groups to-
43
0.90 0.42
0.51 0.67
Figure 3.6: The average face image for each cluster in iteration 2. This clusteringview corresponds to different poses.
gether the recreation and computer categories. The ten most frequent words from
this cluster suggested that the documents here share information related to enter-
tainment. Cluster C2 groups science and talks together, and the frequent words
confirm that it groups science and the religion part of the talk. Cluster C3 is a
mixture of different topics.
In iteration 2, we set K = 4 to see if it we can partition the data to capture the
four categories “computer”, “recreation”, “talk” and “science”. From the confusion
matrix, we see that we were able to find these high level categories. C1 is about
computers; C2 contains news about recreation; and C3 groups those files related to
science. The last one C4 contains documents from the talk category that are related
to politics.
In iteration 3, two of the computer classes (graphics, os.ms) were grouped
together with the talk category, the remaining three computer classes were grouped
44
Table 3.4: Confusion Matrix for the Mini-Newsgroups DataITERATION1 (K=3) C1 C2 C3COMP.GRAPHICS 88 0 12COMP.OS.MS 95 0 5COMP.SYS.IBM.PC.HARDWARE 94 0 6COMP.SYS.MAC.HARDWARE 88 0 12COMP.WINDOWS.X 87 0 13REC.AUTOS 81 0 19REC.MOTORCYCLES 82 0 18REC.SPORT.BASEBALL 81 0 19REC.SPORT.HOCKEY 71 2 27SCI.CRYPT 0 68 32SCI.ELECTRONICS 0 76 24SCI.MED 0 78 22SCI.SPACE 0 74 26TALK.POLITICS.GUNS 0 70 30TALK.POLITICS.MIDEAST 0 61 39TALK.POLITICS.MISC 0 72 28TALK.RELIGION.MISC 0 77 23ITERATION2 (K=4) C1 C2 C3 C4COMP.GRAPHICS 98 0 2 0COMP.OS.MS 94 0 0 6COMP.SYS.IBM.PC.HARDWARE 78 15 3 4COMP.SYS.MAC.HARDWARE 66 20 3 11COMP.WINDOWS.X 43 39 5 13REC.AUTOS 28 51 6 15REC.MOTORCYCLES 17 59 6 18REC.SPORT.BASEBALL 10 67 4 19REC.SPORT.HOCKEY 5 62 4 29SCI.CRYPT 5 8 57 30SCI.ELECTRONICS 1 9 65 25SCI.MED 1 22 61 16SCI.SPACE 0 16 58 26TALK.POLITICS.GUNS 0 37 20 43TALK.POLITICS.MIDEAST 5 39 11 45TALK.POLITICS.MISC 3 45 6 46TALK.RELIGION.MISC 1 58 3 38ITERATION3 (K=4) C1 C2 C3 C4COMP.GRAPHICS 33 32 6 29COMP.OS.MS 42 23 10 25COMP.SYS.IBM.PC.HARDWARE 17 45 11 27COMP.SYS.MAC.HARDWARE 15 41 20 24COMP.WINDOWS.X 19 40 18 23REC.AUTOS 15 47 27 11REC.MOTORCYCLES 10 54 22 14REC.SPORT.BASEBALL 7 51 33 9REC.SPORT.HOCKEY 5 66 21 8SCI.CRYPT 5 15 68 12SCI.ELECTRONICS 10 9 65 16SCI.MED 31 8 46 15SCI.SPACE 15 24 48 13TALK.POLITICS.GUNS 49 19 18 14TALK.POLITICS.MIDEAST 45 24 16 15TALK.POLITICS.MISC 55 12 12 21TALK.RELIGION.MISC 56 8 20 16
45
Table 3.5: Confusion Matrix for WebKB DataITERATION 1 C1 C2 C3 C4COURSE 134 12 81 17FACULTY 2 78 61 12PROJECT 1 47 28 10STUDENT 2 68 402 86ITERATION 2 C1 C2 C3 C4CORNELL 103 86 27 10TEXAS 50 87 83 32WASHINGTON 35 77 138 5WISCONSIN 60 86 30 132
together with the recreation category (i.e., auto, motorcycles and sports). This sug-
gests that our method continued to find interesting clustering structure that is dif-
ferent from the existing results.
Results for the WebKB Text Data
This data contains 1041 html documents, from four webpage topics: course, faculty,
project and student. Alternatively, the webpage can also be grouped based on their
regions/universities, which include four universities: Cornell University, University
of Texas Austin, University of Washington and Wisconsin Madison. Following the
same pre-processing procedure used for the mini-newsgroups data, we removed the
rare words, stop words, and words with low variances. Finally, we obtained 350
words in the vocabulary. The final data matrix is of size 350% 1041.
The experimental results are quite interesting. For the first iteration, we
see our method found the partition that mostly corresponds to the different topics,
which can be seen in Table 3.5. Cluster 1 contains course webpages, cluster 2 is
a mix of faculty and project pages, both cluster 3 and 4 consist of a majority of
student webpages. In the second iteration, our method found a different clustering
that corresponds to the universities, as shown in Table 3.5.
46
3.3 Automatically Finding the Number of Clusters
and Stopping Criteria
In this section, we investigate how we can fully automate the process of finding
non-redundant multiple clustering views by addressing the following two model
selection issues: (1) how to automatically determine the number of clusters in each
view, and (2) how to automatically determine when to stop generating alternative
views.
3.3.1 Finding the Number of Clusters by Gap Statistics
There are several ways to find the number of clusters, K, automatically. For exam-
ple, Bayesian information criterion (BIC) [86] and Aikaike’s information criterion
(AIC) [87] find K by adding a model complexity penalty to maximum likelihood
estimation of mixture models for clustering. X-means [88] extends the BIC score
to K-means clustering for finding K. Resampling methods [89, 90] attempt to find
the correct number of clusters by clustering on diversified samples of the data set,
and select the K which gives the most “stable” clustering result. Another approach
introduced in [91] is gap statistics. Gap statistics selects K to be the minimum
K whose gap between the distribution of the observed data samples from a non-
structured null distribution is statistically significant. Note that any one of these
approaches can work with our framework. In this paper, we chose gap statistics
[91] in our experiments to automatically find the number of clusters K.
The basic idea of gap statistics is to compare the observed distribution of
the data samples to a null reference distribution, then K is selected to be the small-
est K whose difference/gap is statistically significant. The error measure Wk ="K
r=11
2nrDr (within cluster dispersion) decreases monotonically as the number of
clusters K increases. Statistical study shows that for some K, the decrease flat-
47
tens markedly, and such an “elbow” indicates the appropriate number of clusters
[91]. The gap at K is defined as GapN(K) = E%N{log(Wk)} ! log(Wk), where
E%N denotes the expectation under a sample of size N from the reference distribu-
tion. Here, we set our null reference to be a uniform distribution as suggested in
[91]. Dr ="
i,i!"Cr
"j(xij ! xi!j); thus, Wk is the sum of within-cluster error,
which is consistent with the objective function in k-means clustering (used in our
experiments).
To computationally implement gap statistics, we generate B copies of null
references by uniformly creating each virtual feature over the range of the observed
values for that feature. Varying the total number of clusters K from 1 to Kmax,
we cluster both the observation and the references. We compute the expected WK ,
E%N{Wk}, by averaging the B copies l = (1/B)
"b log(W %
Kb), Wk as the sum of
within-cluster error resulting from clustering on the observed data, and the gap
Gap(K) = l ! log(WK).
Then, the clustering result with K clusters is said to be statistically significantly
different from the null reference if Gap(K) ) Gap(K + 1) ! #K+1
#1 + 1/B,
where # is the standard deviation calculated as
#K = [(1/B)!
b
{log(W %Kb)! l}2]1/2.
We, thus, select the optimal Kopt to be the smallest K that achieves this statistical
significance, i.e., Gap(K) ) Gap(K+1)!#K+1
#1 + 1/B. We apply this estima-
tion method to find the number of clusters in each iteration/view of our multi-view
clustering framework.
48
3.3.2 Stopping Criteria
A major difference between our proposed method and previous works on non-
redundant clustering [25] is that we aim to generate a sequence of non-redundant
alternative views of the data instead of just one alternative view. Consequently, we
need an approach to guide our algorithm to know when to stop. We propose three
stopping criteria for our framework. When one of the three criteria is met, we stop
the process and return all the partitionings we currently have as the output result.
The first two criteria are based on model estimation for finding K as dis-
cussed in the previous section. Specifically, when Kopt = 1, we know that there
is only one cluster in the remaining data. That means there is no more interesting
structure in the residue space and we should stop.
Secondly, we are also not interested in views with very small clusters which
only contain very few data points. This implies that the data is non-structured and
the clustering algorithm is simply trying to memorize it (an example of over-fitting
in unsupervised learning). Thus, when Kopt is very large compared to the number of
data samples, we should also stop to keep the clustering algorithm from breaking the
data into negligible small clusters. Similarly, when none of the candidate K from
1 to a high Kmax 3 achieves statistically significant gap from the null-distribution,
we should also stop. This indicates that the residue that is left is simply uniform
random noise.
The third criterion we track at each iteration is the sum-square-error(SSE)
we defined for the two methods. When the SSE is very small, we know that the
existing partitionings already covers most of the original space. Since there is no
residue component left, we should stop. In this work, when the ratio between the
first singular value of the original space and the current subspace is very small (i.e.,
less than 10%) we conclude that the SSE loss is very small compared to the original3In our experiments, we set a large Kmax such that the average number of data points in each
cluster is less than 1.5% of the total number of data samples.
49
variance of the data, hence we stop iterating.
3.3.3 Case Studies for Synthetic Data II, Face and Text Data
The two approaches for finding non-redundant clustering views combined with
finding K and the stopping criteria, provide us with a completely automated frame-
work to solve the non-redundant multi-view clustering problem. To successfully
use this technique for mining the interesting structures, the data itself need to be
rich. For example, applying our proposed automated framework on the synthetic
data I, we will find K = 4 and stop at the first iteration. Similar situation holds
for the digit data. In this section, we will demonstrate the effectiveness of the auto-
mated framework for finding K and stopping the search for views on data that are
known to contain complex structures (such as, synthetic data II, face and the We-
bKB text data). In these experiments, we set the gap statistics parameter B = 50
for all data except the face data. Since the face data has a high dimensionality over
1000, we set B = 80 for a higher confidence in the obtained statistical significance
results.
Synthetic Data
In order to show if our approach can appropriately stop finding alternative views
when only noise is left, we modified the synthetic data II in our experiment by
adding two more dimensions, feature 5 and feature 6, with uniformly distributed
random noise. Figure 3.7 shows the entire procedure and the results. Sub-figures
a1 ! a3 show the data structures in the different subspaces of {f1, f2}, {f3, f4}and {f5, f6} respectively. Sub-figures b1 ! b3 show the bar plot of !Gap(K) =
Gap(K)! (Gap(K + 1)! #K+1
#1 + 1/B) for K from 1 to Kmax = 4 for each
iteration. As we discussed in the previous section, the minimum K which leads to
!Gap(K) > 0 is our optimal K. We observed that in iteration 1, K = 3 is the
50
first to satisfy, !Gap(K) > 0. Thus, the optimal K found by gap statistics is 3.
And the corresponding clustering revealed is the three clusters in the feature 3 and 4
subspace. In iteration 2, the optimal K found is also three and we found the clusters
in the feature 1 and 2 subspace. For the third iteration, we obtained an optimal K
of one. We, thus, stopped correctly. Compared to the experiment shown in section
3.2.1, without knowing K, the process will continue to find a meaningless structure
in iteration 3. Note too that the gap statistics method for finding K was able to
discover the correct number of clusters for the two alternative views.
1 2 3 4 5 6 7 82.5
3
3.5
4
4.5
5
5.5
6
6.5
7
F1
F2
−5 0 5 10 15 20 252
4
6
8
10
12
14
16
18
20
F3
F4
0 0.5 1 1.5 20
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
F5
F6
(a1. SynData in F1&F2) (a2. SynData in F3&F4) (a3. SynData in F5&F6)
1 2 3 4−0.6
−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
1 2 3 4−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
1 2 3 40
0.05
0.1
0.15
0.2
0.25
(b1. !Gap for iteration1) (b2. !Gap for iteration2) (b3.!Gap for iteration3)
1 1.5 2 2.5 30
500
1000
1500
2000
2500
3000
3500
4000
Iteration
Sum−Square−Error
(c. SSE for each iteration)
Figure 3.7: !Gap and SSE results in each iteration for the Synthetic II data set.
51
(a. clustering in iteration1) (b. clustering in iteration2)
Figure 3.8: Different partitionings for face data in different iterations
Face Data
Face data is a rich data set containing various interesting partitionings. In the first
iteration, we run gap statistics for the original data set, and the optimal K is 14.
Then, in the second iteration, the optimal K found is equal to four corresponding
to the four poses. In the third iteration, by gap statistics, we find that
Gap(K)! (Gap(K + 1)! #K+1
#1 + 1/B) = 0.0121 > 0, with K = 1, B = 80
That means only one cluster is left in the current residue space, indicating that
we should stop. Figure 3.8 displays the cluster means in each iteration by our
automated scheme.
Text Data
Finally, we study a text data, the WebKB data in particular. It is a complex data
set and has high dimensions. Rather than simply keeping 90% of the variance in
determining the number of dimensions to keep in PCA, we instead heuristically ex-
amined the singular values of the data to select the dimensions to keep. Figure 3.9a
52
presents a plot of the gap between consecutive singular values, i.e. si ! si+1, of
this data. This figure reveals that the gap reaches a peak in s3 ! s4 in iteration 1.
That means the 4th singular value drops abruptly. We, thus, project the data onto
the first three principal eigenvectors. By gap statistics, we estimate that the optimal
K is four corresponding to the clusters course, faculty, project, student. In the sec-
ond iteration, we perform singular value decomposition again and found the largest
decrease in the singular values occur between the second and third singular values,
as shown in Figure 3.9b, and thus keep two dimensions. Gap statistics determined
the optimal K for the second iteration to be four corresponding to clusters based on
institution. These results agree well with the true structures of the data.
In the third iteration, we do not find the optimal K within the range of 70.
This implies that the residue space has a distribution close to uniform noise and is
thus non-interesting. Hence, we stop after iteration 2. Table 3.6 shows the confusion
matrix we obtain in each iteration. We see that it coincides well with the original
clustering results generated on the true number of clusters.
0 5 10 15 200
2
4
6
8
10
12
14
0 5 10 15 200
1
2
3
4
5
6
7
Figure 3.9: The drop in singular values: si ! si+1. Left: the gap of consecutivesingular values in iteration1. Right: the gap of consecutive singular values in itera-tion2.
53
Table 3.6: Confusion Matrix for WebKB Data based on Gap StatisticsITERATION 1 C1 C2 C3 C4COURSE 111 17 99 17FACULTY 3 74 63 13PROJECT 2 45 29 10STUDENT 11 53 402 92ITERATION 2 C1 C2 C3 C4CORNELL 88 81 37 20TEXAS 63 83 71 35WASHINGTON 74 69 102 10WISCONSIN 42 89 40 137
3.4 Conclusions for Multi-View Clustering Methods
Given data, the goal of exploratory data analysis is to find interesting structures,
which may be multi-faceted by nature. Clustering is a popular tool for exploring
data. However, many clustering algorithms only find one single clustering solution.
Our main contribution in this paper is that we introduced a new paradigm for ex-
ploratory data clustering that seeks to extract all non-redundant clustering views
from a given set of data.
We presented a general framework for extracting multiple clustering views
from high dimensional data. In essence, this framework works by incorporating
orthogonality constraints into a clustering algorithm. In other words, the clustering
algorithm will search for a new clustering in a space that is orthogonal to what has
been covered by existing clustering solutions. We described two different ways for
introducing orthogonality and conducted a variety of experiments on both synthetic
and real-world benchmark data sets to evaluate these methods. Our results show
that our proposed framework was able to find the different substructures of the
data or different structures embedded in different subspaces in different views on
synthetic data. Similarly, on the benchmark data sets, our methods not only found
different clustering structures in different iterations, but also discovered clustering
54
structures that are sensible, judging from the various evaluation criteria reported
(such as, confusion matrices and scatter plots). For example, on the face data,
PCA+K-means identified individuals in the first view/iteration, and in the second
view/iteration, our methods discovered clusters that correspond nicely to different
facial poses. For the other data sets, we observed similar results in the sense that
different concept classes were identified in different views.
Furthermore, we present a fully automated version of the proposed frame-
work by automatically estimating the number of clusters K in each iteration through
the GAP statistics, and automatically determining when to stop searching for alter-
native views based on the estimated K and the Sum-Square-Error of the clustering
solution in each iteration. Experiments on synthetic and benchmark data showed
that the proposed framework stopped searching for alternative views appropriately,
when the residue space left was just noise.
Note that in this paper we use k-means and spherical k-means as the basic
clustering algorithms. However, the framework is limited to such choices and can
be instantiated using any clustering algorithm. Future directions will be to explore
the framework with other clustering methods.
55
Chapter 4
Orthogonal Principal FeatureSelection via Component Analysis
In this chapter,we present a feature selection algorithm based on principal com-
ponent analysis (PCA) and orthogonalization, we call principal feature selection
(PFS). The non-redundant orthogonal framework we use in clustering described in
Chapter 3 was originally inspired by our study on this feature subset selection prob-
lem. In Section 4.1, we define our notations and provide a review on singular value
decomposition. In addition, we describe two ways of viewing a data matrix, which
will be useful in motivating our orthogonal feature selection approach. Then, we
present our orthogonal feature selection method via PCA in Section 4.2. In Sec-
tion 4.3, we explain the differences in goals between sparse principal component
analysis and PFS. We report our experimental results in Section 4.4, a discussion
of extending orthogonal feature selection to linear discriminant analysis in Section
4.5 and finally conclude in Section 4.6.
56
4.1 Background and Notations
In this section, we provide: the notations that will be used throughout this chapter,
a short background review on singular value decomposition (SVD), and a presen-
tation of the dual space representation of a data matrix, which will be illustrative in
motivating our approach.
4.1.1 Notations
Let X = [x1x2 · · ·xN ] = [f1f2 · · · fd]T denote a set of N samples in Rd space (i.e.,
xi & Rd) or d features in RN (i.e., fj & RN ), where (·)T means the transpose of
matrix (·). span{·} represents the spanning space of (·). Note that xi and fj are
column vectors and X is of size d % N . X(t) is the data matrix in the tth iteration
after t! 1 projections. fi(t) is the ith feature or ith row in X(t). fi is the projection
of the ith feature in the subspace spanned by the selected features. Without loss of
generality and to avoid cluttered notation, in this work, we arrange the features in
the order they are selected. For example, we permute the first selected fi to the first
row of X and call it f1.
4.1.2 Background Review on SVD and Definition of Terms
PCA can be solved using singular value decomposition (SVD). Without loss of gen-
erality, assume the data set X is a d%N zero-centered matrix. X = [f1 · · · fd]T =
[x1 · · ·xN ]. The SVD [92] solution of X is:
X = USV T
Accordingly, we have Xvk = skuk, where uk is the kth column of U (also called
the kth principal eigenvector or loadings of the principal components), vk is the kth
column of V (also called the kth principal component) and sk is the kth singular
57
(a) Uncorrelated in data space (b) Uncorrelated in feature space
(c) Correlated in data space (d) Correlated in feature space
Figure 4.1: Three uncorrelated data points in 2D space: (a) in the data space view,and (b) in the feature space view. Three correlated data points in 2D space: (c) inthe data space view, and (d) in the feature space view.
value. Furthermore, vk = XT uk/sk, the kth principal component (PC) is equal to
the projection of X to the kth principal eigenvector divided by sk, and uk contains
the loadings of all the variables in the kth eigenvector.
4.1.3 Dual Space Representation of a Data Matrix and Statisti-
cal Correlation
To build our orthogonal principal feature selection algorithm, we need to look at
data in a dual space representation. Let us consider a data matrix X & Rd!N , with
d features and N samples. Its columns can be viewed as data points, xi & Rd.
An alternative way of viewing X is as a scatter or configuration of d points in RN ,
with each axis associated with an individual data point and each of the d vectors
representing a variable, fj .
58
Statistical correlation between features is the non-random associations be-
tween two variables. Figure 4.1 illustrates this dependence by scatter plots of two
datasets that contain three data points in 2D-space. In Figure 4.1a, the data points
are evenly spread out in the data space, which means the two features are not cor-
related. In Figure 4.1c, the data points are located almost on a line in data space,
that means the two features are highly correlated. This property can be clearly
viewed in the feature space as shown in Figure 4.1b and 4.1d. Note that the original
features, fi and fj , may be statistically correlated even though they are shown as
orthogonal axes in Figure 4.1c (data space view). Two features are statistically
uncorrelated if < fi, fj >= 0, and are shown as orthogonal vectors in Figure
4.1b (feature space view). The statistical correlation between features is defined
as: corr(fi, fj) = cos(fi, fj) = <fi,fj>||fi||||fj || , assuming that features are zero-centered.
Note that without loss of generality, we assume that X is zero-centered throughout
this chapter. We take advantage of this dual space view in describing and motivating
our approach.
4.2 Feature Selection via PCA
Principal component analysis (PCA) is a popular tool for data analysis and dimen-
sionality reduction. Dimensionality reduction transformations can have a variety of
optimality criteria, and ten of them discussed in [46] lead to the principal compo-
nents solutions, which is one reason for the popularity of this analysis. We name
two of the most common criteria here. PCA finds linear combinations of the vari-
ables, the so-called principal components (PCs), which correspond to the subspace
of maximal variance in the data. This property is to keep the maximum “spread”
in the selected lower dimensional subspace. Another optimality criterion is to find
the linear transformation such that the sum-squared-error between the original data
and the predicted data is minimized. Given a data set X & Rd!N of N observed
59
d-dimensional vectors, xi. PCA finds the linear transformation, Y = AT X , where
A & Rd!q, q < d, that minimizes the sum-squared-error, ||X ! AY ||2, subject to
the constraint AT A = I (the columns of A are orthonormal). The PCA solution is
A equal to the q dominant eigenvectors corresponding to the q largest eigenvalues
of the covariance matrix of X , "x. The sum-squared-error is equal to"d
j=q+1 $j ,
where $j are the remaining eigenvalues.
4.2.1 PCA Orthogonal Feature Selection
The goal of PCA orthogonal feature selection is to find a subset of features that min-
imizes the sum-squared-error between the original data and the data represented by
the selected features, subject to the constraint that the selected features are as un-
correlated as possible1. We would like to have an objective function that penalizes
both squared-error and correlation. Two features x and y are correlated if one can
linearly predict y from x and vice versa. We can, thus, incorporate the penalty for
correlation to the sum-squared-error criterion as follows: The error in feature selec-
tion is the error due to features not selected minus the portion of the features not se-
lected that can be linearly predicted (correlated) by the selected features (spanned
by the selected features). More formally, we can express the objective function as
presented below. Assume after permutation, f1, · · · , fq are the q selected features,
the unselected features are fq+1, · · · , fd, and fq+1, · · · , fd are the projections of the
unselected features to the space spanned by those selected, the sum-squared-error
is:
SSE = ||[f1, · · · , fd]! [f1, · · · , fq, fq+1, · · · , fd]||2
= ||[fq+1, · · · , fd]! [fq+1, · · · , fd]||2 (4.1)1Note that we cannot constrain the selected features to be orthogonal because we are not allowed
to select transformed (or new) features, only a subset of the original.
60
Referring to the dual space representation in Figure 4.1, the goal of the or-
thogonal feature selection problem is to find the minimum subset of features that
spans the space of all feature vectors in the feature space view. Note that sum-
squared-error is calculated in the feature space. It is the sum of the squared distance
of those unselected features to the subspace spanned by those selected ones, as
shown in Figure 4.2. In the example displayed in Figure 4.2, there are three fea-
tures in all. Features 1 and 2 are selected; thus, the error e is the squared distance
from feature 3 to the plane spanned by features 1 and 2.
Figure 4.2: SSE between selected features and all the features.
4.2.2 Orthogonal Feature Search
There are several ways to search the feature subset space to optimize an objective
function as pointed out in the related work section, Section 2.2. Here, we apply
feature transformation (the solution from PCA) to perform our search. Let X &Rd!N be our original data and X(t) be the remaining data orthogonal to the chosen
features. We set iteration time t = 1, X(1) = X and data is zero-centered. Figure 3
shows the framework of our feature selection process. Our method has three main
61
steps: (1) perform PCA on X(t) to get the first eigenvector, (2) pick the feature most
correlated with the eigenvector, (3) project the data X(t) to the space orthogonal to
the chosen feature to get the residual space, X(t+1). We repeat these steps until the
number of features desired is obtained. We motivate each of these steps below.
Figure 4.3: The general framework for our feature selection process.
Step 1: PCA to get the first eigenvector. Our approach selects the features one
at a time. However, instead of looking at the effect of one feature at a time, PCA
provides a global view (i.e., takes feature interaction into account) on which feature
combination provides the largest variance (most relevant with respect to our objec-
tive function). In some sense, PCA projection performs some kind of “look-ahead”
in the feature search process. In the data space view, the first eigenvector, u1, is the
direction of largest spread (variance) among the data samples xi.
Step 2: After finding the largest eigenvector, which feature should we pick?We select the feature which is most correlated with the largest eigenvector. In
this work, we call this selected feature, principal feature. A feature fj is & RN ,
whereas the eigenvector u1 is & Rd. What do we mean by correlation here? u1
is actually a transformation that leads to an extracted new feature fnew = X(t)T u1
62
(the projection of X(t) in u1), where fnew & RN . We select the feature fj from
the original set in X which has the largest correlation with fnew (i.e., the feature
fj closest to fnew in terms of cosine distance in the feature space view). In [1],
they select the feature with the largest loading (coefficients in the eigenvector) in
the largest eigenvector, whereas PFS utilizes correlation. Note that the feature with
the largest loading does not necessarily correspond to the feature with the highest
correlation with fnew. Proposition 1 below explains their relationship.
Proposition 1: The feature with the largest loading is not equal to the feature
with the largest correlation with the projection of an eigenvector in gen-
eral. Maximizing the correlation is equivalent to maximizing the loading,
argmaxj corr(fj, vk) = argmaxj|!j|, only if the features, fj , have unit
norm, ||fj|| = 1.
Proof: Let X = USV T be the SVD solution of X . For U = [u1, · · · , ud], each
uk = [!1, · · · ,!d]T , where !j ,j & {1, · · · , d} is the loading for the jth vari-
able in the kth eigenvector. Then, we get [f1 · · · fd]T vk = sk[!1, · · · ,!d]T
(i.e., [fT1 vk, · · · , fT
d vk]T = [sk!1, · · · , sk!d]T ). Looking at it element-wise,
we have fTj vk = sk!j . Therefore, if |!j| is the maximum, |sk!j| is the max-
imum, and |fTj vk| is the maximum. The correlation between the jth feature
and the kth PC in the feature space is then
corr(fj, vk) = |fT
j vk
||fj||||vk||| = | sk!j
||fj|||,
because ||vk|| = 1. Thus if ||fj|| = 1, i.e., feature j has unit norm, then
corr(fj, vk) = |fTj vk| = |sk!j| is the maximum since |!j| is the maximum.
Based on Proposition 1, we can speed up our correlation computation by
utilizing the loading of a feature divided by the norm of that feature, |!j|/||fj||, to
63
select features. In Section 4.2.3, Property 3, we prove that this property holds even
in the residual spaces, X(t+1).
Step 3: After selecting a feature, how do we reduce the search space? To
keep the features as uncorrelated as possible, we project the current data at time
t, X(t), to the subspace orthogonal to the feature selected f (t)t . Here, f (t)
t is the
component of the current selected feature ft in the span{X(t)}. This will make
X(t+1) uncorrelated to all the features selected from t = 1 to time t. Refer to
Property 2 in Section 4.2.3.
Algorithm 3 Pseudo-code for orthogonal feature selection via PCA (PFS).
Inputs: The data matrix X & Rd!N , and the number of features q to retain.Outputs: The q selected features.Pre-Processing: Make the dataset zero-centered.Initialization: t = 1, X(1) = X , Qselect = {}
Step 1 (Find the first eigenvector: View the dataset as N samples in Rd space.Perform SVD on dataset X(t) and find the principal component (PC) v1
(t) withthe largest eigenvalue.
Step 2 (Select principal feature): Select the feature fi from the original featuresin X that correlates best with v1
(t). Add fi to the set of selected features Qselect,and remove fi from the original set in X . The highest correlation corresponds tothe maximum absolute value of the loading u1
(t) of a feature divided by the normof that feature. If the norm of a feature is zero, we set the correlation to be zero,meaning that it is farthest from the PC.
Step 3 (Orthogonalize): View the dataset as d features in the Rn space. Findthe subspace orthogonal to the variable selected at the current space fi
(t) by:P (t) = I ! fi
(t)fi(t)T /(fi
(t)T fi(t)). Project our dataset X(t) to that subspace by:
X(t+1) = X(t)P (t). t = t + 1. Note that the ith row of X(t+1) is 0, and all therows corresponding to the previously selected features remain 0.
Step 4 (Repeat): Repeat steps 1-3 until q features are selected.
As we have pointed out in Section 4.1.3, uncorrelated features are orthogonal
64
in the feature space view of the dual space representation of a data matrix. It would
be convenient for us to look at X(t) in the feature space view, X(t) = [f (t)1 · · · f (t)
d ]T .
To compute the subspace orthogonal to a feature f (t)i , we use the projection matrix
P : P&f(t)i
= I ! f (t)i f (t)
i
T/(f (t)
i
Tf (t)
i ). The residual feature space X(t+1) is X(t)
projected to this orthogonal subspace, X(t+1) = X(t)P (t). This means, we project
each of the remaining features, f (t)j , to the subspace orthogonal to f (t)
i . All the
previous selected features, including i should remain all zeros. Thus, we remove
the component in f (t)j that can be linearly predicted by f (t)
i . What is left is the
residual X(t+1) that cannot be linearly explained by f (t)i .
In the next iteration, we reapply PCA on X(t+1), select the feature from the
original set in X removing the features selected so far that correlates best with the
first principal component of X(t+1). We repeat the process until we have selected
q features or the error is smaller than a threshold. Our orthogonal feature selection
method via PCA is summarized in Algorithm 3.
4.2.3 Properties
Note that since we have already selected fi, we do not want features that are in the
spanning space of fi (correlated to fi). Orthogonal projection presented above will
solve this problem. Several nice properties with this approach are:
Property 1: Let f2 be the projection of f2 to the null space of f1, span{f1, f2} =
span{f1, f2}.
Property 2: Correlation and Loading. We can utilize the loading divided by the
norm of the feature to speed up correlation computation even in the residual
space because, fTi v(t)
1 = f (t)i
Tv(t)
1 . Since v(t)1 and f (t)
i are in the residual space
which is orthogonal to the previously selected features, we have fi ! f (t)i
orthogonal to v(t)1 , i.e., *(fi ! f (t)
i ), v(t)1 + = 0.
65
Property 3: X(t+1) is uncorrelated to f1, · · · , ft.
Proof By mathematical induction.
1. After the 1st iteration, we pick the first feature f1 and project the data to
its orthogonal space, and get X(2). Thus, < X(2), f1 >= %0.
2. For each t > 1, assume X(t) is orthogonal to the current selected feature
set f1, f2, · · · , ft#1, i.e., < X(t), f1 >= %0, · · · , < X(t), ft#1 >= %0, and in the
current iteration ft is selected.
After performing orthogonalization, as indicated in Step 3 of our algorithm,
we have X(t+1) = X(t) !X(t) ft(t)ft
(t)T
ft(t)T
ft(t)
. For ,1 ( i < t, we have
< X(t+1), fi >=< X(t), fi > ! < X(t)ft(t)ft
(t)T
ft(t)T ft
(t), fi >= %0!X(t)fi(·) = %0,
Since fi(t) is just the tth row (after permutation) in X(t), it is orthogonal to
f1, f2, · · · , ft#1.
For i = t,< X(t+1), ft >=< X(t+1),"t#1
i=1 ctifi + f (t)t >. Here cti is some
constant, since ft can be written as the part covered by the selected features,
which belongs to span{f1, · · · , ft#1}, and the residue part ft(t). This leads to
< X(t+1), ft >= X(t)ft(t) !X(t)ft
(t)ft(t)T
ft(t)T ft
(t)ft
(t) = X(t)ft(t) !X(t)ft
(t) = %0.
Therefore, we have < X(t+1), f1 >= %0, · · · , < X(t+1), ft >= %0 (i.e., X(t+1)
is uncorrelated to f1, · · · , ft). The f (i)i s form an orthogonal basis for the se-
lected features.
Property 4: Residue space has zero mean. After each projection, the data X(t+1)
still has zero mean.
66
Proof Without loss of generality and to simplify notations, let us normalize the
selected feature to be unit norm. fn = f/||f ||. Then the projection matrix
P = I ! ffT /||f ||2 = I ! fnfTn . X has zero-mean, i.e.,
"Nk=1 xk = 0. Y =
XP = X ! (Xfn)fTn = X ! [(fT
1 fn)fn, · · · , (fTd fn)fn]T , then
"Nk=1 yk =
"Nk=1 xk !
"Nk=1 fnk(Xn) = 0 ! 0(Xn) = 0, fnk is the kth element in the
N-by-1 vector fn.
Property 5: Retained Variance. Similar to PCA where the variance retained is
the sum of the eigenvalues corresponding to the eigenvectors that are kept;
in PFS, the variance retained is the sum of the variances of each feature
when projected to the space such feature is selected (due to orthogonality
of the spaces). Note that the retained variance is RetainedV ariance ="q
t=1 var(f (t)t ), where var(·) is the variance, and f (t)
t is the feature ft in the
orthogonal space span{X(t)}. Actually {f (1)1 , f (2)
2 , · · · , f (q)q } form an orthog-
onal basis that spans the selected q original variables. One can utilize the pro-
portion of variance desired to select the number of features retained similar
to conventional PCA.
Property 6: Convergence. SSE(t) ) SSE(t+1), and SSE is bounded below to be
greater than or equal to zero. Thus, the algorithm is guaranteed to converge
to a local minimum.
Proof Since SSE is the sum of squared errors, it is greater than or equal to zero.
Let fi be the original features sorted in the order of our selection. Let us nor-
malize each orthogonal basis obtained by PFS: {f (1)1 , f (2)
2 , · · · , f (q)q }. Then
we will have an orthonormal basis represented by %f1, · · · , %fq, with %f1 =
f (1)1 /||f (1)
1 || and so on. Then, at iteration t and t + 1,
SSE(t) =d!
j=t+1
{||fj||2 !t!
i=1
*fj, %fi+2},
67
SSE(t+1) =d!
j=t+2
{||fj||2 !t+1!
i=1
*fj, %fi+2},
where the projection of fj to %fi is *fj, %fi+ since %fi is unit length.
SSE(t)!SSE(t+1) = (||ft+1||2!t!
i=1
*ft+1, %fi+2)+
d!
j=t+2
{||fj||2!t!
i=1
*fj, %fi+2}
!d!
j=t+2
{||fj||2 !t!
i=1
*fj, %fi+2! *fj, %ft+1+
2}
= ||ft+1||2 !t!
i=1
*ft+1, %fi+2
$ %& ''0
+d!
j=t+2
*fj, %ft+1+2
$ %& ''0
) 0.
4.2.4 Illustrative Example
In order to clearly show how we perform our algorithm, in this section, we describe
a simple illustrative example on the Iris data from the UCI repository [82]. Iris is
a small data set with 4 features and 150 data points. Figure 4.4 demonstrates the
whole feature selection process. From the upper left, we show the Iris data set, X .
We denote each row as features f1 to f4, and each column to be data points x1 to
x150.
We initialize by setting X(1) = X ! µ, X with its mean µ subtracted. In
iteration 1, we perform SVD on X(1). We obtain the loading of each feature (!j)
from the first PC (v1), and compute |!j|/||fj||. We can see that f3 has the maximum
value. Thus, we select f3 in the first iteration. After f3 is selected, we project our
data to the space orthogonal to f3, which gives us X(2). In X(2), the third row goes
to zero.
In iteration 2, we repeat the process. We first perform SVD and compute
|!(2)j |/||f (2)
j ||. This time f (2)1 has the maximum value and should be selected. Then,
our X(2) is projected to the orthogonal space of f (2)1 to obtain X(3). Now, the first
68
Figure 4.4: A simple illustrative example of PFS.
row in X(3) goes to zero. Note that the third row in X(3) remains zero. Then f1
is added to our selected feature set. The SSE for each step is shown at the bottom.
When all four features are selected, the SSE converges to zero. We repeat this
process selecting the features one by one until the SSE goes to zero or the desired
number of features q is reached.
Observe that the residual data matrix X(t) has zero rows corresponding to
the t ! 1 selected features. In our implementation, instead of keeping these zeros,
we just remove it for simplicity and computational efficiency.
In this figure, we also provide the correlation of each feature with the first
PC to illustrate Property 2 and Proposition 1. Observe that the correlation is equal
to |!j|/||fj|| times a constant s (the corresponding singular value of the first PC).
This shows that the feature that maximizes |!j|/||fj|| is equivalent to the feature
that maximizes the correlation, and hence can be used to speed up our calculations.
69
4.3 Sparse Principal Component Analysis (SPCA) and
PFS
In this section, we explain why we say SPCA does not exactly perform global fea-
ture selection. Its goal is different from ours. The goal in SPCA is to make the PCs
interpretable by finding sparse loadings. Sparsity allows one to determine which
features explains which PCs. PFS, on the other hand, applies a direct approach to
feature selection. Our goal is to determine which set of q original features captures
the variance of the data most and at the same time are as non-redundant as possible.
The SPCA criterion is as follows:
(!, &) arg min!,"
"ni=1 |xi ! !&T xi|2 + $
"kj=1 |&j|2 +
"kj=1 $1,j|&j|1
s.t. !T ! = Ik
(4.2)
where k is the number of principal components, Ik is an identity matrix of size
k% k, $ controls the weight on the ridge regression (norm-2) penalty [93], and $1,j
are weight parameters to control the sparsity. ! and & are d % k matrices. &i - ui
for i = 1, . . . , k.
SPCA is a good way to sparsify the loadings of principal components or
determine which features correspond to each PC, however it is not appropriate for
global feature selection (i.e., find a set of features to represent all the PCs). To il-
lustrate this, we provide an example using the Glass data from the UCI repository
[94]. Glass data has a total of nine features and 214 samples. Table 4.1 presents
the PC loadings obtained by applying SPCA with a sparsity of five and the number
of PCs, k, set to two. Note that although each PC have sparse loadings, all fea-
tures have non-zero loadings to explain both PCs. In this case, all features are still
needed and no reduction in features is achieved. SPCA has a tendency to spread the
non-zero loadings to different features in different PCs because the sparse PCs are
70
constrained to be orthogonal.
Table 4.1: PC Loadings Applied to Glass Data Using SPCA
LOADINGS PC1 PC2FEATURE 1 0 -0.76FEATURE 2 0.35 0FEATURE 3 -0.60 0FEATURE 4 0.47 0FEATURE 5 0 0.03FEATURE 6 0 0.11FEATURE 7 0 -0.64FEATURE 8 0.53 0FEATURE 9 -0.14 -0.07
From Table 4.1, it is not clear how one can utilize SPCA to select features.
Let us say, we wish to retain two features. Which two features should we keep?
Features 3 and 8 based on the loadings in PC1 or features 3 and 1 the top loadings
of PC1 and PC2 respectively. Another complication is that in SPCA one can tweak
the sparsity parameter and the number of components to keep. Changing those
parameters modifies the loadings and the features with the non-zero loadings as
shown in Table 4.2, where sparsity is set to two and the number of PCs is set to
nine.
4.4 Experiments
The goal of this section is to investigate the performance of orthogonal principal
feature selection (PFS) against other methods. We examine whether or not or-
thogonalization helps in achieving the PCA objective and study PFS as a search
technique. We describe the data used in our experiments in Subsection 4.4.1, the
competing methods in Subsection 4.4.2, and present and discuss the results in Sub-
section 4.4.3. In addition, we provide a time complexity analysis in Subsection
71
Table 4.2: Example: SPCA Confusion for Feature Selection
LOADINGS PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9FEATURE 1 0 0 -0.01 -0.01 0 0 0 -1.0 0.04FEATURE 2 0 0 0 0 -1.0 0 0 0 0FEATURE 3 -0.98 0 0 0 0 0 0 0 0FEATURE 4 0 0 0 0 0 0 1.0 0 0FEATURE 5 0 0 0 1.0 0 0 0 0 0FEATURE 6 0 -1 0 0 0 0 0 0 0FEATURE 7 0.19 0.07 0 0 0.1 0.05 -0.04 -0.03 1.0FEATURE 8 0 0 0 0 0 -1.0 0 0 0FEATURE 9 0 0 -1.0 0 0 0 0 0 0
4.4.4.
4.4.1 Data
We investigate the performance of orthogonal principal feature selection (PFS) on
five real world datasets: chart, HRCT, face, 20 mini-news group and gene micro-
array data. Chart data is from the UCI repository [95]. It is a satellite dataset,
including six classes, with 60 features and 600 instances. Face data from the UCI
KDD [83] repository consists of 640 face images of 20 people. Each person has
32 images with image resolution 32 % 30. We remove the missing data to form a
960% 624 data matrix. HRCT is a high resolution computed tomography image of
the lungs (HRCT-lung) data set [96] with eight disease classes, 1545 instances and
183 features. The mini-newsgroups data comes from the UCI KDD repository with
2, 000 documents from 20 categories, and 4, 374 words as features. The last data set
we studied is the gene data for lung cancer from http://sdmc.lit.org.sg/GEDatasets/
Datasets.html. It contains 327 instances with a dimensionality of 12, 558.
72
4.4.2 Methods
We compare our approach to the simple threshold method (which simply ranks
features based on variance and keeps those with variances larger than a threshold)
and two eigenvector-loading-based methods by Jolliffe [1]: (1)Jolliffe i, iteratively
computes the first eigenvector and selects the feature with the largest loading, and
(2) Jolliffe ni, computes all the eigenvectors at once and selects q features corre-
sponding to the largest loading in the first q eigenvectors. Jolliffe i is similar to PFS,
except that it did not take the statistical correlation between features into consider-
ation. Jolliffe ni computes all the PCA eigenvectors at once, thus, implicitly taking
correlation into account. Furthermore, we compare our PFS method to SPCA. We
set SPCA with sparsity equal to q (the number of features to be selected) and set
the number of PCs to one (to avoid ambiguities). This version of feature selection
will be aggressive in selecting features that maximize variance. The other extreme
is to keep q PCs and sparsity equal to one. This version will be aggressive in re-
moving redundancy and provides results similar to Jolliffe ni. It is not clear how to
select features in SPCA that is somewhere in between these two extremes. In these
experiments, we test the importance of the orthogonality constraint.
Besides loading-based methods, we also compare our PFS, with sequential
forward search (SFS) applied to our SSE objective function in Eqn. 4.1 that takes
both error and correlation into account. SFS is a greedy subset search technique that
finds the single best feature that when combined with the current subset minimizes
our SSE objective function. SFS starts with the empty set and sequentially adds
one feature at a time. This comparison tests PFS as a search technique. In addition,
we compare with Mao’s least squares estimate (LSE) method [51], which is an im-
proved fast version of Krzanowski’s Procrustes approach [50]. We applied the LSE
based forward selection method, and call it LSE-fw. Furthermore, we also compare
to principal feature analysis (PFA) [52], which performs Kmeans clustering on the
73
loadings of the first m PCs. In our experiments, we use the number of clusters equal
to q (the number of features to keep), m is set to correspond to retaining 90% of
the variance captured by the first q PCs. This typically results in m = q minus one
to five. We also apply Kmeans with 10 random starts and pick the best solution in
terms of minimum SSE as indicated in their work.
For all methods, we calculate the sum-squared-error defined in Equation 4.1.
Essentially, we measure how much the selected features span the original feature
space.
4.4.3 Results and Discussion
Fig.4.5 shows the SSE, as defined in Eqn.4.1, for all datasets via these eight meth-
ods. We also show the retained variance for the HRCT data as shown in the right
sub-figure in the first line of Fig. 4.5. The retained variance is a measure of how
much the selected feature span the original feature space. The smaller the SSE
the better, and the larger the retained variance the better. It is the total variance
minus the error. Thus, the more the retained variance, the less SSE. Because they
present redundant information, we only present the SSE plots for the other data
sets. The retained variance would simply be the reverse of these plots. We observe
that the SSE between all features and the features selected by our PFS is con-
sistently smaller than simple threshold, Jolliffe i, Jolliffe ni, SPCA, PFA for any
number of retained features and for all datasets.
PFS vs. simple threshold, Jolliffe i and Jolliffe ni Simple thresholding is an
individual search method that considers features one-at-a-time. PFS takes feature
interaction into account through PCA transformation and thus resulted to better
SSE. For simple threshold and Jolliffe i, they do not take the correlation between
features into account. Without the redundancy removal scheme, they have higher
SSE compared to our PFS. Jolliffe ni implicitly takes correlation into account in-
74
directly through the orthogonality constraint among the PCs. PFS addresses redun-
dancy directly by orthogonalizing the data with respect to the selected features, and
by doing so led to lower SSE compared to Jolliffe ni.
PFS vs. SPCA SPCA has SSE values between that of Jolliffe i and Jolliffe ni.
As mentioned in the previous section, SPCA does not exactly perform a global fea-
ture selection. It cannot explicitly tell which features are important overall because
it has a different objective. SPCA tries to get sparse loadings to interpret the dom-
inant PCs. However, our objective is to find which features capture the variance of
the data most and at the same time are as uncorrelated as possible. This leads PFS
to a smaller SSE than SPCA.
PFS vs. PFA PFA, using the clustering approach, also tries to select uncorrelated
features as our method. However, by keeping only the first m PCs which is slightly
smaller than q, as indicated in PFA, to perform Kmeans clustering on their loadings
is not accurate. We can see that when q is relatively small, the first several features
selected by PFA leads to high SSE. In addition, the use of Kmeans clustering adds
more uncertainty to the feature partitioning because of the unstable initialization.
PFS vs. LSE-fw and SFS Note that PFS optimizes for SSE through PCA. We
compare the performance of PFS against SFS (sequential forward search) that di-
rectly optimizes for SSE by adding the best one feature to add to the current
selected feature set each time. Similarly, LSE-fw performs a sequential forward
search and adds the best feature at every step to minimize the least squares error
between the selected features and a procrustes transformation of the reduced PCA
space. Fig. 4.5 shows that our method, in solid black line, achieves the same small
SSE as SFS and LSE-fw. However, LSE-fw and SFS are very slow in picking the
desired number of features. In general, they are 102 to 103 slower than the other
75
Table 4.3: Computational Complexity AnalysisMETHODS SIMPLETHRESH JOLLIFFE I JOLLIFFE NI SPCACOMPLEXITY qNd max(N, qK)d2 max(N, d)d2 max(N, d)d2
METHODS PFA LSE-FW SFS PFSCOMPLEXITY max(Nd, Kmq)d q2Ndp max(N, q)qd3 max(N, qK)d2
methods on the data we used in our experiments. Especially for gene data with
a dimensionality of more than 12, 000, SFS ran out of memory immediately and
LSE-fw ran for a week and ended up out of memory. All our experiments are run
on a Pentium 4 computer with 512 megabytes of RAM. These two methods are not
practical when dealing with very large data sets. A detailed complexity analysis of
all methods are discussed in the next section.
In summary, PFS performs as good as the slower SFS and LSE-fw feature
selection methods in terms of the SSE reconstruction error and retained variance
and consistently better than the other transformation-based methods (Jolliffe i, Jol-
liffe ni, PFA, and SPCA) and simple thresholding.
4.4.4 Time Complexity Analysis
In this section, we discuss the time complexity of each method for selecting q
features. For our PFS method, we first compute the sample covariance matrix,
O(Nd2). Then, in step 1, we find the first eigenvector which takes O(Kd2) by the
power method [92], where K is the number of iterations to converge. K is usu-
ally small compared to N . In the second step, we perform feature search, and it
takes O(Nd+d). For step 3, performing orthogonal projection is equal to updating
the covariance matrix, "X(t+1) = "X(t) ! (X(t)fi(t))(fi
(t)T X(t)T )/.fi(t).2
, because
"X(t+1) = X(t+1)X(t+1)T = X(t)P (t)P (t)T X(t+1)T = X(t)(I!fi(t)fi
(t)T /.fi(t).2
)X(t)T =
"X(t) ! (X(t)fi(t))(fi
(t)T X(t)T )/.fi(t).2
. This procedure takes time O(d2 + Nd).
We repeat steps 1 to 3 until q features are selected, the total time needed is O(Nd2+
76
qKd2) = O(max(N, qK)d2).
For comparison, we also show the time complexity of the other methods.
For the Jolliffe i method, the time cost is similar to ours PFS. Get the covariance
matrix first, and then find the first eigenvector, if fi is selected, just remove the
ith column and row in the covariance matrix and repeat the process. The time
complexity is O(Nd2 + q(Kd2)) = O(max(N, qK)d2). For Jolliffe ni method,
only one SVD is performed on the covariance matrix, thus, the time cost is O(Nd2+
d3) = O(max(N, d)d2). For the SPCA method, when d > N , the biggest time cost
is to run SPCA and get the first q PCs, which takes time O(Ndlg(d)+p2d). p is the
sparse basis found such that p/d / 0, then the time cost is o(d3). If N > d, SPCA
takes nd2. Therefore, it is similar to Jolliffe ni, which is O(max{N, d}d2). The
major time cost for PFA is the clustering besides computing the original covariance
matrix. Thus, the time cost is O(Nd2 + Kmqd), here m is the number of PCs
retained and K is the number of iterations needed for convergence of the Kmeans
clustering. Mao’s LSE-fw method requires estimating the Least Squares Estimate
(LSE) model in each iteration and takes O(q2Ndp) time, p is the number of PCs
that the user wants to retain. Basically, keeping more PCs (p close to d) will result
in a more accurate LSE model and smaller SSE. However, as a tradeoff, the time
cost will be high. For the SFS method, it has the time cost of O(max(N, q)qd3),
which is the slowest. Thus, PFS performing as well as SFS and LSE-fw in terms
of SSE is good; moreover, PFS is much faster than SFS and LSE-fw as shown
in Table 4.3, which provides a summary of the time complexity analysis for each
method.
In Table 4.4, we provide the actual time in seconds the different methods
need to select q features which can cover 90% of the total variance of the original
data. Note that in these experiments we did not optimize for calculating the first
eigenvector which can speed-up PFS and Jolliffe i. The results show that Jolliffe i
and our PFS have similar speeds. Jolliffe ni and SPCA are similar and faster. PFA
77
Table 4.4: Computational Time in Seconds
TIMEHRCT CHART FACE TEXT GENE
(q = 126) (q = 42) (q = 140) (q = 315) (q = 312)SIMPLETHRESH 0.05 0.02 0.3 0.2 2.8JOLLIFFE I 25.07 1.14 75.6 182.8 1231.05JOLLIFFE NI 4.12 0.24 10.48 37.05 284.98SPCA 5.09 0.2 22.61 42.11 330.78PFA 5.4 0.58 24.95 44.72 807.18LSE-FW 409.69 3.53 3.33E+04 2.52E+04 N/ASFS 980.13 9.13 1.23E+05 8.13E+05 N/APFS 26.4 1.18 82.7 251.4 1612.05
is in between. Mao and SFS are very slow.
Even though SPCA and PFA are faster than PFS, we need to re-run the
whole experiment when different number of features are needed. However, our
PFS keeps the optimal feature set for any number of features < q. We just need
to run the program once to have all the optimal features for q = 1 to q, since PFS
is a sequential method. This means that given 312 features selected by our PFS,
if we only need 200 features, we can just pick the first 200 features in our current
feature set. Conversely, if I start with 200 features and I need 312, I can start with
the residue space after 200 features and continue the process to find 312 features.
However, for SPCA and PFA, they need to re-run the whole experiment.
In our experiment with gene data, we select 312 features out of 12, 558,
which covers 90% of the total variance and is around 2.5% of the original features.
To obtain a relatively accurate LSE model estimation, we need at least p = q. Thus,
the time needed for LSE-fw will be at least in 10E6 scale with the smallest p, which
is 1, 000 times of our PFS with 10E3. For the SFS, the difference is qd = 10E6
times more than our PFS method. Our experiments showed that SFS and LSE-fw
run out memory and cannot obtain the corresponding results for the gene data.
78
4.5 Extension to Linear Discriminant Analysis (LDA)
PCA reduces the dimensionality in an unsupervised fashion. When class labels are
available, one would wish to reduce the dimensionality such that the reduced space
finds the transformation that separates the classes well. In this section, we discuss
the possible extension of our PFS method to the LDA case.
A popular supervised dimensionality reduction method is linear discrimi-
nant analysis (LDA) [2]. LDA computes the optimal transformation, which min-
imizes the within-class scatter and maximizes the between-class scatter simulta-
neously. The goal in discriminant feature selection is to select the subset of fea-
tures that maximizes J = trace(S#1m Sb) subject to the constraint that the features
are as “uncorrelated” as possible. The relationship between PCA and LDA has
been studied in [2, 97, 98]. Thus, we can express the LDA problem as a data
representation problem similar to PCA where the data to be compressed is repre-
sented by their means. We generate the matrix M, which consists of the means of
each class, M = [M1, · · · ,ML]. To solve the LDA optimization problem based
on trace(S#1m Sb) criterion, we form a matrix D & Rd!L, D = [D1D2 · · ·DL],
Di ='
niS#1/2m Mi, L is the number of labeled classes. Then the solution of LDA
is just to perform PCA on the D matrix, looking at each Di as a data point. Now, we
have transformed the LDA problem to the PCA problem of normalized class-mean
matrix D. We can now apply a technique similar to that in Section 4.2, except that
D is not in the original feature space. Note that D is simply a rotated, normalized
version of M by the transformation matrix, S#1/2m and M is in the original feature
space. Since M is in the original feature space, we need to select the features from
M that correlates best to the first principal component of D. Our new objective
function for LDA is now changed to:
SSED = ||[f1D, · · · , fq
D]! [f1D, · · · , fq
D]||2 (4.3)
79
fiD
is the projection of fiD onto the subspace spanned by those selected fi
Ms, or
span{f1M(1)
, f2M(2)
, · · · , fqM(q)}.
SSED measures how much the selected features fiM can span the space of span{D}.
The more the selected features fiM span the space span{D}, the less the SSE.
Based on the above discussion, the same technique can be applied to the
span{D} to get the non-redundant feature set which separates the classes most.
However, instead of keeping the largest normalized loading of the first PC of the
residue space, we keep the feature in M (in the original feature space) that correlates
best to the first PC in the residue space of D.
This section provides an interesting example on how to extend the orthogo-
nal feature search via a transformation method to LDA. However, a detailed analysis
of orthogonal feature selection via LDA as a supervised feature selection approach
is outside the scope of this work and would be a topic for future research.
4.6 Conclusion for Principal Feature Selection
We have developed an orthogonal feature selection algorithm: principal feature
selection (PFS). This algorithm selects features based on the results from transfor-
mation approaches, where transformation serves as a search technique to find the
direction that optimizes our objective function. At the same time, we incorporate
orthogonalization to remove redundancy.
The resulting feature selection algorithm, PFS, is as simple to implement as
PCA, obtain principal features sequentially (analogous to the PCs) and their corre-
sponding non-redundant variance contribution (analogous to the eigenvalues) with
respect to the previously selected features as a by-product of the approach. With
these similar and important properties, hopefully PFS can become widely applied
80
as its transformation-based counterpart. Experiments show that PFS was consis-
tently closer to the optimum SSE compared to the loading-based approaches that
does not take orthogonality into account. The experiments also show that PFS pro-
vides a good compromise as a search technique between sequential forward search
and individual search with respect to speed and SSE, with speeds closer to that of
the faster individual search, and SSE values almost the same as that of sequential
forward search and consistently the best among the other methods.
There exist several criteria for evaluating features in the feature selection
literature. Here, we explored criteria optimized by the popular transformation
method, PCA. We hope that this work inspires future research in taking advan-
tage of continuous space transformation to improve finding the optimal solution of
the combinatorial feature selection problem.
81
20 40 60 80 100 120 1400
50
100
150
Sum−squared−error for HRCT data
Number of features
Sum−s
quar
ed−e
rror
v: SimpleThresh^: PFA−Cohano: Sparse−PCA*: Jolliffe−i.: Jolliffe−ni+: SFSx: LSE−fwPFS
20 40 60 80 100 120 140
20
40
60
80
100
120
140
Retained variance for HRCT data
Number of features
Sum−s
quar
ed−e
rror
v: SimpleThresh^: PFA−Cohano: Sparse−PCA*: Jolliffe−i.: Jolliffe−ni+: SFSx: LSE−fwPFS
10 20 30 40 50 600
5
10
15
20
25
30
35
40
45Sum−squared−error for CHART data
Number of features
Sum−s
quar
ed−e
rror
v: SimpleThresh^: PFA−Cohano: Sparse−PCA*: Jolliffe−i.: Jolliffe−ni+: SFSx: LSE−fwPFS
50 100 150 200 250 300
100
200
300
400
500
600
700
800
900
Sum−squared−error for FACE data
Number of features
Sum−s
quar
ed−e
rror
v: SimpleThresh^: PFA−Cohano: Sparse−PCA*: Jolliffe−i.: Jolliffe−ni+: SFSx: LSE−fwPFS
50 100 150 200 250 300
100
150
200
250
300
350
400
450
500
Sum−squared−error for TEXT data
Number of features
Sum−s
quar
ed−e
rror
v: SimpleThresh^: PFA−Cohano: Sparse−PCA*: Jolliffe−i.: Jolliffe−ni+: SFSx: LSE−fwPFS
50 100 150 200 250 300
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Sum−squared−error for GENE data
Number of features
Sum−s
quar
ed−e
rror
v: SimpleThresh^: PFA−Cohano: Sparse−PCA*: Jolliffe−i.: Jolliffe−ni+: SFS(N/A)x: LSE−fw(N/A)PFS
Figure 4.5: SSE and retained variance for HRCT data on top. SSE for Chart, face,20 mini-newsgroup and gene data respectively. Each figure plots the eight SSEcurves for the eight methods: blue line with ’"’ by simple threshold, green linewith ’#’ by PFA, red line with ’o’ by SPCA, light blue line with ’$’ by Jolliffe i,purple line with ’·’ by Jolliffe ni, yellow line with ’+’ by SFS, grey line with ’x’ byLSE-fw and black line by our PFS.
82
Chapter 5
Robust Fluoroscopic RespiratoryGating for Lung Cancer
Radiotherapy without ImplantedFiducial Markers
In this chapter, we investigate machine learning algorithms for markerless gated
radiotherapy with fluoroscopic images. The framework of the proposed clinical
treatment procedure is shown in Figure 5.1. We start by describing the gating prob-
lem, how data is acquired and pre-processed in Section 5.1. Section 5.2 presents a
detailed description of our clustering ensemble template matching method. Section
5.3 re-frames the gating problem into a classification problem and provides a so-
lution through support vector machine (SVM). To test our algorithms, Section 5.4
shows the evaluation metrics and validation results on five patient datasets. Finally,
in Section 5.5, we conclude and outline future directions.
83
Figure 5.1: Block diagram for showing the process of the proposed clinical proce-dure for generating the gating signal.
5.1 Data Acquisition and Pre-Processing
In this section, we describe how data is acquired and pre-processed, the first two
components during patient set-up in Figure 5.1.
5.1.1 Image Acquisition
In this study, the raw fluoroscopic image data we used comes from the system called
the integrated radiotherapy imaging system (IRIS) [99], which consists of two pairs
of gantry-mounted diagnostic x-ray tubes and flat panel imagers, shown in Figure
5.2. The system can be used to acquire a pair of real-time orthogonal fluoroscopic
images for lung tumor tracking.
5.1.2 Building Training Data
Before treatment, a sequence of orthogonal fluoroscopic images (approximately ten
seconds in our experiments) are taken and used for patient set-up as training im-
ages. The tumor position in the gating window, where the treatment beam should
be turned on, is identified either manually by a clinician or automatically by match-
ing digitally reconstructed radiographs (DRRs) from the simulation 4D CT scan
84
Figure 5.2: The Integrated Radiotherapy Imaging System (IRIS), used as the hard-ware platform for the proposed gating technology in this chapter.
[100]. In this investigation, the images were manually contoured. A rectangular
region of interest (ROI) is then created in those images (see Figure 5.3). This ROI
is set to be large enough to contain tumor motion in the training period.
Lung tumors move primarily due to the patient’s breathing. Typically, the
gating window is set at the end-of-exhale (EOE) phase of the breathing cycle due
to its longer duration and better stability. Figure 5.4 top left illustrates the tumor
motion in the up-and-down direction as a function of time. Motion in the left-and-
right direction is relatively very small(1 ! 2mm) compared to the motion in the
up-and-down direction (13 ! 18mm). The tumor position is based on the centroid
of the manually contoured image. Tumor in the lower positions correspond to the
exhale phase, and those in the higher positions correspond to the inhale phase. A
measure of radiation treatment efficiency is the gating duty cycle, the total time the
beam is turned ON divided by the total time (beam is ON and OFF). Assuming
that the desired gating duty cycle is given (in our experiments we set it at 35% and
85
Figure 5.3: Tumor contour and the region of interest (ROI). Left: original fluoro-scopic image. Right: motion-enhanced image.
50% which are typically used in radiotherapy), a corresponding threshold can be
determined to define the gating window as shown by the horizontal line in Figure
5.4 top left. All the images in the gating window, i.e., with locations below this
threshold, as shown in Figure 5.4, are labeled as EOE images.
5.1.3 Pre-Processing
We pre-process our images by first applying motion enhancement and then reduce
the dimensionality by principal component analysis (PCA).
Motion Enhancement We apply a simple pre-processing technique called mo-
tion enhancement [101] on our training images. Given a sequence of images I[t],
where t = 1, · · · , N is the sequence number. We compute the average image,"N
t=1 I[t]/N , and the motion-enhanced image (MEI) is the difference between the
original image and the average image, I[t] ! "Nt=1 I[t]/N . The intuition behind
MEI is that the average captures the static structures and smears the moving struc-
tures, thus the difference will amplify the moving structures. Figure 5.3 shows the
original fluoroscopic image of a tumor in an ROI together with a motion-enhanced
view of it. We see that the tumor is clearer in the MEI.
86
Figure 5.4: The top left figure is the breathing waveform represented by the tumorlocation. To the left of the vertical dotted line is the training period. To the right ofthe vertical line is the treatment or testing period. Under the horizontal dotted line(threshold corresponding to a given duty cycle) is the end of exhale phase. Bottomfigures showed different end of exhale images during the training session, whichare averaged to generate a single template.
Principal Components Analysis A typical EOE image has a size of 100 % 100
pixels. This leads to a dimensionality of size 10, 000. To reduce the dimensionality,
we apply principal component analysis (PCA) [102]. PCA finds a linear trans-
formation, Y = AT X , that projects the original high-dimensional data X with d
dimensions to a lower dimensional data Y with q dimensions where q < d, such that
the mean squared error between X and Y is as small as possible. X here is d%N ,
where N is the number of data points, and A is a d % q matrix. The solution is the
transformation matrix A whose columns correspond to the q eigenvectors with the
q largest eigenvalues of the data covariance. It also projects the high-dimensional
dataset to the lower dimensional subspace in which the original dataset has the
largest variance (i.e., restricts attention to those directions along which the scatter
of the data points is greatest). PCA is applied only as a pre-processing step to clus-
87
tering and to the support vector machine. It is not applied to template matching in
this research because during treatment, projecting each image to span{A} is quite
time consuming.
5.2 Clustering Ensemble Template Matching and Gaus-
sian Mixture Clustering
In this section, we describe our clustering ensemble template matching method for
gated treatment of lung cancer using fluoroscopy images. Before we proceed, we
define our notations for clarity: R is the ROI in an incoming fluoroscopic image,
Ti is the ith reference template (such as one of EOE templates in the EOE gating
window), the sign(
represents correlation, s is the score used for generating the
gating signal, i.e., g = H(s ! s0), where s0 is the threshold score, H(x) is the
Heaviside step function. H(x) = 0 for x < 0 and H(x) = 1 for x ) 0. The gating
signal g = 1 means beam ON while g = 0 means beam OFF.
In our previous work, we built a single EOE template by simply averaging
all the motion-enhanced EOE training images, as shown in Figure 5.4. During
treatment, we compute the correlation score between the reference template and
each incoming MEI. Assuming that the image R and template T are of the same
size m%n, the normalized correlation coefficient (correlation score s) is defined as
s =
"m
"n(Rmn ! R)(Tmn ! T )
#("
m"
n(Rmn ! R)2)("
m"
n(Tmn ! T )2)(5.1)
Here R and T are the average intensity values of the image R and template T ,
respectively. A high correlation score indicates that the incoming image is similar
to the reference and gating is enabled.
In essence, the template building procedure is to generate the representative
88
patterns defined in the EOE window. Then during matching, we use these represen-
tatives to recognize those images who have the same pattern and should be treated.
Therefore, accurate representative templates are key points leading to accurate gat-
ing signals. We have noted that, only using the mean of EOE training images as
the template, may not be enough. Our experiments showed that sometimes it leads
to erratic gating signals. That is because the correlation score curve may not be
smooth, causing the gating signal to be noisy at the transition regions.
Figure 5.5: Ensemble/multiple template method. Here, each image is an end-of-exhale template. We match the incoming image with each template and get a set ofcorrelation scores s1, s2, · · · , sK . Then we apply a weighted average of these scoresto generate the final correlation score s for gating.
5.2.1 Ensemble/Multiple Template Method
Inspired by the success of ensemble methods in improving the classification accu-
racies of weak base classifiers [103], we explore applying an ensemble or multiple
template matching for gating. Hopefully, an ensemble of templates can smooth out
the resulting gating signals. One way of generating an ensemble of templates is to
set each EOE image as a template, and a correlation score is computed for each
89
template. After a set of correlation scores are computed, it is necessary to choose
an intelligent way to combine them to get a robust gating signal. There are several
ways to combine the correlation scores (such as, taking the maximum score, taking
the weighted average score). We found that applying a weighted average gave us
the best results. We define these weights in Subsection 5.2.3. This procedure is
explained in Figure 5.5. Each image in the figure is an end-of-exhale template. We
match the incoming image with each template and get a set of correlation scores
s1, s2, · · · , sK . Then we apply a weighted average of these scores to generate the
final correlation score s for gating.
Although we can use the entire set of motion-enhanced images in the EOE
gating window as reference templates, this approach is computationally expensive
during treatment due to the need for computing several correlation scores. Many of
the reference templates are very similar. Therefore, we want to find a way some-
where in between a single template method and using all the templates, which hope-
fully has the merits of both methods, i.e., computational efficiency and robustness.
Here, instead of using all the templates, we would like to find a set of representative
EOE templates. Ideally, these templates should carry all of the useful information
of the original frames and discard the noise. Clustering methods are ideally suited
to this task. Clustering algorithms group similar objects together and summarize
each group with a representative template. We will use clustering to find a small
set of templates and apply weighted averaging to combine the scores. We need to
determine the number of templates and a method to find the clusters.
5.2.2 Finding Representative Templates by Clustering
To cluster the EOE templates, we apply Gaussian mixture clustering [104] to the
PCA dimensionality reduced images. We denote the parameters of this model by
#. In this model, we assume that each cluster comes from a multivariate Gaus-
90
sian distribution, and our data (the image templates) come from a finite mixture of
Gaussians. A Gaussian model assumes that there is a cluster template and the mem-
bers of that cluster are variations of a template. Let X denote our data set which
has d dimensions, and n denote the number of data points in X , we are trying to
group the n data points into K clusters. Each of the K cluster is represented by its
model parameters, 'j = ((j, µj, "j), where (j is the prior probability (or mixture
proportion) of cluster j, µj is the mean of cluster j and "j is the covariance matrix
of cluster j. More formally, we say that the probability of the data given the model
is P (X|#) ="K
j=1 (jP (X|'j)
P (X|'j) =1
#(2()D|"j|
EXP{!1
2(X ! µj)
T "#1(X ! µj)}
. To estimate the parameters, (j , µj and "j for each cluster, we apply the expectation-
maximization algorithm (EM) [105]. The EM algorithm alternates through an
expectation-step and a maximization-step until convergence. In the expectation-
step, we estimate the cluster to which each image template belongs to given that
the parameters ((j , µj and "j) are fixed. In the maximization-step, we estimate
the parameters by maximizing the complete log-likelihood (the log-likelihood as-
suming we know the cluster memberships). The cluster means µj then become our
representative templates.
To automatically determine the number of clusters, we apply the Bayesian
information criterion (BIC) [86] score to penalize the log-likelihood function. We
now maximize
logP (X|#ML)! (f/2)log(n), (5.2)
where f is the number of free parameters in the model. In our problem, we have
f = dK + (d(d + 1)/2)K + (K ! 1) (5.3)
91
Figure 5.6: Scatter plot of our image data for patient 4 and 35% duty cycle in 2Dwith the clustering result. The ”o” and ”x” each represent different clusters, withthe means represented by the ”$” symbol in bold and the covariances in ellipses.
to be estimated. We run the Gaussian mixture clustering from K = 1 to Kmax
(Kmax = 4 in our experiments). Then pick the K with the largest BIC score. Note
that if we do not add a penalty term, the log-likelihood increases as K increases. It
can lead to a trivial result of picking K equal to the number of data samples (i.e.,
each data point will be considered as a cluster). A scatter plot of the clusters with
their means and covariances is shown in Figure 5.6.
5.2.3 Generating the Gating Signal
By the template clustering method, we can build a set of accurate representative
templates. Accordingly, we will have a set of correlation scores for each incoming
new image in the template matching step. Therefore, we need a way to combine the
scores. As mentioned earlier, we are using a weighted average. Now the weights are
just the prior probability (j of each Gaussian mixture. We are performing a voting
procedure where mixtures with more members have higher weights to vote for the
final correlation score compared to mixtures with feature members. We generate the
gating signal based on the final correlation score. From the final correlation score of
92
the training images, we determine a threshold that corresponds to the pre-set duty
cycle. This threshold is then applied to the correlation scores calculated in real-
time during treatment to generate the gating signal. When the score is above this
threshold value, it indicates that the therapy beam should be enabled. Otherwise,
the therapy beam should be turned off.
Figure 5.7: Results from different methods for an example patient. (a) single tem-plate method; (b) ensemble/multiple templates method with Gaussian mixture clus-tering. For each figure, the top curve is the correlation score and the bottom plot isthe gating signal generated by the correlation score. Here we use 35% duty cycle.
93
Figure 5.7 shows an example of gating signals generated by this cluster-
ing ensemble template method and the previous single template method. We can
see that the ensemble approach can achieve a reasonable gating signal and smooth
correlation score, which coincides well with the smooth tumor motion. As demon-
strated in this figure, through the voting process, the effect of errors caused by one
template is compensated by the other templates. Thus the clustering ensemble tem-
plate method is less sensitive to noise compared to a single template.
5.3 Support Vector Machines
Template matching as described in the previous section only look at images inside
the gating window. However, instead of only using the images inside the gating
window, can we also take advantage of images outside the gating window? Will
this additional information be helpful to make decisions? This strategy is espe-
cially helpful for distinguishing images in the transition of the gating signal. We
can measure how similar they are with the EOE images as well as how dissimilar
they are from the non-EOE frames. Indeed, this can be viewed as a two-class clas-
sification problem. In this section, we re-cast the gating problem as a classification
problem and present the support vector machine classifier as a solution.
Gating as a Classification Problem The goal for an automated method for gated
radiotherapy is to decide when to turn the beam ON or OFF. This exclusivity con-
dition provides us with a clue that the gating problem can actually be re-cast as a
classification problem. We treat the image frames that correspond to a beam-on
signal as one class and those that correspond to a beam-off signal as another class.
For simplicity, we will call them beam-on and beam-off images. By this way, we
reformulate the original gating problem into a two-class classification problem, as
shown in Figure 5.8. In Figure 5.8a, ten time points on the gating signal are shown
94
Figure 5.8: Re-cast the gating problem to a classification problem (a) and (b). (c)presents the decision boundary created by a single template matching and (d) dis-plays decision boundary by an SVM classifier.
from time t1 to t10. And each time point corresponds to a image frame, x1 to x10
in Figure 5.8b. We represent each image frame as a vector, xt. For this illustrative
example, we project each x1 to x10 onto two-dimensions as shown in Figure 1b.
95
The set of images, {x1, x2, x3, x8, x9, x10} are examples of beam-off images (class
OFF) and {x4, x5, x6, x7} are examples of beam-on images (class ON). The goal
of classification is to build a classifier that outputs ON or OFF given a new input
image xt by learning parameters of this classifier from training examples such as x1
to x10.
The template matching method can be considered as a classifier that only
takes advantage of the positive (ON) class. Correlation provides a measure of how
close or similar each new image is from our template representing the ON class.
The threshold is our decision boundary and turns the correlation score into a deci-
sion: scores higher than a threshold classify as ON and OFF otherwise. We select
this threshold based on the desired duty cycle. In Figure 5.8c, a template would
be the average of the ON examples shown in red dot in the center of the ellipse.
And the threshold will be a fixed distance from this template. Note that a better
way of automatically determining the decision boundary (threshold) is to consider
both positive (ON) and negative (OFF) examples. In this research work, we build
our classifier by looking at both ON and OFF training examples. In particular, we
design a support vector machine (SVM) classifier to solve this gating problem. Fig-
ure 5.8d displays the decision boundary created by a linear SVM on this simple
illustrative example data.
Support Vector Machine Classifier There are several possible classification al-
gorithms for this task. Among them, SVM is one of the most popular learning
methods for binary classification. SVM was originally designed by Vapnik [106].
It learns an optimal bound on the expected error, and finds an optimal solution as
opposed to many learning algorithms that provides local optima (such as, neural
networks [107]). The SVM objective is to find a boundary that maximizes the mar-
gin between two classes as well as separates them with the minimum empirical
classification error. An SVM first projects instances into high dimensional space
96
via kernels and then learns a linear separator that maximizes the margin between
the two classes.
The SVM problem can be formulated as follows: suppose we have the train-
ing data X = {x1, · · · , xt} and {y1, · · · , yt} be the class labels of X. Without loss
of generality, we assign class labels to take the value of either +1 or !1. We want
to have a large margin, and a small error penalty (slack variables )i) for misclassifi-
cations, as shown in Figure 5.8d.
Minimize ||%w||2 + C("
i )i)
Subject.to. yi(%xi · %w + b) ) +1! )i (5.4)
Here, C is a user-defined parameter, where larger values mean higher penalty to
errors. In addition, we apply the kernel trick to allow nonlinear decision boundaries.
For the gating problem, we apply the radial basis function (RBF) kernel:
K(x, x$) = exp(!*||x! x$||2), for(* > 0).
During patient treatment (refer to Figure 5.1), each incoming image is pre-
processed by projecting the pixels in the ROI to the reduced dimensional space by
PCA as explained in Section 5.1. The decision function will be of the following
form:
D(x) = sgn(n!
t=1
yt!t · K(x, xt) + b),
K(x, xt) = exp(!*||x! xt||2) (5.5)
Here yt are the class labels for image vector t, and !t and b are derived for a given
C by solving the optimization problem described in equation 5.4, using quadratic
programming [108]. The parameters * and C of this function are learned during
97
patient set-up or training. During treatment, these parameters are fixed. We create
the gating signal based on the output of the decision function in equation 5.5 above
with input yt (the reduced-dimensional representation of our incoming fluoroscopy
image at time t, i.e., the class label for image t).
5.4 Experiments and Results
We collected fluoroscopic image data from five patients to evaluate the performance
of the following two methods: (1) an ensemble template matching method where
the representative templates are selected by Gaussian mixture clustering, and (2) a
support vector machine (SVM) classifier with radial basis kernels. For each patient,
we get a sequence of image frames by sampling at ten frames per second. Typically,
each sequence contains 300 ! 400 frames, corresponding to 30 ! 40 seconds. A
region of interest of around 100 % 100 pixels in size was selected to include the
tumor positions at various breathing phases. The training period we used to build
our templates consisted of two cycles, which corresponds to about 60 ! 80 image
frames.
To validate the algorithms, we compared our estimated gating signal with the
reference gating signal. A radiation oncologist manually contoured the tumor in the
first frame of the images. Then in the following frames, the contour was dragged
to the correct places manually using a computer mouse by the radiation oncologist.
The tumor centroid position in each image frame was calculated and used to gener-
ate the gating signal based on various duty cycles. Here, in our experiment, we used
35% and 50% duty cycles. Assuming t0 is the total time (beam on and off), t1 is
the beam-on time (based on our estimation), t2 is the correctly predicted beam-on
time (true positive), we define the evaluation metrics: delivered target dose (TD)
TD = t2/t1 and real duty cycle (DC) of the treatment DC = t1/t0.
98
5.4.1 Experiments by Clustering Ensemble Template Method
For the ensemble/multiple template matching method, we find the EOE training
images in the first two to three breathing cycles (training period). We reduced the
dimensions of these EOE images to two to three dimensions and perform clustering.
We then computed the log-likelihood based on the Gaussian mixture model to get
our BIC scores and determine the number of clusters. We obtained two or three
(depends on which patient) clusters with this data. That means only two or three
templates is enough for this application, which made our method more efficient
compared to using all the EOE training images as templates. The final correlation
score for a new incoming images is the weighted average of the scores from the
reference templates with the prior probability of the mixture as weights.
5.4.2 Experiments by Support Vector Machine
Image frames acquired during the set-up session is pre-processed and used to train
the SVM classifier. We reduced the dimension from 100% 100 to 1% 50 for SVM.
We then labeled the training images with +1 for beam-on images and -1 for beam-
off images. We used LIBSVM [109] in our experiments. To train our SVM model,
we applied a coarse-to-fine grid-search to determine the parameters * and C for our
radial basis function SVM (RBF-SVM) model. We found that trying exponentially
growing sequences of * and C is a practical method to identify the parameters (* =
10#1010#8 · · · 103, C = 10#510#3 · · · 1010). Furthermore, to prevent overfitting in
tuning the parameters, a ten-fold cross-validation procedure is performed on the
training images to find a better model. Basically, the (*, C) pair which provided
the best ten-fold cross-validation accuracy on the training data is selected. We use
the rest of the image data (data during treatment) as our testing samples. We pre-
process the test data by PCA, and the predicted labels given by the SVM classifier
serve as our estimated gating signals.
99
5.4.3 Results and Discussion
The experimental results are shown in Figures 5.9-5.10. In each figure, the upper
figure shows the delivered target dose (TD) as a bar plot, with the SVM results in
blue bars and the template-matching method results in red bars. Delivered target
dose measures the true positive rate. The lower figure shows the real duty cycle
(DC) in the same format, this is a measure of efficiency. For the proposed duty
cycle of 35%, SVM achieves 95.8% average TD and 40.8% average DC, while the
clustering ensemble template matching method achieves 94.9% average TD and
34.7% average DC. When the proposed duty cycle equals 50%, SVM has an average
TD of 98.4% and average DC of 53.1%, while clustering ensemble template method
has an average TD of 97.6% and 49.5% DC.
Figure 5.9: Experiment results in TD and DC for 35% proposed duty cycle. Bluebars: metric by SVM method. Red bars: metric by clustering ensemble templatematching method.
Over all five patients we found that both methods are able to deliver most of
the target dose correctly, both have a high TD, while SVM is more efficient than
100
Figure 5.10: Experiment results in TD and DC for 50% proposed duty cycle. Bluebars: metric by SVM method. Red bars: metric by clustering ensemble templatematching method.
clustering ensemble template. In general, SVM has an average DC that is 4%-6%
longer. This is because SVM makes use of more information, both beam-on and
beam-off images, while clustering ensemble template matching only captures the
information from the beam-on images. However, template matching has an advan-
tage of detecting the aperiodic intra-fraction motion that occurs for some patients,
although not seen in the current datasets. It is because the beam is only turned on
when the correlation score is high, meaning the ROI of the incoming image is simi-
lar to the reference template and the tumor is at the right position. If the tumor drifts
away or if shifts/rotations/deformations of the anatomy between patient setup and
treatment happen, the correlation score should be always low and the beam will not
be turned on. We then need to re-position the patient.
The time complexity for both methods is almost the same. We recorded the
CPU time for executing both methods with a Pentium 4 Windows machine with 1G
of RAM. The average CPU time for predicting the gating signal during testing is
101
Figure 5.11: Example of estimated gating signals on patient 4 for proposed 35%duty cycle. Top: the predicted gating signal by SVM classifier. Bottom: the gatingsignal generated by clustering ensemble template matching method.
0.06sec/frame by both methods. To obtain the gating signal for each time point,
SVM needs to perform PCA and apply the reduced image vector as an input to the
decision function, equation 7. For clustering ensemble template matching method,
it needs to calculate the correlation between the original high dimensional motion-
enhanced image with two or three templates (cluster means).
Figure 5.11 shows an example of the estimated gating signals. It plots the
gating results by both methods. The top figure displays the predicted gating signal
in red and reference signal in black for SVM. And the bottom figure shows the
gating and reference signals for clustering ensemble template matching. We can
see that both gating signals coincide very well with the reference, and all errors
occur at the edges.
102
5.5 Conclusion for Robust Markerless Gated Radio-
therapy
This research work provides a case study where machine learning techniques have
been successfully applied to an important real-world problem: gated radiotherapy.
Working closely with the domain expert, we carefully selected the appropriate ma-
chine learning and data mining tools in developing the ensemble/multiple template
matching method. Through our collaboration, we also provided our domain expert
with a different view of the gating problem and re-cast it as a classification prob-
lem. Our study showed the feasibility of solving the gating problem by classifica-
tion techniques. This provides us with wider recourses for gated radiotherapy. We
can try other classification techniques, such as Bayesian classifier, neural networks,
hidden Markov model, in our future work.
For our next step, we will (1) test the algorithms using more and longer
patient data, (2) find a better way to get reference gating signal for validation, and
(3) evaluate the dosimetric consequence of the current error level to see if there is a
need to further lower the error rates. Then, we will consider clinical implementation
of our methods.
103
Chapter 6
Multiple Template-basedFluoroscopic Tracking of LungTumor Mass without Implanted
Fiducial Markers
In the previous chapter, we intensively studied the template matching method to
generate robust and accurate gating signals for lung radiotherapy without implanted
markers. Here, in this chapter, we extend this idea to directly track the tumor loca-
tion throughout the whole breathing cycle. In Section 6.1, we describe two template
matching methods to track tumors using the fluoroscopic images without markers:
the motion-enhanced method and eigenspace tracking. Section 6.2 describes the
experimental setup and evaluation metrics used to test the proposed methods. Sec-
tion 6.3 reports the experimental results and provides a discussion of those results.
Finally, Section 6.4 presents our conclusion.
104
6.1 Basic Ideas of Multiple Template Tracking
A good tracking algorithm should take into account the tumor motion characteris-
tics. Relevant lung tumor motion characteristics can be summarized as follows.
1. Lung tumor motion is mainly caused by patient respiration and is thus pe-
riodic. Accordingly, tumor shape and appearance projected in the images
should vary more or less as a function of the breathing phase.
2. Although breath coaching can improve the regularity and reproducibility of
patient breathing [110], the projected tumor images at the same breathing
phase in different breathing cycles can still vary.
3. The fluoroscopic image intensities change with the chest expansion and con-
traction. The images are brighter during the inhale phase, while darker dur-
ing the exhale phase. This intensity change should also be considered in the
tracking algorithm.
In theory, tumor positions in fluoroscopic images can be detected using a
single-template matching method. A basic single-template tracking approach is to
simply perform an exhaustive search and to find the highest correlation between a
tumor template and an image region. Due to the above-mentioned tumor motion
characteristics, i.e., tumor appearance in, and the intensity of, the projected images
can vary from frame to frame, this simple approach does not work well for lung
tumor mass tracking in fluoroscopy video. It leads to erratic locations and is quite
time consuming. Using an adaptive template method, i.e., updating the reference
template using the tumor image in the previous image frame, the tracking results
can be improved. However, such a method is not robust to errors made in previous
frames and the tracking may drift. In one word, a single-template approach for lung
tumor tracking is not sufficient.
105
Figure 6.1: Outline of the proposed multiple template tracking procedure.
If we use multiple templates, instead of one, for lung tumor tracking, the
first and third tumor motion characteristics can be naturally handled. If we further
allow some fine-tuning for the tumor position in each template, the second motion
characteristic may also be taken into account. The multiple-template method has
been developed in object recognition to detect the object at various poses (due to
changes in rotation, scale, illumination and other factors) [111]. Here, for our prob-
lem, we apply multiple templates to represent different poses (position and shape)
of the tumor in the projection images at different breathing phases.
The multiple-template tracking has four components: template generation,
search mechanism, scoring function and a voting mechanism. We first need to build
the set of templates to represent the tumors various poses using images acquired
during patient setup. Then, during patient treatment, we find the best match between
the incoming image and each reference template. The quality of a match is based on
the scoring function utilized. Finally, we combine the results from all the templates
to determine the tumor location using a voting scheme. Figure 6.1 outlines the
tracking procedure using multiple templates.
106
Figure 6.2: A fluoroscopic image with an region of interest (ROI) (blue rectangle)and a tumor contour (red curve).
6.1.1 Building Multiple Templates
For this study, we use about 6-7s, which normally contains two breathing cycles of
fluoroscopy video as our setup or training images. The tumor contour in each of the
training images is either manually marked by clinicians or automatically transferred
from digitally reconstructed fluoroscopy (DRF) images by image registration [100].
Then a rectangular region of interest (ROI), as shown in Figure 6.2, is automatically
created in the image that contains the tumor throughout the whole breathing cycle.
We sample N templates, for example N = 12 in our experiments for a certain
patient, T1, T2, · · · , T12 at equal time intervals based on the breathing waveform
from the setup session, as shown in Figure 6.3. Berbeco et al have observed that
the average fluoroscopic image intensity changes with the breathing cycle (i.e., the
image is darker at the exhale phase and brighter at the inhale phase) [16]. We utilize
the intensity waveform to determine the breathing cycle. We divide the intensity
waveform into equal bins as shown in Figure 6.3. Setup image frames falling in
a specific bin are averaged, and an ROI that contains the tumor is selected to be
the reference template for that phase bin. Each reference template corresponds to
107
Figure 6.3: Twelve motion-enhanced tumor templates built by averaging the imagesin ROI (as shown in figure 2) falling in the same time bin. Intensity waveform isdivided into twelve equal time bins, corresponding to which twelve templates werebuilt.
a known tumor position (from the patient setup). This set of templates gives us a
better representation of the tumor movement because they cover most of the motion
information. To get a set of smoother templates, we may use several breathing
cycles as a training period by dividing each cycle into the same number of bins,
then averaging the images in the same bin at the same phase.
6.1.2 Search Mechanism
Ideally, even taking into account the second tumor motion characteristic, if the
breathing is reasonably stable, at a particular breathing phase the tumor position
108
should be close to that represented by the reference template at the same phase.
Therefore, instead of an exhaustive search in the whole image, which can take a
long time, we only search in a small neighboring region around the reference lo-
cation of the template. That is to say, we measure the similarity between each
template, Ti(i = 1, 2, · · · , 12), and each incoming fluoroscopic image by allowing
the template to have a small shift (+x, +y) along the x! and y! axis. There are two
reasons for allowing the shift: (i) templates built during the training period may not
provide a complete description of tumor motion during the whole treatment fraction
and (ii) the tumor movement trajectory may slightly vary from period to period.
6.1.3 Template Matching
The x! and y! coordinates of the tumor centroid (xi, yi), for a template Ti , are
calculated by averaging the coordinates of the pixels in the tumor contour. The tu-
mor location at a given time point during treatment delivery, estimated using this
template, is then given as (xt, yt) = (xi + ,xi, yi + ,yi), where (,xi, ,yi) is the
shift required for template Ti to produce the best match to the image. This is done
for all the templates. The tumor location at this time point is then determined by
combining all estimated positions through a voting procedure described later. Two
methods for template representation and similarity calculation have been developed.
One method uses motion-enhanced images (MEIs) and calculates Pearsons correla-
tion scores as the scoring function. The other is eigenspace tracking that represents
the images in a reduced dimension eigenspace and applies a mean-squared error as
the scoring function. In the following sections, we discuss these two methods in
detail.
109
Figure 6.4: Tumor contour and region of interest (ROI). Left: Original fluoroscopicimage. Right: Motion-enhanced image.
Motion Enhancement and Pearsons Correlation Score.
Motion enhancement [101] isolates the moving tumor from the rest of the static
anatomy. Given a sequence of images I[t], where t = 1, · · · , N is the sequence
number, we compute the average"N
t=1 I[t]/N , and the motion-enhanced image
(MEI) is then the difference between the original image and the average image,
I[t] ! "Nt=1 I[t]/N . In a clinical scenario, the average image is determined using
the training images and can be updated using the images acquired during treatment
delivery. The intuition behind MEI is that the average captures the static struc-
tures and blurs the moving structures, thus the difference will amplify the moving
structures. Figure 6.4 shows the original fluoroscopic image of a tumor in an ROI
together with its motion-enhanced view.
We apply Pearsons correlation score as the similarity measure between an
incoming image and a reference template. Assuming the image R and template T
are of the same size m% n, Pearsons correlation score s is defined as
s =
"m
"n(Rmn ! R)(Tmn ! T )
#("
m"
n(Rmn ! R)2)("
m"
n(Tmn ! T )2)(6.1)
110
It is assumed that if the score is high, the incoming image has a high similarity with
the reference template, and the tumor location is close to that in the template.
Eigenspace Representation.
Motivation and overview. Eigenspace representation, based on the principal com-
ponent analysis (PCA), can reduce redundant information, keep the most dis-
tinct features and thus reduce noise in the image sequence. Here we use this
method to improve the matching accuracy and robustness [1]. The idea for ap-
plying PCA to images or eigen images was introduced by Turk and Pentland
for face recognition and has shown to be successful in other object recogni-
tion tasks [3]. Motion enhancement represents the image by amplifying the
moving structures and keeps the images original dimensionality. PCA, on the
other hand, is a feature extraction method that projects the original image to
a lower dimensional space and represents images by capturing the features
with high variability.
Eigenspace tracking can be described in two steps. First, find the lower di-
mensional space (eigenspace) by PCA and project the current frame to that
space, i.e., find the eigenspace representation of the current frame. Second,
match the current frame to the templates by minimizing the mean-squared
error between them. We will describe this method in detail below.
Template building in eigenspace. Given a sequence of images, eigenspace tracking
constructs a small set of basis images that characterize the majority of the
variation in the image set. These basis images are used to approximate the
original images. Assume that the tumor template size is m%n (in our study, a
typical tumor template is around 100 pixels % 100 pixels in dimensionality).
Suppose we have N images in our training period. For each of the N images,
we construct a 1D image vector, e, by scanning the image in the standard
111
raster scan order. This vector has dimension d = m % n and becomes a
column in a d%N matrix X . Using PCA, we can find a linear transformation,
Y = AT X , that projects the original higher dimensional matrix X with d%N
dimensions to a lower dimensional matrix Y with q % N dimensions where
q < d, such that the mean-squared error between X and Y is as small as
possible. Here, A is a d% q transformation matrix.
To find the transformation matrix A, we use singular value decomposition
(SVD) [1] to decompose the matrix X as X = U"V T . U is an orthogonal
matrix of d % d dimensions and V is another orthogonal matrix of N % N
dimensions. " is a diagonal matrix of d%N dimensions with singular values
sorted in decreasing order. Then the transformation matrix A is the first q
columns in U , i.e., the first q eigenvectors with the largest singular values.
Now we can project the original d!dimensional image vector e into q dimen-
sions as e%. Here, e% is a q!dimensional vector (a column of matrix Y ) and
can be represented as e% ="q
i=1 ciAi , where ci are the new coordinates of
the original image vector e in this subspace and ci = AiT e. In this study, we
kept the least number of eigenvectors that keeps 95% of the original variance
(i.e., the sum of the eigenvalues kept divided by the sum of all the eigenvalues
should be at least 95%). For our data, this corresponds to q = 50. Thus, we
reduce the dimension of an image from about 10, 000 to 50.
Tracking in eigenspace. Let eN be an m % n ROI in a new incoming image
measured during treatment delivery, represented as a d % 1 vector, where
d = m % n. We wish to compute the similarity between this ROI and tumor
templates in eigenspace. First, we project eN to the reduced eigenspace. We
compute ci as ci = AiT eN to get e%N =
"qi=1 ciAi. Then the mean-squared er-
ror E(c) is calculated as E(c) = ||eN%!eT
%||2 between eN% and the projection
of each template in eigenspace, eT%. For each reference template, the ROI is
112
allowed to move in the incoming image around the template position within
a pre-set small range, which is ±10 pixels in the y!direction and ±5 pixels
in the x!direction. Now, the tumor position is determined for this specific
template by minimizing the mean-squared error. This is done for all the tem-
plates. Therefore, every template will have a corresponding tumor position
and a minimized mean-squared error value. The smaller the mean-squared
error, the higher the similarity between the image and the template.
6.1.4 Voting Mechanism
For each patient, we have multiple, say 12, reference templates representing vari-
ous breathing phases and tumor positions. Using methods discussed in the previous
sections, for an incoming image, each template will be associated with an esti-
mated tumor position and a similarity measure (correlation score or mean-squared
error). The detected tumor position in the incoming image is a combination of
the estimated tumor positions using all templates, weighted with the corresponding
similarity measures.
The detected tumor position in each incoming image (xt, yt) could be esti-
mated by using the tumor position corresponding to the template of the maximum
score. A more robust method is to combine the positions from all the templates as
(xt, yt) =!
wi(xi + ,xi, yi + ,yi)/!
wi, (6.2)
where (xi, yi) is the reference tumor position corresponding to template Ti, de-
termined during patient setup, (,xi, ,yi) is the small shift of the template during
matching and wi is the weighting factor for template Ti . There are many ways to
determine the weighting factors for the combination process. The weighting scheme
used in this work is to set wi = 1 for the templates with scores above a threshold
and wi = 0 for the templates with scores below the threshold. The threshold is set
113
empirically to be within 85! 95% of the maximum similarity score. It varies from
patient to patient. We search from 85% to 95% and find the best threshold based on
the training data and use it on the incoming new image data during testing.
6.2 Experiments Setup for Direct Tumor Tracking
The performance of the proposed two multiple-template tracking methods was eval-
uated by doing a set of experiments. The first method (Method 1) computes the
correlation scores between motion-enhanced fluoro images and multiple templates.
The second method (Method 2) performs multiple-template matching in the eigenspace.
We have used six patients fluoroscopic image data to evaluate the perfor-
mance of these two methods. A region of interest of around 100 pixels %100 pixels
in size was selected to include the tumor positions at various breathing phases. The
training period we used to build our templates consisted of two cycles, which cor-
responds to about 60 image frames. A breathing cycle was equally divided into
10 ! 12 bins and training images were placed in the corresponding bins based on
the breathing phases. Then, we took the average of all frames in the same bin as
one template to generate 10! 12 templates for matching. Each reference template
corresponds to a tumor position as determined in the patient setup process, either
manually by a clinician or automatically by matching to the reference digitally re-
constructed radiographs (DRRs) from patients 4D CT data [100].
To validate the algorithms, we need to compare our tracking results with the
reference tumor positions. A radiation oncologist manually contoured the tumor
projection in the first frame of the images. Then in the following frames, the con-
tour was dragged to the correct places manually using a computer mouse by the ra-
diation oncologist. The tumor centroid position in each image frame was calculated
and used as the reference for comparison. A MatLab (The MathWorks, Inc.,Natick,
MA,USA) code was written to facilitate this procedure. The performance of the
114
algorithms is measured by calculating the absolute distance between the algorithm-
determined tumor position and the reference tumor position. Since the tumor mo-
tion in the superiorCinferior (S ! I) direction dominates, in this work we only
present and evaluate the tracking results in the S ! I direction (y!direction). The
metricswe use for evaluation include mean localization error (e) and maximum lo-
calization error at a 95% confidence level (e95). Here, (e) is the average of the
distances between tracked tumor centroids and reference centroids in all testing im-
age frames, and e95 means that, among all testing image frames, only 5% of the
frames have tracking errors larger than e95.
6.3 Results and Discussion on Tumor Tracking
Figure 6.5: The correlation score (in gray scale) as functions of template ID (y-axis)and the incoming image frame ID (x-axis).
Figure 6.5 is a plot showing the correlation score as a function of templates
and incoming fluoroscopic images for Patient 4. The y!axis is the template ID, the
115
x!axis is the incoming fluoro image ID (time) and the gray scale value signifies
the correlation score. The brighter the pixel, the higher the score. Note that the
correlation score in general cycles through the different templates as one would
expect. Each incoming image is highly correlated with several templates (high
intensity values along a vertical line in the figure). This is due to the fact that tumor
positions in the neighboring templates are close and we allow small shifts of the
templates during the matching procedure.
Figure 6.6: A comparison of the tracking results with and without voting for Method2 and patient 3. The tumor position (y-axis) as a function of time (x-axis). Blacksolid line: the reference tumor location. Blue dotted line: Method 2 without voting.Red dots: Method 2 with voting.
Figure 6.6 shows an example of the tracking results using Method 2 with
and without voting. The black solid line is the reference tumor location and the blue
dotted line is the original eigenspace tracking result without voting. By averaging
locations defined by those templates with correlation scores higher than 87% of the
maximum score, we get a smoother tumor motion trajectory which is marked by
116
the red dots. We can see it is less noisy and closer to the black line, especially at
the peaks and valleys of each breathing cycle. For all six patient data sets, we found
that voting can improve the tracking results. The computational cost for using the
voting scheme, which is within the order of 0.01s per frame, is negligible compared
to the searching mechanism.
Figures 6.7-6.9 show our experimental results for Method 2 (eigenspace
tracking with voting), in comparison with the reference results from an expert ob-
server, for six patients. For all six patients, the motion range of the tumor mass
varies from 17 ! 36 pixels, corresponding to 8.5 ! 18mm, as shown in Table 6.1.
The red dots are our tracking results for the tumor locations and the black solid
line shows the reference trajectory. We can see that our tracking results are in good
agreement with the reference results, as shown in Table 6.1 and Figure 6.7-6.9. For
Method 1, the mean localization error (e) and the maximum localization error at a
95% confidence level (e95), averaged over six patients, are 1.5 and 4.2 pixels, while
for Method 2, the corresponding numbers are 1.2 and 3.2. Note that the pixel size is
about 0.5mm. That means for both proposed tracking methods, the mean tracking
error is less than 1mm and the maximum tracking error at a 95% confidence level is
less than 3mm. Figure 6.10 shows the bar plot of the experimental results for both
Method 1 and Method 2.
As we can see, both methods are quite promising for lung tumor track-
ing without implanted fiducial markers. The voting scheme used in this work
makes the algorithms less sensitive to noise, thus the tracking results are smoother
and more robust. Method 2 (eigenspace tracking) performs slightly better than
Method 1 (motion-enhanced templates and correlation score). This is because, with
eigenspace tracking, we reduce the noise, remove the redundant information and
only keep the most informative dimensions from our image sequences, so that the
templatematching is improved. We found the improvement is significant at the end
of inhale or exhale. Figure 6.11 shows an example.
117
Figure 6.7: Experiment results for a) patient 1 and b) patient 2. Black solid line:Reference tumor motion trajectory. Red dots: tracking results using Method 2.
118
Figure 6.8: Experiment results for c) patient 3 and d) patient 4. Black solid line:Reference tumor motion trajectory. Red dots: tracking results using Method 2.
119
Figure 6.9: Experiment results for e) patient 5 and f) patient 6. Black solid line:Reference tumor motion trajectory. Red dots: tracking results using Method 2.
120
Table 6.1: Experimental results for the proposed multiple tracking methods. e isthe mean localization error and e95 is the maximum localization error at a 95%confidence level.
Patients Patient 1 Patient 2 Patient 3 Patient 4 Patient 5 Patient 6 AverageMoving range (pixel) 36 23 18 18 25 17 22.5
Method1 e(pixel) 2.2 1.3 1.4 1.1 1.7 1.2 1.5e95(pixel) 6 4 4 3 5 3 4.2
Method2 e(pixel) 1.6 0.9 1.1 0.9 1.6 1.1 1.2e95(pixel) 4 3 3 2 4 3 3.2
Figure 6.10: Top: the average localization error (blue bar) and max localizationerror at 95% confidence level (red bar) for Method 1. Bottom: same errors forMethod 2.
121
Figure 6.11: A comparison between Method 1 and Method 2, for patient 3. Thetumor position (y-axis) as a function of time (x-axis). Black solid line: the referencetumor location. Blue dotted line: Method 1. Red dots: Method 2.
Another interesting observation is that the tracking error increases roughly in
a linear way as tumor excursion increases. Using a simple linear extrapolation, we
estimated the tracking error to be about 1.5mm for 2cm tumor excursion, and about
1.7mm for 3cm tumor excursion. It might be a good idea to have the number of
reference templates proportional to the tumor motion range, which will be explored
in future work.
The remaining tracking errors for the proposed algorithms may come from
the following sources. As our reference results were manually produced, it would
be inevitable to introduce human errors. Ideally, we should have multiple expert ob-
servers to generate multiple reference data sets, and the inter-observer variation can
be estimated. However, manually identifying tumor position in each of 300 ! 400
image frames for every patient and for six patients is very labor-intensive work.
Therefore, instead of using multiple expert observers, the same expert observer has
marked the tumor position twice for most of the images. We found that the dif-
122
ference between two marked tumor positions is within 2 pixels on average, which
is comparable with our tracking results. This means that the proposed algorithms
seem to have the same order of accuracy as human experts in terms of tumor track-
ing in fluoroscopic images. This work is limited to a feasibility study. We will leave
more comprehensive evaluation and validation for future work.
Another error source is the instability of patients breathing cycles. Templates
developed during patient setup may not cover all tumor status and positions during
treatment delivery. When the tumor drifts out of the region of movement in the
training session, there will be no exact template that it can be matched to. Thus, the
error occurs at the end of inhale/exhale as shown in Figure 6.7-6.9. In this case, the
correlation score will be low and the sum-squared error will be high, which gives
us a clue that the tumor is drifting. This weakness of the proposed algorithms will
be addressed in our future work.
6.4 Summary for Multiple Template Tracking
We have demonstrated the feasibility of tracking a tumor mass or nearby anatomic
feature in fluoro images. Two multiple-template matching algorithms have been
proposed and evaluated, one based on the motion-enhanced templates and corre-
lation score and the other based on the eigenspace tracking. For both methods,
a voting scheme has been used to improve the smoothness and robustness of the
tracking results, resulting in accuracies within 3mm. These methods can be poten-
tially used for the precise treatment of lung cancer using either respiratory gating
or beam tracking.
123
Chapter 7
Concluding Remarks
Machine learning and data mining are two emerging fields whose applications will
ultimately touch every aspect of human life. In this thesis, we have successfully
applied the algorithms from these fields to an important domain of medical image-
guided radiotherapy for the treatment and control of lung cancer. In addition,
we advanced the field of machine learning and data mining by introducing a new
paradigm for exploratory data analysis with multi-view orthogonal clustering, and
developing a novel method for feature selection via transformation based methods.
The goal of data mining is to extract information and discover structures
from large data bases. A popular technique for mining patterns from data is clus-
tering. However, traditional clustering algorithms only find one clustering solution
even though many applications have data which are multi-faceted by nature. We
have introduced a new paradigm for exploratory data clustering that seeks to ex-
tract all non-redundant clustering views from a given set of data. We introduced a
non-redundant multi-view clustering framework that discovers different meaning-
ful partitions/clusterings by iterative orthogonalization. We have shown both the
orthogonal clustering algorithm and the clustering in orthogonal subspaces algo-
rithm instantiations of this framework worked successfully toward finding multiple
124
non-redundant clustering views. We also developed a fully automated version of
this framework by combining our algorithm with gap statistics to automatically find
the number of clusters in each iteration, and for determining when to stop looking
for alternative views. Our experiments confirmed that the proposed new clustering
paradigm is applicable to different type of data, such as text and image.
Clustering is closely related to dimensionality reduction, because different
clusterings often lie in different subspaces. If the right subspace can be discovered,
the clustering task will become much easier. Redundant features and irrelevant
features will significantly decrease the clustering performance. Feature subset se-
lection as a dimensionality reduction technique serves as a tool to remove unwanted
features, as well as to keep the original meaning of the features for explanatory pur-
poses. We have introduced principal feature selection through principal component
analysis (PFS-PCA) and shown that the method effectively removes redundancy
among the features while at the same time select the most informative features.
In the second half of this dissertation, we presented a successful application
of our learning techniques to the medical application of lung tumor image-guided
radiotherapy. There are two ways to perform image-guided radiotherapy: one is
through gating, and the other through tracking. We developed techniques to tackle
both. Current practice utilize external markers to locate the tumors during treat-
ment. However, external markers are not accurate. There are ongoing research in
the use of internal surrogates, but they have the associated risk of pneumothorax.
In this thesis, we avoid markers altogether and study the feasibility of radiotherapy
through fluoroscopy images.
To perform markerless gated radiotherapy, we investigated four different
methods: single template matching, multiple template matching, clustering tem-
plate method, and support vector machine (SVM) classifier. Our feasibility study
on five different patients have shown that the clustering template method and SVM
classifier generated robust, accurate and efficient gating signals using the fluoro-
125
scopic image sequences without fiducial markers for lung tumor treatment. At 35%
duty cycle, they achieve the average delivered target dose 94.9% and 95.8%, real
duty cycle 34.7% and 40.8% respectively. At 50% duty cycle, they obtain 97.6%
and 98.4% average delivered target dose, 49.5% and 53.1% real duty cycle respec-
tively. This study further shows that the gating problem can be re-cast as a classi-
fication problem. This opens up new directions for improving gating by exploring
other classification techniques, such as Bayesian classifiers [112], neural networks
[113], hidden Markov models [114, 115]. All these demonstrate that we have suc-
cessfully applied machine learning and data mining algorithms to the real world
problem of lung tumor treatment in IGRT.
In addition to research on gated radiotherapy, we also developed methods
for directly tracking the tumor mass without markers. We investigated using our
multiple-template matching method and eigenspace tracking. These two tracking
methods have been carefully evaluated against the physician marked tumor loca-
tions. The tracking error for multiple-template matching method is 1.5 pixel on
average, corresponding to around 0.75mm. For eigenspace tracking, the average
error is 1.2 pixel, corresponding to 0.6mm.
To turn this research into a real product, the next steps will be (1) testing the
algorithms using more and longer patient data, (2) finding a better way to get refer-
ence gating signals for validation, and (3) evaluating the dosimetric consequence of
the current error level to see if there is a need to further lower the error rates. Then,
clinical implementation of these methods will be considerd.
126
Bibliography
[1] I.T. Jolliffe. Principal Component Analysis. Springer, second edition edition,
2002.
[2] K. Fukunaga. Statistical Pattern Recognition (second edition). Academic
Press, San Diego, CA, 1990.
[3] M. Turk and A. Pentland. Face recognition using eigenfaces. In Proc. Com-
puter Vision and Pattern Recognition (Maui, HI), page 586C91, 1991.
[4] S. Deerwester, S. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman.
Indexing by latent semantic analysis. Journal of the American Society for
Information Science, 41(6):391–407, 1990.
[5] K. Y. Yeung and W. L. Ruzzo. Principal component analysis for clustering
gene expression data. Bioinformatics, 17(9):763–74, 2001.
[6] T. Hastie H. Zou and R. Tibshirani. Sparse principal component analy-
sis. Journal of Computational and Graphical Statistics, 15(2):265–286(22),
2006.
[7] S. B. Jiang. Radiotherapy of mobile tumors. Semin Radiat Oncol, 16(4):239–
48, 2006.
127
[8] Q.S. Chen, M.S. Weinhous, F.C. Deibel, J.P. Ciezki, and R.M. Macklis. Flu-
oroscopic study of tumor motion due to breathing: facilitating precise radia-
tion therapy for lung cancer patients. Med Phys, 28(9):1850–6, 2001. TY -
JOUR.
[9] H. Shirato, T. Harada, T. Harabayashi, K. Hida, H. Endo, K. Kita-
mura, R. Onimaru, K. Yamazaki, N. Kurauchi, T. Shimizu, N. Shinohara,
M. Matsushita, H. Dosaka-Akita, and K. Miyasaka. Feasibility of inser-
tion/implantation of 2.0-mm-diameter gold internal fiducial markers for pre-
cise setup and real-time tumor tracking in radiotherapy. Int J Radiat Oncol
Biol Phys, 56(1):240–7., 2003.
[10] T. Harada, H. Shirato, S. Ogura, S. Oizumi, K. Yamazaki, S. Shimizu, R. On-
imaru, K. Miyasaka, M. Nishimura, and H. Dosaka-Akita. Real-time tumor-
tracking radiation therapy for lung carcinoma by the aid of insertion of a gold
marker using bronchofiberscopy. Cancer, 95(8):1720–7., 2002.
[11] Y. Seppenwoolde, H. Shirato, K. Kitamura, S. Shimizu, M. van Herk, J. V.
Lebesque, and K. Miyasaka. Precise and real-time measurement of 3d tumor
motion in lung due to breathing and heartbeat, measured during radiotherapy.
Int J Radiat Oncol Biol Phys, 53(4):822–34, 2002. Journal Article.
[12] D. P. Gierga, J. Brewer, G. C. Sharp, M. Betke, C. G. Willett, and G. T.
Chen. The correlation between internal and external markers for abdominal
tumors: implications for respiratory gating. Int J Radiat Oncol Biol Phys,
61(5):1551–8, 2005.
[13] F. Laurent, V. Latrabe, B. Vergier, and P. Michel. Percutaneous ct-guided
biopsy of the lung: comparison between aspiration and automated cutting
needles using a coaxial technique. Cardiovasc Intervent Radiol, 23(4):266–
72, 2000.
128
[14] S. Arslan, A. Yilmaz, B. Bayramgurler, O. Uzman, E. Nver, and E. Akkaya.
Ct- guided transthoracic fine needle aspiration of pulmonary lesions: accu-
racy and complications in 294 patients. Med Sci Monit, 8(7):CR493–7, 2002.
[15] P. R. Geraghty, S. T. Kee, G. McFarlane, M. K. Razavi, D. Y. Sze, and M. D.
Dake. Ct-guided transthoracic needle aspiration biopsy of pulmonary nod-
ules: needle size and pneumothorax rate. Radiology, 229(2):475–81, 2003.
[16] R. I. Berbeco, H. Mostafavi, G. C. Sharp, and S. B. Jiang. Towards fluoro-
scopic respiratory gating for lung tumours without radiopaque markers. Phys
Med Biol, 50(19):4481–90, 2005.
[17] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM
Computing Surveys, 31(3):264–323, 1999.
[18] Alexander Strehl and Joydeep Ghosh. Cluster ensembles – a knowledge
reuse framework for combining multiple partitions. Journal on Machine
Learning Research (JMLR), pages 583–617, 2002.
[19] X. Z. Fern and C. E. Brodley. Random projection for high dimensional data
clustering: A cluster ensemble approach. In Proceedings of the International
Conference on Machine Learning, 2003.
[20] S. Bickel and T. Scheffer. Multi-view clustering. In Proceedings of the IEEE
International Conference on Data Mining, 2004.
[21] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-
training. In In Proceedings of the Conference on Computational Learning
Theory, page 92C100, 1998.
[22] K. Crammer and Y. Singer. A family of additive online algorithms for cate-
gory ranking. Journal of Machine Learning Research, 3:1025–1058, 2003.
129
[23] S. Gao, W. Wu, C. H. Lee, and T. S. Chua. A mfom learning approach to
robust multiclass multi-label text categorization. In In Proceedings of the
21st international conference on Machine learning, page 42, 2004.
[24] M.R. Boutell, J. Luo, X. Shen, and C.M. Brown. Learning multi-label scene
classication. Pattern Recognition, 37(9):1757–1771, 2004.
[25] D Gondek. Non-redundant clustering. PhD thesis, Brown University, 2005.
[26] G. Chechik and N. Tishby. Extracting relevant structures with side informa-
tion. In Advances in Neural Information Processing Systems 15 (NIPS-2002),
2003.
[27] D. Gondek and T. Hofmann. Conditional information bottleneck clustering.
In The 3rd IEEE Intl. Conf. on Data Mining, Workshop on Clustering Large
Data Sets, 2003.
[28] D. Gondek and T. Hofmann. Non-redundant data clustering. In Proc. of the
4th Intl. Conf. on Data Mining, 2004.
[29] David Gondek and Thomas Hofmann. Non-redundant clustering with condi-
tional ensembles. In Proc. of the 11th ACM SIGKDD Intl. Conf. on Knowl-
edge Discovery and Data Mining (KDD’05), pages 70–77, 2005.
[30] D. Gondek, S. Vaithyanathan, and A. Garg. Clustering with model-level
constraints. In Proc. of SIAM International Conference on Data Mining,
2005.
[31] E. Bae and J. Bailey. Coala: A novel approach for the extraction of an
alternate clustering of high quality and high dissimilarity. In Proceedings
of the Sixth International Conference on Data Mining, pages 53–62, Hong
Kong, December 2006.
130
[32] R. Caruana, M. Elhawary, N. Nguyen, and C. Smith. Meta clustering. In
Proceedings of the Sixth International Conference on Data Mining, pages
107–118, Hong Kong, December 2006.
[33] P. Jain, R. Meka, and Dhillon I. S. Simultaneous unsupervised learning of
disparate clusterings. In Proceedings of the Seventh SIAM International Con-
verence on Data Mining, 2008.
[34] Chris Ding, Xiaofeng He, Hongyuan Zha, and Horst Simon. Adaptive di-
mension reduction for clustering high dimensional data. In Proc. of the 2nd
IEEE Int’l Conf. on Data Mining, pages 147–154, 2002.
[35] Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar
Raghavan. Automatic subspace clustering of high dimensional data for data
mining applications. In Proceedings of the 1998 ACM SIGMOD Int’l Conf.
on Management of Data, pages 94–105, 1998.
[36] Lance Parsons, Ehtesham Haque, and Huan Liu. Subspace clustering for
high dimensional data: a review. SIGKDD Explor. Newsl., 6(1):90–105,
2004.
[37] R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial
Intelligence, 97(1-2):273–324, 1997.
[38] E. Amaldi and V. Kann. On the approximability of minimizing nonzero
variables and unsatisfied relations in linear systems. Theoret. Comput. Sci.,
209:237–260, 1998.
[39] A. W. Whitney. A direct method of nonparametric measurement selection.
IEEE Transactions Computers, 20:1100–1103, 1971.
[40] T. Marill and D. M. Green. On the effectiveness of receptors in recognition
systems. IEEE Transactions on Information Theory, 9:11–17, 1963.
131
[41] J. Kittler. Feature set search algorithms. In Pattern Recognition and Signal
Processing, pages 41–60, 1978.
[42] J. Doak. An evaluation of feature selection methods and their application to
computer security. Technical report, University of California at Davis, 1992.
[43] Isabelle Guyon and Andre; Elisseeff. An introduction to variable and feature
selection. Journal of Machine Learning Research, 3:1157–1182, 2003.
[44] L. Yu and H. Liu. Efficient feature selection via analysis of relevance and
redundancy. Journal of Machine Learning Research, 5:1205–1224, 2004.
[45] H. Stoppiglia, G. Dreyfus, R. Dubois, and Y. Oussar. Ranking a random
feature for variable and feature selection. Journal of Machine Learning Re-
search, 3:1399 – 1414, 2003.
[46] G.P. McCabe. Principal variables. Technometrics, 26:127–134, 1984.
[47] J. Cadima, O. Cerdeira, and M. Minhoto. Rotation of principal components:
choice of normalization constraints. Journal of Applied Statistics, 22:29–35,
1995.
[48] I. T. Jolliffe and M Uddin. A modified principal component technique based
on the lasso. Journal of Computational and Graphical Statistics, 12:531–
547, 2003.
[49] R Tibshirani. Regression shrinkage and selection via the lasso. Journal of
the Royal Statistical Society, Series B, 58:267–288, 1996.
[50] W. J. Krzanowski. Selection of variables to preserve multivariate data struc-
ture, using principal components. Applied Statistics, 36(1):22–33, 1987.
132
[51] K. Z. Mao. Identifying critical variables of principal components for
unsupervised feature selection. IEEE transactions on system, man, and
cyberbetics-part B: cyberbetics, 35(2):339–344, 2005.
[52] Y. Lu, I. Cohen, X. S. Zhou, and Q. Tian. Feature selection using princi-
pal feature analysis. In Proceedings of the 15th international conference on
Multimedia, pages 301 – 304, 2007.
[53] P. J. Keall, V. R. Kini, S. S. Vedam, and R. Mohan. Potential radiotherapy
improvements with respiratory gating. Australas Phys Eng Sci Med, 25(1):1–
6, 2002.
[54] I. Jacobs, J. Vanregemorter, and P. Scalliet. Influence of respiration on calcu-
lation and delivery of the prescribed dose in external radiotherapy. Radiother
Oncol, 39(2):123–8, 1996.
[55] SB Jiang, T Bortfeld, A TROFIMOV, E RIETZEL, GC SHARP, N CHOI,
and G T Y Chen. Synchronized moving aperture radiation therapy (smart):
Treatment planning using 4d ct data. In The 14th International Conference
on the Use of Computers in Radiation Therapy, Seoul, Korea, 2004.
[56] M. Engelsman, E. M. Damen, K. De Jaeger, K. M. van Ingen, and B. J.
Mijnheer. The effect of breathing and set-up errors on the cumulative dose
to a lung tumor. Radiother Oncol, 60(1):95–105., 2001.
[57] C.W. Stevens, R.F. Munden, K.M. Forster, J.F. Kelly, Z. Liao, G. Starkschall,
S. Tucker, and R. Komaki. Respiratory-driven lung tumor motion is inde-
pendent of tumor size, tumor location, and pulmonary function. Int J Radiat
Oncol Biol Phys, 51(1):62–8, 2001.
[58] S C Davies, A L Hill, R B Holmes, M Halliwell, and P C Jacson. Ultrasound
133
quantitation of respiratory organ motion in the upper abdomen. British Jour-
nal of Radiology, 67:1096–1102, 1994.
[59] P.J. Bryan, S. Custar, J.R. Haaga, and V. Balsara. Respiratory movement of
the pancreas: an ultrasonic study. J Ultrasound Med, 3(7):317–20, 1984.
[60] C. Ozhasoglu and M. J. Murphy. Issues in respiratory motion compen-
sation during external-beam radiotherapy. Int J Radiat Oncol Biol Phys,
52(5):1389–99, 2002.
[61] P.H. Weiss, J.M. Baker, and E.J. Potchen. Assessment of hepatic respiratory
excursion. J Nucl Med, 13(10):758–9, 1972.
[62] G. Harauz and M.J. Bronskill. Comparison of the liver’s respiratory motion
in the supine and upright positions: concise communication. J Nucl Med,
20(7):733–5, 1979.
[63] P. Giraud, Y. De Rycke, B. Dubray, S. Helfre, D. Voican, L. Guo, J. C. Rosen-
wald, K. Keraudy, M. Housset, E. Touboul, and J. M. Cosset. Conformal
radiotherapy (crt) planning for lung cancer: analysis of intrathoracic organ
motion during extreme phases of breathing. Int J Radiat Oncol Biol Phys,
51(4):1081–92., 2001.
[64] E. C. Ford, G. S. Mageras, E. Yorke, K. E. Rosenzweig, R. Wagman, and
C. C. Ling. Evaluation of respiratory movement during gated radiother-
apy using film and electronic portal imaging. Int J Radiat Oncol Biol Phys,
52(2):522–31., 2002.
[65] J. Hanley, M. M. Debois, D. Mah, G. S. Mageras, A. Raben, K. Rosenzweig,
B. Mychalczak, L. H. Schwartz, P. J. Gloeggler, W. Lutz, C. C. Ling, S. A.
Leibel, Z. Fuks, and G. J. Kutcher. Deep inspiration breath-hold technique
for lung tumors: the potential value of target immobilization and reduced
134
lung density in dose escalation. Int J Radiat Oncol Biol Phys, 45(3):603–11,
1999.
[66] V.M. Remouchamps, N. Letts, D. Yan, F.A. Vicini, M. Moreau, J.A. Zielin-
ski, J. Liang, L.L. Kestin, A.A. Martinez, and J.W. Wong. Three-dimensional
evaluation of intra- and interfraction immobilization of lung and chest wall
using active breathing control: A reproducibility study with breast cancer
patients. Int J Radiat Oncol Biol Phys, 57(4):968–78, 2003.
[67] H. D. Kubo, P. M. Len, S. Minohara, and H. Mostafavi. Breathing-
synchronized radiotherapy program at the university of california davis can-
cer center. Med Phys, 27(2):346–53., 2000.
[68] H. D. Kubo and B. C. Hill. Respiration gated radiotherapy treatment: a
technical study. Phys Med Biol, 41(1):83–91., 1996.
[69] M. J. Murphy. Tracking moving organs in real time. Semin Radiat Oncol,
14(1):91–100, 2004.
[70] M. J. Murphy, S. D. Chang, I. C. Gibbs, Q. T. Le, J. Hai, D. Kim, D. P.
Martin, and Jr. Adler, J. R. Patterns of patient movement during frameless
image-guided radiosurgery. Int J Radiat Oncol Biol Phys, 55(5):1400–8,
2003.
[71] S Webb. Limitations of a simple technique for movement compensation via
movement-modified fluence profiles. Phys. Med. Biol., 50(14):N155–N161,
2005.
[72] S. Minohara, T. Kanai, M. Endo, K. Noda, and M. Kanazawa. Respiratory
gated irradiation system for heavy-ion radiotherapy. Int J Radiat Oncol Biol
Phys, 47(4):1097–103, 2000.
135
[73] T. Tada, K. Minakuchi, T. Fujioka, M. Sakurai, M. Koda, I. Kawase, T. Naka-
jima, M. Nishioka, T. Tonai, and T. Kozuka. Lung cancer: intermittent irra-
diation synchronized with respiratory motion–results of a pilot study. Radi-
ology, 207(3):779–83.
[74] T Okumara, H Tsuji, and Y Hayakawa. Respiration-gated irradiation system
for proton radiotherapy. In Proceedings of the 11th international conference
on the use of computers in radiation therapy, pages 358–359, Manchester,
1994. North Western Medical Physics Dept. Christie Hospital.
[75] S. Shimizu, H. Shirato, S. Ogura, H. Akita-Dosaka, K. Kitamura, T. Nish-
ioka, K. Kagei, M. Nishimura, and K. Miyasaka. Detection of lung tumor
movement in real-time tumor-tracking radiotherapy. Int J Radiat Oncol Biol
Phys, 51(2):304–10.
[76] H. Shirato, S. Shimizu, T. Kunieda, K. Kitamura, M. van Herk, K. Kagei,
T. Nishioka, S. Hashimoto, K. Fujita, H. Aoyama, K. Tsuchiya, K. Kudo,
and K. Miyasaka. Physical aspects of a real-time tumor-tracking system for
gated radiotherapy. Int J Radiat Oncol Biol Phys, 48(4):1187–95, 2000.
[77] M. Imura, K. Yamazaki, H. Shirato, R. Onimaru, M. Fujino, S. Shimizu,
T. Harada, S. Ogura, H. Dosaka-Akita, K. Miyasaka, and M. Nishimura. In-
sertion and fixation of fiducial markers for setup and tracking of lung tumors
in radiotherapy. Int J Radiat Oncol Biol Phys, 63(5):1442–7, 2005.
[78] E. Forgy. Cluster analysis of multivariate data: Efficiency vs. interpretability
of classifications. Biometrics, 21:768, 1965.
[79] J.B. Macqueen. Some methods for classifications and analysis of multivari-
ate observations. Proc. Symp. Mathematical Statistics and Probability, 5th,
Berkeley, 1:281–297, 1967.
136
[80] I. T. Jolliffe. Principal Component Analysis. Springer-Verlag, New-York,
1986.
[81] R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley
& Sons, NY, 1973.
[82] C.L. Blake and C.J. Merz. UCI repository of machine learning databases. In
http://www.ics.uci.edu/ 0mlearn/MLRepository.html, 1998.
[83] S. D. Bay. The UCI KDD archive, 1999.
[84] CMU. CMU 4 universities WebKB data, 1997.
[85] I. S. Dhillon and D. M. Modha. Concept decompositions for large sparse text
data using clustering. Machine Learning, 42(1):143–175, 2001.
[86] G. Schwarz. Estimating the dimension of a model. The Annals of Statistics,
6(2):461–464, 1978.
[87] H. Akaike. A new look at the statistical model identification. IEEE Transac-
tions on Automatic Control, AC-19(6):716–723, December 1974.
[88] Dan Pelleg. X-means: Extending k-means with efficient estimation of the
number of clusters. In In Proceedings of the 17th International Conf. on
Machine Learning, pages 727–734. Morgan Kaufmann, 2000.
[89] S. Monti, T. Pablo, J. Mesirov, and T Golub. Consensus clustering: A
resampling-based method for class discovery and visualization of gene ex-
pression microarray data. Machine Learning, 52:91–118, 2003.
[90] V. Roth, T. Lange, M. Braun, and J. Buhmann. A resampling approach to
cluster validation. In Intl. Conf. on Computational Statistics, pages 123–129,
2002.
137
[91] R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters
in a dataset via the gap statistic. J. R. Statist. Soc., 63(2):411–423, 2001.
[92] G. Golub and C. Van Loan. Matrix computations,3rd edition. Johns Hopkins,
Baltimore, 1996.
[93] H. Zou and T. Hastie. Regression shrinkage and selection via the elastic net,
with applications to microarrays. Technical report, 2003.
[94] P. Murphy and D. Aha. Uci repository of machine learning databases. Tech-
nical Report Technical Report, University of California, Irvine, 1994.
[95] D. Aha, P. Murphy, and C. Merz. Uci repository of machine learning
databases. Technical Report Technical Report, University of California,
Irvine, 1997.
[96] J. G. Dy, C. E. Brodley, A. Kak, L. S. Broderick, and A. M. Aisen. Un-
supervised feature selection applied to content-based retrieval of lung im-
ages. IEEE Transactions on Pattern Analysis and Machine Intelligence,
25(3):373–378, March 2003.
[97] J. Ye and T. Xiong. Computational and theoretical analysis of null space
and orthogonal linear discriminant analysis. Journal of Machine Learning
Research, 7:1183–1204, 2006.
[98] Chris Ding and Tao Li. Adaptive dimension reduction using discriminant
analysis and k-means clustering. In Proceedings of the 24th international
conference on Machine learning, volume 227, pages 521–528, 2007.
[99] R. I. Berbeco, S. B. Jiang, G. C. Sharp, G. T. Chen, H. Mostafavi, and H. Shi-
rato. Integrated radiotherapy imaging system (iris): design considerations of
tumour tracking with linac gantry-mounted diagnostic x-ray systems with
flat-panel detectors. Phys Med Biol, 49(2):243–55, 2004.
138
[100] X Tang, G S Sharp, and S B Jiang. Patient setup based on lung tumor mass
for gated radiotherapy. Med. Phys. (abstract), 33:2244, 2006.
[101] Kasturi R Jain, R and B G Schunck. Machine Vision. New York: McGraw-
Hill, 1995.
[102] I T Jolliffe. Principal Component Analysis. Berlin: Springer, 2002.
[103] L. Breiman. Bagging predictors. Machine Learning, 26(2):123–140, 1996.
[104] G. J. McLachlan and K. E. Basford. Mixture Models, Inference and Applica-
tions to Clustering. Marcel Dekker, New York, 1988.
[105] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from in-
complete data via the em algorithm. Journal Royal Statistical Society, Series
B, 39(1):1–38, 1977.
[106] V. Vapnik. The nature of statistical learning theory. Springer, New York,
1995.
[107] C M Bishop. Neural Networks for Pattern Recognition. Oxford University
Press, 1995.
[108] Smola A J Burges, C J C and B Scholkopf. Advances in Kernel Methods -
Support Vector Learning. MIT Press, Cambridge, USA, 1999.
[109] C C Chang and C J Lin. Libsvm: a library for support vector machines.
2003. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.
[110] T. Neicu, R. Berbeco, J. Wolfgang, and S. B. Jiang. Synchronized moving
aperture radiation therapy (smart): improvement of breathing pattern repro-
ducibility using respiratory coaching. Phys Med Biol, 51(3):617–36, 2006.
139
[111] H. Murase and S. Nayar. Visual learning and recognition of 3-d objects from
appearance. Int. J. Comput. Vis, 14:5–24, 1995.
[112] P. Cheeseman and J. Stutz. Bayesian classification (autoclass): theory and
results. In Advances in Knowledge Discovery and Data Mining, pages 153–
180, Cambridge, MA, 1996. AAAI/MIT Press.
[113] J. A. Freeman and D. M. Skapura. Neural networks: Algorithms, Applica-
tions, and Programming Techniques. Addison-Wesley, 1991.
[114] C. D. Mitchell. Improving Hidden Markov Models for Speech Recognition.
PhD thesis, Purdue University, W. Lafayette, Indiana, May 1995.
[115] P. Smyth. Clustering sequences with hidden Markov models. In M. C. Mozer,
M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Pro-
cessing 9. MIT Press, 1997.
140