Non-redundant Clustering, Principal Feature Selection and …... · 2019-02-13 · clustering and...

Non-redundant Clustering, Principal Feature Selection and Learning Methods

Applied to Lung Tumor Image-guided Radiotherapy

A Dissertation Presented

by

Ying Cui

to

The Department of Electrical and Computer Engineering

in partial fulfillment of the requirements

for the degree of

Doctor of Philosophy

in the field of

Electrical and Computer Engineering

Northeastern University

Boston, Massachusetts

(January, 2009)

To those I love ...

To those unforgettable days in Boston ...

Acknowledgments

I would like to take this opportunity to express my great gratitude to my advisor,

Dr. Jennifer G. Dy, for her supervision, support and encouragement throughout my

academic research. I sincerely appreciate her invaluable help in suggesting research

topics, revising my technical reports, papers, thesis and helping me prepare my

presentations. She provided many inspirations and insights into research that will

always remain helpful to me.

I am especially grateful to Dr. Steve B. Jiang, for his time to serve as my

committee member, and working with me on the lung tumor Image-Guided Ra-

diotherapy (IGRT) research. He provides me with the precious opportunity to do

applied research in the medical domain. His patient guidance, wealth of knowledge

and deep insight into this area truly benefit me much. I would also like to thank

Dr. David Brady, Dr. Gregory C. Sharp and Dr. Mario Sznaier for their time, ad-

vice, and encouragements during my study process. It has been a great experience

working with all of them.

Besides the guidance from my committee members, I feel very fortunate

to meet my friends in Northeastern University, Xin Huang and Hongyan Liu with

whom I went through so many unforgettable moments no matter what - joyful or

frustrated. There are also many friends that I would like to show my deep appre-

ciation, with whom I had many fruitful conversations about my research: Xiaojun

Wang, Yujuan Cheng, Yanjun Xiang, Qiuzhao Dong, Yan Yan, Donglin Niu and

iv

Guan Yue. I thank all those who gave me generous help and shared the beautiful

memory with me in my life. I cannot imagine how my life would be without you.

This thesis once again finds me indebted to my family, especially my par-

ents, for their love, support and encouragement throughout my life. I never had a

chance to show you how much I appreciate what you have done for me, certainly

not as much as you deserve. Thank you! Last but not the least, I cannot express

my gratitude in words to my boyfriend, Wei Wang. His love and support will be

cherished in my heart forever.

YING CUI

Northeastern University

Boston, MA

December 2008

v

ABSTRACT

Cui, Ying, Ph.D., Northeastern University, December 2008. Non-redundant clus-

tering, principal feature selection and learning methods applied to lung tumor image-

guided radiotherapy. Major advisor: Jennifer G. Dy.

This thesis is divided into two parts. The first part is about non-redundant

clustering and feature selection for high dimensional data. The second part is on

applying learning techniques to lung tumor image-guided radiotherapy.

In the first part, we investigate a new clustering paradigm for exploratory

data analysis: find all non-redundant clustering views of the data, where data points

of one cluster can belong to different clusters in other views. Typical clustering al-

gorithms output a single clustering of the data. However, in real world applications,

data can have different groupings that are reasonable and interesting from different

perspectives. This is especially true for high-dimensional data, where different fea-

ture subspaces may reveal different structures of the data. We present a framework

to solve this problem and suggest two approaches: (1) orthogonal clustering, and

(2) clustering in orthogonal subspaces.

The idea of removing redundancy between clustering solutions was inspired

by our preliminary work on solving the feature selection problem via transforma-

tion methods. In particular, we developed a feature selection method based on the

popular transformation approach: principal component analysis (PCA). PCA is a

dimensionality reduction algorithm that do not explicitly indicate which variables

are important. We designed a method that utilize the PCA result to select the orig-

inal features, which are most correlated to the principal components and are as

uncorrelated with each other as possible through orthogonalization. We show that

our feature selection method, as a consequence of orthogonalization, preserve the

special property in PCA that the retained variance can be expressed as the sum of

orthogonal feature variances that are kept.

vi

In the second part, we design machine learning algorithms to aid lung tu-

mor image-guided radiotherapy (IGRT). Precise target localization in real-time is

particularly important for gated radiotherapy. However, it is difficult to gate or

track the lung tumors due to the uncertainties when using external surrogates and

the risk of pneumothorax when using implanted fiducial markers. We investigate

algorithms for gating and for directly tracking the tumor. For gated radiotherapy,

previous approach utilizes template matching to localize the tumor position. Here,

we investigate two ways to improve the precision of tumor target localization by

applying: (1) an ensemble of templates where the representative templates are se-

lected by Gaussian mixture clustering, and (2) a support vector machine (SVM)

classifier with radial basis kernels. Template matching only considers images inside

the gating window, but images outside the gating window might provide additional

information. We take advantage of both states and re-cast the gating problem into

a classification problem. For the tracking problem, we explore a multiple-template

matching method to capture the varying tumor appearance throughout the different

phases of the breathing cycle.

vii

Contents

Acknowledgments iv

List of Tables xii

List of Figures xiii

Chapter 1 Introduction 11.1 Non-redundant Multiview Clustering Through Orthogonalization . . 2

1.2 Non-redundant Principal Feature Selection . . . . . . . . . . . . . . 4

1.3 Robust Gating and Tracking the Lung Tumor Mass Without Mark-

ers for Image-guided Radiotherapy . . . . . . . . . . . . . . . . . . 5

1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Chapter 2 Review of Related Literature 102.1 Review of Related Clustering Algorithms . . . . . . . . . . . . . . 10

2.2 Review of Related Feature Selection Techniques . . . . . . . . . . . 15

2.3 Current Image-Guided Radiotherapy (IGRT) Approaches for Lung

Tumor Treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.1 Problems with Respiratory Tumor Motion in Radiotherapy . 18

2.3.2 Review of the Techniques of Gated Radiotherapy . . . . . . 20

viii

Chapter 3 Non-redundant Multi-view Clustering 243.1 Multi-View Orthogonal Clustering . . . . . . . . . . . . . . . . . . 24

3.1.1 Orthogonal Clustering . . . . . . . . . . . . . . . . . . . . 26

3.1.2 Clustering in Orthogonal Subspaces . . . . . . . . . . . . . 28

3.1.3 Relationship Between Orthogonal Clustering and Cluster-

ing in Orthogonal Subspaces . . . . . . . . . . . . . . . . . 31

3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2.1 Experiments on Synthetic Data . . . . . . . . . . . . . . . . 34

3.2.2 Experiments on Real-World Benchmark Data . . . . . . . . 38

3.3 Automatically Finding the Number of Clusters and Stopping Criteria 47

3.3.1 Finding the Number of Clusters by Gap Statistics . . . . . . 47

3.3.2 Stopping Criteria . . . . . . . . . . . . . . . . . . . . . . . 49

3.3.3 Case Studies for Synthetic Data II, Face and Text Data . . . 50

3.4 Conclusions for Multi-View Clustering Methods . . . . . . . . . . . 54

Chapter 4 Orthogonal Principal Feature Selection via Component Anal-ysis 564.1 Background and Notations . . . . . . . . . . . . . . . . . . . . . . 57

4.1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.1.2 Background Review on SVD and Definition of Terms . . . . 57

4.1.3 Dual Space Representation of a Data Matrix and Statistical

Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2 Feature Selection via PCA . . . . . . . . . . . . . . . . . . . . . . 59

4.2.1 PCA Orthogonal Feature Selection . . . . . . . . . . . . . . 60

4.2.2 Orthogonal Feature Search . . . . . . . . . . . . . . . . . . 61

4.2.3 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.2.4 Illustrative Example . . . . . . . . . . . . . . . . . . . . . 68

4.3 Sparse Principal Component Analysis (SPCA) and PFS . . . . . . . 70

ix

4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.4.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . 74

4.4.4 Time Complexity Analysis . . . . . . . . . . . . . . . . . . 76

4.5 Extension to Linear Discriminant Analysis (LDA) . . . . . . . . . . 79

4.6 Conclusion for Principal Feature Selection . . . . . . . . . . . . . . 80

Chapter 5 Robust Fluoroscopic Respiratory Gating for Lung Cancer Ra-diotherapy without Implanted Fiducial Markers 835.1 Data Acquisition and Pre-Processing . . . . . . . . . . . . . . . . . 84

5.1.1 Image Acquisition . . . . . . . . . . . . . . . . . . . . . . 84

5.1.2 Building Training Data . . . . . . . . . . . . . . . . . . . . 84

5.1.3 Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . 86

5.2 Clustering Ensemble Template Matching and Gaussian Mixture Clus-

tering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.2.1 Ensemble/Multiple Template Method . . . . . . . . . . . . 89

5.2.2 Finding Representative Templates by Clustering . . . . . . 90

5.2.3 Generating the Gating Signal . . . . . . . . . . . . . . . . . 92

5.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . 94

5.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . 98

5.4.1 Experiments by Clustering Ensemble Template Method . . . 99

5.4.2 Experiments by Support Vector Machine . . . . . . . . . . 99

5.4.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . 100

5.5 Conclusion for Robust Markerless Gated Radiotherapy . . . . . . . 103

Chapter 6 Multiple Template-based Fluoroscopic Tracking of Lung Tu-mor Mass without Implanted Fiducial Markers 1046.1 Basic Ideas of Multiple Template Tracking . . . . . . . . . . . . . . 105

x

6.1.1 Building Multiple Templates . . . . . . . . . . . . . . . . . 107

6.1.2 Search Mechanism . . . . . . . . . . . . . . . . . . . . . . 108

6.1.3 Template Matching . . . . . . . . . . . . . . . . . . . . . . 109

6.1.4 Voting Mechanism . . . . . . . . . . . . . . . . . . . . . . 113

6.2 Experiments Setup for Direct Tumor Tracking . . . . . . . . . . . . 114

6.3 Results and Discussion on Tumor Tracking . . . . . . . . . . . . . 115

6.4 Summary for Multiple Template Tracking . . . . . . . . . . . . . . 123

Chapter 7 Concluding Remarks 124

Bibliography 127

xi

List of Tables

3.1 Confusion Matrix for Synthetic Data1 . . . . . . . . . . . . . . . . 35

3.2 Confusion Matrix for Synthetic Data2 . . . . . . . . . . . . . . . . 38

3.3 Confusion Matrix for the Digits Data . . . . . . . . . . . . . . . . . 41

3.4 Confusion Matrix for the Mini-Newsgroups Data . . . . . . . . . . 45

3.5 Confusion Matrix for WebKB Data . . . . . . . . . . . . . . . . . . 46

3.6 Confusion Matrix for WebKB Data based on Gap Statistics . . . . . 54

4.1 PC Loadings Applied to Glass Data Using SPCA . . . . . . . . . . 71

4.2 Example: SPCA Confusion for Feature Selection . . . . . . . . . . 72

4.3 Computational Complexity Analysis . . . . . . . . . . . . . . . . . 76

4.4 Computational Time in Seconds . . . . . . . . . . . . . . . . . . . 78

6.1 Experimental results for the proposed multiple tracking methods. e

is the mean localization error and e95 is the maximum localization

error at a 95% confidence level. . . . . . . . . . . . . . . . . . . . . 121

xii

List of Figures

1.1 This is a scatter plot of the data in (a) subspace {F1, F2} and (b)

subspace {F3, F4}. Note that the two subspaces lead to different

clustering structures. . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 An example of an on-board imaging system for respiratory gated

radiotherapy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1 The general framework for generating multiple orthogonal cluster-

ing views. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Scatter plots of synthetic data 1. The two columns show the results

of methods 1 and 2 respectively. The colors represent different class

labels and the ellipses represent the clusters found. Row 1 and 2

show the results for iteration 1 and 2 respectively; Row 3 shows

SSE as a function of iteration. . . . . . . . . . . . . . . . . . . . . 36

3.3 These are scatter plots of synthetic data 2 and the clusters found

by methods 1 (a1, a2) and 2 (b1, b2). The color of the data points

reflect different class labels and the ellipses represent the clusters

found. a1, b1 are the results for iteration 1; a2, b2 are the results for

iteration 2; a3 and b3 are SSE as a function of iteration for methods

1 and 2 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . 39

xiii

3.4 The average digit for images within each cluster found by method

2 in iterations/views 1, 2 and 3. These clustering views correspond

to different digits. . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.5 The average face image for each cluster in iteration 1. This cluster-

ing view corresponds to different persons. . . . . . . . . . . . . . . 43

3.6 The average face image for each cluster in iteration 2. This cluster-

ing view corresponds to different poses. . . . . . . . . . . . . . . . 44

3.7 !Gap and SSE results in each iteration for the Synthetic II data set. 51

3.8 Different partitionings for face data in different iterations . . . . . . 52

3.9 The drop in singular values: si ! si+1. Left: the gap of consecutive

singular values in iteration1. Right: the gap of consecutive singular

values in iteration2. . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.1 Three uncorrelated data points in 2D space: (a) in the data space

view, and (b) in the feature space view. Three correlated data points

in 2D space: (c) in the data space view, and (d) in the feature space

view. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2 SSE between selected features and all the features. . . . . . . . . . 61

4.3 The general framework for our feature selection process. . . . . . . 62

4.4 A simple illustrative example of PFS. . . . . . . . . . . . . . . . . 69

4.5 SSE and retained variance for HRCT data on top. SSE for Chart,

face, 20 mini-newsgroup and gene data respectively. Each figure

plots the eight SSE curves for the eight methods: blue line with ’"’

by simple threshold, green line with ’#’ by PFA, red line with ’o’

by SPCA, light blue line with ’$’ by Jolliffe i, purple line with ’·’by Jolliffe ni, yellow line with ’+’ by SFS, grey line with ’x’ by

LSE-fw and black line by our PFS. . . . . . . . . . . . . . . . . . . 82

xiv

5.1 Block diagram for showing the process of the proposed clinical pro-

cedure for generating the gating signal. . . . . . . . . . . . . . . . . 84

5.2 The Integrated Radiotherapy Imaging System (IRIS), used as the

hardware platform for the proposed gating technology in this chapter. 85

5.3 Tumor contour and the region of interest (ROI). Left: original fluo-

roscopic image. Right: motion-enhanced image. . . . . . . . . . . . 86

5.4 The top left figure is the breathing waveform represented by the

tumor location. To the left of the vertical dotted line is the training

period. To the right of the vertical line is the treatment or testing

period. Under the horizontal dotted line (threshold corresponding

to a given duty cycle) is the end of exhale phase. Bottom figures

showed different end of exhale images during the training session,

which are averaged to generate a single template. . . . . . . . . . . 87

5.5 Ensemble/multiple template method. Here, each image is an end-

of-exhale template. We match the incoming image with each tem-

plate and get a set of correlation scores s1, s2, · · · , sK . Then we

apply a weighted average of these scores to generate the final cor-

relation score s for gating. . . . . . . . . . . . . . . . . . . . . . . 89

5.6 Scatter plot of our image data for patient 4 and 35% duty cycle

in 2D with the clustering result. The ”o” and ”x” each represent

different clusters, with the means represented by the ”$” symbol in

bold and the covariances in ellipses. . . . . . . . . . . . . . . . . . 92

5.7 Results from different methods for an example patient. (a) sin-

gle template method; (b) ensemble/multiple templates method with

Gaussian mixture clustering. For each figure, the top curve is the

correlation score and the bottom plot is the gating signal generated

by the correlation score. Here we use 35% duty cycle. . . . . . . . . 93

xv

5.8 Re-cast the gating problem to a classification problem (a) and (b).

(c) presents the decision boundary created by a single template

matching and (d) displays decision boundary by an SVM classifier. . 95

5.9 Experiment results in TD and DC for 35% proposed duty cycle.

Blue bars: metric by SVM method. Red bars: metric by clustering

ensemble template matching method. . . . . . . . . . . . . . . . . . 100

5.10 Experiment results in TD and DC for 50% proposed duty cycle.

Blue bars: metric by SVM method. Red bars: metric by clustering

ensemble template matching method. . . . . . . . . . . . . . . . . . 101

5.11 Example of estimated gating signals on patient 4 for proposed 35%

duty cycle. Top: the predicted gating signal by SVM classifier. Bot-

tom: the gating signal generated by clustering ensemble template

matching method. . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.1 Outline of the proposed multiple template tracking procedure. . . . 106

6.2 A fluoroscopic image with an region of interest (ROI) (blue rectan-

gle) and a tumor contour (red curve). . . . . . . . . . . . . . . . . . 107

6.3 Twelve motion-enhanced tumor templates built by averaging the

images in ROI (as shown in figure 2) falling in the same time bin.

Intensity waveform is divided into twelve equal time bins, corre-

sponding to which twelve templates were built. . . . . . . . . . . . 108

6.4 Tumor contour and region of interest (ROI). Left: Original fluoro-

scopic image. Right: Motion-enhanced image. . . . . . . . . . . . 110

6.5 The correlation score (in gray scale) as functions of template ID

(y-axis) and the incoming image frame ID (x-axis). . . . . . . . . . 115

xvi

6.6 A comparison of the tracking results with and without voting for

Method 2 and patient 3. The tumor position (y-axis) as a function

of time (x-axis). Black solid line: the reference tumor location.

Blue dotted line: Method 2 without voting. Red dots: Method 2

with voting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

6.7 Experiment results for a) patient 1 and b) patient 2. Black solid

line: Reference tumor motion trajectory. Red dots: tracking results

using Method 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

6.8 Experiment results for c) patient 3 and d) patient 4. Black solid

line: Reference tumor motion trajectory. Red dots: tracking results

using Method 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.9 Experiment results for e) patient 5 and f) patient 6. Black solid line:

Reference tumor motion trajectory. Red dots: tracking results using

Method 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6.10 Top: the average localization error (blue bar) and max localization

error at 95% confidence level (red bar) for Method 1. Bottom: same

errors for Method 2. . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.11 A comparison between Method 1 and Method 2, for patient 3. The

tumor position (y-axis) as a function of time (x-axis). Black solid

line: the reference tumor location. Blue dotted line: Method 1. Red

dots: Method 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

xvii

Chapter 1

Introduction

My dissertation has two components. The first involves basic research in ma-

chine learning and data mining. In particular, I study methods for non-redundant

multi-view clustering for exploratory data analysis and methods for feature selec-

tion through feature transformation. The second involves designing and applying

machine learning to improve the robustness of image guided radiotherapy without

markers using fluoroscopic image sequences.

In this chapter, I begin by defining clustering and then motivate the need

for non-redundant multi-view clustering in Section 1.1. In the next section, Section

1.2, I define the feature selection problem and explain the benefit of utilizing feature

transformation for feature selection. Then, in Section 1.3, I describe image-guided

radiotherapy, explain gating and tracking, and motivate the importance of marker-

less gating and tracking. Finally, I provide a guide to this thesis dissertation in

Section 1.4.

1

1.1 Non-redundant Multiview Clustering Through Or-

thogonalization

Many applications are characterized by data in high dimensions. Examples include

text, image, and gene data. Automatically extracting interesting structure in such

data has been heavily studied in a number of different areas including data mining,

machine learning and statistical data analysis. One approach for extracting informa-

tion from unlabeled data is through clustering. Given a data set, typical clustering

algorithms group similar objects together based on some fixed notion of similarity

(distance) and output a single clustering solution. However, in real world appli-

cations, data can often be interpreted in many different ways and there may exist

multiple groupings of the data that are all reasonable in some perspective.

2 4 6 8 10 12 14 16 18 20 226

8

10

12

14

16

18

20

F1

F2

−5 0 5 10 15 20 252

4

6

8

10

12

14

16

18

20

F3

F4

(a) (b)

Figure 1.1: This is a scatter plot of the data in (a) subspace {F1, F2} and (b) sub-space {F3, F4}. Note that the two subspaces lead to different clustering structures.

This problem is often more prominent for high dimensional data, where each

object is described by a large number of features. In such cases, different feature

subspaces can often warrant different ways to partition the data, each presenting

the user a different view of the data’s structure. Figure 1.1 illustrates one such sce-

nario. In particular, Figure 1.1a shows a scatter plot of the data in feature subspace

2

{F1, F2}. Figure 1.1b shows how the data looks like in feature subspace {F3, F4}.

Note that each subspace leads to a different clustering structure. When faced with

such a situation, which features should we select (i.e., which clustering solution is

better)? Why do we have to choose? Why not keep both solutions? In fact both

clustering solutions could be important, and provide different interpretations of the

same data. For example, for the same medical data, what is interesting to physicians

might be different from what is interesting to insurance companies.

The goal of exploratory data analysis is to find structures in data, which

maybe multi-faceted by nature. Traditional clustering methods seek to find a unified

clustering solution and are thus inherently limited in achieving this goal. In this

research, we suggest a new exploratory clustering paradigm: the goal is to find a

set of non-redundant clustering views from data, where data points belonging to the

same cluster in one view can belong to different clusters in another view.

Toward this goal, we propose a framework that extracts multiple clustering

views of high-dimensional data that are orthogonal to each other. Note that there

are kN possible k disjoint partitioning of N data points. Not all of them are mean-

ingful. We wish to find good clustering solutions based on a clustering objective

function. Meanwhile, we would like to minimize the redundancy among the ob-

tained solutions. Thus, we include an orthogonality constraint in our search for

new clustering views to avoid providing the user with redundant clustering results.

The proposed framework works iteratively, at each step adding one clustering view

by searching for solutions in a space that is orthogonal to the space of the exist-

ing solutions. Within this framework, we develop two general approaches. The

first approach seeks orthogonality in the cluster space, while the second one seeks

orthogonality in the feature subspace. We present all the multiple view clustering

solutions to the user.

3

1.2 Non-redundant Principal Feature Selection

Feature selection is a dimensionality reduction technique which selects a subset of

features from the original set. It is very useful because in some applications it is

desirable not only to reduce the dimension of the space, but also to reduce the num-

ber of variables that are to be considered or measured in the future. However, it

is an NP-hard combinatorial optimization problem. As such, practical approaches

involve greedy searches that guarantee local optima. Besides feature selection, an-

other well studied topic in dimensionality reduction is feature transformation. Fea-

ture transformation is a process through which a new set of features is created. It

can be expressed as an optimization problem over a continuous feature space solu-

tion and classical feature transformation approaches (such as, principal component

analysis (PCA) [1] and linear discriminant analysis (LDA) [2]) provide global so-

lutions. Here, we propose a non-standard approach to feature selection by utilizing

feature transformations to perform feature search. In a sense, feature transforma-

tion performs a search that takes a global view and considers the interactions among

all the features.

PCA is a widely used transformation approach. It has been successfully ap-

plied to many real world applications, including face recognition [3], latent seman-

tic indexing for text retrieval [4], and gene sequence recognition [5]. An important

property of PCA is that the transformation vectors are orthogonal to each other.

Orthogonality is desired, because it assures that the transformed features are not

correlated with each other, and in some sense non-redundant. In fact, the success of

PCA can be attributed to two important optimality properties: (1) the principal com-

ponents sequentially capture the maximum variability in the data thereby retaining

the minimum information loss, and (2) the principal components are uncorrelated

[6]. The problem with transformation methods, such as PCA, is that they do not

explicitly inform us on which features are important.

4

In this thesis, we present a novel approach to feature selection that sequen-

tially selects original features based on the transformation-based method, PCA, to

optimize an objective criterion, while trying to keep the selected features as non-

redundant (uncorrelated) as possible through orthogonalization. We call this ap-

proach principal feature selection (PFS). In developing PFS, we present a new

objective function for PCA feature selection that incorporates feature redundancy

into account, which is analogous to an orthogonality constraint in feature trans-

formation approaches. We show that PFS, as a consequence of orthogonalization,

preserves the special property in PCA that the retained variance can be expressed

as the sum of orthogonal feature variances that are kept. This property is important

as it helps decide how many features to keep in terms of the proportion of variance

retained.

1.3 Robust Gating and Tracking the Lung Tumor Mass

Without Markers for Image-guided Radiother-

apy

Image Guided Radiation Therapy (IGRT) combines scanning and radiation equip-

ment, to provide images of the patient’s organs in the treatment position, at time

of treatment, optimizing the accuracy and precision of the radiotherapy. Treatment

errors related to respiratory organ motion may greatly degrade the effectiveness

of conformal radiotherapy for the management of thoracic and abdominal lesions.

This has become a pressing issue in image-guided radiation therapy (IGRT). For

patients with significant inter- and intra-fractional tumor motion, large treatment

margins are needed to provide full target coverage. Large margins limit the dose

that can be prescribed for tumor control and can cause complications from over-

irradiation of normal tissue. Motion management techniques, such as respiratory

5

Figure 1.2: An example of an on-board imaging system for respiratory gated radio-therapy.

gating or dynamic multi-leaf collimator (DMLC) beam tracking, hold promise to

reduce the incidence and severity of normal tissue complications and to increase lo-

cal control through dose escalation, for mobile tumors in the thorax and abdomen.

For those techniques, precise target localization in real time is particularly impor-

tant due to the reduced clinical tumor volume (CTV) to planning target volume

(PTV) margin and/or the escalated dose.

In this research, we investigate two approaches for lung tumor IGRT using

fluoroscopic images. One is to generate robust real-time gating signals for respi-

ratory gated radiotherapy. Another is to perform position estimation of the tumor

mass. By these two methods, we try to precisely deliver a lethal dose to the tumor,

while minimizing the incidence and severity of normal tissue complications, for

mobile tumors in the thorax and the abdomen [7].

Respiratory gating is a method of synchronizing radiation with respiration,

during the imaging and treatment processes. In computer-driven respiratory-gated

radiotherapy, a small plastic box with reflective markers is placed on the patient’s

abdomen. The reflective markers move during breathing, and a digital camera

hooked up to a central processing unit monitors these movements in real time. A

computer program analyzes the movements and triggers the scanner (simulation of

treatment), or the treatment beam, always at the same moment of the respiratory

cycle. With this technique, it is also possible to choose the respiratory phase: de-

6

pending on its location, the tumor will be treated during inspiration or expiration

so as to avoid exposure of critical organs. Figure 1.2 shows an imaging system

mounted with two orthogonal x-ray tubes and fast amorphous silicon flat panels on

the gantry of a medical linear accelerator (linac), which is used in respiratory gated

radiotherapy.

In an idealized gated treatment, tumor position should be directly detected

and the delivery of radiation is only allowed when the tumor is at the right position.

However, direct detection of the tumor mass in real-time during the treatment is

often difficult. Various surrogates, both external and internal, are used to identify

the tumor position. Depending on the surrogates used, we categorized the respira-

tory gating into external (optical) gating and internal (fluoroscopic) gating. During

gated treatment, the internal or external surrogate signal is continuously compared

against a pre-specified range of values, called the gating window. When the surro-

gate signal is within the gating window, a gating signal is sent to the linac to turn

on the radiation beam [7].

External gating techniques rely on the correlation between tumor location

and the external surrogates, such as markers placed on the patients’ abdomen [8, 9].

The major weakness in external gating is the uncertainty in the correlation between

the external marker position and internal target position [7]. Current internal gating

uses internal tumor motion surrogates such as implanted fiducial markers, as es-

tablished by the Hokkaido group [10, 11, 12]. And it has been shown that internal

surrogates can generate accurate gating signals. However, due to the risk of pneu-

mothorax, the implantation of radiopaque markers in patients’ lungs will unlikely

become a widely accepted clinical procedure [13, 14, 15]. Therefore, it is crucial

to be able to perform accurately gated treatment or directly tracking of lung tumor

mass without implanted markers.

For gated treatment, [16] has shown the feasibility of a template matching

method to generate gating signals for lung radiotherapy without implanted mark-

7

ers. The basic idea is (1) to generate a reference template which corresponds to

the treatment position of the target in the gating window using fluoroscopic im-

ages acquired during patient setup, (2) to calculate the correlation scores between

the reference template and the incoming fluoroscopic images acquired during treat-

ment delivery, and (3) to convert the correlation score into gating signals. Here,

in this research, we explore ways to improve the accuracy and robustness of tem-

plate matching for gating. From our experiments, a single template is not enough

to generate robust gating signals. Thus, we look at all the images corresponding

to the treatment position of the target in the gating window, make each of them

to be a template and combine the correlation scores. However, this multiple tem-

plate method, although good, is very time consuming. Therefore, we find a method

somewhere in between single and multiple template method. We group the tem-

plates into clusters and use the cluster means as the representative templates. This

leads to our template clustering ensemble method. Furthermore, template matching

only considers images inside the gating window, but images outside the gating win-

dow might provide additional information for improving the precision in localizing

the tumor. We assigned images inside the gating window as “ON” and those out-

side as “OFF” classes and re-cast the gating problem into a classification problem.

Then, as another approach, we apply a support vector machine (SVM) classifier to

gated radiotherapy.

On the other hand, we also investigate a direct beam-tracking method to

track the tumor location throughout the whole breathing cycle. The basic idea is as

follows: (i) during the patient setup session, a pair of orthogonal fluoroscopic image

sequences are taken and processed off-line to generate a set of reference templates

that correspond to different breathing phases and tumor positions; (ii) during treat-

ment delivery, fluoroscopic images are continuously acquired and processed; (iii)

the similarity between each reference template and the processed incoming image is

calculated; (iv) the tumor position in the incoming image is then estimated by com-

8

bining the tumor centroid coordinates in reference templates with proper weights

based on the measured similarities. With different image representation and sim-

ilarity calculations, two such multiple-template tracking techniques have been de-

veloped: one based on motion-enhanced templates and Pearson’s correlation score

while the other based on eigen templates and mean-squared error.

In both gating and tracking methods for radiotherapy, we perform a vali-

dation study by comparing the gating signals and the tumor locations generated

with the proposed techniques against those determined manually by clinicians us-

ing multiple patient data sets. For the gating problem, our case study on these pa-

tients shows that both clustering ensemble template matching method and SVM are

reasonable tools for image-guided markerless gated radiotherapy. For the tracking

problem, the tumor centroid coordinates automatically detected using both meth-

ods agree well with the manually marked reference locations, with the eigenspace

tracking method performing slightly better than the motion-enhanced method.

1.4 Overview

The remainder of this thesis dissertation is organized as follows. Chapter 2 provides

a review of related literature. In Chapter 3, we describe in detail the non-redundant

multi-view clustering paradigm. Then, Chapter 4 presents principal feature selec-

tion. In Chapters 5 and 6, we illustrate the robust gating and direct tracking methods

for lung tumor treatment respectively. Finally, we provide concluding remarks in

Chapter 7.

9

Chapter 2

Review of Related Literature

In this chapter, we first review the various related clustering algorithms in Section

2.1. Specifically, we point out the difference between our clustering paradigm with

the traditional clustering scheme. Next, we review the related feature subset se-

lection methods in Section 2.2. In particular, we review feature subset selection

methods which take advantage of feature transformation for feature selection. And

finally, in Section 2.3, we provide an overview of the various techniques that has

been done to achieve precise target localization in real time for image-guided radio-

therapy of mobile tumors in the thorax and abdomen.

2.1 Review of Related Clustering Algorithms

In this section, we review the literature related to our non-redundant multiview

clustering problem in different aspects.

Hierarchical clustering [17] presents a hierarchical grouping of the objects.

It can be subdivided into two categories: the agglomerative methods, which suc-

cessively merge small clusters into larger ones until a stopping criterion is met; and

divisive methods, which treat all the objects in one cluster and successively split

10

them into finer groupings until a stopping criterion is satisfied. However, although

a certain object can have a different label in different stages, it is quite different

with from our multi-view clustering problem. For hierarchical clustering, the dif-

ferent clustering solutions obtained at different hierarchical levels differ only in

their granularity – objects belonging to the same cluster in fine resolutions remain

in the same cluster at the coarser levels. For our multi-view clustering, objects in

the same cluster can be in different clusters in different views.

On the other hand, a different but related problem is the cluster ensemble

problem [18, 19]. The key idea of an ensemble approach is to improve the clustering

performance by combining several clustering results. While an ensemble method

creates a set of cluster solutions for a data set, the final objective is to generate a

single consolidated clustering. On the other hand, the objective of our multi-view

clustering method is to provide users with different meaningful clustering solutions.

The term “multi-view” is also utilized in semi-supervised learning [20, 21].

There, they break the feature space into two independent subsets to generate two

hypotheses. And, the two independent hypotheses bootstrap by providing each

other with labels for the unlabeled data in a semi-supervised learning setting. The

author provided partitioning and agglomerative, hierarchical multi-view clustering

algorithms for text data in [20]. The objective of their method is to maximize the

agreement between the two independent hypotheses. In contrast, our multi-view

method tries to find meaningful partitions that disagrees as much as possible with

previous solutions so as to give distinct structures of the data.

In addition, the problem we are trying to solve is different to the multi-

labeling problem [22, 23, 24]. First of all, our non-redundant multi-view cluster-

ing is performed in a totally unsupervised manner. However, the multi-labeling

problem is mainly for classification, i.e., supervised learning. Similar to the multi-

labeling problem, a given instance can be assigned to more than one class. For

example, in text document retrieval [22], a document can be tagged with a set of

11

labels, where the classes can be semantically overlapped. [22] solved the multi-

labeling problem by ranking the relevance of a topic based on a top-ranking func-

tion. Then a certain instance is marked with all the topic labels above a threshold.

Another multi-label approach is the MFoM [23] learning approach, which is a dis-

criminative multi-label multiclass classifier that maximizes the pair-wise discrimi-

nation power among all competing classes in labeling the topics of text documents.

In [24], Boutell applied a multi-label classifier to scene classification, and gave an

extensive comparative study of possible approaches to training and testing multi-

label classifiers. He used cross training as training strategy and defined base-class

and !-evaluation as evaluation metrics in testing. Although our multi-view non-

redundant clustering can also assign samples to multiple labels, our objective is dif-

ferent. Instead of looking for all the high-relevance labels describing an instance,

we look at the data set as a whole, and provide multiple partitionings for the whole

data set. In each of the different partitionings, the class labels for each instance

are mutually exclusive. Furthermore, the partitioning solutions in different views

should be non-redundant.

The idea of non-redundant clustering was introduced in [25, 26]. In non-

redundant clustering, we are typically given a set of data objects together with an

existing clustering solution and the goal is to learn an alternative clustering that cap-

tures new information about the structure of the data. Existing non-redundant clus-

tering techniques, include the conditional information bottleneck approach [26, 27,

28], the conditional ensemble based approach [29] and the constrained model based

approach [30]. In [26], they suggest that by minimizing the information about irrel-

evant structures, people can better identify the relevant structures. They provided

a new formulation called information bottleneck with side information (IBSI) to

remove the irrelavant structures. IBSI finds a stochastic map of the original data to

a new variable space which maximizes the mutual information between the labels

associated with the new variables and the desired relevant structure from side infor-

12

mation and minimizes the mutual information of the irrelevant ones. Gondek and

Hofmann [27, 28, 29] used conditional information bottleneck to remove the effect

of the undesired apriori solutions to try to discover new interesting structures. How-

ever, these existing methods are limited to find only one alternative partitioning and

the default apriori solution is given beforehand. Another approach for seeking an

alternative view is called COALA [31]. It finds a clustering redundant to an already

known clustering by creating cannot-link constraints per pair of objects which are

in the same cluster in the previously given partitioning. Then it proceeds in an

agglomerative fashion and merge single objects to clusters based on their dissimi-

larity measure to generate another view. Compared to these existing non-redundant

clustering techniques, the critical differences of our proposed research are:

(1) Focusing on searching for orthogonal clustering of high-dimensional data,

our research combines dimensionality reduction and orthogonal clustering

into a unifying framework and seeks lower dimensional representations of

the data that reveal non-redundant information about the data.

(2) Existing non-redundant clustering techniques are limited to finding one alter-

native structure given a known structure. Our framework works successively

to reveal a sequence of different clustering of the data.

Another related work is meta clustering. Meta clustering [32] inspired by

ensemble methods, first generates a diverse set of candidate clustering by either

random initialization or random feature weighting. They then apply agglomerative

clustering to merge the clustering solutions in the meta level based on a Rand index

for measuring similarity. Contrary to our framework, they create multiple clustering

solutions randomly. Our framework, on the other hand, generates multiple views

by orthogonalization so as to directly seek out non-redundant solutions.

Recently, [33] solve the problem of disparate clusterings by optimizing an

objective function that penalizes among the correlations of different clusterings,

13

and minimizes the sum-squared distances within each clusterings. Similar to our

method, it can find multiple alternative clusters. However, their approach looks at

the same feature space (i.e., they use all the original features to generate the dif-

ferent partitionings); whereas, our method mines alternative clutering structures in

different subspaces. In addition, although their method can be extended to generate

more than two alternative clusterings T of the data, extending to more than two

clusterings increases the complexity of their problem by T % (T ! 1)/2, and the

heavy computational burden of optimization using gradient descent may make this

extension unrealistic for large data sets.

Our framework produces a set of different clustering solutions, which is

similar to cluster ensembles [18, 19]. A clear distinction of our work from cluster

ensembles is that we intentionally search for orthogonal clusterings and do not seek

to find a consensus clustering as our end product.

An integral part of our framework is to search a clustering structure in a

high-dimensional space and find the corresponding subspace that best reveals the

clustering structure. This topic has been widely studied, and one closely related

work was conducted by [34]. In this work, after a clustering solution is obtained,

a subspace is computed to best capture the clustering, and the clustering is then

refined using the data projected onto the new subspace. Our framework works in

an opposite direction. We look at a subspace that is orthogonal to the space in

which the original clustering is embedded to search for non-redundant clustering

solutions.

Finally, while we search for different subspaces in our framework, it is dif-

ferent from the concept of subspace clustering in [35, 36]. [35] is interested in

automatically finding subspaces with high-density clusters, which is not revealed

in the full space. They divide the data space into a grid and measure the density

of the data points in each cell (unit). Dense units are recognized as clusters. In

subspace clustering, the goal is still to learn a single clustering, where each cluster

14

can be embedded in its own subspace. In contrast, our method searches for multiple

clustering solutions, each is revealed in a different subspace.

As mentioned previously, our proposed method for redundancy removal is

inspired from our work on non-redundant feature selection. In the next section, we

review feature selection methods that are closely related to our proposed feature

selection approach.

2.2 Review of Related Feature Selection Techniques

Feature selection algorithms are described by an objective function for evaluating

features and the search strategy for exploring candidate features. Feature selection

methods can be classified as wrapper (which takes the final learning algorithm into

account for feature evaluation) or filter (which evaluates features based on char-

acteristics of the data alone) methods [37]. Wrapper methods perform better than

filters for a particular learning algorithm, but filter methods are more efficient and

the features selected are typically not tied to a learning algorithm. Our method is

a filter approach with objective functions for PCA/LDA feature selection with an

embedded redundancy penalty and a search strategy based on PCA/LDA transfor-

mation and orthogonalization. This work has roots which reflects and combines

ideas from both machine learning and statistics.

Feature selection is an NP-hard combinatorial optimization problem [38].

Feature transformation can be expressed as an optimization problem over a contin-

uous feature space solution. Feature selection, on the other hand, is an optimization

over a discrete space, and typically involves combinatorial search. An exhaustive

search of 2d possible feature subsets (where d is the number of features) for the

best feature subset is computationally impractical. More realistic search strategies

such as, greedy approaches (e.g., sequential forward search (SFS) [39] and sequen-

tial backward search (SBS) [40] [41]) can lead to local optima. SFS starts with

15

an empty feature set and adds features one by one based on the criterion func-

tion. SBS starts from the whole set and remove the least important feature one by

one based on its criterion function. Random search methods, such as genetic al-

gorithms, [42], add some randomness in the search procedure to help escape from

local optima. These searches are subset search methods because they evaluate each

candidate subset with respect to the evaluation criterion. In some cases when the

dimensionality is very high, one can only afford an individual search. Individual

search methods evaluate each feature individually (as opposed to feature subsets)

according to a criterion [43]. They describe the various criterions to rank a feature,

including correlation, single feature classifier and information theoretical criterion.

Then, they select features, which either satisfy the condition or are top-ranked.

However, individual search is limited in that it does not consider the interaction

among features. The performance of individual search can be improved by remov-

ing redundancy after feature ranking. [44] suggests removing the features that are

highly correlated with the currently selected subset. They measure the relevance

between features based on the correlation coefficient and mutual information. [45]

applies Gram-Schmidt orthogonalization to candidate features ranked in order of

decreasing relevance of a general measured process output and suggests a stopping

criterion based on a rank probe method. They continuously select the feature which

has the minimum angle with the output vector until the probability that a realization

of the probe is less than a predefined risk. Similar to [45], we apply Gram-Schmidt,

to remove redundancy. But in contrast to [45], we incorporate PCA/LDA transfor-

mation in searching and selecting features. Feature transformation to select features

takes a global view and considers the interactions among features. Principal com-

ponent analysis (PCA) [1] and linear discriminant analysis (LDA) [2] are classical

feature transformation approaches, which provide global solutions to the best sub-

space based on various optimal criterions [46]. Unlike individual search, moreover,

our combination results in inheriting desirable properties of PCA/LDA.

16

There has been some work on performing feature selection through PCA.

Recently, there is a growing excitement in sparsifying PCA, such as rotation tech-

niques [47], SCoTLASS [48], and sparse PCA [6]. They comes for the classical

regression method LASSO [49]. By adding different penalty terms, including L1,

L2 norms, they achieve sparsity, and at the same time they perform rotation to match

the optimal regression subspace to the original data space. However, they are not

exactly feature selection methods. The original features are extracted to form new

variables which is a combination of some of them. A closely related approach is

the variable selection method based on PCA [1], which selects variables with the

highest coefficient (or loading) in absolute value of the first q principal eigenvec-

tors. It can be implemented both iteratively and non-iteratively. Another approach

is by Krzanowski [50], which tries to minimize the error between the principal com-

ponents (PC) calculated with all the original features and the ones in the selected

feature subset, via forward search and backward elimination and procrustes anal-

ysis (minimize sum squared error under translation, rotation and reflexion). Mao

[51] provided a modified faster version of Krzanowski’s method. Mao applyed the

forward selection and backward elimination, using least square estimate. It builds

linear models for each feature by minimizing the least square error between the

feature subset and the original PCs. The iterative PCA approach by Jolliffe does

not take redundancy into account. The other methods do but not explicitly, contrary

to our approach. Moreover, Krzanowski and Mao apply sequential search tech-

niques which are very slow and almost unrealistic when dealing with very large

data sets. Lu et al. [52] pointed out the importance of removing redundancy in

image recognition applications and performed Kmeans clustering to the loadings

of the first several PCs, and selected the features closest to each clusters’ centroid,

called principal feature analysis (PFA). This method depends on the performance

of the clustering method.

17

2.3 Current Image-Guided Radiotherapy (IGRT) Ap-

proaches for Lung Tumor Treatment

In gated radiotherapy, precise and real time tumor localization is extremely impor-

tant because tighter CTV-to-PTV margins are often applied based on the expecta-

tion of reduced tumor motion [53]. In an idealized gated treatment, tumor position

should be directly detected and the delivery of radiation is only allowed when the

tumor is at the right position.

However, direct detection of the tumor mass in real-time during the treat-

ment is often difficult. Instead, various surrogates are used to identify the tumor

position. Depending on surrogates used, we may categorize respiratory gating into

two classes: internal gating and external gating. Internal gating uses internal tumor

motion surrogates such as implanted fiducial markers while external gating relies

on external respiratory surrogates such as makers placed on the patient’s abdomen.

During gated treatment, the internal or external surrogate signal is continuously

compared against a pre-specified range of values, called the gating window. When

the surrogate signal is within the gating window, a gating signal is sent to the linac

to turn on the radiation beam.

In the following sections, first we will describe the problem with respiratory

tumor motion in IGRT in Section 2.3.1. Then we will review the history and current

techniques of gated radiotherapy for mobile tumors in Section 2.3.2.

2.3.1 Problems with Respiratory Tumor Motion in Radiother-

apy

Radiation therapy is a treatment modality directed towards local control of cancer.

The primary goal is to precisely deliver a lethal dose to the tumor while minimizing

the dose to surrounding healthy tissues and critical structures. However, treatment

18

errors related to internal organ motion may greatly degrade the effectiveness of

conformal radiotherapy for the management of thoracic and abdominal lesions, es-

pecially when the treatment is done in a hypo-fraction or single fraction manner

[54, 55]. This has become a pressing issue in the emerging era of image-guided

radiation therapy (IGRT).

Intra-fraction organ motion is mainly caused by patient respiration, some-

times also by skeletal muscular, cardiac, or gastrointestinal systems. Respiration

induced organ motion has been studied by directly tracking the movement of the tu-

mor [56, 57], the host organ [58, 59], radio-opaque markers implanted at the tumor

site [60, 11], radioactive tracer targeting the tumor [61, 62], and surrogate struc-

tures such as diaphragm and chest wall [63, 64]. It has been shown that the motion

magnitude can be clinically significant (e.g., of the order of 2 - 3 cm), depending

on tumor sites and individual patients.

One category of methods to account for respiratory motion is to minimize the

tumor motion, using techniques such as breath holding and forced shallow breathing

(such as jet ventilation) [65, 66]. These techniques require patient compliance,

active participation and, often, extra therapist participation. They may not be well

tolerated by patients with compromised lung function which is the case for most

lung cancer patients [67].

An alternative strategy is to allow free tumor motion while adapting the ra-

diation beam to the tumor position by either respiratory gating or beam tracking.

Respiratory gating limits radiation exposure to the portion of the breathing cycle

when the tumor is in the path of the beam [68, 66]. Beam tracking technique fol-

lows the target dynamically with the radiation beam [69], and was first implemented

in a robotic radiosurgery system (CyberKnife)[70]. For linac-based radiotherapy,

tumor motion can be compensated for using a dynamic multi-leaf collimator (MLC)

[71]. Linac based beam tracking is not used in clinical practice because its imple-

mentation and quality assurance are technically challenging. In contrast, respira-

19

tory gating is more practical, and has been adopted in clinical practice by a limited

number of cancer centers. We believe that if the proper tools were available, safe

and effective gated radiotherapy could be widely adopted for treating tumors in the

thorax and abdomen.

2.3.2 Review of the Techniques of Gated Radiotherapy

Respiratory gated radiation therapy was first developed in Japan in the late 1980s

and early 1990s [72, 73]. Various external surrogates were used to monitor respi-

ratory motion, including a combination airbag and strain gauge taped on the pa-

tient’s abdomen or back (for prone treatments) to gate a proton beam [74], and

position sensors placed on the patient [72, 73]. A major advancement of the gated

radiotherapy was the real-time tumor tracking (RTRT) system developed by Mit-

subishi Electronics Co., Ltd., Tokyo, in collaboration with the Hokkaido University

[75, 10]. The RTRT system uses real-time fluoroscopic tracking of gold markers

implanted in tumor.

Around the mid 1990s, Kubo and his colleagues at the University of Cali-

fornia at Davis introduced the gated radiotherapy technique into the United States.

They reported the first feasibility study of gated radiotherapy with a Varian 2100C

accelerator, as well as an evaluation of different external surrogate signals to moni-

tor respiratory motion [68]. They also reported a gated radiotherapy system which

tracks inferred reflective markers on the patient abdomen using a video camera,

developed jointly with Varian Medical Systems, Inc. (Palo Alto, CA) [67]. This

system was later commercialized by Varian and called real-time position manage-

ment (RPM) respiratory gating system. The RPM system has been implemented

and investigated clinically at a number of centers [64].

Currently, the Mitsubishi/Hokkaido RTRT system is the only internal gating

system used in clinical routine, while the Varian RPM system can be considered as

20

the representative external gating system. Each system has its strengths and weak-

nesses, but their weaknesses have been barriers to a broad adoption of gated radio-

therapy. For the RPM system, a lightweight plastic block with two passive inferred

reflective markers is placed on the patient’s anterior abdominal surface and moni-

tored by a charge-coupled-device (CCD) video camera mounted on the treatment

room wall. The surrogate signal is the abdominal surface motion. Both amplitude

and phase gating are allowed by the RPM system. A periodicity filter checks the

regularity of the breathing waveform and immediately disables the beam when the

breathing waveform becomes irregular, such as patient movement or coughing, and

re-enables the beam after establishing that breathing is again regular. The RPM can

also be used during treatment simulation at a radiotherapy simulator or a CT scan-

ner to acquire the patient treatment geometry in the gating window and to setup the

gating window.

The major strength of the external gating systems is that it is non-invasive,

and that tracking external markers is relatively easy. However, tracking the exter-

nal marker is not equivalent to tracking the tumor, and naively trusting the external

surrogate can cause significant errors. In particular, the relationship between the

tumor motion and the surrogate signal may change over time, which requires fre-

quent re-calibration of this relationship. The major weakness in external gating is

the uncertainty in the correlation between the external marker position and internal

target position.

The Mitsubishi/Hokkaido RTRT system as well as its application in radio-

therapy has been extensively published by the Hokkaido group [75, 10]. The system

consists of four sets of diagnostic x-ray camera systems, where each system con-

sists of an x-ray tube mounted under the floor, a 9-inch image intensifier mounted

in the ceiling, and a high-voltage x-ray generator. The four x-ray tubes are placed

at right caudal, right cranial, left caudal, and left cranial position with respect to the

patient couch at a distance of 280 cm from the isocenter. The image intensifiers are

21

mounted on the ceiling, opposite to the x-ray tubes, at a distance of 180 cm from

the isocenter, with beam central axes intersecting at the isocenter. At a given time

during patient treatment, depending on the linac gantry angle, two out of the four

x-ray systems are enabled to provide a pair of unblocked orthogonal fluoroscopic

images. To reduce the scatter radiation from the therapeutic beam to the imagers,

the x-ray units and the linac are synchronized, i.e., the MV beam is gated off the

kV x-ray units are pulsed.

Using this system, the fiducial markers implanted at the tumor site can be

directly tracked fluoroscopically at a video frame rate [76]. The linear accelerator

is gated to irradiate the tumor only when the marker is within the internal gating

window. The size of the gating window is set at ±1 to ±3 mm according to the

patient’s characteristics and the margin used in treatment planning [75]. Techniques

for the insertion of gold markers of 1.5-2.0 mm diameter into or near the tumor

were developed for various tumor sites, including bronchoscopic insertion for the

peripheral lung, image-guided transcutaneous insertion for the liver, cystoscopic

and image-guided percutaneous insertion for the prostate, surgical implantation for

spinal/paraspinal lesions [10].

Percutaneously implanting fiducial markers is an invasive procedure with

potential risks of infection. Many clinicians are reluctant to use this procedure for

lung cancer patients because puncturing the chest wall may cause pneumothorax.

The insertion of gold markers using bronchofiberscopy is feasible and safe only

for peripheral-type lung tumors, not for central lung lesions [10]. The Hokkaido

group found that the markers fixed into the bronchial tree may significantly change

their relationship with the tumor after 2 weeks of insertion [77]. Therefore, bron-

choscopic insertion of markers is not an ideal solution for lung tumor treatment,

especially for a large number of fractions.

In summary, the major strength of the internal gating systems represented

by the RTRT system is the precise and real-time localization of the tumor position

22

during the treatment. The implanted internal markers are often good surrogates

for tumor position, and marker migration usually is not an issue if the simulation

images are acquired a few days after marker implantation [10]. It is even less of a

concern if multiple markers are used. The two major weaknesses of internal gating

are the risk of pneumothorax for implantation of markers in the lungs and the high

imaging dose required for fluoroscopic tracking.

Here, in this research, we apply machine learning algorithms to improve

gating and tracking based radiotherapy. More specifically, we aim to precisely gate

the mobile tumor without implanted fiducial markers using fluoroscopic images.

Methods for gating and tracking are described in Chapters 5 and 6 respectively.

23

Chapter 3

Non-redundant Multi-viewClustering

In this chapter, we will present our non-redundant multi-view clustering framework

in detail. To be specific, in Section 3.1, we present the two clustering approaches

based on our framework. Interestingly, these two proposed approaches are related,

in Section 3.1.3, we analyze their relationships. Based on our framework which

is done in a totally unsupervised fashion, we provide an approach to automatically

find the number of clusters in each iteration. Together with it, we develop a stopping

criterion accordingly in Section 3.3. Then, we perform a set of experiments on both

synthetic and real-world data sets. The results are presented in Section 3.2. Finally,

we present our conclusions in Section 3.4.

3.1 Multi-View Orthogonal Clustering

Given data X & Rd!N , with N instances and d features. Our goal is to learn a set

of non-redundant clustering views from data.

There are a number of ways to find different clustering views [18, 19]. One

24

can apply different clustering algorithms (each with varying objective functions),

utilize different similarity measures, or apply the same algorithm on different ran-

domly sampled (either in instance space or feature space) data from X . Note that

such methods produce each individual clustering independently from all the other

clustering views. While the differences in the objective functions, similarity mea-

sures, density models, or different data samples may lead to clustering results that

differ from one another, it is common to see high redundancy in the obtained mul-

tiple clustering views. Below, we present a framework for successively generating

multiple clustering views that are orthogonal from one another and thus contain

limited redundancy.

Figure 3.1: The general framework for generating multiple orthogonal clusteringviews.

Figure 3.1 shows the general framework of our approach. We first cluster

the data (this can include dimensionality reduction followed by clustering when

necessary); then we orthogonalize the data to a space that is not covered by the

existing clustering solutions. Repeat the process until we cover most of the data

space or no structure can be found in the remaining space.

We developed two different approaches within this framework: (1) orthogo-

nal clustering, and (2) clustering in orthogonal subspaces.

25

These two approaches differ primarily in how they represent the existing

clustering solutions and consequently how to orthogonalize the data based on exist-

ing solutions. Specifically, the first approach represents a clustering solution using

its k cluster centroids. The second approach represents a clustering solution using

the feature subspace that best captures the clustering result. In the next two subsec-

tions, we describe these two different representations in detail and explain how to

obtain orthogonal clustering solutions based on these two representations.

3.1.1 Orthogonal Clustering

Clustering can be viewed as a way for compressing data X . For example in k-

means [78, 79], the objective function is to minimize the sum-squared-error crite-

rion (SSE):

SSE =k!

j=1

!

xi"Cj

||xi ! µj||2

where xi & Rd is a data point assigned to cluster Cj , and µj is the mean of Cj . We

represent xi and µj as column vectors. The outputs for k-means clustering are the

cluster means and the cluster membership of each data point xi. One can consider

k-means clustering as a compression of data X to the k cluster means µj .

Following the compression viewpoint, each data point xi is represented by

its cluster mean µj . Given k µj’s for representing X , what is not captured by these

µj’s? Let us consider the space that is spanned by xi, i = 1 . . . N , we refer to this as

the original data space. In contrast, the subspace that is spanned by the mean vectors

µj , j = 1 . . . k, is considered the compressed data space. Assigning data points to

their corresponding cluster means can be essentially considered as projecting the

data points from the original data space onto the compressed data space. What is

not covered by the current clustering solution (i.e., its compressed data space) is

captured by its residue space. In this paper, we define the residue space as the data

projected onto the space orthogonal to our current representation. Thus, to find

26

alternative clustering solutions not covered in the current solution, we can simply

perform clustering in the space that is orthogonal to the compressed data space.

Given current data X(t), and the clustering solution we found on X(t) (i.e.,

M (t) = [µ(t)1 µ(t)

2 · · ·µ(t)k ], and the cluster assignments), we describe two variations

for representing data in the residue space, X(t+1), single-mean and all-mean repre-

sentations.

Single-mean representation. In hard clustering, each data point belongs to a sin-

gle cluster, this is represented using a single mean vector. For data point x(t)i

belonging to cluster j, we project it onto its center µ(t)j as its representation

in the current clustering view. We consider two different methods to compute

the residue in this case. In the first method, the residue x(t+1)i is defined as

x(t)i ! µt

j (i.e., the difference between a data point and its mean). In the sec-

ond method, the residue, x(t+1)i , is defined to be the projection of x(t)

i onto the

subspace orthogonal to µ(t)j . This can be formalized by the following formula:

x(t+1)i = (I ! µ(t)

j µ(t)Tj /(µ(t)T

j µ(t)j ))x(t)

i .

Note that empirically we observed that method two is more effective in pro-

ducing non-redundant clustering solutions. This can be attributed to the fact

that method two’s residual representation of a data point in iteration t + 1 is

orthogonal to its cluster center in iteration t. This proved to be beneficial in

achieving our goal of producing non-redundant solutions. In the remainder

of the paper, we will focus on the second method for hard clustering.

All-mean representation. 1 We achieve this by projecting the data onto the sub-

space spanned by all cluster means and compute the residue, X(t+1), as the

projection of X(t) onto the subspace orthogonal to all the cluster centroids.1The solution in clustering (hard or soft) can be represented by using all cluster centers.

27

This can be formalized by the following formula:

X(t+1) = (I !M (t)(M (t)T M (t))#1M (t)T )X(t).

The algorithm for orthogonal clustering is summarized in Algorithm 1. The

data is first centered to have zero mean. We then create the first view by clustering

the original data X . Since most of the data in our experiments are high-dimensional,

we apply principal components analysis [80] to reduce the dimensionality, followed

by k-means. Note that one can apply other clustering methods within our frame-

work. We chose PCA followed by k-means because they are popular techniques. In

step 2, we project the data to the space orthogonal to the current cluster representa-

tion (using cluster centers) to obtain our residue X(t+1). The next clustering view is

then obtained by clustering in this residue space. We repeat steps 1 (clustering) and

2 (orthogonalization) until the desired number of views are obtained or when the

SSE is very small. Small SSE signifies that the existing views have already covered

most of the data. In Section 3.3, we provide and discuss a solution on when to stop

automatically in detail.

3.1.2 Clustering in Orthogonal Subspaces

In this approach, given a clustering solution with means µj , j = 1 . . . k, we would

like to find a feature subspace that best captures the clustering structure, or, in other

words, discriminates these clusters well. One well-known method for finding a

reduced dimensional space that discriminates classes (clusters here) is linear dis-

criminant analysis (LDA) [2, 81]. Another approach is by applying singular value

decomposition (SVD) on the k mean vectors µj’s [34].

Below we explain the mathematical differences between these two approaches.

28

Algorithm 1 Orthogonal Clustering.Inputs: The data matrix X & Rd!N , and the number of clusters k(t) for eachiteration, t.Output: The multiple partitioning views of the data into k(t) clusters at each itera-tion.Pre-processing: Center the data to have zero mean.Initialization: Set the iteration number t = 1 and X(1) = X .Step1: Cluster X(t). In our experiments, we performed PCA followed by k-means.The compressed solution are the k means, µ(t)

j . Each µ(t)j is a column vector in Rd

(the original feature space).Step2: Project each x(t)

i in X(t) to the space orthogonal to its cluster mean (forsingle-mean version) or all the means (for all-mean version), to form the residuespace representation, X(t+1).Step3: Set t = t + 1 and repeat steps 1 and 2 until the desired number of views oruntil the sum-squared-error,

"kj=1

"x(t)i "C

(t)j

||x(t)i ! µ(t)

j ||2, is very small.

LDA finds a linear projection Y = AT X that maximizes

trace(S#1w Sb)

where Sw is the within-class-scatter matrix and Sb is the between-class-scatter ma-

trix defined as follows.

Sw =k!

j=1

!

yi"Cj

(yi ! µj)(yi ! µj)T

Sb =k!

j=1

nj(µj ! µ)(µj ! µ)T

where yi’s are projected data points; µj’s are projected cluster centers; nj is the total

number of points in cluster j and µ is the center of the entire set of projected data.

In essence, LDA finds the subspace that maximizes the scatter between the cluster

means normalized by the scatter within each cluster.

Similarly, the SVD approach in [34] seeks a linear projection Y = AT X , but

29

maximizes a different objective function trace(MMT ), where M = [µ1 ! µ, µ2 !µ, · · · , µk ! µ] and the µj’s are the projected cluster centers and µ is the center of

the entire set of projected data.

For both methods, the solution can be represented as A = ["1"2 · · ·"q],

which contains the q most important eigenvectors (corresponding to the q largest

eigenvalues) of S#1w Sb for LDA and MMT for SVD respectively.

Note that, trace(Sb) = trace(M $M $T ), where M $ = ['

n1M1'

n2M2 · · ·'

nkMk].

The only difference between M and M $ is the weighting of each column of M ,

Mj , by the square root of nj , the number of data points in cluster Cj . But, both

M and M $ span the same space. Thus, in practice, both approaches maximizing

trace(MMT ) and trace(Sb), produce similar results. The difference between max-

imizing trace(Sb) (or trace(MMT )) from the standard LDA objective (trace(S#1w Sb))

is the normalization by the within-class-scatter S#1w . For computational purposes,

we choose the SVD approach on the means µj’s and set q = k ! 1, the rank

of MMT . In general, one may choose any number of dimensionality to keep,

q ( k ! 1, and any other dimensionality reduction algorithm for determining A(t).

Once we have obtained a feature subspace A = ["1"2 · · ·"q] that captures

the clustering structure well, we project X(t) to the subspace orthogonal to A to

obtain the residue X(t+1) = P (t)X(t). The orthogonal projection operator, P , is:

P (t) = I ! A(t)(A(t)T A(t))#1A(t)T

Algorithm 2 presents the pseudo-code for clustering in orthogonal subspaces.

We first pre-process the data to have zero mean. In step 1, we apply a clustering

algorithm (PCA followed by k-means in our experiments). We then represent this

clustering solution using the subspace that best separates these clusters. In step 2,

we project the data to the space orthogonal to the computed subspace representa-

tion. We repeat steps 1 and 2 until the desired number of views are obtained or

30

Algorithm 2 Clustering in Orthogonal Subspaces.Inputs: The data matrix X & Rd!N , and the number of clusters k(t) for eachiteration, t.Output: The multiple partitioning views of the data into k(t) clusters, and a reduceddimensional subspace for each iteration A(t).Pre-processing: Center the data to have zero mean.Initialization: Set the iteration number t = 1 and X(1) = X .Step1: Cluster X(t). In our experiments, we performed PCA followed by k-means.Then, apply a dimensionality reduction algorithm to obtain the subspace, A(t), thatcaptures the current clustering.Step2: Project X(t) to the space orthogonal to A(t) to produce X(t+1) = P (t)X(t),where the projection operator P (t) is:

P (t) = I ! A(t)(A(t)T A(t))#1A(t)T

Step3: Set t = t + 1 and repeat steps 1 and 2 until the desired number of views oruntil the sum-squared-error,

"Ni=1 ||x

(t)i ! A(t)y(t)

i ||2, is very small.

the SSE is very small. An automated approach for determining when to stop is

proposed in Section 3.3.2.

3.1.3 Relationship Between Orthogonal Clustering and Cluster-

ing in Orthogonal Subspaces

We have illustrated two approaches to represent a clustering view and search for

orthogonal clustering solutions in the two previous subsections. In this section, we

discuss the relationship between them.

In general, these two methods are different. However, under certain condi-

tions, the all-mean version of method 1 can be equivalent to method 2.

Assume that we obtain the same clustering results with K clusters for both

methods; consequently, the same mean vectors. The projection matrix for method

31

1, orthogonal clustering (all-mean version), is:

P1 = I !M (t)(M (t)T M (t))#1M (t)T ,

where M (t) = [µ1(t), · · · , µK

(t)] and X(t+1) = P1X(t). In contrast, for method 2,

clustering in orthogonal subspaces, the projection matrix is:

P2 = I ! A(t)(A(t)T A(t))#1A(t)T ,

where X(t+1) = P2X(t), and A(t) is the matrix of eigenvectors of M $(t)M $(t)T , and

M $(t) = [µ1(t)!µ(t), · · · , µK

(t)!µ(t)]. Note that the total mean µ(t) is zero because

X is zero-centered initially and linear projections simply rotates X keeping the

center for the residue spaces X(t) at zero. Therefore we have M $ = M . It follows

that A and M span the same space. As a result, we have P1 = P2. More specifically,

substituting the singular value decomposition of M 2, M = ASV T , we get

M(MT M)#1MT = ASV T (V ST AT ASV T )#1V ST AT (3.1)

= ASV T V T#1S#1A#1AT#1

ST#1V #1V ST AT

= A(AT A)#1AT

Thus, P1 = P2. Therefore, after projection, the residue space we generated by the

orthogonal clustering approach is the same as the one by the orthogonal subspace

algorithm. The two methods will lead to the same multi-view clustering results.

In the above paragraphs, we show when the two methods are equal. Here,

we explain when they differ. In method 2, one can also have the option of keeping

fewer than K ! 1 eigenvectors. In such a case, the two methods will be different,2We remove the superscript (t) here to simplify notations.

32

with method 2 having a slower convergence of reaching zero SSE and consequently

more iterations/views.

In addition, note that the equivalence of these two methods only holds when

we apply SVD on the means in method 2. If other dimensionality reduction tech-

niques are used, including LDA, these two methods will lead to different clustering

solutions in each iteration.

Finally, the above equivalence of the two methods only holds for the all-

mean version of method 1. The single-mean version of method 1 leads to different

residue representations of the data compared to method 2.

Again let us assume the same clustering results with K clusters for both

methods. Without loss of generality, we select a data sample x(t)i from X(t), and as-

sume it is clustered into class c. Then by the single-mean version of the orthogonal

clustering method, each data belongs to a single cluster, and the projection matrix

is:

P1 = I ! µ(t)c µ(t)T

c /(µ(t)Tc µ(t)

c ) and x(t+1)i = P1x

(t)i .

For method 2, clustering in orthogonal subspaces, the projection matrix is:

P2 = I !M (t)(M (t)T M (t))#1M (t)T , M = [µ1(t), · · · , µK

(t)], x(t+1)i = P2x

(t)i .

Comparing the two projection matrices, we see that only when span{µ1(t), · · · , µK

(t)} =

span{µ(t)c } (i.e., all current mean vectors are located in a line/1D vector equal to

µ(t)c ), we have P1 = P2 leading to x(t+1)

i method1 = x(t+2)i method2 in the residue space.

But in practice, because µ(t)c generally covers less space than the span of all mean

vectors which has a dimensionality of K ! 1, the single-mean version of method

1 converges slower than method 2 in each iteration. In other words, method 2 re-

moves the total subspace covered by the current K cluster partitioning, while the

single-mean version of method 1 only removes the direction of each sample’s clus-

33

ter mean. Another difference is that method 2 always obtains a residue space that

is orthogonal to all the previous iterations; the single-mean version of method 1, on

the other hand, leads to a residue that is not orthogonal to all previous iterations.

3.2 Experiments

In this section, we investigate whether our multi-view orthogonal clustering frame-

work can provide us with reasonable and orthogonal clustering views of the data.

We start by performing experiments on synthetic data in Section 3.2.1 to get a better

understanding of the methods, then we test the methods on benchmark data in Sec-

tion 3.2.2. In these experiments, we chose as our base clustering – PCA followed

by k-means clustering. This means, we first reduce the dimensionality with PCA,

keeping a dimensionality that retains at least 90% of the original variance, then fol-

low PCA with k-means clustering. Because the all-mean version of method 1 is

equivalent to method 2, we implement orthogonal clustering with the single-mean

version. In this section, we refer to the single-mean version of orthogonal cluster-

ing approach as method 1, and the clustering in orthogonal subspaces approach as

method 2.

3.2.1 Experiments on Synthetic Data

We would like to see whether our two methods can find diverse groupings of the

data. We generate two synthetic data.

Data 1 We generate a four-cluster data in two dimensions with N = 500 instances

as shown in Figure 3.2, where each cluster contains 125 data points. We test

our methods by setting k = 2 for our k-means clustering. We would like to

see that if the methods group the clusters into two in the first iteration, then

34

they should group the clusters the other way in the next iteration. This data

tests whether the methods can find orthogonal clusters.

Data 2 We generate a second synthetic data in four dimensions, with N = 500

instances as shown in Figure 3.3. We generate three Gaussian clusters in

features F1 and F2 with 100, 100 and 300 data points and means µ1 =

(12.5, 12.5), µ2 = (19, 10.5), and µ3 = (6, 17.5), and identity covariances.

We generate another mixture of three Gaussian clusters in features F3 and F4

with 200, 200 and 100 data points and means µ1 = (2, 17), µ2 = (17.5, 9),

and µ3 = (1.2, 5), and identity covariances. This data tests whether the meth-

ods can find different clustering solutions in different subspaces.

Table 3.1: Confusion Matrix for Synthetic Data1SYNTHETIC DATA1 METHOD1 METHOD2ITERATION1 C1 C2 C1 C2L1 125 0 125 0L2 0 125 0 125L3 125 0 125 0L4 0 125 0 125ITERATION2 C1 C2 C1 C2L1 125 0 125 0L2 125 0 125 0L3 0 125 0 125L4 0 125 0 125

Results for Synthetic Data 1

The confusion matrix in Table 3.1 shows the experimental results for synthetic data

1 for methods 1 and 2, in two iterations. We can see that for the first iteration, both

methods grouped classes L1 and L3 into a single cluster C1, and classes L2 and

L4 into another cluster C2. For the second iteration, the data was partitioned in a

35

−20 −15 −10 −5 0 5 10 15 20−15

−10

−5

0

5

10

15

F1

F2

−20 −15 −10 −5 0 5 10 15 20−15

−10

−5

0

5

10

15

F1

F2

(a1.iteration1 for method1) (b1.iteration1 for method2)

−15 −10 −5 0 5 10 15−20

−15

−10

−5

0

5

10

15

20

F1

F2

−15 −10 −5 0 5 10 15−20

−15

−10

−5

0

5

10

15

20

F1

F2


1 2 30

1

2

3

4

5

6x 104

Iteration

Sum−S

quar

e−Er

ror

1 2 30

1

2

3

4

5

6x 104

Iteration

Sum−S

quar

e−Er

ror

(a3.SSE for method1) (b3.SSE for method2)

Figure 3.2: Scatter plots of synthetic data 1. The two columns show the results ofmethods 1 and 2 respectively. The colors represent different class labels and theellipses represent the clusters found. Row 1 and 2 show the results for iteration 1and 2 respectively; Row 3 shows SSE as a function of iteration.

36

different way, which grouped classes L1 and L2 into one cluster, and classes L3 and

L4 into another cluster. Figure 3.2 shows the scatter plot of the clustering results

of both methods in the original 2D data space for the two iterations. Different

colors are used to signify the true classes, and the ellipses show the clusters found

by k-means. The figure confirms the result summarized in the confusion matrix.

Both methods 1 and 2 have similar results as shown. In subfigure a3 and b3 of

Figure 3.2, we plot the sum-squared-error (SSE) as a function of iteration. Note

that, as expected, SSE for both methods decreases monotonically until convergence.

Moreover, the SSE reaches zero at iteration 2 meaning that the first two clustering

views have covered the data space completely.

Results for Synthetic Data 2

Table 3.2 shows the confusion matrix for our clustering with the two different label-

ings: labeling 1 is for features 1 and 2, and labeling 2 is for features 3 and 4. High

number of common occurrences means that the cluster correspond to those labels.

Observe that for both methods 1 and 2, they found the clusters in labeling 2 (fea-

tures 3 and 4) perfectly with zero confusion in the off-diagonal elements in the first

iteration/view. In the second iteration/view, methods 1 and 2 found the clusters in

labeling 1 (features 1 and 2) perfectly also with zero confusion. This result confirms

that indeed our multi-view approach can discover multiple clustering solutions in

different subspaces. Figure 3.3 shows scatter plots of the data. The left column

((a1), (a2), (a3)) is the plot for method 1. (a1) shows the clustering in ellipses

found by method 1 in iteration 1. The left sub-figure shows the groupings in the

original features 1 and 2, and the data points are colored based on true labeling 1.

The right sub-figure shows the clusterings in the original features 3 and 4, and the

color of the data points are based on true labeling 2. (a2) is the same scatter plot of

the original data X with the clusters found by method 1 as shown by the ellipses in

iteration 2. Similarly, (b1) and (b2) show the results of method 2. (a3) and (b3) are

37

the SSE for the two methods in each iteration. Method 2 converges much faster than

method 1 here. Note that SSE monotonically decreases with iteration and that the

algorithm captures most of the information in two clustering views. From these re-

sults, in iteration 1 we found the right partition based on features 3 and 4, but group

the clusters in features 1 and 2 incorrectly. On the other hand, iteration 2 groups

the clusters based on features 1 and 2 correctly, but the partition for the clusters in

features 3 and 4 is wrong. The results confirm that indeed our multi-view approach

can discover multiple clustering solutions in different subspaces.

Table 3.2: Confusion Matrix for Synthetic Data2SYNTHETIC DATA2 METHOD1 METHOD2

ITERATION 1LABELLING1 C1 C2 C3 C1 C2 C3L1 41 40 19 41 40 19L2 44 34 22 44 34 22L3 115 126 59 115 126 59LABELLING2 C1 C2 C3 C1 C2 C3L1 200 0 0 200 0 0L2 0 200 0 0 200 0L3 0 0 100 0 0 100

ITERATION 2LABELLING1 C1 C2 C3 C1 C2 C3L1 100 0 0 100 0 0L2 0 100 0 0 100 0L3 0 0 300 0 0 300LABELLING2 C1 C2 C3 C1 C2 C3L1 126 34 40 126 34 40L2 115 44 41 115 44 41L3 59 22 19 59 22 19

3.2.2 Experiments on Real-World Benchmark Data

We have shown that our two methods work on synthetic data. Here, we investigate

whether they reveal interesting and diverse clustering solutions on real benchmark

38

0 10 20 306

8

10

12

14

16

18

20

22

F1

F2

Label 1

−10 0 10 20 302

4

6

8

10

12

14

16

18

20

F3F4

Label 2

0 10 20 306

8

10

12

14

16

18

20

22

F1

F2

Label 1

−10 0 10 20 302

4

6

8

10

12

14

16

18

20

F3

F4

Label 2


0 10 20 306

8

10

12

14

16

18

20

F1

F2

Label 1

−10 0 10 20 302

4

6

8

10

12

14

16

18

20

22

F3

F4

Label 2

0 10 20 306

8

10

12

14

16

18

20

F1

F2

Label 1

−10 0 10 20 302

4

6

8

10

12

14

16

18

20

22

F3

F4

Label 2


2 4 6 8 10 120

0.5

1

1.5

2

2.5x 104

Iteration

Sum−S

quar

e−Er

ror

1 1.5 2 2.5 30

0.5

1

1.5

2

2.5x 104

Iteration

Sum−S

quar

e−Er

ror

(a3.SSE for method1) (b3.SSE for method2)

Figure 3.3: These are scatter plots of synthetic data 2 and the clusters found bymethods 1 (a1, a2) and 2 (b1, b2). The color of the data points reflect differentclass labels and the ellipses represent the clusters found. a1, b1 are the results foriteration 1; a2, b2 are the results for iteration 2; a3 and b3 are SSE as a function ofiteration for methods 1 and 2 respectively.

39

data. We select data sets that have high-dimensionality and that have multiple pos-

sible partitions. Since the two methods have similar results, we only need to show

the results once. In particular, we report the results for method 2.

In this section, we investigate the performance of our multi-view orthog-

onal clustering algorithms on four real-world data sets, including the digits data

set from the UCI machine learning repository [82], the face data set from the UCI

KDD repository [83], and two text data sets: the mini-newsgroups data [83] and the

WebKB data set [84].

The digits data is a data set for an optical recognition problem of handwritten

digits with ten classes, 5620 cases, and 64 attributes (all input attributes are integers

from 0 . . . 16). The face data consists of 640 face images of 20 people taken with

varying pose (straight, left, right, up), expression (neutral, happy, sad, angry), eyes

(wearing sunglasses or not). Each person has 32 images capturing every combina-

tion of features. The image resolution is 32% 30. We removed the missing data and

formed a 960% 624 data matrix. Each of the 960 features represents a pixel value.

The mini-newsgroups data comes from the UCI KDD repository which contains

2000 articles from 20 newsgroups. The second text data is the CMU four university

WebKB data set as described in [84]. Both text data sets were processed following

the standard procedure, including stemming and removing stopwords.

Results for the Digit Data

Table 3.3 shows the confusion matrix for the digit data. For all three iterations,

we partition the data into three clusters. In iteration 1, the resulting partition clus-

tered digits {1, 7, 8}, {2, 3, 5, 9} and {0, 4, 6} into different groups. In iteration 2,

our method clustered {2, 6, 8}, {1, 4} and {0, 3, 5, 7, 9} into another set of clusters.

And, in iteration 3, the clusters we found are {3, 6, 7}, {0, 1, 2, 8, 9} and {4, 5}.

These results show that in each iteration we can find a different way of partitioning

the ten classes (digits).

40

Table 3.3: Confusion Matrix for the Digits DataITERATION1 ITERATION2 ITERATION3

DIGIT C1 C2 C3 DIGIT C1 C2 C3 DIGIT C1 C2 C3”1” 477 82 12 ”2” 488 11 58 ”3” 393 151 28”7” 566 0 0 ”6” 532 14 12 ”6” 486 47 25”8” 321 218 15 ”8” 350 31 173 ”7” 500 54 12”2” 27 525 5 ”1” 285 286 0 ”0” 1 545 8”3” 41 531 0 ”4” 49 381 138 ”1” 77 394 100”5” 236 287 35 ”0” 2 4 548 ”2” 21 522 14”9” 152 408 2 ”3” 66 158 348 ”8” 213 317 24”0” 0 4 550 ”5” 212 24 322 ”9” 143 265 154”4” 199 0 369 ”7” 67 7 492 ”4” 159 187 222”6” 2 1 555 ”9” 9 95 458 ”5” 5 18 535

In Figure 3.4, we present the mean image of each cluster obtained by method

2 in three iterations. Below each image we show the dominant digits contained in

the cluster. For a digit to be considered as contained in a cluster, we require that at

least 70% of its data points fall into the cluster. It is interesting to note that digits

4 and 5 were not well captured by any of the clusters in iteration 1. In contrast, in

iteration 2, we see digit 4 well-separated and captured by cluster 2. In iteration 3,

we were able to capture digit 5 nicely in a single cluster. This further demonstrated

that our method is capable of discovering multiple reasonable structures from data.

Results for the Face Data

Face data is a very interesting data set because it can be grouped in several different

ways (e.g., by person, pose, etc.). We design the experiment to see if we can obtain

different clustering information in different iterations.

First, we begin with our number of clusters K = 20 in the first iteration,

hopefully to find the 20 persons in the database. Then, from the second iteration to

the rest of the iterations, we set K = 4 to see if the partitions found in the remaining

iterations can tell us any useful information. Figure 3.5 shows the average image

41

("1","7") ("0","6")("2","3","9")

("2","6") ("4") ("0","7","9")

("3","6","7") ("0","1","2") ("5")

Figure 3.4: The average digit for images within each cluster found by method 2 initerations/views 1, 2 and 3. These clustering views correspond to different digits.

for each cluster we find in iteration 1. We observed from this figure that iteration

1 leads to a clustering corresponding to the different persons in the database. The

number below the image is the percentage this person appears in the cluster. The

images clearly show different persons. In the second iteration, the four clusters we

found are shown in Figure 3.6. Each image is an average image of the images within

each cluster. It is clear that the clustering in iteration 2 groups the data based on

different poses. This suggested that our method was able to find different clustering

views from the face data.

Results for the Mini-Newsgroups Data

The mini-newsgroups data set originally contains 20 classes. We removed the

classes that are under the “misc” category because it does not correspond to a clear

42

1

1

1 0.440.76

0.57

10.53

1

0.25

0.86

0.28

1

0.52

0.52

1

0.33

0.85

0.65

0.6

Figure 3.5: The average face image for each cluster in iteration 1. This clusteringview corresponds to different persons.

concept class. We also pre-processed the data to remove stop words, words that ap-

peared in less than 40 documents, and the words that had low variance of occurrence

in all documents. After pre-processing, the data contains 1700 documents from 17

classes. Each document is represented by a 500-dimensional term frequency vector.

Note that PCA followed by k-means does not work well for text data. Here,

we apply the spherical k-means method [85] instead, which considers the corre-

lation between documents rather than the Euclidean distance. Our experiments

showed that this method provided a reasonable clustering of the text data sets.

Table 3.4 shows the confusion matrices by method 2 for three iterations.

For the first iteration, we set K = 3. The results show that cluster C1 groups to-

43

0.90 0.42

0.51 0.67

Figure 3.6: The average face image for each cluster in iteration 2. This clusteringview corresponds to different poses.

gether the recreation and computer categories. The ten most frequent words from

this cluster suggested that the documents here share information related to enter-

tainment. Cluster C2 groups science and talks together, and the frequent words

confirm that it groups science and the religion part of the talk. Cluster C3 is a

mixture of different topics.

In iteration 2, we set K = 4 to see if it we can partition the data to capture the

four categories “computer”, “recreation”, “talk” and “science”. From the confusion

matrix, we see that we were able to find these high level categories. C1 is about

computers; C2 contains news about recreation; and C3 groups those files related to

science. The last one C4 contains documents from the talk category that are related

to politics.

In iteration 3, two of the computer classes (graphics, os.ms) were grouped

together with the talk category, the remaining three computer classes were grouped

44

Table 3.4: Confusion Matrix for the Mini-Newsgroups DataITERATION1 (K=3) C1 C2 C3COMP.GRAPHICS 88 0 12COMP.OS.MS 95 0 5COMP.SYS.IBM.PC.HARDWARE 94 0 6COMP.SYS.MAC.HARDWARE 88 0 12COMP.WINDOWS.X 87 0 13REC.AUTOS 81 0 19REC.MOTORCYCLES 82 0 18REC.SPORT.BASEBALL 81 0 19REC.SPORT.HOCKEY 71 2 27SCI.CRYPT 0 68 32SCI.ELECTRONICS 0 76 24SCI.MED 0 78 22SCI.SPACE 0 74 26TALK.POLITICS.GUNS 0 70 30TALK.POLITICS.MIDEAST 0 61 39TALK.POLITICS.MISC 0 72 28TALK.RELIGION.MISC 0 77 23ITERATION2 (K=4) C1 C2 C3 C4COMP.GRAPHICS 98 0 2 0COMP.OS.MS 94 0 0 6COMP.SYS.IBM.PC.HARDWARE 78 15 3 4COMP.SYS.MAC.HARDWARE 66 20 3 11COMP.WINDOWS.X 43 39 5 13REC.AUTOS 28 51 6 15REC.MOTORCYCLES 17 59 6 18REC.SPORT.BASEBALL 10 67 4 19REC.SPORT.HOCKEY 5 62 4 29SCI.CRYPT 5 8 57 30SCI.ELECTRONICS 1 9 65 25SCI.MED 1 22 61 16SCI.SPACE 0 16 58 26TALK.POLITICS.GUNS 0 37 20 43TALK.POLITICS.MIDEAST 5 39 11 45TALK.POLITICS.MISC 3 45 6 46TALK.RELIGION.MISC 1 58 3 38ITERATION3 (K=4) C1 C2 C3 C4COMP.GRAPHICS 33 32 6 29COMP.OS.MS 42 23 10 25COMP.SYS.IBM.PC.HARDWARE 17 45 11 27COMP.SYS.MAC.HARDWARE 15 41 20 24COMP.WINDOWS.X 19 40 18 23REC.AUTOS 15 47 27 11REC.MOTORCYCLES 10 54 22 14REC.SPORT.BASEBALL 7 51 33 9REC.SPORT.HOCKEY 5 66 21 8SCI.CRYPT 5 15 68 12SCI.ELECTRONICS 10 9 65 16SCI.MED 31 8 46 15SCI.SPACE 15 24 48 13TALK.POLITICS.GUNS 49 19 18 14TALK.POLITICS.MIDEAST 45 24 16 15TALK.POLITICS.MISC 55 12 12 21TALK.RELIGION.MISC 56 8 20 16

45

Table 3.5: Confusion Matrix for WebKB DataITERATION 1 C1 C2 C3 C4COURSE 134 12 81 17FACULTY 2 78 61 12PROJECT 1 47 28 10STUDENT 2 68 402 86ITERATION 2 C1 C2 C3 C4CORNELL 103 86 27 10TEXAS 50 87 83 32WASHINGTON 35 77 138 5WISCONSIN 60 86 30 132

together with the recreation category (i.e., auto, motorcycles and sports). This sug-

gests that our method continued to find interesting clustering structure that is dif-

ferent from the existing results.

Results for the WebKB Text Data

This data contains 1041 html documents, from four webpage topics: course, faculty,

project and student. Alternatively, the webpage can also be grouped based on their

regions/universities, which include four universities: Cornell University, University

of Texas Austin, University of Washington and Wisconsin Madison. Following the

same pre-processing procedure used for the mini-newsgroups data, we removed the

rare words, stop words, and words with low variances. Finally, we obtained 350

words in the vocabulary. The final data matrix is of size 350% 1041.

The experimental results are quite interesting. For the first iteration, we

see our method found the partition that mostly corresponds to the different topics,

which can be seen in Table 3.5. Cluster 1 contains course webpages, cluster 2 is

a mix of faculty and project pages, both cluster 3 and 4 consist of a majority of

student webpages. In the second iteration, our method found a different clustering

that corresponds to the universities, as shown in Table 3.5.

46

3.3 Automatically Finding the Number of Clusters

and Stopping Criteria

In this section, we investigate how we can fully automate the process of finding

non-redundant multiple clustering views by addressing the following two model

selection issues: (1) how to automatically determine the number of clusters in each

view, and (2) how to automatically determine when to stop generating alternative

views.

3.3.1 Finding the Number of Clusters by Gap Statistics

There are several ways to find the number of clusters, K, automatically. For exam-

ple, Bayesian information criterion (BIC) [86] and Aikaike’s information criterion

(AIC) [87] find K by adding a model complexity penalty to maximum likelihood

estimation of mixture models for clustering. X-means [88] extends the BIC score

to K-means clustering for finding K. Resampling methods [89, 90] attempt to find

the correct number of clusters by clustering on diversified samples of the data set,

and select the K which gives the most “stable” clustering result. Another approach

introduced in [91] is gap statistics. Gap statistics selects K to be the minimum

K whose gap between the distribution of the observed data samples from a non-

structured null distribution is statistically significant. Note that any one of these

approaches can work with our framework. In this paper, we chose gap statistics

[91] in our experiments to automatically find the number of clusters K.

The basic idea of gap statistics is to compare the observed distribution of

the data samples to a null reference distribution, then K is selected to be the small-

est K whose difference/gap is statistically significant. The error measure Wk ="K

r=11

2nrDr (within cluster dispersion) decreases monotonically as the number of

clusters K increases. Statistical study shows that for some K, the decrease flat-

47

tens markedly, and such an “elbow” indicates the appropriate number of clusters

[91]. The gap at K is defined as GapN(K) = E%N{log(Wk)} ! log(Wk), where

E%N denotes the expectation under a sample of size N from the reference distribu-

tion. Here, we set our null reference to be a uniform distribution as suggested in

[91]. Dr ="

i,i!"Cr

"j(xij ! xi!j); thus, Wk is the sum of within-cluster error,

which is consistent with the objective function in k-means clustering (used in our

experiments).

To computationally implement gap statistics, we generate B copies of null

references by uniformly creating each virtual feature over the range of the observed

values for that feature. Varying the total number of clusters K from 1 to Kmax,

we cluster both the observation and the references. We compute the expected WK ,

E%N{Wk}, by averaging the B copies l = (1/B)

"b log(W %

Kb), Wk as the sum of

within-cluster error resulting from clustering on the observed data, and the gap

Gap(K) = l ! log(WK).

Then, the clustering result with K clusters is said to be statistically significantly

different from the null reference if Gap(K) ) Gap(K + 1) ! #K+1

#1 + 1/B,

where # is the standard deviation calculated as

#K = [(1/B)!

b

{log(W %Kb)! l}2]1/2.

We, thus, select the optimal Kopt to be the smallest K that achieves this statistical

significance, i.e., Gap(K) ) Gap(K+1)!#K+1

#1 + 1/B. We apply this estima-

tion method to find the number of clusters in each iteration/view of our multi-view

clustering framework.

48

3.3.2 Stopping Criteria

A major difference between our proposed method and previous works on non-

redundant clustering [25] is that we aim to generate a sequence of non-redundant

alternative views of the data instead of just one alternative view. Consequently, we

need an approach to guide our algorithm to know when to stop. We propose three

stopping criteria for our framework. When one of the three criteria is met, we stop

the process and return all the partitionings we currently have as the output result.

The first two criteria are based on model estimation for finding K as dis-

cussed in the previous section. Specifically, when Kopt = 1, we know that there

is only one cluster in the remaining data. That means there is no more interesting

structure in the residue space and we should stop.

Secondly, we are also not interested in views with very small clusters which

only contain very few data points. This implies that the data is non-structured and

the clustering algorithm is simply trying to memorize it (an example of over-fitting

in unsupervised learning). Thus, when Kopt is very large compared to the number of

data samples, we should also stop to keep the clustering algorithm from breaking the

data into negligible small clusters. Similarly, when none of the candidate K from

1 to a high Kmax 3 achieves statistically significant gap from the null-distribution,

we should also stop. This indicates that the residue that is left is simply uniform

random noise.

The third criterion we track at each iteration is the sum-square-error(SSE)

we defined for the two methods. When the SSE is very small, we know that the

existing partitionings already covers most of the original space. Since there is no

residue component left, we should stop. In this work, when the ratio between the

first singular value of the original space and the current subspace is very small (i.e.,

less than 10%) we conclude that the SSE loss is very small compared to the original3In our experiments, we set a large Kmax such that the average number of data points in each

cluster is less than 1.5% of the total number of data samples.

49

variance of the data, hence we stop iterating.

3.3.3 Case Studies for Synthetic Data II, Face and Text Data

The two approaches for finding non-redundant clustering views combined with

finding K and the stopping criteria, provide us with a completely automated frame-

work to solve the non-redundant multi-view clustering problem. To successfully

use this technique for mining the interesting structures, the data itself need to be

rich. For example, applying our proposed automated framework on the synthetic

data I, we will find K = 4 and stop at the first iteration. Similar situation holds

for the digit data. In this section, we will demonstrate the effectiveness of the auto-

mated framework for finding K and stopping the search for views on data that are

known to contain complex structures (such as, synthetic data II, face and the We-

bKB text data). In these experiments, we set the gap statistics parameter B = 50

for all data except the face data. Since the face data has a high dimensionality over

1000, we set B = 80 for a higher confidence in the obtained statistical significance

results.

Synthetic Data

In order to show if our approach can appropriately stop finding alternative views

when only noise is left, we modified the synthetic data II in our experiment by

adding two more dimensions, feature 5 and feature 6, with uniformly distributed

random noise. Figure 3.7 shows the entire procedure and the results. Sub-figures

a1 ! a3 show the data structures in the different subspaces of {f1, f2}, {f3, f4}and {f5, f6} respectively. Sub-figures b1 ! b3 show the bar plot of !Gap(K) =

Gap(K)! (Gap(K + 1)! #K+1

#1 + 1/B) for K from 1 to Kmax = 4 for each

iteration. As we discussed in the previous section, the minimum K which leads to

!Gap(K) > 0 is our optimal K. We observed that in iteration 1, K = 3 is the

50

first to satisfy, !Gap(K) > 0. Thus, the optimal K found by gap statistics is 3.

And the corresponding clustering revealed is the three clusters in the feature 3 and 4

subspace. In iteration 2, the optimal K found is also three and we found the clusters

in the feature 1 and 2 subspace. For the third iteration, we obtained an optimal K

of one. We, thus, stopped correctly. Compared to the experiment shown in section

3.2.1, without knowing K, the process will continue to find a meaningless structure

in iteration 3. Note too that the gap statistics method for finding K was able to

discover the correct number of clusters for the two alternative views.

1 2 3 4 5 6 7 82.5

3

3.5

4

4.5

5

5.5

6

6.5

7

F1

F2

−5 0 5 10 15 20 252

4

6

8

10

12

14

16

18

20

F3

F4

0 0.5 1 1.5 20

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

F5

F6

(a1. SynData in F1&F2) (a2. SynData in F3&F4) (a3. SynData in F5&F6)

1 2 3 4−0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

1 2 3 4−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

1 2 3 40

0.05

0.1

0.15

0.2

0.25

(b1. !Gap for iteration1) (b2. !Gap for iteration2) (b3.!Gap for iteration3)

1 1.5 2 2.5 30

500

1000

1500

2000

2500

3000

3500

4000

Iteration

Sum−Square−Error

(c. SSE for each iteration)

Figure 3.7: !Gap and SSE results in each iteration for the Synthetic II data set.

51

(a. clustering in iteration1) (b. clustering in iteration2)

Figure 3.8: Different partitionings for face data in different iterations

Face Data

Face data is a rich data set containing various interesting partitionings. In the first

iteration, we run gap statistics for the original data set, and the optimal K is 14.

Then, in the second iteration, the optimal K found is equal to four corresponding

to the four poses. In the third iteration, by gap statistics, we find that

Gap(K)! (Gap(K + 1)! #K+1

#1 + 1/B) = 0.0121 > 0, with K = 1, B = 80

That means only one cluster is left in the current residue space, indicating that

we should stop. Figure 3.8 displays the cluster means in each iteration by our

automated scheme.

Text Data

Finally, we study a text data, the WebKB data in particular. It is a complex data

set and has high dimensions. Rather than simply keeping 90% of the variance in

determining the number of dimensions to keep in PCA, we instead heuristically ex-

amined the singular values of the data to select the dimensions to keep. Figure 3.9a

52

presents a plot of the gap between consecutive singular values, i.e. si ! si+1, of

this data. This figure reveals that the gap reaches a peak in s3 ! s4 in iteration 1.

That means the 4th singular value drops abruptly. We, thus, project the data onto

the first three principal eigenvectors. By gap statistics, we estimate that the optimal

K is four corresponding to the clusters course, faculty, project, student. In the sec-

ond iteration, we perform singular value decomposition again and found the largest

decrease in the singular values occur between the second and third singular values,

as shown in Figure 3.9b, and thus keep two dimensions. Gap statistics determined

the optimal K for the second iteration to be four corresponding to clusters based on

institution. These results agree well with the true structures of the data.

In the third iteration, we do not find the optimal K within the range of 70.

This implies that the residue space has a distribution close to uniform noise and is

thus non-interesting. Hence, we stop after iteration 2. Table 3.6 shows the confusion

matrix we obtain in each iteration. We see that it coincides well with the original

clustering results generated on the true number of clusters.

0 5 10 15 200

2

4

6

8

10

12

14

0 5 10 15 200

1

2

3

4

5

6

7

Figure 3.9: The drop in singular values: si ! si+1. Left: the gap of consecutivesingular values in iteration1. Right: the gap of consecutive singular values in itera-tion2.

53

Table 3.6: Confusion Matrix for WebKB Data based on Gap StatisticsITERATION 1 C1 C2 C3 C4COURSE 111 17 99 17FACULTY 3 74 63 13PROJECT 2 45 29 10STUDENT 11 53 402 92ITERATION 2 C1 C2 C3 C4CORNELL 88 81 37 20TEXAS 63 83 71 35WASHINGTON 74 69 102 10WISCONSIN 42 89 40 137

3.4 Conclusions for Multi-View Clustering Methods

Given data, the goal of exploratory data analysis is to find interesting structures,

which may be multi-faceted by nature. Clustering is a popular tool for exploring

data. However, many clustering algorithms only find one single clustering solution.

Our main contribution in this paper is that we introduced a new paradigm for ex-

ploratory data clustering that seeks to extract all non-redundant clustering views

from a given set of data.

We presented a general framework for extracting multiple clustering views

from high dimensional data. In essence, this framework works by incorporating

orthogonality constraints into a clustering algorithm. In other words, the clustering

algorithm will search for a new clustering in a space that is orthogonal to what has

been covered by existing clustering solutions. We described two different ways for

introducing orthogonality and conducted a variety of experiments on both synthetic

and real-world benchmark data sets to evaluate these methods. Our results show

that our proposed framework was able to find the different substructures of the

data or different structures embedded in different subspaces in different views on

synthetic data. Similarly, on the benchmark data sets, our methods not only found

different clustering structures in different iterations, but also discovered clustering

54

structures that are sensible, judging from the various evaluation criteria reported

(such as, confusion matrices and scatter plots). For example, on the face data,

PCA+K-means identified individuals in the first view/iteration, and in the second

view/iteration, our methods discovered clusters that correspond nicely to different

facial poses. For the other data sets, we observed similar results in the sense that

different concept classes were identified in different views.

Furthermore, we present a fully automated version of the proposed frame-

work by automatically estimating the number of clusters K in each iteration through

the GAP statistics, and automatically determining when to stop searching for alter-

native views based on the estimated K and the Sum-Square-Error of the clustering

solution in each iteration. Experiments on synthetic and benchmark data showed

that the proposed framework stopped searching for alternative views appropriately,

when the residue space left was just noise.

Note that in this paper we use k-means and spherical k-means as the basic

clustering algorithms. However, the framework is limited to such choices and can

be instantiated using any clustering algorithm. Future directions will be to explore

the framework with other clustering methods.

55

Chapter 4

Orthogonal Principal FeatureSelection via Component Analysis

In this chapter,we present a feature selection algorithm based on principal com-

ponent analysis (PCA) and orthogonalization, we call principal feature selection

(PFS). The non-redundant orthogonal framework we use in clustering described in

Chapter 3 was originally inspired by our study on this feature subset selection prob-

lem. In Section 4.1, we define our notations and provide a review on singular value

decomposition. In addition, we describe two ways of viewing a data matrix, which

will be useful in motivating our orthogonal feature selection approach. Then, we

present our orthogonal feature selection method via PCA in Section 4.2. In Sec-

tion 4.3, we explain the differences in goals between sparse principal component

analysis and PFS. We report our experimental results in Section 4.4, a discussion

of extending orthogonal feature selection to linear discriminant analysis in Section

4.5 and finally conclude in Section 4.6.

56

4.1 Background and Notations

In this section, we provide: the notations that will be used throughout this chapter,

a short background review on singular value decomposition (SVD), and a presen-

tation of the dual space representation of a data matrix, which will be illustrative in

motivating our approach.

4.1.1 Notations

Let X = [x1x2 · · ·xN ] = [f1f2 · · · fd]T denote a set of N samples in Rd space (i.e.,

xi & Rd) or d features in RN (i.e., fj & RN ), where (·)T means the transpose of

matrix (·). span{·} represents the spanning space of (·). Note that xi and fj are

column vectors and X is of size d % N . X(t) is the data matrix in the tth iteration

after t! 1 projections. fi(t) is the ith feature or ith row in X(t). fi is the projection

of the ith feature in the subspace spanned by the selected features. Without loss of

generality and to avoid cluttered notation, in this work, we arrange the features in

the order they are selected. For example, we permute the first selected fi to the first

row of X and call it f1.

4.1.2 Background Review on SVD and Definition of Terms

PCA can be solved using singular value decomposition (SVD). Without loss of gen-

erality, assume the data set X is a d%N zero-centered matrix. X = [f1 · · · fd]T =

[x1 · · ·xN ]. The SVD [92] solution of X is:

X = USV T

Accordingly, we have Xvk = skuk, where uk is the kth column of U (also called

the kth principal eigenvector or loadings of the principal components), vk is the kth

column of V (also called the kth principal component) and sk is the kth singular

57

(a) Uncorrelated in data space (b) Uncorrelated in feature space

(c) Correlated in data space (d) Correlated in feature space

Figure 4.1: Three uncorrelated data points in 2D space: (a) in the data space view,and (b) in the feature space view. Three correlated data points in 2D space: (c) inthe data space view, and (d) in the feature space view.

value. Furthermore, vk = XT uk/sk, the kth principal component (PC) is equal to

the projection of X to the kth principal eigenvector divided by sk, and uk contains

the loadings of all the variables in the kth eigenvector.

4.1.3 Dual Space Representation of a Data Matrix and Statisti-

cal Correlation

To build our orthogonal principal feature selection algorithm, we need to look at

data in a dual space representation. Let us consider a data matrix X & Rd!N , with

d features and N samples. Its columns can be viewed as data points, xi & Rd.

An alternative way of viewing X is as a scatter or configuration of d points in RN ,

with each axis associated with an individual data point and each of the d vectors

representing a variable, fj .

58

Statistical correlation between features is the non-random associations be-

tween two variables. Figure 4.1 illustrates this dependence by scatter plots of two

datasets that contain three data points in 2D-space. In Figure 4.1a, the data points

are evenly spread out in the data space, which means the two features are not cor-

related. In Figure 4.1c, the data points are located almost on a line in data space,

that means the two features are highly correlated. This property can be clearly

viewed in the feature space as shown in Figure 4.1b and 4.1d. Note that the original

features, fi and fj , may be statistically correlated even though they are shown as

orthogonal axes in Figure 4.1c (data space view). Two features are statistically

uncorrelated if < fi, fj >= 0, and are shown as orthogonal vectors in Figure

4.1b (feature space view). The statistical correlation between features is defined

as: corr(fi, fj) = cos(fi, fj) = <fi,fj>||fi||||fj || , assuming that features are zero-centered.

Note that without loss of generality, we assume that X is zero-centered throughout

this chapter. We take advantage of this dual space view in describing and motivating

our approach.

4.2 Feature Selection via PCA

Principal component analysis (PCA) is a popular tool for data analysis and dimen-

sionality reduction. Dimensionality reduction transformations can have a variety of

optimality criteria, and ten of them discussed in [46] lead to the principal compo-

nents solutions, which is one reason for the popularity of this analysis. We name

two of the most common criteria here. PCA finds linear combinations of the vari-

ables, the so-called principal components (PCs), which correspond to the subspace

of maximal variance in the data. This property is to keep the maximum “spread”

in the selected lower dimensional subspace. Another optimality criterion is to find

the linear transformation such that the sum-squared-error between the original data

and the predicted data is minimized. Given a data set X & Rd!N of N observed

59

d-dimensional vectors, xi. PCA finds the linear transformation, Y = AT X , where

A & Rd!q, q < d, that minimizes the sum-squared-error, ||X ! AY ||2, subject to

the constraint AT A = I (the columns of A are orthonormal). The PCA solution is

A equal to the q dominant eigenvectors corresponding to the q largest eigenvalues

of the covariance matrix of X , "x. The sum-squared-error is equal to"d

j=q+1 $j ,

where $j are the remaining eigenvalues.

4.2.1 PCA Orthogonal Feature Selection

The goal of PCA orthogonal feature selection is to find a subset of features that min-

imizes the sum-squared-error between the original data and the data represented by

the selected features, subject to the constraint that the selected features are as un-

correlated as possible1. We would like to have an objective function that penalizes

both squared-error and correlation. Two features x and y are correlated if one can

linearly predict y from x and vice versa. We can, thus, incorporate the penalty for

correlation to the sum-squared-error criterion as follows: The error in feature selec-

tion is the error due to features not selected minus the portion of the features not se-

lected that can be linearly predicted (correlated) by the selected features (spanned

by the selected features). More formally, we can express the objective function as

presented below. Assume after permutation, f1, · · · , fq are the q selected features,

the unselected features are fq+1, · · · , fd, and fq+1, · · · , fd are the projections of the

unselected features to the space spanned by those selected, the sum-squared-error

is:

SSE = ||[f1, · · · , fd]! [f1, · · · , fq, fq+1, · · · , fd]||2

= ||[fq+1, · · · , fd]! [fq+1, · · · , fd]||2 (4.1)1Note that we cannot constrain the selected features to be orthogonal because we are not allowed

to select transformed (or new) features, only a subset of the original.

60

Referring to the dual space representation in Figure 4.1, the goal of the or-

thogonal feature selection problem is to find the minimum subset of features that

spans the space of all feature vectors in the feature space view. Note that sum-

squared-error is calculated in the feature space. It is the sum of the squared distance

of those unselected features to the subspace spanned by those selected ones, as

shown in Figure 4.2. In the example displayed in Figure 4.2, there are three fea-

tures in all. Features 1 and 2 are selected; thus, the error e is the squared distance

from feature 3 to the plane spanned by features 1 and 2.

Figure 4.2: SSE between selected features and all the features.

4.2.2 Orthogonal Feature Search

There are several ways to search the feature subset space to optimize an objective

function as pointed out in the related work section, Section 2.2. Here, we apply

feature transformation (the solution from PCA) to perform our search. Let X &Rd!N be our original data and X(t) be the remaining data orthogonal to the chosen

features. We set iteration time t = 1, X(1) = X and data is zero-centered. Figure 3

shows the framework of our feature selection process. Our method has three main

61

steps: (1) perform PCA on X(t) to get the first eigenvector, (2) pick the feature most

correlated with the eigenvector, (3) project the data X(t) to the space orthogonal to

the chosen feature to get the residual space, X(t+1). We repeat these steps until the

number of features desired is obtained. We motivate each of these steps below.

Figure 4.3: The general framework for our feature selection process.

Step 1: PCA to get the first eigenvector. Our approach selects the features one

at a time. However, instead of looking at the effect of one feature at a time, PCA

provides a global view (i.e., takes feature interaction into account) on which feature

combination provides the largest variance (most relevant with respect to our objec-

tive function). In some sense, PCA projection performs some kind of “look-ahead”

in the feature search process. In the data space view, the first eigenvector, u1, is the

direction of largest spread (variance) among the data samples xi.

Step 2: After finding the largest eigenvector, which feature should we pick?We select the feature which is most correlated with the largest eigenvector. In

this work, we call this selected feature, principal feature. A feature fj is & RN ,

whereas the eigenvector u1 is & Rd. What do we mean by correlation here? u1

is actually a transformation that leads to an extracted new feature fnew = X(t)T u1

62

(the projection of X(t) in u1), where fnew & RN . We select the feature fj from

the original set in X which has the largest correlation with fnew (i.e., the feature

fj closest to fnew in terms of cosine distance in the feature space view). In [1],

they select the feature with the largest loading (coefficients in the eigenvector) in

the largest eigenvector, whereas PFS utilizes correlation. Note that the feature with

the largest loading does not necessarily correspond to the feature with the highest

correlation with fnew. Proposition 1 below explains their relationship.

Proposition 1: The feature with the largest loading is not equal to the feature

with the largest correlation with the projection of an eigenvector in gen-

eral. Maximizing the correlation is equivalent to maximizing the loading,

argmaxj corr(fj, vk) = argmaxj|!j|, only if the features, fj , have unit

norm, ||fj|| = 1.

Proof: Let X = USV T be the SVD solution of X . For U = [u1, · · · , ud], each

uk = [!1, · · · ,!d]T , where !j ,j & {1, · · · , d} is the loading for the jth vari-

able in the kth eigenvector. Then, we get [f1 · · · fd]T vk = sk[!1, · · · ,!d]T

(i.e., [fT1 vk, · · · , fT

d vk]T = [sk!1, · · · , sk!d]T ). Looking at it element-wise,

we have fTj vk = sk!j . Therefore, if |!j| is the maximum, |sk!j| is the max-

imum, and |fTj vk| is the maximum. The correlation between the jth feature

and the kth PC in the feature space is then

corr(fj, vk) = |fT

j vk

||fj||||vk||| = | sk!j

||fj|||,

because ||vk|| = 1. Thus if ||fj|| = 1, i.e., feature j has unit norm, then

corr(fj, vk) = |fTj vk| = |sk!j| is the maximum since |!j| is the maximum.

Based on Proposition 1, we can speed up our correlation computation by

utilizing the loading of a feature divided by the norm of that feature, |!j|/||fj||, to

63

select features. In Section 4.2.3, Property 3, we prove that this property holds even

in the residual spaces, X(t+1).

Step 3: After selecting a feature, how do we reduce the search space? To

keep the features as uncorrelated as possible, we project the current data at time

t, X(t), to the subspace orthogonal to the feature selected f (t)t . Here, f (t)

t is the

component of the current selected feature ft in the span{X(t)}. This will make

X(t+1) uncorrelated to all the features selected from t = 1 to time t. Refer to

Property 2 in Section 4.2.3.

Algorithm 3 Pseudo-code for orthogonal feature selection via PCA (PFS).

Inputs: The data matrix X & Rd!N , and the number of features q to retain.Outputs: The q selected features.Pre-Processing: Make the dataset zero-centered.Initialization: t = 1, X(1) = X , Qselect = {}

Step 1 (Find the first eigenvector: View the dataset as N samples in Rd space.Perform SVD on dataset X(t) and find the principal component (PC) v1

(t) withthe largest eigenvalue.

Step 2 (Select principal feature): Select the feature fi from the original featuresin X that correlates best with v1

(t). Add fi to the set of selected features Qselect,and remove fi from the original set in X . The highest correlation corresponds tothe maximum absolute value of the loading u1

(t) of a feature divided by the normof that feature. If the norm of a feature is zero, we set the correlation to be zero,meaning that it is farthest from the PC.

Step 3 (Orthogonalize): View the dataset as d features in the Rn space. Findthe subspace orthogonal to the variable selected at the current space fi

(t) by:P (t) = I ! fi

(t)fi(t)T /(fi

(t)T fi(t)). Project our dataset X(t) to that subspace by:

X(t+1) = X(t)P (t). t = t + 1. Note that the ith row of X(t+1) is 0, and all therows corresponding to the previously selected features remain 0.

Step 4 (Repeat): Repeat steps 1-3 until q features are selected.

As we have pointed out in Section 4.1.3, uncorrelated features are orthogonal

64

in the feature space view of the dual space representation of a data matrix. It would

be convenient for us to look at X(t) in the feature space view, X(t) = [f (t)1 · · · f (t)

d ]T .

To compute the subspace orthogonal to a feature f (t)i , we use the projection matrix

P : P&f(t)i

= I ! f (t)i f (t)

i

T/(f (t)

i

Tf (t)

i ). The residual feature space X(t+1) is X(t)

projected to this orthogonal subspace, X(t+1) = X(t)P (t). This means, we project

each of the remaining features, f (t)j , to the subspace orthogonal to f (t)

i . All the

previous selected features, including i should remain all zeros. Thus, we remove

the component in f (t)j that can be linearly predicted by f (t)

i . What is left is the

residual X(t+1) that cannot be linearly explained by f (t)i .

In the next iteration, we reapply PCA on X(t+1), select the feature from the

original set in X removing the features selected so far that correlates best with the

first principal component of X(t+1). We repeat the process until we have selected

q features or the error is smaller than a threshold. Our orthogonal feature selection

method via PCA is summarized in Algorithm 3.

4.2.3 Properties

Note that since we have already selected fi, we do not want features that are in the

spanning space of fi (correlated to fi). Orthogonal projection presented above will

solve this problem. Several nice properties with this approach are:

Property 1: Let f2 be the projection of f2 to the null space of f1, span{f1, f2} =

span{f1, f2}.

Property 2: Correlation and Loading. We can utilize the loading divided by the

norm of the feature to speed up correlation computation even in the residual

space because, fTi v(t)

1 = f (t)i

Tv(t)

1 . Since v(t)1 and f (t)

i are in the residual space

which is orthogonal to the previously selected features, we have fi ! f (t)i

orthogonal to v(t)1 , i.e., *(fi ! f (t)

i ), v(t)1 + = 0.

65

Property 3: X(t+1) is uncorrelated to f1, · · · , ft.

Proof By mathematical induction.

1. After the 1st iteration, we pick the first feature f1 and project the data to

its orthogonal space, and get X(2). Thus, < X(2), f1 >= %0.

2. For each t > 1, assume X(t) is orthogonal to the current selected feature

set f1, f2, · · · , ft#1, i.e., < X(t), f1 >= %0, · · · , < X(t), ft#1 >= %0, and in the

current iteration ft is selected.

After performing orthogonalization, as indicated in Step 3 of our algorithm,

we have X(t+1) = X(t) !X(t) ft(t)ft

(t)T

ft(t)T

ft(t)

. For ,1 ( i < t, we have

< X(t+1), fi >=< X(t), fi > ! < X(t)ft(t)ft

(t)T

ft(t)T ft

(t), fi >= %0!X(t)fi(·) = %0,

Since fi(t) is just the tth row (after permutation) in X(t), it is orthogonal to

f1, f2, · · · , ft#1.

For i = t,< X(t+1), ft >=< X(t+1),"t#1

i=1 ctifi + f (t)t >. Here cti is some

constant, since ft can be written as the part covered by the selected features,

which belongs to span{f1, · · · , ft#1}, and the residue part ft(t). This leads to

< X(t+1), ft >= X(t)ft(t) !X(t)ft

(t)ft(t)T

ft(t)T ft

(t)ft

(t) = X(t)ft(t) !X(t)ft

(t) = %0.

Therefore, we have < X(t+1), f1 >= %0, · · · , < X(t+1), ft >= %0 (i.e., X(t+1)

is uncorrelated to f1, · · · , ft). The f (i)i s form an orthogonal basis for the se-

lected features.

Property 4: Residue space has zero mean. After each projection, the data X(t+1)

still has zero mean.

66

Proof Without loss of generality and to simplify notations, let us normalize the

selected feature to be unit norm. fn = f/||f ||. Then the projection matrix

P = I ! ffT /||f ||2 = I ! fnfTn . X has zero-mean, i.e.,

"Nk=1 xk = 0. Y =

XP = X ! (Xfn)fTn = X ! [(fT

1 fn)fn, · · · , (fTd fn)fn]T , then

"Nk=1 yk =

"Nk=1 xk !

"Nk=1 fnk(Xn) = 0 ! 0(Xn) = 0, fnk is the kth element in the

N-by-1 vector fn.

Property 5: Retained Variance. Similar to PCA where the variance retained is

the sum of the eigenvalues corresponding to the eigenvectors that are kept;

in PFS, the variance retained is the sum of the variances of each feature

when projected to the space such feature is selected (due to orthogonality

of the spaces). Note that the retained variance is RetainedV ariance ="q

t=1 var(f (t)t ), where var(·) is the variance, and f (t)

t is the feature ft in the

orthogonal space span{X(t)}. Actually {f (1)1 , f (2)

2 , · · · , f (q)q } form an orthog-

onal basis that spans the selected q original variables. One can utilize the pro-

portion of variance desired to select the number of features retained similar

to conventional PCA.

Property 6: Convergence. SSE(t) ) SSE(t+1), and SSE is bounded below to be

greater than or equal to zero. Thus, the algorithm is guaranteed to converge

to a local minimum.

Proof Since SSE is the sum of squared errors, it is greater than or equal to zero.

Let fi be the original features sorted in the order of our selection. Let us nor-

malize each orthogonal basis obtained by PFS: {f (1)1 , f (2)

2 , · · · , f (q)q }. Then

we will have an orthonormal basis represented by %f1, · · · , %fq, with %f1 =

f (1)1 /||f (1)

1 || and so on. Then, at iteration t and t + 1,

SSE(t) =d!

j=t+1

{||fj||2 !t!

i=1

*fj, %fi+2},

67

SSE(t+1) =d!

j=t+2

{||fj||2 !t+1!

i=1

*fj, %fi+2},

where the projection of fj to %fi is *fj, %fi+ since %fi is unit length.

SSE(t)!SSE(t+1) = (||ft+1||2!t!

i=1

*ft+1, %fi+2)+

d!

j=t+2

{||fj||2!t!

i=1

*fj, %fi+2}

!d!

j=t+2

{||fj||2 !t!

i=1

*fj, %fi+2! *fj, %ft+1+

2}

= ||ft+1||2 !t!

i=1

*ft+1, %fi+2

$ %& ''0

+d!

j=t+2

*fj, %ft+1+2

$ %& ''0

) 0.

4.2.4 Illustrative Example

In order to clearly show how we perform our algorithm, in this section, we describe

a simple illustrative example on the Iris data from the UCI repository [82]. Iris is

a small data set with 4 features and 150 data points. Figure 4.4 demonstrates the

whole feature selection process. From the upper left, we show the Iris data set, X .

We denote each row as features f1 to f4, and each column to be data points x1 to

x150.

We initialize by setting X(1) = X ! µ, X with its mean µ subtracted. In

iteration 1, we perform SVD on X(1). We obtain the loading of each feature (!j)

from the first PC (v1), and compute |!j|/||fj||. We can see that f3 has the maximum

value. Thus, we select f3 in the first iteration. After f3 is selected, we project our

data to the space orthogonal to f3, which gives us X(2). In X(2), the third row goes

to zero.

In iteration 2, we repeat the process. We first perform SVD and compute

|!(2)j |/||f (2)

j ||. This time f (2)1 has the maximum value and should be selected. Then,

our X(2) is projected to the orthogonal space of f (2)1 to obtain X(3). Now, the first

68

Figure 4.4: A simple illustrative example of PFS.

row in X(3) goes to zero. Note that the third row in X(3) remains zero. Then f1

is added to our selected feature set. The SSE for each step is shown at the bottom.

When all four features are selected, the SSE converges to zero. We repeat this

process selecting the features one by one until the SSE goes to zero or the desired

number of features q is reached.

Observe that the residual data matrix X(t) has zero rows corresponding to

the t ! 1 selected features. In our implementation, instead of keeping these zeros,

we just remove it for simplicity and computational efficiency.

In this figure, we also provide the correlation of each feature with the first

PC to illustrate Property 2 and Proposition 1. Observe that the correlation is equal

to |!j|/||fj|| times a constant s (the corresponding singular value of the first PC).

This shows that the feature that maximizes |!j|/||fj|| is equivalent to the feature

that maximizes the correlation, and hence can be used to speed up our calculations.

69

4.3 Sparse Principal Component Analysis (SPCA) and

PFS

In this section, we explain why we say SPCA does not exactly perform global fea-

ture selection. Its goal is different from ours. The goal in SPCA is to make the PCs

interpretable by finding sparse loadings. Sparsity allows one to determine which

features explains which PCs. PFS, on the other hand, applies a direct approach to

feature selection. Our goal is to determine which set of q original features captures

the variance of the data most and at the same time are as non-redundant as possible.

The SPCA criterion is as follows:

(!, &) arg min!,"

"ni=1 |xi ! !&T xi|2 + $

"kj=1 |&j|2 +

"kj=1 $1,j|&j|1

s.t. !T ! = Ik

(4.2)

where k is the number of principal components, Ik is an identity matrix of size

k% k, $ controls the weight on the ridge regression (norm-2) penalty [93], and $1,j

are weight parameters to control the sparsity. ! and & are d % k matrices. &i - ui

for i = 1, . . . , k.

SPCA is a good way to sparsify the loadings of principal components or

determine which features correspond to each PC, however it is not appropriate for

global feature selection (i.e., find a set of features to represent all the PCs). To il-

lustrate this, we provide an example using the Glass data from the UCI repository

[94]. Glass data has a total of nine features and 214 samples. Table 4.1 presents

the PC loadings obtained by applying SPCA with a sparsity of five and the number

of PCs, k, set to two. Note that although each PC have sparse loadings, all fea-

tures have non-zero loadings to explain both PCs. In this case, all features are still

needed and no reduction in features is achieved. SPCA has a tendency to spread the

non-zero loadings to different features in different PCs because the sparse PCs are

70

constrained to be orthogonal.

Table 4.1: PC Loadings Applied to Glass Data Using SPCA

LOADINGS PC1 PC2FEATURE 1 0 -0.76FEATURE 2 0.35 0FEATURE 3 -0.60 0FEATURE 4 0.47 0FEATURE 5 0 0.03FEATURE 6 0 0.11FEATURE 7 0 -0.64FEATURE 8 0.53 0FEATURE 9 -0.14 -0.07

From Table 4.1, it is not clear how one can utilize SPCA to select features.

Let us say, we wish to retain two features. Which two features should we keep?

Features 3 and 8 based on the loadings in PC1 or features 3 and 1 the top loadings

of PC1 and PC2 respectively. Another complication is that in SPCA one can tweak

the sparsity parameter and the number of components to keep. Changing those

parameters modifies the loadings and the features with the non-zero loadings as

shown in Table 4.2, where sparsity is set to two and the number of PCs is set to

nine.

4.4 Experiments

The goal of this section is to investigate the performance of orthogonal principal

feature selection (PFS) against other methods. We examine whether or not or-

thogonalization helps in achieving the PCA objective and study PFS as a search

technique. We describe the data used in our experiments in Subsection 4.4.1, the

competing methods in Subsection 4.4.2, and present and discuss the results in Sub-

section 4.4.3. In addition, we provide a time complexity analysis in Subsection

71

Table 4.2: Example: SPCA Confusion for Feature Selection

LOADINGS PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9FEATURE 1 0 0 -0.01 -0.01 0 0 0 -1.0 0.04FEATURE 2 0 0 0 0 -1.0 0 0 0 0FEATURE 3 -0.98 0 0 0 0 0 0 0 0FEATURE 4 0 0 0 0 0 0 1.0 0 0FEATURE 5 0 0 0 1.0 0 0 0 0 0FEATURE 6 0 -1 0 0 0 0 0 0 0FEATURE 7 0.19 0.07 0 0 0.1 0.05 -0.04 -0.03 1.0FEATURE 8 0 0 0 0 0 -1.0 0 0 0FEATURE 9 0 0 -1.0 0 0 0 0 0 0

4.4.4.

4.4.1 Data

We investigate the performance of orthogonal principal feature selection (PFS) on

five real world datasets: chart, HRCT, face, 20 mini-news group and gene micro-

array data. Chart data is from the UCI repository [95]. It is a satellite dataset,

including six classes, with 60 features and 600 instances. Face data from the UCI

KDD [83] repository consists of 640 face images of 20 people. Each person has

32 images with image resolution 32 % 30. We remove the missing data to form a

960% 624 data matrix. HRCT is a high resolution computed tomography image of

the lungs (HRCT-lung) data set [96] with eight disease classes, 1545 instances and

183 features. The mini-newsgroups data comes from the UCI KDD repository with

2, 000 documents from 20 categories, and 4, 374 words as features. The last data set

we studied is the gene data for lung cancer from http://sdmc.lit.org.sg/GEDatasets/

Datasets.html. It contains 327 instances with a dimensionality of 12, 558.

72

4.4.2 Methods

We compare our approach to the simple threshold method (which simply ranks

features based on variance and keeps those with variances larger than a threshold)

and two eigenvector-loading-based methods by Jolliffe [1]: (1)Jolliffe i, iteratively

computes the first eigenvector and selects the feature with the largest loading, and

(2) Jolliffe ni, computes all the eigenvectors at once and selects q features corre-

sponding to the largest loading in the first q eigenvectors. Jolliffe i is similar to PFS,

except that it did not take the statistical correlation between features into consider-

ation. Jolliffe ni computes all the PCA eigenvectors at once, thus, implicitly taking

correlation into account. Furthermore, we compare our PFS method to SPCA. We

set SPCA with sparsity equal to q (the number of features to be selected) and set

the number of PCs to one (to avoid ambiguities). This version of feature selection

will be aggressive in selecting features that maximize variance. The other extreme

is to keep q PCs and sparsity equal to one. This version will be aggressive in re-

moving redundancy and provides results similar to Jolliffe ni. It is not clear how to

select features in SPCA that is somewhere in between these two extremes. In these

experiments, we test the importance of the orthogonality constraint.

Besides loading-based methods, we also compare our PFS, with sequential

forward search (SFS) applied to our SSE objective function in Eqn. 4.1 that takes

both error and correlation into account. SFS is a greedy subset search technique that

finds the single best feature that when combined with the current subset minimizes

our SSE objective function. SFS starts with the empty set and sequentially adds

one feature at a time. This comparison tests PFS as a search technique. In addition,

we compare with Mao’s least squares estimate (LSE) method [51], which is an im-

proved fast version of Krzanowski’s Procrustes approach [50]. We applied the LSE

based forward selection method, and call it LSE-fw. Furthermore, we also compare

to principal feature analysis (PFA) [52], which performs Kmeans clustering on the

73

loadings of the first m PCs. In our experiments, we use the number of clusters equal

to q (the number of features to keep), m is set to correspond to retaining 90% of

the variance captured by the first q PCs. This typically results in m = q minus one

to five. We also apply Kmeans with 10 random starts and pick the best solution in

terms of minimum SSE as indicated in their work.

For all methods, we calculate the sum-squared-error defined in Equation 4.1.

Essentially, we measure how much the selected features span the original feature

space.

4.4.3 Results and Discussion

Fig.4.5 shows the SSE, as defined in Eqn.4.1, for all datasets via these eight meth-

ods. We also show the retained variance for the HRCT data as shown in the right

sub-figure in the first line of Fig. 4.5. The retained variance is a measure of how

much the selected feature span the original feature space. The smaller the SSE

the better, and the larger the retained variance the better. It is the total variance

minus the error. Thus, the more the retained variance, the less SSE. Because they

present redundant information, we only present the SSE plots for the other data

sets. The retained variance would simply be the reverse of these plots. We observe

that the SSE between all features and the features selected by our PFS is con-

sistently smaller than simple threshold, Jolliffe i, Jolliffe ni, SPCA, PFA for any

number of retained features and for all datasets.

PFS vs. simple threshold, Jolliffe i and Jolliffe ni Simple thresholding is an

individual search method that considers features one-at-a-time. PFS takes feature

interaction into account through PCA transformation and thus resulted to better

SSE. For simple threshold and Jolliffe i, they do not take the correlation between

features into account. Without the redundancy removal scheme, they have higher

SSE compared to our PFS. Jolliffe ni implicitly takes correlation into account in-

74

directly through the orthogonality constraint among the PCs. PFS addresses redun-

dancy directly by orthogonalizing the data with respect to the selected features, and

by doing so led to lower SSE compared to Jolliffe ni.

PFS vs. SPCA SPCA has SSE values between that of Jolliffe i and Jolliffe ni.

As mentioned in the previous section, SPCA does not exactly perform a global fea-

ture selection. It cannot explicitly tell which features are important overall because

it has a different objective. SPCA tries to get sparse loadings to interpret the dom-

inant PCs. However, our objective is to find which features capture the variance of

the data most and at the same time are as uncorrelated as possible. This leads PFS

to a smaller SSE than SPCA.

PFS vs. PFA PFA, using the clustering approach, also tries to select uncorrelated

features as our method. However, by keeping only the first m PCs which is slightly

smaller than q, as indicated in PFA, to perform Kmeans clustering on their loadings

is not accurate. We can see that when q is relatively small, the first several features

selected by PFA leads to high SSE. In addition, the use of Kmeans clustering adds

more uncertainty to the feature partitioning because of the unstable initialization.

PFS vs. LSE-fw and SFS Note that PFS optimizes for SSE through PCA. We

compare the performance of PFS against SFS (sequential forward search) that di-

rectly optimizes for SSE by adding the best one feature to add to the current

selected feature set each time. Similarly, LSE-fw performs a sequential forward

search and adds the best feature at every step to minimize the least squares error

between the selected features and a procrustes transformation of the reduced PCA

space. Fig. 4.5 shows that our method, in solid black line, achieves the same small

SSE as SFS and LSE-fw. However, LSE-fw and SFS are very slow in picking the

desired number of features. In general, they are 102 to 103 slower than the other

75

Table 4.3: Computational Complexity AnalysisMETHODS SIMPLETHRESH JOLLIFFE I JOLLIFFE NI SPCACOMPLEXITY qNd max(N, qK)d2 max(N, d)d2 max(N, d)d2

METHODS PFA LSE-FW SFS PFSCOMPLEXITY max(Nd, Kmq)d q2Ndp max(N, q)qd3 max(N, qK)d2

methods on the data we used in our experiments. Especially for gene data with

a dimensionality of more than 12, 000, SFS ran out of memory immediately and

LSE-fw ran for a week and ended up out of memory. All our experiments are run

on a Pentium 4 computer with 512 megabytes of RAM. These two methods are not

practical when dealing with very large data sets. A detailed complexity analysis of

all methods are discussed in the next section.

In summary, PFS performs as good as the slower SFS and LSE-fw feature

selection methods in terms of the SSE reconstruction error and retained variance

and consistently better than the other transformation-based methods (Jolliffe i, Jol-

liffe ni, PFA, and SPCA) and simple thresholding.

4.4.4 Time Complexity Analysis

In this section, we discuss the time complexity of each method for selecting q

features. For our PFS method, we first compute the sample covariance matrix,

O(Nd2). Then, in step 1, we find the first eigenvector which takes O(Kd2) by the

power method [92], where K is the number of iterations to converge. K is usu-

ally small compared to N . In the second step, we perform feature search, and it

takes O(Nd+d). For step 3, performing orthogonal projection is equal to updating

the covariance matrix, "X(t+1) = "X(t) ! (X(t)fi(t))(fi

(t)T X(t)T )/.fi(t).2

, because

"X(t+1) = X(t+1)X(t+1)T = X(t)P (t)P (t)T X(t+1)T = X(t)(I!fi(t)fi

(t)T /.fi(t).2

)X(t)T =

"X(t) ! (X(t)fi(t))(fi

(t)T X(t)T )/.fi(t).2

. This procedure takes time O(d2 + Nd).

We repeat steps 1 to 3 until q features are selected, the total time needed is O(Nd2+

76

qKd2) = O(max(N, qK)d2).

For comparison, we also show the time complexity of the other methods.

For the Jolliffe i method, the time cost is similar to ours PFS. Get the covariance

matrix first, and then find the first eigenvector, if fi is selected, just remove the

ith column and row in the covariance matrix and repeat the process. The time

complexity is O(Nd2 + q(Kd2)) = O(max(N, qK)d2). For Jolliffe ni method,

only one SVD is performed on the covariance matrix, thus, the time cost is O(Nd2+

d3) = O(max(N, d)d2). For the SPCA method, when d > N , the biggest time cost

is to run SPCA and get the first q PCs, which takes time O(Ndlg(d)+p2d). p is the

sparse basis found such that p/d / 0, then the time cost is o(d3). If N > d, SPCA

takes nd2. Therefore, it is similar to Jolliffe ni, which is O(max{N, d}d2). The

major time cost for PFA is the clustering besides computing the original covariance

matrix. Thus, the time cost is O(Nd2 + Kmqd), here m is the number of PCs

retained and K is the number of iterations needed for convergence of the Kmeans

clustering. Mao’s LSE-fw method requires estimating the Least Squares Estimate

(LSE) model in each iteration and takes O(q2Ndp) time, p is the number of PCs

that the user wants to retain. Basically, keeping more PCs (p close to d) will result

in a more accurate LSE model and smaller SSE. However, as a tradeoff, the time

cost will be high. For the SFS method, it has the time cost of O(max(N, q)qd3),

which is the slowest. Thus, PFS performing as well as SFS and LSE-fw in terms

of SSE is good; moreover, PFS is much faster than SFS and LSE-fw as shown

in Table 4.3, which provides a summary of the time complexity analysis for each

method.

In Table 4.4, we provide the actual time in seconds the different methods

need to select q features which can cover 90% of the total variance of the original

data. Note that in these experiments we did not optimize for calculating the first

eigenvector which can speed-up PFS and Jolliffe i. The results show that Jolliffe i

and our PFS have similar speeds. Jolliffe ni and SPCA are similar and faster. PFA

77

Table 4.4: Computational Time in Seconds

TIMEHRCT CHART FACE TEXT GENE

(q = 126) (q = 42) (q = 140) (q = 315) (q = 312)SIMPLETHRESH 0.05 0.02 0.3 0.2 2.8JOLLIFFE I 25.07 1.14 75.6 182.8 1231.05JOLLIFFE NI 4.12 0.24 10.48 37.05 284.98SPCA 5.09 0.2 22.61 42.11 330.78PFA 5.4 0.58 24.95 44.72 807.18LSE-FW 409.69 3.53 3.33E+04 2.52E+04 N/ASFS 980.13 9.13 1.23E+05 8.13E+05 N/APFS 26.4 1.18 82.7 251.4 1612.05

is in between. Mao and SFS are very slow.

Even though SPCA and PFA are faster than PFS, we need to re-run the

whole experiment when different number of features are needed. However, our

PFS keeps the optimal feature set for any number of features < q. We just need

to run the program once to have all the optimal features for q = 1 to q, since PFS

is a sequential method. This means that given 312 features selected by our PFS,

if we only need 200 features, we can just pick the first 200 features in our current

feature set. Conversely, if I start with 200 features and I need 312, I can start with

the residue space after 200 features and continue the process to find 312 features.

However, for SPCA and PFA, they need to re-run the whole experiment.

In our experiment with gene data, we select 312 features out of 12, 558,

which covers 90% of the total variance and is around 2.5% of the original features.

To obtain a relatively accurate LSE model estimation, we need at least p = q. Thus,

the time needed for LSE-fw will be at least in 10E6 scale with the smallest p, which

is 1, 000 times of our PFS with 10E3. For the SFS, the difference is qd = 10E6

times more than our PFS method. Our experiments showed that SFS and LSE-fw

run out memory and cannot obtain the corresponding results for the gene data.

78

4.5 Extension to Linear Discriminant Analysis (LDA)

PCA reduces the dimensionality in an unsupervised fashion. When class labels are

available, one would wish to reduce the dimensionality such that the reduced space

finds the transformation that separates the classes well. In this section, we discuss

the possible extension of our PFS method to the LDA case.

A popular supervised dimensionality reduction method is linear discrimi-

nant analysis (LDA) [2]. LDA computes the optimal transformation, which min-

imizes the within-class scatter and maximizes the between-class scatter simulta-

neously. The goal in discriminant feature selection is to select the subset of fea-

tures that maximizes J = trace(S#1m Sb) subject to the constraint that the features

are as “uncorrelated” as possible. The relationship between PCA and LDA has

been studied in [2, 97, 98]. Thus, we can express the LDA problem as a data

representation problem similar to PCA where the data to be compressed is repre-

sented by their means. We generate the matrix M, which consists of the means of

each class, M = [M1, · · · ,ML]. To solve the LDA optimization problem based

on trace(S#1m Sb) criterion, we form a matrix D & Rd!L, D = [D1D2 · · ·DL],

Di ='

niS#1/2m Mi, L is the number of labeled classes. Then the solution of LDA

is just to perform PCA on the D matrix, looking at each Di as a data point. Now, we

have transformed the LDA problem to the PCA problem of normalized class-mean

matrix D. We can now apply a technique similar to that in Section 4.2, except that

D is not in the original feature space. Note that D is simply a rotated, normalized

version of M by the transformation matrix, S#1/2m and M is in the original feature

space. Since M is in the original feature space, we need to select the features from

M that correlates best to the first principal component of D. Our new objective

function for LDA is now changed to:

SSED = ||[f1D, · · · , fq

D]! [f1D, · · · , fq

D]||2 (4.3)

79

fiD

is the projection of fiD onto the subspace spanned by those selected fi

Ms, or

span{f1M(1)

, f2M(2)

, · · · , fqM(q)}.

SSED measures how much the selected features fiM can span the space of span{D}.

The more the selected features fiM span the space span{D}, the less the SSE.

Based on the above discussion, the same technique can be applied to the

span{D} to get the non-redundant feature set which separates the classes most.

However, instead of keeping the largest normalized loading of the first PC of the

residue space, we keep the feature in M (in the original feature space) that correlates

best to the first PC in the residue space of D.

This section provides an interesting example on how to extend the orthogo-

nal feature search via a transformation method to LDA. However, a detailed analysis

of orthogonal feature selection via LDA as a supervised feature selection approach

is outside the scope of this work and would be a topic for future research.

4.6 Conclusion for Principal Feature Selection

We have developed an orthogonal feature selection algorithm: principal feature

selection (PFS). This algorithm selects features based on the results from transfor-

mation approaches, where transformation serves as a search technique to find the

direction that optimizes our objective function. At the same time, we incorporate

orthogonalization to remove redundancy.

The resulting feature selection algorithm, PFS, is as simple to implement as

PCA, obtain principal features sequentially (analogous to the PCs) and their corre-

sponding non-redundant variance contribution (analogous to the eigenvalues) with

respect to the previously selected features as a by-product of the approach. With

these similar and important properties, hopefully PFS can become widely applied

80

as its transformation-based counterpart. Experiments show that PFS was consis-

tently closer to the optimum SSE compared to the loading-based approaches that

does not take orthogonality into account. The experiments also show that PFS pro-

vides a good compromise as a search technique between sequential forward search

and individual search with respect to speed and SSE, with speeds closer to that of

the faster individual search, and SSE values almost the same as that of sequential

forward search and consistently the best among the other methods.

There exist several criteria for evaluating features in the feature selection

literature. Here, we explored criteria optimized by the popular transformation

method, PCA. We hope that this work inspires future research in taking advan-

tage of continuous space transformation to improve finding the optimal solution of

the combinatorial feature selection problem.

81

20 40 60 80 100 120 1400

50

100

150

Sum−squared−error for HRCT data

Number of features

Sum−s

quar

ed−e

rror

v: SimpleThresh^: PFA−Cohano: Sparse−PCA*: Jolliffe−i.: Jolliffe−ni+: SFSx: LSE−fwPFS

20 40 60 80 100 120 140

20

40

60

80

100

120

140

Retained variance for HRCT data

Number of features

Sum−s

quar

ed−e

rror


10 20 30 40 50 600

5

10

15

20

25

30

35

40

45Sum−squared−error for CHART data

Number of features

Sum−s

quar

ed−e

rror


50 100 150 200 250 300

100

200

300

400

500

600

700

800

900

Sum−squared−error for FACE data

Number of features

Sum−s

quar

ed−e

rror


50 100 150 200 250 300

100

150

200

250

300

350

400

450

500

Sum−squared−error for TEXT data

Number of features

Sum−s

quar

ed−e

rror


50 100 150 200 250 300

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Sum−squared−error for GENE data

Number of features

Sum−s

quar

ed−e

rror

v: SimpleThresh^: PFA−Cohano: Sparse−PCA*: Jolliffe−i.: Jolliffe−ni+: SFS(N/A)x: LSE−fw(N/A)PFS

Figure 4.5: SSE and retained variance for HRCT data on top. SSE for Chart, face,20 mini-newsgroup and gene data respectively. Each figure plots the eight SSEcurves for the eight methods: blue line with ’"’ by simple threshold, green linewith ’#’ by PFA, red line with ’o’ by SPCA, light blue line with ’$’ by Jolliffe i,purple line with ’·’ by Jolliffe ni, yellow line with ’+’ by SFS, grey line with ’x’ byLSE-fw and black line by our PFS.

82

Chapter 5

Robust Fluoroscopic RespiratoryGating for Lung Cancer

Radiotherapy without ImplantedFiducial Markers

In this chapter, we investigate machine learning algorithms for markerless gated

radiotherapy with fluoroscopic images. The framework of the proposed clinical

treatment procedure is shown in Figure 5.1. We start by describing the gating prob-

lem, how data is acquired and pre-processed in Section 5.1. Section 5.2 presents a

detailed description of our clustering ensemble template matching method. Section

5.3 re-frames the gating problem into a classification problem and provides a so-

lution through support vector machine (SVM). To test our algorithms, Section 5.4

shows the evaluation metrics and validation results on five patient datasets. Finally,

in Section 5.5, we conclude and outline future directions.

83

Figure 5.1: Block diagram for showing the process of the proposed clinical proce-dure for generating the gating signal.

5.1 Data Acquisition and Pre-Processing

In this section, we describe how data is acquired and pre-processed, the first two

components during patient set-up in Figure 5.1.

5.1.1 Image Acquisition

In this study, the raw fluoroscopic image data we used comes from the system called

the integrated radiotherapy imaging system (IRIS) [99], which consists of two pairs

of gantry-mounted diagnostic x-ray tubes and flat panel imagers, shown in Figure

5.2. The system can be used to acquire a pair of real-time orthogonal fluoroscopic

images for lung tumor tracking.

5.1.2 Building Training Data

Before treatment, a sequence of orthogonal fluoroscopic images (approximately ten

seconds in our experiments) are taken and used for patient set-up as training im-

ages. The tumor position in the gating window, where the treatment beam should

be turned on, is identified either manually by a clinician or automatically by match-

ing digitally reconstructed radiographs (DRRs) from the simulation 4D CT scan

84

Figure 5.2: The Integrated Radiotherapy Imaging System (IRIS), used as the hard-ware platform for the proposed gating technology in this chapter.

[100]. In this investigation, the images were manually contoured. A rectangular

region of interest (ROI) is then created in those images (see Figure 5.3). This ROI

is set to be large enough to contain tumor motion in the training period.

Lung tumors move primarily due to the patient’s breathing. Typically, the

gating window is set at the end-of-exhale (EOE) phase of the breathing cycle due

to its longer duration and better stability. Figure 5.4 top left illustrates the tumor

motion in the up-and-down direction as a function of time. Motion in the left-and-

right direction is relatively very small(1 ! 2mm) compared to the motion in the

up-and-down direction (13 ! 18mm). The tumor position is based on the centroid

of the manually contoured image. Tumor in the lower positions correspond to the

exhale phase, and those in the higher positions correspond to the inhale phase. A

measure of radiation treatment efficiency is the gating duty cycle, the total time the

beam is turned ON divided by the total time (beam is ON and OFF). Assuming

that the desired gating duty cycle is given (in our experiments we set it at 35% and

85

Figure 5.3: Tumor contour and the region of interest (ROI). Left: original fluoro-scopic image. Right: motion-enhanced image.

50% which are typically used in radiotherapy), a corresponding threshold can be

determined to define the gating window as shown by the horizontal line in Figure

5.4 top left. All the images in the gating window, i.e., with locations below this

threshold, as shown in Figure 5.4, are labeled as EOE images.

5.1.3 Pre-Processing

We pre-process our images by first applying motion enhancement and then reduce

the dimensionality by principal component analysis (PCA).

Motion Enhancement We apply a simple pre-processing technique called mo-

tion enhancement [101] on our training images. Given a sequence of images I[t],

where t = 1, · · · , N is the sequence number. We compute the average image,"N

t=1 I[t]/N , and the motion-enhanced image (MEI) is the difference between the

original image and the average image, I[t] ! "Nt=1 I[t]/N . The intuition behind

MEI is that the average captures the static structures and smears the moving struc-

tures, thus the difference will amplify the moving structures. Figure 5.3 shows the

original fluoroscopic image of a tumor in an ROI together with a motion-enhanced

view of it. We see that the tumor is clearer in the MEI.

86

Figure 5.4: The top left figure is the breathing waveform represented by the tumorlocation. To the left of the vertical dotted line is the training period. To the right ofthe vertical line is the treatment or testing period. Under the horizontal dotted line(threshold corresponding to a given duty cycle) is the end of exhale phase. Bottomfigures showed different end of exhale images during the training session, whichare averaged to generate a single template.

Principal Components Analysis A typical EOE image has a size of 100 % 100

pixels. This leads to a dimensionality of size 10, 000. To reduce the dimensionality,

we apply principal component analysis (PCA) [102]. PCA finds a linear trans-

formation, Y = AT X , that projects the original high-dimensional data X with d

dimensions to a lower dimensional data Y with q dimensions where q < d, such that

the mean squared error between X and Y is as small as possible. X here is d%N ,

where N is the number of data points, and A is a d % q matrix. The solution is the

transformation matrix A whose columns correspond to the q eigenvectors with the

q largest eigenvalues of the data covariance. It also projects the high-dimensional

dataset to the lower dimensional subspace in which the original dataset has the

largest variance (i.e., restricts attention to those directions along which the scatter

of the data points is greatest). PCA is applied only as a pre-processing step to clus-

87

tering and to the support vector machine. It is not applied to template matching in

this research because during treatment, projecting each image to span{A} is quite

time consuming.

5.2 Clustering Ensemble Template Matching and Gaus-

sian Mixture Clustering

In this section, we describe our clustering ensemble template matching method for

gated treatment of lung cancer using fluoroscopy images. Before we proceed, we

define our notations for clarity: R is the ROI in an incoming fluoroscopic image,

Ti is the ith reference template (such as one of EOE templates in the EOE gating

window), the sign(

represents correlation, s is the score used for generating the

gating signal, i.e., g = H(s ! s0), where s0 is the threshold score, H(x) is the

Heaviside step function. H(x) = 0 for x < 0 and H(x) = 1 for x ) 0. The gating

signal g = 1 means beam ON while g = 0 means beam OFF.

In our previous work, we built a single EOE template by simply averaging

all the motion-enhanced EOE training images, as shown in Figure 5.4. During

treatment, we compute the correlation score between the reference template and

each incoming MEI. Assuming that the image R and template T are of the same

size m%n, the normalized correlation coefficient (correlation score s) is defined as

s =

"m

"n(Rmn ! R)(Tmn ! T )

#("

m"

n(Rmn ! R)2)("

m"

n(Tmn ! T )2)(5.1)

Here R and T are the average intensity values of the image R and template T ,

respectively. A high correlation score indicates that the incoming image is similar

to the reference and gating is enabled.

In essence, the template building procedure is to generate the representative

88

patterns defined in the EOE window. Then during matching, we use these represen-

tatives to recognize those images who have the same pattern and should be treated.

Therefore, accurate representative templates are key points leading to accurate gat-

ing signals. We have noted that, only using the mean of EOE training images as

the template, may not be enough. Our experiments showed that sometimes it leads

to erratic gating signals. That is because the correlation score curve may not be

smooth, causing the gating signal to be noisy at the transition regions.

Figure 5.5: Ensemble/multiple template method. Here, each image is an end-of-exhale template. We match the incoming image with each template and get a set ofcorrelation scores s1, s2, · · · , sK . Then we apply a weighted average of these scoresto generate the final correlation score s for gating.

5.2.1 Ensemble/Multiple Template Method

Inspired by the success of ensemble methods in improving the classification accu-

racies of weak base classifiers [103], we explore applying an ensemble or multiple

template matching for gating. Hopefully, an ensemble of templates can smooth out

the resulting gating signals. One way of generating an ensemble of templates is to

set each EOE image as a template, and a correlation score is computed for each

89

template. After a set of correlation scores are computed, it is necessary to choose

an intelligent way to combine them to get a robust gating signal. There are several

ways to combine the correlation scores (such as, taking the maximum score, taking

the weighted average score). We found that applying a weighted average gave us

the best results. We define these weights in Subsection 5.2.3. This procedure is

explained in Figure 5.5. Each image in the figure is an end-of-exhale template. We

match the incoming image with each template and get a set of correlation scores

s1, s2, · · · , sK . Then we apply a weighted average of these scores to generate the

final correlation score s for gating.

Although we can use the entire set of motion-enhanced images in the EOE

gating window as reference templates, this approach is computationally expensive

during treatment due to the need for computing several correlation scores. Many of

the reference templates are very similar. Therefore, we want to find a way some-

where in between a single template method and using all the templates, which hope-

fully has the merits of both methods, i.e., computational efficiency and robustness.

Here, instead of using all the templates, we would like to find a set of representative

EOE templates. Ideally, these templates should carry all of the useful information

of the original frames and discard the noise. Clustering methods are ideally suited

to this task. Clustering algorithms group similar objects together and summarize

each group with a representative template. We will use clustering to find a small

set of templates and apply weighted averaging to combine the scores. We need to

determine the number of templates and a method to find the clusters.

5.2.2 Finding Representative Templates by Clustering

To cluster the EOE templates, we apply Gaussian mixture clustering [104] to the

PCA dimensionality reduced images. We denote the parameters of this model by

#. In this model, we assume that each cluster comes from a multivariate Gaus-

90

sian distribution, and our data (the image templates) come from a finite mixture of

Gaussians. A Gaussian model assumes that there is a cluster template and the mem-

bers of that cluster are variations of a template. Let X denote our data set which

has d dimensions, and n denote the number of data points in X , we are trying to

group the n data points into K clusters. Each of the K cluster is represented by its

model parameters, 'j = ((j, µj, "j), where (j is the prior probability (or mixture

proportion) of cluster j, µj is the mean of cluster j and "j is the covariance matrix

of cluster j. More formally, we say that the probability of the data given the model

is P (X|#) ="K

j=1 (jP (X|'j)

P (X|'j) =1

#(2()D|"j|

EXP{!1

2(X ! µj)

T "#1(X ! µj)}

. To estimate the parameters, (j , µj and "j for each cluster, we apply the expectation-

maximization algorithm (EM) [105]. The EM algorithm alternates through an

expectation-step and a maximization-step until convergence. In the expectation-

step, we estimate the cluster to which each image template belongs to given that

the parameters ((j , µj and "j) are fixed. In the maximization-step, we estimate

the parameters by maximizing the complete log-likelihood (the log-likelihood as-

suming we know the cluster memberships). The cluster means µj then become our

representative templates.

To automatically determine the number of clusters, we apply the Bayesian

information criterion (BIC) [86] score to penalize the log-likelihood function. We

now maximize

logP (X|#ML)! (f/2)log(n), (5.2)

where f is the number of free parameters in the model. In our problem, we have

f = dK + (d(d + 1)/2)K + (K ! 1) (5.3)

91

Figure 5.6: Scatter plot of our image data for patient 4 and 35% duty cycle in 2Dwith the clustering result. The ”o” and ”x” each represent different clusters, withthe means represented by the ”$” symbol in bold and the covariances in ellipses.

to be estimated. We run the Gaussian mixture clustering from K = 1 to Kmax

(Kmax = 4 in our experiments). Then pick the K with the largest BIC score. Note

that if we do not add a penalty term, the log-likelihood increases as K increases. It

can lead to a trivial result of picking K equal to the number of data samples (i.e.,

each data point will be considered as a cluster). A scatter plot of the clusters with

their means and covariances is shown in Figure 5.6.

5.2.3 Generating the Gating Signal

By the template clustering method, we can build a set of accurate representative

templates. Accordingly, we will have a set of correlation scores for each incoming

new image in the template matching step. Therefore, we need a way to combine the

scores. As mentioned earlier, we are using a weighted average. Now the weights are

just the prior probability (j of each Gaussian mixture. We are performing a voting

procedure where mixtures with more members have higher weights to vote for the

final correlation score compared to mixtures with feature members. We generate the

gating signal based on the final correlation score. From the final correlation score of

92

the training images, we determine a threshold that corresponds to the pre-set duty

cycle. This threshold is then applied to the correlation scores calculated in real-

time during treatment to generate the gating signal. When the score is above this

threshold value, it indicates that the therapy beam should be enabled. Otherwise,

the therapy beam should be turned off.

Figure 5.7: Results from different methods for an example patient. (a) single tem-plate method; (b) ensemble/multiple templates method with Gaussian mixture clus-tering. For each figure, the top curve is the correlation score and the bottom plot isthe gating signal generated by the correlation score. Here we use 35% duty cycle.

93

Figure 5.7 shows an example of gating signals generated by this cluster-

ing ensemble template method and the previous single template method. We can

see that the ensemble approach can achieve a reasonable gating signal and smooth

correlation score, which coincides well with the smooth tumor motion. As demon-

strated in this figure, through the voting process, the effect of errors caused by one

template is compensated by the other templates. Thus the clustering ensemble tem-

plate method is less sensitive to noise compared to a single template.

5.3 Support Vector Machines

Template matching as described in the previous section only look at images inside

the gating window. However, instead of only using the images inside the gating

window, can we also take advantage of images outside the gating window? Will

this additional information be helpful to make decisions? This strategy is espe-

cially helpful for distinguishing images in the transition of the gating signal. We

can measure how similar they are with the EOE images as well as how dissimilar

they are from the non-EOE frames. Indeed, this can be viewed as a two-class clas-

sification problem. In this section, we re-cast the gating problem as a classification

problem and present the support vector machine classifier as a solution.

Gating as a Classification Problem The goal for an automated method for gated

radiotherapy is to decide when to turn the beam ON or OFF. This exclusivity con-

dition provides us with a clue that the gating problem can actually be re-cast as a

classification problem. We treat the image frames that correspond to a beam-on

signal as one class and those that correspond to a beam-off signal as another class.

For simplicity, we will call them beam-on and beam-off images. By this way, we

reformulate the original gating problem into a two-class classification problem, as

shown in Figure 5.8. In Figure 5.8a, ten time points on the gating signal are shown

94

Figure 5.8: Re-cast the gating problem to a classification problem (a) and (b). (c)presents the decision boundary created by a single template matching and (d) dis-plays decision boundary by an SVM classifier.

from time t1 to t10. And each time point corresponds to a image frame, x1 to x10

in Figure 5.8b. We represent each image frame as a vector, xt. For this illustrative

example, we project each x1 to x10 onto two-dimensions as shown in Figure 1b.

95

The set of images, {x1, x2, x3, x8, x9, x10} are examples of beam-off images (class

OFF) and {x4, x5, x6, x7} are examples of beam-on images (class ON). The goal

of classification is to build a classifier that outputs ON or OFF given a new input

image xt by learning parameters of this classifier from training examples such as x1

to x10.

The template matching method can be considered as a classifier that only

takes advantage of the positive (ON) class. Correlation provides a measure of how

close or similar each new image is from our template representing the ON class.

The threshold is our decision boundary and turns the correlation score into a deci-

sion: scores higher than a threshold classify as ON and OFF otherwise. We select

this threshold based on the desired duty cycle. In Figure 5.8c, a template would

be the average of the ON examples shown in red dot in the center of the ellipse.

And the threshold will be a fixed distance from this template. Note that a better

way of automatically determining the decision boundary (threshold) is to consider

both positive (ON) and negative (OFF) examples. In this research work, we build

our classifier by looking at both ON and OFF training examples. In particular, we

design a support vector machine (SVM) classifier to solve this gating problem. Fig-

ure 5.8d displays the decision boundary created by a linear SVM on this simple

illustrative example data.

Support Vector Machine Classifier There are several possible classification al-

gorithms for this task. Among them, SVM is one of the most popular learning

methods for binary classification. SVM was originally designed by Vapnik [106].

It learns an optimal bound on the expected error, and finds an optimal solution as

opposed to many learning algorithms that provides local optima (such as, neural

networks [107]). The SVM objective is to find a boundary that maximizes the mar-

gin between two classes as well as separates them with the minimum empirical

classification error. An SVM first projects instances into high dimensional space

96

via kernels and then learns a linear separator that maximizes the margin between

the two classes.

The SVM problem can be formulated as follows: suppose we have the train-

ing data X = {x1, · · · , xt} and {y1, · · · , yt} be the class labels of X. Without loss

of generality, we assign class labels to take the value of either +1 or !1. We want

to have a large margin, and a small error penalty (slack variables )i) for misclassifi-

cations, as shown in Figure 5.8d.

Minimize ||%w||2 + C("

i )i)

Subject.to. yi(%xi · %w + b) ) +1! )i (5.4)

Here, C is a user-defined parameter, where larger values mean higher penalty to

errors. In addition, we apply the kernel trick to allow nonlinear decision boundaries.

For the gating problem, we apply the radial basis function (RBF) kernel:

K(x, x$) = exp(!*||x! x$||2), for(* > 0).

During patient treatment (refer to Figure 5.1), each incoming image is pre-

processed by projecting the pixels in the ROI to the reduced dimensional space by

PCA as explained in Section 5.1. The decision function will be of the following

form:

D(x) = sgn(n!

t=1

yt!t · K(x, xt) + b),

K(x, xt) = exp(!*||x! xt||2) (5.5)

Here yt are the class labels for image vector t, and !t and b are derived for a given

C by solving the optimization problem described in equation 5.4, using quadratic

programming [108]. The parameters * and C of this function are learned during

97

patient set-up or training. During treatment, these parameters are fixed. We create

the gating signal based on the output of the decision function in equation 5.5 above

with input yt (the reduced-dimensional representation of our incoming fluoroscopy

image at time t, i.e., the class label for image t).

5.4 Experiments and Results

We collected fluoroscopic image data from five patients to evaluate the performance

of the following two methods: (1) an ensemble template matching method where

the representative templates are selected by Gaussian mixture clustering, and (2) a

support vector machine (SVM) classifier with radial basis kernels. For each patient,

we get a sequence of image frames by sampling at ten frames per second. Typically,

each sequence contains 300 ! 400 frames, corresponding to 30 ! 40 seconds. A

region of interest of around 100 % 100 pixels in size was selected to include the

tumor positions at various breathing phases. The training period we used to build

our templates consisted of two cycles, which corresponds to about 60 ! 80 image

frames.

To validate the algorithms, we compared our estimated gating signal with the

reference gating signal. A radiation oncologist manually contoured the tumor in the

first frame of the images. Then in the following frames, the contour was dragged

to the correct places manually using a computer mouse by the radiation oncologist.

The tumor centroid position in each image frame was calculated and used to gener-

ate the gating signal based on various duty cycles. Here, in our experiment, we used

35% and 50% duty cycles. Assuming t0 is the total time (beam on and off), t1 is

the beam-on time (based on our estimation), t2 is the correctly predicted beam-on

time (true positive), we define the evaluation metrics: delivered target dose (TD)

TD = t2/t1 and real duty cycle (DC) of the treatment DC = t1/t0.

98

5.4.1 Experiments by Clustering Ensemble Template Method

For the ensemble/multiple template matching method, we find the EOE training

images in the first two to three breathing cycles (training period). We reduced the

dimensions of these EOE images to two to three dimensions and perform clustering.

We then computed the log-likelihood based on the Gaussian mixture model to get

our BIC scores and determine the number of clusters. We obtained two or three

(depends on which patient) clusters with this data. That means only two or three

templates is enough for this application, which made our method more efficient

compared to using all the EOE training images as templates. The final correlation

score for a new incoming images is the weighted average of the scores from the

reference templates with the prior probability of the mixture as weights.

5.4.2 Experiments by Support Vector Machine

Image frames acquired during the set-up session is pre-processed and used to train

the SVM classifier. We reduced the dimension from 100% 100 to 1% 50 for SVM.

We then labeled the training images with +1 for beam-on images and -1 for beam-

off images. We used LIBSVM [109] in our experiments. To train our SVM model,

we applied a coarse-to-fine grid-search to determine the parameters * and C for our

radial basis function SVM (RBF-SVM) model. We found that trying exponentially

growing sequences of * and C is a practical method to identify the parameters (* =

10#1010#8 · · · 103, C = 10#510#3 · · · 1010). Furthermore, to prevent overfitting in

tuning the parameters, a ten-fold cross-validation procedure is performed on the

training images to find a better model. Basically, the (*, C) pair which provided

the best ten-fold cross-validation accuracy on the training data is selected. We use

the rest of the image data (data during treatment) as our testing samples. We pre-

process the test data by PCA, and the predicted labels given by the SVM classifier

serve as our estimated gating signals.

99

5.4.3 Results and Discussion

The experimental results are shown in Figures 5.9-5.10. In each figure, the upper

figure shows the delivered target dose (TD) as a bar plot, with the SVM results in

blue bars and the template-matching method results in red bars. Delivered target

dose measures the true positive rate. The lower figure shows the real duty cycle

(DC) in the same format, this is a measure of efficiency. For the proposed duty

cycle of 35%, SVM achieves 95.8% average TD and 40.8% average DC, while the

clustering ensemble template matching method achieves 94.9% average TD and

34.7% average DC. When the proposed duty cycle equals 50%, SVM has an average

TD of 98.4% and average DC of 53.1%, while clustering ensemble template method

has an average TD of 97.6% and 49.5% DC.

Figure 5.9: Experiment results in TD and DC for 35% proposed duty cycle. Bluebars: metric by SVM method. Red bars: metric by clustering ensemble templatematching method.

Over all five patients we found that both methods are able to deliver most of

the target dose correctly, both have a high TD, while SVM is more efficient than

100

Figure 5.10: Experiment results in TD and DC for 50% proposed duty cycle. Bluebars: metric by SVM method. Red bars: metric by clustering ensemble templatematching method.

clustering ensemble template. In general, SVM has an average DC that is 4%-6%

longer. This is because SVM makes use of more information, both beam-on and

beam-off images, while clustering ensemble template matching only captures the

information from the beam-on images. However, template matching has an advan-

tage of detecting the aperiodic intra-fraction motion that occurs for some patients,

although not seen in the current datasets. It is because the beam is only turned on

when the correlation score is high, meaning the ROI of the incoming image is simi-

lar to the reference template and the tumor is at the right position. If the tumor drifts

away or if shifts/rotations/deformations of the anatomy between patient setup and

treatment happen, the correlation score should be always low and the beam will not

be turned on. We then need to re-position the patient.

The time complexity for both methods is almost the same. We recorded the

CPU time for executing both methods with a Pentium 4 Windows machine with 1G

of RAM. The average CPU time for predicting the gating signal during testing is

101

Figure 5.11: Example of estimated gating signals on patient 4 for proposed 35%duty cycle. Top: the predicted gating signal by SVM classifier. Bottom: the gatingsignal generated by clustering ensemble template matching method.

0.06sec/frame by both methods. To obtain the gating signal for each time point,

SVM needs to perform PCA and apply the reduced image vector as an input to the

decision function, equation 7. For clustering ensemble template matching method,

it needs to calculate the correlation between the original high dimensional motion-

enhanced image with two or three templates (cluster means).

Figure 5.11 shows an example of the estimated gating signals. It plots the

gating results by both methods. The top figure displays the predicted gating signal

in red and reference signal in black for SVM. And the bottom figure shows the

gating and reference signals for clustering ensemble template matching. We can

see that both gating signals coincide very well with the reference, and all errors

occur at the edges.

102

5.5 Conclusion for Robust Markerless Gated Radio-

therapy

This research work provides a case study where machine learning techniques have

been successfully applied to an important real-world problem: gated radiotherapy.

Working closely with the domain expert, we carefully selected the appropriate ma-

chine learning and data mining tools in developing the ensemble/multiple template

matching method. Through our collaboration, we also provided our domain expert

with a different view of the gating problem and re-cast it as a classification prob-

lem. Our study showed the feasibility of solving the gating problem by classifica-

tion techniques. This provides us with wider recourses for gated radiotherapy. We

can try other classification techniques, such as Bayesian classifier, neural networks,

hidden Markov model, in our future work.

For our next step, we will (1) test the algorithms using more and longer

patient data, (2) find a better way to get reference gating signal for validation, and

(3) evaluate the dosimetric consequence of the current error level to see if there is a

need to further lower the error rates. Then, we will consider clinical implementation

of our methods.

103

Chapter 6

Multiple Template-basedFluoroscopic Tracking of LungTumor Mass without Implanted

Fiducial Markers

In the previous chapter, we intensively studied the template matching method to

generate robust and accurate gating signals for lung radiotherapy without implanted

markers. Here, in this chapter, we extend this idea to directly track the tumor loca-

tion throughout the whole breathing cycle. In Section 6.1, we describe two template

matching methods to track tumors using the fluoroscopic images without markers:

the motion-enhanced method and eigenspace tracking. Section 6.2 describes the

experimental setup and evaluation metrics used to test the proposed methods. Sec-

tion 6.3 reports the experimental results and provides a discussion of those results.

Finally, Section 6.4 presents our conclusion.

104

6.1 Basic Ideas of Multiple Template Tracking

A good tracking algorithm should take into account the tumor motion characteris-

tics. Relevant lung tumor motion characteristics can be summarized as follows.

1. Lung tumor motion is mainly caused by patient respiration and is thus pe-

riodic. Accordingly, tumor shape and appearance projected in the images

should vary more or less as a function of the breathing phase.

2. Although breath coaching can improve the regularity and reproducibility of

patient breathing [110], the projected tumor images at the same breathing

phase in different breathing cycles can still vary.

3. The fluoroscopic image intensities change with the chest expansion and con-

traction. The images are brighter during the inhale phase, while darker dur-

ing the exhale phase. This intensity change should also be considered in the

tracking algorithm.

In theory, tumor positions in fluoroscopic images can be detected using a

single-template matching method. A basic single-template tracking approach is to

simply perform an exhaustive search and to find the highest correlation between a

tumor template and an image region. Due to the above-mentioned tumor motion

characteristics, i.e., tumor appearance in, and the intensity of, the projected images

can vary from frame to frame, this simple approach does not work well for lung

tumor mass tracking in fluoroscopy video. It leads to erratic locations and is quite

time consuming. Using an adaptive template method, i.e., updating the reference

template using the tumor image in the previous image frame, the tracking results

can be improved. However, such a method is not robust to errors made in previous

frames and the tracking may drift. In one word, a single-template approach for lung

tumor tracking is not sufficient.

105

Figure 6.1: Outline of the proposed multiple template tracking procedure.

If we use multiple templates, instead of one, for lung tumor tracking, the

first and third tumor motion characteristics can be naturally handled. If we further

allow some fine-tuning for the tumor position in each template, the second motion

characteristic may also be taken into account. The multiple-template method has

been developed in object recognition to detect the object at various poses (due to

changes in rotation, scale, illumination and other factors) [111]. Here, for our prob-

lem, we apply multiple templates to represent different poses (position and shape)

of the tumor in the projection images at different breathing phases.

The multiple-template tracking has four components: template generation,

search mechanism, scoring function and a voting mechanism. We first need to build

the set of templates to represent the tumors various poses using images acquired

during patient setup. Then, during patient treatment, we find the best match between

the incoming image and each reference template. The quality of a match is based on

the scoring function utilized. Finally, we combine the results from all the templates

to determine the tumor location using a voting scheme. Figure 6.1 outlines the

tracking procedure using multiple templates.

106

Figure 6.2: A fluoroscopic image with an region of interest (ROI) (blue rectangle)and a tumor contour (red curve).

6.1.1 Building Multiple Templates

For this study, we use about 6-7s, which normally contains two breathing cycles of

fluoroscopy video as our setup or training images. The tumor contour in each of the

training images is either manually marked by clinicians or automatically transferred

from digitally reconstructed fluoroscopy (DRF) images by image registration [100].

Then a rectangular region of interest (ROI), as shown in Figure 6.2, is automatically

created in the image that contains the tumor throughout the whole breathing cycle.

We sample N templates, for example N = 12 in our experiments for a certain

patient, T1, T2, · · · , T12 at equal time intervals based on the breathing waveform

from the setup session, as shown in Figure 6.3. Berbeco et al have observed that

the average fluoroscopic image intensity changes with the breathing cycle (i.e., the

image is darker at the exhale phase and brighter at the inhale phase) [16]. We utilize

the intensity waveform to determine the breathing cycle. We divide the intensity

waveform into equal bins as shown in Figure 6.3. Setup image frames falling in

a specific bin are averaged, and an ROI that contains the tumor is selected to be

the reference template for that phase bin. Each reference template corresponds to

107

Figure 6.3: Twelve motion-enhanced tumor templates built by averaging the imagesin ROI (as shown in figure 2) falling in the same time bin. Intensity waveform isdivided into twelve equal time bins, corresponding to which twelve templates werebuilt.

a known tumor position (from the patient setup). This set of templates gives us a

better representation of the tumor movement because they cover most of the motion

information. To get a set of smoother templates, we may use several breathing

cycles as a training period by dividing each cycle into the same number of bins,

then averaging the images in the same bin at the same phase.

6.1.2 Search Mechanism

Ideally, even taking into account the second tumor motion characteristic, if the

breathing is reasonably stable, at a particular breathing phase the tumor position

108

should be close to that represented by the reference template at the same phase.

Therefore, instead of an exhaustive search in the whole image, which can take a

long time, we only search in a small neighboring region around the reference lo-

cation of the template. That is to say, we measure the similarity between each

template, Ti(i = 1, 2, · · · , 12), and each incoming fluoroscopic image by allowing

the template to have a small shift (+x, +y) along the x! and y! axis. There are two

reasons for allowing the shift: (i) templates built during the training period may not

provide a complete description of tumor motion during the whole treatment fraction

and (ii) the tumor movement trajectory may slightly vary from period to period.

6.1.3 Template Matching

The x! and y! coordinates of the tumor centroid (xi, yi), for a template Ti , are

calculated by averaging the coordinates of the pixels in the tumor contour. The tu-

mor location at a given time point during treatment delivery, estimated using this

template, is then given as (xt, yt) = (xi + ,xi, yi + ,yi), where (,xi, ,yi) is the

shift required for template Ti to produce the best match to the image. This is done

for all the templates. The tumor location at this time point is then determined by

combining all estimated positions through a voting procedure described later. Two

methods for template representation and similarity calculation have been developed.

One method uses motion-enhanced images (MEIs) and calculates Pearsons correla-

tion scores as the scoring function. The other is eigenspace tracking that represents

the images in a reduced dimension eigenspace and applies a mean-squared error as

the scoring function. In the following sections, we discuss these two methods in

detail.

109

Figure 6.4: Tumor contour and region of interest (ROI). Left: Original fluoroscopicimage. Right: Motion-enhanced image.

Motion Enhancement and Pearsons Correlation Score.

Motion enhancement [101] isolates the moving tumor from the rest of the static

anatomy. Given a sequence of images I[t], where t = 1, · · · , N is the sequence

number, we compute the average"N

t=1 I[t]/N , and the motion-enhanced image

(MEI) is then the difference between the original image and the average image,

I[t] ! "Nt=1 I[t]/N . In a clinical scenario, the average image is determined using

the training images and can be updated using the images acquired during treatment

delivery. The intuition behind MEI is that the average captures the static struc-

tures and blurs the moving structures, thus the difference will amplify the moving

structures. Figure 6.4 shows the original fluoroscopic image of a tumor in an ROI

together with its motion-enhanced view.

We apply Pearsons correlation score as the similarity measure between an

incoming image and a reference template. Assuming the image R and template T

are of the same size m% n, Pearsons correlation score s is defined as

s =

"m

"n(Rmn ! R)(Tmn ! T )

#("

m"

n(Rmn ! R)2)("

m"

n(Tmn ! T )2)(6.1)

110

It is assumed that if the score is high, the incoming image has a high similarity with

the reference template, and the tumor location is close to that in the template.

Eigenspace Representation.

Motivation and overview. Eigenspace representation, based on the principal com-

ponent analysis (PCA), can reduce redundant information, keep the most dis-

tinct features and thus reduce noise in the image sequence. Here we use this

method to improve the matching accuracy and robustness [1]. The idea for ap-

plying PCA to images or eigen images was introduced by Turk and Pentland

for face recognition and has shown to be successful in other object recogni-

tion tasks [3]. Motion enhancement represents the image by amplifying the

moving structures and keeps the images original dimensionality. PCA, on the

other hand, is a feature extraction method that projects the original image to

a lower dimensional space and represents images by capturing the features

with high variability.

Eigenspace tracking can be described in two steps. First, find the lower di-

mensional space (eigenspace) by PCA and project the current frame to that

space, i.e., find the eigenspace representation of the current frame. Second,

match the current frame to the templates by minimizing the mean-squared

error between them. We will describe this method in detail below.

Template building in eigenspace. Given a sequence of images, eigenspace tracking

constructs a small set of basis images that characterize the majority of the

variation in the image set. These basis images are used to approximate the

original images. Assume that the tumor template size is m%n (in our study, a

typical tumor template is around 100 pixels % 100 pixels in dimensionality).

Suppose we have N images in our training period. For each of the N images,

we construct a 1D image vector, e, by scanning the image in the standard

111

raster scan order. This vector has dimension d = m % n and becomes a

column in a d%N matrix X . Using PCA, we can find a linear transformation,

Y = AT X , that projects the original higher dimensional matrix X with d%N

dimensions to a lower dimensional matrix Y with q % N dimensions where

q < d, such that the mean-squared error between X and Y is as small as

possible. Here, A is a d% q transformation matrix.

To find the transformation matrix A, we use singular value decomposition

(SVD) [1] to decompose the matrix X as X = U"V T . U is an orthogonal

matrix of d % d dimensions and V is another orthogonal matrix of N % N

dimensions. " is a diagonal matrix of d%N dimensions with singular values

sorted in decreasing order. Then the transformation matrix A is the first q

columns in U , i.e., the first q eigenvectors with the largest singular values.

Now we can project the original d!dimensional image vector e into q dimen-

sions as e%. Here, e% is a q!dimensional vector (a column of matrix Y ) and

can be represented as e% ="q

i=1 ciAi , where ci are the new coordinates of

the original image vector e in this subspace and ci = AiT e. In this study, we

kept the least number of eigenvectors that keeps 95% of the original variance

(i.e., the sum of the eigenvalues kept divided by the sum of all the eigenvalues

should be at least 95%). For our data, this corresponds to q = 50. Thus, we

reduce the dimension of an image from about 10, 000 to 50.

Tracking in eigenspace. Let eN be an m % n ROI in a new incoming image

measured during treatment delivery, represented as a d % 1 vector, where

d = m % n. We wish to compute the similarity between this ROI and tumor

templates in eigenspace. First, we project eN to the reduced eigenspace. We

compute ci as ci = AiT eN to get e%N =

"qi=1 ciAi. Then the mean-squared er-

ror E(c) is calculated as E(c) = ||eN%!eT

%||2 between eN% and the projection

of each template in eigenspace, eT%. For each reference template, the ROI is

112

allowed to move in the incoming image around the template position within

a pre-set small range, which is ±10 pixels in the y!direction and ±5 pixels

in the x!direction. Now, the tumor position is determined for this specific

template by minimizing the mean-squared error. This is done for all the tem-

plates. Therefore, every template will have a corresponding tumor position

and a minimized mean-squared error value. The smaller the mean-squared

error, the higher the similarity between the image and the template.

6.1.4 Voting Mechanism

For each patient, we have multiple, say 12, reference templates representing vari-

ous breathing phases and tumor positions. Using methods discussed in the previous

sections, for an incoming image, each template will be associated with an esti-

mated tumor position and a similarity measure (correlation score or mean-squared

error). The detected tumor position in the incoming image is a combination of

the estimated tumor positions using all templates, weighted with the corresponding

similarity measures.

The detected tumor position in each incoming image (xt, yt) could be esti-

mated by using the tumor position corresponding to the template of the maximum

score. A more robust method is to combine the positions from all the templates as

(xt, yt) =!

wi(xi + ,xi, yi + ,yi)/!

wi, (6.2)

where (xi, yi) is the reference tumor position corresponding to template Ti, de-

termined during patient setup, (,xi, ,yi) is the small shift of the template during

matching and wi is the weighting factor for template Ti . There are many ways to

determine the weighting factors for the combination process. The weighting scheme

used in this work is to set wi = 1 for the templates with scores above a threshold

and wi = 0 for the templates with scores below the threshold. The threshold is set

113

empirically to be within 85! 95% of the maximum similarity score. It varies from

patient to patient. We search from 85% to 95% and find the best threshold based on

the training data and use it on the incoming new image data during testing.

6.2 Experiments Setup for Direct Tumor Tracking

The performance of the proposed two multiple-template tracking methods was eval-

uated by doing a set of experiments. The first method (Method 1) computes the

correlation scores between motion-enhanced fluoro images and multiple templates.

The second method (Method 2) performs multiple-template matching in the eigenspace.

We have used six patients fluoroscopic image data to evaluate the perfor-

mance of these two methods. A region of interest of around 100 pixels %100 pixels

in size was selected to include the tumor positions at various breathing phases. The

training period we used to build our templates consisted of two cycles, which cor-

responds to about 60 image frames. A breathing cycle was equally divided into

10 ! 12 bins and training images were placed in the corresponding bins based on

the breathing phases. Then, we took the average of all frames in the same bin as

one template to generate 10! 12 templates for matching. Each reference template

corresponds to a tumor position as determined in the patient setup process, either

manually by a clinician or automatically by matching to the reference digitally re-

constructed radiographs (DRRs) from patients 4D CT data [100].

To validate the algorithms, we need to compare our tracking results with the

reference tumor positions. A radiation oncologist manually contoured the tumor

projection in the first frame of the images. Then in the following frames, the con-

tour was dragged to the correct places manually using a computer mouse by the ra-

diation oncologist. The tumor centroid position in each image frame was calculated

and used as the reference for comparison. A MatLab (The MathWorks, Inc.,Natick,

MA,USA) code was written to facilitate this procedure. The performance of the

114

algorithms is measured by calculating the absolute distance between the algorithm-

determined tumor position and the reference tumor position. Since the tumor mo-

tion in the superiorCinferior (S ! I) direction dominates, in this work we only

present and evaluate the tracking results in the S ! I direction (y!direction). The

metricswe use for evaluation include mean localization error (e) and maximum lo-

calization error at a 95% confidence level (e95). Here, (e) is the average of the

distances between tracked tumor centroids and reference centroids in all testing im-

age frames, and e95 means that, among all testing image frames, only 5% of the

frames have tracking errors larger than e95.

6.3 Results and Discussion on Tumor Tracking

Figure 6.5: The correlation score (in gray scale) as functions of template ID (y-axis)and the incoming image frame ID (x-axis).

Figure 6.5 is a plot showing the correlation score as a function of templates

and incoming fluoroscopic images for Patient 4. The y!axis is the template ID, the

115

x!axis is the incoming fluoro image ID (time) and the gray scale value signifies

the correlation score. The brighter the pixel, the higher the score. Note that the

correlation score in general cycles through the different templates as one would

expect. Each incoming image is highly correlated with several templates (high

intensity values along a vertical line in the figure). This is due to the fact that tumor

positions in the neighboring templates are close and we allow small shifts of the

templates during the matching procedure.

Figure 6.6: A comparison of the tracking results with and without voting for Method2 and patient 3. The tumor position (y-axis) as a function of time (x-axis). Blacksolid line: the reference tumor location. Blue dotted line: Method 2 without voting.Red dots: Method 2 with voting.

Figure 6.6 shows an example of the tracking results using Method 2 with

and without voting. The black solid line is the reference tumor location and the blue

dotted line is the original eigenspace tracking result without voting. By averaging

locations defined by those templates with correlation scores higher than 87% of the

maximum score, we get a smoother tumor motion trajectory which is marked by

116

the red dots. We can see it is less noisy and closer to the black line, especially at

the peaks and valleys of each breathing cycle. For all six patient data sets, we found

that voting can improve the tracking results. The computational cost for using the

voting scheme, which is within the order of 0.01s per frame, is negligible compared

to the searching mechanism.

Figures 6.7-6.9 show our experimental results for Method 2 (eigenspace

tracking with voting), in comparison with the reference results from an expert ob-

server, for six patients. For all six patients, the motion range of the tumor mass

varies from 17 ! 36 pixels, corresponding to 8.5 ! 18mm, as shown in Table 6.1.

The red dots are our tracking results for the tumor locations and the black solid

line shows the reference trajectory. We can see that our tracking results are in good

agreement with the reference results, as shown in Table 6.1 and Figure 6.7-6.9. For

Method 1, the mean localization error (e) and the maximum localization error at a

95% confidence level (e95), averaged over six patients, are 1.5 and 4.2 pixels, while

for Method 2, the corresponding numbers are 1.2 and 3.2. Note that the pixel size is

about 0.5mm. That means for both proposed tracking methods, the mean tracking

error is less than 1mm and the maximum tracking error at a 95% confidence level is

less than 3mm. Figure 6.10 shows the bar plot of the experimental results for both

Method 1 and Method 2.

As we can see, both methods are quite promising for lung tumor track-

ing without implanted fiducial markers. The voting scheme used in this work

makes the algorithms less sensitive to noise, thus the tracking results are smoother

and more robust. Method 2 (eigenspace tracking) performs slightly better than

Method 1 (motion-enhanced templates and correlation score). This is because, with

eigenspace tracking, we reduce the noise, remove the redundant information and

only keep the most informative dimensions from our image sequences, so that the

templatematching is improved. We found the improvement is significant at the end

of inhale or exhale. Figure 6.11 shows an example.

117

Figure 6.7: Experiment results for a) patient 1 and b) patient 2. Black solid line:Reference tumor motion trajectory. Red dots: tracking results using Method 2.

118

Figure 6.8: Experiment results for c) patient 3 and d) patient 4. Black solid line:Reference tumor motion trajectory. Red dots: tracking results using Method 2.

119

Figure 6.9: Experiment results for e) patient 5 and f) patient 6. Black solid line:Reference tumor motion trajectory. Red dots: tracking results using Method 2.

120

Table 6.1: Experimental results for the proposed multiple tracking methods. e isthe mean localization error and e95 is the maximum localization error at a 95%confidence level.

Patients Patient 1 Patient 2 Patient 3 Patient 4 Patient 5 Patient 6 AverageMoving range (pixel) 36 23 18 18 25 17 22.5

Method1 e(pixel) 2.2 1.3 1.4 1.1 1.7 1.2 1.5e95(pixel) 6 4 4 3 5 3 4.2

Method2 e(pixel) 1.6 0.9 1.1 0.9 1.6 1.1 1.2e95(pixel) 4 3 3 2 4 3 3.2

Figure 6.10: Top: the average localization error (blue bar) and max localizationerror at 95% confidence level (red bar) for Method 1. Bottom: same errors forMethod 2.

121

Figure 6.11: A comparison between Method 1 and Method 2, for patient 3. Thetumor position (y-axis) as a function of time (x-axis). Black solid line: the referencetumor location. Blue dotted line: Method 1. Red dots: Method 2.

Another interesting observation is that the tracking error increases roughly in

a linear way as tumor excursion increases. Using a simple linear extrapolation, we

estimated the tracking error to be about 1.5mm for 2cm tumor excursion, and about

1.7mm for 3cm tumor excursion. It might be a good idea to have the number of

reference templates proportional to the tumor motion range, which will be explored

in future work.

The remaining tracking errors for the proposed algorithms may come from

the following sources. As our reference results were manually produced, it would

be inevitable to introduce human errors. Ideally, we should have multiple expert ob-

servers to generate multiple reference data sets, and the inter-observer variation can

be estimated. However, manually identifying tumor position in each of 300 ! 400

image frames for every patient and for six patients is very labor-intensive work.

Therefore, instead of using multiple expert observers, the same expert observer has

marked the tumor position twice for most of the images. We found that the dif-

122

ference between two marked tumor positions is within 2 pixels on average, which

is comparable with our tracking results. This means that the proposed algorithms

seem to have the same order of accuracy as human experts in terms of tumor track-

ing in fluoroscopic images. This work is limited to a feasibility study. We will leave

more comprehensive evaluation and validation for future work.

Another error source is the instability of patients breathing cycles. Templates

developed during patient setup may not cover all tumor status and positions during

treatment delivery. When the tumor drifts out of the region of movement in the

training session, there will be no exact template that it can be matched to. Thus, the

error occurs at the end of inhale/exhale as shown in Figure 6.7-6.9. In this case, the

correlation score will be low and the sum-squared error will be high, which gives

us a clue that the tumor is drifting. This weakness of the proposed algorithms will

be addressed in our future work.

6.4 Summary for Multiple Template Tracking

We have demonstrated the feasibility of tracking a tumor mass or nearby anatomic

feature in fluoro images. Two multiple-template matching algorithms have been

proposed and evaluated, one based on the motion-enhanced templates and corre-

lation score and the other based on the eigenspace tracking. For both methods,

a voting scheme has been used to improve the smoothness and robustness of the

tracking results, resulting in accuracies within 3mm. These methods can be poten-

tially used for the precise treatment of lung cancer using either respiratory gating

or beam tracking.

123

Chapter 7

Concluding Remarks

Machine learning and data mining are two emerging fields whose applications will

ultimately touch every aspect of human life. In this thesis, we have successfully

applied the algorithms from these fields to an important domain of medical image-

guided radiotherapy for the treatment and control of lung cancer. In addition,

we advanced the field of machine learning and data mining by introducing a new

paradigm for exploratory data analysis with multi-view orthogonal clustering, and

developing a novel method for feature selection via transformation based methods.

The goal of data mining is to extract information and discover structures

from large data bases. A popular technique for mining patterns from data is clus-

tering. However, traditional clustering algorithms only find one clustering solution

even though many applications have data which are multi-faceted by nature. We

have introduced a new paradigm for exploratory data clustering that seeks to ex-

tract all non-redundant clustering views from a given set of data. We introduced a

non-redundant multi-view clustering framework that discovers different meaning-

ful partitions/clusterings by iterative orthogonalization. We have shown both the

orthogonal clustering algorithm and the clustering in orthogonal subspaces algo-

rithm instantiations of this framework worked successfully toward finding multiple

124

non-redundant clustering views. We also developed a fully automated version of

this framework by combining our algorithm with gap statistics to automatically find

the number of clusters in each iteration, and for determining when to stop looking

for alternative views. Our experiments confirmed that the proposed new clustering

paradigm is applicable to different type of data, such as text and image.

Clustering is closely related to dimensionality reduction, because different

clusterings often lie in different subspaces. If the right subspace can be discovered,

the clustering task will become much easier. Redundant features and irrelevant

features will significantly decrease the clustering performance. Feature subset se-

lection as a dimensionality reduction technique serves as a tool to remove unwanted

features, as well as to keep the original meaning of the features for explanatory pur-

poses. We have introduced principal feature selection through principal component

analysis (PFS-PCA) and shown that the method effectively removes redundancy

among the features while at the same time select the most informative features.

In the second half of this dissertation, we presented a successful application

of our learning techniques to the medical application of lung tumor image-guided

radiotherapy. There are two ways to perform image-guided radiotherapy: one is

through gating, and the other through tracking. We developed techniques to tackle

both. Current practice utilize external markers to locate the tumors during treat-

ment. However, external markers are not accurate. There are ongoing research in

the use of internal surrogates, but they have the associated risk of pneumothorax.

In this thesis, we avoid markers altogether and study the feasibility of radiotherapy

through fluoroscopy images.

To perform markerless gated radiotherapy, we investigated four different

methods: single template matching, multiple template matching, clustering tem-

plate method, and support vector machine (SVM) classifier. Our feasibility study

on five different patients have shown that the clustering template method and SVM

classifier generated robust, accurate and efficient gating signals using the fluoro-

125

scopic image sequences without fiducial markers for lung tumor treatment. At 35%

duty cycle, they achieve the average delivered target dose 94.9% and 95.8%, real

duty cycle 34.7% and 40.8% respectively. At 50% duty cycle, they obtain 97.6%

and 98.4% average delivered target dose, 49.5% and 53.1% real duty cycle respec-

tively. This study further shows that the gating problem can be re-cast as a classi-

fication problem. This opens up new directions for improving gating by exploring

other classification techniques, such as Bayesian classifiers [112], neural networks

[113], hidden Markov models [114, 115]. All these demonstrate that we have suc-

cessfully applied machine learning and data mining algorithms to the real world

problem of lung tumor treatment in IGRT.

In addition to research on gated radiotherapy, we also developed methods

for directly tracking the tumor mass without markers. We investigated using our

multiple-template matching method and eigenspace tracking. These two tracking

methods have been carefully evaluated against the physician marked tumor loca-

tions. The tracking error for multiple-template matching method is 1.5 pixel on

average, corresponding to around 0.75mm. For eigenspace tracking, the average

error is 1.2 pixel, corresponding to 0.6mm.

To turn this research into a real product, the next steps will be (1) testing the

algorithms using more and longer patient data, (2) finding a better way to get refer-

ence gating signals for validation, and (3) evaluating the dosimetric consequence of

the current error level to see if there is a need to further lower the error rates. Then,

clinical implementation of these methods will be considerd.

126

Bibliography

[1] I.T. Jolliffe. Principal Component Analysis. Springer, second edition edition,

2002.

[2] K. Fukunaga. Statistical Pattern Recognition (second edition). Academic

Press, San Diego, CA, 1990.

[3] M. Turk and A. Pentland. Face recognition using eigenfaces. In Proc. Com-

puter Vision and Pattern Recognition (Maui, HI), page 586C91, 1991.

[4] S. Deerwester, S. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman.

Indexing by latent semantic analysis. Journal of the American Society for

Information Science, 41(6):391–407, 1990.

[5] K. Y. Yeung and W. L. Ruzzo. Principal component analysis for clustering

gene expression data. Bioinformatics, 17(9):763–74, 2001.

[6] T. Hastie H. Zou and R. Tibshirani. Sparse principal component analy-

sis. Journal of Computational and Graphical Statistics, 15(2):265–286(22),

2006.

[7] S. B. Jiang. Radiotherapy of mobile tumors. Semin Radiat Oncol, 16(4):239–

48, 2006.

127

[8] Q.S. Chen, M.S. Weinhous, F.C. Deibel, J.P. Ciezki, and R.M. Macklis. Flu-

oroscopic study of tumor motion due to breathing: facilitating precise radia-

tion therapy for lung cancer patients. Med Phys, 28(9):1850–6, 2001. TY -

JOUR.

[9] H. Shirato, T. Harada, T. Harabayashi, K. Hida, H. Endo, K. Kita-

mura, R. Onimaru, K. Yamazaki, N. Kurauchi, T. Shimizu, N. Shinohara,

M. Matsushita, H. Dosaka-Akita, and K. Miyasaka. Feasibility of inser-

tion/implantation of 2.0-mm-diameter gold internal fiducial markers for pre-

cise setup and real-time tumor tracking in radiotherapy. Int J Radiat Oncol

Biol Phys, 56(1):240–7., 2003.

[10] T. Harada, H. Shirato, S. Ogura, S. Oizumi, K. Yamazaki, S. Shimizu, R. On-

imaru, K. Miyasaka, M. Nishimura, and H. Dosaka-Akita. Real-time tumor-

tracking radiation therapy for lung carcinoma by the aid of insertion of a gold

marker using bronchofiberscopy. Cancer, 95(8):1720–7., 2002.

[11] Y. Seppenwoolde, H. Shirato, K. Kitamura, S. Shimizu, M. van Herk, J. V.

Lebesque, and K. Miyasaka. Precise and real-time measurement of 3d tumor

motion in lung due to breathing and heartbeat, measured during radiotherapy.

Int J Radiat Oncol Biol Phys, 53(4):822–34, 2002. Journal Article.

[12] D. P. Gierga, J. Brewer, G. C. Sharp, M. Betke, C. G. Willett, and G. T.

Chen. The correlation between internal and external markers for abdominal

tumors: implications for respiratory gating. Int J Radiat Oncol Biol Phys,

61(5):1551–8, 2005.

[13] F. Laurent, V. Latrabe, B. Vergier, and P. Michel. Percutaneous ct-guided

biopsy of the lung: comparison between aspiration and automated cutting

needles using a coaxial technique. Cardiovasc Intervent Radiol, 23(4):266–

72, 2000.

128

[14] S. Arslan, A. Yilmaz, B. Bayramgurler, O. Uzman, E. Nver, and E. Akkaya.

Ct- guided transthoracic fine needle aspiration of pulmonary lesions: accu-

racy and complications in 294 patients. Med Sci Monit, 8(7):CR493–7, 2002.

[15] P. R. Geraghty, S. T. Kee, G. McFarlane, M. K. Razavi, D. Y. Sze, and M. D.

Dake. Ct-guided transthoracic needle aspiration biopsy of pulmonary nod-

ules: needle size and pneumothorax rate. Radiology, 229(2):475–81, 2003.

[16] R. I. Berbeco, H. Mostafavi, G. C. Sharp, and S. B. Jiang. Towards fluoro-

scopic respiratory gating for lung tumours without radiopaque markers. Phys

Med Biol, 50(19):4481–90, 2005.

[17] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM

Computing Surveys, 31(3):264–323, 1999.

[18] Alexander Strehl and Joydeep Ghosh. Cluster ensembles – a knowledge

reuse framework for combining multiple partitions. Journal on Machine

Learning Research (JMLR), pages 583–617, 2002.

[19] X. Z. Fern and C. E. Brodley. Random projection for high dimensional data

clustering: A cluster ensemble approach. In Proceedings of the International

Conference on Machine Learning, 2003.

[20] S. Bickel and T. Scheffer. Multi-view clustering. In Proceedings of the IEEE

International Conference on Data Mining, 2004.

[21] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-

training. In In Proceedings of the Conference on Computational Learning

Theory, page 92C100, 1998.

[22] K. Crammer and Y. Singer. A family of additive online algorithms for cate-

gory ranking. Journal of Machine Learning Research, 3:1025–1058, 2003.

129

[23] S. Gao, W. Wu, C. H. Lee, and T. S. Chua. A mfom learning approach to

robust multiclass multi-label text categorization. In In Proceedings of the

21st international conference on Machine learning, page 42, 2004.

[24] M.R. Boutell, J. Luo, X. Shen, and C.M. Brown. Learning multi-label scene

classication. Pattern Recognition, 37(9):1757–1771, 2004.

[25] D Gondek. Non-redundant clustering. PhD thesis, Brown University, 2005.

[26] G. Chechik and N. Tishby. Extracting relevant structures with side informa-

tion. In Advances in Neural Information Processing Systems 15 (NIPS-2002),

2003.

[27] D. Gondek and T. Hofmann. Conditional information bottleneck clustering.

In The 3rd IEEE Intl. Conf. on Data Mining, Workshop on Clustering Large

Data Sets, 2003.

[28] D. Gondek and T. Hofmann. Non-redundant data clustering. In Proc. of the

4th Intl. Conf. on Data Mining, 2004.

[29] David Gondek and Thomas Hofmann. Non-redundant clustering with condi-

tional ensembles. In Proc. of the 11th ACM SIGKDD Intl. Conf. on Knowl-

edge Discovery and Data Mining (KDD’05), pages 70–77, 2005.

[30] D. Gondek, S. Vaithyanathan, and A. Garg. Clustering with model-level

constraints. In Proc. of SIAM International Conference on Data Mining,

2005.

[31] E. Bae and J. Bailey. Coala: A novel approach for the extraction of an

alternate clustering of high quality and high dissimilarity. In Proceedings

of the Sixth International Conference on Data Mining, pages 53–62, Hong

Kong, December 2006.

130

[32] R. Caruana, M. Elhawary, N. Nguyen, and C. Smith. Meta clustering. In

Proceedings of the Sixth International Conference on Data Mining, pages

107–118, Hong Kong, December 2006.

[33] P. Jain, R. Meka, and Dhillon I. S. Simultaneous unsupervised learning of

disparate clusterings. In Proceedings of the Seventh SIAM International Con-

verence on Data Mining, 2008.

[34] Chris Ding, Xiaofeng He, Hongyuan Zha, and Horst Simon. Adaptive di-

mension reduction for clustering high dimensional data. In Proc. of the 2nd

IEEE Int’l Conf. on Data Mining, pages 147–154, 2002.

[35] Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar

Raghavan. Automatic subspace clustering of high dimensional data for data

mining applications. In Proceedings of the 1998 ACM SIGMOD Int’l Conf.

on Management of Data, pages 94–105, 1998.

[36] Lance Parsons, Ehtesham Haque, and Huan Liu. Subspace clustering for

high dimensional data: a review. SIGKDD Explor. Newsl., 6(1):90–105,

2004.

[37] R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial

Intelligence, 97(1-2):273–324, 1997.

[38] E. Amaldi and V. Kann. On the approximability of minimizing nonzero

variables and unsatisfied relations in linear systems. Theoret. Comput. Sci.,

209:237–260, 1998.

[39] A. W. Whitney. A direct method of nonparametric measurement selection.

IEEE Transactions Computers, 20:1100–1103, 1971.

[40] T. Marill and D. M. Green. On the effectiveness of receptors in recognition

systems. IEEE Transactions on Information Theory, 9:11–17, 1963.

131

[41] J. Kittler. Feature set search algorithms. In Pattern Recognition and Signal

Processing, pages 41–60, 1978.

[42] J. Doak. An evaluation of feature selection methods and their application to

computer security. Technical report, University of California at Davis, 1992.

[43] Isabelle Guyon and Andre; Elisseeff. An introduction to variable and feature

selection. Journal of Machine Learning Research, 3:1157–1182, 2003.

[44] L. Yu and H. Liu. Efficient feature selection via analysis of relevance and

redundancy. Journal of Machine Learning Research, 5:1205–1224, 2004.

[45] H. Stoppiglia, G. Dreyfus, R. Dubois, and Y. Oussar. Ranking a random

feature for variable and feature selection. Journal of Machine Learning Re-

search, 3:1399 – 1414, 2003.

[46] G.P. McCabe. Principal variables. Technometrics, 26:127–134, 1984.

[47] J. Cadima, O. Cerdeira, and M. Minhoto. Rotation of principal components:

choice of normalization constraints. Journal of Applied Statistics, 22:29–35,

1995.

[48] I. T. Jolliffe and M Uddin. A modified principal component technique based

on the lasso. Journal of Computational and Graphical Statistics, 12:531–

547, 2003.

[49] R Tibshirani. Regression shrinkage and selection via the lasso. Journal of

the Royal Statistical Society, Series B, 58:267–288, 1996.

[50] W. J. Krzanowski. Selection of variables to preserve multivariate data struc-

ture, using principal components. Applied Statistics, 36(1):22–33, 1987.

132

[51] K. Z. Mao. Identifying critical variables of principal components for

unsupervised feature selection. IEEE transactions on system, man, and

cyberbetics-part B: cyberbetics, 35(2):339–344, 2005.

[52] Y. Lu, I. Cohen, X. S. Zhou, and Q. Tian. Feature selection using princi-

pal feature analysis. In Proceedings of the 15th international conference on

Multimedia, pages 301 – 304, 2007.

[53] P. J. Keall, V. R. Kini, S. S. Vedam, and R. Mohan. Potential radiotherapy

improvements with respiratory gating. Australas Phys Eng Sci Med, 25(1):1–

6, 2002.

[54] I. Jacobs, J. Vanregemorter, and P. Scalliet. Influence of respiration on calcu-

lation and delivery of the prescribed dose in external radiotherapy. Radiother

Oncol, 39(2):123–8, 1996.

[55] SB Jiang, T Bortfeld, A TROFIMOV, E RIETZEL, GC SHARP, N CHOI,

and G T Y Chen. Synchronized moving aperture radiation therapy (smart):

Treatment planning using 4d ct data. In The 14th International Conference

on the Use of Computers in Radiation Therapy, Seoul, Korea, 2004.

[56] M. Engelsman, E. M. Damen, K. De Jaeger, K. M. van Ingen, and B. J.

Mijnheer. The effect of breathing and set-up errors on the cumulative dose

to a lung tumor. Radiother Oncol, 60(1):95–105., 2001.

[57] C.W. Stevens, R.F. Munden, K.M. Forster, J.F. Kelly, Z. Liao, G. Starkschall,

S. Tucker, and R. Komaki. Respiratory-driven lung tumor motion is inde-

pendent of tumor size, tumor location, and pulmonary function. Int J Radiat

Oncol Biol Phys, 51(1):62–8, 2001.

[58] S C Davies, A L Hill, R B Holmes, M Halliwell, and P C Jacson. Ultrasound

133

quantitation of respiratory organ motion in the upper abdomen. British Jour-

nal of Radiology, 67:1096–1102, 1994.

[59] P.J. Bryan, S. Custar, J.R. Haaga, and V. Balsara. Respiratory movement of

the pancreas: an ultrasonic study. J Ultrasound Med, 3(7):317–20, 1984.

[60] C. Ozhasoglu and M. J. Murphy. Issues in respiratory motion compen-

sation during external-beam radiotherapy. Int J Radiat Oncol Biol Phys,

52(5):1389–99, 2002.

[61] P.H. Weiss, J.M. Baker, and E.J. Potchen. Assessment of hepatic respiratory

excursion. J Nucl Med, 13(10):758–9, 1972.

[62] G. Harauz and M.J. Bronskill. Comparison of the liver’s respiratory motion

in the supine and upright positions: concise communication. J Nucl Med,

20(7):733–5, 1979.

[63] P. Giraud, Y. De Rycke, B. Dubray, S. Helfre, D. Voican, L. Guo, J. C. Rosen-

wald, K. Keraudy, M. Housset, E. Touboul, and J. M. Cosset. Conformal

radiotherapy (crt) planning for lung cancer: analysis of intrathoracic organ

motion during extreme phases of breathing. Int J Radiat Oncol Biol Phys,

51(4):1081–92., 2001.

[64] E. C. Ford, G. S. Mageras, E. Yorke, K. E. Rosenzweig, R. Wagman, and

C. C. Ling. Evaluation of respiratory movement during gated radiother-

apy using film and electronic portal imaging. Int J Radiat Oncol Biol Phys,

52(2):522–31., 2002.

[65] J. Hanley, M. M. Debois, D. Mah, G. S. Mageras, A. Raben, K. Rosenzweig,

B. Mychalczak, L. H. Schwartz, P. J. Gloeggler, W. Lutz, C. C. Ling, S. A.

Leibel, Z. Fuks, and G. J. Kutcher. Deep inspiration breath-hold technique

for lung tumors: the potential value of target immobilization and reduced

134

lung density in dose escalation. Int J Radiat Oncol Biol Phys, 45(3):603–11,

1999.

[66] V.M. Remouchamps, N. Letts, D. Yan, F.A. Vicini, M. Moreau, J.A. Zielin-

ski, J. Liang, L.L. Kestin, A.A. Martinez, and J.W. Wong. Three-dimensional

evaluation of intra- and interfraction immobilization of lung and chest wall

using active breathing control: A reproducibility study with breast cancer

patients. Int J Radiat Oncol Biol Phys, 57(4):968–78, 2003.

[67] H. D. Kubo, P. M. Len, S. Minohara, and H. Mostafavi. Breathing-

synchronized radiotherapy program at the university of california davis can-

cer center. Med Phys, 27(2):346–53., 2000.

[68] H. D. Kubo and B. C. Hill. Respiration gated radiotherapy treatment: a

technical study. Phys Med Biol, 41(1):83–91., 1996.

[69] M. J. Murphy. Tracking moving organs in real time. Semin Radiat Oncol,

14(1):91–100, 2004.

[70] M. J. Murphy, S. D. Chang, I. C. Gibbs, Q. T. Le, J. Hai, D. Kim, D. P.

Martin, and Jr. Adler, J. R. Patterns of patient movement during frameless

image-guided radiosurgery. Int J Radiat Oncol Biol Phys, 55(5):1400–8,

2003.

[71] S Webb. Limitations of a simple technique for movement compensation via

movement-modified fluence profiles. Phys. Med. Biol., 50(14):N155–N161,

2005.

[72] S. Minohara, T. Kanai, M. Endo, K. Noda, and M. Kanazawa. Respiratory

gated irradiation system for heavy-ion radiotherapy. Int J Radiat Oncol Biol

Phys, 47(4):1097–103, 2000.

135

[73] T. Tada, K. Minakuchi, T. Fujioka, M. Sakurai, M. Koda, I. Kawase, T. Naka-

jima, M. Nishioka, T. Tonai, and T. Kozuka. Lung cancer: intermittent irra-

diation synchronized with respiratory motion–results of a pilot study. Radi-

ology, 207(3):779–83.

[74] T Okumara, H Tsuji, and Y Hayakawa. Respiration-gated irradiation system

for proton radiotherapy. In Proceedings of the 11th international conference

on the use of computers in radiation therapy, pages 358–359, Manchester,

1994. North Western Medical Physics Dept. Christie Hospital.

[75] S. Shimizu, H. Shirato, S. Ogura, H. Akita-Dosaka, K. Kitamura, T. Nish-

ioka, K. Kagei, M. Nishimura, and K. Miyasaka. Detection of lung tumor

movement in real-time tumor-tracking radiotherapy. Int J Radiat Oncol Biol

Phys, 51(2):304–10.

[76] H. Shirato, S. Shimizu, T. Kunieda, K. Kitamura, M. van Herk, K. Kagei,

T. Nishioka, S. Hashimoto, K. Fujita, H. Aoyama, K. Tsuchiya, K. Kudo,

and K. Miyasaka. Physical aspects of a real-time tumor-tracking system for

gated radiotherapy. Int J Radiat Oncol Biol Phys, 48(4):1187–95, 2000.

[77] M. Imura, K. Yamazaki, H. Shirato, R. Onimaru, M. Fujino, S. Shimizu,

T. Harada, S. Ogura, H. Dosaka-Akita, K. Miyasaka, and M. Nishimura. In-

sertion and fixation of fiducial markers for setup and tracking of lung tumors

in radiotherapy. Int J Radiat Oncol Biol Phys, 63(5):1442–7, 2005.

[78] E. Forgy. Cluster analysis of multivariate data: Efficiency vs. interpretability

of classifications. Biometrics, 21:768, 1965.

[79] J.B. Macqueen. Some methods for classifications and analysis of multivari-

ate observations. Proc. Symp. Mathematical Statistics and Probability, 5th,

Berkeley, 1:281–297, 1967.

136

[80] I. T. Jolliffe. Principal Component Analysis. Springer-Verlag, New-York,

1986.

[81] R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley

& Sons, NY, 1973.

[82] C.L. Blake and C.J. Merz. UCI repository of machine learning databases. In

http://www.ics.uci.edu/ 0mlearn/MLRepository.html, 1998.

[83] S. D. Bay. The UCI KDD archive, 1999.

[84] CMU. CMU 4 universities WebKB data, 1997.

[85] I. S. Dhillon and D. M. Modha. Concept decompositions for large sparse text

data using clustering. Machine Learning, 42(1):143–175, 2001.

[86] G. Schwarz. Estimating the dimension of a model. The Annals of Statistics,

6(2):461–464, 1978.

[87] H. Akaike. A new look at the statistical model identification. IEEE Transac-

tions on Automatic Control, AC-19(6):716–723, December 1974.

[88] Dan Pelleg. X-means: Extending k-means with efficient estimation of the

number of clusters. In In Proceedings of the 17th International Conf. on

Machine Learning, pages 727–734. Morgan Kaufmann, 2000.

[89] S. Monti, T. Pablo, J. Mesirov, and T Golub. Consensus clustering: A

resampling-based method for class discovery and visualization of gene ex-

pression microarray data. Machine Learning, 52:91–118, 2003.

[90] V. Roth, T. Lange, M. Braun, and J. Buhmann. A resampling approach to

cluster validation. In Intl. Conf. on Computational Statistics, pages 123–129,

2002.

137

[91] R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters

in a dataset via the gap statistic. J. R. Statist. Soc., 63(2):411–423, 2001.

[92] G. Golub and C. Van Loan. Matrix computations,3rd edition. Johns Hopkins,

Baltimore, 1996.

[93] H. Zou and T. Hastie. Regression shrinkage and selection via the elastic net,

with applications to microarrays. Technical report, 2003.

[94] P. Murphy and D. Aha. Uci repository of machine learning databases. Tech-

nical Report Technical Report, University of California, Irvine, 1994.

[95] D. Aha, P. Murphy, and C. Merz. Uci repository of machine learning

databases. Technical Report Technical Report, University of California,

Irvine, 1997.

[96] J. G. Dy, C. E. Brodley, A. Kak, L. S. Broderick, and A. M. Aisen. Un-

supervised feature selection applied to content-based retrieval of lung im-

ages. IEEE Transactions on Pattern Analysis and Machine Intelligence,

25(3):373–378, March 2003.

[97] J. Ye and T. Xiong. Computational and theoretical analysis of null space

and orthogonal linear discriminant analysis. Journal of Machine Learning

Research, 7:1183–1204, 2006.

[98] Chris Ding and Tao Li. Adaptive dimension reduction using discriminant

analysis and k-means clustering. In Proceedings of the 24th international

conference on Machine learning, volume 227, pages 521–528, 2007.

[99] R. I. Berbeco, S. B. Jiang, G. C. Sharp, G. T. Chen, H. Mostafavi, and H. Shi-

rato. Integrated radiotherapy imaging system (iris): design considerations of

tumour tracking with linac gantry-mounted diagnostic x-ray systems with

flat-panel detectors. Phys Med Biol, 49(2):243–55, 2004.

138

[100] X Tang, G S Sharp, and S B Jiang. Patient setup based on lung tumor mass

for gated radiotherapy. Med. Phys. (abstract), 33:2244, 2006.

[101] Kasturi R Jain, R and B G Schunck. Machine Vision. New York: McGraw-

Hill, 1995.

[102] I T Jolliffe. Principal Component Analysis. Berlin: Springer, 2002.

[103] L. Breiman. Bagging predictors. Machine Learning, 26(2):123–140, 1996.

[104] G. J. McLachlan and K. E. Basford. Mixture Models, Inference and Applica-

tions to Clustering. Marcel Dekker, New York, 1988.

[105] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from in-

complete data via the em algorithm. Journal Royal Statistical Society, Series

B, 39(1):1–38, 1977.

[106] V. Vapnik. The nature of statistical learning theory. Springer, New York,

1995.

[107] C M Bishop. Neural Networks for Pattern Recognition. Oxford University

Press, 1995.

[108] Smola A J Burges, C J C and B Scholkopf. Advances in Kernel Methods -

Support Vector Learning. MIT Press, Cambridge, USA, 1999.

[109] C C Chang and C J Lin. Libsvm: a library for support vector machines.

2003. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.

[110] T. Neicu, R. Berbeco, J. Wolfgang, and S. B. Jiang. Synchronized moving

aperture radiation therapy (smart): improvement of breathing pattern repro-

ducibility using respiratory coaching. Phys Med Biol, 51(3):617–36, 2006.

139

[111] H. Murase and S. Nayar. Visual learning and recognition of 3-d objects from

appearance. Int. J. Comput. Vis, 14:5–24, 1995.

[112] P. Cheeseman and J. Stutz. Bayesian classification (autoclass): theory and

results. In Advances in Knowledge Discovery and Data Mining, pages 153–

180, Cambridge, MA, 1996. AAAI/MIT Press.

[113] J. A. Freeman and D. M. Skapura. Neural networks: Algorithms, Applica-

tions, and Programming Techniques. Addison-Wesley, 1991.

[114] C. D. Mitchell. Improving Hidden Markov Models for Speech Recognition.

PhD thesis, Purdue University, W. Lafayette, Indiana, May 1995.

[115] P. Smyth. Clustering sequences with hidden Markov models. In M. C. Mozer,

M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Pro-

cessing 9. MIT Press, 1997.

140

Non-redundant Clustering, Principal Feature Selection and …... · 2019-02-13 · clustering and...

Documents

Transcript of Non-redundant Clustering, Principal Feature Selection and …... · 2019-02-13 · clustering and...