Mining Exhaled Volatile Organic Compounds for Breast...

14
2 Mining Exhaled Volatile Organic Compounds for Breast Cancer Detection Kichun Sky Lee 1 , M. Forrest Abouelnasr 1 , Charlene Bayer 1 , Sheryl G.A. Gabram 2 , Boris Mizaikoff 3 , Andr´ e Rogatko 2 , and Brani Vidakovic 1,2 Georgia Institute of Technology, Atlanta, USA 1 ; Emory University, Atlanta, USA 2 ; University of Ulm, Germany 3 CONTENTS 2.1 Introduction ............................................................. 15 2.2 What are Volatile Organic Compounds? .................................. 16 2.3 Dimension reduction .................................................... 17 2.4 Conclusion and Discussion .............................................. 24 References .............................................................. 26 This paper provides a new application of a nonlinear dimension reduction technique to analyze Volatile Organic Compounds (VOCs) in human breath that are measured as highly dimensional and sparse data. In breast cancer diagnostics context the re- searchers are interested in classifying subjects to cases and controls based on the exhaled VOC content. It is demonstrated that principal component analysis is in- ferior in capturing the relations between VOCs and incidence of breast cancer to a nonlinear dimension reduction methodology that considers of geometric manifold of observed VOCs. To take the chemical properties into account our dimension re- duction approach introduces measures of chemical closeness between VOCs. We demonstrate that the nonlinear dimension reduction technique in our context results in a weak classifier that in conjunction with other diagnostic modalities can improve the accuracy of diagnosis of breast cancer. 2.1 Introduction Women in the U.S. have a one-in-eight lifetime chance of developing invasive breast cancer and a 1 in 33 chance of dying from breast cancer (ACS, 2006). However, if the disease is detected early, it is treatable. Mortality can be significantly reduced with the currently used modalities (Henderson, 1991). A major problem with early 1-58488-344-8/04/$0.00+$1.50 c 2004 by CRC Press LLC 15

Transcript of Mining Exhaled Volatile Organic Compounds for Breast...

Page 1: Mining Exhaled Volatile Organic Compounds for Breast ...zoe.bme.gatech.edu/~klee7/docs/cha/chemcanc_HDM2008_v2.pdf · 2 Mining Exhaled Volatile Organic Compounds for Breast Cancer

2

Mining Exhaled Volatile Organic Compoundsfor Breast Cancer Detection

Kichun Sky Lee1, M. Forrest Abouelnasr1, Charlene Bayer1, Sheryl G.A.Gabram2, Boris Mizaikoff 3, Andr e Rogatko2, and Brani Vidakovic1,2

Georgia Institute of Technology, Atlanta, USA1; Emory University, Atlanta, USA2;University of Ulm, Germany3

CONTENTS2.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2 What are Volatile Organic Compounds?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3 Dimension reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4 Conclusion and Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

This paper provides a new application of a nonlinear dimension reduction techniqueto analyze Volatile Organic Compounds (VOCs) in human breath that are measuredas highly dimensional and sparse data. In breast cancer diagnostics context the re-searchers are interested in classifying subjects tocases andcontrols based on theexhaled VOC content. It is demonstrated that principal component analysis is in-ferior in capturing the relations between VOCs and incidence of breast cancer to anonlinear dimension reduction methodology that considersof geometric manifoldof observed VOCs. To take the chemical properties into account our dimension re-duction approach introduces measures of chemical closeness between VOCs. Wedemonstrate that the nonlinear dimension reduction technique in our context resultsin a weak classifier that in conjunction with other diagnostic modalities can improvethe accuracy of diagnosis of breast cancer.

2.1 Introduction

Women in the U.S. have a one-in-eight lifetime chance of developing invasive breastcancer and a 1 in 33 chance of dying from breast cancer (ACS, 2006). However, ifthe disease is detected early, it is treatable. Mortality can be significantly reducedwith the currently used modalities (Henderson, 1991). A major problem with early

1-58488-344-8/04/$0.00+$1.50c© 2004 by CRC Press LLC 15

Page 2: Mining Exhaled Volatile Organic Compounds for Breast ...zoe.bme.gatech.edu/~klee7/docs/cha/chemcanc_HDM2008_v2.pdf · 2 Mining Exhaled Volatile Organic Compounds for Breast Cancer

16 Statistical Data Mining and Knowledge Discovery

detection of breast cancer is that mammography techniques can be uncomfortable ornot available due to the lack of financial and technical resources. A new diagnosismethod which is promising for diagnostics is analysis of exhaled Volatile OrganicCompounds (Cao and Duan, 2006, Poliet al., 2005). While the exact reactionsand chemical effects which accompany breast cancer are not accurately known, alinks may still be found between the incidence of the diseaseand the occurrencesof certain exhaled VOCs. Resent research involving canine scent detection withvarious types of cancer demonstrated high sensitivity and specificity of these nature-made classifiers (Horvathet al., 2008, Pickelet al., 2004, Dobson, 2003, Willisetal., 2004, McCullochet al., 2006, Gordonet al., 2008).

This study was performed to investigate this link, but with summaries of breathmass spectrometric analysis as informative descriptors. Two groups were examined:a group of women who recently had been diagnosed with breast cancer (but who hadnot yet started any treatment) and a control group of 24 women. Using the mass-spectrometric summaries in the context of Laplacian eigenmap dimension reduction,we were able to predictively distinguish between the two groups with correct classifi-cation rate of 0.7580, on average. One of the novelties of ourresearch is the guidanceof this classification scheme by chemical closeness betweenthe VOCs expressed bythe so called closeness matrix, critical in tuning the Laplacian eigenmaps.

2.2 What are Volatile Organic Compounds?

Volatile organic compounds are organic chemical compoundsthat have a high va-por pressure under normal conditions to significantly vaporize and enter the atmo-sphere (EPA, 2007). Common artificial VOCs include paint thinners, pharmaceu-ticals, refrigerants, dry cleaning solvents, and some constituents of petroleum fuels(e.g., gasoline and natural gas). Flora and fauna are also animportant biologicalsources of VOCs; for example, it is known that trees emit large amounts of VOCs,especially isoprene and terpenes. During this research, human-exhaled VOCs arecollected for diagnostics of breast cancer in the followingway.Human subjects: Two groups of subjects have been examined. Thecase group con-sisted of women who had recently been diagnosed with breast cancer at Stages II, III,or IV, and prior to receiving any treatment. Thecontrol group consisted of healthywomen in a similar age range confirmed to be cancer-free by a mammogram takenless than six months prior to the sample collection. Subjects were not allowed to eator drink for at least two hours prior to breath sample collection.Breath Collection and Assay: Human alveolar breath and background air weresampled separately. The alveolar breath of each subject wascollected by the sub-ject breathing five times at five-minute intervals into a valved-TeflonR© samplingbulb (Markes Bio-VOC Sampler) containing a RadielloR© passive sampler. Analy-sis was via thermal desorption/gas chromatography/mass spectrometry (Markes In-

Page 3: Mining Exhaled Volatile Organic Compounds for Breast ...zoe.bme.gatech.edu/~klee7/docs/cha/chemcanc_HDM2008_v2.pdf · 2 Mining Exhaled Volatile Organic Compounds for Breast Cancer

Mining Exhaled VOCs for Breast Cancer Detection 17

ternational Ltd. ULTRA thermal desorber/ Finnigan Trace GCULTRA/FinniganTrace DSQ mass spectrometer) (Lechner and Rieder, 2007, Manini et al., 2006).Background air was collected passively with Radiello samplers placed in the roomsduring the breath sample collection time period. The breathdata was corrected forthe background concentration data.

2.2.1 Description of Data

Our observations consist of expressions of 378 VOCs for eachof total 41 subjects,(xi ∈ R378, i = 1, . . . ,41). Out of 41 subjects 17 subjects came from the case group(yi = +1) and 24 from the control group (yi = −1), that is,

(xi,yi) ∈ Rm ×{−1,+1}, i = 1, . . . ,n, (2.1)

with m = 378 andn = 41. Our goal is to construct a classifying functionC,

C : x ∈ Rm 7→ {−1,+1}, (2.2)

where a label for a new observationxnew is predicted asC(xnew). Since the dimensionof each observationxi exceeds the number of observations, for inference purposeswehave so-called “smalln-largep” type of problem. In addition, data are sparse, thatis, many VOCs for a particular subject are not observed (Figure 2.1(a-d)).

2.3 Dimension reduction

Two approaches of dimension reduction (linear and non-linear) are discussed andcompared. The construction of classifying functionC is discussed in the subsequentsections.

2.3.1 Linear Dimension Reduction

Principal component analysis, also called Karhunen-Loevedecomposition in engi-neering fields, is a widely used technique in multivariate data analysis as a lineardimension reduction method (Jolliffe, 2002). In PCA, linear combinations ofxi,ui, are iteratively chosen so that they are mutually orthogonaland “compress” thevariance. Formally, the principal componentsui are defined as follows.

ui = arg max||ui||=1

ui⊥u j , j=1,...,i−1

var(uTi x), (2.3)

where foru1 the orthogonality constraint is removed by definition. These ui are theoptimal linear representation, optimized by minimizing the square of the errors. BydefiningX = [x1 x2 · · · xn ], an empirical version of equation 2.3 can be written as

ui = arg max||ui||=1

ui⊥u j , j=1,...,i−1

uTi XXT ui (2.4)

Page 4: Mining Exhaled Volatile Organic Compounds for Breast ...zoe.bme.gatech.edu/~klee7/docs/cha/chemcanc_HDM2008_v2.pdf · 2 Mining Exhaled Volatile Organic Compounds for Breast Cancer

18 Statistical Data Mining and Knowledge Discovery

subject index

VO

Cs

5 10 15

50

100

150

200

250

300

350

(a) Scaled image of VOC data for the case group.

subject index

VO

Cs

5 10 15 20

50

100

150

200

250

300

350

(b) Scaled image for the control group.

0 50 100 150 200 250 300 350 4000

200

400

600

800

1000

1200

index of VOCs

mag

nitu

de o

f VO

C

(c) 2nd subject in the case group;

2nd column of the matrix in (a).

0 50 100 150 200 250 300 350 4000

200

400

600

800

1000

1200

1400

index of VOCs

mag

nitu

de o

f VO

C

(d) 10th subject in the control group;

10th column of the matrix in (b).

FIGURE 2.1Plots of VOC data for Case and Control groups.

which is solved by the eigenvector associated withith largest eigenvalue ofXXT . Thedimension reduction is achieved by taking a subset{u1,u2, . . . ,uk}, k < min{n, p} asa basis. An illustration forx ∈ R2 is shown in Figure 2.2. The first principal compo-nent corresponds to the direction of largest variability inx, the second is orthogonalto the first. In this case, dimension reduction can be achieved of replacing the twodimensional data with the scores along the first principal component.

The scores of first principal component and second principalcomponent for VOCdata are shown in Figure 2.3. As evident, the discriminativepower for the first com-ponent is highly influenced by outliers.

2.3.2 Nonlinear Dimension Reduction

Many nonlinear dimension reduction techniques guided by a non-linear manifold inwhich “data live” have been proposed. There are Local LinearEmbedding (LLE)(Roweis and Saul, 2000), ISOMAP (Tenenbaumet al., 2000), Hessian Eigenmaps(Donohoet al., 2003), Laplacian Eigenmaps (Belkinet al., 2003), and Diffusion

Page 5: Mining Exhaled Volatile Organic Compounds for Breast ...zoe.bme.gatech.edu/~klee7/docs/cha/chemcanc_HDM2008_v2.pdf · 2 Mining Exhaled Volatile Organic Compounds for Breast Cancer

Mining Exhaled VOCs for Breast Cancer Detection 19

−5 0 5−5

−4

−3

−2

−1

0

1

2

3

4

5

1st component2nd component

FIGURE 2.2Illustration of PCA with 2 dimensional data.

−2000 0 2000 4000 6000 8000 10000 12000−4000

−3500

−3000

−2500

−2000

−1500

−1000

−500

0

500

1st Principal Coordinate

2nd

Prin

cipa

l Coo

rdin

ate

ControlCase

(a) 1st vs. 2nd.

−2000 0 2000 4000 6000 8000 10000 12000−1200

−1000

−800

−600

−400

−200

0

200

400

600

800

1st Principal Coordinate

3rd

Prin

cipa

l Coo

rdin

ate

ControlCase

(b) 1st vs. 3rd.

FIGURE 2.3Scores of the principal components for VOC data;a few outliers dominate thegeometry.

Maps (Lafon, 2004), to name a few. Unlike the PCA, in the listed non-linear clas-sification methods the transformations are made in such a wayas to preserve thedistance structure of data, thus capturing its intrinsic dimensionality. These learningalgorithms can be unified by the following framework, describing the computationof an embedding for the given data set.

Algorithm

1. Start from the data set{x1, . . . ,xn} of sizen where each observation belongsto Rm. Construct an×n “neighborhood” or similarity matrixW .

2. Optionally normalizeW asW .

3. Compute thep largest nontrivial eigenvalues and eigenvectorsv j of W .

4. The embedding of each observationxi is the vectoryi with elementsyi j repre-sentingi-th element of thej-th eigenvectorv j of W .

Page 6: Mining Exhaled Volatile Organic Compounds for Breast ...zoe.bme.gatech.edu/~klee7/docs/cha/chemcanc_HDM2008_v2.pdf · 2 Mining Exhaled Volatile Organic Compounds for Breast Cancer

20 Statistical Data Mining and Knowledge Discovery

Nitrogen-containing compoundsAlcoholsAlkanesAldehydesKetonesCyclo-ethers or furansAlkenesHalocarbonsEstersOrganosulfur compoundsOrganosilicon compoundsCarboxylic AcidsAromatic RingsMiscellaneous

TABLE 2.1Chemical grouping of 378 VOCs.

In what follows we focus on Laplacian Eigenmaps. They can be described underthe above framework with appropriate definition of neighborhood matrixW . We willdiscuss its property, and application to our data set in detail.

2.3.2.1 Distance between VOCs

Defining a neighborhood for each observation requires the notion of distance be-tween two VOCs. Each VOC is presumed to be chemically relatedor connected tosome other VOC from the set of 378 VOCs. This consideration led us to have thefollowing distance metric between two observations,

||xi − x j||2 de f

=378

∑k=1

378

∑l=1

akl(xi,k − x j,l)2, (2.5)

whereakl represents connection strength betweenk-th and thel-th VOC. All ai j formthe VOCcloseness matrix A that is derived by the chemical properties of VOCs.

The closeness matrixA was found by grouping all compounds into fourteen groups.Each compound’s closeness factor with itself is one, and each compound’s closenessto others in the same group is 0< λ < 1. A compound’s closeness to other com-pounds outside of its group is zero. The compounds are sortedto groups based upontheir chemical properties. The proposed groups are shown inTable 2.1.

In each case where a compound may belong to more than one group, the groupwith the most active or significant component was chosen. Then, we define

akl =

1 if k = l,

λ if k-th andl-th VOCs are in the same group,

0 otherwise,

Page 7: Mining Exhaled Volatile Organic Compounds for Breast ...zoe.bme.gatech.edu/~klee7/docs/cha/chemcanc_HDM2008_v2.pdf · 2 Mining Exhaled Volatile Organic Compounds for Breast Cancer

Mining Exhaled VOCs for Breast Cancer Detection 21

with λ between 0 and 1. Parameterλ can be interpreted as the effect of group incalculating the distance. The closeness matrix for 378 VOCsis shown in the Figure2.4. Notice VOCs are ordered by the retention time from chromatographic analysisof the breath collection tubes.

50 100 150 200 250 300 350

50

100

150

200

250

300

350

FIGURE 2.4Closeness matrix withλ = 0.5; Black represents strong affinity between twoVOCs.

This grouping is justifiable from a chemical standpoint for several reasons. Mostcompounds in a group (e.g. alkanes, alcohols) behave chemically similarly. And forthose which do not behave similarly, like the nitrogen-containing compounds, theirpresence may indicate the utilization of similar biochemical substrates or pathways,which may themselves be symptoms of the disease. Finally, several members of eachgroup may be in equilibrium with each other.

2.3.2.2 Laplacian Eigenmaps

Embedding provided by the Laplacian Eigenmaps preserves local geometric infor-mation optimally by making neighboring points mapped as closely as possible. Build-ing neighboring points amounts to introducing ‘adjacency’metric between the points.Adjacency matrix consists of entriesWi j representing the closeness betweenith andjth observations as

Wi j = e−||xi−x j ||

2

t , (2.6)

wheret is a “diffusion” parameter. One can simply assumeWi j = 1 if two nodes areconnected or withink-nearest neighborhood range.

A mapping fromx1, . . . ,xn to y1, . . . ,yn is obtained by solving the following opti-mization problem,

minyT Dy=1yT D1=0

∑i, j

(yi − y j)2Wi j, (2.7)

Page 8: Mining Exhaled Volatile Organic Compounds for Breast ...zoe.bme.gatech.edu/~klee7/docs/cha/chemcanc_HDM2008_v2.pdf · 2 Mining Exhaled Volatile Organic Compounds for Breast Cancer

22 Statistical Data Mining and Knowledge Discovery

where diagonal matrixD is defined asDii = ∑ j Wi j. We put a penalty proportionalto the adjacencyWi j if xi andx j are mapped apart. By spectral decomposition theequation 2.7 is simplified to the following generalized eigenvalue problem:

(D−W)f = µDf. (2.8)

The f can be viewed as optimal “cuts” in the geometric structure ofxi. The matrixD−W is called the graph Laplacian ofx1, . . . ,xn. By taking eigenvectors associatedwith the smallest nonzerol eigenvalues, one gets

xi → ( f1(i), . . . , fl(i)). (2.9)

The Laplacian Eigenmaps for VOCs are depicted in Figure 2.5.It is clear that thefirst 3 eigenvectors are more discriminative than those in PCA. We employ severalclassification techniques on the summaries in (2.9). The results are provided in thefollowing section.

−0.3 −0.2 −0.1 0 0.1 0.2

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

1st component

2nd

com

pone

nt

ControlCase

(a) 1st vs. 2nd withλ = 0, 5 neighbors.

−0.3 −0.2 −0.1 0 0.1 0.2

−0.2

−0.1

0

0.1

0.2

0.3

0.4

1st component

3rd

com

pone

nt

ControlCase

(b) 1st vs. 3rd withλ = 0, 5 neighbors.

−0.3 −0.2 −0.1 0 0.1

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

1st component

2nd

com

pone

nt

ControlCase

(c) 1st vs. 2nd withλ = 0.005, 4 neighbors.

−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

2nd component

3rd

com

pone

nt

ControlCase

(d) 2nd vs. 3rd withλ = 0.005, 4 neighbors.

FIGURE 2.5New coordinate from the Laplacian Eigenmaps for VOC data;the intrinsic ge-ometry of the data can be detected.

2.3.3 Classification, Comparison

The classification functionC can be approximated by a linear combination of thenew transformed coordinates, since eigenfunctions of the Laplacian constitute an or-

Page 9: Mining Exhaled Volatile Organic Compounds for Breast ...zoe.bme.gatech.edu/~klee7/docs/cha/chemcanc_HDM2008_v2.pdf · 2 Mining Exhaled Volatile Organic Compounds for Breast Cancer

Mining Exhaled VOCs for Breast Cancer Detection 23

thogonal basis for the Hilbert spaceL2 defined on the manifold of data (Belkinet al.,2004). Four classification methods: Linear Discriminant Analysis (LDA), QuadraticDiscriminant Analysis (QDA), Support Vector Machines (SVM), and Least SquareEstimate (LSE) have been applied to the VOC data in the transformed domain.

For each classification method, stratified random choices for training and testingwere made 1000 times. At each testing for both transformations, eigenvectors fortraining set and one test observation were computed. In particular, the one test ob-servation was treated as unknown label for the Laplacian Eigenmaps. Out-of-sampleextension without recomputing eigenvectors is found in Bengio et al. (2004). Eachstratified training random sample consists of 60% of case observations and 60% ofcontrol observations, thus leaving 40% of cases and 40% of controls for testing. Pre-diction for testing samples was done in a semi-supervised manner, treating trainingsamples as those with known labels and testing samples as those with unknown la-bels.For LSE, classification function can be put in a closed form. Suppose a training sethasL pairs,(x1,y1), . . . ,(xL,yL), andU unknown observationsxL+1, . . . ,xL+U , wherexi ∈ R378, yi ∈ {−1,+1}. After the transformation we obtainf (i) ∈ Rl correspondingto xi:

(xi,yi) 7→ ( f (i),yi), i = 1, . . . ,L (2.10)

xi 7→ f (i), i = L+1, . . . ,L+U. (2.11)

Classification functionC(x) is approximated by a linear combination ofl compo-nents as the new coordinates:

C(x) =l

∑k=1

ck f (x)k ,

where f (x) is the transformation ofx. Let f = [ f (1) f (2) . . . f (L)] andy = [y1 . . .yL]corresponds only only to training samples. Then,ck are estimated by

c = (f fT )−1fy

from the least square criterion. Sinceyis are either−1 or +1, ∑lk=1 ck f (x)

k can beused as a predictor in the classification ofx:

y =

{

1 ∑lk=1 ck f (x)

k ≥ 0,

−1 ∑lk=1 ck f (x)

k < 0.

Figure 2.6 illustrates decision boundaries for PCA mappingand Laplacian Eigen-maps with LDA, QDA, and SVM. Notice that LDA produces a linearseparationwhereas QDA has a quadratic separation boundary based on theassumption that thevariances in the two groups are different from each other. The SVM varies accordingto its kernel property. In Figure 2.6-(e), the radial basis kernel with width parameter105, penalty parameter 100, and 5 neighbors was used while in Figure 2.6-(f) the

Page 10: Mining Exhaled Volatile Organic Compounds for Breast ...zoe.bme.gatech.edu/~klee7/docs/cha/chemcanc_HDM2008_v2.pdf · 2 Mining Exhaled Volatile Organic Compounds for Breast Cancer

24 Statistical Data Mining and Knowledge Discovery

0 2000 4000 6000 8000 10000

−300

−200

−100

0

100

200

300

400

500

600

1st component

2nd

com

pone

nt

casecontrol

(a) lda with pca.

−0.2 −0.1 0 0.1 0.2−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

1st component

2nd

com

pone

nt

casecontrol

(b) lda with laplacian.

0 2000 4000 6000 8000 10000

−300

−200

−100

0

100

200

300

400

500

600

1st component

2nd

com

pone

nt

casecontrol

(c) qda with pca.

−0.2 −0.1 0 0.1 0.2−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

1st component

2nd

com

pone

nt

casecontrol

(d) qda with laplacian.

0 2000 4000 6000 8000 10000

−300

−200

−100

0

100

200

300

400

500

600

1st component

2nd

com

pone

nt

casecontrol

(e) svm with pca

−0.2 −0.1 0 0.1 0.2−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

1st component

2nd

com

pone

nt

casecontrol

(f) svm with laplacian

FIGURE 2.6Illustration of classifications with PCA and Laplacian Eigenmaps with λ = 0.1;Points represent a random training set. Responses for a testset are estimatedby the decision boundary inferred from the training set.

radial basis kernel with width 0.4, penalty 10, and 5 neighbors is applied.Table 2.2 shows an average of the ratio of correct classifications for each method.As evident, the average ratios of the Laplacian Eigenmaps are exceeding those ofthe PCA. The classification rate depending on parameterλ in the closeness matrix isgiven in Figure 2.7, which justifies the use of closeness matrix in the classification.

Page 11: Mining Exhaled Volatile Organic Compounds for Breast ...zoe.bme.gatech.edu/~klee7/docs/cha/chemcanc_HDM2008_v2.pdf · 2 Mining Exhaled Volatile Organic Compounds for Breast Cancer

Mining Exhaled VOCs for Breast Cancer Detection 25

PCA Laplacian EigenmapsLDA QDA SVM LSE LDA QDA SVM LSE

0.6126 0.6030 0.7000 0.6224 0.7668 0.7648 0.7346 0.7658

TABLE 2.2Average correct classification ratio for several classification methods based on 1000random partitions of data to training and validation sets.

0 0.05 0.1 0.15 0.20.5

0.55

0.6

0.65

0.7

0.75

0.8

λ in closeness matrix

clas

sific

atio

n ra

te

ldaqdasvmlse

FIGURE 2.7Classification ratio with 2 Laplacian components as a function of to λ . The ratiois maximized atλ = 0.085.

2.4 Conclusion and Discussion

This study demonstrated a significant link between the incidence of breast cancerand exhaled VOCs. As a consequence, the predictive screening is possible with thismethod. While the sensitivity and specificity of this breathtest may be low, it is in-expensive, noninvasive, and independent of other modalities such as mammogramy,breast MRA. As a “weak classifier” the VOCs analysis can be combined with otherindependent classifiers, to increase confidence level of thediagnostic decision. Theacceptance of breath volatiles as a viable clinical screening technique has been slowdue to a number of issues:

1. largely unsuccessful attempts to find a single or small number of low levelVOCs as biomarker compounds for individual diseases;

2. lack of understanding of the physiological meaning of thedetected volatiles;and

3. results and methods cannot be easily converted into a clinically useful form.

With a larger sample size, we plan to focus attention on thosecompounds which arefound to be the strongest indicators of breast cancer, or on the relationships found

Page 12: Mining Exhaled Volatile Organic Compounds for Breast ...zoe.bme.gatech.edu/~klee7/docs/cha/chemcanc_HDM2008_v2.pdf · 2 Mining Exhaled Volatile Organic Compounds for Breast Cancer

26 Statistical Data Mining and Knowledge Discovery

between compounds. This will help to better understand the specific chemical andmetabolic changes involved with the disease. While this method might be made moreaccurate using the laboratory-based equipment and techniques such as employed byPhillips et al. (1997), the instruments used in this study are much more common atmedical facilities, and so the method described here is morehas potential to be morepracticable for the diagnosis of the breast cancer.

2.4.1 Chemistry considerations

Several assumptions and steps which were taken in the analysis of the data must bejustified from a chemical perspective.

The proposed methodology operates under the assumption that a few linear ornonlinear combinations of the compounds dominate the predictive power of data todiagnose cancer. This is inherent, for example, in PCA, where the dimension of thedata is reduced from 378 to two. While the inherent mathematical/chemical spaceof disease related VOCs may have many dimensions, two optimized dimensions canwell model this link. Chemically, this may be because only a few reactions dominatethe influence that the disease has on the presence of each compound. This suggeststhat a similar number of nonlinear combinations can effectively model the entiresystem.

Chemical “closeness” was defined in a rather simplistic way,with no full usagebiochemical knowledge, and used in calculations. With multiple reactions and re-actants, many co-products and co-reactants would exist, and the dominant reactionswould most closely control these relationships. Chemical closeness is an effectiveway to model this behavior, and can be a useful tool to augmentthe computationsteps. However, if any two compounds are found to have littlecorrelation, then the“closeness” can be set to zero, and that relationship will not contribute to the calcu-lation step. In this way, the inclusion of the closeness stepin calculation does notpreclude “non-closeness” interactions.Acknowledgment. The research of Lee, Abouelnasr and Vidakovic was supportedby NSF Grants XXXX and XXXX at Georgia Institute of Technology. [ADD YOURGRANTS]

References

ACS (2006). What Are the Key Statistics for Breast Cancer?American CancerSociety, Detailed Guide: Cancer, Retrieved on 2007-04-26.

Hendersonk, C. (1991).Breast Cancer, New York: McGraw-Hill, 12 edition.

EPA (2007). Indoor Air Quality: Organic Gases (Volatile Organic Compounds -

Page 13: Mining Exhaled Volatile Organic Compounds for Breast ...zoe.bme.gatech.edu/~klee7/docs/cha/chemcanc_HDM2008_v2.pdf · 2 Mining Exhaled Volatile Organic Compounds for Breast Cancer

Mining Exhaled VOCs for Breast Cancer Detection 27

VOCs), http://www.epa.gov/iaq/voc.html, Retrieved on 2008-05-01.

Lechner, M. and Rieder, J. (2007). Mass spectrometric profiling of low-molecular-weight volatile compounds–diagnostic potential and latest applications, CurrentMedical Chemistry, 14:987-995.

Manini, P., De Palma, G., Andreoli, R., Poli, D., Mozzoni, P., Folesani, G., Mutti,A., and Apostoli, P. (2006). Environmental and biological monitoring of benzeneexposure in a cohort of Italian taxi drivers,Toxicology Letters, 167:142-151.

Cao, W. and Duan, Y. (2006). Breath Analysis: Potential for Clinical Diagnosis andExposure Assessment,Clinical Chemistry, 52:800-811.

Poli, D., Carbognani, P., Corradi, M., Goldoni, M., Acampa,O., Balbi, B., Bianchi,L., Rusca, M., and Mutti, A. (2005). Exhaled volatile organic compounds in patientswith non-small cell lung cancer: cross sectional and nestedshort-term follow-upstudy,Respiratory Research, 6:71.

Pickel, D., Manucy, G., Walker, D., Hall, S., and Walker, J. (2004). Evidence forcanine olfactory detection of melanoma,Applied Animal Behaviour Science, 89-107.

Dobson, R. (2003). Dogs can sniff out first signs of men’s cancer, Sunday Times, 27,5.

Willis, C., Church, S., Guest, C.,et al. (2004). Olfactory detection of human bladdercancer by dogs: proof of principle study,BMJ, 329, 712.

McCulloch, M., Jezierski, T., Broffman, M., Hubbard, A., Turner, K., and Janecki,T. (2006). Diagnostic accuracy of canine scent detection inearly- and late-stage lungand breast cancers,Integrative Cancer Therapies, 5, 30-39.

Gordon, R., Shatz, C., Myers, L.,et al. (2008). The use of Canines in the Detectionof Human Cancers,The Journal of Alternative and Complementary Medicine, 14,61-67.

Jolliffe, I.T. (2002).Principal Component Analysis, Series: Springer Series in Statis-tics, Springer, NY, 2nd ed.

Roweis, S. and Saul, L. (2000). Nonlinear dimensionality reduction by locally linearembedding,Science, Vol. 290, No. 5500, 2323-2326, December 2000.

Tenenbaum, J., Silva, V., and Langford, J. (2000). A global geometric framework fornonlinear dimensionality reduction,Science, Vol. 290, No. 5500, December 2000.

Belkin, M. and Niyogi P. (2003). Laplacian Eigenmaps for Dimensionality Reduc-tion and Data Representation,Neural Computation, Vol. 15, No. 6, 1373-1396, June2003.

Belkin, M. and Niyogi P. (2004). Semi-supervised Learning on Riemannian Mani-folds, Machine Learning, Vol. 56, Invited: Special Issue on Clustering, 209-239.

Donoho, D. and Grimes, C. (2003). Hessian Eigenmaps: new tools for nonlinear

Page 14: Mining Exhaled Volatile Organic Compounds for Breast ...zoe.bme.gatech.edu/~klee7/docs/cha/chemcanc_HDM2008_v2.pdf · 2 Mining Exhaled Volatile Organic Compounds for Breast Cancer

28 Statistical Data Mining and Knowledge Discovery

dimensionality reduction,Proceedings of National Academy of Science, 5591-5596.

Lafon, S. (2004).Diffusion maps and geometric harmonics, Yale University, Ph.D.dissertation, May 2004.

Bengio, Y., Delalleau O., Roux, N. L., Paiement, J., Vincent, P., and Ouimet. M.(2004). Learning eigenfunctions links spectral embeddingand kernel PCA,NeuralComputation, Vol. 16, No. 10, 2197-2219.

Phillips, M (1997). Method for the collection and assay of volatile organic com-pounds in breath,Analytical Biochemistry, 247:272-278.

Sorensen, M., Skov, H., Autrup, H., Hertel, O., and Loft, S. (2003). Urban ben-zene exposure and oxidative DNA damage: influence of geneticpolymorphisms inmetabolism genes,The Science of the Total Environment, 309:69-80.