2009 spie hmm

Unsupervised learning in Hyperspectral Classifiers usingHidden Markov Models

Vikram Jayaram and Bryan Usevitch

Dept. of Electrical and Computer EngineeringThe University of Texas at El Paso

500 W. University Ave., El Paso, TX 79968-0523

ABSTRACT

Hyperspectral data represents a mixture of several component spectra from many classifiable sources. Theknowledge of the contributions of the underlying sources to the recorded spectra is valuable in many remotesensing applications. Traditional Hyperspectral classification and segmentation algorithms have used Markovrandom field (MRF) based estimation in recent investigations. Although, this method reflects plausible local,spatial correlation in the true scene, it is limited to using supervised learning schemes for parameter estimation.Besides, the expectation-maximization (EM) for the hidden MRF is considerably more difficult to realize dueto the absence of a closed form formulation. In this paper, we propose a hidden Markov model (HMM) basedprobability density function (PDF) classifier for reduced dimensional feature space. Our approach uses anunsupervised learning scheme for maximum-likelihood (ML) parameter estimation that combines both modelselection and estimation in a single algorithm. The proposed method accurately models and synthesizes theapproximate observations of the true data in a reduced dimensional feature space.

Keywords: Unsupervised Learning, hidden Markov model, expectation-maximization, maximum-likelihood.

1. INTRODUCTION

Hyperspectral images exploit the fact that each material radiates a different amount of electromagnetic energythroughout the spectra. This unique characteristic of the material is commonly known as the spectral signatureand we can read this signature from the images obtained by airborne or spaceborne-based detectors. Thebandwidth of these sensors ranges from the visible region (0.4-0.7 μm) through the near infrared (about 2.4 μm)in hundreds of narrow contiguous bands about 10 nm wide.1

Classification of Hyperspectral imagery (HSI) data is a challenging problem for two main reasons. First,limited spatial resolution of HSI sensors and/or the distance of the observed scene, the images invariably containpixels composed of several materials. It is desirable to resolve the contributions of the constituents from theobserved image without relying on high spatial resolution images. Remote sensing cameras have been designed tocapture a wide spectral range motivating the use of post-processing techniques to distinguish materials via theirspectral signatures. Secondly, available training data for most pattern recognition problems in HSI processing isseverely inadequate. Under the framework of statistical classifiers, Hughes2 was able to demonstrate the impactof this problem on a theoretical basis. Concerning the second problem, feature extraction and optimal bandselection are the methods most commonly used for finding useful features in high-dimensional data.3, 4 On theother hand reduced dimensionality algorithms suffer from theoretical loss of performance. This performance lossoccurs due to reduction of data to features, and further approximating the theoretical features to PDFs. Figure1 hypothetically illustrates the different types of feature representation. However, it is beneficial to understandthe trade-off between need to retain as much information (increased feature space), and the need to obtain betterPDF estimates (reduced feature dimensionality) for a wide array of HSI applications.

In statistical pattern recognition, finite mixtures allow a probabilistic approach to unsupervised learning(clustering).5, 6 Our problem of interest is to introduce a new unsupervised algorithm for learning a finite

Further author information: (Send correspondence to Vikram Jayaram)V. Jayaram: E-mail: [email protected], Telephone: 1 915 747 5869

Algorithms and Technologies for Multispectral, Hyperspectral, and Ultraspectral Imagery XV,edited by Sylvia S. Shen, Paul E. Lewis, Proc. of SPIE Vol. 7334, 73340F · © 2009 SPIE

CCC code: 0277-786X/09/$18 · doi: 10.1117/12.820325

Proc. of SPIE Vol. 7334 73340F-1

Downloaded From: http://spiedigitallibrary.org/ on 09/04/2013 Terms of Use: http://spiedl.org/terms

mixture model from the multivariate HSI data using an HMM. This HMM approach estimates the proportion ofeach HSI class present in a finite mixture model by incorporating both the estimation step and model selectionin a single algorithm. The model selection step that was previously introduced in7 automatically assigns mixturecomponents for a GM. Our technique utilizes a reduced dimensional feature space to model and synthesize theapproximate observations of the true HSI data. In order to define the relevance of using finite mixture model forHSI, let us consider a random variable X , the finite mixture models decompose a PDF f(x) into sum of K classPDFs. A general density function f(x) is considered K class probability density functions. A general densityfunction f(x) is considered semiparametric, since it may be decomposed into K components. Let fk(x) denotethe kth class PDF. The finite mixture model with K components expands as

f(x) =K∑

k=1

akfk(x), (1)

where ak denotes the proportion of the kth class. The proportion ak may be interpreted as the prior probability ofobserving a sample from class k. Furthermore, the prior probabilities ak for each distribution must be nonnegativeand sum-to-one, or

ak ≥ 0 for k = 1, · · ·, K, (2)

whereK∑

k=1

ak = 1. (3)

Since, the underlying probability densities of the mixture are initially unknown, one must estimate thedensities from samples of each class iteratively. Thus, we formally extend the PDF based classification approachto the analysis of HSI data (dependent data). In our approach we adapt a stationary Markovian model whichis a powerful stochastic model that can closely approximate many naturally occurring phenomena. One suchfamous phenomena is approximation of human speech.8 While a very powerful stochastic model, a single HMMcannot easily act as a good classifier between a wide variety of signal classes. Instead, it is best to design themspecifically for each signal type and feature type.

The rest of the paper is organized as follows. In Section II, we mention some examples of previous work inliterature that is related to our experiments. In Section III, we review some of the basics of HMM formulation,problem of mixture learning and density estimation. Section IV describes briefly about minimum noise fraction(MNF) transform. Section V reports experimental results, and Section VI ends the paper by presenting someconcluding remarks.

2. EARLIER WORK

Few investigations have introduced HMM in HSI processing in recent times. In Du et. al.,9 hidden Markovmodel information divergence (HMMID) was introduced as a discriminatory measure among target spectra.Comparison were made to deterministic distance metrics such as the spectral angle mapper (SAM) and minimumEuclidean distance (MED). More recently, Bali et. al.10 shows the problem of joint segmentation of hyperspectralimages in the Bayesian framework. This approach based on a HMM of the images with common segmentation, orequivalently with common hidden classification label variables which is modeled by a Potts Markov Random Field.In a related work, Li et. al.11 proposed a two dimensional HMM for image classification. This method provided astructured way to incorporate context information into classification. All the above mentioned approaches comeunder the domain of supervised techniques. The modest development of unsupervised classification techniquesin the HSI regime has been the primary source of motivation for the proposed work.

Multidimensional data such as the HSI can be modeled by a multidimensional Gaussian mixture (GM).12

Normally, a GM in the form of the PDF for z ∈ RP is given by

p(z) =L∑

i=1

αiN (z, μi, Σi)



Feature Vector

z

Feature space (3D) Scatter plot (20)

255

Figure 1. An illustration of Image and Feature space representation.

whereN (z, μi, Σi) =

1(2π)P/2|Σi|1/2

e{−12 (z−µi)

′Σ−1i (z−µi)}.

Here L is the number of mixture components and P the number of spectral channels (bands). The GM parametersare denoted by λ = {αi, μi, Σi}. The parameters of the GM are estimated using maximum likelihood by meansof the EM algorithm. In7 we show the structural learning of a GM that is employed to model and classify HSIdata. This methodology utilizes a fast and automatic assignment of mixture components to model PDFs. Lateron, we employ the same mechanism to estimate parameters and further model state PDFs of an HMM.

Consider a data that consists of K samples of dimension P , it is not necessary or even desirable to group allthe data together in to a single KXP -dimensional sample. In the simplest case, all K samples are independentand we may regard them as samples of the same RV. For most practical cases, they are not independent. TheMarkovian principle assumes consecutive samples are statistically independent when conditioned on knowing thesamples that preceded it. This leads to an elegant solution of HMM which employs a set of M PDFs of dimensionP . The HMM regards each of the K samples as having originated from one of the M possible states and thereis a distinct probability that the underlying model “jumps” from one state to another. In our approach, theHMMs uses GM to model each state PDFs.5 We have focused on an unsupervised learning algorithm for MLparameter estimation which in turn is used as a reduced dimensional PDF based classifier.

3. UNSUPERVISED LEARNING OF MARKOV SOURCES

In this section, we mention general formulation of HMM, re-estimation of HMM parameters, observed PDFs andGM parameters. Let us begin following the notational approach of Rabiner,8 consider there are T observationtimes. At each time 1 ≤ t ≤ T , there is a discrete state variable qt which takes one of N values qt ∈ {S1, S2, · ··, SN}. According to the Markovian assumption, the probability distribution of qt+1 depends only on the valueof qt. This is described compactly as a state transition probability matrix A whose elements aij represents theprobability that qt+1 equals j given that qt equals i. The initial state probabilities are denoted πi, the probabilitythat q1 equals Si. It is hidden Markov model because the states qt are hidden from view; that is we cannotobserve them. But, we can observe the random data Ot which is generated according to a PDF dependent onthe state at time t as illustrated in Figure 2. We denote the PDF of Ot under state j as bj(Ot). The completeset of model parameters that define the HMM are ∧ = {πj , aij , bj}.The EM also known as the Baum-Welch algorithm calculates new estimates ∧ given an observation sequence O =O1, O2, O3, · · ·OT and a previous estimate of ∧. The algorithm is composed of two parts: the forward/backwardprocedure and the re-estimation of parameters.

Using Gaussian Mixtures for bj(Ot)

We model the PDFs bj(Ot) as GM,



i. 0 0 0 0p p2 0 \ 14 0

30 0 43 a4 0 0 0

VI I I

p (z) p (z) I p (z) p (z) p (z)1 2: 4'

1 2 3 4 5

Observer ® ®

S

IA

T

S

Hidden

SthS

Figure 2. A hidden Markov model. The observer makes his observations whose PDF depends on the state.

bj(O) =M∑

m=1

cjmN (O, μjm,Ujm), 1 ≤ j ≤ N

where

N (O, μjm,Ujm) =1

2πP2 |√Ujm|e

{− 1

2 (O−µjm)′U−1jm(O−µjm)

}

,

and P is the dimension of O. We will refer to these GM parameters collectively as bj � {cjm, μjm,Ujm}.Forward/Backward Procedure

We wish to compute the probability of observation sequence O = O1, O2, ···, OT given the model ∧ = {πj , aij , bj}.The forward procedure for p(O|∧) is

• Initialization:αi = πibi(O1), 1 ≤ i ≤ N

• Induction:

αt+1(j) =[ N∑

i=1

αt(i)aij

]bj(Ot+1), 1 ≤ t ≤ T − 1 1 ≤ j ≤ N

• Termination:

p(O|∧) =N∑

i=1

αT (i)

The backward procedure is

• Initialization:βt(i) = 1



• Induction:

βt(i) =N∑

j=1

aijbj(Ot+1)βt+1(j), t = T − 1, T − 2, · · ··, 1 1 ≤ i ≤ N

Re-estimation of HMM parameters

The re-estimation procedure calculates new estimates of ∧ given the observation sequence O = O1, O2, O3, · · ·OT .We first define

ξt(i, j) =αt(i)aijbj(Ot+1)βt+1(j)∑N

i=1

∑Nj=1 αt(i)aijbj(Ot+1)βt+1(j)

and

γt(i) =N∑

j=1

ξt(i, j).

The updated state priors are

πi = γ1(i).

The updated state transition matrix is

aij =∑T−1

t=1 ξt(i, j)∑T−1t=1 γt(i)

.

Re-estimation of Observation PDF’s13

In order to update the observation PDF’s, it is necessary to maximize

Qj =T∑

t=1

wtj log bj(Ot)

over the PDF bj, where

wt,j =αt(j)βt(j)∑Ni=1 αt(i)βt(i)

.

This is a “weighted” likelihood (ML) procedure since if wtj = cj , the results are strict ML estimates. The weightswtj are interpreted as the probability that the Markov chain is in state j at time t.

Re-estimation of Gaussian Mixture Parameters

In our experiments, bj(O) are modeled as GM by simply determining the weighted ML estimates of the GMparameters. This would require iterating to convergence at each step. A more global approach is possible if themixture components assignments are regarded as “missing data”.13 The result is that the quantity

Qj =T∑

t=1

M∑

m=1

γt(j, m) log bj(Ot)

is maximized, where



γt(j, m) = wt,j

[cjmN (Ot, μjm, Ujm)

∑Mk=1 cjkN (Ot, μjm, Ujm)

].

Here, the weights γt(j, m) are interpreted as the probability that the Markov chain is in state j and theobservation is from mixture component m at time t. The resulting update equations for cjm, μjm, and Ujm arecomputed as follows:

cjm =∑T

t=1 γt(j, m)∑T

t=1

∑Ml=1 γt(j, l)

.

The above expression is similar to re-estimation of GM.5 This means that the algorithms designed for GMare applicable for updating the state PDFs of the HMM. Therefore,

μjm =∑T

t=1 γt(j, m)Ot∑Tt=1 γt(j, m)

Ujm =∑T

t=1 γt(j, m)(Ot − μjm)(Ot − μjm)′∑T

t=1 γt(j, m).

4. MINIMUM NOISE FRACTION TRANSFORM

Before we begin the our section on experiments, we shall define minimum noise fraction (MNF) transformsince we use them to obtain a 2D feature plot of the true data as shown in Figure 3 (right). The MNFtransformation is a highly useful spectral processing tool in HSI analysis.14 It is used to determine the inherentdimensionality of image data, to segregate noise in the data, and to reduce the computational requirementsfor subsequent processing. This transform is essentially two cascaded principal components transformations.The first transformation, based on an estimated noise covariance matrix, decorrelates and rescales the noisein the data. This first step results in transformed data in which the noise has unit variance and no band-to-band correlations. The second step is a standard principal components transformation of the noise-whiteneddata. For the purposes of further spectral processing, the inherent dimensionality of the data is determined byexamination of the final eigenvalues and the associated images. The data space can be divided into two parts:one part associated with large eigenvalues and coherent eigenimages, and a complementary part with near-unityeigenvalues and noise-dominated images. By using only the coherent portions, the noise is separated from thedata, thus the image bands get ranked based on signal to noise ratios (SNR).

5. EXPERIMENTS

The remote sensing data sets that we have used in our experiments comes from an Airborne Visible/InfraredImaging Spectrometer (AVIRIS) sensor image. AVIRIS is a unique optical sensor that delivers calibrated imagesof the upwelling spectral radiance in 224 contiguous spectral bands with wavelengths corresponding to 0.4-2.5μm. AVIRIS is flown all across the US, Canada and Europe. Figure 3 shows data sets used in our experimentsthat belong to a Indian Pine scene in northwest Indiana. The spatial bands of this scene are of size 169 X 169pixels. Since, HSI imagery is highly correlated in the spectral direction using MNF rotation is an obvious choicefor decorrelation among the bands. This also results in a 2D “scatter” plot of the first two MNF componentsof the data as shown in Figure 3. The scatter plots used in the paper are similar to marginalized PDF onany 2D plane. Marginalization could be easily depicted for visualizing state PDFs of HMM. To illustrate thisvisualization scheme, let z = [z1, z2, z3, z4]. For example, to visualize on the (z2, z4) plane, we would need tocompute

p(z2, z4) =∫

z1

∫

z3

p(z1, z2, z3, z4)dz1dz3.



I

Figure 3. (Left) Composite image of Indian Pine scene. (Right) Composite image of MNF transformed scene.

This utility is very useful when visualizing high-dimensional PDF. On basis of an initial analysis based onIso-data and K-means unsupervised classifiers, it was found that the scene consisted of 3 prominent mixtureclasses. Therefore, we begin the training by considering a tri-state (corresponding to the 3 mixture classesidentified)uniform state transition matrix A and prior probability π to initialize the HMM parameters. ThePDF of the feature vector in each state is approximated by Gaussian mixtures. The automatic learning andinitialization of the Gaussian mixtures are explicitly dealt in our earlier work.7 The algorithm outputs the totallog likelihood at each iteration.

Training an HMM is an iterative process that seeks to maximize the probability that the HMM accounts forthe example sequences. However, there are chances of running into a “local maximum” problem; the model,though converged to some locally optimal choice of parameters, is not guaranteed to be the best possible model.In an attempt to avoid this pitfall, we use a simulated annealing procedure along side of training. This step isperformed by expanding the covariance matrices of the PDF estimates and by pushing the state transition matrixand prior state probabilities closer to “uniform”. We attempt to find a “bad” stationary point by re-runningthe above sequence until one is found. The PDF plots of the three state PDF’s after convergence are shown inFigures 5, 6 and 7. In our experiments, (both model and synthesis stage) we use Viterbi algorithm8 to estimatethe most likely state sequence. A few outliers are also observed in one or more state PDFs. Now that we havemixtures modeled by their corresponding state PDFs, we would like to test the model by generating syntheticobservations. In Figure 8 we were able to synthesize 100 observation. We clearly notice that the syntheticobservations closely approximate the true data observations. This result is also exemplified in Figure 9 when wecompare true states of the data with the estimated states of the synthetic observations. Similarly, in Figures 10and 11 we show instances that compare 300 and 600 synthetic observations to the true data. These comparisonsshow that the underlying mixture density were adequately modeled using HMM.

6. CONCLUSIONS

In this paper, we proposed the use of hidden Markov model that uses structural learning for approximatingunderlying mixture densities. Algorithm test were carried out using real Hyperspectral data consisting of ascene from Indian pines of NW Indiana. In our experiments, we utilized only the first two components of MNFtransformed bands to ensure feature learning in reduced representation of the data. We show that mixturelearning for multivariate Gaussians is very similar to learning HMM parameters. In fact, unsupervised learningof GM parameters for each class are seamlessly integrated to model the state PDFs of a HMM in a singlealgorithm. This technique could be applied to any type of parameter mixture model that utilizes EM algorithm.Our experiments exemplifies that the proposed method models and well synthesizes the observations of theHSI data in a reduced dimensional feature space. This technique can considered a new paradigm of reduceddimensional classifier in processing HSI data.



−20 −10 0 10 20 30 40−20

−10

0

10

20

30

40

MNF Band 1

MNF Band 2

−10

0

10

20

30

−100

1020

30

0

1000

2000

3000

4000

5000

6000

7000

MNF Band 1

MNF Band 2

0

1000

2000

3000

4000

5000

6000

7000

Figure 4. (Left) 2D scatter plot of MNF transformed Band 1 Vs. Band 2. (Right) 2D Histogram of MNF bands 1 and 2.

−30 −20 −10 0 10 20 30 40 50

−20

0

20

40

MNF1

MNF2

State 1

MNF1

MNF2

−30 −20 −10 0 10 20 30 40 50

−20

0

20

40

Figure 5. (Top) 2D scatter plot of true data. (Bottom) PDF of State 1 after convergence.



−30 −20 −10 0 10 20 30 40 50

−20

0

20

40

MNF1

MNF2

State 2

MNF1

MNF2

−30 −20 −10 0 10 20 30 40 50

−20

0

20

40


−30 −20 −10 0 10 20 30 40 50

−20

0

20

40

MNF1

MNF2

State 3

MNF1

MNF2

−30 −20 −10 0 10 20 30 40 50

−20

0

20

40




−20 −10 0 10 20 30 40−20

−10

0

10

20

30

40

MNF Band 1

MNF Band 2

Original SamplesSynthetic Samples

Figure 8. Comparison of true data Vs. 100 synthetic observations.

0 10 20 30 40 50 60 70 80 90 1001

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

3

No.of Samples

Sta

tes

True StatesEstimates States

Figure 9. Comparison of true states vs. estimated states from synthetic observation.



−20 −10 0 10 20 30 40−20

−10

0

10

20

30

40

MNF Band 1

MNF Band 2



−20 −10 0 10 20 30 40−20

−10

0

10

20

30

40

MNF Band 1

MNF Band 2





ACKNOWLEDGMENTS

We would like to thank department of Geological Sciences at UTEP for providing access to the ENVI softwareand LARS, Purdue University for making the HSI data15 available. This work was supported by NASA EarthSystem Science (ESS) doctoral fellowship at the University of Texas at El Paso.

REFERENCES

[1] Schott, J. R., [Remote Sensing: The Image Chain Approach ], Oxford University Press.[2] Hughes, G. F., “On the mean accuracy of statistical pattern recognizers,” IEEE Transactions on Information

Theory 14, 55–63 (1968).[3] Shaw, G. and Manolakis, D., “Signal processing for hyperspectral image exploitation,” IEEE Signal Pro-

cessing Magazine 19, 12–16 (2002).[4] Keshava, N., “Distance metrics & band selection in hyperspectral processing with applications to material

identification and spectral libraries,” IEEE Transactions on Geoscience and Remote Sensing 42, No. 7,1552–1565 (July 2004).

[5] McLachlan, G. and Peel, D., [Finite Mixture Models ], Wiley Series in Probability and Statistics, New York,NY, second ed. (2000).

[6] Figueiredo, M. A. T. and Jain, A. K., “Unsupervised learning of finite mixture models,” IEEE Transactionson Pattern Analysis and Machine Intelligence 24, 381–396 (2002).

[7] Jayaram, V. and Usevitch, B., “Dynamic mixing kernels in gaussian mixture classifier for hyperspectralclassification,” in [Mathematics of Data/Image Pattern Recognition, Compression, and Encryption withApplications XI, Proceedings of the SPIE ], 70750L–70750L–8 (2008).

[8] Rabiner, L. R., “A tutorial on hidden Markov models and selected applications in speech recognition,” in[Proceedings of the IEEE ], 257–286 (1989).

[9] Du, Q. and Chang, C.-I., “A hidden markov model approach to spectral analysis for hyperspectral imagery,”Optical Engineering 40, No. 10, 2277–2284 (2001).

[10] Bali, N. and Mohammad-Djafari, A., “Bayesian approach with hidden Markov modeling and Mean FieldApproximation for Hyperspectral data analysis,” IEEE Transactions on Image Processing 17, No. 2, 217–225 (2008).

[11] J. Li, A. N. and Gray, R. M., “Image classification by a two-dimensional hidden Markov model,” IEEETrans. Signal Processing 48, 517–533 (2000).

[12] Marden, D. B. and Manolakis, D. G., “Modeling hyperspectral imaging data,” in [Algorithms and Technolo-gies for Multispectral, Hyperspectral, and Ultraspectral Imagery IX. Edited by Shen, Sylvia S.; Lewis, PaulE. Proceedings of the SPIE ], 253–262 (2003).

[13] Huang, B. H., “Maximum likelihood estimation for mixture multivariate stochastic observations of markovchains,” in [AT&T Technical Journal ], 1235–1249 (1985).

[14] A. A. Green, M. Berman, P. S. and Craig, M. D., “A transformation for ordering multispectral data interms of image quality with implications for noise removal,” IEEE Transcations on Geoscience and RemoteSensing 26, 65–74 (1988).

[15] Landgrebe, D., “AVIRIS derived Northwest Indianas Indian Pines 1992 Hyperspectral dataset,”http://dynamo.ecn.purdue.edu/ biehl/MultiSpec/documentation.html. .



2009 spie hmm

Technology

Transcript of 2009 spie hmm