Probabilistic Latent Factor Induction and Statistical Factor Analysis

Probabilistic Latent Factor Induction andStatistical Factor Analysis

A Comparison of Methods

Stefan Conrady, [email protected]

Dr. Lionel Jouffe, [email protected]

April 7, 2011

Conrady Applied Science, LLC - Bayesia’s North American Partner for Sales and Consulting

mailto:[email protected]




Table of Contents

IntroductionAbout the Authors 4

Stefan Conrady 4

Lionel Jouffe 4

Key Concepts from Information Theory 1

Entropy 1

Chain Rule Theorem 2

Conditional Entropy 2

Mutual Information 3

Relative Entropy (Kullback-Leibler Divergence) 3

Example 1 3

Example 2 4

Comparison of MethodsApproach 5

Notation 5

Key Terminology 5

Data Set 6

Probabilistic Latent Factor Induction with BayesiaLab 7

Data Import 7

Variable Clustering 16

Latent Factor Induction 21

Statistical Factor Analysis 30

Factor Analysis with STATISTICA 32

Conclusion 39

References 40

Contact Information 41

Conrady Applied Science, LLC 41

Bayesia SAS 41

Copyright 41

Probabilistic Factor Induction and Statistical Factor Analysis

www.conradyscience.com | www.bayesia.com ii

http://www.conradyscience.com


http://www.bayesia.com


Introduction

Bayesian networks have been gaining prominence among scientists over the recent decade and the new insights gener-

ated by this powerful research approach can now be found in studies that circulate well beyond the academic communi-ties. As a result, many practitioners and managerial decision-makers see more and more references to Bayesian networks

in all kinds of scienti!c and business research, ranging from biostatistics to marketing analytics.

It is not surprising that the new Bayesian network paradigm prompts comparisons to more conventional methods. In the !eld of market research, for instance, long-established methods, such as factor analysis remain in daily use today.

Given that there exists a direct counterpart to factor analysis in the Bayesian network framework, we want to highlight

similarities as well as fundamental differences. The objective of this paper is to present both methods side-by-side and thus help researchers to correctly compare and interpret the respective results. More speci!cally, we want to establish

the semantic equivalents between the traditional statistical factor analysis approach and BayesiaLab’s method based on

Bayesian networks, which we refer to as Probabilistic Latent Factor Induction.

Factor Analysis is a statistical method used to describe variability among observed variables in terms of a potentially lower number of unobserved variables called factors. It is possible, for example, that variations in three or four ob-

served variables mainly re"ect the variations in a single unobserved variable, or in a reduced number of unobserved

variables. The observed variables can be seen as manifestations of abstract underlying (and unobserved) dimensions or (latent) factors.

Factor analysis originated in psychometrics, and is used in behavioral sciences, social sciences, marketing, product man-

agement, operations research, and other applied sciences that deal with a large number of variables in their data.

Probabilistic Latent Factor Induction is a work"ow within the BayesiaLab software package, which has the same objec-

tive as the traditional factor analysis, i.e. variable reduction, but works entirely with the framework of Bayesian net-

works and is based on principles derived from information theory.

It is important to point out that this comparison is not meant to favor one approach over the other (and to declare a winner and loser), although it is clearly in the authors’ interest to promote Bayesian networks in general and BayesiaLab

in particular. Rather, this paper should serve as reference for research practitioners and those who use research results

in their decision-making processes, so they can correctly interpret insights generated with either approach.


www.conradyscience.com | www.bayesia.com iii





About the Authors

Stefan Conrady

Stefan Conrady is the cofounder and managing partner of Conrady Applied Science, LLC, a privately held consulting

!rm specializing in knowledge discovery and probabilistic reasoning with Bayesian networks. In 2010, Conrady Applied Science was appointed the authorized sales and consulting partner of Bayesia SAS for North America.

Stefan Conrady studied Electrical Engineering and has extensive management experience in the !elds of product plan-

ning, marketing and analytics, working at Daimler and BMW Group in Europe, North America and Asia. Prior to es-tablishing his own !rm, he was heading the Analytics & Forecasting group at Nissan North America.

Lionel Jouffe

Dr. Lionel Jouffe is cofounder and CEO of France-based Bayesia SAS. Lionel Jouffe holds a Ph.D. in Computer Science and has been working in the !eld of Arti!cial Intelligence since the early 1990s. He and his team have been developing

BayesiaLab since 1999 and it has emerged as the leading software package for knowledge discovery, data mining and

knowledge modeling using Bayesian networks. BayesiaLab enjoys broad acceptance in academic communities as well as

in business and industry. The relevance of Bayesian networks, especially in the context of consumer research, is high-lighted by Bayesia’s strategic partnership with Procter & Gamble, who has deployed BayesiaLab globally since 2007.


www.conradyscience.com | www.bayesia.com iv





Key Concepts from Information TheoryBefore we proceed to the direct comparison of methods, it is important to establish several key concepts relating to the knowledge representation in Bayesian networks.

Entropy The concept of entropy provides the underpinning for all structural learning and analysis algorithms in BayesiaLab.

Entropy measures the uncertainty inherent in the distribution of a random variable.

The entropy H(X) of a random variable X is de!ned as:

H (X) = − p(x)log2 p(x)x∈X∑ ,

where x stands for the states, which variable X can take. Note that the log is to the base of 2 and the value of entropy is expressed in bits (0/1).

An example can perhaps illustrate this: If variable X represents the outcome of a coin toss, X can have one of two

states, Heads and Tails, i.e. the set of potential outcomes is X={Heads, Tails}. Given the coin toss is fair, the probability of Head and Tails will be 0.5, i.e. p(Heads)=0.5 and p(Tails)=0.5.

We can now compute the entropy H(Xfair), based on these values:

H (X fair ) = − p(Heads)log2 p(Heads) − p(Tails)log2 p(Tails)= −0.5 log2 0.5 − 0.5 log2 0.5 = 0.5 + 0.5 = 1 bit

This means our uncertainty prior to a fair coin toss is equivalent to an entropy value of 1 bit, which is the maximum

entropy due to the uniform distribution of the variable with two states.

If we had a biased coin instead with p(Heads)=0.7 and p(Tails)=0.3, it is intuitive to think that the uncertainty would be lower as one state of the coin toss will be more probable and, indeed, computing the entropy H(Xbiased) yields a lower

value.

H (Xbiased ) = −0.7 log2 0.7 − 0.3log2 0.3 = 0.881

To complete this idea, we can also plot H(X) as a function of the bias, p(Heads)=1-p(Tails), with p(Heads)∈{0,..,1}, i.e.

ranging from impossible, p(Heads)=0, to certain, p(Heads)=1.

Information Theory Background

www.conradyscience.com | www.bayesia.com 1





0.2 0.4 0.6 0.8 1.0p�Heads�

0.2

0.4

0.6

0.8

1.0H�X�

Clearly, anything other than a perfectly fair coin reduces the entropy and thus our uncertainty regarding the outcome of the coin toss.

Chain Rule TheoremThe chain rule for joint entropy states that the total uncertainty about the value of X and Y is equal to the uncertainty

about X plus the (average) uncertainty about Y once you know X.

H (X,Y ) = H (X) + H (Y∣X)

The proof of this theorem follows:

H (X,Y ) = − p(x, y)log2 p(x, y)x∈X∑

y∈Y∑

= − p(x, y)log2 p(y∣x)p(x)x∈X∑

y∈Y∑

= − p(x, y)log2 p(y∣x)x∈X∑

y∈Y∑ − p(x, y)log2 p(x)

x∈X∑

y∈Y∑

= − p(x, y)log2 p(y∣x)x∈X∑

y∈Y∑ − p(x)log2 p(x)

x∈X∑

= H (Y∣X) + H (X)

Conditional EntropyPerhaps the single most important concept for computations in BayesiaLab is conditional entropy. Conditional entropy

refers to the entropy of a random variable when we have information on another variable.

The conditional entropy H(Y|X), is de!ned as







H (Y∣X) = p(x)H (Y∣ X = x)x∈X∑

= − p(x) p(y∣x)log2 p(y∣x)y∈Y∑

x∈X∑

= − p(x, y)log2 p(y∣x)y∈Y∑

x∈X∑

The conditional entropy of Y conditional on X refers to the expected entropy of Y conditional on the value of X.

Mutual InformationThe mutual information I(X,Y) measures how much (on average) the observation of random variable Y tells us about

the uncertainty of X, i.e. by how much the entropy of X is reduced if we we have information on Y.

I(X,Y ) = H (X) − H (X∣Y ) = H (Y ) − H (Y∣X)

Note that the mutual information is a symmetric metric, which re"ects the uncertainty reduction of X by knowing Y as

well as of Y by knowing X.

Relative Entropy (Kullback-Leibler Divergence)A closely related concept is the relative entropy, also referred to as the Kullback-Leibler Divergence (DKL) or sometimes

cross entropy. The Kullback-Leibler Divergence is a measure of the difference between two probability distributions p and q.

For probability distributions p and q of a discrete random variable X, their K–L divergence is de!ned to be

DKL = p(X) || q(X)( ) = p(x)log2p(x)q(x)x∈X

∑

In words, it is the expected value of the logarithmic difference between the joint probability distributions p(X) and q(X). In contrast to the mutual information, the relative entropy is non-symmetric.

Example 1

We once again use tossing coins as an example. By default, we would expect that any given coin is fair and assume a

model q(Heads)=q(Tails)=0.5. As it turns out, in repeated coin tosses, we observe that a probability of p(Heads)=0.75 and of p(Tails)=0.25. We can now use the Kullback-Leibler Divergence to establish the “distance” or “distortion” be-

tween the originally assumed distribution q(x) and the observed distribution of p(x).

DKL = p(X) || q(X)( ) = p(x)log2p(x)q(x)x∈X

∑

= p(Heads)log2p(Heads)q(Heads)

+ p(Tails)log2p(Tails)q(Tails)

= 0.75 log20.750.5

+ 0.25 log20.250.5

= 0.188722 bits







Example 2

For another illustration we use an example from the !eld of meteorology. More speci!cally, we look at the rainfall in two cities in state of Victoria, Australia. We used daily rainfall data measured at Geelong Airport and at Melbourne

Tullamarine Airport, which are approximately 80 kilometers apart, over the entire calendar year of 2010. Given the

proximity of those locations, one would generally expect similar weather. Perhaps the Geelong weather isn’t reported in the Melbourne newspapers and so a traveler wants to use the Melbourne weather as a proxy. However, the actual

weather station observations tell us that there is rain in Melbourne on 40.3% of the days, whereas Geelong sees rainfall

on 47.4% of the days in the year.

We can now compute the Kullback-Leibler Divergence for these two distributions, and pGeelong(x) stands for Geelong

and pMelbourne(x) for the Melbourne rain probability distributions.

DKL = pGeelong (X) || pMelbourne(X)( ) = pGeelong (x)log2

pGeelong (x)pMelbourne(x)x∈X

∑

= pGeelong (x = No Rain)log2

pGeloong (x = No Rain)pMelbourne(x = No Rain)

+ pGeelong (x = Rain)log2

pGeloong (x = Rain)pMelbourne(x = Rain)

= 0.526 log20.5260.597

+ 0.474 log20.4740.403

= 0.0148958 bits

DKL = pMelbourne(X) || pGeelong (X)( ) = pMelbourne(x)log2pMelbourne(x)pGeelong (x)x∈X

∑

= pMelbourne(x = Rain)log2pMelbourne(x = Rain)pGeelong (x = Rain)

+ pMelbourne(x = No Rain)log2pMelbourne(x = No Rain)pGeelong (x = No Rain)

= 0.403log20.4030.474

+ 0.597 log20.5970.526

= 0.0147077 bits

BayesiaLab’s primary metric, the Arc Force, is directly proportional to the relative entropy and describes the strength of

the directional link between two variables. More speci!cally, it describes the difference between the joint probability distributions with and without the particular arc.







Comparison of Methods

ApproachWe believe that we can best facilitate a comparison of the statistical factor analysis and latent factor induction by work-

ing through an example. We draw upon the familiar dataset from the previously presented case study from the perfume

industry, hereafter referred to as the “Perfume Study.”1

We begin our tutorial with the Data Import process for BayesiaLab, although it is not meant to be at the core of the

comparison. It is important though to spell out the data pre-processing steps in BayesiaLab, as they highlight some of

the fundamental differences between probabilistic and statistical approaches.

Once the data preparation is complete, we !rst present the probabilistic latent factor induction work"ow with

BayesiaLab and then provide an example of a statistical factor analysis. For the statistical factor analysis, we will use

STATISTICA 10 as the software platform, although most steps are fairly generic and could be reproduced with a num-ber of other statistical software packages as well.

NotationTo clearly distinguish between natural language, software-speci!c functions and study-speci!c variable names, the fol-

lowing notation is used:

• BayesiaLab-speci!c functions, keywords, commands, etc., are capitalized and shown in bold type.

• Names of attributes, variable, node and factors are italicized.

• At appropriate points in the text, grey boxes highlight parallels between the two presented methods:

Key Terminology• “Observed” and “manifest” are used interchangeably and describe those random variables, which have been meas-

ured by the researcher. Each variable measure

• The terms “latent” or “unobserved” are used interchangeably in the context of hidden concepts or factors, which

cannot be measured, but can potentially be extracted or induced. In our context, the term factor stands exclusively for

latent variables. Consequently, the terms “factor”, “factor variable”, “latent variable” and “unobserved variable” are equivalent.

Probabilistic Latent Factor Induction ↔ Statistical Factor Analysis

Probabilistic Latent Factor Induction vs. Statistical Factor Analysis


1 Conrady and Jouffe (2010)





Data SetThe Perfume Study is based on a monadic consumer survey about a range of fragrances, which was conducted in

France. In this example we use survey responses from 1,321 women, who have evaluated a total of 11 fragrances on a

wide range of attributes:

• 27 ratings on fragrance-related attributes, such as, “sweet”, “!owery”, “feminine”, etc., measured on a 1-to-10 scale.

• 12 ratings on projected imagery related to someone, who would be wearing the respective fragrance, e.g. “is sexy”,

“is modern”, measured on a 1-to-10 scale.

• 1 variable for Intensity, a measure re"ecting the level of intensity, measured on a 1-to-5 scale.

• 1 variable for Purchase Intent, measured on a 1-to-6 scale.

• 1 nominal variable, Product, for product identi!cation purposes.

Probabilistic Latent Factor Induction vs. Statistical Factor Analysis






Probabilistic Latent Factor Induction with BayesiaLab

Data ImportTo start the process with BayesiaLab, we !rst import the data set, which is formatted as a CSV !le.2 With Data>Open Data Source>Text File, we start the Data Import wizard, which immediately provides a preview of the data !le.

The table displayed in the Data Import wizard shows the individual variables as columns and the survey responses as rows. There are a number of options available, e.g. for sampling. However, this is not necessary in our example given

the relatively small size of the database.

Clicking the Next button, prompts a data type analysis, which provides BayesiaLab’s best guess regarding the data type

of each variable.

Furthermore, the Information box provides a brief summary regarding the number of records, the number of missing

values, !ltered states, etc.3



2 CSV stands for “comma-separated values”, a common format for text-based data !les.

3 There are no missing values in our database and !ltered states are not applicable in this survey.





For this example, we will need to override the default data type for the Product variable, as each value is a nominal

product identi!er rather than a numerical scale value. We can change the data type by highlighting the Product variable

and clicking the Discrete check box, which changes the color of the Product column to red.

We will also de!ne Purchase Intent and Intensity as discrete variables, as the default number of states of these variables is already adequate for our purposes.4

The next screen provides options as to how to treat any missing values. In our case, there are no missing values so the

corresponding panel is grayed-out.

Clicking the small upside-down triangle next to the variable names brings up a window with key statistics of the

selected variable, in this case Fresh.



4 The desired number of variable states is largely a function of the analyst’s judgment.





The next step is the Discretization and Aggregation dialogue, which allows the analyst to determine the type of

discretization that must be performed on all continuous variables.5 For this survey, and given the number of

observations, it is appropriate to reduce the number of states from the original 10 states (1 through 10) to smaller number. One could, for instance, bin the 1-10 rating into low, mid and high, or apply any other arbitrary method

deemed appropriate by the analyst.

The screenshot shows the dialogue for the Manual selection of discretization steps, which permits to select binning

thresholds by point-and-click.



5 BayesiaLab requires discrete distributions for all variables.





For this particular example, we select Equal Distances with 5 intervals for all continuous variables. This was the

analyst’s choice in order to be consistent with prior research.

Clicking Select All Continuous followed by Finish completes the import process and the 49 variables (columns) from our database are now shown as blue nodes in the Graph Panel, which is the main window for network editing. By

default, all variables are represented as nodes. This initial view represents a fully unconnected Bayesian network.

Note

For choosing discretization algorithms beyond this example, the following rule of thumb may be helpful:

• For supervised learning, choose Decision Tree.

• For unsupervised learning, choose, in the order of priority, K-Means, Equal Distances or Equal Frequencies.







In the above graph, two variables play a fundamentally different role. The values of Product represent categories and

Purchase Intent is the overall target variable, i.e. the dependent variable of the Perfume Study. Thus both will be ex-cluded from the factor generation process.

While correlation and covariance the central measures for statistical factor analysis, learning Bayesian networks with

BayesiaLab (and thus probabilistic factor induction) is based on measures from information theory, such as the

Kullback-Leibler Divergence, which was introduced in the !rst chapter.

The Kullback-Leibler Divergence can be obtained after learning an initial Bayesian network with one of BayesiaLab’s

unsupervised learning algorithms. “Unsupervised” implies that the learning algorithm searches for an overall representa-

tion of the joint distribution of the underlying data rather than the characterization of an individual target variable.

In our example, we use BayesiaLab’s EQ algorithm to obtain a Bayesian network.







As this view of the network is not easily readable, BayesiaLab has numerous built-in layout algorithms, of which the

Force Directed Layout is perhaps the most commonly used. It can be invoked by View>Automatic Layout>Force Directed Layout or alternatively through the keyboard shortcut “p”.

The resulting network will look similar to the following screenshot.







Completed Bayesian Network upon EQ Learning

With the network established, we can now further examine the probabilistic relationships between the nodes, which are represented as arcs.6 By selecting, Analysis>Graphic>Arc Force, we can show the probabilistic strength of the arcs,

which is visualized by the thickness of the arcs.



6 “Arcs” are directed links or edges between nodes, which appear as arrows in the graph.





Network with Arc Force

The numeric values of the Arc Force can be shown by selecting View>Display Arc Comments. In the network shown

below, the Arc Force values are presented in yellow boxes attached to each arc.







Network with Arc Force

Arc Force ↔ CovarianceIn BayesiaLab, Arc Force, a probabilistic measure based on the Kullback-Leibler Divergence, is the central measure for latent factor induction. In statistical factor analysis, covariance, correlation and, in particular, the covariance

matrix play the equivalent role.







Variable ClusteringWith Arc Force established as a the key measure across the entire network, BayesiaLab can determine clusters of vari-

ables, which are “close” in a probabilistic sense. This can be initiated from the menu via Analysis>Graphic>Variable Clustering.

The clustering algorithm is iterative and starts with those two variables, whose connecting arc has the strongest Arc Force. The following sequence of screenshots illustrates this algorithm conceptually in “slow motion,” as the analyst

would not see these individual steps in the actual work"ow.

As a starting point, every manifest variable is treated as a distinct cluster and so we have 47 clusters. Using the

Kullback-Leibler Divergence as a measure, the “closest” variables are then merged into one concept. As a result, we !rst

obtain 46 clusters, then 45, etc., as shown in the array of dendrograms below. BayesiaLab proposes to conclude this algorithm upon !nding 15 clusters. However, the analyst has the ability to override this automatic selection. As the

choice of clusters appears to be generally compatible with our interpretation of the variable names, we accept this rec-

ommendation.







Sequence of Dendrograms

47 46 45 44 16 15...

Because of the importance of this process, we will also show it from another angle, i.e. by looking at sequential views of

the graph.







Step 0 - 47 Clusters

Step 1 - 46 Clusters: Pleasure merged with Corresponds

The strongest Arc Force exists between Pleasure and Corresponds and BayesiaLab will form an interim concept from them. The next-highest Arc Force then determines whether another variable is merged with the !rst concept or whether

a new concept is created. In our case, Radiant and In Love are combined as a new concept.







Step 2 - 45 Clusters: Radiant merged with In Love

In the third step, we see Sensual and Romantic merged into a new latent concept, and so on.

Step 3 - 44 Clusters: Sensual merged with Romantic

Upon completion of this process, BayesiaLab forms variable/node clusters from these common concepts and color-codes

them accordingly.







Network with Color-Coded Variable Clusters

By clicking the Validate Clustering button , we can now formally !xate the new latent factor variables. The new latent factors are shown in the following table with their associated observed variables. By default, they are given

the name “Factor” plus a numeric suf!x







Latent Factor InductionUpon de!nition of the new latent factor variables, we now want to make them

available for modeling purposes. Although these latent factors exist as new concepts

and are conceptually linked to the manifest variables, the factors do not yet have any values or states.

This will now happen in the Multiple Clustering process, which creates discrete

states for each latent factor variable by performing data clustering over the linked manifest variables.

More speci!cally, the states of each latent factor will be created in such a way that

they best summarize the joint probability distribution de!ned by the manifest vari-

ables. Factor 0 and its linked manifest variables are shown below.

Subnetwork for Factor 0







The following Monitors display the marginal probability distributions of the variables associated with Factor 1, plus,

highlighted in red, Factor 1 itself and its states are shown. We can see that 5 states were created for Factor 1, labelled C1 through C5, and they each have an expected value, which is shown in parentheses. For instance, state C2 has an

expected value of 9.21. That means, given that C2 is observed, the mean value of the manifest variables, weighted by

their relation with C2, is equal to 9.21. In other words, C2 corresponds to high ratings with regard to those 5 dimen-sions.

By selecting speci!c states of Factor 0 in the Monitor Panel, we can see the conditional distributions of the manifest variables. The states C2 and C3 are displayed for reference below. They can be easily interpreted by looking at the asso-

ciated values, e.g. state C2 appears to re"ect high ratings of the manifest variables, whereas state C3 captures very low

ratings.

A more general analysis of the relationships between manifest variables and latent factors can be obtained through

Analysis>Reports>Relationship Analysis:

This chart summarizes the values of key clustering measures, such as the Kullback-Leibler Divergence, for every mani-

fest variable associated with Factor 0. For reference only, it also includes Pearson’s Correlation Coef!cient R.







It is also possible to visualize the mean values of the manifest variables (x-axis) along with the Mutual Information (y-

axis, left panel) and the Standardized Total Effect (y-axis, right panel) for the latent factor variable.

Although we have now de!ned new factor variables, we have not yet seen the original matrix survey responses in terms

of the new factor variables. For instance, every respondent record has a value for Active, Ful"lled, Trust, etc., as these variables were observed and recorded in the survey, but how do we !nd the values (or states) of the new latent factors

for each respondent record?

Actually, at the conclusion of the Multiple Clustering process, BayesiaLab has introduced the new factors into the origi-

nal network. By using BayesiaLab’s imputation process, which is based on maximum likelihood, they were added as new nodes to the graph and also saved as new columns (or !elds) to the database,

Relationship Analysis ↔ Factor LoadingsThis summary of clustering measures in the Relationship Analysis allows an interpretation, which is very similar to what is provided with factor loadings.







Latent Factors Introduced into Network

We can easily verify that each new factor has a value for each respondent record. We start Inference>Interactive Infer-ence, which allows to scroll through the survey records and view the values of any variable, including the values of the new latent factors.

Factor Induction ↔ Saving Factor ScoresIntroducing the new latent factors into the network is equivalent to adding the factor scores to the original observa-tion matrix.







For instance, survey record #0 is expressed as state C4 in terms of Factor 0. The states of the manifest variables are

shown for reference.

Record #8, for example is assigned to state C3:

Now we have the entire set of respondent records re-expressed in terms of 15 latent factors, which allows us to use

them for all kinds of modeling purposes.







Given the importance of latent factors for interpretation, we will assign descriptive labels to each of them. BayesiaLab

can visually aid in this process by showing the latent factors and their relationships to the original manifest variables. This means, we will simply learn a new network, which includes both factor variables and manifest variables.







Network including Latent Factors and Manifest Variables

The emerging network structure clearly lends itself to de!ning descriptive labels, which are applied to the factors in the following graph.7



7 See Conrady and Jouffe (2010) for a more detailed explanation of the interpretation process.





Network including Latent Factors and Manifest Variables plus Factor Labels

It is important to reiterate that the latent factors generated here are not orthogonal, which means that probabilistic rela-

tionships exist between the factors. For illustration purposes, we can highlight the latent factors and exclude the mani-

fest variables from being displayed. In addition, the following graph also displays the Arc Force between each latent

factor providing further con!rmation that the latent factors are not independent.







Network with Latent Factors and Arc Forces







Statistical Factor AnalysisPerhaps the most common approach for extracting factors from a set of observed variables is Principal Components Analysis (PCA) and it is frequently considered a synonym for factor analysis.8 For our purpose, we look at PCA as a

prototypical tool for factor extraction, which lends itself to be compared to the latent factor induction with BayesiaLab

presented earlier.

Principal Component Analysis (PCA) is a mathematical procedure that uses an orthogonal transformation to convert a set of observations, represented by matrix X, of possibly correlated variables into a set of values of uncorrelated vari-

ables called principal components, to be represented by a new matrix Y. The goal of this transformation is to minimize

redundancy (measured by covariance) and to maximize the signal (measured by variance).

This transformation is de!ned in such a way that the !rst principal component has the highest possible variance, i.e.

accounting for as much of the variability in the data as possible. In turn, each succeeding component has the next-

highest variance while being orthogonal to (uncorrelated with) the preceding components.

Conceptual Illustration of Principal Component Vectors

More formally, PCA creates a re-expression of the original data set on the basis of a new set of orthonormal vectors,

replacing the original set of “naive” basis vectors, which resulted from the choice of measurements.9

In matrix notation, this can be expressed as follows:

PX = Y

Statistical Factor Analysis


8 There are differences between PCA and the more general concept of factor analysis, but explaining those goes beyond

the scope of this paper.

9 Any observed variable automatically establishes a basis vector. Measuring 47 variables would thus result in a 47-

dimensional coordinate system.

http://en.wikipedia.org/wiki/Orthogonal_transformation

http://en.wikipedia.org/wiki/Orthogonal_transformation





with X being the matrix of original observations and P being a yet-to-be-determined orthonormal matrix that trans-

forms X into Y. Interpreting this geometrically, P is a rotation and stretch to generate Y. The rows of P, {p1,…,pm}, are the new set of basis vectors for expressing the columns of X. Writing out the explicit dot products may better illustrate

this.

PX =p1pm

⎛

⎝

⎜⎜⎜

⎞

⎠

⎟⎟⎟x1 xn( )

Y =p1 ⋅x1 … p1 ⋅xn

pm ⋅x1 pm ⋅xn

⎛

⎝

⎜⎜⎜

⎞

⎠

⎟⎟⎟

This provides us with the general framework, but we have yet to determine what matrix P should be.

This is the point where we need to introduce the concept of the covariance matrix (Cx). It is de!ned as

CX =1

n −1XXT

• CX is a square and symmetric m × m matrix.

• The elements on the diagonal of CX represent the variance of the observed variables.

• The off-diagonal elements of CX represent the covariance between observed variables.

As a result CX captures the correlations between all possible pairs of observed variables.

This obviously relates to our objective of minimizing redundancy (measured by covariance) and maximizing the signal

(measured by variance) of the target matrix Y. The optimum achievement of these goals would imply a diagonal covari-

ance matrix of Y, i.e. with all off-diagonal elements being zero, and our objective thus translates into stipulating that CY must be diagonal. Fortunately, linear algebra provides several tools for diagonalizing a matrix.

More formally, the objective becomes !nding some orthonormal matrix P where Y=PX such that CY is diagonalized.

The rows of P are then the principal components.

Without providing further detail, the solution is:

• The principal components of X are the eigenvectors of XXT or the rows of P.

• The ith diagonal value of CY is the variance of X along pi.







Factor Analysis with STATISTICAUpon loading the survey data into STATISTICA, the respondent records will be presented as a data table, with the vari-able names shown as column headers and case numbers shown as row headers.10 This represents our observation matrix

X.

Observation Matrix X

As a starting point of the PCA process, we can display CX, the covariance matrix of X:



10 We will skip a detailed description of the data import steps, as they are fairly generic and we assume that readers

would use a wide array of statistical programs.





Covariance Matrix

As expected, there is a high amount of covariance, i.e. redundancy, between many of the observed variables. To get a

better sense of the magnitude of these pairwise relationships, it helps to display the correlation matrix for reference:

Arc Force ↔ CovarianceIn BayesiaLab, Arc Force, a probabilistic measure based on the Kullback-Leibler Divergence, is the central measure for latent factor induction. In statistical factor analysis, covariance, correlation and, in particular, the covariance

matrix play the equivalent role.







Correlation Matrix

STATISTICA, like many other statistical software packages, has built-in routines, which can perform the computation of the matrix P of principal components automatically. There are several methods available for solving the PCA, includ-

ing the approach using the eigenvectors of the covariance matrix, which was shown earlier.

Regardless of the computational method used, the solution of the PCA provides as many eigenvalues as there are ob-

served variables. The sum of all eigenvalues equals the number of observed variables, in our case 47. This allows to de-termine the share of variance attributable to each factor. For instance, the !rst factor has an eigenvalue of 29.6, which

means that it accounts for 29.6/47=62.98% of the variance. Proceeding down the list, the eigenvalues decline in value and

correspondingly their contribution to the total variance.







List of Eigenvalues

Now that we have a measure of how much variance each successive factor extracts, we can return to the question of

how many factors to retain, as the overall objective of this exercise is variable reduction. The precise number of factors

to be retained is ultimately an arbitrary decision of the analyst, but factors with eigenvalues greater than 1 are typically considered candidates. A scree plot11 is typically used to illustrate the eigenvalues of the extracted factors. Sometimes

this provides a visual indication of a natural cutoff point between higher and lower eigenvalues. Here such a distinction

cannot be made easily, so we defer to the rule-of-thumb and retain eigenvalues greater than 1.



11 The name “scree plot” is a metaphorical expression, as “scree” is the term for the accumulation of broken rock at the

base of mountain cliffs. In the scree plot we want to distinguish the substantial eigenvalues from the “rubble” at the bottom.





Scree Plot

In the next step we turn to the interpretation of the extracted factors. The table below shows the factor loadings, which are the correlations of each observed variable with the extracted factors.

Factor Loadings







Given the high eigenvalue of factor 1, it is not surprising that many variables are highly correlated with it. In our par-

ticular case, however, this correlation is mostly negative, which may be counterintuitive for interpretation purposes.

It is common practice to rotate factors in order to aid in the interpretation process. Intuitively speaking, the rotation in

typically chosen in such a way that the principal factor, i.e. factor 1, aligns with what is commonly understood as the

“positive x-axis.”

Such a factor rotation, for which several methods exist, was also performed with STATISTICA and the results appear in

the table below. In addition, factor loadings higher than 0.7 are highlighted.

Loadings on Rotated Factors

The analyst can now use these factor loadings to assign meaningful names to each factor. Some are quite obvious in

their characterization, such as factor 3, which could be called “pleasant” or factor 4, which is quite obviously “classi-cal.” It is also interesting to see that only one variable, i.e. Intensity, has a high loading on factor 2. This implies that

Relationship Analysis ↔ Factor LoadingsThe summary of clustering measures in BayesiaLab’s Relationship Analysis allows an interpretation, which is very simi-lar to what is provided with factor loadings.







perhaps Intensity is a standalone concept, which has little redundancy. On the other extreme, many variables have high

loadings on factor 1, which makes identifying a distinct concept more elusive.

Without completing this interpretation process, we turn to the “reduction” part by introducing the extracted factors as

variables into the original data set, i.e. replacing 47 variables with 6 variables. This is often referred to as “saving factor

scores,” with the factor scores being the values related to the original observations in this new coordinate system created by the extracted factors. Our observations now have new coordinates in a 6-dimensional coordinate system rather than

in one with 47 dimensions.

Factor Scores

We now have the ability to create a wide range of models, for instance, modeling Purchase Intent as a function of the 6

new factors. This will undoubtedly be easier to interpret than a model, which includes all of the 47 original observed variables.

Latent Factor Induction ↔ Saving Factor ScoresIntroducing the latent factors into the network is equivalent to adding the factor scores to the original observation matrix.







ConclusionAlthough fundamentally different in their framework, statistical factor analysis and probabilistic latent factor induction have many parallels, which lend themselves to direct comparative interpretation. Given these parallels, analysts familiar

with either domain should !nd it easy to translate their research work"ow from one framework into the other. Equally,

end users of research results, who may be less familiar with the underlying computations, should be in a position to

interpret the !ndings from both methods in a very similar manner.







References

Conrady, Stefan, and Lionel Jouffe. “Driver Analysis & Product Optimization, A Case Study from the Perfume Indus-try”, December 1, 2010. http://www.conradyscience.com/index.php/driver-analysis.

Cover, T. M, and J. A Thomas. “Entropy, relative entropy and mutual information.” Elements of Information Theory (1991): 12–49.

Kachigan, Sam Kash. Multivariate Statistical Analysis: A Conceptual Introduction. 2nd ed. Radius Press, 1991.

MacKay, David J. C. Information Theory, Inference and Learning Algorithms. 1st ed. Cambridge University Press, 2003.

Shlens, J. “A tutorial on principal component analysis.” Systems Neurobiology Laboratory, University of California at San Diego (2005).

StatSoft, Inc. “Electronic Statistics Textbook.” Electronic Statistics Textbook, 2011. http://www.statsoft.com/textbook/.



http://www.conradyscience.com/index.php/driver-analysis

http://www.conradyscience.com/index.php/driver-analysis

http://www.statsoft.com/textbook/

http://www.statsoft.com/textbook/





Contact Information

Conrady Applied Science, LLC312 Hamlet’s End Way

Franklin, TN 37067

USA

+1 888-386-8383 [email protected]

www.conradyscience.com

Bayesia SAS6, rue Léonard de Vinci

BP 119

53001 Laval CedexFrance

+33(0)2 43 49 75 69

[email protected]

www.bayesia.com

Copyright© 2011 Conrady Applied Science, LLC and Bayesia SAS. All rights reserved.

Any redistribution or reproduction of part or all of the contents in any form is prohibited other than the following:

• You may print or download this document for your personal and noncommercial use only.

• You may copy the content to individual third parties for their personal use, but only if you acknowledge Conrady

Applied Science, LLC and Bayesia SAS as the source of the material.

• You may not, except with our express written permission, distribute or commercially exploit the content. Nor may you transmit it or store it in any other website or other form of electronic retrieval system.















Probabilistic Latent Factor Induction and Statistical Factor Analysis

Documents

Transcript of Probabilistic Latent Factor Induction and Statistical Factor Analysis