-Long version (Supplementary Material)- Latent Confusion...

12
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 108 109 -Long version (Supplementary Material)- Latent Confusion Analysis by Normalized Gamma Construction Abstract We developed a flexible framework for model- ing the annotation and judgment processes of hu- mans, which we called “normalized gamma con- struction of a confusion matrix.” This frame- work enabled us to model three properties: (1) the abilities of humans, (2) a confusion matrix with labeling, and (3) the difficulty with which items are correctly annotated. We also provided the concept of “latent confusion analysis (LCA),” whose main purpose was to analyze the prin- cipal confusions behind human annotations and judgments. It is assumed in LCA that confusion matrices are shared between persons, which we called “latent confusions”, in tribute to the “la- tent topics” of topic modeling. We aim at sum- marizing the workers’ confusion matrices with the small number of latent principal confusion matrices because many personal confusion ma- trices is difficult to analyze. We used LCA to analyze latent confusions regarding the effects of radioactivity on fish and shellfish following the Fukushima Daiichi nuclear disaster in 2011. An important theme in collective intelligence is modeling the annotation and judgment processes of humans. We fo- cus on modeling a confusion matrix with labeling. Extract- ing a confusion matrix is useful for not just obtaining better (closer to the ground truth) aggregation of labels but also obtaining diagnostic information on human annotation and judgments. Dawid and Skene (1979) proposed a probabilistic genera- tive model for subjective labeling. Their model can esti- mate individual confusion matrices even when the true la- bel is not available. Each worker in this model has a con- fusion matrix in which if an item (e.g., an image) has true label u, worker j can assign another label l with probabil- ity π (j) u,l . Smyth et al. (1994) applied the Dawid and Skene (DS) model to the image labeling problem. Snow et al. Preliminary work. Under review by the International Conference on Machine Learning (ICML). Do not distribute. (2008) applied the DS model to the analysis of opinions in natural language processing. Liu and Wang (2012) applied the DS model to judge the quality of (query, URL) pairs. Whitehill et al. (2009) proposed the Generative model of Labels, Abilities, and Difficulties (GLAD), which simul- taneously estimated the expertise of each worker and the difficulty of each task. It is beneficial to use GLAD, unlike the DS model, in that it models the difficulty with which items are correctly annotated. However, it suffers from a critical issue that when we apply GLAD to a task with multiple labels, the confusion matrix of a worker cannot be constructed (see Sec.2.2 for the details). Contributions This paper makes three main contributions. 1. We propose a normalized gamma construction (NGC) of a confusion matrix to model the annotation and judgment process of humans. This framework easily enables us to model a confusion matrix with labeling in a multi-label setting like the DS model and to take into account a task’s difficulty like that with GLAD. 2. We provide a novel concept in data science, latent confusion analysis (LCA), which was developed with the NGC framework and latent Dirichlet enhanced modeling. The main aim of LCA is to extract la- tent (principal) confusions behind the annotation and judgment processes of humans. LCA summarizes the workers’ confusion matrices with the small number of latent principal confusion matrices because many per- sonal confusion matrices is difficult to analyze. 3. The proposed learning algorithm was based on the variational Bayes inference. Due to the normalization term of NGC, we had to devise a way of optimizing the variational lower bound, which enabled us to ob- tain closed form solutions in the M-step (see Sec.4). Moreover, we provide point estimations of prior pa- rameters, i.e., we did not need to tune prior parameters for each dataset

Transcript of -Long version (Supplementary Material)- Latent Confusion...

Page 1: -Long version (Supplementary Material)- Latent Confusion ...proceedings.mlr.press/v32/satob14-supp.pdf · sis. On March 11 2011, the Tohoku earthquake and tsunami occurred, followed

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053054

055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107108109

-Long version (Supplementary Material)-Latent Confusion Analysis by Normalized Gamma Construction

Abstract

We developed a flexible framework for model-ing the annotation and judgment processes of hu-mans, which we called “normalized gamma con-struction of a confusion matrix.” This frame-work enabled us to model three properties: (1)the abilities of humans, (2) a confusion matrixwith labeling, and (3) the difficulty with whichitems are correctly annotated. We also providedthe concept of “latent confusion analysis (LCA),”whose main purpose was to analyze the prin-cipal confusions behind human annotations andjudgments. It is assumed in LCA that confusionmatrices are shared between persons, which wecalled “latent confusions”, in tribute to the “la-tent topics” of topic modeling. We aim at sum-marizing the workers’ confusion matrices withthe small number of latent principal confusionmatrices because many personal confusion ma-trices is difficult to analyze. We used LCA toanalyze latent confusions regarding the effects ofradioactivity on fish and shellfish following theFukushima Daiichi nuclear disaster in 2011.

An important theme in collective intelligence is modelingthe annotation and judgment processes of humans. We fo-cus on modeling a confusion matrix with labeling. Extract-ing a confusion matrix is useful for not just obtaining better(closer to the ground truth) aggregation of labels but alsoobtaining diagnostic information on human annotation andjudgments.

Dawid and Skene (1979) proposed a probabilistic genera-tive model for subjective labeling. Their model can esti-mate individual confusion matrices even when the true la-bel is not available. Each worker in this model has a con-fusion matrix in which if an item (e.g., an image) has truelabelu, workerj can assign another labell with probabil-ity π

(j)u,l . Smyth et al. (1994) applied the Dawid and Skene

(DS) model to the image labeling problem. Snow et al.

Preliminary work. Under review by the International Conferenceon Machine Learning (ICML). Do not distribute.

(2008) applied the DS model to the analysis of opinions innatural language processing. Liu and Wang (2012) appliedthe DS model to judge the quality of (query, URL) pairs.

Whitehill et al. (2009) proposed the Generative model ofLabels, Abilities, and Difficulties (GLAD), which simul-taneously estimated the expertise of each worker and thedifficulty of each task. It is beneficial to use GLAD, unlikethe DS model, in that it models the difficulty with whichitems are correctly annotated. However, it suffers froma critical issue that when we apply GLAD to a task withmultiple labels, the confusion matrix of a worker cannot beconstructed (see Sec.2.2for the details).

Contributions

This paper makes three main contributions.

1. We propose a normalized gamma construction (NGC)of a confusion matrix to model the annotation andjudgment process of humans. This framework easilyenables us to model a confusion matrix with labelingin a multi-label setting like the DS model and to takeinto account a task’s difficulty like that with GLAD.

2. We provide a novel concept in data science,latentconfusion analysis (LCA), which was developed withthe NGC framework and latent Dirichlet enhancedmodeling. The main aim of LCA is to extract la-tent (principal) confusions behind the annotation andjudgment processes of humans. LCA summarizes theworkers’ confusion matrices with the small number oflatent principal confusion matrices because many per-sonal confusion matrices is difficult to analyze.

3. The proposed learning algorithm was based on thevariational Bayes inference. Due to the normalizationterm of NGC, we had to devise a way of optimizingthe variational lower bound, which enabled us to ob-tain closed form solutions in the M-step (see Sec.4).Moreover, we provide point estimations of prior pa-rameters, i.e., we did not need to tune prior parametersfor each dataset

Page 2: -Long version (Supplementary Material)- Latent Confusion ...proceedings.mlr.press/v32/satob14-supp.pdf · sis. On March 11 2011, the Tohoku earthquake and tsunami occurred, followed

110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219

Latent Confusion Analysis by Normalized Gamma Construction

Figure 1. Motivational example and input/output of our model.We created rating data in which the safety level of fish and shell-fish was annotated by using Japanese crowdsourcing. We askedpeople in Japanese crowds to assess the safety levels of items thatconsisted of fish names and their fishing grounds in Japan. Weaimed at gaining insights into what kind of confusions Japanesepeople had about the effect of radioactivity on fish and shellfish.Unlike Dawid and Skene (1979), we modeled confusion matri-ces for the workforce, not for each worker, i.e., we shared latentconfusion matrices between workers. When the workers’ confu-sions could be described with a combination of a small numberof principal confusion matrices, the number of principal confu-sion matrices was smaller than the number of workers. Therefore,we expected to only have to analyze data in the small principalconfusion matrices. Moreover, we modeled the worker’s ability,denoted bya, and the difficulty in correctly annotating items, de-noted byd. This information helped us understand the propertiesof rating data. The details are described in the experimental sec-tion 5.

Motivation behind LCA: Fukushima Daiichi nucleardisaster

We are often interested in obtaining diagnostic informationon the types of confusions people experience. It is moreuseful in this situation to extract the shared confusions be-hind people than extract the individual confusion matrix foreach person in the existing work. It is ideal in this situationto analyze the confusion matrix of each person by using acombination of latent confusions.

Figure1 outlines our motivation for latent confusion analy-sis. On March 11 2011, the Tohoku earthquake and tsunamioccurred, followed by a series of equipment failures, nu-clear meltdowns, and the release of radioactive materialsat the Fukushima I Nuclear Power Plant, which was calledthe Fukushima Daiichi disaster. This disaster is consideredto be the largest nuclear disaster since the Chernobyl dis-aster of 1986 and was only the second disaster along with

Chernobyl to measure Level 7 on the International NuclearEvent Scale.

Unsurprisingly, there was a great deal of concern in Japanabout the risk to health and the food chain caused by ra-dioactivity. A huge social issue emerged called “Fuhyo Hi-gai” in Japanese that was related to trustworthiness. Farm-ers, fishermen, and related businesses face this risk becauseconsumers stopped buying products that might be affectedby radioactivity. There is now a growing need to analyzehow people are confused about the effect radioactivity hason foods. We are therefore under pressure to analyze latentconfusions from a questionary investigation into what ef-fect radioactivity has on foods. The details on the datasetare described in the experimental section.

This paper is organized as follows. We describe exist-ing models in Sec.2. We propose our novel framework inSec.3. We provide the variational Bayes inference for LCAin Sec.4. We present comparative experimental results inSec.5.

1. Preliminaries and notations

Thebold notation of a variable indicates a set of the vari-ables in terms of its subscripts, e.g,zj = {zj,i}

nj

i=1 andz = {zj}Nj=1. E[x] denotes the expectation ofx byits distribution. In particular,Eq[x] denotes the expec-tation of x by its variational posterior. KL[·||·] denotesthe Kullback-Leibler (KL) divergence. Multi(·) denotesthe multinomial distribution. Dir(·) denotes the Dirich-let distribution. Gamma(·) denotes the gamma distribu-tion. The probability function of the gamma distributionis Gamma(x; a, b) = ba

Γ(a)xa−1e−bx. The expectation ofx

and log x areE[x] = a/b andE[log x] = Ψ(a) − log b.x ∼ P expresses that a random variable x is distributedaccording to the probability distributionP . Ψ(·) is thedigamma (psi) function.δ(c) is the delta function that takesa value of one if conditionc is satisfied, and zero otherwise.∑

l(̸=u) means∑L

l=1 δ(u ̸= l).

Suppose that we haveM items to annotate andL annota-tion labels.N denotes the number of workers. Each itemhas a true label from a set of labels{1, 2, · · · , L} where thetrue label is not available in fact. For example, in the caseof customers rating books on a scale from one to five stars,we haveM books andL is five.

Since the true labels cannot be observed, we formulate thetrue labels as latent variables.τm = l denotes that the truelabel of itemm is l ∈ {1, · · · , L}. xj,i is the i-th itemthat workerj labels. yj,i ∈ {1, · · · , L} denotes the labelassigned by workerj to itemxj,i. x is a bag ofxj,i, y is abag ofyj,i, andτ is a bag ofτm. nj is the number of itemsthat workerj annotates.

Page 3: -Long version (Supplementary Material)- Latent Confusion ...proceedings.mlr.press/v32/satob14-supp.pdf · sis. On March 11 2011, the Tohoku earthquake and tsunami occurred, followed

220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274

275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329

Latent Confusion Analysis by Normalized Gamma Construction

2. Existing models

Here, we describe the two models most related to our work.

2.1. Dawid and Skene (DS) model

Dawid and Skene (1979) considered the problem of mea-suring observer error and analyzed a patient record inwhich the patient was seen by different clinicians and dif-ferent responses may be obtained to the same questions.They proposed a model that allowed an individual confu-sion matrix to be estimated even when the true responsewas not available. We call this model the DS model.

The key idea is to introduce the confusion matrix given bythe probability,π(j)

u,l , that workerj will assign labell whenu is the true label. That is, workerj assigns labelyj,i toitemxj,i by

yj,i =

{u with probability π

(j)u,u

l(̸= u) with probability π(j)u,l

. (1)

The probabilitiesπ(j)u,l (u ̸= l) indicate the individual error

rates for workerj, andπ(j)u,u is the probability that workerj

will annotate the true labelu. Note that the error rates areconditional probabilities where

∑Ll=1 π

(j)u,l = 1 for eachj

andu. π is a set ofπju,l.

The likelihood of workers’ annotation datay and true labelτ , givenπ = {π(j)}Nj=1 is

p(y, τ |x,π,µ) =M∏

m=1

µτm

N∏j=1

nj∏i=1

π(j)τxj,i

,yj,i. (2)

whereµ is the true label prior, i.e,µl = p(τm = l), and wedenoteµ = (µ1, · · · , µL). Dawid and Skene (1979) usedthe Expectation-Maximization (EM) algorithm to estimatep(τm|x,y,p,µ), π(j)

u,l , andµl.

2.2. Generative Model of Labels, Abilities, andDifficulties (GLAD)

Whitehill et al. (2009) formulated a probabilistic modelof the binary-labeling process, i.e.,L = 2, by modelingthe true labels, workers’ abilities, and the difficulty withwhich items were correctly annotated, called the Genera-tive model of Labels, Abilities, and Difficulties (GLAD).

The ability (expertise) of each workerj is modeled by theparameteraj ∈ (−∞,∞). aj = ∞(−∞) means theworker always labels items correctly (incorrectly).aj = 0means that the worker has no information about the truelabel. The difficulty of annotating itemm to be annotatedcorrectly is modeled bydm, which is positive anddm = ∞

means the item is very ambiguous and hence even the high-est skilled worker has only a 50% chance of labeling it cor-rectly. 1/dm = ∞ means the item is so easy to annotatethat most of workers always label it correctly.

Labelyj,i = l that workerj assigns toi-th itemxj,i = mgiven true labelτm is generated according to

p(yj,i = l|aj , dm, τm)

= σ(aj/dm)δ(τm=l) (1− σ(aj/dm))δ(τm ̸=l)

, (3)

whereσ(·) is the sigmoid function. Equation (3) meansthat if true labelτm = l, the probability ofyj,i = l isσ(aj/dm); otherwise,1 − σ(aj/dm). It is ideal for us tohave closed-form solutions in the M-step. However, wehave to numerically solve an optimization problem, e.g., bygradient ascent, to estimate workerj’s ability aj and taskm’s difficulty dm for each M-step in the EM algorithm,which requires tuning a step-size parameter.

Whitehill et al. formulated a multiple-label variant ofGLAD (mGLAD) in their paper’s supplementary material.It is assumed with mGLAD that the probability of workerj assigning labell given true labelτm is

p(yj,i = l|aj , dm, τm)

= σ(aj/dm)δ(τm=l)

(1− σ(aj/dm)

L− 1

)δ(τm ̸=l)

. (4)

Equation (4) means that if true labelτm = u, the prob-ability of yj,i = u is σ(aj/dm) and that of other labels is(1−σ(aj/dm))/(L−1), respectively. The workers’ abilityparametersaj (j = 1, · · · , N) are shared in all items. Theproblem is that the workers’ labeling confusion cannot bemodeled because it is modeled as a uniform, i.e.,1/(L−1).

2.3. Other Related Work

Various studies have investigated crowdsourcing. The fol-lowing studies differ from our work in that they are notaimed at analyzing the confusion matrices, in particular,latent confusions and the most cases are binary labeling.

Classification: Raykar et al. (2010) studied a binary clas-sifier via estimating the annotator accuracy and the actualtrue label. Yan et al. modeled annotators’ expertise as afunction of the item’s feature vector (Yan et al., 2010b;a;2011). Welinder et al. (2010) modeled a binary annotationprocess by considering the low-dimensional feature vectorof each image in an image labeling task. Wauthier et al.(2011) proposed a Bayesian model to account for annota-tor bias. Liu et al. (2012) connected the aggregation ofbinary labels in crowdsourcing with belief propagation anda mean field algorithm. These studies seemed to be alongthe lines of the DS and GLAD models but their setting wasbinary labeling and did not deal with the confusion matrix.

Page 4: -Long version (Supplementary Material)- Latent Confusion ...proceedings.mlr.press/v32/satob14-supp.pdf · sis. On March 11 2011, the Tohoku earthquake and tsunami occurred, followed

330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384

385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439

Latent Confusion Analysis by Normalized Gamma Construction

Zhou et al. (2012) proposed a minimax entropy principleto improve the quality of noisy labels.

Clustering and Ranking: Gomes et al. (2011) pre-sented “crowdclustering,” which clusters items by using aworker’s label (same or different) on a pair of items. Yiet al. (2012) combined a metric learning and the man-ual annotations obtained via crowdsourcing for clustering.Raykar and Yu (2011) proposed a score to rank the annota-tors.

Confusion Matrix Modeling: Venanzi et al. (2014) pro-posed a community-based Bayesian aggregation model,which assume that each worker belongs to a certain com-munity and the worker’s confusion matrix is similar tothe community’s confusion matrix. (Venanzi et al., 2014)is the most similar work to our work1. The difference be-tween their model and ours is as follows. (1) Each workerbelongs to one community. (2) They model one confusionmatrix for each worker. (3) They do not model the abilitiesof workers and the difficulty of items simultaneously.

3. Proposed framework

We describe the proposed framework in this section. First,we present the normalized gamma construction (NGC) of aconfusion matrix. Then, we propose latent confusion anal-ysis (LCA).

3.1. Normalized Gamma Construction of ConfusionMatrix

The confusion matrix of each worker is constructed withprobabilistic vectors in the DS model. It is common to as-sume that a probability vector is distributed according tothe Dirichlet distribution. That is, the process to generatethe confusion matrix in Eq.(1) for worker j in a Bayesianmanner is

π(j)u ∼ Dir(γ1, · · · , γL), (u = 1, · · · , L), (5)

whereγl (l = 1, · · · , L) is a parameter of the Dirich-let distribution. However, in this formulation, we cannotmodel the difficulty with which items are correctly anno-tated. Therefore, we need a novel process of generatingprobabilistic vectors to introduce the difficulty with whichitems are correctly annotated, as with GLAD.

Here, we consider the following relationship between theDirichlet distribution and the gamma distribution.

If gℓ (ℓ = 1, · · · , L) is independently distributed accordingto Gamma(γi, 1) respectively, i.e.,

gℓ ∼ Gamma(γℓ, 1), (6)

1(Venanzi et al., 2014) was published after this paper was sub-mitted.

then the vector(g1/s, · · · , gℓ/s), wheres =∑L

ℓ=1 gi, fol-lows the Dirichlet distribution with parametersγ1, · · · , γL,i.e.,

s =L∑

ℓ=1

gi ∼ Gamma(L∑

ℓ=1

γℓ, 1) (7)

(π1, · · · , πL) = (g1s, · · · , gL

s) ∼ Dir(γ1, · · · , γL). (8)

This reformulation inspired us to use the construction ofeach worker’s confusion matrixπ(j) by using randomvariables distributed according to the gamma distribution,which presents a flexible framework for modeling the anno-tation process of workers in the next session and an efficientinference algorithm based on the VB inference.

The DS model’s generation process in Eq.(1), and the gen-eration process ofπ(j)

u with the Dirichlet distribution inEq.(5), can be reformulated as follows. Letcj,u,l be a con-fusion variable for workerj to assign labell to an item thathas true labelu and

cj,u,l ∼ Gamma(γc, 1) (u, l = 1, · · · , L). (9)

Given true labelu, we have

yj,i =

u with probability π(j)

u,u =cj,u,u∑Lv=1 cj,u,v

l(̸= u) with probability π(j)u,l =

cj,u,l∑Lv=1 cj,u,v

(10)

This idea enables us to easily introduce the ability of hu-mans, a confusion matrix and the difficulty with whichitems are correctly annotated into modeling human anno-tation and judgment processes, and to model the concept oflatent confusion analysis (LCA), which is described in thenext section.

3.2. Latent Confusion Analysis

It is assumed with the DS model that a confusion matrixis formulated for each worker. In this section, we considerthat confusion matrices are shared between workers, whichwe call “latent confusions” in tribute to the “latent topics”of latent Dirichlet allocation (Blei et al., 2003). Figure2outlines the graphical model of the proposed model.

Workerj has a latent variable for thei-th item to be anno-tated, denoted byzj,i, andzj,i = k indicates that workerjis affected by thek-th latent confusion matrix when anno-tating thei-th item. LetK be the number of latent confu-sions, which are given by, fork = 1, · · · ,K,

ck,u,l ∼ Gamma(γc, 1) (u, l = 1, · · · , L). (11)

Moreover, we introduce the ability of each workerj, de-noted byaj , and the difficulty with which itemm can be

Page 5: -Long version (Supplementary Material)- Latent Confusion ...proceedings.mlr.press/v32/satob14-supp.pdf · sis. On March 11 2011, the Tohoku earthquake and tsunami occurred, followed

440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494

495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549

Latent Confusion Analysis by Normalized Gamma Construction

correctly annotated, denoted bydm as

aj ∼ Gamma(γa, 1) (j = 1, · · · , N), (12)

dm ∼ Gamma(γd, 1) (m = 1, · · · ,M). (13)

Therefore, workerj assigns labelyj,i to item xj,i = mwhenzj,i = k and the true label isu ∈ {1, · · · , L} as

yj,i =

u with probability

π(j,m,k)u,u =

ajck,u,u

ajck,u,u + dm∑L

v=1 δ(v ̸= u)ck,u,vl(̸= u) with probability

π(j,m,k)u,l =

dmck,u,l

ajck,u,u + dm∑L

v=1 δ(v ̸= u)ck,u,v

A largeaj andck,u,u mean that workerj tends to correctlylabel items when the true label isu. A largeck,u,l meansthat a worker tends to labell when the true label isu. Alargedm means that itemm is so difficult to correctly an-notate that even the most expert worker will usually label itincorrectly according to its confusion matrixck.

The remaining problem is how to model latent variablesz.We use latent Dirichlet modeling for latent variables. Thatis, for each workerj, θj ∼ Dir(α), whereα is theK-dimensional parameter vector of the Dirichlet distributionand for thei-th item to be annotated,zj,i ∼ Multi(θj).

Moreover, we model the probability distribution over itemxj,i by ϕk,m which indicates the probability that the itemthat a worker annotates ism when the worker is affectedby the k-th confusion matrix, i.e.

∑Mm=1 ϕk,m = 1, as

follows. Whenzj,i = k, xj,i ∼ Multi(ϕk). It is easy tounderstand this generation process whenxj,i is regarded asa word in latent Dirichlet allocation.

The reason we modeled the process for the generation ofitems is that we wanted to analyze the relationship be-tween items and latent confusion. It is useful to gain in-sights into what types of items are affected by thek-th la-tent confusion in the annotation process. Whenϕk,m takesa large value, we find that the annotation of itemm isgreatly affected by thek-th latent confusion. We assumethat ϕk ∼ Dir(β), (k = 1, · · · ,K), where Dir(β) is asymmetric Dirichlet prior with scalar parameterβ.

4. Variational Bayes Inference for LCA

We provide the variational Bayes (VB) inference in the pro-posed model. Due to the normalization term of NGC, wehave to devise a way of optimizing the variational lowerbound.

We present the key idea to derive the VB inference for theproposed model. This is promising to enable this NGCframework to be applied to many applications.

Figure 2. Graphical model of LCA

For simplicity, we consider the simple formgℓ∑ℓ gℓ

andgℓ is distributed by the gamma distribution. We typicallyneed the expectation calculation in the VB inference, i.e.,E[log gℓ∑

ℓ gℓ] = E[log gℓ] − E[log

∑ℓ gℓ]. The problem is

that we cannot calculate the expectation of a log functionof the normalization termE[log

∑ℓ gℓ].

Here, we will return to the definition of the gamma distri-bution. The gamma distribution

p(ξ; a, b) =ba

Γ(a)ξa−1e−bξ (14)

indicates that

1 =

∫p(ξ; a, b)dξ =

∫ba

Γ(a)ξa−1e−bξdξ,

b−a =

∫1

Γ(a)xa−1e−bξdξ. (15)

When we setb =∑

ℓ gℓ anda = 1, we have

1∑ℓ gℓ

=

∫e−(

∑ℓ gℓ)ξdξ. (16)

Therefore,

log1∑ℓ gℓ

= log

∫e−(

∑ℓ gℓ)ξdξ (17)

By introducing probability distributionq(ξ) and Jensen’sinequality, we have

log1∑ℓ gℓ

= log

∫q(ξ)

e−(∑

ℓ gℓ)ξ

q(ξ)dξ

≥∫

q(ξ) loge−(

∑ℓ gℓ)ξ

q(ξ)dξ. (18)

The expectation of this lower bound has an analytic solu-tion, and thus, this lower-bound and the estimation ofq(ξ)

Page 6: -Long version (Supplementary Material)- Latent Confusion ...proceedings.mlr.press/v32/satob14-supp.pdf · sis. On March 11 2011, the Tohoku earthquake and tsunami occurred, followed

550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604

605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659

Latent Confusion Analysis by Normalized Gamma Construction

Table 1.Basic information on dataset:N denotes the number of workers.M denotes the num-ber of item types.L denotes the number of label types.E[nworker] =

∑Nj=1 nj/N denotes the average number of

items that a worker annotated.Nitem denotes the number ofworkers who annotated an item.

Dataset N M L E[nworker] Nitem

Safety-level 62 97 5 78.2 50Preposition 47 100 5 63.8 30Bluebird 39 108 2 108 39

enable us to obtain closed form solutions for the VB infer-ence of NGC and LCA.

We estimate variational posteriorsq(τ ), q(a), q(c), q(d),q(z), q(θ), q(ϕ) and q(ξ). and use the point estimationfor γa, γc, γd, µ, α andβ because we do not want to tunethe hyper-parameters for each task. They are described inAppendixA. The estimate ofµ is the same as that in theDawid and Skene model (Dawid & Skene, 1979).

4.1. Computational Complexity

Let T be the total number of labeled items. The comp-tational costs per iteration in DS model and LCA areO(NL2 + TL+ML) andO(TKL2 +ML+MK), re-spectively. Seemingly, LCA is not that scalable because itcan reveal a much more informative latent structure thanexisting models. However, the scalability of LCA is thesame as that of LDA, and our learning algorithm is deter-ministic. Therefore, we can easily apply recent advancesin scaling-up LDA into LCA such as those in the literature(Hoffman et al., 2010; Zhai et al., 2012).

5. Experiments

We empirically analyzed the proposed model in this sec-tion.

Since our problem setting was unsupervised, i.e., the truelabels and confusion matrices were not available, it was dif-ficult to evaluate the models. Therefore, we use datasets inwhich the correct answers (labels or scores) were known.Here, we call a “gold label” a correct label that is actuallyknown in the datasets. We only use a gold label to evalu-ate an estimated label that has a maximum probability ofq(τm) for each model, i.e.,

τ∗m = argmaxτm

q(τm). (19)

MV indicates majority voting. DS indicates the Dawid andSkene model, and GLAD/mGLAD (multi-label variant ofGLAD described in Sec.2). LCA is our model described

in Sec.3.2.

5.1. Datasets and Evaluation Metrics

We applied the models to three datasets: (1) safety-leveldata, (2) preposition data, and (3) bluebird data. We createddatasets (1) and (2) by using crowdsourcing and publishedthe datasets2. Table1 summarizes the basic information onthe datasets.

Safety-Level Data:Analyzing safety-level data is the mainpurpose of this study. We analyzed public confusion re-garding the effects of radioactivity on food products fol-lowing the Fukushima Daiichi nuclear disaster. We usedJapanese crowdsourcing to prepare this dataset. We askedcrowd workers to judge the safety level of an item by us-ing two pieces of information on each item: “the name ofthe fish or shellfish” and “where its fishing grounds are inJapan,” as outlined in Fig.1. The number of labels for an-notation wasL = 5, in which “safety level 1” meant “It’sdangerous. I will not eat this food” and “safety level 5”meant “It’s safe. I will eat this food.” We used items de-scribed in a safety manual on the effects of radioactivityon food products published in 20123. The number of itemswasM = 97. These items in the safety manual had asafety level from1 (dangerous) to100 (safe), which wascalculated by using radioactivity measurement and expertknowledge. We used this information to evaluate models asa gold safety level. Each item was annotated by50 work-ers. The evaluation metric was the correlation coefficientbetween the gold safety level and the estimated safety levelwith the maximum probability ofq(τm).

We used other datasets to compare our model with the othermodels in several settings.

Preposition Data: The use of prepositions in English isoften a headache for non-native English speakers. We an-alyzed public confusion in the use of prepositions. Wecollected100 sentences as fill-in-the-blank questions fromthe Special English of Voice of America (VOA)4, whereM = 100. We asked crowd workers to select a prepo-sition by choosing from the labels “on,” “at,” “in,” “for,”and “to,” which were the prepositions that cause confusionfor Japanese people, i.e.,L = 5. The number of workerswasN = 47 and each item was annotated by30 workers.The evaluation metric was the accuracy measured by usinga gold label and an estimated answer that had a maximumprobability ofq(τm), i.e., accuracy=the number of correctanswers /M .

Bluebird Data: We used a dataset called “bluebird,” pub-2http://www.r.dl.itc.u-tokyo.ac.jp/ sato/icml2014/3“Complete manual on the effects of radioactivity on food

products” (in Japanese) ISBN-10: 47966968574http://learningenglish.voanews.com/

Page 7: -Long version (Supplementary Material)- Latent Confusion ...proceedings.mlr.press/v32/satob14-supp.pdf · sis. On March 11 2011, the Tohoku earthquake and tsunami occurred, followed

660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714

715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769

Latent Confusion Analysis by Normalized Gamma Construction

Table 2.Empirical results.Larger values indicate better performance. If there is an equality of votes in majority voting (MV), we select a label atrandom. We tried five random seeds in MV. Note that we could only determine one label in other models even if there wasan equality of votes because we used the maximumq(τm) for labeling items.K+ denotes the effective number of latentconfusion which is estimated by the implicit sparsity of the VB inference. Note that we setK = N in these experiments.

Dataset Evaluation Metric MV(five random seeds) LCA (K+/K) DS GLAD/mGLAD

Safety-level Correlation coefficient 0.525, 0.525, 0.508, 0.510, 0.528 0.571 (20/62) 0.505 0.472Preposition Accuracy 0.709, 0.719, 0.700, 0.710, 0.710 0.770 (7/47) 0.739 0.750Bluebird Accuracy 0.759 (No equality of votes) 0.898 (6/39) 0.898 0.722

Top 1 confusion Top 2 confusion Top 3 confusion Top 4 confusion Top 5 confusion

Top 6 confusion Top 7 confusion Top 8 confusion Top 9 confusion Top 10 confusion

Figure 3. Top ten frequent latent confusions behind safety-level data.The size of the gray squares indicates the size of the values of the confusion probabilities. For simple visualization, a rowwith no gray squares means that the confusion probability is uniform (see Sec.5.5).

lished by Welinder et al. (2010). This dataset includedM = 108 items andN = 39 workers on a fully connectedbipartite assignment graph, where the workers were askedwhether the presented images contained the Indigo Buntingor Blue Grosbeak, i.e.,L = 2. Each item was annotated by39 workers. The evaluation metric was accuracy.

5.2. Initialization

The results obtained from the DS and GLAD models onlydepended on the initialization ofq(τm). We initializedq(τm) with an empirical distribution by using worker vot-ing as Dawid and Skene (1979) did in their study, who ob-served that this initialization was more effective than ran-dom initializations. We also found in pilot experiments thatvoting initialization was more effective than random initial-izations in the DS and GLAD models. When we used vot-ing initialization for q(τm) and not random initialization,we only had to do the experiment once. The results from

LCA depended on the initialization ofq(zj,i) (or q(ck,u,v))as well asq(τj) and the number of latent confusionsK.Therefore, it was ideal for LCA that we did not use ran-domization for the initialization. We devised the followingstrategy to initialize LCA. Note that we actually consideredinitializing q(ck,u,v) instead ofq(zj,i).

1. We setK = N andq(zj,i = j) = 1 (0 otherwise),which means that each worker had a personal confu-sion matrix as well as the DS model. We initializedq(τm) with an empirical distribution by voting likethat in the DS model.

2. We ran the VB inference withq(zj,i = j) = 1 be-ing fixed, which meant that we did not use the latentDirichlet enhanced modeling. The results in this steponly depended on the initialization ofq(τm) as withthe DS model.

3. We resetq(zj,i = j) = 1/K and initializedq(θj) and

Page 8: -Long version (Supplementary Material)- Latent Confusion ...proceedings.mlr.press/v32/satob14-supp.pdf · sis. On March 11 2011, the Tohoku earthquake and tsunami occurred, followed

770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824

825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879

Latent Confusion Analysis by Normalized Gamma Construction

Figure 4. Expected frequency of latent confusions in safety-leveldata.

q(ϕk) with their prior distributions.

After these three steps, we ran the VB inference for LCA.This initialization scheme only depended onq(τm) as withthe DS and GLAD models.

When we setK < N , we select the workers’ personalconfusions in descending order of their expected ability ofEq[aj ], which is pre-estimated in initialization step2.

We initializedγa = γc = γd = 1.

5.3. Empirical results for evaluation metric

Table2 summarizes the experimental results. Our model,LCA, outperformed or was competitive with the other mod-els in terms of each evaluation metric for each dataset. Theresults revealed that using a family of confusion matriceshelped to recover the ground truth. We think one of thereasons is that our approach is intuitively a kind of multi-task learning. We could complement the latent judgmenttendencies of workers who assessed the small number ofitems by sharing confusion matrices among users. More-over, when a dataset has a genre or category, e.g., that inrating movies, a worker has a biased knowledge, which canbe modeled by a family of confusion matrices. In this case,the confusion matrix of a worker cannot be represented asone confusion matrix but a combination of various latentconfusion matrices. For example, there can be bias in aworker’s knowledge due to where they lives in savefy-leveldata.

5.4. Effective number of latent confusions

It was well known that the VB inference induced im-plicit sparcity, which is called a zero-forcing effect (Minka,2005). This effect enables us to estimate the effective num-ber of latent confusions. This property is seen in the updateequation ofq(zj,i).

Let K+ be the effective number of latent confusions, i.e.,the number of principal confusions. We calculatedK+ byusing the number of latent confusions whose expected fre-quencyE[nk] =

∑j,i q(zj,i = k) was greater than0.5 in

Table2. The performance of LCA and the DS model wasthe same with the bluebird data; however, it was found thatLCA used a smaller number of confusion matrices than theDS model did.

Figure 4 plots the frequency of latent confusions beforeand after the VB inference of LCA in the safety-level data.“INIT” indicates the frequency in the initialization step,i.e., each frequency indicates the number of items that eachworker annotated (see Sec.5.2). “LCA” means that the fre-quency was given after the VB inference, where the fre-quency was the expected frequency, i.e.,E[nk]. The latentconfusions are sorted in descending order in terms of thefrequency of “LCA.”

5.5. Visualizing Top 10 latent confusions

We analyzed the top 10 latent confusions by frequencyextracted by using LCA in Fig.3, which revealed “pes-simistic” and “optimistic” confusions in the safety-leveldata.

We normalized thek-th confusion matrix, ck, tomake each row a confusion probability, i.e.,c̃k,u,ℓ =

E[ck,u,ℓ]/∑L

ℓ=1 E[ck,u,ℓ]. The element of a confusion ma-trix in the u-th row andℓ-th column,c̃k,u,ℓ, expresses theprobability that if an item has true label,u, labelℓ will beannotated. The size of the gray squares indicates the size ofthe values of their elements. We deductedminℓ c̃k,u,ℓ fromeach row to enable simple visualization. Therefore, a rowwith no gray squares means that the confusion probabilityis uniform.

If there are gray squares below the dashed line, confusionindicates the safety level has been overestimated, whichmeans “optimistic.” The top1, 2, 7, 8, and9 confusionsseem to be pessimistic, and the others seem to be opti-mistic. The top1 confusion was less unbiased than theother confusions. The top4 confusion was much more op-timistic because safety levels2, 3, and4 could be confusedas safety level5. The top7 confusion was, in comparison,much more pessimistic. The frequency of each confusionis plotted in Fig.4.

5.6. Analyzing item and worker

Here, we analyzed the first listed item in the safety man-ual, the “Greenling,” whose item-id was1 in the safety-level dataset. Figure5 explains how we analyzed the itembased on the estimated variational posterior,q(d1), theprior p(d|γd), and the proportion of related latent confu-sions. Since we used the point estimation of prior parame-

Page 9: -Long version (Supplementary Material)- Latent Confusion ...proceedings.mlr.press/v32/satob14-supp.pdf · sis. On March 11 2011, the Tohoku earthquake and tsunami occurred, followed

880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917918919920921922923924925926927928929930931932933934

935936937938939940941942943944945946947948949950951952953954955956957958959960961962963964965966967968969970971972973974975976977978979980981982983984985986987988989

Latent Confusion Analysis by Normalized Gamma Construction

Figure 5. Analysis of “Greenling” item

Figure 6. Analysis of workers with highest and lowest abilities

terγd by using Eq.(42), the priorp(d|γd) meant a base dis-tribution for analyzing the difficulty posteriorq(d1). Figure5 indicates that it was more difficult to evaluate the safetylevel with this item than with the base line. We investigatedthe ranking of this item in terms of the average value ofdm,i.e.,E[dm]. The difficulty rank of this item was6/97.

The pie chart indicates the proportion of related latentconfusions, while the numbers in this figure indicate therankings of the frequencies in Fig.4. This proportion wascalculated by usingE[nk,m] =

∑Nj=1

∑nj

i=1 δ(xj,i =m)q(zj,i = k), which means the number of times workerswere affected by thek-th latent confusion when annotatingitem m. A large part of the pie chart is the Top2 confu-sion that indicates “pessimistic” latent confusion shown inFig.3. Thus, the item “Greenling” suffered from the “pes-

simistic” confusion. As with the “Sockeye salmon” item inFig. 1, the main fishing ground of the “Greenling” item isaround Hokkaido, which is far from Fukushima. However,the “Greenling” item had a large fishing ground, whichseems to be a reason for the “pessimistic” confusion.

Our model could also analyze the confusion matrix of eachworker by using a combination of latent confusions. Wealso analyzed the two workers with the highest and lowestabilities (see6) in terms of their posteriorsq(a1) (highest)andq(a62) (lowest), and priorp(a|γa) whereγa was esti-mated with Eq.(40). The pie chart indicates the proportionof latent confusions for each worker calculated withE[θ].The index of latent confusion is the same as that for latentconfusions in Fig.3.

6. Conclusion

We proposed modeling the annotation and judgment pro-cesses of humans by using the normalized gamma con-struction (NGC) of a confusion matrix. The NGC frame-work flexibly enabled various properties of data to be mod-eled and it provided an efficient learning algorithm basedon the variational Bayes inference and the fixed point it-eration algorithm to estimate prior parameters. Therefore,it had a wide range of applications besides those that wedescribed in this paper. We also provided the concept of“latent confusion analysis (LCA),” which was used to an-alyze the principal confusions behind human annotationsand judgments. We modeled LCA by using NGC and la-tent Dirichlet modeling. LCA seemed to be increasinglymore important because there is an information overloadin real life and people are too taken in by it. We believethat this work represents a first step toward further work onLCA. Interesting directions of research will include a time-series analysis of latent confusions and more sophisticatedmodeling used in a topic modeling.

A. Update equations

The likelihood of workers’ annotation data is

p(y|x, z, τ ,a, c,d) =N∏j=1

nj∏i=1

π(j,xj,i,zj,i)τxj,i

,yj,i. (20)

We estimate a factorized variational posterior overτ , a, c,d, z, θ, andϕ, i.e.,

q(τ ,a, c,d, z,θ,ϕ) =

[M∏

m=1

q(dm)q(τm)

][

K∏k=1

q(ϕk)L∏

u=1

L∏l=1

q(ck,u,l)

] N∏j=1

q(aj)q(θj)

nj∏i=1

q(zj,i)

.

(21)

Page 10: -Long version (Supplementary Material)- Latent Confusion ...proceedings.mlr.press/v32/satob14-supp.pdf · sis. On March 11 2011, the Tohoku earthquake and tsunami occurred, followed

990991992993994995996997998999100010011002100310041005100610071008100910101011101210131014101510161017101810191020102110221023102410251026102710281029103010311032103310341035103610371038103910401041104210431044

1045104610471048104910501051105210531054105510561057105810591060106110621063106410651066106710681069107010711072107310741075107610771078107910801081108210831084108510861087108810891090109110921093109410951096109710981099

Latent Confusion Analysis by Normalized Gamma Construction

Thus, the variational lower bound is obtained by

L[q(τ ,a, c,d,z,θ,ϕ)] = Eq[log p(y|x,z, τ ,a, c,d)p(x|z,ϕ)p(z|θ)]−KL[q(a)|p(a|γa)]−KL[q(c)|p(c|γc)]−KL[q(d)|p(d|γd)]−KL[q(τ )|p(τ |µ)]−KL[q(θ)|p(θ|α)]−KL[q(ϕ)|p(ϕ|β)],

(22)

where the bold notation in the KL divergence means thesum in terms of subsections, e.g.,KL[q(a)|p(a|γa)] =∑N

j=1 KL [q(aj)||p(aj |γa)].

The problem is that this variational lower bound includesthe following intractable expectation of

Eq

[log

1

ajczj,i,τxj,i,τxj,i

+ dxj,i

∑l(̸=τxj,i

) czj,i,τxj,i,l

].

(23)

Here, let gj,i = ajczj,i,τxj,i,τxj,i

+

dxj,i

∑l(̸=τxj,i

) czj,i,τxj,i,l and we have

1

gj,i=

∫e−gj,iξj,idξj,i, (24)

whereξj,i can be regarded as the auxiliary variable. There-fore, by introducing a variational distributionq(ξj,i) overξj,i, we have

Eq

[log

1

gj,i

]= Eq

[log

∫e−gj,iξj,idξj,i

]≥ Eq

[∫q(ξj,i) log

e−gj,iξj,i

q(ξj,i)dξj,i

]≥ −Eq[gj,i]Eq[ξj,i]−

∫q(ξj,i) log q(ξj,i)dξj,i,

(25)

where

Eq[gj,i] = Eq[aj ]Eq[czj,i,τxj,i,τxj,i

]

+ Eq[dxj,i ]Eq

∑l(̸=τxj,i

)

czj,i,τxj,i,l

, (26)

Eq[czj,i,τxj,i,τxj,i

] =K∑

k=1

L∑u=1

q(zj,i = k)q(τxj,i = u)Eq[ck,u,u],

(27)

Eq

∑ℓ(̸=τxj,i

)

czj,i,τxj,i,ℓ

=

K∑k=1

L∑u=1

L∑ℓ=1

q(zj,i = k)q(τxj,i = u)q(τxj,i ̸= ℓ)Eq[ck,u,ℓ].

(28)

We obtained the following variational posteriors by tak-ing the derivatives of the lower bound with respect toq(τ )q(a)q(c)q(d)q(z)q(θ)q(ϕ)q(ξ) , and equating themto zero.

q(τm = u) ∝

µu exp

{N∑

j=1

nj∑i=1

δ(xj,i = m)[δ(yj,i = u)Eq[log ajczj,i,u,u]

+ δ(yj,i ̸= u)Eq[log dmczj,i,u,yj,i ]

Eq

[ξj,iajczj,i,u,u

]+

∑ℓ( ̸=u)

Eq

[ξj,idmczj,i,u,ℓ

](29)

q(aj) = Gamma(aj |γ̃(j)a,1, γ̃

(j)a,2), where

γ̃(j)a,1 = γa +

nj∑i=1

q(τxj,i = yj,i), (30)

γ̃(j)a,2 = 1 +

nj∑i=1

Eq

[ξj,iczj,i,τxj,i

,τxj,i

]. (31)

q(ck,u,l) = Gamma(ck,u,l; γ̃(k,u,l)c,1 , γ̃

(k,u,l)c,2 ), where

γ̃(k,u,l)c,1 = γc +

N∑j=1

nj∑i=1

q(zj,i = k)q(τxj,i = u)δ(yj,i = l),

(32)

γ̃(k,u,u)c,2 = 1 +

N∑j=1

nj∑i=1

Eq [ξj,iaj ] q(zj,i = k)q(τxj,i = u),

(33)

γ̃(k,u,l( ̸=u))c,2 = 1 +

N∑j=1

nj∑i=1

Eq

[ξj,idxj,i

]q(zj,i = k)q(τxj,i = u).

(34)

q(dm) = Gamma(dm; γ̃(m)d,1 , γ̃

(m)d,2 ), where

γ̃(m)d,1 = γd +

N∑j=1

nj∑i=1

(1− q(τm = yj,i))δ(xj,i = m), (35)

γ̃(m)d,2 = 1 +

N∑j=1

nj∑i=1

L∑u=1

q(τxj,i=u)∑l( ̸=u)

δ(xj,i=m)Eq

[ξj,iczj,i,u,l

].

(36)

q(ξj,i) = Gamma(ξj,i; 1, γ̃(j,i)ξ ), where the update for aux-

iliary variableξj,i is

γ̃(j,i)ξ = Eq

ajczj,i,τxj,i,τxj,i

+ dxj,i

∑l( ̸=τxj,i

)

czj,i,τxj,i,l

.

(37)

Page 11: -Long version (Supplementary Material)- Latent Confusion ...proceedings.mlr.press/v32/satob14-supp.pdf · sis. On March 11 2011, the Tohoku earthquake and tsunami occurred, followed

1100110111021103110411051106110711081109111011111112111311141115111611171118111911201121112211231124112511261127112811291130113111321133113411351136113711381139114011411142114311441145114611471148114911501151115211531154

1155115611571158115911601161116211631164116511661167116811691170117111721173117411751176117711781179118011811182118311841185118611871188118911901191119211931194119511961197119811991200120112021203120412051206120712081209

Latent Confusion Analysis by Normalized Gamma Construction

Finally, we have

q(zj,i = k) ∝exp

{Eq[log ϕk,xj,i ] + Eq[log θj,k]

}exp

{Eq[δ(τxj,i = yj,i) log ajck,τxj,i

,τxj,i]}

exp{Eq[δ(τxj,i ̸= yj,i) log dxj,ick,τxj,i

,yj,i]

}exp

−Eq

ξj,i(ajck,τxj,i,τxj,i

+ dxj,i

∑l(̸=τxj,i

)

ck,τxj,i,l)

,

(38)

and

q(θj) ∝K∏

k=1

θαk−1+

∑nji=1 q(zj,i=k)

j,k , (39)

q(ϕk) ∝M∏

m=1

θβ−1+

∑Nj=1

∑nji=1 δ(xj,i=m)q(zj,i=k)

k,m .

αk andβ can be estimated by using the fixed point iter-ation algorithm described in Asuncion et al. and Minka(Asuncion et al., 2009; Minka, 2000).

We use the exponential of the digamma functionexp(Ψ(x)) to estimateq(zj,i) by the VB inference, e.g., tocalculateexpE[log θj,k] ∝ exp(Ψ(E[nj,k] + αk)) whereE[nj,k] =

∑i q(zj,i = k), which is a squashing func-

tion and makesq(zd,i) a sharper distribution. This is simi-lar to simulated annealing when the temperature is smallerthan one, which makes the Gibbs distribution sharper.When x takes a smaller value than one,exp(Ψ(x)) isnon-linear. For example,expΨ(1) is about five timesgreater thanexpΨ(0.5) and expΨ(0.5) is about 5,000times greater thanexpΨ(0.1). This property and the con-straint

∑k q(zj,i = k) = 1 will lead to some latent confu-

sions with small probabilities, which makesE[nj,k] smallfor some latent confusions. Moreover,αk will take smallvalue for some topics in the fixed point iteration algorithmif∑

j E[nj,k] takes a small value. As a result,E[nj,k] +αk

will also become smaller for some latent confusions, whichmakesq(zj,i = k) more sharply distributed and gives somelatent confusions almost no chance of becomimg largerq(zj,i = k) in the future. This is so-called ”the rich getricher and the poor get poorer.” Therefore, we expect thateven if we set a large value toK, we can extract a sparserepresentation of latent confusions as principal confusionsbehind data.

We use the following fixed point iteration algorithm to up-dateγa, γc andγd by taking derivatives of related varia-tional lower bound∑N

j=1 KL [q(aj)||p(aj |γa, 1)],

∑Kk=1

∑Lu=1

∑Ll=1 KL [q(ck,u,l)||p(ck,u,l|γc, 1)],∑M

m=1 KL [q(dm)||p(dm|γd, 1)],with respect toγa, γc andγd, and equating them to zero.

γa = Ψ−1

1

N

N∑j=1

Eq[log aj ]

, (40)

γc = Ψ−1

(1

KL2

K∑k=1

L∑u=1

L∑l=1

Eq[log ck,u,l]

), (41)

γd = Ψ−1

(1

M

M∑m=1

Eq[log dm]

), (42)

whereΨ−1(x) is the inverse digamma function. We use theapproximation calculation forΨ−1 as follows. By using theasymptotic formulas forΨ(x) (Minka, 2000):

Ψ(x) ≈ log(x− 1/2) (if x ≥ 0.6), (43)

Ψ(x) ≈ − 1

x+Ψ(1) (if x < 0.6), (44)

we have

Ψ−1(x) ≈ exp(x) + 1/2 (if x ≥ −2.22), (45)

Ψ−1(x) ≈ − 1

x+Ψ(1)(if x < −2.22). (46)

References

Asuncion, A., Welling, M., Smyth, P., and Teh, Y. W. Onsmoothing and inference for topic models. InProceed-ings of the International Conference on Uncertainty inArtificial Intelligence, 2009.

Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent Dirichletallocation. Journal of Machine Learning Research, 3:993–1022, 2003.

Dawid, A P and Skene, A M. Maximum likelihood esti-mation of observer error-rates using the em algorithm.Applied Statistics, 28(1):20–28, 1979.

Gomes, Ryan G., Welinder, Peter, Krause, Andreas, andPerona, Pietro. Crowdclustering. InAdvances in NeuralInformation Processing Systems 24, pp. 558–566. 2011.

Hoffman, Matthew D., Blei, David M., and Bach, Fran-cis R. Online learning for latent dirichlet allocation. InAdvances in Neural Information Processing Systems 23:24th Annual Conference on Neural Information Process-ing Systems 2010, pp. 856–864, 2010.

Liu, Chao and Wang, Yi-Min. Truelabel + confusions: Aspectrum of probabilistic models in analyzing multipleratings. InProceedings of the 29th International Con-ference on Machine Learning, 2012.

Page 12: -Long version (Supplementary Material)- Latent Confusion ...proceedings.mlr.press/v32/satob14-supp.pdf · sis. On March 11 2011, the Tohoku earthquake and tsunami occurred, followed

1210121112121213121412151216121712181219122012211222122312241225122612271228122912301231123212331234123512361237123812391240124112421243124412451246124712481249125012511252125312541255125612571258125912601261126212631264

1265126612671268126912701271127212731274127512761277127812791280128112821283128412851286128712881289129012911292129312941295129612971298129913001301130213031304130513061307130813091310131113121313131413151316131713181319

Latent Confusion Analysis by Normalized Gamma Construction

Liu, Qiang, Peng, Jian, and Ihler, Alex. Variational infer-ence for crowdsourcing. InAdvances in Neural Informa-tion Processing Systems 25, pp. 701–709. 2012.

Minka, Thomas. Divergence measures and message pass-ing. Technical report, Microsoft Research, 2005.

Minka, Thomas P. Estimating a dirichlet distribution. Tech-nical report, Microsoft, 2000.

Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin,C., Bogoni, L., and Moy, L. Learning from crowds.Journal of Machine Learning Research, 11:1297–1322,2010.

Raykar, Vikas C. and Yu, Shipeng. Ranking annotators forcrowdsourced labeling tasks. InAdvances in Neural In-formation Processing Systems 24, pp. 1809–1817. 2011.

Smyth, P., Fayyad, U., Burl, M., Perona, P., and Baldi, P.Inferring ground truth from subjective labelling of venusimages. InAdvances in Neural Information ProcessingSystems 7, pp. 1085–1092, 1994.

Snow, Rion, O’Connor, Brendan, Jurafsky, Daniel, and Ng,Andrew Y. Cheap and fast—but is it good?: evaluat-ing non-expert annotations for natural language tasks. InProc. of the Conference on Empirical Methods in Natu-ral Language Processing, pp. 254–263, 2008.

Venanzi, Matteo, Guiver, John, Kazai, Gabriella, Kohli,Pushmeet, and Shokouhi, Milad. Community-basedbayesian aggregation models for crowdsourcing. InPro-ceedings of the 23rd International World Wide Web Con-ference, pp. 155–164, 2014.

Wauthier, Fabian L. and Jordan, Michael I. Bayesianbias mitigation for crowdsourcing. In Shawe-Taylor,J., Zemel, R.S., Bartlett, P., Pereira, F.C.N., and Wein-berger, K.Q. (eds.),Advances in Neural Information Pro-cessing Systems 24, pp. 1800–1808. 2011.

Welinder, Peter, Branson, Steve, Belongie, Serge, and Per-ona, Pietro. The multidimensional wisdom of crowds. InAdvances in Neural Information Processing Systems 23,pp. 2424–2432, 2010.

Whitehill, Jacob, Ruvolo, Paul, fan Wu, Ting, Bergsma,Jacob, and Movellan, Javier. Whose vote should countmore: Optimal integration of labels from labelers of un-known expertise. InAdvances in Neural InformationProcessing Systems 22, pp. 2035–2043, 2009.

Yan, Yan, Rosales, Romer, Fung, Glenn, and Dy, Jen-nifer. Modeling multiple annotator expertise in the semi-supervised learning scenario. InProc. of the Proceed-ings of the 26th Conference Annual Conference on Un-certainty in Artificial Intelligence, pp. 674–682, 2010a.

Yan, Yan, Rosales, Romer, Fung, Glenn, Schmidt,Mark W., Valadez, Gerardo Hermosillo, Bogoni, Luca,Moy, Linda, and Dy, Jennifer G. Modeling annotator ex-pertise: Learning when everybody knows a bit of some-thing. Proceedings of the 13th International Confer-ence on Artificial Intelligence and Statistics, pp. 932–939, 2010b.

Yan, Yan, Rosales, Romer, Fung, Glenn, and Dy, Jennifer.Active learning from crowds. InProc. of the 28th Inter-national Conference on Machine Learning, pp. 1161–1168, 2011.

Yi, Jinfeng, Jin, Rong, Jain, Anil, Jain, Shaili, and Yang,Tianbao. Semi-crowdsourced clustering: Generalizingcrowd labeling by robust distance metric learning. InAdvances in Neural Information Processing Systems 25,pp. 1781–1789. 2012.

Zhai, Ke, Boyd-Graber, Jordan L., 0001, Nima Asadi, andAlkhouja, Mohamad L. Mr. lda: a flexible large scaletopic modeling package using variational inference inmapreduce. InACM International Conference on WorldWide Web, pp. 879–888, 2012.

Zhou, Dengyong, Platt, John, Basu, Sumit, and Mao, Yi.Learning from the wisdom of crowds by minimax en-tropy. In Advances in Neural Information ProcessingSystems 25, pp. 2204–2212. 2012.