Mallet Tutorial

120
Machine Learning with MALLET h1p://mallet.cs.umass.edu David Mimno Informa@on Extrac@on and Synthesis Laboratory, Department of CS UMass, Amherst

Transcript of Mallet Tutorial

Page 1: Mallet Tutorial

MachineLearningwithMALLET

h1p://mallet.cs.umass.edu

DavidMimno

Informa@onExtrac@onandSynthesisLaboratory,DepartmentofCS

UMass,Amherst

Page 2: Mallet Tutorial

Outline

• AboutMALLET

• Represen@ngData

• Classifica@on

• SequenceTagging

• TopicModeling

Page 3: Mallet Tutorial

Outline

• AboutMALLET

• Represen@ngData

• Classifica@on

• SequenceTagging

• TopicModeling

Page 4: Mallet Tutorial

Who?

• AndrewMcCallum(mostofthework)

• CharlesSu1on,AronCulo1a,GregDruck,KedarBellare,GauravChandalia…

• FernandoPereira,othersatPenn…

Page 5: Mallet Tutorial

WhoamI?

• ChiefmaintainerofMALLET

• PrimaryauthorofMALLETtopicmodelingpackage

Page 6: Mallet Tutorial

Why?

• Mo@va@on:textclassifica@onandinforma@onextrac@on

• Commercialmachinelearning(JustResearch,WhizBang)

• Analysisandindexingofacademicpublica@ons:Cora,Rexa

Page 7: Mallet Tutorial

What?

• Textfocus:dataisdiscreteratherthancon@nuous,evenwhenvaluescouldbecon@nuous:

double value = 3.0

Page 8: Mallet Tutorial

How?

• Commandlinescripts:– bin/mallet[command]‐‐[op@on][value]…

– TextUserInterface(“tui”)classes

• DirectJavaAPI– h1p://mallet.cs.umass.edu/api

Most of this talk

Page 9: Mallet Tutorial

History

• Version0.4:c2004– Classesinedu.umass.cs.mallet.base.*

• Version2.0:c2008– Classesincc.mallet.*– Majorchangestofinitestatetransducerpackage

– bin/malletvs.specializedscripts– Java1.5generics

Page 10: Mallet Tutorial

LearningMore

• h1p://mallet.cs.umass.edu– “QuickStart”guides,focusedoncommandlineprocessing

– Developers’guides,withJavaexamples

• mallet‐[email protected]– Lowvolume,butcanbebursty

Page 11: Mallet Tutorial

Outline

• AboutMALLET

• Represen@ngData

• Classifica@on

• SequenceTagging

• TopicModeling

Page 12: Mallet Tutorial

ModelsforTextData

• Genera@vemodels(Mul@nomials)– NaïveBayes

– HiddenMarkovModels(HMMs)

– LatentDirichletTopicModels

• Discrimina@veRegressionModels– MaxEnt/Logis@cregression

– Condi@onalRandomFields(CRFs)

Page 13: Mallet Tutorial

Representa@ons

• Transformtextdocumentstovectorsx1, x2,…

• Retainmeaningofvectorindices

• Ideallysparsely

Call meIshmael.…

Document

Page 14: Mallet Tutorial

Representa@ons

• Transformtextdocumentstovectorsx1, x2,…

• Retainmeaningofvectorindices

• Ideallysparsely

1.00.0…0.06.00.0…3.0…

Call meIshmael.…

xi

Document

Page 15: Mallet Tutorial

Representa@ons

• Elementsofvectorarecalledfeaturevalues

• Example:Featureatrow345isnumberof@mes“dog”appearsindocument

1.00.0…0.06.00.0…3.0…

xi

Page 16: Mallet Tutorial

DocumentstoVectors

Call me Ishmael.

Document

Page 17: Mallet Tutorial

DocumentstoVectors

Call me Ishmael.

Document

Call me Ishmael

Tokens

Page 18: Mallet Tutorial

DocumentstoVectors

Call me Ishmael

Tokens

call me ishmael

Tokens

Page 19: Mallet Tutorial

DocumentstoVectors

call me ishmael

Tokens

473, 3591, 17

Features

17 ishmael…473 call…3591 me

Page 20: Mallet Tutorial

DocumentstoVectors

17 1.0473 1.03591 1.0

Features (bag)

17 ishmael473 call3591 me

473, 3591, 17

Features (sequence)

17 ishmael…473 call…3591 me

17 ishmael…473 call…3591 me

Page 21: Mallet Tutorial

Instances

Emailmessage,webpage,sentence,journalabstract…

• Name

• Data

• Target/Label

• Source

What is it called?

What is the input?

What is the output?

What did it originally look like?

Page 22: Mallet Tutorial

Instances

• Name

• Data

• Target

• Source

String

TokenSequenceArrayList<Token>

FeatureSequenceint[]

FeatureVectorint -> double map

cc.mallet.types

Page 23: Mallet Tutorial

Alphabets

TObjectIntHashMap mapArrayList entries

int lookupIndex(Object o, boolean shouldAdd)

Object lookupObject(int index)

cc.mallet.types, gnu.trove

17 ishmael…473 call…3591 me

Page 24: Mallet Tutorial

Alphabets

TObjectIntHashMap mapArrayList entries

int lookupIndex(Object o, boolean shouldAdd)

Object lookupObject(int index)

cc.mallet.types, gnu.trove

17 ishmael…473 call…3591 me

for

Page 25: Mallet Tutorial

Alphabets

TObjectIntHashMap mapArrayList entries

cc.mallet.types, gnu.trove

17 ishmael…473 call…3591 me

void stopGrowth()

void startGrowth()

Do not add entries fornew Objects -- defaultis to allow growth.

Page 26: Mallet Tutorial

Crea@ngInstances

• Instanceconstructormethod

• Iterators

new Instance(data, target,name, source)

Iterator<Instance>FileIterator(File[], …)CsvIterator(FileReader, Pattern…)ArrayIterator(Object[])…

cc.mallet.pipe.iterator

Page 27: Mallet Tutorial

Crea@ngInstances

• FileIterator

cc.mallet.pipe.iterator

/data/bad/

/data/good/

Label from dir name

Each instance inits own file

Page 28: Mallet Tutorial

Crea@ngInstances

• CsvIterator

cc.mallet.pipe.iterator

Name, label, data from regular expression groups.“CSV” is a lousy name. LineRegexIterator?

Each instanceon its own line

1001 Melville Call me Ishmael. Some years ago…1002 Dickens It was the best of times, it was…

^([^\t]+)\t([^\t]+)\t(.*)

Page 29: Mallet Tutorial

InstancePipelines

• Sequen@altransforma@onsofinstancefields(usuallyData)

• PassanArrayList<Pipe>toSerialPipes

cc.mallet.pipe

// “data” is a StringCharSequence2TokenSequence// tokenize with regexpTokenSequenceLowercase// modify each token’s textTokenSequenceRemoveStopwords// drop some tokensTokenSequence2FeatureSequence// convert token Strings to intsFeatureSequence2FeatureVector// lose order, count duplicates

Page 30: Mallet Tutorial

InstancePipelines

• Asmallnumberofpipesmodifythe“target”field

• Therearenowtwoalphabets:dataandlabel

cc.mallet.pipe, cc.mallet.types

// “target” is a StringTarget2Label// convert String to int// “target” is now a Label

Alphabet > LabelAlphabet

Page 31: Mallet Tutorial

Labelobjects

• Weightsonafixedsetofclasses

• Fortrainingdata,weightforcorrectlabelis1.0,allothers0.0

cc.mallet.types

implements Labeling

int getBestIndex()Label getBestLabel()

You cannot create a Label,they are only produced byLabelAlphabet

Page 32: Mallet Tutorial

InstanceLists

• AListofInstanceobjects,alongwithaPipe,dataAlphabet,andLabelAlphabet

cc.mallet.types

InstanceList instances = new InstanceList(pipe);

instances.addThruPipe(iterator);

Page 33: Mallet Tutorial

Purngitalltogether

ArrayList<Pipe> pipeList = new ArrayList<Pipe>();

pipeList.add(new Target2Label());pipeList.add(new CharSequence2TokenSequence());pipeList.add(new TokenSequence2FeatureSequence());pipeList.add(new FeatureSequence2FeatureVector());

InstanceList instances = new InstanceList(new SerialPipes(pipeList));

instances.addThruPipe(new FileIterator(. . .));

Page 34: Mallet Tutorial

PersistentStorage

• MostMALLETclassesuseJavaserializa@ontostoremodelsanddata

java.io

ObjectOutputStream oos = new ObjectOutputStream(…);oos.writeObject(instances);oos.close();

Pipes, data objects, labelings, etcall need to implementSerializable.

Be sure to include custom classesin classpath, or you get aStreamCorruptedException

Page 35: Mallet Tutorial

Review

• WhatarethefourmainfieldsinanInstance?

Page 36: Mallet Tutorial

Review

• WhatarethefourmainfieldsinanInstance?

• WhataretwowaystogenerateInstances?

Page 37: Mallet Tutorial

Review

• WhatarethefourmainfieldsinanInstance?

• WhataretwowaystogenerateInstances?

• HowdowemodifythevalueofInstancefields?

Page 38: Mallet Tutorial

Review

• WhatarethefourmainfieldsinanInstance?

• WhataretwowaystogenerateInstances?

• HowdowemodifythevalueofInstancefields?

• Namesomeclassesthatappearinthe“data”field.

Page 39: Mallet Tutorial

Outline

• AboutMALLET

• Represen@ngData

• Classifica@on

• SequenceTagging

• TopicModeling

Page 40: Mallet Tutorial

Classifierobjects

• Classifiersmapfrominstancestodistribu@onsoverafixedsetofclasses

• MaxEnt,NaïveBayes,DecisionTrees…

cc.mallet.classify

Given data Which classis best?

(this one!)wateryNNJJPRPVBCC

Page 41: Mallet Tutorial

Classifierobjects

• Classifiersmapfrominstancestodistribu@onsoverafixedsetofclasses

• MaxEnt,NaïveBayes,DecisionTrees…

cc.mallet.classify

Labeling labeling = classifier.classify(instance);

Label l = labeling.getBestLabel();

System.out.print(instance + “\t”);System.out.println(l);

Page 42: Mallet Tutorial

TrainingClassifierobjects

cc.mallet.classify

ClassifierTrainer trainer = new MaxEntTrainer();

Classifier classifier = trainer.train(instances);

• EachtypeofclassifierhasoneormoreClassifierTrainerclasses

Page 43: Mallet Tutorial

TrainingClassifierobjects

cc.mallet.optimize

log P(Labels | Data) =log f(label1, data1, w) +log f(label2, data2, w) +log f(label3, data3, w) +…

• Someclassifiersrequirenumericalop@miza@onofanobjec@vefunc@on. Maximize w.r.t. w!

Page 44: Mallet Tutorial

Parametersw

• Associa@onbetweenfeature,classlabel

• HowmanyparametersforKclassesandNfeatures?

ac@on NN 0.13ac@on VB ‐0.1ac@on JJ ‐0.21SUFF‐@on NN 1.3SUFF‐@on VB ‐2.1SUFF‐@on JJ ‐1.7SUFF‐on NN 0.01SUFF‐on VB ‐0.02…

Page 45: Mallet Tutorial

TrainingClassifierobjects

cc.mallet.optimize

interface Optimizerboolean optimize()

interface Optimizableinterface ByValueinterface ByValueGradient

Limited-memory BFGS,Conjugate gradient…

Specific objective functions

Page 46: Mallet Tutorial

TrainingClassifierobjects

cc.mallet.classify

MaxEntOptimizableByLabelLikelihooddouble[] getParameters()void setParameters(double[] parameters)…

double getValue()void getValueGradient(double[] buffer)

Log likelihood and its first derivative

ForOptimizableinterface

Page 47: Mallet Tutorial

Evalua@onofClassifiers

• Createrandomtest/trainsplits

cc.mallet.types

InstanceList[] instanceLists =instances.split(new Randoms(),

new double[] {0.9, 0.1, 0.0});

90% training

10% testing

0% validation

Page 48: Mallet Tutorial

Evalua@onofClassifiers

• TheTrialclassstorestheresultsofclassifica@onsonanInstanceList(tes@ngortraining)

cc.mallet.classify

Trial(Classifier c, InstanceList list)double getAccuracy()double getAverageRank()double getF1(int/Label/Object)double getPrecision(…)double getRecall(…)

Page 49: Mallet Tutorial

Review

• Ihaveinventedanewclassifier:Davidregression.– WhatclassshouldIimplementtoclassifyinstances?

Page 50: Mallet Tutorial

Review

• Ihaveinventedanewclassifier:Davidregression.– WhatclassshouldIimplementtotrainaDavidregressionclassifier?

Page 51: Mallet Tutorial

Review

• Ihaveinventedanewclassifier:Davidregression.– IwanttotrainusingByValueGradient.Whatmathema@calfunc@onsdoIneedtocodeup,andwhatclassshouldIputthemin?

Page 52: Mallet Tutorial

Review

• Ihaveinventedanewclassifier:Davidregression.– HowwouldIcheckwhethermynewclassifierworksbe1erthanNaïveBayes?

Page 53: Mallet Tutorial

Outline

• AboutMALLET

• Represen@ngData

• Classifica@on

• SequenceTagging

• TopicModeling

Page 54: Mallet Tutorial

SequenceTagging

• Dataoccursinsequences

• Categoricallabelsforeachposi@on

• Labelsarecorrelated

DETNNVBSVBGthedoglikesrunning

Page 55: Mallet Tutorial

SequenceTagging

• Dataoccursinsequences

• Categoricallabelsforeachposi@on

• Labelsarecorrelated

????????thedoglikesrunning

Page 56: Mallet Tutorial

SequenceTagging

• Classifica@on:n‐way

• SequenceTagging:nT‐way

NNJJPRPVBCC

NNJJPRPVBCC

NNJJPRPVBCC

NNJJPRPVBCC

NNJJPRPVBCC

NNJJPRPVBCC

NNJJPRPVBCC

orreddogsonbluetrees

Page 57: Mallet Tutorial

AvoidingExponen@alBlowup

• Markovproperty

• Dynamicprogramming

Andrei Markov

Page 58: Mallet Tutorial

AvoidingExponen@alBlowup

• Markovproperty

• Dynamicprogramming

This oneGiven this one

Is independent of theseAndrei Markov

DETJJNNVB

Page 59: Mallet Tutorial

AvoidingExponen@alBlowup

• Markovproperty

• Dynamicprogramming

NNJJPRPVBCC

NNJJPRPVBCC

NNJJPRPVBCC

NNJJPRPVBCC

NNJJPRPVBCC

NNJJPRPVBCC

orreddogsonbluetrees Andrei Markov

Page 60: Mallet Tutorial

AvoidingExponen@alBlowup

• Markovproperty

• Dynamicprogramming

NNJJPRPVBCC

NNJJPRPVBCC

NNJJPRPVBCC

NNJJPRPVBCC

NNJJPRPVBCC

reddogsonbluetrees Andrei Markov

Page 61: Mallet Tutorial

AvoidingExponen@alBlowup

• Markovproperty

• Dynamicprogramming

NNJJPRPVBCC

NNJJPRPVBCC

NNJJPRPVBCC

NNJJPRPVBCC

dogsonbluetrees Andrei Markov

Page 62: Mallet Tutorial

HiddenMarkovModelsandCondi@onalRandomFields

• HiddenMarkovModel:fullygenera@ve

• Condi@onalRandomField:condi@onal

P(Labels | Data) =P(Data, Labels) / P(Data)

P(Labels | Data)

Page 63: Mallet Tutorial

HiddenMarkovModelsandCondi@onalRandomFields

• HiddenMarkovModel:simple(independent)outputspace

• Condi@onalRandomField:arbitrarilycomplicatedoutputs

“NSF-funded”

“NSF-funded”CAPITALIZEDHYPHENATEDENDS-WITH-edENDS-WITH-d…

Page 64: Mallet Tutorial

HiddenMarkovModelsandCondi@onalRandomFields

FeatureSequence

FeatureVectorSequence

FeatureVector[]

int[]

• HiddenMarkovModel:simple(independent)outputspace

• Condi@onalRandomField:arbitrarilycomplicatedoutputs

Page 65: Mallet Tutorial

Impor@ngData

• SimpleTaggerformat:onewordperline,withinstancesdelimitedbyablankline

Call VBme PPNIshmael NNP. .

Some JJyears NNS…

Page 66: Mallet Tutorial

Impor@ngData

• SimpleTaggerformat:onewordperline,withinstancesdelimitedbyablankline

Call SUFF-ll VBme TWO_LETTERS PPNIshmael BIBLICAL_NAME NNP. PUNCTUATION .

Some CAPITALIZED JJyears TIME SUFF-s NNS…

Page 67: Mallet Tutorial

Impor@ngData

LineGroupIterator

SimpleTaggerSentence2TokenSequence()//String to Tokens, handles labels

TokenSequence2FeatureVectorSequence()//Token objects to FeatureVectors

cc.mallet.pipe, cc.mallet.pipe.iterator

Page 68: Mallet Tutorial

Impor@ngData

LineGroupIterator

SimpleTaggerSentence2TokenSequence()//String to Tokens, handles labels

[Pipes that modify tokens]

TokenSequence2FeatureVectorSequence()//Token objects to FeatureVectors

cc.mallet.pipe, cc.mallet.pipe.iterator

Page 69: Mallet Tutorial

Impor@ngData

//IshmaelTokenTextCharSuffix(“C2=”, 2)

//Ishmael C2=elRegexMatches(“CAP”, Pattern.compile(“\\p{Lu}.*”))

//Ishmael C2=el CAPLexiconMembership(“NAME”, new File(‘names’), false)

//Ishmael C2=el CAP NAME

cc.mallet.pipe.tsf

must matchentire string

one name per line

ignore case?

Page 70: Mallet Tutorial

Slidingwindowfeatures

areddogonabluetree

Page 71: Mallet Tutorial

Slidingwindowfeatures

areddogonabluetree

Page 72: Mallet Tutorial

Slidingwindowfeatures

areddogonabluetree

red@-1

Page 73: Mallet Tutorial

Slidingwindowfeatures

areddogonabluetree

red@-1a@-2

Page 74: Mallet Tutorial

Slidingwindowfeatures

areddogonabluetree

red@-1a@-2on@1

Page 75: Mallet Tutorial

Slidingwindowfeatures

areddogonabluetree

red@-1a@-2on@1a@-2_&_red@-1

Page 76: Mallet Tutorial

Impor@ngData

int[][] conjunctions = new int[3][]; conjunctions[0] = new int[] { -1 }; conjunctions[1] = new int[] { 1 }; conjunctions[2] = new int[] { -2, -1 };

OffsetConjunctions(conjunctions)

// a@-2_&_red@-1 on@1

cc.mallet.pipe.tsf

previousposition

next position

previous two

Page 77: Mallet Tutorial

Impor@ngData

int[][] conjunctions = new int[3][]; conjunctions[0] = new int[] { -1 }; conjunctions[1] = new int[] { 1 }; conjunctions[2] = new int[] { -2, -1 };

TokenTextCharSuffix("C1=", 1)OffsetConjunctions(conjunctions)

// a@-2_&_red@-1 a@-2_&_C1=d@-1

cc.mallet.pipe.tsf

previousposition

next position

previous two

Page 78: Mallet Tutorial

FiniteStateTransducers

• Finitestatemachineovertwoalphabets(observed,hidden)

Page 79: Mallet Tutorial

FiniteStateTransducers

• Finitestatemachineovertwoalphabets(observed,hidden)

DET

P(DET)

Page 80: Mallet Tutorial

FiniteStateTransducers

• Finitestatemachineovertwoalphabets(observed,hidden)

DETthe

P(the | DET)

Page 81: Mallet Tutorial

FiniteStateTransducers

• Finitestatemachineovertwoalphabets(observed,hidden)

DETNNthe

P(NN | DET)

Page 82: Mallet Tutorial

FiniteStateTransducers

• Finitestatemachineovertwoalphabets(observed,hidden)

DETNNthedog

P(dog | NN)

Page 83: Mallet Tutorial

FiniteStateTransducers

• Finitestatemachineovertwoalphabets(observed,hidden)

DETNNVBSthedog

P(VBS | NN)

Page 84: Mallet Tutorial

Howmanyparameters?

• Determinesefficiencyoftraining

• Toomanyleadstooverfirng

Trick: Don’t allowcertain transitions

P(VBS | DET) = 0

Page 85: Mallet Tutorial

Howmanyparameters?

• Determinesefficiencyoftraining

• Toomanyleadstooverfirng

DETNNVBS

thedogruns

DETNNVBS

thedogruns

DETNNVBS

thedogruns

Page 86: Mallet Tutorial

FiniteStateTransducers

abstract class TransducerCRFHMM

abstract class TransducerTrainerCRFTrainerByLabelLikelihoodHMMTrainerByLikelihood

cc.mallet.fst

Page 87: Mallet Tutorial

FiniteStateTransducers

cc.mallet.fst

First order: one weightfor every pair of labelsand observations.

CRF crf = new CRF(pipe, null);crf.addFullyConnectedStates(); // orcrf.addStatesForLabelsConnectedAsIn(instances);

DETNNVBS

thedogruns

Page 88: Mallet Tutorial

FiniteStateTransducers

cc.mallet.fst

“three-quarter” order:one weight for everypair of labels andobservations.

crf.addStatesForThreeQuarterLabelsConnectedAsIn(instances);

DETNNVBS

thedogruns

Page 89: Mallet Tutorial

FiniteStateTransducers

cc.mallet.fst

Second order: one weightfor every triplet of labelsand observations.

crf.addStatesForBiLabelsConnectedAsIn(instances);

DETNNVBS

thedogruns

Page 90: Mallet Tutorial

FiniteStateTransducers

cc.mallet.fst

“Half” order: equivalent toindependent classifiers,except some transitionsmay be illegal.

crf.addStatesForHalfLabelsConnectedAsIn(instances);

DETNNVBS

thedogruns

Page 91: Mallet Tutorial

Trainingatransducer

CRF crf = new CRF(pipe, null);crf.addStatesForLabelsConnectedAsIn(trainingInstances); CRFTrainerByLabelLikelihood trainer = new CRFTrainerByLabelLikelihood(crf);

trainer.train();

cc.mallet.fst

Page 92: Mallet Tutorial

Evalua@ngatransducer

CRFTrainerByLabelLikelihood trainer = new CRFTrainerByLabelLikelihood(transducer);

TransducerEvaluator evaluator = new TokenAccuracyEvaluator(testing, "testing"));

trainer.addEvaluator(evaluator);

trainer.train();

cc.mallet.fst

Page 93: Mallet Tutorial

Applyingatransducer

Sequence output = transducer.transduce (input);

for (int index=0; index < input.size(); input++) {System.out.print(input.get(index) + “/”);System.out.print(output.get(index) + “ “);

}

cc.mallet.fst

Page 94: Mallet Tutorial

Review

• HowdoyouaddnewfeaturestoTokenSequences?

Page 95: Mallet Tutorial

Review

• HowdoyouaddnewfeaturestoTokenSequences?

• Whatarethreefactorsthataffectthenumberofparametersinamodel?

Page 96: Mallet Tutorial

Outline

• AboutMALLET

• Represen@ngData

• Classifica@on

• SequenceTagging

• TopicModeling

Page 97: Mallet Tutorial

Topics:“Seman@cGroups”

News Article

Page 98: Mallet Tutorial

Topics:“Seman@cGroups”

“Sports” “Negotiation”

News Article

Page 99: Mallet Tutorial

Topics:“Seman@cGroups”

“Sports” “Negotiation”

News Article

teamplayer

game

strike

deadlineunion

Page 100: Mallet Tutorial

Topics:“Seman@cGroups”

News Article

teamplayer

game

strike

deadlineunion

Page 101: Mallet Tutorial

SeriesYankeesSoxRedWorldLeaguegameBostonteamgamesbaseballMetsGameserieswonClemensBraves

Yankeeteams

Page 102: Mallet Tutorial

playersLeagueownersleaguebaseballunioncommissionerBaseballAssocia@onlaborCommissionerFootballmajor

teamsSeligagreementstriketeambargaining

Page 103: Mallet Tutorial

TrainingaTopicModel

cc.mallet.topics

ParallelTopicModel lda = new ParallelTopicModel(numTopics); lda.addInstances(trainingInstances); lda.estimate();

Page 104: Mallet Tutorial

Evalua@ngaTopicModel

cc.mallet.topics

ParallelTopicModel lda = new ParallelTopicModel(numTopics);lda.addInstances(trainingInstances);lda.estimate();

MarginalProbEstimator evaluator = lda.getProbEstimator();

double logLikelihood = evaluator.evaluateLeftToRight(testing, 10, false, null);

Page 105: Mallet Tutorial

Inferringtopicsfornewdocuments

cc.mallet.topics

ParallelTopicModel lda = new ParallelTopicModel(numTopics);lda.addInstances(trainingInstances);lda.estimate();

TopicInferencer inferencer = lda.getInferencer();

double[] topicProbs = inferencer.getSampledDistribution(instance, 100, 10, 10);

Page 106: Mallet Tutorial

Morethanwords…

• Textcollec@onsmixfreetextandstructureddata

David MimnoAndrew McCallumUAI2008…

Page 107: Mallet Tutorial

Morethanwords…

• Textcollec@onsmixfreetextandstructureddata

David MimnoAndrew McCallumUAI2008

“Topic models conditionedon arbitrary features usingDirichlet-multinomialregression. …”

Page 108: Mallet Tutorial

Dirichlet‐mul@nomialRegression(DMR)

Thecorpusspecifiesavectorofreal‐valuedfeatures(x)foreachdocument,oflengthF.

EachtopichasanF‐lengthvectorofparameters.

Page 109: Mallet Tutorial

Topicparametersforfeature“publishedinJMLR”

user,users,userinterface,interac@ve,interface‐1.44

web,webpages,webpage,worldwideweb,websites‐1.36

retrieval,informa@onretrieval,query,queryexpansion‐1.23

strategies,strategy,adapta@on,adap@ve,driven‐1.21

agent,agents,mul@agent,autonomousagents‐1.12

nearestneighbor,boos@ng,nearestneighbors,adaboost1.37

blindsourcesepara@on,sourcesepara@on,separa@on,channel1.40

reinforcementlearning,learning,reinforcement1.41

bounds,vcdimension,bound,upperbound,lowerbounds1.74

kernel,kernels,ra@onalkernels,stringkernels,fisherkernel2.27

Page 110: Mallet Tutorial

FeatureparametersforRLtopic

<default>‐3.76

COLING‐1.64

IEEETrans.PAMI‐1.54

CVPR‐1.47

ACL‐1.38

MachineLearningJournal2.19

ECML2.45

KenjiDoya2.56

ICML2.88

SridharMahadevan2.99

Page 111: Mallet Tutorial

Topicparametersforfeature“publishedinUAI”

nearestneighbor,boos@ng,nearestneighbors,adaboost‐1.50

descrip@ons,descrip@on,top,bo1om,topbo1om‐1.50

workshopreport,invitedtalk,interna@onalconference,report‐1.37

digitallibraries,digitallibrary,digital,library‐1.36

shape,deformable,shapes,contour,ac@vecontour‐1.29

reasoning,logic,defaultreasoning,nonmonotonicreasoning2.11

uncertainty,symbolic,sketch,primalsketch,uncertain,[email protected]

probability,probabili@es,probabilitydistribu@ons,2.25

qualita@ve,reasoning,qualita@vereasoning,qualita@[email protected]

bayesiannetworks,bayesiannetwork,beliefnetworks2.88

Page 112: Mallet Tutorial

FeatureparametersforBayesnetstopic

<default>‐3.36

ICRA‐2.24

NeuralNetworks‐1.50

COLING‐1.38

Probabilis@cSeman@csforNonmonotonicReasoning(Pearl,KR,1989)

‐1.16

LoopyBeliefPropaga@onforApproximateInference(Murphy,Weiss,andJordan,UAI,1999)

2.04

PhilippeSmets2.15

AshrafM.Abdelbar2.23

Mary‐AnneWilliams2.41

UAI2.88

Page 113: Mallet Tutorial

Dirichlet‐mul@nomialRegression

• Arbitraryobservedfeaturesofdocuments

• TargetcontainsFeatureVector

DMRTopicModel dmr = new DMRTopicModel (numTopics);

dmr.addInstances(training);dmr.estimate();

dmr.writeParameters(new File("dmr.parameters"));

Page 114: Mallet Tutorial

PolylingualTopicModeling

• Topicsexistinmorelanguagesthanyoucouldpossiblylearn

• Topicallycomparable documentsaremucheasiertogetthantransla@onsets

• Transla@ondic@onaries– coverpairs,notsetsoflanguages– misstechnicalvocabulary– aren’tavailableforlow‐resourcelanguages

Page 115: Mallet Tutorial

TopicsfromEuropeanParliamentProceedings

Page 116: Mallet Tutorial

TopicsfromEuropeanParliamentProceedings

Page 117: Mallet Tutorial

TopicsfromWikipedia

Page 118: Mallet Tutorial

Alignedinstancelists

dog… chien… hund…cat… chat…pig… schwein…

Page 119: Mallet Tutorial

PolylingualTopics

InstanceList[] training = new InstanceList[] { english, german, arabic, mahican };

PolylingualTopicModel pltm = new PolylingualTopicModel(numTopics);

pltm.addInstances(training);

Page 120: Mallet Tutorial

MALLEThands‐ontutorial

h1p://mallet.cs.umass.edu/mallet‐handson.tar.gz