mallet tutorial

120
Machine Learning with MALLET h1p://mallet.cs.umass.edu David Mimno Informa@on Extrac@on and Synthesis Laboratory, Department of CS UMass, Amherst

Upload: vgumash

Post on 24-Oct-2014

376 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Mallet Tutorial

MachineLearningwithMALLET

h1p://mallet.cs.umass.edu

DavidMimno

Informa@onExtrac@onandSynthesis

Laboratory,DepartmentofCS

UMass,Amherst

Page 2: Mallet Tutorial

Outline

• AboutMALLET

• Represen@ngData

• Classifica@on

• SequenceTagging

• TopicModeling

Page 3: Mallet Tutorial

Outline

• AboutMALLET

• Represen@ngData

• Classifica@on

• SequenceTagging

• TopicModeling

Page 4: Mallet Tutorial

Who?

• AndrewMcCallum(mostofthe

work)

• CharlesSu1on,AronCulo1a,

GregDruck,KedarBellare,

GauravChandalia…

• FernandoPereira,othersat

Penn…

Page 5: Mallet Tutorial

WhoamI?

• ChiefmaintainerofMALLET

• PrimaryauthorofMALLETtopicmodeling

package

Page 6: Mallet Tutorial

Why?

• Mo@va@on:textclassifica@onand

informa@onextrac@on

• Commercialmachinelearning(Just

Research,WhizBang)

• Analysisandindexingofacademic

publica@ons:Cora,Rexa

Page 7: Mallet Tutorial

What?

• Textfocus:dataisdiscreteratherthan

con@nuous,evenwhenvaluescouldbe

con@nuous:

double value = 3.0

Page 8: Mallet Tutorial

How?

• Commandlinescripts:

– bin/mallet[command]‐‐[op@on][value]…

– TextUserInterface(“tui”)classes

• DirectJavaAPI

– h1p://mallet.cs.umass.edu/api

Most of this talk

Page 9: Mallet Tutorial

History

• Version0.4:c2004

– Classesinedu.umass.cs.mallet.base.*

• Version2.0:c2008

– Classesincc.mallet.*

– Majorchangestofinitestatetransducerpackage

– bin/malletvs.specializedscripts

– Java1.5generics

Page 10: Mallet Tutorial

LearningMore

• h1p://mallet.cs.umass.edu

– “QuickStart”guides,focusedoncommandline

processing

– Developers’guides,withJavaexamples

• mallet‐[email protected]

– Lowvolume,butcanbebursty

Page 11: Mallet Tutorial

Outline

• AboutMALLET

• Represen@ngData

• Classifica@on

• SequenceTagging

• TopicModeling

Page 12: Mallet Tutorial

ModelsforTextData

• Genera@vemodels(Mul@nomials)

– NaïveBayes

– HiddenMarkovModels(HMMs)

– LatentDirichletTopicModels

• Discrimina@veRegressionModels

– MaxEnt/Logis@cregression

– Condi@onalRandomFields(CRFs)

Page 13: Mallet Tutorial

Representa@ons

• Transformtext

documentsto

vectorsx1, x2,…

• Retainmeaning

ofvectorindices

• Ideallysparsely

Call meIshmael.…

Document

Page 14: Mallet Tutorial

Representa@ons

• Transformtext

documentsto

vectorsx1, x2,…

• Retainmeaning

ofvectorindices

• Ideallysparsely

1.00.0…0.06.00.0…3.0…

Call meIshmael.…

xi

Document

Page 15: Mallet Tutorial

Representa@ons

• Elementsofvector

arecalledfeature

values

• Example:Feature

atrow345is

numberof@mes

“dog”appearsin

document

1.00.0…0.06.00.0…3.0…

xi

Page 16: Mallet Tutorial

DocumentstoVectors

Call me Ishmael.

Document

Page 17: Mallet Tutorial

DocumentstoVectors

Call me Ishmael.

Document

Call me Ishmael

Tokens

Page 18: Mallet Tutorial

DocumentstoVectors

Call me Ishmael

Tokens

call me ishmael

Tokens

Page 19: Mallet Tutorial

DocumentstoVectors

call me ishmael

Tokens

473, 3591, 17

Features

17 ishmael…473 call…3591 me

Page 20: Mallet Tutorial

DocumentstoVectors

17 1.0473 1.03591 1.0

Features (bag)

17 ishmael473 call3591 me

473, 3591, 17

Features (sequence)

17 ishmael…473 call…3591 me

17 ishmael…473 call…3591 me

Page 21: Mallet Tutorial

Instances

Emailmessage,webpage,sentence,journal

abstract…

• Name

• Data

• Target/Label

• Source

What is it called?

What is the input?

What is the output?

What did it originally look like?

Page 22: Mallet Tutorial

Instances

• Name

• Data

• Target

• Source

String

TokenSequenceArrayList<Token>

FeatureSequenceint[]

FeatureVectorint -> double map

cc.mallet.types

Page 23: Mallet Tutorial

Alphabets

TObjectIntHashMap mapArrayList entries

int lookupIndex(Object o, boolean shouldAdd)

Object lookupObject(int index)

cc.mallet.types, gnu.trove

17 ishmael…473 call…3591 me

Page 24: Mallet Tutorial

Alphabets

TObjectIntHashMap mapArrayList entries

int lookupIndex(Object o, boolean shouldAdd)

Object lookupObject(int index)

cc.mallet.types, gnu.trove

17 ishmael…473 call…3591 me

for

Page 25: Mallet Tutorial

Alphabets

TObjectIntHashMap mapArrayList entries

cc.mallet.types, gnu.trove

17 ishmael…473 call…3591 me

void stopGrowth()

void startGrowth()

Do not add entries fornew Objects -- defaultis to allow growth.

Page 26: Mallet Tutorial

Crea@ngInstances

• Instance

constructor

method

• Iterators

new Instance(data, target,name, source)

Iterator<Instance>FileIterator(File[], …)CsvIterator(FileReader, Pattern…)ArrayIterator(Object[])…

cc.mallet.pipe.iterator

Page 27: Mallet Tutorial

Crea@ngInstances

• FileIterator

cc.mallet.pipe.iterator

/data/bad/

/data/good/

Label from dir name

Each instance inits own file

Page 28: Mallet Tutorial

Crea@ngInstances

• CsvIterator

cc.mallet.pipe.iterator

Name, label, data from regular expression groups.“CSV” is a lousy name. LineRegexIterator?

Each instanceon its own line

1001 Melville Call me Ishmael. Some years ago…1002 Dickens It was the best of times, it was…

^([^\t]+)\t([^\t]+)\t(.*)

Page 29: Mallet Tutorial

InstancePipelines

• Sequen@al

transforma@ons

ofinstancefields

(usuallyData)

• Passan

ArrayList<Pipe>

toSerialPipes

cc.mallet.pipe

// “data” is a StringCharSequence2TokenSequence// tokenize with regexpTokenSequenceLowercase// modify each token’s textTokenSequenceRemoveStopwords// drop some tokensTokenSequence2FeatureSequence// convert token Strings to intsFeatureSequence2FeatureVector// lose order, count duplicates

Page 30: Mallet Tutorial

InstancePipelines

• Asmallnumber

ofpipesmodify

the“target”

field

• Therearenow

twoalphabets:

dataandlabel

cc.mallet.pipe, cc.mallet.types

// “target” is a StringTarget2Label// convert String to int// “target” is now a Label

Alphabet > LabelAlphabet

Page 31: Mallet Tutorial

Labelobjects

• Weightsona

fixedsetof

classes

• Fortraining

data,weightfor

correctlabelis

1.0,allothers

0.0

cc.mallet.types

implements Labeling

int getBestIndex()Label getBestLabel()

You cannot create a Label,they are only produced byLabelAlphabet

Page 32: Mallet Tutorial

InstanceLists

• AListof

Instanceobjects,

alongwitha

Pipe,data

Alphabet,and

LabelAlphabet

cc.mallet.types

InstanceList instances = new InstanceList(pipe);

instances.addThruPipe(iterator);

Page 33: Mallet Tutorial

Purngitalltogether

ArrayList<Pipe> pipeList = new ArrayList<Pipe>();

pipeList.add(new Target2Label());

pipeList.add(new CharSequence2TokenSequence());

pipeList.add(new TokenSequence2FeatureSequence());

pipeList.add(new FeatureSequence2FeatureVector());

InstanceList instances =

new InstanceList(new SerialPipes(pipeList));

instances.addThruPipe(new FileIterator(. . .));

Page 34: Mallet Tutorial

PersistentStorage

• MostMALLET

classesuseJava

serializa@onto

storemodels

anddata

java.io

ObjectOutputStream oos = new ObjectOutputStream(…);oos.writeObject(instances);oos.close();

Pipes, data objects, labelings, etcall need to implementSerializable.

Be sure to include custom classesin classpath, or you get aStreamCorruptedException

Page 35: Mallet Tutorial

Review

• Whatarethefourmainfieldsinan

Instance?

Page 36: Mallet Tutorial

Review

• Whatarethefourmainfieldsinan

Instance?

• WhataretwowaystogenerateInstances?

Page 37: Mallet Tutorial

Review

• Whatarethefourmainfieldsinan

Instance?

• WhataretwowaystogenerateInstances?

• HowdowemodifythevalueofInstance

fields?

Page 38: Mallet Tutorial

Review

• Whatarethefourmainfieldsinan

Instance?

• WhataretwowaystogenerateInstances?

• HowdowemodifythevalueofInstance

fields?

• Namesomeclassesthatappearinthe

“data”field.

Page 39: Mallet Tutorial

Outline

• AboutMALLET

• Represen@ngData

• Classifica@on

• SequenceTagging

• TopicModeling

Page 40: Mallet Tutorial

Classifierobjects

• Classifiersmap

frominstances

todistribu@ons

overafixedset

ofclasses

• MaxEnt,Naïve

Bayes,Decision

Trees…

cc.mallet.classify

Given data Which classis best?

(this one!)watery

NN

JJ

PRP

VB

CC

Page 41: Mallet Tutorial

Classifierobjects

• Classifiersmap

frominstances

todistribu@ons

overafixedset

ofclasses

• MaxEnt,Naïve

Bayes,Decision

Trees…

cc.mallet.classify

Labeling labeling = classifier.classify(instance);

Label l = labeling.getBestLabel();

System.out.print(instance + “\t”);System.out.println(l);

Page 42: Mallet Tutorial

TrainingClassifierobjects

cc.mallet.classify

ClassifierTrainer trainer = new MaxEntTrainer();

Classifier classifier = trainer.train(instances);

• Eachtypeof

classifierhas

oneormore

ClassifierTrainer

classes

Page 43: Mallet Tutorial

TrainingClassifierobjects

cc.mallet.optimize

log P(Labels | Data) =log f(label1, data1, w) +log f(label2, data2, w) +log f(label3, data3, w) +…

• Someclassifiers

require

numerical

op@miza@onof

anobjec@ve

func@on. Maximize w.r.t. w!

Page 44: Mallet Tutorial

Parametersw

• Associa@on

between

feature,class

label

• Howmany

parametersfor

KclassesandN

features?

ac@on NN 0.13

ac@on VB ‐0.1

ac@on JJ ‐0.21

SUFF‐@on NN 1.3

SUFF‐@on VB ‐2.1

SUFF‐@on JJ ‐1.7

SUFF‐on NN 0.01

SUFF‐on VB ‐0.02

Page 45: Mallet Tutorial

TrainingClassifierobjects

cc.mallet.optimize

interface Optimizerboolean optimize()

interface Optimizableinterface ByValueinterface ByValueGradient

Limited-memory BFGS,Conjugate gradient…

Specific objective functions

Page 46: Mallet Tutorial

TrainingClassifierobjects

cc.mallet.classify

MaxEntOptimizableByLabelLikelihooddouble[] getParameters()void setParameters(double[] parameters)…

double getValue()void getValueGradient(double[] buffer)

Log likelihood and its first derivative

ForOptimizableinterface

Page 47: Mallet Tutorial

Evalua@onofClassifiers

• Create

random

test/train

splits

cc.mallet.types

InstanceList[] instanceLists =instances.split(new Randoms(),

new double[] {0.9, 0.1, 0.0});

90% training

10% testing

0% validation

Page 48: Mallet Tutorial

Evalua@onofClassifiers

• TheTrial

classstores

theresultsof

classifica@ons

onan

InstanceList

(tes@ngor

training)

cc.mallet.classify

Trial(Classifier c, InstanceList list)double getAccuracy()double getAverageRank()double getF1(int/Label/Object)double getPrecision(…)double getRecall(…)

Page 49: Mallet Tutorial

Review

• Ihaveinventedanewclassifier:David

regression.

– WhatclassshouldIimplementtoclassify

instances?

Page 50: Mallet Tutorial

Review

• Ihaveinventedanewclassifier:David

regression.

– WhatclassshouldIimplementtotrainaDavid

regressionclassifier?

Page 51: Mallet Tutorial

Review

• Ihaveinventedanewclassifier:David

regression.

– IwanttotrainusingByValueGradient.What

mathema@calfunc@onsdoIneedtocodeup,

andwhatclassshouldIputthemin?

Page 52: Mallet Tutorial

Review

• Ihaveinventedanewclassifier:Davidregression.

– HowwouldIcheckwhethermynewclassifierworksbe1erthanNaïveBayes?

Page 53: Mallet Tutorial

Outline

• AboutMALLET

• Represen@ngData

• Classifica@on

• SequenceTagging

• TopicModeling

Page 54: Mallet Tutorial

SequenceTagging

• Dataoccursin

sequences

• Categoricallabels

foreachposi@on

• Labelsare

correlated

DETNNVBSVBG

thedoglikesrunning

Page 55: Mallet Tutorial

SequenceTagging

• Dataoccursin

sequences

• Categoricallabels

foreachposi@on

• Labelsare

correlated

????????

thedoglikesrunning

Page 56: Mallet Tutorial

SequenceTagging

• Classifica@on:n‐way

• SequenceTagging:nT‐way

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

orreddogsonbluetrees

Page 57: Mallet Tutorial

AvoidingExponen@alBlowup

• Markovproperty

• Dynamicprogramming

Andrei Markov

Page 58: Mallet Tutorial

AvoidingExponen@alBlowup

• Markovproperty

• Dynamicprogramming

This oneGiven this one

Is independent of theseAndrei Markov

DETJJNNVB

Page 59: Mallet Tutorial

AvoidingExponen@alBlowup

• Markovproperty

• Dynamicprogramming

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

orreddogsonbluetrees Andrei Markov

Page 60: Mallet Tutorial

AvoidingExponen@alBlowup

• Markovproperty

• Dynamicprogramming

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

reddogsonbluetrees Andrei Markov

Page 61: Mallet Tutorial

AvoidingExponen@alBlowup

• Markovproperty

• Dynamicprogramming

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

dogsonbluetrees Andrei Markov

Page 62: Mallet Tutorial

HiddenMarkovModelsand

Condi@onalRandomFields

• HiddenMarkov

Model:fully

genera@ve

• Condi@onal

RandomField:

condi@onal

P(Labels | Data) =P(Data, Labels) / P(Data)

P(Labels | Data)

Page 63: Mallet Tutorial

HiddenMarkovModelsand

Condi@onalRandomFields

• HiddenMarkovModel:

simple(independent)

outputspace

• Condi@onalRandom

Field:arbitrarily

complicatedoutputs

“NSF-funded”

“NSF-funded”CAPITALIZEDHYPHENATEDENDS-WITH-edENDS-WITH-d…

Page 64: Mallet Tutorial

HiddenMarkovModelsand

Condi@onalRandomFields

FeatureSequence

FeatureVectorSequence

FeatureVector[]

int[]

• HiddenMarkovModel:

simple(independent)

outputspace

• Condi@onalRandom

Field:arbitrarily

complicatedoutputs

Page 65: Mallet Tutorial

Impor@ngData

• SimpleTagger

format:one

wordperline,

withinstances

delimitedbya

blankline

Call VBme PPNIshmael NNP. .

Some JJyears NNS…

Page 66: Mallet Tutorial

Impor@ngData

• SimpleTagger

format:one

wordperline,

withinstances

delimitedbya

blankline

Call SUFF-ll VBme TWO_LETTERS PPNIshmael BIBLICAL_NAME NNP. PUNCTUATION .

Some CAPITALIZED JJyears TIME SUFF-s NNS…

Page 67: Mallet Tutorial

Impor@ngData

LineGroupIterator

SimpleTaggerSentence2TokenSequence()//String to Tokens, handles labels

TokenSequence2FeatureVectorSequence()//Token objects to FeatureVectors

cc.mallet.pipe, cc.mallet.pipe.iterator

Page 68: Mallet Tutorial

Impor@ngData

LineGroupIterator

SimpleTaggerSentence2TokenSequence()//String to Tokens, handles labels

[Pipes that modify tokens]

TokenSequence2FeatureVectorSequence()//Token objects to FeatureVectors

cc.mallet.pipe, cc.mallet.pipe.iterator

Page 69: Mallet Tutorial

Impor@ngData

//IshmaelTokenTextCharSuffix(“C2=”, 2)

//Ishmael C2=elRegexMatches(“CAP”, Pattern.compile(“\\p{Lu}.*”))

//Ishmael C2=el CAPLexiconMembership(“NAME”, new File(‘names’), false)

//Ishmael C2=el CAP NAME

cc.mallet.pipe.tsf

must matchentire string

one name per line

ignore case?

Page 70: Mallet Tutorial

Slidingwindowfeatures

areddogonabluetree

Page 71: Mallet Tutorial

Slidingwindowfeatures

areddogonabluetree

Page 72: Mallet Tutorial

Slidingwindowfeatures

areddogonabluetree

red@-1

Page 73: Mallet Tutorial

Slidingwindowfeatures

areddogonabluetree

red@-1a@-2

Page 74: Mallet Tutorial

Slidingwindowfeatures

areddogonabluetree

red@-1a@-2on@1

Page 75: Mallet Tutorial

Slidingwindowfeatures

areddogonabluetree

red@-1a@-2on@1a@-2_&_red@-1

Page 76: Mallet Tutorial

Impor@ngData

int[][] conjunctions = new int[3][]; conjunctions[0] = new int[] { -1 }; conjunctions[1] = new int[] { 1 }; conjunctions[2] = new int[] { -2, -1 };

OffsetConjunctions(conjunctions)

// a@-2_&_red@-1 on@1

cc.mallet.pipe.tsf

previousposition

next position

previous two

Page 77: Mallet Tutorial

Impor@ngData

int[][] conjunctions = new int[3][]; conjunctions[0] = new int[] { -1 }; conjunctions[1] = new int[] { 1 }; conjunctions[2] = new int[] { -2, -1 };

TokenTextCharSuffix("C1=", 1)OffsetConjunctions(conjunctions)

// a@-2_&_red@-1 a@-2_&_C1=d@-1

cc.mallet.pipe.tsf

previousposition

next position

previous two

Page 78: Mallet Tutorial

FiniteStateTransducers

• Finitestate

machineover

twoalphabets

(observed,

hidden)

Page 79: Mallet Tutorial

FiniteStateTransducers

• Finitestate

machineover

twoalphabets

(observed,

hidden)

DET

P(DET)

Page 80: Mallet Tutorial

FiniteStateTransducers

• Finitestate

machineover

twoalphabets

(observed,

hidden)

DET

the

P(the | DET)

Page 81: Mallet Tutorial

FiniteStateTransducers

• Finitestate

machineover

twoalphabets

(observed,

hidden)

DETNN

the

P(NN | DET)

Page 82: Mallet Tutorial

FiniteStateTransducers

• Finitestate

machineover

twoalphabets

(observed,

hidden)

DETNN

thedog

P(dog | NN)

Page 83: Mallet Tutorial

FiniteStateTransducers

• Finitestate

machineover

twoalphabets

(observed,

hidden)

DETNNVBS

thedog

P(VBS | NN)

Page 84: Mallet Tutorial

Howmanyparameters?

• Determines

efficiencyof

training

• Toomanyleads

tooverfirng

Trick: Don’t allowcertain transitions

P(VBS | DET) = 0

Page 85: Mallet Tutorial

Howmanyparameters?

• Determines

efficiencyof

training

• Toomanyleads

tooverfirng

DETNNVBS

thedogruns

DETNNVBS

thedogruns

DETNNVBS

thedogruns

Page 86: Mallet Tutorial

FiniteStateTransducers

abstract class TransducerCRFHMM

abstract class TransducerTrainerCRFTrainerByLabelLikelihoodHMMTrainerByLikelihood

cc.mallet.fst

Page 87: Mallet Tutorial

FiniteStateTransducers

cc.mallet.fst

First order: one weightfor every pair of labelsand observations.

CRF crf = new CRF(pipe, null);crf.addFullyConnectedStates(); // orcrf.addStatesForLabelsConnectedAsIn(instances);

DETNNVBS

thedogruns

Page 88: Mallet Tutorial

FiniteStateTransducers

cc.mallet.fst

“three-quarter” order:one weight for everypair of labels andobservations.

crf.addStatesForThreeQuarterLabelsConnectedAsIn(instances);

DETNNVBS

thedogruns

Page 89: Mallet Tutorial

FiniteStateTransducers

cc.mallet.fst

Second order: one weightfor every triplet of labelsand observations.

crf.addStatesForBiLabelsConnectedAsIn(instances);

DETNNVBS

thedogruns

Page 90: Mallet Tutorial

FiniteStateTransducers

cc.mallet.fst

“Half” order: equivalent toindependent classifiers,except some transitionsmay be illegal.

crf.addStatesForHalfLabelsConnectedAsIn(instances);

DETNNVBS

thedogruns

Page 91: Mallet Tutorial

Trainingatransducer

CRF crf = new CRF(pipe, null);crf.addStatesForLabelsConnectedAsIn(trainingInstances); CRFTrainerByLabelLikelihood trainer = new CRFTrainerByLabelLikelihood(crf);

trainer.train();

cc.mallet.fst

Page 92: Mallet Tutorial

Evalua@ngatransducer

CRFTrainerByLabelLikelihood trainer = new CRFTrainerByLabelLikelihood(transducer);

TransducerEvaluator evaluator = new TokenAccuracyEvaluator(testing, "testing"));

trainer.addEvaluator(evaluator);

trainer.train();

cc.mallet.fst

Page 93: Mallet Tutorial

Applyingatransducer

Sequence output = transducer.transduce (input);

for (int index=0; index < input.size(); input++) {System.out.print(input.get(index) + “/”);System.out.print(output.get(index) + “ “);

}

cc.mallet.fst

Page 94: Mallet Tutorial

Review

• Howdoyouaddnewfeaturesto

TokenSequences?

Page 95: Mallet Tutorial

Review

• Howdoyouaddnewfeaturesto

TokenSequences?

• Whatarethreefactorsthataffectthe

numberofparametersinamodel?

Page 96: Mallet Tutorial

Outline

• AboutMALLET

• Represen@ngData

• Classifica@on

• SequenceTagging

• TopicModeling

Page 97: Mallet Tutorial

Topics:“Seman@cGroups”

News Article

Page 98: Mallet Tutorial

Topics:“Seman@cGroups”

“Sports” “Negotiation”

News Article

Page 99: Mallet Tutorial

Topics:“Seman@cGroups”

“Sports” “Negotiation”

News Article

teamplayer

game

strike

deadlineunion

Page 100: Mallet Tutorial

Topics:“Seman@cGroups”

News Article

teamplayer

game

strike

deadlineunion

Page 101: Mallet Tutorial

SeriesYankeesSoxRedWorldLeaguegameBostonteam

gamesbaseballMetsGameserieswonClemensBraves

Yankeeteams

Page 102: Mallet Tutorial

playersLeagueownersleaguebaseballunioncommissioner

BaseballAssocia@onlaborCommissionerFootballmajor

teamsSeligagreementstriketeambargaining

Page 103: Mallet Tutorial

TrainingaTopicModel

cc.mallet.topics

ParallelTopicModel lda = new ParallelTopicModel(numTopics); lda.addInstances(trainingInstances); lda.estimate();

Page 104: Mallet Tutorial

Evalua@ngaTopicModel

cc.mallet.topics

ParallelTopicModel lda = new ParallelTopicModel(numTopics);lda.addInstances(trainingInstances);lda.estimate();

MarginalProbEstimator evaluator = lda.getProbEstimator();

double logLikelihood = evaluator.evaluateLeftToRight(testing, 10, false, null);

Page 105: Mallet Tutorial

Inferringtopicsfornew

documents

cc.mallet.topics

ParallelTopicModel lda = new ParallelTopicModel(numTopics);lda.addInstances(trainingInstances);lda.estimate();

TopicInferencer inferencer = lda.getInferencer();

double[] topicProbs = inferencer.getSampledDistribution(instance, 100, 10, 10);

Page 106: Mallet Tutorial

Morethanwords…

• Textcollec@ons

mixfreetext

andstructured

data

David MimnoAndrew McCallumUAI2008…

Page 107: Mallet Tutorial

Morethanwords…

• Textcollec@ons

mixfreetext

andstructured

data

David MimnoAndrew McCallumUAI2008

“Topic models conditionedon arbitrary features usingDirichlet-multinomialregression. …”

Page 108: Mallet Tutorial

Dirichlet‐mul@nomialRegression

(DMR)

Thecorpusspecifiesavectorofreal‐valued

features(x)foreachdocument,oflengthF.

EachtopichasanF‐lengthvectorof

parameters.

Page 109: Mallet Tutorial

Topicparametersforfeature

“publishedinJMLR”

user,users,userinterface,interac@ve,interface‐1.44

web,webpages,webpage,worldwideweb,websites‐1.36

retrieval,informa@onretrieval,query,queryexpansion‐1.23

strategies,strategy,adapta@on,adap@ve,driven‐1.21

agent,agents,mul@agent,autonomousagents‐1.12

nearestneighbor,boos@ng,nearestneighbors,adaboost1.37

blindsourcesepara@on,sourcesepara@on,separa@on,channel1.40

reinforcementlearning,learning,reinforcement1.41

bounds,vcdimension,bound,upperbound,lowerbounds1.74

kernel,kernels,ra@onalkernels,stringkernels,fisherkernel2.27

Page 110: Mallet Tutorial

FeatureparametersforRLtopic

<default>‐3.76

COLING‐1.64

IEEETrans.PAMI‐1.54

CVPR‐1.47

ACL‐1.38

MachineLearningJournal2.19

ECML2.45

KenjiDoya2.56

ICML2.88

SridharMahadevan2.99

Page 111: Mallet Tutorial

Topicparametersforfeature

“publishedinUAI”

nearestneighbor,boos@ng,nearestneighbors,adaboost‐1.50

descrip@ons,descrip@on,top,bo1om,topbo1om‐1.50

workshopreport,invitedtalk,interna@onalconference,report‐1.37

digitallibraries,digitallibrary,digital,library‐1.36

shape,deformable,shapes,contour,ac@vecontour‐1.29

reasoning,logic,defaultreasoning,nonmonotonicreasoning2.11

uncertainty,symbolic,sketch,primalsketch,uncertain,[email protected]

probability,probabili@es,probabilitydistribu@ons,2.25

qualita@ve,reasoning,qualita@vereasoning,qualita@[email protected]

bayesiannetworks,bayesiannetwork,beliefnetworks2.88

Page 112: Mallet Tutorial

FeatureparametersforBayes

netstopic

<default>‐3.36

ICRA‐2.24

NeuralNetworks‐1.50

COLING‐1.38

Probabilis@cSeman@csforNonmonotonicReasoning(Pearl,KR,

1989)

‐1.16

LoopyBeliefPropaga@onforApproximateInference(Murphy,Weiss,

andJordan,UAI,1999)

2.04

PhilippeSmets2.15

AshrafM.Abdelbar2.23

Mary‐AnneWilliams2.41

UAI2.88

Page 113: Mallet Tutorial

Dirichlet‐mul@nomialRegression

• Arbitraryobservedfeaturesofdocuments

• TargetcontainsFeatureVector

DMRTopicModel dmr = new DMRTopicModel (numTopics);

dmr.addInstances(training);dmr.estimate();

dmr.writeParameters(new File("dmr.parameters"));

Page 114: Mallet Tutorial

PolylingualTopicModeling

• Topicsexistinmorelanguagesthanyoucouldpossiblylearn

• Topicallycomparable documentsaremucheasiertogetthantransla@onsets

• Transla@ondic@onaries

– coverpairs,notsetsoflanguages

– misstechnicalvocabulary

– aren’tavailableforlow‐resourcelanguages

Page 115: Mallet Tutorial

Topicsfrom

European

Parliament

Proceedings

Page 116: Mallet Tutorial

Topicsfrom

European

Parliament

Proceedings

Page 117: Mallet Tutorial

Topicsfrom

Wikipedia

Page 118: Mallet Tutorial

Alignedinstancelists

dog… chien… hund…cat… chat…pig… schwein…

Page 119: Mallet Tutorial

PolylingualTopics

InstanceList[] training = new InstanceList[] { english, german, arabic, mahican };

PolylingualTopicModel pltm = new PolylingualTopicModel(numTopics);

pltm.addInstances(training);

Page 120: Mallet Tutorial

MALLEThands‐ontutorial

h1p://mallet.cs.umass.edu/mallet‐handson.tar.gz