mallet tutorial

Post on 24-Oct-2014

376 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

MachineLearningwithMALLET

h1p://mallet.cs.umass.edu

DavidMimno

Informa@onExtrac@onandSynthesis

Laboratory,DepartmentofCS

UMass,Amherst

Outline

• AboutMALLET

• Represen@ngData

• Classifica@on

• SequenceTagging

• TopicModeling

Outline

• AboutMALLET

• Represen@ngData

• Classifica@on

• SequenceTagging

• TopicModeling

Who?

• AndrewMcCallum(mostofthe

work)

• CharlesSu1on,AronCulo1a,

GregDruck,KedarBellare,

GauravChandalia…

• FernandoPereira,othersat

Penn…

WhoamI?

• ChiefmaintainerofMALLET

• PrimaryauthorofMALLETtopicmodeling

package

Why?

• Mo@va@on:textclassifica@onand

informa@onextrac@on

• Commercialmachinelearning(Just

Research,WhizBang)

• Analysisandindexingofacademic

publica@ons:Cora,Rexa

What?

• Textfocus:dataisdiscreteratherthan

con@nuous,evenwhenvaluescouldbe

con@nuous:

double value = 3.0

How?

• Commandlinescripts:

– bin/mallet[command]‐‐[op@on][value]…

– TextUserInterface(“tui”)classes

• DirectJavaAPI

– h1p://mallet.cs.umass.edu/api

Most of this talk

History

• Version0.4:c2004

– Classesinedu.umass.cs.mallet.base.*

• Version2.0:c2008

– Classesincc.mallet.*

– Majorchangestofinitestatetransducerpackage

– bin/malletvs.specializedscripts

– Java1.5generics

LearningMore

• h1p://mallet.cs.umass.edu

– “QuickStart”guides,focusedoncommandline

processing

– Developers’guides,withJavaexamples

• mallet‐dev@cs.umass.edumailinglist

– Lowvolume,butcanbebursty

Outline

• AboutMALLET

• Represen@ngData

• Classifica@on

• SequenceTagging

• TopicModeling

ModelsforTextData

• Genera@vemodels(Mul@nomials)

– NaïveBayes

– HiddenMarkovModels(HMMs)

– LatentDirichletTopicModels

• Discrimina@veRegressionModels

– MaxEnt/Logis@cregression

– Condi@onalRandomFields(CRFs)

Representa@ons

• Transformtext

documentsto

vectorsx1, x2,…

• Retainmeaning

ofvectorindices

• Ideallysparsely

Call meIshmael.…

Document

Representa@ons

• Transformtext

documentsto

vectorsx1, x2,…

• Retainmeaning

ofvectorindices

• Ideallysparsely

1.00.0…0.06.00.0…3.0…

Call meIshmael.…

xi

Document

Representa@ons

• Elementsofvector

arecalledfeature

values

• Example:Feature

atrow345is

numberof@mes

“dog”appearsin

document

1.00.0…0.06.00.0…3.0…

xi

DocumentstoVectors

Call me Ishmael.

Document

DocumentstoVectors

Call me Ishmael.

Document

Call me Ishmael

Tokens

DocumentstoVectors

Call me Ishmael

Tokens

call me ishmael

Tokens

DocumentstoVectors

call me ishmael

Tokens

473, 3591, 17

Features

17 ishmael…473 call…3591 me

DocumentstoVectors

17 1.0473 1.03591 1.0

Features (bag)

17 ishmael473 call3591 me

473, 3591, 17

Features (sequence)

17 ishmael…473 call…3591 me

17 ishmael…473 call…3591 me

Instances

Emailmessage,webpage,sentence,journal

abstract…

• Name

• Data

• Target/Label

• Source

What is it called?

What is the input?

What is the output?

What did it originally look like?

Instances

• Name

• Data

• Target

• Source

String

TokenSequenceArrayList<Token>

FeatureSequenceint[]

FeatureVectorint -> double map

cc.mallet.types

Alphabets

TObjectIntHashMap mapArrayList entries

int lookupIndex(Object o, boolean shouldAdd)

Object lookupObject(int index)

cc.mallet.types, gnu.trove

17 ishmael…473 call…3591 me

Alphabets

TObjectIntHashMap mapArrayList entries

int lookupIndex(Object o, boolean shouldAdd)

Object lookupObject(int index)

cc.mallet.types, gnu.trove

17 ishmael…473 call…3591 me

for

Alphabets

TObjectIntHashMap mapArrayList entries

cc.mallet.types, gnu.trove

17 ishmael…473 call…3591 me

void stopGrowth()

void startGrowth()

Do not add entries fornew Objects -- defaultis to allow growth.

Crea@ngInstances

• Instance

constructor

method

• Iterators

new Instance(data, target,name, source)

Iterator<Instance>FileIterator(File[], …)CsvIterator(FileReader, Pattern…)ArrayIterator(Object[])…

cc.mallet.pipe.iterator

Crea@ngInstances

• FileIterator

cc.mallet.pipe.iterator

/data/bad/

/data/good/

Label from dir name

Each instance inits own file

Crea@ngInstances

• CsvIterator

cc.mallet.pipe.iterator

Name, label, data from regular expression groups.“CSV” is a lousy name. LineRegexIterator?

Each instanceon its own line

1001 Melville Call me Ishmael. Some years ago…1002 Dickens It was the best of times, it was…

^([^\t]+)\t([^\t]+)\t(.*)

InstancePipelines

• Sequen@al

transforma@ons

ofinstancefields

(usuallyData)

• Passan

ArrayList<Pipe>

toSerialPipes

cc.mallet.pipe

// “data” is a StringCharSequence2TokenSequence// tokenize with regexpTokenSequenceLowercase// modify each token’s textTokenSequenceRemoveStopwords// drop some tokensTokenSequence2FeatureSequence// convert token Strings to intsFeatureSequence2FeatureVector// lose order, count duplicates

InstancePipelines

• Asmallnumber

ofpipesmodify

the“target”

field

• Therearenow

twoalphabets:

dataandlabel

cc.mallet.pipe, cc.mallet.types

// “target” is a StringTarget2Label// convert String to int// “target” is now a Label

Alphabet > LabelAlphabet

Labelobjects

• Weightsona

fixedsetof

classes

• Fortraining

data,weightfor

correctlabelis

1.0,allothers

0.0

cc.mallet.types

implements Labeling

int getBestIndex()Label getBestLabel()

You cannot create a Label,they are only produced byLabelAlphabet

InstanceLists

• AListof

Instanceobjects,

alongwitha

Pipe,data

Alphabet,and

LabelAlphabet

cc.mallet.types

InstanceList instances = new InstanceList(pipe);

instances.addThruPipe(iterator);

Purngitalltogether

ArrayList<Pipe> pipeList = new ArrayList<Pipe>();

pipeList.add(new Target2Label());

pipeList.add(new CharSequence2TokenSequence());

pipeList.add(new TokenSequence2FeatureSequence());

pipeList.add(new FeatureSequence2FeatureVector());

InstanceList instances =

new InstanceList(new SerialPipes(pipeList));

instances.addThruPipe(new FileIterator(. . .));

PersistentStorage

• MostMALLET

classesuseJava

serializa@onto

storemodels

anddata

java.io

ObjectOutputStream oos = new ObjectOutputStream(…);oos.writeObject(instances);oos.close();

Pipes, data objects, labelings, etcall need to implementSerializable.

Be sure to include custom classesin classpath, or you get aStreamCorruptedException

Review

• Whatarethefourmainfieldsinan

Instance?

Review

• Whatarethefourmainfieldsinan

Instance?

• WhataretwowaystogenerateInstances?

Review

• Whatarethefourmainfieldsinan

Instance?

• WhataretwowaystogenerateInstances?

• HowdowemodifythevalueofInstance

fields?

Review

• Whatarethefourmainfieldsinan

Instance?

• WhataretwowaystogenerateInstances?

• HowdowemodifythevalueofInstance

fields?

• Namesomeclassesthatappearinthe

“data”field.

Outline

• AboutMALLET

• Represen@ngData

• Classifica@on

• SequenceTagging

• TopicModeling

Classifierobjects

• Classifiersmap

frominstances

todistribu@ons

overafixedset

ofclasses

• MaxEnt,Naïve

Bayes,Decision

Trees…

cc.mallet.classify

Given data Which classis best?

(this one!)watery

NN

JJ

PRP

VB

CC

Classifierobjects

• Classifiersmap

frominstances

todistribu@ons

overafixedset

ofclasses

• MaxEnt,Naïve

Bayes,Decision

Trees…

cc.mallet.classify

Labeling labeling = classifier.classify(instance);

Label l = labeling.getBestLabel();

System.out.print(instance + “\t”);System.out.println(l);

TrainingClassifierobjects

cc.mallet.classify

ClassifierTrainer trainer = new MaxEntTrainer();

Classifier classifier = trainer.train(instances);

• Eachtypeof

classifierhas

oneormore

ClassifierTrainer

classes

TrainingClassifierobjects

cc.mallet.optimize

log P(Labels | Data) =log f(label1, data1, w) +log f(label2, data2, w) +log f(label3, data3, w) +…

• Someclassifiers

require

numerical

op@miza@onof

anobjec@ve

func@on. Maximize w.r.t. w!

Parametersw

• Associa@on

between

feature,class

label

• Howmany

parametersfor

KclassesandN

features?

ac@on NN 0.13

ac@on VB ‐0.1

ac@on JJ ‐0.21

SUFF‐@on NN 1.3

SUFF‐@on VB ‐2.1

SUFF‐@on JJ ‐1.7

SUFF‐on NN 0.01

SUFF‐on VB ‐0.02

TrainingClassifierobjects

cc.mallet.optimize

interface Optimizerboolean optimize()

interface Optimizableinterface ByValueinterface ByValueGradient

Limited-memory BFGS,Conjugate gradient…

Specific objective functions

TrainingClassifierobjects

cc.mallet.classify

MaxEntOptimizableByLabelLikelihooddouble[] getParameters()void setParameters(double[] parameters)…

double getValue()void getValueGradient(double[] buffer)

Log likelihood and its first derivative

ForOptimizableinterface

Evalua@onofClassifiers

• Create

random

test/train

splits

cc.mallet.types

InstanceList[] instanceLists =instances.split(new Randoms(),

new double[] {0.9, 0.1, 0.0});

90% training

10% testing

0% validation

Evalua@onofClassifiers

• TheTrial

classstores

theresultsof

classifica@ons

onan

InstanceList

(tes@ngor

training)

cc.mallet.classify

Trial(Classifier c, InstanceList list)double getAccuracy()double getAverageRank()double getF1(int/Label/Object)double getPrecision(…)double getRecall(…)

Review

• Ihaveinventedanewclassifier:David

regression.

– WhatclassshouldIimplementtoclassify

instances?

Review

• Ihaveinventedanewclassifier:David

regression.

– WhatclassshouldIimplementtotrainaDavid

regressionclassifier?

Review

• Ihaveinventedanewclassifier:David

regression.

– IwanttotrainusingByValueGradient.What

mathema@calfunc@onsdoIneedtocodeup,

andwhatclassshouldIputthemin?

Review

• Ihaveinventedanewclassifier:Davidregression.

– HowwouldIcheckwhethermynewclassifierworksbe1erthanNaïveBayes?

Outline

• AboutMALLET

• Represen@ngData

• Classifica@on

• SequenceTagging

• TopicModeling

SequenceTagging

• Dataoccursin

sequences

• Categoricallabels

foreachposi@on

• Labelsare

correlated

DETNNVBSVBG

thedoglikesrunning

SequenceTagging

• Dataoccursin

sequences

• Categoricallabels

foreachposi@on

• Labelsare

correlated

????????

thedoglikesrunning

SequenceTagging

• Classifica@on:n‐way

• SequenceTagging:nT‐way

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

orreddogsonbluetrees

AvoidingExponen@alBlowup

• Markovproperty

• Dynamicprogramming

Andrei Markov

AvoidingExponen@alBlowup

• Markovproperty

• Dynamicprogramming

This oneGiven this one

Is independent of theseAndrei Markov

DETJJNNVB

AvoidingExponen@alBlowup

• Markovproperty

• Dynamicprogramming

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

orreddogsonbluetrees Andrei Markov

AvoidingExponen@alBlowup

• Markovproperty

• Dynamicprogramming

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

reddogsonbluetrees Andrei Markov

AvoidingExponen@alBlowup

• Markovproperty

• Dynamicprogramming

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

NN

JJ

PRP

VB

CC

dogsonbluetrees Andrei Markov

HiddenMarkovModelsand

Condi@onalRandomFields

• HiddenMarkov

Model:fully

genera@ve

• Condi@onal

RandomField:

condi@onal

P(Labels | Data) =P(Data, Labels) / P(Data)

P(Labels | Data)

HiddenMarkovModelsand

Condi@onalRandomFields

• HiddenMarkovModel:

simple(independent)

outputspace

• Condi@onalRandom

Field:arbitrarily

complicatedoutputs

“NSF-funded”

“NSF-funded”CAPITALIZEDHYPHENATEDENDS-WITH-edENDS-WITH-d…

HiddenMarkovModelsand

Condi@onalRandomFields

FeatureSequence

FeatureVectorSequence

FeatureVector[]

int[]

• HiddenMarkovModel:

simple(independent)

outputspace

• Condi@onalRandom

Field:arbitrarily

complicatedoutputs

Impor@ngData

• SimpleTagger

format:one

wordperline,

withinstances

delimitedbya

blankline

Call VBme PPNIshmael NNP. .

Some JJyears NNS…

Impor@ngData

• SimpleTagger

format:one

wordperline,

withinstances

delimitedbya

blankline

Call SUFF-ll VBme TWO_LETTERS PPNIshmael BIBLICAL_NAME NNP. PUNCTUATION .

Some CAPITALIZED JJyears TIME SUFF-s NNS…

Impor@ngData

LineGroupIterator

SimpleTaggerSentence2TokenSequence()//String to Tokens, handles labels

TokenSequence2FeatureVectorSequence()//Token objects to FeatureVectors

cc.mallet.pipe, cc.mallet.pipe.iterator

Impor@ngData

LineGroupIterator

SimpleTaggerSentence2TokenSequence()//String to Tokens, handles labels

[Pipes that modify tokens]

TokenSequence2FeatureVectorSequence()//Token objects to FeatureVectors

cc.mallet.pipe, cc.mallet.pipe.iterator

Impor@ngData

//IshmaelTokenTextCharSuffix(“C2=”, 2)

//Ishmael C2=elRegexMatches(“CAP”, Pattern.compile(“\\p{Lu}.*”))

//Ishmael C2=el CAPLexiconMembership(“NAME”, new File(‘names’), false)

//Ishmael C2=el CAP NAME

cc.mallet.pipe.tsf

must matchentire string

one name per line

ignore case?

Slidingwindowfeatures

areddogonabluetree

Slidingwindowfeatures

areddogonabluetree

Slidingwindowfeatures

areddogonabluetree

red@-1

Slidingwindowfeatures

areddogonabluetree

red@-1a@-2

Slidingwindowfeatures

areddogonabluetree

red@-1a@-2on@1

Slidingwindowfeatures

areddogonabluetree

red@-1a@-2on@1a@-2_&_red@-1

Impor@ngData

int[][] conjunctions = new int[3][]; conjunctions[0] = new int[] { -1 }; conjunctions[1] = new int[] { 1 }; conjunctions[2] = new int[] { -2, -1 };

OffsetConjunctions(conjunctions)

// a@-2_&_red@-1 on@1

cc.mallet.pipe.tsf

previousposition

next position

previous two

Impor@ngData

int[][] conjunctions = new int[3][]; conjunctions[0] = new int[] { -1 }; conjunctions[1] = new int[] { 1 }; conjunctions[2] = new int[] { -2, -1 };

TokenTextCharSuffix("C1=", 1)OffsetConjunctions(conjunctions)

// a@-2_&_red@-1 a@-2_&_C1=d@-1

cc.mallet.pipe.tsf

previousposition

next position

previous two

FiniteStateTransducers

• Finitestate

machineover

twoalphabets

(observed,

hidden)

FiniteStateTransducers

• Finitestate

machineover

twoalphabets

(observed,

hidden)

DET

P(DET)

FiniteStateTransducers

• Finitestate

machineover

twoalphabets

(observed,

hidden)

DET

the

P(the | DET)

FiniteStateTransducers

• Finitestate

machineover

twoalphabets

(observed,

hidden)

DETNN

the

P(NN | DET)

FiniteStateTransducers

• Finitestate

machineover

twoalphabets

(observed,

hidden)

DETNN

thedog

P(dog | NN)

FiniteStateTransducers

• Finitestate

machineover

twoalphabets

(observed,

hidden)

DETNNVBS

thedog

P(VBS | NN)

Howmanyparameters?

• Determines

efficiencyof

training

• Toomanyleads

tooverfirng

Trick: Don’t allowcertain transitions

P(VBS | DET) = 0

Howmanyparameters?

• Determines

efficiencyof

training

• Toomanyleads

tooverfirng

DETNNVBS

thedogruns

DETNNVBS

thedogruns

DETNNVBS

thedogruns

FiniteStateTransducers

abstract class TransducerCRFHMM

abstract class TransducerTrainerCRFTrainerByLabelLikelihoodHMMTrainerByLikelihood

cc.mallet.fst

FiniteStateTransducers

cc.mallet.fst

First order: one weightfor every pair of labelsand observations.

CRF crf = new CRF(pipe, null);crf.addFullyConnectedStates(); // orcrf.addStatesForLabelsConnectedAsIn(instances);

DETNNVBS

thedogruns

FiniteStateTransducers

cc.mallet.fst

“three-quarter” order:one weight for everypair of labels andobservations.

crf.addStatesForThreeQuarterLabelsConnectedAsIn(instances);

DETNNVBS

thedogruns

FiniteStateTransducers

cc.mallet.fst

Second order: one weightfor every triplet of labelsand observations.

crf.addStatesForBiLabelsConnectedAsIn(instances);

DETNNVBS

thedogruns

FiniteStateTransducers

cc.mallet.fst

“Half” order: equivalent toindependent classifiers,except some transitionsmay be illegal.

crf.addStatesForHalfLabelsConnectedAsIn(instances);

DETNNVBS

thedogruns

Trainingatransducer

CRF crf = new CRF(pipe, null);crf.addStatesForLabelsConnectedAsIn(trainingInstances); CRFTrainerByLabelLikelihood trainer = new CRFTrainerByLabelLikelihood(crf);

trainer.train();

cc.mallet.fst

Evalua@ngatransducer

CRFTrainerByLabelLikelihood trainer = new CRFTrainerByLabelLikelihood(transducer);

TransducerEvaluator evaluator = new TokenAccuracyEvaluator(testing, "testing"));

trainer.addEvaluator(evaluator);

trainer.train();

cc.mallet.fst

Applyingatransducer

Sequence output = transducer.transduce (input);

for (int index=0; index < input.size(); input++) {System.out.print(input.get(index) + “/”);System.out.print(output.get(index) + “ “);

}

cc.mallet.fst

Review

• Howdoyouaddnewfeaturesto

TokenSequences?

Review

• Howdoyouaddnewfeaturesto

TokenSequences?

• Whatarethreefactorsthataffectthe

numberofparametersinamodel?

Outline

• AboutMALLET

• Represen@ngData

• Classifica@on

• SequenceTagging

• TopicModeling

Topics:“Seman@cGroups”

News Article

Topics:“Seman@cGroups”

“Sports” “Negotiation”

News Article

Topics:“Seman@cGroups”

“Sports” “Negotiation”

News Article

teamplayer

game

strike

deadlineunion

Topics:“Seman@cGroups”

News Article

teamplayer

game

strike

deadlineunion

SeriesYankeesSoxRedWorldLeaguegameBostonteam

gamesbaseballMetsGameserieswonClemensBraves

Yankeeteams

playersLeagueownersleaguebaseballunioncommissioner

BaseballAssocia@onlaborCommissionerFootballmajor

teamsSeligagreementstriketeambargaining

TrainingaTopicModel

cc.mallet.topics

ParallelTopicModel lda = new ParallelTopicModel(numTopics); lda.addInstances(trainingInstances); lda.estimate();

Evalua@ngaTopicModel

cc.mallet.topics

ParallelTopicModel lda = new ParallelTopicModel(numTopics);lda.addInstances(trainingInstances);lda.estimate();

MarginalProbEstimator evaluator = lda.getProbEstimator();

double logLikelihood = evaluator.evaluateLeftToRight(testing, 10, false, null);

Inferringtopicsfornew

documents

cc.mallet.topics

ParallelTopicModel lda = new ParallelTopicModel(numTopics);lda.addInstances(trainingInstances);lda.estimate();

TopicInferencer inferencer = lda.getInferencer();

double[] topicProbs = inferencer.getSampledDistribution(instance, 100, 10, 10);

Morethanwords…

• Textcollec@ons

mixfreetext

andstructured

data

David MimnoAndrew McCallumUAI2008…

Morethanwords…

• Textcollec@ons

mixfreetext

andstructured

data

David MimnoAndrew McCallumUAI2008

“Topic models conditionedon arbitrary features usingDirichlet-multinomialregression. …”

Dirichlet‐mul@nomialRegression

(DMR)

Thecorpusspecifiesavectorofreal‐valued

features(x)foreachdocument,oflengthF.

EachtopichasanF‐lengthvectorof

parameters.

Topicparametersforfeature

“publishedinJMLR”

user,users,userinterface,interac@ve,interface‐1.44

web,webpages,webpage,worldwideweb,websites‐1.36

retrieval,informa@onretrieval,query,queryexpansion‐1.23

strategies,strategy,adapta@on,adap@ve,driven‐1.21

agent,agents,mul@agent,autonomousagents‐1.12

nearestneighbor,boos@ng,nearestneighbors,adaboost1.37

blindsourcesepara@on,sourcesepara@on,separa@on,channel1.40

reinforcementlearning,learning,reinforcement1.41

bounds,vcdimension,bound,upperbound,lowerbounds1.74

kernel,kernels,ra@onalkernels,stringkernels,fisherkernel2.27

FeatureparametersforRLtopic

<default>‐3.76

COLING‐1.64

IEEETrans.PAMI‐1.54

CVPR‐1.47

ACL‐1.38

MachineLearningJournal2.19

ECML2.45

KenjiDoya2.56

ICML2.88

SridharMahadevan2.99

Topicparametersforfeature

“publishedinUAI”

nearestneighbor,boos@ng,nearestneighbors,adaboost‐1.50

descrip@ons,descrip@on,top,bo1om,topbo1om‐1.50

workshopreport,invitedtalk,interna@onalconference,report‐1.37

digitallibraries,digitallibrary,digital,library‐1.36

shape,deformable,shapes,contour,ac@vecontour‐1.29

reasoning,logic,defaultreasoning,nonmonotonicreasoning2.11

uncertainty,symbolic,sketch,primalsketch,uncertain,connec@onist2.25

probability,probabili@es,probabilitydistribu@ons,2.25

qualita@ve,reasoning,qualita@vereasoning,qualita@vesimula@on2.26

bayesiannetworks,bayesiannetwork,beliefnetworks2.88

FeatureparametersforBayes

netstopic

<default>‐3.36

ICRA‐2.24

NeuralNetworks‐1.50

COLING‐1.38

Probabilis@cSeman@csforNonmonotonicReasoning(Pearl,KR,

1989)

‐1.16

LoopyBeliefPropaga@onforApproximateInference(Murphy,Weiss,

andJordan,UAI,1999)

2.04

PhilippeSmets2.15

AshrafM.Abdelbar2.23

Mary‐AnneWilliams2.41

UAI2.88

Dirichlet‐mul@nomialRegression

• Arbitraryobservedfeaturesofdocuments

• TargetcontainsFeatureVector

DMRTopicModel dmr = new DMRTopicModel (numTopics);

dmr.addInstances(training);dmr.estimate();

dmr.writeParameters(new File("dmr.parameters"));

PolylingualTopicModeling

• Topicsexistinmorelanguagesthanyoucouldpossiblylearn

• Topicallycomparable documentsaremucheasiertogetthantransla@onsets

• Transla@ondic@onaries

– coverpairs,notsetsoflanguages

– misstechnicalvocabulary

– aren’tavailableforlow‐resourcelanguages

Topicsfrom

European

Parliament

Proceedings

Topicsfrom

European

Parliament

Proceedings

Topicsfrom

Wikipedia

Alignedinstancelists

dog… chien… hund…cat… chat…pig… schwein…

PolylingualTopics

InstanceList[] training = new InstanceList[] { english, german, arabic, mahican };

PolylingualTopicModel pltm = new PolylingualTopicModel(numTopics);

pltm.addInstances(training);

MALLEThands‐ontutorial

h1p://mallet.cs.umass.edu/mallet‐handson.tar.gz

top related