topic models for morphologically rich languages

Topic Models for Morphologically Rich Languages

Michael Elhadad, Meni Adler, Yoav Goldberg, Rafi Cohen

23 Jan 2011, HaifaMachine Translation and Morphologically-rich Languages

Topic ModelsUnsupervised discovery of topics in text collectionUseful to browse/explore large corpora by themeTopic evolution over timeAuthor-topic modelsDifficult to evaluate / Task-based evaluations helpWSDSummarizationIRSentiment analysisMultilingual LDA could help as feature for MTTopic Models

Topic Models and Rich MorphologyTopic Models from text in HebrewRich morphologyHigh number of distinct word formsHigh ambiguity

Halakhic Domain (Jewish Religious Law)Mixture of languages (Hebrew / Aramaic)Various Historical / Geographical / SubdomainsExisting metadata / Can we exploit it?Medical DomainPatient letters / eHealth QA siteHigh level of mixture English/Hebrew (transliterations)Existing metadata (UMLS) / Can we exploit it?

Work in progressTopic Models

OutlineTopic Analysis with LDA Domain: Halakhic Sources / Medical datasetCombining LDA and Morphological AnalysisCombining Semantic Priors and LDAMultilingual Topic ModelsEvaluating Topic ModelsOutline

ObjectivesInput: Domain specific text corpus in HebrewMetadata on documents (tags, alignment to English tags)Output: Topic model: Discover topics discussed in the corpusRecognize topics in unseen textIndex text collection by topicTask:Something where topics help:WSD, IR, Text categorization, clusteringSome part of MT?Objectives

Term Ambiguity and What is a Topic? (ox/bull) refers to many complex halakhic topics:Damages ( goring ox)Kosher meat ( slaughter)Sacrifices ()Shabbat ( domestic animals must rest)Calendar ( Zodiac sign Taurus)

What are these topics?

Terms are disambiguated in context+ (Ox + Shabbat)

Associate a word to a topicAssociate a document to topicsObjectives

Discovering Topic Models: LDALatent Dirichlet AllocationBlei and Jordan 2003Discover (unsupervised) topic structures in a document collectionTopics are modeled as distributions of wordsProbabilistic generative model of text

LDA

What can be done with an LDA Topic Model?D. Blei and J. Lafferty. Topic Models. In A. Srivastava and M. Sahami, editors, Text Mining: Theory and Applications. 2009LDA

Structure of an LDA ModelLDAFrom (Blei 2008)

The LDA ModelObservations: documents are composed of words.Latent variable: each document expresses a few topicsGenerative probabilistic model:Each document is a mixture of topicsEach word is drawn from the topics active in the document

LDA

LDA Graphical Model LDA(Blei 2008)

LDA Generative Process LDA(Blei 2008)

LDA Generative Process LDA(Blei 2008)(w1, w2,. wV)(t1,,tK)

LDA Estimation LDA(Blei 2008)

LDA Estimation LDA(Blei 2008)Matrix KxV

LDA Approximation LDA(Blei 2008)Generally use Gibbs Sampling to estimate

Gibbs SamplingRepresent corpus as:Array of words w[i] fixedDocument indices d[i] fixedTopics z[i] changeMarkov chain where states = topic assignments to wordsMacro-steps: assign a new topic to all the wordsMicro-steps: assign a new topic to each word w[i]LDA

LDA in Hebrew

Explore various datasets in HebrewHow well does LDA work on Hebrew?Domain

Domain: Halakhic SourcesVarious Historical / Geographical background

Domain

The MishnaMishna (Tanaim)Exhaustive code of Jewish LawWritten by R. Yehuda Hanasi (220 CE)6 orders, 63 tractates, 524 chapters, 6K paragraphs, 350K words.Hierarchical thematic organization by topics

Domain

Rambams Mishne TorahCorpus of Mishne Torah (Rishonim)Exhaustive code of HalakhaWritten by Maimonides 1170-118014 books, 85 sections, 1,000 chapters, 15K articles, 600K words.Domain

Responsa CorpusWe manually constructed a reference corpus for testing purposes.Team of 5 Jewish Law experts with metadata associated to each QA document.

Documents8,000 responsa from 35 distinct books of various origins (geographical, historical)3.6M words (avg 450 tokens per document)On average 4.5 tags per document (from the ontology)

Ontology of Halakha~2,000concepts ~5,000 relations among concepts of 14 distinct types

MetadataPer book: Author, Location, Publication DatePer document:Topics from indexReferences to "sources" (Bavli, Yerushalmi, Mishna, Tanakh, Shulhan 'arukh) (In progress)References to other responsa (In progress)Domain

Halakhic Corpus SpecificityLanguageMixture (Hebrew + Aramaic)Semitic languages: rich morphologyMany acronyms / abbreviationsWide variety of domains / historical backgroundVarious GenresCodes (hierarchical, synthetic)Commentaries (segmented, linear)Responsa (implicitly hypertextual complex citations)Layers of corpus (derivation, authority)Mishna Gmara Mishne Tora ResponsaDomain

Medical CorpusInfomed.co.ilPopular QA Health site2M words / 4K documentsAnnotated by site categories6,000 concepts / 3,000 mapped to UMLSHospital Patient release lettersNeurology department 150K words / 1K documentsManual UMLS concept annotation (in progress)Domain

Medical Corpus SpecificityMany unknown words (~20% token types)Many transliterations (Rafis talk)Many named entitiesDomain

Hebrew Morphological Analysis (name of an association) (while taking a picture) (their onion) (under their shades) (in a photographer) (in the photographer( (in an idol( (in the idol(

LDA and Morphology

Morphological Analysis proper-noun verb, infinitive- noun, singular, masculine -- noun, singular, masculine - noun, singular, masculine, absolute- noun, singular, masculine, construct - noun, definitive singular, masculineLDA and Morphology

Many morphological variantsLDA and MorphologyOne word about 50 distinct forms in the corpus(12 forms average)

Combining LDA and MorphologyLDA picks up patterns of word co-occurrence in documents.Heavy variations in Hebrew could mean we miss co-occurrence if we do not first analyze morphology.

What is the best method to combine LDA and Morphological analysis?LDA and Morphology

Combining LDA and Morphology3 options:Ignore morphology token-based LDAEnglish LDA: stemming, filter POS (nouns)Pipeline resolve morphological ambiguities, then learn LDA.MorphologyLemma is ambiguousJoint learn LDA on distributions of lemma conditioned by morphological analysis

LDA and Morphology

Joint LDA-Morphology LearningLDA and MorphologyStandard token-based LDA

Joint LDA-Morphology LearningLDA and MorphologyJoint Morphology-LDAConstrainedByTagger Decision

Joint LDA-Morphology worksToken-based LDA in Hebrew gives no useful topics:No semantic coherence (less than 1/3 topics)No alignment with semantic annotationsLDA-Morphology worksSemantic coherenceMore on evaluationLDA and Morphology

Morphology VariantsSemantic Coherence EvaluationAsk experts if they recognize a topic as coherent and to label it.Test on Rambam 128 topics108 coherent topics with short label20 unrecognized [2 taggers / high agreement]Test on Medical Data 128 topics115 coherent topicsTest on Mishna 128 topics60 coherent topicsLDA and Morphology

Morphology VariantsVariant models on Mishna DatasetLDA on Nouns onlyLDA on Nouns and Compound nouns (smixut)Semantic coherence only for Compound model80 coherent topics / 128 topicsUnstable: 75 coherent / 150 topicsMarked Compounds45 compounds appear as top terms in topics (out of 6,500 distinct compounds)All recognized as key concepts by domain expertsMore evaluation needed on term extractionWhy such a difference with Rambam?LDA and Morphology

OutlineTopic Analysis with LDA Domain: Halakhic Sources / Medical datasetCombining LDA and Morphological AnalysisEvaluating Topic ModelsCombining Semantic Priors and LDAMultilingual Topic ModelsOutline

How Good are Discovered Topics?Difficult to evaluate LDA topicsMany parametersEach run gives slightly different resultsHow to compare topic models?MethodsData-oriented evaluationSemantic CoherenceOntology alignment evaluationTask-based evaluation Evaluation

Topic Evaluation Methods 1Data-oriented: Measure fit between dataset and generative model seen as language model (perplexity)Seems to miss what is good about topicsSemantic coherenceSubjective judgment Individual topics meaningful? Can be labeled?Assignments topic/docs meaningful?Find the intruder testsRank best word / worst word find the intruder wordEvaluation

Evaluating Topic Model (ox/bull) refers to many complex halakhic topics:Damages ( goring ox)Kosher meat ( slaughter)Sacrifices ()Shabbat ( domestic animals must rest)Calendar ( Zodiac sign Taurus)

Evaluation

Topics for (Ox) on Rambam CorpusDamagesSacrificesCalendarMeatEvaluation

Topics for (Ox) on Rambam CorpusDamagesSacrificesCalendarMeatSacrifices(again)Meat(again)Damages(again)Sacrifices(again)Evaluation

Topics for (Ox) on Rambam CorpusDamagesSacrifices?CalendarMeatSacrifices(again)Meat(again)Damages(again)Sacrifices(again)?????Evaluation

Topics for (Ox) on Rambam CorpusDamagesSacrifices?CalendarMeatSacrificesMeatDamages(again)Sacrifices(again)? Shabbat + Lighting candles ?? Wine + Sacrifices ???Evaluation

Topics for a Document - -

RambamBook of DamagesDamages by PropertyChapter 12DamagesDamagesEvaluation

Topics for a Document - -

RambamBook of DamagesDamages by PropertyChapter 12DamagesDamages Units??Evaluation

Topic Evaluation Methods 2Alignment Topic Model / OntologyDoes the topic model reproduce existing metadata classificationTask-based EvaluationDo topics facilitate search or navigation?For IR, relevance models with semantic smoothing Do multilingual topics capture word alignments?Evaluation

Semantic CoherenceSubjective evaluationTopic is meaningful / can be labeled?Highly positive on Rambam and MedicalLow on Mishna until restricted to Compound+N / Marked morphologically

Can topic semantic coherence be predicted?(Newman et al 2010) using PMI measure Evaluation

Ontology AlignmentRambam Mishne Torah has existing structureHierarchy of Book/Section/ChapterWe find good alignment Topic/BookSome topics are cross-concern (witnesses)

Evaluation

Topic DocumentsFits the Rambams classificationEvaluation

Alignment Topic / BooksEvaluationDocument Book on Rambams topic modelDocument = (book[1-14] / section[1-85] / document)

BooksTopics5 general topics / 20 focus on 2 books / 30 skinny / 65 focus on 1 book1 book covers many topics / 2 books very fewZRAIM MADA ZMANIM NZIKIN AVODA KINYAN TAHARA KORBANOT AHAVA MISHPATIM SHOFTIM NASHIM HAFLAA KDUSHA

Semantics and LDALDA is fully unsupervisedLearn better models with underlying semantic knowledge?Active field of researchExcellent survey: Incorporating domain knowledge in latent topic models (Andrzejewski 2010)

Semantics and LDA

Semantics and LDA: 3 Types of ApproachesLDA+X:Model additional observed data (Document+Tag)SupervisedLDA, Author-Topic, Topic-Link LDAWord-Topic ConstraintsPrior constraints on word-topic associationSyntax: Syntactic Topic Model, HMM-LDAConcept-Topic Model (semantic fields), LDAWN, Dirichlet Forest, Topic-in-SetDocument-Topic ConstraintsPrior constraints on document-topic association and among topicsTopic relations: hLDA, Correlated Topic Models, PAMDocument-Topic: Dirichlet Multinomial Regression, labeled LDA, Logic LDATopics over time: DTM, TOTSemantics and LDA

Semantics and LDA: 3 Types of ApproachesSemantics and LDAWord-TopicConditionsTopic-TopicConditionsDocument-TagObserved

Which Method for our domainDocument-Tags are available Labeled LDA and DMRHierarchical topic models (PAM)Hyperlinks exist but are difficult to extract LinkLDA

Currently experimenting with Labeled-LDA on our datasets.Semantics and LDA

Multilingual Topic ModelsAssume bilingual document set (di, li)Can we catch patterns of word co-occurrence across languages?

MUTO (Boyd-Graber & Blei 2009)Combine 2 aspects in one generative model:Align words across languagesGroup words into topics

Multilingual LDA

MUTO Generative ProcessChoose matching m (mst weight of (ws, wt))Choose multinomial term distributions:Choose background distributions for words not in m for (S,T) lChoose topic Ti ~ Dir() i in (1..K) over the pairs in mFor each document d (1..D) with language ldChoose topics d ~ Dir()For each n in (1..Md)Choose topic assignment zd ~ Mult(d)Choose cn from (matched, unmatched) uniformlyIf cn = matched: choose a pair ~ Mult(zn(m)) / project on ldIf cn = unmatched: choose wn ~ Mult(l)

Multilingual LDA

Learned bi-lingual topic (En/Ge)time:schattenworld:kontakthistory:roemischnumber:nummermath:withterm:zeroaxiom:axiomsystem:systemtheory:theorieMultilingual LDA

Learned bi-lingual topic (En/Ge)time:schattenworld:kontakthistory:roemischnumber:nummermath:withterm:zeroaxiom:axiomsystem:systemtheory:theorieEdit distance priorA bilingual dictionary helpsDoes much better on aligned corpora

Multilingual LDA

Could topic models over documents help MT with document level features?

Multilingual LDA

ConclusionsMorphological analysis is critical to start exploring topic models in MRLsTopic models are hard to evaluateSemi-supervised topic models improve quality of topicsMulti-lingual topics can be learned

Could help provide document level direction in MTConclusion