topic models for morphologically rich languages

64
Topic Models for Morphologically Rich Languages Michael Elhadad, Meni Adler, Yoav Goldberg, Rafi Cohen 23 Jan 2011, Haifa Machine Translation and Morphologically-rich Languages

Upload: gili

Post on 15-Jan-2016

53 views

Category:

Documents


0 download

DESCRIPTION

Topic Models for Morphologically Rich Languages. Michael Elhadad, Meni Adler, Yoav Goldberg, Rafi Cohen 23 Jan 2011, Haifa Machine Translation and Morphologically-rich Languages. Topic Models. Unsupervised discovery of topics in text collection Useful to browse/explore large corpora by theme - PowerPoint PPT Presentation

TRANSCRIPT

  • Topic Models for Morphologically Rich Languages

    Michael Elhadad, Meni Adler, Yoav Goldberg, Rafi Cohen

    23 Jan 2011, HaifaMachine Translation and Morphologically-rich Languages

  • Topic ModelsUnsupervised discovery of topics in text collectionUseful to browse/explore large corpora by themeTopic evolution over timeAuthor-topic modelsDifficult to evaluate / Task-based evaluations helpWSDSummarizationIRSentiment analysisMultilingual LDA could help as feature for MTTopic Models

  • Topic Models and Rich MorphologyTopic Models from text in HebrewRich morphologyHigh number of distinct word formsHigh ambiguity

    Halakhic Domain (Jewish Religious Law)Mixture of languages (Hebrew / Aramaic)Various Historical / Geographical / SubdomainsExisting metadata / Can we exploit it?Medical DomainPatient letters / eHealth QA siteHigh level of mixture English/Hebrew (transliterations)Existing metadata (UMLS) / Can we exploit it?

    Work in progressTopic Models

  • OutlineTopic Analysis with LDA Domain: Halakhic Sources / Medical datasetCombining LDA and Morphological AnalysisCombining Semantic Priors and LDAMultilingual Topic ModelsEvaluating Topic ModelsOutline

  • ObjectivesInput: Domain specific text corpus in HebrewMetadata on documents (tags, alignment to English tags)Output: Topic model: Discover topics discussed in the corpusRecognize topics in unseen textIndex text collection by topicTask:Something where topics help:WSD, IR, Text categorization, clusteringSome part of MT?Objectives

  • Term Ambiguity and What is a Topic? (ox/bull) refers to many complex halakhic topics:Damages ( goring ox)Kosher meat ( slaughter)Sacrifices ()Shabbat ( domestic animals must rest)Calendar ( Zodiac sign Taurus)

    What are these topics?

    Terms are disambiguated in context+ (Ox + Shabbat)

    Associate a word to a topicAssociate a document to topicsObjectives

  • Discovering Topic Models: LDALatent Dirichlet AllocationBlei and Jordan 2003Discover (unsupervised) topic structures in a document collectionTopics are modeled as distributions of wordsProbabilistic generative model of text

    LDA

  • What can be done with an LDA Topic Model?D. Blei and J. Lafferty. Topic Models. In A. Srivastava and M. Sahami, editors, Text Mining: Theory and Applications. 2009LDA

  • Structure of an LDA ModelLDAFrom (Blei 2008)

  • The LDA ModelObservations: documents are composed of words.Latent variable: each document expresses a few topicsGenerative probabilistic model:Each document is a mixture of topicsEach word is drawn from the topics active in the document

    LDA

  • LDA Graphical Model LDA(Blei 2008)

  • LDA Generative Process LDA(Blei 2008)

  • LDA Generative Process LDA(Blei 2008)(w1, w2,. wV)(t1,,tK)

  • LDA Estimation LDA(Blei 2008)

  • LDA Estimation LDA(Blei 2008)Matrix KxV

  • LDA Approximation LDA(Blei 2008)Generally use Gibbs Sampling to estimate

  • Gibbs SamplingRepresent corpus as:Array of words w[i] fixedDocument indices d[i] fixedTopics z[i] changeMarkov chain where states = topic assignments to wordsMacro-steps: assign a new topic to all the wordsMicro-steps: assign a new topic to each word w[i]LDA

  • OutlineTopic Analysis with LDA Domain: Halakhic Sources / Medical datasetCombining LDA and Morphological AnalysisCombining Semantic Priors and LDAMultilingual Topic ModelsEvaluating Topic ModelsOutline

  • LDA in Hebrew

    Explore various datasets in HebrewHow well does LDA work on Hebrew?Domain

  • Domain: Halakhic SourcesVarious Historical / Geographical background

    Domain

  • The MishnaMishna (Tanaim)Exhaustive code of Jewish LawWritten by R. Yehuda Hanasi (220 CE)6 orders, 63 tractates, 524 chapters, 6K paragraphs, 350K words.Hierarchical thematic organization by topics

    Domain

  • Rambams Mishne TorahCorpus of Mishne Torah (Rishonim)Exhaustive code of HalakhaWritten by Maimonides 1170-118014 books, 85 sections, 1,000 chapters, 15K articles, 600K words.Domain

  • Responsa CorpusWe manually constructed a reference corpus for testing purposes.Team of 5 Jewish Law experts with metadata associated to each QA document.

    Documents8,000 responsa from 35 distinct books of various origins (geographical, historical)3.6M words (avg 450 tokens per document)On average 4.5 tags per document (from the ontology)

    Ontology of Halakha~2,000concepts ~5,000 relations among concepts of 14 distinct types

    MetadataPer book: Author, Location, Publication DatePer document:Topics from indexReferences to "sources" (Bavli, Yerushalmi, Mishna, Tanakh, Shulhan 'arukh) (In progress)References to other responsa (In progress)Domain

  • Halakhic Corpus SpecificityLanguageMixture (Hebrew + Aramaic)Semitic languages: rich morphologyMany acronyms / abbreviationsWide variety of domains / historical backgroundVarious GenresCodes (hierarchical, synthetic)Commentaries (segmented, linear)Responsa (implicitly hypertextual complex citations)Layers of corpus (derivation, authority)Mishna Gmara Mishne Tora ResponsaDomain

  • Medical CorpusInfomed.co.ilPopular QA Health site2M words / 4K documentsAnnotated by site categories6,000 concepts / 3,000 mapped to UMLSHospital Patient release lettersNeurology department 150K words / 1K documentsManual UMLS concept annotation (in progress)Domain

  • Medical Corpus SpecificityMany unknown words (~20% token types)Many transliterations (Rafis talk)Many named entitiesDomain

  • OutlineTopic Analysis with LDA Domain: Halakhic Sources / Medical datasetCombining LDA and Morphological AnalysisCombining Semantic Priors and LDAMultilingual Topic ModelsEvaluating Topic ModelsOutline

  • Hebrew Morphological Analysis (name of an association) (while taking a picture) (their onion) (under their shades) (in a photographer) (in the photographer( (in an idol( (in the idol(

    LDA and Morphology

  • Morphological Analysis proper-noun verb, infinitive- noun, singular, masculine -- noun, singular, masculine - noun, singular, masculine, absolute- noun, singular, masculine, construct - noun, definitive singular, masculineLDA and Morphology

  • Many morphological variantsLDA and MorphologyOne word about 50 distinct forms in the corpus(12 forms average)

  • Combining LDA and MorphologyLDA picks up patterns of word co-occurrence in documents.Heavy variations in Hebrew could mean we miss co-occurrence if we do not first analyze morphology.

    What is the best method to combine LDA and Morphological analysis?LDA and Morphology

  • Combining LDA and Morphology3 options:Ignore morphology token-based LDAEnglish LDA: stemming, filter POS (nouns)Pipeline resolve morphological ambiguities, then learn LDA.MorphologyLemma is ambiguousJoint learn LDA on distributions of lemma conditioned by morphological analysis

    LDA and Morphology

  • Joint LDA-Morphology LearningLDA and MorphologyStandard token-based LDA

  • Joint LDA-Morphology LearningLDA and MorphologyJoint Morphology-LDAConstrainedByTagger Decision

  • Joint LDA-Morphology worksToken-based LDA in Hebrew gives no useful topics:No semantic coherence (less than 1/3 topics)No alignment with semantic annotationsLDA-Morphology worksSemantic coherenceMore on evaluationLDA and Morphology

  • Morphology VariantsSemantic Coherence EvaluationAsk experts if they recognize a topic as coherent and to label it.Test on Rambam 128 topics108 coherent topics with short label20 unrecognized [2 taggers / high agreement]Test on Medical Data 128 topics115 coherent topicsTest on Mishna 128 topics60 coherent topicsLDA and Morphology

  • Morphology VariantsVariant models on Mishna DatasetLDA on Nouns onlyLDA on Nouns and Compound nouns (smixut)Semantic coherence only for Compound model80 coherent topics / 128 topicsUnstable: 75 coherent / 150 topicsMarked Compounds45 compounds appear as top terms in topics (out of 6,500 distinct compounds)All recognized as key concepts by domain expertsMore evaluation needed on term extractionWhy such a difference with Rambam?LDA and Morphology

  • OutlineTopic Analysis with LDA Domain: Halakhic Sources / Medical datasetCombining LDA and Morphological AnalysisEvaluating Topic ModelsCombining Semantic Priors and LDAMultilingual Topic ModelsOutline

  • How Good are Discovered Topics?Difficult to evaluate LDA topicsMany parametersEach run gives slightly different resultsHow to compare topic models?MethodsData-oriented evaluationSemantic CoherenceOntology alignment evaluationTask-based evaluation Evaluation

  • Topic Evaluation Methods 1Data-oriented: Measure fit between dataset and generative model seen as language model (perplexity)Seems to miss what is good about topicsSemantic coherenceSubjective judgment Individual topics meaningful? Can be labeled?Assignments topic/docs meaningful?Find the intruder testsRank best word / worst word find the intruder wordEvaluation

  • Evaluating Topic Model (ox/bull) refers to many complex halakhic topics:Damages ( goring ox)Kosher meat ( slaughter)Sacrifices ()Shabbat ( domestic animals must rest)Calendar ( Zodiac sign Taurus)

    Evaluation

  • Topics for (Ox) on Rambam CorpusDamagesSacrificesCalendarMeatEvaluation

  • Topics for (Ox) on Rambam CorpusDamagesSacrificesCalendarMeatSacrifices(again)Meat(again)Damages(again)Sacrifices(again)Evaluation

  • Topics for (Ox) on Rambam CorpusDamagesSacrifices?CalendarMeatSacrifices(again)Meat(again)Damages(again)Sacrifices(again)?????Evaluation

  • Topics for (Ox) on Rambam CorpusDamagesSacrifices?CalendarMeatSacrificesMeatDamages(again)Sacrifices(again)? Shabbat + Lighting candles ?? Wine + Sacrifices ???Evaluation

  • Topics for a Document - -

    RambamBook of DamagesDamages by PropertyChapter 12DamagesDamagesEvaluation

  • Topics for a Document - -

    RambamBook of DamagesDamages by PropertyChapter 12DamagesDamages Units??Evaluation

  • Topic Evaluation Methods 2Alignment Topic Model / OntologyDoes the topic model reproduce existing metadata classificationTask-based EvaluationDo topics facilitate search or navigation?For IR, relevance models with semantic smoothing Do multilingual topics capture word alignments?Evaluation

  • Semantic CoherenceSubjective evaluationTopic is meaningful / can be labeled?Highly positive on Rambam and MedicalLow on Mishna until restricted to Compound+N / Marked morphologically

    Can topic semantic coherence be predicted?(Newman et al 2010) using PMI measure Evaluation

  • Ontology AlignmentRambam Mishne Torah has existing structureHierarchy of Book/Section/ChapterWe find good alignment Topic/BookSome topics are cross-concern (witnesses)

    Evaluation

  • Topic DocumentsFits the Rambams classificationEvaluation

  • Alignment Topic / BooksEvaluationDocument Book on Rambams topic modelDocument = (book[1-14] / section[1-85] / document)

    BooksTopics5 general topics / 20 focus on 2 books / 30 skinny / 65 focus on 1 book1 book covers many topics / 2 books very fewZRAIM MADA ZMANIM NZIKIN AVODA KINYAN TAHARA KORBANOT AHAVA MISHPATIM SHOFTIM NASHIM HAFLAA KDUSHA

  • OutlineTopic Analysis with LDA Domain: Halakhic Sources / Medical datasetCombining LDA and Morphological AnalysisEvaluating Topic ModelsCombining Semantic Priors and LDAMultilingual Topic ModelsOutline

  • Semantics and LDALDA is fully unsupervisedLearn better models with underlying semantic knowledge?Active field of researchExcellent survey: Incorporating domain knowledge in latent topic models (Andrzejewski 2010)

    Semantics and LDA

  • Semantics and LDA: 3 Types of ApproachesLDA+X:Model additional observed data (Document+Tag)SupervisedLDA, Author-Topic, Topic-Link LDAWord-Topic ConstraintsPrior constraints on word-topic associationSyntax: Syntactic Topic Model, HMM-LDAConcept-Topic Model (semantic fields), LDAWN, Dirichlet Forest, Topic-in-SetDocument-Topic ConstraintsPrior constraints on document-topic association and among topicsTopic relations: hLDA, Correlated Topic Models, PAMDocument-Topic: Dirichlet Multinomial Regression, labeled LDA, Logic LDATopics over time: DTM, TOTSemantics and LDA

  • Semantics and LDA: 3 Types of ApproachesSemantics and LDAWord-TopicConditionsTopic-TopicConditionsDocument-TagObserved

  • Which Method for our domainDocument-Tags are available Labeled LDA and DMRHierarchical topic models (PAM)Hyperlinks exist but are difficult to extract LinkLDA

    Currently experimenting with Labeled-LDA on our datasets.Semantics and LDA

  • OutlineTopic Analysis with LDA Domain: Halakhic Sources / Medical datasetCombining LDA and Morphological AnalysisEvaluating Topic ModelsCombining Semantic Priors and LDAMultilingual Topic ModelsOutline

  • Multilingual Topic ModelsAssume bilingual document set (di, li)Can we catch patterns of word co-occurrence across languages?

    MUTO (Boyd-Graber & Blei 2009)Combine 2 aspects in one generative model:Align words across languagesGroup words into topics

    Multilingual LDA

  • MUTO Generative ProcessChoose matching m (mst weight of (ws, wt))Choose multinomial term distributions:Choose background distributions for words not in m for (S,T) lChoose topic Ti ~ Dir() i in (1..K) over the pairs in mFor each document d (1..D) with language ldChoose topics d ~ Dir()For each n in (1..Md)Choose topic assignment zd ~ Mult(d)Choose cn from (matched, unmatched) uniformlyIf cn = matched: choose a pair ~ Mult(zn(m)) / project on ldIf cn = unmatched: choose wn ~ Mult(l)

    Multilingual LDA

  • Learned bi-lingual topic (En/Ge)time:schattenworld:kontakthistory:roemischnumber:nummermath:withterm:zeroaxiom:axiomsystem:systemtheory:theorieMultilingual LDA

  • Learned bi-lingual topic (En/Ge)time:schattenworld:kontakthistory:roemischnumber:nummermath:withterm:zeroaxiom:axiomsystem:systemtheory:theorieEdit distance priorA bilingual dictionary helpsDoes much better on aligned corpora

    Multilingual LDA

  • Could topic models over documents help MT with document level features?

    Multilingual LDA

  • ConclusionsMorphological analysis is critical to start exploring topic models in MRLsTopic models are hard to evaluateSemi-supervised topic models improve quality of topicsMulti-lingual topics can be learned

    Could help provide document level direction in MTConclusion