ontology-based faceted search engine for halakhic textnachum/iscol/session1-adler.pdf ·...
TRANSCRIPT
Ontology-based Faceted Search Engine for Halakhic text
Meni Adler, Yoav Goldberg, Michael Elhadad
Ben Gurion University
ISCOL 2010
Motivation
� Query שבת� Full text
� Google (many documents…)
� ת"פרויקט השו (~100K documents)
� מפעל המילון ההיסטו רי (~20K documents)
� מאגר ספ רות הקודש (~10K documents)
� מכון ממ רא (~10K documents)
� Scanned (OCR full text)� אוצר החכמה (~42K books)
� מאגר ספ רים סרוקים (~1.5K books)
� Hebrew Books (~42K books)
Motivation
� בר אילן, ת"פרויקט השו
� Morphology based
� Indexing of all possible analyses for each token
� Query expansion
� Mixture of lexemes
� שבת (37689 results)
� )צמח(ֶׁשֶבת , ) מראה מקום–שם מסכת (ַׁשָּבת, ֶׁשֶבת, ָׁשַבת, ַׁשָּבת
� )תחילית(בת , )שם עצם( ַּבת
� ָׁשב
Motivation
� האקדמיה ללשון, מפעל המילון ההיסטורי
� Lexeme-based search
� Covers Hebrew literature 300 B.C. - 1400 A.C.
� 10,287 results for the specific ַׁשָּבת query
� Mixture of ‘topics’, unclassified data
� The indexing is based on the manual tagged lemma (and part-of-speech) of each token
Selected results for query ַשָּבת
How can we improve search in a specific
domain
� Morphology
� Next talk…
� Semantics
� Topic models
� Ontology
� Combination
� Model
� Resource construction
Faceted Search
� Faceted Search
� Implementation of Exploratory Search
� Unfamiliar domain, uncertain goals
� Navigation through well defined facets
� Query � list of facets � selection � list of facets
� selection … � relevant documents
� Topic Modeling
Latent Dirichlet Allocation (LDA)
� Fully unsupervised method
� Describe topics as distribution of words
� Benefits
� Can recognize topics in documents
� Can cluster documents by topic
� Recently used for text summarization
LDA – Generative Model
α
zm,n
β φk
θm
wm,n
K topics
M documents� �
��
k in [1,K]
m in [1,M]
n in [1,Nm]
Implementation
� Inference/unsupervised learning by Gibbs sampling
� Dataset� ם"לרמבמשנה תורה
� 1000 documents (chapters)
� 128 topics http://www.cs.bgu.ac.il/~adlerm/rambam/128/topics/
� How should it be evaluated?
� Can it be improved?
Selected Topics
Lexeme-based LDA
� Hebrew� Rich Morphology
� Affixation
� Inflection
� Morphological analysis helps� Named-entity recognition
� Noun phrase chunking
� Parsing
� CFG
� Dependency trees
� Text Summarization
Lexeme-based LDAα
Lm,nβ φk
θm
wm,n
K topics
M documents
R lexemes
� �
�
�
zm,n
δ
�
Ψl
��
n in [1,Nm]
m in [1,M]
k in [1,K]
l in [1,R]
Selected Topics
Top results for query שור
Can we make it better?
� Ontology� Concepts
� Relations
� Research question� Can ontology improve topic modeling?
� Ontology construction vs. documents labeling
� Combination of topic model with existing knowledge� Dataset preparation
� Ontology construction
� Design of an algorithm that learns a topic model compatible with an ontology
The Dataset
� Responsa literature
� Medieval era
� Book � answers � paragraphs
� Germany, France, Italy
� 151 books
� Spain, North Africa
� 100 books
Halakhic Ontology - Concepts
� A hard decision…
� The juridical Halakhic index (the institute for
research in Jewish law)
� 120 entries
� Subcategories
Methodology
� Given: entry, subcategory
� Define the entry as a concept
� Generalize the subcategory into a concept
� הטלת מס על מציאה � ?הטלה , מציאה, מס
� Composed concept
� חוב-פירעון, חוב, פירעון
� Define relevant relation for the 2 concepts
� Currently: ~1000 concepts, ~2000 relations
Halakhic Ontology – A sample
Halakhic Ontology - Relations
� נכסים המשועבדים לאישה נשואה
� בעלות אלמנה על נכ סי מלוג
� אישה-סוגשל-אלמנה
� Semantic
� אישה-שיעבוד-נכס
� אלמנה-שייכות- נכסי מלוג
� Syntactic
� אישה -יחס-נכס
� אלמנה-יחס- נכסי מלוג
Halakhic Ontology - Relations
� gerund direct-object noun
� gerund indirect-object noun
� noun relation noun
� noun object gerund
� noun role noun
� noun/gerund conjunction noun/gerund
� noun attribute adjective
� noun typeof noun
Ontology-based LDA
� How to combine the ontology into the model?
� Use the ontology for constructing distributions for each document (instead of Dirichlet distribution)
� Ontology relations as covariance matrix
� Multivariate normal distribution
Correlated Topic Model
Σ
zm,n
β φk
θm
wm,n
K topics
M documentsNm words in document NΣ covariance matrix
for all documents m in [1,M] do
generate parameter vector α~N(µ,Σ)
transform α into θm: θm,t=exp(αt)/sumt’exp(αt’)
sample document length Nm~Poiss(ξ)
for all words n in [1,Nm] in document m do
sample topic index zm,n~Mult(θm)
sample term for word wm,n~Mult(φzm,n)
Model: [Blei, Lafferty, 2006]
Gibbs Sampling: [Mimno Wallach, McCallum, 2008]
�
��
k in [1,K]
m in [1,M]
n in [1,Nm]
µ�
Testset
� Selected index entries
� , שטרות, פיקדון, עד ות, ערבות, נישואין, מוחז קות, גזילה
שכירות, שבועה
� Each subcategory induces a concept
intersection query
� Currently� 43 concept intersection queries
� 1848 results
Summary
� Combination of topic modeling and
knowledge on the domain� Token-based
� Lexeme-based
� Ontology-based
� Ontology Construction� Concept, Relations
� Dataset
� Testset
Future Work
� Complete the ontology
� Evaluation� Complete queries feeding
� Compare F-measure of the various topic models
� Ontology-based Topic Modeling� Richer representation
� Relation labels
� Relation frequencies
� Transitive relations
� Matching of manual topics
Acknowledgements
� Ministry of Science and Technology
� Deutsche Telekom Laboratories
תודה