the semantic quilt

Download The Semantic Quilt

Post on 11-May-2015

752 views

Category:

Education

1 download

Embed Size (px)

DESCRIPTION

A talk on the Semantic Quilt, which combines various methods of "doing semantics" into a more unified framework.

TRANSCRIPT

  • 1.The Semantic Quilt:Contexts, Co-occurrences, Kernels, and OntologiesTed Pedersen University of Minnesota, Duluth http:// www.d.umn.edu/~tpederse

2. Create by stitching together 3. Sew together different materials 4. Ontologies Co-Occurrences Kernels Contexts 5. Semantics in NLP

  • Potentially useful for many applications
    • Machine Translation
    • Document or Story Understanding
    • Text Generation
    • Web Search
  • Can come from many sources
  • Not well integrated
  • Not well defined?

6. What do we mean bysemantics ? it depends on our resources

  • Ontologies relationships among concepts
    • Similar / related concepts connected
  • Dictionary definitions of senses / concepts
    • similar / related senses have similar / related definitions
  • Contexts short passages of words
    • similar / related words occur in similar / related contexts
  • Co-occurrences
    • awordis defined by the company it keeps
    • words that occur with the same kinds words are similar / related

7. What level of granularity?

  • words
  • terms / collocations
  • phrases
  • sentences
  • paragraphs
  • documents
  • books

8. The Terrible Tension : Ambiguity versus Granularity

  • Words are potentially very ambiguous
    • But we can list them (sort of)
    • we can define their meanings (sort of)
    • not ambiguous to human reader, but hard for a computer to know which meaning is intended
  • Terms / collocations are less ambiguous
    • Difficult to enumerate because there are so many, but can be done for a domain (e.g., medicine)
  • Phrases (short contexts) can still be ambiguous, but not to the same degree as words or terms/collocations

9. The Current State of Affairs

  • Most resources and methods focus on word or term semantics
    • makes it possible to build resources (manually or automatically) with reasonable coverage, but
    • techniques become very resource dependent
    • resources become language dependent
    • introduces a lot of ambiguity
    • not clear how to bring together resources
  • Similarity is a useful organizing principle, but
    • there are lots of ways to be similar

10. Similarity as Organizing Principle

  • Measure word association using knowledge lean methods that are based on co-occurrence information from large corpora
  • Measure contextual similarity using knowledge lean methods that are based on co-occurrence information from large corpora
  • Measure conceptual similarity / relatedness using a structured repository of knowledge
    • Lexical database WordNet
    • Unified Medical Language System (UMLS)

11. Things we can do now

  • Identify associated words
    • fine wine
    • baseball bat
  • Identify similar contexts
    • I bought some food at the store
    • I purchased something to eat at the market
  • Assign meanings to words
    • I went to the bank /[financial-inst.]to deposit my check
  • Identify similar (or related) concepts
    • frog : amphibian
    • Duluth : snow

12. Things we want to do

  • Integrate different resources and methods
  • Solve bigger problems
    • some of what we do now is a means to an unclear end
  • Be Language Independent
  • Offer Broad Coverage
  • Reduce dependence on manually built resources
    • ontologies, dictionaries, labeled training data

13. Semantic Patches to Sew Together

  • Contexts
    • SenseClusters : measures similarity between written texts (i.e., contexts)
  • Co-Occurrences
    • Ngram Statistics Package : measures association between words, identify collocations or terms
  • Kernels
    • WSD-Shell : supervised learning for word sense disambiguation, in process of including SVMs with user defined kernels
  • Ontologies
    • WordNet-Similarity : measures similarity between concepts found in WordNet
    • UMLS-Similarity
    • All of these are projects at the University of Minnesota, Duluth

14. Ontologies Co-Occurrences Kernels Contexts 15. Ngram Statistics Package http:// ngram.sourceforge.net Co-Occurrences 16. Things we can do now

  • Identify associated words
    • fine wine
    • baseball bat
  • Identify similar contexts
    • I bought some food at the store
    • I purchased something to eat at the market
  • Assign meanings to words
    • I went to the bank/[financial-inst.] to deposit my check
  • Identify similar (or related) concepts
    • frog : amphibian
    • Duluth : snow

17. Co-occurrences and semantics?

  • individual words (esp. common ones) are very ambiguous
    • bat
    • line
  • pairs of words disambiguate each other
    • baseball bat
    • vampire Transylvania
    • product line
    • speech . line

18. Why pairs of words?

  • Zipf's Law
    • most words are rare, most bigrams are even more rare, most ngrams are even rarer still
    • the more common a word, the more senses it will have
  • Co-occurrences are less frequent than individual words, tend to be less ambiguous as a result
    • Mutually disambiguating

19. Bigrams

  • Window Size of 2
    • baseball bat, fine wine, apple orchard, bill clinton
  • Window Size of 3
    • houseof representatives, bottleofwine,
  • Window Size of 4
    • presidentof therepublic, whisperingin thewind
  • Selected using a small window size (2-4 words)
  • Objective is to capture a regular or localized pattern between two words (collocation?)
  • If order doesnt matter, then these are co-occurrences

20. occur together more often than expected by chance

  • Observed frequencies for two words occurring together and alone are stored in a 2x2 matrix
  • Expected values are calculated, based on the model of independence and observed values
    • How often would you expect these words to occur together, if they only occurred together by chance?
    • If two words occur significantly more often than the expected value, then the words do not occur together by chance.

21. Measures and Tests of Associationhttp:// ngram.sourceforge.net

  • Log-likelihood Ratio
  • Mutual Information
  • Pointwise Mutual Information
  • Pearsons Chi-squared Test
  • Phi coefficient
  • Fishers Exact Test
  • T-test
  • Dice Coefficient
  • Odds Ratio

22. What do we get at the end?

  • A list of bigrams or co-occurrences that are significantor interesting (meaningful?)
    • automatic
    • language independent
  • These can be used as building blocks for systems that do semantic processing
    • relatively unambiguous
    • often very informative about topic or domain
    • can serve as a fingerprint for a document or book

23. Ontologies Co-Occurrences Kernels Contexts 24. SenseClusters http:// senseclusters.sourceforge.net Contexts 25. Things we can do now

  • Identify associated words
    • fine wine
    • baseball bat
  • Identify similar contexts
    • I bought some food at

Recommended

View more >