multilingual ner using wiki

22
Laboratory for Knowledge Discovery in Databases Department of Computing and Information Sciences Kansas State University http://www.kddresearch.org/tikiwiki/tiki-index.php Presenter: Svitlana O. Volkova Instructor: William Hsu Multilingual Named Entity Recognition using Wikipedia

Upload: svitlana-volkova

Post on 08-Jun-2015

799 views

Category:

Education


1 download

TRANSCRIPT

Page 1: Multilingual Ner Using Wiki

Laboratory for Knowledge Discovery in Databases

Department of Computing and Information Sciences

Kansas State University

http://www.kddresearch.org/tikiwiki/tiki-index.php

Presenter: Svitlana O. Volkova

Instructor: William Hsu

Multilingual Named Entity Recognition

using Wikipedia

Page 2: Multilingual Ner Using Wiki

AGENDA

I. Project Overview

II. Crawling Wikipedia

III. Synonymy Discovery with Google Sets

IV. Experiment Design

V. Conclusions

Page 3: Multilingual Ner Using Wiki

AGENDA

I. Project Overview

II. Crawling Wikipedia

III. GoogleSets for Synonymy Discovery

IV. Experiment

V. Conclusions

Page 4: Multilingual Ner Using Wiki

PROJECT MILESTONES

Input: Crawler Functionality

CRAWLING WIKIPEDIA

Output: Set of Multilingual Gazetteers

Input: Initial Gazetteer in one Language

RELATIONSHIP DISCOVERY WITH GOOGLESETS

Output: Extended Gazetteer with Synonyms

Input: Extended Gazetteer with Synonyms + Content

MULTILINGUAL NER TASK

Output: Extracted Entities from the Content

Page 5: Multilingual Ner Using Wiki

KEY IDEA - WIKIPEDIA

Apply Wikipedia knowledge representation for

multilingual information extraction

17http://wiki.digitalmethods.net/Dmi/WikipediaAnalysis

English Wiki Concepts of Interest

…, anthrax, bovine virus, …, camelpox, surra, …

Russian Wiki Concepts of Interest

…, Зоонозы, Классическая чума свиней, Лептоспироз, …

Page 6: Multilingual Ner Using Wiki

AGENDA

I. Project Overview

II. Crawling Wikipedia

III. GoogleSets for Synonymy Discovery

IV. Experiment

V. Conclusions

Page 7: Multilingual Ner Using Wiki

CRAWLING WIKIPEDIA

Multilingual NER

(article + category

+interwiki links)

Wiki Category Graph and Article Graph

Page 8: Multilingual Ner Using Wiki

GAZETTEERS EXAMPLES IN DIFFERENT

LANGUAGES

Page 9: Multilingual Ner Using Wiki

GAZETTEERS SIZE IN DIFFERENT

LANGUAGES

86

20

37

19

English

Japanese

German

Russian

Decision: dictionaries are too small, so wee need to find a way how to

extend it!!!

Page 10: Multilingual Ner Using Wiki

AGENDA

I. Project Overview

II. Crawling Wikipedia

III. GoogleSets for Synonymy Discovery

IV. Experiment

V. Conclusions

Page 11: Multilingual Ner Using Wiki

GAZETTEERS EXAMPLES:

GERMAN GOOGLE SETS OUTPUT

Page 12: Multilingual Ner Using Wiki

AGENDA

I. Project Overview

II. Crawling Wikipedia

III. GoogleSets for Synonymy Discovery

IV. Experiment

V. Conclusions

Page 13: Multilingual Ner Using Wiki

EXPERIMENT SET UP

Purpose: to perform named entity recognition task in

specific domain and report accuracy of extraction using

a) Wiki knowledge

b) Extended lists with synonyms from Google Sets

Hypothesis: the synonyms extraction phase is essential

for increasing accuracy of information extraction task

Page 14: Multilingual Ner Using Wiki

DISEASE EXTRACTOR MODULE

INPUT AND OUTPUT

Disease

Extractor

Module

Index of the first character

Index of the last character

Length of the matched text

Matched Text

Canonical disease name

Input: Text

from file

Output:

Disease ExtractionTask

The task of disease recognition can be considered as NER/information

extraction (IE) task

The main purpose is to retrieve tokens that much at least one term with

synonyms, abbreviations from list of the animal disease names

Page 15: Multilingual Ner Using Wiki

CONTEXT EXAMPLES IN DIFFERENT LANGUAGES

DUTCH

Leptospirose komt voor in alle landen, behalve het Noordpoolgebied. De incidentie is hoog.Meer dan de helft van de gevallen voordoet in ernstige en vereiste reanimatie.

CZECH

Leptospiróza se vyskytuje ve všech zemích s výjimkou Arktidy. Incidence je vysoká. Více nežpolovina případů se vyskytuje v těžké a vyžaduje resuscitaci.

GERMAN

Leptospirose tritt in allen Ländern, mit Ausnahme der Arktis. Die Inzidenz ist hoch. Mehr alsdie Hälfte der Fälle tritt in schweren und Reanimation erforderlich.

ITALIAN

Leptospirosi si verifica in tutti i paesi, tranne l'Artico. L'incidenza è alta. Più della metà deicasi si verifica in rianimazione grave e richiesti.

URKAINIAN

Лептоспіроз відбувається в усіх країнах, за винятком Арктики. Захворюваність висока.Більше половини випадків відбувається в суворих і необхідність реанімації.

RUSSIAN

Лептоспироз происходит во всех странах, за исключением Арктики. Заболеваемостьвысокая. Более половины случаев происходит в суровых и необходимости реанимации.

Page 16: Multilingual Ner Using Wiki

DISEASE EXTRACTOR MODULE DEMO

http://fingolfin.user.cis.ksu.edu:8080/diseaseextractor/

Page 17: Multilingual Ner Using Wiki
Page 18: Multilingual Ner Using Wiki

Foot and mouth disease is

one of the most contagious

diseases of cloven-hooved

mammals…

INPUT A OUTPUT A

Rift Valley Fever | CDC

Special Pathogens Branch

Mission Statement Disease …

INPUT B OUTPUT B

RESULTS FOR DISEASE EXTRACTOR MODULE

Page 19: Multilingual Ner Using Wiki

AGENDA

I. Project Overview

II. Crawling Wikipedia

III. GoogleSets for Synonymy Discovery

IV. Experiment

V. Conclusions

Page 20: Multilingual Ner Using Wiki

CONCLUSIONS

ApplyingWikipedia knowledge for multilingual NERTask

Phase 1: CrawlingWiki – completed

Phase 2: Google Sets Expansion – completed

Phase 3: Multilingual Disease Extraction – in progress

Novelty: Overcome Wiki limitations by applying Google Sets

expansion approach

In order to estimate accuracy we need to have annotated data in

different languages

Page 21: Multilingual Ner Using Wiki

REFERENCES

Torsten Zesch and Iryna Gurevych, Analysis of the Wikipedia Category Graph for NLP

Applications, In: Proceedings of the TextGraphs-2 Workshop (NAACL-HLT 2007), p.

1--8, April 2007. http://elara.tk.informatik.tu-

darmstadt.de/publications/2007/hlt-textgraphs.pdf

Watanabe, Yotaro and Asahara, Masayuki and Matsumoto, Yuji, A Graph-Based

Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields,

Proceedings of the 2007 Joint Conference on Empirical Methods in Natural

Language Processing and Computational Natural Language Learning (EMNLP-

CoNLL), 649-657. http://www.aclweb.org/anthology/D/D07/D07-1068

Manning, C., & Schutze, H. Foundations of statistical natural language processing.

Cambridge, MA: MIT Press, 1999.

Page 22: Multilingual Ner Using Wiki

ACKNOWLEDGEMENTS

Dr. William Hsu for meaningful guidance

John Drouhard for building extraction architecture

Landon Fowles for expanding gazetteers using Google Sets