the impact of standardized terminologies and domain-ontologies in multilingual information...
Post on 26-Jul-2015
1.174 Views
Preview:
TRANSCRIPT
The impact of standardized-terminologies and domain-
ontologies in multilingual information processing
Maruf Hasan, D.Eng.Senior Researcher
Thai Computational Linguistics Laboratory, Thailand National Institute of Information and Communication Technology, Japan
The 5th AOS Workshop, Beijing, April 27-29, 2004
2
Outline
Natural Language Processing (NLP) Research Cross Language Information Retrieval Named Entity Extraction
Integrated Knowledge Management Scenario Terminology and Ontology Initiatives The Future: Bootstrapping NiCT resources and technologies Conclusions
The 5th AOS Workshop, Beijing, April 27-29, 2004
3
NLP Research
Corpus-based Statistical NLP became a popular research theme in recent years many smart applications exist (e.g., Google
search engine, MS Word’s Grammar Checking, etc.)
semantics and knowledge still remain obscured behind words (symbols)
meaning, concepts are difficult to extract/build with statistics alone Bootstrapping helps
The 5th AOS Workshop, Beijing, April 27-29, 2004
4
New Research Trends
While relying heavily on sophisticated NLP techniques, researchers are paying increasing attention to take advantage of semi-automatically built Lexical and Knowledge resources
Outcomes Increasing number of monolingual lexical resources Increasing number of multilingual dictionaries, thesauri, and
generalized ontologies Increasing number of specialized ontologies Increasing number of bootstrapping approaches to get the best
from both ends augmenting statistically extracted knowledge with the manually
encoded one, and vice versa.
The 5th AOS Workshop, Beijing, April 27-29, 2004
5
Two perspectives of Information/Knowledge
Content Management Perspective Metadata (e.g., Dublin Core Metadata, 13 fields) Taxonomy/thesauri (augmenting the Keyword field)
Analogy: HTML (Fixed set of tags)
Content Harnessing Perspective Machine understandable content Conceptual and associative hierarchy based on content Ontology (Modeling a domain with concepts and their
relationships from domain-expert’s perspective) Analogy: XML (Tags are not fixed)
The 5th AOS Workshop, Beijing, April 27-29, 2004
6
Interoperability
XML technology revolutionized the computing industry in terms of data interoperability and exchange
Ontology has started bringing new dimensions in modeling information and knowledge in the same way Traditional dictionaries and thesauri suffered badly from
interoperability problems Ontology offers a flexible framework for Knowledge
Modeling (similar to that of XML in Data manipulation)
The 5th AOS Workshop, Beijing, April 27-29, 2004
7
Bootstrapping: How It Helps
Two major pitfalls with ontology Developing ontologies (expensive! requires Knowledge Engineers) Populating ontologies (labor intensive! Semi-automatic means exist)
Bootstrapping: a simple example X is identified as a Person in the ontology but Y is not Analyzing a piece of text with NLP tools, we found the evidence that X
and Y are conducting research in an organization for some projects, for example.
It is easy to infer that Y is a person (and, also her affiliation, research interests, etc. through similar analysis)
NLP techniques helps in semi-automatically populating an ontology NLP tools and algorithms can be further augmented with the help of the
ontology-driven knowledge What if we do not find any such evidence that X is also a family-friend of Y?
How can we possibly deal with such cases? I will show an example later
The 5th AOS Workshop, Beijing, April 27-29, 2004
8
Human factors: Why MT fails but IR wins
So far, Information Retrieval (IR) applications including Search Engines, such as Google, have been largely successful but Machine Translation (MT) systems are not so successful. Reasons include
Failures in modeling linguistic and extra-linguistic phenomena, context and concepts, etc.
Human tolerance in finding information and in translation quality varies• Human tolerance: [ (low) Written Audio Video (high)]
Case-Study: Telstra Voice-operated Directory Service – a failure from user’s perspective but a successful investment from Telstra’s point of view
Many queries (70%) are repeating and the system can handle them quickly (success from Telstra’s perspective). But when a user enquires about rare entities, the system fails (failure from user’s perspective).
The 5th AOS Workshop, Beijing, April 27-29, 2004
9
Cross-Language Information Retrieval
Cross-language Information Retrieval is crucial Why: Querying with native language is comfortable, but every
now and then, the most valuable information related to our search is probably available in another language
How: Translating the queries or the document-collection (using a simplified MT model) to find information in other languages
Economic Factor: Finding relevant information at a low cost (using noisy translation) is possible. And, after receiving a list of documents (and selecting the relevant ones - as we often do with Google), we can take the (costly) decision of whether or not to translate the information.
That is, even if someone’s foreign language level is not so competitive, we can still make sense of information from other cues (tables, graphs, etc.) and take the right decision.
The 5th AOS Workshop, Beijing, April 27-29, 2004
10
Cross-Language Information Retrieval (2)
Multilingual dictionaries or simplistic MT models are typically used Although noisy to some extent, language pair, such as Chinese and
Japanese can take advantage of Hanzi- (Kanji-) semantics also applicable for alphabetic languages if we map words with their root
forms Further enhancement, for example, Latent Semantic Indexing (or
other conceptual retrieval techniques help in mapping symbolic words to abstract concepts
Statistically built dictionaries (based on statistical correlation) also proved effective in CLIR
CLIR Demo In CLIR, the best effect can be achieved, if a user is guided through
a correlation dictionary (statistically created) and an ontology (manually crafted).
Associative relationships are better captured by statistical correlation Hierarchical relationships are better captured in ontologies or KBs
The 5th AOS Workshop, Beijing, April 27-29, 2004
11
Searching Idiosyncrasies (pseudo CLIR)
Experiment with Kanji Semantics Searching “ 大学” on Google
大学 site:cn 大学 site:jp
• The word, 大学 has the same meaning in both Japanese and Chinese Experiment with different server
Searching “DNA” on different Google local sites www.google.co.jp www.google.co.th www.google.com
• The retrieved results are quite different When it comes to information, we prefer to harness it in
an integrated fashion. Communication and connectivity are no longer barriers but
languages are!
The 5th AOS Workshop, Beijing, April 27-29, 2004
12
Dilemma in Named-Entity Extraction
Named Entities play an important role in harnessing information
Significant research efforts have been channeled to automatic Named Entity Extraction - using simple heuristics as well as sophisticated machine learning algorithms.
For some reasons, the task remained restricted Organization, Person, Location, Date, Time, Money, Percent
In specific domains such as Bio- or Agro- informatics, the notion of named-entities is broader (and different from the above, of course) Domain specific entities are important. With carefully designed tools
(using NLP techniques), it is possible to identify domain-specific entities Event extraction is more difficult but crucial in harnessing information
The 5th AOS Workshop, Beijing, April 27-29, 2004
13
Integrated Knowledge Management
In an optimal scenario, we need to elicit knowledge from 3 different sources and manage it in an integrated fashion Knowledge extracted from symbolic systems (written text, utterance,
etc.) – relatively explicit but not so precise! More precise knowledge encoded in ontologies and KBs (semi-
automatic) – converted from implicit towards explicit forms! Expert’s tacit knowledge – possible to capture in a system if the experts
cooperate. Ontology-based knowledge representation is the most appropriate
representation so far – because it is understood by both human and machine equally
Ontologies, if not maintained regularly can be outdated soon. There are certain other pitfalls which can be circumvented
through sophisticated NLP techniques, bootstrapping and indexing scheme. see examples in the following slides
The 5th AOS Workshop, Beijing, April 27-29, 2004
14
An Integrated KM Scenario
An “academic ontology” about people, project, organisations, project-reports, etc. within an organization (precise knowledge: ontologies are populated semi-automatically, sometimes from databases)
A set of sophisticated “NLP Tools” for Tokenizing, Parsing, Text Classifications, etc. (non-precise knowledge: Extracted from text automatically)
A group of users/experts who are inspired to make things better (Tacit Knowledge) by giving feedback.
A Spreading Activation based indexing scheme is used to capture and propagate changes in a bootstrapped fashion
c.f., Hasan, M.M. (2004). Spreading Activation Framework for Ontology-enhanced Effective Information Access within Organisations, In van Elst, L. et al. eds.: "Agent-Mediated Knowledge Management". Springer’s Lecture Notes in Computer Science, Vol. 2926. pp. 288-296. Also published in the proceedings of AAAI Spring Symposium, AMKM-2003, USA.
The 5th AOS Workshop, Beijing, April 27-29, 2004
16
But, Integrated Manipulation
Underneath, there is a spreading activation based indexing structure which changes over time
Expert’s feedback is also captured and propagated into the network
Commercial systems are developed using similar technique (e.g., TeSSI ® from L&C Global in pharmaceutical domain using a multilingual pharmaceutical ontology (developed under EU initiative)
The 5th AOS Workshop, Beijing, April 27-29, 2004
17
Lexical and Ontological Resources
China: HowNet (similar to WordNet with broader conceptual coverage
Japan: EDR Dictionary - A set of dictionaries including bilingual E-J dictionaries, Dictionary of Technical Terms and concept; NTT Goi Taikei, etc
Korea: KORTERM initiativeThai: TCL’s Computational Lexicon
The 5th AOS Workshop, Beijing, April 27-29, 2004
18
Lexical and Ontological Resources (2)
GENIA Annotated Corpus and GENIA Ontology from University of Tokyo for Bioinformatics research based on Medline Abstract Multilingual specialized ontologies are still rare but
likely to appear Similar resources in Agricultural domain
including AGROVOC thesaurus, and related ontologies and resources (corpora) FAO’s Bio-Safety Ontology:
Frequent verbs (Free Text Corpus) Arguments (NPs) KAON concepts Domain Experts
a bootstrapping approach of creating ontology
The 5th AOS Workshop, Beijing, April 27-29, 2004
19
NiCT Language Resources
EDR LexiconsNiCT acquired all copyright of the EDR electronic dictionary in 2002 and
able to distribute them for a nominal handling fee. Word Dictionaries
Japanese Word Dictionary (260,000) English Word Dictionary (190,000)
Bilingual Dictionaries Jpn.-Eng. Bilingual Dictionary (240,000) Eng.-Jpn. Bilingual Dictionary (160,000)
Concept Dictionary (410,000) Co-occurrence Dictionary
Japanese Co-occurrence Dictionary (930,000)• 20,000 Japanese example sentences
English Co-occurrence Dictionary (460,000)• 12,000 English example sentences
Technical Terminology Dictionary (110,000 Japanese & 70,000 English entries)
The 5th AOS Workshop, Beijing, April 27-29, 2004
20
NiCT Language Resources (2)
Multilingual Annotated Corpus 40,000 Japanese sentences from Mainichi Newspaper (i.e.,
Kyoto University Corpus) Morphologically and syntactically annotated English translation (manually translated); Phrase alignment
done Syntactic annotation (based on Penn Treebank) 10,000 sentences will be translated and aligned in the phrase level
in April 2004 (tentative) Chinese translation (manually translated)
10,000 sentences are already translated And, many other tools and linguistic resources
Project Gutenberg Corpus (English-Japanese Bilingual Sentence Aligned corpus)
SST Learner Corpora (with error annotation)
The 5th AOS Workshop, Beijing, April 27-29, 2004
21
Conclusions
In this new era of ubiquitous connectivity, Integrated processing of information is a necessity .
Language (not physical communication/bandwidth) remains to be the strongest barrier.
Multilingual resources (dictionaries, thesauri, corpora) are either rare or incomplete
AGROVOC still doesn’t cover many languages (including Japanese) Effective processing of multilingual information needs concerted effort in
resource building and standardization Specially in terminology and interoperable ontology standards
Multilingual resources along with effective bootstrapping strategy will help us overcoming the difficulties in NLP and multilingual information processing
With the resources and technologies we have at NiCT, it could be worthy to try extending AGROVOC and related ontology to cover Japanese
AGROTERM from AFFRC-Japan contains 57,000 agricultural terms extracted from a corpus using NLP tools.
Aligning AGROTERM or other similar resources with AGROVOC semi-automatically is a useful challenge.
top related