ontology-aware information extraction hamish cunningham, kalina bontcheva department of computer...

12
Ontology-Aware Information Extraction http://gate.ac.uk/ Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield OntoWeb 4, SIG 5, 2002

Upload: lee-lang

Post on 24-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ontology-Aware Information Extraction  Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield OntoWeb

                                                                                                                           

Ontology-Aware Information Extraction

http://gate.ac.uk/

Hamish Cunningham, Kalina Bontcheva

Department of Computer Science, University of Sheffield

OntoWeb 4, SIG 5, 2002

Page 2: Ontology-Aware Information Extraction  Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield OntoWeb

2(12)

                                                                                                                           

GATE, a General Architecture for Text Engineering

GATE is….• An architectureA macro-level organisational picture for LE software systems. • A frameworkFor programmers, GATE is an object-oriented class library that implements the architecture. • A development environmentFor language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction. • Free software (LGPL). Mature robust software (in development since 1995). Download at http://gate.ac.uk/download

Comes with…• Some free components... ...and wrappers for other people's components • Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc.

Page 3: Ontology-Aware Information Extraction  Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield OntoWeb

3(12)

                            Applications; languagesGATE has been used for a variety of applications, including:

• MUMIS: automatic creation of semantic indexes for multimedia programme material

• MUSE: a multi-genre IE system

• EMILLE: a 70 million word corpus of Indic languages

• Metadata for Medline (at Merck)

• Creation of metadata for Semantic Web Services; documentation using NLG

• HSE: summarisation of health and safety information from company reports

• OldBaileyIE: NE recognition on 17th century Old Bailey Court reports.

• AKT: language technology in knowledge management

• AMITIES: call centre automation

• Digital libraries / e-philology for ancient languages researchers

• Various Medical Informatics and database technology projects

• IE in Romanian, Bulgarian, Greek, Bengali, Spanish, Swedish, German, Italian, and

French (Arabic, Chinese and Russian next year)

Page 4: Ontology-Aware Information Extraction  Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield OntoWeb

4(12)

Some users…At time of writing a representative fraction of GATE users includes: • Longman Pearson publishing, UK; • BT Exact Technologies, UK;• Merck KgAa, Germany; • Canon Europe, UK; • Knight Ridder (the second biggest US news publisher); • BBN Technologies, US;• Sirma AI Ltd., Bulgaria; • Resco AB, Sweden/Finland/Germany;• Glaxo Smith Kline Plc: drug-based navigation of Medline abstracts• Master Foods NV: extraction of commodities events from news• the American National Corpus project, US; • Imperial College, London, the University of Manchester, Queen Mary

College, UMIST, the University of Karlsruhe, Vassar College, ISI / the University of Southern California and a large number of other UK, US and EU Universities;

• the Perseus Digital Library project, Tufts University, US.

Page 5: Ontology-Aware Information Extraction  Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield OntoWeb

5(12)

Scientific method and HLT

• How do we really know that this stuff works?!• Open source systems make experimental

repeatability easier and therefore cut down on site-specific skew effects.

• GATE's IE tools have competed in MUC, TREC (QA), ACE, and DUC. TIDES Surprise Language exercise next year.

• GATE includes markup and automated evaluation tools: easier quantitative evaluation.

Page 6: Ontology-Aware Information Extraction  Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield OntoWeb

6(12)

Collaboration opportunities

• Interoperation, integration, not re-invention: collaboration not competition

• Take the code, do what you like with it, perhaps contribute something back

• Involve us in your 6th Framework projects

• Join KITShare: a network of excellence in Knowledge and Interface Tool Sharing.

Page 7: Ontology-Aware Information Extraction  Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield OntoWeb

7(12)

                                                                                                                           

The Holy Grail

• Problem: gap between many current IE tools and SemWeb needs

Page 8: Ontology-Aware Information Extraction  Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield OntoWeb

8(12)

What is needed?

• Content, not Information Extraction– Identify the ontological reference, not just the

class – Maintain referential integrity (coreference)

• Ontology-aware IE tools– Use instances already in the ontology– React to changes in the ontology

• Support experienced users to change the IE tools

Page 9: Ontology-Aware Information Extraction  Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield OntoWeb

9(12)

GATE and Content Extraction

ANNIE - Open-source IE system in GATE, providing modules needed for content extraction

• Pre-processing• Named entity recognition• Coreference resolution

– ANNIE handles proper names, pronouns, and nominals

• Easy-to-use pattern-action rule language to enable customisation and postprocessing of the IE results

Page 10: Ontology-Aware Information Extraction  Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield OntoWeb

10(12)

Populating Ontologies with ANNIE

Page 11: Ontology-Aware Information Extraction  Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield OntoWeb

11(12)

Ontologies as explicit IE resources

• Reuse, not reinvention: – Protégé for ontology maintenance– Sesame/KAON for storage and reasoning

• Ontology-aware gazetteers– Provide the ontological class of each entry– Use instances from the ontology for IE

Page 12: Ontology-Aware Information Extraction  Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield OntoWeb

12(12)

Ontology-aware IE

• The IE tools can use available formal knowledge and reasoning

• Ontology-based anaphora resolution– G. Bush, G. Brown, the president

• The correct ontological classes are assigned to the recognised entities

• Changes in the ontology available to the IE tools