text mining and the semantic web

46
Text mining and the Semantic Web Dr Diana Maynard NLP Group Department of Computer Science University of Sheffield

Upload: zuwena

Post on 19-Mar-2016

87 views

Category:

Documents


0 download

DESCRIPTION

Text mining and the Semantic Web. Dr Diana Maynard NLP Group Department of Computer Science University of Sheffield. Structure of this lecture. Text Mining and the Semantic Web Text Mining Components / Methods Information Extraction Evaluation Visualisation Summary. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Text mining and the Semantic Web

Text mining and the Semantic Web

Dr Diana MaynardNLP Group

Department of Computer ScienceUniversity of Sheffield

Page 2: Text mining and the Semantic Web

University of Manchester – 15 March 2005 2

http://nlp.shef.ac.uk

Structure of this lecture

• Text Mining and the Semantic Web• Text Mining Components / Methods• Information Extraction• Evaluation• Visualisation• Summary

Page 3: Text mining and the Semantic Web

Introduction to Text Mining and the Semantic Web

Page 4: Text mining and the Semantic Web

University of Manchester – 15 March 2005 4

http://nlp.shef.ac.uk

What is Text Mining?

• Text mining is about knowledge discovery from large collections of unstructured text.

• It’s not the same as data mining, which is more about discovering patterns in structured data stored in databases.

• Similar techniques are sometimes used, however text mining has many additional constraints caused by the unstructured nature of the text and the use of natural language.

• Information extraction (IE) is a major component of text mining.

• IE is about extracting facts and structured information from unstructured text.

Page 5: Text mining and the Semantic Web

University of Manchester – 15 March 2005 5

http://nlp.shef.ac.uk

Challenge of the Semantic Web

• The Semantic Web requires machine processable, repurposable data to complement hypertext

• Such metadata can be divided into two types of information: explicit and implicit. IE is mainly concerned with implicit (semantic) metadata.

• More on this later…

Page 6: Text mining and the Semantic Web

Text mining components and methods

Page 7: Text mining and the Semantic Web

University of Manchester – 15 March 2005 7

http://nlp.shef.ac.uk

Text mining stages

• Document selection and filtering (IR techniques)

• Document pre-processing (NLP techniques)

• Document processing (NLP / ML / statistical techniques)

Page 8: Text mining and the Semantic Web

University of Manchester – 15 March 2005 8

http://nlp.shef.ac.uk

Stages of document processing• Document selection involves identification and retrieval of

potentially relevant documents from a large set (e.g. the web) in order to reduce the search space. Standard or semantically-enhanced IR techniques can be used for this.

• Document pre-processing involves cleaning and preparing the documents, e.g. removal of extraneous information, error correction, spelling normalisation, tokenisation, POS tagging, etc.

• Document processing consists mainly of information extraction

• For the Semantic Web, this is realised in terms of metadata extraction

Page 9: Text mining and the Semantic Web

University of Manchester – 15 March 2005 9

http://nlp.shef.ac.uk

Metadata extraction• Metadata extraction consists of two types: • Explicit metadata extraction involves information

describing the document, such as that contained in the header information of HTML documents (titles, abstracts, authors, creation date, etc.)

• Implicit metadata extraction involves semantic information deduced from the material itself, i.e. endogenous information such as names of entities and relations contained in the text. This essentially involves Information Extraction techniques, often with the help of an ontology.

Page 10: Text mining and the Semantic Web

Information Extraction (IE)

Page 11: Text mining and the Semantic Web

University of Manchester – 15 March 2005 11

http://nlp.shef.ac.uk

IE is not IR

IE pulls facts and structured information from the content of large text collections. You analyse the facts.

IR pulls documents from large text collections (usually the Web) in response to specific keywords or queries. You analyse the documents.

Page 12: Text mining and the Semantic Web

University of Manchester – 15 March 2005 12

http://nlp.shef.ac.uk

IE for Document Access• With traditional query engines, getting the facts can

be hard and slow• Where has the Queen visited in the last year?• Which places on the East Coast of the US

have had cases of West Nile Virus? • Which search terms would you use to get this kind

of information?• How can you specify you want someone’s home

page?• IE returns information in a structured way• IR returns documents containing the relevant

information somewhere (if you’re lucky)

Page 13: Text mining and the Semantic Web

University of Manchester – 15 March 2005 13

http://nlp.shef.ac.uk

IE as an alternative to IR

• IE returns knowledge at a much deeper level than traditional IR

• Constructing a database through IE and linking it back to the documents can provide a valuable alternative search tool.

• Even if results are not always accurate, they can be valuable if linked back to the original text

Page 14: Text mining and the Semantic Web

University of Manchester – 15 March 2005 14

http://nlp.shef.ac.uk

Some example applications

• HaSIE• KIM• Threat Trackers

Page 15: Text mining and the Semantic Web

University of Manchester – 15 March 2005 15

http://nlp.shef.ac.uk

HaSIE

• Application developed by University of Sheffield, which aims to find out how companies report about health and safety information

• Answers questions such as:“How many members of staff died or had accidents

in the last year?”“Is there anyone responsible for health and safety?”“What measures have been put in place to improve

health and safety in the workplace?”

Page 16: Text mining and the Semantic Web

University of Manchester – 15 March 2005 16

http://nlp.shef.ac.uk

HASIE

• Identification of such information is too time-consuming and arduous to be done manually

• IR systems can’t cope with this because they return whole documents, which could be hundreds of pages

• System identifies relevant sections of each document, pulls out sentences about health and safety issues, and populates a database with relevant information

Page 17: Text mining and the Semantic Web

University of Manchester – 15 March 2005 17

http://nlp.shef.ac.uk

HASIE

Page 18: Text mining and the Semantic Web

University of Manchester – 15 March 2005 18

http://nlp.shef.ac.uk

KIM

• KIM is a software platform developed by Ontotext for semantic annotation of text.

• KIM performs automatic ontology population and semantic annotation for Semantic Web and KM applications

• Indexing and retrieval (an IE-enhanced search technology)

• Query and exploration of formal knowledge

Page 19: Text mining and the Semantic Web

University of Manchester – 15 March 2005 19

http://nlp.shef.ac.uk

KIMOntotext’s KIM query and results

Page 20: Text mining and the Semantic Web

University of Manchester – 15 March 2005 20

http://nlp.shef.ac.uk

Threat tracker

• Application developed by Alias-I which finds and relates information in documents

• Intended for use by Information Analysts who use unstructured news feeds and standing collections as sources

• Used by DARPA for tracking possible information about terrorists etc.

• Identification of entities, aliases, relations etc. enables you to build up chains of related people and things

Page 21: Text mining and the Semantic Web

University of Manchester – 15 March 2005 21

http://nlp.shef.ac.uk

Threat tracker

Page 22: Text mining and the Semantic Web

University of Manchester – 15 March 2005 22

http://nlp.shef.ac.ukWhat is Named Entity Recognition?

• Identification of proper names in texts, and their classification into a set of predefined categories of interest

• Persons• Organisations (companies,

government organisations, committees, etc)

• Locations (cities, countries, rivers, etc)• Date and time expressions• Various other types as appropriate

Page 23: Text mining and the Semantic Web

University of Manchester – 15 March 2005 23

http://nlp.shef.ac.uk

Why is NE important?

• NE provides a foundation from which to build more complex IE systems

• Relations between NEs can provide tracking, ontological information and scenario building

• Tracking (co-reference) “Dr Head, John, he”• Ontologies “Manchester, CT”• Scenario “Dr Head became the new director

of Shiny Rockets Corp”

Page 24: Text mining and the Semantic Web

University of Manchester – 15 March 2005 24

http://nlp.shef.ac.uk

Two kinds of approachesKnowledge Engineering

• rule based • developed by

experienced language engineers

• make use of human intuition

• require only small amount of training data

• development can be very time consuming

• some changes may be hard to accommodate

Learning Systems

• use statistics or other machine learning

• developers do not need LE expertise

• require large amounts of annotated training data

• some changes may require re-annotation of the entire training corpus

Page 25: Text mining and the Semantic Web

University of Manchester – 15 March 2005 25

http://nlp.shef.ac.uk

Typical NE pipeline

• Pre-processing (tokenisation, sentence splitting, morphological analysis, POS tagging)

• Entity finding (gazeteer lookup, NE grammars)

• Coreference (alias finding, orthographic coreference etc.)

• Export to database / XML

Page 26: Text mining and the Semantic Web

University of Manchester – 15 March 2005 26

http://nlp.shef.ac.uk

GATE and ANNIE• GATE (Generalised Architecture for Text

Engineering) is a framework for language processing• ANNIE (A Nearly New Information Extraction system)

is a suite of language processing tools, which provides NE recognition

GATE also includes:• plugins for language processing, e.g. parsers,

machine learning tools, stemmers, IR tools, IE components for various languages etc.

• tools for visualising and manipulating ontologies• ontology-based information extraction tools• evaluation and benchmarking tools

Page 27: Text mining and the Semantic Web

University of Manchester – 15 March 2005 27

http://nlp.shef.ac.uk

GATE

Page 28: Text mining and the Semantic Web

University of Manchester – 15 March 2005 28

http://nlp.shef.ac.uk

Information Extraction for the Semantic Web

• Traditional IE is based on a flat structure, e.g. recognising Person, Location, Organisation, Date, Time etc.

• For the Semantic Web, we need information in a hierarchical structure

• Idea is that we attach semantic metadata to the documents, pointing to concepts in an ontology

• Information can be exported as an ontology annotated with instances, or as text annotated with links to the ontology

Page 29: Text mining and the Semantic Web

University of Manchester – 15 March 2005 29

http://nlp.shef.ac.uk

Richer NE Tagging

• Attachment of instances in the text to concepts in the domain ontology

• Disambiguation of instances, e.g. Cambridge, MA vs Cambridge, UK

Page 30: Text mining and the Semantic Web

University of Manchester – 15 March 2005 30

http://nlp.shef.ac.uk

Magpie

• Developed by the Open University• Plugin for standard web browser• Automatically associates an ontology-based

semantic layer to web resources, allowing relevant services to be linked

• Provides means for a structured and informed exploration of the web resources

• e.g. looking at a list of publications, we can find information about an author such as projects they work on, other people they work with, etc.

Page 31: Text mining and the Semantic Web

University of Manchester – 15 March 2005 31

http://nlp.shef.ac.uk

MAGPIE in action

Page 32: Text mining and the Semantic Web

University of Manchester – 15 March 2005 32

http://nlp.shef.ac.uk

MAGPIE in action

Page 33: Text mining and the Semantic Web

Evaluation

Page 34: Text mining and the Semantic Web

University of Manchester – 15 March 2005 34

http://nlp.shef.ac.uk

Evaluation metrics and tools

• Evaluation metrics mathematically define how to measure the system’s performance against human-annotated gold standard

• Scoring program implements the metric and provides performance measures – for each document and over the entire corpus– for each type of NE– may also evaluate changes over time

• A gold standard reference set also needs to be provided – this may be time-consuming to produce

• Visualisation tools show the results graphically and enable easy comparison

Page 35: Text mining and the Semantic Web

University of Manchester – 15 March 2005 35

http://nlp.shef.ac.uk

Methods of evaluation

• Traditional IE is evaluated in terms of Precision and Recall

• Precision - how accurate were the answers the system produced?

correct answers/answers produced• Recall - how good was the system at finding

everything it should have found? correct answers/total possible correct answers • There is usually a tradeoff between precision

and recall, so a weighted average of the two (F-measure) is generally also used.

Page 36: Text mining and the Semantic Web

University of Manchester – 15 March 2005 36

http://nlp.shef.ac.uk

GATE AnnotationDiff Tool

Page 37: Text mining and the Semantic Web

University of Manchester – 15 March 2005 37

http://nlp.shef.ac.uk

Metrics for Richer IE

• Precision and Recall are not sufficient for ontology-based IE, because the distinction between right and wrong is less obvious

• Recognising a Person as a Location is clearly wrong, but recognising a Research Assistant as a Lecturer is not so wrong

• Similarity metrics need to be integrated additionally, such that items closer together in the hierarchy are given a higher score, if wrong

• Also possible is a cost-based approach, where different weights can be given to each concept in the hierarchy, and to different types of error, and combined to form a single score

Page 38: Text mining and the Semantic Web

Visualisation of Results

Page 39: Text mining and the Semantic Web

University of Manchester – 15 March 2005 39

http://nlp.shef.ac.uk

Visualisation of Results

• Cluster Map example• Traditionally used to show documents classified

according to topic• Here shows instances classified according to

concept• Enables analysis, comparison and querying of

results• Examples here created by Marta Sabou (Free

University of Amsterdam) using Aduna software

Page 40: Text mining and the Semantic Web

University of Manchester – 15 March 2005 40

http://nlp.shef.ac.uk

The principle – Venn Diagrams

Documents classified according to topic

Page 41: Text mining and the Semantic Web

University of Manchester – 15 March 2005 41

http://nlp.shef.ac.uk

Jobs by region

Instances classified by concept

Page 42: Text mining and the Semantic Web

University of Manchester – 15 March 2005 42

http://nlp.shef.ac.uk

Concept distribution

Shows the relative importance of different concepts

Page 43: Text mining and the Semantic Web

University of Manchester – 15 March 2005 43

http://nlp.shef.ac.uk

Correct and incorrect instances attached to concepts

Page 44: Text mining and the Semantic Web

University of Manchester – 15 March 2005 44

http://nlp.shef.ac.uk

Summary

• Introduction to text mining and the semantic web

• How traditional information extraction techniques, including visualisation and evaluation, can be extended to deal with complexity of the Semantic Web

• How text mining can help the progression of the Semantic Web

Page 45: Text mining and the Semantic Web

University of Manchester – 15 March 2005 45

http://nlp.shef.ac.uk

Research questions

• Automatic annotation tools are currently mainly domain and ontology-dependent, and work best on a small scale

• Tools designed for large scale applications lose out on accuracy

• Ontology population works best when the ontology already exists, but how do we ensure accurate ontology generation?

• Need large scale evaluation programs

Page 46: Text mining and the Semantic Web

University of Manchester – 15 March 2005 46

http://nlp.shef.ac.uk

Some useful links

• NaCTem (National centre for text mining) http://www.nactem.ac.uk• GATE http://gate.ac.uk• KIMhttp://www.ontotext.com/kim/• h-TechSighthttp://www.h-techsight.org• Magpiehttp://www.kmi.open.ac.uk/projects/magpie