cni, 3rd april 2006 slide 1 uk national centre for text mining: activities and plans dr. robert...

25
CNI, 3rd April 2006 Slide 1 UK National Centre for Text UK National Centre for Text Mining: Mining: Activities and Plans Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool [email protected] http://www.nactem.ac.uk

Upload: jasmin-harrison

Post on 13-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool

CNI, 3rd April 2006 Slide 1

UK National Centre for Text UK National Centre for Text Mining:Mining:

Activities and PlansActivities and Plans

Dr. Robert SandersonDept. of Computer ScienceUniversity of Liverpool

[email protected]

http://www.nactem.ac.uk

Page 2: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool

CNI, 3rd April 2006 Slide 2

OverviewOverview

Text Mining?

NaCTeM

Consortium Components

Service Infrastructure

Future Work

Page 3: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool

CNI, 3rd April 2006 Slide 3

Centre for ...Centre for ...

National Centre for ... what was that?

TicksMining!TEXT

Page 4: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool

CNI, 3rd April 2006 Slide 4

... Text Mining?... Text Mining?

Text Mining: No canonical definition

Commonly used definition based on Data Mining:

“The non-trivial extraction of implicit, previously unknown, and potentially useful information from data.”

“The non-trivial extraction of previously unknown, interesting facts from an invariably large collection of texts.”

Page 5: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool

CNI, 3rd April 2006 Slide 5

... Text Mining?... Text Mining?

Typical Data Mining Functions:

Classification

Association Rule Mining

Clustering

Useful when applied to texts, but doesn't fulfill the

definition as they don't discover “facts”.

Information Retrieval also doesn't discover facts.

Page 6: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool

CNI, 3rd April 2006 Slide 6

... Text Mining?... Text Mining?

Need to understand the meaning of the text:

Part of Speech tagging

Clauses

Named Entity Recognition

Find correlations of entities

Infer information from logical chains

Result: New Knowledge

Page 7: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool

CNI, 3rd April 2006 Slide 7

Other BenefitsOther Benefits

Plus a lot more:

Improved document classification

Automatic semantic annotation of documents

Improved access -- search by semantics and concepts

Improved clustering of documents by concept

Summarization

Visualization techniques

Page 8: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool

CNI, 3rd April 2006 Slide 8

Event ExtractionEvent Extraction

Extract events from the text along with information

about the

participants

Can be modeled as relationships between named

entities

Extracting events allows discovery of hidden temporal

correlations

eg: Google refuses to announce plans. Google's

stock falls.

Improves understanding of the semantics, improving

the

functions based around those semantics

Page 9: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool

CNI, 3rd April 2006 Slide 9

NaCTeMNaCTeM

Hosted at University of Manchester

Participants: Universities of Manchester, Liverpool,

Salford

Plus: San Diego Supercomputer Centre, University of

Tokyo,

University of Geneva, University of California

Berkeley

Six full time posts for 3 years (2005-2007)

Plus active board of directors and experts

Current Director: Professor Jun'ichi Tsujii from

U.Tokyo

Funding: JISC, BBSRC, EPSRC

Page 10: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool

CNI, 3rd April 2006 Slide 10

NaCTeM AimsNaCTeM Aims

Provide text mining oriented services

Facilitate access to text mining resources

User support, advice, training and consultancy

Participate in international research

Formulate best practice guidelines

Increase awareness of text mining in all domains

Develop links with industrial partners involved in text

mining

Page 11: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool

CNI, 3rd April 2006 Slide 11

ComponentsComponents

Liverpool: Cheshire3 (Information framework)

Manchester: CAFETIERE (Entity recognition, event

extraction)

Salford: TerMine (Automatic term recognition)

SDSC: Storage Resource Broker (Data grid)

UC Berkeley: Cheshire, TM/IR expertise

U.Tokyo: GENIA, ENJU (Text analysis tools)

U.Geneva: User studies and evaluation

Page 12: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool

CNI, 3rd April 2006 Slide 12

Cheshire3Cheshire3

Information Processing Framework

Liverpool and UC Berkeley

Standards based: XML, SRU, Unicode, etc.

Scalable: Single machine to Grid (PVM, MPI, SRB)

Extensible: Python + C, Object Oriented with stable

API

Work ongoing to integrate Data Mining tools and other

information processing applications

Page 13: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool

CNI, 3rd April 2006 Slide 13

Cheshire3 ExamplesCheshire3 Examples

Integrated tools from other participants in preparation

for

NaCTeM service infrastructure.

Medline: 4350 records/second using 60 concurrent

processes

on SDSC's Teragrid cluster

440 seconds to index 1 field from 16 million MARC

records

Distributed network of Archival Descriptions in the UK

NARA ERA prototype system with SDSC

Page 14: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool

CNI, 3rd April 2006 Slide 14

CAFETIERECAFETIERE

Entity Recognition and Annotation

University of Manchester

Discovers named entities in part of speech tagged text

Discovers temporal events referring to those entities

Integration of ontologies and term processing

Rules based

Page 15: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool

CNI, 3rd April 2006 Slide 15

CAFETIERE ExampleCAFETIERE Example

Page 16: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool

CNI, 3rd April 2006 Slide 16

TerMineTerMine

Automatic Term Recognition

University of Salford/Manchester

Discovers important terms

Assigns 'C-value' score to rank terms

Interaction with terminology databases for term

management

Page 17: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool

CNI, 3rd April 2006 Slide 17

TerMine ExampleTerMine Example

Page 18: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool

CNI, 3rd April 2006 Slide 18

U. Tokyo ToolsU. Tokyo Tools

Natural Language Parsing

University of Tokyo

Tagger, Chunker, ENJU, GENIA

Necessary for any text mining application

Fast and accurate

http://www-tsujii.is.s.u-tokyo.ac.jp/hiiragi/

http://www-tsujii.is.s.u-tokyo.ac.jp/CytoSailing/

Page 19: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool

CNI, 3rd April 2006 Slide 19

Tokyo Tools ExampleTokyo Tools Example

Page 20: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool

CNI, 3rd April 2006 Slide 20

Tokyo Tools Example2Tokyo Tools Example2

Page 21: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool

CNI, 3rd April 2006 Slide 21

Service InfrastructureService Infrastructure

NaCTeM will allow UK researchers to perform text

mining on

their own data in combination with other accessible

resources (eg other data sets, ontologies etc)

Requirements:

Lots of processing power

Lots of storage capacity

Easily extensible/configurable service framework

Access to cutting edge TM, DM and IR tools

Page 22: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool

CNI, 3rd April 2006 Slide 22

Service InfrastructureService Infrastructure

Processing provided by UK National Grid Service

Data Storage via SDSC's Storage Resource Broker

Important to store multiple versions of each

document

Cheshire3 provides the Grid enabled information

infrastructure

Plus information retrieval and data mining tools

Manchester and Tokyo provide the text mining tools

Stable tools integrated into Cheshire3 already

Page 23: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool

CNI, 3rd April 2006 Slide 23

Service InfrastructureService Infrastructure

Initial NaCTeM services will be focused on the bio

domain:

Bio-informatics is a growing field

Interest from both academic and corporate sectors

Large datasets/services available (MeSH,

Medline, ...)

Web portal interaction

Then expand into other areas, such as Social Sciences

and

Historical text analysis.

Page 24: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool

CNI, 3rd April 2006 Slide 24

Future WorkFuture Work

Services for other domains

GUI Workflow configuration

Integration of user developed services and

applications

Maximizing workflow potential with 'smart'

components

Standardizing annotation schemas

Conference/Workshop

Other?

Page 25: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool

CNI, 3rd April 2006 Slide 25

Thank You Thank You

Questions?

...

Reception!