an introduction to gate presented by lin. what is gate? stands for general architecture for text...

34
An Introduction to An Introduction to GATE GATE Presented by Lin Lin

Upload: frances-dash

Post on 31-Mar-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

An Introduction to GATEAn Introduction to GATE

Presented

by

Lin Lin

Page 2: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

What is GATE?What is GATE?

Stands for General Architecture for Text Engineering.

The theory behind GATE is SALE (Software Architecture for Language Engineering):– computer processing of human language– computer infrastructure for software

development

Page 3: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

Who Use GATE?Who Use GATE?

Scientists performing experiments that involve processing human language

Developers developing applications with language processing components

Teachers and students of courses about language and language computation

Page 4: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

How GATE can Help?How GATE can Help?Specify an architecture, or organizational

structure, for language processing softwareProvide a framework, or class library, that

implements the architecture and can be used to embed language processing capabilities in diverse applications

Provide a development environment built on top of the framework made up of convenient graphical tools for developing components

Page 5: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

What are GATE Components?What are GATE Components?

Reusable software chunks with well defined interfaces

Used in Java beans and Microsoft’s .Net

Page 6: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

GATE as an architectureGATE as an architecture

Breaks down to three types of components:– LanguageResources (LRs)

represent entities such as lexicons, corpora, or ontologies;

– ProcessingResources (PRs) represent entities that are primarily algorithmic, such as

parsers, generators or ngram modelers;

– VisualResources (VRs) represent visualization and editing components that

participate in GUIs.

Page 7: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

LRs: Corpora, Documents, LRs: Corpora, Documents, and Annotationsand AnnotationsA Corpus in Gate is a Java Set whose

members are Documents.Documents are modeled as content plus

annotations plus features.Annotations are organized in graphs, which

are modeled as Java sets of Annotation.

Page 8: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

Documents Processing in GATEDocuments Processing in GATE

Document:– Formats including XML, RTF, email, HTML,

SGML, and plain text.– Identified and converted into GATE annotation

format.– Processed by PRs.– Results stored in a serial data store (based on

Java serialization) or as XML.

Page 9: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

Built-in GATE ComponentsBuilt-in GATE Components

Resources for common LE data structures and algorithms, including documents, corpora and various annotation types

A set of language analysis components for Information Extraction (e.g. ANNIE)

A range of data visualization and editing components

Page 10: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

Develop Language Develop Language Processing Functionality using Processing Functionality using GATEGATEProgramming, or the development of

Language Resources such as grammars that are used by existing Processing Resources, or a mixture of both.

The development environment is used for:– visualization of the data structures produced

and consumed during processing– debugging– performance measurement

Page 11: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

CREOLECREOLE

A Collection of REusable Objects for Language Engineering

The set of resources integrated with GATEAll the resources are packaged as Java

Archive (or ‘JAR’) files, plus some XML configuration data.

Page 12: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

PRs: ANNIEPRs: ANNIE

A family of Processing Resources for language analysis included with GATE

Stands for A Nearly-New Information Extraction system.

Using finite state techniques to implement various tasks: tokenization, semantic tagging, verb phrase chunking, and so on.

Page 13: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

ANNIE IE ModulesANNIE IE Modules

Page 14: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

ANNIE ComponentsANNIE ComponentsTokenizerGazetteerSentence SplitterPart of Speech Tagger

– produces a part-of-speech tag as an annotation on each word or symbol.

Semantic TaggerOrthoMatcher Coreference Module

Page 15: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

ANNIE Component: TokenizerANNIE Component: Tokenizer

Token Types– word, number, symbol, punctuation, and

spaceToken.

A tokenizer rule has a left hand side and a right hand side.

Page 16: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

Tokenizer RuleTokenizer RuleOperations used on the LHS:

– | (or) –  * (0 or more occurrences)  – ? (0 or 1 occurrences)  – + (1 or more occurrences)

The RHS uses ’;’ as a separator, and has the following format: {LHS} > {Annotation type};{attribute1}={value1};...;{attribute  n}={value n}

Page 17: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

Example Tokenizer RuleExample Tokenizer Rule"UPPERCASE_LETTER" "LOWERCASE_LETT

ER"* >  Token;orth=upperInitial;kind=word;

– The sequence must begin with an uppercase letter, followed by zero or more lowercase letters. This sequence will then be annotated as type “Token”. The attribute “orth” (orthography) has the value “upperInitial”; the attribute “kind” has the value “word”.

Page 18: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

ANNIE Component: GazetteerANNIE Component: Gazetteer

The gazetteer lists used are plain text files, with one entry per line.

Each list represents a set of names, such as names of cities, organizations, days of the week, etc.

Page 19: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

Example Gazetteer ListExample Gazetteer List A small section of the list for units of currency: …… Ecu  

European Currency Units  FFr  Fr  German mark  German marks  New Taiwan dollar  New Taiwan dollars  NT dollar  NT dollars

……

Page 20: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

ANNIE Component: ANNIE Component: Semantic Tagger Semantic TaggerBased on JAPE language, which contains

rules that act on annotations assigned in earlier phases.

Produce outputs of annotated entities.

Page 21: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

ANNIE Component: Sentence ANNIE Component: Sentence SplitterSplitterSegments the text into sentences. This module is required for the tagger. The splitter uses a gazetteer list of

abbreviations to help distinguish sentence-marking full stops from other kinds.

Page 22: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

ANNIE Component: OrthoMatcherANNIE Component: OrthoMatcher

Adds identity relations between named entities found by the semantic tagger, in order to perform coreference.

Does not find new named entities, but it may assign a type to an unclassified proper name.

Page 23: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

Create a New ResourceCreate a New Resource

Write a Java class that implements GATE’s beans model.

Compile the class, and any others that it uses, into a Java Archive (JAR) file.

Write some XML configuration data for the new resource.

Tell GATE the URL of the new JAR and XML files.

Page 24: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

Example: Create a New Example: Create a New Component Called GoldFishComponent Called GoldFishGoldFish:

– Is a processing resource– Look for all instances of the word “fish” in the

document– Add an annotation of type “GoldFish”

Page 25: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

Example: Create GoldFish Example: Create GoldFish Using BootStrap WizardUsing BootStrap Wizard

Page 26: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

GoldFish: default files createdGoldFish: default files created

The default Java code created for the GoldFish resource looks like:– GoldFish.java

The default XML configuration for GoldFish looks like:– resource.xml

Page 27: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

Create an Application with PRsCreate an Application with PRs

Applications model a control strategy for the execution of PRs.

Currently only pipeline execution is supported.– Simple pipelines: group a set of PRs together in

order and execute them in turn.– Corpus pipelines: open each document in the corpus

in turn, set that document as a runtime parameter on each PR, run all the PRs on the corpus, then close the document

Page 28: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

Additional FacilitiesAdditional FacilitiesJAPE

– a Java Annotation Patterns Engine, provides regular-expression based pattern/action rules over annotations.

– The file “Main.jape” contains a list of the grammars to be used for for Named Entity Recognition, in the correct processing order.

– Used in ANNIE.

Page 29: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

Additional FacilitiesAdditional Facilities

The ‘annotation diff’ tool in the development environment – implements performance metrics such as

precision and recall for comparing annotations.GUK (the GATE Unicode Kit)

– fills in some of the gaps in the JDK’s support for Unicode.

Page 30: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

Embedding ANNIEEmbedding ANNIE

Create a stand alone ANNIE extraction system.

Example code that will embed ANNIE in an application that takes URLs as inputs and produces named entities as outputs.

Page 31: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

Additional FeaturesAdditional Features

Add support for a new document formatCreate a new annotation schemaWrite your own algorithm to dump results

to fileWork with UnicodeWork with Oracle and PostgreSQL

Page 32: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

Other VR can be Used in GATEOther VR can be Used in GATE

Ontogazetteer– makes ontologies “visible” in GATE.

Protégé– makes use of developed Protégé ontologies in

GATE, and also take advantage of being able to read different format ontology files in Protégé.

Page 33: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

Link to GATE web pageLink to GATE web page

http://gate.ac.ukDocumentation and download

Page 34: An Introduction to GATE Presented by Lin. What is GATE? Stands for General Architecture for Text Engineering. The theory behind GATE is SALE (Software

GATE DemoGATE Demo

GATE graphical development environmentDo information extraction with ANNIECreate and run an application.....