machine learning in gate

Machine Learning in GATE

Valentin Tablan

2

Machine Learning in GATE

• Uses classification.[Attr1, Attr2, Attr3, … Attrn] Class

• Classifies annotations.(Documents can be classified as well using a

simple trick.)

• Annotations of a particular type are selected as instances.

• Attributes refer to instance annotations.• Attributes have a position relative to the

instance annotation they refer to.

3

Attributes

Attributes can be:– Boolean

The [lack of] presence of an annotation of a particular type [partially] overlapping the referred instance annotation.

– NominalThe value of a particular feature of the referred instance

annotation. The complete set of acceptable values must be specified a-priori.

– NumericThe numeric value (converted from String) of a particular

feature of the referred instance annotation.

4

Implementation

Machine Learning PR in GATE.Has two functioning modes:

– training– application

Uses an XML file for configuration:<?xml version="1.0" encoding="windows-1252"?>

<ML-CONFIG><DATASET> … </DATASET><ENGINE>…</ENGINE>

<ML-CONFIG>

5

<DATASET><DATASET><INSTANCE-TYPE>Token</INSTANCE-TYPE> <ATTRIBUTE> <NAME>POS_category(0)</NAME> <TYPE>Token</TYPE> <FEATURE>category</FEATURE> <POSITION>0</POSITION> <VALUES> <VALUE>NN</VALUE> <VALUE>NNP</VALUE> <VALUE>NNPS</VALUE> … </VALUES> [<CLASS/>] </ATTRIBUTE> …</DATASET>

6

<ENGINE>

<ENGINE> <WRAPPER>gate.creole.ml.weka.Wrapper</WRAPPER> <OPTIONS> <CLASSIFIER>weka.classifiers.j48.J48</CLASSIFIER> <CLASSIFIER-OPTIONS>-K 3</CLASSIFIER-OPTIONS> <CONFIDENCE-THRESHOLD>0.85</CONFIDENCE-

THRESHOLD> </OPTIONS> </ENGINE>

7

Attributes Position

Instances type: Token

8

Machine Learning PR

• Can save a learnt model to an external file for later use.Saves the actual model and the collected dataset.

• Can export the collected dataset in .arff format.

9

Standard Use ScenarioTraining• Prepare training data by

enriching the documents with annotation for attributes. (e.g. run Tokeniser, POS tagger, Gazetteer, etc).

• Run the ML PR in training mode.

• Export the dataset as .arff and perform experiments using the WEKA interface in order to find the best attribute set / algorithm / algorithm options.

• Update the configuration file accordingly.

• Run the ML PR again to collect the actual data.

• [ Save the learnt model. ]

Application• Prepare data by enriching the

documents with annotation for attributes. (e.g. run Tokeniser, POS tagger, Gazetteer, etc).

• [ Load the previously saved model. ]

• Run the ML PR in application mode.

• [ Save the learnt model. ]

10

An Example

Learn POS category from POS context.

11

Using Other ML Libraries

The MLEngine InterfaceMethod Summary• void addTrainingInstance(List attributes)

Adds a new training instance to the dataset. • Object classifyInstance(List attributes)

Classifies a new instance. • void init()

This method will be called after an engine is created and has its dataset and options set.

• void setDatasetDefinition(DatasetDefintion definition) Sets the definition for the dataset used.

• void setOptions(org.jdom.Element options) Sets the options from an XML JDom element.

• void setOwnerPR(ProcessingResource pr) Registers the PR using the engine with the engine.