machine learning in gate
DESCRIPTION
Machine Learning in GATE. Valentin Tablan. Machine Learning in GATE. Uses classification . [Attr 1 , Attr 2 , Attr 3 , … Attr n ] Class Classifies annotations . (Documents can be classified as well using a simple trick.) Annotations of a particular type are selected as instances. - PowerPoint PPT PresentationTRANSCRIPT
Machine Learning in GATE
Valentin Tablan
2
Machine Learning in GATE
• Uses classification.[Attr1, Attr2, Attr3, … Attrn] Class
• Classifies annotations.(Documents can be classified as well using a
simple trick.)
• Annotations of a particular type are selected as instances.
• Attributes refer to instance annotations.• Attributes have a position relative to the
instance annotation they refer to.
3
Attributes
Attributes can be:– Boolean
The [lack of] presence of an annotation of a particular type [partially] overlapping the referred instance annotation.
– NominalThe value of a particular feature of the referred instance
annotation. The complete set of acceptable values must be specified a-priori.
– NumericThe numeric value (converted from String) of a particular
feature of the referred instance annotation.
4
Implementation
Machine Learning PR in GATE.Has two functioning modes:
– training– application
Uses an XML file for configuration:<?xml version="1.0" encoding="windows-1252"?>
<ML-CONFIG><DATASET> … </DATASET><ENGINE>…</ENGINE>
<ML-CONFIG>
5
<DATASET><DATASET><INSTANCE-TYPE>Token</INSTANCE-TYPE> <ATTRIBUTE> <NAME>POS_category(0)</NAME> <TYPE>Token</TYPE> <FEATURE>category</FEATURE> <POSITION>0</POSITION> <VALUES> <VALUE>NN</VALUE> <VALUE>NNP</VALUE> <VALUE>NNPS</VALUE> … </VALUES> [<CLASS/>] </ATTRIBUTE> …</DATASET>
6
<ENGINE>
<ENGINE> <WRAPPER>gate.creole.ml.weka.Wrapper</WRAPPER> <OPTIONS> <CLASSIFIER>weka.classifiers.j48.J48</CLASSIFIER> <CLASSIFIER-OPTIONS>-K 3</CLASSIFIER-OPTIONS> <CONFIDENCE-THRESHOLD>0.85</CONFIDENCE-
THRESHOLD> </OPTIONS> </ENGINE>
7
Attributes Position
Instances type: Token
8
Machine Learning PR
• Can save a learnt model to an external file for later use.Saves the actual model and the collected dataset.
• Can export the collected dataset in .arff format.
9
Standard Use ScenarioTraining• Prepare training data by
enriching the documents with annotation for attributes. (e.g. run Tokeniser, POS tagger, Gazetteer, etc).
• Run the ML PR in training mode.
• Export the dataset as .arff and perform experiments using the WEKA interface in order to find the best attribute set / algorithm / algorithm options.
• Update the configuration file accordingly.
• Run the ML PR again to collect the actual data.
• [ Save the learnt model. ]
Application• Prepare data by enriching the
documents with annotation for attributes. (e.g. run Tokeniser, POS tagger, Gazetteer, etc).
• [ Load the previously saved model. ]
• Run the ML PR in application mode.
• [ Save the learnt model. ]
10
An Example
Learn POS category from POS context.
11
Using Other ML Libraries
The MLEngine InterfaceMethod Summary• void addTrainingInstance(List attributes)
Adds a new training instance to the dataset. • Object classifyInstance(List attributes)
Classifies a new instance. • void init()
This method will be called after an engine is created and has its dataset and options set.
• void setDatasetDefinition(DatasetDefintion definition) Sets the definition for the dataset used.
• void setOptions(org.jdom.Element options) Sets the options from an XML JDom element.
• void setOwnerPR(ProcessingResource pr) Registers the PR using the engine with the engine.