extracting why text segment from web based on grammar-gram

Post on 24-Feb-2016

51 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Extracting Why Text Segment from Web Based on Grammar-gram. Iulia Nagy, Master student, 2010-02-27. Summary. Introduction Related work Rule Based Methods Machine Learning Approach “Bag of Function Words” method Method outline Adaptation of “Bag of Function Words” to English - PowerPoint PPT Presentation

TRANSCRIPT

Extracting Why Text Segment from Web Based on Grammar-gram

Iulia Nagy, Master student, 2010-02-27

www.***.com

SummaryIntroductionRelated work

Rule Based MethodsMachine Learning Approach

“Bag of Function Words” methodMethod outline

Adaptation of “Bag of Function Words” to English

Experiments and EvaluationConclusion and Remarks

-2-

www.***.com

Problem tremendous growth of the Internet    information

hard to find

-3-

www.***.com

Solution Create QA system

system capable to give an exact answer to an exact questiondetect answer from arbitrary corpora

Purpose obtain viable information rapidly

-4-

www.***.com

Purpose of our research

Create a why-QA system with automatically-built classifier

Classifier Use a model presented in Japanese Literature

created using Machine learning based on Bag of Grammar approach

Purpose of this paper

-5-

adapt Japanese method to English

test effectiveness of the method on English

www.***.com

Related word

Two main trends Rule Based methods

Machine Learning methods

Preprocess text

Detect patterns

Create set of rules

Apply rules to identify why-answer from

text

Preprocess text

Identify and extract relevant features

Create classification scheme

Classify

-6-

www.***.com

Rule based in why-QA

Suzan Vererne’s ApproachImprove performance by re-ranking

Method : weight the score assigned to a QA-pair by QAP

with a number of syntactic features.

+ -

Hardly adaptable to various languages

Deep grammar

knowledge

Labour intensive

Importance of syntax

Effective

-7-

www.***.com

Machine Learning method

Higashinaka and Isozaki’s ApproachAcquire causal expression from Japanese EDR

dictionary Method :

train a ranker based on clause structures extracted from EDR

+ -

Hardly adaptable to various languages

Not fully automated:

based on EDR

EDR rather high priced

Partially automated

Effective

-8-

www.***.com

Machine Learning method

Tanaka’s ApproachBuild why-classifier with function words as

featuresMethod :

Bag of function words

Adaptable to different languages

Domain independent

ScalableEffective

Fully automated

-9-

www.***.com

Bag of function words Machine learning approach to automatically build

domain independent why-classifier based of function words

Conditions to obtain domain-independence

Class fulfilling conditions

Convergence and reasonable size of feature space

Generality of features in feature space

Ability of features to discriminate causality

Function words

-10-

www.***.com

Bag of function words Method – same baseline for Japanese and English

Ts 1

Ts 2

Ts n…

Extract function words

Tag •label all words with POS tagger

Classify

•Determine POS for function words

Create feature space

for because at after in under which that why to therefore

Create feature vectors

Fv 1

Fv 2

Fv n…

Trainer Classification scheme

Loogit Boostweak learners

Mapping using “tf-idf” on function words

-11-

{ ( x⃗ i , y i )} є

Vectors' format:

www.***.com

Adaptation to English Differences

Adjustments Identify eligible function words in English

Japanese

•Forms phrases by adding new words at the end of the phrase

•Use of particles to define syntactic roles in a phrase

English•Forms phrases by adding new words at the beginning of

the phrase•Words do not belong to an only grammatical category

-12-

www.***.com

Experiment Data

ProcessingLabel all words with POS and extract function wordsCalculate tf-idf for each function wordMap features from feature set into feature vectors

216 Why

answers

216 definitio

ns

Dataset : 432 text segment

s

-13-

www.***.com

Experiment Classifier

Used Loogit Boost (Weka) with Decision stump Created 5 classifiers (50, 100, 150, 200, 250 iterations)

Evaluation10-fold cross validationModels trained on 9 folds and tested on 1Measured precision, recall and F-measure

-14-

www.***.com

Results – why text segments

50 100 150 200 250

0.620.640.660.680.7

0.720.740.760.78

Evaluation of classifiers for why-TS

RecallF-measurePrecision

No of iterations

-15-

www.***.com

Results – non why text segments (NWTS)

50 100 150 200 250

0.660.680.7

0.720.740.760.780.8

Evaluation of classifiers for NWTS

PrecisionF-measureRecall

No of iterations

-16-

www.***.com

ConclusionResults

321 instances out of 432 correctly classified 76.1% precision and 70.6% recall on WTS72.6% precision and 77.9% recall on NWTS

WTS NWTS0.65

0.7

0.75

0.8Global results

PrecisionF-measureRecall

Type of TS

Method effective on English

-17-

www.***.com

Future worksExperiment with a increased dataset (>

5000)Use Yahoo!Answers database to extract datasetInterest

Include causative construction in the analysis

to identify optimal number of iteration

to make a better selection of the function words to be used English

English often expresses cause by a closed set of verbs or nouns

Increase accuracy of the classifier

-18-

www.***.com

Questions and remarks

Thank you for your attention !

-19-

top related