extracting why text segment from web based on grammar-gram

Extracting Why Text Segment from Web Based on Grammar-gram

Iulia Nagy, Master student, 2010-02-27

www.***.com

SummaryIntroductionRelated work

Rule Based MethodsMachine Learning Approach

“Bag of Function Words” methodMethod outline

Adaptation of “Bag of Function Words” to English

Experiments and EvaluationConclusion and Remarks

www.***.com

Problem tremendous growth of the Internet 　　 information

hard to find

www.***.com

Solution Create QA system

system capable to give an exact answer to an exact questiondetect answer from arbitrary corpora

Purpose obtain viable information rapidly

www.***.com

Purpose of our research

Create a why-QA system with automatically-built classifier

Classifier Use a model presented in Japanese Literature

created using Machine learning based on Bag of Grammar approach

Purpose of this paper

adapt Japanese method to English

test effectiveness of the method on English

www.***.com

Related word

Two main trends Rule Based methods

Machine Learning methods

Preprocess text

Detect patterns

Create set of rules

Apply rules to identify why-answer from

Preprocess text

Identify and extract relevant features

Create classification scheme

Classify

www.***.com

Rule based in why-QA

Suzan Vererne’s ApproachImprove performance by re-ranking

Method : weight the score assigned to a QA-pair by QAP

with a number of syntactic features.

Hardly adaptable to various languages

Deep grammar

knowledge

Labour intensive

Importance of syntax

Effective

www.***.com

Machine Learning method

Higashinaka and Isozaki’s ApproachAcquire causal expression from Japanese EDR

dictionary Method :

train a ranker based on clause structures extracted from EDR

Hardly adaptable to various languages

Not fully automated:

based on EDR

EDR rather high priced

Partially automated

Effective

www.***.com

Machine Learning method

Tanaka’s ApproachBuild why-classifier with function words as

featuresMethod :

Bag of function words

Adaptable to different languages

Domain independent

ScalableEffective

Fully automated

www.***.com

Bag of function words Machine learning approach to automatically build

domain independent why-classifier based of function words

Conditions to obtain domain-independence

Class fulfilling conditions

Convergence and reasonable size of feature space

Generality of features in feature space

Ability of features to discriminate causality

Function words

www.***.com

Bag of function words Method – same baseline for Japanese and English

Ts n…

Extract function words

Tag •label all words with POS tagger

Classify

•Determine POS for function words

Create feature space

for because at after in under which that why to therefore

Create feature vectors

Fv n…

Trainer Classification scheme

Loogit Boostweak learners

Mapping using “tf-idf” on function words

{ ( x⃗ i , y i )} є

Vectors' format:

www.***.com

Adaptation to English Differences

Adjustments Identify eligible function words in English

Japanese

•Forms phrases by adding new words at the end of the phrase

•Use of particles to define syntactic roles in a phrase

English•Forms phrases by adding new words at the beginning of

the phrase•Words do not belong to an only grammatical category

www.***.com

Experiment Data

ProcessingLabel all words with POS and extract function wordsCalculate tf-idf for each function wordMap features from feature set into feature vectors

216 Why

answers

216 definitio

Dataset : 432 text segment

www.***.com

Experiment Classifier

Used Loogit Boost (Weka) with Decision stump Created 5 classifiers (50, 100, 150, 200, 250 iterations)

Evaluation10-fold cross validationModels trained on 9 folds and tested on 1Measured precision, recall and F-measure

www.***.com

Results – why text segments

50 100 150 200 250

0.620.640.660.680.7

0.720.740.760.78

Evaluation of classifiers for why-TS

RecallF-measurePrecision

No of iterations

www.***.com

Results – non why text segments (NWTS)

50 100 150 200 250

0.660.680.7

0.720.740.760.780.8

Evaluation of classifiers for NWTS

PrecisionF-measureRecall

No of iterations

www.***.com

ConclusionResults

321 instances out of 432 correctly classified 76.1% precision and 70.6% recall on WTS72.6% precision and 77.9% recall on NWTS

WTS NWTS0.65

0.8Global results

PrecisionF-measureRecall

Type of TS

Method effective on English

www.***.com

Future worksExperiment with a increased dataset (>

5000)Use Yahoo!Answers database to extract datasetInterest

Include causative construction in the analysis

to identify optimal number of iteration

to make a better selection of the function words to be used English

English often expresses cause by a closed set of verbs or nouns

Increase accuracy of the classifier

www.***.com

Questions and remarks

Thank you for your attention !

extracting why text segment from web based on grammar-gram

Documents

extracting the shales

sorting & extracting data

gram positive and gram negative anaerobic rods gram

extracting metals (boardworks)

data mining: efficiently extracting interpretable and...

02. extracting metals

extracting randomness

lecture 12.2b- gram-gram stoich

pengaruh variasi elektrolit kalium hidroksida … ·...

extracting pricing conditions

extracting data

gram negativos y gram positivos

5.1 extracting edges

extracting semantic

english extracting forceps

extracting insights 2014

extracting user interests from log using long-period...

extracting multilingual lexicons from parallel corpora ·...

tabla de btabla de bacterias patogenas gram positivas y gram...

gram stain: gram positive streptococcus and gram negative...