extracting why text segment from web based on grammar-gram
Post on 24-Feb-2016
51 Views
Preview:
DESCRIPTION
TRANSCRIPT
Extracting Why Text Segment from Web Based on Grammar-gram
Iulia Nagy, Master student, 2010-02-27
www.***.com
SummaryIntroductionRelated work
Rule Based MethodsMachine Learning Approach
“Bag of Function Words” methodMethod outline
Adaptation of “Bag of Function Words” to English
Experiments and EvaluationConclusion and Remarks
-2-
www.***.com
Problem tremendous growth of the Internet information
hard to find
-3-
www.***.com
Solution Create QA system
system capable to give an exact answer to an exact questiondetect answer from arbitrary corpora
Purpose obtain viable information rapidly
-4-
www.***.com
Purpose of our research
Create a why-QA system with automatically-built classifier
Classifier Use a model presented in Japanese Literature
created using Machine learning based on Bag of Grammar approach
Purpose of this paper
-5-
adapt Japanese method to English
test effectiveness of the method on English
www.***.com
Related word
Two main trends Rule Based methods
Machine Learning methods
Preprocess text
Detect patterns
Create set of rules
Apply rules to identify why-answer from
text
Preprocess text
Identify and extract relevant features
Create classification scheme
Classify
-6-
www.***.com
Rule based in why-QA
Suzan Vererne’s ApproachImprove performance by re-ranking
Method : weight the score assigned to a QA-pair by QAP
with a number of syntactic features.
+ -
Hardly adaptable to various languages
Deep grammar
knowledge
Labour intensive
Importance of syntax
Effective
-7-
www.***.com
Machine Learning method
Higashinaka and Isozaki’s ApproachAcquire causal expression from Japanese EDR
dictionary Method :
train a ranker based on clause structures extracted from EDR
+ -
Hardly adaptable to various languages
Not fully automated:
based on EDR
EDR rather high priced
Partially automated
Effective
-8-
www.***.com
Machine Learning method
Tanaka’s ApproachBuild why-classifier with function words as
featuresMethod :
Bag of function words
Adaptable to different languages
Domain independent
ScalableEffective
Fully automated
-9-
www.***.com
Bag of function words Machine learning approach to automatically build
domain independent why-classifier based of function words
Conditions to obtain domain-independence
Class fulfilling conditions
Convergence and reasonable size of feature space
Generality of features in feature space
Ability of features to discriminate causality
Function words
-10-
www.***.com
Bag of function words Method – same baseline for Japanese and English
Ts 1
Ts 2
Ts n…
Extract function words
Tag •label all words with POS tagger
Classify
•Determine POS for function words
Create feature space
for because at after in under which that why to therefore
Create feature vectors
Fv 1
Fv 2
Fv n…
Trainer Classification scheme
Loogit Boostweak learners
Mapping using “tf-idf” on function words
-11-
{ ( x⃗ i , y i )} є
Vectors' format:
www.***.com
Adaptation to English Differences
Adjustments Identify eligible function words in English
Japanese
•Forms phrases by adding new words at the end of the phrase
•Use of particles to define syntactic roles in a phrase
English•Forms phrases by adding new words at the beginning of
the phrase•Words do not belong to an only grammatical category
-12-
www.***.com
Experiment Data
ProcessingLabel all words with POS and extract function wordsCalculate tf-idf for each function wordMap features from feature set into feature vectors
216 Why
answers
216 definitio
ns
Dataset : 432 text segment
s
-13-
www.***.com
Experiment Classifier
Used Loogit Boost (Weka) with Decision stump Created 5 classifiers (50, 100, 150, 200, 250 iterations)
Evaluation10-fold cross validationModels trained on 9 folds and tested on 1Measured precision, recall and F-measure
-14-
www.***.com
Results – why text segments
50 100 150 200 250
0.620.640.660.680.7
0.720.740.760.78
Evaluation of classifiers for why-TS
RecallF-measurePrecision
No of iterations
-15-
www.***.com
Results – non why text segments (NWTS)
50 100 150 200 250
0.660.680.7
0.720.740.760.780.8
Evaluation of classifiers for NWTS
PrecisionF-measureRecall
No of iterations
-16-
www.***.com
ConclusionResults
321 instances out of 432 correctly classified 76.1% precision and 70.6% recall on WTS72.6% precision and 77.9% recall on NWTS
WTS NWTS0.65
0.7
0.75
0.8Global results
PrecisionF-measureRecall
Type of TS
Method effective on English
-17-
www.***.com
Future worksExperiment with a increased dataset (>
5000)Use Yahoo!Answers database to extract datasetInterest
Include causative construction in the analysis
to identify optimal number of iteration
to make a better selection of the function words to be used English
English often expresses cause by a closed set of verbs or nouns
Increase accuracy of the classifier
-18-
www.***.com
Questions and remarks
Thank you for your attention !
-19-
top related