classification of unanswerable questions: the rhetoric of twitter

Classification of unanswerable

questions: the rhetoric of Twitter

Clasificación de preguntas sin respuesta:

la retórica de Twitter

questions: the rhetoric of Twitter

David TomásDepartment of Software and Computing Systems

University of Alicante, Spain

[email protected]

CERI 2012

This presentation is about…

Twitter

Questions that look like questions when in fact they are not

Corpus-based question classification

A preliminary evaluation

A lot of future work

Twitter

friend

mention @

hashtag #

url

follower

tweet hashtag #

140 characters

RT

Twitter

Twitter

A perfect way to spread information

Fast, fast, fast

Immediacy: many people is asking questions

Twitter

Proposal

Wouldn’t it be nice that someone come to

your aid when you need an answer?

New paradigm: systems going to the userNew paradigm: systems going to the user

First problem: who really needs an answer?

Proposal

Proposal

Question classification problem

Real questions vs. rhetorical questions

Supervised / corpus-based

Corpus + Features + Algorithms

Corpus

Real question: expects an answer, from the

mass or from an individual

Rhetorical question: all the others

what

who whom

whose

which

when where

whyhow

x 100 =

= 220 real + 680 rhetorical

Features

punctuation marks

? ! “

part-of-speech

named entity recognition

entities

WordNet

relations

friends

Twitter language

@ # links

words

interjections

part-of-speech

NN NP VWordNet

average length

% terms found

total terms found

sentiment analysis

polarity

friends

followers

friends/followers

Algorithm

Experiments and results

72

74

76

78

80

Accuracy

60

62

64

66

68

70

72

SVM NB IB1 RF

real + rhetorical


72

74

76

78

80

Accuracy

Baseline

60

62

64

66

68

70

72

SVM NB IB1 RF

real + rhetorical


0

10

20

30

40

50

60

70

80

90

Precision

0

SVM NB IB1 RF

real rhetorical

0

10

20

30

40

50

60

70

80

90

100

SVM NB IB1 RF

Recall

real rhetorical

Corpus (2nd attempt)

Unbalanced corpus bias classification

Problem: need for more real questions

Solution: #lazyweb


Balanced corpus of 1360 questions:

680 rhetorical

680 real (from a set of 2,800 #lazyweb)


75

80

85

Accuracy

60

65

70

75

SVM NB IB1 RF

real + rhetorical balanced


75

80

85

Accuracy

60

65

70

75

SVM NB IB1 RF

real + rhetorical balanced

Baseline50


80

81

82

83

Accuracy (ablation study)

75

76

77

78

79

80

Punctuation Language Entities POS WordNet Polarity Relations

Selection All

Conclusions and future work

Just a first step

Room for improvement

Augment the corpusAugment the corpus

Truly analyze the rhetoric of Twitter

Integrate in a QA system

Thank you very much

35,679,2

CERI 2012

classification of unanswerable questions: the rhetoric of twitter

Technology