automatic generation of the domain-specific sentiment ... - dialog-21.ru · pdf...
TRANSCRIPT
AUTOMATIC GENERATION OF THE
DOMAIN-SPECIFIC SENTIMENT
RUSSIAN DICTIONARIES
Alina Dubatovka / SPbSU
Yurii Kurochkin / Yandex
Elena Mikhailova / SPbSU
Dialogue 2016, Moscow, June 1-4, 2016
Goals
• Automatic extraction of sentiment words
• Automatic polarity detection
• Unsupervised
AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT RUSSIAN DICTIONARIES
2
Methodology
• Hatzivassilogloum, McKeown 1997
– "Tasty and healthy Breakfast“
– "Cheap but nice hotel“
• The better the node is connected with other
"positive" nodes and the worse with the
"negative", the more positive it is
AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT RUSSIAN DICTIONARIES
3
Graph builder• ADV NEG ∗ ADJ , ? AND BUT ? ADV NEG ∗ ADJ +
• AND – conjunction "and“
• BUT – one of adversative conjunctions ("but", "instead", "however”, “nevertheless ")
• NEG – negation
• ADV – an adverb of measure and degree ("very", "quite", "too", "completely")
• ADJ – adjective
AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT RUSSIAN DICTIONARIES
4
Example
• "Tasty, plentiful but not very varied and expensive
breakfast“
• positive links: (tasty, plentiful), (tasty, varied),
(plentiful and varied)
• negative links: (tasty, expensive), (plentiful,
expensive), (varied, expensive).
AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT RUSSIAN DICTIONARIES
5
Particle “not” and prefix “un-”
AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT RUSSIAN DICTIONARIES
6
Good (хороший)
Pleasant (приятный)
Free (бесплатны
й)
Big (большой)1486; -40
6; 0
Unpleasant (неприятны
й)
Good (хороший)
Pleasant (приятный)
Free (бесплатны
й)
Big (большой)1556; -113
6; 0
Graph Analyzer• Initialization
• Weight of the graph edges
– 𝑤𝑒𝑖𝑔ℎ𝑡 𝑤𝑜𝑟𝑑1, 𝑤𝑜𝑟𝑑2 = # 𝑤𝑜𝑟𝑑1𝐴𝑁𝐷 𝑤𝑜𝑟𝑑2 −𝐾 ∗ # 𝑤𝑜𝑟𝑑1𝐵𝑈𝑇 𝑤𝑜𝑟𝑑2
• Distance to the final set– The heaviest edge
– The sum of the weights of edges
AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT RUSSIAN DICTIONARIES
7
Description of experiments• 259023 depersonalized unlabeled reviews
• Dataset size – 660 Mb
• Hotel domain
• Texts by real users– Misspellings
– Grammatical errors
– Informal words
– unrelated information concerning flight, excursions, places of interest etc
AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT RUSSIAN DICTIONARIES
8
“Large” dictionaries
AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT RUSSIAN DICTIONARIES
9
Positive Negative Neutral Total
Algorithm
without removing
the "un-" prefix
5252 2815 - 8067
Algorithm after
removing the "un-
" prefix
4936 2695 - 7631
“Large”
dictionary1948 1946 4951 8845
“Small” dictionaries
AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT RUSSIAN DICTIONARIES
10
Positive
dictionary
Negative
dictionaryTotal
“Manual” dictionary 173 127 300
Algorithm without “un-”
prefix removing164 74 238
Algorithm with “un-”
prefix removing163 83 246
Results without removing
the "un-" prefix
AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT RUSSIAN DICTIONARIES
11
MetricPositive
dictionary
Negative
dictionary
Total
dictionary
Recall 0.806 0.684 0.754
Precision 0.309 0.521 0.381
Precision without
neutral words0.77 0.827 0.796
F1-measure 0.447 0.591 0.506
F1-measure without
neutral words0.788 0.749 0.774
Results after removing the
"un-" prefix
AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT RUSSIAN DICTIONARIES
12
MetricPositive
dictionary
Negative
dictionary
Total
dictionary
Recall 0.793 0.683 0.746
Precision 0.314 0.502 0.38
Precision without
neutral words0.779 0.82 0.799
F1-measure 0.45 0.579 0.504
F1-measure without
neutral words0.786 0.745 0.772
Precision@n for positive
dictionary
AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT RUSSIAN DICTIONARIES
13
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
1
201
401
601
801
1001
1201
1401
1601
1801
2001
2201
2401
2601
2801
3001
3201
3401
3601
3801
4001
4201
4401
4601
4801
5001
5201
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
1
201
401
601
801
1001
1201
1401
1601
1801
2001
2201
2401
2601
2801
3001
3201
3401
3601
3801
4001
4201
4401
4601
4801
5001
5201
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
1
201
401
601
801
1001
1201
1401
1601
1801
2001
2201
2401
2601
2801
3001
3201
3401
3601
3801
4001
4201
4401
4601
4801
5001
5201
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
1
201
401
601
801
1001
1201
1401
1601
1801
2001
2201
2401
2601
2801
3001
3201
3401
3601
3801
4001
4201
4401
4601
4801
5001
5201
Precision@n for negative
dictionary
AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT RUSSIAN DICTIONARIES
14
0,5
0,6
0,7
0,8
0,9
1
0,5
0,6
0,7
0,8
0,9
1
1
20
1
40
1
60
1
80
1
10
01
12
01
14
01
16
01
18
01
20
01
22
01
24
01
26
01
28
01
0,5
0,6
0,7
0,8
0,9
1
0,5
0,6
0,7
0,8
0,9
1
1
20
1
40
1
60
1
80
1
10
01
12
01
14
01
16
01
18
01
20
01
22
01
24
01
26
01
28
01
Dependence on K
AUTOMATIC GENERATION OF THE DOMAIN-SPECIFIC SENTIMENT RUSSIAN DICTIONARIES
15
0,5
0,6
0,7
0,8
0,9
1 2 3 4 5 6 7 8 9 10
Without neutral words With neutral words
0,5
0,6
0,7
0,8
0,9
1 2 3 4 5 6 7 8 9 10
With neutral words Without neutral words
0
0,2
0,4
0,6
0,8
1
0,76 0,78 0,8 0,82 0,84 0,86 0,88
With neutral words Without neutral words
0,3
0,4
0,5
0,6
0,7
0,8
0,9
0,75 0,8 0,85 0,9 0,95
With neutral words Without neutral words