제6장 태깅
POS Tagging (품사태깅)
• Labeling each word in a sentence with its appropriate part of speech (POS)
• Information sources in tagging:
– Tags of other words in the context
– The word itself
• Different approaches:
– Rule-based Tagger
– Stochastic POS Tagger
• Simplest stochastic Tagger
• HMM Tagger
• …2
Simplest Stochastic Tagger
• Each word is assigned its most frequent tag (most frequently encountered in the training set)
–영어의경우 90% 이상의성능을보임
• Problem: may generate a valid tag for a word but unacceptable tag sequences
– Time flies like an arrow
NN VBZ VB DT NN
3
4
Markov Models (MM)
• In a Markov chain, the future element of the sequence depends only on the current element in the sequence, but not the past elements
• X = (X1, …, XT) is a sequence of random variables, S = {s1, …, sN} is the state space
and
ex.
)|( 1 itjtij sXsXPa
jiaij ,,0
N
j
ij ia1
.,1
)|(),,,|(
)|(),|(
4512345
23123
XXPXXXXXP
XXPXXXP
Example of Markov Models (MM)
5
Hidden Markov Model (HMM)
• In (visible) MM, we know the state sequences the model passes, so the state sequence is regarded as output
• In HMM, we don’t know the state sequences, but only some probabilistic function of it– 품사태깅문제에서단어만주어지고그단어의실제품사태그는모름
– 확률함수는알고있음
• Markov models can be used wherever one wants to model the probability of a linear sequence of events
• HMM can be trained from unannotated text– 이론적으로학습데이터없이학습할수있음
– 그러나, 이럴경우성능이나쁘기때문에대부분학습데이터를사용함 (Visible) Markov Model
6
HMM Example
7
NN
Time
VBZ
flies
IN
like
DT
an
NN
arrow
State
Output
NNS VB
HMM Tagger
• Assumption: word’s tag only depends on the previous tag and this dependency does not change over time– P(St+1|St), 0 < t < n+1
• HMM tagger uses states to represent POS tags and outputs (symbol emission) to represent the words.
• Tagging task is to find the most probable tag sequence for a sequence of words.
8
Finding the most probable sequence
9
HMM tagging – an example
Calculating the most likely sequence
Green: transition probabilities
Blue: emission probabilities
11
Dealing with unknown words
• The simplest model: assume that unknown words can have any POS tags, or the most frequent one in the tagset
• In practice, morphological info like suffix is used as hint
12
13
TnT (Trigrams’n’Tags)
• A statistical tagger using Markov Models: states represent tags and outputs represent words
• To find the current tag is to calculate:
)|()]|(),|([maxarg 121
1...1
TTiiiii
T
itt
ttPtwPtttPr
14
Transition and emission probabilities
• Transition and output probabilities are estimated from a tagged corpus:
Bigrams:
Trigrams:
Lexical:
)(
),()|(
2
3223
^
tf
ttfttP
),(
),,(),|(
21
321213
^
ttf
tttftttP
)(
),()|(
3
3333
^
tf
twftwP
15
Smoothing Technique
• Needed due to sparse-data problem
• The trigram is most likely to be zero in a limited corpus:– Without smoothing, the complete probability becomes
zero
• Smoothing:
where
),|()|()(),|( 213
^
323
^
23
^
1213 tttPttPtPtttP
1321
Other techniques
• Handling unknown words– Using the longest suffix (the final sequence of
characters of a word) as a strong predictor for word classes
– To calculate the probability of a tag t given the last m letters li of an n letter word.• m depends on the specific word
• Capitalization– Works better for English than for German
16
Learning Curve for Penn Treebank
End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF (ACL16)
• LSTM-CRF model + Char-level Representation– Char: CNN
End-to-End 한국어형태소분석(동계학술대회16)
Attention + Input-feeding + Copying mechanism
BERT기반 LSTM-CRF 모델을이용한한국어형태소분석및품사태깅 (HCLT19)
20
입력 신문 시장은 여느 때보다 탈법과 불법이난무하고있다.
음절 단위 입력문장
신문시장은여느때보다탈법과불법이난무하고있다 .
Kor BERT 입
력
[CLS] 신문_ 시장은_ 여느_ 때보다_ 탈법과_불법이_ 난무하고_ 있다 ._ [SEP]
어절 범위자
질
E B E B I E B E B I E B I E B I E B I I EB I E E
형태소 분석및 품사 태깅
출력
B-NNG I-NNG B-NNG I-NNG B-JX B-MM I-MM B-NNG B-JKB I-JKB B-NNG I-NNG B-JC B-NNG I-NNG B-JKS B-NNG I-NNG B-XSV B-EC B-VX B-EF B-SF
Models F1나승훈[18]: CRF* 97.65이건일[6]: Sequence-to-sequence* 97.15이창기[15]: structural SVM 98.03RNN-search [13] 95.92황현선[29]: Copying mechanism 97.08BERT + LSTM-CRF (ours) 98.74
최근태깅기술 – 영어
• 대상: WSJ corpus (Penn tag set, Penn treebank)
• Hidden Markov Model (Viterbi algorithm)
– TnT (2000): 96.5% ~ 96.7%
• Brill Tagger (rule-based, 1995): 97.2%
• Averaged Perceptron
– Collins (2002): 97.1%, COMPOST (2009): 97.2%
• SVM (2004): 97.2%
• Maximum Entropy / Conditional Random Fields
– CRF: 97%, Melt (2009): 97.0%, GENiA Tagger (2005): 97.1%
– Stanford Tagger (2011): 97.3%
• Deep Learning
– SENNA: 97.55%
– Bi-directional LSTM-CNNs-CRF (ACL16): 97.55%
– Bi-LSTM RNN + Char representation(Ling et al., 2015): 97.78%
– Morpho-syntactic Tagging with a Meta-BiLSTM Model (Bohnet et al., 2018): 97.96%
21
최근태깅기술 – 한국어
• 창원대학교 (HMM+어절패턴, 세종코퍼스, 2009)
– 어절별 정확도=90.5%, 형태소별 품사 정확도=95.9%, 초당 2만 어절 처리
• KOMORAN (HMM?, 2013)
– 어절 정확도=84.8%, 형태소별 품사 정확도=91.2%, 초당 3만 어절 처리 (초당 200KB 처리)
• KAIST (HMM, 2010)
– 어절 정확도=89%
• 울산대 (HMM+기분석사전, 2012)
– 형태소별 품사 정확도=95.8% (형태소의 의미번호 포함), 초당 4.8만 어절 처리
• CRF + 음절기반
– 심광섭 교수님: 어절 단위 정확도 96.6% (태그 셋과 테스트 셋이 다름)
– 나승훈 (ETRI): 품사 태깅 정확도 96.2% (세종코퍼스, 어절별 성능?)
• Structural SVM + 음절기반
– 이창기(강원대, 2013): 형태소별 품사 정확도=97.96% (세종코퍼스), 초당 100KB 처리
• Deep Learning (Sequence-to-sequence, End-to-end)
– 이창기(강원대, 2016): 형태소별 품사 정확도=97.08% (세종코퍼스)
• Deep Learning (BERT + LSTM-CRF)
– 박천음,이창기(강원대, 2019): 형태소별 품사 정확도=98.74% (세종코퍼스)
22