-
I HC QUC GIA H NI
TRNG I HC CNG NGH
NGUYN TH THA
PHN LOI CU TING VIT
V NG DNG TRONG VN HI P
LUN VN THC S CNG NGH THNG TIN
H Ni - 2015
-
I HC QUC GIA H NI
TRNG I HC CNG NGH
NGUYN TH THA
PHN LOI CU TING VIT
V NG DNG TRONG VN HI P
Ngnh : Cng ngh thng tin
Chuyn ngnh : H thng thng tin
M s : 60 48 01 04
LUN VN THC S CNG NGH THNG TIN
GIO VIN HNG DN KHOA HC: TS. PHAN XUN HIU
Hc vin thc hin Gio vin hng dn Hi ng chm lun vn
H Ni 2015
-
LI CAM OAN
Ti Nguyn Th Tha xin cam oan ni dung trong lun vn ny l cng
trnh nghin cu v sng to do chnh ti thc hin di s hng dn ca TS.
Phan Xun Hiu. S liu, kt qu trnh by trong lun vn l hon ton trung thc
v cha cng b trong bt c cng trnh khoa hc no trc y. Nu hnh nh
c ly t ngun bn ngoi, ti u c trch dn ngun r rng v y .
H Ni, ngy thng nm 2015
Hc vin
Nguyn Th Tha
-
2
LI CM N
u tin, ti xin gi li cm n chn thnh n thy Phan Xun Hiu. Thy
truyn cm hng hc tp, nhit huyt nghin cu khoa hc v dn li ti
n vi lnh vc nghin cu ny. Thy cng l ngi tn tnh gip ti vt
qua nhng th thch trong qu trnh nghin cu lun vn.
Ti xin gi li cm n chn thnh n thy H Quang Thy. Cng tip xc
vi thy, ti cng cm thy yu qu v trn trng thi gian c lm sinh vin
nhiu hn.
Ti xin by t lng bit n chn thnh ti cc thy, c gio ging dy ti
trong sut 2 nm ti Trng i hc Cng ngh - i hc Quc gia H Ni. Mi
thy c u cho ti nhng bi ging tht hay v b ch.
Ti xin cm n cc anh ch trong Phng o to, Phng Cng tc sinh vin,
Phng Ti v v cc anh ch khc trong trng. Nh c s lm vic tn ty ca
cc anh ch, chng ti mi c mt ngi trng ng nht nh c nc hc tp
v rn luyn.
Ti xin by t s cm n su sc n cc thnh vin trong nhm MDN-
Team. Thi gian chng ti bn nhau chia s nhng kh khn khi to ra ng
dng tr l o cho ngi Vit - VAV. c bit l 2 em Nguyn Vn Hp v V
Th Hi Yn nhit tnh gip ti trong qu trnh thc nghim, ti s khng
bao gi qun.
Ti xin gi li cm n chn thnh cc anh ch ng nghip ti Cc Thng
tin khoa hc v cng ngh quc gia B Khoa hc v Cng ngh gip hon
thnh cng vic ti c quan ti c th yn tm hc tp.
Ti cng xin cm n cc anh ch trong Phng Th nghim cng ngh tri
thc gp chi tit mi bui seminar hng tun ti hon thin tt lun vn
ca mnh.
Cui cng, ti xin chn thnh cm n n b m, anh ch trong gia nh.
H l ngun ng vin khng th thiu trong cuc i ti.
H Ni, ngy thng nm 2015
Hc vin
Nguyn Th Tha
-
3
MC LC
T VN ................................................................................................................. 6
Chng I. Gii thiu v phn loi cu v ng dng ................................................ 14
1.1 Cc cng trnh nghin cu v phn loi cu .................................................... 14
1.2. Phn loi cu ting Vit ................................................................................... 16
1.2.1. Gii thiu v bi ton Phn loi cu ting Vit .................................... 16
1.2.2. Cc phng php gii quyt bi ton ................................................. 18
Chng II. Phn loi cu ting Vit bng cc phng php hc my ................... 19
2.2. Phng php Nave Bayes ............................................................................... 19
2.3. Phng php SVMs ......................................................................................... 21
2.4. Thut ton Maximum Entropy ......................................................................... 23
Chng III. Thc nghim ........................................................................................... 26
3.1. Phng php thc nghim ............................................................................... 26
3.2. D liu thc nghim ........................................................................................ 28
3.3. La chn thuc tnh ......................................................................................... 29
3.4. Kt qu thc nghim v phn tch ................................................................... 30
3.4.1. M hnh MaxEnt .............................................................................. 30
3.4.2. M hnh Nave Bayes ....................................................................... 33
3.4.4. So snh MaxEnt, Nave Bayes v SVMs ............................................ 36
KT LUN ............................................................................................................................. 38
TI LIU THAM KHO ...................................................................................................... 39
PH LC ................................................................................................................................ 41
-
4
DANH SCH HNH V
Hnh 0.1 Giao din phn mm ng dng VAV Tr l o cho ngi Vit
Hnh 0.2 Ngun d liu cho Big Data
Hnh 0.3 Giao din phn mm VOS
Hnh 1.1 M hnh n gin bi ton phn loi cu ting Vit
Hnh 1.2 V d minh ho bi ton phn loi cu ting Vit
Hnh 1.3 M hnh tng th bi ton phn loi cu ting Vit
Hnh 2.1 M hnh SVMs
Hnh 3.1 Phng php Cross Validation Test
Hnh 3.2 S lng mi loi cu thu c qua ASR service (Google Voice)
Hnh 3.3 Biu so snh o F1 ca m hnh MaxEnt trn 2 tp thuc tnh
ln lp th 4
Hnh 3.4 Biu so snh F1 ca m hnh Nave Bayes gia 2 tp thuc tnh n-
grams v n-grams + Dictionary
Hnh 3.5 Biu so snh o F1 ca m hnh SVMs gia 2 tp thuc tnh n-
grams v n-grams + Dictionary sau 4 folds
Hnh 3.6 Biu so snh o F1 ca 3 m hnh MaxEnt, Nave Bayes v
SVMs ln lp th 4 trn tp thuc tnh n-grams
Hnh 3.7 Biu so snh o F1 ca 3 m hnh MaxEnt, Nave Bayes v
SVMs ln lp th 4 trn tp thuc tnh n-grams + Dictionary
Hnh PL.1 S phn b d liu khi Phn loi vi phng php Nave Bayes
Hnh PL.2 Kt qu Phn loi vi phng php Nave Bayes
Hnh PL.3 S phn b d liu khi Phn loi vi phng php SVMs
Hnh PL.4 Kt qu Phn loi vi phng php SVMs
Hnh PL.5 D liu u vo fold th 4 vi phng php MaxEnt
Hnh PL.6 D liu hun luyn fold 4
Hnh PL.7 D liu kim tra fold 4
Hnh PL.8 Kt qu nh gi m hnh MaxEnt
Hnh PL.9 S phn b d liu khi Phn loi vi phng php Nave Bayes
Hnh PL.10 Kt qu Phn loi vi phng php Nave Bayes
Hnh PL.11 S phn b d liu khi Phn loi vi phng php SVMs
Hnh PL.12 Kt qu Phn loi vi phng php SVMs
Hnh PL.13 D liu hun luyn fold 4
Hnh PL.14 D liu kim tra fold 4
Hnh PL.15 Kt qu nh gi m hnh MaxEnt
-
5
DANH SCH BNG BIU
Bng 1.1 Bng m t cc kiu cu thng dng
Bng 3.1 Mt s thuc tnh mu hun luyn m hnh phn loi cu
Bng 3.2 Kt qu ln lp th 4 ca m hnh MaxEnt vi tp thuc tnh n-grams
Bng 3.3 Kt qu ln lp th 4 ca m hnh MaxEnt vi tp thuc tnh n-grams
+ Dictionary
Bng 3.4 Kt qu tng ln lp ca m hnh MaxEnt vi tp thuc tnh n-grams
Bng 3.5 Kt qu tng ln lp ca m hnh MaxEnt vi tp thuc tnh n-grams
+ Dictionary
Bng 3.6 Kt qu sau 4 ln lp ca m hnh Nave Bayes vi tp thuc tnh n-
grams
Bng 3.7 Kt qu sau 4 ln lp ca m hnh Nave Bayes vi tp thuc tnh n-
grams + Dictionary
Bng 3.8 Kt qu sau 4 ln lp ca m hnh SVMs vi tp thuc tnh n-grams
vi C = 0.1, gamma = 0.5, Kernel = exp (-gamma*|u-v|^2)
Bng 3.9 Kt qu sau 4 ln lp ca m hnh SVMs vi tp thuc tnh n-grams +
Dictionary vi C = 0.1, gamma = 0.5, Kernel = exp (-gamma*|u-v|^2)
-
6
T VN
Theo PGS.TS. Bi Mnh Hng [1], thc hin mc ch pht ngn, ngi
ta thng dng cu trc c php c trng kt hp vi nhng phng tin ngn
ng ring bit nh: tiu t, ph t, ph t, trt t t, ng iu, hin tng tnh
lc, v.v. Ngha l c mt mi tng quan kh u n gia hnh thc ca cu v
mc ch s dng n. T hnh thnh nn khi nim kiu cu (sentence type)
v nhng kiu cu thng dng nht thng c nhc n l: cu trn thut, cu
nghi vn, cu cu khin, cu cm thn (x. J. Sadock & A. Zwicky 1990: 155-156).
Phn loi cu ting Vit bng my tnh l bi ton c bn, lm tin cho
cc nghin cu cao hn v x l v hiu ngn ng t nhin. Phn loi cu l mt
trong nhng thnh phn x l ct li ca h thng hi p nh phn mm ng
dng VAV (Vitual Assistant for Vietnammese) Tr l o cho ngi Vit do
MDN Team thuc Trng i hc Cng ngh - i hc Quc gia H Ni sng
lp, ca h thng phn tch social media nghin cu th trng nh cc h
thng x l Big Data hay trong h thng tng hp ting ni nh VOS Ting ni
Phng Nam do i hc Quc gia Tp. H Ch Minh sng lp.
Hnh 0.1 Giao din phn mm ng dng VAV Tr l o cho ngi Vit
-
7
VAV l mt ng dng thng minh trn di ng cho php ngi dng tng
tc bng ging ni hn chung bo thc, t lch cho mt cuc hp, bt nh
v, gi in cho ai , truy cp mt trang web bt k, tm ng trn bn , nh
v cy ATM ca mt ngn hng no gn vi bn, hay thng thc mt bn
nhc mnh yu thch c thit k v pht trin da trn cc k thut tr tu
nhn to (hc my, phn tch v hiu ngn ng t nhin), VAV c th hiu c
nh ca ngi dng d h din t cu lnh ca mnh theo nhiu cch khc
nhau m khng cn tun theo bt k khun mu no cho trc.
VAV - ng dng tr l o cho ngi Vit l mt trong nhng phn mm
nhn c nhiu s quan tm trn cc trang mng x hi, cc din n cng ngh.
Phn loi cu gip VAV lc ra c nhng cu thuc kiu cu hi hoc kiu cu
cu khin x l tip tc cc pha tip theo hoc VAV s hi p li ngay cho
ngi dng m khng cn x l nu l cu cm thn hoc cu trn thut qua
module h tr tch hp sn trong VAV.
Big data l tp hp d liu ln v a dng nn khng th x l bng cch
th cng hoc bng phn mm thng thng. Vic thu thp, qun l, phn tch d
liu ny tr thnh ngnh ring trong cng ngh thng tin v thu ht c s
ch ca gii kinh doanh trong nhng nm gn y v tim nng ca n.
Hnh 0.2 Ngun d liu cho Big Data
-
8
Social media ch trong thi gian ngn to nn lng d liu bng lng
d liu ca c th gii vi th h trc: Facebook mi ngy u x l 500 terabytes
d liu, Twitter mi ngy cng x l 12 terabytes d liu; trong khi sn chng
khon New Yorks ch x l 1 terabytes d liu. Lng d liu t Social Media s
l m vng i vi cc doanh nghip mun hiu v hnh vi khch hng ca mnh,
cch h a ra quyt nh mua sm, nhu cu ca h trong tng lai gn...
Phn loi cu trong trng hp ny s gip h thng lc ra c nhng cu
no th hin trng thi tm l ca ngi dng, nhng cu no phn nh s khen
ch t doanh nghip s c th a ra gii php ci tin sn phm ca h
hoc c nhng chin lc thu ht khch hng kp thi.
Tng t, trong h thng tng hp ting ni, Ting ni Phng Nam VOS
l mt h thng tng hp ting ni ting Vit, dnh cho chnh ngi Vit, c th
to ra ging ni nhn to ca ngi trn my tnh t d liu u vo l vn bn.
Phn loi cu lc ny s gip h thng thm c sc thi cho cu vn trong on
text .
Trong lnh vc truyn thng, h thng VOS c th c p dng trong cc
ng dng truy vn thng tin qua tng i in thoi, trong yu cu ca ngi
dng s c ng dng tip nhn v x l thnh dng vn bn. Thng tin ny s
c h thng VOS chuyn thnh dng m thanh v tr v cho ngi dng. Cc
h thng ny c kh nng ng dng cao do qu trnh x l hon ton t ng, c
th hot ng lin tc, p ng c nhu cu v thng tin ca ngi dng, c
bit l cc thng tin nng, cp nht.
Trong lnh vc t ng ha, h thng VOS c th c tch hp vi h thng
nh v GPS trong cc ng dng tm ng i, gn trn xe hi cung cp cc ch
dn dng m thanh, hn ch vic li xe phi lin tc va nhn mn hnh GPS,
lm tng an ton cho ngi iu khin.
Trong lnh vc gio dc, VOS c th c s dng dy ting Vit cho
con em Vit Kiu nh c nc ngoi, nht l cch c, cch pht m cc t
ting Vit. y l phn mm thc hnh ting Vit hu hiu, c bit trong mi
trng m ngn ng s dng khng phi l ting Vit.
-
39
TI LIU THAM KHO
Ti liu ting Vit
[1] Bi Mnh Hng (2011), Bn v vn Phn loi cu theo mc ch pht
ngn, Khoa Ngn ng, i hc Quc gia Tp. H Ch Minh.
[2] Bi c Tnh (1995), Vn phm Vit Nam. Tp. H Ch Minh: Vn ha.
[3] Hong Trng Phin (1980), Ng php ting Vit Cu. H Ni: i hc &
Trung hc chuyn nghip.
[4] Nguyn H Nam (2013), Gio trnh Khai ph d liu, Nh Xut bn
i hc Quc gia H Ni.
Ti liu ting Anh
[5] Adam L. Berger & Stephen A.Della Pietra & Vincent J. Della Pietra (1996),
A Maximum Entropy Approach to Natural Language Processing.
[6] Adwait Ratnapakhi (1997), A Simple Introduction to Maximum Entropy
Models for Natural Language Processing.
[7] Ashequl Qadir (2011), Classifying Sentences as Speech Acts in Message
Board Posts, University of Utah, In Proceedings of the 2011 Conference
on Empirical Methods in Natural Language Processing.
[8] Arpit Trived (2013), Implementation of Bayesian Theory in Sentence
Classification for Online Subjective Test, International Journal of
Advanced Research in Computer Science and Software Engineering,
Volume 3, Issue 12.
[9] Anthony Khoo (2006), Experiments with Sentence Classification, Monash
University, Australia.
[10] Ben Hachey & Claire Grover (2004), Sentence Classification Experiments
for Legal Text Summarisation, University of Edinburgh, In Proceedings of
the 17th Annual Conference on Legal Knowledge and Information
Systems.
[11] Diego Molla (2012), Experiments with Clustering-based Features for
Sentence Classification in Medical Publications: Macquarie Tests
participation in the ALTA 2012 shared task, In Proceedings of Australasian,
Language Technology Association Workshop, pages 139142.
-
40
[12] Helen Kwong (2012), Detection of Imperative and Declarative Question-
Answer Pairs in Email Conversations, Stanford University, Journal AI
Communications archive, Volume 25 Issue 4, Pages 271-283.
[13] Martina Naughton (2008), Sentence-Level Event Classification in
Unstructured Texts, University College Dublin, Ireland.
[14] Menno v.an Zaanen (2005), Classifying Sentences using Induced Structure,
Macquarie University, Volume 3772 of the series Lecture Notes in
Computer Science, pp 139-150, 12th International Conference, SPIRE
2005, Buenos Aires, Argentina.
[15] Nal Kalchbrenner (2014),A Convolutional Neural Network for Modelling
Sentences, University of Oxford, In Proceedings of the 52nd Annual
Meeting of the Association for Computational Linguistics.
[16] William Gardner Hale (1913), The Classification of Sentences and Clauses,
The School Review, The University of Chicago Press, Vol. 21, No. 6, pp.
388-397.
[17] Ulf Hermjakob (2001),Parsing and Question Classification for Question
Answering, University of Southern California, USA, Proceeding ODQA '01
Proceedings of the workshop on Open-domain question answering -
Volume 12, Pages 1-6.
[18] Yoon Kim (2014), Convolutional Neural Networks for Sentence
Classification, New York University.
[19] Emile de Maat (2008), Automatic Classification of Sentences in Dutch
Laws, University of Amsterdam, Proceedings of the 2008 conference on
Legal Knowledge and Information Systems, The Twenty-First Annual
Conference,Pages 207-216
[20] Janyce Wiebe (2005), Creating Subjective and Objective Sentence
Classifiers from Unannotated Texts, University of Pittsburgh, CICLing'05
Proceedings of the 6th international conference on Computational
Linguistics and Intelligent Text Processing, Pages 486-497.
[21] Nitin Jindal (2006), Identifying Comparative Sentences in Text Documents,
University of Illinois at Chicago, SIGIR06.
[22] Thomasson, Amie, "Categories", The Stanford Encyclopedia of Philosophy
(Fall 2013 Edition), First published Thu Jun 3, 2004, URL =
.