baocaochuyende ruttrichthongtin version1.0

66
CHƯƠNG 1: TỔNG QUAN VỀ TRÍCH XUẤT THÔNG TIN....................3 1.1 Mục tiêu và phạm vi chuyên đề 3 1.2 Giới thiệu về trích xuất thông tin (IE) 3 1.3 Trích xuất thông tin (IE) và truy vấn thông tin (IR) 6 1.4 Các nghiên cứu và ứng dụng liên quan 6 1.5 Các bước cơ bản của một hệ thống IE 11 1.6 Phương pháp rút trích thông tin 12 1.7 Phương pháp đánh giá 12 CHƯƠNG 2: CÁC BÀI TOÁN, PHƯƠNG PHÁP TRÍCH XUẤT THÔNG TIN......14 2.1 Mở đầu 14 2.2 Rút trích cụm từ khóa 14 2.2.1 Giới thiệu 14 2.2.2 Phạm vi ứng dụng 15 2.2.3 Bài toán sinh keyphrase tự động 16 2.2.4 Thuật toán KEA 16 2.2.4.1 Chọn cụm ứng viên 18 2.2.4.2 Tính toán đặc trưng 19 2.2.4.3 Huấn luyện 20 2.2.4.4 Rút trích những cụm từ khóa 20 2.2.5 Thuật toán KIP 21 2.3 Nhận diện thực thể có tên 22 2.3.1 Khái niệm 22 2.3.2 Phương pháp tiếp cận và các hệ thống phổ biến 23 2.4 Nhận diện mối quan hệ 24 2.4.1 Khái niệm 24 2.4.2 Phương pháp tiếp cận và các nghiên cứu liên quan 24 1

Upload: compaq1501

Post on 14-Sep-2015

226 views

Category:

Documents


3 download

DESCRIPTION

BaocaoChuyende RuttrichThongtin Version1.0

TRANSCRIPT

RT TRCH THNG TIN

3CHNG 1: TNG QUAN V TRCH XUT THNG TIN

31.1 Mc tiu v phm vi chuyn

31.2 Gii thiu v trch xut thng tin (IE)

61.3 Trch xut thng tin (IE) v truy vn thng tin (IR)

61.4 Cc nghin cu v ng dng lin quan

111.5 Cc bc c bn ca mt h thng IE

121.6 Phng php rt trch thng tin

121.7 Phng php nh gi

14CHNG 2: CC BI TON, PHNG PHP TRCH XUT THNG TIN

142.1 M u

142.2 Rt trch cm t kha

142.2.1 Gii thiu

152.2.2 Phm vi ng dng

162.2.3 Bi ton sinh keyphrase t ng

162.2.4 Thut ton KEA

182.2.4.1 Chn cm ng vin

192.2.4.2 Tnh ton c trng

202.2.4.3 Hun luyn

202.2.4.4 Rt trch nhng cm t kha

212.2.5 Thut ton KIP

222.3 Nhn din thc th c tn

222.3.1 Khi nim

232.3.2 Phng php tip cn v cc h thng ph bin

242.4 Nhn din mi quan h

242.4.1 Khi nim

242.4.2 Phng php tip cn v cc nghin cu lin quan

26CHNG 3: RT TRCH METADATA

263.1. M u

273.2 Khi nim Metadata

283.3 Chun Dublin Core Metadata

303.4 Rt trch metadata v cc nghin cu lin quan

323.5 Cch tip cn ca ti

323.5.1 Kin trc h thng

333.5.2 Rt trch metadata da trn lut

343.5.3 Cc lut JAPE rt metadata cho bi bo khoa hc

383.6 Thc nghim v nh gi

39CHNG 4: KT LUN V HNG PHT TRIN

394.1 Kt lun

404.2 Hng pht trin

41TI LIU THAM KHO

CHNG 1: TNG QUAN V TRCH XUT THNG TIN1.1 Mc tiu v phm vi chuyn Vi mc tiu tm kim v xut mt m hnh biu din tri thc cho ti liu vn bn bao gm cc thnh phn tri thc nh: siu d liu m t ngun gc, cu trc vn bn (tiu , tc gi, ni xut bn, nm xut bn, ch , ni lu tr, ...), cc cm t kha, cc thc th, v quan h gia cc thc th biu din ni dung ti liu ( t h tr truy vn thng minh, tm kim thng tin, ti liu lin quan t kho ti liu thu thp, t chc lu tr. Cng vic ca chuyn ny l tin hnh nghin cu v tm kim cc phng php, cng c cho vic trch xut cc thng tin, tri thc ca ti liu v a vo m hnh, chun b cho vic t chc tri thc vn bn h tr x l truy vn.

Da trn mc tiu t ra chng ti s tin hnh kho st cc bi ton, phng php, cng c rt trch thng tin vn bn nh: Rt trch t kha, cm t kha

Rt trch thc th (c tn, khng tn)

Rt trch cc mi quan h

Rt trch cc thnh phn cu trc, metadata ca ti liu

1.2 Gii thiu v trch xut thng tin (IE)Cc nh ngha c dng ph bin trn internet lin quan n trch xut thng tin Theo (Jim Cowie and Yorick Wilks) [2]: IE l tn c t cho qu trnh cu trc v kt hp mt cch c chn lc d liu c tm thy, c pht biu r rng trong mt hay nhiu ti liu vn bn.

Theo Line Eikvil [1]: IE l lnh vc nghin cu hp ca x l ngn ng t nhin v xut pht t vic xc nh nhng thng tin c th t mt ti liu ngn ng t nhin. Mc ch ca trch xut thng tin l chuyn vn bn v dng c cu trc. Thng tin c trch xut t nhng ngun ti liu khc nhau v c biu din di mt hnh thc thng nht. Nhng h thng trch xut thng tin vn bn khng nhm mc tiu hiu vn bn a vo, m nhim v chnh ca n l tm kim cc thng tin cn thit lin quan, m chng ta mong mun c tm thy. Cng theo Line Eikvil [1], thnh phn ct li ca cc h thng trch xut thng tin l mt tp hp cc lut v mu dng xc nh nhng thng tin lin quan cn trch xut. Theo Tin s Alexander Yates trng i hc Washington [3] th trch xut thng tin l qu trnh truy vn nhng thng tin cu trc t nhng vn bn khng cu trc. Theo nhng chuyn gia v trch xut thng tin ca GATE th nhng h thng trch xut thng tin s tin hnh phn tch vn bn nhm trch ra nhng thng tin cn thit theo cc dng c nh ngha trc, chng hn nh nhng s kin, cc thc th v cc mi quan h.

Tm li, chng ta c th hiu trch xut thng tin (Information Extraction) l mt k thut, lnh vc nghin cu c lin quan n truy vn thng tin (Information Retrieval), khai thc d liu (Data mining), cng nh x l ngn ng t nhin (Natural Language Processing). Mc tiu chnh ca trch xut thng tin l tm ra nhng thng tin cu trc t vn bn khng cu trc hoc bn cu trc. Trch xut thng tin s tm cch chuyn thng tin trong vn bn khng hay bn cu trc v dng c cu trc v c th biu din hay th hin chng mt cch hnh thc di dng mt tp tin cu trc XML hay mt bng cu trc (nh bng trong c s d liu chng hn).

Mt khi d liu, thng tin t cc ngun khc nhau, t internet c th biu din mt cch hnh thc, c cu trc. T chng ta c th s dng cc k thut phn tch, khai thc d liu (data mining) khm ph ra cc mu thng tin hu ch. Chng hn vic cu trc li cc mu tin qung co, mu tin bn hng trn internet c th gip h tr t vn, nh hng ngi dng khi mua sm. Vic trch xut v cu trc li cc mu tin tm ngi, tm vic s gip cho qu trnh phn tch thng tin ngh nghip, xu hng cng vic, h tr cho cc ngi tm vic, cng nh nh tuyn dng.

Rt trch thng tin khng i hi h thng phi c hiu ni dung ca ti liu vn bn, nhng h thng phi c kh nng phn tch ti liu v tm kim cc thng tin lin quan m h thng mong mun c tm thy. Cc k thut rt trch thng tin c th p dng cho bt k tp ti liu no m chng ta cn rt ra nhng thng tin chnh yu, cn thit cng nh cc s kin lin quan. Cc kho d liu vn bn v mt lnh vc trn internet l v d in hnh, thng tin trn c th tn ti nhiu ni khc nhau, di nhiu nh dng khc nhau. S rt hu ch cho cc kho st, ng dng lin quan n mt lnh vc nu nh nhng thng tin lnh vc lin quan c rt trch v tch hp li thnh mt hnh thc thng nht v biu din mt cch c cu trc. Khi thng tin trn internet s c chuyn vo mt c s d liu c cu trc phc v cho cc ng phn tch v khai thc khc nhau.

Cc nghin cu hin nay lin quan n rt trch thng tin vn bn tp trung vo:

Rt trch cc thut ng (Terminology extraction): tm kim cc thut ng chnh c lin quan, th hin ng ngha, ni dung, ch ti liu hay mt tp cc ti liu.

Rt trch cc thc th c tn (named entity recognition): vic rt trch ra cc thc th c tn tp trung vo cc phng php nhn din cc i tng, thc th nh: tn ngi, tn cng ty, tn t chc, mt a danh, ni chn. Rt trch quan h (Relationship Extraction): cn xc nh mi quan h gia cc thc th nhn bit t ti liu. Chng hn xc nh ni chn cho mt t chc, cng ty hay ni lm vic ca mt ngi no . V d t mt on vn bn: James Gosling vo lm vic cho Sun Microsystems t nm 1984 nm ti Silicon Valley , bng cc phng php, k thut trch xut thng tin lm th no ta c th nhn din c cc thc th, loi thc th v quan h gia chng nh sau:

CONNGI lm vic TCHC: nhn din c hai thc th l James Gosling v Sun Microsystems. Mi quan h gia hai thc th ny l lm vic. TCHC nm ti NICHN: nhn din c hai thc th l Sun Microsystems v Silicon Valley; mi quan h gia hai thc th ny l nm ti.1.3 Trch xut thng tin (IE) v truy vn thng tin (IR)

Trch xut thng tin l tm ra cc thng tin cu trc, thng tin cn thit t mt ti liu, trong khi truy vn thng tin l tm ra cc ti liu lin quan, hoc mt phn ti liu lin quan t kho d liu cc b nh th vin s hoc t internet phn hi cho ngi dng ty vo mt truy vn c th.

Truy vn vn bn thng minh hng ti ti u hay tm kim cc phng php nhm cho kt qu phn hi tt hn, gn ng hoc ng vi nhu cu ngi dng. Chng hn ty vo mt truy vn ca ngi dng, h thng c th tm ra nhng thnh phn no trong ti liu ph hp vi cu truy vn (chng hn mt on, mt cu trong ti liu), thng minh hn h thng c th tr li chnh xc thng tin t cu truy vn hay cu hi ca ngi dng.1.4 Cc nghin cu v ng dng lin quan

Phn ln cc h thng thng minh nhn to ph thuc nhiu vo ngun tri thc v c ch suy din ca h thng, bn cnh kh nng suy din th ngun tri thc cng phong ph s gip kh nng p ng cc hnh vi ca h thng cng tt. Web l mt kho d liu khng l v v tn n cha bn trong nhiu tri thc hu ch thuc cc lnh vc khc nhau do con ngi cp nht v pht trin, tuy nhin ngun tri thc Web tn ti phn tn di nhiu dng thc khc nhau. Vn t ra l lm th no c th trch xut ra nhng tri thc cn thit, hu ch, t chc qun l chng mt cch hiu qu t gip gii quyt nhng vn do con ngi t ra. Cu tr li l cn pht trin cc h thng rt trch thng tin trn WEB [8][9]. Theo tin s Alexander Yates trng i hc Washington [3] nhng h thng rt trch thng tin trn Web, WIE (Web Information Extraction) ha hn s v nhng l trng gia WEB v thng minh nhn to. WIE s gip cho vic pht trin, xy dng cc c s tri thc t WWW, t c th p dng trin khai cc nghin cu v ng dng khc. Bn di l mt s v d in hnh v cc nghin cu v ng dng ca WIE.

H thng h tr tm vic [4], chng hn khi ngi dng c nhu cu tm kim mt cng vic dng Goolge Search th r rng cng c Google Search Engine khng tht s hiu v p ng c cc yu cu tm kim ca ngi dng. Nhng thng tin ngi dng thc s quan tm nh: cc cng ty no c tuyn dng chc danh hay mt ngh nghip no , thng tin v cc cng ty cn tuyn dng, lin h vi ai, ch chnh sch ca mi cng ty nh th no, nhng thng tin phn hi, kin nhn xt t cc nhn vin v ang lm ti cc cng ty ra sao, v.v Tt c nhng thng tin nh vy cn thit phi c rt trch, tng hp v t vn cho ngi dng mt cch c h thng (hnh v 1).

Hnh 1: Rt trch thng tin h tr tm vic (Ngun ti liu tham kho [4])

Mt ng dng khc l trch xut v lc ra nhng thng tin lin quan ti u vn tm kim thng tin [4]. V d trong hnh v 2 bn di, khi ngi dng c nhu cu tm kim cc cng vic lin quan n ngh lm bnh m (baker), th ngi ta nhp vo Goolge chui baker job opening. Kt qu tr v ca Google c rt nhiu thng tin khng lin quan: chng hn thng tin ng tuyn dng ca trng hc MtBaker v cng ty Baker Hostetler, v.v. Nhng thng tin ny khng lin quan n cng vic cn tm l ngh lm bnh m (Baker). ng ra h thng phi tr v cc lin kt n cc trang hay cc cng ty tuyn dng ngh Baker. Nh vy trong trng hp ny IE c nhim v trch ra cc lin kt lin quan n nhu cu tm kim ca ngi dng.

Hnh 2:Tm vic da trn search engine (Ngun ti liu tham kho [4])

IE ng dng tm kim cu tr li cho cc h thng hi p QA (Question Answering) da vo kt qu tr v ca search engine. Gn y xut hin mt cch tip cn nghin cu pht trin h thng QA da vo vic phn tch kt qu tm kim tr v t cc search engine nhm tm ra cu tr li chnh xc cho cu hi a vo. V d ngi dng cn hi Thnh ph no l th ca nc Vit Nam, th kt qu tr v t cc search engine th rt nhiu v h thng phi tm cch trch ra cu tr li m ngi dng mong ch, l H Ni hay Thnh ph H Ni y l mt dng ng dng k thut rt trch thng tin IE trong QA. (hnh 3)

Hnh 3: Hi p da trn cc kt qu t search engine

IE ng dng trong cc h thng h tr, t vn mua hng. V d khi ngi dng cn mua mt mn hng, nhng thng tin m ngi dng quan tm n nh: thng tin sn phm (gi c t cc ca hng, cht lng sn phm, thng tin phn hi t ngi dng), thng tin nh cung cp (ch hu mi, cht lng dch v, ...), v.v. Ngi dng phi tn nhiu thi gian tm kim v t ng trch xut, tng hp thng tin theo kiu ca mnh c th quyt nh cho vic mua hng. Mt h thng IE gip trch xut, tng hp cc thng tin theo cc yu cu, tiu ch t ra th rt cn thit trong cc h thng thng minh thng mi nh th.

IE dng cho vic rt trch thng tin t cc bi bo khoa hc nh tn tc gi, tiu t mc header ca bi bo cng nh nhng thng tin t mc reference ng dng xy dng cc h thng t chc ch mc, tm kim bi bo khoa hc. Mt h thng tm kim bi bo khoa hc c dng rng ri l Citeseer. (hnh 4)

Hnh 4: H thng tm kim bi bo khoa hc Citeseer

Mt d n khc tn DBLP thuc trng i hc Trier ca c xy dng mt c s d liu ca cc bi bo khoa hc t cc hi tho, tp ch v cc lin kt n cc trang c nhn ca cc nh khoa hc h tr tm kim bi bo khoa hc. Theo tc gi th vic xy dng c s d liu ny t cc k yu v tp ch c thc hin th cng (thu sinh vin kim tra v cp nht d liu). Hin c s d liu ca DBLP cha khong 1.4 triu bi bo khoa hc t mt s hi tho, tp ch uy tn nh ACM, IEEE, Springer, ScienceDirect, ... (hnh 5)

Hnh 5: C s d liu ch mc DBLP

1.5 Cc bc c bn ca mt h thng IE

Theo tin s Diana Maynard [5] hu ht cc h thng IE ni chung thng tin hnh cc bc sau

Tin x l

Nhn bit nh dng ti liu (Format detection) Tch t (Tokenization)

Phn on t (Word segmentation)

Gii quyt nhp nhng ng ngha (Sense disambiguation)

Tch cu (Sentence splitting)

Gn nhn t loi (POS tagging) Nhn din thc th t tn (Named Entity Detection)

Nhn bit thc th (Entity detection) Xc nh ng tham chiu (Coreference) 1.6 Phng php rt trch thng tinTip cn tri thcTip cn hc t ng

Da trn lut, mu c xy dng th cng.c pht trin bi nhng chuyn gia ngn ng, chuyn gia lnh vc c kinh nghim.

Da vo trc gic, quan st. Hiu qu t c tt hn. Vic pht trin c th s tn nhiu thi gian

Kh iu chnh khi c s thay iDa trn hc my thng k.Ngi pht trin khng cn thnh tho ngn ng, lnh vc.Cn mt lng ln d liu hc c gn nhn tt.

Khi c s thay i ( c th cn phi gn nhn li cho c tp d liu hc.

Theo [1][5] cc phng php trch xut hin nay c th chia thnh hai cch tip cn chnh: tip cn cng ngh tri thc (Knowledge Engineering) v tip cn hc my t ng (Automatic Training)1.7 Phng php nh gi

Theo [1] vn nh gi cc bi ton trch xut thng tin c cp v thu ht nhiu quan tm trong cc hi tho MUC Message Understanding Conference c c quan qun l cc d n v quc phng thuc b Quc Phng Hoa K khi sng v h tr ti chnh. MUC c u t v khuyn khch nghin cu pht trin cc phng php mi cho trch xut thng tin. nh gi kt qu ca thng tin c trch xut, cc chuyn gia a ra o da vo cc o c s dng trong lnh vc truy vn thng tin (IR) l tin cy Precision v chnh xc Recall.

chnh xc Recall (R): l phn s th hin t l thng tin c rt trch ng. Bao nhiu phn trm thng tin c rt l ng. T l gia s lng cu tr li ng tm thy vi tng s cu tr li ng c th.

tin cy Precision (P): l o hay phn s th hin kh nng tin cy ca thng tin c trch xut. T l gia tng s cu tr li ng tm thy vi tng s cu tr li tm thy.

Vi tp: s kt qu ng c tm thy

tn: s kt qu ng m khng tm thy

fp: s kt qu tm thy m khng ng

P v R thuc khong [0, 1], kt qu tt nht l 1. P v R c lin quan v nh hng ln nhau. Nu gim R, chng ta c th t c P cao hn v ngc li. Khi so snh, nh gi mt h thng hay mt phng php th nht thit phi so snh v nh gi da trn c P v R. Theo Line Eikvil [1], vic so snh, xem xt c hai thng s cng lc th khng phi n gin, v d dng. V th ngi ta tm cch kt hp hai o ny v xut mt o mi, l F-Measure (F).

Thng s xc nh mc tng quan gia chnh xc R (Recall) v tin cy P (Precision). Cc chuyn gia v rt trch thng tin thng s dng = 1 nh gi o F. Khi P v R c gn trng bng nhau, hiu nng ca h thng c nh gi thng qua cc gi tr khc nhau ca chnh xc R v tin cy P, t chng ta c th so snh mt cch d dng.

Vi = 1 th F-Mearsure:

CHNG 2: CC BI TON, PHNG PHP TRCH XUT THNG TIN2.1 M u

Nh chng ta bit trch xut thng tin l mt lnh vc nghin cu chuyn su thuc lnh vc x l ngn ng t nhin. V vy cc bi ton cng nh phng php trch xut thng tin u c ngun gc, v tng t cc phng php k thut c s dng trong x l ngn ng t nhin.

Trong chng ny chng ti s trnh by tm tt kho st v cc bi ton lin quan n trch xut thng tin t vn bn (t kha, cm t kha, thc th c tn, quan h gia cc thc th, ) cng nh cc phng php tip cn. 2.2 Rt trch cm t kha (Keyphrase Extraction)2.2.1 Gii thiu

Cm t kha c xem l thnh phn chnh hay mt dng siu d liu (metadata) th hin ni dung ca ti liu vn bn [7]. Mc ch ca hu ht cc nghin cu rt trch cm t kha l nhm tm kim cc c trng tt m ha vn bn [19][20][21] ng dng trong cc h thng phn loi, gom cm, tm tt v tm kim vn bn. Ty vo c trng ca tng ngn ng s c nhng phng php khc nhau tm kim cc cm t kha. Hu ht cc phng php u da trn cc k thut truyn thng c dng trong x l ngn ng t nhin nh tin x l vn bn, tch on, tch cu, tch t, phn tch c php, phn tch ng ngha, thng k v hc my. Theo quan st ca ti th Cc nghin cu v rt trch cc cm t lm c trng cho vn bn ting Vit ng dng trong cc h thng phn loi, tm tt, tm kim ti liu bt u t nhng nm 2000. Mt s kt qu ph bin nh inh in, Hong Kim (2001) v tch t ting Vit [27]; v tm kim cc cm ph bin m ha v gom cm vn bn ting Vit, Hong Kim v Nguyn Tun ng (2002) da trn th l m tch cm v thng k n-gram [26], Hong Kim v Hunh Ngc Tn (2003) rt trch cc cm ph bin bng cch phn tch vn bn da trn danh sch cc h t ting Vit v thng k n-gram [22][25]; nhm tc gi ng Th Bch Thy, H Bo Quc (2003) xut vic tm cm n-gram kt hp danh mc t lm c trng m ha cho h tm thng tin vn bn ting Vit [24]; Phc v Hong Kim (2004) tm dy t ph bin dng cy hu t rt trch chnh phc v tm tt vn bn ting Vit [23]. Vic rt trch trc y hu ht da vo tip cn phn tch c php, tch cu, thng k tn xut xut hin tf*idf rt ra cc cm. Kt qu rt trch vn cha thc s tt, cn kh nhiu rc (cm v ngha, cm khng th hin in ng ngha ca ti liu cp). Vn xc nh chnh xc cc cm t kha, cng nh xc nh c bin gii ca cc t kha, cm t kha t ti liu ting Vit hin nay vn l mt bi ton kh v vn ang c quan tm nghin cu.

Vi ting Anh th cch tip cn c in vn l dng tn s xut hin tf*idf, bn cnh mt s thut ton hc my thng k, cng vi cc k thut x l ngn ngn t nhin nh gn nhn t loi, phn tch c php kt hp cc t in lnh vc c pht trin. Ph bin rng ri trong cng ng nghin cu v trch xut cm t kha ting Anh l cc thut ton nh KEA [17][18], KIP [7][14]. 2.2.2 Phm vi ng dng

Kh nng ng dng ca t kha v cm t kha c th k n nh sau: Cc kho d liu vn bn ln nh cc th vin s pht trin rt nhanh ( dn n gia tng gi tr thng tin tm tt. H tr ngi dng nhn bit v ni dung ca ti liu v kho ti liu.

ng dng trong truy vn thng tin ( m t nhng ti liu tr v t kt qu truy vn. nh hng tm kim cho ngi dng. Nn tng cho ch mc tm kim. L c trng dng trong k thut phn loi, gom cm ti liu.

Vic gn cc keyphrases cho ti liu: cc cm t kha thng c gn bng tay, tc cc tc gi ch ng gn cc keyphrases cho ti liu h vit. i vi cc b ch mc chuyn nghip thng chn cc cm (phrases) t mt t in nh ngha trc (predefined controlled vocabulary)( Vn gp phi i vi cc ti liu khng c keyphrases. Vic gn bng tay l qu trnh tn nhiu thi gian, cng sc, cng nh cn c kin thc chuyn mn.( Rt cn thit cc k thut rt trch t ng

2.2.3 Bi ton sinh keyphrase t ngBi ton gn keyphrases (Keyphrase assignment): tm kim v chn cc keyphrase t t in nh ngha trc (Controlled Vocabulary) m thch hp nht m t ti liu. Tp d liu hun luyn l mt tp hp cc ti liu vi mi phrase trong t in v da vo xy dng mt b phn lp (classifier)Bi tan trch xut keyphrase (Keyphrase extraction): s dng cc k thut truy vn thng tin v x l t vng chn ra cc keyphrase t chnh ti liu ang xt thay v dng cc phrase nh ngha trc trong t in (controlled vocabulary).2.2.4 Thut ton KEA

Turney (2000) c xem l ngi u tin gii quyt bi ton rt trch cc keyphrase da trn phng php hc gim st [15][16], trong khi cc nghin cu khc dng heuristic, k thut phn tch n-gram, phng php nh mng Neural [11][12][13]. KEA [17][18] l mt thut ton trch xut cc cm t kha (keyphrases) t d liu vn bn. KEA xc nh danh sch cc cm ng vin dng cc phng php t vng hc, sau tin hnh tnh ton gi tr c trng cho mi ng vin, tip n dng thut ton hc my tin on xem cc cm ng vin no l cc cm t kha. Hin nay KEA c xem l mt thut ton n gin v hiu qu nht rt cc keyphrases [6][11]. KEA dng phng php hc my Nave Bayes hun luyn v rt trch cc keyphrases.

Theo nhn nh ca cc tc gi, KEA l thut ton c kh nng c lp ngn ng. Thut ton KEA c th c tm tt thng qua cc bc sau:

Bc 1: Rt trch cm ng vin: KEA rt cc cm ng vin n-gram (chiu di 1 n 3 t) m khng bt u hay kt thc bng cc stop word. Trong trng hp bi ton gn cm t kha (keyphrase assignment) dng t in nh ngha trc (controlled indexing), KEA ch chn ra cc cm ng vin m khp vi cc thut ng nh ngha trong t in. Vi cc cm n-gram thu c KEA tin hnh loi b ra khi cm ng vin cc stop word v chuyn v dng gc ca t (stemming) cho cm ng vin.

Hnh 7: S thut ton KEA (tham kho: http://www.nzdl.org/Kea/description.html)Bc 2: Tnh ton c trng: mi cm ng vin, KEA tnh 4 gi tr c trng sau: TFIDF: th hin mc quan trng ca mt cm ng vin trong ti liu ang xt so vi cc ti liu khc trong tp d liu. Mt cm ng vin c TFIDF cng cao th cng c kh nng tr thnh cm t kha. V tr xut hin u tin: theo quan nim tc gi cc cm ng vin m c v tr xut hin gn u hay cui ti liu th cng c kh nng tr thnh cm t kha. Chiu di cm: s lng t trong cm. Theo tc gi cc cm c chiu di l 2 thng c quan tm. tng quan: l s lng cc cm trong danh sch cc cm ng vin c lin quan ng ngha vi cm ang xt. tng quan c tnh nh vo t in nh ngha trc. Mt cm ng vin c tng quan cao th cng c kh nng tr thnh cm t kha.Bc 3: Hun luyn v xy dng m hnh: dng tp ti liu hun luyn m cc cm t kha c gn bi tc gi xy dng m hnh. Vi danh sch cc cm ng vin xc nh dng cc k thut n-gram, loi b stop word v chuyn v gc t (stemming) trn. KEA s nh du nhng cm no l cm + (l cm t kha) v nhng cm no l cm - (khng l cm t kha). M hnh s c xy dng bng cch tin hnh phn tch, tnh ton gi tr cho cc c trng cm (nh m t pha trn) cho cc cm + v cm -. M hnh xy dng s phn nh phn b ca cc gi tr c trng cho mi cm t.Bc 4: Rt trch cm t kha: KEA s dng m hnh xy dng bc 3 v tnh ton gi tr c trng cho cc cm ng vin. Sau tnh xc sut cm ng vin l cm t kha. Cc cm ng vin vi xc sut xp hng cao nht c chn a vo danh sch cc cm t kha. Ngi dng c th ch nh s lng cc cm t kha cho mt ti liu.2.2.4.1 Chn cm ng vin (candidate phrases) Vic chn cm ng vin c tin hnh thng qua 3 bc nh sau:

Tin x l (Input Cleaning): cc files d liu u vo c dn dp v chun ha v xc nh bin gii ban u ca cc cm. Chui u vo s c cht thnh cc tokens Cc du chm cu, ngoc n v nhng con s c thay th bi cc ng bin ca cc cm (phrase boundaries).

Xa cc du nhy n

Tch nhng t c du gia thnh hai

Xa nhng k t cn li khng phi l token. (v khng c token no m khng cha cc k t).Kt qu

Tp hp cc lines

Mi line l mt dy cc token (mi token cha t nht 1 k t)

Nhng t vit tt cha cc du ngn cch phi c gi li l token (nh C4.5 chng hn)

Xc nh cm (phrase): KEA xem xt tt c cc dy con (subsequences) trong mi dng v xc nh dy con no thch hp l mt cm ng vin. Mt s phng php khc c gng xc nh cc noun phrase, tuy nhin KEA dng cc lut xc nh cc phrase nh sau: Chiu di ti a: phrase ng vin thng ti a l 3 t Phrase ng vin khng th l tn ring

Phrase ng vin khng c php bt u v kt thc vi 1 stopword.

Tt c cc dy t lin nhau trong mi dng s c kim tra dng 3 lut trn. Kt qu l mt tp cc cm ng vin.V d:

DngCm ng vin

the programming by demonstration methodprogramming

demonstration

method

programming by demonstration

demonstration method

programming by demonstration method

Xc nh gc t (stemming): bc sau cng trong vic xc nh cc cm ng vin l xc nh gc t (stemming) dng thut ton Lovins (1968) b i cc hu t. Vic lm ny gip h thng c th xem nhiu bin th khc nhau ca cm (phrase) nh l mt. (chng hn cut elimination s tr thnh cut elim). V h thng cng dng stemming so snh nhng cm t kha kt qu ca KEA vi cc cm t kha do tc gi nh ngha.2.2.4.2 Tnh ton c trng (Feature calculation)Tnh ton cc c trng cho mi cm ng vin v chng s c dng trong hun luyn v rt trch. Hai c trng c dng l: tn s tf*idf, v tr xut hin u tin ca cm.

Tn s TF*IDF (t): c trng ny th hin tn sut xut hin ca mt cm trong mt ti liu so vi tn sut ca cm trong c kho d liu. S lng ti liu cha mt cm cng t th kh nng cm l cm t kha (keyphrase) cho ti liu ang xt cng cao. Thut ton KEA to mt tp tin lu tr gi tr tn xut ca c trng ny.

Freq(P, D) l s ln cm P xut hin trong ti liu DSize(D) l s lng t ca ti liu D

df(P) l s lng ti liu cha cm P trong kho d liu.

N: kch thc ca kho d liu

V tr xut hin u tin (d: disttance): y l c trng th 2, l s lng t pha trc v tr xut hin u tin ca cm t chia cho kch thc ca ti liu (tng s t). Gi tr ca c trng ny thuc khong [0, 1].2.2.4.3 Hun luyn

Bc hun luyn dng mt tp ti liu hun luyn trong cc cm t kha c tc gi xc nh trc. i vi mi ti liu trong tp hun luyn, nhng cm ng vin s c xc nh v cc gi tr c trng ca tng cm ng vin s c tnh ton. gim kch thc ca tp hun luyn, tc gi b qua cc cm m ch xut hin mt ln trong ti liu. Mi cm ng vin s c gn nhn l cm t kha hay khng l cm t kha da vo nhng cm t kha do tc gi ch nh. Qu trnh hun luyn s sinh ra mt mt m hnh v m hnh ny c dng tin on phn lp cho cc mu d liu mi dng cc gi tr ca hai c trng. Nhm tc gi th nghim vi mt s phng php hc my khc nhau v quyt nh chn k thut Nave Bayes cho thut ton KEA, v theo tc gi phng php hc da trn xc sut Nave Bayes n gin nhng cho kt qu kh tt.2.2.4.4 Rt trch nhng cm t kha rt trch cc cm t kha t mt ti liu mi, KEA xc nh cc cm ng vin v cc gi tr c trng, sau p dng m hnh xy dng trong qu trnh hun luyn. M hnh xc nh xc sut m mi ng vin l mt cm t kha. Sau KEA s thc hin thao tc hu x l chn ra tp hp nhng cm t kha tt nht c th.Khi m hnh Nave Bayes c p dng cho cc cm ng vin vi cc gi tr c trng t(TF*IDF) v d (distance), hai lng sau c tnh ton l

(1)

Y: s lng cc cm l cm t kha (do tc gi ch nh)

N: s lng cc cm ng vin khng phi l cm t kha.

Xc sut tng th m cm ng vin l cm t kha c tnh nh sau:

(2)

Sau khi tnh ton gi tr xc sut p. Cc ng vin c sp theo th t (tng hay gim dn) ca gi tr p ny. Tip sau s l 2 bc hu x l. Th nht, TF*IDF s l gi tr quyt nh trong trng hp 2 cm ng vin c cng xc sut p. Th hai, tc gi quyt nh loi b ra khi danh sch cc cm m l cm con ca mt cm c xc sut cao hn. T danh sch cn li, thut ton s chn ra r cm c xc sut cao nht (vi r l s lng cc cm t kha cn xc nh theo yu cu). 2.2.5 Thut ton KIP2.2.5.1 tng

Mt cm danh t cha nhng t kha hay cm t kha v mt lnh vc c th s c kh nng tr thnh cm t kha trong lnh vc . Mt cm danh t cng cha nhiu t kha hay cm t kha th cm danh t ny cng c nhiu kh nng tr thnh cm t kha. H thng xy dng sn mt c s d liu t vng lu gi cc t kha, cm t kha v mt lnh vc c th. V cc t kha trong t in nh ngha trc s dng tnh ton im hay trng s cho mt cm danh t. T quyt nh cm ng vin no l cm t kha da trn trng s, im s tnh c cao hn.2.2.5.2 M t thut ton

KIP n gin gm cc bc nh: rt trch cc cm danh t (noun phrase) ng vin t ti liu u vo. Sau kim tra cu thnh ca cm ng vin v tnh im cho n. T quyt nh cm ng vin no l cm t kha da trn trng s, im s tnh c cao hn.

im ca mt cm danh t c tnh da vo cc yu t:

Tn xut xut hin trong ti liu

Cu thnh ca cm danh t (cha t hay cm con no)

Nhng t v cm t cu thnh cm danh t lin quan nh th no n lnh vc ca ti liu

KIP bao gm cc thnh phn chnh: gn nhn t loi (POS tagger), rt trch cm danh t (Noun phrase extractor), cng c rt trch cm t kha.* Gn nhn t loi (POS tagger): KIP dng phng php gn nhn t loi dng ph bin ca Brill [32].* Rt trch cm danh t: b rt trch cm danh t da vo cc nhn t loi gn trong bc trc v rt ra cc cm danh t da vo mu {[A]} {N}(A( adjective; N( noun; {} ( lp li nhiu ln; [] ( c th c hoc khng)

* Rt trch cm t kha: tnh trng s cho cc cm danh t, thut ton xy dng mt t in t vng cha cc t kha, cm t kha vi cc gi tr khi to v mt lnh vc c th. T in bao gm 2 danh sch: mt danh sch cc cm t kha (cha 1 hay nhiu t), mt danh sch cc t kha (cha 1 t n c phn tch t danh sch th 1, cm t kha).Trng ca mt cm danh t: WNP = F x SF: tn s xut hin ca cm danh t trong ti liu.S: tng trng s ca nhng t n v cc kt hp c th trong cm ng vin.

+ j

Wi: trng s ca mt t trong cm danh t ny

Pj: trng s ca ca cm con trong cm danh t.Mc tiu ca vic tnh ton trng s ca tt c nhng t n v nhng cm con l nhm xc nh xem mt cm con c phi l mt cm t kha c nh ngha sn trong t in hay khng. Nu n tn ti trong t in th cm danh t ang xt cng quan trng hn. KIP s truy vn danh sch cc t kha v cm t kha t t in lnh vc c c trng s cho cc t n (Wi) v cm con (Pj).2.3 Nhn din thc th c tn

2.3.1 Khi nim

Nhn din thc th c tn (NER-Named Entity Recognition) l mt cng vic thuc lnh vc trch xut thng tin nhm tm kim, xc nh v phn lp cc thnh t trong vn bn khng cu trc thuc vo cc nhm thc th c xc nh trc nh tn ngi, t chc, v tr, biu thc thi gian, con s, gi tr tin t, t l phn trm, v.v. Thc th c tn (Named Entity) c rt nhiu ng dng, c bit trong cc lnh vc nh hiu vn bn, dch my, truy vn thng tin, v hi p t ng. 2.3.2 Phng php tip cn v cc h thng ph bin

Hin nay, hu ht cc h thng nhn din thc th c tn p dng cc k thut khai thc d liu vn bn, x l ngn ng t nhin v tip cn theo cc hng chnh sau: K thut da trn vn phm ngn ng: qui tc, lut vn phm c xy dng bng tay nh kin chuyn gia ngn ng, v tn nhiu thi gian cho vic xy dng qui tc vn phm. Qui tc vn phm s phi thay i khi c s thay i v lnh vc ng dng hay ngn ng. Cc m hnh hc thng k: t ph thuc ngn ng, v cng khng ph thuc vo chuyn gia lnh vc nhng cn chun b tp d liu hun luyn tht tt v ln c th xy dng c mt b phn lp ti u. Kt hp my hc v cc k thut x l ngn ng t nhin.

H thng nhn din thc th c tn ph bin: c th k n cc h thng ph bin hin nay nh: H thng Standford NER: xy dng b phn lp CRFClassifier da trn m hnh thuc tnh ngu nhin c iu kin (CRF-Condictional Random Field) H thng GATE-ANNIE : l mt h thng con ca GATE Framework (General Architecture of Text Engineering) mt trong cc d n ln nht thuc khoa Khoa hc My tnh, i hc Sheffield ca Anh. y l h thng da trn cc t in, Ontology v vic xy dng lut nh du (annotation) cc thnh t trong vn bn. Vic xc nh cc thc th c tn trong vn bn thc hin trong qu trnh nh du vn bn. 2.4 Nhn din mi quan h2.4.1 Khi nim

Cc nghin cu v rt trch thc th, cng nh quan h c t chc MUC (Message Understanding Conferences) v ACE (Automatic Content Extration) u t v thc y pht trin. Rt trch quan h bt u c quan tm t hi tho MUC ln th 7 nm 1998, t ngy cng c ch n. Rt trch quan h l vic xc nh mi quan h ng ngha gia cc thc th trong vn bn hay trong mt cu. Chng hn xc nh ni chn cho mt t chc, cng ty hay ni lm vic ca mt ngi no . V d t mt on vn bn: James Gosling vo lm vic cho Sun Microsystems t nm 1984 nm ti Silicon Valley ta c th nhn din c cc thc th, loi thc th v quan h gia chng nh sau:

CONNGI lm vic TCHC: nhn din c hai thc th l James Gosling v Sun Microsystems. Mi quan h gia hai thc th ny l lm vic.

TCHC nm ti NICHN: nhn din c hai thc th l Sun Microsystems v Silicon Valley; mi quan h gia hai thc th ny l nm ti.2.4.2 Phng php tip cn v cc nghin cu lin quanHu ht cc phng php rt trch quan h tip cn theo cc hng nh da trn lut (rule-base), da trn c trng (feature-based) v cc phng php kernel (kernel-based). Mt s nghin cu lin quan nh sau: Cc phng php da trn trn lut, c trng ngn ng ch yu da vo cc k thut x l ngn ng t nhin, cc qui tc ngn ng, c php, c im t vng, c im c php, c im ng ngha xc nh cc mi quan h. Mt s h thng in hnh [28][29]. Cc phng php kernel da vo cc cy kernel tch bit khai thc c im cu trc. Mt s nghin cu n hnh [30][31] tin hnh xy dng quan h kernel trn cy c php. Kernel so trng cc node t gc cho n l theo tng lp t trn xung mt cch qui.Hu ht cc nghin cu ph bin hin nay tp trung vo vn rt trch quan h gia cc thc th c tn. Bn cnh quan h gia cc thc th khng tn, hay quan h gia thc th c tn v khng tn cha tht s c quan tm nhiu. Cc nghin cu lin quan n rt trch thc th v quan h da trn Ontology l cch tip cn m hin nay ang c cng ng nghin cu quan tm. ti tip cn theo hng ny.CHNG 3: RT TRCH METADATA3.1. M u

Metadata hay cn gi l siu d liu (tiu , tn tc gi, ni xut bn, nm xut bn, ) c dng ph bin, rng ri trong cc th vin s nhm m t thng tin v ti nguyn (sch, bo, tp ch, ti liu, lun vn, lun n, ). Metadata gip phn loi, tm kim ti liu mt cch d dng, c nh hng. Theo ti, i vi m hnh biu din tri thc cho vn bn th metadata c th c xem l mt thnh phn trong m hnh tri thc, cng vi cc thnh phn khc nh cc cm t kha (keyphrase), cc thc th v quan h.

Cc th vin s ca cc t chc gio dc cng nh cc trng i hc ngy cng m rng v pht trin vi nhng ngun ti liu in t a dng v phong ph v th loi, nh dng, ch . Vic u t xy dng cc chun, phng php, v phn mm nhm t chc, thu thp, phn loi, qun l v khai thc cc ti liu ny mt cch hiu qu l mt vic lm rt cn thit, hu ch v c nhiu ngi, nhm nghin cu, t chc u t nghin cu, pht trin trong nhng nm gn y [33][34][40].

Theo [35], chun trao i d liu trn internet hin nay c t chc tiu chun quc gia ca M thng qua nhm thay th cho cc chun c khng cn ph hp l chun ANSI/NISO Z39.85 2001. Ni dung ch yu ca chun ny m t d liu gm 15 trng d liu cn c gi l Dublin Core Metadata. y l cc trng d liu ph bin v hu ch nht km theo cc ti liu s ha trao i trn mng internet.

Vic rt trch v to metadata cho cc ti liu in t gip cho vic sp xp ti liu mt cch khoa hc v h tr ngi dng c th tm kim chng mt cch d dng. To metadata bng tay s tn km nhiu thi gian v cng sc. Theo [41] chng ta s tn 60 nm cho mt ngi to metadata cho mt triu ti liu.

Mc ch nghin cu ca chng ti l tm phng php v xy dng cng c xc nh c cc thnh phn metadata cho mt ti liu in t. Vic xc nh c metadata t ng s h tr tch cc cho cng vic xy dng m hnh tri thc ti liu vn bn, t chc bin mc ti liu in t. ng thi vi metadata ca ti liu chng ta c th s tm kim nhng mi lin h gia cc ti liu thng qua metadata. Chng hn sau khi xc nh c thng tin metadata ca mt bi bo. Chng ta c th bit c bi bo ny c nhng ti liu no trch dn, nhiu hay t. Da vo chng ta c th gn cho mi bi bo mt o. o ny s gip ch nhiu trong vn xp hng cc bi bo khi tm kim. Bn cnh metadata ca cc ti liu v mt lnh vc no c th gip ch cho vic lm giu Ontology lnh vc. Chng hn t cc thng tin metadata ca cc computer scienece publications chng ta c th dng lm giu mt Ontology v Khoa hc My tnh (Computer Science Ontology - CSOnt).

Trong chng ny chng ti trnh by mt cch tip cn rt trch Metadata cho cc bi bo khoa hc da trn thng tin cu trc trnh by v vic xy dng lut da trn cc mu (patterns). ng thi chng ti cng xy dng mt cng c rt trch metadata t ng c th dng kt hp vi cc phm mm th vin s.

Trong mc 3.2 chng ti s trnh by v cc khi nim c bn v Metadata, mc 3.3 gii thiu v chun Dublin Core Metadata c hin ang dn c p dng trong cc th vin s v thay th dn cho nhng chun trc y. Mc 3.4 trnh by v cc nghin cu lin quan n rt trch metadata t ng t chc d liu s. Mc 3.5 s trnh by v cch tip cn ca chng ti, kin trc h thng rt trch v nhng lut c nh ngha da trn JAPE Grammar v plug in l ANNIE ca GATE. Mc 3.6 s trnh by kt qu thc nghim ca phng php xut v cng c xy dng.

3.2 Khi nim Metadata

Metadata (siu d liu) dng m t ti nguyn thng tin. Thut ng meta xut x l mt t Hy Lp ng ch mt ci g c bn cht c bn hn hoc cao hn. Mt nh ngha chung nht v c dng ph bin trong cng ng nhng ngi lm Cng ngh Thng tin: Metadata l d liu v d liu khc (Metadata is data about other data) hay c th ni ngn gn l d liu v d liu.

Trong cc phm vi c th, nhng chuyn gia a ra cc quan im khc nhau v metadata:

Theo Chris.Taylor gim c dch v truy cp thng tin th vin thuc trng i hc Queensland th Metadata l d liu c cu trc c dng m t nhng c im ca ti nguyn. Mt mu tin metadata bao gm mt s lng nhng phn t c nh ngha trc gi l elements dng m t c tnh, thng tin ti nguyn. Mi elements c th c 1 hay nhiu gi tr.

Theo tin s Warwick Cathro thuc th vin quc gia Australia th mt phn t metadata hay cn gi l metadata elements m t ti nguyn thng tin, hay h tr truy cp n mt ti nguyn thng tin.

Tm li, ta c th hiu metadata l thng tin dng m t ti nguyn thng tin.

3.3 Chun Dublin Core Metadata

Dublin Core Metadata l mt chun metadata c nhiu ngi bit n v c dng rng ri trong cng ng cc nh nghin cu, chuyn gia v th vin s. Dublin Core Metadata ln u tin c xut nm 1995 bi Dublin Core Metadata Element Initiative. Dublin l tn mt a danh Dublin, Ohio M ni t chc hi tho OCLC/NCSA Metadata Workshop nm 1995. Core c ngha l mt danh sch cc thnh phn ct li dng m t ti nguyn (Element metadata), nhng thnh phn ny c th m rng thm.

Theo [35], thng 9/2001 b yu t siu d liu Dublin Core Metadata c ban hnh thnh tiu chun M, gi l tiu chun The Dublin Core Metadata Element Set ANSI/NISO Z39.85-2001.

Dublin Core Metadata bao gm 15 yu t c bn [35] c m t chi tit trong bng bn di

Cc yu t c bn ca chun Dublin Core Metadata

STTYu tM t

1TitleNhan hay tiu ca ti liu

2CreatorTc gi ca ti liu, bao gm c tc gi c nhn v tc gi tp th

3SubjectCh ti liu cp dng phn loi ti liu. C th th hin bng t, cm t/(Khung ch ), hoc ch s phn loi/ (Khung phn loi).

4DescriptionTm tt, m t ni dung ti liu. C th bao gm tm tt, ch thch, mc lc, on vn bn lm r ni dung

5PublisherNh xut bn, ni ban hnh ti liu c th l tn c nhn, tn c quan, t chc, dch v...

6ContributorTn nhng ngi cng tham gia cng tc ng gp vo ni dung ti liu, c th l c nhn, t chc..

7DateNgy, thng ban hnh ti liu.

8TypeM t bn cht ca ti liu. Dng cc thut ng m t phm tr kiu: trang ch, bi bo, bo co, t in...

9FormatM t s trnh by vt l ca ti liu, c th bao gm; vt mang tin, kch c di, kiu d liu (.doc, .html, .jpg, xls, phn mm....)

10IdentifierCc thng tin v nh danh ti liu, cc ngun tham chiu n, hoc chui k t nh v ti nguyn: URL (Uniform Resource Locators) (bt u bng http://), URN (Uniform Resource Name), ISBN (International Standard Book Number), ISSN (International Standard Serial Number), SICI (Serial Item & Contribution Identifier), ...

11SourceCc thng tin v xut x ca ti liu, tham chiu n ngun m ti liu hin m t c trch ra/to ra, ngun cng c th l: ng dn (URL), URN, ISBN, ISSN...

12LanguageCc thng tin v ngn ng, m t ngn ng chnh ca ti liu

13RelationM t cc thng tin lin quan n ti liu khc. c th dng ng dn (URL), URN, ISBN, ISSN...

14CoverageCc thng tin lin quan n phm vi, quy m hoc mc bao qut ca ti liu. Phm vi c th l a im, khng gian hoc thi gian, ta ...

15RightsCc thng tin lin quan n bn quyn ca ti liu

3.4 Rt trch metadata v cc nghin cu lin quan

Rt trch metadata l lnh vc nghin cu thu hp thuc lnh vc rt trch thng tin. Hu ht cc phng php rt trch metadata hin nay c th chia lm 2 cch tip cn chnh l: cc phng php da trn hc my [10][36][38][42] v mt nhm cc phng php da trn lut [39][41][43], cc phng php ny c p dng kt hp cng vi s xut hin v pht trin ca cc t in v cc Ontologies.

Theo [36], nhng phng php hc my rt trch metadata in hnh c th k n nh: lp trnh logic, m hnh Markov n (Hidden Markov Models), Support Vector Machince, v cc phng php hc thng k khc. Trong [36], nhm tc gi dng SVM rt trch metadata t cc bi bo khoa hc. Qu trnh rt trch ca h gm 2 bc: bc th 1 h dng SVM phn lp cc dng (lines) thuc phn heading ca cc ti liu (t phn gii thiu tr ln); bc th 2 h rt trch metadata t cc dng phn lp trong bc th 1 dng cc lut du cu, k t vit hoa kt hp vi cc t in. Kt qu th nghim ca cc tc gi trong [36] cho thy phng php ca h cho kt qu tt hn cc phng php hc my khc (da trn thc nghim).

Trong [38], nhm tc gi xut phng php rt trch metadata dng CRF (Conditional Random Fields) v da trn nh gi thc nghim trong [38], phng php ca h cho kt qu tng ng vi phng php SVM trong [36]. Kt qu thc nghim trong [36][38] cho thy cc phng php trong CRF v SVM l tng ng nhau v hiu xut. Kt qu t c Precision t 86% - 99%, Recall t 45%-100%, v chnh xc t 96% 100% (kt qu khc nhau i vi cc metadata khc nhau).

Trong [42], nhm tc gi xy dng mt package t tn l PDF2gsdl, package ny ch dng rt trch cc tiu v tc gi t cc bi bo c nh dng PDF, package ny c th dng kt hp vi phn mm th vin s Greenstone to metadata t ng cho cc ti liu trong th vin s. Trong [42], nhm tc gi p dng hc my v xy dng b phn lp Neural dng c trng nh thng tin trnh by, kch thc font ch, v tr, th nghim trn mt tp d liu bao gm 45 bi bo ly t cc k yu hi tho v chnh xc t c cho tiu khong 93% v cho tc gi khong 70%.

Mc d nhng phng php my hc cp n trn p dng cho vic rt trch metadata cho kt qu kh n tng. Tuy nhin chng ta bit rng i vi cc phng php my hc, vic to ra mt tp d liu hc, c gn nhn s tn nhiu cng sc, chi ph cho vic chn mu v gn nhn. l l do cho vic u t cho vic pht trin cc phng php, h thng da trn lut, t in, ontologies [37][39][41][43].

Trong ti liu [37], nhm tc gi xut mt phng php rt trch cu trc logic (tiu , cc tc gi, cc mc, cc nh ngha, nh l, ) t cc bi bo trong lnh vc ton hc. T h xy dng xy dng mt trnh duyt gip ngi dng c th d dng c cc bi bo ton hc. Thut ton hc xut gm 2 bc: th nht xc nh nhng vng c bit trong ti liu (s trang, mc, phn footnote cui trang, tiu ca cc bng biu v hnh nh) dng cc t kha, kiu dng font ch, khong cch khng gian trnh by trong ti liu; sau thng tin chi tit s c xc nh t cc vng ny da vo kiu dng, v tr v trnh by ca tng vng. Nhm tc gi thc nghim trn 29 bi bo ton hc v chnh xc l 93%.

Trong bi bo [39], nhm tc gi xut phng php lm giu mt Ontology v nhng ngi lm ngh thut hay ngh s bng cch tm kim v rt trch cc thng tin c nhn lin quan (ngy sinh, ni sinh, c quan cng tc, ngy thnh hn, qu trnh lm vic, v.v) t kt qu tm kim trn internet. lm c iu , h tin hnh tch cu trong vn bn (kt qu tm kim trn internet), sau dng GATE Framework nhn din cc thc th nh NGI, A IM, THI GIAN v kt hp vi mt ontology c sn Artequakt Ontology (CONCEPT-RELATION-CONCEPT) [39] nhn din mi quan h gia cc thc th nh NGI, A IM, THI GIAN t cc cu trong vn bn ca kt qu tm kim.

Mi cch tip cn u c nhng u, nhc im ring. i vi cc phng php my hc th chng ta cn phi tn nhiu thi gian cho vic chn mu, gn nhn v c kt qu tt cn rt nhiu d liu hc. Bn cnh cc phng php da trn lut v mu n gin v d dng thc hin hn, nhng c kt qu tt cng tn rt nhiu cng sc cho vic kho st, nh ngha lut ca chuyn gia. Cc lut cng cn phi thay i khi xut hin cc loi d liu mi m nhng lut hin c khng th gii quyt c. Thng thng i vi tng bi ton c th ngi ta s a ra mt cch tip cn v phng php gii quyt vn tng ng ph hp vi bi ton t ra.3.5 Cch tip cn ca ti

Phng php tip cn ca ti da trn xy dng cc lut, mu da trn thng tin cu trc v trnh by ca ti liu, kt hp vi nhng t in, ontologies v th vin sn c ca GATE rt trch cc metadata cho cc ti liu khoa hc.

3.5.1 Kin trc h thng

Hnh 8: Kin trc h thng rt trch metadata

3.5.2 Rt trch metadata da trn lut

Rt trch metadata cho mc header ca ti liu khoa hc

Hnh 9: Cc bc rt trch metadata t header ca bi bo

Rt trch metadata cho mc reference ca ti liu khoa hc

Hnh 10: cc bc rt trch metadata t phn reference ca bi bo

3.5.3 Cc lut JAPE rt metadata cho bi bo khoa hc

2.5.3.1 Lut xc nh t kha Abstract

Rule: AbstractKeyword

Priority: x

(

({SpaceToken.kind=="control"})+

({Token.string=="Abstract\u2014" } | {Token.string=="ABSTRACT\u2014"} |

{Token.string=="Abstract" } | {Token.string=="ABSTRACT"})

({Token.string=="."})?

):abstract_Keyword

-->

:abstract_Keyword.AbstractKeyword = {rule = "AbstractKeyword"}

3.5.3.2 Lut xc nh t kha ReferencesRule: ReferencesKeyword

Priority: x

(

({SpaceToken.kind=="control"})+

(

{Token.kind=="number"}

({Token.string=="."})?

({SpaceToken.kind=="space"})+

)?

({Token.string=="References"} | {Token.string=="REFERENCES"} | {Token.string=="reference"} | {Token.string=="REFERENCE"} )

):referencesKeyword

-->

:referencesKeyword.ReferencesKeyword = {rule= "ReferencesKeyword" }

3.5.3.3 Lut tch cc References

Rule:ReferencesBreak

Priority: x

(

(

{SpaceToken.kind=="control"}

(

(

({Token.string=="["})

({Token} | {SpaceToken.kind=="space"})+

({Token.string=="]"})

):referenceBreak_1

|

(

({Token.string=="("})

{Token.kind=="number", Token.length < 3}

({Token.string==")"})

):referenceBreak_2

|

(

{Token.kind=="number", Token.length < 3}

{Token.string=="."}

):referenceBreak_3

)

)

|

(

({Token.string=="References"} | {Token.string=="REFERENCES"} |

{Token.string=="."} | {Token.kind=="number"} | {Lookup.majorType=="year"})

(({SpaceToken.kind=="control"})+):referenceBreak_4

({Person} | {Lookup.majorType=="person_first"})

)

)

-->

:referenceBreak_1.ReferenceBreak_1 = {rule = "ReferencesBreak"},

:referenceBreak_2.ReferenceBreak_2 = {rule = "ReferencesBreak"},

:referenceBreak_3.ReferenceBreak_3 = {rule = "ReferencesBreak"},

:referenceBreak_4.ReferenceBreak_4 = {rule = "ReferencesBreak"}

3.5.3.4 Lut xc nh dng email

Rule:LineEmailAnnotation

Priority: x

(

(

{Token.string=="{"}

(

{Token}

({SpaceToken.kind=="space"})?

)+

({SpaceToken.kind=="control"})?

)?

(

{Token}

({SpaceToken.kind=="space"})?

)+

(

{Token.string=="@"} | {Address.kind=="email"} | {Token.string=="}"}

)

({SpaceToken.kind=="space"})?

(

{Token}

({SpaceToken.kind=="space"})?

)+

):lineEmailAnnotation

-->

:lineEmailAnnotation.LineEmailAnnotation = {rule = "LineEmailAnnotation"}

3.5.3.5 Lut xc nh dng c quan cng tc

Rule:LineAffiliationAnnotation

Priority: x

(

(

{Token.string=="Dept"} | {Token.string=="dept"} |

{Token.string=="University"} | {Token.string=="university"} |

{Token.string=="Faculty"} | {Token.string=="FACULTY"} |

{Lookup.majorType=="location"} |

{Lookup.majorType=="org_key"} | {Lookup.majorType=="org_base"} |

{Lookup.majorType=="cdg"} | {Lookup.majorType=="facility_key", !Token.string=="Hall"} |

(

(

{Token.kind=="number", Token.length>=3}

{SpaceToken.kind=="space"}

)

|

(

{Token.kind=="number"}

({SpaceToken.kind=="space"})?

({Token.kind== "punctuation", Token.subkind =="dashpunct"})

({SpaceToken.kind=="space"})?

{Token.kind=="number"}

)

)

)

({SpaceToken.kind=="space"})?

(

{Token}

({SpaceToken.kind=="space"})?

)*

):lineAffiliationAnnotation

-->

:lineAffiliationAnnotation.LineAffiliationAnnotation = {rule = "LineAffiliationAnnotation"}

3.5.3.6 Lut tch cc tc gi t dng tc gi

Rule: Author

Priority: 40

(

(

{Person}

|

(

{Token.string!=",", Token.string!="and", Token.kind!="number"}

)+

):author

)

-->

:author.Author = {rule= "Author"}

3.6 Thc nghim v nh gi

Chng ti download cc ti liu, bi bo khoa hc t cc th vin s v tp ch chuyn ngnh Khoa hc My tnh nh ACM, Springer, IEEE, Citeseer, thc nghim. Chng ti tin hnh thc nghim vi 200 bi bo c download. nh gi kt qu cch tip cn chng ti s dng cc o truyn thng c dng trong truy vn thng tin l chnh xc Recall (R), tin cy Precision (P), v o F-measure.

;;

Trong tp: s kt qu ng c tm thy

tn: s kt qu ng m khng tm thy

fp: s kt qu tm thy m khng ng

Kt qu thc nghim c o trn mt s thuc tnh metadata chnh theo chun Dubline Core Metadata, v kt qu c th hin trong bng bn di:MetadataPrecision (%)Recall (%)F-Measure (%)

Title100.00100.00100.00

Authors92.7289.4791.07

Affiliation95.8392.0093.87

Email100.00100.00100.00

Abstract96.5593.3394.92

References97.4488.0592.51

CHNG 4: KT LUN V HNG PHT TRIN

4.1 Kt lun

Vi mc tiu tm kim v xy dng mt m hnh tri thc cho ti liu vn bn v khai thc cc thnh phn tri thc lin quan t vn bn a vo m hnh hng n xy dng mt h thng tm kim, truy vn thng minh hn. Chuyn tp trung nghin cu tng quan v lnh vc rt trch thng tin t vn bn, cc phng php, h thng, ng dng lin quan nh vn rt trch cm t kha, rt trch siu d liu (metadata), rt trch cc thc th v quan h gia cc thc th. Phn nghin cu chnh ca chuyn l xut cch tip cn rt trch t ng thnh phn metadata t cc bi bo khoa hc chuyn ngnh Cng ngh Thng tin cng b trong cc k yu hi tho, tp ch chuyn ngnh da trn vic xy dng cc mu (pattern) vi cc yu t ln cn ca thnh phn rt trch (tin t, hu t). Kt qu t c ca chuyn c th tm tt nh sau: Kin thc c bn v rt trch thng tin vn bn Cc nghin cu lin quan, bi ton ng dng ca rt trch thng tin vn bn Cc phng php rt trch cm t kha (keyphrase), thc th, quan h gia cc thc th v cc phng php rt trch siu d liu (metadata) t bi bo khoa hc

xut phng php rt trch metadata da trn vic xy dng cc lut, mu (pattern) kt hp cc t in, thng tin tin t v hu t.

Chuyn cng thu thp d liu bao gm cc bi bo khoa hc chuyn ngnh Cng ngh Thng tin t cc tp ch, th vin s nh ACM, IEEE, Springer, CiteSeer thc nghim. V kt qu t c hon ton c th so snh vi cc phng php my hc khc (chi tit kt qu thc nghim v nhn xt nh gi ti mc 3.6 chng 3)

Cng b 2 bi bo trong hi tho quc t ( ICEMT2010 ca t chc IEEE, v mt trong hi tho IT@EDU2010) [44][45]4.2 Hng pht trin Nghin cu ci tin cc phng php rt trch cm t kha, rt trch thc th v quan h t ti liu.

Xy dng m hnh tri thc cho ti liu vn bn gm cc thnh phn chnh: siu d liu (Metadata), cm t kha (Keyphrase), thc th (Entity) v quan h (Relationship).

Xy dng o cho m hnh tri thc vn bn ng dng xy dng h thng truy vn ti liu thng minh (tm kim, hi p).TI LIU THAM KHO

[1] Line Eikvil. Information Extraction from World Wide Web A Survey. Norwegian Computing Center, PB, Citeseer. July 1999. [2] Jim Cowie and Yorick Wilk. Information Extraction, 1996.[3] Alexander Yates. Information Extraction from the Web: Techniques and Applications. Phd thesis, University of Washington, 2007.

[4] Kamal Nigam, Google Pittsburg. Machine Learning for Information Extraction: An Overview, 2007. (Slides)

[5] Dr Diana Maynard, Computer Science Department,University of Sheffield.

http://gate.ac.uk/g8/page/print/2/demos/talks/maynard_diana_01.wmv. (Slides&video)[6] Eleni Mangina *, John Kilbride. Evaluation of keyphrase extraction algorithm and tiling process for a document/resource recommender within e-learning environments. Edu Elsevier. 2008.

[7] Yi-fang Brook Wu, Quanzhi Li. Document keyphrases as subject metadata: incorporating document key concepts in search results. Inf Retrieval -Springer. 2008.

[8] Mo Chen, Jian-Tao Sun, Hua-Jun Zeng, Kwok-Yan Lam. A Practical System of Keyphrase Extraction for Web Pages. ACM SIGIR_2005.

[9] Raymond J. Mooney and Rarvan Bunescu. Mining knowledge Using Information Extraction. ACM SIGKDD_2005.

[10] K. Seymore, A. McCallum, R. Rosenfeld, Learning hidden Markov model structure for information extraction, In: AAAI, Workshop on Machine Learning for Information Extraction, 1999.[11] Su Nam Kim-University of Melbourne, Min-Yen Kan-National University of Singapore, Re-examining Automatic Keyphrase Extraction Approaches in Scientific Articles, Proceedings of the 2009 Workshop on Multiword Expressions, ACL-IJCNLP 2009, Singapore, 6 August 2009, c2009 ACL and AFNLP, page 9-16.

[12] Niraj Kumar & Kannan Srinathan, Automatic Keyphrase Extraction from Scientific Documents Using N-gram Filtration Technique, Proceeding of the eighth ACM symposium on Document engineering. Information extraction in documents, 2008, page 199-208.

[13] JiabingWang et al, Ensemble Learning for Keyphrases Extraction from Scientific Document, Book-Advances in Neural Networks - ISNN 2006, Publisher Springer Berlin/Heidelberg 2006, page.1267-1272.

[14] Yi-fang Brook Wu, Quanzhi Li, Razvan Stefan Bot, Xin Chen, Domain-specific Keyphrase Extraction. CIKM05, October 31-November 5, 2005, Bremen, Germany, ACM-2005.

[15] P.D. Turney, Learning algorithms for keyphrase extraction, Information Retrieval, vol. 2, no. 4, pp. 303- 336, 2000.

[16] P.D. Turney, Learning to Extract Keyphrases from Text. National Research Council, Institute for Information Technology, Technical Report ERB-1057, 1999.

[17] I.H. Witten, G.W. Paynter, E. Frank, C. Gutwin and C.G. Nevill-Manning. KEA: Practical automatic Keyphrase Extraction. The proceedings of Digital Libraries '99: The Fourth ACM Conference on Digital Libraries, pp. 254-255, 1999.

[18] Web link for KEA5.0 source code: http://www.nzdl.org./Kea/download.html[19] Teuvo Kohonen, et al. Self-Organizing Maps, Third edition, Springer, 2002.

[20] A. Rauber, D. Merkl, and M. Dittenbach: The Growing Hierarchical Self-Organizing Map: Exploratory Analysis of High-Dimensional Data in: IEEE Transactions on Neural Networks, Vol. 13, No 6, pp. 1331-1341, IEEE, November 2002.

[21] Michael Dittenbach, Andreas Rauber, Dieter Merkl, Uncovering Hierarchical Struture in Data Using the Growing Hierarchical Self-Organizing Map, Institute of Software Technology, Vienna University of Technology, Vienna Austria, 24 July 2002.

[22] Hoang Kiem Huynh Ngoc Tin. Organization, management and knowledge discovery from the English, Vietnamese text collection. Proceedings JCIS2003-USA. (7th Joint Conference on Information Sciences, September 2003, North Carolina, USA), page 1613-1616.

[23] Phc, Hong Kim. Rt trch chnh t vn bn ting Vit h tr tm tt ni dung. Tp ch cc cng trnh nghin cu trin khai vin thng v cng ngh thng tin, s 13, 2004.

[24] ng Th Bch Thy, H Bo Quc. ng dng x l ngn ng t nhin trong h tm kim thng tin trn vn bn ting Vit. i hc Khoa hc T nhin, 2003.

[25] Hunh Ngc Tn. Qun l ni dung v khai thc tri thc trn bn vn bn ting Vit. Lun vn thc s ti trng i hc Khoa hc T nhin HQG TpHCM, 2003.

[26] Nguyn Tun ng. Khai thc d liu vn bn ting Vit vi SOM (Self-Organizationg Map). Lun vn thc s Khoa CNTT - HKHTN - HQG TpHCM. 2002.

[27] Dinh Dien, Hoang Kiem, Nguyen Van Toan. Vietnamese Word Segmentation. Proceedings of the NLPRS2001, Tokyo (Japan, 27-30 November 2001, p.749-756.

[28] Scott Miller, Heidi Fox, et al. A Novel use of statistical parsing to extract information from Text, In 6th Applied Natural Language Processing Conference, 2000.

[29] Zhou GuoDong, Su Jian, et al. Exploring Various Knowledge in Relation Extraction. Proceedings of the 43rd Annual Meeting of ACL, pages 427 434, Association for computational linguitics, 2005.[30] Dmitry Zelenko, Chinatsu Aone, Anthony Richardella. Kernel Methods for Relation Extraction. Journal of Machine Learning Research 3, pages 1083-1106, 2003.

[31] Razvan C. Bunescu, Raymond J. Mooney. Subsequence Kernels for Relation Extraction. In Advances in Neural Information Processing Systems, 2006.[32] Brill, E. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics, 21(4), 543565, 1995.[33] D. Bainbridge, J. Thompson, and I. Witten, Assembling and enriching digital library collections, In Proc. Joint Conference on Digital Libraries, pages 323334, 2003.

[34] D. Bainbridge, K. J. Don, G. R. Buchanan, I. H. Witten, S. Jones, M. Jones, and M. I. Barr, Dynamic digital library construction and configuration, In Proc. European Conference on Digital Libraries, pages 116, 2004.

[35] http://www.nlv.gov.vn/nlv/index.php/en/2008060697/DUBLIN-CORE/XML-Metadata-va-Dublin-Core-Metadata.html

[36] H. Han, C.L. Giles, E. Manavoglu, H. Zha, Z. Zhang, E.A. Fox, Automatic document metadata extraction using support vector machines, In: Proceedings of the 3rd ACM/IEEECS Joint Conference on Digital Libraries, International Conference on Digital Libraries, pages 3748. IEEE Computer Society Press, Washington, DC, 2003.

[37] K. Nakagawa, A. Nomura, and M. Suzuki, Extraction of Logical Structure from Articles in Mathematics, MKM, LNCS 3119, pages 276-289, Springer Berlin Heidelberg from Articles in Mathematics, 2004.

[38] F. Peng, A. McCallum, Accurate Information Extraction from Research Papers using Conditional Random Fields, Information Processing and Management: an International Journal, Pages: 963 979, 2006.

[39] H. Alani, S. Kim, D. E. Millard, M. J. Weal, P. H. Lewis, W. Hall and N. R Shadbolt, Automatic Extraction of Knowledge from Web Documents, In: 2nd International Semantic Web Conference - Workshop on Human Language Technology for the Semantic Web abd Web Services, October 20-23, Sanibel Island, Florida, USA, 2003.

[40] J. Greenburg, K. Spurgin, A. Crystal, Final Report for the Automatic Metadata Generation Applications (AMeGA) Project, UNC School of Information and Library Science. http://ils.unc.edu/mrc/amega/, 2005. Last visited date 30/04/2010.

[41] P. Flynn, L. Zhou, K. Maly, S. Zeil, and M. Zubair, Automated Template-Based Metadata Extraction Architecture, ICADL 2007, LNCS 4822, pages 327336, 2007. Springer-Verlag Berlin Heidelberg, 2007.

[42] S. Marinai, Metadata Extraction from PDF Papers for Digital Library Ingest, 10th International Conference on Document Analysis and Recognition. ICDAR-IEEE, pages 251-255, 2009.

[43] B. A. Ojokoh, O. S. Adewale and S. O. Falaki, Automated document metadata extraction. Journal of Information Science, pages 563-570, 2009. [44] Tin Huynh, Kiem Hoang. Automatic Metadata Extraction from sciencetific papers. Proceeding of IT@EDU, Phan Thiet, VietNam, 2010.

[45] Tin Huynh, Kiem Hoang. GATE Framework Based Metadata Extraction from Scientific Papers, Proceeding of ICEMT Egypt, IEEE, 2010.

Kho

Ti liu

T in

lnh vc

Rt trch ng vin

Cm ng vin

Tnh c trng

Hun luyn?

Cm t kha c gn nhn trc

Tnh xc sut

Cm t kha

Xy dng m hnh dng Nave Bayes

M hnh

C

Khng

http://gate.ac.uk/ie/

http://dblp.uni-trier.de/

http://en.wikipedia.org/wiki/DARPA

http://en.wikipedia.org/wiki/Named_entity_recognition

http://nlp.stanford.edu/ner/index.shtml

http://gate.ac.uk/ie/annie.html

http://dublincore.org/

http://www.library.uq.edu.au/iad/ctmeta4.html

http://www.nla.gov.au/nla/staffpaper/cathro3.html

http://dublincore.org/

http://www.greenstone.org/

PAGE 11

_1349522353.unknown

_1349522355.unknown

_1349522356.unknown

_1349522357.unknown

_1349522354.unknown

_1349522352.unknown