teknik stemming bahasa melayu berasaskan
TRANSCRIPT
I
TEKNIK STEMMING BAHASA MELAYU BERASASKAN
ALGORITMA PORTER
ANIZAH BINTI SAMSUDIN
Laporan ini dikemukakan sebagai memenuhi sebahagian daripada syarat penganugerahan
Ijazah Sarjana Muda Sains Komputer
FAKULTI SAINS KOMPUTER DAN SISTEM MAKLUMAT
UNIVERSITI TEKNOLOGI MALAYSIA
OKTOBER,2003
ABSTRAK
f
Stemmingmerupakan satu proses yang dilaksanakan untuk mendapatkan kata akar
bagi sesuatu perkataan. Stemming merupakan satu teknik yang banyak digunakan untuk
carian maklumat terutamanya dalam bidang Information Retrieval (IR). Banyak
algorifna yang telah dibangunkan untuk proses stemming ini bagi mendapatkan
ketepatan maklumat yang diperlukan. Kebanyakan stemmingyang digunakan kini adalah
untuk Bahasa Inggeris memandangkan penggmaan Bahasa Inggeris yang meluas
terutamanya dalam sistem carian di internet. Stemming bagi Bahasa Melayu masih belum
banyak digunakan dan masih belum belum banyak teknik stemmingyang diketahui
umum. Hanya segelintir sahaja kajian stemming unhrk Bahasa Melayu yang dapat
dikenalpasti. Kajian ini akan melaksanakan stemming ke atas perkataan Bahasa Melayu
yangberimbuhan iaitu bagr imbuhan awalan, imbuhan akhiran dan juga imbuhan apitan.
Objektif kEian bagi projek ini adalah mengkaji nahu Bahasa Melayu, mengkaji
keupayaan Algorifrna Porter dalam melaksanakan stemming, menghasilkan peraturan
untuk Bahasa Melayu berdasarkan Algoritma Porter dan menguji serta mefirperbaiki
kelemahan yang akan timbul. Metodologi pembangunan teknik stemming ini terdiri
daripada enam fasa iaitu fasa pernurlaan, fasa kajian nahu Bahasa Melayu, fasa kajian
Algoriuna Porter, fasa penghasilan peraturan, fasa pengkodan dan pelaksanaan dan fasa
pengujian. Hasil daripada kajian dan pelaksanaan projek II ini ialah satu teknik stemming
bagi perkataan Bahasa Melayu berasaskan kepada Algoritna Porter. Teknik ini telah
berjaya menghasilkan kata akar yang tepat bagi sebahagian besar bentuk perkataan
terbitan yang diporolehi dari Kamus Dewan Bahasa dan Pustaka dan juga dapat
menyelesaikan masalah understemming.
VI
ABSTRACT
t
Stemming is a morphological process of normalizing word tokens down to their
essential roots. Stemming widely used in information retrieval system. A stemming
algorithm is a computational procedure which has the ability to reduce all words with the
sarne root to common form. There wore many types of stemming algorithms had
developed to increase the performance in order to get the information effectively.
Stemming process are widely use in English language because English is an important
language in searching systom. Stemming for Malay language not yet widely implemented
and there are just a few commonly known techniques. On the few research of stemming
technique in Malay words were detected. This rosearch implement the stemming for
Malay words which strips prefix and suffix off the word. This study consists of four main
objectives which are study the grammar of Malay words, study the capability of Porter
Algorithm, produce the specific rules for Malay words according to Porter Algorithm and
test and improve problem that had been surface. There are six phases methodology to
produce this technique that are beginning with initial phase, then study the Malay words
grammar, study the capability of Porter Algorithm, produce the specific rules for Malay
words, then implement the technique into a program coding and finally test and improve
the technique to get the high performance. The output is a stemming technique for Malay
words base on the Porter Algorithm. The performance of this Malay stemming algorithm
was tested using the test collection of words which exhacted from the Dewan Bahasa dan
Pustaka dictionary. The results of this study show that the algorithm has successfirlly
stemmed almost all of affixes Malay words and also successful to overcome the
understemming error in this project.
99
RUJUKAN
Tai, S. Y. dan Ong, C. S (2000). On Designing an Automated MalaysianSte,nrmer for the Malay Language. ACM. 207 -208.
Abdul Rahman Talib"(2000). Pedagogr Bahass Melayu: Prinsip Kaedah dan Teknik.
Edisi Pertama. Utusan Publication & Distributors Sdn Bhd.
Allen, J.( I 987). Natural Language unde rstandi ng. The Benjamin/cummings
Publi shing Company,lnc.
othman (1993). Pengakar Perkataan Meloyu untuk sistem capaian
Dolwmen. Universiti Kebangsaan Malaysia : Tesis Master.
Al-Kharashi, I.A dan Evens, M.w. (1994). comparing words, sterns, And Roots AsIndex Temrs In Arabic Information Retrieval System. Journal Of AmericanSociety For Information Science.45 (8): 54S- 560.
cay, s.H dan Gary,c (1999). core Java volume I Fundamental. swMicrosystemsPress: A prentice Hall Title.
Dawson, J. (1974). suffix Removal And word conflation. Bulletin of theAsso c iati on for Li terory & Lingui sti c C omptting. 2 (3): 33 -46.
Atikah said (1998) . study on stemming Algorithm For Malay words startingwith Alphabet 'E', 'F' and ../ ' . Universiti Teknologi Mara: Tesis smjana Muda.
Fatimah Alrmad (1995). A Malay Language Document Retrieval system An
Experiment Approach And Analysis. Universiti Kebangsaan Malaysia : TesisPh. D.
100
Ahmad, Muhammad Yussof dan TEngku M. T. Sembok (1996)"
Experiment with a stemrning algorithm for malay words. Joumal of the
American Societyfor Information Science- 47 (12):909 - 918.
ij, W. dan Pohlmann, R.(1996). Viewing Stemming as recall enhancemeot.ln
Proceedings of ACM SIGIR96. pp.40-48.
R. (1993). Viewing Morphology as an Inference Process. Proc. rcn ACM
S/G/R C onfe renc e. 191 -202.
,L. S., Ballesteros, L. and Connell, M. E. (2001). Improving Stemming for
Arabic Information Retrieval : Light Stemming and Co-occurrence Analysis.
In TREC 2001. Gaithersburg : NIST.
J.B. (1968). Development of a Sternming Algorithm. Mechanical
Translati on and Computati onal Lingui sti c. ll, 22-23
Nazrul Che Mahmud (2000). To Improve Stemming Algorithm For Malay
Words Starttng With The Letter 'I'. Universiti Teknologi Mara: Tesis Sarjana
Muda.
Safiah Karim, Farid M. Onn, Hashim Hj. Musa dan Abdul Hamid Mahmood
(1995). Tatabahasa Dewan. Edisi Ketiga. Malaysia : Dewan Bahasa dan
Pustaka.
Paice, C.D.(1990). Another Stemmer. In SIGIR 90,56-6I.
Popovic, M. and Willett, P. (1992). The Effectiveness of Stemming for
Natnral-Language Access to Slovene Textual Da1.a. Journal of the
Am eri can So ci e ty fo r Inform ati on S c i e nc e, 43 (5 ): 3 84 -3 90.
Porter, M. F. (1980). An Algorithm for suffix stripping Program,14:130-137 .
t
l0l
in (1999). Study Of &6)mn1 Algorithm For Malay Words
Startingwith Alphabet'Z'. Universiti Teknologi Mara: Tesis Sarjana Muda.
(1993). Effectiveness of inforrration retrieval system used in a hypertext
Hypermedia, 5:23-46.
Sheilr*r Satim (1991). Kqmus Dewan. Edisi Kedua. Malaysia: Dian
Bahasa Dan Pustaka.
Abu Bakar (1999). Evaluation Of Retrieval Effictiveness Of Confletion Methods
On Malay Documents. Universiti Kebangsaan Malaysia: Tesis Doktor Falsafah.
http :{maya. cs. depapUedu
http ://www. cs jmu. qdr/commqn/proj ect/stem8ing/porter
http ://www. cwa.md,xgc.uk/phris/search/stemmer. doc