a unified framework for automatic …...search engine knowledge portal processing www unstructured,...

50
Natural Language Processing and Intelligent Information System Technology Research Laboratory 1 A UNIFIED FRAMEWORK FOR AUTOMATIC A UNIFIED FRAMEWORK FOR AUTOMATIC METADATA EXTRACTION METADATA EXTRACTION FROM ELECTRONIC DOCUMENT FROM ELECTRONIC DOCUMENT Asanee Kawtrakul, Chaiyakorn Yingsaeree and Team NAiST Research Laboratory Dept of Computer Engineering, Faculty of Engineering Kasetsart University, THAILAND 26 August 2005, Nagoya IADLC 05 (International Advanced Digital Library Conference)

Upload: others

Post on 08-Jul-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory1

A UNIFIED FRAMEWORK FOR AUTOMATIC A UNIFIED FRAMEWORK FOR AUTOMATIC METADATA EXTRACTION METADATA EXTRACTION

FROM ELECTRONIC DOCUMENTFROM ELECTRONIC DOCUMENT

Asanee Kawtrakul, Chaiyakorn Yingsaeree and Team

NAiST Research LaboratoryDept of Computer Engineering, Faculty of Engineering

Kasetsart University, THAILAND

26 August 2005, Nagoya

IADLC 05(International Advanced Digital Library Conference)

Page 2: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory2

GoalGoal

Page 3: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory3

Ontology

Collection &Collection &AcquisitionAcquisition

MetadataExtraction Tools

Texts

ImageTexts

Multimedia DatabaseMultimedia Database(e.g. (e.g. DSpaceDSpace))

AccessAccess

ServicesServices

DocumentClustering

DeliveringSystem

Query

NAiST OCRThai OCR

KU Search EngineSmart Search Engine

DSpacePosgres

SQL

Title=Ginger

Domain=Plant

C F A D B E

Title=Cabbage Title=Cucumber

…Tracking by Domain, Title

Author = KU

C F A D B E

…Title=Ginger Title=Cabbage Title=Cucumber

Tracking by Author, Title

C F

A

D

B E

Document

Ontology

KW. TrackingProcess

Title=Cabbage

A D

Author=Doae Author=KU

…Tracking by Title, Author

Multi-viewpoint Knowledge Tracking

1 23 4 5

Computer

Author

A B C

2000 2002 2004 2001 2002

1 2 34 5

Computer

Year

2000 2001 2002

A B C A C

Another Knowledge Gain :-Author B is a new researcher.-Author C publishes papers continuously-Author A do not publish in year 2001-And more...

Another Knowledge Gain :-Author C is only one who published in year 2001-Author A and B are pioneer researchers in domain.-And more ...

Knowledge Tracking : Different Tracking Paths (Same Documents)Knowledge Tracking : Different Tracking Paths (Same Documents)

Page 4: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory4

TodayToday’’s Outlines Outline

IntroductionProblemsArchitectureCurrent StatusConclusionOngoing Projects

Page 5: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory5

IntroductionIntroductionWhat is metadata?

Data about dataEx:

About document:Traditional library card catalogue, About content: purpose, problem spaces, methodologies, and results.

Why is it important?Help people distinguish relevant from non-relevant documents,Multi-view point of Knowledge Tracking

Page 6: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory6

Examples of MetadataExamples of MetadataSome Meaning Procedures of Ontological Semantics

Marjorie McShane,Stephen Beale and Sergei Nirenburg

Institute of Language and Information TechnologiesUniversity of Maryland Baltimore Country

{marge,sbeale,sergei}@umbc.edu

Title

Authors

Affiliation

Graduated Student

Graduated YearSystematic & Fixed order

Page 7: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory7

Introduction (2)Introduction (2)

Where does it come from?By Human

Annotating the document manuallyBy Computer

Metadata HarvestingMetadata Extraction

Page 8: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory8

Introduction (3)Introduction (3)

Metadata HarvestingCollect metadata from previously defined metadata Usually performed by creating a parser to analyze source metadata and transform parsing results into an appropriated formatApplication includes interoperability between metadata of different systems and platforms

Page 9: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory9

Introduction (4)Introduction (4)

Metadata ExtractionExtract metadata from document contentUsually performed by machine learning, rule-based parser and Regular Expression Machine learning approaches are robust and adaptable, but require a large training exampleRule-based parsers and Regular Expression are dependent on an application domain, and no training example is required

Page 10: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory10

Introduction (5)Introduction (5)

ObjectiveCreate a framework for automatic metadata extraction from technical and thesis documents which have fixed format.

SolutionUse rule-based parser due to simplicity and cost

Page 11: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory11

ProblemsProblems

Variety of electronic document formatsE-Document can be stored in a variety of formats

e.g. Microsoft Word, Adobe Acrobat, Image of document, etc.

It is necessary to convert such document into text file in order to access document content

Quality of extracted metadataExtracted metadata may contain errors both from original documents and text conversion process,Some mechanisms are required to produce high-quality metadata

Page 12: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory12

ArchitectureArchitecture

Text ConversionModule

Text ConversionText ConversionModuleModule

Task-Oriented ParserModule

TaskTask--Oriented ParserOriented ParserModuleModule

Data VerificationModule

Data VerificationData VerificationModuleModule

ExtractedMetadata

CorrectedMetadata

LanguageModel

Dictionary ExistingMetadata

IDENTIFY&CORRECT ERRORS from- task-oriented parser

-controlled vocabularies-general vocaburaries

- existing metadata repository

Author’s Name Tanyaratana Dumka

Thesis’s Title Differential Expression of 1-Amniocyclopropane-1-……

Degree Master of Science

Major Field Genetic Engineering

Page 13: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory13

Text Conversion Module (1)Text Conversion Module (1)

AFPL Ghostscript (for PS & PDF)CATDOC (for Microsoft Word & Excel)OCR (for Image Document)

Document Skew CorrectionMarginal Noise RemovalSalt-and-Pepper Noise RemovalBroken Character Management

Page 14: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory14

Text Conversion Module (2)Text Conversion Module (2)

Skew

Marginal Noise

Salt-and-Pepper Noise

Broken Character

Page 15: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory15

Optical Character RecognitionOptical Character Recognition

Conversion from Image to Text

Optical Character Recognition

Optical Character Recognition

หญาแฝกหอมหรือแฝกลุม 4 พันธุ1. พันธุศรีลังกา ดินลูกรัง2. พันธุกําแพงเพชร 2 ดิน

ทรายถึงลูกรัง3. พันธุสุราษฎรธานี ดินรวน

เหนยีวถึงลูกรัง4. พันธุสงขลา 3 ดินรวนเหนยีวถึง

ลูกรัง

Page 16: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory16

Optical Character RecognitionOptical Character Recognition

Character Segmentation

Line Segmentation

Character Recognition

Text Generationหญาแฝกหอมหรือแฝกลุม 4 พันธุ

1. พันธุศรีลังกา ดนิลูกรัง

2. พันธุกําแพงเพชร 2 ดนิทรายถึงลูกรัง

3. พันธุสุราษฎรธานี ดนิรวนเหนียวถึงลูกรัง

4 พันธสงขลา 3 ดนิ

Page 17: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory17

Optical Character RecognitionOptical Character Recognition

Character Segmentation

Line Segmentation

Character Recognition

Text Generation

Page 18: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory18

Optical Character RecognitionOptical Character Recognition

Character Segmentation

Line Segmentation

Character Recognition

Text Generation

Page 19: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory19

Optical Character RecognitionOptical Character Recognition

Character Segmentation

Line Segmentation

Character Recognition

Text Generation ห ญ า

Page 20: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory20

Optical Character RecognitionOptical Character Recognition

Character Segmentation

Line Segmentation

Character Recognition

Text Generation

… ห อ ม ห ร อ แ ฝ ก ล ุ

หญาแฝกหอมหรือแฝกลุม 4พันธุ

Page 21: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory21

Optical Character RecognitionOptical Character Recognition

Page 22: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory22

Automatic Metadata ExtractionAutomatic Metadata Extraction

Needs Resources and Cost

…ฯลฯ...

การสรางดัชนี

การประมวลผลภาษาธรรมชาติ

คําสําคัญ

การสรางดัชนีหนังสืออัตโนมัติชือเรื่อง

นายวีร สัตยมาศตรผูแตง

Page 23: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory23

Extraction Meta Data for eExtraction Meta Data for e--thesisthesis

Student’s Name

Graduate YearThesis Title

Degree Name

Department Name

Advisor’s PositionISBN NumberAdvisor’s Name

Advisor’s Degree

Page 24: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory24

Automatic Metadata ExtractionAutomatic Metadata Extraction

TaskTask--OrientedOrientedParserParser

<sentence> :- <name> <year><name> :- <firstname> <lastname><firstname> :- [A-Z][a-z]+<lastname> :- [A-Z][a-z]+<year> :- [0-9]+

Regular ExpressionsRegular Expressions

Tanyaratana Dumkua 2000

<firstname>

<sentence>

tanyaratana Dumkua 2000

<lastname> <year>

<name>

Firstname: TanyaratanaLastname: DumkuaYear: 2000

Page 25: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory25

Extraction Result for eExtraction Result for e--thesisthesis

Name: อบุลวรรณ

Surname: นนทพันธุ

Year: 2543

Topic: การเรงปฏิกิริยาการยอยสลายมูลฝอยเศษอาหารในกระบวนการหมักแบบไรออกซเิจน

Major: วิทยาศาสตรสิ่งแวดลอม

Department: โครงการสหวิทยาการระดบับัณฑิตศึกษา

… ….

Page 26: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory26

Data Verification Module (1)Data Verification Module (1)

Error from Task-Oriented Parser ModuleControlled VocabulariesGeneral Vocabularies

Error in Existing Metadata Repository

Page 27: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory27

Data Verification Module (2)Data Verification Module (2)

Error from Task-Oriented Parser ModuleThe parser might not be able to parse some documents due to incomplete grammar, error from text conversion, or defect in the document itselfTo solve the problem, either creating new rules or fixing the defect is required

Page 28: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory28

Data Verification Module (3)Data Verification Module (3)

Error in Controlled VocabulariesSome metadata fields’ value can be only a word(s) in controlled vocabulariesError identification can be achieved by comparing extracted data with a dictionaryWhen error occurs, the correction process simply replace the error word with its closest word in the dictionary by means of Edit Distance

Page 29: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory29

Data Verification Module (4)Data Verification Module (4)

Error in General VocaburariesUse spelling correction technique to detect and correct the errors

OCR Error CorrectionTyping Error Correction

This module is under development

Page 30: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory30

Data Verification Module (5)Data Verification Module (5)

Error in Existing Metadata RepositoryHand-made metadata usually contained many errorsInstead of manually correcting the error, we can use automatic metadata extraction and alignment tool to ease data correction process

Page 31: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory31

Current StatusCurrent Status

Page 32: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory32

Extracting metadata from studentsExtracting metadata from students’’ thesis abstract (1)thesis abstract (1)

Page 33: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory33

Extracting metadata from studentsExtracting metadata from students’’ thesis abstract (2)thesis abstract (2)

The preliminary results with 3,712 thesis show that using this system greatly reduce the labor work of metadata creation process by correctly extracting metadata 91.41% of the documents.

Page 34: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory34

Extracting plant information from image of Thai Extracting plant information from image of Thai plant name dictionary(1)plant name dictionary(1)

Genusname

Family-Subfamily name

Specific epithet

Epithet’sauthor name

English pronunciation

Thai name

Plant habits

Province

Page 35: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory35

Extracting plant information from image of Thai Extracting plant information from image of Thai plant name dictionary (2)plant name dictionary (2)

Page 36: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory36

ConclusionConclusion

A Unified Framework for Automatic Metadata Extraction from Electronic DocumentConsists of three main components

text conversion moduletask-oriented parser moduledata verification module

The experimental result shown that using the framework greatly reduce the labor work of metadata creation process

Page 37: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory37

Ongoing ProjectsOngoing ProjectsAgricutural Knowledge Portal

Ontology MaintenanceInformation extractionKnowledge Mining

Open source Digital LibraryKnowledge collecting, sharing and

Accessing (DSpace)Library System Management (Koha)

Page 38: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory38

IntelligentSearch Engine

Knowledge Portal Processing

WWW

Unstructured,Semi-structured,

StructuredDocument

Meta DataAnnotation tools

KnowledgeStructure

Thai AGRISCorpus

Agricultural Information Bases

Real-World Ontology

OntologyTask Oriented

Ontology

MultilingualDictionary

MT KT

Rice

Diseases&How to protect?

How to plant in

the winter?

Follow up the price

etc.

Yield

Page 39: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory39

Ontology Development for Ontology Development for Enhancing ServiceEnhancing Service

Task Oriented ontology

disease control

cause from pathogen

cause from environment

Plant Diseases symptomcauseTreatment

ScorchBlight

. . .IS-A relation

concepts

instances

specific relations(e.g. Cause, hasSymptom)

. . .

Page 40: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory40

Dictionary Characteristic

Automatic Ontology ConstructionAutomatic Ontology ConstructionSystem Architecture

Ontology

Structured CorpusUnstructured Corpus

Raw Text Dictionary Thesaurus

Morphological Analysis

Term Extraction

Structure Analysis

Database Conversion

Thesaurus Recycling

Organizing System

VerificationSystem

Semantic Relation

IdentificationLexico-Syntactic Pattern,

Grammartical Rules, Heuristic Rules

AGROVOC Thesaurus

Cereals BT Plant ProductNT Oats

RiceMaize

Thai Plant Name Dictionary

Chirita GESNERIACEAEfulva Barnett H ดาดหอย Dathoi (Nakhon Si Thammarat).involucrata Craib H น้ําดับไฟ Nam dap fai (Surat Thani); มะและ Malae (Pattani).micromusa B. L. Burtt H คําหยาด Kham yat (Nakhon Ratchasima).Chisocheton MELIACEAEceramicus (Miq.) CDC. T ยมใหญ Yomyai (General).cumingianus (CDC.) Harms subsp. balansar (C.DC.) Mabb.T ยมมะกอก Yom makok (Chiang Mai).

Family/SubfamilyGenus

Specific epithet

Local Name

Habit

Formal Name

Author Name

Raw Text Example

ผกักาดหอมผกักาดหอมเปนผกัที่ใชบริโภคสวนใบ เปนผกัจําพวกผกัสลดัที่มี

คุณคาทางอาหารสูง นิยมบริโภคกันแพรหลายทีส่ดุในบรรดาผกัสลดัดวยกัน โดยสวนใหญนิยมรับประทานสดและนํามาประกอบอาหารหลายชนิด คนไทยนิยมใชผกักาดหอมกินกบัอาหารจาํพวกยําตางๆ สาคูหมู หรอืขาวเกรยีบปากหมอ เปนตน ประโยชนของผักกาดหอมนอกจากจะใชกินเปนผกัสดที่มีคุณคาทางอาหารสูงแลว ยังจัดเปนอาหารทางตาดวยโดยการนํามาตกแตงอาหารใหมีสสีันสวยงามนารับประทานมากขึ้น นอกจากนี้ผกักาดหอมยังมีคุณสมบัติในการเปนยาอีกดวย ความตองการผกักาดหอมมีอยูตลอดทั้งป โดยเฉพาะในชวงเทศกาลตางๆ จึงนับไดวาผกักาดหอมเปนผกัที่มคีวามสําคญัทางเศรษฐกจิชนิดหนึง่ที่นับวันจะทวีความตองการเพิ่มขึ้นเรื่อยๆ

ผกักาดหอมมีชื่อเรยีกอื่นๆ ไดหลายชื่อเชน ภาคเหนอืเรยีกวา ผักกาดยี ภาคกลางเรยีกวาผกัสลดั เปนตน ผักกาดหอมเปนพืชที่จัดอยูในตระกลู Compositae มีชื่อวิทยาศาสตรวา Lactuca sataiva มีถิ่นกําเนดิในทวีปเอเชยีและยุโรป มีปลูกในประเทศไทยมาชานานแลว

Page 41: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory41

Corpus Corpus based based Ontology Ontology ConstructionConstruction

Problems in this process:Many Candidate Terms

Ex1. Many herbs can be used as medicine and some of them are manufactured in the industry level, such as garlic, ginkgo biloba.

Candidate Terms => herbs, medicine, industry

NP1... NP2... NP3... such as NP, NP, ...

Ex2. Sun flower is rather enduring with dry season while comparing to other field crops such as corn, soy beanand green bean.

Candidate Terms => Sun flower, field crop

NP1... NP2... NP3... such as NP, NP, ...

Page 42: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory42

Ontological Term SelectionOntological Term Selection

MI (w1, w2) = log2 P(w1,w2)P(w1) P(w2)

Where w1 is a candidate term w2 is a related termP(wi) is probability of term wiP(wi, wj) is probability of co-occurrence of term wi and wj

• Statistical Technique– Mutual Information, the measure of word association

– ExampleMany herbs can be used as medicine and some of them aremanufactured in the industry level, such as garlic, ginkgo biloba

MI(herb, garlic) > MI(medicine, garlic), MI(industry, garlic) Results : HYPO(garlic, herb)

Page 43: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory43

Structured Corpus

Dictionaryin Electronic Format

Structure Analysis

Database Conversion

Forest Ontology

Printed Dictionary

OCRSystem

Dictionary Dictionary based Ontology based Ontology ConstructionConstruction

Dictionary Characteristic

•Technique:Applied task

oriented parser to extract relation terms by alphabet characteristic and position of terms

Family/SubfamilyGenus

Specific epithet

Local Name

Habit

Formal Name

Author Name

Page 44: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory44

Dictionary Dictionary based based Ontology Ontology ConstructionConstruction

Alphabet Characteristic of Dictionary.

Feature Database field Example

All upper case Family/Sub-Family EUPHORBIACEAE

Start with upper case Genus Acalypha

All lower case Specific epithet brachystachya

Thai alphabet with bold font

Formal Name ตําแยดอยใบบาง

Thai alphabet Local Name เกี้ยวเกลา

Limitation:Dictionary has only plant names

Page 45: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory45

AGROVOC Thesaurus AGROVOC Thesaurus based Ontology Constructionbased Ontology Construction

Technique:Convert BT/NT to IS-A Relation

Cereals BT Plant ProductNT Oats

RiceMaize

Plant Product

Cereals

Oats MaizeRice

IS-A

IS-A IS-AIS-A

Page 46: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory46

Experimental ResultsExperimental Results

By random checking with 1,000 united terms, the accuracy of the system is 87 %.

Source Number of Terms

Number of Relations

Accuracy

Raw Text (150 doc.) 3,720 3,312 73 %.

Dictionary 37,110 21,620 100%.

Thesaurus 27,540 15,628 91%.

3 Sources 43,073 31,387 87 %.

Page 47: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory47

Page 48: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory48

DeploymentDeployment

Knowledge portal as

One Stop Service

Better living condition of Better living condition of AgricultureAgriculture

Page 49: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory49

Finally,Finally,We have just initiated an open source Digital Library since it will be the back bone of e-learning for both formal and informal education. Especially, for informal education, we should thinking about extension to the root of grass such as farmers and also organization – workers for becoming Knowledge -workers.This open source DL will be added more advanced features such as assistant tools for collecting Knowledge, automatic cataloging, automatic indexing, information extraction and so on.

Knowledge based Society and Economy,

Acadamic Knowledge Factory and Knowledge Park

Page 50: A UNIFIED FRAMEWORK FOR AUTOMATIC …...Search Engine Knowledge Portal Processing WWW Unstructured, Semi-structured, Structured Document Meta Data Annotation tools Knowledge Structure

Natural Language Processing and Intelligent Information System Technology Research Laboratory50

AcknowledgementAcknowledgementKURDI: Kasetsart University Research and Development InstituteGraduate School of Kasetsart UniversityIADLC2005 Chairs and Organizer