natural language processing verbatim text coding and data mining report generation josef s.w. leung...

33
Natural Language Natural Language Processing Processing Verbatim Text Coding and Verbatim Text Coding and Data Mining Report Generation Data Mining Report Generation Josef S.W. Leung Josef S.W. Leung ( ( [email protected] [email protected] ) ) Ching-Long Yeh Ching-Long Yeh ( ( [email protected] [email protected] ) ) NLP One of the Top Priority Funding It in Computer Science Research -- National Natural Science Foundation, China

Upload: martha-hicks

Post on 26-Dec-2015

235 views

Category:

Documents


12 download

TRANSCRIPT

Page 1: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Natural Language Natural Language ProcessingProcessing

Verbatim Text Coding andVerbatim Text Coding andData Mining Report GenerationData Mining Report Generation

Josef S.W. LeungJosef S.W. Leung (([email protected]@ieee.org))

Ching-Long YehChing-Long Yeh (([email protected]@cse.ttit.edu.tw))

NLP One of the Top Priority Funding Items

in Computer Science Research -- National Natural Science

Foundation, China

Page 2: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Language

Listen

(Understand)Speak

(Generate)

Page 3: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Natural Language

Internal Representatio

ns

GenerationGeneration

Analysis/ Analysis/ UnderstandingUnderstanding

Natural Language ProcessingNatural Language Processing

Page 4: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Outline of PresentationOutline of Presentation

• NLP IntroductionNLP Introduction– Natural Language Analysis/UnderstandingNatural Language Analysis/Understanding

– Natural Language GenerationNatural Language Generation

• Case 1: Verbatim Text CodingCase 1: Verbatim Text Coding– May need NL analysis techniquesMay need NL analysis techniques

• Case 2: Data Mining Report GenerationCase 2: Data Mining Report Generation– May need NL generation techniquesMay need NL generation techniques

Page 5: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Pre-processing

Tokens

Parsing

Syntactic structure

Semantic Interpretation Semantic

representation

Contextual Interpretation

Knowledge representati

on

Input sentence

Modules of NL Modules of NL UnderstandingUnderstanding

Page 6: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Parsing for Syntactic Parsing for Syntactic AnalysisAnalysis

Grammar Grammar Rules:Rules:

S

NP

VP

NP + VP

ART + N

V + NP

Lexicon:Lexicon:

N

N

V

ART

dog

cat

chased

the

Page 7: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

s

NP VP

ART N V NP

dog chased the cat

ART N

the

Syntactic StructureSyntactic Structure

Page 8: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Structural AmbiguityStructural Ambiguity

• Time flies like an arrow.Time flies like an arrow.

• The passage of time is as quick as The passage of time is as quick as an arrow.an arrow.

• A species of flies called ‘time flies’ A species of flies called ‘time flies’ enjoy an arrow.enjoy an arrow.

Page 9: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Structural AmbiguityStructural Ambiguity

• The man saw the girl with The man saw the girl with telescope.telescope.

• The man saw the girl who possessed The man saw the girl who possessed the telescope.the telescope.

• The man saw the girl with the aid of The man saw the girl with the aid of the telescope.the telescope.

Page 10: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

User’s Goal

Surface Sentences

Strategic Component

Tactical Component

Domain KB

Planning Operators

User Model

Discourse Model

Linguistic Rules & Lexicon

Text Planning

Linguistic Realizatio

n

Natural Language Natural Language GenerationGeneration

Page 11: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Unification GrammarUnification Grammar

the man sees a the man sees a sheepsheep

S [numb=X, S [numb=X, tense=T]tense=T]

NP [numb=X] VP [numb=X, NP [numb=X] VP [numb=X, tense=T]tense=T]VP[numb=N,tenseVP[numb=N,tense

=M]=M] V [numb=N, tense=M] NPV [numb=N, tense=M] NP

NP NP [numb=Y][numb=Y]

det [numb = Y] noun [numb = det [numb = Y] noun [numb = Y]Y]

manman : : noun [numb = sing]noun [numb = sing] a a :: det [numb = sing]det [numb = sing] the the : : detdetsheepsheep :: nounnounseessees : : [tense = pres, numb = sing][tense = pres, numb = sing]

Page 12: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Migraine abortive Migraine abortive treatment is used to treatment is used to abort migraine.abort migraine.((cat clause)((cat clause) (process ((lex “ (process ((lex “useuse”) (type material)))”) (type material))) (partic ((affected ((cat proper) (partic ((affected ((cat proper) (lex “ (lex “migraine abortive treatmentmigraine abortive treatment”)))”))) (agent none))) (agent none))) (circum ((purpose ((cat clause) (circum ((purpose ((cat clause) (keep-in-order no) (keep-for no) (keep-in-order no) (keep-for no) (position end) (position end) (process ((lex “ (process ((lex “abortabort”)”) (effect-type creative) (effect-type creative) (type material))) (type material))) (partic ((created ((lex “ (partic ((created ((lex “migrainemigraine”)”) (countable no) (countable no) (cat common))))))))))) (cat common)))))))))))

Page 13: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Verbatim Text CodingVerbatim Text Coding

• A text content classification problem.A text content classification problem.

• Group semantically similar answer items.Group semantically similar answer items.

• Develop a code list/tree to represent the Develop a code list/tree to represent the answer item groups.answer item groups.

• Simple NL analysis techniques may help.Simple NL analysis techniques may help.

• Details will be given in the first example of Details will be given in the first example of NLP application.NLP application.

Page 14: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Data Mining Report Data Mining Report GenerationGeneration

• Data mining results are usually in Data mining results are usually in rule or tree formats with obscure rule or tree formats with obscure notations.notations.

• NL generation techniques may help NL generation techniques may help translate the data mining results translate the data mining results into plain natural languages.into plain natural languages.

• Details will be given in the second Details will be given in the second example of NLP application.example of NLP application.

Page 15: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Codia for Verbatim Text Codia for Verbatim Text CodingCoding

Answer Items Code Tree

• Small Small screen/window/textscreen/window/text

• Long list of answer Long list of answer itemsitems

• Difficult to browse/viewDifficult to browse/view

• Worse than paper formWorse than paper form

Page 16: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Codia for Verbatim Text Codia for Verbatim Text CodingCoding

Key Terms

Page 17: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Ranking Answers by SimilarityRanking Answers by Similarity

Items with similar meaning

Page 18: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Text Similarity MeasuresText Similarity Measures

StringString

SemanticsSemantics CoverageCoverage

Text Text Similarity Similarity ScoreScore

Page 19: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Codia for Verbatim Text Codia for Verbatim Text CodingCoding

• A user-interface for classifying answer A user-interface for classifying answer items by drag-and-drop actions.items by drag-and-drop actions.

• NLP reduces time and effort in NLP reduces time and effort in searching, browsing, and selecting searching, browsing, and selecting multiple answer items for multiple answer items for classification.classification.

• There’s still limitations and not fully There’s still limitations and not fully automated.automated.

Page 20: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Technical Issues of CodiaTechnical Issues of Codia

• Improve user-interface.Improve user-interface.

• Use only simple NLP techniques.Use only simple NLP techniques.

• Ambiguity resolution by human.Ambiguity resolution by human.

• Limited by thesaurus.Limited by thesaurus.

• Still cannot handle negatives ‘Not’. Still cannot handle negatives ‘Not’.

• Knowledge engineering is tedious.Knowledge engineering is tedious.

Page 21: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Limitations and Future Limitations and Future ImprovementsImprovements

• Thesaurus has only Thesaurus has only 60,000 terms 60,000 terms classified into 3900 classified into 3900 semantic categories.semantic categories.

• Manual operation Manual operation (ambiguity (ambiguity resolution relies on resolution relies on human).human).

• Similarity measures Similarity measures are too mechanical.are too mechanical.

• Need to update and Need to update and incorporate incorporate frequently used frequently used terms/categories.terms/categories.

• Towards automation Towards automation by using more AI by using more AI such as NLP, GA and such as NLP, GA and NN.NN.

• More adaptive by More adaptive by rule-based or case-rule-based or case-based reasoning.based reasoning.

Page 22: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Data Mining and Knowledge Data Mining and Knowledge DiscoveryDiscovery

PatternsPatterns

KnowledgeKnowledge

DataData

Data Data MiningMining

InterpretatioInterpretationn

KnowledgKnowledge e DiscoveryDiscovery

Page 23: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

IfIf q12 = 4 and q12 = 4 and

q31 = 6 and q31 = 6 and

q35 = 3 q35 = 3

thenthen q38 = 3 q38 = 3

Page 24: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

IfIf h/h_income = 4 h/h_income = 4

and and city = 6 and city = 6 and

car_owner = 3car_owner = 3

thenthen user = 3 user = 3

Page 25: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

say(feature,say(feature,[r1]).[r1]).

Page 26: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

The segment of respondents who are The segment of respondents who are product X users is characterized byproduct X users is characterized by

residence in Shanghai,residence in Shanghai,consumption of brand Y cigarettes,consumption of brand Y cigarettes,overseas travel in the past twelve months,overseas travel in the past twelve months,ownership of imported cars, andownership of imported cars, andhigh monthly household income.high monthly household income.

r1 say(feature, say(feature, [r1]).[r1]).

Page 27: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

say(general,say(general,[r1]).[r1]).

say(likely,[r1]).say(likely,[r1]).

say(reason,say(reason,[r1]).[r1]).

Page 28: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Basically, the respondents who are Basically, the respondents who are product X users have product X users have

residence in Shanghai,residence in Shanghai,consumption of brand Y cigarettes,consumption of brand Y cigarettes,overseas travel in the past twelve months,overseas travel in the past twelve months,ownership of imported cars, andownership of imported cars, andhigh monthly household income.high monthly household income.

r1 say(general, say(general, [r1]).[r1]).

Page 29: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

The respondents who are product X users The respondents who are product X users because they have because they have

residence in Shanghai,residence in Shanghai,consumption of brand Y cigarettes,consumption of brand Y cigarettes,overseas travel in the past twelve months,overseas travel in the past twelve months,ownership of imported cars, andownership of imported cars, andhigh monthly household income.high monthly household income.

r1

say(reason, say(reason, [r1]).[r1]).

Page 30: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

It is likely that the people who have It is likely that the people who have

residence in Shanghai,residence in Shanghai,consumption of brand Y cigarettes,consumption of brand Y cigarettes,overseas travel in the past twelve months,overseas travel in the past twelve months,ownership of imported cars, andownership of imported cars, andhigh monthly household incomehigh monthly household income

are product X usersare product X users.

r1

say(likely, [r1]).say(likely, [r1]).

Page 31: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Limitations and Future Limitations and Future ImprovementsImprovements

• Pre-defined syntactic Pre-defined syntactic category of code labels.category of code labels.

• Single sentence for each Single sentence for each rule.rule.

• Lack visualization.Lack visualization.

• Almost no text planning.Almost no text planning.

• English only.English only.

• Lack knowledge of Lack knowledge of explanation.explanation.

• Automatic recognition of Automatic recognition of the syntax.the syntax.

• Describe rule relationship Describe rule relationship in multiple coherent in multiple coherent sentences.sentences.

• Text + graphics or even Text + graphics or even multimedia generation.multimedia generation.

• Implement text planning.Implement text planning.

• Multilingual.Multilingual.

• Implement NL techniques Implement NL techniques for explanation.for explanation.

Page 32: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Concluding RemarksConcluding Remarks

• NLP techniques are found useful in:NLP techniques are found useful in:– Verbatim text coding and Verbatim text coding and

– Data mining report generation.Data mining report generation.

• Group similar answer items.Group similar answer items.

• Write simple natural language text.Write simple natural language text.

• A pricey technology because few A pricey technology because few tools are available.tools are available.

Page 33: Natural Language Processing Verbatim Text Coding and Data Mining Report Generation Josef S.W. Leung (j.leung@ieee.org) Ching-Long Yeh (chingyeh@cse.ttit.edu.tw)

Natural Language Natural Language ProcessingProcessing

Josef Siu-Wai LeungJosef Siu-Wai Leung ([email protected])([email protected])

Ching-Long YehChing-Long Yeh ([email protected])([email protected])