Download - Mining Domain Specific Words from Hierarchical Web Documents

Mining Domain Specific Words Mining Domain Specific Words fromfrom

Hierarchical Web DocumentsHierarchical Web Documents

Jing-Shin Chang (Jing-Shin Chang ( 張景新張景新 ))

Department of Computer Science & Information EngineeringDepartment of Computer Science & Information Engineering

National Chi-Nan (National Chi-Nan ( 暨南暨南 ) University) University

1, Univ. Road, Puli, Nantou 545, Taiwan, ROC.1, Univ. Road, Puli, Nantou 545, Taiwan, ROC.

[email protected]@csie.ncnu.edu.tw

CJNLP-04, 2004/11/10~15, City U., H.K.CJNLP-04, 2004/11/10~15, City U., H.K.

3

TOCTOC MotivationMotivation What are DSW’s?What are DSW’s? Why DSW Mining? (Applications)Why DSW Mining? (Applications)

WSD with DSW’s without sense tagged corpusWSD with DSW’s without sense tagged corpus Constructing Hierarchical Lexicon Tree w/o ClusteringConstructing Hierarchical Lexicon Tree w/o Clustering Other applicationsOther applications

How to Mine DSW’s from Hierarchical Web How to Mine DSW’s from Hierarchical Web DocumentsDocuments

Preliminary ResultsPreliminary Results Error SourcesError Sources RemarksRemarks

4

MotivationMotivation ““Is there a quick and easy (engineering) way to Is there a quick and easy (engineering) way to

construct a large scale WordNet or things like that construct a large scale WordNet or things like that … now that everyone is talking about ontological … now that everyone is talking about ontological knowledge sources and X-WordNet (whatever you knowledge sources and X-WordNet (whatever you call it)…?”call it)…?”

……trigger a new view for constructing a lexicon trigger a new view for constructing a lexicon tree with hierarchical semantic links…tree with hierarchical semantic links…

……DSW identification turns out to be a key to such DSW identification turns out to be a key to such constructionconstruction

……and can be used in various applications, and can be used in various applications, including DSW-based WSD without using sense including DSW-based WSD without using sense tagged corpora…tagged corpora…

5

What Are Domain Specific What Are Domain Specific Words (DSW’s)Words (DSW’s) Words that appear frequently in some particular Words that appear frequently in some particular

domains:domains: (a) Multiple Sense Words that are used frequently with (a) Multiple Sense Words that are used frequently with

special meanings or usage in particular domainsspecial meanings or usage in particular domains E.g., piston: “E.g., piston: “ 活塞” 活塞” ((mechanics) or “mechanics) or “ 活塞隊” 活塞隊” ((sports) sports)

(b) Single Sense Words that are used frequently in (b) Single Sense Words that are used frequently in particular domainsparticular domains

Suggesting that some words in the current document might be Suggesting that some words in the current document might be related to this particular senserelated to this particular sense

As “anchor words/tags” in the context for disambiguating As “anchor words/tags” in the context for disambiguating other multiple sense wordsother multiple sense words

6

What to Do in DSW MiningWhat to Do in DSW Mining

DSW Mining TaskDSW Mining Task Find lists of words that occurs frequently in the same Find lists of words that occurs frequently in the same

domain and associate each list (and words within it) a domain and associate each list (and words within it) a domain (implicit sense) tagdomain (implicit sense) tag

E.g., entertainment: ‘singer’, ‘pop songs’, ‘rock & roll’, E.g., entertainment: ‘singer’, ‘pop songs’, ‘rock & roll’, ‘Chang Hui-Mei’ (‘Ah-Mei’), ‘album’, …‘Chang Hui-Mei’ (‘Ah-Mei’), ‘album’, …

As a side effect, find the hierarchical or network-like As a side effect, find the hierarchical or network-like relationships between adjacent sets of DSW’srelationships between adjacent sets of DSW’s

When applied to mining DSW’s associated with each node of When applied to mining DSW’s associated with each node of a hierarchical directory/document tree a hierarchical directory/document tree

Each node being annotated with a domain tagEach node being annotated with a domain tag

10

DSW Applications (1)DSW Applications (1)

Technical term extraction:Technical term extraction: W(d) ={ w | w W(d) ={ w | w DSW(d) } DSW(d) } d d {computer, traveling, food, …} {computer, traveling, food, …}

11


Generic WSD based on DSW’sGeneric WSD based on DSW’s ArgmaxArgmaxSS dd P(s|d,W)P(d|W) = agrmax P(s|d,W)P(d|W) = agrmaxSS dd P(s| P(s|

d,W)P(W|d)P(d)d,W)P(W|d)P(d) If a large-scale sense-tagged corpus is not If a large-scale sense-tagged corpus is not

available, which is often the caseavailable, which is often the case Machine translationMachine translation

Help select translation lexicon candidatesHelp select translation lexicon candidates E.g., money bank (when used with “payment”, E.g., money bank (when used with “payment”,

“loan”, etc.), river bank, memory bank (in PC, “loan”, etc.), river bank, memory bank (in PC, Intel, MS Windows domains)Intel, MS Windows domains)

12

DSW ApplicationsDSW Applications

Generic WSD based on DSW’sGeneric WSD based on DSW’s

*0 1

0

0 1

0 0

0 0

arg max | ,

arg max | [sense-based models]

arg max , | ,

arg max | , |

arg max | , | [domain-based models]

n

s

n

s

n

s d

n n

s d

n n

s d

s P s w w

P w s P s

P s d w w

P s d w P d w

P s d w P w d P d

Need sense-tagged corpora

for training (*not widely

available)

Implicitly domain-tagged corpora are widely available in

the web

Sum over domains where w0 is a DSW 0 0| , , , : almost deterministic ("one sense per context")nP s d w d w s

13


Document classificationDocument classification N-class classification based on DSW’sN-class classification based on DSW’s

Anti-spamming (Two-class classification)Anti-spamming (Two-class classification) Words in spamming (uninteresting) mails vs. Words in spamming (uninteresting) mails vs.

normal (interesting) mails help block spamming normal (interesting) mails help block spamming mailsmails

Interesting domains vs. uninteresting domainsInteresting domains vs. uninteresting domains P(W|S)P(S) vs. P(W|~S)P(~S)P(W|S)P(S) vs. P(W|~S)P(~S)

14

DSW Applications (3.a)DSW Applications (3.a)

Document classification based on DSW’sDocument classification based on DSW’s d: document class labeld: document class label w[1..n]: bag of words in documentw[1..n]: bag of words in document |D| = n >= 2: number of document classes|D| = n >= 2: number of document classes

Anti-spamming based on DSW’sAnti-spamming based on DSW’s |D|=n=2 (two-class classification)|D|=n=2 (two-class classification)

*1

1

arg max |

arg max | [class-based models]

n

d

n

d

d P d w

P w d P d

15


Building large lexicon tree or wordnet-Building large lexicon tree or wordnet-lookalike (semi-) automatically from lookalike (semi-) automatically from hierarchical web documentshierarchical web documents Membership: Semantic links among words of Membership: Semantic links among words of

the same domain are close (context), similar the same domain are close (context), similar (synonym, thesaurus), or negated concept (synonym, thesaurus), or negated concept (antonym)(antonym)

Hierarchy: Hierarchy of the lexicon suggests Hierarchy: Hierarchy of the lexicon suggests some ontological relationshipssome ontological relationships

16

Conventional Methods for Conventional Methods for Constructing Lexicon TreesConstructing Lexicon Trees Construction by ClusteringConstruction by Clustering

Collect words in a large corpusCollect words in a large corpus Evaluate Evaluate word associationword association as distance (or as distance (or

closeness) measure for all word pairscloseness) measure for all word pairs Use Use clustering criteriaclustering criteria to build lexicon to build lexicon

hierarchyhierarchy Adjust the hierarchy and Assign semantic/sense Adjust the hierarchy and Assign semantic/sense

tags to nodes of the lexicon treetags to nodes of the lexicon tree Thus assigning sense tags to members of each nodeThus assigning sense tags to members of each node

17

Clustering Methods for Clustering Methods for Constructing Lexicon TreesConstructing Lexicon Trees

A04A1

2B C A22D E

A0, A1, B

A0, A1, C

A0, A2, D

A0, A2, E

18

Clustering Methods for Clustering Methods for Constructing Lexicon TreesConstructing Lexicon Trees DisadvantagesDisadvantages

Do not take advantages of Do not take advantages of hierarchical informationhierarchical information of of document tree (flattened when collecting words)document tree (flattened when collecting words)

Word association & Clustering criteriaWord association & Clustering criteria are not related directly are not related directly to human to human perceptionperception

Most clustering algorithms conduct Most clustering algorithms conduct binarybinary merging (or merging (or division) in each step for simplicitydivision) in each step for simplicity

Automatically generated Automatically generated semantics hierarchysemantics hierarchy may not may not reflect reflect human perceptionhuman perception

Hierarchy Hierarchy boundariesboundaries are not clearly & automatically are not clearly & automatically detecteddetected

Adjustment of hierarchy may not be easy (since human Adjustment of hierarchy may not be easy (since human perception is not used to guide clustering)perception is not used to guide clustering)

Pairwise association evaluation is costlyPairwise association evaluation is costly

19

Hierarchical Information Loss Hierarchical Information Loss when Collecting Wordswhen Collecting Words

A04, A1

2, A22,

B, C, D, E

A02, A1

2, B, C

A0, A1, B A0, A1, C A0, A2, D A0, A2, E

A02, A2

2, D, E

A0, A1, B A0, A1, C A0, A2, D A0, A2, E

20

Clustering Methods for Clustering Methods for Constructing Lexicon TreesConstructing Lexicon Trees

A04A1

2B C

?

?

?

A22D E

?

?

?

A0, A1, B

A0, A1, C

A0, A2, D

A0, A2, E

Reflect human

perception?

Why binary?

Hierarchy?

21

Alternative View for Alternative View for Constructing Lexicon TreesConstructing Lexicon Trees Construction by Retaining DSW’sConstruction by Retaining DSW’s

Preserve Preserve hierarchical structurehierarchical structure of web of web documents as baseline of semantic hierarchy, documents as baseline of semantic hierarchy, which is already mildly confirmed by which is already mildly confirmed by webmasterswebmasters

Associate each node with Associate each node with DSW’s as membersDSW’s as members and tag each DSW with the directory/domain and tag each DSW with the directory/domain namename

Optionally adjust the tree hierarchy and Optionally adjust the tree hierarchy and members of each nodesmembers of each nodes

22

Constructing Lexicon Trees by Constructing Lexicon Trees by Preserving DSW’sPreserving DSW’s

O,O,O,O

X,O,X,O X,X,O,O

O,X,O,O

O,X,O,X O,O,X,X

O,O,X,O

O: +DSWX: -DSW

23


O,O,O,O

O,O O,O

O,O,O

O,O O,O

O,O,O

O: +DSWX: -DSW

24

Constructing Lexicon Trees by Constructing Lexicon Trees by Preserving DSW’sPreserving DSW’s AdvantagesAdvantages

Hierarchy reflect Hierarchy reflect human perceptionhuman perception Adjustment could be easier if necessaryAdjustment could be easier if necessary

Directory names are Directory names are highly correlated to sense tagshighly correlated to sense tags Domain-based model can be used if sense-tagged corpora is Domain-based model can be used if sense-tagged corpora is

not availablenot available Pairwise Pairwise word associationword association evaluation is replaced by evaluation is replaced by

computation of “computation of “domain specificitydomain specificity” against domains” against domains O(|W|x|W|) vs. O(|W|x|D|)O(|W|x|W|) vs. O(|W|x|D|)

Requirements:Requirements: A well-organized web siteA well-organized web site Mining DSW’s from such a siteMining DSW’s from such a site

25


A04, A1

2, A22,

B, C, D, E

A0, A1, B A0, A1, C

A02, A1

2, B, C

A0, A2, D A0, A2, E

A02, A2

2, D, E

relationship

Membership(closeness, similarity)

Is_a, hypernym,

…

SynonymAntonym

X

Y

Y is_a X ?? B is_a X (or A1)

26

Alternative View for Constructing Alternative View for Constructing Lexicon TreesLexicon Trees Benefits:Benefits:

No similarity computation: Closeness (incl. No similarity computation: Closeness (incl. similarity) is already implicitly encoded by similarity) is already implicitly encoded by human judgeshuman judges

No binary clustering: Clustering is already done No binary clustering: Clustering is already done (implicitly) with human judgment(implicitly) with human judgment

Hierarchical links available: Some well Hierarchical links available: Some well developed relationships are already donedeveloped relationships are already done

Although not perfect…Although not perfect…

28

Proposed Method for MiningProposed Method for Mining

Web Hierarchy as a Large Document TreeWeb Hierarchy as a Large Document Tree Each document was generated by applying DSW’s to Each document was generated by applying DSW’s to

some generic document templatessome generic document templates Remove non-specific words from documents, Remove non-specific words from documents,

leaving a lexicon tree with DSW’s associated with leaving a lexicon tree with DSW’s associated with each nodeeach node

Leaving only domain-specific wordsLeaving only domain-specific words Forming a lexicon tree from a document treeForming a lexicon tree from a document tree Label domain specific wordsLabel domain specific words

Characteristics:Characteristics: Get associated words by measuring domain-specificity to Get associated words by measuring domain-specificity to

a known and common domain instead of measuring a known and common domain instead of measuring pairwise association plus clustering pairwise association plus clustering

29

Mining Criteria:Mining Criteria:Cross-Domain EntropyCross-Domain Entropy Domain-independent terms tend to Domain-independent terms tend to

distributed evenly in all domains.distributed evenly in all domains. Distributional “evenness” can be measured Distributional “evenness” can be measured

with the with the Cross-Domain EntropyCross-Domain Entropy (CDE) (CDE) defined as follows:defined as follows: Pij: probability of word-i in domain-jPij: probability of word-i in domain-j fij: normalized frequencyfij: normalized frequency

* * logi i ij ijj

ijij

ijj

H H w P P

fP

f

30

Mining Criteria:Mining Criteria:Cross-Domain EntropyCross-Domain Entropy Example:Example:

Wi = “piston”, with frequencies (normalized to Wi = “piston”, with frequencies (normalized to [0,1]) at various domains:[0,1]) at various domains:

ffijij = (0.001, 0.62, 0.0003, 0.57, 0.0004) = (0.001, 0.62, 0.0003, 0.57, 0.0004) Domain-specific (unevenly distributed) at the 2Domain-specific (unevenly distributed) at the 2ndnd

and the 4and the 4thth domains domains

31

Mining Algorithm – Step1Mining Algorithm – Step1

Step1 (Data Collection)Step1 (Data Collection): Acquire a large : Acquire a large collection of web documents using a web collection of web documents using a web spider while preserving the directory spider while preserving the directory hierarchy of the documents. Strip unused hierarchy of the documents. Strip unused markup tags from the web pages.markup tags from the web pages.

32


Step2 (Word Segmentation or Chunking)Step2 (Word Segmentation or Chunking): : Identify word (or compound word) Identify word (or compound word) boundaries in the documents by applying a boundaries in the documents by applying a word segmentation process, such as word segmentation process, such as (Chiang 92; Lin 93), to Chinese-like (Chiang 92; Lin 93), to Chinese-like documents (where word boundaries are not documents (where word boundaries are not explicit) or applying a compound word explicit) or applying a compound word chunking algorithms to English-like chunking algorithms to English-like documents in order to identify interested documents in order to identify interested word entities.word entities.

33

Mining Algorithm – Step3Mining Algorithm – Step3 Step3 (Acquiring Normalized Term Frequencies for all Step3 (Acquiring Normalized Term Frequencies for all

Words in Various Domains):Words in Various Domains): For each subdirectory For each subdirectory ddjj, , find the number of occurrences find the number of occurrences nnijij of each term of each term wwii in in all the documents, and derive the normalized term all the documents, and derive the normalized term frequency frequency ffijij = = nnijij//NNjj by normalizing by normalizing nnijij with the total with the total document size, document size, NNjj = = ii nnijij, in that directory. The , in that directory. The directory is then associated with a set of <directory is then associated with a set of <wwii, , ddjj, , ffijij> > tuples, where tuples, where wwii is the is the ii-th words of the complete -th words of the complete word list for all documents, word list for all documents, ddjj is the is the jj-th directory -th directory name (refer to as the domain hereafter), and name (refer to as the domain hereafter), and ffijij is the is the normalized relative frequency of occurrence of in normalized relative frequency of occurrence of in domain domain ddjj..

34


Input:Input:

wherewhere

, , : word, domain, normalized frequence triplei j ijw d f

: frequency of in domain

: number of words in domain

/ : normalized frequency of in domain

ij i j

j ij ji

ij ij j i j

n w d

N n d

f n N w d

35

Mining Algorithm – Step4Mining Algorithm – Step4 Step4 (Removing Domain-Independent Terms):Step4 (Removing Domain-Independent Terms):

Domain-independent terms are identified as those Domain-independent terms are identified as those terms which distributed evenly in all domains. That is, terms which distributed evenly in all domains. That is, terms with large terms with large Cross-Domain EntropyCross-Domain Entropy (CDE) (CDE) defined as follows:defined as follows:

Terms whose CDE is above a threshold can be Terms whose CDE is above a threshold can be removed from the lexicon tree since such terms are removed from the lexicon tree since such terms are unlikely to be associated with any domain closely. unlikely to be associated with any domain closely. Terms with a low CDE will be retained in a few Terms with a low CDE will be retained in a few domains with the highest normalized frequencies (e.g., domains with the highest normalized frequencies (e.g., top-1 and top-2).top-1 and top-2).

* * logi i ij ijj

ijij

ijj

H H w P P

fP

f

42

ExperimentsExperiments

Domains:Domains: News articles from a local news siteNews articles from a local news site 138 distinct domains138 distinct domains

including leaf nodes of the directory tree and their parentsincluding leaf nodes of the directory tree and their parents leaves with the same name are considered in the same domainleaves with the same name are considered in the same domain Examples: baseball, basketball, broadcasting, car, Examples: baseball, basketball, broadcasting, car,

communication, culture, digital, edu(cation), entertainment communication, culture, digital, edu(cation), entertainment (( 流星流星 ++ 花園花園 ), finance, food (), finance, food ( 大魚大肉大魚大肉 ,, 干貝干貝 ,, 木耳木耳 ,, 錫箔錫箔紙紙 ,…)…,…)…

Size: 200M bytes (HTML files)Size: 200M bytes (HTML files) 16K+ unique words after word segmentation16K+ unique words after word segmentation

43

Dom

ains

Dom

ains

(Hierarch

y not sh

own

)(H

ierarchy n

ot show

n)

afternoon-news

entertainment

ilan listed-elec personal taiwan-china

all-baseball europe important local-scene pintung taoyuan

america-topic

europe2 important2 lotto pl tax-law

autumn family important3 main politics ti

basketball finance important4 mainland public-forum todaynews

bnext fish important5 managementreadexcellen

ttopic

broadcasting-tv

focus infotech medical-news readtopic topic2

buybooks focusnews insurance medical shopping trade

car food interest-prose miaoli sitemap travel

card fund-futures internal-sport middle-taiwan sitemap_title travelwindow

changhwa gameinternational-

sportmiddlesouth-

taiwansocial-forum

udn-supplement

chiayi global international miscellaneous society udn

college golf internet mixtravel south-taiwan udnbw

communication

happy_worker

japan movie special ue

culture hardware kaoshiung-city music sport usa-stock

daily health-carekaoshiung-

sentrynantou star world-econ

day_starnews

health-club keelung national-travel stock writers

digital hot-news life-topic02 newbooks taichung-city yunlin

domestic hot-topic life-topic03 north-taiwantaichung-

sentry　

dswa.crp <root>

hot-topic2 life-topic1 opinion taiex 　

east-taiwan hot-topic3 life otc tainan 　

ec hot lifestyle out-activity taipei-city 　

edict hsinchu life_newtopic oversea-star taipei-sentry 　

edu hwalen listed-co performance taitung 　

44

Sample Output (4 Selected Sample Output (4 Selected Domains)Domains) baseball

broadcast -TV

basketball car

日本職棒有線電視一分千西西

棒球賽東風三秒小型車

熱身開工女子組中古

運動節目中包夾引擎蓋

場次廣電處外線水箱

價碼收視犯規加裝

球團和信投籃市場買氣

部長新聞局男子組目的地

練球開獎防守交車

興農頻道冠軍戰同級

球場電視後衛合作開發

投手電影活塞安全系統

球季熱門國男行李

賽程影視華勒行李廂

太陽娛樂費城西西

Table 1. Sampled domain specific words with low entropies.

46

Preliminary ResultsPreliminary Results

Domain specific words and the assigned domain Domain specific words and the assigned domain tags are well associated (e.g., “tags are well associated (e.g., “ 投手” 投手” is is specifically used in the “baseball” domain.)specifically used in the “baseball” domain.) Extraction with the cross-domain entropy (CDE) metric Extraction with the cross-domain entropy (CDE) metric

is well founded.is well founded. Domain-independent (or irrelevant) words (such as Domain-independent (or irrelevant) words (such as

those for webmaster’s advertisements) are well rejected those for webmaster’s advertisements) are well rejected as DSW candidates for their high cross-domain entropyas DSW candidates for their high cross-domain entropy

DSW’s are mostly nouns and verbs (open-class DSW’s are mostly nouns and verbs (open-class words)words)

47

Preliminary ResultsPreliminary Results

Low cross-domain entropy words (DSW’s) Low cross-domain entropy words (DSW’s) in the respective domain are generally in the respective domain are generally highly correlated (e.g., “highly correlated (e.g., “ 日本職棒”日本職棒” , “, “ 部部長”長” ))

New New usages of words, such as “usages of words, such as “ 活塞活塞” ” ((Pistons) with the “basketball” sense, could Pistons) with the “basketball” sense, could also be identifiedalso be identified Both are good for WSD tasks to use the DSW’s Both are good for WSD tasks to use the DSW’s

as contextual evidencesas contextual evidences

48

Error SourcesError Sources

Single CDE metric may not be sufficient to Single CDE metric may not be sufficient to capture all characteristics of “domain-specificity”capture all characteristics of “domain-specificity” Type II error: Some general (non-specific) words may Type II error: Some general (non-specific) words may

have low entropy simply because they appear only in have low entropy simply because they appear only in one domain (CDE=0)one domain (CDE=0)

Probably due to low occurrence counts (a kind of estimation Probably due to low occurrence counts (a kind of estimation error)error)

Type I error: Some multiple sense words may have too Type I error: Some multiple sense words may have too many senses and thus be mis-recognized as non-many senses and thus be mis-recognized as non-specific in each domain (although the senses are unique specific in each domain (although the senses are unique in respect domains)in respect domains)

49

Error SourcesError Sources

““Well-organized website” assumption may Well-organized website” assumption may not be available all the timenot be available all the time The hierarchical directory tags may not be The hierarchical directory tags may not be

appropriate representatives for the document appropriate representatives for the document words within a websitewords within a website

The hierarchies may not be consistent from The hierarchies may not be consistent from website to websitewebsite to website

50

Future worksFuture works

Use other knowledge sources, other than the Use other knowledge sources, other than the single CDE measure, to co-train the model single CDE measure, to co-train the model in a manner similar to [Chang 97b, c]in a manner similar to [Chang 97b, c] E.g., with other term weighting metrics E.g., with other term weighting metrics E.g., stop list acquisition metric for identifying E.g., stop list acquisition metric for identifying

common words (for type II errors)common words (for type II errors) Explore methods and criteria to adjust Explore methods and criteria to adjust

hierarchy of a single directory treehierarchy of a single directory tree Explore methods to merge directory trees Explore methods to merge directory trees

from different sitesfrom different sites

51

Concluding RemarksConcluding Remarks A simple metric for automatic/semi-automatic A simple metric for automatic/semi-automatic

identification of DSW’sidentification of DSW’s At low sense tagging costAt low sense tagging cost

Rich web resource almost freeRich web resource almost free Implicit semantic tagging implied by the directory hierarchy Implicit semantic tagging implied by the directory hierarchy

(imperfect hierarchy but free)(imperfect hierarchy but free)

A simple method to build semantic links and degree of A simple method to build semantic links and degree of closeness among DSW’scloseness among DSW’s may be helpful for building large semantically tagged may be helpful for building large semantically tagged

lexicon trees or network linked x-wordnetslexicon trees or network linked x-wordnets Good knowledge source for WSD-related applicationsGood knowledge source for WSD-related applications

WSD, Machine translation, document classification, anti-spamming, WSD, Machine translation, document classification, anti-spamming, ……

52

Thanks for your attention!!Thanks for your attention!!

Thanks!!Thanks!!

Download - Mining Domain Specific Words from Hierarchical Web Documents

Top Related