keyword extraction from a single document using word co-occurrence statistical information
DESCRIPTION
Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information. Authors: Yutaka Matsuo & Mitsuru Ishizuka. Designed by CProDM Team. Outline. Introduction. Study Algorithm. Algorithm implement. Evaluation. Introduction. Discard stop words. Stem. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information](https://reader036.vdocuments.net/reader036/viewer/2022062302/56816711550346895ddb7ad3/html5/thumbnails/1.jpg)
Keyword Extraction from a Single Document using Word Co-occurrence
Statistical InformationAuthors: Yutaka Matsuo & Mitsuru Ishizuka
Designed by CProDM Team
![Page 2: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information](https://reader036.vdocuments.net/reader036/viewer/2022062302/56816711550346895ddb7ad3/html5/thumbnails/2.jpg)
Introduction
Algorithm implement
Evaluation
Outline
Study Algorithm
![Page 3: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information](https://reader036.vdocuments.net/reader036/viewer/2022062302/56816711550346895ddb7ad3/html5/thumbnails/3.jpg)
Introduction
Discard stop words Stem Extract
frequency
Select frequent
termClusteringExpected
probability
Calculate X’2 value Output
Preprocessing
Processing
![Page 4: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information](https://reader036.vdocuments.net/reader036/viewer/2022062302/56816711550346895ddb7ad3/html5/thumbnails/4.jpg)
Study AlgorithmPreprocessing
Goal: - Remove unnecessary words in document. - Get terms which are candidate keywords.
Stop word: the function words and, the, and of , or other words with minimal lexical meaning.
Stem: remove suffixes from words
![Page 5: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information](https://reader036.vdocuments.net/reader036/viewer/2022062302/56816711550346895ddb7ad3/html5/thumbnails/5.jpg)
Discard stop words
It might be urged that when playing the “imitation game" the best strategy for the machine may possibly be something other than imitation of the behaviour of a man. This may be, but I think it is unlikely that there is any great effect of this kind. In any case there is no intention to investigate here the theory of the 2 game, and it will be assumed that the best strategy is to try to provide answers that would naturally be given by a man.
urged playing “imitation game" best strategy machine possibly imitation behaviour man think unlikely great effect kind. case intention investigate theory 2 game, assumed best strategy try provide answers naturally given man.
Study AlgorithmPreprocessing
![Page 6: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information](https://reader036.vdocuments.net/reader036/viewer/2022062302/56816711550346895ddb7ad3/html5/thumbnails/6.jpg)
urged playing “imitation game" best strategy machine possibly imitation behaviour man think unlikely great effect kind. case intention investigate theory game, assumed best strategy try provide answers naturally given man.
Stem
urge play “imitation game" best strategi machine possible imitation behaviour man think unlike great effect kind. case intention investigate theory game, assum best strategi try provide answers natural give man.
Study AlgorithmPreprocessing
![Page 7: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information](https://reader036.vdocuments.net/reader036/viewer/2022062302/56816711550346895ddb7ad3/html5/thumbnails/7.jpg)
imitation best strategi man best strategi
Extract frequency
urge play “imitation game" best strategi machine possible imitation behaviour man think unlike great effect kind. case intention investigate theory game, assum best strategi try provide answers natural give man.
Study AlgorithmPreprocessing
![Page 8: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information](https://reader036.vdocuments.net/reader036/viewer/2022062302/56816711550346895ddb7ad3/html5/thumbnails/8.jpg)
Study AlgorithmTerm Co-occurrence and Importance
the top ten frequent terms (denoted as ) and the probability of occurrence, normalized so that the sum is to be 1
![Page 9: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information](https://reader036.vdocuments.net/reader036/viewer/2022062302/56816711550346895ddb7ad3/html5/thumbnails/9.jpg)
Study AlgorithmTerm Co-occurrence and Importance
Two terms in a sentence are considered to co-occur once.
![Page 10: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information](https://reader036.vdocuments.net/reader036/viewer/2022062302/56816711550346895ddb7ad3/html5/thumbnails/10.jpg)
co-occurrence probability distribution of some terms and the frequent terms.
Study AlgorithmTerm Co-occurrence and Importance
![Page 11: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information](https://reader036.vdocuments.net/reader036/viewer/2022062302/56816711550346895ddb7ad3/html5/thumbnails/11.jpg)
The statistical value of χ2 is defined as
Pg Unconditional probability of a frequent term g G ∈(the expected probability)
Nw The total number of co-occurrence of term w and frequent terms G
freq (w, g) Frequency of co-occurrence of term w and term g
Study AlgorithmTerm Co-occurrence and Importance
![Page 12: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information](https://reader036.vdocuments.net/reader036/viewer/2022062302/56816711550346895ddb7ad3/html5/thumbnails/12.jpg)
Study AlgorithmTerm Co-occurrence and Importance
![Page 13: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information](https://reader036.vdocuments.net/reader036/viewer/2022062302/56816711550346895ddb7ad3/html5/thumbnails/13.jpg)
Study AlgorithmAlgorithm improvement
Pg (the sum of the total number of terms in sentences where g appears) divided by (the total number of terms in the document)
Nw The total number of terms in the sentences where w appears including w
If a term appears in a long sentence, it is likely to co-occur with many terms; if a term appears in a short sentence, it is less likely to co-occur with other terms.
We consider the length of each sentence and revise our definitions
![Page 14: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information](https://reader036.vdocuments.net/reader036/viewer/2022062302/56816711550346895ddb7ad3/html5/thumbnails/14.jpg)
the following function to measure robustness of bias values
Study AlgorithmAlgorithm improvement
![Page 15: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information](https://reader036.vdocuments.net/reader036/viewer/2022062302/56816711550346895ddb7ad3/html5/thumbnails/15.jpg)
To improve extracted keyword quality, we will cluster terms
Two major approaches (Hofmann & Puzicha 1998) are:
Similarity-based clustering If terms w1 and w2 have similar distribution of co-occurrence with other terms, w1 and w2 are considered to be the same cluster.
Pairwise clustering If terms w1 and w2 co-occur frequently, w1 and w2 are considered to be the same cluster.
Study AlgorithmAlgorithm improvement
![Page 16: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information](https://reader036.vdocuments.net/reader036/viewer/2022062302/56816711550346895ddb7ad3/html5/thumbnails/16.jpg)
Similarity-based clustering centers upon Red Circles
Pairwise clustering focuses on Yellow Circles
Study AlgorithmAlgorithm improvement
![Page 17: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information](https://reader036.vdocuments.net/reader036/viewer/2022062302/56816711550346895ddb7ad3/html5/thumbnails/17.jpg)
Where:
Similarity-based clusteringCluster a pair of terms whose Jensen-Shannon divergence is
and:
Study AlgorithmAlgorithm improvement
![Page 18: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information](https://reader036.vdocuments.net/reader036/viewer/2022062302/56816711550346895ddb7ad3/html5/thumbnails/18.jpg)
Cluster a pair of terms whose mutual information is
Pairwise clustering
Where:
Study AlgorithmAlgorithm improvement
![Page 19: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information](https://reader036.vdocuments.net/reader036/viewer/2022062302/56816711550346895ddb7ad3/html5/thumbnails/19.jpg)
Study AlgorithmAlgorithm improvement
![Page 20: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information](https://reader036.vdocuments.net/reader036/viewer/2022062302/56816711550346895ddb7ad3/html5/thumbnails/20.jpg)
Step 1- Preprocessing
Step 2: Selection of frequent terms
Step 3: Clustering frequent terms
Algorithm Implement
Step 4: Calculation of expected probability
Step 5: Calculation of x’2 value
Step 6: Output keywords
![Page 21: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information](https://reader036.vdocuments.net/reader036/viewer/2022062302/56816711550346895ddb7ad3/html5/thumbnails/21.jpg)
Discard stop
wordsStem Extract
frequency
Algorithm ImplementStep 1: Preprocessing
![Page 22: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information](https://reader036.vdocuments.net/reader036/viewer/2022062302/56816711550346895ddb7ad3/html5/thumbnails/22.jpg)
Algorithm ImplementStep 2: Selection of frequent terms
Select the top frequent terms up to 30% of the number of running terms as a standard set of terms
Count number of terms in document (Ntotal )
![Page 23: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information](https://reader036.vdocuments.net/reader036/viewer/2022062302/56816711550346895ddb7ad3/html5/thumbnails/23.jpg)
Algorithm ImplementStep 3: Clustering frequent terms
• Similarity-base clustering
• Pairwise clustering
![Page 24: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information](https://reader036.vdocuments.net/reader036/viewer/2022062302/56816711550346895ddb7ad3/html5/thumbnails/24.jpg)
Algorithm ImplementStep 4: Calculate expected probability
Count the number of terms co-occurring with c C, ∈denoted as nc , to yield the expected probability
![Page 25: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information](https://reader036.vdocuments.net/reader036/viewer/2022062302/56816711550346895ddb7ad3/html5/thumbnails/25.jpg)
Algorithm ImplementStep 5: Calculate χ’2 value
Where:
the number of co-occurrence frequency with c C∈
the total number of terms in the sentences including w
![Page 26: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information](https://reader036.vdocuments.net/reader036/viewer/2022062302/56816711550346895ddb7ad3/html5/thumbnails/26.jpg)
Algorithm ImplementStep 6: Output keywords
![Page 27: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information](https://reader036.vdocuments.net/reader036/viewer/2022062302/56816711550346895ddb7ad3/html5/thumbnails/27.jpg)
Evaluation
![Page 28: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information](https://reader036.vdocuments.net/reader036/viewer/2022062302/56816711550346895ddb7ad3/html5/thumbnails/28.jpg)
Evaluation
In this paper, we developed an algorithm to extract keywords from a single document.
Main advantages of our method are its simplicity without requiring use of a corpus and its high performance comparable to tfidf algorithm.
As more electronic documents become available, we believe our method will be useful in many applications, especially for domain-independent keyword extraction.
![Page 29: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information](https://reader036.vdocuments.net/reader036/viewer/2022062302/56816711550346895ddb7ad3/html5/thumbnails/29.jpg)
Thank for your attention