query and analysis on the document and customer/item bag card of the datadex kellie erickson

Query and Analysis on the document and customer/item bag

card of the DataDex

Kellie Erickson

Outline

• Proposed idea

• Background information

• General steps toward my idea

• Objectives to achieve for next month

• Q/A, comments, and suggestions

Proposed Idea:

Identifying the number of times an author cut-and-pastes in a document

Customer

1

2

3

4

76

54

32

t

1

6

5

4

3

Gene

11

1

Doc

1

2

3

4

Gene

11

3

Exp

11

11

11

11

1 2 3 4 Author

1 2 3 4 G 5 6term 7

5 6 7People

11

11

11

3

2

1

Doc2 3 4 5PI

People

cust itembag card

authordoc card

termdoc card

docdoc card

termterm card (share stem?)

expgene card

genegene card (ppi)

expPI card

Item

Bag

genegene card (ppi)

ItemBag

1 2 3 4

Item

5

6

∞

5 6 ∞

itembag itembag card

Possible applications:

• Detect plagiarism

• Quality of an author’s paper

• Sentence query techniques

• Automatic Key Sentence Generator

Steps towards proposed idea:

• information retrieval

• identify if authors cut-and-paste using MBR

Background Information

• keyword generator (KWL)

• summarization generators (KSL)

• anti-plagiarism tools (sentence similarity)

• sentence query (sentence similarity)

Keyword/Key Phrase Generator

• What are keywords?– Words that relate to a particular topic

• Identify keywords:– position of the word in the document– tf-idf score [4]:

• term frequency in a given document gives a measure of the importance of a term within a particular document

• inverse document frequency is a measure of the general importance of the term

• Automatic keyword extractor– Kea (Baye’s Theorem) [8]– GenEx (Genetic Algorithm) [7]

Key Sentence Generator• Identify Key Sentences:

– Sentence position

– Sentence length

– tf-idf: sentences containing more keywords are more likely to be relevant

– Similarity to the title: Greater number of words in a sentence that match the title, the more important the sentence [2]

– Complete sentences [3]

– Indicators (In conclusion…, We found…)

• Key Sentence Extractor– Use scoring function [2]

Similarity Between Key Sentences

• Should identify semantically similar sentences and sentences with equal or similar scores

• Methods:1) Dice coefficient: based on the number of words

between two sentences– 3 types of weights for each word:

• 1 if the word appears in a sentence, otherwise 0

• tf of a word

• tf-idf of the word [2]

Similarity Between Key Sentences

• Methods (cont):2) Number of keywords between sentences [1]

3) if the intersection of their keyword sets are the same size or slightly smaller [5]

Assumption:

• Focusing on the database aspect, not on the linguistic point of view

Steps towards proposed idea:

• information retrieval – Generate keyword/key phrase list (KWL)– Generate key sentence list (KSL)

• identify if authors cut-and-paste– Similarity between sentences– MBR

Generate Keyword/Key Phrase List

• Use an approach already available (Kea)

• Need to consider:– Stop words (ex. the, to, and, a, is, in, with, be)– Stem words (ex. agree, agreed, agreeable) [9]

Generate Key Sentence List

• Use KWL and KPL to help identify key sentences– Frequency of KWL and KPL found in a sentence

• Identify heuristics to determine key sentences– Introduction

– First and last sentence in a paragraph

– Conclusion [6]

• Identify grammar rules to determine key sentences

Identify if authors cut and paste

• Implement MBR– Paragraphs considered transactions– Key sentences considered items in an item set – Find frequent item sets to determine the amount

of cut-and-paste


• Implement MBR– Key sentence list {ks1, ks2, ks3}

– Paragraphs in the document {p1, p2, p3 , p4}

ks1 ks2 ks3

p1 1 1 1

p2 0 0 0

p3 1 0 1

p4 0 0 1


• Identify sentence-to-sentence similarity

Objectives to achieve next month

• Find an adequate automatic KWL/KPL extractor and run a training set.

• Identify heuristics and rules to create KSL

• Identify rules for similarity between sentences

References

[1] E. Park, S. Moon, and D. Ra. Web Document Retrieval Using Sentence-query Similarity. http://citeseer.ist.psu.edu/park02web.html.

[2] C. Nobato, S. Sekine, K. Uchimoto, and H. Isahara. A Summarization System with Categorization of Document Sets. http://www.cs.nyu.edu/~sekine/papers/tsc2nova2.ps.

[3] E. Alfonseca, J. Guirao, and A. Moreno-Sandoval. Description of UAM System for Generating Very Short Summaries at DUC-2004. http://www.nlpir.nist.gov/projects/duc/pubs/2004papers/uautonoma2.alfonseca.pdf.

[4] Y. Uzun. Keyword Extraction Using Naïve Bayes. http://www.cs.bilkent.edu.tr/~guvenir/courses/cs550/Workshop/Yasin_Uzun.pdf.

[5] C. Collberg, S. Kobourov, J. Louie, and T. Slattery. SPLAT: A System for Self-Plagiarism Detection. http://splat.cs.arizona.edu/icwi_plag.pdf

[6] A Fox. Armando’s Paper Writing and Presentations Page. http://swig.standford.edu/~fox/paper_writing.html.

[7] P.D. Turney. Learning Algorithms for Keyphrase Extraction. Information Retrieval, 1999.

[8] E. Frank, G.W. Paynter, I.H. Witten, C. Gutwin, and C.G. Nevill-Manning. Domain-specific keyphrase extraction. In IJCAI, pages 668-673, 1999.

[9] J.Callan. Text Data Mining. http://hartford.lti.cs.cmu.edu/classes/95-779.

http://www-nlpir.nist.gov/projects/duc/pubs/2004papers/uautonoma2.alfonseca.pdf

http://www.cs.bilkent.edu.tr/~guvenir/courses/cs550/Workshop/Yasin_Uzun.pdf

http://splat.cs.arizona.edu/icwi_plag.pdf

http://swig.standford.edu/~fox/paper_writing.html

query and analysis on the document and customer/item bag card of the datadex kellie erickson

Documents