query and analysis on the document and customer/item bag card of the datadex kellie erickson
TRANSCRIPT
Query and Analysis on the document and customer/item bag
card of the DataDex
Kellie Erickson
Outline
• Proposed idea
• Background information
• General steps toward my idea
• Objectives to achieve for next month
• Q/A, comments, and suggestions
Proposed Idea:
Identifying the number of times an author cut-and-pastes in a document
Customer
1
2
3
4
76
54
32
t
1
6
5
4
3
Gene
11
1
Doc
1
2
3
4
Gene
11
3
Exp
11
11
11
11
1 2 3 4 Author
1 2 3 4 G 5 6term 7
5 6 7People
11
11
11
3
2
1
Doc2 3 4 5PI
People
cust itembag card
authordoc card
termdoc card
docdoc card
termterm card (share stem?)
expgene card
genegene card (ppi)
expPI card
Item
Bag
genegene card (ppi)
ItemBag
1 2 3 4
Item
5
6
∞
5 6 ∞
itembag itembag card
Possible applications:
• Detect plagiarism
• Quality of an author’s paper
• Sentence query techniques
• Automatic Key Sentence Generator
Steps towards proposed idea:
• information retrieval
• identify if authors cut-and-paste using MBR
Background Information
• keyword generator (KWL)
• summarization generators (KSL)
• anti-plagiarism tools (sentence similarity)
• sentence query (sentence similarity)
Keyword/Key Phrase Generator
• What are keywords?– Words that relate to a particular topic
• Identify keywords:– position of the word in the document– tf-idf score [4]:
• term frequency in a given document gives a measure of the importance of a term within a particular document
• inverse document frequency is a measure of the general importance of the term
• Automatic keyword extractor– Kea (Baye’s Theorem) [8]– GenEx (Genetic Algorithm) [7]
Key Sentence Generator• Identify Key Sentences:
– Sentence position
– Sentence length
– tf-idf: sentences containing more keywords are more likely to be relevant
– Similarity to the title: Greater number of words in a sentence that match the title, the more important the sentence [2]
– Complete sentences [3]
– Indicators (In conclusion…, We found…)
• Key Sentence Extractor– Use scoring function [2]
Similarity Between Key Sentences
• Should identify semantically similar sentences and sentences with equal or similar scores
• Methods:1) Dice coefficient: based on the number of words
between two sentences– 3 types of weights for each word:
• 1 if the word appears in a sentence, otherwise 0
• tf of a word
• tf-idf of the word [2]
Similarity Between Key Sentences
• Methods (cont):2) Number of keywords between sentences [1]
3) if the intersection of their keyword sets are the same size or slightly smaller [5]
Assumption:
• Focusing on the database aspect, not on the linguistic point of view
Steps towards proposed idea:
• information retrieval – Generate keyword/key phrase list (KWL)– Generate key sentence list (KSL)
• identify if authors cut-and-paste– Similarity between sentences– MBR
Generate Keyword/Key Phrase List
• Use an approach already available (Kea)
• Need to consider:– Stop words (ex. the, to, and, a, is, in, with, be)– Stem words (ex. agree, agreed, agreeable) [9]
Generate Key Sentence List
• Use KWL and KPL to help identify key sentences– Frequency of KWL and KPL found in a sentence
• Identify heuristics to determine key sentences– Introduction
– First and last sentence in a paragraph
– Conclusion [6]
• Identify grammar rules to determine key sentences
Identify if authors cut and paste
• Implement MBR– Paragraphs considered transactions– Key sentences considered items in an item set – Find frequent item sets to determine the amount
of cut-and-paste
Identify if authors cut and paste
• Implement MBR– Key sentence list {ks1, ks2, ks3}
– Paragraphs in the document {p1, p2, p3 , p4}
ks1 ks2 ks3
p1 1 1 1
p2 0 0 0
p3 1 0 1
p4 0 0 1
Identify if authors cut and paste
• Identify sentence-to-sentence similarity
Objectives to achieve next month
• Find an adequate automatic KWL/KPL extractor and run a training set.
• Identify heuristics and rules to create KSL
• Identify rules for similarity between sentences
References
[1] E. Park, S. Moon, and D. Ra. Web Document Retrieval Using Sentence-query Similarity. http://citeseer.ist.psu.edu/park02web.html.
[2] C. Nobato, S. Sekine, K. Uchimoto, and H. Isahara. A Summarization System with Categorization of Document Sets. http://www.cs.nyu.edu/~sekine/papers/tsc2nova2.ps.
[3] E. Alfonseca, J. Guirao, and A. Moreno-Sandoval. Description of UAM System for Generating Very Short Summaries at DUC-2004. http://www.nlpir.nist.gov/projects/duc/pubs/2004papers/uautonoma2.alfonseca.pdf.
[4] Y. Uzun. Keyword Extraction Using Naïve Bayes. http://www.cs.bilkent.edu.tr/~guvenir/courses/cs550/Workshop/Yasin_Uzun.pdf.
[5] C. Collberg, S. Kobourov, J. Louie, and T. Slattery. SPLAT: A System for Self-Plagiarism Detection. http://splat.cs.arizona.edu/icwi_plag.pdf
[6] A Fox. Armando’s Paper Writing and Presentations Page. http://swig.standford.edu/~fox/paper_writing.html.
[7] P.D. Turney. Learning Algorithms for Keyphrase Extraction. Information Retrieval, 1999.
[8] E. Frank, G.W. Paynter, I.H. Witten, C. Gutwin, and C.G. Nevill-Manning. Domain-specific keyphrase extraction. In IJCAI, pages 668-673, 1999.
[9] J.Callan. Text Data Mining. http://hartford.lti.cs.cmu.edu/classes/95-779.