using social networking techniques in text mining document summarization

10
Using Social Networking Techniques in Text Mining Document Summarization

Upload: june-ray

Post on 23-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using Social Networking Techniques in Text Mining Document Summarization

Using Social Networking Techniques in Text Mining

Document Summarization

Page 2: Using Social Networking Techniques in Text Mining Document Summarization

Using Social Networking Techniques in Summarization

Definition: Text Document Summarization is a task of extraction thematic or topically important sentences from document(s).

Points: A traditional Summarization steps can be given as:

1. Identify the signature terms.

2. Rank the sentences in the document or document set based upon their weight.

3. Choose the most highly ranked sentences.

Use of Social Networking Techniques: Social Networking based techniques helps in identifying signature terms and ranking the sentences. E.g.

4. Text Rank (Mihalcea and Tarau, 2004)5. Degree centrality (Erkan and Radev, 2004)

6. LexRank with threshold (Erkan and Radev, 2004)

7. Continuous LexRank (Erkan and Radev, 2004)

Page 3: Using Social Networking Techniques in Text Mining Document Summarization

Text Rank3: BC−HurricaineGilbert, 09−11 339 4: BC−Hurricaine Gilbert, 0348 5: Hurricaine Gilbert heads toward Dominican Coast 6: By Ruddy Gonzalez 7: Associated Press Writer 8: Santo Domingo, Dominican Republic (AP) 9: Hurricaine Gilbert Swept towrd the Dominican Republic Sunday, and the Civil Defense alerted its heavily populated south coast to prepare for high winds, heavy rains, and high seas. 10: The storm was approaching from the southeast with sustained winds of 75 mph gusting to 92 mph. 11: "There is no need for alarm," Civil Defense Director Eugenio Cabral said in a television alert shortly after midnight Saturday. 12: Cabral said residents of the province of Barahona should closely follow Gilbert’s movement. 13: An estimated 100,000 people live in the province, including 70,000 in the city of Barahona, about 125 miles west of Santo Domingo. 14. Tropical storm Gilbert formed in the eastern Carribean and strenghtened into a hurricaine Saturday night. 15: The National Hurricaine Center in Miami reported its position at 2 a.m. Sunday at latitude 16.1 north, longitude 67.5 west, about 140 miles south of Ponce, Puerto Rico, and 200 miles southeast of Santo Domingo. 16: The National Weather Service in San Juan, Puerto Rico, said Gilbert was moving westard at 15 mph with a "broad area of cloudiness and heavy weather" rotating around the center of the storm. 17. The weather service issued a flash flood watch for Puerto Rico and the Virgin Islands until at least 6 p.m. Sunday. 18: Strong winds associated with the Gilbert brought coastal flooding, strong southeast winds, and up to 12 feet to Puerto Rico’s south coast. 19: There were no reports on casualties. 20: San Juan, on the north coast, had heavy rains and gusts Saturday, but they subsided during the night. 21: On Saturday, Hurricane Florence was downgraded to a tropical storm, and its remnants pushed inland from the U.S. Gulf Coast. 22: Residents returned home, happy to find little damage from 90 mph winds and sheets of rain. 23: Florence, the sixth named storm of the 1988 Atlantic storm season, was the second hurricane. 24: The first, Debby, reached minimal hurricane strength briefly before hitting the Mexican coast last month.

Figure: Representing Text as Graph

Page 4: Using Social Networking Techniques in Text Mining Document Summarization

Text Rank

Preparing the sentence Grpah:

Uses undirected graph, Prepared by using sentences as node of graph.

Calculating similarity between sentences: For given two sentences iS and jS ,

with a sentence being represented by a set iN words that appear in the sentence, i.e. iN

iii i

WWWS ,,, 21 , the similarity of iS and jS is defined as:

ji

jkikkji

SS

SWSWWSSSimilarity

loglog

&,

Page 5: Using Social Networking Techniques in Text Mining Document Summarization

Text Rank

Important Points: Each vertex corresponds to a sentence type. A weight, ijW is assigned to the edge connecting the two vertices, iV and jV ,

and its value is the similarity score between sentence iS and jS .

Ranking the Sentences: The score for iV , iVS , is initialized with a default value and is computed in an iterative manner until convergence using this recursive formula:

j

VAdjVVAdjV

jk

jii VS

W

WddVS

ij

jk

1

Where, iVAdj denotes sVi ' neighbor and d is the damping factor set to 0.85

(Brin and Page, 1998).

Page 6: Using Social Networking Techniques in Text Mining Document Summarization

Degree centrality (Erkan and Radev, 2004)

Degree Centrality: the number of direct neighbors of node V i.e.

VNVd

Degree centrality based Document summarizer: Uses document cluster Consider sentences in a document cluster as nodes in a graph Each link represents some relation among sentences (e.g. semantic

similarity). Main Points:

Each edge is a vote Totally democratic The more connections a sentence has, the higher degree and higher

centrality. Problems: (1) Out of topic (or outlier) documents can influence summary and make it into summary, (2) neglects the influence of useful phrases or words.

Page 7: Using Social Networking Techniques in Text Mining Document Summarization

LexRankLexRank: Instead of being wholly democratic like the degree centrality method, give \prestigious" sentences more of a vote. The higher the centrality of a sentence, the more it counts.

Neighbors of prestigious sentences: Also, distribute a sentence's centrality to its neighboring nodes (propagate).

Distributed Centrality:

][ degUadjV V

VPUP

Where,

UP is the centrality of node U

][Uadj is the set of nodes adjacent to U

Vdeg is the degree of node V

Page 8: Using Social Networking Techniques in Text Mining Document Summarization

Weighted LexRankWeighted LexRank: The previous Degree centrality and LexRank was unweighted. The weighted LexRank can be calculated by using following eq.

][ ,cos_mod_

,cos_mod_1

UadjVVP

VZineifiedidf

VUineifiedidfd

N

dUP

NOTE: The “ifd modified cosine similarity” can be calculated by using the following equation:

YY

YYYXX

XXX

YXWYWXW

iii

iii

idftfidftf

idftftf

2,

2,

,

2,,

Where,

SWtf , is the number of occurrences of word W in sentence S

W is the word shared by sentences X and Y

Page 9: Using Social Networking Techniques in Text Mining Document Summarization

Test Your Understanding

• Point out the major differences between Text Rank and Weighted LexRank ?

(Note: just differentiate between ranking schemes)

• Can you prove the correctness of LexRank’s ranking formula ?

(Hint: read/watch video about correctness of page rank based equation)

Page 10: Using Social Networking Techniques in Text Mining Document Summarization

References

• Mihalcea, Rada and Paul Tarau. 2004. TextRank:Bringing order into texts. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 404–411.

• Günes Erkan, Dragomir R. Radev: LexPageRank: Prestige in Multi-Document Text Summarization. EMNLP 2004: 365-371