Using Social Networking Techniques in Text Mining
Document Summarization
Using Social Networking Techniques in Summarization
Definition: Text Document Summarization is a task of extraction thematic or topically important sentences from document(s).
Points: A traditional Summarization steps can be given as:
1. Identify the signature terms.
2. Rank the sentences in the document or document set based upon their weight.
3. Choose the most highly ranked sentences.
Use of Social Networking Techniques: Social Networking based techniques helps in identifying signature terms and ranking the sentences. E.g.
4. Text Rank (Mihalcea and Tarau, 2004)5. Degree centrality (Erkan and Radev, 2004)
6. LexRank with threshold (Erkan and Radev, 2004)
7. Continuous LexRank (Erkan and Radev, 2004)
Text Rank3: BC−HurricaineGilbert, 09−11 339 4: BC−Hurricaine Gilbert, 0348 5: Hurricaine Gilbert heads toward Dominican Coast 6: By Ruddy Gonzalez 7: Associated Press Writer 8: Santo Domingo, Dominican Republic (AP) 9: Hurricaine Gilbert Swept towrd the Dominican Republic Sunday, and the Civil Defense alerted its heavily populated south coast to prepare for high winds, heavy rains, and high seas. 10: The storm was approaching from the southeast with sustained winds of 75 mph gusting to 92 mph. 11: "There is no need for alarm," Civil Defense Director Eugenio Cabral said in a television alert shortly after midnight Saturday. 12: Cabral said residents of the province of Barahona should closely follow Gilbert’s movement. 13: An estimated 100,000 people live in the province, including 70,000 in the city of Barahona, about 125 miles west of Santo Domingo. 14. Tropical storm Gilbert formed in the eastern Carribean and strenghtened into a hurricaine Saturday night. 15: The National Hurricaine Center in Miami reported its position at 2 a.m. Sunday at latitude 16.1 north, longitude 67.5 west, about 140 miles south of Ponce, Puerto Rico, and 200 miles southeast of Santo Domingo. 16: The National Weather Service in San Juan, Puerto Rico, said Gilbert was moving westard at 15 mph with a "broad area of cloudiness and heavy weather" rotating around the center of the storm. 17. The weather service issued a flash flood watch for Puerto Rico and the Virgin Islands until at least 6 p.m. Sunday. 18: Strong winds associated with the Gilbert brought coastal flooding, strong southeast winds, and up to 12 feet to Puerto Rico’s south coast. 19: There were no reports on casualties. 20: San Juan, on the north coast, had heavy rains and gusts Saturday, but they subsided during the night. 21: On Saturday, Hurricane Florence was downgraded to a tropical storm, and its remnants pushed inland from the U.S. Gulf Coast. 22: Residents returned home, happy to find little damage from 90 mph winds and sheets of rain. 23: Florence, the sixth named storm of the 1988 Atlantic storm season, was the second hurricane. 24: The first, Debby, reached minimal hurricane strength briefly before hitting the Mexican coast last month.
Figure: Representing Text as Graph
Text Rank
Preparing the sentence Grpah:
Uses undirected graph, Prepared by using sentences as node of graph.
Calculating similarity between sentences: For given two sentences iS and jS ,
with a sentence being represented by a set iN words that appear in the sentence, i.e. iN
iii i
WWWS ,,, 21 , the similarity of iS and jS is defined as:
ji
jkikkji
SS
SWSWWSSSimilarity
loglog
&,
Text Rank
Important Points: Each vertex corresponds to a sentence type. A weight, ijW is assigned to the edge connecting the two vertices, iV and jV ,
and its value is the similarity score between sentence iS and jS .
Ranking the Sentences: The score for iV , iVS , is initialized with a default value and is computed in an iterative manner until convergence using this recursive formula:
j
VAdjVVAdjV
jk
jii VS
W
WddVS
ij
jk
1
Where, iVAdj denotes sVi ' neighbor and d is the damping factor set to 0.85
(Brin and Page, 1998).
Degree centrality (Erkan and Radev, 2004)
Degree Centrality: the number of direct neighbors of node V i.e.
VNVd
Degree centrality based Document summarizer: Uses document cluster Consider sentences in a document cluster as nodes in a graph Each link represents some relation among sentences (e.g. semantic
similarity). Main Points:
Each edge is a vote Totally democratic The more connections a sentence has, the higher degree and higher
centrality. Problems: (1) Out of topic (or outlier) documents can influence summary and make it into summary, (2) neglects the influence of useful phrases or words.
LexRankLexRank: Instead of being wholly democratic like the degree centrality method, give \prestigious" sentences more of a vote. The higher the centrality of a sentence, the more it counts.
Neighbors of prestigious sentences: Also, distribute a sentence's centrality to its neighboring nodes (propagate).
Distributed Centrality:
][ degUadjV V
VPUP
Where,
UP is the centrality of node U
][Uadj is the set of nodes adjacent to U
Vdeg is the degree of node V
Weighted LexRankWeighted LexRank: The previous Degree centrality and LexRank was unweighted. The weighted LexRank can be calculated by using following eq.
][ ,cos_mod_
,cos_mod_1
UadjVVP
VZineifiedidf
VUineifiedidfd
N
dUP
NOTE: The “ifd modified cosine similarity” can be calculated by using the following equation:
YY
YYYXX
XXX
YXWYWXW
iii
iii
idftfidftf
idftftf
2,
2,
,
2,,
Where,
SWtf , is the number of occurrences of word W in sentence S
W is the word shared by sentences X and Y
Test Your Understanding
• Point out the major differences between Text Rank and Weighted LexRank ?
(Note: just differentiate between ranking schemes)
• Can you prove the correctness of LexRank’s ranking formula ?
(Hint: read/watch video about correctness of page rank based equation)
References
• Mihalcea, Rada and Paul Tarau. 2004. TextRank:Bringing order into texts. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 404–411.
• Günes Erkan, Dragomir R. Radev: LexPageRank: Prestige in Multi-Document Text Summarization. EMNLP 2004: 365-371