human aided text summarizer “saar” using reinforcement learning

8/10/2019 Human Aided Text Summarizer “SAAR” using Reinforcement Learning

http://slidepdf.com/reader/full/human-aided-text-summarizer-saar-using-reinforcement-learning 1/31

Paper ID: ISCMI2014-1-031E

Human Aided Text Summarizer“SAAR” using Reinforcement Learning

By : Chandra Prakash

ABV-IIITM Gwalior

&

Dr. Anupam Shukla

Professor, ABV-IIITM, Gwalior

2014 Intl. Conference on Soft Computing & Machine Intelligence (ISCMI 2014)



Approach

Problem Definition

Motivation

Literature survey

Scope of Project Methodology/Approach

Tools used

Result



Introduction



Real time Problem

Imagine

Download 1000 + papers and now want to get thesummary..

We have list of emails about sports event, get the summaryof those emails in one para…

We have to study lots of books for the exam and thesummarizer gives the key concepts of the books as fewpages notes…

Value for researchers

Get me everything/Papers say about “Automatic TextSummarization”



Definition

Automatic Summaries• An active research area where computer automatically

summarize text from both single and multi-documents.

• A short summary, which conveys the essence of the document

• Should be less than half of original text

• Can be extractive or abstractive based

• May be produced from single or multiple documents

Dipanjan Das, Andre F.T. Martins (2007). A Survey on Automatic Text Summarization. LiteratureSurvey for the Language and Statistics II course at CMU, Pittsburg



Problem definition

With the advent of the information revolution (WWW), Electronic documents are becoming a principle media of business and academic

information

Thousands of electronic documents are produced and made available on the internet each day.

not easy to read each and every document .

Information Access Agent:

Search engines : Google, Yahoo etc.

Information retrieval is far greater than that a user can handle and manage.

User has to analyze searched result one by one until felt satisfactory, this is time

consuming and inefficient.

What could be the possible solution than???



Problem definition (cont..)

Text summarization is not as per user specification.

Generic summary generation not possible as summary changes as userchanges.

Even two human can‘t generate a similar summary from a given

document.

Internal factors (background, education etc.) play vital role in generating asummary

What could be the possible solution now ???



Solution: Human A ided Text Summarization

Benefits of summarization include: Save reading time

Value for researchers

Abstracts for Scientific and other articles

Facilitate fast literature searches

Facilities classification of articles and other written data :

Improve Search engines indexing efficiency of web pages

Assists in storing the text in much lesser space.

Heading of the given article/document

News summarization

Opinion Mining and Sentiment Analysis

Enables Cell phones to access the Web information

With human feedback – user oriented summary



Previous Approach :

1950 : Automatic creation of literature abstracts was proposed by IBMLuhn.

Text Mining: Includes discovery of patterns and trends in data associations among entities

in a document. Consist of three steps:text preparation,text processing andtext analysis.

Text Summarization : Text Summarization Methods.

Extraction: Construct the summery by taking the most importantsentences

Abstraction: Construct the summary by paraphrasing section of theoriginal document.

99



Type of Techniques:

Statistical techniques :

Based on Term Frequency.

Stop-word filtering : remove the unwanted noise.

Stemming or Lemmatization: different forms of the same word.

Determine term importance

Term Frequency/Inverse-documents-frequency (TF-IDF) Weighting scheme, etc.

Linguistic techniques :

Looks for text semantics.

Linguistic techniques extract sentence by Parsing and part of Natural language processing (NLP).

Speech tagging is among the starting steps.



Scope of Project

Problem Definition: Extractive Text Summarization

Single Document

Fully Automated Summarization (FAS)

Human Aided Machine Summarization (HAMS)

Machine Learning

Reinforcement Learning

Tools used:

Matlab

Java



Earlier Methodology proposed (FAS)

Chandra Prakash, Anupam Shukla “Automated summary generation from singe document using information gain ”

Springer, Contemporary Computing ,Communications in Computer and Information Science Volume 94, pp 152-159,

2010 .



Methodology proposed (HAMS)



Keyword Significant Factor



15

Solution

Approach for the Problem Input: Document with text is fed into the system.

Preprocessing:

Tokenization: Divides the character sequence into words

sentence splitting further divides sequences of words into

sentences, and so on. Stemming or Lemmatization

Stop word filtering Feature Extraction :

Sentence Ranking: Machine Learning

Human Feedback

Output\ Result: Generated Summary an abstract.

15



Methodology Steps..

Methodology for text summarization involves Term Selection using Pre-Processing

Tokenization or Segmentation

Stop word Filtering

Stemming or Lemmatization

Term weighting Term Frequency (TF):

Wi(T j)=f ij

where f ij is the frequency of j th term in sentence i.

Inverse Sentence Frequency (ISF) :

where N =no of sentences in the collection

n j =no of sentence where the term j appears.

nj

NlogWi(Tjj fij



Methodology Steps (cont…)

Information Gain is calculated as

Information Gain (IG) = (TFW)i+ ISFS(Tj)i + (NSL)i +(SPS)i+ (PNS)i

where i is the sentence and j is the term

Term-Sentence matrix after IG :

)(....)2()1(

................

)2(....)22()21(

)1(....)12()11(

)(

Wmn IGWm IGmW IG

nW IGW IGW IG

nW IGW IGW IG

TSM



Element of reinforcement learning

Agent: Intelligent programs Environment: External condition

Policy : Defines the agent’s behavior at a given time A mapping from states to actions Lookup tables or simple function

An agent learns behavior through trial-and-error interactions with a dynamic

environment.

Agent

Environment

State Reward Action

Policy



Methodology Steps (cont…)

Processing Step:

Action Sentence scoring using Reinforcement Learning

Selection Policies

Ɛ-greedy

In our approach we have considerState : Sentences ;

Action: Updating Term weight is considered

Policy: Update the term to maximum the sentence rank

Reward : scalar value of Term. (IG)

Q-Learning

y probabilithaction witRandom

1- probilitywith, =

t

t

aa



Processing Step:

Matrix Q : learning matrix.

updted updted updted



Wmn IGWm IGmW IG

nW IGW IGW IG

nW IGW IGW IG

TSM updted

)(....)2()1(

................

)2(....)22()21(

)1(....)12()11(

)(

)(....)2()1(

................

)2(....)22()21(

)1(....)12()11(

)(

Wmn IGWm IGmW IG

nW IGW IGW IG

nW IGW IGW IG

TSM



Summary Generation :

Sentence selection : Euclidean n-space

P = 1, 2 … …

Q = 1, 2 … …

Dataset Article from “The Hindu” (june 2013) DUC’06 sets of documents :

12 document sets

No of document in each Set 25

Average no of sentence 32

300 document summary



Evaluation

Evaluation Techniques

where, r is no of common sentence, K m is length of machine generated summary and k h is length ofhuman generated summary

Available automated text summarizers Open Text summarizer (OTS),

Pertinence Summarizer (PS), and

Extractor Test Summarizer Software (ETSS).

The compression ratio is 30%

m K

r 100 =(P)Precision

h K

r 100 =(R)Recall

mh K + K =

R+ P R P = score F 2r 100100



Comparison of generated textsummary for HAMS

Comparison of Recall, PrecisionValue and F-score for HAMS

Methods Precision value (P)

Recall Value(R)

F-score

SAAR (user

feedback)90 85 87.42

IGsummary

75 65 70.57

OTS 75 60 66.66

PS 75 60 66.66

ETSS 75 60 66.66

Result

0 20 40 60 80 100

SAAR Based

IG Summary

OTS

PS

ETSS

Chart Title

F-Score Recall Value ® Precision Value (P)

Compared with some available automated text summarizers• Open Text summarizer (OTS), Pertinence Summarizer (PS),

and Extractor Test Summarizer Software (ETSS)



Conclusion and future scope

A novel approach for human aided text summarization by userfeedback from single document

This summarization by extract will be good enough for a reader tounderstand the main idea of a document, though the understandability might not be as good as a summary by abstract.

As a future work this approach can be exacted for multi-documentsummary document extraction using machine learning.

We can introduce the concept of multi agent into the system. This will increase its speed as well make the summary or abstract more generic.



References

1. Verma R, Chen P, “Integrating Ontology Knowledge into a Query -based InformationSummarization System”, DUC 2007, 2007. Rochester, NY.2. Lunh H. P. 'The automatic creation of literature abstracts”, IBM Journal of Research and

Development, vol 2, pp 159—165, 1958.3. Edmundson H. P., “New Methods in Automatic Extracting”, Journal of the ACM (JACM),

vol. 16 no.2, pp. 264-285, 1969.4. Salton G., Buckley, C., “Term-Weighting Approaches in Automatic Text Retrieval

Information Processing & Management”, Vol 24. pp.513 523, 1988.

5. Luhn H.P, “A Statical Approach to Mechanical Encoding and Searching of LiteraryInformation”. IBM Journal of Research and Development, pp. 309-317, 1975.

6. Salton G., Buckley, C. “Term- Weighting Approaches in Automatic Text Retrieval”.Information Processing & Management, Vol 24. pp.513–523, 1988.

7. Kupiec J et al., “A trainable document summarizer”, In Proceedings of SIGIR, 1995.8. Conroy J. M., O'leary D. P, “Text summarization via hidden markov model”, In Proceedings

of SIGIR '01, pp 406-407, 2001, New York, NY, USA.9. Agarwal N., Ford K. H., Shneider M., “Sentence Boundary Detection using a MaxEnt

Classifer”.10. García-Hernández R. A., Ledeneva Y., “Word Sequence Models for Single Text

Summarization”, 2009 Second International Conferences on Advances in Computer-HumanInteractions, pp. 44-48, 2009.

11. The Hindu [http://www.hinduonnet.com/] Accessed on 23rd June 2009.12. Van Rsbergen C J. Information Retrieval, 2nd edition. Dept. of Computer Science, University

of Glasgow. 1979.



References

13. V. A. Yatsko and T. N. Vishnyakov (2006). A Method for Evaluating Modern Systems of Automatic Text Summarization.14. S. Hariharan, and R. Srinivasan,(2008).Investigations in single document summarization by

extraction method.15. René Arnulfo García-Hernández and Yulia Ledeneva (2009) Word Sequence Models for

Single Text Summarization.16. Kyoomarsi, F.; Khosravi, H.; Eslami, E.; Dehkordy, P.K.; Tajoddin, A.; Optimizing Text

Summarization Based on Fuzzy Logic. In Proceedings of Computer and Information Science,

2008. ICIS 08.17. Sparck-Jones, K. Automatic summarizing: factors and directions. In Mani, I.; Maybury, M.

Advances in Automatic Text Summarization. The MIT Press (1999) 1-1218. Hovy, E. and C.-Y. Lin (1997). Automated Text Summarization in SUMMARIST. In

Proceedings of the ACL97/EACL97 Workshop on Intelligent Scalable Text Summarization,Madrid, Spain.

19. Mani, I. and M. T. Maybury (editors) (1999). Advances in Automatic Text Summarization.MIT Press, Cambridge, MA.

20. Sparck-Jones, K. (1999). Automatic Summarizing: Factors and Directions. In Mani, I. and M.T. Maybury (editors), Advances in Automatic Text Summarization, pp. 1–13. The MIT Press.

21. Lin, C.-Y. and E. Hovy (2000). The automated acquisition of topic signatures for textsummarization. In Proceedings of the 18th COLING Conference, Saarbr¨ucken, Germany.

22. Baldwin, B., R. Donaway, E. Hovy, E. Liddy, I. Mani, D. Marcu, K. McKeown, V. Mittal, M.Moens, D. Radev, K. Sparck-Jones, B. Sundheim, S. Teufel, R. Weischedel, and M. White(2000). An Evaluation Road Map for Summarization Research. http://www-nlpir.nist.gov/projects/duc/papers/summarization.roadmap.doc.

human aided text summarizer “saar” using reinforcement learning

Documents