intelligent web agent under the guidance of dr. s. b. nair presented by sandeep kumar (001127) dept....

35
Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Upload: madison-hutchinson

Post on 27-Mar-2015

221 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Intelligent Web Agent

Under the guidance of

Dr. S. B. Nair

Presented by

Sandeep Kumar (001127)Dept. of CSE, IIT Guwahati

Page 2: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Goals To develop an Intelligent web agent which learns

about the user’s behavior with any user intervention and uses the information gathered to help him search pages of his interests by ranking results of google according to his interests.

Page 3: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Intelligent Web Agent Agent – something that perceives its environment through its

sensors and acting upon it through its effectors. Web Agent – the environment is the world wide web. Intelligent Web Agent – A rational web agent i.e. one that

can make a rational decision when given a choice leading to its goal.

‘Personalized’ Intelligent Web Agent (PIWA) – It learns user preferences and behavior over a length of time and exhibits ample intelligence in its decisions.

Page 4: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Problem Explosive growth of internet has made it the largest

knowledge repository mankind ever had. Need to use it efficiently – providing one with

information one seeks. Searching for information is a Problem. The internet structure - a massive mess of

hyperlinks pointing to HTML pages makes it difficult

Search Engines though excellent in their working fail to satisfy an user’s needs returning millions of results - most of which are unwanted

Page 5: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

How can we help the user A web based search engine lacks information about

the user. They are based on algorithms which need more

information about the search query – a normal user provides just one or two.

We can have a software agent at the user’s desktop. It has more access to user’s browsing behavior. It can learn from it. Using its knowledge base, it can help user to find

information matching his interests.

Page 6: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Profiling Static – profiles are built beforehand like templates Dynamic – profile are generated dynamically

learning from user’s behavior. Need

an intelligent search engine over an general purpose search engine.

Perform real time adaptive learning from monitoring user’s habits with no relevance feedback from user.

Must change according to interest changes in the user

Page 7: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

The Algorithm Salient Features:

Representation of user’s interest as a group of words. Generation of the group of words by unobtrusive

monitoring of user’s browsing habits with no relevance feedback from user.

Dynamic membership of the group of words representing a particular user’s interest.

Using the group of words to improvise web query generation.

Using the group of words to ranks results from general purpose search engine.

Page 8: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Interest Basic knowledge block of user profile. Represented by a group of 10 words each having an

associated weight and a timestamp. Weight represents the importance of word in that

particular interest. Eg:

said 16, yesterday 15, city 15, sadr 14, iraq 12, holy 12, news 12, east 11, talks11 , tension 11

user’s interest in what sadr said yesterday in a holy city of iraq and his talks about tension and east

Page 9: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Generation of Interest Key Point – unobtrusive, user friendly

Page 10: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Implementation Agent act as a proxy for user’s browser. Passive monitoring of incoming traffic. The 10 words are extracted from the very pages the

user browses through. Extract top words from a HTML document.

Get the page. Do feature extraction. Do stop-word removal. Do stemming.

Page 11: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Generation of Interest From HTML pages browsed by the user. Feature extraction done using latest features of HTMLEditorKit (available

in JDK SE > 1.4) HTML tags given weights like title 10, meta-names 6, block-quote 4,

boldfaced and underline 2, fontsizes, etc Content tag given weight 1 (similar to Term Frequency (tf)) Weights are summed up for all words. Commonly used words removed by stop words elimination and removing

words of length <= 2 Top 10 words selected. Morphological analysis not done as many words don’t occur in dictionary

like yahoo, and the process still is not very efficient.

Page 12: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Creation of Profile Get 10 keywords from each page visited 2 possible cases:

Current page (keywords) matches/is similar to a past interest -> list of interest updated

Current page (keywords) is new -> new interest created

Match if 3 words or more (>= 30%) match between keywords of current page and past interest.

Interest Update:- Sum up the weights of the matched words and get the top 10 from the merged list.

Page 13: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Maintenance of Profile An optimum size needed as too big will have

erroneous interests and has performance problems, a small list may not cover all his interests – at present its at 20 interest.

When an interest is created or updated its timestamp is updated (associated with the 11 word marker “1234567890”)

The product of timestamp and sum of weights of the interest is used to determine which interest will remain in list and which one removed.

Page 14: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Use of Profile for web searches Direct Searching: user provides search query 3 cases:

Query matches one interest Query matches more than one interest Query does not match any interest

No match: simple google search More than one match: sum up the words of query in the

matched interest and select the one for which sum in maximum.

So now we have one matched interest

Page 15: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Trigger Pair Model Trigger Pair – get some words from the matched

interest which have weights less than the smallest weight for any word of the query.

Done to prevent overshadowing of original query by more popular/weighted words.

1 word added for single worded query, 2 words for double or more worded query -> to prevent overshadowing.

Trigger Pair refines results from google to a great extent.

Page 16: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Ranking of results to user’s interest Get top 20 results from Google. Get top 10 keywords for each of the result Score each page by summing the product of weights

of common words between matched interest and keywords.

Get the top 10 pages based on the score. – 1st Result Take a arithmetic mean of ranks of Google and rank

of the algorithm and get top 10 pages – 2nd Result

Page 17: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Agent Architecture

Page 18: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Implementation Summary Keyword: String word, Int val Interest: 10 keywords + 1 marker with timestamp Profile: 20 interests Google element: String title, String snippet, String

url

Page 19: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Results : 1. IRAQ 10 pages from news on Iraq dated April 14th 2004. 7 interests formed, 3 ages merged.

Page 20: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Interests

Page 21: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Points to note 1st interest: lexington and concorde came in as they

were advertisements on the first page – parser is not designed to ignore and considers part of the page

Can be done if structure of page known beforehand but impossible in present case.

Timestamps of 2nd and 10th page merging as both from reuters

Page 22: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Search: IRAQ Search query: iraq Matched interest: last one as weight of iraq is

maximum in it – 42 Trigger Pair causes Fallujah to be appended.

Page 23: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Comparision

Page 24: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Points to note 1st result of Google describes about fallujah and has

less elements of fighting in it – hence ranks a poor 10th in the PIWA rank.

7th result of Google which is basically a discussion board on iraq war with lots of discussion on “iraq, fallujah, marines, coalition, Baghdad” (word in matched interest) ranks first in PIWA rankings.

Mixed Results: Its rank 1 are 2nd of Google and 5th of PIWA.

Page 25: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

2. Sandeep 3 pages: 2 homepages and 1 resume 3 interests formed

Page 26: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Search: Sandeep Search query: sandeep Kumar appended

Page 27: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Points to note Matched interest was derived from 10th result of

Google but ranked 2nd in PIWA rankings and hence 5th on mixed results.

A page visited in past affects the results greatly. Sandeep is very general term so results still not

much inclined in my favour.

Page 28: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Search: Sandeep Kumar Appended by “2004 iitg”

Page 29: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

3. India Pakistan Series 5 news pages dated 14th April 2004

Page 30: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Search: India Pakistan Appended by “test series”

Page 31: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Points to note Normal search in google without trigger pair resulted

in results on war – not wanted by user Google's 7th, 8th, 9th and 10th result don’t make it to

top 10 of PIWA – shows ranking differences based on user’s interests.

Page 32: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Mixed Results In classical AI search terminology Google:- explore strategy – get new results PIWA:- exploit strategy – use past information to

decide new ranks Mixed:- a 50-50 mix of both, can be changed to

explore more initially and then exploit more as in any other learning process

Page 33: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Conclusion A profile for an user was generated with absolutely

no user relevance feedback Dynamic profile maintenance – continuously

updated by new information. Profile used to improve user’s web searches to suit

his interests.

Page 34: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Future Work Improve GUI: specially for search utility Support for plugins: to handle non-HTML

documents Support of encoded pages: SSL, gzipped, etc News reader can be made easily with improved

parser knowing the structure of news pages beforehand.

Page 35: Intelligent Web Agent Under the guidance of Dr. S. B. Nair Presented by Sandeep Kumar (001127) Dept. of CSE, IIT Guwahati

Thank you