intelligent web agent under the guidance of dr. s. b. nair presented by sandeep kumar (001127) dept....

Intelligent Web Agent

Under the guidance of

Dr. S. B. Nair

Presented by

Sandeep Kumar (001127)Dept. of CSE, IIT Guwahati

Goals To develop an Intelligent web agent which learns

about the user’s behavior with any user intervention and uses the information gathered to help him search pages of his interests by ranking results of google according to his interests.

Intelligent Web Agent Agent – something that perceives its environment through its

sensors and acting upon it through its effectors. Web Agent – the environment is the world wide web. Intelligent Web Agent – A rational web agent i.e. one that

can make a rational decision when given a choice leading to its goal.

‘Personalized’ Intelligent Web Agent (PIWA) – It learns user preferences and behavior over a length of time and exhibits ample intelligence in its decisions.

Problem Explosive growth of internet has made it the largest

knowledge repository mankind ever had. Need to use it efficiently – providing one with

information one seeks. Searching for information is a Problem. The internet structure - a massive mess of

hyperlinks pointing to HTML pages makes it difficult

Search Engines though excellent in their working fail to satisfy an user’s needs returning millions of results - most of which are unwanted

How can we help the user A web based search engine lacks information about

the user. They are based on algorithms which need more

information about the search query – a normal user provides just one or two.

We can have a software agent at the user’s desktop. It has more access to user’s browsing behavior. It can learn from it. Using its knowledge base, it can help user to find

information matching his interests.

Profiling Static – profiles are built beforehand like templates Dynamic – profile are generated dynamically

learning from user’s behavior. Need

an intelligent search engine over an general purpose search engine.

Perform real time adaptive learning from monitoring user’s habits with no relevance feedback from user.

Must change according to interest changes in the user

The Algorithm Salient Features:

Representation of user’s interest as a group of words. Generation of the group of words by unobtrusive

monitoring of user’s browsing habits with no relevance feedback from user.

Dynamic membership of the group of words representing a particular user’s interest.

Using the group of words to improvise web query generation.

Using the group of words to ranks results from general purpose search engine.

Interest Basic knowledge block of user profile. Represented by a group of 10 words each having an

associated weight and a timestamp. Weight represents the importance of word in that

particular interest. Eg:

said 16, yesterday 15, city 15, sadr 14, iraq 12, holy 12, news 12, east 11, talks11 , tension 11

user’s interest in what sadr said yesterday in a holy city of iraq and his talks about tension and east

Generation of Interest Key Point – unobtrusive, user friendly

Implementation Agent act as a proxy for user’s browser. Passive monitoring of incoming traffic. The 10 words are extracted from the very pages the

user browses through. Extract top words from a HTML document.

Get the page. Do feature extraction. Do stop-word removal. Do stemming.

Generation of Interest From HTML pages browsed by the user. Feature extraction done using latest features of HTMLEditorKit (available

in JDK SE > 1.4) HTML tags given weights like title 10, meta-names 6, block-quote 4,

boldfaced and underline 2, fontsizes, etc Content tag given weight 1 (similar to Term Frequency (tf)) Weights are summed up for all words. Commonly used words removed by stop words elimination and removing

words of length <= 2 Top 10 words selected. Morphological analysis not done as many words don’t occur in dictionary

like yahoo, and the process still is not very efficient.

Creation of Profile Get 10 keywords from each page visited 2 possible cases:

Current page (keywords) matches/is similar to a past interest -> list of interest updated

Current page (keywords) is new -> new interest created

Match if 3 words or more (>= 30%) match between keywords of current page and past interest.

Interest Update:- Sum up the weights of the matched words and get the top 10 from the merged list.

Maintenance of Profile An optimum size needed as too big will have

erroneous interests and has performance problems, a small list may not cover all his interests – at present its at 20 interest.

When an interest is created or updated its timestamp is updated (associated with the 11 word marker “1234567890”)

The product of timestamp and sum of weights of the interest is used to determine which interest will remain in list and which one removed.

Use of Profile for web searches Direct Searching: user provides search query 3 cases:

Query matches one interest Query matches more than one interest Query does not match any interest

No match: simple google search More than one match: sum up the words of query in the

matched interest and select the one for which sum in maximum.

So now we have one matched interest

Trigger Pair Model Trigger Pair – get some words from the matched

interest which have weights less than the smallest weight for any word of the query.

Done to prevent overshadowing of original query by more popular/weighted words.

1 word added for single worded query, 2 words for double or more worded query -> to prevent overshadowing.

Trigger Pair refines results from google to a great extent.

Ranking of results to user’s interest Get top 20 results from Google. Get top 10 keywords for each of the result Score each page by summing the product of weights

of common words between matched interest and keywords.

Get the top 10 pages based on the score. – 1st Result Take a arithmetic mean of ranks of Google and rank

of the algorithm and get top 10 pages – 2nd Result

Agent Architecture

Implementation Summary Keyword: String word, Int val Interest: 10 keywords + 1 marker with timestamp Profile: 20 interests Google element: String title, String snippet, String

Results : 1. IRAQ 10 pages from news on Iraq dated April 14th 2004. 7 interests formed, 3 ages merged.

Interests

Points to note 1st interest: lexington and concorde came in as they

were advertisements on the first page – parser is not designed to ignore and considers part of the page

Can be done if structure of page known beforehand but impossible in present case.

Timestamps of 2nd and 10th page merging as both from reuters

Search: IRAQ Search query: iraq Matched interest: last one as weight of iraq is

maximum in it – 42 Trigger Pair causes Fallujah to be appended.

Comparision

Points to note 1st result of Google describes about fallujah and has

less elements of fighting in it – hence ranks a poor 10th in the PIWA rank.

7th result of Google which is basically a discussion board on iraq war with lots of discussion on “iraq, fallujah, marines, coalition, Baghdad” (word in matched interest) ranks first in PIWA rankings.

Mixed Results: Its rank 1 are 2nd of Google and 5th of PIWA.

2. Sandeep 3 pages: 2 homepages and 1 resume 3 interests formed

Search: Sandeep Search query: sandeep Kumar appended

Points to note Matched interest was derived from 10th result of

Google but ranked 2nd in PIWA rankings and hence 5th on mixed results.

A page visited in past affects the results greatly. Sandeep is very general term so results still not

much inclined in my favour.

Search: Sandeep Kumar Appended by “2004 iitg”

3. India Pakistan Series 5 news pages dated 14th April 2004

Search: India Pakistan Appended by “test series”

Points to note Normal search in google without trigger pair resulted

in results on war – not wanted by user Google's 7th, 8th, 9th and 10th result don’t make it to

top 10 of PIWA – shows ranking differences based on user’s interests.

Mixed Results In classical AI search terminology Google:- explore strategy – get new results PIWA:- exploit strategy – use past information to

decide new ranks Mixed:- a 50-50 mix of both, can be changed to

explore more initially and then exploit more as in any other learning process

Conclusion A profile for an user was generated with absolutely

no user relevance feedback Dynamic profile maintenance – continuously

updated by new information. Profile used to improve user’s web searches to suit

his interests.

Future Work Improve GUI: specially for search utility Support for plugins: to handle non-HTML

documents Support of encoded pages: SSL, gzipped, etc News reader can be made easily with improved

parser knowing the structure of news pages beforehand.

Thank you

intelligent web agent under the guidance of dr. s. b. nair presented by sandeep kumar (001127) dept....

user slide

words of query

group of words

matched words

user friendly slide

intelligent web agent

used words

words elimination

Documents

guwahati :: assam

deepthi nair

syllabus_btech iit guwahati

graphic2 - rrb), guwahati

nair matrimonial

padmaraj nair

cat guwahati judgment

nptel - iit guwahati

master plan guwahati

guwahati/ region - borjhar.kvs.ac.in

nair thesis

jayakrishnan nair,

cda guwahati

iit guwahati - catjee

outline - iit guwahati

indian institute of technology guwahati sixth...

becchi to guwahati

patron - niper guwahati

cdp guwahati

guwahati municipal corporation