intelligent web agent under the guidance of dr. s. b. nair presented by sandeep kumar (001127) dept....
Post on 27-Mar-2015
221 Views
Preview:
TRANSCRIPT
Intelligent Web Agent
Under the guidance of
Dr. S. B. Nair
Presented by
Sandeep Kumar (001127)Dept. of CSE, IIT Guwahati
Goals To develop an Intelligent web agent which learns
about the user’s behavior with any user intervention and uses the information gathered to help him search pages of his interests by ranking results of google according to his interests.
Intelligent Web Agent Agent – something that perceives its environment through its
sensors and acting upon it through its effectors. Web Agent – the environment is the world wide web. Intelligent Web Agent – A rational web agent i.e. one that
can make a rational decision when given a choice leading to its goal.
‘Personalized’ Intelligent Web Agent (PIWA) – It learns user preferences and behavior over a length of time and exhibits ample intelligence in its decisions.
Problem Explosive growth of internet has made it the largest
knowledge repository mankind ever had. Need to use it efficiently – providing one with
information one seeks. Searching for information is a Problem. The internet structure - a massive mess of
hyperlinks pointing to HTML pages makes it difficult
Search Engines though excellent in their working fail to satisfy an user’s needs returning millions of results - most of which are unwanted
How can we help the user A web based search engine lacks information about
the user. They are based on algorithms which need more
information about the search query – a normal user provides just one or two.
We can have a software agent at the user’s desktop. It has more access to user’s browsing behavior. It can learn from it. Using its knowledge base, it can help user to find
information matching his interests.
Profiling Static – profiles are built beforehand like templates Dynamic – profile are generated dynamically
learning from user’s behavior. Need
an intelligent search engine over an general purpose search engine.
Perform real time adaptive learning from monitoring user’s habits with no relevance feedback from user.
Must change according to interest changes in the user
The Algorithm Salient Features:
Representation of user’s interest as a group of words. Generation of the group of words by unobtrusive
monitoring of user’s browsing habits with no relevance feedback from user.
Dynamic membership of the group of words representing a particular user’s interest.
Using the group of words to improvise web query generation.
Using the group of words to ranks results from general purpose search engine.
Interest Basic knowledge block of user profile. Represented by a group of 10 words each having an
associated weight and a timestamp. Weight represents the importance of word in that
particular interest. Eg:
said 16, yesterday 15, city 15, sadr 14, iraq 12, holy 12, news 12, east 11, talks11 , tension 11
user’s interest in what sadr said yesterday in a holy city of iraq and his talks about tension and east
Generation of Interest Key Point – unobtrusive, user friendly
Implementation Agent act as a proxy for user’s browser. Passive monitoring of incoming traffic. The 10 words are extracted from the very pages the
user browses through. Extract top words from a HTML document.
Get the page. Do feature extraction. Do stop-word removal. Do stemming.
Generation of Interest From HTML pages browsed by the user. Feature extraction done using latest features of HTMLEditorKit (available
in JDK SE > 1.4) HTML tags given weights like title 10, meta-names 6, block-quote 4,
boldfaced and underline 2, fontsizes, etc Content tag given weight 1 (similar to Term Frequency (tf)) Weights are summed up for all words. Commonly used words removed by stop words elimination and removing
words of length <= 2 Top 10 words selected. Morphological analysis not done as many words don’t occur in dictionary
like yahoo, and the process still is not very efficient.
Creation of Profile Get 10 keywords from each page visited 2 possible cases:
Current page (keywords) matches/is similar to a past interest -> list of interest updated
Current page (keywords) is new -> new interest created
Match if 3 words or more (>= 30%) match between keywords of current page and past interest.
Interest Update:- Sum up the weights of the matched words and get the top 10 from the merged list.
Maintenance of Profile An optimum size needed as too big will have
erroneous interests and has performance problems, a small list may not cover all his interests – at present its at 20 interest.
When an interest is created or updated its timestamp is updated (associated with the 11 word marker “1234567890”)
The product of timestamp and sum of weights of the interest is used to determine which interest will remain in list and which one removed.
Use of Profile for web searches Direct Searching: user provides search query 3 cases:
Query matches one interest Query matches more than one interest Query does not match any interest
No match: simple google search More than one match: sum up the words of query in the
matched interest and select the one for which sum in maximum.
So now we have one matched interest
Trigger Pair Model Trigger Pair – get some words from the matched
interest which have weights less than the smallest weight for any word of the query.
Done to prevent overshadowing of original query by more popular/weighted words.
1 word added for single worded query, 2 words for double or more worded query -> to prevent overshadowing.
Trigger Pair refines results from google to a great extent.
Ranking of results to user’s interest Get top 20 results from Google. Get top 10 keywords for each of the result Score each page by summing the product of weights
of common words between matched interest and keywords.
Get the top 10 pages based on the score. – 1st Result Take a arithmetic mean of ranks of Google and rank
of the algorithm and get top 10 pages – 2nd Result
Agent Architecture
Implementation Summary Keyword: String word, Int val Interest: 10 keywords + 1 marker with timestamp Profile: 20 interests Google element: String title, String snippet, String
url
Results : 1. IRAQ 10 pages from news on Iraq dated April 14th 2004. 7 interests formed, 3 ages merged.
Interests
Points to note 1st interest: lexington and concorde came in as they
were advertisements on the first page – parser is not designed to ignore and considers part of the page
Can be done if structure of page known beforehand but impossible in present case.
Timestamps of 2nd and 10th page merging as both from reuters
Search: IRAQ Search query: iraq Matched interest: last one as weight of iraq is
maximum in it – 42 Trigger Pair causes Fallujah to be appended.
Comparision
Points to note 1st result of Google describes about fallujah and has
less elements of fighting in it – hence ranks a poor 10th in the PIWA rank.
7th result of Google which is basically a discussion board on iraq war with lots of discussion on “iraq, fallujah, marines, coalition, Baghdad” (word in matched interest) ranks first in PIWA rankings.
Mixed Results: Its rank 1 are 2nd of Google and 5th of PIWA.
2. Sandeep 3 pages: 2 homepages and 1 resume 3 interests formed
Search: Sandeep Search query: sandeep Kumar appended
Points to note Matched interest was derived from 10th result of
Google but ranked 2nd in PIWA rankings and hence 5th on mixed results.
A page visited in past affects the results greatly. Sandeep is very general term so results still not
much inclined in my favour.
Search: Sandeep Kumar Appended by “2004 iitg”
3. India Pakistan Series 5 news pages dated 14th April 2004
Search: India Pakistan Appended by “test series”
Points to note Normal search in google without trigger pair resulted
in results on war – not wanted by user Google's 7th, 8th, 9th and 10th result don’t make it to
top 10 of PIWA – shows ranking differences based on user’s interests.
Mixed Results In classical AI search terminology Google:- explore strategy – get new results PIWA:- exploit strategy – use past information to
decide new ranks Mixed:- a 50-50 mix of both, can be changed to
explore more initially and then exploit more as in any other learning process
Conclusion A profile for an user was generated with absolutely
no user relevance feedback Dynamic profile maintenance – continuously
updated by new information. Profile used to improve user’s web searches to suit
his interests.
Future Work Improve GUI: specially for search utility Support for plugins: to handle non-HTML
documents Support of encoded pages: SSL, gzipped, etc News reader can be made easily with improved
parser knowing the structure of news pages beforehand.
Thank you
top related