Using HTML Metadata to Retrieve Relevant Images from the World Wide
Web
Ethan V. Munson
University of Wisconsin-Milwaukee
Why is image search important?
• The Web is becoming the world’s primary information source
• Images are one of the Web’s key features• Few WWW image search engines exist currently• Using textual search engines to find images
manually is laborious
A Requirement for Web Image Search
• We need an efficient method of discovering and indexing image content.
• Two main sources of information about image content:– image processing
– associated text• text content
• markup
Related work
• QBIC (the IBM Almaden Research Center)– indexes and retrieves images according to:
– shape
– color
– texture
– object layout
– queries are formulated through visual examples – a sample image
– user provided sketches
Related work QBIC system
Related work QBIC system
Related work QBIC system
QBIC: Advantages and Disadvantages
• Advantages– well-developed visual query language
– interesting GUI
– queries are based on image appearance
• Disadvantages– works only at the primitive feature level (color, texture,
shape)
– doesn’t recognize semantics of image• very sensitive to camera viewpoint
– doesn’t scale up to the Web
Related work
• WebSeek (J. Smith & S. Chang, Columbia University)
– performs a semi-automated classification of the images• automatically extracts keywords from image file names
• computes the keyword histogram
• manually creates a subject hierarchy
• manually maps the images into the subject hierarchy
– User can• browse the categories
• search the categories by keyword
• search the database using image features – color content
Webseek: Advantages/Disadvantages
• Advantages– Large index of Web images
– Supports both text and image search
• Disadvantages– Not clear that database can scale up
• Manual categorization is very expensive
– Relevance feedback mechanism is computationally expensive
Related work
• WebSeer (M. Swain et al., The University of Chicago) – uses associated text and markup to supplement
information derived from analyzing image content
– uses multiple kinds of metadata• image file names
• alternate text
• text of a hyperlink
– decides which images are photographs, portraits, or computer generated drawing
– research emphasized categorization, not metadata-based search
Why seek new image retrieval methods?
• The number of WWW documents is growing rapidly and constantly changing
• We need fast and efficient methods for finding images
• Image processing is– complex
– computationally expensive
– limited (misses true image semantics)
– unnecessary
Research Goals
• Show that images can be found using HTML “metadata”– textual content
– HTML tag structure
– attribute values
• Determine which metadata features are the best clues to image content
The URL Filter• assembles a list of URLs from the results returned by Alta
Vista– parses the first page returned by Alta Vista
– follows the URLs of results pages, retrieves these pages, and parses them
– extracts list of URLs from the results pages
The Crawler• retrieves the pages
• saves each page’s HTML source code in a separate file
“Tidy”• converts arbitrary and probably ill-formed HTML into
XHTML
XHTML Parser• parses an XHTML document
• builds an XHTML parse tree
The Document Analyzer
• scans the parse tree for image URLs– an image URL appears in either an image or anchor
element
• converts relative URLs into absolute URLs• uses various heuristics to determine which URLs
point to relevant images
Search Strategies
• Image’s file name
• Textual content of the TITLE element
• Value of the ALT attribute of IMG elements
• Textual content of anchor elements
• Value of the title attribute of anchor elements
• Textual content of the paragraph surrounding an image
• Textual content of any paragraph located within the same center element as the image
• Textual content of heading elements
Image Retrieval Experiment
Experimental Questions
• Which HTML features reveal the most information about image? – Do particular patterns of HTML structure carry useful
information?
• Do image search results depend on the type of query?
Informal Experiments
• Conducted extensive informal testing– to check software correctness
– to investigate possible metadata clues
– to determine rules for filtering out images based on size• images smaller than 65 pixels in either dimension almost never
contained useful content
• reduced the number of images we had to classify
Metadata Clues
1 Image’s file name
2 Textual content of the TITLE element
3 Value of the ALT attribute of IMG elements
4 Textual content of anchor elements
5 Value of the title attribute of anchor elements
6 Textual content of the paragraph surrounding an image
7 Textual content of any paragraph located within the same center element as the image
8 Textual content of heading elements
Query Categories• Famous people
“Gorbachev”, “Yeltsin”, and “Streisand”
• Non-famous people“Yelena” and “Ekaterina”
• Famous places “Paris” and “London”
• Less-famous places
“Bremen” and “Spokane”
• Phenomena“Explosion”, “Sunset”, and “Hurricane”
Experimental Procedure
• For each of the 12 queries– Alta Vista returned 200 URLs (20 groups of 10)
– We used first, middle, and last groups (30 URLs)
– Downloaded pages and all images on pages• excluding small images (< 65 pixels in either dimension)
• 276 pages and 1578 images were accessible
– Manually determined relevance of each image
– Used our system to determine the effectiveness of each of the 8 metadata clue
• standard information retrieval measures: precision and recall
Information Retrieval Measures
• Recall = B/(A + B)– Warning: our study does not really test recall
• We need a controlled sample of the Web, but instead, we are using Alta Vista’s biased sample
• Precision = B/(B + D)
Relevant, not retrieved A
Relevant, retrieved B
Nonrelevant, not retrieved C
Nonrelevant, retrieved D
Recall Table
Precision Table
Key Results
• Image file name has poor recall for people’s names and excellent recall for less-famous cities
• Famous names have poorer precision than non-famous and place names
Image file name
Textual content of TITLE
Value of ALT
Overall percent of recallOverall percent of precision
43.5 % 62.1 % 13.7 %
70.7 % 58.2 % 87.5 %
Problems with this study
• This is a single, small study– results must be replicated
• No standard corpus for testing Web image search– our “recall” results are not reliable or truly sound
• Our choice of tools may bias our results– Title tag may be important only because Alta Vista
considers it important
– Tidy may remove some clues• What is the structure of “<P> Text <IMG>”?
– Analysis of “header” clue is questionable
Body Body
P IMG P
IMG
Conclusion
• Existing content-based image retrieval systems are not good models for Web image search
• HTML metadata is useful for Web image search– Image file name and document title are most useful
– Alternate text is extremely precise, when present
• HTML metadata should provide faster image search than image processing approaches– no need to download and analyze images
– can take advantage of existing search engines
Using HTML Metadata to Retrieve Relevant Images from the Web
Ethan V. Munson
Dept. of Electrical Engineering & Computer Science
University of Wisconsin - Milwaukee
http://www.cs.uwm.edu/~multimedia