using html metadata to retrieve relevant images from the world wide web

Using HTML Metadata to Retrieve Relevant Images from the World Wide

Web

Ethan V. Munson

University of Wisconsin-Milwaukee

Why is image search important?

• The Web is becoming the world’s primary information source

• Images are one of the Web’s key features• Few WWW image search engines exist currently• Using textual search engines to find images

manually is laborious

A Requirement for Web Image Search

• We need an efficient method of discovering and indexing image content.

• Two main sources of information about image content:– image processing

– associated text• text content

• markup

Related work

• QBIC (the IBM Almaden Research Center)– indexes and retrieves images according to:

– shape

– color

– texture

– object layout

– queries are formulated through visual examples – a sample image

– user provided sketches

Related work QBIC system

QBIC: Advantages and Disadvantages

• Advantages– well-developed visual query language

– interesting GUI

– queries are based on image appearance

• Disadvantages– works only at the primitive feature level (color, texture,

shape)

– doesn’t recognize semantics of image• very sensitive to camera viewpoint

– doesn’t scale up to the Web

Related work

• WebSeek (J. Smith & S. Chang, Columbia University)

– performs a semi-automated classification of the images• automatically extracts keywords from image file names

• computes the keyword histogram

• manually creates a subject hierarchy

• manually maps the images into the subject hierarchy

– User can• browse the categories

• search the categories by keyword

• search the database using image features – color content

Webseek: Advantages/Disadvantages

• Advantages– Large index of Web images

– Supports both text and image search

• Disadvantages– Not clear that database can scale up

• Manual categorization is very expensive

– Relevance feedback mechanism is computationally expensive

Related work

• WebSeer (M. Swain et al., The University of Chicago) – uses associated text and markup to supplement

information derived from analyzing image content

– uses multiple kinds of metadata• image file names

• alternate text

• text of a hyperlink

– decides which images are photographs, portraits, or computer generated drawing

– research emphasized categorization, not metadata-based search

Why seek new image retrieval methods?

• The number of WWW documents is growing rapidly and constantly changing

• We need fast and efficient methods for finding images

• Image processing is– complex

– computationally expensive

– limited (misses true image semantics)

– unnecessary

Research Goals

• Show that images can be found using HTML “metadata”– textual content

– HTML tag structure

– attribute values

• Determine which metadata features are the best clues to image content

The URL Filter• assembles a list of URLs from the results returned by Alta

Vista– parses the first page returned by Alta Vista

– follows the URLs of results pages, retrieves these pages, and parses them

– extracts list of URLs from the results pages

The Crawler• retrieves the pages

• saves each page’s HTML source code in a separate file

“Tidy”• converts arbitrary and probably ill-formed HTML into

XHTML

XHTML Parser• parses an XHTML document

• builds an XHTML parse tree

The Document Analyzer

• scans the parse tree for image URLs– an image URL appears in either an image or anchor

element

• converts relative URLs into absolute URLs• uses various heuristics to determine which URLs

point to relevant images

Search Strategies

• Image’s file name

• Textual content of the TITLE element

• Value of the ALT attribute of IMG elements

• Textual content of anchor elements

• Value of the title attribute of anchor elements

• Textual content of the paragraph surrounding an image

• Textual content of any paragraph located within the same center element as the image

• Textual content of heading elements

Image Retrieval Experiment

Experimental Questions

• Which HTML features reveal the most information about image? – Do particular patterns of HTML structure carry useful

information?

• Do image search results depend on the type of query?

Informal Experiments

• Conducted extensive informal testing– to check software correctness

– to investigate possible metadata clues

– to determine rules for filtering out images based on size• images smaller than 65 pixels in either dimension almost never

contained useful content

• reduced the number of images we had to classify

Metadata Clues

1 Image’s file name

2 Textual content of the TITLE element

3 Value of the ALT attribute of IMG elements

4 Textual content of anchor elements

5 Value of the title attribute of anchor elements

6 Textual content of the paragraph surrounding an image

7 Textual content of any paragraph located within the same center element as the image

8 Textual content of heading elements

Query Categories• Famous people

“Gorbachev”, “Yeltsin”, and “Streisand”

• Non-famous people“Yelena” and “Ekaterina”

• Famous places “Paris” and “London”

• Less-famous places

“Bremen” and “Spokane”

• Phenomena“Explosion”, “Sunset”, and “Hurricane”

Experimental Procedure

• For each of the 12 queries– Alta Vista returned 200 URLs (20 groups of 10)

– We used first, middle, and last groups (30 URLs)

– Downloaded pages and all images on pages• excluding small images (< 65 pixels in either dimension)

• 276 pages and 1578 images were accessible

– Manually determined relevance of each image

– Used our system to determine the effectiveness of each of the 8 metadata clue

• standard information retrieval measures: precision and recall

Information Retrieval Measures

• Recall = B/(A + B)– Warning: our study does not really test recall

• We need a controlled sample of the Web, but instead, we are using Alta Vista’s biased sample

• Precision = B/(B + D)

Relevant, not retrieved A

Relevant, retrieved B

Nonrelevant, not retrieved C

Nonrelevant, retrieved D

Recall Table

Precision Table

Key Results

• Image file name has poor recall for people’s names and excellent recall for less-famous cities

• Famous names have poorer precision than non-famous and place names

Image file name

Textual content of TITLE

Value of ALT

Overall percent of recallOverall percent of precision

43.5 % 62.1 % 13.7 %

70.7 % 58.2 % 87.5 %

Problems with this study

• This is a single, small study– results must be replicated

• No standard corpus for testing Web image search– our “recall” results are not reliable or truly sound

• Our choice of tools may bias our results– Title tag may be important only because Alta Vista

considers it important

– Tidy may remove some clues• What is the structure of “<P> Text <IMG>”?

– Analysis of “header” clue is questionable

Body Body

P IMG P

IMG

Conclusion

• Existing content-based image retrieval systems are not good models for Web image search

• HTML metadata is useful for Web image search– Image file name and document title are most useful

– Alternate text is extremely precise, when present

• HTML metadata should provide faster image search than image processing approaches– no need to download and analyze images

– can take advantage of existing search engines

Using HTML Metadata to Retrieve Relevant Images from the Web

Ethan V. Munson

Dept. of Electrical Engineering & Computer Science

University of Wisconsin - Milwaukee

[email protected]

http://www.cs.uwm.edu/~multimedia

using html metadata to retrieve relevant images from the world wide web

Documents