using html metadata to retrieve relevant images from the world wide web

62
Using HTML Metadata to Retrieve Relevant Images from the World Wide Web Ethan V. Munson University of Wisconsin- Milwaukee

Upload: harvey

Post on 08-Jan-2016

24 views

Category:

Documents


1 download

DESCRIPTION

Using HTML Metadata to Retrieve Relevant Images from the World Wide Web. Ethan V. Munson University of Wisconsin-Milwaukee. Why is image search important?. The Web is becoming the world’s primary information source Images are one of the Web’s key features - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

Using HTML Metadata to Retrieve Relevant Images from the World Wide

Web

Ethan V. Munson

University of Wisconsin-Milwaukee

Page 2: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

Why is image search important?

• The Web is becoming the world’s primary information source

• Images are one of the Web’s key features• Few WWW image search engines exist currently• Using textual search engines to find images

manually is laborious

Page 3: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

A Requirement for Web Image Search

• We need an efficient method of discovering and indexing image content.

• Two main sources of information about image content:– image processing

– associated text• text content

• markup

Page 4: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

Related work

• QBIC (the IBM Almaden Research Center)– indexes and retrieves images according to:

– shape

– color

– texture

– object layout

– queries are formulated through visual examples – a sample image

– user provided sketches

Page 5: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

Related work QBIC system

Page 6: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

Related work QBIC system

Page 7: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

Related work QBIC system

Page 8: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

QBIC: Advantages and Disadvantages

• Advantages– well-developed visual query language

– interesting GUI

– queries are based on image appearance

• Disadvantages– works only at the primitive feature level (color, texture,

shape)

– doesn’t recognize semantics of image• very sensitive to camera viewpoint

– doesn’t scale up to the Web

Page 9: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

Related work

• WebSeek (J. Smith & S. Chang, Columbia University)

– performs a semi-automated classification of the images• automatically extracts keywords from image file names

• computes the keyword histogram

• manually creates a subject hierarchy

• manually maps the images into the subject hierarchy

– User can• browse the categories

• search the categories by keyword

• search the database using image features – color content

Page 10: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
Page 11: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
Page 12: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
Page 13: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
Page 14: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
Page 15: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
Page 16: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
Page 17: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

Webseek: Advantages/Disadvantages

• Advantages– Large index of Web images

– Supports both text and image search

• Disadvantages– Not clear that database can scale up

• Manual categorization is very expensive

– Relevance feedback mechanism is computationally expensive

Page 18: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

Related work

• WebSeer (M. Swain et al., The University of Chicago) – uses associated text and markup to supplement

information derived from analyzing image content

– uses multiple kinds of metadata• image file names

• alternate text

• text of a hyperlink

– decides which images are photographs, portraits, or computer generated drawing

– research emphasized categorization, not metadata-based search

Page 19: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

Why seek new image retrieval methods?

• The number of WWW documents is growing rapidly and constantly changing

• We need fast and efficient methods for finding images

• Image processing is– complex

– computationally expensive

– limited (misses true image semantics)

– unnecessary

Page 20: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

Research Goals

• Show that images can be found using HTML “metadata”– textual content

– HTML tag structure

– attribute values

• Determine which metadata features are the best clues to image content

Page 21: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
Page 22: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
Page 23: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
Page 24: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
Page 25: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
Page 26: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
Page 27: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
Page 28: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
Page 29: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
Page 30: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

The URL Filter• assembles a list of URLs from the results returned by Alta

Vista– parses the first page returned by Alta Vista

– follows the URLs of results pages, retrieves these pages, and parses them

– extracts list of URLs from the results pages

Page 31: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

The Crawler• retrieves the pages

• saves each page’s HTML source code in a separate file

Page 32: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

“Tidy”• converts arbitrary and probably ill-formed HTML into

XHTML

Page 33: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

XHTML Parser• parses an XHTML document

• builds an XHTML parse tree

Page 34: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

The Document Analyzer

• scans the parse tree for image URLs– an image URL appears in either an image or anchor

element

• converts relative URLs into absolute URLs• uses various heuristics to determine which URLs

point to relevant images

Page 35: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
Page 36: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

Search Strategies

• Image’s file name

• Textual content of the TITLE element

• Value of the ALT attribute of IMG elements

• Textual content of anchor elements

• Value of the title attribute of anchor elements

• Textual content of the paragraph surrounding an image

• Textual content of any paragraph located within the same center element as the image

• Textual content of heading elements

Page 37: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
Page 38: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
Page 39: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
Page 40: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
Page 41: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
Page 42: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
Page 43: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
Page 44: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
Page 45: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
Page 46: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
Page 47: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
Page 48: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web
Page 49: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

Image Retrieval Experiment

Page 50: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

Experimental Questions

• Which HTML features reveal the most information about image? – Do particular patterns of HTML structure carry useful

information?

• Do image search results depend on the type of query?

Page 51: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

Informal Experiments

• Conducted extensive informal testing– to check software correctness

– to investigate possible metadata clues

– to determine rules for filtering out images based on size• images smaller than 65 pixels in either dimension almost never

contained useful content

• reduced the number of images we had to classify

Page 52: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

Metadata Clues

1 Image’s file name

2 Textual content of the TITLE element

3 Value of the ALT attribute of IMG elements

4 Textual content of anchor elements

5 Value of the title attribute of anchor elements

6 Textual content of the paragraph surrounding an image

7 Textual content of any paragraph located within the same center element as the image

8 Textual content of heading elements

Page 53: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

Query Categories• Famous people

“Gorbachev”, “Yeltsin”, and “Streisand”

• Non-famous people“Yelena” and “Ekaterina”

• Famous places “Paris” and “London”

• Less-famous places

“Bremen” and “Spokane”

• Phenomena“Explosion”, “Sunset”, and “Hurricane”

Page 54: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

Experimental Procedure

• For each of the 12 queries– Alta Vista returned 200 URLs (20 groups of 10)

– We used first, middle, and last groups (30 URLs)

– Downloaded pages and all images on pages• excluding small images (< 65 pixels in either dimension)

• 276 pages and 1578 images were accessible

– Manually determined relevance of each image

– Used our system to determine the effectiveness of each of the 8 metadata clue

• standard information retrieval measures: precision and recall

Page 55: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

Information Retrieval Measures

• Recall = B/(A + B)– Warning: our study does not really test recall

• We need a controlled sample of the Web, but instead, we are using Alta Vista’s biased sample

• Precision = B/(B + D)

Relevant, not retrieved A

Relevant, retrieved B

Nonrelevant, not retrieved C

Nonrelevant, retrieved D

Page 56: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

Recall Table

Page 57: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

Precision Table

Page 58: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

Key Results

• Image file name has poor recall for people’s names and excellent recall for less-famous cities

• Famous names have poorer precision than non-famous and place names

Image file name

Textual content of TITLE

Value of ALT

Overall percent of recallOverall percent of precision

43.5 % 62.1 % 13.7 %

70.7 % 58.2 % 87.5 %

Page 59: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

Problems with this study

• This is a single, small study– results must be replicated

• No standard corpus for testing Web image search– our “recall” results are not reliable or truly sound

• Our choice of tools may bias our results– Title tag may be important only because Alta Vista

considers it important

– Tidy may remove some clues• What is the structure of “<P> Text <IMG>”?

– Analysis of “header” clue is questionable

Page 60: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

Body Body

P IMG P

IMG

Page 61: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

Conclusion

• Existing content-based image retrieval systems are not good models for Web image search

• HTML metadata is useful for Web image search– Image file name and document title are most useful

– Alternate text is extremely precise, when present

• HTML metadata should provide faster image search than image processing approaches– no need to download and analyze images

– can take advantage of existing search engines

Page 62: Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

Using HTML Metadata to Retrieve Relevant Images from the Web

Ethan V. Munson

Dept. of Electrical Engineering & Computer Science

University of Wisconsin - Milwaukee

[email protected]

http://www.cs.uwm.edu/~multimedia