content extraction from html documents a. rahman h. alam r. hartono document analysis and...

Content Extraction from HTML Documents

A. Rahman H. Alam R. Hartono

Document Analysis and Recognition Team (DART)

BCL Computers Inc. Santa Clara, Calif, USA

Current need?

• Viewing website using small screen handheld devices

• Since web sites are written using HTML codes, we need to translate these to systems that the wireless devices can support.

Current Solutions

• Handcrafting: – Custom Web Sites are typically crafted by

hand by a set of content experts

• Transcoding:– Thranscoding replaces HTML tags with

suitable device specific tags (HDML, WML etc)

Handcrafting

• Automation– Use of XML.

• There is no standard XML tagset (Document Type Definition – DTD) in use by vendors.

• XML has been available to web designers for the last 10 years. Examination of websites shows little use of document structural elements.

– Web masters see themselves as artists rather than programmers.

– XML may meet the same fate as SGML, an earlier attempt to create structured documents.

Handcrafting

• Take an existing website and make it available to wireless access. Aether Systems, Mshift and 2Roam currently offer these types of solutions.

• Use a proprietary graphical interface to ease the development of wireless applications from scratch. Covigo and iConverse offer these type of solutions.

• Let the user do all coding in languages such as C++ or Java. ThinAirApps offers this type of solution.

Handcrafting

• Labor intensive

• Expensive.

• Typically less than 1% of a web site gets converted to wireless content.

Transcoding

• Most web pages have a loose repeating visual structure. The wireless user gets the same repeating information with every screen

• Browsing is an unfriendly experience • Transcoding sends all the information to the

wireless device, making it substantially slow on the wireless network

Transcoding

• Transcoding was introduced in Japan during 1999-2000. It was widely rejected by the Japanese users.

• Recently, Google and Pixo introduced this solution for the US market, but have so far failed to attract attention of end users.

The Alternate Solution

• Separate the content into smaller segments

• Generate a summary of these segments

• Prioritize these summaries from individual segments

• Put together to form a summary of the overall document

Steps to Content Extraction

• Structural analysis: Understanding the relationship of the various segments with the document

• Decomposition: Breakdown on these segments into operational units

• Contextual Analysis: Employment of context to revise the segmentation

(Continued=>)

Steps to Content Extraction (Continued)

• Labeling => Segment Summary: Extraction of a low level summary of the segment

• Priority: Estimating importance of these segments

• Table of Content (TOC) => Document Summary: Putting together a summary of the document

Content Extraction

• Proximity Analysis: Relational analysis of content between segments

• Content Classification: callification into various types, i.e. [stories], [navigation], [links], [images], [forms] etc.

• Relationship Analysis– Contextual grammar (Natural Language)– Knowledge modes– Information retrieval techniques

Content Extraction: Why do we need it?

• Viewing any website: Any solution to web browsing has to be universal

• High network access: Any transformation has to be fast and on-the-fly

• Network Usage: Network traffic should increase because of these systems

(Continued=>)

Content Extraction: Why do we need it (continued)?

• Easy Configurability: Any such system should be easiliy configurable

• Rapid Deployment: Should be rapidly deployable• Non-intrusive Design: Should be possible to

transform web sites without modifying the actual web site

• Multiple Views: System Integrators should be able to create multiple views of the same site

Advantages of Content Extraction

• Displays size• Locating information • Important content can be on top• Multiple levels of abstraction can be created• The browsing can use a demand-driven model• Faster download• More efficient use of small display areas• Mapping of the importance of content from the

original document

Supported Devices and Formats

• PDAs (HTML3.2)

• Cell phones – USA/Europe:

• WAP

– Japan• iMode (NTT DoCoMo)

• J-Sky (J-Phone)

• EZWeb (KDDI)

Conclusion• Content from web documents can be extracted

based on the – HTML structure– Proximity analysis– Logical relationship analysis– Information retrieval techniques

• Content can be used effectively to summarize web documents– Better option compared to handcrafting or transcoding – Produces faster browsing experience