content extraction from html documents a. rahman h. alam r. hartono document analysis and...

17
Content Extraction from HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa Clara, Calif, USA

Upload: bathsheba-riley

Post on 28-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Content Extraction from HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa Clara, Calif,

Content Extraction from HTML Documents

A. Rahman H. Alam R. Hartono

Document Analysis and Recognition Team (DART)

BCL Computers Inc. Santa Clara, Calif, USA

Page 2: Content Extraction from HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa Clara, Calif,

Current need?

• Viewing website using small screen handheld devices

• Since web sites are written using HTML codes, we need to translate these to systems that the wireless devices can support.

Page 3: Content Extraction from HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa Clara, Calif,

Current Solutions

• Handcrafting: – Custom Web Sites are typically crafted by

hand by a set of content experts

• Transcoding:– Thranscoding replaces HTML tags with

suitable device specific tags (HDML, WML etc)

Page 4: Content Extraction from HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa Clara, Calif,

Handcrafting

• Automation– Use of XML.

• There is no standard XML tagset (Document Type Definition – DTD) in use by vendors.

• XML has been available to web designers for the last 10 years. Examination of websites shows little use of document structural elements.

– Web masters see themselves as artists rather than programmers.

– XML may meet the same fate as SGML, an earlier attempt to create structured documents.

Page 5: Content Extraction from HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa Clara, Calif,

Handcrafting

• Take an existing website and make it available to wireless access. Aether Systems, Mshift and 2Roam currently offer these types of solutions.

• Use a proprietary graphical interface to ease the development of wireless applications from scratch. Covigo and iConverse offer these type of solutions.

• Let the user do all coding in languages such as C++ or Java. ThinAirApps offers this type of solution.

Page 6: Content Extraction from HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa Clara, Calif,

Handcrafting

• Labor intensive

• Expensive.

• Typically less than 1% of a web site gets converted to wireless content.

Page 7: Content Extraction from HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa Clara, Calif,

Transcoding

• Most web pages have a loose repeating visual structure. The wireless user gets the same repeating information with every screen

• Browsing is an unfriendly experience • Transcoding sends all the information to the

wireless device, making it substantially slow on the wireless network

Page 8: Content Extraction from HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa Clara, Calif,

Transcoding

• Transcoding was introduced in Japan during 1999-2000. It was widely rejected by the Japanese users.

• Recently, Google and Pixo introduced this solution for the US market, but have so far failed to attract attention of end users.

Page 9: Content Extraction from HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa Clara, Calif,

The Alternate Solution

• Separate the content into smaller segments

• Generate a summary of these segments

• Prioritize these summaries from individual segments

• Put together to form a summary of the overall document

Page 10: Content Extraction from HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa Clara, Calif,

Steps to Content Extraction

• Structural analysis: Understanding the relationship of the various segments with the document

• Decomposition: Breakdown on these segments into operational units

• Contextual Analysis: Employment of context to revise the segmentation

(Continued=>)

Page 11: Content Extraction from HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa Clara, Calif,

Steps to Content Extraction (Continued)

• Labeling => Segment Summary: Extraction of a low level summary of the segment

• Priority: Estimating importance of these segments

• Table of Content (TOC) => Document Summary: Putting together a summary of the document

Page 12: Content Extraction from HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa Clara, Calif,

Content Extraction

• Proximity Analysis: Relational analysis of content between segments

• Content Classification: callification into various types, i.e. [stories], [navigation], [links], [images], [forms] etc.

• Relationship Analysis– Contextual grammar (Natural Language)– Knowledge modes– Information retrieval techniques

Page 13: Content Extraction from HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa Clara, Calif,

Content Extraction: Why do we need it?

• Viewing any website: Any solution to web browsing has to be universal

• High network access: Any transformation has to be fast and on-the-fly

• Network Usage: Network traffic should increase because of these systems

(Continued=>)

Page 14: Content Extraction from HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa Clara, Calif,

Content Extraction: Why do we need it (continued)?

• Easy Configurability: Any such system should be easiliy configurable

• Rapid Deployment: Should be rapidly deployable• Non-intrusive Design: Should be possible to

transform web sites without modifying the actual web site

• Multiple Views: System Integrators should be able to create multiple views of the same site

Page 15: Content Extraction from HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa Clara, Calif,

Advantages of Content Extraction

• Displays size• Locating information • Important content can be on top• Multiple levels of abstraction can be created• The browsing can use a demand-driven model• Faster download• More efficient use of small display areas• Mapping of the importance of content from the

original document

Page 16: Content Extraction from HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa Clara, Calif,

Supported Devices and Formats

• PDAs (HTML3.2)

• Cell phones – USA/Europe:

• WAP

– Japan• iMode (NTT DoCoMo)

• J-Sky (J-Phone)

• EZWeb (KDDI)

Page 17: Content Extraction from HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa Clara, Calif,

Conclusion• Content from web documents can be extracted

based on the – HTML structure– Proximity analysis– Logical relationship analysis– Information retrieval techniques

• Content can be used effectively to summarize web documents– Better option compared to handcrafting or transcoding – Produces faster browsing experience