assuming accurate layout information is available: how do we interpret the content flow in html...

13
Assuming Accurate Assuming Accurate Layout Information is Layout Information is Available: How do we Available: How do we Interpret the Content Interpret the Content Flow in HTML Documents? Flow in HTML Documents? Hassan Alam and Hassan Alam and Fuad Rahman Fuad Rahman Human Computer Interaction Group Human Computer Interaction Group BCL Technologies Inc. Santa Clara, CA 95050 BCL Technologies Inc. Santa Clara, CA 95050 www. www. bcltechnologies bcltechnologies .com .com [email protected] [email protected]

Upload: eugenia-skinner

Post on 11-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Assuming Accurate Layout Assuming Accurate Layout Information is Available: Information is Available: How do we Interpret the How do we Interpret the Content Flow in HTML Content Flow in HTML Documents?Documents?

Hassan Alam and Hassan Alam and Fuad RahmanFuad Rahman

Human Computer Interaction GroupHuman Computer Interaction GroupBCL Technologies Inc. Santa Clara, CA 95050BCL Technologies Inc. Santa Clara, CA 95050www.www.bcltechnologiesbcltechnologies.com.com

[email protected]@bcltechnologies.com

Overview of the TalkOverview of the Talk Content Flow in Web Pages Structural Flow vs. Logical Flow Language Independence Independence for Semantics Content Flow from Purely Geometrical

Information Conclusion and Future Work

Related WorkRelated Work

Handcrafting

Transcoding

Adaptive Re-authoring

Handcrafting involves typically crafting web pages by hand by a set of content experts for device specific output.

Transcoding replaces HTML tags with suitable device specific tags, such as HDML, WML and others.

The research on web page re-authoring can explicitly use natural language processing or use non-NLP techniques.

The The HTML HTML Table Table based based

StructureStructure

Rows are only used to arrange content

How is the Table Structure Exploited?How is the Table Structure Exploited?

Most HTML source use table as the principal organizational method

We assume that a geometric parser will give us exact positioning of each table and sub-table

Content is in the Columns.

We assume that content flow is language independent, or is it?

Calculate Inclusion Criterion

How is the Table Structure Exploited?How is the Table Structure Exploited?

Calculate xPreference list

Calculate yPreference list

Perform Proximity analysis: Know thy neighbors!

Quantify each table: Calculate area

Calculate table hierarchy based on Inclusion criterion and proximity analysis

Continued …

Same Inclusion Criterion

How is the Table Structure Exploited?How is the Table Structure Exploited?

Calculate TOC

Calculate Level of TOC

Calculate Merging Criterion

Lowest first

Sharing identical sides

Not if a border exists

The The HTML HTML Table Table based based

StructureStructure

Map of Table LayoutMap of Table Layout

What is the Advantage ofWhat is the Advantage of this Analysis this Analysis??

Relative importance of content can be assessed, resulting in better re-authoring.

It becomes possible to capture the contextual relationship among various components within the document, such as what is a side bar, what is an advertisement, what is a top bar etc.

If needed, it is possible to use other natural language techniques to correlate tables by using semantics or other criteria.

Current WorkCurrent Work

XML is being successfully used in many applications to mark up important information according to application-specific vocabularies .

Two W3C Recommendations, XSLT (the Extensible Stylesheet Language Transformations) and XPath (the XML Path Language), meet that need.

This is an exploratory paper offering a specific pathway to the future of web page re-authoring provided accurate layout information is available.

It is probably better to use the XSLT language, which itself uses XPath, to specify how an implementation of an XSLT processor is to create a desired output from a given marked-up input.

Future WorkFuture Work Exact location of each block, in rectangular coordinates, equivalent to

rendition using a standard browser. Size of each block of content. Type of content, e.g. text, graphics etc. Weight of content, in terms of size and placement within a page. Continuity information, derived from physical association in terms of

geometrical collocation. Classification of content into a set of pre-defined classes, e.g. main

story, sidebars, links and so on. Linkage information from the XML representation, indicating the layers

of information that can be hidden at a level of summary. This can represent the content in many levels, but more than two or three levels are unsuitable for easy navigation.

ConclusionsConclusions

A specific pathway to the future of web page re-authoring provided accurate layout information is available.

This in no way represents a state of the art discussion about the possible use of layout information. Rather, it focuses on one small part within an array of possibilities.

It will be interesting to discuss other possibilities in this space during the DLIA workshop.