analysing and testing html5 parsers - the university of ... · analysing and testing html5 parsers...

75
Analysing and testing HTML5 parsers A dissertation submitted to The University of Manchester for the degree of Master of Science in the Faculty of Engineering and Physical Sciences 2015 Jose Carlos Anaya Bolaños School of Computer Science

Upload: dinhdat

Post on 16-Feb-2019

219 views

Category:

Documents


0 download

TRANSCRIPT

Analysing and testing HTML5 parsers

A dissertation submitted to The University of Manchester for the degree of

Master of Science in the Faculty of Engineering and Physical Sciences

2015

Jose Carlos Anaya Bolaños

School of Computer Science

2

Contents

Contents ............................................................................................................................ 2

List of figures ..................................................................................................................... 4

List of tables ...................................................................................................................... 5

Abstract ............................................................................................................................. 6

Declaration ........................................................................................................................ 7

Intellectual property statement ........................................................................................ 8

Acknowledgements ........................................................................................................... 9

The author ....................................................................................................................... 10

1. Introduction ............................................................................................................ 11

1.1 Goal and objectives .......................................................................................... 12

2. Literature review ..................................................................................................... 14

2.1 HTML history .................................................................................................... 14

2.2 The HTML5 parsing process ............................................................................. 16

2.3 Testing HTML5 .................................................................................................. 20

2.4 HTML5 parsing implementations ..................................................................... 22

3. Project architecture ................................................................................................ 25

3.1 Overview ........................................................................................................... 25

3.2 Tasks distribution ............................................................................................. 27

3.3 Project evolution .............................................................................................. 29

4. Project implementation .......................................................................................... 31

4.1 The MScParser .................................................................................................. 31

4.1.1 Architecture .............................................................................................. 31

4.1.2 The custom HTML5 DOM .......................................................................... 33

4.1.3 Challenges ................................................................................................. 36

4.2 The specification tracer .................................................................................... 37

3

4.2.1 Architecture .............................................................................................. 37

4.2.2 Challenges ................................................................................................. 40

4.3 The harness for comparison ............................................................................. 40

4.3.1 The parser adaptors .................................................................................. 41

4.3.2 The script execution .................................................................................. 41

4.3.3 The comparison and report generation .................................................... 42

4.3.4 Challenges ................................................................................................. 47

4.4 The web application ......................................................................................... 49

4.4.1 Architecture .............................................................................................. 50

4.4.2 Parsing and tracing .................................................................................... 51

4.4.3 Comparing outputs ................................................................................... 53

4.4.4 Reviewing reports ..................................................................................... 55

4.4.5 Challenges ................................................................................................. 57

5. Analysis and Results ................................................................................................ 59

5.1 The html5lib test suite coverage ...................................................................... 59

5.2 The MScParser vs. the html5lib test suite ........................................................ 62

5.3 Comparing parsers with the html5lib test suite ............................................... 63

5.4 Tracing the web ................................................................................................ 65

6. Conclusions ............................................................................................................. 68

6.1 Reflection .......................................................................................................... 71

Bibliography .................................................................................................................... 73

4

List of figures

Figure 1 – Flow diagram of the HTML5 parsing process (adapted from [13]) ................ 17

Figure 2 – A cycle through the tokenizer to emit a token .............................................. 19

Figure 3 – A cycle through the tree constructor to process an empty string ................. 20

Figure 4 – Overview of the product architecture (adapted from [26]) .......................... 26

Figure 5 – Class diagram of the parser ............................................................................ 32

Figure 6 – Class diagram of the custom HTML5 DOM .................................................... 35

Figure 7 – Class diagram of the specification tracer. ...................................................... 38

Figure 8 – File example of tracerEvents.xml ................................................................... 39

Figure 9 – Comparator flow diagram .............................................................................. 43

Figure 10 – XML report sample ....................................................................................... 45

Figure 11 – Tracer input tab ............................................................................................ 51

Figure 12 – Tracer exclusion tabs .................................................................................... 52

Figure 13 – Tracer output tabs for the input string this is a <b>test .............................. 52

Figure 14 – Input form for the multi-parser comparator tool ........................................ 53

Figure 15 – Comparison details page .............................................................................. 54

Figure 16 - Comparison page displaying differences between outputs ......................... 54

Figure 17 – Format options tab ....................................................................................... 55

Figure 18 – Report class diagram .................................................................................... 56

Figure 19 – Report details page ...................................................................................... 57

Figure 20 – Html5lib tokenizer state tests results .......................................................... 62

Figure 21 – Html5lib tree construction tests results ...................................................... 62

Figure 22 – Insertion modes usage by websites ............................................................. 66

Figure 23 – Tokenizer states usage by websites ............................................................. 67

5

List of tables

Table 1 – Most popular HTML5 parsers in Github .......................................................... 24

Table 2 – Participation of the members in the project ................................................... 28

Table 3 – Example of inputs that are HTML5 valid but XML invalid ............................... 34

Table 4 – Example of diff encoding ................................................................................. 46

Table 5 – Code coverage of the tokenizer states by the html5lib test suite .................. 60

Table 6 – Code coverage of the insertion modes by the html5lib test suite .................. 61

Table 7 – Comparison of parsers vs. html5lib expected output ..................................... 64

Table 8 – Tracing details over websites .......................................................................... 65

6

Abstract

In its early days, websites only contained plain text and images interlinked. Over time,

websites turned to complex web applications offering diverse services such as

multimedia streaming, social networking, gaming, etc. HTML parsers have been

historically flexible and permissive with the user inputs. Each parser had to define its

own way to parse and fix errors but, due to the increasing complexity of inputs,

disagreements and inconsistencies of outputs among different applications have been

rising. Those differences might cause missing or misplaced content or even uneven

behaviours because other technologies, such as AJAX, Javascript and CSS, rely on the

HTML content.

HTML5 is the latest version of the HTML standard and the specification includes, for

the first time, an algorithm for parsing and handling errors. The specification aims to

finally achieve full consistency and interoperability between parsing implementations.

However, the new parsing algorithm brought challenges for testing HTML parsers.

This dissertation presents a set of tools for analysing and comparing HTML5 parsers.

The tool set includes a specification compliant parser, a tracer of the specification

sections used when parsing and a harness to parse and compare outputs from

different parsers. In addition to the tool set, an analysis of a test suite (html5lib) is

included and discussed.

7

Declaration

No portion of the work referred to in this dissertation has been submitted in support

of an application for another degree or qualification of this or any other university or

other institute of learning.

8

Intellectual property statement

i. The author of this dissertation (including any appendices and/or schedules

to this dissertation) owns certain copyright or related rights in it (the

“Copyright”) and s/he has given The University of Manchester certain rights

to use such Copyright, including for administrative purposes.

ii. Copies of this dissertation, either in full or in extracts and whether in hard

or electronic copy, may be made only in accordance with the Copyright,

Designs and Patents Act 1988 (as amended) and regulations issued under it

or, where appropriate, in accordance with licensing agreements which the

University has entered into. This page must form part of any such copies

made.

iii. The ownership of certain Copyright, patents, designs, trademarks and other

intellectual property (the “Intellectual Property”) and any reproductions of

copyright works in the dissertation, for example graphs and tables

(“Reproductions”), which may be described in this dissertation, may not be

owned by the author and may be owned by third parties. Such Intellectual

Property and Reproductions cannot and must not be made available for use

without the prior written permission of the owner(s) of the relevant

Intellectual Property and/or Reproductions.

iv. Further information on the conditions under which disclosure, publication

and commercialisation of this dissertation, the Copyright and any

Intellectual Property and/or Reproductions described in it may take place is

available in the University IP Policy (see

http://documents.manchester.ac.uk/display.aspx?DocID=487), in any

relevant Dissertation restriction declarations deposited in the University

Library, The University Library’s regulations (see

http://www.manchester.ac.uk/library/aboutus/regulations) and in The

University’s Guidance for the Presentation of Dissertations.

9

Acknowledgements

Foremost, I would like to express my sincere gratitude to my supervisor Dr Bijan Parsia

for all his support and guidance in the research and developing of my MSc project. His

patience and continuous motivation were key for completing this project and

dissertation.

I thank my teammates, Jose Armando and Xiao, for all the debates and discussions that

contributed to the development and improvement of our project.

My sincere thanks goes to the professors of the School of Computer Science of the

University of Manchester whose passion and love for computer science inspired me to

never give up on learning and researching.

I am deeply thankful to the CONACyT for the sponsorship of my MSc course and to Dr

Alonso and her LAPP team for helping and guiding me on the admission process and

other formalities.

Last but not the least, I would like to thank my family and friends in México who were

in constant communication with me despite the long distance that separates us.

10

The author

I graduated as a telematics engineer at the National Polytechnic Institute in Mexico

City in 2010. Currently I am pursuing the degree of MSc in Advanced Computer Science

with specialization in Advanced Web Technologies.

I have experience and knowledge of some web technologies and applications since I

have been working as a web system developer for almost three years.

During the MSc, I had a course called Semi-structured Data and the Web. It was mainly

related to XML and technologies/applications for creating, manipulating and querying

XML documents. I found the course quite interesting and I enjoyed it because its

closeness to web system development.

One of the reasons I chose this project is because it is highly related with the

aforementioned course. Another reason was because the project was suitable for

applying an agile-based methodology. I had never worked following agile techniques

and this was a great opportunity to gain some experience.

Finally, the main reason I chose this project was because it promised a lot of

programming. I love programming. I made a great decision; the project was full of

programming.

11

1. Introduction

The World Wide Web Consortium (W3C) is an international organization that defines

standards regarding web technologies. The mission of the W3C is to “develop

protocols and guidelines that ensure the growth of the Web” [1]. Since its foundation,

in October 1994, several standards have been promoted for creating, interpreting,

rendering and displaying web pages.

The Hypertext Markup Language (HTML) is the most used language by web pages. It

was born between 1989 and 1990 taking as a base the Standard Generalized Mark-up

Language (SGML) [2]. The W3C has been promoting the use of HTML for achieving full

compatibility and agreement between different web vendors. Several HTML versions

have been created and in October 2014 the latest version, HTML5, reached the status

of Recommendation, i.e., the stage of highest maturity of a W3C standard [3].

Some prior versions of HTML, such as XHTML, are based on the Extensible Markup

Language (XML), thus the inputs can be easily parsed using an XML parser. A valid XML

document can be analysed and tested by using grammars; nevertheless, those

documents are restricted by a strict set of rules defined in a schema. When those rules

were not completely met, the parsers had to deal with erroneous inputs in order to

produce some output to the user.

With the aim to gain or maintain user acceptance, parsers increased their flexibility

and permissiveness. This situation caused that the parsing and error handling

processes became more and more complex. Moreover, inconsistencies among

different parsers started to appear. Other web technologies such as Javascript, CSS and

AJAX rely on the DOM (a tree-based structure that represents a parsed HTML

document). Different DOMs for a single input might cause missing or misplaced

contents, uneven behaviours, etc.

The HTML5 specification includes several changes and improvements with respect to

its predecessors. One of those changes is that, for the first time, the parsing process is

defined as an algorithm and it includes error handling. This new parsing process is a

key feature of HTML5 because it ensures that every input stream of data has a well-

12

defined output (DOM). This certainty of the input-output relation is the element that

targets toward the full consistency and interoperability of parsing implementations.

In order to be compliant with the HTML5 specification, an HTML5 parser may be

implemented with any technology or programming language as long as it guarantees

the same output as the parsing algorithm.

HTML5 brought new challenges for testing parsers. The new parsing algorithm is a

convoluted process that relies on finite state machines and a large set of complex data

structures. XML-based parsing and grammar-based testing cannot be used because

HTML5 have removed the XML restrictions and its grammar is not context free [4]. In

[5], a testing method is proposed by using a process called reachability analysis.

However, the test approach is limited to a subset of the specification.

The use of test cases is the most used approach for testing. There are test suites, such

as the W3C and html5lib test suites, which contain test cases for specific sections of

the HTML5 specification. However, the use of test cases brings challenges such as

pronation to errors, specification coverage, complexity of the testing process and

uncertainty of expected outputs. Moreover, the Web evolves constantly and HTML5

evolves with it, thus constant maintenance of the test suites is required.

1.1 Goal and objectives

The goal of the project is to compare different HTML5 parsing implementations and

analyse the sources of agreement/disagreement.

The objectives are:

Develop a specification conformant HTML5 parser following the given pseudo

code.

Create a comparative tool for inspecting and evaluating outputs from different

parsers.

Manufacture an analysis tool for tracking parsing information and finding

sources of disagreement.

Perform a comparative review of some parsing implementations.

Analyse a HTML5 test suite.

13

The set of tools would help the analysis and comparison of different parser

implementations for finding useful information such as level of agreement, causes of

disagreement, specification coverage, percentage of use of html5 characteristics, etc.

The html5lib [6] test suite was chosen for comparing parsers. The test suite is public,

well documented, constantly updated and it includes more than 8000 test cases for

the parsing process.

This dissertation is organized as follows: in chapter 2 a background research about the

HTML history, the HTML5 parsing process, current parsing implementations and

testing methodologies is presented. Chapter 3 describes the project architecture and

the distribution of tasks among team members. The implementation of the project

tools is described in chapter 4. In the following chapter, results and analysis of some

parsers and the html5lib test suite is presented. Finally, the last chapter presents

conclusions and areas of opportunity for future work.

14

2. Literature review

This chapter presents a brief history of the HTML standard followed by a description of

the HTML5 parsing process. Next, some strategies for testing HTML5 are discussed.

Finally, information related to HTML5 parsers is presented.

2.1 HTML history

The Hypertext Markup Language (HTML) was born between 1989 and 1990 as an

application of the Standard Generalized Mark-up Language (SGML) [2]. The W3C was

born in 1994 with the aim to increase the Web potential through standards and rapidly

adopted HTML. In 1995 HTML was extended with new tags and a draft called HTML 3.0

appeared. In 1997 a stable HTML specification, named HTML 3.2, was approved by

Microsoft and Netscape (the major browser vendors from that time). In spring 1998

HTML 4 .0 reached the status of W3C recommendation.

HTML documents were validated against a DTD schema. A DTD schema describes the

structure of a document, the legal names for elements and attributes, etc. If a

document follows a schema rules, it is said to be a valid document. When a document

is valid with respect to a DTD, it guarantees that the document can be parsed in a

unique Document Object Model (DOM). A DOM is an interface of a data structure,

represented as a tree, that allows applications to access and manipulate the structure,

content and style of documents. W3C defined an specification for DOM [7].

In 1996 the W3C presented the Extensible Markup Language (XML) specification (a

subset of the SGML). XML was designed to be generic, extensible and simple to use,

etc. [8]. The rules of a well formed XML document are:

There is exactly one root element.

Tags are correct (i.e. between “<” and “>” characters) and properly nested.

Attributes are unique for each tag and attribute values are quoted.

No comments inside tags.

Special characters are escaped.

When a non-well-formed XML document is being parsed, it will produce a fatal error

(known as Draconian error), and consequently the document will not be parsed into a

DOM tree by the XML parser.

15

With the arrival of XML, XML Schema appeared as an alternative to DTD schemas.

Unlike DTD schema, XML Schema included new features such as data types, element

enumerations, etc. Moreover, an XML Schema follows the XML syntax. “The W3C

believed the Web itself would eventually move to XML” [9] and, in January 2000, the

XHTML 1.0 spec was adopted as a W3C Recommendation. The version 1.1 was a

recommendation by May 2001. XHTML is defined as an XML application (i.e. a

restricted subset of XML). The XHTML spec included three schemas (Strict, Transitional

and Frameset) in order to validate a document and guarantee the uniqueness of a

DOM tree.

With the schema validation and the rules for well-formedness, XHTML was against the

permissive and forgiving approach of HTML. A user would prefer an application that

produce some output despite a missing closing tag or an unquoted attribute value

instead of an application failing or displaying an error as XHTML was proposing.

Some of the W3C members were representatives of major browser vendors such as

Mozilla, Apple, Google, Opera, etc. According to them, web pages were turning into

something more “than text and images interconnected by links” [9]; they were

becoming web applications containing dynamic content and multimedia. To cope with

those new features, the W3C began to work on XHTML 2.0.

The first draft of the HTML5 spec (born as a proposal from Mozilla and Opera, called

Web Forms 2.0) was presented in 2004 to the W3C. The draft was voted and it was

rejected (8 in favour vs. 11 against). Despite the rejection, some members agreed to

continue working on the project and formed the Web Hypertext Application

Technology Working Group (WHATWG).

W3C continued to work in XHTML 2.0. However, in 2007 they realised that the spec

proposed by the WHATWG had indeed a promising future and they asked them to

work together. The idea of normalising the way to handle errors seemed more

plausible than forcing users to write valid, well-formed documents.

The drafts related to HTML were merged and renamed as HTML5. The first official

draft of HTML5 appeared in January 2008. Currently the W3C and the WHATWG

specifications are slightly different. The divergence began in 2012, when the W3C

16

introduced a group of editors to organize the draft and decide what should be included

in the HTML5 spec and what should be put into another specs. In the W3C

recommendation they claim that, “The W3C HTML working group actively pursues

convergence of the HTML specification with the WHATWG living standard” [3].

The WHATWG spec is a “living standard” named the HTML Standard [10]. Ian Hickson

had been (and continues to be) the unique editor of this spec [9]. That decision was

taken because web browsers are constantly experimenting with new behaviours and

features. According to Hickson, “The reality is that the browser vendors have the

ultimate veto on everything in the spec, since if they don’t implement it, the spec is

nothing but a work of fiction” [11].

In fact, the major web browsers (Opera, Google Chrome, Apple Safari, Mozilla Firefox

and Microsoft Internet Explorer) claim to be conformant with the WHATWG HTML

Standard and not the W3C HTML5 Recommendation. David Baron, a distinguished

engineer from Mozilla said “When the W3C’s and WHATWG’s HTML specifications

differ, we tend to follow the WHATWG one” [12].

Each web browser vendor defined its own way to parse and fix HTML when invalid or

problematic inputs were presented. Although “error handling is quite consistent in

browsers”[4] there were inconsistencies amongst them. In order to finally end with

disagreements, the HTML5 spec includes a parsing algorithm and error handling.

Moreover, HTML5 is not an XML document; therefore it is not subject to the rules for

being a well formed document. The algorithm uses finite state machines and ensures

that every1 input stream of data has a well-defined output.

2.2 The HTML5 parsing process

Figure 1 presents a simplified overview of the HTML5 parsing process.

The data input is a stream of octets. The flow of the parsing process begins with the

identification of the encoding of the input stream by using the encoding sniffing

algorithm. Typically the user agent explicitly defines the encoding. When no character

encoding is specified, the algorithm analyses the stream in order to try to determine

1 There are some unsupported character encodings in the spec, thus that data cannot be parsed.

17

the encoding. The specification discourages the use of some character encodings and

suggests the use of the UTF-8 as default character encoding [3].

Figure 1 – Flow diagram of the HTML5 parsing process (adapted from [13])

18

The next stage is the pre-processing of the input stream. This stage manipulates some

characters and raises errors when control characters2 are encountered. After the pre-

processing, the tokenizer consumes characters from the input data stream and

produces tokens. Those tokens are then consumed by the tree constructor. The tree

constructor creates and manipulates a DOM tree that will be the output of the parsing

process.

The tokenizer state machine is composed by 69 different states and the transitions are

mostly triggered by the data input. The execution of scripts may insert new characters

into the input stream. The tree constructor phase is defined by 23 states and the

transitions are triggered by the tokens produced by the tokenizer. The DOM is created

and manipulated by the tree constructor by using some algorithms and several data

structures and flags. Additionally, the tree constructor may also change the current

state of the tokenizer.

The tokenizer

There are six different types of tokens: character, comment, DOCTYPE, end of file, end

tag and start tag. A cycle through the tokenizer will consume one or more characters

and it will end by emitting one or more tokens. Most of the tokenizer states (62 out of

69) will consume and process one character from the input stream. Depending on the

character value, it might be ignored, produce or emit a token (or several), cause a state

transition and/or be reconsumed.

The default state of the tokenizer is the Data state (i.e. when a token is emitted, the

tokenizer will return to this state). Nevertheless, under some circumstances, the tree

construction stage may change the default state. Figure 2 presents a worked example

of a cycle for emitting a start tag token with one attribute.

2 The Unicode control characters have no visual representation and are used to control how text is

displayed.

19

Input

Tokenizer steps

1) Data state consumes a “<” character. Switches to tag open state. 2) Tag open state consumes an “a” character. Creates a start tag token with

value equals to “a”. Switches to tag name state. 3) Tag name state consumes a space. Switches to before attribute name state. 4) Before attribute name state consumes an “h” character. Creates an

attribute for the token with name equals to “h”. Switches to attribute name state.

5) Attribute name state consumes an “r” character. Appends the character to the current attribute name. Keeps consuming characters and when the “=” character is consumed, it switches to before attribute value state.

6) The characters are consumed and appended to the current attribute value. When the “>” character is consumed, the current start tag token is emitted and the tokenizer switches back to the data state.

Figure 2 – A cycle through the tokenizer to emit a token

Recalling the previous example, if the input was the same except for the first character,

the transition to tag open state would never happened and a character token would

have been emitted for each character.

The other states will attempt to consume several characters to identify character

references, comments, a DOCTYPE declaration or CDATA sections. It is an attempt

because the characters are consumed only if they truly represent one of the previously

mentioned values. For example, a transition to the markup declaration open state is

made. This state will attempt to consume characters matching DOCTYPE or [CDATA[. If

there is a match, the characters are consumed, and then a transition is made. If there

is no match, a transition is made without consuming the characters.

The tree construction

When the tokenizer completes a cycle, one or more tokens were generated and the

tree construction machine will process the token(s). The DOM tree is manipulated in

this stage. A pointer to the current node is used (initially null). The character tokens

will create text nodes; comment tokens will create comment nodes; start tag tokens

will produce element nodes; end tag tokens will be used for closing element nodes (i.e.

the pointer to the current node is updated to point to the parent node). The machine

has 23 states called insertion modes. The first state it the initial insertion mode. Figure

20

3 presents the cycle through the tree construction to process an end of file token (i.e.

an empty string input).

Input

“” (empty string)

Tree construction steps

1) Initial insertion mode switches to before HTML insertion mode and reprocesses the current token.

2) Before HTML insertion mode creates an html element and appends it to the document object (DOM tree). It pushes the html element into the stack of open elements, switches to before head insertion mode and reprocesses the current token.

3) Before head insertion mode creates a head element and appends it to the DOM tree. It pushes the head element into the stack of open elements, switches to in head insertion mode and reprocesses the current token.

4) In head insertion mode pops the head element from the stack of open elements, switches to after head insertion mode and reprocesses the current token.

5) After head insertion mode creates a body element and appends it to the DOM tree. It pushes the body element into the stack of open elements, switches to in body insertion mode and reprocesses the current token.

6) After head insertion mode creates a body element and appends it to the DOM tree. It pushes the body element into the stack of open elements, switches to in body insertion mode and reprocesses the current token.

7) In body performs some validations and finally stops parsing.

Figure 3 – A cycle through the tree constructor to process an empty string

The worked example depicts the simplest flow of the tree construction stage. It

produces the minimal DOM tree, i.e. a DOM tree that contains only an html element

(as root node) and a head and body element (as children elements of the html node).

The tree construction stage is very complex and it uses several data structures (stacks

and lists), flags, pointers and persistent status (the current insertion mode).

Additionally, it includes some other smaller algorithms that are used across insertion

modes.

2.3 Testing HTML5

Testing HTML5 is a complex task. Some reasons are:

The specifications are updated constantly due to the HTML5 evolving nature.

The specifications are prone to errors, omissions, etc.

21

Parsers and test suites require constant maintenance to cope with the spec

changes.

HTML5 have removed the XML restrictions and its grammar is not context free

[4] thus XML-parsing and grammar-based testing cannot be used.

The number of potential inputs is infinite. It might be really hard that a

document (the specification) contemplates every possible scenario or

combination of inputs.

The implementation of the parsing algorithm into code is a subjective process

that depends on the style and experience of the programmer(s).

The W3C has its own test suite and defines it as “The Web Platform Tests Project is a

W3C-coordinated attempt to build a cross-browser test suite for the Web-platform

stack” [14][15]. The project is hosted in Github and it comprises test cases for the

complete HTML5 spec (parsing, HTML element interfaces, encodings, fonts, images,

media, events, etc.). It is focused on testing browsers rather than standalone parsing

implementations. The WHATWG has a test suite as well [16]. It includes test cases

from developers and companies: IE, Opera, Mozilla, Ian Hickson, etc. Both test suites

are not completed and they are referred as works on progress.

The html5lib test suite [6] is public, well documented and constantly updated. It

contains more than 8000 entries detailing the input, expected output, expected

number of errors, etc. It includes tests for parsing (i.e. tokenizer and tree construction

stages), encoding, sanitizing, serializing and validating HTML5. Those test cases are

generally trusted as reliable and conformant with the WHATWG spec. The html5lib

project was started by four developers and it has contributions (test cases) from

several users, including developers of WebKit and Mozilla.

In [5] a couple of researchers from the university of Tsukuba (Japan) present another

approach for testing HTML5 parsers. By using a method called reachability analysis and

conditional push down systems, they generated a set of HTML documents that covers

a subset of the specification. Then, they used the obtained set of documents for

comparing the outputs of different parsers. The limitation of their work is that it

cannot generate tests for the entire specification because the complex behaviour of

the formatting elements and the adoption agency algorithm.

22

HTML5TEST [17] is a web application that test browser support of HTML5.It runs a

several tests and is assigns a score. The tests cover various sections of HTML5 such as

multimedia, parsing rules, device access, connectivity, performance, etc. According to

the authors, they test “the official HTML5 specification, specifications that are related

to HTML5 and some experimental new features that are extensions of HTML5”.

2.4 HTML5 parsing implementations

Rendering (or layout) engines are the main type of applications that require HTML

parsing. Web browsers use those engines not only to parse HTML but CSS as well,

execute scripts, render and display content, etc. Usually each vendor of major

browsers has its own implementation of layout engines. For example Google Chrome

and Opera browsers use Blink, Apple Safari browser uses WebKit, Mozilla Firefox uses

Gecko, Microsoft Internet Explorer uses MSHTML (also known as Trident), etc. [4].

The new implementation of Gecko, Gecko 2 [18] implements an HTML5 parser,

compliant with the spec. The parsing process is executed in a separate thread from the

main UI thread to improve responsiveness from the browser. It features speculative

parsing in order to parallelize the HTML parsing and the script execution3, improving

the performance of the rendering process.

In [19] a new browser engine called Servo is presented. It is written in Rust

programming language instead of C++ as the previously commented rendering engines.

It aims for taking advantage of parallel hardware and for better performance, power

usage and concurrency management than other rendering engines. The authors state

that “Servo must be at least as fast as other browsers at similar tasks to succeed, even

if it provides additional memory safety”. It is still under development but so far they

managed to make Servo faster than Gecko in the layout stage.

Apart from web browsers, there are other applications that use rendering engines such

as email managers, Integrated Development Environments (IDEs), e-book readers, VoIP

and videoconference applications, etc. For example Microsoft Outlook and Microsoft

Visual Studio both of them use Trident, the first for rendering emails and the second

one for its web page designer [20].

3 Not always is possible to parallelize those tasks. Moreover, to take real advantage of speculative parsing,

some suggestions have to be followed.

23

There are other applications that might require only a standalone HTML parser, i.e.

those might not need a complex render engine as they will not execute scripts or

render/display the HTML content. Among those applications are HTML debuggers,

validators, reporters, web crawlers, text-mining tools, sanitizers, pretty-printers, etc.

Live DOM Viewer [21] is an HTML5 parser developed by Ian Hickson, written in

Javascript. It can be accessed online and it displays the HTML output, the rendered

view and a representation of the DOM generated.

There are several standalone HTML5 parsers, each offering different features and

capabilities. Github claims to be the world’s largest code host. A search for HTML5

parser displays more than 130 repositories in more than 10 different programming

languages. According to the search result in Github, the top language used is Javascript,

followed by C and then PHP.

Backed by Google and “tested on over 2.5 billion pages from Google's index”, gumbo

parser [22] is the most popular and the third most forked HTML5 standalone parser

available in Github. It is written in C and it claims to be fully conformant with the

WHATWG spec. Moreover, it passes all the test cases from the html5lib test suite [6].

Another well positioned implementation is jsoup [23][24]. It is the most forked and the

third most popular HTML5 parser in Github. It is written in java and additionally to

HTML5 parsing, it features: XML and CSS parsing, pretty printing and HTML cleaning. It

is conformant with the WHATWG spec. Table 1 presents the most popular (highest

number of stars) standalone HTML5 parsers in Github.

In [25] a standalone, parallel HTML5 parser is presented. According to the authors,

“HPar is the first pipelining and data-level parallel HTML parser”. Parallelization of the

parsing algorithm is hard because there are dependencies between the tree

construction and the tokenizer. Under some circumstances, a few insertion modes can

modify the next tokenizer state. Additionally there are some elements that can be self-

closing (for example the br element); in order to raise errors when a non-self-closing

element is self-closed, the tokenizer has to wait for feedback of the tree construction.

Initially, the HPar parser divides the input into chunks and each chunk is processed in

parallel generating tokens and storing them in a buffer. The parsing process is

24

speculative and it is similar to a transaction: a snapshot stores the state of the

tokenizer at a given time and a flag for hazard detection is used, when the flag is true

(i.e. the tokenizer state was changed by the tree constructor), a rollback has to be

made (i.e. discarding some tokens and creating new ones).

To validate their parallel parser, the authors analysed over 1000 websites to find how

often the tree construction stage modified the tokenizer state and they found that it

was less than 0.01%. That means that the probability of a rollback is less than 0.01%.

To test their implementation, they compared it against to jsoup (commented

previously). HPar had a speed improvement up to 2.4 times (1.73 on average) when

parsing some websites such as Facebook, YouTube, BBC, etc.

Name Stars Forks (order) Language Spec

google/gumbo-parser 3251 399 (3) C WHATWG

jhy/jsoup 1878 646 (1) Java WHATWG

inikulin/parse5 685 22 (10) Javascript WHATWG

aredridel/html5 479 73 (5) Javascript -

html5lib/html5lib-

python

329 79 (4) Python WHATWG

masterminds/html5-php 269 39 (6) PHP WHATWG

FlorianRappl/AngleSharp 207 32 (7) C# W3C

servo/html5ever 167 31 (8) Rust WHATWG

tracy-e/OCGumbo 150 26 (9) Objective-C WHATWG

Table 1 – Most popular HTML5 parsers in Github

25

3. Project architecture

This chapter presents an overview of the product, the tasks distribution among team

members and a brief chronological description of how the project evolved.

3.1 Overview

Figure 4 displays a diagram of the architecture of the product. A brief description of

each module is presented below.

Input sources

For testing and comparing the parsers, two input sources were used: the Html5lib test

suite and web sites from the Common Crawl corpus. The html5lib test cases were

stored as binary files. The Common Crawl module is a sampler for obtaining random

web pages from Common Crawl.

Input pre-processing

The inputs to the test harness can be strings, URLs, binary files or WARC files.

Depending on the input type, the pre-processing process acts as follows:

When the input is an URL, it accesses it and stores the content (if available)

into a binary file. Then, the path to the saved file is sent to the script.

When the input is a WARC file, it extracts and saves the web pages to disk.

Then, the path to the directory is sent to the script.

When the input is a string or a file path, it is sent directly to the script.

Script execution

The script is used for executing the third party parsers and the MScParser. Each parser

is executed by using an adapter (implementing a defined interface) that accepts as

input either a string or a path to a file. The generated DOMs are serialized as formatted

strings and then saved temporarily as binary files.

26

Figure 4 – Overview of the product architecture (adapted from [26])

27

Comparator

The comparator analyses the outputs generated by the parsers and generates an XML

report. The output with more parser consensus is stored. Diff files of the other outputs

(if any) are generated and stored.

Web application

This application was developed as a graphical user interface for visualizing the reports

generated by the test harness and for analysing and comparing the outputs.

Additionally, it allows executing the spec tracer and the test harness for string and URL

inputs.

HTML5 Parser - MScParser

This parser is conformant with the W3C HTML5 Recommendation. It was designed as a

transliteration of the pseudo code presented in the specification. It will be referred as

MScParser.

Specification Tracer

The tracer is used for tracking the sections used and the parse errors generated during

parsing. Its goal is to produce a log of the parsing process that can be analysed for

finding useful information. An XML report containing the tracing details and

information about the output can be generated. The tracer runs over the MScParser.

The code of the project is hosted in Github [27]. The repository contains

documentation about the installation and usage of the modules.

3.2 Tasks distribution

This is a team project. Table 2 presents a summary of the activities in which each team

member participated. The table includes only the activities related to the product, i.e.,

the tasks that involved writing code. Other activities such as reading, researching or

project discussions are not contemplated.

28

Module Activity

Estimated

effort (days)

% completed by team member

Carlos Jose Xiao

HTML5 Parser Pre-processing 1 100 0 0

Tokenizer 15 33 33 33

Tree constructor 15 25 25 50

Algorithms 10 50 50 0

Test harness 4 25 75 0

Testing 25 40 40 20

DOM 7 100 0 0

Adapters Jsoup 2 100 0 0

AngleSharp 2 100 0 0

Html5lib 2 0 50 50

Parse5 2 0 100 0

Validator.nu 2 0 100 0

Comparator Report generator 7 25 75 0

Algorithm 3 0 100 0

Output processing 5 100 0 0

Web

application

Comparator UI 5 100 0 0

Tracer UI 5 100 0 0

Input form 5 100 0 0

Common crawl Design and

implementation

30 0 100 0

Tracer Design and

implementation

15 100 0 0

Table 2 – Participation of the members in the project

The estimated effort is measured in days of 8 hours each. Some tasks evolved

constantly or were paused and completed or updated after some time of its creation,

thus the effort measure is estimated, i.e., the exact time spent on them was impossible

to track.

29

The percentage of task completion per member is also estimated because the code is

collective, i.e., several files were constantly manipulated (mainly while testing) by all

the team members.

The activities presented were developed from March to July 2015. The project had a

short pause during April and May due to the writing of the progress report and the

exams season. August and early September involved the writing of this document.

We considered including the number of lines of code written per person (LOC metrics).

However, the idea was discarded because there are files that we all manipulated, some

other files constantly evolved or were refactored, there is some pair-programmed

code, the adaptors are in different programming languages, there are several mapping

tables and xml files that could be arguably counted as code, etc. Additionally, LOC are

usually used as a predictor of development time rather than a measure of system

complexity (due to its dependencies on language and style) [28].

3.3 Project evolution

This section presents a brief chronological description of the approach adopted for

working on the project and distributing tasks.

The team decided to work using some of the Agile Software Development principles. In

February we did two spikes for learning and understanding the HTML5 parsing process.

The parser development started in March, 2015 with the aim to finish it within 6 weeks.

We planned three sprints of two weeks each.

During the first sprint, we designed the general architecture of the parser and wrote

the tokenizer states. Each team member worked individually on 23 states (the

tokenizer stage consists of 69 states). The division was made in order to work on

related states.

The goal of the second sprint was to complete the tree construction stage. The

complexity of that module was higher than the tokenizer because it includes many

algorithms and data structures. Overall, there is dependency between the insertion

modes and algorithms. It was hard to find a way to divide the work minimizing code

dependencies in order to avoid conflicts. Each member worked on some insertion

modes and some algorithms.

30

The plan for the final sprint goal was to integrate and test the entire parser prototype.

The integration was almost completed during Sprint 2, thus we focused mainly on

testing and fixing errors.

At the end of May we started to work on the test harness and the adaptors of the

parsers. The goal was to have six third-party parsers working (each team member had

to work on two adaptors). Meanwhile, Jose Armando and I started to design the script

and comparison algorithm. Xiao committed to keep adding parsers and adaptors but

then he abandoned the team in mid-June.

By the end of June, Jose Armando started to focus on a module for getting random

samples of web pages (from the Common Crawl corpus) for testing.

On the other hand, I began to build the HTML5 DOM for passing the remaining failing

tests of the MScParser. Once the parser was completed, I focussed on the web

application and the spec tracer.

Despite our individual commitments, there were some activities in which Jose

Armando and I worked together:

Completing and fixing the parser for passing all the remaining tests. Around

70% of the failures and errors were fixed by me and 30% by Jose Armando.

The test harness changed constantly, we maintained constant communication

for improving and updating it.

Discussing errors or failures caused by the adaptors or the Comparer.

31

4. Project implementation

This chapter describes the implementation of the following modules:

The HTML5 parser (MScParser).

The specification tracer.

The harness for comparison of different HTML5 parsers.

The web application for tracing and comparing.

The Common Crawl sampler is fully covered by my teammate Jose Armando in his

dissertation “A testing strategy for HTML5 parsers” [29].

4.1 The MScParser

This section presents the architecture of the MScParser and the custom HTML5 DOM

implementation.

4.1.1 Architecture

The MScParser was written in Java. It was designed to be a transliteration of the

algorithm of the W3C recommendation in order to check the usability of the

specification.

JUnit was used for testing. The test suite of html5lib was chosen due its simplicity and

quantity of test cases (more than 8000 for parsing). By using the html5lib test suite,

the testing method was dynamic white box. The test cases present an input and the

expected output of the whole tokenizer or the tree construction stages, thus they were

used as integration tests.

In order to implement a transliteration of the spec as clean as possible, a class called

ParserContext was developed. That class act as a container for all the variables, data

structures, DOM and states (both, insertion modes and tokenizer states) required

while parsing. In Figure 5 the ParserContext class is presented along with the main

classes of the parsing process.

32

Figure 5 – Class diagram of the parser

The finite state machines for the tokenizer and the tree construction are handled by

using factories (applying the factory method pattern). This was the best approach we

found to translate the specification and handle 23 insertion modes and 69 tokenizer

states.

33

The parser has the following limitations:

The sniffing algorithm was not implemented and the UTF-8 character encoding

was used by default. There are three reasons that led to that decision: first,

UTF-8 is the most widely used character encoding for websites (83.7% of

websites use it, according to w3techs.com [30]). Second, UTF-8 is the suggested

character encoding by the spec. Finally, the sniffing algorithm is a large a

complex procedure that does not guarantee a 100% confidence in determining

the character encoding.

The execution of scripts was not implemented. That decision was taken for two

reasons: first, the script execution is not part of the parsing process (i.e. it is

part of the HTML5 spec but outside the parsing section). The second reason is

that a script execution engine is a complex and large system. Handling scripts

would have incremented the complexity of the parser because scripts might

insert new data into the tokenizer. That new data could lead to a manipulation

of the DOM tree by inserting, removing or modifying elements; furthermore, it

could produce a change of character encoding (by inserting or modifying a

meta tag).

4.1.2 The custom HTML5 DOM

In order to create and manipulate an HTML5 DOM tree, the initial and naïve approach

was to use the org.w3c.dom [31] implementation which is part of the Java API and

provides interfaces for XML processing. It was used because its simplicity and our

previous experience with it.

All the MScParser code was built around the usage of this implementation and it

worked well in early tests. When the parser was almost complete, a failing test was

presented. An attribute which name contained a semicolon (;) provoked a fatal error (a

draconian error). That attribute name violates the well-formedness requirement of an

XML document.

Each Element node object has a property to associate an object (userData). That

property was used to solve the issue, i.e., the XML invalid attribute was stored there

and then retrieved when serializing.

That was a dirty trick because the invalid arguments were not part of the DOM tree

and additional processing was required to retrieve them. Nevertheless, the tests were

passed and the project continued. Later, more failing tests appeared. A draconian error

34

was generated when an element whose name contained a less-than-sign (<), i.e.,

<ele<ment>. The HTML5 specification allows element names with especial characters.

In another failing test, the input contained a space as the name of a DOCTYPE. This is

valid according to HTML5 but it causes conflicts with the XML DOM. Table 3

summarizes a list of inputs that led to errors and/or failing tests. At this point we

realized that an XML DOM was not suitable for an HTML5 DOM due to the restrictions

placed. A new DOM implementation was required.

Input Details

<a ;=’value’> Invalid attribute name.

<div<div> Invalid element name.

<rdar:> Invalid element name.

rdar is considered to be a namespace and the

element name is considered to be empty.

For example, an input <rdar:a> is valid.

<!doctype > Invalid doctype name (empty).

Table 3 – Example of inputs that are HTML5 valid but XML invalid

JDOM [32] and Dom4j [33] were considered and a few tests were made using them but

those are XML frameworks, thus they are not suitable for storing a HTML5 DOM. The

Jsoup parser contains its own implementation (written in java), thus it was a potential

solution for our problems.

After an analysis, the idea of using the Jsoup HTML5 DOM was discarded. It would

require a huge amount of effort to incorporate it to our code due to differences of

objects, property names, method implementations, etc. At this point we realized that

the only feasible solution was to implement a custom HTML5 DOM.

The custom DOM was designed following the structure and names of the org.w3c.dom

but removing the XML restrictions. This was done in order to minimize the impact over

the parser code. Figure 6 shows a class diagram of the DOM implementation followed

by a description of each class. N.B. for readability, the class methods (operations) are

not displayed.

35

Figure 6 – Class diagram of the custom HTML5 DOM

Node – Abstract class. It defines generic properties and operations for nodes

such as getParentNode, appendChild, etc.

NodeType – Enumeration that defines the category of XML nodes. Seven types

of nodes are used for an HTML5 DOM: Attribute, CDATA Section, Comment,

Document, Document Type, Element and Text.

Attribute – Contains properties for name and value. Local name is available in

case the name contains a valid namespace.

CDATASection, Comment, Text – Nodes with a single property to store the

content.

Document – Class that represents a HTML5 document. It defines methods for

creating nodes, e.g. createComment, createElement, etc.

QuirksMode – Enumeration for the quirks mode status of a Document object.

DocumentType – Has properties for name, public ID and system ID.

36

Element – Represents an element (tag). Includes operations for creating,

getting and setting attributes.

The org.w3c.dom defines interfaces for nodes that are not used in HTML5 such as

processing instructions or notations, therefore those implementations were ignored.

Moreover, only the required operations and properties were included. For example

the Document node has methods such as getXmlVersion or normalizeDocument that

are not used for HTML5 parsing.

The org.w3c.dom implementation does not have a direct way to serialize the DOM; a

transformer object is required. This custom HTML5 DOM was designed to serialize

directly by using the method getOuterHtml. A flag for pretty printing (indent and new

lines for each element) can be specified.

Another set of tests that were failing were related to the quirks mode of the HTML5

document. Depending of some conditions of the Doctype definition in the input, the

quirks mode may change. The quirks mode affects the behaviour of the table element.

The document node of the org.w3c.dom library has no quirks mode property, thus the

userData property was used for storing it.

To avoid using the userData property, an enumeration with the possible values of the

quirks mode was included. Then, the document node was updated to include a

property to get and set the quirks mode status. With this modification the code is

cleaner and easier to understand.

The impact of using this custom DOM was minimal because the structure and the

names of objects, properties and methods were mostly maintained. Besides a few

refactors, the unique change required was to update the references (imports) of the

parser classes.

4.1.3 Challenges

The last tokenizer state (Tokenizing character references) has a different behaviour

than the others states. That state can read characters of the stream beyond the

current character for trying to find character references. If a reference is found, the

characters are consumed. If no reference was found, the characters are not consumed

and the original state consumes the characters.

37

When we realised the unusual behaviour of that state, it was too late to rebuild the

architecture and the other tokenizer states, thus it was hacked to fit into the defined

architecture. There are 5 tokenizer states that can consume character references, thus

they were manipulated as well. This was the unique part of the spec that could not be

transliterated completely.

4.2 The specification tracer

The goal of the tracer is logging the spec sections used and the parse errors generated

while parsing. The log of events could be used for analysing differences between

different outputs, for calculating spec coverage or for extracting useful information

such as number of parse errors or existence of certain elements in the output. The

MScParser has an option for enabling tracing (disabled by default).

4.2.1 Architecture

Figure 7 shows a class diagram of the tracer implementation. A brief description of the

classes is presented below.

Tracer – Is used for logging events generated while parsing. It includes

operations for filtering events and generating an XML report.

TracerSummary – Contains information such as the number of errors, used

insertion modes, presence of certain type of elements, etc.

Event – This class has associated a type, a description and the specification

section where the event occurred.

EventType – Enumeration that defines four event types: algorithm, insertion

mode, tokenizer state and parse error.

The MScParser generates a list of parse errors whether tracing was enabled or not.

When tracing is enabled the parse errors are registered as trace events, i.e., the

location of the errors is logged.

The HTML5 specification defines how to handle errors. However, there is not explicit

definition of the types of errors. An enumeration called ParseErrorType was defined to

categorise the errors according to its nature and location.

38

Figure 7 – Class diagram of the specification tracer.

A XML document (tracerEvents.xml) is used for defining the list of sections to be

tracked. The root element is called events and every event node has attributes for

section, description and type as shown on Figure 8. The events with no type are

treated as informative, i.e., they are not tracked. When the parsing process begins, if

the value of the tracing flag is set to true, a new Tracer instance is created and the

tracerEvents.xml file is loaded into a hash map.

In order to trace, it has to be specified directly in the code where to raise an event. An

operation called addParseEvent, with a required argument (the section number) and

an optional argument (event details), was defined. Whenever the addParseEvent

operation is executed, it validates if tracing was enabled in order to log the event.

39

Figure 8 – File example of tracerEvents.xml

The task of identifying the location for raising events is relatively trivial because the

MScParser was developed as a transliteration of the spec algorithm. This characteristic

allows for a high level of tracking granularity. Currently, the tracer granularity is coarse

because all the algorithms, insertion modes and tokenizer states register only one

event (when they are used).

Nevertheless, it might be the case that a deeper level of tracking is required. Let us

consider the InBody insertion mode. It has more than 50 possible paths depending on

the token type and value. Furthermore, some of those paths have more divergences

depending on other variables or current status.

An event for every path could be raised by just adding a line of code, i.e., the

addParseEvent method. To differentiate each path, two options are available. The first

option is using as arguments the InBody section number (8.2.5.4.7) and the path

details. The other option is to assume every path is a spec subsection, i.e., using a

unique section number e.g., section number 8.2.5.4.7_n where n is the number or

name of the path.

For the latter option, the description and section number of every path have to be

added as an event node in the tracerEvents.xml file. This option is cleaner with respect

to the code by the means of defining spec subsections.

40

The TracerSummary is a POJO4. It is used only for tracking details such as number of

emitted tokens, existence of HTML5, SVG or MathML elements, etc. The tracer already

has a generic method for tracking emitted tokens and created elements. By using

those operations, the task for adding new tracking details just require a few extra

validations and the respective property in the TracerSummary class.

The tracer has the capability to generate an XML report. This functionality was added

for complementing the test harness and retrieve useful data about the nature of the

tested input pages. The tracer class has a method called toXML for generating a report;

it only requires the file path where the report should be saved.

4.2.2 Challenges

These are the challenges faced during the tracer development.

Tracer granularity

Strictly, this was not a challenge during the development. Nevertheless, achieving a

finer granularity would have required a substantial amount of time.

XML invalid characters

The XML 1.0 specification defines ranges and values of valid Unicode code points.

Nevertheless, that list is not consistent with HTML5 specification. When generating a

XML report, it might be the case that the tracer logged an event containing an XML

invalid character provoking an error. To solve this problem, all XML invalid characters

are escaped before generating the report.

4.3 The harness for comparison

The harness goal is to parse, compare and generate a report of the outputs produced

by the different parsers. It was designed for handling HTML5 parsers despite their

programming language and for being scalable horizontally, i.e., plug-in new parsers.

The test harness consists of three main modules: the parser adapters, the script

execution and the comparison and report generation process.

4 POJO stands for Plan Old Java Object. Is a simple class that does not extend from another class, has no

special implementation and only contains getters and setters.

41

4.3.1 The parser adaptors

In order to compare the DOM generated by each parser, serialized outputs were

required. The standard serialized outputs from the different parsers varied slightly as

some use formatting or pretty printing options. A standardization of output formats

was required in order to perform a comparison.

The html5lib output format was used because of its simplicity. Moreover, some parser

implementations use the html5lib test suite and consequently already have functions

that serialize in that format.

The interface for the adaptors was defined to be as simple as possible. It receives only

two parameters and outputs the serialized DOM formatted accordingly to the html5lib

format. The input parameters are:

Type of input. File path or string (-f and -s, respectively).

Input value.

In addition to the MScParser, the project has adaptors for the following parsers:

AngleSharp – C#

Html5lib – Python

Jsoup – Java

Parse5 – Javascript

ValidatorNU – Java

4.3.2 The script execution

The script (bash for Linux, batch for Windows) is executed through the system’s

console. It receives three arguments:

a path to a directory for saving the outputs

the input type (string, URL, file path or WARC file)

the input value

Depending on the input type, the script might do some pre-processing:

When the input is a string, no operation is realized. The string input is fed

directly to the parser adaptors.

42

When the input is an URL, the web site content is saved into a binary file. Then,

parsers are executed receiving a file path.

When the input is a WARC file, it extracts and saves the web pages to disk.

When the input is a path to a directory, the script loops the files recursively and

executes the parsing process for each file.

The script runs all the parsers (by using the adaptors) and saves each output in a binary

file (named after the parser). Finally, it executes the process of comparison and report

generation.

4.3.3 The comparison and report generation

A java program called Comparator is used for reading the files generated by the

parsers, comparing the outputs, generating and an xml report and diff files (if required).

Figure 9 presents a flow diagram of the Comparator.

When the comparator is executed, it requires as mandatory argument the path where

the parser outputs were saved. The process begins with a loop through directories

(that represent an input or test case) followed by a loop through the directory files.

The content of each file is stored in an object along with the parser name.

After reading all the files, the list of objects is processed. The output that has higher

consensus among the parsers is referred as majority tree or majority output. Then, the

list is sorted by consensus rate. In case there are outputs different with respect to the

majority tree, a diff file per output is generated. Next, the test case is added to the

report and the report totals are updated. The xml report is saved to a file and the

process ends when there are no more directories to analyse.

The modules for grouping and sorting outputs and the diff files generation are

discussed below. Additionally, the process for adding new parsers is explained as well.

43

Figure 9 – Comparator flow diagram

Groping and sorting outputs

The Comparator analyses and groups the outputs applying a slight variation of the

Boyer-Moore Majority Vote Algorithm [36]. The algorithm proposed by Boyer and

Moore considers a candidate the majority when it constitutes at least the half of the

set. However, in this implementation, the set with highest consensus (excluding ties) is

considered the majority even if it does not constitute the half of the set, e.g.:

44

Parser 1 and 2 had output A, parser 3 had output B, parser 4 had output C and

parser 5 had output D. Output A is considered majority even it only represents

40% of the set.

Parser 1 and 2 had output A, parser 3 and 4 had output B and parser 5 had

output C. Output A and B have 40% of consensus thus, there is no majority.

Considering the report presented in Figure 10 , the report values are calculated as

follows:

Test 1 – All the parsers produced the same output. No diffs are generated. All

passed the test.

Test 2 – The parsers 1, 2 and 3 had the same output; they are considered

majority and passed the test. Parser 4 had a different output, thus it failed the

test and a diff file is generated.

Test 3 – The parsers 1 and 3 had the same output; they are considered

majority because the parsers 2 and 4 had different outputs. Parsers 2 and 4

failed the test and a diff file for each output is generated.

Test 4 – All the parsers failed the test because there are two outputs but each

one has the same number of parsers. The majority attribute of both outputs is

set to false. N.B. the first output has name majority; this is for convenience of

the system i.e., although that output is not majority, diff files of other outputs

are generated with respect to it.

Test 5 – All parsers had different outputs, thus all failed the test. The majority

attribute of all the outputs is set to false. The first output is called majority for

convenience of the system and three diff files are generated.

The generalData node tracks the number of tests. The equals attribute represents the

number of tests in which all the parsers produced the same output. The testResult

node tracks the test results of each parser; the passed attribute represents the number

of test in which the parser was part of the majority group.

45

Figure 10 – XML report sample

Diffs generation and encoding

When testing URL pages, we found pages with sizes from a few kilobytes up to a

couple of megabytes. Considering m inputs of n kilobytes and p number of parsers, the

disk space required for saving all the outputs would be m x n x p kilobytes plus the size

of the xml report. For a large set of inputs this could be a potential problem.

With the aim to reduce the disk space required, the test harness was designed for

storing only one complete output and diffs files (if any). This decision was taken by

considering that the outputs should have a tendency to converge (one of the goals of

HTML5).

In the ideal scenario all parsers would produce the same output, thus only one output

file would be required to be stored. In the worst scenario all the outputs would be

different, therefore storing diff files would be equivalent to storing all the outputs.

This harness uses a library called google-diff-match-patch [34] for generating a list of

differences between the majority output and the non-majority outputs. The list of

differences is then processed for storing it in a file.

46

A compression method called delta-compression is presented in [35]. The delta-

compression is used for “storing or transmitting only the changes relative to a related

artefact instead of storing or transmitting the complete content”. In order to perform

the compression, an encoding format is detailed.

In this implementation, a simple encoding based on the delta-encoding is used. The

encoding format is as follows:

Diff type : char (‘+’ for insertion, ‘-‘ for deletion)

Index : integer (the index in the majority output where the difference starts)

Separator : char (‘,’ is the designated separator)

Diff length : integer (byte length of the difference)

Separator : char

Diff content : string

End of entry : char (‘;’ is the designated char to denote the end of the entry)

An example of two different strings and the diff encoding is presented in Table 4. Two

differences were found between the outputs. The first diff is a deletion of one

character (x) at index 42. The second diff is at index 59 and represents an insertion of

13 characters (a new line plus the string | "x").

Majority output Different output Formatted output Diffs encoded

-42,1,x;+59,13,

| "x";

Table 4 – Example of diff encoding

In order to reconstruct the original output, the diff file has to be decoded to obtain a

list of diffs, and then the majority output have to be updated by inserting or removing

characters as defined in the list of diffs.

47

Adding a new parser

In order to provide horizontal scalability, i.e., adding new parsers by avoiding re-

parsing the inputs, a function called restoreOutputsFromDiffs is included in the

comparator. As the name suggests, the method restores (temporarily) the outputs

generated by the parsers. Then, the comparison and report processes run normally.

The comparison process automatically deletes the repeated outputs and generates

diffs files (if required); hence no further operations are required.

When running the comparator, a parameter –u (for update) should be included. This

solution only requires that the output(s) of the new parser(s) is included in the path

where the current output files are stored.

4.3.4 Challenges

While developing the test harness, the following difficulties and challenges were faced:

Command line arguments

The original interface for the adaptors was designed to receive one single string

parameter. The web page inputs were read from the script and the content was passed

(as argument) to the parsers. However, some web pages produced errors. The cause

was that there is a max size for command line arguments. Moreover, the limit size

varies depending on the operating system. To avoid potential errors, the interface was

modified to include file inputs and each adaptor was updated for reading files.

Dynamic website content

A few inconsistencies were detected when comparing the outputs from URLs. The

differences were in text nodes or attribute values. We realized that the differences

were caused because some websites have dynamic content that changes depending

on the region, language, date, time, etc. Moreover, there are websites that generate

tokens or session ids every time a page is requested.

In order to face that kind of problems, a program was developed to save the web page

content in one temporary binary file. Then, the script acts as if a file argument was

received, i.e., it parses the same file on all the parsers. Finally, the temporary file is

deleted.

48

Encodings

Issues with encodings were constantly faced while parsing and reading/writing files.

Every single adaptor was required to explicitly read and write files using UTF-8

encoding. When outputting the serialized DOM, explicit UTF-8 encoding was required

as well.

UTF-8 was also set for compiling and generating executable files inside the IDE as this

used by default the encoding given by the operating system settings.

Line breaks

Some differences were found due to differences of line break characters between

operating systems (considering Linux and Windows). A function was included to avoid

this kind of problems.

RCDATA sections and XML documents

To make the comparison, the first approach was to generate an XML document

containing the outputs from all the parsers directly from the bash/batch script. Then,

the java program could read the xml and then process it. Each output is a formatted

HTML string, thus it could not be saved directly in a text node of an XML document

because the HTML elements would be considered XML elements leading to a mal-

formed XML document. N.B. the outputs could be escaped and inserted as Text nodes.

However, this approach was discarded (the next paragraphs discusses the reasons).

In order to avoid a mal-formed XML document, the outputs were saved as RCDATA

nodes. This worked well for several tests until we realised that some inputs were

ignored in the report. After an analysis, we found the source of the problem.

If a web page contains a RCDATA node, it will produce an invalid XML document. The

RCDATA end tag of the output was considered the end tag of the RCDATA node of the

XML document. The remaining content of the output was inserted after the RCDATA

node leading to a mal formed XML file.

The first idea for solving this issue was to escape the RCDATA elements. Nevertheless,

it was soon discarded because it would require modifying the adaptors. Moreover, the

front-end application would require to un-escape the RCDATA elements.

49

Although performance was not one of the priorities, the escaping/un-escaping process

would hurt the performance as find-replace operations would be required for each

output over potentially large files (web pages).

Finally, by not using an XML file to save the outputs, a problem (that was not faced yet)

was solved. As discussed in the challenges of the tracer (4.2.2), the XML specification

defines a range of valid characters. A mal-formed XML document would have been

produced if an output had contained any XML invalid character.

Adaptors

The algorithm for structuring DOMs into the html5lib format is simple and easy to

apply. However, writing an adaptor for a new parser should not be considered as a

trivial task. The setup of some environment might be time-consuming, the

documentation of the parser implementation could be scarce, the programming

language could be hard to understand, etc.

4.4 The web application

Initially, just a simple tool for visualizing the reports generated by the harness was

required. As the report was an XML document, an XSLT style sheet was used to

generate an HTML page to display the report highlights.

This approach was enough and useful for reports of a single input, or even a few

dozens. A large report was hard to visualize and navigate correctly; pagination and

sorting functions were required. Programming those operations with XSLT would have

required a substantial amount of work. At the end we agreed in creating a web

application.

The web application is a Spring MVC project written in Java. This decision was made

because of the experience with the language and because most of the project (parser,

tracer and comparator tool) was programmed using java.

As the project continued, more characteristics were included with the aim to improve

the visualization of the reports and the comparison and tracing processes. Currently,

the web application functions are:

Parse a string or URL input and run the specification tracer.

50

Analyse and filter tracing events.

Run the comparer harness for string and URL inputs.

Review reports generated by the comparer harness.

Analyse, format and minimize outputs from the comparer.

4.4.1 Architecture

The web application follows the MVC design pattern. The model represents the

information of the report, test cases, outputs, etc. and the operations for generating

and accessing such information. The model also includes classes for handling user

requests, i.e., the inputs for the tracer or the minimizing options for the test harness.

A single controller, called ReportController, is used for handling user requests,

executing model operations and interacting with the views for presenting results.

Additionally, the controller handles the parsing and tracing processes directly (the

MScParser and tracer APIs are bundled in a jar file).

With respect to the views, the web application consists of 5 jsp files:

Report details – presents the report information.

Test details – displays the outputs of a given test case. Includes functionality to

compare and minimize outputs.

Tracer form – page for executing the parser and the tracer.

Input form – page for capturing the input for the comparison harness.

Layout – also known as master page. Used for defining a consistent layout

through the site.

The web application uses a configuration file (WebConfig) for storing paths to the

bash/batch script files and reports directory. The values can be edited manually in the

mvc-dispatcher-servlet.xml file.

In addition to the classes previously presented, three classes with generic operations

are used:

FormattingUtils – includes operations for escaping, highlighting and searching

strings.

51

ProcessRunner – used for executing operating system processes. In this case, it

executes the script of the comparison harness.

RequestURL – given an URL, creates a connection and returns the content as a

stream.

4.4.2 Parsing and tracing

Figure 11 presents the input tab for the tracer. A drop down list offers the option to

switch between a string and an URL input. A check box allows for the output to be

presented formatted and highlighted (pretty code). This is done with the help of an

open source Javascript module called prettify [37].

Figure 11 – Tracer input tab

As mentioned in section 4.2, the tracer logs every event generated during the parsing

process. Nevertheless, it could be the case that a specific event type or set of spec

sections are required to be filtered. The web application offers the option to exclude

both, events (by type) and specification sections. Figure 12 shows the tabs that allow

the user to define exclusions. N.B. to improve readability, the list of sections is not

displayed complete.

52

Figure 12 – Tracer exclusion tabs

Given the input string ‘this is a <b>test’, Figure 13 shows the tabs of the parser output,

the tracer log and the tracer summary, respectively.

Figure 13 – Tracer output tabs for the input string this is a <b>test

The parser output displays the HTML output of the parser. The tracer log shows the

events (after exclusions, if any). Events are coloured to improve readability. Finally, the

tracer summary displays detailed information about the used algorithms, tokenizer

53

states and insertion modes, the existence of certain type of elements and the number

of parse errors and emitted tokens.

When tracing a web page or a long input string, if there are no event exclusions, the

log produced by the tracer is really big. In such cases, the page might not be displayed

correctly or the browser might have unexpected behaviours (even crashing). To avoid

those scenarios, a max log size (number of entries) is included in the WebConfig file.

When the log size is exceeded a message is presented to the user.

4.4.3 Comparing outputs

The web application allows parsing a string (or URL) input and then comparing the

output produced by all the available parsers. The input page is shown in Figure 14.

Figure 14 – Input form for the multi-parser comparator tool

Figure 15 shows the result of the parsing and comparison process of the input string

‘this is a <b>test’. In this case, all the parsers had the same output; therefore there is

only one tab with the name of all the parsers. When there are differences, a tab for

each different output is generated as shown in Figure 16. A panel for navigating

through the differences (auto-scroll to the next or previous difference) is displayed

when the output is too large to fit the screen.

The report name is a number assigned by the system. This is because the output and

the xml report files have to be stored in a file directory (specified in the WebConfig).

54

Figure 15 – Comparison details page

Figure 16 - Comparison page displaying differences between outputs

Figure 17 presents the options for formatting and minimizing the outputs. The

removals are applied only when all the outputs present no difference in the specified

element. For example, assuming Remove script elements is enabled, if there are five

script elements among the inputs and all are the same then the five elements are

removed. If one script element is different, only the four that are equal are removed.

55

The application currently allows removing link, script, style and meta elements.

Nevertheless, the removal process is generic as it only requires the name of the

element to be removed. The addition of new options for removal of elements is a

trivial task.

Finally, the option Show original output allows presenting the reconstructed output

tree without displaying the differences, i.e., the output generated by the parser. This

option was included because there might be the case that the user wants to check for

differences manually or by using specialized software.

Figure 17 – Format options tab

4.4.4 Reviewing reports

The web application offers the opportunity to visualise the report generated by the

comparator harness. Considering the xml report presented previously in Figure 10,

four classes were designed for handling the report as shown in Figure 18.

The Report class represents the whole XML document. The TestResult class represents

each of the parsers and its number of tests passed and failed. Each input (test input) is

mapped to a TestCase object. The set of outputs of a test case is represented by a list

of TestOutput objects. A class called ReportGenerator defines an operation for

generating a report object given the path to the xml report file.

56

Figure 18 – Report class diagram

Figure 19 presents the report details page. The general information, test results and

the test list is displayed. A jQuery plugin called DataTables [38] is used for the sorting

and paging functions. Additionally, the plugin offers options for changing the number

of elements displayed and for searching specific entries.

When a test entry is selected, the application redirects to the comparison details page

(previously shown in Figure 15).

The report details page requires a parameter (a query string parameter in the URL)

which is the name of the report to be displayed. The application assumes a relative

path for all the report files (configured in the WebConfig bean – previously discussed).

57

Figure 19 – Report details page

4.4.5 Challenges

This subsection presents and discusses the issues that were faced during the web

application development.

Formatting and highlighting output differences

Initially, a HTML div element was used as the container of the output strings

nevertheless we realized that it does not maintain the format (indentation and line

58

breaks) of the strings. Other HTML elements such as p or span have the same

limitation. After some research, the HTML pre element resulted to be the solution.

The process for highlighting differences (and lines containing them) represented a

challenge. The reason is that the reconstruction of the original output (using the diffs

and the majority output files) had to be mixed with a formatting process.

The reconstructed output had to be escaped (because it contains HTML elements that

might be confused with the web page HTML elements). HTML span elements had to be

inserted whenever a difference is presented. Several indexes had to be used to track

the start and end of lines to highlight because a difference may involve several lines or

a line may have several differences.

Encoding

Despite that the tracer and the comparison harness were already using UTF-8, making

the entire web application to use UTF-8 encoding was a real challenge as several

modules had to be configured.

Tomcat 8 is used as the web server for the application. It uses the encoding defined in

the operating system settings for encoding the GET request parameters (parameters

specified in the URL). This setting had to be changed to use UTF-8 encoding.

The java web application by default uses the encoding of the browser for handling

requests and responses. A filter to override this behaviour and force the use of UTF-8

was developed.

Every jsp page can have its own encoding. However, we set in the application

configuration file (web.xml) that all the jsp pages must use UTF-8. Finally, the layout

page (master page) has a meta tag where the charset attribute is set to UTF-8 as well.

59

5. Analysis and Results

This chapter presents the analysis and results of the following experiments:

The spec coverage of the html5lib test suite

The MScParser test results using the html5lib test suite

The parsers test results using the html5lib test suite

Running the tracer over websites

5.1 The html5lib test suite coverage

The code of the MScParser was used to measure the coverage of the spec by the

html5lib test suite. The parser was developed as a transliteration of the specification

algorithm, thus its granularity is somehow equivalent to the spec algorithm. It has to

be taken into consideration that the coverage is based on lines of code and not in

actual coverage of the spec. A line in the spec could imply a dozen of lines of code or

vice versa. EclEmma [39] is a code coverage plugin for the Eclipse IDE. The following

analysis was performed by using the aforementioned tool.

The tokenizer state comprises 69 states. The html5lib test suite contains 6665 test

cases covering in total 68.21% of the tokenizer state process. As shown in Table 5,

there are 38 states with 100% coverage, 12 states are not fully covered and 19 states

have 0% coverage. From the 19 tokenizer states that are not covered, 18 are related to

scripts and one is related to CDATA elements.

An interesting tokenizer state to note is the tokenizing character references. It has

code coverage of 74.77%. The five states that can switch to this state are:

Attribute value double quoted state (94.87%)

Attribute value single quoted state (94.87%)

Character reference in data state (89.94%)

Attribute value unquoted state (74.26%)

Character reference in RCDATA state (51.57%)

As discussed in the section of the MScParser challenges (4.1.3), those states required

few changes to fit into the architecture of the system and are not a reliable

transliteration of the spec; hence, those percentages might not be completely accurate.

60

Code coverage of the tokenizer states

100.00% > 0% and < 100% 0%

- After attribute name -After attribute value quoted -After DOCTYPE name -After DOCTYPE public identifier -After DOCTYPE public keyword -After DOCTYPE system identifier -After DOCTYPE system keyword -Attribute name -Before attribute name -Before attribute value -Before DOCTYPE name -Bogus comment -Bogus DOCTYPE -Character reference in attribute value -Comment end dash -Comment end -Comment start dash -Comment start -Comment -Data -DOCTYPE name -DOCTYPE public identifier double quoted -DOCTYPE public identifier single quoted -DOCTYPE -DOCTYPE system identifier double quoted -DOCTYPE system identifier single quoted -End tag open -RAWTEXT end tag name -RAWTEXT end tag open -RAWTEXT less than sign -RAWTEXT -RCDATA end tag name -RCDATA end tag open -RCDATA less than sign -RCDATA -Self-closing start tag -Tag name -Tag open

-Attribute value double quoted (94.87%) -Attribute value single quoted (94.87%) -Character reference in data (89.94%) -PLAINTEXT (80.36%) -Comment end bang (78.07%) -Tokenizing character references (74.77%) -Attribute value unquoted (74.26%) -Between DOCTYPE public and system identifiers (74.23%) -Markup declaration open (69.42%) -Before DOCTYPE public identifier (67.29%) -Before DOCTYPE system identifier (57.01%) -Character reference in RCDATA (51.57%)

-CData section -Script data double escape end -Script data double escape start -Script data double escaped dash dash -Script data double escaped dash -Script data double escaped less than sign -Script data double escaped -Script data end tag name -Script data end tag open -Script data escape start dash -Script data escape start -Script data escaped dash dash -Script data escaped dash -Script data escaped end tag name -Script data escaped end tag open -Script data escaped less than sign -Script data escaped -Script data less than sign -Script data

Table 5 – Code coverage of the tokenizer states by the html5lib test suite

61

With respect to the tree construction stage, the test suite contains 1555 test cases.

The total coverage of the 23 insertion modes is of 94.5% as shown in Table 6.

Insertion Mode Coverage Insertion Mode Coverage

After After Body 100.00% In Column Group 96.67%

After After Frameset 100.00% After Head 96.23%

After Body 100.00% In Table 93.28%

After Frameset 100.00% In Table Body 93.02%

Before Head 100.00% In Select 91.56%

Before HTML 100.00% In Caption 91.03%

In Head 100.00% Initial 88.37%

Text 100.00% In Cell 87.24%

In Frameset 97.88% In Row 87.24%

In Template 97.87% In Head No Script 75.60%

In Body 97.60% In Select In Table 65.32%

In Table Text 97.14% TOTAL 94.50%

Table 6 – Code coverage of the insertion modes by the html5lib test suite

The tree construction involves the use of some algorithms that were defined in other

package of the project. The coverage of the algorithms is of 89.5%. The adoption

agency algorithm is the most complex algorithm of the spec. It can manipulate the

DOM by removing, adding or moving nodes. The code for that algorithm presents

98.2% of coverage. Nevertheless, the coverage of that algorithm should be considered

of 100% because after a code review, one line of unreachable code was detected.

The In body insertion mode, which is the largest and most complex section of the spec,

has a coverage of 97.52%. After a code review, there are 4 very specific paths that are

not covered, e.g., an unclosed p tag inside an isindex node inside a template node or

misnested dd and dt elements inside some specific nodes.

When using the tree construction test cases, the tokenizer states coverage reached

79%. It was increased with respect to the tokenizer test cases (68.21%) because the

tree constructor fully used the script-related tokenizer states (that had 0% coverage).

62

When running all the test cases (from both, the tokenizer and the tree constructor),

the tokenizer code reached a coverage of 94.3%.

The html5 test suite covers 91% of all the source code of the MScParser. However, the

percentage rises to 93.02% by removing the spec tracer code, which is bundled to the

project.

5.2 The MScParser vs. the html5lib test suite

The MScParser was developed according to the W3C Recommendation. However, the

html5lib test suite is conformant with the WHATWG specification. The difference of

specifications caused 8 failing tests as described below.

In the test suite there are 13 files containing a total of 6665 tests cases for the

tokenizer. This parser passes all tests except one as shown in Figure 20.

Figure 20 – Html5lib tokenizer state tests results

The failing test appeared in the test file domjs.test and the test name is CRLFLF in

bogus comment state. The input of the test is a comment containing a CR character

followed by two LF characters. The expected output is a comment token containing

two LF characters. However, the output of our parser is a comment token containing

just one LF character.

There is a subtle difference for processing new line characters in the specifications. In

the section 8.2.2.5 (Pre-processing the input stream) the W3C spec states that “any LF

characters that immediately follow a CR character must be ignored” while the

WHATWG spec states that “any LF character that immediately follows a CR character

must be ignored”. This is the reason why our parser fails that test.

With respect to the tree construction tests, Figure 21 presents the test results of the

parser. There are 48 test files and the total of test cases is 1555.

Figure 21 – Html5lib tree construction tests results

63

As well as the failing test in the tokenizer tests, the failing tests of the tree construction

stage are due to differences within the specifications:

Test case 54 (tests1.dat) – There is one difference in the first step of the

Adoption agency algorithm that causes misnested elements.

Test case 14 (ruby.dat) – In the in body insertion mode the rtp, rp and rb

elements are treated differently causing misnested content.

Test case 3 (main-element.dat) – The WHATWG spec considers the main

element as a foreign content whereas W3C recommendation does not. This

lead to a misnested node.

Test case 78 (template.dat) – The list of special HTML5 elements is different

(menu and menuitem). This could lead to misnested elements (and chance of

missing elements) because of adoption agency algorithm.

Test case 1, 2 and 3 (tests11.dat) – The specs define a table for mapping the

name of some attributes in the SVG namespace. The attribute list is not fully

consistent between the specifications.

Considering 8 failures out of 8220 test cases, the MScParser achieved a performance of

99.90% of the html5lib test suite.

5.3 Comparing parsers with the html5lib test suite

Five third party parsers and the MScParser were compared by using the html5lib test

cases for the tree construction. However, from the 1555 test cases available, only 1276

were used. The other 269 test cases were excluded for the following reasons:

HTML fragment – Some parsers provide incomplete or no support for parsing

HTML fragments. For example, Jsoup has a parse fragment operation but the

context node is set to the body element and it cannot be changed.

Template – The template element uses HTML fragments.

Scripting flag – According to the spec, the flag is set to true “if scripting was

enabled for the Document with which the parser is associated when the parser

was created” [3] and set to false otherwise. Parse5 and Html5lib parsers

enable scripting and provide no option to change. In fact, those parsers lack

the spec sections that handle no script scenarios.

64

Table 7 presents the results of comparing the output of the parsers considering the

html5lib expected output as gold standard. From the 1276 test cases, all parsers

passed 911, i.e., all had the same output as the html5lib expected. This is equivalent to

an agreement of 71.39%.

Number of tests 1276

Equals 911

Different 365

Parser name Failed Passed Conformance

AngleSharp 17 1259 98.67%

Html5lib 20 1256 98.43%

Html5parser 6 1270 99.53%

Jsoup 354 921 72.18%

Parse5 18 1258 98.59%

ValidatorNU 2 1274 99.84%

Table 7 – Comparison of parsers vs. html5lib expected output

With only two failed tests, ValidatorNU is the parser with highest conformance with

the html5lib test suite. Our parser is closely related, the failing tests are due to the

difference of specs mentioned in the previous subsection.

Parse5 and Html5lib show a high performance as well. Several of their failing tests

contain a ruby element (13 failing tests each). As well as our parser, they also fail the 3

tests related to SVG attributes. It is highly likely that they have not updated their code

recently. One of the tests (number 16, file tests26.dat) causes a stack size exceeded

error on Parse5.

AngleSharp is based on the W3C specification. It presents some fails in the entities,

domjs-unsafe and plain-text-unsafe test files (12 failing tests); those test cases are

related to invalid Unicode characters and character references. This could be caused

by the pre-processing stage or the tokenizing character references state. As well as our

parser, it also fails the 3 tests related to SVG attributes and the remaining fails are due

to tests containing a frameset element.

65

Jsoup was the parser with lowest conformance. Several failed tests are due to the lack

of support of namespaces. The HTML5 spec defines two types of elements: foreign

elements (in the MathML or SVG namespace) and normal elements (in the HTML

namespace). Jsoup does not provide any details of the namespace of the element

nodes, thus all the tests that contain foreign elements failed.

Excluding Jsoup, from the 1276 total tests, the other parsers agree on 1240 tests

(97.18%).

5.4 Tracing the web

With the aim of analysing the coverage of the spec and the usage of HTML5 ‘in the

wild’, the specification tracer was run over 90 websites (taken from [40]). Table 8

presents a summary of the information collected by the tracer.

Data Times Used Average

Algorithms 1,870 19.28

Insertion Modes 987 10.18

Parse Errors 1,640 16.91

Tokenizer States 3,408 35.13

MathML Elements - -

SVG Elements 9 9%

HTML5 Form Elements - -

HTML5 Graphic Elements 1 1%

HTML5 Media Elements 2 2%

HTML5 Semantic Elements 54 56%

Table 8 – Tracing details over websites

On average, each website uses 10.18 insertion modes (out of 23) and 35.13 tokenizer

states (out of 69). Each website generates almost 17 parse errors. 9 websites used SVG

elements and none use MathML elements.

The HTML5 specification defined new elements for forms, graphics and media content.

However, those are rarely used. The new semantic elements are used by 1 of each 2

websites o average.

66

Figure 22 shows the distribution of usage of insertion modes by the websites. Every

input (even an empty string) uses 6 insertion modes by default: initial, before html,

before head, in head, after head and in body; the figure corroborates that statement.

An interesting case is the text insertion mode. That insertion mode can only be

triggered when script or textarea elements are presented. Hence we can say that all

the websites used scripts (most likely) or textarea elements.

Figure 22 – Insertion modes usage by websites

The insertion modes related to tables (in table, in table text, in table body, in row, etc.)

are used by around 20% of the websites. The frameset-related insertion modes were

not used. The in template insertion mode was not used by any web site. The might be

due to the novelty of the template element which was introduced in HTML5.

With respect to the tokenizer states, Figure 23 presents the distribution of usage by

websites. The states for processing tags, attributes and comments are widely used.

The states for processing the public id and system id of doctype elements are barely

used. Those attributes are usually used for specifying schemas for validating the html

document. Some script related tokenizer states are barely used. Those are used for

escaping script content. The last bar corresponds to the tokenizing character

references state and it is used by more than 80 websites.

0

20

40

60

80

100

120

Init

ial

Be

fore

htm

l

Be

fore

he

ad

In h

ead

In h

ead

no

scri

pt

Aft

er h

ead

In b

od

y

Text

In t

able

In t

able

tex

t

In c

apti

on

In c

olu

mn

gro

up

In t

able

bo

dy

In r

ow

In c

ell

In s

ele

ct

In s

ele

ct in

tab

le

In t

emp

late

Aft

er b

od

y

In f

ram

eset

Aft

er f

ram

eset

Aft

er a

fte

r b

od

y

Aft

er a

fte

r fr

ame

set

67

Figure 23 – Tokenizer states usage by websites

0

20

40

60

80

100

120

68

6. Conclusions

The developed product presents a new approach for analysing and comparing HTML5

parsers. It has two main characteristics: scalability and modularity. With respect to

scalability, it is scalable vertically by allowing processing multiple inputs. It is scalable

horizontally by allowing adding new parsers. Modularity was achieved because each

part of the system is an independent module that can be updated or replaced

effortlessly, i.e., it is a low maintenance system.

Despite the main goal of the HTML5 standard is achieve convergence of outputs

between parsers, there are still differences. By using the html5lib test suite, six parsers

were analysed and compared finding disagreements between their outputs.

Testing the HTML5 standard is a complex task. With our analysis, we found that more

than 8000 test cases of the html5lib test suite are not enough to cover all the

specification. The code coverage is around 93% but this is not a measure 100% reliable

because code is prone to errors and there could be redundant or unreachable code.

Moreover, a transliteration of an algorithm into code is a subjective process that

depends on other factors such as the style and experience of the programmer, the

programming language, etc.

A highly valuable retribution, not for the system but for the community, would be

improving the html5lib test suite. Test cases can be written for both, the tokenizer

state and the test construction in order to increase the coverage of the spec. A higher

coverage would increase the reliability of the test suite.

Reading and understanding the HTML5 specification represents a challenge. Despite it

is well organized and the writing style is clear, it is really large and tedious in some

points. The existence of two specifications (W3C and WHATWG) certainly complicates

the testing of HTML5. According to our results, there are spec differences that have

direct repercussions on the output DOM. The risk of potential differences is higher

because there might be differences in parts of the specs were the html5lib test suite

has no coverage. As long as there is no full convergence between the specs, there will

be not convergence amongst parsers.

69

Performance was not a priority in order to transliterate the algorithm proposed by the

W3c specification. However, performance was not an impediment for the correct

functionality of the parser in the system. At this point, it is still unclear if a

transliteration of the spec would be an effective approach for building a high

performance parser.

The spec tracer was thought as an analysis tool. Several ideas were considered for

easing the analysis of the parsing process and finding sources of disagreements

between outputs. However, due to the reduced span of time for working on it, its

current state does not fulfil that goal completely. The EclEmma code coverage plugin

resulted to be a better option to measure spec coverage than the tracer (because of

the coarse granularity of spec tracking). The tracer has the potential to be an

extremely useful tool.

Some areas for improvement of the product are discussed below.

MScParser

Allow parsing according to the W3C or the WHATWG specifications – I

particularly consider this option would be really useful. It would allow tracking

and analysing spec differences. However, it would require constant

maintenance because of the frequent updates of the HTML5 living standard.

Moreover, the complexity of spec differences might involve significant code

changes.

Sniffing algorithm – This algorithm is used for parsing inputs with a character

encoding different from UTF8. By implementing this algorithm, more web

pages could be parsed without encoding issues.

Script execution – Although the script execution is not part of the parsing

section of the specification, it is closely related. The script engine could be

useful for analysing the structure of web pages with and without script

execution.

HTML5 DOM

Add operations for navigation through nodes – The HTML5 DOM was

developed only for storing and manipulating the DOM tree generated while

70

parsing. It lacks methods for easy navigation and node searching that a user

might find useful, e.g. finding elements by tag name, finding attributes by id,

etc.

Document it and publish it as an API – During the development of the parser no

independent HTML5 DOM implementations were found. This module can be

documented and offered as a public independent API for potential HTML5

parser developers.

Spec tracer

Increase the tracer granularity – The current level of tracing is at section level.

The granularity can be increased as much as the user requires. Moreover, being

ambitious, an option for selecting the granularity level would be extremely

useful.

Trace over substrings – It would work as a zoom-in and zoom-out option where

the user could select a block of the input for tracing. This will be particularly

useful when analysing large inputs.

Minimize repeated events – This could be an option for reducing repeated

events into a single event. For example, when the input has text content, every

character is emitted as a character token, leading to a loop of repeated events.

All those events could be merged as a single event for the entire text block.

Debugger – This option would allow the user to set breakpoints over the input

or to track the events generated step by step. This feature could also display

the output (and other parsing or context variables) at a given state of the

parsing process.

Include more details in the tracing summary – Currently, the summary is hard-

coded to trace some elements. A generic option for allowing the user to select

specific elements to track and count could be valuable.

Test harness

Add more parsers – This could be one of the most valuable improvements to

the system. An analysis would be more valued and trusted by adding more

parsers and, ideally, including the major web browsers parsers.

71

Performance and threading – The execution of the parsing processes is made

sequentially by the bash/batch script. An application for running each parser on

parallel processes could reduce the response time.

Web application

Add tables and graphs – The presentation of the report is very simple. Graphs

and tables could be included for analysing in more detail the tracer and

comparison reports.

Filter test cases by parser – The web application presents the report details and

the list of all the test cases. In the case the user wants to see the tests that a

particular parser passed or failed, he has to search manually through all the

tests. An option to filter the test cases would be useful.

Link output elements to events – In the tracer page, the HTML output and the

log of events are displayed. However, there is no way to link or relate elements

(in the HTML output) with tracer events. The proposed functionality would

highlight events when hovering (or clicking) over elements or vice versa.

6.1 Reflection

I am really glad with our project and accomplishments. However, I think that the

system could have been better. There are two reasons that make me feel that way: we

had troubles working as a team and we never had a specific and grounded goal.

I feel that we never worked as a team. Instead we just were a group of individuals

working on the same project. In fact, since the early spikes (in February) we realised

that working as a team would be a challenge. Moreover, there were a few attempts to

dissolve the team and work individually.

Xiao is an easy-going, friendly person. The difficulties for working with Xiao were due

language barriers. On the other hand, despite Jose Armando and I are Mexicans and

we had no language barrier, we could not forge a friendship or at least a comradeship

for working together. I feel that we both tried but our tempers are simply not

compatible.

During the project, we tried to apply some agile techniques such as pair programming,

collective ownership of code, continuous delivery, using a backlog, etc. However, we

72

quit on most of them. This is probably because our lack of experience with Agile and

mostly, I think, due our low commitment to teamwork.

For example, when we started to work on the parser, we had two days of discussions

of the system architecture with low progression. I feel that there was no engagement

from my teammates. I proposed some ideas but they just questioned them without

trying to solve anything. At the end I wrote the code base for the parser and literally

imposed it to start working. A similar situation happened with the harness for

comparison.

I feel that we did not have a specific and grounded project goal until the very end. We

had several discussions with our supervisor and he suggested us plenty of ideas. Sadly,

we as a team never agreed on something particular and the project changed

constantly. We rambled between a parser in a new language, a high performance

parser, minimizers, debuggers, amongst other ideas.

Since late April I expressed my desire to work on a spec tracer and debugger but I did

not manage to convince my partners to work on it; we ended up working with the

comparator and parser adapters. Later I started to work on the spec tracer but the

time was not enough to finish it as I would like.

Once I read that we have to do what we can, with what we have and where we are.

Although we, as a team, did not have an adequate communication and we did not

apply correctly some techniques, the pseudo agile we applied helped us. I feel

confident that the agile toolset has real potential for improving software development.

Despite the issues commented before, our project accomplishments makes feel

satisfied. The product is usable and I hope that someone will find it useful. I have

significantly gained experience with version control software and my programming

skills were improved. I have obtained useful knowledge of HTML and related content. I

am really enthusiastic to share and apply my new knowledge and experience back in

my home country.

73

Bibliography

[1] “W3C Mission.” [Online]. Available: http://www.w3.org/Consortium/mission. [Accessed: 27-Apr-2015].

[2] “2 - A history of HTML.” [Online]. Available: http://www.w3.org/People/Raggett/book4/ch02.html. [Accessed: 27-Apr-2015].

[3] “8 The HTML syntax — HTML5.” [Online]. Available: http://www.w3.org/TR/html5/syntax.html. [Accessed: 18-Apr-2015].

[4] “How Browsers Work: Behind the scenes of modern web browsers - HTML5 Rocks.” [Online]. Available: http://www.html5rocks.com/en/tutorials/internals/howbrowserswork/. [Accessed: 18-Apr-2015].

[5] Y. Minamide and S. Mori, “Reachability analysis of the HTML5 parser specification and its application to compatibility testing,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 7436 LNCS, pp. 293–307, 2012.

[6] “html5lib/html5lib-tests.” [Online]. Available: https://github.com/html5lib/html5lib-tests/tree/master. [Accessed: 28-Apr-2015].

[7] “W3C Document Object Model.” [Online]. Available: http://www.w3.org/DOM/. [Accessed: 02-May-2015].

[8] “Extensible Markup Language (XML) 1.0 (Fifth Edition).” [Online]. Available: http://www.w3.org/TR/REC-xml/#sec-origin-goals. [Accessed: 02-May-2015].

[9] L. Stevens, The Truth About HTML5. Apress, 2012.

[10] “HTML Standard.” [Online]. Available: https://html.spec.whatwg.org/multipage/syntax.html#parsing. [Accessed: 23-Apr-2015].

[11] “Interview with Ian Hickson, editor of the HTML 5 specification. - The Web Standards Project.” [Online]. Available: http://www.webstandards.org/2009/05/13/interview-with-ian-hickson-editor-of-the-html-5-specification/. [Accessed: 02-May-2015].

[12] “W3C vs. WhatWG HTML5 Specs - Differences Documented -Telerik Developer Network.” [Online]. Available: http://developer.telerik.com/featured/w3c-vs-whatwg-html5-specs-differences-documented/. [Accessed: 08-May-2015].

74

[13] J. Anaya, J. Zamudio, and X. Li, “HTML5 flow diagram- Gliffy Diagram,” 2015. [Online]. Available: https://www.gliffy.com/go/publish/7298487. [Accessed: 08-May-2015].

[14] “w3c/web-platform-tests.” [Online]. Available: https://github.com/w3c/web-platform-tests. [Accessed: 01-May-2015].

[15] “Testing - HTML WG Wiki.” [Online]. Available: http://www.w3.org/html/wg/wiki/Testing. [Accessed: 23-Apr-2015].

[16] “Testsuite - WHATWG Wiki.” [Online]. Available: https://wiki.whatwg.org/wiki/Testsuite. [Accessed: 03-May-2015].

[17] “HTML5test - How well does your browser support HTML5?” [Online]. Available: https://html5test.com/index.html. [Accessed: 03-May-2015].

[18] “HTML5 Parser - Web developer guide | MDN.” [Online]. Available: https://developer.mozilla.org/en-US/docs/Web/Guide/HTML/HTML5/HTML5_Parser. [Accessed: 24-Apr-2015].

[19] B. Anderson, J. Moffitt, M. Goregaokar, D. Herman, J. Matthews, K. McAllister, and J. Moffitt, “Experience Report : Developing the Servo Web Browser Engine using Rust,” 2015.

[20] “Trident (layout engine).” [Online]. Available: http://en.wikipedia.org/wiki/Trident_%28layout_engine%29. [Accessed: 28-Apr-2015].

[21] “Live DOM Viewer.” [Online]. Available: http://software.hixie.ch/utilities/js/live-dom-viewer/. [Accessed: 29-Aug-2015].

[22] “google/gumbo-parser.” [Online]. Available: https://github.com/google/gumbo-parser. [Accessed: 23-Apr-2015].

[23] “jsoup Java HTML Parser, with best of DOM, CSS, and jquery.” [Online]. Available: http://jsoup.org/. [Accessed: 02-May-2015].

[24] “jhy/jsoup.” [Online]. Available: https://github.com/jhy/jsoup. [Accessed: 23-Apr-2015].

[25] Z. Zhao, C. William, M. Bebenita, D. Herman, M. Corporation, J. Sun, X. Shen, and C. William, “HPar : A Practical Parallel Parser for HTML – Taming HTML Complexities for Parallel Parsing,” vol. 10, no. 4, 2013.

[26] J. Anaya and J. Zamudio, “Project architecture,” 2015. [Online]. Available: https://drive.google.com/file/d/0B49Wuuqv8y6PejN1anFGdVFHS3c/view?usp=sharing. [Accessed: 27-Aug-2015].

75

[27] J. Anaya, J. Zamudio, and X. Li, “HTML5MSc,” 2015. [Online]. Available: https://github.com/HTML5MSc. [Accessed: 31-Aug-2015].

[28] “Lines Of Code.” [Online]. Available: http://c2.com/cgi/wiki?LinesOfCode. [Accessed: 02-Sep-2015].

[29] J. Zamudio, “A testing strategy for HTML5 parsers,” The University of Manchester, 2015.

[30] “Usage Statistics of Character Encodings for Websites, April 2015.” [Online]. Available: http://w3techs.com/technologies/overview/character_encoding/all. [Accessed: 24-Apr-2015].

[31] “org.w3c.dom (Java Platform SE 7 ).” [Online]. Available: https://docs.oracle.com/javase/7/docs/api/org/w3c/dom/package-summary.html. [Accessed: 09-Aug-2015].

[32] “JDOM.” [Online]. Available: http://www.jdom.org/. [Accessed: 09-Aug-2015].

[33] “Dom4j by dom4j.” [Online]. Available: http://dom4j.github.io/. [Accessed: 09-Aug-2015].

[34] “google-diff-match-patch - Diff, Match and Patch libraries for Plain Text - Google Project Hosting.” [Online]. Available: https://code.google.com/p/google-diff-match-patch/. [Accessed: 14-Aug-2015].

[35] “Fossil: Fossil Delta Format.” [Online]. Available: http://fossil-scm.org/xfer/doc/trunk/www/delta_format.wiki. [Accessed: 23-Aug-2015].

[36] W. H. Hesselink, “The Boyer-Moore Majority Vote Algorithm,” vol. 0, no. November, pp. 1–2, 2005.

[37] “google-code-prettify - syntax highlighting of code snippets in a web page - Google Project Hosting.” [Online]. Available: https://code.google.com/p/google-code-prettify/. [Accessed: 11-Aug-2015].

[38] “DataTables | Table plug-in for jQuery.” [Online]. Available: http://www.datatables.net/. [Accessed: 12-Aug-2015].

[39] “EclEmma - Java Code Coverage for Eclipse.” [Online]. Available: http://www.eclemma.org/index.html. [Accessed: 25-Aug-2015].

[40] “5000 Best Websites.” [Online]. Available: http://5000best.com/websites/. [Accessed: 03-Sep-2015].