natural language generation for the semantic web: unsupervised template extraction

Natural Language Generation

the

Unsupervised

MSc Speech and Language Processing

Philosophy, Psychology and Language Sciences

Natural Language Generation

the Semantic Web:

Unsupervised Template

Extraction

Daniel Duma

MSc Speech and Language Processing

Philosophy, Psychology and Language Sciences

University of Edinburgh

2012

Natural Language Generation for

Template

2 of 66

Abstract

I propose an architecture for a Natural Language Generation system that

automatically learns sentence templates, together with statistical document

planning, from parallel RDF data and text. To this end, I design, build and

test a proof-of-concept system (“LOD-DEF”) trained on un-annotated text

from the Simple English Wikipedia and RDF triples from DBpedia, with the

communicative goal of generating short descriptions of entities in an RDF

ontology. Inspired by previous work, I implement a baseline triple-to-text

generation system and I conduct human evaluation of the LOD-DEF system

against the baseline and human-generated output. LOD-DEF significantly

outperforms the baseline on two of three measures: non-redundancy and

structure and coherence.

3 of 66

Declaration

I declare that this thesis was composed by myself, that the work contained herein is my own

except where explicitly stated otherwise in the text, and that this work has not been submitted

for any other degree or professional qualification except as specified.

(Daniel Duma)

4 of 66

Acknowledgements

I am indebted to the many people who have, directly or indirectly, contributed to this effort.

First, to my parents, Eugenia and Calin Duma, for without them I would not be here to tell this

story, and to Decebal Duma, for his financial support and a wealth of stories to entertain friends

with.

Second, to everyone who helped in some way to the completion of this thesis, and especially to

Austin Leirvik, Ben Dawson, Cristian Kliesch, Dan Maftei and Magda Aniol for supplementing

my lack of knowledge with patient explanations and helpful hints.

Third, to my supervisor, Ewan Klein, for being a continuous source of encouragement and for his

many helpful pointers along the way.

Finally, I want to thank everyone who has made this year of my life something more than one

never-ending night in DSB. And to you, caffeine, for packing three days into one.

5 of 66

Table of Contents Chapter 1 Introduction and background ..................................................................................... 8

1.1 Introduction ........................................................................................................................ 8

1.2 Overview of this thesis ....................................................................................................... 9

1.3 Semantic Web and Linked Data ........................................................................................ 9

1.4 RDF triples: data format for the Semantic Web ............................................................. 11

1.5 DBpedia: the hub of the LOD Cloud............................................................................... 12

1.6 Natural Language Generation .......................................................................................... 13

1.6.1 Shallow vs. Deep NLG ................................................................................................. 15

Chapter 2 Previous approaches ................................................................................................. 17

2.1 Hand-coded, rule-based .................................................................................................... 17

2.1.1 Assessment .................................................................................................................... 17

2.2 Generating directly from RDF ......................................................................................... 17

2.2.1 Assessment .................................................................................................................... 18

2.3 Unsupervised trainable NLG ........................................................................................... 18

2.3.1 Assessment .................................................................................................................... 19

2.4 Automatic summarisation ................................................................................................ 19

Chapter 3 Design ....................................................................................................................... 21

3.1 Design overview ................................................................................................................ 21

3.2 Goal .................................................................................................................................. 23

3.3 Tasks ................................................................................................................................. 24

3.3.1 Aligning data and text ................................................................................................. 24

3.3.2 Extracting templates .................................................................................................... 24

3.3.3 Dealing with Linked Open Data .................................................................................. 25

3.3.4 Modelling different classes ............................................................................................ 26

3.3.5 Document planning ...................................................................................................... 27

3.4 Baseline ............................................................................................................................ 27

3.4.1 Coherent text ................................................................................................................ 28

Chapter 4 Implementation: Training ......................................................................................... 30

4.1 Obtaining the data ........................................................................................................... 30

4.1.1 Wikipedia text .............................................................................................................. 30

4.1.2 DBpedia triples ............................................................................................................. 30

4.2 Tokenizing and text normalisation ................................................................................... 32

6 of 66

4.3 Aligning: Named Entity Recognition ............................................................................... 32

4.3.1 Surface realisation generation....................................................................................... 32

4.3.2 Spotting ........................................................................................................................ 32

4.4 Class selection .................................................................................................................. 33

4.5 Coreference resolution ...................................................................................................... 34

4.6 Parsing .............................................................................................................................. 35

4.7 Syntactic pruning ............................................................................................................. 35

4.8 Store annotations ............................................................................................................. 37

4.9 Post-processing ................................................................................................................. 38

4.9.1 Cluster predicates into pools ........................................................................................ 38

4.9.2 Purge and filter sentences ............................................................................................ 39

4.9.3 Compute n-gram probabilities and store model........................................................... 39

Chapter 5 Implementation: Generation ..................................................................................... 40

5.1 Retrieve RDF triples ........................................................................................................ 40

5.2 Choose best class for entity .............................................................................................. 40

5.3 Chart generation .............................................................................................................. 41

5.4 Viterbi generation ............................................................................................................ 41

5.5 Filling the slots ................................................................................................................. 42

Chapter 6 Experiments .............................................................................................................. 44

6.1 Problems with the data .................................................................................................... 44

6.2 Performance of the system ............................................................................................... 44

6.2.1 Spotting performance ................................................................................................... 45

6.2.2 Parser performance ....................................................................................................... 45

6.2.3 Class selection performance .......................................................................................... 45

6.2.4 Template extraction ..................................................................................................... 46

6.2.5 Examples of errors in output........................................................................................ 47

Chapter 7 Evaluation ................................................................................................................. 49

7.1 Approach .......................................................................................................................... 49

7.2 Selection of data ............................................................................................................... 49

7.3 Human generation ............................................................................................................ 50

7.4 LOD-DEF generation ....................................................................................................... 51

7.5 Human rating ................................................................................................................... 51

7.6 Results .............................................................................................................................. 51

7 of 66

7.7 Discussion ......................................................................................................................... 52

Chapter 8 Conclusion and future work ..................................................................................... 53

8.1 Conclusion ........................................................................................................................ 53

8.2 Future work ...................................................................................................................... 53

Appendix A: Human generation ..................................................................................................... 55

Appendix B: Human evaluation ...................................................................................................... 57

References ........................................................................................................................................ 63

8 of 66

Chapter 1

Introduction and background

1.1 Introduction

The next generation of the web is in the making. The amount of information on the Semantic

Web is growing fast; this open, structured, explicitly meaningful machine-readable data on the

Web is already forming a web of data, a “giant global graph consisting of billions of RDF

statements from numerous sources covering all sorts of topics” (Heath & Bizer, 2011).

This information space is however designed to be used by machines rather than humans (Gerber

et al., 2006), and us humans are meant to access it via intelligent user agents, such as information

brokers, search agents and information filters (Decker et al., 2000) which on the whole are

expected to take the shape of question-answering systems (Bontcheva & Davis, 2009).

A crucial element of such a question-answering system is then the ability to communicate with

the user using natural language (Bontcheva & Davis, 2009), both understanding user input and

generating natural language to relay information to the user.

This is why the role of Natural Language Generation is potentially key in the Semantic Web

vision (Galanis & Androutsopoulos, 2007) where, for applications generating textual output, the

text presented to users can be generated or updated by NLG systems using data on the web as

input. However, Natural Language Generation systems have traditionally relied on hand-built

templates and schemas and many expert-hours of work. While this has been a successful

approach in several domains (e.g. Androutsopoulos el al., 2001), it is frequently observed that it

does not scale well, it is not easy to transfer across domains, and it requires many expert man-

hours which makes it expensive and impractical for many applications (Busemann & Horacek,

1998). The scale and decentralised nature of the Semantic Web suggests this is one of these

applications.

Recent initiatives by organisations and governments, coupled with efforts in text mining have

made large knowledge bases publicly available, such as census results, biomedical databases and

more general ones like DBpedia. These now contain information also found in natural language

texts, starting with the very ones the information was mined from. I propose here that, given the

widespread availability on the web of these parallel text and data resources, and the mature state

of key Natural Language Processing technologies, NLG systems could be automatically trained

from these resources with little or no human intervention. The wealth of research done in

trainable NLG and automatic summarisation suggests that it is feasible for these systems to learn

how to generate natural language by analysing existing human-authored natural language text in

9 of 66

an unsupervised fashion. This would make them inexpensive to build and deploy, easier to

transfer to other domains, and potentially multilingual, which would contribute massively to

making the Semantic Web vision a reality.

The aim of this project is to propose an architecture for an NLG system that can automatically

learn sentence templates and document planning from parallel data and text, for the

communicative goal of describing entities in an RDF ontology. The emphasis of the system is on

expressing literal, non-temporal, factual data.

The system is trained from text and data by performing four main actions: given parallel data

and text about a number of entities, first it aligns the text and the data by finding literal values

in the text. Second, it extracts and modifies sentences that express these values, so as to use

them as sentence templates. Third, it collects statistics about the frequency with which a spotted

property follows another. Finally, it determines the class of entity that the text and data describe

and builds a separate model for that class.

The nature of this project is exploratory rather than exhaustive. To this end, I design, build and

evaluate a proof-of-concept system (LOD-DEF) trained on text from the Simple English

Wikipedia and data from DBpedia. No part of its architecture is exhaustively optimised, and all

the modules in the pipeline can be seen as baselines for their specific function. This system is as

such in itself a baseline implementation, its goal being to explore the feasibility of this approach

and hopefully help inspire others to improve upon it.

1.2 Overview of this thesis

In this chapter, I present the case for trainable Natural Language Generation for the Semantic

Web and provide an overview these two technology areas.

In Chapter 2 I review the recent approaches most relevant to the present one, on which this

project is either based, or to which it is theoretically related.

Chapter 3 considers the project from a design standpoint: it formulates the goals of the project

and the criteria it must abide by, identifies the problem areas and lays out the design of the

solutions. In this chapter I also present the full specification of the non-trainable baseline

generation system implemented.

In Chapter 4 the training pipeline is laid out and technically detailed with step-by-step examples.

The same is done for the generation pipeline in Chapter 5.

In Chapter 6 a number of experiments are discussed, reporting the performance of the system on

different metrics and analysing some interesting findings.

Chapter 7 contains the detailed analysis of the human evaluation of the system. Chapter 8

contains the conclusion and suggests possible directions for future work.

1.3 Semantic Web and Linked Data

Where the terms Semantic Web and Web of Data refer to a vision of the web to come (Berners-

Lee et al., 2001), the term Linked Data is more concrete, referring to “a set of best practices for

publishing and interlinking structured data o

practices (the “Linked Data Principles”)

for publishing this data, such as HTTP

RDF for linking datasets (see section

of Linked Data is to apply the general architecture of the World Wide Web

structured data on global scale.

be added to this Linked Data.

Linked Data does not need to be open, i.e. accessible to anyone, but increasingly organisations

are publishing Linked Open Data

steadily increasing over the past years

“geographic locations, people, companies, books, scienti

and radio programmes, genes, proteins, drugs and cli

online communities and reviews

Figure 1.1 Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.

As of late, there were over 31 billion

LOD Cloud (Bizer et al., 2011)

bubbles, interconnected by edges.

It is these edges that are most interesting; p

published datasets are explicitly linked

Both the vocabularies and the data can be published by any organisation or individual, leading

to the somewhat famous observation “

Carroll, 2002). Throughout this

10 of 66

publishing and interlinking structured data on the Web” (Heath & Bizer,

(the “Linked Data Principles”) require the use of a number of web-

for publishing this data, such as HTTP as a transport layer, URIs as identifiers for resources and

section 1.4 for a detailed explanation). Essentially,

of Linked Data is to apply the general architecture of the World Wide Web to the task of sharing

structured data on global scale.” (Heath & Bizer, 2011) To realise the SW vision, inference must


Linked Open Data (LOD). The rate of publication of this informati

steadily increasing over the past years, forming a big “data cloud”, containing

geographic locations, people, companies, books, scientiEc publications, Elms, music, television

and radio programmes, genes, proteins, drugs and clinical trials, statistical data, census results,

online communities and reviews” (Heath & Bizer, 2011).

Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. (Cyganiak & Jentzsch, 2011)

, there were over 31 billion (109) RDF triples (statements) in datasets linked in the

2011) from 295 different datasets. Figure 1.1 represents

bubbles, interconnected by edges.

It is these edges that are most interesting; perhaps the key aspect of this effort is that these

published datasets are explicitly linked together by using common vocabularies and ontologies.


to the somewhat famous observation “Anyone can say Anything about Anything

his thesis, “Semantic Web” and “Web of Data”

, 2011). These best

-based open formats

as identifiers for resources and

Essentially, “the basic idea

to the task of sharing

To realise the SW vision, inference must


(LOD). The rate of publication of this information has been

, forming a big “data cloud”, containing, among others,

Ec publications, Elms, music, television

nical trials, statistical data, census results,

(Cyganiak & Jentzsch, 2011)

datasets linked in the

represents these nodes as

this effort is that these

using common vocabularies and ontologies.


nyone can say Anything about Anything” (Klyne &

“Web of Data” are both taken to

11 of 66

mean Linked Open Data, thus referring to data published in adherence to the Linked Data

Principles1.

1.4 RDF triples: data format for the Semantic Web

Resource Description Framework (RDF) is the default and recommended data format for the

Web of Data (Heath & Bizer 2011). RDF triples represent the simplest statement that can be

made, involving two entities (nodes in a conceptual graph) and a relation between them (an

edge). These are often called subject, predicate and object, and a triple must contain all three of

them. Another way of reading this information is that the subject has a property (predicate), the

value of which is the object. I use both naming conventions throughout this thesis.

An example of a triple would be: http://dbpedia.org/resource/United_States_of_Americ a http://dbpedia.org/property/leaderName http://dbpedia.org/resource/Barack_Obama

Figure 1.1 Example of an RDF triple

where United_States_of_America has a property leaderName, the value of which is

Barack_Obama. This could be conceptually read as “the name of the leader of the USA is

Barack Obama”. This triple then connects two entities in the graph, United_States_of_America

and Barack_Obama, via the edge leaderName.

A central aspect to RDF triples is the fact that subjects and objects must necessarily be URIs

(Uniform Resource Identifier). The concept of a URL (Uniform Resource Locator) is perhaps a

familiar one, given how commonplace the use of that web addresses (e.g.

http://www.google.com) has become. A URI differs from a URL in that, although it must also

be globally unique, it does not need to be “dereferenceable”, that is, if we point a web browser to

that address, the browser may not be able to load a web page to show. It is recommended good

practice that URIs be made dereferenceable (Heath & Bizer 2011), but it is not required.

Objects of triples can either be URIs (e.g. “http://dbpedia.org/resource/Barack_Obama” in the

previous example) or literal values, such as character strings (e.g. “Hogwarts”), dates (e.g. “1066-

10-12”^^xsd:date) or numbers in different formats. RDF triples may either have a data type or a

language suffix (e.g. @en for English), but not both. A language suffix automatically identifies

the value as a string literal.

A number of serializations of RDF exist, for a number of purposes; the one in use throughout

this document is N3/Turtle (Berners-Lee & Connolly, 2011). This serialization is intended to be

easier for humans to access directly: among other things, it allows for the shortening of

namespaces to make triples easier to read. In this serialisation, we can define a number of

prefixes for namespaces: @prefix : <http://dbpedia.org/resource/> @prefix dbprop: <http://dbpedia.org/property/> @prefix dbont: <http://dbpedia.org/ontology/>

This allows us to write the previous triple as: dbpedia:United_States_of_America dbprop:leaderName dbpedia:Barack_Obama

1 See http://lab.linkeddata.deri.ie/2010/star-scheme-by-example/ for an intuitive overview of Linked Open Data.

12 of 66

When accessed, each of the URIs in this triple would be expanded back to the values in Figure

1.1. Finally, Notation 3 permits defining and using a default namespace, identified by a single

colon. Henceforth, the default namespace used in examples throughout this document is

<http://dbpedia.org/resource/> for ease of reading:

:United_States_of_America dbprop:leaderName :Barac k_Obama.

Ontologies can be built on top of RDF using a standard class inheritance mechanism via the

predicate rdf:type. RDF implements multiple inheritance, that is, an instance can belong to any

number of classes. On top of this basic framework, more complex mechanisms to allow for

inheritance and reasoning have been implemented, most importantly for this project RDFS

(RDF Schema) and OWL (Web Ontology Language). Different extensions of OWL (OWL Lite,

OWL DL, OWL Full) can encode different types of formal logic (Bechhofer et al., 2004), but this

is outside the scope of this project.

A property, being a URI, can have properties itself. For every property, its rdfs:domain property

restricts the classes and instances which can have this property and its rdfs:range specifies what

values this property can take.

There are many standard prefixes for namespaces defining vocabularies with widely-used and

well-defined semantics. Two examples are foaf (“Friend Of A Friend”) and dc (“Dublin Core”).

Very important to the design of the system presented here are the widely used properties

foaf:name, the description of which simply stands for “a name for some thing” (Brickley &

Miller, 2010), and rdfs:label, which “may be used to provide a human-readable version of a

resource's name” (Brickley & Guha, 2004).

Beyond storage, the emphasis of the LOD approach is that this data can be queried in highly

complex ways. The default query language for this is SPARQL (Prud'hommeaux & Seaborne,

2008) which is a structured query language based on matching patterns of triples, allowing for

highly complex filters using logic, inference and regular expressions. Some examples of these

queries, expressed in natural language, could be:

• Skyscrapers in China that have more than 50 floors

• Albums from the Beach Boys that were released between 1980 and 1990

1.5 DBpedia: the hub of the LOD Cloud

At the centre of the LOD cloud (Figure 1.1) lays the DBpedia. This multi-lingual knowledge

base contains knowledge that has been extracted from the infobox systems of Wikipedias in 15

different languages (Mendes et al., 2012). Infoboxes are human-authored tables of information,

akin to collections of attribute-value pairs, that appear on the side of an article on Wikipedia.

They contain factual information such as dates, population sizes, titles of national anthems, etc.,

in a format that is easy to mine for data. This extracted data is stored as RDF triples, using a

number of standard vocabularies for properties (e.g. foaf, dc).

As described in (Mendes et al., 2012), “the DBpedia Ontology organizes the knowledge on

Wikipedia in 320 classes which form a subsumption hierarchy and are described by 1,650

different properties. It features labels and abstracts for 3.64 million things in up to 97 different

languages of which 1.83 million are classified in a consistent ontology, including 416,000 persons,

13 of 66

526,000 places, 106,000 music albums, 60,000 films, 17,500 video games, 169,000 organizations,

183,000 species and 5,400 diseases.”

Due to its status as the unofficial hub of the LOD cloud (being used to interlink many datasets),

and its breadth of coverage, the data on the DBpedia seems to be an ideal starting dataset for

any approach that aims to generate natural language using data in the LOD Cloud.

1.6 Natural Language Generation

Natural Language Generation is the process of creating text in natural language (e.g. English)

from an input of conceptual information. NLG allows for adapting the text to specific

communicative goals and to user preferences and for generating text in different natural

languages using the same underlying representation. In what is perhaps the required reference of

the field, Reiter & Dale (2000), the authors propose and describe a standard architecture for a

Natural Language Generation system. While the approach to NLG in the present project is far

from this level of sophistication, it is relevant to present an overview of this architecture here to

put the task at hand in context.

According to Reiter & Dale (2000), the architecture of a Natural Language Generation system is

modularised, formed of a number of distinct, well-defined and easily integrated modules which

perform different functions. A graphical representation of this architecture is provided in Figure

1.2.

It is sometimes observed that NLG consists in making choices (what to say, in what order, with

what words, etc.). These choices depend on each other, but can be separated in different levels,

forming a pipeline. In this pipeline, domain data in some internal semantic representation is

input at one end and natural language text is output at the other.

14 of 66

Figure 1.2 Natural Language Generation Architecture (Based on Reiter and Dale, 2000)

The pipeline consists of three main stages, implemented by as many components:

1. In the document planning stage, the data to be included in the generated text is chosen

(content determination) and also, the order in which to present this is chosen (document

structuring). These processes produce an intermediate representation of the structure of

the document, in this diagram labelled “document plan”, typically a tree structure.

Document planning takes domain data as input together with a communicative goal, that

is, the purpose of the text that is to be generated, such as “describing of an entity”,

“recommending a restaurant”, or “comparing flights”, depending on the application. The

communicative goal typically determines both content determination and document

structuring.

Document planning, as well as the other components in this pipeline, can be informed,

among others, by: discourse strategies (helping realise the communicative goal), dialogue

history (in a dialogue system), constraints upon the output (the resulting text might

need to fit in a constrained space, etc.).

Most importantly however, it can be informed by a user model, capturing preferences or

specific circumstances that characterise the target audience of the text. Depending on the

application, system and communicative goal, this could mean e.g. a preference on

sentence length or for the ordering of an argument in the case of a recommendation.

2. In the microplanning stage, the document plan is taken as input to a number of sub-

processes, which are to a great extent dependent on each other.

15 of 66

a. Lexicalisation is choosing the content words required in order to express the

content selected by the previous module.

b. Aggregation consists in joining together short sentences or chunks of text to

create longer sentences. Both coordination and subordination strategies may play

a role in this.

c. Referring Expression Generation (REG) deals with how to refer to entities in the

discourse. There are multiple ways in which we can refer to the same real-world

entity. For example, “Barack Obama” might be referred to as “President

Obama”, “Obama”, “the President”, or simply “he”, depending on the

communicative context. A distinction is usually made between the first time an

entity is mentioned (“initial reference”) and “subsequent reference”. Depending on

other factors such as pragmatic considerations, we might want to avoid repetition

by using personal pronouns and other referring expressions. These also depend on

style considerations of the textual domain. For instance, in newswire it might be

preferred to use “President Obama” and “the President” instead of “he”.

3. The surface realisation component takes text specifications as input and outputs surface

text. Surface realisation often adopts an “overgenerate and rank” approach, where a

number of possible surface realisations are generated and then ranked using a language

model (i.e. how likely that sequence of words is) or other scoring functions.

This pipeline allows for much control over the output text, permits a high degree of confidence

that the text will be grammatical and semantically accurate by design, and most importantly,

allows for adapting and adjusting the output text according to a user model. This has been

called the “deep” model for generation, in contrast with “shallow” methods based on templates,

as outlined in the following.

1.6.1 Shallow vs. Deep NLG

It has been noted that there is a continuum between shallow and deep NLG methods

(Busemann, 2011). Considering “canned text” to be at the shallow end of the scale and “deep”

NLG to be at the other, a number of intermediate approaches can be situated in between them,

depending on what modules and functionality they implement, as represented in Figure 1.3.

Prefabricated texts shallow

“Fill in the slots”

With flexible templates

With aggregation

With sentence planning

With document planning deep

Figure 1.3 Shallow to deep NLG transition (based on Busemann, 2011)

The approach presented herein stands on the shallow end of this scale, as the generation is not

inherently knowledge-based or theoretically motivated (Busemann & Horacek, 1998), but based

on sentence templates with “slots” in them. These templates are sequences of text tokens of two

16 of 66

types: static text (words or punctuation) and placeholders for values, linking that slot to the

value of a property or variable. A template as we define it here could take the form of:

[name] was born on [dateOfBirth].

In this example, [name] and [dateOfBirth] are the names of properties whose values would be

substituted in that sentence in lieu of the properties themselves (e.g. “John Doe was born on 14

October 1066.”).

Templates can deal to a large extent with the issue of lexicalisation, for they already contain

many of the words used and as such are a lexical choice, and with that of aggregation, as they

can contain complex grammatical structures where only values have to be substituted in.

The approach presented herein also incorporates characteristics of deeper NLG, by performing a

kind of document planning as described in 3.3.5.

17 of 66

Chapter 2

Previous approaches

A number of previous approaches to Natural Language Generation for the Semantic Web have

been adopted. Of these, a majority have been concerned with verbalising OWL ontologies (c.f.

Stevens et al., 2011, Hewlett & Kalyanpur, 2005, Liang et al., 2012), and the verbalisation of

factual data has remained somewhat under-addressed. In the following I situate my work in the

context of previous efforts by providing an overview of the most relevant ones.

2.1 Hand-coded, rule-based

In a first category, there have been a number of approaches to NLG for the SW that employed a

deep NLG architecture like the one described in Chapter 2. Perhaps the most interesting of these

to date is the NaturalOWL system (Galanis & Androutsopoulos, 2007), which could potentially

be more easily applied across domains. It builds upon the M-PIRO system (Androutsopoulos el

al., 2001), used for multilingual generation of museum exhibits, but adapting it to use OWL

ontologies and RDF data. The classes and properties in the ontologies are explicitly annotated

with text in multiple languages to carry out the generation, which enables the system to generate

multilingual text using RDF data.

2.1.1 Assessment

NaturalOWL is a versatile and powerful system, including a full NLG pipeline adapted from an

already-successful system with commercial applications. It can achieve high quality output, and

is multilingual by design. In essence, this system is a solid implementation for Linked Open Data

of the NLG Architecture described in section 1.6. As such, we can see in it the same benefits and

shortcomings. Great control over the output comes with a requirement for many expert man-

hours and limited transferability between textual domains. Furthermore, the approach requires

publishers of Linked Data to provide non-trivial annotations of the ontologies they publish. It

remains to be seen to what extent this is a realistic expectation.

2.2 Generating directly from RDF

A competing approach is generating directly from RDF with few hand-coded rules, particularly

representative of which is the Triple-Text system of Sun & Mellish (2007). The authors note that

RDF predicates typically encode rich linguistic information, that is, their URIs are meaningful

chunks of natural language. Sun & Mellish (2007) have exploited this information to

automatically generate natural English text from triples without using domain dependent

knowledge.

Their approach, the Triple-Text (TT) system, is based on processing the predicate of the triple.

Words forming predicates that are meaningful in English are typically concatenated into one

string with no spaces or space-equivalent characters (e.g. underscores), but they are also typically

18 of 66

“camel-cased”, that is, using uppercase characters used for capturing these separations (e.g.

“hasProperty”, “wasBornIn”). This makes it easy to tokenise the predicate into its building

blocks (e.g. “has” + “property”, “was” + “born” + ”in”). This sequence of tokens is then

assigned part-of-speech (POS) tags and classified into one of 6 categories, depending on its

format (e.g. “has” + [unit]* + [noun]). For each a different rule is applied to build the output

sentence (e.g. “has” + det* + “units” + “noun”).

As an example, given the triple “North industryOfArea manufacturing_sector”, the system

generates the sentence “The industry of the North is manufacturing sector.”

2.2.1 Assessment

Simple as it is, this approach is interesting in many ways: it is reasonably domain-independent, it

is very fast, inexpensive and intuitive to deploy and can provide an immediate lexicalisation of

triples to natural language text without the need for domain-dependent knowledge.

However, its shortcomings severely limit its applications. First, generation from single RDF

triples is also limited by the fact that the relations encoded in a triple are only between two

entities. Human discourse is on average much richer, often including relations that can involve

more than two entities, like ditransitive verbs, which require a subject, a direct object and an

indirect object (e.g. "John gave the book to Mary"). The authors point this out and suggest that

the next step is generating from multiple triples. Second, no mechanism is provided to perform

document planning (as described in 1.6) when dealing with a collection of triples that should be

lexicalised together. Finally, the output is not always grammatically correct and it cannot be

easily adapted to a specific domain, as it does not take into account the ambiguity inherent to

natural language (e.g. polysemy) and relies on using the same words found in the RDF

predicates for realisation.

The baseline implemented as part of this project (see section 3.4) draws much inspiration from

this approach, extending it to use rdfs:label properties for verbalisation and combining it with a

baseline Referring Expression Generation algorithm.

2.3 Unsupervised trainable NLG

Perhaps the most relevant previous work on trainable NLG is that of Duboue & McKeown

(2003), where they describe a system that learns content determination rules statistically from

parallel text and semantic input data. They collect the information in the knowledge base for

this application by crawling websites with celebrity fact-sheets and obtained the biography texts

for these celebrities by crawling other websites.

They align the data with the text (i.e. the “Matching” stage) using a two-step approach. In the

first step they identify the spans of text that are verbatim copies of values in the data. The

second step is building statistical language models and using the entropy of these to determine if

other spans of text are indicative of other values being expressed. There is an amount of

inference and reasoning involved in this approach, such as deciding that “young” is someone

between a span of ages. This is specifically applied to short biographies.

19 of 66

2.3.1 Assessment

This work focuses on a limited domain and only on content determination. The output of their

system is however still exclusively dependent on hand-written rules and is specifically targeted at

the constrained domain of biography generation. However, their approach to automatic learning

of content determination is undoubtedly far superior to what I present in this paper. The

approach taken in this project is equivalent to the baseline of Duboue & McKeown (2003), or the

first matching step in their algorithm: only literal values found in the data are matched in the

text.

2.4 Automatic summarisation

Very related to the approach presented herein is the wealth of work done in the field of

automatic summarisation, which consists in creating a summary of a text by automatically

choosing the most relevant information in it and collating it 2 . A subfield of automatic

summarisation, frequently called “text-to-text” Natural Language Generation (to differentiate it

from full “concept-to-text” NLG) deals with the generation of documents by extracting

information from multiple input documents.

This can be seen as very related to this project, insofar as multi-document summarisation also

deals with the extraction of sentences and with concatenating them in an organised way to

create a new document. However, a main difference stands out: text-to-text NLG only deals with

processing documents that are all about the same entity, subject or topic and extracting the

most relevant sentences from those documents to create a new document. This stands in contrast

with the problem we are tackling here: we want to generate natural language describing an

instance in an ontology for which there may be no such text available. We then need to identify

sentences about an entity that will be transferable, that is, will be true of other entities of the

same type, more particularly, sentences that express values of properties that other entities of

that type will have.

Where those sentences are not directly available in the text, we can try to modify them to make

them transferable. This is much related to the frequently addressed task of sentence compression

in automatic summarisation, which consists in creating a summary of a single sentence. Often

this is addressed using tree operations, where a sentence is parsed and the parse tree is modified,

with the most frequent operation being the deletion of constituents. Where for summarisation

these constituents are removed because they are deemed of lesser importance, in the present

approach they are deleted where there is no evidence for them in the data.

A number of previous approaches to this deletion problem exist (e.g. Cohn & Lapata, 2009,

Filippova & Strube, 2008) but here I specifically borrow the term syntactic pruning from the

work of Gagnon & Da Sylva (2006). Their approach is to parse the sentences with a dependency

parser and apply a number of hand-built pruning rules and filters to simplify and compress those

sentences. This approach presented here is similarly rule-based.

2 Methods for summarisation are generally classified into extractive, i.e. extracting sentences from the text based on

their salience score and joining them, and abstractive, i.e. producing a new, shorter text (Gagnon & Da Sylva, 2006).

Both of these categories are relevant here.

20 of 66

21 of 66

Chapter 3

Design

3.1 Design overview

The present approach is based on two main intuitions. One is that if we can identify sentences in

text expressing factual information about an entity that are transferable (i.e. would also be true

of another entity of the same class) we can use them as sentence templates for that class.

Figure 3.1 System overview

This requires that we first identify literal values in the text that are the values of properties in

the data and then select and edit the sentences to make sure they express no information that

would not be true of another entity, i.e. that is not a value expressed from the input data. An

example of this would be “[foaf:name] was one of the greatest [rdf:type].”, as this template

contains a value judgement that is not supported by data.

The second intuition is that we can model the content of an article by collecting statistics about

the properties whose values we have spotted in the text.

22 of 66

Figure 3.2 Aligning data and text: spotting property values

As an illustration of this intuition, consider the RDF triples and text shown in Figure 3.2. We

can paraphrase the data in RDF as “:Johann_Sebastian_Bach is an entity of type German

Composers, his death place is Leipzig, his birth place is Eisenach”, etc.

In this particular example, we can align all the values of all the RDF properties with spans of

text (illustrated by arrows in Figure 3.2), and this is here called spotting. Although it is often the

case that string literals in the RDF can be matched with identical strings in the text, sometimes

a conversion of these values is required. For example, the value “1750-07-28” is matched with the

non-identical string “28 July 1750”, as these different formats for dates represent in fact the same

value.

Having spotted the values, we can assume we have spotted the properties that generated them,

which then allows us to replace each of those values in the text by a symbolic link to the

property that generated it and extract the sentence template:

[foaf:name] (b. [dbont:birthPlace], [dbont:birthDat e], d. [dbont:deathPlace], [dbont:deathDate]) was a [dbprop:shortDescription].

This template contains no information that is not supported by the data, and is therefore

transferable and can be instantiated for any other entity of the same class (in this case,

yago:GermanComposers3) for which we have the same properties, e.g.:

Ludwig van Beethoven (b. Bonn, 17 December 1770, d. Vienna, 26 March 1827) was a German composer.

This approach can be seen to a certain extent as a conceptual hybrid between shallow NLG

systems, where the information to be represented is stored in a symbolic data structure, and

text-to-text NLG, where the content determination and document structuring are automatically

3 Throughout this thesis, I refer to classes used by DBpedia with the namespace prefix yago: defined as @prefix yago: <http://dbpedia.org/class/yago/>

23 of 66

learned from text, and the surface realisation is carried out via templates that are also

automatically learned from text.

The hypothesis presented here is that this system, trained to learn document planning and

sentence templates will perform better in subjective human evaluation than a simple baseline

generating directly from English RDF predicates. The system is also ranked against human-

generated text for the same data, which is expected to perform better and so the hypothesis is

that this is an upper bound.

3.2 Goal

The system must generate descriptions, approximately equivalent to Wikipedia “stubs”, for any

entity in an RDF ontology, focussing on factual data. This is therefore the one hard-wired

communicative goal of the system.

This system must be inexpensive and fast to deploy, using readily available resources and

avoiding any manual annotation of data, while also keeping hand-written rules to a minimum

and avoiding domain-specific ones (e.g. specific for biographies, descriptions of cities, etc.).

At the same it should significantly overcome the shortcomings of direct generation from RDF

triples by performing content determination and document structuring as described in section

1.6. Given that the rules for this are not present in the data, they must be statistically acquired

from text aligned with the data.

Crucially, the system must be able to extract sentence templates from the training text for use in

generation. These templates should verbalise the values of properties identified in them in the

training text. Most importantly, these sentences should be transferrable between instances of the

same class: it should only extract sentences that would hold true for other instances of the same

class with different property values.

Also, this system is specifically targeted to use Linked Open Data, which implies that it should

exhibit a degree of robustness to inconsistencies and redundancies in the data. Linked Open

Data, unlike a relational database, has a very flexible schema, so the system should be able to

deal with e.g. a property value being available for one instance of a class but not for another.

Likewise, it should not depend on hand-picked lists of predicates, with some exceptions for very

widely used ones (e.g. rdfs:label).

As opposed to e.g. Stevens et al (2011), the aim is not to fully process OWL semantics, but the

focus here is exclusively on factual, literal, non-temporal data that can be expressed in quantities

and string literals.

Finally, it should be easy to adapt to other domains. A specific format of text and a specific

schema for the data should not be required, as long as one article of description text per entity is

available, together with RDF triples that contain properties aligned with an instance in an

ontology. Finally, it should be conceivable to adapt the system to other languages for which the

required resources exist (e.g. a trained parser).

24 of 66

3.3 Tasks

3.3.1 Aligning data and text

Automatically aligning the RDF data with the text can be seen as a case of Named Entity

Recognition, a well-established task used for such processes as Text Mining and Sentiment

Analysis, which consists in identifying and annotating a number of “entities” in a text (Feldman

& Sanger 2007). These entities can be names or values, such as quantities (both as digits and as

string literals), dates, etc.

General-purpose NER systems are usually based on three different matching techniques:

dictionaries (also known as gazetteers), regular expressions and trained classifiers (Conditional

Random Fields, MaxEnt, etc.). In this implementation, the NER task only uses gazetteers,

regular expressions and heuristics for normalising and recognising quantities. Despite this being a

very simple approach, for this specific project it is adequate as a baseline. Given that we have

prior knowledge about what entities we expect to find in the text, the problem is limited to

finding them. Cases of ambiguity are much reduced and the problem is limited to recognising

values when they appear.

Selecting the RDF properties whose values are to be spotted in the text is dependent on the

domain of the text and on the way data is encoded in RDF/OWL. We can think of this as a

window over the graph, defined by a number of edges or relations. The ideal distance of edges to

consider including is dataset-specific, depending on the design of this dataset, very complex

relationships may exist between nodes in the graph. In the present approach, only triples that

have as subject the main topic of the article being processed (title entity) are retrieved (i.e. s p o,

where s=title entity URI). These triples can be any property and have any value. This was

deemed to be sufficient for this dataset, and retrieving longer paths through the graph was found

to significantly increase the complexity of the extraction.

3.3.2 Extracting templates

The system must extract sentence templates with “slots” in them as described in 1.6.1. These

slots correspond to spotted properties of the class being described. As others before (Sripada et

al., 2003), I find Grice’s Cooperative Principle and its maxims (Grice, 1975) capture

fundamentally important aspects of an NLG system. Crucial to the approach presented herein is

the maxim of quality: “contribute only what you know to be true; do not say that for which you

lack evidence”. As per this maxim, the output of the system is desired to be truthful, which

means that we should make sure the textual output is supported by evidence in the input data.

The sentence templates extracted should be then transferable, that is, they should hold true for

any entity of a given class that has the same properties as the entity for which the original

sentence was written.

In order to realise this, similarly to (Gagnon & Da Sylva, 2006), the system needs to parse the

source sentences to prune them of constituents, but in our case, these constituents are those for

which we have no evidence in the data.

Parsing consists in taking a sentence in natural language and determining what its most likely

parse tree would be, i.e. how its grammatical constituents are clustered and nested. It is

important to note here that parsing is an area of active research and, due to natural language

25 of 66

ambiguity, far from a solved problem, as parsing will be another step in the pipeline that is likely

to introduce a significant source of errors.

3.3.3 Dealing with Linked Open Data

A brief examination of data from the DBpedia brings key issues to the fore. First, properties of

an instance can have multiple values. For example, let us examine the following triples:

:Carl_Maria_von_Weber dbont:birthPlace :Eutin. :Carl_Maria_von_Weber dbont:birthPlace :Holstein.

The property dbont:birthPlace here has two values, both of them are correct, and both of them

may be necessary, as Eutin is contained in Holstein. It would perhaps have been better to store

this information only as the more finely-grained value (:Eutin) and left the task of performing

inference to reach the coarser-grained container (:Holstein) to the smart agent. However, given

that both these values are in the data, the system must be able to deal with this. Our approach

is to group together spotted entities in text that are the values of the same property and to keep

track of properties spotted as lists in text. When generating, we can use this information to

determine if only one surface realisation must be chosen from the options given or if all of them

must be shown as a list.

Second, multiple properties can have the same value. These properties may have very different

meanings, as in the following example: :Carl_Orff dbont:birthPlace :Munich. :Carl_Orff dbont:deathPlace :Munich.

This creates a different problem that has arguably more difficult solutions, as the problem will

be disambiguating between surface realisations. A number of approaches could be applied to this,

some variant of the EM algorithm for instance.

Third, there are many properties that have the same meaning and yet are often present for the

same entity, which makes them completely redundant. These redundant triples are either kept for

backwards compatibility or due to of an incomplete alignment of vocabularies when aggregating

different sources of data. Consider, for example, these triples: :Johann_Sebastian_Bach dbont:birthPlace :Eisenac h. :Johann_Sebastian_Bach dprop:birthPlace :Eisenach . :Johann_Sebastian_Bach dont:placeOfBirth :Eisenach . :Johann_Sebastian_Bach dprop:placeOfBirth :Eisenac h.

It is immediately clear in this example that the four predicates shown are actually one and the

same in meaning, and their values confirm this. The equivalence of these predicates is well known

and documented (Mendes et al., 2012), but it is not retrievable from the triples themselves. OWL

implements a system to link equal URIs, via the owl:sameAs property (Bechhofer et al., 2004),

which is available for some properties but not for others4.

The solution to the last two problems is to compute predicate similarity (by essentially counting

the times the two predicates have the same value) and grouping significantly co-occurring

properties together into predicate “pools”. The similarity of predicates is their similarity

4 While we would expect this to be a solved problem when dealing only with data from a self-contained knowledge

base such as DBpedia, it can be seen as a good example of a common problem of LOD. As such, this is an opportunity

to rise to the challenge and provide a solution for it.

26 of 66

coefficient, conceptually identical to Dice’s coefficient. This is computed by dividing the number

of times these two properties for an entity of class C have the same value over the times they

appear for entities of that class. This is only computed for the set of entities of that class for

which both properties are defined (i.e. have a value other than a null string).

For example, the four predicates seen above are frequently grouped into a single pool, which

takes the name of the most frequent of these predicates (or the first in alphabetical order in case

of a tie).

Once this similarity metric has been computed, to discard those sentence templates that contain

conflicting predicates in their slots, e.g. if “birthPlace” and “deathPlace” have been spotted with

the same value when they belong to different “pools”, that is, over all the text their Dice

coefficient is smaller than a constant. Other options were considered, such as a semantic

similarity metric between the rdfs:label of the predicates and the context words. Duboue &

McKeown (2003) have a different approach to clustering which includes hand-input rules for

inference (e.g. people whose age is 1 < age < 25 are labelled “young”). For this baseline

implementation however, the simplest option was chosen.

3.3.4 Modelling different classes

The system has a single communicative goal: the description of an instance of a class. However,

descriptions of entities belonging to different classes can be seen as belonging to different textual

sub-domains. What is relevant in the description of, for instance, a rock band, is unlikely to be

relevant (or even apply) to the description of a species of animal. Similarly, the same predicates

need not be expressed in the same order for all classes. Finally, sentence templates can also be

expected to depend on the class of the entity, both because of the predicates they realise and of

the lexical items and structures used in them. To a certain degree, this also applies to classes

that belong to a common super-class, e.g. while both being instances of the super-class

“Company”, the airline KLM and the software company Oracle should intuitively be treated as

instances of different classes.

This means we need to choose, from those available, the class that an entity would be most

prototypical of, as defined by Prototype Theory (Rosch, 1973). This class must have the right

granularity or level of detail, both for training and generating. It must not be too general that

the statements are too generic or irrelevant, or too specific that the extracted templates do not

generalise to other entities of the same class.

This task is surprisingly nontrivial, as the standard class inheritance mechanism implemented by

RDF (i.e. “entity rdf:type class”) allows for multiple class inheritance, and as an initial

exploratory analysis of the data shows, this mechanism is very frequently used and a single entity

typically belongs to a number of classes (for example, J.S. Bach belongs to both

“ComposersForPipeOrgan” and “PeopleFromEisenach”5). This makes choosing the “right” class

for an entity a problem that requires nontrivial inference to solve.

5 It could be argued that this information would be much better encoded using properties, e.g.

”:Johann_Sebastian_Bach :composedForInstrument :PipeOrgan”

”:Johann_Sebastian_Bach :bornIn :Eisenach”

This, however, would require extracting (mining) this information from the names of Wikipedia categories.

27 of 66

For the present approach I develop a baseline class selection algorithm, essentially consisting in

computing a score for each class based on term frequency scores (i.e. the count of times a word

appears in the class names of the entity) and selecting the n-highest. The intuition is that

category overlap can help determine which class is more central to the entity, and help choose

the class with an adequate granularity6. That is, if an entity belongs to a number of classes with

“composer” in the name, the degree of confidence should be higher that the entity’s class should

be “composer” or a subclass of it. A detailed explanation of this is offered in section 4.4.

3.3.5 Document planning

N-grams are sequences of items in succession, of size n, which capture the probability of a

sequence of items to appear (Jurafsky & Martin 2009, pp. 117-124). N-gram models are

frequently used to model language, and have also been applied to capturing the likelihood of a

sequence of concepts, rather than words, in text. This approach has been previously applied by

e.g. Galley et al. (2001), who used it to aid in document structuring for a dialogue system. Their

“word-concept n-grams” are in our situation equivalent to RDF predicates.

Duboue and McKeown (2003) also used n-grams to refine their statistical model for content

determination. Here I implement their baseline: we aim to learn content determination by

collecting unigrams (1-grams) of spotted predicates in the text: if the frequency of a predicate

was below a threshold in the articles for a given class of entity, even if an instance of this class

has this property in the data, the system should not output it.

According to Reiter and Dale (2000), document structuring carries out more complex selection

and ordering of information than just sequencing; it treats the text as a tree structure and

clusters related items of information. Given that the sentences templates extracted from the text

contain several properties expressed, we can think of them as partial trees, part of the bigger tree

required for document structuring, so I expect that extracting these templates and ordering them

in the right way will yield good results.

3.4 Baseline

In order to evaluate this approach, a comparable baseline approach is necessary. The baseline

generation system I implement here is exclusively based on direct generation from RDF triples.

It is loosely based on Sun & Mellish (2007), in that single triples are used to generate single

sentences and I use a shallow linguistic analysis of the words in the predicate to determine the

structure of these sentences.

However, a number of differences stand out. First, as opposed to Sun & Mellish (2007), this

baseline does not directly split out the “camel case” in predicates. RDF predicates, being URIs

themselves, have properties like rdfs:label, and these triples are available on DBpedia. Given this,

I first attempt to retrieve the rdfs:label of the predicate for use in generation. It is an underlying

6 This does not attempt to provide a definitive solution to this problem, but to solve it to a satisfactory degree for the

present application. A more sophisticated approach would perhaps have to account for the fact that the “right” class

necessarily depends on the application and the context. For instance, when deciding whether Mauritius an

IslandCountry or AfricanCountry, when it is just as prototypical a member of the first and second categories.

28 of 66

assumption of this approach that labels in all the languages we are concerned with (here

exclusively English) will be available in the triples. If a label is not available, the system then

does back off to splitting the words in the predicate URI.

The first sentence created by the baseline is an expression of the class of the entity, formed by

the name of the entity (i.e. its rdfs:label) followed by “is a” and the rdfs:label of the class of the

entity, e.g. “Johann Sebastian Bach is a German composer.” This class is chosen using the class

selection algorithm detailed in section 4.4.

All subsequent sentences are composed according to the following logic:

• If the retrieved or created label starts with an auxiliary verb (i.e. “is” or “has”), the

article “a” is inserted after that first word, the first word of the sentence is made to be

the personal pronoun nominative (i.e. “he”, “she”, “it”), and the value(s) are appended

to the sentence, separated with a colon.

For example, from London_Heathrow_Airport foaf:isPrimaryTopicOf

<http://en.wikipedia.org/wiki/London_Heathrow_Airpo rt> , the resulting

sentence is:

“It is a primary topic of <http://en.wikipedia.org/wiki/London_Heathrow_Airpo rt>”

• If the label starts with “was”, the sentence is created using the template “[personal pronoun] [property label] [values].”

• Otherwise, the sentence is created with “[possessive pronoun] [property label] [is/are] [va lues].”

The predicate text is converted to plural if there are several. For several values, these are

presented as a list, e.g. “His names are X, Y and Z”. It is an implied assumption that

predicate labels will be in the singular.

3.4.1 Coherent text

As opposed to Sun & Mellish (2007), who only generated single sentences from single triples, we

are dealing with a collection of triples encoding information about a single entity, and we wish to

present this information as a coherent text, made up of several sentences connected using

coherence devices like coreference.

For this, the baseline implements a very simple Referring Expression Generation algorithm,

which operates in the following way: The initial reference to the title entity expresses the full

name of the entity being described, by retrieving its foaf:name or rdfs:label for the language we

are generating in (throughout this paper, always “en” for English). Subsequent references to that

entity will use a personal pronoun (e.g. “he”, “she”, “it”). The system specifically retrieves the

value for foaf:gender for the title entity whose description is being generated and chooses the

right pronoun and possessive based on its value (e.g. “he” and “his” for “Male”, “she” and “her”

for “Female”).

29 of 66

This is a simple approach that produces acceptable results in English. It is perhaps relevant to

note here that different languages will have different requirements for the treatment of

grammatical gender and probably a more complex approach would be necessary. In the baseline

implementation of the system, no attempt to order the sentences is made. Also, no attempt is

made at document planning, with one exception: simple heuristics and an “ignore list” filter out

from the input properties with inappropriate values. Specifically, values containing more than ten

words and values which are integers below 31 are ignored, together a number of predicates such

as purl:subject. This essentially helps filter out output that would be too verbose and visually

strident, with the aim of making the baseline competitive in evaluation.

30 of 66

Chapter 4

Implementation: Training

In this chapter I present an implementation of the training phase according to the design

principles outlined in Chapter 3: the LOD-DEF system. The system is trained on a corpus of

text documents, each of which is an article from the Simple English Wikipedia, and RDF triples

for the same entity, retrieved at runtime from DBpedia Live. The full training pipeline is

described in this chapter: what steps are taken to train the NLG model from the text and data.

The diagram in Figure 4.1 represents the pipeline for each article, while the post-processing steps

are represented in Figure 4.3.

4.1 Obtaining the data

4.1.1 Wikipedia text

The text from the Simple English Wikipedia was downloaded from the Wikipedia dump site7.

This is one large compressed XML file, containing the latest revision of all the articles on the

SEW, together with a small amount of metadata, e.g. author information for the most recent

edit. This text is in “wiki markup” format, including in it information like infoboxes, inter-

language links, markers such as stub and redirection, etc.

This text then had to be processed to remove all this mark-up unnecessary to the task, and leave

only the article text with the links to other articles. This step is done only once, previous to

running the training pipeline on any article. Steps 2-7 apply to every processed article.

4.1.2 DBpedia triples

RDF triples from DBpedia are retrieved at runtime for each article via SPARQL queries. The

standard SPARQL endpoints for this were the ones provided by DBpedia Live8. The endpoints

proved to be quite problematic, as they would quite frequently experience high overload which

meant slower response times and several times would be offline for several hours. To deal with

this and to speed up testing, I implemented an offline triple cache using SQLite3, as an

alternative to setting up a special triplestore for this project, which would have been a

considerable expenditure of time and resources.

Triples are retrieved using the SPARQL query:

SELECT ?pred, ?obj WHERE { <http://dbpedia.org/resource/%s> ?pred ?ob j.}

7 http://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2 8 http://live.dbpedia.org/sparql and http://dbpedia-live.openlinksw.com/sparql

31 of 66

This returns all the triples in the DBpedia with the entity marked by %s as subject9, using its

“wiki link” as search keyword. The triples returned by this query are then first filtered by

language, as we are only concerned with triples for English in this case10. Triples with literal

values in English often do not have a marked language suffix, as English is often considered the

“default” language, so we need to include both “en” and “” (blank string) as valid languages.

Second, triples using a predicate from an “exclude” list are filtered out. These are predicates that

we found to be useless for our purpose and to induce noise, add data overhead and increase

training time. Examples of these predicates are "http://www.w3.org/2002/07/owl#sameAs" and

"http://purl.org/dc/terms/subject" (the values of which are already available as rdf:type

properties connected to YAGO classes).

Figure 4.1 Training pipeline for each article

9 The special %s marker is interpreted by a formatting function which substitutes it for a parameter string. 10 Using the FILTER keyword in SPARQL queries resulted in much longer response times from the endpoint, together

with occasional time-outs, more so the more complex a filter, so I decided to use it sparingly and filter the results

client-side for increased robustness and faster execution. This is perhaps less than ideal within the vision of the

Semantic Web, but due to the limitations of the available servers, it is often more practical to retrieve more

information than needed and filter it client-side rather than relying on filtering by the SPARQL endpoint.

32 of 66

4.2 Tokenizing and text normalisation

The first step in the pipeline is to tokenize the Wikipedia text – separate text into words,

punctuation and sentences. Several standard tokenizers were tested for this task, and their

results proved quite unsatisfactory11. Therefore use a custom built algorithm is used, to take into

account the format of Wikipedia mark-up. Tokens are considered to be individual elements of the

sentence, and so punctuation is individually tokenized: commas, colons, semicolons, parentheses,

brackets, etc., are all considered to be individual tokens. The exception is the apostrophe (’), so

clitics like “n’t” and the genitive “’s” will be tokenized as one single word, remaining attached to

the root word.

These rules are applied in order to facilitate the next processing step, the spotting of values in

text. To further ease this, a number of processing stages normalise values found in the text,

mainly using Regular Expression matching and replacement. As an example, if a number is

expressed in words (e.g. “fourteen”) it is converted to its equivalent in digits as a string (“14”).

The same date may appear in the Wikipedia in several formats, e.g. “17 Aug 2012”, “17 Aug,

2012”, “17 August 2012”. We could either normalise these first in the text or perform a RegEx

matching for each generated date for each processed text. In the interest of processing ease and

efficiency, we do it once and normalise all dates found to one standard format: YYYY-MM-DD

(year-month-day) using all digits. This is the same format used in xsd:date values (ISO 860112),

which further eases spotting.

4.3 Aligning: Named Entity Recognition

4.3.1 Surface realisation generation

In a first step, we generate possible surface realisations for the triples retrieved. This is equivalent

to building a gazetteer list. The way it is done will depend on the type of object of each triple.

If the object is a URI, the rdfs:label for this URI (again identified by %s in the query) is retrieved

via a SPARQL query:

SELECT ?label WHERE { <%s> <http://www.w3.org/2000/01/rdf-schema#label> ? label. FILTER (langMatches(lang(?label),"") || langMatches(lang(?label),"en") ) }

If the object is a typed literal, that is, it has an associated data type, the conversion will depend

on the data type. Strings (xsd:string) are taken as they are with no modification, where

xsd:int, xsd:decimal, xsd:double are converted to integers and xsd:float is converted to float

and rounded to two decimals.

4.3.2 Spotting

A second step is clustering together tokens to facilitate a maximum span matching: all surface

realisations generated in the step above are ordered in inverse order of length (highest first). For

11 As an example, the Punkt sentence splitter included with the NLTK Python library would consistently fail to

separate sentences like “.. was born in Bath.Later in life..” or “…was born in [Bath].[London] was his first…”. 12 http://books.xmlschemata.org/relaxng/ch19-77041.html

33 of 66

each surface realisation r, if a series of tokens K matches r, assuming a whitespace character

between elements of K, then K are concatenated together into a single token t, with the insertion

of one whitespace character between every two tokens in K.

Flexible matching is then done between each token t and each surface realisation r using a

regular expression to accept whitespace, hyphens or any other character occurring between two

words.

4.4 Class selection

This step could in practice be located anywhere in the pipeline, as it only affects what “class

model” will be updated with the learned templates and n-gram counts (as detailed in 5.9). Class

models are the data structures holding the extracted sentence templates and stored annotations

from the training documents, together with other information after post-processing. Determining

what class models the title entity13 belongs to at this stage makes it more straightforward to save

this information with no intervening temporary data structure.

As pointed out earlier, determining the “right” class for an entity is not straightforward. As a

working example, consider the rdf:type triples available for Johann_Sebastian_Bach:

:Johann_Sebastian_Bach rdf:type :AnglicanSaints, yago:ComposersForViolin, foaf:Person, yago:ComposersForCello, yago:GermanComposers yago:GermanClassicalOrganists, yago:PeopleCelebratedInTheLutheranLiturgicalCalenda r, yago:ComposersForPipeOrgan yago:ComposersForLute, yago:OrganistsAndComposersInTheNorthGermanTradition , yago:18th-centuryGermanPeople, yago:PeopleFromSaxe-Eisenach, yago:BaroquEComposers, yago:PeopleFromEisenach, yago:ClassicalComposersOfChurchMusic.

As the example illustrates, entities in DBpedia are aligned with YAGO classes14, which are

automatically mined from crowd-sourced Wikipedia categories.

We choose the class using the following steps:

1. We retrieve the rdfs:label values for each of the classes. Using a bag-of-words approach,

we put all these labels in a single list of words.

2. We add to this vector the words from the first sentence of the Wikipedia article.

3. We remove the stopwords from this list, i.e. prepositions, conjunctions, articles (“for”,

“from”, “and”, “a”, “the”, etc.).

4. We compute term frequency (tf) scores for each word in this list, i.e. count how many

times they occur in it.

13 The title entity is the entity that is the main topic of the article on the Wikipedia and whose URI is the subject of

the triples in DBpedia. 14 YAGO is a freely available knowledge base, derived from Wikipedia, WordNet and GeoNames (Kasneci et al., 2008).

34 of 66

5. We compute a normalized sum of tf scores for every class label, using the formula:

�� = 1� ��

��

� = �� > 15�� = 1 �

where w is the class label string, wi is the ith element (word) in the string, tf is the term

frequency score and N is the total number of elements in w. Note: tf(wi) will return 0

when wi is a stopword. M is adjusted here to reflect a dispreference against one-word

class names.

6. We order all the classes by their score in inverted order and select the n-highest as the

classes the entity belongs to. We train for several models at the same time, given that we

cannot be confident the class we chose is the only one that the entity is prototypical of.

During the experiments, we set the value of this n to 5.

As an example, training for :Johann_Sebastian_Bach, the n-best list is shown in Table 4.1. For

each of these classes, a class model is created (or updated if already existent).

rdf:type Score

yago:GermanComposers 6.0

yago:BaroquEComposers15 3.3

yago:ComposersForViolin 3.0

yago:ComposersForCello 3.0

yago:ComposersForLute 3.0

Table 4.1 Class scores, n-best list

4.5 Coreference resolution

Coreference resolution can be a complex task and is an active area of research. A number of well-

known algorithms exist for this, but given the domain of text we are dealing with, for our

purposes a very simple approach to coreference resolution is sufficient,

A very simple heuristic is used for coreference resolution: the first pronoun appearing in the text

is assumed to refer to the entity the text is describing (the title entity), and so are all forms of it

throughout the text. The title entity, that is, the entity the article is about, its main topic, is

henceforth referred to as “$self”. Consider the following example, taking the first two sentences

from the article on Johann Sebastian Bach. Coreferent spans of text are in bold face:

“ Johann Sebastian Bach (b. Eisenach, 21 March 1685; d. Leipzig, 28 July 1750) was a German composer and organist. He lived in the last part of the Baroque period.”

15 Note that the spelling “BaroquEComposers” is taken verbatim from the data. This just offers a hint of how careful

one must be when dealing with automatically-mined data like that of DBpedia and YAGO. We explore this issue in

more depth in 6.1.

35 of 66

The first pronoun to appear is the “He” starting the second sentence. From this moment on,

“he” will be assumed to be coreferential with the entity :Johann_Sebastian_Bach, and so will be

the form “his”. For each coreferent token, a coreference annotation is stored.

4.6 Parsing

For parsing, I employ the Stanford parser with the pre-trained PCFG English model16 (Klein and

Manning, 2003), a widely-used, state-of-the-art, self-contained parser, which also provides a

number of pre-trained probabilistic models for other languages. Distributed as a Java Archive

(.jar), it is easy to interface or use from the command line or other programming languages17. A

number of freely licensed and open-sourced parsers were considered (e.g. C&C parser, NLTK

Viterbi parser, Berkeley parser) and the final choice was motivated by its robustness, speed, and

ease of interfacing.

Of all the sentences containing at least one coreferential token, the ones that also contain at least

one spotted property value that is not coreferential with $self are selected as template

candidates. For each of these sentences, a specially prepared pre-parse version is created, where

for each spotted entity (or each annotation) a placeholder variable is created. This variable takes

the name “var_n”, where n is an automatic counter, with values from 1 to N, the number of

tokens in the sentence with an annotated spotted entity. So, for instance, given the sentence:

“Carl Maria von Weber (born Eutin, Holstein, baptis ed 20 November 1786; died 5 June, 1826 in London was one of the most important German composers of the early Romant ic period.”

After date normalisation and spotted entity substitution, this sentence becomes:

“Var_1 (born Var_2, Var_3, baptised 1786-11-20; die d var_4 in var_5 was one of the most important Var_6 of the early Romantic period.”

The parser assigns a noun (NN) Part-Of-Speech tag by default to unknown words, which all the

placeholders are in this case. This is conceptually consistent with the fact that they are spotted

entities in the text, and can therefore be nouns. This is done in order to preserve these spotted

entities as one unit each. The parser will often nest entities formed by more than one word in the

parse tree in ways that complicate the posterior retrieval of those entities and even more so the

pruning of the tree.

4.7 Syntactic pruning

While parsing can be helpful for a number of tasks (e.g. it can inform coreference resolution by

identifying the subject of the sentence), here it is only deemed necessary in order to carry out

syntactic pruning to ensure the templates are transferable.

16 Version 1.6.5, from 30/11/2010. The Python library used was designed for the older API and incompatible with

more recent versions. 17 This implementation uses jPype (http://jpype.sourceforge.net/) to interface the Java Virtual Machine from Python.

36 of 66

For this, it is considered here that the following grammatical categories require support in the

data: nouns, adjectives, adverbs and numerals. The corresponding tags of these categories

returned by the parser are: N* (e.g. NNS – plural), JJ*, RB*, CD*. The asterisk here is meant as

a wildcard for zero or more characters: NN* should match both NNS and NN. These tags are the

ones used in the Penn Treebank, the annotated corpus the Stanford parser English PCFG model

was trained on.

By “require support” it is meant that the words with those corresponding tags must have been

aligned to values found in the data, i.e. must have a “spotted” predicate. Note that several

grammatical categories do not require support, most relevantly verbs. This is because what verbs

do require is objects, and it is these that require support. This concept can be seen as very

related to techniques of Relation Extraction (Sarawagi, 2008).

The pruning proceeds in three stages:

• Stage 1: Each leaf of the tree (i.e. word in the sentence) whose Part-of-Speech tag

matches one of the masks (N*, JJ*, RB*, CD*) and which does not have a “spotted”

value in the data is deleted.

• Stage 2: Context-Free Grammar rules are inverted. For example, if in a standard CFG a

constituent is expanded via the rule NP -> DET + N (a Noun Phrase is formed by a

head Noun and a DETerminer), if the head noun of an NP is deleted, then the whole NP

must be deleted too, together with all the constituents it may contain. The rules used in

stage 2 are:

o VP requires V leaf

o NP requires N* leaf (N, NP, NN, NNS, all valid)

o PP requires P and NP

o Verb requires object: either VB* or VP containing it must have sister constituent

to the right18

o WH-phrase (WDT, WP, WP$) requires VP or S leaf

o Coordinating conjunction (CC) requires sister nodes to its left and right of the

same type

The rules in this stage are applied successively and repeatedly to the parse tree until no

modification is made.

• Stage 3:

o Resulting parses are first filtered by the number of spotted entities they still

contain. If there is not at least one token that refers to $self and one spotted

value, we discard the template candidate.

o Finally, we apply a number of rules to ensure correct punctuation by deleting

empty brackets, duplicate commas, etc.

An example of this processing applied to the previous sentence from section 4.6 can be seen in

Figure 4.2:

18 This accounts for inconsistencies in the parse trees returned by the Stanford parser.

37 of 66

• In stage 1 the following leaves require evidence in the data, and as this is missing, they

are deleted: “1786-11-20” (CD), “most” (RBS), “important” (JJ), “early” (JJ),

“Romantic” (JJ), “period” (NN).

• In stage 2, the NP containing “period”, having lost the only NN that supported it, is

deleted. The ADJP “most important”, having no leaves in it, is also deleted.

• In post-processing rules, “one of the” is substituted by “a”.

This pruning produces the following pruned template candidate:

[foaf:name] (born [dbont:birthPlace], baptised died [dbont:deathDate] in [dbont:deathPlace]) was a [rdf :type].

This is not a perfectly correct template, given the presence of “baptised” in it with no

complement. Although it is extracted and stored, this template is not judged grammatical

enough as per the criteria defined in section 6.2.4.

Figure 4.2 Parse tree, with removed constituents underlined.

4.8 Store annotations

At this stage, a list of annotations from step 3 (NER) is collected and saved for the model as a

separate list for each document processed. This is done by iterating through all the tokens in the

text, ignoring sentence boundaries and storing only annotations that do not refer to the title

entity (e.g. foaf:name, rdfs:label, dbprop:name). This will be used in step 4.9.3 to compute counts

and probabilities for predicate n-grams.

An example list of annotations for an article would be:

{foaf:name, rdfs:label, dbprop:name}, {dbont:birthD ate}, {dbont:birthPlace}, {dbont:deathDate}, {dbont:death Place}, {rdf:type}, {dbont:knownFor}

38 of 66

Figure 4.3 Training: post-processing steps

4.9 Post-processing

Post-processing is applied to every class model independently, executing the following steps in

succession.

4.9.1 Cluster predicates into pools

First the owl:sameAs property is retrieved for every property spotted (i.e. for every entry in the

1-gram list) and it and its object are added to a single “predicate pool”. Second, the lists of

annotations for the analysed articles are processed and properties that are seen to have the same

value with high frequency are grouped together in pools. This is done by computing a similarity

coefficient based on the Dice coefficient formula:

��, �� = 2 � !"#��, ��!$�� + ��!$��

This is twice the amount of times they have the same value divided by the number of times they

appear individually. This is only computed for the set D of entities of class C for which both p1

and p2 are defined (i.e. have a value other than null string). If this coefficient is above a

threshold, the predicates are deemed to be equivalent. Experimentally the value of this threshold

is set to 0.9. As an example, foaf:name, rdfs:label and dbprop:name are clustered together in a

single pool for all classes used in the experiments.

The predicate pool is identified by the most frequent predicate of those in the pool. If they are

equally frequent, the first one in the list returned by the sorting function is used. This most

frequent predicate is then substituted in all the sentence templates in that model in place of all

39 of 66

the other predicates in the pool. This is mostly for aesthetic reasons and to simplify the

generation. It does not affect the behaviour of the system, but makes the visualisation of the

extracted templates more intuitive.

4.9.2 Purge and filter sentences

After the predicate pools have been built, each sentence in the model is checked for conflicts

between predicates. This is done to account for the fact that predicates with different and

possibly opposed meanings can have the same object.

For every sentence template t, for every slot s in t, if s contains more than one predicate after the

clustering carried out in the previous step, the predicates are checked for similarity. The Dice

coefficient of each pair of predicates is checked (having been computed previously), and if it falls

below a threshold (experimentally set to 0.1), the sentence template is discarded.

Ideally, sentences that express the same set properties would be filtered here according to some

criteria (e.g. length in tokens, amount of symbol tokens present, an overall character length

preference, etc.), and the best or n-best ones would be kept. This is, however, not implemented

in the LOD-DEF system.

4.9.3 Compute n-gram probabilities and store model

The n-gram counts collected are adjusted to reflect probabilities using Maximum Likelihood

Estimation. A very simple smoothing technique is applied, equivalent to add-α smoothing with a

very small α. Trigrams are used throughout this implementation. The model is finally stored in a

file, for which Python’s built-in serialisation is used.

Class: yago:GermanComposers Templates: (1) [foaf:name] ([dbont:birthDate] -- [dbont:deathDate]) was a

[purl:description]. (2)[foaf:name] (born [dbont:birthDate] [dbont:birth Place] ; died [dbont:deathDate] [dbont:deathPlace]) was a [rdf:ty pe]. …

Pools: (1) {foaf:name, rdfs:label, dbprop:name, dbprop:caption, dbprop:cname} (2) {rdf:type, purl:description, dbprop:shortDescription} (3) {dbprop:dateOfDeath, dbprop:deathDate, dbont:deathDate} (4) {dbprop:dateOfBirth, dbprop:birthDate, dbont: birthDate} (5) {dbont:knownFor} …

n-grams (“”, “”, foaf:name) (“”, foaf:name, dbont:birthDate) (dbont:birthDate, dbont:deatDate, purl:description) (dbont:deatDate, purl:description, dbont:knownFor) …

Figure 4.4 Example trained model for yago:GermanComposers

The resulting output of the training pipeline is not just one class model, but a model collection,

as for every entity a number of class models may be created or updated.

40 of 66

Chapter 5

Implementation: Generation

The generation algorithm described here takes as input a collection of trained class models as

defined in the previous chapter, and the URI of an entity for which to generate a description

article.

5.1 Retrieve RDF triples

Values are retrieved from the SPARQL endpoint for each predicate pool saved in the model. A

pool may contain any number of different predicates, but these being considered completely

equal, only the values of one of them are retrieved and saved for the whole pool. A query is made

for each predicate until values are returned, which are then stored for the whole pool.

5.2 Choose best class for entity

The first step of this procedure is identical to that detailed in section 4.4 with the difference that

there is no article text to be considered, so this is not added to the vector of words, therefore

step 2 (as defined in 4.4) is omitted. This generates an n-best list (experimentally, n=5) of

classes. For each of these classes, if a model is found in the model collection, a score is computed

for this model. This score is the amount of pools in the model that would get instantiated

through the sentence templates available in the model. Only pools for which there are values in

the triples for the entity are considered. For example, from the n-best list for

:Johann_Sebastian_Bach from Table 4.1, if two models were available such that:

Model yago:GermanComposers yago:ComposersForCello Number of templates 2 3 Pools for which there are values in the data 7 6 Pools that would be instantiated through templates 5 6

The chosen model would be yago:ComposersForCello, even though it received a lower score in

the first step, only because a higher number of values would be (potentially) expressed through

templates. This does not take into account the fact that due to the constraint on number of uses

of property values, not all these templates might be instantiated.

The motivation behind this choice is that an extracted sentence template is expected to generate

higher quality text, so a model instantiating more predicates through extracted templates is

preferred. This is especially important for the subjective human valuation conducted as part of

this project (see Chapter 7 for details).

41 of 66

5.3 Chart generation

We use chart generation: all sentence templates in the model for which there are enough triples

in the data are put on a chart and combinations of them are generated. The following steps are

taken:

1. For each template t, where S(t)i is the ith slot in it, we discard sentences for which there

is no value for S(t)i in the set of retrieved property values V. Every template t must

satisfy that for every slot of S we must find a value in V.

2. For each pool in the model, a simple sentence template is generated in exactly the same

way as for the baseline and added to the chart. This is done in order to deal with the

situation where pools (spotted properties in the training text) would not be expressed for

a lack of a template expressing them.

5.4 Viterbi generation

We now need to select and order sentence templates from the chart to produce a combination.

Ideally, we would want to find the combination of sentences that expresses all the values of the

pools in the model, yet uses as uses as many extracted templates and as few simple generated

ones as possible.

In order to deal with the combinatorial explosion, instead of an “overgenerate and rank”

approach, we apply the Viterbi criterion (Jurafsky & Martin, 2009). This means that we

compute scores for all the options at every step, select the one with the highest and discard all

the others, thus only ever keeping one possible combination. This is not guaranteed to be the

optimal solution to the requirements outlined above, but it is a satisfactory trade-off between

quality and speed and keeps the algorithm simple and the generation running in polynomial

time. The computational complexity of the algorithm presented in Figure 5.1 is O (n2 log n).

used_ngram_list = null predicate (beginning of document) combination = new list of sentence templates

(1) Do while len(candidates) < len(chart): considered = new list of templates (2) Do for each template in chart:

If template not in combination: If template does not require more uses of pools than allowed19:

Compute n-gram score of template using only the first pool Add to score a tenth of the number of pools used by template20 Add sentence with score to considered list

If no templates were considered, exit loop (1) Take the template with the highest score, add it to combination Increase the counter of times used of each pool used by template Add all pools used to used_ngram_list

Figure 5.1 Pseudocode for the Viterbi generation algorithm

19 One pool does not have to satisfy this constraint; this is the one representing the name of the entity being described,

“$self”, clearly identified based on the fact that it contains rdfs:label.

20 This has the effect of, where more than one template would have the same n-gram score, the one using more pools

(i.e. the longest one) will be selected.

42 of 66

To illustrate this algorithm, consider we are generating an article for the entity

:Woody_Woodpecker using the following example model (Figure 5.2), where two templates were

extracted from text. Consider also that in the input data there are only values available for pools

(1) to (4), so pool (5) has no value.

Class: yago:FictionalAnthropomorphicCharacters Templates: (1) [foaf:name] is a [rdf:type] created by [dbprop: creator].

(2) [foaf:name] first appeared in [dbprop:first]. Pools: (1) {foaf:name, rdfs:label, dbprop:name} = “Woody Woodpecker”

(2) {rdf:type} = “fictional anthropomorphic characters” (3) {dbprop:creator} = “Walter Lantz”, “Ben Hardaway”, “Alex Lovy” (4) {dbprop:significantother} = “Winnie Woodpecker” (5) {dbprop:first} = *empty*

Figure 5.2 Example model for generation

Having selected the model, the templates available for which the required values are available are

put on a chart (Figure 5.3). Here, template (2) requires values from pool (5), for which no values

were found in the RDF triples, so it is not added to the chart. Template (1) fulfils all

requirements and is added. Next, simple templates are generated for each of the four pools

except pool (1) as this one contains rdfs:label.

(1) [$self] is a [rdf:type] created by [dbprop:crea tor]. (2) [$self] is a [rdf:type]. (3) [$self-posessive] creator is [dbprop:creator]. (4) [$self-posessive] significant other is [dbprop:signoficantother].

Figure 5.3 Chart for generation

In the first iteration, none of the pools have been used, which makes all templates on the chart

selectable. Considering that the stored n-gram probabilities have rdf:type as the most likely

predicate to follow the null property (beginning of document), and given that only the first

property expressed is considered when computing the n-gram score, both (1) and (2) would have

the same score. However, given the formula adds to this score a value proportional to the number

of properties that would be instantiated by the template, template (1) is chosen and added to

the final combination. This has two pools marked as used once: pool (2) and (3). This still leaves

one pool that does not refer to $self to be expressed: number (4).

In the second iteration, templates (1), (2) and (3) cannot be selected as candidates to follow in

the combination, as they require properties that have already been used once. Only template (4)

is available for selection, so independently of its score it will be added next. The final template

combination is then (1,4).

5.5 Filling the slots

For every template in the combination created in the step above, we must select values for its

slots. For slots that are refer to $self, the title entity, LOD-DEF implements a very simple

Referring Expression Generation algorithm, similar to the baseline described in section 3.4.

The initial reference to an entity is its foaf:name or rdfs:label. For classes which have been

observed to be referred to using a singular pronoun with grammatical gender (“he” and “she”),

as it was done for the baseline, the system specifically retrieves the value for foaf:gender for the

43 of 66

entity whose description is being generated and chooses the right pronoun based on it. This

objective would ideally be attained by performing inference on the classes the entity belongs to

or by checking with the SPARQL endpoint whether the entity is of class “Person” (using any of

the available URIs identifying a person, e.g. foaf:Person). However, due to the experience dealing

with remote data as detailed in 6.1, in the implementation we trust the text rather than the

data. If no foaf:gender value is available for an entity for which “he” and/or “she” referring

pronouns were observed in training, the fallback gender is the most frequent one observed during

training.

For all other slots, if only one value is available for the pool, this is rendered depending on its

type (i.e. dates get formatted from 1066-10-14 to “14 October 1066”). Finally, a number of

regular expressions help keep the output grammatically correct, by adjusting spaces between

punctuation tokens, changing the article “a” to “an” before a word starting with a vowel, etc.

Continuing the previous example, the resulting output is:

Woody Woodpecker is a fictional anthropomorphic cha racter created by Walter Lantz, Ben Hardaway and Alex Lovy . His significantother is Winnie Woodpecker.

44 of 66

Chapter 6

Experiments

6.1 Problems with the data

Months of testing support the conclusion that a great degree of caution must be exercised when

relying on DBpedia data. To begin with, as mentioned before, the schema is rather unreliable.

Redundancy is high, as very often several properties with the same meaning are provided (e.g.

dbprop:birthPlace, dbprop:placeOfBirth, dbont:birthPlace and dbprop:birthPlace all have the same

meaning). These properties are meant to have owl:sameAs links to identify them as equal, yet

when these triples exist they always point back to the same URI (e.g. dbprop:dateOfBirth

owl:sameAs dbprop:dateOfBirth). As detailed in Chapter 3 and Chapter 4, this is addressed by

the LOD-DEF system by learning pools of equivalent predicates.

Further, it is remarkable that the rdf:type properties on DBpedia link to supposed class URIs

with incorrect spellings like “SpanishFootballCluBs”, “CarManuFACturers” and

“BaroquEComposers”. It remains unclear what the reasons behind these spellings are, but these

are clearly errors in the data, as there are no triples in the triplestore with these URIs as

subjects. Their correctly-spelled counterparts do have triples, e.g. yago:CarManufacturers rdfs:label "Car manufacturers"@en.

6.2 Performance of the system

Evaluation of the system’s performance is somewhat problematic due to the cumulative error

rate introduced by the amount of interdependent modules in the architecture pipeline. Each

stage depends on output from the previous stage, so for instance an error at the spotting stage is

sure to impact the extraction of a sentence template.

I manually evaluate here two main aspects of the system: the success of the template extraction

process and the class selection algorithm. For this, the training pipeline was run to train a single

model collection for the classes in Table 6.1, for a maximum of 30 entities of each class. These

classes are the same ones used for human evaluation (see Chapter 7), although the entities the

model was trained on need not be the same ones. Other aspects, such as spotting performance,

are evaluated through examples and critical discussion.

yago:EnglishComedians yago:CarManuFACturers yago:AmericanPopSingers yago:AfricanCountries yago:FictionalAnthropomorphicCharacters yago:SpanishFootballCluBs yago:ArgentineFootballers yago:SingingCompetitions yago:SpeedMetalMusicalGroups yago:GermanComposers yago:CapitalsInEurope dbont:TelevisionShow

Table 6.1 Classes used for testing

45 of 66

6.2.1 Spotting performance

As only a gazeteer is applied in this baseline, the performance of the spotting will exclusively

depend on the extent to which the value literals in the RDF triples mirror those found in text. In

the case of spelling differences or naming inconsistencies, the spotting will fail.

There are inconsistencies in the data, such as the spelling of names cross-language. For example,

for a single entity we find “George Frideric Handel” in the triple values and “George Fredrick

Handel (German: Georg Friedrich Händel)” in its associated article text. Note that none of the

two spellings found in text can be matched to the one in the triple values.

Another example is the category name “Argentine Fooballers”. This string is seldom spotted in

the text, as the surface realisation is “football players”. The two terms are synonymous, and

ideally the system should be able to determine that the surface realisation of “footballer” is

“football player”. Clearly then this task requires a more sophisticated approach than literal string

matching. DBpedia provides a lexicalisation dataset which can be applied to this task and is

indeed used by DBpedia Spotlight (Mendes & Jakob, 2011). Another option for a robust NER

solution is OpenCalais (Butuc, 2009).

6.2.2 Parser performance

Although the parser introduces a significant error rate to the system due to inconsistencies in

nesting constituents, I do not directly evaluate its performance here, lacking a gold standard to

compare against for this specific domain. It should however be noted that a different PCFG

model could help improve performance. Also, it was noted before that other approaches to

sentence compression use dependency parsing. This was tested for the present project but was

deemed unsuitable due to the low accuracy of the output from the parsers tested. There exists

perhaps a different dependency parser that would be more fitting to the task.

6.2.3 Class selection performance

The class selection algorithm was manually evaluated, by comparing the n-best classes identified

by the algorithm with the first line in the full English Wikipedia article for that entity.

The criteria adopted were as follows. Consider two sets, A and B, where A is the set of classes

mentioned in the first description sentence in the text, and B is the set of n-best classes chosen

by the class selection algorithm. The criteria for establishing matches between these sets are

shown in Table 6.2.

Match type Criterion No match No element of A is equivalent to an element in B Partial match At least one element in A is equivalent to an element in B Correct match A1 (the first class mentioned in the text) is also in B

Table 6.2 Match criteria

For this testing, n was set dynamically depending on the number of classes available for an

entity. For entities belonging to 9 or more classes, n is set to 5. For entities with fewer than 9

classes, n is set to half the number of classes, rounded up.

To give an example, for the entity :Woody_Woodpecker, the best class as chosen by the algorithm

is yago:FictionalAntropomorphicCharacters, whereas the article text begins “Woody Woodpecker

46 of 66

is an animated cartoon character, an anthropomorphic acorn woodpecker”. This is judged to be a

correct match, although “animated” and “cartoon” do not appear in the class name.

Similarly, for an entity of class yago:CapitalsInEurope, where no class with “city” or “capital” in

the name is available, yago:PopulatedPlace is accepted as equivalent of “city”.

For testing, set C is also defined, where B is the set of n-best classes chosen directly from the

triples, and C is the set that was chosen adding the first sentence from the article text to the bag

of words.

Set B (n-best) Set C (n-best with text) Entities tested 95 95 No match 1 1 Partial match 0 0 Correct match 94 94

Table 6.3 Results of class selection evaluation

The results of the evaluation suggest this is a robust algorithm, with almost a 100% correct

match rate as defined above. This evaluation is admittedly dependent on subjective

interpretation of the meaning and overlapping of class names, so further testing, refining of

criteria and ideally evaluation by other humans (like the one in Chapter 7) should be

undertaken.

The one entity for which no match was found was :Life_in_Hell. Revealingly, it is said to be of

class yago:FictionalAntropomorphicCharacters, yet the Wikipedia states “Life in Hell was a

weekly comic strip […] The strip features anthropomorphic rabbits and a gay couple.” While the

entity clearly contains fictional characters, it was decided that it should be of class “comic strip”

or equivalent.

6.2.4 Template extraction

The templates extracted were manually judged on transferability (Y/N) and on their

grammaticality. Grammaticality was judged according to the criteria outlined in Table 6.4.

Scores Meaning 5 Perfectly grammatical 4 Minor punctuation defects (e.g. stranded commas) 3 Missing determiner or stranded conjunction 1-2 Lack of verb or no meaning

Table 6.4 Grammaticality scores and criteria

Intuitively there are also different degrees of non-transferability, but this here was not taken into

account. The judging was binary: if a template was not perfectly transferable, it was judged not

transferable at all.

47 of 66

Item total Processed articles 268 Sentences considered for extraction 199 Discarded during pruning 98 (49%) Discarded after pruning (filtered) 26 (13%) Extracted templates 74 (37%)

Non-transferable 14 (19%) Transferable templates 60 (81%)

Average grammaticality 4.15 5-stars grammaticality 34 4-stars grammaticality 9 Final accuracy 43/74 (58%)

Table 6.5 Extracted templates statistics

For the purposes of evaluation, here I adopt as the final performance metric (accuracy) of the

template extraction process the percentage of final extracted templates that are both

transferable and have a grammaticality score of 4 or 5. As Table 6.5 shows, of a total of 74

extracted templates, 60 (81%) are transferable, of which 43 (58% of total extracted templates)

have a rating of 4 or 5 on grammaticality.

Note here that of the 14 non-transferable templates reported, 3 were purged in the post-

processing stage because of conflicting predicates in their slots. However, this is not taken into

account here, as this is an independent step with a different purpose.

The final accuracy metric can clearly be improved on, and one way of doing so is refining the

rules for pruning. The development of these rules did not follow a data-driven approach, but was

based on first principles of Context Free Grammar and on a summary examination of the data.

It only became apparent during evaluation that these rules were not sufficient to ensure the

grammaticality and transferability of the extracted templates.

6.2.5 Examples of errors in output

• “Casablanca of Morocco is Rabat.” Here the spotting failed mainly due to

the low quality of the data. During training, the title entity’s dbprop:largestCity property

had the value “capital” as a string literal. This prompted the extraction of the previous

template [dbprop:largestCity] of [dbprop:commonName] is [dbont:capital]. This property

has no rdfs:range specified in the schema, which means it can take any value. This is

unfortunate, as it could be argued that its values should be of type City, and a string

literal like “capital” here is of little use and adds noise to the data.

• “Her active is 1981.” What this means is that the person who is the title entity

has been active since 1981, but the rdfs:label for this does not say so.

• “Although Hyundai Motor Company started in Public.” Here, the

pruning rules were clearly not enough to make this sentence grammatical. Either the

“although” should have been removed or the whole template dropped.

• [foaf:surname] is married to [rdf:type] [dbprop:spo use]. This

template for EnglishComedians may well happen to be true once instantiated in text, if

and only if the spouse of the title entity is of the same type (i.e. EnglishComedians). This

48 of 66

means that this sentence is not transferable and should be identified as such and

discarded.

• “Mercyful Fate is a Speed metal musical group from, Denmark and

Copenhagen.” This sentence is mildly ungrammatical due to the presence of a

comma. While “Denmark and Copenhagen” are odd, they are due to the generation

algorithm, not to the extraction.

49 of 66

Chapter 7

Evaluation

7.1 Approach

Given the exploratory nature of this project, the evaluation relies on multiple human valuation of

the system’s output, evaluated in equal conditions with output from two other systems: the

baseline described in section 3.4 and expert human output. I adopt a two-panel (i.e. two separate

groups of subjects) approach to compare the three generation systems, very similar to the

evaluation undertaken by Sun & Mellish (2007). Humans in Panel A generate descriptions of the

same 12 entities and humans in Panel B rate the different outputs of System A (baseline),

System B (LOD-DEF) and System C (human generation) across a number of dimensions.

The hypothesis is that LOD-DEF will be rated higher on average in human evaluation than a

system generating exclusively from English words in RDF predicates. For comparison with an

upper bound, the system is also ranked against human-generated text for the same data.

Human-generated text need not always be an upper bound in subjective evaluation, but given

the simplicity of the two NLG systems, this is the hypothesis here.

Given the relatedness of the present approach with automatic summarisation, three of the

criteria for evaluation used by the Document Understanding Conference 2007 were found to be

very appropriate for the task at hand. The texts are rated on grammaticality, non-redundancy,

and structure and coherence. No direct evaluation of content determination is carried out: here it

is evaluated implicitly through the dimension of “non-redundancy”, given that its main effect in

this implementation is filtering out redundant and unnecessary information.

7.2 Selection of data

Classes used for evaluation were not chosen at random. Given that one of the aims was to

evaluate the effect of the sentence templates as opposed to the baseline, I purposefully applied a

bias towards classes for which a higher number of templates were extracted and more properties

were spotted in text, which correlates with classes for which more factual information (strings

and quantities) was available on the DBpedia 21 . This aimed to ensure richer output was

generated by the LOD-DEF system, to allow for a more meaningful rating from human judges

and so that we can evaluate the performance of the system at document structuring.

Within these constraints an attempt was made to select classes as varied as possible. While in

the final test set there are four instances of subclasses of “Person”, these are markedly different

kinds of person, with several different RDF properties. Also, this is in approximate correlation

(30%) to the amount of entities of type “person” available in the consistent DBpedia ontology,

approximately 23% (Mendes et al., 2012).

21 This also correlates with subclasses of Person.

50 of 66

Subject 1 2

Entities and classes

Jennifer Jane Saunders (English Comedians) Fernando Gago (Argentine Footballers) Hyundai Motor Company (Car Manufacturers) American Idol (Singing Competitions) Nicole Scherzinger (American Pop Singers) Mercyful Fate (Speed Metal Musical

Groups) Morocco (African Countries) William Herschel (German Composers) Woody Woodpecker (Fictional Characters) Belgrade (Capitals in Europe) Real Zaragoza (Football Clubs) Winx Club (Television Shows)

Table 7.1 Entities chosen for evaluation and subject generating each

For each class, the aim was to select a lesser known instance and so prevent the subjects’ adding

extraneous information to the output. For example, for “Fictional Character”, instead of “Mickey

Mouse”, “Woody Woodpecker” was chosen, still a widely known character but arguably one that

is less heavy with associations.

During the development of the LOD-DEF system, the development set of entities against which I

adjusted the several subcomponents was formed mostly of instances of yago:GermanComposers.

The template extraction system was adjusted in order to extract more grammatical sentences

this set, which was also included in evaluation.

William Herschel is not best known for being a German composer, but rather an astronomer.

Although he was both, this is an instance where the algorithm failed due to the data available.

Given that at the time of organising the survey this was not known, the survey reflects it as

such. However, an important observation is that he is best known for discovering Uranus the

planet. In the context of the article, as it is not specified, it could be assumed that this is a piece

of music.

7.3 Human generation

Panel A is given triples related to the chosen entities and instructions on how to proceed. Panel

A is formed by two native speakers of English, both of them linguistics postgraduate students.

Their task is to write summary descriptions of the entities the data is about by expressing as

much of this data as possible in text.

The triples are grouped by entity they relate to, one entity on each page. The information is

printed in a human-friendly format, where the rdfs:label is retrieved for every predicate, followed

by the equal sign and a list of n values, which are all the values the property has. If a value is a

URI, the rdfs:label for that URI is retrieved and printed instead. Otherwise the value is presented

with its literal value. For example, “birth date = 1958-07-06”, “place of birth = Sleaford,

Lincolnshire, England”. The full instructions used for this experiment can be found in Appendix

A.

Triples given to Panel A were selected from the same ones identified by the LOD-DEF system as

pools. Triples were then curated and filtered by hand to further remove redundancy and to

randomize the order in which they are presented. As I have already pointed out, much factual

information is encoded in Wikipedia categories, and thus in the names of YAGO classes. For this

reason, only one class is included in the triples, which I manually picked as the one intuitively

and subjectively considered more representative from the available rdf:type triples.

51 of 66

I avoided giving the subjects examples of what kind of output was expected, thus taking care not

to prime them. I did include an example of generating from one triple as a warning to avoid

including extraneous information.

7.4 LOD-DEF generation

For the training of the LOD-DEF system a separate model collection was trained for a maximum

of 30 entities for of the class manually chosen (see Table 7.1). These entities were taken in the

order returned by the SPARQL endpoint, where articles on the Simple English Wikipedia where

available for them.

7.5 Human rating

Subjects were asked to complete an online survey. For this survey, the same 12 entities (Table

7.1) were described by the three systems, which produced 36 short texts, rated by 25 subjects.

The participants were self-identified as having an upper-intermediate or above level of English.

The texts were presented to the subjects in pseudo-random order, to avoid texts about the same

entity occurring within a page of each other (four texts were presented on every page). This

avoided direct side-by-side comparison. For each of these, each subject was asked to rate the text

on a measure of 1 (lowest) to 5 (highest) on the following three criteria, adapted from the DUC

2007 criteria 22 : grammaticality, non-redundancy and structure and coherence. For the full

description of these criteria and the instructions given to the subjects, see Appendix B.

It was not disclosed to the subjects until the end of the experiment that humans generated the

texts of one of the systems being tested.

7.6 Results

An exploratory analysis of the data collected showed clear differences in means between for the

rating of the three systems (Table 7.2). To establish the significance of these differences I

conducted a One-Way ANOVA (as opposed to Paired-rank T-tests, to adjust for the comparisons

made) for each of the three criteria the texts were rated on. All three ANOVAs were statistically

significant: for grammaticality (F(2,72)=119.001, p < 0.001), for non-redundancy

(F(2,72)=129.053, p < 0.001) and for structure and coherence (F(2,72)=129.053, p < 0.001). I

conducted Tukey’s Post-Hoc test to establish which comparisons were significant for each; Table

7.3, Table 7.4 and Table 7.5 show the differences in mean and the results of the Tukey tests.

For the ratings on structure and coherence, three outliers were found to affect the normality of

the distribution of System C. The outliers were removed to and the assumption of normality held

(as reported by the Shapiro-Wilk test) and ANOVA and Tukey test were run with and without

the outliers. The same difference was found in main effects in both models, therefore I report the

main effects with the outliers included.

System Grammaticality Non-redundancy Structure and coherence A (baseline) 2.29 1.89 1.95

22 http://www-nlpir.nist.gov/projects/duc/duc2007/quality-questions.txt

52 of 66

B (LOD-DEF) 2.58 3.03 2.70 C (humans) 4.48 4.66 4.49

Table 7.2 Means

Baseline vs.LOD-DEF Grammaticality Non-redundancy Structure and coherence Difference 0.29 1.14 0.75 Significance p = 0.151 p < 0.001 p < 0.001 Significant No Yes Yes

Table 7.3 Differences and significance

LOD-DEF vs. Humans Grammaticality Non-redundancy Structure and coherence Difference 1.14 1.63 1.79 Significance p < 0.001 p < 0.001 p < 0.001 Significant Yes Yes Yes


Humans vs. Baseline Grammaticality Non-redundancy Structure and coherence Difference 2.19 2.77 2.54 Significance p < 0.001 p < 0.001 p < 0.001 Significant Yes Yes Yes


7.7 Discussion

As it was expected, expert human generation is an upper bound in this evaluation, being

consistently superior to the other two systems tested. LOD-DEF does not improve on the

perception of grammaticality of the baseline, but it does significantly outperform the baseline on

non-redundancy and structure and coherence.

The difference between the average score of humans and the baseline is at its lowest for

grammaticality, as the output of both the baseline and LOD-DEF was judged surprisingly high

on grammaticality. LOD-DEF scored very slightly higher (a difference of 0.29 on means) but this

is not statistically significant (p = 0.151). The most significant improvement of LOD-DEF over

the baseline is on the non-redundancy metric, with a difference of 1.14.

The fact that, in spite of the simple approach taken and the many errors in output (as discussed

in the previous chapter), LOD-DEF still significantly outperforms the baseline on both non-

redundancy and structure and coherence is very encouraging. These results suggest that

automatic training of NLG systems is a promising approach that should be pursued further.

53 of 66

Chapter 8

Conclusion and future work

8.1 Conclusion

This project has focussed on describing, implementing and testing a trainable shallow Natural

Language Generation system for factual Linked Open Data based on the extraction of sentence

templates and document planning via content n-grams.

The main contributions of this work are:

• Describing a full architecture for this system, including both the training and generation

stages. To my knowledge this system in its whole represents a new approach to trainable

NLG, never before tried in its entirety.

• Building a baseline implementation of this architecture, the LOD-DEF system, and both

evaluating its performance at template extraction and class selection, and conducting

human evaluation of this system against a baseline and human-generated output.

• Proving that even an exceedingly simple system as LOD-DEF system is rated

significantly higher in human evaluation over the baseline. In essence, this project shows

that this approach is a promising one and that it should be pursued further.

As per the criteria outlined in Chapter 3, it is clear that much could be improved. Most

importantly, much more work on template extraction needs to be done. With little extra effort,

the system could easily improve on its current performance of 58% of extracted templates that

are both transferable and grammatical.

This project was met with a measure of success. However, were I to start again now, I would

approach it in different ways. First, I would perhaps focus on one of the many problems I have

tackled, e.g. the class selection algorithm, and investigate it more thoroughly. Second, while

building this whole system from the ground up was a highly instructive experience, I would

strive to use or adapt an existing architecture.

This three-month project started life as a PhD research proposal. It is immediately apparent that

this is but one twelfth of that initial project, and many interesting lines of research had to be

abandoned for lack of time and experience. Sourced both from the original proposal and from the

findings of this project, in the next section I offer some directions for future work.

8.2 Future work

First, the implementation described here is but a baseline. As I have already suggested, more

robust systems exist for every main module of this application: Named Entity Recognition,

parsing, coreference resolution, etc.

Within the same shallow approach, substituting these modules in the pipeline would surely help

improve the results, as the analyses of errors in Chapter 6 show. For the NER task, using an

54 of 66

established system like DBpedia Spotlight (Mendes & Jakob, 2011) or OpenCalais (Butuc, 2009)

would be a first step, which would also allow integrating inference in the selection of the data to

be spotted in text.

Also, the architecture implemented for this project duplicates readily-available general-purpose

architectures, of which the General Architecture for Text Engineering (GATE) (Cunningham et

al. 1996) is a prototypical example.

But beyond these improvements on the shallow approach, the crucial steps involve moving

towards deeper natural language understanding and with it to deeper generation. The most

sophisticated approaches to document planning use another level of abstraction from text:

discourse relations. The original aim was to automatically extract these relations, to which there

exist a number of approaches (e.g. Soricut & Marcu, 2003).

Whether we think of the rhetorical relations in a text as a tree (as in Rhetorical Structure

Theory – Mann & Thompson, 1988) or as a graph (e.g. Segmented Discourse Representation

Theory – Asher & Lascarides, 2003), it is clear that the structure and coherence of a text are

more than just a succession of properties and their values.

This move must probably be accompanied by an application of techniques of relation extraction,

ideally informed by a deeper understanding of the argument structure of predicates in natural

language text, that is, what arguments verbs take and their thematic roles. The FrameNet and

VerbNet projects, coupled with WordNet are likely to play a role in this (Shi & Mihalcea, 2005).

These steps would allow us to move towards the automatic learning of rules for deeper

generation. A number of more general-purpose NLG architectures exist (e.g. NaturalOWL as

described before, but also others not specifically targeted to the Semantic Web like OpenCCG –

White, 2008).

With better identifications of the relations between the spotted entities in text and an

understanding of the rhetorical relations between sentences, we could extract full document

planning and aggregation rules, which could be converted for use by one of those systems.

Finally, an interesting problem for Named Entity Recognition is that of vagueness (Klein &

Rovatsos, 2011), when dealing, for instance, with large numbers. For example, the population of

a country is a figure in the millions, which is often reported in text as “about 30 million people”,

but has exact numbers in the data (e.g. 27.543.216).

In brief, the approach presented herein does not but scratch the surface.

55 of 66

Appendix A: Human generation

The following text was given to each of the two subjects. It is followed by the first data for

generation as an example.

Description generation

In this experiment you are required to write short descriptions based on some given information.

You will find a block of information at the beginning of each page. This information consists of

facts about a real-world entity (a person, a country, etc.) You might have never heard of that

person or thing but this does not matter.

Your task is to write a short description of this entity based on the information at the beginning

of the page. You are writing for a general audience with no previous knowledge about the entity

you are describing (but with knowledge of other entities of the same type, e.g. countries) and you

want to get all this information across (think of an article on the English Wikipedia).

Use the blank area of each page to write your text. Feel free to copy and paste names and other

chunks of text.

Please do not use any other information resources for this (i.e. don't look it up on Google or

Wikipedia until you have finished the experiment). It is essential you write the text that seems

more natural to you from the given information only.

Please do not include any value judgements (e.g. “the best”, “one of the greatest”, “a very

famous”, “the most important”) unless these are present in the information provided.

The information is in random order. You should report it in the order that seems more logical to

you in a description.

You can use any format you prefer for dates, numbers and other amounts. You can use any

grammatical construction and vocabulary.

It is very important that you include in your text no other information and that you use all the

information that you can infer from what is given (that is relevant).

Example:

From this information:

name = John

date of death = 1666-02-01

You could write:

1. John died in 1666.

2. John died accompanied by his wife and 3 pigs, in a barge that was pushed blazing into

Dunsapie Loch.

3. John died on 1st Feb 1666.

56 of 66

(1) is bad because it omits date information.

(2) is bad because it adds extraneous information.

(3) is good.

[new page]

Information:

Category: English comedians

active = 1981,

spouse = Adrian Edmondson,

description = British comedienne,

place of birth = Sleaford, Lincolnshire, England,

birth name = Jennifer Jane Saunders,

spouse = Adrian Edmondson,

birth date = 1958-07-06,

notable work = Various in French & Saunders, Edina Monsoon in Absolutely Fabulous, Fairy

Godmother in Shrek 2,

name = Jennifer Saunders,

Text (write text here):

57 of 66

Appendix B: Human evaluation

Hello! This is an evaluation questionnaire that compares output from 3 Natural Language

Generation systems, that is, software that takes data and outputs text in English. Two of these I

have created myself.

There are 36 very short text snippets in this questionnaire, generated directly from information

on the Semantic Web. It should take you about 25 minutes.

All you need to do is rate each text from 1 (very poor) to 5 (very good) on these measures:

Grammaticality

Non-redundancy

Structure and Coherence.

Don't worry, these are all explained at the top of each page.

NOTE: I assume that your level of English is upper-intermediate or above.

Let's go!

These are the criteria for rating the texts, please take a moment to read them. They appear

again at the beginning of every page.

Grammaticality

The text should have no system-internal formatting, capitalization errors or obviously

ungrammatical sentences (e.g., fragments, missing components) that make the text difficult to

read.

Non-redundancy

There should be no unnecessary repetition in the text. Unnecessary repetition might take the

form of whole sentences that are repeated, or repeated facts, or the repeated use of a noun or

noun phrase (e.g., "Bill Clinton") when a pronoun ("he") would suffice.

Structure and Coherence

The text should be well-structured and well-organized.

1. Very Poor

2. Poor

3. Barely Acceptable

4. Good

5. Very Good

58 of 66

Full text generated by systems B and C, and example of A

System A (baseline)

1 Jennifer Saunders is an English television actor. Her birth date is 6 July 1958. Her description is British comedienne.

Her spouse is Adrian Edmondson. Her genres are Comedy and Parody. Her caption is Saunders in November 2008.

Her birth name is Jennifer Jane Saunders. Her wordnet type is synset-actor-noun-1. Her nationality is British

people. Her medium is Television, film. Her short description is British comedienne. Her place of birth is Sleaford,

Lincolnshire, England. She is a primary topic of : Jennifer Saunders. Her page is Jennifer Saunders. Her notable

works are Edina Monsoon in Absolutely Fabulous, Various in French & Saunders and Fairy Godmother in Shrek 2.

Her name is Jennifer Saunders. She has a photo collection : Jennifer Saunders. Her label is Jennifer Saunders. Her

given name is Jennifer. Her surname is Saunders. Her birth places are Sleaford and Lincolnshire.

4 Morocco is an Arab LeaguE member state. Its cctld is .ma. Its geometry is POINT(-6.85 34.0333). Its area total

(km2)s are 446550.0 and 446739.2791875256. Its sovereignty types are Monarchy and Independence. Its demonym is

Moroccan. Its time zone is Western European Time. Its lats are 34.0333 and 32.0. Its established events are from

France, Alaouite dynasty, Mauretania and from Spain. Its percentage of area water is 250.0. Its leader names are

Abdelillah Benkirane, Abdelilah Benkirane and Mohammed VI of Morocco. Its points are 34.03333333333333 -6.85

and 32.0 -6.0. Its gdp ppp is 1.62617E11. Its image maps are 29.0 and Morocco on the globe .svg. Its official

languagess are Berber, Arabic and Arabic language. Its government types are Parliamentary system, Constitutional

Monarchy and Unitary state. Its leader titles are Prime Minister of Morocco, List of heads of government of

Morocco, List of rulers of Morocco and King of Morocco. Its currency is Moroccan dirham. Its conventional long

name is Kingdom of Morocco. Its legislature is Parliament of Morocco. Its national anthem is "Cherifian Anthem".

Its longs are -6.85 and -6.0. Its percent waters are 250.0. Its languages type is Native languages. Its time zone dst is

Western European Summer Time. It has a photo collection : Morocco. Its page is Morocco. Its north is

Mediterranean Sea. Its anthem is Cherifian Anthem. Its homepage is http://www.maroc.ma/PortailInst/An/. Its

longname is Kingdom of Morocco. Its established dates are 7 April 1956, 1666, 2 March 1956 and 110. Its founding

dates are 7 April 1956 and 2 March 1956. Its capital is Rabat. Its largest city is Casablanca. Its lower house is

Assembly of Representatives of Morocco. Its ethnic groups are North African Arabs, Berber people and Berber

Jews. Its gdp nominal is 9.9241E10. It is a primary topic of : Morocco. Its gdp ppp per capita is 5052.0. Its longew is

W. Its drives on is right. Its common name is Morocco. Its languages is Berber, Moroccan Arabic, Hassaniya.. Its

southwest is Atlantic Ocean. Its area footnote is or 710,850 km2. Its population density (/sqkm)s are 71.622 and

71.6. Its official languages are Arabic language and Berber languages. Its latns is N. Its gdp nominal per capita is

3083.0. Its calling codes are Telephone numbers in Morocco and %2B212. Its hdi category is medium. Its northeast

is Mediterranean Sea. Its label is Morocco. Its languages are Hassaniya language, Moroccan Arabic, Arabic language

and Berber languages. Its titles are Languages, Geographic locale and International membership. Its currency code

is MAD. Its national motto is "God, Homeland, King". Its mottoes are "God, Homeland, King", (Berber) and

(Arabic). Its northwest is Atlantic Ocean. Its name is Morocco. Its west is Atlantic Ocean. Its upper house is

Assembly of Councillors.

5 Woody Woodpecker is a Fictional anthropomorphic character. His last appearance is The New Woody Woodpecker

Show. His portrayers are Kent Rogers, Grace Stafford, Ben Hardaway, Daniel Webb, Mel Blanc, Cherry Davis,

Danny Webb and Billy West. His families are Splinter and Knothead and Scrooge Woodpecker. His creators are

Walter Lantz, Ben Hardaway and Alex Lovy. His significantother is Winnie Woodpecker. His caption is 1951.0. He

is a primary topic of : Woody Woodpecker. His species is Woodpecker. His last is I Know What You Did Last

Night. His first is Knock Knock. His gender is Male. He has a photo collection : Woody Woodpecker. His labels are

Woody Woodpecker. His page is Woody Woodpecker. His first appearance is Knock Knock (1940 cartoon). His

name is Woody Woodpecker. His homepage is www.woodywoodpecker.com.

System B (LOD-DEF)

1 Jennifer Saunders (6 July 1958, Sleaford and Lincolnshire) is a British comedienne. Her spouse is Adrian

Edmondson. Her active is 1981. Her place of birth is Sleaford, Lincolnshire, England. Her notable works are Edina

Monsoon in Absolutely Fabulous, Various in French & Saunders and Fairy Godmother in Shrek 2. Her nationality

is British people.

2 Hyundai Motor Company is a Car manuFACturer. Hyundai Motor Company started on 29 December 1967.

Although Hyundai Motor Company started in Public. Its parent company is Hyundai Motor Group. Its founded by

is Hyundai Motor Company. Its location country is South Korea. Its subsid is Hyundai Motor India Limited. Its

59 of 66

location cities are Seoul. Its key people is Chung Mong-koo. Its products is Automobiles, commercial vehicles,

engines. Its key person is Chung Mong-koo. Its products are Commercial vehicle and Internal combustion engine. Its

production is 2943529.

3 Nicole Scherzinger (born 29 June 1978) is an American female singer. Scherzinger worked in Hawaii and Honolulu.

Her labels are Polydor Records, Interscope Records and A&M Records. Her associated musical artists are Days of

the New, Pussycat Dolls and Eden's Crush. Her titles are "Jai Ho! ", "Poison" and Dancing with the Stars (US)

winner. Her alternative names is Kea, Nicole. Her befores are Donny Osmond and Kym Johnson. Her years is

Season 10.

4 Morocco (called as Kingdom of Morocco) is an African country. Casablanca of Morocco is Rabat. Morocco's leader

names are Abdelillah Benkirane, Abdelilah Benkirane and Mohammed VI of Morocco. Its west is Atlantic Ocean.

Its official languagess are Berber, Arabic and Arabic language. Its northeast is Mediterranean Sea. Its demonym is

Moroccan. Its founding dates are 7 April 1956 and 2 March 1956. Its established events are from France, Alaouite

dynasty, Mauretania and from Spain. It is an African country. Its demonym is Moroccan. Its established dates are 7

April 1956, 1666, 2 March 1956 and 110. Its leader titles are King and Prime Minister. Its largest city is Casablanca.

5 Woody Woodpecker is a Fictional anthropomorphic character created by Walter Lantz, Ben Hardaway and Alex

Lovy. His species is Woodpecker. His first is Knock Knock. His last is I Know What You Did Last Night. His first

appearance is Knock Knock (1940 cartoon). His significantother is Winnie Woodpecker.

6 Real Zaragoza's clubname is Real Zaragoza. Its nats are Italy, Portugal, Mexico, ESP, Serbia, ITA, Croatia, BRA,

Paraguay, Hungary, Argentina and Spain. Its league is La Liga. Its titles are Inter-Cities Fairs Cup, UEFA Cup

Winners' Cup and UEFA Cup Winners%27 Cup. Its founded is 1932. Its fullname is Real Zaragoza, S.A.D. It is a

Spanish football cluB.

7 Fernando Gago's teams are Real Madrid C.F., Boca Juniors and Argentina national football team. His clubss are

Real Madrid C.F. and Boca Juniors. His birth date is 10 April 1986. His playername is Fernando Gago. His

fullname is Fernando Rubén Gago. His currentclub is Real Madrid C.F. His dateofbirth is 10 April 1986. He is an

Argentina international footballer.

8 American Idol is a Creative Work run by the 19 Entertainment and FremantleMedia.Its presenters are Brian

Dunkleman and Ryan Seacrest. Its judges are Randy Jackson, Ryan Seacrest and Mariah Carey.

9 Mercyful Fate is an Speed metal musical group from, Denmark and Copenhagen. Their former band members are

Timi Hansen, Snowy Shaw and Michael Denner. Their associated musical artists are Fate (band), Arch Enemy,

Force of Evil (band), Spiritual Beggars, Memento Mori (band), King Diamond (band), Brats (band), Black Rose

(band) and Metallica. Their band members are Hank Shermann, Mike Wead, King Diamond and Sharlee D'Angelo.

Their record labels are Roadrunner Records, Combat Records, Rave On (record label) and Metal Blade Records.

Their labels are Roadrunner Records, Combat Records, Metal Blade Records and Rave On %28record label%29.

Their years active is 1981.

10 Friedrich Wilhelm Herschel (15 November 1738 in Holy Roman Empire, Hanover and Electorate of Brunswick-

Lüneburg – 25 August 1822 in England, Berkshire and Slough) was a German composer. His known fors are Uranus

and Infrared.

11 City of Belgrade is the populated place. Its is a part of : Belgrade%23 Municipalities. Its leader names are Party of

United Pensioners of Serbia, Dragan Đilas, Democratic Party (Serbia), Milan Krkobabić and Socialist Party of

Serbia. Its population demonym is Belgrader. Its official name is Belgrade. It is a populated place. Its native names

are Град Београд, Београд and Beograd.

12 Winx Club is a Nickelodeon, Rai Due, 4Kids TV, 4KidsTV and Rai 2 series made by Alfred R. Kahn, Norman J.

Grossfeld and Joanna Lee. Its director is Iginio Straffi. Its first aired is 28 January 2004. It is a Creative Work. Its

country is Italy.

System C (humans)

1 Jennifer Jane Saunders (Born 06/07/1958) is an English comedienne, originally from Sleaford, Lincolnshire. Jennifer

has been active as a comedienne since 1981 and a selection of her most notable roles include Edina Monsoon in

Absolutely Fabulous, the Fairy Godmother in Shrek 2, whilst also appearing in French and Saunders. Her spouse is

60 of 66

Adrian Edmondson.

2 Hyundai Motor Company is a South Korean company based in Seoul and is part of the Hyundai Motor Group. The

company was founded on the 29th of December, 1967 by Chung Ju-yung. They make various products ranging from

automobiles, commerical vehicles and internal combustion engines.

3 Nicole Prescovia Elikolani Valiente (also known as Nicole Scherzinger, Kea, Nicole) was born on the 29th of August

1978, in Honolulu, Hawaii, USA and is a singer from the noughties. She is associated with a variety of different acts

including Days of the New, Pussycat Dolls and Eden’s Crush. Record labels that she has been signed to include

A&M Records, Polydor Records and Interscope Records.

4 Morocco (or Kingdom of Morocco) is a country that is part of the continent of Africa. The capital city is Rabat and

the largest city is Casabalanca. The total area of Morocco is approximately 445739 Km2, with a hdi categorisation

of medium. The country has a population density of 186 people per square mile with the official population

demonym being Moroccan. It is geographically located with the Mediterranean Sea to the Northeast and the

Atlantic Ocean to the Southwest of the country. Arabic and Berber are the officially spoken languages, although

Hassaniva is also spoken. The present leaders in the country are King Mohammed VI and the Prime Minister is

Abdelilah Benkirane. The modern country was officially established in 1956 on the 7th of April, which marks the

independence from the Alouite Dynasty of France. Earlier reports of the country’s establishment relate to 1666, with

the event of Mauretania from Spain.

5 Woody Woodpecker is a fictional cartoon woodpecker. Created by Ben Hardaway, Walter Lantz and Alex Lovy, his

first appearance was in the 1940 cartoon ‘Knock Knock’. Since then various people have portrayed the character

including, Mel Blanc, Billy West, Kent Rogers, Ben Hardaway, Daniel Webb, Grace Stafford, Cherry Davis and

Danny Webb. The last appearance which Woody Woodpecker was featured in was ‘I Know What You Did Last

Night’. Other related characters include his significant other, Winnie Woodpecker.

6 Real Zaragoza are a Spanish football team, who play in the Spanish league La Liga. Their ground is La Romareda,

Aragon in Zaragoza. Founded in 1932, the club have won the Inter-Cities Fairs Cup and the UEFA cup Winners’

Cup. Players for the team come from a variety of different nations including Argentina, Italy, Hungary, Serbia,

Croatia, Paraguay, Mexico, Spain, Portugal and Brazil.

7 Fernando Rubén Gago, born on 10 April 1986, is an Argentine footballer. Gago currently plays for the club Real

Madrid C.F., as well as for the team Boca Juniors and the Argentine national football team; he has played in four

other clubs prior to joining Real Madrid. Gago has thus far scored no goals for his national team.

8 American Idol is a singing competition aired on television by the Fox Broadcasting Company, and produced by

FremantleMedia and 19 Entertainment. It was first aired on 11 June 2002. The programme is presented by Brian

Dunkleman and Ryan Seacrest, and the panel of judges is composed of Mariah Carey, Randy Jackson, Simon

Cowell, Steven Tyler, Ellen DeGeneres, Paula Abdul, Jennifer Lopez and Kara Dio Guardi. Its producers are Shane

Drake, Ken Warwick, Bruce Gowers, Nigel Lythgoe, Gregg Gelfand, John Pritchett and Andrew Scheer.

9 Mercyful Fate is a speed metal musical group from Copenhagen, Denmark. It has been associated with the bands

Metallica, Arch Enemy, King Diamond, Memento Mori, Brats, Black Rose, Force of Evil, Spiritual Beggars and

Fate. The band has been active since 1981. Its current members are King Diamond, Hank Shermann, Sharlee

D’Angelo, Mike Wead and Bjarne T. Holm; past members are Snowy Shaw, Michael Denner, Timi Hansen and Kim

Ruzz. It has released records on the labels Roadrunner Records, Combat Records, Metal Blade Records and Rave

On.

10 William Herschel (born Friedrich Wilhelm Herschel) was a German composer. He was born on 15 November 1738 in

Hanover, Electorate of Brunswick-Lüneburg, Holy Roman Empire. Herschel was known for the pieces Uranus and

Infrared. He died on 25 August 1822 in Slough, Berkshire, England.

11 Belgrade (officially the City of Belgrade, native name Beograd) is the capital of Serbia. It is a city with an area of

359.96 km2 and forms part of the Belgrade Municipalities. Its City Council is ruled by the Socialist Party of Serbia

and the Party of United Pensioners of Serbia; the current Mayor is Milan Krkobabić, and the Deputy Mayor is

Dragan Đilas. Belgrade was established prior to 279 BC. The population demonym of Belgrade is Belgrader.

12 Winx Club is an Italian animated television show aired in stereo on the networks 4Kids TV, Rai 2, and

Nickelodeon. It is directed by Iginio Straffi and released on 28 January 2004, and has so far run for 104 episodes

61 of 66

over four seasons. It is narrated by Joanna Lee, Alfred R. Kahn and Norman J. Grossfeld.

62 of 66

Order of the articles in the survey:

Page Text number Generated by system

1 1 C

1 3 B

1 6 B

1 4 A

2 7 C

2 5 A

2 2 C

2 8 B

3 10 A

3 4 B

3 9 C

3 11 C

4 12 A

4 2 B

4 3 A

4 5 B

5 10 C

5 1 A

5 8 C

5 12 B

6 3 C

6 9 B

6 2 A

6 7 B

7 11 A

7 6 C

7 8 A

7 1 B

8 12 C

8 11 B

8 9 A

8 4 C

9 10 B

9 5 C

9 7 A

9 6 A

63 of 66

References

Androutsopoulos, I., Kokkinaki, V., Dimitromanolaki, A., Calder, J., Oberlander, J., Not, E.

(2001). Generating Multilingual Personalized Descriptions of Museum Exhibits – The M-

PIRO Project. Retrieved from http://arxiv.org/ftp/cs/papers/0110/0110057.pdf

Asher, N. & Lascarides, A. (2003). Logics of Conversation. Studies in Natural Language

Processing. Cambridge University Press.

Bechhofer, S., van Harmelen, F., Hendler, J., Horrocks, I., McGuinness, D.L.,

Patel-Schneider, P. & Stein, L.A. World Wide Web Consortium (W3C). (2004). OWL

Web Ontology Language Reference. Retrieved from http://www.w3.org/TR/owl-ref/

Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The semantic web. Scientific American.

Retrieved from http://campus.fsu.edu/bbcswebdav/users/bstvilia/lis5916metadata/

readings/scientific-american_0.pdf

Berners-Lee, T. & Connolly, D. (W3C). (2011). Notation3 (N3): A readable RDF syntax.

Retrieved from http://www.w3.org/TeamSubmission/n3/

Bizer, C., Jentzsch, A. & Cyganiak, R.. (2011). State of the LOD Cloud. Retrieved from

http://www4.wiwiss.fu-berlin.de/lodcloud/state/

Bontcheva, K., & Davis, B. (2009). Natural Language Generation from Ontologies. In J. Davies,

M. Grobelnik, & D. Mladenic (Eds.), Semantic Knowledge Management: Integrating

Ontology Management, Knowledge Discovery and Human Language Technology (pp. 113–

127). Springer.

Brickley, D., & Guha, R.V. (W3C). (2004). RDF Vocabulary Description Language 1.0: RDF

Schema. Retrieved from http://www.w3.org/TR/2004/REC-rdf-schema-20040210/

Brickley, D. & Miller, L. (2010). FOAF Vocabulary Specification 0.98. Retrieved from

http://xmlns.com/foaf/spec/20100809.html

Busemann, S., & Horacek, H. (1998). A Flexible Shallow Approach to Text Generation, (2945),

10. Computation and Language. Retrieved from http://arxiv.org/abs/cs.CL/9812018

Busemann, S. (2011). Shallow Text Generation. Retrieved from

http://www.coli.uni-saarland.de/courses/LT1/2011/slides/shallow-nlg-lecture_WS1112.pdf

Butuc, M.G. (2009). Semantically enriching content using OpenCalais. Retrieved from

www.eed.usv.ro/SistemeDistribuite/2009/Butuc1.pdf

Cohn, T., & Lapata, M. (2009). Sentence compression as tree transduction. Journal of Artificial

Intelligence …, 1–38. Retrieved from http://eprints.pascal-network.org/archive/00005887/

Cunningham, H., Wilks, Y. & Gaizauskas, R.J. (1996). Gate: a general architecture for text

engineering. Proceedings of the 16th conference on Computational linguistics-Volume 2,

pp. 1057--1060

64 of 66

Cyganiak, R. & Jentzsch, A. (2011). The Linking Open Data cloud diagram. Retrieved from

http://lod-cloud.net/

Decker, S., Van Harmelen, F., Broekstra, J., Erdmann, M., Fensel, D., Horrocks, I., Klein, M.,

Melnik, S. (2000). The Semantic Web-on the respective Roles of XML and RDF. Retrieved

from

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.6109&rep=rep1&type=pdf

Duboue, P. A., & Mckeown, K. R. (2003). Statistical acquisition of content selection rules for

natural language generation. Proceedings of the 2003 conference on Empirical methods in

natural language processing, pp. 121-128. Retrieved from

http://dl.acm.org/citation.cfm?id=1119371

Feldman, R. & Sanger, J. (2007). The Text Mining Handbook - Advanced Approaches in

Analyzing Unstructured Data. Cambridge University Press.

Filippova, K., & Strube, M. (2008). Dependency tree based sentence compression. Proceedings of

the Fifth International Natural Language Generation Conference on - INLG ’08, 25.

doi:10.3115/1708322.1708329

Gagnon, M., & Sylva, L. D. (2006). Text Compression by Syntactic Pruning. Advances in

Artificial Intelligence 312–323. Springer.

Galanis, D., & Androutsopoulos, I. (2007). Generating multilingual descriptions from

linguistically annotated OWL ontologies: the NaturalOWL system. Proceedings of the

Eleventh European Workshop on Natural Language Generation, 143–146. Retrieved from

http://dl.acm.org/citation.cfm?id=1610188

Galley, M., Fosler-Lussier, E., & Potamianos, A. (2001). Hybrid natural language generation for

spoken dialogue systems. In Proceedings of the 7th European Conference on Speech

Communication and Technology (Interspeech-Eurospeech). September 3-7, 2001. Aalborg,

Denmark.

Grice, H. P. (1975). Logic and Conversation. In C. P. & M. J. (Eds.), Syntax and Semantics, Vol

3: Speech Acts (pp. 43–58). New York, New York, USA: Academic Press.

Heath, T. & Bizer, C., (2011). Linked Data: Evolving the Web into a Global Data Space

(1st edition). Synthesis Lectures on the Semantic Web: Theory and Technology, 1:1, 1-136.

Morgan & Claypool.

Hewlett, D. Kalyanpur, A. Kolovski, V. Halaschek-Wiener, C. (2005). Effective NL paraphrasing

of ontologies on the Semantic Web. In Workshop on End-User Semantic Web Interaction,

4th Int. Semantic Web conference, Galway, Ireland. Retrieved from

http://www.mindswap.org/papers/nlpowl.pdf

Jurafsky, D. & Martin, J.H. (2009). Speech and language processing: An introduction to natural

language processing, computational linguistics, and speech recognition. Prentice Hall: New

Jersey.

Kasneci, G., Ramanath, M., Suchanek, F., & Weikum, G. (2008). The YAGO-NAGA Approach

to Knowledge Discovery. Retrieved from

http://dl.acm.org/citation.cfm?id=1519103.1519110

65 of 66

Klein, D. & Manning, C.. (2003). Accurate Unlexicalized Parsing. Proceedings of the 41st

Meeting of the Association for Computational Linguistics, pp. 423-430.

Klein, E. and Rovatsos, M. (2011). Temporal vagueness, coordination and communication. In

Nouwen, R., Schmitz, H.-C., van Rooij, R., and Sauerland, U., editors, Vagueness in

Communication, LNCS. Springer.

Klyne, G., & Carroll, J. (W3C). (2002). Resource Description Framework (RDF): Concepts and

Abstract Data Model. Retrieved from http://www.w3.org/TR/2002/WD-rdf-concepts-

20020829/

Liang, S.F., Stevens, R., Scott, D. & Rector, A. (2012). OntoVerbal: a Protege plugin for

verbalising ontology classes. Proceedings of the Third International Conference on

Biomedical Ontology , (ICBO'2012), Graz, Austria.

Mann, W.C. & Thompson, S.A. (1988). Rhetorical structure theory: Toward a functional theory

of text organization.

Mendes, P., & Jakob, M. (2011). DBpedia spotlight: shedding light on the web of documents.

Proceedings of the 7th …, 1–8. Retrieved from http://dl.acm.org/citation.cfm?id=2063519

Mendes, P., Jakob, M., & Bizer, C. (2012). DBpedia: A Multilingual Cross-Domain Knowledge

Base. Retrieved from http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/

research/publications/Mendes-Jakob-Bizer-DBpedia-LREC2012.pdf

Prud'hommeaux, E., & Seaborne, A. (W3C). (2008). SPARQL Query Language for RDF,

Retrieved from http://www.w3.org/TR/rdf-sparql-query/

Reiter, E., & Dale, R. (2000). Building Natural Language Generation systems. Cambridge

University Press.

Rosch, E.H. (1973). Natural categories. Cognitive Psychology 4 (3): 328–50. DOI:10.1016/0010-

0285(73)90017-0.

Sarawagi, S. (2008). Information Extraction, 1(3), 261–377. doi:10.1561/150000000

Shi, L. & Mihalcea, R. (2005). Putting Pieces Together: Combining FrameNet, VerbNet and

WordNet for Robust Semantic Parsing. Computational Linguistics and Intelligent Text

Processing. Lecture Notes in Computer Science, DOI: 10.1007/978-3-540-30586-6_9

Soricut, R., & Marcu, D. (2003). Sentence level discourse parsing using syntactic and lexical

information. Proceedings of the 2003 Conference of the North …, (June), 149–156.

Retrieved from http://dl.acm.org/citation.cfm?id=1073475

Sripada, S. G., Reiter, E., Hunter, J., & Yu, J. (2003). Generating English summaries of time

series data using the Gricean maxims. Proceedings of the ninth ACM SIGKDD

international conference on Knowledge discovery and data mining - KDD ’03, 187.

doi:10.1145/956755.956774

Stevens, R., Malone, J., Williams, S., Power, R., and Third, A., (2011). Automating generation of

textual class definitions from OWL to English. Retrieved from

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3102894/

Sun, X. & Mellish, C. (2007). An Experiment on “Free Generation” from Single RDF triples.

Retrieved from www.aclweb.org/anthology/W07/W07-2316.pdf

66 of 66

White, M., (2008). OpenCCG Realizer Manual. Documentation of the OpenCCG Realizer.

Retrieved from https://svn.kwarc.info/repos/lamapun/lib/LaMaPUn/External/Math-

CCG/docs/realizer-manual.pdf

natural language generation for the semantic web: unsupervised template extraction

Documents