improving habitability of natural language interfaces for querying ontologies with feedback and...

Web Semantics: Science, Services and Agents on the World Wide Web 19 (2013) 1–21

Contents lists available at SciVerse ScienceDirect

Web Semantics: Science, Services and Agentson the World Wide Web

journal homepage: www.elsevier.com/locate/websem

Improving habitability of natural language interfaces for querying ontologieswith feedback and clarification dialoguesDanica Damljanović a,∗, Milan Agatonović b, Hamish Cunningham a, Kalina Bontcheva a

a University of Sheffield, Department of Computer Science, Sheffield, United Kingdomb Fizzback, London, United Kingdom

a r t i c l e i n f o

Article history:Received 25 January 2011Received in revised form8 February 2013Accepted 20 February 2013Available online 5 March 2013

Keywords:Natural language interfacesOntologiesLearningClarification dialoguesUser interactionFeedback

a b s t r a c t

Natural Language Interfaces (NLIs) are a viable, human-readable alternative to complex, formal querylanguages like SPARQL, which are typically used for accessing semantically structured data (e.g. RDF andOWL repositories). However, in order to cope with natural language ambiguities, NLIs typically supporta more restricted language. A major challenge when designing such restricted languages is habitability– how easily, naturally and effectively users can use the language to express themselves within the con-straints imposed by the system. In this paper, we investigate two methods for improving the habitabilityof a Natural Language Interface: feedback and clarification dialogues. Wemodel feedback by showing theuser how the system interprets the query, thus suggesting repair through query reformulation. Next, weinvestigate how clarification dialogues can be used to control the query interpretations generated by thesystem. To reduce the cognitive overhead, clarification dialogues are coupled with a learningmechanism.Both methods are shown to have a positive effect on the overall performance and habitability.

© 2013 Elsevier B.V. All rights reserved.

1. Introduction

Recent years have seen a tremendous increase in structureddata on the Web, with public sectors such as UK and USAgovernments opening their data to the public,1 and encouragingothers to build useful applications on top. At the same time,the Linked Open Data (LOD) project2 continues to promote theauthoring, publication and interlinking of new RDF graphs withthose already in the LOD cloud [1]. In March 2009, around 4billion RDF statements were available while in September 2010this number increased to 25 billion, and continues to grow. Thismassive amount of data requires effective exploitation, whichis now a great challenge largely due to the complexity andsyntactic unfamiliarity of the underlying triple models and thequery languages built on top of them. Natural Language Interfaces(NLIs) to rich, structured data, such as RDF and OWL repositories,are a viable, human-readable alternative.

Themain challenges related to buildingNLIs are centred aroundsolving the Natural Language understanding problem, the data that

∗ Corresponding author. Tel.: +44 1142221931.E-mail addresses: [email protected], [email protected]

(D. Damljanović), [email protected] (M. Agatonović),[email protected] (H. Cunningham), [email protected](K. Bontcheva).1 http://data.gov.uk and http://www.data.gov.2 http://linkeddata.org.

1570-8268/$ – see front matter© 2013 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.websem.2013.02.002

is being queried, and the user, and the way in which the user’sinformation need is verbalised into a question.

Solving the Natural Language understanding problem includesgrammar analysis, and solving language ambiguity and expressive-ness, e.g. [2]. Ambiguity can be avoided through the use of a Con-trolled Natural Language (CNL): a subset of Natural Language (NL)that includes a limited vocabulary and grammar rules that must befollowed. Expressiveness can be improved by extending the systemvocabulary with the use of external resources such asWordNet [3]or FrameNet [4].

The second group of challenges is related to the data that isbeing queried, and building portable systems—those that can beeasily ported from one domain or ontology to another withoutsignificant effort. According to [5], amajor challengewhenbuildingNLIs is to provide the information the system needs to bridge thegap between theway the user thinks about the domain of discourseand the way the domain knowledge is structured for computerprocessing. This implies that in the context of NLIs to ontologies, itis very important to consider the ontology structure and content.Two ontologies describing identical domains (e.g., music) can usedifferent modelling conventions. For example, while one ontologycan use a datatype property artistName of class Artist, theother onemight use instances of a special class tomodel the artist’sname.3

3 See for example how the class Alias is used in the Proton System Moduleontology: http://proton.semanticweb.org/.

http://dx.doi.org/10.1016/j.websem.2013.02.002

http://www.elsevier.com/locate/websem

http://www.elsevier.com/locate/websem

mailto:[email protected]





http://data.gov.uk

http://www.data.gov

http://linkeddata.org

http://proton.semanticweb.org/

http://dx.doi.org/10.1016/j.websem.2013.02.002

2 D. Damljanović et al. / Web Semantics: Science, Services and Agents on the World Wide Web 19 (2013) 1–21

Ontologies can be constructed to include sufficient lexicalinformation to support a domain-independent query analysisengine. However, due to different processes used to generateontologies, the extracted domain lexicon might be of varyingquality. In addition, some words might have different meaningsin two different domains. For example, ‘‘How big’’ might refer toheight, but also to length, area, or population—dependingon the question context, but also on the ontology structure. Thiskind of adjustments – or mappings from words or phrases toontology concepts/relations – is performed during customisation ofNLIs.

The third group of challenges is centred around the usersand how they translate their information need into questions.While NLIs are intuitive, having only one text query box canpose difficulties for users, who need to express their informationneed through a natural language query effectively [6]. In order toaddress this problem, several usability enhancement methods havebeen developed with the aim to either assist users with queryformulation, or to communicate the system’s interpretation of thequery to the user. In other words, the role of these methods isto increase the habitability of the system. Habitability refers tohow easily, naturally and effectively users can use language toexpress themselves within the constraints imposed by the system.If users can express everything they need for their tasks, using theconstrained system language, then such a language is consideredhabitable.

Our focus is on building portable systems that do not requirea strict adherence to syntax—the supported language includesboth grammatically correct and ill-formed questions, but alsoquestion fragments. We look at improving the habitability ofsuch NLIs to ontologies through the application of feedback andclarification dialogues. We first discuss habitability and the fourdifferent domains that it covers in Section 2.We then describe howwe model feedback relative to the specific habitability domains,and evaluate it in a user-centric, task-based evaluation (Section 3).Further on, in Section 4 we look at clarification dialogues andwhether they can improve the specific habitability domains, bymaking the process of mapping a NL question into a formal query,transparent to the user. We combine the dialogue with a lightlearningmodel in order to reduce the user’s cognitive overhead andimprove the system’s performance over time. We then examinethe approachwehave taken,which combines clarification dialogueswith learning, in the controlled evaluation using the MooneyGeoQuery dataset.

2. Habitability

According to Epstein [7], a language is habitable if:

• Users are able to construct expressions of the language whichthey have not previously encountered, without significantconscious effort.

• Users are able to easily avoid constructing expressions that arenot part of the language.

Anotherwayof viewinghabitability is as themismatch betweenuser expectations and the capabilities of an NLI system [8].Ogden and Bernick [9] describe habitability in the context of fourdomains [9]:

• The conceptual domain of the language supported by the systemdescribes the area of its coverage, and defines the complete setof objects and the actions which are covered. In other words,the conceptual domain determines what can be expressed bythe system. Consequently, this domain is satisfied if the userdoes not ask about concepts which cannot be processed bythe system. To cite the example from [9], the user could notask ‘‘What is the salary of John Smith’s manager?’’ if there is

no information about managers in the system. The conceptualdomain of the language can be expanded to inform the user thatthere is no information aboutmanagers in the system.

• The functional domain determines how a query to the systemcan be expressed. Natural language allows different ways ofexpressing the same fact, especially taking into account theknowledge of the listener and the context. The functionaldomain is determined by the number of built-in functionsor knowledge the system has available. If, for example, theanswer to a question requires combining several knowledgesources, the system itself might not be able to answer it andwould require the user to ask two questions instead of one. Ahabitable system provides the functions that the user expects.Note that this is different from rephrasing the question due tounsupported grammar constructions, which is related to thesyntactic domain.

• The syntactic domain of a language is determined by thenumber of paraphrases of a single command that the systemunderstands. For example, to cite again the example from [9],the systemmight not be able to understand the question ‘‘Whatis the salary of John Smith’s manager?’’ but, could be able toprocess a rephrased one such as ‘‘What is the salary of themanager of John Smith?’’.

• The lexical domain is determined by the words available inthe lexicon. For example, in order to improve the coverage,many systems extend their lexicon through the use of externalsources for finding synonyms.

For an NLI to be considered habitable, it should cover allfour domains. Habitability is an important aspect of a system tomeasure because it can affect the usability of NLIs. By identifyingwhy systems fail to be habitable, we can identify the ways toimprove them [10].

One way to increase habitability is to use usability enhancementmethods such as feedback and clarification dialogues. We first lookat how feedback can improve the user’s experience with an NLI,thus having an effect onhabitability (Section3). Further on,we lookat using clarification dialogues to improve the habitability domainsand make the process of mapping an NL question onto the formalquery transparent; this gives theusers control as they can influencethe full interpretation of the query (Section 4).

3. Feedback

Showing the user the system’s interpretation of the queryin a suitably understandable format is called feedback. Feedbackincreases the user’s confidence and in the case of failures, helpsthe user understand which habitability domain is affected. Severalearly studies [11,12] show that after receiving feedback, users arebecoming more familiar with the system’s interpretations and thenext step is usually that they try to imitate the system’s feedbacklanguage. In other words, returning feedback to the user helpsthem understand how the system interprets queries, thereforemotivating them to use similar formulations and create queriesthat are understandable to the system.

Showing feedback can be useful for communicating themessagebetween the user and computer clearly. This is comparable tohuman–human communication, where participants usually tryto establish that the message they are trying to communicateis properly understood. This process is called grounding—as theusers try to ground what is being said [13]. As pointed out byClark and Brennan [13], humans seek evidence of understanding,which can either be positive or negative. Negative evidence isthe evidence that they have not been understood, or heard, andif they find any, they attempt to repair it. If they fail to findany negative evidence, the assumption is that the other humanunderstood the message correctly. However, often people search

D. Damljanović et al. / Web Semantics: Science, Services and Agents on the World Wide Web 19 (2013) 1–21 3

for a positive evidence as well, such as acknowledgements, orinclusion of the relevant next turn. Usingnegative evidence to repairin human–human conversation is studied in human–computerinteraction such as in [14] where the authors conducted a studywith novice users who deal with a database system called SAMi.The study reveals that 30% of the user’s time is spent in repair.In comparison to human–human conversation, repair seems to bemore significant in human–computer interaction, as it becomes theprimary medium for learning the actions required by the systemand also the method of correcting ineffective input [14].

In the following section we describe how we model feedback,which is used to communicate the message between the user andthe system clearly, and suggest repair if necessary.

3.1. Baseline

To model and test feedback, we used a Natural Language Inter-face for querying ontologies, which we developed in our previouswork [15], as the baseline. The baseline system automatically gen-erates a set of ranked query interpretations from an NL query orits fragment, and then shows the answer of the best ranked option,which returns a non-empty result to the user. Both failure to gen-erate query interpretations, and generated interpretations that re-sult in no answer, produce the same output—the system shows themessage ‘‘No answer found’’. Hence, the user does not receive anyfeedback from the system in terms of how the query is interpreted—the message ‘‘No answer found’’, does not provide any additionalinformation on whether this was due to system failure or due to anon-existing knowledge. In other words, it does not become clearto the user whether repair in a form of query reformulation mightor might not help in answering the question.

3.2. Extending baseline with feedback

We extend the baseline by implementing feedback—showingthe user all possible query interpretations and the system’srankings, so that the user then can modify the answer by choosingthe correct interpretation. We modelled feedback having in mindfour previously discussed habitability domains, and using repair toimprove them where appropriate. Given an NL query as input, thesystem can produce the following output:

• Success—the query is successfully parsed and the query inter-pretation is correct. Showing feedback is a positive evidence ofunderstanding the query:– A non-empty answer: if the user’s query is correctly inter-

preted, and the system returns a non-empty answer, feed-back can increase the user’s confidence that the answer isindeed correct and can alsomake the user familiarise himselfwith the queried knowledge structure.

– An empty answer: the answer is not found although the sys-tem successfully parsed the question. As curating the knowl-edge is outside the scope of the topic discussed in this paper,we consider this case to be a negative answer, and thereforecategorise it under success. The role of feedback is to commu-nicate thismessage to the user effectively so that the user canconclude with confidence that the answer is negative.

• Failure—the query is not successfully parsed or the query inter-pretation is incorrect. Showing feedback is negative evidence ofunderstanding the query and should suggest a repair. Ideally,the system should be able to detect which habitability domainis affected, and the feedback should be used to make the useraware of the reasonswhy the failures happened.We distinguishtwo kinds of failures based onwhether or not repair in the formof query reformulation could be used to correctly answer thequestion:

– Encourage query reformulation: The question could be an-swered if reformulated, and the feedback shouldhelp theuserto reformulate the query to conform to the lexical, functionalor syntactic domain of the system.

– Encourage change of topic: The answer is not found becausethe system could not find the information about the requiredconcepts—the question could not be answered if reformu-lated. The user should be able to conclude based on feedbackthat the question is outside the conceptual domain of the sys-tem.

3.2.1. Hiding complexitiesOne challenge when modelling feedback is showing the sys-

tem’s interpretation bearing in mind that NLIs are intended to beused by users not necessarily familiar with ontologies. NLIs to on-tologies usually translate a natural language query into some in-termediate interpretation such as a set of triples or a formal querysuch as SPARQL. Hence, the most natural way from the point ofview of the system’s developer, would be showing either triples orthe SPARQL query. However, as our intention is to develop meth-ods which are suitable for casual users as well as for semantic webexperts, our initial design sought to simplify the system’s interpre-tation, and hide complexities as much as possible. Therefore, thefollowing decisions were taken:

• Show labels instead of URIs.• Show the linear list of elements (instead of triples) in the order

in which they appear in the question.• Show relations between the elements by rendering a tree-like

structure.

3.2.2. Identified context and tree-based viewImplementing these decisions resulted in the Web interface

shown in Fig. 1. After the user posts a question, the system firstgenerates the table with two columns: Identified context, whichshows query interpretations as a linear list of elements (recognisedconcepts and relations between them as found in the ontology),and Our score, which shows the score by which the interpretationsare ranked. The system automatically selects the first option, andthe results are rendered using the tree-based view.

The user has the option to select any of the Identified contexts byclicking on the radio button in the desired row. The results for theselected interpretation will be rendered upon clicking. Further on,the user can explore the tree-based view by selecting its nodes, forexample Country in Fig. 1, and the instances will be shown in theright hand side pane.

In cases where the system recognises concepts in a query, butdoes not find any results, the query interpretation (e.g. a set ofconcepts) will be shown in the Identified context, and on selectionthe message reading ‘‘No relation found within this context’’ isdisplayed in the area for displaying the tree-based view.

3.2.3. Linearised list of conceptsThe query interpretation is shown as a set of recognised con-

cepts, which follows the order in which they appear in the ques-tion. However, due to the presence of properties in each queryinterpretation (as properties are crucial to get the correct inter-pretation and consequently the answer), this can lead to the ‘notso natural’ effect; for an example, see Fig. 2. The Identified contextis shown including the has runtime parameter relation. Theinterpretation as such is not understandable without an additionalexplanation to the user—users must be trained to understand therole of the property in between the recognised concepts. The otheroption which we could consider is to reverse the order and showthe interpretation to read:


Fig. 1. Feedback: showing the user the system’s interpretation of the query ‘‘capitals of countries in Europe’’.

Fig. 2. Feedback and results for ‘‘runtime parameters of RASP parser’’.

rasp parser (language analyzer) has runtimeparameters resource parameter

However, for more complex queries, this approach would requiremodelling triples. For example, if we look back at the example inFig. 1, the first interpretation reads:capital has capital country sub region ofEurope (continent)

Tomake this interpretationmore natural, we would have to show:country has capital capital;country sub region of Europe (continent)

However, this makes it harder to follow which question termrefers to which ontology concept, and from where the relationswere derived. Therefore,weused the linearised representation, butdecided to model the tree-like view (see the lower left part of theFig. 2), so that it is indeed clear to the user that according to theknowledge structure ‘‘the RASP parser has runtime parameters . . . ’’rather than the other way around.

When no results were found, the user was prompt with themessage ‘‘No relations found within this context’’, see Fig. 3.

3.3. Evaluation

In order to test feedback we organised a task-based evaluationwith the participants from the GATE Summer School in July 2009.

3.3.1. TrainingParticipants listened to the talk about Natural Language

Interfaces to Ontologies for 20 min, where they were given a shortoverview of how the systemworks and the language supported bythe system. They were given a five minute demo on how to use theWeb-based interface.4

3.3.2. Evaluation measuresAt the beginning of the experiment, we asked participants

to complete the questionnaire about their background (age,gender, profession, knowledge of semantic technologies). Theythen completed four tasks after each they were asked to answerseveral questions. We then measured the following based on theanswers:

• Effectiveness: whether they could finish the tasks successfully.• Feedback: whether the feedback was helpful or not for the

particular task.• Difficulty of the supported language: whether or not it was easy

to formulate the queries for the task.

The subjects were offered a predefined set of answers, with anoption to add additional comments in free-text. After finishing all

4 Slides are available from http://gate.ac.uk/sale/talks/gate-course-july09/slides-pdf/questio.pdf.

http://gate.ac.uk/sale/talks/gate-course-july09/slides-pdf/questio.pdf

http://gate.ac.uk/sale/talks/gate-course-july09/slides-pdf/questio.pdf


Fig. 3. Feedback and results for ‘‘init parameters of RASP parser’’.

the tasks, subjects were asked to complete the SUS questionnaireas a standard user satisfaction measure.

In addition, we measured efficiency—the time each user spenton each task, and also the number of queries they used.

3.3.3. DatasetWe initialised the system with two domain ontologies. The

first one covers GATE components,5 while the second one is theMooney GeoQuery ontology that covers the geography of theUnited States.6

Subjects were asked to perform four tasks. For each task, theyhad the opportunity to choose between the two ontologies. If theywere not confident in their knowledge about GATE, we hoped theywould choose the task relating to US geography. The task pairscovering two domains were of the same complexity.

3.3.4. TasksOur intention was to see whether users could make the correct

conclusions based on the system’s feedback, and therefore con-clude correctly whether the system’s response resulted in a successor failure, and in case of failures whether it could successfully sug-gest repair. We designed four tasks, each one to assess a specificpart of the feedback discussed at the beginning of this section:

• Success—a non-empty answer: an NL question is expected to besuccessfully parsed, the query interpretation correct, and theanswer to the question found and returned to the user. Basedon feedback, the user should conclude that the answer is correctand terminate the task (Task 1).

• Success—negative answer: an NL question is expected to besuccessfully parsed, but the answer to the question is negative.Based on feedback, the user should conclude that the queryinterpretation is correct, as the knowledge about conceptsexists, but the lack of relations between the concepts indicatesthat the answer is negative. Based on feedback, they shouldconclude that they successfully finished the task and theyshould terminate it (Task 2).

• Failure—encourage repair through query reformulation: an NLquestion is parsed but the resulting output leads to an incorrect(often empty) answer. Based on feedback, the subjects need todecide that there is knowledge about what they are searchingfor in the system, but the query they are likely to type in first is

5 http://gate.ac.uk/ns/gate-kb.6 The Mooney geography dataset is available from http://www.ifi.uzh.ch/ddis/

research/talking-to-the-semantic-web/owl-test-data/.

too complex and needs reformulating, rather than concludingthere is no answer. A successfully finished task results inSuccess—a non-empty answer (Task 3).

• Failure—encourage change of topic: an NL question cannot beparsed or the query interpretation is incorrect or partiallycorrect (e.g., the output leads to a more generic answer assome concepts are skipped during query processing as theyare not in the knowledge base). Based on feedback, the usershould conclude that the query reformulation cannot help asthe knowledge is not available in the system (Task 4).

The task pairs were as follows:

Task 1:

• Task 1a: Find part of speech taggers which exist in GATE. Findout which parameters exist for the POS Tagger of your choice.

• Task 1b: Find mountains which exist in the United States. Findout in which state the mountain of your choice is located.

Task 2:

• Task 2a: Imagine that you are a GATE developer who needs toextend the RASP Parser. Your task is to find out the names of initparameters.

• Task 2b: Find out which states border Hawaii.

Task 3:

• Task 3a:What are the parameters of the PRswhich are includedin the same plugin as the Morpher?

• Task 3b: Which rivers flow through the state in which theHarvard mountain is located?

Task 4:

• Try exploring the knowledge available in the system. Eithersearch for various components of GATE such as PRs, plugins, LRs,VRs, or explore the geography of the United States by enquiringabout: cities, states, rivers, mountains, highways, etc.. Then asksome questions in order to connect these concepts such as‘‘Which states border Georgia?’’ or ‘‘Which rivers flow throughstates which border California?’’. Input as many queries as youlike.

3.3.5. ParticipantsParticipants were all external to Sheffield University, and were

not known to us before they registered to attend the GATE SummerSchool. They were almost evenly distributed across researchers,software developers and students, as well as across gender. Wemeasured their expertise in ontologies, ontology editors and

http://gate.ac.uk/ns/gate-kb

http://www.ifi.uzh.ch/ddis/research/talking-to-the-semantic-web/owl-test-data/










Fig. 4. Expertise in using ontologies, ontology editors and SPARQL for 30 partici-pants (M = 60.79, SD = 22.97).

SPARQL, using the Likert scale.7 Their knowledge of the semanticweb technologies was neither basic, nor advanced (see Fig. 4),although did lean towards a more advanced level (M = 60.80,SD = 22.98).

3.3.6. ResultsWhile the number of participants at the GATE Summer School

was 50, the participation in the evaluationwas on a voluntary basis,andmany did not complete all required tasks or all questionnaires.We therefore disregarded all incomplete records, leaving 30participants who had completed the background questionnaireand at least the first three tasks. 11 out of these 30 participantsfinished Task 4, while 19 completed the SUS questionnaire.However, all of them have previously finished at least three tasksand therefore we canmake conclusions about the user satisfactionbased on these records.

Effectiveness. Fig. 5 illustrates the task difficulty based on themean success rate across four tasks. Based on the average valueof the task difficulty across all participants, Task 1 was the easiest,while Task 3 was the most difficult to finish. This is because thesubjects usually followed the wording of the description of thetask carefully, and hence, for Task 1, as the description is givenin two sentences (e.g., ‘‘Find mountains which exist in the UnitedStates. Find out in which state is the mountain of your choicelocated’’.) they typed in a query per sentence. Also the secondsentence asks about ‘‘themountain/the POS tagger of your choice’’,which indicates that they first needed to find out the names ofmountains/POS taggers using the first query, and then follow upwith another query formulated using the result from the first. ForTask 3, they struggled more as their first attempt was to finishthe task with one complicated query (often following the exactwording of the task, which was given in one sentence), beforethey decided to reformulate the query. This was in line with ourexpectations as Task 3 was designed to test whether feedback can

7 Expertise was calculated as a linear combination of these three, and thennormalised on the scale from 0 to 100 similar to how the SUS score is calculated.The scores were normally distributed.

Fig. 5. Task difficulty based on the success rate per task: finished with ease (0),finished with difficulty (1), not finished (2).

help users to resolve failures by suggesting repair through queryreformulation.

However, looking into the distribution of different difficultylevels per task, as shown in Fig. 6, only Tasks 2 and 3 had failures(23.33% and 20% respectively). Task 1 was completed successfullyby all participants, with only four subjects reporting difficulty. Thelog shows that the queries of those who reported difficulty werevery similar to the queries of the subjects who reported that theyfinished the task with ease. However, instead of terminating thetask after determining the answer, the former group of subjectsusually followed up with an additional set of queries, some ofwhich could not be answered by the system. For example, afterasking ‘‘What part-of-speech taggers are there in GATE?’’, a subjectfollowed up with ‘‘What parameters are there for the Heppletagger?’’. Both queries were correctly answered by the system,however, the subject followed up with ‘‘Can you give me anymore detailed information?’’. The system returned no answer, andthe subject terminated the task and reported difficulty. Anothersubject tried 14 queries, all of them being very similar, but justreworded versions enquiring about POS taggers in GATE. In somecases, this caused the system failure such as in ‘‘What are the PRsfor POS tagging?’’ where ‘‘POS tagging’’ was not recognised dueto the failure of the morphological analyser, and after the userreformulated the question to ‘‘What are the PRs for POS taggers?’’the results were returned.

Interestingly, if participants managed to finish Task 2 success-fully, they did not experience any difficulties. This is becausemost subjects managed to formulate the query that was correctlyparsed by the system immediately, and then the subjects eitherunderstood the system’s message (that the answer is negative),or they did not and attempted to follow up with many queryreformulations—however this did not help and they eventually ter-minated the task reporting that they could not finish it.

Task 3 was not finished in 20% of the cases. In comparison toTask 2, this is slightly better, however, a large portion of thosewho completed Task 3 (37.5%) reported difficulty in doing so. Thisis because the query reformulation that was part of this task wasnot easy for the majority of participants—it took them 5.5 querieson average to successfully complete the task, although the optimalnumber of queries that was necessary was 2.

Task 4 was finished by only 11 participants, the majority ofwhom reported that they finished it with ease. Based on the querylogs, the reason seems to be that the majority of subjects usedqueries that were successfully answered in and similar to thoseused in the previous tasks. Hence, they did not experience any


Fig. 6. Frequency of different success rates per task.

failures andwe could not test the casewhen the queries fell outsideof the conceptual domain of the supported language (Failure—encourage change of topic).

User satisfaction. With regard to the SUS score, the result (M =

66.97, SD = 15.38) can be interpreted to be in between OK andgood: according to [16] the score of 50.9 is the lower limit for OK,and 71.4 is the lower limit for good (see Fig. 7).

Subjectivemeasures of user satisfaction. Fig. 8 shows the distribu-tion of the subjects’ subjective judgment on the Identified context.The exception is Task 2 for which we did not ask subjects aboutIdentified context explicitly. Instead we asked themwhether it wasclear that the answer was negative—there were no states or no initparameters of the RASP parser.

Task 1: Success—a non-empty answer. A large percentage ofsubjects (43.33%) found the Identified context confusing or neutralwhen doing Task 1, although all of them successfully finishedthe task. Six subjects who found the Identified context confusing,reported that ‘‘several of the generated examples were confusingor non-sensical e.g. state -- is mountain of -- rainer’’.The reason for this was that the system showed the recognisedelements of the query in the order in which they appeared in thequery. A more natural way of showing this to the user would be:rainer -- is mountain of -- state. However, this kindof interpretation is a step towards showing triples to the end-user,and for more complex queries, these would need to be multiplied.

Fig. 7. Distribution of SUS scores for 19 participants (M = 66.97, SD = 15.38). Alarge portion of the participants (42.1%) rated the system usability as good (in therange from 70 to 80).


Fig. 8. Clarity of feedback for all tasks.

As we have previously discussed (see Section 3.2.3), our intentionwas to mark question terms as recognised without going deeplyinto the complexities of ontology structure; the tree-based viewwas meant to correctly show the relations and that in fact rainer-- is mountain of -- state, and not the other way around.Indeed, all subjects had positive comments on the tree-based viewcomponent of feedback.

Task 2: Success—a negative answer. With regard to seven sub-jects who failed to complete Task 2, this happened due to the fol-lowing reasons:• 28.57% said the System provided confusing output so they could

not determine what to do.• 71.43% said the System provided no output so they could not

determine what to do.

The last group can be classified as system failure, and thereforewe conclude that the remaining 28.57% of failures happened dueto the users struggling to understand the feedback.

Looking at the results of the 23 participants who claimed thatthey finished task 2 with ease (see Fig. 9):• For 34.78% it was not clear that there were no bordering

states/no init time parameters for RASP parser for that specific

Fig. 9. Clarity of feedback for Task 2 considering only the participants who finishedthe task with ease.

task based on feedback, but they could successfully finish thetask by looking at the results for some other queries. Forexample, some of them said that ‘‘they determined that thesystem meant that there were no bordering states by queryinganother state with others bordering it’’.


Fig. 10. The subject’s perception about the difficulty of the supported language.

• 13.04% experienced the system failure, which they recognisedas repair and reformulated the query in order to finish the tasksuccessfully.

• For 52.17% of participants, the feedback shown by the systemwas clear enough to immediately draw conclusions that therewas no answer.

Overall, one third of subjects struggled to understand the sys-tem’s feedback, however four fifths of those found an alternativeway to solve the task usually by trying similar queries which re-turned a non-empty result.

From Task 2, we conclude that the Identified context coupledwith the message ‘‘No relation found within this context’’ was notuseful even though 76.67% of subjects found a way to completethe task successfully. Hence, those queries for which the answeris negative, showing the user that the system knows about certainconcepts, but does not find any answer due to themissing relations,resulted in a large number of subjects being confused. Some ofthem reported that they would rather see the message ‘‘Thereare no states’’ or ‘‘There are no parameters’’ instead of the list ofrecognised concepts and a generic message ‘‘No relations found’’.

Task 3: Failure—encourage repair through query reformulation.Among 20% of subjectswho failed to complete Task 3, one reported

that the ‘‘system provided confusing output: could not manageto find out how to formulate the query; tried several ones byrefinement’’. Upon further investigation of this user’s query log, wefound out that he tried 18 different queries, most of which gavesome results, however, they were either too generic (e.g., PRs), ortoo specific and long, and also very similar to the wording of theactual task, for example ‘‘creole plugin PRs parameters that are thesame as the parameters of GATE morphological analyser’’.

The majority of subjects tried to input the exact wording ofTask 3 into the system and then, since the system showed therecognised concepts but no answer, a large number recognised theneed to repair and reformulated the query. This resulted in 80% ofsubjects successfully finishing the task, while 20% gave up.

Difficulty of the supported query language. Fig. 10 illustrates thedifficulty of the supported language as perceived by participants,per task. According to these results, subjects struggled most withformulating queries for Task 3. This is because the majoritytried to solve the task using one complicated query, which theyreformulated several times before deciding to split it in two, whichsuccessfully completed the task.

Table 1 illustrates the optimal number of queries that wasrequired for successfully finishing the first three tasks, and theaverage number of queries used across all subjects. The lowest


Table 1Number of queries per task across all subjects.

Task Optimal#queries

Avg.#queries

Avg. #queries(tasks finished)

Avg. #queries(tasks not finished)

1 2 7.27 7.27 n/a2 1 4.10 3.17 7.143 2 6.03 5.5 8.17

number of queries was for Task 2. However, it seems that thesubjects who finished Tasks 2 and 3 successfully, used fewerqueries. The logs reveal that the majority of subjects who failed tocomplete the task, could have finished it even after the first query,given that they could understand feedback. While the highestnumber of average number of queries per participant is for Task 1,this is not related to the task difficulty, as themajority finished thistask with ease. In contrast, the query log shows that the subjectscould finish Task 1 usually after the second query, however, theydecided to try several other similar queries before marking thatthey had finished the task successfully.

One of the subjects stated that ‘‘[our system] is a nice toolbut can easily be fake i.e. try ‘state mountains in the States’ or‘state apple, monkeys, bananas, mountains in the USA’’’. This isan interesting observation, and is indeed true. Our system wouldindicate that ‘‘state’’ at the beginning of the query is recognised asgeo:State, and the user, knowing this is not true, would need toreformulate the query (i.e. use similar words such as ‘‘give me’’ or‘‘show’’ or ‘‘list’’ instead of ‘‘state’’ at the beginning of the query).

Comparison with baseline. The baseline system described inSection 3.1 was tested in the baseline usability study throughmeasuring effectiveness, efficiency, and user satisfaction with 12participants, and 4 tasks, covering the GATE domain. The goal andthe scope of that evaluation goes beyond the goals of this paper(see [17] for more details), however, when designing the feedbackusability studypresented here,we repeated some tasks deliberately(Tasks 1 and 4), in order to make a comparison where appropriate.

Task 1was intended to test the difference between the effective-ness and the efficiency of the baselinewith feedback in comparison tothe baseline without feedback. The aim was to answer the questionof whether feedback improves usability of the baseline system or notfor those tasks for which the answer exists in the knowledge base.8

• Effectiveness. In the baseline usability study the tasks equivalentto Task 1 in the feedback usability study resulted in a task dif-ficulty of 0.67, while in the feedback usability study the sametask resulted in a task difficulty of 0.13. This result shows thatthe task seemed easier in the baseline with feedback in compar-ison to the baseline without feedback. We tested the significanceof this difference using Chi-Square test of independence. Ournull hypothesis was that there is no relation between the systemused (independent variable) and effectiveness measured throughthe task difficulty (dependent variable). Based on the results wecan reject the null hypothesis, leading us to the conclusion thatthe difference in effectiveness in using the two systems is sig-nificant, χ2(2,N = 42) = 8.31, p = .016. This indicates thatfeedback had a positive impact on effectiveness as the task seemedsignificantly easier when performed using the baseline withfeedback in comparison to the same task performed using thebaseline only.

8 As in the feedback usability studywhere the first task was equivalent to the twotasks (Task 1 and 2) in the baseline usability study, we first merged the results ofthese two into one. For effectiveness, in case the success score differed for the twotasks in the previous study, the higher one was picked as the representative. Forexample, if one of the tasks was marked as task completed with ease (0), and theother failed to complete (2), the overall assigned score was failed to complete (2). Forefficiency, measured through the time spent on the task, we summarised the timefor Tasks 1 and 2 into one value.

Table 2Difficulty of the supported language as perceived by subjects in the two studies.

Query languageEasy (%) Neutral (%) Difficult (%)

Baseline (defined task) 66.7 29.2 4.2Baseline with feedback (defined task) 80 6.7 13.3Baseline (undefined task) 41.7 50 8.3Baseline with feedback (undefined task) 63.6 0 36.4

• Efficiency. With regard to efficiency, in the baseline usabilitystudy it took subjects 180.5 s on average to finish the taskequivalent to Task 1 in the feedback usability study. In thefeedback usability study the same task took 155.27 s on aver-age, which indicates that the subjects were faster when usingthe baseline with feedback system. To test the significance ofthis difference we used a 2-tailed independent t-test, whichrevealed that this difference is not significant (t(40) = 0.19,p = .85 with equal variances assumed), and thus we retain thenull hypothesis and conclude that there is no relation betweenthe system used (independent variable) and efficiency mea-sured through the time spent on task (dependent variable). Thisindicates that feedback did not have a significant influence on howquickly subjects could finish the task.

Further on, as we assessed the difficulty of the supported lan-guage in the baseline usability study, we can now compare the re-sults of the two studies for the tasks that we could then haverepeated in the feedback usability study. In other words, we com-pared the perception of the difficulty of the supported language inthe baseline usability study and in the feedback usability study forthe equivalent tasks.

As we described in Section 3.1, the baseline without feedbackand the baseline with feedback support the same query language.Hence, the comparison of the user’s perception of the difficulty ofthe supported language will help us reveal whether there is anyeffect of someother factors (such as user-interaction features of thetree-based view, and the feedback delivered through the Identifiedcontext) that influence the perception of the difficulty of the queryformulation. Our null hypothesis is that there is no relation betweenthe systemused, and the difficulty of the supported language.We usednon-parametric Fisher Exact Test to assess this.9

Table 2 shows the distribution of the answers, indicating thatthere were more positive answers in the feedback usability study incomparison to the baseline usability study for both defined and un-defined tasks. For defined tasks, Fisher’s Exact test reveals that thisdifference is not significant (F = 5.25, p = .07) hence there is noevidence to reject the null hypothesis (that there is no differencein how the two groups of subjects perceived the query formulationfor defined tasks). For the undefined task, this difference is signif-icant (F = 8.02, p = .015) indicating that subjects had the im-pression that the query language in the feedback usability studywaseasier than the one required by the baseline system in the baselineusability study. This indicates that feedback had a positive effect onthe user’s perception of the difficulty of the query language andhelped boost the user’s experience.

3.4. Summary and discussion

In this section we presented one possibility of designing feed-back in Natural Language Interfaces for querying Ontologies, and a

9 Fisher exact test is the exact version of Chi-square, which is usually usedfor testing 2-by-2 tables, particularly for small samples. As Chi-square is anapproximation, it is not as trustworthy as the exact test on the data with expectedcounts less than 5.


task-based evaluation with 30 subjects. This was conducted in or-der to assess whether using feedback has any effect on the usabilityof such systems and hence could help in building habitable NLIs.

As a baselinewe used a Natural Language Interface for queryingontologies that we developed and evaluated in our previous work.We first extended the baseline by implementing feedback using thefollowing two elements:

• The Identified context table showing all query interpretationsto the user, where each interpretation is a linear combinationof the concepts and relations between them. The order of therecognised concepts follows the order in which they appear inthe question.

• The tree-based view shows the concepts and their relations toany selected Identified context.

We designed feedback to test four different aspects:

• Success—a non-empty answer was tested through Task 1.• Success—a negative answer was tested through Task 2.• Failure—repair through query reformulation was tested through

Task 3.• Failure—encourage change of topic was intended to be tested

through Task 4. However, the subjects did not experiencethis kind of failure at all and hence this aspect is not furtherdiscussed.

In the baseline usability study described elsewhere (see [17]),among other tests, we measured effectiveness and efficiency pertask, and also the perception of difficulty of the supported querylanguage per user. In the feedback usability study presented here,we repeated two tasks (Tasks 1 and 4) in order to make acomparison, where appropriate, of the two systems: the baselinewith feedback and the baseline without feedback.

In the feedback usability study all subjects completed Task 1although four of them did so with difficulty. This result is signif-icantly better (p = .01) than the results for the same task in thebaseline usability study, indicating that the tasks for which the an-swer exists in the knowledge base are more easily successfully fin-ished with the baseline with feedback in comparison to the samekind of tasks performed with the baseline system. However, al-though the subjects finished the taskmore quickly than in the pre-vious study, this difference is not significant (p >= .78).

Identified context was not well received even for Task 1, whichwas the easiest. For Task 2, the Identified context was not key tosuccess. Instead of understanding that there were no relationswithin the identified context as it was stated by the system, thesubjects reformulated the initial query many times, and triedsimilar ones in order to understand the answer. The averagenumber of queries per task for the tasks completed successfullyis much lower than for those that were not completed, indicatingthat the subjects who did not understand the system’s messagesbelieved that they needed to reformulate the query; they did thismany times until eventually giving up. This is specifically thecase for Tasks 2 and 3. For Task 3, the feedback which combinesIdentified context and the tree-based view was quite successful insuggesting repair and 80% of subjects managed to successfullyreformulate their initial queries and finish the task successfully.

Overall, our conclusion is that feedback can help to build habit-ableNLIs through showing the user how the system interpreted thequery. By looking at the interpretations, the users can better under-stand if the query they formulated is too complex for the system,and if they need to reformulate it in order to receive the answer.More specifically:

• Feedback had a positive impact on the overall effectiveness ofthe system, but no significant effect on efficiency.

• Feedback had a positive impact on the subject’s perception ofthe difficulty of the supported query language.

• The Identified context showing the linearised list of conceptswas notwell accepted, especially for the caseswhen the answerto the question was negative. In other words, showing thatthe system knows about certain concepts, but cannot find anyrelations between them was not clear and the subjects dislikedthe generic message ‘‘No relation found’’.

• The tree-based view and especially its interactive feature waswell accepted. This indicates that showing context from whichthe answer was derived can increase user confidence.

• For complex queries, feedback was useful to suggest repairthrough query reformulation.

We attempted to render feedback in a user-friendlymanner andour eventual goal is to make the vast amount of structured infor-mation available to casual users. However, based on the evaluationpresented in this section, we can onlymake claims about the popu-lation represented by our samplewhich largely included computa-tional linguists, computer scientists, and software developers, whowere familiarwith semanticweb technologies even if not on an ad-vanced level.

While feedback can be useful to train the user towards formu-lating queries that are supported by the system thus improvinghabitability, this method does not offer the user to be involved orto anyhowmodify potential query interpretations. In other words,the user can either:

• Choose from an already existing list of query interpretations, or• Recognise repair and reformulate the query for which the

system will generate a new set of interpretations.

In the next section, we look at clarification dialogues as amethodthat allows the user to be involved in modifying or generatingquery interpretations. In other words, the user can supervise theprocess of mapping an NL question into a formal query in order toproduce the correct answer. Habitability in this case is expected tobe improved by extending the existing habitability domains of thelanguage through the dialogue, and repair is expected to happenduring the process of mapping.

4. Clarification dialogues

Using clarification dialogues is a common way of solving theambiguity problem in NLIs to ontologies (e.g., Querix [18], Aqua-Log [19]), and involves engaging the user in a dialogue when-ever the system fails to solve the ambiguities automatically. Thismethod is especially effective for large knowledge bases with ahuge number of items with identical names, but also when thequestion is ambiguous. For example, if the user asks ‘‘How big isCalifornia?’’, the system might discover ambiguity when trying tomap ‘‘big’’ intostate population orstate area. Hence, usingclarification dialogueswill allow the user to specify the meaning of‘‘big’’. Another way to solve ambiguities is to show all possible an-swers to the user, however, this might not be feasible when thereare too many alternatives. Clarification dialogues can help to definethe information need precisely, thus resulting in a system with ahigher precision.

We extend the application of clarification dialogues to supportnot only solving ambiguities but rather to make the whole processof mapping an NL question into the form that would lead to thecorrect answer transparent to the user. We combined clarificationdialogues with a light learning model in order to improve thehabitability domains of our NLI. We consider several aspects ofhabitability domains previously discussed in Section 2, and lookat how we can improve them.

The lexical domain of NLIs to ontologies is usually bound to theontology lexicalisations, and then extended fromvarious resources


to include synonyms, hypernyms and hyponyms. The most com-mon resources are WordNet [3], FrameNet [4], and OpenCyc.10 Afew systems use resources accessible through the Semantic Webvia the owl:sameAs relation (e.g., PowerAqua [20]), or even person-alise the system by including user-centric vocabulary. We derivethe domain lexicon from the ontology, enrich it with synonymsfrom WordNet, and then further enrich it from dialogues with theuser.

The most obvious way to improve the conceptual domain is toadd a new knowledge about the ‘‘non-existing’’ concept to the on-tology. This goes beyond the scope of our research, as we are in-terested in querying the existing data without any intention togive feedback on quality (e.g. by adding, updating or deleting state-ments from the ontology). Hence, instead of improving this domaintowards extending the system’s knowledge, we focus on commu-nicating the failure to the user through the dialogue that showswhat the coverage of the system is. The dialogue contains sug-gestions listing only the available concepts—those that the systemknows about. For example, if the ontology is about geography,and the user asks ‘‘List actors from Hollywood’’, the systemwill not know about actors, but knows about Hollywood (adistrict in Los Angeles). In this case, it will prompt theusers with a dialogue, asking them to map ‘‘actors’’ to propertiessuch as population, area and classes such as state, whichare clearly not related to ‘‘actors’’ at all. In other words, the dia-logue will assist the user to receive the answer about all concepts/relations related to Hollywood, however, the lack of suggestionsmentioning actors indicates that the query falls beyond the cov-erage of the supported conceptual domain.

The functional domain can be improved by extending the listof functions available in the system, given that those extensionsare in line with user expectations. However, for flexible supportedlanguages, such as ours, which do not have strictly defined rulesto be followed, issues with unsupported functions can arise fromthe system incorrectly interpreting a question term. We solve thisproblem by allowing different configurations of the dialogue. Forexample, one possible configuration (a force dialogue mode) is tomodel a dialogue for any attempt to map a question term onto anOntology Concept. In this case it is possible to extend the dialogueallowing users a flexible mapping of a NL query to the formalquery. For example, we added a None element to each dialogueto allow users to ignore attempts to performmappings that wouldcause a system failure. In this way, instead of asking the user toreformulate the question, we are asking the system to ignore thespecific question term. In addition, the suggestions shown to theuser hold some of the functions that are used to map a NL queryinto the SPARQL. This is especially the case for datatype propertiesof type number, where, in addition tomapping aword/phrase ontoa datatype property, it is possible to map it to a function such asminimum, maximum, and sum, which would communicate to thesystem to apply the special function to the result set before it showsthe answer to the user.

The syntactic domain of any supported language can be im-proved by allowing as many grammatical constructions as possi-ble. When the system fails to answer the question that falls out ofthe scope of the syntactic domain, an ideal solution would be toencourage the user to reformulate the query. We improve this do-main by making the process of mapping a NL query into SPARQLtransparent—the user controls the mapping through the dialogue.In some situations the dialogue might reveal that certain syntacticstructures are not supported. For example, if the system attemptsto map ‘‘as’’ to an ontology concept, it shows the user that the sys-tem is not aware of the semantic meaning of the word, and thequestion should be reformulated.

10 http://www.opencyc.org/.

Fig. 11. FREyA workflow: from a Natural Language question to the answer.

The involvement of the user into the dialogue is empoweredby a light learning model, in order to improve the habitabilitydomains (and hence the system’s performance) over time. Wetested the combination of clarification dialogues and the learningmodel through the implementation of the FREyA system to whichwe now turn.11

4.1. FREyA workflow

FREyA12 is an interactive Natural Language Interface forquerying ontologies, which combines syntactic parsing withthe ontology-based lookup in an attempt to precisely answerquestions. If the system fails to automatically generate the answer(or when it is configured to work in the force dialogue mode, seeSection 4.7) it models the clarification dialogue. The suggestionsshown to the user are found through ontology reasoning andare initially ranked using a combination of string similarity andsynonym detection. The system then learns from user selections,and improves its performance over time.

Fig. 11 shows the workflow starting with a Natural Languagequestion (or its fragment), and ending when the answer is found.

The syntactic parsing and analysis generates a parse tree (usingthe Stanford Parser [21]) and then uses several heuristic rules inorder to identify Potential Ontology Concepts (POCs). POCs refer toquestion terms/phrases which can, but not necessarily have to,be linked to Ontology Concepts (OCs). POCs are chosen based onthe analysis of the syntactic parse tree, however this analysis doesnot require strict adherence to syntax and works on ill-formedquestions and question fragments as well as on grammaticallycorrect ones. For example, nouns, verbs, or WH-phrases suchas Where, Who, When, How many are expected to be found byour POC identification algorithm. This algorithm is based on theidentification of prepreterminals and preterminals in the parsetree, as well as on their part-of-speech tags (see [22]).

The ontology-based lookup links question terms to logical formsin the ontology, which we call Ontology Concepts (OCs), withoutconsidering any context or grammar used in the question (apartfrommorphological analysis, see [15]). Ontology Concepts refer toinstances/individuals, classes, properties, or datatype property valuessuch as string literals. By default, the system assumes that therdfs:label property is used to name the specific Ontology

11 Note that this system is different from the systems described in Section 3.12 More information about the system and the source code is available from:https://sites.google.com/site/naturallanguageinterfaces.

http://www.opencyc.org/

https://sites.google.com/site/naturallanguageinterfaces


Concept. However, for ontologies which use different namingconventions (such as using dc:title inside the MusicBrainzdataset), it is possible to predefine which properties are usedfor names. This will enable the system to make the distinctionbetween making a datatype property value element and an instanceelement. This distinction is important for determining the semanticmeaning of the question terms.

The consolidation algorithm aims at mapping existing POCs toOCs automatically. If it fails, the user is engaged in a dialogue. Forinstance, in the query ‘‘Give me all former members of the BerlinerPhilharmoniker’’, the POC identification algorithmwill find that ‘‘theBerliner Philharmoniker’’ is a POC, while the ontology-based lookupwill find that ‘‘Berliner Philharmoniker’’ is an OC, referring to aninstance of mm:Artist. As the only difference in the POC andthe OC text is the determiner (‘‘the’’), the consolidation algorithmwill resolve this POC automatically by removing it, thus verifyingthat this noun phrase refers to the OC with dc:title ‘‘BerlinerPhilharmoniker’’.

When the system fails to automatically generate the answer,it will prompt the user with a dialogue. There are two kinds ofclarification dialogues in FREyA:• Disambiguation dialogues involve users in resolving all identified

ambiguities.• Mapping dialogues involve users in mapping a POC to one of the

suggested OCs.

While the two types of dialogues look identical from the user’spoint of view, there are differences which we will highlight here.Firstly, we give a higher priority to disambiguation dialogues incomparison to mapping dialogues. This is due to our assumptionthat question terms which exist in the graph (OCs) should beinterpreted before those which do not (POCs). Note that FREyAdoes not attempt to interpret the whole question at once, butrather one pair at a time. In other words, each resolved dialoguecan be seen as a pair of two OCs: an OC to which a question termis mapped, and the neighbouring OC (context). Secondly, the waythe suggestions are generated for the two types of dialogues differ.Disambiguation dialogues include only suggestions with OntologyConcepts that are the result of ontology-based lookup. Mappingdialogues, in contrast, show suggestions that are found throughontology reasoning. This ensures that any suggestion that is shownto the user will generate an answer.

Finally, the sequence of disambiguation andmapping dialoguesthemselves are controlled differently for these two kinds of dia-logues:• Disambiguation dialogues are driven by the question focus or the

answer type, whichever is available first: the closer the OC to bedisambiguated to the question focus/answer type, the higher thechance that itwill be disambiguated before any other. The ques-tion focus is the term/phrase which identifies what the questionis about, while the answer type identifies the type of the ques-tion (such as Person in the query ‘‘Who owns the biggest de-partment store in England?’’). The focus of this question wouldbe ‘‘the biggest department store’’ (details of the algorithm foridentifying the focus and the answer type are described in [23]).After all ambiguities are resolved, theworkflow continues to re-solve all POCs through mapping dialogues.

• Mapping dialogues are driven by the availability of OCs in theneighbourhood. We calculate the distance between each POCand the nearest OC inside the parse tree, and the one with theminimum distance is the one to be used for the dialogue, beforeany other.

4.2. Disambiguation dialogues

For ambiguous OCs that are identified through ontology-basedlookup, dialogues are modelled so that the user disambiguates the

Table 3Generating suggestions based on the type of the nearest OC.

Type of the closest OC Suggestions

Class or instance All classes connected to the OC by exactlyone property, and all properties definedfor this OC

Datatype property of type number Maximum, minimum and sum functionof the OC

Object property All domain and range classes for the OCDatatype property value Suggestions for the instance to which this

value belongs

specificmeaning.Disambiguation dialogues consist of an ambiguousterm and a list of OCs. The user is then asked:

I struggle with [ambiguous term]. Is[ambiguous term] related to:

OC1OC2...OCn

While it is possible to automatically disambiguate the meaningdepending on the question context and using ontology reasoning(e.g. ontology relations), this option could be expensive, butalso insufficient. Our approach suggests that any automaticdisambiguation could be corrected by involving the user in adialogue. For example, if someone is enquiring about ‘‘Mississippi’’,we might not be able to automatically derive whether the queryrefers to geo:River,13 or geo:State, because we do nothave enough context for effective disambiguation. However, ifthe question is ‘‘Which rivers flow through Mississippi?’’, thecontext can help automatically derive that the question is about‘‘Mississippi state’’, due to the existing relation in the ontology suchas geo:River -- geo:flowsThrough -- geo:State.

4.3. Mapping dialogues

For all POCs that could not be automatically resolved to an OC,mapping dialogues were initiated. They consisted of an unknown/POC term and a list of suggestions. The user was then asked:

I struggle with [POC term]. Is [POC term]related to:

suggestion 1 (OC1)suggestion 2 (OC2)...suggestion n (OCn)

Note that while the OCs in Disambiguation dialogues are foundby ontology-based lookup, the OCs (suggestions) in Mappingdialogues are found by ontology reasoning—they are derived basedon the closestOC to the POC term. The closest OC is foundbywalkingthrough the syntax tree. Based on the type of the closest OC, rulesfor generating suggestions vary (see Table 3). For the closest OCX , we identified its neighbouring concepts which were shown tothe user as suggestions. Neighbouring concepts include the definedproperties for X , and also its neighbouring classes. Neighbouringclasses of class X are those that are defined to be:

• The domain of the property P where range(P) = X , and• The range of the property P where domain(P) = X .

Option none (None Element) is always added to the list ofsuggestions (see Table 4), unless FREyA is configured differently(see Section 4.7 for different modes). This allows the user to

13 For clarity of presentation, we use prefix geo: instead of http://www.mooney.net/geo# in all examples.

http://www.mooney.net/geo#






Table 4Sample queries and generated suggestions for the identified POCs.

Query POC Closest OC Suggestions

Population ofcities inCalifornia

Population geo:City 1. City population2. State3. Has city4. Is city of5. None

Population ofCalifornia

Population geo:California 1. State population2. State pop density3. Has low· · ·

n. None

Which city hasthe largestpopulation inCalifornia

Largestpopulation

geo:City 1. max (city population)2. min (city population)3. sum (city population)4. None

ignore suggestions if they are irrelevant thus improving FREyA’sfunctional domain. That is, the system assumes that the POC inthe dialogue should not be mapped to any suggested OCs, andtherefore the system learns that this POC is either: (1) incorrectlyidentified, or (2) cannot be mapped to any OC as the ontology doesnot contain the relevant knowledge. While this option will not beof significant benefit to end-users, it is intended to identify flawsin the system and encourage improvements.

The task of creating and ranking suggestions before showingthem to the user is quite complex, and this complexity increaseswith the size of the queried knowledge source.

4.4. Ranking suggestions

The initial ranking of suggestions is based on string similaritybetween a POC term and the suggestions, and also based onsynonym detection:

• String similarity.We combined Monge Elkan14 metrics with theSoundex15 algorithm. When comparing two strings the formergives a very high score to those which are exact parts of theother. For example, if we compare population with city pop-ulation, the similarity would be maximised as the former iscontained in the latter. The intuition behind this is that theontology concepts are usually named using camelCased names,and are more explicit than how they are usually referred tousing natural language, e.g., cityPopulation, stateArea,projectName, and the like. The Soundex algorithm compen-sates for any spelling mistakes that the user makes—this algo-rithm gives a very high similarity to two words which are speltdifferently but pronounced similarly.

• Synonym detection. We used WordNet [3] in order to rank thesynonyms of a POC higher. For example, if a question is ‘‘Whatis the highest peak in the US?’’, although there is no mention ofUS in the ontology, WordNet would list The States as a synonymfor US. This would match with the geo:State in the ontologyand therefore, this option would be ranked very high.

When ambiguous OCs and all POCs have been resolved, thequery is interpreted as a set of OCs. At this point, there is enoughinformation to identify the answer type. Unlike other approacheswhich start by identifying the question type, followed by theidentification of the answer type, our approach interprets themajority of the question before it identifies the answer type. Thereason for this is that our approach does not require a strict

14 http://sourceforge.net/projects/simmetrics/.15 http://en.wikipedia/wiki/Soundex.

adherence to syntax, and it heavily relies on ontology-based lookupand the definitions in the RDF structure. Hence, it can only identifythe answer type after all relevant mappings and disambiguationsare performed. Note however, that there are cases when theanswer type is identified before the whole question is interpreted,and in this case it is used to drive the remaining mappings, if any(as described above in Section 4.1).

4.5. Combining Ontology Concepts into triples and generating SPARQL

The list of Ontology Concepts was prepared to conform to thestructure that was suitable for generating triples. As the triples arein a form

SUBJECT - PREDICATE - OBJECTCLASS/INSTANCE - PROPERTY - CLASS/INSTANCE/

LITERAL

we first inserted any potential joker elements in betweenOCs, ifnecessary. Jokers are wildcards or variables used instead of classes,instances, literals or properties to generate query interpretationsin a triple format. At the time of generating these interpretationsit was not known what kind of elements could be expected, andhence jokers were used. The rules for inserting joker elements areas follows:

• If the first or the last element is a property, then we add a Jokerelement at the beginning or at the end of the list, respectively;a joker here is a variable representing a class, an instance, or adatatype property value (literal).

• If any two classes, instances, or datatype property values in thelist of OCs are next to each other, we insert the Joker elementrepresenting a property between them.

• If any two properties in the list of OCs are next to each other,insert a Joker element representing a class/datatype propertyvalue between them.

For example, if the first two OCs derived from a question arereferring to a property and a class respectively, one joker classwould be added before them. For instance, the query ‘‘What isthe highest point of the state bordering Mississippi?’’ would betranslated into the following list of OCs:

isHighestPointOf State border mississippiPROPERTY CLASS PROPERTY INSTANCE

These elements are transformed into the following:

? isHighestPointOf State border mississippiJOKER PROPERTY1 CLASS1 PROPERTY2 INSTANCE

The next step is generating a set of triples from OCs, taking intoaccount the domain and the range of the properties. For example,from the previous list, two triples would be generated16:

? - geo:isHighestPointOf - geo:State;geo:State - geo:borders - geo:mississippi

(geo:State);

The last step is generating the SPARQL query. Sets of triples arecombined and based on the OC type, relevant parts are added tothe SELECT and WHERE clauses. Following the previous example,the SPARQL query would look like the following:

16 Note that if geo:isHighestPointOf had geo:State as a domain, the triplewould look like:geo:State -- geo:isHighestPointOf -- ?;.

http://sourceforge.net/projects/simmetrics/

http://en.wikipedia/wiki/Soundex


Fig. 12. Validation of potential ontology concepts through the user interaction.

prefix rdf:<http://www.w3.org/1999/02/22-rdf-syntax- ns#>

prefix geo: <http://www.mooney.net/geo#>select ?firstJoker ?p0 ?c1 ?p2 ?i3where { {?firstJoker ?p0 ?c1 .

filter (?p0=geo:isHighestPointOf) .}?c1 rdf:type geo:State .?c1 ?p2 ?i3 .filter (?p2=geo:borders) .?i3 rdf:type geo:State .filter (?i3=geo:mississippi) .}

An Example. Fig. 12 shows the syntax tree for the query ‘‘Whatis the population of New York?’’. As ‘‘New York’’ is identified asreferring to both geo:State and geo:City, we first asked theuser to disambiguate (see Fig. 12(a)). If they selected, for example,geo:City, we start iterating through the list of remaining POCs.The next one—(‘‘population’’) is used together with the closestOC geo:City, to generate suggestions for the mapping dialogue.Among them there will be geo:cityPopulation and after theuser selects this from the list of available options, ‘‘population’’is mapped to the datatype property geo:cityPopulation (seeFig. 12(b)). Note that if the user selected that ‘‘New York’’ refersto geo:State, the suggestions would be different, and followingthe user’s selection, ‘‘population’’ would be mapped to referto geo:statePopulation, because the closest OC would begeo:State.

An example of the generated suggestions for the same query isshown in Fig. 13. The suggestions are made based on geo:City(city), which is the closest OC. If the user selected geo:State(state), the list of suggestions would contain different optionsstarting with geo:statePopulation (state population)(see Fig. 14).We can see thedifference in the generated suggestionsin the cases when the user selects that ‘‘New York’’ means thecity, and the state, respectively.17 The following answer differs

17 Note that the system can also work in the automatic mode where it wouldsimulate user selection of the best ranked options, without the need to engage theuser into a dialogue. This is discussed later in Section 4.7.

as well. The first dialogue in Figs. 13 and 14 is a disambiguationdialogue, whereas the second one is amapping dialogue.

While clarification dialogues give full control to the user whenmapping a NL into the formal query language to formulate theanswer, they can also be seen as a cognitive overhead. Therefore,we enhance them through a learning mechanism that is expectedto reduce this overhead over time, which increases systemperformance as well as habitability for end-users.

4.6. Learning

Supervised learning requires a set of questions with theright answers in order to achieve satisfactory performance.Unfortunately, as noted by Belew [24], there are many situationswhere we do not know the correct answers. In supervised learningevery aspect of the learner’s actions can be contrasted withcorresponding features of the correct action. On the other hand,semi-supervised approaches such as Reinforcement Learning (RL)aggregate all these features into a single measure of performance.Therefore, reinforcement seems to be much better for users asthere is less cognitive overhead.

We decided to use a semi-supervised approach for severalreasons. Firstly, supervised learning goes in linewith the automaticclassification of the question, where each question is usuallyidentified as belonging to one predefined category. Our intentionis to avoid this automatic classification and allow users freedomto enter queries of any form. Secondly, we want to minimise themanual work required when mapping some parts of the queryto the underlying structure. For example, we want the system tosuggest that ‘‘Where’’ should be mapped to a specific part of theontology concept such as Location, rather than the applicationdeveloper browsing the ontology structure in order to place thismapping.

Our learning algorithm is inspired by a pure delayed rewardreinforcement function [25], which is defined to be zero after theuser selects the option, except when an action results in a win(satisfying answer) or a loss (wrong answer or no answer), inwhich case there would be a +1 reinforcement for a win, and a−1 reinforcement for a loss.

We initialise the value function based on string similarity andsynonym detection, as described in Section 4.4. When the userchanges the selection (selects an option other than the first onesuggested by the system), the system will learn that the previousrankings were not correct, and will recalculate its value function.We assume that the action selected by the user is the one whichis desired, and therefore we give a reinforcement of +1 to suchan action, while we give −1 to all the others. Therefore, if theinitial ranking was wrong, there is a strong chance that this iscorrected after only one user choosing the right option, due tothe fact that the initial ranking is in the range from 0 to 1. Forexample, if the question was ‘‘How many people live in Florida?’’the closest OC to the POC ‘‘people’’ is geo:florida, which is astate. Our ranking mechanism would place the correct suggestion(geo:statePopulation) in the 14th place. This is becausethere is no significant similarity between ‘‘people’’ and ‘‘statepopulation’’, at least according to our initial ranking algorithm.

Fig. 15 shows the values of the initial states, the reinforcementreceived after the user selected geo:statePopulation, andfinally the rankings after recalculation.18

4.6.1. Generalisation of the learning modelWe use the ontology as a source for designing a generic

learning model. When an OC is related to another concept with

18 For the sake of clarity, we only show a subset of the generated suggestions inFig. 15.


Fig. 13. Generated suggestions and the result for ‘‘city population of the New York city’’.

Fig. 14. Generated suggestions and the result for ‘‘state population of New York state’’.

(a) Initial ranking. (b) Reinforcement based on the user selectinggeo:statePopulation.

(c) Ranking after the user selectsgeo:statePopulation.

Fig. 15. Mapping ‘‘How many people’’ to geo:statePopulation in the ontology.

a subClassOf relation, that concept is used to learn the model.For example, if the features are extracted for the OC of typeclass—geo:Capital, the same features would be applicable forthe OC geo:City, because geo:Capital rdfs:subClassOfgeo:City.

In addition, we do not update our learning model per question,but per combination of a POC and the closest OC. We alsopreserve a function over the selected suggestion such asminimum,maximum, or sum (applicable to datatype property values). In thisway, we extract several learning rules from one single question,so that if the same combination of POC and OC appears in another

question, we can reuse it. Table 5 shows several sample questionsand its derived features, which are used to learn the model.

4.7. Combining clarification dialogues with learning through modes

The role of learning is to improve the ranking of the suggestionsshown to the user so that after sufficient training the systemcan automatically generate the answer by selecting the bestranked options. In addition, while the intention behind clarificationdialogues is to control the process of mapping a NL question, the


Table 5Features used for learning the model.

IF THENPOC Context Function Correct rank

‘‘What is the smallest city in the US?’’Smallest geo:City min geo:cityPopulation

‘‘What is the population of Tempe, Arizona?’’Population geo:City – geo:cityPopulation

‘‘What is the population of the capital of the smallest state?’’Population geo:Capital – geo:cityPopulationSmallest geo:State min geo:statePopulation

‘‘What state has the smallest population?’’Population geo:State – geo:statePopulationSmallest geo:statePopulation min geo:statePopulation

approach is applicable to any rankingmechanism. Our assumptionis that ‘‘no ranking will be perfect’’ and therefore should becorrected by involving the user, thus improving the performanceof the system.

An interesting question is how to decide whether or not theexisting ranking is good.When using a new dataset the best optionis to use clarification dialogues asmuch as possible in order to checkthe system interpretations and correct them if necessary. In thatregard, there are severalmodes that can be used:

• Automatic mode: the system will generate the answer bysimulating the selection of the best ranked option. This modeis used when confidence is high that the ranking is effective, orthat the system has been trained enough.

• Force dialogue mode: the system will generate a clarificationdialogue for each attempt tomap a question term or phrase intoan ontology concept. This mode operates on two levels:1. Ignoring the system’s attempt to perform the mapping by

adding a None element. As previously discussed, this el-ement is used to ignore the system’s attempt to map a ques-tion term to an OC.

2. Extending the disambiguation dialogue: This option extendsthe disambiguation dialogue by adding more suggestions,in addition to the OCs identified through Ontology-basedLookup. This option is important to be used when the knowl-edge base has a large number of names (e.g., MusicBrainz)so that any question would be a rich set of Ontology Con-cepts, while the underlying grammar would be somewhatignored. For example, in the question ‘‘Which members ofthe Beatles are dead?’’, due to a huge number of string lit-erals ‘‘dead’’ appearing in the ontology, this element wouldbe annotated to refer to several OCs (such as instances ofrdf:type mm:Album) while in this context it needs to bemapped to the property endDate.

4.8. Evaluation

While we tested feedback and its effect on habitability in theuser-centric study (Section 3), we tested clarification dialoguesin a controlled experiment on the well-known and widely usedMooney GeoQuery dataset.19 This is because it is important to usefeedback in order to communicate the right message to the end-user. The supported language does not change with feedback as itis only used to show the system’s interpretations of the query to theuser and lets themdecidewhether the answer is correct, or a queryreformulation is required. In contrast, clarification dialogues, asdescribed in this section, are used to improve habitability domainsof the supported language. This improvement can be measured by

19 The ontology and the questions can be downloaded from: http://www.ifi.uzh.ch/ddis/research/talking-to-the-semantic-web/owl-test-data/.

Fig. 16. The distribution of the number of dialogues for 202 correctly answeredquestions.

comparing the performance of our system against other similarsystems, that have used the same ontology and the same setof questions in their evaluation. Another important aspect ofclarification dialogues is that they are combined with a learningmodel and hence in our evaluationwe also look at how the learningmechanism improves the system’s performance over time.

We used 250 questions from the Mooney GeoQuery dataset.Although the ontology contains a relatively small portion ofknowledge about the geography of the United States, the questionsare quite complex and the systemmust have a good understandingof their semantic meaning in order to correctly answer them.

We evaluated the correctness of the overall approach, as well asthe learning and ranking algorithms.

4.8.1. CorrectnessWe report correctness of the overall approach in terms of

precision and recall, which aremeasures adapted from InformationRetrieval. Precision measures the number of questions correctlyanswered then divided by the number of questions for whichsome answer is returned [26,27]. Recall is defined as the numberof correctly produced answers, divided by the total number ofquestions.

Recall and precision values are equal, reaching 94.4%. This isdue to the fact that our approach always attempts to generate adialogue, and return an answer, although partial or incorrect. 34questionswere answered correctlywithout requiring any dialoguewith the user, while the remaining 202 required at most fourdialogues in order to correctly return the answer (see Fig. 16).The system failed to answer 14 questions (5.6%), five out of whichare not supported by the system as they are outside the syntacticand functional domain of the system, e.g. negation or comparison‘‘Which states have points higher than the highest point inColorado?’’. The remaining nine were interpreted incorrectly.

Although FREyA required quite a significant input from the user,its performance compares favourably to other similar systems.PANTO [28] is a similar systemwhichwas evaluated on theMooneygeography dataset of 877 questions (they removed duplicates fromthe original set of 879). They reportedprecision and recall of 88.05%and 85.86% respectively. NLP-Reduce [29] was evaluated withthe original dataset, reporting 70.7% precision and 76.4% recall.Kaufmann et al. [18] selected 215 questions which syntacticallyrepresented the original set of 879 queries. They reported theevaluation results over this subset for their system Querix with86.08% precision and 87.11% recall. Our 250 questions were thosereleased for public from the original source and syntacticallyrepresent the original dataset.20

20 see http://www.cs.utexas.edu/users/ml/nldata/geoquery.html.










http://www.cs.utexas.edu/users/ml/nldata/geoquery.html


In order to test the statistical significance of our results, wecalculated the 95% confidence interval for precision and recall. Aswe only have one test set, we used the bootstrapping samplingtechnique,which is used in the CoNLL-03 competition (see [30] andalso [31]).

The 95% confidence interval with 1000 samplings ranges from91.6% to 97.2%. As the lower range is still higher than the bestpreviously evaluated system (88.05% for recall of PANTO [28], and87.11% precision of Querix [18]), we conclude that precision andrecall values obtained with our approach were significantly better(p = 0.05) than the precision and recall of other systems trialedwith the same dataset. It should be noted, however, that thishigh performance of our system engaged the user into dialogues.Querix also relies on dialogues, while PANTO answers questionsautomatically.

What makes our approach outstanding is the possibility to putthe user in control through the dialogue, in order to improvethe performance incrementally with each user’s new question, byboosting the rankings through learning from the user’s choices.In the next section, we describe the evaluation of our learningmechanism and its effect on performance.

4.8.2. LearningWe evaluate our learning algorithm using cross-validation on

202 questions, which are a subset of the above 250—those thatwere answered correctly and required a dialogue.

Cross-validation is a statistical method used to evaluate andcompare learning algorithms by dividing the data into twosegments: one used to train a model and the other used to testit. In typical cross-validation, the training and validation sets mustcross-over in successive rounds such that each data point is testedagainst [32]. The basic form of cross-validation is k-fold cross-validation, where the data is first partitioned into k equally (ornearly equally) sized folds. Subsequently k iterations of trainingand validation are performed such that within each iteration adifferent fold of the data is held-out for testingwhile the remaining(k − 1) folds are used for training.

We have performed 10-fold evaluation using the subset of theMooney GeoQuery questions, which were answered correctly inthe earlier evaluation (Section 4.8.1):• 5 questions were not supported by the system, and they have

been removed due to no possibility of mapping them to therelevant ontology concepts and formulating the correct answer,

• 8 questions were misinterpreted by the system,• 35 could be answered automatically so they were removed.

This resulted in 202 questions requiring 343 dialogues in total.In 10 iterations, 181/182 questions were used for training themodel, while the remaining 21/20 were used for testing it. Beforeexecuting the test, we generated a gold standard in two steps:• We ran FREyA in the automatic mode where for any required

dialogue the system chose the first available option, saved thelearning items and carried forward to the next question. Theresults were saved and used as a baseline.

• We then manually examined the output and corrected invalidentries; if we had to change the entries we marked those asincorrect.

The goal of this evaluation was to test whether the learningalgorithm can improve the performance of the system. In order toassess this, we compare the precision of the trained system withthe performance of the baseline. The results are shown in Table 6and also in Fig. 17.

The average precision for the system trained with 9/10 ofquestions was 0.48, which is 0.23 higher than the baseline.While this is a good improvement over the baseline model, theperformance is not outstanding. Looking into the questions, whichcould not be answered using our trained system, the reasons are:

Table 6Precision for the questions evaluated using 10-fold cross-validation.

Fold 0 1 2 3 4 5 6 7 8 9 Avg

Baseline 0.3 0.15 0.2 0.25 0.24 0.3 0.3 0.35 0.15 0.19 0.25Learning 0.65 0.4 0.65 0.4 0.24 0.55 0.5 0.6 0.35 0.48 0.48

Fig. 17. Precision for the learned vs. baseline using 10-fold cross-validation.

• Ambiguity. 30 questions were incorrectly answered due toambiguity. The advantage of our learningmodel is its simplicity:it is based on a very few features, which ensures that questionswith similar word pairs benefit from the training with similarand not necessarily the same questions. However, this is, atthe same time, a drawback as it can introduce ambiguities. Forexample, if the system learns from ‘‘What is the highest pointof Nebraska?’’ the ‘‘point’’ refers to geo:HiPoint, wheneverit appears in the context of geo:Country, then, for similaralbeit drastically different questions, the system would use theknowledge which might be wrong. For the question ‘‘Whatpoint is the lowest in California?’’ the system would findthe previously learned mapping and it will associate ‘‘point’’with geo:HiPoint. The correct mapping is, however, thegeo:LoPoint. This indicates that we should extend the contextof our learning model to consider the whole phrase in which anunknown termappears, so that for the above examplewhenever‘‘point’’ appears in the context of geo:Country– AND ‘‘highest’’, map it to geo:HiPoint.– AND ‘‘lowest’’, map it to geo:LoPoint.

• Sparsity. 65 questions contained a learning item, which wasseen only once across all questions.

While the performance of the baseline is quite low, we shouldnote here that this figure does not take into consideration the caseswhen an unknown or ambiguous term can be mapped to morethan one ontology concept. In addition, the question is marked ascorrect if all dialogues had the correct ranking placed as number 1.However, for some cases it is very difficult to automatically judgewhich suggestion to place as number one. It is very likely thatdifferent users would select different suggestions for questionsphrased in the same way. This emphasises the importance of thedialogues whenmodelling NLI systems. To assess this, we evaluatesuggestion ranking in isolation.

4.8.3. Ranked suggestionsWe use Mean Reciprocal Rank (MRR) to report the performance

of our ranking algorithm. MRR is a statistic for evaluating anyprocess that produces a list of possible responses (suggestions


in our case) to a query, ordered by probability of correctness.The reciprocal rank of a suggestion is the multiplicative inverse ofthe correct rank. The mean reciprocal rank is the average of thereciprocal ranks of results for a sample of queries (see Eq. (1)).

MRR =1

|Q |

Qi=1

1ranki

. (1)

We have manually labelled the correct ranking for suggestionswhich have been generated when running FREyA on the above setof 202 questions. This was the gold standard against which ourranking mechanism achieved MRR of 0.76. However, the medianandmodewere both 1, indicating that themajority of the rankingswere correct. Indeed, in 69.7% of the cases the correct ranking wasplaced as number 1, while in 87.5% of the cases the correct rankingwas among the top 5.

From the above set of 343 dialogues, we randomly selected103, then ran our initial ranking algorithm and compared resultsagainst the manually labelled gold standard. MRR was 0.72. Wethen grouped 103 dialogues by OC, and then randomly chosetraining and evaluation sets from each group. We repeated thistwice. These two iterations are independent—they have both beenperformed starting with an untrained system. Overall MRR (forall 103 dialogues) increased from 0.72 to 0.77. After training themodel with 47 items during iteration 2, overall MRR increased to0.79. Average MRR after running these two experiments was 0.78,which shows the increase of 0.06 in comparison to MRR of theinitial rankings. Therefore, we conclude that for the selection of103 dialogues from the Mooney GeoQuery dataset, our learningalgorithm improved our initial ranking by 6% (further details aboutthe benefits of our learning mechanism can be found in [22]).

4.9. Summary and discussion

In this section we discussed how a combination of clarificationdialogues and learning can be used to improve certain aspects of thehabitability of NLIs to ontologies. As discussed earlier, the NLIs thatwe are interested in are those that are portable and that support anon-controlled and flexible query language. The flexibility of thesupported language has a trade-off related to the fact that it is nottrivial for the user to translate the informationneed into a question.Hence, we examined how clarification dialogues can:

• Improve precision by asking the user to disambiguate. The sys-tem then improves the disambiguation for the next user/ques-tion.

• Improve recall through vocabulary extension: generating a di-alogue for any question term that is considered important forunderstanding the semantic meaning of the question. Thisquestion term typically does not exist in the domain vocabulary(derived from the semantic resources and enriched fromWord-Net). The systemwill then learn the new term for the next user.

The vocabulary extension improves the lexical domain ofhabitability, as defined in Section 2.Moreover, the lexical domain isimproved due to our ranking algorithm, which relies on Soundex—a state-of-the-art algorithm that assigns a very high similarityto words which are spelt differently, but pronounced similarly.Soundex is combined with the Monge Elkan string similarityalgorithm, which assigns a high similarity to words where oneis contained within other (e.g. a question term population is verysimilar to the ontology lexicalisation state population according toMonge Elkan). Combining the two algorithms gives the possibilityof going beyond the existing lexicalisations attached to semanticresources, and ‘‘understand’’ words which are either misspelled orexpressed differently in comparison to how they are verbalised in

the semantic repository. The ranking mechanism showed a goodperformance as well (see Section 4.8.3).

The functional domain of habitability is defined by algorithmswhich are used to identify Potential Ontology Concepts (candidatequestion terms that are identified as being important for un-derstanding the semantic meaning of the question) and gener-ate suggestions for the dialogue. These consider adding maximum,minimum and sum function to the datatype property values, so thatadjectives which modify nouns (as in ‘‘the largest city’’) can bemapped to different functions applied to datatype property val-ues and attached to classes. In the ‘‘largest city’’ example, thismeans that, once ‘‘city’’ is mapped to the class geo:City, the dia-logue attempting to map ‘‘largest’’ will model suggestions by look-ing at the defined properties for City, and if any of them is adatatype property of type number, it will add the additional func-tions so that it is possible tomap ‘‘largest’’ to maximum value ofgeo:cityPopulation.

Our approach of combining clarification dialogueswith learningis evaluated on the GeoQuery Mooney dataset, to enable us tocompare our results against other NLI systems, which use differentapproaches. The overall precision and recall with this datasetreached 94.4%, which is significantly better than other similarsystems evaluated on the same dataset.

We also evaluated the individual algorithms. MRR for the initialranking, using 250 questions from the Mooney GeoQuery set,yielded 0.76.

The learning algorithm showed an improvement over the base-line model of 0.23.

The combination of clarification dialogues and learning is envis-aged to be used in two steps which correspond to two different,albeit easily interchangeable, modes of the system:

• The force dialogue mode is used to train the system towards areasonable performance. The system will generate a dialoguefor any attempt to map a question term into an OntologyConcept, when its confidence to automatically resolve thismapping is below 100%.

• The automatic mode: the system will return the answer auto-matically by simulating selection of the best ranked options.

Note that for true ambiguities the automatic modemight not bethe best choice even in a perfectly trained system. For instance, ifsomebody asks ‘‘How big is New York state?’’ we might be unableto decide whether ‘‘How big’’ refers to state area or statepopulation automatically. In this situation, as the system learnsfrom the users’ selections, the automatic mode would work infavour of the majority of users. However, if the majority of usersrefer tostate areawhenmentioning size, theminority still havea chance to get the correct answer by switching to the force dialoguemode and mapping ‘‘big’’ to state population.

Upon initial inspection, the two types ofmodes described aboveseem like a perfect match for the two types of users of an NLI:ideally application developers can use an NLI in the force dialoguemode until they are satisfied with the system’s interpretationsof the questions. At that point, the end-users can take over thesystem and use it in the automatic mode to ask questions. However,the real scenario might be completely different. The mode canbe changed easily so if the user discovers non-satisfying resultsin the automatic mode, they can immediately switch to the forcedialoguemode in order to investigate themappings. Their inputwillthen improve the system for the next user/subsequent questionsof the same user. The easy switching between modes makes ourapproach suitable to be used by both end-users and applicationdevelopers. In fact, the border between the customisation of thesystem performed by application developers, and the customisedversion of the system used by the end-users is not strict. Hence,the role of the two types of users is, to some extent, blurred, which


allows end-users to control the answers to their questions or to,at least, understand how Natural Language queries are mappedto the formal queries. This leaves us with the same question thatwe asked in the previous section about end-users. Who are they?For the current state of the methods and algorithms, the end-users probably do not need to know about semantic technologiesif the system works within a narrow domain, such as the MooneyGeoQuery ontology. As soon as we move towards large scale data(such as parts of the Linked Open Data cloud e.g. DBPedia), anddatasets which are characterised by a large amount of redundant,duplicate, and often false data, our approach becomes suitable tobe used by semantic web experts who can explore the availableknowledge by asking questions and being engaged into dialogues(see [33] for details on how our approach is evaluated withDBPedia). The force dialogue mode applied to the low quality datacan be used not only to get familiarisedwith the dataset, but also todiscover existing inconsistencies. It is left for futurework to furtherdevelop and test mechanisms that will use our system in this kindof scenario, and also in the scenarios where these large knowledgebases are queried by end-userswho are not familiar with semantictechnologies at all.

5. Related work

While research has been active in testing usability of varioussemantic search interfaces (see [29,34]), little work has been donein the area of testing the usability of NLIs to ontologies themselves.There are evaluation campaigns of SEALS project21 [35], whichpartially address this problem, however there is a little emphasison testing individual usability enhancement methods and theireffect on habitability as well as the overall performance andusability of NLIs to ontologies. While these methods have beenextensively researched in Information Retrieval for example, thechallenge with NLIs to ontologies is the underlying structure of theknowledge, which is considered complex for casual users. In thispaper, we fill that gap by designing and testing feedback in NLIsto ontologies, in order to emphasise their importance in buildinghabitable NLIs which can be considered a user-friendly way ofbringing the structured information in the form of ontologies tothe casual users.

Building habitable NLIs for querying ontologies is a difficulttask, and many different systems have been developed in re-cent years. While NLI systems with a good performance requirecustomisation through specific software (such as in the case ofORAKEL [36]), several systems have been developed for which cus-tomisation is not mandatory (e.g., PANTO [28], Querix [29], Aqua-Log [19]), NLP-Reduce [29], QuestIO [15]. However, as reportedin [19] the customisation usually improves recall and hence helpsin improving the specific habitability domains. For example, ascustomisation usually requires mapping WH-phrases to OntologyConcepts, it can improve the syntactic domain of habitability,whilein the case ofmapping verbs toOntologyConcepts this can improvethe lexical domain. However, performing customisation manually(as in AquaLog or PANTO) when querying large ontologies mightbe impractical if not impossible as it implies that the applica-tion developers who customise the system also know the ontologystructure very well. Our approach does not require any mandatorycustomisation, however, the specific habitability domains are im-proved by making the process of mapping an NL to Ontology Con-cepts transparent to the user.We do this by combining clarificationdialogueswith learning. The role of learning is to improve the initialranking that exists in the dialogues, which removes the cognitiveoverhead for the users.

21 http://www.seals-project.eu/.

The majority of existing NLIs to ontologies are portable in asense that all that is required to port the system to work witha new domain is the ontology URI—the system automaticallygenerates the domain lexicon by reading and processing ontologylexicalisations. Indeed, most of the mentioned systems rely onthe ontology lexicalisations and WordNet [3]. AquaLog [19] andPowerAqua [37] are capable of learning the user’s jargon in orderto improve the lexical domain of habitability and hence the userexperience. Their learning mechanism is good in a way that it usesontology reasoning to learnmore generic patterns, which can thenbe reused for the questions with similar context. However, theclarification dialogues in AquaLog are used for resolving ambiguitiesonly, and also learning from the user jargon applies only toontology relations. Our approach is more generic as our definitionof clarification dialogues is wider, and also we model contextdifferently.

Querix [18] is another ontology-based question answeringsystemwhich relies on clarification dialogues in case of ambiguities,but in comparison to AquaLog it does not implement the learningmechanism and hence its lexical domain is bound to the ontologylexicalisations enriched by synonyms fromWordNet.

Our approach of combining clarification dialogues with thelearning mechanism is different in that it shares the input from allusers. This is influenced by the recent emergence of social net-works, which have shown the advantages of collaborative intel-ligence. In addition, the role of clarification dialogues in our caseis not only to resolve ambiguities but rather to control the wholeprocess of mapping an NL question (including WH-phrases) to theformal query, hence allowing the user to define or change the spe-cific lexical, syntactic or functional habitability domain.

Moreover, while other existing approaches start by gener-ating linguistic triples from a question (even if in an iterativefashion) and then attempting to generate ontology triples ina form of Subject-Predicate-Object, our approach oper-ates on a pair of Ontology Concepts at one time, which canbe Subject-predicate or predicate-Object or Subject-Object. In that sense our approach is more flexible as it operateson a unit smaller than a triple, where each unit can be validated orchanged through the clarification dialogue.

6. Conclusion

The NLIs to ontologies that we discuss in this paper are thosethat are portable and also, those with a flexible supported languageso that not only grammatically correct questions, but also questionfragments and ill-formed questions are supported. In particular,we discussed the application of feedback and clarification dialoguesand how they can affect the habitability of such Natural LanguageInterfaces to Ontologies.

First, we looked at the effect of modelling feedback by showingusers the system’s interpretations of the query. The method wastested with users and our results reveal that feedback can increasehabitability and thus usability of an NLI system. More specifically,it can improve the effectiveness of the system, while it does notsignificantly improve efficiency. In addition, feedback has a positiveeffect on the user’s perception of the difficulty of the supportedlanguage.

Next, we examined how the existing habitability domains ofthe language can be extended and improved through dialogues.Here we are not concerned with showing the user previouslygenerated query interpretations (as in feedback), but rather withinvolving the user in the process of generating the correctquery interpretation through clarification dialogues. To reducethe cognitive overhead, clarification dialogues are coupled with alearning mechanism, so that the user’s input is used to improvethe system through training. This method is tested in a controlled

http://www.seals-project.eu/


evaluation using the Mooney GeoQuery dataset, in order tomake comparisons against other similar approaches. Our approachobtained very high precision and recall, outperforming other state-of-the-art systems. While the reason for such a good performanceis partially in its subsequent modules, such as the learning andranking algorithms, the most important aspect of our approach isadding the users into loop, allowing them to control the output andsupervise the querying process through dialogue. The question ofwhether this level of involvement is acceptable from an end-user’spoint of view is a subject of our future work.

Acknowledgements

We would like to thank Abraham Bernstein and Esther Kauf-mann from theUniversity of Zurich, for sharingwithus theMooneydataset in OWL format and J. Mooney from the University of Texasfor making this dataset publicly available. Grateful acknowledge-ment for proofreading and correcting the English edition go to AmyWalkers from Kuato Studios.

This research has been partially supported by the EU-fundedTAO (FP6-026460) and LarKC (FP7-215535) projects.

References

[1] C. Bizer, T. Heath, T. Berners-Lee, Linked data—the story so far, Int. J. Semant.Web Inf. Syst. (2009).

[2] K. Church, R. Patil, Coping with syntactic ambiguity or how to put the block inthe box, Amer. J. Comput. Linguist. 8 (1982).

[3] C. Fellbaum (Ed.), WordNet—An Electronic Lexical Database, MIT Press, 1998.[4] J. Ruppenhofer, M. Ellsworth, M.R.L. Petruck, C.R. Johnson, J. Scheffczyk,

FrameNet II: extended theory and practice, Technical Report, ICSI, 2005.[5] B.J. Grosz, D.E. Appelt, P.A. Martin, F.C.N. Pereira, TEAM: an experiment in the

design of transportable natural-language interfaces, Artificial Intelligence 32(1987) 173–243.

[6] N. Stojanovic, On the query refinement in the ontology-based searching forinformation, Inf. Syst. 30 (2005) 543–563.

[7] S.S. Epstein, Transportable natural language processing through simplicity—the PRE system, ACM Trans. Inf. Syst. 3 (1985) 107–120.

[8] A. Bernstein, E. Kaufmann, GINO—a guided input Natural Language Ontologyeditor, in: 5th International Semantic Web Conference, ISWC2006, 2006.

[9] W. Ogden, P. Bernick, Using natural language interfaces, in: M. Helander (Ed.),Handbook of Human–Computer Interaction, Elsevier Science Publishers B.V.,North-Holland, 1997.

[10] W. Ogden, J. Mcdonald, P. Bernick, R. Chadwick, Habitability in ques-tion–answering systems, series: text, speech and language technology,in: T. Strzalkowski, S. Harabagiu (Eds.), Advances in Open Domain QuestionAnswering, Vol. 32, Springer, Netherlands, 2006, pp. 457–473.

[11] E. Zolton-Ford, Reducing variability in natural-language interactions withcomputers, in: Proceedings of the Human Factors Society 28th AnnualMeeting, The Human Factors Society, 1984, pp. 768–772.

[12] B.M. Slator, M.P. Anderson, W. Conley, Pygmalion at the interface, Commun.ACM 29 (1986) 599–604.

[13] H.H. Clark, S.A. Brennan, Grounding in communication, in: L.B. Resnick,J.M. Levine, S.D. Teasley (Eds.), Perspectives on Socially SharedCognition, 1991.

[14] D. Frohlich, P. Drew, A. Monk, Management of repair in human–computerinteraction, Hum. -Comput. Interact. 9 (1994) 385–425.

[15] D. Damljanovic, V. Tablan, K. Bontcheva, A text-based query interface to OWLontologies, in: 6th LanguageResources andEvaluationConference, LREC, ELRA,Marrakech, Morocco, 2008.

[16] A. Bangor, P. Kortum, J. Miller, Determining what individual SUS scores mean:adding an adjective rating scale, J. Usability Stud. 4 (2009) 114–123.

[17] Hai H. Wang, D. Damljanovic, Terry R. Payne, N. Gibbins, K. Bontcheva,Transition of legacy systems to semantically enabled applications: TAOmethod and tools, Semantic Web 3 (2) (2012) 157–168.

[18] E. Kaufmann, A. Bernstein, R. Zumstein, Querix: a natural language interface toquery ontologies based on clarification dialogs, in: 5th International SemanticWeb Conference, ISWC 2006, Springer, 2006, pp. 980–981.

[19] V. Lopez, V. Uren, E. Motta, M. Pasin, Aqualog: an Ontology-driven questionanswering system for organizational semantic intranets, in: Web Semantics:Science, Services and Agents on theWorldWideWeb, vol. 5, 2007, pp. 72–105.

[20] V. Lopez, V.S. Uren, M. Sabou, E. Motta, Cross Ontology query answering onthe SemanticWeb: an initial evaluation, in: Y. Gil, N.F. Noy (Eds.), K-CAP, ACM,2009, pp. 17–24.

[21] D. Klein, C.D. Manning, Fast exact inference with a factored model for naturallanguage parsing, in: S. Becker, S. Thrun, K. Obermayer (Eds.), Advances inNeural Information Processing Systems 15—Neural Information ProcessingSystems, NIPS 2002, MIT Press, 2002, pp. 3–10.

[22] D. Damljanovic, M. Agatonovic, H. Cunningham, Natural language interfaces toOntologies: combining syntactic analysis and Ontology-based lookup throughthe user interaction, in: Proceedings of the 7th Extended Semantic WebConference, ESWC 2010, Lecture Notes in Computer Science, Springer-Verlag,Heraklion, Greece, 2010.

[23] D. Damljanovic, M. Agatonovic, H. Cunningham, Identification of the questionfocus: combining syntactic analysis and Ontology-based lookup through theuser interaction, in: 7th Language Resources and Evaluation Conference, LREC,ELRA, La Valletta, Malta, 2010.

[24] R.K. Belew, Finding Out About: A Cognitive Perspective on Search EngineTechnology and the WWW, Cambridge University Press, Cambridge, UnitedKingdom, 2000.

[25] R.S. Sutton, A.G. Barto, Reinforcement Learning: an Introduction, MIT Press,Cambridge, Mass, 1998.

[26] L.R. Tang, R.J. Mooney, Using multiple clause constructors in inductive logicprogramming for semantic parsing, in: Proceedings of the 12th EuropeanConference on Machine Learning, Freiburg, Germany, 2001, pp. 466–477.

[27] P. Cimiano, P. Haase, J. Heizmann, Porting natural language interfaces betweendomains: an experimental user study with the ORAKEL system, in: IUI’07: Proceedings of the 12th International Conference on Intelligent UserInterfaces, ACM, New York, NY, USA, 2007, pp. 180–189.

[28] C.Wang,M. Xiong, Q. Zhou, Y. Yu, PANTO: a portable natural language interfaceto ontologies, in: The Semantic Web: Research and Applications, Springer,2007, pp. 473–487.

[29] E. Kaufmann, A. Bernstein, L. Fischer, NLP-Reduce: a naive but domain-inde-pendent natural language interface for querying ontologies, in: Proceedingsof the European Semantic Web Conference ESWC 2007, Springer, Innsbruck,Austria, 2007.

[30] E.F.T.K. Sang, F.D. Meulder, Introduction to the CoNLL-2003 shared task:language-independent named entity recognition, in: Proceedings of CoNLL-2003, Edmonton, Canada, 2003, pp. 142–147.

[31] Y. Li, K. Bontcheva, H. Cunningham, Adapting SVM for data sparseness andimbalance: a case study on information extraction, Nat. Lang. Eng. 15 (2009)241–271.

[32] P. Refaeilzadeh, L. Tang, H. Liu, Cross-validation, in: L. Liu, M.T. Özsu (Eds.),Encyclopedia of Database Systems, Springer US, 2009, pp. 532–538.

[33] D. Damljanovic, M. Agatonovic, H. Cunningham, FREyA: an interactive way ofquerying linked data using Natural Language, in: Proceedings of 1stWorkshopon Question Answering over Linked Data, QALD-1, Collocated with the 8thExtended Semantic Web Conference, ESWC 2011, Heraklion, Greece, 2011.

[34] V. Uren, Y. Lei, V. Lopez, H. Liu, E. Motta, M. Giordanino, The usability ofSemantic Search tools: a review, Knowl. Eng. Rev. 22 (2007) 361–377.

[35] S.N.Wrigley, K. Elbedweihy, D. Reinhard, A. Bernstein, F. Ciravegna, Evaluatingsemantic search tools using the seals platform, in: International Workshopon Evaluation of Semantic Technologies, IWEST 2010, International SemanticWeb Conference, ISWC2010, China, 2010.

[36] P. Cimiano, P. Haase, J. Heizmann, Porting natural language interfaces betweendomains: an experimental user study with the ORAKEL system, in: IUI’07: Proceedings of the 12th International Conference on Intelligent UserInterfaces, ACM, New York, NY, USA, 2007, pp. 180–189.

[37] V. Lopez, E. Motta, V.S. Uren, PowerAqua: fishing the semantic web, in: ESWC,pp. 393–410.

improving habitability of natural language interfaces for querying ontologies with feedback and...

Documents