empirical analysis of domain ontology usage on the web: ecommerce domain in focus

28
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2014; 26:1157–1184 Published online 4 July 2013 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.3089 SPECIAL ISSUE PAPER Empirical analysis of domain ontology usage on the Web: eCommerce domain in focus Jamshaid Ashraf 1, * ,† , Omar Khadeer Hussain 1 and Farookh Khadeer Hussain 2 1 School of Information Systems, Curtin Business School, Curtin University of Technology , Perth, 6107 WA, Australia 2 Decision Support and e-Service Intelligence Lab, Quantum Computation and Intelligent Systems, School of Software, Faculty of Engineering and Information Technology, University of Technology, Sydney, 2007 NSW, Australia SUMMARY In the recent past, there has been an exponential growth in Resource Description Framework data on the web known as web of data. The emergence of the web of data is transforming the existing web from a document-sharing medium to a decentralized knowledge platform for publishing and sharing information between humans and computers. To enable common understanding between different users, domain ontolo- gies are being developed and deployed to annotate information on the web. This semantically annotated information is then accessed by machines to extract and aggregate information, on the basis of the underly- ing ontologies used. To effectively and efficiently access data on the web, insight into the usage of ontology is pivotal, because this assists users in experiencing the benefits offered by the Semantic Web. However, such an approach has not been proposed in the literature. In this paper, we present a pragmatic approach to the analysis of domain ontology usage on the web. We propose metrics to measure the use of domain ontology constructs on the web from different aspects. To comprehensively understand the usage patterns of conceptual knowledge, instance data, and ontology co-usability, we considered GoodRelations ontology as the domain ontology and built a dataset by collecting structured data from 211 web-based data sources that have published information using the domain ontology. The dataset is analyzed by using the proposed metrics and observations along with their usability and applicability to the different users of the Semantic Web. Copyright © 2013 John Wiley & Sons, Ltd. Received 26 April 2013; Accepted 6 May 2013 KEY WORDS: ontology usage analysis; empirical analysis; web of data usage analysis 1. INTRODUCTION Ontologies, being a fundamental component of the Semantic Web, promote the establishment of a shared understanding between data providers and consumers in a common format that allows the automated processing of information by software agents and people. It achieves this by semantically annotating structured data, which can be roughly categorized into schema-level data (metadata) and instance-level data [1], allowing consuming applications to know the structure and semantics of the information so that the required data can be accessed effectively and efficiently. Schema-level data are the terminological knowledge represented by ontology, and instance-level data are concrete data that describe the resource. In addition to semantic annotation, which promotes information interop- erability, ontologies enable the reuse of knowledge by making implicit domain assumptions explicit and assisting in the separation of domain knowledge from operational knowledge [2]. These salient features of ontologies have motivated the development of standardized ontologies by domain experts *Correspondence to: Jamshaid Ashraf, School of Information Systems, Curtin Business School, Curtin University of Technology, Perth, 6107 WA, Australia. E-mail: [email protected] Copyright © 2013 John Wiley & Sons, Ltd.

Upload: farookh-khadeer

Post on 31-Mar-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Empirical analysis of domain ontology usage on the Web: eCommerce domain in focus

CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCEConcurrency Computat.: Pract. Exper. 2014; 26:1157–1184Published online 4 July 2013 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.3089

SPECIAL ISSUE PAPER

Empirical analysis of domain ontology usage on the Web:eCommerce domain in focus

Jamshaid Ashraf 1,*,†, Omar Khadeer Hussain 1 and Farookh Khadeer Hussain 2

1School of Information Systems, Curtin Business School, Curtin University of Technology , Perth, 6107 WA, Australia2Decision Support and e-Service Intelligence Lab, Quantum Computation and Intelligent Systems, School of Software,

Faculty of Engineering and Information Technology, University of Technology, Sydney, 2007 NSW, Australia

SUMMARY

In the recent past, there has been an exponential growth in Resource Description Framework data on theweb known as web of data. The emergence of the web of data is transforming the existing web from adocument-sharing medium to a decentralized knowledge platform for publishing and sharing informationbetween humans and computers. To enable common understanding between different users, domain ontolo-gies are being developed and deployed to annotate information on the web. This semantically annotatedinformation is then accessed by machines to extract and aggregate information, on the basis of the underly-ing ontologies used. To effectively and efficiently access data on the web, insight into the usage of ontologyis pivotal, because this assists users in experiencing the benefits offered by the Semantic Web. However,such an approach has not been proposed in the literature. In this paper, we present a pragmatic approachto the analysis of domain ontology usage on the web. We propose metrics to measure the use of domainontology constructs on the web from different aspects. To comprehensively understand the usage patternsof conceptual knowledge, instance data, and ontology co-usability, we considered GoodRelations ontologyas the domain ontology and built a dataset by collecting structured data from 211 web-based data sourcesthat have published information using the domain ontology. The dataset is analyzed by using the proposedmetrics and observations along with their usability and applicability to the different users of the SemanticWeb. Copyright © 2013 John Wiley & Sons, Ltd.

Received 26 April 2013; Accepted 6 May 2013

KEY WORDS: ontology usage analysis; empirical analysis; web of data usage analysis

1. INTRODUCTION

Ontologies, being a fundamental component of the Semantic Web, promote the establishment of ashared understanding between data providers and consumers in a common format that allows theautomated processing of information by software agents and people. It achieves this by semanticallyannotating structured data, which can be roughly categorized into schema-level data (metadata) andinstance-level data [1], allowing consuming applications to know the structure and semantics of theinformation so that the required data can be accessed effectively and efficiently. Schema-level dataare the terminological knowledge represented by ontology, and instance-level data are concrete datathat describe the resource. In addition to semantic annotation, which promotes information interop-erability, ontologies enable the reuse of knowledge by making implicit domain assumptions explicitand assisting in the separation of domain knowledge from operational knowledge [2]. These salientfeatures of ontologies have motivated the development of standardized ontologies by domain experts

*Correspondence to: Jamshaid Ashraf, School of Information Systems, Curtin Business School, Curtin University ofTechnology, Perth, 6107 WA, Australia.

†E-mail: [email protected]

Copyright © 2013 John Wiley & Sons, Ltd.

Page 2: Empirical analysis of domain ontology usage on the Web: eCommerce domain in focus

1158 J. ASHRAF, O. K. HUSSAIN AND F. K. HUSSAIN

in several disciplines, which assist applications to semantically annotate the information on the web.Recently, we have seen the emergence of open ontologies on the web being used to semanticallydescribe web data, such as Semantically-Interlinked Online Communities, Dublin Core, GoodRela-tions (GR), and Friend of a Friend (FOAF). This has led to an increase in Semantic Web data(Resource Description Framework (RDF) data) on the web, which is described using these ontolo-gies. The surge in RDF data is evident from the famous Linked Open Data (LOD) cloud diagram[3], which in its updated version contains 31 billion triples and hundreds of ontologies. A recentwork [4] has reported the presence of roughly 212 vocabularies in the LOD cloud.

1.1. Problem statement

The adoption of ontologies in the recent past has been made possible by a decade (circa 1998–2008)of research effort on ontology languages, ontology repositories, reasoning algorithms, and manage-ment tools that focus on developing an ontology from its infancy to a user-ready or implementation-ready application. Generally, in the ontology life cycle model, shown in Figure 1, ontologies aredeveloped using a development methodology and then evaluated using an evaluation methodologyto measure the quality of the developed ontology. Ontologies are then published on the web to allowusers to access the ontology and its components. The published ontologies are adopted by differ-ent types of users to semantically describe the information on the web and to be used for ontologypopulation and/or ontology instantiation. Over time, a developed ontology is evolved to meet newrequirements when changes become necessary.

Users of an ontology can be broadly categorized into three groups, namely ontology developers,application developers, and data publishers. Ontology developers are those who initiate the develop-ment of an ontology in response to a need. Application developers are those users who devise newapplications using parts of an ontology or different ontologies. Data publishers are those users whodisseminate their data on the web by using the defined ontologies. These users need to analyze thevarious phases of the ontology life cycle once the ontologies are used on the web, as follows:

1. During the adoption phase:(a) Users who are interested in consuming Semantic Web data need to know what type of data

is available. This implies that they need to know which ontologies are being used and whatis included in a given ontology to be able to source accurate information from the web,understand the prevailing ontological structure on the web, and decide what to use to meettheir requirements.

(b) Ontology developers need to know how effective their developed ontology is and how it isbeing used, which will assist in the reusability of their ontology by different users and is animportant requirement of the Semantic Web [5, 6]. Promoting reusability requires sharinginformation about the use of different ontologies and their uptake by data publishers, inaddition to infrastructure arrangements [7].

(c) Application developers need to know the most common concepts being used in an ontol-ogy, which will assist them in understanding what terminological knowledge is available forapplication consumption, which common data and knowledge patterns are available, and

Figure 1. Ontology life cycle with feedback loop.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:1157–1184DOI: 10.1002/cpe

Page 3: Empirical analysis of domain ontology usage on the Web: eCommerce domain in focus

ECOMMERCE DOMAIN IN FOCUS 1159

how a given ontology is being used by different data publishers to semantically describetheir information.

2. During the evolution phase, ontology developers need to know the usage level of various con-cepts, in addition to knowing which concepts in an ontology need to be evolved. This willassist them in knowing what needs to be changed /updated/enhanced to increase instantiationin the adoption phase.

The preceding analyses are achieved by measuring ontology usage. In ontology usage analy-sis, deployed ontologies are analyzed to measure their usage and provide a feedback loop to theontology life cycle model, which includes the development, evaluation, and evolution processes ofontology engineering. The present ontology life cycle model primarily focuses on the developmentand evaluation stages of the ontology and partially supports the population stage of the ontologies.However, there is no research that focuses on understanding how ontologies are adopted and usedon the web, signifying the need for a framework to analyze and measure the use of ontologies on theweb following adoption. Such a framework for measuring ontology usage on the web would help inthe following:

1. to make effective and efficient use of formalized knowledge (ontology) on the web,2. to provide a usage-based feedback loop to the ontology maintenance process for a pragmatic

conceptual model update, and3. on the basis of the prevalent knowledge patterns, to provide an erudite insight on the state of

semantic structured data for the consuming applications to use the appropriate ontologies.

In this paper, we develop an Ontology Usage Analysis Framework (OUSAF) to analyze and mea-sure the use of domain ontologies on the web. The remainder of the paper is organized as follows.In Section 2, the OUSAF and its phases are introduced. Section 3 describes the metrics developedto measure domain ontology usage along with an explanation of their application, using a sampleRDF graph. In Section 4, we discuss the domain ontology whose usage we consider for analysis.The evaluation of empirical analysis and general discussion is presented in Section 5. In Section 6,we discuss how the obtained analysis can be utilized by different users. In Section 7, we discuss therelated literature in which the use of RDF data and terminological knowledge is analyzed, and inSection 8, we conclude the paper along with a discussion of future work.

2. ONTOLOGY USAGE ANALYSIS FRAMEWORK

To provide the aforementioned insight into the use of ontologies, a semantic framework (OUSAF)for the measurement and analysis of ontologies is proposed. The framework comprises four phases,namely identification, investigation, representation, and utilization, as depicted in Figure 2. Eachphase is briefly described in the following.

2.1. Identification phase

Identification refers to the selection of the ontology we need to consider for analysis. There are twocommon requirements of identification: (i) to determine the usage of a specific domain ontologyalready known for an application area, for example FOAF for social networking; and (ii) to analyzethe interesting ontologies in the domain-specific dataset. In the latter case, a domain-specific datasetis required to identify the presence of different ontologies and the corelationship among differentontologies.

2.2. Investigation phase

Investigation refers to the analysis of the use of ontology. The aim of this step is to analyze the iden-tified ontology to measure its usage and population. Ontology usage is investigated at two levels:first empirically and then quantitatively as described in the following.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:1157–1184DOI: 10.1002/cpe

Page 4: Empirical analysis of domain ontology usage on the Web: eCommerce domain in focus

1160 J. ASHRAF, O. K. HUSSAIN AND F. K. HUSSAIN

Figure 2. Ontology Usage Analysis Framework and its four phases. RDF, Resource Description Framework.

Empirical analysis: In empirical analysis, the key aspects that contribute to the adoption ofontologies and that are useful for users in understanding how domain ontologies are used onthe web are identified and analyzed.Quantitative analysis: In quantitative analysis, on the basis of the insight obtained from empir-ical analysis, ontologies are analyzed from different perspectives to obtain a comprehensiveinsight into their usage. The quantitative analysis obtained is useful for ranking ontologies andtheir components on the basis of usage, as well as for identifying other criteria such as thestructural and typological characteristics of ontologies.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:1157–1184DOI: 10.1002/cpe

Page 5: Empirical analysis of domain ontology usage on the Web: eCommerce domain in focus

ECOMMERCE DOMAIN IN FOCUS 1161

2.3. Representation phase

Representation refers to the collation and meaningful presentation of the analysis results of ontologyusage. The purpose of investigating ontology usage is to understand how ontology is used by dif-ferent users and to exploit this information to utilize Semantic Web data effectively and efficiently.The analysis results need to capture all the aspects of ontology usage to allow a large number ofapplications to use it for information processing

2.4. Utilization phase

Utilization refers to application of the ontology usage analysis by different users according to theirneed. To facilitate the utilization of the analysis in different application areas, the results are rep-resented through an ontological model, allowing wider dissemination and exploitation of findings.Utilization includes the implementation of usage analysis in a case scenario to assess the benefits ofthe analysis.

The emphasis of this paper is on the investigation phase of the OUSAF. To empirically ana-lyze the usage of domain ontologies, we present a methodological approach and a set of metrics tounderstand the usage of domain ontology and prevalent knowledge patterns. The empirical analysisis conducted on a dataset collected by crawling the web to obtain a fair representation of domainontology usage and capture the invariant patterns prominent on the web. We chose the GR ontology(GRO) as our domain ontology for the analysis, and we report the usage patterns, knowledge pat-terns, and relationships (links) between different vocabularies, based on ontology instantiation. Ourproposed approach is discussed in the next section.

3. EMPIRICAL ANALYSIS OF DOMAIN ONTOLOGY USAGE

The empirical analysis of key aspects of domain ontologies needs to be defined, measured, andthen analyzed to understand the use of domain ontologies on the web, as mentioned in Section2.2. Such insight helps different types of user to effectively make use of the schema-level infor-mation published on the web. The framework implements a set of metrics to measure the use ofdifferent ontology components from various aspects to gain better insight into the prevailing usagepatterns on the web. The aspects considered are the use of pivot concepts, their semantic descrip-tion, and the use of textual description and knowledge and data patterns. Before the proposed metricsare explained, preliminaries are presented that are important for understanding the working of ourproposed approach.

3.1. Preliminaries

Before introducing our proposed framework to empirically analyze domain ontology usage, webriefly introduce the basic RDF and Semantic Web-related terms used in this paper. For a moredetailed and formal discussion of these terms and notations, readers should refer to [8, 9].

Uniform Resource Identifier (URI) reference: On the Semantic Web, all information must beexpressed as statements about resources. Resources are identified by the URI. URIs iden-tify not only web documents but also real-world objects, such as people and cars, and evenabstract ideas and nonexistent things such as mythical concepts. All these real-world objectsor things are called resources in the Semantic Web, and the URI reference is a compact stringof characters to identify an abstract or physical resource.RDF term: Given the set of URI references U , the set of blank nodes B , and the set of literalsL, the set of RDF terms is denoted by RDFTerm WD U [ B [ L. The sets U , B , and L arepairwise disjoint.RDF triple (triple): A triplet WD .s,p, o/ 2 .U [ B/ X UX.U [ B [ L/ is called an RDFtriple, where s is called subject, p predicate, and o object.Class: We refer to a class as an RDF Term that appears in either� o of a triple t where p is rdf:type or� s of a triple t where p is rdf:type and o is rdfs:Class or owl:Class.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:1157–1184DOI: 10.1002/cpe

Page 6: Empirical analysis of domain ontology usage on the Web: eCommerce domain in focus

1162 J. ASHRAF, O. K. HUSSAIN AND F. K. HUSSAIN

Property: We refer to a property as an RDF Term that appears in either� p of a triple t or� s of a triple t where p is rdf:type and o is rdf:Property.

Instance of a concept (C): A triple t D .s,p, o/ or a set of triples in the dataset is an instanceof a triple pattern tc D .sc ,pc , oc/ if there exist� sc as the URI reference,� pc as the rdf:type, and� oc as the class (concept) of domain ontology.

3.2. Analysis approach and metrics

To provide a solid empirical grounding, our approach performs an analysis on a dataset collectedfrom the web comprising real-world instance data that are primarily described using open web(domain) ontologies. To achieve our aim, our approach comprises two phases: data collection anddata analysis. Figure 3 provides a schematic view of the proposed framework, showing the high-level data collection approach (Figure 3(a)) and the different aspects considered for measuring theusage and uptake (Figure 3(b)). The former phase deals with the activities pertaining to data sourceidentification, data crawling, and setting up the dataset, while the latter focuses on empirical analysisactivities.

Our aim in the analysis phase is to establish how ontologies are being used by using the proposedapproach to analyze factors such as concept instantiation, interlinking between different ontologieson the basis of assertional statements, the use of labels to provide textual descriptions, and knowl-edge patterns based on terminological statements to measure how ontologies are being used on theweb. We consider four factors, namely schema link graph, concept usage template (CUT), labeling,and traversal path, in an attempt to provide a holistic means to understand and measure the use ofdomain ontologies by data publishers on the web.

The schema link graph unveils the relationship between vocabularies at the instance level; theCUT provides class-centric analysis to measure richness and semanticity; labeling measures thetextual description of entities, which is important for user interfaces or human usage; and traversalpath extracts the knowledge patterns in the published data to provide a summarized view of theknowledge graph. Moreover, we also analyze the use of different ontologies (namespace usage) inthe dataset to understand which namespaces are being referred to in the collected RDF graphs.

In the following subsections, we explain each of the aforementioned factors, which we use to mea-sure ontology usage. We will use a sample RDF graph to provide a walkthrough of the computationprocess of each metric.

Figure 3. Schemata diagram of the empirical analysis of domain ontology use: (a) data collection and (b)analysis phase.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:1157–1184DOI: 10.1002/cpe

Page 7: Empirical analysis of domain ontology usage on the Web: eCommerce domain in focus

ECOMMERCE DOMAIN IN FOCUS 1163

3.2.1. Schema link graph In the Semantic Web, the RDF data model and the use of ontologies allowdecentralized entities to be linked across sources and domains. An understanding of how entities arelinked at the schema level across ontologies helps with the extraction of schema patterns and theanalysis of entity linkage that exists within the dataset [10]. This additionally assists in the under-standing of the macro structure at the instance level, which is similar to the schema graph that existsin social networks. To undertake this analysis, we introduce the notion of the schema link graph.The schema link graph is an undirected graph consisting of a finite set of vertices V and a set ofedges E, representing a link between two vertices. Formally, we define the schema link graph asfollows:

Schema link graph: The schema link graph is a tuple .V ,E/, where n is a node .n 2 V / suchthat n is the ontology namespace used in the dataset. By ‘used’, we mean the presence of atriple where n appears as an object (for instantiation with rdf:type) or in a predicate todescribe the object. E is the edge set, and e 2 E is an edge of graph V linking two nodesn1 and n2 such that either there is a triple that entails that n1 is the namespace of the subjectand n2 is the entailed namespace of the object or there is an m sequence of triples connectedthrough a blank node such that n1 is the entailed namespace of the subject of the first tripleand n2 is the namespace of the object in the m-th triple where m> 1.

For example, Figure 4(a) shows an RDF graph snippet extracted from the RDF graph, represent-ing semantic data published by http://www.tenera.ch. Here, we have a sequence of triples connectedthrough blank nodes, and the corresponding schema link graph is depicted in Figure 4(b). It is impor-tant to note that we use the namespaces used in the dataset by considering the triples instantiatingthe instances of the concepts defined by the respective namespace. Figure 4(b) shows that there isa URI in the RDF graph of type gr directly or indirectly (through blank nodes) connected to theURI of type v. An inferred schema link graph such as this unveils the linkage between differentvocabularies based on instance data that describe entities in the dataset.

3.2.2. Concept usage template In the CUT, we capture the details of how the concept is used inthe dataset and what properties (both domain ontology predicates and other predicates) are usedto describe the entities instantiated by the concept. The consolidation of the semantic informationprovides a detailed view of the entity description made available by data publishers and allows dataconsumers to assess the utility of the data in the dataset. CUT attempts to capture the ubiquitouspatterns and arranges them to facilitate the processing of information for specific purposes such assearching, browsing, querying, and reasoning.

Concept instantiation (CI):Concept instantiation computes the number of instances instantiated by the class represent-

ing the concept. This gives us the number of entities in the dataset and reflects the dominanceof the entity in the dataset when compared with templates of other concepts. In most webontologies, subsumption axioms are used to provide the taxonomical relationship betweenconcepts. With inference provisioning, the concept instantiation may fluctuate depending on

Figure 4. (a) Sample Resource Description Framework graph from the dataset with blank nodes and (b) thecorresponding schema link graph.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:1157–1184DOI: 10.1002/cpe

Page 8: Empirical analysis of domain ontology usage on the Web: eCommerce domain in focus

1164 J. ASHRAF, O. K. HUSSAIN AND F. K. HUSSAIN

where the concept falls in the taxonomy hierarchy. Because most of the triple stores implementRDF semantics (RDFS) entailment rule, it is safe to consider the rdfs9‡ rule while measuringthe instantiation of a top-level concept in the taxonomic hierarchy. The CI of a concept C isgiven as follows:

CI.C /D jtriplesj

where8<:s D RDFTermp D rdf:typeoD class defined by ontology

(1)

In the case of subsumption axioms [11], the CI.C / can include the instances instantiated bythe subconcepts (subclasses) of o such that

oD entailrdf9.C / (2)

where entailrdf9.C / is a function that implements the RDFS9 rule:

IF (uuu rdfs:subClassOf xxx AND vvv rdf:type uuu) THEN (vvv rdf:type xxx)

CI.C / returns the numeric value representing the number of entities defined by the conceptand its subconcepts.

Vocabs:Vocabs provide the list of ontologies (other than the domain ontology) used to describe

an entity. Ontologies are represented here with their namespace prefixes and include both thepredicate’s ontology prefix and the concept’s ontology prefix to which it is linked. Vocabs helpus understand the ontologies that are co-used to describe different aspects of an entity. Thisanalysis helps us determine the use of different vocabularies, which can be useful for queryingthe data or preprocessing several inference closures for reasoning. Formally, vocabs is definedas follows.

DefinitionVocabs is a set of namespaces (empty possible) of the vocabularies used in a triple such thato is the domain ontology concept and p is the URI reference of the ontology other than thedomain ontology used to describe the s.

VocabsD fvocab1, vocab2, : : : , vocabng (3)

such that vocabi is the namespace of p’s URI reference.

Object property usage:Object property usage is a list of typed relationships that describe an entity by relating it to

other sets of entities and resources. This includes the properties defined by the domain ontol-ogy as well as the properties of the ontologies listed in vocabs. Object property usage allowsan understanding of the information pertaining to the entity and its richness by exploring theentities linked to it through these properties. The availability of such information in advance ofaccessing the dataset helps in building prototypical (or generic) queries for distributed datasetswhere information is in continuous growth. Also, knowing which object properties are used inconjunction with vocabs helps in understanding which object properties an entity could havefrom other ontologies to describe it.

ObjectPro.C /Dfpre1, pre2, : : : , preng

Such that prei DProperty(4)

‡IF (<v subClassOf w> and <u type v>) THEN < u type w>

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:1157–1184DOI: 10.1002/cpe

Page 9: Empirical analysis of domain ontology usage on the Web: eCommerce domain in focus

ECOMMERCE DOMAIN IN FOCUS 1165

The ObjectPro.C / set contains the URI references representing the object properties definedby the ontologies belonging to vocabs.

Attribute usage:Attribute usage provides textual information about an entity. This may include RDF label

properties, data type properties of the domain ontology, and nondomain ontologies. A tex-tual description linked to an entity instance is a useful set of information for data processingand user interface. Knowing an entity’s attributes and data type helps in building data orknowledge-driven interfaces and adopting the layout, depending on the assertional knowledge.

Attri.C /D fatt1, att2, : : : , attng (5)

such that atti 2 .Lp [Lt /.The Attri.C / set contains the URI references representing the data type properties defined

by the ontologies belonging to Vocabs.

Class usage:Class usage records the list of other concepts of which the entity is a member. This allows

us to learn more about the entity in question when different concepts are used to instantiate thesame entity. We believe that class usage provides the conceptual overlap that exists betweenrelated but different concepts formalized by different ontologies and can be exploited to gen-erate semantic mapping between related terms. This further helps in aligning similar conceptsdefined in different ontologies using different mapping predicates to specify weak and strongsemantics depending on the overlap.

ClassUsage.C / is a set of classes such that there exists a triple in

the dataset wherep D rdf:type and o is class and o¤ C(6)

Interlinking:Interlinking provides a list of properties used to create links across different datasets. Exam-

ples of such links are link base and equivalence link [12]. Here, we mainly focus on equiv-alence links, which help to specify when different URIs refer to the same entity or resource.Semantic web languages provide built-in support for creating equivalence links between dif-ferent components of the ontology and data. The resources and entities are linked throughthe owl:sameAs relation, which tells applications that these two resources (subject URI andobject URI) describe the same entity, and their data can be merged to obtain a detailed view ofthe entity.

Now, let us understand the aforementioned analysis methods and metrics by using a sample RDFgraph. Figure 5 shows the sample RDF graph from a fictitious ‘Example.com’ data source. TheRDF data describe a company that is in the business of car sales. In the triples, we have the businessentity, its business location (address), the offer, and the products included in the offer. For the sakeof brevity and readability, we have listed the relevant triples in turtle syntax§ and will use it in thissection for discussion and explanation. Lines 1–7 of the sample RDF code give the prefixes usedin the triples to access the vocabulary (or terms) defined by their respective namespaces to describethe resources (entities). Lines 8–37 of the sample RDF code give different resources linked throughtyped relationships to semantically describe the entities.

To explain the aforementioned measures, we consider ex:cardealer as the entity instance of thetype gr:BusinessEntity class. The value of the concept instantiation is CI.C / D 1, because wehave only one instance of type gr:BusinessEntity. Vocabs is the set of prefixes used to describethe entity, and in this example, we have vocabs D fgr, dc, foafg. Note that we are not includingWorld Wide Web Consortium (W3C) standard language prefixes here such as RDF, RDFS, and

§ http://www.w3.org/TeamSubmission/turtle/

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:1157–1184DOI: 10.1002/cpe

Page 10: Empirical analysis of domain ontology usage on the Web: eCommerce domain in focus

1166 J. ASHRAF, O. K. HUSSAIN AND F. K. HUSSAIN

Figure 5. Sample Resource Description Framework code for discussion.

Web Ontology Language (OWL). In the case of object property usage, we have only one prop-erty (gr:offering); therefore, ObjectPro.C / D gr:offering, and attribute usage has Attri.C / Dfdc:title, dc:date, foaf:homepageg. The class usage of the product entity (ex:product_1, line 23) oftype gr:ProductOrServiceModel returns the set of classes of which the entity is also a member,that is, ClassUsage.C / D fvso:Automobile, coo:Derivativeg. Additionally, link base and equiva-lence links are provided to allow users to access additional relevant information (rdfs:seeAlso; line15) and have detailed information about the entity by merging the description published on twodifferent locations (owl:sameAs; line 13), that is, Interlink.C /D frdfs:seeAlso, owl:sameAsg.

3.3. Labeling

Labeling refers to the textual information provided with the entity description to allow a betterunderstanding of an entity before that entity is processed by Semantic Web applications. In [4], theauthors listed a number of benefits of labels, which include displaying human-readable informationinstead of displaying URIs, using labels for indexing (this use case is also highlighted in our previ-ous work [13]) and support for keyword and question-based searches over the web of data. Here, weanalyze how labeling properties are used with entity descriptions, which is helpful for informationretrieval and presentation. While analyzing the entity, we look at the use of different label propertiesin the data and discuss their usefulness in scenarios, such as finding hidden information from the

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:1157–1184DOI: 10.1002/cpe

Page 11: Empirical analysis of domain ontology usage on the Web: eCommerce domain in focus

ECOMMERCE DOMAIN IN FOCUS 1167

label text and the use of language tags to facilitate the internationalization of semantic applicationsand help in the development of a user interface for information, which is syntactically published formachine consumption.

Formal labels:The RDFS specification provides two properties, namely rdfs:label and rdfs:comment, to

provide human-readable information about the resources. The former is normally used to pro-vide a human-friendly version of the resource name, which is otherwise an opaque URI, andthe latter is used to present a human-readable description of the resource. We will refer tothese two label properties as formal label (fl) while analyzing the presence of such propertiesin the dataset in general and in the entity description specifically. Such inline documentationon resources is very useful, and often, domain ontologies define more specific labeling prop-erties. We propose and define the following metrics to measure the use of fl for each pivotalentity. Entityfl measures the ratio of entities with at least one formal label to all pivot entitiesin the dataset.

If C is the concept of the domain ontology (class), then

flD frdfs:label, rdfs:commentg

Entityfl.C /D number of instances .C / with fl=total number of instances .C /(7)

Domain labels:There are two common practices for defining domain ontology label properties: first,

describing label properties as the subproperty of rdfs:label using the subproperty axiom (sub-sumption) and, second, having a data-type property with rdfs:Literal as its range. In somecases, the label properties are defined by specifying a literal data type, and in such cases,xsd:string data type is used. We refer to these domain ontology-defined label properties asdomain labels (dl). In [4], the authors proposed label-related matrices to measure complete-ness, efficient accessibility of label properties, and unambiguity of the labels in the knowledgebase. These metrics help quantify the presence of labels in a dataset; however, to under-stand their usefulness in a real setting for information retrieval and presentation purposes,it is necessary to analyze label properties for each pivotal entity and discuss their usefulness.

Likewise, Entitydl computes the ratio of entities with at least one domain ontology labelto all pivot entities in a dataset. The sum of these two measures tells us how rich a partic-ular concept (pivot entity) is in terms of labels. If C is the concept of the domain ontology(class), then

dlD f i j i is the label property defined in the domain ontologyg

Entitydl.C /D number of instances .C / with dl=total number of instances .C /(8)

By using the sample code, let us understand what labels are available and how they are usedin the knowledge base by using the metrics Entityfl and Entitydl. Continuing the example code,as we are focusing on gr:BusinessEntity as the pivot concept, we simply calculate the labelmetrics for entities of type gr:BusinessEntity.

Entityfl D 0=1D 0

Entitydl D 1=1D 1

The label attributes used for the description of ex:cardealer entity are listed from lines 9to 16 of the sample code. For Entityfl, we consider only RDFS-based label properties (i.e.,rdfs:label and rdfs:comments), none of which is used in this particular example. We have onlyone instance of an entity (individual) of type gr:BusinessEntity; therefore, Entityfl equals 0.Likewise, for Entitydl, we have the gr:legalName predicate usage, which is a domain ontol-ogy label property (the list of domain ontology labels relevant to this work is discussed inSection 5.4); therefore, the value of Entitydl is 1.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:1157–1184DOI: 10.1002/cpe

Page 12: Empirical analysis of domain ontology usage on the Web: eCommerce domain in focus

1168 J. ASHRAF, O. K. HUSSAIN AND F. K. HUSSAIN

3.3.1. Knowledge patterns (traversal path) A traversal path determines the sequence in which theproperties are used to access the description of related entities within a given context. A traversalpath starts with the instance of the entity class in focus and follows the sequence of instance–property–instance triples to record all the paths in the dataset. We report the following quantitativemeasures pertaining to traversal paths.

Unique paths:Unique paths compute the number of unique paths leading from the entity (out links). One

entity can have zero or many paths of varying lengths, depending on the RDF graph in thedataset. A complete set of unique paths helps in understanding the data patterns, which canfurther assist in querying the dataset.

Average path length:The average path length helps in understanding the entity description depth in the dataset.

Max path length:The max path length helps in understanding the maximum possible description depth in the

knowledge base.

Path steps:Path steps help to identify the triples found in traversal paths. In traversal paths, unique paths

in the RDF graph (or dataset) and the maximum and average traversal path lengths are com-puted. Our traversal path procedure constructs a list of all available paths in the dataset, andthis list of paths is then used to compute the maximum and average path lengths. Additionally,the path steps of each path are generated, and their frequency in the path list is computed toreflect the occurrences of each path step in the path list.

In the example code, there are two unique paths in the RDF graph, one of length 3 and asecond of length 2 (Figure 6). The length is computed by counting the number of predicates(relationships) in a path. The path steps and their strength value are shown in Figure 7. We cansee that the first path step has strength 2 because this appears in two paths, and the remainingsteps have only one strength value because each appears once in both paths. Paths and pathsteps provide a snapshot of the knowledge in the form of triple patterns that indicate the invari-ance of instance data or entity descriptions across data sources that are contextually relevant(domain specific).

In the next section, we introduce the domain ontology and the dataset that we used for the analysis.

4. CASE STUDY: GOODRELATIONS AS A DOMAIN ONTOLOGY

The GRO [14], developed and introduced in 2008, is one of the first web ontologies to conceptualizethe eCommerce domain on the web. It has recently seen an increase in popularity and adoption bythe Semantic Web community, particularly after being recognized by major search engines such asGoogle (www.google.com), Yahoo (www.yahoo.com), and Bing (www.bing.com).

4.1. GoodRelations conceptual schema

The GRO is a live ontology that is evolving with time to capture the changes and improve its con-ceptual representation of the domain model. The latest version of the GRO comprises 31 concepts

Figure 6. Traversal paths.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:1157–1184DOI: 10.1002/cpe

Page 13: Empirical analysis of domain ontology usage on the Web: eCommerce domain in focus

ECOMMERCE DOMAIN IN FOCUS 1169

Figure 7. Path steps and their strengths.

(classes), 50 object properties, 44 data properties, and 48 named individuals. With backward com-patibility kept intact, the ontology model is updated frequently to add new object and data properties,on the basis of the experience and feedback gained from real-world implementations. The GRO isavailable at http://purl.org/goodrelations/v1, and gr is the prefix used in this paper and elsewhere torefer to GRO. From a high-level view, the GR model is based on three main concepts, each focusingon a separate aspect of the eCommerce domain. These concepts are business entity, offering, andproduct or service, each of which is discussed in detail next.

Business entity: The gr:BusinessEntity concept represents a business organization (or any indi-vidual) that intends to offer or seek products on the web. The main purpose of this concept isto provide the necessary attributes needed to describe any business such as the name of thecompany, address, location, vertical industry in which it operates, and any other identifier thatmakes it uniquely identifiable on the web.

Offering: gr:Offering is the pivotal concept in the GRO. This concept allows the descriptionof a particular offering a business entity is likely to make or seek on the web. In the latestversion, there are 15 data-type properties (all optional) to describe offer details such as avail-ability, validity, name, and description of the offering. Recently, gr:name and gr:descriptionhave also been added to make it easy to give any name and description to allow users to knowmore about the offer itself.

Product or service: The third main concept is gr:ProductOrService. As mentioned earlier, anoffering can contain one or more products (or services) and is usually described using one ofthe three possible subclasses of this main (abstract) class. GR’s main focus is to cover the con-ceptual model of offering rather than being product ontology. However, gr:ProductOrServiceand its subconcepts can be used to specify the product and its qualitative and quantitativeproperties to describe lightweight product ontology.

4.2. Axioms

Ontologies are often comprised of classes, properties, individuals, and axioms. Axioms allow infor-mation to be inferred from a knowledge base through the use of a reasoning engine known as areasoner [15]. The expressivity of the GRO is based on an OWL description logic program frag-ment and contains subclass and subproperty axioms to express the subsumption behavior in themodel. Axiomatic triples in the GRO are given in Table 1 to shed light on the possible inference oneCommerce data, which have been annotated using the GRO and applicable rule sets. RDFS andOWL elements such as rdfs:domain and rdfs:range, which are available in the ontology, are omittedfrom the table because they were not included in the reasoning experiment.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:1157–1184DOI: 10.1002/cpe

Page 14: Empirical analysis of domain ontology usage on the Web: eCommerce domain in focus

1170 J. ASHRAF, O. K. HUSSAIN AND F. K. HUSSAIN

Table I. Axioms in GoodRelations ontology and applicable rule sets

Axioms Count Applicable rule sets

Class SubClassOf 12 RDFSDisjointClasses 129 OWL2RL

Object property SubPropertyOf 4 RDFSInverseOf 6 pD*, OWL2RLTransitiveProperty 7 pD*, OWL2RLSymmetricProperty 2 pD*, OWL2RL

Data property SubPropertyOf 13 RDFS

Elements such as rdfs:subClassOf, rdfs:subPropertyOf, owl:inverseOf, owl:TransitiveProperty,and owl:SymmetricProperty are considered in this study because they are associated with newknowledge. They can be used both in forward chaining (to materialize implied statements, therebymaking them explicit) and in backward chaining (performing query rewrites to expand query scopeand include inferred knowledge). owl.DisjointClasses differs because it is used primarily for dataquality and checking for inconsistencies. The constructs in Table I are covered by almost all ofthe rule sets including RDFS, pD* [16], and OWL2RL.¶ In our study, we employ an RDFS-basedreasoning engine with the RDFS rules because it is generally available in most semantic repositories.

4.3. Dataset

To gain a clear understanding of the RDF data and the use of ontologies to provide shared infer-ence and structure on the web, we built a dataset comprising ‘domain’-specific data extracted fromthe web to conduct an investigation on empirical grounding. We are particularly interested in datasources that used the ‘domain ontology’ by using the core concepts to provide schema-level meta-data. In the following subsections, we first discuss the approach adopted to identify potential datasources and the minimum selection criteria used. Then, we discuss the dataset collection approach,including hybrid crawling and the selection of seed URIs, followed by the dataset characteristics.

4.3.1. Hybrid crawler A potential source for the required data is the LOD cloud,|| which, as of early2013, currently hosts 295 datasets containing approximately 32 billion triples. This appears to bea very fertile source of data for our study; however, as reported in [17] and [18], the datasets inthe LOD cloud are publishing data but merely using ontologies, neglecting to provide the schema-level meta information deemed necessary to apportion information over the web. The publishedLOD statistics state that 64.75% of datasets make use of non-W3C-based vocabularies (RDF, RDFschema, and OWL), which we call here open ontologies or vocabularies. Of these open ontolo-gies, 78.31% of datasets use one or more of the following in combination to provide schema-levelinformation: Dublin Core (31.19%), FOAF (27.46%), and Simple Knowledge Organization System(19.66%). Noticeably, only four (1.36%) out of 295 are reported to have used GRO, and PingTh-eSemanticWeb** ranks GR as the second most-used ontology after FOAF. These numeric factshighlight the paucity of use and availability of ontological knowledge in the LOD dataset. There-fore, we decided to build our own dataset to collect the RDF data currently published using domainontology.

To collate domain-focused data, the minimum criterion for the selection of potential data sourcesis to identify the data publishers who have at least described the key concepts using the domainontology. In our case, business entity and offering are the primary identification drivers. We builtthe list of seed URIs for crawling using Sindice API and the Watson semantic search engine (Fig-ure 3(a)). For crawling, we initially attempted to use semantic crawlers such as LDSpider andexplored ontology-based crawling [19], but because most of the eCommerce-related RDF data are

¶http://www.w3.org/TR/owl2-profiles/||http://www4.wiwiss.fu-berlin.de/lodcloud/state/ (last accessed on 27 Nov 2011)**http://pingthesemanticweb.com (last accessed on 15 July 2012)

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:1157–1184DOI: 10.1002/cpe

Page 15: Empirical analysis of domain ontology usage on the Web: eCommerce domain in focus

ECOMMERCE DOMAIN IN FOCUS 1171

embedded within HTML pages using RDFa and as a result of the lack of interlinking between dif-ferent resources even within the same host name, the existing crawler could not be used effectively.Therefore, using LDspider API, we implemented a hybrid crawler that crawls in a similar way totraditional web crawlers by following hyperlinks and extracting only the RDF triples in web doc-uments. By using REST-based web services, namely Any23†† and RDFa Distiller,‡‡ the extractedRDFa snippets from web documents were then transformed into an RDF/XML document to createone RDF graph for each web document.

The RDF graphs were then loaded into the OpenLinks Virtuoso triple store to create a datasetcalled GRDS2, which would be used for further analysis. From an RDF data management perspec-tive, named graphs are used to group all the triples from one data source (hostname) under a uniquenamed graph IRI, allowing the dataset to be queried vertically (one data source) and horizontally(across data sources).

4.3.2. Dataset characteristic The empirical analysis is performed on the GRDS2 dataset describedearlier. The GRDS2 dataset comprises 22.3 million triples loaded into the Virtuoso triple store(open-source version) collected from 211 different data sources (data sources with unique domainnames). Occasionally, particularly in namespace usage analysis (Section 5.1), we will refer to ourprevious dataset GRDS [13], which in this paper we will refer to as GRDS1, to demonstrate thechanges in ontology usage, concept usage, and data patterns over a time span of more than a year.The GRDS1 dataset comprises 9.5 million triples collected from 105 web sources.

4.3.3. Data providers By observing the structured eCommerce data landscape while building theGRDS2, we are able to categorize data publishers into three groups, on the basis of their publishingapproach, usage pattern, and data volume.

Large size retailers: This group includes large online eRetailers and retailers who are tradition-ally premises based and have only recently entered the eRetailing business. Such data sourcesprovide richer, more detailed offerings and product descriptions, which are useful for entityconsolidation and interlinking with other datasets. Retailers include Volkswagen.com.uk,BestBuy.com, Overstock.com, Oreilly.com, and Suitcase.com.Web shops: A large number of semantic eCommerce adopters are small to medium web shops,offering their products and services mainly through web channels. Most of these web shopsuse web content management packages§§ such as Magento,¶¶ Oxid-eSales, WP 4 eCommerce,osCommerce, and Joomla Virtuemart to add RDFa data to offer-related web pages. Thisapproach of embedding Semantic Web data in existing web pages works well for small andmedium web shops because no special infrastructure arrangement is required in most cases;the semantic metadata (data describing products and offers) are embedded within existing webdocuments, offering several benefits to both producers and consumers.Data service providers (data spaces): To leverage the benefits offered by semantic eCom-merce data, businesses offer data services that are built on consolidated semantic repositories.Moreover, providers use APIs to access and transform proprietary data into RDF before mak-ing them available through their repositories. For example, Linked Open Commerce|||| containsAmazon.com data although Amazon.com has not yet published RDF/RDFa.

5. GOODRELATIONS ONTOLOGY USAGE ANALYSIS

Using the GRDS2 dataset and the metrics introduced in Section 3.2, we analyze the use of domainontology from different aspects in this section.

††http://any23.org (last accessed on 16 Jan 2013)‡‡http://www.w3.org/2007/08/pyRdfa/ (last accessed on 9 Jan 2013)§§Complete list of their references are available at http://www.ebusiness-unibw.org/wiki/GoodRelations#Shop_Software¶¶www.magentocommerce.com||||http://www.linkedopencommerce.com

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:1157–1184DOI: 10.1002/cpe

Page 16: Empirical analysis of domain ontology usage on the Web: eCommerce domain in focus

1172 J. ASHRAF, O. K. HUSSAIN AND F. K. HUSSAIN

5.1. Namespace usage

We measure the availability of different ontologies in the dataset, and their usage intensity isobserved by querying the dataset and identifying the data sources using those vocabularies (vocab-ularies/ontologies are referred to in RDF by their namespaces). We looked at the inclusion of theauthentic vocabulary namespace instead of measuring the use of terms of a given ontology, thatis, its concepts and properties. Throughout this paper, we have adopted an approach in which wereport the percentage of data sources (structured data publishers/consumers) in the dataset insteadof counting and reporting the number of triples (matched against specified criteria), which from ourpoint of view does not communicate any valuable information. We believe that this approach pro-vides more unbiased usage analysis, because it disregards the size of the implementer and looks atthe number of data sources being used. For example, a large implementer such as BestBuy.com usesa term (e.g., gr:contains) to describe its 200,000 products and happens to be the only data sourceusing this term in the dataset; hence, this will count as only one instance of usage in the dataset.

Table II lists the vocabularies in the captured dataset along with the percentage of data sourcesusing them.

In total, there are 48 namespaces found in the dataset, 22 of which are listed in Table II. Theremainders are excluded from the list. We found 12 in-house ontologies with no formal descrip-tion, four with erroneous URIs, and seven namespaces representing W3C’s formal specificationsuch as RDF, RDFS, and OWL. The complete list of vocabularies found in the dataset is pre-sented in Appendix A. The first four vocabularies next to gr, namely vCard, foaf, Yahoo, and dc,are, on average, used by 53% of the data sources to describe the commonly used entities. The useof different ontologies (within a vertical domain) promotes semantic-level integration among dis-tributed datasets and supplements the knowledge base by enabling interlinking with the externaldataset. Note that the inclusion of new vocabularies in GRDS2, which were not present in GRDS1,in our view, indicates the potential role of the domain ontology that acts as a ‘drug gateway’ forthe adoption of more specialized ontologies. For example, Volkswagen.com.uk, while implement-ing Semantic Web technologies and GRO, developed a new Volkswagen vehicles ontology (vvo) toprovide a contextual search [20] for its users to represent the precise concepts that are not coveredby presently available ontologies, hence motivating the creation and adoption of new ontologies.We believe that this motivational aspect of the domain ontology, which results in the creation ofmore specialized conceptualization, promotes the annotation of semantic information at a fine grain

Table II. List of vocabularies and their percentage in GRDS.

Prefix Namespace Data sources (%)

Gr http://purl.org/goodrelations/v1# 97.16vCard http://www.w3.org/2006/vcard/ns# 79.15foaf http://xmlns.com/foaf/0.1/ 54.98yahoo http://search.yahoo.com/searchmonkey/commerce/ 41.71Dc http://purl.org/dc/terms/ 36.49eCl@ss http://www.ebusiness-unibw.org/ontologies/eclass/5.1.4/# 18.01V http://rdf.data-vocabulary.org 16.59Og http://opengraphprotocol.org/schema/ 9.00rev http://purl.org/stuff/rev# 7.11pto http://www.productontology.org/id/ 1.90geo http://www.w3.org/2003/01/geo/wgs84_pos# 0.95Cc http://creativecommons.org/ns# 0.95frbr http://vocab.org/frbr/core# 0.47void http://rdfs.org/ns/void# 0.47sioc http://rdfs.org/sioc/ns# 0.47vso http://purl.org/vso/ns# 0.47coo http://purl.org/coo/ns# 0.47scovo http://purl.org/NET/scovo# 0.47comm http://purl.org/commerce# 0.47media http://purl.org/media# 0.47

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:1157–1184DOI: 10.1002/cpe

Page 17: Empirical analysis of domain ontology usage on the Web: eCommerce domain in focus

ECOMMERCE DOMAIN IN FOCUS 1173

level. This results in improvements in information specificity and allows semantic applications toprecisely locate information on the web.

Figure 8 provides a vocabulary usage comparison in two datasets. In GRDS1, 10 vocabularies(the first 10 in Figure 8) were found in the dataset, and GRDS2 has 22 in total, including 10 fromthe last dataset. Two interesting observations evident from this figure are the use of new vocabulariesand a surge in the use of already adopted vocabularies. The former observation indicates that theuse of domain ontologies facilitates the adoption of new specialized focused ontologies to providethe form of the concepts not covered by the respective domain ontologies. For example, we haveseen the adoption of two ontologies (eCl@ss and pto) to semantically describe the ‘products’ infor-mation, which was not present in the first dataset, that is, GRDS1, and we reported it as the issuein [13]. The later observation reflects the growing trend in the adoption and popularity of in-usevocabularies in the eCommerce domain.

5.2. Schema link graph

Using the schema link graph model, we obtained a graph to represent all the ontologies in the datasetin which the links reflect the co-usability of different ontologies. Figure 9 shows the links betweenentities defined across various ontologies. The node size represents the degree of an ontology, whichrefers to the number of linked ontologies describing the entities in the dataset. For example, the foafnode has a degree value of 7, which means that the foaf resources are further linked with dc, frbr,vso, vCard, pto, gr, and v resources. In the schema link graph, the average node degree is 4.12with a standard deviation of 3.61, which shows that the degree distribution ostensibly follows thepower law distribution [21]. However, the average degree distribution in the schema link graph isencouraging because it reflects the good co-usability factor that exists in the dataset.

After analyzing the use of different vocabularies and the linking of entities over different vocabu-laries, we look at domain ontology usage in the next section in a more detailed fashion to understandthe data and knowledge patterns in the dataset.

5.3. Pivot concept usage analysis using the concept usage template

To conduct the empirical analysis of domain ontology, it is important to identify the pivot conceptsthat represent the core entity in the domain conceptualized by the domain ontology. While thereare some advanced approaches [22] that can be employed to automatically find the key concepts ofthe domain ontology, we use the gr:BusinessEntity, gr:Offering, and gr:ProductOrService pivotalconcepts introduced in Section 4.

Figure 8. Vocabulary usage in two longitudinal datasets.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:1157–1184DOI: 10.1002/cpe

Page 18: Empirical analysis of domain ontology usage on the Web: eCommerce domain in focus

1174 J. ASHRAF, O. K. HUSSAIN AND F. K. HUSSAIN

Figure 9. Schema link graph.

5.4. Preprocessing

To compute the values of metrics and gather the results of simple measures that are computation-ally less expensive, such as concept instantiations, the presence of certain triple patterns and theuse of different properties with a given pivot concept are obtained by posing SPARQL queries tothe dataset. However, for computationally complex operations such as traversal path, querying thedataset using triple store’s SPARQL endpoint does not offer a practical solution. Any query withmore than three triple patterns in a chain with fitter clauses fails to return the result set within areasonable time. As a work-around, we export the dataset into theN -triples format (a line-delimitedsyntax for RDF graphs) using Jena API [23], and the nxparser API*** is used to extract the pathsfanning out from the pivot entity. The list of paths is then used to compute the maximum and averagepath lengths. Additionally, the path steps of each path are generated, and their frequency in the pathlist is updated to reflect the occurrences of each path step in the path’s list.

To understand the use of label properties by data publishers, we have two metrics, namelyEntityfl and Entitydl, to measure the use of formal label properties and domain-ontology-specificlabel properties, respectively. Aside from RDFS, several ontologies have defined their own label-ing properties, which are often used together to provide the same contextual information but usingdifferent predicates. Publishers do this to provide support for a range of vocabularies, to make iteasy for consumers, but sometimes, deciding which one to use while querying the data becomes anissue from the consumer’s point of view. The labeling properties, formally defined as subproperties(using rdfs:subPropertyOf ) of rdfs:label, make it easy for an application to include all the availablelabels for an entity if lightweight reasoning is supported. To make our label analysis more empir-ically grounded, we relax the definition of Entityfl to also include all the labeling properties thatare subproperties of rdfs:label:foaf:name, skos:prefLabel, sioc:name, and skos:prefLabel. Anotherexception/extension has been made to include dc:title, even though it is not defined as a subpropertyof rdfs:label; because it is an extensively used [4] property in LOD, we include it under Entityfl.After relaxing the conditions, we have the following set of label properties as part of the formallabels:

Formal labelsD ffoaf:name, skos:prefLabel, sioc:name, dc:titleg

***http://code.google.com/p/nxparser

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:1157–1184DOI: 10.1002/cpe

Page 19: Empirical analysis of domain ontology usage on the Web: eCommerce domain in focus

ECOMMERCE DOMAIN IN FOCUS 1175

Table III. Resource Description Framework usage of gr:BusinessEntity.

Entity gr:BusinessEntityInstantiation 54,542Vocabs vCard, gr, foaf, yahoo, v, and schemaObject properties vCard:adr, vCard:email, vCard:url, yahoo:image, gr:offers,

gr:hasPOS, foaf:logo, foaf:homepage, foaf:maker, foaf:page,gr:hasOpeningHourSpecification, and foaf:depiction

Attribute usage vCard:fn, vCard:tel, vCard:email, vCard:organization-name,vCard:fax, vCard:adr, vCard:Tel, gr:hasISICv4,gr:legalName, v:name, v:pricerange, v:category, foaf:maker,yahoo:seatingOptions, yahoo:cuisine, yahoo:features,yahoo:smoking, yahoo:serviceOptions, yahoo:mealOptions,yahoo:priceRange, yahoo:hoursOfOperation, schema:postalCode,schema:addressLocality, schema:streetAddress, and schema:telephone

Class usage vCard:VCard, cVard:org, yahoo:Business, yahoo:Restaurant,gr:BusinessEntityType, comm.:Business, and v:Organization

Interlinking rdfs:seeAlso and owl:sameAs

To compute Entitydl for a given pivot concept, we need to have a set of label attributes defined bythe domain ontology where the pivot concept is the rdfs:domain of the label property. For the threepivot concepts used in this analysis, we have the following set of domain labels:

Domain labelsgr:BusinessEntity D fgr:legalNamegDomain labelsgr:Offering D fgr:condition, gr:categorygDomain Labelsgr:ProductOrService D fgr:category, gr:color, gr:condition,

gr:datatypeProductOrServicePropertyg

From the preprocessing discussed earlier, we analyze the usage of each pivot concept using theCUT metrics.

5.4.1. gr:BusinessEntity analysis In GRO, gr:BusinessEntity represents a business organization(or an individual) that intends to offer or seek products on the web. First, we will look at the RDFusage and then discuss the available paths and labels provided with the entities of this concept.Table III provides the analysis results for the gr:BusinessEntity concept.

In our dataset, there are 789,440 entities in total, and of these, 54,542 are of the typegr:BusinessEntity concept. This means that 6.9% of the entities are of this type in the GRDS2.From the Vocab set, we can see the co-usage of different vocabularies in the entity description.The list of object properties provides an approximation of the typed relationship of the entity andprovides substantial evidence about the discoverable, related entities in the knowledge base. Bylooking at the object properties, it is easy to see that this pivot business entity is described withlocation address and contact-related details. In addition to relationship, attribute usage details all theattributes used to provide textual information about the entity. In RDF data, it is presumed that allthe resources are identified with URIs that, when dereferenced, return human-readable informationabout the resource. Interestingly, in attribute usage (Attri.C /), we found the use of several attributesthat are from schema.org††† and are not valid URIs. This also indicates the adoption and use of non-semantic schema in RDF data, which we believe is a good sign as far as the burgeoning of structureddata on the web is concerned, although the semantic aspect is being ignored.‡‡‡

The class usage (ClassUsage.C /), which lists the other classes of which the entity is a mem-ber, returns seven other classes. This tells us that one or more entities of gr:BusinessEntity classin this dataset also have membership of seven other classes. This membership relationship infor-mation provides intrinsic overlapping of concepts that have several aspects in common but are notsubject to the same interpretation in cross domains. To promote information interoperability on the

†††http://schema.org‡‡‡On a side note, there has been community effort in mapping schema.org terms with the semantic version published at

http://schema.rdfs.org/mappings.html.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:1157–1184DOI: 10.1002/cpe

Page 20: Empirical analysis of domain ontology usage on the Web: eCommerce domain in focus

1176 J. ASHRAF, O. K. HUSSAIN AND F. K. HUSSAIN

web, the identification of related but different concepts in the knowledge base facilitates alignmentbetween different concepts in the ontology mapping process. We believe that related concepts oftenmaintain an elusive relationship, requiring more diverse mapping predicates to capture the naturallinkages between disparate concepts instead of using a mapping predicate with strong semantics,that is, owl:equivalentClass [24].

In interlinking, we capture the information related to the linking of similar but disparate entities.This includes the link base and the equivalence links, indicating that different URIs refer to thesame resource or entity. We find the use of two interlinking properties for the entities in the dataset,namely owl:sameAs and rdfs:seeAlso. The property rdfs:seeAlso provides very little informationabout the resource to which it links, but it is a standard Semantic Web method of linking hypertextto provide reference to additional resources or documents.

The last component of CUT measures the use of label properties in the dataset. As mentionedearlier, we have the Entityfl and Entitydl metrics to measure the use of formal label predicates andthe domain-ontology-specific label properties, respectively. Focusing on this pivot concept, 32% ofentities have used label properties with the following values for these two metrics:

Entityfl D 1703 (9% of entities have used formal labels)

Entitydl D 17,146 (91% of entities have used domain labels)

One of the most obvious and surprising findings is the dominance of the domain label predicatesover the formal labels. Contrary to the previous findings in [4, 25] and the general presumption thatformal labels are more frequently used, we have seen the dominance of domain-ontology-specificlabel properties in our experiment. This also signifies that data publishers prefer to provide special-ized label properties to help consumers access less ambiguous contextual information that is usefulfor querying and for interface presentation.

5.4.2. gr:Offering analysis gr:Offering is the concept that enables business entities to publish theiroffers on the web, for either selling or buying products.

Table IV presents the CUT for the gr:Offering pivot concept. In RDF usage, an interesting findingis the use of different but related vocabularies to semantically describe offering-related informa-tion. Three vocabularies that supplement offering information, namely media, rev, and comm, areincluded, but two names that are included in the gr:BusinessEntity concept, vCard and schema

Table IV. Resource Description Framework usage of gr:Offering.

Entity gr:OfferingInstantiation 61,330Vocabs gr, foaf, v, comm, media, rev, and yahooObject properties gr:availableAtOrFrom, gr:hasBusinessFunction,

gr:eligibleCustomerTypes, gr:acceptedPaymentMethods,gr:availableDeliveryMethods, gr:includesObject,gr:hasPriceSpecification, gr:hasWarrantyPromise, gr:includes,gr:hasManufacturer, gr:hasInventoryLevel, gr:hasBrand,foaf:page, foaf:depiction, foaf:thumbnail, yahoo:media/image,yahoo:product/specification, yahoo:product/manufacturer,v:url, v:photo, v:hasReview, media:depiction, media:sample,media:contains, and rev:hasReview

Attributes usage gr:validFrom, gr:validThrough, gr:eligibleRegions,gr:hasStockKeepingUnit, gr:availabilityStarts, gr:hasEAN_UCC-13,gr:description, gr:name, gr:condition, gr:hasMPN, gr:BusinessEntity,gr:hasCurrency, rdfs:title, rdfs:comments, dc:description, dc:title,dc:contributor, dc:date, dc:type, dc:duration, dc:position, v:name,v:description, v:price, v:category, v:brand, ogp:image, ogp:type,ogp:site_name, ogp:title, and ogp:url

Class usage v:Product, media: Album, and media:Recordinga

Interlinking rdfs:seeAlso and owl:sameAs

aWe have found around 26 product types defined by http://www.productontology.org/.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:1157–1184DOI: 10.1002/cpe

Page 21: Empirical analysis of domain ontology usage on the Web: eCommerce domain in focus

ECOMMERCE DOMAIN IN FOCUS 1177

Table V. Resource Description Framework usage of gr:ProductOrService.

Entity gr:ProductOrServiceInstantiation 37,996Vocabs gr, foaf, yahoo, v, vso, eCl@ss, and ptoObject properties gr:hasMakeAndModel, gr:hasInventoryLevel, gr:hasManufacturer,

gr:description, gr:depth, gr:height, gr:weight, gr:width,vso:mileageFromOdometer, gr:hasBusinessFunction,gr:hasMakeOrModel, gr:hasBrand, gr:hasPriceSpecification,foaf:depiction, foaf:thumbnail, foaf:page, foaf:logo, rev:hasReview,v:hasReview, vso:bodyStyle, vso:engineDisplacement,vso:gearsTotal, vso:previousOwners, gr:name, vso:transmission,vso:fuelType, and vso:featurea

Attribute usage gr:description, gr:hasStockKeepingUnit, gr:hasEAN_UCC-13,gr:name, gr:hasMPN, gr:condition, gr:category, vso:modelDate,vso:VIN, vso:color, vso:engineName, and vso:rentalUsage

Class usage eCl@ss, v:Product, yahoo:Product, and vso:Automobileb

Interlinking rdfs:seeAlso

aThere are several in-house developed ontologies to describe product attributes.bhttp://www.productontology.org has hundreds of classes that are used in the dataset fordescribing high-level product type/category.

vocabularies, have been excluded. In both object (ObjectPro.C /) and attribute usage (Attri.C /), wesee the use of different predicates from different vocabularies used to provide the offering descrip-tion, similar to the previous concept. Another interesting finding is the use of product vocabulariesto describe the products being offered: we can see the use of different concepts defined in productontology as part of the class usage (ClassUsage.C /). Because the list is long, we have only providedthe concepts used from the pro-vocabulary. The use of interlinking predicates is the same as in theprevious pivot concept, and it can be assumed that these two predicates are consistent across allkey concepts and entities. Next, we analyze the use of label properties by the entities of gr:Offeringtype. Of 61,330 entities, 11% used labeling properties with the following distribution:

Entityfl D 4171 (62% of entities used formal labels)

Entitydl D 2610 (38% of entities used domain labels)

5.4.3. gr:ProductOrService analysis In GR, a lightweight description of the products being offeredis given through gr:ProductOrService and three of its subclasses.

In Table V, we have the usage summary for the gr:ProductOrService concept. Roughly 38,000entities in total are defined as ‘type of product’. Because product-related concepts are arranged intaxonomical hierarchy in GRO to allow users to specify the exact nature of the product being offered,we have used the subsumption axiom to include all the instances belonging to the super concept.Vocabulary usage for product and offering is almost identical, and entities of both concepts use thesame vocabularies to describe the instances. One important improvement to class usage, comparedwith our previous study [13], is that most new eCommerce data publishers now use product ontolo-gies to describe their products. For example, in our dataset, more than 100 concepts of pto are used tospecify the type of products being offered. In interlinking, we have seen the usage of the rdfs:seeAlsopredicate; however, there is no usage instance of the owl:sameAs predicate. Possible reasons for thetemporary nonexistence of this predicate in product instances is firstly that product ontologies haverecently begun to emerge but that these ontologies do not offer rich product descriptions that coverthe qualitative and quantitative properties of products and, secondly, that owl:sameAs interlinking isalgorithmically complex and less effective; thus, it is preferred that this is carried out through socialengagement.§§§

§§§In a keynote speech at ISWC2011, Frank van Harmelen mentioned the role of social engagement being more effectivethan an algorithmic approach in interlinking entities.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:1157–1184DOI: 10.1002/cpe

Page 22: Empirical analysis of domain ontology usage on the Web: eCommerce domain in focus

1178 J. ASHRAF, O. K. HUSSAIN AND F. K. HUSSAIN

Pertaining to the use of label properties with product instances, the label metric values areas follows:

Entityfl D 30,379 (99.05% of entities are using formal labels)

Entitydl D 360 (0.95% of entities are using domain labels)

In the product pivotal concept, 30,739 entities have labels attached to the instances, which meansthat 80% of entities offer textual descriptions to provide human-readable descriptions of products.Of these 80%, only 0.95% of the entities provide domain label properties and 99.05% formal labels,which is quite a different trend to the two aforementioned pivot concepts. As noted previously, GROprovides only high-level concepts to identify the product but recommends using product ontologiessuch as eCl@ss and pto to provide a semantic description of the products; therefore, we see little ornegligible use of domain-ontology-specific labels.

5.5. Knowledge patterns (traversal path analysis)

In this section, we present the results of traversal paths and the path steps in the dataset.In Table VI, we provide the number of unique paths that exist for each pivot concept. To recap,

in traversal paths, we calculated all the unique paths originating from the given pivot concept. Thisprovides the data-level and schema-level patterns in the knowledge base. Because gr:BusinessEntityis considered a kind of root concept, although not in the literal sense, we see that it has the largestmaximum traversal path length. Similarly, gr:ProductOrService, being the later concept in the onto-logical model, has the lowest maximum length. Interestingly, there is little significant deviationin the average path length, which tells us that although gr:BusinessEntity has the maximum pathlength, on average, all the pivot concepts have a similar average path length. We believe that suchinsight into data and schema patterns, and the depths in triple-chaining patterns, helps in planningdata management including storage, querying, and reasoning. To further understand the triple pat-terns in traversal paths, we list the dominant path steps with the frequency found in traversal pathsin Table VII.

This provides a snapshot of the terminological knowledge and the schema-level triples availablein the dataset, which, with the traversal path information, provides a summary of the knowledge baseand helps the generation of the SPARQL query template for accessing domain-related knowledgefrom any dataset. However, note that while this provides a complete set of terminologies used inthe dataset, not necessarily all entities use these terms; therefore, certain terms need to be optionalin the automatic query generation process. To support more effective automatic query generationbased on the preceding summary, the attachment of frequency to each term to give an estimation ofdistribution can be considered.

6. GENERAL OBSERVATIONS AND UNDERSTANDING THE USAGE RESULTS

As depicted in Figure 1, an empirical analysis of the use of domain ontologies helps obtain feed-back based on the data instantiation. Such feedback provides evidence-based insight into uptakeand adoption at an early stage of the ontology evolution, which is essential for the inclusion ofnew knowledge and is one of the most challenging problems in current Semantic Web research[26]. Using the metrics reveals that, from the results obtained, only a small part of the ontologyis widely used and the majority of the concepts are rarely or never instantiated. Our finding alignswith the observation of Noy et al. [27] that ontologies need to change the perspective from which

Table VI. Traversal path of all three pivot concepts.

gr:BusinessEntity gr:Offering gr:ProductOrService

Number of unique paths 12,245 14,871 2,453Maximum path length 6 4 3Average path length 3.12 2.78 2.13

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:1157–1184DOI: 10.1002/cpe

Page 23: Empirical analysis of domain ontology usage on the Web: eCommerce domain in focus

ECOMMERCE DOMAIN IN FOCUS 1179

Table VII. Path step frequency in traversal path.

Path step Frequency

gr:Offering gr:hasBusinessFunction gr:BusinessFunction 51,928gr:Offering gr:hasPriceSpecification gr:PriceSpecification 34,659gr:Offering gr:includesObject gr:TypeAndQuantityNode 29,038gr:Offering gr:availableAtOrFrom gr:Location 24,914gr:Offering gr:hasManufacturer gr:BusinessEntity 19,430gr:Offering gr:eligibleCustomerTypes gr:BusinessEntityType 15,906gr:SomeItems gr:hasMakeAndModel gr:ProductOrServiceModel 7,168gr:Offering gr:availableDeliveryMethods gr:DeliveryMethod 5,462gr:Offering gr:hasWarrantyPromise gr:WarrantyPromise 4,090gr:BusinessEntity gr:offers gr:Offering 2,398gr:BusinessEntity vCard:adr vCard:Address 2,385gr:OpeningHoursSpecification gr:hasOpeningHoursDayOfWeek gr:DayOfWeek 1,953gr:Offering gr:includes gr:ProductOrService 1,814gr:Location gr:hasOpeningHoursSpecification gr:OpeningHoursSpecification 1,025gr:BusinessEntity gr:hasPOS gr:Location 598gr:Offering media:contains v:Product 514gr:BusinessFuntion gr:hasBrand gr:Brand 265gr:Offering media:contains media:Recording 218gr:BusinessEntity vCard:url owl:Ontology 182gr:WarrantyPromise gr:hasWarrantyScope gr:WarrantyScope 19gr:DayOfWeek gr:hasNext gr:DayOfWeek 7gr:DayOfWeek gr:hasPrevious gr:DayOfWeek 7gr:Offering rev:hasReview rev:Review 4

the ontology is viewed and used. We also observe that the present driving factors behind the adop-tion of the domain ontology on the web (particularly for semantic annotation) are driven by themajor search engines, which are exploiting the benefits of explicit semantics (e.g., Google Knowl-edge Graph [28]) which can be seen as a weakness, because their ontology usage stands and fallswith the search engine’s support. There are a few other observations, but these cannot be general-ized because of their respective domain specificity; for example, in GRO, we observed that GR data(eCommerce data) on the web are presently limited to offering-related information and lack detailedproduct modeling information.

From the outset, we observed that as Semantic Web expands, ontological data are being dis-tributed over a large network of data sources on the web. To access query information rather thansearch information, the availability of empirical analysis of terminological knowledge usage, such asthe use of concepts or the summarized view of property usage, will help to evaluate queries over dis-tributed data sources. Also, real-world Semantic Web data are available for evaluating and evaluatingontologies and for establishing semantic alignment among related but different ontologies.

The following section is a discussion of how the obtained results assist in addressing the commonrequirements of ontology users such as application developers, with reference to the case scenariopresented in Section 1.1.

Case 1The application developer needs to know what terminological knowledge is available for applicationconsumption.

Terminological knowledge, which refers to the use of terms, or vocabularies, defined by ontolo-gies, is important because it provides a representation and description of the entities involved in agiven domain. Application developers using this information can prepare generic queries to accessthe data or prepare the interface on the basis of the conceptual elements. The CUT, which capturesall the terminological knowledge attached to the concept, provides a unified source of information tothe developer or other ontology users, enabling them to prepare the data access layer. For example,Table III shows how the gr:BusinessEntity concept is generally used and provides specific detailson the number of instances of this concept (i.e., 54,542), what other entities it is connected to, and

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:1157–1184DOI: 10.1002/cpe

Page 24: Empirical analysis of domain ontology usage on the Web: eCommerce domain in focus

1180 J. ASHRAF, O. K. HUSSAIN AND F. K. HUSSAIN

what relationships it uses. As shown in Table III, the relationships or object properties vCard:adr,vCard:email, vCard:url, yahoo:image, gr:offers, gr:hasPOS, foaf:logo, foaf:homepage, foaf:maker,foaf:page, gr:hasOpeningHourSpecification, and foaf:depiction are used to provide relevant detailsfor the instances of the concept.

Case 2The data publisher needs to know what common data and knowledge patterns are available.

From a data management and processing point of view, it is important to know the types of pat-terns being followed in the dataset or that are in general usage. Information regarding patterns notonly helps in generating prototypical queries but also assists in strategizing the index for efficientinformation retrieval and storage. Traversal paths and their frequency identify the presence of dif-ferent knowledge patterns and their frequency in the dataset. For example, in Table VII, it can beseen that the knowledge pattern that dominates the whole dataset (indicating that the majority ofdata publishers have published this piece of knowledge) is (gr:Offering–gr:hasBusinessFunction–gr:BusinessFunction), which has 51,928 occurrences in the dataset, whereas at the opposite endof the spectrum, the pattern (gr:Offering–rev:hasReview–rev:Review) has the lowest occurrenceof four.

Case 3The ontology developer needs to know how data publishers are describing a company (or businessentity) and what attributes are being used.

It is very important for any business to provide a semantic description of their business to maketheir products or services discoverable by agents and clients. The best approach is to understand thatsuch information is currently being published by others and to know what the prevailing dominantstructure is on the web. The dominant structure provides a template that can then be used for publish-ing Semantic Web data on the web. The framework provides CUT to capture this structure and assiststhe data publisher in their publishing process. Table III provides the prevalent semantic descriptionof gr:BusinessEntity, which conceptualizes the concept of a business and can be used by data pub-lishers to describe their company. The ontology developer will be interested to know what datapublishers are using to describe the given concept (i.e., business entity). Attribute usage in Table III(row 5) provides a list of data-type properties being used by others, which assists data publishers inknowing what attributes and which terms are being used to describe a company. Specific to the casestudy considered in this paper, a few of the attributes used are gr:legalName, vCard:fax, vCard:adr,vCard:Tel, schema:postalCode, schema:addressLocality, and schema:streetAddress (Table III for acomplete list).

7. RELATED WORK

As discussed in Section 1 and in [17], the focus of the Semantic Web research community since itsinception has shifted from knowledge to data. In the early years of Semantic Web research, knowl-edge was the primary research object, and most of the literature of that time centered on formalizingknowledge in the realm of the Semantic Web. In later years, from approximately 2006 onwards, fol-lowing the appearance of the famous linked data principles [29], the research objective shifted fromknowledge to data, and as a result, research began to study data from other angles such as quality,structure, semanticity, and accessibility. This polarized effort has resulted in addressing enough ofthe issues, if not all of them that were impeding the Semantic Web from gaining traction. Recently,we have seen the use of ontologies on the web to semantically annotate real-world data, a resultof Semantic Web technology adoption by commercial enterprises such as Google, Facebook, BBC,Yahoo, and BestBuy, to name a few. The explosion of structured data and the use of ontologies tosemantically describe information are potential areas for conducting empirical analysis on seman-tic data on the web. To cover the breadth of the relevant literature, the discussion in this sectiondiscussion is grouped into three focus categories: ontology, web of data, and RDF with ontology.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:1157–1184DOI: 10.1002/cpe

Page 25: Empirical analysis of domain ontology usage on the Web: eCommerce domain in focus

ECOMMERCE DOMAIN IN FOCUS 1181

7.1. Ontology focused

Ontology evaluation, often described as a subarea of ontology engineering and ontology retrieval[30], covers research work that measures the quality of a developed ontology, with or without con-sideration of (instance) data. For example, in [31], the authors presented a framework and a set ofmeasurements to evaluate the richness, connectivity, fullness, and cohesiveness of a given ontol-ogy. The metrics and measures used in their study are interesting, but their usefulness is not widelyknown. The proposed metrics are evaluated on a very small dataset, which by no means reflects theactual instantiation. Another framework is presented in [32], which implements three types of evalu-ations covering functionality, usability, and structural aspects of ontology evaluation. The functionalevaluation, which determines how fit the ontology is in terms of its intended purpose, does not pro-vide a pragmatic view of the ontology’s fitness in the absence of instantiation on the web. Similarly,a structural evaluation of only the ontological model does not provide rich information if conductedwithout considering the instance data. The usability aspect of ontology evaluation, which considersthe metadata and the use of annotation vocabulary, in our view, is quite helpful, because it enablesthe autodiscovery of terminological knowledge by applications and tools.

In [33], the authors proposed the OntoClean methodology in their approach to evaluate and vali-date the ontology’s taxonomical relationship by employing formal notions from philosophy such asessence, identity, and unity. Four meta-properties (rigidity, identity, unity, and dependence) and oper-ators (C,�,�) to symbolically specify the characteristics of ontology components such as classesand relationships were used to validate the assumption that influenced the ontological model. Sim-ilar to the aforementioned approaches, this study made use of examples often used in presentationand teaching materials, which, in our thinking, cannot sufficiently represent ontologies being usedon the web.

Real instance data on the web are of a diverse nature and, being compounded with quality issues,cannot be effectively represented by a manually generated dataset; therefore, the effectiveness of theanalysis conducted by these frameworks using test data is skewed and cannot be substantiated.

7.2. Web of data focused

In this category, we cover the literature focusing on evaluating the nature, quality, patterns, seman-ticity, and statistics of semantic (RDF) data published in response to the LOD project initiative [29].In [34], the authors conducted a detailed study on the quality and state of the published RDF dataon the Semantic Web. Linked data principles were used to measure the noise and inconsistencyin a dataset, and reasoning was performed. In highlighting the issues and findings, the researchersprovided guidelines for both data publishers and data consumers to assist in generating and consum-ing high-quality semantic data. Although the experiment was performed on instance data collectedfrom the web and provided details on inconsistency and ontology hijacking in general, no particularontology was considered in the data analysis.

The generic instance data evaluation process [35] evaluates the instance data in knowledge man-agement systems. Wine ontology is used with test instance data to discuss the potential issues foundin instance data. Findings are categorized into logical inconsistencies, syntax issues, and detaileddiscussion around hypothetical issues. The study is of a generic nature, and the instance data areevaluated using an ontology that is primarily developed for learning purposes and does not reflectactual usage or the state of instance data on the Semantic Web. The two studies discussing thiscategory look at the instance data from a quality perspective and do not offer any insight intodomain-focused ontology usage and the availability of schema-level information on the web.

7.3. Resource Description Framework data with ontology-focused analysis

There is very little evidence in the literature of work that focuses on cases where real-world instancedata (RDF in our case) are used to analyze the use of domain ontologies and understand the usagepatterns of semantic data. One study in this area is reported in [36]. The authors analyzed socialand structural relationships on the Semantic Web by examining the FOAF vocabulary. The studywas performed on approximately 1.5 million FOAF documents to analyze the instance data on

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:1157–1184DOI: 10.1002/cpe

Page 26: Empirical analysis of domain ontology usage on the Web: eCommerce domain in focus

1182 J. ASHRAF, O. K. HUSSAIN AND F. K. HUSSAIN

the web and their usefulness in understanding social structures and networks. The use of differ-ent namespaces, concepts, and properties was discussed to provide a perspective on various FOAFimplementations. This research provides only limited analysis because the prime focus was on socialnetwork-related instance data.

8. CONCLUSION AND FUTURE WORK

In this paper, we presented domain ontology usage analysis in an RDF dataset published on the web,particularly in the eCommerce application area. We analyzed the use of different vocabularies andthe co-usability of ontologies to describe entity description and presented the schema-level graph,which depicts vocabulary linking in the dataset. We presented the CUT, which entrenches the keyinformational component to provide a snapshot of the relevant knowledge about the entity in ques-tion. To understand the triple patterns chained together to represent the knowledge and instance data,the traversal path is measured to observe the dominant path steps in the dataset. This framework formeasuring ontology usage offers a pragmatic step toward analyzing the current state of SemanticWeb technology in terms of its adoption and usage in real-world settings by commercial entitieson the web. This paper also constitutes one of the first attempts to conduct an empirical analysis tounderstand domain-focused ontology usage, the characteristics of semantically annotated structureddata, and data patterns on the web.

In our future work, we intend to progress in two directions. First, we intend to perform a moredetailed empirical analysis on an expanded dataset. Although the dataset used in this research isan expanded version of the previous dataset [13], which comprises 22.3 million triples collectedfrom 211 data sources, this amount of data is still, by order of magnitude, smaller than the webof data on the web. Second, we intend to implement the framework to automate domain ontol-ogy usage analysis and conceptual analysis results for the following: (i) the efficient utilization ofknowledge management; and (ii) the evaluation of usage analysis in diverse application scenarios(e.g., ontology-based matchmaking service [37], automatic categorization of web services [38], andontology update in the semantic grid environment [39]) to measure its usefulness.

APPENDIX A

Table A1. List of vocabularies.

Prefix Vocabulary Uniform Resource Identifier

Open web ontologies (namespaces)Cc http://creativecommons.org/ns, http://web.resource.org/cc/licenseOg http://ogp.me/ns, http://opengraphprotocol.org/schema/Com http://purl.org/commerceCoo http://purl.org/coo/nsDc http://purl.org/dc/Gr http://purl.org/goodrelations/v1Media http://purl.org/mediaScovo http://purl.org/NET/scovoRev http://purl.org/stuff/revVann http://purl.org/vocab/vann/Vso http://purl.org/vso/nsV http://rdf.data-vocabulary.org/Void http://rdfs.org/ns/voidSioc http://rdfs.org/sioc/nsFrbf http://vocab.org/frbr/coreeCl@ss http://www.ebusiness-unibw.org/ontologies/eclass/5.1.4/Pto http://www.productontology.org/id/vCard http://www.w3.org/2001/vcard-rdf/3.0 , http://www.w3.org/2006/vcard/nsGeo http://www.w3.org/2003/01/geo/wgs84_posfoaf http://xmlns.com/foaf/0.1/skos http://www.w3.org/2004/02/skos/core

http://www.w3.org/2003/06/sw-vocab-status/nsmoreinfo

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:1157–1184DOI: 10.1002/cpe

Page 27: Empirical analysis of domain ontology usage on the Web: eCommerce domain in focus

ECOMMERCE DOMAIN IN FOCUS 1183

Table A1. (continued)

Prefix Vocabulary Uniform Resource Identifier

http://vocab.sindice.net/datehttp://www.facebook.com/2008/fbmladminshttp://search.yahoo.com/searchmonkey/commerce/Business

W3C-based vocabularies (namespaces)rdfs http://www.w3.org/2000/01/rdf-schemardfa http://www.w3.org/ns/rdfatgrddl http://www.w3.org/2003/g/data-viewwdrs http://www.w3.org/2007/05/powder-srdf http://www.w3.org/1999/02/22-rdf-syntax-nsxhtml http://www.w3.org/1999/xhtml/vocabowl http://www.w3.org/2002/07/owl

In-house built ontologieshttp://herbaman.com.ar/Products.htmlhttp://lokool.com/extendedgoodrelations.owlhttp://www.kica-jugendstil.com/semanticweb.rdfhttp://www.logicpass.com/semanticweb.owlhttp://www.openlinksw.com/schemas/DAVhttp://www.acigroup.co.uk/semanticweb.rdfhttp://www.buntegeschenke.de/semanticweb.rdfhttp://www.svanvit.se/sv/kvinna/shopkvinnahttp://www.symbolontarot.nl/de-winkel-met-symbolon-artikelen.htmlhttp://data.openlinksw.com/oplwebhttp://olutools.com/shop.htmlhttp://www.wifo-ravensburg.de/rdf/semanticweb.rdf

REFERENCES

1. Hausenblas M, Halb W, Raimond Y, Heath T. What is the size of the Semantic Web? In Proceedings ofthe International Conference on Semantic Systems (ISemantics), Universal Computer Science, Graz, Austria,2008; 6–16.

2. Noy NF, McGuinness DL. Ontology development 101: a guide to creating your first ontology. Technical Report,Stanford Knowledge Systems Laboratory and Stanford Medical Informatics, 2001.

3. Bizer C, Jentzsch A, Cyganiak R. State of the Linked Open Data (LOD) cloud. Technical Report 5 April 2011, March2011. http://www4.wiwiss.fu-berlin.de/lodcloud/state/.

4. Ell B, Vrandecic D, Simperl EPB. Labels in the web of data. In International Semantic Web Conference (1), Lec-ture Notes in Computer Science, Vol. 7031, Aroyo L, Welty C, Alani H, Taylor J, Bernstein A, Kagal L, Noy NF,Blomqvist E (eds). Springer Berlin Heidelberg: Bonn, Germany, 2011; 162–176.

5. Hepp M. Possible ontologies: how reality constrains the development of relevant ontologies. IEEE InternetComputing 2007; 11(1):90–96.

6. Ding Y, Fensel D. Ontology library systems the key to successful ontology re-use. Proceedings of the 1stInternational Semantic Web Working Symposium (SWWS), Stanford, California, USA, 2001; 93–112.

7. d’Aquin M, Lewen H. Cupboard—a place to expose your ontologies to applications and the community. In Pro-ceedings of the 6th European Semantic Web Conference (ESWC) on The Semantic Web: Research and Applications,Vol. 5554. Springer Berlin/Heidelberg: Heraklion, Crete Greece, 2009; 913–918.

8. Hogan A, Harth A, Polleres A. Scalable authoritative owl reasoning for the web. International Journal of SemanticWeb Information Systems 2009; 5(2):49–90.

9. Hayes P. RDF Semantics. Technical Report 2, 2004. Http://www.w3.org/TR/rdf-mt/.10. Nikolov A, Uren V, Motta E. Data linking: capturing and utilising implicit schema-level relations. In Proceedings of

the Linked Data on the Web (LDOW 2010) at 19th International World Wide Web Conference (WWW 2010), Vol. 628,CEUR Workshop Proceedings. CEUR-WS.org: Raleigh, USA, 2010; 1–11. CEUR Workshop Proceedings, Vol. 628.

11. Gomez-Perez A, Corcho O. Ontology languages for the Semantic Web. IEEE Intelligent Systems 2002; 17(1):54–60.DOI: 10.1109/5254.988453.

12. Dodds L, Davis I. Linked data patterns, 2010. http://patterns.dataincubator.org/book/.13. Ashraf J, Cyganiak R, O’Riain S, Hadzic M. Open eBusiness ontology usage: investigating community implemen-

tation of GoodRelations. In Proceedings of Linked Data on the Web Workshop (LDOW) at WWW2011, Vol. 813,CEUR Workshop Proceedings. CEUR-WS.org: Hyderabad, India, 2011; 1–11.

14. Hepp M. GoodRelations: an ontology for describing products and services offers on the web. In Proceedings of the16th International Conference on Knowledge Engineering: Practice and Patterns (EKAW), Vol. 5268, Lecture Notesin Computer Science. Springer Berlin/Heidelberg: Sicily, Italy, 2008; 329–346.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:1157–1184DOI: 10.1002/cpe

Page 28: Empirical analysis of domain ontology usage on the Web: eCommerce domain in focus

1184 J. ASHRAF, O. K. HUSSAIN AND F. K. HUSSAIN

15. Antoniou G, van Harmelen F. A Semantic Web Primer. MIT Press, 2004.16. ter Horst HJ. Completeness decidability and complexity of entailment for RDF schema and a semantic exten-

sion involving the OWL vocabulary. Web Semantics: Science, Services and Agents on the World Wide Web 2005;3(2–3):79–115. Selected Papers from the International Semantic Web Conference, 2004—ISWC, 2004.

17. Hitzler P, van Harmelen F. A Reasonable Semantic Web. Semantic Web Journal 2010; 1(1–2):39–44.18. Jain P, Hitzler P, Yeh P, Verma K, Sheth A. Linked data is merely more data. Proceedings of the AAAI Spring

Symposium, Linked AI: Linked Data Meets Artificial Intelligence, AAAI: Menlo Park, CA, USA, 2010; 82–86.19. Dong H, Hussain FK. SOF: a semi-supervised ontology-learning-based focused crawler, 2012. http://dx.doi.org/10.

1002/cpe.2980.20. Tvarožek M. Exploratory search in the adaptive social Semantic Web. Information Sciences and Technologies

Bulletin of the ACM Slovakia 2011; 3(1):42–51.21. Clauset A, Shalizi CR, Newman MEJ. Power-law distributions in empirical data. SIAM Review 2009; 51(4):661–703.22. Zhang X, Li H, Qu Y. Finding important vocabulary within ontology. In ASWC, Lecture Notes in Computer Science,

Vol. 4185, Mizoguchi R, Shi Z, Giunchiglia F (eds). Springer Berlin Heidelberg: Beijing, China, 2006; 106–112.23. McBride B. Jena: a Semantic Web toolkit. IEEE Internet Computing 2002; 6(6):55–59.24. Bergman M. Making connections real (web page), 2011. http://www.mkbergman.com/941/making-connections-real/

[last access: 14/11/2012].25. Manaf NAA, Bechhofer S, Stevens R. A survey of identifiers and labels in OWL ontologies. In OWLED CEUR

Workshop Proceedings, Vol. 614, Sirin E, Clark K (eds). CEUR-WS.org: San Francisco, California, USA, 2010.26. Flouris G, Plexousakis D, Antoniou G. Evolving ontology evolution. SOFSEM 2006: Theory and Practice of

Computer Science 2006:14–29.27. Noy NF, Klein M. Ontology evolution: not the same as schema evolution. Knowledge and Information Systems 2004;

6(4):428–440. http://www.cs.vu.nl/mcaklein/papers/NoyKlein.pdf.28. Thomas Steiner SM. SEKI@home, or crowd sourcing an open knowledge graph API. In Proceedings of the 1st

International Workshop on Knowledge Extraction & Consolidation from Social Media in conjunction with the 11thInternational Semantic Web Conference (ISWC 2012), Vol. 895, CEUR Workshop Proceedings. CEUR-WS.org:Boston, USA, 2012; 7–12.

29. Bizer C, Heath T, Berners-Lee T. Linked data—the story so far. International Journal on Semantic Web andInformation Systems 2009; 5(3):1–22.

30. Ungrangsi R, Anutariya C, Wuwongse V. Sqore: an ontology retrieval framework for the next generation web.Concurrency and Computation: Practice and Experience 2009; 21(5):651–671.

31. Tartir S, Arpinar IB, Moore M, Sheth AP, Aleman-Meza B. OntoQA: metric-based ontology quality analysis. In Pro-ceedings of IEEE Workshop on Knowledge Acquisition from Distributed, Autonomous, Semantically HeterogeneousData and Knowledge Sources, Vol. 9. IEEE Computer Society Press: Houston, Texas, 2005; 45–53.

32. Gangemi A, Catenacci C, Ciaramita M, Lehmann J. A theoretical framework for ontology evaluation and validation.In Proceedings of 2nd Italian Semantic Web Workshop Semantic Web Application and Perspectives(SWAP), CEURWorkshop Proceedings: Trento, Italy, 2005; 1–16.

33. Guarino N, Welty C. An Overview of OntoClean, Handbook on Ontologies. Springer: Germany, 2004. 151–159.34. Hogan A, Harth A, Passant A, Decker S, Polleres A. Weaving the pedantic web. In Proceedings of Linked Data

on the Web Workshop (LDOW) at WWW2010, Vol. 628, CEUR Workshop Proceedings. CEUR-WS.org: Raleigh,USA, 2010; 1–10.

35. Tao J, Ding L, McGuinness DL. Instance data evaluation for Semantic Web-based knowledge management systems.In Proceedings of the 42nd Hawaii International Conference on Systems Science (HICSS). IEEE Computer Society:Big Island, HI, USA, 2009; 1–10.

36. Ding L, Zhou L, Finin T, Joshi A. How the semantic web is being used: an analysis of FOAF documents. In Pro-ceedings of the 38th Annual Hawaii International Conference on System Sciences, Vol. 4. IEEE Computer Society:Washington, DC, USA, 2005; 113–120.

37. Dong H, Hussain FK, Chang E. Semantic web service matchmakers: state of the art and challenges. Concurrencyand Computation: Practice and Experience 2013; 25(7):899–1012. DOI: 10.1002/cpe.2886. http://dx.doi.org/10.1002/cpe.2886.

38. Flahive A, Taniar D, Rahayu W, Apduhan BO. A methodology for ontology update in the semantic grid environment.Concurrency and Computation: Practice and Experience 2012. DOI: 10.1002/cpe.2841. http://dx.doi.org/10.1002/cpe.2841.

39. Kehagias DD, Giannoutakis KM, Gravvanis GA, Tzovaras D. An ontology-based mechanism for automatic cate-gorization of web services. Concurrency and Computation: Practice and Experience 2012; 24(3):214–236. DOI:10.1002/cpe.1818. http://dx.doi.org/10.1002/cpe.1818.

Copyright © 2013 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2014; 26:1157–1184DOI: 10.1002/cpe