facilitating knowledge sharing and reuse in building and construction domain: an ontology-based...

20
J Intell Manuf DOI 10.1007/s10845-013-0856-5 Facilitating knowledge sharing and reuse in building and construction domain: an ontology-based approach Ruben Costa · Celson Lima · João Sarraipa · Ricardo Jardim-Gonçalves Received: 23 July 2013 / Accepted: 13 December 2013 © Springer Science+Business Media New York 2013 Abstract This paper brings a contribution focused on col- laborative engineering projects where knowledge plays a key role in the process. Collaboration is the arena, engi- neering projects are the target, knowledge is the currency used to provide harmony into the arena since it can poten- tially support innovation and, hence, a successful collabo- ration. The building and construction domain is challenged with significant problems for exchanging, sharing and inte- grating information between actors. For example, semantic gaps or lack of meaning definition at the conceptual and technical level, are problems fundamentally created through the employment of representations to map the ‘world’ into models in an endeavour to anticipate different actors’ views, vocabulary, and objectives. One of the primary research chal- lenges addressed in this work is the process of formalization and representation of document content, where most exist- ing approaches are limited in their capability and only take into account the explicit, word-based information in the doc- ument. The research described in this paper explores how traditional knowledge representations can be enriched by incorporation of implicit information derived from the com- plex relationships (the Semantic Associations) modelled by domain ontologies combined with the information presented R. Costa (B ) · J. Sarraipa · R. Jardim-Gonçalves Centre of Technology and Systems, UNINOVA, Caparica, Portugal e-mail: [email protected] J. Sarraipa e-mail: [email protected] R. Jardim-Gonçalves e-mail: [email protected] C. Lima Federal University of Western Pará UFOPA, Santarém, Brasil e-mail: [email protected] in documents, thereby providing a baseline for facilitating knowledge interpretation and sharing between humans and machines. The paper introduces a novel conceptual frame- work for representation of knowledge sources, where each knowledge source is semantically represented (within its domain of use) by a Semantic Vector. This work contributes to the enrichment of Semantic Vectors, using the classical vector space model approach extended with ontological sup- port, employing ontology concepts and their relations in the enrichment process. The test bed for the assessment of the approach is the Building and Construction industry, using an appropriate B&C domain Ontology. Preliminary results were collected using a clustering algorithm for document classi- fication, which indicates that the proposed approach does improve the precision and recall of classifications. Future work and open issues are also discussed. Keywords Knowledge sharing · Semantic interoper- ability · Ontology engineering · Unsupervised document classification · Vector space models Introduction Over the last two decades, the adoption of the Internet as the primary communication channel for business purposes brought new requirements especially considering the collab- oration centred on engineering projects. By their very nature, such projects normally demand a high level of innovation since they tackle complex challenges and issues. On the one hand, innovation often comes by the combination of knowl- edge (existing, recycled, and some new) and, on the other hand, it can depend on individuals (or groups) with appro- priate knowledge to make the required breakthrough. 123

Upload: ricardo

Post on 23-Dec-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Facilitating knowledge sharing and reuse in building and construction domain: an ontology-based approach

J Intell ManufDOI 10.1007/s10845-013-0856-5

Facilitating knowledge sharing and reuse in building andconstruction domain: an ontology-based approach

Ruben Costa · Celson Lima · João Sarraipa ·Ricardo Jardim-Gonçalves

Received: 23 July 2013 / Accepted: 13 December 2013© Springer Science+Business Media New York 2013

Abstract This paper brings a contribution focused on col-laborative engineering projects where knowledge plays akey role in the process. Collaboration is the arena, engi-neering projects are the target, knowledge is the currencyused to provide harmony into the arena since it can poten-tially support innovation and, hence, a successful collabo-ration. The building and construction domain is challengedwith significant problems for exchanging, sharing and inte-grating information between actors. For example, semanticgaps or lack of meaning definition at the conceptual andtechnical level, are problems fundamentally created throughthe employment of representations to map the ‘world’ intomodels in an endeavour to anticipate different actors’ views,vocabulary, and objectives. One of the primary research chal-lenges addressed in this work is the process of formalizationand representation of document content, where most exist-ing approaches are limited in their capability and only takeinto account the explicit, word-based information in the doc-ument. The research described in this paper explores howtraditional knowledge representations can be enriched byincorporation of implicit information derived from the com-plex relationships (the Semantic Associations) modelled bydomain ontologies combined with the information presented

R. Costa (B) · J. Sarraipa · R. Jardim-GonçalvesCentre of Technology and Systems , UNINOVA, Caparica,Portugale-mail: [email protected]

J. Sarraipae-mail: [email protected]

R. Jardim-Gonçalvese-mail: [email protected]

C. LimaFederal University of Western Pará UFOPA, Santarém, Brasile-mail: [email protected]

in documents, thereby providing a baseline for facilitatingknowledge interpretation and sharing between humans andmachines. The paper introduces a novel conceptual frame-work for representation of knowledge sources, where eachknowledge source is semantically represented (within itsdomain of use) by a Semantic Vector. This work contributesto the enrichment of Semantic Vectors, using the classicalvector space model approach extended with ontological sup-port, employing ontology concepts and their relations in theenrichment process. The test bed for the assessment of theapproach is the Building and Construction industry, using anappropriate B&C domain Ontology. Preliminary results werecollected using a clustering algorithm for document classi-fication, which indicates that the proposed approach doesimprove the precision and recall of classifications. Futurework and open issues are also discussed.

Keywords Knowledge sharing · Semantic interoper-ability · Ontology engineering · Unsupervised documentclassification · Vector space models

Introduction

Over the last two decades, the adoption of the Internet asthe primary communication channel for business purposesbrought new requirements especially considering the collab-oration centred on engineering projects. By their very nature,such projects normally demand a high level of innovationsince they tackle complex challenges and issues. On the onehand, innovation often comes by the combination of knowl-edge (existing, recycled, and some new) and, on the otherhand, it can depend on individuals (or groups) with appro-priate knowledge to make the required breakthrough.

123

Page 2: Facilitating knowledge sharing and reuse in building and construction domain: an ontology-based approach

J Intell Manuf

Engineering companies are project oriented and ensuringsuccessful projects is their way to keep market share and toexploit new opportunities. Engineering projects strongly relyon innovative factors (processes and ideas) in order to be suc-cessful. From the organisation point of view, knowledge goesthrough a spiral cycle, as presented by Nonaka and Takeuchi(Nonaka and Takeuchi 1995). It is created and enhanced ina continuous cycle of conversion, sharing, combination, anddissemination, where all the aspects and contexts of a givenorganisation are considered, such as individuals, communi-ties, and projects.

Knowledge is considered the key asset of modern organi-sations and industry and academia have been working to pro-vide the appropriate support to leverage this asset (Firestoneand McElroy Mark 2003). Some examples of this are: theextensive work on knowledge models and knowledge man-agement (KM) tools, the rise of so-called knowledge engi-neering, the many projects around ‘controlled vocabularies’(i.e., ontology, taxonomies, etc..), and the development byacademia of knowledge-centred courses (graduate, master,doctoral).

The quest for innovation to be used as a “wild card” foreconomic development, growth and competitiveness, affectsnot only organisations, but also countries. This demand forinnovative processes and ideas, and the pursuit of moreknowledge, inevitably raise issues regarding the adoption anduse of KM models and tools within organisations.

The KM theme and, more specifically, how knowledgecan be represented, gained a new impetus with the adventof the computer age. In particular, with the creation of theWorld Wide Web new forms of knowledge representationwere needed in order to transmit data from donor to recipi-ent in common data formats, and to aid humans retrieve theappropriate answers to their questions in an easily under-standable manner.

Artificial Intelligence (AI) based research, which abstra-cted knowledge into a clear set of parameters and used fairlystatic/rigid rules, had a rather limited “context” (the domainof applicability) and were poor in “human communication.”Further, such systems lacked interoperability because mostAI tools focused on solving a specific problem and faced chal-lenges in handling cross-context information flows, imputa-tion, and interpretation, i.e., how to convert an actual situa-tion into the parameters used by the AI tool (Dascal 1989),(Dascal 1992).

With the evolution of the Semantic Web, knowledge rep-resentation techniques came into the spotlight, aiming atbringing human understanding of the meaning of data to theworld of machines. Such techniques create representationsof knowledge sources (KS), whether they are web pages ordocuments (Figueiras et al. 2012).

Like many information retrieval (IR) tasks, knowledgerepresentation and classification techniques depend on using

content independent metadata (e.g. author, creation date)and/or content dependent metadata (e.g. words in the docu-ment). However, such approaches tend to be inherently lim-ited by the information that is explicit in the documents,which introduces a further problem. For instance, in the sit-uation where words like ‘architect’ and ‘design’ do not co-occur frequently, statistical techniques will fail to make anycorrelation between them (Nagarajan et al. 2007).

Furthermore, existing IR techniques are based upon index-ing keywords extracted from documents and then creat-ing a term vector. Unfortunately, keywords or index termsalone often do not adequately capture the document contents,resulting in poor retrieval and indexation performances. Key-word indexing is still widely used in commercial systemsbecause it is by far the most viable way to process largeamounts of text, despite the high computational power andcost required to update and maintain the associated indexes.

Such challenges raise the following question: how to intu-itively alter and add contents to a document’s term vectorusing semantic background knowledge available in domainontologies, and thereby provide classifiers with more infor-mation than is exemplified directly in the document?

In recent decades, the use of ontologies in information sys-tems has become more and more popular in various researchfields, such as web technologies, database integration, multiagent systems, and Natural Language Processing. This workfocuses on how ontologies can be used to improve seman-tic interoperability between heterogeneous information sys-tems. We understand interoperability as the ability of two ormore systems or components to exchange information and touse the information that has been exchanged (IEEE 1990).

An ontology models information and knowledge in theform of concept hierarchies (taxonomies), interrelationshipsbetween concepts, and axioms (Noy and Hafner 1997; Noyand McGuinness 2002). Axioms, along with the hierarchalstructure and relationships, define the semantics - the mean-ing of the concepts. Ontologies are thus the foundation ofcontent-based information access and semantic interoper-ability over the web.

Fundamentally, ontologies are used to improve communi-cation between people and/or computers (Uschold and Jasper1999). By describing the intended meaning of “things” in aformal and unambiguous way, ontologies enhance the abil-ity of both humans and computers to interoperate seamlesslyand consequently facilitate the development of semantic (andmore intelligent) software applications.

The motivation guiding this work is that a system shouldbe interoperable and capable of “wrapping” existing data toallow for a seamless exchange of data among stakeholders—a necessary first condition for effective collaboration. Here,in the work we report, we propose to use background knowl-edge available in domain ontologies to support the process ofrepresenting KS from the building and construction domain,

123

Page 3: Facilitating knowledge sharing and reuse in building and construction domain: an ontology-based approach

J Intell Manuf

thus improving the classification of such KS. In the scope ofthis work, ontology is a way to represent knowledge withina specific domain (Gruber 1993).

Our hypothesis is that semantic background knowledgefrom ontologies can be used for the enrichment of traditionalstatistical term vectors and can be fulfilled by the usage ofsemantic background knowledge available in domain ontolo-gies. Therefore, one of the main contributions of this workis to affect the document term vectors in a way that we canuse and measure the effect of semantic enrichment on exist-ing classifiers, not to develop new or improved classificationalgorithms per se.

We believe that information contained in ontologies canbe incorporated into many representation schemes and algo-rithms. In this paper, we focus on a particular representationscheme based on Vector Space Models (VSM) (Salton etal. 1975), which represents documents as a vector of theirmost important terms (knowledge representations). Impor-tant terms are those which are considered to be the best dis-criminators for each document space. The aim is to under-stand how useful external domain knowledge is to the processof knowledge representation; what the trade-offs may be andwhen it makes sense to bring in such background knowl-edge. In order to do this, we intuitively alter basic tf-idf (termfrequency–inverse document frequency) (Salton and Buck-ley 1988) weighted document term vectors (statistical termvector) with the help of an already available domain ontol-ogy to generate new (enhanced) semantic term vectors for alldocuments to be represented.

This work describes the representation of KS through theuse of Semantic Vectors (SV) based on the combination ofthe VSM approach and a domain-specific Ontology (Costaet al. 2012). Therefore, KS, in this work, are representedby SV’s which contain concepts and their equivalent terms,weights (statistical, taxonomical, and ontological), relationsand other elements that semantically enrich each SV.

The performance of the proposed approach is evaluatedby comparison with an unsupervised document classificationalgorithm. Document clustering has become one of the maintechniques for organizing large volumes of documents intoa small number of meaningful clusters (Chen et al. 2010).However, there still exist several challenges for documentclustering, such as high dimensionality, scalability, accuracy,meaningful cluster labels, overlapping clusters, and extract-ing the semantics from the texts.

Also, performance is directly related to the quantity andquality of information within the Knowledge Base (KB) itruns upon. Until, if ever, ontologies and metadata (and theSemantic Web itself) become a global commodity, the lack, orincompleteness, of available ontologies and KBs is a limita-tion that has to be lived with in the mid-term (Castells 2007).

We used an unsupervised classification algorithm (K-Means clustering (MacQueen 1967)) to evaluate the results

of our approach. One of the reasons we chose an unsuper-vised classification is that supervised classification is inher-ently limited by the information that can be inferred from thetraining data. The objective here is to use a centroid-baseddocument classification algorithm to assess the effectivenessof the altered vectors since no in-depth knowledge of theactual contents of the document corpus was provided (it waslargely “blind”).

This paper is structured as follows. “Motivating scenarioand related work” section illustrates a motivating scenarioand the related work. “Modelling the building and con-struction knowledge” section illustrates the domain ontol-ogy used under this work. “Enriching knowledge represen-tations process” section describes the process of enrich-ment of KSs. “Assessment of the presented work” sectionillustrates the empirical evidences of the work addressedso far. Finally, “Conclusions and future work” section con-cludes the paper and points to the future work to be carriedout.

Motivating scenario and related work

In order to understand the type of domain addressed withinthis work and the associated KS space, we present some of therelevant challenges in the B&C (Building and Construction)sector and why this topic is so important to this particulardomain.

B&C projects are information-intensive. The availabilityof integrated project data and the recording of such datathroughout the construction process are essential not only forproject monitoring, but also to build a repository of historicalproject information that can be used to improve performanceof future projects. This would allow construction actors tobetter share and use corporate knowledge when searchingfor appropriate actions to solve on-site construction prob-lems. The shared knowledge is also expected to help betterpredict the impacts of corrective actions in a project life cycle,and so improve project performance.

Projects are conducted through a series of meetings (pos-sibly meetings of minds by email exchanges) and every meet-ing is considered a Decisional Gate (DG), a milestone pointwhere decisions are made, problems are raised, solutions areagreed, and goals/tasks are assigned to project participants.Pre-existing knowledge serves as input to the DG, the projectis judged against a set of criteria, and the outputs include adecision (go/kill/hold/recycle) and a path forward (schedule,tasks, to-do list, and deliverables for next DG). The DG rep-resentation is depicted in Fig. 1.

Each DG is prepared for (through the creation of agendapoints), and the events that occur during the meeting arerecorded. Between two DGs there is a continuous monitor-ing of progress of all tasks being executed. There is need

123

Page 4: Facilitating knowledge sharing and reuse in building and construction domain: an ontology-based approach

J Intell Manuf

Fig. 1 The decisional gate

of a decision recording mechanism to highlight the majordecisions made during a meeting i.e. written minutes.

DGs normally go through the following phases: (i) Indi-vidual work, (ii) Initialisation, (iii) Collaboration, and (iv)Closing/Clean-up. Individual work relates to asynchronouscollaboration, where all individuals involved in the projectare supposed to provide inputs to the undergoing tasks. Ini-tialisation (pre-meeting) covers the preparation of the meet-ing agenda and the selection of the meeting participants. Col-laboration phase is the meeting itself where participants tryto reach a common understanding regarding the issues fromthe agenda, using the right resources. This phase also consid-ers the annotation of the decisions made during the meeting.Finally, Closing/Clean-up basically targets the creation ofmeeting minutes.

Knowledge needs to be shared in order to be properly cap-italised during decision making processes. On the one handknowledge sharing is heavily dependent on technical capa-bilities and, on the other hand, since the social dimension isvery strong during collaboration, there is also an increasedneed to take into account how to support the culture and prac-tice of knowledge sharing. For instance, issues of trust arecritical in collaborative engineering projects, since the dis-tribution of knowledge and expertise means that it becomesincreasingly difficult to understand the context in which theknowledge was originally created, to identify who knowssomething about the issue at hand, and so forth.

B&C knowledge is seen as an evolving sequence ofinteroperable models. Each model should be built by using

construction epistemology based on a bottom-up, field-based, human-oriented approach. These models, or subdo-main ontologies, should be interlinked within a contempo-rary pragmatic approach. In other words, they should be inte-grated on the basis of utility to industry and usability andwith the acceptance of the dual/relative nature of such mod-els (ontologies) (El-Diraby 2012).

A consensus strategy for interoperability embraces allstandards where the main models of conceptualization arefirst created and subsequent data models are developed.Actors or developers harmonize their models with the inten-tion of integrating their data models with other actors inthe interoperability strategy. The strategy consists of findingcommon concepts of the universe of discourse of the domain.In the case of the construction industry domain, the definitionof those concepts is focused not only on construction prod-ucts but also on construction processes during a project lifecycle (ISO12006-3 2006). The Industry Foundation Classes(IFC) captures specifications of actors, product, processes,and geometric representation, and provides support as a neu-tral model for the attachment of properties, classifications,and external library access (BuildingSmart 2012). An exam-ple of how separate international organizations that combinetheir individual efforts into a single object library is the Inter-national Framework for Dictionaries (IFD).

Human knowledge can be efficiently represented andshared through semantic systems using ontologies to encap-sulate and manage the representation of relevant knowledge(Lima, El-Diraby and Stephens, Ontology-based optimisation

123

Page 5: Facilitating knowledge sharing and reuse in building and construction domain: an ontology-based approach

J Intell Manuf

of KM in e-Construction 2005). Specifically, ontologies pro-vide knowledge conceptualization using a hierarchical sys-tem of concepts (taxonomies), associative relations (link-ing different concepts across hierarchies), and axioms (El-Diraby et al. 2005). Thus, ontologies may enable reasoningabout semantics between domain concepts and can play acrucial role in representing knowledge in the B&C industry(Lima, El-Diraby and Stephens, Ontology-based optimisa-tion of KM in e-Construction 2005), (Rezgui 2006).

A variety of semantic resources ranging from domaindictionaries to specialized taxonomies have been devel-oped in the building and construction industry. Amongthem are BS6100 (Glossary of Building and Civil Engi-neering terms produced by the British Standards Institu-tion); bcXML (an XML vocabulary developed by the eCon-struct IST project for the construction industry); IFD; OCCS(OmniClass Classification System for Construction Informa-tion), BARBi (Norwegian Building and Construction Refer-ence Data Library); and e-COGNOS (COnsistent KM acrossprojects and between enterprises in the construction domain).Within these semantic resources, the e-COGNOS project wasthe first to deploy a domain Ontology for KM in the con-struction industry which has been tested in leading Europeanconstruction organizations (Lima et al. 2005).

The initiatives described are seen as efforts designed toestablish a common ground for enabling semantic interoper-ability within the B&C sector. However many other web-based tools have used semantic systems to support someaspects of integrating unstructured data and/or ontologies.For example, the GIDS (Global Interlinked Data Store)technique distributes Linked Data across the network andthen manages the network as a database (Braines, Joneset al. 2009). The SWEDER mechanism (Semantic Wrap-ping of Existing Data Sources with Embedded Rules) makesavailable existing electronic data sources in a usable andsemantically-rich format along with rules to facilitate inte-gration between datasets (Braines, Kalfoglou et al. 2008).The POAF technique (Portable Ontology Aligned Frag-ments) aims to support alignment between existing ontologyresources. These techniques (and many others) can be used tocreate interoperability between interlinked unstructured datasets based on semantic analysis (Kalfoglou et al. 2008). TheFunsiec initiative employed IFC and taxonomies as concep-tual models and starting points to create single, harmonizedproducts, queries, and control vocabulary (Lima et al. 2006).

For the sake of clarity, it is worthwhile distinguishing herethe major difference between data and information. In thiswork, data is seen as a representation of the simplest factsabout a system with limited meaningfulness. In informationsystems, data is normally stored in databases. Informationis the composition of various data to establish a meaning-ful representation of facts. In information systems, informa-tion is exchanged normally through communication between

humans or via electronic means such as web sites or e-mail(Floridi 2004). Typically, IT-based tools (such as XML andother web systems) are used to support the interoperableexchange of information.

The work presented here is a continuation of that in(Figueiras et al. 2012) and (Costa et al. 2012). Regardingthe main issue addressed in our work, Castells (2007) pro-pose an ontology-based scheme for the semi-automatic anno-tation of documents, and a retrieval system. The retrievalmodel is based on an adaptation of the classic VSM,including an annotation weighting algorithm. Similar to ourapproach, Castells uses the tf-idf (term frequency–inversedocument frequency) algorithm, matches documents’ key-words with Ontology concepts, creates semantic vectors, anduses the cosine similarity to compare created vectors. How-ever, Castells does not take into consideration the nature andstrength of relationships between concepts (either taxonomicor ontological) in a way that could influence performance onannotations, as we do.

The work presented by Sheng (2009) tries to overcome thisdrawback by presenting a way of mathematically quantify-ing hierarchical or taxonomic relations between ontologicalconcepts, based on the importance of relationships and the co-occurrence of hierarchically related concepts, which can bereflected in the quantification of document SV. Sheng’s workcontributes by comparing the effectiveness of the traditionalVSM with the semantic model. Sheng used semantic andontology technology to solve several problems that the tradi-tional model could not overcome, such as the shortcomings ofcomputing weights based on statistical method, the expres-sion of semantic relations between different keywords, thedescription of document SV and quantifying similarity. How-ever, Sheng’s work neglects other types of semantic relations,including ontological. According to Sheng’s work, conceptsimilarity decreases with the distance between concepts in ataxonomy, which seems not always the case as demonstratedwith our approach. Sheng used 100 abstracts from documentsources to evaluate his method; it would be interesting to usethe full document texts in order to quantify how his approachscales up, when compared to the full document texts used byour approach. It should be mentioned that the approach usedby Sheng‘s has been adapted for calculating the taxonomicalrelationship weights in our approach.

Another relevant approach in the area of IR and docu-ment classification is proposed by (Nagarajan et al. 2007).The authors explore the use of external semantic metadataavailable in ontologies in addition to the metadata centralto documents, for the task of supervised document classi-fication. One of the key differences between Nagarajan’sapproach and ours is that Nagarajan does not quantify thedifference between ontological related concepts and taxo-nomically related concepts. Also, our work does not directlyinclude terms from documents within SV; the terms are first

123

Page 6: Facilitating knowledge sharing and reuse in building and construction domain: an ontology-based approach

J Intell Manuf

mapped to ontology concepts which then guarantees a reduc-tion in the semantic vector dimensionality and avoids a verysparse vector. A further key difference is that Nagarajanuses a supervised document classification algorithm, which isinherently limited by the information inferred from the train-ing data as opposed to our approach of using an unsupervisedclustering algorithm.

In other recent work, Xia and Du (2011) propose docu-ment classification mechanisms based on title vectors whichassumes that the terms in titles represent main topics in thosedocuments, and therefore the weights for title terms shouldbe amplified. Xia and Du (2011) work seems to show animprovement in text classification for webpages, where titlesare carefully created by editors and usually reflect the maincontent of the webpage. However, the same does not applyto the technical documents considered in our work. As willbe explained and demonstrated in later sections, documenttitles can sometimes be misleading about the real content ofthe document.

Modelling the building and construction knowledge

Models of B&C knowledge span three broad categories: clas-sification systems and thesauri, product and process models,and ontologies. The first category is the most prominent andoldest. Classification systems (such as the Swedish classi-fication of construction terms- sfb, Uniclass and Masterfor-mat) majored on product categorization with limited atten-tion to ontological modelling. Product models such as IFCalso have limited ontological features as they were gearedtowards assuring interoperable exchange of product data (incontrast to semantic knowledge).

International Framework for Dictionaries (IFD) is closelyrelated to IFC and BIM (Building Information Modelling)and can be seen as a thesaurus of B&C terms with aims tocreate multilingual dictionaries or ontologies. It is meant as areference library intended to support improved interoperabil-ity in the building and construction industry (BuildingSmart2012). The value of interoperability for a BIM-based con-struction projects has been analysed in Grilo and Jardim-Goncalves (Grilo and Jardim-Goncalves 2010) and theauthors support the conviction that interoperability in BIMcan contribute to efficiency value levels, through support-ing communication and coordination interactions betweenparticipants in BIM-based projects. The ontology developedunder the scope of this work was intended to be IFC compli-ant and to capitalize on previous taxonomies/classificationsystems. BS6100 and UniClass terms were used to enrichthe ontology.

From a high level point of view, the basic ontologi-cal model of the domain ontology was inspired by thee-COGNOS ontology (Lima et al. 2005) and can be described

as follows: a group of Actors uses a set of Resources to pro-duce a set of Products following certain Processes within awork environment (Related Domains) and according to cer-tain conditions (Technical Topics).

As such, the proposed taxonomy includes seven majordomains to classify the seven major concepts:

• Project• Actor• Resource• Product• Process• Technical Topics (Conditions)• Related Domains (work environment)

As can be seen the first five domains coincide with majorthemes in the IFC model. The final two domains includerelated issues that are not covered by IFC. Figure 2 illustratesthe major domains and the following subsections describe themajor elements of these domains. This ontology is process-centered. Other domains define all relevant process attributes.For example, the Technical Topics domain defines the con-cepts of productivity, quality standard and duration.

All entities (including Process) have three ontologicaldimensions: state, stage and situation. The State conceptcaptures the status of entity development: dormant, execut-ing, stopped, re-executing, completed. The Stage conceptdefines several development stages: conceptualization, plan-ning, implementation and utilization. The Situation conceptrefers to planned entities and unplanned entities.

Fig. 2 Major domains in the domain ontology

123

Page 7: Facilitating knowledge sharing and reuse in building and construction domain: an ontology-based approach

J Intell Manuf

A Project is a collection of processes. Building projectsare of two main types: Brown field projects reuse an exist-ing building site and Green field projects build on virgin land.The Project has a project delivery system, a contract, a sched-ule, a budget, and resource requirements. It also has a set ofrelated aspects that include: start time, finish time, duration,quality standard, productivity level, life cycle stages and lifecycle costs—all of which are defined in the Technical Topicsdomain.

A Process has input requirements that include: the com-pletion of all preceding processes, the existence of requiredApprovals, the availability of required Knowledge items(documents, software, etc.), the availability of requiredResources (materials, equipment, subcontractors), the avail-ability of required Actors, and the availability of the requiredbudget.

A Process has three major sub concepts: Phase, Activityand Task. There are two major process types: engineering andadministrative. A Process has output that includes: an updateto a product time-line, an update to the project schedule, anupdate to the project budget.

A Product (or Actor, Process or Resource) has attributes,parameters and elements, which are defined in Technical Top-ics.

The domain-specific Ontology used in this work wasdeveloped using Protégé Ontology editor (Stanford Centerfor Biomedical Informatics Research 2013), and is writtenin OWL-DL language (W3C 2012). The Ontology compre-hends two major pillars, namely concepts and their relation-ships. The former relates to specific aspects (classes) of build-ing and construction such as the type of project, project phase,geographical location and similar data. The latter specifieshow the ontology concepts are related to each other.

Several levels of specificity are given for all concept fam-ilies, as described for the ‘Actor’ concept. These specificitylevels represent concept hierarchies and, ultimately, taxo-nomic relations such as ‘Architect’ <is_a> ‘Design Actor’and ‘Design Actor’ <is_a> ‘Actor’. All classes, or concepts,have an instance (an individual), which corresponds to theclass, and comprises the keywords or expressions gatheredand related to each concept, through an ontological data-typeproperty designated ‘has Keyword’.

Concepts are related through a set of terms named‘equivalent terms’ which are terms or expressions rele-vant for capturing different semantic aspects of such con-cepts. For instance, the ‘Learning_Facility’ concept has a‘Higher_Education_Facility’ individual, and this individualhas several equivalent terms such as ‘university’, ‘sciencecol-lege’ and ‘professional college’. Thus each equivalent termbelongs to some higher concept, as shown in Fig. 3. More-over, concepts are connected by ontological object propertiestermed ‘ontological relations’. Ontological relations relateconcepts among themselves and are described by a label

(property) and the relevance (weight) of such relation in thecontext of the B&C domain Ontology.

The KS used the purpose of evaluate the approachdescribed here are also related to the B&C domain, meaningthat there is a close relation between the KS and the domainontology in terms of entities that are shared by both.

The usage of such domain ontology within this approachwill enable to “annotate” KS with concepts available in thedomain ontology. To some extent, there is a risk of obtaininga very sparse annotation of the KS depending on the mod-elling of the domain Ontology. However, we will assume theexistence of a relatively complete model considering that theexperiment was selected to fit within the richer areas coveredby the ontology.

The assessment will then cluster/classify KS against con-cepts defined by the domain ontology. By looking into classesat Fig. 3, as an example, some KS can be classified asbeing more relevant to a “Cultural_Facility” concept, whileothers can be classified as more relevant to a “Commer-cial_Facility”.

Enriching knowledge representations process

In this section, we describe the rationale behind our hypothe-sis that semantic background knowledge from ontologies canbe used to augment traditional statistical term vectors. Ourapproach mainly focuses on knowledge representation of KS,but there are several steps that need to be performed beforeand after the knowledge representation itself, as depicted inFig. 4. The overall approach is described as follows:

• The initial step deals with the searching of relevant KS,using the ICONDA digital library;

• The next step collects all relevant KS found, and storesthem in a KB repository;

• In the third step, knowledge experts within the B&Cdomain pre-label all relevant KS by inspection. This stepwill further be detailed under “Conclusions and futurework” section;

• The fourth step is the core of our approach, which isdetailed below in this section;

• The fifth step is responsible for applying an unsu-pervised classification algorithm (K-Means clustering),which groups KS into various categories (clusters). Thisstep is further detailed in “Conclusions and future work”section;

• The final step evaluates the overall approach, using classi-cal precision and recall metrics to measure performance.This step is also detailed in “Conclusions and futurework” section.

123

Page 8: Facilitating knowledge sharing and reuse in building and construction domain: an ontology-based approach

J Intell Manuf

Fig. 3 Domain ontology elements and relations

Fig. 4 Step-wise approach

The core of our approach lies in altering document termvectors in three simple steps. Figure 5 gives an overview ofthe semantic vector creation process, which is carried out bythree main modules, namely Ontology Definition Module,Document Analysis Module, and Semantic Enrichment Mod-ule (explained in “ Ontology definition module’ ’sub-sectionsOntology Definition Module’, ‘Document Analysis Mod-ule’, and ‘4.3 Semantic Enrichment Module’, respectively).In our approach when receiving a set of textual documents,the document analysis module will extract terms, create thekey term set, and produce a term occurrence statistical vec-tor. From that point on, the semantic enrichment module willalter the statistical vector using information available in theB&C domain Ontology and produce an Ontology-conceptoccurrence vector, (Semantic Vector for short).

Ontology definition module

This module supports the definition of a domain referenceontology, which will serve as background knowledge forenabling the semantic enrichment of KS representations. Theontology defined under this work was not developed fromscratch, rather it has used several KBs already available inthe building and construction sector, namely: the OmniClassStandards for the Construction Environment (OCCS Devel-opment Committee Secretariat 2013), the BuildingSmart IFDLibrary and the Construction Information and KnowledgePortal ontology (Zhang 2010).

The use of MENTOR methodology proposed in Sarraipaet al. (2008), can be seen as a collaborative methodologydeveloped with the idea of helping a group of people or enter-prises in sharing their knowledge with others in the network,and provides several steps as semantic comparisons, basic

lexicon establishment, ontology mappings and some otheroperations to build a domain’s reference ontology. MENTORmethodology aims to combine the knowledge described bydifferent formalisms in a semantic interoperable way (Sar-raipa et al. 2010), facilitating further updates that could comefrom any participant KB alterations. This methodology com-prises three phases (Fig. 6).

The Lexicon Settlement (Phase 1) represents the knowl-edge acquisition by assembling a collection of terms andrelated definitions from all participants. This phase is dividedinto three steps: Terminology Gathering, Glossary Building,and Thesaurus Building. The Reference Ontology Building(Phase 2) is the phase where the reference ontology is built,and the semantic mappings between participant’s ontologiesand the reference ontology are established. This phase, justlike the first phase, is divided into three steps: OntologiesGathering, Ontologies Harmonization, and Ontologies map-ping. The Reference Ontology Learning (Phase 3) extendsMENTOR methodology capabilities to a dynamic level. Thisphase implements the feedback mechanisms for sustainableevolutionary learning of the dynamic ontological system. Itrepresents the encircling of the lexicon settlement and refer-ence ontology building phases of MENTOR’s methodology.It will provide a continuous Learning Ontology (Fig. 6).

Document analysis module

We start with a state-of-the art indexing tool, RapidMiner(RapidMiner 2012), to generate document term vectors (sta-tistical vectors) that order terms in a document by the impor-tance of their occurrence in that document and in the entiredocument corpus by a normalized tf-idf score. There are twostages in this module, namely Term Extraction and Term

123

Page 9: Facilitating knowledge sharing and reuse in building and construction domain: an ontology-based approach

J Intell Manuf

Fig. 5 The semantic vector creation process

Fig. 6 Mentor methodology phases

Selection, which reduce the dimensionality of the source doc-ument corpus.

Term extraction

The extraction process is as follows:

(a) First, each document is split into sentences. Then, termsin each sentence are extracted as tokens (so called tok-enization).

(b) All tokens found in the document are transformed tolower case font.

(c) Terms belonging to a predefined stop word list1 areremoved.

(d) The remaining terms are converted to their base formsby a process called stemming, using the “snowball”method. Terms with the same stem are then combined

1 It contains a list of stop words that is used by Rapidminer tool

123

Page 10: Facilitating knowledge sharing and reuse in building and construction domain: an ontology-based approach

J Intell Manuf

for frequency counting. In this paper, a term is regardedas the stem of a single word.

(e) Tokens whose length is “< 4” or “> 50” characters arediscarded.

(f) The n-Grams generation is the creation of strings of 1 toN words. For this case we are considering the generationof unigrams (e.g. Energy), bigrams (e.g. Waste Manage-ment) and trigrams (e.g. Electric Power Product).

Term selection

We consider that terms with low frequencies are most likelyto be noise sources and of no interest, so we apply the tf–idf (term frequency–inverse document frequency) method toselect the key terms for the document set. Equation 1, is usedfor the measurement of t f id fi j for the importance of a termt j within a document di . The main drawback of the tf-idfmethod is that long documents tend to have higher weightsthan short ones. The method considers only the weightedfrequency of the terms in a document, but ignores the lengthof the document. To prevent this, in Eq. 2, t fi j is the fre-quency of ti in d j , and the total number of occurrences in d j

is the maximum frequency of all terms in d j that is used fornormalization to prevent bias for long documents.

t f id fi j = t f i j ∗ id fi (1)

t fi j = number of occurrences of ti in d j

total number of occurences in d j(2)

id fi = lognumber of documents in D

number of documents in D that contain ti(3)

After calculating the weight of each term in each document,those which satisfy a pre-specified minimum tf–idf thresholdγ are retained. For this work, we consider only terms wherethe tf-idf score is ≥ 0.001in order to reduce the high dimen-sionality of the generated vectors and also the computationalpower required to process the generated vectors. After closehuman inspection, it was concluded that terms which tf-idfscore was less than 0.001, where not considered to be rele-vant enough. Subsequently, the retained terms form a set ofkey terms for the document set D.

A document, di , is a logical unit of text, characterisedby a set of key terms t j together with their correspond-ing frequency fi j , and can be described in vector form bydi = {

(t1, fi1) , (t2, fi2) , . . . ,(t j , fi j

), . . . , (tm, fim)

}, the

statistical vector. Thus for each document in the documentcorpus D there is a resultant statistical vector. A tabular exam-ple statistical vector is depicted in Table 1.

Semantic enrichment module

In this module we construct a new term vector, the Seman-tic Vector (SV) for all the documents in corpus D. Thisvector comprises Ontology concepts that are in the domain

Table 1 Statistical Vector

Key term Weight

Sanitari 0,004101

Water_suppli_drainag 0,003265

Toilet 0,002482

Personnel 0,002332

Table 2 Ontological equivalent terms

Ontological concept Equivalent terms

Complete_Sanitary_Suite Complete sanitary suite, completebathroom suite, bathroom,washroom,…

Plumbing_Fixture_And_Sanitary_Washing_Unit

Bathtub, shower, service sink,lavatory,…

Sanitary_Disposal_Unit Water closet, toilet, urinal,…

Ontology and whose equivalent terms (cf. Fig. 3) semanti-cally match terms which are present in the statistical vec-tor (Table 2). This step ensures a ‘meaningful’ reduction inthe term vector dimensionality and establishes a semanticgrounding of the terms in the document that overlap withinstances in the Ontology. However, there is a risk of obtain-ing a rather sparse vector if the domain ontology is26 itselfsparse and poorly modelled. For now we assume the exis-tence of a (relatively) complete ontology model.

Semantic vector creation is the basis for the approach inour work. It represents the extraction of knowledge and mean-ing from KS and the agglomeration of this information in amatrix form, better suited to mathematical handling than theraw text form of documents.

A semantic vector is represented by two columns: the firstcolumn contains the concepts that populate the knowledgerepresentation of the KS, i.e., the most relevant concepts forcontextualizing the information within the KS; the secondcolumn keeps the degree of relevance, or weight, that eachterm has on the knowledge description of the KS.

Our approach takes into account three complementaryprocedures for creating the SV, where each procedure suc-cessively adds new semantic enrichment to the KS represen-tation. The first step creates a keyword-based SV, the secondstep creates a taxonomy-based vector, and the final step cre-ates an Ontology-based vector. Each step is described in thefollowing sections.

Keyword-based semantic vector

The keyword-based SV takes into consideration only the liai-son between terms present in the statistical vector and theconcepts in the domain ontology. This step matches the sta-tistical vector keywords with equivalent terms that are linked

123

Page 11: Facilitating knowledge sharing and reuse in building and construction domain: an ontology-based approach

J Intell Manuf

Fig. 7 Vector terms mappingagainst the Ontology concepts

to the ontological concepts in the domain Ontology as shownin Fig. 7.

This process starts by first identifying the statistical vectorkeywords associated to a particular document and then find-ing similarities between each keyword and the equivalentterms within the ontology. The calculation of the similaritiesis done using the cosine similarity. The reason we choose thecosine algorithm is that cosine measure can be applied whencomparing n-grams similarities of different magnitudes.

Cosine similarity algorithm measures the similaritybetween two vectors. In this case, we have to compare twon-grams. If we consider each one has a vector, we can use thecosine of the angle θ between x and y, represented in Eq. 4.

cos(x, y) = x

x.y

y(4)

From Eq. 1 in our study, this could be applied to our processin the following manner:

(Shared K eywordT erms) ∗ (Shared EquivalentT erms)

(K eywordT otalT erms) ∗ (EquivalentT ermsT otalT erms)(5)

Word sense disambiguation (WSD) for ontology learning,is a research topic which makes the matching process a chal-lenge task. Most WSD research employs resources such asWordNet, text corpora, or social media. Many authors haveproposed several approaches for dealing with the challenge ofWSD (ex. (Wimmer and Zhou 2013), (Dandala et al. 2013)).The implementation of a mechanism for word sense disam-biguation is very relevant to the current scope of the workand the authors are considering it as part of future work. Wecame across several situations where word sense disambigua-tion is important, and at the moment is currently addressed

Fig. 8 Word ambiguity mismatch

through human inspection. Figure 8 illustrates some exam-ples of ambiguity found when creating an SV, where an equiv-alent term was inappropriately matched to a term in the sta-tistical vector.

Next the keyword-based SV is stored in the database inthe form

[∑ni=1 xi;∑n

i=1 wxi

], where n is the number of con-

cepts in the vector, xi is the statistical representation of theconcept and wxi is the semantic weight corresponding to theconcept.

Table 3 depicts the weight of every ontological conceptassociated to each key term within the statistical vector,where the first column corresponds to the concepts that werematched to describe the most relevant terms extracted fromthe statistical vector shown in column 2, and the third columnshows the semantic weight for each concept matched.

Taxonomy-based semantic vector

Taxonomy-based vectors are the next step in the representa-tion of KSs achieved by adjusting the weights of conceptsaccording to the taxonomic relation among them, i.e., thoseconcepts that are related by the ‘is_a’ type relation. If twoor more concepts that are taxonomically related appear ina keyword-based vector, then the existing relation can boost

123

Page 12: Facilitating knowledge sharing and reuse in building and construction domain: an ontology-based approach

J Intell Manuf

Table 3 Keyword-basedsemantic vector Concept Key term Weight

Sanitary_Disposal_Unit Toilet, urin, water_closet 0,149514

Sanitary_Laundry_and_Cleaning_Equipment_Product Sanitari 0,132629

Team Person, personnel 0,104497

Commitee Subcommitte 0,067880

Table 4 Taxonomy-based semantic vector

Concept Weight

Sanitary_Disposal_Unit 0,107615

Sanitary_Laundry_and_Cleaning_Equipment_Product 0,092500

Team 0,075767

Plumbing_Fixture_and_Sanitary_Washing_Unit 0,057912

the relevance of the expressions within the KS representationand therefore enhance weightings.

The taxonomy-based SV creation process defines a SVbased on kin relations between concepts within the onto-logical tree. Specifically, the kin relations can be expressedthrough the notion of homologous/non-homologous con-cepts as follows (Sheng 2009).

Definition 1 In the hierarchical tree structure of the Ontol-ogy, concept A and concept B are homologous concepts if thenode of concept A is an ancestor node of concept B. Hence,A is considered the nearest root concept of B, R(A,B). Thetaxonomical distance between A and B is given by:

d (A, B) = |depth (B) − depth (A)|= |depth (A) − depth (B)| (6)

In Eq. 4, depth (X) is the depth of node X in the hierarchicaltree structure, with the ontological root concept depth beingzero (0).

Definition 2 In the hierarchical tree structure of the Ontol-ogy, concept A and concept B are non-homologous conceptsif concept A is neither the ancestor node nor the descendantnode of concept B, even though both concepts are relatedby kin; If R is the nearest ancestor of both A and B, then Ris considered the nearest ancestor concept for both A and Bconcepts, R(A,B). The taxonomical distance between A andB is expressed as:

d (A, B) = d (R, A) + d (R, B) (7)

Figure 9 depicts the difference between homologous and non-homologous concepts.

The taxonomy-based SV is calculated using the keyword-based vector as input, where taxonomical relations are used

Fig. 9 Homologous and non-homologous concepts (Sheng 2009)

to boost the relevance of the concepts already present withinthe vector or to add new concepts. The weight of the con-cepts is boosted when two concepts found in the keyword-based vector are highly relevant, with the degree of relevancebeing defined by a given threshold. If the relevance of the tax-onomical relation between two concepts is higher than thepredefined threshold, then the semantic weight of such con-cepts is boosted in the taxonomy-based vector. If a conceptalready present in the keyword-based vector is taxonomi-cally related to a concept that is not present in the vector,then the related concept is added into the taxonomy-basedvector.

One of the major differences between the present work andthe work presented by (Sheng 2009) is that, in our approach,new concepts are only added into the taxonomy-based vectorif the d (A, B) = 1 for homologous concepts and d (A, B) =2 for non-homologous. The reason for such limitation is toavoid obtaining a sparse vector and to only add concepts thatare highly related to already existing ones.

The intuition behind this work is to alter term vectors bystrengthening the discriminative terms in a document in pro-portion to how related they are to other terms in the document(where relatedness includes all possible relationships mod-elled in an Ontology). A side effect of this process is theweeding out of the less important terms. Since ontologiesmodel domain knowledge independently of any particularcorpus, there is also the possibility of introducing terms inthe term vector that are highly related to the document but arenot explicitly present in it. The approach used for enhancingterm vectors is therefore based on a combination of statisticalinformation and semantic domain knowledge. An exampleof a taxonomy-based SV is depicted in Table 4.

123

Page 13: Facilitating knowledge sharing and reuse in building and construction domain: an ontology-based approach

J Intell Manuf

The taxonomical similarity is calculated differently forboth homologous and non-homologous taxonomical rela-tions defined previously:

Sim (A, B) =(

1 − α

depth (A) + 1

d(A, B)

son(B)

son(A)(8)

If d(A, B) �= 0 and A and B are homologous.

Sim (A, B) =(

1 − α

depth (R) + 1

d(A, B)

× son (A) + son(B)

son(R)(9)

If d(A, B) �= 0 and A and B are non-homologous.

Sim (A, B) = 1 (10)

If d (A, B) = 0.The concept ‘Plumbing_Fixture_and_Sanitary _Wash-

ing_Unit’ weight is boosted within the Taxonomy-basedSV because it is highly related with the concepts ‘Sani-tary_Disposal_Unit’ and ‘Sanitary_Laundry_and_Cleaning_Equipment_Product’.

Ontology-based semantic vector

The third step in SV creation is the definition of the vec-tor based on the ontological relations defined in the domainOntology. We apply association rule theory to construct onto-logical concept relations and evaluate the importance of suchrelations for supporting the enrichment process of a domainontology. The objective is to analyse the co-occurrences ofconcepts in unstructured sources of information in order toprovide additional relevant relationships for enriching onto-logical structures. This is part of our on-going work describedin (Paiva et al. 2013).

The ranking of such semantic association is also comple-mented by human input (experts from the building and con-struction domain) to establish the final numerical weights foreach ontological relationship. The idea behind having humanintervention is to let the importance of relationships reflect aproper expert knowledge representation requirement, at firsthand.

The creation of the ontological-based SV is a two-stageprocess using the taxonomy-based SV as input: the first stage

boosts weights of concepts already present in the taxonomy-based vector, depending on the Ontology relations amongthem; the second stage adds new concepts that are not presentin the input vector, according to ontological relations theymight have with concepts belonging to the taxonomy-basedvector (Costa et al. 2012).

Analogous to the creation of a taxonomy-based SV, thenew concept is added to the vector only if the importance of anontological relation exceeds a pre-defined threshold, for thesame constraint reasons. The ontological relation’s impor-tance, or relevance, is not automatically computed; rather, itis retrieved from an ontological relation vector comprisingpairs of concepts and the weight associated to their relation,as shown in Table 5.

Equation 9 describes the process of boosting of conceptsor the addition of new ones, here OwCy , is the new weightof the ontological concept, and T wCy is the input taxonomyweight of the concept to be boosted. If the concept is addedthen T wCy should be zero. T wCx is the taxonomical weightof the concept related to Cy and T ICx Cy is the weight of therelation between Cy and Cx.

OwCy = T wCy +∑

(all related Cx s)

× [T wCx ∗ (

T ICx Cy

)](11)

An example of an Ontology-based SV is depicted in Table 6.In this example, the concepts ‘Sanitary_Disposal_Unit’

and ‘Sanitary_Laundry_and_Cleaning_Equipment_Product’were boosted because they are already present in thetaxonomy-based vector and are related by the ontologicalrelation ‘<is_operated_by>’. On the other hand, concepts‘Team’ and ‘Plumbing_Fixture_and_Sanitary_Washing_Unit’, were not boosted, meaning that their respectiveweights were decreased after vector normalization.

Table 6 Ontology-based semantic vector

Concept Weight

Sanitary_Disposal_Unit 0,111718

Sanitary_Laundry_and_Cleaning_Equipment_Product 0,099504

Team 0,074115

Plumbing_Fixture_and_Sanitary_Washing_Unit 0,056649

Table 5 Ontological relationsProperty Subject Object Weight

Is_part_of Complete_Sanitary_Suite Sanitary_Laundry_and_Cleaning_Equipment_Product

0,07

Is_operated_by Sanitary_Disposal_Unit Sanitary_Laundry_and_Cleaning_Equipment_Product

0,07

123

Page 14: Facilitating knowledge sharing and reuse in building and construction domain: an ontology-based approach

J Intell Manuf

Assessment of the presented work

This section describes the technical architecture of the pro-totype, implemented to assess our approach and evaluate theresults achieved so far.

Technical architecture

The architecture adopts a 3-tier model structure, compris-ing a knowledge repository layer, a service layer and a userinterface layer. Figure 10 illustrates the architecture, depict-ing also the technical modules addressed by each layer aswell as the technologies used to develop the modules.

Knowledge repository layer

The knowledge repository layer comprises: (1) a docu-ment repository, developed under the Liferay portal whichis responsible for storing all the KS that will be furtherprocessed; (2) a domain Ontology developed in OWL formatand maintained developed by Protégé editor tool. A detaileddescription of the domain ontology was already provided in“Ontology definition module” section; (3) a relational data-base (named SEKS – Semantic Enrichment of KS) developedin MySQL, responsible for holding the appropriate statisti-cal and SV for each of the knowledge source stored in thedocument repository. This means that, for each KS uploadedby the document repository portal, there is a corresponding

set of vectors (statistical and semantic) stored in the SEKSdatabase.

Service layer

The service layer includes a set of web-services responsiblefor performing all the calculus needed for creating the SVassociated with each KS, and also responsible for calculatingthe level of similarity between a given user query and suchvectors. This layer comprises two types of service.

The Basic Services layer consists of four service mod-ules: Serialization Services, Calculus Services, OntologyServices, and Database Services. (1) Serialization Servicesare used by the Advanced Services and are responsible forconverting messages that are exchanged between services toand from XML format. (2) Calculus Services are responsiblefor the required mathematical computations for creating theSV, and also the calculation of the similarity measure betweentwo vectors using the cosine similarity algorithm. (3) Data-base Services are responsible for managing the ODBC con-nections and access from the service layer and the knowledgelayer. (4) Ontology Services include all necessary methodsto access the elements of the domain Ontology, using theJena API library, to retrieve data from the OWL ontology (5)the Knowledge Extraction Module, was developed using theRapidMiner tool, and enable to access the document repos-itory and apply the tf-idf score to the document corpus, thuscreating the statistical vectors for each document.

Fig. 10 Technical architecture

123

Page 15: Facilitating knowledge sharing and reuse in building and construction domain: an ontology-based approach

J Intell Manuf

The Advanced Services layer interacts with all other basicservices. It is responsible for performing the system’s mainfunctionalities and comprises three high-level service mod-ules: (1) Document Indexation Services handles all func-tions associated with the iterative creation of the three SVsas already explained in “Term extraction” section. It takes asinput the statistical vector created by the knowledge extrac-tion module and as output creates the semantic vectors; (2)Query Treatment Services are responsible for transformingthe user query into a semantic vector, and (3) Document Com-parison Services contains all methods that support the com-parison between the document corpus SVs and the user query.As output, this service presents a ranking of the results of thecomparison.

User interface layer

The user interface layer was developed using JSP, AJAX andJQuery technologies. It provides the front-end for the user toupload new documents into the document repository, navi-gate through the domain ontology and search for documents.

Evaluation process

One of the key requirements for evaluating this approach isthe availability of a relatively complete domain Ontology.This assessment built upon some preliminary results of priorwork on semantic enrichment of KS (Figueiras et al. 2012),(Costa et al. 2012). A metadata KB was developed focusedon the building and construction sector, addressing the typeof actors involved, related projects and products used.

Our dataset for evaluation in this paper is primarilyfocused on related products used in the building and con-struction domain. Figure 11 shows part of the taxonomy intowhich the documents were classified. Although the taxon-omy related to products we had available contained 16 sub-categories, we chose a smaller subset (5 categories as shownin Fig. 11) in order to analyse and explain the results in aclearer fashion.

We tested our approach with 20 scientific publicationscontaining on average 3.500 words each. The reason forchoosing scientific publications was the significant amount

Fig. 12 Pre-labelling mismatch example

of words in each document, which makes the dispersion ofeach document regarding key terms much higher when com-pared to simple webpages or news headlines, and makes theprecise classification a much greater challenge.

Documents used in the assessment were manually pre-labelled with the support of ICONDA search engine (IRB1986) and a close human evaluation, which sometimes helpedin resolving some inconsistencies. For example looking intoFig. 12, the ICONDA search engine considered the documentto be related to some extent withe ‘lighting’ concept, but afterclose inspection, the document was pre-labelled as ‘climatecontrol’.

The core aspect of our evaluation is to measure the effec-tiveness of the altered term vectors. The question we aretrying to answer is whether our intuition of adding terms andboosting weights of terms in a term vector does, in practice,meaningfully amplify important terms and “weed out” lessimportant ones? And at the same time, is it possible to rep-resent KS with more accuracy with the support of domainontologies? We believe that, having more accurate represen-tations of KS can improve semantic interoperability amongproject teams, and consequently to facilitate knowledge shar-ing and reuse in the B&C domain.

The comparison of this evaluation process is thereforeperformed between the four vectors –statistical, keyword-based, taxonomy-based, and Ontology-based vectors.

As mentioned earlier, the focus of this work is not onimproving or extending existing classification algorithms.

Fig. 11 Categories used forevaluation

123

Page 16: Facilitating knowledge sharing and reuse in building and construction domain: an ontology-based approach

J Intell Manuf

Table 7 A simple illustration oftext data in VSM t0 t1 t2 t3 t4

C0 x0 1 2 3 0 2

x1 2 3 1 0 2

x2 3 1 2 0 2

C1 x3 0 0 1 3 2

x4 0 0 2 1 3

x5 0 0 3 2 1

Our system uses the altered term vectors as inputs to var-ious classification algorithms – and specifically, we used anunsupervised classification algorithm for the evaluation pur-poses (K-Means clustering).

The K-means clustering algorithm

Let a set of text documents be represented as a set of vectorsX = {X1, X2, . . ., Xn}. Each vector X j is characterized bya set of m terms (t1, t2, . . ., tm). m is the total number ofunique terms in all documents which form the vocabulary ofthese documents. The terms are referred to as features. Let Xbe a set of documents that contain several categories. Eachcategory of documents is characterized by a subset of termsin the vocabulary that corresponds to a subset of features inthe vector space.

A simple illustration of text data in VSM is given inTable 7. Here, x j represents the j th document vector; ti rep-resents the i th term; each cell in the table is the frequencythat term ti occurs in x j . A zero cell means that the term doesnot appear in the related document. Documents x0, x1, x2

belong to one category C0, assuming “Climate Control”,while x3, x4, x5 belong to another category C1, assuming“Waste Management”.

Because these two categories are different, they are cate-gorized by different subsets of terms. As shown in Table 7,category C0 is categorized by terms t0, t1, t2 and t4 whilecategory C1 by terms t2, t3 and t4. In the meantime, termsplay different roles on identifying categories or clusters. Forinstance, the same frequency of t4 appears in every documentof category C0, hence, t4 should be more important than otherterms in identifying category C0.

K-means algorithm finds a partition such that the squarederror between the empirical mean of a cluster and the pointsin the cluster is minimized. Let μk be the mean of cluster Ck .The squared error between μk and the points in cluster Ck isdefined as

J (Ck) = ∑xi ∈Ck

‖xi − μk‖2 (12)

J (Ck) = arg minS∑k

k=1∑

xi ∈Ck‖xi − μk‖2 (13)

Minimizing this objective function is known to be an NP-hard problem (even for K = 2) (Drineas et al. 2004). ThusK-means, which is a greedy algorithm, can only converge toa local minimum, even though recent study has shown witha large probability K-means could converge to the globaloptimum when clusters are well separated (Meila 2006). K-means starts with an initial partition with K clusters andassign patterns to clusters so as to reduce the squared error.Since the squared error always decreases with an increase inthe number of clusters K (with J(C) = 0 when K = n), itcan be minimized only for a fixed number of clusters.

The reasons why unsupervised classification was chosenover supervised classification were that:

• Supervised classification is inherently limited by theinformation that can be inferred from the training data(Nagarajan et al. 2007). Meaning that, the accuracy andthe representativeness of the training data, and also thedistinctiveness of the classes must be taken into account.This tends to be a problem when dealing with largeamounts document corpora, when no previous in-depthknowledge about the documents is assumed.

• Some documents tend to overlap, even when belongingto different categories. Such situations are quite commonwhen working with documents with an average of 3.500words each. In general, text classification is a multi-classproblem (more than 2 categories). Training supervisedtext classifiers requires large amounts of labelled datawhose annotation can be expensive (Dumais et al. 1998).A common drawback of many supervised learning algo-rithms is that they assume binary classification tasks andthus require the use of sub-optimal (and often compu-tationally expensive) approaches such as one vs. rest tosolve multi-class problems, let alone structured domainssuch as strings and trees (Subramanya and Bilmes 2008).

• Labelling such documents manually beforehand is not atrivial task and may affect adversely the training set ofthe classification algorithm. Our intention is to reduce asfar as possible human intervention in the classificationtask and also to scale up our approach to sets of hundredsof scientific publications.

123

Page 17: Facilitating knowledge sharing and reuse in building and construction domain: an ontology-based approach

J Intell Manuf

• The goal of the assessment is to evaluate if the semanticenrichment process improves the similarity level amongdocuments, even when such documents were not con-sidered similar using purely statistical approaches but,indeed, they are in fact similar from a semantic perspec-tive.

In the following sub-section, we present the results of ourapproach and give details of the kinds of classification pat-terns we have observed.

Results

Our metrics of evaluation are the traditional notions of pre-cision and recall, and are computed as follows:

Precision = no of documents correctly assigned to the categor y

no of documents correctly assigned to the categor y + no of documents incorrectly assigned to the categor y(14)

Recall = no of documents correctly assigned to the categor y

no of documents correctly assigned to the categor y + no of documents incorrectly rejected f rom the categor y(15)

Nevertheless, the correctness of the classification tends tobe a subjective issue. What may be a satisfactory classifica-tion for an application setting that has weighted ontologicalsemantic relationships a certain way, might be unacceptablein other classification settings. The importance of relation-ships between ontological concepts is therefore an additionalindependent and tuneable component that affects the preci-sion and recall metrics.

We first present some overall statistics and then discusssome success and failure patterns observed during correlationwith the results of the classification. Figs. 13 and 14 showaverage recall and precision values for 5 product categoriescomparing all four vectors

As a result of examining more closely some categories inorder to understand the above results better, we discoveredsome interesting patterns when the use of this approach addedvalue and other patterns when it did not.

Considering the ‘Sanitary Laundry and Cleaning’ cate-gory, we can conclude that using our approach there wasa substantial improvement in terms of recall metric, from25 % using the statistical-based approach to 75 % using theOntology-based approach. In this case, the usage of ontolog-ical relations present in the domain Ontology (as shown inTable 5), improved the recall metric from 50 to 75 %.

Our evaluation also indicated that quite a few documentshad minimal or no direct matching with Ontology equiva-lent term instances, mostly because of an incomplete domainontology model (further investment in extending the Ontol-ogy KB can address this issue to some extent) and the lackof a thorough method for removing word ambiguity duringthe matching process (as explained previously).

It is possible for a domain Ontology to have no influenceon the classification. Therefore the goal is to do no worsethan the statistical-based approach whether the Ontology isrelevant or wholly irrelevant.

Our dataset for evaluation considered (intentionally) sev-eral categories that had minor characteristic differences. Forexample, contents in ‘Climate Control’ and ‘Electric Powerand Lighting’ categories have many similar predictor vari-ables or terms that make classifying and allocating docu-ments to the categories a challenge. Statistical term vec-tors that rely solely on document contents can rarely reli-ably classify a document as falling into one category or theother.

Conclusions and future work

The paper’s contribution targets the representation of KSs invarious application areas for IR, including, importantly, thesemantic web. Moreover, it can also support collaborativeproject teams by helping them identify relevant knowledgeamongst a panoply of KSs allowing knowledge to be betterexploited within organizations and projects. Our contribu-tion is to highlight the challenges of reconciling knowledgeand to bring attention to the need for further research on therelationship between actors as social subjects and the wayknowledge can be formalized and represented to the com-munity. We anticipate the inclusion of this relationship inthe research efforts will lead to a more effective sharing,exchanging, integrating, and communication of KS amongactors through the employment of IT.

This work specifically addresses sharing and reuse ofknowledge representation within collaborative engineeringprojects from the building and construction industry, adopt-ing a conceptual approach supported by semantic services.The knowledge representation enrichment process is sup-ported using a semantic vector holding a classification basedon ontological concepts. Illustrative examples showing theprocess are part of this paper.

The intuition behind our work was to alter term vectors bystrengthening the discriminative terms in a document in pro-portion to how strongly related they are to other terms in thedocument (where relatedness includes all possible relation-ships modelled in an Ontology). A side effect of the processwas the weeding out of the less important terms. Since ontolo-gies model domain knowledge independent of any document

123

Page 18: Facilitating knowledge sharing and reuse in building and construction domain: an ontology-based approach

J Intell Manuf

Fig. 13 Overall recall valuesfor 5 categories

Fig. 14 Overall precisionvalues for 5 categories

corpus, there is also the possibility of introducing relevantnew terms into the term vector that are highly related to thedocument but not explicit in it.

The results achieved so far and presented here do notreflect a final conclusion of the proposed approach and arepart of on-going work that will evolve and mature over time.Nevertheless, preliminary results indicate that the inclusionof additional information available in domain ontologiesin the process of representing KS can enrich and improveknowledge representations. Additional evaluation needs tobe undertaken to reach more formal conclusions includingdevising additional metrics for assessing the performance ofthe proposed method. However, we can conclude that Ontolo-gies do help improve the precision of a classification.

As described earlier, additional methods are requiredto reduce word ambiguity by taking account of context,when matching terms within the statistical vector with the

equivalent terms present in the domain Ontology. At themoment the comparison is performed by using the cosinesimilarity algorithm, which may lead to inconsistencies asmentioned earlier.

The domain Ontology is presently seen as something thatis static and not evolving over time with new knowledge. Theapproach that is being exploited is to extract new knowledgecoming from KSs (new concepts and new semantic relations)and to reflect such new knowledge in the domain Ontology.The idea for accomplishing this is the adoption of algorithmsfor learning association rules, correlating the co-occurrenceof terms within the document corpus. Such measures canbe considered as an estimation of the probability of termsbeing semantically related. The weights of such semanticrelations should also be updated every time new KSs areintroduced into the corpus KB. The intent is, therefore, thatnew ontological concepts and relations from new sources

123

Page 19: Facilitating knowledge sharing and reuse in building and construction domain: an ontology-based approach

J Intell Manuf

should be inserted and managed dynamically to support anevolving domain Ontology through a learning process.

References

Braines, D., Kalfoglou, Y., Smart, P., Shadbolt, N., & Bao, J. (2008). Adata-intensive lightweight semantic wrapper approach to aid infor-mation integration. 4th International Workshop on Contexts andOntologies (C&O 2008). Patras.

Braines, D, Jones, G, Smart, P., Bao, J & Huynh, T. D. (2009). GIDS:Global Interlinked Data Store. 3rd Annual Conference of the Inter-national Technology Alliance (ACITA’09). Hyattsville: InternationalTechnology Alliance.

BuildingSmart. IFD Library for BuildingSmart. (2012). http://www.ifd-library.org/index.php?title=Home_Page. Accessed September 3,2012.

Castells, P., Fernandez, M., & Vallet, D. (2007). An Adaptation of thevector-space model for ontology-based information retrieval. IEEETransactions on Knowledge and Data Engineering, 19(2), 261–272.

Chen, C.-L., Tseng, F., & Liang, T. (2010). An integration of WordNetand fuzzy association rule mining for multi-label document cluster-ing. Data & Knowledge Engineering, 69, 1208–1226.

Costa, R., Figueiras, P., Paiva, L., Jardim-Gonçalves, R., & LimaC. (2012) Capturing knowledge representations using semanticrelationships. The Sixth International Conference on Advances inSemantic Processing. Barcelona, Spain: IARIA.

Dandala, B., Mihalcea, R., & Razvan, B. (2013). Word sense disam-biguation using Wikipedia. Theory and Applications of Natural Lan-guage Processing, by Iryna Gurevych and Jungi Kim (pp. 241–262).Berlin: Springer.

Dascal, M. (1989). Artificial intelligence and philosophy: The knowl-edge of representation. Systems Research, 6, 39–52.

Dascal, M. (1992). Why does language matter to artificial intelligence?Minds and Machines, 2, 145–174.

Drineas, P., Frieze, A., Kannan, R., Vempala, S., & Vinay, V. (2004).Clustering large graphs via the singular value decomposition.Machine Learning, 56, 9–33.

Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductivelearning algorithms and representations for text categorization. inter-national conference on Information and knowledge management.Washington: ACM, 148–155.

El-Diraby, T., & Celson, L. (2005). Domain taxonomy for constructionconcepts: Toward a formal ontology for construction knowledge.Journal of Computing in Civil Engineering, 19(4), 394–406.

El-Diraby, T. (2012). Epistemology of construction informatics. Journalof Construction Engineering and Management, 138, 53–65.

Figueiras, P., Costa, R., Paiva, L., Jardim-Gonçalves, R., & Lima, C.(2012). Information retrieval in collaborative engineering projects: Avector space model approach. Knowledge Engineering and OntologyDevelopment Conference. (2012). Barcelona (pp. 233–238). Spain:INSTICC.

Firestone, J., & McElroy, M. (2003). Mark key issues in the new knowl-edge management. Burlington: Butterworth-Heinemann.

Floridi, L. (2004). Open problems in the philosophy of information.Metaphilosophy, 35, 554–582.

Grilo, A., & Jardim-Goncalves, R. (2010). Value proposition on interop-erability of BIM and collaborative working environments. Automa-tion in Construction, 522–530.

Gruber, T. (1993). Toward principles for the design of ontologies usedfor knowledge sharing. International Journal of Human-ComputerStudies, 907–928.

IEEE. (1990) Standard computer dictionary - a compilation of IEEEstandard computer glossaries. The Institute of Electrical and Elec-tronics Engineers.

IRB (1986) Fraunhofer. ICONDA Bibliographic.ISO12006-3., (2006). Building construction—organization of infor-

mation about construction works: Part 3: Framework for object-oriented information. International Organization for Standardiza-tion: Switzerland.

Kalfoglou, Y., Smart, P., Braines, D., & Shadbolt, N. (2008).POAF: Portable ontology aligned fragments. International Work-shop on Ontologies: Reasoning and Modularity (WORM 2008).Tenerife.

Li, S. (2009). A semantic vector retrieval model for desktop documents.Journal of Software Engineering and Applications, 2(1), 55–59.

Lima, C., Silva, C., Duc, C., Zarli, A. (2006). A Framework to SupportInteroperability among Semantic Resources. In: Interoperability ofEnterprise Software and Applications, by Dimitri Konstantas, Jean-Paul Bourrières, Michel Léonard and Nacer Boudjlida, (pp. 87–98).Springer: London.

Lima, C., & El-Diraby, T. (2005). Ontology-based optimisation ofknowledge management in e-Construction. ITcon, 10, 305–327.

MacQueen, J. (1967). Some methods for classification and analysis ofmultivariate observations. Berkeley: University of California Press.

Meila, M. (2006). The uniqueness of a good optimum for K-means (pp.625–632).International conference on Machine learning. Pittsburgh:ACM.

Nagarajan, M., Sheth, A., Aguilera, M., Keeton, K., Merchant,A., & Uysal, M. (2007). Altering Document Term Vectors forClassification: Ontologies as Expectations of Co-occurrence. 16thinternational conference on World Wide Web. Alberta: ACM,1225–1226.

Nonaka, I., & Hirotaka, T. (1995). The knowledge-creating company:How japanese companies create the dynamics of innovation. NewYork: Oxford University Press.

Noy, N, & Deborah, Mc G. (2002). Ontology Development 101: A Guideto Creating Your First Ontology.Technical Report, Stanford : Knowl-edge Systems Laboratory.

Noy, N. F., & Hafner, C. (1997). The State of the Art in Ontology Design.AI Magazine, 53–74.

OCCS Development Committee Secretariat (2013). OmniClass -A Strategy for Classifying the Built Environment. http://www.omniclass.org/. Accessed September 3, 2012.

Paiva, L., Costa, R., Figueiras, P., & Lima, C. (2013). Discoveringsemantic relations from unstructured data for ontology enrichment:Association rules based approach. 8th Iberian Conference on Infor-mation Systems and Technologies. Lisbon: IEEE.

RapidMiner (2012). Rapid-I GmBH.Rezgui, Y. (2006). Ontology-centered knowledge management using

information retrieval techniques. Journal of Computing in Civil Engi-neering, 20(4), 261–270.

Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model forautomatic indexing. Communications of the ACM, 18(11), 613–620.

Salton, G., & Buckley, C. (1988). Term-weighting approaches in auto-matic text retrieval. Information Processing and Management, 24,513–523.

Sarraipa, S., João, J., Jardim-Goncalves, R., & Monteiro, A. (2008).MENTOR-A Methodology for Enterprise Reference OntologyDevelopment. Intelligent Systems, 2008. IS ’08. 4th InternationalIEEE Conference.

Sarraipa, J., Jardim-Gonçalves, R., & Steiger-Garção, A. (2010). MEN-TOR: An enabler for interoperable intelligent systems. InternationalJournal General Systems, 39(5), 57–573.

Stanford Center for Biomedical Informatics Research (2013). Stan-ford’s Protégé Home Page. 2013. http://protege.stanford.edu/.Accessed Spetember 3, 2012.

Subramanya, A., & Bilmes, J. (2008). Soft-supervised learning for textclassification. In Proceedings of the Conference on Empirical Meth-ods in Natural Language Processing. Honolulu, Hawaii: Associationfor Computational Linguistics, 1090–1099.

123

Page 20: Facilitating knowledge sharing and reuse in building and construction domain: an ontology-based approach

J Intell Manuf

Uschold, M., & Jasper, R. (1999). A framework for understanding andclassifying ontology applications. IJCAI-99 Workshop on Ontolo-gies and Problem-Solving Methods. Stockholm: CEUR Publica-tions.

W3C. (2012). OWL Web Ontology Language Reference. http://www.w3.org/TR/owl2-overview/. Accessed September 2012, 3.

Wimmer, H., & Zhou, L. (2013). Word Sense Disambiguation for Ontol-ogy Learning. 19th Americas Conference on Information Systems.Chicago.

Xia, T., & Du Y. (2011). Improve VSM text classification by title vectorbased document representation method. The 6th International Con-ference on Computer Science & Education. Singapore: IEEE.

Zhang, J. (2010). A social semantic web system for coordinating commu-nication in the architecture, engineering and construction industry.Toronto: Univeristy of Toronto.

123