ios press lifecycle models of data-centric systems and domains · 2012-10-30 · data lifecycle has...

24
Undefined 0 (2010) 1 1 IOS Press Lifecycle Models of Data-centric Systems and Domains The Abstract Data Lifecycle Model Knud Möller DERI, National University of Ireland, Galway E-mail: [email protected] Abstract. The Semantic Web, especially in the light of the current focus on its nature as a Web of Data, is a data-centric system, and arguably the largest such system in existence. Data is being created, published, exported, imported, used, transformed and re-used, by different parties and for different purposes. Together, these aspects form a lifecycle of data on the Semantic Web. Understanding this lifecycle will help to better understand the nature of data on the SW, to explain paradigm shifts, to compare the functionality of different platforms, to aid the integration of previously disparate implementation efforts or to position various actors on the SW and relate them to each other. However, while conceptualisations on many aspects of the SW exist, no exhaustive data lifecycle has been proposed to our knowledge. This paper proposes a data lifecycle model for the Semantic Web by first looking outward, and performing an extensive survey of lifecycle models in other data-centric domains, such as digital libraries, multimedia, eLearning, knowledge and Web content management or ontology development. For each domain, an extensive list of models is taken from the literature, and then described and analysed in terms of its different phases, actor roles and other characteristics. By contrasting and comparing the existing models, a meta vocabulary of lifecycle models for data-centric systems — the Abstract Data Lifecycle Model, or ADLM — is developed. In particular, a common set of lifecycle phases, lifecycle features and lifecycle roles is established, as well as additional actor features and generic features of data and metadata. This vocabulary now provides a tool to describe each individual model, relate them to each other, determine similarities and overlaps and eventually establish a new such model for the Semantic Web. Keywords: data, data-centric, lifecycle, Semantic Web 1. Introduction The Semantic Web — a web of data rather than doc- uments, where automatic agents can make sense of in- formation, aiding human users and preventing infor- mation overload — is still a young phenomenon, and at this point in time we cannot yet say what shape and form exactly it will take in the years to come. A crucial aspect during this development process is a common understanding of the anchor points of what this web of data sets out to be, regardless of specific technolo- gies, languages or systems. Without this understand- ing, there is a danger that individual efforts will be in- compatible, that there is a duplication of efforts or that the effort as a whole will derail. It is therefore nec- essary to establish a comprehensive conceptual model and architecture of the Semantic Web: “the architec- ture of any system is one of the primary aspects to con- sider during design and implementation thereof, and the [. . . ] architecture of the Semantic Web is thus cru- cial to its eventual realisation” [Gerber et al., 2008]. This architecture and model will then provide the re- quired anchor points and allow for a common under- standing. So far, a number of basic foundations have been laid and are quite well understood: (i) A layer cake of languages and technologies has been proposed — and since been changed and extended many times — to implement the vision of the Semantic Web. (ii) The Semantic Web is rooted in the World Wide Web, and 0000-0000/10/$00.00 c 2010 – IOS Press and the authors. All rights reserved

Upload: others

Post on 03-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IOS Press Lifecycle Models of Data-centric Systems and Domains · 2012-10-30 · data lifecycle has been proposed to our knowledge. This paper proposes a data lifecycle model for

Undefined 0 (2010) 1 1IOS Press

Lifecycle Models of Data-centric Systemsand DomainsThe Abstract Data Lifecycle Model

Knud MöllerDERI, National University of Ireland, GalwayE-mail: [email protected]

Abstract. The Semantic Web, especially in the light of the current focus on its nature as a Web of Data, is a data-centric system,and arguably the largest such system in existence. Data is being created, published, exported, imported, used, transformed andre-used, by different parties and for different purposes. Together, these aspects form a lifecycle of data on the Semantic Web.Understanding this lifecycle will help to better understand the nature of data on the SW, to explain paradigm shifts, to comparethe functionality of different platforms, to aid the integration of previously disparate implementation efforts or to position variousactors on the SW and relate them to each other. However, while conceptualisations on many aspects of the SW exist, no exhaustivedata lifecycle has been proposed to our knowledge.

This paper proposes a data lifecycle model for the Semantic Web by first looking outward, and performing an extensivesurvey of lifecycle models in other data-centric domains, such as digital libraries, multimedia, eLearning, knowledge and Webcontent management or ontology development. For each domain, an extensive list of models is taken from the literature, andthen described and analysed in terms of its different phases, actor roles and other characteristics. By contrasting and comparingthe existing models, a meta vocabulary of lifecycle models for data-centric systems — the Abstract Data Lifecycle Model, orADLM — is developed. In particular, a common set of lifecycle phases, lifecycle features and lifecycle roles is established, aswell as additional actor features and generic features of data and metadata. This vocabulary now provides a tool to describe eachindividual model, relate them to each other, determine similarities and overlaps and eventually establish a new such model forthe Semantic Web.

Keywords: data, data-centric, lifecycle, Semantic Web

1. Introduction

The Semantic Web — a web of data rather than doc-uments, where automatic agents can make sense of in-formation, aiding human users and preventing infor-mation overload — is still a young phenomenon, andat this point in time we cannot yet say what shape andform exactly it will take in the years to come. A crucialaspect during this development process is a commonunderstanding of the anchor points of what this webof data sets out to be, regardless of specific technolo-gies, languages or systems. Without this understand-ing, there is a danger that individual efforts will be in-compatible, that there is a duplication of efforts or thatthe effort as a whole will derail. It is therefore nec-

essary to establish a comprehensive conceptual modeland architecture of the Semantic Web: “the architec-ture of any system is one of the primary aspects to con-sider during design and implementation thereof, andthe [. . . ] architecture of the Semantic Web is thus cru-cial to its eventual realisation” [Gerber et al., 2008].This architecture and model will then provide the re-quired anchor points and allow for a common under-standing.

So far, a number of basic foundations have beenlaid and are quite well understood: (i) A layer cakeof languages and technologies has been proposed —and since been changed and extended many times —to implement the vision of the Semantic Web. (ii) TheSemantic Web is rooted in the World Wide Web, and

0000-0000/10/$00.00 c© 2010 – IOS Press and the authors. All rights reserved

Page 2: IOS Press Lifecycle Models of Data-centric Systems and Domains · 2012-10-30 · data lifecycle has been proposed to our knowledge. This paper proposes a data lifecycle model for

2 Knud Möller / Lifecycle Models of Data-centric Systems and Domains

so the technical infrastructure of how communicationtakes place on the latter also applies to the former.(iii) Suggestions have been made to extend the archi-tecture specification for the WWW, in order to incor-porate ideas such as self-describing data, that underliethe Semantic Web.

While all these components are part of what we caneventually consider a complete conceptual model forthe Semantic Web, other aspects have not yet beentaken into account. One of the core assumptions aboutthe Semantic Web is that it is a web of data — there-fore, one of the missing pieces must be to establishwhat constitutes data on the Semantic Web, what therole is that it plays, and what its features are. Data is aliving thing that moves through various stages, such ascreation, publishing, use or termination; data is at thecentre of the Semantic Web. Therefore, this paper con-tributes to the overall task of establishing a conceptualmodel of the Semantic Web by proposing an exhaus-tive lifecycle model of data on the Semantic Web. Weare not aware of any definition of lifecycles specificto data — however, a generic definition of the term is“life cycle, n. [. . . ] In extended use: a course or evolu-tion from a beginning, through development and pro-ductivity, to decay or ending.” [oed, 2011].

The lifecycle model is established in three mainsteps: (i) In light of the fact that the Semantic Web is, atits heart, a data-centric system, we begin in Sect. 2 withan extensive survey of lifecycle models and similar rel-evant literature from other data-centric domains andsystems. (ii) As a second step, in Sect. 3, we are thenmoving to a higher level of abstraction by distilling ameta-model and vocabulary of terms (phases, features,roles, etc.) from the survey. We call this model the Ab-stract Data Lifecycle Model (ADLM). In defining theADLM, we also revisit the surveyed lifecycle mod-els and classify them according to the abstract model.(iii) Finally, in Sect. 4 we turn our attention to the Se-mantic Web as a concrete example of a data-centricsystem and apply the ADLM to it, in effect proposinga descriptive lifecycle model of data for the SemanticWeb.

2. Data Lifecycles in Data-centric Domains

The idea of a lifecycle for data has been discussedfor various different domains in which the genera-tion and use of data and metadata is of central im-portance, in other words: data-centric domains. Exam-ples of such domains are digital libraries, multime-

dia, eLearning, knowledge and Web content manage-ment systems or ontology development. In this sectionwe perform a survey of examples from each domain.It should be noted that the choice of subsections ismainly an aid to arrange the mass of lifecycle modelsinto manageable chunks. Other than that, the classifi-cation into different domains is not our main interest.In fact, some of the lifecycle models could easily beapplied to other domains as well, while others alreadytranscend specific domains right from the outset.

2.1. Lifecycles for Multimedia

[Hardman, 2005] sets out to find a canonical de-scription of the processes involved in (any) media pro-duction. For each of those processes, different tools areavailable to the user. However, since a unifying modelof media production as a whole is not available, theinputs and outputs of the different processes are oftennot compatible, in particular regarding metadata, thusmaking it impossible to devise an integrated tool chain.The incentive for proposing a canonical model then isto facilitate the integration of processes and tools. Toillustrate the kind of integration they have in mind, theauthors make a comparison with UNIX pipes.

The authors define nine different processes, togetherwith the inputs and outputs of each — in fact, the for-mal definition of inputs and outputs is what clearly setsHardman’s model apart from all other models featuredin this paper. Inputs and outputs can be complex ob-jects, and the model is formalised by typing each com-ponent and identifying it with an id. The processes are(i) premeditate — decisions, inspirations and thoughtprocesses that take place prior to the creation of ac-tual media content (e.g., who, where, why or what),(ii) capture — the process of physically capturing apiece of media content, (iii) archive — storing and in-dexing a media asset, alongside its annotations, (iv) an-notate — adding arbitrary information to a particu-lar media asset, (v) query — retrieving a media assetfrom an archive, (vi) message construction — definingthe intended meaning of the media object currently inproduction, (vii) organise — composing a set of me-dia assets to a larger document, (viii) publish — con-verting a document into a format ready for externaluse, and (ix) distribute — making a media documentavailable to external users. An interpretation of the ar-rangement and interaction of these processes is shownin Fig. 1. The various kinds of xxxID in the figuredenote named data objects which are part of the in-put or output of a process, while arrows indicate how

Page 3: IOS Press Lifecycle Models of Data-centric Systems and Domains · 2012-10-30 · data lifecycle has been proposed to our knowledge. This paper proposes a data lifecycle model for

Knud Möller / Lifecycle Models of Data-centric Systems and Domains 3

{annID}

Premeditate

medID, {capID}, {annID}

{annID}

Capture

compID, archID

medID, {capID}, {annID}

Archive

compID, {annID}

compID

Annotate

{compID}

archID, query

Query

messID

Message Construction

docID

messID, {compID}

Organise

presID

docID

Publish

presID

Distribute

Users

Author Author

Fig. 1. Processes in media production, after Hardman

the different processes can be linked through their in-puts and outputs. Even though this is not conveyed inthe figure, some of those data objects can themselvesbe complex objects. E.g., an annotation referenced byan annID will contain a reference to an ontology orvocabulary (ontID), as well as a term from that vo-cabulary (attID). Similarly, an archived media asset(compID) will contain a reference to the physical me-dia asset (medID).

The production model described by Hardman is notnecessarily a repetitive life-cycle, but rather a non-circular workflow. The publish and distribute pro-cesses are considered to be the end of the workflow:“Once a document structure is published it is no longerpart of the process set.” However, the authors also ac-knowledge that there can be circular sequences: “In anumber of cases, the results of a process can feedbackinto a different process in a different role.” An exam-ple of this is the idea that an annotation can itself beconsidered a media asset, which could be archived, an-notated, etc. Similarly, a document as the output of theorganise process can be considered a media asset andas such be archived into the system again. These kindsof transformations are indicated by the dotted arrowsin the figure.

[Kosch et al., 2005] discusses the lifecycle of meta-data in the context of multimedia systems (here fo-cussing on the multimedia database management sys-tem CODAC1). Metadata in this context relates to bothstructure and format of media resources (e.g., resolu-

1http://www-itec.uni-klu.ac.at/~harald/codac/, retrieved 31/10/2010

tion, sample rate, color scheme, etc.), as well as theircontent (what is shown in a video, what can be heardin an audio recording). Kosch characterises those twokinds of metadata as “metadata for content adaption”and “metadata for search and retrieval”, respectively(see Sect. 3.5).

The lifecycle model presented by the authors issomewhat unusual, in that it introduces so-called life-cycle spaces as one of the central concepts, whichis used as a term to denote the different divisionsof a multimedia system in which metadata has rel-evance. Three such spaces are identified: the con-tent space, metadata space and user space. The con-tent space is the main anchor point and is concernedwith the multimedia data itself. It comprises a produc-tion/creation, postproduction, delivery (e.g., throughstreaming) and consumption stage. The metadata spaceon the other hand is divided into metadata produc-tion and metadata consumption. Metadata productiontakes place during content production/creation andpost-production, and both phases can produce content-related as well as structural metadata. Metadata con-sumption takes place during content delivery and con-tent consumption. Finally, within the user space the au-thors identify content providers/producers, processingusers and end users. Both content providers and pro-cessing users are involved in metadata production, andboth can generate content-related as well as structuralmetadata. On the other hand, end users are only in-volved in metadata consumption. This of course alsomeans that the authors do not assume that end userscontribute to the media’s metadata through means suchas tagging. Figure 2 illustrates our own interpretationof how the different spaces are related.

Even though the authors use the term “metadata life-cycle”, their model does not contain an immediatelyapparent circular flow. However, the CODAC systemcontains router and proxy elements which potentiallyreproduce metadata used to adapt multimedia data tothe preferences or context of a particular user. Accord-ing to the authors, this in effect closes the metadatalifecycle.

MPEG-7 is adopted as the sole metadata schema forCODAC, apparently with the implicit assumption thatit suffices for the purpose of multimedia data. For thisreason, the question of ontology creation or validationof a chosen terminology is not considered.

Page 4: IOS Press Lifecycle Models of Data-centric Systems and Domains · 2012-10-30 · data lifecycle has been proposed to our knowledge. This paper proposes a data lifecycle model for

4 Knud Möller / Lifecycle Models of Data-centric Systems and DomainsMe

tada

ta S

pace

metad

ata

prod

uctio

n

metad

ata

cons

umpti

on

Content Space

production/ creation

postproduction

delivery

consumption

User Spacecontent providers/

producers

processingusers

end users

Fig. 2. Metadata lifecycle spaces in multimedia, after Kosch et al.

2.2. Lifecycles in eLearning

[Catteau et al., 2006] investigate the lifecycle of datain the context of eLearning. Their approach is to firstgive an overview of a number of other lifecycle mod-els for learning objects (LO), from which they thendistil their own common terminology and model. Theend-result is an elaborate model describing a combinedlifecycle of the actual LOs and their associated meta-data.

The model consists of nine different phases, whichare related to the lifecycle stages of the reviewedmodels. Those terms are: initiation, conception, re-alisation, classification, validation, diffusion, usage,feedback and termination. Strictly speaking, initiationtakes place before any actual work on the LO begins,and is therefore outside of the cycle. Only in concep-tion does the work begin in earnest. Various data el-ements are defined, as well as their relations to otherLOs. Realisation marks the point when the actual con-tent of the LO is produced, while classification posi-tions it in the context of a classification system such asthe Dewey Decimal system. Validation is defined as ameans of quality control by experts, which may lead tothe LO being pushed back to the classification, reali-sation or even conception step. In diffusion, the LO isbeing introduced into a concrete learning managementsystem, where it becomes available to learners in theusage step. User reactions and comments are analysedduring feedback, which may again lead to changes ofthe LO in terms of conception or realisation, or even to

Usage

Feedback

Validation

Diffusion

Initiation

Classification

Realisation

Termination

Conception

Fig. 3. Metadata lifecycles for learning objects, after Catteau et al.

Knowledgemodelling

KnowledgeannotationKnowledge

reuse

Knowledgeacquisition

Knowledge

maintenance

Fig. 4. Metadata lifecycles for learning objects, after Millard et al.

the complete termination, in case it is deemed unsuit-able or unsuccessful.

In Fig. 3, the thick arrows indicate the main flowof the proposed lifecycle model, while thin lines in-dicate changes after “unfavourable evaluation” in thevalidation or feedback stage. Furthermore, the authorsidentify three stages which involve changes to the LOsthemselves (indicated by a book icon), while there areseven different stages which involve changes in the LOmetadata (indicated by an RDF icon).

Another, much simpler model is introduced in [Mil-lard et al., 2006]. The goal of their “Knowledge LifeCycle” is to facilitate the lifting of learning objectsand their metadata to a level of formal semantics.They propose four basic stages: Knowledge acquisi-tion, which is the conceptualisation of a knowledgedomain with the help of domain experts, knowledge

Page 5: IOS Press Lifecycle Models of Data-centric Systems and Domains · 2012-10-30 · data lifecycle has been proposed to our knowledge. This paper proposes a data lifecycle model for

Knud Möller / Lifecycle Models of Data-centric Systems and Domains 5

modelling, which is the formalisation of the concep-tualisation into an ontology language, knowledge an-notation, which is the annotation of learning objectswith the ontology and finally knowledge reuse, whichis usage of the annotated LOs in applications such assearch, automatic composition of courses, personali-sation, etc. Millard’s lifecycle model is circular in thesense that one can have several iterations, with an ad-ditional knowledge maintenance step in between reuseand acquisition (see Fig. 4). Apart from its simplic-ity, a crucial difference between Catteau’s model andMillard’s is that the former completely omits ontologycreation, whereas the latter gives it a very prominentposition (by taking up half of the cycle).

2.3. Lifecycles in Digital Libraries

Another domain for which data lifecycles are rele-vant are digital libraries. [Chen et al., 2003] describea very detailed and elaborate 10-step methodology forsetting up metadata systems for digital libraries (ormore generically: collections of digital artefacts). Thelifecycle model is divided into four general groups: Re-quirement Assessment and Content Analysis, SystemRequirement Specification, Metadata System and Ser-vice and Evaluation. Each of those groups is again di-vided into a number of different steps, as shown inFig. 5.

The Requirement Assessment and Content Analysisphase covers aspects such as defining the basic meta-data needs of the task at hand (e.g., schedule, scopeor function), as well as deep needs (ontologising thedomain, defining search strategies, etc.) and a reviewof candidate metadata standards. Since the authors as-sume a situation where data from legacy systems willbe migrated into the new system, another aspect of thisphase is the analysis of the legacy system in terms ofaccess options, suitability for transformation, etc. Af-ter this, the System Requirement Specification phasewill produce a detailed specification document (theMetadata Requirement Specification, or MRS), cover-ing all aspects which were dealt with in the assessmentphase. Also covered by this phase is an evaluation ofboth existing metadata platforms (software) for use inthe desired target system and the possibility to developsuch a system from scratch. In the Metadata Systemphase, two things take place: a best practice guidelinefor applying the MRS to specific metadata elements isprepared, and the development or setting up of the sys-tem which was decided upon in the previous step is be-ing done. Finally, in the Service and Evaluation phase,

Fig. 5. Metadata lifecycle for digital libraries, from Chen et al.

a metadata service model (including a service mecha-nism, different user roles and their relationships) is putinto place to ensure the smooth running of the system.Also, an evaluation of the system takes place, whichcan feed back into a new round of the cycle.

If we compare Chen’s model to the other modelsdiscussed in the paper, it becomes obvious that it isn’treally a lifecycle model for data, but rather a lifecy-cle model for ontology (see Sect. 2.5) and systems de-velopment. All data and metadata in the system comesfrom translating resources from legacy systems into aformat defined by the new ontology. The creation ofnew instance data does not feature anywhere in themodel.

In [Novácek et al., 2007], a lifecycle model for on-tology development which was previously defined in[Novácek et al., 2006] is applied to the domain of dig-ital libraries. Also this model focusses on the ontologydevelopment aspect of digital libraries. While muchsimpler than [Chen et al., 2003] (including creation,evaluation, negotiation and versioning phases), it alsospecifically distinguishes between automatic creationof ontologies (ontology learning) through machinesand manual, collaborative ontology creation of ontolo-gies through humans (see 3.4).

Page 6: IOS Press Lifecycle Models of Data-centric Systems and Domains · 2012-10-30 · data lifecycle has been proposed to our knowledge. This paper proposes a data lifecycle model for

6 Knud Möller / Lifecycle Models of Data-centric Systems and Domains

Use Create

CaptureRetrieve/access

Import

Fig. 6. Knowledge process, from Staab et al.

2.4. Lifecycles for Knowledge and ContentManagement

Knowledge and content management represents thelargest subset of data lifecycles discussed in this paper.The domain is defined very loosely, comprising bothtraditional knowledge management and lifecycles witha specific Web or even Semantic Web focus.

[Staab et al., 2001] propose a metadata lifecyclefor knowledge management (KM) in organisations.In fact, they define two such cycles, which are in-tertwined: the knowledge process and the orthogonalknowledge metaprocess. The former denotes creation,retrieval and use of actual knowledge (not simply inthe form of documents, but as knowledge items). Onthe other hand, the knowledge metaprocess denotes theprocess of planning and implementing an infrastruc-ture and platform for knowledge management, includ-ing ontology creation and management. For both cy-cles, the authors assume that a soft- and hardware in-frastructure is already in place.

Within the knowledge process (i.e., the lifecycle ofdata), the authors identify four different steps: Cre-ate, Capture, Retrieve/Access and Use, as depictedin Fig. 6. In addition, Import is identified as a fifthstep, which allows to introduce new, external data (e.g.from documents). Both Creation and Import are char-acterised by the fact that new knowledge items comeinto the system. Knowledge items have varying de-grees of formality (from free text documents over tem-

FeasibilityStudy

Ontologykickoff Refinement Maintenance

Ontology

Evaluation

Fig. 7. Knowdlege metaprocess, from Staab et al.

plated documents to formal knowledge structures), andcan thus be said to cover both data and metadata in theclassical sense. In the Capture phase, the new knowl-edge items are then integrated into the system: thiscan mean indexing, additional annotation, establish-ing relations to other knowledge items, etc. Retrieveand Access usually involves querying and browsingthe organisation’s knowledge base. In other words, thisphase defines a typical search task of a knowledgeworker, at the end of which they have retrieved a num-ber of knowledge items. The authors stress that, onceknowledge items are found, the job is not done: in-stead, another important phase is Use, which can spanaspects such as personalisation, pro-active access, in-tegration with other applications or aggregation of sep-arate knowledge items into a new whole. Aggregationand inference can lead to new knowledge and thusclose the cycle.

The knowledge metaprocess is mostly concernedwith ontology development for a particular organisa-tion or use case. It consists of a Feasibility Study,Ontology kickoff, Refinement, Evaluation and Main-tenance. The Feasibility Study takes place before theactual work on the ontology begins. Its purpose is toproperly specify the use case at hand, define the or-ganisational context and identify any problems whichmight appear along the way. The purpose of the kickoffphase is to produce a requirements document, whichwill serve as a guide throughout the project, elabo-rating on aspects such as goal, domain and scope,supported applications, available knowledge sources,users and external ontologies which could potentiallybe reused. In the Refinement phase, the ontology en-gineers will then start to implement the specification.The suggested approach is to start with a simple base-line taxonomy, over a so-called “seed ontology” un-til one arrives at the final target ontology, which isexpressed in a formal representation language. Forthe Evaluation phase, the designers refer back to thespecification document to check their ontology, test it

Page 7: IOS Press Lifecycle Models of Data-centric Systems and Domains · 2012-10-30 · data lifecycle has been proposed to our knowledge. This paper proposes a data lifecycle model for

Knud Möller / Lifecycle Models of Data-centric Systems and Domains 7

within the targeted applications and gather feedbackfrom beta users. Finally, the Maintenance phase en-sures that the ontology always stays up-to-date and re-flects changes in the real world. Both in the Evaluationand Maintenance phases, user input and feedback maylead to further refinement, as indicated by the loops inFig. 7.

[Greenberg, 2003] introduces a framework for thedescription of generic metadata generation. She fo-cusses on the different processes, people (roles) andtools involved. While not strictly speaking a completelifecycle model, her framework is still providing im-portant input for at least part of such a model. In termsof processes, Greenberg establishes the very broad di-chotomy of human metadata generation vs. automaticmetadata generation. While the first was historicallythe only form of metadata generation, the latter isbecoming increasingly important, especially consider-ing the huge amounts of data on the Web and else-where today. In human metadata generation, the dif-ferent classes of persons which are proposed in theframework are professional metadata creators (e.g.professional librarians), technical metadata creators(e.g. library assistants), content creators (e.g. authors)and community or subject enthusiasts2 (e.g. readers).While those classes can be arranged on a vector of de-creasing professionality, they are not necessarily dis-junct. In terms of tools, the framework distinguishesbetween three different kinds: human beings, stan-dards & documentation and devices. The latter are thetechnical means to actually capture metadata, and arefurther divided into templates, editors and generators.Templates are defined as simple forms to input meta-data, while editors will aid the user with links to doc-umentation, syntactical help, etc. Generators, finally,will produce metadata automatically and can often befound as components in editors.

[Lee and Hong, 2002] propose a simple lifecy-cle model for knowledge management in organisa-tions. The model comprises four rather high-levelsteps, which are discussed by suggesting the differentinformation technologies (IT) which facilitate them:(i) Knowledge capture is defined as “the process bywhich knowledge is obtained and stored”. Facilitat-ing ITs are database systems, data warehouses anddocument management systems. (ii) Knowledge devel-

2Greenberg relates an interesting account of a project at the FineArts Museum of San Francisco, which involved voluntary commu-nity enthusiasts to help assigning keywords to a corpus of 20.000images — an early, pre-Web2.0 example of collaborative tagging.

opment is the organisation and analysis of data “forstrategic or tactical decision making”. Examples of ITsused in this step are data mining tools, OLAP (onlineanalytical process) and competitive intelligence sys-tems. (iii) Knowledge sharing means distribution ofknowledge and is enabled by group support systemsand communications technology such as EDI (elec-tronic data interchange), e-mail, voice mail, video con-ferencing and electronic bulletin boards. (iv) Knowl-edge utilisation finally is the use “without computerknowledge” of knowledge by end users in the organ-isation. ITs used in this step are vaguely defined asGUI-driven end user applications. Multimedia is alsomentioned as an enabler technology.

In a general paper on Web content management(WCM) systems in organisations, [McKeever, 2003]suggests a lifecycle model for WCM as a special caseof KM. In her model, collection and delivery are twoprocesses which are repeated iteratively for as longas the organisation is involved in WCM. Collectionis the authoring or creation of content, while deliverymeans the publishing or deployment of data. In ad-dition to those two processes, the workflow and con-trol and administration processes have a ongoing sup-portive function. The former enables collaboration andmanages steps such as approval, development, etc.,while the latter includes the identification of user roles,groups, management of the system, security and simi-lar aspects.

On the fringes of knowledge management, [Hig-gins, 2008] proposes a lifecycle model for digital cu-ration, to be promoted by the UK Digital CurationCentre (DCC)3. It is one of the most complex mod-els in this survey, distinguishing between three differ-ent types of lifecycle actions, from full lifecycle ac-tions, which take place continually, over sequential ac-tions, which conform more to the traditional lifecy-cle concept and form the bulk of the model, to oc-casional actions, which only take place potentially atcertain moments in the lifecycle. Full lifecycle actionsare mainly administrative and preparatory, while thelatter two kinds of actions deal with the data directly.As full lifecycle actions, the DCC defines descrip-tion and representation information, similar to ontol-ogy development phases of other models, preserva-tion planning, which provides planning for the man-agement and administration of the regular lifecycle ac-tions, community watch and participation, which en-

3http://www.dcc.ac.uk/, retrieved 31/10/2010

Page 8: IOS Press Lifecycle Models of Data-centric Systems and Domains · 2012-10-30 · data lifecycle has been proposed to our knowledge. This paper proposes a data lifecycle model for

8 Knud Möller / Lifecycle Models of Data-centric Systems and Domains

sures involvement in related activities (e.g., standardsdevelopment), and finally curate and preserve, whichdrives the actual management and administration ofthe lifecycle. Sequential actions are conceptualise, pre-ceding the creation of data, create and receive, eithercreating new data or importing existing data, appraiseand select, evaluating and selecting data for long-termcuration, ingest, archiving data in a repository, preser-vation action, which includes data cleaning and val-idation or assigning preservation metadata, store, an-other form of archiving, access, use and reuse, mak-ing data accessible to potential users, and finally trans-form, meaning to create new versions of data alreadypresent in the lifecycle. The three occasional actionsare dispose, which can take place after appraise andselect in order to remove data from the lifecycle, reap-praise, which is a loop back from preservation ac-tion to appraise and select, and finally migrate, whichmoves directly from preservation action to transform,skipping the intermediate actions.

In [Mödritscher, 2009], a very simple lifecyclecalled MAAME is proposed for semantic applications.The name is derived from the five phases which con-stitute the lifecycle: modelling, application, authoring,mining and evaluation. The modelling phase covers allelements of ontology modelling, whereas applicationdenotes the process of defining how semantic informa-tion will be used in a concrete application, i.e., what isthe role of semantics in that application. In authoring,instance data is created manually, while mining createsdata automatically from other sources, through tech-niques such as natural language processing or machinelearning. Finally, evaluation encompasses all aspectsof feedback about the lifecycle, including the model,the application itself and its instance data. Mödritscherdiscusses the lifecycle itself only at a very high andabstract level, but then grounds it by mapping four dif-ferent concrete applications to it.

[Hausenblas, 2011] loosely defines a lifecycle forlinked data [Berners-Lee, 2006], with the express pur-pose of categorising different R&D projects in a re-search group. In this sense, the six phases of dataawareness, modelling, publishing, discovery, integra-tion and use cases are used. The first of these phasesshould probably be seen separately, in that it does notconcern a particular unit of data or a particular dataset,but rather the awareness towards linked and open datain general.

2.5. Ontology Lifecycles

A special case of data lifecycles are ontology lifecy-cles, i.e., lifecycle models that describe aspects such ascreation, integration or reuse of ontologies (as opposedto instance data). In fact, some of the metadata lifecy-cles presented in the previous sections, e.g. Chen et al.,turn out to be ontology lifecycles, while others at leastinclude the concept of ontology creation prominently(Staab et al.; Millard et al.; Mödritscher; Hausenblas).A significant number of what can be considered life-cycle models for ontologies have been published underthe label of methodologies for ontology developmentand often do not use the term “lifecycle” at all.

[Fernández-López and Gómez-Pérez, 2002] give acomprehensive overview of different ontology devel-opment methodologies based on their compliance withthe IEEE Standard for Developing Software Life CycleProcesses (1074-1995) [Schultz, 1996]. The method-ologies are categorised as (i) building ontologies fromscratch, (ii) re-engineering ontologies or (iii) collabo-rative methodologies. For (i), the authors include theCyc methodology, Uschold and King, Grüninger andFox, KACTUS, their own METHONTOLOGY [Fer-nández et al., 1997; Gómez-Pérez et al., 1996] andthe SENSUS-based methodology. For (ii), they out-line an approach of their own, while for (iii), the CO4and KA2 methodologies are included. Each method-ology is evaluated in terms of whether or not it im-plements the different processes defined by the IEEEstandard. Depending on how many processes are con-sidered by a given methodology, it is given a ma-turity rating (the more processes, the more mature).The authors’ own METHONTOLOGY comes out asthe “most mature approach”. Additionally, each ap-proach is characterised according to the kind of life-cycle model it proposes (sequential, incremental orevolving), as presented in [Fernández et al., 1997] (seealso Sect. 3.1). [Fernández et al., 2000] discuss howthe (development) lifecycles of different ontologiescan touch due to integration and reuse. Considering the“confluences and forking of life cycles”, the authorspresent a method to make the relations between on-tologies explicit and define the steps necessary to in-tegrate them. Other, more recent examples of ontol-ogy development lifecycles include work done in the

Page 9: IOS Press Lifecycle Models of Data-centric Systems and Domains · 2012-10-30 · data lifecycle has been proposed to our knowledge. This paper proposes a data lifecycle model for

Knud Möller / Lifecycle Models of Data-centric Systems and Domains 9

Knowledge Web project4 (e.g., [Novácek et al., 2006])and the NeOn project5.

2.6. Lifecycles in Databases

Databases are not an application domain such asthe ones presented in the previous sections, but ratheran enabling technology. For this reason, they are onlybriefly mentioned here. Nevertheless, databases are anarea where data lifecycles are of central importance,and the CRUD operations [Kilov, 1990] in particu-lar are relevant. The CRUD acronym denotes the fourfundamental atomic operations common to persistentdatabase systems, i.e., (i) create, (ii) read, (iii) updateand (iv) delete. The point of view of CRUD is thatthese are all individual, low-level operations, ratherthan phases in a sequential lifecycle. However, thereis still an implicit temporal element to them, in thateach unit of data first needs to be created before it canbe read, updated or deleted. Also, all four operationscan be easily mapped to the lifecycle model proposedin the following Sect. 3. Indeed, it is easy to arguethat CRUD in fact does represent a simple lifecycle fordatabases.

While CRUD originates from databases, where it isoften used at the basic record level, the term is nowalso widely used to describe the functionality of ap-plications at a higher, even user interface level. In thisinterpretation, an application dealing with data entitiesof a particular type is only considered complete if itsUI at least supports the four CRUD operations.

3. The Abstract Data Lifecycle Model

Looking at the survey of lifecycle models presentedin Sect. 2, we can see that a significant number ofsuch models have been proposed for different domains.Some models have been designed from scratch, whileothers have been based on surveys of previous mod-els (in particular McKeever and Catteau et al.). How-ever, none of the literature presented has looked be-yond its respective domain and proposed a genericlifecycle model for (meta)data. On the one hand, thisis understandable, considering that the individual re-search communities are often disjoint. On the otherhand, even though it is not always possible to per-

4http://knowledgeweb.semanticweb.org/o2i/, re-trieved 31/10/2010

5http://www.neon-project.org/, retrieved 31/10/2010

form a direct one-to-one mapping between the differ-ent models, there are striking similarities among allof them. Each model has its own specific focus, andwhile each domain is concerned with a different kindof data, they all deal with the same thing: data. Forthis reason, it is feasible to devise a domain-agnostic(meta)data lifecycle model based on the previous sec-tion, which will help to compare and point out differ-ences and similarities between different systems withrespect to their data lifecycles. In this approach, the ab-straction is derived from specific instances. The alter-native approach, as e.g. adopted in [Leake and Wilson,1998] for the domain of case base maintenance, wouldbe to begin with the abstraction and then use it to clas-sify a selection of instances in the pertinent domain6.

The meta-model, which we call the Abstract DataLifecycle Model (ADLM), consists of five parts: (i) aset of phases which will generalise the steps and pro-cesses defined in the individual models in the survey(Fig. 8), (ii) a set of features which can be used todefine lifecycle models further (overview in Tab. 2),(iii) a set of roles describing the different actors in themodel (Fig. 8), (iv) a set of features describing the ac-tors in the model and (v) a set of features describing thedata and metadata found in a lifecycle. Finally, each ofthe lifecycle models in the survey will be categorisedin terms of the meta-model.

3.1. Lifecycle Phases

The choice of phases for the meta-model is based onthe survey in Sect. 2. While some of the reviewed mod-els go into a lot of detail and include domain-specificaspects, or aspects which pertain to specific implemen-tation matters, other models are very high-level andgeneric. Both extremes are undesirable for the purposeof building a comprehensive meta-model, in the sensethat they are either over- or under-specific. However, itis still possible to map all models covered in the surveyto the meta-model.

Like the models covered in the survey, the ADLMhas transitions between the different phases. In manycases, these transitions are possible in both directions,allowing to pass the phases in many different orders,including recursively. While there is no explicit ver-sioning phase or aspect to the ADLM, one implication

6The bottom-up approach chosen in this paper is not in generalbetter or worse than the top-down approach. However, it ensures abroader coverage and wider applicability, while the top-down ap-proach may be more suitable for a restricted, specific use case.

Page 10: IOS Press Lifecycle Models of Data-centric Systems and Domains · 2012-10-30 · data lifecycle has been proposed to our knowledge. This paper proposes a data lifecycle model for

10 Knud Möller / Lifecycle Models of Data-centric Systems and Domains

Creation

Planning

Refinement

Ontology Development

Archiving

Publication

Access

External Use TerminationFeedback

End Users

Data Creators

Metadata Creators

Ontology Designers

Administrators

Roles:Phases:

Fig. 8. Phases and roles of the Abstract Data Lifecycle Model (ADLM)

of the recursiveness is that versioning of data is possi-ble within the ADLM.

Ontology Development In this phase the formal do-main model for (meta)data is defined. As the previ-ous sections show, the literature contains several life-cycle models which are dedicated solely to the do-main of ontology development (notably Chen et al. andthe methodologies in Sect. 2.5), whereas others do notconsider ontology development at all. There are a fewexamples in which ontology development is seen as acomponent in a larger structure, such as Millard et al.,in which knowledge acquisition and knowledge mod-elling are the two initial steps in the five-step cycle,or Staab et al., where the knowledge meta-process isone of the two cycles which make up the model as awhole. Nevertheless, the reason why ontology devel-opment is not featured at all in some of the literatureis not that it is considered unimportant, but simply thatit is considered to be something that has taken placea priori to the lifecycle of instance data. E.g., Koschet al. simply assume that MPEG-7 will be used, whileCatteau et al. suggest to use the Dewey Decimal (DD)or a similar system. However, both MPEG-7 and theDD have obviously been designed at some stage andare in some form of lifecycle themselves. While it may

be less important in some domains, ontology develop-ment is conceptually still a pre-requisite to any datalifecycle in a data-centric system, which is why it is in-cluded in the ADLM. It is, however, placed outside thecore part of the lifecycle. It can also be considered alifecycle of its own, but one which must precede everyother (meta)data lifecycle at some point.

If, in a particular lifecycle model, ontology devel-opment is included as a phase, this inevitably meansthat the lifecycle is not that of a particular piece ofdata or metadata, but instead of a complete system asa whole, which includes both ontology and instancedata. This characteristic sets the ontology developmentphase apart from all other phases, which is reflected inFig. 8 by using a dotted line for its outline.

Planning Planning is a phase which precedes the ac-tual lifecycle of data and describes the moment whenthe intent to create data takes a concrete form, but be-fore the data is created as part of the system. As Hard-man puts it, planning is “outwith the system”, but itcan already result in a number of facts which later maybecome part of the data, e.g. ownership, intent, etc. It isobvious that some form of planning must always takeplace, but only a few models in the literature name itexplicitly. In Catteau et al. this phase is called initia-

Page 11: IOS Press Lifecycle Models of Data-centric Systems and Domains · 2012-10-30 · data lifecycle has been proposed to our knowledge. This paper proposes a data lifecycle model for

Knud Möller / Lifecycle Models of Data-centric Systems and Domains 11

tion, while Hardman has two different processes whichfall under planning: one is called premeditate, and de-notes the planning of creating a single unit of data,while message construction explicitly means the plan-ning and intent of combining several pieces of data toa larger whole. From the point of view of the ADLM,both processes shall be considered the same (but pos-sibly during different iterations).

Creation The Creation phase defines the momentwhen new data or metadata is created in terms of thesystem at hand. This data is either genuinely new andhas not existed in any other (formal) format before, orit has been imported from another system — the mainpoint is that it did not exist within the lifecycle’s sys-tem before the creation phase. All models discussedabove contain a creation step: the multimedia model inKosch et al. calls it Production/Creation, while Hard-man names it capture (here in the sense of “captur-ing a piece of media with a recording device”). WithineLearning, Millard et al. include a knowledge anno-tation step (which in fact spans the phases of datacreation, archiving and data refinement, see below),whereas Catteau et al. split creation into the two sep-arate conception and realisation steps. In knowledgemanagement, Lee and Hong call this phase knowledgecapture, McKeever collection and Staab et al. have cre-ate. The latter also distinguish explicitly between thecreation of new data and the import of existing, butexternal data, i.e., the transformation of existing datafrom one format into the format and infrastructure ofthe lifecycle’s system. Finally, this phase in combina-tion with the following archiving phase is representedby the create operation in CRUD set of operations,since it also entails making the newly created data per-sistent in the system.

Archiving Archiving denotes the process of anchor-ing a piece of data within the system by the meansof indexing, cataloguing or a similar activity. Not allmodels covered in the survey feature anything that canbe mapped to this phase. However, Hardman intro-duces an archive step, whereas Staab et al. suggests acapture step, which combines aspects of the archiving,refinement and publication phases (see below). In Mc-Keever, archiving is covered by the supporting processcontrol and administration.

Refinement The refinement phase covers all kinds ofactivities which make additions or changes to data thatalready exists within the system. In a very generalsense, this can mean to annotate data, which inciden-

tally is the name of one of the steps within the refine-ment phase in Hardman, the other one being organise,which means the combination of several data items toa larger whole. Of course, depending on whether dataand metadata are distinguished, annotation as refine-ment could also be interpreted as another iteration ofthe creation phase: if all data in the lifecycle is con-sidered equal, then adding new annotations to a pieceof data is nothing else than creating new data. Thisis reflected in Hardman by the fact that the output ofthe annotate process can be new data that can enterthe system. In Kosch et al., refinement is named post-production, a term typically used in the media domain,for which this model was made. Catteau et al. empha-sise the act of classification as a specific kind of refine-ment. Both post-production and classification can beconsidered specialised, restricted kinds of annotation,i.e., refinement. On the other hand, Lee and Hong in-troduce the very generic knowledge development pro-cess, which denotes various kinds of data analysis ac-tivities. The knowledge annotation and capture stepsin Millard et al. and Staab et al., respectively, bothcontain aspects of refinement, but are not limited tothis phase. Like archiving, McKeever covers refine-ment as part of the control and administration pro-cess. In Greenberg, refinement is reflected as the gen-eral process of metadata generation and further dis-tinguished into human metadata generation and auto-matic metadata generation7. In CRUD, the update op-eration entails a combination of refinement and archiv-ing, since any changes made to data are immediatelymade persistent in the system.

Publication The publication phase is the momentwhen data is made accessible to the users either withinor outside of the system, or both. Most of the lifecyclemodels in the survey have a dedicated slot for publica-tion: For multimedia data in Kosch et al. this is calleddelivery, while Hardman splits this phase up into apublish (which really is just preparing a piece of datafor publication) and a distribute (the actual act of pub-lication) step. In eLearning, LOs are being made avail-able to the system in the diffusion step. In Staab et al.,publication is part of the very broad capture step, whileLee and Hong reserves a dedicated slot with knowl-edge sharing and McKeever with the delivery step.

7This distinction is extended to a general feature of lifecycle ac-tors in the model proposed here; see Sect. 3.4.

Page 12: IOS Press Lifecycle Models of Data-centric Systems and Domains · 2012-10-30 · data lifecycle has been proposed to our knowledge. This paper proposes a data lifecycle model for

12 Knud Möller / Lifecycle Models of Data-centric Systems and Domains

Access Access denotes the moment in the lifecyclewhen parties from either within or outside of the sys-tem gain access to the data in the system, e.g., bymeans of a query or through browsing. This phase iscovered explicitly in Hardman by the query step, andin Staab et al. by the retrieve/access step. Kosch et al.and Catteau et al. also recognise access as a necessarystep in the lifecycle, as consumption and usage, respec-tively, but do not distinguish it from actual use (see be-low). The read operation of CRUD is also an instanceof access in terms of the ADLM.

External Use Whereas access merely means to re-trieve data, external use implies that the user then per-forms some further actions with it, such as export intoother systems or software or aggregation. It should bestressed that this phase explicitly means usage of thesystem’s data outside of it. If data is used and changedwithin the system, this is a case of refinement. As men-tioned above, use falls together with access in bothKosch et al. and Catteau et al., whereas Staab et al. in-clude a dedicated use step. Also, the model in Millardet al. includes a dedicated knowledge reuse step, andLee and Hong suggest the term knowledge utilisation.

Feedback The feedback phase allows users of a sys-tem to comment on the data or metadata they have pre-viously accessed and used. In a strict view, this onlymakes sense in centralised lifecycles, where there issome sort of authority the users can give feedback to.Nevertheless, the multimedia model in Kosch et al.does not provide for any kind of explicit feedback.In Catteau et al., different kinds of users (experts andend users) provide feedback at different levels in thefeedback and validation steps. Staab et al. also pro-vide feedback as evaluation, but only for the ontologywhich was devised in the knowledge metaprocess. Inthe same sense, Chen et al. envisages a Service & Eval-uation phase, which provides feedback on the ontol-ogy and overall performance of the system. Similarly,almost all dedicated ontology development method-ologies provide the opportunity for feedback, usuallyin order to gain consensus within a community. Theknowledge maintenance phase in Millard et al. has asimilar purpose.

Termination Finally, the termination phase presentsthe moment when data is removed from the system. Inother words, it is the “end of life” phase (a term takenfrom product lifecycles). Surprisingly, of all the mod-els in the survey, only Catteau et al. provide a slot forthe termination of data. Termination is, however, oneof the four CRUD operations, i.e., the delete operation.

3.2. Lifecycle Features

There are a number of dichotomies and other char-acteristics that can aid in classifying different lifecyclemodels and which are useful in pointing out the dif-ferences and similarities between the various lifecyclemodels presented in Sect. 2.

Distinction Data vs. Metadata An important point inlooking at data lifecycles is the question whether ornot a distinction between data and metadata is made.Are both kinds of data first-class citizens? A numberof models in the survey clearly make this difference: inmultimedia, Kosch et al. distinguishes between multi-media resources and metadata about those. In eLearn-ing, both Catteau et al. and Millard et al. considerlearning objects and annotations on them as differentelements in their lifecycle models. While McKeeverdoesn’t discuss metadata in Web content managementin detail, she does seem to distinguish the two whenshe states that part of the control and administrationprocess is the “ability to specify metadata”. On theother hand, Hardman initially distinguishes betweendata (media objects) and metadata (annotations on me-dia objects), but leaves the possibility open that anno-tations themselves can become first-class citizens (seeSect. 2.1). Also in the model in Staab et al., the dis-tinction is not so clear. Instead, all data in this modelare knowledge items with a varying degree of formal-ity, ranging from unstructured text over templated datato structured data objects. Finally, Lee and Hong onlyuse the generic term “knowledge” when describing thedata in their lifecycle. Since activities such as data min-ing can lead to new knowledge, we will assume thata clear distinction between data and metadata is notmade in their model.

Prescriptive vs. Descriptive The dichotomy of pre-scriptive vs. descriptive is typically used in linguistics,where it denotes the two different approaches to lan-guage in general, and grammar in particular. Prescrip-tive linguistics attempt to define a set of rules for the“correct” use of language, whereas descriptive linguis-tics try to analyse a language in its actual use, in orderto find rules and regularities within it (see e.g., [Crys-tal, 1997]). Applied to the domain at hand, we use theterm prescriptive for lifecycle models which try to es-tablish a set of steps which are then suggested for useby others8, while a descriptive model will look at a

8In a very combative position paper, [McCracken and Jackson,1982] make the point that, applied to the area of software develop-

Page 13: IOS Press Lifecycle Models of Data-centric Systems and Domains · 2012-10-30 · data lifecycle has been proposed to our knowledge. This paper proposes a data lifecycle model for

Knud Möller / Lifecycle Models of Data-centric Systems and Domains 13

Table 1Comparison of Metadata Lifecycles — Phases

Met

a-M

odel

Kos

chH

ardm

anC

atte

auM

illar

dSt

aab

Lee

McK

eeve

rC

RU

D(M

ultim

edia

)(M

ultim

edia

)(e

Lea

rnin

g)(e

Lea

rnin

g)(K

M)

(KM

)(W

CM

)

Ont

olog

yD

evel

opm

ent

——

—kn

owle

dge

acqu

i-si

tion,

know

ledg

em

odel

ling

(kno

wle

dge

met

a-pr

oces

s)—

——

Plan

ning

—pr

emed

itate

initi

atio

n—

——

——

Cre

atio

npr

oduc

tion/

crea

tion

capt

ure

conc

eptio

n,re

ali-

satio

nkn

owle

dge

anno

-ta

tion

crea

te,i

mpo

rtkn

owle

dge

cap-

ture

colle

ctio

ncr

eate

Arc

hivi

ng—

arch

ive

—kn

owle

dge

anno

-ta

tion

capt

ure

—co

ntro

lan

dad

min

istr

atio

ncr

eate

and

upda

te

Refi

nem

ent

post

-pro

duct

ion

anno

tate

,or

gani

secl

assi

ficat

ion

know

ledg

ean

no-

tatio

nca

ptur

ekn

owle

dge

deve

lopm

ent

cont

rol

and

adm

inis

trat

ion

upda

te

Publ

icat

ion

deliv

ery

publ

ish,

dist

ribu

tedi

ffusi

on—

capt

ure

know

ledg

esh

arin

gde

liver

y—

Acc

ess

cons

umpt

ion

quer

yus

age

—re

trie

ve/a

cces

s—

—re

ad

Ext

erna

lUse

cons

umpt

ion

—us

age

know

ledg

ere

use

use

know

ledg

eut

ili-

satio

n—

Feed

back

——

feed

back

,va

lidat

ion

know

ledg

em

ain-

tena

nce

(eva

luat

ion)

——

Term

inat

ion

——

term

inat

ion

——

——

dele

te

Page 14: IOS Press Lifecycle Models of Data-centric Systems and Domains · 2012-10-30 · data lifecycle has been proposed to our knowledge. This paper proposes a data lifecycle model for

14 Knud Möller / Lifecycle Models of Data-centric Systems and Domains

given system and find a lifecycle in it. Using this termi-nology, many of the models presented in Sect. 2 can becalled prescriptive, because they suggest to the readera methodology of how the lifecycle of data or metadatashould be handled. Hardman presents an interestingexception to this trend: “The processes should not beviewed as prepackaged, ready to be implemented by aprogrammer. Our goal is rather to analyse existing sys-tems to identify functionality they provide”. Lee andHong and McKeever both address an audience of de-cision makers or experts in a business position, ratherthan one of implementers and developers. For this rea-son, they simply report what they conceive to be life-cycle processes in knowledge and Web content man-agement, rather than suggest a particular model to use,and should therefore be considered descriptive.

Homogeneous vs. Heterogeneous We will say that alifecycle is homogeneous when the data in the sys-tem it describes is homogeneous, i.e., when its schemaor ontology is known beforehand, and no data of un-known form or using unknown vocabulary terms willtypically enter the cycle. In contrast, a lifecycle is het-erogeneous when the data in the system it describes isheterogeneous, i.e., when its form or ontology is notknown beforehand. Most lifecycle models presented inthe survey are examples of homogeneous lifecycles,since they only deal with a specific kind of data ormetadata. Kosch et al. is restricted to MPEG-7, whilein Catteau et al. only a specific vocabulary for learn-ing objects is used. Both Staab et al. and Chen et al.do not prescribe the vocabulary or ontology to be usedfor data and/or metadata in the system, but assumethat a previous ontology creation phase clearly defineswhich ontology will be used for any metadata in thesystem. The same is true for Millard et al., who as-sume that a domain ontology will be modelled in thefirst phases of the lifecycle, which will then be used toannotate the data in the system (in this case learningobjects). Neither Lee and Hong nor McKeever explic-itly define what the form of either data or metadata inthe systems they describe should be. This means that aclear classification in terms of homogeneous and het-erogeneous cannot be made. An exception is Hardman:while it is true that the data in this lifecycle is restrictedto some form of media asset, no restrictions are givenon the form or terminology used for annotations overthis data. Also, even the restriction to media as data

ment, prescriptive lifecycle models may in fact be considered a badidea in many circumstances.

is softened somewhat in the sense that annotations areallowed to become media assets in their own right.

Open vs. Closed This feature is related to the homo-/heterogeneous dichotomy. Open and Closed describewhether a system allows arbitrary data from the out-side to enter its lifecycle, which wasn’t initially meantfor this particular system. In this sense, the model de-scribed in Staab et al. is an example for an open life-cycle, since it includes an explicit import stage for theinclusion of external data. All other models in the sur-vey should be considered closed lifecycles, becausethey do not provide any import stage. At first sight,this statement does not seem to hold for Hardman, inthe sense that the two planning steps premeditate andmessage construction take their inputs from outside thesystem. However, this kind of input actually only rep-resents the intent or context of the author for becomingactive in the system. To consider this as an example foran open lifecycle would render the feature meaning-less, since all lifecycles would then have to be charac-terised as open.

Centralised vs. Distributed This last dichotomy de-scribes the physical nature of a system and the lifecy-cle of its data and metadata: if it resides in a single,centrally controlled infrastructure, we will call it cen-tralised. If it is spread over a a network with no singlepoint of control, we will call it distributed. Most sys-tems described in Sect. 2 are examples for centralisedlifecycles. One might argue that, if users can access thesystem online, that those systems are still distributed.However, since the circular flow of data resides in acentralised system, we will still consider them to becentralised. Again, Hardman is an exception: her life-cycle does not describe a concrete system at any oneplace, but rather the lifecycle of media data as it isprocessed by various tools. Where those tools reside,how they are connected and how the data flows be-tween them is left open, therefore making the lifecy-cle a distributed one. Due to their vague nature, bothLee and Hong and McKeever could describe either acentralised or a distributed systems.

Lifecycle Type [Fernández et al., 1997] and [Fernández-López and Gómez-Pérez, 2002] suggest three generaltypes of lifecycle models, based on how and underwhat circumstances a new iteration of the lifecycle isreached and how data in the system is changed: (i) se-quential, (ii) incremental and (iii) evolving. A sequen-tial model [Royce, 1987] (also referred to as water-fall model), denotes a kind of lifecycle in which each

Page 15: IOS Press Lifecycle Models of Data-centric Systems and Domains · 2012-10-30 · data lifecycle has been proposed to our knowledge. This paper proposes a data lifecycle model for

Knud Möller / Lifecycle Models of Data-centric Systems and Domains 15

phase or step can only be reached if the preceding oneis completely finished. This also means that a new it-eration of the cycle can only be started when all stepshave been gone through. In contrast, the incrementalmodel allows the start of a new iteration whenever thetotality of data in the system is lifted to a new version.This may happen even before the cycle has been fin-ished completely. Finally, the evolving model impliesthat data in the system can change at any time, mean-ing that new iterations can be started at any time. Con-sequentially, it would also mean that several iterationsof the lifecycle can be ongoing at the same time.

Granularity Since every lifecycle operates on somekind of system, we can also consider how much of thatsystem is affected by a single iteration of the cycle.By going through the individual steps of the cycle, arewe manipulating all data in the system or only parts ofit? The answer to this question defines the granular-ity of the lifecycle: a model where all data is affectedin each iteration has a coarse granularity, whereas amodel where only portions of the data are affected hasa fine granularity. Most lifecycles in the survey shouldbe considered fine-grained models, in the sense thatthey allow a single iteration for each data object in thesystem, be it a media asset, a learning object, a docu-ment, etc. However, those models which contain a ded-icated ontology development phase in the beginningappear to be rather coarse-grained: we first define thedomain ontology, then apply it to a corpus of data, thenget feedback on the ontology and its application, thenrevise the ontology based on the feedback, etc. In thesurvey, examples of this kind of coarse-grained lifecy-cle are Millard et al. and Chen et al.

There also seems to be a connection between thegranularity of a lifecycle and its type. A lifecycle ofthe evolving type would also tend to be fine-grained(lest we would have the whole system in a constantlychanging state, with each state existing alongside theothers), whereas a sequential lifecycle would tend tobe more coarse-grained.

3.3. Lifecycle Roles

Based on the survey in Sect. 2 we propose five dif-ferent roles (see [Mossé, 2002]) in this section, whichcan be played by actors in the ADLM. Together withthe lifecycle phases and features, these roles will aidin describing and classifying various lifecycle models,and ultimately outline the lifecycle model for the Se-mantic Web. With respect to the relationship between

actor and role, it should be noted that the same actorin any particular system for which the lifecycle is ap-plied can play several roles at once. E.g., the same ac-tor could, with their data creator hat on, first plan andcreate a data object, and then, with their administratorhat on, archive and publish the data.

In addition to the models in the survey, we also con-sider the roles established in a prescriptive proposalfor the lifecycle of URIs on the Semantic Web [Booth,2009], which are URI owner (authority to mint URIs),statement author (using URIs in statements) and con-sumer (reading and interpreting an RDF statement).

Each role will typically be involved in specificphases from the lifecycle meta-model, as shown in theoverview in Fig. 8.

Ontology Designers An actor who is involved in theontology development phase of the lifecycle plays therole of an ontology designer. As argued previously, theontology development phase can often be seen as apreceding lifecycle of its own. For the same reasons,when the ontology is considered data, the ontology de-signer role could be argued to comprise all other roleswithin it.

Data Creators The role of data creator implies thatthe actor who performs it is creating the primarykind of data in the system at hand. Depending onthe system, this could mean the creation of mediaassets, learning objects, documents, etc. In the lit-erature, this kind of role has been given differentnames: Greenberg, in her discussion of processes, peo-ple and tools, calls this role content creator, whereasin Kosch et al., data creators are identified as con-tent providers/producers in the tripartite user space. InBooth’s proposal, the most suitable mappings for thedata creator role are URI owner and statement author.Within the lifecycle meta-model, data creators partakein the planning and creation phases.

Metadata Creators Metadata creators are those ac-tors which annotate the primary data in a system withmetadata. This role is called processing user in the userspace in Kosch et al. and statement creator in the pro-posal by Booth. Greenberg uses the same term as it waschosen for the ADLM, but further distinguishes meta-data creators according to their level of professionality.In the lifecycle model, we have chosen to express pro-fessionality as a role feature instead. Metadata creatorsalso perform planning, after which they either createnew metadata, or refine existing data or metadata.

Page 16: IOS Press Lifecycle Models of Data-centric Systems and Domains · 2012-10-30 · data lifecycle has been proposed to our knowledge. This paper proposes a data lifecycle model for

16 Knud Möller / Lifecycle Models of Data-centric Systems and Domains

Table 2Comparison of Metadata Lifecycles — Features

Feat

ure

Kos

chH

ardm

anC

atte

auM

illar

dC

hen

Staa

bL

eeM

cKee

ver

(Mul

timed

ia)

(Mul

timed

ia)

(eL

earn

ing)

(eL

earn

ing)

(Dig

.Lib

rary

)(K

M)

(KM

)(W

CM

)D

istin

ctio

nD

ata

vs.

Met

adat

a

yes

noye

sye

sye

sno

noye

s

Pres

crip

tive

vs.

Des

crip

tive

pres

crip

tive

desc

ript

ive

pres

crip

tive

pres

crip

tive

pres

crip

tive

pres

crip

tive

desc

ript

ive

desc

ript

ive

Hom

ogen

eous

vs.

Het

erog

eneo

usho

mog

eneo

ushe

tero

gene

ous

hom

ogen

eous

hom

ogen

eous

hom

ogen

eous

hom

ogen

eous

——

Clo

sed

vs.O

pen

clos

edcl

osed

clos

edcl

osed

clos

edop

encl

osed

clos

edC

entr

alis

edvs

.D

istr

ibut

edce

ntra

lised

dist

ribu

ted

cent

ralis

edce

ntra

lised

cent

ralis

edce

ntra

lised

——

Life

cycl

eTy

peev

olvi

ngev

olvi

ngev

olvi

ngse

quen

tial

sequ

entia

lev

olvi

ngev

olvi

ngev

olvi

ngG

ranu

lari

tyfin

efin

efin

eco

arse

coar

sefin

efin

efin

e

Page 17: IOS Press Lifecycle Models of Data-centric Systems and Domains · 2012-10-30 · data lifecycle has been proposed to our knowledge. This paper proposes a data lifecycle model for

Knud Möller / Lifecycle Models of Data-centric Systems and Domains 17

Obviously, the distinction between data creators andmetadata creators will become meaningless when aparticular lifecycle doesn’t distinguish between dataand metadata. In that case, the two roles should be con-sidered identical.

Administrators Administrators, in contrast to cre-ators, handle data and metadata in the system withoutactually changing its shape or explicit meaning. In thatsense, administrators are responsible for the archiving,publication and termination of data. Other activities,which are not covered in the meta-model, are thoseoutlined in the control and administration support pro-cess as outlined in McKeever (security, monitoring,etc.).

End Users As the last role, end users are those ac-tors who do not actively handle the data in the system,but instead passively receive and consume it. Accord-ing to Kosch et al., end users are involved in “brows-ing, searching, and consuming” metadata and content.The equivalent in Booth is consumer. Within the meta-model, this means that this role is played in the ac-cess and external use phases, but also in the more ac-tive feedback phase, which allows the closing of thecycle. This gives the end user actor a pivotal role inthe lifecycle, which is also played out in the fact thatthey can change their role and become a data or meta-data creator (by re-using the data). By doing this, theywill effectively move the data they are consuming backinto the refinement or archiving phase, thus leading toa new iteration of the lifecycle.

3.4. Actor Features

Actor Professionality Apart from the roles defined inthe paragraphs above, each actor playing a role canalso be categorised according to their level of profes-sionality. Borrowing the suggestion made in Green-berg, we propose two levels of professionality, whichcan be seen as the extreme points on a vector of profes-sionality: professional and community or subject en-thusiasts. Greenberg only applies these attributes tometadata creators, whereas in this model they will beapplied to actors performing any of the five roles. Pro-fessionality is a function of individual actors pairedwith a role. I.e., an actor might be a community or sub-ject enthusiast as a Data Creator, but professional asan Administrator.

Actor Humanness Any of the roles defined for datalifecycles can be performed by actors which are ei-ther human or machine agents. This categorisation isalso highlighted by Greenberg, but restricted to meta-data generation. In the model proposed here, human-ness can be applied to actors performing any role of thelifecycle. Situations which involve any kind of semi-automatic operation should be modelled with two ac-tors — one human and machine9.

3.5. Metadata Features

As a final area of classification, we will presentsome features that have been suggested for the descrip-tion of metadata in the literature.

Authoritative vs. Non-authoritative Where data andmetadata are distinguished, metadata can be eitherauthoritative or non-authoritative. Authoritative meta-data will be used in preference over all other, non-authoritative metadata. The exact definition of bothterms depends on the usage context; however, gener-ally speaking metadata is considered authoritative if itcomes directly from the author or source of a piece ofdata. E.g., authoritative metadata about a digital im-age would be the metadata that comes directly fromthe creator of the image. On the other hand, metadatais considered non-authoritative if it originates fromany source other than the original author of the data.E.g., tags created through the means of social tagging(i.e., by the community) would be considered non-authoritative metadata. For a discussion of authorita-tive metadata as part of Web architecture see [Fieldingand Jacobs, 2006]. As an example for the relevance ofthis feature, [Hogan et al., 2008] propose the use ofT-Box reasoning over authoritative statements only totackle scalability issue in Web-scale reasoning.

Content-independent vs. Content-dependent This di-chotomy is introduced as part of a classificationscheme for metadata in [Kashyap and Sheth, 1996]and classifies metadata according to its content depen-dency. Examples for content-independent metadata aremodification date or author of a document, a logo, orsensor information for an image. Content-dependentmetadata on the other hand is directly related or evenderived from the content of the data it is related to. Ex-amples for this kind of metadata are document size orimage resolution.

9Unless, of course, the actor is a cyborg. In this case, a human-machine category should be added to the model.

Page 18: IOS Press Lifecycle Models of Data-centric Systems and Domains · 2012-10-30 · data lifecycle has been proposed to our knowledge. This paper proposes a data lifecycle model for

18 Knud Möller / Lifecycle Models of Data-centric Systems and Domains

Direct Content-based vs. Content-descriptive A fur-ther classification of content-dependent metadata fromKashyap and Sheth describes how closely metadata isbased on its data. Direct content-based metadata is di-rectly derived from its data, such as a document full-text index or document vectors. Content-descriptivemetadata on the other hand is an indirect description ofdata, such as textual descriptions or markup.

Domain-independent vs. Domain-specific Drillingdown further in their classification of metadata, Kashyapand Sheth suggest this dichotomy to describe thedomain dependency of content-descriptive metadata.Metadata which can be used to describe other data re-gardless of its domain is called domain-independent. Atypical example would be format-related markup suchas HTML. In contrast, domain-specific metadata is ap-plicable only depending on the subject domain of itsdata. A domain ontology or domain-specific markupare examples.

A similar distinction is discussed in [Möller et al.,2006]. The authors there divide the domain of meta-data into structural metadata and content metadata,where the former describes the internal and exter-nal structure of its data (and is therefore content-independent), whereas the latter describes the contentof its data (and is therefore content-specific). AlsoKosch et al. make this distinction by using the two op-posing terms “metadata for content adaption” (struc-tural) and “metadata for search and retrieval” (con-tent). Occasionally, domain-independent metadata isfurther differentiated into structural and administra-tive metadata (e.g., [METS, 2010]).

4. Applying the ADLM to the Semantic Web

In the following, we will revisit the different phasesand features proposed in previous sections and discusswhether and how they are realised on the SemanticWeb.

4.1. Semantic Web Lifecycle Phases

Ontology Development Ontologies are a central com-ponent of a self-describing, semantic Web. Based onthe evidence found in the survey, the definition of theontology development phase in the previous sectionstated that this phase should either be considered to beoutside the lifecycle model (since it concerns a metalayer and therefore doesn’t directly consider the flow

of data itself), or to be its own lifecycle altogether.While this is still true for the Semantic Web, there is,however, a slight modification: Vocabularies and on-tologies on the SW on the one hand and primary data(RDF statements using the terms defined in the ontolo-gies) on the other — or TBox and ABox in descriptionlogic terms — cannot be distinguished with respectto their form. Any document or set of statements cancontain any mixture of definitions of ontological termsand instances of them. Therefore, data in the ontologydevelopment phase has to be interpreted in two ways:(i) Ontological data in the sense of a conceptualisationof a domain. In this sense, ontology development is aseparate lifecycle, with its own set of phases and fea-tures, e.g., those defined in [Fernández et al., 1997].Indeed, we shall consider it to be a separate instanceof the ADLM, proceeding that of the primary (or in-stance) data. (ii) Ontological data in the sense of theirnature as RDF statements. In this sense, ontologicaldata is indistinguishable from other data on the Se-mantic Web and part of the same lifecycle. Which ofthe two interpretations applies, depends on the specificrequirements of the scenario for which the ADLM isused.

Planning With respect to planning, the lifecycle ofdata on the Semantic Web is no different from thatof any other system. Planning must precede any cre-ation or refinement of data. It does not touch the dataitself, and is therefore outside of the system. Exam-ples of planning include (i) the decision process thatleads to the actual creation of data, (ii) defining a URIpattern for any new resources that shall be created, or(iii) setting up the necessary technical infrastructurethat might be needed for it, such as servers or tools totransform source data from other formats into RDF.

Creation In terms of the ADLM, data creation meansintroducing data into the lifecycle’s system which didnot exist within the system before. For the purpose ofthis paper, we will say that any data which is not a setof RDF statements, does not qualify as data within theSemantic Web. In a looser interpretation of “SemanticWeb”, one could of course argue that data in variousother formats can be integrated into the Semantic Web,be it by extraction, transformation or reference. How-ever, for the purpose of this discussion, such data willonly be considered once it has been expressed in termsof the system, i.e., as RDF statements. Creation of dataon the SW in our model therefore means the creationof new RDF statements.

Page 19: IOS Press Lifecycle Models of Data-centric Systems and Domains · 2012-10-30 · data lifecycle has been proposed to our knowledge. This paper proposes a data lifecycle model for

Knud Möller / Lifecycle Models of Data-centric Systems and Domains 19

Refinement Refinement means to change or makeadditions to existing data in some way. An examplewould be updating an RDF description of someone’sbibliography with references to new papers or articles.Also the process of ontology mapping should be clas-sified as refinement, in the sense that one makes addi-tional statements about ontology terms. Since, withinthe model discussed in this chapter, all data is RDF,and since no difference is made between data andmetadata (see Sect. 4.2), refinement and creation onthe Semantic Web are technically the same. Makingadditions to existing data could tentatively mean mak-ing additions to an RDF statement. This is not possi-ble, however, since RDF statements are immutable10.Making additions therefore has to mean creating newstatements about a resource that is part of a previouslyexisting statement. If the focus is on an RDF graph (aset of statements), rather than an individual statement,making additions then means creating a new statementto add to the graph. This point of view also capturesthe notion of database updates, applied to RDF. Inother words, updates on the SW are the addition ofstatements to existing RDF graphs. In both cases —statement-level and graph-level —, refinement is there-fore a special case of creation.

However, for some contexts it is still useful to keepa distinction between the two phases. When a user cre-ates a description of a real world entity in RDF, this de-scription will in most cases be a set of statements. Con-ceptually, the user might consider creating this set ofstatements as one act of creation, saying e.g. “I createdan RDF description of Galway”. When, at a later stage,this RDF graph is extended by adding more statementsabout Galway, the user might consider this to be an actof refinement, clearly distinct from the initial creation,saying “I have refined the description of Galway”. Inother words: at statement level, there is only creation,while at graph level, there are creation and refinement.

Inferencing and reasoning, as central concepts of theSemantic Web, would be considered examples of cre-ation and/or refinement: the application of inferencerules implies creating new statements based on other,pre-existing statements. Since the new statements willmost likely be related to the resources involved in theexisting ones, inference will more often than not be anexample of refinement.

10Changing an RDF statement would create a new one, whichcould not be explicitly related to the original one. There is no conceptof versioning in RDF.

Lifecycle aspects such as provenance and trust areessentially the addition of metadata to the lifecycle’sprimary data, and should as such also be interpretedas a type of refinement. Alternatively, such metadatacould have its own lifecycle, which is linked to the pri-mary data lifecycle. This is similar to the way ontologydevelopment can be its own lifecycle.

Archiving The archiving phase in the SW data life-cycle constitutes making data persistent. In concreteterms, this can e.g. mean to serialise data in a particu-lar RDF serialisation format, or storing and indexing itin one of the many RDF databases available. In moregeneral terms, archiving means to physically preparethe data in a way that allows it to be published (seenext phase).

Technically, creating or refining data and archivingare closely related and hard to distinguish in the model.On the one hand, it could be argued that data, i.e., RDFstatements, are created in a transient, non-persistentway, e.g., by simply thinking them or as an in-memoryrepresentation. However, there will hardly be any ap-plication that will facilitate creation but not archivingof some form, or an actor on the lifecycle that is re-sponsible for the former, but not the latter. It is per-fectly feasible to say that a user creates RDF data ina file, which is then loaded into a database by anotheruser. However, this only means that data has been cre-ated and archived, and then archived a second time.The conclusion is that creation/refinement cannot oc-cur on its own in any realistic scenario, whereas archiv-ing can. It is still beneficial to distinguish between thetwo stages, e.g., when the model is applied for the anal-ysis of a single application, in order to define a sepa-ration of concerns within it (which part of the code orUI is concerned with creation, and which is concernedwith archiving).

Furthermore, there are cases where data is cre-ated and published immediately, without ever beingarchived. Examples for this are data wrappers whichgenerate RDF on the fly from a source format (e.g., aGRDDL transformation), on inference engines whichcompute additional statements of an RDF graph basedon RDFS or OWL entailment rules on the fly, i.e., with-out ever materialising those statements physically.

Publication Publication of data on the Semantic Webmeans to make it available on the World-wide Web. Inother words, it means making it available at an HTTPURI on a Web server. In the case of data which haspreviously been archived, this can either happen in theform of a file that is served, or an interface to an RDF

Page 20: IOS Press Lifecycle Models of Data-centric Systems and Domains · 2012-10-30 · data lifecycle has been proposed to our knowledge. This paper proposes a data lifecycle model for

20 Knud Möller / Lifecycle Models of Data-centric Systems and Domains

database (such as a SPARQL endpoint). Alternatively,as described in the previous section, RDF data can becreated and published in one seamless step.

Access From a technical perspective, the accessphase in our model on the Semantic Web can be de-fined as the dereferencing of an HTTP URI by anagent, given that SW data has been published to and isserved at that URI. Dereferencing here means to sendan HTTP request to a server and receive a HTTP re-sponse in return. Access is the flip side of publication.

Access can occur in two different ways, which de-termine how the lifecycle is continued. If an agent ac-cesses a URI on the Semantic Web and retrieves RDFdata, this means it stays within the system. In this case,the data can then be refined (changed), archived andpublished again, possibly at a different location thanbefore. Alternatively, an agent could also access a SWURI and receive an answer in a different format, suchas an HTML page. In this case, the data would be ex-posed outside the system and enable external use.

External Use External use means the utilisation ofdata outside the system of the lifecycle. For the Seman-tic Web, this means to use RDF outside it (use withinthe Semantic Web means the flow of data throughmany of the other phases of the lifecycle: data is usedon the SW when it is refined, when it is archived, pub-lished, or accessed). Examples of external use on theSemantic Web would be the aggregation of data fromvarious sites in order to answer a particular query foran external agent, or the conversion and import of datainto an external application, such as a desktop addressbook application. The latter scenario is described in[Möller and Decker, 2005].

Strictly speaking, even the rendering of a Web pagein a browser derived from RDF data already consti-tutes a case of external use. However, in the case of anHTML page with embedded RDFa, an interesting sit-uation arises, in which the same document is a hybridof published Semantic Web data and external use asnon-Semantic Web data. Also, it should be noted thatexternal use can take place without the user actuallyrealising it. E.g., a user could direct their Web browserat the URI http://data.semanticweb.org/conference/iswc/2008. Through content-nego-tiation, this particular server would determine that thebrowser requires HTML and redirect to http://data.semanticweb.org/conference/iswc/2008/html, which is an HTML document. For theuser, it would not be obvious that they have just ac-cessed and used SW data.

Table 3Features of the Semantic Web data lifecycle

Distinction Data vs. Metadata no

Prescriptive vs. Descriptive descriptive

Homogeneous vs. Heterogeneous heterogeneous

Closed vs. Open open

Centralised vs. Distributed distributed

Lifecycle Type evolving

Granularity fine

Feedback In the feedback phase end users give com-ments on the data they have accessed and used. In theADLM, it is suggested that feedback implies a systemswhich has a central authority to which the feedbackcan be addressed. Following this notion, there wouldbe no feedback on the Semantic Web as such, sincethere is no central authority. However, it is obvious thatthe Semantic Web as a system at large has many differ-ent sub-systems which are the providers of individualservices or datasets. Therefore, while users of the Se-mantic Web could not give feedback to the overall sys-tem (just as Web users cannot give feedback to the Webas such), they could certainly give feedback to the cen-tral authority of each sub-system. This feedback canthen trigger actions on the side of the service provider,such as the termination or refinement of data.

Termination Termination, finally, means to removedata from the system. In a distributed system likethe Semantic Web (just like on the World-wide Web),where data is constantly crawled, indexed and dupli-cated by various players, it is not possible (or at leastvery difficult) to completely terminate any piece ofdata, i.e., remove it from the system without any trace.E.g., while the publishers of a particular dataset on theSW might decide to remove a number of RDF state-ments from their site, it is very likely that this data haspreviously been crawled by a different service, and willtherefore still exist in some form after the termination.

4.2. Semantic Web Data Lifecycle Features

Wrapping up the discussion on applying the ADLMto the SW, this section will characterise the lifecycleof data on the Semantic Web in terms of the featuresdefined in Sect. 3.2. As an overview, all features aresummarised in Tab. 3.

Distinction Data vs. Metadata We have stated that,for the purpose of this paper, we only consider RDFstatements as data. Furthermore, in RDF seman-tics [Hayes, 2004] there is no conceptual difference

Page 21: IOS Press Lifecycle Models of Data-centric Systems and Domains · 2012-10-30 · data lifecycle has been proposed to our knowledge. This paper proposes a data lifecycle model for

Knud Möller / Lifecycle Models of Data-centric Systems and Domains 21

between data and metadata. Consequently, the lifecy-cle model is characterised as having no distinction be-tween data and metadata.

Prescriptive vs. Descriptive The lifecycle model isdescriptive, simply observing the properties of the Se-mantic Web as they are perceived by the authors of thispaper, and classifying them in the terms of the ADLM.It is not prescriptive because it does not propose anyrules of how the lifecycle of data on the Semantic Webshould look like.

Homogeneous vs. Heterogeneous The data in thelifecycle’s system is heterogeneous, because no as-sumption is made about the schema or ontology whichdefines the various resource types, such as classes,instances or properties. In fact, agents do not haveto know any data’s schema. Data can even be com-pletely schema-less and still be useful. One could as-sume the opposite standpoint and argue that the datais in fact homogeneous, because only RDF is consid-ered. However, this only means that the data is homo-geneous syntactically. It is still heterogeneous seman-tically, which is what is considered by this feature.

Closed vs. Open With the restriction that any data en-tering the system first has to be expressed as RDF, theSemantic Web is an open system. Any data can be inte-grated, as anything can be said about anything [Klyneand Carroll, 2004], which means that there are no re-strictions on the scope of domains or topics which canbe discussed.

Centralised vs. Distributed The underlying architec-ture of the the Semantic Web is that of the World-wideWeb, which has no place for any one central author-ity. Consequently, and since it does not add any suchauthority itself, the Semantic Web is a distributed sys-tem.

Lifecycle Type The lifecycle type of the model pro-posed here is evolving, since there is no requirementfor data to pass through the lifecycle completely beforea new iteration can be started. In fact, there is no suchthing as “complete” iteration in the model. RDF datacan be created, published, accessed, refined, archived,published, etc. These phases can be passed in manydifferent orders and, in principle, for an infinite num-ber of times.

Granularity Finally, the Semantic Web data lifecy-cle is fine-grained, meaning that the cardinality of theset of statements considered by any particular instanceof the lifecycle can be as small as 1. The lifecycle is

equally applicable to individual statements and largegraphs, which can both pass through any of the phasesproposed in the ADLM.

4.3. Using the ADLM

Concluding the application of the ADLM to the Se-mantic Web, two examples shall illustrate how themodel can be used in practice, by grounding discussionin a precise terminology.

Comparing Applications A brief look at two casesof SW-enabled websites will illustrate how the ADLMcan be applied to analyse differences in the approach tothe publication of data on the Semantic Web. Version7 of the Drupal content management system featuresbuilt-in RDFa functionality, and so a means to publishdata on the Semantic Web. The “Semantic Web DogFood” (SWDF) site for conference data and metadata(details in [Möller, 2009]) also publishes RDF to theSemantic Web, and is based on Drupal. At first sight,the two implementations seem to be doing more orless the same thing. However, the SWDF approach topublication and the Drupal 7 RDFa functionality arefundamentally different: where SWDF uses Drupal topublish pre-existing RDF to both the Semantic Web (inits original RDF form) and the traditional eye-ball Web(as HTML), Drupal 7 transforms and publishes thenon-RDF contents of its own database to the Seman-tic Web as RDFa. Using the ADLM, we can more pre-cisely describe SWDF as covering the creation phase(conference data is created in file format), the archiv-ing phase (data is archived in an RDF store) and thepublishing phase (data is published through Drupal).These phases are visited in sequence, as shown inFig. 9. In Drupal 7, on the other hand, data is cre-ated on the fly from the Drupal database, and then in-jected as RDFa in a Drupal page. Again employing theADLM’s terminology, we can explain that on the flymore precisely means that the publication phase fol-lows the creation phase directly, with no intermediatearchiving in play, as also shown in Fig. 9.

In a less detailed way, also [Hausenblas, 2011] (seeSect. 2.4) provides an example of how a lifecyclemodel (here a linked data lifecycle) can be used to clas-sify and group applications, with the purpose of givingan overview of a broader field or the work done withina research group.

Explaining Paradigm Shifts [Hausenblas, 2009] ob-serves a (possible) paradigm shift from the (i) tradi-tional publishing and usage of data to (ii) publishing

Page 22: IOS Press Lifecycle Models of Data-centric Systems and Domains · 2012-10-30 · data lifecycle has been proposed to our knowledge. This paper proposes a data lifecycle model for

22 Knud Möller / Lifecycle Models of Data-centric Systems and Domains

Fig. 9. Different approaches to publication on the SW

and usage of data according to the principles of linkeddata. In both cases, data publishers use transformationtemplates to publish data from their internal databasesto a webpage. However, in (i) the transformation is totraditional HTML (meaning that software agents haveto employ screen scraping to re-discover structureddata on the webpage), while in (ii) the transformationis to HTML+RDFa (meaning that software agents cantake structured data directly from the page).

This paradigm-shift can be nicely described usingthe ADLM: In the traditional scenario (i) the SW’s datalifecycle is only entered through the software agent’sscreen scraper, which performs the creation phase ofthe lifecycle. On the other hand, in the linked dataapproach (ii) the lifecycle is already entered on thedata publisher’s side, where the transformation tem-plate creates SW data, which is then published on thefly (as in the previous example). From there, a softwareagent is ready to simply access the data. In both cases,the software agent can then proceed to the archiving(e.g., a crawler), refinement (e.g., an aggregated query)or external use (e.g., visualisation in an infographic)phase.

5. Conclusion

The aim of this paper was two-fold: (i) to providea wide overview of different approaches to lifecyclemodelling of data-centric domains, and (ii) to establisha terminological framework and conceptual model fordata and metadata lifecycles in data-centric domains.The overview was presented in the form of a surveyof a significant number of different lifecycle modelsfrom the literature, ranging across different domains.The survey itself functions as a first port of call forresearchers and developers aiming to design a lifecy-cle for a particular domain, by providing an overview

and inspiration over typical modelling approaches andpractices.

Based on the survey, as well as additional literature,five areas of classification for data lifecycles have beenidentified and discussed, which together form the ter-minological framework and model, called the AbstractData Lifecycle Model (ADLM): lifecycle phases, life-cycle features, lifecycle roles, actor features and meta-data features. None of the individual models coveredin the survey suffices as a generic model over any data-centric domain; the ADLM can fulfil this task. In addi-tion to its primary aspects (phases, features and roles),the ADLM also enables to model secondary aspects ofdata lifecycles, such as versioning (through repetitionof the cycle) or provenance and trust (as metadata cre-ation through the refinement phase).

The function of the ADLM as a meta model is toprovide a means to classify, compare and relate otherlifecycle models, as well as to provide the basis to con-struct new lifecycle models for other data-centric do-mains, applications or use cases. In the third part ofthe paper, this kind of use is illustrated by applying themodel to the Semantic Web. This SW instance of theADLM serves both to support the general purpose ofdefining a conceptual framework for the SW, as well asto show how application developers or researchers canuse the framework to illustrate design issues or grounddiscussion in a common terminology.

With respect to other proposed lifecycles for the Se-mantic Web or Web of Data, no comprehensive pro-posal is known to the authors. However, a number ofSW-related lifecycles have been discussed as part ofSect. 2.4 on lifecycles for knowledge and content man-agement, as well as in Sect. 4.3 on using the ADLM.

Acknowledgments

The work presented in this paper has been funded in partby Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Líon-2) and (in part) by the European project NEPO-MUK No. FP6-027705 and (in part) by the European projectFAST No. FP7-216048.

References

OED Online. Oxford University Press,June 2011. http://www.oed.com/viewdictionaryentry/Entry/108098.

Page 23: IOS Press Lifecycle Models of Data-centric Systems and Domains · 2012-10-30 · data lifecycle has been proposed to our knowledge. This paper proposes a data lifecycle model for

Knud Möller / Lifecycle Models of Data-centric Systems and Domains 23

Tim Berners-Lee. Linked data, 2006. URLhttp://www.w3.org/DesignIssues/LinkedData.html.

David Booth. The URI lifecycle in Semantic Web ar-chitecture. In Twenty-first International Joint Con-ference on Artificial Intelligence (IJCAI-09), July2009.

O. Catteau, P. Vidal, and J. Broisin. A generic repre-sentation allowing for expression of learning objectand metadata lifecycle. Advanced Learning Tech-nologies, 2006. Sixth International Conference on,pages 30–32, July 2006.

Ya-Ning Chen, Shu-Jiun Chen, and Simn C. Lin.A metadata lifecycle model for digital libraries:methodology and application for an evidence-basedapproach to library research. In Proceedings of the69th IFLA General Conference and Council, August2003.

David Crystal. The Cambridge Encyclopedia of Lan-guage. Cambridge University Press, second edition,1997.

Mariano Fernández, Asunción Gómez-Pérez, and Na-talia Juristo. METHONTOLOGY: From ontologicalart towards ontological engineering. In Workshop onOntological Engineering at AAAI97, Stanford, USA,Spring Symposium Series, 1997.

Mariano Fernández, Asunción Gómez-Pérez, andMariá Dolores Rojas Amaya. Ontology’s crossedlife cycles. In Lecture Notes in Artificial Intelli-gence, number 1937, pages 65–79. Springer-VerlagHeidelberg, 2000.

Mariano Fernández-López and Asunción Gómez-Pérez. Overview and analysis of methodologies forbuilding ontologies. Knowl. Eng. Rev., 17(2):129–156, 2002.

Roy T. Fielding and Ian Jacobs. Authori-tative metadata. Tag finding, W3C, April2006. URL http://www.w3.org/2001/tag/doc/mime-respect-20040225.

Aurona Gerber, Alta van der Merwe, and AndriesBarnard. A functional Semantic Web architec-ture. In Proceedings of the 5th European Seman-tic Web Conference (ESWC2008), Tenerife, Spain,pages 273–287. Springer, June 2008.

Asunción Gómez-Pérez, Mariano Fernández, and An-tonio J. de Vicente. Towards a method to conceptu-alize domain ontologies. In Workshop on Ontologi-cal Engineering at ECAI’96, pages 41–51, 1996.

Jane Greenberg. Metadata generation: Processes, peo-ple and tools. Bulletin of the American Society forInformation Science and Technology, 29(2), 2003.

Lynda Hardman. Canonical processes of media pro-duction. In MHC ’05: Proceedings of the ACMworkshop on Multimedia for human communica-tion, pages 1–6, New York, NY, USA, 2005. ACM.

Michael Hausenblas. Data lifecycle, September 2009.Blog Post: http://webofdata.wordpress.com/2009/09/14/data-lifecycle/.

Michael Hausenblas. Linked data research centre —projects, research, applications. Internal presenta-tion at DERI, May 2011.

Pat Hayes. RDF semantics. Recommendation, W3C,February 2004. http://www.w3.org/TR/2004/REC-rdf-mt-20040210/ 29/05/2009.

Sarah Higgins. The DCC curation lifecycle model.The International Journal of Digital Curation, 3(1):134–140, June 2008.

Aidan Hogan, Andreas Harth, and Axel Polleres. Saor:Authoritative reasoning for the Web. In 3rd AsianSemantic Web Conference, ASWC 2008, Bangkok,Thailand, volume 5367 of LNCS, pages 76–90, De-cember 2008.

Vipul Kashyap and Amit Sheth. Semantic heterogene-ity in global information systems: The role of meta-data, context and ontologies. In M. Papazoglou andG. Schlageter, editors, Cooperative Information Sys-tems: Current Trends and Directions, pages 139–178. Academic Press, June 1996.

Haim Kilov. From semantic to object-oriented datamodeling. In First International Conference on Sys-tems Integration (ICSI1990), April 1990.

Graham Klyne and Jeremy J. Carroll. Re-source description framework (RDF): Conceptsand abstract syntax. Recommendation, W3C,February 2004. http://www.w3.org/TR/rdf-concepts/ 28/05/2009.

Harald Kosch, László Böszörményi, Mario Döller,Mulugeta Libsie, Peter Schojer, and Andrea Kofler.The life cycle of multimedia metadata. IEEE Multi-Media, 12(1):80–86, 2005.

David B. Leake and David C. Wilson. Categoriz-ing case-base maintenance: Dimensions and direc-tions. In Advances in Case-based Reasoning, vol-ume 1488 of Lecture Notes in Computer Science(LNCS), pages 196–207, 1998.

Sang M. Lee and Soongoo Hong. An enterprise-wideknowledge management system infrastructure. In-dustrial Management & Data Systems, 102(1):17–25, 2002.

Daniel D. McCracken and Michael A. Jackson. Lifecycle concept considered harmful. SIGSOFT Softw.Eng. Notes, 7(2):29–32, 1982.

Page 24: IOS Press Lifecycle Models of Data-centric Systems and Domains · 2012-10-30 · data lifecycle has been proposed to our knowledge. This paper proposes a data lifecycle model for

24 Knud Möller / Lifecycle Models of Data-centric Systems and Domains

Susan McKeever. Understanding Web content man-agement systems: evolution, lifecycle and market.Industrial Management & Data Systems, 103(9):686–692, 2003.

METS. <METS> Metadata encoding and transmis-sion standard: Primer and reference manual. Tech-nical report, Library of Congress, April 2010.

David E. Millard, Feng Tao, Karl Doody, ArounaWoukeu, and Hugh C. Davis. The knowledge lifecycle for e-learning. International Journal of Con-tinuing Engineering Education and Lifelong Learn-ing: Special Issue on Application of Semantic WebTechnologies in E-learning, 16(1/2):110–121, April2006.

Felix Mödritscher. Semantic lifecycles: modelling,application, authoring, mining, and evaluation ofmeaningful data. International Journal of Knowl-edge and Web Intelligence, 1(1/2):110–124, 2009.

Knud Möller. The Semantic Web conference metadataproject. In Lifecycle Support for Data on the Se-mantic Web, pages 73–111, September 2009. PhDThesis, NUI Galway, Ireland.

Knud Möller and Stefan Decker. Harvesting DesktopData for Semantic Blogging. In 1st Workshop on theSemantic Desktop at ISWC2005, Galway, Ireland,Proceedings, pages 79–91, November 2005.

Knud Möller, Uldis Bojars, and John G. Breslin. Us-ing Semantics to Enhance the Blogging Experience.In The third European Semantic Web Conference,pages 679–696, Budva, Montenegro, June 2006.

Francis G. Mossé. Modeling roles — a practical seriesof analysis patterns. Journal of Object Technology,1(4):27–37, September-October 2002.

Vit Novácek, Siegfried Handschuh, Diana Maynard,Loredana Laera, Sebastian Ryszard Kruk, MaxVölkel, Tudor Groza, and Valentina Tamma. Reportand prototype of dynamics in the ontology lifecy-cle. Deliverable 2.3.8v1, Knowledge Web, January2006.

Vit Novácek, Sebastian Ryszard Kruk, MaciejDabrowski, and Siegfried Handschuh. ExtendingCommunity Ontology Using Automatically Gener-ated Suggestions. In 20th International FloridaArtificial Intelligence Research Society Conference2007, Casa Marina Resort, Key West, Florida, USA,May 2007.

W. W. Royce. Managing the development of large soft-ware systems: concepts and techniques. In ICSE’87: Proceedings of the 9th international confer-ence on Software Engineering, pages 328–338, LosAlamitos, CA, USA, 1987. IEEE Computer SocietyPress.

David J. Schultz. IEEE standard for develop-ing software life cycle processes. Techni-cal Report 1074-1995, IEEE, 1996. http://ieeexplore.ieee.org/servlet/opac?punumber=3513.

Steffen Staab, Rudi Studer, Hans-Peter Schnurr, andYork Sure. Knowledge processes and ontologies.IEEE Intelligent Systems, Special Issue on Knowl-edge Management, 16(1):26–34, January 2001.