master's thesis _daniel meriläinen 90648
TRANSCRIPT
Turun kauppakorkeakoulu • Turku School of Economics
The Federative Approach to Data Governance and Manage-
ment in an Open Information Systems Environment
A Case Study on Data Governance and Management of Clinical
Breast Cancer Treatment Data
Master’s Thesis in Information Systems
Science
Prepared by
Daniel Meriläinen (BBA)
90648
Supervisor
Tomi Dahlberg (Ph.D.)
September 30, 2016
Turku, Finland
TABLE OF CONTENTS
1 INTRODUCTION ................................................................................................... 9
1.1 Breast Cancer ............................................................................................... 10
1.1.1 Incidence .......................................................................................... 10
1.1.2 Prediction ......................................................................................... 11
1.1.3 Personal Identity Code ..................................................................... 12
1.1.4 Diagnosis Code of Cancer ............................................................... 12
1.1.5 TNM Staging System ....................................................................... 13
1.1.6 Dates of Events ................................................................................ 13
1.1.7 Cancer Survival Analysis ................................................................. 14
1.2 Data Management Systems Related to Cancer Treatment ........................... 14
1.3 Realization .................................................................................................... 15
1.4 Limitations ................................................................................................... 15
1.5 Research Questions ...................................................................................... 15
1.6 Structure of the Master’s Thesis .................................................................. 16
2 THEORETICAL BACKGROUND ..................................................................... 17
2.1 Federative Approach .................................................................................... 17
2.1.1 Benefits ............................................................................................ 20
2.1.2 Limitations ....................................................................................... 21
2.2 Golden Record Approach ............................................................................. 22
2.3 Comparison of Ontological Approaches ...................................................... 26
2.3.1 Advantages ....................................................................................... 27
2.3.2 Disadvantages .................................................................................. 28
2.4 Ontological Approach in the Case Study ..................................................... 29
3 LITERATURE REVIEW ...................................................................................... 30
3.1 Ontology ....................................................................................................... 30
3.2 Semantics ..................................................................................................... 33
3.3 Data Models ................................................................................................. 34
3.3.1 Conceptual Model ............................................................................ 35
3.3.2 Contextual Model ............................................................................ 36
3.4 Data Classification ....................................................................................... 37
3.4.1 Master Data ...................................................................................... 38
3.4.2 Metadata ........................................................................................... 42
3.5 Data Quality ................................................................................................. 46
3.6 Data Consolidation ....................................................................................... 49
3.6.1 Sharing ............................................................................................. 50
3.6.2 Mapping ........................................................................................... 50
3.6.3 Matching .......................................................................................... 50
3.6.4 Data Federation ................................................................................ 52
3.6.5 Data Integration ............................................................................... 54
3.6.6 Data Warehouse, Storage and Repository ....................................... 57
3.7 Data Management Framework ..................................................................... 58
3.7.1 Data and Corporate Governance ...................................................... 59
3.7.2 Data Management ............................................................................ 60
3.8 Discovery of Data from Large Data Sets ..................................................... 61
3.8.1 Data Mining ..................................................................................... 62
3.8.2 Big Data ........................................................................................... 65
3.8.3 Business Intelligence Systems ......................................................... 66
3.9 Cancer Data .................................................................................................. 67
3.10 Healthcare..................................................................................................... 69
4 METHODOLOGY ................................................................................................ 71
4.1 Case Study .................................................................................................... 71
4.1.1 Data Ontology .................................................................................. 74
4.1.2 Epistemology ................................................................................... 74
4.1.3 Paradigma ........................................................................................ 74
4.1.4 Methods............................................................................................ 75
4.1.5 Rhetoric ............................................................................................ 75
4.1.6 Triangulation .................................................................................... 75
4.2 Research Participation .................................................................................. 76
4.3 Artifact ......................................................................................................... 76
5 RESULTS .............................................................................................................. 78
5.1 Matrices of the Artifact ................................................................................ 79
5.2 Pattern to Implement the Artifact in the Case Study .................................... 81
6 DISCUSSION ........................................................................................................ 83
6.1 Contribution ................................................................................................. 83
6.2 Limitations ................................................................................................... 83
6.3 Future Research Questions ........................................................................... 84
7 REFERENCES ...................................................................................................... 85
List of Figures
Figure 1 Semantic View of the Federative Approach in Practice .............................. 20
Figure 2 Semantic View of the Golden Record Approach in Practice (1) ................. 24
Figure 3 Semantic View of the Golden Record Approach in Practice (2) ................. 25
Figure 4 Semantic View of the Golden Record Approach in Practice (3) ................. 26
Figure 5 Scope of MDM [25, pp. 46] ........................................................................ 39
Figure 6 MDM Registry Federation [21, pp. 28] ....................................................... 41
Figure 7 Generating the Golden Record [25, pp. 99]................................................. 57
Figure 8 Basic Idea of MBR [45, pp. 333]................................................................. 64
List of Tables
Table 1 Design Artifact of the Case Study................................................................. 77
Table 2 Data Federation Artifact - Identification of Shared Attributes ..................... 79
Table 3 Definition of Contextual Metadata Characteristics ....................................... 80
List of Abbreviations
ANSI American National Standards Institute
API Application Programming Interface
B.C. Before Christ
BI&A Business Intelligence and Analytics
CDI Customer Data Integration
CMM Capability Maturity Model
CRF Case Report Form
CRM Customer Relationship Management
CS Computer Science
CT Clinical Trial
CUH Central University Hospital
DAMA Data Management Association
DBA Database Administrator
DC Diagnosis Code
DC Dublin Core
DFD Data Flow Diagram
DLS Digital Library System
DQ Data Quality
DREPT Design-Relevant Explanatory/Predictive Theory
DSRIS Design Science Research Information Systems
DW Data Warehouse
EAD Encoded Archival Description
EAI Enterprise Application Integration
ED&P Early Detection and Prevention
EMPI Enterprise Master Patient Index
ER Model Entity and Relationship Model
ER Endoplasmic Reticulum Marker
ERD Entity Relationship Diagramming
ERP Enterprise Resource Planning
ETL Extract, Transform, Load
FDBS Federated Database System
HETU Personal Identity Code
HER2 Human Epidermal Growth Factor Receptor
ICD-O-3 International Classification of Diseases for Oncology
ID Identifier
IoT Internet of Things
IS Information System
ISAD Information Systems Analysis and Design
ISD Information Systems Development
ISDT Information Systems Design Theory
IT Information Technology
JPL Jet Propulsion Laboratory
KDD Knowledge Discovery in Databases
MB Megabyte
MBR Memory-Based Reasoning
MDM Master Data Management
NASA National Aeronautics and Space Administration
NBER National Bureau of Economic Research
NIH National Institutes of Health
OAI-PMH Open Archives Initiative Protocol for Metadata Harvesting
ODPC Open Database Connectivity
OLAP Online Analytical Processing
PLM Product Lifecycle Management
POP Persistent Organic Pollutant
ROI Return on Investment
RSR Relative Survival Rate
SBCDS Southampton Breast Cancer Data System
SCM Supply Chain Management
SQL Structured Query Language
TAMUS Texas A&M University System
TB Terabyte
TDWI The Data Warehousing Institute
TEKES Finnish Funding Agency for Innovation
TIP TAMUS Information Portal
TNM Tumor Node Metastasis
UML Unified Modeling Language
VSSHP Varsinais-Suomen Sairaanhoitopiiri
XML Extensible Markup Language
9
1 INTRODUCTION
This Master’s thesis is based on the case study, which encompasses data governance and
management and the solution of unstructured data in open distributed IS environments.
The goal of the study is to develop and test the framework of the federative approach to
data governance and management. The framework and approach have been tested previ-
ously with other cases. The goal is also to discover the benefits and limitations of the
framework in the case environment. The framework of the contextual metadata is a pre-
requisite for data federation without changing or transferring the data in the first place, or
the original source.
The ongoing research started in January 2016 and was executed in co-operation with
the data/information specialists and healthcare professionals of the Central University
Hospital (CUH) in Turku, Finland. The Master’s thesis is a part of the overall research.
The case study builds on and extends the results of a project called Management of IT
in Mergers and Acquisitions funded by Tekes (the Finnish Funding Agency for Innova-
tion), Master Data Management Best Practices. The current research addresses the predic-
tion of malignant breast cancers and improving the survival rate of diagnosed malignant
breast cancers by means of data analysis.
The case study is worth contemplating against the background of open systems envi-
ronments and growth with data digitalization and explosion. With Big Data and the Inter-
net of Things (IoT), the importance of data governance and management is increasing
rapidly.
The annual volume growth of digital data is estimated to be approximately 60 %. By
the end of 2011 the amount of digital data created had grown to 1.8 ZB (1021
bytes) during
that year. The proportion of digital data was 99 % of all data created [17]. The estimate
for 2015 of the amount of digital data produced is 12 ZB and the proportion of digital
data is 99.84 % of all data created. As a result, in 2015, mankind produced as much data
as from the year 10000 B.C. to the end of 2003 [17].
A great number of organizations neither know nor understand how to handle or how to
govern data. Currently data governance seems to take place unilaterally on the operative
and executive level. Governance should extend from the bottom to the top in the organi-
zation’s hierarchy.
The empirical case is based on a CUH research project, which aims to improve the
survival rate of widely spread breast cancer. This type of breast cancer accounts for the
majority of the deaths from this disease.
The ongoing research project has two data governance, management and analysis re-
lated goals. The primary goal is to improve and enhance the ability to detect and identify
widely spread cancers (by applying TNM classification) from the data available in the
10
early phase of a cancer. The secondary goal is to enhance the precision of predicting the
survival rate of patients with malignant breast cancer after various cures and treatments.
The aim is to identify the treatments that are, either in isolation or in combination, reliable
predictors of enhanced survival probability.
The dilemma of data governance and management at the hospital is based on the frag-
mented, unstructured and distributed data, which are stored in several distinct locations.
Another serious problem is that data governance seems to be nobody’s responsibility.
The Masters’s thesis for its part, is intended to solve the problems caused by the con-
nected inconsistent, semi-structured, unstructured and fragmented nature of data storages
within the open distributed IS environments. The solution is based on the framework of
the federative approach to data governance and management. The approach can be also
called the interpreted interoperable attribute philosophy or attribute-driven MDM.
1.1 Breast Cancer
Breast cancer can be a malicious tumor. It is the most common type of cancer in women
all over the world. Breast cancer is characterized by genetical and histopathological het-
erogeneity. The reasons for the emergence of breast cancer are unknown [77].
The most common symptom is a lump in the chest. Forms of treatments, such as sur-
gery, radiotherapy, drug treatment, or chemotherapy, hormone therapy and antibody ther-
apy, are widely used [23].
1.1.1 Incidence
Every year over 4100 new cases of breast cancer are detected in Finland. About one half
of the breast cancer cases are found over 60 years’ old patients. Nursing seems to de-
crease a woman's risk of contracting breast cancer [23]. Breast cancer has been the lead-
ing Finnish women’s cancer disease since the 1960s. The incidence of cancer begins to
increase after the age of 40.
Breast cancer screening is based on mammography tests, i.e., breast X-ray investiga-
tion, in which the breast is X-rayed from one or more directions. If the mammography
finding is abnormal, the woman is called for a complementary test. These complementary
tests with additional mammography images include ultrasound examinations and/or nee-
dle research samples. If follow-up tests fail to rule out the possibility of cancer, the tests
are continued in hospital, and a biopsy is taken to clarify the tumor quality [23]. Accord-
ing to Rahmati et al., mammography allows for the detection of intangible tumors and
11
increases the survival rate. Digital mammography uses X-rays to project structures in the
3D female breast onto a 2D image [58].
At the beginning of 1987, national breast screening for breast cancer was initiated in
Finland. Screenings are attended by nearly 90 % of the so-called baby-boom generation.
Graduate studies called under 3 % of the screened women, and breast cancer was found in
about 300 cases [23].
1.1.2 Prediction
Over 80 % of breast cancer patients are alive after five years since the cancer was detect-
ed. The younger the patient, the more likely it is that the cancer recurs [23].
Cancer treatments have made great progress in recent decades. Today, a large propor-
tion of breast cancer patients live a normal life after cancer detection and treatment, and
eventually dies from causes grounds other than cancer [23].
Cancer diseases, as an indirect measure of healing lead to the relative survival rate.
This indicates how many of the cancer patients live a certain period of time compared to
the age of detection of cancer, as a proportion of the population of the same age living at
that time. If the annual relative survival rate is less than 100 %, cancer exposes patients to
additional mortality. Usually improvement of the statistical limit is five years, although in
some cancers slight increased mortality occurs. Even after that limit and, in some cases, a
statistical improvement was reached earlier [32].
Cancer in women with a higher survival rate is largely due to the fact that in women
the most common type of cancer is breast cancer. The prediction is significantly better
than for general lung cancer. In 2007-2009, the five-year survival rate of breast cancer
patients was 89 % [32].
Survival of cancer patients varies considerably, depending on how far the cancer has
spread, and when it is detected. The patients’ prognosis is better when the cancer found is
still local, as it is often possible to remove the entire cancerous tumor by surgery [32].
Breast cancer is an example of a disease where a patient with widespread disease lives
a considerable period of time thanks to effective treatments. In the 2000s a five-year fol-
low-up of local breast cancer survival of patients showed 98 % survival. If the disease had
spread only to the axillary lymph nodes, the figure was 88 %. Even in the event that the
disease had been confirmed to have spread even further, the patients’ five-year survival
figure was 42 % [32].
In the following chapters I’ll describe the data attributes that are used to make breast
cancer interoperable. They can be used to establish the link between various data storages
to make the data attributes of those data storages visible to end users.
12
1.1.3 Personal Identity Code
The personal identity code (HETU) is applied as a key identification number in the pa-
tient management system at the CUH, Turku, Finland. The code is mandatory in the sys-
tem and it is updated and transferred to other IS systems.
The personal identity makes identification more specific and secure than just a name
alone. Many people may have an identical name, but never the same personal identity
code. The code follows a person from birth to death.
The code is given by the Population Register Centre [56]. It identifies each individual
in Finland. The code is always used when having contact with authorities, verifying iden-
tity and relating to a person’s official documents.
The personal identity code is a key attribute in this case study.
1.1.4 Diagnosis Code of Cancer
Cancer is a disease characterized by an abnormal reproduction of cells, which invade and
destroy adjacent tissues, being even able to spread to other parts of the body, through a
process called metastasis [4]. Currently breast radiography (mammography) is the most
frequently used tool for detecting this type of cancer at its stage [4].
Mammography makes possible the identification of abnormalities in their initial de-
velopment stage, which is a determining factor for the success of treatment. Mammogra-
phy allows the detection of intangible tumors and increases survival rate [67]. Digital
mammography uses X-rays to project structures in the 3 D female breast onto a 2D im-
age. Information management for digital mammography generates a number of files and
enormous amounts of imaging data. All this data must then be stored, transmitted, and
displayed. To govern and manage a huge number of files is a challenge for data manage-
ment. A study of U.S. imaging centers conducted by Fajardo indicates that a typical
breast study stored using 4:1 lossless compression requires 8 to 15 MB (106 bytes) of
storage for a system rendering 100-micron resolution; 16 to 38 MB for a 70-micron sys-
tem; and 45 to 60 MB for a 50-micron pixel detector imaging system [19].
The diagnosis code (DC) is one of the attributes used this case study. We are only in-
terested here in persons who have been diagnosed with breast cancer (code).
13
1.1.5 TNM Staging System
According to Edge et al., the extent or stage of cancer at the time of diagnosis is a key
factor that defines prognosis [26]. It is a critical element in determining appropriate
treatment based on the experience and outcomes of groups of prior patients with similar
cancer stages. Among several cancer staging systems, that are used worldwide, the tumor
node metastasis (TNM) system is the most clinically useful staging system [26].
The TNM system classifies cancers by the size and extent of the primary tumor (T),
the involvement of regional lymph node (N), and the presence or absence of distant me-
tastases (M) [26]. The system has been supplemented in recent years by adding carefully
selected non-anatomic prognostic factors, and it includes a TNM staging algorithm for
cancers of virtually every anatomic site and histology [26].
The cancer stage is determined from information on the tumor T, regional nodes N,
and metastases M and by grouping cases with a similar prognosis [26]. The criteria for
defining the anatomic extent of the disease are specific for tumors at different anatomic
sites and of different histologic types. Unlike other types of cancer, in breast cancer, the
size of the tumor is a key factor. Thus the criteria for T, N, and M are defined separately
for each tumor and histologic type [26].
Although T, N, and M are of some value in determining a patient’s future outcome,
there are multiple factors relating to both prognosis and prediction [26]. The use of fac-
tors such as estrogen and progesterone receptor content or HER2 (Human Epidermal
Growth Factor Receptor) status are rather predictive than prognostic [26].
TMN is one of the linking attributes in this case study. It has to be deduced from vari-
ous data storages. Certain values of TMN indicate malignant breast cancer, and access
needs to be provided to the attributes of federated storages for those persons who have
breast cancer, while persons who have particular values of the TNM code, in order to
detect breast cancer and to analyze the outcomes of treatment.
1.1.6 Dates of Events
The date of event gives information about the cancer treatments and consultations that
have taken place. The treatment date of events such as cancer surgery, radiation therapy
or chemotherapy, hormone therapy and biological treatments (for example, antibody ther-
apy, and interferon) are saved into the corresponding IS of the treatment at the CUH of
Turku.
The date of event is one of the attributes used in this case study. Due to the nature ma-
lignant breast cancer, the dates of events need to be close to one another.
14
1.1.7 Cancer Survival Analysis
Analysis of cancer survival data and related outcomes is needed when assessing cancer
treatment programs and monitoring the progress of regional and national cancer control
programs [26]. In order to use data retrieved from the databases of cancer registries for
outcomes analyses properly, it is necessary to understand the correct application of ap-
propriate quantitative tools. The limitations affecting the analyses, based on the source of
data, must be taken into account as well [26].
A survival rate is a statistical index that summarizes the probable frequency of specific
outcomes for a group of patients at a particular point in time. In contrast, a survival curve
is a summary display of the pattern of survival rates over time [26].
1.2 Data Management Systems Related to Cancer Treatment
Information systems include master data. This refers data that are long-lived and slow to
change and are shared by multiple systems without being transactional data. For example,
customer/patient information is often deposited for years, and will occasionally change.
Typical examples of other master data information include product/diagnosis code infor-
mation, organization information, employees, and various other lists of codes.
Master data consist of patients, the diagnosis and treatment information on nuclear da-
ta, which is a relatively-constant relation. The integrity and unambiguity of master data at
the CUH is extremely important. In addition, master data are used in the hospital in sev-
eral different applications or information systems. Thus master data attributes are com-
mon to different systems. It is connected to event data in separate registers. Master data
refer to independent information. Currently there are about 16 distinct information sys-
tems at the CUH in Turku, Finland (2016).
The detection and treatments of breast cancer generate a huge number of different data
elements such as magnetic resonance and X-rays images, laboratory test and pathology
analysis results, cytostatic, medication and/or surgical treatment medical reports as well
as a lot metrics, referrals, prescriptions, analyses, diagnoses, reports and other data [13].
Relevant data could spread out to several years and could even include genetic data about
close relatives and/or data about the patient’s life style and social environment. Clerical
personnel, nurses and doctors participating to the various cancer detection and treatment
tasks create, use, modify and store data about treatment events into dozens of information
systems (ISs), IS modules and related data storages [13]. As a result the IS technical, in-
formation handling and socio-contextual characteristics of federated data differ. The enti-
ty and attribute definitions, formats, hierarchies and granularities of data storages are dif-
15
ferent. Data could be structured, unstructured or multi-structured. Data could be repre-
sented in numeric, alphanumeric, audio or video format(s). Data creation, use, storing and
purging procedures as well as data volumes and velocity vary. The sources of data range
from ISs to sensors and from internal to external data storages (e.g. code registers). Fur-
thermore, data is used for different purposes in various use contexts at a time and over
time and may thus have several valid contextual meanings [13].
1.3 Realization
The framework is based on the idea of common systems, data attributes, whose signifi-
cance in source systems (ISs) is known. The framework was tentatively tested in collabo-
ration with the Hospital District of Southwest Finland (VSSHP) and CUH information
management experts, pathologists and physicians, in the clinical management of breast
cancer derived from healthcare data. The long-term goal is to assess at an early stage the
survival possibility rate of patients suffering from disseminated breast cancer and to iden-
tify the factors affecting it.
The research was conducted as a qualitative case study. The research data were col-
lected by means of workshop activities with the corresponding professionals at the CUH
of Turku, Finland.
1.4 Limitations
The research is limited to concern the governance and management of breast cancer data.
The case study does not take a stance either on other types of cancer or on treatment. The
core subject is the clinical data and how they can be brought into use from the purpose of
research, without time-consuming data transfer, ETL (Extract, Transform, Load) batch
processing and the golden record.
1.5 Research Questions
The research focuses on the functionality of the framework related to data governance
and management of breast cancer data.
The main research question is: How does the theoretical framework of data federation
work in practice, when compared with the golden record?
The first sub-question: What are the benefits of the federative approach?
16
The second sub-question: What are the limitations?
Keywords: attributes, master data, metadata, data federation, breast cancer, golden record,
data governance, data management.
1.6 Structure of the Master’s Thesis
The Master’s thesis is organized as follows. Firstly, data ontology, data governance and
data management are examined by comparing contextual and canonical stances on data
ontology. Secondly, the literature review related to the issue is studied. Thirdly, methodo-
logical issues are explicated. Fourthly, the design artifact of federation is introduced, and
the procedures for using it in the federation of breast cancer data, and other findings of
the study are discussed. Finally, the conclusions of the Master’s thesis are presented.
17
2 THEORETICAL BACKGROUND
The theoretical framework is based mainly on seven articles by Dahlberg et al. The first
article is Master Data Management Best Practices Benchmarking Study [11]; the second
article is Framework and Research Agenda for Master Management in Distributed Envi-
ronments [12]; the third article is Data Federation by Using a Governance of Data
Framework Artifact as a Tool [13]; the fourth article is A Framework for the Corporate
Governance of Data – Theoretical Background and Empirical Evidence [14]; the fifth
article is Managing Datification – Data Federation in Open Systems Environments [15];
the sixth article is The MDM Golden Record is Dead, Rest in Peace – Welcome Interpret-
ed Interoperable Attributes [16]; and finally, the seventh article is Research on: Govern-
ance of Data in the Contexts of Corporate Governance and Governance of IT and Data
Federation in the Context of Master and Big Data [17].
The approach to the theory of the case study proceeds by comparison of the federative
approach and the golden record concept. The comparison is made between the golden
record (the canonical approach) and contextual data ontology (the federative approach).
The canonical approach is prevalent in computer science (natural sciences) and entity
relationship diagramming (ERD). This approach advocates a single truth, meaning that
there is one true data value, which can include several attributes while having one true
value.
The contextual approach is widely used in information systems science (social scienc-
es). In contrast it underlines that data ontology has no single true value depending on the
context use and linked with time.
2.1 Federative Approach
Data federation refers to the activities that facilitate the simultaneous use of data from
storages, which give different IS technical, informational and socio-contextual data char-
acteristics [13]. Federation makes possible to solve some traditionally challenging data
integration problems, such as inconsistencies in an attribute format, compulsion of attrib-
utes etc. The core idea of the federative approach is to make data storages interoperable
through data storage cross-mappings. Federation is carried out on the basis of metadata.
When an information system is designed and built by an organization or by a software
vendor, the data model represents the canonical approach with a specific social use con-
text (the true world as interpreted in the data model of the IS). Since ISs are designed and
built for the different use of contexts, they have different canonical data models.
18
The claim of the federative approach is that different canonical data models should not
be replaced, e.g. with a data model of data models. However, it should be made interoper-
able by identifying attributes that are shared by the data models to be described. The aim
is to make the data attributes of the federated data storages visible to one another. Moreo-
ver, the federative approach claims that in an organizational environment with dozens of
ISs or more, where all or the majority of ISs are purchased from software vendors, it is
not even possible to create a data model of data models, since the data models are owned
by the software vendor, not by the user organization.
Data federation starts by identifying shared attributes and by then describing the IS
technical, informational and socio-contextual metadata through cross-mappings of those
attributes [13]. The contextual stance on data ontology means that data are regarded as
truthfully representing the social use context of the data. Thus apparently similar data
may have several meanings representing each use context to the extent that some of the
meanings could be contradictory. Compared with the canonical stance on data ontology,
the canonical approach proposes that it is possible to agree on one single version of the
truth for data values and then to use those values in all contexts [13].
The federative approach to data governance and management is to govern data. The
approach suggests that all the master data of a selected activity or process should be ad-
dressed rather than a single domain [16]. Domain can be defined as an area of control or
knowledge. Related to MDM, domain refers to the type of data to be mastered. The feder-
ative approach does not require that there is only one interpretation of customer data. It
allows for varied data management and governance arrangements, since the term data is
understood as having several meanings. Thus all local interpretations or contexts are con-
sidered true, and global master data are the sum of all local metadata interpretations. The
federative approach is appropriate and intended for use in an open ISs environment [16].
According to Dahlberg, data interoperability and transferability are far from reality
[16]. Relating to the prevention of electronic data transfer and consolidation, Dahlberg
argues that data creation and handling processes vary, which leads to dissimilarities in
data coding and content. Additionally data concepts, formats, and structures differ, which
result in fragmented and duplicated data.
Dahlberg and Nokkala believe that there is a lack of any accepted and widely used in-
ternational, national and local data model or message standards [14, pp. 27]. The reason
for the present situation is that each organization develops or procures and implements
databases and ISs of its own. No data interoperability, transferability and usability are
taken into account. Dahlberg and Nokkala allot responsibility to data management, since
business professionals should know what the content of data ought to be and what data
are necessary to perform specific tasks [14]. Therefore, if data governance is unclear, no-
body in an organization is responsible for the content quality or the availability of data for
19
specific tasks. The conclusion drawn by Dahlberg and Nokkala is that the governance of
the data framework should be generic and should have a corporate managerial focus [14].
Chen et al. define interoperability as the ability of two systems to understand each oth-
er and to use each other’s functionality [6, pp. 648]. Relating to the definitions of integra-
tion and interoperability, interoperability has the meaning of coexistence, autonomy and
federated environment, whereas integration refers rather to the concepts of coordination,
coherence and uniformization.
The cross-reference is used to create links between the registers and to keep track of
metadata and cross-references [17]. Thus the original data remain untouched in their orig-
inal location. The objective is to make data available by knowing what the data mean.
The solution is built up gradually by adding new registers and refining the cross-
referencing metadata when a new transaction is detected [17].
Dahlberg argues that attribute-driven MDM leads to a narrow ERD and richness of
metadata descriptions [16]. The ERD will be and is narrow because it is indispensable for
modeling only attributes that are used to implement the federation. Rich metadata de-
scriptions are necessary to achieve IS technical, informational, and contextual metadata.
The following procedures are defined to show how the federative approach works in
practice.
Data federation is implemented on the basis of contextual metadata by interpreting at-
tributes that make data interoperability and federation possible. The primary procedure is
to identify and make a cross-reference (Figure 1). Firstly, the identification of the regis-
ters is done in order to federate and execute transactions. The aim is also to steer process-
es and produce management reports. Secondly, all the attributes shared between registers
are identified. These attributes can be used to build links between federated registers.
Thirdly, the basis for the link between registers is created by describing the metadata of
the shared attributes. The technical, informational and semantic metadata are necessary
for making cross-reference attributes, in order to federate them.
The secondary procedure is to use the cross-reference metadata to create links between
the registers and to keep track of the metadata produced by cross-reference (Figure 1).
The original data remain untouched in the same place as it was in the beginning. Firstly,
the core idea is not to replace data, but to make them available by knowing what the data
mean. Secondly, the solution is built up gradually by adding new registers and by refining
the metadata produced by cross-reference on-demand, i.e. when a type of new transaction
is discovered.
This answers the main research question of the Master’s thesis.
20
Figure 1 Semantic View of the Federative Approach in Practice
2.1.1 Benefits
The benefits of the federative approach are categorically as follows [17]. Firstly, an or-
ganization can gradually move and start federating data without risking its legacy IT.
21
Thus it is able to protect IT investments and increase value creation. Potential investments
are solely focused on data attributes and MDM tools in order to make the data interopera-
ble. There is no need to modify legacy IS for data federation, while shared attributes can
be identified.
Secondly, an organization may create new perspectives on federated data by using
MDM tools and without acquiring new IS, since mappings exist. This is a cost-effective
approach to assessing the strength of data federation.
Thirdly, an organization can modify master data in any of the federated ISs, and there-
after all the other federated ISs are capable of loading data by using MDM tools.
Fourthly, any attribute could become an interoperable connector for both master data
and transactional data by creating the data link.
Finally, the federative approach may offer the possibility to start with practical actions,
projects and allow the tight focus related to more understandable data governance and
management models. Additionally, the federative approach leads to narrow entity-
relationship modeling and rich metadata descriptions.
This answers the first sub-question of the Master’s thesis.
2.1.2 Limitations
The limitations of the federative approach are as follows [16]. Taking into account the
current IS environment, open systems environments increase the risk of ontological er-
rors. There are entirely new data sources such as sensors and other Internet of Things
(IoT) devices available. Besides, new formats and dimensions of data prevail, such as
audio and video formats or spatial and temporal data. Additionally, data can be structured,
unstructured or multi-structured.
There is as yet no empirical evidence that data federation can meet all these challeng-
es. In relation to the APIs of various ISs, data federation may face problems. For exam-
ple, data federations between asynchronous and synchronous or, more simply put, differ-
ently timed automated processes and applications can be very demanding. Another limita-
tion is that APIs do not contain metadata about the meaning of data in ISs or changes in
the meaning of the data within ISs [15].
This answers the second sub-question of the Master’s thesis.
22
2.2 Golden Record Approach
The golden record or a single version of the truth is identified and developed for each
product (treatment), place (location), person (patient), or organization (department) for
the use of MDM [18. pp. 173]. The same golden records and values are used in every
transaction-processing database [18]. It is typical for many organizations that they have
transaction systems, which contain inconsistent reference and master data. Data ware-
housing systems must be able to identify both the most veracious system of records and
the most accurate, golden reference and master data values [18].
Dreibelbis et al. in turn define the golden record as a service provided by the trustwor-
thy source for downstream systems. It is dedicated to reporting and analytics, or it is a
system of reference for other operational applications [21, pp. 26].
The golden record approach requires a closed IS environment where data are internal
and structured. The approach requires that one knows data models and entities with an
assumption that it is possible to create a data model in order to achieve a single version of
the truth. However, currently an IS environment is increasingly open and transparent. The
data are also increasingly multi-structured and external. The golden record implies a sin-
gle standard record. It is generated using data from multiple source systems [25, pp. 101].
The golden record approach appeared as a solution to answer the dilemma of connect-
ed inconsistent, unstructured and fragmented data storages [13, pp. 9]. The golden record
philosophy represents the current mainstream in data management discussion [2, 15, 18].
Although the golden record term is actively used among professionals, a clear consistent
definition is hard to find.
Dreibelbis et al. present the case of consolidation implementation, where master data
have been brought together from a number of ISs [21, pp. 26]. The data are processed
(transform, cleanse, match, and integrate) in order to capture a single golden record relat-
ed to master data domains.
The golden record approach belongs to the era of closed information system (IS) envi-
ronments, because currently user organizations have acquired ISs and storage solutions as
packages or as cloud services. The approach is prevalent when each organization de-
velopes or is responsible for developing its own system with information architecture.
This is particularly necessary in order for an organization to be able to execute its own
business processes. The strength of the golden record approach is that an organization is
aware of each IS scope in details, with data entities, attributes and their dependencies on
IS. On the other hand, the weaknesses of the approach are closed systems dependency and
the canonical data ontology assumption [16].
Thanks to the closed IS environment, the information architecture can be designed
without involving any useless overlaps [16]. Thus data interfaces are defined in advance.
23
This means that data are interchanged and federated by applying the data model of each
IS [16, pp. 2]. According to Dahlberg, data federation requires the presence of connecting
data element, such as a customer (patient) or a product (treatment) [16].
In open overlapping (redundant) systems, which contain equal data, such as customers,
vendors, products, services etc. or business transactions, reports and documents, the ques-
tion of trustworthy data inevitably arises [11]. Due to the explosion of external data faced
by open system environments, it is necessary to federate data from different sources. As a
result, compared with closed system environments with structured data of internal ISs,
data federation was not relevant.
Data ontology assumes that data has the same meaning based on the one version truth
philosophy [14]. The assumption can be divided into two different forms. The first form
assumes that data entities and attributes have a single meaning. This means that, with the
overlapping data of ISs, a single version of the truth can be created for data entity and
attribute values. Relating to the golden record, a record thus has the true values of shared
data attributes, e.g. for a customer. As a conclusion, the true values should be used by all
ISs and the purpose of the matching and merging process is to delete non-true values [16].
The following procedures are carried out in order to show how the golden record
works in practice. The primary procedure is based on matching and merging (Figure 2).
Firstly, all the patient registers are examined and attributes are listed on the matrix, where
rows describe attributes and column registers. Secondly, the attributes, which are com-
mon to all or most registers, are listed. These are descriptive attributes referring to key
entities of the patient data, and they are used to build the attributes of a golden record.
The golden record as a result does not necessarily resemble the record in any original data
register. Thirdly, the one true value for each attribute of the golden record is defined. The
definition is thus created for each patient. If there are several values for the same attribute
in different registers, then the latest and the most accurate value is chosen and other val-
ues are deleted as duplicates or deviations.
The secondary procedure is based on cloning. The golden record can be cloned either
to all data registers or transformed from a master system (Figure 3). Cloning means that
the golden values are substituted for the data of the original registers. All ISs should use
these true values. The process of matching and merging removes other than true values. If
the golden record is enforced as a whole in the original registers, then the IS maintenance
is required. This may be impossible due to the high costs incurred in open system envi-
ronments. Transforming involves a master system being used as the source of the golden
data, i.e. maintenance is carried out in only one IS. The data are made available to other
ISs, typically via a data bus, i.e. Enterprise Application Integration (EAI) (Figure 4).
24
Figure 2 Semantic View of the Golden Record Approach in Practice (1)
25
Figure 3 Semantic View of the Golden Record Approach in Practice (2)
26
Figure 4 Semantic View of the Golden Record Approach in Practice (3)
2.3 Comparison of Ontological Approaches
Both approaches, contextual and canonical, are fundamentally related to master data. The
canonical approach was originally created in closed ISs environments. A golden record is
a master data record, which aligns all relevant attributes from all available data sources. It
is administered in a central repository, where data cleansing and data matching guarantee
its quality.
27
The contextual approach works in open distributed ISs systems. It considers both con-
text and time. Dependent on the particular characteristics of open systems environments,
data federation should be performed on an attribute level, rather than according to an enti-
ty or a data model [13]. Data federation needs at least one connecting, i.e. shared, data
attribute when there are two or more data sets. The approach does not intend to create a
single harmonized record as the canonical approach does. The goal is to identify shared
attributes, which form links between federated registers.
2.3.1 Advantages
The golden record approach is a simple and straightforward concept. Duplicates of attrib-
utes are avoided by raising the level of data quality [52]. An all-encompassing data record
contains links to the master data records in the different original data sources. When an
update is made to an attribute in a particular data source, the same update is also made to
all other relevant sources. Consequently, all the available data remains consistent
throughout, and as it does not have to be physically moved, it is not stored redundantly.
All these actions save time and money. As a result, all the data silos in the company are
synchronized without any fragmented data. The golden record approach can be used in a
single homogeneous use context [52]. In the case study, the approach could be applied if
breast cancers were always treated in the same way. The strength of the golden record
approach is that an organization is aware of each IS scope in details, with data entities,
attributes and their dependencies on IS [16].
From the perspective of IT, golden records reduce the costs for data availability, ex-
change, integration and data migration [52]. The actual quantity of data can be reduced.
From the business stance, the golden records make a holistic customer view possible.
According to Martin, this leads to short-term increased turnover, as well as long-term
customer satisfaction and loyalty [52].
Van der Lans introduces the advantages by deploying data federation as follows [44,
pp. 7 & pp. 149]:
Increased speed of report development
Easier maintenance of metadata specification
Consistent reporting
Cost reduction due to simplification
Increased flexibility of the business intelligence system
Easier data store migration
Seamless adoption of new technology
Data model-driven development
28
Transparent archiving of data.
The advantages are considered from the stance of business intelligence systems.
2.3.2 Disadvantages
It is not necessarily possible to implement the golden record approach with high quality
requirements, since ISs are becoming more and more distributed and open. The weak-
nesses of the approach are closed systems dependency and the canonical data ontology
assumption [16]. Organizations have replaced self-developed ISs, which used to be typi-
cal in a closed systems’ environment, with software packages services from independent
IS service vendors. With the increasing numbers of ISs used in organizations and the vol-
umes of digital data, the technological move to an open systems environment is accelerat-
ing.
The data model of many commercial software packages represents the model of a ge-
neric user organization. This may cause risks of incompatibilities between different in-
stances of the same software package in a single organization. A user organization may
therefore not even have access to the data model of a software package. Changes and up-
dates are made externally by a software vendor to the generic data model of the software.
Due to the ever-increasing deployment of IT with larger numbers of ISs in use, there
might be data about the same persons (patients), facilities and locations (healthcare facili-
ties), things (cytostatic materials), concepts (disease diagnoses) and other data elements in
dozens, hundreds or even in thousands of data storages with possibly unknown intercon-
nections. As a result, data definitions tend to be out of control. Additionally, the IS tech-
nical, informational and socio-contextual characteristics of data may differ between dif-
ferent data storages.
The golden record approach is unable to deliver organization-wide federation of cus-
tomer (patient), product (treatment), location (healthcare facilities), and other types of
data. Despite improvements, MDM solutions are fragmented. The reason is based on the
canonical ontological assumption.
In open systems environments, data federation requires more IS technical, informa-
tional, and social metadata. Thus metadata of this kind is needed to understand how
shared attributes have been created and what the meanings of the attributes in their use
contexts are.
The potential disadvantages of data federation are as follows [44, pp. 151]:
Extra layer of software
Repeated processing of transformations
Proprietary development language
29
Management of data servers
Limited experience.
The disadvantages are viewed from the perspective of business intelligence systems.
2.4 Ontological Approach in the Case Study
The federative approach was used in this case study. Several attributes were detected and
two basic matrices were built, forming an artifact. The artifact is a keystone in data feder-
ation. The first matrix (Table 1) describes information systems and four shared and in-
teroperable attributes (HETU, TNM, Diagnosis Code, Dates of Events). The second ma-
trix (Table 3) is more important, because it analyzes each attribute one at a time using
metadata. It includes data about the accountability of those inserting data, which process-
es and phases are in use, life cycles, and the level of understanding, i.e. what is under-
stood and meant at a different phase of the data life cycle.
30
3 LITERATURE REVIEW
The literature review is based on the key literature, the latest and prominent articles and
books relating to the issue. The literature review moves through the main content of sci-
entific articles concerning the context of the research and the case study.
The literature search was based on the keywords as follows: attributes, master data,
metadata, data federation, breast cancer, golden record, data governance, data manage-
ment. The search engines and databases were used as follows: Google Scholar, Volter,
Melinda, Arto and Nelli Portal by Turku University, Finland.
In each chapter a comparison is made between the federative approach and the canoni-
cal philosophy (the golden record approach).
3.1 Ontology
Ontology can be defined as an explicit formal specification of how to represent objects,
concepts and other entities that are assumed to exist in some area of interest and the rela-
tionships holding among them [59, pp. 5]. Systems that share the same ontology are able
to communicate about domain of discourse without necessarily operating on a globally
shared theory. System commits to ontology if its observable actions are consistent with
the definitions in the ontology. Ontology is also defined as a model with concepts and
their relationships within a domain [18]. For example, the canonical data ontology stance
proposes that it is possible to agree on one single version of the truth of data values as the
golden record.
The ontology defined by DAMA describes individuals (instances), classes (concepts),
attributes, and relations [18]. It creates relationship between a taxonomic hierarchy of
classes and definitions with the subsumed relation, for example decomposing intelligent
behavior into many simpler behavior modules and layers [18]. Ontology defines basic
terms and relations comprising the vocabulary of a topic area and comprises rules for
combining terms and relations to define extensions to the vocabulary. In contrast, taxon-
omy denotes classification of information entities in the form of a hierarchy, according to
the presumed relationships of the real-world entities that they represent. The two terms
ontology and taxonomy are widely used to describe the results of modeling efforts [50,
pp. 175).
In their article On the Ontological Expressiveness of Information Systems Analysis
and Design Grammar, Wand and Weber introduce the grammars that ISAD (Information
Systems Analysis and Design) methodologies provide to describe various features of the
real world [69, pp. 219]. They argue that an ISAD grammar can be used to describe all
31
ontological constructs completely and clearly. ISAD intends to represent the real world,
and by tracking changes in the existing or imagined real-world phenomena, it strives to
model and facilitate the design of a structured information system, which is well decom-
posed [69, pp. 218].
According to the article An Ontological Model of an Information System, by Wand
and Weber, theoretical developments in the CS and IS disciplines have been inhibited by
inadequate formalization of basic constructs [68]. The article proposes an ontological
model of an information system that provides precise definitions of fundamental concepts
such system, subsystem, and coupling. The model is used to analyze some static and dy-
namic properties of an information system and to examine the question of what consti-
tutes a good decomposition of an information system. Tsiknakis et al. underline the im-
portance of extensive use of ontology and metadata in research [66].
The article A Semantic Grid Infrastructure Enabling Integrated Access and Analysis of
Multilevel Biomedical Data in Support of Postgenomic Clinical Trials on Cancer, by
Tsiknakis et al., presents the master ontology on cancer, developed by the project, and
their approach to developing the required metadata registries [66]. The master ontology
defines the ontology of cancer research and management as an objective to enable seman-
tic data integration.
According to Tsiknakis et al., clinical researchers or molecular biologists often en-
counter difficulties in exploiting one another’s expertise, since the prevailing research
environment is not sufficiently cooperative [66]. Cooperation would enable the sharing of
data, resources, or tools for comparing results and experiments, and creating a uniform
platform supporting the seamless integration and analysis of disease-related data at all
levels [66].
Tsiknakis et al. [66] state that the results from the research analyses of breast cancer
reside in separate dedicated databases, e.g. clinical trial (CT) database, histopathological
database, institutional or modality-specific image management systems, microarray data-
bases, proteomic information systems, etc. In addition, since several clinical research or-
ganizations are participating in a given trial, they are also geographically distributed with-
in or across countries [66].
Once data sets are generated, a range of specialized tools is required for the integrated
access, processing, analysis, and visualization of the data sets [66]. In addition, the tools
must provide a dynamically evolving set of validated data exploration, analysis, simula-
tion, and modeling services. The integration of applications and services requires substan-
tial meta-information on algorithms and input/output formats if the tools are to interoper-
ate. Furthermore, the assembly of tools into complex discovery workflows will only be
possible if the data formats are compatible, and the semantic relationships between ob-
jects shared or transferred in workflows are clear. To integrate the highly fragmented and
32
isolated data sources, semantics is needed in order to answer higher-level questions.
Therefore, it becomes critically important to describe the context, in which the data were
captured. This contextualization of data is described as metadata [66].
In the case study by Tsiknakis et al. [66], the user performs queries against a single
virtual repository. The virtual repository represents the integration of several heterogene-
ous sources of information. The integration process relies on a common interoperability
infrastructure, based at a conceptual level on domain ontology [66].
The article A Paradigmatic Analysis on Information System Development, by Iivari et
al., analyses the fundamental philosophical assumptions of five contrasting information
systems development (ISD) approaches: the interactionist approach, the speech act-based
approach, the soft systems methodology approach, the trade unionist approach, and the
professional work practice approach [40]. According to the article, these five approaches
are selected for analysis because they illustrate alternative philosophical assumptions to
the dominant orthodoxy identified in the research literature.
The article also proposes a distinction between approach and methodology. The analy-
sis of the five approaches is organized around four basic questions. Firstly, what is the
assumed nature of an information system (ontology)? Secondly, what is human
knowledge and how can it be obtained (epistemology)? Thirdly, what are the preferred
research methods for continuing the improvement of each approach (research methodolo-
gy)? And fourthly, what are the implied values of information system research (ethics)?
Each of these questions is explored from the internal perspective of the particular ISD
approach. The questions are addressed through a conceptual structure, which is based on
a paradigmatic framework for analyzing ISD approaches [40].
From the ontological stance the golden record philosophy assumes that it is possible to
define and agree on one version of the truth and on the context in which all the data enti-
ties and data attributes have the same meaning in every data use context. However, this
canonical approach is contradictory to the ontology principles of Wand and Weber [68,
69, 70]. The approach aims to establish canonical data models, where all the context- and
time-specific values between IS-shared attributes are purpose-specific exceptions and are
replaced by the canonical true values.
The federative approach is based on the ontological assumption that data are and
should be contextually defined in order to maintain local completeness. Thus data are
considered to have several true meanings, depending on their use context and the time of
data usage. The federative approach does not remove alternative meanings as the golden
record approach advocates. It uses metadata about the meanings of data to execute data
federation on the basis of interoperable/shared attributes. The approach produces truer
and richer insights.
33
The first canonical assumption is that there is only one context in effect. If there are
more contexts, they can be described with the aid of attributes and as a result the concep-
tual models are created. The second assumption is that data and the interpretation of data
can be defined and controlled.
In the federative approach every single operating condition is contextual, which is true.
For example, the X-rayed images are evaluated from the stance of a pathologist and relat-
ed to a state of breast cancer. Thus the pathological data are significant and true. If anom-
alous values are found, they are registered as additional properties.
When all the single operating conditions are bound together, no single version of the
truth can be found. In this case there are contextual interpretations of the data which can-
not be linked, since the ontological interpretation of these data can be lost. Another per-
spective is that there is no control over the interpretations of data, as no accountability can
be found.
3.2 Semantics
Semantics indicates the meaning of expressions written in a certain language, as opposed
to their syntax, which describes how the symbols may be combined independently of their
meaning [59, pp. 5]. Semantic modeling on the other hand is a type of knowledge model-
ing [18, pp. 249]. Modeling includes a network of concepts, i.e. ideas or topics of concern
and their relationships. A semantic model such as an ontology model contains the con-
cepts and relationships together [18].
Based on the article A Framework for Theory Development in Design Science Re-
search, by Kuechler and Vaishnavi, one point of convergence in the many recent discus-
sions on design science research in information Systems (DSRIS) has been the desirabil-
ity of a directive design theory (ISDT) as one of the outputs from a DSRIS project [43].
Kuechler and Vaishnavi introduce the framework from a knowledge representation
perspective and then provide typological and epistemological perspectives [43]. The aim
is to motivate the desirability of both directive-prescriptive theory (ISDT) and explanato-
ry-predictive theory (DREPT) for IS design science research and practice. Kuechler and
Vaishnavi position both types of theory in Gregor’s (2006) taxonomy of IS theory within
a typological view of the framework. Gregor claims that an appropriate taxonomy for IS
depends on classifying theories with respect to the degree and manner. These theories
address four central goals of theory as follows: analysis, explanation, prediction and pre-
scription [36, pp. 614].
From the semantic perspective the federative approach needs semantic metadata to
federate data. In particular, semantic metadata are needed in order to cross-reference at-
34
tributes in registers. Semantic metadata describe what data mean during their lifecycle, or
what their use is. In addition, semantic metadata are used to describe business/data rules
for data federation. For example, data governance builds on understanding the meaning of
the various, i.e. the contextually defined semantic descriptive metadata of those represen-
tations. In federated databases semantics, added to the local schemas, facilitates negotia-
tions, specify views, and supplements queries. In the federative approach, semantics must
be known, as the data are made interoperable and the semantic meaning of the data is
changed as a function of time.
Semantics has no great role in the canonical approach, since the data model governs
semantics and has to be unchangeable. If it is changed, then the data model must be
brought into focus and supplemented by changing the data model.
3.3 Data Models
A data model is an abstract model that organizes elements of data and standardizes how
they relate to one another and to properties of the real world entities. It explicitly deter-
mines the structure of data. A data model can be sometimes referred to as a data structure,
especially in the context of programming languages. Data models are often complement-
ed by function models, especially in the context of enterprise models.
A conceptual model is a representation of a system, made of the composition of con-
cepts which are used to help people know, understand, or simulate a subject the model
represents. A contextual model defines how context data are structured and maintained. It
aims to produce a formal or semi-formal description of the context information that is
present in a context-aware system.
The canonical model is a design pattern used to communicate between different data
formats. It is any model that is canonical in nature, i.e. a model which is in the simplest
form possible based on a standard, application integration (EAI) solution.
The canonical approach builds on entities and data models when designing an infor-
mation system. The canonical approach is a transformation of the data model. If the same
data are available in the environment, they introduce additional attributes into the data
model, but the model itself is not changed and it remains the same model.
In the federative approach, it does not make sense to acquire one large conceptual
model, as the model becomes complicated and difficult to govern. Every single operating
condition requires a quantity of data and is not necessarily needed, as it hampers practical
work. Essential operating conditions include their own data models, which are canonical,
as they must always be understood equally. Every operating condition must have a con-
ceptual model. Data models are extremely valuable, because they help us to understand
35
metadata. In the federative approach entities in particular are principal elements for the
understanding of metadata.
3.3.1 Conceptual Model
The article Conceptual Model Enhancing Accessibility of Data from Cancer-Related En-
vironmental Risk Assessment Studies, by Dušek et al., proposes a conceptual model,
which can be used to facilitate the discovery, integration and analysis of environmental
data in cancer-related risk studies [24]. According to the article, persistent organic pollu-
tants were chosen as a model due to their persistence, bioaccumulation potential and gen-
otoxicity. The part dealing with cancer risk is primarily focused on population-based ob-
servations encompassing a wide range of epidemiologic studies, from local investigations
to national cancer registries. The proposed model adopted as content-defining classes a
multilayer hierarchy working with characteristics of given entities (POPs, cancer diseases
as nomenclature classes) and couples observation-measurement [24].
The proposal extends the formally used taxonomy applying a multidimensional set of
descriptors, including measurement validity and precision. Dušek et al argue that it has
the potential to aid multidisciplinary data discovery and knowledge mining. The same
structure of descriptors used for the environmental and cancer parts enables the users to
integrate different data sources [24].
According to the article Information Systems and Conceptual Modeling, by Wand and
Weber, within the information systems field, the task of conceptual modeling involves
building a representation of selected phenomena in some domain [70]. Wand and Weber
argue that high-quality conceptual-modeling work is important because it facilitates early
detection and correction of system development errors. It also plays an increasingly im-
portant role in activities such as business process reengineering and documentation of
best-practice data and process models in enterprise resource planning systems. Yet little
research has been undertaken on many aspects of conceptual modeling [70].
Wand and Weber propose a framework to motivate research that addresses the follow-
ing fundamental question: How can the world be modelled to better facilitate our devel-
oping, implementing, using, and maintaining more valuable information systems? The
framework comprises four elements: conceptual-modeling grammars, conceptual-
modeling methods, conceptual-modeling scripts, and conceptual-modeling contexts [70].
According to the article Leveraging Information Technology for Transforming Organ-
izations, by Henderson and Venkatraman, it is clear that even though IT has evolved from
its traditional orientation of administrative support toward a more strategic role within an
36
organization, there is still a glaring lack of fundamental frameworks, within which we
could understand the potential of IT for tomorrow’s organizations [38].
Henderson and Venkatraman developed a model for conceptualizing and directing the
emerging area of strategic management of IT [38]. This model, termed the Strategic
Alignment Model, is defined in terms of four fundamental domains of strategic choice:
business strategy, information technology strategy, organizational infrastructure and pro-
cesses, and information technology infrastructure and processes – each with its own un-
derlying dimensions. Henderson and Venkatraman also present the power of the model in
terms of two fundamental characteristics of strategic management: strategic fit (the inter-
relationships between external and internal components) and functional integration (inte-
gration between business and functional domains [38].
3.3.2 Contextual Model
According to the article A Framework for the Corporate Governance of Data - Theoreti-
cal Background and Empirical Evidence, by Dahlberg and Nokkala, the contextual ap-
proach acknowledges that the universal approach suits situations, in which real-world
representations are closely related, for example similar tasks or chains of tasks [14, pp.
35]. The approach emphasizes the role of metadata and the use of agreed messages in
sharing data between data storages.
The contextual approach proposes that data represent granted interests and dynamic in-
terplay between socially-constructed concepts, especially the representations of human
behavior in context by the IS [14, pp. 34]. Dahlberg and Nokkala argue that digital data
are contextually defined and central to business management. Contextual metadata are
used to describe business and data rules, and they are necessary for federating data.
The framework of this Master’s thesis is based on the ontological assumption that data
are contextually defined [68, 69, 70]. Thus data could have several meanings and inter-
pretations, which depend on their social use context. Dahlberg et al. argue that the mean-
ings could even be contradictory [13]. Additionally data are defined in their various use
contexts over the life cycle of the data.
The federative approach builds on a contextual stance to the data ontology prevailing
in open systems environments [13]. It thus pays more attention to data governance.
37
3.4 Data Classification
DAMA defines data as a representation of facts as text, numbers, graphics, images,
sound, or video [18, pp. 2]. The difference between data and information is that infor-
mation is data in context. Knowledge in turn, is defined as understanding, awareness,
cognizance, and the recognition of a situation and familiarity with its complexity [18, pp.
3]. In other words, one gains in knowledge when the significance of the information is
understood.
Data can be classified as follows [21, pp. 33]:
• Metadata
• Reference data
• Master data
• Transaction data
• Historical data.
The term metadata refers to descriptive information. In turn, reference data defines and
distributes the collections of common values. Reference data enables to process opera-
tional and analytical activities. It also processes the same defined values of information
for common abbreviations, for codes, and for validation [16, pp. 34].
Master data represents the common agreed and shared business objects within an en-
terprise [21, pp. 35]. Master data are managed within a dedicated MDM (Master Data
Management) system. An MDM System often uses both reference data and metadata in
processing.
Reference data ensures common and consistent values for the attributes of master data,
such as a patient’s name or gender. Transaction data refer to fine-grained information,
representing the details of any enterprise [21, pp. 35]. From the business point of view,
this kind of information includes sales transactions, inventory information, and invoices
and bills. In non-profit organizations, transaction data represents passport applications,
logistics, or court cases. Transaction data also describes what is happening in an organiza-
tion, while historical data represents the accumulation of transaction and master data over
time. These data are used for both analytical processing and for regulatory compliance
and auditing [21, pp. 35].
The federative approach takes advantage of interoperable/shared attributes based on
metadata, which can have different meanings depending on where and when the data are
used. The approach considers master data contextually, and it is applicable beyond inter-
nal master and transaction data, particularly complex small data and Big Data. The feder-
ative approach accepts the canonical approach in the case of one single context, if it is
possible. If the contexts are different and diverge from one another, then the canonical
approach is not possible. Federation is performed with respect to contexts and entities. In
38
practice, federation is done by describing the dependencies of master data, i.e. the loca-
tions which can be refered to. The federative approach is interested in the life cycle of
data, i.e. what happens to data and what data are needed in each phase of the life cycle,
when the metadata differ from technical metadata.
Master data and reference data are alike. Both data types are governed with the help of
transactions. Transactions are shared in an information system. Master data are important
in both ontological approaches. The canonical philosophy leads to golden records in order
to harmonize data. From the data storage, the attributes that are shared are chosen. Then
the values which it is exclusively permissible to use are agreed upon for every attribute
(e.g. customer, patient). The canonical approach does not pay attention to how the data
are processed, but underlines the fact that there is only one version of the truth, which is
based on technical metadata.
3.4.1 Master Data
MDM with master data is a collection of best data management practices [47, pp. 8]. On
the one hand, it organizes key stakeholders, participants, and business clients by incorpo-
rating the business applications, information management methods, and data management
tools. MDM implements the policies, procedures, services, and infrastructure to support
the capture, integration, and subsequent shared use of accurate, timely, consistent, and
complete master data. On the other hand, MDM is a set of disciplines and methods to
ensure the currency, meaning, and quality of a company’s reference data within and
across various data subject areas [25, pp. 43].
According to Martin, MDM represents a complex task, which includes all strategic,
organizational, methodical and technological activities related to a company’s master
data. MDM ensures consistent, complete, up-to-date, correct, and high quality master data
for supporting the business processes (e.g. ERP, CRM, SCM, PLM) of a company [52,
pp. 1].
Loshin argues that MDM combines core identifying data attributes along with associ-
ated relevant data attributes into a master repository. It links an indexing capability that
acts as a registry for any additional and distributed data within an enterprise [47, pp. 180].
Federated information models of this kind are often serviced via Enterprise Application
Integration (EAI) or Enterprise Information Integration (EII) styles of data integration
tools. Loshin believes that this capability is important in MDM systems built on a registry
framework or using any framework that does not maintain all attributes in the repository
for materializing views on demand. Master data record materialization is based on the
existence of a unique identifier with a master registry. The registry carries both core iden-
39
tifying information and an index to location across the enterprise, holding the best values
for designated master attributes [47].
According to the work, Customer Data Integration, by Dyché and Levy, MDM usually
refers to the management of reference data and the establishment of standard data values
for that reference data across the company [25, p.44]. Most MDM programs also include
transactional and relationship data as they are needed for specific business processes
(Figure 5).
Figure 5 Scope of MDM [25, pp. 46]
According to the article Uncovering Four Strategies to Approach Master Data Man-
agement, by Cleven and Wortmann, just recently much Information Systems (IS) research
focuses on master data management (MDM), which promises to increase an organiza-
tion’s overall core data quality [10]. Above any doubt, however, MDM initiatives con-
front organizations with multi-faceted and complex challenges that call for a more strate-
gic approach to MDM.
According to the article Master Data Management Best Practices, by Dahlberg, opera-
tionally, critical and other master data are fragmented and inconsistent [11]. As a result,
the information content quality is poor. This has a negative impact on business activities,
such as business transparency, loss of revenue, the use of power-accident occurrence, and
incorrect business management reporting. According to Dahlberg, poor quality also re-
duces the chances of achieving the benefits of improvement of business processes and
other related development initiatives [11]. Older managers, who may not understand the
importance of critical business data, accept the status quo and consider it normal.
Based on the article Framework and Research Agenda for Master Data Management in
Distributed Environments, by Dahlberg et al., master data provide the foundation for re-
lating business transactions with business entities, such as customers, products, locations
40
etc. [12]. These entities are also referred to as domains in the master data literature. The
integrity, availability and timeliness of master data in single and growingly multi-domain
combinations are crucial in e-Business transactions over the Internet, or in the cloud for
multiple stakeholders. Dahlberg et al. argue that distributed environments set additional
challenges for the management of master data [12]. Master data, management processes,
responsibilities and other contemporary master data management practices are described
as aiming to ensure master data quality in different domains. Even though these practical
means are of help in improving master data quality and managing master data, they are
insufficient to capture the underlying root cause of master data problems [12].
Dahlberg takes his stance on master data management from the IS theoretical view-
point [11]. He suggests that holistic approaches, such as enterprise architecting, stake-
holder analysis, or business modeling, could serve as coherent frameworks in identifying
common and specific master data management research themes for global businesses with
networked IT environments [11].
Figure 6 shows a simple example, where the MDM System holds enough master data
to uniquely identify a customer (in this case, the Name, TaxID, and Primary address in-
formation), and then provides cross-references to additional customer information stored
in Information System 1 and Information System 2 [21]. When a service request for cus-
tomer information is received (getCustInfo()), the MDM System looks up the information
that it stores locally, as well as the cross-references, to return the additional information
from Systems 1 and 2. The MDM System brings together the information desired through
federation, as it is needed. Federation can be performed at the database layer or by dy-
namically invoking services to retrieve the needed data in each of the source systems [21,
pp. 29].
Since most information remains in the source systems and is available when needed,
the information returned is always up-to-date [21]. This is implemented by a MDM Sys-
tem, which is able to meet transactional inquiry needs in an operational environment. This
kind of registry-based implementation can also be applied in complex organizational en-
vironments where one group may not be able to provide all of its data to another. The
registry can be implemented relatively quickly, since responsibility for most of the data
remains within the source systems [21].
41
Figure 6 MDM Registry Federation [21, pp. 28]
Based on Doan et al. Master Data Management (MDM) uses a central warehouse as a
repository of knowledge about the enterprise’s critical business objects, rules, and pro-
cesses [20, pp. 273-274]. Essential to MDM is a clean, normalized version of the terms
used through the enterprise, whether addresses, names, or concepts and information about
the related metadata. Ideally, whenever business objects are used in systems throughout
the enterprise, the data values used by these systems can be tied back to the master data.
In many ways, a master data repository is merely a data warehouse with a particular role
to play [20].
According to Doan et al. as with any warehouse, MDM gives the various data owners
and stakeholders a bird’s eye view of all of the data entities as well as a common inter-
mediate representation [20]. The master data repository is also intended to be a central
repository where relevant properties about data, especially constraints and assumptions,
can be captured, making it the home of all metadata as well as data. In many cases, the
master repository is made query-friendly to all of the data owners, such that they can di-
rectly incorporate it into their systems and processes. The repository is seen as a way of
improving risk management, decision making, and analysis [20].
Doan et al. believe that MDM provides a process to allow data to be overseen and
managed through data governance [20]. It refers to the process and organization put in
place to oversee the creation and modification of data entities in a systematic way.
42
3.4.2 Metadata
DAMA defines metadata as information about the physical data, technical and business
processes, data rules and constraints, and logical and physical structures of the data, as
used by an organization [18]. It represents data to data what data is to reality.
Watson defines metadata (data about data) as a description of each data type, its for-
mat, coding standards, and the meaning of the field [72, pp. 23].
Metadata can be classified into four major types [18, pp. 262]. Business model data in-
cludes the business names and definitions of subject and concept areas, entities, and at-
tributes; attribute data types and other attribute properties, range descriptions; calcula-
tions; algorithms and business rules; and valid domain values and their definitions. Tech-
nical and operational metadata provides developers and technical users with information
about their systems. Technical metadata includes physical database table and column
names, column properties, other database object properties, and data storage. Process
metadata are data that define and describe the characteristics of other system elements
(processes, business rules, programs, jobs, tools, etc.). Data stewardships metadata are
data about data stewards, stewardship processes, and responsibility assignments. Data
stewards ensure that data and metadata are accurate, with high quality across the enter-
prise [18].
In turn, Maier et al. define metadata as data about data [50, pp. 173-174]. The structure
of knowledge is based on knowledge elements and the relations between elements and
metadata. Relations expose further information about the content and associations of ele-
ments. A single knowledge element is called metadata. It can simultaneously be trans-
formed into another knowledge element of data [50].
Metadata can be used to describe any kind of data from structured to unstructured [50].
The structure itself is already a form of metadata and usually provides information about
the name of the data element (e.g. an XML Schema for an XML document). Element
names are often not sufficient to carry all the relevant information. Additional metadata
that either describe the content (e.g. keywords, domain) or the context of the data, are
needed, especially for semi-structured data. The context can further be subdivided into
creation context (e.g. customer, intended use) [50].
Three types of metadata can be identified [50]. Content metadata relates to what the
object contains or is about, and is intrinsic to an information object. Context metadata
indicates the aspects associated with the object’s creation and /or application and is ex-
trinsic to an information object (e.g. who, what, why, where and how aspects). Structure
metadata relate to the formal set of associations within or among individual information
objects and can be intrinsic or extrinsic [50].
43
Dahlberg argues that rich metadata descriptions are necessary to achieve three types of
metadata related to data federation [17]. Firstly, IS technical metadata describes the tech-
nical properties of the attribute that is used to federate data in the various federated regis-
ters. Secondly, informational metadata describes the data life-cycle properties of the at-
tribute including the accountable that created the attribute. This kind of metadata can also
be used to establish a data governance model. Accountability for data is allocated to the
person(s) who know it best. Thirdly, contextual/semantic metadata describes the meaning
of data during their lifecycle or the purpose of their use. Contextual metadata are used to
describe business and data rules in data federation [17].
Based on Maier et al., the structure is extrinsic in data base tables (data and structure
are separated) and intrinsic in XML documents (task and content mixed). Metadata can be
informal (e.g. structured according to a user-invented structure) or formal (e.g. structured
and compliant with a standard) [50]. The Dublin Core Metadata Initiative defines a set of
elements that are mainly based on experience acquired in public libraries.
Metadata ontology provides a vocabulary that is used to describe contents based on the
Dublin Core metadata standard. For example, data integration implements standards that
define character sets, addressing, markup, scopes and schema definitions [50, pp. 77). In
integration, documents are captured and additional metadata are assigned, i.e. to indexing
or attribution of documents. Metadata are usually stored in a data base together with a
link to the document, which is transferred to a separate storage medium or system [50, pp.
252).
Dyché and Levy classify metadata into four main types of data. Transactional data are
records of individual customer interactions [25, pp. 44]. A transaction represents an activ-
ity at a point in time. Reference data identify a product, customer, or other business enti-
ty. Relationship data are data that further describes an entity in order to relate it to other
entities. Metadata are basically descriptive data about individual data elements [25].
Maier et al., on the other hand, identify three types of metadata. Content metadata re-
lates to what the object contains or is about, and are intrinsic to an information object [50,
pp. 172]. Context metadata indicates the aspects associated with the object’s creation
and/or application and are extrinsic to an information object (e.g., who, what, why, where
and how aspects). Structure metadata relates to the formal set of associations within or
among individual information objects and may be intrinsic or extrinsic [50].
As a conclusion, the common denominator to all these numerous definitions intro-
duced above is that metadata are core and elementary data of ISs. It is also a vital part of
master data and MDM. In order to understand the meaning of data federation it is indis-
pensable to comprehend metadata with their attributes and entities, and the structure they
form together.
44
According to the article Metadata Management by Sen, in the past, metadata has al-
ways been a second-class citizen in the world of databases and data warehouses [63]. Its
main purpose has been to define the data. However, the current emphasis on metadata in
the data warehouse and software repository communities has elevated it to a new promi-
nence. The organization now needs metadata for tool integration, data integration and
change management. Sen presents a chronological account of this evolution from both
conceptual and management perspectives. Repository concepts are currently being used
to manage metadata for tool integration and data integration. Alongside the evolution
process, Sen points out the need for a concept called Metadata Warehouse. He proposes
that the metadata warehouse needs to be designed to store the metadata and manage their
changes [63].
According to the article An Interoperable Data Architecture for Data Exchange in a
Biomedical Research Network, by Crichton et al., knowledge discovery and data correla-
tion require a unified approach to basic data management [8]. Nevertheless, achieving
such an approach is nearly impossible with hundreds of disparate data sources, legacy
systems, and data formats. Crichton et al. argue that the problem is pervasive in the bio-
medical research community, where data models, taxonomies, and data management sys-
tems are locally implemented. These local implementations create an environment where
interoperability and collaboration between researchers and research institutions are lim-
ited [8].
Crichton et al. demonstrate how technology developed by NASA’s Jet Propulsion La-
boratory (JPL) for space science can be used to build an interoperable data architecture
for bioinformatics [8]. JPL has taken a novel approach towards solving the problem by
exploiting web technologies usually dedicated to e-commerce, combined with a rich
metadata-based environment. The article discusses the approach to developing a proto-
type based on data architecture for the discovery and validation of disease biomarkers
within a biomedical research network. Biomarkers are measured parameters of normal
biologic processes, pathogenic processes (cancer research), or pharmacologic responses
to a therapeutic intervention. Biomarkers are of growing importance in biomedical re-
search for therapeutic discovery, disease prevention, and detection. A bioinformatics in-
frastructure is crucial for supporting the integration and analysis of large, complex biolog-
ical and epidemiologic data sets [8].
Based on Customer Data Integration by Dyché and Levy, metadata are defined as data
about data [25, pp. 44]. They are basically descriptive data about individual data ele-
ments. Metadata can include system-level metadata, used by applications to navigate and
distinguish certain data types. In addition, the term may also include a user-defined
metadata that involves persistent definitions of important data fields. Four main types of
data can be found in a business. These are [25]:
45
Transactional data (records of individual patient interactions)
Reference data (identification of a cancer type, patient, or other healthcare enti-
ty)
Relationship data (description of an entity in order to relate it to other entities)
Metadata (data about data, i.e. descriptive data about individual data elements).
According to the article A Methodology for Sharing Archival Descriptive Metadata in
a Distributed Environment, by Ferro & Silvello, the core question is how to exploit wide-
ly accepted solutions for interoperation, for example the pair Open Archives Initiative
Protocol for Metadata Harvesting (OAI-PMH) and the Dublin Core (DC) metadata format
[31]. The goal is to deal with the peculiar features of archival description metadata and
allow their sharing on the subject. Ferro and Silvello present a methodology for mapping
Encoded Archival Description (EAD) metadata into DC metadata records without losing
information. The methodology exploits Digital Library System (DLS) technologies, en-
hancing archival metadata sharing possibilities and at the same time considering archival
needs. Furthermore, it makes it possible to open valuable information resources held by
archives to the wider context of cross-domain interoperation among different cultural
heritage institutions [31].
Metadata go beyond the data model that lets business users know what types of infor-
mation are stored in the database [45, pp. 623]. Metadata provides an invaluable service.
When not available, this type of information needs to be gleaned, usually from friendly
database administrators and analysts. This is an inefficient and time-consuming way of
gathering information. For a data warehouse, metadata provides discipline, since changes
to the warehouse must be reflected in the metadata to be communicated to users. Related
to the metadata repository, Linoff and Berry argue that metadata should also be consid-
ered a component of the data warehouse [45, pp. 630]. The lowest level of metadata is the
database schema, the physical layout of the data. Metadata answers questions posed by
end users about the availability of data, gives them tools for browsing through the con-
tents of the data warehouse, and gives everyone more confidence in the data [45].
A good metadata system should include the following elements [45, pp. 630]:
• The annotated logical data model. The annotations should explain the entities and
attributes, including valid values.
• Mapping from the logical data model to the source systems.
• The physical schema
• Mapping from the logical model to the physical schema
• Common views and formulas for accessing the data. What is useful to one user
may be useful to others.
• Information about loads and updates
• Security and access information
46
• Interfaces for end users and developers, so that they share the same description of
the database.
3.5 Data Quality
According to the article Identifying, Investigating and Classifying Data Errors, by Duda,
high quality data are essential to the accuracy and validity of clinical study results [22].
Data quality assurance has been a particular emphasis in clinical trials, where extensive
personnel training and data monitoring programs are built into the study protocol in an
effort to prevent scientific misconduct and ensure compliance with the International Con-
ference on Harmonization’s Guidelines for Good Clinical Practice [22].
Duda claims that clinical trials can be elaborate and expensive, and the cohorts are not
always large or varied enough to answer broad research questions [22]. According to Du-
da, researchers and funding agencies seek to leverage existing clinical care data by pool-
ing data sets from multiple sites. The United States of America’s National Institutes of
Health (NIH) have indicated an interest in promoting and expanding such clinical re-
search networks by featuring them as a cornerstone of the NIH Roadmap for Medical
Research [22].
The U.S. Nationwide Health Information Network, a standards initiative for health in-
formation exchange over the Internet, supports complementary standards for both clinical
care and clinical research data in order to encourage and support the reuse of healthcare
data for observational studies and population monitoring. Duda argues that medical re-
search is experiencing a simultaneous upsurge in international research collaborations
[22].
Membership in multi-national research networks has grown exponentially, and publi-
cations by multi-national research teams receive more citations than similar work from
domestic collaborations [22]. These trends combine in the increased reuse of clinical care
data for international research collaborations. Data collected during routine patient care
are readily available and relatively inexpensive to acquire, so that even clinical sites in
resource-limited settings are contributing data to shared repositories or multi-site data
sets. Unfortunately scientists seldom investigate the quality of such secondary-use data as
thoroughly as data generated in clinical trials or similarly regulated studies [22].
Duda believes that some research groups rely on data cleaning performed at the data
coordinating center in order to detect data discrepancies, or they request that their partici-
pating sites perform regular quality self-assessments [22]. Given time and funding re-
strictions and a dearth of data management personnel in academic centers, it is likely that
many groups simply accept secondary-use data as it is. According to Duda significant
47
challenges to high quality data exist within such international, multi-site research net-
works, but that these issues can be remedied through well-planned, cost-effective quality
control activities. Duda underlines the necessity of data quality assessments for observa-
tional networks, as well as means of identifying and evaluating data errors and improving
the audit process [22].
Based on the article Quality and Value of the Data Resource in Large Enterprises by
Otto, enterprises are facing problems in managing the quality and value of their key data
objects [55]. The article presents the findings from a case study comprising six large en-
terprises. The study results point to the importance of the situational nature of master data
as a strategic resource, which must be considered when analyzing how the quality of data
affects their value for business [55].
The article Anchoring Data Quality Dimensions in Ontological Foundations, by Wand
and Wang, claims that poor data quality can have a severe impact on the overall effec-
tiveness of an organization [67]. A leading computer industry information service firm
indicated that it expects most business process reengineering initiatives to fail through
lack of attention to data quality. An industry executive report noted that more than 60% of
the surveyed firms (500 medium-size corporations with annual sales of more than $20
million) had problems with data quality [67].
According to the article Data Quality Assessment in Context, by Watts et al., in organ-
izations today, the risk of poor information quality is becoming increasingly high as larg-
er and more complex information resources are being collected and managed [72]. To
mitigate this risk, decision makers assess the quality of the information provided by their
IS systems in order to make effective decisions based on it. They may rely on quality
metadata: objective quality measurements tagged by data managers into the information
used by decision makers. Watts et al. claim that decision makers may also gauge infor-
mation quality on their own, subjectively and contextually, assessing the usefulness of the
information for solving the specific task at hand. Although information quality has been
defined as fitness for use, models of information quality assessment have thus far tended
to ignore the impact of contextual quality on information use and decision outcomes.
Contextual assessments can be as important as objective quality indicators since they can
affect which information is used for decision-making tasks [72].
The research by Watts et al. offers a theoretical model for understanding users' contex-
tual information quality assessment processes [72]. The model is grounded in dual pro-
cess theories of human cognition, which enable simultaneous evaluation of both objective
and contextual information quality attributes. The findings of an exploratory laboratory
experiment suggest that the theoretical model provides an avenue for understanding con-
textual aspects of information quality assessment in concert with objective ones. The
model offers guidance for the design of information environments that can improve per-
48
formance by integrating both objective and subjective aspect of the users' quality assess-
ments [72].
Based on the article Master Data Management and Customer Data Integration for a
Global Enterprise, by Berson and Dubov, data quality is one of the key components of
any successful data strategy and data governance initiative, and is also one of the core
enabling requirements for MDM [2, pp. 135-136, pp. 305]. Conversely, MDM is a power-
ful technique that helps an enterprise to improve the quality of master data [2, pp. 117].
Berson and Dubov claim that a key challenge of data quality is incomplete or unclear
semantic definitions of what the data are supposed to represent, in what form, and with
what kind of timeliness requirements [2]. The metadata repository is the place where
these definitions are stored. The quality of metadata may be low, because there are many
data quality dimensions and contexts, each of which may require a different approach to
the measurement and improvement of the data quality. For example, in order to measure
and improve address information about customers, there are numerous techniques and
references data sources that can provide an accurate view of a potentially misspelled or
incomplete address. Similarly, in order to validate a social security number or a driver’s
license number, it is possible to use a variety of standard sources of this information to
validate and correct the data [2].
The canonical approach intends to establish canonical data models where all the con-
text and time specific values of intermediate IS-shared attributes are considered purpose-
specific exceptions and replaced with canonical true data values [18]. Purpose-specific
requirements lead organizations to create purpose-specific applications each with similar,
yet inconsistent data values in differing formats. These inconsistencies have a dramatical-
ly negative impact on overall data quality [18]. In the canonical approach the data that do
not belong to a group are flawed. The approach does not take into account the differences
between information processes. An information process can create incorrect data, which
are then not considered.
The canonical approach presupposes an unambiguous meaning, and other meanings
are interpreted as a fault, to be deleted. If they are needed, then they are formed from data
attributes. Further, if the meaning of data is changed as a function of the data, the correct
way has to be found, and other attributes are substituted for or new attributes are made
from obsolete data.
Since the canonical approach does not pay attention to how data is generated, data in-
accuracy may be due to the fact that the information processes used to create the same
data may be different. The approach does not recognize the fact that there are two kinds
of error sources regarding data. The first error source is that the data has not been created,
but a real error occurs. The second is that the canonical approach does not recognize the
49
fact that there are processes that produce systematic errors, and that different types of data
can be produced by various processes.
The federative approach advocates that there are different types of contents, and if they
represent different contexts, this is acceptable. If the contents describe the same context,
they are incorrect and must be corrected, but there may be deliberately different values
for a particular result. For example, relating to temperature measurement in the human
body, if a patient’s body temperature is measured before surgery, and he/she does not
have a fever, while after surgery the patient has a slight bit fever. Both cases can be inter-
preted as representing normal body temperature, since it is known that the temperature
after surgery is slightly higher. As a result it does not make sense to correct the latter val-
ue as an error.
In the federative approach the data always reside in their original location. The ap-
proach includes different rules for use when a context is flawed. If an information process
produces incorrect data during its lifecycle it must be corrected.
3.6 Data Consolidation
According to Loshin, data consolidation can be defined as data instances located from
different sources and brought together [47, pp. 179]. The integration tools use the parsing,
standardization, harmonization, and matching capabilities of the data quality technologies
to consolidate data into unique records in the master data model. Loshin argues that data
consolidation depends on the number of data sources that feed the master repository and
the expectation of a single view of each master entity [47, pp. 174].
The federative approach consolidates data in its original place without data transfer or
extraction. The approach uses interoperable/shared attributes in order to build a linkage
between different registers. It is based on the processes of sharing and matching, but not
on integration. Integration is possible, and the reason for integrating is to ensure interop-
erability, not to match and cleanse data. At least two data storages with attributes are
needed when integrating, and when cross-reference is performed, the interoperable attrib-
utes are the result.
The canonical philosophy consolidates data into the golden record by applying the
master data model to entities. Relating to the stance on data storages, the canonical ap-
proach advocates reporting databases in data storages, while in the federative approach,
the data reside in their original location. However, in the federative approach, repositories
with the meanings of attributes, registered cross-references, and descriptions of contents
are used in federation.
50
3.6.1 Sharing
Based on Loshin, the essence of MDM revolves around data sharing and interchange [47,
pp. 146]. Information is shared using data integration tools in three ways, as follows:
Data transformation
Data monitoring
Data consolidation.
Data transformation means that data are transformed into a format that is acceptable to
the target architecture. Data monitoring provides a way of incorporating the types of data
rules both discovered and defined during the data profiling phase. Data consolidation
means that data are consolidated into unique records in the master data model [47].
Heimbigner and McLeod [37] argue that the federated architecture provides mecha-
nisms for sharing data, for sharing transactions (via message types), for combining infor-
mation from several components, and for coordinating activities among autonomous
components (via negotiation). A prototype implementation of the federated database
mechanism is currently operational on an experimental basis [37].
3.6.2 Mapping
According to Loshin, data mapping is the process of creating data element mappings be-
tween two distinct data models [47]. Data mapping is used as a first step in a wide variety
of data integration tasks. The first task is data transformation or data mediation between a
data source and a destination. The second task is identification of data relationships as
part of data lineage analysis. The third task is discovery of hidden sensitive data such as
the last four digits of a social security number hidden in another user ID as part of a data
masking or de-identification project. Finally the fourth task is consolidation of multiple
databases into a single data base and identifying redundant columns of data for consolida-
tion or elimination [47].
3.6.3 Matching
Based on the article Data Matching Concepts and Techniques for Record Linkage, Entity
Resolution, and Duplicate Detection, by Christen, data matching is the task of identifying,
matching, and merging records that corresponds to the same entities from several data-
bases [9]. The entities under consideration most commonly refer to people, for example
patients, customers, tax payers, or travelers.
51
Christen claims that a major challenge in data matching is the lack of common entity
identifiers in the databases to be matched [9]. The matching needs to be conducted using
attributes that contain partially identifying information, such as names, addresses, or dates
of birth. Such identifying information is often of low quality. Personal details in particular
suffer from frequently occurring typographical variations and errors, and such infor-
mation can change over time, or it is only partially available in the databases to be
matched [9].
DAMA believes that one of the greatest ongoing challenges in MDM is the process for
the matching, merging, and linking of data from multiple systems about the same person
(patient), group (gender), place (location), or thing (diagnosis) [18].
The key to building a data integration application is the source description, a kind of
glue that connects the mediated schema and the schemas of the sources [20, pp. 11]. The
descriptions specify the properties of the sources that the system needs to know in order
to use their data. The main component of source descriptions is semantic mapping, which
relates the schemata of the data sources to the mediated schema [20].
Semantic mappings specify how attributes in the sources correspond to attributes in the
mediated schema, and how the different group of attributes into tables is resolved [20].
The semantic mappings specify how to resolve differences in how data values are speci-
fied in different sources. Thus specification between every pair of data sources is unnec-
essary. The semantic mappings are specified declaratively, which enables the data inte-
gration system to reason about the contents of the data sources and their relevance to a
given query and to optimize the query execution [20].
Data matching is the problem of finding structured data items that describe the same
real-world entity (20, pp. 173). In many integration situations, merging multiple databases
with identical schemas is not possible without a unique global ID and the need to decide,
which rows are duplicates.
Doan et al. name the matching techniques available as follows [20]:
• Rule-based matching: the aim is to match tuples from two tables with the same
schema, but generalizing to other contexts is straightforward. The rule computes the simi-
larity score between a pair of tuples x and y and a linearly weighted combination.
• Learning-base matching: supervised learning is to automatically create matching
rules from labeled examples.
• Matching by clustering.
• Probabilistic approaches to data matching.
• Collective matching.
• Scaling up data matching.
52
3.6.4 Data Federation
According to van der Lans, data federation refers to the combining of autonomously op-
erating objects [44]. Basically data federation means combining autonomous data stores
to form one large data store. Data federation is a form of data virtualization where the
data stored in a heterogeneous set of autonomous data stores is made accessible to data
consumers as one integrated data store by using on-demand data integration. The defini-
tion is based on data virtualization [44].
The difference between federated and canonical data integration is that data are not
collected into one (harmonized record, i.e. a golden record) IS, but instead the data are
left in the original location [44]. Technically data federation is conducted with the help of
a metadata repository, which maps federated data sets to each other by using interopera-
ble attributes. The metadata repository comprises data storage for federation rules, mean-
ings of attributes, descriptions of data formats, and definitions of mappings. Metadata
descriptions are created, modified and used only when data federation is needed. New
federation rules can be added whenever needed, e.g. for new reporting needs. The idea is
to avoid large data banks and to produce practically useful results from the very begin-
ning [44].
Data federation should make it possible to bring data together from data stores using
different storage structures, different access languages, and different application pro-
gramming interfaces (APIs). An application that uses data federation is able to access
different types of database servers and files with various formats. It makes it possible to
integrate data from all these data sources. Data federation also offers features for trans-
forming the data. It allows applications and tools to access the data through various APIs
and languages [44].
Data stores accessed by data federation are able to operate independently. This means
that they can be used outside the scope of data federation. Regardless of how and where
data are stored, they are presented as one integrated data set. This implies that data fed-
eration involves transformation, cleansing, and possibly even enrichment of data [44].
Based on Dahlberg and et al., data federation, which uses a data governance frame-
work artifact as a tool, may differ from canonical data integration [13]. Data is made in-
teroperable without changing the original data in federated data storages. Canonical data
integration often means and leads to data transformation, cleansing, harmonization and/or
standardization. In contrast, data federation makes it possible to use simultaneously data
from data storages with different technical, informational and social characteristics. It
requires that users of the federated data understand the meaning of the outcomes. The
purpose of data federation is to make data storages that are linked and interoperable
53
through data mapping [13]. The data are made interoperable without changing the origi-
nal data in federated data storages.
Current problems in data federation are related primarily to two issues [13]:
• Data ontology
• Insufficient attention paid to the governance of data.
The article Towards Information Systems as a Science of Meta-Artifacts by Iivari ar-
gues that more emphasis should be given to the nature of Information Systems as an ap-
plied, engineering-like discipline that develops various meta-artifacts to support the de-
velopment of IS artifacts [41]. The article refers to data federation with the use of arti-
facts. Iivari argues that building such meta-artifacts is a complementary approach to the
theory with practical implications type of research. The primacy assigned to theory and
research method has effectively excluded constructive research on building meta-artifacts
from the major IS journals. The article also claims that information systems as a category
of IT artifacts, and especially the focus on IS development, can help to distinguish the IS
discipline from its sister and reference disciplines [41].
According to the article Data Federation Methods and System, by Chen et al., a meth-
od is provided for processing tree-like data structures in a streaming manner [5]. An ini-
tial context of name/value bindings is set up, and a tree of objects is constructed. Each
element in the tree of objects is represented as a function object, which accepts a context
parameter and a target parameter that it can send a stream of start, content, and events to
represent tree output. The parse tree of objects is examined for element names that are
recognized as commands. The commands are converted into special function objects that
implement the command’s semantics. Other elements that are not recognized as com-
mands, are mapped to a default function object in data federation [5].
In managing datification and data federation in open systems environments, master da-
ta are non-transactional data saved by several ISs, which from the data federation per-
spective, provide links to various data sets and sources [2]. The golden record approach
emerged as a solution to the problem of what to do with inconsistent and fragmented data,
saved in storages that had been brought together. The approach can be defined as a single,
well-defined version of all the data entities in an organizational ecosystem [2].
It is necessary to understand the use of the artifact based on nine questions (the princi-
ples of data federation) [15]. Contextual metadata are also needed in order to execute data
federation. Data federation starts from understanding of the contextual metadata. The
ability to federate and combine firstly, and then to analyze data, makes any data potential-
ly valuable, worth gathering, maintaining and describing [15].
Based on the article The MDM Golden Record Is Dead, Rest in Peace – Welcome In-
terpreted Interoperable Attributes in Open Systems Environments, by Dahlberg, data fed-
erations are characterized by differences in the formats, structure, granularity and other
54
characteristics of data [16]. Dahlberg claims that data federation can happen only if there
is a connecting data element, such as a patient (a customer) or a type of care (a product)
available [16].
Based on Design of Enterprise Systems by Giachetti, a federated database is defined as
a collection of heterogeneous, component databases, over which a global view is created
[35, pp. 396]. This makes it possible that applications can treat the separate databases as a
single database.
The article Federated Database Systems for Managing Distributed, Heterogeneous and
Autonomous Databases, by Sheth and Larson, describes a federated database system
(FDBS) as a collection of cooperating database systems that are autonomous and possibly
heterogeneous [65]. The article defines the reference architecture for distributed database
management systems based on system and schema viewpoints and show how various
FDBS architectures can be developed [65].
The article A Federated Architecture for Information Management, by Heimbigner and
McLeod, presents an approach to the coordinated sharing and interchange of computer-
ized information, which is described emphasizing partial, controlled sharing among au-
tonomous databases [37]. Office information systems provide a particularly appropriate
context for this type of information sharing and exchange. The federated database archi-
tecture consists of a collection of independent database systems, which are united into a
loosely coupled federation in order to share and exchange information [37].
According to Heimbigner and McLeod, a federation consists of components (of which
there may be any number) and a single federal dictionary [37]. The components represent
individual users, applications, workstations, or other components of an office information
system. The federal dictionary is a specialized component that maintains the topology of
the federation and oversees the entry of new components. Each component in the federa-
tion controls its interactions with other components by means of an export schema and an
import schema. The export schema specifies the information that a component will share
with other components, while the import schema specifies the non-local information that
a component wishes to manipulate [37].
3.6.5 Data Integration
Data integration is a form of data federation or data virtualization. According to van der
Lans, data integration is the process of combining data from a possibly heterogeneous set
of data stores to create one unified view of all the data [44, pp. 8]. Data integration is in-
volved, among other things, in joining data, in transforming data values, enriching data,
55
and cleansing data values. The definition itself does not take a stance on how integration
occurs.
Linthicum defines Enterprise Application Integration (EAI) as the unrestricted sharing
of data and business processes among any connected applications and data sources in the
enterprise [46]. Sharing information among different systems is particularly difficult as
many of them are not designed to access anything outside their own proprietary technolo-
gy [46, pp. 3-4].
EAI allows many of the stovepipe applications (e.g. patient management systems,
ERP, CRM etc.) to share both processes and data. EAI does not need to make changes to
the applications or data structures. According to Gable EAI allows diverse systems to
connect with one another quickly in order to share data, communications, and processes,
alleviating the information silos that plague many businesses [33, pp. 48]. EAI implemen-
tation integrates the Information Systems (ISs) so that a data warehouse can aggregate
account data, providing a single view to the end user.
On-demand integration refers to the issue of when data from a heterogeneous set of da-
ta stores is integrated. Unlike data federation, integration takes place on the fly, and not in
a batch. When data consumers ask for data, only then are the data accessed and integrat-
ed. Thus data are not stored in an integrated way, but remains in their original location
and format [44].
Maier et al. argue that data integration requires mutual understanding in user organiza-
tions or applications. It relates to exchanging data on how data resources are addressed
over a network and in the most generic sense over the Internet [50]. Organizations must
consider which character set to use. They are expected to know about the internal struc-
ture of documents, i.e. text markup and about the scope or domain, in which the specified
names in the markup are valid. Finally, it is necessary to know how to define a schema,
the structure of the elements in a semi-structured text, and how to translate a document
that is an instance of one schema so that it conforms to another schema [50].
According to Giachetti, data integration technologies either create a single, unified
model of the data by merging databases together, or provide the tools and technologies to
move data between systems [35, pp. 393-397]. To share data across the enterprise, the
options include the following. Point-to-point integration connects two databases together
by defining data translators between them; data middleware (ODPC): creation of interfac-
es between the database and all the other applications [35]. A single, centralized database
is for the entire organization and every application writes and reads from the shared data-
base (e.g. ERP).
The term federated data refers to collection of cooperating but autonomous databases.
It consists of a collection of heterogeneous, component databases, over which a global
view is created, so that applications can treat the separate databases as a single database.
56
This is implemented by means of data mediation, which converts data from one format to
another. The original data sources are left untouched [35].
A data warehouse collects data from one or more operational databases, integrates the
data, and makes them available for querying, reporting, and analysis [35]. No transactions
are executed in connection with the warehouse but it is only used to obtain information.
There is a relationship between the data warehouse and the operational databases. The
relationship is called a process of ETL, which takes data from the operational databases,
cleans the data, and then loads them into the data warehouse. The process is executed by
batches.
Giachetti describes data integration as a process, which is executed by taking the data
structure and data definitions from the legacy systems, redesigning the structure and data
definitions, and recreating them in the new system [35, pp. 409]. The conversion process
is called ETL. The data objects are extracted from the old data systems, cleansed, then
transformed and loaded into the new database. Cleansing of the data is required to ensure
that they are correct, complete, consistent, and that they adhere to business rules (dirti-
ness: missing data attributes, noise words, different languages, misspellings, multiple
terms for the same data).
According to Doan et al., most integration systems are based on warehousing or virtual
integration [20, pp. 9]. In warehousing, data from the individual data sources are loaded
and materialized into a physical database (warehouse), where queries about the data can
be answered. In virtual integration, the data remain in the sources and are accessed as
needed at query time [20].
Five mainstay challenges of data integration are as follows as follows [25, pp. 72]:
The need for a different development framework
Difficulty with stakeholder enlistment
Operational data are not available
Nonexistent metadata
Poor data quality.
Customer Data Integration by Data Dyché and Levy [25, pp. 98] argues that the goal
of matching is to identify all the data for a particular customer held by the enterprise. The
golden record implies a single, standard record that has usually been generated with data
from multiple source systems [25, pp. 101]. For example, the CDI hub recognizes, match-
es, and consolidates the data into a master record for that customer. The CDI hub will pull
the most accurate value from each source system so that the golden record may contain
the customer’s first and last name from one system, the phone number from another sys-
tem, and the home address from another system (Figure 7). CDI hubs are distinguished
from other technology solutions by their ability to identify the optimal values that com-
prise the golden record for a customer. The CDI hub has the ability to tie-break between
57
data sources and individual elements, and it decides on the best combination of elements
to comprise the golden record. Ultimately, the golden record becomes the enterprise’s
view of an individual customer’s information [25, pp. 102].
Figure 7 Generating the Golden Record [25, pp. 99]
According to Watson, there is a lack of data integration in most organization [72, pp.
22]. Due to the limitations of the available technology, early computer systems were not
integrated. Organizations created simple file systems to support a particular function. In-
tegration is a long-term goal. As new systems are developed and old ones rewritten, or-
ganizations can evolve integrated systems. It would be too costly and disruptive to try to
solve the data integration problem in one step.
3.6.6 Data Warehouse, Storage and Repository
DAMA defines a data warehouse as a combination of an integrated decision support da-
tabase and the related software programs [18, pp. 197]. The programs take care of collect-
ing, cleansing, transforming, and storing data from internal and external sources.
Based on Data Mapping Diagrams for Data Warehouse Design with UML in Data
Warehouse (DW) Scenarios, by Luján-Mora et al., ETL (Extract, Transform, Load) pro-
cesses are responsible for the extraction of data from heterogeneous operational data
58
sources, their transformation (conversion, cleaning, normalization, etc.) and their loading
into the DW [49].
The article by Luján-Mora et al. presents a framework for the design of the DW back-
stage and the respective ETL processes, based on the key observation that this task fun-
damentally involves dealing with the specificities of information at very low levels of
granularity, including transformation rules at the attribute level [49]. Specifically, the
article introduces a disciplined framework for the modeling of relationships between
sources and targets at different levels of granularity, which include coarse mappings at the
database and table levels to detailed inter-attribute mappings at the attribute level [49].
A data storage is for archiving data in an electromagnetic and digital form. According
to Dahlberg et al., data storages have different IS technical, informational, and socio-
contextual data characteristics [12]. In addition, the accountabilities of various data stor-
ages seem to be unclear in most organizations [11]. Various data storages provide frag-
mented, overlapping, and even controversial data on the same issue.
A repository is a particular kind of setup within an overall IT structure, such as a group
of databases, where an organization keeps data of various kinds. Based on the article
Adex – A Meta Modeling Framework for Repository-Centric Systems Building, by Red-
dy et al., enterprises use repositories for storing, and subsequently leveraging, the descrip-
tions of diverse information systems and the various complex relationships present in
them [60, pp. 1]. An information system repository should also support modeling of pro-
cesses that coordinate various system-building activities.
Based on Watson, a data registry is equivalent to the reference repository, which in-
cludes metadata. It contains a description of each data type, format, programming stand-
ards (e.g. volume in liters) and the meaning of a field [71, pp. 437]. For the sake of a data
warehouse, a data registry also includes specifications of an operating system with which
the data were created, data transformations and the frequency of data retrievals. Analysts
need to enter the metadata in order to design their analyses and learn the contents of data
warehouses. If a data registry is not found, it should be implemented and maintained in
order to ensure the integrity of a data warehouse [71].
3.7 Data Management Framework
DAMA defines data governance as a core function of the data management framework
[18, pp. 37]. Data governance interacts with and influences each of the surrounding data
management functions. Data management in turn is a high-level business process [18, pp.
17]. The functions are the planning and execution of policies, practices, and projects.
They acquire, control, protect, deliver, and enhance the value data and information assets
59
[18]. Shleifer and Vishny define corporate governance as referring to the way in which
suppliers of finance assure themselves a return on their investment [62, pp. 737].
Data governance can prevent data deficiency problems. It is possible to agree account-
abilities to prevent ontological or design failures and operational problems of centralizing
the management and interpretation of the data. It makes the governance of the organiza-
tion to match the ontological stance [14].
The federative approach is one of the solutions for governing and managing data effi-
ciently in an open information systems environment. The approach governs all the data in
a certain context, and context-based governance models are built. Models define clear
accountabilities and understanding of sources (sensor data, social media data, IoT etc.), as
well as the temporal and spatial dimensions. An understanding of ontology is a keystone
in governance. There is no such element in the canonical approach and it is not given any
attention.
Relating to the canonical approach, one domain (e.g. product data, customer data, de-
vice data) is viewed at a time. The aim is to go through all the information systems of the
domain. As a result a hierarchical governance model is built, governing who owns what
in each phase. Since it is unilateral, no account is taken of the different dimensions of the
data and how they should be governed. The canonical approach pays no attention to the
temporal and spatial dimensions, accountabilities, structural data, distinct images, text
data or accurate data. All the data are interpreted indistinguishably without considering
the location where the data are created. The canonical philosophy considers that an organ-
ization takes responsibility for its own IS and thus knows the meaning of the data stored
in each database.
3.7.1 Data and Corporate Governance
Based on the article A Framework for the Corporate Governance of Data – Theoretical
Background and Empirical Evidence Governance, by Dahlberg and Nokkala, in a modern
organization, IT and digital data have transformed from being functional resources to
being integral elements of business strategy [14]. They apply the framework to the gov-
ernance of data relating to aging societies, that is, to answer the question of how best to
manage the provision of services with digital data enablement and support to citizens.
Dahlberg and Nokkala disclose the results of two recent surveys, with 212 and 68 re-
spondents respectively, on the business significance of data governance [14]. The survey
results show that good governance of data is considered critical to organizations. As a
result of continuous increase of ISs and data storage systems, respectively overlapping
data on the same citizens, services, and professionals are increasing. Dahlberg and Nok-
60
kala argue that this is a managerial issue, because only business professionals know what
the content of data should be and what data are needed to perform specific tasks [14, pp.
28].
The article A Survey of Corporate Governance, by Shleifer and Vishny, is focused on
research on corporate governance [62]. Schleifer and Vishny argue that most world-wide
corporate governance mechanisms with large share holdings, relationship banking, and
takeovers can be viewed as examples of large investors exercising their power [62, pp.
739]. Corporate governance deals with constraints that managers put on themselves, or
that investors put on managers, to reduce the ex post misallocation and thus to induce
investors to provide more funds in advance [62, pp. 743]. Successful corporate govern-
ance systems combine significant legal protection of at least some investors with an im-
portant role for large investor [62, pp. 774].
Based on the article The State of Corporate Governance Research, by Bebchuk and
Weisbach, the special issue on corporate governance, co-sponsored by the Review of Fi-
nancial Studies and the National Bureau of Economic Research (NBER), states that poor
governance can limit capital flows and the integration of capital markets in the global
economy [1, pp. 952]. Bebchuk and Weisbach remind that corporate governance is in part
a product of legal systems set in place and the legal infrastructure accompanying them.
According to the article The Governance of Inter-organizational Coordination Hubs,
by Markus and Bui, business-to-business interactions are increasingly conducted through
inter-organizational coordination hubs, in which standardized information technology–
based platforms provide data and business process interoperability for interactions among
the organizations in particular industrial communities [51]. Because the governance of
inter-organizational arrangements is believed to affect their efficiency and effectiveness,
the article explores how and why inter-organizational coordination hubs are governed.
Analysis of the relevant prior theory and case examples shows that coordination hub
governance is designed to balance the sometimes conflicting needs for capital to invest in
new technology, for participation of industry members, and for the protection of data re-
sources. The findings by Markus and Bui suggest that the governance of inter-
organizational coordination hubs is not the starkly categorical choice between collective
(member-owned) and investor-owned forms, as suggested by the prior theory [51].
3.7.2 Data Management
DAMA defines data management as a business function [18]. It is responsible for plan-
ning, controlling and delivering data and information assets. Management consists of the
disciplines of development, execution, and supervision of plans, policies, programs, pro-
61
jects, processes, practices and procedures. The disciplines control, protect, deliver, and
enhance the value of data and information assets [18].
Data management includes database administration-database design, implementation,
and production support [18]. It is a responsibility shared between the data management
professionals within IT organizations and the business data stewards, who represents the
collective interests of data producers and information consumers [18, pp. 5]. The profes-
sionals serve as the expert curators and technical custodians of the data, while stewards
serve as the appointed trustees for data assets.
The holistic data management function encompasses the following [18, pp. 6]:
Data governance
Data architecture management
Data development
Data operations management
Data security management
Data quality management
Reference and master data management
Data warehousing and business intelligence management
Document and content management
Metadata management.
3.8 Discovery of Data from Large Data Sets
Data mining refers to a set of methodologies aimed at finding relevant information from
large masses of data, i.e. Big Data [61]. Data mining can be applied very broadly. Typi-
cally, the data used in data mining include, for example, measurements of industrial pro-
cesses, and excerpts from the customer database or web server log files. Definitions of the
purposes of data mining do not limit the methods available. In most cases, the algorithms
used are, for example, various clustering, correlations, neural networks, self-organizing
maps, etc. Generally speaking, the successful utilization of data mining needs the most
relevant data with their various holistic understanding of the variables. Also, a simple
innovative approach, for example, data visualization, can help to see the benefits of the
data warehouse from an entirely new perspective [61].
Most definitions of Big Data focus on the size of data in storage. Size matters, but
there are other important attributes of Big Data, namely data variety and data velocity.
The three Vs of Big Data (volume, variety, and velocity) constitute a comprehensive def-
inition, and they bust the myth that Big Data is only about data volume. In addition, each
of the three Vs has its own ramifications for analytics [61]. Attributes can be divided into
62
two groups: stable and flexible. Stable attributes mean attributes, whose values cannot be
changed (e.g. age or maiden name), while the values of flexible attributes can be changed
[59].
Big Data are culpable for the emergence of the federative approach. The approach re-
gards Big Data as a whole which can be processed in its original location. The structure
of the data does not play a crucial role, as the interoperable, shared attributes with cross-
references take care of data harmonization. Based on various different data storages, Big
Data can be defined as complex data, and data federation is necessary in order to solve
data integration of this kind, where attributes are shared between different data storages
and databases.
The canonical philosophy aims to build a canonical data model with separate entities
in order to gain the golden record. Due to the size of Big Data the golden record approach
faces critical problems, because Big Data consists of heterogeneous data formats, e.g.
social media, data streaming and IoT etc., which make difficult or impossible to drive into
the golden record at the expense of data quality.
The canonical approach builds a single data model from Big Data, and if the infor-
mation systems are governed by an organization, no data model can be found and it is not
possible to create one, since the data are external and owned by software vendors. In this
case attention is paid to public and well-known data storages, where the data are better
available, but federation of the data is difficult.
3.8.1 Data Mining
The article Framework for Early Detection and Prevention of Oral Cancer Using Data
Mining, by Sharma and Om, proposes an ED&P framework, which is used to develop a
data mining model for early detection and prevention of malignancy of the oral cavity
[64]. The database of 1025 patients has been created and the required information stored
in the form of 36 attributes. According to Sharma and Om, data mining in clinical data
sets is one of the extensively researched areas in computer science and information tech-
nology, owing to the wide influence exhibited by this computational technique in diverse
fields including finance, clinical research, multimedia, education and the like. Adequate
surveys and literature have been devoted to clinical data mining, an active interdiscipli-
nary area of research that is considered to be the consequent of applying artificial intelli-
gence and data mining concepts to the field of medicine and healthcare [64].
The article Data Mining in Clinical Data Sets: A Review, by Jacob and Ramani, aims
to provide a review on the foundation principles of mining clinical data sets, and presents
the findings and results of past researches on utilizing data mining techniques for mining
63
healthcare data and patient records [42]. The scope of the article is to present a brief re-
port on previous investigations made in the sphere of mining clinical data, the techniques
applied, and the conclusions reached [42].
According to Data Mining Techniques by Linoff and Berry, memory-based reasoning
(MBR) results are based on analogous situations in the past [45, pp. 321]. In medical
treatments the most effective treatment for a given patient is probably the treatment that
resulted in the best outcomes for similar patients. MBR can find the treatment that pro-
duces the best outcome. It does not care about the format of the records. MBR only takes
into consideration the existence of two operations as follows [45]:
A distance function capable of calculating a distance between any two records
A combination function capable of combining results from several neighbors to
arrive at an answer.
These functions can be defined for many kinds of records, including records with
complex or unusual data types, such as geographic locations, images, audio files, and free
text. These types of data are usually difficult to handle by other analysis techniques. One
case study presented by Linoff and Berry describes using MBR for medical diagnosis [45,
pp. 323]. They introduce an example that takes advantage of ideas from image processing
to determine whether a mammogram is normal or abnormal (Figure 8).
MBR can be applied to the identification of abnormal mammograms. A radiologist
learns how to read mammograms by studying thousands of them, before she/he ever sees
any patients [45, pp. 332]. The approach essentially takes many pre-classified mammo-
grams and, for a new mammogram, finds the ones that are closest. The idea is that two
identical mammograms require no additional information so that their mutual information
similarity is maximized (the resulting distance between them is zero). If there is no rela-
tionship at all between the pixels in the images, then the images are not similar [45].
64
Figure 8 Basic Idea of MBR [45, pp. 333]
In any data warehousing environment, each of these pieces of information is available
somewhere. They may exist in scripts written by the DBA, in e-mail messages, in docu-
mentation, in the system tables in the database, and so on. A metadata repository makes
this information available to the users in a format they can readily understand. Data ware-
houses store and retrieve clean, consistent data effectively [45].
According to the article Data Mining Techniques in Health Informatics: A Case Study
from Breast Cancer Research by Lu et al., the healthcare domain covers a vast amount of
complex data generated and developed over the years through electronic patient records,
diseases diagnoses, hospital resources and medical devices [48, pp. 56]. The article intro-
duces Knowledge Discovery in Databases (KDD) as a substitute for traditional methods
of data analysis. KDD automatically searches large volumes of data for interesting pat-
terns, useful information and knowledge. Data mining plays a key role in KDD. It brings
a set of techniques and methods that can be applied to the processed data to discover hid-
den patterns. Data mining can provide healthcare professionals with the ability to analyze
patient records and disease treatment over time, which in turn can help to improve the
quality of life for those facing terminal illnesses such as breast cancer [48].
Clinical data mining is an active interdisciplinary area of research that can be consid-
ered as arising from applying artificial intelligence concepts to medicine and healthcare,
for example, in personalized medicine based on a patient’s profile, history, physical ex-
amination and diagnosis, and utilizing previous treatment patterns, new treatment plans
can effectively be proposed [48].
Data mining requires high quality of data. Mining consists of selecting relevant attrib-
utes, generating the corresponding data set, and cleaning and replacing missing values
[48, pp. 64].
65
In the case study by Lu et al., 16319 breast cancer patient records are extracted from
SBCDS (Southampton Breast Cancer Data System), and there are 22380 records, i.e.
instances, which show the patient’s cancer details [55]. The system does not record the
order, in which breast cancer treatments occur between presentations. After all, the deci-
sions about which information is sought will come from the clinical researchers [48].
3.8.2 Big Data
The article The World’s Technological Capacity to Store, Communicate and Compute
Information, by Hilbert and López in 2007, argues that humankind was able to store 2.9 ×
1020 optimally compressed bytes, to communicate almost 2 × 1021 bytes, and to carry
out 6.4 × 1018 instructions per second on general-purpose computers [39]. General pur-
pose computing capacity grew at an annual rate of 58%. The world’s capacity for bidirec-
tional telecommunication grew at 28% per year, closely followed by the increase in glob-
ally stored information (23%). Humankind’s capacity for unidirectional information dif-
fusion through broadcasting channels has experienced comparatively modest annual
growth (6%). Telecommunication has been dominated by digital technologies since 1990
(99.9% in digital format in 2007), and the majority of our technological memory has been
in digital format since the early 2000s (94% digital in 2007) [39].
The article Metcalfe’s Law after 40 Years of Ethernet, by Metcalfe, claims that critics
have declared Metcalfe’s law a gross overestimation of the network effect, but nobody
has tested the law with real data [53]. The law states that the value of a network grows as
the square of the number of its users, a gross overestimation of the network effect, but
nobody has tested the law with real data. Using a generalization of the sigmoid function
called the netoid, the Ethernet’s inventor and the law’s originator models Facebook user
growth over the past decade and fits his law to the associated revenue [53].
Data analytics is a method for analyzing mainly Big Data. Based on the article of
Business Intelligence and Analytics by Chen et al., business intelligence and analytics
(BI&A) has emerged as an important area of study for both practitioners and researchers,
reflecting the magnitude and impact of data-related problems to be solved in contempo-
rary business organizations [7].
It’s obvious that data volume is the primary attribute of Big Data. With that in mind,
most people define Big Data in terabytes (1012
bytes), sometimes in petabytes (1015
bytes). For example, a number of users interviewed by TDWI are managing 3 to 10 tera-
bytes (TB) of data for analytics [61]. Yet, Big Data can also be quantified by counting
records, transactions, tables, or files. Some organizations find it more useful to quantify
Big Data in terms of time. For example, due to the seven-year statute of limitations in the
66
U.S., many firms prefer to keep seven years of data available for risk, compliance, and
legal analysis [61].
According to the article Big Data Analytics, by Russom, a new flood of user organiza-
tions is currently commencing or expanding solutions for analytics for Big Data [61]. To
supply the demand, vendors have recently released numerous new products and functions,
specifically for advanced forms of analytics (beyond OLAP and reporting) and analytic
databases that can manage Big Data. While it is good to have options, it is hard to track
them and determine, in which situations they are ready for use. The purpose of the article
is to accelerate users’ understanding of the many new tools and techniques that have
emerged for Big Data analytics in recent years. It will also help readers map newly avail-
able options to real-world use cases [61].
The article A Knowledge-Based Platform for Big Data Analytics is based on Pub-
lish/Subscribe Services and Stream Processing by Esposito et al. [29]. Big Data analytics
is considered an imperative aspect that needs to be further improved in order to increase
the operating margin of both public and private enterprises, and it represents the next
frontier for their innovation, competition, and productivity. Big Data are typically pro-
duced in different sectors of the above organizations, often geographically distributed
throughout the world, and are characterized by large size and variety.
Esposito et al. argue that there is a strong need for platforms handling larger and larger
amounts of data in contexts characterized by complex event processing systems and mul-
tiple heterogeneous sources, dealing with the various issues related to efficiently dissemi-
nating, collecting and analyzing the systems and sources in a fully distributed way. In
such a scenario, the article proposes a method for solving two fundamental issues: data
heterogeneity and advanced processing capabilities [29].
The article by Esposito et al. presents a knowledge-based solution for Big Data analyt-
ics, which consists in applying automatic schema mapping to deal with data heterogenei-
ty, as well as ontology extraction and semantic inference to support innovative processing
[29]. Such a solution, based on the publish/subscribe paradigm, has been evaluated within
the context of a simple experimental proof-of-concept, in order to determine its perfor-
mance and effectiveness.
3.8.3 Business Intelligence Systems
According to van der Lans, a business intelligence system can be defined as a solution for
supporting and improving the decision-making process of an organization [44, pp. 29].
From the user’s perspective the user interfaces of reporting and analytical tools are the
most practical elements in a business intelligence system. Reporting tools include e.g.
67
OLAP and data discovery/exploitation tools. Analytical tools consist of data mining and
statistical analysis tools [44].
Due to the increasing number of external data sources, organizations are interested in
combining their own internal data with these new data sources [44]. This enriches report-
ing and analytical capabilities in an organization. Most users of business intelligence sys-
tems are decision makers at strategical and tactical management levels [44].
Data federation may solve the dilemma that faces business intelligence systems based
on a chain of databases [44]. Data are transformed and copied from one database to an-
other until they reach an endpoint, i.e. a database being accessed by a reporting or analyt-
ical tool [44, pp. 3]. This process is called ETL, which is long, complex, and highly inter-
connected.
Based on an agile architecture, data federation involves fewer databases and transfor-
mations. As a result, data federation in a business intelligence system leads to a shorter
chain [44].
3.9 Cancer Data
According to the article Cancer in Finland 2015 by Pukkala et al., the risk of breast cancer
in women has increased continuously. Thus breast cancer is clearly the most common
cancer affecting women. The incidence of breast cancer was arisen a tenth in 1987, when
nationwide mammography screening for breast cancer was started, and the incidence of
breast cancer in women seems to be increasing further [57]. Already in 2015, it was esti-
mated that 42 % of all female cancer incidences are breast cancer. More than a thousand
breast cancer cases, most of which are sympthom-free, are detected annually in screen-
ings.
Pukkala et al. argue that menopausal hormone therapies are involved in the significant
increase in breast cancer [57]. The decline of long-term hormone therapies in the early
2000s in Norway and Sweden has reversed the incidence of breast cancer. However, the
reduction is less in Finland.
Breast cancer appears before retirement age in about two in 15 women, and during
their entire life cycle, in more than one in ten. Breast cancer has been the most common
malignant disease affecting Finnish women since the 1960s. The incidence of cancer
starts to increase after the age of 40 [57].
According to Pukkala et al., the five-year survival rate of patients with breast cancer in
2007-2009 was 89 % [57, pp. 58]. Breast cancer is an example of a disease, in which a
patient with the advanced disease lives a considerable period of time due to effective
treatments [66, pp. 59]. In the case of local breast cancer in a sample of 200 patients, the
68
five-year survival rate is as high as 98%. If the disease had spread to the armpit and
lymph nodes, the figure is 88%. Even if it is found that the disease had spread even fur-
ther, the five-year survival rate of patients is 42% [57].
According to the company Noona Healthcare, the number of breast cancer patients is
projected to grow 50 % by the year 2030 [54]. In 2016, nearly two million women will be
diagnosed with breast cancer worldwide. Most will survive cancer thanks to advanced
treatment methods. The growth in patient volumes poses new challenges for the
healthcare ecosystem. Some patients require treatment for troublesome side effects and
some relapse over the course of many years. In future, the systematic monitoring of large
numbers of patients and their symptoms and recovery will have a huge effect on clinical
resources [54].
Based on the article Long-Term Cancer Patient Survival achieved by the End of the
20th
Century: Most Up-to-Date Estimates from the Nationwide Finnish Cancer Registry,
by Brenner and Hakulinen, a new method of survival analysis, called period analysis, has
recently been developed. The method has been shown to provide more up-to-date esti-
mates of long-term survival rates than traditional methods of survival analysis [3].
Brenner and Hakulinen introduce applied period analysis to data from the nationwide
Finnish cancer registry to provide up-to-date estimates of 5-, 10-, 15- and 20-year relative
survival rates (RSR) achieved by the survival estimates, which suggests that for these
cancers, there has been ongoing major progress in survival rates in recent years, which
has so far remained undisclosed by traditional methods of survival analysis. For example,
period analysis reveals that 10-year RSRs have come close to (or even exceed) 75 % for
breast cancer. Period analysis further reveals that 20-year RSRs have now come close to
(or even exceed) 60 % for breast cancer [3].
Brenner and Hakulinen claim that RSR represents the survival rate in the hypothetical
situation where the cancer in question is the only possible cause of death [3]. It is defined
as the absolute survival rate among cancer patients divided by the expected survival rate
of a comparable group from the general population.
The most common forms of cancer are breast and lung cancer, with an average of
more than 2000 incident cases per year, followed by cancer of the prostate, stomach and
colon. The final coding of cancer data is carried out by qualified secretaries and super-
vised by the Registry physician (pathologist). The Registry follows the ICD-O-3 nomen-
clature [3].
The nationwide Finnish Cancer Registry is a database covering about 5.5 million peo-
ple [3, pp. 367]. It contains the highest quality data of any population-based cancer regis-
try in the world. Notification of cancer cases to the registry is mandatory by law, and the
information comes from many different sources, including hospitals, physicians working
outside hospitals, dentists, and pathological and cytological laboratories. Copies are also
69
obtained of all death certificates where cancer is mentioned. Mortality follow-up is ex-
tremely efficient in Finland, due to the existence of personal identification numbers. Us-
ing these numbers as the key, the cancer registry files are matched annually with the an-
nual listed deaths. Matching with the central population register (a register of all people
currently alive and living in Finland) is performed as an additional check on the vital sta-
tus of patients [3].
According to the article Estimates of the Cancer Incidence and Mortality in Europe
published in 2006 by Ferlay et al., breast cancer is by far the most common form of can-
cer diagnosed in European women [30, pp. 586]. Breast cancer is the leading cause of
death from cancer in Europe [30, pp. 590].
In Europe the most common form of cancers was breast cancer (429 900 cases, 13.5%
of all cancer cases) in 2006 and the most common cause of death was from breast cancer
(131 900) [30]. Evidence-based public health measures exist to reduce mortality from
breast cancer.
The federative approach with the use of artifacts is an appropriate tool for processing
cancer data. The data are heterogeneous by nature and it includes a large number of imag-
es, text, films, recordings etc.
3.10 Healthcare
CDI is a dynamic solution in a healthcare environment where patient recognition and
payer-provider collaboration are continuously changing the prevailing situation [25, pp.
228]. It can offer patients access to their private healthcare records. The patient must be
informed of all transactions based on storing and sharing patient data. Hospitals, provid-
ers, health maintenance organizations, and insurers are legislatively required to track in-
dividual patient records across the lifespan of care. The implementation of electronic
medical records might solve the tracking dilemma [25]. It requires that there will be a
careful focus on data quality and accuracy in healthcare, even though it is already strug-
gling with a heavy data load. It is estimated that nearly 99 % of a supplier’s patient data
had error rates that risked and jeopardized the accurate identification of individual pa-
tients. It is thus logical that patient records must be matched and identified with individu-
al patients. This could be a matter of life and death [25].
The Enterprise Master Patient Index (EMPI) capabilities are dedicated to helping
healthcare suppliers to recognize patients as individuals [25]. It combines data across
multiple hospitals, doctors’ offices, clinics, laboratories, pharmacies, and other patient
entry points, as well as across diverse systems [25]. The goal is to have a combined, reli-
able view of every patient for every point of care or every system across the healthcare
70
delivery network [25, pp. 229]. The core issue is patient safety. There are hundreds of
thousands of instances where missing or incomplete patient records have had tragic con-
sequences for patients [25].
According to Dyché and Levin, the idea of interoperable healthcare is that with in-
teroperable electronic health records, updated medical information could be available
wherever and whenever (ubiquitous) the patient and the attending health professional
need it right across the healthcare ecosystem [25, pp. 62]. For example, in the U.S., 785
million healthcare tests are conducted each year. The lack of interoperable systems to
effectively communicate the results among the various providers who need to review
them consume 1 billion hours of administrative processing time simply to get the data in
the right place [25, pp. 63].
Noona Healthcare aims to create the world’s largest evolving database of cancer pa-
tients at various stages of the disease [54]. The database has been designed for long-term
analysis and insight gained from the Big Data on millions of cancer patients. The data-
base will provide doctors, researchers and treatment developers with unique opportunities
to find new ways to overcome cancer. The company has a mobile service to provide can-
cer centers with a real-time holistic view of their patients’ wellbeing. The service im-
proves the quality of cancer patient care and makes the patient-clinic relationship more
personal and meaningful. Clinical staff can rapidly respond to severe symptoms and pro-
vide better care to far greater numbers of patients. The system enables patients to follow
their own wellbeing and recovery and stay in close contact with their clinic [54].
Based on Lu et al., a hospital’s health data sets come from various sources (e.g. clini-
cal data, administrative data, financial data), and health information systems are generally
optimized for high speed continuous updating of individual patient data and patient que-
ries in small transactions [48, pp. 57]. Data warehousing is used to integrate data from
multiple operational systems and provide population-based views of health information.
With the help of a clinical evidence-based process, clinical data warehouse can facilitate
strategic decision making for the generation of treatment rules. By using ETL technolo-
gies, data, which originates from different medical sources are extracted, transformed and
loaded into the existing data warehouse structure. Clustering is used to identify the groups
of individuals, who present similar risk profiles and symptoms [48].
No single ontological approach has importance in relation to healthcare, but data quali-
ty may have impact on patient safety.
71
4 METHODOLOGY
Methodology refers to the range of appropriate means for the observation of reality and
the collection of information. Methodology applies to both qualitative and quantitative
research. Wand and Weber argue that a methodology must possess the features that ena-
ble users to construct a representation of their view of the real world [69, pp. 218]. A us-
er’s view may reflect existing real-world phenomena or imagined real-world phenomena.
4.1 Case Study
This Master’s thesis is based on a qualitative case study. The study methodology in turn
relies on an intensive case study. The intensive case study aims to form a holistic and
contextual description of data governance and management.
According to Ghauri and Grønhaug, a case study is associated with descriptive and ex-
perimental research [34]. However, according to Yin's view, a case study is not limited to
the above two categories [73]. In business research, a case study is especially useful when
the considered phenomenon is difficult to study outside the state of being a natural phe-
nomenon, and furthermore, exploring the concepts and variables is difficult to quantify
[34, pp. 109]. Investigations often collide with the fact that there are too many variables
present and this makes the research methods inappropriate.
The case study refers to the qualitative and empirical input angle and case studies
analysis [34, pp. 109]. It is based on a process model and a description of the manage-
ment of the situation. A case study is often associated with data collection from multiple
sources: verbal reports, personal interviews and observations of the primary sources of
information. In addition, a case study is based on the collection of data sources, such as
financial reports, archives and budget, and the activities based on the reports concerning
market and competition reports [34].
Case studies are not suitable for all types of research, depending on the research prob-
lem and objectives, which determine the suitability of the research method [34]. A case
study is useful for the development of the theory and its testing. The main feature is the
intensity of the phenomenon, the individual, group, organization, culture, event or situa-
tion. There must be sufficient information to characterize and to explain the unique fea-
tures of the case, as well as to demonstrate properties that are common in many cases.
Case study research relies on integrating forces. The ability to study the phenomenon in
many dimensions and then to make an integrative interpretation is needed [34].
Based on Ghauri and Grønhaug, a case study may either be holistic or integrated into a
single case study [34, pp. 178]. A holistic case study is the opposite of a built-in case
72
study [73]. A holistic case study includes a comprehensive analysis of several cases with-
out subunits. A built-in case study, on the other hand, contains more than one subunit of
the analysis. Research methodology by integrating the qualitative and the quantitative
method can be implemented in a single study [73].
A case study requires the most accurate and in-depth perception and understanding of
metadata and attributes for the integration of the clinical medicine application environ-
ment. A qualitative case study with a comprehensive approach involves exploring items
[34, pp. 105], [28], [73].
The case study is most suitable for studies, whose aim is to look at a particular phe-
nomenon in its real environment [34]. In this study, the phenomenon is based on cancer
research, metadata and attribute-driven federation, the application process and the distri-
bution of information. The case study is based on individual case research (Single Case
Study). The subject is a cancer clinical department’s information application process. The
case study is appropriate to limit the investigation at this stage, in order to test and verify
the framework for reliability and validity in a precisely defined environment.
The article Building Theories from Case Study Research, by Eisenhardt, describes the
process of inducting theory using case studies from specifying the research questions to
reaching closure [27]. Some features of the process, such as problem definition and con-
struct validation, are similar to hypothesis-testing research. Others, such as within-case
analysis and replication logic, are unique to the inductive, case-oriented process. Overall,
the process described here is highly iterative and tightly linked to data. This research ap-
proach is especially appropriate in new topic areas. The resultant theory is often novel,
testable, and empirically valid. Finally, frame-breaking insights, the tests of good theory
(e.g. parsimony, logical coherence), and convincing grounding in the evidence are the key
criteria for evaluating this type of research [27].
The theoretical framework is based on Dahlberg’s and Nokkala’s article [14]. The
framework is presented in Chapter 2. It has not been validated, but this study is a part of
the wider research project, the purpose of which is to test and validate the empirical
framework.
The collection of empirical data is utilized according to Yin [73] and the principles of
Eisenhardt [27] for case studies and constructions. Case studies can combine different
types of data collecting methods, such as interviews, observation, and archival materials.
With the aid of the artifact, the data federation of breast cancer data from data storages
available to the CUH, the aim is to detect malignant breast cancer cases [13]. The CUH
has access to enormous amounts of relevant data both internal and external due to its role
in the healthcare system of the country of the present research. The CUH provides special
healthcare services to the citizens of healthcare districts. Numerous professionals and
software vendors have participated and participate in the development and operating of
73
ISs and data storages kept by hospitals and by breast cancer specialists. Yet, the detection
of malignant breast cancer cases is currently largely manual and based on the expertise of
the professionals. The reason is that most data characteristics in relevant data storages
differ for certain reasons [13].
The development of artifact and data collection has been organized through workshops
with the data/information specialists of the hospital since January 2016 [13]. Prior to a
workshop, the latest version of the artifact is prepared for its presentation at the work-
shop. The researchers and the data/information specialists implement the first version of
the artifact. Then in a workshop, researchers and data/information specialists interview
one specific group of specialists in breast cancer at a time, for example pathology special-
ists and IS-support personnel having responsibility for pathology ISs [13].
After the first workshop, the design artifact was modified. The ability of the artifact to
support data federation is evaluated after each workshop and will be evaluated more thor-
oughly after the last workshop. If the data/information specialists and the medical chief
information officer (CIO) of the university hospital consider the artifact and the federative
approach useful, the artifact and the approach will be made available to the CUH, and to
be used as generic tools in the federation of clinical data.
The governance of data framework with the federative approach discussed by Dahl-
berg et al. was applied in order to craft the artifact [13, 14, 15]. The steps in the imple-
mentation of the artifact are as follows:
• Step 1. Identify the most relevant ISs/modules and data storages for data federa-
tion. Identify groups of specific specialists to be interviewed about how data in those
ISs/modules are to be understood and used.
• Step 2. Identify shared attributes that are needed to make data interoperable be-
tween the identified ISs/modules and data storages.
• Step 3. Describe IS technical, informational and socio-technical metadata for each
shared attribute.
The steps are iterative. Thus it is possible to modify both by adding and reducing,
ISs/modules and data storages, shared attributes and their metadata characteristics. For
example, the first three shared attributes were identified and at fourth one was later added.
Similarly, there were initially 30 to 40 candidates for the metadata characteristics of each
shared attribute, but the number of candidates was later reduced.
In the collection of empirical data about the case, the guidelines of Yin [73] and Ei-
senhardt [27] for case studies and for the building of research constructs from case studies
were applied. Case studies can combine different data collection methods, such as inter-
views, observation and archival material. All other data collection methods with the ex-
ception of direct observation were used. This case study has more features of a single-
case research than of a design science research. The federative approach and comparable
74
artifacts to the artifact developed in this case have been used earlier to federate master
and social media site data in large commercial projects. As a consequence of this fact, the
artifact in this study is not new while the research context itself is brand-new.
4.1.1 Data Ontology
On the one hand, ontology refers to the nature of the test of reality or the nature of know-
ing and the nature of existence. On the other hand, ontology can be defined as a type of
model that represents a set of concepts and their relationships within a domain [18, pp.
249]. Both declarative statements and diagrams using data modeling techniques can de-
scribe these concepts and relationships.
In this Master’s thesis, the ontological or federative angle of incidence is the attribute
federation of different information systems without data transfer.
4.1.2 Epistemology
Epistemology is the information science, which examines perception of reality,
knowledge, opportunities to observe reality and to obtain information. It answers the
question of whether the study results can be generalized or what kind of information can
be obtained through research.
In this Master’s thesis, the epistemological viewing angle is data federation, recovera-
bility of data harmonization.
4.1.3 Paradigma
Paradigma refer to research into the guiding of basic ways and established assumptions.
Paradigma involve a sector-specific belief system science, containing the philosophical,
intellectual, experiential, and learned elements of reality and its examination.
In this Master’s thesis, both conceptual and contextual paradigma are relevant.
75
4.1.4 Methods
Research methods refer to the reality of observation and data collection practices. They
are rules and procedures, and they are as tools or ways of proceeding to solve problems.
Methods play several roles, such as [34, pp. 37]:
Logic or ways of reasoning to arrive at solutions;
Rules for communication, i.e. to explain how the findings have been achieved;
Rules of intersubjectivity, i.e. outsiders should be able to examine and evaluate
research findings.
In this Master’s thesis, interviews, surveys and observations of the forming of a central
data collection channel allow the artifact (research matrix) content to be assembled.
4.1.5 Rhetoric
Rhetoric defines how the study should be reported. It deals with the established terms and
concepts, as well as with the concept of hierarchy.
In this Master’s thesis, the reporting of the results includes the thesis as a whole.
4.1.6 Triangulation
According to Ghauri and Grønhaug, triangulation involves a combination of methodolo-
gies studying the same phenomenon [34, pp. 181]. Triangulation can be used to improve
the accuracy of the assessment and therefore the results of the collection of data by differ-
ent methods, or even by collecting data from different subject areas of the research. When
the study is validated by the need to improve, then the data need to be collected, or ana-
lyzed through triangulation. In cases where the accuracy or resolution of the data is sig-
nificant, it is rather logical to collect information through a variety of methods and angles
of incidence [34]. Triangulation can improve the accuracy of judgements and thus results,
by collecting data through different methods or by collecting different kinds of data on
the subject matter of the study [34, pp. 212].
In this Master’s thesis, triangulation is applied in interviews related to the case study.
76
4.2 Research Participation
This study covers the CUH of clinical cancer and its information management (IT Gov-
ernance). Professionals in IT management and nursing staff participated in the work-
shops. The medical expertise consists of professionals in oncology, pathology and radiol-
ogy. The ongoing research started in January 2016 and is being executed in co-operation
with the data/information specialists of the hospital.
The development of the artifact and data collection was organized through workshops.
Prior a workshop, the latest version of the artifact was prepared for its presentation at the
workshop. The researchers and the data/information specialists crafted the first version of
the artifact. Then in a workshop, researchers and data/information specialists interview
one specific group of (breast cancer) specialists at a time, for example pathology special-
ists and IS-support personnel having responsibility for pathology ISs. The design artifact
was modified after the workshop based on feedback. The ability of the artifact to support
data federation was evaluated lightly and temporarily after each workshop.
4.3 Artifact
The artifact of a matrix is used as a design tool in data federation based on the article by
Dahlberg et al. [13]. The cells of the matrix are interoperable and shared attributes (Table
1). The columns are ISs, which are chosen according to the hospital professionals and the
supervisor. The first phase in data federation is to identify interoperable attributes. Cross-
references are taken advantage when of filling the matrix.
On the vertical level interoperable attributes are the following:
HETU
TNM Code (TNM)
Diagnosis Code (DC)
Date of Event (Date).
On the horizontal level the key factors are ISs, as follows:
Uranus, Miranda, Oberon (Patient Information System)
Weblab (Laboratory Information System)
StellarQ (Management Tracking System)
Aria (Control System of Radiology and Patient-Specific Radiotherapy)
QpatiWeb (Pathology Information System).
77
Table 1 Design Artifact of the Case Study
The artifact supports the federation of breast cancer data from data storages available
to a central university hospital and the detection of malignant breast cancer cases [13].
The development of the artifact and data collection is organized with the aid of work-
shops.
78
5 RESULTS
This Master’s thesis aims to show that the theoretical framework is understandable and
workable. The goal was to answer the following research questions.
The main research question (How does the theoretical framework of data federation
work in practice compared with the golden record?), the first sub-question (What are the
benefits of the federative approach?) and the second sub-question (What are the limita-
tions?) are answered in Chapter 2.
The empirical data were collected by interviewing the CUH staff in Turku, Finland.
The acquisition of empirical material is based on workshops. Each workshop took about a
couple of hours. The design artifact was applied to acquire data. Data acquisition was
focused on five separate information systems. The criteria for choosing the particular ISs
were based on the advice and expertise of the CUH staff.
The MDM of the case includes unstructured, fragmented and non-governed data. The
MDM of hospital organizations showed outstanding differences related to data quality
(DQ) and MDM. Data governance seems to belong to nobody in an organization. It seems
that there is a lack of understanding as to which data should be governed. The canonical
approach was not able to produce any organization-wide solutions. Additionally, it was
unclear whether it is legally permissible to use data and to modify them in a designed
service. It is also unclear whether data are protected as well as who owns the data and
who needs to provide permissions.
For the sake of this case study, an artifact was designed to conceptualize and opera-
tionalize the considerations of understanding the reason for data creation and the govern-
ance accountabilities for the data. The case study shows that the amount of digital data is
increasing exponentially in the data environment of the CUH. With the volume growth,
sources, structures and all kinds of data are multiplying. Simultaneously the user organi-
zations are losing control over the data models and their data.
Currently data are increasingly often external data and provided by a kind of service
with unknown schemas through APIs or adapters. In spite of the current trend the ability
to manage and federate data is extremely important. Both the private and public sectors
need to benefit from data in a large number of contexts. Thus it is vital to know why the
federated data sets are created, for what purpose they serve and the reason the data are
stored.
A comparison between two practical approaches was made. The theoretical discussion
was conducted by reflecting the phenomena against the existing literature on ontologies
and data federation.
79
The artifact of data federation in the case study is built of two matrices shown in Table
2 and Table 3. In the field tests the artifact proved to be practical and useful. The IS pro-
fessionals at the hospital accepted the use of the artifact positively.
The case study revealed that the primary intellectual dilemma is based on understand-
ing the ontological stance of the federative approach and the artifact. It is a generally
shared view that data are on the one hand contextually defined. On the other hand, it is
usual to apply the canonical data models of information systems. As a result, there is a
potential failure to take the necessary intellectual step of paying attention to the ontology
of the data. This step is mandatory in order to federate data in open systems environments
from incompatible data storages. If the ontological stance of the federative approach is
understood and recognized, then the artifact and its meaning make sense.
5.1 Matrices of the Artifact
The matrix shown in Table 2 was designed for and during the iterative second steps of the
artifact design. The second steps are focused on the identification of shared attributes.
Table 3 is shown in a generic and concise format due to the confidential nature of the data
in the ISs of the CUH. The matrix was designed by placing ISs and modules and data
storages in the matrix as columns of the matrix. The shared attributes were placed in the
matrix as rows of the matrix. The matrix of Table 2 shows the outcome of step 2 (Chapter
5.2) and could be used to check once more that the shared attributes really exist in all the
federated data storages.
Table 2 Data Federation Artifact - Identification of Shared Attributes
The case study shows that the identification of shared interoperable attributes proved
to be an easy task for the IS professionals at the CUH. It also made sense to the cancer
specialists. The matrix compiles shared attributes from all ISs and modules and data stor-
ages in one table. On the basis of the case study, it is palpable that the best way to imple-
ment the matrix is to add at a time one IS/module and data storage.
The matrix shown in Table 3 was implemented for and during the iterative third steps
of the artifact design. The third steps focused on the definition of contextual metadata
80
characteristics for each shared attribute. Also Table 3 is shown in a generic and concise
format in order to prevent identification of the CUH’s information systems and sensitive
data.
Table 3 Definition of Contextual Metadata Characteristics
The content in the cells of the matrix shown in Table 3 was produced by answering to
the following questions:
What kind of IS technical properties does a shared attribute have (format,
length, hierarchy, granularity, mandatory, search key etc.)?
What kind of informational properties does a shared attribute have?
Is the data type of the shared attribute, such as a transaction, a report, a
document, a content, master data, reference data or metadata?
What is the source of the shared attribute: a business transaction sys-
tem, a sensor device, a control device, a spatial device, a temporal de-
vice, a social media device, or other?
Is the shared attribute structured, unstructured or multi-structured?
Is the origin of the shared attribute an internal or an external data
source? If the source is external, how is the organization allowed to
process and use the shared attribute and the related data storage?
Who enters and modifies the shared attribute during its life cycle?
What kind of socio-contextual properties does a shared attribute have?
What does the shared attribute mean in each use contexts during the life
cycle of the attribute?
For what purpose is the shared attribute used and created and what does
it mean at the time of creation?
For what purpose is the shared attribute used and what does it mean
when being used?
What is the reason for storing the shared attribute and what does it
mean when being stored?
What other life-cycle stages does the shared attribute have and what is
the meaning of the attribute at each stage?
Who are responsible for a shared attribute?
81
Who are responsible for each of the IS technical, informational and so-
cio-contextual metadata characteristic of the shared attribute?
Who are responsible for the data quality of the shared attribute?
How are the availability and the access rights of data ensured for the
shared attribute?
Based on the confidential agreement between the CUH and Turku School of Economics,
no detailed contents of the matrices are able to introduce in this Master’s thesis.
The case study with the chosen approach is distinguished from a great number of ca-
nonical data integration endeavors in that no attempt is made to collect all the data into
the harmonized data storage. Usually the storage is used for reporting. The core idea of
the study is that the original data are untouchable and reside in their original locations.
Technically, data federations are conducted with the aid of metadata repositories. A re-
pository obtains cross-mappings of federated data storages by using the metadata of
shared attributes. The metadata repository is a data storage/repository for federation rules,
the meanings of attributes and their metadata, descriptions of data formats, and definitions
of cross mappings. Metadata descriptions are created, modified and used only when a
data federation need is recognized. New federation rules can be added whenever needed,
e.g. for a new reporting need. The idea is to avoid big bangs and to proceed at the pace of
learning.
5.2 Pattern to Implement the Artifact in the Case Study
The prerequisite for federating data is the contextual metadata. This is a fundamental re-
sult of this case study. Data federation is initiated by understanding the contextual
metadata. To avoid data deficiencies related to the human perception of the real word, the
representations of reality in data and ISs with their combined implications must be taken
into account.
The pattern consists of the following logical steps when implementing the artifact:
• Step 1. Identification of the most relevant IS/modules and data storages. Identifi-
cation of specific specialist groups who need to be interviewed about how data in those
ISs/modules are to be understood and used.
• Step 2. Identification of shared attributes is necessary when making data interop-
erable between the identified ISs/modules and data storages.
• Step 3. Description of IS technical, informational and socio-technical metadata for
each shared attribute.
All the steps are iterative and thus the process of each step must be repeated until the
appropriate result is achieved. During the cycles of iteration it is possible to add to and
82
reduce information system modules and data storages, shared attributes and their metada-
ta characteristics. In this case study the first three shared attributes (HETU, TNM, Diag-
nosis Code) were identified and the fourth ones (Date of Event) were added later. Initial-
ly, 30 to 40 candidates for the metadata characteristics of each shared attribute were iden-
tified, but the number of attributes was reduced later.
83
6 DISCUSSION
In Chapter 2.1, I argue that user organizations have lost control over the data models they
use and partly also over their data. In Chapter 2.2 I state that the golden record philosophy
is obsolete, because the data are increasingly external and provided as a service, with un-
known data models, APIs and/or adapters. Yet, the ability to manage and federate data is
becoming continuously important for organizations in order that they can benefit from
digital data. To manage and federate data effectively, it is necessary to know why the
federated data sets have been created, for what purposes they are used, and why the data
are stored.
Data federation starts from understanding the contextual metadata. The avoidance of
data deficiencies relates to the human perception of real word states, and the representa-
tions of the real world in data/ISs and their combined effects must also be considered
when data are federated.
6.1 Contribution
The background of the Master’s thesis is based on five years of relevant research. In spite
of the relatively long research period, there is as yet a limited amount of empirical public-
ly available data to support the federative approach. This also applies to the artifact de-
signed in this case study.
The federative approach may increase value creation along with overall IT invest-
ments. The necessary investments are recommendable in order to focus on data attributes
and MDM tools with the stance of interoperability. Data federation creates a new ap-
proach to integration with the aid of MDM tools. Thus it is not necessary to have a new
IS as mappings are available. This is an important finding since IT expenses in organiza-
tions are apt to skyrocket when overlapping systems and appliances are required.
The most important scientific contribution to the overall research is based on the ex-
tension of the research work by Wang et al. [74, 75, 76, 77]. In addition, data governance
and management based on the data framework by Dahlberg and Nokkala are introduced
at a practical level [14].
6.2 Limitations
This Master’s thesis has certain limitations related to the conceptual sources. Based on the
research cycle of five years, the amount of empirical, publicly available data is limited in
84
terms of supporting the federative approach, including the designed artifact and the judg-
ment on the golden record approach. The federative framework introduced in this Mas-
ter’s thesis has not yet been empirically validated. The artifact as a tool for federation has
neither been validated nor reliability tested.
6.3 Future Research Questions
Data management researchers should focus on the ontological nature of digital data. As a
result of the data explosion, the importance of Big Data and data mining is growing rapid-
ly. Another relevant point is to gain an understanding of what kind of data on the opera-
tional level is necessary in order to perform diverse tasks. Last, but not least, further tests
are necessary to verify the applicability of data federation in practice relating also to other
branches than healthcare, in order to be able to qualify and rely on the federative frame-
work.
85
7 REFERENCES
[1] Bebchuck, L.A., Weisbach, M.S. (2010). The State of Corporate Governance Re-
search. The Review of Financial Studies, Vol. 23, No. 3, pp. 939-961.
[2] Berson, A., Dubov, L. (2007). Master Data Management and Customer Data Integra-
tion for a Global Enterprise, McGraw-Hill, New York, pp. 1-434.
[3] Brenner, H., Hakulinen, T. (2001). Long-Term Cancer Patient Survival Achieved by
the End of the 20th
Century: Most Up-To-Date estimates from the Nationwide Finnish
Cancer Registry. British Journal of Cancer, Vol. 85, Issue 3, pp. 367-371.
[4] Cheikhouhou, I., Djemal, K., Maaref, H. (2010). Mass Description for Breast Cancer
Recognition. Emoataz et al. (Editors). ICISP 2010, LNCS 6134, pp. 576-584.
[5] Chen, B., Oliver, J., Schwartz, D., Lindsey, W., MacDonald, A. (2005). Data Federa-
tion Methods and System. United States Patent Application Publication, US
2005/0021502 A1, Jan. 27, 2005, pp. 1-9.
[6] Chen, D., Doumeingts, G., Vernadat, F. (2008). Architectures for Enterprise Integra-
tion and Interoperability: Past, Present and Future. Computers in Industry 59 (2008) pp.
647-659.
[7] Chen, H., Chiang, R.H.L., Storey, V.C. (2012). Business Intelligence and Analytics:
From Big Data to Big Impact. MIS Quarterly, Vol. 36, No. 4, pp. 1165-1188.
[8] Crichton, D., Kincaid, H., Downing, G.J., Srivastava, S., Hughes, J.S. (2001). An
Interoperable Data Architecture for Data Exchange in a Biomedical Research Network.
Computer-Based Medical Systems, on July 26-27, 2001. CBMS 2001. Proceedings. 14th
IEEE Symposium, pp. 1-8.
[9] Christen, P. (2014). Data Matching – Concept and Techniques for Record Linkage,
Entity Resolution, and Duplicate Detection. Springer-Verlag Berlin and Heidelberg
GmbH & Co.LG, pp. 3-265.
[10] Cleven, A., Wormann, F. (2010). Uncovering Four Strategies to Approach Master
Data Management, System Sciences(HICSS), 2010 the 43rd
Hawaii International Confer-
ence on IEEE, pp. 1-10.
[11] Dahlberg, T. (2010). Master Data Management Best Practices Benchmarking Study
2010. Dataset March 2010 htpps://www. Researchgate.net/publication/267508624, pp. 1-
46.
[12] Dahlberg, T., Heikkilä, J., Heikkilä, M. (2011). Framework and Research Agenda for
Master Data Management in Distributed Environments. Proceedings of IRIS 2011.
TUCS Lecture Notes No. 15, October 2011, pp. 82-90.
[13] Dahlberg, T., Heikkilä, J., Heikkilä, M., Nokkala, T. (2015). Data Federation by Us-
ing a Governance of Data Framework Artifact as a Tool - Case Clinical Breast Cancer
86
Treatment Data. Åbo Akademi, pp. 1-15.
[14] Dahlberg, T., Nokkala, T. (2015). A Framework for the Corporate Governance of
Data – Theoretical Background and Empirical Evidence, pp. 25-45.
[15] Dahlberg, T. (2015). Managing Datification - Data Federation in Open Systems En-
vironments. Elsevier Editorial System for the Journal of Strategic Information Systems
Manuscript Draft. Åbo Akademi, pp. 1-22.
[16] Dahlberg, T. (2015). The MDM Golden Record is Dead, Rest in Peace – Welcome
Interpreted Interoperable Attributes. Åbo Akademi, pp. 1-13.
[17] Dahlberg, T. (2016). Research on: Governance of Data in the Contexts of Corporate
Governance and Governance of IT. Data Federation in the Context of Master and Big
Data. Presentation Slides. Åbo Akademi, pp. 1-46.
[18] DAMA (2010). The DAMA Guide to the Data Management Body of Knowledge
DAMA-DMBOK Guide, Technics Publications, 1st Edition, LLC: Bradley Beach, NJ, pp.
1-406.
[19] Diagnostic Imaging. November 1, 2007. Digital Mammography Produces Large Da-
ta Loads. http://www.diagnosticimaging.com/articles/digital-mammography-produces-
large-data-loads, Retrieved on August 23, 2016.
[20] Doan, A., Halevy, A., Ives, Z. (2012). Principles of Data Integration.Elsevier, Inc.,
pp. 1-487.
[21] Dreibelbis,A., Hechler,E., Milman, I., Oberhofer, M., van Run, P., Wolfson, D.
(2008). Enterprise Master Data Management an SOA Approach to Managing Core Infor-
mation, IBM Press/Pearson plc, Upper Saddle River, NJ, pp. 1-656.
[22] Duda, S. (2011). Identifying, Investigating and Classifying Data Errors: An Analysis
of Clinical Research Data Quality from an Observationl HIV Research Network in Latin
America and the Caribbean. Dissertation. The Faculty of the Graduate School of Vander-
bilt University, Nashville, Tennessee, USA, pp. 1-100.
[23] Duodecim (2002). Rintasyövän diagnostiikka ja seuranta. On June 14, 2002. Duo-
decim. https://www.duodecim.fi/. Retrieved on September 1, 2016.
[24] Dušek, L., Hřebíček, J., Kubásek, M., Jarkovský, J., Kalina, J., Baroš, R., Bednářo-
vá, Z., Klánová, J., Holoubek, I. 2011. Conceptual Model Enhancing Accessibility of
Data from Cancer–Related Environmental Risk Assessment Studies. International Federa-
tion for Information Processing (IFIP), AICT 359, pp. 461-479.
[25] Dyché, J., Levy, E. (2006). Customer Data Integration. John Wiley and Sons, Inc.,
pp. 1-324.
[26] Edge, S.B., Byrd, D.R., Compton, C.C., Fritz, A.G., Greene, F.L., Trotti, A.III
(2011). Cancer Staging Handbook (AJCC). Springer-Verlag, New York, NY, pp. 1-718.
87
[27] Eisenhardt, K.M. (1989). Building Theories from Case Study Research. Academy of
Management Review, Vol. 14, Issue 4, pp. 532-550.
[28] Eriksson, P., Kovalainen, A. (2011). Qualitative Methods in Business Research.
Sage Publications Inc., pp. 115-136.
[29] Esposito, C., Ficco, M., Palmieri, F., Castiglione, A. (2015). A Knowledge-Based
Platform for Big Data Analytics Based on Publish/Subscribe Services and Stream Pro-
cessing. Knowledge-Based Systems, Vol. 79, May 2015, pp. 3-17.
[30] Ferlay, J., Autier, P., Boniol, M., Heanue, M., Colombet, M., Boyle, P. Estimates of
the Cancer Incidence and Mortality in Europe in 2006. 2007. Annals of Oncology, Vol.
18, pp. 581-592.
[31] Ferro, N., Silvello, G. (2008). A Methodology for Sharing Archival Descriptive
Metadata in a Distributed Environment. ECDL 2008, LNCS 5173, pp. 268–279.
[32] Finnish Cancer Registry.
http://www.cancer.fi/syoparekisteri/en/?x56215626=112197488. Retrieved on January 1,
2016.
[33] Gable, J. (2002). Enterprise Application Integration. The Information Management
Journal, March/April 2002, pp. 48-52.
[34] Ghauri, P., Grønhaug, K. (2010). Research Methods in Business Studies. A Practical
Guide. 4th
Edition. Prentice Hall, Pearson Education Limited, pp. 3-265.
[35] Giachetti, R.E. (2010). Design of Enterprise Systems. CRC Press, pp. 1-423.
[36] Gregor, S. (2006). The Nature of Theory in Information Systems. MIS Quartely,
Volume 30, Issue 3, pp. 611-642.
[37] Heimbigner, D., McLeod, D. (1985). A Federated Architecture for Information
Management. ACM Transactions on Office Information Systems. Vol.3., No.3., July
1985, pp. 253-278.
[38] Henderson, J.C., Venkatraman, N. (1993, 1999). Strategic Alignment: Leveraging
Information technology for Transforming Organizations. IBM Systems Journal, Vol. 32,
No. 1, pp. 472-484.
[39] Hilbert, M., Lopez, P. (2011). The World's Technological Capacity to Store, Com-
municate and Compute. Information Science, Vol.332, No.6025, pp. 60-65.
[40] Iivari, J., Hirschheim, R., Klein, H.K. (1998). A Paradigmatic Analysis Contrasting
Information Systems Development Approaches and Methodologies, Information Systems
Research, 9(2), pp.164-193.
[41] Iivari, J. (2003). The IS Core – VII: Towards Information Systems as a Science of
Meta-Artifacts. Communications of the Association for Information Systems, Vol. 12,
No. 37, pp. 567-582.
[42] Jacob, S.G., Ramani, R. G. (2012). Data Mining in Clinical Data Sets: A Review.
International Journal of Applied Information Systems (IJAIS), Vol. 4, No. 6, pp. 15-26.
88
[43] Kuecler, W., Vaisnavi, V. (2012). A Framework for Theory Development in Design
Science Research: Multiple Perspectives. Journal of the Association for Information Sys-
tems, Vol. 13, Issue 6, pp. 395-423.
[44] Lans, R.F.van der (2012). Data Virtualization for Business Intellience Systems.
Elsevier, Inc., pp. 1-275.
[45] Linoff, G.S.-Berry, M.J.A. (2011) Data Mining Techniques. 3rd
Edition. Wiley
Publishing, Inc. USA, pp. 1-821.
[46] Linthicum, D.S. (2000). Enterprise Application Integration. Adison Wesley. USA,
pp. 1-379.
[47] Loshin,D. (2010). Master Data Management. Morgan Kaufmann, pp. 1-274.
[48] Lu, J., Hales, A., Rew, D., Keech, M., Fröhlingsdorf, C., Mills-Mullett, A., Wette, C.
(2015). Data Mining Techniques in Health Informatics: A Case Study from Breast Cancer
Research. Information Technology in Bio- and Medical Informatics, Vol. 9267, pp. 56-
70.
[49] Luján-Mora, S., Vassiliadis, P., Trujillo, J. (2004). Data Mapping Diagrams for Data
Warehouse Design with UML. Lecture Notes in Computer Science, Vol. 3288, pp. 191-
204.
[50] Maier, R.-Hädric, T., Peinl, R. (2005). Enterprise Knowledge Infrastructures.
Springer-Verlag Berlin Heidelberg, Germany, pp. 1-379.
[51] Markus, M.L., Bui, Q.N. (2012). Going Concerns: The Governance of Interorganiza-
tional Coordination Hubs. Journal of Management Information Systems, Spring 2012,
Vol. 28, No, 4, pp. 163-167.
[52] Martin, W. (2014). The Advantages of a Golden Record in Customer Master Data
Management. Uniserv GmbH, pp. 1-6. http://www.wolfgang-martin-
team.net/paper/SpecialistReport_golden%20record_ENG.PDF. Retrieved on September
16, 2016.
[53] Metcalfe, B. (2013). Metcalfe’s Law after 40 Years of Ethernet. The IEE Computer
Society, December 2013, pp. 26-31.
[54] Noona Healthcare, www.noona.com, Retrieved on July 8, 2016.
[55] Otto,B. (2015). Quality and Value of the Data Resource in Large Enterprises. Infor-
mation Systems Management, Vol 32, pp. 234–251.
[56] Population Registry Centre (2016). http://vrk.fi/en/personal-identity-code1. Ret-
rieved on June 11, 2016.
[57] Pukkala, E., Dyba, T., Hakulinen, T., Sankila, R. Syöpä Suomessa 2015. (2015).
Syöpäjärjestöjen julkaisuja 2015. Suomen Syöpäyhdistys. Helsinki, pp. 7-16.
[58] Rahmati, P., Hamarneh,, Nussbaum, D., Adler, A. (2010). A New Preprocessing
Filter for Digital Mammograms. Elmoataz et al. (Editors): ICISP 2010, LNCS 6134, pp.
585-592.
89
[59] Ras, Z.W., Tzacheva, A., Tsay, L-S. (2006). Encyclopedia of Data Warehousing and
Mining. Action Rules. Edited by Wang, J. Idea Group Reference, USA, pp. 1-5.
[60] Reddy, S.S., Mulani, J., Bahulkar, A. (2000). Adex – A Meta Modeling Framework
for Repository Centric Systems Building, in Advances in Data Management. Edited by
Ramamritham, K.,and Vijayaraman, T.M., Tata McGraw-Hill Publishing Company Ltd.,
pp. 1-10.
[61] Russom, P. (2011). Big Data Analytics. TDWI (The Data Warehousing Institute)
Best Practices Report, 4th
Quarter 2011, pp. 3-34.
[62] Schleifer, A., Vishny, R.W. (1997). A Survey of Corporate Governance. The Journal
of Finance, Vol. 52, Issue 2, pp. 737-783.
[63] Sen, A. (2004). Metadata Management: Past, Present, Future. Decision Support Sys-
tems, Volume 37, pp. 151-173.
64/63/69/74/[64] Sharma, N., Om, H. (2012). Framework for Early Detection and Preven-
tion of Oral Cancer Using Data Mining. International Journal of Advances in Engineering
& Technology, September 2012, pp. 302-310.
[65] Sheth, A.P., Larson, J.A. (1990). Federated Database Systems for Managing Distrib-
uted, Heterogeneous, and Autonomous Databases, ACM Computing Surveys, Vol.22,
No.3, pp.183-236.
[66] Tsiknakis, M., Brochhausen, M., Nabrzyski, J., Pucacki, J., Sfakianakis, S. G., Pota-
mias, G., Desmedt, C., Kafetzopoulos, D. (2008). A Semantic Grid Infrastructure Ena-
bling Integrated Access and Analysis of Multilevel Biomedical Data in Support of Post-
genomic Clinical Trials on Cancer. IEEE Transactions on Information Technology in
Biomedicine, Vol. 12, no. 2, pp. 205-279.
[67] Wand, Y., Wang, R.Y. (1996). Anchoring Data Quality Dimensions in Ontological
Foundations Communications of the ACM, Vol. 39, Issue 11, pp. 86-95.
[68] Wand, Y., Weber, R. (1990). An Ontological Model of an Information System, IEEE
Transactions on Software Engineering, Vol. 16, No. 1, pp. 1282-1292.
[69] Wand, Y., Weber, R. (1993). On the Ontological Expressiveness of Information Sys-
tems Analysis and Design Grammars, Information Systems Journal, Vol. 3, No. 4, pp.
217-237.
[70] Wand, Y., Weber, R. (2002). Research Commentary: Information Systems and
Conceptual Modeling—a Research Agenda", Information Systems Research, Vol. 13, No.
4, pp. 363-376.
[71] Watson, R.T. (2006). Data Management: Databases and Organizations. 5th
Edition.
John Wiley & Sons, Inc., pp. 1-603.
[72] Watts, S., Shankaranarayanan, G., Evem, A. (2009). Data Quality Assessment in
Context: A Cognitive Perspective. Decision Support Systems, Vol. 48, pp. 202-211.
90
[73] Yin, R.K. (2014). Case Study Research: Design and Methods, 5th
Edition, Sage Pub-
lications, USA, pp. 3-282.