mathematik und informatik - deposit hagen · dr. michael kaufmann external habilitation candidate...

15
deposit_hagen Publikationsserver der Universitätsbibliothek Towards a reference model for Big Data management Mathematik und Informatik Forschungsbericht Michael Kaufmann

Upload: others

Post on 05-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mathematik und Informatik - deposit hagen · Dr. Michael Kaufmann External habilitation candidate Lucerne University of Applied Sciences and Arts School of Engineering Horw, Switzerland

deposit_hagenPublikationsserver der Universitätsbibliothek

Towards a reference model for Big Data management

Mathematik und Informatik

Forschungsbericht

Michael Kaufmann

Page 2: Mathematik und Informatik - deposit hagen · Dr. Michael Kaufmann External habilitation candidate Lucerne University of Applied Sciences and Arts School of Engineering Horw, Switzerland

Research Report 1

Lehrgebiet Multimedia und Internetanwendungen

Prof . Dr.-Ing. Matthias Hemmje

Towards a Reference Model for Big Data Management

Dr. Michael Kaufmann

External habilitation candidate

Lucerne University of Applied Sciences and Arts

School of Engineering Horw, Switzerland

A research question in the area of big data is how to create value from it. This paper introduces a design for a management meta-model that can be used as a frame of reference for big data practitioners and researchers who aim at value creation from data. To this end, first, the state of the art in big data management is investigated. Then, because knowledge is central to value creation from data, an epistemological model is presented that is used as a blueprint to be instantiated in a business intelligence model. This model divides the data value creation process into five layers: Integration, Analytics, Interaction, Effectuation, and Intelligence. To close the link to information technology, each of these layers is subdivided into business and technological aspects, and specific examples are provided to concretize the reference model. The resulting Big Data Management Meta-Model BDM

cube can be instantiated by concrete big data management models, either to classify and extend

existing models, or to derive new big data management models for future big data projects. Keywords: Big Data, Data Management, Emergent Knowledge, Reference Model

1 Introduction

The global amount of data is exploding. For 2017, according to Hilbert and López (2011), the global information capacity can be estimated to two zettabytes (two billion terabytes), with a doubling every 3 years.

1 Data is everywhere; it is omnipresent and overflowing. In the words of

Anderson (1972), more is different—what applies for water is analogically the case for data. With this new dimension in quantity, data get a new quality. For this trend, this global phenomenon and its exploitation, the term big data has spread. It has been said that “big data is the oil of the 21

st century”

2

because the ability to refine and use big data will be a key success factor for businesses as well as economies. The Beckman Report on Database Research (Abadi et al., 2016) calls big data “a defining challenge of our time”.

The Term Big Data is not necessarily optimistically connoted – think of similar terms such as Big Money, Big Oil, Big Pharma, or Big Government. Big data can be viewed as a cultural issue, as more and more data means proportionally less and less conscious knowledge in society. The attention span and memory of humans are limited, and thus, the ratio between what we do know in relation to what we could know (i.e., knowledge versus data) is exponentially decreasing. The question is: can we make something good out of the data deluge? Can we create value from big data?

Accordingly, a basic assumption for the research presented in this paper is that not technological solutions, but value-related questions about meaning, use, quality, objectives, purpose, and applicability in the real world are the crucial point of big data research, because great volumes of all kinds of high speed data will not be of any use unless the data can be turned into something useful. These questions should be at the heart of research in big data management. Still, most approaches to big data research aim primarily at data storage, computation, and analytics instead of knowledge and value that could and should be the result of big data processing. Therefore, in this paper, I propose an epistemic and valuational approach to big data management: to focus on knowledge processes and value chains when designing technical information systems for coping with big data. In that sense, the 1

According to (Hilbert & López, 2011), in 2007 the global information capacity was estimated to 2.9 x 10 20 bytes (ca. 0.25 zettabytes). The estimated growth

rate of 23% per year means a doubling every 3 1/3 years: 1.23 y = 2 y = ln(2)/1n(1.23) = 3.348307909. The interpolated global amount of data in 2017 is circa

2 zettabytes (ca. 0.25 ZB * 1.2310) 2

http://www.forbes.com/sites/gartnergroup/2015/08/14/big-data-fades-to-the-algorithm-economy/ (accessed March 2016)

Page 3: Mathematik und Informatik - deposit hagen · Dr. Michael Kaufmann External habilitation candidate Lucerne University of Applied Sciences and Arts School of Engineering Horw, Switzerland

Research Report 2

Lehrgebiet Multimedia und Internetanwendungen

Prof . Dr.-Ing. Matthias Hemmje

question that drives the presented research is how big data management can be designed to help knowledge emerge and, consequently, to create value.

Consequently, the proposed reference model for big data management creates a link between information technology and business value, and aims at closing a research gap for the subject of big data management in the border area between the fields of computing and management sciences. Big data analytics is not an ars gratia artis—the intention of this research is to provide a frame of reference for the technical sciences to orient the engineering of information systems for big data processing towards the generation of knowledge, and thus, value.

2 Related Work

In the sense of “more is different,” big data is a class of data that requires for scalability new methods of data management. Based on a report of the META Group, Gartner coined the most common definition of the term big data that addresses volume (vast amounts of data), velocity (fast data streams) and variety (heterogeneous content), the three Vs that specify the big data challenge.

3

Based on that, Schroeck, Shockley, Smart, Romero-Morales, and Tufano (2012) have defined big data with the qualities of high velocity, large volume, wide variety and uncertain veracity, and have thus added a fourth V, concerned with uncertainties in data, to the definition. Demchenko, Grosso, Laat, and Membrey (2013) have identified a fifth V, that of value, concretized as “the added-value that the collected data can bring to the intended process, activity or predictive analysis/hypothesis”. Figure 1 reproduces the 5V model of big data by Demchenko et al. (2013) that, in contrast to the other two mentioned definitions, poses a value question for big data theory.

Figure 1. 5 Vs of Big Data. Reproduced from Demchenko et al. (2013)

In 2015, the definition of big data has been standardized by the NIST Big Data Public Working Group: “Big Data consists of extensive datasets primarily in the characteristics of volume, variety, velocity, and/or variability that require a scalable architecture for efficient storage, manipulation, and analysis” (NIST, 2015a, p. 5). Although this has been set as the standardized

3

http://www.gartner.com/newsroom/id/1731916 (accessed March 2016)

Page 4: Mathematik und Informatik - deposit hagen · Dr. Michael Kaufmann External habilitation candidate Lucerne University of Applied Sciences and Arts School of Engineering Horw, Switzerland

Research Report 3

Lehrgebiet Multimedia und Internetanwendungen

Prof . Dr.-Ing. Matthias Hemmje

definition, NIST (2015a, pp. 10–11) acknowledges that there are many other definitions and interpretations of the term big data in contexts other than volume, such as: less sampling, new data types, analytics, data science, value, and cultural change. This must be taken into account when talking about big data with practitioners.

The NIST standard defines the big data paradigm as “the distribution of data systems across horizontally coupled, independent resources to achieve the scalability needed for the efficient processing of extensive datasets”. For storage, processing and analysis of big data, this new kind of scalable, horizontally distributed systems architecture has been commonly adopted. Singh and Reddy (2014) describe two reference architectures for big data analytics: the Hadoop Stack and the Berkeley Data Analysis Stack. Both are based on parallel computing using clusters of commodity hardware. For parallel programming, the former is based on Hadoop MapReduce, whereas the latter is based on Spark, which can compute faster because of in-memory processing (Gu & Li, 2013). Commonly, the Hadoop ecosystem of software tools, as described by Landset, Khoshgoftaar, Richter, and Hasanin (2015), is a de facto standard systems architecture for big data processing. Figure 2 reproduces the model of the Hadoop ecosystem by Landset et al. (2015) that shows its standard components.

Figure 2. The Hadoop ecosystem. Reproduced from Landset et al. (2015)

Big data technology has advanced and matured over the last decade. However, two important aspects of the 5V model, the aspects of veracity and especially value, cannot be solved by technology directly. In the context of the research question stated in the introduction, to turn big data into value, process oriented big data management models are needed.

Pääkkönen and Pakkala (2015) have proposed a high-level reference architecture that analyzes the big data management process into six layers: (1) data extraction, (2) data loading and preprocessing, (3) data processing, (4) data analysis, (5) data loading and transformation, and (6) interfacing and visualization. However, this process model does not address value questions. The NIST Big Data Public Working Group has defined a data life cycle model. This process model consists of four stages: (a) collection, (b) preparation, (c) analysis and (d) action. The last stage is where the magic happens: “Action: This stage involves processes that use the synthesized knowledge to generate value” (NIST, 2015a, p. 16). This value creation is concretized in the NIST Big Data Reference Architecture (NIST, 2015b, p. 12). As shown in Figure 3, according to the NIST Big Data Public Working Group, access to analytics and visualization have the highest value for the data consumer in the information value chain as well as in the IT value chain.

Page 5: Mathematik und Informatik - deposit hagen · Dr. Michael Kaufmann External habilitation candidate Lucerne University of Applied Sciences and Arts School of Engineering Horw, Switzerland

Research Report 4

Lehrgebiet Multimedia und Internetanwendungen

Prof . Dr.-Ing. Matthias Hemmje

Figure 3. NIST Big Data Reference Architecture. Reproduced from NIST (2015b)

Unlike (Pääkkönen and Pakkala (2015), the NIST data lifecycle model and the NIST big data reference architecture address the issue of big data value, and explain value generation from data in terms of actions by data consumers based on access to and visualizations of knowledge that has been synthesized from data. Yet it remains unclear how exactly this value is generated by the data consumers. In contrast, the OECD has published a similar model that is termed the data value cycle (OECD, 2015, p. 33). As seen in Figure 4, this model consists of a cycle of five connected steps, each taking as input the output of the previous step iteratively: i.) datafication and data collection, ii.) big data, iii.) data analytics, iv.) knowledge base, and v.) decision making, which again can be input to step i.), and so on. Additionally, there is a step vi.) value added for growth and well-being, which, according the model, is a result of enhanced decision making supported by the knowledge base that results from the analysis of big data. Two important and distinctive aspects of this model are noteworthy: First, there is an iterative, closed-feedback loop in action, where results of big data analytics are again integrated into the database. And second, the model explains value creation from big data in terms of decision support.

This model of the OECD explains a lot about value creation from big data, but it basically reduces this question to questions about analytics, knowledge bases, and optimal data-driven decision making. However, there are three drawbacks to this model. First, it is clear to see that decision support is not the only way to create value from data. For example, think about data-driven innovation and data-based products. There, the knowledge base itself is of value for a business, not the decisions that it supports. Therefore, one extension of the theory is possible in generalizing from decision support to a more generalized concept that includes other ways to effectuate the results from data analytics. This layer could be called data effectuation. Second, the knowledge base is clearly not only a result of data analytics, and knowledge does not only influence decision making. Many forms of learning and codification contribute to practically all of the steps in the data value cycle For example, education of

Page 6: Mathematik und Informatik - deposit hagen · Dr. Michael Kaufmann External habilitation candidate Lucerne University of Applied Sciences and Arts School of Engineering Horw, Switzerland

Research Report 5

Lehrgebiet Multimedia und Internetanwendungen

Prof . Dr.-Ing. Matthias Hemmje

data scientists is crucial for data analytics. Therefore, the model could be extended to use knowledge-related processes as a cross section function in big data management. This layer could be called data intelligence. Third, it remains unclear how exactly the knowledge base leads to better decisions. This layer of action in the NIST data lifecycle model, specifically visualization and access that is central in the NIST reference architecture, is missing in the OECD-Model. This layer could be called data interaction.

Figure 4. The data value cycle. Reproduced from OECD (2015)

On the issue of concretion of the idea of value creation from data, Davenport (2013) adds interesting thoughts to the literature with a concept termed Analytics 3.0. In the words of Davenport (2013), “data analysis used to add the most value by enabling managers to make better internal decisions. The new strategic focus on delivering value to customers has profound implications for where analytics functions sit in organizations and what they must do to succeed”. According to Davenport, decision support is only one part of the value creation process. A more recent approach is to feedforward analytics results to products and services of a company directly, to create value for the customers. With this customer-centric approach, value creation from data is extended from decision support within the organization to supporting operations of an organization on the market. In the words of Davenport and Dyché (2013), “the primary value from big data comes not from the data in its raw form, but from the processing and analysis of it and the insights, products, and services that emerge from analysis”. Therefore, data effectuation could be oriented toward a market and business perspective.

Based on this survey, my contribution will propose a model for big data management that closes the above mentioned gaps. It will incorporate the big data definition of the 5V model, the valuational approach of the OECD model, and the focus on action, visualization and access of the NIST model, and it will add the market focus of Davenport. Hence, the issues of data interaction and data effectuation and especially the focus on knowledge management in the big data lifecycle (data intelligence) will provide an extension to these existing approaches. Furthermore, my model will be based on a theoretical epistemological as well as epistemic foundation. This proposition is supported

Page 7: Mathematik und Informatik - deposit hagen · Dr. Michael Kaufmann External habilitation candidate Lucerne University of Applied Sciences and Arts School of Engineering Horw, Switzerland

Research Report 6

Lehrgebiet Multimedia und Internetanwendungen

Prof . Dr.-Ing. Matthias Hemmje

by statements in the Beckman Report on Database Research (Abadi et al., 2016), which mentions the following research challenges for big data research: data-to-knowledge pipeline, including human feedback and interpretability by subject-matter experts; the use of knowledge bases for improving the data-to-knowledge-pipeline; and optimizing the role of humans in the data life cycle.

3 Emergent Knowledge

As analyzed in the previous section, in the discussed models the creation of value from data is closely linked to the knowledge that data analytics generates. Knowledge is the fuel of the motor of value creation from data. Therefore, my big data management model will be derived with an epistemic focus and an epistemological foundation.

The knowledge that is created by big data analytics has an important distinctive feature: it is neither explicit nor tacit, in the sense of Polanyi (1967) but emergent according to the definition of Patel and Ghoneim (2011). In fact, data analysis is a form of knowledge-intensive work that can be characterized as an emergent knowledge process corresponding to Markus, Majchrzak, and Gasser (2002). According to Kakihara and Sørensen (2002), knowledge is generally “a result of emergent processes of knowing through human subjective interpretations, or sense-making”. This especially applies to knowledge that is generated by data scientists working with big data. The knowledge that big data analytics generates emerges from the interaction of data scientists and end-users with existing datasets and analysis results. Therefore, in this section, we examine the nature of emergent knowledge. This will provide an epistemological theory (an “Erkenntnistheorie” in German

4) of emergent

knowledge that will later be the basis for the development of our epistemic big data management model.

For the theory presented in this section, another basic assumption is that knowledge emerges by closed loops of observation and operation. This Weltanschauung can be summarized as knowing through making (Mäkelä, 2007). If we start with a very high-level perspective, humans both explore and shape their environment. This fact might seem trivial at first, but this is actually a distinctive feature of the human species, because no other species shapes the environment to such an extent; examples are the human culture, urban landscapes, and the internet. By simultaneously understanding nature and creating artifacts, and by again trying to understand these artifacts, iteratively, human knowledge has emerged. In the course of history, the human mind (“Geist” in the sense of Hegel (1807)) has thus shaped the physical environment in a way that has led to completely different experiences when new generations explore the iteratively changing, partly natural, partly artificial environment. This concept can be called ontological designing: “We design our world, while our world acts back on us and designs us” (Willis, 2006, p. ##).

If we take this basic feedback-loop into account, we can construct a theory of emergent knowledge as illustrated in Figure 5. For this, I will build upon and deepen the theory of Luhmann (1988), who presents a radically constructivist epistemology: a priori, all is one

5; only observation

constructs distinctions. To begin with, observation creates a contrast between subject and object, between the observer and the environment; Luhmann calls the observer “erkennendes System,” which is roughly translated into English as cognitive system.

In addition to the basic distinction between subject and object, my model introduces two further ontological distinctions. First, the epistemic source of a system’ s observation, its environment, can be divided into nature, that which is independent from and untouched by the system, and artifacts, changes in the environment which are the product of the system’s operation. In my opinion, this ontological distinction is justified if a cognitive system can recognize (“erkennen”) the signals it generates in the environment as its own product. I believe that this process is central for the emergence

4

It is important to note that “Erkenntnistheorie” and epistemology are similar, but distinct in a sense that is difficult to translate to English. An

“ Erkenntnistheorie” presents a model of how “ Erkenntnis,” that is, insight about real objects, is possible, especially in spite (or, for Luhmann, because) of the fact that all we can ever realize are observations, and not the real objects themselves. 5 The following quote is quite formidable for those readers who understand German: “ Alles Beobachtbare ist Eigenleistung des Beobachters, eingeschlossen das

Beobachten von Beobachtern. Also gibt es in der Umwelt nichts, was der Erkenntnis entspricht; denn alles, was der E rkenntnis entspricht, ist abhangig von Unterscheidungen, innerhalb derer sie etwas als dies und nicht das bezeichnet. In der Umwelt gibt es daher auch weder Dinge noch Ereignisse, wenn mit diesem

Begriff bezeichnet sein soll, daß das, was bezeichnet ist , anders ist als anderes. Nicht einmal Umwelt gibt es in der Umwelt, da dieser Begriff ja nur in

Unterscheidung von einem System etwas bezeichnet , also verlangt, daß man angibt, fur welches System die Umwelt eine Umwelt ist. Und ebensowenig gibt es, wenn man von Erkenntnis absieht, Systeme” (Luhmann, 1988, p. 16).

Page 8: Mathematik und Informatik - deposit hagen · Dr. Michael Kaufmann External habilitation candidate Lucerne University of Applied Sciences and Arts School of Engineering Horw, Switzerland

Research Report 7

Lehrgebiet Multimedia und Internetanwendungen

Prof . Dr.-Ing. Matthias Hemmje

of knowledge from the viewpoint of radical constructivism. Second, the epistemic source of the system’s operation can be divided into two different aspects: By realizing distinguishable signals in the environment, the system constructs insights, or what Luhmann (and many other German theorists) call “Erkenntnis.”

6 Yet for the creation of artifacts, cognitive systems draw upon an ingenious

intuitive force that can be called creativity. Advanced cognitive systems bring forth new modes that could not have been observed directly in nature. Hence, it is argued here that creativity is outside of and an addition to the realm of observational insights, and hence, it deserves its own ontological class. In the artifacts, in the changes that the subject generates in the environment, the effect of creativity is observable, which closes the outer loop of emergent knowledge generation. In addition, very advanced cognitive systems, such as the human mind, are able to anticipate effects of their own creativity, and thus an inner feedback loop is closed by consideration of possible new creative modes and by anticipating their effects internally.

Figure 5. “Erkenntnistheorie der Technik”: A constructivist operational theory of emergent knowledge

This proposed theory of emergent knowledge states that knowledge emerges by interaction

of a cognitive system with its environment, by creating artifacts using its creativity and by observing the effects of its own operation to gain insights. Culturally the discipline of rigorous observation and the empirical knowledge and insights it generates is usually termed science, which corresponds to

episteme () in classic Greek philosophy; the application of empirical knowledge together with ingenuity to create artifacts is usually called arts and technology (Technik

7 in German), both of

which can be subsumed by the ancient Greek term techne (). These two domains can be distinguished by their main goals: insights are the intention of science, whereas artifacts are the focus of arts and technology (see Lipton, 2010). However, the proposed model theorizes that knowledge emerges, not by passive observation, but by iterative closed loops of purposefully creating and observing changes in the environment: by a combination of art, technology, and science. Therefore, in contrast to the classic observational theory of empirical knowledge, this model presents an operational theory of emergent knowledge. In the next section, we will apply this model to big data management.

6 Erkenntnis is not knowledge in an epistemological sense (justified true belief or some derivation). The term Erkenntnis can more closely be translated to

“ insights” and “ realization” (in the sense of “ das Erkennen”). According to (Luhmann, 1988, p. 44), Erkenntnis is a construction based on distinction (“ unterscheidungsbasierte Konstruktion”), which, in contrast to its complement, the real object (“ Realgegenstand ,” p. 8), is the result of observation and

description of said real object ( p. 14). 7 In German, Technik can mean technology, engineering, technique as well as technics.

Page 9: Mathematik und Informatik - deposit hagen · Dr. Michael Kaufmann External habilitation candidate Lucerne University of Applied Sciences and Arts School of Engineering Horw, Switzerland

Research Report 8

Lehrgebiet Multimedia und Internetanwendungen

Prof . Dr.-Ing. Matthias Hemmje

4 Big Data Management

The Oxford Dictionary defines the term management as “the process of dealing with or controlling things or people.”

8 Based on that, in combination with the 5V-model of big data by

Demchenko et al. (2013), I define Big Data Management (BDM) as the process of controlling flows of large-volume, high-velocity, heterogeneous and/or uncertain data to create value. Managing big data is not an end in itself: successful BDM creates value in the real world. The aim of the model presented in this section is to provide a frame of reference for creating value from big data. To this end, because value creation from data is closely linked to knowledge emergence from data analysis, we will apply the theory of emergent knowledge developed in the previous section, which itself is founded on the epistemology of Luhmann (1988).

As a sociologist, Niklas Luhmann certainly had social systems in mind when he wrote about cognitive systems. Hence, in the age of big data, the model can be taken as a blueprint to engineer cognitive systems that optimize the emergence of knowledge from big data. As such, in organizations that make use of big data, business intelligence (BI) (Chen, Chiang, & Storey, 2012) can be seen as a socio-technical cognitive system, and the model in Figure 5 can be applied accordingly.

Figure 6. Application of the operational theory of emergent knowledge to business intelligence as a cognitive system

This view of BI as a cognitive system is an instance of a biomimetic information system, as envisioned by (Kaufmann & Portmann, 2015). In Figure 6, the resulting concretized epistemic model for the domain of application is visualized. For BI, the environment consists of the business and the market it acts on. The market is independent of BI, and thus, its parameters, such as customers, competitors, politics, and so on, represent natural, that is, a priori aspects of the environment. However, BI can inform the business itself, and thusly optimized artifacts, such as products, services, or processes of the business, are the result of BI operations. As a cognitive system, BI observes its environment by integration of (possibly big) data from various sources. Based on that, BI processes construct new insights from these observations by application of data analytics to the integrated data base. The creative aspect of BI comes from the interaction of human users with the data analytics processes that are performed mostly by machines. This interaction can, for example, be performed by a data scientist who performs explorative, interactive analyses, or, as another instance, by a manager who interacts with a forecasting dashboard. This iterative interaction is signified by the small loop in Figure 6. Anyway, BI creates value for the business if the results of analytics and interactions have a direct positive effect on the business, if it is effectuated by optimizing products, services, decisions, and processes. According to the theory of emergent knowledge presented in the last section, knowledge emerges as BI, as a cognitive system, and interacts with and observes its environment, the market and the business, iteratively. This is signified by the greater loop in Figure 6. As an example, closed loop analytic customer relationship management (CRM) continuously optimizes analytic

8 http://www.oxforddictionaries.com/definition/english/management (accessed April 2016)

Page 10: Mathematik und Informatik - deposit hagen · Dr. Michael Kaufmann External habilitation candidate Lucerne University of Applied Sciences and Arts School of Engineering Horw, Switzerland

Research Report 9

Lehrgebiet Multimedia und Internetanwendungen

Prof . Dr.-Ing. Matthias Hemmje

marketing campaigns based on CRM data in connection with outcomes from past campaigns in a closed epistemic loop. Thus, outcomes of past actions are integrated into future observations.

The management of big data to create value is closely linked to the management of business intelligence as a cognitive system. Based on that, BDM consists of the optimization of each of these five aspects of data integration, data analytics, data interaction, and data effectuation as well as the successful management of the emergent knowledge in this process, which can be called data intelligence.

Figure 7. BDM cube: A Knowledge-based Big Data Management Meta-Model

This is summarized by Figure 7, which shifts from an epistemic view of the cognitive system to a management perspective in a layered reference model. This can be seen as a meta-model, where specific BMD models for a company or a research process represent specific instances implementing certain aspects of the five layers. The purpose of this meta-model is two-fold: 1.) it can be used for classifying and extending existing specific BDM Models, and 2.) it can be an inspiration to derive new BDM models for big data projects. Because of that, the model is named BDM

cube, wich

stands for Big Data Management Meta Model (hence M cube, the third power of M, in the name). This model can serve as a frame of reference for installation, operation and optimization of BDM. In the following points, each of these concepts is described in more detail. Data Integration can be defined as the combination of data from different sources into a single

platform with consistent access. In the age of big data, the collection of data is usually not the problem. Existing data sources overflow with potentially useable data, and need to be integrated so that they become analyzable. This involves database systems as well as interfaces to data sources. Here, special care must be taken for scalability with regard to the big data characteristics of volume, velocity, and variety of the data sources.

Data Analytics is the transformation of raw data into useable information.9 This involves analytic

processes and tools. With respect to big data, analytic and machine learning platforms must be used that operate on a scalable, parallel computing architecture. In this step, data science is applied, which is defined by NIST (2015a, p. 7) as “the extraction of actionable knowledge directly from data through a process of discovery, or hypothesis formulation and hypothesis testing.”

Data Interaction consists of mutual interferences of users and data analysis results that generate individual and organizational knowledge. It is important to note that data analysis results are in fact nothing but more data unless users interact with them. At this point, helpful user interfaces, user experience design, and data visualization can be designed to promote the interaction of data analytics and the organization.

9 The OECD Glossary of Statistical Terms defines Data analysis as “ the process of transforming raw data into usable information, often presented in the form of

a published analytical article, in order to add value to the statistical output”. https://stats.oecd.org/glossary/detail.asp?ID=2973

Page 11: Mathematik und Informatik - deposit hagen · Dr. Michael Kaufmann External habilitation candidate Lucerne University of Applied Sciences and Arts School of Engineering Horw, Switzerland

Research Report 10

Lehrgebiet Multimedia und Internetanwendungen

Prof . Dr.-Ing. Matthias Hemmje

Data Effectuation means the utilization of the data analysis results to create value in products, services, and operations of the organization

Data Intelligence refers to the ability of the organization to acquire knowledge and skills10

in the context of (big) data management. This can be understood as knowledge management and knowledge engineering applied for all steps of the data lifecycle. For example, this can focus on the ability to acquire knowledge that is generated from data (organizational communication of analytics results) and on the ability to acquire knowledge and skills about data and data management (metadata management and data methodology), as well as on the ability to acquire knowledge and skills that enable the generation of insights and value from data (knowledge and education management for data scientist and data consumers). Data intelligence is a knowledge-driven cross-platform function that ensures that knowledge assets are optimally deployed, distributed, and utilized over all four layers of big data management.

5 Business-IT Alignment

At this point, I have provided an overview of the state of the art in section 2; I have provided an epistemological foundation for a process model for business intelligence as a cognitive system in section 3; and, based on that, I have presented a design of a model for big data management in section 4. Hence, n this basis, I can create a link to the initial motivation for the research: to provide a frame of reference for big data processing information systems toward the generation of knowledge and value. To this end, Figure 8 describes, for each layer in the management model in Figure 7, business aspects in connection with technology aspects that can implement solutions for the corresponding model layer. In this model, the data intelligence layer acts as a bridge between business and IT aspects of BDM

cube.

Figure 8: BCM cube for Business-IT Alignment

The purpose of the framework in Figure 8 is to subdivide the formidable task of creating value from data into ten manageable fields of focus, five in the area of business and five in the area of IT, so that the whole task can be tackled in a “divide and conquer” manner. Also, these ten fields can serve as a mental map and a checklist for the inspiration and guidance of big data management toward a comprehensive approach. The concretized meta-model, again, can serve as a blueprint for new big data projects, or as a classification of and inspiration for extending existing big data models and systems.

10 The Oxford dictionary defines intelligence as “The ability to acquire and apply knowledge and skills ”:

http://www.oxforddictionaries.com/definition/english/intelligence

Page 12: Mathematik und Informatik - deposit hagen · Dr. Michael Kaufmann External habilitation candidate Lucerne University of Applied Sciences and Arts School of Engineering Horw, Switzerland

Research Report 11

Lehrgebiet Multimedia und Internetanwendungen

Prof . Dr.-Ing. Matthias Hemmje

In Table 1, examples are provided for each of these ten angles of vision. These examples are not meant to be exhaustive or complete. Table 1 concretizes the focus areas with specific instances so that the abstraction of a closed loop epistemic management model can be grounded in conceivable and tangible concepts. Table 1: Ten fields of action for business-IT alignment for big data with BDMcube, including examples

Data Integration Data Analytics Data Interaction Data Effectuation Data Intelligence Combination of data from different sources

into a single platfrom with consistent access

Transformation of raw data into useable

information

Users working with data analysis results to

generate individual and organizational knowledge

Using the data analysis results to create value in

products, services and operations of the organization

The ability to acquire knowledge and skills in

the context of big data management

Bu

sin

ess

Source Data Analytics Processes Organziational Application

Value Creation Data Knowledge Management

* Customer Data

* Financial Data * Production Data * Open Data * Web Data

* …

* Business Intelligence

* Data Mining (inductive) * Data Preparation * Statistical Data

Analysis * Data Warehouse Management * Data Science

(abductive) * …

* Strategic Management

* Production Optimization * Knowledge Management

* Distribution & Marketing * Customer Relationship management

* Industry 4.0 * …

* Efficiency Increase in

Production and Distribution * Optimization of Machine Operations

* Improvement of CRM * Data-based Innovation, Data-based Products * ...

* Managing Insights

emerging by Data Analytics Managing Knowledge about Data Analytics

* Managing Know-How for Data Analytics (Data Science Education) * …

IT

Integrated Databases Analytic Software Analytic User Interfaces Feedforward Control Data Knowledge Engineering

* Data Warehoses * Data Lakes

* Hadoop Clusters * Apache Spark Clusters * Relational Databases * NoSQL Databases

* Cloud based Storage * …

* Mahout * Mlib

* Scikit * ... * Deep Learning * Regression

* Decision Trees * …

* Management Dashboards

* Mobile Applications * Office Documents * Front Office Applications

* Intranet * Search Engines * Interactive Visualizations

* …

* Integration with Factory Production

Systems to minimize disturbances * Integration with Operational Systems

such as CRM, Helpdesk, Webserver to predict opportunities * …

* Semantic Annotation and Archival of analytic

Results * Metadata Repositories * Information Extraction * Data Knowledge Bases

* ...

Compared to traditional forms of BI that are focused on relational and rather small data, for

the processing of big data11

, technologies must be deployed that are capable of massive scalability along the dimensions of volume, velocity, variety and veracity. In fact, for BDMcube big data aspects, are only relevant for the first two layers of data integration and data analytics; the other three layers concern application-oriented aspects that are universally applicable for small data as well as big data. The differences when working with big data in the first two layers are, among others, the following points:

There are restrictions on the kind of technologies and algorithms that can be used with regard to scalability

SQL-databases are mostly only vertically scalable, and thus, NoSQL database technologies are recommended for data integration

Using horizontally scalable architecture, the analysis of the whole dataset is possible, in lieu of subsampling that was commonplace in the era of data warehousing and data mining (i.e., 10 to 20 years ago)

Working with the whole dataset has led to a data-driven approach to analytics (“data first”), where hypotheses are second to and derived from integrated data (abductive instead of inductive reasoning; Dhar, 2013)

11 In fact, this data management model, in principle, applies to small data as well as to big data.

Page 13: Mathematik und Informatik - deposit hagen · Dr. Michael Kaufmann External habilitation candidate Lucerne University of Applied Sciences and Arts School of Engineering Horw, Switzerland

Research Report 12

Lehrgebiet Multimedia und Internetanwendungen

Prof . Dr.-Ing. Matthias Hemmje

6 Discussion and Outlook

The reference model for big data management I have proposed in this paper is an instance of design-oriented research in information systems (DORIS; Österle et al., 2010). With this paper, I have initiated the four-phase research process of DORIS, consisting of (1) analysis, (2) design, (3) evaluation and (4) diffusion. In this report, I present preliminary results the first two phases of analysis and design. With the analysis, I have explained the motivation for directing the proposed research toward value-related management questions about big data in Section 1, and I have investigated the current state of the art in Section 2. The derivation of my proposed management model from epistemological and epistemic viewpoints in Sections 3, 4 , and 5 represents the result of the design phase of my research process. Hence, it is clear that the research is still in progress, and an empirical and/or synthetic evaluation of this design and a diffusion of insights is still open.

At this point, no empirical insight can be provided on the proposed design. The outlook on further research is to provide a qualitative evaluation of the design, the proposed reference model for big data management, according to DORIS, using the methods of expert interviews, prototyping, hermeneutics, and logical discourse. Evaluation will be conducted along the dimensions of truth and value of the designed artifact. BDM

cube is a meta-model; thus specific instances can be subsumed by it.

Research questions central to the proposed theory are whether BDMcube

is valuable for (A) classifying and extending existing BDM models and (B) deriving new BDM models, and if (C) this will help to improve BDM, and thus (D) to optimize value creation from data The proposed theory will be evaluated along the dimensions of truth and value in two ways: As an empirical evaluation, one or more case studies with existing enterprises will provide empirical exemplars of theory instantiation. In the sense of knowing through making, the synthetic research processes of four ongoing doctoral theses on big data management will be aligned with the theory, which will provide four synthetic exemplars. The expected outcome is that, as a reference model, BDM

cube is intended to act as a frame of reference

for theorists and practitioners, and as a conceptual coordinate system that can help inter- and intrapersonal orientation in data management to facilitate value creation from big data.

7 Acknowledgements

Although I am the sole author of this paper, the research presented in this paper would not have been possible without the helpful feedback I got in interactions with different universities in Switzerland and Germany.

First of all, the reference model was developed by initiation of and in collaboration with the Internet and Multimedia Applications group at the Department of Mathematics and Computer Science at the University of Hagen, together with Matthias Hemmje, Kevin Berwind, Marco Bornschlegel, Christian Nawroth, and Tobias Swoboda. Their feedback helped shape the layers in iterative loops. Special thanks go to Dominic Heutelbeck for challenging my theory and research agenda, which motivated me to elaborate the theory and strategies for its evaluation more clearly.

Moreover, I presented initial ideas at a meeting with the information systems research group at the University of Hagen, together with Andreas Meier, Alexander Denzler, and Marcel Wehrle. Their helpful critique stopped me from stressing the aspect of data driven innovation in the reference model.

Additionally, I heard the term “data intelligence” for the first time in collaboration with Tim Weingärtner, director of research at the Lucerne School of Information Technology in Rotkreuz, Switzerland. Based on his inspiration, I investigated possible meanings of the term and later applied it as a label for knowledge management issues in the proposed reference model.

8 References

Abadi, D., Agrawal, R., Ailamaki, A., Balazinska, M., Bernstein, P. A., Carey, M. J., et al. (2016). The Beckman Report on Database Research. Commun. ACM, 59(2), 92–99.

Anderson, P. W. (1972). More is Different: Broken Symmetry and the Nature of the Hierarchical Structure of Science. Science, 177(4047), 393–396.

Page 14: Mathematik und Informatik - deposit hagen · Dr. Michael Kaufmann External habilitation candidate Lucerne University of Applied Sciences and Arts School of Engineering Horw, Switzerland

Research Report 13

Lehrgebiet Multimedia und Internetanwendungen

Prof . Dr.-Ing. Matthias Hemmje

Chen, H., Chiang, R. H. L., & Storey, V. C. (2012). Business Intelligence and Analytics: From Big Data to Big Impact. MIS Q., 36(4), 1165–1188.

Davenport, T. H. (2013). Analytics 3.0. Harvard Business Review, 91(12), 65–72.

Davenport, T. H., & Dyché, J. (2013). Big Data in Big Companies. Portland, Oregon: International Institute for Analytics.

Demchenko, Y., Grosso, P., Laat, C. de, & Membrey, P. (2013). Addressing big data issues in Scientific Data Infrastructure. 2013 International Conference on Collaboration Technologies and Systems (CTS) (pp. 48–55).

Dhar, V. (2013). Data science and prediction. Communications of the ACM, 56(12), 64–73.

Gu, L., & Li, H. (2013). Memory or Time: Performance Evaluation for Iterative Operation on Hadoop and Spark. 2013 IEEE 10th International Conference on High Performance Computing and Communications 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC) (pp. 721–727).

Hegel, G. W. F. (1807). Phänomenologie des Geistes. Bamberg und Würzburg: Joseph Anton Goebhardt.

Hilbert, M., & López, P. (2011). The World’s Technological Capacity to Store, Communicate, and Compute Information. Science, 332(6025), 60–65.

Kakihara, M., & Sørensen, C. (2002). Exploring Knowledge Emergence: From Chaos to Organizational Knowledge. Journal of Global Information Technology Management, 5(3), 48–66.

Kaufmann, M., & Portmann, E. (2015). Biomimetics in design-oriented information systems research. DESRIST 2015.

Landset, S., Khoshgoftaar, T. M., Richter, A. N., & Hasanin, T. (2015). A survey of open source tools for machine learning with big data in the Hadoop ecosystem. Journal of Big Data, 2(1), 1.

Lipton, P. (2010). Engineering and Truth. In K. Guy (Ed.), Philosophy of Engineering (Vol. 1, pp. 7–13). London: The Royal Academy of Engineering.

Luhmann, N. (1988). Erkenntnis als Konstruktion. Bern: Benteli.

Mäkelä, M. (2007). Knowing Through Making: The Role of the Artefact in Practice-led Research. Knowledge, Technology & Policy, 20(3), 157–163.

Markus, M. L., Majchrzak, A., & Gasser, L. (2002). A Design Theory for Systems That Support Emergent Knowledge Processes. MIS Q., 26(3), 179–212.

NIST. (2015a). NIST Big Data Interoperability Framework: Volume 1, Definitions. NIST Special Publication, NIST Big Data Public Working Group.

NIST. (2015b). NIST Big Data Interoperability Framework: Volume 2, Taxonomies. NIST Special Publication, NIST Big Data Public Working Group.

OECD. (2015). Data-Driven Innovation: Big Data for Growth and Well-Being. Paris: OECD Publishing.

Österle, H., Becker, J., Frank, U., Hess, T., Karagiannis, D., Krcmar, H., et al. (2010). Memorandum on design-oriented information systems research. European Journal of Information Systems, 20(1), 7–10.

Pääkkönen, P., & Pakkala, D. (2015). Reference Architecture and Classification of Technologies, Products and Services for Big Data Systems. Big Data Research, 2(4), 166–186.

Patel, N. V., & Ghoneim, A. (2011). Managing emergent knowledge through deferred action design principles. Journal of Enterprise Information Management, 24(5), 424–439.

Polanyi, M. (1967). The Tacit Dimension. London, Routledge & K. Paul.

Page 15: Mathematik und Informatik - deposit hagen · Dr. Michael Kaufmann External habilitation candidate Lucerne University of Applied Sciences and Arts School of Engineering Horw, Switzerland

Research Report 14

Lehrgebiet Multimedia und Internetanwendungen

Prof . Dr.-Ing. Matthias Hemmje

Schroeck, M., Shockley, R., Smart, J., Romero-Morales, D., & Tufano, P. (2012). Analytics: The real-world use of big data - How innovative enterprises extract value from uncertain data. Executive Report, New York: IBM Institute for Business Value.

Singh, D., & Reddy, C. K. (2014). A survey on platforms for big data analytics. Journal Of Big Data, 2(1), 8.

Willis, A.-M. (2006). Ontological Designing. Design Philosophy Papers, 4(2), 69–92.