on a meaningful exploitation of machine and human ...pdfs.semanticscholar.org › 0bc6 ›...

ON A MEANINGFUL EXPLOITATION OF MACHINE AND HUMAN

REASONING TO TACKLE DATA-INTENSIVE DECISION MAKING

NIKOS KARACAPILIDIS*

University of Patras & Computer Technology Institute and Press "Diophantus"

26504 Rio Patras, Greece

[email protected] * Corresponding author

MANOLIS TZAGARAKIS

University of Patras & Computer Technology Institute and Press "Diophantus"

26504 Rio Patras, Greece

[email protected]

SPYROS CHRISTODOULOU

University of Patras & Computer Technology Institute and Press "Diophantus" 26504 Rio Patras, Greece

[email protected]

Abstract: This paper deals with the “data deluge” issue in contemporary decision making settings; specifically,

it reports on the development of an innovative platform that incorporates and orchestrates a set of interoperable

Web services to facilitate and augment collaboration and decision making in data-intensive and cognitively-

complex settings. The proposed approach exploits and builds on the most prominent high-performance

computing paradigms and large data processing technologies to meaningfully search, analyze and aggregate

data existing in diverse, extremely large and rapidly evolving sources. At the same time, it remedies the

shortcomings of current “big data” analytics technology and provides the appropriate ways to nurture and

capture the human intelligence in order to extract the necessary insights. It can thus reduce the data-

intensiveness and complexity overload at critical decision points to a manageable level, thus permitting

stakeholders to be more productive and concentrate on creative activities.

Keywords: information overload; human reasoning; machine reasoning; collaboration; decision making.

1. Introduction

Individuals and organizations are nowadays confronted with the rapidly growing problem

of information overload [Economist, 2010]. An enormous amount of content already

exists in the digital universe (i.e. information that is created, captured, or replicated in

digital form), which is characterized by high rates of new information that is being

distributed and demands attention. This enables us to have instant access to more

information than we can ever possibly consume [Kirsh, 2000]. Supporting collaboration

2 Karacapilidis, Tzagarakis and Christodoulou

in such contexts is a critical aspect [Finholt, 2003] as tasks become more and more

collaborative in nature [Hara et al., 2003]; people need to efficiently and effectively

collaborate and make decisions by appropriately assembling and analyzing enormous

volumes of complex multi-faceted data residing in different sources. Representative

examples of such collaborative environments can be witnessed today in many domains,

such as:

A community of clinical researchers and bio-scientists, trying to locate and

assemble huge quantities of data in order to examine heterogeneous clinico-

genomic data and information sources for the production of new insightful

conclusions or the formation of reliable biomedical knowledge related to

molecular pathways, DNA sequence data, etc.

A community of clinicians, radiologists, radiographers, patients and pharma-

researchers elaborating vast amounts of data that include blood tests, physical

examinations, free text journals from patients on their experience from

treatment, as well as X-Ray, Static and Dynamic MRI scans, to reach clinical

decisions and assess drugs’ efficiency during clinical trials.

A marketing and consultancy company requiring to forage the Web (blogs,

forums, wikis, etc.) for high-level knowledge, such as public opinions about its

products and services, to quickly monitor public response to a new marketing

launch and make decisions to inform a new strategy.

To facilitate decision making tasks in such contexts, users need to exploit a diverse

set of tools to collect, store and analyze the available data, as well as tools to share and

discuss the associated resources and outcomes. Despite the number of tools available to

support the various stages of their work, users still invest great efforts to transform the

data throughout the various stages, due to the standalone and independent nature of these

tools [Lopez and Oliveira, 2011]. Furthermore, collaboration and data sharing aspects in

such work settings are rarely considered as an integral part of the process. Approaches

that consider collaboration support as central to “big data” tasks have started to emerge

only recently [De Roure et al., 2009].

In this paper, we present an innovative approach to decision making in data-intensive

and cognitive-complex settings, which facilitates data analysis tasks and considers

collaboration support as an integral part of such processes. The work reported is

On a Meaningful Exploitation of Machine and Human Reasoning 3

performed in the context of an FP7 EU project, namely Dicode (http://dicode-project.eu/).

To achieve its goal, Dicode follows a hybrid approach, in that it exploits and builds on

the synergy of human and machine reasoning to meaningfully analyze and aggregate data

existing in diverse, extremely large, and rapidly evolving sources.

The remainder of this paper is structured as follows: The next section lists a series of

critical issues that characterize contemporary collaboration and decision making settings.

Taking them into account, Section 3 reports on the proposed solution, while Section 4

describes in more detail the services offered. Section 5 sketches a specific scenario of use

of the tool that meaningfully integrates the Dicode services, which is presented in Section

6. Finally, evaluation results are presented in Section 7, while concluding remarks are

given in Section 8.

2. Contemporary Collaboration and Decision Making Settings

Collaboration and decision making settings are often associated with huge, ever-

increasing amounts of multiple types of data, obtained from diverse and distributed

sources, which often have a low signal-to-noise ratio for addressing the problem at hand.

In many cases, the raw information is so overwhelming that stakeholders are often at a

loss to know even where to begin to make sense of it. In addition, these data may vary in

terms of subjectivity and importance, ranging from individual opinions and estimations to

broadly accepted practices and undeniable measurements and scientific results. Their

types can be of diverse level as far as human understanding and machine interpretation

are concerned.

Besides, it is nowadays easier to get the data in than out. Big volumes of data can be

effortlessly added to a database; the problems start when we want to consider and exploit

the accumulated data, which may have been collected over a few weeks or months, and

meaningfully analyze them towards making a decision. Admittedly, when things get

complex, we need to identify, understand and exploit data patterns; we need to aggregate

big volumes of data from multiple sources, and then mine it for insights that would never

emerge from manual inspection or analysis of any single data source. In the settings

under consideration, the way that data will be structured for query and analysis, as well as

the way that tools will be designed to handle them efficiently are of great importance and

certainly set a big research challenge.

http://dicode-project.eu/


Generally speaking, information management related tasks need to be streamlined

and automated. Recent findings clearly indicate that information management costs too

much when it is not well organized and meaningfully automated [Eppler and Mengis,

2004]. They also call for investments in innovative software that reduces or eliminates

time wasted, reduces management overheads, streamlines collaborative processes, and

automates the overall workflow. Return on such investments can be both tangible (e.g.

time or money saved) and intangible (e.g. more valuable information, easier extraction of

hidden information, increase of information workers’ satisfaction and creativity,

improved collaboration).

As results from the above, issues related to the guidance of the information worker

through the space of available data and the indication of relevant information to facilitate

and augment collaboration and decision making activities are of major importance.

Towards this direction, we foresee a semi-automatic, adaptive approach that makes use of

both semantic metadata and pre-structured data patterns to provide plausible

recommendations, while also learning from the users’ feedback to better target their

information interests [Adomavicius and Tuzhilin, 2005; Anya et al., 2011]. This will be

enabled by innovative data mining techniques and services such as local pattern mining,

similarity learning, and graph mining, coupled with a flexible collaboration framework

where all these services are seamlessly integrated and orchestrated.

Recent research in data mining is geared towards the extraction of more semantic

information. Since data and information is today available in large volumes and diverse

types of representation, intelligent integration of these data sources to generate new

knowledge - towards serving collaboration and decision making requirements - remains a

key challenge. Data pre-processing requirements, associated with data cleansing as well

as handling of noise and uncertainty in various data sources, are inherent here. In the

settings under consideration, data sources are associated with various types of

information, each of them covering distinct aspects. A systematic way is needed to

generate different points of view for such kind of data. Contemporary approaches need to

help users utilizing complex multi-source data in a reasonable way by supporting them in

finding relevant information and by providing personalized recommendations.

Another big category of requirements concerns the exploration, delivery and

visualization of the pertinent information. These should be based on: (i) an intelligent


semantic annotation, structuring and aggregation of voluminous and complex data, (ii)

the meaningful analysis and exploitation of data patterns and interrelations, (iii) the

capturing of stakeholders’ tacit knowledge, as far as information analysis and problem

solving are concerned, through a social web approach, and (iv) the exploitation of

particular user and group characteristics to properly direct or adapt data. Generally

speaking, semantics to be deployed should come out of a joint consideration of

stakeholders, their actions in the settings under consideration, and data considered each

time.

As far as collaboration and decision making support is concerned, stakeholders

require solutions that easily enable them to create and maintain private or public

workspaces, where the most pertinent information about the problem at hand can be

gathered, linked, synthesized and assessed. Through such workspaces, they need to carry

out synchronous or asynchronous collaboration to accommodate and elaborate the

outcomes of data mining, get recommendations, identify inconsistencies, spot and repair

information gaps, reason about actions etc. A goal-dependant integration of data coming

from heterogeneous databases is also required (to complement goal-driven data search

and acquisition). Data visualization issues also impose a series of important requirements

here.

The above bring up the need for development of innovative services that shift in focus

from the mere collection and representation of large-scale information to its meaningful

assessment, aggregation and utilization in contemporary collaboration and decision

making settings.

3. The Proposed Solution

In the context of the Dicode project, we exploit a cloud infrastructure to adapt and refine

computationally expensive algorithms for semantic data mining to new paradigms for

distributed computing, such as the MapReduce paradigm implemented in frameworks

like Hadoop and its Mahout derivative. The Apache Mahout project aims at building a

scalable machine learning library on top of Hadoop. In the context of this project, a

number of machine learning algorithms have already been parallelized. Based on the

outcomes of the Mahout project so far, our goal is to set up a basic analysis infrastructure

for the Dicode project. Aiming at fully supporting the data mining process (including pre-


processing, modeling, validation and deployment), we integrate, adapt and extend the

Mahout machine learning library, e.g. by developing advanced machine learning

algorithms for large scale data. For instance, Mahout may significantly help towards

grouping similar items (being they related raw data or users with similar expertise or

interests), identifying main or “hot” topics, assigning items to predefined categories,

recommending important data to diverse stakeholders, and discovering frequent and

meaningful patterns in a specific decision making setting.

The ultimate goal of the Dicode project is to support users in collaboratively solving

problems and making decisions on complex scenarios with very large, potentially

conflicting and incomplete amounts of information. To achieve this goal, Dicode exploits

and advances the state-of-the-art in three relevant directions, namely: (i) new techniques

for scalable high-performance data mining, (ii) data mining to make sense of real-world

multi-faceted data, and (iii) collaboration and decision making support.

3.1. New techniques for scalable high-performance data mining

A particular challenge of Dicode is to develop data mining approaches that scale well to

extremely large, rapidly evolving data sets in the terabyte range. To respond to this

challenge, the project develops a flexible large-scale data mining platform based on the

MapReduce framework and related technology. MapReduce has been already

successfully deployed within Google and runs a myriad of different programming tasks.

Many of the building blocks that are necessary to cover the requirements mentioned

above do not rely on simple filters or selection rules; instead, they can only be efficiently

implemented by employing appropriate machine learning algorithms (e.g. classification

of objects into predefined categories, clustering objects by similarity, identification of

trends in document streams, extracting topics from documents). There are some

approaches on scaling traditional machine learning algorithms to large scale datasets

[Wegener et al., 2009]; however, most research still leaves the aspect of scalability and

performance optimization out of scope. The most comprehensive and active project

aiming at implementing machine learning algorithms for large scale problems today is

Apache Mahout [Ingersoll, 2009]. However, Mahout is unsuitable for out-of-the-box, ad-

hoc data analysis of arbitrary data: the project includes only very basic modules for data

pre-processing, data conversion and integration with existing systems.


3.2. Data Mining to make sense of real-world multi-faceted data

Dicode also aims to support users in solving problems and making decisions on complex

scenarios with large, potentially conflicting and incomplete amounts of information. A

main challenge in this setting is that in real applications data is not as nicely structured as

it seems to be in current solutions and closed systems. Instead, data in real-world

applications are complex and multi-faceted, where important information is spread

among multiple data sources and formats, and relations between these instances are not

always obvious. For instance, while it is typically not a problem to show up many

possible links between two pieces of information using standard features, it is hard to

identify the really relevant links. A candidate technique to address this problem is

similarity learning, which strives at finding an optimal, user-centric measure of similarity

on the basis of many candidate similarity measures. Similarity learning has been proven

to be successful in many different applications with complex items, ranging from the

identification of similar people [Rüping et al., 2008] to the identification of similar

process workflows [Friesen and Rüping, 2010].

Besides, text data is the most central data type in many applications. To bridge the

gap between human-readable texts and structured data that is suited for machine

processing, text mining technologies are of much importance. In the context of Dicode,

approaches to extract semantic information from text allow to generate higher-level

knowledge, which can be fed to the users as additional input for their decision making

process without the need to read all texts (which can be millions of web pages in an

extreme case) by themselves.

3.3. Collaboration and Decision Making Support

The advent of Web 2.0 has introduced a plethora of collaboration tools which provide

engagement at a massive scale and feature novel paradigms. These tools cover a broad

spectrum of needs, ranging from exchanging, sharing and tagging, social networking, to

authoring, mind mapping and discussing. For instance, Delicious (http://delicious.com)

and CiteULike (http://www.citeulike.com) provide services for storing, sharing and

discovering of user generated Web bookmarks and academic publications, respectively.

A different set of applications focuses on building online communities of people who

share interests and activities (social networking applications). MySpace


(http://www.myspace.com) and LinkedIn (http://www.linkedin.com) are representative

examples in this category. Another set of Web 2.0 tools aim at collectively organize,

visualize and structure concepts via maps to aid brainstorming, study and problem

solving. Tools such as Thinkature (http://www.thinkature.com) fall into this category.

Finally, systems such as online discussion forums, Debatepedia (http://wiki.idebate.org)

and Cohere (http://cohere.open.ac.uk/) support online discussions over the Web.

Although all the above tools enable the massive and unconstraint collaboration of

users, this very feature is the source of a problem that these tools introduce to their users:

the problem of information overload. Current Web 2.0 collaboration tools exhibit two

important shortcomings making them prone to the problem of information overload.

First, these tools are ‘information islands’, thus providing only limited support for

interoperation, integration and synergy with third party tools. Second, Web 2.0

collaboration tools are rather passive media; they lack reasoning services with which they

could meaningfully support the collaboration.

As far as existing decision making support technologies are concerned, data

warehouses and on-line analytical processing have been broadly recognized as

technologies playing a prominent role in the development of current and future DSS

[Shim et al., 2002]. However, there is still room for further developing the conceptual,

methodological and application-oriented aspects of the problem. One critical point that is

still missing is a holistic perspective on the issue of decision making. This originates out

of the growing need to develop applications by following a more human-centric (not

problem-centric) view, in order to appropriately address the requirements of the

contemporary, knowledge-intensive organization’s employees. Dicode advances decision

making support technologies by adopting a knowledge-based decision-making view,

enabled by the meaningful accommodation of the results of the data mining processes. In

such a way, the decision making process is able to produce new knowledge, such as

evidence justifying or challenging an alternative or practices to be followed or avoided

after the evaluation of a decision. Knowledge management activities such as knowledge

elicitation, representation and distribution influence the creation of the decision models to

be adopted, thus enhancing the decision making process [Bolloju et al., 2002; Watanabe

et al., 2010].


With respect to collaboration and decision making support, the Dicode project

provides a series of innovative features: First, Dicode introduces advanced decision

making services into collaborative environments in order to help control the impact of

voluminous and complex data. Second, Dicode does not treat collaboration services as

standalone applications that operate autonomously and in isolation from other services,

but rather as ones that coexist. Third, Dicode enables both human and machine

understandable argumentative discourses to support ease-of-use and expressiveness for

users, as well as advanced reasoning by the machine.

4. Services offered

As noted above, the Dicode project aims at offering an innovative solution that reduces

the data-intensiveness and overall complexity of real-life collaboration and decision

making settings to a manageable level, thus permitting stakeholders to be more

productive and concentrate on creative activities. Towards this direction, the project

provides a suite of innovative, adaptive and interoperable services (both at a conceptual

and a technical level) that satisfies the full range of the requirements reported above.

These services are running on the Web. Much attention is being given to the adaptability

of the foreseen services with respect to changes in user requirements and operating

conditions. As shown in Figure 1, the Dicode suite of services comprises:

Data acquisition services, which enable the purposeful capturing of tractable

information that exists in diverse data sources and formats. Particular attention is

being paid to web resources and the development of the associated spider

components (web crawler).

Data pre-processing services, which efficiently manipulate raw data before their

storage to the foreseen solution. Transformation of different kinds of documents

into a canonical form, structuring of documents from layout information, data

cleansing (e.g. removing noise from web pages, discarding useless database

records), as well as language detection and linguistic annotations, are some of

the functionalities that fall in this category of services.

Data mining services, which exploit and are built on top of a cloud infrastructure

and other most prominent large data processing technologies to offer

functionalities such as high performance full text search, data indexing,


classification and clustering, directed data filtering and fusion, and meaningful

data aggregation. Advanced text mining techniques such as named entity

recognition, relation extraction and opinion mining help to extract valuable se-

mantic information from unstructured texts. Intelligent data mining techniques

being used include local pattern mining, similarity learning, and graph mining.

Figure 1. The Dicode architecture.

Collaboration support services, which facilitate the synchronous and

asynchronous collaboration of stakeholders through adaptive workspaces,

efficiently handle the representation and visualization of the outcomes of the

data mining services (through alternative and dedicated data visualization

schemas), and accommodate the orchestration of a series of actions for the

appropriate handling of data in each case.

Decision making support services, which augment both individual and group

sense-making and decision-making by supporting stakeholders in locating,

retrieving and arguing about relevant information and knowledge, as well as by


providing them with appropriate notifications and recommendations (taking into

account parameters such as preferences, competences, expertise etc.).

To better illustrate the use and orchestration of the abovementioned services, we first

describe a detailed scenario taking place in a collaborative decision making setting (see

next section). Through this specific scenario, we then present the features and

functionalities of the proposed tool (namely, the Dicode Workbench - see Section 6).

5. A Specific Scenario: Investigating Genes Associated with Breast Cancer

Disease

Consider two researchers, Jim and Alice, aiming to investigate which genes or groups of

genes are associated with breast cancer disease. Initially, they create a new collaboration

session (logbook), where they exchange ideas related to which data sources to use, based

on their own data analysis experience and literature knowledge. They search relevant

literature via PubMed (http://www.ncbi.nlm.nih.gov/pubmed) using the appropriate

search services.

Jim has conducted an initial analysis with some in-house gene-expression datasets;

however, his findings were not very encouraging, which was attributed to the small

sample size (i.e. number of patients) available. He informs Alice about it and suggests

potential solutions. The discussion proceeds and finally, in order to overcome the limited

sample size problem, they decide to augment their samples with publicly available gene-

expression data derived from the GEO (http://www.ncbi.nlm.nih.gov/geo/) and SMD

(http://smd.stanford.edu/) databases.

After deciding what data to use, they keep collaborating in order to discuss how the

data will be processed. Both suggest solutions, comment on them, and finally decide to

use the normalized data for each platform and the UniGene annotation database

(http://www.ncbi.nlm.nih.gov/unigene) to uniformly map all genes. Jim knows that there

are particular confounding effects in such kind of analysis and for that reason suggests a

specific strategy that would account for these effects. Particularly, they decide to first

analyze the integrated dataset using the well-known Significant Analysis of Microarrays

(SAM) methodology [Tusher et al., 2001]. This analysis will serve as a baseline to any

further analysis they attempt. Jim is also offering to provide all the necessary R scripts


(http://www.r-project.org/) for this initial statistical analysis. In addition, they decide to

employ model-based data integration methodologies that have been recently published

and claim to perform better than simple data integration techniques [Huttenhower et al.,

2006; Garrett-Mayer et al., 2007; Shabalin et al., 2008].

Some of the models are readily available; however, others need to be coded. Jim

offers to write the relevant scripts. Alice, being an experienced programmer, offers to

hard code them using parallel programming and various servers available at her

department. Parallel and cloud computing will ensure fast results, since they have both

agreed that they should apply the selected methodologies to numerous datasets. Their

goal is to identify novel or already reported groups of genes associated with breast cancer

disease. In addition, they are interested in comparing the findings of the chosen

methodologies to those of the simple analysis conducted by Jim. They decided to

quantify and check the statistical and biological significance of their results via the

DAVID tool (http://david.abcc.ncifcrf.gov/) and the KEGG database

(http://www.genome.jp/kegg/pathway.html).

Both researchers can execute the available services and retrieve the results of the

invoked tool (e.g. a scatter plot or heatmap plot). Once the results are available, they

engage into interpreting the results in terms of the initial research question.

6. The Dicode Workbench

Scenarios such as the one described above bring up the need for development of

innovative services that shift in focus from the mere collection and representation of

large-scale information to its meaningful assessment, aggregation and utilization in

contemporary collaboration and decision making settings. Such services are provided by

the Dicode Workbench.

Figure 2 presents the main user interface of the Dicode Workbench, which comprises

the necessary computational and collaborative services that biomedical researchers

require to address their research questions (see Section 5). A widget-based approach has

been adopted to offer these services. Each widget provides different functionalities. In

particular, the ‘Storage Service’ widget (top left side of Figure 2) lists all resources that

researchers require to elaborate for their research question. These resources may include,

for instance, XML files stored locally and containing biomedical data, references to data


retrieval services which fetch data sets from remote repositories, as well as research

papers that are considered useful for the question at hand. Selecting one of the items

allows team members to retrieve and view the associated data.

Figure 2: The Dicode workbench (showing the mind-map view of the collaboration workspace).

The ‘Services search’ widget (top right side of Figure 2) enables users to locate

computational tools that can be invoked to process data sets and produce results. These

services can be, for instance, R scripts (see the ‘R engine’ widget at the bottom left side

of Figure 2), references to Web services or entire applications which implement specific

data mining algorithms (e.g. the ‘PubMed Mobile’ widget that enables users exploit the

popular PubMed library (http://www.ncbi.nlm.nih.gov/pubmed)). The section ‘Additional

information’ displays metadata about the workbench (when it was created, a short

description related to the aim of the research, as well as the team members who have

access to the workbench).

At the center of the Dicode Workbench (see Figure 2) is the ‘Dicode Collaborative

Workspace’, which enables the argumentative collaboration among researchers. The

approach adopted to support collaboration in Dicode builds on a conceptual framework

http://www.ncbi.nlm.nih.gov/pubmed)


where formality and the level of knowledge structuring during collaboration is not

considered as a predefined and rigid property, but rather as an adaptable aspect that can

be modified to meet the needs of the tasks at hand. By the term formality, we refer to the

rules enforced by the system, with which all user actions must comply. Allowing

formality to vary within the collaboration space, incremental formalization - i.e. a

stepwise and controlled evolution from a mere collection of individual ideas and

resources to the production of highly contextualized and interrelated knowledge artifacts

- can be achieved.

In our approach, views constitute the ‘vehicle’ that permits incremental formalization

of the collaboration. A view can be defined as a particular representation of the

collaboration space, in which a consistent set of abstractions able to solve a particular

organizational problem during collaboration is available. Our approach enables the

switching from a view to another. Each view of a collaboration workspace provides the

necessary mechanisms to support a particular level of formality. The more informal a

view is, the more easiness-of-use is implied. At the same time, the actions that users may

perform are intuitive and not time consuming; however, the overall context is human (and

not system) interpretable. On the other hand, the more formal a view is, easiness-of-use is

reduced; actions permitted are less and less intuitive and more time consuming. The

overall context in this case is both human and system interpretable. The already

developed collaboration views, along with the functionalities they provide, are as

follows:

Discussion-forum view: In this view, a collaboration space is displayed as a

traditional web-based forum, where posts are displayed in ascending

chronological order. Users are able to post new messages to the collaboration

space, which will appear at the end of the list of messages.

Mind-map view: In this view, a collaboration space is displayed as a mind map,

where users can ‘interact’ with the items uploaded so far. The map deploys a

spatial metaphor permitting the easy movement and arrangement of items on the

collaboration space. In this view, information triage - i.e. the process of sorting

and organizing through numerous relevant materials and organizing them to

meet the task at hand [Marshall and Shipman, 1997] - is supported.


Formal view: This view enables the posting of predefined knowledge items. It

invokes a set of dedicated scoring and reasoning mechanisms aiming to aid users

conceive the outcome of a particular collaborative session [Karacapilidis et al.,

2009].

Recalling our scenario, as the initial goal of Jim and Alice is to accumulate a critical

mass of relevant resources, they create a new collaboration workspace and may start

using it in the ‘forum-view’. The forum view primarily aims to effortlessly collect and

share the available resources (without the need to interrelate them). During this

collaboration phase, Jim and Alice upload available resources and assess them

informally, by briefly commenting on them.

When many resources start appearing in the forum view, Jim and Alice may decide to

switch to a mind-map view, where they can better manage these resources. In this view,

Jim and Alice may organize the available items in more advanced ways and exploit

dedicated item types such as ideas, notes and comments (Figure 2). Ideas stand for items

that deserve further exploitation; they may correspond to an alternative solution to the

issue under consideration and they usually trigger the evolution of the collaboration.

Notes are generally considered as items expressing one’s knowledge about the overall

issue, an already asserted idea or note. Finally, comments are items that usually express

less strong statements and are uploaded to express some explanatory text or point to some

potentially useful. Multimedia resources can be also uploaded.

All the above items can be interrelated by trouble-free actions. When interrelating

items, Jim and Alice may select the color of the connecting arrow and provide a legend

describing the interrelationship they conceive. These legends are intentionally arbitrary.

The visual cues of the arrows bear well-defined semantics: for instance, green arrows

declare support, whereas red ones declare opposition. Another visual cue that appears in

Figure 2 concerns the colored rectangles that have been created by Jim and Alice to

meaningfully group/cluster related items.

By using the mind-map view, Jim and Alice can transform the resources from a mere

collection of items into coherent knowledge structures that facilitating sense making on

the available resources. By using the search facilities of the workbench, they are also able

to search for relevant literature or data sets, which can be also uploaded on the


collaboration logbook. Moreover, resources that Jim and Alice agree that are relevant for

their research can be easily added to the sources section of the workbench.

Jim and Alice may need to further elaborate the knowledge items considered so far,

and exploit additional functionalities to advance their collaboration towards reaching a

decision. Such functionalities can be provided by the formal view that enables the

semantic annotation of knowledge items, the formal exploitation of collaboration items

patterns, and the deployment of appropriate formal argumentation and reasoning

mechanisms (Figure 3).

Figure 3: The Dicode workbench (showing the formal view of the collaboration workspace).

While a mind map view aids the exploitation of information by Jim and Alice, a formal

view aims mainly at the exploitation of information by the machine. This view provides a

fixed set of discourse element and relationship types, with predetermined, system

interpretable semantics. In particular, issues correspond to problems to be solved,

decisions to be made, or goals to be achieved. For each issue, both users may propose


alternatives (i.e. solutions to the problem under consideration) that correspond to

potential choices. Positions are asserted in order to support the selection of a specific

course of action (alternative), or avert the users’ interest from it by expressing some

objection. A position may also refer to another (previously asserted) position, thus

arguing in favor or against it. By switching from mind map into formal view, existing

item types are transformed, filtered out, or kept “as-is” based on a specific set of rules.

These rules take also into consideration the item’s visual cues. For example, red colored

arrows in the mind-map view are transformed into positions that express objection in the

formal view (arguments against).

The formal view also integrates a reasoning mechanism that determines the status of

each discourse entry, the ultimate aim being to keep Jim and Alice aware of the discourse

outcome. Jim and Alice can continue their collaboration in this formal view; each time an

element is added to the discussion, this triggers the underlying reasoning mechanism

which informs the team about the most prominent solution. Using the formal view, Jim

and Alice receive active support from the system to reach a decision concerning the most

appropriate resources for their research. The reasoning and scoring mechanisms of the

Dicode workbench’s formal view are built according to the algorithms described in

[Karacapilidis and Papadias, 2001].

7. Evaluation results

To assess the effectiveness of the Dicode approach, an extensive evaluation process is

currently underway. A first evaluation phase has been completed aiming to assess a series

of key success indicators concerning the maturity of the technology used, as well as the

usability and acceptability of Dicode services in three real-life contexts (clinic-genomic

research, medical decision making, and opinion mining from Web 2.0 data). Evaluators

were asked to read a service-specific scenario, experiment with the Dicode services and

fill in a mixed-type questionnaire (responses expected were in both quantitative and

qualitative form). For the case reported in this paper, the sample consisted of 58

evaluators with varying background knowledge in bioscience fields. Answers to the

quantitative questions of the questionnaires were given in a 1-5 scale, where 1 stands for

‘I strongly disagree’ and 5 for ‘I strongly agree’ [Nielsen, 1991; Norman, 1998]. Figures

4 and 5 illustrate the evaluation results for the Dicode Workbench (where all Dicode


services are integrated) and the services related to collaboration and decision making

support, respectively.

Figure 4: Evaluation results of the Dicode Workbench.

Figure 5: Evaluation results of the Dicode Collaborative Decision Making Support Service.


Figure 4 summarizes the evaluators’ responses relative to the overall quality of the

Dicode Workbench. As shown, the evaluators agreed (median: 4, mode: 4) that its

objectives are met, that it is novel to their knowledge, that are satisfied with its

performance and that they are overall satisfied with it. The evaluators were neutral

(median: 3, mode: 3) with respect to whether the Workbench addressed the data intensive

decision making issues. Related comments were: ‘Some kind of 'roadmap' would be

appreciated’; ‘Getting started is a bit confusing for a new user’. As long as its

acceptability is concerned (such results are not shown in Figure 4), the evaluators agreed

(median: 4, mode: 4) that the Workbench has the full set of functions they expected, that

its interface is pleasant and that they will recommend it to their peers/community.

The evaluation focused also in assessing the collaboration aspects of the workbench,

which constitutes a central part of the Dicode approach. Figure 5 summarizes the

evaluators’ responses relative to the overall quality of the Collaborative Decision Making

Support service. The evaluators agreed (median: 4, mode: 4) that the service addressed

the data intensive decision making issues, that the objectives of the service are met, that

the service is novel to their knowledge, that are satisfied with the performance of the

service and that they are overall satisfied with this service. The analysis of qualitative

evaluation results showed that, overall, reviewers found the service ‘promising’.

However, a few technical issues were raised, such as: ‘It's very useful for a complex use

case. I'd also like to move with mouse holding left click’; ‘A bit slow loading time both for

workspaces list and mind-map view’; ‘The arrows’ graphics were not very pleasant for

me. It starts from the middle of the icon and not from the begging of the square. The

overall idea however, was quite good’.

Generally speaking, the feedback received from the first evaluation phase of the Dicode

project was positive, clearly pointing out that the overall Dicode approach is promising.

The big majority of the evaluators appreciated the potential of exploiting the synergy of

machine and human intelligence through data mining and collaborative decision making

services. Evaluators indicated that the approach adopted reduces the data-intensiveness

and overall complexity of real-life collaboration and decision making settings to a

manageable level, thus permitting stakeholders to be more productive and concentrate on

creative activities. Towards this direction, the Dicode Workbench provides a flexible


integration framework that also exploits and augments the underlying collective

intelligence.

8. Conclusions

Dealing with data-intensive and cognitively complex settings is not a technical problem

alone. Building on current advancements, the approach described in this paper brings

together the reasoning capabilities of the machine and the humans. It can be viewed as an

innovative workbench incorporating and orchestrating a set of interoperable services that

reduce the data-intensiveness and complexity overload at critical decision points to a

manageable level, thus permitting stakeholders to be more productive and concentrate on

creative and innovative activities.

Aiming at facilitating and augmenting collaboration and decision making, the proposed

approach may enhance the quality of these processes and render time and cost savings.

The initial evaluation of the Dicode services supports this claim. Moreover, the proposed

approach faces two major imperatives: (i) it exploits the information growth by ensuring

a flexible, adaptable and scalable information and computation infrastructure, and (ii) it

exploits the competences of all stakeholders and information workers to meaningfully

confront various information management issues, such as information characterization,

classification, presentation, retention, storage, and disposal.

Acknowledgments

This publication has been produced in the context of the EU Collaborative Project "DICODE - Mastering

Data-Intensive Collaboration and Decision" which is co-funded by the European Commission under the

contract FP7-ICT-257184. This publication reflects only the authors’ views and the Community is not

liable for any use that may be made of the information contained therein.

References

Adomavicius, G. and Tuzhilin, A. (2005). Toward the Next Generation of Recommender Systems:

A Survey of the State-of-the-Art and Possible Extensions. IEEE Transactions on Knowledge

and Data Engineering, 17(6): 734–749.

Anya O., Nagar A. and Tawfik H. (2011). Building adaptive systems for collaborative e-work: The

e-Workbench approach. Intelligent Decision Technologies 5 (1), pp. 83–100.


Bolloju N., Khalifa M. and Turban E. (2002). Integrating Knowledge Management into Enterprise

Environments for the Next Generation Decision Support. Decision Support Systems 33(2), pp.

163-176.

De Roure, D., Goble, C. and Stevens, R. (2009) The Design and Realisation of the myExperiment

Virtual Research Environment for Social Sharing of Workflows. Future Generation Computer

Systems 25, pp. 561-567.

Economist (2010). Data, data everywhere, http://www.economist.com/node/15557443.

Eppler, M.J. and Mengis, J. (2004). The Concept of Information Overload: A Review of Literature

from Organization Science, Accounting, Marketing, MIS, and Related Disciplines. The

Information Society, 20(5):325–344.

Finholt, T. A. (2003). Collaboratories as a new form of scientific organization. Econ. of Innovation

and New Technologies, 12(1):5-25.

Friesen, N. and Rüping, S. (2010). Workflow Analysis Using Graph Kernels. Proc. of the

ECML/PKDD Workshop on Third-Generation Data Mining: Towards Service-Oriented

Knowledge Discovery (SoKD’10), Barcelona, Spain.

Garrett-Mayer, E., Parmigiani, G., Zhong, X., Cope, L., and Gabrielson, E. (2007). Cross study

validation and combined analysis of gene expression microarray data. Biostatistics, 9: 333–354.

Hara, N., Solomon, P., Kim, S., & Sonnenwald, H. (2003). An Emerging View of Scientific

Collaboration: Scientists’ Perspectives on Collaboration and Factors that Impact Collaboration.

J Amer Soc Info Sci Techn., 54(10):952-965.

Huttenhower, C., Hibbs, M., Myers, C., and Troyanskaya, O.G. (2006) A scalable method for

integration and functional analysis of multiple microarray. Bioinformatics, 22(23):2890-2897.

Ingersoll, G. (2009). Introducing Apache Mahout, IBM developer works, Java Technical library.

Available at: http://www.ibm.com/developerworks/java/library/j-mahout/

Karacapilidis N. and Papadias D. (2001). Computer Supported Argumentation and Collaborative

Decision Making: The HERMES system. Information Systems, 26(4): 259-277.

Karacapilidis N., Tzagarakis M., Karousos N., Gkotsis G., Kallistros V., Christodoulou S.,

Mettouris C., Nousia D. (2009). Tackling cognitively-complex collaboration with CoPe_it!. Int.

J. of Web-Based Learning & Teaching Technologies, 4(3): 22-38.

Kirsh, D. (2000). A Few Thoughts on Cognitive Overload, Intellectica, 1(30): 19-51.

Lopes, P and Oliveira, J. L. (2011). Towards knowledge federation in biomedical applications.

Proc. of the 7th International Conference on Semantic Systems, Graz, Austria


Marshall, C. and Shipman, F. (1997). Spatial Hypertext and the Practice of Information Triage,

Proc. of the 8th ACM Conference on Hypertext, Southampton UK, pp. 124-133. ACM Press,

New York.

Nielsen, J., 1991. Designing Web Usability: The Practice of Simplicity. New Riders Publishing.

Norman, D.A., 1998. The Design of Everyday Things. The MIT Press.

Rüping, S., Punko, N., Günter, B. and Grosskreutz, H. (2008). Procurement Fraud Discovery using

Similarity Measure Learning. Transactions on Case-based Reasoning, 1(1), pp. 37-46.

Shabalin, A.A., Tjelmeland, H., Fan, C., Perou, C.M., & Nobel, A.B. (2008). Merging two gene-

expression studies via cross-platform normalization. Bioinformatics, 24(9): 1154-1160.

Shim J.P., Warkentin M., Courtney J.F., Power D.J., Sharda R. and Carlsson C. (2002). Past,

Present and Future of Decision Support Technology. Decision Support Systems 33(2), pp. 111-

126.

Tusher, V.G., Tibshirani, R., and Chu, G. (2001). Significance analysis of microarrays applied to

the ionizing radiation response. Proc. Natl. Acad. Sci. USA 2001, 98(9): 5116-5121.

Watanabe Y., Kojiri T. and Watanabe T. (2010). Effective solution knowledge organization from

discussion record. Intelligent Decision Technologies 4 (4), pp. 241–251.

Wegener, D., Mock, M., Adranale, D. and Wrobel, S. (2009). Toolkit-based high-performance Data

Mining of large Data on MapReduce Clusters. Proc. of the 1st IEEE ICDM Workshop on

Large-scale Data Mining: Theory and Applications (LDMTA 2009).

on a meaningful exploitation of machine and human ...pdfs.semanticscholar.org › 0bc6 ›...

Documents