on a meaningful exploitation of machine and human ...pdfs.semanticscholar.org › 0bc6 ›...
TRANSCRIPT
ON A MEANINGFUL EXPLOITATION OF MACHINE AND HUMAN
REASONING TO TACKLE DATA-INTENSIVE DECISION MAKING
NIKOS KARACAPILIDIS*
University of Patras & Computer Technology Institute and Press "Diophantus"
26504 Rio Patras, Greece
[email protected] * Corresponding author
MANOLIS TZAGARAKIS
University of Patras & Computer Technology Institute and Press "Diophantus"
26504 Rio Patras, Greece
SPYROS CHRISTODOULOU
University of Patras & Computer Technology Institute and Press "Diophantus" 26504 Rio Patras, Greece
Abstract: This paper deals with the “data deluge” issue in contemporary decision making settings; specifically,
it reports on the development of an innovative platform that incorporates and orchestrates a set of interoperable
Web services to facilitate and augment collaboration and decision making in data-intensive and cognitively-
complex settings. The proposed approach exploits and builds on the most prominent high-performance
computing paradigms and large data processing technologies to meaningfully search, analyze and aggregate
data existing in diverse, extremely large and rapidly evolving sources. At the same time, it remedies the
shortcomings of current “big data” analytics technology and provides the appropriate ways to nurture and
capture the human intelligence in order to extract the necessary insights. It can thus reduce the data-
intensiveness and complexity overload at critical decision points to a manageable level, thus permitting
stakeholders to be more productive and concentrate on creative activities.
Keywords: information overload; human reasoning; machine reasoning; collaboration; decision making.
1. Introduction
Individuals and organizations are nowadays confronted with the rapidly growing problem
of information overload [Economist, 2010]. An enormous amount of content already
exists in the digital universe (i.e. information that is created, captured, or replicated in
digital form), which is characterized by high rates of new information that is being
distributed and demands attention. This enables us to have instant access to more
information than we can ever possibly consume [Kirsh, 2000]. Supporting collaboration
2 Karacapilidis, Tzagarakis and Christodoulou
in such contexts is a critical aspect [Finholt, 2003] as tasks become more and more
collaborative in nature [Hara et al., 2003]; people need to efficiently and effectively
collaborate and make decisions by appropriately assembling and analyzing enormous
volumes of complex multi-faceted data residing in different sources. Representative
examples of such collaborative environments can be witnessed today in many domains,
such as:
A community of clinical researchers and bio-scientists, trying to locate and
assemble huge quantities of data in order to examine heterogeneous clinico-
genomic data and information sources for the production of new insightful
conclusions or the formation of reliable biomedical knowledge related to
molecular pathways, DNA sequence data, etc.
A community of clinicians, radiologists, radiographers, patients and pharma-
researchers elaborating vast amounts of data that include blood tests, physical
examinations, free text journals from patients on their experience from
treatment, as well as X-Ray, Static and Dynamic MRI scans, to reach clinical
decisions and assess drugs’ efficiency during clinical trials.
A marketing and consultancy company requiring to forage the Web (blogs,
forums, wikis, etc.) for high-level knowledge, such as public opinions about its
products and services, to quickly monitor public response to a new marketing
launch and make decisions to inform a new strategy.
To facilitate decision making tasks in such contexts, users need to exploit a diverse
set of tools to collect, store and analyze the available data, as well as tools to share and
discuss the associated resources and outcomes. Despite the number of tools available to
support the various stages of their work, users still invest great efforts to transform the
data throughout the various stages, due to the standalone and independent nature of these
tools [Lopez and Oliveira, 2011]. Furthermore, collaboration and data sharing aspects in
such work settings are rarely considered as an integral part of the process. Approaches
that consider collaboration support as central to “big data” tasks have started to emerge
only recently [De Roure et al., 2009].
In this paper, we present an innovative approach to decision making in data-intensive
and cognitive-complex settings, which facilitates data analysis tasks and considers
collaboration support as an integral part of such processes. The work reported is
On a Meaningful Exploitation of Machine and Human Reasoning 3
performed in the context of an FP7 EU project, namely Dicode (http://dicode-project.eu/).
To achieve its goal, Dicode follows a hybrid approach, in that it exploits and builds on
the synergy of human and machine reasoning to meaningfully analyze and aggregate data
existing in diverse, extremely large, and rapidly evolving sources.
The remainder of this paper is structured as follows: The next section lists a series of
critical issues that characterize contemporary collaboration and decision making settings.
Taking them into account, Section 3 reports on the proposed solution, while Section 4
describes in more detail the services offered. Section 5 sketches a specific scenario of use
of the tool that meaningfully integrates the Dicode services, which is presented in Section
6. Finally, evaluation results are presented in Section 7, while concluding remarks are
given in Section 8.
2. Contemporary Collaboration and Decision Making Settings
Collaboration and decision making settings are often associated with huge, ever-
increasing amounts of multiple types of data, obtained from diverse and distributed
sources, which often have a low signal-to-noise ratio for addressing the problem at hand.
In many cases, the raw information is so overwhelming that stakeholders are often at a
loss to know even where to begin to make sense of it. In addition, these data may vary in
terms of subjectivity and importance, ranging from individual opinions and estimations to
broadly accepted practices and undeniable measurements and scientific results. Their
types can be of diverse level as far as human understanding and machine interpretation
are concerned.
Besides, it is nowadays easier to get the data in than out. Big volumes of data can be
effortlessly added to a database; the problems start when we want to consider and exploit
the accumulated data, which may have been collected over a few weeks or months, and
meaningfully analyze them towards making a decision. Admittedly, when things get
complex, we need to identify, understand and exploit data patterns; we need to aggregate
big volumes of data from multiple sources, and then mine it for insights that would never
emerge from manual inspection or analysis of any single data source. In the settings
under consideration, the way that data will be structured for query and analysis, as well as
the way that tools will be designed to handle them efficiently are of great importance and
certainly set a big research challenge.
4 Karacapilidis, Tzagarakis and Christodoulou
Generally speaking, information management related tasks need to be streamlined
and automated. Recent findings clearly indicate that information management costs too
much when it is not well organized and meaningfully automated [Eppler and Mengis,
2004]. They also call for investments in innovative software that reduces or eliminates
time wasted, reduces management overheads, streamlines collaborative processes, and
automates the overall workflow. Return on such investments can be both tangible (e.g.
time or money saved) and intangible (e.g. more valuable information, easier extraction of
hidden information, increase of information workers’ satisfaction and creativity,
improved collaboration).
As results from the above, issues related to the guidance of the information worker
through the space of available data and the indication of relevant information to facilitate
and augment collaboration and decision making activities are of major importance.
Towards this direction, we foresee a semi-automatic, adaptive approach that makes use of
both semantic metadata and pre-structured data patterns to provide plausible
recommendations, while also learning from the users’ feedback to better target their
information interests [Adomavicius and Tuzhilin, 2005; Anya et al., 2011]. This will be
enabled by innovative data mining techniques and services such as local pattern mining,
similarity learning, and graph mining, coupled with a flexible collaboration framework
where all these services are seamlessly integrated and orchestrated.
Recent research in data mining is geared towards the extraction of more semantic
information. Since data and information is today available in large volumes and diverse
types of representation, intelligent integration of these data sources to generate new
knowledge - towards serving collaboration and decision making requirements - remains a
key challenge. Data pre-processing requirements, associated with data cleansing as well
as handling of noise and uncertainty in various data sources, are inherent here. In the
settings under consideration, data sources are associated with various types of
information, each of them covering distinct aspects. A systematic way is needed to
generate different points of view for such kind of data. Contemporary approaches need to
help users utilizing complex multi-source data in a reasonable way by supporting them in
finding relevant information and by providing personalized recommendations.
Another big category of requirements concerns the exploration, delivery and
visualization of the pertinent information. These should be based on: (i) an intelligent
On a Meaningful Exploitation of Machine and Human Reasoning 5
semantic annotation, structuring and aggregation of voluminous and complex data, (ii)
the meaningful analysis and exploitation of data patterns and interrelations, (iii) the
capturing of stakeholders’ tacit knowledge, as far as information analysis and problem
solving are concerned, through a social web approach, and (iv) the exploitation of
particular user and group characteristics to properly direct or adapt data. Generally
speaking, semantics to be deployed should come out of a joint consideration of
stakeholders, their actions in the settings under consideration, and data considered each
time.
As far as collaboration and decision making support is concerned, stakeholders
require solutions that easily enable them to create and maintain private or public
workspaces, where the most pertinent information about the problem at hand can be
gathered, linked, synthesized and assessed. Through such workspaces, they need to carry
out synchronous or asynchronous collaboration to accommodate and elaborate the
outcomes of data mining, get recommendations, identify inconsistencies, spot and repair
information gaps, reason about actions etc. A goal-dependant integration of data coming
from heterogeneous databases is also required (to complement goal-driven data search
and acquisition). Data visualization issues also impose a series of important requirements
here.
The above bring up the need for development of innovative services that shift in focus
from the mere collection and representation of large-scale information to its meaningful
assessment, aggregation and utilization in contemporary collaboration and decision
making settings.
3. The Proposed Solution
In the context of the Dicode project, we exploit a cloud infrastructure to adapt and refine
computationally expensive algorithms for semantic data mining to new paradigms for
distributed computing, such as the MapReduce paradigm implemented in frameworks
like Hadoop and its Mahout derivative. The Apache Mahout project aims at building a
scalable machine learning library on top of Hadoop. In the context of this project, a
number of machine learning algorithms have already been parallelized. Based on the
outcomes of the Mahout project so far, our goal is to set up a basic analysis infrastructure
for the Dicode project. Aiming at fully supporting the data mining process (including pre-
6 Karacapilidis, Tzagarakis and Christodoulou
processing, modeling, validation and deployment), we integrate, adapt and extend the
Mahout machine learning library, e.g. by developing advanced machine learning
algorithms for large scale data. For instance, Mahout may significantly help towards
grouping similar items (being they related raw data or users with similar expertise or
interests), identifying main or “hot” topics, assigning items to predefined categories,
recommending important data to diverse stakeholders, and discovering frequent and
meaningful patterns in a specific decision making setting.
The ultimate goal of the Dicode project is to support users in collaboratively solving
problems and making decisions on complex scenarios with very large, potentially
conflicting and incomplete amounts of information. To achieve this goal, Dicode exploits
and advances the state-of-the-art in three relevant directions, namely: (i) new techniques
for scalable high-performance data mining, (ii) data mining to make sense of real-world
multi-faceted data, and (iii) collaboration and decision making support.
3.1. New techniques for scalable high-performance data mining
A particular challenge of Dicode is to develop data mining approaches that scale well to
extremely large, rapidly evolving data sets in the terabyte range. To respond to this
challenge, the project develops a flexible large-scale data mining platform based on the
MapReduce framework and related technology. MapReduce has been already
successfully deployed within Google and runs a myriad of different programming tasks.
Many of the building blocks that are necessary to cover the requirements mentioned
above do not rely on simple filters or selection rules; instead, they can only be efficiently
implemented by employing appropriate machine learning algorithms (e.g. classification
of objects into predefined categories, clustering objects by similarity, identification of
trends in document streams, extracting topics from documents). There are some
approaches on scaling traditional machine learning algorithms to large scale datasets
[Wegener et al., 2009]; however, most research still leaves the aspect of scalability and
performance optimization out of scope. The most comprehensive and active project
aiming at implementing machine learning algorithms for large scale problems today is
Apache Mahout [Ingersoll, 2009]. However, Mahout is unsuitable for out-of-the-box, ad-
hoc data analysis of arbitrary data: the project includes only very basic modules for data
pre-processing, data conversion and integration with existing systems.
On a Meaningful Exploitation of Machine and Human Reasoning 7
3.2. Data Mining to make sense of real-world multi-faceted data
Dicode also aims to support users in solving problems and making decisions on complex
scenarios with large, potentially conflicting and incomplete amounts of information. A
main challenge in this setting is that in real applications data is not as nicely structured as
it seems to be in current solutions and closed systems. Instead, data in real-world
applications are complex and multi-faceted, where important information is spread
among multiple data sources and formats, and relations between these instances are not
always obvious. For instance, while it is typically not a problem to show up many
possible links between two pieces of information using standard features, it is hard to
identify the really relevant links. A candidate technique to address this problem is
similarity learning, which strives at finding an optimal, user-centric measure of similarity
on the basis of many candidate similarity measures. Similarity learning has been proven
to be successful in many different applications with complex items, ranging from the
identification of similar people [Rüping et al., 2008] to the identification of similar
process workflows [Friesen and Rüping, 2010].
Besides, text data is the most central data type in many applications. To bridge the
gap between human-readable texts and structured data that is suited for machine
processing, text mining technologies are of much importance. In the context of Dicode,
approaches to extract semantic information from text allow to generate higher-level
knowledge, which can be fed to the users as additional input for their decision making
process without the need to read all texts (which can be millions of web pages in an
extreme case) by themselves.
3.3. Collaboration and Decision Making Support
The advent of Web 2.0 has introduced a plethora of collaboration tools which provide
engagement at a massive scale and feature novel paradigms. These tools cover a broad
spectrum of needs, ranging from exchanging, sharing and tagging, social networking, to
authoring, mind mapping and discussing. For instance, Delicious (http://delicious.com)
and CiteULike (http://www.citeulike.com) provide services for storing, sharing and
discovering of user generated Web bookmarks and academic publications, respectively.
A different set of applications focuses on building online communities of people who
share interests and activities (social networking applications). MySpace
8 Karacapilidis, Tzagarakis and Christodoulou
(http://www.myspace.com) and LinkedIn (http://www.linkedin.com) are representative
examples in this category. Another set of Web 2.0 tools aim at collectively organize,
visualize and structure concepts via maps to aid brainstorming, study and problem
solving. Tools such as Thinkature (http://www.thinkature.com) fall into this category.
Finally, systems such as online discussion forums, Debatepedia (http://wiki.idebate.org)
and Cohere (http://cohere.open.ac.uk/) support online discussions over the Web.
Although all the above tools enable the massive and unconstraint collaboration of
users, this very feature is the source of a problem that these tools introduce to their users:
the problem of information overload. Current Web 2.0 collaboration tools exhibit two
important shortcomings making them prone to the problem of information overload.
First, these tools are ‘information islands’, thus providing only limited support for
interoperation, integration and synergy with third party tools. Second, Web 2.0
collaboration tools are rather passive media; they lack reasoning services with which they
could meaningfully support the collaboration.
As far as existing decision making support technologies are concerned, data
warehouses and on-line analytical processing have been broadly recognized as
technologies playing a prominent role in the development of current and future DSS
[Shim et al., 2002]. However, there is still room for further developing the conceptual,
methodological and application-oriented aspects of the problem. One critical point that is
still missing is a holistic perspective on the issue of decision making. This originates out
of the growing need to develop applications by following a more human-centric (not
problem-centric) view, in order to appropriately address the requirements of the
contemporary, knowledge-intensive organization’s employees. Dicode advances decision
making support technologies by adopting a knowledge-based decision-making view,
enabled by the meaningful accommodation of the results of the data mining processes. In
such a way, the decision making process is able to produce new knowledge, such as
evidence justifying or challenging an alternative or practices to be followed or avoided
after the evaluation of a decision. Knowledge management activities such as knowledge
elicitation, representation and distribution influence the creation of the decision models to
be adopted, thus enhancing the decision making process [Bolloju et al., 2002; Watanabe
et al., 2010].
On a Meaningful Exploitation of Machine and Human Reasoning 9
With respect to collaboration and decision making support, the Dicode project
provides a series of innovative features: First, Dicode introduces advanced decision
making services into collaborative environments in order to help control the impact of
voluminous and complex data. Second, Dicode does not treat collaboration services as
standalone applications that operate autonomously and in isolation from other services,
but rather as ones that coexist. Third, Dicode enables both human and machine
understandable argumentative discourses to support ease-of-use and expressiveness for
users, as well as advanced reasoning by the machine.
4. Services offered
As noted above, the Dicode project aims at offering an innovative solution that reduces
the data-intensiveness and overall complexity of real-life collaboration and decision
making settings to a manageable level, thus permitting stakeholders to be more
productive and concentrate on creative activities. Towards this direction, the project
provides a suite of innovative, adaptive and interoperable services (both at a conceptual
and a technical level) that satisfies the full range of the requirements reported above.
These services are running on the Web. Much attention is being given to the adaptability
of the foreseen services with respect to changes in user requirements and operating
conditions. As shown in Figure 1, the Dicode suite of services comprises:
Data acquisition services, which enable the purposeful capturing of tractable
information that exists in diverse data sources and formats. Particular attention is
being paid to web resources and the development of the associated spider
components (web crawler).
Data pre-processing services, which efficiently manipulate raw data before their
storage to the foreseen solution. Transformation of different kinds of documents
into a canonical form, structuring of documents from layout information, data
cleansing (e.g. removing noise from web pages, discarding useless database
records), as well as language detection and linguistic annotations, are some of
the functionalities that fall in this category of services.
Data mining services, which exploit and are built on top of a cloud infrastructure
and other most prominent large data processing technologies to offer
functionalities such as high performance full text search, data indexing,
10 Karacapilidis, Tzagarakis and Christodoulou
classification and clustering, directed data filtering and fusion, and meaningful
data aggregation. Advanced text mining techniques such as named entity
recognition, relation extraction and opinion mining help to extract valuable se-
mantic information from unstructured texts. Intelligent data mining techniques
being used include local pattern mining, similarity learning, and graph mining.
Figure 1. The Dicode architecture.
Collaboration support services, which facilitate the synchronous and
asynchronous collaboration of stakeholders through adaptive workspaces,
efficiently handle the representation and visualization of the outcomes of the
data mining services (through alternative and dedicated data visualization
schemas), and accommodate the orchestration of a series of actions for the
appropriate handling of data in each case.
Decision making support services, which augment both individual and group
sense-making and decision-making by supporting stakeholders in locating,
retrieving and arguing about relevant information and knowledge, as well as by
On a Meaningful Exploitation of Machine and Human Reasoning 11
providing them with appropriate notifications and recommendations (taking into
account parameters such as preferences, competences, expertise etc.).
To better illustrate the use and orchestration of the abovementioned services, we first
describe a detailed scenario taking place in a collaborative decision making setting (see
next section). Through this specific scenario, we then present the features and
functionalities of the proposed tool (namely, the Dicode Workbench - see Section 6).
5. A Specific Scenario: Investigating Genes Associated with Breast Cancer
Disease
Consider two researchers, Jim and Alice, aiming to investigate which genes or groups of
genes are associated with breast cancer disease. Initially, they create a new collaboration
session (logbook), where they exchange ideas related to which data sources to use, based
on their own data analysis experience and literature knowledge. They search relevant
literature via PubMed (http://www.ncbi.nlm.nih.gov/pubmed) using the appropriate
search services.
Jim has conducted an initial analysis with some in-house gene-expression datasets;
however, his findings were not very encouraging, which was attributed to the small
sample size (i.e. number of patients) available. He informs Alice about it and suggests
potential solutions. The discussion proceeds and finally, in order to overcome the limited
sample size problem, they decide to augment their samples with publicly available gene-
expression data derived from the GEO (http://www.ncbi.nlm.nih.gov/geo/) and SMD
(http://smd.stanford.edu/) databases.
After deciding what data to use, they keep collaborating in order to discuss how the
data will be processed. Both suggest solutions, comment on them, and finally decide to
use the normalized data for each platform and the UniGene annotation database
(http://www.ncbi.nlm.nih.gov/unigene) to uniformly map all genes. Jim knows that there
are particular confounding effects in such kind of analysis and for that reason suggests a
specific strategy that would account for these effects. Particularly, they decide to first
analyze the integrated dataset using the well-known Significant Analysis of Microarrays
(SAM) methodology [Tusher et al., 2001]. This analysis will serve as a baseline to any
further analysis they attempt. Jim is also offering to provide all the necessary R scripts
12 Karacapilidis, Tzagarakis and Christodoulou
(http://www.r-project.org/) for this initial statistical analysis. In addition, they decide to
employ model-based data integration methodologies that have been recently published
and claim to perform better than simple data integration techniques [Huttenhower et al.,
2006; Garrett-Mayer et al., 2007; Shabalin et al., 2008].
Some of the models are readily available; however, others need to be coded. Jim
offers to write the relevant scripts. Alice, being an experienced programmer, offers to
hard code them using parallel programming and various servers available at her
department. Parallel and cloud computing will ensure fast results, since they have both
agreed that they should apply the selected methodologies to numerous datasets. Their
goal is to identify novel or already reported groups of genes associated with breast cancer
disease. In addition, they are interested in comparing the findings of the chosen
methodologies to those of the simple analysis conducted by Jim. They decided to
quantify and check the statistical and biological significance of their results via the
DAVID tool (http://david.abcc.ncifcrf.gov/) and the KEGG database
(http://www.genome.jp/kegg/pathway.html).
Both researchers can execute the available services and retrieve the results of the
invoked tool (e.g. a scatter plot or heatmap plot). Once the results are available, they
engage into interpreting the results in terms of the initial research question.
6. The Dicode Workbench
Scenarios such as the one described above bring up the need for development of
innovative services that shift in focus from the mere collection and representation of
large-scale information to its meaningful assessment, aggregation and utilization in
contemporary collaboration and decision making settings. Such services are provided by
the Dicode Workbench.
Figure 2 presents the main user interface of the Dicode Workbench, which comprises
the necessary computational and collaborative services that biomedical researchers
require to address their research questions (see Section 5). A widget-based approach has
been adopted to offer these services. Each widget provides different functionalities. In
particular, the ‘Storage Service’ widget (top left side of Figure 2) lists all resources that
researchers require to elaborate for their research question. These resources may include,
for instance, XML files stored locally and containing biomedical data, references to data
On a Meaningful Exploitation of Machine and Human Reasoning 13
retrieval services which fetch data sets from remote repositories, as well as research
papers that are considered useful for the question at hand. Selecting one of the items
allows team members to retrieve and view the associated data.
Figure 2: The Dicode workbench (showing the mind-map view of the collaboration workspace).
The ‘Services search’ widget (top right side of Figure 2) enables users to locate
computational tools that can be invoked to process data sets and produce results. These
services can be, for instance, R scripts (see the ‘R engine’ widget at the bottom left side
of Figure 2), references to Web services or entire applications which implement specific
data mining algorithms (e.g. the ‘PubMed Mobile’ widget that enables users exploit the
popular PubMed library (http://www.ncbi.nlm.nih.gov/pubmed)). The section ‘Additional
information’ displays metadata about the workbench (when it was created, a short
description related to the aim of the research, as well as the team members who have
access to the workbench).
At the center of the Dicode Workbench (see Figure 2) is the ‘Dicode Collaborative
Workspace’, which enables the argumentative collaboration among researchers. The
approach adopted to support collaboration in Dicode builds on a conceptual framework
14 Karacapilidis, Tzagarakis and Christodoulou
where formality and the level of knowledge structuring during collaboration is not
considered as a predefined and rigid property, but rather as an adaptable aspect that can
be modified to meet the needs of the tasks at hand. By the term formality, we refer to the
rules enforced by the system, with which all user actions must comply. Allowing
formality to vary within the collaboration space, incremental formalization - i.e. a
stepwise and controlled evolution from a mere collection of individual ideas and
resources to the production of highly contextualized and interrelated knowledge artifacts
- can be achieved.
In our approach, views constitute the ‘vehicle’ that permits incremental formalization
of the collaboration. A view can be defined as a particular representation of the
collaboration space, in which a consistent set of abstractions able to solve a particular
organizational problem during collaboration is available. Our approach enables the
switching from a view to another. Each view of a collaboration workspace provides the
necessary mechanisms to support a particular level of formality. The more informal a
view is, the more easiness-of-use is implied. At the same time, the actions that users may
perform are intuitive and not time consuming; however, the overall context is human (and
not system) interpretable. On the other hand, the more formal a view is, easiness-of-use is
reduced; actions permitted are less and less intuitive and more time consuming. The
overall context in this case is both human and system interpretable. The already
developed collaboration views, along with the functionalities they provide, are as
follows:
Discussion-forum view: In this view, a collaboration space is displayed as a
traditional web-based forum, where posts are displayed in ascending
chronological order. Users are able to post new messages to the collaboration
space, which will appear at the end of the list of messages.
Mind-map view: In this view, a collaboration space is displayed as a mind map,
where users can ‘interact’ with the items uploaded so far. The map deploys a
spatial metaphor permitting the easy movement and arrangement of items on the
collaboration space. In this view, information triage - i.e. the process of sorting
and organizing through numerous relevant materials and organizing them to
meet the task at hand [Marshall and Shipman, 1997] - is supported.
On a Meaningful Exploitation of Machine and Human Reasoning 15
Formal view: This view enables the posting of predefined knowledge items. It
invokes a set of dedicated scoring and reasoning mechanisms aiming to aid users
conceive the outcome of a particular collaborative session [Karacapilidis et al.,
2009].
Recalling our scenario, as the initial goal of Jim and Alice is to accumulate a critical
mass of relevant resources, they create a new collaboration workspace and may start
using it in the ‘forum-view’. The forum view primarily aims to effortlessly collect and
share the available resources (without the need to interrelate them). During this
collaboration phase, Jim and Alice upload available resources and assess them
informally, by briefly commenting on them.
When many resources start appearing in the forum view, Jim and Alice may decide to
switch to a mind-map view, where they can better manage these resources. In this view,
Jim and Alice may organize the available items in more advanced ways and exploit
dedicated item types such as ideas, notes and comments (Figure 2). Ideas stand for items
that deserve further exploitation; they may correspond to an alternative solution to the
issue under consideration and they usually trigger the evolution of the collaboration.
Notes are generally considered as items expressing one’s knowledge about the overall
issue, an already asserted idea or note. Finally, comments are items that usually express
less strong statements and are uploaded to express some explanatory text or point to some
potentially useful. Multimedia resources can be also uploaded.
All the above items can be interrelated by trouble-free actions. When interrelating
items, Jim and Alice may select the color of the connecting arrow and provide a legend
describing the interrelationship they conceive. These legends are intentionally arbitrary.
The visual cues of the arrows bear well-defined semantics: for instance, green arrows
declare support, whereas red ones declare opposition. Another visual cue that appears in
Figure 2 concerns the colored rectangles that have been created by Jim and Alice to
meaningfully group/cluster related items.
By using the mind-map view, Jim and Alice can transform the resources from a mere
collection of items into coherent knowledge structures that facilitating sense making on
the available resources. By using the search facilities of the workbench, they are also able
to search for relevant literature or data sets, which can be also uploaded on the
16 Karacapilidis, Tzagarakis and Christodoulou
collaboration logbook. Moreover, resources that Jim and Alice agree that are relevant for
their research can be easily added to the sources section of the workbench.
Jim and Alice may need to further elaborate the knowledge items considered so far,
and exploit additional functionalities to advance their collaboration towards reaching a
decision. Such functionalities can be provided by the formal view that enables the
semantic annotation of knowledge items, the formal exploitation of collaboration items
patterns, and the deployment of appropriate formal argumentation and reasoning
mechanisms (Figure 3).
Figure 3: The Dicode workbench (showing the formal view of the collaboration workspace).
While a mind map view aids the exploitation of information by Jim and Alice, a formal
view aims mainly at the exploitation of information by the machine. This view provides a
fixed set of discourse element and relationship types, with predetermined, system
interpretable semantics. In particular, issues correspond to problems to be solved,
decisions to be made, or goals to be achieved. For each issue, both users may propose
On a Meaningful Exploitation of Machine and Human Reasoning 17
alternatives (i.e. solutions to the problem under consideration) that correspond to
potential choices. Positions are asserted in order to support the selection of a specific
course of action (alternative), or avert the users’ interest from it by expressing some
objection. A position may also refer to another (previously asserted) position, thus
arguing in favor or against it. By switching from mind map into formal view, existing
item types are transformed, filtered out, or kept “as-is” based on a specific set of rules.
These rules take also into consideration the item’s visual cues. For example, red colored
arrows in the mind-map view are transformed into positions that express objection in the
formal view (arguments against).
The formal view also integrates a reasoning mechanism that determines the status of
each discourse entry, the ultimate aim being to keep Jim and Alice aware of the discourse
outcome. Jim and Alice can continue their collaboration in this formal view; each time an
element is added to the discussion, this triggers the underlying reasoning mechanism
which informs the team about the most prominent solution. Using the formal view, Jim
and Alice receive active support from the system to reach a decision concerning the most
appropriate resources for their research. The reasoning and scoring mechanisms of the
Dicode workbench’s formal view are built according to the algorithms described in
[Karacapilidis and Papadias, 2001].
7. Evaluation results
To assess the effectiveness of the Dicode approach, an extensive evaluation process is
currently underway. A first evaluation phase has been completed aiming to assess a series
of key success indicators concerning the maturity of the technology used, as well as the
usability and acceptability of Dicode services in three real-life contexts (clinic-genomic
research, medical decision making, and opinion mining from Web 2.0 data). Evaluators
were asked to read a service-specific scenario, experiment with the Dicode services and
fill in a mixed-type questionnaire (responses expected were in both quantitative and
qualitative form). For the case reported in this paper, the sample consisted of 58
evaluators with varying background knowledge in bioscience fields. Answers to the
quantitative questions of the questionnaires were given in a 1-5 scale, where 1 stands for
‘I strongly disagree’ and 5 for ‘I strongly agree’ [Nielsen, 1991; Norman, 1998]. Figures
4 and 5 illustrate the evaluation results for the Dicode Workbench (where all Dicode
18 Karacapilidis, Tzagarakis and Christodoulou
services are integrated) and the services related to collaboration and decision making
support, respectively.
Figure 4: Evaluation results of the Dicode Workbench.
Figure 5: Evaluation results of the Dicode Collaborative Decision Making Support Service.
On a Meaningful Exploitation of Machine and Human Reasoning 19
Figure 4 summarizes the evaluators’ responses relative to the overall quality of the
Dicode Workbench. As shown, the evaluators agreed (median: 4, mode: 4) that its
objectives are met, that it is novel to their knowledge, that are satisfied with its
performance and that they are overall satisfied with it. The evaluators were neutral
(median: 3, mode: 3) with respect to whether the Workbench addressed the data intensive
decision making issues. Related comments were: ‘Some kind of 'roadmap' would be
appreciated’; ‘Getting started is a bit confusing for a new user’. As long as its
acceptability is concerned (such results are not shown in Figure 4), the evaluators agreed
(median: 4, mode: 4) that the Workbench has the full set of functions they expected, that
its interface is pleasant and that they will recommend it to their peers/community.
The evaluation focused also in assessing the collaboration aspects of the workbench,
which constitutes a central part of the Dicode approach. Figure 5 summarizes the
evaluators’ responses relative to the overall quality of the Collaborative Decision Making
Support service. The evaluators agreed (median: 4, mode: 4) that the service addressed
the data intensive decision making issues, that the objectives of the service are met, that
the service is novel to their knowledge, that are satisfied with the performance of the
service and that they are overall satisfied with this service. The analysis of qualitative
evaluation results showed that, overall, reviewers found the service ‘promising’.
However, a few technical issues were raised, such as: ‘It's very useful for a complex use
case. I'd also like to move with mouse holding left click’; ‘A bit slow loading time both for
workspaces list and mind-map view’; ‘The arrows’ graphics were not very pleasant for
me. It starts from the middle of the icon and not from the begging of the square. The
overall idea however, was quite good’.
Generally speaking, the feedback received from the first evaluation phase of the Dicode
project was positive, clearly pointing out that the overall Dicode approach is promising.
The big majority of the evaluators appreciated the potential of exploiting the synergy of
machine and human intelligence through data mining and collaborative decision making
services. Evaluators indicated that the approach adopted reduces the data-intensiveness
and overall complexity of real-life collaboration and decision making settings to a
manageable level, thus permitting stakeholders to be more productive and concentrate on
creative activities. Towards this direction, the Dicode Workbench provides a flexible
20 Karacapilidis, Tzagarakis and Christodoulou
integration framework that also exploits and augments the underlying collective
intelligence.
8. Conclusions
Dealing with data-intensive and cognitively complex settings is not a technical problem
alone. Building on current advancements, the approach described in this paper brings
together the reasoning capabilities of the machine and the humans. It can be viewed as an
innovative workbench incorporating and orchestrating a set of interoperable services that
reduce the data-intensiveness and complexity overload at critical decision points to a
manageable level, thus permitting stakeholders to be more productive and concentrate on
creative and innovative activities.
Aiming at facilitating and augmenting collaboration and decision making, the proposed
approach may enhance the quality of these processes and render time and cost savings.
The initial evaluation of the Dicode services supports this claim. Moreover, the proposed
approach faces two major imperatives: (i) it exploits the information growth by ensuring
a flexible, adaptable and scalable information and computation infrastructure, and (ii) it
exploits the competences of all stakeholders and information workers to meaningfully
confront various information management issues, such as information characterization,
classification, presentation, retention, storage, and disposal.
Acknowledgments
This publication has been produced in the context of the EU Collaborative Project "DICODE - Mastering
Data-Intensive Collaboration and Decision" which is co-funded by the European Commission under the
contract FP7-ICT-257184. This publication reflects only the authors’ views and the Community is not
liable for any use that may be made of the information contained therein.
References
Adomavicius, G. and Tuzhilin, A. (2005). Toward the Next Generation of Recommender Systems:
A Survey of the State-of-the-Art and Possible Extensions. IEEE Transactions on Knowledge
and Data Engineering, 17(6): 734–749.
Anya O., Nagar A. and Tawfik H. (2011). Building adaptive systems for collaborative e-work: The
e-Workbench approach. Intelligent Decision Technologies 5 (1), pp. 83–100.
On a Meaningful Exploitation of Machine and Human Reasoning 21
Bolloju N., Khalifa M. and Turban E. (2002). Integrating Knowledge Management into Enterprise
Environments for the Next Generation Decision Support. Decision Support Systems 33(2), pp.
163-176.
De Roure, D., Goble, C. and Stevens, R. (2009) The Design and Realisation of the myExperiment
Virtual Research Environment for Social Sharing of Workflows. Future Generation Computer
Systems 25, pp. 561-567.
Economist (2010). Data, data everywhere, http://www.economist.com/node/15557443.
Eppler, M.J. and Mengis, J. (2004). The Concept of Information Overload: A Review of Literature
from Organization Science, Accounting, Marketing, MIS, and Related Disciplines. The
Information Society, 20(5):325–344.
Finholt, T. A. (2003). Collaboratories as a new form of scientific organization. Econ. of Innovation
and New Technologies, 12(1):5-25.
Friesen, N. and Rüping, S. (2010). Workflow Analysis Using Graph Kernels. Proc. of the
ECML/PKDD Workshop on Third-Generation Data Mining: Towards Service-Oriented
Knowledge Discovery (SoKD’10), Barcelona, Spain.
Garrett-Mayer, E., Parmigiani, G., Zhong, X., Cope, L., and Gabrielson, E. (2007). Cross study
validation and combined analysis of gene expression microarray data. Biostatistics, 9: 333–354.
Hara, N., Solomon, P., Kim, S., & Sonnenwald, H. (2003). An Emerging View of Scientific
Collaboration: Scientists’ Perspectives on Collaboration and Factors that Impact Collaboration.
J Amer Soc Info Sci Techn., 54(10):952-965.
Huttenhower, C., Hibbs, M., Myers, C., and Troyanskaya, O.G. (2006) A scalable method for
integration and functional analysis of multiple microarray. Bioinformatics, 22(23):2890-2897.
Ingersoll, G. (2009). Introducing Apache Mahout, IBM developer works, Java Technical library.
Available at: http://www.ibm.com/developerworks/java/library/j-mahout/
Karacapilidis N. and Papadias D. (2001). Computer Supported Argumentation and Collaborative
Decision Making: The HERMES system. Information Systems, 26(4): 259-277.
Karacapilidis N., Tzagarakis M., Karousos N., Gkotsis G., Kallistros V., Christodoulou S.,
Mettouris C., Nousia D. (2009). Tackling cognitively-complex collaboration with CoPe_it!. Int.
J. of Web-Based Learning & Teaching Technologies, 4(3): 22-38.
Kirsh, D. (2000). A Few Thoughts on Cognitive Overload, Intellectica, 1(30): 19-51.
Lopes, P and Oliveira, J. L. (2011). Towards knowledge federation in biomedical applications.
Proc. of the 7th International Conference on Semantic Systems, Graz, Austria
22 Karacapilidis, Tzagarakis and Christodoulou
Marshall, C. and Shipman, F. (1997). Spatial Hypertext and the Practice of Information Triage,
Proc. of the 8th ACM Conference on Hypertext, Southampton UK, pp. 124-133. ACM Press,
New York.
Nielsen, J., 1991. Designing Web Usability: The Practice of Simplicity. New Riders Publishing.
Norman, D.A., 1998. The Design of Everyday Things. The MIT Press.
Rüping, S., Punko, N., Günter, B. and Grosskreutz, H. (2008). Procurement Fraud Discovery using
Similarity Measure Learning. Transactions on Case-based Reasoning, 1(1), pp. 37-46.
Shabalin, A.A., Tjelmeland, H., Fan, C., Perou, C.M., & Nobel, A.B. (2008). Merging two gene-
expression studies via cross-platform normalization. Bioinformatics, 24(9): 1154-1160.
Shim J.P., Warkentin M., Courtney J.F., Power D.J., Sharda R. and Carlsson C. (2002). Past,
Present and Future of Decision Support Technology. Decision Support Systems 33(2), pp. 111-
126.
Tusher, V.G., Tibshirani, R., and Chu, G. (2001). Significance analysis of microarrays applied to
the ionizing radiation response. Proc. Natl. Acad. Sci. USA 2001, 98(9): 5116-5121.
Watanabe Y., Kojiri T. and Watanabe T. (2010). Effective solution knowledge organization from
discussion record. Intelligent Decision Technologies 4 (4), pp. 241–251.
Wegener, D., Mock, M., Adranale, D. and Wrobel, S. (2009). Toolkit-based high-performance Data
Mining of large Data on MapReduce Clusters. Proc. of the 1st IEEE ICDM Workshop on
Large-scale Data Mining: Theory and Applications (LDMTA 2009).