distributed data mining research using grids and web - iceage

46
Session 28: Session 28: Distributed Data Mining Distributed Data Mining Research using Grids and Research using Grids and Web Services Web Services Author/Presenter: Peter Brezany Author/Presenter: Peter Brezany University of Vienna, Austria University of Vienna, Austria 11 July

Upload: tommy96

Post on 28-Nov-2014

570 views

Category:

Documents


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Distributed Data Mining Research using Grids and Web - Iceage

Session 28: Session 28: Distributed Data Mining Distributed Data Mining

Research using Grids and Web Research using Grids and Web ServicesServices

Author/Presenter: Peter BrezanyAuthor/Presenter: Peter Brezany

University of Vienna, AustriaUniversity of Vienna, Austria

11 July

Page 2: Distributed Data Mining Research using Grids and Web - Iceage

Motivation

Balatonfüred,Hungary - 6th-18th July 2008 2

Business

Medicine

Scientificexperiments

Simulations

Earth observations

Data and data exploration cloud

Data and data exploration cloud

Page 3: Distributed Data Mining Research using Grids and Web - Iceage

Outline

Motivation

Selected projects ← Data mining model Towards high productivity analytics Parallel and distributed data mining

and OLAP in GridMiner/ADMIRE projects

Future developments

Page 4: Distributed Data Mining Research using Grids and Web - Iceage

Selected Projects

Balatonfüred,Hungary - 6th-18th July 2008 4

Page 5: Distributed Data Mining Research using Grids and Web - Iceage

A Long-Term Biodiversity, Ecosystem and Awareness Research Network – ALTER-Net

Balatonfüred,Hungary - 6th-18th July 2008 5

Waste

Air

Soil

Water

Emmision

Bio-diversity

Forests

DistributedData

DistributedData Mining

Flow Analysis

Geo-Statistic

Reporting

PopularPresen-tation

PredictionModels

DistributedApplications

Statistic

Common Ontology

Author: Kathi Schleidt

Page 6: Distributed Data Mining Research using Grids and Web - Iceage

Balatonfüred,Hungary - 6th-18th July 2008 6

China-Austria Data Grid (CADGrid)

Main Idea: Medical Meridian Measurement Grid (M3G) for On-Line Diagnosis

Diabetic domain is the first domain highly profiting of the project results

Page 7: Distributed Data Mining Research using Grids and Web - Iceage

Motivation

Meridian-Theory is an important part of Traditional Chinese Medicine (TCM)

Clinical practices of TCM (esp. acupuncture) have been guided by meridian theory for thousands of years

More than 4000 years of experience Knowledge that we should not only use

but also support by modern high-tech measurement and IT technologies

3-Dec-07 CADGrid 7

Page 8: Distributed Data Mining Research using Grids and Web - Iceage

Balatonfüred,Hungary - 6th-18th July 2008 8

Meridian-Theory Basics (1)

According to TCM our human body has 14 acupuncture meridians

Secret to our biological and medical knowledge

Each meridian has its main points called source points

Page 9: Distributed Data Mining Research using Grids and Web - Iceage

Balatonfüred,Hungary - 6th-18th July 2008 9

Meridian-Theory Basics (2)

Using data mining techniques, correlations between these points can be identified e.g. start-end point correlationsymmetric point correlation

If there was a pain on one place along the meridian, a good effect can be achieved by treating another place on the same line

Page 10: Distributed Data Mining Research using Grids and Web - Iceage

Meridian-theory Basics (3)

Meridians can transport physical, medical, biological material

and information The characteristics (weaker or stronger

output, time delay, …) gained by the analysis of electro-signals sensed from meridians have a strong relationship with the human body organs (heart, lung, brain,…)

10

Page 11: Distributed Data Mining Research using Grids and Web - Iceage

Meridian Measurement Methods

1 Active

2 Passive

11

Page 12: Distributed Data Mining Research using Grids and Web - Iceage

Active Measurement

12

Data-file

Down-flow-key-point

Up-flow-key-point

Human-body-meridianSend

MeasureRecord

Record

Up-flow point: lower electrical potentialDown-flow point: higher electrical potentialFingers and toes: zero potential

Page 13: Distributed Data Mining Research using Grids and Web - Iceage

Passive Measurement

13

Data-file

MeasureRecord

to the ground of the instrument

Page 14: Distributed Data Mining Research using Grids and Web - Iceage

Balatonfüred,Hungary - 6th-18th July 2008 14

Application 1

Non-invasive Glucose Measurement (NIGM)

Meridian Measurement Instrument

Page 15: Distributed Data Mining Research using Grids and Web - Iceage

Balatonfüred,Hungary - 6th-18th July 2008 15

The First Prototype

Page 16: Distributed Data Mining Research using Grids and Web - Iceage

Balatonfüred,Hungary - 6th-18th July 2008 16

NIGM Workflow

Page 17: Distributed Data Mining Research using Grids and Web - Iceage

Balatonfüred,Hungary - 6th-18th July 2008 17

M3G Services for DiabeticsNIGM-Service – Model Setup

Page 18: Distributed Data Mining Research using Grids and Web - Iceage

Balatonfüred,Hungary - 6th-18th July 2008 18

M3G Services for DiabeticsNIGM-Service – Use Model

Page 19: Distributed Data Mining Research using Grids and Web - Iceage

Balatonfüred,Hungary - 6th-18th July 2008 19

M3G Services for DiabeticsNIGM-Service – Maintain Model

Page 20: Distributed Data Mining Research using Grids and Web - Iceage

Balatonfüred,Hungary - 6th-18th July 2008 20

CADGrid Framework

Page 21: Distributed Data Mining Research using Grids and Web - Iceage

Intelligence Base

21

Page 22: Distributed Data Mining Research using Grids and Web - Iceage

Future Work

Extension to other domains Brain Informatics

domain

22

Page 23: Distributed Data Mining Research using Grids and Web - Iceage

Balatonfüred,Hungary - 6th-18th July 2008 23

BusinessUnderstanding

DataUnderstanding

DataPreparation

Modelling

Evaluation

DeploymentData

CRISP-DM

Page 24: Distributed Data Mining Research using Grids and Web - Iceage

Towards High Productivity Analytics

Balatonfüred,Hungary - 6th-18th July 2008 24

A Project Sponsored by                               

                                                                                             

                                                                                                                                                          

Motivation:

Page 25: Distributed Data Mining Research using Grids and Web - Iceage

High Productivity Analytics

Our definition:

„A high productive analytics system is one that delivers a high level of performance, guarantees a high level of accuracy of analytics models and other results extracted from analyzed data sets while scoring equally on other aspects, like usability, robustness, system management, and ease of programming.“

Balatonfüred,Hungary - 6th-18th July 2008 25

Page 26: Distributed Data Mining Research using Grids and Web - Iceage

High Productivity Analytics Research Agenda

High performance services developed by high productivity languages and tools

Efficient workflow management (building and execution)

Advanced GUI

Illustration on the GridMiner system

Balatonfüred,Hungary - 6th-18th July 2008 26

Page 27: Distributed Data Mining Research using Grids and Web - Iceage

GridMiner Data Mining Model

Balatonfüred,Hungary - 6th-18th July 2008 27

Data

Business understanding

Dataunderstanding

DataPreparation

Modeling

Evaluation

Deployment

CRISP-DM, SPSS

ServiceProvider

ServiceProvider

ServiceProvider

Data provider

Gri

dM

iner User

Virtual Organization

Page 28: Distributed Data Mining Research using Grids and Web - Iceage

GridMiner Conceptual Architecture

Balatonfüred,Hungary - 6th-18th July 2008 28

DataWarehouse

Knowledge

Cleaning andIntegration

Selection andTransformation

Data Mining

Evaluation andPresentation

OLAP

Online Analytical Mining

OLAP Queries

Data and functional resources can be geogra-phically distributed – focus on virtualizationand large-scale data mining.

Page 29: Distributed Data Mining Research using Grids and Web - Iceage

Motivation for large-scale data mining

Balatonfüred,Hungary - 6th-18th July 2008 29

accu

rac

y

sampled data size

100%

available data size

(qo,mo)

(qo,mo)

qi - data quality

mi - modeling method

(q0,m0)

Assumed

(qo,mo)

(qi,mi)

Page 30: Distributed Data Mining Research using Grids and Web - Iceage

Service Parallelism Levels

Balatonfüred,Hungary - 6th-18th July 2008 30

Inter-Service &Intra-ServiceParallelism

Page 31: Distributed Data Mining Research using Grids and Web - Iceage

Hybrid Programming Model

SPMD – Single Program Multiple Data (used for programming multiprocessor architectures)

+ SSMD – Single Service Multiple Data

(introduced by us for programming service-oriented architectures)

Balatonfüred,Hungary - 6th-18th July 2008 31

Page 32: Distributed Data Mining Research using Grids and Web - Iceage

1. Construction of Decision Trees - SPRINT – Scalable PaRallelizable INduction of decision Tree

Balatonfüred,Hungary - 6th-18th July 2008 32

categoric

al

continuous

class

Splitting Attributes

The splitting attribute at a node is determined by the Gini index.

Out-of-Core Algorithm

Page 33: Distributed Data Mining Research using Grids and Web - Iceage

Phase 1 - Preparation

Balatonfüred,Hungary - 6th-18th July 2008 33

Page 34: Distributed Data Mining Research using Grids and Web - Iceage

Phase 2 - Execution

Balatonfüred,Hungary - 6th-18th July 2008 34

Page 35: Distributed Data Mining Research using Grids and Web - Iceage

2. Construction of Neural Networks

Balatonfüred,Hungary - 6th-18th July 2008 35

Error Back-Propagation

Inputlayer

Outputlayer

Hidden layer

+-

Desiredoutput

Σ

Sum Limiter-sigmoid funct.

Weightedinputs

Outputs

Node

Page 36: Distributed Data Mining Research using Grids and Web - Iceage

Parallel Algorithm

Challenges Training real NN is extremely computationally

intensive. Many NN practical applications (e.g., speech

and face recognition) involve the large number of adjustable parameters and training patterns to achieve the needed accuracy.

Solution Parallel training algorithms Development of services running in high

performance hardware and software environments

Balatonfüred,Hungary - 6th-18th July 2008 36

Page 37: Distributed Data Mining Research using Grids and Web - Iceage

Programming Environment: Titanium

The goals: performance, safety, and expressiveness.

A language that gives its users access to modern program structuring through the use of object-oriented technology, that enables its users to write explicitly parallel code.

Based on a parallel SPMD model of computation with a global address space.

Titanium uses Java as its base, not a strict extension of Java.

Compiler: Titanium C + communicationBalatonfüred,Hungary - 6th-18th July 2008 37

Page 38: Distributed Data Mining Research using Grids and Web - Iceage

Overview of Distributed Solution

Balatonfüred,Hungary - 6th-18th July 2008 38

Master

Sub-master 0

Sub-master 1

Slave0

Slave1

Slave2

Slave0

Slave1

Training Datafor

Sub-master 1

Data DistributionScheme 1

Data DistributionScheme 2

Training Datafor

Sub-master 0

Page 39: Distributed Data Mining Research using Grids and Web - Iceage

The Parallel Implementation

Balatonfüred,Hungary - 6th-18th July 2008 39

VGE Client

VGE Server

VGE – Vienna Grid Environment

Page 40: Distributed Data Mining Research using Grids and Web - Iceage

The Distributed & Parallel Implementation

Balatonfüred,Hungary - 6th-18th July 2008 40

VGE Client

VGE Server

Page 41: Distributed Data Mining Research using Grids and Web - Iceage

3. On-Line Analytical Processing (OLAP)

Balatonfüred,Hungary - 6th-18th July 2008 41

a three-dimensional data cube

Page 42: Distributed Data Mining Research using Grids and Web - Iceage

Distributed OLAP – Aggregation of Compute and Storage Resources

Balatonfüred,Hungary - 6th-18th July 2008 42

Tuple Stream

Page 43: Distributed Data Mining Research using Grids and Web - Iceage

OLAP Service

Balatonfüred,Hungary - 6th-18th July 2008 43

Virtual Cube

Sub Cube

Sub Cube

Slave 1

Slave 3

Master

Data

Sub Cube

Slave 2

Indexes

Index Service

query

answerXML

Page 44: Distributed Data Mining Research using Grids and Web - Iceage

Workflow Composition Approaches

Balatonfüred,Hungary - 6th-18th July 2008 44

AnalyticalServices

AnalyticalServices

AnalyticalServices

AnalyticalServices

WorkflowEngine

WorkflowEngine

AnalyticalServices

AnalyticalServices

WorkflowDescriptionWorkflow

Description

Manual Composition

WorkflowEditor

WorkflowEditor

AnalyticalServices

AnalyticalServices

AnalyticalServices

AnalyticalServices

WorkflowComposerWorkflowComposer

Passive Approach

WorkflowEngine

WorkflowEngine

AnalyticalServices

AnalyticalServices

WorkflowDescriptionWorkflow

Description

KBKB

Automated Composition

ReasonerReasoner

ResourcesMonitoringResourcesMonitoring

Active Approach

WorkflowComposerWorkflowComposer

WorkflowEngine

WorkflowEngine

KBKB

AnalyticalServices

AnalyticalServices

AnalyticalServices

AnalyticalServices

AnalyticalServices

AnalyticalServices

ReasonerReasoner

ResourcesMonitoringResourcesMonitoring

Page 45: Distributed Data Mining Research using Grids and Web - Iceage

GridMiner Workflow Composition Editor

Page 46: Distributed Data Mining Research using Grids and Web - Iceage

Computational Grid

Data Grid

Data Minig Grid

Semantic Grid – 1st Generation

Current Grids

Next-GenerationGrid

Evolution of the Web

KnowledgeTechnologies

Evolution of HPCNMobileServices

Towards Next-Generation Grids