integritymr: integrity assurance framework for biggy data … · experiment setupexperiment setup...

29
IntegrityMR: Integrity Assurance Framework for Big Data Analytics and Management Applications Applications Yongzhi Wang, Jinpeng Wei Florida International University Mudhakar Srivatsa IBM T.J. Watson Research Center Yucong Duan, Wencai Du Hainan University

Upload: others

Post on 14-Mar-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IntegrityMR: Integrity Assurance Framework for Biggy Data … · Experiment setupExperiment setup yEnvironment yPrivate cloud: ya local Linux server (2.93GHz, 8-core Intel Xeon CPU,

IntegrityMR: Integrity Assurance Framework for Big Data Analytics g y

and Management ApplicationsApplications

Yongzhi Wang, Jinpeng Wei Florida International UniversityMudhakar Srivatsa IBM T.J. Watson Research Center

Yucong Duan, Wencai Du Hainan University

Page 2: IntegrityMR: Integrity Assurance Framework for Biggy Data … · Experiment setupExperiment setup yEnvironment yPrivate cloud: ya local Linux server (2.93GHz, 8-core Intel Xeon CPU,

AgendaAgendaProblem Statement

MapReduce Task Layer Solution

l lApplication Layer Solution

Conclusion & Future Work

Page 3: IntegrityMR: Integrity Assurance Framework for Biggy Data … · Experiment setupExperiment setup yEnvironment yPrivate cloud: ya local Linux server (2.93GHz, 8-core Intel Xeon CPU,

AgendaAgendaProblem Statement

MapReduce Task Layer Solution

l lApplication Layer Solution

Conclusion & Future Work

Page 4: IntegrityMR: Integrity Assurance Framework for Biggy Data … · Experiment setupExperiment setup yEnvironment yPrivate cloud: ya local Linux server (2.93GHz, 8-core Intel Xeon CPU,

Big Data Analytics & CloudBig Data Analytics & Cloud

Privacy?Privacy? Integrity?Integrity?Security?

Page 5: IntegrityMR: Integrity Assurance Framework for Biggy Data … · Experiment setupExperiment setup yEnvironment yPrivate cloud: ya local Linux server (2.93GHz, 8-core Intel Xeon CPU,

Security ProblemSecurity Problem

How do we construct big dataHow do we construct big data analytics infrastructure on cloud that can provide high integritycan provide high integrity assurance?

Page 6: IntegrityMR: Integrity Assurance Framework for Biggy Data … · Experiment setupExperiment setup yEnvironment yPrivate cloud: ya local Linux server (2.93GHz, 8-core Intel Xeon CPU,

Big Data InfrastructureBig Data Infrastructure

Application Layer Integrity

MapReduce Task Layer Integrity

Storage Integrity:[5] [6]

Page 7: IntegrityMR: Integrity Assurance Framework for Biggy Data … · Experiment setupExperiment setup yEnvironment yPrivate cloud: ya local Linux server (2.93GHz, 8-core Intel Xeon CPU,

AgendaAgendaProblem Statement

MapReduce Task Layer Solution

l lApplication Layer Solution

Conclusion & Future Work

Page 8: IntegrityMR: Integrity Assurance Framework for Biggy Data … · Experiment setupExperiment setup yEnvironment yPrivate cloud: ya local Linux server (2.93GHz, 8-core Intel Xeon CPU,

Related WorksRelated WorksWei Wei, Juan Du, Ting Yu, Xiaohui Gu, “SecureMR: A

fService Integrity Assurance Framework for MapReduce”, in Proceedings of the 2009 Annual Computer Applications Conference.(ACSAC2009)

Yongzhi Wang, Jinpeng Wei, “VIAF: Verification-basedIntegrity Assurance Framework for MapReduce”, in the 4thIEEE International Conference on Cloud Computing4 IEEE International Conference on Cloud Computing (CLOUD 2011).

Yongzhi Wang, Jinpeng Wei, Mudhakar Srivatsa,“ResultI i Ch k f M R d C i H b idIntegrity Check for MapReduce Computation on Hybrid Clouds” in the 6th IEEE International Conference on Cloud Computing (CLOUD 2013).

Page 9: IntegrityMR: Integrity Assurance Framework for Biggy Data … · Experiment setupExperiment setup yEnvironment yPrivate cloud: ya local Linux server (2.93GHz, 8-core Intel Xeon CPU,

ArchitectureArchitectureInternet

slave*workers*&DFSslave*workers*&DFS

mastermaster

verifiersverifiers

private*cloudpublic*cloud

C l d M R dCross-cloud MapReduce (IEEE CLOUD 2013)

Internet

slave*workers

master ……

public*cloud*1

verifiers

private*cloud

slave*workers

public*cloud*N

IntegrityMR

Page 10: IntegrityMR: Integrity Assurance Framework for Biggy Data … · Experiment setupExperiment setup yEnvironment yPrivate cloud: ya local Linux server (2.93GHz, 8-core Intel Xeon CPU,

Architecture DesignArchitecture DesignTrusted private cloud + Untrusted public cloudsp p

Trusted private cloudMaster controls the computationMaster controls the computation.Verifier offers the trusted result verification.

Untrusted public cloudsUntrusted public cloudsOffers the computation capacity.Multiple clouds raise the bar for the attackerp

Page 11: IntegrityMR: Integrity Assurance Framework for Biggy Data … · Experiment setupExperiment setup yEnvironment yPrivate cloud: ya local Linux server (2.93GHz, 8-core Intel Xeon CPU,

Control FlowControl FlowHistory‐

2 R1

History‐Cache‐of‐W1

R1 R2Arbitrate‐

N

1

R1‐=‐R2task

No

YesAdd‐to

Task‐Queue

W1

Reschedule‐untrusted‐

Cloud‐AVerify‐task

34

Yes Blacklist‐t2

Queue

t1t3...

R2

untrustedtasks

(with‐prob.‐v)

Pass‐Verify? No

W1.Credit++

Yes

W1.Credit>NRelease‐W1‐result‐to‐reducer

W2

Cloud‐B

Yes

Replication Verification Credit-based Management

Page 12: IntegrityMR: Integrity Assurance Framework for Biggy Data … · Experiment setupExperiment setup yEnvironment yPrivate cloud: ya local Linux server (2.93GHz, 8-core Intel Xeon CPU,

Experiment setupExperiment setupEnvironment

Private cloud: a local Linux server (2.93GHz, 8-core Intel Xeon CPU, 16GB Ram)Ram)

Public clouds: 6 Microsoft Azure extra small instances (1core @1GHz, 768MB Ram)768MB Ram)6 Amazon EC2 small instances (1ECU, 1core, 1.7GB).

ApplicationApplicationWord count (100 map task) for accuracy testMahout 20 Newsgroup Classification for performance test

Page 13: IntegrityMR: Integrity Assurance Framework for Biggy Data … · Experiment setupExperiment setup yEnvironment yPrivate cloud: ya local Linux server (2.93GHz, 8-core Intel Xeon CPU,

Metrics of Accuracy and O h dOverhead

Error rate: The percentage of incorrect map task p g presults accepted by the master in one job execution.

Worker overhead: The percentage of extra number p gof map tasks executed on the workers on public cloud in one job execution.

Verifier overhead: The percentage of map tasks executed by the verifiers on the private cloud in one job executionjob execution.

Page 14: IntegrityMR: Integrity Assurance Framework for Biggy Data … · Experiment setupExperiment setup yEnvironment yPrivate cloud: ya local Linux server (2.93GHz, 8-core Intel Xeon CPU,

AccuracyAccuracyError Rate vs Credit Threshold

1012

n=0.15,p=0.1n=0.5,p=0.1n=0.3,p=0.5

Error rate: The percentage of incorrect map task results

8

RAT

E

n 0.3,p 0.5n=0.3,p=1.0(%) accepted by the master

in one job execution.

46

ER

RO

R

02 n: malicious node ratio

p: cheat probabilityN: credit threshold

2 4 6 8

CREDIT THRESHOLD N

N: credit threshold

Page 15: IntegrityMR: Integrity Assurance Framework for Biggy Data … · Experiment setupExperiment setup yEnvironment yPrivate cloud: ya local Linux server (2.93GHz, 8-core Intel Xeon CPU,

Overhead and Verifier Overhead

Verifier Overhead vs Credit ThresholdWorker Overhead vs Credit Threshold

2530

D

Verifier Overhead vs Credit Threshold

512

0

D

Worker Overhead vs Credit Threshold

n=0.15,p=0.1n=0.5,p=0.1n=0.3,p=0.5n=0.3,p=1.0

(%) (%)

1520

ER

OVE

RH

EA

D

110

11

ER

OV

ER

HE

AD

510

VE

RIF

IE

n=0.15,p=0.1n=0.5,p=0.1n=0.3,p=0.5n=0.3,p=1.0

105WO

RK

E

2 4 6 80

CREDIT THRESHOLD N2 4 6 8

100

CREDIT THRESHOLD N

li i d in: malicious node ratiop: cheat probabilityN: credit threshold

Page 16: IntegrityMR: Integrity Assurance Framework for Biggy Data … · Experiment setupExperiment setup yEnvironment yPrivate cloud: ya local Linux server (2.93GHz, 8-core Intel Xeon CPU,

Execution time

140016001800

Trainning Time(s)

Mahout 20 news group Classification

60080010001200

Trainning Time(s)

Testing Time(s) N = 5v = 0.15Exec. Delay:

0200400600

ec. e ay:P-A compared to A-O: 145% and 177%P E A compared0

Private‐EC2‐Azure withIntegrityMR

Private‐Azurewith traditionalMapReduce

Azure‐only withtraditionalMapReduce

P-E-A compared to P-A: 18% and 82%

Name Environment Composition Cloud Map Reduce

Private-EC2-Azure

Linux server on Private Cloud, 6 small instances on EC2, 6 extra small instances on Azure

Cross Cloud IntegrityMR

Private-Azure

Linux server on Private Cloud, 6 extra small instances on Azure.

Cross Cloud Map Reduce

Azure-only 6 extra small instances on Azure Inside Cloud Map Reduce

Page 17: IntegrityMR: Integrity Assurance Framework for Biggy Data … · Experiment setupExperiment setup yEnvironment yPrivate cloud: ya local Linux server (2.93GHz, 8-core Intel Xeon CPU,

AgendaAgendaProblem Statement

MapReduce Task Layer Solution

l lApplication Layer Solution

Conclusion & Future Work

Page 18: IntegrityMR: Integrity Assurance Framework for Biggy Data … · Experiment setupExperiment setup yEnvironment yPrivate cloud: ya local Linux server (2.93GHz, 8-core Intel Xeon CPU,

Big Data InfrastructureBig Data Infrastructure

Application Layer Integrity

MapReduce Task Layer Integrity

Storage Integrity:[5] [6]

Page 19: IntegrityMR: Integrity Assurance Framework for Biggy Data … · Experiment setupExperiment setup yEnvironment yPrivate cloud: ya local Linux server (2.93GHz, 8-core Intel Xeon CPU,

Apache PigApache Pig

Pig Script

Logical Plan

Physical Plan

MapReduce Job Plan

Page 20: IntegrityMR: Integrity Assurance Framework for Biggy Data … · Experiment setupExperiment setup yEnvironment yPrivate cloud: ya local Linux server (2.93GHz, 8-core Intel Xeon CPU,

How Pig WorksHow Pig Works

Page 21: IntegrityMR: Integrity Assurance Framework for Biggy Data … · Experiment setupExperiment setup yEnvironment yPrivate cloud: ya local Linux server (2.93GHz, 8-core Intel Xeon CPU,

IntuitionIntuition

Transform the script so that to change the planSplit the map task into two/more different tasksSplit the map task into two/more different tasks.The output of different map tasks, although different, should obey the constructed invariant.h d k i f d h k h i iThe reduce task is transformed to check the invariant.

Page 22: IntegrityMR: Integrity Assurance Framework for Biggy Data … · Experiment setupExperiment setup yEnvironment yPrivate cloud: ya local Linux server (2.93GHz, 8-core Intel Xeon CPU,

T f i E lTransformation ExampleScript 1: GROUP data in houred txt by hour S lit th t k i t-- Script 1: GROUP data in houred.txt by hour

raw_data = LOAD './houred.txt' USING PigStorage('\t') AS (user, hour, query);

result = GROUP raw_data BY hour;d lt

Split the map task into two/more different tasks, The output of different

map tasks, although dump result; p , gdifferent, should obey

the constructed invariant.

-- Script 2: invariant check is enforced register ./tutorial.jar;raw_data = LOAD './houred.txt' USING PigStorage('\t')

AS ( h ) AS (user, hour, query);part1 = FILTER raw_data BY hour>=12;part2 = FILTER raw_data BY hour<=12;result = COGRUP part1 BY hour, part2 BY hour;group result=FOREACH result GENERATE

The reduce task is transformed to group_result FOREACH result GENERATE

group, org.apache.pig.tutorial.CheckInvariant($1,$2); check the invariant

Page 23: IntegrityMR: Integrity Assurance Framework for Biggy Data … · Experiment setupExperiment setup yEnvironment yPrivate cloud: ya local Linux server (2.93GHz, 8-core Intel Xeon CPU,

Plan TransformationPlan Transformation

Split the map task.The output of different p

map tasks obey the constructed invariant.

checkIntegrity (key, tuple1, tuple2){

If(key != 12) return true;

else if(tuple 1 == tuple 2) return true;return true;

else return false;

}

Page 24: IntegrityMR: Integrity Assurance Framework for Biggy Data … · Experiment setupExperiment setup yEnvironment yPrivate cloud: ya local Linux server (2.93GHz, 8-core Intel Xeon CPU,

Security Argumenty g

• Check is performed on d hi h ireduce, which is

executed by a trusted worker. The check logic cannot be leaked to thecannot be leaked to the mapper.

• The map/reduce task can be obfuscated to hid th i i thide the invariant.

Page 25: IntegrityMR: Integrity Assurance Framework for Biggy Data … · Experiment setupExperiment setup yEnvironment yPrivate cloud: ya local Linux server (2.93GHz, 8-core Intel Xeon CPU,

Performance evaluationPerformance evaluation

3 virtual machines in local cluster:

• 1 as master and trusted worker.

• 2 as untrusted• 2 as untrusted workers.

0-35% of slow down

Page 26: IntegrityMR: Integrity Assurance Framework for Biggy Data … · Experiment setupExperiment setup yEnvironment yPrivate cloud: ya local Linux server (2.93GHz, 8-core Intel Xeon CPU,

AgendaAgendaProblem Statement

MapReduce Task Layer Solution

l lApplication Layer Solution

Conclusion & Future Work

Page 27: IntegrityMR: Integrity Assurance Framework for Biggy Data … · Experiment setupExperiment setup yEnvironment yPrivate cloud: ya local Linux server (2.93GHz, 8-core Intel Xeon CPU,

ConclusionConclusionIntegrityMR explores Big Data analytic integrity g y p g y g yfrom two alternative layers

Task layer: Trusted private cloud + untrusted multiple public cloudsTrusted private cloud + untrusted multiple public clouds architecture.Replication, verification, credit-based management.Experiment result: high integrity with non-negligibleExperiment result: high integrity with non negligible overhead

Application layer(Apache Pig):Transform original script to introduce invariant in theTransform original script to introduce invariant in the map tasksCheck the invariant in the reduce taskPractice the idea by manually transform the scriptPractice the idea by manually transform the script.

Page 28: IntegrityMR: Integrity Assurance Framework for Biggy Data … · Experiment setupExperiment setup yEnvironment yPrivate cloud: ya local Linux server (2.93GHz, 8-core Intel Xeon CPU,

Future WorksFuture WorksMapReduce task layerp y

Improve system performance by reducing cross-cloud communication and alleviate the DFS bottle neckneck.

Application layerAutomating pig script transformationAutomating pig script transformation

Page 29: IntegrityMR: Integrity Assurance Framework for Biggy Data … · Experiment setupExperiment setup yEnvironment yPrivate cloud: ya local Linux server (2.93GHz, 8-core Intel Xeon CPU,