1 cs 294-42: project suggestions ion stoica (istoica/classes/cs294/11/) september 14, 2011

CS 294-42: Project Suggestions

Ion Stoica (http://www.cs.berkeley.edu/~istoica/classes/cs294/11/)

September 14, 2011

Projects

This is a project oriented class Reading papers should be means to a great

project not a goal in itself! Strongly prefer groups of two

Perfectly fine to have the same project at cs262 Today, I’ll present some suggestions

But, you are free to come up with your own proposal

Main goal: just do a great project2

Where I’m Coming From?

Key challenge: maximize economic value of data, i.e., Extract value from data while reducing costs (e.g.,

storage, computation)

Where I’m Coming From? Tools to extract value from big-data

Scalability Response time Accuracy

Provide high cluster utilization for heterogeneous workloads Support diverse SLAs Predictable performance Isolation Consistency 4

Caveats Cloud computing is HOT, but lot of NOISE!

Not easy to differentiate between narrow engineering solutions

and fundamental tradeoffs predict the importance of the problem you solve

Cloud computing it’s akin Gold Rush!

Background: Mesos Rapid innovation in cloud computing

No single framework optimal for all applications Running each framework on its dedicated cluster

Expensive Hard to share data

Pregel

CassandraHypertable

Need to run multiple frameworks on same clusterNeed to run multiple frameworks on same cluster

Background: Mesos – Where We Want to Go

HadoopHadoop

PregelPregel

MPIMPIShared cluster

Today: static partitioning Mesos: dynamic sharinguniprogramming multiprogramming

Background: Mesos – Solution

Mesos is a common resource sharing layer over which diverse frameworks can run

NodeNode NodeNode

HadoopHadoop

NodeNode NodeNode

MPIMPI…MesosMesos

Background: Workload in Datacenters

Frontend (Web-servers, dabses)

Decision-driven processes

Exploratory queries (e.g., Dremel)

Production jobs (e.g., compute summaries)

Analytics jobs

High Low

Interactive(low-latency)

Priority

Response

Datacenter OS: Resource Management, Scheduling

Hierarchical Scheduler (for Mesos)

Allow administrators to organize into groups Provide resource guarantees per group Share available resources (fairly) across groups

Research questions Abstraction (when using multiple resources)? How to implement using resource offers? What policies are compatible at different levels in the

hierarchy?

Cross Application Resource Management

An app uses many services (e.g., file systems, key-value storage, databases, etc)

If an app has high priority and the service it uses doesn’t, the app SLA (Service Level Agreement) might be violated

Research questions Abstraction, e.g., resource delegation, priority

propagation? Clean-slate mechanisms vs. incremental deployability This is also highly challenging in single node OSes!

Resource Management using VMs

Most cluster resource managers use Linux containers (e.g., Mesos) Thus, schedulers assume no task migration

Research questions: Develop scheduler for VM environments (e.g., extend

DRF) Tradeoffs between migration, delay, and preemption

Task Granularity Selection (Yanpei Chen)

Problem: number of tasks per stage in today’s MapRed apps (highly) sub-optimal

Research question: Derive algorithms to pick the number of tasks to

optimize various performance metrics, e.g., utilization, response time, network traffic

subject to various constraints, e.g., capacity, network

Resource Revocation

Which task we should revoke/preempt? Two questions

Which slot has least impact on the giving framework? Is the slot acceptable to receiving framework?

Research questions Identify feasible slot for receiving framework with least

impact on giving framework Light-weight protocol design

Control Plane Consistency Model What type of consistency is “good-enough” for

various control plane functions File system metadata (Hadoop) Routing (Nicira) Scheduling Coordinated caching …

Research question What are trade-off between performance and

consistency? Develop generic framework for control plane

Decentralized vs. Centralized Scheduling Decentralized schedulers

E.g., Mesos, Hadoop 2.0 Delegate decision to apps (i.e., frameworks, jobs) Advantages: scale and separation of concerns (i.e., apps know

the best where and which tasks to run) Centralized schedulers

Knows all app requirements Advantages: optimal

Research challenge: Evaluate centralized vs. decentralized schedulers Characterize class of workloads for which decentralized

scheduler is good enough

Opportunistic Scheduling

Goal: schedule interactive jobs (e.g., <100ms latency)

Existing schedulers: high overhead (e.g., Mesos needs to decide on every offer)

Research challenge: Tradeoff between utilization and response time Evaluate hybrid approach

Background: Dominant Resource Fairness

Implement fair (proportional) allocation for multiple types of resources

Key properties Strategy proof: users cannot get an advantage by

lying about their demands Sharing incentives: users are incentivized to share a

cluster rather than partitioning it

DRF for Non-linear Resources/Demands

DRF assume resources & demands are additive E.g., task 1 needs (1CPU, 1GB) and task 2 needs

(1CPU, 3GB) both tasks need (2CPU, 4GB) Sometime demands are non-linear

E.g., shared memory Sometime resources are non-linear

E.g., disk throughput, caches Research challenge:

DRF-like scheduler for non-linear resources & demands (could be two projects here!)

DRF for OSes

DRF designed for clusters using resource offer mechanism

Redesign DRF to support multi-core OSes

Research questions: Is resource offer best abstraction? How to best leverage preemption? (in Mesos tasks

are not preempted by default) How to support gang scheduling?

Storage & Data Processing

Resource Isolation for Storage Services

Share storage (e.g., key-value store) between Frontend, e.g., web services Backend, e.g., analytics on freshest data

Research challenge Isolation mechanism: protect front-end performance

from back-end workload

“Quicksilver” DB Goal: interactive queries with bounded error on

“unbounded” data Trade between efficiency and accuracy Query response time target: < 100ms

Approach: random pre-sampling across different dimensions (columns)

Research question: given a query and an error bound, find Smallest sample to compute result Sample minimizing disk (or memory) access times (Talk with Sameer, if interested)

Split-Privacy DB (1/2)

Partition data & computation Private Public (stored on cloud)

Goal: use cloud without revealing the computation result

Example: Operation f(x, y) = x + y, where

x: private y: public

Pick random number a, and compute x’ = x + a compute f(x’, y) = r’ = x’ + y recover result: r = r’ – a = (x’ – a) + y = x + y

Private DB Public DB

fprivate fpublic

result

Split-Privacy DB (2/2)

Partition data & computation Private Public (stored on cloud) Example: patient data (private), public clinical and

genomics data sets Goal: use cloud without revealing the

computation result Research questions:

What types of computation can be implemented? Any more powerful than privacy-preserving

computation / Data Mining?

Private DB Public DB

fprivate fpublic

result

RDDs as an OS Abstraction Resilient Data Sets (RDDs)

Fault-tolerant (in-memory) parallel data structures Allows Spark apps to efficiently reuse data

Design cross-application RDDs Research questions

RDD reconstruction (track software and platform changes)

Enable users to share intermediate results of queries (identify when two apps compute same RDD)

RDD cluster-wide caching

Provenance-based Efficient Storage (Peter B and Patrick W)

Reduce storage by deleting data that can be recreated Generalization of previous project

Research challenges: Identify data that can deterministically recreated and the

code to do so Use hints?

Tradeoff between re-creation and storage May take into account access patter, frequency, performance

Very-low Latency Streaming

Challenge: straglers, failures Approaches to reduce latency:

Redundant computations Speculative execution

Research questions Theoretical trade-off between response time and

accuracy? Achieve target latency and accuracy, while minimizing

the overhead 29

1 cs 294-42: project suggestions ion stoica (istoica/classes/cs294/11/) september 14, 2011

series 3sheet1series

great project

extract value

mesos solution mesos

economic value of data

diverse frameworks

resource ma

multiple frameworks

Documents

cs 268: lecture 7 (beyond tcp congestion...

sp08 cs294 lecture 9 -- speech signalklein/cs294-19/sp08...

stoica micu andrei

căile ferate române laslea vintu de jos nr... · 2014....

1 cs 294: big data system research: trends and challenges...

teza stoica

welcome cs294-8 design of deeply networked systems spring...

ion stoica nov 25, 2002 - university of california,...

flat datacenter storage talk - people @ eecs at uc...

cs 268: lecture 13 qos: diffserv and intserv - people ›...

mining%modern%repositories%...

ion stoica, spring 2003 1 cs 268: graduate computer networks...

why do we need graph...

cv stoica andrei

aurel stoica

ion stoica, fall 2001 1 ee 122: introduction to computer...

sp09 cs294 lecture 9 -- speech signal.ppt

sparrow: distributed, low latency scheduling -...

cs294-1 behavioral data mining - people

stoica elena ro