cs 294-110: project suggestions - people › ~istoica › classes › ...this is a project-oriented...

45
1 CS 294-110: Project Suggestions Ion Stoica and Ali Ghodsi (http://www.cs.berkeley.edu/~istoica/classes/cs294/15/) September 14, 2015

Upload: others

Post on 25-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

1

CS 294-110: Project Suggestions

Ion Stoica and Ali Ghodsi (http://www.cs.berkeley.edu/~istoica/classes/cs294/15/)

September 14, 2015

Page 2: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Projects

l  This is a project-oriented class l  Reading papers should be a means to a great

project, not a goal in itself! l  Strongly prefer groups of two students l  Today, I’ll present some suggestions

l  But, you are free to come up with your own proposal

l  Main goal: just do a great project

2

Page 3: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Projects

l  Many projects around Spark l  Local expertise l  Great platform to disseminate your work l  Short review based on log mining example to provide

context

3

Page 4: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Spark Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

4

Page 5: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Spark Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

Worker

Worker

Worker

Driver

5

Page 6: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Spark Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

Worker

Worker

Worker

Driver

lines = spark.textFile(“hdfs://...”)

6

Page 7: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Spark Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

Worker

Worker

Worker

Driver

lines = spark.textFile(“hdfs://...”)

Base RDD

7

Page 8: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Spark Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

Worker

Worker

Worker

Driver

8

Page 9: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Spark Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

Worker

Worker

Worker

Driver

Transformed RDD

9

Page 10: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Spark Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count()

10

Page 11: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Spark Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count() Action

11

Page 12: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Spark Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3 12

Page 13: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Spark Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver tasks

tasks

tasks

13

Page 14: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Spark Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Read HDFS Block

Read HDFS Block

Read HDFS Block

14

Page 15: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Spark Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

Process & Cache

Data Process & Cache

Data

Process & Cache

Data

15

Page 16: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Spark Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

results

results

results

16

Page 17: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Spark Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3 messages.filter(lambda s: “php” in s).count()

17

Page 18: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Spark Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3 messages.filter(lambda s: “php” in s).count()

tasks

tasks

tasks

Driver

18

Page 19: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Spark Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3 messages.filter(lambda s: “php” in s).count()

Driver

Process from

Cache

Process from Cache

Process from

Cache 19

Page 20: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Spark Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3 messages.filter(lambda s: “php” in s).count()

Driver results

results

results

20

Page 21: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Spark Example: Log Mining Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(“\t”)[2])

messages.cache()

Worker

Worker

Worker messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3 messages.filter(lambda s: “php” in s).count()

Driver

Cache your data è Faster Results Full-text search of Wikipedia •  60GB on 20 EC2 machines •  0.5 sec from mem vs. 20s for on-disk 21

Page 22: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Spark

22

{JSON}

Data Sources

Spark Core

DataFrames ML Pipelines

Spark Streaming

Spark SQL MLlib GraphX

?

Page 23: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Pipeline Shuffle

l  Problem l  Right now shuffle senders write data on storage after

which the data is shuffled to receivers l  Shuffle often most expensive communication pattern,

sometimes dominates job comp. time l  Project

l  Start sending shuffle data as it is being produced l  Challenge

l  How do you do recovery and speculation? l  Could store data as being sent, but still not easy….

23

Page 24: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Fault Tolerance & Perf. Tradeoffs l  Problem:

l  Maintaining lineage in Spark provides fault recovery, but comes at performance cost l  E.g., hard to support super small tasks due to lineage overhead

l  Project: l  Evaluate how much you can speed up Spark by ignoring

fault tolerance l  Can generalize to other cluster computing engines

l  Challenge l  What do you do for large jobs, how do you treat

stragglers? l  Maybe a hybrid method, i.e., just don’t do lineage for small jobs?

Need to figure out when a job is small… 24

Page 25: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

(Eliminating) Scheduling Overhead

l  Problem: with Spark, driver schedules every task l  Latency 100s ms or higher; cannot run ms queries l  Driver can become a bottleneck

l  Project: l  Have workers perform scheduling

l  Challenge: l  How do you handle faults?

l  Maybe some hybrid solution across driver and workers?

25

Page 26: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Cost-based Optimization in SparkSQL

l  Problem: l  Spark employs a rule-based Query Planner (Catalyst) l  Limited optimization opportunities especially when

operator performance varies widely based on input data l  E.g., join and selection on skewed data

l  Project: cost-based optimizer l  Estimate operators’ costs, and use these costs to

compute the query plan

26

Page 27: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Streaming Graph Processing

l  Problem: l  With GraphX, queries can be fast but updates are

typically in batches (slow) l  Project:

l  Incrementally update graphs l  Support window based graph queries

l  Note: l  Discuss with Anand Iyer and Ankur Dave if interested

27

Page 28: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Streaming ML

l  Problem: l  Today ML algorithms typically performed on static data l  Cannot update model in real-time

l  Project: l  Develop on-line ML algorithms that update the model

continuously as new data is streamed

l  Notes: l  Also contact Joey Gonzalez if interested

28

Page 29: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Beyond JVM: Using Non-Java Libraries

l  Problem: l  Spark tasks are executed within JVMs l  Limits performance and use of non-Java popular libraries

l  Project: l  General way to add support for non-Java libraries l  Example: use JNI to call arbitrary libraries

l  Challenges: l  Define interface, shared data formats, etc

l  Notes l  Contact Guanhua and Shivaram, if needed

29

Page 30: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Beyond JVM: Dynamic Code Generation

l  Problem: l  Spark tasks are executed within JVMs l  Limits performance and use of non-Java popular

libraries l  Project:

l  Generate non-Java code, e.g., C++, CUDA for GPUs l  Challenges:

l  API and shared data format l  Notes

l  Contact Guanhua and Shivaram, if needed

30

Page 31: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Beyond JVM: Resource Management and Scheduling

l  Problem l  Need to schedule processes hosting non-Java code l  GPU cannot be invoked by more than one process

l  Project: l  Develop scheduling, and resource management

algorithms l  Challenge:

l  Preserve fault tolerance, straggler mitigation l  Notes

l  Contact Guanhua and Shivaram, if needed

31

Page 32: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Time Series for DataFrames

l  Insprired by Pandas and R DataFrames, Spark recently introduced DataFrames

l  Problem l  Spark DataFrames don’t support time series

l  Project: l  Develop and contribute distributed time series

operations for Data Frames l  Challenge:

l  Spark doesn’t have indexes l  http://pandas.pydata.org/pandas-docs/stable/timeseries.html

32

Page 33: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

ACID transactions to Spark SQL

l  Problem l  Spark SQL is used for Analytics and doesn’t support

ACID l  Project:

l  Develop and add row-level ACID tx on top of Spark SQL

l  Challenge: l  Challenging to provide transactions and analytics in

one system l  https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions

33

Page 34: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Typed Data Frames

l  Problem l  DataFrames in Spark, unlike Spark RDDs, do not

provide type safety l  Project:

l  Develop a typed DataFrame framework for Spark l  Challenge:

l  SQL-like operations are inherently dynamic (e.g. filter(“col”) and make it hard to have static typing unless fancy reflection mechanisms are used

34

Page 35: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

General pipelines for Spark

l  Problem l  Spark.ml provides a pipeline abstraction for ML,

generalize it to cover all of Spark l  Project:

l  Develop a pipeline abstraction (similar to ML pipelines) that spans all of Spark, allowing users to perform SQL operations, GraphX operations, etc

35

Page 36: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Beyond BSP

l  Problem l  With BSP each worker executes the same code

l  Project l  Can we extend Spark (or other cluster computing

framework) to support non-BSP computation l  How much better than emulating everything with

BSP? l  Challenge

l  Maintain simple APIs l  More complex scheduling, communication patterns

36

Page 37: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

37  

Project  idea:  cryptography  &  big  data  (Alessandro  Chiesa)    

As  data  and  computa8ons  scale  up  to  larger  sizes…  

…  can  cryptography  follow?  

One  direc8on:  zero  knowledge  proofs  for  big  data  

Page 38: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

38  

Classical  seBng:  zero  knowledge  proofs  on  1  machine  

result  

server  

client  

Here  is  the  result  of  your  computa8on.  

I  don’t  believe  you.  

I  don’t  want  to  give  you  my  private  data.  

Send  me  a  ZK  proof  of  correctness?  

&  ZK  proof  

add  crypto  magic  

+  generate  ZK  proof   +  verify  

ZK  proof  

Page 39: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

39  

New  seBng  for  big  data:  zero  knowledge  proofs  on  clusters  

result  

cluster  

client  &  ZK  proof  

+  generate  ZK  proof  

+  verify  ZK  proof  

Problem:  cannot  generate  ZK  proof  on  1  machine  (as  before)  

Challenge:    generate  the  ZK  proof  over  a  cluster  (e.g.,  using  Spark)  

End  goal:  “scaling  up”  ZK  proofs  to  computa8ons  on  big  data  

&  explore  security  applica8ons!  

Page 40: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Succinct (quick overview)

l  Queries on compressed data l  Basic operations:

l  Search: given a substring “s” return offsets of all occurrences of “s” within the input

l  Extract: given an offset “o” and a length “l” uncompress and return “l” bytes from original file starting at “o”

l  Count: given a substring “s” return the number of occurrences of “s” within the input

l  Can implement key-value store on top of it 40

Page 41: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Succinct: Efficient Point Query Support

l  Problem: l  Spark implementation: expensive, as always queries

all workers l  Project:

l  Implement Succinct on top of Tachyon (storage layer) l  Provide efficient key-value store lookups, i.e., lookup a

single worker if key is there l  Note:

l  Contact Anurag and Rachit, if interested

41

Page 42: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Succinct: External Memory Support

l  Problem: l  Some data increases faster than main memory l  Need to execute queries on external storage (e.g.,

SSDs) l  Project:

l  Design & implement compressed data structures for efficient external memory execution

l  A lot of work in theory community, that could be exploited

l  Note: l  Contact Anurag and Rachit, if interested

42

Page 43: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Succinct: Updates

l  Problem: l  Current systems use a multi-store architecture l  Expensive to update compressed representation

l  Project: l  Develop a low overhead update solution with minimal

impact on memory overhead and query performance l  Start from multi-store architecture (see NSDI paper)

l  Note: l  Contact Anurag and Rachit, if interested

43

Page 44: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Succinct: SQL

l  Problem: l  Arbitrary sub-string search powerful but not as many

workloads l  Project:

l  Support SQL on top of Succinct l  Start from SparkSQL and Succinct Spark package?

l  Note: l  Contact Anurag and Rachit, if interested

44

Page 45: CS 294-110: Project Suggestions - People › ~istoica › classes › ...This is a project-oriented class ! Reading papers should be a means to a great project, not a goal in itself!

Succinct: Genomics

l  Problem: l  Genomics pipeline still expensive

l  Project: l  Genome processing on a single machine (using

compressed data) l  Enable queries on compressed genomes

l  Challenges: l  Domain specific query optimizations

l  Note: l  Contact Anurag and Rachit, if interested

45