large scale machine earning with the simsql system · 2015-11-14 · 1 large scale machine learning...
TRANSCRIPT
![Page 1: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/1.jpg)
1
LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM
Chris JermaineRice University
Current/Recent SimSQL team member: Zhuhua Cai, Jacob Gao, Michael Gubanov, Shangyu Luo, Luis Perez
Also, Peter J. Haas at IBM Almaden
![Page 2: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/2.jpg)
2
The Last Few Years...
• Have seen huge interest in large-scale data-processing systems• OptiML, GraphLab, SystemML, MLBase, HBase, MongoDB, BigTable, Pig, Impala,
ScalOps, Pregel, Giraph, Hadoop, TupleWare, Hama, Spark, Flink, Ricardo, Nyad, Dra-dLinq, and many others...
• Many have had significant industrial impact...• Why now?
![Page 3: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/3.jpg)
3
Until the Mid-2000’s
• If you had a large-scale data-processing problem, you:1. Rolled your own
2. Bought an SQL-based database system
3. Used a niche (often industry-specific) solution
![Page 4: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/4.jpg)
4
Until the Mid-2000’s
• If you had a large-scale data-processing problem, you:1. Rolled your own
2. Bought an SQL-based database system
3. Used a niche (often industry-specific) solution
Available to very few people
![Page 5: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/5.jpg)
5
Until the Mid-2000’s
• If you had a large-scale data-processing problem, you:1. Rolled your own
2. Bought an SQL-based database system
3. Used a niche (often industry-specific) solution
![Page 6: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/6.jpg)
6
SQL Databases
• What were the complaints?— Poor performance
![Page 7: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/7.jpg)
7
SQL Databases
• What were the complaints?— Poor performance This is misleading...
![Page 8: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/8.jpg)
8
SQL Databases
• What were the complaints?— Poor price/performance (Teradata costs a lot)
![Page 9: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/9.jpg)
9
SQL Databases
• What were the complaints?— Poor price/performance (Teradata costs a lot)
— No open-source solution with good scale out
![Page 10: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/10.jpg)
10
SQL Databases
• What were the complaints?— Poor price/performance (Teradata costs a lot)
— No open-source solution with good scale out
— Frustration with cost to load data into relations
![Page 11: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/11.jpg)
11
SQL Databases
• What were the complaints?— Poor price/performance (Teradata costs a lot)
— No open-source solution with good scale out
— Frustration with cost to load data into relations
— SQL has never played nicely with other tools (software packages, other PLs)
![Page 12: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/12.jpg)
12
SQL Databases
• What were the complaints?— Poor price/performance (Teradata costs a lot)
— No open-source solution with good scale out
— Frustration with cost to load data into relations
— SQL has never played nicely with other tools (software packages, other PLs)
A Lot Of Pain in the Analytics Space
![Page 13: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/13.jpg)
13
Then, Suddenly We Had Hadoop
• Our salvation! Right??
![Page 14: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/14.jpg)
14
import java.util.*;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.IntWritable;
public class WordCount { public static int main(String[] args) throws Exception { // if we got the wrong number of args, then exit if (args.length != 4 || !args[0].equals ("-r")) { System.out.println("usage: WordCount -r <num reducers> <input> <output>"); return -1; } // Get the default configuration object Configuration conf = new Configuration (); // now create the MapReduce job Job job = new Job (conf); job.setJobName ("WordCount"); // we'll output text/int pairs (since we have words as keys and counts as values) job.setMapOutputKeyClass (Text.class); job.setMapOutputValueClass (IntWritable.class); // again we'll output text/int pairs (since we have words as keys and counts as values) job.setOutputKeyClass (Text.class); job.setOutputValueClass (IntWritable.class); // tell Hadoop the mapper and the reducer to use job.setMapperClass (WordCountMapper.class); job.setCombinerClass (WordCountReducer.class); job.setReducerClass (WordCountReducer.class); // we'll be reading in a text file, so we can use Hadoop's built-in TextInputFormat job.setInputFormatClass (TextInputFormat.class); // we can use Hadoop's built-in TextOutputFormat for writing out the output text file job.setOutputFormatClass (TextOutputFormat.class);
// set the input and output paths TextInputFormat.setInputPaths (job, args[2]); TextOutputFormat.setOutputPath (job, new Path (args[3])); // set the number of reduce paths try { job.setNumReduceTasks (Integer.parseInt (args[1])); } catch (Exception e) { System.out.println("usage: WordCount -r <num reducers> <input> <output>"); return -1; } // force the mappers to handle one megabyte of input data each TextInputFormat.setMinInputSplitSize (job, 1024 * 1024); TextInputFormat.setMaxInputSplitSize (job, 1024 * 1024); // this tells Hadoop to ship around the jar file containing "WordCount.class" to all of the different // nodes so that they can run the job job.setJarByClass(WordCount.class); // submit the job and wait for it to complete! int exitCode = job.waitForCompletion (true) ? 0 : 1; return exitCode;
Then, Suddenly We Had Hadoop
Here’s the main program forWord Count...
![Page 15: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/15.jpg)
15
import java.util.*;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.IntWritable;
public class WordCount { public static int main(String[] args) throws Exception { // if we got the wrong number of args, then exit if (args.length != 4 || !args[0].equals ("-r")) { System.out.println("usage: WordCount -r <num reducers> <input> <output>"); return -1; } // Get the default configuration object Configuration conf = new Configuration (); // now create the MapReduce job Job job = new Job (conf); job.setJobName ("WordCount"); // we'll output text/int pairs (since we have words as keys and counts as values) job.setMapOutputKeyClass (Text.class); job.setMapOutputValueClass (IntWritable.class); // again we'll output text/int pairs (since we have words as keys and counts as values) job.setOutputKeyClass (Text.class); job.setOutputValueClass (IntWritable.class); // tell Hadoop the mapper and the reducer to use job.setMapperClass (WordCountMapper.class); job.setCombinerClass (WordCountReducer.class); job.setReducerClass (WordCountReducer.class); // we'll be reading in a text file, so we can use Hadoop's built-in TextInputFormat job.setInputFormatClass (TextInputFormat.class); // we can use Hadoop's built-in TextOutputFormat for writing out the output text file job.setOutputFormatClass (TextOutputFormat.class);
// set the input and output paths TextInputFormat.setInputPaths (job, args[2]); TextOutputFormat.setOutputPath (job, new Path (args[3])); // set the number of reduce paths try { job.setNumReduceTasks (Integer.parseInt (args[1])); } catch (Exception e) { System.out.println("usage: WordCount -r <num reducers> <input> <output>"); return -1; } // force the mappers to handle one megabyte of input data each TextInputFormat.setMinInputSplitSize (job, 1024 * 1024); TextInputFormat.setMaxInputSplitSize (job, 1024 * 1024); // this tells Hadoop to ship around the jar file containing "WordCount.class" to all of the different // nodes so that they can run the job job.setJarByClass(WordCount.class); // submit the job and wait for it to complete! int exitCode = job.waitForCompletion (true) ? 0 : 1; return exitCode;
Then, Suddenly We Had Hadoop
Plus, often not that fast... No pretty!
![Page 16: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/16.jpg)
16
Data load engineering $$
Dat
a qu
ery
engi
neer
ing
$$
SQL
MapReduce(Hadoop)
Complexity of Computation
Que
ry R
espo
nse
Tim
eSQL
MapReduce(Hadoop)
Then, Suddenly We Had Hadoop
Not a clear win...
![Page 17: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/17.jpg)
17
Data load engineering $$
Dat
a qu
ery
engi
neer
ing
$$
SQL
DryadLinq,Spark, Flink, others
Complexity of Computation
Que
ry R
espo
nse
Tim
eSQL
Next Generation “DataFlow” Platforms
DryadLinq,Spark, Flink, others
Cleaner programming interface
Better Implementation TechniquesSo people worked at improvingthe situation...
![Page 18: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/18.jpg)
18
Next Generation “DataFlow” Platforms
• “Dataflow” platforms (such as Spark) are effectively RA engines— Human being strings together (simple) bulk ops to form a computation
— But DB people have recognized imperative data access is bad for nearly 40 years
![Page 19: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/19.jpg)
19
One Step Forward, Two Steps Back
• “Dataflow” platforms (such as Spark) are effectively RA engines— Human being strings together (simple) bulk ops to form a computation
— But DB people have recognized imperative data access is bad for nearly 40 years
Dept (Name, Supervisor, Address)
Affiliation (PercentEffort)
Employee (EName, Age, Salary)
Works (Duration)
Project (PName, Director, Customer)
CODASYL Data Model (1969)
“Which employees supervised byChris worked on the Plinyproject?”
![Page 20: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/20.jpg)
20
Dept (Name, Supervisor, Address)
One Step Forward, Two Steps Back
• “Dataflow” platforms (such as Spark) are effectively RA engines— Human being strings together (simple) bulk ops to form a computation
— But DB people have recognized imperative data access is bad for nearly 40 years
Affiliation (PercentEffort)
Employee (EName, Age, Salary)
Works (Duration)
Project (PName, Director, Customer)
CODASYL Data Model (1969)
“Which employees supervised byChris worked on the Plinyproject?”
Good if few employees per project
![Page 21: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/21.jpg)
21
Dept (Name, Supervisor, Address)
One Step Forward, Two Steps Back
• “Dataflow” platforms (such as Spark) are effectively RA engines— Human being strings together (simple) bulk ops to form a computation
— But DB people have recognized imperative data access is bad for nearly 40 years
Affiliation (PercentEffort)
Employee (EName, Age, Salary)
Works (Duration)
Project (PName, Director, Customer)
CODASYL Data Model (1969)
“Which employees supervised byChris worked on the Plinyproject?”
Good if few people supervised by Chris
![Page 22: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/22.jpg)
22
Dept (Name, Supervisor, Address)
One Step Forward, Two Steps Back
• “Dataflow” platforms (such as Spark) are effectively RA engines— Human being strings together (simple) bulk ops to form a computation
— But DB people have recognized imperative data access is bad for nearly 40 years
Affiliation (PercentEffort)
Employee (EName, Age, Salary)
Works (Duration)
Project (PName, Director, Customer)
CODASYL Data Model (1969)
“Which employees supervised byChris worked on the Plinyproject?”
Good if few people supervised by Chris
Such Issues Led Directly to Widespread Adoption ofthe Relational Model
![Page 23: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/23.jpg)
23
Deja Vu, All Over Again
• I’m teaching a data science class this semester at Rice— Assignment 3: Use Spark to analyze NYC taxi GPS data
— One task: find drop-off anomalies and geographic points-of-interest nearby
— “Prof. Chris: My last join hasn’t finished in two hours!!”
![Page 24: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/24.jpg)
24
Deja Vu, All Over Again #Map the above to CellDayHour, ratio numDrop/avgDrops, numDrops then #take the top 20 ratios
output3 = output2.map(lambda p:(p[0], (p[1]/p[2], p[1]))).takeOrdered
(20, lambda (key, value): -value[0])
#Turn back into RDD
output4 = sc.parallelize(output3)
#Change formatting of the output
output5 = output4.map(lambda p: (str(p[0].split(":")[0]), [p[0].split(":")[1].split("")[1],p[0].split(":")[1].split(" ")[0], p[1]])).keyBy(lambda p: p[0])
#Map locations to grid cells
mappedData = mappings.map(lambda m: (str(changeValToCell (float(m.split("||")[1]))) + str(changeValToCell (float(m.split("||")[0]))), (m.split("||")[2]))).keyBy(lambda p: p[0]).reduceByKey(lambda t1, t2: (t1))
#Join the mappings to the data together
final_output = output5.leftOuterJoin(mappedData).map(lambda p: (p[0], p[1][0][1][0], p[1][0][1][1],p[1][0][1][2][0] ,p[1][0][1][2][1]))
Here’s a fast solution
![Page 25: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/25.jpg)
25
Deja Vu, All Over Again # Now that it's sorted, put it back to (grid_cell,
# (hour, date, fraction, number))
grid_cell_combined = grid_cell_combined.map(lambda t: (t[1][0].split
(":")[0], (t[1][0].split(":")[1], t[1][1],t[0], t[1][2])))
# now let's grab the POI data
lines = sc.textFile(sys.argv[2], 1)
poi_lines = lines.map(lambda x: x.split('||')).map(lambda l:
(get_cell_id(l[0], l[1]), [l[2]])).reduceByKey(lambda list1,
list2: list1 + list2)
grid_with_poi = grid_cell_combined.leftOuterJoin(poi_lines).map(
lambda t: (t[1][0][2], (t[0], t[1][0][0], t[1][0][1], t[1][0][3],
t[1][1]))).sortByKey(ascending=False).map(
lambda t: (t[1][0], t[1][1], t[1][2], t[0], t[1][3], t[1][4]))
test_taxi_rows = sc.parallelize(grid_with_poi.take(20))
test_taxi_rows.saveAsTextFile(sys.argv[3])
And here’s a slow one
![Page 26: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/26.jpg)
26
Is This Progress?
![Page 27: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/27.jpg)
27
Is This Progress?
• Dataflow engine: not the right platform for complicated analytics— Throwing away the optimizer radically increases programmer burden
— 100% host language code: significant portion of program opaque to system-Means system cannot optimize
— Loss of control of data format/layout: big performance hit
![Page 28: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/28.jpg)
28
The Ideal Analytics Ecosystem
Unstructured data store
(HDFS)Long-term storage
Dataflowplatform
Structured data storeLong-term storage (hot data)
Short-term storage (cool data)
Simple/one-off analytics
Dataintake
SQL
DSL1
DSL2
DSL3
Complexanalytics
![Page 29: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/29.jpg)
29
Our Research
• What should this system look like?• Design principals:
— RDBMS/declarative = GOOD
— Incremental, not revolutionary
— No need to throw out 40 years of tech.
— Statistical processing: esp. important
Structured data storeLong-term storage (hot data)
Short-term storage (cool data)
SQL
DSL1
DSL2
DSL3
Complexanalytics
![Page 30: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/30.jpg)
30
Our Research
Structured data storeLong-term storage (hot data)
Short-term storage (cool data)
SQL
DSL1
DSL2
DSL3
Complexanalytics
• What should this system look like?• Design principals:
— RDBMS/declarative = GOOD
— Incremental, not revolutionary
— No need to throw out 40 years of tech.
— Statistical processing: esp. important
![Page 31: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/31.jpg)
31
How Must DBMS Change?
• More extensive support for recursion• Fancier table functions (“VG functions”)• Add native support for vectors/matrices (as att types)• Support for executing huge “query” plans (1000’s of operations)• New logical/physical operators• Additional DSLs (SQL not enough)
![Page 32: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/32.jpg)
32
SimSQL’s Version of SQL
• Most fundamental SQL addition is “VG Function” abstraction• Called via a special CREATE TABLE statement• Example; assuming:
— SBP(MEAN, STD, GENDER)
— PATIENTS(NAME, GENDER)
• To create a derived table, we might have:CREATE TABLE SBP_DATA(NAME, GENDER, SBP) ASFOR EACH p in PATIENTS WITH Res AS Normal ( SELECT s.MEAN, s.STD FROM SPB s WHERE s.GENDER = p.GENDER) SELECT p.NAME, p.GENDER, r.VALUE FROM Res r
We like randomized algs(MCMC) but not limited to that
![Page 33: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/33.jpg)
33
How Does This Work?
CREATE TABLE SBP_DATA(NAME, GENDER, SBP) ASFOR EACH p in PATIENTS WITH Res AS Normal ( SELECT s.MEAN, s.STD FROM SPB s WHERE s.GENDER = p.GENDER) SELECT p.NAME, p.GENDER, r.VALUE FROM Res r
PATIENTS (NAME, GENDER)(Joe, Male) “p”(Tom, Male)(Jen, Female)(Sue, Female)(Jim, Male)
SBP(MEAN, STD, GENDER)(150, 20, Male)(130, 25, Female)
Loop through PATIENTS
![Page 34: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/34.jpg)
34
How Does This Work?
CREATE TABLE SBP_DATA(NAME, GENDER, SBP) ASFOR EACH p in PATIENTS WITH Res AS Normal ( SELECT s.MEAN, s.STD FROM SPB s WHERE s.GENDER = p.GENDER) SELECT p.NAME, p.GENDER, r.VALUE FROM Res r
PATIENTS (NAME, GENDER)(Joe, Male) “p”(Tom, Male)(Jen, Female)(Sue, Female)(Jim, Male)
SBP(MEAN, STD, GENDER)(150, 20, Male)(130, 25, Female)
![Page 35: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/35.jpg)
35
How Does This Work?
CREATE TABLE SBP_DATA(NAME, GENDER, SBP) ASFOR EACH p in PATIENTS WITH Res AS Normal ( SELECT s.MEAN, s.STD FROM SPB s WHERE s.GENDER = p.GENDER) SELECT p.NAME, p.GENDER, r.VALUE FROM Res r
PATIENTS (NAME, GENDER)(Joe, Male) “p”(Tom, Male)(Jen, Female)(Sue, Female)(Jim, Male)
SBP(MEAN, STD, GENDER)(150, 20, Male)(130, 25, Female)
Normal(150,20)
![Page 36: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/36.jpg)
36
How Does This Work?
CREATE TABLE SBP_DATA(NAME, GENDER, SBP) ASFOR EACH p in PATIENTS WITH Res AS Normal ( SELECT s.MEAN, s.STD FROM SPB s WHERE s.GENDER = p.GENDER) SELECT p.NAME, p.GENDER, r.VALUE FROM Res r
PATIENTS (NAME, GENDER)(Joe, Male) “p”(Tom, Male)(Jen, Female)(Sue, Female)(Jim, Male)
SBP(MEAN, STD, GENDER)(150, 20, Male)(130, 25, Female)
Res(VALUE)(162)
Normal(150,20)
![Page 37: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/37.jpg)
37
How Does This Work?
CREATE TABLE SBP_DATA(NAME, GENDER, SBP) ASFOR EACH p in PATIENTS WITH Res AS Normal ( SELECT s.MEAN, s.STD FROM SPB s WHERE s.GENDER = p.GENDER) SELECT p.NAME, p.GENDER, r.VALUE FROM Res r
PATIENTS (NAME, GENDER)(Joe, Male) “p”(Tom, Male)(Jen, Female)(Sue, Female)(Jim, Male)
SBP(MEAN, STD, GENDER)(150, 20, Male)(130, 25, Female)
Res(VALUE)(162)
SBP_DATA (NAME, GENDER, SPB)(Joe, Male, 162)
Normal(150,20)
![Page 38: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/38.jpg)
38
How Does This Work?
CREATE TABLE SBP_DATA(NAME, GENDER, SBP) ASFOR EACH p in PATIENTS WITH Res AS Normal ( SELECT s.MEAN, s.STD FROM SPB s WHERE s.GENDER = p.GENDER) SELECT p.NAME, p.GENDER, r.VALUE FROM Res r
PATIENTS (NAME, GENDER)(Joe, Male)(Tom, Male) “p”(Jen, Female)(Sue, Female)(Jim, Male)
SBP(MEAN, STD, GENDER)(150, 20, Male)(130, 25, Female)
Res(VALUE)(135)
SBP_DATA (NAME, GENDER, SPB)(Joe, Male, 162)
Normal(150,20)
![Page 39: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/39.jpg)
39
How Does This Work?
CREATE TABLE SBP_DATA(NAME, GENDER, SBP) ASFOR EACH p in PATIENTS WITH Res AS Normal ( SELECT s.MEAN, s.STD FROM SPB s WHERE s.GENDER = p.GENDER) SELECT p.NAME, p.GENDER, r.VALUE FROM Res r
PATIENTS (NAME, GENDER)(Joe, Male)(Tom, Male) “p”(Jen, Female)(Sue, Female)(Jim, Male)
SBP(MEAN, STD, GENDER)(150, 20, Male)(130, 25, Female)
Res(VALUE)(135)
SBP_DATA (NAME, GENDER, SPB)(Joe, Male, 162)(Tom, Male, 135)
Normal(150,20)
![Page 40: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/40.jpg)
40
How Does This Work?
CREATE TABLE SBP_DATA(NAME, GENDER, SBP) ASFOR EACH p in PATIENTS WITH Res AS Normal ( SELECT s.MEAN, s.STD FROM SPB s WHERE s.GENDER = p.GENDER) SELECT p.NAME, p.GENDER, r.VALUE FROM Res r
PATIENTS (NAME, GENDER)(Joe, Male)(Tom, Male) (Jen, Female) “p”(Sue, Female)(Jim, Male)
SBP(MEAN, STD, GENDER)(150, 20, Male)(130, 25, Female)
Res(VALUE)(112)
Normal(130,25)
SBP_DATA (NAME, GENDER, SPB)(Joe, Male, 162)(Tom, Male, 135)
![Page 41: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/41.jpg)
41
How Does This Work?
CREATE TABLE SBP_DATA(NAME, GENDER, SBP) ASFOR EACH p in PATIENTS WITH Res AS Normal ( SELECT s.MEAN, s.STD FROM SPB s WHERE s.GENDER = p.GENDER) SELECT p.NAME, p.GENDER, r.VALUE FROM Res r
PATIENTS (NAME, GENDER)(Joe, Male)(Tom, Male) (Jen, Female) “p”(Sue, Female)(Jim, Male)
SBP(MEAN, STD, GENDER)(150, 20, Male)(130, 25, Female)
Res(VALUE)(112)
Normal(130,25)
SBP_DATA (NAME, GENDER, SPB)(Joe, Male, 162)(Tom, Male, 135)(Jen, Female, 112)
and so on...
![Page 42: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/42.jpg)
42
More Complicated Computations
• Previous allows (for example) table-valued RVs• But Markov chains are easy in SimSQL, so Bayesian ML easy• Here’s a silly Markov chain. We have:
— PERSON (pname)
— PATH (fromCity, toCity, prob)
— RESTAURANT (city, rname, prob)
![Page 43: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/43.jpg)
43
Markov Chain Simulation
• To select an initial starting position for each person:CREATE TABLE POSITION[0] (pname, city) ASFOR EACH p IN PERSON WITH City AS DiscreteChoice ( SELECT r DISTINCT toCity FROM PATH) SELECT p.pname, City.value FROM City
![Page 44: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/44.jpg)
44
Markov Chain Simulation
• And then randomly select a restaurant:CREATE TABLE VISITED[i] (pname, rname) ASFOR EACH p IN PERSON WITH Visit AS Categorical ( SELECT r.rname, r.prob FROM RESTAURANT r, POSITION[i] l WHERE r.city = l.city AND l.pname = p.pname) SELECT p.pname, Visit.val FROM Visit
![Page 45: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/45.jpg)
45
Markov Chain Simulation
• And transition the person:CREATE TABLE POSITION[i] (pname, city) ASFOR EACH p IN PERSON WITH Next AS Categorical ( SELECT PATH.tocity, PATH.prob FROM PATH, POSITION[i - 1] l WHERE PATH.fromcity = l.city AND l.pname = p.pname) SELECT p.pname, Next.val FROM Next
![Page 46: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/46.jpg)
46
Markov Chain Simulation
• And transition the person:CREATE TABLE POSITION[i] (pname, city) ASFOR EACH p IN PERSON WITH Next AS Categorical ( SELECT PATH.tocity, PATH.prob FROM PATH, POSITION[i - 1] l WHERE PATH.fromcity = l.city AND l.pname = p.pname) SELECT p.pname, Next.val FROM Next
• Fully spec’ed a distributed Markov chain!
![Page 47: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/47.jpg)
47
Native Vector and Matrix Support
• Vectors and matrices fundamental to analytics
![Page 48: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/48.jpg)
48
Native Vector and Matrix Support
• Vectors and matrices fundamental to analytics• Can be difficult to write code/expensive without it
![Page 49: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/49.jpg)
49
Native Vector and Matrix Support
• Vectors and matrices fundamental to analytics• Can be difficult to write code/expensive without it
data (pointID INTEGER, dimID INTEGER, value DOUBLE)matrixA (row INTEGER, col INTEGER, value DOUBLE)
CREATE VIEW xDiff (pointID, value) AS SELECT x2.pointID, x1.value - x2.value FROM data AS x1, data AS x2 WHERE x1.pointID = i and x1.dim = x2.dim
SELECT x.pointID, SUM (firstPart.value * x.value)FROM (SELECT x.pointID, a.colID, SUM (a.value, x.value) AS value FROM xDiff as X, matrixA AS a WHERE x.dimID = a.rowID GROUP BY x.pointID, a.colID) AS firstPart, xDiff AS xWHERE firstPart.colID = x.dimID AND firstPart.pointID = x.pointIDGROUP BY x.pointID
Classical SQL
![Page 50: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/50.jpg)
50
Native Vector and Matrix Support
• Vectors and matrices fundamental to analytics• Can be difficult to write code/expensive without it
data (pointID INTEGER, val VECTOR [])matrixA (val MATRIX [][])
SELECT x2.pointID, inner_product ( matrix_vector_multiply ( x1.val - x2.val, a.val), x1.val - x2.val) AS valueFROM data AS x1, data AS x2, matrixA AS aWHERE x1.pointID = i
SimSQL
![Page 51: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/51.jpg)
51
Native Vector and Matrix Support
• Another example: linear regression• Goal is to compute
![Page 52: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/52.jpg)
52
Native Vector and Matrix Support
• Another example: linear regression• Goal is to compute
CREATE TABLE X ( i INTEGER, x_i VECTOR []);
CREATE TABLE y ( i INTEGER, y_i DOUBLE);
SELECT matrix_vector_multiply (matrix_inverse ( SUM (outer_product (X.x_i, X.x_i))), SUM (X.x_i * y_i))FROM X, yWHERE X.i = y.i
Easy in SimSQL!
![Page 53: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/53.jpg)
53
Promotion and Demotion
• Start with pure, tuple-based encoding
mat (row INTEGER, col INTEGER, value DOUBLE)
![Page 54: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/54.jpg)
54
Promotion and Demotion
• Start with pure, tuple-based encoding• Move to a set of (vector, rowID) pairs
mat (row INTEGER, col INTEGER, value DOUBLE)
CREATE VIEW vecs (vec, row) AS SELECT VECTORIZE (label_scalar (val, col)) AS vec, row FROM mat GROUP BY row
![Page 55: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/55.jpg)
55
Promotion and Demotion
• Start with pure, tuple-based encoding• Move to a set of (vector, rowID) pairs• Then create a matrix...
mat (row INTEGER, col INTEGER, value DOUBLE)
CREATE VIEW vecs (vec, row) AS SELECT VECTORIZE (label_scalar (val, col)) AS vec, row FROM mat GROUP BY row
SELECT ROWMATRIX (label_vector (vec, row))FROM vecs
![Page 56: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/56.jpg)
56
Promotion and Demotion
• Start with pure, tuple-based encoding• Move to a set of (vector, rowID) pairs• Then create a matrix... or move back to tuples
mat (row INTEGER, col INTEGER, value DOUBLE)
CREATE VIEW vecs (vec, row) AS SELECT VECTORIZE (label_scalar (val, col)) AS vec, row FROM mat GROUP BY row
SELECT ROWMATRIX (label_vector (vec, row))FROM vecs
SELECT v.row, c.cnt AS col, get_scalar (v.vec, c.cnt) AS v.valueFROM vecs AS v, counts AS c
![Page 57: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/57.jpg)
57
Additional DSLs: BUDS
• SQL might not be most natural declarative interface for analytics
![Page 58: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/58.jpg)
58
Additional DSLs: BUDS
• SQL might not be most natural declarative interface for analytics• Math for the Bayesian Lasso, lifted from original paper
1.
2.
3.
— where ,
r Normal A 1– XTy σ2A 1–,( )∼
σ2 InvGamma n 1–( ) p+2
-------------------------- y Xr–( )T y Xr–( )T rTD 1– r+2
---------------------------------------------------------------------,⎝ ⎠⎛ ⎞∼
τj2– InvGaussian λσ
rj------- λ2,⎝ ⎠⎛ ⎞∼
A XTX D 1–+= D 1– diag τ12– τ2
2– …, ,( )=
![Page 59: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/59.jpg)
59
SQL Is Not TrivialCREATE TABLE tau[0](tauValue) AS WITH ig AS InvGaussian_VM(SELECT invGaussianMean, invGaussianShape FROM prior)
SELECT ig.outValue FROM ig;
CREATE TABLE A[i](AValue) AS SELECT MATRIX_INVERSE(xtx.val + DIAG_MATRIX(t.tauValue)) FROM tau[i] t, regressorGram xtx;
CREATE TABLE beta[i](betaValue) AS WITH mvn AS MultiNormal_VM( (SELECT MATRIX_VECTOR_MULTIPLY(a.AValue, rs.sumValue), (a.AValue * s.sigmaValue)
FROM A[i] a, regressorSum rs, sigma[i] s)) SELECT mvn.out_mean FROM mvn;
CREATE TABLE sigma[i](sigmaValue) AS WITH g as InvGamma( (SELECT (pr.numResponses - 1)/2.0 + (pr.numRegressors/2.0) FROM prior pr), (SELECT sb.sumBetas + st.sumTaus FROM (SELECT SUM(((res.respValue - xb.sumValue) * (res.respValue - xb.sumValue)) / 2.0) AS sumBetas
FROM centeredResponse res, (SELECT reg.respID as respID, INNER_PRODUCT(reg.regValue, b1.betaValue) as sumValue
FROM regressor reg, beta[i-1] b1) AS xb WHERE res.respID = xb.respID) AS sb, (SELECT (INNER_PRODUCT(b2.betaValue * b2.betaValue, t.tauValue) / 2.0) AS sumTaus
FROM tau[i-1] t, beta[i-1] b2) AS st)) SELECT g.outValue FROM g;
CREATE TABLE tau[i](tauValue) AS WITH ig AS InvGaussian_VM( (SELECT SQRT_VECTOR((pr1.lambdaValue * pr1.lambdaValue * s.sigmaValue) / (b.betaValue * b.betaValue))
FROM prior pr1, sigma[i] s, beta[i-1] b), (SELECT (pr2.lambdaValue * pr2.lambdaValue) FROM prior pr2)), SELECT ig.outValue FROM ig;
CREATE TABLE response( respID INTEGER, respValue DOUBLE, PRIMARY KEY (respID));CREATE TABLE regressor( respID INTEGER, regValue VECTOR[], PRIMARY KEY (respID));CREATE TABLE prior( numResponses INTEGER, numRegressors INTEGER, lambdaValue DOUBLE, invGammaShape DOUBLE, invGammaScale DOUBLE, invGaussianMean VECTOR[], invGaussianShape DOUBLE);
CREATE VIEW centeredResponse(respID, respValue) AS SELECT r1.respID, (r1.respValue - m.meanRespValue) FROM response r1, (SELECT AVG(r2.respValue) AS meanRespValue FROM response r2) AS m;
CREATE VIEW regressorGram(val) AS SELECT SUM(OUTER_PRODUCT(r.regValue, r.regValue)) FROM regressor r;
CREATE VIEW regressorSum(sumValue) AS SELECT SUM(reg.regValue * res.respValue) FROM regressor reg, centeredResponse res WHERE reg.respID = res.respID;
CREATE TABLE sigma[0](sigmaValue) AS WITH g AS InvGamma (SELECT invGammaShape, invGammaScale FROM prior)
SELECT g.outValue
FROM g;
![Page 60: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/60.jpg)
60
BUDS Is Much Simpler!
• Write code that looks just like the math...data { n: range (responses); p: range (regressors); X: array[n, p] of real; y: array[n] of real; lam: real}
var { sig: real; r, t: array[p] of real; yy, Z: array[n] of real;}
A <- inv(X ‘* X + diag(t));yy <- (y[i] - mean(y) | i in 1:n);Z <- yy - X * r;
init { sig ~ InvGamma (1, 1); t ~ (InvGauss (1, lam) | j in 1:p);} r ~ Normal (A *’ X * yy, sig * A);sig ~ InvGamma(((n-1) + p)/2, (Z ‘* Z + (r * diag(t) ‘* r)) / 2);for (j in 1:p) { t[j] ~ InvGauss (sqrt((lam * sig) / r[j]), lam);}
Truly declarative;Programmer only lists dependencies
![Page 61: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/61.jpg)
61
BUDS Is Much Simpler!
• Write code that looks just like the math...data { n: range (responses); p: range (regressors); X: array[n, p] of real; y: array[n] of real; lam: real}
var { sig: real; r, t: array[p] of real; yy, Z: array[n] of real;}
A <- inv(X ‘* X + diag(t));yy <- (y[i] - mean(y) | i in 1:n);Z <- yy - X * r;
init { sig ~ InvGamma (1, 1); t ~ (InvGauss (1, lam) | j in 1:p);} r ~ Normal (A *’ X * yy, sig * A);sig ~ InvGamma(((n-1) + p)/2, (Z ‘* Z + (r * diag(t) ‘* r)) / 2);for (j in 1:p) { t[j] ~ InvGauss (sqrt((lam * sig) / r[j]), lam);}
r Normal A 1– XTy σ2A 1–,( )∼
σ2 InvGamma n 1–( ) p+2
-------------------------- y Xr–( )T y Xr–( )T rTD 1– r+2
---------------------------------------------------------------------,⎝ ⎠⎛ ⎞∼
τj2– InvGaussian λσ
rj------- λ2,⎝ ⎠⎛ ⎞∼
A XTX D 1–+=
![Page 62: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/62.jpg)
62
How Well Does All of This Work?
• SimSQL is great in theory...— Many will buy the “data independence” argument
— Will appreciate being able to specify algs at a very high level
• But can it scale? Isn’t the DB approach gonna be slow?• Yes, it’s slow, compared to C/Fortran + MPI
— But zero data independence with MPI
• But does it compete well with other “Big Data” ML platforms?— After all, are many that count ML as the primary (or a motivating) application
— OptiML, GraphLab, SystemML, MLBase, ScalOps, Pregel, Giraph, Hama, Spark, Ricardo, Nyad, DradLinq
— How might those compare?
![Page 63: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/63.jpg)
63
How Well Does All of This Work?
• We’ve done a LOT of comparisons with other mature platforms— Specifically, GraphLab, Giraph, Spark
— More than 70,000 hours of Amazon EC2 time ($100,000 @on-demand price)
— I’d wager that few groups have a better understanding of how well these platforms work in practice!
• Note: point is not to show SimSQL is the fastest (it is not)— Only to argue that it can compete well
— If it competes, it’s a strong argument for the DB approach to ML
![Page 64: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/64.jpg)
64
Example One: Bayesian GMM
Generative process:
![Page 65: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/65.jpg)
65
Example One: Bayesian GMM
Generative process:(1) Pick a cluster
![Page 66: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/66.jpg)
66
Example One: Bayesian GMM
Generative process:(1) Pick a cluster(2) Use it to generate point
![Page 67: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/67.jpg)
67
Example One: Bayesian GMM
Generative process:(1) Pick a cluster(2) Use it to generate point
![Page 68: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/68.jpg)
68
Example One: Bayesian GMM
Generative process:(1) Pick a cluster(2) Use it to generate point
![Page 69: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/69.jpg)
69
Example One: Bayesian GMM
Generative process:(1) Pick a cluster(2) Use it to generate point
![Page 70: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/70.jpg)
70
Example One: Bayesian GMM
Generative process:(1) Pick a cluster(2) Use it to generate point
![Page 71: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/71.jpg)
71
Example One: Bayesian GMM
Then given this
![Page 72: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/72.jpg)
72
Example One: Bayesian GMM
Infer this
![Page 73: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/73.jpg)
73
Example One: Bayesian GMM
• Implemented relevant MCMC simulation on all four platforms— SimSQL, GraphLab, Spark, Giraph
![Page 74: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/74.jpg)
74
Example One: Bayesian GMM
• Implemented relevant MCMC simulation on all four platforms— SimSQL, GraphLab, Spark, Giraph
• Philosophy: be true to the platform— Ex: avoid “Hadoop abuse” [Smola & Narayanamurthy, VLDB 2010]
![Page 75: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/75.jpg)
75
Example One: Bayesian GMM
• Implemented relevant MCMC simulation on all four platforms— SimSQL, GraphLab, Spark, Giraph
• Philosophy: be true to the platform— Ex: avoid “Hadoop abuse” [Smola & Narayanamurthy, VLDB 2010]
• Ran on 10 dimensional data, 10 clusters, 10M points per machine— Full (non-diagonal) covariance matrix
— Also on 100 dimensional data, 1M points per machine
![Page 76: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/76.jpg)
76
Example One: Bayesian GMM
• Some notes:— Times are HH:MM:SS per iteration (time in parens is startup/initialization)
— Amount of data is kept constant per machine in all tests
— “Fail” means that even with much effort and tuning, it crashed
(a) GMM: Initial Implementations10 dimensions 100 dimensions
lines of code 5 machines 20 machines 100 machines 5 machines
SimSQL 197 27:55 (13:55) 28:55 (14:38) 35:54 (18:58) 1:51:12 (36:08)GraphLab 661 Fail Fail Fail FailSpark (Python) 236 26:04 (4:10) 37:34 (2:27) 38:09 (2:00) 47:40 (0:52)Giraph 2131 25:21 (0:18) 30:26 (0:15) Fail Fail
![Page 77: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/77.jpg)
77
Example One: Bayesian GMM
• Not much difference!— But SimSQL was slower in 100 dims. Why?
-Experiments didn’t use support for vectors/matrices
(a) GMM: Initial Implementations10 dimensions 100 dimensions
lines of code 5 machines 20 machines 100 machines 5 machines
SimSQL 197 27:55 (13:55) 28:55 (14:38) 35:54 (18:58) 1:51:12 (36:08)GraphLab 661 Fail Fail Fail FailSpark (Python) 236 26:04 (4:10) 37:34 (2:27) 38:09 (2:00) 47:40 (0:52)Giraph 2131 25:21 (0:18) 30:26 (0:15) Fail Fail
![Page 78: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/78.jpg)
78
Example One: Bayesian GMM
• Spark is surprisingly slow— Is Spark slower due to Python vs. Java?
(a) GMM: Initial Implementations10 dimensions 100 dimensions
lines of code 5 machines 20 machines 100 machines 5 machines
SimSQL 197 27:55 (13:55) 28:55 (14:38) 35:54 (18:58) 1:51:12 (36:08)GraphLab 661 Fail Fail Fail FailSpark (Python) 236 26:04 (4:10) 37:34 (2:27) 38:09 (2:00) 47:40 (0:52)Giraph 2131 25:21 (0:18) 30:26 (0:15) Fail Fail
(b) GMM: Alternative Implementations10 dimensions 100 dimensions
lines of code 5 machines 20 machines 100 machines 5 machines
Spark (Java) 737 12:30 (2:01) 12:25 (2:03) 18:11 (2:26) 6:25:04 (36:08)
![Page 79: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/79.jpg)
79
Example One: Bayesian GMM
• What about GraphLab?— GraphLab failed every time. Why?
(a) GMM: Initial Implementations10 dimensions 100 dimensions
lines of code 5 machines 20 machines 100 machines 5 machines
SimSQL 197 27:55 (13:55) 28:55 (14:38) 35:54 (18:58) 1:51:12 (36:08)GraphLab 661 Fail Fail Fail FailSpark (Python) 236 26:04 (4:10) 37:34 (2:27) 38:09 (2:00) 47:40 (0:52)Giraph 2131 25:21 (0:18) 30:26 (0:15) Fail Fail
![Page 80: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/80.jpg)
80
Example One: Bayesian GMM
• What about GraphLab?— GraphLab failed every time. Why?
(a) GMM: Initial Implementations10 dimensions 100 dimensions
lines of code 5 machines 20 machines 100 machines 5 machines
SimSQL 197 27:55 (13:55) 28:55 (14:38) 35:54 (18:58) 1:51:12 (36:08)GraphLab 661 Fail Fail Fail FailSpark (Python) 236 26:04 (4:10) 37:34 (2:27) 38:09 (2:00) 47:40 (0:52)Giraph 2131 25:21 (0:18) 30:26 (0:15) Fail Fail
n data points
k clusters
mixing proportionvertex
GraphLab/Giraph graph model
1 billion data points by10 clusters by1KB = 10TB RAM (6TB RAM in 100-machine cluster)
![Page 81: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/81.jpg)
81
Example One: Bayesian GMM
• What about GraphLab?— GraphLab failed every time. Why?
(a) GMM: Initial Implementations10 dimensions 100 dimensions
lines of code 5 machines 20 machines 100 machines 5 machines
SimSQL 197 27:55 (13:55) 28:55 (14:38) 35:54 (18:58) 1:51:12 (36:08)GraphLab 661 Fail Fail Fail FailSpark (Python) 236 26:04 (4:10) 37:34 (2:27) 38:09 (2:00) 47:40 (0:52)Giraph 2131 25:21 (0:18) 30:26 (0:15) Fail Fail
m “super vertices”
k clusters
mixing proportionvertex
GraphLab/Giraph graph model
10,000 super vertices10 clusters by1KB = 100 MB RAM (insignificant!)
To Fix...
![Page 82: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/82.jpg)
82
Example One: Bayesian GMM
• Super vertex results— GraphLab super vertex screams!
(b) GMM: Alternative Implementations10 dimensions 100 dimensions
lines of code 5 machines 20 machines 100 machines 5 machinesSpark (Java) 737 12:30 (2:01) 12:25 (2:03) 18:11 (2:26) 6:25:04 (36:08)GraphLab (Super Vertex) 681 6:13 (1:13) 4:36 (2:47) 6:09 (1:21)∗ 33:32 (0:42)
![Page 83: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/83.jpg)
83
Example One: Bayesian GMM
• Super vertex results— GraphLab super vertex screams!
— But to be fair, others can benefit from super vertices as well....
(b) GMM: Alternative Implementations10 dimensions 100 dimensions
lines of code 5 machines 20 machines 100 machines 5 machinesSpark (Java) 737 12:30 (2:01) 12:25 (2:03) 18:11 (2:26) 6:25:04 (36:08)GraphLab (Super Vertex) 681 6:13 (1:13) 4:36 (2:47) 6:09 (1:21)∗ 33:32 (0:42)
(c) GMM: Super Vertex Implementations10 dimensions, 5 machines 100 dimensions, 5 machines
w/o super vertex with super vertex w/o super vertex with super vertex
SimSQL 27:55 (13:55) 6:20 (12:33) 1:51:12 (36:08) 7:22 (14:07)GraphLab Fail 6:13 (1:13) Fail 33:32 (0:42)Spark (Python) 26:04 (4:10) 29:12 (4:01) 47:40 (0:52) 47:03 (2:17)Giraph 25:21 (0:18) 13:48 (0:03) Fail 6:17:32 (0:03)
![Page 84: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/84.jpg)
84
Example Two: Bayesian Lasso
• Bayesian LR model— Due to Park and Casella [JASA 2008]
— With Laplace prior on regression coefs (good regularization)
— Clever Markov chain derivation means all updates from standard dist families
![Page 85: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/85.jpg)
85
Example Two: Bayesian Lasso
• Bayesian LR model— Due to Park and Casella [JASA 2008]
— With Laplace prior on regression coefs (good regularization)
— Clever Markov chain derivation means all updates from standard dist families
• Gibbs sampler is:
1.
2.
3.
— where , , and is the regul’z’n param
r Normal A 1– XTy σ2A 1–,( )∼
σ2 InvGamma n 1–( ) p+2
-------------------------- y Xr–( )T y Xr–( )T rTD 1– r+2
---------------------------------------------------------------------,⎝ ⎠⎛ ⎞∼
τj2– InvGaussian λσ
rj------- λ2,⎝ ⎠⎛ ⎞∼
A XTX D 1–+= D 1– diag τ12– τ2
2– …, ,( )= λ
Gram matrix computationmakes it interesting for high-D data
![Page 86: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/86.jpg)
86
Example Two: Bayesian Lasso
• Experimental setup— 1K regressors (dense)
— 100K points per machine
![Page 87: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/87.jpg)
87
Example Two: Bayesian Lasso
• Experimental setup— 1K regressors (dense)
— 100K points per machine
• ResultsBayesian Lasso
lines of code 5 machines 20 machines 100 machines
SimSQL 100 7:09 (2:40:06) 8:04 (2:45:28) 12:24 (2:54:45)GraphLab (Super Vertex) 572 0:36 (0:37) 0:26 (0:35) 0:31 (0:50)Spark (Python) 168 0:55 (1:26:59) 0:59 (1:33:13) 1:12 (2:06:30)Giraph 1871 Fail Fail FailGiraph (Super Vertex) 1953 0:58 (1:14) 1:03 (1:14) 2:08 (6:31)
![Page 88: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/88.jpg)
88
Example Two: Bayesian Lasso
• Experimental setup— 1K regressors (dense)
— 100K points per machine
• Results
• Interesting points— SimSQL slow (again, lack of support for vectors/matrices is brutal here)...
— But Spark is almost as slow for startup (computation of Gram matrix)
— Check out GraphLab: super fast!
Bayesian Lassolines of code 5 machines 20 machines 100 machines
SimSQL 100 7:09 (2:40:06) 8:04 (2:45:28) 12:24 (2:54:45)GraphLab (Super Vertex) 572 0:36 (0:37) 0:26 (0:35) 0:31 (0:50)Spark (Python) 168 0:55 (1:26:59) 0:59 (1:33:13) 1:12 (2:06:30)Giraph 1871 Fail Fail FailGiraph (Super Vertex) 1953 0:58 (1:14) 1:03 (1:14) 2:08 (6:31)
![Page 89: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/89.jpg)
89
Example Three: LDA
• Sort of a Bayesian variant on PCA (for dimensionality reduction)• Experimental setup
— Run over a document database, dictionary size of 10K words
— 100 “topics” (components) were learned
— Constant 2.5M documents per machine
• Note: didn’t do collapsed simulation, since hard to parallelize
![Page 90: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/90.jpg)
90
Example Three: LDA
• First we considered a “word based” implementation— Arguably the most natural
— One vertex for each word in corpus in graph-based
— Separate Multnomial call for each word in each doc in SimSQL/Spark
• And a “document based” implementation— One vertex for each document in graph-based
— Update membership for all words at once in SimSQL/Spark (faster ‘cause you broadcast the model, do join with words in doc in user code)
![Page 91: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/91.jpg)
91
Example Three: LDA
• Results
• Interesting findings— Only SimSQL can handle word-based imp, but really slow
— Only Giraph gives reasonable performance!
— Spark unable to join words-in-doc with topic-probs, hence an NA
— Giraph unable to load up word-based graph, hence an NA
(a) LDA: Word-based and document-based implementations
Word-based, 5 machines Document-based, 5 machines
lines of code running time lines of code running time
SimSQL 126 16:34:39 (11:23:22) 129 4:52:06 (4:34:27)Spark (Python) NA NA 188 ≈15:45:00 (≈2:30:00)Giraph NA NA 1358 22:22 (5:46)
(b) LDA: Super Vertex Implementations
![Page 92: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/92.jpg)
92
Example Three: LDA
• Results
• Interesting findings— Only SimSQL can handle word-based imp, but really slow
— Only Giraph gives reasonable performance!
— Spark unable to join words-in-doc with topic-probs, hence an NA
— Giraph unable to load up word-based graph, hence an NA
• How about super vertex? (handle thousands of docs in a batch)
(a) LDA: Word-based and document-based implementations
Word-based, 5 machines Document-based, 5 machines
lines of code running time lines of code running time
SimSQL 126 16:34:39 (11:23:22) 129 4:52:06 (4:34:27)Spark (Python) NA NA 188 ≈15:45:00 (≈2:30:00)Giraph NA NA 1358 22:22 (5:46)
(b) LDA: Super Vertex Implementations
![Page 93: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/93.jpg)
93
Example Three: LDA
• Super vertex results
• Interesting findings— Only SimSQL can scale to 250M docs on 100 machines
(b) LDA: Super Vertex Implementations
lines of code 5 machines 20 machines 100 machines
Giraph 1406 18:49 (2:35) 20:02 (2:46) FailGraphLab 517 39:27 (32:14) Fail FailSpark (Python) 220 ≈3:56:00 (≈2:15:00) ≈3:57:00 (≈2:15:00) FailSimSQL 117 1:00:17 (3:09) 1:06:59 (3:34) 1:13:58 (4:28)
![Page 94: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/94.jpg)
94
Example Three: LDA
• Super vertex results
• Interesting findings— Only SimSQL can scale to 250M docs on 100 machines
— Even super vertex can’t help GraphLab here...
(b) LDA: Super Vertex Implementations
lines of code 5 machines 20 machines 100 machines
Giraph 1406 18:49 (2:35) 20:02 (2:46) FailGraphLab 517 39:27 (32:14) Fail FailSpark (Python) 220 ≈3:56:00 (≈2:15:00) ≈3:57:00 (≈2:15:00) FailSimSQL 117 1:00:17 (3:09) 1:06:59 (3:34) 1:13:58 (4:28)
-10K super vertices on 100 machines -each broadcasts 100 different 10K vectors to each topic node-10K by 10K by 100 is 10 billion numbers... -what if a machine gets 2 or three topic nodes?
![Page 95: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/95.jpg)
95
Example Three: LDA
• Super vertex results
• Interesting findings— Only SimSQL can scale to 250M docs on 100 machines
— Even super vertex can’t help GraphLab here...
— Spark does quite poorly... might this be due to Python?
(b) LDA: Super Vertex Implementations
lines of code 5 machines 20 machines 100 machines
Giraph 1406 18:49 (2:35) 20:02 (2:46) FailGraphLab 517 39:27 (32:14) Fail FailSpark (Python) 220 ≈3:56:00 (≈2:15:00) ≈3:57:00 (≈2:15:00) FailSimSQL 117 1:00:17 (3:09) 1:06:59 (3:34) 1:13:58 (4:28)
![Page 96: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/96.jpg)
96
Example Three: LDA
• Super vertex results
• Interesting findings— Only SimSQL can scale to 250M docs on 100 machines
— Even super vertex can’t help GraphLab here...
— Spark does quite poorly... might this be due to Python?
(b) LDA: Super Vertex Implementations
lines of code 5 machines 20 machines 100 machines
Giraph 1406 18:49 (2:35) 20:02 (2:46) FailGraphLab 517 39:27 (32:14) Fail FailSpark (Python) 220 ≈3:56:00 (≈2:15:00) ≈3:57:00 (≈2:15:00) FailSimSQL 117 1:00:17 (3:09) 1:06:59 (3:34) 1:13:58 (4:28)
GMM experiments,ing. For each datah gave us a proba-e. Each of the tenensored by flipping
LDA Spark Java Implementationlines of code 5 machines 20 machines 100 machines
377 9:47 (0:53) 19:36 (1:15) Fail
Figure 6: Average time per iteration (and startup time).
![Page 97: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/97.jpg)
97
Summary of Findings
• Giraph can be made very fast— Mostly ‘cause of distributed aggregation facilities
— But it is still brittle, perhaps due to reliance on main memory
![Page 98: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/98.jpg)
98
Summary of Findings
• Giraph can be made very fast— Mostly ‘cause of distributed aggregation facilities
— But it is still brittle, perhaps due to reliance on main memory
• GraphLab codes are small and nice, especially considering C++— And it can be very fast
— But lack of distributed agg is a killer... what does this even mean in asynch env?
![Page 99: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/99.jpg)
99
Summary of Findings
• Giraph can be made very fast— Mostly ‘cause of distributed aggregation facilities
— But it is still brittle, perhaps due to reliance on main memory
• GraphLab codes are small and nice, especially considering C++— And it can be very fast
— But lack of distributed agg is a killer... what does this even mean in asynch env?
• Spark codes (Python) are startlingly beautiful. Wow!— But Spark was brittle, hard to tune, and often slow
![Page 100: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/100.jpg)
100
Summary of Findings
• Giraph can be made very fast— Mostly ‘cause of distributed aggregation facilities
— But it is still brittle, perhaps due to reliance on main memory
• GraphLab codes are small and nice, especially considering C++— And it can be very fast
— But lack of distributed agg is a killer... what does this even mean in asynch env?
• Spark codes (Python) are startlingly beautiful. Wow!— But Spark was brittle, hard to tune, and often slow
• SimSQL codes fully declarative, and often competitive in speed— Only platform to run everything we threw at it
— But lack of matrices and vectors really hurts
![Page 101: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/101.jpg)
101
Summary of Talk
• I’ve motivated a relational approach to large-scale ML— All about data independence!
— Same code works for any data set, compute platform
— Just drop in a new physical optimizer and runtime, keep application stack
![Page 102: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/102.jpg)
102
Summary of Talk
• I’ve motivated a relational approach to large-scale ML— All about data independence!
— Same code works for any data set, compute platform
— Just drop in a new physical optimizer and runtime, keep application stack
• I’ve briefly described SimSQL, our realization of the approach
![Page 103: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/103.jpg)
103
Summary of Talk
• I’ve motivated a relational approach to large-scale ML— All about data independence!
— Same code works for any data set, compute platform
— Just drop in a new physical optimizer and runtime, keep application stack
• I’ve briefly described SimSQL, our realization of the approach• And I’ve given experimental evidence the approach is practical
— Our Hadoop targeted optimizer and runtime competes well
— And its the only platform to handle everything we threw at it
![Page 104: LARGE SCALE MACHINE EARNING WITH THE SIMSQL SYSTEM · 2015-11-14 · 1 LARGE SCALE MACHINE LEARNING WITH THE SIMSQL SYSTEM Chris Jermaine Rice University Current/Recent SimSQL team](https://reader035.vdocuments.net/reader035/viewer/2022063007/5fb839ae943984379c708789/html5/thumbnails/104.jpg)
104
That’s It. Questions?
• Download SimSQL today— http://cmj4.web.rice.edu/SimSQL/SimSQL.html
• This presentation at— http://cmj4.web.rice.edu/SimSQLNew.pdf