20160908 hivemall meetup
TRANSCRIPT
Copyright©2016 NTT corp. All Rights Reserved.
Hivemall Meets XGBoost inDataFrame/Spark
2016/9/8Takeshi Yamamuro (maropu) @ NTT
2Copyright©2016 NTT corp. All Rights Reserved.
Who am I?
3Copyright©2016 NTT corp. All Rights Reserved.
• Short for eXtreme Gradient Boosting• https://github.com/dmlc /xgboost
• It is...• variant of the gradient boosting machine
• tree-‐‑‒based model• open-‐‑‒sourced tool (Apache2 license)
• written in C++• R/python/Julia/Java/Scala interfaces provided
• widely used in Kaggle competitions
is...
4Copyright©2016 NTT corp. All Rights Reserved.
• Most of Hivemall functions supported in Spark-‐‑‒v1.6 and v2.0
• the v2.0 support not released yet
• XGBoost integration under development• distributed/parallel predictions• native libraries bundled for major platforms
• Mac/Linux on x86_̲64• how-‐‑‒to-‐‑‒use: https://gist.github.com/maropu/33794b293ee937e99b8fb0788843fa3f
Hivemall in DataFrame/Spark
5Copyright©2016 NTT corp. All Rights Reserved.
Spark Quick Examples
• Fetch a binary Spark v2.0.0• http://spark.apache.org/downloads.html
$ <SPARK_HOME>/bin/spark-shell scala> :paste val textFile = sc.textFile(”hoge.txt") val counts = textFile.flatMap(_.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)
6Copyright©2016 NTT corp. All Rights Reserved.
Fetch training and test data
• E2006 tfidf regression dataset• http://www.csie.ntu.edu.tw/~∼cjlin/libsvmtools/datasets/regression.html#E2006-‐‑‒tfidf
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/ E2006.train.bz2 $ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/ E2006.test.bz2
7Copyright©2016 NTT corp. All Rights Reserved.
XGBoost in spark-‐‑‒shell
• Scala interface bundled in the Hivemall jar$ bunzip2 E2006.train.bz2 $ <SPARK_HOME>/bin/spark-shell -conf spark.jars=hivemall-spark-XXX-with-dependencies.jar scala> import ml.dmlc.xgboost4j.scala._ scala> :paste // Read trainining data val trainData = new DMatrix(”E2006.train") // Define parameters val paramMap = List( "eta" -> 0.1, "max_depth" -> 2, "objective" -> ”reg:logistic” ).toMap // Train the model val model = XGBoost.train(trainData, paramMap, 2) // Save model to the file model.saveModel(”xgboost_models_dir/xgb_0001.model”)
8Copyright©2016 NTT corp. All Rights Reserved.
Load test data in parallel
$ <SPARK_HOME>/bin/spark-shell -conf spark.jars=hivemall-spark-XXX-with-dependencies.jar // Create DataFrame for the test data scala> val testDf = sqlContext.sparkSession.read.format("libsvm”) .load("E2006.test.bz2") scala> testDf.printSchema root |-- label: double (nullable = true) |-- features: vector (nullable = true)
9Copyright©2016 NTT corp. All Rights Reserved.
Load test data in parallel
0.000357499151147113 6066:0.0007932706219604 8 6069:0.000311377727123504 6070:0.0003067549 34580457 6071:0.000276992485786437 6072:0.000 39663531098024 6074:0.00039663531098024 6075 :0.00032548335…
testDf
Partition1 Partition2 Partition3 PartitionN
…
…
…
Load in parallel because bzip2 is splittable
• #partitions depends on three parameters• spark.default.parallelism: #cores by default• spark.sql.files.maxPartitionBytes: 128MB by default• spark.sql.files.openCostInBytes: 4MB by default
10Copyright©2016 NTT corp. All Rights Reserved.
• XGBoost in DataFrame• Load built models and do cross-‐‑‒joins for predictions
Do predictions in parallel
scala> import org.apache.spark.hive.HivemallOps._ scala> :paste // Load built models from persistent storage val modelsDf = sqlContext.sparkSession.read.format(xgboost) .load(”xgboost_models_dir") // Do prediction in parallel via cross-joins val predict = modelsDf.join(testDf) .xgboost_predict($"rowid", $"features", $"model_id", $"pred_model") .groupBy("rowid") .avg()
11Copyright©2016 NTT corp. All Rights Reserved.
• XGBoost in DataFrame• Load built models and do cross-‐‑‒joins for predictions
• Broadcast cross-‐‑‒joins expected• Size of `̀modelsDf`̀ must be less than and equal to spark.sql.autoBroadcastJoinThreshold (10MB by default)
Do predictions in parallel
testDf
rowid label features1 0.392 1:0.3 5:0.1…2 0.929 3:0.2…3 0.132 2:0.9…4 0.3923 5:0.4…
…
modelsDf
model_̲id pred_̲modelxgb_̲0001.model <binary data>xgb_̲0002.model <binary data>
cross-joins in parallel
12Copyright©2016 NTT corp. All Rights Reserved.
• Structured Streaming in Spark-‐‑‒2.0• Scalable and fault-‐‑‒tolerant stream processing engine built on the Spark SQL engine
• alpha component in v2.0
Do predictions for streaming data
scala> :paste // Initialize streaming DataFrame val testStreamingDf = spark.readStream .format(”libsvm”) // Not supported in v2.0 … // Do prediction for streaming data val predict = modelsDf.join(testStreamingDf) .xgboost_predict($"rowid", $"features", $"model_id", $"pred_model") .groupBy("rowid") .avg()
13Copyright©2016 NTT corp. All Rights Reserved.
• One model for a partition• WIP: Build models with different parameters
Build models in parallel
scala> :paste // Set options for XGBoost val xgbOptions = XGBoostOptions() .set("num_round", "10000") .set(“max_depth”, “32,48,64”) // Randomly selected by workers // Set # of models to output val numModels = 4 // Build models and save them in persistent storage trainDf.repartition(numModels) .train_xgboost_regr($“features”, $ “label”, s"${xgbOptions}") .write .format(xgboost) .save(”xgboost_models_dir”)
14Copyright©2016 NTT corp. All Rights Reserved.
• If you get stuck in UnsatisfiedLinkError, you need to compile a binary by yourself
Compile a binary on your platform
$ mvn validate && mvn package -Pcompile-xgboost -Pspark-2.0 –DskipTests $ ls target hivemall-core-0.4.2-rc.2-with-dependencies.jar hivemall-spark-1.6.2_2.11.8-0.4.2-rc.2-with-dependencies.jar hivemall-core-0.4.2-rc.2.jar hivemall-spark-1.6.2_2.11.8-0.4.2-rc.2.jar hivemall-mixserv-0.4.2-rc.2-fat.jar hivemall-xgboost-0.4.2-rc.2.jar hivemall-nlp-0.4.2-rc.2-with-dependencies.jar hivemall-xgboost_0.60-0.4.2-rc.2-with-dependencies.jar hivemall-nlp-0.4.2-rc.2.jar hivemall-xgboost_0.60-0.4.2-rc.2.jar
15Copyright©2016 NTT corp. All Rights Reserved.
• Rabbit integration for parallel learning• http://dmlc.cs.washington.edu/rabit.html
• Python supports• spark.ml interface supports• Bundle more binaries for portability
• Windows and x86 platforms• Others?
Future Work