hadoopsummit16 myui
TRANSCRIPT
Hivemall:ScalablemachinelearninglibraryforApacheHive/Spark/Pig
1).ResearchEngineer,TreasureDataMakotoYUI@myui
2016/10/26HadoopSummit'16,Tokyo
1
2).ResearchEngineer,NTTTakashiYamamuro@maropu
#hivemall
Planofthetalk
1. IntroductiontoHivemall
2. HivemallonSpark
2016/10/26HadoopSummit'16,Tokyo 2
2016/10/26HadoopSummit'16,Tokyo 3
HivemallentersApacheIncubatoronSept13,2016🎉
http://incubator.apache.org/projects/hivemall.html
@ApacheHivemall
• MakotoYui<TreasureData>• TakeshiYamamuro <NTT>
Ø HivemallonApacheSpark• DanielDai<Hortonworks>
Ø HivemallonApachePigØ ApachePigPMCmember
• TsuyoshiOzawa<NTT>ØApacheHadoopPMCmember
• KaiSasaki<TreasureData>2016/10/26HadoopSummit'16,Tokyo 4
Initialcommitters
Champion
NominatedMentors
2016/10/26HadoopSummit'16,Tokyo 5
Projectmentors
• ReynoldXin<Databricks,ASFmember>ApacheSparkPMCmember
• MarkusWeimer<Microsoft,ASFmember>ApacheREEFPMCmember
• Xiangrui Meng <Databricks,ASFmember>ApacheSparkPMCmember
• RomanShaposhnik <Pivotal,ASFmember>ApacheBigtop/IncubatorPMCmember
WhatisApacheHivemall
ScalablemachinelearninglibrarybuiltasacollectionofHiveUDFs
2016/10/26HadoopSummit'16,Tokyo
6
github.com/myui/hivemallhivemall.incubator.apache.org
Hivemall’s Vision:MLonSQL
2016/10/26HadoopSummit'16,Tokyo 7
ClassificationwithMahout
CREATETABLElr_model ASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers
✓MachineLearningmadeeasyforSQLdevelopers✓ InteractiveandStableAPIsw/ SQLabstraction
ThisSQLqueryautomaticallyrunsinparallelonHadoop
HadoopHDFS
MapReduce(MRv1)
Hivemall
ApacheYARN
ApacheTezDAGprocessing
Machine Learning
Query Processing
Parallel Data Processing Framework
Resource Management
Distributed File SystemCloud Storage
SparkSQL
ApacheSpark
MESOS
Hive Pig
MLlib
Hivemall’s TechnologyStack
2016/10/26HadoopSummit'16,Tokyo 8
AmazonS3
ListofsupportedAlgorithmsClassification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification
2016/10/26HadoopSummit'16,Tokyo 9
Regression✓LogisticRegression(SGD)✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓PARegression✓AROWRegression✓FactorizationMachines✓RandomForestRegression
SCW is a good first choiceTry RandomForest if SCW does not work
Logistic regression is good for getting a probability of a positive class
Factorization Machines is good where features are sparse and categorical ones
ListofAlgorithmsforRecommendation
2016/10/26HadoopSummit'16,Tokyo 10
K-NearestNeighbor✓ Minhash andb-BitMinhash
(LSHvariant)✓ SimilaritySearchonVectorSpace
(Euclid/Cosine/Jaccard/Angular)
MatrixCompletion✓MatrixFactorization✓ FactorizationMachines(regression)
each_top_k functionofHivemallisusefulforrecommendingtop-kitems
2016/10/26HadoopSummit'16,Tokyo 11
student class score1 b 702 a 803 a 904 b 505 a 706 b 60
Top-kqueryprocessing
student class score3 a 902 a 801 b 706 b 60
Listtop-2studentsforeachclass
2016/10/26HadoopSummit'16,Tokyo 12
student class score1 b 702 a 803 a 904 b 505 a 706 b 60
Top-kqueryprocessing
Listtop-2studentsforeachclass
SELECT*FROM(SELECT*,rank()over(partitionbyclassorderbyscoredesc)asrank
FROMtable)tWHERErank<=2
2016/10/26HadoopSummit'16,Tokyo 13
student class score1 b 702 a 803 a 904 b 505 a 706 b 60
Top-kqueryprocessing
Listtop-2studentsforeachclass
SELECTeach_top_k(2,class,score,class,student
)as(rank,score,class,student)FROM (SELECT*FROMtableDISTRIBUTEBYclassSORTBYclass
)t
2016/10/26HadoopSummit'16,Tokyo 14
Top-kqueryprocessingbyRANKOVER()
partitionbyclass
Node1
Sortbyclass,score
rankover()
rank>=2
2016/10/26HadoopSummit'16,Tokyo 15
Top-kqueryprocessingbyEACH_TOP_K
distributedbyclass
Node1
Sortbyclass
each_top_k
OUTPUTonlyKitems
2016/10/26HadoopSummit'16,Tokyo 16
ComparisonbetweenRANKandEACH_TOP_K
distributedbyclass
Sortbyclass
each_top_k
Sortbyclass,score
rankover()
rank>=2
SORTING ISHEAVY
NEEDTOPROCESSALL
OUTPUTonlyKitems
Each_top_k isveryefficientwherethenumberofclassislarge
BoundedPriorityQueueisutilized
PerformancereportedbyTDcustomer
2016/10/26HadoopSummit'16,Tokyo 17
•1,000studentsineachclass•20 millionclasses
RANKover()querydoesnotfinishesin24hoursLEACH_TOP_Kfinishesin2hoursJ
Referfordetailhttps://speakerdeck.com/kaky0922/hivemall-meetup-20160908
OtherSupportedAlgorithms
2016/10/26HadoopSummit'16,Tokyo 18
AnomalyDetection✓ LocalOutlierFactor(LoF)
FeatureEngineering✓FeatureHashing✓FeatureScaling
(normalization,z-score)✓ TF-IDFvectorizer✓ PolynomialExpansion
(FeaturePairing)✓ Amplifier
NLP✓BasicEnglist textTokenizer✓JapaneseTokenizer(Kuromoji)
IndustryusecasesofHivemall
CTRpredictionofAdclicklogs• Algorithm:Logisticregression• Freakout Inc.,Smartnews,andmore
GenderpredictionofAdclicklogs• Algorithm:Classification• Scaleout Inc.
2016/10/26HadoopSummit'16,Tokyo 19
http://www.slideshare.net/eventdotsjp/hivemall
IndustryusecasesofHivemall
Item/Userrecommendation• Algorithm:Recommendation• Wish.com,GMOpepabo
2016/10/26HadoopSummit'16,Tokyo 20
Problem:Recommendationusinghot-itemishardinhand-craftedproductmarketbecauseeachcreatorsellsfewsingleitems(willsoonbecomeout-of-stock)
minne.com
IndustryusecasesofHivemall
ValuepredictionofRealestates• Algorithm:Regression• Livesense
2016/10/26HadoopSummit'16,Tokyo 21
IndustryusecasesofHivemall
Userscorecalculation• Algrorithm:Regression• Klout
2016/10/26HadoopSummit'16,Tokyo 22
bit.ly/klout-hivemall
Influencermarketing
klout.com
2016/10/26HadoopSummit'16,Tokyo 23
Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data
2016/10/26HadoopSummit'16,Tokyo 24
J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.
Anomaly/Change-pointDetectionbyChangeFinder
Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data
2016/10/26HadoopSummit'16,Tokyo 25
Anomaly/Change-pointDetectionbyChangeFinder
J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.
2016/10/26HadoopSummit'16,Tokyo 26
• T.IdeandK.Inoue,"KnowledgeDiscoveryfromHeterogeneousDynamicSystemsusingChange-PointCorrelations",Proc.SDM,2005T.
• T.IdeandK.Tsuda,"Change-pointdetectionusingKrylovsubspacelearning",Proc.SDM,2007.
Change-pointdetectionbySingularSpectrumTransformation
2016/10/26HadoopSummit'16,Tokyo 27
EvaluationMetrics
2016/10/26HadoopSummit'16,Tokyo 28
FeatureEngineering– FeatureBinning
Mapsquantitativevariablestofixednumberofbinsbasedonquantiles/distribution
MapAgesinto3bins
2016/10/26HadoopSummit'16,Tokyo 29
FeatureSelection– SignalNoiseRatio
2016/10/26HadoopSummit'16,Tokyo 30
FeatureSelection– Chi-Square
2016/10/26HadoopSummit'16,Tokyo 31
FeatureTransformation– Onehot encoding
Mapsacategoricalvariabletoauniquenumberstartingfrom1
ü Spark2.0supportü XGBoost Integrationü Field-awareFactorizationMachinesü GeneralizedLinearModel
• OptimizerframeworkincludingADAM• L1/L2regularization
2016/10/26HadoopSummit'16,Tokyo 32
Othernewfeaturestocome
Copyright©2016 NTT corp. All Rights Reserved.
Hivemall onTakeshi Yamamuro @ NTT
Hadoop Summit 2016Tokyo, Japan
34Copyright©2016 NTT corp. All Rights Reserved.
Who am I?
35Copyright©2016 NTT corp. All Rights Reserved.
Whatʼs Spark?• 1. Unified Engine
• support end-to-end APs, e.g., MLlib and Streaming
• 2. High-level APIs• easy-to-use, rich optimization
• 3. Integrate broadly• storages, libraries, ...
36Copyright©2016 NTT corp. All Rights Reserved.
• Hivemall wrapper for Spark• Wrapper implementations for DataFrame/SQL• + some utilities for easy-to-use in Spark
• The wrapper makes you...• run most of Hivemall functions in Spark• try examples easily in your laptop• improve some function performance in Spark
Whatʼs Hivemall on Spark?
37Copyright©2016 NTT corp. All Rights Reserved.
• Hivemall already has many fascinating ML algorithms and useful utilities
• High barriers to add newer algorithms in MLlib
Whyʼs Hivemall on Spark?
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
38Copyright©2016 NTT corp. All Rights Reserved.
• Most of Hivemall functions supported in Spark v1.6 and v2.0
Current Status
- For Spark v2.0
$ git clone https://github.com/myui/hivemall$ cd hivemall$ mvn package –Pspark-2.0 –DskipTests$ ls target/*spark*target/hivemall-spark-2.0_2.11-XXX-with-dependencies.jar...
39Copyright©2016 NTT corp. All Rights Reserved.
• Most of Hivemall functions supported in Spark v1.6 and v2.0
Current Status
- For Spark v2.0
$ git clone https://github.com/myui/hivemall$ cd hivemall$ mvn package –Pspark-2.0 –DskipTests$ ls target/*spark*target/hivemall-spark-2.0_2.11-XXX-with-dependencies.jar...
-Pspark-1.6 for Spark v1.6
40Copyright©2016 NTT corp. All Rights Reserved.
• 1. Download a Spark binary• 2. Fetch training and test data• 3. Load these data in Spark• 4. Build a model• 5. Do predictions
Running an Example
41Copyright©2016 NTT corp. All Rights Reserved.
1. Download a Spark binary Running an Example
http://spark.apache.org/downloads.html
42Copyright©2016 NTT corp. All Rights Reserved.
• E2006 tfidf regression dataset• http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#E2006-tfidf
2. Fetch training and test data Running an Example
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.train.bz2
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.test.bz2
43Copyright©2016 NTT corp. All Rights Reserved.
3. Load data in Spark Running an Example
$ <SPARK_HOME>/bin/spark-shell-conf spark.jars=hivemall-spark-2.0_2.11-XXX-with-dependencies.jar
// Create DataFrame from the bzip’d libsvm-formatted filescala> val trainDf = sqlContext.sparkSession.read.format("libsvm”).load(“E2006.train.bz2")
scala> trainDf.printSchemaroot|-- label: double (nullable = true)|-- features: vector (nullable = true)
44Copyright©2016 NTT corp. All Rights Reserved.
3. Load data in Spark Running an Example
0.000357499151147113 6066:0.00079327062196048 6069:0.000311377727123504 6070:0.000306754934580457 6071:0.000276992485786437 6072:0.00039663531098024 6074:0.00039663531098024 6075:0.00032548335…
trainDf
Partition1 Partition2 Partition3 PartitionN…
…
…
Load in parallel becausebzip2 is splittable
45Copyright©2016 NTT corp. All Rights Reserved.
4. Build a model - DataFrame Running an Example
scala> paste:val modelDf = trainDf.train_logregr($"features", $"label")
.groupBy("feature”)
.agg("weight" -> "avg”)
46Copyright©2016 NTT corp. All Rights Reserved.
4. Build a model - SQL Running an Example
scala> trainDf.createOrReplaceTempView("TrainTable")scala> paste:val modelDf = sql("""
| SELECT feature, AVG(weight) AS weight| FROM (| SELECT train_logregr(features, label)| AS (feature, weight)| FROM TrainTable| )| GROUP BY feature
""".stripMargin)
47Copyright©2016 NTT corp. All Rights Reserved.
5. Do predictions - DataFrame Running an Example
scala> paste:val df = testDf.select(rowid(), $"features").explode_vector($"features").cache
# Do predictionsdf.join(modelDf, df("feature") === model("feature"), "LEFT_OUTER").groupBy("rowid").avg(sigmoid(sum($"weight" * $"value")))
48Copyright©2016 NTT corp. All Rights Reserved.
5. Do predictions - SQL Running an Example
scala> modelDf.createOrReplaceTempView(”ModelTable")scala> df.createOrReplaceTempView(”TestTable”)scala> paste:sql("""
| SELECT rowid, sigmoid(value * weight) AS predicted| FROM TrainTable t| LEFT OUTER JOIN ModelTable m| ON t.feature = m.feature| GROUP BY rowid
""".stripMargin)
49Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them
• ex.1) Compute a sigmoid function
Improve Some Functions in Spark
scala> val sigmoidFunc = (d: Double) => 1.0 / (1.0 + Math.exp(-d))scala> val sparkUdf = functions.udf(sigmoidFunc)scala> df.select(sparkUdf($“value”))
50Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them
• ex.1) Compute a sigmoid function
Improve Some Functions in Spark
scala> val hiveUdf = HivemallOps.sigmoidscala> df.select(hiveUdf($“value”))
51Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them
• ex.1) Compute a sigmoid function
Improve Some Functions in Spark
52Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them
• ex.2) Compute top-k for each key group
Improve Some Functions in Spark
scala> paste:df.withColumn(“rank”,rank().over(Window.partitionBy($"key").orderBy($"score".desc)
) .where($"rank" <= topK)
53Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them
• ex.2) Compute top-k for each key group
• Fixed the overhead issue for each_top_k• See pr#353: “Implement EachTopK as a generator expression“ in Spark
Improve Some Functions in Spark
scala> df.each_top_k(topK, “key”, “score”, “value”)
54Copyright©2016 NTT corp. All Rights Reserved.
• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them
• ex.2) Compute top-k for each key group
Improve Some Functions in Spark
~4 times faster than rank!!
55Copyright©2016 NTT corp. All Rights Reserved.
• supports under development• fast implementation of the gradient tree boosting• widely used in Kaggle competitions
• This integration will make you...• load built models and predict in parallel• build multiple models in parallel
for cross-validation
3rd–Party Library Integration
ConclusionandTakeawayHivemallisamulti/cross-platformMLlibraryprovidingacollectionofmachinelearningalgorithmsasHiveUDFs/UDTFs
2016/10/26HadoopSummit'16,Tokyo 56
WewelcomeyourcontributionstoApacheHivemallJ
HiveQL SparkSQL/Dataframe API PigLatin
2016/10/26HadoopSummit'16,Tokyo 57
Anyfeaturerequestorquestions?