hadoopsummit16 myui

Hivemall:ScalablemachinelearninglibraryforApacheHive/Spark/Pig

1).ResearchEngineer,TreasureDataMakotoYUI@myui

2016/10/26HadoopSummit'16,Tokyo

1

2).ResearchEngineer,NTTTakashiYamamuro@maropu

#hivemall

Planofthetalk

1. IntroductiontoHivemall

2. HivemallonSpark

2016/10/26HadoopSummit'16,Tokyo 2


HivemallentersApacheIncubatoronSept13,2016🎉

http://incubator.apache.org/projects/hivemall.html

@ApacheHivemall

• MakotoYui<TreasureData>• TakeshiYamamuro <NTT>

Ø HivemallonApacheSpark• DanielDai<Hortonworks>

Ø HivemallonApachePigØ ApachePigPMCmember

• TsuyoshiOzawa<NTT>ØApacheHadoopPMCmember

• KaiSasaki<TreasureData>2016/10/26HadoopSummit'16,Tokyo 4

Initialcommitters

Champion

NominatedMentors


Projectmentors

• ReynoldXin<Databricks,ASFmember>ApacheSparkPMCmember

• MarkusWeimer<Microsoft,ASFmember>ApacheREEFPMCmember

• Xiangrui Meng <Databricks,ASFmember>ApacheSparkPMCmember

• RomanShaposhnik <Pivotal,ASFmember>ApacheBigtop/IncubatorPMCmember

WhatisApacheHivemall

ScalablemachinelearninglibrarybuiltasacollectionofHiveUDFs

2016/10/26HadoopSummit'16,Tokyo

6

github.com/myui/hivemallhivemall.incubator.apache.org

Hivemall’s Vision:MLonSQL


ClassificationwithMahout

CREATETABLElr_model ASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers

✓MachineLearningmadeeasyforSQLdevelopers✓ InteractiveandStableAPIsw/ SQLabstraction

ThisSQLqueryautomaticallyrunsinparallelonHadoop

HadoopHDFS

MapReduce(MRv1)

Hivemall

ApacheYARN

ApacheTezDAGprocessing

Machine Learning

Query Processing

Parallel Data Processing Framework

Resource Management

Distributed File SystemCloud Storage

SparkSQL

ApacheSpark

MESOS

Hive Pig

MLlib

Hivemall’s TechnologyStack


AmazonS3

ListofsupportedAlgorithmsClassification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification


Regression✓LogisticRegression(SGD)✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓PARegression✓AROWRegression✓FactorizationMachines✓RandomForestRegression

SCW is a good first choiceTry RandomForest if SCW does not work

Logistic regression is good for getting a probability of a positive class

Factorization Machines is good where features are sparse and categorical ones

ListofAlgorithmsforRecommendation


K-NearestNeighbor✓ Minhash andb-BitMinhash

(LSHvariant)✓ SimilaritySearchonVectorSpace

(Euclid/Cosine/Jaccard/Angular)

MatrixCompletion✓MatrixFactorization✓ FactorizationMachines(regression)

each_top_k functionofHivemallisusefulforrecommendingtop-kitems


student class score1 b 702 a 803 a 904 b 505 a 706 b 60

Top-kqueryprocessing

student class score3 a 902 a 801 b 706 b 60

Listtop-2studentsforeachclass





SELECT*FROM(SELECT*,rank()over(partitionbyclassorderbyscoredesc)asrank

FROMtable)tWHERErank<=2





SELECTeach_top_k(2,class,score,class,student

)as(rank,score,class,student)FROM (SELECT*FROMtableDISTRIBUTEBYclassSORTBYclass

)t


Top-kqueryprocessingbyRANKOVER()

partitionbyclass

Node1

Sortbyclass,score

rankover()

rank>=2


Top-kqueryprocessingbyEACH_TOP_K

distributedbyclass

Node1

Sortbyclass

each_top_k

OUTPUTonlyKitems


ComparisonbetweenRANKandEACH_TOP_K

distributedbyclass

Sortbyclass

each_top_k

Sortbyclass,score

rankover()

rank>=2

SORTING ISHEAVY

NEEDTOPROCESSALL

OUTPUTonlyKitems

Each_top_k isveryefficientwherethenumberofclassislarge

BoundedPriorityQueueisutilized

PerformancereportedbyTDcustomer


•1,000studentsineachclass•20 millionclasses

RANKover()querydoesnotfinishesin24hoursLEACH_TOP_Kfinishesin2hoursJ

Referfordetailhttps://speakerdeck.com/kaky0922/hivemall-meetup-20160908

OtherSupportedAlgorithms


AnomalyDetection✓ LocalOutlierFactor(LoF)

FeatureEngineering✓FeatureHashing✓FeatureScaling

(normalization,z-score)✓ TF-IDFvectorizer✓ PolynomialExpansion

(FeaturePairing)✓ Amplifier

NLP✓BasicEnglist textTokenizer✓JapaneseTokenizer(Kuromoji)

IndustryusecasesofHivemall

CTRpredictionofAdclicklogs• Algorithm:Logisticregression• Freakout Inc.,Smartnews,andmore

GenderpredictionofAdclicklogs• Algorithm:Classification• Scaleout Inc.


http://www.slideshare.net/eventdotsjp/hivemall


Item/Userrecommendation• Algorithm:Recommendation• Wish.com,GMOpepabo


Problem:Recommendationusinghot-itemishardinhand-craftedproductmarketbecauseeachcreatorsellsfewsingleitems(willsoonbecomeout-of-stock)

minne.com


ValuepredictionofRealestates• Algorithm:Regression• Livesense



Userscorecalculation• Algrorithm:Regression• Klout


bit.ly/klout-hivemall

Influencermarketing

klout.com

Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data


J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.

Anomaly/Change-pointDetectionbyChangeFinder

Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data


Anomaly/Change-pointDetectionbyChangeFinder

J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.


• T.IdeandK.Inoue,"KnowledgeDiscoveryfromHeterogeneousDynamicSystemsusingChange-PointCorrelations",Proc.SDM,2005T.

• T.IdeandK.Tsuda,"Change-pointdetectionusingKrylovsubspacelearning",Proc.SDM,2007.

Change-pointdetectionbySingularSpectrumTransformation


EvaluationMetrics


FeatureEngineering– FeatureBinning

Mapsquantitativevariablestofixednumberofbinsbasedonquantiles/distribution

MapAgesinto3bins


FeatureSelection– SignalNoiseRatio


FeatureSelection– Chi-Square


FeatureTransformation– Onehot encoding

Mapsacategoricalvariabletoauniquenumberstartingfrom1

ü Spark2.0supportü XGBoost Integrationü Field-awareFactorizationMachinesü GeneralizedLinearModel

• OptimizerframeworkincludingADAM• L1/L2regularization


Othernewfeaturestocome

Copyright©2016 NTT corp. All Rights Reserved.

Hivemall onTakeshi Yamamuro @ NTT

Hadoop Summit 2016Tokyo, Japan

34Copyright©2016 NTT corp. All Rights Reserved.

Who am I?


Whatʼs Spark?• 1. Unified Engine

• support end-to-end APs, e.g., MLlib and Streaming

• 2. High-level APIs• easy-to-use, rich optimization

• 3. Integrate broadly• storages, libraries, ...


• Hivemall wrapper for Spark• Wrapper implementations for DataFrame/SQL• + some utilities for easy-to-use in Spark

• The wrapper makes you...• run most of Hivemall functions in Spark• try examples easily in your laptop• improve some function performance in Spark

Whatʼs Hivemall on Spark?


• Hivemall already has many fascinating ML algorithms and useful utilities

• High barriers to add newer algorithms in MLlib

Whyʼs Hivemall on Spark?

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark


• Most of Hivemall functions supported in Spark v1.6 and v2.0

Current Status

- For Spark v2.0

$ git clone https://github.com/myui/hivemall$ cd hivemall$ mvn package –Pspark-2.0 –DskipTests$ ls target/*spark*target/hivemall-spark-2.0_2.11-XXX-with-dependencies.jar...


• Most of Hivemall functions supported in Spark v1.6 and v2.0

Current Status

- For Spark v2.0

$ git clone https://github.com/myui/hivemall$ cd hivemall$ mvn package –Pspark-2.0 –DskipTests$ ls target/*spark*target/hivemall-spark-2.0_2.11-XXX-with-dependencies.jar...

-Pspark-1.6 for Spark v1.6


• 1. Download a Spark binary• 2. Fetch training and test data• 3. Load these data in Spark• 4. Build a model• 5. Do predictions

Running an Example


1. Download a Spark binary Running an Example

http://spark.apache.org/downloads.html


• E2006 tfidf regression dataset• http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#E2006-tfidf

2. Fetch training and test data Running an Example

$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.train.bz2

$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.test.bz2


3. Load data in Spark Running an Example

$ <SPARK_HOME>/bin/spark-shell-conf spark.jars=hivemall-spark-2.0_2.11-XXX-with-dependencies.jar

// Create DataFrame from the bzip’d libsvm-formatted filescala> val trainDf = sqlContext.sparkSession.read.format("libsvm”).load(“E2006.train.bz2")

scala> trainDf.printSchemaroot|-- label: double (nullable = true)|-- features: vector (nullable = true)


3. Load data in Spark Running an Example

0.000357499151147113 6066:0.00079327062196048 6069:0.000311377727123504 6070:0.000306754934580457 6071:0.000276992485786437 6072:0.00039663531098024 6074:0.00039663531098024 6075:0.00032548335…

trainDf

Partition1 Partition2 Partition3 PartitionN…

…

…

Load in parallel becausebzip2 is splittable


4. Build a model - DataFrame Running an Example

scala> paste:val modelDf = trainDf.train_logregr($"features", $"label")

.groupBy("feature”)

.agg("weight" -> "avg”)


5. Do predictions - DataFrame Running an Example

scala> paste:val df = testDf.select(rowid(), $"features").explode_vector($"features").cache

# Do predictionsdf.join(modelDf, df("feature") === model("feature"), "LEFT_OUTER").groupBy("rowid").avg(sigmoid(sum($"weight" * $"value")))


5. Do predictions - SQL Running an Example

scala> modelDf.createOrReplaceTempView(”ModelTable")scala> df.createOrReplaceTempView(”TestTable”)scala> paste:sql("""

| SELECT rowid, sigmoid(value * weight) AS predicted| FROM TrainTable t| LEFT OUTER JOIN ModelTable m| ON t.feature = m.feature| GROUP BY rowid

""".stripMargin)


• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them

• ex.1) Compute a sigmoid function

Improve Some Functions in Spark

scala> val sigmoidFunc = (d: Double) => 1.0 / (1.0 + Math.exp(-d))scala> val sparkUdf = functions.udf(sigmoidFunc)scala> df.select(sparkUdf($“value”))





scala> val hiveUdf = HivemallOps.sigmoidscala> df.select(hiveUdf($“value”))



• ex.2) Compute top-k for each key group


scala> paste:df.withColumn(“rank”,rank().over(Window.partitionBy($"key").orderBy($"score".desc)

) .where($"rank" <= topK)




• Fixed the overhead issue for each_top_k• See pr#353: “Implement EachTopK as a generator expression“ in Spark


scala> df.each_top_k(topK, “key”, “score”, “value”)





~4 times faster than rank!!


• supports under development• fast implementation of the gradient tree boosting• widely used in Kaggle competitions

• This integration will make you...• load built models and predict in parallel• build multiple models in parallel

for cross-validation

3rd–Party Library Integration

ConclusionandTakeawayHivemallisamulti/cross-platformMLlibraryprovidingacollectionofmachinelearningalgorithmsasHiveUDFs/UDTFs


WewelcomeyourcontributionstoApacheHivemallJ

HiveQL SparkSQL/Dataframe API PigLatin


Anyfeaturerequestorquestions?

hadoopsummit16 myui

Data & Analytics