hadoopsummit16 myui

57
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig 1). Research Engineer, Treasure Data Makoto YUI @myui 2016/10/26 Hadoop Summit '16, Tokyo 2). Research Engineer, NTT Takashi Yamamuro @maropu #hivemall

Upload: makoto-yui

Post on 16-Apr-2017

632 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Hadoopsummit16 myui

Hivemall:ScalablemachinelearninglibraryforApacheHive/Spark/Pig

1).ResearchEngineer,TreasureDataMakotoYUI@myui

2016/10/26HadoopSummit'16,Tokyo

1

2).ResearchEngineer,NTTTakashiYamamuro@maropu

#hivemall

Page 2: Hadoopsummit16 myui

Planofthetalk

1. IntroductiontoHivemall

2. HivemallonSpark

2016/10/26HadoopSummit'16,Tokyo 2

Page 3: Hadoopsummit16 myui

2016/10/26HadoopSummit'16,Tokyo 3

HivemallentersApacheIncubatoronSept13,2016🎉

http://incubator.apache.org/projects/hivemall.html

@ApacheHivemall

Page 4: Hadoopsummit16 myui

• MakotoYui<TreasureData>• TakeshiYamamuro <NTT>

Ø HivemallonApacheSpark• DanielDai<Hortonworks>

Ø HivemallonApachePigØ ApachePigPMCmember

• TsuyoshiOzawa<NTT>ØApacheHadoopPMCmember

• KaiSasaki<TreasureData>2016/10/26HadoopSummit'16,Tokyo 4

Initialcommitters

Page 5: Hadoopsummit16 myui

Champion

NominatedMentors

2016/10/26HadoopSummit'16,Tokyo 5

Projectmentors

• ReynoldXin<Databricks,ASFmember>ApacheSparkPMCmember

• MarkusWeimer<Microsoft,ASFmember>ApacheREEFPMCmember

• Xiangrui Meng <Databricks,ASFmember>ApacheSparkPMCmember

• RomanShaposhnik <Pivotal,ASFmember>ApacheBigtop/IncubatorPMCmember

Page 6: Hadoopsummit16 myui

WhatisApacheHivemall

ScalablemachinelearninglibrarybuiltasacollectionofHiveUDFs

2016/10/26HadoopSummit'16,Tokyo

6

github.com/myui/hivemallhivemall.incubator.apache.org

Page 7: Hadoopsummit16 myui

Hivemall’s Vision:MLonSQL

2016/10/26HadoopSummit'16,Tokyo 7

ClassificationwithMahout

CREATETABLElr_model ASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers

✓MachineLearningmadeeasyforSQLdevelopers✓ InteractiveandStableAPIsw/ SQLabstraction

ThisSQLqueryautomaticallyrunsinparallelonHadoop

Page 8: Hadoopsummit16 myui

HadoopHDFS

MapReduce(MRv1)

Hivemall

ApacheYARN

ApacheTezDAGprocessing

Machine Learning

Query Processing

Parallel Data Processing Framework

Resource Management

Distributed File SystemCloud Storage

SparkSQL

ApacheSpark

MESOS

Hive Pig

MLlib

Hivemall’s TechnologyStack

2016/10/26HadoopSummit'16,Tokyo 8

AmazonS3

Page 9: Hadoopsummit16 myui

ListofsupportedAlgorithmsClassification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification

2016/10/26HadoopSummit'16,Tokyo 9

Regression✓LogisticRegression(SGD)✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓PARegression✓AROWRegression✓FactorizationMachines✓RandomForestRegression

SCW is a good first choiceTry RandomForest if SCW does not work

Logistic regression is good for getting a probability of a positive class

Factorization Machines is good where features are sparse and categorical ones

Page 10: Hadoopsummit16 myui

ListofAlgorithmsforRecommendation

2016/10/26HadoopSummit'16,Tokyo 10

K-NearestNeighbor✓ Minhash andb-BitMinhash

(LSHvariant)✓ SimilaritySearchonVectorSpace

(Euclid/Cosine/Jaccard/Angular)

MatrixCompletion✓MatrixFactorization✓ FactorizationMachines(regression)

each_top_k functionofHivemallisusefulforrecommendingtop-kitems

Page 11: Hadoopsummit16 myui

2016/10/26HadoopSummit'16,Tokyo 11

student class score1 b 702 a 803 a 904 b 505 a 706 b 60

Top-kqueryprocessing

student class score3 a 902 a 801 b 706 b 60

Listtop-2studentsforeachclass

Page 12: Hadoopsummit16 myui

2016/10/26HadoopSummit'16,Tokyo 12

student class score1 b 702 a 803 a 904 b 505 a 706 b 60

Top-kqueryprocessing

Listtop-2studentsforeachclass

SELECT*FROM(SELECT*,rank()over(partitionbyclassorderbyscoredesc)asrank

FROMtable)tWHERErank<=2

Page 13: Hadoopsummit16 myui

2016/10/26HadoopSummit'16,Tokyo 13

student class score1 b 702 a 803 a 904 b 505 a 706 b 60

Top-kqueryprocessing

Listtop-2studentsforeachclass

SELECTeach_top_k(2,class,score,class,student

)as(rank,score,class,student)FROM (SELECT*FROMtableDISTRIBUTEBYclassSORTBYclass

)t

Page 14: Hadoopsummit16 myui

2016/10/26HadoopSummit'16,Tokyo 14

Top-kqueryprocessingbyRANKOVER()

partitionbyclass

Node1

Sortbyclass,score

rankover()

rank>=2

Page 15: Hadoopsummit16 myui

2016/10/26HadoopSummit'16,Tokyo 15

Top-kqueryprocessingbyEACH_TOP_K

distributedbyclass

Node1

Sortbyclass

each_top_k

OUTPUTonlyKitems

Page 16: Hadoopsummit16 myui

2016/10/26HadoopSummit'16,Tokyo 16

ComparisonbetweenRANKandEACH_TOP_K

distributedbyclass

Sortbyclass

each_top_k

Sortbyclass,score

rankover()

rank>=2

SORTING ISHEAVY

NEEDTOPROCESSALL

OUTPUTonlyKitems

Each_top_k isveryefficientwherethenumberofclassislarge

BoundedPriorityQueueisutilized

Page 17: Hadoopsummit16 myui

PerformancereportedbyTDcustomer

2016/10/26HadoopSummit'16,Tokyo 17

•1,000studentsineachclass•20 millionclasses

RANKover()querydoesnotfinishesin24hoursLEACH_TOP_Kfinishesin2hoursJ

Referfordetailhttps://speakerdeck.com/kaky0922/hivemall-meetup-20160908

Page 18: Hadoopsummit16 myui

OtherSupportedAlgorithms

2016/10/26HadoopSummit'16,Tokyo 18

AnomalyDetection✓ LocalOutlierFactor(LoF)

FeatureEngineering✓FeatureHashing✓FeatureScaling

(normalization,z-score)✓ TF-IDFvectorizer✓ PolynomialExpansion

(FeaturePairing)✓ Amplifier

NLP✓BasicEnglist textTokenizer✓JapaneseTokenizer(Kuromoji)

Page 19: Hadoopsummit16 myui

IndustryusecasesofHivemall

CTRpredictionofAdclicklogs• Algorithm:Logisticregression• Freakout Inc.,Smartnews,andmore

GenderpredictionofAdclicklogs• Algorithm:Classification• Scaleout Inc.

2016/10/26HadoopSummit'16,Tokyo 19

http://www.slideshare.net/eventdotsjp/hivemall

Page 20: Hadoopsummit16 myui

IndustryusecasesofHivemall

Item/Userrecommendation• Algorithm:Recommendation• Wish.com,GMOpepabo

2016/10/26HadoopSummit'16,Tokyo 20

Problem:Recommendationusinghot-itemishardinhand-craftedproductmarketbecauseeachcreatorsellsfewsingleitems(willsoonbecomeout-of-stock)

minne.com

Page 21: Hadoopsummit16 myui

IndustryusecasesofHivemall

ValuepredictionofRealestates• Algorithm:Regression• Livesense

2016/10/26HadoopSummit'16,Tokyo 21

Page 22: Hadoopsummit16 myui

IndustryusecasesofHivemall

Userscorecalculation• Algrorithm:Regression• Klout

2016/10/26HadoopSummit'16,Tokyo 22

bit.ly/klout-hivemall

Influencermarketing

klout.com

Page 23: Hadoopsummit16 myui

2016/10/26HadoopSummit'16,Tokyo 23

Page 24: Hadoopsummit16 myui

Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data

2016/10/26HadoopSummit'16,Tokyo 24

J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.

Anomaly/Change-pointDetectionbyChangeFinder

Page 25: Hadoopsummit16 myui

Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data

2016/10/26HadoopSummit'16,Tokyo 25

Anomaly/Change-pointDetectionbyChangeFinder

J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.

Page 26: Hadoopsummit16 myui

2016/10/26HadoopSummit'16,Tokyo 26

• T.IdeandK.Inoue,"KnowledgeDiscoveryfromHeterogeneousDynamicSystemsusingChange-PointCorrelations",Proc.SDM,2005T.

• T.IdeandK.Tsuda,"Change-pointdetectionusingKrylovsubspacelearning",Proc.SDM,2007.

Change-pointdetectionbySingularSpectrumTransformation

Page 27: Hadoopsummit16 myui

2016/10/26HadoopSummit'16,Tokyo 27

EvaluationMetrics

Page 28: Hadoopsummit16 myui

2016/10/26HadoopSummit'16,Tokyo 28

FeatureEngineering– FeatureBinning

Mapsquantitativevariablestofixednumberofbinsbasedonquantiles/distribution

MapAgesinto3bins

Page 29: Hadoopsummit16 myui

2016/10/26HadoopSummit'16,Tokyo 29

FeatureSelection– SignalNoiseRatio

Page 30: Hadoopsummit16 myui

2016/10/26HadoopSummit'16,Tokyo 30

FeatureSelection– Chi-Square

Page 31: Hadoopsummit16 myui

2016/10/26HadoopSummit'16,Tokyo 31

FeatureTransformation– Onehot encoding

Mapsacategoricalvariabletoauniquenumberstartingfrom1

Page 32: Hadoopsummit16 myui

ü Spark2.0supportü XGBoost Integrationü Field-awareFactorizationMachinesü GeneralizedLinearModel

• OptimizerframeworkincludingADAM• L1/L2regularization

2016/10/26HadoopSummit'16,Tokyo 32

Othernewfeaturestocome

Page 33: Hadoopsummit16 myui

Copyright©2016 NTT corp. All Rights Reserved.

Hivemall onTakeshi Yamamuro @ NTT

Hadoop Summit 2016Tokyo, Japan

Page 34: Hadoopsummit16 myui

34Copyright©2016 NTT corp. All Rights Reserved.

Who am I?

Page 35: Hadoopsummit16 myui

35Copyright©2016 NTT corp. All Rights Reserved.

Whatʼs Spark?• 1. Unified Engine

• support end-to-end APs, e.g., MLlib and Streaming

• 2. High-level APIs• easy-to-use, rich optimization

• 3. Integrate broadly• storages, libraries, ...

Page 36: Hadoopsummit16 myui

36Copyright©2016 NTT corp. All Rights Reserved.

• Hivemall wrapper for Spark• Wrapper implementations for DataFrame/SQL• + some utilities for easy-to-use in Spark

• The wrapper makes you...• run most of Hivemall functions in Spark• try examples easily in your laptop• improve some function performance in Spark

Whatʼs Hivemall on Spark?

Page 37: Hadoopsummit16 myui

37Copyright©2016 NTT corp. All Rights Reserved.

• Hivemall already has many fascinating ML algorithms and useful utilities

• High barriers to add newer algorithms in MLlib

Whyʼs Hivemall on Spark?

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

Page 38: Hadoopsummit16 myui

38Copyright©2016 NTT corp. All Rights Reserved.

• Most of Hivemall functions supported in Spark v1.6 and v2.0

Current Status

- For Spark v2.0

$ git clone https://github.com/myui/hivemall$ cd hivemall$ mvn package –Pspark-2.0 –DskipTests$ ls target/*spark*target/hivemall-spark-2.0_2.11-XXX-with-dependencies.jar...

Page 39: Hadoopsummit16 myui

39Copyright©2016 NTT corp. All Rights Reserved.

• Most of Hivemall functions supported in Spark v1.6 and v2.0

Current Status

- For Spark v2.0

$ git clone https://github.com/myui/hivemall$ cd hivemall$ mvn package –Pspark-2.0 –DskipTests$ ls target/*spark*target/hivemall-spark-2.0_2.11-XXX-with-dependencies.jar...

-Pspark-1.6 for Spark v1.6

Page 40: Hadoopsummit16 myui

40Copyright©2016 NTT corp. All Rights Reserved.

• 1. Download a Spark binary• 2. Fetch training and test data• 3. Load these data in Spark• 4. Build a model• 5. Do predictions

Running an Example

Page 41: Hadoopsummit16 myui

41Copyright©2016 NTT corp. All Rights Reserved.

1. Download a Spark binary Running an Example

http://spark.apache.org/downloads.html

Page 42: Hadoopsummit16 myui

42Copyright©2016 NTT corp. All Rights Reserved.

• E2006 tfidf regression dataset• http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#E2006-tfidf

2. Fetch training and test data Running an Example

$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.train.bz2

$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.test.bz2

Page 43: Hadoopsummit16 myui

43Copyright©2016 NTT corp. All Rights Reserved.

3. Load data in Spark Running an Example

$ <SPARK_HOME>/bin/spark-shell-conf spark.jars=hivemall-spark-2.0_2.11-XXX-with-dependencies.jar

// Create DataFrame from the bzip’d libsvm-formatted filescala> val trainDf = sqlContext.sparkSession.read.format("libsvm”).load(“E2006.train.bz2")

scala> trainDf.printSchemaroot|-- label: double (nullable = true)|-- features: vector (nullable = true)

Page 44: Hadoopsummit16 myui

44Copyright©2016 NTT corp. All Rights Reserved.

3. Load data in Spark Running an Example

0.000357499151147113 6066:0.00079327062196048 6069:0.000311377727123504 6070:0.000306754934580457 6071:0.000276992485786437 6072:0.00039663531098024 6074:0.00039663531098024 6075:0.00032548335…

trainDf

Partition1 Partition2 Partition3 PartitionN…

Load in parallel becausebzip2 is splittable

Page 45: Hadoopsummit16 myui

45Copyright©2016 NTT corp. All Rights Reserved.

4. Build a model - DataFrame Running an Example

scala> paste:val modelDf = trainDf.train_logregr($"features", $"label")

.groupBy("feature”)

.agg("weight" -> "avg”)

Page 46: Hadoopsummit16 myui

46Copyright©2016 NTT corp. All Rights Reserved.

4. Build a model - SQL Running an Example

scala> trainDf.createOrReplaceTempView("TrainTable")scala> paste:val modelDf = sql("""

| SELECT feature, AVG(weight) AS weight| FROM (| SELECT train_logregr(features, label)| AS (feature, weight)| FROM TrainTable| )| GROUP BY feature

""".stripMargin)

Page 47: Hadoopsummit16 myui

47Copyright©2016 NTT corp. All Rights Reserved.

5. Do predictions - DataFrame Running an Example

scala> paste:val df = testDf.select(rowid(), $"features").explode_vector($"features").cache

# Do predictionsdf.join(modelDf, df("feature") === model("feature"), "LEFT_OUTER").groupBy("rowid").avg(sigmoid(sum($"weight" * $"value")))

Page 48: Hadoopsummit16 myui

48Copyright©2016 NTT corp. All Rights Reserved.

5. Do predictions - SQL Running an Example

scala> modelDf.createOrReplaceTempView(”ModelTable")scala> df.createOrReplaceTempView(”TestTable”)scala> paste:sql("""

| SELECT rowid, sigmoid(value * weight) AS predicted| FROM TrainTable t| LEFT OUTER JOIN ModelTable m| ON t.feature = m.feature| GROUP BY rowid

""".stripMargin)

Page 49: Hadoopsummit16 myui

49Copyright©2016 NTT corp. All Rights Reserved.

• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them

• ex.1) Compute a sigmoid function

Improve Some Functions in Spark

scala> val sigmoidFunc = (d: Double) => 1.0 / (1.0 + Math.exp(-d))scala> val sparkUdf = functions.udf(sigmoidFunc)scala> df.select(sparkUdf($“value”))

Page 50: Hadoopsummit16 myui

50Copyright©2016 NTT corp. All Rights Reserved.

• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them

• ex.1) Compute a sigmoid function

Improve Some Functions in Spark

scala> val hiveUdf = HivemallOps.sigmoidscala> df.select(hiveUdf($“value”))

Page 51: Hadoopsummit16 myui

51Copyright©2016 NTT corp. All Rights Reserved.

• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them

• ex.1) Compute a sigmoid function

Improve Some Functions in Spark

Page 52: Hadoopsummit16 myui

52Copyright©2016 NTT corp. All Rights Reserved.

• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them

• ex.2) Compute top-k for each key group

Improve Some Functions in Spark

scala> paste:df.withColumn(“rank”,rank().over(Window.partitionBy($"key").orderBy($"score".desc)

) .where($"rank" <= topK)

Page 53: Hadoopsummit16 myui

53Copyright©2016 NTT corp. All Rights Reserved.

• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them

• ex.2) Compute top-k for each key group

• Fixed the overhead issue for each_top_k• See pr#353: “Implement EachTopK as a generator expression“ in Spark

Improve Some Functions in Spark

scala> df.each_top_k(topK, “key”, “score”, “value”)

Page 54: Hadoopsummit16 myui

54Copyright©2016 NTT corp. All Rights Reserved.

• Spark has overheads to call Hive UD*Fs• Hivemall heavily depends on them

• ex.2) Compute top-k for each key group

Improve Some Functions in Spark

~4 times faster than rank!!

Page 55: Hadoopsummit16 myui

55Copyright©2016 NTT corp. All Rights Reserved.

• supports under development• fast implementation of the gradient tree boosting• widely used in Kaggle competitions

• This integration will make you...• load built models and predict in parallel• build multiple models in parallel

for cross-validation

3rd–Party Library Integration

Page 56: Hadoopsummit16 myui

ConclusionandTakeawayHivemallisamulti/cross-platformMLlibraryprovidingacollectionofmachinelearningalgorithmsasHiveUDFs/UDTFs

2016/10/26HadoopSummit'16,Tokyo 56

WewelcomeyourcontributionstoApacheHivemallJ

HiveQL SparkSQL/Dataframe API PigLatin

Page 57: Hadoopsummit16 myui

2016/10/26HadoopSummit'16,Tokyo 57

Anyfeaturerequestorquestions?