Dots20161029 myui

66
Apache Hivemall: Machine Learning Library for Apache Hive/Spark/Pig Research Engineer Makoto YUI @myui <[email protected]> 1 2016/10/29 @Dots

Transcript of Dots20161029 myui

ApacheHivemall:MachineLearningLibraryforApacheHive/Spark/Pig

ResearchEngineerMakotoYUI@myui

<[email protected]>

12016/10/29@Dots

Ø 2015.04~ ResearchEngineeratTreasureData,Inc.• MymissionisdevelopingML-as-a-ServiceinaHadoop-as-

a-servicecompany

Ø 2010.04-2015.03SeniorResearcheratNationalInstituteofAdvancedIndustrialScienceandTechnology,Japan.• DevelopedHivemallasapersonalresearchproject

Ø 2009.03Ph.D.inComputerSciencefromNAIST• MajoredinParallelDataProcessing,notMLthen

Ø VisitingscholarinCWI,AmsterdamandUniv.Edinburgh

Littleaboutme…

2016/10/29@Dots 2

2016/10/29@Dots 3

Hiro YoshikawaCEO

Kaz OtaCTO

Sada FuruhashiChief Architect

Open source business veteran

Founder - world’s largest Hadoop group

Invented Fluentd, Messagepack

TODAY100+ Employees, 30M+ funding

2015 New office in Seoul, Korea

2013 New office in Tokyo, Japan

2012 Founded in Mountain View, CA

InvestorsJerry YangYahoo! Founder

Bill TaiAngel Investor

Yukihiro MatsumotoRuby Inventor

Sierra Ventures - Tim GuleriEntrerprise Software

Scale Ventures - Andy Vitus B2B SaaS

TreasureData

2016/10/29@Dots 4

BigDataStatsinTreasureData

2016/10/29@Dots 5

WeOpen-source!TDinvented..

Streaming log collector Bulk data import/export efficient binary serialization

Streaming Query ProcessorMachine learning on Hadoop

digdag.io

Workflow engine (Beta)

2016/10/29@Dots 6

TreasureData’sSolution

1. WhatisHivemall(introduction)

2. HowtouseHivemall

3. Roadmapandcomingnewfeatures

Agenda

2016/10/29@Dots 7

2016/10/29@Dots 8

HivemallenteredApacheIncubatoronSept13,2016🎉

hivemall.incubator.apache.org

@ApacheHivemall

•MakotoYui<TreasureData>• TakeshiYamamuro <NTT>Ø HivemallonApacheSpark• DanielDai<Hortonworks>Ø HivemallonApachePigØ ApachePigPMCmember• TsuyoshiOzawa<NTT>ØApacheHadoopPMCmember• KaiSasaki<TreasureData>

9

Initialcommitters

2016/10/29@Dots

Champion

NominatedMentors

10

Projectmentors

• ReynoldXin<Databricks,ASFmember>ApacheSparkPMCmember• MarkusWeimer<Microsoft,ASFmember>ApacheREEFPMCmember• Xiangrui Meng <Databricks,ASFmember>ApacheSparkPMCmember

• RomanShaposhnik <Pivotal,ASFmember>ApacheBigtop/IncubatorPMCmember

2016/10/29@Dots

WhatisApacheHivemall

ScalablemachinelearninglibrarybuiltasacollectionofHiveUDFs

112016/10/29@Dots

Multi/Crossplatform Versatile Scalable Ease-of-use

Hivemalliseasyandscalable…

ClassificationwithMahout

CREATETABLElr_model ASSELECTfeature,-- reducersperformmodelaveraginginparallelavg(weight)asweightFROM(SELECTlogress(features,label,..)as(feature,weight)FROMtrain)t-- map-onlytaskGROUPBYfeature;-- shuffledtoreducers

MLmadeeasyforSQLdevelopers

Borntobeparallelandscalable

ThisSQLqueryautomaticallyrunsinparallelonHadoopcluster

122016/10/29@Dots

Ease-of-use

Scalable

2016/10/29@Dots 13

Hivemallisamulti/cross-platformMLlibrary

HiveQL SparkSQL/Dataframe API PigLatin

HivemallisMulti/Crossplatform..

Multi/Crossplatform

predictionmodelsbuiltbyHivecanbeusedfromSpark,andconversely,predictionmodelsbuildbySparkcanbeusedfromHive

Hivemall’s TechnologyStack

2016/10/29@Dots 14

2016/10/29@Dots 15

HivemallonApacheHive

2016/10/29@Dots 16

HivemallonApacheSparkDataframe

2016/10/29@Dots 17

HivemallonSparkSQL

2016/10/29@Dots 18

HivemallonApachePig

2016/10/29@Dots 19

Versatile

HivemallisaVersatilelibrary..

ü HivemallisnotonlyforMachineLearning

ü Hivemallprovidesbunchofgenericutilityfunctions

EachorganizationhasownsetsofUDFsfordatapreprocessing!

Don’tRepeatYourself!Don’tRepeatYourself!

2016/10/29@Dots 20

Hivemallgenericfunctions

ArrayandMap

Bitandcompress

StringandNLP

WewelcomecontributingyourgenericUDFstoHivemall!

ListofsupportedAlgorithms

Classification✓ Perceptron✓ PassiveAggressive(PA,PA1,PA2)✓ ConfidenceWeighted(CW)✓ AdaptiveRegularizationofWeightVectors(AROW)✓ SoftConfidenceWeighted(SCW)✓ AdaGrad+RDA✓ FactorizationMachines✓ RandomForestClassification

21

Regression✓LogisticRegression(SGD)✓AdaGrad (logisticloss)✓AdaDELTA (logisticloss)✓PARegression✓AROWRegression✓FactorizationMachines✓RandomForestRegression

SCW is a good first choiceTry RandomForest if SCW does not work

Logistic regression is good for getting a probability of a positive class

Factorization Machines is good where features are sparse and categorical ones

2016/10/29@Dots

ListofAlgorithmsforRecommendation

22

K-NearestNeighbor✓ Minhash andb-BitMinhash

(LSHvariant)✓ SimilaritySearchonVectorSpace

(Euclid/Cosine/Jaccard/Angular)

MatrixCompletion✓MatrixFactorization✓ FactorizationMachines(regression)

each_top_k functionofHivemallisusefulforrecommendingtop-kitems

2016/10/29@Dots

2016/10/29@Dots 23

student class score

1 b 702 a 803 a 904 b 505 a 706 b 60

Top-kqueryprocessing

student class score3 a 902 a 801 b 706 b 60

Listtop-2studentsforeachclass

2016/10/29@Dots 24

student class score

1 b 702 a 803 a 904 b 505 a 706 b 60

Listtop-2studentsforeachclass

SELECT*FROM(SELECT*,rank()over(partitionbyclassorderbyscoredesc)asrank

FROMtable)tWHERErank<=2

Top-kqueryprocessing

2016/10/29@Dots 25

student class score1 b 702 a 803 a 904 b 505 a 706 b 60

Listtop-2studentsforeachclass

SELECTeach_top_k(2,class,score,class,student

)as(rank,score,class,student)FROM(SELECT*FROMtableDISTRIBUTEBYclassSORTBYclass

)t

Top-kqueryprocessing

2016/10/29@Dots 26

Top-kqueryprocessingbyRANKOVER()

partitionbyclass

Node1

Sortbyclass,score

rankover()

rank>=2

2016/10/29@Dots 27

Top-kqueryprocessingbyEACH_TOP_K

distributedbyclass

Node1

Sortbyclass

each_top_k

OUTPUTonlyKitems

2016/10/29@Dots 28

ComparisonbetweenRANKandEACH_TOP_K

distributedbyclass

Sortbyclass

each_top_k

Sortbyclass,score

rankover()

rank>=2

SORTINGISHEAVY

NEEDTOPROCESSALL

OUTPUTonlyKitems

Each_top_k isveryefficientwherethenumberofclassislarge

BoundedPriorityQueueisutilized

PerformancereportedbyTDcustomer

2016/10/29@Dots 29

•1,000studentsineachclass•20 millionclasses

RANKover()querydoesnotfinishesin24hoursLEACH_TOP_Kfinishesin2hoursJ

Referfordetailhttps://speakerdeck.com/kaky0922/hivemall-meetup-20160908

OtherSupportedAlgorithms

30

AnomalyDetection✓ LocalOutlierFactor(LoF)

FeatureEngineering✓FeatureHashing✓FeatureScaling

(normalization,z-score)✓ TF-IDFvectorizer✓ PolynomialExpansion

(FeaturePairing)✓ Amplifier

NLP✓BasicEnglist textTokenizer✓JapaneseTokenizer(Kuromoji)

2016/10/29@Dots

• CTRpredictionofAdclicklogs• Algorithm:Logisticregression• Freakout Inc.,Smartnews,andmore

• GenderpredictionofAdclicklogs• Algorithm:Classification• Scaleout Inc.

IndustryusecasesofHivemall

312016/10/29@Dots

http://www.slideshare.net/eventdotsjp/hivemall

• CTRpredictionofAdclicklogs• Algorithm:Logisticregression• Freakout Inc.,Smartnews,andmore

• GenderpredictionofAdclicklogs• Algorithm:Classification• Scaleout Inc.

• Item/Userrecommendation• Algorithm:Recommendation• Wish.com,GMOpepabo

IndustryusecasesofHivemall

322016/10/29@Dots

minne.com

• CTRpredictionofAdclicklogs• Algorithm:Logisticregression• Freakout Inc.,Smartnews,andmore

• GenderpredictionofAdclicklogs• Algorithm:Classification• Scaleout Inc.

• Item/Userrecommendation• Algorithm:Recommendation• Wish.com,GMOpepabo

• ValuepredictionofRealestates• Algorithm:Regression• Livesense

IndustryusecasesofHivemall

332016/10/29@Dots

• CTRpredictionofAdclicklogs• Algorithm:Logisticregression• Freakout Inc.,Smartnews,andmore

• GenderpredictionofAdclicklogs• Algorithm:Classification• Scaleout Inc.

• Item/Userrecommendation• Algorithm:Recommendation• Wish.com,GMOpepabo

• ValuepredictionofRealestates• Algorithm:Regression• Livesense

• Userscorecalculation• Algrorithm:Regression• Klout

IndustryusecasesofHivemall

34

bit.ly/klout-hivemall

2016/10/29@Dots

Influencermarketing

klout.com

OISIX,aleadingfooddeliveryservicecompanyinJapan,usedHivemall’s LogisticRegressiontogetchurnprobability

2016/10/29@Dots 35

ChurnDetectionofMonthlyPaymentService

ChurnratedroppedalmostbyhalfbygivinggiftpointstocustomersbeingpredictedtoleaveJ

1. WhatisHivemall(introduction)

2. HowtouseHivemall

3. Roadmapandcomingnewfeatures

Agenda

2016/10/29@Dots 36

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

Datapreparation372016/10/29@Dots

Create external table e2006tfidf_train (rowid int,label float,features ARRAY<STRING>

) ROW FORMAT DELIMITED

FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“

STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';

HowtouseHivemall- Datapreparation

DefineaHivetablefortraining/testingdata

382016/10/29@Dots

2016/10/29@Dots 39

HowtouseHivemall

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

FeatureEngineering

402016/10/29@Dots

create view e2006tfidf_train_scaled asselect

rowid,rescale(target,${min_label},${max_label}) as label,

featuresfrom

e2006tfidf_train;

Applying a Min-Max Feature Normalization

HowtouseHivemall- FeatureEngineering

Transformingalabelvaluetoavaluebetween0.0and1.0

412016/10/29@Dots

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

Training

422016/10/29@Dots

HowtouseHivemall- Training

CREATE TABLE lr_model ASSELECTfeature,avg(weight) as weight

FROM (SELECT logress(features,label,..)

as (feature,weight)FROM train

) tGROUP BY feature

Trainingbylogisticregression

map-onlytasktolearnapredictionmodel

Shufflemap-outputstoreducesbyfeature

Reducersperformmodelaveraginginparallel

432016/10/29@Dots

HowtouseHivemall- Training

CREATE TABLE news20b_cw_model1 ASSELECT

feature,voted_avg(weight) as weight

FROM(SELECT

train_cw(features,label) as (feature,weight)

FROMnews20b_train

) t GROUP BY feature

TrainingofConfidenceWeightedClassifier

Votetousenegativeorpositiveweightsforavg

+0.7,+0.3,+0.2,-0.1,+0.7

TrainingfortheCWclassifier

442016/10/29@Dots

HowtouseHivemall

MachineLearning

Training

Prediction

PredictionModel Label

FeatureVector

FeatureVector

Label

Prediction

452016/10/29@Dots

HowtouseHivemall- Prediction

CREATE TABLE lr_predictasSELECT

t.rowid, sigmoid(sum(m.weight)) as prob

FROMtesting_exploded t LEFT OUTER JOINlr_model m ON (t.feature = m.feature)

GROUP BY t.rowid

PredictionisdonebyLEFTOUTERJOINbetweentestdataandpredictionmodel

Noneedtoloadtheentiremodelintomemory

462016/10/29@Dots

Real-timeprediction

MachineLearning

Batch Training on Hadoop

Online Prediction on RDBMS

PredictionModel Label

FeatureVector

FeatureVector

Label

Exportpredictionmodels

47

bit.ly/hivemall-rtp

2016/10/29@Dots

ExportPredictionModeltoaRDBMS

AnyRDBMS

TDexportPeriodicalexportisvery easy

inTreasureData

103 -0.4896543622016907104 -0.0955817922949791105 0.12560302019119263106 0.09214721620082855

48

PredictionModel

2016/10/29@Dots

Real-timePredictiononMySQL

PredictionModel Label

FeatureVector

SELECTsigmoid(sum(t.value*m.weight))asprob

FROMtesting_explodedtLEFTOUTERJOINprediction_modelmON(t.feature=m.feature)

IndexlookupsareveryefficientinRDBMSs!492016/10/29@Dots

2016/10/29@Dots 50

OnlinePredictionbyApacheStreaming

RandomForestinHivemall

EnsembleofDecisionTrees

2016/10/29@Dots 51

TrainingofRandomForest

2016/10/29@Dots 52

PredictionofRandomForest

2016/10/29@Dots 53

1. WhatisHivemall(introduction)

2. HowtouseHivemall

3. Roadmapandcomingnewfeatures

Agenda

2016/10/29@Dots 54

• IPclearanceandproject/repositorysitesetup• Createcontributionguidelines• Moverepositoryfromgithub toASF

• Addmoretestsanddocumentations• InitialApacheReleasewillbeDecorJan

55

Roadmap

2016/10/29@Dots

Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data

2016/10/29@Dots 56

J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.

Anomaly/Change-pointDetectionbyChangeFinder

Efficientalgorithmforfindingchangepointandoutliersfromtimeseries data

2016/10/29@Dots 57

J.TakeuchiandK.Yamanishi,“AUnifyingFrameworkforDetectingOutliersandChangePointsfromTimeSeries,” IEEEtransactionsonKnowledgeandDataEngineering,pp.482-492,2006.

Anomaly/Change-pointDetectionbyChangeFinder

2016/10/29@Dots 58

T.IdeandK.Inoue,"KnowledgeDiscoveryfromHeterogeneousDynamicSystemsusingChange-PointCorrelations",Proc.SDM,2005T.T.IdeandK.Tsuda,"Change-pointdetectionusingKrylovsubspacelearning",Proc.SDM,2007.

Change-pointdetectionbySingularSpectrumTransformation

LessHyper-parametersthanChangeFinderJ

2016/10/29@Dots 59

EvaluationMetrics

2016/10/29@Dots 60

FeatureEngineering– FeatureBinning

Mapsquantitativevariablestofixednumberofbinsbasedonquantiles/distribution

MapAgesinto3bins

2016/10/29@Dots 61

FeatureSelection– SignalNoiseRatio

2016/10/29@Dots 62

FeatureSelection– Chi-Square

2016/10/29@Dots 63

FeatureTransformation– Onehot encoding

Mapsacategoricalvariabletoauniquenumberstartingfrom1

ü Spark2.0 Dataframe supportü XGBoost Integrationü Field-awareFactorizationMachinesü GeneralizedLinearModel• OptimizerframeworkincludingADAM• L1/L2regularization

2016/10/29@Dots 64

Othernewfeaturestocome

ConclusionandTakeaway

Hivemallisamachinelearninglibrarythatis…

2016/10/29@Dots 65

WewelcomeyourcontributionstoApacheHivemallJ

Multi/Crossplatform Versatile Scalable Ease-of-use

hivemall.incubator.apache.org

Ø ForDataEngineerswhoneedMLØ DeepLearningisoutofscopeØ Recommendationishigh-priorityforus

Hivemall’s Positioning

66

Anyquestionsorcomments?

2016/10/29@Dots