where search meets machine learning: presented by diana hu & joaquin delgado, verizon

51
Where Search Meets Machine Learning Diana Hu @sdianahu — Data Science Lead, Verizon Joaquin Delgado @joaquind — Director of Engineering, Verizon

Upload: lucidworks

Post on 16-Apr-2017

2.269 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Where Search Meets Machine Learning Diana Hu @sdianahu — Data Science Lead, Verizon

Joaquin Delgado @joaquind — Director of Engineering, Verizon

Page 2: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Disclaimer

2

The content of this presentation are of the authors’ personal statements and does not officially represent their employer’s view in anyway. Included content is especially not intended to convey the views of OnCue or Verizon

01

Page 3: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Index

1.  Introduction 2.  Search and Information Retrieval 3.  ML problems as Search-based Systems 4.  ML Meets Search!

Page 4: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Introduction

Page 5: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Scaling learning systems is hard!

•  Millions of users, items

•  Billions of features

•  Imbalanced Datasets

•  Complex Distributed Systems

•  Many algorithms have not been tested at “Internet Scale”

Page 6: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Typical approaches

•  Distributed systems – Fault tolerance, Throughput vs. latency

•  Parallelization Strategies – Hashing, trees

•  Processing – Map reduce variants, MPI, graph parallel

•  Databases – Key/Value Stores, NoSQL

Such a custom system requires TLC

Page 7: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Search and Information Retrieval

Page 8: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Search

Search is about finding specific things that are either known or assumed to exist, Discovery is about is about helping the user encounter what he/she didn’t even know exists.

•  Focused on Search: Search Engines, Database Systems •  Focused on Discovery: Recommender Systems, Advertising

Predicate Logic and Declarative Languages Rock!

Page 9: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Search stack

Matched Hits

Representation Function

Similarity Calculation

Matched Hits Documents

Representation Function

Input Query

Matched Hits Matched Hits Retrieved Documents

Online Processing

Offline Processing

(*)Relevance Feedback

Query Representation

Doc Representation Index

*Metadata Engineering

(*) Optional

Page 10: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Relevance: Vector Space Model

Page 11: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Search Engines: the big hammer

•  Search engines are largely used to solve non-IR search problems, because: •  Widely Available

•  Fast and Scalable

•  Integrates well with existing data stores

Page 12: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

But… Are we using the right tool?

•  Search Engines were originally designed for IR. •  Complex non-IR search tasks sometimes require a two

phase approach

Phase1) Filter Phase 2) Rank

Page 13: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Finding commonalities

Relevance aka Ranking

RecSys

Discovery

IR Search

Advertising

Page 14: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

ML problems as Search-based Systems

Page 15: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Machine Learning

Machine Learning in particular supervised learning refer to techniques used to learn how to classify or score previously unseen objects based on a training dataset

Inference and Generalization are the Key!

Page 16: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Supervised learning pipeline

Page 17: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Learning systems’ stack

Visualization / UI

Retrieval

Ranking

Query Generation and Contextual Pre-filtering

Model Building

Index Building

Data/Events Collections

Data Analytics

Contextual Post Filtering

OnlineOffline

Experimentation

Page 18: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Case study: Recommender Systems

•  Reduce information load by estimating relevance •  Ranking (aka Relevance) Approaches: •  Collaborative filtering•  Content Based•  Knowledge Based•  Hybrid

•  Beyond rating prediction and ranking •  Business filtering logic•  Low latency and Scale

Page 19: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

RecSys: Content based models •  Rec Task: Given a user profile find the best matching items by their

attributes •  Similarity calculation: based on keyword overlap between user/items •  Neighborhood method (i.e. nearest neighbor)

•  Query-based retrieval (i.e. Rocchio’s method)

•  Probabilistic methods (classical text classification)

•  Explicit decision models

•  Feature representation: based on content analysis •  Vector space model

•  TF-IDF

•  Topic Modeling

Page 20: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

RecSys: Collaborative Filtering

Matrix Factorization

Rating Dataset

User Factors

Item Factors

Re-Ranking Model

Input Query

Online Processing

Offline Processing

Recommendations

Page 21: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

RecSys: Collaborative Filtering

Matrix Factorization

Rating Dataset

User Factors

Item Factors

Re-Ranking Model

Input Query

Online Processing

Offline Processing

Recommendations

Page 22: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

ML Meets Search! ML Search

Page 23: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Remember the elephant?

Visualization / UI

Retrieval

Ranking

Query Generation and Contextual Pre-filtering

Model Building

Index Building

Data/Events Collections

Data Analytics

Contextual Post Filtering

OnlineOffline

Experimentation

Page 24: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Simplifying the stack!

Visualization / UI

Query Generation and Contextual Pre-filtering

Model Building

Index Building

Data/Events Collections

Data Analytics

OnlineOffline

Experimentation

RetrievalContextual Post Filtering

Ranking

Page 25: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Search stack

Matched Hits

Representation Function

Similarity Calculation

Matched Hits Documents

Representation Function

Input Query

Matched Hits Matched Hits Retrieved Documents

Online Processing

Offline Processing

(*)Relevance Feedback

Query Representation

Doc Representation Index

*Metadata Engineering

(*) Optional

Page 26: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Simplifying the Search stack

Matched Hits

Representation Function

Similarity Calculation

Matched Hits Documents

Representation Function

Input Query

Matched Hits Matched Hits Retrieved Documents

Online Processing

Offline Processing

(*)Relevance Feedback

Query Representation

Doc Representation Index

*Metadata Engineering

(*) Optional

Retrieval Contextual Post Filtering

Ranking

ML-Scoring PluginSerialized ML Model

Page 27: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

ML-Scoring architecture

Lucene/Solr

Instances + Labels

Instances Index

ML Scoring Plugin

Serialized ML Model

Online Processing

Offline Processing

Trainer +

Indexer

Page 28: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

ML-Scoring Options •  Option A: Solr FunctionQuery •  Pro: Model is just a query!•  Cons: Limits expressiveness of models

•  Option B: Solr Custom Function Query •  Pro: Loading any type of model (also PMML)•  Cons: Memory limitations, also multiple model reloading

•  Option C: Lucene CustomScoreQuery •  Pro: Can use PMML and tune how PMML gets loaded•  Cons: No control on matches•  Option D: Lucene Low level Custom Query •  *Mahout vectors from Lucene text (only trains, so not an option)

Page 29: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Real-life Problem

•  Census database that contains documents with the following fields: 1. Age: continuous; 2. Workclass: 8 values; 3. Fnlwgt: continuous.; 4. Education: 16 values; 5. Education-num: continuous.; 6. Marital-status: 7 values; 7. Occupation: 14 values; 8. Relationship: 6 values; 9. Race: 5 values; 10. Sex: Male, Female; 11. Capital-gain: continuous.;12. Capital-loss: continuous.; 13. Hours-per-week: continuous.; 14. Native-country: 41 values; 15. >50K Income: Yes, No.

•  Task is to predict whether a person makes more than 50k a year based on their attributes

Page 30: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

1) Learn from the (training) data

Naïve Bayes SVM

Logistic Regression

Decision Trees

Train with your favorite ML Framework

Page 31: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Option A: Just a Solr Function Query

q=“sum(C,              product(age,w1),              product(Workclass,w2),              product(Fnlwgt,  w3),              product(Education,  w4),              ….)”  

Serialized ML Model as Query

Trainer +

Indexer Y_prediction = C + XB

Page 32: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

May result in a crazy Solr functionQuery

See more at https://wiki.apache.org/solr/FunctionQuery

q=dismax&bf="ord(educaton-num)^0.5 recip(rord(age),1,1000,1000)^0.3"

Page 33: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

What about models like this?

Page 34: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Option B: Custom Solr FuntionQuery 1.  Subclass org.apache.solr.search.ValueSourceParser.

public  class  MyValueSourceParser  extends  ValueSourceParser  {    public  void  init  (NamedList  namedList)  {  

         …      }  

   public  ValueSource  parse(FunctionQParser  fqp)  throws  ParseException  {            return  new  MyValueSource();      }  }

2.  In solrconfig.xml, register your new ValueSourceParser directly under the <config> tag <valueSourceParser  name=“myfunc”  class=“com.custom.MyValueSourceParser”  />  

3.  Subclass org.apache.solr.search.ValueSource and instantiate it in ValueSourceParser.parse()

Page 35: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Option C: Lucene CustomScoreQuery

2C) Serialize model with PMML •  Can use JPMML library to read serialized model in Lucene

•  On Lucene will need to implement an extension with JPMML-evaluator to take vectors as expected

3C) In Lucene: •  Override CustomScoreQuery: load PMML

•  Create CustomScoreProvider: do model PMML data marshaling

•  Rescoring: PMML evaluation

Page 36: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Predictive Model Markup Language

•  Why use PMML •  Allows users to build a model in one system•  Export model and deploy it in a different environment for prediction•  Fast iteration: from research to deployment to production

•  Model is a XML document with: •  Header: description of model, and where it was generated

•  DataDictionary: defines fields used by model

•  Model: structure and parameters of model

•  http://dmg.org/pmml/v4-2-1/GeneralStructure.html

Page 37: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Example: Train in Spark to PMML import  org.apache.spark.mllib.clustering.KMeans    import  org.apache.spark.mllib.linalg.Vectors        //  Load  and  parse  the  data    val  data  =  sc.textFile("/path/to/file")            .map(s  =>  Vectors.dense(s.split(',').map(_.toDouble)))          //  Cluster  the  data  into  three  classes  using  KMeans    val  numIterations  =  20    val  numClusters  =  3    val  kmeansModel  =  KMeans.train(data,  numClusters,  numIterations)          //  Export  clustering  model  to  PMML    kmeansModel.toPMML("/path/to/kmeans.xml")  

Page 38: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

PMML XML File

Page 39: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Overriding scores with CustomScoreQuery

CustomScoreProvider CustomScoreQuery

Lucene Query

Find next Match

Score

Rescore Doc

New Score

*Credit to Doug Turnbull’s Hacking Lucene forCustom Search Results

Page 40: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Overriding scores with CustomScoreQuery

•  Matching remains •  Scoring overridden

CustomScoreProvider CustomScoreQuery

Lucene Query

Find next Match

Score

Rescore Doc

New Score

*Credit to Doug Turnbull’s Hacking Lucene forCustom Search Results

Page 41: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Implementing CustomScoreQuery

1.  Given normal Lucene Query, use a CustomScoreQuery to wrap it TermQuery  q  =  New  TermQuery(term)  

MyCustomScoreQuery  mcsq  =  New  MyCustomScoreQuery(q)  

//Make  sure  query  has  all  fields  needed  by  PMML!

Page 42: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Implementing CustomScoreQuery

2.  Initialize PMML

PMML  pmml  =  ...;  

ModelEvaluatorFactory  modelEvaluatorFactory  =              ModelEvaluatorFactory.newInstance();  

ModelEvaluator<?>  modelEvaluator  =              modelEvaluatorFactory.newModelManager(pmml);  

Evaluator  evaluator  =  (Evaluator)modelEvaluator;  

     

Page 43: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Implementing CustomScoreQuery

2.  Rescore each doc with IndexReader and docID

public  float  customScore(int  doc,  float  subQueryScore,  float  valSrcScores[])  throws  IOException  {  

//Lucene  reader  IndexReader  r  =  context.reader();  Terms  tv  =  r.getTermVector(doc,  _field);  TermsEnum  tenum  =  null;  tenum  =  tv.iterator(tenum);          //convert  the  iterator  order  to  fields  needed  by  model  

TermsEnum  tenumPMML  =  tenum2PMML(tenum,                        evaluator.getActiveFields());        

Page 44: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Implementing CustomScoreQuery

2.  Rescore each doc with IndexReader and docID //Marshall  Data  into  PMML  

Map<FieldName,  FieldValue>  arguments  =                    new  LinkedHashMap<FieldName,  FieldValue>();  List<FieldName>  activeFields  =  evaluator.getActiveFields();  

for(FieldName  activeField  :  activeFields){      //  The  raw  is  value  has  been  sorted  with  number  of  fields  needed      Object  rawValue  =  tenumPMML.next;      FieldValue  activeValue  =  evaluator.prepare(activeField,  rawValue);      arguments.put(activeField,  activeValue);  }  

 

   

Page 45: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Implementing CustomScoreQuery

2.  Rescore each doc with IndexReader and docID //Rescore  and  evaluate  with  PMML  

Map<FieldName,  ?>  results  =  evaluator.evaluate(arguments);  

FieldName  targetName  =  evaluator.getTargetField();  

Object  targetValue  =  results.get(targetName);  

return  (float)  targetValue;  

   

Page 46: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Potential issues

•  Performance •  If search space is very large

•  If model complexity explodes (i.e. kernel expansion)

•  Operations •  Code is running on key infrastructure

•  Versioning

•  Binary Compatibility

Page 47: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Option D: Low Level Lucene •  CustomScoreQuery or Custom FunctionScore can’t control

matches •  If you want custom matches and scoring…. •  Implement:

•  Custom Query Class

•  Custom Weight Class

•  Custom Scorer Class

•  http://opensourceconnections.com/blog/2014/03/12/using-customscorequery-for-custom-solrlucene-scoring/

Page 48: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

Conclusion

•  Importance of the full picture – Learning systems from the lenses of the whole elephant

•  Reducing the time from science to production is complicated

•  Scalability is hard! •  Why not have ML use Search in its core during online eval?

•  Solr and Lucene are a start to customize your learning system

Page 49: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

We are Hiring!

Contact me at [email protected]

@sdianahu Q&A

Page 50: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon

O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X

Page 51: Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado, Verizon