1confidential | thinking lucene think lucid grant ingersoll chief scientist lucid imagination...

28
1 CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

Upload: dwayne-arnold

Post on 29-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

1 CONFIDENTIAL |

Thinking Lucene Think Lucid

Grant IngersollChief ScientistLucid Imagination

Enhancing Discovery with Solr and Mahout

Page 2: 1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

2 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Evolution

Documents•Models•Feature Selection

User Interaction•Clicks•Ratings/Reviews

•Learning to Rank

•Social Graph

Queries•Phrases•NLP

Content Relationships•Page Rank, etc.•Organization

Page 3: 1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

3 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Minding the Intersection

Search

DiscoveryAnalytics

Page 4: 1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

4 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Background– Apache Mahout– Apache Solr and Lucene

Recommendations with Mahout– Collaborative Filtering

Discovery with Solr and Mahout

Discussion

Topics

Page 5: 1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

5 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Apache Lucene in a Nutshell

http://lucene.apache.org/java Java based Application Programming Interface (API) for adding search and

indexing functionality to applications Fast and efficient scoring and indexing algorithms Lots of contributions to make common tasks easier:

– Highlighting, spatial, Query Parsers, Benchmarking tools, etc.

Most widely deployed search library on the planet

Page 6: 1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

6 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Apache Solr in a Nutshell

http://lucene.apache.org/solr Lucene-based Search Server + other features and functionality Access Lucene over HTTP:

– Java, XML, Ruby, Python, .NET, JSON, PHP, etc.

Most programming tasks in Lucene are taken care of in Solr Faceting (guided navigation, filters, etc.) Replication and distributed search support Lucene Best Practices

Page 7: 1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

7 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Apache Mahout in a Nutshell

An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License– http://mahout.apache.org

The Three C’s:– Collaborative Filtering (recommenders)– Clustering– Classification

Others:– Frequent Item Mining– Primitive collections– Math stuff

http://dictionary.reference.com/browse/mahout

Page 8: 1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

8 CONFIDENTIAL |

Thinking Lucene Think Lucid

Recommendations with Mahout

Page 9: 1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

9 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Collaborative Filtering (CF)– Provide recommendations solely based on preferences expressed between

users and items– “People who watched this also watched that”

Content-based Recommendations (CBR)– Provide recommendations based on the attributes of the items and user profile– ‘Modern Family’ is a sitcom, Bob likes sitcoms

• => Suggest Modern Family to Bob

Mahout geared towards CF, can be extended to do CBR– Classification can also be used for CBR

Aside: search engines can also solve these problems

Recommenders

Page 10: 1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

10 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Dracula Jane Eyre Frankenstein Java Programming

Bob 1 4 ??? -

Mary 5 1 4 -

In many instances, user’s don’t provide actual ratings– Clicks, views, etc.

Non-Boolean ratings can also often introduce unnecessary noise– Even a low rating often has a positive correlation with highly rated items in the

real world

Example: Should we recommend Frankenstein to Bob?

To Rate or Not?

Dracula Jane Eyre Frankenstein

Bob 1 4 ???

Mary 5 1 4

Page 11: 1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

11 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Collaborative Filtering with Mahout

Extensive framework for collaborative filtering

Recommenders– User based– Item based– Slope One

Online and Offline support– Offline can utilize Hadoop

Item 1

Item 2

… Item m

User 1 - 0.5 0.9

User 2 0.1 0.3 -

User n 0.8 0.7 0.1

Recommendations for User X

Page 12: 1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

12 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

User Similarity

Item 1 Item 2 Item 3 Item 4

User 1

User 2 User

3 User 4

What should we recommend for User 1?

Page 13: 1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

13 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Item Similarity

Item 1 Item 2 Item 3 Item 4

User 1

User 2 User

3 User 4

What should we recommend for User 1?

Page 14: 1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

14 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Intuition: There is a linear relationship between rated items– Y = mX + b where m = 1

Solve for b upfront based on existing ratings: b = (Y-X)– Find the average difference in preference value for every pair of items

Online can be very fast, but requires up front computation and memory

Slope One

User Item 1 Item 2

A 3.5 2

B ? 3

User A: 3.5 – 2 = 1.5

Item 1 (User B) = 3 + 1.5 = 4.5

Page 15: 1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

15 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Online– Predates Hadoop– Designed to run on a single node

• Matrix size of ~ 100M interactions– API for integrating with your application

Offline– Hadoop based– Designed to run on large cluster– Several approaches:

• RecommenderJob, ItemSimilarityJob, ParallelALSFactorizationJob

Online and Offline Recommendations

Page 16: 1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

16 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Essentially does matrix multiplication using distributed techniques $MAHOUT_HOME/bin/examples/asf-email-examples.sh

RecommenderJob

101 102 103 104 105

101 7 2 0 1 3

102 2 8 3 5 2

103 0 3 3 6 4

104 1 5 6 4 7

105 3 2 4 7 9

User A

3.0

0

4.0

3.0

2.0

X =

Recs

30

37

38

53

64

Page 17: 1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

17 CONFIDENTIAL |

Thinking Lucene Think Lucid

Discovery with Solr

Page 18: 1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

18 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Goals:– Guide users to results without having to guess at keywords– Encourage serendipity– Never show empty results

Out of the Box:– Faceting– Spell Checking– More Like This– Clustering (Carrot2)

Extend– Clustering (with Mahout)– Frequent Item Mining (with Mahout)

Discovery with Solr

Page 19: 1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

19 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Automatically group similar content together to aid users in discovering related items and/or avoiding repetitive content

Solr has search result clustering– Pluggable– Default implementation uses Carrot2

Mahout has Hadoop based large scale clustering– K-Means, Minhash, Dirichlet, Canopy, Spectral, etc.

Clustering

Page 20: 1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

20 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Discovery In Action

Pre-reqs:– Apache Ant 1.7.x, Subversion (SVN)

Command Line 1:– svn co https://svn.apache.org/repos/asf/lucene/dev/trunk solr-trunk– cd solr-trunk/solr/– ant example– cd example– java –Dsolr.clustering.enabled=true –jar start.jar

Command Line 2– cd exampledocs; java –jar post.jar *.xml

http://localhost:8983/solr/browse?q=&debugQuery=true&annotateBrowse=true

Page 21: 1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

21 CONFIDENTIAL |

Thinking Lucene Think Lucid

Solr + Mahout

Page 22: 1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

22 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Most Mahout tasks are offline Solr provides many touch points for integration:

– ClusteringEngine• Clustering results

– SearchComponent• Suggestions – Related searches, clusters, MLT, spellchecking

– UpdateProcessor• Classification of documents

– FunctionQuery

Basics

Page 23: 1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

23 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Discover frequently co-occurring items

Use Case: Related Searches from Solr Logs

Hadoop and sequential versions– Parallel FP Growth

Input:– <optional document id>TAB<TOKEN1>SPACE<TOKEN2>SPACE– Comma, pipe also allowed as delimiters

Example: Frequent Itemset Mining

Page 24: 1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

24 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Goal: – Extract user queries from Solr logs– Feed into FIM to generate Related Keyword Searches

Context:– Solr Query logs– bin/mahout regexconverter –input $PATH_TO_LOGS --output /tmp/solr/output

--regex "(?<=(\?|&)q=).*?(?=&|$)" --overwrite --transformerClass url --formatterClass fpg

– bin/mahout fpg --input /tmp/solr/output/ -o /tmp/solr/fim/output -k 25 -s 2 --method mapreduce

– bin/mahout seqdumper --seqFile /tmp/solr2/results/frequentpatterns/part-r-00000

FIM on Solr Query Logs

Page 25: 1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

25 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Key: Chris: Value: ([Chris, Hostetter],870), ([Chris],870), ([Search, Faceted, Chris, Hostetter, Webcast, Power, Mastering],18), ([Search, Faceted, Chris, Hostetter, Webcast, Power],18), ([Search, Faceted, Chris, Hostetter],18), ([Solr, new, Chris, Hostetter, webcast, along, sponsors, DZone, QA, Refcard],12), ([Solr, new, Chris, Hostetter, webcast, along, sponsors, DZone],12), ([Solr, new, Chris, Hostetter, webcast, along, sponsors],12), ([Solr, new, Chris, Hostetter, webcast, along],12), ([Solr, new, Chris, Hostetter, webcast],12), ([Solr, new, Chris, Hostetter],12)

Output

Page 26: 1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

26 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

http://lucene.apache.org http://mahout.apache.org http://manning.com/owen http://manning.com/ingersoll

http://www.lucidimagination.com [email protected] @gsingers

Resources

Page 27: 1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

27 CONFIDENTIAL |

Thinking Lucene Think Lucid

Appendix

Page 28: 1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

28 CONFIDENTIAL |Copyright Lucid ImaginationCopyright Lucid Imagination

Mahout Overview

MathVectors/Matrices/SVD

RecommendersClusteringClassificationFreq. PatternMining

Genetic

Utilities/IntegrationLucene/Vectorizer

Collections (primitives)

Apache Hadoop

Applications

Examples

See http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms