jake mannix, lead data engineer, lucidworks at mlconf sea - 5/20/16

A Practical Data Science Workbench:spark-solr

Jake Mannix

@pbrane

Lead Data Engineer, Lucidworks

$ whoamiNow: Lucidworks, Office of the CTO: applied ML / data engineering R&D

Previously: • Allen Institute for AI: Semantic Search on academic research

publications• Twitter: account search, user interest modeling, content

recommendations• LinkedIn: profile search, generic entity-to-entity recommender

systems

Prehistory:• other software companies, algebraic topology, particle cosmology

Cold Start

Imagine you jumped into a new Data Lake…

• What is the “Minimum Viable Big Data Science Toolkit”?• DB? Distributed FS? NoSQL store?• ML libraries / frameworks (scripting? notebook? REPL?)• text analysis or graph libraries?• dataviz package?• hosting layer (for models and/or POC apps)?

Cold Start

• Spark and Solr for Data Engineering• Why Solr?• Why Spark?• Example rapid turnaround workflow: Searchhub

• data exploration• clustering: unsupervised ML• classification: supervised ML• recommenders: collaborative filtering + content-

based + “mixed-mode”

Overview

Practical Data Science with Spark and Solr

Why does Solr need Spark?

Why does Spark need Solr?

Why does Spark need Solr?

Typical Hadoop / Spark data-engineering task, start with some data on HDFS:

$ hdfs dfs -ls /user/jake/mail/lucene-solr-user/2015…-rw-r--r-- 1 jake staff 63043884 Feb 4 18:22 part-00001.lzo-rw-r--r-- 1 jake staff 79770856 Feb 4 18:22 part-00002.lzo-rw-r--r-- 1 jake staff 72108179 Feb 4 18:22 part-00003.lzo-rw-r--r-- 1 jake staff 12150481 Feb 4 18:22 part-00004.lzo

Now what? What’s in these files?

Solr gives you:

• random access data store

• full-text search

• fast aggregate statistics

• just starting out: no HDFS / S3 necessary!

• world-class multilingual text analytics:

• no more: tokens = str.toLowerCase().split(“\\s+“)

• relevancy / ranking

• realtime REST service layer / web console

• Apache Lucene

• Grouping and Joins

• Streaming parallel SQL

• Stats, expressions, transformations and more

• Lang. Detection

• Extensible

• Massive Scale/Fault tolerance

Solr Key Features

• Full text search (Info Retr.)

• Facets/Guided Nav galore!

• Lots of data types

• Spelling, auto-complete, highlighting

• Cursors

• More Like This

• De-duplication

Why Spark for Solr?

• spark-shell: a Big Data REPL with all your fave JVM libs!

• Build the index in parallel very, very quickly

• Aggregations

• Boosts, stats, iterative global computations

• Offline compute to update index with additional info (e.g. PageRank, popularity)

• Whole corpus analytics and ML: clustering, classification, CF, rankers

• General-purpose distributed computation

• Joins with other storage (Cassandra, HDFS, DB, HBase)

Why do data engineering with Solr and Spark?

SolrSpark

• Data exploration and visualization

• Easy ingestion and feature selection

• Powerful ranking features• Quick and dirty classification

and clustering• Simple operation and scaling• Stats and math built in

• General purpose batch/streaming compute engine

Whole collection analysis!• Fast, large scale iterative

algorithms• Advanced machine learning:

MLLib, Mahout, Deep Learning4j

• Lots of integrations with other big data systems

and together: http://github.com/lucidworks/spark-solr

http://github.com/lucidworks/spark-solr

• Free Data ! ASF mailing-list archives + github + JIRA

• https://github.com/lucidworks/searchhub

• Index it into Solr

• Explore a bit deeper: unsupervised Spark ML

• Exploit labels: predictive analytics

• Build a recommender, mix & match with search

Example workflow: Searchhub

TM

• Initial exploration of ASF mailing-list archives

• index into Solr: just need to turn your records into json

• facet:

• fields with low cardinality or with sensible ranges

• document size histogram

• projects, authors, dates

• find: broken fields, automated content, expected data missing, errors

• now: load into a spark RDD via SolrRDD:

Searchhub: Initial Exploration

• try other text analyzers: (no more str.split(“\\w+”)! )

Smarter Text Analysis in Spark

ref: Lucidworks blog on LuceneTextAnalyzer by Steve Rowe

• Unsupervised machine learning with MLLib or Mahout:

• clustering documents with KMeans

• extract topics with Latent Dirichlet Allocation

• learn word vectors with Word2Vec

• Write the results back to solr:

Searchhub: Exploratory Data Science

• can also do something more like real Data Science:

Searchhub Classification: “Many Newsgroups”

Recommender Systems with Spark and Solr

• Recommender Systems• content-based:

• mail-thread as “item”, head msgs grouped by replier as “user” profile

• search query of users against items to recommend• collaborative-filtering:

• users replying to a head msg “rate” them +-tively• train a Spark-ML ALS RecSys model

• both can generate item-item similarity models

Spark+Solr RecSys

• With top-K closest items by both CF and Content:• store them back into a Solr collection!• fetch your (or generic user’s) recent items• query them:

• “q=(cf:123^1.1 cf:39^2.3 cf:93^0.7)^alpha (ct:912^2.9 ct:123^1.8 ct:99^2.2)^(1-alpha)”

Experimenting with mixed-mode Recommenders

Resources

• spark-solr: http://github.com/lucidworks/spark-solr

• searchhub: http://github.com/lucidworks/searchhub

• Company: http://www.lucidworks.com

• Our blog: http://www.lucidworks.com/blog

• Fusion: http://www.lucidworks.com/products/fusion

• Twitter: @pbrane

jake mannix, lead data engineer, lucidworks at mlconf sea - 5/20/16

Technology