data science with solr and spark

29

Upload: lucidworks

Post on 16-Apr-2017

856 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Data Science with Solr and Spark
Page 2: Data Science with Solr and Spark

2016

OCTOBER 13-16, 2016

BOSTON, MA

http://lucenerevolution.com

Page 3: Data Science with Solr and Spark

Data Science with Solr and Spark

Grant Ingersoll

@gsingers

CTO, Lucidworks

Page 4: Data Science with Solr and Spark

Get Started

ASF Mailing Lists, lucidworks.com, docs.lucidworks.com Solr for Data Science

http://www.slideshare.net/gsingers/solr-for-data-science

Fusion for Data Science

https://www.youtube.com/watch?v=oSs5DNTVxv4

Page 5: Data Science with Solr and Spark

Lucidworks Fusion Is Search-Driven Everything

• Drive next generation relevance via Content, Collaboration and Context

• Harness best in class Open Source: Apache Solr + Spark

• Simplify application development and reduce ongoing maintenance

Fusion is built on three core principles:

Page 6: Data Science with Solr and Spark

Fusion Architecture

REST

API

Worker Worker Cluster Mgr.

Apache Spark

Shards Shards

Apache Solr

HD

FS (O

ptio

nal)

Shared Config Mgmt

Leader Election

Load Balancing

ZK 1

Apache Zookeeper

ZK N

DATABASEWEBFILELOGSHADOOP CLOUD

Connectors

Alerting/Messaging

NLP

Pipelines

Blob Storage

Scheduling

Recommenders/Signals

Core Services

Admin UI

SECURITY BUILT-IN

Lucidworks View

Page 7: Data Science with Solr and Spark

• Why Search for Data Engineering?

• Quick intro to Solr

• Quick intro to Spark

• Solr + Spark

• Demo!

Let’s Do This

Page 8: Data Science with Solr and Spark

The Importance of Importance

Page 9: Data Science with Solr and Spark

Search-Driven Everything

Customer Service

Customer Insights

Fraud Surveillance

Research Portal

Online Retail Digital Content

Page 10: Data Science with Solr and Spark

• Graph Traversal

• Grouping and Joins

• Stats, expressions, transformations and more

• Lang. Detection

• Extensible

• Massive Scale/Fault tolerance

Solr Key Features

• Full text search (Info Retr.)

• SQL and JDBC! (v6)

• Facets/Guided Nav galore

• Lots of data types

• Spelling, auto-complete, highlighting

• Cursors

• More Like This

• De-duplication

Page 11: Data Science with Solr and Spark

Lucene: the Ace in the Hole

• Vector Space or Probabilistic, it’s your choice!

• Killer FST

• Wicked fast

• Pluggable compression, queries, indexing and more

• Advanced Similarity Models

• Lang. Modeling, Divergence from Random, more

• Easy to plug-in ranking

Page 12: Data Science with Solr and Spark

Spark Key Features

• General purpose, high powered cluster computing system

• Modern, faster alternative to MapReduce

• 3x faster w/ 10x less hardware for Terasort

• Great for iterative algorithms

• APIs for Java, Scala, Python and R

• Rich set of add-on libraries for machine learning, graph processing, integrations with SQL and other systems

• Deploys: Standalone, Hadoop YARN, Mesos

Page 13: Data Science with Solr and Spark

Why do data engineering with Solr and Spark?

Solr Spark

• Data exploration and visualization

• Easy ingestion and feature selection

• Powerful ranking features

• Quick and dirty classification and clustering

• Simple operation and scaling

• Stats and math built in

• Advanced machine learning: MLLib, Mahout, Deep Learning4j

• Fast, large scale iterative algorithms

• General purpose batch/streaming compute engine

Whole collection analysis!

• Lots of integrations with other big data systems

Page 14: Data Science with Solr and Spark

Why Spark for Solr?

• Spark-shell!

• Build the index in parallel very, very quickly!

• Aggregations

• Boosts, stats, iterative computations

• Offline compute to update index with additional info (e.g. PageRank, popularity)

• Whole corpus analytics and ML: clustering, classification, CF, Rankers

• Joins with other storage (Cassandra, HDFS, DB, HBase)

Page 15: Data Science with Solr and Spark

Why Solr for Spark?

• Massive simplification of operations!

• World-class multilingual text analytics:

• No more: tokens = str.toLowerCase().split(“\\s+“)

• Non “dumb” distributed, resilient storage

• Random access with smart queries

• Table scans

• Advanced filtering, feature selection

• Schemaless when you want, predefined when you don’t

• Spatial, columnar, sparse

Page 16: Data Science with Solr and Spark

Spark + Solr in Anger

http://github.com/lucidworks/spark-solr

Map<String,  String>  options  =  new  HashMap<String,  String>();options.put("zkhost",  zkHost);options.put("collection”,  "tweets");DataFrame  df  =  sqlContext.read().format("solr").options(options).load();  count  =  df.filter(df.col("type_s").equalTo(“echo")).count();

Page 17: Data Science with Solr and Spark

DemoSpark Shell, run k-Means, LDA, Word2Vec, Random Forest, index clusters

Page 18: Data Science with Solr and Spark

Resources

• This code: published in a few weeks

• Company: http://www.lucidworks.com

• Our blog: http://www.lucidworks.com/blog

• Book: http://www.manning.com/ingersoll

• Solr: http://lucene.apache.org/solr

• Fusion: http://www.lucidworks.com/products/fusion

• Twitter: @gsingers

Page 19: Data Science with Solr and Spark

Appendix A: Deeper Solr Details

Page 20: Data Science with Solr and Spark

• Parallel Execution of SQL across SolrCloud

• Realtime Map-Reduce (“ish”) Functionality

• Parallel Relational Algebra

• Builds on streaming capabilities in 5.x

• JDBC client in the works

Just When You Thought SQL was Dead

Full, Parallelized, SQL Support

Page 21: Data Science with Solr and Spark

• Lots of Functions:

• Search, Merge, Group, Unique, Parallel, Select, Reduce, Select, innerJoin, hashJoin, Top, Rollup, Facet, Stats, Update, JDBC, Intersect, Complement, Logit

• Composable Streams

• Query optimization built in

SQL Guts

Example

select  str_s,  count(*),  sum(field_i),  min(field_i),  max(field_i),  avg(field_i)  from  collection1  where  text=’XXXX’  group  by  str_s

rollup(        search(collection1,                      q=”(text:XXXX)”,                      qt=”/export”,                      fl=”str_s,field_i”,                      partitionKeys=str_s,                      sort=”str_s  asc”,                      zkHost=”localhost:9989/solr”),        over=str_s,        count(*),        sum(field_i),        min(field_i),        max(field_i),        avg(field_i)

Page 22: Data Science with Solr and Spark

• Graph Traversal

• Find all tweets mentioning “Solr” by me or people I follow

• Find all draft blog posts about “Parallel SQL” written by a developer

• Find 3-star hotels in NYC my friends stayed in last year

• BM25F Default Similarity

• Geo3D search

Make Connections, Get Better Results

Page 23: Data Science with Solr and Spark

Streaming API & Expressions●API ○ Java API to provide programming framework ○ Returns tuples as a JSON stream ○ org.apache.solr.client.solrj.io  

●Expressions ○ String Query Language ○ Serialization format ○ Allows non-Java programmers to access Streaming API

DocValues must be enabled for any field to be returned

Page 24: Data Science with Solr and Spark

Streaming Expression Request

curl  -­‐-­‐data-­‐urlencode          'stream=search(sample,              q="*:*",              fl="id,field_i",              sort="field_i  asc")'  http://localhost:8901/solr/sample/stream

Page 25: Data Science with Solr and Spark

Streaming Expression Response

{"responseHeader":  {"status":  0,  "QTime":  1},          "tuples":  {                  "numFound":  -­‐1,                  "start":  -­‐1,                  "docs":  [                          {"id":  "doc1",  "field_i":  1},                          {"id":  "doc2",  "field_i":  2},                          {"EOF":  true}]          }}

Page 26: Data Science with Solr and Spark

Architecture●MapReduce-ish ○ Borrows Shuffling concept from M/R ●Logical tiers for performing the query ○ SQL tier: translates SQL to streaming expressions for parallel query plan,

selects worker nodes, merges results ○ Worker tier: executes parallel query plan, streams tuples from data tables

back ○ Data Table tier: queries SolrCloud collections, performs initial sort and

partitioning of results for worker nodes

Page 27: Data Science with Solr and Spark

JDBC Client●Parallel SQL includes a “thin” JDBC client

●Expanded to include SQL Clients such as DbVisualizer (SOLR-8502)

●Client only works with Parallel SQL features

Page 28: Data Science with Solr and Spark

Learning MoreJoel Bernstein’s presentation at Lucene Revolution: ●https://www.youtube.com/watch?v=baWQfHWozXc

Apache Solr Reference Guide: ●https://cwiki.apache.org/confluence/display/solr/

Streaming+Expressions ●https://cwiki.apache.org/confluence/display/solr/Parallel

+SQL+Interface

Page 29: Data Science with Solr and Spark

Spark Architecture

Spark Master (daemon)

Spark Slave (daemon)

my-spark-job.jar (w/ shaded deps)

My Spark App SparkContext

(driver) •  Keeps track of live workers •  Web UI on port 8080 •  Task Scheduler •  Restart failed tasks

Spark Executor (JVM process)

Tasks

Executor runs in separate process than slave daemon

Spark Worker Node (1...N of these)

Each task works on some partition of a data set to apply a transformation or action

Cache

Losing a master prevents new applications from being executed

Can achieve HA using ZooKeeper

and multiple master nodes

Tasks are assigned based on data-locality

When selecting which node to execute a task on, the master takes into account data locality

•  RDD Graph •  DAG Scheduler •  Block tracker •  Shuffle tracker