leveraging the power of solr with spark

31
Leveraging the Power of SOLR with SPARK Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

Upload: jweigend

Post on 14-Feb-2017

687 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Leveraging the power of solr with spark

Leveraging the Power of SOLR with SPARK

Johannes Weigend QAware GmbH Germany pache Big Data Europe

September 2015

Page 2: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Welcome

• Johannes Weigend- CTO QAware GmbH- Software architect / developer- 25 years of experience- Custom enterprise solutions (Java, JS,…)- Lecturer for UI development at the University of

Applied Science in Rosenheim - Focus on performance and scalability- SOLR user since 2011

2

Page 3: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Brute Force Data Analysis

3

Read Read Read

Filter Filter Filter

Map Map Map

Reduce

Dataflow

Not Indexed

foreach() -> Minutes / Hours

Page 4: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Search Based Data Analysis

4

Filter

Search Search Search

Map Map Map

Reduce

DataflowFilter Filter

Indexed Data (There’s no free lunch)

foreach() -> Seconds/Minutes

Page 5: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Agenda

SOLR cloudDemo

SPARK clusterDemo

Importing data into SOLR with SPARKDemo

Analysis with SOLR and SPARKDemo

5

1

2

3

4

Page 6: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

• Horizontally scalable, distributed NoSQL (Index) Database • Document oriented• A document is a collection of fields (string, number, date, …)• Simple and multiple fields (similar to arrays)• Schema and schema less• Powerful query language (Lucene)

• Distributed data in shards• Replication• Powerful full text search capabilities• Aggregation functions (aka facets)• Stable —> V 5.3

6

1 2 3 4

Page 7: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

SOLR@QAware

• AIR• Aftersales Information Research

• ZEBRA• Part explosion for complex products

• EKG • Software Electro Cardiogram

• QAsearch• Enterprise search across all repositories including

history

7

Page 8: Leveraging the power of solr with spark

8

Page 9: Leveraging the power of solr with spark

9

Page 10: Leveraging the power of solr with spark

10

Page 11: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Apache SOLR for BigData Analysis?

• Text Search Engine?• Aggregations?• Slice and Dice?• Pivots?

11

Page 12: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Demo: SOLR Cloud

• Installing and configuring SOLR Cloud• Searching, sorting and filtering• Facets

• Terms (count by term)• Ranges (count in range)• Functions (avg, sum, …)• Sub-Facets (pivot)

12

Page 13: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Counting as Term Facet

13

Page 14: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Statistics as Function Facet

14

Page 15: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Pivots as Sub Facets

15

Page 16: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

careerbuilder.com

16

Page 17: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Banana

17

Page 18: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany 18

Page 19: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

What’s Missing?

• Client-side processing of SOLR results does not scale• No built-in M/R support• Where to store really big data?

• Images• Videos• Binaries / large text documents

• No interfaces to R / ML

19

Page 20: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

• Distributed job execution engine• Map/Reduce framework• Scala based (runs on JVM)• Java/Scala/Python APIs• Processes data from various data sources

• Textfiles (accessible from all nodes)• Hadoop File System (HDFS)• Databases (JDBC)• SOLR!

20

1 2 3 4

Must Read: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Page 21: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Combining Spark with SOLR

• Use Cases• Distributed ETL – Importing data into SOLR-

Cloud• Our Usecase: importing N logfiles into SOLR

• Distributed processing – data analysis• Statistics on binary data• Map/Reduce

21

Page 22: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Four Ways to Import Data into SOLR 1. Using built-in functions

post scriptDataimport handler,Admin-UI

2. Writing custom parallel code using the SOLRJ API 3. Using and customizing Apache Nutch (Hadoop !)4. Using and customizing Apache Spark

22

Page 23: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Demo: Import Logfiles with Spark• Writing a Spark job which imports a bunch of

logfiles in one directory • Using Lucidwork’s Solr-Spark library

23

1 2 3 4

Page 24: Leveraging the power of solr with spark

24

Page 25: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Demo: Distributed Analysis with Spark• Write a Spark Job which calculates the Duration of Business Actions • Use Spark to access SOLR per SQL / JDBC

25

1 2 3 4

Page 26: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

SolrRDD - The Spark Abstraction to process SOLR Resultshttps://github.com/LucidWorks/spark-solr

26

Page 27: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

SPARK Supports Parallel SQL

27

Page 28: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Dataframe API

28

Page 29: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

SPARK WorkerSOLR 5.3SHARD #4

29

Odroid XU4 2 GB RAM 64 GB eMMC Disk Ubuntu Linux 70$

SPARK WorkerSOLR 5.3SHARD #3

SPARK WorkerSOLR 5.3SHARD #1

SPARK WorkerSOLR 5.3SHARD #2

SPARK Master

SOLR 5.3SHARD #0

SPARK Worker

ZOOKEEPERNFS

40 Cores 10 GB RAM 320 GB eMMC Disk

Page 30: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Summary

30

Page 31: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Any Questions ?

31