opencms days 2014 - using the solr collector

Sören Schneider, Alkacon Software

WORKSHOP TRACK

Using the SOLR Collector

27.11.2014

1. Brief Introduction Into Solr

2. Common Mistakes Using OpenCms & Solr

3. Using the Solr Collector (DEMO)

4. Spellchecking in OpenCms Using Solr

Agenda

● Solr is a very versatile and powerfool search

engine that supports various features

● This functionality comes with the price of

increased complexity to handle Solr

● Many customizations available

● All fields composing a single document are typed

Brief Solr Introduction

● Data structures of Solr‘s documents are

defined the file schema.xml

● Performing changes on this file requires reindexing

● Dynamic Fields cope with that limitiation

● Can be used without being explicitely defined in

the schema using wildcards

Defining Solr‘s Data Structure

Solr: Indexing Content

a: date

b: text

c: string

Solr processing

(through

analyzers, filters

and tokenizers)

a: date

b: string

c: string

● „Direct“ usage of OpenCms & Solr requires a

basic understanding of Solr

● Use proper datatypes in respect of individual

usecase, gain knowledge of filters

● Know the query syntax (for appropriate datatypes)

● Most common mistakes of OpenCms users

result in insufficient knowledge of Solr basics

OpenCms & Solr

1. Using inproper types

● „text“ vs „string“

● Formulating correct queries

2. Issues regarding mapping OpenCms <->Solr

3. (Encoding Problems)

Common Mistakes Using Solr &

OpenCms

● String

● Stores its content as exact string

● No tokenization / processing is being performed

● Useful when searching for exact value

● Text

● Tokenization and processing is performed

● Useful when a part of the content is searched for

„text“ vs „string“

● OpenCms‘s copies the entire XML content into

a single(!) locale-aware Solr field of type „text“

for each locale

● Particular information of a resource is made

searchable in OpenCms using two approaches

● Automatic mapping of properties to Solr fields

● Manual definintion of mappings

Making Your Content Searchable

Indexing Content w/o

Searchsettings

Solr processing

(through analyzers,

filters and tokenizers)

x: text a: date

b: string

c: string

Indexing Content with

Searchsettings

a: date

b: text

c: string

Solr processing

(through analyzers,

filters and tokenizers)

a: date

b: string

c: string

● Mapping happens in the scheme of the

appropriate resource type

● Excerpt

Solr – OpenCms Interaction:

Mapping

<xsd:schema

…

<xsd:annotation

<xsd:appinfo

<searchsettings>

<searchsetting element= "City" searchcontent="true">

<solrfield targetfield= "city" sourcefield="_s"

</searchsetting> …

Resource type

element name

Element Mapping Attributes

Attribute Name Effect on the Solr Field

targetfield* The resulting name

locale Write content only for specific locale

sourcefield Defines the resulting type

copyfields Copies the value to a different field

default Sets a default value

boost Sets a boost for the field

● Users complain about problems regarding

certain Characters – mostly German Umlauts –

in Solr results

● In nearly all cases the sole problem lies within the

integration of Solr to the servlet cotainer which is

not happening in UTF-8

● Extra note for Tomcat users: Please check

whether you appended the required attributes

all appropriate „<Connector>“s ;-)

Using UTF-8 in Solr

● Live Demo

15

Live Demo

Demo

Demo Demo

Demo

デモ

WYSIWYG Spellchecker

● The Spellchecker has been realized using Solr

● Solr already provides a flexible component named

„SpellCheckComponent“

● This component supports inline spellchecking of

Solr queries

● Source for suggestions can be specified by Solr

fields or text files

WYSIWIG Spellchecker

● The „SpellCheckComponent“ is widely used to

implement the „Did you mean?“-feature known

by popular search engines

● The component is

● Reliable and mature

● Fast

● Plus, Solr is already available in OpenCms

Why using Solr as Spellchecker

● If both usecases use the same component,

how do the implementations actually differ?

● „Did you mean?“ builds source of suggested words

based on the entire data, the search runs on.

Usually only a single hit is returned.

● The WYSIWYG spellchecker builds ist source of

suggestions based on a data that solely contains

the dictionary for a single language

Differences Between Usecases in

Regards of Implementation

● Spellchecking has been realized using another Solr

core that resides in WEB-INF/spellcheck

● As the only purpose of this core is to contain spellcheck

information, the schema.xml file is as simple as it gets

● Why using another Solr core instead of the default core

that‘s used by OpenCms?

● Dictionaries are stored as one Solr index per

language

How to model this scenario using

Solr?

● Sadly, the spellchecking interfaces of tinyMCE

and Solr are incompatible

Problems regarding tinyMCE and

Solr

Solr

tinyMCE

Comparison Spellcheck Responses

{

"id":"c0",

"result":{„hsoue":[„hous

e„, „has“]}

}

"spellcheck":{ "suggestions":[

„hsoue",{"numFound":5,

"startOffset":0, "endOffset":4,

"origFreq":0,

"suggestion":[{"word":„house","freq":

53}, {"word":"has","freq":271},

…

]}, "correctlySpelled",false,

"collation","hsue„

]},

● A new component had to be realized in

OpenCms that basically

● Accepts spellcheck requests from tinyMCE

● Handles tinyMCE and Solr communication and

message conversion

● Checks and (re-)builds spellcheck indices

● The appropriate code is found in

org.opencms.search.solr.spellcheck

Glueing the Pieces together

● Dictionaries can be edited easily in OpenCms

● Those indices are automatically filled by flat text

files, one word per line

● Support for multiple languages

● To access the dicts, have a look at the directory

org.opencms.workplace.spellcheck/resources/

Spellchecker in OpenCms

● Adding a new language

1. Create new Solr field in schema.xml

2. Create new dictionary file inside VFS

3. Restart OpenCms

● Adding words to the custom dict

Extending the Spellchecker

● Any Questions?

26

Any Questions?

Fragen? Questions ?

Questiones?

¿Preguntas? 質問

Sören Schneider

Alkacon Software GmbH

http://www.alkacon.com

http://www.opencms.org

Thank you very much for your

attention! 27

opencms days 2014 - using the solr collector

Software

solr solr

solr agenda solr

solr opencms string

string solr processing

solr results

integration of solr

mapping opencms solr

solr live demo