opencms days 2014 - using the solr collector
TRANSCRIPT
Sören Schneider, Alkacon Software
WORKSHOP TRACK
Using the SOLR Collector
27.11.2014
1. Brief Introduction Into Solr
2. Common Mistakes Using OpenCms & Solr
3. Using the Solr Collector (DEMO)
4. Spellchecking in OpenCms Using Solr
Agenda
● Solr is a very versatile and powerfool search
engine that supports various features
● This functionality comes with the price of
increased complexity to handle Solr
● Many customizations available
● All fields composing a single document are typed
Brief Solr Introduction
● Data structures of Solr‘s documents are
defined the file schema.xml
● Performing changes on this file requires reindexing
● Dynamic Fields cope with that limitiation
● Can be used without being explicitely defined in
the schema using wildcards
Defining Solr‘s Data Structure
Solr: Indexing Content
a: date
b: text
c: string
Solr processing
(through
analyzers, filters
and tokenizers)
a: date
b: string
c: string
● „Direct“ usage of OpenCms & Solr requires a
basic understanding of Solr
● Use proper datatypes in respect of individual
usecase, gain knowledge of filters
● Know the query syntax (for appropriate datatypes)
● Most common mistakes of OpenCms users
result in insufficient knowledge of Solr basics
OpenCms & Solr
1. Using inproper types
● „text“ vs „string“
● Formulating correct queries
2. Issues regarding mapping OpenCms <->Solr
3. (Encoding Problems)
Common Mistakes Using Solr &
OpenCms
● String
● Stores its content as exact string
● No tokenization / processing is being performed
● Useful when searching for exact value
● Text
● Tokenization and processing is performed
● Useful when a part of the content is searched for
„text“ vs „string“
● OpenCms‘s copies the entire XML content into
a single(!) locale-aware Solr field of type „text“
for each locale
● Particular information of a resource is made
searchable in OpenCms using two approaches
● Automatic mapping of properties to Solr fields
● Manual definintion of mappings
Making Your Content Searchable
Indexing Content w/o
Searchsettings
Solr processing
(through analyzers,
filters and tokenizers)
x: text a: date
b: string
c: string
Indexing Content with
Searchsettings
a: date
b: text
c: string
Solr processing
(through analyzers,
filters and tokenizers)
a: date
b: string
c: string
● Mapping happens in the scheme of the
appropriate resource type
● Excerpt
Solr – OpenCms Interaction:
Mapping
<xsd:schema
…
<xsd:annotation
<xsd:appinfo
<searchsettings>
<searchsetting element= "City" searchcontent="true">
<solrfield targetfield= "city" sourcefield="_s"
</searchsetting> …
Resource type
element name
Element Mapping Attributes
Attribute Name Effect on the Solr Field
targetfield* The resulting name
locale Write content only for specific locale
sourcefield Defines the resulting type
copyfields Copies the value to a different field
default Sets a default value
boost Sets a boost for the field
● Users complain about problems regarding
certain Characters – mostly German Umlauts –
in Solr results
● In nearly all cases the sole problem lies within the
integration of Solr to the servlet cotainer which is
not happening in UTF-8
● Extra note for Tomcat users: Please check
whether you appended the required attributes
all appropriate „<Connector>“s ;-)
Using UTF-8 in Solr
● Live Demo
15
Live Demo
Demo
Demo Demo
Demo
デモ
WYSIWYG Spellchecker
● The Spellchecker has been realized using Solr
● Solr already provides a flexible component named
„SpellCheckComponent“
● This component supports inline spellchecking of
Solr queries
● Source for suggestions can be specified by Solr
fields or text files
WYSIWIG Spellchecker
● The „SpellCheckComponent“ is widely used to
implement the „Did you mean?“-feature known
by popular search engines
● The component is
● Reliable and mature
● Fast
● Plus, Solr is already available in OpenCms
Why using Solr as Spellchecker
● If both usecases use the same component,
how do the implementations actually differ?
● „Did you mean?“ builds source of suggested words
based on the entire data, the search runs on.
Usually only a single hit is returned.
● The WYSIWYG spellchecker builds ist source of
suggestions based on a data that solely contains
the dictionary for a single language
Differences Between Usecases in
Regards of Implementation
● Spellchecking has been realized using another Solr
core that resides in WEB-INF/spellcheck
● As the only purpose of this core is to contain spellcheck
information, the schema.xml file is as simple as it gets
● Why using another Solr core instead of the default core
that‘s used by OpenCms?
● Dictionaries are stored as one Solr index per
language
How to model this scenario using
Solr?
● Sadly, the spellchecking interfaces of tinyMCE
and Solr are incompatible
Problems regarding tinyMCE and
Solr
Solr
tinyMCE
Comparison Spellcheck Responses
{
"id":"c0",
"result":{„hsoue":[„hous
e„, „has“]}
}
"spellcheck":{ "suggestions":[
„hsoue",{"numFound":5,
"startOffset":0, "endOffset":4,
"origFreq":0,
"suggestion":[{"word":„house","freq":
53}, {"word":"has","freq":271},
…
]}, "correctlySpelled",false,
"collation","hsue„
]},
● A new component had to be realized in
OpenCms that basically
● Accepts spellcheck requests from tinyMCE
● Handles tinyMCE and Solr communication and
message conversion
● Checks and (re-)builds spellcheck indices
● The appropriate code is found in
org.opencms.search.solr.spellcheck
Glueing the Pieces together
● Dictionaries can be edited easily in OpenCms
● Those indices are automatically filled by flat text
files, one word per line
● Support for multiple languages
● To access the dicts, have a look at the directory
org.opencms.workplace.spellcheck/resources/
Spellchecker in OpenCms
● Adding a new language
1. Create new Solr field in schema.xml
2. Create new dictionary file inside VFS
3. Restart OpenCms
● Adding words to the custom dict
Extending the Spellchecker
● Any Questions?
26
Any Questions?
Fragen? Questions ?
Questiones?
¿Preguntas? 質問
Sören Schneider
Alkacon Software GmbH
http://www.alkacon.com
http://www.opencms.org
Thank you very much for your
attention! 27