advanced query parsing techniques
DESCRIPTION
This presentation given at the November 2013 Basis Technologies' Open Source Search Conference, reviews the role that advanced query parsing can play in building systems including: relevancy customization, taking input from user interface variables, such as the position on a website or geographical indicators, which sources are to be searched, and third party data sources. Query parsing can also enhance data security. Best practices for building and maintaining complex query parsing rules will be discussed and illustrated. http://www.searchtechnologies.com/query-parsing-language.htmlTRANSCRIPT
Advanced Query Parsing Techniques
Aruna Kumar Pamulapati (Arun)Technical Consultant
2 The expert in the search space
Search Technologies Overview
Formed June 2005Over 100 employees and growingOver 500 customers worldwidePresence in US, Latin America, UK & GermanyDeep enterprise search expertiseConsistent revenue growth and profitabilitySearch Engine Independent
3 The expert in the search space
Lucene Relevancy: Simple Operators
term(A) TF(A) * IDF(A)Implemented with DefaultSimilarity / TermQueryTF(A) = sqrt(termInDocCount)IDF(A) = log(totalDocsInCollection/(docsWithTermCount+1)) + 1.0
and(A,B) A * BImplemented with BooleanQuery()
or(A, B) A + BImplemented with BooleanQuery()
max(A, B) max(A, B)Implemented with DisjunctionMaxQuery()
4 The expert in the search space
Simple Operators - Example
and
or max
george martha washington custis
0.10 0.20 0.60 0.90
0.1 + 0.2 = 0.30 max(0, 0.9) = 0.90
0.3 * 0.9 = 0.27
5 The expert in the search space
Less Used Operators
boost(f, A) (A * f)Implemented with Query.setBoost(f)
constant(f, A) if(A) then f else 0.0Implemented with ConstantScoreQuery()
boostPlus(A, B) if(A) then (A + B) else 0.0Implemented with BooleanQuery()
boostMul(f, A, B) if(B) then (A * f) else AImplemented with BoostingQuery()
6 The expert in the search space
Problem: Need for More Flexibility
Difficult / impossible to use all operatorsMany not available in standard query parsers
Complex expressions = string manipulationThis is messy
Query construction is in the application layerYour UI programmer is creating query expressions?Seriously?
Hard to create and use new operatorsRequires modifying query parsers - yuck
7 The expert in the search space
Query Processing Language
Solr
UserInterface
QPLEngine Search
QPLScript
8 The expert in the search space
Introducing: QPL
Query Processing LanguageDomain Specific Language for Constructing QueriesBuilt on Groovyhttps://wiki.searchtechnologies.com/index.php/QPL_Home_Page
Solr Plug-InsQuery ParserSearch Component
“The 4GL for Text Search Query Expressions”Server-side Solr Access
Cores, Analyzers, Embedded Search, Results XML
9 The expert in the search space
Solr Plug-Ins
10 The expert in the search space
QPL Configuration – solrconfig.xml
<queryParser name="qpl" class="com.searchtechnologies.qpl.solr.QPLSolrQParserPlugin"> <str name="scriptFile">parser.qpl</str> <str name="defaultField">text</str></queryParser>
<searchComponent name="qplSearchFirst" class="com.searchtechnologies.qpl.solr.QPLSearchComponent"> <str name="scriptFile">search.qpl</str> <str name="defaultField">text</str> <str name="isProcessScript">false</str></searchComponent>
Query Parser Configuration:
Search Component Configuration:
11 The expert in the search space
QPL Example #1
myTerms = solr.tokenize(query);
phraseQ = phrase(myTerms);
andQ = and(myTerms);
return phraseQ^3.0 | andQ^2.0 | orQ;
Tokenize:
Phrase Query:
And Query:
Put It All Together:
orQ = (myTerms.size() <= 2) ? null : orMin( (myTerms.size()+1)/2, myTerms);
Or Query:
12 The expert in the search space
Thesaurus Example #2
myTerms = solr.tokenize(query);
thes = Thesaurus.load("thesaurus.xml")
thesQ = thes.expand(0.8f, solr.tokenizer("text"), myTerms);
return and(thesQ);
Tokenize:
Load Thesaurus: (cached)
Thesaurus Expansion:
Put It All Together:Original Query: bathroom humor
[or(bathroom, loo^0.8, wc^0.8), or(humor, jokes^0.8)]
13 The expert in the search space
More Operators
Boolean Query Parser:pQ = parseQuery("(george or martha) near/5 washington")
Relevancy Ranking Operators:q1 = boostPlus(query, optionalQ)q2 = boostMul(0.5, query, optionalQ)q3 = constant(0.5, query)
Composite Queries:compQ = and(compositeMax(
["title":1.5, "body":0.8],"george", "washington"))
14 The expert in the search space
News Feed Use Case
Order Documents Date1 markets+terms Today2 markets Today3 terms Today4 companies Today5 markets+terms Yesterday6 markets Yesterday7 terms Yesterday8 companies Yesterday9 markets, companies older
15 The expert in the search space
News Feed Use Case – Step 1
markets = split(solr.markets, "\\s*;\\s*")marketsQ = field("markets", or(markets));
terms = solr.tokenize(query);termsQ = field("body", or(thesaurus.expand(0.9f, terms)))
compIds = split(solr.compIds, "\\s*;\\s*")compIdsQ = field("companyIds", or(compIds))
Segments:
Terms:
Companies:
16 The expert in the search space
News Feed Use Case – Step 2
todayDate = sdf.format(c.getTime())todayQ = field("date_s",todayDate)
c.add(Calendar.DAY_OF_MONTH, -1)yesterdayDate = sdf.format(c.getTime())yesterdayQ = field("date_s",yesterdayDate)
Today:
Yesterday:
sdf = new SimpleDateFormat("yyyy-MM-dd")cal = Calendar.getInstance()
17 The expert in the search space
News Feed Use Case – Step 3
sq1 = constant(4.0, and(marketsQ, termsQ))sq2 = constant(3.0, marketsQ)sq3 = constant(2.0, termsQ)sq4 = constant(1.0, compIdsQ)subjectQ = max(sq1, sq2, sq3, sq4)
tq1 = constant(10.0, todayQ)tq2 = constant(1.0, yesterdayQ)timeQ = max(tq1, tq2)
recentQ = and(subjectQ, timeQ)
Weighted Subject Queries:
Weighted Time Queries:
Put it All Together:
return max(recentQ, or(marketsQ,compIdsQ)^0.01))
18 The expert in the search space
BT RLP Tokenizer Use Case – Step 1
<tokenizer class="com.basistech.rlp.solr.RLPTokenizerFactory" rlpContext=“<PATH>rlp-context-bl1.xml" postAltLemmas="false"
lang="eng" postPartOfSpeech="false"/>
Define field type:
finalExpandedQuery = transform(queryTerms,[ TERM:{ ctx -> def btCustomTokens = solr.tokenize("subject_bt", ctx.op.term)
if(btCustomTokens.size()> 1) return or( term(btCustomTokens[0])^1.5, or(btCustomTokens[1..-1])); else
return ctx.op;} ]);
QPL Expansion:
19 The expert in the search space
BT RLP Tokenizer Use Case – Step 2
Original User Query: following is "presentation on QPL"
QPL Parsed: and(and(term(following),term(is)), phrase(term(presentation),term(on),term(QPL)))
BT Expansion + QPL Transformation :and(and(or(term(following)^1.5,term(follow)),or(term(is)^1.5,term(be))),phrase(term(presentation),term(on),term(QPL)))
20 The expert in the search space
BT RLP Tokenizer Use Case – Step 3
and
and phrase
Presentation on QPLFollowing is
or
follow
or
be
^1.5 ^1.5
21 The expert in the search space
Embedded Search Example #1
results = solr.search('subjectsCore', or(qTerms), 50)
subjectsQ = or(results*.subjectId)
return field("title", and(qTerms)) | subjectsQ^0.9;
Execute an Embedded Search:
Create a query from the results:
Put it all together:
qTerms = solr.tokenize(qTerms);
22 The expert in the search space
Embedded Search Example #2
results = solr.search('categories', and(qTerms), 10)
myList = solr.newList();myList.add("relatedCategories", results*.title);
solr.addResponse(myList)
Execute an Embedded Search:
Create a Solr named list:
Add it to the XML response:
qTerms = solr.tokenize(qTerms);
23 The expert in the search space
Other Features
Embedded Grouping QueriesOh yes they did!
Proximity operatorsADJ, NEAR/#, BEFORE/#
Reverse LemmatizerPrefers exact matches over variants
TransformerApplies transformations recursively to query trees
24 The expert in the search space
Query Processing Language
Solr
UserInterface
QPLEngine Search
Data as entered by user Boolean
Query ExpressionQPL
Script
ApplicationDev Team
Search Team
25 The expert in the search space
Query Processing Language
Solr
UserInterface
QPLEngine Search
QPLScript
RDBMS OtherIndexes Thesaurus
26 The expert in the search space
More on QPL…
http://www.searchtechnologies.com/query-
parsing-language.html