search features and architecture in dnn 7.1
DESCRIPTION
7.1 Search and Lucene.Net Lucene.Net was the obvious choice of technology for Search in 7.1. Lucene is a general purpose search engine, integrating with the intricracies with DNN wasn't trivial. Ash was very instrumental in design and development of the new Search in 7.1. Join Ash to hear all about DNN Search and Lucene.Net and what's the future look like.TRANSCRIPT
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
7.1 Search and Lucene.Net
Ash Prasad
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• History and New Objectives • Architecture• Lucene / Lucene.Net• Crawlers, Entities, Controllers• Ranking, Synonyms, Ignore Words,
Stemming• Security Trimming• Module Integration, New Crawler
Agenda
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• Platform Edition• SQL Server• ISearchable
• Commercial Edition• Lucene 2.9.2• URL and Files
History of Search
Lucene
Scheduler
SQL
Scheduler Module
Module
ISearchable
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• Handle diverse Content • CMS, Social, Localized, 3rd Party
Modules)
• Consistent User Experience• Simple for Module Developers• Uniform Architecture • Feature based differentiation
Objectives of New Search
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
Architecture
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• Java-based indexing and search technology
• Managed by Apache• NOSQL database• Near real-time, Spellchecking,
Highlighting, Ranking, Synonyms
• Many companies use Lucene directly or customize
• Facebook’s Graph search uses
similar ‘Inverted Index’
What’s Lucene
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• Line-by-line port from Java to C#• Maintains high-performance requirements• A bit behind Java releases• Who Uses Lucene.Net• Products - RavenDB, Orchard, Umbraco,
SubText• Commercial Sites – BBC UK Top Gear,
AutoDesk, Koders.Com
What’s Lucene.Net
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• Flexible Schema
• Consists of Documents• Which are collection of Fields
• Documents can have different set of Fields• Field(“ID”,”xxx-yyy-999”), Field(“Title”,
“My best doc”)• Field(“Owner”,”Ash”),
Field(“Locale”,”en-US”)
Lucene – A Document Store
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• Denormalized (No Referential Integrity)
• Deletion – Done through a flag• Compact reclaims deleted space
• Update is Delete + Insert • Boost = Ranking• Unicode compliant
Lucene – A Document Store (Contd.)
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
Book consulted for Search
• Book on version 3.0
• ~ 500 pages• Very useful
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
Search Phases
Content Acquisition• Crawling• ISearchable• ModuleSearchBase• URL• Doc / PDF
Content Indexing• Text Analysis• Ranking• Synonyms• Ignore Words• Stemming
Content Search• Querying• Sorting• Security Trimming• Boolean Search• Highlighting
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• Platform• Site Crawler• Module and Tab Metadata• Module Content
(ModuleSearchBase/ISearchable)
• Commercial Edition• File Crawler • Uses IFilter for extraction of text
PDF/Office files
• URL Crawler• Internal and External URLs
Crawlers
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• SearchType• Distinguishes Crawlers
• SearchDocument• Properties for a Content• Stored in the Index
• SearchQuery• Parameters to execute a Query
• SearchResult• Derived from SearchDocument
Search Entities
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
Search Entities – Indexing vs. Querying
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• SearchController• For Querying
• InternalSearchController• For Adding / Updating / Deleting
• LuceneController• Interacts with Lucene
Controllers
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• Doc and/or Field can be boosted in Lucene
• DNN does Field boosts (Default - 10)• Title (50)• Tag (40)• Keyword (35)• Description (20)• Author (15)
• Configured manually by HostSettings
Ranking = Boosting
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• Synonyms are injected into Index
• Ignore Words are removed from Index
Synonyms and Ignore Words
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• Convert words to its root• PorterStemFilter is used• Country and Countries = countri• breathe, breathes, breathing,
breathed = breath• fishing, fished, fisher = fish
Stemming
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• Done through Collectors (Callback)
• Each Doc found is sent to Collector
• Collector rejects/accept per Permission
• Site Crawler - Module / Tab Permission
• File Crawler - Folder Permission• User Crawler – Profile
Permission
Security Trimming
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• ModuleSearchBase • New abstract class with just one
method• Defined in BusinessControllerClass• GetModifiedSearchDocuments• Returns New, Changed and Deleted
content• Delta based• Granular Permission, Localization, etc.
• ISearchable continues to work (no delta)
Module Integration
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• Define a new SearchType• Optionally use IsPrivate to hide
from site search
• Implement BaseResultController (2 methods)• HasViewPermission• GetDocUrl
• Create Scheduled Task• Call AddSearchDocuments to inject
content
New Crawler (How to)
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
Demo
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• New Search uses Lucene.Net• Platform has Site Crawler • Commercial has URL and File
Crawlers• Modules to implement
ModuleSearchBase• New Crawler implements
BaseResultController
Recap
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
THANKS TO ALL OF OUR GENEROUS SPONSORS!