a new content processing framework for search applications iain fletcher...
DESCRIPTION
A New Content Processing Framework for Search Applications Iain Fletcher [email protected]. Agenda. Briefly About Search Technologies Key Issues for Enterprise Search A New Content Processing Framework for Search Applications How do we use it? What does it look like? - PowerPoint PPT PresentationTRANSCRIPT
2Agenda
• Briefly About Search Technologies• Key Issues for Enterprise Search• A New Content Processing Framework for
Search Applications• How do we use it?• What does it look like?• Use case example
2
3Search Technologies overview 3
• The leading IT services company focused on search engines• Consulting• Implementation• Managed services
• Technology independent, working with most of the leading search engines
• 90 staff, 250+ customers
4Search Technologies overview
San Diego, CA
San Jose, CR
Herndon, VA
Ascot, UKBoston, MACincinnati, OH
5Executive team
Executive Enterprise Search Industry Experience
Kamran KhanPresident & CEO
18 years: International Sales, VP Sales, Executive
John Steinhauer VP Technology
16 years: Development Management, Project Management, Executive
Paul NelsonChief Architect
22 years: Development, Innovation, Architecting, Dev. Management
Graham CharlesworthVP Europe
16 years: Business Development, VP Sales, Executive
Phil LewisTech. Director, Europe
19 years: Development, Innovation, Architecting, Project Management
Dennis TranVP & Founder
21 years: International Sales, VP Sales
John BackVP Sales
15 years: Sales, Federal Sales Director
Iain FletcherVP Marketing
16 years: International Sales, Product Management, VP Marketing
# years in the search engine industry
5
6Selected customers 6
7
7
A New Content Processing Framework for Search Applications
8Agenda
• Briefly About Search Technologies• Key Issues for Enterprise Search• A New Content Processing Framework for
Search Applications• How do we use it?• What does it look like?• Use case example
8
9Enterprise Search - An Indifferent Reputation
• Major surveys show that no progress has been made during the last 10 years
• Searchers are successful in finding what they seek 50% of the time or less • 2001, IDC, “Quantifying Enterprise Search”
• More than half cannot find the information they need using their Enterprise search system • 2011, MindMetre/SmartLogic, “Mind the Enterprise
Search Gap”
9
10Search Fundamentals 10
11Metadata Supports Relevance Ranking
12Metadata Supports Relevance Ranking
Supported by great metadata!• Title• Meta description•URL• Inbound links• Alt tag text•Etc.•Provided for free by millions of SEO practitioners
13Key Issues
• Almost all modern search functions are driven by data structure
13
14Key Issues
• The majority of serious problems in serious search systems are caused by data quality issues
Also...• “Big Data” and BI from unstructured data will
face the same challenges• Can you trust an analysis if you are unsure of data
providence?
14
15Data quality examples
• The subscription portal caught out by template information
• The Intranet search skewed by a new piece of hardware
• The Intranet search where great quality was the problem!
15
16Key Issues
• Data structure and quality issues are addressed in the indexing pipelines of search engines• Cleaning, enriching, normalizing, granularizing...
• It is about process as much as technology• And data constantly evolves
• Sometimes the built-in indexing pipeline is not good enough (issues with scale, flexibility or transparency)• Some search engines don’t really have one
• We’ve written our own
16
17Agenda
• Briefly About Search Technologies• Key Issues for Enterprise Search• A New Content Processing Framework for
Search Applications• How do we use it?• What does it look like?• Use case example
17
18Document Processing Methodology for Search (DPMS)
• The Philosophy• Understand the Document Model• Understand the User Model
• Includes business-level requirements• Create the Search Engine Model
• Search = the pivot point between User and Data• Document everything
18
19DPMS – The Methodology
Assessment (Search Technologies
Architect and Business Analyst)
DPMSAnalysis
(Knowledge Engineer, Business Analyst, etc.)
Assessment Report
Expert assessment and recommendations
ValidationAspire
DMDsReview
(Architect, Domain Experts, Peers)
1Assessment
2Detailed Analysis
3Execution
Implementation(Developer)
Validate DMDsSearchEngine
20DPMS – The Implementation
21Introducing “Aspire”
• Think of it as a stand-alone indexing pipeline with a framework + component architecture
• Framework built for scalability, performance and flexibility – designed to use cloud elasticity
• Components built to be autonomous and transparent
22Technology Suite
• 100% Java• OSGi™ See www.osgi.org
• The Dynamic Module System for Java™• Apache Felix
• Open source implementation of OSGi• Jetty
• Embedded HTTP server• Maven & Maven Repositories
• For component deployment
23Component Configuration
• Any number of document processing pipelines can be used in an application
• Disparate data sources will need different treatment• Components can be shared where appropriate• Configurations are easy to change
23
24Component autonomy
• Components communicate via XML• Each component has a known and transparent input and output,
and can be tested in isolation• This simplifies problem diagnosis, promotes transparency and
controls cost-of-ownership
24
25Data Quality Monitoring
• Components have built-in quarantine systems to monitor data quality
• Content is constantly evolving• This provides transparency and enables content issues to be
diagnosed and resolved faster
25
26The Component Library
• Search Technologies maintains a library of components
• Currently there are more than 70• Components can be as simple as 3 lines
of groovy script, or complex, 3rd party technologies
• Many applications can be addressed using existing components + configuration
26
27Component Upgrading
• Components can be upgraded in-situ from a cloud-based service, without stopping/restarting the system
• Helpful in the maintenance of complex or mission-critical systems
27
28Component control
• Every component has its own control / status page
28
29A very simple example
30Security expansion example
31Patent Assignee Name Normalization
32Complexity example 32
• CPA Global Discover• The world’s leading patent research
portal• 80 million patents from 95 patent offices• More than a dozen navigators built• Numerous graphical search results
display options• Whole document comparison features
33In Summary
• Many applications today don’t need this level of diligence• But as data and data dynamism grows, more will
• A stand-alone unstructured content processing system can serve multiple applications, and makes sense for some companies
• Method. Diligence. Transparency – its not rocket science...
• Applying this approach to enterprise search is a key part of moving user satisfaction forward during the next few years
33