applied enterprise semantic mining
TRANSCRIPT
Applied Enterprise Semantic MiningTEXT MINING WITH SQL SERVER 2012
PRESENTED AT ATLANTA MICROSOFT BUSINESS INTELLIGENCE GROUP
JANUARY 28, 2013
Mark Tabladillo Ph.D.Data Mining Scientist
MarkTab Inc.
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
About MarkTabhttp://marktab.com
http://marktab.net
@MarkTabNet
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
IntroductionSQL Server 2012 has new Programmability Enhancements
◦ Statistical Semantic Search
◦ File Tables
◦ Full-Text Search Improvements
These combined technologies make SQL Server 2012 a strong contender in text mining
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
ChallengesBuilding and Maintaining Applications with relational and non-relational data is hard
◦ Complex integration
◦ Duplicated functionality
◦ Compensation for unavailable services
80% of all data is not stored in databases! Most of it is “unstructured”
(2012, Michael Rys, Microsoft)
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
Microsoft and Google
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
HistoryJuly 2008
◦ Microsoft purchases Powerset for US$100 Million
◦ Google Dismisses Semantic Search
◦ http://venturebeat.com/2008/06/26/microsoft-to-buy-semantic-search-engine-powerset-for-100m-plus/
◦ http://www.forbes.com/2008/07/01/powerset-msft-search-tech-intel-cx_ag_0701powerset.html
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
HistoryMarch 2009
◦ Google announces “snippets” as relevant to search
◦ The media picks this story up as “semantic search”
◦ http://googleblog.blogspot.com/2009/03/two-new-improvements-to-google-results.html#!/2009/03/two-new-improvements-to-google-results.html
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
HistoryFebruary 2012
◦ Google announces Knowledge Graph, an explicit application of semantic search
◦ http://mashable.com/2012/02/13/google-knowledge-graph-change-search/
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
HistoryApril 2012
◦ Microsoft purchases 800+ patents from AOL for US$1 Billion
◦ Among the patents are semantic search and metadata querying – older than Google
◦ http://www.theregister.co.uk/2012/04/09/aol_microsoft_patent_deal/
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
New in SQL Server 2012HTTP://MSDN.MICROSOFT.COM/EN -US/LIBRARY/CC645577.ASPX
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
Goals of Semantic SearchReduce the cost of managing all data
Simplify the development of applications over all data
Provide management and programming services for all data
Make SQL Server the preferred choice for managing Unstructured Data and allow building Rich Application Experience on top
(2012, Michael Rys, Microsoft)
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
Statistical Semantic SearchIdentifies statistically relevant key phrases
Based on these phrases, can identify (by score) similar documents
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
FileTablesBuilt on existing SQL Server FILESTREAM technology
Files and documents ◦ Stored in special tables in SQL Server
◦ Accessed if they were stored in the file system
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
Full-Text Search EnhancementsProperty search: search on tagged properties (such as author or title)
Customizable NEAR: find words or phrases close to one another
New Word Breakers and Stemmers (for many languages)
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
RowsetOutput
with Scores
Varchar
NVarchar
Office
From Documents to Output
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
“Beyond Relational” vs. “Adoption”Start with unstructured (meaning non-relational) data
Use Windows technology◦ Reading and Writing Files (Win32 API)
◦ iFilters for reading proprietary formats
Develop indexed structure from unstructured data
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
(iFilter Required)
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
DocumentsFull-Text Keyword
Index“FTI”
iFilters
Semantic Document Similarity Index “DSI”
Semantic Database
Semantic Key Phrase
Index –Tag Index
“TI”
“iFilter”?IFilters are components that allow search services to index content of specific file types, letting you search for content in those files.
They are intended for use with Microsoft Search Services (SharePoint, SQL, Exchange, Windows Search).
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
Microsoft Office 2010 Filters PackLegacy Office Filter (97-2003; .doc, .ppt, .xls)
Metro Office Filter (2007; .docx, .pptx, .xlsx)
Zip Filter
OneNote filter
Visio Filter
Publisher Filter
Open Document Format Filter
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
Adobe PDF iFilter 9 for 64-bit platformsAllows PDF search
Not currently supported for Windows 7 or 8◦ But I used it anyway
Add the Bin directory to your path◦ Computer (right click), Properties, Advanced System Settings, Environment Variables
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
“Semantic Language Statistics Database”?This database contains the statistical language models required by semantic search.
A single semantic language statistics database contains the language models for all the languages that are supported for semantic indexing.
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
Languages Currently SupportedTraditional Chinese
German
English
French
Italian
Brazilian
Russian
Swedish
Simplified Chinese
British English
Portuguese
Chinese (Hong Kong SAR, PRC)
Spanish
Chinese (Singapore)
Chinese (Macau SAR)
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
Phases of Semantic Indexing
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
Full Text Keyword Index “FTI”
Semantic Key Phrase Index –Tag Index “TI”
Semantic Document Similarity Index “DSI”
http://msdn.microsoft.com/en-us/library/gg492085.aspx#SemanticIndexing
Performance
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
Integrated Full Text Search (iFTS)Improved Performance and Scale:
◦ Scale-up to 350M documents for storage and search
◦ iFTS query performance 7-10 times faster than in SQL Server 2008
◦ Worst-case iFTS query response times less than 3 sec for corpus
◦ Similar or better than main database search competitors
(2012, Michael Rys, Microsoft)
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
Linear Scale of FTI/TI/DSIFirst known linearly scaling end-to-end Search and Semantic product in the industry
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
Time in Seconds vs. Number of Documents(2011 – K. Mukerjee, T. Porter, S. Gherman – Microsoft)
ConclusionSQL Server 2012 adds new text processing capabilities
This technology scales linearly
Microsoft invites millions of documents for enterprise-level applications
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
NetworkMarkTab Consulting
◦ http://marktab.com
Blog◦ http://marktab.net
Twitter◦ @marktabnet
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
Appendix
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
ReferencesVideo
◦ http://channel9.msdn.com/Shows/DataBound/DataBound-Episode-2-Semantic-Search
◦ http://www.microsoftpdc.com/2009/SVR32
Semantic Search (Books Online) – explains the demo◦ http://msdn.microsoft.com/en-us/library/gg492075.aspx
Paper◦ http://users.cis.fiu.edu/~lzhen001/activities/KDD2011Program/docs/p213.pdf
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
Demo: My Semantic Search Samplehttp://mysemanticsearch.codeplex.com/
Requires:◦ iFilters
◦ Semantic Language Statistics Database
◦ IIS7, IIS6, with Windows Authentication
◦ .NET 4.0
◦ Silverlight 4.0
◦ FILESTREAM (complete)
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
Demo: T-SQL and DocumentsNaveen Garg
Requires Adventure Works (from Codeplex)
http://blogs.msdn.com/b/sqlfts/archive/2011/07/21/introducing-fulltext-statistical-semantic-search-in-sql-server-codename-denali-release.aspx
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE
AbstractSQL Server 2012 debuts a new Semantic Platform (commonly known as the Semantic Search applied task). This text mining technology leverages the already established Full Text Index and builds semantic indexes in a two-phase process. This session's detailed description and demo give you important information for the enterprise implementation of Tag Index and Document Similarity Index. The demo is a web-based Silverlight application showing how to interactively use semantic search. Currently, the indexes work for 15 languages. We'll also look at strategy tips for how to best leverage the new semantic technology with existing Microsoft text and data mining functionality.
©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE