applied enterprise semantic mining

33
Applied Enterprise Semantic Mining TEXT MINING WITH SQL SERVER 2012 PRESENTED AT ATLANTA MICROSOFT BUSINESS INTELLIGENCE GROUP JANUARY 28, 2013 Mark Tabladillo Ph.D. Data Mining Scientist MarkTab Inc. ©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Upload: mark-tabladillo

Post on 27-May-2015

444 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Applied enterprise semantic mining

Applied Enterprise Semantic MiningTEXT MINING WITH SQL SERVER 2012

PRESENTED AT ATLANTA MICROSOFT BUSINESS INTELLIGENCE GROUP

JANUARY 28, 2013

Mark Tabladillo Ph.D.Data Mining Scientist

MarkTab Inc.

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Page 2: Applied enterprise semantic mining

About MarkTabhttp://marktab.com

http://marktab.net

@MarkTabNet

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Page 3: Applied enterprise semantic mining

IntroductionSQL Server 2012 has new Programmability Enhancements

◦ Statistical Semantic Search

◦ File Tables

◦ Full-Text Search Improvements

These combined technologies make SQL Server 2012 a strong contender in text mining

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Page 4: Applied enterprise semantic mining

ChallengesBuilding and Maintaining Applications with relational and non-relational data is hard

◦ Complex integration

◦ Duplicated functionality

◦ Compensation for unavailable services

80% of all data is not stored in databases! Most of it is “unstructured”

(2012, Michael Rys, Microsoft)

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Page 5: Applied enterprise semantic mining

Microsoft and Google

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Page 6: Applied enterprise semantic mining

HistoryJuly 2008

◦ Microsoft purchases Powerset for US$100 Million

◦ Google Dismisses Semantic Search

◦ http://venturebeat.com/2008/06/26/microsoft-to-buy-semantic-search-engine-powerset-for-100m-plus/

◦ http://www.forbes.com/2008/07/01/powerset-msft-search-tech-intel-cx_ag_0701powerset.html

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Page 7: Applied enterprise semantic mining

HistoryMarch 2009

◦ Google announces “snippets” as relevant to search

◦ The media picks this story up as “semantic search”

◦ http://googleblog.blogspot.com/2009/03/two-new-improvements-to-google-results.html#!/2009/03/two-new-improvements-to-google-results.html

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Page 8: Applied enterprise semantic mining

HistoryFebruary 2012

◦ Google announces Knowledge Graph, an explicit application of semantic search

◦ http://mashable.com/2012/02/13/google-knowledge-graph-change-search/

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Page 9: Applied enterprise semantic mining

HistoryApril 2012

◦ Microsoft purchases 800+ patents from AOL for US$1 Billion

◦ Among the patents are semantic search and metadata querying – older than Google

◦ http://www.theregister.co.uk/2012/04/09/aol_microsoft_patent_deal/

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Page 10: Applied enterprise semantic mining

New in SQL Server 2012HTTP://MSDN.MICROSOFT.COM/EN -US/LIBRARY/CC645577.ASPX

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Page 11: Applied enterprise semantic mining

Goals of Semantic SearchReduce the cost of managing all data

Simplify the development of applications over all data

Provide management and programming services for all data

Make SQL Server the preferred choice for managing Unstructured Data and allow building Rich Application Experience on top

(2012, Michael Rys, Microsoft)

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Page 12: Applied enterprise semantic mining

Statistical Semantic SearchIdentifies statistically relevant key phrases

Based on these phrases, can identify (by score) similar documents

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Page 13: Applied enterprise semantic mining

FileTablesBuilt on existing SQL Server FILESTREAM technology

Files and documents ◦ Stored in special tables in SQL Server

◦ Accessed if they were stored in the file system

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Page 14: Applied enterprise semantic mining

Full-Text Search EnhancementsProperty search: search on tagged properties (such as author or title)

Customizable NEAR: find words or phrases close to one another

New Word Breakers and Stemmers (for many languages)

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Page 15: Applied enterprise semantic mining

RowsetOutput

with Scores

Varchar

NVarchar

Office

PDF

From Documents to Output

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Page 16: Applied enterprise semantic mining

“Beyond Relational” vs. “Adoption”Start with unstructured (meaning non-relational) data

Use Windows technology◦ Reading and Writing Files (Win32 API)

◦ iFilters for reading proprietary formats

Develop indexed structure from unstructured data

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Page 17: Applied enterprise semantic mining

(iFilter Required)

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

DocumentsFull-Text Keyword

Index“FTI”

iFilters

Semantic Document Similarity Index “DSI”

Semantic Database

Semantic Key Phrase

Index –Tag Index

“TI”

Page 18: Applied enterprise semantic mining

“iFilter”?IFilters are components that allow search services to index content of specific file types, letting you search for content in those files.

They are intended for use with Microsoft Search Services (SharePoint, SQL, Exchange, Windows Search).

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Page 19: Applied enterprise semantic mining

Microsoft Office 2010 Filters PackLegacy Office Filter (97-2003; .doc, .ppt, .xls)

Metro Office Filter (2007; .docx, .pptx, .xlsx)

Zip Filter

OneNote filter

Visio Filter

Publisher Filter

Open Document Format Filter

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Page 20: Applied enterprise semantic mining

Adobe PDF iFilter 9 for 64-bit platformsAllows PDF search

Not currently supported for Windows 7 or 8◦ But I used it anyway

Add the Bin directory to your path◦ Computer (right click), Properties, Advanced System Settings, Environment Variables

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Page 21: Applied enterprise semantic mining

“Semantic Language Statistics Database”?This database contains the statistical language models required by semantic search.

A single semantic language statistics database contains the language models for all the languages that are supported for semantic indexing.

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Page 22: Applied enterprise semantic mining

Languages Currently SupportedTraditional Chinese

German

English

French

Italian

Brazilian

Russian

Swedish

Simplified Chinese

British English

Portuguese

Chinese (Hong Kong SAR, PRC)

Spanish

Chinese (Singapore)

Chinese (Macau SAR)

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Page 23: Applied enterprise semantic mining

Phases of Semantic Indexing

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Full Text Keyword Index “FTI”

Semantic Key Phrase Index –Tag Index “TI”

Semantic Document Similarity Index “DSI”

http://msdn.microsoft.com/en-us/library/gg492085.aspx#SemanticIndexing

Page 24: Applied enterprise semantic mining

Performance

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Page 25: Applied enterprise semantic mining

Integrated Full Text Search (iFTS)Improved Performance and Scale:

◦ Scale-up to 350M documents for storage and search

◦ iFTS query performance 7-10 times faster than in SQL Server 2008

◦ Worst-case iFTS query response times less than 3 sec for corpus

◦ Similar or better than main database search competitors

(2012, Michael Rys, Microsoft)

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Page 26: Applied enterprise semantic mining

Linear Scale of FTI/TI/DSIFirst known linearly scaling end-to-end Search and Semantic product in the industry

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Time in Seconds vs. Number of Documents(2011 – K. Mukerjee, T. Porter, S. Gherman – Microsoft)

Page 27: Applied enterprise semantic mining

ConclusionSQL Server 2012 adds new text processing capabilities

This technology scales linearly

Microsoft invites millions of documents for enterprise-level applications

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Page 28: Applied enterprise semantic mining

NetworkMarkTab Consulting

◦ http://marktab.com

Blog◦ http://marktab.net

Twitter◦ @marktabnet

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Page 29: Applied enterprise semantic mining

Appendix

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Page 30: Applied enterprise semantic mining

ReferencesVideo

◦ http://channel9.msdn.com/Shows/DataBound/DataBound-Episode-2-Semantic-Search

◦ http://www.microsoftpdc.com/2009/SVR32

Semantic Search (Books Online) – explains the demo◦ http://msdn.microsoft.com/en-us/library/gg492075.aspx

Paper◦ http://users.cis.fiu.edu/~lzhen001/activities/KDD2011Program/docs/p213.pdf

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Page 31: Applied enterprise semantic mining

Demo: My Semantic Search Samplehttp://mysemanticsearch.codeplex.com/

Requires:◦ iFilters

◦ Semantic Language Statistics Database

◦ IIS7, IIS6, with Windows Authentication

◦ .NET 4.0

◦ Silverlight 4.0

◦ FILESTREAM (complete)

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Page 32: Applied enterprise semantic mining

Demo: T-SQL and DocumentsNaveen Garg

Requires Adventure Works (from Codeplex)

http://blogs.msdn.com/b/sqlfts/archive/2011/07/21/introducing-fulltext-statistical-semantic-search-in-sql-server-codename-denali-release.aspx

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE

Page 33: Applied enterprise semantic mining

AbstractSQL Server 2012 debuts a new Semantic Platform (commonly known as the Semantic Search applied task). This text mining technology leverages the already established Full Text Index and builds semantic indexes in a two-phase process. This session's detailed description and demo give you important information for the enterprise implementation of Tag Index and Document Similarity Index. The demo is a web-based Silverlight application showing how to interactively use semantic search. Currently, the indexes work for 15 languages. We'll also look at strategy tips for how to best leverage the new semantic technology with existing Microsoft text and data mining functionality.

©2013 MARK TABLADILLO, ALL RIGHTS RESERVED WORLDWIDE