classification & your intranet: from chaos to control susan stearns inmagic, inc. ...

21
Classificatio n & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. www.inmagic.com [email protected] E-Libraries E204 May, 2003

Upload: darius-holdway

Post on 29-Mar-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc.  sstearns@inmagic.com E-Libraries E204 May, 2003

Classification & Your Intranet: From Chaos to Control

Susan StearnsInmagic, [email protected] E204May, 2003

Page 2: Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc.  sstearns@inmagic.com E-Libraries E204 May, 2003

Look familiar??#$%&*

Why can’t I find anything on the Intranet?

How do we manage all the information we want to publish on the Intranet?

User Content Manager

Page 3: Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc.  sstearns@inmagic.com E-Libraries E204 May, 2003

• The issue:▶We know that we have documents in-house that

contain hugely valuable information

• The problem:▶How do users find the right information at the right

time?

• The answer:▶Automated spider and meta- classification software

that allow an enterprise to automatically build and maintain a completely searchable database of critical content.

Page 4: Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc.  sstearns@inmagic.com E-Libraries E204 May, 2003

Automated Spidering Software

• A spider that “crawls” specified in-house servers and Web sites

• Extracts content from most popular file types and formats – HTML, Text, MSOffice, PDF – even e-mail

• Content can be loaded into a database

Page 5: Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc.  sstearns@inmagic.com E-Libraries E204 May, 2003

Key features to look for in a spider

• Document types: Microsoft Office, PDF, other formats (IFilter compatible)

• Zip files and email folders can be crawled• Remote administration• Can be scheduled to run multiple times a day• Web crawling can be set up to “n” levels deep• Easy to create an XML transform to your

database design• Integrates with automated classification software

Page 6: Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc.  sstearns@inmagic.com E-Libraries E204 May, 2003

How a spider works

File system crawl

Database

Webcrawl

The Spider

Native document cache

Extracted text cache

XML load files

Content Manager can add value (e.g. add additional meta-tags, etc.)

Users can search and access “Gathered” content via a Web-browser

Meta-Classification

Page 7: Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc.  sstearns@inmagic.com E-Libraries E204 May, 2003

We Love Search: We Hate Search

• Search is ubiquitous but insufficient▶Only one slice into content▶Missing relationships across

information▶Few are skilled at searching

Page 8: Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc.  sstearns@inmagic.com E-Libraries E204 May, 2003

The search engine paradox: “Regardless of the product or a user's ability to use it, effective searches require the user to know the terms they need to use before they type them into the search engine.”

The Delphi Group

Page 9: Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc.  sstearns@inmagic.com E-Libraries E204 May, 2003

The Solution: Meta-Classification

• Enrich the content with meta-data

• Leverage XML and integrate content from multiple sources

• Extract other useful concepts

• Give users browse-able directories in addition to a search box

Page 10: Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc.  sstearns@inmagic.com E-Libraries E204 May, 2003

What is Meta-Classification?

• Automated meta-data extraction• Meta-data includes “subject

information” as well as names of people, company names, acronyms, key noun phrases

• Auto-classification of documents using a predefined taxonomy

• This meta-data can be mapped to a database along with the full-text of the document or a URL link

Page 11: Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc.  sstearns@inmagic.com E-Libraries E204 May, 2003

Why Meta-Classification?

• Creates structured information from unstructured data

• Allows local terminology to be reflected in searching

• Provides a browse-able directory

• Greatly enhances search through controlled vocabulary

Page 12: Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc.  sstearns@inmagic.com E-Libraries E204 May, 2003

How does it work?

• Spider/crawl the documents to create a corpus

• Automated software▶Identifies key words and phrases▶Maps them to known topics in

taxonomy▶Scores the topics and derives a

central theme▶Repeat for the sub-themes

Page 13: Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc.  sstearns@inmagic.com E-Libraries E204 May, 2003

Step 1: Identify words and phrases in the text

Microsoft NASDAQ:MSFT, which won a round in its antitrust fight against the government today, launched its Microsoft.Net initiative that could someday replace computer hardware with software. Via XML (extensible markup language), Microsoft.net will enable use of much larger computers accessible on the Internet for storage of programs, word processing files and other data.

Page 14: Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc.  sstearns@inmagic.com E-Libraries E204 May, 2003

Step 2:Map them to topics within the taxonomy

Government: governmentComputer Science: Internet, XMLHardware: computersStorage: data, storageApplication Files: programs, languageWord Processing Files: word processing filesSoftware Companies: MicrosoftMicrosoft: Microsoft.Net

Page 15: Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc.  sstearns@inmagic.com E-Libraries E204 May, 2003

Step 3: Determine themes

• End results of classification of this story are:▶Central Theme: Microsoft▶Sub-theme 1: Word Processing Files▶Sub-theme 2: Software Companies

• Microsoft is a good match for central theme

• Microsoft.Net would have been the best classification▶Original taxonomy didn’t know this topic▶Will be added to taxonomy

Page 16: Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc.  sstearns@inmagic.com E-Libraries E204 May, 2003
Page 17: Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc.  sstearns@inmagic.com E-Libraries E204 May, 2003

Customize the Taxonomy

• 1 million node taxonomy often too large

• Develop a custom taxonomy▶A subset of the large taxonomy▶Selected nodes to match business

needs▶A set of rules to aggregate from the

low level topics in the large taxonomy to the custom taxonomy

Page 18: Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc.  sstearns@inmagic.com E-Libraries E204 May, 2003

The Result

• Very large corpus of content can be classified in automated fashion

• Meta-data is used to create browse-able directories

• Meta-data is used for searching

• End user is given “clues” for finding the right information

Page 19: Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc.  sstearns@inmagic.com E-Libraries E204 May, 2003

Other features to consider

• Document summaries/abstracts

• Including external content▶Spidered from Web sites▶Integrated from licensed content

sources

• User submissions

• User ratings/reviews

Page 20: Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc.  sstearns@inmagic.com E-Libraries E204 May, 2003

To Control

Web Content

Meta Data:Captured centrally

User Interaction

Word Doc

Document Properties, Classification

Search and Browse

Content Collection(spider)

Context

(entity extraction/auto-classification)

Corporate Intranet

From Chaos

Page 21: Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc.  sstearns@inmagic.com E-Libraries E204 May, 2003

Thank You.

[email protected]