www.nlsearch.com classification at northern light presentation to access 98 october 4, 1998
TRANSCRIPT
![Page 1: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/1.jpg)
www.nlsearch.com
Classification at Northern Light
Presentation to Access 98
October 4, 1998
![Page 2: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/2.jpg)
www.nlsearch.com
“This year, the World Wide Web has arrived as a serious supplier
of ‘serious’ online information.”
Sue Feldman, “Web Search Services in 1998: Trends and Challenges,” Searcher
Magazine, June 1998
![Page 3: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/3.jpg)
![Page 4: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/4.jpg)
![Page 5: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/5.jpg)
![Page 6: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/6.jpg)
www.nlsearch.com
Search engines are being held to higher standards
All users want freshness and manageable results sets
Professional information seekers want
– high relevance and high quality content first
– good descriptive information for all results
– precision searching
– text and tables
![Page 7: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/7.jpg)
www.nlsearch.com
Web search environment
constant growth in all dimensions (pages, countries, languages, file formats)
constantly increasing traffic
continuous onslaught of spam
![Page 8: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/8.jpg)
www.nlsearch.com
Practical considerations for search engines
significant engineering time spent counteracting spam
constantly adding disk space: 3 terabytes at Northern Light
crawler efficiency: must balance new page discovery with known-page re-crawl
![Page 9: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/9.jpg)
www.nlsearch.com
You step in the stream, but the water has moved on.
This page is not here.
![Page 10: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/10.jpg)
www.nlsearch.com
Search engines: limitations
lack the higher quality sources not found on the Web
no concept of classification as found in library systems
like an index of every word on every page in every book in your library
– with no subject catalog
![Page 11: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/11.jpg)
www.nlsearch.com #
Northern Light’s fundamental goals
Combine Web data with quality information not on the Web in a single integrated search
Make results set manageable for user (already a problem; worse after non-Web data is added)
![Page 12: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/12.jpg)
www.nlsearch.com
Research Engine : Content as of Oct 98
Web
– 96,000,000 pages
Special Collection
– 3,600,000+ full-text documents
– 4600 journals, magazines, books, trusted reference works, etc.
Mixes free (Web) and Fee (Special Collection)
![Page 13: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/13.jpg)
www.nlsearch.com
Relevancy ranking still critical
Engines continue to improve their ranking algorithms
All seem to agree that relevancy ranking is not enough to manage results lists of size commonly seen now
![Page 14: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/14.jpg)
www.nlsearch.com
Techniques for taming results sets
abridge the database (Excite, Lycos, Infoseek)
re-sort by popularity (HotBot/Direct Hit)
suggest further refinement steps to user (Alta Visa Refine)
sort based on number of inbound links (Infoseek…?)
sort by classification metadata (Northern Light)
![Page 15: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/15.jpg)
www.nlsearch.com
Research Engine: Classification
classify the Web according to the same standards found in journal literature
sort results for user, based on this classification
work with the user to refine the question (reference interview approach)
![Page 16: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/16.jpg)
www.nlsearch.com
Relevancy ranking has its limits
Library patron: “I need some baseball information.”
Librarian: “OK. Here are 41,536 books and sources about baseball, relevancy ranked.”
Good general sources may be ranked on top, but the user probably had something more specific in mind...
![Page 17: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/17.jpg)
www.nlsearch.com
Reference librarian approach: work with the user to refine the question
“I need some baseball information.”
“OK. Tell me more. Do you want general info, teams and players, recent news...?”
“Um... team info”
“OK. Red Sox, Yankees, ...?”
“Red Sox.”
![Page 18: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/18.jpg)
www.nlsearch.com
![Page 19: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/19.jpg)
www.nlsearch.com
Classification helps organize results
shows aspects of a topic (‘baseball’, ‘diagnostic tests’)
disambiguates queries (‘what is balance’)
sometimes answers questions directly (‘12th President’)
![Page 20: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/20.jpg)
www.nlsearch.com
![Page 21: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/21.jpg)
www.nlsearch.com
![Page 22: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/22.jpg)
www.nlsearch.com
Search Current News
Computer networksLocal area networksModemsCable modems
all others...
Special Collection
Personal computersComputer cachesBuses (computer)
Health care softwareSoftware industryCircuit design
![Page 23: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/23.jpg)
www.nlsearch.com
![Page 24: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/24.jpg)
www.nlsearch.com
Special Collection documentsCommercial sites
Sociology of the familyEmployee assistance programs
Neurology
Online bankingHelicoptersMartial artsChinese philosophy
all others...
1. WHAT IS BALANCE?84% - Articles & General info: WHAT IS BALANCE? Back to New Evangelicanism Reports. Back to the Way of Life Home Page Way of Life Literature Online Catalog You Can Own…11/09/97Personal Page: http://www.dsinclair.com /~dcloud/fbns /whatisbalance.htm
2. Emotional Stability is Balance77% - Articles & General info: Emotional Stability is Balance Emotional Stability is Balance - 1 He is unbalanced - 2 She’s not on an even keel - 3 They’re upset…03/24/95Educational site:http://cogsci.berkeley.edu/metaphors/ EmotionalStabilityIsBalance.html
3. What is balance?73% - Biographical sources: “What is balance?” This is an ongoing, soul-searching, head-scratching question that my husband, Don, and I ponder on a regular bases….07/01/96Exceptional parent (magazine): Available at Northern Light
![Page 25: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/25.jpg)
www.nlsearch.com
![Page 26: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/26.jpg)
www.nlsearch.com
Subject classification of Web documents
exists for sites in Web directories (Yahoo, Looksmart, The Mining Co)
exists behind CGI interfaces
doesn’t exist at the document level
except where supplied by the page creator
![Page 27: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/27.jpg)
www.nlsearch.com
Cost of document classification
Original cataloging of book: $37
Creating a journal article abstract: $1.50
Deriving subject headings from journal abstract: $.20
for 95,000,000 Web documents = $161.5 million
![Page 28: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/28.jpg)
www.nlsearch.com
Metadata manufacturing
Automatically determine document’s subject, type, source and language metadata
Controlled vocabularies interoperate with classifier system
System classifies pages
Fraction of cent per document
![Page 29: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/29.jpg)
www.nlsearch.com
NL’s controlled vocabularies
Editorially developed
Hierarchical in form (graph)
Exist for subjects, types, and sources
![Page 30: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/30.jpg)
www.nlsearch.com
NL’s subject vocabulary
Subject scope is unlimited (as in LC, Dewey, Yahoo)
Major points of reference were DDC, LC Subject headings, UMI subject headings, and subject-specialized classification schemes
Unique, selective conflation of these
Mapping NL with content partners’ vocabularies gives freshness, completion
20,000 concepts; 200-300,000 concept equivalents
![Page 31: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/31.jpg)
www.nlsearch.com
Subject classification process
Three main techniques:
– mapping
– automatic classification
– editorial classification of whole web sites
![Page 32: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/32.jpg)
www.nlsearch.com
Mapping
Indexing vocabularies of content partners are normalized
with NL vocabularies
Excellent source of new terms; helps maintain freshness
and ensure complete coverage of a topic
All terms become synonyms, equivalents of NL terms and
are used in automatic classification... creating a ‘network
effect’ of subject knowledge
![Page 33: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/33.jpg)
www.nlsearch.com
Partner vocabularies mapped to date
journal aggregators: UMI, IAC, Ethnic News
Watch, Responsive Database Services
news databases: AP News, Comtex Newswires,
Newsbytes
others: U.S. Pharmacopeia, American Banker,
Engineering News Record
![Page 34: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/34.jpg)
www.nlsearch.com
Automatic classification
based on words contained in document
uses Term Frequency/Inverse Document Frequency methods
document must have a strong degree of
‘aboutness’ to class
![Page 35: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/35.jpg)
www.nlsearch.com
NL’s type classification
This scheme too is hierarchical, e.g.• Reviews
– Book reviews– Movie reviews– Product reviews
classification process based on words and structure of document
![Page 36: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/36.jpg)
www.nlsearch.com
Librarians at Northern Light
Build and maintain controlled vocabulary
Map vocabularies of new partners
Continually tune classification performance
Help design and test user interface
Mine and classify whole web sites
Edit databases
![Page 37: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/37.jpg)
www.nlsearch.com
Database editing
Classification used to slice NL database into “vertical search engines”
Since Feb 98, we’ve released
– 17 subject search engines on NL Power Search
– 26 industry databases (for NL; also on Netscape Netcenter)
– 5 personal finance databases (for Doubleclick)
– music industry database (with Billboard magazine)
– construction industry database (with Engineering News Record)
![Page 38: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/38.jpg)
www.nlsearch.com
Automatic classification is still a fledgling technology, however...
it has proved practical for classifying close to 100 million web pages
it is remarkably accurate, given the breadth of concept space it covers
it is responsive to tuning
it is effective in managing results sets for users
![Page 39: Www.nlsearch.com Classification at Northern Light Presentation to Access 98 October 4, 1998](https://reader035.vdocuments.net/reader035/viewer/2022070412/56649e2a5503460f94b17cfa/html5/thumbnails/39.jpg)
www.nlsearch.com
Joyce WardDirector, Content ClassificationNorthern Light Technology LLC222 Third St.Cambridge, MA [email protected]