Surviving the Information Glut
bbtitle 9/23/94
Presentation by Bob BoeriFactory Mutual Engineering & [email protected]
October 7, 1994
Roots of the Problem<Storage: increasing <Access: faster<Document complexity: more<Information quantity: increasing
exponentially
bb1 9/13/94
"'A cat may look at a king,' said Alice. 'I've read that in some book,but I don't remember where.' ". Alice in Wonderland
Document Complexity<word processor types<fonts<rich layouts <tables<graphics/ photos<video, sound, hypertext<SGML<... what isn't a document?
bb2 9/13/94
"and what is the use of a book," thought Alice, "without pictures orconversations?" Alice in Wonderland
How to Find What You AreLooking For
<how large is your collection ofdocuments?
<how complex are they?<how complex are the searches?<who will search?•individuals working by themselves•members of a corporate organization
bb3 9/17/94
Searching a Very SmallCollection
<a few dozen documents<simple structure (e.g., memo or e-mail)<written consistently (e.g., you, one author)
bb4 9/17/94
Find that note about an inexpensive, simpleword processor that never needs upgradingand will let you add simple graphics to yourwriting. Runs under MS-Windows.
Trivial Search Techniques<browse through each one<use a word processor "list files"<simple search system, simple boolean
search
bb5 9/17/94
First Search Barrier<somewhere between a few dozen
documents and several hundred<can't remember exactly the words to
search, begin searching synonyms orusing wild cards.
bb6 9/14/94
She went on, rather surprised at not being able to think of the word. 'I mean to get under the --under the -under THIS, you know!' putting her hand on the trunk of a tree. --Alice in Wonder land
First-Level SearchTechniques
<Range searches: "word processor"<sentence> "inexpensive"
bb7 9/15/94
<only 2 hits (probably missed something).< forgot to ask about graphics support.
First-Level SearchTechniques
<Wild cards: word process* <sentence>basic
bb8 9/15/94
<99 hits; unusable< Maybe asking about graphics support toowill reduce number of hits
Searching Gets Complex <(basic <sentence> word process*)
<paragraph> (support* <sentence>graphic*)
bb9 9/16/94
< complex expression < system searches a long time< finds nothing useful.
Sample Hits from 1st-levelComplex Search
<"Although Visual Basic contains arudimentary word processor... graphicsupport is really limited to OLE andDDE."
<"Basic word processing skills cansometimes be transferred to..... programswhich allow you to create graphic effects.
bb10 9/17/94
< new ways to divide and conquor < richer, easier search aids< richer reporting of results
Need to Break the 1st-LevelSearch Barrier:
<reduce hits to most relevant<get hits when simpler searches fail<additional techniques beyond Boolean
bb11 9/19/94
Combine Structured and FullText Queries
<Apply search to portion of library ("formqueries")
<Requires knowledge of the library<Requires "catalog card" for each
document (e.g., date, subject)<Smart system might construct catalog
card•Requires highly regular documents•Risk of catalog errors
bb12 9/19/94
Combine Structured and FullText Queries
<Could design as a form for users to fill out<Example:
bb13 9/19/94
DATE: after 1/1/94
(inexpensive <sentence> "word processor"<sentence> "windows")
Relevancy Ranking<Puts most likely hits at the top of the list<Requires understanding of what's most
important•# of hits/document•weighting certain hits (e.g., exact matches) more
than others•weighting other criteria (such as date or other
structured fields)•let users say what's most important to them
bb14 9/19/94
Thesauruses<General<Specific•medical•legal•scientific<user-modifiable
bb15 9/20/94
"I don't know the meaning of half those long words, and what's more, I don't believe you do either!" -- Alice in Wonderland
Linguistic Helps<Automatic search for parts of speech•"sprinkle" also searches for "sprinkled,"
"sprinkling," etc.<Fuzzy search•"sprinkle" also searches for "sparkle"•helps overcome some OCR errors.•user-specifiable (how many letters to make "fuzzy")•gets words you would have missed•gets words that make no sense at all.<Natural Language Queries: ("Find me
cheap reliable easy Windows wordprocessors")
bb16 9/20/94"Language is worth a thousand pounds a word."
-- Through the Looking Glass
Complex and ModularQueries
<Create, debug, save queries<Use queries as models for new queries<If modular ("Lego•s")•assemble large search queries by plugging together
smaller ones.•fine tune searches (adjusting rankings of search
criteria). •build libraries of modular searches
bb19 9/22/94
Fuzzy Searches<use neural network technology<like sophisticated wildcard searches<help overcome OCR errors<find good matches and irrelevant ones<can distort relevancy rankings by hit
count
bb20 9/22/94
SGML Usage<"Zone" searches•Confine searches to paragraph headings, chapter
titles, etc.<Use SGML DTDs directly:•Full, Arbitrary (all DTDs)
A exploits full capabilities of your tag set A performance and/or size penalties
•Specific DTDs onlyA "Any color Ford you want as long as it's black."A May be tuned for better use
bb17 9/20/94
SGML Usage<Filter (convert) SGML tags to application
specific codes.•Not authentic SGML use•May be better performance than authentic SGML<Best when documents are themselves
highly structured.<One-way (from SGML to proprietary);
loses important SGML benefit.<Few vendors support SGML well<Those who do may skimp on other search
facilities.
bb18 9/21/94
Interest Profiling<Profile determined by any number of
means<"I like these documents. Find me more
like this."•simple•unexpected results•electronic highlighter improves search<The more search tools the better.
Looking in classifieds for a low-mileage Saab, prefer beige or red, one-owner,automatic, 1993 or newer, less than $10,000.
Looking in PC literature for Windows word processor , easy to use, never needsupgrades, can handle graphics, bug-free, uses 1MB disk, less than $29.95.
Information Agents<passive•computed once, updated periodically•use when you choose (whenever new CD-Rom title
appears)<active•information gobots•always on the lookout for anything relevant•inform you with results or email notification•on-line or jukeboxes
Collateral Issues: Authoringand Using
<Authoring•Populating the system•Subject areas and forms•Document size•Legacy Documents
bb24 9/23/94
Populating the system<Security: everyone have identical access?<Easy way to get documents into system?<Form per document for form queries?•date, subject area, sub-type)?•subject area (e.g., word processors)?•sub-types within areas (e.g., character-based, GUI)<Easy way to retract documents? Re-file
documents? "See also" subject areas?<QA of forms and documents•Form field info correct?•Complex document objects (e.g.,tables).
bb25 9/24/94
Document Size<Whole documents or chunks?<What's appropriate to users?•Effort to build collection•Precision of hits•Size of hit list•What's natural and expected
bb26 9/24/94
"What size do you want to be," the catepillar asked.
Oh, I'm not so particular as to size, Alice hastily replied. "Only one doesn' t like changing sooften, you know."
-- Alice in Wonderland
Legacy Documents<Paper•size, number, quality•OCR•Ability to attach page images•At least name file for faxing<Electronic•document type•quality of author practices•fonts. . . . . .•command launch when possible•what about form queries/document?
bb27 9/24/94
"These words were followed by a very longsilence, broken only by an occasionalexclamation of 'Hjckrrh!" from theGryphon."
-- Alice in Wonderland
Collateral Issues: Using<Pie fonts<Non-English characters<Equations<Font fidelity, size on-screen•letter "o" and zero•letters one "1", el "l", and capital i "I".
bb28 9/25/94
"The White Queen whispered, 'I can read words of one letter!... However, don't be discouraged,You'll come to it in time.'"
-- Through the Looking Glass
Collateral Issues: Using<Navigation within documents<Viewers<Launching when Viewers Inadequate<CD-Rom Performance<Exporting information for reuse.<Printing
bb29 9/25/94
"... the books are something like our books, only the words go the wrong way."
-- Through the Looking Glass
Collateral Issues: Using<Interactive searches<Batch searches ("go do this later and tell
me what you found")<Autonomous information agents•Continuous monitoring•Urgent, routine notification•Empower agents to "Ring a bell" ; "Push a button" •Active documents: "Go find me more like yourself"
bb30 9/26/94
Adobe Acrobat version 2.0<Powerful searching<CD-Rom performance<Font problem disappears<SGML promised
bb31 9/26/94
Even the best searching system can't findwhat isn't there. But the best ones will keepon trying.
And What of Our OriginalSearch... Perfect Word
Processor, Saab for a Song
bb28 9/25/94
Alice laughed. There's not use trying,' she said: one CAN'T believe impossible things.'
I daresay you haven't had much practice,' said the Queen. . -- Through the Looking Glass