enabling exploration through text analytics
DESCRIPTION
Enterprises are awash in textual documents that represent valuable information assets. The limited access of conventional search interfaces, however, prevents enterprises from unlocking this value; * An expert guide to how richer interfaces enable exploration and discovery and how these typically rely on content enrichment techniques that can be unreliable, labor-intensive, or both. It is essential to maximize the effectiveness of content enrichment, not only to achieve the desired value, but also to incent organizations to make the necessary investment. * Useful insight about content enrichment approaches that have demonstrated success in supporting exploration and discovery. * Gain insight into both the enrichment techniques and the ways they are used to enable exploratory search.Daniel Tunkelang, Chief Scientist, EndecaTRANSCRIPT
![Page 1: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/1.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.
Enabling Exploration through Text Analytics
Daniel TunkelangChief Scientist, Endeca
![Page 2: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/2.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.2
overview
information seeking toolsneed to support exploration
text analytics can help
you can do this here and now
![Page 3: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/3.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.3
real-world information seeking examples
• looking for health information
• looking for work-related information
remindersearch and text analyticsare a means, not an end
![Page 4: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/4.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.4
example 1: looking for health information
six months into my wife’s pregnancy, wediscovered that she had gestational diabetes
how to learn more?
![Page 5: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/5.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.5
google: the default option for most
![Page 6: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/6.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.6
in government we trust: fda.gov
![Page 7: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/7.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.7
maybe the private sector knows best: webmd
powered by
![Page 8: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/8.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.8
success – and a sticky site
powered by
![Page 9: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/9.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.9
example 2: looking for work-related information
need to ramp up summerinterns on text mining
how to find a good book?
![Page 10: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/10.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.10
let’s try google again
![Page 11: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/11.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.11
google: the gateway to wikipedia?
![Page 12: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/12.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.12
the library of congress (loc.gov)
![Page 13: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/13.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.13
triangle research libraries: next-gen catalog
powered by
![Page 14: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/14.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.14
faceted search enables query refinement
powered by
![Page 15: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/15.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.15
take-away #1
exploratory search support:a must-have for many information needs
![Page 16: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/16.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.16
text analytics
• categorization• named entity detection• term extraction• sentiment analysis
vague term, lots of see-alsostext mining
information extractioncontent enrichment
![Page 17: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/17.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.17
newssift: text analytics enabling exploration
powered by
categorization
named entity detection
term extraction
sentiment analysis
![Page 18: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/18.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.18
exploring the news about facebook
powered by
![Page 19: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/19.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.19
facebook: the good
powered by
Social Utility
Iphone Application
![Page 20: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/20.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.20
facebook: the bad
powered by
Criminal BehaviorLitigation AndSettlement
![Page 21: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/21.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.21
take-away #2
text analytics enableexploratory search
![Page 22: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/22.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.22
text analytics is here and now
? ??
![Page 23: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/23.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.23
lots of off-the-shelf options
and more!
![Page 24: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/24.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.24
caveats
• rule-based techniques are domain-specific
• statistical techniques rely on trained models
• plan for errors, inconsistency
• document vs. corpus analysis
![Page 25: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/25.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.25
Person Location Organization
ABDUL-KARIM KHALAF (1) ALTOONA, PA (1) ABC News Inc. (1)
ABDULRAHMAN ABDULLAH (1) Afghanistan (7) Air Force (1)
AL GORE (1) Africa (5) Amazon.com Inc. (1)
ALEX TREBEK (1) Akihabara (1) American Airlines Inc. (1)
ALI HASSAN AL (1) Alaska (3) Apple (1)
AMANDA MARCOTTE (1) Allegheny (1) Arctic National Wildlife Refuge (1)
AMY WINEHOUSE (1) Americas (17) Arianna Huffington (1)
ANDERS ERICSSON (1) Appalachia (1) Australian Liberal Party (1)
ANDREW LLOYD WEBBER (1) Argentina (1) Bad News Bears (1)
ANTHONY MWANGI (1) Arizona (11) Bear Stearns (2)
ANTONIN SCALIA (1) Arkansas (7) Big Apple Companies (1)
ARYE BARAK (1) Arlington, Va. (2) BioDiversity Research Institute (1)
Aaron Sorkin (1) Arrest (1) Bloomberg LP (3)
Abbie Hoffman (1) Asia (1) Bob Dole (1)
Abe Lincoln (1) Atlanta (2) Bocuse d’Or World Cuisine Contest (1)
Abe Weiss (1) Austin (1) Boston Globe (1)
Abraham Lincoln (1) Austin, Texas (1) Boston Tea Party (1)
Adlai Stephenson (1) Australia (1) Budweiser (1)
problems with entity extraction
• moderate precision, but low recall• not just noisy, but inconsistent• corpus analysis can help!
Arrest (1)
Asia (1)
ALTOONA, PA (1)
Abe Lincoln (1)
Bob Dole (1)
Boston Tea Party (1)Abraham Lincoln (1)
![Page 26: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/26.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.26
look for ways to cheat!
recall
precision
![Page 27: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/27.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.27
division of labor
people supply vocabulary
machine annotates documents
http://www.precolumbianwomen.com/images/inca-labor.10.gif
![Page 28: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/28.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.28
example: ACM digital library
• opportunity– repository of (sometimes) author-tagged documents– high-precision tags: very few false positives
• challenge– poor reuse of vocabulary: most tags unique– low-recall tags: 90% false negatives
as is, tags were not useful for exploration
![Page 29: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/29.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.29
solution
• bootstrap on author-supplied tags
• prune 600K+ tags to 10K by– imposing frequency threshold– normalizing by case and singular/plural– eliminating infrequent subphrases
• mine documents using resulting vocabulary
• manually validate most frequently assigned tags
![Page 30: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/30.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.30
example: a search for boeing
powered by
![Page 31: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/31.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.31
it’s a HITS!
![Page 32: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/32.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.32
if you prefer sports to computer science
• no author-supplied tags
• use search logs instead
• supplement with authority files– team names– player names
• mine documents using resulting vocabulary
![Page 33: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/33.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.33
roger clemens, then and now
powered by
![Page 34: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/34.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.34
pivoting to a different view
powered by
![Page 35: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/35.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.35
take-away #3
this is not vapor ware;text analytics to enable exploration
is available here and now
![Page 36: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/36.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.36
looking forward
• better tags are the beginning, not the end
• improve with manual and automatic processing
• give users control over precision / recall trade-off
• help users and content creators help you
![Page 37: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/37.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.37
in closing
exploratory search = must-have, not nice-to-have
text analytics are a key enabler
the technology is real, here, and now
![Page 38: Enabling Exploration Through Text Analytics](https://reader035.vdocuments.net/reader035/viewer/2022070314/554d905bb4c905525e8b457b/html5/thumbnails/38.jpg)
© 2009 Endeca Technologies, Inc. All rights reserved.38
thank you…and come to SIGIR!
communication 1.0email: [email protected]
communication 2.0blog: http://thenoisychannel.com
twitter: http://twitter.com/dtunkelang
SIGIR: July 19-23 in Boston Industry Track on July 22nd!