making sense of unstructured data by turning strings into things
DESCRIPTION
We all know about the promise of Big Data Analytics to transform our understanding of the world. The analysis of structured data, such as inventory, transactions, close rates, and even clicks, likes and shares is clearly valuable, but the curious fact about the immense volume of data being produced is that a vast majority of it is unstructured text. Content such as news articles, blog post, product reviews, and yes even the dreaded 140 character novella contain tremendous value, if only they could be connected to things in the real world – people, places and things. In this talk, we’ll discuss the challenges and opportunities that result when you extract entities from Big Text. Speaker: Gregor Stewart – Director of Product Management for Text Analytics at Basis Technology As Director of Product Management, Mr. Stewart helps to ensure that Basis Technology’s offerings stay ahead of the curve. Previously Mr. Stewart was the CTO of a storage services startup and a strategy consultant. He holds a Masters in Natural Language Processing from the University of Edinburgh, a BA in PPE from the University of Oxford, and a Masters from the London School of Economics. Thanks to our amazing sponsors: MicrosoftNERD (http://microsoftnewengland.com/) for Venue Basis Technology(http://basistech.com) for Food and Kindle RaffleTRANSCRIPT
Analyze
Extract
Match
Transform
Information
Revealed
Connect
Analyze
Extract
Match
Transform
Information
Revealed
Connect
Overview
• Very briefly introduce Basis
• Motivate the move from Strings to Things
• Review two enabling technologies:
– Entity Extraction: finding names in text (and classifying them)
– Entity Resolution: connecting names together and to things
• Give you three examples of things you can do:
– Entity-based search, illustrating:
• How entities and enriched typing can empower searchers
• How human corrections might be used to improve accuracy over time
– Get additional high quality enrichments from knowledge sources
– Recognize anomalies/outliers, by establishing rich norms
4
Introduction: Basis Technology
5
Introduction: Gregor Stewart
6
Facebingler cares, should you?
Facebingler cares, should you?
8
Entity Extraction: What is it?
9
Entity Extraction: How is it done?
10
Probabilistic Extractor
Supervised Model
Unsupervised Model
Deterministic Extractor
Exact Match (Gazetteer)
Pattern Match (Regex)
En
tity
Red
act
or
JoiningInputText
Filtering
Adjudication
TaggedText
Domain
Text
Annotated
Text
User Defined
Lists
User DefinedPatterns
Entity Resolution: What is it? (1)
Entity Resolution: What is it? (2)
Alberto
Alberto
AlbertoAlberto
Alberto Amos Fernandez…
Alberto M.Fernandez…
Alberto Fernandez…
Alberto Fernandiz…
AlbertFernandez…
Alberto
Alberto
AlbertoAlberto
Alberto Fernandez…
… Chief of Cabinet… Argentina… …Prof of Criminal Law…
Alberto Fernandez…
… born Sept 7, 1984… cycling… Madrid
Alberto Fernandez…
… born in Cuba… US Ambassador
Alburto Fernandez…
Alberto
Alberto Fernandezde la Puebla…
Alberto
Ratio ofPoliticians to Sportsmen?
2:1
Alberto Fernandez… Sportsmen?
YES
Nickname“El Galleta?”
?
But it’s not just text (1)
But it’s not just text (2)
?
Entity Resolution: How is it Done? (1)
Entity Resolution: How is it Done? (2)
16
Entity Resolution: How is it Done? (3)
17
Entity Resolution: How is it Done? (4)
18
Resolution EngineCandidate Selection
Entity Index
Entity Mentio
n+
Context
Link or Ghost
Ranking
Knowledge Base
Learned
Seeded
!
A (Convenient) Fiction…
• In a nearby place, not so long ago… the CIA was asked by the President to assess the likelihood that the Syrian opposition would use chemical weapons by mid-2014.
• As part of building that analysis, and because there are Al-Qaeda elements in the Syrian opposition, Alice the analyst was asked to: characterize Al-Qaeda’s attitude to using chemical weapons against Middle Eastern governments.
19
20
From: Ayman Al-Zawahiri (?)To: “Hafiz Sultan”
Dear Brother, We need guidance from you on the issue of using chlorine gas technology. It was reported that the brothers in Iraq have used it, but this was implicitly denied in a
statement issued by the Islamic State of Iraq.
The brothers where Mahmud is have the potential to use chlorine gas on the forces of the apostates, Jalal Talabani and Mas'ud Barzani, and have already considered using it.
However, I informed them that matters as serious as this require centralized [coordination] and permission from the senior [al-Qa'ida] leadership, because the gas could be difficult to control and might harm some people, which could tarnish our image, alienate people from
us, and so on.”
A document that Alice needs to read (socom-2012-
0000011)…
21
3/28/07 Today
5/2/11 5/3/11 5/12/11 5/14/11
22
3/28/07 Today
5/2/11 5/3/11 5/12/11 5/14/11
23
3/28/07 Today
5/2/11 5/3/11 5/12/11 5/14/11
24
3/28/07 Today
5/2/11 5/3/11 5/12/11 5/14/11
25
3/28/07 Today
5/2/11 5/3/11 5/12/11 5/14/11
26
3/28/07 Today
5/2/11 5/3/11 5/12/11 5/14/11
27
3/28/07 Today
5/2/11 5/3/11 5/12/11 5/14/11
28
3/28/07 Today
5/2/11 5/3/11 5/12/11 5/14/11
29
3/28/07 Today
5/2/11 5/3/11 5/12/11 5/14/11
30
3/28/07 Today
5/2/11 5/3/11 5/12/11 5/14/11
31
3/28/07 Today
5/2/11 5/3/11 5/12/11 5/14/11
32
3/28/07 Today
5/2/11 5/3/11 5/12/11 5/14/11
33
3/28/07 Today
5/2/11 5/3/11 5/12/11 5/14/11
34
3/28/07 Today
5/2/11 5/3/11 5/12/11 5/14/11
35
3/28/07 Today
5/2/11 5/3/11 5/12/11 5/14/11
36
3/28/07 Today
5/2/11 5/3/11 5/12/11 5/14/11
37
3/28/07 Today
5/2/11 5/3/11 5/12/11 5/14/11
38
3/28/07 Today
5/2/11 5/3/11 5/12/11 5/14/11
39
3/28/07 Today
5/2/11 5/3/11 5/12/11 5/14/11
40
3/28/07 Today
5/2/11 5/3/11 5/12/11 5/14/11
41
3/28/07 Today
5/2/11 5/3/11 5/12/11 5/14/11
42
3/28/07 Today
5/2/11 5/3/11 5/12/11 5/14/11
Advanced enrichment: Topics?
• Some knowledge sources have rich connectivity between things, concepts, etc.
• Developers often ask for “Topic”
• Even advanced Topic approaches often yield “howlers”
• Better labels might be derived from node info or graph walking.
Advanced enrichment: Norms?
• By walking the graph in very specific ways, we can build one or more efficient representations of what is normal or expected context for an entity.
• This could be focused on particular entities, types of entities, relationships, etc.
• We could use these representations to affect result rankings, raise alerts.
• Note again, that this is not specific to text: elementary parts of other unstructured sources such as images and video might be connected/used in the same way.
Summary
• Extraction and resolution components like REX and RES, can reliably connect Strings to Things in a range of texts.
• This allows existing knowledge to be usefully applied:• We can add properties (like types), and other advanced enrichments• We can discover where existing knowledge is lacking
• Thing-based search can allow each query to be more precise and productive• Fewer queries, fewer adjustments, fewer results to read
• By using abundant human feedback, KB quality and resolution accuracy can be increased.• More subtle distinctions between entities can be learned, example by
example.
• But…
…these tools are like shoes…