semantics-empowered understanding, analysis and mining of nontraditional and unstructured data
DESCRIPTION
Amit Sheth, "Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and Unstructured Data,"WSU & AFRL Window-on-Science Seminar on Data Mining, August 05, 2009.http://wiki.knoesis.org/index.php/Seminar_on_Data_Mining#Semantics_empowered_Understanding.2C_Analysis_and_Mining_of_Nontraditional_and_Unstructured_DataTRANSCRIPT
1
Semantics-Empowered Understanding, Analysis and Mining of Nontraditional and Unstructured Data
WSU & AFRL Window-on-Science Seminar on Data Mining
Amit P. Sheth,LexisNexis Ohio Eminent Scholar
Director, Kno.e.sis center, Wright State Universityknoesis.org
Thanks: K. Gomadam, M. Nagarajan, C. Thomas, C. Henson, C. Ramakrishnan, P. Jain and Kno.e.sis Researchers
Data & Knowledge Ecosystem
3
Data Mining
Knowledge Discovery
Understanding & Perception
IntegrationSearch
Analysis (eg Patterns)
Browsing
Insight
Situational Awareness
Decision Support
Transactional DataObservational Data
Multimedia Data
Experimental Data
Textual Data: Scientific Literature, Web Pages, News, Blogs, Reports, Wiki, Forums, Comments, Tweets
Structured,SemistructuredUnstructuredData
Some examples of R&D we have done
• Semantic Search & Ranking of Stories and Reports – connecting the dots applications (insider threat, financial risk analysis)
• Mining of biomedical (scientific) literature (extraction of entities and relationships) – discovering hidden public knowledge
• Semantic Integration, Analysis and Decision Support over Sensor Data
• Extracting taxonomy/domain model from Wikipedia• Discovering Hidden Relationships (insights) in
Community Created Content (Wikipedia)
4
• Understanding User Generated Content (on Social Networking Sites)*– What are people talking about– How people write– Why people write
With application to - Artist Popularity Ranking- Advertisement on Social Media- Identifying Social Signals – spatio-temporal-thematic analysis of
Citizen Sensor Data
5* Meena Nagarajan
TextMultimedia Content
and Web data
Metadata Extraction
Patterns / Inference / Reasoning
Domain Models
Meta data / Semantic Annotations
Relationship Web
SearchIntegrationAnalysisDiscoveryQuestion AnsweringSituational Awareness
Sensor Data
RDB
Structured and Semi-structured data
Insider threat demo (semantic search/querying, ranking, …)
7
Knowledge Discovery from Scientific Literature
Cartic Ramakrishnan
9
What Knowledge Discovery is NOT
•Search– Keyword-in-document-out – Keywords are fully specified
features of expected outcome
– Searching for prospective mining sites
•Mining – Know where to look– Underspecified
characteristics of what is sought are available
– Patterns
Cartic Ramakrishnan
10
What is knowledge discovery?
• “knowledge discovery is more like sifting through a warehouse filled with small gears, levers, etc., none of which is particularly valuable by itself. After appropriate assembly, however, a Rolex watch emerges from the disparate parts.” – James Caruther
• “discovery is often described as more opportunistic search in a less well-defined space, leading to a psychological element of surprise” – James Buchanan
• Opportunistic search over an ill-defined space leading to surprising but useful emergent knowledge
Cartic Ramakrishnan
11
Element of surprise – Swanson’s discoveries
MagnesiumMigraine
PubMed
?Stress
Spreading Cortical Depression
Calcium Channel Blockers
Swanson’s Discoveries
Associations Discovered based on keyword searches followed by manually analysis of text to establish possible relevant relationships
11 possible associations found
12
Knowledge Discovery over text
Text
Extraction of Semantics from text
Semantic Metadata Guided
Knowledge Explorations
Assigning interpretation to text
Semantic Metadata Guided
Knowledge Discovery
Triple-basedSemantic
Search
Semanticbrowser
Subgraphdiscovery
Semantic metadata in the form ofsemi-structured data
Cartic Ramakrishnan
13
Information Extraction via Ontology assisted text mining – Relationship extraction
Biologically active substance
LipidDisease or Syndrome
affects
causes
affectscauses
complicates
Fish Oils Raynaud’s Disease???????
instance_of instance_of
UMLS Semantic Network
MeSH
PubMed9284 documents
4733 documents
5 documents
Cartic Ramakrishnan
14
Background knowledge and Data used
• UMLS – A high level schema of the biomedical domain– 136 classes and 49 relationships– Synonyms of all relationship – using variant lookup (tools from
NLM)– 49 relationship + their synonyms = ~350 verbs
• MeSH – 22,000+ topics organized as a forest of 16 trees– Used to query PubMed
• PubMed – Over 16 million abstract– Abstracts annotated with one or more MeSH terms
15
Method – Parse Sentences in PubMed
SS-Tagger (University of Tokyo)
SS-Parser (University of Tokyo)
(TOP (S (NP (NP (DT An) (JJ excessive) (ADJP (JJ endogenous) (CC or) (JJ exogenous) ) (NN stimulation) ) (PP (IN by) (NP (NN estrogen) ) ) ) (VP (VBZ induces) (NP (NP (JJ adenomatous) (NN hyperplasia) ) (PP (IN of) (NP (DT the) (NN endometrium) ) ) ) ) ) )
• Entities (MeSH terms) in sentences occur in modified forms• “adenomatous” modifies “hyperplasia”• “An excessive endogenous or exogenous stimulation” modifies
“estrogen”• Entities can also occur as composites of 2 or more other entities
• “adenomatous hyperplasia” and “endometrium” occur as “adenomatous hyperplasia of the endometrium”
Cartic Ramakrishnan
18
Preliminary Results
• Swanson’s discoveries – Associations between Migraine and Magnesium [Hearst99]
• stress is associated with migraines • stress can lead to loss of magnesium • calcium channel blockers prevent some migraines • magnesium is a natural calcium channel blocker • spreading cortical depression (SCD) is implicated in some migraines • high levels of magnesium inhibit SCD • migraine patients have high platelet aggregability • magnesium can suppress platelet aggregability
•Data sets generated using these entities (marked red above) as boolean keyword queries against pubmed
•Bidirectional breadth-first search used to find paths in resulting RDF
19
Paths between Migraine and Magnesium
Paths are considered interesting if they have one or more named relationshipOther than hasPart or hasModifiers in them
Cartic Ramakrishnan
20
An example of such a path
platelet(D001792)
collagen(D003094)
migraine(D008881)
magnesium(D008274)
me_3142by_a_primary_abnormality_of_platelet_behavior
me_2286_13%_and_17%_adp_and_collagen_induced_platelet_aggregation
caused_by
hasPart
hasPart
stimulated
stimulatedhasPart
CONCLUSION Rules over parse trees are able to extract structure from
sentences
Our definition of compound and modified entities are critical for identifying both implicit and explicit relationships
Swanson’s discovery can be automated – if recall can be improved – what hurts recall?
Unsupervised Joint Extraction of Compound Entities and Relationship
Cartic Ramakrishnan, Pablo N. Mendes, Shaojun Wang and Amit P. Sheth "Unsupervised Discovery of Compound Entities for Relationship Extraction"EKAW 2008 - 16th International Conference on Knowledge Engineering and Knowledge Management Knowledge Patterns
22
Joint Extraction approach
•Dependency parse – Stanford Parser
governor
dependent
amod = adjectival modifiernsubjpass = nominal subject in passive voice
23
Algorithm
Relationship head
Subject head
Object head Object head
Cartic Ramakrishnan
24
Preliminary results
Cartic Ramakrishnan
25
Extracted Triples
Semantic Metadata Guided Knowledge Explorations and Discovery
27
Results
Cartic Ramakrishnan
28
Hypothesis Driven retrieval of Scientific Literature
PubMed
Complex Query
SupportingDocument setsretrieved
Migraine
Stress
Patient
affects
isaMagnesium
Calcium Channel Blockers
inhibit
Keyword query: Migraine[MH] + Magnesium[MH]
29
Applications
• Triple-based semantic search• Semantic Browser
30
Knowledge Discovery = Extraction + Heuristic Aggregation
Leonardo Da Vinci
The Da Vinci code
The Louvre
Victor Hugo
The Vitruvian man
Santa Maria delle Grazie
Et in Arcadia EgoHoly Blood, Holy Grail
Harry Potter
The Last Supper
Nicolas Poussin
Priory of Sion
The Hunchback of Notre Dame
The Mona Lisa
Nicolas Flammel
painted_by
painted_by
painted_by
painted_by
member_of
member_of
member_of
written_by
mentioned_in
mentioned_in
displayed_at
displayed_at
cryptic_motto_of
displayed_at
mentioned_in
mentioned_in
Undiscovered Public Knowledge
Understanding, Analyzing, Mining
Social Media
Meena Nagarajan, Karthik Gomadam
mumbai, india
november 26, 2008
another chapter in the war against civilization
and
the world saw it
Through the eyes of the people
the world read itThrough the words of the people
PEOPLE told their stories to PEOPLE
A powerful new era in Information dissemination had
taken firm ground
Making it possible for us to
create a global network of citizens
Citizen Sensors – Citizens observing, processing,
transmitting, reporting
Image Metadatalatitude: 18° 54′ 59.46″ N, longitude: 72° 49′ 39.65″ E
Geocoder (Reverse Geo-coding)
Address to location database
18 Hormusji Street, Colaba
Nariman House
Identify and extract information from tweetsSpatio-Temporal Analysis
Structured Meta Extraction
Income Tax Office
Vasant Vihar
Research Challenge #1
• Spatio Temporal and Thematic analysis– What else happened “near” this event
location?– What events occurred “before” and
“after” this event?– Any message about “causes” for this
event?
Spatial Analysis….Which tweets originated from an
address near 18.916517°N 72.827682°E?
Which tweets originated during Nov 27th 2008,from 11PM to 12 PM
Giving us
Tweets originated from an address near 18.916517°N, 72.827682°E during time interval 27th Nov 2008 between 11PM to 12PM?
Research Challenge #2:Understanding and Analyzing Casual Text
• Casual text– Microblogs are often written in SMS
style language– Slangs, abbreviations
Understanding Casual Text
• Not the same as news articles or scientific literature– Grammatical errors
• Implications on NL parser results
– Inconsistent writing style• Implications on learning algorithms that
generalize from corpus
Nature of Microblogs
• Additional constraint of limited context– Max. of x chars in a microblog– Context often provided by the discourse
• Entity identification and disambiguation
• Pre-requisite to other sophisticated information analytics
NL understanding is hard to begin with..
• Not so hard– “commando raid appears to be nigh at
Oberoi now”• Oberoi = Oberoi Hotel, Nigh = high
• Challenging– new wing, live fire @ taj 2nd floor on
iDesi TV stream• Fire on the second floor of the Taj hotel, not
on iDesi TV
Social Context surrounding content
• Social context in which a message appears is also an added valuable resource
• Post 1: – “Hareemane House hostages said by
eyewitnesses to be Jews. 7 Gunshots heard by reporters at Taj”
• Follow up post– that is Nariman House, not (Hareemane)
Understanding content … informal text
• I say: “Your music is wicked”
• What I really mean: “Your music is good”
54
Structured text (biomedical literature)
Multimedia Content and Web
data
Web Services
Semantic Metadata: Smile is a TrackLil transliterates to Lilly Allen
Lilly Allen is an Artist
Informal Text (Social Network
chatter)
Your smile rocks Lil
Urban Dictionary
MusicBrainz Taxonomy
Artist: Lilly AllenTrack: Smile
Sentiment expression: Rocks Transliterates to: cool, good
Example: Pulse of a Community
• Imagine millions of such informal opinions– Individual expressions to mass opinions
• “Popular artists” lists from MySpace comments
Lilly Allen
Lady Sovereign
Amy Winehouse
Gorillaz
Coldplay
Placebo
Sting
Kean
Joss Stone
What Drives the Spatio-Temporal-Thematic Analysis and Casual Text
Understanding
Semantics with the help of
1. Domain Models2. Domain Models3. Domain Models
(ontologies, folksonomies)
Domain Knowledge: A key driver
• Places that are nearby ‘Nariman house’– Spatial query
• Messages originated around this place– Temporal analysis
• Messages about related events / places– Thematic analysis
Research Challenge #3But Where does the Domain Knowledge come from?
• Expert and committee based ontology creation … works in some domains (e.g., biomedicine, health care,…)
• Community driven knowledge extraction – How to create models that are “socially
scalable”?– How to organically grow and maintain
this model?
Building models…seed word to hierarchy creation using WIKIPEDIA
Seed Query
BWikipedia
Fulltext Concept Search
Wikigraph-Based expansion
Graph Search
Graph Search
Graph Search
Hierarchy Creation
Query: “cognition”
Identifying relationships: Hard, harder than many hard things
But NOT that Hard, When WE do it
Games with a purpose
• Get humans to give their solitaire time – Solve real hard computational problems– Image tagging, Identifying part of an
image– Tag a tune, Squigl, Verbosity, and
Matchin– Pioneered by Luis Von Ahn
OntoLablr
• Relationship Identification Game
•leads to•causes
Explosion Traffic congestion
• How do you get comprehensive situational awareness by merging “human sensing” and “machine sensing”?
64
Research Challenge #4: Semantic Sensor Web
Semantically Annotated O&M
<swe:component name="time"><swe:Time definition="urn:ogc:def:phenomenon:time" uom="urn:ogc:def:unit:date-time">
<sa:swe rdfa:about="?time" rdfa:instanceof="time:Instant"><sa:sml rdfa:property="xs:date-time"/>
</sa:swe></swe:Time>
</swe:component><swe:component name="measured_air_temperature">
<swe:Quantity definition="urn:ogc:def:phenomenon:temperature“ uom="urn:ogc:def:unit:fahrenheit"><sa:swe rdfa:about="?measured_air_temperature“
rdfa:instanceof=“senso:TemperatureObservation"><sa:swe rdfa:property="weather:fahrenheit"/><sa:swe rdfa:rel="senso:occurred_when" resource="?time"/><sa:swe rdfa:rel="senso:observed_by" resource="senso:buckeye_sensor"/>
</sa:sml></swe:Quantity>
</swe:component>
<swe:value name=“weather-data">2008-03-08T05:00:00,29.1
</swe:value>
Semantic Sensor ML – Adding Ontological Metadata
67
Person
Company
Coordinates
Coordinate System
Time Units
Timezone
SpatialOntology
DomainOntology
TemporalOntology
Mike Botts, "SensorML and Sensor Web Enablement," Earth System Science Center, UAB Huntsville
68
Semantic Query• Semantic Temporal Query
• Model-references from SML to OWL-Time ontology concepts provides the ability to perform semantic temporal queries
• Supported semantic query operators include:– contains: user-specified interval falls wholly within a sensor reading interval
(also called inside)– within: sensor reading interval falls wholly within the user-specified interval
(inverse of contains or inside)– overlaps: user-specified interval overlaps the sensor reading interval
• Example SPARQL query defining the temporal operator ‘within’
Kno.e.sis’ Semantic Sensor Web
69
Semantic Sensor Web demo (online)
Semantic Sensor Web demo (local)
70
Synthetic but realistic scenario
• an image taken from a raw satellite feed
71
• an image taken by a camera phone with an associated label, “explosion.”
Synthetic but realistic scenario
72
• Textual messages (such as tweets) using STT analysis
Synthetic but realistic scenario
73
• Correlating to get
Synthetic but realistic scenario
Create better views (smart mashups)
Extracting Social Signals
• what are the important topics of discussions and concerns in different parts of the world on a particular day
• how different cultures or countries are reacting to the same event or situation (eg Mumbai Attack)
• how a situation such as financial crisis is evolving over a period of time in terms of key topics of discussion and issues of concern (eg subprime mortgages and foreclosures, followed by troubled banks and credit freeze, followed by massive government intervention and borrowing, and so on).
Twitris Demo
76
A few more things
• Use of background knowledge• Event extraction from text
– time and location extraction • Such information may not be present• Someone from Washington DC can tweet about
Mumbai
• Scalable semantic analytics– Subgraph and pattern discovery
• Meaningful subgraphs like relevant and interesting paths
• Ranking paths
The Sum of the Parts
Spatio-Temporal analysis– Find out where and when
+ Thematic – What and how
+ Semantic Extraction from text, multimedia and sensor data- tags, time, location, concepts, events
+ Semantic models & background knowledge– Making better sense of STT– Integration
+ Semantic Sensor Web– The platform
= Situational Awareness
KNO.E.SIS as a case study of world class research based higher education environment
http://knoesis.org
79
Kno.e.sis Center Labs (3rd Floor, Joshi)
Amit Sheth•Semantic Science Lab•Semantic Web Lab•Service Research Lab
TK Prasad•Metadata and Languages Lab
Shaojun Wang•Statistical Machine Learning
Pascal Hitzler•Formal Semantics & Reasoning lab
Michael Raymer•Bioinformatics Lab
Guozhu Dong•Data Mining Lab
Keke Chen•Data Intensive Analysis and Computing Lab
KNO.E.SIS MEMBERS – A SUBSET
Exceptional students
• Six of the senior PhD students: 84 papers, 43 program committees, contributed to winning NIH and NSF grants.
• Successfully competed with two Stanford PhDs, 1000+ citations in 2 years of his graduation.
• “BTW, Meena is an absolute find. If all of your other students are as talented, you are very lucky. … I’d definitely like to work with more interns of her caliber, ... ”[Dr. Kevin Haas, Director of Search at Yahoo!]
• “It has been a few years since I visited Dayton (Wright AFB). However, it is clear that Wright State has transformed itself. Congratulations on your success with the Knoesis Center.” [Dr. Alpers Caglayan –
looking to hire Kno.e.sis grads]
Funding, Collaboration, etc
• UGA, Stanford, CCHMC, SAIC, HP, IBM, Yahoo!
• NIH, NSF, AFRL-HE, AFRL-Sensor, HP, IBM, Microsoft, Google
• 70% Federal, 19% State, 11% Industry
• Students intern at the bestIndustry labs & national labs
• Graduates very successful
83
Interested in more background?
• Semantics-Empowered Social Computing• Semantic Sensor Web • Traveling the Semantic Web through Space, Theme
and Time • Relationship Web: Blazing Semantic Trails between
Web Resources • Text Mining, Workflow Management, Semantic
Web Services, Cloud Computing with application to healthcare, biomedicine, defense/intelligence, energy
Contact/more details: amit @ knoesis.org
Special thanks: Karthik Gomadam, Meena Nagarajan, Christopher Thomas
Partial Funding: NSF (Semantic Discovery: IIS: 071441, Spatio Temporal Thematic: IIS-0842129), AFRL and DAGSI (Semantic Sensor Web), Microsoft Research and IBM Research (Analysis of Social Media Content),and HP Research (Knowledge Extraction from Community-Generated Content).