language as social sensor - marko grobelnik - dubrovnik - hrtal2016 - 30 sep 2016
Post on 22-Jan-2018
114 Views
Preview:
TRANSCRIPT
Language as a Social Sensor to operate with Knowledge
Marko Grobelnik
Jozef Stefan Institute, Slovenia
Marko.Grobelnik@ijs.si
Dubrovnik, Sep 30th 2016
Reflection on what should be the goal of NLP
• The (mostly) forgotten long term aim of NLP is to understand the text• …and not so much ‘processing’ itself (as NLP suggests)
• The curse of shallow solutions working well enough for too many problems, made people (and researchers) happy for too long• …as much as information retrieval and text mining are useful, they delayed
development of “text understanding”
Language vs. World
• …if we agree with the above statement, then at this point in time, we have ‘language’, but the ‘world’ is more or less missing
• So – so what a ‘world’ or ‘world model’ could be?
Language is really a social sensor…
• Nature’s physical reality is very complex…• …but manifests itself in a simple and structured way
• Humans need a mechanism to capture the complexity they need to survive, evolve and communicate• …that’s why the language appeared as a necessity
• Consequently, human language is a reflection of the world in which we live and our perception of it:• Some of the key properties: Uncertainty, dynamics, compressed information
Nature
Human Human Human
Perception PerceptionPerception
Language Language
Common Understanding
Nature is complex – but whenever Nature gets optimized it gets towards a simple and clear structure (crystallization as an obvious process of getting structure)
Human perception is just a simplified reflectionof how Nature shows itself
Language is a means how to communicate the perception – kind of a sensor for the structures beneath (since it is optimized, it has a form of a crystal)
Common understanding of the Nature we call Knowledge – it still emits clear structures(clear Knowledge has nice crystal structure)
Crystallization of the Nature, Perception, Language and Knowledge
Positioning language towards knowledge
• Language has a difficult task to encode the Nature’s complexity in an efficient way for humans…• …to describe the Nature
• …to express uncertainty, not fully understanding the Nature’s complexity
• …to be efficient when communicate
• …to reflect dynamics of the changing environment
• …to abstract physical reality in an abstract forms, what we call Knowledge
Why we need representing knowledge in a formal way?• The key element to operate with knowledge is “Reasoning”
• Since we cannot express all the facts in a formalized way, we need a mechanism to combine knowledge fragments to derive new knowledge• …this is called reasoning
Popular ways to encode and reasoning with knowledge?• In the current science we have several ways to express the
knowledge, with an aim to encode the complexity of the world:• …simple forms of knowledge expressed as a collection of points in high
dimensional spaces • Efficient, due to linear and other algebras and corresponding tools
• Most popular nowadays – machine learning, statistics, text-mining, statistical NLP are using mostly these forms
• Reasoning is often straightforward
• …probabilistic structures such as Bayesian networks• Expressive, but more expensive to encode and still manageable to be used for reasoning
• …various kinds of logic to formulate ontological knowledge• Very expressive, not always easy to be used for reasoning
CYC KNOWLEDGE BASE
Thing
Universe
isa
isa
Celestial Body
isa
located in
Planet
subclass
Earth
isa
Animal
isa
Human
subclass
Physics
Money
Mathematics
Chemistry
Time
LearningFoodVehicles
EventEducation
School
LanguageLoveEmotions Going for a
walk
Death
Cat
Euro
Working
Words
DrivingRainStabbing someone
Nature
Tree
HatredFear
Physics
Time
LearningVehicles
EventEducation
School
EmotionsGoing for a
walk
Death
Cat
EuroWords
DrivingRain
Stabbing someone
Nature
Tree
HatredFear
Planet
Earth
isaHuman
Physics
Money
Mathematics
Chemistry
Time
LearningFoodVehicles
Event
EducationLanguag
e LoveEmotions Going for a walk
Cat
Euro
Working
Words
Driving Rain
Tree
HatredFear
LearningVehicles
Event
EducationSchool
Emotions
Euro
Driving
Stabbing someone
Hatred
Fear
Structure of a Common Sense Knowledge (CycKB at http://opencyc.org/)
Model of the world…• …beyond surface knowledge• …to interconnect contextualized fragments
Why?• To make reasoning capable of connecting
isolated fragments of knowledge• To derive new knowledge beyond
materialized factual knowledge
World model
Top-down KA
Bottom-up KA
Multimodal data
Why we need a World model?
What can be extracted from a document?• Lexical level
• Tokenization – extracting tokens from a document (words, separators, …)• Sentence splitting – set of sentences to be further processed
• Linguistic level• Part-of-Speech – assigning word types (nouns, verbs, adjectives, …)• Deep Parsing – constructing parse trees from sentences• Triple extraction – subject-predicate-object triple extraction• Name entity extraction – identifying names of people, places, organizations
• Semantic level• Co-reference resolution – replacing pronouns with corresponding names;
merging different surface forms of names into single entity• Semantic labeling – assigning semantic identifiers to names (e.g.
LOD/DBpedia/Freebase) including disambiguation• Topic classification – assigning topic categories to a document (e.g. DMoz)• Summarization – assigning importance to parts of a document• Fact extraction – extracting relevant facts from a document
Wikipedia as a World model (http://wikifier.org) [Demo]
Annotation, Disambiguation of general texts into Wikipedia Concepts with a changing vocabulary in 100 language
Global Media as a playground to understand social dynamics through shallow knowledge extraction (http://eventregistry.org/) [Demo]
Imported articles: 150MIdentified events: 5M (2014-2016)News sources: 154,969Unique concepts: 2,698,213Categories: 5,015
Linguistic processing on Semantically augmented texts• The goal is to use traditional corpus linguistic tools on the top of
semantically enriched texts• Exmaple: “UN” string -> “United Nations” concept -> “Organization” higher level
concept -> …• The purpose is to reuse existing tools for many languages to accurately extract micro-
context within the text
• Using SketchEngine (https://www.sketchengine.co.uk/) to preprocess the NewsFeed.ijs.si documents (100M+ docs)• Covering the following languages: Arabic, Catalan, Czech, German, English, film,
French, Croatian, Hungarian, Italian, Korean, Dutch, Polish, Russian, Spanish, Serbian and Swedish
• Login: https://ondra.sketchengine.co.uk/ / username: test / password: preview
Infobox extraction for events:(structured event representation)• Structured event representation describes an event
by its “Event Type” and corresponding information slots to be filled
• Event Types should be taken from “Event Taxonomy”
• …at this stage of development this level of representation still requires human intervention to achieve high accuracy (Precision/Recall) extraction
• Example on the right – Wikipedia event infobox: • 2011 Tōhoku earthquake and tsunami
One of the challenges for the future: Micro-reading
• It is “easier” to understand millions of documents than a single document• …reading and understanding a single document is micro-reading
• The following experiment is on how much knowledge we can extract from individual documents• …extraction is in a form of first order inferentially productive Cyc logic
• …allowing us full reasoning to identify new facts
• …minimizing human involvement, optimizing precision and recall
Document Assertions Reasoning Dialogue
Disambiguation with a world model (CycKB)
World model used as a set of common-sense semantic constraints to disambiguate text
Cycorp © 2006
The Cyc Ontology
Thing
Intangible
ThingIndividual
Temporal
Thing
Spatial
Thing
Partially
Tangible
Thing
Paths
Sets
Relations
Logic
Math
Human
Artifacts
Social
Relations,
Culture
Human
Anatomy &
Physiology
Emotion
Perception
Belief
Human
Behavior &
Actions
Products
Devices
Conceptual
Works
Vehicles
Buildings
Weapons
Mechanical
& Electrical
Devices
Software
Literature
Works of Art
Language
Agent
Organizations
Organizational
Actions
Organizational
Plans
Types of
Organizations
Human
Organizations
Nations
Governments
Geo-Politics
Business,
Military
Organizations
Law
Business &
Commerce
Politics
Warfare
Professions
Occupations
Purchasing
Shopping
Travel
Communication
Transportation
& Logistics
Social
Activities
Everyday
Living
Sports
Recreation
Entertainment
Artifacts
Movement
State Change
Dynamics
Materials
Parts
Statics
Physical
Agents
Borders
Geometry
Events
Scripts
Spatial
Paths
Actors
Actions
Plans
Goals
Time
Agents
Space
Physical
Objects
Human
Beings
Organ-
ization
Human
Activities
Living
Things
Social
Behavior
Life
Forms
Animals
Plants
Ecology
Natural
Geography
Earth &
Solar System
Political
Geography
Weather
General Knowledge about Various Domains
Specific data, facts, and observations
Cycorp © 2006
Cyc ReasoningModules
Interface to External Data Sources
Cyc
API
Know
ledge
Entr
y T
ools
User Interface(with Natural Language Dialog)
DataBases
WebPages
Text Sources
Other KBs
Cyc Ontology & Knowledge Base
Cyc High-level Architecture
Cycorp © 2006
Thing
Intangible
ThingIndividual
Temporal
Thing
Spatial
Thing
Partially
Tangible
Thing
Paths
Sets
Relations
Logic
Math
Human
Artifacts
Social
Relations,
Culture
Human
Anatomy &
Physiology
Emotion
Perception
Belief
Human
Behavior &
Actions
Products
Devices
Conceptual
Works
Vehicles
Buildings
Weapons
Mechanical
& Electrical
Devices
Software
Literature
Works of Art
Language
Agent
Organizations
Organizational
Actions
Organizational
Plans
Types of
Organizations
Human
Organizations
Nations
Governments
Geo-Politics
Business,
Military
Organizations
Law
Business &
Commerce
Politics
Warfare
Professions
Occupations
Purchasing
Shopping
Travel
Communication
Transportation
& Logistics
Social
Activities
Everyday
Living
Sports
Recreation
Entertainment
Artifacts
Movement
State Change
Dynamics
Materials
Parts
Statics
Physical
Agents
Borders
Geometry
Events
Scripts
Spatial
Paths
Actors
Actions
Plans
Goals
Time
Agents
Space
Physical
Objects
Human
Beings
Organ-
ization
Human
Activities
Living
Things
Social
Behavior
Life
Forms
Animals
Plants
Ecology
Natural
Geography
Earth &
Solar System
Political
Geography
Weather
General Knowledge about Terrorism
Specific data, facts, and observations
about terrorist groups and activities
General Knowledge about Terrorism:Terrorist groups are capable of directing assassinations:(implies
(isa ?GROUP TerroristGroup)(behaviorCapable ?GROUP AssassinatingSomeone directingAgent))
…If a terrorist group considers an agent an enemy, that agent is vulnerable to an attack by that group:(implies
(and(isa ?GROUP TerroristGroup)(considersAsEnemy ?GROUP ?TARGET))
(vulnerableTo ?GROUP ?TARGET TerroristAttack))
Cyc KB Extended w/Domain Knowledge
Cycorp © 2006
Thing
Intangible
ThingIndividual
Temporal
Thing
Spatial
Thing
Partially
Tangible
Thing
Paths
Sets
Relations
Logic
Math
Human
Artifacts
Social
Relations,
Culture
Human
Anatomy &
Physiology
Emotion
Perception
Belief
Human
Behavior &
Actions
Products
Devices
Conceptual
Works
Vehicles
Buildings
Weapons
Mechanical
& Electrical
Devices
Software
Literature
Works of Art
Language
Agent
Organizations
Organizational
Actions
Organizational
Plans
Types of
Organizations
Human
Organizations
Nations
Governments
Geo-Politics
Business,
Military
Organizations
Law
Business &
Commerce
Politics
Warfare
Professions
Occupations
Purchasing
Shopping
Travel
Communication
Transportation
& Logistics
Social
Activities
Everyday
Living
Sports
Recreation
Entertainment
Artifacts
Movement
State Change
Dynamics
Materials
Parts
Statics
Physical
Agents
Borders
Geometry
Events
Scripts
Spatial
Paths
Actors
Actions
Plans
Goals
Time
Agents
Space
Physical
Objects
Human
Beings
Organ-
ization
Human
Activities
Living
Things
Social
Behavior
Life
Forms
Animals
Plants
Ecology
Natural
Geography
Earth &
Solar System
Political
Geography
Weather
General Knowledge about Terrorism
Specific data, facts, and observations
about terrorist groups and activities
Specific Facts about Al Qaida:
(basedInRegion AlQaida Afghanistan) Al-Qaida is based in Afghanistan.(hasBeliefSystems AlQaida IslamicFundamentalistBeliefs) Al-Qaida has Islamic fundamentalist beliefs.(hasLeaders AlQaida OsamaBinLaden) Al-Qaida is led by Osama bin Laden.…(affiliatedWith AlQaida AlQudsMosqueOrganization) Al-Qaida is affiliated with the Al Quds Mosque.(affiliatedWith AlQaida SudaneseIntelligenceService) Al-Qaida is affiliated with the Sudanese Intell Service…(sponsors AlQaida HarakatUlAnsar) Al-Qaida sponsors Harakat ul-Ansar.(sponsors AlQaida LaskarJihad) Al-Qaida sponsors Laskar Jihad.…(performedBy EmbassyBombingInNairobi AlQaida) Al-Qaida bombed the Embassy in Nairobi.(performedBy EmbassyBombingInTanzania AlQaida) Al-Qaida bombed the Embassy in Tanzania.
Cyc KB Extended w/Domain Knowledge
Example of automatic translating text into Cyc Logic and back to text
Source: “Galileo Galilei was an Italian physicist and astronomer.”
Learn Logic:(#$and (#$isa #$GalileoGalilei #$ItalianPerson)(#$isa #$GalileoGalilei #$Physicist) (#$isa #$GalileoGalilei #$Astronomer))
Fact: Galileo was an Italian, a physicist, and an astronomer.
Source: “Galileo was born in Pisa on Feburary 15, 1564.”
Learn Logic:(#$and (#$birthDate #$GalileoGalilei (#$DayFn 15 (#$MonthFn #$February(#$YearFn 1564))))
(#$birthPlace #$GalileoGalilei #$CityOfPisaItaly))
Fact: Galileo was born on February 15, 1564 and he was born in Pisa.
Source: “Albert Einstein was born in 1879 in Ulm, Germany.”
Learn Logic: (#$birthDate #$AlbertEinstein (#$YearFn 1879))
Fact: Albert Einstein was born in 1879.
Example of text and extracted Cyc assertions (1/2)
Automatically Extracted Assertions:• (isa ?V1 ProsecutingEvent)• (agent ?V1 RudyGiuliani)• (genls Entity Agent)• (isa RudyGiuliani Agent)• (isa RudyGiuliani Entity)• (isa ?V3 OrganizingEvent)• (patient ?V3 (IntersectionFn
OrganizedCrime WallStreet))
• (isa (IntersectionFn OrganizedCrimeWallStreet) Patient)
• (genls Entity Patient)• (isa OrganizedCrime Patient)• (isa OrganizedCrime Entity)• (isa WallStreet Patient)• (isa WallStreet Entity)
Sentence: He prosecuted a number of high-profile cases, including ones against organized crime and Wall_Street financiers.
Example of text and extracted Cyc assertions (2/2)
Automatically Extracted Assertions:
• (isa ?V1 SubstitutingEvent)
• (temporal ?V1 Lincoln)
• (genls Entity Agent)
• (isa Lincoln Agent)
• (genls Person Entity)
• (isa Lincoln Entity)
• (isa Lincoln Person)
• (isa ?V3 SucceedingEvent)
• (temporal ?V3 Grant)
• (isa Grant Agent)
• (isa Grant Entity)
• (isa Grant Person)
Sentence: Each time a general failed, Lincoln substituted another until finally Grant succeeded in 1865.
Reasoning on extracted assertions (Cyc)
Query:
(and
(isa ?Per Person)
(birthDate ?Per ?BD)
(occursBefore ?BD WorldWarII)
(thereExistsAtLeast 2 ?Role
(lifeRole ?Per ?Role)
(roleInIndustry ?Role FilmIndustry)
)
)
Answers:
Sir Derek_George_Jacobi
Sir Alexander_Korda
Victor Lonzo_Fleming
John_Francis_Junkin
Cornel_Wilde
George_Stevens
Bertrand_Blier
NL Query: People born before World War II who had at least two roles in the film industry KB?
Text queryQuery (semi) automatically translated in the First Order Logic
Answers to the query
Cyc’s front-end: “Cyc Analytic Environment” – querying (1/2)
Who has a motive for the assassination of Rafik Hariri?
Query & Answer
Justification
Sources forReasoning and Justification
Cyc’s front-end: “Cyc Analytic Environment” – justification (2/2)
Some of the challenges for the future
• Background knowledge in a form of a World Model• …to have knowledge contextualized
• Representing and scalable reasoning knowledge with operational soft logic• …to decrease brittleness of logic and increase scale
• Economically viable structured knowledge acquisition with high precision and recall• …to increase the reach of what we can acquire
• Emphasizing understanding vs. applying black box models
top related