language as social sensor - marko grobelnik - dubrovnik - hrtal2016 - 30 sep 2016

42
Language as a Social Sensor to operate with Knowledge Marko Grobelnik Jozef Stefan Institute, Slovenia [email protected] Dubrovnik, Sep 30 th 2016

Upload: marko-grobelnik

Post on 22-Jan-2018

114 views

Category:

Data & Analytics


4 download

TRANSCRIPT

Language as a Social Sensor to operate with Knowledge

Marko Grobelnik

Jozef Stefan Institute, Slovenia

[email protected]

Dubrovnik, Sep 30th 2016

Reflection on what should be the goal of NLP

• The (mostly) forgotten long term aim of NLP is to understand the text• …and not so much ‘processing’ itself (as NLP suggests)

• The curse of shallow solutions working well enough for too many problems, made people (and researchers) happy for too long• …as much as information retrieval and text mining are useful, they delayed

development of “text understanding”

Language vs. World

• …if we agree with the above statement, then at this point in time, we have ‘language’, but the ‘world’ is more or less missing

• So – so what a ‘world’ or ‘world model’ could be?

Language is really a social sensor…

• Nature’s physical reality is very complex…• …but manifests itself in a simple and structured way

• Humans need a mechanism to capture the complexity they need to survive, evolve and communicate• …that’s why the language appeared as a necessity

• Consequently, human language is a reflection of the world in which we live and our perception of it:• Some of the key properties: Uncertainty, dynamics, compressed information

Nature

Human Human Human

Perception PerceptionPerception

Language Language

Common Understanding

Nature is complex – but whenever Nature gets optimized it gets towards a simple and clear structure (crystallization as an obvious process of getting structure)

Human perception is just a simplified reflectionof how Nature shows itself

Language is a means how to communicate the perception – kind of a sensor for the structures beneath (since it is optimized, it has a form of a crystal)

Common understanding of the Nature we call Knowledge – it still emits clear structures(clear Knowledge has nice crystal structure)

Crystallization of the Nature, Perception, Language and Knowledge

Positioning language towards knowledge

• Language has a difficult task to encode the Nature’s complexity in an efficient way for humans…• …to describe the Nature

• …to express uncertainty, not fully understanding the Nature’s complexity

• …to be efficient when communicate

• …to reflect dynamics of the changing environment

• …to abstract physical reality in an abstract forms, what we call Knowledge

Why we need representing knowledge in a formal way?• The key element to operate with knowledge is “Reasoning”

• Since we cannot express all the facts in a formalized way, we need a mechanism to combine knowledge fragments to derive new knowledge• …this is called reasoning

Popular ways to encode and reasoning with knowledge?• In the current science we have several ways to express the

knowledge, with an aim to encode the complexity of the world:• …simple forms of knowledge expressed as a collection of points in high

dimensional spaces • Efficient, due to linear and other algebras and corresponding tools

• Most popular nowadays – machine learning, statistics, text-mining, statistical NLP are using mostly these forms

• Reasoning is often straightforward

• …probabilistic structures such as Bayesian networks• Expressive, but more expensive to encode and still manageable to be used for reasoning

• …various kinds of logic to formulate ontological knowledge• Very expressive, not always easy to be used for reasoning

CYC KNOWLEDGE BASE

Thing

Universe

isa

isa

Celestial Body

isa

located in

Planet

subclass

Earth

isa

Animal

isa

Human

subclass

Physics

Money

Mathematics

Chemistry

Time

LearningFoodVehicles

EventEducation

School

LanguageLoveEmotions Going for a

walk

Death

Cat

Euro

Working

Words

DrivingRainStabbing someone

Nature

Tree

HatredFear

Physics

Time

LearningVehicles

EventEducation

School

EmotionsGoing for a

walk

Death

Cat

EuroWords

DrivingRain

Stabbing someone

Nature

Tree

HatredFear

Planet

Earth

isaHuman

Physics

Money

Mathematics

Chemistry

Time

LearningFoodVehicles

Event

EducationLanguag

e LoveEmotions Going for a walk

Cat

Euro

Working

Words

Driving Rain

Tree

HatredFear

LearningVehicles

Event

EducationSchool

Emotions

Euro

Driving

Stabbing someone

Hatred

Fear

Structure of a Common Sense Knowledge (CycKB at http://opencyc.org/)

Model of the world…• …beyond surface knowledge• …to interconnect contextualized fragments

Why?• To make reasoning capable of connecting

isolated fragments of knowledge• To derive new knowledge beyond

materialized factual knowledge

World model

Top-down KA

Bottom-up KA

Multimodal data

Why we need a World model?

Simple forms of knowledge extraction and reasoning

What can be extracted from a document?• Lexical level

• Tokenization – extracting tokens from a document (words, separators, …)• Sentence splitting – set of sentences to be further processed

• Linguistic level• Part-of-Speech – assigning word types (nouns, verbs, adjectives, …)• Deep Parsing – constructing parse trees from sentences• Triple extraction – subject-predicate-object triple extraction• Name entity extraction – identifying names of people, places, organizations

• Semantic level• Co-reference resolution – replacing pronouns with corresponding names;

merging different surface forms of names into single entity• Semantic labeling – assigning semantic identifiers to names (e.g.

LOD/DBpedia/Freebase) including disambiguation• Topic classification – assigning topic categories to a document (e.g. DMoz)• Summarization – assigning importance to parts of a document• Fact extraction – extracting relevant facts from a document

Wikipedia as a World model (http://wikifier.org) [Demo]

Annotation, Disambiguation of general texts into Wikipedia Concepts with a changing vocabulary in 100 language

Global Media as a playground to understand social dynamics through shallow knowledge extraction (http://eventregistry.org/) [Demo]

Imported articles: 150MIdentified events: 5M (2014-2016)News sources: 154,969Unique concepts: 2,698,213Categories: 5,015

Event description through entities and Semantic keywords

Collection of events described through Entity relatedness

Collection of events described through trending concepts

Collection of events described through three level categorization

Events identified across languages

Collection of events described through a story-line of related events

Linguistic processing on Semantically augmented texts• The goal is to use traditional corpus linguistic tools on the top of

semantically enriched texts• Exmaple: “UN” string -> “United Nations” concept -> “Organization” higher level

concept -> …• The purpose is to reuse existing tools for many languages to accurately extract micro-

context within the text

• Using SketchEngine (https://www.sketchengine.co.uk/) to preprocess the NewsFeed.ijs.si documents (100M+ docs)• Covering the following languages: Arabic, Catalan, Czech, German, English, film,

French, Croatian, Hungarian, Italian, Korean, Dutch, Polish, Russian, Spanish, Serbian and Swedish

• Login: https://ondra.sketchengine.co.uk/ / username: test / password: preview

Infobox extraction for events:(structured event representation)• Structured event representation describes an event

by its “Event Type” and corresponding information slots to be filled

• Event Types should be taken from “Event Taxonomy”

• …at this stage of development this level of representation still requires human intervention to achieve high accuracy (Precision/Recall) extraction

• Example on the right – Wikipedia event infobox: • 2011 Tōhoku earthquake and tsunami

Deeper means to model and reason with knowledge

One of the challenges for the future: Micro-reading

• It is “easier” to understand millions of documents than a single document• …reading and understanding a single document is micro-reading

• The following experiment is on how much knowledge we can extract from individual documents• …extraction is in a form of first order inferentially productive Cyc logic

• …allowing us full reasoning to identify new facts

• …minimizing human involvement, optimizing precision and recall

Document Assertions Reasoning Dialogue

Disambiguation with a world model (CycKB)

World model used as a set of common-sense semantic constraints to disambiguate text

Cyc Knowledge Base and Reasoning

Cycorp © 2006

The Cyc Ontology

Thing

Intangible

ThingIndividual

Temporal

Thing

Spatial

Thing

Partially

Tangible

Thing

Paths

Sets

Relations

Logic

Math

Human

Artifacts

Social

Relations,

Culture

Human

Anatomy &

Physiology

Emotion

Perception

Belief

Human

Behavior &

Actions

Products

Devices

Conceptual

Works

Vehicles

Buildings

Weapons

Mechanical

& Electrical

Devices

Software

Literature

Works of Art

Language

Agent

Organizations

Organizational

Actions

Organizational

Plans

Types of

Organizations

Human

Organizations

Nations

Governments

Geo-Politics

Business,

Military

Organizations

Law

Business &

Commerce

Politics

Warfare

Professions

Occupations

Purchasing

Shopping

Travel

Communication

Transportation

& Logistics

Social

Activities

Everyday

Living

Sports

Recreation

Entertainment

Artifacts

Movement

State Change

Dynamics

Materials

Parts

Statics

Physical

Agents

Borders

Geometry

Events

Scripts

Spatial

Paths

Actors

Actions

Plans

Goals

Time

Agents

Space

Physical

Objects

Human

Beings

Organ-

ization

Human

Activities

Living

Things

Social

Behavior

Life

Forms

Animals

Plants

Ecology

Natural

Geography

Earth &

Solar System

Political

Geography

Weather

General Knowledge about Various Domains

Specific data, facts, and observations

Cycorp © 2006

Cyc ReasoningModules

Interface to External Data Sources

Cyc

API

Know

ledge

Entr

y T

ools

User Interface(with Natural Language Dialog)

DataBases

WebPages

Text Sources

Other KBs

Cyc Ontology & Knowledge Base

Cyc High-level Architecture

Cycorp © 2006

Thing

Intangible

ThingIndividual

Temporal

Thing

Spatial

Thing

Partially

Tangible

Thing

Paths

Sets

Relations

Logic

Math

Human

Artifacts

Social

Relations,

Culture

Human

Anatomy &

Physiology

Emotion

Perception

Belief

Human

Behavior &

Actions

Products

Devices

Conceptual

Works

Vehicles

Buildings

Weapons

Mechanical

& Electrical

Devices

Software

Literature

Works of Art

Language

Agent

Organizations

Organizational

Actions

Organizational

Plans

Types of

Organizations

Human

Organizations

Nations

Governments

Geo-Politics

Business,

Military

Organizations

Law

Business &

Commerce

Politics

Warfare

Professions

Occupations

Purchasing

Shopping

Travel

Communication

Transportation

& Logistics

Social

Activities

Everyday

Living

Sports

Recreation

Entertainment

Artifacts

Movement

State Change

Dynamics

Materials

Parts

Statics

Physical

Agents

Borders

Geometry

Events

Scripts

Spatial

Paths

Actors

Actions

Plans

Goals

Time

Agents

Space

Physical

Objects

Human

Beings

Organ-

ization

Human

Activities

Living

Things

Social

Behavior

Life

Forms

Animals

Plants

Ecology

Natural

Geography

Earth &

Solar System

Political

Geography

Weather

General Knowledge about Terrorism

Specific data, facts, and observations

about terrorist groups and activities

General Knowledge about Terrorism:Terrorist groups are capable of directing assassinations:(implies

(isa ?GROUP TerroristGroup)(behaviorCapable ?GROUP AssassinatingSomeone directingAgent))

…If a terrorist group considers an agent an enemy, that agent is vulnerable to an attack by that group:(implies

(and(isa ?GROUP TerroristGroup)(considersAsEnemy ?GROUP ?TARGET))

(vulnerableTo ?GROUP ?TARGET TerroristAttack))

Cyc KB Extended w/Domain Knowledge

Cycorp © 2006

Thing

Intangible

ThingIndividual

Temporal

Thing

Spatial

Thing

Partially

Tangible

Thing

Paths

Sets

Relations

Logic

Math

Human

Artifacts

Social

Relations,

Culture

Human

Anatomy &

Physiology

Emotion

Perception

Belief

Human

Behavior &

Actions

Products

Devices

Conceptual

Works

Vehicles

Buildings

Weapons

Mechanical

& Electrical

Devices

Software

Literature

Works of Art

Language

Agent

Organizations

Organizational

Actions

Organizational

Plans

Types of

Organizations

Human

Organizations

Nations

Governments

Geo-Politics

Business,

Military

Organizations

Law

Business &

Commerce

Politics

Warfare

Professions

Occupations

Purchasing

Shopping

Travel

Communication

Transportation

& Logistics

Social

Activities

Everyday

Living

Sports

Recreation

Entertainment

Artifacts

Movement

State Change

Dynamics

Materials

Parts

Statics

Physical

Agents

Borders

Geometry

Events

Scripts

Spatial

Paths

Actors

Actions

Plans

Goals

Time

Agents

Space

Physical

Objects

Human

Beings

Organ-

ization

Human

Activities

Living

Things

Social

Behavior

Life

Forms

Animals

Plants

Ecology

Natural

Geography

Earth &

Solar System

Political

Geography

Weather

General Knowledge about Terrorism

Specific data, facts, and observations

about terrorist groups and activities

Specific Facts about Al Qaida:

(basedInRegion AlQaida Afghanistan) Al-Qaida is based in Afghanistan.(hasBeliefSystems AlQaida IslamicFundamentalistBeliefs) Al-Qaida has Islamic fundamentalist beliefs.(hasLeaders AlQaida OsamaBinLaden) Al-Qaida is led by Osama bin Laden.…(affiliatedWith AlQaida AlQudsMosqueOrganization) Al-Qaida is affiliated with the Al Quds Mosque.(affiliatedWith AlQaida SudaneseIntelligenceService) Al-Qaida is affiliated with the Sudanese Intell Service…(sponsors AlQaida HarakatUlAnsar) Al-Qaida sponsors Harakat ul-Ansar.(sponsors AlQaida LaskarJihad) Al-Qaida sponsors Laskar Jihad.…(performedBy EmbassyBombingInNairobi AlQaida) Al-Qaida bombed the Embassy in Nairobi.(performedBy EmbassyBombingInTanzania AlQaida) Al-Qaida bombed the Embassy in Tanzania.

Cyc KB Extended w/Domain Knowledge

Example of automatic translating text into Cyc Logic and back to text

Source: “Galileo Galilei was an Italian physicist and astronomer.”

Learn Logic:(#$and (#$isa #$GalileoGalilei #$ItalianPerson)(#$isa #$GalileoGalilei #$Physicist) (#$isa #$GalileoGalilei #$Astronomer))

Fact: Galileo was an Italian, a physicist, and an astronomer.

Source: “Galileo was born in Pisa on Feburary 15, 1564.”

Learn Logic:(#$and (#$birthDate #$GalileoGalilei (#$DayFn 15 (#$MonthFn #$February(#$YearFn 1564))))

(#$birthPlace #$GalileoGalilei #$CityOfPisaItaly))

Fact: Galileo was born on February 15, 1564 and he was born in Pisa.

Source: “Albert Einstein was born in 1879 in Ulm, Germany.”

Learn Logic: (#$birthDate #$AlbertEinstein (#$YearFn 1879))

Fact: Albert Einstein was born in 1879.

Example of text and extracted Cyc assertions (1/2)

Automatically Extracted Assertions:• (isa ?V1 ProsecutingEvent)• (agent ?V1 RudyGiuliani)• (genls Entity Agent)• (isa RudyGiuliani Agent)• (isa RudyGiuliani Entity)• (isa ?V3 OrganizingEvent)• (patient ?V3 (IntersectionFn

OrganizedCrime WallStreet))

• (isa (IntersectionFn OrganizedCrimeWallStreet) Patient)

• (genls Entity Patient)• (isa OrganizedCrime Patient)• (isa OrganizedCrime Entity)• (isa WallStreet Patient)• (isa WallStreet Entity)

Sentence: He prosecuted a number of high-profile cases, including ones against organized crime and Wall_Street financiers.

Example of text and extracted Cyc assertions (2/2)

Automatically Extracted Assertions:

• (isa ?V1 SubstitutingEvent)

• (temporal ?V1 Lincoln)

• (genls Entity Agent)

• (isa Lincoln Agent)

• (genls Person Entity)

• (isa Lincoln Entity)

• (isa Lincoln Person)

• (isa ?V3 SucceedingEvent)

• (temporal ?V3 Grant)

• (isa Grant Agent)

• (isa Grant Entity)

• (isa Grant Person)

Sentence: Each time a general failed, Lincoln substituted another until finally Grant succeeded in 1865.

Reasoning on extracted assertions (Cyc)

Query:

(and

(isa ?Per Person)

(birthDate ?Per ?BD)

(occursBefore ?BD WorldWarII)

(thereExistsAtLeast 2 ?Role

(lifeRole ?Per ?Role)

(roleInIndustry ?Role FilmIndustry)

)

)

Answers:

Sir Derek_George_Jacobi

Sir Alexander_Korda

Victor Lonzo_Fleming

John_Francis_Junkin

Cornel_Wilde

George_Stevens

Bertrand_Blier

NL Query: People born before World War II who had at least two roles in the film industry KB?

Text queryQuery (semi) automatically translated in the First Order Logic

Answers to the query

Cyc’s front-end: “Cyc Analytic Environment” – querying (1/2)

Who has a motive for the assassination of Rafik Hariri?

Query & Answer

Justification

Sources forReasoning and Justification

Cyc’s front-end: “Cyc Analytic Environment” – justification (2/2)

Some of the challenges for the future

• Background knowledge in a form of a World Model• …to have knowledge contextualized

• Representing and scalable reasoning knowledge with operational soft logic• …to decrease brittleness of logic and increase scale

• Economically viable structured knowledge acquisition with high precision and recall• …to increase the reach of what we can acquire

• Emphasizing understanding vs. applying black box models