knowledge representation. computational journalism week 8

38
Frontiers of Computational Journalism Columbia Journalism School Week 7: Knowledge Representation November 6, 2015

Upload: jonathan-stray

Post on 16-Feb-2016

17 views

Category:

Documents


0 download

DESCRIPTION

Jonathan Stray, Columbia University, Fall 2015Syllabus at http://www.compjournalism.com/?p=133

TRANSCRIPT

Page 1: Knowledge Representation. Computational Journalism week 8

Frontiers  of  Computational  Journalism

Columbia Journalism School

Week 7: Knowledge Representation

November 6, 2015

Page 2: Knowledge Representation. Computational Journalism week 8

Unstructured  data

Page 3: Knowledge Representation. Computational Journalism week 8

Structured  data

Page 4: Knowledge Representation. Computational Journalism week 8

Everyblock.com circa 2009

Page 5: Knowledge Representation. Computational Journalism week 8

Connected China. Reuters, 2013

Page 6: Knowledge Representation. Computational Journalism week 8

Article  Metadata headline

photo

photo  caption byline

photo  credit

publication  date dateline article  body related  articles

Page 7: Knowledge Representation. Computational Journalism week 8

Schema.org  news  markup Overall  type  of  the  object  on  this  page,  in  HTML  head

Headline,  dateline,  date  as  additions  to  div/span  properties

Byline  expressed  as  nested  object  (using  itemscope)  of  type  schema.org/Person

Page 8: Knowledge Representation. Computational Journalism week 8

Driving  application:  “rich  snippets”

Schema.org  covers  not  just  news  but  music,  restaurants,  people,  organizations,  reviews,  offers...   Snippets,  and  beSer  search-­‐‑ability  generally,  are  motivation  for  Google,  Yahoo,  Bing  to  push  schema.org

Page 9: Knowledge Representation. Computational Journalism week 8

Additional  metadata  from  indexing  team

In database, but doesn't necessarily make it to HTML.

Page 10: Knowledge Representation. Computational Journalism week 8

News  application:  content  navigation

Articles  about  “Syria” on  NYT  topic  page More  reliable  than  simple  text  search  (because  the  relevance  algorithm  knows  a  story  is  "ʺabout"ʺ  Syria.)

Page 11: Knowledge Representation. Computational Journalism week 8

Ontologies What objects and relations are available?

Often  represented  as  class  hierarchy.   Arrows  =  “is_a”  relation

Page 12: Knowledge Representation. Computational Journalism week 8

(Part  of)  a  real  ontology,  from  Cyc

Page 13: Knowledge Representation. Computational Journalism week 8

Every  big  news  org  has  their  own    big  ontology  L

topics, people, organizations, places...

Page 14: Knowledge Representation. Computational Journalism week 8

Yaaay  Linked  Data! Triples of (subject relation object), each a URL or literal <urn:x-states:New%20York> <http://purl.org/dc/terms/alternative> "NY”

<http://dbpedia.org/resource/Columbia_University> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/CollegeOrUniversity>

Abbreviations possible with many formats... <http://dbpedia.org/resource/Columbia_University> rdf:type

ns6:CollegeOrUniversity

Page 15: Knowledge Representation. Computational Journalism week 8
Page 16: Knowledge Representation. Computational Journalism week 8
Page 17: Knowledge Representation. Computational Journalism week 8
Page 18: Knowledge Representation. Computational Journalism week 8

NYT  ontology  available  as  LOD

owl:SameAs  makes  this  interoperable

Page 19: Knowledge Representation. Computational Journalism week 8

NYT  API  can  return  linked  data { "title": "Syria's Rebels Open Talks on Forging United Political Front"

"body": "BEIRUT, Lebanon — Syria ’s fractious opposition groups began negotiations in Doha, Qatar, on Sunday to forge a more unified front to reshape the political landscape in a bloody conflict that claims more than 100 lives virtually every day. Given the scant prospects that any attempt to restructure the opposition will succeed — the",

"dbpedia_resource_url": [ "http://dbpedia.org/resource/Hillary_Rodham_Clinton", "http://dbpedia.org/resource/Bashar_al-Assad"],

"facet_terms": "CLINTON, HILLARY RODHAM ASSAD, BASHAR AL- SYRIA DOHA (QATAR) SYRIAN NATIONAL COUNCIL STATE DEPARTMENT WAR AND REVOLUTION DEFENSE AND MILITARY FORCES"}

Page 20: Knowledge Representation. Computational Journalism week 8

Objects  and  relations  in  text?

names,  dates,  places, verbs.

Page 21: Knowledge Representation. Computational Journalism week 8

Named  Entity  Recognition Extract subjects, objects, from text. Also, resolve pronouns if possible. "Gov. Andrew M. Cuomo on Wednesday gave a sea wall the nod. Because of the recent history of powerful storms hitting the area, he said, elected officials have a responsibility to consider new and innovative plans to prevent similar damage in the future."

Page 22: Knowledge Representation. Computational Journalism week 8

NER  state  of  the  art •  Commercial: Google Knowledge Graph •  Academic: Stanford NER library

Page 23: Knowledge Representation. Computational Journalism week 8

Next  level  of  understanding:  verbs “The water that made rivers of Avenues C and D receded on Tuesday, and the East Village was a mixture of disaster and nonchalance. A group of young men in pajama pants and shorts threw a football on East 12th Street, while workers pumped the basement of CHP Hardware on Avenue C and Eighth Street.”

subject  verb  object

Page 24: Knowledge Representation. Computational Journalism week 8

Knowledge  Representation  in  AI  (a  crazy  brief  introduction)

Classic "symbolic" paradigm represents knowledge as statements in mathematical logic. Many variations. Most are subsets or modifications of standard first order logic (FOL). Mathematical representation of human knowledge is a very old dream! (Greeks, Leibniz, GOFAI...)

Page 25: Knowledge Representation. Computational Journalism week 8

Leibniz,  1685 The only way to rectify our reasonings is to make them as tangible as those of the Mathematicians, so that we can find our error at a glance, and when there are disputes among persons, we can simply say: Let us calculate [calculemus], without further ado, to see who is right.

Page 26: Knowledge Representation. Computational Journalism week 8

Predicates  and  Relations Predicate: asserts that object belongs to a class

vechicle(schoolbus)bird(tweety)straight_gangsta(emily_bell)

Relation: asserts relationship between objects

is_a(car, vehicle)higher_rank(general, colonel)capital(paris, france)

Page 27: Knowledge Representation. Computational Journalism week 8

Inference General rules

a ∧ (a => b) => bp ∨ !p

Domain specific inferences

is_a(car, vehicle)can_move(vehicle) => can_move(car)

Page 28: Knowledge Representation. Computational Journalism week 8

News  as  relations  between  entities “Alice attended the wedding”

attended(alice, wedding)

“IBM was founded in 1917.”

founded(IBM, 1917)

“Hurricane Sandy hit New York”

hit(hurricane_sandy, New_York)

Encode  facts  as  relation(subject,object)also  wriSen  (subject relation object)

Page 29: Knowledge Representation. Computational Journalism week 8

Things  we  could  do  with  this Question answering

“The granddaughter of which actor starred in E.T.?” (?x acted-in “E.T.”)(?y is-a actor)(?x granddaughter-of ?y)

Inference (bob brother-of alice)(alice mother-of lucy) =>

(bob uncle-of lucy)

Answer questions using inference

“how many executives of publicly-traded Canadian companies died in car crashes?

Page 30: Knowledge Representation. Computational Journalism week 8

Problems Not all subjects are simple.

“Over a hundred guests attended the wedding” attended(num_guests, wedding)

greater_than(num_guests,100)

Some relations have multiple parts.

“Hurricane Sandy hit New York on Monday” hit(sandy, New_York, monday)

Page 31: Knowledge Representation. Computational Journalism week 8

Standard  inference  doesn’t  allow  defaults “All birds fly”

bird(tweety)bird(?x) => flies(?x) => flies(tweety)

But, “penguins and chickens don’t fly” bird(?x) & !penguin(?x) & !chicken(?x)=> flies(?x)

Now we can’t guess that tweety flies bird(tweety) => flies(tweety) ?we don’t know!

Page 32: Knowledge Representation. Computational Journalism week 8

Standard  mathematical  logic  doesn’t  deal  well  with  exceptions

Some people don’t have a last name.

Sometimes an election isn’t decided on election day. Is a trash can used as a flower pot still a trash can? Is a broken car still a vehicle if it can't move?

Page 33: Knowledge Representation. Computational Journalism week 8

Relations  from  sentence  parsing “The water that made rivers of Avenues C and D receded on Tuesday, and the East Village was a mixture of disaster and nonchalance. A group of young men in pajama pants and shorts threw a football on East 12th Street, while workers pumped the basement of CHP Hardware on Avenue C and Eighth Street.”

subject  verb  object

Page 34: Knowledge Representation. Computational Journalism week 8

Relation  extraction  systems •  Commercial: IBM's DeepQA (Watson) •  Academic: Open IE project

Page 35: Knowledge Representation. Computational Journalism week 8

Ontology  explosions

(water made rivers of Avenues C and D) (East Village was a mixture of disaster and nonchalance) (group of young men in pajama pants and shorts threw football) (workers pumped the basement of CHP Hardware )

Do we have all of these in the ontology?

Page 36: Knowledge Representation. Computational Journalism week 8

“General  Question  Answering”

Precision/recall tradeoff. State of the art is IBM’s DeepQA

Page 37: Knowledge Representation. Computational Journalism week 8

DeepQA  use  of  structured  data “Watson  can  also  use  detected  relations  to  query  a  triple  store  and  directly  generate  candidate  answers.  Due  to  the  breadth  of  relations  in  the  Jeopardy  domain  and  the  variety  of  ways  in  which  they  are  expressed,  however,  Watson’s  current  ability  to  effectively  use  curated  databases  to  simply  “look  up”  the  answers  is  limited  to  fewer  than  2  percent  of  the  clues.” -­‐‑  Ferruci  et.  al.  “Building  Watson”

Page 38: Knowledge Representation. Computational Journalism week 8

Wall Street is high on Molson Coors Brewing (TAP), expecting it to report earnings that are up 17.5% from a year ago when it reports its third quarter earnings on Wednesday, November 7, 2012. The consensus estimate is $1.34 per share, up from earnings of $1.14 per share a year ago. The consensus estimate has dipped over the past month, from $1.35, but it’s still up from the consensus estimate of $1.19 three months ago. For the fiscal year, analysts are expecting earnings of $3.89 per share. Revenue is projected to eclipse the year-earlier total of $954.4 million by 31%, finishing at $1.25 billion for the quarter. For the year, revenue is projected to roll in at $4.04 billion. The company’s net income has declined in the last two quarters. The company posted profit falling by 52.8% in the second quarter. This is after it reported a profit decline in the first quarter by 4.1%.

Automatic  story  generation,  by  Narrative  Science