Download - Exploring our world with freebase
Generic Databases
Where does the data come from?
Copyright 2009 CC-BY by Richard HeavenRobot Image Copyright 2007 CC-BY by Crispin Summers
Google Knowledge Graph
The Wikipedia Data Ecosystem
API RDF Deferencing
Quad Dump
Simple Topic Dump
Type Tables
MQL{ "status": "200 OK", "code": "/api/status/ok", "result": { "type": "/music/artist", "name": "The Police", "album": [ "Outlandos d'Amour", "Reggatta de Blanc", "Zenyatta Mondatta", "Ghost in the Machine", "Synchronicity" ] } }
My path to the semantic web
My path to the semantic web
My path to the semantic web
Infovore 1
Quad Dump Simple Topic Dump
:BaseKB Pro :BaseKB Lite
Spring 2012
Fall 2012
Quad Dump
Official RDF Dump
Infovore 1.0 released as open source under Apache License
13+ million Invalid Facts
Image cc-by from arj03
Infovore 1.0
Quad Dump -> RDF
Infovore 1.1
General RDF Cleanup& Filtering
Millipede framework – Map/Reduce on a single computer
Infovore 2
What does Freebase cover?
Is it a bibliographic database?
Ahead of their time?
Reading Room, Library of Congress
MARC… in electronic form since 1969!
First standard data format with variable length fields & I18N.
Now everybody has a bibliographic database…
Or, do documents annotate the world?
Social Semantic Systems
Linked Data User-Generated Content
The dominant paradigm
Triple store
How to break your triple store
http://gen5.info/q/2009/02/25/putting-freebase-in-a-star-schema/
The RDF data warehouse
ETL
warehouse
operations
development
science
The RDF data warehouse II
warehouse
Operations tools
Science Tools
Latency: low is not low enough
operations
development
science
FreebaseDBpedia
any relational databasemachine learning
JenaAmazon Web Services
PHPmap/reduce frameworks (ex. Hadoop)
MongoDBSesame
Virtuoso OpenLinkother NoSQL database
Solid State Drives (SSD)other cloud computing service
Neo4JRuby
Drupalalternative JVM languages (ex. Scala or Clojure)
other triple storeany key/value store (ex. JDBM or Berkeley DB)
OWLIMAllegrograph
4storeFactual
dotNetRDFStardog
Kasabi/Talis PlatformOracle Spatial RDF
0 10 20 30 40 50 60
Tools Popular With :BaseKB Users
Map/ReduceInputs
Mappers
Shuffle
Sort
Reducers
Output
RDF: Reduction on Subject
:Goat:Bear:Alligator:Iguana:Dog:Elephant:Cat:Horse:Fox
:Alligator:Dog:Goat
:Bear:Elephant:Horse
:Cat:Fox:Iguana
Jena Framework
SDB
Relational db-based Triple store
TDB
Native disk-based triple store
Model
In-memory triple store
“We use Jena Models like PHP programmers use hashtables”
-- Kendall Clark, Clark and Parsia
Hadoop Physical Architecture
Namenode
JobtrackerDatanodes&
TasktrackersHDFS
My development cluster – Namenode/JobTracker
Hadoop toleratesHardware failures
My other computer is
Amazon Elastic Map/Reduce
Amazon S3 (Permanent Storage)Amazon S3 (Permanent Storage)
“It’s harder to make up names for things than to invent them”
- Tom SwiftFictional American Inventor
Infovore modules
bakemonoharuhi
centipedechopper
Bakemono Super JAR
Bakemono Super JAR
Contains applications like
freebaseRDFPrefilter pse3 ranSample sieve3
Named after Japanese word for “monsters”
“Haruhi”
(1) Japanese religious word for “Full of Spirit” ; (2) a very dominant person
Unpacking the Freebase RDF Dump
photograph Copyright 2010 Ian Munroe CC-BY SA
Eliminate Bulk Up Front
BIG DATA
Eliminate Bulk Up Front
DATA
Inputs
Mappers
freebaseRDFPrefilter removes…
Wasteful Facts• 120M+ copies of the “a” predicate• 60M+ access control predicates
Violent and Dangerous facts
ns:common.topic ns:type.type.instance ?o .
Is repeated 30M times, and if you group on ?s and keep them in memory…
… uneven bin distribution …
331 332330
333
334 335… …
Prefiltering stops memory exhaustion before it happens!
Parallel Super Eyeball
“triples”
valid triples junk
Currently, 250,000 or so triples in Freebase are rejected by PSE3
Parallel Super Eyeball 3
Sieve3
literal facts (ex. ?s ?p 55. )
?s :a ?p .
?s ?p ns:some_topic .
?s rdfs:label ?o .
Horizontal Decomposition of Freebase
a5%
description18%
key11%
keyNs13%
label6%
name6%
notability1%
nfp0%
text8%
web6%
links20%
other7%
percentage of gz compressed size
a16%
description1%
key9%
keyNs11%
label6%
name6%
notability2%
nfp2%
text1%
web5%
links32%
other11%
percentage of facts
a15%
description7%
key8%
keyNs9%
label4%name
4%notability
2%nfp1%
text3%
web6%
links30%
other11%
percentage of uncompressed size
rdf:type aka “a”
16% 15%5%
facts bytes compressed bytes
ns:m.02qvftw rdf:type ns:business.employer .
RDFS Inference
:a :Actor ?
RDFS Inference
Jesse Plemons
Todd
:a :Actor .
Jesse Plemons
Todd
implies
Descriptions
1%
facts
18%
bytes
7%
compressed
Descriptionsns:m.010bfy ns:common.topic.description
"Riverside \u00E9 uma cidade localizada no estado norte-americano de Texas, no Condado de Walker."@pt .
ns:m.010bs8 ns:common.topic.description "El Campo is a city in Wharton County, Texas, United States. The population was 10,945 at the 2000 census, making it the largest city in Wharton County."@en .
Descriptionsns:m.010bfy ns:common.topic.description
"Riverside \u00E9 uma cidade localizada no estado norte-americano de Texas, no Condado de Walker."@pt .
ns:m.010bs8 ns:common.topic.description "El Campo is a city in Wharton County, Texas, United States. The population was 10,945 at the 2000 census, making it the largest city in Wharton County."@en .
This does not compute!
Descriptionsns:m.010bfy ns:common.topic.description
"Riverside \u00E9 uma cidade localizada no estado norte-americano de Texas, no Condado de Walker."@pt .
ns:m.010bs8 ns:common.topic.description "El Campo is a city in Wharton County, Texas, United States. The population was 10,945 at the 2000 census, making it the largest city in Wharton County."@en .
Labels and Namesns:american_football.football_division rdfs:label
"American football division"@en .
ns:american_football.football_conference rdfs:label"Grupper inom amerikansk fotboll"@sv .
ns:american_football.football_player ns:type.object.name"Football-Spieler"@de .
ns:american_football.football_team ns:type.object.name "American football-team"@nl .
Freebase Labels Are Not Unique
Dbpedia Labels are Unique
https://github.com/paulhoule/infovore/wikihttps://groups.google.com/forum/#!forum/infovore-basekb
Keys in the Freebase dump
• Most objects represented by mid identifiers
Keys in the Freebase dump
• Schema objects have friendly identifiers
Keys in the Freebase dump
Examples…ns:m.010bs8 ns:common.topic.description "El Campo is a city in Wharton County, Texas, United States. The population was 10,945 at the 2000 census, making it the largest city in Wharton County."@en .
ns:american_football.football_division rdfs:label"American football division"@en .
Freebase always uses the same key in the ?s, ?p, and ?o fields, but...
It wasn’t always this way
… the old quad dump used mids in the subject field, but others in the destination field …
Turtle0
Turtle1
Turtle2
Turtle3
Extract namespace graph
Convert all identifiers to mids
Extract type information from schema
Convert to RDF types
:BaseKB 2012
Freebase Knows Many Keysns:g.11vk55hmr ns:type.object.key "/base/dspl/us_census/population/place" . ns:m.010004m ns:type.object.key "/authority/musicbrainz/339a2897-9ba4-4820-a2a8-f234c22608a4“ . ns:Lm.01003_ ns:type.object.key "/wikipedia/de/Krum_$0028Texas$0029“ . ns:m.01010d ns:type.object.key "/wikipedia/en_id/135860" .ns:m.0100_b ns:type.object.key "/authority/gnis/1352653" .ns:m.0100l2 ns:type.object.key "/authority/hud/countyplace/4814101390" . ns:m.01031l ns:type.object.key "/en/chandler_texas" .ns:m.015g9m ns:type.object.key "/en/aliens_from_space" .ns:m.015gdl ns:type.object.key "/en/self-publishing" .ns:m.015gjr ns:type.object.key "/authority/nndb/231$002F000085973" .
… and type.object.key spells them out …
A directed acyclic graph/m/01root
/m/019swikipedia
/m/047w32vauthority
/m/0gt9en
/m/05x_rjrGeoff_Simmons
/wikipedia/en/Geoff_Simmons = /authority/wikipedia/en/Geoff_Simmons
key: namespace encodes the graph
ns:m.010005 key:wikipedia.pt "Corinth_$0028Texas$0029" .ns:m.010005h key:authority.musicbrainz "ab0b82ce-d1be-4641-b0d1-838896a25887" .
Useful external keys
Music
http://www.freebase.com/authority/musicbrainz/e217a1e9-9ec8-4e88-aebc-7d6b720384c1
Musical Composition
…
Recording
“Recording appears on Album as track #”
Functional Requirements For Bibliographic Records (FRBR)
Nick Hexium Rap Rock
311
Omaha, NE Los Angeles, CA
Unique data in DBpedia
Wikipedia Categories
Wikipedia Page Links
“Smushing”
dbpedia:Striated_Heron :linksTo dbpedia:Heron .dbpedia:Striated_Heron owl:sameAs ns:m.01v7dp .dbpedia:Heron owl:sameAs ns:m.01jgnh .
Ns:m.01v7dp :linksTo ns:m.01jgnh .
Duck Types
• ?a performed on music track ?b- ?a is a musician
Duck Types
• ?a employed ?b- ?a is an employer
Duck Types
• Book ?a was written about ?b– ?b is a book subject
The Problem of Notability
ns:m.0100007 ns:common.topic.notable_types ns:m.0kpv11.ns:m.01000_r ns:common.topic.notable_types ns:m.0kpv11.ns:m.01000dh ns:common.topic.notable_types ns:m.09jd9nh.ns:m.01000pp ns:common.topic.notable_types ns:m.09jd9nh.ns:m.01000px ns:common.topic.notable_types ns:m.0kpv11.ns:m.01000w ns:common.topic.notable_types ns:m.01m9.ns:m.01000yk ns:common.topic.notable_types ns:m.0kpv11.ns:m.010012t ns:common.topic.notable_types ns:m.0kpv11.ns:m.010014_ ns:common.topic.notable_types ns:m.09jd9nh.ns:m.010019c ns:common.topic.notable_types ns:m.09jd9nh.
Analysis with Chopper and Pig
Why APIs suck(Including SPARQL endpoints)
• Provider can afford maximum $/query
• If you need a more complex query you’ve got no option!
:BaseKB Now:BaseKB Now
YOU
AWS S3
Cluster creation made easy
:BaseKB Now:BaseKB Now
Pig Script – count common types
$ piggrunt> run chopper/src/main/pig/lib/chopper.piggrunt> a = LOAD '/freebase/20130915/a/' USING com.ontology2.chopper.io.PrimitiveTripleInput();grunt> oNodes = FOREACH a GENERATE o;grunt> groupNodes = GROUP oNodes BY o;grunt> countedNodes = FOREACH groupNodes GENERATE group AS uri:chararray,COUNT(oNodes) AS cnt:long;grunt> sortedNodes = ORDER countedNodes BY cnt DESC;grunt> top100= DUMP sortedNodes;
Most frequent types(<http://rdf.basekb.com/ns/common.topic>,39030195)(<http://rdf.basekb.com/ns/common.notable_for>,18747254)(<http://rdf.basekb.com/ns/music.release_track>,13304261)(<http://rdf.basekb.com/ns/music.recording>,8902041)(<http://rdf.basekb.com/ns/music.single>,6297869)(<http://rdf.basekb.com/ns/common.document>,5580077)(<http://rdf.basekb.com/ns/media_common.cataloged_instance>,3030634)(<http://rdf.basekb.com/ns/book.book_edition>,2771323)(<http://rdf.basekb.com/ns/people.person>,2742157)(<http://rdf.basekb.com/ns/type.namespace>,2689781)(<http://rdf.basekb.com/ns/book.isbn>,2601099)(<http://rdf.basekb.com/ns/type.content>,2499648)(<http://rdf.basekb.com/ns/measurement_unit.dated_integer>,2466557)
Compound Value Typesand our 4D world
The 13th most prevalent type(<http://rdf.basekb.com/ns/common.topic>,39030195)(<http://rdf.basekb.com/ns/common.notable_for>,18747254)(<http://rdf.basekb.com/ns/music.release_track>,13304261)(<http://rdf.basekb.com/ns/music.recording>,8902041)(<http://rdf.basekb.com/ns/music.single>,6297869)(<http://rdf.basekb.com/ns/common.document>,5580077)(<http://rdf.basekb.com/ns/media_common.cataloged_instance>,3030634)(<http://rdf.basekb.com/ns/book.book_edition>,2771323)(<http://rdf.basekb.com/ns/people.person>,2742157)(<http://rdf.basekb.com/ns/type.namespace>,2689781)(<http://rdf.basekb.com/ns/book.isbn>,2601099)(<http://rdf.basekb.com/ns/type.content>,2499648)(<http://rdf.basekb.com/ns/measurement_unit.dated_integer>,2466557)
:Las_Vegas
945
1910
:US_Census_Bureau
population
number
date
source
25 1900
945 1910
2,304 1920
5,165 1930
8,422 1940
24,624 1950
64,405 1960
125,787 1970
164,674 1980
260,561 1990
284,931 1991
297,326 1992
312,634 1993
336,380 1994
354,559 1995
372,849 1996
391,074 1997
405,245 1998
418,658 1999
484,487 2000
498,638 2001
507,219 2002
516,723 2003
534,168 2004
544,806 2005
552,855 2006
559,892 2007
562,849 2008
567,641 2009
584,539 2010
589,317 2011
19001920
19401960
19801991
19931995
19971999
20012003
20052007
20092011
0
100000
200000
300000
400000
500000
600000
700000
Population of Las Vegas, NV
Series1
Axis Title
Vertical Divisions of FreebaseWikipedia Topics Movies and Television Travel and Lodging
:BaseKB Lite
Separating Blank Nodes
Separating Blank Nodes
Separating Blank Nodes
Separating Blank Nodes
:BaseKB Now
• Created Weekly by automated process• Delivered to AMZN S3• Accepted facts are 100% Valid RDF• Rejected facts collected for inspection• “Violent” predicates removed to fight skew• Horizontally divided for fast processing
http://basekb.com/
Infovore Software
http://github.com/paulhoule/infovore/wiki