big data with knowledge graph

Big Data With Knowledge Graph

A Real Experience

Agenda

● My Case Study● Big Data Fact● The Challenges● The Solution● Knowledge Graph● RDF triple● Freebase● Ranker● Our custom Knowledge Graph- How did we build it?● Conclusion

My Case Study

This case study is about-

● Our graph processing engine which uses one of the largest knowledge graphs available as a source and creating multiple knowledge graphs specific to the application.

● This graph processing engine deals with traversing through more than 700 million triples.

Big Data Fact

The term Big data from software engineering and computer science describes it as the data-sets that grow so large that they become awkward to work with using on-hand database management tools - Wiki

Read on for an exciting tour of big data, knowledge graph, the challenges we faced & how we came up with a solution.

http://en.wikipedia.org/wiki/Big_data



The Challenges

● RDF is not a mature data structure as compared to other data structures/sets which have a mature ecosystem built around them.

● Freebase has more than 760 million triples in their knowledge graph. What would be the data store for such a huge knowledge graph?

● Optimum way to store this knowledge graph locally in a data store.

● Transform this huge knowledge graph into ranker knowledge graph.

http://www.freebase.com

The Solution

Highlights

● Our platform has proven to scale to the biggest knowledge graph available.

● Our graph processing engine deals with 760 million triples from freebase.

● We did it even before google used it.

● Really the next big thing in big data is large scale processing of knowledge graph to your application perspective!



Knowledge Graph

● Freebase data is organised and stored as a graph instead of tables & keys, as in rdbms.

● The dataset is organised into nodes. Each node connects to several nodes via predicates hence representing the relative data in a simplistic and realistic way.

● The nodes are grouped together using topics & types. The data is inter connected so it is very easy to traverse through them if we know the right predicates.

Knowledge graph & Conventional Data- How Different Are They?

In an RDBMS database-

● The data is organized into tables

● They are connected via foreign keys.

● Once the table is designed the relationship is fixed. The number of tables needed would depend on the predicates.

● We cannot have new predicate definitions at runtime. We will have to create the table definition and then save the data.

RDF triple

An RDF triple consists of three parts-

● A subject● A Predicate● An object

A Subject is related to an object via a Predicate. Each triple is a complete assertive statement which makes complete sense.

Examples of RDF triple:

Francis Ford Coppola | Directed | The GodfatherAl Pacino | Acted in | The GodfatherThe Godfather | Written by | Mario Puzo

I recommend the below video to get a brief idea on knowledge graph.Google's Knowledge Graph

http://www.youtube.com/watch?v=mmQl6VGvX-c

http://www.youtube.com/watch?v=mmQl6VGvX-c

FreebaseFacts

● It is an online knowledge database.

● The source of this data is mainly from its community members and Wikipedia, ChefMoz, NNDB, and MusicBrainz.

● It became public in 2007 by Metaweb, which was acquired by google in 2010.

"Freebase is an open shared database of the world's knowledge."- this is how Metaweb described freebase.

http://en.wikipedia.org/wiki/Wikipedia

http://en.wikipedia.org/wiki/Wikipedia

http://en.wikipedia.org/wiki/ChefMoz

http://en.wikipedia.org/wiki/ChefMoz

http://en.wikipedia.org/wiki/NNDB

http://en.wikipedia.org/wiki/NNDB

http://en.wikipedia.org/wiki/MusicBrainz

http://en.wikipedia.org/wiki/MusicBrainz

RankerFacts

● Ranker is a social web platform designed for collaborative and individual list making & voting.

● Ranker launched in August, 2009, and has since grown to over 4 million monthly unique visitors and over 14 million monthly page views, per Quantcast. As of January 2012 Ranker’s traffic was ranked at 949 on Quantcast.

● One of the prominent data partners for ranker is freebase, now Google owned.

Click here for more info...

http://en.wikipedia.org/wiki/Ranker

http://en.wikipedia.org/wiki/Ranker

Our custom knowledge graph- How did we build it?

Freebase data expose option-1MQL

The Metaweb query API is a powerful API provided by freebase in order to read data. The data is communicated over http using JSON. This method is very effective if it is used to just browse the data or download limited data.

For very large data consumption, I do not recommend MQL because of the following reasons-

● Freebase API is intermittently down.

● Freebase has throttling controls on both the API limit as well as the size of datasets returned on a daily basis. We have faced issues in the past where the API was responding with the “allowance exceeded” timeout errors. The max results returned for any query is 100.

Freebase data expose option-2

Data Dumps

● Freebase provides weekly quad dumps available for download via its download site.

● It is a complete dump of all the assertions in freebase in utf-8 format.The dump is available as a compressed file, 4+ Gb in size. It has to be downloaded & unzipped, which will be approximately 30 Gb.

● The quad dump has to be converted into RDF statements. For this we use the Open source freebase-quad-rdfize program which is a free distribution. After the end of this process you will have a .nt file which will be approximately 90-100 Gb in size. So disk size is a vital requirement.

http://download.freebase.com/datadumps/




http://code.google.com/p/freebase-quad-rdfize/

http://code.google.com/p/freebase-quad-rdfize/

Datastore

● A triple store is a data store for storing RDF triples. It is optimized for the storage and retrieval of triples. Our knowledge graph datastore is openlink virtuoso. It has the ability to handle more than a billion triples, hence for our requirement this suited well.

● Since the “nt” file is very large, the ingestion of data into the triple store had various issues. After a million triples the server froze. Hence we just broke the nt file into smaller chunks. After doing this, the ingestion was fine and competed successfully.

● The system we use for ingestion is an ubuntu 10.04, 48 Gb RAM machine. It takes approximately 36 hours to ingest the complete quad dump into our triple store.

http://virtuoso.openlinksw.com/

http://virtuoso.openlinksw.com/

Data consumption for the App

Our platform is a highly scalable graph processing engine that operates on the largest knowledge graph (freebase) and uses a graph datastore from openlink virtuoso. However, the platform itself is built using standard protocols for graph navigation, processing and traversing - sparql.

● Every node on freebase has an unique alphanumeric id made of two parts; Namespace and Key. Together they are called the 'mid'.

● Every predicate in freebase has source id or source namespace. Example, the predicate “Nationality” has a source url as “http://rdf.freebase.com/ns/people/person/nationality”.

What we have done in our app is predefined entities and their properties by using predicate urls as source ids. Example, a Person entity in our system has a Nationality property with a source url and source is “freebase”. This way we can add more sources in future and also have one entity with properties from one or more sources.

http://rdf.freebase.com/ns/people/person/nationality



SPARQL

● This is a query language for RDF data.

● The results of these queries are always triples.

Hence we chose to dynamically build these queries depending on what data we need. Based on our experience we found that avoiding joins in SPARQL queries will improve the performance.

API

● We chose the java based jena api for virtuoso.

● It establishes a connection to the triple store over jdbc.

The api supports sparql and hence the results are packages as RDF objects, so that we can easily read them and use adapters to transform them to the app objects.

Data Aggregation

This is what makes our platform truly powerful. Not only do we store the knowledge graph locally, we also have the ability to create our own custom graph from this data. The ranker system has approximately 20 million nodes & powers half a million lists & counting.

Not all entities in our system are simple, we have complex ones. By complex I mean the properties belong to one or more types on freebase.For example a 'Person' node in our system will not only have date of birth, place of birth, age etc but also have properties like dated, breakups. We have achieved this by pre-defining aggregation rules for each and every entity in our system based on feedback from our seo & business team.

Conclusion

big data with knowledge graph

Documents