big data and the semantic web: challenges and opportunities

46
Big Data Tech Conclave, 26—27 April 2013 Bangalore, India Big Data and the Semantic Web: Challenges and Opportunities Srinath Srinivasa Open Systems Laboratory IIIT Bangalore http://osl.iiitb.ac.in/ [email protected]

Upload: srinath-srinivasa

Post on 08-May-2015

2.990 views

Category:

Health & Medicine


0 download

TRANSCRIPT

Page 1: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Big Data and the Semantic Web:Challenges and Opportunities

Srinath SrinivasaOpen Systems Laboratory

IIIT Bangalorehttp://osl.iiitb.ac.in/

[email protected]

Page 2: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

http://www.bda2013.net/

Page 3: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

OSL ReleasesTopical Anchors: Given a list of noun phrases, identify a semantic topic for these terms.

Powered by Wikipedia co­occurrence graph hosted by Agama

Web APIs enable use of Topical Anchors in third party applications 

Page 4: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

OSL ReleasesTopic Expansion: Given a term, expands it into semantically relevant topical clusters with different senses.

Uses co-occurrence datasets from Wikipedia 2006 or 2011.

Web APIs enable use by third party applications

Page 5: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

OSL ReleasesAgama: A graph database for storing large undirected graphs for efficient traversal (not structure­based retrieval)

Currently Agama powers a co­occurrence graph of all noun­phrases from Wikipedia articles hosted in OSL, managing 10s of millions of nodes and 100s of millions of edges 

Page 6: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

More data beats better algorithms..

meets

No data is an island..

Page 7: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Outline● Big Data Characteristics

● Big Data Analytics● Pattern­driven and Model­driven Analytics

● Big Data and the Semantic Web

● Semantic Challenges● The myth of a global ontology

● Convergent and divergent semantics

● Semantic interoperability 

● Technology Challenges● Storage, traversal and retrieval of large­scale semantic networks

● Inference on Big Data

● On the road ahead

Page 8: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Big Data

Data that is ● Too large to be processed by conventional 

databases and data management techniques (Volume)

● Too diverse in structure that no single data model captures all elements of the data (Variety)

● Transient and/or impermanent, especially when pertaining to dynamic phenomena (Velocity)

Page 9: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Big Data● Transaction records

● Network streams

● Experimental output

● Social media data 

● Demographic records

● Citation data 

● Clickstreams

● Log data

● Weather data 

● …

Page 10: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Some Big Data Stats

● YouTube users upload 48 hours of video every minute http://gigaom.com/2011/05/25/youtube­48­hours­of­video­per­minute/

● Facebook data grows by 500TB daily http://www.slashgear.com/facebook­data­grows­by­over­500­tb­daily­23243691/

● WalMart handles more than 1 million customer transactions every hour http://www.economist.com/node/15557443

● Akamai analyzes 75 million events per day for targeted advertising http://wikibon.org/blog/taming­big­data/

● 90% of data in the world today was created in the last 2 years http://wikibon.org/blog/big­data­infographics/ 

Page 11: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Big Data Analytics

Examine Big Data for useful (often actionable) knowledge

The long spectrum of Big Data Analytics

Pattern identification

Association rule mining

Classification/Clustering

Record Linkage

Security analytics

Complex EventProcessing

Opinion mining

Predictive modeling

Pattern driven

Model driven

Page 12: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Pattern Driven Analytics● Discovery and visualization 

of recurring patterns in datasets

● Mostly quantitative

●  Paradigms in pattern discovery:

● Sampling and aggregation

● Thresholding and filtering

Image Source: Wikipedia

Page 13: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Pattern Driven Analytics

Sampling and Aggregation● Query based pattern aggregation● Based on an initial idea of what we are looking 

for

Hypothesis

Data

Query Patterns Aggregation Presentation

Page 14: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Pattern Driven Analytics

Tresholding and Filtering● Based on sifting through the entire dataset (or a 

view) to look for “interesting” patterns without the context of a query

Data

Interestingnesscriteria

Patterns Filteringand

SegregationPresentation

Page 15: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Model Driven Analytics

Analytics as a model­discovery problem

Wedding

Images source: Wikipedia

ObservableData

LatentConcept

Page 16: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Model Driven Analytics

● Pattern discovery coupled with semantic modeling

● Non­trivial qualitative modeling challenges● Model discovery:

● Descriptive model discoveryFit a model to explain the observed data

● Predictive model discoveryDiscover a model that can predict values of data elements into the future

Page 17: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Linked Data

Image source: Wikipedia

The Linked DataCloud as of September 2011

Page 18: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Linked Data

● Using Semantic Web technologies to connect data elements from disparate data sources

● From Web of Documents to Web of Data● Elements of Linked Data

● URIs ● HTTP● Resource Description Framework (RDF)● Serialization formats (RDFa, RDF/XML, N3, Turtle, 

and others)

Page 19: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Big Data and the Semantic Web

Big DataSemantic Web

Model Discovery

Catalyzation andPredictive Modeling

Page 20: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Big Data        Semantic Web● One of the main elements of the Linked Data Cloud: DBpedia is 

built from a Big Data resource: Wikipedia

● Open Biomedical Ontology (OBO) (http://www.oboedit.org/) created from mining PubMed publications

● Enterprise scale Big Data Analytics helping build organizational models, operational intelligence solutions, etc. Example: Anzo software suite by Cambridge Semantics (www.cambridgesemantics.com), Loom data management suite by Revelytix (www.revelytix.com)

Page 21: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Semantic Web       Big Data

Schema.org● Collection of schemata on various topics that are recognized by major 

search providers and used to semantically interpret web content

SourceMap● Linked data augmented with web content and crowdsourced data used 

to provide details about companies like their carbon footprint, energy use, water use, etc. www.sourcemap.com 

OpenSteetMap● Linked data augmenting crowdsourced data on www.openstreetmap.org 

helped in detailed mapping of disaster scenario during the Jan 2010 Haiti earthquake (http://www.scientificamerican.com/article.cfm?id=berners­lee­linked­data)

Page 22: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Big Data and the Semantic Web: Challenges

Semantic challenges● The myth of a global ontology● Convergent and divergent semantics

Technology and system challenges● Characteristics of a semantic graph● Managing graph structured data

Page 23: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

The Myth of a Global Ontology

Several “core” semantic ontologies exist:● WordNet● YAGO● OpenCyc● SUMO

However, none of them (even automated ones) can capture all possible semantic associations and all possible perspectives on a given topic

Page 24: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

The Myth of a Global Ontology

The open world problem

● We don't know what we don't know.. 

● Representation bias in big data sources

The neutral­but­useless perspective

● Localized, utilitarian descriptions often more useful than neutral, global descriptions. Ex: Use of “zones” as a geographical element in Indian Railways

● Difficult for disparate perspectives to co­exist in a single Ontology, violating design principles like Occam's razor

Page 25: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Convergent and Divergent Semantics

Wikipedia article onWest Bank

conflict

Palestine POV

Israeli POV

Historians' POV

UN's POV

Encyclopedic Semantics

Page 26: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Convergent and Divergent Semantics

IPL event schedule

Traffic planning

Advertisement planningaround IPL

Legal structuringaround IPL

TV programmescheduling

Securityplanning

Page 27: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Semantic Interoperability

● Binary predicates like RDF may not capture complete semantics of the association

But it is too difficult to work with higher­order predicates

● Semantic queries are characterized by contextual relevance and default assumptions

● Linked Data can be useful primarily within the context of a model

Model­building from predicates as complex a problem as identifying predicates from data

Page 28: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Semantic Challenges: Summary

● Hard to distinguish data from noise without a modelEspecially hard when we are using data to help build a model!

● There may not be a single global model explaining the data

● Model construction as challenging, if not more challenging, as predicate mining

● No clarity on the underlying processes that aid in knowledge aggregationKnowledge aggregation happens differently depending on the kind of knowledge being aggregated (encyclopedic versus operational knowledge) 

Page 29: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Tech Challenges

Storing Big Semantic Data● Semantic data not amenable to physical access coherence to be 

efficiently stored in relational tables● Logical proximity of triples, more important than physical 

proximity● Read/Write storage models change logical proximity● RDF graphs tend to be extremely dense and/or clustered● Need efficient methods of graph storage and retrieval 

Page 30: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Semantic store for Big Data

● Databases optimized to store and retrieve interrelated sets of triples of the form (subject, predicate, object) 

● Query models based on answering graph queries (usually in SPARQL) rather than SQL queries

●  Main design criteria: storage and read­ahead policies of triples based on their logical proximity rather than physical proximity in order to enable Bulk Synchronous Parallel (BSP) processing

Page 31: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Semantic store for Big Data

AllegroGraph  (http://www.franz.com/agraph/allegrograph/)

● NoSQL Graph based native storage for RDF triples● ACID compliant● Interfaces with Solr for free text indexing ● Triple and text level indexing● MongoDB integration● RDFS++ Reasoning with dynamic materialization ● SPARQL queries on named graphs and Prolog based 

inferencing engine

Page 32: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Semantic store for Big Data

Sesame http://www.openrdf.org/

●  Open source Java framework for parsing, storing, querying and inferencing over RDF data 

● Collections of RDF triples can be manipulated in memory using a graph data model

● Compliant with SPARQL 1.1 protocol recommendation ● Provides two levels of APIs: SAIL (Storage and Inference 

Layer) for low level RDF processing and Repository layer for programmatic interfacing with Sesame

Page 33: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Semantic store for Big Data

Mulgara http://www.mulgara.org/ ● Native storage model for RDF● Supports multiple models (databases) per server● ACID transactions and concurrency support ● Copy­on­write­ cache semantics● Full­text search and support for data types● Primarily useful as a repository – no evidence of 

support for logical inferences over RDF 

Page 34: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Semantic store for Big Data

Other examples:● InfiniteGraph from Objectivity http://www.objectivity.com/

● Big­Data http://www.bigdata.com/bigdata/blog/ 

– A high scale­out storage and computing engine● Agama https://github.com/arrac/agama/wiki/Agama 

– Storage, search and traversal support (Ruby library) for very large graphs 

● Neo4j http://www.neo4j.org/ – Embedded, disk­based transactional graph database 

written in Java 

Page 35: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Logical inference over Big Data

● Problem: Find factual answers to specific questions by reasoning over large­scale data.  

● Performing extremely large­scale deductions over large semantic datasets in interactive response time 

● Need to contend with potentially inconsistent predicates, incomplete or missing values and default assumptions

● Varieties of inference over datasets● Deduction● Induction● Abduction● Statistical inference

Page 36: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Logical inference over Big Data

Common approaches for scalable inferencing:● Horn clause inferencing● Variants of random walks on knowledge graphs● Distributed MCMC (Markov Chain Monte Carlo) 

methods

Page 37: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Horn Clauses

Horn clauses are predicates of the form:

atomic sentence with no negation and a single consequent

Horn clause knowledge bases can be resolved using “backward chaining” starting from the consequent and building a tree of antecedents until they are grounded in facts

Horn clause resolution can be scaled over large datasets by parallelizing resolutions using MapReduce 

 

p1∧p2∧...∧pn→u

Page 38: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Random Walks on Big Data

Random walks on RDF graphs as a means of:

● Belief materialization● Soft inference

a c e d f b

R R

R

R

Assuming transitivity of R

Page 39: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Random Walks on Big Data

Large scale graph processing solutions for scaling random walks over Big Data: ● Apache Giraph http://giraph.apache.org/ 

● Pregel [Malewicz et al., 2010]

● Grappa http://www.cs.washington.edu/node/4217/ 

Page 40: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

MCMC

A “generic” problem solving method based on local sampling, useful for soft inferences on semantic data

Time homogeneous Markov Chain:

Page 41: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

MCMC

A homogeneous Markov chain can be represented as a set of “states” and “transition probabilities” across states

Given an initial “prior” probability distribution across states           the “stationary distribution” or “equilibrium condition” is defined as: 

Page 42: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

MCMC

Markov Chain Monte Carlo

Given a state space S and an “equilibrium” distribution       choose a sample s of the state space S so that a Markov chain on s results in      as the stationary distribution

MCMC for logical inference

For a logical inference problem, the equilibrium condition would be of the form [0,1]m defined over a set of m predicates

Example Sampling algorithms for MCMC

Gibbs Sampling http://en.wikipedia.org/wiki/Gibbs_sampling 

Metropolis­Hastings algorithm http://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm 

Page 43: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Scaling MCMC for Big Data

Distributed MCMC

Several models are explored for distributing MCMC computations over large datasets making them amenable to diffusing computations. Some examples include: [Murray 2010; Singh et al 2011]

Distributional models for MCMC beyond the scope of this talk.. 

Page 44: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

On the road ahead..

Some promising directions for Big Data and Semantics● Diffusion models for large scale inference● Cognitive models for semantics over large scale data● Model­based reasoning and reasoning across models● Soft (probabilistic) inferences, confidence measures, 

relevance feedback● Continuous learning over Big Data 

Page 45: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

Thank You!

Page 46: Big Data and the Semantic Web: Challenges and Opportunities

Big Data Tech Conclave, 26—27 April 2013Bangalore, India

References● Neal Madras. Introduction to Markov Chain Monte Carlo. 

http://www.cs.cornell.edu/selman/cs475/lectures/intro­mcmc­lukas.pdf 

● Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: a system for large­scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (SIGMOD '10). ACM, New York, NY, USA, 135­146. DOI=10.1145/1807167.1807184 http://doi.acm.org/10.1145/1807167.1807184

● Ni Lao, Tom Mitchell, and William W. Cohen. 2011. Random walk inference and learning in a large scale knowledge base. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '11). Association for Computational Linguistics, Stroudsburg, PA, USA, 529­539. 

● Lawrence Murray, Distributed Markov Chain Monte Carlo. Proceedings of NIPS 2010 Workshop on Learning on Cores, Clusters and Clouds. http://lccc.eecs.berkeley.edu/ 

● Stefan Schoenmackers, Oren Etzioni, and Daniel S. Weld. 2008. Scaling textual inference to the web. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '08). Association for Computational Linguistics, Stroudsburg, PA, USA, 79­88.

● Stefan Schoenmackers, Oren Etzioni, Daniel S. Weld, and Jesse Davis. 2010. Learning first­order Horn clauses from web text. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP '10). Association for Computational Linguistics, Stroudsburg, PA, USA, 1088­1098.

● Sameer Singh, Amarnag Subramanya, Fernando Pereira, and Andrew McCallum. 2011. Large­scale cross­document coreference using distributed inference and hierarchical models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies ­ Volume 1 (HLT '11), Vol. 1. Association for Computational Linguistics, Stroudsburg, PA, USA, 793­803.