w3 s2008 apache heart project proposal frederick haebin na

Heart Project ProposalDistributed RDF Table & Processing Engine

Frederick Haebin Na [email protected] Project Group

2008.10.23.

Contents

1.Heart Proposal Overview2.Goals & Objectives 3.Backgrounds4.Benefits5.Features

3 / Heart Project Proposal

Heart (Highly Extensible & Accumulative RDF Table) aims to provide a planet-scale RDF store and a set of features to process the data in distributed manner. Heart is based on Hadoop and HBase. Heart aims to be a batch processor, or analyzer, rather than a real-time database.

Heart Proposal Overview

Heart will be the heart of Web 3.0 where the machine extends human powered knowledge at a far greater rate than in Web 2.0. With this increasing rate of semantic data, Heart will be very useful after about a decade or so. Until then, Heart will play a crucial role in experimenting niche service models.

Massive Storage & Processor Highly Extensible &

Accumulative Storage Faster Loader/Query

Processing/ Materializer for Massive RDF Data

RDF Data Mining Platform Knowledge Discovery

Prediction/Classification/Association

Semantic Search Platform Bulk Pre & Post Processing

for Semantic Search

Heart Data Loader Bulk Triples to HBase

Heart Storage Manager Smart Triples Partitioning

Heart Query Processor Optimized Query for

Massive Data Heart Data Miner

Extension to SparQL for Data Mining

Heart Data Materializer Indexing for Implicit

Statements

Core (Billion Triples)1)

Garlik JXT (9.8) YARS2 (7) BigOWLIM (6.7) Jena TDB (1.7) Virtuoso (1)

Applications PowerSet – Semantic

Search Engine A Scale-Out RDF

Molecule Store for Distributed Processing of Biomedical Data, Newman, et al.

Benefits Features Relevant Projects

1

1) http://esw.w3.org/topic/LargeTripleStores


Goals & Objectives2

The goals and objectives of Heart is to provide a massive RDF data storage and a batch processor for various RDF data mining.

Key problems must be addressed for the first objective which has the highest priority over the rests.

Goals To Provide Massive RDF Data Storage &

Batch Processor for Various RDF Data Mining

Key Problems Need to be Solved Would Sequential-read centric Hbase

index be enough for random reads/writes for joins?

If not, then how to exploit HBase indexes or generate new ones for speeding up the processing?• What is the best suitable index for

semantic search? How to partition the triples for efficient

joins? (By subject, predicate, grouped by named graphs)

Objectives Faster Massive Data Processor• Loader 1) – Better than Garlik JXT• Query Processor1) – Better than

Garlik JXT Highly Extensible & Accumulative RDF

Table• Supports more than 10 billion triples

over more than 3,000 computers. Extensions for Data Mining• Full Support for the Standard SparQL• Machine Learning Extensions2)

1) http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/

2) http://www.eswc2008.org/final-pdfs-for-web-site/qpI-4.pdf


More RDF Supporting Services

Needs for Contextual &

Specific Search Result

Proliferation of Various

RDF Schemes

Heart

Increase inRDF Data

Refinement in RDFS’s

Needs for ProcessingRDF Data

Backgrounds3

Environmentally, more and more services begin to provide and refine their RDF/S related features. Also, people begin to ask for more specific and contextual search result. For the service providers, they begin to have the data and its scheme to process RDF data for their customers’ needs.


1

2

3

Massive RDF Storage & Processor

RDF Data Mining Platform

Semantic Search Platform

Highly extensible and accumulative storage benefits are from Hadoop and HBase.

Faster processing over massive RDF data is possible by MapReduce model for distributed RDF data processing.

HBase based column-oriented partitioning gives performance increase because of the lesser joins.

Full Support for Standard SparQL over Massive RDF Data Converts SparQL to MapReduce query implementation

Machine Learning Features for SparQL Extensions1)

Prediction Classification Association

Provides fundamental features for semantic search. Storage & Processor Knowledge Discovery by Data Mining

Massive RDF data can be mined to generate semantic search index. Support for User Defined Index Model

Benefits4

Heart provides three benefits; a massive RDF storage/processor, RDF data mining and semantic search platform.

1) http://www.eswc2008.org/final-pdfs-for-web-site/qpI-4.pdf


12345

Data Loader

Storage Manager

Query Processor

Fast Bulk Storing & ReasoningBulk Triples into HBaseSupports Various File Format

Smart Triples PartitioningC-Store with Sequential-Read Centric Processing

Reduce or Eliminate Random Access

Full Standard SparQL Query Conversion to MapReduce Codes

Features5

Heart provides 5 core features; data loader, storage manager, query processor, data miner and data materializer.

Data MinerMachine Learning Extensions

Prediction Classification Association

Data Materializer

Indexes for Implicit Statements

Thank you.

w3 s2008 apache heart project proposal frederick haebin na

Education