big data without big change - semtech june 2012 v1.5

Big Data Without Big ChangeSemTech West 2012

Michael Lang

Revelytix

Discussion Points

Review the RDBMS, ETL, and data warehouse data management paradigms

Compare those paradigms to data virtualization and Big Data

Propose “Bigger Data” in support of radically better analytic capability

In 1970, E.F. Codd, with the IBM Research Laboratory in San Jose, California, wrote a paper published in ACM,

“A Relational Model of Data for Large Shared Data Banks”

Codd wrote, “The problems treated here are those of data independence – the independence of application programs from the growth in data types and changes in data representation...”

This paper set in motion the architecture for data management systems for the next forty years. These systems are known as

relational database management systems (RDBMS)

The Last Forty Years

The Last Forty Years

Siloed Information Management Systems– All data in a single shared databank

– Rigid schemas

– Data and metadata are different types of things

– Query processor only knows about its local data expressed in a fixed schema

– Excellent ACID / CRUD capability

The Age of Virtualization

DIMSDistributed Information

Management System

Virtualization

Hardware and operating system virtualization became available in 2004 and brought great value to IT infrastructure

– Cloud-based deployment

– Extreme flexibility

– Efficient use of hardware resources

– Independence from operating systems

Leading to an enormous ROI for large enterprises

EDM

Hardware virtualization did not help with the problems associated with Enterprise Data Management

– Data remains distributed over many silos, even in cloud-based environments

– Meaning of data in independent silos is still obscure

– Schema are still disparate

Data Virtualization

The advent of RDF, OWL, and SPARQL have created the technical foundation for building a completely virtualized data infrastructure

– All information can be managed in the same data model

– Any domain can be described at the schema level

– SPARQL provides a distributed query and transformation language

– R2RML provides mappings from native schema to RDF schema

– Standards-based data virtualization is here to stay

Data Virtualization

This paradigm assumes data is completely distributed, and that anyone/anything should be able to find it and use it

– RDF is the data model

– OWL is the schema model

– SPARQL is the query language

– URI provide a unique identifiers

– URL provides the location

Data Abstraction

A RDBMS is an abstraction layer above an OS-based file systems

– Made it vastly simpler to work with local data

Data Virtualization is an abstraction layer above multiple RDBMS and/or other sources of data

– vastly simpler to work with distributed data

– Distributed Information Management System

Caveats

Data virtualization technologies are not as performant as locally managed data

Data virtualization depends on sophisticated transformation of complex and unstructured data

Bigger Data: Hadoop and Virtual Data

DIMSDistributed Information

Management System

NoSQL / Big Data

Another seminal paper: Copyright 2003 ACM

“The Google File System”Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

• These data processing systems are highly distributed but, …

• Each NoSQL database is a “large shared databank”

• Data cannot be combined for analytics across NoSQL databases

• NoSQL is an evolutionary step in data storage; it is not a paradigm shift in information management

Big Data

Hadoop is an excellent technology to use for transforming data of varying structures to formats useful for analytics

Hadoop also excels at handling very large amounts of disparate data

Virtual data needs a place to be materialized

Data Virtualization technologies provide a common structure and access methodology for disparate sets of data

RDB RDB

Mappings(R2RML)

RDB Schema(Source

Ontology)

Mappings(R2RML)

Data Validation& Analysis

SPARQLSPARQL

RDB Schema(Source

Ontology)

Rules(RIF)

DomainOntology

SPARQL(data input)

SPARQL(data input)

Inferred Data

SPARQL(data output)

SPARQL

Data Virtualization

Hadoop

The RDF-based technology implementing a virtual data infrastructure is useful for Hadoop data transformations using MapReduce

– All of the disparate data sets in a Hadoop cluster can be organized with a common set of semantics provided by an R2RML map and a Domain Ontology

– Data transformations are made using a series of MapReduce jobs

– ETL becomes ELT

ELT

Extract, Load, and Transform is a fundamentally new paradigm facilitating enterprise analytics

– Data can be loaded in its native formats and structures

– Transformation activities take place after the data is loaded into a Hadoop cluster

– Hadoop and MapReduce are excellent technologies for data transformations at scale

Need to transform structure– Relational -> RDF

– HDFS/HBase -> Tuples

– Merge data from multiple sets (federate)

– Basic query processing: join, aggregation, etc

– Execute arbitrary user-defined analytical functions (UDFs)

Revelytix query engines already do these– Spinner – federation, query processing, Hadoop-to-tuples

– Spyder – relational-to-RDF, query processing

Query Engine = Transformation Engine

Hadoop/Cloud Infrastructure

Triples

Relational Database

Load,Index

Triples

Relational Database

Extract

Data

HDFS Files

HBase

Source Data

The big win is to leave the data in situ, and define networked pipelines of transformations to move data through various processing stages.

Transform

Transforming Data in Hadoop

Dataflow Pipeline

Definition S6 S1b

S5 S4

S3Execution

QueryS8 S7

X1

local cloudDesign

‘endpoints’

D2

F1D3

Configure execution environments for parts of pipeline

D1

S2

S1aX6aX6b

X5

X8

T

T

T

TT

T

Processing Pipeline

Data Flow

Mix of materialized and virtual data sets… inter-linked by a set of transformations

Distributed Pipelined Processing

Query Processing in Hadoop

Hadoop and SPARQL

Once the data sets have been transformed to a common set of semantics, SPARQL queries can be executed as a set of distributed MapReduce jobs

We must know the relationships between data sets

The descriptions of the relations need to be available at query time

Query Client

Query Processor


Query Processor

Data

HDFS Files

HBase

Query processor is shipped to all Hadoop nodes for parallel processing, using the Hadoop MapReduce framework.

Query Processor

Query Processor

Query Processor

Query Execution in the Cloud

Query Processing


Hadoop Adapter

SpinnerData

HDFS Files

HBase


Hadoop Adapter

Spyder

Data

HDFS Files

HBase

Spinner

• Query processing can be done locally, remotely (in cloud), or mix• Many types of transformations can be done

• Basic query processing (SPARQL or SQL) • Relational to graph (R2RML) transformations• Federation over multiple sources or data sets• Hadoop HDFS-to-Tuple and HBase-to-Tuple transformations

• We can plan and optimize across all these for maximum performance

Hadoop and RIF

Once the data sets have been transformed to a common set of semantics, RIF rules can be executed as a set of distributed MapReduce jobs

– Inference

– Classification

– Validation

– Compliance

Enable access to large volumes of data

Warehouse-style access

Enable a ‘processing pipeline’ in the cloud

Push processing into Map-Reduce infrastructure

Parallelize query execution– Extreme scalability

Architectural flexibility

Why Use Hadoop?

Future Directions

27

Hadoop and Solr

Integration between Hadoop, Data Virtualization, and Solr provides massively scalable faceted search

– The common set of semantics, applied over disparate unstructured data sets provides a powerful paradigm for searching with facets over massive amounts of data

What Are We Offering?

Seamless integration of virtual data and HadoopLinkage (relationships) between data sets, yielding…

– Provenance/traceability/lineage

– Metadata management and data visibility/understanding

– Powerful analytics infrastructure

Common data model, enabling…

– Mixing of relational and graph-based data

– Mixing of SQL and SPARQL queries

– Access to all cloud-based data

Optimization across heterogeneous data systems

The Shift is OnDistributed Information

Management SystemDIMS is available now

Questions

Revelytix.com for much additional information

Thank You

big data without big change - semtech june 2012 v1.5

Documents