big data without big change - semtech june 2012 v1.5

31
Big Data Without Big Change SemTech West 2012 Michael Lang Revelytix

Upload: svntemp

Post on 19-May-2017

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data Without Big Change - Semtech June 2012 v1.5

Big Data Without Big ChangeSemTech West 2012

Michael Lang

Revelytix

Page 2: Big Data Without Big Change - Semtech June 2012 v1.5

Discussion Points

Review the RDBMS, ETL, and data warehouse data management paradigms

Compare those paradigms to data virtualization and Big Data

Propose “Bigger Data” in support of radically better analytic capability

Page 3: Big Data Without Big Change - Semtech June 2012 v1.5

In 1970, E.F. Codd, with the IBM Research Laboratory in San Jose, California, wrote a paper published in ACM,

“A Relational Model of Data for Large Shared Data Banks”

Codd wrote, “The problems treated here are those of data independence – the independence of application programs from the growth in data types and changes in data representation...”

This paper set in motion the architecture for data management systems for the next forty years. These systems are known as

relational database management systems (RDBMS)

The Last Forty Years

Page 4: Big Data Without Big Change - Semtech June 2012 v1.5

The Last Forty Years

Siloed Information Management Systems– All data in a single shared databank

– Rigid schemas

– Data and metadata are different types of things

– Query processor only knows about its local data expressed in a fixed schema

– Excellent ACID / CRUD capability

Page 5: Big Data Without Big Change - Semtech June 2012 v1.5

The Age of Virtualization

DIMSDistributed Information

Management System

Page 6: Big Data Without Big Change - Semtech June 2012 v1.5

Virtualization

Hardware and operating system virtualization became available in 2004 and brought great value to IT infrastructure

– Cloud-based deployment

– Extreme flexibility

– Efficient use of hardware resources

– Independence from operating systems

Leading to an enormous ROI for large enterprises

Page 7: Big Data Without Big Change - Semtech June 2012 v1.5

EDM

Hardware virtualization did not help with the problems associated with Enterprise Data Management

– Data remains distributed over many silos, even in cloud-based environments

– Meaning of data in independent silos is still obscure

– Schema are still disparate

Page 8: Big Data Without Big Change - Semtech June 2012 v1.5

Data Virtualization

The advent of RDF, OWL, and SPARQL have created the technical foundation for building a completely virtualized data infrastructure

– All information can be managed in the same data model

– Any domain can be described at the schema level

– SPARQL provides a distributed query and transformation language

– R2RML provides mappings from native schema to RDF schema

– Standards-based data virtualization is here to stay

Page 9: Big Data Without Big Change - Semtech June 2012 v1.5

Data Virtualization

This paradigm assumes data is completely distributed, and that anyone/anything should be able to find it and use it

– RDF is the data model

– OWL is the schema model

– SPARQL is the query language

– URI provide a unique identifiers

– URL provides the location

Page 10: Big Data Without Big Change - Semtech June 2012 v1.5

Data Abstraction

A RDBMS is an abstraction layer above an OS-based file systems

– Made it vastly simpler to work with local data

Data Virtualization is an abstraction layer above multiple RDBMS and/or other sources of data

– vastly simpler to work with distributed data

– Distributed Information Management System

Page 11: Big Data Without Big Change - Semtech June 2012 v1.5

Caveats

Data virtualization technologies are not as performant as locally managed data

Data virtualization depends on sophisticated transformation of complex and unstructured data

Page 12: Big Data Without Big Change - Semtech June 2012 v1.5

Bigger Data: Hadoop and Virtual Data

DIMSDistributed Information

Management System

Page 13: Big Data Without Big Change - Semtech June 2012 v1.5

NoSQL / Big Data

Another seminal paper: Copyright 2003 ACM

“The Google File System”Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

• These data processing systems are highly distributed but, …

• Each NoSQL database is a “large shared databank”

• Data cannot be combined for analytics across NoSQL databases

• NoSQL is an evolutionary step in data storage; it is not a paradigm shift in information management

Page 14: Big Data Without Big Change - Semtech June 2012 v1.5

Big Data

Hadoop is an excellent technology to use for transforming data of varying structures to formats useful for analytics

Hadoop also excels at handling very large amounts of disparate data

Virtual data needs a place to be materialized

Data Virtualization technologies provide a common structure and access methodology for disparate sets of data

Page 15: Big Data Without Big Change - Semtech June 2012 v1.5

RDB RDB

Mappings(R2RML)

RDB Schema(Source

Ontology)

Mappings(R2RML)

Data Validation& Analysis

SPARQLSPARQL

RDB Schema(Source

Ontology)

Rules(RIF)

DomainOntology

SPARQL(data input)

SPARQL(data input)

Inferred Data

SPARQL(data output)

SPARQL

Data Virtualization

Page 16: Big Data Without Big Change - Semtech June 2012 v1.5

Hadoop

The RDF-based technology implementing a virtual data infrastructure is useful for Hadoop data transformations using MapReduce

– All of the disparate data sets in a Hadoop cluster can be organized with a common set of semantics provided by an R2RML map and a Domain Ontology

– Data transformations are made using a series of MapReduce jobs

– ETL becomes ELT

Page 17: Big Data Without Big Change - Semtech June 2012 v1.5

ELT

Extract, Load, and Transform is a fundamentally new paradigm facilitating enterprise analytics

– Data can be loaded in its native formats and structures

– Transformation activities take place after the data is loaded into a Hadoop cluster

– Hadoop and MapReduce are excellent technologies for data transformations at scale

Page 18: Big Data Without Big Change - Semtech June 2012 v1.5

Need to transform structure– Relational -> RDF

– HDFS/HBase -> Tuples

– Merge data from multiple sets (federate)

– Basic query processing: join, aggregation, etc

– Execute arbitrary user-defined analytical functions (UDFs)

Revelytix query engines already do these– Spinner – federation, query processing, Hadoop-to-tuples

– Spyder – relational-to-RDF, query processing

Query Engine = Transformation Engine

Page 19: Big Data Without Big Change - Semtech June 2012 v1.5

Hadoop/Cloud Infrastructure

Triples

Relational Database

Load,Index

Triples

Relational Database

Extract

Data

HDFS Files

HBase

Source Data

The big win is to leave the data in situ, and define networked pipelines of transformations to move data through various processing stages.

Transform

Transforming Data in Hadoop

Page 20: Big Data Without Big Change - Semtech June 2012 v1.5

Dataflow Pipeline

Definition S6 S1b

S5 S4

S3Execution

QueryS8 S7

X1

local cloudDesign

‘endpoints’

D2

F1D3

Configure execution environments for parts of pipeline

D1

S2

S1aX6aX6b

X5

X8

T

T

T

TT

T

Processing Pipeline

Data Flow

Mix of materialized and virtual data sets… inter-linked by a set of transformations

Distributed Pipelined Processing

Page 21: Big Data Without Big Change - Semtech June 2012 v1.5

Query Processing in Hadoop

Page 22: Big Data Without Big Change - Semtech June 2012 v1.5

Hadoop and SPARQL

Once the data sets have been transformed to a common set of semantics, SPARQL queries can be executed as a set of distributed MapReduce jobs

We must know the relationships between data sets

The descriptions of the relations need to be available at query time

Page 23: Big Data Without Big Change - Semtech June 2012 v1.5

Query Client

Query Processor

Hadoop/Cloud Infrastructure

Query Processor

Data

HDFS Files

HBase

Query processor is shipped to all Hadoop nodes for parallel processing, using the Hadoop MapReduce framework.

Query Processor

Query Processor

Query Processor

Query Execution in the Cloud

Page 24: Big Data Without Big Change - Semtech June 2012 v1.5

Query Processing

Hadoop/Cloud Infrastructure

Hadoop Adapter

SpinnerData

HDFS Files

HBase

Hadoop/Cloud Infrastructure

Hadoop Adapter

Spyder

Data

HDFS Files

HBase

Spinner

• Query processing can be done locally, remotely (in cloud), or mix• Many types of transformations can be done

• Basic query processing (SPARQL or SQL) • Relational to graph (R2RML) transformations• Federation over multiple sources or data sets• Hadoop HDFS-to-Tuple and HBase-to-Tuple transformations

• We can plan and optimize across all these for maximum performance

Page 25: Big Data Without Big Change - Semtech June 2012 v1.5

Hadoop and RIF

Once the data sets have been transformed to a common set of semantics, RIF rules can be executed as a set of distributed MapReduce jobs

– Inference

– Classification

– Validation

– Compliance

Page 26: Big Data Without Big Change - Semtech June 2012 v1.5

Enable access to large volumes of data

Warehouse-style access

Enable a ‘processing pipeline’ in the cloud

Push processing into Map-Reduce infrastructure

Parallelize query execution– Extreme scalability

Architectural flexibility

Why Use Hadoop?

Page 27: Big Data Without Big Change - Semtech June 2012 v1.5

Future Directions

27

Page 28: Big Data Without Big Change - Semtech June 2012 v1.5

Hadoop and Solr

Integration between Hadoop, Data Virtualization, and Solr provides massively scalable faceted search

– The common set of semantics, applied over disparate unstructured data sets provides a powerful paradigm for searching with facets over massive amounts of data

Page 29: Big Data Without Big Change - Semtech June 2012 v1.5

What Are We Offering?

Seamless integration of virtual data and HadoopLinkage (relationships) between data sets, yielding…

– Provenance/traceability/lineage

– Metadata management and data visibility/understanding

– Powerful analytics infrastructure

Common data model, enabling…

– Mixing of relational and graph-based data

– Mixing of SQL and SPARQL queries

– Access to all cloud-based data

Optimization across heterogeneous data systems

Page 30: Big Data Without Big Change - Semtech June 2012 v1.5

The Shift is OnDistributed Information

Management SystemDIMS is available now

Page 31: Big Data Without Big Change - Semtech June 2012 v1.5

Questions

Revelytix.com for much additional information

Thank You