riga dev day 2016 adding a data reservoir and oracle bdd to extend your oracle dw platform

65
[email protected] www.rittmanmead.com @rittmanmead Adding a Data Reservoir and Oracle Big Data Discovery to extend your Oracle Data Warehouse Platform Mark Rittman, CTO, Rittman Mead Riga Dev Day 2016, Riga, March 2016

Upload: mark-rittman

Post on 12-Apr-2017

665 views

Category:

Data & Analytics


5 download

TRANSCRIPT

Page 1: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead

Adding a Data Reservoir and Oracle Big Data Discovery to extend your Oracle Data Warehouse Platform

Mark Rittman, CTO, Rittman Mead Riga Dev Day 2016, Riga, March 2016

Page 2: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 2

•Mark Rittman, Co-Founder of Rittman Mead‣Oracle ACE Director, specialising in Oracle BI&DW‣14 Years Experience with Oracle Technology‣Regular columnist for Oracle Magazine

•Author of two Oracle Press Oracle BI books‣Oracle Business Intelligence Developers Guide‣Oracle Exalytics Revealed‣Writer for Rittman Mead Blog :http://www.rittmanmead.com/blog

•Email : [email protected]•Twitter : @markrittman

About the Speaker

Page 3: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 3

•Started back in 1997 on a bank Oracle DW project•Our tools were Oracle 7.3.4, SQL*Plus, PL/SQL and shell scripts

•Went on to use Oracle Developer/2000 and Designer/2000•Our initial users queried the DW using SQL*Plus•And later on, we rolled-out Discoverer/2000 to everyone else•And life was fun…

15+ Years in Oracle BI and Data Warehousing

Page 4: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 4

•Over time, this data warehouse architecture developed•Added Oracle Warehouse Builder to automate and model the DW build

•Oracle 9i Application Server (yay!) to deliver reports and web portals

•Data Mining and OLAP in the database•Oracle 9i for in-database ETL (and RAC)•Data was typically loaded from Oracle RBDMS and EBS

• It was turtles Oracle all the way down…

The Oracle-Centric DW Architecture

Page 5: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 5

•Many customers and organisations are now running initiatives around “big data”•Some are IT-led and are looking for cost-savings around data warehouse storage + ETL•Others are “skunkworks” projects in the marketing department that are now scaling-up•Projects now emerging from pilot exercises•And design patterns starting to emerge

Many Organisations are Running Big Data Initiatives

Page 6: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 6

•Typical implementation of Hadoop and big data in an analytic context is the “data lake”•Additional data storage platform with cheap storage, flexible schema support + compute•Data lands in the data lake or reservoir in raw form, then minimally processed•Data then accessed directly by “data scientists”, or processed further into DW

Common Big Data Design Pattern : “Data Reservoir”

DataTransfer DataAccess

DataFactory DataReservoir

BusinessIntelligenceTools

HadoopPlatform

FileBasedIntegration

StreamBased

Integration

Datastreams

Discovery&DevelopmentLabsSafe&secureDiscoveryandDevelopment

environment

Datasetsandsamples

Models andprograms

Marketing/SalesApplications

Models

MachineLearning

Segments

OperationalData

Transactions

CustomerMasterata

UnstructuredData

Voice+ChatTranscripts

ETLBasedIntegration

RawCustomerData

Datastoredintheoriginal

format(usuallyfiles)suchasSS7,ASN.1,JSONetc.

MappedCustomerData

Datasetsproducedbymappingandtransformingrawdata

Page 7: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

So What is a Data Reservoir?

Page 8: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

What Does it Do?

Page 9: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

And Does it Replace My Data Warehouse?

Page 10: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

So What is a Data Reservoir?

Page 11: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 11

Data from Real-Time, Social & Internet Sources is Strange

Single Customer ViewEnriched

Customer Profile

Correlating

Modeling

Machine Learning

Scoring

•Typically comes in non-tabular form •JSON, log files, key/value pairs

•Users often want it speculatively ‣Haven’t though through final purpose

•Schema can change over time ‣Or maybe there isn’t even one

•But the end-users want it now ‣Not when your ETL team are next free

Page 12: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 12

Data Warehouse Loading Requires Formal ETL and Modeling

$1m

Analytic DBMS Node

ETL

Data Model

ETL Developer

Data Modeller

Curated Data

ETL Development takes time, is fragile, but results in well-curated dataBut what about data whose schema is now known?Or final use has not yet been determined?

Dimensional data modelling gives structure to the data for business usersBut also restricts how that data can be analysedWhat if the end-user is better placed to apply that schema?

Page 13: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 13

… And Are Limited in What They Can Store (Economically)

$1m

Analytic DBMS Node

DB Instance

Compute

ETL

Data Model

ETL Developer

Data Modeller

$1m

Analytic DBMS Node

Compute

$1m

Analytic DBMS Node

Compute

$1m

Analytic DBMS Node

Single DB Instance

Compute

Page 14: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 14

•A new approach to data processing and data storage•Rather than a small number of large, powerful servers, it spreads processing overlarge numbers of small, cheap, redundant servers

•Spreads the data you’re processing over lots of distributed nodes

•Has scheduling/workload process that sends parts of a job to each of the nodes

•And does the processing where the data sits •Shared-nothing architecture•Low-cost and highly horizontal scalable

Introducing Hadoop - Cheap, Flexible Storage + Compute

Job Tracker

Task Tracker Task Tracker Task Tracker Task Tracker

Data Node Data Node Task Tracker Task Tracker

Page 15: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 15

•Hadoop & NoSQL better suited to exploratory analysis of newly-arrived data‣Flexible schema - applied by user rather than ETL‣Cheap expandable storage for detail-level data‣Better native support for machine-learning anddata discovery tools and processes

‣Potentially a great fit for our new and emergingcustomer 360 datasets, and great platform for analysis

Introducing Hadoop - Cheap, Flexible Storage + Compute

Page 16: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 16

Hadoop Designed for Real-Time Storage of Raw Data Feeds

$50k

Hadoop Node

Voice + Chat Transcripts

Call Center LogsChat Logs iBeacon Logs Website Logs

Real-time Feeds

Raw Data

Page 17: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 17

Supplement with Batch + API Loads of ERP + 3rd Party Data

$50k

Hadoop Node

Voice + Chat Transcripts

Call Center LogsChat Logs iBeacon Logs Website Logs

Real-time Feeds

CRM Data Transactions Social FeedsDemographics

Batch Loads APIs, Web Service Calls

Raw Data

Page 18: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 18

Supplement with Batch + API Loads of ERP + 3rd Party Data

$50k

Hadoop Node

Voice + Chat Transcripts

Call Center LogsChat Logs iBeacon Logs Website LogsCRM Data Transactions Social FeedsDemographics

Raw Data

Customer 360 Apps

Predictive Models

SQL-on-Hadoop

Business analytics

Real-time Feeds,batch and API

Page 19: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 19

Hadoop-Based Storage & Compute : A Better Logical Fit

Voice + Chat Transcripts

Call Center LogsChat Logs iBeacon Logs Website LogsCRM Data Transactions Social FeedsDemographics

Real-time Feeds,batch and API

$50k

Hadoop Node

$50k

Hadoop Node

$50k

Hadoop Node

Hadoop Node

Hadoop Node

$50k$50k

Hadoop Node

Hadoop Node

$50k

Enriched Customer

Profile Modeling

Scoring Hadoop Data Reservoir

Raw customer data stored at detail Enriched and processed for insights

$50k

Page 20: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 20

Combine with DW for Big Data Management Platform

Page 21: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 21

Introducing the “Data Lab” for Raw/Unstructured Data

collectsnewandexistingdataasrawmaterials

providesbroadaccesstoatrustedcommunity

enablesagileexperimentation,abilitytofailfast

offersa‘doanything’sandboxenvironment

Page 22: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 22

•Specialist skills typically needed to ingest and understand data coming into Hadoop•Data loaded into the reservoir needs preparation and curation before presenting to users•But we’ve heard a similar story before, from the world of BI…

But … Getting Value Out of Unstructured Data is Hard

Page 23: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform
Page 24: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 24

•Part of the acquisition of Endeca back in 2012 by Oracle Corporation

•Based on search technology and concept of “faceted search”

•Data stored in flexible NoSQL-style in-memory database called “Endeca Server”

•Added aggregation, text analytics and text enrichment features for “data discovery”‣Explore data in raw form, loose connections, navigate via search rather than hierarchies‣Useful to find out what is relevant and valuable in a dataset before formal modeling

What Was Oracle Endeca Information Discovery?

Page 25: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 25

•Proprietary database engine focused on search and analytics•Data organized as records, made up of attributes stored as key/value pairs•No over-arching schema, no tables, self-describing attributes

•Endeca Server hallmarks:‣Minimal upfront design‣Support for “jagged” data‣Administered via web service calls‣“No data left behind”‣“Load and Go”

•But … limited in scale (>1m records)‣… what if it could be rebuilt on Hadoop?

Endeca Server Technology Combined Search + Analytics

Page 26: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 26

•A visual front-end to the Hadoop data reservoir, providing end-user access to datasets•Catalog, profile, analyse and combine schema-on-read datasets across the Hadoop cluster•Visualize and search datasets to gain insights, potentially load in summary form into DW

Oracle Big Data Discovery

Page 27: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 27

•Provide a visual catalog and search function across data in the data reservoir•Profile and understand data, relationships, data quality issues•Apply simple changes, enrichment to incoming data•Visualize datasets including combinations (joins)‣Bring in additional data from JDBC + file sources

•Prepare more structured Hadoop datasets for use with OBIEE

Typical Big Data Discovery Use-Cases

Page 28: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 28

•Very good tool for analysing and cataloging the data in the production data reservoir•Basic BI-type reporting against datasets, joined together, filtered, transformed etc•Cross-functional and cross-subject area analysis and visualization•Great complement to OBIEE‣Able to easily access semi-structured data‣Familiar BI-type graph components‣Ideal way to give analysts access to semi-structured customer datasets

BDD for Reporting and Cataloging the Data Reservoir

Page 29: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 29

•Visual Analyzer also provides a form of “data discovery” for BI users‣Similar to Tableau, Qlikview etc‣Inspired by BI elements of OEID

•Uses OBIEE RPD as the primary datasource, so data needs to be curated + structured

•Probably a better option for users who aren’t concerned its “big data”

•But can still connect to Hadoop viaHive, Impala and Oracle Big Data SQL

Comparing BDD to Oracle Visual Analyzer

Page 30: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 30

•Data in the data reservoir typically is raw, hasn’t been organised into facts, dimensions yet• In this initial phase, you don’t want to it to be - too much up-front work with unknown data•Later on though, users will benefit from structure and hierarchies being added to data•But this takes work, and you need to understand cost/benefit of doing it now vs. later

Managed vs. Free-Form Data Discovery

Page 31: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 31

•Adds additional nodes into the CDH5.3 cluster, for running DGraph and Studio‣DGraph engine based on Endeca Server technology, can also be clustered

•Hive (HCatalog) used for reading table metadata,mapping back to underlying HDFS files

•Apache Spark then used to upload (ingest)data into DGraph, typically 1m row sample

•Data then held for online analysis in DGraph•Option to write-back transformations tounderlying Hive/HDFS files using Apache Spark

Oracle Big Data Discovery Architecture

Page 32: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 32

•Big Data Discovery Studio then provides web-based tools for organising, searching datasets•Visualize Data, Transform and Export back to Hadoop

Search, Catalog, Join and Analyse Data Using BDD Studio

Page 33: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 33

•Rittman Mead want to understand drivers and audience for their website‣What is our most popular content? Who are the most in-demand blog authors?‣Who are the influencers? What do they read?

•Three data sources in scope:

Example Scenario : Social Media Analysis

RM Website Logs Twitter Stream Website Posts, Comments etc

Page 34: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 34

•Enrich and transform incoming schema-on-read social media data to add value•Do some initial profiling, analysis to understand datasets•Provide a user interface for the Big Data team•Provide a means to identify candidate dimensions and facts for a more structured (managed) data discovery tool like Visual Analyzer

Objective of Exercises

Combine with site content, semantics, text enrichment Catalog and explore using Oracle Big Data Discovery

Page 35: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 35

Data Sources used for Data Discovery Exercise

Spark

Hive

HDFS

Spark

Hive

HDFS

Spark

Hive

HDFS

Cloudera CDH5.3 BDA Hadoop Cluster

Hive Client

HDFS Client

BDD DGraphGateway

Hive Client

BDD StudioWeb UI

BDD Node

BDD Data Processing

BDD Data Processing

BDD Data Processing

Ingest semi-process logs(1m rows)

Ingest processedTwitter activity

Write-backTransformationsto full datasets

UploadSite page andcomment contents

Persist uploaded DGraphcontent in Hive / HDFS

Data Discovery using Studio web-based app

Page 36: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 36

•Datasets in Hive have to be ingested into DGraph engine before analysis, transformation•Can either define an automatic Hive table detector process, or manually upload•Typically ingests 1m row random sample‣1m row sample provides > 99% confidence that answer is within 2% of value shownno matter how big the full dataset (1m, 1b, 1q+)‣Makes interactivity cheap - representative dataset

Ingesting & Sampling Datasets for the DGraph Engine

Amount'of'data'queried

The'100%'premium

CostAccuracy

Page 37: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 37

•Tweets and Website Log Activity stored already in data reservoir as Hive tables•Upload triggered by manual call to BDD Data Processing CLI‣Runs Oozie job in the background to profile,enrich and then ingest data into DGraph

Ingesting Site Activity and Tweet Data into DGraph

[oracle@bddnode1 ~]$ cd /home/oracle/Middleware/BDD1.0/dataprocessing/edp_cli [oracle@bddnode1 edp_cli]$ ./data_processing_CLI -t access_per_post_cat_author [oracle@bddnode1 edp_cli]$ ./data_processing_CLI -t rm_linked_tweets

Hive

Apache Spark

pageviews X rows

pageviews >1m rows

Profiling pageviews >1m rows

Enrichment pageviews >1m rows

BDD

pageviews >1m rows

{ "@class" : "com.oracle.endeca.pdi.client.config.workflow. ProvisionDataSetFromHiveConfig", "hiveTableName" : "rm_linked_tweets", "hiveDatabaseName" : "default", "newCollectionName" : “edp_cli_edp_a5dbdb38-b065…”, "runEnrichment" : true, "maxRecordsForNewDataSet" : 1000000, "languageOverride" : "unknown" }

1

23

Page 38: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 38

Ingesting Project Data into BDD DGraph

[oracle@bigdatalite ~]$ cd /home/oracle/movie/Middleware/BDD1.0/dataprocessing/edp_cli [oracle@bigdatalite edp_cli]$ ./data_processing_CLI -t access_per_post_cat_author [oracle@bigdatalite edp_cli]$ ./data_processing_CLI -t rm_linked_tweets

Page 39: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 39

• Ingested datasets are now visible in Big Data Discovery Studio•Create new project from first dataset, then add second

View Ingested Datasets, Create New Project

Page 40: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 40

• Ingestion process has automatically geo-coded host IP addresses•Other automatic enrichments run after initial discovery step, based on datatypes, content

Automatic Enrichment of Ingested Datasets

Page 41: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 41

•For the ACCESS_PER_POST_CAT_AUTHORS dataset, 18 attributes now available•Combination of original attributes, and derived attributes added by enrichment process

Initial Data Exploration On Uploaded Dataset Attributes

Page 42: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 42

•Click on individual attributes to view more details about them•Add to scratchpad, automatically selects most relevant data visualisation

Explore Attribute Values, Distribution using Scratchpad

1

2

Page 43: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 43

•Click on the Filter button to display a refinement list

Filter (Refine) Visualizations in Scratchpad

Page 44: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 44

•Select refinement (filter) values from refinement pane•Visualization in scratchpad now filtered by that attribute‣Repeat to filter by multiple attribute values

Display Refined Data Visualization

Page 45: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 45

•For visualisations you want to keep, you can add them to Discovery page•Dashboard / faceted search part of BDD Studio - we’ll see more later

Save Scratchpad Visualization to Discovery Page

12

Page 46: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 46

•Select AUTHOR attribute, seeinitial ordered values, distribution

•Add attribute POST_DATE‣choose between multiple instances of first attribute split by second‣or one visualisation with multiple series

Select Multiple Attributes for Same Visualization

Page 47: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 47

•Data ingest process automatically applies some enrichments - geocoding etc•Can apply others from Transformation page - simple transformations & Groovy expressions

Data Transformation & Enrichment

Page 48: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 48

•Group and bin attribute values; filter on attribute values, etc•Use Transformation Editor for custom transformations (Groovy, incl. enrichment functions)

Standard Transformations - Simple & Using Editor

Page 49: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 49

•Datatypes can be converted into other datatypes, with data transformed if required•Example : convert Apache Combined Format Log date/time to Java date/time

Datatype Conversion Example : String to Date / Time

Page 50: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 50

•Uses Salience text engine under the covers•Extract terms, sentiment, noun groups, positive / negative words etc

Transformations using Text Enrichment / Parsing

Page 51: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 51

•Choose option to Create New Attribute, to add derived attribute to dataset•Preview changes, then save to transformation script

Create New Attribute using Derived (Transformed) Values

12

3

Page 52: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 52

•Transformation changes have to be committed to DGraph sample of dataset‣Project transformations kept separate from other project copies of dataset

•Transformations can also be applied to full dataset, using Apache Spark ‣Creates new Hive table of complete dataset

Commit Transforms to DGraph, or Create New Hive Table

Page 53: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 53

•Users can upload their own datasets into BDD, from MS Excel or CSV file•Uploaded data is first loaded into Hive table, then sampled/ingested as normal

Upload Additional Datasets

12

3

Page 54: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 54

•Used to create a dataset based on the intersection (typically) of two datasets•Not required to just view two or more datasets together - think of this as a JOIN and SELECT

Join Datasets On Common Attributes

Page 55: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 55

•Tweets ingested into data reservoir can reference a page URL•Site Content dataset contains title, content, keywords etc for RM website pages•We would like to add these details to the tweets where an RM web page was mentioned‣And also add page author details missing from the site contents upload

Join Example : Add Post + Author Details to Tweet URL

Main “driving” dataset Contains tweet user details,tweet text, hashtags, URL referenced, location of tweeter etc

Contains full details of each site page, including URL, title, content, category

Join on URL referenced in tweet

Contains the post author details missing from the Site Content dataset

Join on internal Page ID

Page 56: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 56

Select From Multiple Visualisation Types

Page 57: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 57

•Site contents dataset needs to gain access to the page author attribute only found in Posts•Create join in the Dataset Relationships panel, using Post ID as the common attribute•Join from Site contents to Posts, to create left-outer join from first to second table

Multi-Dataset Join Step 1 : Join Site Contents to Posts

1

2

3

Previews rows from the join, based onpost_id = a (post_id column)

Page 58: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 58

•URLs in Twitter dataset have trailing ‘/‘, whereas URLs in RM site data do not•Use the Transformation feature in Studio to add trailing ‘/‘ to RM site URLs•Select option to replace the current URL values and overwrite within project dataset

Multi-Dataset Join Step 2 : Standardise URL Formats

Page 59: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 59

•Join on the standardised-format URL attributes in the two datasets•Data view will now contain the page content and author for each tweet mentioning RM

Multi-Dataset Join Step 3 : Join Tweets to Site Content

1

2

3

Page 60: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 60

•Select from palette of visualisation components•Select measures, attributes for display

Create Discovery Pages for Dataset Analysis

Page 61: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 61

•BDD Studio dashboards support faceted search across all attributes, refinements•Auto-filter dashboard contents on selected attribute values - for data discovery•Fast analysis and summarisation through Endeca Server technology

Key BDD Studio Differentiator : Faceted Search

Further refinement on“OBIEE” in post keywords

3Results now filteredon two refinements

4

Page 62: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 62

•Transformations within BDD Studio can then be used to create curated fact + dim Hive tables

•Can be used then as a more suitable dataset for use with OBIEE RPD + Visual Analyzer•Or exported then in to Exadata or Exalytics to combine with main DW datasets

Export Prepared Datasets Back to Hive, for OBIEE + VA

Page 63: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 63

•Users in Visual Analyzer then havea more structured dataset to use

•Data organised into dimensions, facts, hierarchies and attributes

•Can still access Hadoop directlythrough Impala or Big Data SQL

•Big Data Discovery though was key to initial understanding of data

Further Analyse in Visual Analyzer for Managed Dataset

Page 64: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead 64

•Articles on the Rittman Mead Blog‣http://www.rittmanmead.com/category/oracle-big-data-appliance/‣http://www.rittmanmead.com/category/big-data/‣http://www.rittmanmead.com/category/oracle-big-data-discovery/

•Rittman Mead offer consulting, training and managed services for Oracle Big Data‣Oracle & Cloudera partners‣http://www.rittmanmead.com/bigdata

Additional Resources

Page 65: Riga dev day 2016   adding a data reservoir and oracle bdd to extend your oracle dw platform

[email protected] www.rittmanmead.com @rittmanmead

Adding a Data Reservoir and Oracle Big Data Discovery to extend your Oracle Data Warehouse Platform

Mark Rittman, CTO, Rittman Mead Riga Dev Day 2016, Riga, March 2016