bigquery, looker and big data analytics at...

23
Mark Rittman, Independent Analyst + Product Manager BIGQUERY, LOOKER AND BIG DATA ANALYTICS AT PETABYTE-SCALE BUDAPEST DATA FORUM May 2017

Upload: others

Post on 20-May-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BIGQUERY, LOOKER AND BIG DATA ANALYTICS AT PETABYTE-SCALEbiconsulting.hu/letoltes/2017budapestdata/mark... · Google Cloud Platform 11 Cloud Dataflow - A fully managed, auto-scalable

Mark Rittman, Independent Analyst + Product Manager

BIGQUERY, LOOKER AND BIG DATA ANALYTICS

AT PETABYTE-SCALE

BUDAPEST DATA FORUM

May 2017

Page 2: BIGQUERY, LOOKER AND BIG DATA ANALYTICS AT PETABYTE-SCALEbiconsulting.hu/letoltes/2017budapestdata/mark... · Google Cloud Platform 11 Cloud Dataflow - A fully managed, auto-scalable

•Mark Rittman, Independent Analyst for Big Data Analytics

•Currently working with Qubit as Analytic Product Manager

•20 years in the BI, DW, ETL and now Big Data industry

•Implementor, CTO, company founder and author

•On Twitter at @markrittman

•Linkedin at https://uk.linkedin.com/in/markrittman

[email protected] and http://www.mjr-analytics.com

About the Presenter

2

Page 3: BIGQUERY, LOOKER AND BIG DATA ANALYTICS AT PETABYTE-SCALEbiconsulting.hu/letoltes/2017budapestdata/mark... · Google Cloud Platform 11 Cloud Dataflow - A fully managed, auto-scalable

•Responsible for building + managing an analytics

product on personalization platform for marketers

•Operates in same market as Adobe Marketing

Cloud, Google Analytic 360, Optimizely, Monetate

•Real-time ingest of 10TB+/day of web activity data,

used for personalization

•Built on Looker BI tool and Google Cloud Platform

Current Role - Analytics PM for Marketing Tech Startup

3

Page 4: BIGQUERY, LOOKER AND BIG DATA ANALYTICS AT PETABYTE-SCALEbiconsulting.hu/letoltes/2017budapestdata/mark... · Google Cloud Platform 11 Cloud Dataflow - A fully managed, auto-scalable

Big Data Analytics on Google Cloud Platform

4

Page 5: BIGQUERY, LOOKER AND BIG DATA ANALYTICS AT PETABYTE-SCALEbiconsulting.hu/letoltes/2017budapestdata/mark... · Google Cloud Platform 11 Cloud Dataflow - A fully managed, auto-scalable

Cloud Big Data Analytics 2.0

5

Page 6: BIGQUERY, LOOKER AND BIG DATA ANALYTICS AT PETABYTE-SCALEbiconsulting.hu/letoltes/2017budapestdata/mark... · Google Cloud Platform 11 Cloud Dataflow - A fully managed, auto-scalable

Also use as my personal dev platform1TB of BigQuery query usage/month free

Page 7: BIGQUERY, LOOKER AND BIG DATA ANALYTICS AT PETABYTE-SCALEbiconsulting.hu/letoltes/2017budapestdata/mark... · Google Cloud Platform 11 Cloud Dataflow - A fully managed, auto-scalable

•Started back in 1996 on a bank Oracle DW project

•Our tools were Oracle 7.3.4, SQL*Plus, shell scripts

•Data warehouses provided a unified view of the business

•Single place to store key data and metrics

•Joined-up view of the business

•Aggregates and conformed dimensions

•ETL routines to load, cleanse and conform data

•BI tools for simple, guided access to information

Before This … 20 Years in Traditional DW Consulting

7

Page 8: BIGQUERY, LOOKER AND BIG DATA ANALYTICS AT PETABYTE-SCALEbiconsulting.hu/letoltes/2017budapestdata/mark... · Google Cloud Platform 11 Cloud Dataflow - A fully managed, auto-scalable

GOOGLE BIG QUERY

AND DISTRIBUTED, CLOUD COMPUTE + STORAGE

Page 9: BIGQUERY, LOOKER AND BIG DATA ANALYTICS AT PETABYTE-SCALEbiconsulting.hu/letoltes/2017budapestdata/mark... · Google Cloud Platform 11 Cloud Dataflow - A fully managed, auto-scalable

•New generation of big data platform services from Google, Amazon, Oracle

•Combines three key innovations from earlier technologies:

•Organising of data into tables and columns (from RDBMS DWs)

•Massively-scalable and distributed storage and query (from Big Data)

•Elastically-scalable Platform-as-a-Service (from Cloud)

Elastically-Scalable Data Warehouse-as-a-Service

9

Page 10: BIGQUERY, LOOKER AND BIG DATA ANALYTICS AT PETABYTE-SCALEbiconsulting.hu/letoltes/2017budapestdata/mark... · Google Cloud Platform 11 Cloud Dataflow - A fully managed, auto-scalable

•And things come full-circle … analytics

typically requires tabular data

•Google BigQuery based-on DremelX

massively-parallel query engine

•But stores data columnarprovides SQL

interface

•Solves the problem of providing DW-like

functionality at scale, as-a-service

BigQuery : Big Data Meets Data Warehousing

10

Page 11: BIGQUERY, LOOKER AND BIG DATA ANALYTICS AT PETABYTE-SCALEbiconsulting.hu/letoltes/2017budapestdata/mark... · Google Cloud Platform 11 Cloud Dataflow - A fully managed, auto-scalable

Google Cloud Platform

11

Cloud Dataflow - A fully managed, auto-scalable service for

pipeline data processing in batch or streaming mode

BigQuery - A fully managed, petabyte scale, low-cost enterprise

data warehouse for analytics

BigTable - A fully managed, petabyte scale, low-latency, high-

throughput wide column store for analytics

Cloud Pub/Sub - A fully managed, global and scalable publish

and subscribe service with guaranteed at-least-once message delivery

Page 12: BIGQUERY, LOOKER AND BIG DATA ANALYTICS AT PETABYTE-SCALEbiconsulting.hu/letoltes/2017budapestdata/mark... · Google Cloud Platform 11 Cloud Dataflow - A fully managed, auto-scalable

Google Cloud Platform Big Data Reference Architecture

12

All delivered as auto-scaling fully-managed services

Page 13: BIGQUERY, LOOKER AND BIG DATA ANALYTICS AT PETABYTE-SCALEbiconsulting.hu/letoltes/2017budapestdata/mark... · Google Cloud Platform 11 Cloud Dataflow - A fully managed, auto-scalable

Google BigQuery - Key Platform Technologies

13

Dremel - Massively-parallel real-time

query engine with SQL + REST API

Collosus - Distributed Storage layer

for Dremel, successor to GFS/HDFS

Capacitor - Columnar nested +

compressed file format for Dremel,

inspiration for Parquet etc

Borg - Large-scale cluster management,

runs Dremel jobs on 10,000+ servers

Jupiter - Google’s high-capacity

low-latency internal network

Page 14: BIGQUERY, LOOKER AND BIG DATA ANALYTICS AT PETABYTE-SCALEbiconsulting.hu/letoltes/2017budapestdata/mark... · Google Cloud Platform 11 Cloud Dataflow - A fully managed, auto-scalable

•BigQuery is a distributed, column-store query engine

•Denormalize your tables where possible as joins are relatively expensive

•Optimal query is one that filters, aggregates and selects subsets of columns

•Use table partitioning so that full scans of whole columns are minimized

Data Modeling with BigQuery

14

Page 15: BIGQUERY, LOOKER AND BIG DATA ANALYTICS AT PETABYTE-SCALEbiconsulting.hu/letoltes/2017budapestdata/mark... · Google Cloud Platform 11 Cloud Dataflow - A fully managed, auto-scalable

•BigQuery is a distributed, column-store query engine

•Denormalize your tables where possible as joins are relatively expensive

•Use nested repeated fields for dimension lookups to align with Capacitor storage

•Optimal query is one that filters, aggregates and selects subsets of columns

•Use table partitioning so that full scans of whole columns are minimized

BigQuery Storage Formats + Data Modeling

15

SELECT

zipcode, count(zipcode_trips.trip_count) as total_trips,

zipcode_incidents.call_type,

count(zipcode_incidents.call_type) as total_calls

FROM

`aerial-vehicle-148023.personal_metrics.sf_nested`

LEFT JOIN UNNEST (trips) as zipcode_trips

LEFT JOIN UNNEST (incidents) AS zipcode_incidents

WHERE

zipcode_incidents.call_type = 'Traffic Collision'

GROUP BY 1,3

Page 16: BIGQUERY, LOOKER AND BIG DATA ANALYTICS AT PETABYTE-SCALEbiconsulting.hu/letoltes/2017budapestdata/mark... · Google Cloud Platform 11 Cloud Dataflow - A fully managed, auto-scalable

Query Results in Seconds with No Indexes or

Summary Tables

Page 17: BIGQUERY, LOOKER AND BIG DATA ANALYTICS AT PETABYTE-SCALEbiconsulting.hu/letoltes/2017budapestdata/mark... · Google Cloud Platform 11 Cloud Dataflow - A fully managed, auto-scalable

BI + ANALYTICS TOOLING FOR GOOGLE BIGQUERY

AND DISTRIBUTED, CLOUD COMPUTE + STORAGE

Page 18: BIGQUERY, LOOKER AND BIG DATA ANALYTICS AT PETABYTE-SCALEbiconsulting.hu/letoltes/2017budapestdata/mark... · Google Cloud Platform 11 Cloud Dataflow - A fully managed, auto-scalable

BI and Analytics Tools for Google Cloud Platform

18

Page 19: BIGQUERY, LOOKER AND BIG DATA ANALYTICS AT PETABYTE-SCALEbiconsulting.hu/letoltes/2017budapestdata/mark... · Google Cloud Platform 11 Cloud Dataflow - A fully managed, auto-scalable

Looker for BI and Analytics -BI for Data Engineers, Semantic Models and LookML

Page 20: BIGQUERY, LOOKER AND BIG DATA ANALYTICS AT PETABYTE-SCALEbiconsulting.hu/letoltes/2017budapestdata/mark... · Google Cloud Platform 11 Cloud Dataflow - A fully managed, auto-scalable

LookML : BI Modeling for Data Engineers

20

Page 21: BIGQUERY, LOOKER AND BIG DATA ANALYTICS AT PETABYTE-SCALEbiconsulting.hu/letoltes/2017budapestdata/mark... · Google Cloud Platform 11 Cloud Dataflow - A fully managed, auto-scalable

• Query building using business semantic model

• Self-Service data analytics with agile dev model

• Dashboards, reports (“looks”), action links,

scheduling

Page 22: BIGQUERY, LOOKER AND BIG DATA ANALYTICS AT PETABYTE-SCALEbiconsulting.hu/letoltes/2017budapestdata/mark... · Google Cloud Platform 11 Cloud Dataflow - A fully managed, auto-scalable

Further Reading at https://medium.com/mark-rittman

22

Page 23: BIGQUERY, LOOKER AND BIG DATA ANALYTICS AT PETABYTE-SCALEbiconsulting.hu/letoltes/2017budapestdata/mark... · Google Cloud Platform 11 Cloud Dataflow - A fully managed, auto-scalable

Mark Rittman, Independent Analyst + Product Manager

BIGQUERY, LOOKER AND BIG DATA ANALYTICS

AT PETABYTE-SCALE

BUDAPEST DATA FORUM

May 2017