schema registries and snowplow

16
Schema Registries Mike Robins | Co-founder linkedin.com/in/mikerobins

Upload: miiker

Post on 22-Jan-2018

134 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Schema registries and Snowplow

Schema RegistriesMike Robins | Co-founder

linkedin.com/in/mikerobins

Page 2: Schema registries and Snowplow

Snowplow Philosophy

- Open source (ALv2) or managed (paid)- Batch or real time- Collect everything (web, mobile, IoT, webhooks)- Ownership of data matters- Data modelling should be first class and flexible- BYO toolset (Spark, Drill, Beam etc)

Page 3: Schema registries and Snowplow

● Imagine all employees are required to speak only in their native language. ● Either everyone has to be multilingual, or expensive translators must be added for every

pair of languages spoken. ○ Even if you have a sophisticated and efficient way of getting messages from place

to place, you’re still stuck with the overhead of constant translation.

Hazards of many languages

Page 4: Schema registries and Snowplow
Page 5: Schema registries and Snowplow

● A shared contract between a consumer and a producer● Prior art

○ Avro, Thrift, Protobuf etc

A schema

Page 6: Schema registries and Snowplow

Key attributes of schema technologies

● Code generation – for bindings to your schemas in a given programming language

● Data encodings● Validation rules - for calibration and sanity● Types – a description of the type of data● Schema evolution

Page 7: Schema registries and Snowplow
Page 8: Schema registries and Snowplow

Copyright Frank Drake, NASA (1977) License: CC BY-NC-ND 2.0

Page 9: Schema registries and Snowplow

Copyright Frank Drake, NASA (1977) License: CC BY-NC-ND 2.0

Page 10: Schema registries and Snowplow

iglu:com.<myco>/<event>/jsonschema/1-0-0

{ "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#", "description": "Schema for <…>", "self": { "vendor": "<com.myco>", "name": "<event>", "format": "jsonschema", "version": "1-0-0" }, "type": "object", "properties": { "action_context": { "type": "string" }... }, "required": ["subject", "event_id"], "additionalProperties": false}

The schema URI is IGLU

The name of this schema

The vendor of this schema

Schema format

Schema version

Page 11: Schema registries and Snowplow

Schema storage

● Option 1: Send the entire definition with the record

Record Record Record Record

Schema Schema Schema Schema

Page 12: Schema registries and Snowplow

Schema storage

● Option 2: Send a pointer to the definition

*Schema *Schema *Schema *Schema

Record Record Record Record

Schema storage

Page 13: Schema registries and Snowplow

● A canonical, shared source of truth● Within and between organisations

Schema registry

Page 14: Schema registries and Snowplow

● Data governance ○ Safe schema evolution○ Policy enforcement

● Data pipeline resilience● Data discovery● Efficiency

○ Cost○ Storage○ Computation

● Shares principles with software engineering CI/CD

Why?

Page 15: Schema registries and Snowplow

Key takeaways

Schemas are critical and a shared repository of all schemas used by the organisation is important to make siloed knowledge shared and explicit.

By using schemas, the data definition for a particular kind of data exists in a single place.

Schemas serve as self-contained and automatically enforceable contracts between producers and consumers of data.

Page 16: Schema registries and Snowplow

Demo

[email protected]

Snowplow (github.com/snowplow/snowplow)Schemas (Iglu Central)Kinesis (Amazon Web Services)Pusher (pusher.com)