building custom big data integrations

21
Building Custom Big Data Integrations Pat Patterson Community Champion @metadaddy [email protected]

Upload: pat-patterson

Post on 11-Jan-2017

212 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Building Custom Big Data Integrations

Building Custom Big Data Integrations

Pat PattersonCommunity Champion

@[email protected]

Page 2: Building Custom Big Data Integrations

AgendaIngest, Data Drift and StreamSets

Short Demo

Building a custom integration

Real-world integration: Salesforce Wave Analytics

Page 3: Building Custom Big Data Integrations

Traditional and Big Data Founders

Company Background

Top tier Investors

Momentum to Date

Strategic Partners

● Launched 2014; exited stealth 9/15● ~30 employees● Double-digit enterprise customers● 10,000 downloads

Page 4: Building Custom Big Data Integrations

Past ETL ETL

Emerging Ingest Analyze

Data Sources Data Stores Data Consumers

Market Trends

Page 5: Building Custom Big Data Integrations

Problem: Data Drift

The unpredictable, unannounced and unending mutation of data characteristics caused by the operation, maintenance and modernization of the systems that produce the data

Structure Drift

Semantic Drift

Infrastructure Drift

Page 6: Building Custom Big Data Integrations

Delayed and False Insights

Solving Data Drift

Tools

Applications

Data Stores Data ConsumersData Sources

Poor Data QualityData DriftCustom code

Fixed-schema

Page 7: Building Custom Big Data Integrations

Trusted InsightsData KPIs

Solving Data Drift

Tools

Applications

Data Stores Data ConsumersData Sources

Data DriftIntent-Driven

Drift-Handling

Page 8: Building Custom Big Data Integrations

Demo

Let’s build a simple pipeline to answer a real question:

What’s the biggest city lot in San Francisco?

Page 9: Building Custom Big Data Integrations

Customizing StreamSets

Currently 25 standard StreamSets destinations, covering a wide variety of target systems, from flat files to S3 to Kafka

But… there’s always some system not on the list

Solution: DIY!

Page 10: Building Custom Big Data Integrations

Create Your Own Destination

Five Step Process:○ Create template from Maven archetype○ Add logging○ Create a record buffer○ Add configuration parameters○ Send data to external system

bit.ly/sdc-dest

Your System Here!

Page 11: Building Custom Big Data Integrations

Create Template from Archetype

mvn archetype:generate-DarchetypeGroupId=com.streamsets -DarchetypeArtifactId=streamsets-datacollector-stage-lib-tutorial -DarchetypeVersion=1.3.0.0 -DinteractiveMode=true

Page 12: Building Custom Big Data Integrations

Add Logging

Not 100% necessary, but VERY helpfulStreamSets uses SLF4J

$ tail -f streamsets-datacollector-1.3.0.0/log/sdc.log

Page 13: Building Custom Big Data Integrations

Create a Record Buffer

Leverage existing code where possible!StreamSets includes generators for CSV, JSON, Avro, Protocol Buffers etc

Page 14: Building Custom Big Data Integrations

Configuration

Separate configuration and codeDON’T PUT CREDENTIALS IN CODE!!!DON’T PUT CREDENTIALS IN CODE!!!Make your users’ and your lives easier!

Page 15: Building Custom Big Data Integrations

Send Data to the External System

Don’t forget security policy!

streamsets-datacollector/etc/sdc-security.policy

grant codebase "file://${sdc.dist.dir}/user-libs/sampletest/-" { permission java.net.SocketPermission "requestb.in", "connect, resolve";};

Page 16: Building Custom Big Data Integrations

A Real Custom DestinationSalesforce Wave Analytics

● Adapt to batch processing model○ Configure wait time before ‘closing’ a batch

● External Data API○ Create new dataset○ Write to dataset○ Close dataset on timeout○ Trigger dataflow execution

Page 17: Building Custom Big Data Integrations

Conclusion

StreamSets Data Collector makes simple tasks easy, complex tasks possible

Use ‘off the shelf’ stages for simple tasks

Leverage script processors (Jython, JavaScript, Groovy) for more complex work

Build custom stages for ultimate performance, flexibility

Page 18: Building Custom Big Data Integrations

Thank You!

Page 19: Building Custom Big Data Integrations

Structure Drift

Data structures and formats evolve and

change unexpectedly

Implication:Data Loss

Data Squandering

Delimited Data

107.3.137.195 fe80::21b:21ff:fe83:90fa

Attribute Format Changes

{ “first“: “jon” “last“: “smith” “email“: “[email protected]” “add1“: “123 Washington” “add2“: “” “city“: “Tucson” “state“: “AZ” “zip“: “85756”}

{ “first“: “jane” “last“: “smith” “email“: “[email protected]” “add1“: “456 Fillmore” “add2“: “Apt 120” “city“: “Fairfield” “state“: “VA” “zip“: “24435-1001” “phone”: “401-555-1212”}

Data Structure Evolution

Structure Drift

Page 20: Building Custom Big Data Integrations

Semantic Drift

Data semantics change with evolving applications

Implication:Data Corrosion

Data Loss

Semantic Drift

24122-52172 00-24122-52172

Account Number Expansion

M134: user {jsmith} read access granted {ac:24122-52172}

M134: user {jsmith} read access granted {ca.ac:24122-52172}

Namespace Qualification

………,3588310669797950,$91.41,jcb,K1088-W#9,……,6759006011936944,$155.04,switch,A6504-Y#9,……,6771111111151415,$37.78,laser,Q9936-T#9,……,3585905063294299,$164.48,jcb,S4643-H#9,……,5363527828638736,$117.52,mastercard,X3286-P#9,……,4903080150282806,$168.03,switch,I9133-W#3,………

Outlier / Anomaly Detection

Page 21: Building Custom Big Data Integrations

InfrastructureDrift

Physical and Logical Infrastructure changes

rapidly

Implication:Poor Agility

Operational Downtime

Data Center 1 Data Center 2 Data Center n

3rd Party Service Provider

App a App k

App qCloud

Infrastructure

Infrastructure Drift