syncsort dmx-h - meetupfiles.meetup.com/4533812/slides_syncsort_dmx-h.pdf · teradata teradata...

31
Syncsort DMX-h Delivering Smarter ETL through Hadoop… and More Value from Hadoop with Smarter ETL Ruediger Schickhaus [email protected] Syncsort Confidential and Proprietary - do not copy or distribute

Upload: doanthuan

Post on 19-Aug-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

Syncsort DMX-h

Delivering Smarter ETL through Hadoop… and More Value from Hadoop with Smarter ETL

Ruediger Schickhaus [email protected]

Syncsort Confidential and Proprietary - do not copy or distribute

Page 2: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

Big Data – New Name, Old Problem

1956Transport von 5 MegabytesMegabytes

IBM 305 RAMAC

1000 kg

50 Platten – 1.5 m2

TCO $ 35,000

Syncsort Confidential and Proprietary - do not copy or distribute2

Page 3: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

Big Data – New Name, Old Problem

2012Transporting 1 Petabyte

IBM 305 RAMAC IBM 305 RAMAC

1000 kg per Rack

TCO $ 500k to 3.5Million

Page 4: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

The Big Data Continuum

EvolvedDynamicPlateauingAdvancing

Traditional BI

Data

Awakening

Big

Da

ta C

on

tin

uu

m

Early Hadoop adoption

prototyping &

experimentation

Hand-coding:

SQL, JCL.

Basic ETL Tools

Standardization &

Heavy Platforms.

Demand for MF data

Big Data is the new

standard

Value MaxMin

Hitting arch limits +

exponential costs.

Growing

Infrastructure

4Syncsort Confidential and Proprietary - do not copy or distribute

Inte

gra

tin

g B

ig D

ata

… S

ma

rte

r

MFX DMX DMX-h

SQL Migration Hadoop ETLHadoop Sort

& ConnectivityETL & Rehosting

Optimization

High-performance

ETL

Page 5: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

Do You Know Syncsort?

• Leading Big Data Integration company

• Speed leader in ETL

• Fastest sort technology in the market

For 40 years we have been helping companies solve their big data issues…even before they knew the name Big Data!

Our customers are achieving the

impossible, every day!

Our customers are achieving the

impossible, every day!Integrating Big Data… Smarter!

5Syncsort Confidential and Proprietary - do not copy or distribute

• Fastest sort technology in the market

• Powering 50% of mainframes’ sort

• A history of innovation

• 25+ Issued & Pending Patents

• Large global customer base

• 15,000+ deployments in 68 countries

• First-to-market, fully integrated

approach to Hadoop ETL

Key Partners

Page 6: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

Hadoop ?

Hadoop will be in two-thirds of

advanced analytics products by

2015 (source Gartner)

Low TCO experience (source

Hadoop Summit)

6

At @Hortonworks, we

believe that by the end of

2015, more than half the

world’s data will be

processed by Apache

Hadoop

Page 7: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

How does #Hadoop integrate with an Enterprise Data Center?

7Syncsort Confidential and Proprietary - do not copy or distribute

Source @Hortonworks

http://hortonworks.com/blo

g/smarter-etl-with-hadoop-

and-syncsort

Page 8: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

Syncsort’s Open Source Contributions

JIRA

4808 Allow Reduce-side merge to be pluggable

4809 Make classes required for 2454 public

Description

2454 Allow External Sorter Plugin for MR

8Syncsort Confidential and Proprietary - do not copy or distribute

4809 Make classes required for 2454 public

4812 Create reduce input merger plug-in

…and more!!

4842 Shuffle race can hang reducer

4482 Backport of 2454 to MapReduce 1

Page 9: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

Hadoop Integration… for Real(No Code Generation. No Compiling. No Tuning)

• Runs natively within MapReduce

• No Coding, No Java

• Leverages sort plug-in

• Small footprint installs on every

Smarter Architecture

9Syncsort Confidential and Proprietary - do not copy or distribute

Hadoop Cluster

• Small footprint installs on every

node

• Higher throughput per node to

accomplish more on your cluster

• Smart ETL Optimizer = No Tuning

Required

• File-based Metadata with no

dependencies on 3rd party systems

Unleash Hadoop’s Potential

Page 10: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

One Tool to Connect to All Sources & Targets No Coding, No Scripting

Connect to Any Source & Target

• RDBMS

• Mainframe

• Files

• Cloud

• Appliances

• XML

Smarter Connectivity

10Syncsort Confidential and Proprietary - do not copy or distribute

Pre-process & Compress

Extract & Load to/from Hadoop

• Cleanse, validate, and partition for parallel

loading

• Compress for storage savings

• Extract data & load into the cluster natively from

Hadoop or execute “off-cluster” on ETL server

• Load data warehouses directly from Hadoop. No

need for temporary landing areas.

PLUS… Mainframe Connectivity

• Directly read mainframe data

• Parse & translate

• Load into HDFS

Page 11: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

The Hadoop Challenge

PROCESS

JoinAggregate Copy

DISTRIBUTECOLLECT

Most organizations use Hadoop to…

11Syncsort Confidential and Proprietary - do not copy or distribute

Sort Merge

EExtract

TTransform

LLoad

Page 12: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

So… How Do You Do ETL in Hadoop Today?

COLLECT PROCESS DISTRIBUTE

JoinAggregate Copy

12Syncsort Confidential and Proprietary - do not copy or distribute

SortMerge

• FS Shell Put

Command• Flume

• Sqoop

HARD

• Pig • HiveQL• Java

HARDER

• Sqoop • FS Shell Get

Command

HARD

Page 13: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

DMX-h & Mainframe

Page 14: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

The Economics of Data

Cost of managing 1TB of data

$20,000 – $100,000 $15,000 – $80,000 $250 – $2,000

14Syncsort Confidential and Proprietary - do not copy or distribute

Mainframe EDW Hadoop

Scalability

Performance

Reliability

Agility

Aging workforce

But there’s more…

Page 15: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

Because Mainframe Is Big Data Too!

Smarter Connectivity… Also for Mainframe

Connect• Read files directly from the

mainframe

• No software required on mainframe

15Syncsort Confidential and Proprietary - do not copy or distribute

Translate

Load &

Process

• Parse & transform: packed decimal,

EBCDIC/ASCII, multi-format

• No coding required

• Load directly to HDFS

• Offload batch data processing

• Find more insights

Page 16: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

• Develop MapReduce ETL processes

without writing code

• Leverage existing ETL skills

• Develop and test locally in Windows. Deploy in

Hadoop

Smarter Development

Same Familiar Tool. Five Core Transformations. All The Possibilities. (No Coding, No Scripting, No Kidding!)

Hadoop

• Five smart transformations

Patented algorithms

No code generation, no compiling

Execute within MapReduce

…combine & reuse to create virtually

any data flow

16Syncsort Confidential and Proprietary - do not copy or distribute

Sort Join Aggregate Copy Merge

Development accelerators for CDC and other common data flows+ Coding is optional

but not required

Page 17: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

Smarter Productivity

Fast-track your Hadoop productivity with the Use Case Accelerators

Aggregations Change Data Capture

Connectivity & Mainframe Integration

Joins & Lookups

• Web log aggregations

• Lookup + Aggregation

• Word Count

• CDC Single Output

• CDC Distributed Output

• Mainframe Extract + CDC

17Syncsort Confidential and Proprietary - do not copy or distribute

• Fully functional and re-usable templates to

design your own data flows

• Quick-start guide, sample data, and even videos

• Take away the complexities AND the guess work

Mainframe Integration• Direct Mainframe Extract & Load

• Mainframe Extract + CDC

• Smart HDFS Load & HDFS Extract

• Join – Large Side | Small Side

• Join – Large Side | Large Side

• File Lookup

Page 18: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

Smarter Security

Zero-Pain Support for Common Authentication Protocols

• Seamless support for Kerberos & LDAP

• User-level security using authentication

protocol LDAP

18Syncsort Confidential and Proprietary - do not copy or distribute

Hadoop Cluster

protocol

• Invoke Hadoop jobs using user-level

credentials

• Keep Hadoop access separate for each user

• Support multiple ticket locations

• Secure data loads & extracts

• Secure Hadoop MapReduce job execution

Page 19: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

Syncsort’s Hadoop ETL Workshops

You will see how easy it is to deploy ETL in Hadoop with

DMX-h, with hands-on exercises implementing common ETL

tasks, including:

– HDFS Load

– Change Data Capture (CDC)

– Web Log Aggregation– Web Log Aggregation

– Joins

Download the VM and play with Hadoop in 15mn before the

workshop takes place !

19Syncsort Confidential and Proprietary - do not copy or distribute

Download Syncsort’s ETL

package and you’re all set !

– www.syncsort.com/try

Page 20: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

DMX-h Benchmarks

Page 21: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

File CDC

DMX-h

Syncsort Confidential and Proprietary - do not copy or distribute

PigJava

149Lines of Code

70Lines of Code

Page 22: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

Web Log Aggregation

DMX-h

Syncsort Confidential and Proprietary - do not copy or distribute

PigJava

94Lines of Code

48Lines of Code

Page 23: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

Sort Acceleration - Terasort

Use Case

ETL or

Sort

Accele

ration

Alterna

tive

Data Size

(GB)

Native/A

lternativ

e

Elapsed

time

DMX-h

Elapsed

Time

Elapsed

Time

Improv

ement

Native/Alterna

tive Memory

(GB)

DMX-h

Physical

Memory (GB)

Mem

ory

Impro

veme

nt

Native/Alter

native CPU

Time

DMX-h CPU

Time

CPU

Impro

veme

nt

Native/

Alterna

tive

MB/Sec

/Node

DMX-h

MB/Sec

/Node

TERASORT

Sort

Accele

ration Native 512 0:01:47 0:01:45 2% 12,863 12,873 0% 114,297 62,491 45% 6.5 6.6

TERASORT

Sort

Accele

ration Native 1,024 0:02:29 0:01:11 52% 14,512 14,522 0% 194,896 98,972 49% 9.3 19.4 TERASORT ration Native 1,024 0:02:29 0:01:11 52% 14,512 14,522 0% 194,896 98,972 49% 9.3 19.4

TERASORT

Sort

Accele

ration Native 1,536 0:04:02 0:01:23 66% 14,684 14,694 0% 287,055 143,759 50% 8.6 25.0

TERASORT

Sort

Accele

ration Native 4,096 0:03:31 0:02:29 29% 31,520 31,549 0% 927,379 380,442 59% 26.2 37.0

TERASORT

Sort

Accele

ration Native 10,242 0:08:51 0:05:14 41% 47,935 47,951 0% 2,835,927 1,460,101 49% 26.4 44.6

TERASORT

Sort

Accele

ration Native 20,484 0:14:55 0:12:28 16% 106,153 105,239 1% 6,112,296 3,696,727 40% 31.0 37.4

TERASORT

Sort

Accele

ration Native 102,400 1:12:12 0:51:59 28% 387,262 387,211 0% 30,436,624 16,589,332 45% 32.3 44.9

Syncsort Confidential and Proprietary - do not copy or distribute23

Page 24: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

File CDC

Use Case

ETL or

Sort

Acceler

ation

Alterna

tive

Data

Size

(GB)

Native/Alt

ernative

Elapsed

time

DMX-h

Elapsed

Time

Elapse

d Time

Improv

ement

Native/Altern

ative Memory

(GB)

DMX-h

Physical

Memory (GB)

Memor

y

Improv

ement

Native/Alt

ernative

CPU Time

DMX-h

CPU Time

CPU

Improv

ement

Native/

Alterna

tive

MB/Se

c/Node

DMX-

h

MB/Se

c/Node

FileCDC ETL Pig 148 0:05:31 0:01:33 72% 79,876 79,559 0% 79,876 79,559 0% 0.6 2.2 FileCDC ETL Pig 148 0:05:31 0:01:33 72% 79,876 79,559 0% 79,876 79,559 0% 0.6 2.2

FileCDC ETL Pig 450 0:05:11 0:01:58 62% 243,834 182,869 25% 243,834 182,869 25% 1.9 5.3

FileCDC ETL Pig 1,515 0:07:49 0:03:44 52% 845,263 557,226 34% 845,263 557,226 34% 4.4 9.4

Syncsort Confidential and Proprietary - do not copy or distribute24

Page 25: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

Web Log Aggregation

Use Case

Altern

ative

Data

Size

(GB)

Native/Alter

native

Elapsed time

DMX-h

Elapsed

Time

Elapsed

Time

Improve

ment

Native/Alternativ

e Memory (GB)

DMX-h Physical

Memory (GB)

Memory

Improve

ment

Native/Alter

native CPU

Time

DMX-h CPU

Time

CPU

Improve

ment

Native/A

lternativ

e

MB/Sec/

Node

DMX-h

MB/Sec/

Node

WebLogAggregation -

Split Size & fixes Pig 2,067 0:01:12 0:00:58 19% 13,499 7,813 42% 145,972 56,496 61% 40.1 49.8

WebLogAggregation -

Split Size & fixes Pig 4,135 0:01:42 0:01:23 19% 18,003 15,579 13% 300,627 152,390 49% 56.1 69.6

WebLogAggregation -

Split Size & fixes Pig 10,240 0:05:16 0:02:04 61% 40,773 39,091 4% 807,473 335,537 58% 45.3 115.4

WebLogAggregation -

Split Size & fixes Pig 20,480 0:07:54 0:06:58 12% 78,654 78,128 1% 1,339,453 568,107 58% 60.4 68.4

Syncsort Confidential and Proprietary - do not copy or distribute25

Page 26: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

DMX-h Demonstration

Page 27: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

MapReduce – Before and After Syncsort Contribution

NativeSort

Mandatory

NativeSort

Mandatory

27Syncsort Confidential and Proprietary - do not copy or distribute

Page 28: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

Opening the MapReduce Framework

MapperOutput

ShuffleInput

Reducer

Here and here to replace MapReduce native sort

MapperOutput Sorter

ShuffleInput Sorter

Reducer

28Syncsort Confidential and Proprietary - do not copy or distribute

Here to perform functionallogic on our engine

Here to perform functionallogic on our engine

Page 29: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

Hadoop Change Data Capture using DMX-hM

ain

fra

me

fil

es

+

Tera

da

ta

Syncsort’s DMX-hNode

Node

HD

FS

Hadoop

Teradata

Teradata

Syncsort Confidential and Proprietary - do not copy or distribute

Ma

infr

am

e f

ile

s +

Tera

da

ta

Load filesLoad to HDFS

DMX-h ETL Map

Reduce (CDC)

Node

Node

HD

FS

Teradata

Load to Teradata

Page 30: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

30Syncsort Confidential and Proprietary - do not copy or distribute

Page 31: Syncsort DMX-h - Meetupfiles.meetup.com/4533812/Slides_Syncsort_DMX-h.pdf · Teradata Teradata Syncsort Confidential and Proprietary - do not copy or distribute Load files Load to

+

Bridging the Gap Between Big Data and Big Iron

A Smarter Approach to BIG Mainframe Data!

Syncsort DMX-h ETL Edition

31Syncsort Confidential and Proprietary - do not copy or distribute

� Zero-MIPS Connectivity

� Painless Integration & Translation

� Mainframe-like Performance

� Massively Affordable Scalability

� Enterprise-grade Reliability

� Iron-clad Security

� Decades of Proven Mainframe Expertise

THE PLATFORM FOR BIG DATACLOUDERA

Brings batch & real-

time compute to

storage

Works with

all types

of data

Changes the

economics of data

management- -

CDH

Connect ProcessTranslate

Syncsort DMX-h ETL Edition

Cloudera

Manager

Cloudera

Navigator

Cloudera

Support