leveraging mainframe data in hadoopfiles.meetup.com/...dmxh_leveragingmainframedata.pdf · hadoop...

Leveraging MainframeData in Hadoop

Frank Koconis- Senior Solutions [email protected]

Glenn McNairy- Account [email protected]

Agenda

2

• Introductions

• The Mainframe: The Original Big Data Platform

• The Challenges of Ingesting and Using Mainframe Data on Hadoop

• Mainframe-Hadoop Data Integration Goals

• Mainframe-to-Hadoop Migration / Integration Options

• Syncsort and DMX-h

• Live DMX-h Demo

• Q & A

The Mainframe: The Original Big Data Platform

• Mainframes handle over 70% of all OTLP transactions

• They have a long, proven track record- over 60 years!

• They are reliable- operating continuously with zero downtime for years

• They are secure- access is tightly restricted and managed

3

Mainframes Still Process Vast Amounts of Vital Data

4

Top 25 World Banks 9 of World’s Top Insurers 23 of Top 25 US Retailers

“But now our organization is implementing Hadoop…”

Hadoop is the new Big Data platform

The goal is for the Hadoop cluster to be the single central location for ALL data (the “Data Lake”)

– According to Wikipedia, this should be the “single store of all data in the enterprise ranging from raw data … to transformed data” *

So you need to bring in all of the organization’s data sources

And that includes the mainframe

– The mainframe has vital data that you cannot afford to ignore when building your data lake

*- https://en.wikipedia.org/wiki/Data_lake

https://en.wikipedia.org/wiki/Data_lake

Enterprise Data Lake Without Mainframe Data =Missed Opportunity

The Challenges of Using Mainframe Data in Hadoop

Mainframe knowledge and skills are difficult to find

– The mainframe workforce is aging rapidly

– Knowledge of existing designs and code may no longer be available

– Young developers almost never learn mainframe skills

Security and connectivity issues

– Mainframes have a highly controlled security environment

– Installation of data-extraction utilities or programs may be forbidden

– The mainframe is mission-critical, so no action can be taken that could cause downtime

Mainframe data looks VERY different from data on Windows, Linux or UNIX

– This is so important, it deserves its own slide… →

7

The Biggest Challenge: Mainframe Data Formats

Mainframe files are not like files in Windows, Linux or UNIX

– There is no such thing as a delimited text file on the mainframe• File types include fixed-record, variable-record, VSAM and others

– The mainframe uses EBCDIC rather than ASCII, but it’s not that simple• Text values are EBCDIC, but many numeric values are not• Simple EBCDIC-to-ASCII conversion WILL NOT WORK

Mainframe files can have VERY complex record structures

– Records may be very wide, containing hundreds or thousands of fields

– Records are usually not “flat”• They often have sub-records and arrays (COBOL “OCCURS” groups)• These may be nested many levels deep

– Often, a range of bytes in a record is used in several different ways (COBOL “REDEFINES”)

• This means that the data “looks different” between records in the same file(!)

– Record layouts are defined by COBOL copybooks; here are examples…

8

COBOL Copybook Example #1

Simple example of a COBOL copybook which defines a record layout:

** SALES ORDERS FILE

01 SLS-ORD-FILE.

05 CUSTOMER-ACCOUNT-NUMBER PIC S9(9) COMP-3 .

05 ORDER-NUMBER PIC X(10).

05 ORDER-DETAILS.

10 ORDER-STATUS PIC X(1).

10 ORDER-DATE PIC X(10).

10 ORDER-PRIORITY PIC X(15).

10 CLERK PIC X(15).

10 SHIPMENT-PRIORITY PIC S9(4) COMP-3 .

10 TOTAL-PRICE PIC 9(7)V.99 COMP-3.

10 COMMENT-COUNT PIC 9(2).

10 COMMENT PIC X(80) OCCURS 0 to 99 TIMES DEPENDING ON COMMENT-COUNT.

9

These fields are packed decimal (not EBCDIC!)

So EBCDIC-to-ASCII conversion would corrupt them!

This is a variable-length array.

The number of elements depends on the value of COMMENT-COUNT.

So the size of this array will vary from record to record.

COBOL Copybook Example #2 (more complex)

01 LN-HST-REC-LHS.05 HST-REC-KEY-LHS.10 BK-NUM-LHS PIC S9(5) COMP-3.10 APP-LHS PIC S9(3) COMP-3.10 LN-NUM-LHS PIC S9(18) COMP-3.10 LN-SRC-LHS PIC X.10 LN-SRC-TIE-BRK-LHS PIC S9(5) COMP-3.10 EFF-DAT-LHS PIC S9(9) COMP-3.10 PST-DAT-LHS PIC S9(9) COMP-3.10 PST-TIM-LHS PIC S9(7) COMP-3.10 TRN-COD-LHS PIC S9(5) COMP-3.10 SEQ-NUM-LHS PIC S9(5) COMP-3.

05 LN-HST-REC-DTL-LHS.10 VLI-LHS PIC S9(4) COMP.10 HST-REC-DTA-LHS.

15 INP-SRC-COD-LHS PIC S9(3) COMP-3.15 TRN-TYP-IND-LHS PIC X.15 BAT-NUM-LHS PIC S9(7) COMP-3.15 BAT-TIE-BRK-LHS PIC X(3).15 BAT-ITM-NUM-LHS PIC X(9).15 TML-NUM-LHS PIC X(9).15 OPR-ID-LHS PIC X(8).15 HST-ADL-IND-LHS PIC X(1).15 HST-REV-IND-LHS PIC X(1).15 TRN-AMT-LHS PIC S9(9)V99 COMP-3.15 HST-DES-LHS PIC X(25).15 CUR-PRC-DAT-LHS PIC S9(9) COMP-3.15 REF-NUM-LHS PIC X(3).15 INT-FEE-FLG-LHS PIC X(1).15 UDF-L01-LHS PIC X.15 PMT-HLD-DAY-LHS PIC S9(3) COMP-3.15 AUH-NUM-LHS PIC S9(5) COMP-3.15 CUR-LN-BAL-LHS PIC S9(9)V99 COMP-3.

10 ITM-CNT-LHS PIC S9(2) COMP-3.10 PYF-COF-REA-COD-LHS PIC X(3).10 HST-TRN-ADL-DTA-LHS PIC X(240).

Continues in right-hand column…

10

Continued from left-hand column…

10 HST-TRN-RDF-1-LHS REDEFINES HST-TRN-ADL-DTA-LHS.15 HST-TRN-DTA-1-LHS OCCURS 20 TIMES.20 SPR-TRN-COD-LHS PIC S9(5) COMP-3.20 SPR-TRN-REF-LHS PIC X(3).20 SPR-TRN-AMT-LHS PIC S9(9)V99 COMP-3.

10 HST-TRN-RDF-2-LHS REDEFINES HST-TRN-ADL-DTA-LHS.15 HST-TRN-DTA-2-LHS.20 OLD-NMN-DTA-LHS PIC X(40).20 NEW-NMN-DTA-LHS PIC X(40).20 DAT-TO-DSB-LHS PIC S9(9) COMP-3.20 RPT-BK-NUM-LHS PIC S9(5) COMP-3.20 RPT-APP-LHS PIC S9(3) COMP-3.20 RPT-LN-NUM-LHS PIC S9(18) COMP-3.20 CMB-PMT-PTY-LHS PIC S9(3) COMP-3.

10 HST-TRN-RDF-3-LHS REDEFINES HST-TRN-ADL-DTA-LHS.15 HST-TRN-DTA-3-LHS.20 HST-OLD-RT-LHS PIC SV9(5) COMP-3.20 HST-NEW-RT-LHS PIC SV9(5) COMP-3.

10 HST-TRN-RDF-4-LHS REDEFINES HST-TRN-ADL-DTA-LHS.15 HST-TRN-DTA-4-LHS.20 VSI-PMT-AMT-LHS PIC S9(7)V99 COMP-3.20 VSI-INT-AMT-LHS PIC S9(7)V99 COMP-3.20 VSI-TRM-LHS PIC S9(3) COMP-3.20 INS-REF-NUM-LHS PIC X(3).20 STR-DAT-VSI-LHS PIC S9(9) COMP-3.

10 HST-TRN-RDF-5-LHS REDEFINES HST-TRN-ADL-DTA-LHS.15 HST-TRN-DTA-5-LHS.20 NUM-MO-EXT-LHS PIC S9(3) COMP-3.20 CLC-EXT-FEE-AMT-LHS PIC S9(5)V99 COMP-3.20 EXT-REA-LHS PIC X(1).

10 HST-TRN-RDF-6-LHS REDEFINES HST-TRN-ADL-DTA-LHS.15 HST-TRN-DTA-6-LHS OCCURS 11 TIMES.20 ASD-BK-NUM-LHS PIC S9(5) COMP-3.20 ASD-APP-LHS PIC S9(3) COMP-3.20 ASD-LN-NUM-LHS PIC S9(18) COMP-3.20 PMT-AMT-LHS PIC S9(9)V99 COMP-3.

.

.

.

There are several

different ways

that data may be

stored in this

240-byte area

Mainframe-Hadoop Data Integration Goals

1) Making mainframe data available and usable on the cluster

• Interpretation and conversion of mainframe data formats

• Data validation and cleansing

• Integration of mainframe data with non-mainframe sources

• Use of mainframe data for data warehousing and BI

2) Reducing mainframe costs for storage and/or CPU

• Low-cost archival or backup of mainframe data in its native format

• Processing mainframe data on the cluster (yes- it can be done!)

11

Mainframe-to-Hadoop Migration / Integration Options

So, what tools that can be used for mainframe-Hadoop integration?

The “free” open source tools that come with Hadoop

Open-source conversion code generators: JRecord and LegStar

Mainframe-based migration tools

Legacy-ETL vendors

Syncsort DMX-h

Let’s look at the capabilities of each of these…

12

Integration Option: Open-source Hadoop Tools

Standard Hadoop tools are used to convert mainframe data to ASCII delimited-text format and process it

Often the “obvious” choice because these tools come with Hadoop

Steps to integrate ONE mainframe data file:

1) Copy the file from the mainframe to edge node (using FTPS or similar tool)

2) Execute custom program (usually Java) to de-compose complex record structures and convert mainframe data types to delimited text file(s) and write to HDFS

3) Delete copy of mainframe file on edge node

4) Execute custom data-validation/cleansing process using MapReduce or Spark on the cluster (normally Java or Hive)

5) Execute custom MapReduce or Spark process to integrate or load into final target (Data Lake, RDBMS, NoSQL database, etc.)

13

Integration Option: JRecord and LegStar

These are open-source code generators for file format conversion

JRecord– Uses CopybookLoader class to interpret COBOL record layouts

LegStar– Developer must use its Cobol Transformer Generator to create a

COBOL-to-XML translator, then call that translator in his/her program

Steps to integrate ONE mainframe data file:

1) Copy the file from the mainframe to edge node (using FTPS or similar tool)

2) Execute custom Java program to convert mainframe file by calling methods of CopybookLoader class (JRecord) or calling file-specific COBOL-to-XML translator (LegStar) and write to HDFS

LegStar only: Convert XML output to delimited text file(s)

3) Continue with step #3 on previous slide…

14

Open-source Options: Pros and Cons

The one “advantage” of these options are that they are “free”

Ironically, the primary disadvantage of these “free” tools is cost

– Development effort is very high• Very large amount of custom coding is required• Custom program needed for each source file which cannot be re-used• Lack of support

– Difficult and expensive to find, hire and retain skilled developers

Complex mainframe record types are a challenge

– Standard Hadoop tools: No easy way to handle complex records

– JRecord: The Java method calls can get very tricky

– LegStar: The COBOL Transformer Generator has limits

Not “future-proof”

– A Java program is written for a specific execution framework such as MapReduce or Spark- what will you do when another one comes?

15

Integration Option: Mainframe-based Tools

Migration tools that run in z-Linux on the mainframe system

Able to ingest and convert mainframe file formats from z/OS

Results are written to HDFS or a database

Advantages:

– Does not “stage” data on edge node

Disadvantages:

– Data validation and data quality checks require custom code

– Integration with other data sources requires custom code

– Conversion process runs on mainframe, not commodity hardware

16

Integration option: Legacy-ETL Vendors

Many legacy ETL vendors now offer “Hadoop” versions

Able to write read mainframe files and write to HDFS

Primary advantage is existing skill set of ETL developers

– “The devil you know”

Disadvantages

– Very high cost

– May have difficulty with very complex mainframe record structures

– Require dedicated metadata repository• Single point of failure• Becomes a performance bottleneck

– Do not process natively on the cluster• Some work only on the edge node• Those that work on the cluster are code generators (Java or Hive)• Performance and scalability are limited

17

The Best Option: Syncsort’s DMX-h

Create complete mainframe-Hadoop integration solutions, including data validation and integration with other sources

Easy-to-use development GUI; no coding

Very short learning curve

Supports very complex mainframe record structures

Native execution on cluster (NO code generation!)

Superior performance

Runs on all major Hadoop distributions

“Future-proof”: Run ETL jobs on MapReduce, Spark or a future framework with no changes

So let’s find out more about the company Syncsort and DMX-h…18

Who is Syncsort?

Syncsort is a leading Big Data company that has been in the high volume data business for over 45 years.

Syncsort has successfully transformed its business model from the mainframe era to the age of Hadoop.

Syncsort developed DMX which benefits from the algorithms and coding efficiencies developed from its mainframe heritage.

19

Syncsort Products

20

MFXGold-standard sort technology for

over four decades – saving

customers millions each year over

competitive sort solutions.

• High-performance Sort

for System z

• zIIP Offload for Copy

• Hadoop Connectivity

for Mainframe

Mainframe Solutions

DMXFull-featured data integration

software that helps organizations

extract, transform and load more

data in less time, with fewer

resources and less cost

• High-performance ETL

• SQL Analysis &

Migration

• ETL for Business

Intelligence

• Mainframe Re-hosting

Linux/UNIX & Windows

DMX-hA smarter approach to Hadoop

ETL: easier to develop, faster,

lower-cost and future-proof

• Hadoop ETL

• ETL for Business

Intelligence

• Mainframe-Hadoop

Integration

Hadoop Solutions

DMX-h

EngineDMX-h

Engine

DMX-h

Engine

DMX-h Installation Architecture

21

DMX-hJob

Editor

DMX-hTask

Editor

DMX-hEngine

Windows Workstation

DMX-hEngine

DMX-hagent

Edge Node (Linux)

Hadoop Cluster

Cluster Data Nodes

DMX-h

EngineDMX-h

Engine

DMX-h

Engine

Development GUI is

installed on Windows

workstations

DMX Engine is installed on Windows

workstation AND edge node AND all cluster

nodes, allowing job execution anywhere

To execute on the cluster, GUI sends

request to DMX-h agent on edge node

DMX-h Mainframe-Hadoop Integration Features

Mainframe file conversion and processing

– Fixed-record, variable-record and VSAM files

– Mainframe DB2 tables

– EBCDIC text and mainframe numeric types (COMP- types)

– Complex record structures, nested to any depth

– REDEFINES, OCCURS and OCCURS DEPENDING ON)

Secure transfer from mainframe using FTPS and Connect:Direct

Support for mainframe file compression, saving storage and time

No need to “stage” data on the edge node

Ability to store and process mainframe data in HDFS in its native format, without conversion(!), when desired

Easy integration of mainframe data with other sources

22

How Easy is it to Interpret a Mainframe File?

I’ll demonstrate using my laptop…

The use case:

We have been given a mainframe file and the COBOL copybook containing the record layout. The only 2 things that we have been told are that it is a fixed-record file and that the record size is 400.

23

Use DMX-h to Easily Integrate Mainframe Data With…

24

Mainframe-Hadoop Integration Use Cases

Getting and interpreting the data (with no staging!)

– Reading from mainframe

– Conversion from mainframe formats (when desired)

– Data validation and cleansing

– Writing to cluster target

Processing and data integration– Joins and lookups to cluster and non-cluster sources

– Normalization & Aggregation

Publishing and exporting– Load external data warehouses (Oracle, Teradata, DB2, SQL Server, etc.)

– Efficiently generate data extracts for BI users

– Generate native files for Tableau and QlikView

Storing and processing data in mainframe-native format– Only DMX-h can do this!- more info later…

25

Use Case: Mainframe data ingestion

A DMX-h job running in the edge node* can connect to both HDFS and an external data source (such as the mainframe).

This uses no disk space on the edge node! No limit on file size!

This also works for any external source or database, even if it is remote. The source file can even be compressed.

Format conversion and data-validation can be done within the same job.

*- Can also be done using ANY node on the cluster, if network connectivity allows26

Processing in the Cluster Using DMX-h

Once data is in the cluster, additional DMX-h jobs can transform it

The developer defines the operations to be performed

– Join, lookup, aggregate, filter, reformat, etc.

– There is no need to know the details of MapReduce or Spark

– DMX-h Intelligent Execution (IX) automatically runs the jobs on the cluster

DMX-h jobs run natively on all cluster nodes

– No code generation!

– The DMX engine is installed on all nodes

– More efficient than Hive and other ETL tools which generate Java code

– Cluster nodes work concurrently, making the process highly scalable

27

DMX-h Intelligent Execution on Hadoop

DMX-h has a feature called Intelligent Execution (IX) which automatically runs ETL jobs on the Hadoop cluster

The DMX engine is installed on all nodes in the cluster, so the transformations run natively, with no extra code generation step

IX works when the job runs, not at design time

– It currently supports MapReduce and Spark

– It could support other execution frameworks in the future

– This will require no changes to your DMX-h jobs in production

So this means that the SAME DMX-h job can run

– On your Windows laptop (useful during development for unit testing)

– On an edge node or any single cluster node

– On the cluster using MapReduce

– On the cluster using Spark

28

Processing Native-Mainframe Data on Hadoop (!)

Using DMX-h, it is actually possible to store and process mainframe data on Hadoop in its original native-mainframe format (!)

DMX-h can even write mainframe-format target files– No other tool can do this!

Sometimes this is a great idea; for example, you can

– Use HDFS to archive mainframe datasets (MUCH cheaper than DASD)• Because the data is 100% unchanged, it will pass any auditing requirement

– Quickly move mainframe datasets to Hadoop• Sometimes you do not have time or resources for a conversion project• The data can be moved, unchanged, and converted later• You may not immediately know which data fields will need to be used

– Transform the native-mainframe data using MapReduce or Spark• The results can even be moved back to the mainframe and used there!• This allows you to “offload CPU” from the mainframe, reducing MIPS cost.

The bottom line is that DMX-h can convert your mainframe data or work with it in its native form, whichever makes sense for you

29

DMX-h Live Demo

So let’s see it actually work using some mainframe data…

30

DMX-h: Superior Performance and Easy Development

Study by Principled Technologies for Dell

– Development comparison using DMX-h and open-source Hadoop tools• Three different ETL processes (see table below)

– Open-source jobs were built by an experienced Hadoop developer

– DMX-h jobs were built by an entry-level developer with a few days of DMX-h training, and beat the performance of the open-source jobs on the same cluster:

And DMX-h development was much quicker:

– Open source jobs developed by experienced developer: 8.4 days

– DMX-h jobs developed by entry-level developer: 3.8 days (54% less!)

31

ETL ProcessJob Execution Time (minutes)

Open-source DMX-h DMX-h Advantage

Fact Dimension Load with Type-2 SCD 36:39 30:11 18%

Data Validation 15:45 6:15 60%

Mainframe File Integration 5:51 4:48 18%

Resources

Syncsortwww.syncsort.com/liberate

Frank Koconis- Senior Solutions [email protected]

Glenn McNairy- Account [email protected]

Development Comparison by Dell and Principled Technologies determined that DMX-h enables

Easier and Faster Development

Lower Development Cost

Better Performance

These are links to the actual reports from the study

JRecordhttp://jrecord.sourceforge.net/

LegStarhttp://www.legsem.com/legstar/

32

http://www.syncsort.com/liberate

http://www.principledtechnologies.com/Dell/Dell_Cloudera_Syncsort_design_0715.pdf

http://www.principledtechnologies.com/Dell/Dell_Cloudera_Syncsort_cost_0715.pdf

http://www.principledtechnologies.com/Dell/Dell_Cloudera_Syncsort_performance_0715.pdf

http://jrecord.sourceforge.net/

http://www.legsem.com/legstar/

www.syncsort.com/liberate

http://www.syncsort.com/liberate

leveraging mainframe data in hadoopfiles.meetup.com/...dmxh_leveragingmainframedata.pdf · hadoop...

Documents