leveraging mainframe data in hadoopfiles.meetup.com/...dmxh_leveragingmainframedata.pdf · hadoop...
TRANSCRIPT
Leveraging MainframeData in Hadoop
Frank Koconis- Senior Solutions [email protected]
Glenn McNairy- Account [email protected]
Agenda
2
• Introductions
• The Mainframe: The Original Big Data Platform
• The Challenges of Ingesting and Using Mainframe Data on Hadoop
• Mainframe-Hadoop Data Integration Goals
• Mainframe-to-Hadoop Migration / Integration Options
• Syncsort and DMX-h
• Live DMX-h Demo
• Q & A
The Mainframe: The Original Big Data Platform
• Mainframes handle over 70% of all OTLP transactions
• They have a long, proven track record- over 60 years!
• They are reliable- operating continuously with zero downtime for years
• They are secure- access is tightly restricted and managed
3
Mainframes Still Process Vast Amounts of Vital Data
4
Top 25 World Banks 9 of World’s Top Insurers 23 of Top 25 US Retailers
“But now our organization is implementing Hadoop…”
Hadoop is the new Big Data platform
The goal is for the Hadoop cluster to be the single central location for ALL data (the “Data Lake”)
– According to Wikipedia, this should be the “single store of all data in the enterprise ranging from raw data … to transformed data” *
So you need to bring in all of the organization’s data sources
And that includes the mainframe
– The mainframe has vital data that you cannot afford to ignore when building your data lake
*- https://en.wikipedia.org/wiki/Data_lake
Enterprise Data Lake Without Mainframe Data =Missed Opportunity
The Challenges of Using Mainframe Data in Hadoop
Mainframe knowledge and skills are difficult to find
– The mainframe workforce is aging rapidly
– Knowledge of existing designs and code may no longer be available
– Young developers almost never learn mainframe skills
Security and connectivity issues
– Mainframes have a highly controlled security environment
– Installation of data-extraction utilities or programs may be forbidden
– The mainframe is mission-critical, so no action can be taken that could cause downtime
Mainframe data looks VERY different from data on Windows, Linux or UNIX
– This is so important, it deserves its own slide… →
7
The Biggest Challenge: Mainframe Data Formats
Mainframe files are not like files in Windows, Linux or UNIX
– There is no such thing as a delimited text file on the mainframe• File types include fixed-record, variable-record, VSAM and others
– The mainframe uses EBCDIC rather than ASCII, but it’s not that simple• Text values are EBCDIC, but many numeric values are not• Simple EBCDIC-to-ASCII conversion WILL NOT WORK
Mainframe files can have VERY complex record structures
– Records may be very wide, containing hundreds or thousands of fields
– Records are usually not “flat”• They often have sub-records and arrays (COBOL “OCCURS” groups)• These may be nested many levels deep
– Often, a range of bytes in a record is used in several different ways (COBOL “REDEFINES”)
• This means that the data “looks different” between records in the same file(!)
– Record layouts are defined by COBOL copybooks; here are examples…
8
COBOL Copybook Example #1
Simple example of a COBOL copybook which defines a record layout:
** SALES ORDERS FILE
01 SLS-ORD-FILE.
05 CUSTOMER-ACCOUNT-NUMBER PIC S9(9) COMP-3 .
05 ORDER-NUMBER PIC X(10).
05 ORDER-DETAILS.
10 ORDER-STATUS PIC X(1).
10 ORDER-DATE PIC X(10).
10 ORDER-PRIORITY PIC X(15).
10 CLERK PIC X(15).
10 SHIPMENT-PRIORITY PIC S9(4) COMP-3 .
10 TOTAL-PRICE PIC 9(7)V.99 COMP-3.
10 COMMENT-COUNT PIC 9(2).
10 COMMENT PIC X(80) OCCURS 0 to 99 TIMES DEPENDING ON COMMENT-COUNT.
9
These fields are packed decimal (not EBCDIC!)
So EBCDIC-to-ASCII conversion would corrupt them!
This is a variable-length array.
The number of elements depends on the value of COMMENT-COUNT.
So the size of this array will vary from record to record.
COBOL Copybook Example #2 (more complex)
01 LN-HST-REC-LHS.05 HST-REC-KEY-LHS.10 BK-NUM-LHS PIC S9(5) COMP-3.10 APP-LHS PIC S9(3) COMP-3.10 LN-NUM-LHS PIC S9(18) COMP-3.10 LN-SRC-LHS PIC X.10 LN-SRC-TIE-BRK-LHS PIC S9(5) COMP-3.10 EFF-DAT-LHS PIC S9(9) COMP-3.10 PST-DAT-LHS PIC S9(9) COMP-3.10 PST-TIM-LHS PIC S9(7) COMP-3.10 TRN-COD-LHS PIC S9(5) COMP-3.10 SEQ-NUM-LHS PIC S9(5) COMP-3.
05 LN-HST-REC-DTL-LHS.10 VLI-LHS PIC S9(4) COMP.10 HST-REC-DTA-LHS.
15 INP-SRC-COD-LHS PIC S9(3) COMP-3.15 TRN-TYP-IND-LHS PIC X.15 BAT-NUM-LHS PIC S9(7) COMP-3.15 BAT-TIE-BRK-LHS PIC X(3).15 BAT-ITM-NUM-LHS PIC X(9).15 TML-NUM-LHS PIC X(9).15 OPR-ID-LHS PIC X(8).15 HST-ADL-IND-LHS PIC X(1).15 HST-REV-IND-LHS PIC X(1).15 TRN-AMT-LHS PIC S9(9)V99 COMP-3.15 HST-DES-LHS PIC X(25).15 CUR-PRC-DAT-LHS PIC S9(9) COMP-3.15 REF-NUM-LHS PIC X(3).15 INT-FEE-FLG-LHS PIC X(1).15 UDF-L01-LHS PIC X.15 PMT-HLD-DAY-LHS PIC S9(3) COMP-3.15 AUH-NUM-LHS PIC S9(5) COMP-3.15 CUR-LN-BAL-LHS PIC S9(9)V99 COMP-3.
10 ITM-CNT-LHS PIC S9(2) COMP-3.10 PYF-COF-REA-COD-LHS PIC X(3).10 HST-TRN-ADL-DTA-LHS PIC X(240).
Continues in right-hand column…
10
Continued from left-hand column…
10 HST-TRN-RDF-1-LHS REDEFINES HST-TRN-ADL-DTA-LHS.15 HST-TRN-DTA-1-LHS OCCURS 20 TIMES.20 SPR-TRN-COD-LHS PIC S9(5) COMP-3.20 SPR-TRN-REF-LHS PIC X(3).20 SPR-TRN-AMT-LHS PIC S9(9)V99 COMP-3.
10 HST-TRN-RDF-2-LHS REDEFINES HST-TRN-ADL-DTA-LHS.15 HST-TRN-DTA-2-LHS.20 OLD-NMN-DTA-LHS PIC X(40).20 NEW-NMN-DTA-LHS PIC X(40).20 DAT-TO-DSB-LHS PIC S9(9) COMP-3.20 RPT-BK-NUM-LHS PIC S9(5) COMP-3.20 RPT-APP-LHS PIC S9(3) COMP-3.20 RPT-LN-NUM-LHS PIC S9(18) COMP-3.20 CMB-PMT-PTY-LHS PIC S9(3) COMP-3.
10 HST-TRN-RDF-3-LHS REDEFINES HST-TRN-ADL-DTA-LHS.15 HST-TRN-DTA-3-LHS.20 HST-OLD-RT-LHS PIC SV9(5) COMP-3.20 HST-NEW-RT-LHS PIC SV9(5) COMP-3.
10 HST-TRN-RDF-4-LHS REDEFINES HST-TRN-ADL-DTA-LHS.15 HST-TRN-DTA-4-LHS.20 VSI-PMT-AMT-LHS PIC S9(7)V99 COMP-3.20 VSI-INT-AMT-LHS PIC S9(7)V99 COMP-3.20 VSI-TRM-LHS PIC S9(3) COMP-3.20 INS-REF-NUM-LHS PIC X(3).20 STR-DAT-VSI-LHS PIC S9(9) COMP-3.
10 HST-TRN-RDF-5-LHS REDEFINES HST-TRN-ADL-DTA-LHS.15 HST-TRN-DTA-5-LHS.20 NUM-MO-EXT-LHS PIC S9(3) COMP-3.20 CLC-EXT-FEE-AMT-LHS PIC S9(5)V99 COMP-3.20 EXT-REA-LHS PIC X(1).
10 HST-TRN-RDF-6-LHS REDEFINES HST-TRN-ADL-DTA-LHS.15 HST-TRN-DTA-6-LHS OCCURS 11 TIMES.20 ASD-BK-NUM-LHS PIC S9(5) COMP-3.20 ASD-APP-LHS PIC S9(3) COMP-3.20 ASD-LN-NUM-LHS PIC S9(18) COMP-3.20 PMT-AMT-LHS PIC S9(9)V99 COMP-3.
.
.
.
There are several
different ways
that data may be
stored in this
240-byte area
Mainframe-Hadoop Data Integration Goals
1) Making mainframe data available and usable on the cluster
• Interpretation and conversion of mainframe data formats
• Data validation and cleansing
• Integration of mainframe data with non-mainframe sources
• Use of mainframe data for data warehousing and BI
2) Reducing mainframe costs for storage and/or CPU
• Low-cost archival or backup of mainframe data in its native format
• Processing mainframe data on the cluster (yes- it can be done!)
11
Mainframe-to-Hadoop Migration / Integration Options
So, what tools that can be used for mainframe-Hadoop integration?
The “free” open source tools that come with Hadoop
Open-source conversion code generators: JRecord and LegStar
Mainframe-based migration tools
Legacy-ETL vendors
Syncsort DMX-h
Let’s look at the capabilities of each of these…
12
Integration Option: Open-source Hadoop Tools
Standard Hadoop tools are used to convert mainframe data to ASCII delimited-text format and process it
Often the “obvious” choice because these tools come with Hadoop
Steps to integrate ONE mainframe data file:
1) Copy the file from the mainframe to edge node (using FTPS or similar tool)
2) Execute custom program (usually Java) to de-compose complex record structures and convert mainframe data types to delimited text file(s) and write to HDFS
3) Delete copy of mainframe file on edge node
4) Execute custom data-validation/cleansing process using MapReduce or Spark on the cluster (normally Java or Hive)
5) Execute custom MapReduce or Spark process to integrate or load into final target (Data Lake, RDBMS, NoSQL database, etc.)
13
Integration Option: JRecord and LegStar
These are open-source code generators for file format conversion
JRecord– Uses CopybookLoader class to interpret COBOL record layouts
LegStar– Developer must use its Cobol Transformer Generator to create a
COBOL-to-XML translator, then call that translator in his/her program
Steps to integrate ONE mainframe data file:
1) Copy the file from the mainframe to edge node (using FTPS or similar tool)
2) Execute custom Java program to convert mainframe file by calling methods of CopybookLoader class (JRecord) or calling file-specific COBOL-to-XML translator (LegStar) and write to HDFS
LegStar only: Convert XML output to delimited text file(s)
3) Continue with step #3 on previous slide…
14
Open-source Options: Pros and Cons
The one “advantage” of these options are that they are “free”
Ironically, the primary disadvantage of these “free” tools is cost
– Development effort is very high• Very large amount of custom coding is required• Custom program needed for each source file which cannot be re-used• Lack of support
– Difficult and expensive to find, hire and retain skilled developers
Complex mainframe record types are a challenge
– Standard Hadoop tools: No easy way to handle complex records
– JRecord: The Java method calls can get very tricky
– LegStar: The COBOL Transformer Generator has limits
Not “future-proof”
– A Java program is written for a specific execution framework such as MapReduce or Spark- what will you do when another one comes?
15
Integration Option: Mainframe-based Tools
Migration tools that run in z-Linux on the mainframe system
Able to ingest and convert mainframe file formats from z/OS
Results are written to HDFS or a database
Advantages:
– Does not “stage” data on edge node
Disadvantages:
– Data validation and data quality checks require custom code
– Integration with other data sources requires custom code
– Conversion process runs on mainframe, not commodity hardware
16
Integration option: Legacy-ETL Vendors
Many legacy ETL vendors now offer “Hadoop” versions
Able to write read mainframe files and write to HDFS
Primary advantage is existing skill set of ETL developers
– “The devil you know”
Disadvantages
– Very high cost
– May have difficulty with very complex mainframe record structures
– Require dedicated metadata repository• Single point of failure• Becomes a performance bottleneck
– Do not process natively on the cluster• Some work only on the edge node• Those that work on the cluster are code generators (Java or Hive)• Performance and scalability are limited
17
The Best Option: Syncsort’s DMX-h
Create complete mainframe-Hadoop integration solutions, including data validation and integration with other sources
Easy-to-use development GUI; no coding
Very short learning curve
Supports very complex mainframe record structures
Native execution on cluster (NO code generation!)
Superior performance
Runs on all major Hadoop distributions
“Future-proof”: Run ETL jobs on MapReduce, Spark or a future framework with no changes
So let’s find out more about the company Syncsort and DMX-h…18
Who is Syncsort?
Syncsort is a leading Big Data company that has been in the high volume data business for over 45 years.
Syncsort has successfully transformed its business model from the mainframe era to the age of Hadoop.
Syncsort developed DMX which benefits from the algorithms and coding efficiencies developed from its mainframe heritage.
19
Syncsort Products
20
MFXGold-standard sort technology for
over four decades – saving
customers millions each year over
competitive sort solutions.
• High-performance Sort
for System z
• zIIP Offload for Copy
• Hadoop Connectivity
for Mainframe
Mainframe Solutions
DMXFull-featured data integration
software that helps organizations
extract, transform and load more
data in less time, with fewer
resources and less cost
• High-performance ETL
• SQL Analysis &
Migration
• ETL for Business
Intelligence
• Mainframe Re-hosting
Linux/UNIX & Windows
DMX-hA smarter approach to Hadoop
ETL: easier to develop, faster,
lower-cost and future-proof
• Hadoop ETL
• ETL for Business
Intelligence
• Mainframe-Hadoop
Integration
Hadoop Solutions
DMX-h
EngineDMX-h
Engine
DMX-h
Engine
DMX-h Installation Architecture
21
DMX-hJob
Editor
DMX-hTask
Editor
DMX-hEngine
Windows Workstation
DMX-hEngine
DMX-hagent
Edge Node (Linux)
Hadoop Cluster
Cluster Data Nodes
DMX-h
EngineDMX-h
Engine
DMX-h
Engine
Development GUI is
installed on Windows
workstations
DMX Engine is installed on Windows
workstation AND edge node AND all cluster
nodes, allowing job execution anywhere
To execute on the cluster, GUI sends
request to DMX-h agent on edge node
DMX-h Mainframe-Hadoop Integration Features
Mainframe file conversion and processing
– Fixed-record, variable-record and VSAM files
– Mainframe DB2 tables
– EBCDIC text and mainframe numeric types (COMP- types)
– Complex record structures, nested to any depth
– REDEFINES, OCCURS and OCCURS DEPENDING ON)
Secure transfer from mainframe using FTPS and Connect:Direct
Support for mainframe file compression, saving storage and time
No need to “stage” data on the edge node
Ability to store and process mainframe data in HDFS in its native format, without conversion(!), when desired
Easy integration of mainframe data with other sources
22
How Easy is it to Interpret a Mainframe File?
I’ll demonstrate using my laptop…
The use case:
We have been given a mainframe file and the COBOL copybook containing the record layout. The only 2 things that we have been told are that it is a fixed-record file and that the record size is 400.
23
Use DMX-h to Easily Integrate Mainframe Data With…
24
Mainframe-Hadoop Integration Use Cases
Getting and interpreting the data (with no staging!)
– Reading from mainframe
– Conversion from mainframe formats (when desired)
– Data validation and cleansing
– Writing to cluster target
Processing and data integration– Joins and lookups to cluster and non-cluster sources
– Normalization & Aggregation
Publishing and exporting– Load external data warehouses (Oracle, Teradata, DB2, SQL Server, etc.)
– Efficiently generate data extracts for BI users
– Generate native files for Tableau and QlikView
Storing and processing data in mainframe-native format– Only DMX-h can do this!- more info later…
25
Use Case: Mainframe data ingestion
A DMX-h job running in the edge node* can connect to both HDFS and an external data source (such as the mainframe).
This uses no disk space on the edge node! No limit on file size!
This also works for any external source or database, even if it is remote. The source file can even be compressed.
Format conversion and data-validation can be done within the same job.
*- Can also be done using ANY node on the cluster, if network connectivity allows26
Processing in the Cluster Using DMX-h
Once data is in the cluster, additional DMX-h jobs can transform it
The developer defines the operations to be performed
– Join, lookup, aggregate, filter, reformat, etc.
– There is no need to know the details of MapReduce or Spark
– DMX-h Intelligent Execution (IX) automatically runs the jobs on the cluster
DMX-h jobs run natively on all cluster nodes
– No code generation!
– The DMX engine is installed on all nodes
– More efficient than Hive and other ETL tools which generate Java code
– Cluster nodes work concurrently, making the process highly scalable
27
DMX-h Intelligent Execution on Hadoop
DMX-h has a feature called Intelligent Execution (IX) which automatically runs ETL jobs on the Hadoop cluster
The DMX engine is installed on all nodes in the cluster, so the transformations run natively, with no extra code generation step
IX works when the job runs, not at design time
– It currently supports MapReduce and Spark
– It could support other execution frameworks in the future
– This will require no changes to your DMX-h jobs in production
So this means that the SAME DMX-h job can run
– On your Windows laptop (useful during development for unit testing)
– On an edge node or any single cluster node
– On the cluster using MapReduce
– On the cluster using Spark
28
Processing Native-Mainframe Data on Hadoop (!)
Using DMX-h, it is actually possible to store and process mainframe data on Hadoop in its original native-mainframe format (!)
DMX-h can even write mainframe-format target files– No other tool can do this!
Sometimes this is a great idea; for example, you can
– Use HDFS to archive mainframe datasets (MUCH cheaper than DASD)• Because the data is 100% unchanged, it will pass any auditing requirement
– Quickly move mainframe datasets to Hadoop• Sometimes you do not have time or resources for a conversion project• The data can be moved, unchanged, and converted later• You may not immediately know which data fields will need to be used
– Transform the native-mainframe data using MapReduce or Spark• The results can even be moved back to the mainframe and used there!• This allows you to “offload CPU” from the mainframe, reducing MIPS cost.
The bottom line is that DMX-h can convert your mainframe data or work with it in its native form, whichever makes sense for you
29
DMX-h Live Demo
So let’s see it actually work using some mainframe data…
30
DMX-h: Superior Performance and Easy Development
Study by Principled Technologies for Dell
– Development comparison using DMX-h and open-source Hadoop tools• Three different ETL processes (see table below)
– Open-source jobs were built by an experienced Hadoop developer
– DMX-h jobs were built by an entry-level developer with a few days of DMX-h training, and beat the performance of the open-source jobs on the same cluster:
And DMX-h development was much quicker:
– Open source jobs developed by experienced developer: 8.4 days
– DMX-h jobs developed by entry-level developer: 3.8 days (54% less!)
31
ETL ProcessJob Execution Time (minutes)
Open-source DMX-h DMX-h Advantage
Fact Dimension Load with Type-2 SCD 36:39 30:11 18%
Data Validation 15:45 6:15 60%
Mainframe File Integration 5:51 4:48 18%
Resources
Syncsortwww.syncsort.com/liberate
Frank Koconis- Senior Solutions [email protected]
Glenn McNairy- Account [email protected]
Development Comparison by Dell and Principled Technologies determined that DMX-h enables
Easier and Faster Development
Lower Development Cost
Better Performance
These are links to the actual reports from the study
JRecordhttp://jrecord.sourceforge.net/
LegStarhttp://www.legsem.com/legstar/
32
www.syncsort.com/liberate