syncsort dmx-h - meetupfiles.meetup.com/4533812/slides_syncsort_dmx-h.pdf · teradata teradata...
TRANSCRIPT
Syncsort DMX-h
Delivering Smarter ETL through Hadoop… and More Value from Hadoop with Smarter ETL
Ruediger Schickhaus [email protected]
Syncsort Confidential and Proprietary - do not copy or distribute
Big Data – New Name, Old Problem
1956Transport von 5 MegabytesMegabytes
IBM 305 RAMAC
1000 kg
50 Platten – 1.5 m2
TCO $ 35,000
Syncsort Confidential and Proprietary - do not copy or distribute2
Big Data – New Name, Old Problem
2012Transporting 1 Petabyte
IBM 305 RAMAC IBM 305 RAMAC
1000 kg per Rack
TCO $ 500k to 3.5Million
The Big Data Continuum
EvolvedDynamicPlateauingAdvancing
Traditional BI
Data
Awakening
Big
Da
ta C
on
tin
uu
m
Early Hadoop adoption
prototyping &
experimentation
Hand-coding:
SQL, JCL.
Basic ETL Tools
Standardization &
Heavy Platforms.
Demand for MF data
Big Data is the new
standard
Value MaxMin
Hitting arch limits +
exponential costs.
Growing
Infrastructure
4Syncsort Confidential and Proprietary - do not copy or distribute
Inte
gra
tin
g B
ig D
ata
… S
ma
rte
r
MFX DMX DMX-h
SQL Migration Hadoop ETLHadoop Sort
& ConnectivityETL & Rehosting
Optimization
High-performance
ETL
Do You Know Syncsort?
• Leading Big Data Integration company
• Speed leader in ETL
• Fastest sort technology in the market
For 40 years we have been helping companies solve their big data issues…even before they knew the name Big Data!
Our customers are achieving the
impossible, every day!
Our customers are achieving the
impossible, every day!Integrating Big Data… Smarter!
5Syncsort Confidential and Proprietary - do not copy or distribute
• Fastest sort technology in the market
• Powering 50% of mainframes’ sort
• A history of innovation
• 25+ Issued & Pending Patents
• Large global customer base
• 15,000+ deployments in 68 countries
• First-to-market, fully integrated
approach to Hadoop ETL
Key Partners
Hadoop ?
Hadoop will be in two-thirds of
advanced analytics products by
2015 (source Gartner)
Low TCO experience (source
Hadoop Summit)
6
At @Hortonworks, we
believe that by the end of
2015, more than half the
world’s data will be
processed by Apache
Hadoop
How does #Hadoop integrate with an Enterprise Data Center?
7Syncsort Confidential and Proprietary - do not copy or distribute
Source @Hortonworks
http://hortonworks.com/blo
g/smarter-etl-with-hadoop-
and-syncsort
Syncsort’s Open Source Contributions
JIRA
4808 Allow Reduce-side merge to be pluggable
4809 Make classes required for 2454 public
Description
2454 Allow External Sorter Plugin for MR
8Syncsort Confidential and Proprietary - do not copy or distribute
4809 Make classes required for 2454 public
4812 Create reduce input merger plug-in
…and more!!
4842 Shuffle race can hang reducer
4482 Backport of 2454 to MapReduce 1
Hadoop Integration… for Real(No Code Generation. No Compiling. No Tuning)
• Runs natively within MapReduce
• No Coding, No Java
• Leverages sort plug-in
• Small footprint installs on every
Smarter Architecture
9Syncsort Confidential and Proprietary - do not copy or distribute
Hadoop Cluster
• Small footprint installs on every
node
• Higher throughput per node to
accomplish more on your cluster
• Smart ETL Optimizer = No Tuning
Required
• File-based Metadata with no
dependencies on 3rd party systems
Unleash Hadoop’s Potential
One Tool to Connect to All Sources & Targets No Coding, No Scripting
Connect to Any Source & Target
• RDBMS
• Mainframe
• Files
• Cloud
• Appliances
• XML
Smarter Connectivity
10Syncsort Confidential and Proprietary - do not copy or distribute
Pre-process & Compress
Extract & Load to/from Hadoop
• Cleanse, validate, and partition for parallel
loading
• Compress for storage savings
• Extract data & load into the cluster natively from
Hadoop or execute “off-cluster” on ETL server
• Load data warehouses directly from Hadoop. No
need for temporary landing areas.
PLUS… Mainframe Connectivity
• Directly read mainframe data
• Parse & translate
• Load into HDFS
The Hadoop Challenge
PROCESS
JoinAggregate Copy
DISTRIBUTECOLLECT
Most organizations use Hadoop to…
11Syncsort Confidential and Proprietary - do not copy or distribute
Sort Merge
EExtract
TTransform
LLoad
So… How Do You Do ETL in Hadoop Today?
COLLECT PROCESS DISTRIBUTE
JoinAggregate Copy
12Syncsort Confidential and Proprietary - do not copy or distribute
SortMerge
• FS Shell Put
Command• Flume
• Sqoop
HARD
• Pig • HiveQL• Java
HARDER
• Sqoop • FS Shell Get
Command
HARD
DMX-h & Mainframe
The Economics of Data
Cost of managing 1TB of data
$20,000 – $100,000 $15,000 – $80,000 $250 – $2,000
14Syncsort Confidential and Proprietary - do not copy or distribute
Mainframe EDW Hadoop
Scalability
Performance
Reliability
Agility
Aging workforce
But there’s more…
Because Mainframe Is Big Data Too!
Smarter Connectivity… Also for Mainframe
Connect• Read files directly from the
mainframe
• No software required on mainframe
15Syncsort Confidential and Proprietary - do not copy or distribute
Translate
Load &
Process
• Parse & transform: packed decimal,
EBCDIC/ASCII, multi-format
• No coding required
• Load directly to HDFS
• Offload batch data processing
• Find more insights
• Develop MapReduce ETL processes
without writing code
• Leverage existing ETL skills
• Develop and test locally in Windows. Deploy in
Hadoop
Smarter Development
Same Familiar Tool. Five Core Transformations. All The Possibilities. (No Coding, No Scripting, No Kidding!)
Hadoop
• Five smart transformations
Patented algorithms
No code generation, no compiling
Execute within MapReduce
…combine & reuse to create virtually
any data flow
16Syncsort Confidential and Proprietary - do not copy or distribute
Sort Join Aggregate Copy Merge
Development accelerators for CDC and other common data flows+ Coding is optional
but not required
Smarter Productivity
Fast-track your Hadoop productivity with the Use Case Accelerators
Aggregations Change Data Capture
Connectivity & Mainframe Integration
Joins & Lookups
• Web log aggregations
• Lookup + Aggregation
• Word Count
• CDC Single Output
• CDC Distributed Output
• Mainframe Extract + CDC
17Syncsort Confidential and Proprietary - do not copy or distribute
• Fully functional and re-usable templates to
design your own data flows
• Quick-start guide, sample data, and even videos
• Take away the complexities AND the guess work
Mainframe Integration• Direct Mainframe Extract & Load
• Mainframe Extract + CDC
• Smart HDFS Load & HDFS Extract
• Join – Large Side | Small Side
• Join – Large Side | Large Side
• File Lookup
Smarter Security
Zero-Pain Support for Common Authentication Protocols
• Seamless support for Kerberos & LDAP
• User-level security using authentication
protocol LDAP
18Syncsort Confidential and Proprietary - do not copy or distribute
Hadoop Cluster
protocol
• Invoke Hadoop jobs using user-level
credentials
• Keep Hadoop access separate for each user
• Support multiple ticket locations
• Secure data loads & extracts
• Secure Hadoop MapReduce job execution
Syncsort’s Hadoop ETL Workshops
You will see how easy it is to deploy ETL in Hadoop with
DMX-h, with hands-on exercises implementing common ETL
tasks, including:
– HDFS Load
– Change Data Capture (CDC)
– Web Log Aggregation– Web Log Aggregation
– Joins
Download the VM and play with Hadoop in 15mn before the
workshop takes place !
19Syncsort Confidential and Proprietary - do not copy or distribute
Download Syncsort’s ETL
package and you’re all set !
– www.syncsort.com/try
DMX-h Benchmarks
File CDC
DMX-h
Syncsort Confidential and Proprietary - do not copy or distribute
PigJava
149Lines of Code
70Lines of Code
Web Log Aggregation
DMX-h
Syncsort Confidential and Proprietary - do not copy or distribute
PigJava
94Lines of Code
48Lines of Code
Sort Acceleration - Terasort
Use Case
ETL or
Sort
Accele
ration
Alterna
tive
Data Size
(GB)
Native/A
lternativ
e
Elapsed
time
DMX-h
Elapsed
Time
Elapsed
Time
Improv
ement
Native/Alterna
tive Memory
(GB)
DMX-h
Physical
Memory (GB)
Mem
ory
Impro
veme
nt
Native/Alter
native CPU
Time
DMX-h CPU
Time
CPU
Impro
veme
nt
Native/
Alterna
tive
MB/Sec
/Node
DMX-h
MB/Sec
/Node
TERASORT
Sort
Accele
ration Native 512 0:01:47 0:01:45 2% 12,863 12,873 0% 114,297 62,491 45% 6.5 6.6
TERASORT
Sort
Accele
ration Native 1,024 0:02:29 0:01:11 52% 14,512 14,522 0% 194,896 98,972 49% 9.3 19.4 TERASORT ration Native 1,024 0:02:29 0:01:11 52% 14,512 14,522 0% 194,896 98,972 49% 9.3 19.4
TERASORT
Sort
Accele
ration Native 1,536 0:04:02 0:01:23 66% 14,684 14,694 0% 287,055 143,759 50% 8.6 25.0
TERASORT
Sort
Accele
ration Native 4,096 0:03:31 0:02:29 29% 31,520 31,549 0% 927,379 380,442 59% 26.2 37.0
TERASORT
Sort
Accele
ration Native 10,242 0:08:51 0:05:14 41% 47,935 47,951 0% 2,835,927 1,460,101 49% 26.4 44.6
TERASORT
Sort
Accele
ration Native 20,484 0:14:55 0:12:28 16% 106,153 105,239 1% 6,112,296 3,696,727 40% 31.0 37.4
TERASORT
Sort
Accele
ration Native 102,400 1:12:12 0:51:59 28% 387,262 387,211 0% 30,436,624 16,589,332 45% 32.3 44.9
Syncsort Confidential and Proprietary - do not copy or distribute23
File CDC
Use Case
ETL or
Sort
Acceler
ation
Alterna
tive
Data
Size
(GB)
Native/Alt
ernative
Elapsed
time
DMX-h
Elapsed
Time
Elapse
d Time
Improv
ement
Native/Altern
ative Memory
(GB)
DMX-h
Physical
Memory (GB)
Memor
y
Improv
ement
Native/Alt
ernative
CPU Time
DMX-h
CPU Time
CPU
Improv
ement
Native/
Alterna
tive
MB/Se
c/Node
DMX-
h
MB/Se
c/Node
FileCDC ETL Pig 148 0:05:31 0:01:33 72% 79,876 79,559 0% 79,876 79,559 0% 0.6 2.2 FileCDC ETL Pig 148 0:05:31 0:01:33 72% 79,876 79,559 0% 79,876 79,559 0% 0.6 2.2
FileCDC ETL Pig 450 0:05:11 0:01:58 62% 243,834 182,869 25% 243,834 182,869 25% 1.9 5.3
FileCDC ETL Pig 1,515 0:07:49 0:03:44 52% 845,263 557,226 34% 845,263 557,226 34% 4.4 9.4
Syncsort Confidential and Proprietary - do not copy or distribute24
Web Log Aggregation
Use Case
Altern
ative
Data
Size
(GB)
Native/Alter
native
Elapsed time
DMX-h
Elapsed
Time
Elapsed
Time
Improve
ment
Native/Alternativ
e Memory (GB)
DMX-h Physical
Memory (GB)
Memory
Improve
ment
Native/Alter
native CPU
Time
DMX-h CPU
Time
CPU
Improve
ment
Native/A
lternativ
e
MB/Sec/
Node
DMX-h
MB/Sec/
Node
WebLogAggregation -
Split Size & fixes Pig 2,067 0:01:12 0:00:58 19% 13,499 7,813 42% 145,972 56,496 61% 40.1 49.8
WebLogAggregation -
Split Size & fixes Pig 4,135 0:01:42 0:01:23 19% 18,003 15,579 13% 300,627 152,390 49% 56.1 69.6
WebLogAggregation -
Split Size & fixes Pig 10,240 0:05:16 0:02:04 61% 40,773 39,091 4% 807,473 335,537 58% 45.3 115.4
WebLogAggregation -
Split Size & fixes Pig 20,480 0:07:54 0:06:58 12% 78,654 78,128 1% 1,339,453 568,107 58% 60.4 68.4
Syncsort Confidential and Proprietary - do not copy or distribute25
DMX-h Demonstration
MapReduce – Before and After Syncsort Contribution
NativeSort
Mandatory
NativeSort
Mandatory
27Syncsort Confidential and Proprietary - do not copy or distribute
Opening the MapReduce Framework
MapperOutput
ShuffleInput
Reducer
Here and here to replace MapReduce native sort
MapperOutput Sorter
ShuffleInput Sorter
Reducer
28Syncsort Confidential and Proprietary - do not copy or distribute
Here to perform functionallogic on our engine
Here to perform functionallogic on our engine
Hadoop Change Data Capture using DMX-hM
ain
fra
me
fil
es
+
Tera
da
ta
Syncsort’s DMX-hNode
Node
HD
FS
Hadoop
Teradata
Teradata
Syncsort Confidential and Proprietary - do not copy or distribute
Ma
infr
am
e f
ile
s +
Tera
da
ta
Load filesLoad to HDFS
DMX-h ETL Map
Reduce (CDC)
Node
Node
HD
FS
Teradata
Load to Teradata
30Syncsort Confidential and Proprietary - do not copy or distribute
+
Bridging the Gap Between Big Data and Big Iron
A Smarter Approach to BIG Mainframe Data!
Syncsort DMX-h ETL Edition
31Syncsort Confidential and Proprietary - do not copy or distribute
� Zero-MIPS Connectivity
� Painless Integration & Translation
� Mainframe-like Performance
� Massively Affordable Scalability
� Enterprise-grade Reliability
� Iron-clad Security
� Decades of Proven Mainframe Expertise
THE PLATFORM FOR BIG DATACLOUDERA
Brings batch & real-
time compute to
storage
Works with
all types
of data
Changes the
economics of data
management- -
CDH
Connect ProcessTranslate
Syncsort DMX-h ETL Edition
Cloudera
Manager
Cloudera
Navigator
Cloudera
Support