etl benchmark favours datastage and talend

Create an AccountLog In

Blogs Discussions Research Directory

Toolbox for IT Topics Business Intelligence Blogs

Tweet 2 0 0

ETL Benchmark Favours DataStage andTalend

Vincent McBurney Dec 9, 2008 | Comments (7)

A French consulting company called Manapps has released an ETL benchmark report

that compares Talend, Pentaho Kettle, DataStage and Informatica and two of those

four vendors will be pleased with the results.

*** March 2009 Update: I was contacted my ManApps informing me the benchmark

report was a draft and not the final report. There is a final report that shows more

favourable results for Informatica and I will update this blog post when I have some

spare time. Here is the statement from ManApps as translated from French by Google

Translate:

We publish on our website version of the document "Benchmark ETL" whose originalversion was wrongly found work published on various sites or blogs outside ourcompany. This version significantly modifies certain measures have been taken basedon more advanced technical parameters regarding Power Center Solution Informatica,we did not have in the original version. The findings have been modified accordingly.Note that this document remains a working document. It has no goal of marketing. Wewant to provide publishers covered by the study may be required to take any action theydeem useful, and if necessary publish the results accordingly.

*** End of update

I am going to write a post on what I think of the objectives of this benchmark but for

now here are the results and an analysis of each test. Manapps is part of

OmegaHighTech, a company with over 3,500 employees around the world. Amongst

other things Manapps do Business Intelligence consulting and data warehouse

implementations.

The Benchmark is released under creative commons license:

You are free:

to Share — to copy, distribute, display, and perform the work

to Remix — to make derivative works

I hear Coldplay have already taken the words to use in their next song.

I don’t know how they distributed the PDF – I found it in a blog post by Marc Russell:

ETL Benchmark by Manapps. I’ve copied the graphs into this blog post with some

comments and cropped off the top of the graphs for readability – and because I don’t

think the really high scores are reflective of the products but show poor ETL design.

Event Number 1 – Sequential Files

The first test was reading a sequential file and writing out to a sequential file. Anyone

who knows DataStage can guess the result – DataStage Server Edition will be

awesome and DataStage PX was not so great:

1RecommendRecommend ShareShare

Your email address FOLLOW

BEGIN NOW

Tooling Around in the IBM InfoSphereby Vincent McBurney

Vincent McBurney is an IBM Champion for InformationIntegration and has been blogging for many years onInfoSphere software and ... more

Receive the latest blog posts:

Share Your PerspectiveShare your professional knowledge and experiencewith peers. Start a blog on Toolbox for IT today!

ETL Benchmark Favours DataStage and Talend http://it.toolbox.com/blogs/infosphere/etl-benchmark-favours-datastag...

1 of 12 21/01/2014 10:37

Benchmark 1 Results – ETL Sequential File Processing

DataStage Server Edition LOVES sequential files. It’s been optimised over 15 years

of releases not just reading and writing sequential data but memory caching and row

buffering in the middle. Look at the 5 million row result – less than a third of the time

of the nearest competition. This is one of the reasons why DataStage Server Edition

customers who have a lot of low to mid range data sizes and share data in sequential

files are sticking with Server Edition.

DataStage PX tolerates sequential files – it imports the data and converts it to parallel

format and then exports it again back to sequential format. Not only that but because

they chose 2 nodes DataStage PX had to partition and the unpartition the damn data.

Let me show you how much this sucks. Let’s say you have a wheel barrow with two

sacks of flour in it and you’ve got to deliver it to the king of Sparta before sunset or

he’ll kick you into a bottomless pit. With most of these ETL tools you pick up the

wheelbarrow and run like hell to the king. With DataStage PX you pick up the

wheelbarrow, you wheel it over to two wheelbarrows and put a sack of flour in each

and then clone yourself and the two of you push both those wheelbarrows to the

palace where you swap them back to one wheel barrow and give it to the king. He

kicks you and your doppleganger down the bottomless pit and throws the

wheelbarrows after you. If you have 100 flour bags your parallel wheelbarrows are

great, but with two wheat bags it’s a waste of time.

A single job parameter that tells this job to run in sequential mode could have made it

as much as 50% faster.

Informatica – youch! They are like Forest Gump before the leg braces came off.

They finished the first race after DataStage Server finished the third race.

Informatica was the only ETL tool in this test that used three stages to do this job

instead of two. They did a file input and then a file delimiter definition – painfully slow

row-by-row delimiter definition. I’m no expert but someone tell me this was a dumb

job design. Eventually Informatica picked up speed and in the 20 million test it came

second.

Test 2 – MySQL

This test only compared Talend and Pentaho writing to mySQL. In this test they had

two versions of Talend – TOS 2.4.1 and TOS 2.4.1 extended insert. This tells me that

Manapps did a bit more research on Talend than Pentaho Kettle. By this stage I’m

thinking this test is rigged. Fun, but rigged.

Test 3 Read a Database

In test 3 Manapps has a test to read from an Oracle database table and write out to a

sequential file.

Work With Me

Links

Categories

GO

If you are an expert in InfoSphere software and want to workfor the biggest IBM partner in Australia and New Zealand getin touch with me via ITToolbox or Linked In.

Steal This IM Methodology

Informatica Data Quality Blog

DataFlux Community of Experts

Data Governance Blog

dq:view - Steve Tuck on Data Quality


2 of 12 21/01/2014 10:37

Test Results 2 – ETL Reading a Database

DataStage PX did surprisingly well – it had three stages, database – modify –

sequential file. It repartitioned twice (needlessly) so could be improved but because it

has the Oracle Enterprise stage it kicked butt on the database read. DataStage

Server did not so well and I would have loved to tweak the array size and transaction

size properties to read the database data in one chunk. Might have got a huge

performance boost.

Informatica – well I’ve chopped off the top of the graph to make it readable but again

they had the extra stage and that added 40 seconds to each job run time. Surely

there is a way to avoid that 40 second lag.

Test 4 – Database Bulk Load

This test was notable as the only one that Pentaho Kettle won. At least in the

miniscule volume category. The editor must have been asleep at the wheel when he

let this one through. They did worse as the volumes went up.

Test Result 4 – Feeding Oracle Bulk Load

It was another test that Server Edition won comfortably because this is essentially the

same as test 1 – it’s all about the speed of creating a sequential file. After you’ve got

the file all five ETL tools call the exact same Oracle bulk loader. Test 1 and Test 4 –

identical.

Test 5 – Same as Test 1 with a Transform in it

Finally, finally! An ETL test that has Transform in it (you know, the T in ETL?) All the

tests up to now were bullshit, this is the first true ETL test and it took Manapps five

tests to get here:


3 of 12 21/01/2014 10:37

Test Result 5 – ETL Transformer

This was the first time the parallel partitioning of DataStage PX may have helped

rather than hinder. With two nodes doing those transform functions the higher the

volumes the more it wins. We finally see Informatica do well coming second in the 20

million row. What I would give to see a 1,000,000,000 row test. Once again

Informatica had that initial 35 second handicap but you take that out and it performed

really well.

Test 6 – ELT

This was a silly test as Manapps didn’t have DataStage ELT and didn’t know how to

use Informatica ELT. They tried another way to force those product to use ELT but got

it wrong. The one thing I will say about this test is that Talend seems to have a good

GUI for simple ELT:

Benchmark Job – Talend ELT

This job lets you define an oracle connection, aggregate the data and save it to

another Oracle table. It ran in under 2 seconds for up to 1 million rows so I assume it

pushes it all down to the database. This is a good ETL way to write an aggregation –

it runs just like a database group by command but it’s got an open and easy to read

data lineage. This is just the type of thing I would expect from DataStage and

Informatica ELT – if you can get it running.

What Manapps did in this test that is kind of sneaky is have the source and target

table in the same database – which kind of defeats the purpose of using an ETL tool.

A true ELT scenarios in the involves different source and target databases. The other

four ETL tools could have done the exact same thing with user-defined SQL select

though the data lineage would not have been as good.

The tester made the comment:

Only Talend Open Studio permits to use an ELT mod. Informatica got the Push

Down Optimization, but I didn’t find this feature on the tool.

You’ve got to buy the add on! It’s not free with the tool! They are not as charitable as

Talend.

Test 7 – more ELT

This is the most interesting test in the benchmark because it shows how ETL engines

process faster than ELT by throwing grunt, memory and hardware at the transform

part of the job. This test compares a pure ELT command from Talend versus tradition

ETL from DataStage and ETL wins:


4 of 12 21/01/2014 10:37

Benchmark Job – Talend complex ELT

Benchmark Job – DataStage Join and Filter

The first diagram is the clever Talend ELT interface that leaves the data on the

database and performs some mapping on it. The second diagram is traditional ETL,

DataStage reads the data and then transforms, joins, transforms, filters and writes it

out. It looks like it’s doing a lot more work but don’t let those Modify stages deceive

you – they are almost zero overhead and since there are no sequential files this job is

in its element and comes out fastest:

Benchmark Result 7 – Join and Filter

Talend has just one processing engine – the database. DataStage has two – the

database and the ETL server. The higher the volume the faster DataStage PX will

go. Manapps only tested up to 1,000,000 in this test despite testing higher volumes in

other tests. I would have liked a 20,000,000 test for this one.

It would be so so very much faster with a tiny bit of tuning. You see this little

DataStage symbol: . That’s pain time. That’s data being sorted and

repartitioned – swapping flour bags between wheelbarrows. If you sort the data in the

source database stages and remove the sorts from the job this baby runs a lot faster.

Because DataStage PX jobs push everything through a parallel engine you become

adept at sorts and partitions and Manapps would have worked this out with some

scenario testing.

Test 8 - Sort

A very interesting test result that shows how friggen fast the DataStage PX sort is.

When Applied Parallel Technologies (who became Torrent Systems and then IBM

DataStage PX) wrote a parallel flow based processing engine in 1993 one of the first

functions they wrote was sort – it was an obvious candidate for running faster in

parallel mode. Fifteen years later and it flies:

Benchmark Job – DataStage PX Sort

It’s a simple DataStage job design, read from one file and write to another. The

properties of the second file insist on the data being sorted so you can see the little


5 of 12 21/01/2014 10:37

yellow sort symbol that tells you the data on that link is being sorted. This test had

two sequential files, the Achilles heel of parallel processing, but it was kind of like a

sprint relay with Father Christmas handing the baton Hussein Bolt who handed it to

Roseanne Barr. The sort in the middle made up the time.

With the result that DataStage PX was miles ahead:

Benchmark Result 8 – Sort Speeds

My own benchmark tests showed DataStage PX was many times faster at sort and

aggregation than DataStage Server Edition even before you added any parallel

nodes. It’s got very well written processing components. A 7 minute sort in Server

Edition took 12 seconds on one node in DataStage PX. This is the reason why

Co-Sort and Syncsort (the sequential file sorting specialists) are welcome at

DataStage Server sites and not DataStage PX sites. DataStage PX does not need

any help with sorts.

Talend used GNU sort – external to the tool, and lost badly on the very high volume

sort. Maybe there is a better sort script out there. Looks like DataStage Server

Edition fell over on 20 million rows – not a huge surprise. If you are sorting big data

volumes you need to upgrade to PX! You’ll get a huge discount at the moment thanks

to the credit crisis, they are desperate for any extra licensing.

Test 9 – ETL Aggregation

Test 9 is similar to test 8 but it’s aggregation instead of sort. One of the few tests

Informatica won, edging out DataStage in the 20 million category despite that initial

35-40 second flow start:

Benchmark Result 8 – ETL Aggregation

Run Informatica Run! The trend line for Informatica is impressive – not much increase

in time from 100,000 to 5,000,000. If they could break free of those leg braces earlier

they would be winning all categories.

The test developer made a mistake with the DataStage PX job in this test and left it

with two sorts instead of one:


6 of 12 21/01/2014 10:37

Benchmark Job – DataStage PX Aggregation

They used the job from Test 8 that had an enforced sort in it instead of creating a new

job or using the job from test 1. The aggregator will add a sort – it needs sorted data

in order to aggregate. The output sequential file is also asking for a sort (left over

from test 8), possibly in a different order to the aggregator, so this job is combining

test 8 and 9 into one and DataStage PX is still coming first or second in most results.

Could have been 10-20% faster without that second sort.

Test 10 – Lookups

Sigh, this is where the benchmark really gets loopy. Mork and Mindy loopy. This job

is what you expect someone to build if they have only be using DataStage for a

couple hours:

Benchmark Job – DataStage PX Join

It’s a mess. These sorts and partitions are killers: . Lots of flour bag sorting

and swapping between wheelbarrows. This job would be as much as 80% faster if you

replaced that join with a lookup. The lookup stage does not need any sorting to work

and 9 out of 10 times it will be faster than a join. By default I use a Lookup stage and

I need something to go seriously wrong with the job before I switch to a Join. This job

design doesn’t cut it for benchmarking.

Talend does best with a small lookup volume, DataStage PX does okay and

Informatica is astoundingly bad.

Benchmark Result 10 – ETL lookup

I’m no Informatica expert but the job design looked kind of crazy:

Benchmark Job – Informatica Lookup

Can someone tell me what is wrong with it? Informatica lookups shouldn’t be this

slow.

The one time you do want to use a DataStage PX Join stage instead of a Lookup is

when you have massive amounts of lookup data, and in this benchmark there was a

set of tests with 5,000,000 rows of lookup data and we finally got to see a Join stage

that was worthwhile:


7 of 12 21/01/2014 10:37

Benchmark Result 10 – high volume ETL Lookup

This test has high volume input rows AND high volume lookup rows. We have

reached a volume of data that justifies a Join stage where data is sorted before the

comparison of rows is performed – and you can see the scalability of DataStage PX

on 20 million input rows joined to 5 million lookup rows. This test gives you an idea of

what would happen with a job with many stages – join, lookup, transform and sort.

DataStage PX would be further in front as the volumes go up and if you added more

CPUs the difference would be even more obvious.

Test 11 – Lookups with Rejects

Test 11 is similar to test 10 except when you cannot join you produce a reject. This

has me so frustrated, I want to take Manapps out and beat them with a rake, this test

would have been so much better with a DataStage PX lookup stage. It can do the

lookup and reject in one step so much faster than the join stage that does it in two

steps with extra sorts.

By this stage of the benchmark the Informatica job is looking like the route home that

a drunk driver takes to avoid the police:

Benchmark Job – Informatica Drink Driving

What the hell? DataStage PX – in the hands of a drunk driver – still manages to crash

into second place on high volumes but I’m afraid the testers did not know enough

about lookups to do it justice. Informatica fared much worse in the hands of a novice

and I wait with bated breath to hear what was wrong with these job designs.

Conclusion

Thanks to Manapps for the benchmark but I would like to see the sequential file tests

run with DataStage on one node and the lookup tests with a lookup stage – hey isn’t

that a coincidence. Lookup test – lookup stage. Who would have thought a lookup

stage would work for a lookup test?

Talend does a lot of its work in memory (like DataStage PX) but this starts to come

apart at the seams when the volumes go up. DataStage PX handles this by caching

and buffering. It would be interesting to see a benchmark going into the 100s of

millions of rows or 50 columns or more to see what each tool does under real stress.

The type of processing that is common for telcos, banks and insurers.

These tests do show that when you are down in the smaller volumes the open source

ETL tools are an option and I would prefer them to manual coding, but in the higher

volumes give me a premium tool any day. Even a novice can get good results.


8 of 12 21/01/2014 10:37

Read 7 comments

More White Papers

7 Comments

Disclaimer: The opinions expressed herein are my own personal opinions and do not represent myemployer's view in any way.

Vincent McBurney is an IBM Information Champion for Information Integration.

Popular White Paper On This Topic

Best Practices for a BI and Analytics Strategy

Related White Papers

Passmark 2013 Benchmark Report

Endpoint Security Performance Results

ERP in Manufacturing 2011

Werner Daehn Dec 9, 2008

Would love to run that benchmark myself. In case you ever get the source files and

database tables let me know.

Personally I don't like the test either. How many GB of data is moved via flat files vs.

from source to target database? I guess the majority is database-to-database, hence

the file tests are nice and simple but do not help much as the parsing of the files can

be overly expensive, more expensive than the transformations - if there would be

any.

The other thing I am surprised is that there is a difference between the vendors. I

would have thought that for these copy operations with a lookup in the middle and

the such, the performance bottleneck would be the disk I/O. So I would immediately

have guessed that the flows are not correctly designed with each tool. Especially in

an ELT case, where the engine has almost nothing to do compared to the database,

the difference should be zero, shouldn't it?

But the most surprising statement was actually yours about the Oracle bulkload:

"it’s all about the speed of creating a sequential file. After you’ve got the file all five

ETL tools call the exact same Oracle bulk loader."

I know Informatica supports the Oracle API bulkloader, so no need to write any file.

Doesn't DataStage as well?

-Werner

Johannes Almiala Dec 10, 2008

I'm probably going to comment more later, but now a quick one for Test 11.

One thing I would have done differently with Informatica is that I would have used a

single Router transformation instead of four filters. A router does in one pass the

same as the four filter do in four passes, plus you catch the rows that don't match any

of the filter conditions. Also, there is no visibility on how the lookup has been

configured, it could easily be a bottleneck.

Generally, the default amount of memory allocated to transformation caches in

Informatica PowerCenter sessions is 5% (or 512 MB, which ever is smaller) of the

maximum available. If that hasn't been changed and the lookup source file is large,

this test will basically measure random disk reading speed on the server platform.

Vincent McBurney Dec 10, 2008

My Oracle bulk loader days go back to DataStage Server Edition and about Oracle 8!

That version wrote out a text dat file under the covers in the Oracle bulk loader data

format and passed the file to the Oracle bulk load program. DataStage PX has a

much newer Oracle Enterprise stage compatible with the newer versions of Oracle

but I don't know what it does under the covers. The bulk load test would be


9 of 12 21/01/2014 10:37

SUBMITPREVIEW

interesting if the source was a database table so you could take sequential file

parsing out of the equation - and then bump the data volume up to 20 million rows.

Dec 14, 2008

It is interesting that the version of DataStage used in the benchmarking is two major

releases behind the current version 8.1. Along with little mention that DataStage is

the hands down winner in linear scaling of parallel jobs to available hardware by

simple changes to a configuration file. The fact that DataStage can scale seamlessly

beyond any other vendor in this test and that management of that scalability is least

costly in terms of hardware, installation, and IT resources is overlooked.

As mentioned it doesn't make a lot of sense to run any job in a parallel process when

the data volumes and transformative actions are minimized but once the volumes

increase or transformations expand beyond simple data mapping, the parallel engine

underlying the Information Server platform begins to easily out perform the other

vendors in the test.

In addition, no mention is made of the integrated platform Information Server brings

to the table as most the vendors in the test recognize the data integration is much

more than ETL. Granted my opinions are biased and all should evaluate these

results from their own perspective. The only point here is that taking a simple

scenario or two does not give the reader an accurate view of the products or

capabilities as each vendor can demonstrate where the benchmarked test deviates

from their best practices for each product.

USER_1963953 Apr 1, 2010

this benchmark has strange results; just did a mapping with informatica powercenter

8.6.1 that calculates 2 ranks on 34Millions of row, joins them with over 64Millions of

rows, then aggregates them and takes 130 secs consuming on average 5 power 5+

CPUs at 1900 mhz.

repeated the same with a larger volume of data 80M+ for the 2 ranks and 1 billion for

the outer join and the aggregation, and it takes 450 secs to execs, same cpu

consumption.

anyway to benchmark ETL is very difficult task because the results are too much

related to the skill of the developer and the knowledge of architecture of the product

btw: informatica lookups are slower than the joiners at least in 8.6.1 release, we will

see in informatica 9

Younes Siebel Oct 7, 2010

I think that Talend, when there is not huge informations to deal with, can simply be

the best.

But the things that make it more interesting is that it cost 0.00$, while DataStage

Server is more than 80.000,00$!

naresh ketepalli Aug 2, 2011

Can anyone tell me the architecture and features of Talend.

Leave a Comment

Connect to this blog to be notified of new entries.


10 of 12 21/01/2014 10:37

Browse all IT Blogs

We Recommend

Functional Design SpecificationDocument Template – Part 1 - Intro

Merge / Upsert statement

4 Ways Mobile CRM Improves theQuality of Customer Engagements

Password Management in the SAPSystem

How to build a secure LAMP web serverwith CentOS 5

Are Developers "The New Kingmakers"in an App-Centric World?

From Around The Web

Why IT Is Responsible for PainfulCustomer Experiences (TechViews)

Letting Go of Fear to Help the CreativeProcess (Innovative Thinking System)

What Happened to JapaneseInnovation? (Innovative Thinking System)

Time Is More Than Just Money For TheDenver Broncos (Forbes.com)

Human trafficking the fastest growingcriminal industry (WALK FREE)

You are not logged in.

Sign In to post unmoderated comments. Join the community to create your free profile today.

Want to read more from Vincent McBurney? Check out the blog archive .

Archive Category: Information IntegrationKeyword Tags: etl benchmark manapps datastage informatica pentaho talend

Disclaimer: Blog contents express the viewpoints of their independent authors and are not reviewed forcorrectness or accuracy by Toolbox for IT. Any opinions, comments, solutions or other commentary expressedby blog authors are not endorsed or recommended by Toolbox for IT or any vendor. If you feel a blog entry isinappropriate, click here to notify Toolbox for IT.

From Around The Web

Recommended by

Recommended by

Collaboration ToolsDiscussion GroupsBlogsWiki

Toolbox for IT

My HomeTopicsPeopleCompaniesJobsWhite Paper Library

Follow Toolbox.comToolbox for IT on TwitterToolbox.com on TwitterToolbox.com on

Data CenterData Center

DevelopmentC LanguagesJavaVisual BasicWeb Design & Development

Enterprise ApplicationsCRMERPPeopleSoftSAPSCM

Enterprise Architecture & EAIEnterprise Architecture & EAI

Information ManagementBusiness IntelligenceDatabaseData WarehouseKnowledge ManagementOracle

IT Management & StrategyEmerging Technology & TrendsIT Management & StrategyProject & Portfolio Management

Networking & InfrastructureHardwareNetworkingCommunications Technology

Operating SystemsLinuxUNIXWindows

SecuritySecurity

StorageStorage

Topics on Toolbox for IT Toolbox.com

AboutNewsPrivacyTerms of UseWork at Toolbox.comAdvertiseContact usProvide Feedback

Help TopicsTechnical Support

AdChoice

Other Communities

Toolbox for HR

Hispanic ContentMarketing: Is it set toexplode?(Portada-Online.com)

95% of professionalsdon't know about thisemail trick(Frank Addante)

Google Penalty HitYou Hard? VideoReveal 3 Steps ToOvercome Penalty(Kumar Setu)

The Real Problem InWorking From Home(It's Not What YouThink)(Forbes.com)

Mike Zammutolaunches rankingservice for 'SuperBlogs'(Examiner.com)

Eight Ways to a FasterWebsite(ServInt)

Infographic: The Riseof the Millennials(Badgeville)

San Francisco:Destination for toptalent and MikeZammuto(Washington Times)


11 of 12 21/01/2014 10:37

Facebook Siebel Cloud ComputingCloud Computing

Toolbox for Finance

Copyright 1998-2014 Ziff Davis, LLC (Toolbox.com). All rights reserved. All product names are trademarks of their respective companies. Toolbox.com is notaffiliated with or endorsed by any company listed at this site.


12 of 12 21/01/2014 10:37

etl benchmark favours datastage and talend

Documents

document benchmark etl

etl benchmark reportthat

datastage px

poor etl design

sequential filesthe

morefavourable results

amongstother things

latest blog posts