etl 2.0 data integration comes of age

Upload: pdoconno

Post on 02-Jun-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 ETL 2.0 Data Integration Comes of Age

    1/13

    WHITEPAPER

    ETL2.0

    Data Integration Comes of Age

    Robin Bloor, Ph.D.

    Rebecca Jozwiak

  • 8/10/2019 ETL 2.0 Data Integration Comes of Age

    2/13

    Executive Summary

    In this white paper, we examine the evolution of ETL, concluding that a new generation ofETL products, ETL 2.0 as we have called it, is putting a much needed emphasis on thetransformation aspect of ETL. The following bullet points summarize the contents of the

    paper. Data movement has proliferated wildly since the advent of the data warehouse,

    necessitating the growth of a market for ETL products that helps to automate suchtransfers.

    Few data centers have experienced a consistent use of ETL, with many such programsbeing hand coded or implemented using SQL utilities. As a consequence, the ETLenvironment is usually fragmented and poorly managed.

    Databases and data stores in combination with data transfer activities can be viewedas providing a data services layer to the organization. Ultimately, the goal of such dataservices isto provide any data needed by authorized IT and business users when they

    want it and in the form that they need it. The capabilities of the first generation of ETL products are now being stressed by:

    - The growth of new applications, particularly BI applications.

    - The growth of data volumes, the increasing variety of data, and the need forspeed.

    - The increasing need to analyze very large pools of data, often including historicaland social network data.

    - High-availability (24/7) requirements that have closed batch windows in whichETL programs could run.

    - Rapid changes in technology. In respect to technology changes, we note the emergence of a whole new generation of

    databases that are purpose-designed to exploit current computer hardware both toachieve better performance and scale, and to manage very large collections of data.Similarly, we believe the second generation of ETL products will be capable of betterperformance and scalability, and will be better able to process very large volumes ofdata.

    We characterize the second generation of ETL products as having the followingqualities:

    - Improved connectivity

    - Versatility of extracts, transformations, and loads

    - Breadth of application

    - Usability and collaboration

    - Economy of resource usage

    - Self-optimization

    ETL2.0

    1

  • 8/10/2019 ETL 2.0 Data Integration Comes of Age

    3/13

    By leveraging an ETL tool that is versatile in both connectivity and scalability, businesses cannegate the challenges of large data volumes to improve the overall performance of data flows.The versatility of second generation ETL tools additionally allows for a wide variety ofapplications that address business needs, however complex. These products will improve thetime to value for many applications that depend on data flows and provide a framework that

    fosters collaboration among developers, analysts, and business users. By virtue of softwareefficiency, these tools will require fewer hardware resources than previous tools, and becausetransformations are processed in memory, they will eliminate the need for workarounds,scheduling, and constant tuning.

    In summary, it is our view that ETL tools with such capabilities become increasingly strategicbecause of their critical role in the provision of data services to applications and businessusers, and the inherently low development and maintenance costs can help businesses realizea significantly lower overall total cost of ownership (TCO).

    ETL2.0

    2

  • 8/10/2019 ETL 2.0 Data Integration Comes of Age

    4/13

    The Big Data Landscape

    The vision of a single database that could serve the needs of a whole organization was laid torest long ago. That discarded ideal was superseded by a pragmatic acceptance that the dataresources of an organization will involve many data stores threaded together by ad hoc data

    flows carrying data from one place to another. The corporate data resource is fragmented, andthere is scant hope that this state of affairs will change any time soon.

    It is worth reflecting on why this is the case.

    As the use of database technology grew, it soon became clear that the typical workload oftransactional systems was incompatible with the heavy query workloads that provided datato reporting applications. This gave rise to the idea of replicating the data from operationalsystems into a single large database ! a data warehouse ! that could serve all reportingrequirements.

    Initially data migration from operational systems to data warehouses was served byprogrammers writing individual programs to feed data to the data warehouse. This task was

    time-consuming, carried with it an ongoing maintenance cost, and could be better automatedthrough the use of purpose-built tools. Such tools quickly emerged and were called extract,transform, and load (ETL) tools.

    Over time, the wide proliferation of business intelligence (BI) applications drove theincreasing creation of data marts, a subset of data focused on a single business area orfunction. This, in turn, meant more work for ETL tools.

    Eventually, the speed of this data transfer process came into question. The time required tomove data from production systems to the data warehouse and then on to data marts was toolong for some business needs. Therefore, organizations were forced to implement suboptimalworkarounds to achieve the performance needed to support the business.

    Meanwhile, data continued to grow exponentially. While Moores Law increases computerpower by a factor of 10 roughly every six years or so, big databases seemed to grow by 1,000times in size during that period. Thats Moores Law cubed. Thats Big Data, and thatsmainly what is prompting the latest revolution. There is no doubt that data integration hasbecome increasingly complex and costly. Organizations can no longer rely solely on hardwareand inefficient workarounds to overcome the Big Data challenges ahead. Clearly, a newapproach is needed.

    The Stressing of ETL 1.0

    Aside from the growth in the number of business applications and the perennial growth in

    data volumes, there are four distinct factors that have placed increasing demands on ETLtools since they were originally introduced.

    Timing Constraints

    Fifteen years ago interactive applications rarely ran for more than 12 hours. This left ampletime for ETL tools to feed data to a data warehouse, for reporting tools to use the operationaldata directly, and for database backups. This convenient slack period was generally referredto as the batch window. ETL transfers tended to run on a weekly or even a monthly basis.

    ETL2.0

    3

  • 8/10/2019 ETL 2.0 Data Integration Comes of Age

    5/13

    But the batch windows gradually began to close or vanish. Data marts proliferated as a muchwider variety of BI and analytics applications emerged. Eventually, the demand for datawarehouse updates shifted from nightly to hourly to near-real time as the need for timelyinformation grew. ETL tools had to try to accommodate this new reality.

    Technology ShiftsComputer technology gets faster all the time, but what Moores law provides, data growthtakes away. To complicate the situation, hardware does not improve in a uniform way. By2003, after several years of increasing CPU clock speed as means to accelerate processingpower, Intel and AMD began to produce multicore CPUs, packaging more than one processoron each chip. Most databases, because they had been built for high scalability andperformance, soon benefited from these additional resources. However, few ETL tools weredesigned to exploit multiple processors and parallelize workloads. They were behind thecurve.

    The Advent of Big Data

    There have always been massive amounts of data that would be useful to analyze. What isrelatively new is the increasing volume, complexity, and velocity of that data referred to asBig Data.

    In general, the term Big Data involves collections of data measured in tens or hundreds ofterabytes that require significant hardware and software resources in order to be analyzed.Large web businesses have such data. So do telecom companies, financial sector companies,and utilities companies of various kinds. And now that there are databases that cater to suchdata, many other businesses are discovering areas of opportunity where they can accumulateand analyze large volumes of data as well.

    In practice, for many organizations Big Data means digging into previously archived or

    historical data. Similarly, large and small businesses are also discovering the need to store andanalyze data at a more granular level. Either way, the elusive heaps of data that werepreviously considered inaccessible are fast becoming viable data assets. For ETL, Big Datatranslates into new and larger workloads.

    Cloud Computing, Mobile Computing

    Cloud computing adds complexity to the equation by extending the data center through theInternet, providing not only additional data sources and feeds, but also cloud-basedapplications like Salesforce.com, which may pose additional integration challenges.Moreover, cloud environments will most likely suffer from relatively low connection speedsand, possibly, data traffic limitations. Similarly, mobile computing adds new and different

    applications, many of which demand a very specific data service. The dramatic adoption ofsmart phones and other mobile devices ultimately augments the creation and velocity of data,two of the key characteristics of Big Data.

    ETL2.0

    4

  • 8/10/2019 ETL 2.0 Data Integration Comes of Age

    6/13

    The Limitations of ETL 1.0

    ETL evolved from a point-to-point dataintegration capability to a fundamentalcomponent of the entire corporate data

    infrastructure. We illustrate this in a simpleway in Figure 1, which notionally partitionsIT into a data services layer that managesand provides data and an applications layerthat uses the data.

    However, reality is far more complex thanshown in the illustration. There is a varietyof ways that applications can be connectedto data. Applications vary in size and canreside on almost any device from a server toa mobile phone. Data flows can weave acomplex web. Nevertheless, as the diagramsuggests, data management software andETL are complementary components thatcombine to deliver a data service. As such,they need to work hand in hand.

    The problem with most ETL products, thosewe think of as ETL 1.0, is that they werenever designed for such a role. They weredesigned to make it easy for IT users andprogrammers to specify data flows and to carry out some simple data transformations in

    flight so that data arrived in the right format. They included a scheduling capability so thatthey would fire off at the right time, and they usually included a good set of connectors toprovide access to a wide variety of databases and data stores. They were very effective forspecifying and scheduling point-to-point data flows.

    What many of them lacked, however, was a sophisticated software architecture. They werentdesigned to efficiently handle complex data transformations in flight. Indeed the T in ETLwas largely absent. They werent designed to use resources economically. They werentdesigned for scalability or high-speed data transfers. They werent designed to handle ever-increasing data volumes. In summary, they were not designed to globally manage data flowsin a data services layer.

    As data volumes increased, so did the challenge of accessing that data. In many situations,ETL tools simply were not fast enough or capable enough. Consequently, data transformationactivity was often delegated to the database, with database administrators (DBAs) trying tomanage performance through constant tuning. Developers resorted to hand coding or usingETL tools just for scheduling. This inevitably led to spaghetti architectures, longerdevelopment cycles, and higher total cost of ownership. Strategic business objectives were notbeing met.

    ETL2.0

    5

    Figure 1. Applications and Data Services

  • 8/10/2019 ETL 2.0 Data Integration Comes of Age

    7/13

    Increasingly more companies find themselves in this situation. For instance, a leadingtelecommunications company spent over $15 million dollars in additional database capacity,just to get a 10% improvement in overall performance. More importantly, 80% of theirdatabase capacity was consumed by data transformations as opposed to analytics.

    The Nature of ETL 2.0Having described the failings and limitations of ETL 1.0,we can now describe what we believe to be thecharacteristics of ETL 2.0. Just as database technology isevolving to leverage Big Data, we should expect ETLproducts to be either re-engineered or to be superseded.ETL products are (or should be) complementary todatabases and data stores, together delivering a dataservices layer that can provide a comprehensive dataservice to the business. Unlike ETL 1.0, this new

    approach would reduce the complexity, the cost, and thetime to value of data integration.

    We list what we believe the qualities of an ETL 2.0product are in Figure 2 and describe them in detailbelow in the order in which they are listed.

    Versatility of Connectivity, Extract, and Load

    ETL has always been about connectivity to some degree with ETL tools providing as manyconnections as possible to the wide variety of databases, data stores, and applications thatpervade the data center. As new databases and data stores emerge, ETL products need to

    accommodate them, and this includes the ability to connect to sources of unstructured data inaddition to databases. It also means connecting to cloud data sources as well as those in thedata center. Where ETL tools fail to provide a connection, hand coding !with all its painful

    overhead !becomes necessary.

    Extracting data can be achieved in a variety of ways. The ETL product can simply use an SQLinterface to a database to extract data, for example, but this is likely to be inefficient and itpresents an extra workload to the database. Alternatively, it can make use of database log filesor it can access the raw disk directly. ETL products need to provide such options.

    The same goes for loading data. The ETL tool may load the data into staging tables within thedatabase in a convenient form or may simply deposit the data as a file for the database to load

    at its leisure. Ideally, ETL tools would be able to present data in a form that allows for thefastest ingest of data by the target database without violating constraints defined within thedatabases schema.

    Products that qualify as ETL 2.0 need to have as many extract and load options as possible toensure the overall performance of any given data flow, while placing the least possibleoverhead on data sources and targets.

    Versatility of connectivity is also about leveraging and extending the capabilities of theexisting data integration environment, a concept commonly known as data integration

    ETL2.0

    6

    ETL 2.0 Qualities

    Versatility of Connectivity,Extract, and Load

    Versatility of Transformationsand Scalability

    Breadth of Application

    Usability and Collaboration

    Economy of Resource Usage

    Self-Optimization

    Figure 2. Nature of ETL 2.0

  • 8/10/2019 ETL 2.0 Data Integration Comes of Age

    8/13

    acceleration. This includes the ability to seamlessly accelerate existing data integrationdeployments without the need to rip and replace as well as leveraging and acceleratingemerging technologies like Hadoop.

    Versatility of Transformations and Scalability

    All ETL products provide some transformations but few are versatile. Useful transformationsmay involve translating data formats and coded values between the data sources and thetarget (if they are, or need to be, different). They may involve deriving calculated values,sorting data, aggregating data, or joining data. They may involve transposing data (fromcolumns to rows) or transposing single columns into multiple columns. They may involveperforming look-ups and substituting actual values with looked-up values accordingly,applying validations (and rejecting records that fail) and more. If the ETL tool cannot performsuch transformations, they will have to be hand coded elsewhere ! in the database or in anapplication.

    It is extremely useful if transformations can draw data from multiple sources and data joins

    can be performed between such sources in flight, eliminating the need for costly andcomplex staging. Ideally, an ETL 2.0 product will be rich in transformation options since itsrole is to eliminate the need for direct coding all such data transformations.

    Currently ETL workloads beyond the multi-terabyte level are unlikely, although in the futurethey may be seen more frequently. Consequently, scalability needs to be inherent within theETL 2.0 product architecture so that it can optimally and efficiently transfer and transformmultiple terabytes of data when provided with sufficient hardware resources.

    Breadth of Application

    At its most basic, an ETL tool transfers data from a source to a target. It is more complex if

    there are multiple sources and multiple targets. For example, ETL may supplement or replacedata replication carried out by a database, which can mean the ETL tool needs to deliver datato multiple locations. This type of complexity can jeopardize speed and performance. An ETL2.0 product must be able to swiftly and deftly transfer data, despite the number of sourcesand targets.

    A store-and-forward mode of use is important. The availability of data from data sources maynot exactly coincide with the availability of the target to ingest data, so the ETL tool needs togather the data from data sources, carry out whatever transformations are necessary, and thenstore the data until the target database is ready to receive it.

    Change data capture, whereby the ETL tool transfers only the data that has changed in the

    source to the target, is a critically important option. This can reduce the ETL workloadsignificantly and improve timing dramatically, by ensuring that the change data capturekeeps databases in sync.

    The ability to stream data so that the ingest process begins immediately when the data arrivesat the target is another function to reduce the overall time of a data transfer and achieve nearreal-time data movement. Such data transfers are often small batches of data beingtransferred frequently. For similar small amounts of data, it is important that real-time

    ETL2.0

    7

  • 8/10/2019 ETL 2.0 Data Integration Comes of Age

    9/13

    interfaces, such as web services, MQ/JMS, and HTTP are supported. This is also likely to beimportant for mobile data services.

    The ability to work within, or connected to, the cloud is swiftly becoming a necessity. Thisrequires not only supporting data transfer to and from common software-as-a-service (SaaS)

    providers, such as Salesforce.com or Netsuite, but also accommodating the technical,contractual, or cost constraints imposed by any cloud service.

    An ETL 2.0 tool should be able to deliver on all these possibilities.

    Usability and Collaboration

    An ETL 2.0 tool should be easy to use for both the IT developer and the business user. As amatter of course, the ETL tool should log and report on all its activity, including anyexceptions that occur in any of its activities. Such information must be easily available toanyone who needs to analyze ETL activity for any purpose.

    Developers must be able to define complex data transfers involving many transformations

    and rules, specifying the usage mode and scheduling the data transfer in a codeless manner.Business users should be able to take advantage of the power of the ETL environment with aself-service interface based on their role and technical proficiency.

    Todays business user is a more savvy purveyor of technology, and as such, he has thepotential to bring more to the table than a request for a report. An ETL tool should enable andfoster collaboration between business users, analysts, and developers by providing aframework that automatically adapts to each users role. When the business user has a clearunderstanding of the data life cycle and developers and analysts have a clear understandingof the business goals and objectives, a greater level of connectivity can be achieved.

    In addition to bridging the proverbial gap between IT and the business, this type of ETL

    approach can result in faster time to production and, ultimately, increased business agilityand lower costs. By eliminating the typical back-and-forth discussions, a collaborative effortduring the planning stages can have a significant impact on the efficiency of the environmentin which the ETL tool is leveraged.

    Economy of Resource Usage

    At a hardware level, the ETL 2.0 tool must identify available resources (CPU power, memory,disk, and network bandwidth) and take advantage of them in an economic fashion.Specifically, it should be capable of data compression to alleviate disk and network I/O something particularly important for cloud environments and parallel operation both forspeed and resource efficiency.

    With any ETL operation, I/O is almost always one of the biggest bottlenecks. The ETL toolshould be able to dynamically understand and adapt to the file system and I/O bandwidth toensure optimized operation. The ETL tool also needs to clean up after itself, freeing upcomputing resources, including disk space (eliminating all temporary files) and memory assoon as it no longer requires them.

    ETL2.0

    8

  • 8/10/2019 ETL 2.0 Data Integration Comes of Age

    10/13

    Self-Optimization

    In our view, ETL 2.0 products should require very little tuning. The number of man-hours anorganization can spend on the constant tuning of databases and data flows hinders businessagility and eats up resources at an alarming rate. ETL tuning requires time and specific skills.

    Even when it is effective, the gains are usually marginal and may evaporate when datavolumes increase or minor changes to requirements are implemented. Tuning is an expensiveand perpetual activity that doesnt solve the problem !it just defers it.

    ETL 2.0 products will optimize data transfer speeds in line with performance goals, all buteliminating manual tuning. They will embody an optimization capability that is aware of thecomputer resources available and is able to optimize its own operations in real time withoutthe need for human intervention beyond setting very basic parameters. The optimizationcapability will need to consider all the ETLactivities (extracts, transforms, and loads)automatically, adjusting data processingalgorithms to optimize the data transfer

    activity irrespective of how complex it is.

    The Benefits of ETL 2.0

    It is clear that the modern business andcomputing environment demands much morefrom ETL and data integration tools than theywere designed to deliver. So it makes sense todiscuss how much that matters.

    To most end users, ETL tools are little knownand largely invisible until they perform badly

    or fail. As far as the business user is concerned,there is useful data, and they need access to itin a convenient form when and for whateverreason they need it. Their needs are simple toarticulate but not so easy to satisfy.

    Existing ETL products face many challenges assummarized in Figure 3. First and foremost,they need to deliver a first-class data service tobusiness users by ensuring, where possible,that the whole data services layer delivers data

    to those users when and how they want it. TheETL products need to accommodate timingimperatives, perennial growth in applicationsand data volumes, technology changes, Big Data, cloud computing, and mobile computing.And, ultimately, they need to deliver business benefit.

    The business benefits of effective ETL have two aspects: those that affect the operations of thebusiness directly and those that impact the efficient management and operation of ITresources.

    ETL2.0

    9

    Figure 3. The ETL Challenges

  • 8/10/2019 ETL 2.0 Data Integration Comes of Age

    11/13

    The Operations of the Business

    The growth in the business use of data is unlikely to stall any time soon. At the leading edgeof this is the explosion of Big Data, cloud computing, and mobile BI, which are in theirinfancy !but they wont be for long. A product page on Facebook, for example, can record

    certain behaviors of the users who like that page. Drawing on such information andmatching it, perhaps with specific information drawn from Twitter, the company can tailor itsmarketing message to specific categories of customers and potential customers. Suchinformation is easy enough to gather, but not so easy to integrate with corporate data. Thebusiness needs to be able to exploit any opportunity it identifies in these and related areas asquickly as possible; continued competitiveness and revenue opportunities depend on it.

    Assuming data can be integrated, the information needs to be delivered to the entire businessthrough an accurate and nimble data service, tailored to the various uses for that data.Delivering new data services quickly and effectively, means providing user self-service wherethat is feasible and desirable, and enabling the fast development of new data services by ITwhere that is necessary.

    The ability to identify opportunities from the various sources of data and deliver theinformation with agility is essential to continued competitiveness and revenue growth.

    If this can be achieved then ETL and the associated databases and data stores that provideinformation services are doing their job.

    The Operations of IT

    Even when an effective ETL service is provided, its delivery may be costly. The expense ofETL is best viewed from a total cost of ownership perspective. A primary problem thateventually emerges from the deployment of outdated ETL products is entropy, the gradualdeterioration of the data services layer, which results in escalating costs.

    In reality, the software license fees for the ETL tools are likely to be a very small percentage ofthe cost of ownership. The major benefits of ETL 2.0 for the operations of IT will come from:

    Low development costs:New data transfers can be built with very little effort.

    Low maintenance effort:The manual effort of maintaining data transfers will be lowwhen changes to requirements emerge.

    Tunability/optimization: There will be little or no effort associated with ensuringadequate performance.

    Economy of resource usage:They will require less hardware resources than previousETL products for any given workload.

    Fast development and user self-service: They will reduce the time to value for manyapplications that depend on data flows.

    Scalability: There will be no significant limits to moving data around since everyvariety of data transfer will be possible and data volume growth will not requireexponential management overhead.

    Manageability: Finally, all ETL activities will be visible and managed collectivelyrather than on a case-by-case basis. The major win comes from being able to plan for

    ETL2.0

    10

  • 8/10/2019 ETL 2.0 Data Integration Comes of Age

    12/13

    the future, avoiding unexpected costs and provisioning resources as the companyneeds them.

    Clearly ETL 2.0 benefits will differ from business to business. A capable ETL product will helporganizations remain competitive and relevant in the marketplace. Those organizations that

    are under pressure from data growth or a highly fragmented ETL environment will see resultsimmediately by putting their house in order. For example, a Fortune 500 company has beenable to reduce its annual costs by more than $1 million by deploying an ETL 2.0 environmentwith a well-planned, scalable data flow architecture. It replaced most of its existing datatransfer programs, eliminating all hand coding, and significantly reducing its tuning andmaintenance activity. Similarly, businesses that are pioneering in mobile computing and BigData may also see more gains than others. As a rough rule of thumb, the more data transfersthat are done, the more immediate the benefits of ETL 2.0 will be.

    Conclusions

    The first generation of ETL products, ETL 1.0, is becoming increasingly expensive and

    difficult to deploy and maintain.The details are different for each IT environment, but the same characteristics emerge. Theresource management costs of ETL escalate. The amount of IT effort to sustain ETL increases,and the manageability of the whole environment deteriorates. What began as a series ofpoint-to-point deployments becomes an ad hoc spaghetti architecture. The environmentbecomes saturated with a disparate set of transformations !some of them using the ETL toolitself, some of them in the database, and some of them hand coded.

    Whats left is a data services layer that that is impossible to manage, reuse, or govern. The ITdepartment is faced with failing to deliver an adequate service to the business or paying ahigh price in order to do so. Such a situation is not sustainable in the long run.

    As weve discovered, to meet todays business needs, a new approach to data integration is anecessity. We call this approach ETL 2.0, and it is key to helping organizations remaincompetitive in the market place.

    The characteristics of ETL 2.0 include:

    Connectivity and versatility of extract and load

    Versatility of transformations and scalability

    Breadth of application

    Usability and collaboration

    Economy of resource usage

    Self-optimization

    ETL products that provide the full range of capabilities described in this paper will almostcertainly have a significant impact on both organizations and the data integration industry asa whole.

    The benefits of ETL 2.0 are threefold: the business receives the data service it needs to remaincompetitive and achieve strategic objectives, the ETL environment does not suffer fromentropy and can quickly scale to accommodate new demands for information, and most

    ETL2.0

    11

  • 8/10/2019 ETL 2.0 Data Integration Comes of Age

    13/13

    importantly, the total cost of owning, deploying, and maintaining the ETL environment issignificantly lower than that of its predecessor. A capable ETL product will reduce TCOsimply by removing the need for additional personnel and hardware, but one that deliversreally wellwill further increase ROI by providing businesses with thedata they need to makegame-changing decisions precisely when it is needed, enabling organizations to maximize the

    opportunities of Big Data.

    ETL2.0

    12

    About The Bloor GroupThe Bloor Group is a consulting, research and technology analysis firm that focuses on openresearch and the use of modern media to gather knowledge and disseminate it to IT users.Visit both www.TheBloorGroup.com and www.TheVirtualCircle.com for more information.The Bloor Group is the sole copyright holder of this publication.

    !PO Box 200638!Austin, TX 78 72 0 !Tel: 5125243 689 !

    w w w . T h e V i r t u a l C i r c l e . c o m

    w w w . B l o o r G ro u p. c o m

    http://www.thebloorgroup.com/http://www.thevirtualcircle.com/http://www.thebloorgroup.com/http://www.thevirtualcircle.com/http://www.thevirtualcircle.com/http://www.thevirtualcircle.com/http://www.thebloorgroup.com/http://www.thebloorgroup.com/