rethinking integration: emerging patterns from cloud computing

60
2014 Issue 1 1 The enterprise data lake: Better integration and deeper analytics 19 Microservices: The resurgence of SOA principles and an alternative to the monolith 33 Containers are redefining application- infrastructure integration Rethinking integration: Emerging patterns from cloud computing leaders 48 =ero-integration technologies and their role in transformation

Upload: vandien

Post on 13-Feb-2017

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Rethinking integration: Emerging patterns from cloud computing

2014Issue 1

1The enterprise data lake: Better integration and deeper analytics

19Microservices: The resurgence of SOA principles and an alternative to the monolith

33Containers are redefiningapplication-infrastructure integration

Rethinking integration: Emerging patterns from cloud computing leaders

48ero-integration

technologies and their role in transformation

Page 2: Rethinking integration: Emerging patterns from cloud computing

Contents

33

Microservices architecture (MSA)Microservices: The resurgence of SOA principles and an alternative to the monolith

19

Data lake The enterprise data lake: Better integration and deeper analytics

1

Features

Linux containers and DockerContainersareredefiningapplication-infrastructure integration

48

Zero integrationero-integrationtechnologiesand

their role in transformation

2014Issue 1

Page 3: Rethinking integration: Emerging patterns from cloud computing

Related interviews

Mike LangCEO of Revelytix on how companies are using data lakes

John PritchardDirector of platform services at Adobe on agile coding in the software industry

Ben GolubCEO of Docker on the outlook for Linux containers

Dale SandersSVP at Health Catalyst on agile data warehousing in healthcare

Richard RodgerCTO of nearForm on the advantages of microservices architecture

Sam RamjiVP of Strategy at Apigee on integration trends and the bigger picture

9

26

39

13

29

44

Page 4: Rethinking integration: Emerging patterns from cloud computing

1

Technology Forecast: Rethinking integrationIssue 1, 2014

The enterprise data lake: Better integration and deeper analytics

By Brian Stein and Alan Morrison

Data lakes that can scale at the pace of the cloud remove integration barriers and clear a path for more timely and informed business decisions.

Page 5: Rethinking integration: Emerging patterns from cloud computing

2 PwC Technology Forecast The enterprise data lake: Better integration and deeper analytics

Data lakes: An emerging approach to cloud-based big dataUC Irvine Medical Center maintains millions of records for more than a million patients, including radiology images and other semi-structured reports, unstructured physicians’ notes, plus volumes of spreadsheet data. To solve the challenge the hospital faced with data storage, integration, and accessibility, the hospital created a data lake based on a Hadoop architecture, which enables distributed big data processing by using broadly accepted open software standards and massively parallel commodity hardware.

Hadoop allows the hospital’s disparate records to be stored in their native formats for later parsing, rather than forcing all-or-nothing integration up front as in a data warehousing scenario. Preserving the native format also helps maintain data provenance and fidelity,

so different analyses can be performed using different contexts. The data lake has made possible several data analysis projects, including the ability to predict the likelihood of readmissions and take preventive measures to reduce the number of readmissions.1

Like the hospital, enterprises across industries are starting to extract and place data for analytics into a single Hadoop-based repository without first transforming the data the way they would need to for a relational data warehouse.2 The basic concepts behind Hadoop3 were devised by Google to meet its need for a flexible, cost-effective data processing model that could scale as data volumes grew faster than ever. Yahoo, Facebook, Netflix, and others whose business models also are based on managing enormous data volumes quickly adopted similar methods. Costs were certainly a factor, as Hadoop can be

1 “UC Irvine Health does Hadoop,” Hortonworks, http://hortonworks.com/customer/uc-irvine-health/.

2 See Oliver Halter, “The end of data standardization,” March 20, 2014, http://usblogs.pwc.com/emerging-technology/the-end-of-data-standardization/, accessed April 17, 2014.

3 Apache Hadoop is a collection of open standard technologies that enable users to store and process petabyte-sized data volumes via commodity computer clusters in the cloud. For more information on Hadoop and related NoSQL technologies, see “Making sense of Big Data,” PwC Technology Forecast 2010, Issue 3 at http://www.pwc.com/us/en/technology-forecast/2010/issue3/index.jhtml.

Enterprises across industries are starting to extract and place data for analytics into a single, Hadoop-based repository.

A basic Hadoop architecture for scalable data lake infrastructure

Source: Electronic Design, 2012, and Hortonworks, 2014

Hadoop Distributed File System (HDFS)

Input file

Map task

Reduce task

Output fileHadoop stores and preserves data in any format across a commodity server cluster.

Input

Region 1

Region 2

Region 3

Split 1

Split 2

Split 3

Split 4

Split 5

Job tracker

map( )partition( )combine( )

The system splitsup the jobs anddistributes,processes, andrecombines them via a cluster that can scale to thousands of server nodes.

With YARN, Hadoop now supports various programming models and near-real-time outputs in addition to batch.

Input

Output

sort( )reduce( )

Page 6: Rethinking integration: Emerging patterns from cloud computing

3 PwC Technology Forecast The enterprise data lake: Better integration and deeper analytics

10 to 100 times less expensive to deploy than conventional data warehousing. Another driver of adoption has been the opportunity to defer labor-intensive schema development and data cleanup until an organization has identified a clear business need. And data lakes are more suitable for the less-structured data these companies needed to process.

Today, companies in all industries find themselves at a similar point of necessity.

Enterprises that must use enormous volumes and myriad varieties of data to respond to regulatory and competitive pressures are adopting data lakes. Data lakes are an emerging and powerful approach to the challenges of data integration as enterprises increase their exposure to mobile and cloud-based applications, the sensor-driven Internet of Things, and other aspects of what PwC calls the New IT Platform.

Hadoop can be 10 to 100 times less expensive to deploy than conventional data warehousing.

Issue overview: Integration fabric The microservices topic is the second of three topics as part of the integration fabric research covered in this issue of the PwC Technology Forecast. The integration fabric is a central component for PwC’s New IT Platform.*

Enterprises are starting to embrace more practical integration.** A range of these new approaches is now emerging, and during the next few months we’ll ponder what the new cloud-inspired enterprise integration fabric looks like. The main areas we plan to explore include these:

Integration fabric layers Integration challenges Emerging technology solutions

Data Data silos, data proliferation, rigid schemas, and high data warehousing cost; new and heterogeneous data types

Hadoop data lakes, late binding, and metadata provenance tools

Enterprises are beginning to place extracts of their data for analytics and business intelligence (BI) purposes into a single, massive repository and structuring only what’s necessary. Instead of imposing schemas beforehand, enterprises are allowing data science groups to derive their own views of the data and structure it only lightly, late in the process.

Applications and services

Rigid, monolithic systems that are difficult to update in response to business needs

Microservices

Fine-grained microservices, each associated with a single business function and accessible via an application programming interface (API), can be easily added to the mix or replaced. This method helps developer teams create highly responsive, flexible applications.

Infrastructure Multiple clouds and operating systems that lack standardization

Software containers for resource isolation and abstraction

New software containers such as Docker extend and improve virtualization, making applications portable across clouds. Simplifying application deployment decreases time to value.

* See http://www.pwc.com/us/en/increasing-it-effectiveness/new-it-platform.jhtml for more information.

**Integration as PwC defines it means making diverse components work together so they work as a single entity. See “integrated system” at http://www.yourdictionary.com/integrated-system#computer, accessed June 17, 2014.

Page 7: Rethinking integration: Emerging patterns from cloud computing

4 PwC Technology Forecast The enterprise data lake: Better integration and deeper analytics

Why a data lake? Data lakes can help resolve the nagging problem of accessibility and data integration. Using big data infrastructures, enterprises are starting to pull together increasing data volumes for analytics or simply to store for undetermined future use. (See the sidebar “Data lakes defined.”) Mike Lang, CEO of Revelytix, a provider of data management tools for Hadoop, notes that “Business owners at the C level are saying, ‘Hey guys, look. It’s no longer inordinately expensive for us to store all of our data. I want all of you to make copies. OK, your systems are busy. Find the time, get an extract, and dump it in Hadoop.’”

Previous approaches to broad-based data integration have forced all users into a common predetermined schema, or data model. Unlike this monolithic view of a single enterprise-wide data model, the data lake relaxes standardization and defers modeling, resulting in a nearly unlimited potential for operational insight and data discovery. As data volumes, data variety, and metadata richness grow, so does the benefit.

Recent innovation is helping companies to collaboratively create models—or views—of the data and then manage incremental improvements to the metadata. Data scientists and business analysts using the newest lineage tracking tools such as Revelytix Loom or Apache Falcon can follow each other’s purpose-built data schemas. The lineage tracking metadata also is placed in the Hadoop Distributed File System (HDFS)—which stores pieces of files across a distributed cluster of servers in the cloud—where the metadata is accessible and can be collaboratively refined. Analytics drawn from the lake become increasingly valuable as the metadata describing different views of the data accumulates.

Every industry has a potential data lake use case. A data lake can be a way to gain more visibility or put an end to data silos. Many companies see data lakes as an opportunity to capture a 360-degree view of their customers or to analyze social media trends.

What is a data lake?

The lake can serve as a staging area for the data warehouse, the location of more carefully “treated” data for reporting and analysis in batch mode.

Data generalists/ programmers can tap the stream data for real-time analytics.

Data scientists use the lake for discovery and ideation.

Data lakes take advantage of commodity cluster computing techniques for massively scalable, low-cost storageofdatafilesinany format.

The data lake accepts input from various sources and can preserve both the original data fidelity and

the lineage of data transformations. Data

models emerge with usage over time rather than

being imposed up front.

A data lake is a repository for large quantities and varieties of data, both structured and unstructured..

Page 8: Rethinking integration: Emerging patterns from cloud computing

5 PwC Technology Forecast The enterprise data lake: Better integration and deeper analytics

In the financial services industry, where Dodd-Frank regulation is one impetus, an institution has begun centralizing multiple data warehouses into a repository comparable to a data lake, but one that standardizes on XML. The institution is moving reconciliation, settlement, and Dodd-Frank reporting to the new platform. In this case, the approach reduces integration overhead because data is communicated and stored in a standard yet

flexible format suitable for less-structured data. The system also provides a consistent view of a customer across operational functions, business functions, and products.

Some companies have built big data sandboxes for analysis by data scientists. Such sandboxes are somewhat similar to data lakes, albeit narrower in scope and purpose. PwC, for example, built a social media data sandbox to help clients monitor their brand health by using its SocialMind application.4

Motivating factors behind the move to data lakesRelational data warehouses and their big price tags have long dominated complex analytics, reporting, and operations. (The hospital described earlier, for example, first tried a relational data warehouse.) However, their slow-changing data models and rigid field-to-field integration mappings are too brittle to support big data volume and variety. The vast majority of these systems also leave business users dependent on IT for even the smallest enhancements, due mostly to inelastic design, unmanageable system complexity, and low system tolerance for human error. The data lake approach circumvents these problems.

Freedom from the shackles of one big data modelJob number one in a data lake project is to pull all data together into one repository while giving minimal attention to creating schemas that define integration points between disparate data sets. This approach facilitates access, but the work required to turn that data into actionable insights is a substantial challenge. While integrating the data takes place at the Hadoop layer, contextualizing the metadata takes place at schema creation time.

Integrating data involves fewer steps because data lakes don’t enforce a rigid metadata schema as do relational data warehouses. Instead, data lakes support a concept known as late binding, or schema on read, in which users build custom schema into their queries. Data is bound to a dynamic schema created upon query execution. The late-binding principle shifts the data modeling from centralized

4 or more information on Social ind and other analytics applications PwC offers, see http://www.pwc.com/us/en/analytics/analytics-applications.jhtml.

Data lakes definedMany people have heard of data lakes, but like the term big data, definitions vary. Four criteria are central to a good definition:

• Size and low cost: Data lakes are big. They can be an order of magnitude less expensive on a per-terabyte basis to set up and maintain than data warehouses. With Hadoop, petabyte-scale data volumes are neither expensive nor complicated to build and maintain. Some vendors that advocate the use of Hadoop claim that the cost per terabyte for data warehousing can be as much as $250,000, versus $2,500 per terabyte (or even less than $1,000 per terabyte) for a Hadoop cluster. Other vendors advocating traditional data warehousing and storage infrastructure dispute these claims and make a distinction between the cost of storing terabytes and the cost of writing or written terabytes.*

• Fidelity: Hadoop data lakes preserve data in its original form and capture changes to data and contextual semantics throughout the data lifecycle. This approach is especially useful for compliance and internal audit. If the data has undergone transformations, aggregations, and updates, most organizations typically struggle to piece data together when the need arises and have little hope of determining clear provenance.

• Ease of accessibility: Accessibility is easy in the data lake, which is one benefit of preserving the data in its original form. Whether structured, unstructured, or semi-structured, data is loaded and stored as is to be transformed later. Customer, supplier, and operations data are consolidated with little or no effort from data owners, which eliminates internal political or technical barriers to increased data sharing. Neither detailed business requirements nor painstaking data modeling are prerequisites.

• Late binding: Hadoop lends itself to flexible, task-oriented structuring and does not require up-front data models.

*For more on data accessibility, data lake cost, and collective metadata refinement including lineage tracking technology, see the interview with Mike Lang, “Making Hadoop suitable for enterprise data science,” at http://www.pwc.com/us/en/technology-forecast/2014/issue1/interviews/interview-revelytix.jhtml. For more on cost estimate considerations, see Loraine Lawson, “What’s the Cost of a Terabyte?” ITBusinessEdge, May 17, 2013, at http://www.itbusinessedge.com/blogs/integration/whats-the-cost-of-a-terabyte.html.

Page 9: Rethinking integration: Emerging patterns from cloud computing

6 PwC Technology Forecast The enterprise data lake: Better integration and deeper analytics

“We see customers creating big data graveyards, dumping everything into HDFS and hoping to do something with it down the road. But then they just lose track of what’s there.”

—Sean Martin, Cambridge Semantics

data warehousing teams and database administrators, who are often remote from data sources, to localized teams of business analysts and data scientists, who can help create flexible, domain-specific context. For those accustomed to SQL, this shift opens a whole new world.

In this approach, the more that is known about the metadata, the easier it is to query. Pre-tagged data, such as Extensible Markup Language (XML), JavaScript Object Notation (JSON), or Resource Description Framework (RDF), offers a starting point and is highly useful in implementations with limited data variety. In most cases, however, pre-tagged data is a small portion of incoming data formats.

Early lessons and pitfalls to avoidSome data lake initiatives have not succeeded, producing instead more silos or empty sandboxes. Given the risk, everyone is

proceeding cautiously. “We see customers creating big data graveyards, dumping everything into HDFS and hoping to do something with it down the road. But then they just lose track of what’s there,” says Sean Martin, CTO of Cambridge Semantics, a data management tools provider.

Companies avoid creating big data graveyards by developing and executing a solid strategic plan that applies the right technology and methods to the problem. Few technologies in recent memory have as much change potential as Hadoop and the NoSQL (Not only SQL) category of databases, especially when they can enable a single enterprise-wide repository and provide access to data previously trapped in silos. The main challenge is not creating a data lake, but taking advantage of the opportunities it presents. A means of creating, enriching, and managing semantic metadata incrementally is essential.

Data flow in the data lake

XML.xls

etc.

The data lake loads data extracts, irrespective of format, into a big data store. Metadata is decoupled from its underlying data and stored independently, enabling flexibility for multiple end-user perspectives and incrementally maturing semantics.

A big data repository stores data as is, loading existing data and accepting new feeds regularly.

Metadata grows and matures over time via user interaction.

Data scientistsand app developersprepare and analyzeattribute-level data.

Machines help discoverpatterns and create dataviews.

Users collaborate to identify, organize, andmake sense of the data in the data lake.

The data lake offers a unique opportunity for flexible,evolving, and maturing big data insights.

Business and dataanalysts select andreport on domain-specific data.

New data comes into the lake Cross-domain data analysis

Metadata tagging and linking

Tagging, synonyms, linking

Upstreamdataextracts

New actions (such as customercampaigns) based on insightsfrom the data

Page 10: Rethinking integration: Emerging patterns from cloud computing

7 PwC Technology Forecast The enterprise data lake: Better integration and deeper analytics

How a data lake matures

Sourcing new data into the lake can occur gradually and will not impact existing models. The lake starts with raw data, and it matures as more data flows in, as users and machines build up metadata, and as user adoption broadens. Ambiguous and competing terms eventually converge into a shared understanding (that is, semantics) within and across business domains. Data maturity results as a natural outgrowth of the ongoing user interaction and feedback at the metadata management

layer—interaction that continually refines the lake and enhances discovery. (See the sidebar “Maturity and governance.”)

With the data lake, users can take what is relevant and leave the rest. Individual business domains can mature independently and gradually. Perfect data classification is not required. Users throughout the enterprise can see across all disciplines, not limited by organizational silos or rigid schema.

Data lake maturity

The data lake foundation includes a big data repository, metadata management, and an application framework to capture and contextualize end-user feedback. The increasing value of analytics is then directly correlated to increases in user adoption across the enterprise.

With the data lake, users can take what is relevant and leave the rest. Individual business domains can mature independently and gradually.

Incr

easi

ng v

alue

of a

naly

tics

Increasing usage across the enterprise

1. Consolidated and categorizedraw data

2. Attribute-levelmetadata taggingand linking(i.e., joins)

3. Data setextraction andanalysis

4. Business-specifictagging, synonymidentification, and links

5. Convergence ofmeaning withincontext

Data m

aturit

y incr

eases

Page 11: Rethinking integration: Emerging patterns from cloud computing

8 PwC Technology Forecast The enterprise data lake: Better integration and deeper analytics

Maturity and governanceMany who hear the term data lake might associate the concept with a big data sandbox, but the range of potential use cases for data lakes is much broader. Enterprises envision lake-style repositories as staging areas, as alternatives to data warehouses, or even as operational data hubs, assuming the appropriate technologies and use cases.

A key enabler is Hadoop and many of the big data analytics technologies associated with it. What began as a means of ad hoc batch analytics in Hadoop and MapReduce is evolving rapidly with the help of YARN and Storm to offer more general-purpose distributed analytics and real-time capabilities. At least one retailer has been running a Hadoop cluster of more than 2,000 nodes to support eight customer behavior analysis applications.*

Despite these advances, enterprises will remain concerned about the risks surrounding data lake deployments, especially at this still-early stage of development. How can enterprises effectively mitigate the risk and manage a Hadoop-based lake for broad-ranging exploration? Lakes can provide unique benefits over traditional data management methods at a substantially lower cost, but they require many practical considerations and a thoughtful approach to governance, particularly in more heavily regulated industries. Areas to consider include:

• Complexity of legacy data: Many legacy systems contain a hodgepodge of software patches, workarounds, and poor design. As a result, the raw data may provide limited value outside its legacy context. The data lake performs optimally when supplied with unadulterated data from source systems, and rich metadata built on top.

• Metadata management: Data lakes require advanced metadata management methods, including machine-assisted scans, characterizations of the data files, and lineage tracking for each transformation. Should schema on read be the rule and predefined schema the exception? It depends on the sources. The former is ideal for working with rapidly changing data structures, while the latter is best for sub second query response on highly structured data.

• Lake maturity: Data scientists will take the lead in the use and maturation of the data lake. Organizations will need to place the needs of others who will benefit within the context of existing organizational processes, systems, and controls.

• Staging area or buffer zone: The lake can serve as a cost-effective place to land, stage, and conduct preliminary analysis of data that may have been prohibitively expensive to analyze in data warehouses or other systems.

To adopt a data lake approach, enterprises should take a full step toward multipurpose (rather than single purpose) commodity cluster computing for enterprise-wide analysis of less-structured data. To take that full step, they first must acknowledge that a data lake is a separate discipline of endeavor that requires separate treatment. Enterprises that set up data lakes must simultaneously make a long-term commitment to hone the techniques that provide this new analytic potential. Half measures won’t suffice.

* Timothy Prickett Morgan, “Cluster Sizes Reveal Hadoop Maturity Curve,” Enterprise Tech: Systems Edition, November 8, 2013, http://www.enterprisetech.com/2013/11/08/cluster-sizes-reveal-hadoop-maturity-curve/, accessed March 20, 2014.

Page 12: Rethinking integration: Emerging patterns from cloud computing

9 PwC Technology Forecast Making Hadoop suitable for enterprise data science

Mike Lang

Mike Lang is CEO of Revelytix.

PwC: You’re in touch with a number of customers who are in the process of setting up Hadoop data lakes. Why are they doing this?

ML: There has been resistance on the part of business owners to share data, and a big part of the justification for not sharing data has been the cost of making that data available. The data owners complain they must write in some special way to get the data extracted, the system doesn’t have time to process queries for building extracts, and so forth.

But a lot of the resistance has been political. Owning data has power associated with it. Hadoop is changing that, because C-level executives are saying, “It’s no longer inordinately expensive for us to store all of

our data. I want all of you to make copies. OK, your systems are busy. Find the time, get an extract, and dump it in Hadoop.”

But they haven’t integrated anything. They’re just getting an extract. The benefit is that to add value to the integration process, business owners don’t have nearly the same hill to climb that they had in the past. C-level executives are not asking the business owner to add value. They’re just saying, “Dump it,” and I think that’s under way right now.

With a Hadoop-based data lake, the enterprise has provided a capability to store vast amounts of data, and the user doesn’t need to worry about restructuring the data to begin. The data owners just need to do the dump, and they can go on their merry way.

Technology Forecast: Rethinking integration

Issue 1, 2014

Making Hadoop suitable for enterprise data scienceCreating data lakes enables enterprises to expand discovery and predictive analytics.

Interview conducted by Alan Morrison, Bo Parker, and Brian Stein

Page 13: Rethinking integration: Emerging patterns from cloud computing

10 PwC Technology Forecast Making Hadoop suitable for enterprise data science

“If I want to add a terabyte node to my current analytics infrastructure, the cost could be $250,000. But if I want to add a terabyte node to my Hadoop data lake, the cost is more like $25,000.”

PwC: So one major obstacle was just the ability to share data cost-effectively?

ML: Yes, and that was a huge obstacle. Huge. It is difficult to overstate how big that obstacle has been to nimble analytics and data integration projects during my career. For the longest time, there was no such thing as nimble when talking about data integration projects.

Once that data is in Hadoop, nimble is the order of the day. All of a sudden, the ETL [extract, transform, load] process is totally turned on its head—from contemplating the integration of eight data sets, for example, to figuring out which of a company’s policyholders should receive which kinds of offers at what price in which geographic regions. Before Hadoop, that might have been a two-year project.

PwC: What are the main use cases for Hadoop data lakes?

ML: There are two main use cases for the data lake. One is as a staging area to support some specific application. A company might want to analyze three streams of data to reduce customer churn by 10 percent. They plan to build an app to do that using three known streams of data, and the data lake is just part of that workflow of receiving, processing, and then dumping data off to generate the churn analytics.

The last time we talked [in 2013], that was the main use case of the data lake. The second use case is supporting data science groups all around the enterprise. Now, that’s probably 70 percent of the companies we’ve worked with.

PwC: Why use Hadoop?

ML: Data lakes are driven by three factors. The first one is cost. Everybody we talk to really believes data lakes will cost much less than current alternatives. The cost of data processing and data storage could be 90 percent lower. If I want to add a terabyte node to my current analytics infrastructure, the cost could be $250,000. But if I want to add a terabyte node to my Hadoop data lake, the cost is more like $25,000.

The second factor is flexibility. The flexibility comes from the late-binding principle. When I have all this data in the lake and want to analyze it, I’ll basically build whatever schema I want on the fly and I’ll conduct my analysis the way data scientists do. Hadoop lends itself to late binding.

The third factor relates to scale. Hadoop data lakes will have a lot more scale than the data warehouse, because they’re designed to scale and process any type of data.

PwC: What’s the first step in creating such a data lake?

ML: We’re working with a number of big companies that are implementing some version of the data lake. The first step is to create a place that stores any data that the business units want to dump in it. Once that’s done, the business units make that place available to their stakeholders.

The first step is not as easy as it sounds. The companies we’ve been in touch with spend an awful lot of time building security apparatuses. They also spend a fair amount of time performing quality checks on the data as it comes in, so at least they can say something about the quality of the data that’s available in the cluster.

Page 14: Rethinking integration: Emerging patterns from cloud computing

11 PwC Technology Forecast Making Hadoop suitable for enterprise data science

But after they have that framework in place, they just make the data available for data science. They don’t know what it’s going to be used for, but they do know it’s going to be used.

PwC: So then there’s the data preparation process, which is where the metadata reuse potential comes in. How does the dynamic ELT [extract, load, transform] approach to preparing the data in the data science use case compare with the static ETL [extract, transform, load] approach traditionally used by business analysts?

ML: In the data lake, the files land in Hadoop in whatever form they’re in. They’re extracted from some system and literally dumped into Hadoop, and that is one of the great attractions of the data lake—data professionals don’t need to do any expensive ETL work beforehand. They can just dump the data in there, and it’s available to be processed in a relatively inexpensive storage and processing framework.

The challenge, then, is when data scientists need to use the data. How do they get it into the shape that’s required for their R frame or their Python code for their advanced analytics? The answer is that the process is very iterative. This iterative process is the distinguishing difference between business analysts and data warehousing and data scientists and Hadoop.

Traditional ETL is not iterative at all. It takes a long time to transform the different data into one schema, and then the business analysts perform their analysis using that schema.

Data scientists don’t like the ETL paradigm used by business analysts. Data scientists have no idea at the beginning of their job what the schema should be, and so

they go through this process of looking at the data that’s available to them.

Let’s say a telecom company has set-top box data and finance systems that contain customer information. Let’s say the data scientists for the company have four different types of data. They’ll start looking into each file and determine whether the data is unstructured or structured this way or that way. They need to extract some pieces of it. They don’t want the whole file. They want some pieces of each file, and they want to get those pieces into a shape so they can pull them into an R server.

So they look into Hadoop and find the file. Maybe they use Apache Hive to transform selected pieces of that file into some structured format. Then they pull that out into R and use some R code to start splitting columns and performing other kinds of operations. The process takes a long time, but that is the paradigm they use. These data scientists actually bind their schema at the very last step of running the analytics.

Let’s say that in one of these Hadoop files from the set-top box, there are 30 tables. They might choose one table and spend quite a bit of time understanding and cleaning up that table and getting the data into a shape that can be used in their tool. They might do that across three different files in HDFS [Hadoop Distributed File System]. But, they clean it as they’re developing their model, they shape it, and at the very end both the model and the schema come together to produce the analytics.

PwC: How can the schema become dynamic and enable greater reuse?

ML: That’s why you need lineage. As data scientists assemble their intermediate data sets, if they look at a lineage graph in our Loom

“Data scientists don’t like the ETL paradigm used by business analysts. Data scientists have no idea at the beginning of their job what the schema should be, and so they go through this process of looking at the data that’s available to them.”

Page 15: Rethinking integration: Emerging patterns from cloud computing

12 PwC Technology Forecast Making Hadoop suitable for enterprise data science

product, they might see 20 or 30 different sets of data that have been created. Of course some of those sets will be useful to other data scientists. Dozens of hours of work have been invested there. The problem is how to find those intermediate data sets. In Hadoop, they are actually realized persisted data sets.

So, how do you find them and know what their structure is so you can use them? You need to know that this data set originally contained data from this stream or that stream, this application and that application. If you don’t know that, then the data set is useless.

At this point, we’re able to preserve the input sets—the person who did it, when they did it, and the actual transformation code that produced this output set. It is pretty straightforward for users to go backward or forward to find the data set, and then find something downstream or upstream that they might be able to use by combining it, for example, with two other files. Right now, we provide the bare-bones capability for them to do that kind of navigation. From my point of view, that capability is still in its infancy.

PwC: And there’s also more freedom and flexibility on the querying side?

ML: Predictive analytics and statistical analysis are easier with a large-scale data lake. That’s another sea change that’s happening with the advent of big data. Everyone we talk to says SQL worked great. They look at the past through SQL. They know their current financial state, but they really need to know the characteristics of the customer in a particular zip code that they should target with a particular product.

When you can run statistical models on enormous data sets, you get better predictive capability. The bigger the set, the better your predictions. Predictive modeling and analytics are not being done timidly in Hadoop. That’s one of the main uses of Hadoop.

This sort of analysis wasn’t performed 10 years ago, and it’s only just become mainstream practice. A colleague told me a story about a credit card company. He lives in Maryland, and he went to New York on a trip. He used his card one time in New York and then he went to buy gas, and the card was cut off. His card didn’t work at the gas station. He called the credit card company and asked, “Why did you cut off my card?”

And they said, “We thought it was a case of fraud. You never have made a charge in New York and all of a sudden you made two charges in New York.” They asked, “Are you at the gas station right now?” He said yes.

It’s remarkable what the credit card company did. It ticked him off that they could figure out that much about him, but the credit card company potentially saved itself tens of thousands of dollars in charges it would have had to eat.

This new generation of processing platforms focuses on analytics. That problem right there is an analytical problem, and it’s predictive in its nature. The tools to help with that are just now emerging. They will get much better about helping data scientists and other users. Metadata management capabilities in these highly distributed big data platforms will become crucial—not nice-to-have capabilities, but I-can’t-do-my-work-without-them capabilities. There’s a sea of data.

Page 16: Rethinking integration: Emerging patterns from cloud computing

13 PwC Technology Forecast A step toward the data lake in healthcare: Late-bound data warehouses

Dale Sanders

Dale Sanders is senior vice president of Health Catalyst.

PwC: How are healthcare enterprises scaling and maturing their analytics efforts at this point?

DS: It’s chaotic right now. High-tech funding facilitated the adoption of EMRs [electronic medical records] and billing systems as data collection systems. And HIEs [health information exchanges] encouraged more data sharing. Now there’s a realization that analytics is critical. Other industries experienced the same pattern, but healthcare is going through it just now.

The bad news for healthcare is that the market is so overwhelmed from the adoption of EMRs and HIEs. And now the changes from ICD-9 [International Classification of Diseases, Ninth Revision] are coming, as well as the changes to the HIPAA [Health

Insurance Portability and Accountability Act] regulation. Meaningful use is still a challenge. Accountable care is a challenge.

There’s so much turmoil in the market, and it’s hard to admit that you need to buy yet another IT system. But it’s hard to deny that, as well. Lots of vendors claim they can do analytics. Trying to find the way through that maze and that decision making is challenging.

PwC: How did you get started in this area to begin with, and what has your approach been?

DS: Well, to go way back in history, when I was in the Air Force, I conceived the idea for late binding in data warehouses after I’d seen some different failures of data warehouses using relational database systems.

Technology Forecast: Rethinking integration

Issue 1, 2014

A step toward the data lake in healthcare: Late-bound data warehousesDale Sanders of Health Catalyst describes how healthcare providers are addressing their need for better analytics.

Interview conducted by Alan Morrison, Bo Parker, and Brian Stein

Page 17: Rethinking integration: Emerging patterns from cloud computing

14 PwC Technology Forecast A step toward the data lake in healthcare: Late-bound data warehouses

If you look at the early history of data warehousing in the government and military—it was all on mainframes. And those mainframe data warehouses look a lot like Hadoop today. Hadoop is emerging with better tools, but conceptually the two types of systems are very similar.

When relational databases became popular, we all rushed to those as a solution for data warehousing. We went from the flat files associated with mainframes to Unix-based data warehouses that used relational database systems. And we thought it was a good idea. But one of the first big mistakes everyone made was to develop these enterprise data models using a relational form.

I watched several failures happen as a consequence of that type of early binding to those enterprise models. I made some adjustments to my strategy in the Air Force, and I made some further adjustments when I worked for companies in the private sector and further refined it.

I came into healthcare with that. I started at Intermountain Healthcare, which was an early adopter of informatics. The organization had a struggling data warehouse project because it was built around this tightly coupled, early-binding relational model. We put a team together, scrubbed that model, and applied late binding. And, knock on wood, it’s been doing very well. It’s now 15 years in its evolution, and Intermountain still loves it. The origins of Health Catalyst come from that history.

PwC: How mature are the analytics systems at a typical customer of yours these days?

DS: We generally get two types of customers. One is the customer with a fairly advanced analytics vision and aspirations. They understand the whole notion of population health management and capitated reimbursement and things like that. So they’re naturally attracted to us. The dialogue with those folks tends to move quickly.

Then there are folks who don’t have that depth of background, but they still understand that they need analytics.

We have an analytics adoption model that we use to frame the progression of analytics in an organization. We also use it to help drive a lot of our product development. It’s an eight-level maturity model. Intermountain operates pretty consistently at levels six and seven.

But most of the industry operates at level zero—trying to figure out how to get to levels one and two. When we polled participants in our webinars about where they think they reside in that model, about 70 percent of the respondents said level two and below.

So we’ve needed to adjust our message and not talk about levels five, six, and seven with some of these clients. Instead, we talk about how to get basic reporting, such as internal dashboards and KPIs [key performance indicators], or how to meet the external reporting requirements for joint commission and accountable care organizations [ACOs] and that kind of thing.

“We have an analytics adoption model that we use to frame the progression of analytics in an organization. Most of the [healthcare] industry operates at level zero.”

Page 18: Rethinking integration: Emerging patterns from cloud computing

15 PwC Technology Forecast A step toward the data lake in healthcare: Late-bound data warehouses

If they have a technical background, some organizations are attracted to this notion of late binding. And we can relate at that level. If they’re familiar with Intermountain, they’re immediately attracted to that track record and that heritage. There are a lot of different reactions.

PwC: With customers who are just getting started, you seem to focus on already well-structured data. You’re not opening up the repository to data that’s less structured as well.

DS: The vast majority of data in healthcare is still bound in some form of a relational structure, or we pull it into a relational form. Late binding puts us between the worlds of traditional relational data warehouses and Hadoop—between a very structured representation of data and a very unstructured representation of data.

But late binding lets us pull in unstructured content. We can pull in clinical notes and pretext and that sort of thing. Health Catalyst is developing some products to take advantage of that.

But if you look at the analytic use cases and the analytic maturity of the industry right now, there’s not a lot of need to bother with unstructured data. That’s reserved for a few of the leading innovators. The vast majority of the market doesn’t need unstructured content at the moment. In fact, we really don’t even have that much unstructured content that’s very useful.

PwC: What’s the pain point that the late-binding approach addresses?

DS: This is where we borrow from Hadoop and also from the old mainframe days.

When we pull a data source into the late-binding data warehouse, we land that data in a form that looks and feels much like the original source system.

Then we make a few minor modifications to the data. If you’re familiar with data modeling, we flatten it a little bit. We denormalize it a little bit. But for the most part, that data looks like the data that was contained in the source system, which is a characteristic of a Hadoop data lake—very little transformation to data.

So we retain the binding and the fidelity of the data as it appeared in the source system. If you contrast that approach with the other vendors in healthcare, they remap that data from the source system into an enterprise data model first. But when you map that data from the source system into a new relational data model, you inherently make compromises about the way the data is modeled, represented, named, and related.

You lose a lot of fidelity when you do that. You lose familiarity with the data. And it’s a time-consuming process. It’s not unusual for that early binding, monolithic data model approach to take 18 to 24 months to deploy a basic data warehouse.

In contrast, we can deploy content and start exposing it to analytics within a matter of days and weeks. We can do it in days, depending on how aggressive we want to be. There’s no binding early on. There are six different places where you can bind data to vocabulary or relationships as it flows from the source system out to the analytic visualization layer.

Before we bind data to new vocabulary, a new business rule, or any analytic logic, we ask ourselves what use case we’re

“The vast majority of data in healthcare is still bound in some form of a relational structure, or we pull it into a relational form.”

Page 19: Rethinking integration: Emerging patterns from cloud computing

16 PwC Technology Forecast A step toward the data lake in healthcare: Late-bound data warehouses

trying to satisfy. We ask on a use case basis, rather than assuming a use case, because that assumption could lead to problems. We can build just about whatever we want to, whenever we want to.

PwC: In essence, you’re moving toward an enterprise data model. But you’re doing it over time, a model that’s driven by use cases.

DS: Are we actually building an enterprise data model one object at a time? That’s the net effect. Let’s say we land half a dozen different source systems in the enterprise data warehouse. One of the first things we do is provide a foreign key across those sources of data that allows you to query across those sources as if they were an enterprise data model. And typically the first foreign key that we add to those sources—using a common name and a common data type—is patient identifier. That’s the most fundamental. Then you add vocabularies such as CPT [Current Procedural Terminology] and ICD-9 as that need arises.

When you land the data, you have what amounts to a virtual enterprise model already. You haven’t remodeled the data at all, but it looks and functions like an enterprise model. Then we’ll spin targeted analytics data marts off those source systems to support specific analytic use cases.

For example, perhaps you want to drill down on the variability, quality, and cost of care in a clinical program for women and newborns. We’ll spin off a registry of those patients and the physicians treating those patients into its own separate data mart. And then we will associate every little piece of data that we can find: costing

data, materials management data, human resources data about the physicians and nurses, patient satisfaction data, outcomes data, and eventually social data. We’ll pull that data into the data mart that’s specific to that analytic use case to support women and newborns.

PwC: So you might need to perform some transform rationalization, because systems might not call the same thing by the same name. Is that part of the late-binding vocabulary rationalization?

DS: Yes, in each of those data marts.

PwC: Do you then use some sort of provenance record—a way of rationalizing the fact that we call these 14 things different things—that becomes reusable?

DS: Oh, yes, that’s the heart of it. We reuse all of that from organization to organization. There’s always some modification. And there’s always some difference of opinion about how to define a patient cohort or a disease state. But first we offer something off the shelf, so you don’t need to re-create them.

PwC: What if somebody wanted to perform analytics across the data marts or across different business domains? In this framework, would the best strategy be to somehow consolidate the data marts, or instead go straight to the underlying data warehouse?

DS: You can do either one. Let’s take a comorbidity situation, for example, where a patient has three or four different disease states. Let’s say you want to look at that patient’s continuum of care across all of those.

“We are building an enterprise data model one object at a time.”

Page 20: Rethinking integration: Emerging patterns from cloud computing

17 PwC Technology Forecast A step toward the data lake in healthcare: Late-bound data warehouses

Over the top of those data marts is still this common late-binding vocabulary that allows you to query the patient as that patient appears in each of those different subject areas, whatever disease state it is. It ends up looking like a virtual enterprise model for that patient’s record. After we’ve formally defined a patient cohort and the key metrics that the organization wants to understand about that patient cohort, we want to lock that down and tightly bind it at that point.

First you get people to agree. You get physicians and administrators to agree how they want to identify a patient cohort. You get agreement on the metrics they want to understand about clinical effectiveness. After you get comprehensive agreement, then you look for it to stick for a while. When it sticks for a period of time, then you can tightly bind that data together and feel comfortable about doing so—so you don’t need to rip it apart and rebind it again.

PwC: When you speak about coming toward an agreement among the various constituencies, is it a process that takes place more informally outside the system, where everybody is just going to come up with the model? Or is there some way to investigate the data first? Or by using tagging or some collaborative online utility, is there an opportunity to arrive at consensus through an interface?

DS: We have ready-to-use definitions around all these metrics—patient registries and things like that. But we also recognize that the state of the industry being what it is, there’s still a lot of fingerprinting and opinions about those definitions. So even though

an enterprise might reference the National Quality Forum, the Agency for Healthcare Research and Quality, and the British Medical Journal as the sources for the definitions, local organizations always want to put their own fingerprint on these rules for data binding.

We have a suite of tools to facilitate that exploration process. You can look at your own definitions, and you can ask, “How do we really want to define a diabetic patient? How do we define congestive heart failure and myocardial infarction patients?”

We’ll let folks play around with the data, visualize it, and explore it in definitions. When we see them coming toward a comprehensive and persistent agreement, then we’ll suggest, “If you agree to that definition, let’s bind it together behind that visualization layer.” That’s exactly what happens. And you must allow that to happen. You must let that exploration and fingerprinting happen.

A drawback of traditional ways of deploying data warehouses is that they presuppose all of those bindings and rules. They don’t allow that exploration and local fingerprinting.

PwC: So how do companies get started with this approach? Assuming they have existing data warehouses, are you using those warehouses in a new way? Are you starting up from scratch? Do you leave those data warehouses in place when you’re implementing the late-bound idea?

DS: Some organizations have an existing data warehouse. And a lot of organizations don’t. The greenfield organizations are the easiest to deal with.

“A drawback of traditional ways of deploying data warehouses is that they presuppose various bindings and rules. They don’t allow for data e ploratio a d local fi gerpri ti g.

Page 21: Rethinking integration: Emerging patterns from cloud computing

18 PwC Technology Forecast A step toward the data lake in healthcare: Late-bound data warehouses

The strategy is pretty complicated to decouple all of the analytic logic that’s been built around those existing data warehouses and then import that to the future. Like most transitions of this kind, it often happens through attrition. First you build the new enterprise data warehouse around those late-binding concepts. And then you start populating it with data.

The one thing you don’t want to do is build your new data warehouse under a dependency to those existing data warehouses. You want to go around those data warehouses and pull your data straight from source systems in the new architecture. It’s a really bad strategy to build a data warehouse on top of data warehouses.

PwC: Some of the people we’ve interviewed about Hadoop assert that using Hadoop versus a data warehouse can result in a cost benefit that’s at least an order of magnitude cheaper. They claim, for example, that storing data costs $250,000 per terabyte in a traditional warehouse versus $25,000 per terabyte for Hadoop. If you’re talking with the C-suite about an exploratory analytics strategy, what’s the advantage of staying with a warehousing approach?

DS: In healthcare, the compelling use case for Hadoop right now is the license fee. Contrast that case with what compels Silicon Valley web companies and everybody else to go to Hadoop. Their compelling reason wasn’t so much about money. It was about scalability.

If you consider the nature of the data that they’re pulling into Hadoop, there’s no such thing as a data model for the web. All the data that they’re streaming into Hadoop comes tagged with its own data model. They don’t need a relational database engine. There’s no value to them in that setting at all.

For CIOs, the fact that Hadoop is inexpensive open source is very attractive. The downside, however, is the lack of skills. The skills and the tools and the ways to really take advantage of Hadoop are still a few years off in healthcare. Given the nature of the data that we’re dealing with in healthcare right now, there’s nothing particularly compelling about Hadoop in healthcare right now. Probably in the next year, we will start using Hadoop as a preprocessor ETL [extract, transform, load] platform that we can stream data into.

During the next three to four years, as the skills and the tools evolve to take advantage of Hadoop, I think you’ll see companies like Health Catalyst being more aggressive about the adoption of Hadoop in a data lake scenario. But if you add just enough foreign keys and dimensions of analytics across that data lake, that approach greatly facilitates reliable landing and loading. It’s really, really hard to pull meaningful data out of those lakes without something to get the relationship started.

Page 22: Rethinking integration: Emerging patterns from cloud computing

19

By Galen Gruman and Alan Morrison

Big SOA was overkill. In its place, a more agile form of services is taking hold.

Technology Forecast: Rethinking integrationIssue 1, 2014

Microservices: The resurgence of SOA principles and an alternative to the monolith

Page 23: Rethinking integration: Emerging patterns from cloud computing

20 PwC Technology Forecast Microservices: An alternative to the monolith

Moving away from the monolith Companies such as Netflix, Gilt, PayPal, and Condé Nast are known for their ability to scale high-volume websites. Yet even they have recently performed major surgery on their systems. Their older, more monolithic architectures would not allow them to add new or change old functionality rapidly enough. So they’re now adopting a more modular and loosely coupled approach based on microservices architecture (MSA). Their goal is to eliminate dependencies and enable quick testing and deployment of code changes. Greater modularity, loose coupling, and reduced dependencies all hold promise in simplifying the integration task.

If MSA had a T-shirt, it would read: “Code small. Code local.”

Early signs indicate this approach to code management and deployment is helping companies become more responsive to shifting customer demands. Yet adopters might encounter a challenge when adjusting the traditional software development mindset to the MSA way—a less elegant, less comprehensive but more nimble approach. PwC believes MSA is worth considering as a complement to traditional methods when

speed and flexibility are paramount—typically in web-facing and mobile apps.

Microservices also provide the services layer in what PwC views as an emerging cloud-inspired enterprise integration fabric, which companies are starting to adopt for greater business model agility.

Why microservices?In the software development community, it is an article of faith that apps should be written with standard application programming interfaces (APIs), using common services when possible, and managed through one or more orchestration technologies. Often, there’s a superstructure of middleware, integration methods, and management tools. That’s great for software designed to handle complex tasks for long-term, core enterprise functions—it’s how transaction systems and other systems of record need to be designed.

But these methods hinder what Silicon Valley companies call web-scale development: software that must evolve quickly, whose functionality is subject to change or obsolescence in a couple of years—even months—and where the level of effort must fit a compressed and reactive schedule. It’s more like web page design than developing traditional enterprise software.

Greater modularity, loose coupling, and reduced dependencies all hold promise in simplifying the integration task.

For a monolith to change, all must agree on each change. Each change has unanticipated effects requiring careful testing beforehand.

Elements in SOA are developed more autonomously but must be coordinated with others to fit into the overall design.

Developers can create and activate new microservices without prior coordination with others. Their adherence to MSA principles makes continuous delivery of new or modified services possible.

Pre-SOA (monolithic)Tight coupling

Traditional SOALooser coupling

MicroservicesDecoupled

1990s and earlier 2000s 2010s

Dependencies from a developer’s perspective

TeamTeam

TeamTeam

TeamTeam

TeamTeam

TeamTeam

Page 24: Rethinking integration: Emerging patterns from cloud computing

21 PwC Technology Forecast Microservices: An alternative to the monolith

Some of the leading web properties use MSA because it comes from a mindset similar to other technologies and development approaches popular in web-scale companies: agile software development, DevOps, and the use of Node.js and Not only SQL (NoSQL). These approaches all strive for simplicity, tight scope, and the ability to take action without calling an all-hands meeting or working through a tedious change management process. Managing code in the MSA context is often ad hoc and something one developer or a small team can handle without complex superstructure and management. In practice, the actual code in any specific module is quite small—a few dozen lines, typically—is designed to address a narrow function, and can be conceived and managed by one person or a small group.

It is important to understand that MSA is still evolving and unproven over the long term. But like the now common agile methods, Node.js coding framework, and NoSQL data management approaches before it, MSA is an experiment many hope will prove to be a strong arrow in software development quivers.

MSA: A think-small approach for rapid developmentSimply put, MSA breaks an application into very small components that perform discrete functions, and no more. The definition of “very small” is inexact, but think of functional calls or low-level library modules, not applets or complete services. For example, a microservice could be an address-based or geolocation-based zip-code lookup, not a full mapping module.

In MSA, you want simple parts with clean, messaging-style interfaces; the less elaborate the better. And you don’t want elaborate middleware, service buses, or other

orchestration brokers, but rather simpler messaging systems such as Apache Kafka.

MSA proponents tend to code in web-oriented languages such as Node.js that favor small components with direct interfaces, and in functional languages like Scala or the Clojure Lisp library that favor “immutable” approaches to data and functions, says Richard Rodger, a Node.js expert and CEO of nearForm, a development consultancy.

This fine-grained approach lets you update, add, replace, or remove services—in short, to integrate code changes—from your application easily, with minimal effect on anything else. For example, you could change the zip-code lookup to a UK postal-code lookup by changing or adding a microservice. Or you could change the communication protocol from HTTP to AMQP, the emerging standard associated with RabbitMQ. Or you could pull data from a NoSQL database like MongoDB at one stage of an application’s lifecycle and from a relational product like MySQL at another. In each case, you would change or add a service.

MSA lets you move from quick-and-dirty to quick-and-clean changes to applications or their components that are able to function by themselves. You would use other techniques—conventional service-oriented architecture (SOA), service brokers, and platform as a service (PaaS)—to handle federated application requirements. In other words, MSA is one technique among many that you might use in any application.

The fine-grained, stateless, self-contained nature of microservices creates decoupling between different parts of a code base and is what makes them easy to update, replace,

e fi e grai ed, stateless, selfcontained nature of microservices creates decoupling between different parts of a code base and is what makes them easy to update, replace, remove, or augment.

Pre-SOA (monolithic)Tight coupling

Traditional SOALooser coupling

MicroservicesDecoupled

1990s and earlier 2000s 2010s

Evolution of services orientation

Coupling

Exist in a “dumb” messag

ing en

vironment

Page 25: Rethinking integration: Emerging patterns from cloud computing

22 PwC Technology Forecast Microservices: An alternative to the monolith

Thinking the MSA way: Minimalism is a mustThe MSA approach is the opposite of the traditional “let’s scope out all the possibilities and design in the framework, APIs, and data structures to handle them all so the application is complete.”

Think of MSA as almost-plug-and-play in-app integration of discrete services both local

remove, or augment. Rather than rewrite a module for a new capability or version and then coordinate the propagation of changes the rewrite causes across a monolithic code base, you add a microservice. Other services that want this new functionality can choose to direct their messages to this new service, but the old service remains for parts of the code you want to leave alone. That’s a significant difference from the way traditional enterprise software development works.

Issue overview: Integration fabric The microservices topic is the second of three topics as part of the integration fabric research covered in this issue of the PwC Technology Forecast. The integration fabric is a central component for PwC’s New IT Platform.*

Enterprises are starting to embrace more practical integration.** A range of these new approaches is now emerging, and during the next few months we’ll ponder what the new cloud-inspired enterprise integration fabric looks like. The main areas we plan to explore include these:

Integration fabric layers Integration challenges Emerging technology solutions

Data Data silos, data proliferation, rigid schemas, and high data warehousing cost; new and heterogeneous data types

Hadoop data lakes, late binding, and metadata provenance tools

Enterprises are beginning to place extracts of their data for analytics and business intelligence (BI) purposes into a single, massive repository and structuring only what’s necessary. Instead of imposing schemas beforehand, enterprises are allowing data science groups to derive their own views of the data and structure it only lightly, late in the process.

Applications and services

Rigid, monolithic systems that are difficult to update in response to business needs

Microservices

Fine-grained microservices, each associated with a single business function and accessible via an application programming interface (API), can be easily added to the mix or replaced. This method helps developer teams create highly responsive, flexible applications.

Infrastructure Multiple clouds and operating systems that lack standardization

Software containers for resource isolation and abstraction

New software containers such as Docker extend and improve virtualization, making applications portable across clouds. Simplifying application deployment decreases time to value.

* See http://www.pwc.com/us/en/increasing-it-effectiveness/new-it-platform.jhtml for more information.

**Integration as PwC defines it means making diverse components work together so they work as a single entity. See “integrated system” at http://www.yourdictionary.com/integrated-system#computer, accessed June 17, 2014.

Page 26: Rethinking integration: Emerging patterns from cloud computing

23 PwC Technology Forecast Microservices: An alternative to the monolith

to accommodate all possible use cases when those use cases can change unexpectedly and the life span of code modules might be less than 18 months.

The pace at which new code creation and changes happen in mobile applications and websites simply doesn’t support the traditional application development model. In such cases, the code is likely to change due to rapidly evolving social media services, or because it runs in iOS, Android, or some other environment where new capabilities are available annually, or because it needs to search a frequently updated product inventory.

For such mutable activities, you want to avoid—not build in—legacy management requirements. You live with what nearForm’s Rodger considers a form of technical debt, because it is an easier price to pay for functional flexibility than a full-blown architecture that tries to anticipate all needs. It’s the difference between a two-week update and a two-year project.

and external. These services are expected to change, and some eventually will become disposable. When services have a small focus, they become simple to develop, understand, manage, and integrate. They do only what’s necessary, and they can be removed or ignored when no longer needed.

There’s an important benefit to this minimalist approach, says Gregg Caines, a freelance web developer and co-author of programming books: “When a package doesn’t do more than is absolutely necessary, it’s easy to understand and to integrate into other applications.” In many ways, MSA is a return to some of the original SOA principles of independence and composition—without the complexity and superstructure that become common when SOA is used to implement enterprise software.

The use of multiple, specific services with short lifetimes might sound sloppy, but remember that MSA is for applications, or their components, that are likely to change frequently. It makes no sense to design and develop software over an 18-month process

Mobile apps and web apps are natural venues for MSA.

Traditional SOA versus microservicesTraditional SOA Microservices

Messaging type Smart, but dependency-laden ESB Dumb, fast messaging (as with Apache Kafka)

Programming style

Imperative model Reactive actor programming model that echoes agent-based systems

Lines of code per service

Hundreds or thousands of lines of code

100 or fewer lines of code

State Stateful Stateless

Messaging type Synchronous: wait to connect Asynchronous: publish and subscribe

Databases Large relational databases NoSQL or micro-SQL databases blended with conventional databases

Code type Procedural Functional

Means of evolution

Each big service evolves Each small service is immutable and can be abandoned or ignored

Means of systemic change

Modify the monolith Create a new service

Means of scaling Optimize the monolith Add more powerful services and cluster by activity

System-level awareness

Less aware and event driven More aware and event driven

Page 27: Rethinking integration: Emerging patterns from cloud computing

24 PwC Technology Forecast Microservices: An alternative to the monolith

This mentality is different from that required in traditional enterprise software, which assumes complex, multivariate systems are being integrated, requiring many-to-many interactions that demand some sort of intelligent interpretation and complex framework. You invest a lot up front to create a platform, framework, and architecture that can handle a wide range of needs that might be extensive but change only at the edges.

MSA assumes you’re building for the short term; that the needs, opportunities, and context will change; and that you will handle them as they occur. That’s why a small team of developers familiar with their own microservices are the services’ primary users. And the clean, easily understood nature lets developers even more quickly add, remove, update, and replace their services and better ensure interoperation with other services.

In MSA, governance, data architecture, and the microservices are decentralized, which minimizes the dependencies. As a result of this independence, you can use the right language for the microservice in question, as well as the right database or other related service, rather than use a single language or back-end service to accomplish all your application’s needs, says David Morgantini, a developer at ThoughtWorks.

Where MSA makes senseMSA is most appropriate for applications whose functions may need to change frequently; that may need to run on multiple, changing platforms whose local services and capabilities differ; or whose life spans are not long enough to warrant a heavily architected framework. MSA is great for disposable services.

Mobile apps and web apps are natural venues for MSA. But whatever platform the application runs on, some key attributes favor MSA:

• Fast is more important than elegant.

• Change in the application’s functionality and usage is frequent.

• Change occurs at different rates within the application, so functional isolation and simple integration are more important than module cohesiveness.

• Functionality is easily separated into simple, isolatable components.

For example, an app that draws data from social networks might use separate microservices for each network’s data extraction and data normalization. As social networks wax and wane in popularity, they can be added to the app without changing anything else. And as APIs evolve, the app can support several versions concurrently but independently.

Microservices can make media distribution platforms, for example, easier to update and faster than before, says Adrian Cockcroft, a technology fellow at Battery Ventures, a venture capital firm.The key is to separate concerns along these dimensions:

• Each single-function microservice has one action.

• A small set of data and UI elements is involved.

• One developer, or a small team, independently produces a microservice.

• Each microservice is its own build, to avoid trunk conflict.

• The business logic is stateless.

• The data access layer is statefully cached.

• New functions are added swiftly, but old ones are retired slowly.1

These dimensions create the independence needed for the microservices to achieve the goals of fast development and easy integration of discrete services limited in scope.

1 Adrian Cockcroft, “Migrating to Microservices,” (presentation, QCon London, March 6, 2014), http://qconlondon.com/london-2014/qconlondon.com/london-2014/presentation/Migrating%20to%20Microservices.html.

Page 28: Rethinking integration: Emerging patterns from cloud computing

25 PwC Technology Forecast Microservices: An alternative to the monolith

MSA is not entirely without structure. There is a discipline and framework for developing and managing code the MSA way, says nearForm’s Rodger. The more experienced a team is with other methods—such as agile development and DevOps—that rely on small, focused, individually responsible approaches, the easier it is to learn to use MSA. It does require a certain groupthink. The danger of approaching MSA without such a culture or operational framework is the chaos of individual developers acting without regard to each other.

In MSA, integration is the problem, not the solutionMany enterprise developers shake their heads and ask how microservices can possibly integrate with other microservices and with other applications, data sets, and services. MSA sounds like an integration nightmare, a morass of individual connections causing a rat’s nest that looks like spaghetti code.

Ironically, integration is almost a byproduct of MSA, because the functionality, data, and interface aspects are so constrained in number and role. (Rodger says Node.js developers will understand this implicit integration, which is a principle of the language.) In other words, your integration connections are local, so you’re building more of a chain than a web of connections.

When you have fine-grained components, you do have more integration points. Wouldn’t that make the development more difficult and changes within the application more likely to cause breakage? Not necessarily, but it is a risk, says Morgantini. They key is to create small teams focused on business-relevant tasks and to conceive of the microservices they create as neighbors living together in a small neighborhood, so the relationships are easily apparent and proximate. In this model, an application can be viewed as a city of neighborhoods assigned to specific business functions, with each neighborhood composed

of microservices. You might have “planned community” neighborhoods made from coarser-grained services or even monolithic modules that interact with more organic MSA-style neighborhoods.2

It’s important to remember that by keeping services specific, there’s little to integrate. You typically deal with a handful of data, so rather than work through a complex API, you directly pull the specific data you want in a RESTful way. You keep your own state, again to reduce dependencies. You bind data and functions late for the same reasons.

Integration is a problem MSA tries to avoid by reducing dependencies and keeping them local. If you need complex integration, you shouldn’t use MSA for that part of your software development. Instead, use MSA where broad integration is not a key need.

ConclusionMSA is not a cure-all, nor is it meant to be the only or even dominant approach for developing applications. But it’s an emerging approach that bucks the trend of elaborate, elegant, complete frameworks where that doesn’t work well. Sometimes, doing just what you need to do is a better answer than figuring out all the things you might need and constructing an environment to handle it all. MSA serves the “do just what you need to do” scenario.

This approach has proven effective in contexts already familiar with agile development, DevOps, and loosely coupled, event-driven technologies such as Node.js. MSA applies the same mentality to the code itself, which may be why early adopters are those who are using the other techniques and technologies. They already have an innate culture that makes it easier to think and act in the MSA way.

Any enterprise looking to serve users and partners via the web, mobile, and other fast-evolving venues should explore MSA.

2 David Morgantini, “Micro-services—Why shouldn’t you use micro-services?” Dare to dream (blog), August 27, 2013, http://davidmorgantini.blogspot.com/2013/08/micro-services-why-shouldnt-you-use.html, accessed May 12, 2014.

Page 29: Rethinking integration: Emerging patterns from cloud computing
Page 30: Rethinking integration: Emerging patterns from cloud computing

26 PwC Technology Forecast Microservices in a software industry context

John Pritchard

John Pritchard is director of platform services at Adobe.

PwC: What are some of the challenges when moving to an API-first business model?

JP: I see APIs as a large oncoming wave that will create a lot of benefit for a lot of companies, especially companies in our space that are trying to migrate to SaaS.1

At Adobe, we have moved from being a licensed desktop product company to a subscription-based SaaS company. We’re in the process of disintegrating our desktop products to services that can be reassembled

and packaged in interesting ways by our own product teams or third-party developers.

With the API model, there’s a new economy of sorts and lots of talk about how to monetize the services. People discuss the models by which those services could be made available and how they could be sold.

There’s still immaturity in the very coarse way that APIs tend to be exposed now. I might want to lease the use of APIs to a third-party developer, for instance, with a

1 Abbreviations are as follows: PI: application programming interface SaaS: software as a service

or more information on PIs, see “ he business value of PIs,” PwC Technology Forecast 2012, Issue 2, http://www.pwc.com/us/en/technology-forecast/2012/issue2/inde .jhtml.

Technology Forecast: Rethinking integration

Issue 1, 2014

Microservices in a software industry context John Pritchard offers some thoughts on the rebirth of SOA and an API-first strategy from the vantage point of a software provider.

Interview conducted by Alan Morrison, Wunan Li, and Akshay Rao

Page 31: Rethinking integration: Emerging patterns from cloud computing

27 PwC Technology Forecast Microservices in a software industry context

usage-based pricing model. This model allows the developer to white label the experience to its customers without requiring a license.

Usage-based pricing triggers some thought around how to instrument APIs and the connection between API usage and commerce. It leads to some interesting conversations about identity and authentication, especially when third-party developers might be integrating multiple API sets from different companies into a customer-exposed application.

PwC: Isn’t there substantial complexity associated with the API model once you get down to the very granular services suggested by a microservices architecture?

JP: At one level, the lack of standards and tooling for APIs has resulted in quite a bit of simplification. Absent standards, we are required to use what I’ll call the language of the Internet: HTTP, JSON, and OAuth. That’s it. This approach has led to beautiful, simple designs because you can only do things a few ways.

But at another level, techniques to wire together capabilities with some type of orchestration have been missing. This absence creates a big risk in my mind of trying to do things in the API space like the industry did with SOA and WS*.2

“APIs are SOA realized.”

PwC: How are microservices related to what you’re doing on the API front?

JP: We don’t use the term microservices; I wouldn’t say you’d hear that term in conversations with our design teams. But I’m familiar with some of Martin Fowler’s writing on the topic.3 If you think about how the term is defined in industry and this idea of smaller statements that are transactions, that concept is very consistent with design principles and the API-first strategy we adhere to.

What I’ve observed on my own team and some of the other product teams we work with is that the design philosophy we use is less architecturally driven than it is team dynamic driven. When you move to an end-to-end team or a DevOps4 type of construct, you tend to want to define things that you can own completely and that you can release so you have some autonomy to serve a particular need.

We use APIs to integrate internally as well. We want these available to our product engineering community in the most consumable way. How do we describe these APIs so we clear the path for self-service as quickly as possible? Those sorts of questions and answers have led us to the design model we use.

2 Abbreviations are as follows: P: hyperte t transfer protocol S : avaScript bject otation S : service-oriented architecture S : web services

or e ample, see ames ewis and artin owler, “ icroservices,” arch 2 , 2014, http://martinfowler.com/articles/microservices.html, accessed une 1 , 2014.

4 ev ps is a working style designed to encourage closer collaboration between developers and operations people: ev ps ev ps. or more information on ev ps, continuous delivery, and antifragile system development, see “ ev ps: Solving the engineering productivity challenge,” PwC Technology Forecast 201 , Issue 2, http://www.pwc.com/us/en/technology-forecast/201 /issue2/inde .jhtml.

Page 32: Rethinking integration: Emerging patterns from cloud computing

28 PwC Technology Forecast Microservices in a software industry context

PwC: When you think about the problems that a microservices approach might help with, what is top of mind for you?

JP: I’ve definitely experienced the rebirth of SOA. In my mind, APIs are SOA realized. We remember the ESB and WS* days and the attempt to do real top-down governance. We remember how difficult that was not only in the enterprise, but also in the commercial market, where it didn’t really happen at all.5

Developer-friendly consumability has helped us bring APIs to market. Internally, that has led to greater efficiencies. And it encourages some healthy design practices by making things small. Some of the connectivity becomes less important than the consumability.

PwC: What’s the approach you’re taking to a more continuous form of delivery in general?

JP: For us, continuous delivery brings to mind end-to-end teams or the DevOps model. Culturally, we’re trying to treat everything like code. I treat infrastructure like code. I treat security like code. Everything is assigned to sprints. APIs must be instrumented for deployment, and then we test around the APIs being deployed.

We’ve borrowed many of the Netflix constructs around monkeys.6 We use monkeys not

only for infrastructure components but also for scripted security attacks to validate our operational run times. We’ve seen an increased need for automation. With every deployment we look for opportunities for automation. But what’s been key for the success in my team is this idea of treating all these different aspects just like we treat code.

PwC: Would that include infrastructure as well?

JP: Yes. My experience is that the line is almost completely blurred about what’s software and what’s infrastructure now. It’s all software defined.

PwC: As systems become less monolithic, how will that change the marketplace for software?

JP: At the systems level, we’re definitely seeing a trend away from centralized core systems—like core ERP or core large platforms that provide lots of capabilities—to a model where a broad selection of SaaS vendors provide very niche capabilities. Those SaaS operators may change over time as new ones come into the market. The service provider model, abstracting SaaS provider capabilities with APIs, gives us the flexibility to evaluate newcomers that might be better providers for each API we’ve defined.

bbreviations are as follows: S : enterprise service bus P: enterprise resource planning

Chaos onkey is an e ample. See “ he evolution from lean and agile to antifragile,” PwC Technology Forecast 201 , Issue 2, http://www.pwc.com/us/en/technology-forecast/201 /issue2/features/new-cloud-development-styles.jhtml for more on Chaos onkey.

Page 33: Rethinking integration: Emerging patterns from cloud computing
Page 34: Rethinking integration: Emerging patterns from cloud computing

29 PwC Technology Forecast The critical elements of microservices

Richard Rodger

Richard Rodger is the CTO of nearForm, a software development and training consultancy specializing in Node.js.

PwC: What’s the main advantage of a microservices approach versus object-oriented programming?

RR: Object-oriented programming failed miserably. With microservices, it’s much harder to shoot yourself in the foot. The traditional anti-patterns and problems that happen in object-oriented code—such as the big bowl of mud where a single task has a huge amount of responsibilities or goes all over the place—are less likely in the microservices world.

Consider the proliferation of patterns in the object-oriented world. Any programming paradigm that requires you to learn 50 different design patterns to get things right and makes it so easy to get things wrong is probably not the right way to be doing things.

That’s not to say that patterns aren’t good. Pattern designs are good and they are necessary. It’s just that in the microservices world, there are far fewer patterns.

PwC: What is happening in companies that are eager to try the microservices approach?

RR: It’s interesting to think about why change happens in the software industry. Sometimes the organizational politics is a much more important factor than the technology itself. Our experience is that politics often drives the adoption of microservices. We’re observing aggressive, ambitious vice presidents who have the authority to fund large software projects. In light of how long most of these projects usually take, the vice presidents see an opportunity for career advancement by executing much more rapidly.

Technology Forecast: Rethinking integration

Issue 1, 2014

The critical elements of microservices Richard Rodger describes his view of the emerging microservices landscape and its impact on enterprise development.

Interview conducted by Alan Morrison and Bo Parker

Page 35: Rethinking integration: Emerging patterns from cloud computing

30 PwC Technology Forecast The critical elements of microservices

A lot of our engagements are with forward-looking managers who essentially are sponsoring the adoption of a microservices approach. Once those initial projects have been deemed successful because they were delivered faster and more effectively, that proves the point and creates its own force for the broader adoption of microservices.

PwC: How does a typical microservices project begin?

RR: In large projects that can take six months or more, we develop the user story and then define and map capabilities to microservices. And then we map microservices onto messages. We do that very, very quickly. Part of what we do, and part of what microservices enable us to do, is show a working live demo of the system after week one.

If we kick off on a Monday, the following Monday we show a live version of the system. You might only be able to log in and perhaps get to the main screen. But there’s a running system that may be deployed on whatever infrastructure is chosen.

Every Monday there’s a new live demo. And that system stays running during the lifetime of the project. Anybody can look at the system, play with it, break it, or whatever at any point in time. Those capabilities are possible because we started to build services very quickly within the first week.

With a traditional approach, even approaches that are agile, you must make an awful lot of decisions up front. And if you make the wrong decisions, you back yourself into a corner. For example, if you decide to use a particular database technology

“With a traditional approach, even approaches that are agile, you must make an awful lot of decisions up front. And if you make the wrong decisions, you back yourself into a corner.”

or commit to a certain structure of object hierarchies, you must be very careful and spend a lot of time analyzing. The use of microservices reduces that cost significantly.

An analogy might help to explain how this type of decision making happens. When UC Irvine laid out its campus, the landscapers initially put in grass and watched where people walked. They later built paths where the grass was worn down.

Microservices are like that. If you have a particular data record and you build a microservice to look back at that data record, you don’t need to define all of the fields up front. A practical example might be if a system will capture transactions and ultimately use a relational database. We might use MongoDB for the first four weeks of development because it’s schema free.

After four weeks of development, the schema will be stabilized to a considerable extent. On week five, we throw away MongoDB and start using a relational product. We saved ourselves from huge hassles in database migrations by developing this way. The key is using a microservice as the interface to the database. That lets us throw away the initial database and use a new one—a big win.

PwC: Do microservices have skeletal frameworks of code that you can just grab, plug in, and compose the first week’s working prototype?

RR: We open source a lot, and we have developed a whole bunch of precut microservices. That’s a benefit of being part of the Node [server-side JavaScript] community. There’s this ethic in the Node

Page 36: Rethinking integration: Emerging patterns from cloud computing

31 PwC Technology Forecast The critical elements of microservices

community about sharing your Node services. It’s an emergent property of the ecosystem. You can’t really compile JavaScript, so a lot of it’s going to be open source anyway. You publish a module onto the npm public repository, which is open source by definition.

PwC: There are very subtle and nuanced aspects of the whole microservices scene, and if you look at it from just a traditional development perspective, you’d miss these critical elements. What’s the integration pattern most closely associated with microservices?

RR: It all comes back to thinking about your system in terms of messages. If you need a search engine for your system, for example, there are various options and cloud-based search services you can use now. Normally this is a big integration task with heavy semantics and coordination required to make it work.

If you define your search capability in terms of messages, the integration is to write a microservice that talks to whatever back end you are using. In a sense, the work is to define how to interact with the search service.

Let’s say the vendor is rolling out a new version. It’s your choice when you go with the upgrade. If you decide you want to move ahead with the upgrade, you write your microservices so both version 1 and version 2 can subscribe to the same messages. You can route a certain part of your message to version 1 and a certain part to version 2. To gracefully phase in version 2 before fully committing, you might start by directing 5 percent of traffic to the new version, monitor it for issues, and gradually increase the traffic to version 2. Because it

“There’s this ethic in the Node community about sharing your Node services. It’s an emergent property of the ecosystem.”

doesn’t require a full redeployment of your entire system, it’s easy to do. You don’t need to wait three months for a lockdown. Monolithic systems often have these scenarios where the system is locked down on November 30 because there’s a Christmas sales period or something like that. With microservices, you don’t have such issues anymore.

PwC: So using this message pattern, you could easily fall into the trap of having a fat message bus, which seems to be the anti-pattern here for microservices. You’re forced to maintain this additional code that is filtering the messages, interpreting the messages, and transforming data. You’re back in the ESB world.

RR: Exactly. An enterprise spaghetti bowl, I think it’s called.

PwC: How do you get your message to the right places efficiently while still having what some are calling a dumb pipe to the message management?

RR: This principle of the dumb pipe is really, really important. You must push the intelligence of what to do with messages out to the edges. And that means some types of message brokers are better suited to this architecture than others. For example, traditional message brokers like RabbitMQ—ones that maintain internal knowledge of where individual consumers are, message queues, and that sort of thing—are much less suited to what we want to do here. Something like Apache Kafka is much better because it’s purposely dumb. It forces the message-queue consumers to remember their own place in the queue.

Page 37: Rethinking integration: Emerging patterns from cloud computing

32 PwC Technology Forecast The critical elements of microservices

“If we have less intellectual work to do, that actually lets us do more.”

As a result, you don’t end up with scaling issues if the queue gets overloaded. You can deal with the scaling issue at the point of actually intercepting the message, so you’re getting the messages passed through as quickly as possible.

You don’t need to use a message queue for everything, either. If you end up with a very, very high throughput system, you move the intelligence into the producer so it knows you have 10 consumers. If one dies, it knows to trigger the surrounding system to create a new consumer, for example.

It’s the same idea as when we were using MongoDB to determine the schema ahead of time. After a while, you’ll notice that the bus is less suitable for certain types of messages because of the volumes or the latency or whatever.

PwC: Would Docker provide a parallel example for infrastructure?

RR: Yes. Let’s say you’re deploying 50 servers, 50 Amazon instances, and you set them up with a Docker recipe. And you deploy that. If something goes wrong, you could kill it. There’s no way for a sys admin to SSH [Secure Shell] into that machine and start tinkering with the configurations to fix it. When you deploy, the services either work or they don’t.

PwC: The cognitive load facing programmers of monoliths and the coordination load facing programmer teams seem to represent the new big mountain to climb.

RR: Yes. And that’s where the productivity comes from, really. It actually isn’t about best practices or a particular architecture or a particular version of Node.js. It’s just that if we have less intellectual work to do, that actually lets us do more.

Page 38: Rethinking integration: Emerging patterns from cloud computing

33

By Alan Morrison and Pini Reznik

With containers like Docker, developers can deploy the same app on different infrastructure without rework.

Technology Forecast: Rethinking integrationIssue 1, 2014

Containers are redefining application-infrastructure integration

Page 39: Rethinking integration: Emerging patterns from cloud computing

34 PwC Technology Forecast Containersareredefiningapplication-infrastructureintegration

Spotify,theSwedishstreamingmusicservice,grewbyleapsandboundsafteritslaunchin2006. As its popularity soared, the company manageditsscalingchallengesimplybyaddingphysical servers to its infrastructure. Spotify toleratedlowutilizationinexchangeforspeedandconvenience.InNovember2013,Spotifywasoffering20millionsongsto24millionusers in 28 countries. By that point, with a computinginfrastructureof5,000serversin33Cassandraclustersatfourlocationsprocessingmorethan50terabytes,thescalingchallengedemanded a new solution.

Spotify chose Docker, an open source application deployment container that evolved from the LinuX Containers (LXCs) used for the past decade. LXCs allow different applications toshareoperatingsystem(OS)kernel,CPU,andRAM.Dockercontainersgofurther,addinglayersofabstractionanddeploymentmanagementfeatures.Amongthebenefitsofthisnewinfrastructuretechnology,containersthathavethesecapabilitiesreducecoding,deploymenttime,andOSlicensingcosts.

Noteverycompanyisaweb-scaleenterpriselikeSpotify,butincreasinglymanycompaniesneedscalableinfrastructurewithmaximumflexibilitytosupporttherapidchangesinservicesandapplicationsthattoday’sbusinessenvironment demands. Early evaluations of Dockersuggestitisaflexible,cost-effective,andmorenimblewaytodeployrapidlychangingapplicationsoninfrastructurethatalso must evolve quickly.

PwCexpectscontainerswillbecomeastandardfixture of the infrastructure layer in the evolvingcloud-inspiredintegrationfabric.Thisintegrationfabricincludesmicroservicesat the services layer and data lakes at the data layer, which other articles explore in this “Rethinkingintegration”issueofthePwCTechnology Forecast.1 This article examines Docker containers and their implications for infrastructureintegration.

A stretch goal solves a problemSpotify’s infrastructure scale dwarfs those of many enterprises. But its size and complexity

make Spotify an early proof case for the value andviabilityofDockercontainersintheagilebusinessenvironmentthatcompaniesrequire.

Bylate2013,Spotifycouldnolongercontinuetoscaleormanageitsinfrastructureoneserveratatime.Thecompanyusedstate-of-the-artconfigurationmanagementtoolssuchasPuppet,butkeepingthose5,000serversconsistentlyconfiguredwasstilldifficultandtime-consuming.

Spotify had avoided conventional virtualization technologies.“Wedidn’twanttodealwiththeoverheadofvirtualmachines(VMs),”saysRohanSingh,aSpotifyinfrastructureengineer.ThecompanyrequiredsomekindoflightweightalternativetoVMs,becauseitneededtodeploychangesto60servicesandadd new services across the infrastructure in amoremanageableway.“Wewantedtomakeourservicedeploymentsmorerepeatableandlesspainfulfordevelopers,”Singhsays.

Singhwasamemberofateamthatfirstlooked at LXCs, which—unlike VMs—allow applicationstoshareanOSkernel,CPU,and RAM. With containers, developers can isolate applications and their dependencies. Advocates of containers tout the efficiencies and deployment speed compared with VMs. Spotify wrote some deployment service scriptsforLXCs,butdecideditwasneedlesslyduplicatingwhatexistedinDocker,whichincludesadditionallayersofabstractionanddeploymentmanagementfeatures.

Singh’sgrouptestedDockeronafewinternalservicestogoodeffect.Althoughthevendorhad not yet released a production version of Dockerandadvisedagainstproductionuse,Spotify took a chance and did just that. “As a stretchgoal,weignoredthewarninglabelsand went ahead and deployed a container into productionandstartedthrowingproductiontrafficatit,”Singhsays,referringtoaservicethatprovidedalbummetadatasuchasthealbumortracktitles.2

Thanks to Spotify and others, adoption had risensteadilyevenbeforeDocker1.0wasavailableinJune2014.TheDockerapplication

Early evaluations of Docker suggest it is a e ible, cost effecti e, a d more imble

ay to deploy rapidly c a gi g applicatio s o infrastructure that also must evolve quickly.

1 For more information, see “Rethinking integration: Emerging patterns from cloud computing leaders,” PwC Technology Forecast 2014, Issue 1, http://www.pwc.com/us/en/technology-forecast/2014/issue1/index.jhtml.

2 Rohan Singh, “Docker at Spotify,” Twitter University YouTube channel, December 11, 2013, https://www.youtube.com/watch?v=pts6F00GFuU, accessed May 13, 2014, and Jack Clark, “Docker blasts into 1.0, throwing dust onto traditional hypervisors,” The Register, June 9, 2014, http://www.theregister.co.uk/2014/06/09/docker_milestone_release/, accessed June 11, 2014.

Issue overview: Rethinking integrationThis article focuses on one of three topics covered in the Rethinking Integration issue of the PwCTechnology Forecast (http://www.pwc.com/us/en/technology-forecast/2014/issue1/index.jhtml). The integrationfabricisacentral component for PwC’sNewITPlatform.(See http://www.pwc.com/us/en/increasing-it-effectiveness/new-it-platform.jhtml for more information.

Page 40: Rethinking integration: Emerging patterns from cloud computing

35 PwC Technology Forecast Containersareredefiningapplication-infrastructureintegration

containerenginepostedonGitHub,thecode-sharingnetwork,hadreceivedmorethan14,800stars(up-votesbyusers)byAugust14,2014.3Container-orientedmanagementtoolsandorchestrationcapabilitiesarejustnowemerging.WhenDocker,Inc.,(formerlydotCloud,Inc.)releasedDocker1.0,thevendoralsoreleasedDockerHub,aproprietaryorchestrationtoolavailableforlicensing.PwCanticipates that orchestration tools will also becomeavailablefromothervendors.Spotifyiscurrentlyusingtoolsitdeveloped.

Why containers?LXCs have existed for many years, and some companieshaveusedthemextensively.Google,forexample,nowstartsasmanyas2billioncontainersaweek,accordingtoJoeBeda,aseniorstaffsoftwareengineeratGoogle.4 LXCs abstracttheOSmoreefficientlythanVMs.TheVMmodelblendsanapplication,afullguestOS,anddiskemulation.Incontrast,thecontainer model uses just the application’s dependencies and runs them directly on a host OS.ContainersdonotlaunchaseparateOSforeachapplication,butsharethehostkernelwhilemaintainingtheisolationofresourcesand processes where required.

The fact that a container does not run its ownOSinstancereducesdramaticallytheoverheadassociatedwithstartingandrunninginstances.Startuptimecantypicallybereducedfrom30seconds(ormore)toone-tenthofasecond.Thenumberofcontainersrunningonatypicalservercanreachdozensor even hundreds. The same server, in contrast,mightsupport10to15VMs.

Developer teams such as those at Spotify, which writeanddeployservicesforlargesoftware-as-a-service(SaaS)environments,needtodeploynew functionality quickly, at scale, and to test andseetheresultsimmediately.Increasingly,they say containerization delivers those benefits.SaaSenvironmentsbytheirverytest-driven nature require frequent infusions of new codetorespondtoshiftingcustomerdemands.Without containers, developers who write more andmoredistributedapplicationswouldspendmuchtimeonrepetitivedrudgery.

Docker: LXC simplification and an emerging multicloud abstractionA Docker application container takes the basicnotionofLXCs,addssimplifiedwaysofinteractingwiththeunderlyingkernel,andmakesthewholeportable(orinteroperable)

The fact that a co tai er does not run its own OS i sta ce red ces dramatically t e o er ead associated it starti g a d running instances.

tart p time ca typically be red ced from seco ds or more to one-tenthof a seco d.

Virtual machines on a Type 2 hypervisor versus application containerization with a shared OS

Source: Docker, Inc., 2014

3 See “dotcloud/docker,” GitHub, https://github.com/dotcloud/docker, accessed August 14, 2014.

4 Joe Beda, “Containers At Scale,” Gluecon 2014 conference presentation slides, May 22, 2014, https://speakerdeck.com/jbeda/containers-at-scale, accessed June 11, 2014.

Figure 1: Virtual machines on a Type 2 hypervisor versus application containerization with a shared OS

VM

Container

Containers are isolated, but share OS and, where appropriate, bins/libraries

App A

App A

App B

AppA

Bins/Libs

GuestOS

GuestOS

GuestOS

Bins/Libs

Bins/Libs

Bins/Libs Bins/Libs

AppB

Source: Docker, Inc., 2014

Hypervisor (Type 2)

Host OS

Server

Container engine

Host OS

Server

AppA

App B

App B

App B

Page 41: Rethinking integration: Emerging patterns from cloud computing

36 PwC Technology Forecast Containersareredefiningapplication-infrastructureintegration

applicationhasitsownguestOS,whichisaslightlydifferentversion.Thesedifferentversionsaredifficulttopatch.Inacontainer-basedworld,it’seasiertostandardizetheOSanddeployjustonepatchacrossallhosts,” he adds.

Containerized applications also present opportunities for more comprehensive governance.Dockertrackstheprovenanceofeachcontainerbyusingamethodthatdigitallysignseachone.Golubseesthepotential,overtime,foracompletelyprovenancedlibraryof components, each with its own automated documentationandaccesscontrolcapability.7

When VMs were introduced, they formed anewabstractionlayer,awaytodecouplesoftware from a hardware dependency. VMs led to the creation of clouds, which allowed theloadtobedistributedamongmultiplehardwareclusters.Containerizationusingthe open Docker standard extends this notionofabstractioninnewways,acrosshomogeneousorheterogeneousclouds.Evenmore importantly, it lowers the time and cost associatedwithcreating,maintaining,andusingtheabstraction.DockermanagementtoolssuchasDockerHub,CenturyLinkPanamax,ApacheMesos,andGoogleKubernetesareemergingtoaddresscontainerorchestrationandrelatedchallenges.

Outlook: Containers, continuous deployment, and the rethinking of integrationSoftwareengineeringhasgenerallytrendedaway from monolithic applications and toward the division of software into orchestrated groupsofsmaller,semi-autonomouspiecesthat have a smaller footprint and shorter deployment cycle.8 Microservices principles are leadingthischangeinapplicationarchitecture,and containers will do the same when it comes todeployingthosemicroservicesonanycloudinfrastructure. The smaller size, the faster creation,andthesubseconddeploymentofcontainersallowenterprisestoreduceboththe infrastructure and application deployment

across environments that have different operatingsystems.PortabilityiscurrentlylimitedtoLinuxenvironments—Ubuntu,SUSE,orRedHatEnterpriseLinux,forexample.

ButBenGolub,CEOofDocker,Inc.,seesnoreason why a Dockerized container created on a laptop for Linux couldn’t eventually run on aWindowsserverunchanged.“WithDocker,younolongerneedtoworryinadvanceaboutwheretheappswillrun,becausethesamecontainerized application will run without beingmodifiedonanyLinuxservertoday.GoingtoWindowsisalittletrickierbecausetheprimitivesaren’taswelldefined,butthere’snorocket science involved.5 It’s just hard work that wewon’tgettountilthesecondhalfof2015.”

Thatlevelofportabilitycanthereforeextendacrosscloudsandoperatingenvironments,becausecontainerizedapplicationscanrunonaVMorabare-metalserver,orincloudsfromdifferent service providers.

The amount of application isolation that Docker containers provide—a primary reason fortheirportability—distinguishesthemfrombasicLXCs.InDocker,applicationsandtheirdependencies,suchasbinariesandlibraries,allbecomepartofabaseworkingimage.Thatcontainerizedimagecanrunondifferentmachines.“Dockerdefinesanabstractionforthesemachine-specificsettings,sotheexactsameDockercontainercanrun—unchanged—on many different machines, with many differentconfigurations,”saysSolomonHykes,CTOofDocker,Inc.6

AnotheradvantageofDockercontainerizationisthatupdates,suchasvulnerabilitypatches,canbepushedouttothecontainersthatneed them without disruption. “You can push changesto1,000runningcontainerswithouttakinganyofthemdown,withoutrestartinganOS,withoutrebuildingaVM,”Golubsays.Docker’sabilitytoextendthereachofsecuritypolicyandapplyituniformlyissubstantial.“Thesecuritymodelbecomesmuchbetterwithcontainers.IntheVM-basedworld,every

5 A primitive is a low-level object, components of which can be used to compose functions. See http://www.webopedia.com/TERM/P/primitive.html, accessed July 28, 2014, for more information.

6 Solomon Hykes, “What does Docker add to just plain LXC?” answer to Stack Overflow Q&A site, August 13, 2013, http://stackover ow.com/ uestions/1 0 /what-does-docker-add-to-just-plain-l c, accessed une 11, 2014.See the PwC interview with en olub, “ ocker s role in simplifying and securing multicloud development,” http://www.pwc.com/en_US/us/technology-forecast/2014/issue1/interviews/interview-ben-golub-docker.jhtml for more information.

8 For more detail and a services perspective on this evolution, see “Microservices: The resurgence of SOA principles and an alternative to the monolith,” PwC Technology Forecast 2014, Issue 1, http://www.pwc.com/en_US/us/technology-forecast/2014/issue1/features/microservices.jhtml.

Page 42: Rethinking integration: Emerging patterns from cloud computing

37 PwC Technology Forecast Containersareredefiningapplication-infrastructureintegration

Thisproblemledtothecreationofconfigurationmanagementtoolsthathelpmaintain the desired state of the system. CFEnginepioneeredthesetools,whichPuppet,Chef,andAnsiblelaterpopularized.Anothercycleofgrowthledtoanewsetoforchestrationtools. These tools—such as MCollective, Capistrano,andFabric—managethecomplexsystem deployment on multihost environments in the correct order.

Containersmightallowthedeploymentofasingleapplicationinlessthanasecond,butnow different parts of the application must run ondifferentclouds.Thenetworkwillbecomethenextbottleneck.Thenetworkissueswillrequiresystemstohaveacombinationofstatelessnessandsegmentation.Organizationswillneedtodeployandrunsubsystemsseparatelywithonlyloose,software-definednetwork connections. That’s a difficult path. Somecentralizationmaystillbenecessary.

cycles from hours to minutes. When enterprises canreducedeploymenttimesoit’scomparableto the execution time of the application itself, infrastructuredevelopmentcanbecomeanintegralpartofthemaindevelopmentprocess.Thesechangesshouldbeaccompaniedbychangesinorganizationalstructures,suchastransitioningfromwaterfalltoagileandDevOpsteams.9

WhenVMsbecamepopular,theywereinitiallyused to speed up and simplify the deployment ofasingleserver.Oncetheapplicationarchitectureinternalizedthechangeandmonolithicappsstartedtobedividedintosmaller pieces, the widely accepted approach of thattime—thegoldenimage—couldnotkeepup.VMproliferationandmanagementbecamethenewheadachesinthetypicalorganization.

e smaller si e, t e faster creatio , a d t e s bseco d deployme t of containers allow e terprises to red ce bot t e infrastructure a d applicatio deployme t cycles from hours to minutes.

9 For a complete analysis of the DevOps movement and its implications, see “DevOps: Solving the engineering productivity challenge,” PwC Technology Forecast 2013, Issue 2, http://www.pwc.com/us/en/technology-forecast/2013/issue2/index.jhtml.

From monolithic to multicloud architecturesFigure 2: From monolithic to multi-cloud architectures

1995 2014 2017?

Physical

Thick client-server client

Middleware/OS stack

Monolithic physical infrastructure

Thin mobile client

Assembled from available services

VMs on cloud

Thin mobile/web UI

Microservices

Multicloudcontainer-based infrastructure

VMs

CloudContainers Containers

Multipleclouds

Source: Docker, Inc. and PwC, 2014Source: Docker, Inc. and PwC, 2014

Page 43: Rethinking integration: Emerging patterns from cloud computing

38 PwC Technology Forecast Containersareredefiningapplication-infrastructureintegration

Conclusion: Beyond application and infrastructure integrationMicroservicesandcontainersaresymbiotic.Togethertheirgrowthhasproducedanalternativetointegrationentirelydifferentfromtraditionalenterpriseapplicationintegration(EAI). Some of the differences include:

Traditional EAIMicroservices + Containers

Translation (via an enterprise service bus [ESB], for example)

Encapsulation

Articulation Abstraction

Bridging between systems

Portability across systems

Monolithic, virtualized OS

Fit for purpose, distributed OS

Wired Loosely coupled

Theblendofcontainers,microservices,andassociatedmanagementtoolswillredefinethe nature of the components of a system. As aresult,organizationsthatusetheblendcanavoidthesoftwareequivalentofwired,hard-to-create,andhard-to-maintainconnections.Insteadofconstantlytinkeringwithapolyglotconnectionbus,systemarchitectscan encapsulate the application and its dependenciesinalinguafrancacontainer.InsteadofvirtualizingtheoldOSintothenewcontext,developerscancreatedistributed,slimmed-downoperatingsystems.Insteadofbuildingbridgesbetweensystems,architectscan use containers that allow applications torunanywhere.Bychangingthenatureofintegration,containersandmicroservicesenableenterprisestomovebeyondit.

Page 44: Rethinking integration: Emerging patterns from cloud computing

39 PwC Technology Forecast Docker’s role in simplifying and securing multicloud development

PwC: You mentioned that one of the reasons you decided to join Docker, Inc., as CEO was because the capabilities of the tool itself intrigued you. What intrigued you most?1

BG: The VM was created when applications were long-lived, monolithic, built on a well-defined stack, and deployed to a single server. More and more, applications today are built dynamically through rapid modification. They’re built from loosely coupled components in a variety of different

stacks, and they’re not deployed to a single server. They’re deployed to a multitude of servers, and the application that’s working on a developer’s laptop also must work in the test stage, in production, when scaling, across clouds, in a customer environment on a VM, on an OpenStack cluster, and so forth.

The model for how you would do that is really very different from how you would deal with a VM, which is in essence trying to treat an application as if it were an application server.

1 Docker is an open source application deployment container tool released by Docker, Inc., that allows developers to package applications and their dependencies in a virtual container that can run on any Linux server. Docker Hub is Docker, Inc.’s related, proprietary set of image distribution, change management, collaboration, work ow, and integration tools. or more information on ocker ub, see en olub,

“ nnouncing ocker ub and cial epositories,” ocker, Inc. blog , une , 2014, http://blog.docker.com/2014/0 /announcing-docker-hub-and-o cial-repositories/, accessed uly 1 , 2014.

Ben Golub

en olub is C of ocker, Inc.

Technology Forecast: Rethinking integration

Issue 1, 2014

Docker’s role in simplifying and securing multicloud development Ben Golub of Docker outlines the company’s application container road map.

Interview conducted by Alan Morrison, Bo Parker, and Pini Reznik

Page 45: Rethinking integration: Emerging patterns from cloud computing

40 PwC Technology Forecast Docker’s role in simplifying and securing multicloud development

What containers do is pretty radical if you consider their impact on how applications are built, deployed, and managed.2

PwC: What predisposed the market to say that now is the time to start looking at tools like Docker?

BG: When creating an application as if it were an application server, the VM model blends an application, a full guest operating system, and disk emulation. By contrast, the container model uses just the application’s dependencies and runs them directly on a host OS.

In the server world, the use of containers was limited to companies, such as Google, that had lots of specialized tools and training. Those tools weren’t transferable between environments; they didn’t make it possible for containers to interact with each other.

We often use the shipping container as an analogy. The analogous situation before Docker was one in which steel boxes had been invented but nobody had made them a standard size, put holes in all the same places, and figured out how to build cranes and ships and trains that could use them.

We aim to add to the core container technology, so containers are easy to use and interoperable between environments. We want to make them portable between clouds and different operating systems, between physical and virtual. Most importantly, we’re

“You can usually gain 10 times greater density when you get rid of that guest operating system.”

working to build an ecosystem around it, so there will be people, tools, and standard libraries that will all work with Docker.3

PwC: What impact is Docker having on the evolution of PaaS?4

BG: The traditional VM links together the application management and the infrastructure management. We provide a very clean separation, so people can use Docker without deciding in advance whether the ideal infrastructure is a public or private cloud, an OpenStack cluster, or a set of servers all running RHEL or Ubuntu. The same container will run in all of those places without modification or delay.

Because containers are so much more efficient and lightweight, you can usually gain 10 times greater density when you get rid of that guest operating system. That density really changes the economics of providing XaaS as well as the economics and the ease of moving between different infrastructures.

In a matter of milliseconds, a container can be moved between provider A and provider B or between provider A and something private that you’re running. That speed really changes how people think about containers. Docker has become a standard container format for a lot of different platforms as a service, both private and public PaaS. At this point, a lot of people are questioning whether they really need a full PaaS to build a flexible app environment.5

2 bbreviations are as follows: : virtual machine

bbreviations are as follows: S: operating system

4 bbreviations are as follows: PaaS: platform as a service

bbreviations are as follows: : ed at nterprise inu

Page 46: Rethinking integration: Emerging patterns from cloud computing

41 PwC Technology Forecast Docker’s role in simplifying and securing multicloud development

PwC: Why are so many questioning whether or not they need a full PaaS?

BG: A PaaS is a set of preselected stacks intended to run the infrastructure for you. What people increasingly want is the ability to choose any stack and run it on any platform. That’s beyond the capability of any one organization to provide.

With Docker, you no longer need to worry in advance about where the apps will run, because the same containerized application will run without being modified on any Linux server today. You might build it in an environment that has a lot of VMs and decide you want to push it to a bare-metal cluster for greater performance. All of those options are possible, and you don’t really need to know or think about them in advance.

PwC: When you can move Docker containers so easily, are you shifting the challenge to orchestration?

BG: Certainly. Rather than having components tightly bound together and stitched up in advance, they’re orchestrated and moved around as needs dictate.

Docker provides the primitives that let you orchestrate between containers using a bridge. Ultimately, we’ll introduce more full-fledged orchestration that lets you orchestrate across different data centers.

Docker Hub—our commercial services announced in June 2014—is a set of services you can use to orchestrate containers both within a data center and between data centers.

or more detail, see ed iuba, “ ocker at e ay,” presentation slides, uly 1 , 201 , https://speakerdeck.com/tedd iuba/docker-at-ebay, accessed uly 1 , 2014.

PwC: What should an enterprise that’s starting to look at Docker think about before really committing?

BG: We’re encouraging people to start introducing Docker as part of the overall workflow, from development to test and then to production. For example, eBay has been using Docker for quite some time. The company previously took weeks to go from development to production. A team would start work on the developer’s laptop, move it to staging or test, and it would break and they weren’t sure why. And as they moved it from test or staging, it would break again and they wouldn’t know why.6

Then you get to production with Docker. The entire runtime environment is defined in the container. The developer pushes a button and commits code to the source repository, the container gets built and goes through test automatically, and 90 percent of the time the app goes into production. That whole process takes minutes rather than weeks.

During the 10 percent of the time when this approach doesn’t work, it’s really clear what went wrong, whether the problem was inside the container and the developer did something wrong or the problem was outside the container and ops did something wrong.

Some people want to really crank up efficiency and performance and use Docker on bare metal. Others use Docker inside of a VM, which works perfectly well.

“What people want is the ability to choose any stack and run it on any platform.”

Page 47: Rethinking integration: Emerging patterns from cloud computing

42 PwC Technology Forecast Docker’s role in simplifying and securing multicloud development

For those just starting out, I’d recommend they do a proof of concept this year and move to production early next year. According to our road map, we’re estimating that by the second half of 2015, they can begin to use the control tools to really understand what’s running where, set policies about deployment, and set rules about who has the right to deploy.

PwC: The standard operating model with VMs today commingles app and infrastructure management. How do you continue to take advantage of those management tools, at least for now?

BG: If you use Docker with a VM for the host rather than bare metal, you can continue to use those tools. And in the modified VM scenario, rather than having 1,000 applications equal 1,000 VMs, you have 10 VMs, each of which would be running 100 containers.

PwC: What about management tools for Docker that could supplant the VM-based management tools?

BG: We have good tools now, but they’re certainly nowhere as mature as the VM toolset.

PwC: What plans are there to move Docker beyond Linux to Windows, Solaris, or other operating systems?

BG: This year we’re focused on Linux, but we’ve already given ourselves the ability to use different container formats within Linux, including LXC, libvirt, and libcontainer. People who are already in the community are working on having Docker manage Solaris zones and jails. We don’t see any huge technical reasons why Docker for Solaris can’t happen.

Going to Windows is a little bit trickier because the primitives aren’t as well defined, but there’s no rocket science involved. It’s just hard work that we likely won’t get to until the second half of 2015.7

PwC: Docker gets an ecstatic response from developers, but the response from operations people is more lukewarm. A number of those we’ve spoken with say it’s a very interesting technology, but they already have Puppet running in the VMs. Some don’t really see the benefit. What would you say to these folks?

BG: I would say there are lots of folks who disagree with them who actually use it in production.

Folks in ops are more conservative than developers for good reason. But people will get much greater density and a significant reduction in the amount they’re spending on server virtualization licenses and hardware.

One other factor even more compelling is that Docker enables developers to deliver what they create in a standardized form. While admin types might hope that developers embrace Chef and Puppet, developers rarely do. You can combine Docker with tools such as Chef and Puppet that the ops folks like and often get the best of both worlds.

PwC: What about security?

BG: People voice concerns about security just because they think containers are new. They’re actually not new. The base container technology has been used at massive scale by companies such as Google for several years.

bbreviations are as follows: C: inu Container

“Going to Windows is a little bit trickier because t e primiti es are t as ell defi ed, b t t ere s no rocket science involved.”

Page 48: Rethinking integration: Emerging patterns from cloud computing

43 PwC Technology Forecast Docker’s role in simplifying and securing multicloud development

The security model becomes much better with containers. Most organizations face hundreds of thousands of vulnerabilities that they know about but have very little ability to address. In the VM-based world where every application has its own VM, every application has its own guest OS, which is a slightly different version. These different versions are difficult to patch.

In a container-based world, it’s easier to standardize the OS across all hosts. If there’s an OS-level vulnerability, there’s one patch that just needs to be redeployed across all hosts.

Containerized apps are also much easier to update. If there’s an application vulnerability, you can push changes to 1,000 running containers without taking any of them down, without restarting an OS, without rebuilding a VM.

Once the ops folks begin to understand better what Docker really does, they can get a lot more excited.

PwC: Could you extend a governance model along these same lines?

BG: Absolutely. Generally, when developers build with containers, they start with base images. Having a trusted library to start with is a really good approach. Creating containers

these days is directly from source, which in essence means you can put a set of instructions in a source code repository. As this very mature source code is used, the changes to source code that get committed essentially translate automatically into an updated container.

What we’re adding to that is what we call provenance. That’s the ability to digitally sign every container so you know where it came from, all the way back to the source. That’s a much more comprehensive security and governance model than trying to control what different black boxes are doing.

PwC: What’s the outlook for the distributed services model generally?

BG: I won’t claim that we can change the laws of physics. A terabyte of data doesn’t move easily across narrow pipes. But if applications and databases can be moved rapidly, and if they consistently define where they look for data, then the things that should be flexible can be. For example, if you want the data resident in two different data centers, that could be a lot of data. Either you could arrange it so the data eventually become consistent or you could set up continuous replication of data from one location to the other using something like CDP.8

I think either of those models work.

bbreviations are as follows: C P: continuous data protection

Page 49: Rethinking integration: Emerging patterns from cloud computing
Page 50: Rethinking integration: Emerging patterns from cloud computing

44 PwC Technology Forecast What do businesses need to know about emerging integration approaches?

PwC: We’ve been looking at three emerging technologies: data lakes, microservices, and Docker containers. Each has a different impact at a different layer of the integration fabric. What do you think they have in common?1

SR: What has happened here has been the rightsizing of all the components. IT providers previously built things assuming that compute, storage, and networking capacity were scarce. Now they’re abundant. But even when they became abundant, end users didn’t have tools that were the right size.

Containers have rightsized computing, and Hadoop has rightsized storage. With HDFS or Cassandra or a NoSQL database, companies can process enormous amounts of data very easily. And HTTP-based, bindable endpoints that can talk to any compute source have rightsized the network. So between Docker containers for compute, data lakes for storage, and APIs for networking, these pieces are finally small enough that they all fit very nicely, cleanly, and perfectly together at the same time.2

1 For more background on data lakes, microservices, and Docker containers, see “Rethinking integration: Emerging patterns from cloud computing leaders,” PwC Technology Forecast 2014, Issue 1, http://www.pwc.com/us/en/technology-forecast/2014/issue1/index.jhtml.

2 Abbreviations are as follows: S: adoop istributed ile System oS : ot only S PI: application programming interface

Sam Ramji

Sam amji is vice president of strategy at Apigee.

Technology Forecast: Rethinking integration

Issue 1, 2014

What do businesses need to know about emerging integration approaches?Sam Ramji of Apigee views the technologies of the integration fabric through a strategy lens.

Interview conducted by the Technology Forecast team

Page 51: Rethinking integration: Emerging patterns from cloud computing

45 PwC Technology Forecast What do businesses need to know about emerging integration approaches?

PwC: To take advantage of the three integration-related trends that are emerging at the same time, what do executives generally need to be cautious about?

SR: One risk is how you pitch this new integration approach politically. It may not be possible to convert the people who are currently working on integration. For those who are a level or two above the fray, one of the top messages is to make the hard decisions and say, “Yes, we’re moving our capital investments from solving all problems the old ways to getting ready for new problems.” It’s difficult.

PwC: How do you justify such a major change when presenting the pitch?

SR: The incremental cost in time and money for new apps is too high—companies must make a major shift. Building a new app can take nine months. Meanwhile, marketing departments want their companies to build three new apps per quarter. That will consume a particular amount of money and will require a certain amount of planning time. And then by the time the app ships nine months later, it needs a new feature because there’s a new service like Pinterest that didn’t exist earlier and now must be tied in.

“This new approach to integration could enable companies to develop and deliver apps in three months.”

PwC: Five years ago, nine months would have been impossibly fast. Now it’s impossibly slow.

SR: And that means the related business processes are too slow and complicated, and costs are way too high. Companies spend between $300,000 and $700,000 in roughly six to seven months to implement a new partner integration. That high cost is becoming prohibitive in today’s value-network-based world, where companies constantly try to add new nodes to their value network and maybe prune others.

Let’s say there’s a new digital pure play, and your company absolutely must be integrated with it. Or perhaps you must come up with some new joint offer to a customer segment you’re trying to target. You can’t possibly afford to be in business if you rely on those old approaches because now you need to get 10 new partnerships per order. And you certainly can’t do that at a cost of $500,000 per partner.

This new approach to integration could enable companies to develop and deliver apps in three months. Partner integration would be complete in two months for $50,000, not $500,000.

Page 52: Rethinking integration: Emerging patterns from cloud computing

46 PwC Technology Forecast What do businesses need to know about emerging integration approaches?

PwC: How about CIOs in particular? How should they pitch the need for change to their departments?

SR: Web scale is essential now. ESBs absolutely do not scale to meet mobile demands. They cannot support three fundamental components of mobile and Internet access.

In general, ESBs were built for different purposes. A fully loaded ESB that’s performing really well will typically cost an organization millions of dollars to run about 50 TPS. The average back end that’s processing mobile transactions must run closer to 1,000 TPS. Today’s transaction volumes require systems to run at web scale. They will crush an ESB.

The second issue is that ESBs are not built for identity. ESBs generally perform system-to-system identity. They’re handling a maximum of 10,000 different identities in the enterprise, and those identities are organizations or systems—not individual end users, which is crucial for the new IT. If companies don’t have a user’s identity, they’ll have a lot of other issues around user profiling or behavior. They’ll have user amnesia and problems with audits or analytics.

The third issue is the ability to handle the security handoff between external devices that are built in tools and languages such as JavaScript and to bridge those devices into ESB native security.

ESB is just not a good fit when organizations need scale, identity, and security.3

PwC: IT may still be focused on core systems where things haven’t really changed a lot.

SR: Yes, but that’s not where the growth is. We’re seeing a ton of new growth in these edge systems, specifically from mobile. There are app-centric uses that require new infrastructure, and they’re distinct from what I call plain old integration.

PwC: How about business unit managers and their pitches to the workforce? People in the business units may wonder why there’s such a preoccupation with going digital.

SR: When users say digital, what we really mean is digital data that’s ubiquitous and consumable and computable by pretty much any device now.

The industry previously built everything for a billion PCs. These PCs were available only when people chose to walk to their desks. Now people typically have three or more devices, many of which are mobile. They spend more time computing. It’s not situated computing where they get stuck at a desk. It’s wherever they happen to be. So the volume of interactions has gone up, and the number of participants has gone up.

About 3 billion people work at computing devices in some way, and the volume of interactions has gone up many times. The shift to digital interactions has been basically an order of magnitude greater than what we supported earlier, even for web-based computing through desktop computers.

“We’re seeing a ton of new growth in these edge systems, specifically from mobile. There are app-centric uses that require new infrastructure, and they’re distinct from what I call plain old integration.”

3 Abbreviations are as follows: S : enterprise service bus PS: transactions per second

Page 53: Rethinking integration: Emerging patterns from cloud computing

47 PwC Technology Forecast What do businesses need to know about emerging integration approaches?

PwC: The continuous delivery mentality of DevOps has had an impact, too. If the process is in software, the expectations are that you should be able to turn on a dime.4

SR: Consumer expectations about services are based on what they’ve seen from large-scale services such as Facebook and Google that operate in continuous delivery mode. They can scale up whenever they need to. Availability is as important as variability.

Catastrophic successes consistently occur in global corporations. The service gets launched, and all of a sudden they have 100,000 users. That’s fantastic. Then they have 200,000 users, which is still fantastic. Then they reach 300,000. Crunch. That’s when companies realize that moving around boxes to try to scale up doesn’t work anymore. They start learning from web companies how to scale.

PwC: The demand is for fluidity and availability, but also variability. The load is highly variable.

SR: Yes. In highly mobile computing, the demand patterns for digital interactions are extremely spiky and unpredictable.

None of these ideas is new. Eleven years ago when I was working for Adam Bosworth at BEA Systems, he wrote a paper about the autonomic model of computing in which he anticipated natural connectedness and smaller services. We thought web services would take us there. We were wrong about that as a technology, but we were right about the direction.

We lacked the ability to get people to understand how to do it. People were building services that were too big, and we didn’t realize why the web services stack was still too bulky to be consumed and easily adopted by a lot of people. It wasn’t the right size before, but now it’s shrunk down to the right size. I think that’s the big difference here.

4 ev ps refers to a closer collaboration between developers and operations people that becomes necessary for a more continuous ow of changes to an operational code base, also known as continuous delivery. hus, ev ps ev ps. or more on continuous delivery and

ev ps, see “ ev ps: Solving the engineering productivity challenge,” PwC Technology Forecast 2013, Issue 2, http://www.pwc.com/us/en/technology-forecast/2013/issue2/index.jhtml.

Page 54: Rethinking integration: Emerging patterns from cloud computing

48

By Bo Parker

The key to integration success is reducing the need for integration in the first place.

Technology Forecast: Rethinking integrationIssue 1, 2014

Zero-integration technologies and their role in transformation

Page 55: Rethinking integration: Emerging patterns from cloud computing

49 PwC Technology Forecast Zero-integration technologies

Social, mobile, analytics, cloud—SMAC for short—have set new expectations for what a high-performing IT organization delivers to the enterprise. Yet they can be saviors if IT figures out how to embrace them. As PwC states in “Reinventing Information Technology in the Digital Enterprise”:

Business volatility, innovation, globalization and fierce competition are forcing business leaders to review all aspects of their businesses. High on the agenda: Transforming the IT organization to meet the needs of businesses today. Successful IT organizations of the future will be those that evaluate new technologies with a discerning eye and cherry pick those that will help solve the organization’s most important business problems. This shift requires change far greater than technology alone. It requires a new mindset and a strong focus on collaboration, innovation and “outside-in” thinking with a customer-centric point of view.1

The shift starts with rethinking the purpose and function of IT while building on its core historical role of delivering and maintaining stable, rock-solid transaction engines. Rapidly changing business needs are pushing enterprises to adopt a digital operating model. This move reaches beyond the back-office and front-office technology. Every customer, distributor, supplier, investor,

partner, employee, contractor, and especially any software agents substituting for those conventional roles now expects a digital relationship. Such a relationship entails more than converting paper to web screens. Digital relationships are highly personalized, analytics-driven interactions that are absolutely reliable, that deliver surprise and delight, and that evolve on the basis of previous learnings. Making digital relationships possible is a huge challenge, but falling short will have severe consequences for every enterprise unable to make a transition to a digital operating model.

How to proceed? Successfully adopting a digital operating model requires what PwC calls a New IT Platform. This innovative platform aligns IT’s capabilities to the dynamic needs of the business and empowers the entire organization with technology. Empowerment is an important focus. That’s because a digital operating model won’t be something IT builds from the center out. It won’t be something central IT builds much of at all. Instead, building out the digital operating model—whether that involves mobile apps, software as a service, or business units developing digital value propositions on third-party infrastructure as a service or on internal private clouds—will happen closest to the relevant part of the ecosystem.

What defines a New IT Platform? The illustration highlights the key ingredients.

1 “Reinventing Information Technology in the Digital Enterprise,” PwC, December 2013, http://www.pwc.com/us/en/increasing-it-effectiveness/publications/new-it-platform.jhtml.

PwC’s New IT Platform

The New IT Platform encompasses transformation across the organization.

= New IT Platform

+ + + + Broker of Services

Assemble- to-Order

Integration Fabric

Professional Services Structure

Empowering Governance

The Mandate The Process The Architecture The Organization The Governance

Issue overview: Rethinking integrationThis article summarizes three topics also covered individually in the Rethinking Integration issue of the PwC Technology Forecast (http://www.pwc.com/us/en/technology-forecast/2014/issue1/index.jhtml). The integration fabric is a central component for PwC’s New IT Platform. (See http://www.pwc.com/us/en/increasing-it-effectiveness/new-it-platform.jhtml for more information.)

Page 56: Rethinking integration: Emerging patterns from cloud computing

50 PwC Technology Forecast Zero-integration technologies

The New IT Platform emphasizes consulting, guiding, brokering, and using existing technology to assemble digital assets rather than build from scratch. A major technology challenge that remains—and one that central IT is uniquely suited to address—is to establish an architecture that facilitates the integration of an empowered, decentralized enterprise technology landscape. PwC calls it the new integration fabric. Like the threads that combine to create a multicolored woven blanket, a variety of new integration tools and methods will combine to meet a variety of challenges. And like a fabric, these emerging tools and methods rely on each other to weave in innovations, new business partners, and new operating models.

The common denominator of these new integration tools and methods is time: The time it takes to use new data and discover new insights from old data. The time it takes to modify a business process supported by software. The time it takes to promote new code into production. The time it takes to scale up infrastructure to support the overnight success of a new mobile app.

The bigger the denominator (time), then the bigger the numerator (expected business value) must be before a business will take a chance on a new innovation, a new service, or an improved process. Every new integration approach tries to reduce integration time to as close to zero as possible.

Given the current state of systems integration, getting to zero might seem like a pipe dream. In fact, most of the key ideas behind zero-integration technologies aren’t coming from traditional systems integrators or legacy technologies. They are coming from web-scale companies facing critical problems for which new approaches had to be invented. The great news is that these inventions are often available as open source, and a number of service providers support them.

How to reach zero integration?What has driven web-scale companies to push toward zero-integration technologies? These companies operate in ecosystems that innovate in web time. Every web-scale company is conceivably one startup away from oblivion. As a result, today’s smart engineers provide a new project deliverable in addition to working code. They deliver IT that is change-forward friendly.

Above all, change-forward friendly means that doing something new and different is just as easy four years and 10 million users into a project as it was six months and 1,000 users into the project. It’s all about how doing something new integrates with the old.

More specifically, change-forward-friendly data integration is about data lakes.2 All data is in the lake, schema are created on read, metadata generation is collaborative, and data definitions are flexible rather than singular definitions fit for a business purpose. Change-forward-friendly data integration means no time is wasted getting agreement across the enterprise about what means what. Just do it.

Change-forward-friendly application and services integration is about microservices frameworks and principles.3 It uses small, single-purpose code modules, relaxed approaches to many versions of the same service, and event loop messaging. It relies on organizational designs that acknowledge Conway’s law, which says the code architecture reflects the IT organization architecture. In other words, when staffing large code efforts, IT should organize people into small teams of business-meaningful neighborhoods to minimize the cognitive load associated with working together. Just code it.

2 See the article “The enterprise data lake: Better integration and deeper analytics,” PwC Technology Forecast 2014, Issue 1, http://www.pwc.com/us/en/technology-forecast/2014/issue1/features/data-lakes.jhtml.

3 See the article “Microservices: The resurgence of SOA principles and an alternative to the monolith,” PwC Technology Forecast 2014, Issue 1, http://www.pwc.com/en_US/us/technology-forecast/2014/issue1/features/microservices.jhtml.

Page 57: Rethinking integration: Emerging patterns from cloud computing

51 PwC Technology Forecast Zero-integration technologies

their impacts are isolated, and restarts are instantaneous. Just run it.

This is the story of the new integration fabric. Read more about data lakes, microservices, and containers in the articles in the PwC Technology Forecast 2014, Issue 1. But always recall what Ronald Reagan once said about government, rephrased here in the context of technology: “Integration is not the solution to our problem, integration is the problem.” Change-forward-friendly integration means doing whatever it takes to bring time to integration to zero.

Change-forward-friendly infrastructure integration is about container frameworks, especially Docker.4 The speed required by data science innovators using data lakes and by ecosystem innovators using microservices frameworks will demand infrastructure that is broadly consistent with zero-integration principles. That means rethinking the IT stack and the roles of the operating system, hypervisors, and automation tools such as Chef and Puppet. Such an infrastructure also means rethinking operations and managing by chaos principles, where failures are expected,

4 See the article “Containers are redefining application-infrastructure integration,” PwC Technology Forecast 2014, Issue 1, http://www.pwc.com/en_US/us/technology-forecast/2014/issue1/features/open-source-application-deployment-containers.jhtml.

Page 58: Rethinking integration: Emerging patterns from cloud computing

Acknowledgments

AdvisoryUS Technology Consulting LeaderGerard Verweij

Chief TechnologistChris Curran

New IT Platform LeaderMichael Pearl

Strategic MarketingLock Nelson Bruce Turner

US Thought LeadershipPartnerRob Gittings

Center for Technology and Innovation Managing EditorBo Parker

EditorsVinod Baya Alan Morrison

ContributorsGalen GrumanPini ResnikBill RobertsBrian Stein

Editorial AdvisorLarry Marion

Copy EditorLea Anne Bantsari

US Creative TeamInfographicsTatiana PechenikChris Pak

LayoutJyll Presley

Web DesignJaime DirrGreg Smith

ReviewersRohit Antao Phil BermanJulien FurioliOliver HalterGlen Hobbs Henry HwangboRajesh RajanHemant RamachandraRitesh Ramesh Zach Sachen

Special thanksEleni Manetas and Gabe Taylor Mindshare PR Wunan Li Akshay Rao

Industry perspectivesDuring the preparation of this publication, we benefited greatly from interviews and conversations with the following executives:

Darren CunninghamVice President of Marketing SnapLogic

Michael FacemirePrincipal Analyst Forrester

Ben GolubCEO Docker, Inc.

Mike LangCEO Revelytix

Ross MasonFounder and Vice President of Product Strategy MuleSoft

Sean MartinCTO Cambridge Semantics

John PritchardDirector of Platform Services Adobe Systems

Sam RamjiVice President of Strategy Apigee

Richard RodgerCTO nearForm

Dale SandersSenior Vice President Health Catalyst

Ted SchadlerVice President and Principal Analyst Forrester

Brett ShepherdDirector of Big Data Product Marketing Splunk

Eric SimoneCEO ClearBlade

Sravish SridharFounder and CEO Kinvey

Michael TopalovichCTO Delivered Innovation

Michael VoellingerManaging Director ClearBlade

Page 59: Rethinking integration: Emerging patterns from cloud computing

Glossary

Data lake Asingle,verylargerepositoryforless-structureddatathatdoesn’trequireup-frontmodeling,adata lake can help resolve the nagging problem of accessibility and data integration.

Microservicesarchitecture

Microservices architecture (MSA) breaks an application into very small components that perform discrete functions,andnomore.Thefine-grained,stateless,self-contained nature of microservices creates decoupling between different parts of a code base and is what makes them easy to update, replace, remove, or augment.

Linux containers and Docker

LinuX Containers (LXCs) allow different applications to share operating system (OS) kernel, CPU, and RAM. Docker containers go further, adding layers of abstraction and deployment management features. Among the benefits of this new infrastructure technology, containers that have these capabilities reduce coding, deployment time, and OS licensing costs.

Zero integration

Every new integration approach tries to reduce integration time to as close to zero as possible. Zero integration means no time is wasted getting agreement across the enterprise about what means what.

Page 60: Rethinking integration: Emerging patterns from cloud computing

To have a deeper conversation about this subject, please contact:

About PwC’s Technology Forecast

Published by PwC’s Center for Technology and Innovation (CTI), the Technology Forecast explores emerging technologies and trends to help business and technology executives develop strategies to capitalize on technology opportunities.

Recent issues of the Technology Forecast have explored a number of emerging technologies and topics that have ultimately become many of today’s leading technology and business issues. To learn more about the Technology Forecast, visit www.pwc.com/technologyforecast.

About PwC

PwC US helps organizations and individuals create the value they’re looking for. We’re a member of the PwC network of firms in 157 countries with more than 195,000 people. We’re committed to delivering quality in assurance, tax and advisory services. Find out more and tell us what matters to you by visiting us at www.pwc.com.

Comments or requests? Please visit www.pwc.com/techforecast [email protected].

© 2014 PricewaterhouseCoopers LLP, a Delaware limited liability partnership. All rights reserved. PwC refers to the US member firm, and may sometimes refer to the PwC network. Each member firm is a separate legal entity. Please see www.pwc.com/structure for further details. This content is for general information purposes only, and should not be used as a substitute for consultation with professional advisors.

This content is for general information purposes only, and should not be used as a substitute for consultation with professional advisors. MW-15-0186

Gerard VerweijPrincipal and US Technology Consulting Leader+1 (617) 530 [email protected]

Chris Curran Chief Technologist +1 (214) 754 5055 [email protected]

Michael Pearl Principal New IT Platform Leader +1 (408) 817 3801 [email protected]

Bo Parker Managing Director Center for Technology and Innovation +1 (408) 817 5733 [email protected]

Alan MorrisonTechnology Forecast Issue Editor and ResearcherCenter for Technology and Innovation+1 (408) 817 [email protected]