build a devops culture of collaboration - … · patterns like growth rate or variance. ......

BUILD A DEVOPS CULTURE OF COLLABORATIONFive Key Users of Modern Infrastructure Monitoring

www. signalfx.com@signalfx

Build a DevOps Culture of Collaboration | 01

What Is DevOps?DevOps is a culture, movement, or practice that emphasizes the collaboration and communication of both software developers and other information technology professionals while automating the process of software delivery and infrastructure changes. It aims at establishing a culture and environment where building, testing, and releasing software can happen rapidly, frequently, and more reliably. (Source: Wikipedia.org)

DevOps is a culture.

communication

collaboration

integration

RELEVANCEQUALITY

EFFICIENCYSPEEDAGILITY

DEV OPS


Amazon Web Services (AWS) has forced down the risk to tenants and offered cost benefits that carry over to the adoption of open-source technologies.

What The Cloud Hath WroughtThe maturity of cloud infrastructure has transformed the enterprise landscape. Companies of all sizes are transitioning away from a cost structure largely based on expensive, on-premises hardware and proprietary software. The economics of ephemeral and elastic infrastructure managed by third-parties like Amazon Web Services (AWS) has forced down the risk to tenants and offered cost benefits that carry over to the adoption of open-source technologies and an emphasis on speed to market and fast software iteration. As more applications move to the cloud, monitoring has shifted from the responsibility of a self-contained operations team with a narrow, production-only view to a capability delivered as a service throughout the product lifecycle. Modern teams focus on growth and drive development closer to production with an emphasis on agility and continuity. In turn, everyone will take part in monitoring to assure quality, scale, and resilience aren’t relegated to functional silos.

Modern teams focus on growth and drive development closer to production with an emphasis on agility and continuity.


To deliver against these requirements, developers and operations teams alike now need ongoing visibility into every layer of their architecture. They require the means to not just observe, but also experiment with and control an increasingly complex web of application and infrastructure dependencies. Too little or delayed insight into the component parts and aggregate performance of the systems in production could not only cost massive resources in an operations emergency, but could also lose customers and destroy brand equity.

As a result, monitoring has become an analytics problem and an opportunity for deeper, more relevant systems intelligence well beyond traditional IT’s static thresholds and noisy health checks.

Monitoring DecentralizedMonitoring has become an analytics problem and an opportunity for deeper, more relevant systems intelligence.

• Like an EKG for the heartbeat of the entire system• A dashboard to correlate cloud infrastructure, storage, network,

applications, and user data• A real-time view of the end-user experience• An option on greater granularity, scale, and variety of metrics• A specific call to action in the case of a problem• The basis for collaboration, communication, and integration across

the operations, infrastructure, and development organizations


One Step Ahead Of The Cloud END USER

NETWORK

Few engineering orgs can spare the resources to design, implement, and tune a custom metric tracking system. Even fewer can maintain the pace of change in a dynamic, distributed environment. With no hardware to spin up, monitoring-as-a-service offers flexibility to any size company and no overhead or maintenance costs. Monitoring-as-a-service is:

APPLICATION

INFRASTRUCTURE

STORAGE

Gathering the insights SignalFx’s CTO gained from supporting massive growth at Facebook, there are at least five major users of monitoring-as-a-service. Each has a set of objectives and methods tied to core job responsibilities, but all relate to streamlining the product lifecycle for faster, more reliable delivery and putting in place the safety nets required to develop and push code into production more aggressively. Under a microservices or DevOps regime, monitoring becomes the major point of collaboration as the common language for shared objectives.


Five Key Users Of Modern Infrastructure Monitoring

Site Reliability Engineer (SRE)

Infrastructure Engineer

Software Developer

DevOpsEngineer

ProductManager

Site Reliability Engineer (SRE)

“[We look at aggregations alongside analytics] from a day

ago, a week ago, or any time that makes sense…within seconds of

the new raw data coming in. It lets us catch problems in real time.”

Weston JosseyHead of Operations

To borrow a term from Google, site reliability engineers (SREs) are the experts responsible for ensuring the individual applications and infrastructure that underlie software all work at “planet scale.” They combine software and systems engineering skills to ensure the user’s experience is uninterrupted and at or above an expected level of performance. SREs focus on making everything within the environment fault-tolerant, fast, responsive, elastic, and scalable, a particularly meaningful challenge in a hosted-infrastructure cloud situation. An SRE is the most recognizable user of monitoring. Because they most often hold ultimate responsibility for emergency response and problem eradication, SREs’ ability to see and understand slowdowns when they happen and even prevent outages before they occur could spare colleagues weeks of frustration and save the company massive revenue losses.

For SREs, the most important features of monitoring are dimensionality and just-in-time notifications. Dimensions allow a user to better understand specific scenarios in the data for a more actionable and timely view of the system down to the host, device, or other unit of relevance. Each dimension a team member adds to metadata delivers greater definition and specificity about the architecture,



enabling not just efficient search, aggregation, and comparison across sources, but also more high-value use cases like active pattern recognition, more fine-grained anomaly detection, and troubleshooting. Filtering a query against a set of defined dimensions lets the SRE monitor and respond to a potential problem in real time, rather than having to build a model reliant on query language. To be most effective, SREs need to be able to group and aggregate data by the properties they choose, and always as the data is streaming in real time, not after an issue has emerged and is already causing problems.

Moreover, multidimensionality powers an analytic approach to the metrics flowing into the charts that make up a monitoring dashboard. SREs use dimensions to filter or group the data into the most meaningful view to understand what’s actually happening throughout the environment or in any component of the architecture. With the right level of control, the SRE can define sophisticated, dynamic alert thresholds based on duration or percentile and benchmark against historic patterns like growth rate or variance.

Historically, monitoring systems that set simple, static, arbitrary thresholds against the uptime of individual nodes would generate noisy alerts that were not particularly actionable. For highly distributed, cloud-based environments, SREs need to be able to define alert parameters against a system-specific service level, using streaming analytics to detect severity in real time. With a modern monitoring system, the SRE’s role becomes less focused on finding signal among the noise and more focused on troubleshooting and corrective action that actually affects SLAs.

SRE Priorities For Monitoring

• Highly relevant alerts based on unlimited dimensions

• Just-in-time notifications on trends and rates of change


For Example…It’s not all that useful to get a notification that one of 20 nodes failed a health check in the middle of the night, since most architectures are provisioned to accommodate a spike in demand, even with five percent less resource. On the other hand, it’s much more useful for the SRE to get an alert when latency reaches a 98% threshold on a critical search workload. The notification that comes through PagerDuty lets the SRE know that the threshold was breached for a minimum duration of 90 consecutive seconds and is affecting Platinum SLA customers in the U.S. East availability zone.

Event better yet, a solution that provides conditions to understand and qualify the likelihood that a given change falls outside of normal operating parameters for a specific service or system is the most effective way to isolate the signal that a resource is on its way to being overtaxed. Unlike an alert on a node dropping, which doesn’t have a clear call to action in a cloud environment, or the latency alert, which is often a lagging indicator of an issue, the alert on an ongoing system rate-of-change gives the SRE an opportunity to address a trend before customers can experience the symptoms of a problem.

90 Secs98%

Agile methodology has changed not only the tools, goals, and values of the product team, but also the developer’s process and reliance on metrics. Having real-time visibility into the behavior and responsiveness of your application is completely indispensable when you are simultaneously pushing code and making decisions about the durability and impact of the project. The shift to continuous delivery relies on much closer integration between programming and monitoring to inform both product and production and ensure success. Today’s developers build applications that are typically made up of tens, hundreds, or even thousands of different components, deployed as microservices in lightweight containers or virtualized environments. The increased complexity and velocity of modern software engineering is much different from the monolithic applications historically deployed in a single machine and updated on a discrete schedule with a long testing period and slow feedback loop.

Making modern application development and deployment even more flexible but complicated is the widespread reliance on open-source technologies. Open-source software gives developers more flexibility and customizability in the components of their applications, higher degrees of interoperability when it comes to an architecture that will change drastically with new software versions and innovation,

Software Developer

“If you’re running a service and you’re not able to see metrics from every part of your deployment lifecycle, you’re missing lots of opportunity.”

John RousseauOperations Team Lead



and openness to visibility, audit, and replacement. Understanding the behaviors of applications like MySQL, Redis, Kafka, Elasticsearch, Apache, and Cassandra requires more than simple health checks on the nodes running in production. As developers introduce changes across multiple services, and the demands on the system change as customers’ behaviors shift in response to features, the developer who wrote the code must be able to see how broad functionality is affected by each component. What might go wrong? Where? When? Why?

For software developers, the most important features of monitoring are real-time visualization and out-of-the-box analytic dashboards. In order to see the effects of any change, a developer needs to be able to easily select which metrics, dimensions, and time range are meaningful for the particular use case, have the necessary aggregations and computations automatically performed on the data as it is generated from the environment, and see the resulting time series as an interactive chart that can be graphed against historic data to clearly see the change. With an interactive view of the data, the developer can both drill down to specific release versions or microservices as well as correlate events against unusual data patterns.

The faster a developer can see what’s going on in the environment, find the right view for a specific set of objectives, and ask additional questions, the more likely it is that the application can be improved to address the problem. This is an essential

set of capabilities to support a continuous development approach, and it relies on an analytic monitoring solution that is intuitive, responsive in real time, and free of burdensome query language. The applications that the development team cares about are both integrated with the monitoring service and, preferably, have configured, production-ready plugins. That way, the desired metrics can not only be streamed without additional instrumentation, but also explored in pre-built dashboards so that developers can start customizing an analytic pipeline to their needs without delay.

Developer Priorities For Monitoring• Visualization of streaming and historic data side-by-side

• Instant insight into important data in pre-built dashboards


Overcoming RiskA continuous integration or continuous delivery motion introduces much more risk into the overall state of the architecture and greater probability that workloads will break down. Self-service monitoring is the first step towards not only reversibility as a safety net for problematic code or containers, but also predictability as more operational intelligence is gathered from past development cycles and applied prior to future pushes.

Infrastructure Engineer

“We saved several engineer-years worth of development work.

[Building a monitoring platform] is not the core of our business, and

that’s not where we want to spend development time.”

Sam EatonDirector of Engineering Operations

Monitoring has historically been a huge challenge for infrastructure engineers, many of whom were not only responsible for supporting growth of the systems, but also building the monitoring tool itself. However, few engineering organizations can spare the time and resources to design, vet, implement, and tune a custom metric tracking system. Even fewer can keep up with the pace of change usually seen in today’s dynamic and distributed environments. And practically none can afford to invest in the features that are inevitably requested by the consumers of these systems. A modern monitoring-as-a-service solution is deployed in the cloud and scales to accommodate the type, volume, and resolution of metrics for any operations use case. An effective monitoring solution makes it easy for each user to get a custom view without writing code and is completely governed and administrated. The infrastructure engineer has no hardware to spin up, reducing overhead and improving time to insight for everyone.



More importantly, by eliminating on-premises hardware dependency, the entire infrastructure and operations teams spend less time maintaining the metrics system and has more time to actually monitor and scale the application infrastructure itself.

Effective monitoring-as-a-service also prioritizes administrative tools to manage users and organizations, configure integrations, enable single sign-on through existing tools, get full transparency on data ingest rates and billing, and individually whitelist the service.

With analytics-driven monitoring, infrastructure engineers can own growth as the organization scales. Rather than relying on a gut feeling or external benchmarks, capacity requirements can be modeled on actual usage data for better control and efficiency of investment. For the first time, the infrastructure cost implications of software updates can not only be projected but also correlated to actual behaviors in real time. And, rather than running into slowdowns or failures as demand spikes, the infrastructure engineer can set proactive alerts to provision the necessary capacity. Effective monitoring is key to capturing the full value of cloud economics.

Analytics to correlate investments with actual demand

Infrastructure Engineer Priorities for Monitoring

Maintenance-free service that scales with the use case


Legacy approaches to infrastructure monitoring require time to send and load data, run queries, then visualize the outputs for a significantly delayed view of systems change. Even the simple task of observing trends in the data becomes less meaningful as latency goes up alongside volume.

Similarly, the complexity of applications running at scale makes alerting on the status of individual nodes a nightmare. Ephemeral cloud infrastructure is inherently unpredictable, and static alert thresholds easily triggered by nodes dropping, server hiccups, and momentary demand surges could cause the type of burnout that leads to attrition of both responsiveness and, ultimately, talent from your ops team. The last thing you want is for pagers to get turned off after one too many false alarms in the middle of the night. But can you afford for a real problem to go unaddressed among all that noise?

Beyond performance and scale concerns, traditional monitoring systems all require significant overhead and maintenance, particularly for on-premises data storage. When metrics aren’t showing up on time and data quality becomes an issue, do you allocate engineering resources to troubleshoot your monitoring system? If your engineers are working to fix monitoring, what happens to the application and infrastructure it was meant to be tracking?

Shortcomings Of Traditional Monitoring

The last thing you want is for pagers to get turned off after one too many false alarms in the middle of the night.

The product manager owns roadmap for the development team, but has long relied on anecdotal evidence to understand the relationship between customer behavior and product changes. In some cases the product team would guess right about the impact of systems demand on the business (and vice versa), but historically kept operations data in a silo, away from customer and market insights.

A monitoring solution that serves product managers goes beyond simply capturing and visualizing the metrics that describe the state of the application. It also enables ad hoc analytics to compare any characteristic of the system and its components against different dimensions and time horizons. Effective monitoring helps managers understand how changes to the architecture affect the customer experience and how operational issues affect SLAs. By correlating operations, customer, and business metrics, product priorities can be set against much more intelligent, relevant, and measurable objectives.

Product Manager

“Whenever we ship new features, we make sure the performance and the operation of the site is smooth and fast. [Monitoring] helps us create a baseline and ensure we’re always improving.”

Andrew DenmarkCo-Founder and CTO



Product Manager Priorities For Monitoring

Effective monitoring helps managers understand how changes to the architecture affect the customer experience and how operational issues affect SLAs.

A monitoring solution that serves product managers goes beyond simply capturing and visualizing the metrics that describe the state of the application.

Exploration to benchmark against performance metrics

Correlate data from system chages with user experience

DevOps Engineer

“Quite often it’s the individual engineer who wrote the code that knows best what metric we should watch. In the long run, it’s better if they instrument their own metrics.”

Florian BerckemeyerHead of Operations

The DevOps engineer has become one of the most sought-after but hardest to define roles within the larger technical team. Where the skill set and background largely resemble those of an SRE—ability to code and script, experience with systems management, understanding of business objectives like scale and cost—DevOps also represent a new strategic perspective on integration across functional areas, openness of tools, and communication throughout the product lifecycle. A first-line objective for DevOps is to embrace open-source technologies. Whereas a traditional approach to operations would simply support or deal with non-proprietary software, DevOps makes a case for decisions that promote flexibility, efficiency, and scale. DevOps also proliferate automation and agile methodologies so that quality and performance become priorities across the organization and can be tracked and iterated on in every processes.

In order to be effective, a DevOps engineer requires monitoring that can be the basis for collaboration and that scales infinitely. A straightforward visual UI, a simple search interface for discovering useful feeds, high-contrast charts, slick dashboards, and integrations with all the business apps and technical tools already



used throughout the organization help DevOps bring more users to the data. A full breadth of data sources determined by the users and not encumbered by a proprietary agent helps align adoption and growth objectives to a DevOps regime.

Similarly, a monitoring solution that does not limit data volume, resolution, and dimensionality is key to a DevOps engineer’s success. In order to drive scale, alerting rules that each user group sets to its custom data pipelines and thresholds have to survive changes to the infrastructure. Adaptability and growth are fundamental, so a custom view of the environment that not only persists but can take on more dimensions as systems scale makes monitoring the basis of DevOps engineering.

Ultimately, DevOps help decentralize operations and distribute capabilities beyond their typical silos. This calls for a monitoring solution that is advanced enough for the power user and accessible enough to make operational analytics a core value across the organization.

DevOps Engineer Priorities For Monitoring

• Solution that serves the power user and the passive user equally

• Scalable cloud-based service that performs at every stage of growth

DEV OPS


DevOps is a culture shift targeted at removing the pain and fear from deployments and code pushes. While traditional methods of software development required longer quality and testing cycles than the programming itself, the alternative was usually service disruption, long hours of overtime, conflict between teams, unwelcome interference from executive management, negative values reinforcement, and, ultimately, burnout.

The DevOps emphasis on collaboration, communication, and integration between development and operations in the product team starts with transparency and cooperation. Process and people play core roles in this shift, but the transition begins with changes in tools and technology, of which effective infrastructure and application monitoring is the platform.

According to the Puppet Labs 2015 State of DevOps Report, an effective DevOps regime, including monitoring of systems trends and performance and prioritizing continuity of operational insight across the organization, typically results in:• 30x more frequent deployments• 200x shorter lead time• 60x fewer failures• 168x faster recovery

Making Operations A Value Center

more frequent deployments

shorter lead time

fewer failures

faster recovery

200x

30x

60x

168x

Benefits Of An Effective DevOps Regime


Start Building a DevOps Enablement Engine

What’s Your Monitoring Strategy?

Learn More AboutModern Infrastructure

Monitoring

Get Started with A Free 14-Day Trial

Check Out AVideo Tutorial

TALK TO US FREE TRIAL WATCH NOW

https://calendly.com/sfx-customer-success/15min

https://signalfx.com/?signup=true

https://signalfx.com/get-started/signalfx-101/

build a devops culture of collaboration - … · patterns like growth rate or variance. ......

Documents