evolving operational maturity in a startup environment

PowerPoint Presentation

Beamly Limited

Evolving operational maturity in a start-up environmentAdrian Spender, Head of Server Engineering

CTOs in London Meetup Octopus Labs6th October 2016

#

0

About meSoftware engineer with 18 years of JVM-based development experienceNow focused on engineering managementFour years at BeamlyResponsible for our operational support for last two years

@aspender

https://linkedin.com/in/aspender

#

About Beamly

Second screenmobile appsand websiteJan2012Oct2011ZeeboxlaunchSkyinvestmentSep2012US launchWith investment From Comcast/NBC, Viacom and HBONov2012AU launch with Ten and Foxtel

#

We started out as a very tech-driven startup called zeebox in the second screen TV space with an iOS, Android and web app. The company and all engineering has always been based in London. Investment after the UK launch came from Sky, quickly followed by a US launch with Comcast, NBC, Viacom and HBO and later an Australia launch with Network Ten and Foxtel.

People watch TV during a fairly limited prime-time of three hours in the evening, and we were soon in the position where we had these three hours in four main geographic regions (Sydney, UK, US East Coast, US West Coast) all being supported from a UK engineering team.2

#

Our partners often promoted the app during their prime-time shows. This made our traffic patterns extremely peaky with huge spikes in an otherwise pretty low level baseline of traffic. Additionally our monetisation mechanism was through in-app advertising synchronised to TV. This led to the first couple of years including a lot of late nights hand-holding unstable technology and simply providing support in case of issues as demanded by our investors.

The TV world is quite different from a tech startup environment. Their aversion to risk was very high which led to a lot of over-engineering, scaling and support for what may happen, rather than what did happen. When things did go wrong, we were directly answerable to the TV companies.

One other aspect impacting our operational approach during this time was that for a startup we were extremely well funded. To an extent this led to a laziness in that it was quicker to throw more AWS instances at a scale problem than it was to engineer our way out of it.3

About Beamly

Acquisition By Coty Inc.Oct 2015PIVOT!In-houseDigital marketingand web agencyNowRebrandedas BeamlyApr2014Aug2014Foundersstep downPIVOT!Social ContentMarketing and toolingPIVOT!TV focussedSocial NetworkSecond screenmobile appsand websiteJan2012Oct2011LaunchSkyinvestmentSep2012US launchWith investment From Comcast/NBC, Viacom and HBONov2012AU launch with Ten and FoxtelProduceoriginal TV/celebarticlesApr201510mMAUsAUShut-downEOLmobileappsNov2015Aug2015Host Cotybrand sites

Run campaigns

Data ScienceAug2016Feb2015Facebookspend tools

#

We later changed strategy to become more of a 24/7 proposition around a TV-based social network, rebranding along the way as Beamly. As part of this we also hired an editorial team to write original article content around TV and celebrity news. The operational impact of this strategy was that firstly we greatly increased the number of services running in production as we aggressively built out social network, news aggregation and publishing functionality. Secondly it was a natural point at which we moved to a Microservices based approach.

Over the course of this time we also had an emerging strategy of gaining reach through promoting our article content via Facebook. Over time we build our own tooling to support that spend and this had two effects. Firstly we got very good at bringing in users to our platform from where the challenge then became to retain them. This led to our high watermark of around 10 million Monthly Active Users in April 2015. The second effect was that we started a pivot into social content marketing as our core competency and began taking on external clients.

Ultimately this led to our acquisition by Coty Inc. in September 2015. We are now their in-house digital and web agency. We host and build out brand web presence for over thirty leading Coty fragrance and beauty brands, as well as run social, display and video based digital ad campaigns based on a data-led approach.

This final pivot has introduced new operational considerations. We no longer run 95 microservices in production but our estate is now much more heterogeneous and includes code that we have not written but has been provided by third party agencies. Additionally outages are no longer primarily a reputational issue for us, but a revenue issue for our parent company (and other clients)

At the current time, the Beamly engineering team is made up of 20 engineers in London.4

What do I mean by operational maturity?How our ability to support our code running in production has changed over time and the following variables:

Product strategyCustomer baseGeographyTechnical architecture and practicesOrganisational structure and people

Lets focus on the last two

#

Product strategy has changed over time, but in general we adopt a dual track agile approach of doing the minimum to discover the potential of an idea (user testing, low cost tests etc) before we commit to deliver it. Delivery is done as an MVP that is then iterated and built upon based on data feedback. Whilst good from the perspective of finding what works, this approach can present some operational challenges in terms of its tendency to leave services implemented to a good enough level when teams then move onto the next thing. Good enough doesnt always cover non-functional considerations. Sometimes they are not quite good enough

Our customer base has evolved dramatically from the early days of mainly male, technology oriented geeks and our investors through to a specific attempt to target female 18-25 year olds (including changing our name and branding) through to our current position as an agency where our direct customers are our parent company and other brands, but indirectly we are primarily again focused on a female demographic. Operationally weve moved full circle from outages causing us issues with our investors, through to them causing our own reputational damage to them now affecting revenue generating activity for our clients.

Geography has always been a difficult issue in operational terms. We are London based but have had to support a global product in a 24/7 fashion for nearly all of our existence. This is still true now as part of a multi-national organisation. The main challenges here are in building on on-call and incident response process that gives us the coverage we require but in a way that is fair to our team.

For this presentation, the effect on operations of how our technical architecture and practices, and our organisational structure and people have evolved are the things Ill expand on.5

Technical architecture and practicesWe got some things right from very early on

A Dev-ops culture of you write it, you run itTestingContinuous integrationA platform team whose focus is developer effectiveness, not operationsService endpointshttps://github.com/beamly/se4RunbooksMonitoring and alerting

#

6

SE4Common endpoints for every service, regardless of tech/service/status/service/healthcheck/gtg/service/healthcheck/service/metrics/service/config

Acts as single point of understanding about the runtime deployment of the serviceUseful for problem determinationUseful for ELB/haproxy/any other healthcheck

#

SE4

#

Architectural evolution

NowOct2011

Monoliths

Microservices

#

Weve followed a typical evolution from monolithic systems to microservices. An in-depth discussion about the pros and cons of various architectures is beyond the scope of this presentation, but we will discuss some of the operational impacts of such an approach.

Monoliths are actually pretty good in one regard of operational maturity. If you have a smaller number of codebases, by definition you have more people familiar with that code and more able to support it. You also have fewer moving parts.9

Operational considerations of MicroservicesFoo is alertingWhat does that actually mean, what is the impact?Architect for failure

Know and eliminate your SPoFs

Have good tooling to support problem determinationRunbooks to describe service responsibilities and problem determination stepsLog aggregationMetrics aggregationMonitoring

Internet Scale Services Checklist - Adrian Colyer

#

The key operational aspect of a microservices architecture for us is to understand the actual impact to end users when a service is failing or unavailable. We historically monitor individual services but have found that there is more value in attempting to monitor functionality instead. It is more useful for a pager to alert that login is not working, than for one of the microservices tangentially involved in login to alert that it is failing.

This is particularly true if you do not spend the necessary effort to properly implement a microservices architecture. By that I mean you have taken assumptions about the way that systems will behave in the face of failure. If you are doing things properly (and believe me we did not in a lot of cases) then you will design for failure from the outset and youll aim for graceful degradation In favour of total collapse.

It is also incredibly important to find and eliminate your Single Points of Failure. A common one in a microservices architecture is your mechanism for internal load balancing of requests. For instance we use HAProxy and had to spend a lot of time understanding how to run that reliably in a fault tolerant way, especially when it is common for instances to come and go and configuration to be re-written.

You should not even consider microservices (and autoscaling) without the non functional tooling in place to make it work. You need log aggregation, metrics aggregation, monitoring and the like. Adrian Colyers Internet Scale Services Checklist is a great resource for understanding the things you should be thinking about. We use a slimmed down version of this as a pre-live checklist for any new service.

Finally, microservices introduces a cognitive overhead in terms of there being many more codebases, potentially in various languages/frameworks. In our case we have a variety of Scala, Scala/Play and node.js based services. It is harder for every engineer to know what everything running in the estate does, how it works, and how to troubleshoot it. This is where Runbooks have been very useful for us.10

Architectural evolution

NowOct2011

Monoliths

Microservices

Event-sourced

#

Weve now moved beyond microservices to a more event-sourced/driven approach that utilises Apache Kafka and Apache Spark to run code in response to events. This is not suitable for all cases but for example it works really well when building publishing flows. It introduces more challenge in understanding how to run ZooKeeper/Kafka/Spark in a fault tolerant and scalable way (we run everything on AWS) but has the operational advantage that it drastically reduces the complexity of fault handling inherent in a microservices architecture that relies on HTTP communication between services.11

Finally from a technical perspective it is worth discussing briefly our approach to the proliferation of technologies we run in production. We have always had a fairly simple approach to this. We like to use the right tool for the job, but it is inherent on any engineer (or team) looking to introduce a new technology to have to do the work to implement that in a scalable and fault tolerant manner that also integrates into our logging/metrics/monitoring and alerting mechanisms. We avoid CV Driven Development through this approach and any new technology needs to go through the same pre-live checklist as any code we write.

We used to have a rule that any technology used by three or more teams would become owned by the Platform team (whose primarily responsibility is to maximise developer effectiveness) but in reality this doesnt work as that team are not direct consumers of that technology themselves and are therefore not close enough to the pain points to prioritise effort on it.

However, this approach means that our technology estate has become more heterogeneous over time. The above logos are an indication of the complexity of our environment in the early zeebox second screen days. Almost all backend services were written in Scala and the world was pretty simple.

12

In the social network era of Beamly and in a microservices world, things become more complex13

And today they are more complex still. Significantly we are now writing or supporting code written in Scala (Services, Spark), Javascript (Node) and PHP (Drupal/Wordpress) and where a lot of this complexity is inherited from brand websites that we have taken over.

There are some examples in these slides of how weve evolved approaches over time in response to the challenges weve had with technology. A good example is configuration management and orchestration. We started off with a hand-written set of python based tooling known as Verrot. This was replaced by Puppet and Hieradata, which in turn has been replaced with Ansible/Consul. In each case that migration was costly in engineering effort but paid off in greater effectiveness and operational improvements.14

Organisational structure

NowOct2011

Tech silos

Feature team hybrid

Product teams

#

Moving on to organisation structure, this is perhaps the single biggest affecting our operational approaches.

Again, weve followed a fairly typical evolution in our product/engineering structure. We started off with technology aligned vertical silos (iOS team, Android team, Web team, Backend team etc) operationally this works quite well as service ownership is clearly delineated and knowledge is co-located. You can build an on-call rotation based around those teams. The possible downsides are that particular teams may tend to get overly burdened by operational issues (there is little that a mobile app team needs to be on-call for as true app issues cannot be resolved without an app store release)

Again, the product value aspects of team structures are beyond the scope of this presentation, but we started moving to more multi-disciplinary feature teams gradually which then evolved naturally into a product team structure. Product teams have a product manager, tech lead and UX designer (or Data Scientist) as a core, supplemented by the right mix of engineers to achieve their goals. They are given business problems or metrics to affect and have the autonomy to do so in whichever way they think. They make data-led decisions and look to prove approaches with minimal code before committing to delivery.15

Conways law in actionNowOct2011

No communicationSynchronous meetings

#

As an aside, it is interesting to see how Conways Law appears to be true in the context of Beamly. Conways Law states that organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations

Tech aligned silos do not promote much cross-communication and lead to monoliths.

Multi-disciplinary product teams act with a fair degree of autonomy and possess the skills to product their own end to end output. When they do need to communicate it is often via the inefficient form of synchronous meetings or scrum of scrums this is mirrored in our evolution by the use of Microservices reliant on unreliable HTTP based communication.

Finally, like pretty much everybody else weve been using Slack as the de-facto communications mechanism for two years now. This is asynchronous in nature, discoverable and effective. This is mirrored by our move to a more event-driven architecture.16

Organisational structure

NowOct2011

Spotify model

#

Where we are now is a subtle but significant evolution of the product team model inspired by the Spotify approach, but tailored to our context. The main outcome of this is that we moved engineer management from within the product teams (which inhibited movement and limited the motiviation for teams to make time to expand knowledge of engineers outside of the scope of that team) to a matrixed approach with Heads of Engineering

A Head of Engineering is not in any product team and is primarily a people manager. If they make technical contribution it is on non-critical-path stuff and slow progress will not hinder anybody. They have the head-space to look at how we are working as a whole and to identify how we can improve. This has been immensely valuable not only in the ability to focus more on the growth and development of our engineers but also in giving us people who are more incented to identity and start to solve our operational challenges.17

The operational problem with product teams/squadsNowOct2011Team 1Team 2Team 3Team 4ABCDEFGHI

#

Product teams are great for optimisation of delivering product value, but they introduce significant operational headaches. Naturally those teams will start to build services and to begin with all is good with the world.

18

The operational problem with product teams/squadsNowOct2011Team 1Team 2Team 3Team 4ABCJDEFGHI

#

But then some services will not clearly fit into a single team and their ownership is indistinct.

19

The operational problem with product teams/squadsNowOct2011Team 1Team 2Team 3Team 4ABCJDEFGHIShared Infra

#

And then there will be shared infrastructure which multiple teams rely on

20

The operational problem with product teams/squadsNowOct2011Team 2Team 3Team 4ABCJDEFGHIShared Infra

#

And then by nature, product teams/squads can disband when their goals are achieved. Imagine the example of a team tasked with improving the login experience. They build services to support Facebook and Google+ login, things are better, then that teams mission is complete and they move off to other things, but the services they built remain critical.

21

The operational problem with product teams/squadsNowOct2011Team 2Team 3Team 4ABCJDEFGHIShared Infra

#

And finally, even if teams persist their focus shifts as they evolve. They leave services behind them which are still running in production but are not a priority to spend time on.22

The operational problem with product teams/squadsIncident follow-up lets create a Jira boardCross-cutting / fall through the gaps Engineering Excellence initiative

#

All of these are things weve seen happen with the product team/squad model. It causes problems when orphaned or fuzzily-owned services have incidents or we identify improvements wed like to make. The challenge is that all engineers are in product teams that are not primarily incented to pick up work on systems they do not own or which does not deliver against their current goals.

In Spotify it is probable that these are solved by their scale (and the concept of Tribes) but we are not that big.

So, weve tried creating a specific Incident post-mortem JIRA board. We also created an Engineering Excellence initiative whereby anybody can bring up an initiative to improve something or tackle some tech-debt. These are then up-voted and the most popular can be considered and given a Directly Responsible Individual to champion. We then try to carve out time and people to do it by negotiation with product managers.

The problem is that both of these initiatives have been failures. In both cases weve completed only 27% of the identified tickets over the past 12 months. This leaves us with a problem.23

People

#

The final aspect affecting our operational maturity is our engineering team itself. Whilst we are five years old as a company, 55% of our engineering team have been with the company under two years.24

PeopleAcquisitioncleanup

#

Additionally, over the past year since acquisition, the number of operational incidents weve had to handle has dropped dramatically. Some of this is down to improvements, but the biggest factor is that we shutdown our Beamly product to focus on the requirements from our new parent company. We went from over 90 services in production to around 25. Less code = less complexity and fewer things to go wrong.25

45% of engineers have neverbeen involved in handling an operational incident

#

The impact of this is that a significant percentage of our engineers have now never had to handle an operational incident whilst theyve been on call. Again, in one respect this is good (hey boss, fewer incidents!) but actually we need a certain level of incident handling to keep discipline, to keep troubleshooting skills sharp and to maintain knowledge of the process.26

#

Another aspect relates back to the 27% of successfully completed incident and Engineering Excellence tickets. When broken down by job role, it is clear that we overly rely on the most senior (and most tenured) engineers to shoulder the burden of this work. This is down to a number on factors including their experience, knowledge of our systems, the fact that they are more likely to have an emotional attachment to the services being worked on, individual effectiveness and ability to absorb additional work.

This presents a cyclic problem our newest engineers are not engaged with the opportunity to expand their knowledge by working on these issues, and they cant effectively work on these issues due to the lack of knowledge.

27

Product team structure focused on moving forwardsVelocity vs stability tensionCross-cutting tech and issues have no ownerLack of operational issue handling practice and experienceLittle investment in improving our availability and reliability through improved monitoring and automationOver-reliance on small subset of the engineering teamLack of opportunity for experience and growth for the restTacit knowledge not being shared / encodedCurrent operational challenges

#

All this builds up to a significant number of challenges to our operational effectiveness.28

Site Reliability EngineerWe are hiring into this role to focus exclusively on our availability and reliabilityWill not be part of a product teamWill spend at least 50% of their time writing code to automate away operational burden and improve monitoringWill have power to fix things in your production systems if you cant/dontWill own the maintenance and evolution of common runtime infrastructure (e.g. haproxy, Tyk)Will help teams plan for production including capacity planning, performance, architectureWill help us evolve operational processes and practicesIs not platform not focused on developer effectiveness or IT.

https://thebeamlyagency.bamboohr.co.uk/jobs/view.php?id=17

So whats next?

Site Reliability Engineering has become an increasing trend, driven by the success of this model at Google and other companies. SRE isnt ops, but is an application of Software Engineering approaches to the problems of how availability and reliability are maximised. It tackles how operational burden can be eliminated through obsessive automation. It is also much more and the OReilly book is an excellent read.

Of course, we are not Google. But we want to create an SRE style role within the engineering team to start to address some of the issues. However to begin with we certainly cant justify a full SRE team29

Being on call current structure

Mon

Tue

Wed

Thu

Fri

Sat

Sun

Bob

Alice

Don

John

Zed

Joe

Joan

Third line teams

Engineering management

So we are also going to restructure our on-call structure. Currently, all engineers staff a second line rota on a 24 hour rotation (10am-10am) this worked well for us when there were fairly regular operational issues but for the last 18 months peoples expectations of being paged have been minimal and as such it is not uncommon now for the schedule to get into a poor state (people on rota when on holiday for instance) in short weve lost some discipline.30

Being on call new structure

Mon

Tue

Wed

Thu

Fri

Sat

Sun

Bob

Alice

Third line teams

Engineering management

So, the new approach will be to move to a 4/3 rota, whereby an engineer is on-call from 10am Monday to 10am Friday. During this time however they will also be extracted from their normal product team duties to work alongside the SRE. Effectively we create a near full time SRE team of two people. Regardless of whether incidents occur, this person gets time and space to focus on wider issues alongside the full-time SRE.

The weekend on-call engineer just handles pages as usual.31

Being on call during week == Site Reliability EngineerHandling incidents that occurWriting up incident post mortemsResponding to any non-incident issues e.g. automated warnings in the Slack #live-monitoring channelPicking up tickets outstanding from previous post mortemsPicking up Engineering Excellence tickets. Examples of which would include:Resolving issues/pain points through automationImproving documentationImproving alertingImprovements to common infrastructure/servicesPerforming routing maintenance on common systems (e.g. HiveMQ upgrade)Expanding their knowledge on Beamly systems/architecture (e.g. performing chaos monkey tests)Working on technical debt within their own product team that is not specifically prioritised in that teams own plans.

Whilst on-call in the SRE role during the week, the SRE team can focus on a variety of tasks.32

Much more room for improvementBetter measurement of availability/reliabilityError budgetingMore automationContinuous deliveryImproved toolingNever ending

The aim of this presentation is not to claim that we are operationally mature, nor to claim that we have best practices but just to share our experience. We are knowingly deficient and still learning all the time. As is common everywhere, the list of things we would like to do far outstrips our ability to spend time and resource on them. 33

Thank you. Questions?@aspenderhttps://linkedin.com/in/aspender

#