devops for big data - data 360 2014 conference

19
DevOps for Big Data Enabling Continuous Delivery for data analytics applications based on Hadoop, Vertica, and Tableau 1 Max Martynov, VP of Technology Grid Dynamics

Upload: grid-dynamics

Post on 22-Nov-2014

140 views

Category:

Data & Analytics


3 download

DESCRIPTION

Enabling Continuous Delivery for data analytics applications based on Hadoop, Vertica, and Tableau.

TRANSCRIPT

Page 1: DevOps for Big Data - Data 360 2014 Conference

DevOps for Big DataEnabling Continuous Delivery for data analytics applications based on Hadoop, Vertica, and Tableau

1

Max Martynov, VP of TechnologyGrid Dynamics

Page 2: DevOps for Big Data - Data 360 2014 Conference

2

Introductions

• Grid Dynamics─ Solutions company, specializing in eCommerce

─ Experts in mission-critical applications (IMDGs, Big Data)

─ Implementing Continuous Integration and Continuous Delivery for 5+ years

• Qubell─ Enterprise DevOps platform 

─ Focused on self-service environments, service orchestration, and continuous upgrades

─ Targets web-scale and big data applications

Page 3: DevOps for Big Data - Data 360 2014 Conference

3

State of DevOps and Continuous Delivery

Continuous Delivery Value

• Agility

• Transparency

• Efficiency

• Consistency

• Quality

• Control

Findings from The 2014 State of DevOps Report

• Strong IT performance is a competitive advantage

• DevOps practices improve IT performance

• Organizational culture matters

• Job satisfaction is the No. 1 predictor of organizational performance

Page 4: DevOps for Big Data - Data 360 2014 Conference

4

Continuous Delivery Infrastructure

• Environments ─ Reliable and repeatable deployment automation

─ Database schema management

─ Data management

─ Application properties management

─ Dynamic environments

• Quality─ Test automation

─ Test data management (again)

─ Code analysis and review

• Process─ Source code management, branching strategy

─ Agile requirements and project management

─ CICD pipeline

* Big Data applications bring additional challenges in these areas due to big amounts of data, complexity of business logic and large scale environments.

Page 5: DevOps for Big Data - Data 360 2014 Conference

5

Implementing Continuous Delivery for Big Data:Initial State of the Project

• Medium size distributed development team

• Diverse technology stack – Hadoop + Vertica + Tableau

• Only one environment existed and it was production

• Delivery pipeline:

• Procurement of hardware for a new environment was taking months

Development Team

Production

Page 6: DevOps for Big Data - Data 360 2014 Conference

6

Development in Production

It is fun until somebody misses the nail

Page 7: DevOps for Big Data - Data 360 2014 Conference

7

Hadoop Analytical Application

Master

Database

Slaves 1 - N

Manager

10+ TB of data; 10+ nodes in production; 10+ applications; manually pre-deployed on hardware serversHow to quickly reproduce this environment for dev-test purposes?

Page 8: DevOps for Big Data - Data 360 2014 Conference

8

1. Stop Gap Measure

• Same hardware, different logical “zones” implemented on the file system

• Automated build and deployment

• Delivery pipeline:

Development Team

Production cluster

/test1-N

/stage

/prod

Zones

Page 9: DevOps for Big Data - Data 360 2014 Conference

9

1. Stop Gap Measure: Pros and Cons

Pros

• Better than before: code can be tested before it goes to production

• All logical environments has access to the same production data

• Zero additional environment costs

Cons

• Stability, security and compliance issues: dev, test and prod environments share same hardware

• Performance issues: tests affect production performance

• Impossible to run “destructive” tests that affect shared production data

• Impossible to test upgrades of middleware (new versions of H* components)

Page 10: DevOps for Big Data - Data 360 2014 Conference

10

2. Hadoop Dynamic Environments

DataCustom

Application

Dev

Components

Services Environment Policies

QA

ProdStage

Dev/QA/Ops

Request Environment

Orchestrate environment provisioning and application

deployment

Environment

Page 11: DevOps for Big Data - Data 360 2014 Conference

11

2. Hadoop Dynamic Environments (continued)

• Dev/QA/Ops teams got a self-service portal to ─ provision environments

─ deploy applications

• A new environment can be created from scratch in 2-3 hours─ singe-node dev sandbox

─ multi-node QA

─ big clusters for scalability and performance

• An application can be deployed to an environment within 10 minutes

Page 12: DevOps for Big Data - Data 360 2014 Conference

12

3. Vertica and Tableau Dynamic Environments

Data UDF

Dev

Components

ServicesEnvironment

Policies

QA

ProdStage

Dev/QA/Ops

Request Environment

Orchestrate environment provisioning and application

deployment

Environment

VSQL Config

Shared service

Page 13: DevOps for Big Data - Data 360 2014 Conference

13

Unit Tests

Component Tests

Integration Tests(integration with data)

4. Tests & Test Data

• Dev and QA teams implemented automated tests

• Two options to handle data on dev-test environments:

1. Tests generate data for themselves

2. A reduced representative snapshot of obfuscated production data (10TB -> 10GB)

Exploratory Tests

Java code, auto-generated data;build-time validation

Auto tests on “API” level, testing job output;test-generated data

Auto tests on “API” level, validating job output;snapshot of production data

Manual tests;snapshot of production data

Page 14: DevOps for Big Data - Data 360 2014 Conference

14

5. CICD pipeline

With all components ready, implementing CICD pipeline is easy:

Development Team

Dev Sandbox QA Environment

Github Flow2. Commit

1. Develop & Experiment

3. Build & unit test

4. Deploy 5. Test

6. Release

Page 15: DevOps for Big Data - Data 360 2014 Conference

15

6. Release Button

Release Candidate

Release

ProductionOps/RE

Page 16: DevOps for Big Data - Data 360 2014 Conference

16

Assembly Line

Page 17: DevOps for Big Data - Data 360 2014 Conference

17

Results

• Reduced risk and higher quality─ No more development in production

─ Developers have sandboxes, tests are run on separate environments

─ Feature are deployed to production only after validation

• Increased efficiency─ A new environment can be provisioned within 2 hours

─ Developers can freely experiment with new changes

─ No resource contention

• Reduced costs─ No need to procure in-house hardware and manage in-house datacenter

─ Dynamic environments save money by using them on only when they are needed

Page 18: DevOps for Big Data - Data 360 2014 Conference

18

Enabling Technologies

Agile Software FactorySoftware Engineering Assembly Line

griddynamics.com

QubellEnterprise DevOps Platform

qubell.com

Page 19: DevOps for Big Data - Data 360 2014 Conference

A P R I L 8 , 2 0 2 3

Thank You

19

Max Martynov, VP of Technology, Grid [email protected]

Victoria Livschitz, CEO and Founder, [email protected]