devops for big data - data 360 2014 conference
DESCRIPTION
Enabling Continuous Delivery for data analytics applications based on Hadoop, Vertica, and Tableau.TRANSCRIPT
DevOps for Big DataEnabling Continuous Delivery for data analytics applications based on Hadoop, Vertica, and Tableau
1
Max Martynov, VP of TechnologyGrid Dynamics
2
Introductions
• Grid Dynamics─ Solutions company, specializing in eCommerce
─ Experts in mission-critical applications (IMDGs, Big Data)
─ Implementing Continuous Integration and Continuous Delivery for 5+ years
• Qubell─ Enterprise DevOps platform
─ Focused on self-service environments, service orchestration, and continuous upgrades
─ Targets web-scale and big data applications
3
State of DevOps and Continuous Delivery
Continuous Delivery Value
• Agility
• Transparency
• Efficiency
• Consistency
• Quality
• Control
Findings from The 2014 State of DevOps Report
• Strong IT performance is a competitive advantage
• DevOps practices improve IT performance
• Organizational culture matters
• Job satisfaction is the No. 1 predictor of organizational performance
4
Continuous Delivery Infrastructure
• Environments ─ Reliable and repeatable deployment automation
─ Database schema management
─ Data management
─ Application properties management
─ Dynamic environments
• Quality─ Test automation
─ Test data management (again)
─ Code analysis and review
• Process─ Source code management, branching strategy
─ Agile requirements and project management
─ CICD pipeline
* Big Data applications bring additional challenges in these areas due to big amounts of data, complexity of business logic and large scale environments.
5
Implementing Continuous Delivery for Big Data:Initial State of the Project
• Medium size distributed development team
• Diverse technology stack – Hadoop + Vertica + Tableau
• Only one environment existed and it was production
• Delivery pipeline:
• Procurement of hardware for a new environment was taking months
Development Team
Production
6
Development in Production
It is fun until somebody misses the nail
7
Hadoop Analytical Application
Master
Database
Slaves 1 - N
Manager
10+ TB of data; 10+ nodes in production; 10+ applications; manually pre-deployed on hardware serversHow to quickly reproduce this environment for dev-test purposes?
8
1. Stop Gap Measure
• Same hardware, different logical “zones” implemented on the file system
• Automated build and deployment
• Delivery pipeline:
Development Team
Production cluster
/test1-N
/stage
/prod
Zones
9
1. Stop Gap Measure: Pros and Cons
Pros
• Better than before: code can be tested before it goes to production
• All logical environments has access to the same production data
• Zero additional environment costs
Cons
• Stability, security and compliance issues: dev, test and prod environments share same hardware
• Performance issues: tests affect production performance
• Impossible to run “destructive” tests that affect shared production data
• Impossible to test upgrades of middleware (new versions of H* components)
10
2. Hadoop Dynamic Environments
DataCustom
Application
Dev
Components
Services Environment Policies
QA
ProdStage
Dev/QA/Ops
Request Environment
Orchestrate environment provisioning and application
deployment
Environment
11
2. Hadoop Dynamic Environments (continued)
• Dev/QA/Ops teams got a self-service portal to ─ provision environments
─ deploy applications
• A new environment can be created from scratch in 2-3 hours─ singe-node dev sandbox
─ multi-node QA
─ big clusters for scalability and performance
• An application can be deployed to an environment within 10 minutes
12
3. Vertica and Tableau Dynamic Environments
Data UDF
Dev
Components
ServicesEnvironment
Policies
QA
ProdStage
Dev/QA/Ops
Request Environment
Orchestrate environment provisioning and application
deployment
Environment
VSQL Config
Shared service
13
Unit Tests
Component Tests
Integration Tests(integration with data)
4. Tests & Test Data
• Dev and QA teams implemented automated tests
• Two options to handle data on dev-test environments:
1. Tests generate data for themselves
2. A reduced representative snapshot of obfuscated production data (10TB -> 10GB)
Exploratory Tests
Java code, auto-generated data;build-time validation
Auto tests on “API” level, testing job output;test-generated data
Auto tests on “API” level, validating job output;snapshot of production data
Manual tests;snapshot of production data
14
5. CICD pipeline
With all components ready, implementing CICD pipeline is easy:
Development Team
Dev Sandbox QA Environment
Github Flow2. Commit
1. Develop & Experiment
3. Build & unit test
4. Deploy 5. Test
6. Release
15
6. Release Button
Release Candidate
Release
ProductionOps/RE
16
Assembly Line
17
Results
• Reduced risk and higher quality─ No more development in production
─ Developers have sandboxes, tests are run on separate environments
─ Feature are deployed to production only after validation
• Increased efficiency─ A new environment can be provisioned within 2 hours
─ Developers can freely experiment with new changes
─ No resource contention
• Reduced costs─ No need to procure in-house hardware and manage in-house datacenter
─ Dynamic environments save money by using them on only when they are needed
18
Enabling Technologies
Agile Software FactorySoftware Engineering Assembly Line
griddynamics.com
QubellEnterprise DevOps Platform
qubell.com
A P R I L 8 , 2 0 2 3
Thank You
19
Max Martynov, VP of Technology, Grid [email protected]
Victoria Livschitz, CEO and Founder, [email protected]