lessons learned running hadoop and spark in docker containers
Post on 24-Jan-2018
9.189 Views
Preview:
TRANSCRIPT
Lessons Learned Running Hadoop and
Spark in Docker
Thomas Phelan
Chief Architect, BlueData
@tapbluedata
September 29, 2016
Outline
• Docker Containers and Big Data
• Hadoop and Spark on Docker: Challenges
• How We Did It: Lessons Learned
• Key Takeaways
• Q & A
A Better Way to Deploy Big Data
Traditional Approach
IT
ManufacturingSalesR&DServices
< 30% utilization
Weeks to build each
cluster
Duplication of dataManagement
complexity
Painful, complex upgrades
Hadoop and Spark on Docker
ManufacturingSalesR&DServices
BI/Analytics Tools
> 90% utilization
A New Approach
No duplication of data
Simplified management
Multi-tenant
Self-service, on-demand
clusters
Simple, instant
upgrades
Deploying Multiple Big Data Clusters
Data scientists want flexibility:
• Different versions of Hadoop, Spark, et.al.
• Different sets of tools
IT wants control:
• Multi-tenancy
- Data security
- Network isolation
Containers = the Future of Big Data
Infrastructure
• Agility and elasticity
• Standardized environments (dev, test, prod)
• Portability (on-premises and cloud)
• Higher resource utilization
Applications
• Fool-proof packaging (configs, libraries, driver versions, etc.)
• Repeatable builds and orchestration
• Faster app dev cycles
The Journey to Big Data on Docker
Start with a clear goal
in sight
Begin with your Docker
toolbox of a single
container and basic
networking and storage
So you want to run Hadoop and Spark on Dockerin a multi-tenant enterprise deployment?
Beware … there is trouble ahead
Traverse the tightrope of
network configurations
Navigate the river of
container managers
• Swarm ?
• Kubernetes ?
• AWS ECS ?
• Mesos ?
• Overlay files ?
• Flocker ?
• Convoy ?
• Docker Networking ?
Calico
• Kubernetes Networking ?
Flannel, Weave Net
Big Data on Docker: Pitfalls
Cross the desert of storage
configurations
Big Data on Docker: Challenges
Pass thru the jungle of
software compatibility
Tame the lion of
performance
Finally you get to the top!
Trip down the staircase of
deployment mistakes
But for deployment in the enterprise, you are
not even close to being done …
Big Data on Docker: Next Steps?
You still have to climb past:
high availability,
backup/recovery, security,
multi-host, multi-container,
upgrades and patches
You realize it’s time
to get some help!
Big Data on Docker: Quagmire
How We Did It: Design Decisions
• Run Hadoop/Spark distros and applications
unmodified
- Deploy all services that typically run on a single BM
host in a single container
• Multi-tenancy support is key
- Network and storage security
How We Did It: Design Decisions
• Images built to “auto-configure” themselves at
time of instantiation
- Not all instances of a single image run the same set of
services when instantiated
• Master vs. worker cluster nodes
How We Did It: Design Decisions
• Maintain the promise of containers
- Keep them as stateless as possible
- Container storage is always ephemeral
- Persistent storage is external to the container
How We Did It: Implementation
Resource Utilization
• CPU cores vs. CPU shares
• Over-provisioning of CPU recommended
• No over-provisioning of memory
Network
• Connect containers across hosts
• Persistence of IP address across container restart
• Deploy VLANs and VxLAN tunnels for tenant-level traffic isolation
Noisy neighbors
How We Did It: Network Architecture
OVS
ContainerOrchestrator
DHCP/DNS
VxLAN tunnel
NIC
Tenant Networks
OVS
NIC
Resource Manager
Node Manager
Node Manager SparkMasterSparkWorker
Zeppelin
How We Did It: Implementation
Storage• Tweak the default size of a container’s /root
- Resizing of storage inside an existing container is tricky
• Mount logical volume on /data
- No use of overlay file systems
• DataTap (version-independent, HDFS-compliant)
connectivity to external storage
Image Management• Utilize Docker’s image repository
TIP: Mounting block
devices into a container
does not support
symbolic links (IOW:
/dev/sdb will not work,
/dm/… PCI device can
change across host
reboot).
TIP: Docker images can
get large. Use “docker
squash” to save on size.
How We Did It: Security Considerations
• Security is essential since containers and host share one kernel
- Non-privileged containers
• Achieved through layered set of capabilities
• Different capabilities provide different levels of isolation and protection
• Add “capabilities” to a container based on what operations are permitted
How We Did It: Sample Dockerfile# Spark-1.5.2 docker image for RHEL/CentOS 6.x
FROM centos:centos6
# Download and extract spark
RUN mkdir /usr/lib/spark; curl -s http://d3kbcqa49mib13.cloudfront.net/spark-1.5.2-bin-hadoop2.4.tgz | tar -xz -C /usr/lib/spark/
# Download and extract scala
RUN mkdir /usr/lib/scala; curl -s http://www.scala-lang.org/files/archive/scala-2.10.3.tgz | tar xz -C /usr/lib/scala/
# Install zeppelin
RUN mkdir /usr/lib/zeppelin; curl -s http://10.10.10.10:8080/build/thirdparty/zeppelin/zeppelin-0.6.0-incubating-SNAPSHOT-v2.tar.gz|tar xz -C /usr/lib/zeppelin
RUN yum clean all && rm -rf /tmp/* /var/tmp/* /var/cache/yum/*
ADD configure_spark_services.sh /root/configure_spark_services.sh
RUN chmod -x /root/configure_spark_services.sh && /root/configure_spark_services.sh
BlueData Application Image (.bin file)
Application bin file
Dockerimage
CentOS
Dockerfile
RHEL
Dockerfile
appconfig
conf Init.d startscript
<app> logo file
<app>.wb
bdwbcommand
clusterconfig, image, role, appconfig, catalog, service, ..
Sources
Docker file, logo .PNG,
Init.d
RuntimeSoftware Bits
OR
Development(e.g. extract .bin and modify to
create new bin)
Multi-Host
4 containers
on 2 different hosts
using 1 VLAN and 4
persistent IPs
Different Services in Each Container
Master Services
Worker Services
Container Storage from Host
Container Storage Host Storage
Performance Testing: Spark
• Spark 1.x on YARN
• HiBench - Terasort
- Data sizes: 100Gb, 500GB, 1TB
• 10 node physical/virtual cluster
• 36 cores and112GB memory per node
• 2TB HDFS storage per node (SSDs)
• 800GB ephemeral storage
Spark on Docker: Performance
MB/s
Performance Testing: Hadoop
Hadoop on Docker: Performance
Containers (BlueData EPIC with 1 virtual node per host)
Bare-Metal
4 Docker
containers
Multiple
Hadoop
versions
Different Hadoop
versions on
same set of
physical hosts
“Dockerized” Hadoop Clusters
5 fully managed Docker
containers with persistent
IP addresses
“Dockerized” Spark Standalone
Spark with Zeppelin
Notebook
Big Data on Docker: Key Takeaways
• All apps can be “Dockerized”,
including Hadoop & Spark
- Traditional bare-metal approach to Big
Data is rigid and inflexible
- Containers (e.g. Docker) provide a more
flexible & agile model
- Faster app dev cycles for Big Data app
developers, data scientists, & engineers
Big Data on Docker: Key Takeaways
• There are unique Big Data pitfalls & challenges with Docker
- For enterprise deployments, you will need to overcome these and more:
Docker base images include Big Data libraries and jar files
Container orchestration, including networking and storage
Resource-aware runtime environment, including CPU and RAM
Big Data on Docker: Key Takeaways
• There are unique Big Data pitfalls & challenges with Docker
- More:
Access to Container secured with ssh keypair or PAM module
(LDAP/AD)
Fast access to external storage
Management agents in Docker images
Runtime injection of resource and configuration
information
Big Data on Docker: Key Takeaways
• “Do It Yourself” will be costly and time-consuming
- Be prepared to tackle the infrastructure & plumbing challenges
- In the enterprise, the business value is in the applications / data science
• There are other options …
- Public Cloud – AWS
- BlueData - turnkey solution
Thank You
Contact info:
@tapbluedata
tap@bluedata.com
www.bluedata.com
www.bluedata.com/free
FREE LITE VERSION
top related