big data and containers
TRANSCRIPT
Big Data and Containers
Charles Smith@charles_s_smith
Netflix / Lead the big data platform architecture team
Spend my time / Thinking how to make it easy/efficient to work with big data
University of Florida / PhD in Computer Science
Who am I?
“It is important that we know where we come from, because
if you do not know where you come from, then you don't
know where you are, and if you don't know where you are,
you don't know where you're going. And if you don't know
where you're going, you're probably going wrong.”
Terry Pratchett
Database Distributed Database Distributed Storage
Distributed Processing
???
Why do we care about containers?
Containers ~= Virtual Machines
Virtual Machines ~= Servers
Lightweight
fast to start
memory use
Secure
Process isolation
Data isolation
Portable
Composable
Reproducible
Everything old is new
Microservices and large architectures
Datastorage(Cassandra, MySQL, MongoDB, etc..)
Operational(Mesos, Kubernetes, etc...)
Discovery/Routing
What’s different about big data.
Data at rest
Data in motion
Customer Facing
Minimize latency
Maximize reliability
Data Analytics
Minimize I/O
Maximize processing
Ship computation to data
The questions you can answer aren’t predefined
Hive/Pig/MR
Presto
Metacat
Hive
Metastore
That doesn’t look very container-y(or microservicy-y for that matter)
Datastorage - HDFS (Or in our case S3)
Operational - YARN
Containers - JVM
So what happens when you want to do something else?
But is that really the way we want to approach containers?
What’s different about big data.
Running many different short-lived processes
Running many different short-lived processes
Efficient container construction, allocation, and movement
Groups of processes having meaning
Groups of processes having meaning
How we observe processes needs to be holistic
Processes need to be scheduled by data locality(And not just data locality for data at rest)
Processes need to be scheduled by data locality(And not just data locality for data at rest)
A special case of affinity (although possibly over time)
but...
We do need a data discovery service.(kind of… maybe… a namenode?)
SELECT
t.title_id,
t.title_desc,
SUM(v.view_secs)
FROM
view_history as v
join title_d as t on
v.title_id =
t.title_id
WHERE
v.view_dateint > 20150101
GROUP BY 1,2;
LOAD LOAD
JOIN
GROUP
Data
Discovery
Query Compiler
Query Planner
Metadata
DAG
Watcher
Bottom line
Containers provide process level security
The goal should be to minimize monoliths
This isn’t different from what we are doing already
Our languages are abstractions of composable-distributed processing
Different big data projects should share services
No matter what we do, joining is going to be a big problem
Questions?