dadi block-level image service for agile and elastic ... › system › files ›...
TRANSCRIPT
-
DADI Block-Level Image Service for Agile and Elastic Application Deployment
Huiba Li, Yifan Yuan, Rui Du, Kai Ma, Lanzheng Liu and Windsor Hsu
Alibaba Group
-
The Problem• Container deployment (cold startup) is slow
• Long-tail latency reaches 10s of minutes• The essential reasons are image downloading and unpacking
• Only 6.4% [Slacker] of the image is used for startup• A regression to a decade ago, when VM images were also downloaded to hosts
• P2P downloading [Dragonfly, Kraken, Borg, Tupperware, FID] is not enough
• Deals with only half the reason• Little effect for small clusters
• Slimming the images [DockerSlim, Cntr] is not universal
• Hard to automatically find all dependencies for all applications• Hard to support ad-hoc operations
-
Remote Image• is the trend
• [CRFS, Teleport, CernVM-FS, Slacker, Wharf, CFS, Cider]• Optionally with P2P transfer for large clusters
• Container image (tarball) is, however, NOT viable for remote image
• Designed for unpacking, not seekable• Hard to support advanced features, such as xattr, cross-layer reference, etc.• We’d better to design a new one
• Type of image
• File-system-based image?• Block-device-based image?
-
Type of Image: Block!Features Existing Sys Complexity Universality Security Overall
Block-Device-Based
• Work together with a regular file system, such as ext4
• Viable for container, secure container and virtual machine
Cider
(based on Ceph;
no layering format.)
Low
stability↑
optimization↑
advanced features↑
App can choose a best-match file system, e.g. NTFS,
and pack it into the image as a dependence.
small attack surface
need the courage to walk alone
(almost)
TODO: layering
File- System-Based
• Provides a file-system interface directly
• “Natural” extension of container image
• Less mental friction (due to inertia and following the crowd)
CRFS, Teleport, CernVM-FS,
Slacker, Wharf, CFS
High
stability↓
optimization↓
advanced features↓
Fixed features; may not match all
applications.
(e.g. a Windows container on a
Linux host)
large attack surface
Technical advantage is insignificant.
-
Background: Layered Image of Container
docker registry
download
untar
-
Background: Layered Image of Container
Each layer is a change set compared to the previous state (files added, modified, deleted)
(read-only, shared)
docker registry
download
untar
-
Background: Layered Image of Container
Each layer is a change set compared to the previous state (files added, modified, deleted)
(read-only, shared)
Container layer is a change set compared to the image
(files added, modified, deleted) (read-write, private)docker
registry
download
untar
-
Background: Layered Image of Container
Each layer is a change set compared to the previous state (files added, modified, deleted)
(read-only, shared)
Container layer is a change set compared to the image
(files added, modified, deleted) (read-write, private)
Usually the layers are stored in separate directories, and a merged view is created with a kernel module overlayfs.
docker registry
download
untar
-
Background: I/O Path
App Processes
directories
container
user space
kernel space
overlayfs directories layers (directories)
Docker Registry
download, ungzip & untar
-
DADI Remote Image• A layered image format
• based on virtual block device• work together with a regular file system, e.g. ext4• a general solution for container ecology
• Compression
• and seekable decompression (online)
• Scalability
• peer-to-peer (P2P) transfer
-
DADI Remote Image• A layered image format
• based on virtual block device• work together with a regular file system, e.g. ext4• a general solution for container ecology
• Compression
• and seekable decompression (online)
• Scalability
• peer-to-peer (P2P) transfer
Overlay Block Device
-
DADI Remote Image• A layered image format
• based on virtual block device• work together with a regular file system, e.g. ext4• a general solution for container ecology
• Compression
• and seekable decompression (online)
• Scalability
• peer-to-peer (P2P) transfer
ZFile
Overlay Block Device
-
DADI Remote Image• A layered image format
• based on virtual block device• work together with a regular file system, e.g. ext4• a general solution for container ecology
• Compression
• and seekable decompression (online)
• Scalability
• peer-to-peer (P2P) transfer
ZFile
P2P on-demand read in a tree-structured topology
Overlay Block Device
-
DADI I/O Path
App Processes
regular file system (ext4, etc.) virtual block device
OverlayBD
file system (ext4, etc.)
container
P2P RPC
for downloaded layers
user space
kernel space
for new layers
ZFile
lsmd daemon
ZFile ZFile (layer blobs)
-
0 2 15 87 1501 4 10 50 1030 15
pread offset lengthSegment
raw data to readraw data
raw data to read
hole hole
raw data to read
Overlay Block Device
• Each layer is a change set of overwritten blocks
• no concept of file or file system• 512 bytes block size (granularity)
• An index for fast reading
• variable-length entries to
save memory by combining• non-overlapping entries
sorted by logical offsets• range query by binary search
-
Index Merge
5 10 1005 10 10
0 21 53 5 1010 20 875
0 2 15 87 1501 4 10 50 1030 15
30 15 13 100 10 110 27 150 10
+
=>
offset length
Segment
# of
Seg
men
ts in
Mer
ged
Inde
x
0K
1K
2K
3K
4K
5K
Layers Depth0 5 10 15 20 25 30 35 40 45
Merged Index Size of Productional Images
4.5K * 16 bytes = 72KB
-
Index PerformanceQ
uerie
s / S
econ
d
0M
3M
6M
9M
Size of Index (# of Segments)1K 2K 3K 4K 5K 6K 7K 8K 9K 10K
IOPS
(bs=
8KB,
non
-cac
hed)
0K
30K
60K
90K
120K
I/O Queue Depth1 2 4 8 16 32 64 128 256
Thin LVMDADI w/o compDADI - ZFile
> 6M QPS for productional images
-
Writable Layer• Log-structured design
• appending index and raw data to separate logs• Maintaining an in-memory index
• red-black-tree• Commit only useful data blocks (in offset order)
• combine index entries
Data (R/W)
Index (R/W)Header Index TrailerRaw Data
Layer (RO)Header Raw Data
IndexHeader append
append
commit
-
Header Index TrailerCompressed Chunks[Dict]
Header Index TrailerRaw Data
ZFile
Underlay file (DADI layer blob)
ZFile
• A seekable compression format
• random reading, and online decompression
• Compressed by fixed-sized chunks
• Decompressed only needed chunks
• Not tied to DADI
-
On-Demand P2P Transfer• In a tree-structured topology
• Each P2P node caches recently used data blocks.• A request is likely to hit parent’s cache,• or the parent will forward the request upward, recursively.
Registry
DADI-Root
DADI-Agent DADI-Agent
DADI-Agent
DADI-Agent
DADI-Agent DADI-Agent
DADI-Root
DADI-Agent DADI-Agent
DADI-Agent
DADI-Agent DADI-Agent
DADI-AgentDADI-Agent
DADI-Agent
HTTP(S) request DADI request
Datacenter 1 Datacenter 2
DADI-Agent
-
Evaluations
-
Startup Latency with DADIC
old
Star
t Lat
ency
(s)
0
5
10
15
20
.tgz + overlay2
CRFS pseudo Slacker
DADI from Registry
DADI from P2P Root
Image PullApp Launch
War
m S
tartu
p La
tenc
y (s
)
0
0.6
1.2
1.8
2.4
overlay2 Thin LVM (device mapper)
DADI
NVMe SSDCloud Disk
-
Startup Latency with DADI
Star
tup
Late
ncy
(s)
0.0
0.6
1.2
1.8
2.4
Warm Cache
Cold Cache
app launch with prefetchapp launch
Col
d St
artu
p La
tenc
y (s
)
0.0
1.0
2.0
3.0
# of Hosts (and Containers)0 10 20 30 40
pseudo-SlackerDADI
-
Scalability with DADI#
of C
onta
iner
Inst
ance
s St
arte
d
0K
3K
5K
8K
10K
Time (s)0 1 2 3 4
Cold Startup 1Cold Startup 2Cold Startup 3Warm Startup
Estim
ated
Sta
rtup
Late
ncie
s (s
)
1.5
2.0
2.5
3.0
3.5
# of Containers10K 20K 30K 40K 50K 60K 70K 80K 90K 100K
2-ary tree 3-ary tree4-ary tree 5-ary tree
Large-Scale Startup of Agilityon 1,000 hosts Projected Hyper-Scale Startup of Agility (by evaluating a single branch of the P2P tree)
(Agility is a small application specifically written in Python to assist the test)
-
I/O PerformanceTi
me
to d
u Al
l File
s (s
)
0
0.4
0.8
1.2
1.6
overlay2 Thin LVM DADI
NVMe SSDCloud Disk
Tim
e to
tar A
ll Fi
les
(s)
0
3
6
9
12
overlay2 Thin LVM DADI
NVMe SSDCloud Disk
Image Scanning with du Image Scanning with tar
-
Thanks!