high-performance video streaming - acm sigcomm · 2017. 10. 27. · high-performance video...
TRANSCRIPT
Disk|Crypt|NetHigh-performancevideostreaming
Ilias Marinos,RobertWatson(Cambridge),MarkHandley(UCL),
RandallStewart(Netflix)
ModernVideoStreaming
• JustlotsofHTTPrequestsforvideochunks.• Clientpickschunkstoadaptrate.• Serverisprettydumb– justhastogofast.• HTTP/1.1persistentconnections.• TLSbecomingimportant(95%ofYoutube traffic).
• Morethan50%ofUSInternettraffic.• Importanttomakegooduseofexpensivehardware.Howfastcanyougo?
NewiPlayer setup,Dec2015:• nginx onLinux,24coresontwoIntelXeonE5-2680v3
processors,512GBDDR4RAM,8.6TBRAIDarrayofSSDs.• 20Gb/sperserver. ßCanweimproveperformance?
Casestudy:Netflix
• FreeBSD,buttweaked.– Asynchronoussendfile()• Non-blockingzerocopyfromdiskbuffercachetoNet.
– VMscaling• FakeNUMAdomainstoavoidlockcontention.• Proactivecleanupofdiskbuffercache.
– RSS-assistedLRO.• Sortincomingpacketstobucketsbasedon5-tuplehashtooptimizeLROengineefficacy.
LetsDoSomeExperiments• 8-coreHaswellserver,2x40GbENICs,128GBRAM,4xIntelP3700NVMe disks
• LinuxClients.• Syntheticworkload,middlebox forrealisticRTT.
Streamer
middlebox
40GbEswitch
C C
ms
Client
middlebox
Streamer
μs
Unencryptedvideostreamingworkload
DataNOTindiskbuffercache
Conclusions• Netfliximprovementsgood• CPUutilizationisaproblem
~2x Datacomesfromdiskbuffercache
CPUutilizationdoubleswhenfetchingfromdisk
(~350%->~700%)
EncryptionProblem:
Sendfile:• Zerocopyfromdiskbuffercache.
TLS:• Different encryptedstreamperuser.• Kernel isunawareofTLS.
Sendfile andTLSarefundamentallyincompatible!
• ConventionalTLSstackgaveNetflix 20-> 8.5Gb/s• Netfliximplementedin-kernelTLSsupportforsendfile!.
sendfile()NOT zerocopy anymore!
Encryptedvideostreamingworkload
Performanceloss(~30%)whencontentfetched
fromSSDs
CPUissaturated.Memoryreadthroughput~3xmorethannetwork
throughput!
What’shappening?
NVMeDRAMLLC
NIC
BufferCache
Copieddata
Encrypteddata
Copy
TCP
CPU1
2
3
AES
Thestackistooasynchronous.DatakeepsgettingflushedfromtheLLC,andre-loaded.Systemisbottleneckedonmemory.
ProductionNetflixWorkload
• 192GBforbuffercache,butonly10%hitratio.• Streamersbottleneckedinmemorybandwidth.
üModernNVMe SSDshavelowlatency &highthroughput.
üModernIntelCPUsDMAdirectlytoL3cache.
Canweeliminatethediskbuffercachecompletely,andfetcheverythingfromtheSSDs
on-demand?
IdealStack
NVMe
DRAMLLC
NIC
AES
TCP
CPU
re-usebuffer
Toachievethis,wemust:• FetchondemandfromtheSSDwhenTCPneedsdata.• AssoonastheSSDreturnsdata,processitto
completionandDMAittotheNIC.
SolutionOutline1. ATCPACKarrives,freeingupcongestion
window.2. TriggerstacktorequestmoredatafromSSDsto
fillthatcongestionwindow.3. SSDsreturndata placingthemintheLLC.4. Readcompletioneventcausesapplicationto
encryptthedatain-place,addTCPheaders,andtriggerthetransmissionofthepackets.
5. Networkcompletioneventfreesthebuffer,allowingittobereusedforalaterdiskread.
ConventionalOSstackNOTsuitable:Ø Highlyasynchronous;storageandnetworkstackare
looselycoupled-- reliesonVFS&BufferCache.Ø Introducesoverheadsrelatedtoabstractionlayers
(VFS,POSIXetc),redundantmemorycopiesanddomaintransitions(user<->kernel).
TheAtlasStreamingStack
Atlas:acompleteuser-spacestackØ TCP/IPstackbasedonmodifiedversionofSandstorm(SIGCOMM’14) andnetmap(ATC’12).
Ø Storagehandledusingdiskmap (nobuffercache,nosophisticatedFS).
Ø Lockless,fullzero-copy stackfromdisk<->NIC.Ø Tightpipelinetoreduceasynchrony,andideallysavememorybandwidth(w/DDIO).
Diskmap Architecture
SQ CQ
PCIe NVMe Disk
kernel
user
DMA
SQ CQ
nvme0-1
libnvmeapp
SQ CQ
nvme0-2
libnvmeapp
DMA
DMA
adminqpairs
C0 C1
I/OMMU
Diskmap:akernel-bypassI/OframeworkforNVMe disks
memorymapped
buffers buffers
TheAtlasExecutionPipeline
SQ CQ
NVMe DiskNIC
RX TX
kernel
user
webserver
TCP/IP
libnmio libnvme
1
2
4buffers 5
637
Atlasvs.Netflix,UnencryptedContent
Throughp
ut(G
b/s)
LLCmisses/s(x10
7 )Netflixneeds8
cores,Atlasonlyneeds4
15%betterthroughputthanNetflixwhencachehitratioislow.
AlmostnoCPUstalls:datainLLCwhenwewantit.
Atlasvs.Netflix,EncryptedContent
Throughp
ut(G
b/s)
Mem
oryread/throu
ghpu
t
Whencachehitratioislow,50%morethroughputusinghalfthecores.
Almosthalfthememoryreadsforeachpacketsent.
Atlasmemoryusage
WhenLLC/CPUisNOTsaturated:
WhenLLC/CPUissaturated:
DRAMLLC
NIC
AES
TCP
CPU
TCPPackets
re-usebuffer
NVMe
DRAMLLC
NIC
AES
TCP
CPU
TCPPackets
re-usebuffer
NVMe
Netmap doesn’tprovidealow-delayfine-grainedwaytocommunicateDMAcompletions.Can’treusebuffersfastenough(noLIFOstack),andthiscontributestosomeextracachepressure.
Summary• Netflixaddressedallthelow-hangingfruit– Veryfast,butnowbottleneckedonmemory
• Atlasisaspecializedstack– PutsSSDdirectlyinTCPcontrolloop– Immediatelyprocessesdiskreadstocompletionandtransmits.
– 50%throughputimprovementwithencryptedcontent,closeto50%reductioninmemoryreads
• NetflixinspiredbyAtlas– NowexperimentingwithhowtodirectlytriggerencryptionoffofdiskDMAcompletionsintheirFreeBSDstack.