linux networking internals
DESCRIPTION
Slides for a course about the Linux kernel network stack.TRANSCRIPT
TheLinuxNetworkSubsystemUnabletohandlekernelpagingrequestatvirtualaddress4d1b65e8 Unabletohandlekernelpagingrequestatvirtualaddress4d1b65e8 Covers Linux version 2.6.25 pgd=c0280000 pgd=c0280000 Version 1.1 [4d1b65e8]*pgd=00000000[4d1b65e8]*pgd=00000000 Internalerror:Oops:f5[#1] Internalerror:Oops:f5[#1] Moduleslinkedin:Moduleslinkedin:hx4700_udchx4700_udcasic3_baseasic3_base CPU:0 CPU:0 PCisatset_pxa_fb_info+0x2c/0x44 PCisatset_pxa_fb_info+0x2c/0x44 LRisathx4700_udc_init+0x1c/0x38[hx4700_udc] LRisathx4700_udc_init+0x1c/0x38[hx4700_udc] pc:[]lr:[]Nottainted sp:c076df78ip:60000093fp:c076df84 pc:[]lr:[]Nottainted
Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.
Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
1
RightstocopyThiskitcontainsworkbytheAttributionShareAlike2.0 Youarefree tocopy,distribute,display,andperformthework tomakederivativeworks tomakecommercialuseofthework Underthefollowingconditions Attribution.Youmustgivetheoriginalauthorcredit. ShareAlike.Ifyoualter,transform,orbuilduponthiswork, youmaydistributetheresultingworkonlyunderalicense identicaltothisone. Foranyreuseordistribution,youmustmakecleartoothersthe licensetermsofthiswork. Anyoftheseconditionscanbewaivedifyougetpermissionfrom thecopyrightholder. Yourfairuseandotherrightsareinnowayaffectedbytheabove. Licensetext:http://creativecommons.org/licenses/bysa/2.0/legalcodeCopyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.
followingauthors: Copyright20042006 MichaelOpdenacker [email protected] http://www.freeelectrons.com Copyright20032006 OronPeled [email protected] http://www.actcom.co.il/~oron Copyright20042008 Codefidenceltd. [email protected] http://www.codefidence.com
Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
2
WhatisLinux?LinuxisakernelthatimplementsthePOSIXandSingleUnix SpecificationstandardswhichisdevelopedasanOpenSourceproject. WhenonetalksofinstallingLinux,oneisreferringtoaLinux Distribution:acombinationofLinuxandotherprogramsandlibrarythat formanoperatingsystem.
Linuxrunson24mainplatformsandsupportsapplications rangingfromccNUMAsuperclusterstocellularphonesand microcontrollers. Linuxis15yearsold,butisbasedonthe40yearsoldUnixdesign philosophyCopyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
3
LayersinaLinuxsystem
Kernel KernelModules Clibrary Systemlibraries Applicationlibraries Userprograms
Userprograms
Kernel Clibrary
Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.
Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
4
KernelarchitectureApp1 App2 Clibrary Systemcallinterface Process management Memory management Filesystem support Filesystem types CPUsupport code CPU/MMU supportcode Storage drivers Character devicedrivers Network devicedrivers Hardware CPU RAM Storage Device control Networking ... User space
Kernel space
Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.
Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
5
KernelModevs.UserModeAllmodernCPUssupportadualmodeofoperation: Usermode,forregulartasks. Supervisor(orprivileged)mode,forthekernel. ThemodetheCPUisindetermineswhichinstructionstheCPUis willingtoexecute: SensitiveinstructionswillnotbeexecutedwhentheCPUisin usermode. TheCPUmodeisdeterminedbyoneoftheCPUregisters,whichstores thecurrentRingLevel 0forsupervisormode,3forusermode,12unusedbyLinux.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
6
TheSystemCallInterfaceWhenauserspacetasksneedstouseakernelservice,itwillmakea SystemCall. TheClibraryplacesparametersandnumberofsystemcallinregisters andthenissuesaspecialtrapinstruction. Thetrapatomicallychangestheringleveltosupervisormodeandthe setstheinstructionpointertothekernel. Thekernelwillfindtherequiredsystemcalledviathesystemcalltable andexecuteit. Returningfromthesystemcalldoesnotrequireaspecialinstruction, sinceinsupervisormodetheringlevelcanbechangeddirectly.
Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.
Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
7
LinuxSystemCallPathKernel do_name() sys_name() entry.S Function call Trap
Task
Glibc Task
Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.
Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
8
LinuxnetworkingSubsystemOverviewStack App App 1 App2 Socket Layer UDP Networking Stack Driver Stack Driver Hardware TCP IP Stack Driver Interface Driver ICMP Bridge App3
Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.
Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
9
NetworkDeviceDriverHardwareInterfacepacket packet packet packet packet
TxSend Send Send SentOK SendErr Free
Memory Access
Driver
Memory mapped registers access
Rx
Free
Free
RcvOk
RcvErr RecvCRC RcvOK
Interruptspacket
packet
packet
packet
Driver allocates Ring Buffers. Driver resets descriptors to initial state. Driver puts packet to be sent in Tx buffers. Device puts received packet in Rx buffers. Driver/Device update descriptors to indicate state. Device indicates Rx and end of Tx with interrupt, unless interrupt mitigation techniques are applied.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.
DMA
Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
10
NetworkDeviceRegistrationEachnetworkdeviceisrepresentedbyastructnet_device Theseareallocatedusing:structnet_device*alloc_netdev(size,mask, setup_func);
sizesizeofourprivdatapart maskanamingpattern(e.g.eth%d) setup_funcAfunctionthatsetupstherestofnet_device.
Andisregisteredviaacallto:intregister_netdev(structnet_device*dev);Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
11
NetworkDeviceInitializationThenet_devicestructureisinitalizedwithnumerousmethods andflagsbythesetupfunction:openrequestresources,registerinterrupts,startqueues. stopdeallocatesresources,unregisterirq,stopqueue. get_statsreportstatistics set_multicast_listconfiguredeviceformulticast hard_start_xmitcalledbythestacktoinitiateTx. IFF_MULTICASTDevicesupportmulticast IFF_NOARPDevicedoesnotsupportARPprotocolCopyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
12
PacketRepresentationWeneedtomanipulatepacketsthroughthestack Thismanipulationinvolvesefficiently:Addingprotocolheaders/trailersdownthestack. Removingprotocolheaders/trailersupthestack.
Packetscanbechainedtogether. Eachprotocolshouldhaveconvenientaccesstoheader fields. Todoallthisthekernelusesthesk_buffstructure.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
13
SocketBuffersThesk_buffstructurerepresentsasinglepacket. Thisstructureispassedthroughtheprotocolstack. Itholdspointerstoabufferswiththepacketdata. Itholdsmanytypeofotherinformation:Datasize. Incomingdevice. Priority. Security...Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
14
structsk_buffnext: prev: sk: tstamp: dev: input_dev: h: nh: mac: dst: sp: cb: len: data_len: mac_len: csum: local_df: cloned: nohdr: pkt_type: fclone: Nextbufferinlist Previousbufferinlist Socketweareownedby Timewearrived Devicewearrivedon/areleavingby Devicewearrivedon Transportlayerheader Networklayerheader Linklayerheader Destinationroutecacheentry Securitypath,usedforxfrm Controlbuffer.Privatedata. Lengthofactualdata Datalength Lengthoflinklayerheader Checksum Allowlocalfragmentationflag Headmaybecloned(seerefcnt) Payloadreferenceonlyflag Packetclass Clonestatus ip_summed: DriverfedusanIPchecksum priority: users: protocol: truesize: head: data: tail: end: nfmark: nfct: nfctinfo: nf_bridge: tc_index: tc_verd: secmark: Packetqueuingpriority Usercountsee{datagram,tcp}.c Packetprotocolfromdriver Buffersize Headofbuffer Dataheadpointer Tailpointer Endpointer Netfilterhooksprivatedata Associatedconnection,ifany Connectiontrackinginfo. Saveddataaboutabridgedframe Trafficcontrolindex Trafficcontrolverdict SecuritymarkingforLSM
destructor: Destructfunction
ipvs_property:skbuffisownedbyipvs nfct_reasm: Netfilterconntrackreassemblypointer
dma_cookie: DMAoperationcookie
Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.
Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
15
SocketBufferDiagramheadroom Ethernet IP TCP Payload Paddingstruct sk_shared_info
frag1
Note Network chip must support Scatter/Gather to use of frags. Otherwise kernel must copy buffers before send!
len ... head data tail end ... dev
frag2
frag3
struct sk_buffCopyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
16
SocketBufferOperationsskb_put:adddatatoabuffer. skb_push:adddatatothestartofabuffer. skb_pull:removedatafromthestartofabuffer. skb_headroom:returnsfreebytesatbufferhead. skb_tailroom:returnsfreebytesatbufferend. skb_reserve:adjustheadroom. skb_trim:removeendfromabuffer.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
17
OperationExample:skb_putunsignedchar*skb_put (structsk_buff*skb,unsignedintlen)
Addsdatatoabuffer:skb:buffertouse len:amountofdatatoadd
Thisfunctionextendstheuseddataareaofthebuffer. Ifthiswouldexceedthetotalbuffersizethekernelwill panic. Apointertothefirstbyteoftheextradataisreturned.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
18
SocketBufferAlignmentCPUsoftentakeaperformancehitwhenaccessingunaligned memorylocations. SinceanEthernetheaderis14bytesnetworkdriversoften endupwiththeIPheaderatanunalignedoffset. TheIPheadercanbealignedbyshiftingthestartofthe packetby2bytes.Driversshoulddothiswith:skb_reserve(NET_IP_ALIGN);
ThedownsideisthattheDMAisnowunaligned.Onsome architecturesthecostofanunalignedDMAoutweighsthe gainssoNET_IP_ALIGNissetonaperarchbasis.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
19
SocketBufferPaddingThenetworkinglayerreservessomeheadroominskbdata.Thisisusedtoavoidhavingtoreallocateskbdatawhenthe headerhastogrow. Inthedefaultcase,iftheheaderhastogrow16bytesorless weavoidthereallocation.
Unfortunately,thisheadroomchangestheDMAalignmentof theresultingnetworkpacket.AsforNET_IP_ALIGN,this unalignedDMAisexpensiveonsomearchitectures. Thereforearchitecturecanoverridethisvalue,aslongasat least16bytesoffreeheadroomarethere.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
20
SocketBufferAllocationsdev_alloc_skb:allocateanskbuffforRx netdev_alloc_skb:allocateanskbuffforRx,ona specificdevice.Allocateanewsk_buffandassignitausagecountofone. Thebufferhasunspecifiedheadroombuiltin.Usersshouldallocate theheadroomtheythinktheyneedwithoutaccountingforthebuilt inspace.Thebuiltinspaceisusedforoptimizations NULLisreturnedifthereisnofreememory. Althoughthesefunctionsallocatesmemoryitcanbecalledfroman interrupt.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
21
sk_buffAllocationExampleImmediatelyafterallocation,weshouldreservetheneeded headroom:structsk_buff*skb; skb=dev_alloc_skb(1500); if(unlikely(!skb))break;
/*Markasbeingusedbythisdevice*/ skb>dev=dev; /*AlignIPon16byteboundaries*/ skb_reserve(skb,NET_IP_ALIGN);Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
22
SoftnetNetworkstackisimplementedasapairofsoftirqsfor parallelizepackethandlingonSMPmachines:NET_TX_SOFTIRQFeedspacketsfromnetworkstackto driver. NET_RX_SOFTIRQFeedspacketsfromdrivertonetwork stack.
Likeanyothersoftirq,thesearecalledonreturnfrom interruptorviathelowpriorityksoftirqdkernelthread. Transmit/receivequeuesarestoredinpercpusoftnet_data.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
23
LinuxContextsInterrupt Handlers Interrupt ContextHi prio tasklets
SoftIRQsNet Stack
...
Kernel Space
Regular tasklets
Timers
Network Interface Device Driver
User Context User SpaceForfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
Process Thread Kernel Thread
Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.
24
PacketReceptionThedriverallocatesanskbandsetsupadescriptorinthe ringbuffersforthehardware. ThedriverRxinterrupthandlercallsnetif_rx(skb). netif_rxdepositsthesk_buffinthepercpuinputqueue.and markstheNET_RX_SOFTIRQtorun. AtSoftIRQprocessingtime,net_rx_action()iscalledby NET_RX_SOFTIRQ,whichcallsthedriverpoll()methodto feedthepacketup.Normallypoll()issettoproccess_backlog()bynet_dev_init().Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
25
PacketRxOverview
Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.
Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
26
PacketTransmissionEachnetworkdevicedefinesamethod:int(*hard_start_xmit)(structsk_buff*skb,struct net_device*dev);
ThisfunctionisindirectlycalledfromtheNET_TX_SOFTIRQ Callareserializedviathelockdev>xmit_lock_owner
Thedrivermanagesthetransmitqueueduringinterfaceup anddownsortosignalbackpressureusingthefollowing functions:voidnetif_start_queue(structnet_device*net); voidnetif_stop_queue(structnet_device*net); voidnetif_wake_queue(structnet_device*net);Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
27
PacketTxOverview
Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.
Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
28
NAPINetworkNewAPI Providesinterruptmitigation Requirements:ADMAringbuffer. Abilitytoturnoffreceiveinterruptsorevents.
Itisusedbydefininganewmethod:int(*poll)(structnet_device*dev,int*budget); whichiscalledbythenetworkstackperiodicallywhen signaledbythedrivertodoso.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
29
NAPI(cont.)Whenareceiveinterruptoccurs,driver:Turnsoffreceiveinterrupts. Callsnetif_rx_schedule(dev)togetstacktostart callingit'spollmethod.
ThePollmethodScansreceiveringbuffers,feedingpacketstothestackvia: netif_receive_skb(skb). Ifworkfinishedwithinbudgetparameter,reenablesinterrupts andcallsnetif_rx_complete(dev) Else,stackwillcallpollmethodagain.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
30
RoutingAfterthesocketbufferisdeliveredtoaprotocolhandlerthe handlermaydecidetoroutethepacket. Thedefaultroutingusesthenormaldestinationbasedrouting withsingletableandaFIBdestinationcache. Foreachpackettheroutintgdestinationislookedupinthe FIBcache.Iffound,thepacketissenttothatinterfacedriver. Otherwiseamorecostlyroutingdecisionbasedonrulesoccurs andtheresultisstoredintheFIB.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
31
WhatisNetfilter?Netfilterisaframeworkforpacketmangling Eachprotocoldefines"hooks"(IPv4defines5)whichare welldefinedpointsinapacket'straversalofthatprotocol stack. Ateachofthesepoints,theprotocolwillcallthenetfilter frameworkwiththepacketandthehooknumber. Partsofthekernelcanregistertolistentothedifferenthooks foreachprotocol. Whenapacketispassedtothenetfilterframework,itwill callallregisteredcallbacksforthathookandprotocol.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
32
NetfilterArchitectureIngres Pre Routing Route Forward Post Routing Egres
Route
Local In
Local Out
Local Sockets
Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.
Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
33
NetfilterHookKernelcodecanregisteracallbackfunctiontobecalled whenapacketarrivesateachhook.andarefreeto manipulatethepacket. Thecallbackcanthentellnetfiltertodooneoffivethings:NF_ACCEPT:continuetraversalasnormal. NF_DROP:dropthepacket;don'tcontinuetraversal. NF_STOLEN:I'vetakenoverthepacket;stoptraversal. NF_QUEUE:queuethepacket(usuallyforuserspacehandling). NF_REPEAT:callthishookagain.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
34
IPTablesApacketselectionsystemcalledIPTableshasbeenbuilt overthenetfilterframework. Itisadirectdescendantofipchains(thatcamefrom ipfwadm,thatcamefromBSD'sipfwIIRC),with extensibility. Kernelmodulescanregisteranewtable,andaskforapacket totraverseagiventable. Thispacketselectionmethodisusedforpacketfiltering(the `filter'table),NetworkAddressTranslation(the`nat'table) andgeneralpreroutepacketmangling(the`mangle'table).Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
35
IPTablesandNetfilterHooksIngres Egres Pre RoutingConntrack Mangle Destination NAT
Route
ForwardMangle Filter
Post RoutingConntrack Mangle Source NAT
Route
Filter Conntrack Mangle
Local In
Local Out
Conntrack Mangle Destination NAT Filter
Local Sockets
Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.
Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
36
BSDSocketsInterfaceUserspacenetworkinterface:socket()/bind()/accept()/listen()Initalization,addressingandhandshaking
select()/poll()/epoll()Waitingforevents
send()/recv()Streamoriented(e.g.TCP)Rx/Tx
sendto()/recvfrom()Datagramoriented(e.g.UDP)Rx/TXCopyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
37
SimpleClient/ServerClients Serversocket s1, s2 ... sn; char buf[256]; socket s; char buf[256]; s =socket() connect(s, IP:port) while(ret !=0) ret = recv(s, buf) s =socket() bind(s1, IP:port) listen(s1) while { select(s1,s2 ... sn) if(s1) sn = accept(s1) else while(ret !=0) ret = send(sn, buf) }Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.
38
SimpleClient/ServerCopiesClientKernel
Server
Rx
Tx
Kernel
Copy to user ... ret = recv(s, buf) ... User space Application
Copy from user ... ret = send(s, buf) ... User space Application
Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.
Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
39
BSDSocketsInterfacePropertiesOriginallydevelopedbyUC Berkeleyresearchatthe dawnoftime Usedby90%ofnetwork orientedprograms Standartinterfaceacross operatingsystems Simple,wellunderstoodby programmersCopyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
Contextswitchforevery Rx/Tx Buffercopiedfrom/touser spaceto/fromkernel
40
ZeroCopyInkernelbufferthattheuserhascontrolover. Thebufferisimplementedasasetofreferencecountedpointerswhich thekernelcopiesaroundwithoutactuallycopyingthedata. splice()movesdatato/fromthebufferfrom/toanarbitraryfiledescriptor tee()Movesdatato/fromonebuffertoanother vmsplice()doesthesamethansplice(),butinsteadofsplicingfromfdto fdassplice()does,itsplicesfromauseraddressrangeintoafile. Canbeusedanywherewhereaprocessneedstosendsomethingfrom oneendtoanother,butitdoesn'tneedtotouchorevenlookatthedata, justforwardit.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
41
ZeroCopyofExample1Splice() *Only pointer is copied
User space
FilePointer to page cache page
Socket BufPointer to page as part of frag list
Kernel Memory
Data
Copy (using DMA)
Hardware
HD Controller
Network ChipForfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
* In relaity you have to do two splice calls: one from the file to an intermediate pipe and one from the pipe to the socket buffers.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.
42
ZeroCopyofExample2Mem write VMSplice() *Only pointer is copied Proccess page tables
User space
skbPointer to page as part of frag list
Kernel Memory
DataCopy (using DMA)
HardwareNetwork Chip* In relaity you have to do two vmsplice to an intermediate pipe and one splice from the pipe to the socket buffers.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
43
HardwareOffloadingLargereceiveoffloadsupported(insoftware) TCP/LargeSegmentOffloadsupported(e.g.e1000driver) NoTCPOffloadEnginesupportSecurityupdates Pointintimesolution Differentnetworkbehavior Hardwarespecificlimitsandresourcebaseddenialofservice attacks http://www.linuxfoundation.org/en/Net:TOECopyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
44
MoreInformationLinuxFoundationNet:KernelFlowhttp://www.linuxfoundation.org/en/Net:Kernel_Flow
ZeroCopyI:UserModePerspectivehttp://www.linuxjournal.com/article/6345
UnderstandingLinuxNetworkInternals,O'ReillyMedia
Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.
Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
45
UsetheSource,Luke!ManyresourcesandtricksontheInternetfindyouwill,but solutionstoalltechnicalissuesonlyintheSourcelie.
ThankstoLucasArtsCopyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
46
CopyrightsandTrademarksCopyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042008CodefidenceLtd. TuxImageCopyright:1996LarryEwing LinuxisaregisteredtrademarkofLinusTorvalds. Allothertrademarksarepropertyoftheirrespectiveowners. UsedanddistributedunderaCreativeCommonsAttributionShareAlike2.0license
Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.
Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license
47