linux networking internals

Download Linux Networking Internals

If you can't read please download the document

Upload: tuxologynet

Post on 12-Nov-2014

51 views

Category:

Documents


9 download

DESCRIPTION

Slides for a course about the Linux kernel network stack.

TRANSCRIPT

TheLinuxNetworkSubsystemUnabletohandlekernelpagingrequestatvirtualaddress4d1b65e8 Unabletohandlekernelpagingrequestatvirtualaddress4d1b65e8 Covers Linux version 2.6.25 pgd=c0280000 pgd=c0280000 Version 1.1 [4d1b65e8]*pgd=00000000[4d1b65e8]*pgd=00000000 Internalerror:Oops:f5[#1] Internalerror:Oops:f5[#1] Moduleslinkedin:Moduleslinkedin:hx4700_udchx4700_udcasic3_baseasic3_base CPU:0 CPU:0 PCisatset_pxa_fb_info+0x2c/0x44 PCisatset_pxa_fb_info+0x2c/0x44 LRisathx4700_udc_init+0x1c/0x38[hx4700_udc] LRisathx4700_udc_init+0x1c/0x38[hx4700_udc] pc:[]lr:[]Nottainted sp:c076df78ip:60000093fp:c076df84 pc:[]lr:[]Nottainted

Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.

Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

1

RightstocopyThiskitcontainsworkbytheAttributionShareAlike2.0 Youarefree tocopy,distribute,display,andperformthework tomakederivativeworks tomakecommercialuseofthework Underthefollowingconditions Attribution.Youmustgivetheoriginalauthorcredit. ShareAlike.Ifyoualter,transform,orbuilduponthiswork, youmaydistributetheresultingworkonlyunderalicense identicaltothisone. Foranyreuseordistribution,youmustmakecleartoothersthe licensetermsofthiswork. Anyoftheseconditionscanbewaivedifyougetpermissionfrom thecopyrightholder. Yourfairuseandotherrightsareinnowayaffectedbytheabove. Licensetext:http://creativecommons.org/licenses/bysa/2.0/legalcodeCopyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.

followingauthors: Copyright20042006 MichaelOpdenacker [email protected] http://www.freeelectrons.com Copyright20032006 OronPeled [email protected] http://www.actcom.co.il/~oron Copyright20042008 Codefidenceltd. [email protected] http://www.codefidence.com

Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

2

WhatisLinux?LinuxisakernelthatimplementsthePOSIXandSingleUnix SpecificationstandardswhichisdevelopedasanOpenSourceproject. WhenonetalksofinstallingLinux,oneisreferringtoaLinux Distribution:acombinationofLinuxandotherprogramsandlibrarythat formanoperatingsystem.

Linuxrunson24mainplatformsandsupportsapplications rangingfromccNUMAsuperclusterstocellularphonesand microcontrollers. Linuxis15yearsold,butisbasedonthe40yearsoldUnixdesign philosophyCopyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

3

LayersinaLinuxsystem

Kernel KernelModules Clibrary Systemlibraries Applicationlibraries Userprograms

Userprograms

Kernel Clibrary

Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.

Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

4

KernelarchitectureApp1 App2 Clibrary Systemcallinterface Process management Memory management Filesystem support Filesystem types CPUsupport code CPU/MMU supportcode Storage drivers Character devicedrivers Network devicedrivers Hardware CPU RAM Storage Device control Networking ... User space

Kernel space

Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.

Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

5

KernelModevs.UserModeAllmodernCPUssupportadualmodeofoperation: Usermode,forregulartasks. Supervisor(orprivileged)mode,forthekernel. ThemodetheCPUisindetermineswhichinstructionstheCPUis willingtoexecute: SensitiveinstructionswillnotbeexecutedwhentheCPUisin usermode. TheCPUmodeisdeterminedbyoneoftheCPUregisters,whichstores thecurrentRingLevel 0forsupervisormode,3forusermode,12unusedbyLinux.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

6

TheSystemCallInterfaceWhenauserspacetasksneedstouseakernelservice,itwillmakea SystemCall. TheClibraryplacesparametersandnumberofsystemcallinregisters andthenissuesaspecialtrapinstruction. Thetrapatomicallychangestheringleveltosupervisormodeandthe setstheinstructionpointertothekernel. Thekernelwillfindtherequiredsystemcalledviathesystemcalltable andexecuteit. Returningfromthesystemcalldoesnotrequireaspecialinstruction, sinceinsupervisormodetheringlevelcanbechangeddirectly.

Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.

Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

7

LinuxSystemCallPathKernel do_name() sys_name() entry.S Function call Trap

Task

Glibc Task

Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.

Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

8

LinuxnetworkingSubsystemOverviewStack App App 1 App2 Socket Layer UDP Networking Stack Driver Stack Driver Hardware TCP IP Stack Driver Interface Driver ICMP Bridge App3

Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.

Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

9

NetworkDeviceDriverHardwareInterfacepacket packet packet packet packet

TxSend Send Send SentOK SendErr Free

Memory Access

Driver

Memory mapped registers access

Rx

Free

Free

RcvOk

RcvErr RecvCRC RcvOK

Interruptspacket

packet

packet

packet

Driver allocates Ring Buffers. Driver resets descriptors to initial state. Driver puts packet to be sent in Tx buffers. Device puts received packet in Rx buffers. Driver/Device update descriptors to indicate state. Device indicates Rx and end of Tx with interrupt, unless interrupt mitigation techniques are applied.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.

DMA

Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

10

NetworkDeviceRegistrationEachnetworkdeviceisrepresentedbyastructnet_device Theseareallocatedusing:structnet_device*alloc_netdev(size,mask, setup_func);

sizesizeofourprivdatapart maskanamingpattern(e.g.eth%d) setup_funcAfunctionthatsetupstherestofnet_device.

Andisregisteredviaacallto:intregister_netdev(structnet_device*dev);Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

11

NetworkDeviceInitializationThenet_devicestructureisinitalizedwithnumerousmethods andflagsbythesetupfunction:openrequestresources,registerinterrupts,startqueues. stopdeallocatesresources,unregisterirq,stopqueue. get_statsreportstatistics set_multicast_listconfiguredeviceformulticast hard_start_xmitcalledbythestacktoinitiateTx. IFF_MULTICASTDevicesupportmulticast IFF_NOARPDevicedoesnotsupportARPprotocolCopyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

12

PacketRepresentationWeneedtomanipulatepacketsthroughthestack Thismanipulationinvolvesefficiently:Addingprotocolheaders/trailersdownthestack. Removingprotocolheaders/trailersupthestack.

Packetscanbechainedtogether. Eachprotocolshouldhaveconvenientaccesstoheader fields. Todoallthisthekernelusesthesk_buffstructure.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

13

SocketBuffersThesk_buffstructurerepresentsasinglepacket. Thisstructureispassedthroughtheprotocolstack. Itholdspointerstoabufferswiththepacketdata. Itholdsmanytypeofotherinformation:Datasize. Incomingdevice. Priority. Security...Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

14

structsk_buffnext: prev: sk: tstamp: dev: input_dev: h: nh: mac: dst: sp: cb: len: data_len: mac_len: csum: local_df: cloned: nohdr: pkt_type: fclone: Nextbufferinlist Previousbufferinlist Socketweareownedby Timewearrived Devicewearrivedon/areleavingby Devicewearrivedon Transportlayerheader Networklayerheader Linklayerheader Destinationroutecacheentry Securitypath,usedforxfrm Controlbuffer.Privatedata. Lengthofactualdata Datalength Lengthoflinklayerheader Checksum Allowlocalfragmentationflag Headmaybecloned(seerefcnt) Payloadreferenceonlyflag Packetclass Clonestatus ip_summed: DriverfedusanIPchecksum priority: users: protocol: truesize: head: data: tail: end: nfmark: nfct: nfctinfo: nf_bridge: tc_index: tc_verd: secmark: Packetqueuingpriority Usercountsee{datagram,tcp}.c Packetprotocolfromdriver Buffersize Headofbuffer Dataheadpointer Tailpointer Endpointer Netfilterhooksprivatedata Associatedconnection,ifany Connectiontrackinginfo. Saveddataaboutabridgedframe Trafficcontrolindex Trafficcontrolverdict SecuritymarkingforLSM

destructor: Destructfunction

ipvs_property:skbuffisownedbyipvs nfct_reasm: Netfilterconntrackreassemblypointer

dma_cookie: DMAoperationcookie

Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.

Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

15

SocketBufferDiagramheadroom Ethernet IP TCP Payload Paddingstruct sk_shared_info

frag1

Note Network chip must support Scatter/Gather to use of frags. Otherwise kernel must copy buffers before send!

len ... head data tail end ... dev

frag2

frag3

struct sk_buffCopyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

16

SocketBufferOperationsskb_put:adddatatoabuffer. skb_push:adddatatothestartofabuffer. skb_pull:removedatafromthestartofabuffer. skb_headroom:returnsfreebytesatbufferhead. skb_tailroom:returnsfreebytesatbufferend. skb_reserve:adjustheadroom. skb_trim:removeendfromabuffer.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

17

OperationExample:skb_putunsignedchar*skb_put (structsk_buff*skb,unsignedintlen)

Addsdatatoabuffer:skb:buffertouse len:amountofdatatoadd

Thisfunctionextendstheuseddataareaofthebuffer. Ifthiswouldexceedthetotalbuffersizethekernelwill panic. Apointertothefirstbyteoftheextradataisreturned.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

18

SocketBufferAlignmentCPUsoftentakeaperformancehitwhenaccessingunaligned memorylocations. SinceanEthernetheaderis14bytesnetworkdriversoften endupwiththeIPheaderatanunalignedoffset. TheIPheadercanbealignedbyshiftingthestartofthe packetby2bytes.Driversshoulddothiswith:skb_reserve(NET_IP_ALIGN);

ThedownsideisthattheDMAisnowunaligned.Onsome architecturesthecostofanunalignedDMAoutweighsthe gainssoNET_IP_ALIGNissetonaperarchbasis.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

19

SocketBufferPaddingThenetworkinglayerreservessomeheadroominskbdata.Thisisusedtoavoidhavingtoreallocateskbdatawhenthe headerhastogrow. Inthedefaultcase,iftheheaderhastogrow16bytesorless weavoidthereallocation.

Unfortunately,thisheadroomchangestheDMAalignmentof theresultingnetworkpacket.AsforNET_IP_ALIGN,this unalignedDMAisexpensiveonsomearchitectures. Thereforearchitecturecanoverridethisvalue,aslongasat least16bytesoffreeheadroomarethere.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

20

SocketBufferAllocationsdev_alloc_skb:allocateanskbuffforRx netdev_alloc_skb:allocateanskbuffforRx,ona specificdevice.Allocateanewsk_buffandassignitausagecountofone. Thebufferhasunspecifiedheadroombuiltin.Usersshouldallocate theheadroomtheythinktheyneedwithoutaccountingforthebuilt inspace.Thebuiltinspaceisusedforoptimizations NULLisreturnedifthereisnofreememory. Althoughthesefunctionsallocatesmemoryitcanbecalledfroman interrupt.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

21

sk_buffAllocationExampleImmediatelyafterallocation,weshouldreservetheneeded headroom:structsk_buff*skb; skb=dev_alloc_skb(1500); if(unlikely(!skb))break;

/*Markasbeingusedbythisdevice*/ skb>dev=dev; /*AlignIPon16byteboundaries*/ skb_reserve(skb,NET_IP_ALIGN);Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

22

SoftnetNetworkstackisimplementedasapairofsoftirqsfor parallelizepackethandlingonSMPmachines:NET_TX_SOFTIRQFeedspacketsfromnetworkstackto driver. NET_RX_SOFTIRQFeedspacketsfromdrivertonetwork stack.

Likeanyothersoftirq,thesearecalledonreturnfrom interruptorviathelowpriorityksoftirqdkernelthread. Transmit/receivequeuesarestoredinpercpusoftnet_data.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

23

LinuxContextsInterrupt Handlers Interrupt ContextHi prio tasklets

SoftIRQsNet Stack

...

Kernel Space

Regular tasklets

Timers

Network Interface Device Driver

User Context User SpaceForfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

Process Thread Kernel Thread

Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.

24

PacketReceptionThedriverallocatesanskbandsetsupadescriptorinthe ringbuffersforthehardware. ThedriverRxinterrupthandlercallsnetif_rx(skb). netif_rxdepositsthesk_buffinthepercpuinputqueue.and markstheNET_RX_SOFTIRQtorun. AtSoftIRQprocessingtime,net_rx_action()iscalledby NET_RX_SOFTIRQ,whichcallsthedriverpoll()methodto feedthepacketup.Normallypoll()issettoproccess_backlog()bynet_dev_init().Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

25

PacketRxOverview

Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.

Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

26

PacketTransmissionEachnetworkdevicedefinesamethod:int(*hard_start_xmit)(structsk_buff*skb,struct net_device*dev);

ThisfunctionisindirectlycalledfromtheNET_TX_SOFTIRQ Callareserializedviathelockdev>xmit_lock_owner

Thedrivermanagesthetransmitqueueduringinterfaceup anddownsortosignalbackpressureusingthefollowing functions:voidnetif_start_queue(structnet_device*net); voidnetif_stop_queue(structnet_device*net); voidnetif_wake_queue(structnet_device*net);Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

27

PacketTxOverview

Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.

Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

28

NAPINetworkNewAPI Providesinterruptmitigation Requirements:ADMAringbuffer. Abilitytoturnoffreceiveinterruptsorevents.

Itisusedbydefininganewmethod:int(*poll)(structnet_device*dev,int*budget); whichiscalledbythenetworkstackperiodicallywhen signaledbythedrivertodoso.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

29

NAPI(cont.)Whenareceiveinterruptoccurs,driver:Turnsoffreceiveinterrupts. Callsnetif_rx_schedule(dev)togetstacktostart callingit'spollmethod.

ThePollmethodScansreceiveringbuffers,feedingpacketstothestackvia: netif_receive_skb(skb). Ifworkfinishedwithinbudgetparameter,reenablesinterrupts andcallsnetif_rx_complete(dev) Else,stackwillcallpollmethodagain.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

30

RoutingAfterthesocketbufferisdeliveredtoaprotocolhandlerthe handlermaydecidetoroutethepacket. Thedefaultroutingusesthenormaldestinationbasedrouting withsingletableandaFIBdestinationcache. Foreachpackettheroutintgdestinationislookedupinthe FIBcache.Iffound,thepacketissenttothatinterfacedriver. Otherwiseamorecostlyroutingdecisionbasedonrulesoccurs andtheresultisstoredintheFIB.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

31

WhatisNetfilter?Netfilterisaframeworkforpacketmangling Eachprotocoldefines"hooks"(IPv4defines5)whichare welldefinedpointsinapacket'straversalofthatprotocol stack. Ateachofthesepoints,theprotocolwillcallthenetfilter frameworkwiththepacketandthehooknumber. Partsofthekernelcanregistertolistentothedifferenthooks foreachprotocol. Whenapacketispassedtothenetfilterframework,itwill callallregisteredcallbacksforthathookandprotocol.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

32

NetfilterArchitectureIngres Pre Routing Route Forward Post Routing Egres

Route

Local In

Local Out

Local Sockets

Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.

Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

33

NetfilterHookKernelcodecanregisteracallbackfunctiontobecalled whenapacketarrivesateachhook.andarefreeto manipulatethepacket. Thecallbackcanthentellnetfiltertodooneoffivethings:NF_ACCEPT:continuetraversalasnormal. NF_DROP:dropthepacket;don'tcontinuetraversal. NF_STOLEN:I'vetakenoverthepacket;stoptraversal. NF_QUEUE:queuethepacket(usuallyforuserspacehandling). NF_REPEAT:callthishookagain.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

34

IPTablesApacketselectionsystemcalledIPTableshasbeenbuilt overthenetfilterframework. Itisadirectdescendantofipchains(thatcamefrom ipfwadm,thatcamefromBSD'sipfwIIRC),with extensibility. Kernelmodulescanregisteranewtable,andaskforapacket totraverseagiventable. Thispacketselectionmethodisusedforpacketfiltering(the `filter'table),NetworkAddressTranslation(the`nat'table) andgeneralpreroutepacketmangling(the`mangle'table).Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

35

IPTablesandNetfilterHooksIngres Egres Pre RoutingConntrack Mangle Destination NAT

Route

ForwardMangle Filter

Post RoutingConntrack Mangle Source NAT

Route

Filter Conntrack Mangle

Local In

Local Out

Conntrack Mangle Destination NAT Filter

Local Sockets

Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.

Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

36

BSDSocketsInterfaceUserspacenetworkinterface:socket()/bind()/accept()/listen()Initalization,addressingandhandshaking

select()/poll()/epoll()Waitingforevents

send()/recv()Streamoriented(e.g.TCP)Rx/Tx

sendto()/recvfrom()Datagramoriented(e.g.UDP)Rx/TXCopyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

37

SimpleClient/ServerClients Serversocket s1, s2 ... sn; char buf[256]; socket s; char buf[256]; s =socket() connect(s, IP:port) while(ret !=0) ret = recv(s, buf) s =socket() bind(s1, IP:port) listen(s1) while { select(s1,s2 ... sn) if(s1) sn = accept(s1) else while(ret !=0) ret = send(sn, buf) }Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.

38

SimpleClient/ServerCopiesClientKernel

Server

Rx

Tx

Kernel

Copy to user ... ret = recv(s, buf) ... User space Application

Copy from user ... ret = send(s, buf) ... User space Application

Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.

Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

39

BSDSocketsInterfacePropertiesOriginallydevelopedbyUC Berkeleyresearchatthe dawnoftime Usedby90%ofnetwork orientedprograms Standartinterfaceacross operatingsystems Simple,wellunderstoodby programmersCopyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

Contextswitchforevery Rx/Tx Buffercopiedfrom/touser spaceto/fromkernel

40

ZeroCopyInkernelbufferthattheuserhascontrolover. Thebufferisimplementedasasetofreferencecountedpointerswhich thekernelcopiesaroundwithoutactuallycopyingthedata. splice()movesdatato/fromthebufferfrom/toanarbitraryfiledescriptor tee()Movesdatato/fromonebuffertoanother vmsplice()doesthesamethansplice(),butinsteadofsplicingfromfdto fdassplice()does,itsplicesfromauseraddressrangeintoafile. Canbeusedanywherewhereaprocessneedstosendsomethingfrom oneendtoanother,butitdoesn'tneedtotouchorevenlookatthedata, justforwardit.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

41

ZeroCopyofExample1Splice() *Only pointer is copied

User space

FilePointer to page cache page

Socket BufPointer to page as part of frag list

Kernel Memory

Data

Copy (using DMA)

Hardware

HD Controller

Network ChipForfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

* In relaity you have to do two splice calls: one from the file to an intermediate pipe and one from the pipe to the socket buffers.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.

42

ZeroCopyofExample2Mem write VMSplice() *Only pointer is copied Proccess page tables

User space

skbPointer to page as part of frag list

Kernel Memory

DataCopy (using DMA)

HardwareNetwork Chip* In relaity you have to do two vmsplice to an intermediate pipe and one splice from the pipe to the socket buffers.Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

43

HardwareOffloadingLargereceiveoffloadsupported(insoftware) TCP/LargeSegmentOffloadsupported(e.g.e1000driver) NoTCPOffloadEnginesupportSecurityupdates Pointintimesolution Differentnetworkbehavior Hardwarespecificlimitsandresourcebaseddenialofservice attacks http://www.linuxfoundation.org/en/Net:TOECopyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

44

MoreInformationLinuxFoundationNet:KernelFlowhttp://www.linuxfoundation.org/en/Net:Kernel_Flow

ZeroCopyI:UserModePerspectivehttp://www.linuxjournal.com/article/6345

UnderstandingLinuxNetworkInternals,O'ReillyMedia

Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.

Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

45

UsetheSource,Luke!ManyresourcesandtricksontheInternetfindyouwill,but solutionstoalltechnicalissuesonlyintheSourcelie.

ThankstoLucasArtsCopyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd. Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

46

CopyrightsandTrademarksCopyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042008CodefidenceLtd. TuxImageCopyright:1996LarryEwing LinuxisaregisteredtrademarkofLinusTorvalds. Allothertrademarksarepropertyoftheirrespectiveowners. UsedanddistributedunderaCreativeCommonsAttributionShareAlike2.0license

Copyright20062004,MichaelOpdenacker Copyright20032006,OronPeled Copyright20042006CodefidenceLtd.

Forfullcopyrightinformationseelastpage. CreativeCommonsAttributionShareAlike2.0license

47