about message coalescing

Message Coalescing14. 6. 5 (목)

『 Packet Coalescing and Server Substitution for Energy-Proportional Operation of Network Links and Data Servers 』Mostowfi, Mehrgan, "Packet Coalescing and Server Substitution for Energy-Proportional Operation of Network Links and Data Servers" (2013). Graduate School Theses and Dissertations. http://scholarcommons.usf.edu/etd/4732

• PKT / MSG Coalescing 의 구분

• Energy Efficient Ethernet의 Packet Coalescing

• EEE Coalescer 버퍼 크기에 따른 에너지 소비량 비교

PKT / MSG Coalescing 의 구분

메시지 패킷

네트워크를 통해 전송하기 쉽도록 자른 데이터의 전송단위.

통신 수단에 의한 전달에 적합한 언어나 부호로 작성된 단위 정보 또는 전송된 단위 정보.

[네이버 지식백과] 패킷 [packet] (두산백과) [네이버 지식백과] 메시지 [message] (IT용어사전, 한국정보통신기술협회)

즉, ‘메시지’가 네트워크를 통해 전송 될 때, ‘패킷’이라는 단위로 변환(이를 단편화라 함)된다.

packet fragmentation via ‘Transport

& Network Layer’

Packet

Capacity == 1500

case 1. strlen(msg) > Capacity case 2. strlen(msg) < Capacity

- TCP - UDP - IP

단편화(패킷으로 변환)의 두 가지 경우

Case 1. strlen(msg) > Capacity

Packet

Capacity == 1500

msgmsgmsg

1500 - socketHeader

필요한 크기만큼 메시지를 잘라 패킷으로 변환.

수신측에선 잘게 나누어진 패킷의 순서를 해석한 후

메시지 재 조립


& Network Layer’

Case 2. strlen(msg) < Capacity

Capacity < 1500

msg

전달된 메시지가 바로 패킷으로 변환.

pkt


& Network Layer’

Case 3. strlen(msg) < Capacity && Too Much Message

msg

pkt

msgmsgmsgmsgmsg

pktpktpktpktpkt

n

n

불필요하게 많은 수의 패킷이 생성.

n 만큼의 변환 과정이 필요.

불필요하게 많은 수의 전송이 실행.

n 만큼의 전송 과정이 필요.

n (ex: n=10, msg size=1500byte)

Case 3. Solution 1. PKT Coalescing

msg

pkt

msgmsgmsgmsgmsg

pktpktpktpktpkt

n (ex: n=10, msg size=1500byte)

pkt pkt ··· pkt pkt

Packet Coalescing

n

‘패킷’을 네트워크 카드가 보낼 수 있을 만큼 모아서 한꺼번에 전송.

현재 EEE 에서 사용하는 방식.[NIC 카드]

Case 3. Solution 2. MSG Coalescing

msgmsgmsgmsgmsgmsg

msg num=100 size = 15B

‘메시지’를 패킷 크기만큼 모아서 변환.

msg msg ··· msg msg

Message Coalescing

n=100

Packet

Capacity == 1500B소프트웨어적

조작이 필요.Network Kernel

Energy Efficient Ethernet의 Packet Coalescing

에너지 효율 이더넷

• 에너지 효율 이더넷(영어: Energy-Efficient Ethernet)은 데이터를 적게 쓰는 시기에 소비 전력을 낮춤으로써 연선과 백플레인 이더넷 계열의 컴퓨터 네트워킹 표준을 강화하는 기술이다. 50% 이상 소비 전력을 낮추지만 기존의 장비와 완전한 호환성을 유지하는 것이 목적이다.[1] IEEE는 최종 표준을 2010년 9월에 승인하였다.[2] 이 표준이 승인되기 전까지는 그린 이더넷(Green Ethernet)이라는 이름을 사용했다.

[1] Sean Michael Kerner (2009. 7. 17 ). Energy Efficient Ethernet hits standards milestone — InternetNews:The Blog — Sean Michael Kerner . 《Internetnews blog》

[2] "IEEE ratifies new 8023az standard to reduce network energy footprint ", (2010. 10. 5 )

참고부분 - Chapter 3: Packet Coalescing for Energy Efficient Ethernet

3.1 An Analytical Energy-Delay Model for a Count-based Packet Coalescer 3.1.1 Energy-Delay Model for Coalescer 3.1.2 Delay Model for Downstream Queue 3.1.3 Numerical Results

3.2 Reducing the Energy Consumption of EEE by Packet Coalescing 3.2.1 Simulation Model of EEE with Packet Coalescing 3.2.2 Experiments 3.2.3 Results 3.2.4 Comparison Between the Analytical Model of Coalescing and the Simulation Model of EEE with Packet Coalescing

3.3 Extending Savings of Packet Coalescing Beyond Links in Ethernet Switches 3.3.1 Switch Energy Use and Transition Times 3.3.2 The Synchronized Coalescing Method 3.3.2.1 Simple Synchronized Coalescing 3.3.2.2 Adaptive Coalescing 3.3.3 Evaluation by Simulation 3.3.4 Results and Discussion

3.4 Chapter Summary

EEE uses a Low Power Idle (LPI) mode to reduce power consumption between packet transmissions. EEE has transition times Ts(wake-to-sleep) and Tw(sleep-to-wake), which are significantly greater than a single packet transmission time for both 1 and 10 Gb/s EEE.

By coalescing arriving packets into bursts, the overhead of Ts and Tw can be reduced and nearly energy-proportional operation can be achieved. The trade-off in coalescing is increased packet delay at the sender and, potentially, also in downstream switches or routers.

* EEE : Energy Efficient Ethernet

In packet coalescing, a FIFO queue in the Ethernet interface (in the host NIC and switch or router line card) is used to collect, or coalesce, multiple packets before sending them on a link as a burst of back-to-back packets. This FIFO queue is called a coalescer.

Packet coalescing is already used in many high-speed Ethernet interfaces – mostly on the receive side – to reduce CPU overhead for packet processing [73]. Packet coalescing can be based on packet count and/or time from first packet arrival.

In packet coalescing based on packet count (count-based coalescing), the coalescer collects a certain number of packets before sending them on the link in a single burst.

In packet coalescing based on time from first packet arrival, the coalescer sets a timer, called the coalescing timer, to a certain predefined time upon the arrival of the first packet to an empty coalescer. The timer counts down to zero. When the timer reaches zero (or expires), the coalescer sends the packets which are collected in the coalescer on the link.

1

2

Counter에 의한 Coalescing

Timer에 의한 Coalescing

FSM of PKT Coalescing

count-based

time-based (simple synchronized coalescing)

EEE with Packet Coalescing

CTimer : maintain PKT Coalescing time.

WTimer : maintain time spent in ‘Wakeup’.

STimer : maintain time spent in ‘Sleep’.

EEE Coalescer 버퍼 크기에 따른 에너지 소비량 비교

Illustration of PKT Coalescing

* 버퍼 크기에 따라 Trade-off가 발생

Ps : Power Consumption in LPI mode Pa : Power Consumption during Active mode tLPI : time spent in the LPI mode tws : Sleep Time (needed to enter the low-power mode) tsw : Wake-up Time (required to exit the low-power mode)

전력소모공식

* 인용 : 『 Performance Evaluation of Energy Efficient Ethernet 』 P. Reviriego, J. A. Hern´andez, D. Larrabeiti, and J. A. Maestro

IEEE COMMUNICATIONS LETTERS, VOL. 13, NO. 9, SEPTEMBER 2009

The factors in these experiments are :

• The power consumption in the LPI mode, Ps, is assumed to be 10% according to the estimations made by different NIC manufacturers during the standardization process of EEE *

• The power consumption during transitions is also assumed 100% (Pa) also based on estimations made by different NIC manufacturers. *

• The power consumption in Active mode is obviously 100% of the link’s consumption. *

* 인용 : 『 Performance Evaluation of Energy Efficient Ethernet 』 P. Reviriego, J. A. Hern´andez, D. Larrabeiti, and J. A. Maestro

IEEE COMMUNICATIONS LETTERS, VOL. 13, NO. 9, SEPTEMBER 2009


[5] J. Chou, “Low-power idle based EEE 100Base-TX,” Mar. 2008, in IEEE 802.3az Task Force presentation. [6] B. Kohl, “10GBase-T power budget summary,” Mar. 2007, in IEEE 802.3az Task Force presentation.


• Tws and Tsw; set to their minimums, 4.48 and 2.88 μs respectively

• Distribution of packet arrivals and packet size; set to Poisson distribution with fixed packet size of 1500 B.

• For the small coalescer, 12μs and 10 packets are used for these factors, respectively.

• For the large coalescer, 120μs and 100 packets are used.

-> 15 KB

-> 150 KB

[15% LOAD] NO COALESCER

SMALL(15MB) COALESCER

LARGE(150MB) COALESCER

ENERGY CONSUMPT 65% 45% 27%

AVG.PKT. DELAY 5μs 12μs 67μs

* 3.2.3 Results

[15% LOAD] NO COALESCER

SMALL(15MB) COALESCER

LARGE(150MB) COALESCER

ENERGY CONSUMPT

AVG.PKT. DELAY

Coalescing에 의한 delay는 없으나,

에너지 절약이 안 되는 상황

에너지 절약 정도에 비해, delay가 기하급수적으로

증가한 상황

Message Coalescing for InfiniBand Clusters

14. 6. 25 (수)

『 Reducing Connection Memory Requirements of MPI for InfiniBand Clusters: A Message Coalescing Approach 』Matthew J. Koop(1)(2), Terry Jones(2), Dhabaleswar K. Panda(1)

(1) Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University (2) Lawrence Livermore National Laboratory Livermore, CA 94550

*Published in : Cluster Computing and the Grid, 2007. CCGRID 2007. Seventh IEEE International Symposium on http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4215416&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D4215416

Contents

• InfiniBand Architecture Overview

• MSG Coalescing

InfiniBand Architecture Overview - Introduction - How it Works?

Introduction : InfiniBand Architecture Overview

* 2.1 InfiniBand Architecture Overview

Feature InfiniBand PCI-X Fibre Channel1 Gb & 10Gb Ethernet

Hypertransport Rapid I/O

Bus/link bandwidth 2.5/10/30 Gbps 8.51Gbps 1/2.1Gbps 1 Gb, 10 Gb 12.8, 25.6, 51.2 Gbps

16/3 Gbps

Bus/link bandwidth (fully duplexed)

5/20/60 Gbps n/a Gbps 2.1/4.2 Gb 2 Gb, 20 Gb 25.6, 51.2, 102 Gbps

32/64Gbps

Pin count 4/16/484 90 4 4, Fiber 55,103,197 40/76

Maximum signal length km Inches km km Inches Inches

Transport mediaPCB, Fiber, copper cable

PCB onlyCopper and fiber cable

Copper and fiber cable

PCB only PCB only

Simultaneous peer-to-peer communication

15 VLs + management lane

XThree transaction flows

Native hwd transport support X

In-band management XNot native; can use IP

RDMA support X

Native support for virtual interface X

End-to-end management X X X X

Memory partitioning X X

QoS X X Limited X

Reliable X X X X

Scaleable X X X X X

Maximum packet payload 4 KBNot packet based

2 KB1.5 KB (Jumbo: 9 KB)

64 B 256 B

Notes: 1. The raw bandwidth of an InfiniBand 1X link is 2.5 Gbps (per link). Data bandwidth (due to 8B/10B encoding) is 2.0 Gbps for 1X, 8 Gbps for 4X, and 24 Gbps for 12X; twice that for full duplex or 4/16/48 Gbps. 2. The bandwidth of 2-Gb fibre channel is 2.1 Gbps, but the actual raw bandwidth (due to 8B/10B encoding) is 20% lower or around 1.7 Gbps (twice that for full duplex). 3. Values are for 8B/16B data paths peak at 1-GHz operation. Speeds of 125, 250, and 500 MHz are supported. 4. The pin count for a 1X link is four pins up to 48 pins for a 12X link. 5. Memory partitioning enables multiple hosts to access storage endpoints in a controlled manner based on a key. Access to a particular endpoint is controlled by this key, so different hosts can have access to different elements in the network.

* InfiniBand: Thinking Outside the Box Designhttp://www.eetimes.com/document.asp?doc_id=1204052

Feature InfiniBand PCI-X Fibre Channel1 Gb & 10Gb Ethernet

Hypertransport Rapid I/O

Bus/link bandwidth 2.5/10/30 Gbps 8.51Gbps 1/2.1Gbps 1 Gb, 10 Gb 12.8, 25.6, 51.2 Gbps

16/3 Gbps

Bus/link bandwidth (fully duplexed)

5/20/60 Gbps n/a Gbps 2.1/4.2 Gb 2 Gb, 20 Gb 25.6, 51.2, 102 Gbps

32/64Gbps

Pin count 4/16/484 90 4 4, Fiber 55,103,197 40/76

Maximum signal length km Inches km km Inches Inches

Transport mediaPCB, Fiber, copper cable

PCB onlyCopper and fiber cable

Copper and fiber cable

PCB only PCB only

Simultaneous peer-to-peer communication

15 VLs + management lane

XThree transaction flows

Native hwd transport support X

In-band management XNot native; can use IP

RDMA support X

Native support for virtual interface X

End-to-end management X X X X

Memory partitioning X X

QoS X X Limited X

Reliable X X X X

Scaleable X X X X X

Maximum packet payload 4 KBNot packet based

2 KB1.5 KB (Jumbo: 9 KB)

64 B 256 B

Notes: 1. The raw bandwidth of an InfiniBand 1X link is 2.5 Gbps (per link). Data bandwidth (due to 8B/10B encoding) is 2.0 Gbps for 1X, 8 Gbps for 4X, and 24 Gbps for 12X; twice that for full duplex or 4/16/48 Gbps. 2. The bandwidth of 2-Gb fibre channel is 2.1 Gbps, but the actual raw bandwidth (due to 8B/10B encoding) is 20% lower or around 1.7 Gbps (twice that for full duplex). 3. Values are for 8B/16B data paths peak at 1-GHz operation. Speeds of 125, 250, and 500 MHz are supported. 4. The pin count for a 1X link is four pins up to 48 pins for a 12X link. 5. Memory partitioning enables multiple hosts to access storage endpoints in a controlled manner based on a key. Access to a particular endpoint is controlled by this key, so different hosts can have access to different elements in the network.


Feature InfiniBand 1 Gb & 10Gb Ethernet Hypertransport

Bus/link bandwidth 2.5/10/30 Gbps 1 Gb, 10 Gb 51.2 Gbps

Bus/link bandwidth (fully duplexed) 5/20/60 Gbps 2 Gb, 20 Gb 102 Gbps

Maximum signal length km km inches

Transport media PCB, Fiber, copper cable

Copper and fiber cable PCB only

인피니밴드는 전통적인 이더넷 아키텍처와 같은 계층적 스위치 방식의 네트워크와는 반대로 스위치 패브릭 방식의 토폴리지를 사용한다. 모든 전송은 채널 어댑터에서 시작하거나 끝이 난다. 각 프로세서는 호스트 채널 어댑터(HCA)를 가지고 있으며 각 주변장치에는 타켓 채널 어댑터

(TCA)가 있다. 이러한 어댑터들은 보안 및 QoS를 위하여 정보를 교환할 수 있다.* INFINIBAND by Carlo kopp

http://www.csse.monash.edu.au/~carlo/SYSTEMS/Infiniband-Intro-0901.html * http://ko.wikipedia.org/wiki/인피니밴드 * http://etherealmind.com/what-is-the-definition-of-switch-fabric/

스위치 페브릭은 각 노드들이 직물처럼 옹기종기 엮여있는 모양새.점대점 연결이라서 라우팅 알고리즘이 필요 없다.

a host channel adapter (HCA)

a target channel adapter (TCA)

Channel Adapters

The HCA provides an interface to a host CPU and memory subsystem, such as a web server, and supports all software verbs defined by the InfiniBand architecture.

A TCA, on the other hand, provides the connection to an I/O device from InfiniBand. This I/O card, which could be a network interface card (NIC), houses a subset of features necessary for each device's specific operations.


* High-Performance Buses and Interconnects http://www.pcmag.com/article2/0,2817,1154809,00.asp

NIC

msg send/recv via

InfiniBand

HCA 위치 성질

- Ethernet 대신 InfiniBand를 사용함으로써, • Transport / Network Layer에서 진행되던 패킷화 과정이 간소화. • 따라서 CPU 사용량과 지연시간이 감소.

Pkt Pkt

msg

(kern)

* Enterprise Distributed Systems and Infiniband http://www.cisco.com/c/en/us/products/collateral/switches/sfs-7000-series-infiniband-server-switches/prod_white_paper0900aecd804f90f3.html

How it Works?

When using a connection-based model, a pair of hosts that wishes to communicate must each set up a dedicated Queue Pair (QP) for communication with that peer. Each QP is linked to a Completion Queue (CQ) for notification of completion. In this connection-based model, there is additional memory usage with each additional connection.

To send a message a descriptor is posted to the QP. This descriptor contains information about the message to be sent, including the data address, memory keys, and message length. To receive a message using channel semantics a receive descriptor must be posted containing the address and length of the buffer. Upon posting a descriptor, a send Work Queue Entry (WQE), pronounced “wookie,” is used to track the progress of the request.

Upon completion of a WQE a Completion Queue Entry (CQE), “cookie,” is placed in the CQ. This method is used in both channel and memory semantics. CQEs can be obtained by polling the CQ or through an event-based methods.

When a QP is created, the number of send and receive WQEs must be defined. The number of WQEs allocated determines the number of outstanding send and receive operations allowed on a single QP. Using a Shared Receive Queue (SRQ) allows receive WQEs and buffers to be shared rather than per QP, which allows far better scalability. Benefits are demonstrated in [17] and we will assume SRQ is being used. Even using a SRQ, however, send WQEs must be posted per QP. Thus, the number of send WQEs allocated for a QP determines how many outstanding send operations are allowed for that connection.

* 2.1 InfiniBand Architecture Overview

How it Works?

* INFINIBAND by Carlo kopphttp://www.csse.monash.edu.au/~carlo/SYSTEMS/Infiniband-Intro-0901.html

FIFO Queue HCA는 Work Queue의 내용을 검색, 해당 메시지를 주기억 장치에서 읽어내어 패킷으로 변환한다.

전송이 종료되면 해당 Completion Queue의 내용에 전송 완료 정보를 기록한다.

전송된 패킷은 목적지 노드에서 다시 메시지로 조립되어 Work Queue에 저장된다.

하드웨어 상세

MSG Coalescing - Motivation - Design - Evaluation

* MSG / PKT Coalescing

OS Kernel

NIC

msgmsgmsgmsgmsgmsg (15B)

n = 100msgmsgmsgmsgmsgmsg

(1500B)

n = 10

pktpktpktpktpktpkt (1500B)

n = 10

PKT Coalescer Buffer (H/W)

pkt ··· pkt

InfiniBand EEE

[TCA][HCA]

msg ··· msg

msg n = 100, total 1500BSingle Packet

bufsiz =15000B

MSG Coalesce

(S/W)

MSG Coalescing Motivation

MSG Coalescing Design

1. alter the send flow operation.

2. use the InfiniBand scatter/gather capabilities instead of packing into the same buffer.

3. cache the MPI tag matching information for each message.

MSG Coalescing Evaluation

Our experimental testbed is a 575-node InfiniBand Linux cluster at Lawrence Livermore National Laboratory. Each compute node has four 2.5 GHz Opteron 8216 dual-core processors for a total of 8 cores. Total memory per node is 16GB. Each node has a Mellanox MT25208 DDR HCA. InfiniBand software support is provided through the OpenFabrics/Gen2 stack [15]. The Intel v9.0 compiler is used for compilation of the MVA- PICH library and applications.

about message coalescing

Software

packet message

adaptive coalescing

pkt packet fragmentation

packet transmissions

analytical model of

increased packet delay

energyefcient ethernet

synchronized coalescing