grounding high efficiency cloud computing architecture: hw-sw

Grounding High Efficiency Cloud Computing

Architecture: HW-SW Co-Design and

Implementation of a Stand-alone Web Server on

FPGA

Jibo Yu\ Yongxin Zhu\ Liang Xial, Meikang Qiu2, Yuzhuo Ful, Guoguang Rongl 1 School of Microelectronics, Shanghai Jiao Tong University, Shanghai, China

2 Dept. of Electrical and Computer Engineering, University of Kentucky, Lexington, KY, USA yuj [email protected], [email protected], [email protected], [email protected], [email protected],

[email protected]

Abstract-With the advent of the cloud computing, web servers, as

the major channel in cloud computing, need to be redesigned to

meet performance and power constraints. Considerable efforts have

been invested in distributed web servers and web caching with

dijJerent optimizing strategies, but few existing studies have been

directly focused on improving the web server itself, not to mention

complete hardware-favored web services. In this paper, we propose

a novel architecture of web server and implement it on FPGA. After

taking challeng with significant difficulties in design and

implementation, we manage to complete an evaluation system

which confirms that hardware-favored architecture brings higher

throughput, lower power consumption as well as stand-alone web

service functionalities due to direct pipelining execution of web

service protocols in hardware without operating system

Key words: cloud computing; architecture; web server; FPGA

I. INTRODUCTION

Web appeared in 1989 shortly after the invention of Internet in 1984. Web applications have been the motivator of internet applications since then, e.g. Mosaic browser in 1993, e-commerce in 1995, semantic web in 1999, utility computing in 2000, and Wiki encyclopedia (Web 2.0) in 200 1. Even in the era of cloud computing since IBM and Google proposed blue cloud in 2007, Web remains as the major channel of cloud computing. As the number of users explodes with varieties of applications, web servers bear ever tougher workload as well as requirements on delays and bandwidth. This situation becomes worse as users request multimedia data more frequently, which incurs needs of much larger network bandwidth than text data [ 1 ].

With the growth in e-commerce as well as the increasing volume of information available on the Web, the number of users of the WWW is growing rapidly. Typically an online shop employs catalogues and transaction handling databases. Clients perform browsing as well as financial transactions while shopping online. This means that web server has to handle a great deal of both static and dynamic web page requests. Other applications that impose a heavy demand on a web server include movie clips, extremely large audio and

978-1-4244-9825-3/11/$26.00 ©2011 IEEE 124

video files and dynamic pages generated through cgi scripts. Earlier efforts have put more emphasis on improving the web performance by solving the problems caused by network traffic. Several modifications to the Hyper Text Transfer Protocol have been proposed. Equally important is the fact that as the communication bandwidth available to client increases, the size of web documents will tend to increase and each client would generate more and more web page requests to the server, thus pushing the performance bottleneck to the server system. This in tum causes an increase in the client's perceived latency for a web page request [2].

Increasing the network bandwidth involves governmental policy on national infrastructure which requires a long-term investment of resources. The gap between network traffic demand and the network bandwidth capacity is widening [3]. Web servers are anticipated to be the bottleneck in hosting network-based services [4]. With the advent of the cloud computing, this situation could be even worse. Therefore, it is urgent to improve the quality of web service.

There are three ways for a web site to handle high traffic, namely replication (mirroring), distributed caching, and improving server performance [5]. Replication is simply distributing the same web information to multiple machines that are either a cluster [6], or distributed in different locations using various kinds of load balancing strategies [7]. Since any of the machines could serve requests independently, the load of each individual web server is reduced. Distributed caching includes client-side caching [8], proxy caching [9] or dedicated cache servers [ 1]. These methods transparently cache remote web pages on local storages or a cache machine that is close to the clients, therefore reducing the traffic seen by the original server. Finally, improving server performance consists of enhancement of hardware efficiency, adoption of better web server software techniques as well as utilization of highbandwidth network connections.

Considerable efforts have been invested in studying replication and distributed caching. Many interesting and effective approaches have been proposed and implemented. On

the other hand, less attention has been paid to improve the web server performance [4]. The author of [ 10] presents a design of a new web server architecture called the asymmetric multiprocess event-driven architecture. The author of [ 1 1] proposes the use of main memory compression techniques to increase the available memory and mitigate the disk bandwidth problem. The author of [ 12] realizes adaptive web server performance optimization by historical experience and feedback mechanism. The researches mentioned above adopted software-based approaches, the authors of [ 13, 14, 15] provide hardware-based approaches in order to get a smaller and faster embedded web server.

Though these embedded web servers can meet requirements of embedded applications, their performance is much lower than that of generic CPU based web servers for cloud computing applications. Other than these, there are few published studies concerning the web server itself. In this study, we present a novel architecture of a web server system and implement it on FPGA in order to improve the throughput and power efficiency.

A. Motivation

Over the past several years, a number of architectures have been proposed to overcome limitations in the original models, as well as to improve performance and cope with the increasing popularity of Web-based services [ 16].

Nevertheless, these architectures are software-based solutions to web servers, i.e. they rely on CPUs to do everything, namely operating system (OS), network interface driver, TCP/IP protocol stack, operating systems, web server, and web services. These practices imply poor power efficiency which is critical in IT industry.

Some typical software-based web server architectures are as follows with varieties of improvements in software. The single-process event-driven (SPED) architecture uses a single event-driven server process to perform concurrent processing of multiple HTTP requests [ 10]. The Asymmetric MultiProcess Event-Driven (AMPED) architecture combines the event-driven approach of the SPED architecture with multiple helper processes that handle blocking disk I/O operations [ 10]. The Staged Event-Driven Architecture (SEDA) [ 17] works like a pipe lined server that consists of multiple stages, each is associated with a pool of threads. The Symmetric MultiProcess Event-Driven (SYMPED) architecture extends the SPED model by employing multiple processes each of which acts a SPED server to mitigate blocking file I/O operations and to utilize multiple processors [ 18].

Cloud computing is quickly becoming one of the most popular and trendy phrases being tossed around in today's technology world [ 19]. To have a significant overhaul in efficiency under huge workload of cloud computing, we propose to take all layers of protocols of web services out of CPU, and implement them in hardware partition of a FPGA based web server. Due to pipe lined implementation of web service protocols in hardware, hardware based web servers, compared with software based ones, are able to accelerate web processing, shorten web processing time, enhance throughput as well as critical efficiency. Another overhead saved in

125

hardware is the as layer underneath application software on CPU.

B. Contributions

To the best of our knowledge, our work in the paper would be the first contribution to web service domain in the form of a novel architecture of stand-alone hardware based web server, whose performance and efficiency are better than main stream generic CPU based solutions.

To be fair and accurate, we should clarify that there is a soft-core processor in FPGA in our stand-alone hardware web based server. We will present the HW/SW co-design details of the proposed web server architecture to let audience understand how the soft-core processor initializes the system and modules in hardware partition which carries out the actual tough work

We will present evaluation of performance and efficiency of our implementation of web server completely on FPGA. The results indicate that the hardware based approach shortens the time of network computing [20] which is one of three key factors that influence the quality of web service and enhance throughputs.

The rest of the paper is organized as follows. Basic architecture of the web server system is described in section II. Section III presents the HW/SW co-design of the system. The performance of the system is evaluated in section IV. Conclusions and future research directions are given in section V.

II. BASIC ARCHITECTURE

The overall architecture of the web server system is shown in Figure 1. A MicroBlaze soft-core processor running software, and web processing module (WPM), which is the hardware partition implementing a simple web server, are connected by a register bank. Both the software and hardware partitions can access DDR RAM through the multi-port memory controller (MPMC). The architecture of WPM is shown in Figure 2. WPM consists of five sub modules, namely TCP packet decomposer, URL parser, file splitter, TCP/IP processing and Timing service. Since HTTP GET is a typical HTTP request from Web clients, it is selected as the only request type we process in the prototype design.

WPM is a hardware module which implements MAC, TCP/IP, HTTP protocol and session management in a hardware pipeline, instead of MAC, TCP/IP protocol stack and HTTP protocol software being executed by CPU in a means of sharing CPU time slices. WPM dramatically shortens the processing time of TCP/IP protocols. We use zero copy technique for efficient data transfer which reduces data processing time and saves memory bandwidth. Besides, in this design we have no operating system which simplifies the whole process. As a result this design shortens the web processing time and enhances throughput.

The whole system is implemented in FPGA whose power consumption is so low that our hardware based web server consumes low power as well.

[ UO" �

Tep packet decomposer

I:l L::J

File spliller

TCPnP processing

initialize

register

bank

Timing service

web processing

read/write

DRAM

Mulry-Port

Memory

Controller

o FPGA

Figure l. Overview of the architecture of our hardware-favoured web server

Request FIFO for URL parser

II I I I ... Dr. URL parser

1 Request FIFO

� � ... U- GET request processing

� ... U- Connection request processing TCP packet

�I ... -1---+ Disconnection request processing Connection manager decomposer

�I ... �.r--- Disconnection accept processing �

r-----\.! ... .18ill}---+ Data acknowledgement processing

L(), r TCP packet

+ ( sending FIFO

� TCP packet ... File splitter

l encapsulator

TCP packet TCP packet

I filter sender

-{I I I I ... w

I � tt'�'-;)

Memory file encapsulating FIFO

(web pages)

Figure 2. Architecture of web processing module (WPM) in a pipelining way

126

I

r I

I: Timing service

Ii

III. HARDWARE/SOFTWARE CO-DESIGN

In this paper, WPM is the hardware partition we designed to improve the quality of web service, while the other components, such as MicroBlaze Debug Module (MOM), interrupt controller (INTC), timer, lIC, UART and MPMC, are IPs provided by Xilinx Embedded Development Kit.

The platform of our hardware/software co-design is BEE3 prototyping board from BEEcube Incorporation. There are four FPGAs on the board, and each of which has two independent DDR2 channels. In this design we manage to implement everything using FPGA A only. The whole design can be reproduced to FPGA B, C and D. In other words, 4 separate Web servers can be implemented on the BEE3 prototyping board. The layout of the prototyping board is shown in Figure 3.The design tool we used is Xilinx ISE Design Suite 12.2.

Figure 3. BEE3 prototyping board

A. Interfaces between HW/SW partitions

We design a register bank as the interface between the hardware and the software partitions. The hardware partition can access DRAM via Native Port Interface (NPI) through MPMC and the software partition can access DRAM via Processor Local Bus (PLB) interface. MOM, interrupt controller, timer, IIC, UART and the register bank are all connected by MicroBlaze via PLB. Since the performance register bank is sufficient for hardware and software partitions, the bottleneck of this design is actually memory accessing because all five sub modules in WPM tend to read/write DRAM. To mitigate the impacts of the bottleneck, we propose to use two independent DDR2 channels per FPGA on the board. To evaluate the performance improvement, we start with single memory channel system.

I) Single memory channel system The single memory channel system uses one of the two

independent DDR2 channels to access DRAM. Sub modules in WPM are connected by NPI via an arbiter. Although the control system is easy to design and implement, there exists fierce competition. Figure 4 shows the diagram of the single memory channel system.

127

MicroBlaze

Multi-port memory

conlrollt.:r

PU!

Figure 4. Diagram of single memory channel system

2) Dual memory channel system

DORl

I>RA.\t

Each FPGA on BEE3 board is allowed to enable two independent DDR2 channels, so we can use both of them to access DRAM simultaneously. We modify WPM and used an exclusive NPI to read web page file data from DRAM in order to reduce competition. The diagram of dual memory channel system is shown in figure 5.

MicroBlaze

k';====::j F===�I� Multi-port

memory

conlrolll.T b

K;::======�I� -

Figure 5. Diagram of dual memory channel system

B. Register bank

D[)R2 DIV\:\[

We designed two types of register in the register bank. One type is configuration (config) registers which can be written by the hardware but is read-only in the software domain; the other type is statistic registers which can be merely read by hardware and can be read or written by software. The first type of register is designed to collect statistics from the hardware while the second type is supposed to initialize the hardware by the software. The usage of the register bank is shown in table 1. Configuration registers for system initialization are used to initialize the system resource. Major registers in the register bank are listed in table II.

TABLE I. USAGE OF THE REGISTER BANK

config registers statistic registers allocated used allocated used

TCP packet 16 3 16 14 decomposer URL parser 16 11 16 16 file splitter 16 2 16 3

TCP/IP 16 9 16 3 processing

timing 16 3 16 3 service system 16 12 none

initialization

TABLE II. MAJOR REGISTERS IN THE REGISTER BANK

config registers statistic registers

TCP packet handshake overtime link num decomposer eot overtime arrival tcp packet num

idle overtime arrival get yackel num URL parser sys_time url_msg_rx_num

file splitter ackpacket_ overtime retrans _ tcp yacket_ num

TCP/IP server mac addr rcv mac num processing server ip rcv .ip_ num

server port rcv seg num server ip mask tcp num

router ip ping respon lost num router mac addr arp respon lost num

timing ackpacket_ overtime time out service system tcp rcv stack none

initialization free uri msg stack send file msg stack

http head stack free tcp head stack

uri file index

C. Hardware partition

The components in the hardware partition of this system are as follows: IIC to collect local time information from EEPROM, MOM to provide a debug interface for software, Interrupt controller to receive interrupt signals from timer and UART and send a interrupt request to MicroBlaze, Timer to provide an interrupt signal every one second, UART to communicate between FPGA and the host computer, WPM to improve the quality of web service, MPMC to provide interfaces for accessing DRAM. WPM is the kernel component in the partition. It is written in Verilog and implemented to get higher throughput for the web server. The architecture of WPM is illustrated in Figure 2.

D. Software partition

In the system, the software partition is designed to initialize the hardware and application specific data. The whole system starts to work as soon as the initialization is done. Besides, it reads statistics from the register bank and sends them to the host computer via UART every one second. Local time is also read by the software from EEPROM and is transferred to the hardware which is labeled in every TCP package. Cooperations of hardware and software are shown in figure 6.

128

Start signal Interrupt signal

Figure 6. Co-operations of hardware and software

IV. PERFORMANCE EVALUATION

The system implemented in FPGA A runs at a rate of 125MHz and is evaluated with Web test equipments, i.e.

Avalanche 2900 and Spirent TestCenter 3.5 1. Results are compared with apache2.2 and nginxO.7.6 1 which are run on a main stream quad-core processor: Intel Xeon 5520. The speed of the physical Ethernet port on the FPGA board is I Gbps. All the web pages under tests are present in the DDR memory of the testing system as well as the main memory of the reference Xeon 5520 platform. Throughputs of different systems are shown in figure 7.

'VC Q.

-0 :2 i::' " Q.

� " e

-S

1000 .-- ..---

800

600 -

400

200

0 4K 10K lOOK 1M

web page size(bytes)

Figure 7. Throughputs of different systems

o single channel

.dual channel

o apache

Onginx

The throughput of dual memory channel system is higher than that of single memory channel system because web page process time in dual channel system is less than that in single channel system. For larger web page sizes, throughputs are constrained by the physical Ethernet port. Therefore,

throughputs are about the same for the single channel and dual channel system when web page size is equal to or greater than lOOKS. If we had a faster Ethernet port on the board, we could have better evaluation results for our systems.

The power consumption of single memory channel FPGA system, dual memory channel FPGA system, apache on CPU system and nginx on CPU system is nw, nw, 280W and 255W respectively. The power efficiency of each system is shown in figure 8.

� .D 0. ::E '--'

>, u

§ ·03 E 0) ....

0) � 0 0.

15

10

5

0

� /' *

4K

�E �E �(

10K lOOK 1M

web page size(bytes)

-+-single

channel

---dual

channel

apache

�nginx

Figure 8. Power efficiency of different systems

V. CONCLUSIONS AND FUTURE RESEARCH DIRECTIONS

In the HW/SW co-design of the paper, we show that the hardware domain carries out the major task to achieve better performance and power efficiency, although software domain is still required to manage the initialization of hardware and application specific data. Due to the hardware pipeline implementation of web service protocols and direct execution without OS in hardware, our hardware-favored architecture brings higher throughput, lower power consumption as well as stand-alone web service functionalities. The power efficiency of our systems is about 4 times that of Web service software, i.e. apache and nginx over Linux on CPU. Careful calibration of memory management can further improve the overall performance and power efficiency. The significant improvement in power efficiency indicates that reconfigurable hardware approach to cloud computing is promising. With our experimental results, we believe that more researchers and developers will be convinced to convert more mature components of cloud computing into hardware-favored implementations to save precious energy.

ACKNOWLEDGMENT

This paper is partially sponsored by the National HighTechnology Research and Development Program of China (863 Program) (No. 2009AA0 1220 1) and the Shanghai International Science and Technology Collaboration Program (0954070 1900) as well as NSFC 6 107 106 1 and the University of Kentucky Start Up Fund.

REFERENCES

129

[I] D. Lee, K. 1. Kim. A Study on Improving Web Cache Server Performance Using Delayed Caching.Information Science and Application (ICISA),2010 International Conference: 1-5,2010.

[2] S. Nadimpally and S. Majumdar. Techniques for Achieving High Performance Web Servers. Parallel Processing: 233-241, 2000.

[3] Jeffrey K, MacKie-Masnion and Hal R. Varian, "Some Economics of the Internet" in 10th Michigan Public Utility Conference at Western Michigan University, November 1992.

[4] V. Cardellini, E. Casalicchio, M. Colajanni, and P. S. Yu. The State of the Art in Locally Distributed Web-Server Systems. ACM Computing Surveys (CSUR), 34(2):263-311,2002.

[5] Y. Hu, A Nanda, and Q. Yang. Measurement, Analysis and Performance Improvement of the Apache Web Server. Performance, Computing and Communication Conference:261-267,1999 .

[6] T. Schroeder, S. Goddard, and B. Ramamurthy. Scalable Web Server Clustering Technologies. Network, IEEE,14(3):38-45,2000.

[7] M. Swain, Dr. Y. Kim. A study of Data Source selection in Distributed Web Server System. SOUTHEASTCON'09, IEEE:311-316,2009.

[8] A Bestavros, R. L. Carter, M. E. Crovella, C. R. Cunha, AHeddaya, and S.AMirdad, "Application-level document caching in the internet," in Proceedings of the Second IntI. Workshop on Services in Distributed and Networked Environments (SDNE'95),1995.

[9] P. Cao and S. Irani, "Cost-aware WWW proxy caching algorithms," in USENIX Symposium on Internet Technologies and Systems (US ITS), Dec, 1997

[10] V. S. Pai, P. Druschel and W. Zwaenepoel. Flash: An efficient and portable Web server. A TEC'99 Proceedings of the annual conference on USENIX Annual Technical Conference.

[II] V. Beltran, 1. Torres and E. Ayguade, Improving Web Server Performance Through Main Memory Compression, Proc. of 14th IEEE International Conference on Parallel and Distributed Systems (ICPADS),:303-310,2008.

[12] Z. Qu, W. Wang and Z. Li. Web Server Optimization Model Based on Performance Analysis, Proc. of 6th International Conference on Wireless Communications Networking and Mobile Computing (WiCOM), 201O:1-4,201O.

[13] J. Riihijarvi, P. Mahonen, M. J. Saaranen, J. Roivainen and J. Soininen. Providing Network Connectivity for Small Appliances: A Functionally Minimized Embedded Web Server, IEEE Communications Magazine,39(1 0): 74-79,200 I.

[14] N. N. Joshi, P. K. Dakhole, P. P. Zode. Embedded Web Server on Nios II Embedded FPGA Platform, Proc. of 2nd International Conference on Emerging Trends in Engineering and Technology (ICETET) :372-377,2009.

[15] M. Choi, H. Ju, H. Cha, S. Kim and 1. W. Hong. An Efficient Embedded Web Server for Web-based Network Element Management, Proc. of IEEE/IFIP Network Operations and Management Symposium (NOMS) 187-200,2000.

[16] F. Azzedin and K. Al-Issa. A Self-Adapting Web Server Architecture: Towards Higher Performance and Better Utilization. High Performance Computing & Simulation:96-105,2009.

[17] M. Welsh and D. Culler and E. Brewer, "SEDA: An Architecture for Well-Conditioned, Scalable Internet Services," Proceedings of the 18th Symposium on Operating Systems Principles (SOSP 2001), Oct. 2001.

[18] D. Pariag and T. Brecht and A Harji and P. Buhr and A Shukla, "Comparing the Performance of Web Server Architectures," the 2007 EuroSys Conference, Mar. 2007.

[19] F. Hu, M. Qiu, 1. Li, T. Grant, D. Tyloy, S. McCaleb, L. Butler, and R. Hamner, "A Review on Cloud Computing: Design Challenges in Architecture and Security", Journal of Computing and Information Technology (CIT), Vol. 19, No. I, Page 25-55, Mar. 2011.

[20] M. Wang and Z. Qi. Research and practice of Web server Optimization. Second International Symposium on Electronic Commerce and Security:432-436,2009.

grounding high efficiency cloud computing architecture: hw-sw

Documents