Multicores in Cloud Computing: Research Challenges for ... ?· Multicores in Cloud Computing: Research…
Post on 25-Jun-2018
Embed Size (px)
Multicores in Cloud Computing: ResearchChallenges for Applications
Lizhe Wang, Jie Tao, Gregor von Laszewski, Holger Marten
Institute of Pervasive Technology, Indiana University at Bloomington, USA Steinbuch Centre for Computing, Karlsruhe Institute of Technology, Germany
Email: Lizhe.Wang@gmail.com, Jie.Tao@iwr.fzk.de, Laszewski@gmail.com
Abstract With modern techniques that allow billions oftransistors on a chip, microprocessor design is going to aworld of multicore. A cluster of multicores will be commonlyused as an efficient computational platform for high perfor-mance computing in the near future. Correspondingly, theresource providers, who share their computing elements andstorage with customers, will also upgrade their machines tothe multicore architecture.
At the same time, the dominating resource sharing plat-forms, in the form of compute Grids, are facing a grandchallenge from Cloud computing. With an easy interface,high-level services, and on-demand provision of computeresources, compute Clouds are attracting more users to jointheir realm.
Upcoming Cloud environments will be primarily builton top of multicore technologies. However, to fully exploitthe computational capacity and the other advantages ofmulticores, a lot of novel techniques must be proposed andinvestigated.
This paper serves as an overview for multicore technolo-gies in the context of Cloud computing. This paper firstlyintroduces multicore technologies and the Cloud computingconcept. Then it investigates various open research issues inthe context of multicore Cloud computing, for example, howto efficiently employ multicore techniques for Cloud com-puting, how to achieve the high performance of multicore forhigh performance applications in the Cloud environments.Several examples and discussion are presented.
Keywords: Multicore architecture, high performancecomputing, Cloud computing
The rapid development of multicore processors is amilestone for computing systems. It affects compilerconstruction and programming language & applicationmodels. Cloud computing recently emerges as a newcomputing paradigm and could render desired computinginfrastructure for users on demand. This paper overviewsthe research fields of multicore Cloud computing. It in-troduces the background of multicore systems and Cloudcomputing, identifies research issues and proposes pos-sible solutions. The rest of this paper is organized asfollows. Section II discusses the multicore architecture
and system. Section III studies the concept of Cloud com-puting infrastructure. Multicore Cloud computing context,such as research issues, implementation and solutions areinvestigated in Section IV. Finally the paper concludes inSection V.
II. THE MULTICORE ARCHITECTURE
According to Moores law, billions of transistors can beintegrated on a single chip in the near future. However,due to the physical limitations of semiconductor-basedelectronics, the clock frequency of microprocessors willnot increase exponentially like it did over the last thirtyyears. Additionally, heat dissipation is a fatal problemwith modern micro architectures. Power consumption alsobecomes another limitation in regards to enhancing theCPU frequency.
In this case, multicore arises as an adequate solution.Different from the single-processor architecture, whichdeploys all transistors on the chip to form a single core,this novel design partitions the transistors into groups inwhich each group builds a separate processor core. As thecommunication of transistors is within the group, ratherthan globally across the complete chip, shorter latencyand switching time can be maintained. In this case, eachcore can be granted a certain frequency and the multicoreas a whole provides a high performance.
Actually, multicore processor is not only a solutionfor CPU speed but also reduces the power consumptionbecause several cores in a lower frequency together gen-erate less heat dissipation than one core with their totalfrequency. Therefore, multicore processor is regarded asthe future generation of microprocessor design.
A. Multicore Products
Over the last few years almost every processor manu-facturer has put multicores in the market. Active vendorsinclude IBM, Intel, SUN, and AMD.
958 JOURNAL OF COMPUTERS, VOL. 5, NO. 6, JUNE 2010
2010 ACADEMY PUBLISHERdoi:10.4304/jcp.5.6.958-964
core core core core
Main memory Main memory
core core core core
L1 L1 L1 L1
Figure 1. Multicore architecture with private (left) and shared L2 cache (right)
Together with Sony and Toshiba, IBM developed theCell multiprocessor ,  for the Play Station 3console. It applied a heterogeneous architecture with aPowerPC as the main processor and additional eightsynergistic processing elements for computation. The Cellprocessor can operate at both a low voltage and low powerwhile maintaining high frequency and high performance.Besides Cell, IBM also developed other multicore ar-chitectures like the BladeCenter JS22 with two cores ofPower6.
Intel  dual-core processors have been success-fully integrated in desktop, notebooks, workstations, andservers. An example is Woodcrest with two Xeon pro-cessors. Recently, its quad-core technology is deliveringoutstanding performance and energy efficiency. In addi-tion, Intel has announced a 64-core architecture for thefuture.
AMD is another active processor developer and most ofthe HPC platforms are built on top of the Intel and AMDproducts. For multicore technology, AMD  deliveredthe dual-core Opteron designs and currently the new quad-core Phenom II processor. The well-known AMD Hy-perTransport Technology provides a high communicationbandwidth.
Following UltraSPARC T1, SUN  developed Ultra-SPARC T2 Plus Niagara with 8 cores. In addition to thelarge number of cores on the chip, this microprocessorhas hardware support for 8 threads per core forming amaximal number of 64 logical CPUs.
Overall, multicore has become the trend of micropro-cessor design and therefore the basic hardware of an HPCplatform. According to the statistics published in 2007,more than 60% of the Top500 supercomputers rely onmulicore technology. Hence, multicore technology will bethe main components of emerging cloud infrastructures.
B. Multicore Architecture
A multicore processor is typically a single processorwhich contains several cores on a chip. The cores are fully
functional with computation units and caches and hencesupport multithreading. This is different from the earlierhyper-threading techniques which provide additional setsof registers to allow multiple threads that usually usedto prefetch instructions or data , . It is true thatresearchers use idle cores for the same purpose , how-ever, multicore is more adequate for parallel processing.
The dominant design of multicores uses a homogeneousconfiguration with IBM Cell as an exception. In thiscase, multicores can be observed as SMPs (symmetricmultiprocessor) moving their processors from differentboards to the same chip on a single main board. Thismeans a shorter inter-processor communication time anda more efficient utilization of the on-chip memory, whichare advantages for running multiple threads with datadependency.
Each microprocessor manufacture has its specific im-plementation of multicore technology. Nevertheless, prod-ucts on the market mainly distinguish in the organizationof the second level cache: one is private and the otheris shared. Figure 1 shows both architectures with a fourcore processor as example.
As shown in the figure, a multicore architecture con-tains several CPU cores each equipped with a private firstlevel cache. This is common with all multicore designs.Together with the core and the L1 cache, the chip alsocontains a second level (L2) cache or a few L2 caches,depending on the concrete implementation. The formercan be typically found in the Intel multicore processorsand the latter is adopted by AMD. Through a memorycontroller, the second level cache is combined with mainmemory. Additionally, some products have an off-chip L3in front of the main memory.
Theoretically, a shared L2 cache performs better thanprivate L2 caches because with a shared design, theworking cores can use the caches of idle cores. Inaddition, cache coherence issues are not problematic withthe shared cache.
Actually, we have proven this theory using our cache
JOURNAL OF COMPUTERS, VOL. 5, NO. 6, JUNE 2010 959
2010 ACADEMY PUBLISHER
Figure 2. L2 misses of the FT (left) and CG (right) applications
simulator for multicore . Figure 2 depicts sampleresults we acquired with the simulator which models thecache hierarchy of a multicore architecture.
The figure demonstrates the number of the second levelcache with both private and shared organization, with theleft side for FT and the right side for CG, two applicationsfrom the NAS ,  OpenMP benchmark suite.
For a better understanding of the cache behavior, thecache miss is shown in a break-down of four misscategories: compulsory miss, capacity miss, conflict miss,and invalidation miss. As can be seen with both applica-tions, the shared cache performs generally better than theprivate caches. The performance gain with FT is mainlycontributed by the reduction of compulsory misses whileCG shows a significant decrease in the number of capacitymisses with a shared L2 cache.
Different applications have their own reason for thebetter performance, however, several common factorsexist. For example, parallel threads executing a programpartially share the working set. This shared data can bereused by other processors after being loaded to the cache,thus reducing the compulsory caused by the first referenceof a data item. Shared data also saves space becausethe memory block is only cached once. This helps toreduce the capacity miss. In addition, shared cache has noconsistency problem and hence there are no invalidationmisses with a shared cache.
C. Programming Models
Intel is planning to integrate 64 cores on a chip.Actually, this is a trend with multicore technology: thenumber of CPU cores on a chip continues to increase. Aquestion arises: how to parallelize programs in order tofully utilize the computational capacity?
MPI and OpenMP are dominating programming modelsfor parallelizing sequential programs. The former is acommunication-based paradigm that requires program-mers to use explicit send/receive primitives to transfer
data across the processes. This model is typically appliedon loosely coupled architectures like clusters of commod-ity PCs.
Alternatively, OpenMP is a programming model forparallelization on multiprocessor systems with a sharedmemory. It is gaining broader application in both researchand industry. From high performance computing to con-sumer electronics, OpenMP has been established as anappropriate thread-level programming paradigm. OpenMPprovides implicit thread communication over the sharedmemory. Another feature of OpenMP is that it maintainsthe semantic similarity between the sequential and parallelversions of a program. This enables an easy developmentof parallel applications without the requirement of spe-cific knowledge in parallel programming. Due to thesesfeatures, OpenMP is regarded as a suitable model formulticores.
However, OpenMP was proposed at the time whenSMP was a dominating shared memory architecture.It uses the traditional data and function parallelism topresent the concurrency in a program. Hence, it is ade-quate for coarse-grained parallelization. Even though taskparallelism has been introduced in the OpenMP standard,it is still hard to generate sufficient parallel tasks forkeeping all cores busy on the future multicores.
A novel programming model has to be designed formulticore, however this research work is in the beginningstages. For example, Ostadzadeh et al.  providesan approach that observes parallelization in a globalway rather than just parallelizing a code area like thetraditional model does. Hence, this approach is capableof detecting more concurrency in a program.
D. Multicore Clusters
Clusters of commodity PCs/workstations are generallyadopted to perform scientific computing. With the trendsof microprocessor architecture towards multicore, futureclusters will be composed of multiple nodes of multipro-
960 JOURNAL OF COMPUTERS, VOL. 5, NO. 6, JUNE 2010
2010 ACADEMY PUBLISHER
cessors with shared memory. This kind of cluster is simi-lar to the current SMP cluster, however, the system scalewill be much larger, with potentially several thousandsand even ten thousands cores in a single cluster.
One question is how to program such clusters with ashared memory on the node and distributed memories onthe system. For SMP clusters, most deployed applicationsuse the MPI model, i.e. message passing across all pro-cessors including those within one node. The advantageof this approach is that applications need not be rewritten,however, the explicit communication between processorson the same node introduces overhead. This overheadcould be small when the node only contains two orfour processors, like it is the case with an SMP node,nevertheless, for a multicore node it can significantlyinfluence the overall performance.
Another approach  is to program an SMP cluster isusing the OpenMP model with a shared virtual memoryfor the cluster. A well-known commercial product is theIntel compiler  which runs an OpenMP program oncluster architectures by establishing a global memoryspace for shared variables. The advantage of this approachis that OpenMP applications can be executed on a dis-tributed system without the need of any adaptation and theparallel threads running on the same node communicatevia the local memory. However, the management of thevirtual shared space again introduces overhead which canbe large on a cluster with thousands of processor nodes.
A hybrid approach seems to be the best solutionbecause the application is parallelized with OpenMP onthe node and with MPI on the system, causing a hybridcommunication form of intra-node with shared memoryand inter-node with message passing. However, the ex-isting implementations require that the users apply bothMPI primitives and OpenMP directives in the programto give the compiler and the runtime operation guides.This is clearly a burden to the programmers. Hence, abetter way is to only apply the OpenMP model, which iseasier to learn than MPI, and to build a compiler capableof transforming a shared memory communication to MPIsend/receive in case that the communication is beyond aprocessor node.
Another problem with multicore clusters is scalability.Most applications...