Multicores in Cloud Computing: Research Challenges for ... ?· Multicores in Cloud Computing: Research…

Download Multicores in Cloud Computing: Research Challenges for ... ?· Multicores in Cloud Computing: Research…

Post on 25-Jun-2018




0 download


Multicores in Cloud Computing: ResearchChallenges for ApplicationsLizhe Wang, Jie Tao, Gregor von Laszewski, Holger Marten Institute of Pervasive Technology, Indiana University at Bloomington, USA Steinbuch Centre for Computing, Karlsruhe Institute of Technology, GermanyEmail:,, Laszewski@gmail.comAbstract With modern techniques that allow billions oftransistors on a chip, microprocessor design is going to aworld of multicore. A cluster of multicores will be commonlyused as an efficient computational platform for high perfor-mance computing in the near future. Correspondingly, theresource providers, who share their computing elements andstorage with customers, will also upgrade their machines tothe multicore architecture.At the same time, the dominating resource sharing plat-forms, in the form of compute Grids, are facing a grandchallenge from Cloud computing. With an easy interface,high-level services, and on-demand provision of computeresources, compute Clouds are attracting more users to jointheir realm.Upcoming Cloud environments will be primarily builton top of multicore technologies. However, to fully exploitthe computational capacity and the other advantages ofmulticores, a lot of novel techniques must be proposed andinvestigated.This paper serves as an overview for multicore technolo-gies in the context of Cloud computing. This paper firstlyintroduces multicore technologies and the Cloud computingconcept. Then it investigates various open research issues inthe context of multicore Cloud computing, for example, howto efficiently employ multicore techniques for Cloud com-puting, how to achieve the high performance of multicore forhigh performance applications in the Cloud environments.Several examples and discussion are presented.Keywords: Multicore architecture, high performancecomputing, Cloud computingI. INTRODUCTIONThe rapid development of multicore processors is amilestone for computing systems. It affects compilerconstruction and programming language & applicationmodels. Cloud computing recently emerges as a newcomputing paradigm and could render desired computinginfrastructure for users on demand. This paper overviewsthe research fields of multicore Cloud computing. It in-troduces the background of multicore systems and Cloudcomputing, identifies research issues and proposes pos-sible solutions. The rest of this paper is organized asfollows. Section II discusses the multicore architectureand system. Section III studies the concept of Cloud com-puting infrastructure. Multicore Cloud computing context,such as research issues, implementation and solutions areinvestigated in Section IV. Finally the paper concludes inSection V.II. THE MULTICORE ARCHITECTUREAccording to Moores law, billions of transistors can beintegrated on a single chip in the near future. However,due to the physical limitations of semiconductor-basedelectronics, the clock frequency of microprocessors willnot increase exponentially like it did over the last thirtyyears. Additionally, heat dissipation is a fatal problemwith modern micro architectures. Power consumption alsobecomes another limitation in regards to enhancing theCPU frequency.In this case, multicore arises as an adequate solution.Different from the single-processor architecture, whichdeploys all transistors on the chip to form a single core,this novel design partitions the transistors into groups inwhich each group builds a separate processor core. As thecommunication of transistors is within the group, ratherthan globally across the complete chip, shorter latencyand switching time can be maintained. In this case, eachcore can be granted a certain frequency and the multicoreas a whole provides a high performance.Actually, multicore processor is not only a solutionfor CPU speed but also reduces the power consumptionbecause several cores in a lower frequency together gen-erate less heat dissipation than one core with their totalfrequency. Therefore, multicore processor is regarded asthe future generation of microprocessor design.A. Multicore ProductsOver the last few years almost every processor manu-facturer has put multicores in the market. Active vendorsinclude IBM, Intel, SUN, and AMD.958 JOURNAL OF COMPUTERS, VOL. 5, NO. 6, JUNE 2010 2010 ACADEMY PUBLISHERdoi:10.4304/jcp.5.6.958-964core core core coreL1Main memory Main memorycore core core coreL2 cacheL2L1L2L1L2L1L2L1 L1 L1 L1Figure 1. Multicore architecture with private (left) and shared L2 cache (right)Together with Sony and Toshiba, IBM developed theCell multiprocessor [15], [23] for the Play Station 3console. It applied a heterogeneous architecture with aPowerPC as the main processor and additional eightsynergistic processing elements for computation. The Cellprocessor can operate at both a low voltage and low powerwhile maintaining high frequency and high performance.Besides Cell, IBM also developed other multicore ar-chitectures like the BladeCenter JS22 with two cores ofPower6.Intel [13] dual-core processors have been success-fully integrated in desktop, notebooks, workstations, andservers. An example is Woodcrest with two Xeon pro-cessors. Recently, its quad-core technology is deliveringoutstanding performance and energy efficiency. In addi-tion, Intel has announced a 64-core architecture for thefuture.AMD is another active processor developer and most ofthe HPC platforms are built on top of the Intel and AMDproducts. For multicore technology, AMD [6] deliveredthe dual-core Opteron designs and currently the new quad-core Phenom II processor. The well-known AMD Hy-perTransport Technology provides a high communicationbandwidth.Following UltraSPARC T1, SUN [22] developed Ultra-SPARC T2 Plus Niagara with 8 cores. In addition to thelarge number of cores on the chip, this microprocessorhas hardware support for 8 threads per core forming amaximal number of 64 logical CPUs.Overall, multicore has become the trend of micropro-cessor design and therefore the basic hardware of an HPCplatform. According to the statistics published in 2007,more than 60% of the Top500 supercomputers rely onmulicore technology. Hence, multicore technology will bethe main components of emerging cloud infrastructures.B. Multicore ArchitectureA multicore processor is typically a single processorwhich contains several cores on a chip. The cores are fullyfunctional with computation units and caches and hencesupport multithreading. This is different from the earlierhyper-threading techniques which provide additional setsof registers to allow multiple threads that usually usedto prefetch instructions or data [5], [17]. It is true thatresearchers use idle cores for the same purpose [19], how-ever, multicore is more adequate for parallel processing.The dominant design of multicores uses a homogeneousconfiguration with IBM Cell as an exception. In thiscase, multicores can be observed as SMPs (symmetricmultiprocessor) moving their processors from differentboards to the same chip on a single main board. Thismeans a shorter inter-processor communication time anda more efficient utilization of the on-chip memory, whichare advantages for running multiple threads with datadependency.Each microprocessor manufacture has its specific im-plementation of multicore technology. Nevertheless, prod-ucts on the market mainly distinguish in the organizationof the second level cache: one is private and the otheris shared. Figure 1 shows both architectures with a fourcore processor as example.As shown in the figure, a multicore architecture con-tains several CPU cores each equipped with a private firstlevel cache. This is common with all multicore designs.Together with the core and the L1 cache, the chip alsocontains a second level (L2) cache or a few L2 caches,depending on the concrete implementation. The formercan be typically found in the Intel multicore processorsand the latter is adopted by AMD. Through a memorycontroller, the second level cache is combined with mainmemory. Additionally, some products have an off-chip L3in front of the main memory.Theoretically, a shared L2 cache performs better thanprivate L2 caches because with a shared design, theworking cores can use the caches of idle cores. Inaddition, cache coherence issues are not problematic withthe shared cache.Actually, we have proven this theory using our cacheJOURNAL OF COMPUTERS, VOL. 5, NO. 6, JUNE 2010 959 2010 ACADEMY PUBLISHER00,511,522,533,544,52CPU-Private2CPU-Shared4CPU-Private4CPU-Shared8CPU-Private8CPU-SharedNumber of missesInvalidationCapacityConflictCompulsory0246810122CPU-Private2CPU-Shared4CPU-Private4CPU-Shared8CPU-Private8CPU-SharedNumber of missesInvalidationCapacityConflictCompulsoryFigure 2. L2 misses of the FT (left) and CG (right) applicationssimulator for multicore [10]. Figure 2 depicts sampleresults we acquired with the simulator which models thecache hierarchy of a multicore architecture.The figure demonstrates the number of the second levelcache with both private and shared organization, with theleft side for FT and the right side for CG, two applicationsfrom the NAS [9], [14] OpenMP benchmark suite.For a better understanding of the cache behavior, thecache miss is shown in a break-down of four misscategories: compulsory miss, capacity miss, conflict miss,and invalidation miss. As can be seen with both applica-tions, the shared cache performs generally better than theprivate caches. The performance gain with FT is mainlycontributed by the reduction of compulsory misses whileCG shows a significant decrease in the number of capacitymisses with a shared L2 cache.Different applications have their own reason for thebetter performance, however, several common factorsexist. For example, parallel threads executing a programpartially share the working set. This shared data can bereused by other processors after being loaded to the cache,thus reducing the compulsory caused by the first referenceof a data item. Shared data also saves space becausethe memory block is only cached once. This helps toreduce the capacity miss. In addition, shared cache has noconsistency problem and hence there are no invalidationmisses with a shared cache.C. Programming ModelsIntel is planning to integrate 64 cores on a chip.Actually, this is a trend with multicore technology: thenumber of CPU cores on a chip continues to increase. Aquestion arises: how to parallelize programs in order tofully utilize the computational capacity?MPI and OpenMP are dominating programming modelsfor parallelizing sequential programs. The former is acommunication-based paradigm that requires program-mers to use explicit send/receive primitives to transferdata across the processes. This model is typically appliedon loosely coupled architectures like clusters of commod-ity PCs.Alternatively, OpenMP is a programming model forparallelization on multiprocessor systems with a sharedmemory. It is gaining broader application in both researchand industry. From high performance computing to con-sumer electronics, OpenMP has been established as anappropriate thread-level programming paradigm. OpenMPprovides implicit thread communication over the sharedmemory. Another feature of OpenMP is that it maintainsthe semantic similarity between the sequential and parallelversions of a program. This enables an easy developmentof parallel applications without the requirement of spe-cific knowledge in parallel programming. Due to thesesfeatures, OpenMP is regarded as a suitable model formulticores.However, OpenMP was proposed at the time whenSMP was a dominating shared memory architecture.It uses the traditional data and function parallelism topresent the concurrency in a program. Hence, it is ade-quate for coarse-grained parallelization. Even though taskparallelism has been introduced in the OpenMP standard,it is still hard to generate sufficient parallel tasks forkeeping all cores busy on the future multicores.A novel programming model has to be designed formulticore, however this research work is in the beginningstages. For example, Ostadzadeh et al. [20] providesan approach that observes parallelization in a globalway rather than just parallelizing a code area like thetraditional model does. Hence, this approach is capableof detecting more concurrency in a program.D. Multicore ClustersClusters of commodity PCs/workstations are generallyadopted to perform scientific computing. With the trendsof microprocessor architecture towards multicore, futureclusters will be composed of multiple nodes of multipro-960 JOURNAL OF COMPUTERS, VOL. 5, NO. 6, JUNE 2010 2010 ACADEMY PUBLISHERcessors with shared memory. This kind of cluster is simi-lar to the current SMP cluster, however, the system scalewill be much larger, with potentially several thousandsand even ten thousands cores in a single cluster.One question is how to program such clusters with ashared memory on the node and distributed memories onthe system. For SMP clusters, most deployed applicationsuse the MPI model, i.e. message passing across all pro-cessors including those within one node. The advantageof this approach is that applications need not be rewritten,however, the explicit communication between processorson the same node introduces overhead. This overheadcould be small when the node only contains two orfour processors, like it is the case with an SMP node,nevertheless, for a multicore node it can significantlyinfluence the overall performance.Another approach [16] is to program an SMP cluster isusing the OpenMP model with a shared virtual memoryfor the cluster. A well-known commercial product is theIntel compiler [12] which runs an OpenMP program oncluster architectures by establishing a global memoryspace for shared variables. The advantage of this approachis that OpenMP applications can be executed on a dis-tributed system without the need of any adaptation and theparallel threads running on the same node communicatevia the local memory. However, the management of thevirtual shared space again introduces overhead which canbe large on a cluster with thousands of processor nodes.A hybrid approach seems to be the best solutionbecause the application is parallelized with OpenMP onthe node and with MPI on the system, causing a hybridcommunication form of intra-node with shared memoryand inter-node with message passing. However, the ex-isting implementations require that the users apply bothMPI primitives and OpenMP directives in the programto give the compiler and the runtime operation guides.This is clearly a burden to the programmers. Hence, abetter way is to only apply the OpenMP model, which iseasier to learn than MPI, and to build a compiler capableof transforming a shared memory communication to MPIsend/receive in case that the communication is beyond aprocessor node.Another problem with multicore clusters is scalability.Most applications do not initially run efficiently on largeclusters and hence need performance tuning. The existingtools for performance analysis, visualization, and steeringhave helped programmers to improve their codes and canalso be further applied on multicore clusters. However,most of the tools must be extended to grant a goodbehavior with larger systems.III. STUDY ON COMPUTING CLOUDINFRASTRUCTURECloud computing is becoming one of the next ITindustry buzz words: users move data and applications outto the remote Clouds and then access them in a simpleand pervasive way. This is again a central processinguse case. Similar scenario occurred around 50 years ago:a time-sharing computing server served multiple users.Until 20 years ago, when personal computers came to us,data and programs were mostly located in local resources.Certainly the Cloud computing paradigm is not a recur-rence of the history. 50 years ago we had to adopt thetime-sharing servers due to limited computing resources.Nowadays, Cloud computing comes into fashion due tothe need to build complex IT infrastructures. Users have tomanage various software installations, configurations, andupdates. Computing resources, and other hardware, areprone to be outdated very soon. Therefore, outsourcingcomputing platforms is a smart solution for users tohandle complex IT infrastructures.At the current stage, Cloud computing is still evolvingand there exists no widely accepted definition. Based onour experience, we propose an early definition of Cloudcomputing as follows [25]:A computing Cloud is a set of network enabledservices, providing scalable, QoS guaranteed,normally personalized, inexpensive computinginfrastructures on demand, which could be ac-cessed in a simple and pervasive way.In the initial step of the study, we focus on how tobuild infrastructures for Scientific Clouds. The Cumulusproject [24] is an on-going Cloud computing project,which intends to provide virtualized computing infras-tructures for scientific and engineering applications. TheCumulus project currently is running on distributed IBMblade servers with Oracle Linux and the Xen hypervisors.We design the Cumulus system in a layered architecture(see also Figure 3): The Cumulus frontend service resides on the accesspoint of a computing center and accepts usersrequirements of virtual machine operations. TheCumulus frontend service is implemented by re-engineering of Globus virtual workspace service andNimbus frontend [2]. The OpenNEbula [1] works as Local Virtualiza-tion Management System (LVMS) on the Cumulusbackends. The OpenNEbula communicates with theCumulus frontend via SSH or XML-RPC. The OpenNEbula contains one frontend and multiplebackends. The OpenNEbula frontend communicatesto its backends and the Xen hypervisors on thebackends via SSH for virtual machine manipulation.JOURNAL OF COMPUTERS, VOL. 5, NO. 6, JUNE 2010 961 2010 ACADEMY PUBLISHERXenHypervisorXenHypervisorXenHypervisorCumulus frontendSSH/XMLRPCHTTP/HTTPSphysical network domainvirtual network domainVMHostVMHostVM VMHostVMVMSSHOS Farm serverOracle File SystemOpenNEbulaFigure 3. Overview of Cumulus project OS FarmThe OS Farm [3] is a Web server for generatingand storing Xen virtual machine images and VirtualAppliances. The OS Farm is used in the Cumulussystem as a tool for virtual machine template man-agement.IV. MULTICORE IN THE CLOUD COMPUTINGCONTEXTA. Multicore meets CloudsComputing Clouds could definitely benefit from therapid growth of multicore systems. For example, Googledata centers can adopt uniform commodity multicore clus-ters as computing servers. Computing Grids are normallysuitable for coarse grained parallel applications. Ingeneral, the latencies in Grid applications are measured inmacroseconds [11]. Computing Clouds normally containseveral homogeneous data centers or computer centers,where commodity high performance multicore clusters areemployed. Therefore, fine grained parallel applicationscould be mapped computing Clouds, which could providelow latencies between clusters and high data transfer rateof multicore systems.B. Computing Infrastructure Provision from MulticoreCloudWe could expect the following Cloud computing con-text: a computing Cloud is composed by several datacenters or compute centers, and each center contains multiple identical commoditymulticore clusters.Computing Cloud users can then lease computing in-frastructures and platforms from the computing Clouds,which normally are virtual machine based system. Theconsolidation of a set of virtual machines require toefficient and scheduling and management on multicoresystems, such as: How to schedule a virtual machine across multiplecores, multi-processors, or even multi-clusters? How to move a virtual machine when slots of coresare available?These open questions call some solutions and work onefficient power management on multicore clusters andclouds.C. Programming requirements of Multicore Cloud com-putingFrom the user side, proper programming models andparadigms should be developed to fulfill the requirementsof multicore based computing Clouds: Task level atomicityNormally, computing resources are provided to usersin the form of virtual machines, and one task isallocated one or more virtual machines. The basicunit of resource provision and management is virtualmachine. To adapt to this, user programs should beatomic in the unit of task, which is clear defined withits purpose, operations, and control.Task level atomicity can bring several advantages,for example, fault tolerance in commodity comput-ing clusters and Clouds, easy management (such ascheckpointing and migration), robust control fromuser level and system level. Stateless taskIn the Grid computing community, especial theGlobus world, people would like to promote statefulservices, which keep the information or contextbetween calls on the services. However, we discoverthis style in the multicore Clouds: Stateless tasks could be easily managed, e.g.,checkpointing, task migration, both in the userlevel and system level. Stateless tasks therefore can bring scalabilityand fault tolerance in homogeneous multicorecluster environments, in a Cloud.D. Where goes the parallelism?It is noted that multicore based system could not be au-tomatically exploited if users dont apply explicit parallel962 JOURNAL OF COMPUTERS, VOL. 5, NO. 6, JUNE 2010 2010 ACADEMY PUBLISHERcomputing models and language in their applications [7],[18]: User or system level software should know that theirunderlying platform contains multicores and try tomap their multiple tasks or threads to multicores. The multicore system or Cloud middleware shouldprovide some development environments or enablingcompilers to make applications multicore enabled.As we discussed above, multi-task programming andmulti-thread computing are suitable for multicore Clouds.This could be justified by the most popular programmingmodel MapReduce [8]. MapReduce is a programmingmodel and framework for processing large data sets acrossdistributed data centers. The MapReduce framework in-puts a set of input key/value pairs and produces a set ofoutput key/value pairs. There are two functions, Map andReduce, in the programming model. The Map functionaccepts an input pair and generates a set of intermediatekey/value pairs. The MapReduce library groups togetherall intermediate values associated with the same interme-diate key and passes them to the Reduce function. TheReduce function takes an intermediate key and a set ofvalues for that key, merges together these values to forma smaller set of values.Phoenix [21] is an implementation of MapReduce fromStanford University for shared memory systems, e.g.,multicore processors. It uses threads instead of clusternodes for parallelism. The communicates through sharedmemory instead of network messages. Current versionof Phoenix provides C/C++ APIs and uses P-threads.The Phoenix orchestrates application execution acrossmultiple threads: initiates and terminates threads, assigns map & reduce tasks to threads, and handles buffer allocation and communication.V. CONCLUSIONCloud computing and mulicore technologies are listedin 10 disruptive technologies to reshape the IT land-scape between 2008 and 2012 [4]. To prepare for themulticore era and Cloud computing stage, various re-search issues should be identified and studied. This paperoverviews the research context of multicore Cloud com-puting. It introduces the background of multicore systemand cloud computing concept, investigates research prob-lems and discusses possible solutions for the study.REFERENCES[1] Opennebula project.[2] Nimbus Project, access on June 2008.[3] Os farm project, access on July 2008.[4] Gartner identifies top ten disruptive technologies for 2008to 2012, access on Jan. 2009.[5] T.M. Aamodt, P. Chow, P. Hammarlund, H. Wang, and J. P.Shen. Hardware Support for Prescient Instruction Prefetch.In Proceedings of the 10th International Symposium onHigh Performance Computer Architecture, pages 8495,2004.[6] AMD. AMD Multi-Core Technology.[7] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis,P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker,J. Shalf, S. W. Williams, and K. A. Yelick. The landscapeof parallel computing research: a view from berkeley.Technical report, Department of Electrical Engineering andComputer Sciences, University of California at Berkeley,2006.[8] J. Dean and S. Ghemawat. Mapreduce: simplified dataprocessing on large clusters. Commun. ACM, 51(1):107113, 2008.[9] D. Bailey et. al. The NAS Parallel Benchmarks. Techni-cal Report RNR-94-007, Department of Mathematics andComputer Science, Emory University, March 1994.[10] J. Tao et al. Performance Advantage of ReconfigurableCache Design on Multicore Processor Systems. Interna-tional Journal of Parallel Programming, 36(3):347360,June 2008.[11] Geoffrey Fox and Marlon Pierce. Grids challenged bya web 2.0 and multicore sandwich. Concurrency andComputation: Practice and Experience, 21(3):265280,2009.[12] Intel Corporation. Intel C++ Com-piler 11.0 Professional Edition for Linux.[13] Intel Corporation. Intel Multi-Core Technology.[14] H. Jin, M. Frumkin, and J. Yan. The OpenMP Implemen-tation of NAS Parallel Benchmarks and Its Performance.Technical Report NAS-99-011, NASA Ames ResearchCenter, October 1999.[15] J.A. Kahle, M.N. Day, H.P. Hofstee, C.R. Johns, T.R.Maeurer, and D. Shippy. Introduction to the Cell Multipro-cessor. IBM Research & Development, 49(4/5):589605,2005.[16] Y. Kee, J. Kim, and S. Ha. ParADE: An OpenMPProgramming Environment for SMP Cluster Systems. InProceedings of SC 2003, 2003.[17] J. Lee, C. Jung, D. Lim, and Y. Solihin. Prefetchingwith Helper Threads for Loosely-Coupled MultiprocessorSystems. IEEE Transactions on Parallel and DistributedSystems, 99(1), 2008.[18] C. E. Leiserson and I. B. Mirman. How to survive themulticore software revolution. Technical report, Cilk Arts,Inc, 2008.[19] J. Lu, A. Das, W. Hsu, K. Nguyen, and S.G. Abraham.Dynamic Helper Threaded Prefetching on the Sun Ul-traSPARC CMP Processor. In Proceedings of the 38thAnnual IEEE/ACM International Symposium on Microar-chitecture, pages 93104, 2005.[20] S. A. Ostadzadeh, R. J. Meeuws, K Sigdel, and K.L.M.Bertels. A Multipurpose Clustering Algorithm for TaskPartitioning in Multicore Reconfigurable Systems. InProceedings of IEEE International Workshop on Multi-Core Computing Systems, page to appear, March 2009.[21] C. Ranger, R. Raghuraman, A. Penmetsa, G. R. Bradski,and C. Kozyrakis. Evaluating mapreduce for multi-coreJOURNAL OF COMPUTERS, VOL. 5, NO. 6, JUNE 2010 963 2010 ACADEMY PUBLISHERand multiprocessor systems. In Proceedings of 13st In-ternational Conference on High Performance ComputerArchitecture, pages 1324, 2007.[22] SUN Microsystems. UltraSPARC T2 Processor.[23] O. Takahashi, S. Cottier, S.H. Dhong, B. Flachs, and J. Sil-berman. Power-Conscious Design of the Cell ProcessorsSynergistic Processor Element. IEEE Micro, 25(5):1018,2005.[24] Lizhe Wang, Jie Tao, Marcel Kunze, A.C. Castellanos,David Kramer, and Wolfgang Karl. Scientific cloud com-puting: Early definition and experience. In Proceedings ofthe 10th IEEE International Conference on High Perfor-mance Computing and Communications, pages 825830,September 2008.[25] Lizhe Wang, Gregor von Laszewski, Jie Tao, and MarcelKunze. Cloud computing: a perspective study. acceptedby New Generation Computing, 38(5):6369, accepted.964 JOURNAL OF COMPUTERS, VOL. 5, NO. 6, JUNE 2010 2010 ACADEMY PUBLISHER


View more >