collective tuning initiative: automating and …...feedback-directed compilation and allow users...

HAL Id: inria-00436029https://hal.inria.fr/inria-00436029v1

Submitted on 25 Nov 2009 (v1), last revised 4 Jul 2014 (v2)

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Collective Tuning Initiative: automating andaccelerating development and optimization of computing

systemsGrigori Fursin

To cite this version:Grigori Fursin. Collective Tuning Initiative: automating and accelerating development and opti-mization of computing systems. GCC Developers’ Summit, Jun 2009, Montreal, Canada. �inria-00436029v1�

https://hal.inria.fr/inria-00436029v1

https://hal.archives-ouvertes.fr

Collective Tuning Initiative: automating andaccelerating development and optimization of

computing systems

Grigori FursinINRIA Saclay, France

HiPEAC member

[email protected]

Abstract

Computing systems rarely deliver best possible per-formance due to ever increasing hardware and soft-ware complexity and limitations of the current op-timization technology. Additional code and archi-tecture optimizations are often required to improveexecution time, size, power consumption, reliabil-ity and other important characteristics of computingsystems. However, it is often a tedious, repetitive,isolated and time consuming process. In order toautomate, simplify and systematize program opti-mization and architecture design, we are developingopen-source modular plugin-based Collective Tun-ing Infrastructure (http://ctuning.org) that can dis-tribute optimization process and leverage optimiza-tion experience of multiple users.

The core of this infrastructure is a Collective Opti-mization Database that allows easy collection, shar-ing, characterization and reuse of a large number ofoptimization cases from the community. The in-frastructure also includes collaborative R&D toolswith common API (Continuous Collective Com-pilation Framework, MILEPOST GCC with Inter-active Compilation Interface and static feature ex-tractor, Collective Benchmark and Universal Run-time Adaptation Framework) to automate optimiza-tion, produce adaptive applications and enable real-istic benchmarking. We developed several tools and

open web-services to substitute default compiler op-timization heuristic and predict good optimizationsfor a given program, dataset and architecture basedon static and dynamic program features and stan-dard machine learning techniques.

Collective tuning infrastructure provides a novelfully integrated, collaborative, "one button" ap-proach to improve existing underperfoming com-puting systems ranging from embedded architec-tures to high-performance servers based on system-atic iterative compilation, statistical collective op-timization and machine learning. Our experimen-tal results show that it is possible to reduce execu-tion time (and code size) of some programs fromSPEC2006 and EEMBC among others by more thana factor of 2 automatically. It can also reduce de-velopment and testing time considerably. Togetherwith the first production quality machine learningenabled interactive research compiler (MILEPOSTGCC) this infrastructure opens up many researchopportunities to study and develop future realisticself-tuning and self-organizing adaptive intelligentcomputing systems based on systematic statisticalperformance evaluation and benchmarking. Finally,using common optimization repository is intendedto improve the quality and reproducibility of the re-search on architecture and code optimization.

1

�� Continuous (transparent)

monitoring of computing systems

(UNIDAT)

�� !�" #$%& '() *��+,$#- .�//�0 123+4566 7678 6668 9:7;<=> ��&.#$%& ?&0.-/ "@+AB"�/ #-& .�//C0$#DEFGHIFJ K�� L�� M��N��O �� P� �O�� Q�� O� � ��P � �� P�� R N�� P��

ST#$/$U& T"�V" /+W ".-$#&.#C"&+ 0X &0 !�& X T#$%& +&�BY#C0$0V.�/TC#$0V +D+#&/+(programs, run-time systems,

compilers, architectures)

Z&! +&"%$.&+4- collect static & dynamic

optimization cases

- suggest good optimizations

(based on program and

architecture features, run-time

behavior and optimization

scenarios)

Figure 1: Collective tuning infrastructure to enable systematic collection, sharing and reuse ofoptimization knowledge from the community. It automates optimization of computing systems byleveraging the optimization experience from multiple users.

1 Introduction

Continuing innovation in science and technol-ogy requires increasing computing resourceswhile imposing strict requirements on cost, per-formance, power consumption, size, responsetime, reliability, portability and design timeof computing systems. Embedded and large-scale systems tend to evolve towards complexheterogeneous reconfigurable multiprocessingsystems with dramatically increased design,test and optimization time.

Compiler is one of the key components of com-puting systems responsible for delivering high-quality machine code across a wide range ofarchitectures and programs with multiple in-puts. However, for several decades compilersfail to deliver portable performance often dueto restrictions on optimization time, simplifiedcost models for rapidly evolving complex ar-chitectures, large number of combinations of

available optimizations, limitations on run-timeadaptation and inability to leverage optimiza-tion experience from multiple users efficiently,systematically and automatically.

Tuning compiler optimization heuristic for agiven architecture is a repetitive and time con-suming process because of the large numberof possible transformations, architecture con-figurations, programs and inputs available aswell as multiple optimization objectives such asimproving performance, code size, reliabilityamong others. Therefore, when adding new op-timizations or retargeting to a new architecture,compilers are often tuned only for a limited setof architecture configurations, benchmarks andtransformations thus making even relatively re-cent computing systems underperform. Hence,most of the time, users have to resort to addi-tional optimizations to improve the utilizationof available resources of their systems.

2

Iterative compilation has been introduced toautomate program optimization for a givenarchitecture using empirical feedback-directedsearch for good program transformations [76,38, 34, 64, 48, 42, 39, 1, 28, 27, 12, 26, 57,73, 69, 67, 54, 55]. Recently, the search timehas been considerably reduced using statisticaltechniques, machine learning and continuousoptimization [65, 72, 71, 77, 74, 61, 58, 33, 47].However, iterative feedback-directed compila-tion is often performed with the same programinput and has to be repeated if dataset changes.In order to overcome this problem, a frame-work has been developed to statically enablerun-time optimizations based on static functionmultiversioning, iterative compilation and low-overhead run-time program behaviour monitor-ing routines [45, 62] (similar framework hasalso been presented in [63] recently).

Though these techniques demonstrated signifi-cant performance improvements, they have notyet been fully adopted in production environ-ments due to a large number of training runsrequired to test many different combinations ofoptimizations. Therefore, in [50] we proposedto overcome this obstacle and speedup itera-tive compilation using statistical collective op-timization, where the task of optimizing a pro-gram leverages the experience of many users,rather than being performed in isolation, andoften redundantly, by each user. To some ex-tend it is similar to biological adaptive systemssince all programs for all users can be randomlymodified (keeping the same semantics) to ex-plore some part of large optimization spacesin a distributed manner and favor the best per-forming optimizations to improve computingsystems continuously.

Collective optimization requires radicalchanges to the current compiler and architec-ture design and optimization technology. Inthis paper we present a long term community-driven collective tuning initiative to enable

collective optimization of computing systemsbased on systematic, automatic and distributedexploration of program and architecture op-timizations, statistical analysis and machinelearning. It is based on a novel fully integratedcollaborative infrastructure with common opti-mization repository (Collective OptimizationDatabase) and collaborative R&D tools withcommon APIs (including the first of its kindproduction quality machine learning enabledresearch compiler (MILEPOST GCC) [47]) toshare profitable optimization cases and lever-age optimization experience from multipleusers automatically.

We decided to use top-down systematic op-timization approach providing capabilities forglobal and coarse-grain optimization, paral-lelization and run-time adaptation first, andthen combining it with finer grain optimizationsat loop or instruction level. We believe thatthis is the right approach to avoid the tendencyto target very fine grain optimizations at firstwithout solving the global optimization prob-lem that may have much higher potential bene-fits.

Collective tuning infrastructure can already im-prove a broad range of existing desktop, serverand embedded computing systems using em-pirical iterative compilation and machine learn-ing techniques. We managed to reduce execu-tion time (and code size) of multiple programsfrom SPEC95,2000,2006, EEMBC v1 and v2,cBench ranging from several percent to morethan a factor of 2 on several common x86 ar-chitectures. On average, we reduced the exe-cution time of the cBench benchmark suite forARC725D embedded reconfigurable processorby 11% entirely automatically.

Collective tuning technology helps to mini-mize repetitive time consuming tasks and hu-man intervention and opens up many researchopportunities. Such community-driven collec-tive optimization technology is the first step to-

3

wards our long term objective to study and de-velop smart self-tuning adaptive heterogeneousmulti-core computing systems. We also believethat our initiative can improve the quality andreproducibility of academic and industrial ITresearch on code and architecture design andoptimization. Currently, it is not always easyto reproduce and verify experimental results ofmultiple research papers that should not be ac-ceptable anymore. Using common optimiza-tion repository and collaborative R&D toolsprovides means for fair and open comparisonof available optimization techniques, helps toavoid overstatements and mistakes, and shouldeventually boost innovation and research.

The paper is organized as follows. Section 2 in-troduces collective tuning infrastructure. Sec-tion 3 presents collective optimization reposi-tory to share and leverage optimization experi-ence from the community. Section 4 presentscollaborative R&D tools and cBench to auto-mate, systematize and distribute optimizationexploration. Finally, section 5 provides somepractical usage scenarios followed by the futureresearch and development directions.

2 Collective Tuning Infrastructure

In order to enable systematic and automaticcollection, sharing and reuse of profiling andoptimization information from the community,we develop a fully integrated collective tun-ing infrastructure shown in Figure 1. The coreof the cTuning infrastructure is an extendableoptimization repository (Collective Optimiza-tion Database) to characterize multiple hetero-geneous optimization cases from the commu-nity which improve execution time and codesize, track program and compiler bugs amongmany others and to ensure their reproducibility.It is described in detail in Section 3. Optimiza-tion data can be searched and analyzed using

web services with open APIs and external plug-ins. A user can submit optimization cases eithermanually using online submission form at [9]or automatically using collaborative R&D tools(cTools) [3] described in Section 4.

Current cTools include:

• Extensible plugin-enabled ContinuousCollective Compilation Framework(CCC) to automate empirical iterativefeedback-directed compilation and allowusers explore a part of large optimizationsspaces in a distributed manner usingmultiple search strategies.

• Extensible plugin-enabled GCC withhigh-level Interactive Compilation Inter-face (ICI) to open up production compilersand control their compilation flow anddecisions using external user-definedplugins. Currently, it allows selection andtuning of compiler optimizations (globalcompiler flags, passes at function leveland fine-grain transformations) as well asprogram analysis and instrumentation.

• Open-source plugin-based machine learn-ing enabled interactive research compilerbased on GCC (MILEPOST GCC) [47]that includes ICI and static program fea-ture extractor to substitute default opti-mization heuristic of a compiler and pre-dict good optimizations based on staticand dynamic program features (some gen-eral aspects of a program) and machinelearning.

• Universal run-time adaptation framework(UNIDAPT) to enable transparent mon-itoring of dynamic program behavior aswell as dynamic optimization and adapta-tion if statically compiled programs withmultiple datasets for uni-core and hetero-geneous reconfigurable multi-core archi-tecture based on code multiversioning.

4

Automatic exploration of optimization spacesis performed using multiple publicly-availablerealistic programs and their datasets from thecommunity that compose cBench [4]. How-ever, we also plan to use collective optimizationapproach when stable to enable fully transpar-ent collection of optimization cases from mul-tiple users [50].

cTuning infrastructure is available online at thecommunity-driven wiki-based web portal [5]and has been extended within MILEPOSTproject [13]. It now includes plugins to au-tomate compiler and architecture design, sub-stitute default GCC optimization heuristic andpredict good program optimizations for a widerange of architectures using machine learn-ing and statistical collective optimization plu-gins [50, 47]. But more importantly, it creates acommon platform for innovation and opens upmultiple research possibilities for the academiccommunity and industry.

3 Collective OptimizationDatabase

Collective Optimization Database (COD) is thekey component of the cTuning infrastructureserving as a common extensible open onlinerepository of a large number of optimizationcases from the community. Such cases in-clude program optimizations and architectureconfigurations to improve execution time, codesize, power consumption or detect performanceanomalies and bugs, etc. COD should be ableto keep enough information to describe opti-mization cases and characterize program com-pilation and optimization flow, run-time behav-ior and architecture parameters to ensure repro-ducibility and portable performance for collec-tive optimization.

Before the MILEPOST project, COD had afully centralized design shown in Figure 2a

with all the data coming directly from users.Such design may cause large communicationoverheads and database overloading thus re-quiring continuous resource-hungry pruning ofthe data on the database server. Therefore, thedesign of COD has been gradually altered tosupport local user-side filtering of optimizationinformation using plugins of CCC frameworkas shown in Figure 2b. Currently, plugin-basedfilters detect optimization cases that improveexecution time and code size based on Pareto-like distributions or that has some performanceanomalies to allow further detailed analysis.We gradually extend these filters to detect im-portant program and architecture optimizationsas well as useful static and dynamic programfeatures automatically based on Principle Com-ponent Analysis and similar techniques [37] toimprove the correlation between program struc-ture or behavior and optimizations.

We also investigate the possibility to developa fully decentralized collective tuning sys-tem to enable continuous exploration, anal-ysis and filtering of program optimizations,static/dynamic features and architecture con-figurations as well as transparent sharing andreuse of optimization knowledge between mul-tiple users based on P2P networks.

Current design of COD presented in Figure 3has been influenced by the requirements ofthe MILEPOST project [13] to collect a largenumber of optimization cases for different pro-grams, datasets and architectures during iter-ative feedback-directed compilation from sev-eral partners. These cases are used to trainmachine learning model and predict good opti-mizations for a given program on a given recon-figurable architecture based on program staticor dynamic features [47, 37].

Before participating in collective tuning andsharing of optimization cases, users must reg-ister their computing systems or find similarones from the existing records. This includes

5

��

� � �

�� Figure 2: (a) original centralized design of COD with large communication overheads (b) decen-tralized design of COD with local filters to prune optimization information and minimize commu-nication costs.

information about their computing platforms(architecture, GPUs, accelerators, memory andHDD parameters, etc), software environment(OS and libraries), compiler and run-time en-vironment (VM or architecture simulator, ifused). Users can participate in distributed ex-ploration of optimization spaces using cBenchthat is already prepared to work directly withthe cTuning infrastructure. Alternatively, userscan register and prepare their own programsand datasets to support cTuning infrastructureusing CCC framework. Currently, we are ex-tending cTuning infrastructure and GCC to en-able transparent program optimization withoutMakefile or project modifications [18] based

on collective optimization concept [50] andUNIDAPT framework [30].

All information about computing systems isrecorded in COD and shared among all users.All records have unique UUID-based identi-fiers to enable full decentralization of the in-frastructure and unique referencing of opti-mization cases by the community (in reports,documentation and research publications forexample).

Each optimization case is represented by acombination of program compilations (withdifferent optimizations and architecture config-urations) and executions (with the same or dif-

6

�� !� �� !" ��!��!��

��# �� $�� %�� &��

�� '�� ( ��)� � �� *�� !� ��#�� ( ��)� �� *�� !� ��#�� +�� *�� ! �� &&�&� �� ! �!��

,-�. � ��! �!�� '/� �� '/� ��&&�� # �� 01123�04 5 �� # �� 01123�04 5 ��

678879 :;<=8=>?<=79 @?<?A?BC DBE?FCG ?879H ?II JBCFBK

L7M?I 7F BE?FCG G?<?A?BCB N=<E 7;<=8=>?<=79 M?BCBFigure 3: Collective Optimization Database structure (tables) to describe optimization cases andminimize database size: common informative part and shared or private part with optimizationcases.

ferent dataset and run-time environment). Theoptimization information from a user is firstcollected in the Local Optimization Databaseto enable data filtering, minimize global com-munication costs and provide possibility for in-ternal non-shared optimizations within compa-nies. Each record in the databases has a uniqueUUID-based identifier to simplify merging ofthe distributed filtered data in COD.

Information about compilation process isrecorded in a special COD table described inFigure 4. In order to reduce the size of thedatabase, the information that can be sharedamong multiple users or among multiple op-timization cases from a given user has beenmoved to special common tables. This includesglobal optimization flags, architecture configu-ration flags (such as -msse2, -mA7, -ffixed-r16,-march=athlon64, -mtune=itanium2) and fea-tures, sequences of program passes applied tofunctions when using a compiler with the In-teractive Compilation Interface, program staticfeatures extracted using MILEPOST GCC [47]

among others.

Information about program execution isrecorded in a special COD table describedin Figure 5. Though absolute execution timecan be important for benchmarking and otherreasons, we are more interested in how opti-mization cases improve execution time, codesize or other characteristics. In the case of"traditional" feedback-directed compilation,we need two or more runs with the samedataset to evaluate the impact of optimiza-tions on execution time or other metrics: onewith the reference optimization level such as-O3 (referenced by RUN_ID_ASSOCIATE)and another with a new set of optimizationsand exactly the same dataset (referenced byRUN_ID). When user explores larger opti-mization space using CCC framework fora given program with a given dataset, theobtained combination of multiple optimizationcases includes the same associated referenceid (RUN_ID_ASSOCIATE) to be able to cal-culate improvements over original execution

7

Field Description:COMPILE_ID Unique UUID-based identifier to enable global referencing of a given

optimization casePLATFORM_ID Unique platform identifierENVIRONMENT_ID Unique software environment identifierCOMPILER_ID Unique compiler identifierPROGRAM_ID Unique program identifierPLATFORM_FEATURE_ID Reference to the table with platform features describing platform spe-

cific features for architectural design space exploration (it can includearchitecture specific flags such as -msse2 or cache parameters, etc)

OPT_ID Reference to the table with global optimization flagsCOMPILE_TIME Overall compilation timeBIN_SIZE Binary sizeOBJ_MD5CRC MD5-based CRC of the object file to detect whether optimizations

changed the code or notICI_PASSES_USE set to 1 if compiler with Interactive Compilation Interface (ICI) has

been used to select individual passes and their orders on a functionlevel

ICI_FEATURES_STATIC_EXTRACT set to 1 if static program features has been extracted using ICI andMILEPOST GCC [47]

OPT_FINE XML description of fine-grain optimizations selected using ICI (on-going work)

OPT_PAR_STATIC XML description of static program parallelization (on-going work)NOTES User notes about this optimization case (can describe a bug or unusual

compiler behavior for further analysis for example)

Figure 4: Summary of current fields of the COD table describing compilation process.

time or other metrics. We perform multipleruns with the same optimization and the samedataset to calculate speedup confidence anddeal with timer/hardware counters noise.We use the MD5 CRC of the executable(OBJ_MD5CRC) to compare transformedcode with the original one and avoid executingcode when optimizations did not affect thecode.

When sharing multiple optimization casesamong users, there is a natural competitionbetween different optimizations that can im-prove computing system. Hence, we use a sim-ple ranking system to favor stable optimiza-tion cases across the largest number of users.Currently, users rank optimization cases manu-ally, however we plan to automate this processbased on statistical ranking of optimizations de-scribed in [50]. This will require extensions tothe UNIDAPT framework [30] described later

to enable transparent evaluation of program op-timizations with any dataset during single exe-cution without a need for a reference run basedon static multiversioning and statistical run-time optimization evaluation [45, 50, 3].

At the moment, COD uses MYSQL engine andcan be accessed either directly or through on-line web-services. The full description of theCOD structure and web-service is available atthe collective tuning website [9]. Since collec-tive tuning is on-going long term initiative, theCOD structure may evolve over time. Hence,we provide current COD version number in theINFORMATION table ensure compatibility be-tween all cTuning tools and plugins that accessCOD.

8

Field Description:RUN_ID Unique UUID-based identifier to enable global referencing of a given

optimization caseRUN_ID_ASSOCIATE ID of the associated run with baseline optimization for further analysisCOMPILE_ID Associated compilation identifierCOMPILER_ID Duplicate compiler identifier from the compilation table to speed up

SQL queriesPROGRAM_ID Duplicate program identifier from the compilation table to speed up

SQL queriesBIN_SIZE Duplicate binary size from the compilation table to speed up SQL

queriesPLATFORM_ID Unique platform identifierENVIRONMENT_ID Unique software environment identifierRE_ID Unique runtime environment identifierDATASET_ID Unique dataset identifierOUTPUT_CORRECT Set to 1 if the program output is the same as the reference one (sup-

ported when using Collective Benchmark or program specially pre-pared using CCC framework). It is important to add formal validationmethods in the future particularly for transparent collective optimiza-tion [50].

RUN_TIME Absolute execution time in seconds (or relative number if absolute timecan not be disclosed by some companies or when using some bench-marks)

RUN_TIME_USER User execution timeRUN_TIME_SYS System execution timeRUN_TIME_BACKGROUND Information about background processes to be able to analyze the

interference between multiple running applications and enable bet-ter adaptation and scheduling when sharing resources on uni-core ormulti-core systems [56]

RUN_PG Function-level profile information (using gprof or oprofile): <functionname=time spent in this function, ...>

RUN_HC Dynamic program features (using hardware counters): <hardwarecounter=value, ...>

RUN_POWER Power consumption (on-going work)RUN_ENERGY Energy during overall program execution (on-going work)PAR_DYNAMIC Information about dynamic dependencies to enable dynamic paral-

lelization (on-going work)PROCESSOR_NUM Core number assigned to the executed processRANK Integer number describing ranking (profitability) of the optimization.

Optimization case can be ranked manually or automatically based onstatistical collective optimization [50]

NOTES User notes about this optimization case

Figure 5: Summary of current fields of the COD table to describe program executions.

9

4 Collaborative R&D Tools

Many R&D tools have been developed in thepast few decades to enable empirical iterativefeedback-directed optimization and analysis in-cluding [1, 28, 27, 76, 12, 26, 69, 57]. How-ever they are often slow, often domain, com-piler and platform specific, and are not capableof sharing and reusing optimization informa-tion about different programs, datasets, compil-ers and architectures. Moreover, they are oftennot compatible with each other, not fully sup-ported, are unstable and sometimes do not pro-vide open sources to enable further extensions.As a result, iterative feedback-directed compi-lation has not yet been widely adopted.

Previously, we have shown the possibility forrealistic, fast and automatic program optimiza-tion and compiler/architecture co-tuning basedon empirical optimization space exploration,statistical analysis, machine learning and run-time adaptation [48, 42, 33, 45, 41, 37, 47,62, 50]. Since we obtained promising re-sults and our techniques become more mature,we decided to initiate a rigorous systematicevaluation and validation of iterative feedback-directed optimization techniques across multi-ple programs, datasets, architectures and com-pilers but faced a lack of generic, stable, ex-tensible and portable open-source infrastruc-ture to support this empirical study with mul-tiple optimization search strategies. Hence, wedecided to develop and connect all our tools,benchmarks and databases together within Col-lective Tuning Infrastructure using open APIsand move all our developments to the pub-lic domain [3, 8] to extend our techniquesand enable further collaborative and system-atic community-driven R&D to automate codeand architecture optimization and enable futureself-tuning adaptive computing systems.

Instead of developing our own source-to-sourcetransformation and instrumentation tools, we

decided to reuse and "open up" existing produc-tion quality compilers using event-driven plu-gin system called Interactive Compilation In-terface (ICI) [21, 45, 44, 47]. We decided to useGCC for our project since it is a unique open-source production quality compiler that sup-ports multiple architectures, languages and hasa large user base. Using a plugin-enabled pro-duction compiler can improve the quality andreproducibility of research and help to moveresearch prototypes back to a compiler muchfaster for a benefit of the whole community.

Finally, we are developing a universal run-time adaptation framework (UNIDAPT) to en-able transparent collective optimization, run-time adaptation and split compilation for stati-cally compiled programs with multiple datasetsacross different uni-core and multi-core hetero-geneous architectures and environments [45,62, 50, 56]. We also collected multipledatasets within Collective Benchmark (for-merly MiBench/MiDataSets) to enable realisticresearch on program optimization with multi-ple inputs.

We hope that using common tools will helpto avoid costly duplicate developments, willimprove quality and reproducibility of the re-search and will boost innovation in programoptimization, compiler design and architecturetuning.

4.1 Continuous Collective CompilationFramework

In [48, 49, 42] we demonstrated the possibil-ity to apply iterative feedback-directed compi-lation to large applications at loop level. Itwas a continuation of the MHAOTEU project(1999-2000) [32] where we had to develop asource-to-source Fortran 77 and C compiler toenable parametric transformations such as loop

10

unrolling, tiling, interchange, fusion/fission, ar-ray padding and some others, evaluate their ef-fect on large and resource-hungry programs,improve their memory utilization and executiontime on high-performance servers and super-computers, and predict the performance upperbound to guide iterative search. The MHAO-TEU project was in turn a continuation ofthe OCEANS project (1996-1999) where gen-eral iterative feedback-directed compilation hasbeen introduced to optimize relatively smallkernels and applications for embedded comput-ing systems [31].

In order to initiate further rigorous sys-tematic evaluation and validation of itera-tive code and architecture optimizations, westarted developing our own open-source mod-ular Continuous Collective Compilation frame-work (CCC). CCC framework is intended toautomate program optimization, compiler de-sign and architecture tuning using empiricaliterative feedback-directed compilation. Itenables collaborative and distributed explo-ration of program and architecture optimiza-tion spaces and collects static and dynamic op-timization and profile statistics in COD. CCChas been designed using modular approachas shown in figure 6. It has several low-level platform-dependent tools and platform-independent tools to abstract compilation andexecution of programs. It also includes rou-tines to communicate with COD and high-levelplugins for iterative compilation with multiplesearch strategies, statistical analysis and ma-chine learning.

CCC is installed using INSTALL.sh script fromthe root directory. During installation, sev-eral scripts are invoked to configure the system,provide info about architectures, environments,compilers, runtime systems and set up an en-vironment. Before compilation of platform-dependent tools, a user must select the how tostore optimization and profile information, i.e.

within local or shared COD at each step, oras a minimal off-line mode when all statisticsis recorded in a local directory of each opti-mized program. Off-line mode can be usefulfor GRID-like environments with network fil-ters such as GRID5000 [19] where all statisticsfrom multiple experiments can be aggregatedin a text file and recorded in COD later. CCCmay require PHP, PAPI library, PapiEx, uuid-gen and oprofile for extended functionality andMYSQL client to work with local and sharedCOD (Figure 3) configured using the followingenvironment variables:

• CCC_C_URL, CCC_C_DB,CCC_C_USER, CCC_C_PASS,CCC_C_SSL for common database(URL or IP of the server, database name,username, password and SSL attributesfor secure access

• CCC_URL, CCC_DB, CCC_USER,CCC_PASS, CCC_SSL for local databasewith optimization cases

• CCC_CT_URL, CCC_CT_DB,CCC_CT_USER, CCC_CT_PASS,CCC_CT_SSL for shared databasewith filtered optimization cases visibleat [9]

We decided to utilize systematic top-down ap-proach for optimizations: though originally westarted our research on iterative compilationfrom fine-grain loop transformations [32, 48,42], we think that current research on architec-ture and code optimizations focuses too muchon solving only local issues while sometimesoverlooking global and coarse-grain problems.Therefore, we first developed optimizationspace exploration plugins with various searchalgorithms for global compiler flags and startedgradually adding support for finer grain pro-gram optimizations using plugin-enabled GCCwith ICI (described in Section 4.2).

11

�� !"#$%&'() &*+, "( -.- $*,,',/$*(*#'0'(,1(*23"#4 "2' ")) 4 "2' 56 "2'4) "!7,'38� 9:�� 9;��: <<<=>?@= A B�� C�D E��:9��: �9�� : ��;9�;�: <<<=<FGHI��9�� J��K ILM<<<=NO=PQ@N=PRSRP= A BE��D�TU�� U��D DE��EUD��

B��E�� V�� W C �� XBVWY Z[\]^•

B�� _ C ��` a�� b�9� ��K��9��:c ��:c ��;��:c��:c ��• V�� _ C ��` a�� b�9��:��; �� :�: �� K��9��[

�� E�� D�� b9�� Z^ ��:� �d�� ;�� :��e�[�� 9��:� �� f;��g ��;�� :<<<=Gh= ABVW �� :�� NOi SNGj@� �� [:�: NOi S@ShkPjP

lm�EU�� C�D E��:9��: �9�� K��9��:c�� \\n �d��9��c ��;<<<=>?@I��9�� J��K ILM<<<=NO=PQ@N=PRSRP A o�p�� m�EU�� C�D E��;c ��;cK��J�� 9��:<<<=RjGQ q��E ��r��D E�� B�� r��Ds E��;�� [:�: �� Z9��:��:t��u��Kv

w�C EE�� BVWK�� ee��9��;x��;BBB E��s��UD ��;9�� :c ILMc ��K��9��c��c b��K��a:c ��<<<�E��s��UD��yFigure 6: CCC framework enables systematic, automatic and distributed program optimization,compiler design and architecture tuning with multiple optimization strategies as well as collectionof optimization and profiling statistics from multiple users in COD for further analysis and reuse.

Originally, we started using Open64/PathScalecompilers for our systematic empirical stud-ies [45, 36, 43] but gradually moved to GCCsince it is the only open-source production-quality compiler that supports multiple archi-tectures (more than 30 families) and languages.It also features many aggressive optimizationsincluding a unique GRAPHITE coarse-grainpolyhedral optimization framework. How-ever, CCC framework can be easily config-ured to support multiple compilers includingLLVM,GCC4NET,Open64,Intel ICC,IBM XLand others to enable fair comparison of dif-ferent compilers and available optimizationheuristics.

The first basic low-level platform-dependenttool ( ccc-time) is intended to execute appli-cations on a given architecture, collect variousrow profile information such as function levelprofiling and monitor hardware performance

counters using popular PAPI library [24], PA-PIEx [25] and OProfile [23] tools. This toolis very small and portable tested on a num-ber of platforms ranging from supercomputersand high-performance servers based on Intel,AMD, Cell and NVidia processors and acceler-ators to embedded systems from ARC, STMi-croelectronics and ARM.

Other two major platform-independent com-ponents of CCC are ccc-comp and ccc-runthat provide all the necessary functionality tocompile application with different optimiza-tions, execute binary or byte code with multipledatasets, process row profile statistics from ccc-time, validate program output correctness, etc.Currently, these tools work with specially pre-pared programs and datasets such as CollectiveBenchmark (cBench) described in section 4.3in order to automate the optimization processand validation of the code correctness. How-

12

ever, the required changes are minimal and wealready converted all programs from EEMBC,SPEC CPU 95,2000,2006 and cBench to workwith CCC framework. We are currently extend-ing CCC framework within Google Summer ofCode 2009 program [18] to enable transparentcontinuous collective optimization, fine-grainprogram transformations and run-time adapta-tion within GCC without any Makefile modifi-cations based on statistical collective optimiza-tion concept [50].

The command line format of ccc-comp and ccc-run is the following:

• ccc-comp <descriptive compiler name><compiler optimization flags recordedin COD> <compiler auxiliary flags notrecorded in COD>

• ccc-run <dataset number> <1 if baselinereference run (optional)>

Normally, we start iterative compilation andoptimization space exploration with the base-line reference compilation and execution suchas ccc-comp milepostgcc44 -O3 and ccc-run 11 where milepostgcc44 is the sample descrip-tive name of machine learning enabled GCCwith ICI v2.0 and feature extractor v2.0 reg-istered during CCC configuration of availablecompilers, -O3 is the best optimization levelof GCC to be improved. The first parameterof ccc-run is the dataset number (for speciallyprepared benchmarks with multiple datasets)and the second parameter indicates that it is thereference run when the program output will berecorded to help validate correctness of opti-mizations during iterative compilation.

We continue exploration of optimizations in-voking ccc-comp and ccc-run tools multipletimes with different combinations of optimiza-tions controlled either through command-lineflags or multiple environment variables shown

in Figure 7. At each iterative step these toolscompare program output with the output of thebaseline reference execution to validate codecorrectness (though it is clearly not enough andwe would like to provide more formal valida-tion plugins) and prepare several text files (in-formation packets) with compilation and exe-cution information that are recorded locally andcan be sent to COD using ccc-db-* tools asshown in Figure 8.

Iterative optimization space exploration isperformed using high-level plugins that in-voke ccc-comp and ccc-run with different op-timization parameters. We produced a few plu-gins written in C and PHP that implement thefollowing several search algorithms and somemachine learning techniques to predict goodoptimizations:

• ccc-run-glob-flags-rnd-uniform - gener-ates uniform random combinations ofglobal optimization flags (each flag has50% probability to be selected for a gen-erated combination of optimizations)

• ccc-run-glob-flags-rnd-fixed - generatesa random combination of global optimiza-tions of a fixed length

• ccc-run-glob-flags-one-by-one - evaluateall available global optimizations one byone

• ccc-run-glob-flags-one-off-rnd - selectall optimizations at first step and then re-move them one by one (similar to [67]one of the modes of the PathOpt tool fromPathScale compiler suite [26])

• milepost-gcc - a wrapper around MILE-POST GCC to automatically extract pro-gram features and query ctuning web-service to predict good optimization to im-prove execution time and code size sub-stituting default optimization levels (de-scribed more in Section 5).

13

#Record compiler passes (through ICI)export CCC_ICI_PASSES_RECORD=1

#Substitute original GCC pass manager and allow optimization pass selection and reordering (through ICI)export CCC_ICI_PASSES_USE=1

#Extract program static features when using MILEPOST GCCexport CCC_ICI_FEATURES_STATIC_EXTRACT=1#Specify after which optimization pass to extract static featuresexport ICI_PROG_FEAT_PASS=fre

#Profile application using hardware counters and PAPI libraryexport CCC_HC_PAPI_USE=PAPI_TOT_INS,PAPI_FP_INS,PAPI_BR_INS,PAPI_L1_DCM,PAPI_L2_DCM,PAPI_TLB_DM,PAPI_L1_LDM

#Profile application using gprofexport CCC_GPROF=1

#Profile application using OPROFexport CCC_OPROF=1export CCC_OPROF_PARAM="–event=CPU_CLK_UNHALTED:6000"

#Repeat program execution a number of times with the same dataset to detect and remove the performance measurement noise validatestability of execution time statisticallyexport CCC_RUNS=3

#Architecture specific optimization flags for design space explorationexport CCC_OPT_PLATFORM="-mA7 -ffixed-r12 -ffixed-r16 -ffixed-r17 -ffixed-r18 -ffixed-r19 -ffixed-r20 -ffixed-r21 -ffixed-r22 -ffixed-r23-ffixed-r24 -ffixed-r25"export CCC_OPT_PLATFORM="-mtune=itanium2"export CCC_OPT_PLATFORM="-march=athlon64"

#In case of multiprocessor and multicore system, select which processor/core to run application onexport CCC_PROCESSOR_NUM=

#Select runtime environment (VM or simulator)export CCC_RUN_RE=llvm25export CCC_RUN_RE=ilrunexport CCC_RUN_RE=unisimexport CCC_RUN_RE=simplescalar

#Some notes to record in COD together with experimental dataexport CCC_NOTES="test optimizations"

#The following variables are currently used in the on-going projects and can change:#Architecture parameters for design space explorationexport CCC_ARCH_CFG="l1_cache=203; l2_cache=35;"export CCC_ARCH_SIZE=132#Static parallelization and fine-grain optimizationsexport CCC_OPT_FINE="loop_tiling=10;"export CCC_OPT_PAR_STATIC="all_loops=parallelizable;"#Information about power consumption, energy, dynamic dependencies that should be recorded automaticallyexport CCC_RUN_POWER=export CCC_RUN_ENERGY=export CCC_PAR_DYNAMIC="no deps"

Figure 7: Some environment variables to control ccc-comp and ccc-run tools from CCC framework

14

Main compilation information packet (local filename:_comp):

COMPILE_ID=19293849477085514PLATFORM_ID=2111574609159278179ENVIRONMENT_ID=2781195477254972989COMPILER_ID=129504539516446542PROGRAM_ID=1487849553352134DATE=2009-06-04TIME=14:06:47OPT_FLAGS=-O3OPT_FLAGS_PLATFORM=-msse2COMPILE_TIME=69.000000BIN_SIZE=48870OBJ_MD5CRC=b15359251b3c185dfa180e0e1ad16228ICI_FEATURES_STATIC_EXTRACT=1NOTES=baseline compilation

Information packet with ICI optimization passes (local filename:_comp_passes):

COMPILE_ID=19293849477085514COMPILER_ID=129504539516446542FUNCTION_NAME=corner_drawPASSES=all_optimizations,strip_predict_hints,addressables,copyrename,cunrolli,ccp,forwprop,cdce,alias,retslot,phiprop,fre,copyprop,mergephi,...

COMPILE_ID=19293849477085514COMPILER_ID=129504539516446542FUNCTION_NAME=edge_drawPASSES=all_optimizations,strip_predict_hints,addressables,copyrename,cunrolli,ccp,forwprop,cdce,alias,retslot,phiprop,fre,copyprop,mergephi,......Information packet with program features (local filename:_prog_feat):

COMPILE_ID=19293849477085514FUNCTION_NAME=corner_drawPASS=freSTATIC_FEATURE_VECTOR= ft1=9, ft2=4, ft3=2, ft4=0, ft5=5, ft6=2, ft7=0, ft8=3, ft9=1, ft10=1, ft11=1, ft12=0, ft13=5, ft14=2, ...

COMPILE_ID=19293849477085514FUNCTION_NAME=edge_drawPASS=freSTATIC_FEATURE_VECTOR= ft1=14, ft2=6, ft3=5, ft4=0, ft5=7, ft6=5, ft7=0, ft8=3, ft9=3, ft10=3, ft11=2, ft12=0, ft13=11, ft14=1, ......

Execution information packet (local filename:_run):

RUN_ID=22712323769921139RUN_ID_ASSOCIATE=22712323769921139COMPILE_ID=8098633667852535COMPILER_ID=331350613878705696PLATFORM_ID=2111574609159278179ENVIRONMENT_ID=2781195477254972989PROGRAM_ID=1487849553352134DATE=2009-06-04TIME=14:35:26RUN_COMMAND_LINE=1) ../../automotive_susan_data/1.pgm output_large.corners.pgm -c > ftmp_outOUTPUT_CORRECT=1RUN_TIME=16.355512RUN_TIME1=0.000000RUN_TIME_USER=13.822898RUN_TIME_SYS=2.532614RUN_PG={susan_corners=12.27,782,0.0156905371}NOTES=baseline compilation

Figure 8: Information packets produced by ccc-comp and ccc-run tools from CCC framework thatare recorded locally or sent to COD

15

When distributing optimization space explo-ration among multiple users or on clusters andGRID-like architectures, each user may specifydifferent random seed number to explore dif-ferent parts of optimization spaces on differentmachines. Best performing optimization casesfrom all users will later be filtered and joinedin COD for further analysis and reuse by thewhole community. For example, we can trainmachine learning models and predict good opti-mizations for a given program on a given archi-tecture using collective optimization data fromCOD as shown in Section 5.

During iterative compilation we are interestedto filter a large amount of obtained data andfind only those optimization cases that improveexecution time, code size, power consump-tion and other metrics depending on the useroptimization scenarios or detect some perfor-mance anomalies and bugs for further analy-sis. Hence, we developed several platform in-dependent plugins (written in PHP) that an-alyze data in the local database, find suchoptimization cases (for example, get-all-best-flags-time finds optimization cases that im-prove execution time and get-all-best-flags-time-size-pareto find cases that improve bothexecution time and code size using Pareto-likedistribution) and record them in COD. Con-tinuously updated optimization cases can beviewed at [9].

When using random iterative search we mayobtain complex combinations of optimizationswithout clear indication which particular codetransformation improve the code. Therefore,we can also use additional pruning of eachoptimization case and remove those optimiza-tions one by one from a given combinationthat do not influence performance, code size orother characteristics using ccc-run-glob-flags-one-off-rnd tool. It helps to improve correla-tion between program features and transforma-tions, and eventually improve optimization pre-

dictions based on machine learning techniquessuch as decision tree algorithms, for example.

Since we also want to explore dynamic opti-mizations and architecture designs systemati-cally, we are gradually extending CCC to sup-port various VM systems such as MONO andLLVM to evaluate JIT split compilation (find-ing a balance between static and dynamic op-timization using statistical techniques and ma-chine learning) and support multiple architec-ture simulators such as UNISIM, SimpleScalarand others to enable architecture design spaceexploration and automate architecture and codeco-optimization. More information about cur-rent CCC developments is available at our col-laborative website [7]. Some practical CCC us-age examples are presented in Section 5.

4.2 Interactive Compilation Interface

In 1999-2002, we started developing mem-ory hierarchy analysis and optimization toolsfor real large high-performance applicationswithin MHAOTEU project [32] to build thefirst realistic adaptive compiler (follow up ofthe OCEANS project [31]). Within MHAO-TEU project, we attempted to generalize iter-ative feedback-directed compilation and opti-mization space exploration to adapt any pro-gram to any architecture empirically and au-tomatically improving execution time over thebest default optimization heuristic of the state-of-the-art compilers. We decided to focus ona few well-known loop and data transforma-tions such as array padding, reordering andprefetching, loop tiling (blocking), interchange,fusion/fission, unrolling and vectorization aswell as some polyhedral transformations. Un-like [38, 39] where only optimization ordershave been evaluated on some small kernels us-ing architecture simulator, we decided to uselarge SPEC95 and SPEC2000 floating point

16

benchmarks together with a few real applica-tions from MHAOTEU partners as well as sev-eral modern at that time architectures to evalu-ate our approach in practice.

Unfortunately, at that time we could not findany production compiler with fine-grain controlof optimization selection while several avail-able source-to-source transformation tools in-cluding SUIF [28] were not stable enough toparse all SPEC codes or enable systematicexploration of complex combinations of opti-mizations. Hence, we decided to develop ourown source-to-source compiler and iterativecompilation infrastructure [11] using OctaveC/C++/Fortran77 front-end and MARS paral-lelizing compiler based on polyhedral transfor-mation framework produced at Manchester andEdinburgh Universities [66]. However, it wasa very tedious and time-consuming task allow-ing us to evaluate evaluate iterative compila-tion using only loop tiling, unrolling and ar-ray padding by the end of the project. Never-theless, we got encouraging execution time im-provements for SPEC95 and several real largecodes from our industrial partners across sev-eral RISC and CISC architectures [48, 42]. Wealso managed to develop a prototype of quickupper-bound performance evaluation tool as astopping criteria for iterative compilation [42,49].

MHAOTEU project helped us to highlight mul-tiple problems when building adaptive compil-ers and indicated further research directions.For example, state-of-the-art static compilersoften fail to produce good quality code (toachieve better code size, execution time, etc)due to hardwired ad-hoc optimization heuris-tics (cost models) on rapidly evolving hard-ware, large irregular optimization spaces, fixedorder and complex interaction between opti-mizations inside compiler or between compilerand source-to-source or binary transformationtools, time-consuming retuning of default op-

timization heuristic for all available architec-tures when adding new transformations, inabil-ity to retarget compiler easily for new archi-tectures particularly during architecture designspace exploration, inability to produce mixed-ISA code easily, inability to reuse optimizationknowledge among different programs and ar-chitectures, lack of run-time information, in-ability to parallelize code effectively and au-tomatically, lack of run-time adaptation mech-anisms for statically compiled programs to beable to react to varying program and systembehavior as well as multiple datasets (pro-gram inputs) with low overhead. To over-come these problems, we decided to start along term project to change outdated compi-lation and optimization technology radicallyand build novel realistic adaptive optimiza-tion infrastructure that allows rigorous system-atic evaluation of empirical iterative compila-tion, run-time adaptation, collective optimiza-tion, architecture design space exploration andmachine learning.

First, we had to decide which program transfor-mation tool to use to enable systematic perfor-mance evaluation of our research optimizationtechniques, i.e. continue developing our ownsource-to-source interactive compiler which istoo time consuming or find some other solution.At the same time, we noticed that some avail-able open-source production compilers such asOpen64 started featuring many aggressive loopand array transformations that we planned toimplement in our transformation tool. Consid-ering that Open64 was a stable compiler sup-porting two architectures, C/Fortran languagesand could process most of the SPEC codes,we decided to use it for our experiments. Weprovided a simple interface to open it up andenable external selection of internal loop/arrayoptimizations and their parameters through anevent-driven plugin system that we called In-teractive Compilation Interface (ICI). We com-bined it with the Framework for Continuous

17

Interactive Compilation

Interface

CCC plugins to optimize programs

and tune compiler optimization heuristic

Detect optimization flags

GCC Controller (Pass Manager)

IC Event

Pass N

IC Event

Pass 1

GCC Data Layer AST, CFG, CF, etc

IC Data

IC Event

ICI

MILEPOST GCC with ICI

...

IC Plugins

High-level scripting (java, python, etc)

Selecting pass sequences

Extracting static program features

<Dynamically linked shared libraries>

CCC Framework

...

Local COD Filter

Figure 9: Interactive Compilation Interface: high-level event-driven plugin framework to open upcompilers, extend them and control their internal decisions using dynamically loaded user plugins.This is the first step to enable future modular self-tuning adaptive compilers.

Optimizations (FCO) and developed severalsearch plugins (random, exhaustive, leave oneout, hill climbing) to enable continuous andsystematic optimization space exploration us-ing ICI-enabled compiler [15]. Since we re-leased ICI for Open64/PathScale compiler in2005, it proved to be a simple and efficient wayto transform production compilers into iterativeresearch tools and was used in several researchprojects to fine-tune programs for a given archi-tecture and a dataset [45, 62]. However, whenwe tried to extend it to combine reordering ofoptimizations with fine grain optimizations, wefound it too time consuming to modify the rigidoptimization manager in Open64.

At the same time, we noticed that there wasa considerable community effort to modular-ize GCC, add new optimization pass managerwith some basic information about dependen-cies between passes and provide many aggres-sive optimizations including polyhedral trans-formations. Considering that it could openup many interesting research opportunities andtaking into account that GCC is a unique stableopen-source compiler that supports dozens ofvarious architectures, multiple languages, cancompile the whole Linux and has a very largecommunity that is important for collective opti-mization [50], we decided to use this compilerfor our further research. We developed a new

18

ICI to "hijack" GCC and control its internaldecisions through an event-driven mechanismsand dynamically loaded plugins. The conceptof the new ICI and interactive compilers hasbeen described in [44] and extended duringthe MILEPOST project [47]. Since then, wemoved all the developments to the community-driven website [21] and continued extending itbased on user feedback and our research re-quirements.

Current ICI is an event-driven plugin frame-work with a high-level compiler-independentand low-level compiler-dependent API to trans-form production compilers into collaborativeopen modular interactive toolsets as shown infigure 9. ICI acts as a "middleware" inter-face between the compiler and the user plug-ins and enables external program analysis andinstrumentation, fine-grain program optimiza-tions without revealing all the internals of acompiler. This makes such ICI-enabled com-pilers more researchers/developers friendly al-lowing simple prototyping of new ideas withouta deep knowledge of a compiler itself, with-out a need to recompile a compiler itself andavoiding building new compilation and opti-mization tools from scratch. Using ICI can alsohelp to avoid time consuming revolutionary ap-proaches to create new "clean", modular andfast compiler infrastructure from scratch whilegradually transforming current rigid compilersinto modular adaptive compiler infrastructure.Finally, we believe that using production com-pilers with ICI in research can help to movesuccessful ideas back to the compiler muchfaster for the benefit of all users and boost in-novation and research.

We used GCC with ICI in the MILEPOSTproject [13] to develop the first machine learn-ing enabled research compiler and enable au-tomatic and systematic optimization space ex-ploration and predict good combinations of op-timizations (global flags or passes on a func-

tion level) based on static program features andpredictive modeling. Together with colleaguesfrom IBM we could easily add feature extrac-tor pass to GCC and had an ability to call itafter any arbitrary optimization pass using sim-ple dynamic plugin. Such machine learning en-abled self-tuning compiler called MILEPOSTGCC (GCC with ICI and static feature extrac-tor) has been released and used by IBM andARC to improve and speed up the optimizationprocess for their realistic applications. This us-age scenario is described more in Section 5)and in [47].

Current ICI is organized around four main con-cepts to abstract the compilation process:

• Plugins are dynamically loaded usermodules that "hijack" a compiler tocontrol compilation decisions and havean access to some or all of its internalfunctions and data. Currently, the pluginprogramming interface consists of threekinds of functions: initializationfunction that are in charge of startingthe plugin, checking compiler and plugincompatibility, and registering event han-dlers; termination function thatis in charge of cleaning up the plugin datastructures and closing files, etc; eventhandler (callback) functionsthat control a compiler.

• Events are triggered whenever compilerreaches some defined point during execu-tion. In such case, ICI invokes a user-definable callback function (event han-dler) referenced simply by a string name.

• Features are abstractions of selected prop-erties of a current compiler state and ofa compiled program. The brief list ofsome available features is shown in Fig-ure 10 ranging from an array of optimiza-tion passes to simple string name of a com-piled function.

19

Feature name: Type of contents: Purposecompiler_flags array of strings (char **) Names of all known command-line options (flags) of a

compiler. Individual option names are stored withoutthe leading dash.

compiler_params array of strings (char **) Names of all known compiler parameters.function_name string (char) Name of the function currently being compiled.function_decl_filename string (char) Name of the file in which the function currently being

compiled was declared. Returns the filename corre-sponding to the most recent declaration.

function_decl_line integer (int) Line number at which the function currently beingcompiled was declared. In conjuction with feature"function_decl_filename" gives the location of the mostrecent declaration of the current function.

function_filename string (char) Name of the file in which the function currently beingcompiled was defined.

function_start_line integer (int) Line number at which the definition of the current func-tion effectively starts. Corresponds to the first line ofthe body of the current function.

function_end_line integer (int) Line number at which the definition of the current func-tion effectively ends. Corresponds to the last line of thebody of the current function.

first_pass string (char) Human-readable name of the first pass of a compiler.Accessing this feature has the side effect of setting thatspecific pass as the "current pass" of ICI.

next_pass string (char) Human-readable name of the next pass to be executedafter the "current pass" of ICI. Accessing this featurehas the side effect of advancing the "current pass" ofICI to its immediate successor in the currently definedpass chain.

Figure 10: List of some popular features available in ICI version 2.x

• Parameters are abstractions of compilervariables to decouple plugins from theactual implementation of compiler inter-nals. They are identified simply by a stringname and used to get and/or set some val-ues in the compiler such as force inlininga some function or change loop blockingor unrolling factors, for example.

Detailed documentation is available at the ICIcollaborative website [22].

Since 2007, we have been participating in mul-tiple discussions with other colleagues devel-oping their own GCC plugin frameworks suchas [70, 52, 68], GCC community and steer-ing committee to add a generic plugin frame-work in GCC. Finally, a plugin framework will

be included in mainline GCC 4.5 [16]. Thisplugin framework is very similar to ICI butmore low-level. For example, the set of plu-gin events is hardwired inside a compiler, plu-gin callbacks have a fixed, pass-by-value ar-gument set and the pass management is verybasic. However, it already provides a multi-plugin support with command-line argumentsand callback chains, i.e. lists of callbacks in-voked upon a single occurrence of a pluginevent, and is a good step towards interactiveadaptive compilers. Hence, we are synchroniz-ing ICI with the plugin branch [29] to providemore high-level API including:

• dynamic registration and unregistration ofplugin events

20

• dynamic registra-tion/definition/unregistration of eventcallback arguments

• arbitrary number of pass-by-name eventcallback arguments

• ability to substitute complete pass man-agers (chains of passes)

• high-level access to compiler state (valuesof flags and parameters, name and selectedproperties of the current function, name ofcurrent and next pass) with some modifi-cation possibilities (compiler parameters,next pass).

Comparison of ICI and some other availableplugin framework for GCC is available at [6].ICI plugin repository with several test, passmanipulation and machine learning plugins isavailable at the collaborative development web-site [21]. During Google Summer of Code’09program [18] we are extending ICI and pluginsto provide XML representation of the compi-lation flow, selection and tuning of fine-grainoptimizations/polyhedral GRAPHITE transfor-mations and their parameters using machinelearning, enable code instrumentation, genericfunction cloning, run-time adaptation capa-bilities and collective optimization technol-ogy [50]. We also ported ICI and MILEPOSTprogram feature extractor to GCC4NET [2]to evaluate split compilation, i.e. predictingthe good balance between optimizations thatshould be performed at compile time and theones that should be performed at run-time whenexecuting code on multiple architectures andwith multiple datasets based on statistical anal-ysis and machine learning.

We hope that ICI-like plugin framework willbecome standard for compilers in the future,will help prototype research ideas quickly, willsimplify, modularize and automate compiler

design, will allow users write their own opti-mization plugins, will enable automatic tuningof optimization heuristic and retargetability fordifferent architectures and will eventually en-able smart self-tuning adaptive computing sys-tems for the emerging heterogeneous (and re-configurable) multi-core architectures. Moreinformation about ICI and current collaborativeextension projects is available at [21].

4.3 Collective Benchmark

Automatic iterative feedback-directed compi-lation is now becoming a standard techniqueto optimize programs, evaluate architecture de-signs and tune compiler optimization heuris-tics. However, it is often performed with oneor several datasets (test, train and ref in SPECbenchmarks for example) with an implicit as-sumption that the best configuration found fora given program using one or several datasetswill work well with other datasets for that pro-gram.

We already know well that different optimiza-tions are needed for different datasets when op-timizing small kernels and libraries [76, 64, 69,31]. For example, different tiling and unrollingfactors are required for matrix multiply to bet-ter utilize memory hierarchy/ILP and improveexecution time depending on the matrix size.However, when evaluating iterative feedback-directed fine-grain optimizations (loop tiling,unrolling and array padding) for large applica-tions even with one dataset [48, 42] we con-firmed that the effect of such optimizations onlarge code can be very different then on ker-nels and normally much smaller often due tointer-loop and inter-procedural memory local-ity, complex data dependencies and low mem-ory bandwidth among others.

In order to enable systematic exploration ofiterative compilation techniques and realistic

21

performance evaluation for programs with mul-tiple datasets we need benchmarks that havea large number of inputs together with toolsthat support global inter-procedural and coarse-grain optimizations (and parallelization) basedon combination of traditional fine-grain opti-mizations and polyhedral transformations.

Unfortunately, most of the available open-source and commercial benchmarks has onlya few datasets available. Hence, in 2006,we decided to assemble a collection of datasets for a free, commercially representativeMiBench [53] benchmark suite. Originally,we assembled 20 inputs per program, for26 MiBench programs (520 data sets in to-tal) in the dataset suite that we called Mi-DataSets [43]. We started from the top-down approach evaluating first global optimiza-tions (using compiler flags) [43] and graduallyadding support to evaluate individual transfor-mations including polyhedral GRAPHITE op-timizations in GCC using Interactive Compila-tion Interface withing GSOC’09 program [18].

We released MiDataSets in 2007 and sincethen it has been used in multiple researchprojects. Therefore, we decided to extend it,add more programs and kernels, update all cur-rent MiBench programs to support ANSI Cand improve portability across multiple archi-tectures, and create a dataset repository. There-fore, we developed a new research benchmarkcalled Collective Benchmark (cBench) with anopen repository [4] to keep various open-sourceprograms with multiple inputs assembled bythe community.

Naturally, the span of execution times acrossall programs and all datasets can be very largewhich complicates systematic iterative compi-lation evaluation. For example, when the ex-ecution time is too low, it may be difficultto evaluate the impact of optimizations due tomeasurement noise. In such cases, we add aloop wrapper around a main function moving

most of the IO and initialization routines out ofit to be able to control the length of the programexecution. A user can change the upper boundof the loop wrapper through environment vari-able or a special text file. We provide the de-fault setting that makes program run about 10seconds on AMD Athlon64 3700+. However, ifexecution time is too high and slows down sys-tematic iterative compilation particularly whenusing architecture simulators, we are trying todetect those program variables that can controlthe length of the program execution and allowusers to modify them externally. Of course, insuch cases, the program behavior may changedue to locality issues among others and may re-quire different optimizations which is a subjectof further research.

Each program from cBench currently hasseveral Makefiles for different compilersincluding GCC, GCC4CLI, LLVM, Intel,Open64, PathScale (Makefile.gcc, Make-file.gcc4cli, Makefile.llvm, Makefile.intel,Makefile.open64, Makefile.pathscale respec-tively). Two basic scripts are also provided tocompile and run a program:

• __compile <Makefile compiler exten-sion> <Optimization flags>

• __run <dataset number> (<loop wrapperbound - using default number if omitted>)

Datasets are described infile _ccc_info_datasets that has the fol-lowing format:

<Total number of availabledatasets>====<Dataset number><Command line when invokingexecutable for this dataset><Loop wrapper bound>

22

====...

Since one of the main purposes of cBench is en-abling rigorous systematic evaluation of empir-ical program optimizations, we included sev-eral scripts and files for each cBench programto work with CCC framework and record all ex-perimental data in COD entirely automatically.These scripts and files include:

• _ccc_program_id - file with unique CCCframework ID to be able to share opti-mization cases with the community withinCOD [9].

• _ccc_prep - script that is invoked beforeprogram compilation to prepare directoryfor execution, i.e. copying some largedatasets or compile libraries, for example.It is used for SPEC2006 for example.

• _ccc_post - script that is invoked af-ter program execution and can be usefulwhen copying profile statistics from re-mote servers. For example, it is usedwhen executing programs remotely onARC simulation board using SSH.

• _ccc_check_output.clean - script that re-moves all output files a program may pro-duce.

• _ccc_check_output.copy - script thatsaves all output files after a reference run.

• _ccc_check_output.diff - script that com-pares all output files after execution withthe saved outputs from the reference runto have a simple check that a combinationof optimizations have been correct. Ofcourse, this method does not prove cor-rectness and we plan to add more formalmethods, but it can quickly identify bugsand remove illegal combinations of opti-mizations.

We believe that this community-assembledbenchmark with multiple dataset opens upmany research opportunities for realistic codeand architecture optimization, improves thequality and reproducibility of the systematicempirical research and can enable more real-istic benchmarking of computing systems. Forexample, we believe that using just one perfor-mance metric produced by current benchmarkswith some ad-hoc combinations of optimiza-tions and a few datasets may not be enough tocharacterize the overall behavior of the systemsince using iterative optimization can result ina much better code. Our approach is to enablecontinuous monitoring, optimization and char-acterization of computing systems (programs,datasets, architectures, compilers) to be ableto provide a more realistic performance upperbound and comparison after iterative compila-tion.

We also plan to extend cBench by extract-ing most time-consuming kernels (functions orloops) from programs with the snapshots oftheir inputs during multiple program phases(such kernels with encapsulated inputs arecalled codelets). We will randomly modifyand combine them together to produce largetraining sets from kernels with various fea-tures in order to answer a research questionwhether it is possible to predict good opti-mizations for large programs based on programdecomposition and kernel optimizations. Fi-nally, we plan to add parallel programs andkernels (OpenCL, OpenMP, CUDA, MPI, etc)with multiple datasets to extend research onadaptive parallelization and scheduling for pro-grams with multiple datasets for heterogeneousmulti-core architecture [56].

It is possible to download cBench and partici-pate in collaborative development at [4].

23

Extract

dataset

features

Monitor run-time behavior or architectural changes

(in virtual, reconfigurable or heterogeneous environments)

using timers or performance counters �� Function

Version1 … �� !� �""�#��$ ��%�&�� '�� %��%�&� �(�')�� %�* ��#�� '��)%�� + '�+�,��&��'�� "" ��"�-"� +��.

• optimizations for different datasets

• optimizations/compilation for different architectures

(heterogeneous or reconfigurable processors with different

ISA such as GPGPU, CELL, etc or the same ISA with extensions

such as 3dnow, SSE, etc or virtual environments)

• optimizations for different program phases or different

run-time environment behavior

Original

hot

function

Statically-compiled adaptive binaries and libraries

Machine learning

techniques

to find mapping

between different

run-time contexts

and representative

versions

Iterative and

collective compilation

using interactive

compilers, source-to-

source transformers

or manual

optimizations

Function

Version2

Function

VersionN

Figure 11: Universal run-time adaptation framework (UNIDAPT) based on static multiversioningand dynamic selection routines. A representative set of multiple function versions [62] optimizedor compiled for different run-time optimization scenarios is used in such adaptive binaries andlibraries. An optimized decision tree or rule induction techniques should be able to map clones todifferent run-time contexts/scenarios during program or library execution.

4.4 Universal run-time Adaptation Frame-work for Statically-Compiled Pro-grams

As already mentioned in previous sections,iterative compilation can help optimize pro-grams for any given architecture automatically,but often performed with only one or severaldatasets. Since it is known that optimal se-lection of optimizations may depend on pro-gram inputs [76, 64, 69, 43], just-in-time orhybrid static/dynamic optimization approacheshave been introduced to select appropriate op-timizations at run-time depending on the con-text and optimization scenario. However, it isnot always possible to use complex recompila-tion framework particularly in case of resource-limited embedded systems. Moreover, most ofthe available systems are often limited to only

a few optimizations and do not have mecha-nisms to select a representative set of optimiza-tion variants [35, 40, 75, 59].

In [45] we presented a framework (UNIDAPT)that enabled run-time optimization and adap-tation for statically compiled binaries and li-braries based on static function multiversion-ing, iterative compilation and low-overheadhardware counters monitoring routines. Dur-ing compilation, several clones of hot func-tions are created and a set of optimizations isapplied to the clones that may improve exe-cution time across a number of inputs or ar-chitecture configurations. During execution,UNIDAPT framework monitors original pro-gram and system behavior and occasionally in-vokes clones to build a table that associatesprogram behavior based on hardware counters

24

(dynamic program features) with the best per-forming clones, i.e. optimizations. This tableis later used to predict and select good clonesas a reaction to a given program behavior. Itis continuously updated during executions withmultiple datasets and on multiple architecturesthus enabling simple and effective run-timeadaptation even for statically-compiled appli-cations. During consequent compilations theworst performing clones can be eliminated andnew added to enable continuous and transpar-ent optimization space exploration and adapta-tion. Similar framework has been also recentlypresented in [63].

Our approach is driven by simplicity and prac-ticality. We show that with UNIDAPT frame-work it is possible to select complex optimiza-tions at run-time without resorting to sophisti-cated dynamic compilation frameworks. Since2004, we extended this approach in multipleresearch projects. We used it to speed up it-erative compilation by several orders of mag-nitude [45] using low-overhead program phasedetection at run-time; evaluate run-time opti-mizations for irregular codes [46]; build self-tuning multi-versioning libraries automaticallyusing representative sets of optimizations foundoff-line and providing fast run-time selectionroutines based on dataset features and standardmachine learning techniques such as decisiontree classifiers [62]; enable transparent statisti-cal continuous collective optimization of com-puting systems [50]. We also started investigat-ing predictive run-time code scheduling for het-erogeneous multi-core architecture where func-tion clones are targeted for different ISA to-gether with explicit data management [56].

Since 2006, we are gradually implementingUNIDAPT framework presented in Figure 11in GCC. During Google Summer of Code’09program [18] we are extending GCC to gen-erate multiple function clones on the fly us-ing ICI, apply combinations of fine-grain opti-

mizations to the generated clones, provide pro-gram instrumentation to call program behav-ior monitoring and adaptation routines (and ex-plicit memory transfer routines in case of codegeneration for heterogeneous GPGPU-like ar-chitectures [56]), provide transparent linkingwith external libraries (to enable monitoringof hardware counters or machine learning li-braries to associate program behavior with op-timized clones, for example) and add decisiontree statements to select appropriate clones atrun-time according to dynamic program fea-tures.

UNIDAPT framework combined with cTuninginfrastructure opens up many research opportu-nities. We are extending it to improve dataset,program behavior and architecture character-ization to better predict optimizations; pro-vide run-time architecture reconfiguration toimprove both execution time and power con-sumption using program phases based on [45];enable split-compilation, i.e. finding a bal-ance between static and dynamic optimizationsusing predictive modeling; improve dynamicparallelization, data partitioning, caching, andscheduling for heterogeneous multi-core archi-tectures; enable migration of optimized codein virtual environments when architecture maychange at run-time; provide fault-tolerancemechanisms by adding clones compiled withsoft-error correction mechanisms,for example.

Finally, we started combining cTun-ing/MILEPOST technology, UNIDAPTframework and a Hybrid Multi-core ParallelProgramming Environment (HMPP) [20] toenable adaptive practical profile-driven opti-mization and parallelization for the current andfuture hybrid heterogeneous CPU/GPU-likearchitectures based on dynamic collectiveoptimization, dynamic data partitioning andpredictive scheduling, empirical iterative com-pilation, statistical analysis, machine learningand decision trees together with program and

25

dataset features [50, 62, 56, 51, 47, 60, 45].

More information about collaborativeUNIDAPT R&D is available at [30].

5 Usage Scenarios

5.1 Manual sharing of optimization cases

Collective tuning infrastructure provides mul-tiple ways to optimize computing systems andopens up many research opportunities. In thissection we will present several common usagescenarios.

The first and the simplest scenario is manualsharing and reuse of optimization cases usingan online web form at [9]. If a user finds someoptimization configuration such as combinationof compiler flags, order of optimization passes,parameters of fine-grain transformations, archi-tecture configuration, etc that improves somecharacteristics of a given program such as exe-cution time, code size, power consumption, rateof soft errors, etc over default compiler opti-mization level and default architecture config-uration with a given dataset, such optimizationcase can be submitted to Collective Optimiza-tion Database to make the community aware ofit.

If an optimized program and a dataset arewell-known from standard benchmarks such asSPEC, EEMBC, MiBench/cBench/MiDataSetsor from open-source projects, they can be sim-ply referenced by their name, benchmark suiteand a version. If a program is less known inthe community, have open sources and can bedistributed without any limitations, a user maybe interested to share it with the community to-gether with multiple datasets within our Collec-tive Benchmark (cBench) at [4] to help other

users reproduce and verify optimization cases.This can be particularly important when usingCOD to reference bugs in compilers, run-timesystems, architecture simulators and other com-ponents of computing systems. However, if aprogram/dataset pair is not open source while auser or a company would still like to share opti-mization cases for it with the community or getan automatic optimization suggestion based oncollective optimization knowledge, such paircan be characterized using static or dynamicfeatures and program reaction to transforma-tions that can capture code and architecturecharacteristics without revealing the sources tobe able to compare programs and datasets indi-rectly [33, 47, 37, 62, 50] (on-going work).

When optimizing computing systems, userscan browse COD to find optimization cases forsimilar architectures, compilers, run-time envi-ronments, programs and datasets. Users can re-produce and improve optimization cases, pro-vide notes and rank cases to favor the best per-forming ones. We plan to automate ranking ofoptimization cases eventually based on statisti-cal collective optimization concept [50].

Collective tuning infrastructure also helps toimprove the quality of academic and industrialresearch on code, compiler and architecture de-sign and optimization by enabling open charac-terization, unique referencing and fair compar-ison of the empirical optimization experiments.It is thus intended to address one of the draw-backs of academic research when it is often dif-ficult or impossible to reproduce prior experi-mental results. We envision that authors willprovide experimental data in COD when sub-mitting research papers or after publication toallow verification and comparison with avail-able techniques. Some data can be marked asprivate and accessible only by reviewers untilthe paper is published.

When it is not possible to describe optimizationcases using current COD structure, a user can

26

0.6

0.8

1

1.2

1.4

1.6

30000 40000 50000 60000 70000 80000 90000

spee

dup

binary size

CCC filtered profitable optimization cases

0.6

0.8

1

1.2

1.4

1.6

30000 40000 50000 60000 70000 80000 90000

spee

dup

binary size

Figure 12: Variation of speedups and size of the binary for susan_corners using GCC 4.2.2 onAMD Athlon64 3700+ during automatic iterative feedback-directed compilation performed byCCC framework over best available optimization level (-O3) and different profitable optimiza-tion cases detected by CCC filter plugins depending on optimization scenarios (similar to Paretodistribution).

record information in temporal extension fieldsusing XML format. If this information is con-sidered important by the community, the CODstructure is continuously extended to keep moreinformation in permanent fields. Naturally, weuse top-down approach for COD first provid-ing capabilities to describe global and coarse-grain program optimizations and then graduallyadding fields to describe more fine-grain opti-mizations.

5.2 Automatic and systematic optimizationspace exploration and benchmarking

One of the key features of cTuning infrastruc-ture is the possibility to automate program andarchitecture optimization exploration using em-pirical feedback-directed compilation and toshare profitable optimization cases with the

community in COD [9]. This enables fasterdistributed collective optimization of comput-ing systems and reduces release time for newprograms, compilers and architectures consid-erably.

CCC framework [7] is used to automate anddistribute multi-objective code optimization toimprove execution time and code size amongother characteristics for a given architectureusing multiple search plugins including ex-haustive, random, one off, hill-climbing andother strategies. Figure 12 shows distributionof optimization points in the 2D space ofspeedups vs code size of susan_corners fromcBench/MiBench [4] on AMD Athlon643700+ architecture with GCC 4.2.2 duringautomatic program optimization using CCCccc-run-glob-flags-rnd-uniformplugin after 500 uniform random combina-

27

�� ! �� ! �� "�� !� �� "� �� "� # �� $� �� !� �� $� �� !�� % �� ! �� ! �� "�� !� �� "� �� "� # �� $� �� !� �� $� �� !�� % ��

&'( &)*+,- ./ 0-)1+ 21,-EEMBC v1.0 and v2.0

�� !! "�� #$ "�� %! "& �� !# "�� !' "�� () "�*� � #$$ "� +��& %( "��& � � ),# "�� ))) "�� ))! "� �� -- )#! "� �� #� ),( ". �� )() ". '() ��& $! "�� ) "�� ), "& �� +�� !! "�� #$ "�� %! "& �� !# "�� !' "�� () "�*� � #$$ "� +��& %( "��& � � ),# "�� ))) "�� ))! "� �� -- )#! "� �� #� ),( ". �� )(' "� � �� )() ". '() ��& $! "�� ) "�� ), "& ��

/01/23456 78 962:4 ;:56cBench v1.0, SPEC95,2000,2006

Figure 13: Evaluation of mature GCC 4.2.2 using iterative compilation with uniform randomdistribution (500 iterations) on 2 distinct architectures.

tions of more than 100 global compiler flags(each flag has 50% probability to be selectedfor a given combination of optimizations).Naturally, it can be very time consumingand difficult to find good optimization casesmanually in such a non-trivial space andparticularly during multi-objective optimiza-tions. Moreover, the search often dependson optimization scenario, i.e. it is criticalto produce the fastest for high-performanceservers and supercomputers while it can bemore important to find a good balance betweenexecution time and code size for embedded

systems or adaptive libraries. Hence, wedeveloped several CCC filtering plugins toselect optimal optimization cases for a givenprogram, dataset and architecture based onPareto-like distributions [54, 55] (shown bysquare dots in Figure 12, for example) beforesharing them with the community in COD.

The problem of finding good combinations ofoptimizations or tuning default compiler opti-mization levels becomes worse and more timeconsuming when adding more transformations,optimizing for multiple datasets, architectures

28

020406080

100120140160180200220

bitc

oun

tsu

san_

csu

san_

esu

san_

s

jpeg

_c

jpeg

_d

dijk

stra

patr

icia

blow

fish_

dbl

owfis

h_e

rijn

dael

_dri

jnda

el_e sha

adpc

m_c

adpc

m_d

CR

C32

gsm

qsor

t1st

ring

sear

ch1

AV

ER

AG

E

iter

atio

ns

AMD IA32 IA64 ARC

Figure 14: Number of iterations needed to obtain 95% of the available speedup across cBench pro-grams using iterative compilation with uniform random distribution (500 iterations) on 4 distinctarchitectures.

and their configurations, adding more opti-mization objectives such as reducing powerconsumption and architecture size, improvingfault-tolerance, enabling run-time paralleliza-tion and adaptation for heterogeneous and re-configurable multi-core systems and so on.Hence, in practice, it is not uncommon thatcompiler performs reasonably well only on alimited number of benchmarks and on a smallrange of a few relatively recent architecturesbut can underperform considerably on older oremerging architectures.

For example, Figure 13 shows the bestspeedups achieved on a range of popular andrealistic benchmarks including EEMBC, SPECand cBench over the best GCC optimizationlevel (-O3) after 500 iterations using relativelyrecent mature GCC 4.2.2 (1.5 years old) and 2mature architectures: nearly 4 years old AMDAhtlon64 3700+ and 2 years old quad-core In-tel Xeon 2800MHz. It clearly demonstrates that

it is possible to considerably outperform evenmature GCC with the highest default optimiza-tion level using random iterative search. More-over, it also shows that achievable speedups arearchitecture dependent and vary considerablyfor each program ranging from a few percentto nearly two times improvements.

Figure 14 shows that it may take around 70iterations on average before reaching 95% ofthe speedup available after 500 iterations forcBench/MiBench benchmark and is heavily de-pendent on programs and architectures. Sucha large number of iterations is needed due tocontinuously increasing number of aggressiveoptimizations available in the compiler that canboth considerably increase or degrade perfor-mance or change code size making it moretime consuming and non-trivial to find prof-itable combination of optimizations in eachgiven case. For example, Table 1 shows suchnon trivial combinations of optimizations that

29

-O1 -falign-loops=10 -fpeephole2 -fschedule-insns -fschedule-insns2 -fno-tree-ccp -fno-tree-dominator-opts -funroll-loops-O1 -fpeephole2 -fno-rename-registers -ftracer -fno-tree-dominator-opts -fno-tree-loop-optimize -funroll-all-loops-O2 -finline-functions -fno-tree-dce -fno-tree-loop-im -funroll-all-loops-O2 -fno-guess-branch-probability -fprefetch-loop-arrays -finline-functions -fno-tree-ter-O2 -fno-tree-lrs-O2 -fpeephole -fno-peephole2 -fno-regmove -fno-unswitch-loops-O3 -finline-limit=1481 -falign-functions=64 -fno-crossjumping -fno-ivopts -fno-tree-dominator-opts -funroll-loops-O3 -finline-limit=64-O3 -fno-tree-dominator-opts -funroll-loops-O3 -frename-registers-O3 -fsched-stalled-insns=19 -fschedule-insns -funroll-all-loops-O3 -fschedule-insns -fno-tree-loop-optimize -fno-tree-lrs -fno-tree-ter -funroll-loops-O3 -funroll-all-loops-O3 -funroll-loops

Table 1: Some of the profitable combinations of GCC 4.2.2 flags for multiple programs and bench-marks including EEMBC, SPEC and cBench across distinct architectures that improve both exe-cution time and code size.

improve both execution time and code sizefound after uniform random iterative compila-tion 1 across all benchmarks and architecturesfor GCC 4.2.2. One may notice that found com-binations of profitable compiler optimizationsalso often reduce compilation time since somecombinations of optimizations require only aminimal optimization level -O1 together withseveral profitable flags. Some combinationscan reduce compilation time by 70% which canbe critical when compiling large-scale applica-tions and OS. All this empirical optimizationinformation is now available in COD [9] forfurther analysis and improvement of compiler

1After empirical iterative compilation with randomuniform distribution we obtain profitable combinationsof optimizations that may consist of 50 flags on average.However, in practice, only several of these flags influenceperformance or code size. Hence, we use CCC ccc-run-glob-flags-one-off-rnd plugin to prune found combina-tions and leave only influential optimizations to improveperformance analysis and optimization predictions.

design.

We expect that optimization spaces will in-crease dramatically after we provide supportfor fine-grain optimization selection and tun-ing in GCC using ICI [18]. In such situation,we hope that cTuning technology will consid-erably simplify and automate the exploration oflarge optimization spaces with "one button" ap-proach when a user just controls and balancesseveral optimization criteria.

5.3 MILEPOST GCC and optimizationprediction web services based on ma-chine learning

Default optimization levels in compilers arenormally aimed to deliver good average per-formance across several benchmarks and archi-tectures relatively quickly. However, it maynot be good enough or even acceptable for

30

MILEPOST GCC (with ICI)

IC Plugins Recording pass

sequences


Program1

ProgramN

…

Training

New program

Deploym

ent

MILEPOST GCC


Selecting “good” passes

Driver s for iterative

compilation and model

training

Continuous Collective Compilation Framework

CCC

Predict ing “good” passes to improve

exec.time, code size and comp. time

Figure 15: Original MILEPOST framework connected with cTuning infrastructure to substitutedefault compiler optimization heuristic with an optimization prediction plugin based on machinelearning. It includes MILEPOST GCC with Interactive Compilation Interface (ICI) and programfeatures extractor, and CCC Framework to train ML model and share optimization data in COD.

many applications including performance criti-cal programs such as real-time video/audio pro-cessing systems, for example. That is clearlydemonstrated in Figure 13 where we show theimprovements in execution time for multiplepopular programs and several architectures ofnearly 3 times over the best default optimiza-tion level of GCC using random feedback-directed compilation after 500 iterations.

In [47] we introduced our machine learningbased research compiler (MILEPOST GCC)and integrated it with the cTuning infrastructureduring the MILEPOST project [13] to addressthe above problem by substituting default op-timization levels of GCC with a predictive op-timization heuristic plugin that suggests goodoptimizations for a given program and a given

architecture.

Such framework functions in two distinctphases, in accordance with typical machinelearning practice: training and deployment asshown in Figure 15.

During the training phase we gather informa-tion about the structure of programs (static pro-gram features) and record how they behavewhen compiled under different optimizationsettings (execution time or other dynamic pro-gram features such as hardware counters, forexample) using CCC framework. This informa-tion is recorded in COD and used to correlateprogram features with optimizations, buildinga machine learning model that predicts a goodcombination of optimizations.

31

p

�� mobile systems, HPC, cloud computing, virtual environments, desktops �� !� �"#� �$ % "! &' (!� )&� �� !� �" % "! &'# (&*+,-./+012,/,3+-+/0+4+5-05+6 2,00+0

Programs, libraries, whole OS ��7+*,.8-92-:3:;,-:958+4+80�<=� �<>��<?� �<@�<� A��B� ��B�� C��B�:50-/.3+5-,-:95:50-/.3+5-,-:95 �D��- 7:0-/:E.-+7F988+F-:4+92-:3:;,-:95G F933.5:F,-:956:-H F988+F-:4+92-:3:;,-:957,-,E,0+G 3,FH:5+ 8+,/5:5I-/,:5:5I��<��J��K��C��LMNOPQNRSTUNQV92-:3:;,-:952/+7:F-:95

Figure 16: Using cTuning optimization prediction web services to substitute default compiler op-timization levels and predict good optimizations for a given program and a given architecture onthe fly based on its structure or or dynamic behavior (program static and dynamic features) andcontinuously retrained predictive models based on collective optimization concept [50].

In order to train a useful model a large num-ber of compilations and executions as trainingexamples are needed. These training examplesare now continuously gathered from multipleusers in COD together with program featuresextracted by MILEPOST GCC.

Once sufficient training data is gathered, dif-ferent machine learning models are createdbased on (probabilistic and decision trees ap-proaches among others) [47, 33, 37], for ex-ample. We provided such models trained fora given architecture as multiple architecture-

dependent optimization prediction plugins forMILEPOST GCC. When encountering a newprogram, MILEPOST GCC extracts programfeatures and passes them to the ML pluginwhich determines which optimizations to ap-ply.

When more optimization data is available(through collective optimization [50], for ex-ample) or when some new transformations areadded to a compiler, we need to retrain ourmodels for all architectures and provide newpredictive optimization plugins for download-

32

�� http://ctuning.org/wiki/index.php/Special:CDatabase?request=��

�� PLATFORM_ID = CCC framework UUID of a user platform (architecture)

ENVIRONMENT_ID = CCC framework UUID of a user environment

COMPILER_ID = CCC framework UUID of a user compiler

ST_PROG_FEAT = static features of a given program

or

DYN_PROG_FEAT = dynamic features of a given program (hardware counters)

ML_MODEL= 0 - nearest neighbour classifier (selecting optimization cases

from the most similar program

based on Euclidean distance of features.

�� !"#� $ %�� &�!��'� �( PREDICTION_MODE= predict profitable optimization flags to improve code

over the best default optimization level:

1 - improve both execution time and code size

2 - only execution time

3 - only code size ��)� Most profitable combination of optimizations (currently global flags)

*** +�,-�.��/ �0�1�)� ,) ��0� �� 2�� 3�� • .��4��4�� - plugin to send a text query to the web service and return a string of

optimization flags

• -�0��41�� - a MILEPOST GCC wrapper that detects three flags -ml, -ml-e, -ml-c,

extract static program features, query web service using .��4��4�� and substitutes default optimization levels

of a compiler with the predicted optimizations.

Figure 17: cTuning optimization prediction web service: URL, message format and tools.

ing simplifying and modularizing compiler it-self.

Naturally, since all filtered static and dynamicoptimization data is now continuously gatheredin COD, we can also continuously update op-timization prediction models for different ar-chitectures at cTuning website. Hence, we de-cided to create a continuously updated onlineoptimization prediction web-service as shownin Figure 16. It is possible to submit a queryto this web service as shown in Figure 17 pro-viding information about architecture, environ-ment, compiler, static or dynamic program fea-

tures and selecting required machine learningmodel and an optimization criteria, i.e. improv-ing either execution time or code size or bothover best default optimization level of a com-piler. At the moment, this web service returnsthe most profitable combination of global com-piler optimizations (flags) to improve a givenprogram. This service can be tested onlineat [10] or using some plugins from CCC frame-work to automate prediction.

Such optimization prediction web serviceopens up many optimization and research pos-sibilities: We plan to test it to improve the

33

0.8

0.9

1

1.1

1.2

1.3

1.4

bitc

oun

tsu

san

_csu

san

_esu

san

_s

jpeg

_c

jpeg

_d

dijk

stra

pat

rici

abl

owfi

sh_d

blow

fish_

eri

jnda

el_d

rijn

dael

_e sha

adpc

m_c

adpc

m_d

CR

C32

gsm

qso

rt1

stri

ngse

arch

1A

vera

ge

spee

du

p

Iterative compilation Predicted optimization passes using ML and MILEPOST GCC

Figure 18: Speedups achieved when using iterative compilation on ARC with random search strat-egy (500 iterations; 50% probability to select each optimization;) and when predicting best opti-mizations using probabilistic ML model based on program features described in [33]

whole OS optimization (Gentoo-like Linux, forexample), improve adaptation of downloadableapplications for a given architecture (Androidand Moblin mobile systems, cloud computingand virtual environments, for example), just-in-time optimizations for heterogeneous recon-figurable architectures based on program fea-tures and run-time behavior among others. Ofcourse, we need to minimize Internet traffic andqueries. Hence, we will need to develop anadaptive local optimization proxy to keep as-sociations between local program features andoptimizations for a given architecture while oc-casionally updating them using global cTuningweb services. We leave it for the future work.

As a practical example of the usage of ourservice, we trained our prediction model foran ARC725D reconfigurable processor usingMILEPOST GCC and cBench with 500 iter-ations and 50% probability of selecting eachcompiler flag for individual combination of op-timizations at each iteration. Figure 18 com-pares the speedups achieved after training (ourexecution time upper bound) and after one-

shot optimization prediction (as described indetail in [33]).It demonstrates that except a fewpathological cases we can automatically im-prove original production ARC GCC by around11% on average using cTuning infrastructure.

6 Conclusions and Future Work

In this paper we presented our long-term col-lective tuning initiative to automate, distribute,simplify and systematize program optimiza-tion, compiler design and architecture tuningusing empirical, statistical and machine learn-ing techniques. It is based on sharing of em-pirical optimization experience from multipleusers in the Collective Optimization Database,using common collaborative R&D tools andplugin-enabled production quality compilerswith open APIs and providing web services topredict profitable optimizations based on pro-gram features.

We believe that cTuning technology opens upmany research, development and optimization

34

opportunities. It can already help to speedup existing underperfoming computing sys-tems ranging from small embedded architec-tures to high-performance servers automati-cally. It can be used for more realistic statis-tical performance analysis and benchmarkingof computing systems. It can enable statically-compiled self-tuning adaptive binaries and li-braries. Moreover, we believe that cTuninginitiative can improve the quality and repro-ducibility of academic and industrial IT re-search. Hence, we decided to move all our de-velopments to public domain [5, 8] to enablecollaborative community-based developmentsand boost research. We hope that using com-mon cTuning tools and optimization repositorycan help to validate research ideas and movethem back to the community much faster.

We promote top-down optimization approachstarting from global and coarse-grain optimiza-tions and gradually supporting more fine-grainoptimizations to avoid solving local optimiza-tion problems without understanding the globaloptimization problem first. Within GoogleSummer of Code’2009 program [18], we planto enable automatic and transparent collectiveprogram optimization and run-time adaptationbased on [50] providing support for fine grainprogram optimizations and reordering, genericfunction cloning and program instrumentationin GCC using ICI. We will also need to pro-vide formal validation of code correctness dur-ing transparent collective optimization. Weplan to combine CCC framework with archi-tectural simulators to enable systematic soft-ware/hardware co-optimization. We are alsoextending UNIDAPT framework to improveautomatic profile-driven statistical paralleliza-tion and scheduling for heterogeneous multi-core architectures [56, 62, 51, 60] using run-time monitoring of data dependencies and au-tomatic data partitioning and scheduling basedon static and dynamic program/dataset featurescombined with machine learning and statistical

techniques.

We are interested to validate our approachin realistic environments and help better uti-lize available computing systems by improvingwhole OS optimizations, adapting mobile ap-plications for Android and Moblin on the fly,optimizing programs for grid and cloud com-puting or virtual environments, etc. Finally, weplan to provide academic research plugins andonline services for optimization data analysisand visualization.

To some extent, cTuning concept is similar tobiological self-tuning environments since allavailable programs and architectures can be op-timized slightly differently continuously favor-ing the most profitable optimizations and de-signs over time. Hence, we would like to usecTuning knowledge to start investigating com-pletely new programming paradigms and archi-tectural designs to enable development of thefuture self-tuning and self-organizing comput-ing systems.

7 Acknowledgments

cTuning development has been partially sup-ported by the MILEPOST project [13],HiPEAC network of excellence [14] andGoogle Summer of Code’2009 program [18].I would like to thank Olivier Temam for inter-esting discussions about Collective Optimiza-tion concept and Michael O’Boyle for machinelearning discussions; Zbigniew Chamski forhis help with the development of the new ICI;Mircea Namolaru for the development of thestatic program feature extractor for GCC withICI; Erven Rohou for his help with cBench de-velopments; Yang Chen, Liang Peng, YuanjieHuang, Chengyong Wu for cTools discussions,feedback and help with UNIDAPT frameworkextensions. I would also like to thank Abdul

35

Wahid Memon and Menjato Rakototsimba forthorough testing of the Continuous CollectiveCompilation Framework and Collective Opti-mization Database. Finally, I would like tothank cTuning and GCC communities [8, 17]for very useful feedback, discussions and helpwith software developments.

References

[1] ACOVEA: Using Natural Selectionto Investigate Software Complexities.http://www.coyotegulch.com/products/acovea.

[2] CLI Back-End and Front-End forGCC. http://gcc.gnu.org/projects/cli.html.

[3] Collaborative R&D tools (GCC with ICI,CCC, cBench and UNIDAPT frameworksto enable self-tuning computing systems.http://ctuning.org/ctools.

[4] Collective Benchmark: collectionof open-source programs and mul-tiple datasets from the community.http://ctuning.org/cbench.

[5] Collective Tuning Center: automating andsystematizing the design and optimiza-tion of computing systems. http://ctuning.org.

[6] Comparison of ICI and other pro-posed plugin frameworks for GCC.http://ctuning.org/wiki/index.php/CTools:ICI:GCC\_Info:API\_Comparison,http://gcc.gnu.org/wiki/GCC\_PluginComparison.

[7] Continuous Collective CompilationFramework to automate and dis-tribute program optimization, com-

piler design and architecture tuning.http://ctuning.org/ccc.

[8] cTuning community mailing lists. http://ctuning.org/community.

[9] cTuning optimization repository (Collec-tive Optimization Database). http://ctuning.org/cdatabase.

[10] Demo/testing of cTuning online optimiza-tion prediction web services. http://ctuning.org/cpredict.

[11] Edinburgh Optimizing Software(EOS) to enable fine-grain source-to-source program iterative optimiza-tions and performance prediction.http://fursin.net/wiki/index.php5?title=Research:Developments:EOS.

[12] ESTO: Expert System for Tuning Op-timizations. http://www.haifa.ibm.com/projects/systems/cot/esto/index.html.

[13] EU Milepost project (MachIne Learningfor Embedded PrOgramS opTimization).http://www.milepost.eu.

[14] European Network of Excellence onHigh-Performance Embedded Architec-ture and Compilation (HiPEAC). http://www.hipeac.net.

[15] Framework for Continuous Optimizations(FCO) to enable systematic optimiza-tion space exploration using Open64with Interactive Compilation Interface(ICI). http://fursin.net/wiki/index.php5?title=Research:Developments:FCO.

[16] GCC plugin framework. http://gcc.gnu.org/wiki/GCC\_Plugins, http://gcc.gnu.org/

36

onlinedocs/gccint/Plugins.html.

[17] GCC: the GNU Compiler Collection.http://gcc.gnu.org.

[18] Google Summer of Code’2009 projectsto extend CCC, GCC with ICI andUNIDAPT frameworks. http://socghop.appspot.com/org/home/google/gsoc2009/gcc.

[19] Grid5000: Computing infrastruc-ture distributed in 9 sites aroundFrance, for research in large-scaleparallel and distributed systems.http://www.grid5000.fr.

[20] HMPP: A Hybrid Multi-coreParallel Programming Environ-ment, CAPS Entreprise. http://www.caps-entreprise.com/hmpp.html.

[21] Interactive Compilation Interface: high-level event-driven plugin framework toopen up production compilers and con-trol their internal decisions using dynam-ically loaded user plugins. http://ctuning.org/ici.

[22] Interactive Compilation Inter-face wiki-based documentation.http://ctuning.org/wiki/index.php/CTools:ICI:Documentation.

[23] OProfile: A continuous system-wide pro-filer for Linux. http://oprofile.sourceforge.net/.

[24] PAPI: A Portable Interface to HardwarePerformance Counters. http://icl.cs.utk.edu/papi.

[25] PapiEx: Performance analysis tooldesigned to transparently and pas-sively measure the hardware perfor-mance counters of an application using

PAPI. http://icl.cs.utk.edu/~mucci/papiex.

[26] PathScale EKOPath Compilers. http://www.pathscale.com.

[27] ROSE source-to-source com-piler framework. http://www.rosecompiler.org.

[28] SUIF source-to-source compiler sys-tem. http://suif.stanford.edu/suif.

[29] Synchronization of high-level ICIwith low-level GCC plugin frame-work. http://gcc.gnu.org/ml/gcc-patches/2009-02/msg01242.html.

[30] Universal adaptation framework to enabledynamic optimization and adaptation forstatically-compiled programs. http://ctuning.org/unidapt.

[31] B. Aarts, M. Barreteau, F. Bodin,P. Brinkhaus, Z. Chamski, H.-P. Charles,C. Eisenbeis, J. Gurd, J. Hooger-brugge, P. Hu, W. Jalby, P. Knijnenburg,M. O’Boyle, E. Rohou, R. Sakellariou,H. Schepers, A. Seznec, E. Stöhr, M. Ver-hoeven, and H. Wijshoff. OCEANS: Op-timizing compilers for embedded appli-cations. In Proc. Euro-Par 97, volume1300 of Lecture Notes in Computer Sci-ence, pages 1351–1356, 1997.

[32] J. Abella, S. Touati, A. Anderson,C. Ciuraneta, J. C. M. Dai, C. Eisen-beis, G. Fursin, A. Gonzalez, J. Llosa,M. O’Boyle, A. Randrianatoavina,J. Sanchez, O. Temam, X. Vera, andG. Watts. The mhaoteu toolset formemory hierarchy management. In 16thIMACS World Congress on ScientificComputation, Applied Mathematics andSimulation, 2000.

37

[33] F. Agakov, E. Bonilla, J. Cavazos,B. Franke, G. Fursin, M. O’Boyle,J. Thomson, M. Toussaint, andC. Williams. Using machine learn-ing to focus iterative optimization.In Proceedings of the InternationalSymposium on Code Generation andOptimization (CGO), 2006.

[34] F. Bodin, T. Kisuki, P. Knijnenburg,M. O’Boyle, and E. Rohou. Iterativecompilation in a non-linear optimisationspace. In Proceedings of the Workshop onProfile and Feedback Directed Compila-tion, 1998.

[35] M. Byler, J. R. B. Davies, C. Huson,B. Leasure, and M. Wolfe. Multiple ver-sion loops. In Proceedings of the Interna-tional Conference on Parallel Processing,pages 312–318, 1987.

[36] J. Cavazos, C. Dubach, F. Agakov,E. Bonilla, M. O’Boyle, G. Fursin, andO. Temam. Automatic performancemodel construction for the fast soft-ware exploration of new hardware de-signs. In Proceedings of the Interna-tional Conference on Compilers, Archi-tecture, And Synthesis For Embedded Sys-tems (CASES), October 2006.

[37] J. Cavazos, G. Fursin, F. Agakov,E. Bonilla, M. O’Boyle, and O. Temam.Rapidly selecting good compiler opti-mizations using performance counters. InProceedings of the 5th Annual Interna-tional Symposium on Code Generationand Optimization (CGO), March 2007.

[38] K. Cooper, P. Schielke, and D. Subrama-nian. Optimizing for reduced code spaceusing genetic algorithms. In Proceedingsof the Conference on Languages, Com-pilers, and Tools for Embedded Systems(LCTES), pages 1–9, 1999.

[39] K. Cooper, D. Subramanian, and L. Tor-czon. Adaptive optimizing compilers forthe 21st century. Journal of Supercomput-ing, 23(1), 2002.

[40] P. C. Diniz and M. C. Rinard. Dy-namic feedback: An effective techniquefor adaptive computing. In Proceedingsof the SIGPLAN Conference on Program-ming Language Design and Implementa-tion (PLDI), pages 71–84, 1997.

[41] B. Franke, M. O’Boyle, J. Thomson, andG. Fursin. Probabilistic source-level opti-misation of embedded programs. In Pro-ceedings of the Conference on Languages,Compilers, and Tools for Embedded Sys-tems (LCTES), 2005.

[42] G. Fursin. Iterative Compilation and Per-formance Prediction for Numerical Appli-cations. PhD thesis, University of Edin-burgh, United Kingdom, 2004.

[43] G. Fursin, J. Cavazos, M. O’Boyle, andO. Temam. Midatasets: Creating theconditions for a more realistic evaluationof iterative optimization. In Proceedingsof the International Conference on HighPerformance Embedded Architectures &Compilers (HiPEAC), January 2007.

[44] G. Fursin and A. Cohen. Building apractical iterative interactive compiler. In1st Workshop on Statistical and MachineLearning Approaches Applied to Archi-tectures and Compilation (SMART’07),colocated with HiPEAC 2007 conference,January 2007.

[45] G. Fursin, A. Cohen, M. O’Boyle, andO. Temam. A practical method forquickly evaluating program optimiza-tions. In Proceedings of the 1st In-ternational Conference on High Perfor-mance Embedded Architectures & Com-pilers (HiPEAC), number 3793 in LNCS,

38

pages 29–46. Springer Verlag, November2005.

[46] G. Fursin, C. Miranda, S. Pop, A. Cohen,and O. Temam. Practical run-time adapta-tion with procedure cloning to enable con-tinuous collective compilation. In GCCDevelopers’ Summit, July 2007.

[47] G. Fursin, C. Miranda, O. Temam,M. Namolaru, E. Yom-Tov, A. Zaks,B. Mendelson, P. Barnard, E. Ash-ton, E. Courtois, F. Bodin, E. Bonilla,J. Thomson, H. Leather, C. Williams, andM. O’Boyle. Milepost gcc: machinelearning based research compiler. In Pro-ceedings of the GCC Developers’ Summit,June 2008.

[48] G. Fursin, M. O’Boyle, and P. Knijnen-burg. Evaluating iterative compilation.In Proceedings of the Workshop on Lan-guages and Compilers for Parallel Com-puters (LCPC), pages 305–315, 2002.

[49] G. Fursin, M. O’Boyle, O. Temam, andG. Watts. Fast and accurate method fordetermining a lower bound on executiontime. Concurrency: Practice and Experi-ence, 16(2-3):271–292, 2004.

[50] G. Fursin and O. Temam. Collective op-timization. In Proceedings of the In-ternational Conference on High Perfor-mance Embedded Architectures & Com-pilers (HiPEAC 2009), January 2009.

[51] B. F. Georgios Tournavitis, Zheng Wangand M. O’Boyle. Towards a holistic ap-proach to auto-parallelization: Integrat-ing profile-driven parallelism detectionand machine-learning based mapping. InProceedings of the ACM SIGPLAN 2009Conference on Programming LanguageDesign and Implementation (PLDI’09),September 2009.

[52] T. Glek and D. Mandelin. Using gcc in-stead of grep and sed. In Proceedings ofthe GCC Developers’ Summit, June 2008.

[53] M. R. Guthaus, J. S. Ringenberg,D. Ernst, T. M. Austin, T. Mudge,and R. B. Brown. Mibench: A free,commercially representative embeddedbenchmark suite. In Proceedings of theIEEE 4th Annual Workshop on WorkloadCharacterization, Austin, TX, December2001.

[54] K. Heydemann and F. Bodin. Iterativecompilation for two antagonistic criteria:Application to code size and performance.In Proceedings of the 4th Workshop onOptimizations for DSP and EmbeddedSystems, colocated with CGO, 2006.

[55] K. Hoste and L. Eeckhout. Cole: Com-piler optimization level exploration. InProceedings of International Symposiumon Code Generation and Optimization(CGO), 2008.

[56] V. Jimenez, I. Gelado, L. Vilanova,M. Gil, G. Fursin, and N. Navarro. Pre-dictive runtime code scheduling for het-erogeneous architectures. In Proceed-ings of the International Conference onHigh Performance Embedded Architec-tures & Compilers (HiPEAC 2009), Jan-uary 2009.

[57] P. Kulkarni, W. Zhao, H. Moon, K. Cho,D. Whalley, J. Davidson, M. Bailey,Y. Paek, and K. Gallivan. Finding ef-fective optimization phase sequences. InProc. Languages, Compilers, and Toolsfor Embedded Systems (LCTES), pages12–23, 2003.

[58] C. Lattner and V. Adve. Llvm: A com-pilation framework for lifelong program

39

analysis & transformation. In Proceed-ings of the 2004 International Sympo-sium on Code Generation and Optimiza-tion (CGO), Palo Alto, California, March2004.

[59] J. Lau, M. Arnold, M. Hind, andB. Calder. Online performance audit-ing: Using hot optimizations without get-ting burned. In Proceedings of the ACMSIGPLAN Conference on ProgrammingLanguaged Design and Implementation(PLDI), 2006.

[60] S. Long, G. Fursin, and B. Franke. Acost-aware parallel workload allocationapproach based on machine learning tech-niques. In Proceedings of the IFIP In-ternational Conference on Network andParallel Computing (NPC 2007), number4672 in LNCS, pages 506–515. SpringerVerlag, September 2007.

[61] J. Lu, H. Chen, P.-C. Yew, and W.-C. Hsu. Design and implementation ofa lightweight dynamic optimization sys-tem. In Journal of Instruction-Level Par-allelism, volume 6, 2004.

[62] L. Luo, Y. Chen, C. Wu, S. Long, andG. Fursin. Finding representative setsof optimizations for adaptive multiver-sioning applications. In 3rd Workshopon Statistical and Machine Learning Ap-proaches Applied to Architectures andCompilation (SMART’09), colocated withHiPEAC’09 conference, January 2009.

[63] J. Mars and R. Hundt. Scenario basedoptimization: A framework for staticallyenabling online optimizations. In Pro-ceedings of the International Symposiumon Code Generation and Optimization(CGO), 2009.

[64] F. Matteo and S. Johnson. FFTW:An adaptive software architecture for the

FFT. In Proceedings of the IEEE Interna-tional Conference on Acoustics, Speech,and Signal Processing, volume 3, pages1381–1384, Seattle, WA, May 1998.

[65] A. Monsifrot, F. Bodin, and R. Quiniou. Amachine learning approach to automaticproduction of compiler heuristics. In Pro-ceedings of the International Conferenceon Artificial Intelligence: Methodology,Systems, Applications, LNCS 2443, pages41–50, 2002.

[66] M. O’Boyle. Mars: a distributed memoryapproach to shared memory compilation.In Proceedings of the Workshop on Lan-guage, Compilers and Runtime Systemsfor Scalable Computing, 1998.

[67] Z. Pan and R. Eigenmann. Fast and effec-tive orchestration of compiler optimiza-tions for automatic performance tuning.In Proceedings of the International Sym-posium on Code Generation and Opti-mization (CGO), pages 319–332, 2006.

[68] D. D. Sean Callanan and E. Zadok. Ex-tending gcc with modular gimple opti-mizations. In GCC Developers’ Summit,July 2007.

[69] B. Singer and M. Veloso. Learning to pre-dict performance from formula modelingand training data. In Proceedings of theConference on Machine Learning, 2000.

[70] B. Starynkevitch. Multi-stage construc-tion of a global static analyser. In GCCDevelopers’ Summit, July 2007.

[71] M. Stephenson and S. Amarasinghe. Pre-dicting unroll factors using supervisedclassification. In Proceedings of the Inter-national Symposium on Code Generationand Optimization (CGO). IEEE Com-puter Society, 2005.

40

[72] M. Stephenson, M. Martin, andU. O’Reilly. Meta optimization: Im-proving compiler heuristics with machinelearning. In Proceedings of the ACMSIGPLAN Conference on ProgrammingLanguage Design and Implementation(PLDI), pages 77–90, 2003.

[73] S. Triantafyllis, M. Vachharajani,N. Vachharajani, and D. August. Com-piler optimization-space exploration.In Proceedings of the InternationalSymposium on Code Generation andOptimization (CGO), pages 204–215,2003.

[74] M. Voss and R. Eigenmann. Adapt: Auto-mated de-coupled adaptive program trans-formation. In Proceedings of the Interna-tional Conference on Parallel Processing(ICPP), 2000.

[75] M. J. Voss and R. Eigemann. High-level adaptive program optimization withadapt. In Proceedings of the eighthACM SIGPLAN Symposium on Principlesand Practices of Parallel Programming(PPoPP), pages 93–102, 2001.

[76] R. Whaley and J. Dongarra. Automati-cally tuned linear algebra software. InProceedings of the Conference on HighPerformance Networking and Computing,1998.

[77] M. Zhao, B. R. Childers, and M. L. Soffa.A model-based framework: an approachfor profit-driven optimization. In Pro-ceedings of the Interational Conferenceon Code Generation and Optimization(CGO), pages 317–327, 2005.

41

collective tuning initiative: automating and …...feedback-directed compilation and allow users...

Documents