automatic calibration of micro-architecture description models

Automatic Calibration of Micro-Architecture DescriptionModels

†Zhuoran Zhao, ‡Gary Morrison, †Andreas Gerstlauer†Electrical and Computer Engineering Department, The University of Texas at Austin

‡Freescale Semiconductor Inc.†{zhuoran,gerstl}@utexas.edu, ‡[email protected]

ABSTRACTCycle-accurate platform simulators are key components ofvirtual prototyping for system level design. High-level ar-chitecture description language (ADL) based instruction-setsimulation (ISS) models are often used to verify function-ality and estimate system performance during early designstages. However, development of such abstract models tofaithfully cover every micro-architecture detail is a tediousand error-prone process, often resulting in performance mis-matches between the cycle-accurate model and RTL. Thus,time-consuming manual calibrations of high-level architec-ture models against existing RTL code across a large set ofbenchmarks are typically needed.

To address this problem, we propose an automated frame-work for calibration of architecture models against RTL toautomatically discover and generate accurate platform sim-ulators for rapid early exploration and development. Ourframework is based on open-sourced ADL environments fromindustry, making use of open-source evolutionary algorithm(EA) libraries to tune pre-defined parameters in a given tem-plate and automatically derive accurate architecture modelsfor a given RTL processor. During the evaluation stage ofthe calibration, multi-tiered fitness assessment is used to re-duce the calibration time.

We verify our flow by calibrating parameterized proces-sor model templates against RTL reference traces. Resultsshow that our framework can sucessfully generate architec-ture model in an average of 8.1 hours with more than 98%timing accuracy.

1. INTRODUCTIONVirtual prototyping is widely used in embedded system de-

sign methodologies for quick functional validation and per-formance estimation at early desgin stages. Typical virtualprototypes incoroporate cycle-accurate processor simulatorsto get performance feedback and direct early explorationand development. Cycle-accurate instruction-set simulation(ISS) models automatically derived from high-level, easilyretargetable architecture description languages (ADLs) areoften used in this context. Although ADL-based flows al-ready provide useful abstractions to quickly capture proces-sor models, development of such models to faithfully coverevery micro-architecture detail is still a tedious and error-prone process. Performance mismatches between the cycle-accurate ISS model and RTL are very common. To resolvesuch mismatches, manual calibrations of high-level architec-ture models against existing RTL code across a large setof benchmarks are typically needed, which is usually time-consuming and lacking flexibility.

Tier0

...

#define PREFETCH_BUFFER_SIZE 6

define (resources) {

define (fetchunit Fetcher) {

fetch_memory = Mem;

entries

=PREFETCH_BUFFER_SIZE;

... }

...

param PREFETCH_BUFFER_SIZE

= {

range = (5-7);

};

...

tier 0 {

test EAluR01 {

command =

"regress --s=test1";

weight = 675;

};

…};

Tier fileTier file uapl fileuapl file

EagaCal knobs.hknobs.h

Parameterized uadl fileParameterized uadl file

uadl2model

ISS model

teststeststeststeststeststests

Reference Timing Model

Tier1 teststeststeststeststeststests

...

Figure 1: Calibration infrastructure overview.

To address this problem, we propose an automated frame-work for calibration of architecture models against RTL. Ourframework automatically discovers and generates accurateplatform simulators for rapid early exploration and develop-ment. In this report, we document our proposed frameworkfor automated model calibration. Our framework is based onFreescale’s uADL (micro-architecture description language)modeling infrastructure [1] and makes use of open-sourceevolutionary optimization libraries to tune pre-defined pa-rameters in a given template and automatically derive accu-rate architecture models for a given RTL processor. We haveapplyed our framework to automatically calibrate param-eterized template model against PowerPC-based referencemodels. In all cases, our framework was able to discover thecorrect parameter set.

2. INFRASTRUCTURE OVERVIEWFigure 1 shows an overview of our calibration infrastruc-

ture. The first step of the calibration is parametrizing theexisting uADL models with possible micro-architectural pa-rameters. Most parameterization is accomplished throughC preprocessor text substitution performed by the uADLcompiler.

In order to specify the list of valid parameters and the cor-responding ranges, we define a micro-architecture parame-terization language (uapl) file. To evaluate each genome, i.e.uADL instantiation with a valid set of parameters, we intro-duce a multiple-tier benchmark hierarchy, which is specifiedin a Tier file. The syntax rules of our uapl and Tier filesare fed into Lex and Yacc [2] to generate the correspond-ing parser for our calibration tool. Then, the uapl and Tierparser along with the calibration source code, which mainlyincludes invocation commands for calibration componentsand the evolutionary library are linked and compiled to gen-erate our executable calibration tool called EagaCal.

During the start-up of EagaCal, the calibration tool willread the uapl file to build up the encoding of parameters

into a genome and the initial population for the evolutionaryalgorithm. Furthermore, the Tier file is read to specify thebenchmark sets for fitness assessment. Once input files havebeen read, an iterative exploration is performed across thegiven set of benchmarks until a pre-defined stop metric ismet, such as reaching a maximum number of generations orhitting an accuracy threshold.

During the exploration, each genome is translated into anautomatically generated knobs file, which is in turn usedto compile the corresponding uADL model into an ISS ex-ecutable using the existing uadl2model command. Finally,the ISS’ performance accuracy will be evaluated on a se-lected subset of a given set of benchmarks. These bench-marks must previously have been run on the reference model(usually RTL) to generate a set of reference timing traces,which is one-time effort, and clustered into multiple tiers tobe dynamically added into the evaluation benchmark set.During calibration, the timing result of each test case run-ning on the generated ISS model will be compared againstthe corresponding timing reference to get the performanceaccuracy.

Finally, the average accuracy of each benchmark weightedby the instruction count will be assigned as the fitness valueof the corresponding genome and further fed back into theevolutionary algorithm.

3. UADL-MODEL PARAMETERIZATIONTo perform the calibration, uADL models need to be made

explorable and tunable through parameterization. Most pa-rameterization can be accomplished through C-preprocessortext substitution. Parameterized uADL models use #de-fined macros like any other program. The value assigned toeach macro can be defined in a separate header file, whichwill be included by the uADL model and externally modifiedby the evolutionary algorithms.

gen_file = “Knobs.h”objective PIPELINE;objective DIVIDE;objective LOADS;objective BRANCHES;param PREFETCH_BUFFER_SIZE = { range = (5-7); mean = initial = 6; relevancy (PIPELINE)=1;};...

(a) uapl file

...define (resources) { define (fetchunit Fetcher) { fetch_memory = Mem; entries=PREFETCH_BUFFER_SIZE;

... } ...

(b) uadl file

Figure 2: uADL parameterization.

To further specify the complete list of parameters andtheir allowable value ranges (knobs), micro-architecture pa-rameterization language (uapl) files are introduced into ourcalibration infrastructure. Figure 2 shows abbreviated ex-amples of a code snippet of a parameterized uADL model(Figure 2(b)) and its corresponding uapl file (Figure 2(a)).In the original uADL file, the number of entries in the fetchunit can be an integer number in the range of 5-7. In theparametrized uADL model in Figure 2(b), the integer num-ber is replaced with a label PREFETCH BUFFER SIZE,and the corresponding range of this label is then specified inthe uapl file in Figure 2(a). Its actual value is later defined

SelectSimulation

& Evaluation

uADL Instantiation

Evolutionary Algorithm

Fitness Assessment

2 0 3 5 62 0 0 8 8...

Parameters files

Crossover

Mutate

#define fetch_buf 2#define load_fwd 0...

Population

Reference timing traces

Test CasesTier 0

Test CasesTier 1... …

Parameterized uADL file

define (resources){ entries=fetch_buffer...

uADL/ADL

compiler

Cycle Accurate Simulators

Simulated timing tracesSimulated timing tracesSimulated timing traces

Reference timing tracesReference timing traces

Reference RTL

Accuracy metrics

Figure 3: Genetic Algorithm calibration framework.

in a separate header file generated during the calibrationprocess according to the specifications from the uapl file.

Besides the example shown in Figure 2, more advancedparameterizations such as pipeline configuration are also em-ployed in our infrastructure.

4. EVOLUTIONARY CALIBRATIONAn evolutionary algorithm (EA) is a generic population-

based metaheuristic optimization algorithm [3]. In our case,we use a Genetic Algorithm (GA), a subclass of EA, to con-struct our calibration infrastructure. GAs typically consistof selection, genetic operators and termination. In this sec-tion, the corresponding implementations and customizationsin our calibration framework will be illustrated.

The overall setup of our evolutionary calibration frame-work is shown in Figure 3. As the value of each knob inthe calibration is selected from a set with finite discrete ele-ments, we encode the choices of knobs into integer genomes.Before the selection stage, each integer genome in the cur-rent population is translated into header files with micro-architecture parameter values. The header files are thencompiled with the parameterized uADL file to generate thesimulator instances, which are run through multiple tiers ofbenchmarks to obtain the performance differences in com-parison to the reference traces. The average accuracy of eachtest case weighted by its total instruction count is assignedas the fitness value of each genome. After this, tournamentselection is applied in the GA framework to generate thenext evolutionary generation from the current one.

After the selection stage, a uniform crossover operator isused to randomly choose the crossover points among parentgenome pairs and generate off spring genomes, followed bya mutation stage to randomly change each gene within avalid range (as shown in the Evolutionary Algorithm blockin Figure 3).

The calibration framework will evolve the population untilit reaches a convergence point. In our case, we identify theconvergence by the improvement of the best individual. Ifthe best fitness value has no improvement for a thresholdnumber of consecutive populations, the algorithm will beterminated and the results will be reported.

In the entire calibration framework, the fitness assess-ment stage is the most time consuming process. To dealwith this problem, we further utilize parallelized compilationand multi-tiered fitness assessment to reduce the calibration

Tier 0-1 benchmarks

Reference Timing Traces

Accumulate absolute difference of commit time for each instruction

Simple set of ALU instructions

Tier 2 benchmarks

Mixed pair of ALU and branch instructions

Tier 3 benchmarks

Simple set of memory instructions

Tier 4 benchmarks

Mixed pair of ALU and branch memory

instructions

Evaluated Timing Traces

Tier 5 benchmarks

Industry strength benchmarks

Record start to end time to characterize the cumulative timing effect

Reference Execution time

EvaluatedExecution time

Figure 4: Multi-tiered benchmark hierarchy.

time, as will be described in more detail in the followingsections.

4.1 Parallelized compilation and simulationAlthough the evolutionary algorithm has inter-generation

dependency, intra-generation parallelism can be exploited toreduce the evaluation time. In our calibration framework,the compilations and simulations for different genomes areexecuted in parallel to reduce the evaluation time for onegeneration. The user can specify the parallelism windowsize and even migrate the compilation tasks among severalmachines with the interface provided by our framework.

4.2 Multi-tiered fitness assessmentIn the stage of fitness assessment, we set up a multi-tier

benchmark hierarchy, shown in Figure 4, to evaluate eachconsidered uADL instantiation. The first four tiers consistof bottom-up, directed-random timing tests in which theevaluation will accumulate absolute differences in issue andcommit times for each instruction between uADL and RTL,or between uADL and reference uADL. To be more specific,Tier 0 and 1 contain basic tests of the most fundamen-tal ALU individual instruction classes and Tier 2 containscombinations of those instruction classes, followed by Tier3 with simple memory instructions sets and Tier 4 withmixed pairs of ALU, memory instructions. The last tier isintroduced for the purpose of top-down, industry benchmarkbased regression using compiled C code that, generally, rep-resents realistic customer applications. Different from thefirst five tiers, the uADL instantiation is tested in terms ofstart-to-end time to characterize the total, cumulative tim-ing effect of the entire model as a whole. By contrast, thegoal of bottom-up testing in the lower tiers is to study indetail the instruction-by-instruction timing for different in-struction classes. In both cases, cycle count differences aredivided by the total cycle counts to compute the final rela-tive error metric (in percent error) of each test case.

The motivation of such a benchmark hierarchy is to cali-brate the target model over different aspects while being ableto quickly prune the design space using multi-tier evaluation.Moreover, dynamically adding tiers to the calibration frame-work can also avoid running too many benchmarks duringthe early generations, which largely reduces the evaluationtime during calibration.

This idea is shown in Figure 3 and Figure 4. The algo-

Table 1: Summary of uADL ParametersObjectives Parameters

Pipeline

Prefetch buffer size;Forward execution path is existing forALU, compare, load, multiplicationand devide instructions or not;

Memory OperationsMemory address bandwidth;Memory read latency;Memory write latency;

Multiplication

Latency of multiplication with:1-8 leading zeros;9-16 leading zeros;17-24 leading zeros;25-32 leading zeros;

Devide Divided by 0 is treated as special case or not;

Table 2: Calibration algorithm ParametersPopulation Size 20Crossover Rate 0.5Mutation Rate 0.4Parallelism Window Size 4Population Steady Threshold 4Maximum Generations 80

rithm will evaluate the population starting with Tier 0, untilthe best fitness individual is not improved after a thresh-old number of consecutive generations, Tier 1 is added intothe benchmark set and the whole population will be re-evalutated by Tier 0 and Tier 1. The other tiers will beadded to the benchmark set in the same way and the algo-rithm will be terminated when no improvement occurs onceall the tiers have been added.

For fitness assessment, the corresponding simulator in-stance of a genome will go though all the test cases fromthe tiers included in the current benchmark set, and thenegative average running error percentage of all test casesweighted by total instruction counts will be calculated andassigned as the genome’s individual fitness.

After running a genome with n test cases, we computethe final fitness as the negative weigthed average of errorpercentages (E), where each test case is weighted by its in-struction count (W) specified in the tier file:

fitness = − W1∗E1+W2∗E2+...Wn∗EnW1+W2+...Wn

.

Currently we are using a single objective algorithm to ver-ify the calibration performance against reference RTL traces.Any multi-objective appraoch will probably find its useful-ness if calibration for more advanced model introduces moreissues, but is a subject for future work.

5. EXPERIMENTS AND RESULTSWe use Evolving Objects (EO) [4], a template-based, ANSI-

C++ evolutionary computation library, to implement ourcalibration framework. Calibrations were performed on aquad-core Intel i7 workstation running at 2.6 GHz, on whichwe applied a 4-way parallelization of the compilation pro-cess. We applied our framework to calibrate an artificial pa-rameterized model against reference RTL traces. Five tiersof benchmarks were used without applying any industry-strength top-down benchmarks in Tier 5.

A summary of the tunable uADL parameters in our ex-preiments are shown in Table 1. During the calibration, indi-viduals in the initial generation are randomly generated fromtheir corresponding value sets, and further evolve within thevalid range.

The calibration algorithm parameters used in our experi-

0.10%

1.00%

10.00%

100.00%

Avg.

wei

ghte

d er

ror

Tiered Non-tiered

0.00%

0.50%

1.00%

1.50%

2.00%

2.50%

3.00%

3.50%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

Avg.

wei

ghte

d er

ror

Generation

Add tier

Add tier

Add tier Add tier

Figure 5: Fitness by generations.

0 min

100 min

200 min

300 min

400 min

500 min

600 min

700 min

800 min

900 min

1000 min

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Ru

nti

me

Generation

Non-tiered

Tiered

Figure 6: Accumulated execution time by generations.

Table 3: Calibration Runtime StatisticsTiered Non-tiered Improvement

Avg. runtime 08:21:43 16:10:38 48.3%Max. runtime 08:24:14 16:21:21 48.6%Avg. generations 48 48 0%Max. generations 48 48 0%

ments are as shown in Table 2 Note that the number of Pop-ulation Steady Threshold is used to deicide whether a newtier needs to be added into the benchmark set. In our case,if no improvement of best fitness occurs for 4 consecutivegenerations, then a new tier will be added. We comparedour multi-tiered fitness assessment with the case where allthe test cases are organized into one single tier, and recordedthe statistics of the algorithm behavior along with the gener-ations evolved. For multi-tiered and non-tiered algorithms,we collected results for 3 independent runs each.

Figure 5 and Figure 6 show a representative calibrationprogress of applying our calibration framework to our exam-ple model both with and without multi-tiered fitness assess-ment, represented as“Tiered”and“Non-tiered”, respectively.Figure 5 shows the absolute value of fitness for the best indi-vidual recorded in each generation. Tiers are added right af-ter generation 15, 21, 28 and 32. From the slope of the curvewe can see the new tiers are only added if the populationreached a steady state and can not be improved anymoreusing the current benchmark set. The sudden decrease infitness after adding a new tier is due to re-evaluation basedon the new benchmark set.

Figure 6 shows the accumulated runtime of our frame-work, Since the multi-tiered fitness assessment will startwith a single tier and only add new tiers if necessary, theslope of the accumulated runtime curve starts with a smallrate and only increases when new tier is introduced, result-

ing in a much smaller average evaluation time across all tiers.Thus, although the two approaches will give the same resultsafter a similar number of generations, the multi-tiered fit-ness assessment can significantly reduce the total calibrationtime.

Overall, results averaged over all 3 runs for each algo-rithm(Table 3) show that our mult-tiered approach can re-duce the calibration time by an average of 48.3% and in allcases reduced the timing error comparing with RTL tracesto less than 2% after 43 generations.

6. SUMMARY AND CONCLUSIONSIn this paper, we presented a framework for automated

calibration of high-level cycle-accurate simulation modelsagainst RTL and other reference models. Our framework isbased on tunable, parametrized uADL models from Freescale.In our framework, the Evolving Objects (EO) library is usedwith slight modifications as our calibration engine. Differenttypes of test cases are organized into a multi-tier benchmarkhierarchy to improve the calibration performance. Resultsshow that our framework can effectively calibrate parame-terized uADL model against RTL timing traces and generatearchitecture model with more than 98% timing accuracy inan average of 8.1 hours.

7. REFERENCES[1] The architectural description language project, ver 2.0.0. [Online].

Available: http://opensource.freescale.com/fsl-oss-projects

[2] “Lex and yacc,” in lex & yacc, 2nd Edition. O’Reilly Media, 1992, pp.1–25.

[3] “An overview of evolutionary computation,” in EvolutionaryComputation for Modeling and Optimization. Springer New York,2006, pp. 1–31.

[4] M. Keijzer, J. J. Merelo, G. Romero, and M. Schoenauer, “Evolvingobjects: A general purpose evolutionary computation library,”Artificial Evolution, vol. 2310, pp. 829–888, 2002. [Online]. Available:http://www.lri.fr/ marc/EO/EO-EA01.ps.gz

automatic calibration of micro-architecture description models

Documents