ph.d. thesis presentation
DESCRIPTION
Ph.D. Thesis presentationTRANSCRIPT
Scratchpad-oriented address generation for low-power embedded
VLIW processors
Guillermo Talavera Velilla
Departament de Microelectrònica i Sistemes Electrònics
Universitat Autònoma de Barcelona
Thesis supervisor: Jordi Carrabina
Ph.D. Defense PresentationOctober 15th, 2009
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 2/50
What are we talking about?
Scratchpad-oriented address generation
for low-power embedded VLIW processors
Type of memory
Energy optimization
Accessing data
Small, portable,battery operatedand multimedia
Type of processors
Embedded Processors Memories Optimization ConclusionsAGUs Optimization ?
What should I do if I am a VLIW-processor working on the embedded
domain and I want to access data (that is located in memory) consuming little
energy?
Guillermo Talavera Velilla
Departament de Microelectrònica i Sistemes Electrònics
Universitat Autònoma de Barcelona
Thesis supervisor: Jordi Carrabina
Ph.D. Defense PresentationOctober 15th, 2009
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 4/50
Let’s talk about…
… embedded.
Embedded
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 5/50
Embedded mobile systems
Embedded
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 6/50
Greedy users
Users demand:• More functionalities• More speed• More battery• Cheap devices
“PC-like” functionalities
… and we give them VLIW-ASIPs
Embedded
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 7/50
Performance vs Energy Efficiency
Performance Energy efficiency
FlexibilityFlexible enough
Embedded
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 8/50
Goal of the thesis
• Main goal:– Optimization of the energy consumption of the
VLIW-ASIPs architectures focusing on address generation process.
• Side goals:– Analyze state of the art optimizations– Analyze state of the art address generator units– Test the template in different benchmarks and
applications
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 9/50
Let’s talk about…
… processors.
Processors
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 10/50
Definitions
• VLIW =Very Long Instruction Word (processor)– Architecture design style that tries to maximize the
available Instruction Level Paralelism.
• ASIP =Application-Specific Instruction-Set processor – Processor were the instruction set is tailored to
benefit a specific application.
Processors
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 11/50
Target Architecture Style: VLIW-ASIP
Level 1Memory (on chip)
Level 2Memory (on chip)
External Memory
FU
Loop Buffer
Register File
FU FU FU FU
Loop Buffer
Register File
FU FU FU FU FU FU FU FU FU FU FU
Main Cluster Cluster
Loop Buffer Loop Buffer
Register File Register File
Cluster Cluster
Processors
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 12/50
Superscalar vs VLIW (remainder)
HWschedulling
SWschedulling
Processors
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 13/50
VLIW-ASIPs
• Ongoing work at imec:– Novel architectures oriented to low-power (x20)
HW+SW+Compiler exploration:• Data memory hierarchy• Foreground memory (registers)• Instruction/configuration memory• Data-path• Address-path
Processors
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 14/50
COFFEE: COmpiler Framework for Energy-aware Exploration
XMLprocessor
model
Ccode
TrimaranMDES
EnhancedTrimaranCompiler
TrimaranSimulator
Total powercalculation
Asmcode
Trace
Results
XSLTconverter
Areacalculation
Delaycalculation
Powercalculation
AnnotatedXML
processormodel
Ccode
XMLprocessor
model
Power/Energyresults
Performanceresults
Arearesults
compiler+processParametres
Processors
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 15/50
Let’s talk about…
… memories.
Memories
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 16/50
Embedded multimedia domain
• 50%-70% energy consumption caused by memory accessess*
Crucial to optimize:• Memory size, type, number of ports, … • Accesses (and related address computations)
As driver example we use a real application: a MPEG4 encoder
* References of the thesis [WCNM96 and MNCM97]
Memories
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 17/50
Background data memory
Scracthpad (compared to cache)– average energy reduction 40% *– average area-time reduction: 46% *
* References of the thesis [BSL+02ª and AC06]
Core
Level 1
(Data/Instruction mem.)
Level 2
(Data/Instruction mem.)
Level 3
(Main memory)
Fast,small,expensive
Slow,Large,cheap
Memories
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 18/50
Let’s talk about…
… optimization methods.
Optimization
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 19/50
Data Transfer and Storage Exploration (DTSE)
• Goal:– Reduce storage requirements– Optimize locality of data
• Code rewriting– Complex addressing– Control flow– Modulo and divider operations
Optimization
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 20/5020
DTSE transformations
for (x=1; x<=N-2; ++x) { for (y=1; y<=M-2; ++y) { for (k=-1; k<=1; ++k) { A[x][y] += B[x+k][y] * C[abs(k)]; A[x][y] /= tot; } }}
for (y=0; y<=M+2; ++y) { for (x=0; x<=N+2; ++x) { if (x>=0 && x<N && y>=1 && y<=M-2) { D[x%3] = B[(y*N+x)%8704+ (y*N+x)%8704*16384+7680]; } if (x-1>=1 && x-1<=N-2 && y>=1 && y<=M-2) { for (k=-1; k<=1; ++k) { acc += D[(x-1+k)%3] *
C[abs(k)]; } } acc /= tot; }}
20
Code Before DTSE
Code After DTSE
Control flow and address calculation are the bottleneck after DTSE!!!
Optimization
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 21/50
DTSE: Non-linear Operator Strength Reduction
for (y=0; y<M; ++y) { for (x=0; x<N; ++x) { cse0 = x%3; cse1 = (x-1)%3; cse2 = (x-2)%3; … }}
for (y=0; y<M; ++y) { p_cse0 = 0; // x%3 p_cse1 = 2; // (x-1)%3 p_cse2 = 1; // (x-2)%3 for (x=0; x<N; ++x) { … p_cse2 = p_cse1; p_cse1 = p_cse0; p_cse0++; if (p_cse0>=3) p_cse0 = 0; }}
Before … After !!!
Optimization
Modulo operationscan not always be transformedfor complex indexes
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 22/50
DTSE: Arithmetic Cost Minimization
for (y=0; y<M; ++y) {
for (x=0; x<N; ++x) {
if (x>=4 && y>=6) {
ce_img1[(x-2)%3] = …
ce_img2[(x-2)%3] = …
}
if (x>=4 && y>=4) {
ce_img1[(x-1)%3] = …
ce_img1[(x-1)%3] = …
}
}
}
for (y=0; y<M; ++y) {
for (x=0; x<N; ++x) {
if (x>=4 && y>=6) {
cse0 = (x-2)%3;
ce_img1[cse0] = …
ce_img2[cse0] = …
}
if (x>=4 && y>=4) {
cse1 = (x-1)%3;
ce_img1[cse1] = …
ce_img2[cse1] = …
}
}
}
Before … After !!!
Optimization
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 23/50
Control flow optimization
for(i=0; i < 50; i++ ){ for(j=0; j<i; j++){ if(i+j<70) data = Aleft[i+j]; else data = Aright[i+j-70];… }
for(i=0; 50 ; i++){ if(i <= 35){ for( ; i<=35; i++){ for(j=0; j<i; j++){ data= Aleft[i+j]; } } } else{ for(j=0; j<i; j++){ if (i+j < 70){ data= Aleft[i+j]; } else{ data= Aright[i+j-70]; } } } }}
Before …
After !!!Loop nest splitting:
Optimization
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 24/50
Data-path architecture explorationDone at architecture level #clusters, #FU per cluster
2 clusters with 4 FU each
MPEG4 encoder application:- 90nm technology - 500MHz
Optimization
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 25/50
Let’s talk about…
… address generation.
AGUs
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 26/50
How do I access data?
Core
Level 1
(Data cache)
Very often addressess are calculated in normal data-path
AGUs
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 27/50
Address Generation Unit (AGU)
Address Register File
Address ControlUnit
Address Data PathUnit
Addresssequence
Indexes oraddresses
range
Address equation examples:D[x%3] = B[(y*N+x)%8704+ (y*N+x)%8704*16384+7680];AE1= x%3AE2= (y*N+x)%8704+ (y*N+x)%8704*16384+7680
AGUs
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 28/50
AGU
• Multimedia Domain Programmable AGU
AGUs
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 29/5029
AGU Exploration Framework*
29
PE Implementation Pattern
Constraintsmax_pe=6min_add=1max_add=6min_sub=1max_sub=6min_sft=1max_sft=6…
+ - << +,- +,-,<< * %
Arch. FileReport of
cycle, area, and energy
AddressCalculation
AGU Mapping Framework
AGU ExplorationFramework
evaluate for all architectures which satisfy constraints !
Tradeoff !* From Osaka University
AGUs
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 30/50
Experiments
AGUs
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 31/50
Results: Area
15% < Hardware overhead < 200%
Original VLIW VLIW with AGU
FU
Loop Buffer
Register File
FU FU FU FU
Loop Buffer
Register File
FU FU FU
Main Cluster Cluster
FU
Loop Buffer
Register File
FU AGU FU
Loop Buffer
Register File
FU AGU
Main Cluster Cluster
AGUs
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 32/50
Results: Speed and Energy consumption
AGUs
24%27%
16%17% 12%
13%
12%32% 35%
15%
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 33/50
Results (applied to the MPEG4 application)
AGUs
51%
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 34/50
“Stand alone” AGU
for (k …){ for (j… ){ for (i…)} … } }}
AGUs
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 35/50
“Stand alone” AGU (1)
Implements:i*cnst
AGUs
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 36/50
“Stand alone” AGU (2)
Implements:i+= “inc i”i-= “dec i”
AGUs
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 37/50
“Stand alone” AGU (3)
Implements:i+j i << ji*j i >> j
AGUs
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 38/50
“Stand alone” AGU (4-5)
Implements:(i+j)% val (i << j)%val(i*j)/val (i >> j)/val
for (i=0; i≤ 20;i++) address= i%3;
ptr= -1for (i=0; i≤ 20;i++){ ptr++; if (ptr>=3) ptr-=3; address= ptr;}
AGUs
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 39/50
“Stand alone” AGU (6)
Implements:i+j+k (i << j)+ki*j+k (i >> j)+k
AGUs
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 40/50
With this AGU
• Conditions:– No control flow– No dividers*– No modulo operations *
• In cavity detector application:– 2% hardware overhead– 50% energy and cycles reduction
* That can not be transformed with non-linear operator strength reduction
AGUs
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 41/50
Let’s talk about…
… optimization (again).
Optimization
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 42/50
Instruction loop buffering optimization
Datapath
L1
Distributed L0
Datapath
L1
Distributed L0
Datapath
L1
Distributed L0
Normal Operation
Filling L0 Buffer Operation
Initiation Execution
Termination
Optimization
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 43/50
Summary of the optimizations (on the MPEG4 application)
CODE
COMPILER
HARDWARE
DATA
MEM
ORYADDRESS
GENERATION
DATA
PATH
INSTRUCTIO
N
MEM
ORY
OPTIMIZATIONS
Optimization
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 44/50
Results: Cycles
MORE THAN 90%!!! respecte the first straight
implementation
Optimization
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 45/50
Final energy distribution
MPEG4 encoder application:- 90nm technology - 500MHz
Optimization
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 46/50
Let’s talk about…
… conclusions.
Conclusions
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 47/50
Thesis contributions (1)
• Address generation unit template for the embedded multimedia domain– Improvements between 12% and 35% on several
benchmarks and applications (cycles and energy)– Improvements on a real application (MPEG4) of
51% on energy consumption (respect the previous optimization step)
– Global improvements over 90% applying a complete optimization methodology
Conclusions
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 48/50
Thesis contributions (2)
• Quantitave comparison of different platforms commonly used in the embedded domain
• Systematic classification of address generators• Review of literature on address generation
optimization according to the classification • Introduction of AGU reconfigurable framework
results into the COFFEE framework• Application of a complete methodology to optimize
energy consumption on a real data-flow application including address generation steps.
Conclusions
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 49/50
Open issues:
• Support for more loops and control• Bit calculation• Merge of index expression• Extension to other benchmarks and
applications• Heterogenous distributed AGUs• Distributed loop buffers with different speeds• Complete DTSE optimization
Conclusions
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 50/50
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 51/50
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 52/50
End of presentation and open discussion
??
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 53/50
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 54/50
Publications
Journal papers:
• G. Talavera, M. Jayapala, J. Carrabina, and F. Catthoor, “Address generation optimization for embedded high-performance processors: A survey”, Journal of Signal Processing Systems for Signal Image and Video Technology (formerly the Journal of VLSI Signal Processing Systems for Signal Image and Video Technology), May 2008 (online) Decembre 2008 (printed version) 2008.
• G. Talavera, A. Portero, P. Raghavan, M. Jayapala, J. Carrabina, and F. Catthoor, “Power exploration and address generation optimization of multimedia applications on VLIW processors”, Planned for re-submission to the IEEE Transactions on Image Processing.
• A. Portero, G. Talavera, J. Carrabina, and F. Catthoor, “Methodology for multimedia applications in multiplatform implementation for energy-flexibility space exploration”, Planned for re-submission to the IEEE Transactions on Computers .
• A. Portero, G. Talavera, J. Carrabina, and F. Catthoor, “Data-dominant application implementation in multi-platform for energy-flexibility space exploration”, Planned for re-submission to the IEEE Transactions on Image Processing.
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 55/50
Conference papers
• A. Lambrecths, T. V. Aa, M. Jayapala, A. Leroy, G. Talavera, A. Shickova, F. Barat, F. Catthoor, D. Verkest, G. Deconinck, H. Corporaal, F. Robert, and J. C. Bordoll, “Design style case study for compute nodes of a heterogeneous NoC platform”, in 25th IEEE Real-Time Systems Symposium (RTSS), December 2004.
• G. Talavera, V. Nollet, J.-Y. Mignolet, D. Verkest, S. Vernalde, R. Lauwereins, and J. Carrabina, “Hardware-Software debugging techniques for reconfigurable Systems-on-Chip, International Conference on Industrial Technology, 2004. IEEE ICIT '04. vol. 3, Dec. 2004, pp. 1402- 1407 Vol. 3.
• G. Talavera, V. Nollet, J.-Y. Mignolet, D. Verkest, S. Vernalde, R. Lauwereins, and J. Carrabina, “Métodos de depuración HW-SW para sistemas on chip recongurables, in Jornadas Sobre Computación Recongurable y Aplicaciones (JCRA), Barcelona, Spain, Septembre 2004, pp. 251-258.
• A. Lambrechts, P. Raghavan, A. Leroy, G. Talavera, T. Vander Aa, M. Jayapala, F. Catthoor, D. Verkest, G. Deconinck, H. Corporaal, F. Robert, and J. Carrabina, “Power breakdown analysis for a heterogeneous NoC platform running a video application”, in IEEE International Conference on Application-Specific Systems, Architecture Processors (ASAP)), 2005. 16th , July 2005, pp. 179-184.
• A. Portero, G. Talavera, M. Monton, B. Martinez, and J. Carrabina, “NoC system for MPEG-4 SP using heterogeneous tiles” , in Design of Circuits and Integrated Systems (DCIS), San Diego, California, USA. December 2006.
• A. Portero, G. Talavera, M. Monton, B. Martinez, M. Moreno, F. Cathoor, and J. Carrabina, “Energy-aware mpeg-4 single profile in HW-SW multiplatform implementation”, in IEEE International SOC Conference, Austin, Texas, USA. Sept. 2006, pp. 13-16.
• A. Portero, G. Talavera, M. Monton, B. Martinez, F. Cathoor, and J. Carabina, “Dynamic voltage scaling for power efficient MPEG4-SP implementation”, in Proceedings of the IEEE 17th International Conference on Application-specific Systems, Architectures and Processors (ASAP). Washington, DC, USA: IEEE Computer Society, 2006, pp. 257-260.
• A. Portero, G. Talavera, F. Catthoor, and J. Carrabina, “A study of a MPEG-4 codec in a multiprocessor platform”, in IEEE International Symposium on Industrial Electronics (ISIE), 2006, vol. 1, July 2006, pp. 661-666.
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 56/50
Teaching publications
• G. Talavera, J. Saiz, and J. Carrabina., “Dispositivos y plataformas para docencia de informática y electrónica”, in Jornadas Sobre Computación Recongurable y Aplicaciones (JCRA), Barcelona, Spain, Septembre 2004, pp. 711-717.
• G. Talavera, B. Lorente, M. Monton, B. Martinez, J. Oliver, C. Ferrer, L. Ribas, J. Aguilo, and E. Valderrama, “Nuevas metodologías docentes y autoaprendizaje en la enseñanza técnica universitaria”, in Congreso Internacional de Docencia Universitaria e Innovación (CIDUI), Barcelona, Spain, 2006
• B. Lorente, G. Talavera, L. Ribas, and E. Valderrama, “Implantació d'una nova metodologia docent a les pràctiques de fonaments de computadors d'enginyeria informàtica”, in Congreso Internacional de Docencia Universitaria e Innovación (CIDUI), Barcelona, Spain, 2006.
• G. Talavera, X. Fitó, B. Lorente, A. Portero, M. Montón, B. Martínez, J. Oliver, C. Ferrer, L. Ribas, J. Aguiló, and E. Valderrama, “Adaptación metodológica a las nuevas directrices del EEES en la enseñanza técnica universitaria”, in Tecnologías Aplicadas a la Enseñanza de la Electrónica (TAEE), Madrid, Spain. 2006.
• A. Portero, J. Saiz, G. Talavera, R. Aragonés, M. Rullán, J. Aguiló, and E. Valderrama, “Aplicación del plan piloto en sistemas digitales en ingenier ía informática siguiendo las directivas del EEES”, in Tecnologías Aplicadas a la Enseñanza de la Electrónica. (TAEE), Madrid, Spain. 2006.
• G. Talavera, F. X. Fitó, B. Lorente, M. Montón, B. Martínez, C. Ferrer, and E. Valderrama, “Cas pràctic d'adaptació metodològica a les directrius EEES d'una assignatura d'enginyeria informàtica”, in III Jornada de Campus d'Innovació Docent. UAB, Barcelona. Spain. 20 Setembre de 2006. .
• E. Valderrama, G. Talavera, M. Montón, B. Martínez, J. M. Fernández, and J. Muñoz, “Comparación de dos metodologías docentes utilizadas en los seminarios de fundamentos de computadores”, in XIV Jornadas de Enseñanza Universitaria de la Informática (JENUI), 2008.
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 57/50
Results: Energy
MORE THAN 90%!!! respecte the first straight implementation
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 58/50
Reconfigurable AGU template
AGUs
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 59/50
The loop buffer operation: An Illustration
OP11for (..){ …
if (..) {.….} else {.….} …}
OP21 OP31 NOP
NOP OP22 OP32 BNZ ‘x’
OP12 NOP NOP BR ‘y’
OP13 NOP OP33 NOP
OP14 OP23 NOP BNZ ‘s’
S:
X:
Y:
LBON <offset>
if block
else block
Optimization
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 60/50
The loop buffer operation: An Illustration
OP11for (..){ …
if (..) {.….} else {.….} …}
OP21 OP31 NOP
NOP OP22 OP32 BNZ ‘x’
OP12 NOP NOP BR ‘y’
OP13 NOP OP33 NOP
OP14 OP23 NOP BNZ ‘s’
S:
X:
Y:
LBON <offset>
if block
else block
IROCSTART_ADDR
END_ADDR
IR_USE
NEW_PC
PC
FU1
OP11OP12OP13OP14
01-0112131
FU2
OP21OP22OP23
0111-0-021
FU3
OP31OP32OP33
0111-021-0
BR
BNZ ‘x’BR ‘y’
BNZ ‘s’
-00111-021
Optimization
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 61/50
The loop buffer operation: An Illustration
OP11for (..){ …
if (..) {.….} else {.….} …}
OP21 OP31 NOP
NOP OP22 OP32 BNZ ‘x’
OP12 NOP NOP BR ‘y’
OP13 NOP OP33 NOP
OP14 OP23 NOP BNZ ‘s’
S:
X:
Y:
LBON <offset>
if block
else block
IROCSTART_ADDR
END_ADDR
IR_USE
NEW_PC
PC
FU1
OP11OP12OP13OP14
01-0112131
FU2
OP21OP22OP23
0111-0-021
FU3
OP31OP32OP33
0111-021-0
BR
BNZ ‘x’BR ‘y’
BNZ ‘s’
-00111-021
Optimization
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 62/50
The loop buffer operation: An Illustration
OP11for (..){ …
if (..) {.….} else {.….} …}
OP21 OP31 NOP
NOP OP22 OP32 BNZ ‘x’
OP12 NOP NOP BR ‘y’
OP13 NOP OP33 NOP
OP14 OP23 NOP BNZ ‘s’
S:
X:
Y:
LBON <offset>
if block
else block
IROCSTART_ADDR
END_ADDR
IR_USEPC
NEW_PC
FU1
OP11OP12OP13OP14
01-0112131
FU2
OP21OP22OP23
0111-0-021
FU3
OP31OP32OP33
0111-021-0
BR
BNZ ‘x’BR ‘y’
BNZ ‘s’
-00111-021
Optimization
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 63/50
The loop buffer operation: An Illustration
OP11for (..){ …
if (..) {.….} else {.….} …}
OP21 OP31 NOP
NOP OP22 OP32 BNZ ‘x’
OP12 NOP NOP BR ‘y’
OP13 NOP OP33 NOP
OP14 OP23 NOP BNZ ‘s’
S:
X:
Y:
LBON <offset>
if block
else block
IROCSTART_ADDR
END_ADDR
IR_USEPC
NEW_PC
FU1
OP11OP12OP13OP14
01-0112131
FU2
OP21OP22OP23
0111-0-021
FU3
OP31OP32OP33
0111-021-0
BR
BNZ ‘x’BR ‘y’
BNZ ‘s’
-00111-021
Optimization
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 64/50
The loop buffer operation: An Illustration
OP11for (..){ …
if (..) {.….} else {.….} …}
OP21 OP31 NOP
NOP OP22 OP32 BNZ ‘x’
OP12 NOP NOP BR ‘y’
OP13 NOP OP33 NOP
OP14 OP23 NOP BNZ ‘s’
S:
X:
Y:
LBON <offset>
if block
else block
IROCSTART_ADDR
END_ADDR
IR_USEPC
NEW_PC
FU1
OP11OP12OP13OP14
01-0112131
FU2
OP21OP22OP23
0111-0-021
FU3
OP31OP32OP33
0111-021-0
BR
BNZ ‘x’BR ‘y’
BNZ ‘s’
-00111-021
Optimization
Embedded Processors Memories Optimization AGUs Optimization Conclusions ? 65/50
The loop buffer operation: An Illustration
OP11for (..){ …
if (..) {.….} else {.….} …}
OP21 OP31 NOP
NOP OP22 OP32 BNZ ‘x’
OP12 NOP NOP BR ‘y’
OP13 NOP OP33 NOP
OP14 OP23 NOP BNZ ‘s’
S:
X:
Y:
LBON <offset>
if block
else block
IROCSTART_ADDR
END_ADDR
IR_USEPC
NEW_PC
FU1
OP11OP12OP13OP14
01-0112131
FU2
OP21OP22OP23
0111-0-021
FU3
OP31OP32OP33
0111-021-0
BR
BNZ ‘x’BR ‘y’
BNZ ‘s’
-00111-021
Optimization