challenges in automatic optimization of arithmetic circuits

30
Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale de Lausanne (EPFL) csda csda Challenges in Automatic Challenges in Automatic Optimization Optimization of Arithmetic Circuits of Arithmetic Circuits

Upload: randy

Post on 06-Jan-2016

40 views

Category:

Documents


1 download

DESCRIPTION

csda. csda. Challenges in Automatic Optimization of Arithmetic Circuits. Ajay K. Verma , Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale de Lausanne (EPFL). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Challenges in Automatic Optimization of Arithmetic Circuits

Ajay K. Verma, Philip Brisk and Paolo Ienne

Processor Architecture Laboratory (LAP)& Centre for Advanced Digital Systems (CSDA)

Ecole Polytechnique Fédérale de Lausanne (EPFL)

csda

csda

Challenges in Automatic Challenges in Automatic OptimizationOptimization

of Arithmetic Circuitsof Arithmetic Circuits

Page 2: Challenges in Automatic Optimization of Arithmetic Circuits

2

Circuit PerformanceCircuit PerformanceDepends Heavily on the DescriptionDepends Heavily on the Description

“Software”Multiplier

Multiplier withOptimized

Compressor Tree

Multiplier withCompressor Tree

Page 3: Challenges in Automatic Optimization of Arithmetic Circuits

3

Pre-Synthesis Optimization of Pre-Synthesis Optimization of Arithmetic CircuitsArithmetic Circuits

Original Circuit Description

Original Circuit Description

Physical DesignPhysical Design

Logic SynthesisLogic Synthesis

Arithmetic optimizationsArithmetic optimizations

Known architectures

Automaticarchitecture exploration

Page 4: Challenges in Automatic Optimization of Arithmetic Circuits

4

Automation and Computer Automation and Computer ArithmeticArithmetic

Automation Heuristics to optimize general

classes of circuits Kernel and co-kernel

extraction [Brayton82] Decomposition based

approaches for general circuits [Bertacco97, Mishchenko01, Yang02]

Algorithmic approaches for a particular class of circuits Variable group size CLA adder

[Lee91] Irregular partial product

compressors [Stelling98]

Page 5: Challenges in Automatic Optimization of Arithmetic Circuits

5

Logic SynthesisLogic Synthesis

Synthesis tools have become extremely good in optimizing circuits expressed in Sum-Of-Product form

And when there are plenty of XOR gates?

QPR

bbbbbbbbQ

aaaaaaaaP

43322110

43322110

Before expansion : 0.37 ns (138.2 μm2) After expansion : 0.26 ns (146.9 μm2)

Before expansion : 0.22 ns (58.8 μm2) After expansion : 0.27 ns (221.2 μm2)

QPR

babababaQ

babababababaP

)( 03213012

032121300330

)( QPPQ

?

Page 6: Challenges in Automatic Optimization of Arithmetic Circuits

6

OutlineOutline

Verma & Ienne; ICCAD 2004Verma, Brisk, & Ienne; TCAD 2008

Optimizingat Word-Level

Verma & Ienne; DAC 2006

ExploringMicroscopicStructure

CreatingMacroscopic

Structure

Verma, Brisk, & Ienne; DAC 2007Best Paper Award nominee

Verma, Brisk, & Ienne; IWLS 2008

Page 7: Challenges in Automatic Optimization of Arithmetic Circuits

7

OutlineOutline

Optimizingat Word-Level

CreatingMacroscopic

Structure

ExploringMicroscopicStructure

Low Complexity HighHigh Granularity Low

Page 8: Challenges in Automatic Optimization of Arithmetic Circuits

8

OutlineOutline

Optimizingat Word-Level

CreatingMacroscopic

Structure

ExploringMicroscopicStructure

Page 9: Challenges in Automatic Optimization of Arithmetic Circuits

9

Clustering: Maximization of the Use of Clustering: Maximization of the Use of

Carry-Save RepresentationCarry-Save Representation

Two addition nodes areseparated by NOT

The two addition nodesare clustered

Goal: Swap the adders with other logic operations while

preserving the semantics to cluster additions

Page 10: Challenges in Automatic Optimization of Arithmetic Circuits

10

Examples of TransformationsExamples of Transformations

(A << k) A . 2k

Advancing shift left over add (distributivity of multiplication over addition)

Advancing shift right overaddition is more complex

Advancing SEL over add (existence of the identity element of addition)

(C ? A : D) + (C ? B : 0)C ? (A + B) : D

Page 11: Challenges in Automatic Optimization of Arithmetic Circuits

11

Some Transformations Have a Some Transformations Have a CostCost

Advancing PP over add (distributive property of multiplication over addition)

This transformation has a

significant cost in terms of area!

Page 12: Challenges in Automatic Optimization of Arithmetic Circuits

12

Generation of All Pareto-Optimal Generation of All Pareto-Optimal ImplementationsImplementations

Theorem: The transformations form apersistent and confluent reduction system

Pareto-optimal: better than any other in terms of

area or critical-path delay

Page 13: Challenges in Automatic Optimization of Arithmetic Circuits

13

Example: Example: adpcmdecodeadpcmdecode Kernel Kernel

Compressor tree

AND network

0.85 ns,5678 μm2

0.51 ns,4901 μm2

Page 14: Challenges in Automatic Optimization of Arithmetic Circuits

14

OutlineOutline

Optimizingat Word-Level

CreatingMacroscopic

Structure

ExploringMicroscopicStructure

Limited scope for optimizations

Bit-level

Page 15: Challenges in Automatic Optimization of Arithmetic Circuits

15

Implementation of Subcircuits Implementation of Subcircuits Corresponding to Contiguous Layers Can Corresponding to Contiguous Layers Can

Be ImprovedBe Improved

ADD

LZD

Arithmetic

Logic

Leading Zero AnticipatorA direct implementation of LZA

in carry-select fashion [Gerwig99]

Page 16: Challenges in Automatic Optimization of Arithmetic Circuits

16

Input CondensationInput Condensation

Leader expressions: • Sufficient to evaluate the whole of an expression• Once you evaluate them, you can discard the input bits

Compute all leader expressions in parallel

Recursively compute leader expressions

again

IN

SomeLargeCircuit

OUT

IN

L |L| < |IN|

Smaller circuit

OUT

sc

8-input parallel counter

Leader expressions

Page 17: Challenges in Automatic Optimization of Arithmetic Circuits

17

Progressive Decomposition: Progressive Decomposition: Algorithm OverviewAlgorithm Overview

Choose a subset of input bitsHow many bits?Many different combinations?

Find leader expressionsOptimize via Boolean ring propertiesFind identities

Discard dependent expressions

x y zz = f(x, y)

Rewrite circuit in terms of leader expressions Recursively process the remaining circuit

Page 18: Challenges in Automatic Optimization of Arithmetic Circuits

18

Example: 3-Input Adder (sExample: 3-Input Adder (s22 Output)Output)

X = [a1b1 + (a1 + b1)a0b0] [(a1 b1 a0b0)c1 + c0(a0 b0)(c1 + (a1 b1 a0b0))]

Ripple-Carry Adder

L(X, {a1, b1, c1}) = {a1 b1 c1, a1b1 b1c1 a1c1}

sum carry Carry-save adder

0

++

X

++

a0 b0a1 b1

+

c1 c0

0

0Ripple-Carry Adder

3:2 Compressor CSA

a0 b0 c0

CSA

a1 b1 c1

++

0

0

X

Page 19: Challenges in Automatic Optimization of Arithmetic Circuits

19

A Better Division Is Used for A Better Division Is Used for Leader Expression Computation Leader Expression Computation

X = ab (c d e) cd (a b e)

X = (ab + cd) (a b c d e)

Based on the identity: pq (p q) = 0

Theorem: An expression of the form (PQ RS) can be factored as (P R) T, if there exist U and V such that 1) PU = RV = 0 and 2) Q S = U V

The ideal membership problem can be used to determine the existence of such U and V

Page 20: Challenges in Automatic Optimization of Arithmetic Circuits

20

Progressive Decomposition: Progressive Decomposition: Qualitative AnalysisQualitative Analysis

Completely agnostic of the type of circuit to optimize

Automatically infers successful circuit designs from the literature… Carry-lookahead adder (beyond minimal sizes)

Structured LZD/LOD circuit

Optimized LZA circuit (no sum computation)

Carry-save addition

Parallel counter

…and discovers some unknown to us!

Multi-Input comparisons (min/max)

Page 21: Challenges in Automatic Optimization of Arithmetic Circuits

21

Multi-Input ComparatorMulti-Input Comparator(Min/max of k n-bit Integers)(Min/max of k n-bit Integers)

Binary tree of comparatorsNumber of comparators: k − 1

Critical path delay: O(log n log k)Hardware area: O(kn)

0.46 ns, 1755 μm2

Pairwise comparison of inputs

Number of comparators: k (k − 1)/2Critical path delay: O(log n + log k)Hardware area: O(k2n)

0.21 ns, 3479 μm2

log*() is the number of times the logarithm function must be iteratively applied before the result is ≤ 1 – e.g., log*(265536) = 5

With Our Structuring Algorithm:

Bitwidth reduction using dominators and LODsNumber of LODs: k log* n

Critical path delay: O(log n + log k log* n)Hardware area: O(kn)

0.22 ns, 1331 μm2

Page 22: Challenges in Automatic Optimization of Arithmetic Circuits

22

OutlineOutline

Optimizingat Word-Level

CreatingMacroscopic

Structure

ExploringMicroscopicStructure

Reed-Muller formcan be very inefficient

Efficient implementationof the leader expressions ?

ExhaustiveExploration

Page 23: Challenges in Automatic Optimization of Arithmetic Circuits

23

Problem StatementProblem Statement

Given a set of Boolean expressions, generate all their Pareto-optimal implementations

no “reuse”total “reuse”

selective “reuse”

Page 24: Challenges in Automatic Optimization of Arithmetic Circuits

24

EnumeratingEnumeratingCommon Sub-ExpressionsCommon Sub-Expressions

Root: Original Reed-Muller form

Eitherxy or

xyreplaced by a new variable

The nodes of the DAG correspond to all partial implementations of the two expressions with

some sharing between them

Page 25: Challenges in Automatic Optimization of Arithmetic Circuits

25

Pruning the Enumeration DAGPruning the Enumeration DAG

The size of DAG can be as large as O ((n + m) 2m), where n is the number of variables and m is the size of Boolean expressionsEnumerating the whole DAG is computationally

infeasible

Pruning CriteriaRecognizing node equivalence (width reduction)Merging some reductions into a single one

(height reduction)Delaying certain reductions (branch reduction)

Page 26: Challenges in Automatic Optimization of Arithmetic Circuits

26

There Is Scope for Further There Is Scope for Further Pruning…Pruning…

Area and delay for all 6-bit adders generated by our algorithm

Without any pruning, it would be impossible to handle

expressions with more than five variables

Number of possible implementations: >1060

Number of explored implementations: 2687

Number of actual Pareto-optimal

implementations: 4

Page 27: Challenges in Automatic Optimization of Arithmetic Circuits

27

……but the Enumeration Algorithm Finds but the Enumeration Algorithm Finds Interesting Non-Trivial Relations!Interesting Non-Trivial Relations!

0123 aaaa

00010203 babababa

0123 bbbb

10111213 babababa

20212223 babababa

30313233 babababa

0211200110 bababababa 1221 baba

0110 baba 1221 baba 021120 bababa

+

4x4-bit multiplier:better than our best manually-designed

cell-based multiplier?!

The method has been generalized for higher bitwidth multipliersIt reduced the delay of the best cell-based 8 x 8-bit multiplier by 10%

Verma & Ienne; ASP-DAC 2007Best Paper Award nominee

Page 28: Challenges in Automatic Optimization of Arithmetic Circuits

28

SummarySummary

Verma & Ienne; ICCAD 2004Verma, Brisk, & Ienne; TCAD 2008

Optimizingat Word-Level

Verma & Ienne; DAC 2006

ExploringMicroscopicStructure

CreatingMacroscopic

Structure

Verma, Brisk, & Ienne; DAC 2007Best Paper Award nominee

Verma, Brisk, & Ienne; IWLS 2008

Page 29: Challenges in Automatic Optimization of Arithmetic Circuits

29

Computer Arithmetic and Computer Arithmetic and AutomationAutomation

Computer Arithmetic has been for long the domain of extremely ingenuous manually developed architectures

Automation has mostly addressed the optimization of such architectures through the exploration of the predefined design spaces they delimit

Logic synthesis, from the “bottom”, has failed to explore beyond known territories due to fairly fundamental issues

It is perhaps high time to tryto change all this…

Page 30: Challenges in Automatic Optimization of Arithmetic Circuits

30

Thanks!