a bit manipulation : architecture applicationszshi/course/cse5302/ref/yhilewitz_thesis.pdf ·...
TRANSCRIPT
ADVANCED BIT MANIPULATION
INSTRUCTIONS: ARCHITECTURE,IMPLEMENTATION AND APPLICATIONS
YEDIDYA HILEWITZ
A DISSERTATION
PRESENTED TO THE FACULTY
OF PRINCETON UNIVERSITY
IN CANDIDACY FOR THE DEGREE
OF DOCTOR OF PHILOSOPHY
RECOMMENDED FOR ACCEPTANCE
BY THE DEPARTMENT OF
ELECTRICAL ENGINEERING
ADVISER: RUBY B. LEE
SEPTEMBER 2008
c© Copyright by Yedidya Hilewitz, 2008.
All Rights Reserved
Abstract
Advanced bit manipulation operations are not efficiently supported by commodity word-
oriented microprocessors. Programming tricks are typically devised to shorten the long
sequence of instructions needed to emulate these complicated bit operations. As these bit
manipulation operations are relevant to applications that are becoming increasingly impor-
tant, we propose direct support for them in microprocessors. In particular, we propose fast
bit gather (or parallel extract), bit scatter (or parallel deposit) and bit matrix multiply in-
structions, building on previous work which focused solely on instructions for accelerating
general bit permutations.
We show that the bit gather and bit scatter instructions can be implemented efficiently
using the fast butterfly and inverse butterfly network datapaths. We define static, dynamic
and loop invariant versions of the instructions, with static versions utilizing a much simpler
functional unit than dynamic or loop-invariant versions. We show how a hardware decoder
can be implemented for the dynamic and loop-invariant versions to generate, dynamically,
the control signals for the butterfly and inverse butterfly datapaths. We propose a new ad-
vanced bit manipulation functional unit to support bit gather, bit scatter and bit permutation
instructions and then show how this functional unit can be extended to subsume the func-
tionality of the standard shifter unit. This new unit represents an evolution in the design of
shifters.
We also consider the bit matrix multiply instruction. This instruction multiplies two
n × n bit matrices and can be used to accelerate parity computation and is a powerful bit
manipulation primitive. Bit matrix multiply is currently only supported by supercomputers
and we investigate simpler bmm primitive instructions suitable for implementation in a
commodity processor. We consider smaller units that perform submatrix multiplication
and the use of the Parallel Table Lookup module to speed up bmm.
Additionally, we perform an analysis of a variety of different application kernels taken
from domains including bit compression, image manipulation, communications, random
iii
number generation, bioinformatics, integer compression and cryptology. We show that us-
age of our proposed instructions yields significant speedups over a basic RISC architecture
– parallel extract and parallel deposit speed up applications 2.19× on average, while ap-
plications that benefit from bmm instructions are sped up 1.6×-5.0× on average for the
various bmm solutions.
iv
Acknowledgements
First, I would like to thank my advisor, Professor Ruby B. Lee. She helped guide my
research throughout my graduate career and without her dedication and assistance, this
thesis would not have been produced.
I would like to thank Professor Niraj K. Jha and Professor Zhijie Jerry Shi for reading
my dissertation and providing valuable feedback.
I also thank Dr. Yiqun Lisa Yin for her valuable comments and suggestions and fruitful
collaborations, and Roger Golliver for giving me the opportunity to intern for him at Intel
Corporation and for his numerous suggestions regarding the applicability of my work.
I would also like to express gratitude to the National Science Foundation, the Fannie
and John Hertz Foundation and Sir Gordon Wu for their generous financial support which
made my life as a graduate student much easier.
I am grateful to my fellow members in the PALMS research group, including Jef-
frey Dwoskin, Cedric Lauradoux, Reouven Elbaz, Zhenghong Wang, David Champagne,
Nachiketh Potlapally and Michael Wang, for their friendship, collaboration and support. It
was a pleasure working with them.
Finally, I would like to thank my parents and my wife, Mindy. I am grateful for the
love and support they have given me throughout, and I am especially grateful to Mindy for
ensuring that this thesis was completed in a timely manner. Thus I dedicate this thesis to
her.
v
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
1 Introduction 1
1.1 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Bit-Oriented Operations 6
2.1 Word Level, Single Bitfield Operations . . . . . . . . . . . . . . . . . . . . 7
2.2 Subword Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Specialized Permutations . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 General Permutations . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Bit Level Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Bit Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Advanced Bit-Oriented Operations and their Usage in Emerging Applications 26
2.4.1 Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.2 Linear Cryptanalysis . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.3 Software-Defined Radio . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3 Advanced Bit Manipulation Operations 38
3.1 Instruction Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
vi
3.1.1 Bit Gather or Parallel Extract (pex) . . . . . . . . . . . . . . . . . 39
3.1.2 Bit Scatter or Parallel Deposit (pdep) . . . . . . . . . . . . . . . . 41
3.2 Datapaths for Parallel Extract and Parallel Deposit . . . . . . . . . . . . . . 42
3.2.1 Parallel Deposit on the Butterfly Network . . . . . . . . . . . . . . 42
3.2.2 Parallel Extract on the Inverse Butterfly Network . . . . . . . . . . 52
3.2.3 Need for Two Datapaths . . . . . . . . . . . . . . . . . . . . . . . 54
3.2.4 Decoding the Bitmask into Butterfly Control Bits . . . . . . . . . . 55
3.2.5 Hardware Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3 Advanced Bit Manipulation Functional Unit . . . . . . . . . . . . . . . . . 67
3.3.1 Implementation of the Standalone Advanced Bit Manipulation Func-
tional Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.5 Appendix: Decoding the pex Bitmask into Inverse Butterfly Control Bits . 74
4 Advanced Bit Manipulation Functional Unit as a Basis for Shifters 80
4.1 Basic Shifter Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2 Basic Shifter Operations on the Inverse Butterfly Datapath . . . . . . . . . 81
4.2.1 Determining the Control Bits for Rotations . . . . . . . . . . . . . 84
4.2.2 Determining the Control Bits for Other Shift Operations . . . . . . 92
4.3 New Shifter Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.3.1 Basic Shifter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.3.2 Full Shift-Permute Unit . . . . . . . . . . . . . . . . . . . . . . . 96
4.4 Evolution of Shifters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5 Bit Matrix Multiply 107
5.1 Definition of Bit Matrix Multiply Operation . . . . . . . . . . . . . . . . . 107
5.2 Bit Matrix Multiply Basic Operations . . . . . . . . . . . . . . . . . . . . 109
vii
5.3 Computing Bit Matrix Multiply using Existing ISA . . . . . . . . . . . . . 112
5.3.1 Conditional Row-wise Summation . . . . . . . . . . . . . . . . . . 112
5.3.2 Table Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.3.3 Implementation of bmm on Cray Vector Supercomputers . . . . . . 114
5.3.4 Comparison of Existing ISA Methods . . . . . . . . . . . . . . . . 114
5.4 New Architectural Techniques to Accelerate Bit Matrix Multiply . . . . . . 116
5.4.1 Submatrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . 116
5.4.2 Parallel Table Lookup . . . . . . . . . . . . . . . . . . . . . . . . 122
5.5 Comparison of Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.5.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.5.2 Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.5.3 Implementation Cost . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6 Applications and Performance Results 130
6.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.1.1 Binary Compression and Decompression . . . . . . . . . . . . . . 132
6.1.2 Least Significant Bit Steganography . . . . . . . . . . . . . . . . . 133
6.1.3 Transfer Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.1.4 Integer Compression . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.1.5 Binary Image Morphology . . . . . . . . . . . . . . . . . . . . . . 135
6.1.6 Random Number Generation . . . . . . . . . . . . . . . . . . . . . 136
6.1.7 Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.1.8 Cryptology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.1.9 DARPA HPCS Discrete Math Benchmarks . . . . . . . . . . . . . 139
6.1.10 Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.2 Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
viii
6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7 Conclusions and Future Research 148
7.1 Advanced Bit Manipulation Instructions . . . . . . . . . . . . . . . . . . . 148
7.2 Advanced Bit Manipulation Functional Unit . . . . . . . . . . . . . . . . . 149
7.3 Advanced Bit Manipulation Functional Unit as Basis for the Shifter . . . . 149
7.4 Implementation of Bit Matrix Multiply in Commodity Processors . . . . . . 150
7.5 Application Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.6 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Bibliography 153
ix
Chapter 1
Introduction
With the rapid rise of computing power and with the ubiquity of computing devices, the
problems that computers are used to solved are continually evolving. Problems that were
deemed as computationally unfeasible are now within the realm of solvability. Usage sce-
narios that were implausible are now possible given advances in semiconductor manu-
facturing technology. Consequently, the capabilities of the microprocessors in computing
devices need to evolve as well.
Classically, microprocessors were designed so that basic operations, instructions, were
performed on basic data units termed processor words. Over time, the wordsize has grown
from 8 bits to 64 bits as the need and capability to operate on larger data sizes has expanded.
Instruction set architectures (ISAs), the set of instructions visible to the programmer, have
been adapted to add support for wider word sizes while still providing facilities to operate
on the smaller word sizes. For example, the IA-32 ISA [1] addition operation can add two
8-, 16-, 32- or 64-bit quantities. The Alpha ISA [2] addition operation operates only on 32-
or 64-bit quantities.
As workloads changed and multimedia grew in prominence, computers further evolved
to facilitate new applications. Multimedia processing consists of repeating similar opera-
tions on large arrays of small (8-bit or 16-bit) data. Thus, instructions were added to allow
1
computation not just on a single smaller element from a full processor word, but on a vector
of smaller elements packed into the processor word. An addition operation performs eight
adds of 8-bit quantities packed into 64-bit processor words. Lee introduced this concept
with PA-RISC MAX [3–5] and termed it subword parallelism. Most modern ISAs provide
support for subword parallelism [1, 3–15].
Workloads have again changed and processors now need to evolve once again to support
new and emerging workloads. In particular, the facility to operate on the bit level is becom-
ing of greater importance as exhibited by applications such as cryptography, cryptanalysis,
bioinformatics and software-defined radio. Since a microprocessor is optimized around
the processing of words or conveniently-sized subwords, it is not surprising that bit-level
operations are typically not as well supported by current word-oriented microprocessors.
While these can be built from the simpler logical and shift operations or performed using
programming tricks (cf. Hacker’s Delight [16]), the applications using these advanced bit
manipulation operations are significantly sped up if the processor can support more power-
ful bit manipulation instructions.
Supercomputers, such as the Cray vector machines [17], have had support for some of
these advanced bit manipulation operations, as applications such as cryptanalysis were once
relegated to supercomputers. As the classic supercomputer model has yielded to clusters of
commodity processors and as many emerging applications are performed in diverse settings
spanning workstations, desktop computers and low-powered handheld devices, we focus on
commodity processors.
Previous work on bit level instructions [18–27] considered only general bit permutation
instructions – instructions that facilitate achieving any of the nn arbitrary permutations
with repetition or n! permutations without repetition of an n-bit processor word. However,
specialized permutations, such as bit gather and bit scatter operations, may be more useful
for the new applications. Additionally, the combination of bits in a processor word (as
opposed to the translocation of bits), such as the parity operation which combines bits with
2
Exclusive-OR (XOR), is important for many emerging applications, and prior work has
not addressed this at all. Thus new advanced bit-oriented instructions that directly perform
these operations can significantly speed up important emerging applications.
1.1 Thesis Contributions
This thesis focuses on two sets of advanced bit manipulation instructions: bit gather and bit
scatter, and bit matrix multiply. We propose the novel parallel extract instruction and par-
allel deposit instruction to perform bit gather and bit scatter operations, respectively. We
show how these instructions can be performed using the butterfly and inverse butterfly dat-
apaths (used for the previously proposed bfly and ibfly permutation instructions), give
an algorithm to dynamically generate the configuration for these datapaths, and propose a
new advanced bit manipulation functional unit that supports these four instructions.
This thesis next shows how the proposed advanced bit manipulation functional unit can
be used to support the functionality of the shifter unit, which performs basic single field bit-
oriented operations such as shift, rotate, extract and deposit (insert). We give an algorithm
for configuring the datapaths for rotations, which is the basic operation from which the
other single-field operations can be derived. This new shift-permute unit represents an
evolution in the design of shifters from the classic barrel and log shifter architectures. We
compare these functional units using both standard cell and FPGA implementations.
Also considered is the bit matrix multiply operation, which multiplies two n × n bit
matrices. This operation is useful for accelerating parity operations and can be used to
perform a superset of the advanced bit manipulation operations. Cray supercomputers
have a bit matrix multiply instruction, which requires significant hardware resources. We
investigate lower cost alternatives more suitable for commodity processors and compare
these to the best current software-based methods and to the Cray implementation.
This thesis also presents an analysis of the usage of the advanced bit manipulation in-
3
structions in various applications. These include cryptography, cryptanalysis, bioinformat-
ics, software-defined radio, steganography, random number generation, compression and
image manipulation. We quantify the speedups resulting from use of the new instructions.
1.2 Thesis Organization
This thesis is organized as follows. Chapter 2 discusses bit-oriented operations, includ-
ing instructions present in both commercial ISAs as well as academic proposals. Word
level, subword level and bit level operations are discussed. Chapter 2 discusses emerging
applications and the operations needed for efficient execution of these applications, thus
providing the context and motivation for architectural enhancements to support advanced
bit manipulation operations.
Chapter 3 introduces the novel parallel extract and parallel deposit instruction, which
perform bit gather and bit scatter operations, respectively. Chapter 3 then investigates the
datapaths needed for implementing the new instructions and shows how a single functional
unit can be designed to support all of bit gather, scatter and permutation operations. This
chapter is based in part on work presented by the author in [28–31]
Chapter 4 shows how the advanced bit manipulation functional unit can be modified
to support the standard operations performed by the shifter in current microprocessors and
presents a detailed description of the complete functional unit. This chapter is based in part
on work presented by the author in [32, 33].
Chapter 5 discusses the bit matrix multiply operation. Bit matrix multiply is defined and
software-based methods for computing the operation are discussed. Then, new architectural
techniques for computing bit matrix multiply are proposed. This chapter is based in part on
work presented by the author in [34–36].
Chapter 6 delineates the applications that benefit from advanced bit manipulation op-
4
erations and presents performance results which quantify the speedup obtained using the
new instructions.
Chapter 7 concludes the thesis and discusses directions and opportunities for future
research.
5
Chapter 2
Bit-Oriented Operations
Bit patterns held in processor registers can be interpreted by instructions in two ways.
First, they can be viewed as having a meaning, such as representing text characters or
numbers, whether integer or floating point, single word or multiple subwords. Instructions
then perform string processing operations on text characters or arithmetic operations on
these numbers. Alternatively, processor registers can simply hold bit patterns that do not
necessarily have a higher level, abstract interpretation. Instructions then perform operations
on bits and bit fields. In this chapter, we detail the various subcategories of the latter type
of instructions, moving from instructions that operate on a word level on single bit fields, to
instructions that operate on multiple bit fields held in subwords to instructions that operate
on the individual bits.
A breakdown of the subcategories is shown in Figure 2.1 and the instructions are listed
in Table 2.1. In our analysis, we include instructions from the POWER, IA-32/AMD64
and Itanium ISAs, as well as proposed instructions from academic researchers. We note
that instructions listed generally omit the part of the opcode that refer to the size of sub-
words. Also note that, in the text, POWER includes integer and vector (AltiVEC) opera-
tions [15]; IA-32 includes x86, x86-64, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2
and AVX [1, 37]; and AMD64 includes x86, x86-64, 3DNow!, MMX, SSE, SSE2, SSE3,
6
SSE4a and SSE5 [38, 39].
We show that there currently are no implementations of specialized advanced bit manip-
ulation instructions in commercial processors and recently proposed bit-level instructions
in the research community focus on efficient computation of general, arbitrary permuta-
tions [19–27]. Additionally, support for combining bits in processor words, such as the
parity operation that combines bits with XOR, is lacking – current ISAs only have lim-
ited parity computation features. Notwithstanding the lack of support for these operations,
there is a need for these specialized instructions for important emerging applications such
as bioinformatics, linear cryptanalysis and software-defined radio. Thus this thesis pro-
poses new instructions to perform these operations.
This chapter is organized as follows. Section 2.1 describes bit-oriented instructions that
operate on a single field of bits. Section 2.2 discusses instructions that manipulate bits in
subwords in parallel, including both specialized and general rearrangement instructions.
Section 2.3 deals with instructions that operate on an individual bit basis including instruc-
tions to perform general bit permutations. The reader can choose to refer to Figure 2.1
and Table 2.1 for a summary of these instructions and read the sections in detail only as
desired. Section 2.4 highlights the lack of support for certain specialized bit manipulation
operations and parity computation, and discusses the applications that make use of these
operations. Section 2.5 summarizes this chapter.
2.1 Word Level, Single Bitfield Operations
Instructions that operate on a word level, performing operations on a single bit field, are
present in all instruction set architectures. The most basic example is the standard shift
instruction which moves a single field of bits to the right or the left (Figure 2.2(a)). The
emptied bit positions are zero-filled on the right for left shifts and zero-filled or filled with
the sign bit on the left for right shifts. The shift amount is specified statically, in an imme-
7
Figure 2.1: Classification of Types of Bit-Oriented Instructions by Word, Subword or Bit Level
8
Tabl
e2.
1:B
it-or
ient
edop
erat
ions
inva
riou
sIS
As
Inst
ruct
ions
type
sPO
WE
RIA
-32/
AM
D64
Itan
ium
Oth
er
Wor
d
Bas
icsh
ift(sl
,sr
,sra
)sh
ift
(shl
,sh
r,sa
l,sa
r),
ro-
tate
(rol
,ror
)sh
ift(
shl,
shr,
shr.u
)
Com
posi
tero
tate
-and
-cle
ar(rlc
),ro
tate
-mas
k-an
d-in
sert
(rlmi
),do
uble
shif
t(vsldoi
)
subw
ord
extr
act
(pextr
i ,extract
i ),
extr
act
(extrq
a),
subw
ord
inse
rt(pins
i ,insert
i ),
inse
rt(insertq
a),
doub
lesh
ift
(shld
,shrd
,palignr
i )
extr
act
(extr
,extr.u
),de
posi
t(dep
,dep.z
),sh
iftp
air(shrp
)
Ana
lysi
sco
unt
lead
ing
zero
s(cntlz
)po
pula
tion
coun
t(popcnt
),co
unt
lead
-in
gze
ros
(bsr
,lzcnt
a),
coun
ttra
iling
zero
s(bsf
)
popu
latio
nco
unt(popcnt
)
Oth
ercrc32
i ,A
ES
encr
ypt/d
ecry
pti
(aesenc
,aesdec
)
Subw
ord
Para
llel
Ope
ra-
tions
para
llel
shif
tan
dro
tate
(vr
,vsl
,vsr
,vsra
),pa
ralle
lpo
pula
tion
coun
t(popcntb
)
para
llel
shif
tan
dro
tate
(pssl
,psrl
,psra
,psha
a,pshl
a,prot
a)
para
llels
hift
(pshl
,pshr
,pshr.u
)group-withdraw
,group-deposit
[18]
Spec
ializ
edm
erge
(vmrg
)un
pack
(unpck
,punpck
)unpack
,mix
mixpair
[40]
Rea
rran
gem
ent
sele
cti
(blend
,pblend
)se
lect
(check
[41]
)re
vers
e(bswap
,pswapd
a)
reve
rse
(mux1
)sp
lat(vsplt
)du
plic
ate
(movdup
),br
oad-
cast
(vbroadcast
i )br
oadc
ast(mux1
)
mis
c.(mux1
)exchange
[41]
,excheck
[41]
,group-shuffle
[18]
,group-swizzle
[18]
Gen
eral
Rea
r-ra
ngem
ent
vperm
pshufb
i ,pshufw
,pshufd
,shuf
,perm
a,
pperm
a,vperm
i
mux2
select-bytes
[18]
9
Inst
ruct
ions
type
sPO
WE
RIA
-32/
AM
D64
Itan
ium
Oth
er
Bit-
leve
l
Bit
Slic
elo
gica
l,se
lect
(vsel
)lo
gica
l,se
lect
(pcmov
a)
logi
cal
Subs
etPa
rity
prty
Thi
sThe
sis
Spec
ializ
edR
e-ar
rang
emen
tm
ask
extr
act
(pmovmskb
,movmsk
)T
hisT
hesi
s
Gen
eral
grp
[20,
21],
Rea
rran
gem
ent
cross
[20,
22],
omflip
[20,
23]
bfly
,ibfly
[24–
26],
bpi
[27]
xbox
[19]
,pperm
[20]
,swperm
,sieve
[42,
43]
group-8-mux
,group-transpose
-8-mux
[18]
i Int
el-o
nly
exte
nsio
ns;a
AM
D-o
nly
exte
nsio
ns
10
diate field in the opcode, or dynamically, with a second input register.
Another single-field instruction is the rotate instruction. The rotate instruction shifts
bits to the right or left, except the bits that are shifted out are instead wrapped around
(Figure 2.2(b)).
(a) (b)
Figure 2.2: (a) shl r1 = r2, s (b) rotl r1 = r2, s
The shift and rotate instructions are the most basic instructions that manipulate a single
bitfield. However, some ISAs have more advanced instructions that combine multiple in-
struction operations into a single instruction. The most common are the shift-and-mask and
shift-and-mask-and-merge paradigm. Among these are instructions that extract (or with-
draw) a byte or subword – the specified byte or subword is shifted to the least significant
position and then the other bits are zeroed out. There are also instructions that insert (or
deposit) a byte or subword. Instructions like these include the IA-32 pextr and pinsr
instructions that operate on integer subwords and extract and insert that operate on
floating point subwords.
Other ISAs such as Itanium and AMD64 include general field extract and deposit or
insert instructions (Itanium – extr/dep, AMD64 – extrq/insertq) . The extract op-
eration selects a single field of bits of arbitrary length from any arbitrary position in the
source register and right justifies that field in the result (Figure 2.3(a)). Extract is equiva-
lent to a shift right and mask operation. In Itanium, extract has both unsigned and signed
variants. In the latter, the sign bit of the extracted field is propagated to the most-significant
bit of the destination register.
The deposit operation takes a single right justified field of arbitrary length from the
11
source register and deposits it at any arbitrary position in the destination register (Fig-
ure 2.3(b)). Deposit is equivalent to a left shift and mask operation. In Itanium, there are
two variants of deposit: the remaining bits can be zeroed out, or they are supplied from
a second register, in which case a masked merge is required. Extract and deposit were
originally introduced in the PA-RISC ISA [44, 45].
(a) (b)
Figure 2.3: (a) extr r1 = r2, pos, len (b) dep r1 = r2, pos, len
The Power ISA has even more general rotate-and-clear and rotate-and-mask-and-insert
instructions that rotate an operand and then perform a mask operation or a masked-merge
operation on single arbitrary field of bits. This instruction can be used for shift, rotate, zero
extension, single field extract and insert among other operations.
Another advanced single field instruction is the double shift or shift pair instruction
(PA-RISC – shd, IA-32/AMD64 – shd, Itanium – shrp). This instruction is a variation
of the shift instruction and it concatenates two n-bit processor words, shifts the bits right (or
left) by a specified amount and then outputs the low (or high) n bits (Figure 2.4). Similar
instructions that concatenate two vector words and shift by an integer number of bytes
exist in IA-32 (palignr) and POWER (vsldoi – Vector Shift Left Double by Octet
Immediate). These instructions are useful for processing data streams when the data can
cross word boundaries. The double shift instruction can also be used to perform rotations
when the same word is used as both input operands.
Other instructions that operate on a word level are designed to obtain information about
the bit patterns in the word. There are instructions that count the number of one bits in
a word (population count, popcnt in IA-32/AMD64 and Itanium), and instructions that
12
Figure 2.4: shrp r1 = r2, r3, s
count the number of leading or trailing zero bits in a word or, alternatively, give the index of
the first one bit, such as the bit scan reverse (bsr) and bit scan forward (bsf) instructions
in IA-32/AMD64. Other operations include parity. Parity is implicitly computed in IA-
32/AMD64 – the PF flag is set to the (odd) parity of the least significant byte of the result
of an instruction. Notably, none of the POWER, IA-32/AMD64 or Itanium ISAs has an
instruction to compute the parity of a whole word.
Recently, even more specialized instructions have been introduced in IA-32 that per-
form AES encryption or compute the cyclic redundancy checksum of a processor word.
2.2 Subword Operations
Subword parallel ISA extensions include instructions for manipulating subwords. Often
these are also called multimedia extensions, as multimedia workloads motivating the in-
troduction of new instructions. These ISA extensions were first introduced in the 1990s
with PA-RISC MAX-1 [3, 4] followed by Sun SPARC VIS [9], Intel IA-32 MMX [6] and
IBM PowerPC AltiVec [10]. These extensions evolved over time – PA-RISC added MAX-
2 [5] and IA-32 added SSE [8]. Modern subword parallel extensions include those built
into the Itanium ISA [11] (which can be veiwed as the successor to PA-RISC), further SSE
extensions from Intel [1, 37] and AMD [38, 39], and VIS [14] and AltiVec [46] additions.
The most basic operations are the extensions of the single field manipulations to sub-
words. For example, POWER, IA-32/AMD64 and Itanium all include parallel shift in-
13
structions which perform a standard shift operation on each subword in parallel. In IA-
32 and Itanium all the subwords are shifted by the same amount, while in POWER and
AMD64 each subword is shifted by the count in the corresponding subword in a second
input operand. POWER and AMD64 also contain parallel rotate instructions. The Mi-
croUnity MediaProcessor architecture [18] has instructions to perform withdraw (extract)
and deposit operations on every subword. The POWER ISA also has an instruction to com-
pute the population count of each byte of a word and store the count in the corresponding
byte of the result.
The ISA extensions also contain instructions for the rearrangement or permutation of
subwords.
2.2.1 Specialized Permutations
The rearrangement instructions often are specialized to perform a particular type of permu-
tation, such as interleaving subwords from two words. For example, the mix instruction
is one such subword permutation operation, initially designed to accelerate multimedia ap-
plications operating on subwords of 8, 16 or 32 bits, packed into a word. mix.left
selects the left subwords from each pair of subwords alternating between the two source
registers r2 and r3 (Figure 2.5(a)). mix.right does the same on the right subwords of the
two source registers (Figure 2.5(b)). The mix instruction was first introduced in PA-RISC
for multimedia acceleration [5], and also appears in Itanium, where it is implemented in a
separate multimedia functional unit. No ISA currently supports mix for subwords smaller
than a byte, although this is very useful, e.g., for bit matrix transposition and fast parallel
sorting [47].
Another specialized permutation instruction that interleaves subwords is the unpack
instruction found in ISAs such as IA-32/AMD64 and Itanium or the merge instructions
found in POWER. unpack or merge interleaves subwords from the high or low half
14
(a) (b)
Figure 2.5: (a) mix.left r1 = r2, r3; (b) mix.right r1 = r2, r3.
of two words (Figure 2.6). unpack can be used to expand small subwords into larger
ones, as can the mix instruction. (Note these ISAs also include instructions that increase
the precision of the subwords. We do not include these here as they are more similar to
arithmetic instructions than to bit field operations.)
(a) (b)
Figure 2.6: (a) unpack.h r1 = r2, r3; (b) unpack.l r1 = r2, r3.
There are also specialized instructions for selecting subwords from two words. The
check instruction, proposed by Lee in [41], combines two words by alternating subwords
between words (Figure 2.7) – selecting subwords in a checkerboard fasion. The IA-32
pblend instructions combine two words by selecting subwords from the words under
control of a mask.
Other specialized permutation instructions include:
• bswap (IA-32/AMD64), which reverses the byte order of a word, and pswapd
Figure 2.7: check r1 = r2, r3
15
(AMD64), which reverses the order of 32-bit subwords;
• mux1 (Itanium), which performs a small set of specialized byte permutations includ-
ing byte reverse, least significant byte broadcast, and mix, alt and shuffle operations
across the two halves of a word (Figure 2.8);
• splat (POWER), which broadcasts any subword or immediate to all other sub-
word positions, movdup (IA-32/AMD64) which duplicates two single or one double
precision floating point values and broadcast (IA-32) which broadcasts a single
floating point value in memory to all 32-bit subword positions;
• mixpair (proposed by Lee in [40] and [41]), which combines mix.left and
mix.right into one instruction (this instruction requires 2 result registers);
• exchange (proposed by Lee in [41]), which exchanges neighboring subwords in a
word;
• excheck (proposed by Lee in [41]), which exchanges neighboring subwords across
two words (a combination of exchange and check);
• group-shuffle (MediaProcessor), which performs a shuffle operation on sub-
words based on immediate parameters x, y and z which control the size in bits over
which symbols are shuffled, the size of symbols and the degree of shuffling, respec-
tively. Bit j of the 128-bit output r1 is sourced from bit u of two 64-bit inputs r2||r3
where j = i6...x||iy+z−1...y||ix−1...y+z||iy...0; and
• group-swizzle (MediaProcessor), which performs a swizzle operation on sub-
words based on immediate parameters icopy and iswap. Bit j of the 128-bit output
r1 is sourced from bit i of the 128-bit input r2 where j = (i & icopy)⊕ iswap.
16
(a) (b)
(c) (d)
(e)
Figure 2.8: mux1 r1 = r2(a) @rev (b) @mix (c) @shuf (d) @alt (e) @brcst.
2.2.2 General Permutations
The subword parallel ISAs also include general permute instructions. These instructions
can arbitrarily reorder the subwords of 1 or 2 words, including repetition. These instruc-
tions include:
• vperm (POWER), pshufb (IA-32), pperm (AMD64) and select-bytes (Me-
diaProcessor) for arbitrary byte permutations;
• pshufw (IA-32/AMD64) and mux2 (Itanium) for 16-bit subwords;
• pshufd (IA-32/AMD64) for 32-bit subwords; and
• perm (AMD64), shuf (IA-32/AMD64) and vperm (IA-32) for floating point val-
ues.
vperm (POWER), pperm and perm permute the bytes or floating point values of 2
words (i.e., the k output bytes or floating point values are selected from among 2k input
bytes or floating point values). vperm (POWER), pperm and perm use a third input
17
operand to contain a list of indices (POWER and AMD64 SSE5 are 3-input operand ar-
chitectures). Each byte or subword in the third operand contains the index of the byte or
subword to be written to that position. The pperm instructions can also, based on the high
3 bits of the index byte, invert, bit reverse, invert and bit reverse, replicate the sign bit or
replicate the inverse of any source byte before writing the destination byte or it can zero
the destination byte or set it to all ones. The perm instruction can also compute the ab-
solute value, the negative value or the negative of the absolute value of the input floating
point value before writing the destination value, or it can set the destination to +0.0, -1.0,
+1.0 or π. pshufb and select-bytes only permute the bytes of 1 word and use the
second input operand to hold a list of indices that describe the source of each output byte
(Figure 2.9). The pshufb instruction can also zero any byte by setting the high bit of the
corresponding index byte to “1”.
Figure 2.9: byte permute r1 = r2, r3
vperm (IA-32) permutes the floating point values of 1 or 2 words, but the values are
restricted to being chosen from within the corresponding half of the register(s) (for 256-bit
registers). vperm (IA-32) uses a third input operand or an immediate to hold a list of
indices (IA-32 AVX is also a 3-input operand architecture). The instruction can also set a
value to all zeros.
pshufw, mux2, pshufd permute the values in 1 word and use an 8-bit immediate in
the opcode that specifies how the subwords are permuted. pshufw only permutes half of
the 128-bit register at a time, as only four 16-bit source fields can be specified in the 8-bit
immediate.
18
The shuf instructions permute the floating point values in 2 words, but each half of the
result contains values from only one of the inputs. An immediate in the opcode specifies
how the subwords are permuted.
2.3 Bit Level Instructions
Other sets of instructions address individual bits. Some instructions operate on a bit-slice
level. In this view, an n-bit processor acts like n 1-bit processors in parallel. The classic
logic instructions (and, or, xor, etc.) operate in this mode – each bit of output is the
logical combination of the bits in the corresponding position of the inputs. The POWER
vsel and the AMD64 pcmov instructions select between two input words on a bit-slice
level – each bit of output is selected from the bits in the corresponding position of the inputs
depending on the value of the bit in the same position of a third input (Figure 2.10).
Figure 2.10: vsel r1 = r2, r3, r4
Instructions can also operate on collections of individual bits. There are instructions
that compute the parity of a fixed subset of the bits, as in the POWER instruction prty
which computes the parity of the least significant bit of the bytes of a word (Figure 2.11).
The pmovmskb IA-32/AMD64 instruction extracts the most significant bits of the bytes of
a word (Figure 2.12). movmsk (IA-32/AMD64) does the same for the sign bits of packed
floating point values. However, there are no instructions that can compute the parity of an
arbitrary subset of bits (or the entire word, as mentioned earlier), extract an arbitrary bit
pattern or insert bits in arbitrary positions.
19
Figure 2.11: prty r1 = r2
Figure 2.12: pmovmskb r1 = r2
2.3.1 Bit Permutations
Another operation on individual bits is the arbitrary rearrangement of bits in a processor
word. Bit permutation is very difficult for word-oriented processors to compute. However,
these rearrangements are important for a wide variety of fields and for block ciphers, in
particular, where they are used to achieve diffusion. Diffusion spreads the redundancy of
the plaintext across large sections of the ciphertext. For example, the DES cipher [48],
has a permutation in each round. Due to the lack of hardware support for permutations,
the Advanced Encryption Standard finalists [49–53], for which good software performance
was a prerequisite, generally do not employ permutations in the round functions.
To perform a permutation in software, each bit has to be isolated, shifted to the per-
muted position and ored with the result. This requires approximately 4n instructions
(mask load/generation, and, shift and or for each bit). For a given, fixed permuta-
tion, a set of lookup tables can be constructed where each table contains the permutation
of a subset of bits. For example, to permute a 64-bit word, eight 256 entry tables are con-
structed that describe how to permute the bits in each of the eight bytes of the word. The
20
results of the lookups are ored together to produce the final permutation. Using lookup ta-
bles, 23 instructions are required for a permutation (8 byte extract, 8 lds and 7 ors).
However, this method may require many tens or hundreds of cycles if the tables are not in
cache. Additionally, a set of tables is needed for each distinct permutation.
Consequently, ISA designers have proposed new instructions to allow for arbitrarily
manipulating bits in a fashion similar to the arbitrary manipulation of subwords (Sec-
tion 2.2.2). There are two classes of instructions, to handle permutations with and without
repetitions.
Specifying any one of the n! arbitrary permutations without repetition of an n-bit word
requires lg(n!) bits, which is ≤ n × lg(n) bits. Using instructions that conform to the
standard microprocessor datapath, i.e., two n-bit input operands and a single n-bit result,
a sequence of lg(n) instructions are required. Each of these instructions inputs either the
initial data or a partially permuted intermediate result as one operand and the next n bits of
the permutation description as the other operand. After lg(n) iterations, the final permuted
result is obtained. For example, in a 64-bit microprocessor where n = 64, any one of
the 64! arbitrary permutations of the 64 bits in a register can be achieved with a sequence
of at most 6 instructions. Three potential instructions that fit this model were previously
proposed by the PALMS (Princeton Architecture Laboratory for Multimedia and Security)
group [54]: grp, cross and omflip.
grp – In [20, 21] the group (grp) bit permutation instruction was defined. grp is a
permutation primitive that gathers to the right the data bits from r2 selected by “1”s in the
mask r3, and to the left those selected by “0”s in the mask (see Figure 2.13). It was shown
that arbitrary bit permutations can be accomplished by a sequence of at most lg(n) of these
grp instructions.
The grp instruction has other useful properties. It has good cryptographic characteris-
tics and thus is a good candidate to use as a primitive operation in future block ciphers [55].
21
Figure 2.13: grp r1 = r2, r3
It can also be used for a fast subword radix sorting algorithm [47]. The grp instruction is
also scalable – it is efficient at performing permutations on 2n bits [20]. However, the grp
functional unit has a long latency [56].
cross – The cross instruction [20, 22] permutes bits using the Benes network [57],
a general permutation network. The Benes network consists of the concatenation of a
butterfly network and an inverse butterfly network. The structure of the networks is shown
in Figure 2.14, where n = 8. The n-bit networks consist of lg(n) stages and thus the full
Benes network has 2 × lg(n) − 1 stages (one stage is eliminated due to the equivalence
of the first and last stages of the two subnetworks). Each stage is composed of n/2 2-
input switches, each of which is constructed using two 2:1 multiplexers. In the ith stage (i
starting from 1), the input bits are n/2i positions apart for the butterfly network and 2i−1
positions apart for the inverse network. A switch either passes through or swaps its inputs
based on the value of a control bit. Thus n/2 control bits are required per stage.
The cross instruction permutes bits using two stages of the Benes network. One of
the two input operands holds the data to be permuted while the other input operand is used
to hold the control bits for two stages (the other stages are configured to pass through the
bits). An additional sub-opcode is used to indicate which two stages are used. lg(n) cross
instructions are required for any arbitrary permutation of n bits.
omflip – The omflip instruction [20, 23] permutes bits using 2 stages of omega and
flip networks, which are isomorphic to butterfly and inverse butterfly networks, respec-
22
(a) (b)
Figure 2.14: (a) 8-bit butterfly network and (b) 8-bit inverse butterfly network.
tively. The advantage of omega and flip networks is that the stages are all identical. Similar
to cross, the control bits for the 2 stages are supplied using an input operand and lg(n)
omflip instructions are needed for arbitrary n-bit permutations.
While the above instructions can be used to achieve any arbitrary permutation of n bits
in lg(n) instructions, for n being the word-size of the processor, other bit permutation
primitives that can do this in only 1 or 2 instructions. These are the butterfly (bfly)
and inverse butterfly (ibfly) pair of instructions [24–26] and the general bit permutation
instruction (bpi) [27]. These instructions both use the Benes network to permute bits. The
bpi instruction uses the full network while bfly and ibfly use the butterfly and inverse
butterfly subnetworks, respectively, separately. Consequently, only a single execution of
bpi or bfly followed by ibfly is required to compute any of the n! permutations of n
bits.
As the Benes network requires n/2×(2× lg(n)−1) configuration bits (n/2×2× lg(n)
when considering the butterfly and inverse butterfly networks separately), clearly these in-
structions do not follow the standard 2-input, 1-result model. A number of solutions to
this issue are possible [25, 26]. In particular, we focus on the solution in which extra reg-
isters are associated with the functional unit and hold the control bits. Some ISAs have
pre-defined registers that can be used as these extra registers such as the application reg-
23
isters in Itanium. In this paper, we will adopt this nomenclature and call them application
registers, or ARs. For n = 64, three 64-bit application registers are needed to hold the
configuration bits for each of the butterfly and inverse butterfly networks, while 352 bits of
extra application register storage are needed for the entire Benes network used by the bpi
instruction. These registers are loaded n or 2n bits at a time from general purpose registers.
There are a number of tradeoffs for either using the entire Benes network in one instruc-
tion (bpi) or splitting it into its constituent butterfly and inverse butterfly networks (bfly
and ibfly). Each of the butterfly and inverse butterfly networks are faster than an ALU
of the same width, since an ALU also has lg(n) stages which are more complicated than
those of the butterfly and inverse butterfly circuits. Assuming a processor cycle to be long
enough to cover the ALU latency, each bfly and ibfly instruction will have single cy-
cle latency, since these circuits are simpler than an ALU’s circuit. The full Benes network,
on the other hand, may have greater latency than the ALU, and thus the bpi instruction
may have 2 cycle latency. Additionally, there are a great many simple permutations that
require only one of the butterfly or inverse butterfly subnetworks. Use of bpi for these
permutations still requires routing the data bits through both subnetworks. One disadvan-
tage of bfly and ibfly is that two instructions must be issued for general permutations.
However, given superscalar resources, this could have negligible effect on bit permutation
workloads.
New instructions for performing bit permutations with repetitions have also been pro-
posed. The xbox instruction [19] and pperm instruction [20] (not to be confused with the
AMD64 pperm instruction) were introduced to accelerate permutations in symmetric-key
ciphers. These instructions takes 2 source operands – the data to be permuted and a list of 8
indices (for 64-bit registers) that describe how to produce a byte of the permutation, similar
to the subword permute instructions listed in Section 2.2.2. 8 of these instructions (and 7
or instructions) are required to permute a 64-bit word. The pperm3r instruction [21], is
24
a variation of pperm in which the byte of the permutation produced by the instruction is
ORed with the partially permuted result supplied with a third input operand, obviating the
need for an explicit or instruction.
The swperm and sieve instructions [42, 43] can also produce arbitrary bit permuta-
tions with repetitions. The swperm instruction is similar to the byte permute instructions
in commodity ISAs mentioned above except that it operates on 4-bit subwords rather than
bytes. The sieve instruction permutes bits within each 4-bit subword. Using swperm
and sieve, an arbitrary permutation with repetitions of 64 1-bit subwords can be per-
formed with 11 instructions and an arbitrary permutation of 32 2-bit subwords can be per-
formed with 5 instructions.
The MicroUnity MediaProcessor architecture [18] also has a bit permutation instruction
group-8-mux. This instruction operates on each byte of the input operand, performing
any permutation with repetitions of the bits within a byte. As an index for the source of each
bit is specified, 64×3 index bits are required. The MediaProcessor is a 128-bit architecture
and supports 3 input operands, so all these bits are supplied in a single instruction. The
instruction group-tranpose-8-mux also transposes the input (interpreting the 128-bit
data input as two 8×8 bit matrices) before permuting the bits within each byte. A sequence
of three such permutation instructions is equivalent to first independently permuting each
row of the 8×8 matrix, then permuting each column of the matrix and then permuting each
row again. This sequence of operations can achieve any permutation of the 64 bits. The
same permutation is performed on both 64-bit subwords in the first register. Note that this
is fairly expensive in terms of register usage.
25
2.4 Advanced Bit-Oriented Operations and their Usage in
Emerging Applications
In the above classification, we noted three operations which are neither supported by com-
mercial ISAs nor performed by proposed academic architectures. These are parity compu-
tation of entire words or arbitrary subsets of bits in a word, the extraction from arbitrary
positions and the insertion of bits to arbitrary positions (or bit gather and bit scatter opera-
tions, Figure 2.15). While the existing parity instructions and the proposed bit permutation
instructions can be used to perform these operations at some cost, direct support for these
operations would greatly benefit emerging applications. We now describe these emerging
applications and highlight how parallel generalized parity operations and bit gather and
scatter operations play integral roles in executing these applications. These applications
are reviewed in Chapter 6 when we detail all the applications that benefit from the new
instructions proposed in Chapters 3 and 5.
(a) (b)
Figure 2.15: (a) Bit gather operation; (b) Bit scatter operation.
2.4.1 Bioinformatics
Bioinformatics refers to the usage of computational and statistical techniques to solve bi-
ological problems. Major segments of bioinformatics research concern sequence analysis
– DNA or protein sequences are analyzed and compared to find genes, to show functional
similarity or determine common ancestry, among other goals.
DNA is a strand of the nucleotide bases adenine, cytosine, guanine and thymine. A
DNA sequence is the representation of a strand using a string containing the 8-bit ASCII
26
characters A, C, G and T. However, a 2-bit encoding of the nucleotides is more efficient
and can significantly increase performance of matching and searching operations on large
genomic sequences (the human genome contains 3.2 billion nucleotides). The ASCII codes
for the characters A, C, G and T differ in bit positions 2 and 1 (A: 00, C: 01, G: 11, T: 10)
and these two bits can be used to encode each nucleotide [58]. Thus a fourfold compression
of a genomic sequence simply requires a single bit gather operation to select bits 2 and 1
of each byte of a word (Figure 2.16).
Figure 2.16: Compression of sequence ATTCGCAC.
When aligning two DNA sequences, certain algorithms such as the BLASTZ pro-
gram [59] use a “spaced seed” as the basis of the comparisons. This means that n out of m
nucleotides are used to start a comparison rather than a string of n consecutive nucleotides.
The remaining slots effectively function as wild cards, often causing the comparison to
yield better results. For example, BLASTZ uses 12 of 19 (or 14 of 22 nucleotides) as the
seed for comparison. The program compresses the 12 bases and uses the seed as an in-
dex into a hash table. This compression is a bit gather operation selecting 24 of 38 bits
(Figure 2.17).
The DNA sequence is transcribed by the cell into a sequence of amino acids or a protein.
Often the analysis of the genetic data is more accurate when performed on a protein basis,
such as is done by the BLASTX program [60]. A set of three bases, or 6 bits of data,
corresponds to a protein codon. Translating the nucleotides to a codon requires a table
lookup operation using each set of 6 bits as an index. An efficient algorithm can use a bit
scatter operation to distribute eight 6-bit fields on byte boundaries, and then use the result
as a set of table indices for a parallel table lookup instruction [61–63] to translate the bytes
27
(Figure 2.18).
Thus we see that support for bit gather and bit scatter operations can speed up bioinfor-
matics processing.
Figure 2.17: Compression of 12 of 19 bases (24 of 38 bits) in BLASTZ.
Figure 2.18: Translation of sets of 3 nucleotides into protein codons.
2.4.2 Linear Cryptanalysis
Cryptanalysis is the discipline of deciphering a ciphertext without having access to the
secret key. Th goal can be to recover the plaintext message or even the secret key itself. The
complexity of a cryptanalysis depends on the amount of time, memory and data required
to recover the key. The different methods of cryptanalysis can be distinguished depending
on the types of data they assumed: ciphertexts only attacks, in which the attacker has
28
access to only the ciphertext; known plaintext attacks, in which the attacker knows both the
ciphertext and plaintext; and chosen plaintext attacks, in which the attacker can choose the
plaintexts and obtain the corresponding ciphertexts.
Linear cryptanalysis is a known plaintext attack against block ciphers. This attack was
first introduced against DES [64]. During a linear cryptanalysis, the attacker studies prob-
abilistic linear relations (called linear approximations) between parity bits of the plaintext
P , the ciphertext C, and the secret key K. Given an approximation with high probability,
the attacker obtains an estimate for the parity bit of the secret key by analyzing the parity
bits of the known plaintexts and ciphertexts. The ultimate goal of the attacker is to obtained
an approximation for the whole cipher of the type:
ΓPT • PT ⊕ ΓCT • CT = ΓK •K (2.1)
with probability p and bias b = |p − 1/2|, where Γ is a bit mask that specifies bits of
interest and • indicate bitwise dot product. Such an approximation is significant only if
its probability p 6= 1/2, or its bias is non-trivial. The key recovery process is based on
a maximum likelihood method in which a pool of plaintexts is tested using (2.1). The
implementation of this step requires computing many parity operations on subsets of the
bits of the plaintext and ciphertext (the key is fixed for any particular attack and thus the
key bits fall out of the equation).
The key-recovery attack is generally impossible in practice because it is difficult to
obtain the requisite number of plaintext-ciphertext pairs. Furthermore, once a cipher is
found to be vulnerable to cryptanalysis, it falls out of usage. However, linear cryptanalysis,
along with differential cryptanalysis, is considered one of the main methods to test the
security of block ciphers during the design phase. It has been utilized, for example, in the
design of the AES [65].
While the evaluation of cipher candidates for susceptibility to linear cryptanalysis is
29
generally done by analyzing the structure of the cipher, specifically, the substitution boxes
(S-boxes), finding linear approximations across multiple rounds can be difficult. Addi-
tionally, there can be multiple approximations that use the same bits of the plaintext and
ciphertext but use different paths through the rounds, thus making them difficult to find.
The existence of these “linear hulls” makes attacks easier to perform as the bias of the
approximation is greater than would be assumed by following only one path through the
cipher.
Due to the difficulty of obtaining good approximations, we can define a search algo-
rithm (Figure 2.19) that attempts to find good approximations that holds with non-trivial
bias [66]. For a given random key, we generate a large number of plaintext-ciphertext pairs.
For each pair, we test a large number of potential relationships. We then observe whether
the probability of the relationship being true is significantly different than one half. If so, a
linear approximation has been found.
k = rand();For i = 1 to num trialsPT = rand();CT = Ek(PT );For j = 1 to num relationshipsP = PT • ΓPT [j];C = CT • ΓCT [j];R = P ⊕ C;count[j]+ = R;
For j = 1 to num relationshipsbias[j] = |count[j]/num trials −1/2|;exists rel[j] = (bias > ε); // ε ∼ 1/
√num trials
Figure 2.19: Algorithm for exhaustive search for linear relationships
The major operation performed in this search is parity computation. Thus, any way to
accelerate parity can speed up linear cryptanalysis.
30
2.4.3 Software-Defined Radio
In communication systems, messages are translated into physical signals for transmission
over a medium. The processing performed at this “physical layer” can be summarized by
the classical diagram of communication (Figure 2.20). The steps include:
• Channel coding – the message is encoded to reduce message size;
• Encryption – the message is optionally encrypted;
• Forward Error Correction (FEC) – redundancy is added to the message so that sym-
bols can be recovered in the presence of channel noise;
• Scrambling – the message is randomized, facilitating better signal lock-on (by re-
moving long strings of ones and zeros) or spreading the energy of the signal to match
its designed spectrum;
• Channel Coding (Modulation) – the message is mixed with a carrier signal for trans-
mission over a channel.
Figure 2.20: Communications Process
To facilitate efficient communication, the physical layer is typically implemented in
hardware using an ASIC. However, with Software-Defined Radio (SDR) systems, the phys-
ical layer is instead implemented by combination of software, reconfigurable hardware and
31
hardware. The purpose of this modification is to increase the flexibility of radio com-
munications as the physical layer differs depending on the choice of the communication
protocols (802.11, WI-MAX, CDMA, Bluethooth. . . ). Using a flexible SDR system solves
the following issues:
• Multiplicity of the devices – An increasing number of radio applications and proto-
cols currently results in having numerous devices each only capable of communicat-
ing through a single, distinct protocol (or a set of closely related protocols).
• Inefficient use of the spectrum – Using a fixed hardware implementation of the phys-
ical layer prevents adapting communication to efficiently utilize the available spec-
trum (adaptive coding).
• Impossibility to upgrade or update the system – A fixed hardware implementation
cannot be updated if changes to existing protocols are made or new protocols are
released. This issue is particularly important considering the ongoing need to secure
radio communication. The update of the broken cryptosystems (E0, A5/1, . . . ) used
to provide physical layer security has resulted in the need for entirely new devices.
• Slow development – A long process is required to introduce and to deploy a new
standard into market products.
Thus, SDR systems have emerged to overcome these issues: they can eliminate the cost and
overhead of maintaining multiple devices; can adapt protocol usage to maximize spectrum
utilization; can update a protocol in existing devices; and can allow for rapid, low-cost
testing and deployment of new standards.
The simplest SDR system consists of a general purpose processor attached to an analog-
to-digital/digital-to-analog converter and an RF front end (antenna). In such a system,
all the signal processing related to the radio communication is performed in software.
However, a general purpose processor is currently too slow for needs of the aggressive,
32
high-bandwidth algorithms and a DSP [67] or FPGA [68] is required. Alternatively, spe-
cial purpose extensions can be added to the general purpose CPU to speed up process-
ing [69]. However, to make these architectural extensions attractive to designers of com-
modity CPUs, the new instructions should be general enough to be reused across a number
of applications. Thus, we need to identify what operations are to be supported and design
appropriate, general instructions to accelerate these operations.
Linear Feedback Shift Registers – One basic building block utilized across many of
the communications steps is the Linear Feedback Shift Register (LFSR). A LFSR is a shift
register that has a number of taps to bit positions in the register. The parity of these taps is
then fed back to the input (Figure 2.21). LFSRs are used for:
• Encryption – in some stream ciphers, the output is the combination of multiple LF-
SRs (see paragraph on stream ciphers);
• Error Correction – cyclic redundancy check codes (CRC) are based on LFSRs;
• Scrambling – long streams of ones and zeros, which can be difficult for receivers to
track, can be broken up by XORing with the pseudorandom pattern output from an
LFSR;
• Modulation – in direct-sequence spread spectrum modulation, used by CDMA, the
message is XORed by a high frequency pseudorandom noise signal (thus spreading
the spectrum of the message) such as the output of an LFSR.
Thus we see that LFSRs play an important role in communication and accelerating parity,
the key operation of an LFSR, can greatly speedup communication.
Stream Ciphers – Stream ciphers are the alternative to block ciphers for performing
symmetric key encryption. A stream cipher differs from a block cipher as it has internal
33
Figure 2.21: LFSR for CRC-8 (polynomial = x8 + x2 + x+ 1)
state and as it can work on data of any length without padding. It is also not necessary to
know the size of the plaintext before the encryption. Thus stream ciphers are well suited
to communication systems, where message size can vary greatly and are not known in
advance. A common way to transform a block cipher into a stream cipher is to use counter
mode (CTR).
The basis of modern stream ciphers is a weakened version of the one time pad [70].
The ciphertext is obtained by combining each plaintext bit with a bit of a pseudorandom
sequence (keystream) which depends on a secret key (and initialization vector IV). The
common operation for combining the plaintext and the keystream is addition modulo 2 or
XOR. Such a system is know as an additive synchronous stream cipher.
A stream cipher is based on four building block:
• an internal state si;
• an initialization function which produces s0 from the secret key K and the IV ;
• a filtering function f which extracts the keystream from the internal state;
• an update function Φ which modifies the internal state every cycle.
Many stream ciphers use LFSRs to hold and update the internal state. One possible
filtering function f is to obtain the keystream by removing in a pseudorandom manner
some of the bits output from the internal state. This is referred to as decimation. One stream
cipher which combines these two concepts is the shrinking generator [71]. In this cipher,
two LFSRs are utilized – one holds and updates the internal state and the second generates
the decimation pattern. A variation of this cipher is the self-shrinking generator [72]. One
34
LFSR is used – the odd bits output from the LFSR form the sequence to be decimated
and the even bits form the decimation pattern. Decimation is equivalent to a bit gather
operation. The shrinking generator thus requires one bit gather operation and the self-
shrinking generator requires three bit gather operations – two bit gathers to form the two
subsequences and a third to perform the actual decimation.
Error Correction – In error correction, redundancy is introduced into the message so
that the symbols can be recovered in the face of channel noise. One basic error correction
scheme is linear block codes, in which parity bits are computed for various subsets of the
bits of a message block. The generation of multiple parity patterns can be obtained by direct
multiplication mod 2 of the message symbols with a generator matrix (Figure 2.22 [73]).
In convolutional coding, the input is processed as a continuous stream. One or more shift
registers store some number of previous input bits and the outputs are the parity of subsets
of the stored bits and the next input bits (Figure 2.23 [74]). These are also used as building
blocks in Low-Density Parity-Check Codes (LDPC), which are linear block codes with
sparse parity check matrices, and Turbo Codes (TC), which utilize two convolutional coders
as shown in Figure 2.24. Currently, LDPC codes and Turbo Codes are the most efficient
error correction coding schemes.
1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 1 0 1 10 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 1 1 0 10 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 1 1 10 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 1 1 10 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 0 0 0 1 10 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 1 1 0 1 0 0 0 10 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 1 1 0 1 0 0 10 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 1 1 0 1 0 10 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 1 1 0 1 10 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 0 1 1 0 10 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 0 1 1 10 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0
Figure 2.22: Generator matrix for extended binary Golay code
35
Figure 2.23: Convolutional encoder for 802.11g
Figure 2.24: Turbo code encoder
Puncturing – In puncturing, message bits are removed after error correction to remove
some redundancy. Puncturing allows for flexibility as the same code can be used at differing
message rates, thus obviating the need to support multiple error correction schemes and
simplifying the system. Removing message bits is equivalent to a bit gather operation.
Depuncturing is performed at the decoder to put back in bits. This is equivalent to a bit
scatter operation.
Thus we see that digital communications makes heavy use of operations that have parity
computation as an important component. We also see that a number of steps in the commu-
nication process require bit gather and bit scatter operations. Thus, any way to accelerate
bit gather, bit scatter and parity can speed up software-defined radio applications.
36
2.5 Summary
Bit-oriented instructions exist in all instruction set architectures. These range from instruc-
tions that operate on single bit fields to multiple bit fields in subwords to arbitrary collec-
tions of individual bits. However, current ISAs generally lack instructions to perform bit
gather and bit scatter operations and parity operations. These operations are quite useful, as
we see from applications such as bioinformatics, linear cryptanalysis and software-defined
radio. Consequently, we propose new instructions to perform these operations – the paral-
lel extract and parallel deposit instructions perform bit gather and bit scatter operations and
the bit matrix multiply instruction multiplies two matrices mod 2 allowing for fast parallel
parity computation as well as for performing many of the other bit-oriented instructions
mentioned in this chapter.
In the rest of this thesis, Chapter 3 defines the specialized parallel extract and par-
allel deposit instructions, describes the datapaths needed to implement these instructions
and then presents a functional unit that supports general and specialized bit manipulation
instructions. Chapter 4 shows how to enhance this functional unit to subsume the func-
tionality of the basic shifter. Chapter 5 examines the bit matrix multiplication operation
which can be used to accelerate parity; to perform the various subword-parallel multimedia
instructions done by commercial ISAs; and to perform a superset of the bit manipulation
operations. Chapter 6 analyzes applications that benefit from the newly proposed instruc-
tions.
37
Chapter 3
Advanced Bit Manipulation Operations
In the previous chapter, we showed the limitations of existing instructions for applications
such as bioinformatics, linear cryptanalysis and software-defined radio. We highlighted
the need for specialized advanced bit manipulation instructions to perform bit gather and
scatter and parity operations. In this chapter, we introduce novel parallel extract and parallel
deposit instructions to perform bit gather and bit scatter operations, respectively. Each of
these operations can take tens to hundreds of cycles to emulate in current microprocessors.
However, by defining new instructions for them, each can be implemented in one, or a few,
cycles.
This chapter is organized as follows. Section 3.1 gives the definitions for parallel ex-
tract and parallel deposit instructions. Section 3.2 discusses the datapaths needed for imple-
menting these advanced bit manipulation operations. Section 3.3 presents an advanced bit
manipulation functional unit which performs both general and specialized bit manipulation
instructions. Section 3.4 summarizes this chapter.
38
3.1 Instruction Definitions
Table 3.1 lists all the advanced bit manipulation instructions we consider (for a 64-bit pro-
cessor) and gives their cycle counts. The bfly, ibfly and grp permutation instructions
were previsouly defined in Section 2.3.1. The parallel extract (pex) and parallel deposit
(pdep) instructions, which perform bit gather and bit scatter operations, respectively will
be defined in Sections 3.1.1 and 3.1.2. These instructions have three variants – static, vari-
able and loop-invariant. The need for these three variants is elucidated in Sections 3.2
and 3.3. In particular, the hardware decoder used by variable pex.v and pdep.v will
be explained in Section 3.2.5 and the loop-invariant versions of pex and pdep, i.e., the
setib and setb instructions, will be explained in Section 3.3.
3.1.1 Bit Gather or Parallel Extract (pex)
A bit gather instruction collects bits dispersed throughout a word (or register), and places
them contiguously in the result. Such selection of non-contiguous bits from data is often
necessary. For example, in pattern matching, many pairs of features may be compared.
Then, a subset of these comparison result bits are selected, compressed and used as an
index to look up a table. This selection and compression of bits is a bit gather instruction.
A bit gather instruction can also be thought of as a parallel extract, or pex, instruction.
This is so named because it is like a parallel version of the single field extract (extr)
instruction described in Section 2.1. Figure 3.1 compares extr and pex. Recall, the
extr instruction extracts a single field of bits from any position in the source register and
right justifies it in the destination register. The pex instruction extracts multiple bit fields
from the source register, compresses and right justifies them in the destination register.
The selected bits (in r2) are specified by a bit mask (in r3) and placed contiguously and
right-aligned in the result register (r1).
The pex instruction can also be considered the grp right half of the grp operation [28,
39
Tabl
e3.
1:A
dvan
ced
bitm
anip
ulat
ion
inst
ruct
ions
Inst
ruct
ion
Des
crip
tion
Cyc
les
bfly
r 1=
r 2,a
r.b1,a
r.b2,a
r.b3
Perf
orm
But
terfl
ype
rmut
atio
nof
data
bits
inr 2
.1
ibfly
r 1=
r 2,a
r.ib 1
,ar.i
b 2,a
r.ib 3
Perf
orm
Inve
rse
But
terfl
ype
rmut
atio
nof
data
bits
inr 2
.1
pex
r 1=
r 2,r
3,a
r.ib 1
,ar.i
b 2,a
r.ib 3
Para
llel
extr
act,
stat
ic:
Dat
abi
tsin
r 2se
lect
edby
apr
e-de
code
dm
ask
r 3ar
eex
-tr
acte
d,co
mpr
esse
dan
dri
ght-
alig
ned
inth
ere
sult
r 1.
1
pdep
r 1=
r 2,r
3,a
r.b1,a
r.b2,a
r.b3
Para
lleld
epos
it,st
atic
:Rig
ht-a
ligne
dda
tabi
tsin
r 2ar
ede
posi
ted,
inor
der,
inre
sult
r 1at
bitp
ositi
ons
mar
ked
with
a“1
”in
the
stat
ical
ly-d
ecod
edm
ask
r 3.
1
mov
ar.x
=r 2
,r3
Mov
eva
lues
from
GPR
sto
AR
sto
setc
ontr
ols
(cal
cula
ted
byso
ftw
are)
for
pex,
pdep
,bfly
orib
fly.
1
pex.
vr 1
=r 2
,r3
Para
llele
xtra
ct,v
aria
ble:
Dat
abi
tsin
r 2se
lect
edby
ady
nam
ical
ly-d
ecod
edm
ask
r 3ar
eex
trac
ted,
com
pres
sed
and
righ
t-al
igne
din
the
resu
ltr 1
.3
pdep
.vr 1
=r 2
,r3
Para
lleld
epos
it,va
riab
le:
Rig
ht-a
ligne
dda
tabi
tsin
r 2ar
ede
posi
ted,
inor
der,
inre
sult
r 1in
bitp
ositi
ons
mar
ked
with
a“1
”in
the
dyna
mic
ally
-dec
oded
mas
kr 3
.3
setib
ar.ib
1,a
r.ib 2
,ar.i
b 3=
r 3Se
tin
vers
ebu
tterfl
yda
tapa
thco
ntro
lsin
the
asso
ciat
edA
Rs,
usin
gha
rdw
are
de-
code
rto
tran
slat
eth
em
ask
r 3to
inve
rse
butte
rfly
cont
rols
(for
loop
-inv
aria
ntpe
x).
2
setb
ar.b
1,a
r.b2,a
r.b3
=r 3
Set
butte
rfly
data
path
cont
rols
inth
eas
soci
ated
AR
s,us
ing
hard
war
ede
code
rto
tran
slat
eth
em
ask
r 3to
butte
rfly
cont
rols
(for
loop
-inv
aria
ntpd
ep).
2
grp
r 1=
r 2,r
3Pe
rfor
mG
roup
perm
utat
ion
(var
iabl
e):
Dat
abi
tsin
r 2co
rres
pond
ing
to“1
”sin
r 3ar
egr
oupe
dto
the
righ
t,w
hile
thos
eco
rres
pond
ing
to“0
”sar
egr
oupe
dto
the
left
.3
40
(a) (b)
Figure 3.1: (a) extr r1 = r2, pos, len (b) pex r1 = r2, r3
56]. It conserves some of the most useful properties of grp, while being easier to imple-
ment: the bits that would have been gathered to the left are instead zeroed out.
3.1.2 Bit Scatter or Parallel Deposit (pdep)
Bit scatter takes the right-aligned, contiguous bits in a register and scatters them in the
result register according to a mask in a second input register. This is the reverse operation
to bit gather. We also call bit scatter a parallel deposit instruction, or pdep, because it
is like a parallel version of the deposit (dep) instruction. Figure 3.2 compares dep and
pdep. The deposit (dep) instruction takes a right justified field of bits from the source
register and deposits it at any single position in the destination register. The parallel deposit
(pdep)instruction takes a right justified field of bits from the source register and deposits
the bits in different non-contiguous positions indicated by a bit mask.
(a) (b)
Figure 3.2: (a) dep r1 = r2, pos, len (b) pdep r1 = r2, r3
41
Figure 3.3: Labeling of butterfly network.
3.2 Datapaths for Parallel Extract and Parallel Deposit
It is not intuitively clear that pex and pdep are easy to implement, especially in a single
cycle. In this section, we show that parallel deposit can be mapped to the butterfly network
and that parallel extract can be mapped to the inverse butterfly network. This provides the
basis for a single functional unit that performs bit gather, bit scatter and bit permutation
operations using both butterfly and inverse butterfly datapaths.
3.2.1 Parallel Deposit on the Butterfly Network
We first show an example parallel deposit operation on the butterfly network. Then, we
show that any parallel deposit operation can be performed using a butterfly network.
Figure 3.3 shows our labeling of the left (L) and right (R) halves of successive stages
of a butterfly network. Since we often appeal to induction for successive stages, we usually
omit the subscripts for these left and right halves.
Figure 3.4(a) shows an example pdep operation with mask = 10101101. Figure 3.4(b)
42
(a)
(b) (c)
Figure 3.4: (a) 8-bit pdep operation (b) mapped onto butterfly network with explicit right rotationsof data bits between stages and (c) without explicit rotations of data bits by modifying the controlbits.
43
shows this pdep operation broken down into steps on the butterfly network. In the first
stage, we transfer from the right (R) to the left half (L) the bits whose destination is in L,
namely bits d and e. Prior to stage 2, we right rotate e00d by 3, the number of bits that
stayed in R, to right justify the bits in their original order, 00de, in the L half. Note that
bits that stay in the right half, R, are already right-justified.
In each half of stage 2 we transfer from the local R to the local L the bits whose final
destination is in the local L. So in R, we transfer g to RL and in L we transfer d to LL. Prior
to stage 3, we right rotate the bits to right justify them in their original order in their new
subnetworks. So d0 is right rotated by 1, the number of bits that stayed in LR, to yield 0d
and gf is right rotated by 1, the number of bits that stayed in RR, to yield fg.
In each subnetwork of stage 3 we again transfer from the local R to the local L the bits
whose final destination is in the local L. So in LL we transfer d and in LR we transfer e.
After stage 3 we have transferred each bit to its correct final destination: d0e0fg0h. Note
that we use a control bit of “0” to indicate a pass through operation and a control bit of “1”
to indicate a swap.
Rather than explicitly right rotating the data bits in the L half after each stage, we can
compensate by modifying the control bits. This is shown in Figure 3.4(c). How the control
bits are derived will be explained later in Section 3.2.4.
We now show that any pdep operation can be mapped to the butterfly network.
Fact 3.1. Any single data bit can be moved to any result position by just moving it
to the correct half of the intermediate result at every stage of the butterfly network.
This can be proved by induction on the number of stages. At stage 1, the data bit
is moved within n/2 positions of its final position. At stage 2, it is moved within n/4
positions of its final result, and so on. At stage lg(n), it is moved within n/2lg(n) = 1
position of its final result, which is its final result position. Referring back to Figure 3.4(b),
we utilized Fact 3.1 to decide which bits to keep in R and which to transfer from R to L at
44
each stage.
Fact 3.2. If the mask has k “1”s in it, the k rightmost data bits are selected and
moved, i.e., the selected data bits are contiguous. They never cross each other in
the final result.
This fact is by definition of the pdep instruction. See the example of Figure 3.4(a)
where there are 5 “1”s in the mask and the selected data bits are the 5 rightmost bits,
defgh; these bits are spread out to the left maintaining their original order, and thus never
crossing each other in the final result.
Fact 3.3. If a data bit in the right half (R) is swapped with its paired bit in the left
half (L), then all selected data bits to the left of it will also be swapped to L (if they
are in R) or stay in L (if they are in L).
Since the selected data bits never cross each other in the final result (Fact 3.2), once a
bit swaps to L, the selected bits to the left of it must also go to L. Hence, if there is one
“1” in the mask, the one selected data bit, d0, can go to R or L. If there are two “1”s in the
mask, the two selected data bits, d1d0, can go to RR or LR or LL. That is, if the data bit on
the right stays in R, then the next data bit can go to R or L, but if the data bit on the right
goes to L, the next data bit must also go to L. If there are three “1”s, the three selected data
bits, d2d1d0, can go to RRR, LRR, LLR or LLL. For example, in Figure 3.4(b) stage 1, the
five bits have the pattern LLRRR as e is transferred to L and d must then stay in L.
Fact 3.4. The selected data bits that have been swapped from R to L, or stayed in
L, are all contiguous mod n/2 in L.
From Fact 3.3, the destinations of the k selected data bits dk−1 . . .d0 must be of the form
L. . .LR. . .R, a string of zero or more L’s followed by zero or more R’s (see Figure 3.5).
Define X as the bits staying in R, Y as the bits going to L that start in R and Z as the bits
going to L that start in L. It is possible that:
45
i. X alone exists – when there are no selected data bits that go to L,
ii. Y alone exists – when there are no selected data bits that start in L and all bits that start
in R go to L,
iii. X and Y exist – when there are no selected data bits that start in L and some bits that
start in R stay in R and some go to L,
iv. X and Z exist – when all the bits in R are going to R, and all bits going to L start in L,
or
v. X , Y and Z exist.
When X alone exists (i), there are no bits that go to L, so Fact 3.4 is irrelevant.
The structure of the butterfly network requires that when bits are moved in a stage, they
all move by the same amount. Fact 3.2 states that the selected data bits are contiguous.
Together these imply that when Y alone exists or X and Y exist (ii and iii), Y is moved as
a contiguous block from R to L and Fact 3.4 is trivially true.
When X and Z exist (iv), Z is a contiguous block of bits that does not move so again
Fact 3.4 is trivially true.
When X , Y and Z exist (v), Y comprises the leftmost bits of R, and Z the rightmost
bits in L since they are contiguous across the midpoint of the stage (Fact 3.2). When Y is
swapped to L, since the butterfly network moves the bits by an amount equal to the size of L
or R in a given stage, Y becomes the leftmost bits of L. Thus Y and Z are now contiguous
mod n/2, i.e., wrapped around, in L (Figure 3.5).
Thus Fact 3.4 is true in all cases.
For example, in Figure 3.4(b) at the input to stage 1, X is bits fgh, Y is bit e and Z
is bit d. Y is the leftmost bit in R and Z is the rightmost bit in L. After stage 1, Y is the
leftmost bit in L and is contiguous with Z mod 4, within L, i.e., de is contiguous mod 4 in
e00d.
46
Figure 3.5: At the output of a butterfly stage, Y and Z are contiguous mod n/2 in L and can berotated to be the rightmost bits of L.
Fact 3.5. The selected data bits in L can be rotated so that they are the rightmost
bits of L, and in their original order.
From Fact 3.4, the selected data bits are contiguous mod n/2 in L. At the output of
stage 1 in Figure 3.5, these bits are offset to the left by the size of X (the number of bits
that stayed in R), denoted by |X|. Thus if we explicitly rotate right the bits by |X|, the
selected data bits in L are now the rightmost bits of L in their original order (Figure 3.5).
In Figure 3.4(b), Fact 3.5 was utilized prior to stages 2 and 3.
At the end of this step, we have two half-sized butterfly networks, L and R, with the
selected data bits right-aligned and in order in each of L and R (last row of Figure 3.5). The
above can now be repeated recursively for the half-sized butterfly networks, L and R, until
each L and R is a single bit. This is achieved after lg(n) stages of the butterfly network (see
Figure 3.4(b)).
The selected data bits emerge from stage 1 in Figure 3.5 rotated to the left by |X|. In
Fact 3.5, the selected data bits are explicitly rotated back to the right by |X|. Instead, we
can compensate for the rotation by modifying the control bits of the subsequent stages to
limit the rotation to be within each subnetwork. For example, if the n-bit input to stage 1 is
rotated by k positions, the two n/2-bit inputs to the L and R subnetworks are rotated by k
47
(mod n/2) within each subnetwork. At the output of stage lg(n), the subnetworks are 1-bit
wide so the rotations are absorbed.
Fact 3.6. If the data bits are rotated by k positions left (or right) at the input to a
stage of a butterfly network, then at the output of that stage we can obtain a rotation
left (or right) by k positions of each half of the output bits by rotating left (or right)
the control bits by k positions and complementing them upon wrap around.
Consider again the example of Figure 3.4(b). The selected data bits emerge from stage
1 left rotated by 3 bits, i.e., L is e00d, left rotated by 3 positions from 00de. In Fig-
ure 3.4(b), we explicitly rotated the data bits back to the right by 3. Instead, we can
compensate for this left rotation by left rotating and complementing upon wrap around
by 3 positions the control bits of the subsequent stages. This is shown in Figure 3.6.
For stage 2, the control bit pattern of L, after left rotate and complement by 3, becomes
10 → 00 → 01 → 11. The rotation by 3 is limited to a rotation by 3 mod 2 = 1 within
each half of the output of L of stage 2 as the output is transformed from d0,0e in Fig-
ure 3.4(b) to 0d,e0 in Figure 3.6. For stage 3, the rotation and complement by 3 of the two
single control bits in L become three successive complements: 1 → 0 → 1 → 0, and the
left rotation of L is absorbed as the overall output is still d0e0. Hence, Figure 3.6 shows
how the control bits in stages 2 and 3 compensate for the left rotate by 3 bits at the output
of stage 1 (cf. Figure 3.4(b).)
Figure 3.4(c) shows the control bits after also compensating for the left rotate by 1 bit
of RL and LL, prior to stage 3 in Figure 3.6. The explicit right rotation prior to stage 3 is
eliminated. Instead, the two control bits in RL and LL transform from 0 → 1 to absorb the
rotation. Hence, the overall result in Figure 3.4(c) remains the same as in Figure 3.4(b).
The explicit data rotations in Figure 3.4(b) are replaced with rotations of the control bits
instead, complementing them on wraparound.
We now explain why the control bits are complemented when they wrap around. The
48
Figure 3.6: The elimination of explicit right rotations after stage 1 in Figure 3.4(b).
goal is to keep the data bits in the half they were originally routed to at each stage of the
butterfly network, in spite of the rotation of the input.
Figure 3.7(a) shows a pair of bits, a and b, that were originally passed through. So
we wish to route a to L and b to R in spite of any rotation. As the bits are rotated (Fig-
ure 3.7(b)), the control bit is rotated with them, keeping a in L and b in R, as desired.
When the bits wrap around, (Figure 3.7(c)), a wraps to R and b crosses the midpoint to
L. If the control bit is simply rotated and wrapped around with the paired bits, then a is
passed through to R and b is passed through to L, which is contrary to the originally desired
behavior. If instead the control bit is complemented when it wraps around (Figure 3.7(d)),
then a is swapped back to L and b is swapped back to R, as is desired.
Similarly, if a and b were originally swapped (Figure 3.8a)), a should be routed to R
and b to L. As the bits rotate (Figure 3.8(b)), we simply rotate the control bit with them.
When the bits wrap around (Figure 3.8(c)), input a wraps to R and b crosses to L. When
they are swapped, a is routed to L and b to R, contrary to their original destinations. If
49
(a) (b)
(c) (d)
Figure 3.7: (a) a pair of data bits initially passed through; (b) rotation of the paired data bits andcontrol bit; (c) wrapping of the data bits and control bit; (d) complementation of the control bit.
the control bit is complemented on wraparound, a is passed through to R and b is passed
through to L, conforming to the originally desired behavior.
Thus complementing the control bit when it wraps and changing the behavior of the
paired bits from pass through to swap or vice versa causes each of the pair of bits to stay in
the half to which it was originally routed despite the rotation of the input pushing each bit
to the other half. This limits the rotation of the input to be within each half of the output
and not across the entire output.
We now give a theorem to formalize the overall result:
Theorem 3.7. Any parallel deposit instruction on n bits can be implemented with
one pass through a butterfly network of lg(n) stages without path conflicts (with the
bits that are not selected zeroed out externally).
Proof. We give a proof by construction. Assume there are k “1”s in the right half of the bit
mask. Then, based on Fact 1, the k rightmost data bits (blockX in Figure 3.5) must be kept
in the right half (R) of the butterfly network and the remaining contiguous selected data bits
must be swapped (block Y ) or passed through (block Z) to the left half (L). This can be
50
(a) (b)
(c) (d)
Figure 3.8: (a) A pair of data bits initially swapped; (b) rotation of the paired data bits and controlbit; (c) wrapping of the data bits and control bit; (d) complementation of the control bit.
accomplished in stage 1 of the butterfly network by setting the k rightmost configuration
bits to “0” (to pass through X) and the remaining configuration bits to “1” (to swap Y ).
At this point, the selected data bits in the right subnetwork (R) are right-aligned but
those in the left subnetwork (L) are contiguous mod n/2, but not right aligned (see Fact 3.4
and Figure 3.5) – they are rotated left by the size of block X or the number of bits kept in
R. We can compensate for the left rotation of the bits in L and determine the control bits
for subsequent stages as if the bits in L were right aligned. This is accomplished by left
rotating and complementing upon wraparound the control bits in the subsequent stages of
L by the number of bits kept in R (once these control bits are determined pretending that
the data bits in L are right aligned). Modifying the control bits in this manner will limit
the rotation to be within each half of the output until the rotation is absorbed after the final
stage (Fact 3.6).
Now the process above can be repeated on the left and right subnets, which are them-
selves butterfly networks: count the number of “1”s in the local right half of the mask and
then keep that many bits in the right half of the subnetwork and swap the remaining selected
data bits to the left half. Account for the rotation of the left half by modifying subsequent
51
Figure 3.9: Even (or R, dotted) and odd (or L, solid) subnetworks of the inverse butterfly network.
control bits.
This can be repeated for each subnetwork in each subsequent stage until the final stage is
reached, where the final parallel deposit result will have been achieved (e.g., Figure 3.4(c)).
3.2.2 Parallel Extract on the Inverse Butterfly Network
We will now show that pex can be mapped onto the inverse butterfly network. The inverse
butterfly network is decomposed into even and odd subnetworks, in contrast to the butterfly
network which is decomposed into right and left subnetworks. See Figure 3.9, where the
even subnetworks are shown with dotted lines and the odd with solid lines. However, for
simplicity of notation we refer to even as R and odd as L.
Fact 3.8. Any single data bit can be moved to any result position by just moving
it to the correct R or L subnetwork of the intermediate result at every stage of the
inverse butterfly network.
This can be proved by induction on the number of stages. At stage 1, the data bit is
52
moved to its final position mod 2 (i.e., to R or L). At stage 2, it is moved to its final position
mod 4, and so on. At stage lg(n), it is moved to its final position mod 2lg(n) = n, which is
its final result position.
Fact 3.9. A permutation is routable on an inverse butterfly network if the destina-
tions of the bits constitute a complete set of residues mod m (i.e., the destinations
equal 0, 1, . . . ,m− 1 mod m) for each subnetwork of width m.
Based on Fact 3.8, bits are routed on the inverse butterfly network by moving them to the
correct position mod 2 after the first stage, mod 4 after the second stage, etc. Consequently,
if the two bits entering stage 1, (with 2-bit wide inverse butterfly networks as shown in
Figure 3.9), have destinations equal to 0 and 1 mod 2 (i.e., one is going to R and one to L),
Fact 3.8 can be satisfied for each bit and they are routable through stage 1 without conflict.
Subsequently, the four bits entering stage 2 (with the 4-bit wide butterfly networks) must
have destinations equal to 0, 1, 2 and 3 mod 4 to satisfy Fact 3.8 for each bit and be routable
through stage 2 without conflict. A similar constraint exists for each stage.
Theorem 3.10. Any Parallel Extract (pex) instruction on n bits can be imple-
mented with one pass through an inverse butterfly network of lg(n) stages without
path conflicts (with the un-selected bits on the left zeroed out).
Proof. The pex operation compresses bits in their original order into adjacent bits in the
result. Consequently, two adjacent selected data bits that enter the same stage 1 subnetwork
must be adjacent in the output – one bit has a destination equal to 0 mod 2 and the other has
a destination equal to 1 mod 2. Thus the destinations constitute a complete set of residues
mod 2 and thus are routable through stage 1. The selected data bits that enter the same
stage 2 subnetwork must be adjacent in the output and thus form a set of residues mod 4
and are routable through stage 2. A similar situation exists for the subsequent stages, up to
the final n-bit wide stage. No matter what the bit mask of the overall pex operation is, the
selected data bits will be adjacent in the final result. Thus the destination of the selected
53
data bits will form a set of residues mod n and the bits will be routable through all lg(n)
stages of the inverse butterfly network.
3.2.3 Need for Two Datapaths
It would be convenient if both pex and pdep can be implemented using the same datapath
circuit. Unfortunately, this is not possible. There is one unique path between any input
position and output position in the butterfly or inverse butterfly network (see Fact 3.1 and
Fact 3.8 – once a bit has been moved to the correct right or left subnetwork, subseqeunt
stages cannot route bits back to the other subnetwork). Consequently, we need only show
one counter example where paths conflict to prove that pex on butterfly and pdep on
inverse butterfly are not possible in the general case.
We first consider trying to implement pex using a butterfly circuit. From Fact 3.1
we see that parallel extract cannot be mapped to the butterfly network. Parallel extract
compresses bits and thus it is easy to encounter scenarios where two bits entering the same
switch would both require the same output in order to be moved to the correct half (L or R)
corresponding to their final destinations. Consider the parallel extract operation shown in
Figure 3.10. In order to move both bits d and h to the correct half of their final positions,
both must be output in the right half after stage 1. This clearly is a conflict and thus parallel
extract cannot be mapped to butterfly.
We now consider implementing pdep using an inverse butterfly circuit. From Fact 3.8
we see that parallel deposit cannot be mapped to the inverse butterfly network. Parallel
deposit scatters right-aligned bits to the left, and thus it is easy to encounter scenarios
where two bits entering the same switch would both require moving to the same left, or
right, subnetwork. Consider the parallel deposit operation shown in Figure 3.11. In order
to move both bits g and h to their final positions mod 2, both must be output in the right
subnet (i.e., even network: 0 mod 2) after stage 1. This clearly is a conflict and thus parallel
54
Figure 3.10: Conflicting paths when trying to map pex to butterfly.
deposit cannot be mapped to the inverse butterfly datapath.
3.2.4 Decoding the Bitmask into Butterfly Control Bits
The steps in the proof to Theorem 3.7 give an outline for how to decode the n-bit bitmask
into controls for each stage of a butterfly datapath. For each right half of a stage of a
subnetwork we count the number of “1”s in that local right half of the mask, say k “1”s,
and then set the k rightmost control bits to “0”s and the remaining bits to “1”s. This serves
to keep block X in the local R half and export Y to the local L half (refer to Figures 3.3
and 3.5 for nomenclature). We then assume that we explicitly rotate Y and Z to be the
rightmost bits in order in the local L half and iterate through the stages to come up with an
initial set of control bits. After this, we eliminate the need for explicit rotations of Y and Z
by modifying the control bits. This is accomplished by a left rotate and complement upon
wrap around (LROTC) operation, rotating the control bits by the same amount obtained
when assuming explicit rotations.
We will now simplify this process considerably. First, note that when we modify control
55
Figure 3.11: Conflicting paths when trying to map pdep to inverse butterfly.
bits to compensate for a rotation in a given stage, we do so by propagating the rotation
through all the subsequent stages. This means that when the control bits of a local L are
modified, they are rotated and complemented upon wrap around by the number of “1”s in
the local R, and by the number of “1”s in the local R of the preceding stage, and by the
number of “1”s in all the local R’s of all preceding stages up to the R in the first stage. In
other words, the control bits of the local L are rotated by the total number of “1”s to its
right in the bitmask.
Consider the example of Figure 3.4(b). The control bit in stage 3 in the LL subnetwork
is initially a “1” when we assumed explicit rotations. We first rotated and complemented
this bit by 3, the number of “1”s in R of the bitmask: 1 → 0 → 1 → 0 (Figure 3.6). We
then rotated and complemented this bit by another 1 position, the number of “1”s in LR of
the bitmask: 0 → 1. This yielded the final control bit in Figure 3.4(c). Overall we rotated
this bit by 4, the total number of “1”s to the right of LL or to the right of position 6. This is
a population count (POPCNT) of the bitstring from the rightmost bit to bit position 6.
Second, we need to produce a string of k “0”s from a count (in binary) of k, to derive
56
Figure 3.12: LROTC(“1111”, 3) = “1000”.
the initial control bits assuming explicit rotations. This can also be done with a LROTC
operation. We start with a string of all “1”s of the correct length and then, for every position
in the rotation, we wrap around a “1” from the left and complement it to get a “0” on the
right. The end result, after a LROTC by k, is a string of the correct length with k rightmost
bits set to “0” and the rest set to “1” (Figure 3.12, where k = 3).
We can now combine these two facts: the initial control bits are obtained by a LROTC
of a zero string the length of the local R by the POPCNT of the bits in the mask in the local
R and all bits to the right of it. We denote a string of k “1”s as 1k. We specify a bitfield
from bit h to bit v as {h:v}, where v is to the right of h. So,
• for stage 1, we calculate the control bits as
LROTC(1n/2, POPCNT(mask{n/2− 1:0})),
• for stage 2, we calculate the control bits as
LROTC(1n/4, POPCNT(mask{3n/4− 1:0})) for L and
LROTC(1n/4, POPCNT(mask{n/4− 1:0})) for R,
• for stage 3, we calculate the control bits as
LROTC(1n/8, POPCNT(mask{7n/8− 1:0})) for LL,
LROTC(1n/8, POPCNT(mask{5n/8− 1:0})) for LR,
LROTC(1n/8, POPCNT(mask{3n/8− 1:0})) for RL, and
LROTC(1n/8, POPCNT(mask{n/8− 1:0})) for RR,
and so on for the later stages.
57
Let us verify that this is correct using the example of Figure 3.4(c):
• for stage 1, the control bits are
LROTC(14, POPCNT(“1101”)) = LROTC(1111, 3) = 1000,
• for stage 2, the control bits are
LROTC(12, POPCNT(“101101”)) = LROTC(11, 4) = 11 for L and
LROTC(11, POPCNT(“01”)) = LROTC(12, 1) = 10 for R,
• for stage 3, the control bit is
LROTC(11, POPCNT(“0101101”)) = LROTC(1, 4) = 1 for LL,
LROTC(11, POPCNT(“01101”)) = LROTC(1, 3) = 0 for LR,
LROTC(11, POPCNT(“101”)) = LROTC(1, 2) = 1 for RL, and
LROTC(11, POPCNT(“1”)) = LROTC(1, 1) = 0 for RR.
This agrees with the result shown in Figure 3.4(c).
Overall, the population counts of the bits from position to 0 to position k, for k = 0
to n − 2, are required. We call this the set of prefix population counts. One interesting
observation is that for stage 1 we need one count of n/21 bits, for stage 2 we need the two
counts of the odd multiples of n/22 bits, for stage 3 we need the four counts of the odd
multiples of n/23 bits and so on.
Using the two new functions we have defined, LROTC and POPCNT, we present an
algorithm (Algorithm 3.11, Figure 3.13) to decode the n mask bits into the n × lg(n)/2
control bits for pdep (and for pex). Algorithm 3.11 summarizes the discussion above for
obtaining the control bits for achieveing pdep on a butterfly circuit.
We note that Algorithm 3.11 can also be used to obtain the control bits for the inverse
butterfly for a pex operation, with the one caveat that the controls for stage i of the butterfly
datapath are routed to stage lg(n)−i+1 in the inverse butterfly datapath. This can be shown
using an approach similar to that for pdep, except for working backwards from the final
58
Algorithm 3.11. To generate the n× lg(n)/2 butterfly or inverse butterfly control bits fromthe n-bit mask.
Input: mask; the bitmaskOutput: bcb; the lg(n)× n/2 matrix containing the butterfly control bits
ibcb; the lg(n)× n/2 matrix containing the inverse butterfly control bits
Let x||y indicate the concatenation of bit patterns x and y. POPCNT(a) is the populationcount of “1”s in bitfield a. mask{h:v} is the bitfield from bit h to bit v of the mask. 1k
indicates a bit-string of k ones. LROTC(a, rot) is a “left rotate and complement on wrap”operation, where a is the input and rot is the rotation amount.
1. Calculate the prefix population counts:
For i = 1, 2, . . ., n− 2pc[i] = POPCNT(mask{i:0})
2. Calculate the butterfly (and inverse butterfly) control bits for each stage by perform-ing LROTC(1k, pc[m]), where k is the size of the local R and m in the set of theleftmost bit positions of the local R’s:
bcb = ibcb = {}For i = 1, . . ., lg(n) //for each stagek = n/2i //number of bits in local RFor j = 1, 3, 5, . . ., 2i − 1 //for each local Rm = j × k − 1 //the leftmost bit position of the local Rtemp = LROTC(1k, pc[m])bcb[i] = temp || bcb[i]ibcb[lg(n)− i+ 1] = temp || ibcb[lg(n)− i+ 1]
Figure 3.13: Algorithm for decoding a bitmask into controls for a butterfly (or inverse butterfly)datapath.
stage. The derivation is given in more detail in an appendix to this chapter (Section 3.5).
3.2.5 Hardware Decoder
The execution time of Algorithm 3.11 in software is approximately 675 cycles on an Intel
Pentium-D processor. This software routine is useful only for pex or pdep operations for
which the bit mask is known ahead of time. However, for dynamic masks, we require a
hardware decoder that implements Algorithm 3.11 in order to achieve a high performance.
59
Fortunately, Algorithm 3.11 just contains two basic operations, population count and left-
rotate-and-complement, both of which have straightforward hardware implementations. A
block diagram of the decoder is show in Figure 3.14.
Figure 3.14: Hardware decoder for dynamic pdep and pex operation (for pex, the order of theoutputs is reversed).
Before describing the decoder in detail, we make note of a simplification based on the
properties of rotations – that they are invariant when the rotation amount differs by the
period of rotation, which for an k-bit LROTC is 2k. Thus, for the ith stage of the butterfly
network, k = n/2i (see Algorithm 3.11) and the POPCNTs are only computed mod n/2i−1.
For example, for the 64-bit hardware decoder, for the 32 butterfly stage 6 POPCNTs corre-
sponding to the odd multiples of n/64, we need only compute the POPCNTs mod 2 – only
the least significant bit; for the 16 butterfly stage 5 POPCNTs, we need only compute the
POPCNTs mod 4 – the two least significant bits; and so on. Only the POPCNT of 32 bits
for stage 1 requires the full lg(n)-bit POPCNT. In total, 120 bits are computed instead of
the 378 bits that would have been computed had the full lg(n)-bit POPCNTs been required
for each position.
The first stage of the decoder is a parallel prefix population counter. This is a circuit
that computes in parallel all the population counts of step 1 of Algorithm 3.11. In designing
this circuit we considered population counter architectures and parallel prefix networks for
60
(a) (b)
(c) (d)
Figure 3.15: (a) Count of 8 bits; (b) Count of 24 bits; (c) Count of 32 bits; (d) Count of 56 bits
adders.
The carry shower counter [75] groups the input into sets of three lines which are input
into full adders. The sum and carry outputs of the full adders are each grouped into sets of
three lines which are input to another stage of full adders and so on.
In our circuit, in the first stage we group 8 bits into 3 sets due to the structure of the
butterfly and inverse butterfly networks, i.e., the subnetworks have power of 2 sizes. In
stage 2, we add three sums and three carries to produce the sum of the 8 bits in carry-save
form. Figure 3.15(a) shows the first two stages of the counter. The output of this stage is a
2-bit sum and 2-bit carry.
We then gather 3 sets of sums and carries in stage 3 and produce the sum of 24 bits
61
in carry-save form (this requires a second full adder delay within the stage as shown in
Figure 3.15(b)). In stage 4, we gather up to three intermediate results and reduce these
to carry-save form. Stage 4 for the 32-bit count, which requires the full lg(n) bits, is
shown in Figure 3.15(c) and for the 56-bit count, which is computed mod 16, is shown
in Figure 3.15(d). We then use a carry-propagate adder to complete the addition. The
complete network for the 32-bit, 56-bit and 63-bit count are shown in Figure 3.16.
The parallel prefix architecture resembles radix-3 Han-Carlson [76], which is a parallel
prefix look-ahead carry adder that has lg(n) + 1 stages with carries propagated to the odd
positions in the extra final stage to limit fanout at the earlier levels. In our circuit, we
defer the 1- and 2-bit counts to the end, similar to odd carries being deferred in the Han-
Carlson adder, as these are easy to compute from the other counts. This can be seen in
Figure 3.16(c). The radix-3 nature stems from the carry shower counter design, as we
group 3 lines to input into a full adder at each level. Thus, the counter has log3(64) + 2 = 6
stages (where each stage can have multiple full adder delays). We note that in order to
effeciently compute the 3-bit and greater counts, the first two stages are overlapped every 4
positions. This can be seen in Figure 3.17(a) which shows the diagram of the entire parallel
prefix network. The numbers at the top are bit positions. The numbers at the bottom
indicate the number of bits in the sum. The trapezoids indicate adder blocks such as those
in Figure 3.15. The squares indicate the final carry-propagate adder, which can be as small
as 1-bit.
The parallel prefix population count circuit computes the population count of the low 32
bits, the low 16 bits and the low 8 bits (and the low 4 bits and the low 2 bits). With a minor
extension, the population count of the full 64 bit word can be computed (Figure 3.17(b)).
Similarly, the population count of each byte is computed in carry save form and with min-
imal logic can be transformed into the full 4-bit count of each byte. By modifiying the
network, we can also compute the population count of each 16-bit or 32-bit subword. Thus
the parallel prefix population count circuit can be used as the basis of a population count
62
(a)
(b)
(c)
Figure 3.16: (a) 32-bit count network; (b) 56-bit count network; (a) 63-bit count network;
63
(a) (b)
Figure 3.17: (a) Parallel prefix population counter network; (b) 64-bit parallel prefix populationcounter supporting population count of 64, 32, 16 and 8 bits.
64
functional unit and to support useful variations of the population count instruction currently
unavailable in current ISAs.
The outputs from the population counter control the LROTC circuits, one for each local
R of each stage. Each LROTC circuit is realized as a barrel rotator modified to complement
the bits that wrap around (Figure 3.18(a)). However, while a standard 2m-bit rotator has
m stages and control bits, this rotator has m + 1 stages and control bits, as the period of
the 2m-bit LROTC operation is 2m+1. The final stage selects between its input and the
complement, as the bits wrap 2m positions.
Propagating the constants at the input greatly simplifies the LROTC circuit (see Algo-
rithm 3.11 – we always calculate LROTC(1k, count)). However, instead of inputting the
one string, we simplify the circuit by inputting the zero string to the rotator (as it is simpler
to propagate zeros) and then adding an extra layer of inverters at the output (Figure 3.18(b)).
When the two inputs to a multiplexer are both zero, the multiplexer is replaced by
a constant logic “0” (Figure 3.18(c)). When a “0” and “1” are fed to the “0” and “1”
inputs, respectively, the multiplexer is replaced by a pass through of the select control
(Figure 3.18(d)). When a logic signal is fed to “0” input and a “1” to the “1” input, the
multiplexer reduces to an OR of the logic signal and the select control (Figure 3.18(e)).
When a “0” is fed to the “0” input and a logic signal is fed to the “1” input, the multiplexer
reduces to an AND of the logic signal and the select control (Figure 3.18(f)). When a
logic signal is fed to the ”0” input and the complement of the logic signal is fed to the
“1” input, the multiplexer reduces to an XOR of the logic signal and the select control
(Figure 3.18(g)). A pass through and inverter is just an inverter and an XOR and inverter is
just an XNOR (which has the same implementation cost of XOR).
Also note that for the counts of the last butterfly stage, the LROTC circuits are simply
inverters (as LROTC(“1”, b) = b for single bit b).
An overall diagram of the decoder is shown in Figure 3.14. The outputs from the de-
65
(a) (b)
(c) (d)
(e) (f)
(g)
Figure 3.18: (a) Barrel rotator implementation of 8-bit LROTC circuit; (b)-(f) Simplification ofLROTC circuit; (g) Final simplified LROTC circuit.
66
coder are routed to the butterfly network to be used for the pdep instruction. Additionally,
the outputs are routed to the inverse butterfly network, reversing the order of the stages, to
be used for the pex instruction. Unfortunately, there is no overlap between the decoder and
the routing of the data through the butterfly network for the pdep instruction since the con-
trol bits for the first stage of the butterfly network depend on the widest population count
(see Algorithm 3.11 and Figure 3.14), which takes the longest to generate. For the pex
instruction, on the other hand, the control bits for the first stages of the inverse butterfly
network are generated very quickly – hence the time taken for the decoder (for calculating
the control bits for the later stages) can be overlapped with the routing of the data.
3.3 Advanced Bit Manipulation Functional Unit
Given the set of advanced bit manipulation operations (Table 3.1), the knowledge of which
datapaths can be used to perform the operations (Section 3.2), the algorithm for configur-
ing those datapaths (Algorithm 3.11), and the hardware implementation of that algorithm
(Figure 3.14), we are now in position to construct an advanced bit manipulation functional
unit. We sketch the blocks of a functional unit that supports all four operations – parallel
extract, parallel deposit, butterfly permutations and inverse butterfly permutations.
The first building block is a butterfly network followed by a masking stage and extra
register storage to hold the configuration for the butterfly network (Figure 3.19(a), which
shows a 64-bit functional unit with 6 stages). This supports the pdep and bfly instruc-
tions. The masking stage zeroes the bits not selected in the pdep operation. For bfly, we
use a mask of all “1”s to select all the permuted bits. A set of application registers (ARs)
is used to hold the control bits for the butterfly network (as discussed in Section 2.3.1).
Figure 3.19(a) supports butterfly permutations as the application registers can be loaded
with the configuration for any given permutation. Parallel deposit is also supported as a
configuration for the butterfly network that performs the desired parallel deposit operation
67
(a) (b)
Figure 3.19: (a) Functional unit supporting butterfly and parallel deposit operations. (b) Functionalunit building block supporting inverse butterfly and parallel extract operations.
can be derived by a software routine using Algorithm 3.11. The ARs are loaded with this
configuration and the input is permuted through the butterfly network. The bits that are
not needed are then masked out using the bit mask that defines the desired parallel deposit
operation.
Figure 3.19(b) shows the analog of this building block for implementing inverse butter-
fly permutations and the parallel extract instruction. This contains a masking stage followed
by an inverse butterfly network and a set of application registers to hold the configuration.
The ARs are loaded with the configuration for some inverse butterfly permutation or a par-
allel extract operation. For parallel extract, we mask out the undesired bits with the bit
mask and then permute through the inverse butterfly network.
Figure 3.20 combines these two building blocks to form a single functional unit that
supports all four operations.
However, utilizing only a software implementation of Algorithm 3.11 is not always de-
sirable. The compiler (or programmer) can only derive the configuration when the pex or
pdep operation is static, i.e., if the bit mask is known when the program is being written or
compiled. However, the bit mask might be variable and only known at runtime. In this case,
the configuration for the butterfly or inverse butterfly datapath must be produced on the fly
68
Figure 3.20: Functional unit supporting butterfly, inverse butterfly, parallel extract and parallel de-posit.
and the performance cost due to running the software routine might be undesirable. For
the highest performance, we use the hardware decoder circuit of Figure 3.14. Figure 3.21
shows the functional unit enhanced with the decoder. (The circular arrow in Figure 3.21
indicates reversing the stages as the control bits for stage i of the butterfly network are
equivalent to those for stage lg(n) + 1− i of the inverse butterfly network.)
Figure 3.21: Functional unit supporting butterfly, inverse butterfly and static and variable parallelextract and parallel deposit instructions.
At this point, we can define two variations of the parallel extract and parallel deposit
69
instructions – static pex and pdep that use a pre-decoded set of configuration bits stored
in the ARs and dynamic or variable pex.v and pdep.v that use the hardware decoder to
decode the mask at runtime.
Another possible scenario is that the configuration is only known at runtime, but it is
unchanging across many iterations of a loop, i.e., it is loop invariant. In this case, we
use the hardware decoder circuit once to load a configuration for pex or pdep into the
application registers (see the new multiplexers in front of the ARs in Figure 3.22) and
then shut down the decoder (by adding latches to block the bitmask input) and use the
application registers directly. This has the advantage of removing the decoder from the
critical path thereby decreasing latency for the subsequent pex or pdep instructions, and
also possibly conserving power by shutting down the decoder circuitry. We define the
setib and setb instructions to decode a bit mask and store the result in the inverse
butterfly or butterfly ARs.
Figure 3.22: Functional unit supporting butterfly, inverse butterfly and static, variable and loopinvariant pex and pdep.
For completeness, we also consider a functional unit that supports the grp permutation
instruction. The grp permutation instruction is equivalent to a standard parallel extract
70
ORed with a “parallel extract to the left” of the bits selected with “0”s. Consequently, for
the functional unit to support grp, we need to add a second decoder and a second inverse
butterfly network to perform the “parallel extract to the left” (Figure 3.23).
Figure 3.23: Functional unit supporting butterfly, inverse butterfly and static, variable and loopinvariant pex and pdep and grp.
3.3.1 Implementation of the Standalone Advanced Bit Manipulation
Functional Unit
We evaluated the functional units of Figures 3.19(a), 3.19(b), 3.20, 3.22 and 3.23 for
timing and area. These support static pdep only (Figure 3.19(a)), static pex only (Fig-
ure 3.19(b)), both static pex and pdep (Figure 3.20), loop invariant and dynamic pex.v
and pdep.v as well (Figure 3.22), and grp as well (Figure 3.23). The static pdep only
unit also supports the bfly permutation instruction, the static pex only unit also supports
the ibfly permutation instruction and the latter three also support both bfly and ibfly
permutation instructions (while the circuit that supports grp does not need these instruc-
tions, the hardware is required for static pex and pdep). The circuits in Figures 3.22
71
and 3.23 are implemented with a 3-stage pipeline. The hardware decoder occupies the first
2 pipeline stages due to its slow parallel prefix population counter. The butterfly (or inverse
butterfly) network is in the third stage. The static units are a single processor cycle – hence,
no pipelining is needed.
The various functional units were synthesized using Synopsys Design Compiler map-
ping to a TSMC 90nm standard cell library [77]. The designs were compiled to optimize
timing. The decoder circuit was initially compiled as one stage and then Design Compiler
automatically pipelined the subcircuit. Timing and area figures are as reported by Design
Compiler. We also synthesized a reference ALU using the same technology library as a
reference for latency and area comparisons.
Table 3.2 summarizes the timing and area for the circuits. (Note that all latency and area
numbers in this thesis include register overhead.) This shows that the 64-bit functional unit
in Figure 3.20 supporting static pex and pdep has a shorter latency than that of a 64-bit
ALU and about 87% of its area (in NAND-gate equivalents). Splitting the static unit into
two units one supporting pex only and one pdep only does not increase the area. Hence,
we may choose to split the unit in order to accomodate superscalar execution.
In contrast, to support variable and loop-invariant pex and pdep (as in Figure 3.22),
which requires a hardware decoder, the advanced bit manipulation functional unit will have
cycle time 15% slower than an ALU and will be 2.17× larger. To also support a fast
implementation of the grp instruction, the proposed new unit will be 23% slower than an
ALU and about 3.2× larger in area.
Table 3.3 shows the number of different circuit types, to give a sense for why the func-
tional units supporting variable pex.v, pdep.v and grp are so much larger. It clearly
shows that supporting variable operations comes at a high price. The added complexity
is due to the complex decoder combinational logic and to the additional pipeline registers
(this is an irregular number due to the pipepline cutting across the decoder) and multiplexer
logic (for the ARs inputs, the control bit inputs and to select from multiple outputs). This
72
Table 3.2: Latency and area of functional units
Functional Latency Latency Relative Latency Relative Area Area Relative Area RelativeUnit (ns) to ALU to pex, pdep (NAND gates) to ALU to pex, pdepALU 0.5 1 1.05 7.5K 1 1.14pex 0.48 0.95 1 3.3K 0.43 0.50pdep 0.48 0.95 1 3.3K 0.44 0.50pex, pdep 0.48 0.95 1 6.6K 0.87 1pex.v, pdep.v 0.56 1.12 1.18 16.3K 2.17 2.48grp 0.62 1.23 1.30 24.2K 3.22 3.68
Table 3.3: Number of registers and logical blocks in functional units
Functional Unit 64-bit Pipeline Butterfly and Inverse Hardware MuxesRegisters Butterfly Networks Decoders
pex, pdep 0 2 0 1pex.v, pdep.v ∼ 9.32 2 1 13grp ∼ 15.1 3 2 14
explains why in Table 3.2, the variable circuits (Figures 3.22 and 3.23) have approximately
18-30% longer cycle time latencies compared to the static case (Figure 3.20). They are also
2.5 to 3.7 times larger than the static unit.
3.4 Summary
This chapter introduces the parallel extract instruction to perform bit gather operations
and the parallel deposit instruction to perform bit scatter operations. Datapaths for these
instructions were also discussed. We showed that that parallel extract maps to the inverse
butterfly network and parallel deposit maps to the butterfly network.
We then presented an algorithm for decoding the pex or pdep bitmask into the con-
trol bits for the datapaths. A hardware implementation of this algorithm was also pre-
sented. This hardware decoder has two basic components – parallel prefix population count
(POPCNT) and left-rotate-and-complement (LROTC) subcircuits.
With this foundation, we presented the design of a standalone advanced bit manip-
73
ulation functional unit that supports bfly, ibfly, pex and pdep. The high cost of
the hardware decoder that translates the pex or pdep bitmask into the datapath controls
prompted the definition of two variants of the pex and pdep instructions – static versions,
in which the datapath control bits are precomputed using Algorithm 3.11 in software (by
the compiler or programmer), and loop-invariant versions, in which hardware decoding is
performed once and then the static versions of the instructions are used. This led to the
design of two alternative functional units – one that supports only static pex and pdep
operations and one that supports variable pex.v and pdep.v instructions as well as loop
invariant operations. The unit that supports only the static instructions is faster (0.95× the
latency) and smaller (0.9× the area) than an ALU.
The next chapter of the thesis presents a new design for the standard shifter using the
advanced bit manipulation functional unit as its basis.
3.5 Appendix: Decoding the pexBitmask into Inverse But-
terfly Control Bits
We will now show that the inverse butterfly control bits for a pex operation can also be
obtained using Algorithm 3.11. This can be shown using an approach similar to that used
for pdep on butterfly in Section 3.2.
In the final stage of the inverse butterfly network, we call X the selected data bits that
pass through in R, Y the selected data bits that are transferred from L to R and Z the
selected data bits that pass through in L. Consider the possible combinations of X , Y and
Z:
i. X alone exists – when there are no selected data bits in L,
ii. Y alone exists – when there are no selected data bits in R,
74
Figure 3.24: Final stage of pex operation.
iii. X and Y exist – when no selected data bits that start in L stay in L,
iv. X and Z exist – when all the data bits in R are selected and there are some selected
data bits in L, or
v. X , Y and Z exist.
When X exists (i, iii, iv and v), it must be the rightmost bits of R, as the output of pex
is the selected data bits compressed maintaining their relative ordering and right justified
in the result.
When Y exists (ii, iii and v), it must be rotated left from the midpoint by the size of X
at the input to the final stage so that Y and X will be contiguous at the output of the final
stage after Y is swapped from L to R.
When Z exists (iv and v), it must be the rightmost bits in L so that when it is passed
through in L in the final stage it is contiguous with the bits in R.
When Y and Z exist (v), Y must additionally be the leftmost bits of L so that when
it is swapped into R in the final stage it is contiguous to Z on its left. Consequently, Y
and Z must be contiguous mod n/2 in L. Additionally we can view Y and Z as being the
rightmost bits of L at the output to stage lg(n)− 1 which are then left rotated by |X| prior
to the input to stage lg(n). See Figure 3.24.
Thus, prior to the final stage we have two half-sized inverse butterfly networks, L and R,
75
with a pex operation performed within each half network (first row of Figure 3.24 where
selected data bits ZY are the right justified and compressed selected data bits in L and X is
the right justified and compressed selected bits in R). We then iterate backwards through the
stages, performing a local pex operation within each half subnetwork. Between the stages
we explicitly left rotate each local L of by the number of selected data bits in the local R.
We use Fact 3.8 to route the bits in the local pex operation. This process is illustrated for
an example pex operation in Figure 3.25(a), broken down into steps in Figure 3.25(b).
(a)
(b) (c)
Figure 3.25: (a) 8-bit pex operation (b) mapped onto inverse butterfly network with explicit leftrotations between stages and (c) without explicit rotations of data bits by modifying the control bits.
76
In the first stage in Figure 3.25(b), we perform a local pex within each 2-bit subnet-
work:
• a belongs in position 6 for the local 2-bit pex, so it is swapped from L to R to yield
0a;
• c belongs in position 4, so it is swapped to yield 0c;
• e and f are in the correct positions (3 and 2, respectively), so they passed through to
yield ef;
• h is in position 0, so it is passed through to yield 0h.
Then, prior to stage 2, we explicitly left rotate each local L of stage 2 by the number of “1”
bits in the corresponding local R of stage 2: 0a is left rotated by 1 to a0, as there is one
selected data bit (c) in the corresponding local R, and ef is also left rotated by 1 to fe, as
there also is one selected data bit (h) in the corresponding local R.
In stage 2, we perform a local pex within each 4-bit subnetwork:
• a belongs in position 5 and c in position 4 for the local 4-bit pex, so a is swapped
to yield 00ac.
• e belongs in position 2, f in position 1 and h in position 0, so f is swapped to yield
0efh.
Prior to stage 3, we explicitly left rotate L of stage 3, 00ac, by 3, the number of selected
bits in R of stage 3 (efh), to c00a.
In the final stage, c belongs in position 3 so it is swapped from L to R to yield the
overall output 000acefh.
Rather than explicitly left rotating L after each stage, we can incorporate the rotations
directly into the routing of the pex operation by modifying the control bits. This is based
77
on a property of inverse butterfly networks that given a permutation peformed by the in-
verse butterfly network, a rotation of that permutation can also be achieved by rotating and
complementing the control bits by the same amount. (This can be shown following using
an argument similar to one used for Fact 3.6, the analogous fact for butterfly networks.)
Stage i of the inverse butterfly network can add 2i positions to the rotation, i.e., at
the output of the first stage the bits can be rotated by 0 or 1 positions from the initial
permutation. In the second stage, the bits can be rotated an additional 0 or 2 positions, or
the bits can rotated from the inputs by 0, 1, 2 or 3 positions and so on. Thus, we can rotate
the output of any stage of a subnetwork starting from any input and any initial permutation.
Consider the example of Figure 3.25(b), where we wish to eliminate the explicit data
rotations in the local L halves by rotating the control bits instead. To add a left rotation by
1 of 0a and ef though stage 1, we (left rotate and) complement the control bit for those
two subnetworks by 1 and we get the desired result (a0 and fe) as shown in Figure 3.26.
Then, to add a left rotation by 3 in L though stage 2, we left rotate and complement the
control bits by 3 in the first 2 stages of L. The first stage control bits are complemented 3
times in succession: 0 → 1 → 0 → 1 and 1 → 0 → 1 → 0 and the second stage control
bits are left rotated and complemented by 3 positions: 10→ 00→ 01→ 11. The result is
shown in Figure 3.25(c), where we see no explicit rotations, yet nevertheless the input of L
in stage 3 is still c00a and the overall output is still 000acefh.
Hence, the derivation of the control bits for the inverse butterfly network for a pex
operation is in fact identical to the derivation of the control bits for the pdep on the butterfly
network. At each stage, starting from stage 1, in each subnetwork, we pass through the k
rightmost bits by setting the k rightmost control bits to “0”, if there are k “1”s in the local
R of the mask. This is generated by LROTC(“1 . . . 1”, k). We then left rotate the output
by the number of “1”s in the next local R by left rotating and complementing the control
bits from stage 1 through the current stage by that amount. When we add up the effect of
rotating the outputs of each stage – that each rotation requires rotating the control bits from
78
(a)
Figure 3.26: 8-bit pex operation with explicit left rotation after stage 1 eliminated.
stage through that stage 1 – the net result is a LROTC of the control bits by the total number
of “1”s to the right of the subnetwork.
So we arrive at the same result: the initial control bits are obtained by a LROTC of a
string of all “1”s the length of the local R by the number of “1”s in the local R of the mask
and are then corrected by a further LROTC by the total number of “1”s to the right in the
mask. These can be combined into a single LROTC of the all one string by the total number
of “1”s in the local R and to the right. Thus Algorithm 3.11 also yields the configuration
bits for mapping pex to inverse butterfly, with the one caveat that the controls for stage i
of the butterfly datapath is equivalent to stage lg(n)− i+ 1 in the inverse butterfly datapath
(as indicated in the final line of Algorithm 3.11).
79
Chapter 4
Advanced Bit Manipulation Functional
Unit as a Basis for Shifters
In the previous chapter, we described a standalone advanced bit manipulation functional
units that supports bit permutation, bit gather and bit scatter operations. In this chapter,
we propose using the advanced bit manipulation functional unit as a new basis for shifters,
rather than simply adding it to a processor core, or enhancing the current shifter to also
support the above advanced bit manipulation functions. We propose replacing two existing
functional units (the shifter and the multimedia-mix functional units in Itanium processors)
with the new permutation functional unit and performing all the operations previously done
on these existing shifters as well as the advanced bit permutation, bit gather and bit scatter
operations on the same datapath.
This chapter is organized as follows. Section 4.1 lists the basic shifter instructions.
Section 4.2 then shows that the advanced bit manipulation functional unit can perform
any of these basic operations. The hard part is determining how the controls of the lg(n)
stages of the circuit should be set, and we give definitive algorithms for this. Section 4.3
gives a detailed description of the complete new shift-permute functional unit. Section 4.4
discusses how this new functional unit can be considered an evolution of the design of
80
shifters and compares the new design to the classic log shifter. Section 4.5 summarizes this
chapter.
4.1 Basic Shifter Operations
The basic shifter instructions consist of two groups and are summarized in Table 4.1. The
first group consists of the shift and rotate instructions supported in essentially all micropro-
cessors. These instructions include right or left shifts (with zero or sign propagation), and
right or left rotates. While a few microprocessors support only shifts but not rotates, we will
consider rotate as a basic supported operation. The second group of instructions include
extract and deposit instructions and mix instructions, all introduced in Chapter 2. No ISA
currently supports mix for subwords smaller than a byte. In our proposed new functional
unit, mix for bits – and for all subword sizes that are powers of 2 – are supported. This
includes 12 mix operations: mix.left and mix.right for each of 6 subword sizes of
20, 21, 22, 23, 24, 25, for a 64-bit datapath processor.
4.2 Basic Shifter Operations on the Inverse Butterfly Dat-
apath
The key conceptual insight comes from recognizing that the set of basic shifter and mix
operations in Table 4.1 is based on minor variations on a rotation operation, and that any
rotation operation can be achieved on an inverse butterfly circuit (or on a butterfly circuit.)
Theorem 4.1. An inverse butterfly circuit can achieve any rotation of its input.
This is a well-known property of inverse butterfly circuits. A proof of this can be found
in [78], where rotations are called cyclic shifts.
81
Table 4.1: Standard shifter operations
Instruction Descriptionrotr r1 = r2, shamt
Right rotate of r2. The rotate amount is either an immediate in theopcode (shamt), or specified using a second source register r3.
rotr r1 = r2, r3
rotl r1 = r2, shamtLeft rotate of r2.
rotl r1 = r2, r3shr r1 = r2, shamt Right shift of r2 with vacated positions on the left filled with the sign
bit (most significant bit of r2) or zero-filled. The shift amount iseither an immediate in the opcode or specified using a second sourceregister r3.
shr r1 = r2, r3shr.u r1 = r2, shamtshr.u r1 = r2, r3shl r1 = r2, shamt
Left shift of r2 with vacated positions on the right zero-filled.shl r1 = r2, r3extr r1 = r2, pos, len Extraction and right justification of single field from r2 of length len
from position pos. The high order bits are filled with the sign bit ofthe extracted field or zero-filled.
extr.u r1 = r2, pos, len
dep.z r1 = r2, pos, len Deposit at position pos of single right justified field from r2 of lengthlen. Remaining bits are zero-filled or merged from second sourceregister r3.
dep r1= r2, r3, pos, len
mix {r,l} {0,1,2,3,4,5} Select right or left subword from a pair of subwords, alternatingbetween source registers r2 and r3. Subword sizes are 2i bits, for i =0, 1, 2, . . ., 5, for a 64-bit processor.
r1 = r2, r3
Corollary 4.2. An enhanced inverse butterfly circuit can perform on its input:
a. Right and left shifts
b. Extract operations
c. Deposit operations
d. Mix operations
Proof. This follows from Theorem 4.1, with these operations modeled as a rotate with
additional logic handling zeroing, or sign extension from an arbitrary position, or merging
bits from the second source operand. Mix is modeled as a rotate of one operand by the
subword size and then a merge of subwords alternating between the two operands.
As the inverse butterfly circuit only performs permutations without zeroing and without
82
replication, the circuit must be enhanced with an extra 2:1 multiplexer stage at the end that
either selects the rotated bits “as is” or other bits which are precomputed as either zero, or
the sign bit (replicated), or the bits of the second source operand, depending on the desired
operation.
Corollary 4.3. Theorem 4.1 and Corollary 4.2 are true for the butterfly network as
well.
Proof. The butterfly and inverse butterfly networks exhibit a reverse symmetry of their
stages from input to output. Thus a rotation on the inverse butterfly network is equivalent
to a rotation in the opposite direction on the butterfly network when the flow through the
network is reversed (see Figure 4.1). Hence, a butterfly circuit can also achieve any rotation
of its inputs. As in Corollary 4.2, a butterfly network enhanced with an extra multiplexer
stage at the end is needed to handle zeroing or sign extension or merging bits from the
second source operand.
Figure 4.1: Left rotate by three on inverse butterfly is equivalent to right rotate by three on butterfly.
We first show how control bits are obtained for rotations on an inverse butterfly circuit
83
in Section 4.2.1, then for the other operations in Section 4.2.2.
4.2.1 Determining the Control Bits for Rotations
To achieve a right (or left) rotation by s positions, for s = 0, 1, 2, . . ., n− 1, using the n-bit
wide inverse butterfly circuit with lg(n) stages, the input must be right (or left) rotated by
s mod 2j within each 2j-bit wide inverse butterfly circuit at each stage j. This is because
from stage j + 1 on, the inverse butterfly circuit can only move bits at granularities larger
than 2j positions (so the finer movements must have already been performed in the prior
stages). We first give a conceptual explanation of this, then a formal constructive proof to
obtain the actual control bits for a rotation.
An n-bit inverse butterfly circuit can be viewed as two (lg(n) − 1)-stage circuits fol-
lowed by a stage that swaps or passes through paired bits that are n/2 positions apart. To
right rotate the input inn−1 . . . in0 by s positions, the two (lg(n) − 1)-stage circuits must
have right rotated their half inputs by s′ = s mod n/2 and the input to stage lg(n) must be
of the form:
inn/2+s′−1 . . . inn/2 inn−1 . . . inn/2+s′ || ins′−1 . . . in0 inn/2−1 . . . ins′ (4.1)
as the last stage can only move bits by n/2 positions. This is illustrated in Figure 4.2.
When the rotation amount s is less than n/2 then the bits that wrapped around in the
(lg(n) − 1)-stage circuits (cross-hatched) must be swapped in the final stage to yield the
input right rotated by s (Figure 4.2(a)):
ins′−1 . . . in0 inn−1 . . . inn/2+s′ || inn/2+s′−1 . . . inn/2 inn/2−1 . . . ins′ (4.2)
When the rotation amount is greater than or equal to n/2 then the bits that do not wrap in
the (lg(n) − 1)-stage circuits (solid) must be swapped in the final stage to yield the input
84
right rotated by s (Figure 4.2(b)):
inn/2+s′−1 . . . inn/2 inn/2−1 . . . ins′ || ins′−1 . . . in0 inn−1 . . . inn/2+s′ (4.3)
(a)
(b)
Figure 4.2: (a) rotation by s < n/2 and (b) by s ≥ n/2.
For example, consider the 8-bit inverse butterfly network with right rotation amount
s = 5, depicted in Figure 4.3. As s = 5 is greater than n/2 = 4, the bits that did not wrap
in stage 2 are swapped in stage 3 to yield the final result.
As the rotation amount through stage 2, s mod 22 = 5 mod 4 = 1, is less than n/4 = 2,
the bits that did wrap in stage 1 are swapped in stage 2 to yield the input to stage 3.
As the rotation amount through stage 1, s mod 21 = 5 mod 2 = 1, is equal to than
n/8 = 1, the bits that did not wrap in the input, i.e., all the bits, are swapped in stage 1 to
yield the input to stage 2.
We can mathematically derive recursive equations for the control bits, cbj , j = 1, 2,
. . ., lg(n), for achieving rotations on an inverse butterfly datapath. These equations yield a
85
Figure 4.3: Right rotate by 5 on 8-bit, 3-stage inverse butterfly network
compact circuit (shown in Figure 4.4) for the (rotation) control bit generator.
From (4.1)-(4.3) and Figure 4.2, we observe that the pattern for the control bits for the
final stage, which we call cblg(n), for a rotate of s bits, is:
cblg(n) =
1s || 0n/2−s, s < n/2
0s−n/2 || 1n/2−(s−n/2), s ≥ n/2
(4.4)
where ak is a string of k “a”s,“1” means “swap” and “0” means “pass through.” Note that
s = s mod n/2 when s < n/2 and s− n/2 = s mod n/2 when s ≥ n/2:
cblg(n) =
1s mod n/2 || 0n/2−(s mod n/2), s mod n < n/2
0s mod n/2 || 1n/2−(s mod n/2), s mod n ≥ n/2
cblg(n) =
1s mod n/2 || 0n/2−(s mod n/2), s mod n < n/2
∼(1s mod n/2 || 0n/2−(s mod n/2)
), s mod n ≥ n/2
(4.5)
where ∼ indicates negation.
Furthermore, due to the recursive structure of the inverse butterfly circuit, we can gen-
86
eralize (4.5) by substituting j for lg(n), 2j for n and 2j−1 for n/2:
cbj =
1s mod 2j−1 || 02j−1−(s mod 2j−1), s mod 2j < 2j−1
∼(
1s mod 2j−1 || 02j−1−(s mod 2j−1)), s mod 2j ≥ 2j−1
(4.6)
There are j bits in s mod 2j , with the most significant bit denoted sj−1. The condition
s mod 2j < 2j−1 is equivalent to sj−1 being equal to 0 and the condition s mod 2j ≥ 2j−1
is equivalent to sj−1 being equal to 1:
cbj =
1s mod 2j−1 || 02j−1−(s mod 2j−1), sj−1 = 0
∼(
1s mod 2j−1 || 02j−1−(s mod 2j−1)), sj−1 = 1
(4.7)
Equation (4.7) can be rewritten as the pattern XORed with sj−1:
cbj =(
1s mod 2j−1 || 02j−1−(s mod 2j−1))⊕ sj−1. (4.8)
Since s mod k ≤ k − 1, k − (s mod k) ≥ 1 and hence the length of the string of zeros
in (4.8) is always ≥ 1(k = 2j−1). Consequently, the least significant bit of the pattern
(prior to XOR with sj−1) is always “0”:
cbj =(
1s mod 2j−1 || 02j−1−1−(s mod 2j−1) || 0)⊕ sj−1
cbj =((
1s mod 2j−1 || 02j−1−1−(s mod 2j−1))⊕ sj−1
)|| sj−1 (4.9)
We call the bit pattern inside the inner parenthesis of (4.9) f(s, j), a string of 2j−1 − 1
bits with the s mod 2j−1 leftmost bits set to “1” and the remaining bits set to “0.” This
function definition is only for j ≥ 2, and the function is defined to return the empty string
for j = 1:
87
f(s, j) =
1s mod 2j−1 || 02j−1−1−(s mod 2j−1), j ≥ 2
{} , j = 1
(4.10)
cbj = (f(s, j)⊕ sj−1) || sj−1 (4.11)
Note that we can derive f(s, j + 1) from f(s, j):
f(s, j + 1) = 1s mod 2j || 02j−1−(s mod 2j) (4.12)
If bit sj−1 = 0 then s mod 2j = s mod 2j−1: h
f(s, j + 1) =1s mod 2j−1 || 02j−1−(s mod 2j−1)
=1s mod 2j−1 || 02j−1−1−(s mod 2j−1) || 02j−1
f(s, j + 1) =f(s, j) || 02j−1
(4.13)
If bit sj−1 = 1 then s mod 2j = 2j−1 + s mod 2j−1:
f(s, j + 1) =12j−1+s mod 2j−1 || 02j−1−(2j−1+s mod 2j−1)
=12j−1 || 1s mod 2j−1 || 02j−1−1−(s mod 2j−1)
f(s, j + 1) =12j−1 || f(s, j) (4.14)
Combining (4.13) and (4.13), we get:
88
f(s, j + 1) =
f(s, j) || 02j−1
, sj−1 = 0
12j−1 || f(s, j), sj−1 = 1
f(s, j + 1) =
f(s, j) || 0 || 02j−1−1, sj−1 = 0
12j−1−1 || 1 || f(s, j), sj−1 = 1
(4.15)
Since f(s, j) is a string of 2j−1 − 1 bits, we can replace the string of ones and zeros in
(4.15) by f(s, j) ORed (+) and ANDed (•) with 1 and 0, respectively:
f(s, j + 1) =
f(s, j) + 0 || 0 || f(s, j) • 0, sj−1 = 0
f(s, j) + 1 || 1 || f(s, j) • 1, sj−1 = 1
f(s, j + 1) = (f(s, j) + sj−1 || sj−1 || f(s, j) • sj−1) (4.16)
From (4.10) and (4.16) we obtain a simple recursive expression for f(s, j):
f(s, j) =
(f(s, j − 1) + sj−2 || sj−2 || f(s, j − 1) • sj−2) , j ≥ 2
{} , j = 1
(4.17)
Figure 4.4 depicts the hardware implementation of the control bit generator for rota-
tions. Equation (4.17) is used to derive f(s, 2), f(s, 3), f(s, 4) and f(s, 5). Also, the
control bits for rotations, cb1, cb2, cb3, cb4 and cb5, are obtained using equation (4.11). This
implementation is based on sharing of gates by reusing f(s, j) for both cbj and f(s, j + 1).
We now illustrate the use of these equations with the example of Figure 4.3, the 8-bit
89
Figure 4.4: Rotation control bit generator
90
inverse butterfly network with right rotation amount s = 5 (s2s1s0 = 101). The first stage
control bit, cb1, replicated for the four 2-bit circuits, is given by equations (4.11) and (4.17):
cb1 = (f(5, 1)⊕ s0) || s0 = {} ⊕ s0 || s0 = s0 = 1
The second stage control bits, cb2, replicated for the two 4-bit circuits, are given by:
cb2 =(f(5, 2)⊕ s1) || s1
=((f(5, 1) + s0 || s0 || f(5, 1) • s0)⊕ s1) || s1
=(({}+ s0 || s0 || {} • s0)⊕ s1) || s1
=(s0 ⊕ s1) || s1
=(1⊕ 0) || 0
=10.
Note that f(5, 2) = 1 in the above. The final stage control bits, cb3, are given by:
cb3 =(f(5, 3)⊕ s2) || s2
=((f(5, 2) + s1 || s1 || f(5, 2) • s1)⊕ s2) || s2
=((1 + s1 || s1 ||1 • s1)⊕ s2) || s2
=((1 + 0 || 0 ||1 • 0)⊕ 1) || 1
= ∼ (100) || 1
=0111.
Figure 4.3 shows that this configuration of the inverse butterfly circuit does indeed right
91
rotate the input by 5 mod 8 (and that the outputs of stage 2 are rotated by 5 mod 4 = 1 and
that the outputs of stage 1 are rotated by 5 mod 2 = 1).
4.2.2 Determining the Control Bits for Other Shift Operations
The other operations (shifts, extract, deposit and mix) are modeled as a rotation part plus a
masked-merge part with zeroes, sign bits or second source operand bits. The rotation part
can use the same rotation control bit generator described above to configure the inverse
butterfly network datapath. We achieve the masked-merge part by using an enhanced in-
verse butterfly datapath with an extra multiplexer stage added as the final stage. The mask
control bits are “0” when selecting the rotated bits and “1” when selecting the merge bits.
We now describe how this is used to generate these other operations: shift, extract, deposit
and mix.
For a right shift by s, the s sign or zero bits on the left are merged in. This requires
a control string 1s || 0n−s for the extra multiplexer stage. From the definition of f(s, j),
(4.10), we see that f(s, lg(n)+1) is the string 1s || 0n−1−s. Thus the desired control string is
given by f(s, lg(n) + 1) || 0. (Recall that s < n therefore the least significant bit is always
“0”, i.e., the least significant bit is always selected from the inverse butterfly datapath.)
f(s, lg(n) + 1) can easily be produced by extending the rotation control bit generator by
one extra stage. For left shift, which can be viewed as the left-to-right reversal of right
shift, the control bits for the extra stage are obtained by reversing left-to-right the right
shift control string to yield 0n−s || 1s.
For extract operations, which are like right shift operations with the left end replaced
by the sign bit of the extracted field or zeros, our enhanced inverse butterfly network selects
in its extra multiplexer stage the rotated bits or zeros or the sign bit of the extracted field
i.e., the bit in position pos+len-1 in the source register (see Figure 3.1(a)). The bit can be
selected using an n : 1 multiplexer.
92
The mask pattern for extract is n − len “1”s followed by len “0”s, (1n−len || 0len), to
propagate the sign bit of the extracted field in the output (which is in position len − 1)
to the high order bits (Figure 4.5(a)). Note that cblg(n)+1(len) = {f(len, lg(n) + 1) ⊕
lenlg(n) || lenlg(n)} is 1len || 0n−len. (len ranges from 0 to n and has lg(n)+1 bits: lenlg(n)...0.)
So reversing left-to-right cblg(n)+1(len) yields 0n−len || 1len and then negating it produces
1n−len || 0len, the correct bit pattern for stage lg(n) + 1 (Figure 4.5(b)).
(a) (b)
Figure 4.5: (a) Mask for extract operation. (b) Generation of mask for extract operation.
For deposit operations, which are like left shift operations with the right and left ends
replaced by zeros or bits from the second operand, our enhanced inverse butterfly network
selects in its extra multiplexer stage the rotated bits or zeros or bits from the second input
operand. The correct pattern is a string of n− pos− len“1”s followed by len “0”s followed
by s=pos “1”s (1n−pos−len || 0len || 1pos) to merge in bits on the right and left around the
deposited field (Figure 4.6(a)).
We can generate this pattern by observing that cblg(n)+1(pos + len) = {f((pos +
len), lg(n)+1)⊕ (pos + len)lg(n) || (pos + len)lg(n)} is 1pos+len || 0n−pos−len (first line of Fig-
ure 4.6(b)). (pos + len ranges from 0 to n and has lg(n)+1 bits: (pos + len)lg(n)...0.) Revers-
ing left-to-right cblg(n)+1(pos+len) yields 0n−pos−len || 1pos+len (second line of Figure 4.6(b))
and then negating it produces 1n−pos−len || 0pos+len (third line of Figure 4.6(b)). Bitwise OR-
ing this with the left shift control string, 0n−pos || 1pos, yields 1n−pos−len || 0len || 1pos (last line
of Figure 4.6(b)), the correct pattern for the masked-merge part of the deposit operation.
For mix operations, the enhanced inverse butterfly network selects in its extra multi-
93
(a) (b)
Figure 4.6: (a) Mask for deposit operation. (b) Generation of mask for deposit operation.
plexer stage the rotated bits or the bits from the second input operand. The control bit
pattern is simply a pattern of alternating strings of “0”s and “1”s, the precise pattern de-
pending on the subword size and whether mix left or mix right is executed. These patterns
can be hard coded in the circuit for the 12 mix operations (6 operand sizes × 2 directions).
Table 4.2 summarizes the mask patterns and merge bits for all the operations supported
by the shifter.
Table 4.2: Mask patterns and merge bits for basic and advanced shifter operations
Instruction Mask Merge Bitsrotr, rotl 0n -shr 1s || 0n−s r2{n− 1}n (sign bit)shr.u 1s || 0n−s 0n
shl 0n−s || 1s 0n
extr 1n−len || 0len r2{pos + len− 1}n (sign bit of extracted field)extr.u 1n−len || 0len 0n
dep.z 1n−pos−len || 0len || 1pos 0n
dep 1n−pos−len || 0len || 1pos r3mix.left (0k || 1k)n/2k, k = 1, 2, 4, 8, 16, 32 r3mix.left (1k || 0k)n/2k, k = 1, 2, 4, 8, 16, 32 r3bfly, ibfly 0n -pex 0n -pdep r3 0n
94
4.3 New Shifter Implementation
We now describe the (64-bit) shift-permute functional unit. We first consider the most basic
shifter, supporting only shift and rotate, based on the inverse butterfly network and then the
full shift-permute functional unit also supporting extract, deposit, mix, bit permutation, bit
gather and bit scatter instructions.
4.3.1 Basic Shifter
A block diagram of the basic shifter is show in Figure 4.7. This unit consists of an inverse
butterfly circuit enhanced with an extra multiplexer stage for shift and rotate operations;
a control bit generator for generating the rotation control bits; and a masked merge block
for generating the mask control bit pattern and the merge bits for rotates and logical and
unsigned shifts.
The inverse butterfly circuit is as shown in Figure 2.14. The control bit generator is
shown in Figure 4.8, which includes the rotation control bit generator (Figure 4.4). Right
rotates and shifts are handled by using the patterns from the rotation control bit generator
directly and left rotates and shifts are handled by reversing the patterns right-to-left (as
shown by the circular arrow). The output of this block is also fed to the masked merge
Figure 4.7: New shifter functional unit
95
Figure 4.8: Control bit generator for shifter functional unit
stage.
The masked-merge block generates two bit patterns: the merge bits and the mask bits
which control whether the output bits are chosen from the inverse butterfly datapath or from
the merge bits. This include the mask and merge patterns for shifts and rotates in Table 4.2.
4.3.2 Full Shift-Permute Unit
An overview of the full shift-permute functional unit is shown in Figure 4.9. The unit
consists of:
• a butterfly circuit enhanced with an extra multiplexer stage – for butterfly permuta-
tions and parallel deposit operations (and possibly basic shift operations),
• an inverse butterfly circuit with a pre-masking stage and an extra multiplexer stage
– for inverse butterfly permutations, parallel extract operations and basic shift opera-
tions,
• a control bit generator – for generating or supplying the control bits for all the oper-
ations, and
• a masked merge block – for generating or supplying the mask control bit pattern and
the merge bits.
The butterfly and inverse butterfly circuits are as shown in Figure 2.14. The extra mul-
tiplexer stage after the butterfly circuit is used for pdep masking or for the masked-merge
96
Figure 4.9: New shift-permute functional unit
of the basic shift operations. However, as seen from the rotation control bit generator of
Figure 4.4, the control bits produced first are those for the first stage of the inverse butterfly
network, which corresponds to the final stage of the butterfly network. Consequently, we
may choose to have the basic shift operations performed on the inverse butterfly network
alone. The pre-masking stage for the inverse butterfly network is used for pex masking,
which occurs before the bits are permuted. The extra multiplexer stage is used for the
masked-merge of the basic shift operations.
The control bit generator is shown in Figure 4.10. There are three sources for the control
bits:
• the rotation control bit generator – for basic shift operations,
• the application registers – for butterfly and inverse butterfly permutations, and static
parallel extract and parallel deposit operations, and
• the pex/pdep decoder – for variable parallel extract and parallel deposit operations
The rotation control bit generator is similar to the control bit generator of the basic
shifter. The application registers store the control bit patterns for bfly and ibfly per-
97
Figure 4.10: Control bit generator for shift-permute functional unit
mutations as well as the patterns for static and loop-invariant pex and pdep operations.
The application registers are written either directly from the GPRs with the mov ar in-
struction or from the pex/pdep decoder with the setb and setib instructions.
The pex/pdep decoder is that of Figure 3.14 and is used to generate the control bits for
variable pex.v and pdep.v operations, in which case the bits are output directly to the
ibfly and bfly circuits, respectively, reversing the order of the stages for pex.v. The
decoder is also used for loop-invariant pex and pdep, in which case the bits are written
to the application registers. Note that this decoder has a long latency and large area (as
discussed in Section 3.3).
The masked-merge block generates the merge bits and the mask controls for all the
operations. The patterns are summarized in Table 4.2.
4.4 Evolution of Shifters
Our new design represents an evolution of the shifter from the classical designs to a new
shift-permute functional unit which can perform both basic shifter operations and very
98
Figure 4.11: 8-bit barrel shifter
sophisticated bit manipulations. One popular classic shifter architecture is the barrel shifter.
The barrel shifter essentially is an n-bit wide n : 1 multiplexer that selects the input shifted
by s positions, where s = 0, 1, 2, . . ., n−1 (Figure 4.11). The advantage of this design is that
there is only a single gate delay between the input and output. The disadvantages are that
n2 switch elements (pass transistors or transmission gates) are required, and long delays
due to decoding of the shift amount (which adds extra logic delay) and high capacitance as
each input fans out to n elements and each output fans in from n elements.
Due to the high capacitance of wires, a second popular shifter architecture emerged –
the log shifter (which is sometimes also referred to as the barrel shifter in literature [79]).
The log shifter shifts the input by decreasing powers of two or four and selects at each
stage the shifted version or the pass-through version from the previous stage (Figure 4.12).
The advantages are that only n × lg(n) or n × log4(n) elements are required and the shift
amount directly controls the multiplexer elements. The disadvantage is that there are lg(n)
or log4(n) gates between the input and output.
99
Figure 4.12: 8-bit log shifter
Left and right shifts can be performed by implementing two datapaths to perform left
and right shifts separately. Alternatively, only a single-direction shifter is used, e.g., only
right shifting; the left shift is performed by subtracting the left shift amount from the bit
width, n, or by conditionally reversing the input and output of the shifter [79]. Arithmetic
right shift is accomplished by conditionally propagating the sign bit rather than a zero bit.
Additionally, the shifters easily support rotations by wrapping around the bits.
Table 4.3 presents a high level comparison of the 3 shifter designs. The first two lines
contain the components that contribute to area. Both the log shifter and our new ibfly-based
shifter have n×lg(n) elements, while the barrel shifter has n2 elements. The log shifter also
has the fewest control lines (lg(n)) while the number of control lines for the new shifter
design is similar to the barrel shifter for basic shift operations – there are n − 1 control
lines as can be seen in Figure 4.4. When considering our new shifter performing advanced
operations, there are n/2× lg(n) control lines as each switch, or pair of elements, requires
an independent control bit. Thus we might expect that our new design has larger area than
the log shifter.
The next two lines pertain to latency. The datapath of the barrel shifter has a single
gate delay while the log shifter and our new design have lg(n) gate delay. However, both
the log shifter and our new design utilize narrow multiplexers limiting the capacitance at
output nodes while the barrel shifter uses wide n:1 multiplexers. All designs have long
100
Table 4.3: Comparison of shifter designs
Barrel Shifter Log Shifter Our IBFLY Shifter# Elements n2 n× lg(n) or n× lg(n)
n× log4(n)# Control Lines n lg(n) n− 1 for shifts
n/2× lg(n) for permutations# Gate delay (Datapath) 1 lg(n) or log4(n) lg(n)Mux width (Capacitance) n 2 or 4 2
wires to propagate control and data signals across many multiplexers.
We first used the method of logical effort [80] to compare the delay along the critical
paths for the barrel shifter, the log shifter and the inverse butterfly shifter. This estimates
the critical path in terms of FO4 gate equivalents, which is the delay of an inverter driving
4 similar inverters. We compare the latency of only the basic shifter operations on these
datapaths. As the 64-bit barrel shifter is impractical due to the capacitance on the output
lines, we implemented a 64-bit shifter as an 8-byte barrel shifter followed by an 8-bit barrel
shifter, which limits the number of transmission gates tied together to 8. We consider the
delay only from the input to the decoder through the two shifter levels for the barrel shifter
and through the three shifter levels for the log shifter.
For our proposed ibfly-based shifter, we consider the delay from the input to the control
bit generator (i.e., the rotation control bit generator of Figure 4.4 through the output of the
inverse butterfly circuit. According to the logical effort calculations, the delay for the barrel
shifter is 15.1 FO4 and the delay for the log shifter is 13.0 FO4, while the delay for inverse
butterfly shifter is 16.8 FO4. Thus the delay along the critical path our new proposed shifter
is 11% slower than that of the barrel shifter and 29% slower than a log shifter.
As the log shifter is the faster and more compact of the two current shifter designs
we implemented it and our new ibfly-based shifter design using a standard cell library. We
synthesized all designs to gate level, optimizing for shortest latency, using Design Compiler
mapping to a TSMC 90 nm standard cell library [77]. The results are summarized in
101
Table 4.4: Latency and area of units performing shift and rotate
Functional Unit Latency (ns) Relative Latency Area (NAND gates) Relative AreaLog Shifter 0.48 1. 5.8K 1(Reverse)Log Shifter 0.49 1.02 6.1K 1.06(2 Datapath)Our IBFLY shifter 0.51 1.06 4.5K 0.78ALU 0.50 1.03 7.5K 1.31
Tables 4.4-4.6.
Table 4.4 shows the result of a basic shifter that only implements shift and rotate in-
structions. For the log shifter, we evaluated two designs – one which only performs right
shift and rotate and reverses the input and output for left shift and rotate, and a design which
utilizes parallel datapaths for left and right shifts. The log shifter which reverses the input
and output is faster and smaller than the log shifter which uses two datapaths. Thus we use
it as a baseline for all comparisons and further analysis.
Our new shifter has 1.06× the latency of the log shifter, but is smaller than the log
shifter, at approximately 78% of the area. The discrepancy from the logical effort results
arises from the fact that the logical effort calculation considered the reduced datapath only.
(Note that accounting for the wires will increase the area for the ibfly-based shifter relative
to the log shifter, as mentioned.) For comparison, we also implemented an ALU synthe-
sized using the same standard cell library. Our new shifter is slightly slower (102% latency)
and substantially smaller (59% area) than this ALU. As the latency is greater than the ALU,
the use of a shifter based on an inverse butterfly network may impact cycle time. However,
as the difference in latency between all the designs is small, a full custom design of the
units should be done to fully compare the units.
Table 4.5 shows the result when both shifter architectures are enhanced to support ex-
tract, deposit and mix instructions. The critical path of the log shifter is now through the
extract sign bit propagation, so the latency is now comparable to that of the ibfly-based
102
Table 4.5: Latency and area of units performing shift, rotate, extract, deposit and mix
Functional Unit Latency (ns) Relative Latency Area (NAND gates) Relative AreaLog Shifter 0.57 1 6.7K 1Our IBFLY shifter 0.58 1.01 5.8K 0.87ALU 0.57 1 6.1K 0.92
Table 4.6: Latency and area of units performing advanced bit operations
Functional Unit Latency (ns) Relative Latency Area (NAND gates) Relative AreaLog Shifter∗ 0.57 1 6.7K 1Our IBFLY shifter∗ 0.58 1.01 5.8K 0.87ibfly, pex 0.63 1.10 9.0K 1.34
(1.2K for ARs)ibfly, bfly, pex, pdep 0.64 1.11 12.3K 1.83
(2.5K for ARs)ibfly, bfly, pex, pdep 1.22 2.13 16.2K 2.42pex.v, pdep.v (3.6K for ARs)∗Does not perform advanced bit manipulation operations
shifter. Our new ibfly-based design is still only 87% of the area of the log shifter. We
include the results for an ALU of similar latency, which has area less than the log shifter
but greater than the ibfly-based shifter.
Table 4.6 shows the results when we add support for advanced bit manipulation op-
erations to our new ibfly-based shifter, but not to the log-shifter. The first two lines are
the log shifter and ibfly shifter from Table 4.5, which do not perform advanced operations,
included as the baseline. The third line is a unit that supports the ibfly and static pex in-
structions. We consider this unit as the functionality of the butterfly circuit can be emulated
using inverse butterfly, albeit with a multi-cycle penalty. The latency increases are due to
extra multiplexing for the control bits and output. The area increases due to the ARs, the
extra multiplexers and the pex masking. This unit has 1.10× the latency and 1.34× the
area of the log shifter.
The fourth line is a unit that also supports bfly and static pdep, i.e., it is the equivalent
of standalone unit of Figure 3.20 enhanced to support standard shifter operations. The
103
latency increases slightly due to extra multiplexing and the area increases due to the second
(butterfly) datapath and second set of three ARs. This unit has 1.11× the latency and 1.83×
the area of the log shifter.
The fifth line is a unit that also supports variable pex.v and pdep.v. This is a non-
pipelined version of the functional unit. The latency increases and the area increases due to
the pex/pdep decoder and extra multiplexing. This unit has 2.13× the latency and 2.42×
the area of the log shifter (the pex/pdep decoder itself has 1.4× the latency of the log
shifter). A pipelined version of this unit would require three stages, two for the decoder
and one for the datapath (see Section 3.3.
We also performed an FPGA based implementation of the new functional units. The
goal of this implementation was to produce an actual hardware circuit which would fully
take into account the wiring requirements of the butterfly and inverse butterfly network.
An additional goal was to showcase the flexibility of the FPGA-based, fully customizable
PAX cryptographic coprocessor [81] of the PALMS research group by replacing its shifter
with the new shift-permute functional unit. The results are shown in Table 4.7, where the
target device is the Xilinx Virtex-II Pro XC2VP20 and the implementation was performed
using Xilinx ISE. The top of the table shows results for the basic shifter that supports only
shift and rotate while the bottom half shows results for the shifters that also support extract,
deposit and mix and also support the advanced bit operations of Chapter 3.
For the shifters that support only shift and rotate, our new shifter design has 0.85× the
latency of the log shifter with 19% more area. For the shifters that additionally support
extract, deposit and mix, our shifter has 1.05× the latency, but only 0.87× the area of the
log shifter. These results may be due to the granularity of logic synthesis in the FPGA –
for the basic shifter, our design requires one less level of LUTs, while for the shifter that
supports extract, deposit and mix, our shifter requires one more level of LUTs.
The unit that supports bfly, ibfly, static pex and pdep has 1.16× the latency and
1.98× the area of the log shifter. These results are comparable to those obtained for the
104
Table 4.7: Results of FPGA implementation
Functional Unit Latency Latency Relative Latency Relative Area Area Relative Area Relative(ns) to Log Shifter to ALU (Slices) to Log Shifter to ALU
Log Shifter 6.4 1 1.23 394 1 1.15Our IBFLY shifter 5.5 0.85 1.05 470 1.19 1.37ALU 5.2 0.81 1 342 0.87 1Log Shifter 6.7 1 1.29 748 1 2.19Our IBFLY shifter 7.1 1.05 1.36 654 0.87 1.91ibfly, bfly, pex, pdep 7.8 1.16 1.50 1481 1.98 3.76ibfly, bfly, pex, pdep 12.3 1.84 2.37 2457 3.28 6.24pex.v, pdep.vALU 5.2 0.78 1 342 0.46 1
standard cell implementation. Compared to the standard cell implementation of Table 4.6,
the FPGA unit that supports variable pex.v and pdep.v is faster and larger relative to
the log shifter, perhaps due to how the addition in the parallel prefix population counter
is synthesized. We note that the ALU is faster and smaller than all the shifter designs,
possibly due to specialized adder cells used in the ALU implementation.
From both standard cell and FPGA implementations we that supporting variable pex.v
and pdep.v operations has high cost, similar to the case with the standalone advanced ma-
nipulation units in Section 3.3. One may choose to simply pay this cost in order to support
the greatest range of operations. Alternatively, one may simply omit support for variable
operations and use the functional unit with only bfly, ibfly, static pex and pdep,
thereby forcing the use of a software routine to generate the control bits whenever variable
pex.v and pdep.v are performed. As we will see in Chapter 6, this may be a sufficient
alternative.
We remark that full custom designs of the ALU, log shifter and our new shift-permute
unit should be done, since standard cell or FPGA implementations may not reflect a fair
or accurate comparison – especially between the shifters and the ALU, which is typically
highly optimized by custom circuit design. Such circuit design is more appropriately done
by microprocessor custom circuit designers according to implementation-specific needs
105
and the process technology used.
4.5 Summary
The advanced bit manipulation functional unit can be enhanced to support the operations
usually performed by the standard shifter. We showed that standard shifter operations are
all variations on rotations and then designed a rotation control bit generator circuit, which
produces the control bits for the inverse butterfly (or butterfly) datapath based on the shift
amount. This new design is an evolution of shifter architectures away from the classic
barrel and log shifters.
As a new basis for shifters performing only existing shift, rotate, extract, deposit and
mix instructions, our new ibfly-based design has 1.01× the latency and 0.87× the area of
the log-based shifter. Our new unit that also supports static pex and pdep operations, has
1.11× the latency and 1.83× the area of the log shifter (standard cell implementation), but
is much more capable.
In the following chapter, we consider the bit matrix multiply operation which is used
to accelerate parity computation and can also perform a superset of the bit manipulation
operations. In Chapter 6 we analyze the applications that benefit from advanced bit manip-
ulation instructions.
106
Chapter 5
Bit Matrix Multiply
Another powerful bit manipulation operation is bit matrix multiply (bmm). bmm, which
multiplies two bit matrices contained in processor registers, performs a superset of the
advanced bit manipulation operations discussed in Chapter 3. Furthermore, bmm has many
additional capabilities such as finite field multiplication and parity computation. In this
chapter we discuss how bmm, which has heretofore been supported only by supercomputers
such as Cray [82], can be implemented in a commodity microprocessor.
This chapter is organized as follows. Section 5.1 gives the definition of the bit matrix
multiply operation. Section 5.2 discusses the basic operations that bmm can be used to
perform. Section 5.3 describes methods of computing bmm using existing instruction set
architectures. Section 5.4 proposes new architectural enhancements to accelerate bmm.
Section 5.5 compares the new proposals to the existing techniques. Section 5.6 discusses
related work in bmm implementation. Section 5.7 summarizes this chapter.
5.1 Definition of Bit Matrix Multiply Operation
The bit matrix multiply operation, bmm.n, multiplies two (n × n) bit matrices to produce
a third (n × n) bit matrix. The operation is defined in the first line of Table 5.1. The rows
107
Table 5.1: Definitions of Bit Matrix Multiply Instructions
Operation Definition Pseudo-Codebmm.n C = A, B A, B, C: for i from 1 to nBit Matrix Multiply n× n bit matrices for j from 1 to n
C = A × B mod 2 ci,j = ai,1b1,j ⊕ ai,2b2,j ⊕ . . .⊕ ai,nbn,j
bmmt.n C = A, B A, B, C: for i from 1 to nBit Matrix Multiply n× n bit matrices for j from 1 to n(with implicit transpose) C = A × BT mod 2 ci,j = ai,1bj,1 ⊕ ai,2bj,2 ⊕ . . .⊕ ai,nbj,n
of a matrix are numbered ascending from top to bottom, starting at 1. Within a row we
follow the normal matrix numbering convention starting at 1 on the left. (When referring
to bit numbering within a register, position n − 1 is on the left and 0 is on the right.) An
alternative formulation which transposes the B matrix (bmmt.n) is given in the second
line of Table 5.1. Note that the transpose is implicit – the instruction input is B, but the
multiplication is by BT .
The first version of bmm most closely resembles standard matrix multiplication and
therefore is most useful when performing finite field arithmetic. The transpose version
is most useful when we are taking the dot products of the rows of A and the rows of B.
Given the general correspondence between vector dot product and matrix multiplication
(a • b = a × bT for row vectors a and b), this version naturally follows. The transpose
version can also be used when performing finite field arithmetic if the data happen to be in
the transpose form. The transpose version of bmm is the version implemented by Cray.
In particular, we mention two cases for n, n = 8 and n = 64. The bmm.8 instruction
multiplies two 8 × 8 matrices contained in general purpose registers (see Fig 5.1). The
matrices are laid out such that row i corresponds to byte 8 − i in little-endian order. The
bmm.64 instruction multiplies two 64 × 64 matrices and allows for operations on whole
processor words at once. The Cray instruction is equivalent to bmmt.64.
108
(a)
(b)
Figure 5.1: (a) Layout of 8× 8 matrix in a register rB; (b) bmm.8 operation (1 bit of output)
5.2 Bit Matrix Multiply Basic Operations
We describe some basic operations that a bmm operation can perform.
Finite Field Multiplication – Bit matrix multiply can be used to perform multiplication
over GF(2k).
Subset Parity – Bit matrix multiply, bmm or bmmt, can be used to calculate the parity
of an arbitrary subset of the bits of a row of A. The bit positions from which the parity is
computed is specified by setting the corresponding bit positions to “1” in a column of B (or
row of B for the transpose version). Usage of bmmt may be easier for parity as the masks
can be directly specified in processor registers.
Transpose – The transpose version, bmmt, can perform bit matrix transpose by setting
A to I, the identity matrix. By Table 5.1 second row, C then equals BT .
109
Permutations – bmm can be used to perform arbitrary bit-level permutations with rep-
etitions and zeroing. A B matrix with a single “1” in each row and column produces a
permutation of the columns of the A matrix (“column exchange”). A B matrix with a sin-
gle “1” in each column, but more than one “1” in at least one row produces a permutation
with repetition. A matrix with a zero column will zero the corresponding column of the
result. Thus bmm can be used to perform any of the operations described in Chapter 3
(see Figure 5.2 for an example of bmm being used to form pex). bmm can also perform a
standard bit-wise and (with “1”s only appearing on the main diagonal), shift (with “1”s
only appearing on any single non-main diagonal) or rotate (with “1”s appearing on any
single wrapped-around diagonal). If, instead of B, the A matrix is a permutation matrix,
then the rows (or subwords) of B are permuted (“row exchange”).
a1,1 a1,2 a1,3 a1,4 a1,5 a1,6 a1,7 a1,8
a2,1 a2,2 a2,3 a2,4 a2,5 a2,6 a2,7 a2,8
a3,1 a3,2 a3,3 a3,4 a3,5 a3,6 a3,7 a3,8
a4,1 a4,2 a4,3 a4,4 a4,5 a4,6 a4,7 a4,8
a5,1 a5,2 a5,3 a5,4 a5,5 a5,6 a5,7 a5,8
a6,1 a6,2 a6,3 a6,4 a6,5 a6,6 a6,7 a6,8
a7,1 a7,2 a7,3 a7,4 a7,5 a7,6 a7,7 a7,8
a8,1 a8,2 a8,3 a8,4 a8,5 a8,6 a8,7 a8,8
×
0 0 0 1 0 0 0 00 0 0 0 0 0 0 00 0 0 0 1 0 0 00 0 0 0 0 0 0 00 0 0 0 0 1 0 00 0 0 0 0 0 1 00 0 0 0 0 0 0 00 0 0 0 0 0 0 1
=
0 0 0 a1,1 a1,3 a1,5 a1,6 a1,8
0 0 0 a2,1 a2,3 a2,5 a2,6 a2,8
0 0 0 a3,1 a3,3 a3,5 a3,6 a3,8
0 0 0 a4,1 a4,3 a4,5 a4,6 a4,8
0 0 0 a5,1 a5,3 a5,5 a5,6 a5,8
0 0 0 a6,1 a6,3 a6,5 a6,6 a6,8
0 0 0 a7,1 a7,3 a7,5 a7,6 a7,8
0 0 0 a8,1 a8,3 a8,5 a8,6 a8,8
Figure 5.2: pex emulated by bmm
If we consider the simple bmm.8 operation, we see that it can used to emulate a number
of the subword manipulation instructions in existing SIMD instruction extensions that were
mentioned in Chapter 2. For example, the Vector Shift Right Algebraic Byte (vsrab)
POWER instruction [15], which performs an algebraic right shift on each byte in parallel,
can be performed by the multiplication using the constant Bmatrix = “C0 20 10 08 04 02 01
00” for right shift by 1, for example (Figure 5.3). bmm.8 can also perform a byte permute
instruction, which permutes, with repetition, the bytes of a word, such as the permutation
of rows of B from (1,2,3,4,5,6,7,8)→ (7,7,3,4,6,5,2,1) (Figure 5.4).
bmm.8 can also emulate POWER instructions such as Vector Rotate Left Integer Byte
(vrlb), Vector Shift Right by Octet (vsro) and Vector Splat Byte (vspltb), and IA-32
110
a1,1 a1,2 a1,3 a1,4 a1,5 a1,6 a1,7 a1,8
a2,1 a2,2 a2,3 a2,4 a2,5 a2,6 a2,7 a2,8
a3,1 a3,2 a3,3 a3,4 a3,5 a3,6 a3,7 a3,8
a4,1 a4,2 a4,3 a4,4 a4,5 a4,6 a4,7 a4,8
a5,1 a5,2 a5,3 a5,4 a5,5 a5,6 a5,7 a5,8
a6,1 a6,2 a6,3 a6,4 a6,5 a6,6 a6,7 a6,8
a7,1 a7,2 a7,3 a7,4 a7,5 a7,6 a7,7 a7,8
a8,1 a8,2 a8,3 a8,4 a8,5 a8,6 a8,7 a8,8
×
1 1 0 0 0 0 0 00 0 1 0 0 0 0 00 0 0 1 0 0 0 00 0 0 0 1 0 0 00 0 0 0 0 1 0 00 0 0 0 0 0 1 00 0 0 0 0 0 0 10 0 0 0 0 0 0 0
=
a1,1 a1,1 a1,2 a1,3 a1,4 a1,5 a1,6 a1,7
a2,1 a2,1 a2,2 a2,3 a2,4 a2,5 a2,6 a2,7
a3,1 a3,1 a3,2 a3,3 a3,4 a3,5 a3,6 a3,7
a4,1 a4,1 a4,2 a4,3 a4,4 a4,5 a4,6 a4,7
a5,1 a5,1 a5,2 a5,3 a5,4 a5,5 a5,6 a5,7
a6,1 a6,1 a6,2 a6,3 a6,4 a6,5 a6,6 a6,7
a7,1 a7,1 a7,2 a7,3 a7,4 a7,5 a7,6 a7,7
a8,1 a8,1 a8,2 a8,3 a8,4 a8,5 a8,6 a8,7
Figure 5.3: Vector Shift Right Algebraic Byte (vsrab) emulated by bmm
0 0 0 0 0 0 1 00 0 0 0 0 0 1 00 0 1 0 0 0 0 00 0 0 1 0 0 0 00 0 0 0 0 1 0 00 0 0 0 1 0 0 00 1 0 0 0 0 0 01 0 0 0 0 0 0 0
×
b1,1 b1,2 b1,3 b1,4 b1,5 b1,6 b1,7 b1,8
b2,1 b2,2 b2,3 b2,4 b2,5 b2,6 b2,7 b2,8
b3,1 b3,2 b3,3 b3,4 b3,5 b3,6 b3,7 b3,8
b4,1 b4,2 b4,3 b4,4 b4,5 b4,6 b4,7 b4,8
b5,1 b5,2 b5,3 b5,4 b5,5 b5,6 b5,7 b5,8
b6,1 b6,2 b6,3 b6,4 b6,5 b6,6 b6,7 b6,8
b7,1 b7,2 b7,3 b7,4 b7,5 b7,6 b7,7 b7,8
b8,1 b8,2 b8,3 b8,4 b8,5 b8,6 b8,7 b8,8
=
b7,1 b7,1 b7,2 b7,3 b7,4 b7,5 b7,6 b7,7
b7,1 b7,1 b7,2 b7,3 b7,4 b7,5 b7,6 b7,7
b3,1 b3,1 b3,2 b3,3 b3,4 b3,5 b3,6 b3,7
b4,1 b4,1 b4,2 b4,3 b4,4 b4,5 b4,6 b4,7
b6,1 b6,1 b6,2 b6,3 b6,4 b6,5 b6,6 b6,7
b5,1 b5,1 b5,2 b5,3 b5,4 b5,5 b5,6 b5,7
b2,1 b2,1 b2,2 b2,3 b2,4 b2,5 b2,6 b2,7
b1,1 b1,1 b1,2 b1,3 b1,4 b1,5 b1,6 b1,7
Figure 5.4: byte permute emulated by bmm
instructions such as Byte Reverse (bswap). In fact by multiplying by a constant matrix on
either the right or the left, bmm.8 can do any one of 265 different operations!
Overall, we see that bmm is a powerful and versatile data manipulation primitive.
Combined Operations – bmm.n can also perform multiple combined operations at once.
If the B matrix is decomposed in the following way:
B =
B1 0
0 B2
where B1 and B2 are two n/2×n/2 matrices, then bmm.n performs 2 bmm.n/2 operations
on each half of the row of A. We can have more than two matrices and their sizes do not
need to be equal. bmm.n is also able to permute a subset of bits and XOR with another
subset of bits at the same time. We obtain a combination of permutation and a XOR with a
111
matrix defined by:
B =
0 B1
0 B2
.
If B1 corresponds to a circular rotation and B2 is the identity matrix, then the left subword
of each rows of A is rotated and then XORed with right subword of the row. Thus, we
obtain the rorx and rolx instructions proposed in [19], which rotate one operand and
then XOR it with the second.
5.3 Computing Bit Matrix Multiply using Existing ISA
Before considering whether new ISA support is needed, it is important to see how fast a bit
matrix multiplication can be done using existing instructions. Hence, we first investigate
the fastest ways to compute a (64 × 64) × (64 × 64) → (64 × 64) matrix multiply using
existing instructions. The two methods are listed in Table 5.2, where they are compared to
the Cray implementation.
5.3.1 Conditional Row-wise Summation
The first method of computing bmm is based on conditionally XORing together the rows
of B based on the value of the bits of a row of A:
(a1 a2 . . . bn
)×
b1,1 b1,2 b1,n
b2,1 b2,2 b2,n
. . .
bn,1 bn,2 bn,n
⇔
a1 ×(b1,1 b1,2 . . . b1,n
)⊕
a2 ×(b2,1 b2,2 . . . b2,n
)⊕
...
an ×(bn,1 bn,2 . . . bn,n
)
112
The right hand side is equivalent to a sequence of
if ai {c = c XOR Bi},
where Bi is the ith row of B. This is repeated for each row of the A matrix. Table 5.2 shows
the pseudo code for this computation of bmm. In ISAs that support conditional moves or
predication, the branches can be eliminated.
Note that this method uses the standard definition of matrix multiplication and not the
transpose definition. Consequently, if the transpose version is desired, the B matrix must be
transposed in software first. If B is known ahead of time then there is no runtime penalty.
Lee [5] shows how a n × n matrix is transposed with only n × lg(n) mix instructions,
similar to Knuth’s Algorithm [83]. Unfortunately, no ISA has bit level mix instructions,
though our new functional unit described in Chapter 4 does, so these must be emulated
using and, shift and or instructions.
5.3.2 Table Lookup
A second fast method to compute bmm is to use lookup tables. A set of eight 256-entry
tables is constructed, based on a given 64× 64 bit B matrix. The B matrix is divided into 8
groups of 8 rows, and each table contains the XOR sum of the 256 possible combinations
of 8 rows B8i to B8i+7, for 0 ≤ i ≤ 7. These tables can be produced quite efficiently by
stepping through the 256 possible combinations in Gray code order. As sequential values in
Gray code order differ by one bit, the table entries corresponding to those values differing
by a single XOR. Thus a sequence of 256 XOR and 256 store instructions is all that is
necessary to compute each table.
Once the tables are computed, the eight bytes of a row of A are used as indices into
these eight tables and the results of the eight table lookups are XORed together to produce
a row of C (see Table 5.2). This method, like the previous one, uses the standard definition
of matrix multiplication and not the transpose definition and thus, if the transpose version
113
is desired, the B matrix must be transposed in software first.
5.3.3 Implementation of bmm on Cray Vector Supercomputers
The challenge in implementing bmm is that each row of the output matrix C depends on a
row of A and the entire B matrix. This can be seen from the definition of bmm (or bmmt) –
each bit of C depends on a full column (or row) of B. Consequently, Cray supercomputers
define a special 64×64 bmm register to hold the B matrix [17]. The contents of this register,
the entire B matrix, can be read in a single cycle. The Cray ISA defines an additional
instruction to transmit the contents of a vector register to the bmm register. Subsequently,
bmm instructions will stream the rows of A from a normal vector register, multiply each
row by the entire B matrix in a single cycle and then stream the rows of C back to a normal
vector register. This approach is also compatible with vector register lanes – the contents
of the bmm register are provided to each lane in every cycle. If there are k lanes, then
bmm takes 64/k cycles on Cray (plus vector register/functional unit overhead). With k = 2
lanes, a bmm on Cray takes only 32 cycles (not counting loading of the B matrix), which is
much faster than the software methods proposed above using existing ISA (see Table 5.2).
5.3.4 Comparison of Existing ISA Methods
The methods described above each have advantages and disadvantages. The Cray approach
achieves the highest performance at the cost of 64 64-bit words (4096 bits) of extra storage
in the microprocessor core. Additionally, if the B matrix changes, then the special bmm
register must be reloaded.
The table lookup method is the fastest (non-Cray) algorithm if there are no data cache
misses. However, the tables require 16KB of memory (8 tables × 256 entries/table ×
8 bytes/entry), and some microprocessors only have 16KB or less of L1 data cache. If
the tables are evicted from or do not fit in L1 (or higher cache levels), execution time
114
Table 5.2: Instruction sequences and counts for 3 methods to compute a bit matrix multiply usingexisting ISA
Row-wise Summation Table Lookup CrayInstruction c = 0 c = 0 Vc = bmm VaSequence mov PR = a for i from 0 to 7
for i from 1 to n c = c XOR(PR[n− i]) c = c XOR Bj Table[i][a8i+1...8(i+1)]
Instructions used 1 clear, 1 ld, 1 mov PR, 64xor, 1 st
1 ld, 8 extract byte (7shift, 8 and), 8 indexed ld,7 xor, 1 st
1 ld, 1 bmmt, 1st
Instruction Countfor 1 row
68 32 -
Instruction Countfor 64 rows
64× 68 64× 32 3
Additional ISA orresources required
Predicated execution in ISA 16 Kbytes of table space Vector ISA,large bmmfunctional unit
is negatively impacted due to cache miss penalties. Furthermore, if a new B matrix is
generated dynamically, the tables must be recomputed (and the B matrix must be transposed
if bmmt is required). While a fast algorithm for computing the tables was given, there is
still considerable overhead. Even if the B matrix is simply switched to another known
B matrix, a new set of precomputed tables must be loaded into cache from memory at
considerable cost.
Row-wise XOR depends on an ISA which supports conditional moves or predicated
execution to be implemented efficiently (with code for predicated execution shown in Ta-
ble 5.2). Even with an efficient implementation, it is still slower than table lookup. How-
ever, this method is unaffected by a dynamic B matrix as the algorithm processes the rows
of B as is, unless bmmt is required, in which case there is the overhead of transposing the
B matrix.
115
5.4 New Architectural Techniques to Accelerate Bit Ma-
trix Multiply
Since the existing methods of emulating bmm.64 are much slower that the Cray supercom-
puter, we investigate new instruction primitives suitable for commodity microprocessors
that can significantly accelerate bmm operations. Our analysis focuses on the case of the
(N × 64)× (64× 64)→ (N × 64) matrix multiply. That is, we consider multiplication of
N vectors by a fixed 64× 64 matrix B, where N is very large, performed as a sequence of
(64× 64)× (64× 64)→ (64× 64) multiplies. We propose two techniques, the first based
on submatrix multiplication and the second based on a parallel table lookup instruction.
5.4.1 Submatrix Multiplication
Direct multiplication of large matrices requires a stateful function unit – the functional unit
contains extra register storage to hold one of the operands. This extra storage plus the 2
general purpose register operands hold the larger matrices. The Cray bmmt.64 functional
unit, as discussed in Section 5.3, has an embedded vector register (with 64 64-bit entries)
to hold the B matrix.
In the ideal case, our bmm functional unit contains enough embedded storage to hold
a 64 × 64 bit matrix, like Cray. This allows for the full set of bit manipulation and parity
operations to be performed on 64-bit processor words. However, practically, such a large
unit may not be desirable for a commodity microprocessor. Instead, we investigate the use
of smaller submatrices that are directly multiplied by the bmm primitive instruction, and
we employ matrix blocking [84], to perform the larger matrix multiplies. For example,
Figure 5.5 shows how to compute bmm.64 using bmm.8.
For a given amount of embedded storage, multiple submatrix sizes are possible. Con-
sider a functional unit with 256 bits of storage. This can hold a 64 × 4, 32 × 8, 16 × 16,
116
A,B,C are 8× 8 matrices of 8-bit × 8-bit blocksfor i from 1 to 8for j from 1 to 8Ci,j = 0for k from 1 to 8Ci,j = Ci,j XOR bmm.8(Ai,k, Bk,j)
Figure 5.5: bmm.64 via multiplication of (8× 8)× (8× 8) submatrices
8 × 32 or 4 × 64 B matrix (see Table 5.4, column 2). The best matrix size depends on a
number of factors:
1. The number of terms computed in the matrix multiplication;
2. The constraint of only two input registers (128 bits) and one result register (64 bits)
per one instruction;
3. The overhead of accumulation;
4. The overhead incurred in packing and resizing matrices;
5. A preference for square matrices to best facilitate bit manipulation operations.
The number of terms computed in the multiplication of a (n × k) × (k × m) matrix
multiplication is nkm. Maximizing nkm likely yields the fastest matrix multiplication
when the bmm instruction is used as the primitive operation in block multiplication. The
sizes of n, k and m are constrained by the number of input and output registers and the
amount of embedded storage:
• Input general purpose registers constraint: nk ≤ 128 (ideally, nk = 128)
• Input embedded storage constraint: km = size of embedded storage
• Output registers constraint nm ≤ 64 (ideally, nm = 64)
117
Table 5.3: Matrix sizes for no embedded storage
nk ≤ 64 km = 64 nm ≤ 64 nkm #Multiplications for (64× 64)× (64× 64)1× 64 64× 1 1× 1 64 40962× 32 32× 2 2× 2 128 20484× 16 16× 4 4× 4 256 10248× 8 8× 8 8× 8 512 5124× 4 4× 16 4× 16 256 10242× 2 2× 32 2× 32 128 20481× 1 1× 64 1× 64 64 4096
Given the constrains on the inputs, that nk and km are fixed, then the number of terms
computed in the matrix multiplication is:
nkm =nk2m
k=nk × km
k∝ 1
k(5.1)
On the other hand, if we consider the output constraint, that nm is fixed, then the
number of terms computed in the matrix multiplication is:
nkm = nm× k ∝ k (5.2)
Thus, from (5.1) and (5.2), we see that there are competing trends: initially reducing
k increases the number of terms produced until such time that nm, the number of output
terms, is maximized. At this point, increasing k only serves to reduce the number of terms
as no more result bits can be output. We show this explicitly in Tables 5.3 and 5.4, where
we delineate the possible values for n, k and m for the case where we use no embedded
storage and the case of 256 bits of storage. The last column of the tables shows the number
of block multiplications needed to compute a (64× 64)× (64× 64) multiplication, which
requires the computation of 643 terms.
In the case of no embedded storage, decreasing k from 64 to 8 increases the number of
terms produced by 8×. However, continuing the same trend would require multiplying a
118
Table 5.4: Matrix sizes for 256 bits of embedded storage
nk ≤ 128 km = 256 nm ≤ 64 nkm #Multiplications for (64× 64)× (64× 64)2× 64 64× 4 2× 4 512 5124× 32 32× 8 4× 8 1024 2564× 16 16× 16 4× 16 1024 2562× 8 8× 32 2× 32 512 5121× 4 4× 64 1× 64 256 1024
(16 × 4) matrix by a (4 × 16) matrix to produce a (16 × 16) matrix, which would violate
the output constraint. Thus the 8 × 8 matrix size, corresponding to the bmm.8 operation,
produces the most terms. For 256 bits of embedded storage, a similar trend is seen. How-
ever, in this case, two sets of sizes, (4 × 32) × (32 × 8) and (4 × 16) × (16 × 16), both
produce 1024 terms.
However, these sizes can be differentiated based on the need to accumulate results.
Each block multiplication requires an accumulate with a prior result (see Figure 5.5).
Consequently, a bit matrix multiply-accumulate instruction, bmm.ac, (that performs ex-
actly the inner-loop operation of Figure 5.5), greatly reduces the total number of instruc-
tions required, as 64/k partial results must be XORed together for each block. Here, the
(4 × 16) × (16 × 16) case has a distinct advantage – only a single general purpose reg-
ister input is required. We then use the second operand to input a partial C matrix and
accumulate the result of the multiplication with this partial result.
Another factor to consider is the need to pack matrices. The 64× 64 matrices are held
in 64 registers. Thus, whenever the size of the output block, nm, is less than 64, multiple
blocks must be packed to fill all 64 bits. As 64/nm output blocks are packed into a 64-bit
register, we assume that an additional 64/nm− 1 instructions are required for packing for
each of the 64 registers that hold C.
Also an issue is the size of the submatrices held in the registers. Given an instruction
that multiplies matrices with certain fixed sizes, there is no guarantee that data will be laid
out as expected. If, for the (64× 64)× (64× 64) multiplication, the A matrix is generated
119
and laid out such that each 64-bit processor word contains a 1 × 64 block of A, then the
data must be resized to 4 × 16 in one register or 4 × 32 across two registers (or 8 × 8 for
bmm.8). Similarly, the output must eventually be resized back to 1× 64 blocks.
Resizing bit matrices is equivalent to transposing a 2-D matrix of subwords. (A 1× 64-
bit block is equivalent to 8 byte elements from the same row of an 8 × 8 byte matrix.
An 8 × 8-bit block is equivalent to 8 byte elements from the same column). The mix
instruction can be used to perform the matrix transpose using Lee’s method, as mentioned
in Section 5.3.1. Figure 5.6 shows a resizing from four 1× 64 blocks to four 4× 16 blocks,
at the cost of 8 mix instructions. To resize from 1× 64 blocks to 2× 32 blocks requires 1
mix instruction per row, to 4× 16 blocks, 2 mix instructions per row, and to 8× 8 blocks,
3 mix instructions per row.
Register Contentsr1 a1,1...16 a1,17...32 a1,33...48 a1,49...64
r2 a2,1...16 a2,17...32 a2,33...48 a2,49...64
r3 a3,1...16 a3,17...32 a3,33...48 a3,49...64
r4 a4,1...16 a4,17...32 a4,33...48 a4,49...64
Layout of Submatrices1 A1,1 A1,2 A1,3 A1,4
A2,1 A2,2 A2,3 A2,4
A3,1 A3,2 A3,3 A3,4
4 A4,1 A4,2 A4,3 A4,4
after mix2.l r5 = r1, r2; mix2.r r6 = r1, r2; mix2.l r7 = r3, r4; mix2.r r8 = r3, r4:
Register Contentsr5 a1,1...16 a2,1...16 a1,33...48 a2,44...48
r6 a1,17...32 a2,17...32 a1,49...64 a2,49...64
r7 a3,1...16 a4,1...16 a3,33...48 a4,33...48
r8 a3,17...32 a4,17...32 a3,49...64 a4,49...64
Layout of Submatrices1 A1,1 A1,2 A1,3 A1,4
A2,1 A2,2 A2,3 A2,4
A3,1 A3,2 A3,3 A3,4
4 A4,1 A4,2 A4,3 A4,4
after mix4.l r1 = r5, r7; mix4.l r2 = r6, r8; mix4.r r3 = r5, r7; mix4.r r4 = r6, r8:
Register Contentsr1 a1,1...16 a2,1...16 a3,1...16 a4,1...16
r2 a1,17...32 a2,17...32 a3,17...32 a4,17...32
r3 a1,33...48 a2,33...48 a3,33...48 a4,33...48
r4 a1,49...64 a2,49...64 a3,49...64 a4,49...64
Layout of Submatrices1 A1,1 A1,2 A1,3 A1,4
A2,1 A2,2 A2,3 A2,4
A3,1 A3,2 A3,3 A3,4
4 A4,1 A4,2 A4,3 A4,4
Figure 5.6: Matrix resizing using mix.
We now summarize the total number of instructions required for each matrix size for
the two cases of no embedded storage and 256 bits of embedded storage in Tables 5.5
120
and 5.6. We see that the square matrix sizes, 8×8 and 16×16, yield the best results. These
correspond to the operations bmm.8 and bmm.16.ac, respectively.
Table 5.5: Total number of operations for no embedded storage
nk ≤ 128 km = 256 nm ≤ 64 nkm #Muls #Accumulate #Pack #Resize #Resize Totalinput output
1× 64 64× 1 1× 1 64 4096 0 4032 0 0 81282× 32 32× 2 2× 2 128 2048 1024 960 64 64 41604× 16 16× 4 4× 4 256 1024 768 192 128 128 22408× 8 8× 8 8× 8 512 512 448 0 192 192 13444× 4 4× 16 4× 16 256 1024 960 0 128 128 22402× 2 2× 32 2× 32 128 2048 1984 0 64 64 41601× 1 1× 64 1× 64 64 4096 4032 0 0 0 8128
Square matrices are also the most logical choice for supporting multimedia and bit ma-
nipulation operations, as described in Section 5.2. A k × k B matrix completely specifies
how to manipulate a k-bit row of A. A k × k A matrix completely specifies how to manip-
ulate a k-bit column of B. Thus, the bmm.8 instruction, in which both A and B are square,
allows for full bit manipulation of eight 8-bit rows or columns. The bmm.16.ac, in which
only B is square, allows for full bit manipulation on four 16-bit rows.
We note that the partitioning of the large matrix operations in multiprocessor settings
has been studied extensively [84]. In these settings, communication between processors
is costly and, therefore, reducing communication is key. Results show that often square
matrices tend to be better as these maximize the computation to communication ratio –
Table 5.6: Total number of operations for 256 bits of embedded storage
nk ≤ 128 km = 256 nm ≤ 64 nkm #Muls #Accumulate #Pack #Resize #Resize Totalinput output
2× 64 64× 4 2× 4 512 512 0 448 0 64 10244× 32 32× 8 4× 8 1024 256 128 64 64 128 6404× 16 16× 16 4× 16 1024 256 0 0 128 128 5122× 8 8× 32 2× 32 512 512 0 0 64 64 6401× 4 4× 64 1× 64 256 1024 0 0 0 0 1024
121
computation is done on all the elements of a partition (area) while communication involves
the boundary elements (perimeter) and squares maximize the ratio of area to perimeter.
When considering the implementation of a bit matrix multiply instruction within a single
processor our constraints are very different, yet nevertheless, we have obtained a similar
result.
Tables 5.5 and 5.6 omit one important factor – loading the A, B and C matrices, and
moving B to the embedded storage, when it exists. As these costs are the same irrespective
of matrix size, we do not take them into account in choosing the best size.
5.4.2 Parallel Table Lookup
In [61–63, 85], a new functional unit and a set of new instructions were proposed to ac-
celerate table lookups. These instructions were shown to be especially useful for very fast
software implementations of the AES and other block ciphers, the next generation hash
function Whirlpool, data fingerprinting and erasure codes. This functional unit (called
PTLU for Parallel Table LookUp module) contains a set of k tables that are looked up in
parallel, followed by combining the k table entries into a single result. One variant of the
PTLU instruction, denoted ptrd.x1, combines the k entries by XOR. The PTLU instruc-
tion has 2 source registers. The bytes of the first source register are used as indices into the
set of tables. The results of the table lookups are XORed together and then finally XORed
with the second source operand.
This functional unit (with 64-bit table entries) can be used to perform the bmm op-
eration. In Section 5.3, we described how bmm is performed using tables stored in cache.
These tables are instead stored in the new PTLU functional unit. The ptrd.x1 instruction
then serves as an alias for a bit matrix multiply-accumulate (bmm.ac) instruction:
Ci = ptrd.x1 Ai, 0 // ptrd.x1 is an alias for bmm.ac
Using this new functional unit reduces the cost of multiplication to a single cycle per row.
122
Additionally, as these tables are dedicated to the functional unit, and not in the data cache,
there is no possibility of penalties due to data cache misses.
If the B matrix is swapped with another known B matrix, the tables must be reloaded.
In [62], the parallel table write multiple (ptwn) instruction is used to write a single entry
across all 8 tables, loading the 64 bytes directly from memory. Reloading the entire con-
tents of all tables therefore takes 256 ptwn instructions. An alternative to reloading the
tables is to define multiple sets of tables [63]. The cost is an additional 16KB per table.
Additionally, replacement of the table contents might still occur if there are more B ma-
trices than tables provided. If the B matrix is dynamically generated, the tables must be
recomputed (see Section 5.3) and then loaded into the parallel table lookup functional unit.
5.5 Comparison of Techniques
We now compare the techniques listed in Sections 5.3 and 5.4.
5.5.1 Performance
Instruction Counts – We first consider the total number of instructions for each method
(Table 5.7). For bmm.8 and bmm.16.ac, we assume that 8 and 16 rows, respectively, of
A are processed at a time to minimize shuffling of A and C from the GPRs. For each set of
rows, the entire B matrix must be loaded.
Table 5.7: Operation counts for bit matrix multiply techniques
Method #instructions for 64 rowsRow-wise Summation 4352Table lookup 2048PTLU 192bmm.8 1984bmm.16 1664bmm.64 (Cray) 192 (3 vector ops)
123
The most basic new technique, bmm.8, requiring no extra storage, requires slightly
fewer operations than the table lookup method. The more advanced new methods are even
better. However, we defer fully discussing the performance of the new techniques until
Chapter 6, where we present quantitative results from simulation.
Amortization – Another factor to consider is that bmm.8 or bmm.16.ac require mul-
tiplication of 8 or 4 rows, respectively, in a single instruction and it is only by amortizing
the cost of multiplication, accumulation and matrix resizing across multiple rows that these
instructions achieve good overall results. When multiple rows are not available to be mul-
tiplied at once, these instructions become less attractive.
Bit Manipulation – As discussed in Section 5.2, bmm can be used for bit manipulation.
However, bmm.8 and bmm.16.ac are best suited for manipulating bits within subwords
or manipulation of subwords. These instructions are less suited for full n-bit operations.
For example, to permute 8 words requires, at most, 8 bfly and 8 ibfly instructions.
Using bmm.8, requires 48 mix instructions and some number of bmm.8 and xor instruc-
tions (for accumulation) depending on the structure of the permutation matrix (i.e., how
many 8× 8 submatrices are zero). Thus, the dedicated instructions appear to be better than
the smaller bmm variants.
On the other hand, bmm.64 and PTLU are well suited for full n-bit permutations and,
in fact, may be more efficient than the dedicated instructions as a single bmm.64 or ptrd
instruction accomplishes the same permutation as a bfly and an ibfly together. Fur-
thermore, when given a new permutation, deriving the permutation matrix is easier than
deriving the bfly and ibfly control bits. However, the dedicated instructions have less
of a setup cost as only a small number of application registers need to be loaded as opposed
to loading a much larger B matrix or a full 16KB of tables.
124
Table 5.8: Overhead for bit matrix multiply techniques
Method Overhead for changing to knownB
Overhead for new B
Row-wise Summation Negligible transposing B (for bmmt)Table lookup Bringing tables into cache transposing B (for bmmt), com-
puting tablesPTLU Bringing tables into dedicated
on-chip storagetransposing B (for bmmt), com-puting tables
bmm.8 Negligible Resizing Bbmm.16 Negligible Resizing Bbmm.64 (Cray) Loading the B matrix and mov-
ing to embedded storageLoading the B matrix and mov-ing to embedded storage
5.5.2 Overhead
The operation counts listed in Table 5.7 are valid under the assumption that the B matrix
does not change. If the B matrix is generated dynamically or even if the B matrix simply
changes between known matrices, there is considerable overhead, which varies consider-
ably depending on the technique (Table 5.8). The worst overhead when switching to a
known B matrix is potentially experienced by the table lookup methods. A considerable
amount of data is brought into cache or moved to the tables upon a B matrix switch. For
bmm.64, the B matrix is loaded and moved to embedded storage. As bmm.64 generally
requires no reloading of B, changing B introduces non-trivial overhead. For row-wise sum-
mation, bmm.8 and bmm.16.ac, there is negligible overhead as the B submatrices are
constantly reloaded into GPRs anyway, so switching the B matrix has little cost.
When switching to a newly generated B matrix, the table lookup methods fare poorly as
the tables must be recomputed. However, the submatrix multiply instructions only reshape
the matrices, which can be done relatively efficiently.
125
5.5.3 Implementation Cost
We now consider the costs of the implementations of the various methods. Table 5.9 lists
how much extra storage each method needs. We also synthesized the various new func-
tional units mapping to a TSMC 90 nm standard cell library [77]. A reference ALU was
synthesized and the other designs were constrained to match the latency of the ALU. For
bmm.16.ac and bmm.64, we designed two units – one unit has embedded storage and
one set of AND-XOR trees, while the second unit has embedded storage and two sets of
AND-XOR trees. Thus the second unit supports executing two bmm instructions at once
using the same B matrix.
The parallel table lookup module has approximately twice the latency of the ALU and
thus will be a 2 or 3 cycle instruction. Both bmm.64 functional units have slightly greater
latency than the ALU. Thus they might take 1 or 2 cycles depending on timing constraints.
The other functional units meet or beat the timing constraint.
When comparing the area of the functional units, the PTLU module is much larger than
the ALU (12.8×) and is more than twice as large as the bmm.64 unit which performs only
a single multiply, which itself is around 5.5× the ALU area. However, the PTLU module
can be used for other operations besides bmm and it is more useful to think of the module as
a cache than a functional unit (and it is smaller than a 16KB cache as there are no tags). The
bmm.8 and the bmm.16.ac unit that supports one multiply are smaller than the ALU. The
bmm.8 unit is only 30% the size of the ALU and, therefore, one might want to consider
adding the bmm.8 functionality to the ALU rather than introducing a new functional unit.
Note that scaling the amount of extra storage does not equivalently scale the unit size as
the majority of the area consists of combinational logic. Also note that adding a second
set of AND-XOR trees adds 45-65% area overhead, but provides for doubling the rate of
multiplication.
126
Table 5.9: Latency and area of functional units performing proposed bmm instructions.
New Extra Latency Latency Cycles Area Areafunctional unit storage (ns) (relative
to ALU)(NANDgates)
(relativeto ALU)
ALU - 0.5 1 1 7.5K 1Row-wise - - - - - -SummationTable lookup 16 KB cache - - cache - -
latencyPTLU 16 KB 1.05 2.1 2 or 3 95.9K 12.8
dedicated tablesbmm.8 - 0.371 0.74 1 2.4K 0.31bmm.16.ac 256 bits 0.5 1 1 6.3K 0.83bmm.64 (Cray) 4096 bits 0.513 1.03 1 or 2 41.3K 5.47bmm.16.ac 2× 256 bits 0.5 1 1 10.4K 1.38bmm.64 2× 4096 bits 0.527 1.05 1 or 2 59.1K 7.84
5.6 Related Work
A few other architectures also support the bit matrix multiply instruction. The Cray vector
supercomputers, as mentioned, include a bmmt.64 unit. The Cray MTA [86], which is
not a vector supercomputer but rather a massively multithreaded supercomputer has an
instruction like bmm.8 (with both OR and XOR accumulation) and an 8 × 8 bit matrix
transpose instruction. This computer, however, was not a commercial success.
A couple of proposed academic architectures also have a bmm unit. The Near-Memory
Processor [87] has a coprocessor for vector, streaming and bit manipulation operations
which is closely coupled to the onboard memory controller and has a high bandwidth to
main memory. This coprocessor has a full bmmt.64 functional unit.
In [88], a vector accelerator for conventional processors is proposed. These accelerators
have direct access to the memory controllers and communicate with the main processor via
the front side bus. The accelerators have a high number of decoupled vector lanes (> 16)
in order to maximize throughput. Each accelerator has four 64× 64 bmm registers (to hold
4 different B matrices). These registers can be used to perform bmmt.64, but, unlike
127
with the Cray, this is not a single cycle operation. Slices of the B register are fed to each
vector lane in a cycle and the standard ALU in each lane is used to perform the bitwise
and between the rows of A and B and then the xor accumulation (which also requires
transferring data between the lanes). A single vector-matrix multiplication requires 27
cycles (for 16 lanes).
In contrast to these other proposed academic architectures, we examine how to perform
bmm.64 without resorting to a full 4096 bits, or more, of extra storage, as appropriate for
a commodity processor.
5.7 Summary
In this chapter, we discussed the bit matrix multiply operation. This operation can be
used to speed up finite field multiplication, parity computation, bit matrix transpose, and
to perform a superset of the bit manipulation operations, including many of the SIMD ISA
extension subword manipulation instructions.
We presented the Cray implementation of the bmm operation, which requires 4096 bits
of extra storage. We also considered the best software method for bmm on commodity
processors, which is the table lookup method. Due to high cost of the Cray implementa-
tion and the drawbacks of table lookup, new architecture techniques to accelerate bmm in
commodity processors were explored.
The first technique is submatrix multiply – we define a set of primitive bmm instructions
that multiply submatrices of A and B. We consider instructions that either use GPRs or
auxiliary storage to hold the B submatrix. We investigated the choice of sizes for multiplier
and multiplicand matrices and determined that bmm.8, which multiplies two 8×8 matrices
(in general purpose registers), and bmm.16.ac, which multiplies a 4 × 16 matrix in a
general purpose register by a 16×16 matrix in the extra storage and then accumulates with
a 4× 16 matrix in a general purpose register, were the best choices.
128
The second technique makes use of the Parallel Table LookUp module, which uses a
processor word as in index into a set of on-chip tables. The tables are read in parallel and
the results are XORed together. A PTLU operation is equivalent to a bmm.64 operation.
We also considered the overhead and costs of the techniques. The bmm.8 unit is 30%
the size of an ALU, and thus one may consider implementing bmm.8 functionality within
the ALU. The bmm.8 unit also requires no additional architectural register state and thus
does need operating system support. Thus bmm.8 is attractive. bmm.64 and PTLU have
high implementation cost and high overhead. However, they require the fewest number of
operations to perform a (64 × 64) × (64 × 64) → (64 × 64) multiplication. In the next
chapter, we further analyze the applications that benefit from bmm (and other advanced bit
manipulation instructions) and give quantitative results for the speedup yielded by using
the new instructions.
129
Chapter 6
Applications and Performance Results
In this chapter, we detail the applications that benefit from the advanced bit manipulation
instructions introduced earlier in this thesis. Section 6.1 describes the applications in detail
and explains how the new instructions speedup the applications. Section 6.2 presents simu-
lation results demonstrating quantitatively the speedup obtained using the new instructions.
Section 6.3 summarizes the chapter.
6.1 Applications
We now describe how bfly, ibfly, pex, pdep and bmm instructions can be used in ex-
isting applications, to give speedup estimates that are currently realizable. Use of these
novel instructions in new algorithms and applications will likely produce even greater
speedup.
Table 6.1 summarizes each of the applications described below in terms of the advanced
bit manipulation instructions it uses. × indicates definite usage, (×) indicate usage in alter-
native algorithms and question marks indicate potential usage. The bfly/ibfly column
indicates use of bit permutation. The pex and pdep column indicates use of static bit
gather and bit scatter operations. The setib and setb columns indicate use of loop-
130
Table 6.1: Summary of Bit Manipulation Instruction Usage in Various Applications
Application \ Instruction bflyibfly
pex pdep setib setb pex.v pdep.v grp bmm
Binary Compression × (×)Binary Decompression × (×)LSB Encoding × × (×)Steganography Decoding × × (×)
Transfer Coding Encoding × (×)Decoding × (×)
Integer Compression × (×)Compression Decompression × (×)Binary Image Morphology × (×) (×)
Random Number Generation Von Neumann ×Toeplitz ×
Bioinformatics
Compression × (×)BLASTZ Alignment × (×)BLASTX Translation × (×)Reversal × (×)
Cryptography
Block Ciphers × × ×Stream Ciphers × ×Public Key ×Future ? ? ? ? ? ? ? ? ?
CryptanalysisLinear ×∗Algebraic ×Future/Proprietary ? ? ? ? ? ? ? ? ?
DARPA HPCS Discrete Math Matrix Transpose ×∗Benchmarks Equation Solving ×Linear Feedback Shift Registers ×
Error CorrectionBlock Codes ×Convolutional Codes ×Puncturing × × (×)
∗bmmt
invariant bit gather and scatter. The pex.v and pdep.v columns indicate use of dynamic
bit gather and scatter.
For the bmm column, × indicates usage of bmm as a multiplication primitive while (×)
indicates usage of bmm as a bit manipulation primitive, as we use (×) for bmm to indicate
those applications where another bit manipulation instruction (like parallel extract, parallel
deposit or bit permutation) can be used. The use of the mov ar instruction is assumed
(but not shown in Table 6.1) whenever the bfly, ibfly, static pex and pdep, or bmm
instructions are used.
131
6.1.1 Binary Compression and Decompression
The Itanium [13] and IA-32 [1] parallel compare instructions produce subword masks – the
subwords for which the relationship is false contain all zeros and for which the relationship
is true contain all ones. This representation is convenient for subsequent subword masking
or merging. The SPARC VIS [14] parallel compare instruction produces a bit mask of
the results of the comparisons. This representation is convenient if some decision must be
made based on the outcome of the multiple comparisons. Converting from the subword
representation to the bitmask representation requires only a single static pex instruction.
The SSE instruction pmovmskb [1], mentioned in Chapter 2, serves a similar purpose;
it creates an 8- or 16-bit mask from the most significant bit from each byte of a MMX or
SSE register and stores the result in a general purpose register. However, pex offers greater
flexibility than the fixed pmovmskb, allowing the mask, for example, to be derived from
larger subwords, or from subwords of different sizes packed in the same register.
Similarly, binary image compression performed by MATLAB’s bwpack function [89]
benefits from pex. Binary images in MATLAB are typically represented and processed as
byte arrays – a byte represents a pixel and has permissible values 0x00 and 0x01. However,
certain optimized algorithms are implemented for a bitmap representation, in which a single
bit represents a pixel.
To produce one 64-bit output word requires only 8 static pex instructions to extract 8
bits in parallel from 8 bytes and 7 dep instructions to pack these eight 8-bit chunks into one
output word (Figure 6.1). For decompression, as with the bwunpack function, 7 extr
instructions are required to pull out each byte and 8 pdep instructions to scatter the bits of
each byte to byte boundaries.
132
Figure 6.1: Compressing each input word requires 1 pex and 1 dep.
6.1.2 Least Significant Bit Steganography
Steganography [90] refers to the hiding of a secret message by embedding it in a larger,
innocuous cover message. A simple type of steganography is least significant bit (LSB)
steganography in which the least significant bits of the color values of pixels in an image,
or the intensity values of samples in a sound file, are replaced by secret message bits.
LSB steganography encoding can use a pdep instruction to scatter the secret message bits,
placing them at the least significant bit positions of every subword. Decoding uses a pex
instruction to extract the least significant bits from each subword and reconstruct the secret
message.
LSB steganography is an example of an application that utilizes the loop-invariant ver-
sions of the pex and pdep instructions. The sample size and the number of bits replaced
are not known at compile time, but they are constant across a single message. Figure 6.2
depicts an example LSB steganography encoding operation in which the 4 least significant
bits from each 16-bit sample of PCM encoded audio are replaced with secret message bits.
Figure 6.2: LSB steganography encoding (4 bits per 16-bit PCM encoded audio sample).
133
6.1.3 Transfer Coding
Transfer coding is the term applied when arbitrary binary data is transformed to a text string
for safe transmission using a protocol that expects only text as its payload. Uuencoding [91]
is one such encoding originally used for email or usenet attachments. In uuencoding, each
set of 6 bits is aligned on a byte boundary and 32 is added to each value to ensure the result
is in the range of the ASCII printable characters. Without pdep, each field is individually
extracted and has the value 32 added to it. With pdep, 8 fields are aligned at once and
a parallel subword add instruction, available in multimedia extensions of all commerical
microprocessor ISAs [1, 13–15], adds 32 to each byte simultaneously (Figure 6.3, shown
for 4 bytes). Similarly, for decoding, a parallel subtract instruct deducts 32 from each byte
and then a single pex compresses eight 6-bit fields.
Figure 6.3: Uuencode of ’bit’ using pdep.
6.1.4 Integer Compression
Internet search engine databases, such as Google’s, consist of lists of integers describing
frequency and positions of query terms. These databases are compressed to minimize stor-
age space, memory bus traffic and cache usage. To support fast random access into these
databases, integer compression – using less than four bytes to represent an integer – is
utilized. One integer compression scheme is a variable byte encoding in which each byte
contains 7 data bits and a flag bit indicating whether the current byte is an intermediate part
of the current integer or ends the current integer [92]. pex can accelerate the decoding
134
of integers by compacting 7-bit fields from each byte (Figure 6.4), and pdep can speedup
encoding by placing consecutive 7-bit chunks of an integer on byte boundaries.
Figure 6.4: Decompression of integer compressed in 3 bytes.
6.1.5 Binary Image Morphology
Binary image morphology is a collection of techniques for binary image processing such
as erosion, dilation, opening, closing, thickening, thinning, etc. The bwmorph function in
MATLAB [89] implements these techniques primarily through one or more table lookups
applied to the 3 × 3 neighborhood surrounding each pixel (i.e. the value of 9 pixels is
used as index into a 512 entry table). In its current implementation, bwmorph processes
byte array binary images, not bitmap images, possibly due to the difficulty in extracting the
neighborhoods in the bitmap form.
If the images are processed in bitmap form, a single pex instruction extracts the entire
index at once (assuming a 64-bit word contains an 8 × 8 block of 1-bit pixels, as in Fig-
ure 6.5). As the algorithm steps through the pixels, the neighborhood moves. This means
that the bitmask for extracting the neighborhood is shifted and a dynamic pex.v instruc-
tion may be needed. Alternatively, the data might be shifted, rather than the mask, such
that the desired neighborhood always exists in a particular set of bit positions. In this case,
the mask is fixed, and only a static pex is needed. Table 6.1 indicates the latter – with pex
indicated by a × and pex.v by (×) for the alternative algorithm.
135
Figure 6.5: Using pex to extract a 3 × 3 neighborhood of pixels from register containing an 8 × 8block of pixels.
6.1.6 Random Number Generation
Random numbers are very important in cryptographic computations for generating nonces,
keys, random values, etc. Random number generators contain a source of randomness (such
as a thermal noise generator) and a randomness extractor that transforms the randomness
so that it has a uniform distribution. The Intel random number generator [93] uses a von
Neumann extractor. This extractor breaks the input random bits, X = x1x2x3 . . ., into a
sequence of pairs. If the bits in the pair differ, the first bit is output. If the bits are the same,
nothing is output. This operation is equivalent to using a pex.v instruction on each word
X from the randomness pool with the mask:
Mask = x1 ⊕ x2||0||x3 ⊕ x4||0|| . . ..
Another randomness extractor convolves the input random bits with a random string [94].
Convolution of two binary strings is equivalent to multiplication by a bit matrix with con-
stant diagonals,a binary Toeplitz matrix (Figure 6.6). This operation can be accelerated by
the bmm instruction.
(r1 r2 r3 r4 r5 r6 r7 r8
)×
s8 s9 s10 s11s7 s8 s9 s10s6 s7 s8 s9s5 s6 s7 s8s4 s5 s6 s7s3 s4 s5 s6s2 s3 s4 s5s1 s2 s3 s4
Figure 6.6: Randomness extraction by Toeplitz matrix multiplication.
136
6.1.7 Bioinformatics
Bioinformatics, the field of analysis of genetic information, was mentioned in Section 2.4.
We saw that pex and pdep are useful for a number of subproblems within bioinformatics:
• Compression: DNA bases A, C, G and T have a 2-bit representation using bits 2 and
1 of their ASCII codes. A pex instruction is used to select and compress bits 2 and
1 of each byte of a word.
• BLASTZ Alignment: The BLASTZ sequenece alignment program uses a “spaced
seed” as the basis of the comparisons – 12 out of 19 nucleotides are used to start a
comparison rather than a string of consecutive nucleotides. The program compresses
the 12 bases and uses the seed as an index into a hash table. This compression uses a
pex instruction selecting 24 of 38 bits.
• BLASTX Translation: Multiple sets of three bases, or 6 bits of data, are translated
into the corresponding protein codon using a pdep instruction to distribute eight
6-bit fields on byte boundaries, and then using PTLU to translate the bytes.
Permutation instructions can help bioinformatics as well. A strand of DNA is a double
helix – there are really two strands with the complementary nucleotides, A↔T and C↔G,
aligned. When performing analysis on a DNA string, often the complementary string is
analyzed as well. To obtain the complementary string, the bases are complemented and the
entire string is reversed, as the complement string is read from the other end. The reversal
of the DNA string amounts to a reversal of the ordering of the pairs of bits in a word. This
is a straightforward bfly or ibfly permutation.
6.1.8 Cryptology
Cryptology includes cryptography and cryptanalysis. A number of popular block ciphers,
such as DES, have permutations as primitive operations. The inclusion of permutation
137
instructions such as bfly and ibfly (or grp) can greatly improve the performance of
the inner loop of these functions [20, 21, 26, 43]. Also, these instructions can be used
as powerful primitive operations in the design of the next generation of ciphers [55, 95]
and hash functions (especially for the Cryptographic Hash Algorithm Competition (SHA-
3) [96]).
Block ciphers can also benefit from bmm. AES, for example, uses finite field multipli-
cation for its column mix step [49]. In an implementation of AES that performs each step
of the cipher explicitly, bmm can be used to accelerate column mixing.
The proposed instructions also speed up stream ciphers, as discussed in Chapter 2. The
basis of many stream ciphers are Linear Feedback Shift Registers. These can be sped up
using bmm (see Section 6.1.10). Also, some stream ciphers make use of variable pex.v
operations in order to extract the keystream from the state.
The McEliece cryptosystem [97] is a public-key cipher that uses a block code as its
basis and thus can be sped up using bmm. The private key consists of G, a k × n generator
matrix for a block code that corrects t errors, S, a random k × k non-singular matrix, and
P , an n× n permutation matrix. The public key is (G = GSP , t). The sender of message
m of length k chooses a random string z of length n having at most t “1”’s. The ciphertext
is then computed as c = mG ⊕ z. The receiver then computes c = cP−1, uses the error
correction decoding algorithm to obtain m from c, and then computes m = mS−1. In
practice, recommended parameter sizes are n = 1024, t = 38, and k ≥ 644.
Cryptanalysis is the practice of code breaking. Linear cryptanalysis was discussed in
Chapter 2. In linear cryptanalysis, a linear approximation to a block cipher, which relates
the parity of a subset of the bits of plaintext, ciphertext and key, is determined. When at-
tempting to find a linear approximation by exhaustive search, the parity of many plaintext-
ciphertext pairs with bit positions indicated by many plaintext-ciphertext masks is com-
puted in an attempt to find a bias. The bmmt instruction can be used to obtain the parity of
multiple plaintexts or ciphertexts using many masks at once.
138
In algebraic cryptanalysis [98], the relationship between the plaintext, ciphertext and
key are expressed as a large system of linear equations, with the key bits as the unknowns.
Solving the system of equations yields bits of the key. bmm can be used to accelerate solving
the system of equations as matrix multiplication and Gaussian elimination are equivalent
problems. Also, as all of the other instructions are likely to be useful for other secret or
future cryptanalysis algorithms, we indicate them as ”?” in Table 6.1.
6.1.9 DARPA HPCS Discrete Math Benchmarks
The DARPA HPCS discrete math suite [99] contains 6 benchmarks for high performance
computing that focus on integer, bit-oriented and memory operations (as opposed to the
DARPA HPC challenge suite which focuses on floating point and memory operations).
The bmm instruction can be used to accelerate the data transposition benchmark and the
Boolean equation solving benchmark, which comprise 1/3 of the HPCS benchmarks.
In the data transposition benchmark, a stream of n-bit integers is interpreted as a series
of n × n bit matrices which must be transposed. n varies from 32 to 1024. The bmmt
operation can be used directly to speed up this benchmark.
In the Boolean equation solving benchmark, a pattern of ones, zeros and wildcards
must be located within a long bitstream. For each location that the pattern is found, the
next ∼490000 bits are interpreted as a system of 700 equations in 700 unknowns over
GF(2). The solution to the system must be computed, or it must be determined that no
solution exists. bmm can be used to accelerate solving the system of equations as matrix
multiplication and Gaussian elimination are equivalent problems.
6.1.10 Communications
Communications algorithms that are accelerated by advanced bit manipulation instructions
were discussed in Chapter 2. We briefly review the algorithms here
139
Linear feedback shift registers (LFSRs) – Linear feedback shift registers are shift reg-
isters in which the feedback is the parity of some number of the current bits. LFSRs are
used for stream ciphers, error correction, scrambling and modulation. LFSR state trans-
formation are sped up by the bmm instruction. For example, one cycle of the LFSR of
Figure 2.21, with generator polynomial x8 + x2 + x + 1, is given by multiplication of the
state by matrix A:
A =
0 0 0 0 0 0 0 11 0 0 0 0 0 0 10 1 0 0 0 0 0 00 0 1 0 0 0 0 00 0 0 1 0 0 0 00 0 0 0 1 0 0 00 0 0 0 0 1 0 00 0 0 0 0 0 1 1
In general, for an n-bit LFSR, the A matrix is given by:
A =
0 . . . 0 p1
p2
In−1
...pn
where the p vector is the coeffecients of the generator polynomial and is used to compute
the feedback and the (n − 1)-bit Identity matrix shifts the rest of the state. Note that to
compute up to k cycles of the LFSR at once, we multiply the state by Ak.
Error Correction – Error correction coding adds redundancy to the message so that the
symbols can be recovered in the face of channel noise. One basic error correction scheme
is linear block codes, in which parity bits are computed for various subsets of the bits of
a message block. The generation of multiple parity patterns can be obtained by using a
bmm of the message symbols with a generator matrix. In convolutional coding, the input
is processed as a continuous stream. One or more shift registers store some number of
previous input bits and the outputs are the parity of subsets of the stored bits and the next
input bits. Convolutional codes can also be sped up by bmm expressing the parity generation
140
in matrix form and then replicating and shifting that matrix to acheive the effect of the shift
registers.
Puncturing – In puncturing, message bits are removed after error correction to remove
some redundancy. Removing message bits is accomplsihed by a pex instruction. Depunc-
turing is performed at the decoder to put back in bits and uses a pdep instruction.
6.2 Performance Results
We coded kernels for the above applications and simulated them using the SimpleScalar
Alpha simulator [100] enhanced to recognize our new instructions by replacing unused op-
codes with our new instruction defintions. The simulator parameters are given in Table 6.2.
Figure 6.7 shows speedups for pex and pdep, and Figures 6.8 and 6.9 show results for
bmm. In each case, the results are normalized with respect to the baseline Alpha ISA in-
struction counts.
For pex and pdep we tested bit compression of a long stream of bytes; LSB steganog-
raphy coding replacing the low 4 bits of 16-bit PCM encoded audio; uuencoding using a
long stream of bytes; integer compression using a long stream of integers; binary image
morphology using a 512× 512 bit image; BLASTX translate using a long stream of DNA
bases; and random number generation using a long stream of bits from the entropy pool.
The baseline implementation used the algorithms from Hacker’s Delight [16].
Overall, the speedups over baseline range from 1.07× to 9.9×, with an average of
2.19× (1.84× for static pex and 1.93× for static pdep). The benchmark that exhibited
the greatest speedup was random number generation as performing pex.v in software is
much more expensive that performing static pex in software. Of the other benchmarks,
integer compression was sped up the least. This is due to the extra bookkeeping required
to track when compressed integers cross word boundaries. The speedup is also lower for
141
Table 6.2: SimpleScalar Parameters
instruction fetch queue 32decode width 4issue width 4ruu size 128load/store queue 32memory ports 2ALUs 4multipliers 2pex/pdep units 1bmm units 2∗
ptlu units 1static pex/pdep latency 1pex.v/pdep.v latency 3bmm latency 1ptrd latency 2L1I cache 32KBL1D cache 32KBL2 cache 500KB∗Only one set of extra registers, but two AND-XOR trees
binary image morphology due to extra work in extracting the 3 × 3 pixel neighborhoods
that cross register boundaries and is also lower for steganography as there are only 4 fields
per word, limiting the effect of the new instructions.
We also tested the BLASTZ program using the simulator. The results were highly
dependent on the inputs chosen. For shorter sequences, in the range of 104 bases, speedups
up to 1-2% were observed. For longer sequences, 105 bases, only a minor speedup was
observed – around 0.1%. Clearly, BLASTZ can benefit from pex, but this benefit can be
very minor in some cases.
For bmm, we first ran a set of artificial benchmarks. We tested a generic (N × 64) ×
(64 × 64) → (N × 64) matrix multiply for N = 64, 1024 and 65536 and multiplication
of two 1024 × 1024 matrices (one case where the B matrix is known in advance and one
where it is not). The baseline algorithm is serial table lookup. The results are shown in
Figure 6.8.
142
Figure 6.7: Speedup of applications using pex and pdep
Figure 6.8: Speedup of artificial benchmarks using bmm
143
The results for the (N × 64)× (64× 64)→ (N × 64) basic test case are similar to the
results of Table 5.7 only for N = 1024. For N = 64, the overhead of loading the B matrix
for bmm.64 and loading the tables for PTLU adversely affects performance, as indicated
in Table 5.8, while the overhead is negligible for the other cases. For N = 65536, cache
misses become significant. bmm.64 and PTLU are the most affected because they require
the fewest computation operations, thus making them most sensitive to memory latency.
For matrix multiplication using precomputed matrices (indicated by a * in Figure 6.8),
again we see the effect of the overhead for bmm.64 and PTLU. PTLU is also negatively
impacted by caches misses due to the need to store all the tables (4MB of tables). For matrix
multiplication using a new B matrix (indicated by a ** in Figure 6.8), the baseline table
lookup case becomes worse due to the need to compute the tables. This overhead is greater
than that of resizing the matrices. Consequently, bmm.8, bmm.16.ac and bmm.64 are
better relative to the baseline.
We then evaluated the use of bmm in the applications. We tested the linear cryptanalysis
algorithm of Figure 2.19 (inner loop only); random number generation using a 768 × 256
Toeplitz matrix; block coding of a long stream of bits using the Golay code; convolutional
coding using the encoder of Figure 2.23; and transpose of a 1024×1024 matrix. The results
are shown in Figure 6.9.
For linear cryptanalysis, the results are affected by the extra work required to maintain
the counts for each mask. This limits the maximum speedup attainable (ala Amdahl’s
Law). As bmmt.64 speeds up the parity computation portion the most, it most clearly
exhibits this limitation, even though N is large enough in this case to amortize the costs
of changing B. Additionally, the transpose version of bit matrix multiply is used, as we
assume the masks, plaintexts and ciphertexts are all in rows. This impacts PTLU, as the
transpose of the matrices is computed in software.
Overhead of changing matrices also impacts the results for the random number gen-
eration benchmark. As N = 768, bmm.64 has worse performance as compared to the
144
Figure 6.9: Speedup of applications using bmm
1024 × 64 multiplication or the 1024 × 1024 multiplication. For PTLU, performance is
worse than the 1024 × 64 case, due to the overhead, but better than 1024 × 1024 case as
there are fewer tables and no cache misses.
For Golay encoding, bmm.64 and PTLU slow down, not due to the overhead of chang-
ing the B matrix or tables, but rather due to the fact that the input stream must be rearranged
to be multiplied by the fixed 24× 48 generator matrix (the 12× 24 generator matrix is tiled
twice). bmm.8 gets faster in this case as the generator matrix decomposes well into the 8×8
submatrices. bmm.16.ac is slower than bmm.8, due to the sparseness of the generator
matrix (expressed as a 48 × 96 matrix to better fit with the 16 × 16 submatrices). There
are 10 multiplies for both bmm.8 and bmm.16.ac and it is not possible to fully amortize
the cost of loading the B submatrix for each bmm.16.ac.
Compared to the Golay encoder, the results for the convolutional encoder are better
for all instructions. This is due to the structure of the B matrix – there are fewer multiply
instructions required, but more table lookups, as the table lookup method requires an entire
set of 8 rows to be zero to avoid a lookup, while multiplication requires a small 8 × 8
or 16 × 16 block to be zero to skip a submatrix multiply. The structure of the matrix for
convolutional encoding decomposes better for bmm.16.ac, so it is faster than bmm.8.
145
Table 6.3: Incremental performance gain and area increase for submatrix multiply
Incremental Increase in Performance Incremental Increase in Area(versus 2× bmm.8
bmm.16.ac vs. bmm.8 1.4 2.2bmm.64 vs. bmm.8 3.2 12.3bmm.64 vs. bmm.16.ac 2.3 5.7
Additionally, the Golay encoder is a rate 1/2 encoder while the convolutional coder is a
rate 2/3. Thus, as there are fewer outputs per input for the convolutional encoder, more
input rows can be processed at a time without encountering register pressure, allowing for
greater amortization of loading the B matrix.
For transposition, bmmt.64 suffers as the B matrix is loaded every time and few mul-
tiplications are required for bmmt.8 and bmmt.16.ac as A (= I) is very sparse.
Overall, bmm.8 is 1.6× faster than the baseline case, bmm.16.ac is 2.2× faster,
bmm.64 is 5.0× faster and PTLU is 3.6× faster. However, we see that bmm.8 and
bmm.16.ac are mostly insensitive to extremes or changes in parameters. bmm.64 and
PTLU are, on the hand, highly sensitive due to the need to amortize the cost of loading the
B matrix or tables. Furthermore, the cost of computing tables is significant. We also see
sensitivity to awkward matrix sizes and sparse matrices. bmm.8 is most able to handle the
former, as long as the matrix can be decomposed into 8 × 8 submatrices, and the latter, as
few submatrices are required.
In Table 6.3, we consider the incremental performance gain and area increase going
from bmm.8 to bmm.16.ac to bmm.64 (for units that support two multiplies). Going
from bmm.8 to bmm.16.ac to bmm.64 does provide increasing performance, but at a
steeper and steeper cost. Clearly, bmm.64 has the best overall performance, but it also
has the greatest cost. The bmm.8 instruction provides reasonable performance gains, is
insensitive to parameter changes, can take advantage of sparsity, requires no extra state and
can, by itself, perform many of the existing SIMD ISA subword manipulation instructions.
146
6.3 Summary
In this chapter, we analyzed applications that benefit from the newly proposed instruc-
tions. These applications include bit compression and decompression, least significant bit
steganography, binary image morphology, transfer coding, bioinformatics, integer com-
pression, random number generation, error correcting coding, matrix transposition and
cryptology.
Our results show that the applications the benefit from pex and pdep were sped up
2.19× on average. However, only one of our benchmarks made use of loop-invariant pex
and pdep, one made use of the variable pex.v instructions and none used pdep.v.
Consequently, it may not be necessary to provide hardware support for the loop-invariant
and variable operations, especially as the hardware decoder has high cost.
The benchmarks for bit matrix multiply show that bmm.8 is 1.6× faster than the base-
line case, bmm.16.ac is 2.2× faster, bmm.64 is 5.0× faster and PTLU is 3.6× faster,
on average. bmm.64 and PTLU have the best overall performance, but they also have the
greatest cost.
147
Chapter 7
Conclusions and Future Research
This thesis presented a number of contributions concerning the design, implementation and
benefits of advanced bit manipulation instructions for commodity processors. In this chap-
ter, we summarize the work done in the thesis and propose directions for future research.
7.1 Advanced Bit Manipulation Instructions
This first contribution of the thesis is the novel parallel extract and parallel deposit instruc-
tions. In our analysis of bit-oriented instructions, we saw that specialized bit permutation
instructions were lacking. We proposed the parallel extract instruction to perform bit gather
operations and the parallel deposit instruction to perform bit scatter operations. We showed
how these instructions utilize the butterfly and inverse butterfly datapaths and developed an
algorithm for decoding the bitmask that specifies the bit gather or scatter operations into
the controls for the datapaths. We designed a hardware circuit, composed of a parallel pre-
fix population count subcircuit and a set of left rotate and complement subcircuits, which
implements our decoding algorithm.
148
7.2 Advanced Bit Manipulation Functional Unit
We designed a standalone advanced bit manipulation functional unit that implements paral-
lel extract, parallel deposit and the butterfly and inverse butterfly permutation instructions.
Due to high cost of the hardware decoder, we considered the design of two alternative units
– one that supports only static pex and pdep operations for which the datapath control
are precomputed and one that supports variable pex.v and pdep.v instructions as well
as loop invariant operations, for which the hardware decoder is used once and the control
bits are then captured and reused. The unit that supports only the static instructions is faster
(0.95× the latency) and smaller (0.9× the area) than an ALU.
7.3 Advanced Bit Manipulation Functional Unit as Basis
for the Shifter
We proposed that the advanced bit manipulation functional unit, rather than being stan-
dalone, should replace the standard shifter. We showed how rotation operations are per-
formed on butterfly and inverse butterfly datapaths and how instructions such as shift, ex-
tract, deposit and mix are just minor variations on rotation. We designed a rotation control
bit generator circuit that produces the datapath control bits from the shift amount.
This new shifter architecture represents an evolution of shifter design from the classic
barrel and log shifter. The new shifter supporting static pex and pdep is 11% slower
than and 1.8× larger than log shifter while being able to perform a much richer array of
operations. The shifter supporting variable pex.v and pdep.v is 2.1× slower and 2.4×
larger, but supports the widest range of operations.
149
7.4 Implementation of Bit Matrix Multiply in Commodity
Processors
Another contribution of the thesis consists of architectural proposals to accelerate the bit
matrix multiply operation. Bit matrix multiplication is used to speed up parity computation
and can perform a superset of the bit manipulation operations. This instruction is supported
by the Cray supercomputer and we considered alternatives suitable for a commodity pro-
cessor. One proposal is submatrix multiplication, in which the multiplicand matrix B is
stored in extra registers in the functional unit. We also considered using the Parallel Ta-
ble Lookup module, which allows for looking up multiple tables in parallel, for bit matrix
multiply, as the best current technique for bit matrix multiply is the table lookup method.
For submatrix multiplication, we investigated the choice of sizes for multiplier and
multiplicand matrices and determined that bmm.8, which multiplies two 8×8 matrices (in
general purpose registers), and bmm.16.ac, which multiplies a 4× 16 matrix in a general
purpose register by a 16×16 matrix in the extra storage and then accumulates with a 4×16
matrix in a general purpose register, were the best choices. The bmm.8 unit is 30% the
size of an ALU, and thus one may consider implementing bmm.8 functionality within the
ALU. The bmm.8 unit also requires no additional architectural register state and thus does
need operating system support.
7.5 Application Analysis
The final contribution of this thesis is the analysis of applications that benefit from bit
manipulation operations. We showed how diverse applications such as bit compression,
image manipulation, communication coding, random number generation, bioinformatics,
integer compression and cryptology are all sped up by our new instructions. Overall, the
applications that benefited from pex and pdep were sped up 2.19× on average. Of the
150
applications considered only one uses pex.v. This, together with high cost of the pex.v
functional unit, indicates that preferred functional unit should be that only supports static
pex and pdep.
The applications that benefit from bit matrix multiply were sped up 1.6× using bmm.8,
2.2× using bmm.16.ac, 5.0× using bmm.64, and 3.6× using PTLU, on average. Over-
all, bmm.64 was 3.2× faster than bmm.8 and bmm.16.ac was 1.4× faster. However,
the bmm.16.ac unit is 2.2× the size of bmm.8 and the bmm.64 unit is 12.3× the size.
Clearly, going from bmm.8 to bmm.16.ac to bmm.64 does provide increasing perfor-
mance, but at a steeper and steeper cost. To achieve the best performance, a full Cray-like
bmm.64 unit can be implemented. The PTLU unit has good performance, but is very
costly. However, it can be used for many purposes aside from bit matrix multiply.
7.6 Future Research
There are a number of possible directions for future work. The first possibility is compiler
support for bit manipulation instructions. In coding our simulations, we used compiler
intrinsics, essentially inline assembly with better syntax, to access our instructions. This
approach helps the programmer who is diligent enough to discover that the compiler sup-
ports the intrinsics. Ideally, the compiler will recognize the code sequence that is equivalent
to the instructions and automatically generate code that uses the instructions. This is likely
a difficult task, as there are many ways that these operations can be coded in software.
Other future work includes refinement of the implementation of the advanced bit manip-
ulation instructions. There may be better designs for the parallel prefix population counter
or, in fact, better ways entirely for implementing pex and pdep. Additionally, the designs
were synthesized primarily focused on minimizing latency. Other designs that consider
area and power may prove interesting to examine.
Application analysis is also likely a fruitful area for further work. We do not claim to
151
have performed an exhaustive listing of all the applications that can benefit from bit manip-
ulation instructions. There are likely other algorithms that can be sped up by our instruc-
tions. Furthermore, designing algorithms that specifically take into account the existence of
advanced bit manipulation instructions is an area that has seen little research. Conversely,
application analysis might yield other candidates for new specialized advanced permutation
instructions.
Bit matrix multiply in a multicore setting is another avenue for future research. This
area includes considering how best to decompose large matrix multiplies for parallel pro-
cessing. It is assumed that prior work in decomposing standard matrix multiplication has
relevance, but the unique overheads involved with setting up the B matrices need to be
considered. Another point to evaluate is the tradeoff between having many cores each con-
taining a small bit matrix multiply unit (such as bmm.8), versus having only some cores
contain a larger unit (such as bmm.64).
Overall, we have brought the acceleration of advanced bit manipulation operations out
of the realm of “programming tricks.” We have shown that these operations are needed by
many applications and that direct support for them can be implemented in a commodity
microprocessor at a reasonable cost.
152
Bibliography
[1] Intel Corporation, Intel 64 and IA-32 Architectures Software Developer’s Manual, vol. 1–2,
Apr. 2008.
[2] Compaq Computer Corporation, Alpha Architecture Handbook, Rev. 4.0, Oct. 1998.
[3] R. B. Lee, “Multimedia enhancements for PA-RISC processors,” in Hot Chips VI, pp. 183–
1192, Aug. 1994.
[4] R. B. Lee, “Accelerating multimedia with enhanced microprocessors,” IEEE Micro, vol. 15,
no. 2, pp. 22–32, Apr. 1995.
[5] R. B. Lee, “Subword parallelism with MAX-2,” IEEE Micro, vol. 16, no. 4, pp. 51–59, Aug.
1996.
[6] A. Peleg and U. Weiser, “MMX Technology Extension to the Intel Architecture,” IEEE Mi-
cro, vol. 16, no. 4, pp. 10–20, Aug. 1996.
[7] S. Obeman, G. Favor, and F. Weber, “AMD 3Dnow! Technology: Architecture and Imple-
mentations,” IEEE Micro, vol. 19, no. 2, pp. 37–48, Apr. 1999.
[8] S. Thakkar and T. Huff, “The Internet Streaming SIMD Extensions ,” Intel Technology Jour-
nal, vol. 3, no. 2, May 1990.
[9] M. Tremblay, J. M. O’Connor, V. Narayanan, and L. He, “VIS Speeds New Media Process-
ing,” IEEE Micro, vol. 16, no. 4, pp. 35–42, Aug. 1996.
[10] K. Diefendorff, P. K. Dubey, R. Hochsprung, and H. Scales, “AltiVec Extension to PowerPC
Accelerates Media Processing,” IEEE Micro, vol. 20, no. 2, pp. 85–95, Apr. 2000.
153
[11] R. B. Lee, A. M. Fiskiran, and A. Bubshait, “Multimedia instructions in IA-64,” in Pro-
ceedings of the IEEE International Conference on Multimedia and Expo, pp. 281–284, Aug.
2001.
[12] Gerry Kane, PA-RISC 2.0 Architecture. Prentice Hall, 1996.
[13] Intel Corporation, Intel Itanium Architecture Software Developer’s Manual, vol. 1–3, Rev.
2.2, Jan. 2006.
[14] Sun Microsystems, The VIS Instruction Set, Rev. 2.0, November 2003.
[15] IBM Corporation, Power ISA, Rev. 2.05, October 2007.
[16] H. S. Warren Jr., Hacker’s Delight. Boston: Addison-Wesley Professional, 2002.
[17] Cray Inc., “Cray Assembly Language (CAL) for Cray X1 Systems Reference Manual, ver-
sion 1.2,” 2003, http://docs.cray.com/books/S-2314-51/S-2314-51-manual.pdf.
[18] C. Hanson, “MicroUnity’s MediaProcessor Architecture,” IEEE Micro, vol. 16, no. 4, pp.
34–41, Aug. 1996.
[19] J. Burke, J. McDonald, and T. M. Austin, “Architectural Support for Fast Symmetric-Key
Cryptography,” in Architectural Support for Programming Languages and Operating Sys-
tems, ASPLOS 2000, pp. 178–189, Nov. 2000.
[20] R. B. Lee, Z. Shi, and X. Yang, “Efficient Permutation Instructions for Fast Software Cryp-
tography,” IEEE Micro, vol. 21, no. 6, pp. 56–69, Dec. 2001.
[21] Z. Shi and R. B. Lee, “Bit Permutation Instructions for Accelerating Software Cryptogra-
phy,” in Proceedings of the IEEE International Conference on Application-Specific Systems,
Architectures, and Processors (ASAP 2000), pp. 138–148, July 2000.
[22] X. Yang, M. Vachharajani, and R. B. Lee, “Fast Subword Permutation Instructions Based on
Butterfly Networks,” in Proceedings of Media Processors IS&T/SPIE Symposium on Electric
Imaging: Science and Technology, pp. 80–86, Jan. 2000.
154
[23] X. Yang and R. B. Lee, “Fast Subword Permutation Instructions Using Omega and Flip Net-
work Stages,” in Proceedings of the International Conference on Computer Design (ICCD
2000), pp. 15–22, Sept. 2000.
[24] R. B. Lee, Z. Shi, and X. Yang, “How a Processor can Permute n bits in O(1) cycles,” in
Proceedings of Hot Chips 14, Aug. 2002.
[25] Z. Shi, X. Yang, and R. B. Lee, “Arbitrary Bit Permutations in One or Two Cycles,” in
Proceedings of the IEEE International Conference on Application-Specific Systems, Archi-
tectures and Processors (ASAP), pp. 237–247, June 2003.
[26] R. B. Lee, X. Yang, and Z. J. Shi, “Single-Cycle Bit Permutations with MOMR Execution,”
Journal of Computer Science and Technology, vol. 20, no. 5, pp. 577–585, Sept. 2005.
[27] A. A. Moldovyan, N. A. Moldovyan, P. A. Moldovyanu, “Architecture Types of the Bit
Permutation Instruction For General Purpose Processors,” Springer LNG&G, vol. 14, pp.
147–159, 2007.
[28] Y. Hilewitz, Z. J. Shi, and R. B. Lee, “Comparing Fast Implementations of Bit Permutation
Instruction,” in Proceedings of the 38th Annual Asilomar Conference on Signals, Systems,
and Computers, pp. 1856–1863, Nov. 2004.
[29] R. B. Lee and Y. Hilewitz, “Fast Pattern Matching with Parallel Extract Instructions,” Prince-
ton University Department of Electrical Engineering Technical Report CE-L2005-002, Feb.
2005.
[30] Y. Hilewitz and R. B. Lee, “Fast Bit Compression and Expansion with Parallel Extract and
Parallel Deposit Instructions,” in Proceedings of the IEEE 17th International Conference on
Application-Specific Systems, Architecture and Processors (ASAP 2006), pp. 65–72, Sept.
2006.
[31] Y. Hilewitz and R. B. Lee, “Fast Bit Gather, Bit Scatter and Bit Permutation Instructions for
Commodity Microprocessors,” Journal of Signal Processing Systems, 2008.
155
[32] Y. Hilewitz and R. B. Lee, “Performing Advanced Bit Manipulations Efficiently in General-
Purpose Processors,” in Proceedings of the 18th IEEE Symposium on Computer Arithmetic
(ARITH 18), pp. 251–260, June 2007.
[33] Y. Hilewitz and R. B. Lee, “A New Basis for Shifters in General-Purpose Processors for
Existing and Advanced Bit Manipulations,” IEEE Transactions on Computers, 2008.
[34] Y. Hilewitz and R. B. Lee, “Achieving Very Fast Bit Matrix Multiplication in Commodity Mi-
croprocessors,” Princeton University Department of Electrical Engineering Technical Report
CE-L2007-006, Aug. 2007.
[35] Y. Hilewitz, C. Lauradoux, and R. B. Lee, “Fast Bit Matrix Multiplication in Commodity Mi-
croprocessors,” Princeton University Department of Electrical Engineering Technical Report
CE-L2007-011, Nov. 2007.
[36] Y. Hilewitz, C. Lauradoux, and R. B. Lee, “Bit Matrix Multiplication in Commodity Micro-
processors,” in Proceedings of the IEEE International Conference on Application-Specific
Systems, Architectures and Processors, July 2008.
[37] Intel Corporation, Intel Advanced Vector Extensions Programming Reference, Mar. 2008.
[38] Advanced Micro Devices, “AMD64 Technology: AMD64 Architecture Programmers Man-
ual,” Sept. 2007.
[39] Advanced Micro Devices, “AMD64 Technology: 128-Bit SSE5 Instruction Set,” Aug. 2007.
[40] R. B. Lee, “Efficiency of MicroSIMD Architectures and Index-Mapped Data for Media Pro-
cessors,” in Proceedings of Media Processors 1999 IS&T/SPIE Symposium on Electric Imag-
ing: Science and Technology, pp. 34–46, Jan. 1999.
[41] R. B. Lee, “Subword Permutation Instructions for Two-Dimensional Multimedia Process-
ing in MicroSIMD Architectures,” in Proceedings of the IEEE International Conference
on Application-Specific Systems, Architectures and Processors (ASAP 2000), pp. 3–14, July
2000.
156
[42] J. P. McGregor and R. B. Lee, “Architectural Enhancements for Fast Subword Permutations
with Repetitions in Cryptographic Applications,” in Proceedings of the International Con-
ference on Computer Design (ICCD 2001), pp. 453–461, Sept. 2001.
[43] J. P. McGregor and R. B. Lee, “Architectural Techniques for Accelerating Subword Permuta-
tions with Repetitions,” IEEE Transactions on Very Large Scale Integration Systems, vol. 11,
no. 3, pp. 325–335, June 2003.
[44] R. B. Lee, “Precision Architecture,” IEEE Computer, vol. 22, no. 1, pp. 78–91, Jan. 1989.
[45] R. B. Lee, M. Mahon, and D. Morris, “Pathlength Reduction Features in the PA-RISC Ar-
chitecture,” in Proceedings of IEEE Compcon, pp. 129–135, Feb. 1992.
[46] IBM Corporation, PowerPC Microprocessor Family: Vector/SIMD Multimedia Extension
Technology Programming Environments Manual, Rev. 2.06, Aug. 2005.
[47] Z. Shi and R. B. Lee, “Subword Sorting with Versatile Permutation Instructions,” in Proceed-
ings of the International Conference on Computer Design (ICCD 2002), pp. 234–241, Sept.
2002.
[48] National Bureau of Standards, “Data Encryption Standard (DES),” Federal Information Pro-
cessing Standards Publication 46-2, Dec. 1993.
[49] National Institute of Standards and Technology, “Advanced Encryption Standard (AES),”
Federal Information Processing Standards Publication 197, Nov. 2000.
[50] C. Burwick et al, “MARS: a Candidate Cipher for AES,” NIST AES proposal, June 1998.
[51] R. L. Rivest and M. J. B. Robshaw and R. Sidney and Y. L. Yin, “The RC6 Block Cipher,”
NIST AES proposal, Aug. 1998.
[52] R. Anderson and E. Biham and L. Knudsen, “Serpent: a Proposal for the Advanced Encryp-
tion Standard,” NIST AES proposal, June 1998.
[53] B. Schneier and J. Kelsey and D. Whiting and D. Wagner and C. Hall and N. Ferguson,
“TwoFish: a 128-bit Block Cipher,” NIST AES proposal, June 1998.
157
[54] Princton Architecture Laboratory for Multimedia and Security (PALMS),
http://palms.ee.princeton.edu.
[55] R. B. Lee, R. L. Rivest, M. J. B. Robshaw, Z. J. Shi, and Y. L. Yin, “On Permutation Op-
erations in Cipher Design,” in Proceedings of the International Conference on Information
Technology (ITCC), vol. 2, pp. 569–577, Apr 2004.
[56] Z. J. Shi and R. B. Lee, “Implementation Complexity of Bit Permutation Instructions,” in
Proceedings of the 37th Annual Asilomar Conference on Signals, Systems, and Computers,
pp. 879–886, Nov. 2003.
[57] V. E. Benes, “Optimal rearrangeable multistage connecting networks,” Bell System Technical
Journal, vol. 43, no. 4, pp. 1641–1656, July 1964.
[58] Cray Inc., “Man Page Collection: Bioinformatics Library Procedures,” 2004,
http://www.cray.com/craydoc/manuals/S-2397-21/S-2397-21.pdf.
[59] S. Schwartz, W. J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. C. Hardison, D. Haussler, and
W. Miller, “Human-Mouse Alignments with BLASTZ,” Genome Research, vol. 13, no. 1,
pp. 103–107, Jan. 2003.
[60] National Center for Biotechnology Information, “Translating Basic Local Alignment Search
Tool (BLASTX),” http://www.ncbi.nlm.nih.gov/blast/.
[61] A. M. Fiskiran and R. B. Lee, “Fast Parallel Table Lookups to Accelerate Symmetric-Key
Cryptography,” in Proceedings of the International Conference on Information Technology
Coding and Computing (ITCC), Embedded Cryptographic Systems Track, pp. 526–531, Apr
2005.
[62] A. M. Fiskiran and R. B. Lee, “On-Chip Lookup Tables for Fast Symmetric-Key Encryp-
tion,” in Proceedings of the IEEE International Conference on Application-Specific Systems,
Architectures and Processors (ASAP), pp. 356–363, July 2005.
158
[63] W. Josephson, R. Lee, and K. Li, “ISA Support for Fingerprinting and Erasure Codes,” in
Proceedings of the IEEE International Conference on Application-Specific Systems, Archi-
tectures and Processors (ASAP), pp. 415–422, July 2007.
[64] M. Matsui, “Linear Cryptanalysis Method for DES Cipher,” in Advances in Cryptology -
EUROCRYPT ’93, Lecture Notes in Computer Science 765, pp. 386–397. Springer Verlag,
1994.
[65] J. Daemen and V. Rijmen, “The Wide Trail Design Strategy,” in Proceedings of the 8th IMA
International Conference on Cryptography and Coding, Lecture Notes in Computer Science
2260, pp. 222–238, 2001.
[66] Y. L. Yin, Private Correspondence.
[67] Y. Lin, H. Lee, M. Woh, Y. Harel, S. A. Mahlke, T. N. Mudge, C. Chakrabarti, and K. Flaut-
ner, “SODA: A High-Performance DSP Architecture for Software-Defined Radio,” IEEE
Micro, vol. 27, no. 1, pp. 114–123, 2007.
[68] E. Blossom, “The GNU Software Defined Radio,”
http://www.gnu.org/software/gnuradio/gnuradio.html.
[69] S. Mamidi, E. R. Blem, M. J. Schulte, C. J. Glossner, D. Iancu, A. Iancu, M. Moudgill, and
S. Jinturkar, “Instruction Set Extensions for Software Defined Radio on a Multithreaded Pro-
cessor,” in International Conference on Compilers, Architecture, and Synthesis for Embedded
Systems - CASES 2005, pp. 266–273, 2005.
[70] B. Schneier, Applied Cryptography. John Wiley & Sons, 1996.
[71] D. Coppersmith, H. Krawczyk, and Y. Mansour, “The Shrinking Generator,” in Advances in
Cryptology - CRYPTO ’93, Lecture Notes in Computer Science 773, pp. 22–39, 1993.
[72] W. Meier and O. Staffelbach, “The self-shrinking generator,” in Advances in Cryptology -
EUROCRYPT ’94, Lecture Notes in Computer Science 950, pp. 205–214, 1994.
[73] M. Kanemasu, “Golay Codes,” MIT Undergraduate Journal of Mathematics, no. 1, June
1999.
159
[74] “Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specification,”
IEEE Std. 802.11g-2003.
[75] E. E. Swartzlander Jr., “A Review of Large Parallel Counter Designs,” in IEEE Symposium
on VLSI, pp. 89–98, Feb. 2004.
[76] T. Han and D. Carlson, “Fast area-efficient VLSI adders,” in Proceedings of the 8th Sympo-
sium on Computer Arithmetic, pp. 49–55, May 1987.
[77] Taiwan Semiconductor Manufacturing Corporation, “TCBN90GTHP: TSMC 90nm Core Li-
brary Databook, version 1.1,” 2006.
[78] F. T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hyper-
cubes. San Francisco: Morgan Kaufmann Publishers, 1992.
[79] M. R. Pillmeier, M. J. Schulte, and E. G. Walters III, “Design Alternatives for Barrel
Shifters,” in Advanced Signal Processing Algorithms, Architectures, and Implementations
XII, Proceedings of the SPIE, vol. 4791, pp. 436–447, Dec. 2002.
[80] I. Sutherland, B. Sproull, and D. Harris, Logical Effort: Designing Fast CMOS Circuits. San
Francisco: Morgan Kaufmann Publishers, 1999.
[81] R. B. Lee, M. Fiskiran, M. Wang, Y. Hilewitz, and Y.-Y. Chen, “PAX: A Cryptographic
Processor with Parallel Table Lookup and Wordsize Scalability,” Princeton University De-
partment of Electrical Engineering Technical Report CE-L2007-010, Nov. 2007.
[82] W. Lee, G. J. Geissler, S. J. Johnson, and A. J. Schiffleger, “Vector Bit-Matrix Multiply
Function Unit,” US Patent 5,170,370, 1996.
[83] D. E. Knuth, The Art of Computer Programming, 2nd ed. Addison-Wesley, 1978.
[84] D. Culler, J. Singh, and A. Gupta, Parallel Computer Architecture: A Hardware-Software
Approach. Morgan Kaufmann Publishers, 1998.
160
[85] Y. Hilewitz, Y. L. Yin, and R. B. Lee, “Accelerating the Whirlpool Hash Function using
Parallel Table Lookup and Fast Cyclical Permutation,” in Proceedings of 15th Fast Software
Encryption Workshop (FSE), pp. 173–188, Feb. 2008.
[86] Cray Inc., “CRAY/MTA Principles of Operation,” 2005.
[87] M. Wei, M. Snir, J. Torrellas, and R. B. Tremaine, “A Near-Memory Processor for Vector,
Streaming and Bit Manipulation Workloads,” in Watson Conference on Interaction between
Architecture, Circuits, and Compilers (P=AC2), Sept. 2005.
[88] C. Lemuet, J. Sampson, J.-F. Collard, and N. P. Jouppi, “The Potential Energy Efficiency of
Vector Acceleration,” in Proceedings of Supercomputing 2006 (SC06), Nov. 2006.
[89] The Mathworks, Inc., “Image processing toolbox user’s guide,”
http://www.mathworks.com/access/helpdesk/help/toolbox/images/images.html.
[90] E. Franz, A. Jerichow, S. Moller, A. Pfitzmann, and I. Stierand, “Computer Based Steganog-
raphy,” Information Hiding, Springer Lecture Notes in Computer Science, vol. 1174, pp.
7–21, 1996.
[91] “Uuencode,” Wikipedia: The Free Encyclopedia, http://en.wikipedia.org/ wiki/Uuencode.
[92] F. Scholer, H. Williams, J. Yiannis, and J. Zobel, “Compression of Inverted Indexes For Fast
Query Evaluation,” in Proceedings of the 25th Annual International ACM SIGIR Conference
on Research and Development in Information Retrieval, pp. 222–229, 2002.
[93] B. Jun and P. Kocher, “The Intel Random Number Generator,” Cryptography Research Inc.,
1999.
[94] B. Barak, R. Shaltiel, and E. Tromer, “True Random Number Generators Secure in a Chang-
ing Environment,” in Cryptographic Hardware and Embedded Systems (CHES), Lecture
Notes in Computer Science 2779, pp. 166–180. Springer Verlag, Sept. 2003.
[95] N. A. Moldovyan, P. A. Moldovyanu, and D. H. Summerville, “On Software Implementation
of Fast DDP-based Ciphers,” International Journal of Network Security, vol. 4, no. 1, pp.
81–89, Jan. 2007.
161
[96] NIST, “Hash Function Main Page,” http://www.nist.gov/hash-competition.
[97] A. J. Menezes, P. C. van Oorschot, and S. A. Vanstone, Handbook of Applied Cryptography.
CRC Press, 1996.
[98] N. Courtois and W. Meier, “Algebraic Attacks on Stream Ciphers with Linear Feedback,” in
Advances in Cryptology - EUROCRYPT 2003, Lecture Notes in Computer Science 2656, pp.
345–359. Springer Verlag, 2003.
[99] Defense Advanced Research Projects Agency, “High productivity computing systems dis-
crete mathematics benchmarks,” 2003.
[100] T. M. Austin, E. Larson, and D. Ernst, “Simplescalar: An infrastructure for computer system
modeling,” IEEE Computer, vol. 35, no. 2, pp. 59–67, 2002.
162