nonconventional computer arithmetic circuits, systems and

27
1 Nonconventional Computer Arithmetic Circuits, Systems and Applications Leonel Sousa, Senior Member, IEEE Abstract—Arithmetic plays a major role in a computer’s performance and efficiency. Building new computing platforms supported by the traditional binary arithmetic and silicon-based technologies to meet the requirements of today’s applications is becoming increasingly more challenging, regardless whether we consider embedded devices or high-performance computers. As a result, a significant amount of research effort has been devoted to the study of nonconventional number systems to investigate more efficient arithmetic circuits and improved computer technologies to facilitate the development of computational units that can meet the requirements of applications in emergent domains. This paper presents an overview of the state of the art in non- conventional computer arithmetic. Several different alternative computing models and emerging technologies are analyzed, such as nanotechnologies, superconductor devices, and biological- and quantum-based computing, and their applications to multiple domains are discussed. A comprehensive approach is followed in a survey of the logarithmic and residue number systems, the hyperdimensional and stochastic computation models, and the arithmetic for quantum- and DNA-based computing systems and techniques for approximate computing. Technologies, pro- cessors and systems addressing these nonconventional computer arithmetic systems are also reviewed, taking into consideration some of the most prominent applications, such as deep learning or postquantum cryptography. In the end, some conclusions are drawn, and directions for future research on nonconventional computer arithmetic are discussed. I. I NTRODUCTION The demise of Moore’s Law and the waning of Dennard scaling, which respectively stated that the number of tran- sistors on silicon chips would double every two years and that this increase in the transistor density is not achieved at constant power consumption [1], mark the end of an era in which the computational capacity growth was mainly based on the downscaling of silicon-based technology. At the same time, the demand for data processing is higher than ever before as a result of the more than 2.5 quintillion bytes that are created on a daily basis [2]. This number continues to grow exponentially, and therefore, ingenious solutions must be developed to address the limitations of traditional com- putational systems. These innovative solutions must include developments not only at the technological level but also at the arithmetic, architectural and algorithmic levels of computing. Although binary arithmetic has been successfully used to design silicon-based computing systems, the positional nature of this representation imposes the processing of carry chains, which precludes the exploitation of parallelism at the most Leonel Sousa is with the the Department of Electrical and Computer Engineering, INESC-ID, Instituto Superior T´ ecnico, Universidade de Lis- boa; Address: Rua Alves Redol, 9, 1000-029 Lisboa, PORTUGAL; e-mail: [email protected]. basic levels and leads to high power consumption. Hence, the research on unconventional number systems is of the utmost interest to explore parallelism and take advantage of the characteristics of emerging technologies to improve both the performance and the energy efficiency of computational systems. Moreover, by avoiding the dependencies of binary systems, nonconventional number systems can also support the design of reliable computing systems using the newest available technologies, such as nanotechnologies. A. Motivation The Complementary Metal-Oxide Semiconductor (CMOS) transistor was invented over fifty years ago and has played a key role in the development of modern electronic devices and all that it has enabled. The CMOS transistor has evolved into nanodevices, with characteristic dimensions less than 100 nm. The downscaling of the gate length has became one of the biggest challenges hindering progression in each new generation of CMOS transistors and integrated circuits. New device architectures and materials have been proposed to address this challenge, namely, the Fin Field-Effect transistor (FinFET) multigate devices [5]. This type of nonplanar transis- tor became the dominant gate design from the 14 nm/10 nm generations, used in a broad range of applications, ranging from consumer applications to embedded systems and high- performance computing [6]. Fig. 1: Emerging technologies and CMOS speed, size, cost, and switching energy (scales are logarithmic) [7] Fig. 1 plots the cost, speed, size, and energy per operation relationship for the CMOS and other emerging nanotechnolo- gies; all the scales are logarithmic, covering many orders of magnitude. Examples of these technologies include supercon- ducting electronics, molecular electronics, resonant tunneling devices, quantum cellular automata, and optical switches.

Upload: others

Post on 20-Mar-2022

17 views

Category:

Documents


0 download

TRANSCRIPT

1

Nonconventional Computer ArithmeticCircuits, Systems and Applications

Leonel Sousa, Senior Member, IEEE

Abstract—Arithmetic plays a major role in a computer’sperformance and efficiency. Building new computing platformssupported by the traditional binary arithmetic and silicon-basedtechnologies to meet the requirements of today’s applications isbecoming increasingly more challenging, regardless whether weconsider embedded devices or high-performance computers. As aresult, a significant amount of research effort has been devoted tothe study of nonconventional number systems to investigate moreefficient arithmetic circuits and improved computer technologiesto facilitate the development of computational units that canmeet the requirements of applications in emergent domains.This paper presents an overview of the state of the art in non-conventional computer arithmetic. Several different alternativecomputing models and emerging technologies are analyzed, suchas nanotechnologies, superconductor devices, and biological- andquantum-based computing, and their applications to multipledomains are discussed. A comprehensive approach is followedin a survey of the logarithmic and residue number systems,the hyperdimensional and stochastic computation models, andthe arithmetic for quantum- and DNA-based computing systemsand techniques for approximate computing. Technologies, pro-cessors and systems addressing these nonconventional computerarithmetic systems are also reviewed, taking into considerationsome of the most prominent applications, such as deep learningor postquantum cryptography. In the end, some conclusions aredrawn, and directions for future research on nonconventionalcomputer arithmetic are discussed.

I. INTRODUCTION

The demise of Moore’s Law and the waning of Dennardscaling, which respectively stated that the number of tran-sistors on silicon chips would double every two years andthat this increase in the transistor density is not achieved atconstant power consumption [1], mark the end of an era inwhich the computational capacity growth was mainly basedon the downscaling of silicon-based technology. At the sametime, the demand for data processing is higher than everbefore as a result of the more than 2.5 quintillion bytes thatare created on a daily basis [2]. This number continues togrow exponentially, and therefore, ingenious solutions mustbe developed to address the limitations of traditional com-putational systems. These innovative solutions must includedevelopments not only at the technological level but also at thearithmetic, architectural and algorithmic levels of computing.

Although binary arithmetic has been successfully used todesign silicon-based computing systems, the positional natureof this representation imposes the processing of carry chains,which precludes the exploitation of parallelism at the most

Leonel Sousa is with the the Department of Electrical and ComputerEngineering, INESC-ID, Instituto Superior Tecnico, Universidade de Lis-boa; Address: Rua Alves Redol, 9, 1000-029 Lisboa, PORTUGAL; e-mail:[email protected].

basic levels and leads to high power consumption. Hence,the research on unconventional number systems is of theutmost interest to explore parallelism and take advantage ofthe characteristics of emerging technologies to improve boththe performance and the energy efficiency of computationalsystems. Moreover, by avoiding the dependencies of binarysystems, nonconventional number systems can also supportthe design of reliable computing systems using the newestavailable technologies, such as nanotechnologies.

A. Motivation

The Complementary Metal-Oxide Semiconductor (CMOS)transistor was invented over fifty years ago and has playeda key role in the development of modern electronic devicesand all that it has enabled. The CMOS transistor has evolvedinto nanodevices, with characteristic dimensions less than100 nm. The downscaling of the gate length has becameone of the biggest challenges hindering progression in eachnew generation of CMOS transistors and integrated circuits.New device architectures and materials have been proposed toaddress this challenge, namely, the Fin Field-Effect transistor(FinFET) multigate devices [5]. This type of nonplanar transis-tor became the dominant gate design from the 14 nm/10 nmgenerations, used in a broad range of applications, rangingfrom consumer applications to embedded systems and high-performance computing [6].

Fig. 1: Emerging technologies and CMOS speed, size, cost,and switching energy (scales are logarithmic) [7]

Fig. 1 plots the cost, speed, size, and energy per operationrelationship for the CMOS and other emerging nanotechnolo-gies; all the scales are logarithmic, covering many orders ofmagnitude. Examples of these technologies include supercon-ducting electronics, molecular electronics, resonant tunnelingdevices, quantum cellular automata, and optical switches.

2

Superconducting digital logic circuits use Single-FluxQuantum (SFQ), also known as magnetic flux quanta, toencode and process data. An Rapid Single Flux Quantum(RSFQ) device is a superconducting ring with a Josephsonjunction that opens and closes to admit or expel a fluxquanta. The Adiabatic Quantum-Flux-Parametron (AQFP), an-other SFQ-based device not represented in the figure withcharacteristics that are analyzed later in this survey, drasticallyreduces the energy dissipation by using Alternating Current(AC) bias/excitation currents for both the clock signal andpower supply. This is high-speed nanotechnology that offerslow-power dissipation and scalability. However, its main draw-backs include the need for expensive cooling technologies andimproved techniques in manufacturing the elements. Molec-ular electronics use a single molecule as the switch. Theconfiguration of the molecule determines its state, thereby“switching” on or off the current flow through the molecule.While conceptually appealing, the molecular process exhibitspoor self-assembly techniques and low reproducibility. InQuantum Cellular Automata (QCA), cells are switched on andoff by moving the position of the electrons from one cell toanother. Classical physics does not apply to these technologies,which support, for example, quantum computing. While it ischallenging to build general purpose computing engines basedon these technologies, they are much faster on certain specifictasks. A last example is the integrated optical technology,which is based on having small optical cavities that changetheir properties to either store or emit photons – a zero or aone. It is a challenging approach, both from the fan-out andcost point of view. As will be seen, computing-in-the-networkcircuits can be designed with integrated photonics switches.

Fig. 2: Emerging research information processing devices,highlighting the non-conventional data representation (adaptedfrom [9])

These emergent technologies will become more and morerelevant as the end of the CMOS era approaches [8]. Nev-ertheless, for the time being, there is no technology betterthan CMOS, as it provides a good balance between energyconsumption, cost, speed and scalability, as shown in Fig. 1.Yet, better solutions can be identified in Fig. 1 along everyindividual axis. However, to design and implement processingsystems based on these new technologies and devices, differentsolutions have to be found at various levels: from the circuitsto the data representation and architectures, as depicted inFig. 2. For the conventional scaled CMOS technology, thelayers have been quite well established, and the characteristicswere successfully explored from the material layer, silicon,up to the topmost layer, multicore architectures. Moreover,

the conventional binary system has been dominant at the datarepresentation layer. However, when considering emergenttechnologies or new architectures for designing these systems,data representation must also be reconsidered to properly adapteach system to the features of both the bottom and top layers.Data representation and the related computer arithmetic are themotivation behind this survey, which looks beyond the CMOStechnology and conventional systems.

B. Structure of the survey

This survey analyzes the confluence of emergent technolo-gies, nonconventional computer arithmetic, innovative com-puting paradigms and new applications. Similar to a jigsawpuzzle, although these pieces exist by themselves, it is onlyby connecting them in the right way that the whole picturecan be derived and, in this particular case, the way to designefficient systems can be determined. Fig. 3 depicts the tightinterconnections and dependencies of all these aspects in amatrix representation. In addition, this figure shows how thesevery important topics are addressed in this paper, highlightingthe fact that nonconventional arithmetic exposes the character-istics of the underlying technology to assist in the developmentof efficient and reliable computing systems useful for differentapplications.

European Logarithmic Microprocessor

Residual Arithmetic Nanophotonic Computer

SECTION II: COMPUTER ARITHMETIC AND ARCHITECTURES [10-77]

Machine Learning

Quantum-ResistantCryptography

SECTION IV: APPLICATIONS [120-158]

SECTION III: NANOTECHNOLOGIES [78-119] Memristors/Hybrid Memories

CMOS Invertible Circuits

Superconductors (AQFP)

RNS LNS Stochastic dimensionalHyperHigh Performance

ReliabilityLow PowerNoise Robustness

DNA/Molecular Biology

Integrated Photonics

StoRM

HDC Processor

PIM

Approximate ComputingQuantum Computing

DNA Computing

Fig. 3: Structure of the survey: sections and topics

Section II introduces nonconventional number systems, themain characteristics and the associated computer arithmetic.The Logarithmic Number System (LNS) and the ResidueNumber System (RNS) shorten up the carry chain under-lying binary arithmetic. While LNS compresses the numberrepresentation significantly, mitigating the problem of carrychains, it lowers the complexity of the multiplication, division,and square-root arithmetic operations [11]. RNS [4] is atype of nonpositional number system in which integers arerepresented as a set of smaller residues that can be processedindependently and in parallel. Stochastic Computing (SC) andHyper-Dimensional Computing (HDC) introduce robustnessagainst randomness in number representation. SC representsthe probability of a bit taking the value ‘1’ or ‘0’ in a bitstream,regardless of its position [56], processing data bitwise withsimple circuits. HDC operates on very large binary vectors

3

assuming that information is represented in a highly structuredway using hyperdimensional vectors, as in the human brain,where details are not very important for data processing [70].

Section III is devoted to the analyses of new technologiesand computing paradigms. Nanotechnologies, such as spintransistors, superconducting electronics, molecular electron-ics, and resonant tunneling devices, have emerged as alter-native technologies to CMOS [1], while DeoxyribonucleicAcid (DNA)-based computing [163] and Quantum Computing(QC) [155] are new computing paradigms. It is shown thatnonconventional arithmetic is fundamental not only in thedesign of efficient computer systems based on these newparadigms and technologies but also in the mitigation of someof its intrinsic disadvantages. For example, it is shown thatthe RNS is useful in designing energy-efficient integratedphotonics-based computational units and in overcoming thenegative effects caused by the instability of the biochemical re-actions and the error hybridizations in DNA computing [113],as highlighted in Fig. 3. Furthermore, it is discussed howSC can be applied to mitigate the deep pipelining nature ofthe AQFP logic devices, while HDC may support computingnanosystems through the heterogeneous integration of multipleemerging nanotechnologies [106].

In Section IV, two classes of applications are consideredas examples of use cases that highly benefit from noncon-ventional arithmetic: machine learning and postquantum cryp-tography. While quantum-resistant cryptography makes use ofRNS and SC, as shown in Fig. 3, machine learning applicationsmay take advantage of all the classes of nonconventionalarithmetic considered in this paper. For example, the DeepLearning (DL) framework that adopts AQFP to achieve highenergy efficiency with superconductivity is an example of SCto machine learning on emerging technologies [79].

The goal of this paper is to provide a survey on noncon-ventional computer arithmetic circuits and systems, describingtheir main characteristics, mathematical tools and algorithms,and to discuss architectures and technologies for their im-plementation. We follow a tutorial approach on how to usenonconventional arithmetic to compute emergent applicationsand systems, and in the end, we draw some conclusionsand ideas for future research work. We find that this workcan be useful for engineers and researchers in computationalarithmetic, in particular as a source of inspiration for doctoralstudents. We also provide more than one hundred and seventyreferences to the most important works related to nonconven-tional computer arithmetic, distributed according to the maintopics and sections presented in Fig. 3.

II. NONCONVENTIONAL NUMBER SYSTEMS: ARITHMETICAND ARCHITECTURES

This section introduces the LNS, RNS, SC and HDCnonconventional representations and the associated arithmeticproperties. These properties are then leveraged in the designof arithmetic units found in many processors.

A. Logarithmic number systemsThe LNS has been proposed for both the fixed-point [10]

and floating-point [11] formats. In the standardized IEEE 754

single-precision Floating Point (FP) format, a number P mayassume the normalized absolute values in Table I. For a 24-bitmantissa, if the hidden bit is counted, the effective resolutionis between one part in 223 and one part in 224, approximately10−7. Note that the boundary values for the exponent (0 and255) are reserved for special cases.

TABLE I: IEEE 754 single-precision FP and LNS, 32-bitformats

SFP (sign bit) EFP (exponent): 8 bits FFP (mantissa): 23 bitsSLNS (sign bit) ITLNS (integer): 8 bits FLNS (fractional): 23 bits

P = −1SFP × 1.FFP × 2EFP−127

p = −1SLNS × 2ITLNS.FLNS

Absolute values minimum maximumP (FP) 2−126 ≈ 1.2× 10−38 2+128 ≈ 3.4× 10+38

p (LNS) 2−128 ≈ 2.9× 10−39 2+128 ≈ 3.4× 10+38

FP is a “semi-logarithmic” representation with a fixed-pointcomponent – significand or mantissa – that is scaled by a factor– the exponent – that is represented by its logarithmic value inbase 2. In comparison, the same number P is represented inLNS by the 3-tuple (b, s, p), where b is the adopted logarithmbase (without lack of generality, we consider the typical caseb = 2), the s bit denotes the sign, and p = log2 | P |. Theexponent can be computed from the integer and fractionalparts. This fixed-point format provides a range of numberssimilar to the FP: this range is defined by the IT bits, while thewidth F controls the precision. The logarithmic representationmay be beneficial to applications with data and/or parametersthat follow nonuniform distributions, such as the ConvolutionalNeural Networks (CNNs), as described in Section IV-A.

With LNS, operations with higher complexity, such asmultiplication, division, square and square root, are easilyperformed by applying fixed-point addition, subtraction, andleft/right bit-shift, respectively. Considering unsigned operandsP and Q, represented in LNS as p = log2 P and q = log2Q,and BitShift(x, z), regarded as the left-shifted value of xby an integer z in fixed-point arithmetic, these operations areformulated in (1).

log2(P ∗Q) = p+ q ;

log2(P/Q) = p− q ;

log2(P 2) = 2 ∗ p = BitShift(p, 1) ;

log2(√P ) =

1

2∗ p = BitShift(p,−1) . (1)

This simple logarithmic arithmetic comes at the cost of morecomplex additions and subtractions, which map in nonlinearoperations. For two unsigned numbers P and Q (the signcan be handled in parallel), addition and subtraction can becalculated using Lemma 1.

Lemma 1. For (P,Q) ∈ N, and with Max as the functionthat returns the maximum value of a tuple of numbers, thefollowing formulation can be adopted to compute addition andsubtraction in the LNS domain (log2(P ±Q)):

log2(|P ±Q|) = Max(p, q) + log2(1± 2−|p−q|) (2)

4

Proof. For P ≥ Q, addition and subtraction can be calculatedusing (3):

log2(P ±Q) = log2(P (1± Q

P)) = p+ log2(1± 2q−p) (3)

while for P < Q:

log2(Q± P ) = log2(Q(1± P

Q)) = q + log2(1± 2p−q) (4)

By combining (3) and (4):

log2(|P ±Q|) = Max(p, q) + log2(1± 2−|p−q|) (5)

and Lemma 1 is proved. �

Note that the addition and subtraction operations employthe Gaussian logarithms represented in (6) and in Fig. 4.

G = log2(1± 2λ) , λ = −|q − p| (6)

For a modest number of bits, typically up to 20, the func-tions in Fig. 4 can be implemented through lookup tables [13].However, given that the table size increases exponentially withthe word length, table-lookup-and-addition methods can beadopted instead to reduce the burden of storing large tables inmemory, such as bipartite tables and multipartite methods [14].

For a large number of bits, for example, 32 bits (Fig. I)or even 64 bits, a piecewise polynomial approximation [15]or the digit-serial methods [17], [18] can be used to im-plement LNS addition/subtraction. Digit-serial methods, alsoknown as iterative methods, calculate the result digit bydigit. Similarly to the COordinate Rotation DIgital Com-puter (CORDIC) method [16], an iterative algorithm canapproximate the exponential and logarithmic computation withsequences of digit-serial computations [17]. Signed-digit arith-metic was proposed to simplify the exponential computationand reduce the number of pipeline stages required for LNSaddition/subtraction [18]. Although the digit-serial methodsrequire a low circuit area, they exhibit high latency.

The alternative piecewise polynomial approximation meth-ods provide a trade-off between performance and cost. LinearTaylor interpolation has been used to implement 20-bit and32-bit LNS Arithmetic Units (AUs) by applying a table-basederror correction scheme to achieve a similar accuracy to that oftheir FP counterparts [19]. Although a first-order interpolationis used, the error correction step requires a multiplier, makingthe complexity similar to that of a second-order interpolator (asshown in Fig. 5 for an LNS AU, to be analyzed later in thissection). Both Lagrange interpolation [164] and Chebyshevpolynomial approximation [20] (second-order approximations)have been used to compute the Gaussian logarithms. Second-order minimax polynomial approximation has also been ap-plied to compute the addition/subtraction in LNS [21]. Higheraccuracy than what can be found in standard FP arithmetic isachievable [20].

The approaches referred to above for subtraction in the LNSdomain work well, except in the critical region (λ ∈ [−1, 0]),due to the singularity of log2(1−2λ) when λ→ 0 (see Fig. 4

Fig. 4: Plot of the functions log2(1 + 2λ) and log2(1 − 2λ)(the latter has a singularity when λ→ 0)

between λ = −1 and λ = 0). Mathematical transformations,usually known as cotransformations, are used in this region.Using alternative functions defined in adjustable subdomains isone of these transformations. An example of cotransformationis the replacement of the computation of log2(1 − 2λ) withtwo successive subtractions [22], as expressed in (7). In thisequation, 2k1 is individually chosen for each value, suchthat the index λ[1] for the inner subtraction falls on thenearest modulo-∆[1] boundary beneath λ; ∆[1] is fixed at alarge value, with ∆ representing the interval tables that arestored. The value of the inner product for λ[1] can thereforebe obtained directly from a lookup table that covers therange −1 < λ < −∆[1] at modulo-∆[1] intervals. Since−∆[1] ≤ k[1] < 0, 2k1 ≈ 1 and the magnitude of the indexfor the outer subtraction k2 < −1, the computation is shiftedaway from the critical region [22].

2p − 2q = (2p − 2q+k1)− 2q+k2

2k1 + 2k2 = 1 ⇒ k2 = log2(1− 2k1) (7)

A different type of cotransformation decomposes the com-putation of G = log2(1±2λ) according to (8). The implemen-tation in [23] approximates the cotransformation log2

1−2λ1+2λ

ina limited range [1, 2).

log2(1− 2λ) = log2

1− 2λ

1 + 2λ+ log2(1 + 2λ) (8)

LNS AUs have been used in the design of configurable andgeneral purpose processors [21], [23]–[26].

The European Logarithmic Microprocessor (ELM) is a 32-bit scalar microprocessor with a fixed-point LNS-based AU(b = 2), with numbers represented by a sign bit and a fractionalcomponent of 23 bits [26]. This processor implements additionand subtraction by splitting the range of λ in (6) into segmentsfor every increasing power of 2. In turn, these segments aresplit into smaller intervals, and the function F = log2(1±2λ)and its derivative (D) are stored in memory for each of thoseintervals, as depicted in Fig. 5, with the values of intermediatepoints estimated with a Taylor interpolation. To calculate theerror for each interval, the maximum error E is tabulatedalongside F and D – this error corresponds to when a pointfalls near the next stored value tuple. A separate table of

5

Fig. 5: LNS addition and subtraction unit

proportions (P ) stores the normalized shape of the commonerror curve. Each proportion value will be 0 - when therequired value is at the beginning of a segment - or up to1 - when it is estimated by the interpolation. The error iscalculated by adding to the result of the interpolation the valueof E×P . In Fig. 5, a range shifter is inserted immediately afterthe selection of p for the calculation of λ. If λ falls close tozero in the subtraction operation, p and λ will be transformedinto new values, and λ will be moved to the linear region byapplying the cotransformation (7) [15].

The ELM adopts a register-memory addressing modeInstruction Set Architecture (ISA), with a single-level cacheintegrated into the pipeline. The architecture includes 16general-purpose registers, an 8 kbyte L1 data cache, twoadders/subtractors operating in three clock cycles, and fourcombined multiplier/divider/integer units operating in oneclock cycle. Vector operations are implemented in the ELM byusing four identical functional units in parallel, requiring a datacache organization in which four consecutive words are readout simultaneously. The ELM is designed and fabricated in0.18µ technology [26]. Its performance and accuracy, runningat 125 MHz, are evaluated against the TMS320C6711 contem-porary Digital Signal Processor (DSP) [27], available in thesame technology and running at 150 MHz. The TMS320C6711is a Very Long Instruction Word (VLIW) processor that isable to issue eight independent instructions issued every cyclewith two pipelined FP adders and two other pipelined FPmultipliers with a latency of four cycles and two integerunits with a latency of one cycle. It has been shown that

the ELM is marginally more accurate than the FP processorfor additive operators, while multiplicative operators returnexact results. For a typical kernel, making equal use of eachone of the operators, LNS will incur roughly half the totalerror of FP arithmetic. Moreover, notwithstanding the ELMrunning at 5

6 of the clock speed of the FP processor, theadditive times were marginally better, the multiplications were3.4 times faster, and the divisions and square roots weremany times faster in the former than in the latter processor.Although LNS enables the design of faster and, typically,more accurate processors than those of FP arithmetic (the LNSimplementation tends to have smaller worst-case relative errorsthan those of FP [28]), general purpose processors currentlystill adopt the standardized FP, which also easily interfaceswith integer binary arithmetic.

Another LNS-based AU that adopts the hardware-efficientcotransformation (8) for subtraction in the critical region,computing the second term of this equation with a second-order polynomial, was proposed in [23]. The architecture ofthe LNS¸ -based AU consists of four main blocks: i) the pre-and postprocessing blocks, including the preparation stepsof the operations and the combination and/or selection ofthe results produced by the two main interpolation blocks;ii) the main interpolator block that implements the approx-imations for computing log2(1 + 2λ) and exponentiation onthe complete range (−25, 0], and log2(1 − 2λ) outside thecritical region (−25,−0.25], using second-order piecewisepolynomial approximations; and iii) the critical region blockused to calculate the logarithm of the cotransformation func-tion (8) for the critical region (−0.25, 0] by approximating thecotransformation function 1−2λ

1+2λusing a first-order polynomial

and computing the required log2 function based on a rangereduction technique that normalizes the function argument tothe range [1, 2). Since only one instruction is evaluated ata given time, interpolator data paths within the two parallelblocks ii) and iii) can be shared to obtain a more compactdesign.

The LNS-based AU proposed in [23] was integrated intothe datapath of a custom 32-bit OpenRISC core [29] using aCMOS 65 nm process. For comparison purposes, an identicalsystem was designed with featuring a standard IEEE 754single-precision FP AUs with hardware support for additions,subtractions and multiplications. The error analysis show sim-ilar maximum and average values to the ones achieved for ad-dition and subtraction with the ELM. Executing also at a clockfrequency of 125 MHz, the adder and subtractor have a latencyof four clock cycles, while the multiplier, divisor and squareroot calculus exhibit a latency of one clock cycle. The LNS-based adder/subtractor of this processor significantly reducesthe circuit area required by the individual LNS units, mainlydue to circuit sharing. Comparing the energy efficiency of theLNS (per LOGarithmic Operation (LOGP)) and the FP-basedAUs (per FLoating-Point Operation (FLOP)) for a CMOS65 nm process, addition and subtraction in FP is approxi-mately 3 times more efficient than in LNS (40 pJ/FLOP vs130 pJ/LOGP), while multiplication in LNS is approximately1.5 times more efficient (48 pJ/FLOP vs 30 pJ/LOGP), division

6

is 17 times more efficient (525 pJ/FLOP vs 30 pJ/LOGP), andsquare root is 38 times more efficient than FP (609 pJ/FLOPvs 16 pJ/LOGP).

A combination of the LNS and FP representations havebeen explored as an approach to designing hybrid AUs [30].It enables, for example, the multiply and divide operationsto be computed using the LNS format, while addition andsubtraction are efficiently performed in FP. The hardwarefor performing the FP-to-LNS and LNS-to-FP conversions isembedded within the hybrid AUs.

B. Residue number systems

The Chinese Remainder Theorem (CRT) was originallyproposed in the 3rd century AD [31]. In the 1950s, it becamethe support for the proposal of computer arithmetic basedon the RNS [32]. The CRT states that given a moduli set{m1,m2, · · · ,mn} of pairwise co-prime numbers, i.e., theGreatest Common Divisor (GCD)(mi,mj) = 1 for i 6= j(〈X〉m represents the residue of X for the modulus m):

{〈X〉mi |1 ≤ i ≤ n} is unique for any 0 ≤ X < M ; , M =

n∏i=1

mi

(9)The CRT asserts that there is a bijection, formally described

in Lemma 2.

Lemma 2. Let m1, · · · ,mn be n pairwise co-prime integersand M =

∏ni=1mi. The CRT asserts that there is a bijection

between ZM and Zm1× · · · × Zmn

ZM ↔ Zm1× · · · × Zmn . (10)

Proof. Consider n = 2 and two co-prime numbers m1 andm2. The modular inverse m−11 =

⟨m−11

⟩m2

and m−12 =⟨m−12

⟩m1

exist since m1 and m2 are co-prime numbers [165].Thus, the value Y satisfies the equation for X:

Y =⟨x1m2m

−12 + x2m1m

−11

⟩M=m1×m2

since 〈Y 〉m1= x1 and 〈Y 〉m2

= x2.It remains to be shown that no other solutions exist modulo

M , which means X ≡ Y (mod M). By contradiction, it isassumed that (X 6= Y ) < M and 〈X〉mi = 〈Y 〉mi , whichmeans 〈X − Y 〉mi = 0, i = 1, 2. Hence, (X−Y ) is the LeastCommon Multiple (LCM) of mi. However, this contradictsthe fact that mi are pairwise co-prime; thus, the LCM is M .We can represent an element of ZM by one element of Zm1

and one element of Zm2, and vice versa.

The CRT can now be proven for an arbitrary number n(number of co-prime integers) m1|1 ≤ i ≤ n by induction.Defining Mi = M

miand M−1i = 〈Mi〉mi , the previous

equation for Y becomes:⟨X =

n∑i=1

xi ×Mi ×M−1i

⟩M

and X is the unique solution in the ring ZM . �

Moreover, for m1, · · · ,mn pairwise co-prime integers andM = m1× · · · ×mn, the CRT also asserts the following ringisomorphism [33]:

ZM → Zm1 × · · · × ZmnX → f(X) = (x1, · · ·xn)

= (〈X〉m1, · · · 〈X〉mn) , (11)

which allows carrying the ring operations on the residues foreach modulo independently.

Thus, RNS is a nonpositional number system based onsmaller residues, the quantity and range of which are con-trolled by the chosen pairwise co-prime moduli set [34].Each residue corresponds to a “digit”, and “digits” can beprocessed in parallel. The direct application of RNS propertiesto addition and multiplication expressed in (12) is widelydisseminated in the design of efficient arithmetic, in particularfor linear processing [35], [36].

X ◦ Y ◦:+,−,∗7−−−−−→ {〈x1 ◦ y1〉m1, · · · , 〈xn ◦ yn〉mn} (12)

RNS has been proposed mainly for fixed-point arithmetic,accommodating also numbers with signs in the range [−M2 :M2 ). The main limitation of RNS-based arithmetic is related to

the operations, such as division, comparison, sign and overflowdetection, which are difficult to implement with non-weightednumber representations. These require a combination of theresidues to reach some type of positional representation, i.e.,the partial or complete application of the RNS-to-binary (re-verse) conversion, supported by the CRT or the Mixed RadixConversion (MRC) [37], [38]. Although both approaches usemodular operations, the CRT carries out the computation ina single step over a large modular sum X (Mi = M/mi,⟨M−1i

⟩mi

is the multiplicative inverse of Mi, and α is aninteger) (13).

X =

⟨n∑i=1

Mi

⟨M−1i

⟩mixi

⟩M

=

n∑i=1

Mi

⟨M−1i

⟩mixi−αM

(13)On the other hand, the MRC uses shorter modulo arithmeticwith an iterative procedure:

X = y1 +m1 × (y2 +m2 × (y3 +m3 × (· · · ))) (14)y1 = x1

y2 =⟨(x2 − y1)×

⟨m−1

1

⟩m2

⟩m2

y3 =⟨(x3 − y1)×

⟨m−1

1

⟩m3− y2

)×⟨m−1

2

⟩m3

⟩m3

· · ·

Algorithms for RNS division [39], comparison [40], [41],sign detection [42], scaling [43], and reduction [44] have beenproposed for specific moduli sets. Basis Extension (BE) isoften used in those algorithms to extend the representation ofan RNS moduli set (often called a basis) S1 to another S2.Consider the two bases S1 = {m1,1, · · · ,m1,n1

} and S2 =

{m2,1, · · · ,m2,n2} with dynamic ranges M1 =

n1∏i=1

m1,i and

7

M2 =

n2∏i=1

m2,i, respectively. If an error of αM1 is tolerated,

the residues associated with the second basis can be computedas in [44]:

x2,i = 〈X ′〉m2,i=

⟨n1∑i=1

Mi

⟨M−1i

⟩m1,i

xi

⟩m2,i

(15)

For applications that do not tolerate this error, a method toperform the exact extension requires an extra modulus (me),with the corresponding residue xe = 〈X〉me , to enable thecomputation of a parameter α, for X < δM2, δ ∈ N, andxe ≥ 2(δ + n2 + 1) [166]:

α =⟨

(X ′ − xe)⟨M−1i

⟩mi

⟩me

(16)

Once α is known, the basis extension is finally computedwith (17).

x2,i =

⟨n1∑i=1

Mi

⟨M−1i

⟩m1,i

xi − αM1

⟩m2,i

(17)

Performing efficient RNS arithmetic requires the selectionof the most suitable moduli set for the problem at hand. Thesize of the set and the width of the moduli are mainly definedby the dynamic range and the desired parallelism. Featuresusually considered to select the moduli are i) the efficiency ofthe modular arithmetic for each channel (the term “channel”is usually adopted to designate a residue and the modulararithmetic units associated with it), which typically leads tovalues near a power of two; ii) the balancing of the moduliset, since the slowest channel defines the overall performance;and iii) the cost of the reverse converters, with some modulisets allowing efficient arithmetic channels at the cost of morecomplex reverse converters and others tailored to the resultin simple reverse conversion circuits due to the mathematicalrelations between the moduli [45]. Altogether, the most usedmoduli sets are composed of modulus 2v±k, with v ∈ {n, 2n}and k ∈ {−1, 0,+1} [35].

Alternative binary codings can be explored for modulararithmetic in each RNS channel. Thermometer Coding (TC)and One-Hot Coding (OHC) are among some of the most usedones [46]. The value of a number in TC is expressed by thenumber of 1’s in the string of bits [47], which means that itis a nonpositional redundant number representation. Typically,for simplicity, a run-length of 1’s is placed at one end of thestring. For the OHC, the only valid combinations in a string ofbits are those with only a single bit ‘1’ and all the others ‘0’.For the OHC, the bit ‘1’ position directly indicates the value ofthe number. k and k+1 bits are required to represent integersbetween 0 and k in TC and OHC, respectively. For example,for k = 7, the number 4 can be represented in TC by thestream “0001111”, while in the OHC, it can be represented bythe combination “00010000”. The TC is often used in Digital-to-Analog Converters (DACs), as it leads to less glitching thanbinary-weighted codings. The OHC is used in Finite StateMachines (FSMs), as it avoids the use of decoders: a single‘1’ bit directly identifies the state. The TC and OHC codings

are most advantageous for RNS moduli that result in smalland medium dynamic ranges [46]–[48].

Several RNS-based processors have been developed, forexample, a fully programmable RISC Digital Signal Processor(RDSP) [167] and hardware/software RNS-based units [49].The RDSP is a 5-stage pipeline processor with the RNS-based AU partially located in the third and fourth pipelinestages, as depicted in Fig. 6. The most popular 3-moduli set{2n − 1, 2n, 2n + 1} is balanced by extending the binarychannel to 22n; the reverse converter is distributed by thetwo aforementioned pipeline stages, while, for the multiply-accumulate (MAC) unit, the multiplier is located in the thirdstage, and the accumulator operates in parallel to the load/storeunits in the fourth pipeline stage. The experimental resultsobtained for the RDSP in the CMOS 250 nm technology showthat not only are the circuit area and the power consumptionsignificantly lower but also a speedup is obtained in com-parison to a similar DSP using a data path based on binaryarithmetic.

An AU based on a moduli set that comprises a largermodulus 2k and all the other moduli of type 2n − 1 assumesthat the selected moduli fit the typical word size of a CentralProcessing Unit (CPU), for example, 64 bits [49]. Adoptinga hardware-software approach, RNS adders, subtractors andmultipliers are implemented in the hardware, while the reverseconversion and nonlinear operations are implemented in thesoftware. A minimum set of instructions includes not onlyforward conversion (residue), modular addition (add m) andmodular multiplication (mult m) but also instructions thatoperate on positional number representations, such as add,sub, AND and SHR (logical shift right). It supports changingthe dynamic range at runtime in a simple manner, which inturn can result in a reduction in both power consumption andexecution time.

Although the RNS was proposed mainly for fixed-pointarithmetic, it has also been used for floating-point units [50],[51]. To operate on FP numbers, the mantissa and the exponentare individually converted and processed in the RNS domain.For example, to multiply two FP numbers, both mantissas aremultiplied and exponents are added in the RNS domain.

A variant of RNS, the Redundant Residue Number System(RRNS) addresses different purposes [52] but is mainly usedfor error detection and correction. Since the residues areindependent of each other, introducing redundant moduli inthe set of original (information) moduli, usually called thelegitimate, the representation can be extended to the illegiti-mate range to detect residue errors and, in some cases, correctthem. If a result falls into the illegitimate range, it can beconcluded that i) one or more residue digit errors exist aslong as the number of residue digit errors is not more thantwice the number of redundant moduli and ii) errors canbe corrected by subtracting the error digits from the residuedigits, assuming that the number of residue errors does notexceed the number of redundant moduli [35]. For example,applying the RRNS 3-moduli set {2n − 1, 2n + 1, 2n+1} andconsidering the legitimate range [0, 22n− 2], from the modulisubset {2n − 1, 2n + 1}, a one-digit error can be detectedin the illegitimate range [22n − 1, 23n+1 − 2n+1 − 1]. It is

8

Fig. 6: RDSP: RNS-based Reduced Instruction Set Computer (RISC) [167] –Address generation Unit (AGU); Logic Unit (LU))

worth noting that the independence of residue digits in RRNSprevents a residue error in one channel from propagating toanother.

A computationally redundant processing core with highenergy efficiency has been proposed [53]. The computationallyredundant energy-efficient processing for y’all (CREEPY) corecomprises microarchitecture-, ISA- and RRNS-centered algo-rithms and techniques to efficiently perform computational er-ror correction. This work has shown significant improvementsover a non-error-correcting binary processing core. Althoughall research on the design of cores based on the RRNS, themicroarchitectures abstract away the memory hierarchy anddo not consider the power-performance impact of RNS-basedmemory addressing. Compared with a non-error correctingcore that addresses memory in binary, direct RNS-based mem-ory addressing schemes slow down in-order/out-of-order cores.Novel schemes for RNS-based memory access, extendinglow-power and energy-efficient RRNS-based architectures, areproposed in [54].

C. Stochastic number representation

A stochastic signal is the result of a continuous-timestochastic process with two possible values, represented bythe symbols 0 and 1 [55]. Numbers are represented byprobabilities associated with bitstreams: pi is the value of astochastic bitstream Ii defined as the ratio of 1 bits to the totalnumber of bits. Considering Unipolar Representation (UR), thestochastic bitstream values are bound [0 : 1], while for BipolarRepresentation (BR), the values are bound by [−1 : 1], a resultof applying the scale factor 2 after a negative bias term (18).For example, a stream I containing 60% 1s and 40% 0s, asdepicted in Fig. 7, represents the numbers 0.6 and 0.2 in theUR and BR, respectively.

UR = p(i) ; BR = 2× (p(i)− 0.5) . (18)

Neither the length or the structure of I needs to be fixed. Thenumber represented by a particular stream is not defined by theindividual value of a bit in a specific position; thus, SC relies

Fig. 7: Stochastic bitstream representing 0.6 in UR and 0.2 inBR

(a) Multiplication on UR

(b) Scaled adder on UR

(c) Multiplication on BR

Fig. 8: Multiplication and addition on SC

on a nonpositional number representation. The randomizationof the representation, associated with the mapping from spaceto time, leads to SC arithmetic that can be implemented withsimple logic circuits [56], [57].

For example, a single two-input logic gate can be used tocompute the product of the probabilities p×q (Fig. 8). It is easyto show that for the UR, it is an AND gate, while for the BR,it is an eXclusive-NOR (XNOR) gate. A 2 : 1-bit multiplexercan be used to add two numbers in UR, as depicted in Fig.8b.The addition is scaled by 1

2 , as shown in (19).

p(O) = p(I1)p(I3) + (1− p(I3))p(I2)p(I3)=

12⇒

p(O) =1

2× (p(I1) + p(I2)) (19)

9

(a) Binary-to-stochastic

(b) Stochastic-to-binary

Fig. 9: Binary-to-SC and SC-to-Binary converters

The correctness of the final result is impacted by the correla-tion, or lack of randomness, between stochastic bitstreams. Forexample, in the extreme case where the same bitstream feedsthe two inputs of the AND gate in Fig. 8a, regardless of thebit values in the bitstream, for UR, p(O) = p(I1) = p(I2)instead of p(O) = p(I1)2 = p(I2)2. For the same bitstreamfeeding the two entries of the XNOR gate in Fig. 8c, the valueof the product is p(O) = 1 for the BR.

Hence, it is important to use “good” random generators.However, even in systems where stochastic bitstreams aregenerated with high-quality random generators, after a fewoperations, the bitstreams tend to become correlated due toclock synchronicity. One way to overcome this limitation is toregenerate the bitstream [59]; however, the resource penaltyimpacts the benefit introduced by the compact arithmeticoperators. A less costly alternative is to replace the globalclock with individual ring oscillators for each synchronouscomponent, which can be sensitive to the process, temperatureand voltage variation [58].

To use SC in typical computing systems, binary-to-stochastic and stochastic-to-binary converters are required toleverage SC in practical applications. These converters arepresented in Fig. 9. The binary-to-stochastic conversion inFig. 9a comprises the generation of an m-bit random binarynumber in each clock cycle by means of comparing the m-bit input binary number with the output of a (pseudo)randomnumber generator. For UR, assuming that the random numbersare uniformly distributed over the interval [0, 1], the probabilityof a 1 being generated by the comparator for a given clockcycle is equal to the fractional number at the input of theconverter. For the stochastic-to-binary conversion in Fig. 9b,the value p is carried by the number of 1s in its bitstream;thus, the conversion is performed by counting the numbers of1s to obtain p and normalizing the result by the number ofbits in the stream.

Arithmetic functions, even nonpolynomial functions using aMacLaurin expansion [60], can be implemented with SC withsimple combinatorial elements. However, highly nonlinearfunctions such as the tanh and max functions, widely used inartificial neural networks as discussed in Section IV-A, requireFSM-based SC elements [61].

In the state machine in Fig. 10, states are organized insequence, as in a saturating counter [62]. The number of states

Fig. 10: SC linear FSM with N = 2k distinct states

is given by N = 2n; thus, the set of states can be implementedby n flip-flops. X is the input of the state machine, andY is the output (not shown in this figure), with the latterdepending only on the state. Assuming that the input X isa stochastic bitstream, typically a Bernoulli sequence, theprobability p(X) of a bit in the sequence being ‘1’ followsa binomial distribution. With p(Y ), the probability of a bit inthe output stream Y being ’1’, and pi, the probability of thestate Si given the input probability p(X), (22) can be derived.(22) states that the probability of transitioning from Si−1 to Simust equal the probability of transitioning from Si to Si−1 andthat the sum of pi for all N states should be unitary. In (22),si takes the value 1 if the output Y = 1 in the current stateSi and 0 otherwise.

pi(1− p(X)) = pi−1p(X) (20)N−1∑i=0

pi = 1 (21)

p(Y ) =

N−1∑i=0

sipi (22)

Lemma 3 enunciates the value si such that tanh can becomputed with SC.

Lemma 3. By inserting si as in (23) into (22):

si =

{0, 0 ≤ i < N/2

1, N/2 ≤ i < N, (23)

tanh is computed based on SC with (24).

z = tanh

(N

2x

)(24)

Proof. Starting from (20), assuming 0 < pi < 1 and p0 = 1,it is easy to derive (25):

pi =

(p(X)

1− p(X)

)i(25)

and by applying (21) to normalize (25), we obtain (26):

pi =

(p(X)

1−p(X)

)iN−1∑j=0

(p(X)

1− p(X)

)j (26)

Based on the configuration in (23), (22) can be rewrittenas (27):

p(Z) =

N−1∑i=N/2

sipi (27)

10

and by substituting (26) in (27), we obtain (28):

p(Z) =

N−1∑i=N/2

(p(X)

1− p(X)

)iN−1∑j=0

(p(X)

1− p(X)

)j

=

(p(X)

1−p(X)

)N/2−(

p(X)1−p(X)

)N1−

(p(X)

1−p(X)

)N=

(p(X)

1−p(X)

)N/21 +

(p(X)

1−p(X)

)N/2 (28)

Adopting the bipolar coding format, which means that theranges of real numbers x and z are extended to −1 ≤ α ≡x, z ≤ 1, and assuming that the probability of a bit in thestream being one P (α = 1) = α+1

2 , (28) leads to (29):

z =( 1+x1−x )N/2 − 1

( 1+x1−x )N/2 + 1

(29)

By using the Taylor expansion shown in (30), (29) is rewrittento compute tanh based on SC (31):

1 + x ≈ ex ; 1− x ≈ e−x (30)

z =eN2 x − e−N2 x

eN2 x + e−

N2 x

= tanh(N

2x) (31)

and Lemma 3 is proved. �

The stochastic implementation of other functions, suchas the absolute value, the exponentiation, and the max/minfunctions can be carried out using linear FSMs [63] and [61].These functions support, for example, the development ofdeep learning applications, as described in Section IV-A.The characteristics of SC make it suitable, in particular, fordesigning embedded systems. It offers a lower circuit areaand power consumption, as well as high error resilience, incomparison to their binary system counterparts [64].

Since operations at the logic level are performed on ran-domized values, an attractive feature of SC is the degree oftolerance to errors [65]. A stochastic bitstream of length mrepresents numbers with a precision of 1/m. The error followsa normal distribution with zero mean and a variance that is in-versely proportional to the bitstream length N (σ ≈ 1√

N) [56].

For example, insufficient bitstream length or randomness canbe leveraged in approximate computing [66], [67].

Several processors, more or less dedicated, have been devel-oped to apply SC to different applications, such as ArtificialNeural Networks (ANNs), control systems, and the decodingof modern error-correcting codes [67]. Stochastic Recognitionand Mining processor (StoRM) uses SC to efficiently materi-alize computational kernels to perform classification, search,and inference on real data [68]. The StoRM architecture is a2D array of stochastic Processing Elementss (PEs) (typically15 × 15) with a streaming memory hierarchy, mitigating theoverhead of binary-to-stochastic conversion by sharing the

corresponding arithmetic units across rows or columns ofstochastic PEs. Experimental results from the implementationof StoRM in CMOS TSMC 65 nm technology show that SCallows a highly compact implementation, resulting in order-of-magnitude less circuit area and power consumption.

However, computation with SC requires multiple cycles,equal to the number of bits in the stochastic bitstream. Sincethe length of the bitstream grows with the precision of the databeing represented, regardless of its lower power consumption,the PE ends up consuming more energy than a conventionalbinary processing element. To mitigate this intrinsic limita-tion (increase in cycle count) of SC, vector processing andsegmented stochastic processing are adopted [68]. Segmentedstochastic processing is a hybrid representation that divides thebinary data into equally sized segments that are representedby individual stochastic bitstreams so as to achieve a trade-offbetween precision and bitstream length.

The following sections show how SC can be applied todesign invertible logic on the current CMOS technology, basedon Boltzmann machines [78], and to homomorphic encryption.

D. Hyperdimensional representation and arithmeticThe circuits of the brain are composed of billions of

neurons, with each neuron typically connected to up to 10, 000other neurons. This situation leads to more than 100 trillionmodifiable synapses [69]. High-dimensional modeling of neu-ral circuits not only supports artificial neural networks butalso inspires parallel distributed processing. It explores theproperties of high-dimensional spaces, with applications in,for example, classification, pattern identification and discrim-ination.

HDC, inspired by the brain-like computing paradigm, issupported by random high-dimensional vectors [70]. It looksat computing with very wide words, represented by high-dimensional binary vectors, which are typically called hy-perdimensional vectors, with a dimensionality on the orderof thousands of bits. It is an alternative to Support VectorMachines (SVMs) and CNN-based approaches for supervisedclassification. With Associative Memory (AM), a pattern Xcan be stored using another pattern A as the address, and Xcan be retrieved later from the memory address A. However,X can also be retrieved by addressing the memory with apattern A′ similar to A.

The high number of bits in HDC are not used to improvethe precision of the information to represent, as it would bedifficult to give an exact meaning to a 10, 000-bit vector inthe same way a traditional computing model does. Instead,they ensure robustness and randomness by i) being tolerant toerrors and component failure, which come from redundancyin the representation (many patterns mean the same thing andthus are considered equivalent) and ii) having highly structuredinformation, as in the brain, to deal with the arbitrariness ofthe neural code.

HDC starts with vectors drawn randomly from hyperspaceand uses AM to store and access them. A space of n =10, 000-dimensional vectors and independent and identicallydistributed (i.i.d.) components drawn from a normal distri-bution with mean µ = 0 can be considered as a typical

11

example. Points in the space can be viewed as corners of a10, 000-dimensional unit hypercube, for which the distanced(X,Y ) between two vectors X,Y is expressed using theHamming metric. It corresponds to the length of the shortestpath between the corner points along the edges (the distanceis often normalized to the number of dimensions so that themaximum distance, when the values of all bits differ, is 1).Assuming that the distance (k) from any point of the space toa randomly drawn point follows a binomial distribution, withp(0) = 0.5 and p(1) = 0.5:

f(k) =

(10, 000

k

)0.5k0.510,000−k =

10, 000!

k!(10, 000− k)!0.510,000

(32)with mean µ = 10, 000× 0.5 = 5, 000 and variance V AR =10, 000× 0.5× 0.5 = 2, 500 (the standard deviation σ = 50).This means that if vectors are randomly selected, they differby approximately 5, 000 bits, with a normalized distance of0.5. These vectors are therefore considered “unrelated” [70].Although it is intuitive that half the space has a normalizeddistance from a point of less than 0.5, this statement isnot true if we observe that it is only less than a thousand-millionth closer than 0.47 and another thousand-millionthfarther than 0.53. This distribution provides robustness to thehyperdimensional space, given that “nearly” all of the spaceis concentrated around the 5, 000 mean distance (0.5). Hence,a 10, 000-bit vector representing an entity may see a largenumber of bits, e.g., a third, changing their values, by errorsor noise, and the resulting vector still identifies the correct onebecause it is far from any “unrelated” vector.

The vector representation of patterns enables the use of thebody of knowledge on vector and matrix algebra to implementhyperdimensional arithmetic. For example, the componentwiseaddition of a set of vectors results in a vector with the samedimensionality that may represent that set. The sum-vectorvalues can be normalized to yield a mean vector. Moreover,a binary vector can be obtained by applying a threshold. Ifthere are no duplicated elements in a set, the sum-vector is apossible representative of the set that makes up the sum [70].

Another important arithmetic operation in HDC is vectormultiplication, which, as for SC with BR, can be implementedwith the bitwise logic XNOR operator (see Fig. 8c). Revertingthe order of the BR from (1, -1) to (0, 1), the ordinarymultiplication of binary vectors X and Y can be implementedby the bitwise eXclusive-OR (XOR). It is easy to show thatXOR is commutative and that each vector is its own inverse(X ∗ X = 0), where 0 is the unit vector (X ∗ 0 = X).Multiplication can be considered as a mapping of points inthe space: multiplying vector X by M maps the vector intoa new vector XM as far from X as the number of 1s inM (represented by ‖ M ‖) (33). If M is a random vector,with half of the bits taking, on average, a value 1, XM

is “unrelated” to X , and it can be said that multiplicationrandomizes it.

d(XM , X) =‖ XM ∗X ‖=‖M ∗X ∗X ‖=‖M ‖ (33)

Lemma 4 shows that multiplication is distributive overaddition, and it implements a mapping that preserves distance.

These are useful properties of hyperdimensional arithmetic.By keeping the relation between the vectors, multiplicationpreserves the relation between entities.

Lemma 4. Multiplication is distributive over addition andimplements a mapping that preserves distance, i.e., mappingtwo vectors (X,Y ) with the same vector M in the hyperdi-mensional space yields:

d(XM , YM ) = d(X,Y ) (34)

Proof. If the bipolar (−1, 1) representation is adopted, theXOR operator is replaced by regular multiplication. Since itdistributes over ordinary addition, it also does so in this bipolarcase. The symbol | | in (35) denotes normalization.

M ∗ |X + Y | = |M ∗X +M ∗ Y | (35)

By mapping two vectors (X,Y ) with the same vector M , therelation between the vectors is maintained as shown in (36):

XM ∗ YM = (M ∗X) ∗ (M ∗ Y ) = X ∗ Y‖ XM ∗ YM ‖ ⇒ d(XM , YM ) = d(X,Y ) (36)

and Lemma 4 is proved. �

Permutation changes the order of vector components.The permutation can be described as a list of integers(1, 2, 3, · · · , 10000) in the permuted order or represented asmultiplication by a special kind of matrix. This permutationmatrix will have all ‘0’s except exactly one ‘1’ in eachrow and each column. Similar to vector multiplication, whena permutation is applied to different vectors, the distancebetween them is maintained. As with sets, elements of asequence can be represented by a single hypervector in aprocess called flattening. However, if sequences are flattenedby the sum, the order of the elements will be lost. Thus,the elements must be “labeled” according to their position inthe sequence so that the value of X appears under differentvariants depending on the position in the sequence, whichcan be done with permutations [70]. Sequences in time areimportant to introduce memory in the systems, which can alsobe represented with pointer chains or linked lists [70].

The HDC has several known applications. For example,sequence prediction based on sparse hyperdimensional codingis used for online learning and prediction [71], which canbe useful for predicting mobile phone use patterns, includ-ing the prediction of the next launched application and thenext GPS location of a user. Another example of an HDCapplication is the learning and classification of biosignals,such as electromyography (EMG), electroencephalography(EEG), and electrocorticography (ECoG) [72]. A full set ofHyper-Dimensional (HD) network templates comprehensivelyencodes body potentials and brain neural activity recordedfrom different electrodes within a single HD, processed asa single entity for learning and robust classification purposes.

The diagram of a programmable processor based on HDCfor supervised classification [73] is presented in Fig. 11. ThisHDC processor encompasses three main components, eachcorresponding to processing steps: the Item Memory (IM),

12

Fig. 11: The Generic HDC-based Processor: major compo-nents and dataflow [73]– Data Processing Unit (DPU)

the Encoder and the AM. The IM stores a sufficiently largecollection of random hypervectors called items. This memorymaps symbols to items in the inference phase, as it was learnedin the training phase. The Data Processing Units (DPUs) of theEncoder combines the input hypervector sequence accordingto the application-specific algorithm to compose a singlehypervector for each class. Finally, the AM stores the trainedclass of hypervectors and delivers the best prediction accordingto the Hamming distance (dh). The inputs of the processorcomprise the assignment of symbols to an item in the IM, andthe outputs comprise the assignment of class labels to the AMaddress, all of which is enabled by the HD mapper.

An Application Specific Integrated Circuit (ASIC) for theHDC-based programmable processor, with 2048-bit vectors,was synthesized and implemented using a 28 nm processwith a high-k dielectric coupled with metal gate electrodes(HK/MG) [73]. The chip showcases high energy efficiency,less than 1.5 J/prediction for a comprehensive benchmark atclock cycle of approximately 2.5 ns. It rivals other state-of-the-art processors targeting supervised learning, in terms ofboth accuracy and energy efficiency.

E. Discussion and approximate arithmetic techniques

The LNS swaps the cost of addition/subtraction with thecost of the multiplication/division operations, as well as pow-ers and roots, but it still leads to a large cost differencebetween these two operations categories. Although LNSsprovide a similar range and precision to those of FP, beyondthe simplification of multiplication and division to fixed-pointaddition and subtraction, the FP number systems have becomea standard, while the LNS is mainly applied in niches. Thisoccurs, for example, in signal processing, most recently inmachine learning, as shown in the next sections.

RNS belongs to the class of nonpositional representationsexposing parallelism at a very fundamental computing levelto achieve high performance and at the same time supportsthe design of reliable systems (fault tolerance). Although theusage of RNS allows the design of more energy-efficientsystems, RNS does not guarantee a decrement in cost or powerconsumption, since the reduction operation has to be appliedand the sum of the residue bit-widths is typically the bit-widthof the equivalent fixed-point binary number. It can be appliednot only to designing hardware devices but also to speeding

up the execution of applications, such as in cryptography, ontraditional binary-based programmable systems (e.g., the mas-sively parallel Graphic Processing Units (GPUs)). However,RNS is not suitable for designing comparison units, dividersand other nonlinear operations.

HDC is inspired by the attributes of neuronal circuits, in-cluding hyperdimensionality and (pseudo)randomness. Whenemployed on tasks such as learning and classification, HDCinvolves the manipulation and comparison of large patterns(typically 10, 000-bit patterns) within memory. It allows sys-tems to retain memory, instead of computing everything theysense, pointing more to a kind of Processing In-Memory(PIM). The representation of the patterns, and the associatedarithmetic, does not assign any weight to the “digits” accordingto their position in the pattern; thus, it is also a nonpositionalrepresentation. A key attribute of hyperdimensional computingis its robustness to the imperfections associated with thetechnology on which it is implemented. HDC improves withthe advances in memory technologies, replacing computationwith the manipulation of patterns in memory.

SC has shown promising results for low-power area-efficienthardware implementations, even though existing stochasticalgorithms require long streams that negatively impact thelatency. Being another nonpositional representation, SC im-proves the robustness of circuits subject to random faultsor component variability. Since with SC, multiplications andadditions can be calculated using AND gates and multiplexers,significant reductions in power and the hardware footprintcan be achieved in comparison to the conventional binaryarithmetic implementations. The significant savings in powerconsumption and hardware resources allow the design ofautonomous systems, such as embedded devices or for com-putation on the edge.

SC is a good example of a representation that easily allowstrading off the computation quality with the performance andenergy consumption by adjusting the bitstream length. Theidea of trading off accuracy with performance and energy effi-ciency led to the approximate computing paradigm for design-ing circuits and systems [74]. This paradigm has been exploredat several levels, from basic circuits to components, processingunits, and programs. In particular, approximate arithmeticcircuits have been proposed, including adders, multipliers anddividers [75]. Both error characteristics, such as the “errorrate” and the “error distance”, and circuit measurements, suchas the delay, cost and power dissipation, should be consideredfor approximate circuits [75]. Approximate computing has beenconsidered for several applications with error resilience, suchas image processing and machine learning [76], [77]. Althoughmost of the research on approximate computing up to nowhas been focused on binary representation, the paradigm canbe applied to the different representations discussed in thissection.

III. ARITHMETIC: NEW TECHNOLOGIES ANDALTERNATIVE COMPUTING PARADIGMS

As referred to in the previous sections, although arith-metic units and processors have been designed in CMOS-based ASICs and Field Programmable Gate Arrays (FPGAs),

13

(a) FG (CNOT) (b) TG (c) HNG

Fig. 12: Examples of reversible gates

nonconventional number representations are also fundamentalfor developing computing systems based on new technologiesand circuits, such as superconductor devices [79], integratednanophotonics [94], nanoscale memristor devices and hybridmemories [98], [105], and invertible logic devices [78]. More-over, new paradigms for designing computers with emergenttechnologies also cross nonconventional arithmetics to covera wide range of applications. This section introduces notonly those emergent technologies but also two of the mostrepresentative new computing paradigms, QC and DNA-basedcomputing [80], [81]. This survey considers both the technol-ogy and the new computing paradigms from the perspectiveof computer arithmetic.

The lowest limit for the energy required to process one bit ofinformation is Eb = kB×T×ln 2, where kB is the Boltzmannconstant and T the temperature in Kelvin (approximately 3×10−21 Joule for a temperature of 25◦C) [82]. However, it hasbeen shown that this energy can be reduced, achieving evennear-zero energy dissipation, by using reversible gates [82].Reversible logic requires not only the reverse mode operationbut also a one-to-one mapping between inputs and outputs.

Reversible gates have been implemented in nanotechnolo-gies. Some of the most representative reversible gates arepresented in Fig. 12 [81]. The Feynman gate (FG), also knownas the Controlled-NOT (CNOT) gate, is a 2×2 reversible gatedescribed by the equations P = A and Q = A ⊕ B, whereA and B operate as control and data bits, respectively. TheToffoli gate (TG) is a 3×3 reversible logic gate, with P = A,Q = B and R = C⊕(A∧B). A 1-bit Full Adder (FA) can bebuilt with a single Haghparast and Navi gate (HNG) gate byconsidering A, B, Ci and D = 0. The carry out (Co) bit thatresults from the addition of A, B and Ci is easily computedby (37), since the result of the XOR of three terms is equalto the logic OR whenever the number of terms that assumea value of one is different from two, a situation that neveroccurs in (37).

Co = (A⊕B)∧C⊕A∧B = A∧C⊕B∧C⊕A∧B = A∧C+B∧C+A∧B(37)

Efficient reversible circuits have been implemented onQCA-based technologies [80]. Binary information is repre-sented by means of Quantum Dots (QDs), semiconductorparticles a few nanometers in size having optical and electronicproperties that are central to nanotechnology [84]. Whenilluminated by ultraviolet light, an electron in the QD can beexcited to a state of higher energy, while for a semiconductingquantum dot, an electron can transit from the valence band tothe conductance band. Reversible QCA-based gates have beenapplied to design arithmetic units [85], which can be imple-

Fig. 13: Circuit of an AQFP gate– Josephson junctions (J1and J2)

mented, for example, on DNA molecular structures [86], [87]or on quantum computing devices [88]. Moreover, reversiblegates and QCA have been used to implement arithmetic unitsbased on RNS [81], [89].

A. Superconductor logic devices

The AQFP superconductor logic family was proposed asa breakthrough technology towards building energy-efficientcomputing systems [90]. The dynamic energy dissipation ofAQFP logic can be drastically reduced due to the adiabaticswitching operations by using AC bias/excitation currents forboth the clock signal and power supply.

The circuit in Fig. 13, from [91], represents a typicalsimple AQFP gate that acts as a buffer. The Quantum-Flux-Parametron (QFP) is a two-junction Superconducting QuantumInterference Device (SQUID). It is based on superconductingloops containing two Josephson junctions (J1 and J2) and twoinductors (L1 and L2), which are shunted by an inductor Lqin the center node. An excitation current Ix is applied, whichchanges the potential energy of the QFP from a single wellto a double well. Consequently, the final state of the QFP,depending only on the direction of the input current Iin, isone of two states: the “0” state with an SFQ in the left loopor the “1” state with an SFQ in the right loop (see Fig. 13). Alarge output current, in the direction defined by the final stateof the QFP, is generated in Lq .

One major advantage of AQFP is the energy efficiencypotential. For example, experimental results show that theenergy dissipation per junction of an 8-bit AQFP-based CarryLook-ahead Adder (CLA) is only 24kBT [92] (approximately100 × 10−21 Joule for a temperature of 25◦C), being acorrect operation demonstrated at a 1-GHz clock frequency.The robustness of AQFP technology has been demonstratedagainst circuit parameter variations, as well as the potentialtowards implementing very large-scale integrated circuits us-ing AQFP devices. A more systematic and automatic synthesisframework, as well as synthesis results on a larger numberof benchmark circuits, is presented in [90]. The automaticsynthesis on 18 benchmark circuits, including 11 circuitsfrom the ISCAS-85 benchmark suite and a 32-bit RISC-VArithmetic Logic Unit (ALU), is performed using a standardcell library of AQFP technology, with 4-phase clock signals.

14

(a) Bar state (b) Cross state

Fig. 14: Diagram of states of a 2× HPP switch

The results compare an AQFP 10kA/cm2 standard cell librarywith a TSMC 12 nm FinFET one. The results show theconsistent energy benefit of the AQFP technology, which isable to achieve an up to 10-fold improvement in energyconsumption per clock cycle in comparison to the 12 nmTSMC technology [90].

Recent research has highlighted the interest in nonconven-tional arithmetic also for AQFP technology. In addition to theultra-high energy efficiency, it is pointed out that this technol-ogy has two characteristics that make it especially suitable forimplementing SC [79]. One is its deep pipelining nature, sinceeach AQFP logic gate is connected to an AC clock signal, thusrequiring a clock phase, which makes it difficult to avoid Readafter Write (RAW) hazards in conventional binary computing.The other is the opportunity to obtain a true Random NumberGeneration (RNG) using single AQFP buffers.

B. Integrated nanophotonics

Integrated nanophotonics is another emergent technologythat takes advantage of nonconventional arithmetic, in partic-ular RNS [93], [94]. The inherent parallelism of RNS andthe energy efficiency of integrated photonics can be exploredto build high-speed RNS-based arithmetic units supportedon integrated photonics switches. RNS arithmetic is per-formed through spatial routing optical signals in 2× 2 HybridPhotonic-Plasmonic (HPP) switches, following a computing-in-the-network paradigm. The block diagram of these 2 × 2HPP switches is represented in Fig. 14. It is fabricated byusing indium tin oxide (ITO) as the active index modulationmaterial [95]. In the bar state, the light will propagate straightinto the bar arm (Fig. 14a), while in the cross state, lights arerouted to the opposite (cross) arm output (Fig. 14b). A voltagesignal is applied to the switch to control the guidance of theinput light to the desired output port. Theoretically, an HPPswitch could be operated at 400 GHz. However, the speed ofthe whole system is limited by the speed of the modulatorsand photodetectors, as well as the electronic circuit used tocontrol the routing [94].

As discussed before, RNS arithmetic is applied digitwise;thus, independently for each digit, a number of those switchesare interconnected to perform the computation. This is neces-sary to make systems scalable, since the number of switchesgrows with the square of the number of bits. For example, formodular addition, the shifting RNS adder explores the shiftingproperty: a modulo m adder requires a set of diagonallyaligned m−1 HPP switches per each of the (m−1) levels, with(m − 1)2 switches in total. For example, Fig. 15a illustratesan adder modulo m = 5, requiring 42 switches organizedin 4 levels. When computing x + y, x can be representedby the input light source, while y specifies electrically the

(a) RNS shifting adder

(b) RNS All-to-all Sparse Directional (ASD) adder

Fig. 15: On calculating 〈4 + 3〉5 = 2 with modulo 5 adders

number of levels in which the switches are in the cross state(in the example, levels 1 to 3 for adding 3). For this shift-level selection, TC is used to enable level switching in theRNS channels, as presented in Section II-B.

The ASD RNS adder uses a routing network as the com-puting device [93]. To compute x + y, x is represented bythe input light source, and y, by the switch states defined bybinary electrical signals. By storing the precalculated statesthat satisfy the all-to-all nonblocking routing network in anLook up Table (LUT), the input light is routed to the resultingoutput port. Fig. 15b depicts a design for m = 5 and thelight path used to calculate, as in the previous example,〈4 + 3〉5 = 2, now through an ASD network. A modulo mASD RNS adder requires (m − 2)2/2 + 2 switches (for oddm). If the number of switches required by the ASD RNSadder is less than that for the shifting RNS adder, then morecomplex electronic control circuitry, typically including LUTs,is required.

Integrated nanophotonic RNS-based arithmetic processorsexhibit a very short computational execution time, achieved byoptical propagation through an integrated nanophotonic router,on the order of dozens of picoseconds. In comparison to All-Optical-Switch (AOS) technology, also used for RNS-basedarithmetic [96], integrated nanophotonics requires a relativelysmall area due to the compact transistor size. The energy peroperation, on the order of fJ/bit, has been shown to be ordersof magnitude lower [93].

C. Nanoscale memristors and hybrid memories

Several alternatives to existing semiconductor memories,supported by advances in technologies and devices, haverecently been proposed [97]. This has led, for example, tothe re-emergence of PIM architectures based on Non-VolatileMemory (NVM) and 3D monolithic technologies to meetthe demands of extensive data processing for current andnext-generation computing. This subsection discusses hownonconventional arithmetic plays an important role in memorytechnologies, namely, to perform PIM.

Hybrid memories combine CMOS with non-CMOS devicesin a single chip as one alternative to existing semiconduc-tor memories [98]. These memories use nanowires, carbon

15

nanotubes, molecules and other technologies to design thememory cell arrays, while CMOS devices are used to formthe peripheral circuitry. One of the first articles to investigatethe application of RRNS to improve the reliability of hybridmemories was published in 2011 [98]. A new RRNS codeis proposed for a 6-moduli set, where two are nonredundantmoduli and four are redundant moduli and maximum likeli-hood decoding is applied to resolve the ambiguity that occursfor less than 10% of the decoded data.

The CREEPY core, referred to in Section II-B, also adoptsRRNS to achieve computational error correction for newdevices with signal energies approaching the kBT noisefloor [53], such as the Ferroelectric Transistors (FEFET) [99].The 4T nonvolatile differential memory based on FEFETs isa good energy-efficient candidate for PIM [100].

Memristors, a contraction of the terms ‘memory’ and ‘re-sistor’, are devices that behave similar to a nonlinear resistorwith memory [101]. Memristors can be used to combine datastorage and computation in memory, thus enabling non-vonNeumann PIM architectures [102]. CNNs have been fullyhardware-implemented using memristor crossbars, which arecross-point arrays with a memristor device at each inter-section [103]. The processing-in-memory-and-storage (PIMS)project also adopts the RRNS for memristor-based technolo-gies [104]. Moreover, it has been shown that SC is adequateto perform PIM on nanoscale memristor crossbars, allowingthe extension of the flow-based crossbar computing approachto approximate stochastic computing [105]. Applications ofin-memory stochastic computing to a gradient descent solverand a k-means clustering processor can be found in [57].

An end-to-end brain-inspired HDC nanosystem using theheterogeneous integration of multiple emerging nanotechnolo-gies was proposed in [106]. It uses monolithic 3D integrationof Carbon Nanotube Field-Effect Transistors (CNFETs) andResistive Random-Access Memory (RRAM). Due to their lowfabrication temperature, (1,952) CNFETs and (224) RRAMcells are integrated with fine-grained and dense vertical con-nections between computation and storage layers. IntegratingRRAM and CNFETs allows creating area- and energy-efficientcircuits for HDC.

Multi-level cell (MLC) is used to store multiple bits in asingle RRAM cell to reduce the hardware overhead. However,the accuracy of ANNs deteriorates more due to resistancevariations, which is quite relevant in binary coding. The error-tolerance feature of SC can also be explored to mitigate thereliability problem of RRAM MLC technology [107].

D. Invertible logic circuits

Invertible logic provides a reverse mode for which theoutput is fixed and the inputs take on values consistent withthe output [78]. It is less restrictive than reversible logic. Forexample, the two-input OR logic gate is invertible if whenthe output is kept at c = 1, the inputs (a, b) alternate equally,with a probability of 33%, between the values (1, 1), (0, 1),and (1, 0), but it is not reversible.

Boltzmann machine structures [83] are underlying structuresthat have been proposed to design invertible logic circuits [78].

These machines can be represented as networks or graphsof simple processing elements interconnected (e.g., Fig. 16afor a two-input gate Y = A ∧ B) with signals taking thebinary values +1 or 1. Every node in the undirected graphis fully connected to all others through bidirectional links,and an individual bias value is assigned to each node. Thebinary state and ith node output (mi) are computed by (38):after calculating the weighted sum of all input connections(matrix J) and adding the bias terms (h), the nonlinearactivation function (tanh) is applied. A noise source rnd(-1,+1)(uniformly distributed random real number between −1 and+1) is introduced to avoid getting trapped in local minima, andfinally, the sign function (sgn) provides the binary outputs +1or −1. I0 is a scaling factor (an inverse pseudo-temperature).States are categorized as valid and invalid states. If the nodesare unconstrained, the system fluctuates among all possiblevalid states. A logic value of ‘0’ is assigned to mi(t) = −1,and ‘1’ is assigned to mi(t) = 1.

mi(t+ ∆t) = sgn(rnd(−1,+1) + tanh (Ii(t+ ∆t)))

Ii(t+ ∆t) = I0(hi +∑j

Jijmj(t)) (38)

The values of h and J for the AND gate are shown inFig. 16b. The h vector elements correspond to the bias weightsclose to the node in the graph (in the order h0 for A, h1 forB, and h2 for Y ) (16a). Matrix J , which, as expected, is sym-metric, represents the weight of the bidirectional connections,following also the order A,B, Y . The gate is invertible; thus,for example, when Y is clamped to 1, the only valid state is(A = 1, B = 1).

(a) Boltzmann structure

h =[+1 +1 −2

]J =

0 −1 +2−1 0 +2+2 +2 0

(b) h and J (38)

Fig. 16: Boltzmann Machine structure for the AND gate [78]:three nodes, those denoted A and B refer to binary inputs,and Y the output

The organization of the nodes, which operate in parallel,and their type, simple processing elements with binary inputsand outputs, make Boltzmann machines well suited for theSC nonconventional arithmetic discussed in the previous sec-tion [78]. In Fig. 17, an FSM state, as shown in Fig. 10, isstored as the accumulator: for the highest state of the FSM(SN−1), a positive input will not cause any increase in thevalue stored in the accumulator; the inverse also holds truefor negative input signals and the lowest state. The scalingvalues in (38) are included in the weight wi terms in Fig. 10.

SC can be used to implement arithmetic functions basedon Boltzmann machines. A graph of the Boltzmann machinefor an invertible 1-bit FA is represented in Fig. 18; A,

16

Fig. 17: Simplified processing element using SC [78]: wirepresent the factor associated to the Ii inputs; wrnd representsthe weight associated to the noise source

(a) Boltzmann structure

h =[0 0 0 0 0

]J =

0 −1 −1 +1 +2−1 0 −1 +1 +2−1 −1 0 +1 +2+1 +1 +1 0 −2+2 +2 +2 −2 0

(b) h and J (38)

Fig. 18: Boltzmann Machine structure for the 1-bit FA [78]:five nodes, those denoted A, B, Ci refer to binary inputs, andS and Co the outputs

B and Ci are the inputs, and Co and S are the outputsof the 1-bit FA [78]. In this case, the h vector elementsin Fig. 18b are all zero. The symmetric matrix J followsthe order A,B,Ci, S, Co. Invertible multipliers can also beimplemented, which can be used for factorization. A 5-by-5-bit optimized invertible multiplier/divider/factorizer occupies53, 818 µm2, which is 13 times the area of the conventionalbinary logic counterparts [78]. For the latency, the convergencecycles are different depending on the outputs but are ordersof magnitude faster than the alternatives for invertible opera-tions. This observation also demonstrates the ability to designcircuits capable of invertible operations for more complexbehaviors.

E. DNA-based computing

The DNA found in living cells is composed of four bases:Adenine (A), Cytosine (C), Guanine (G), and Thymine (T).Each base is attached to its neighbor base in the sequence viaphosphate bonding. Two DNA sequences bond with each otherbetween complementary base pairs (A-T and C-G), formingthe DNA double helix. During the formation of a DNA doublestrand, two complementary single strands bond with each otherin antiparallel fashion.

The first proposals for representing binary numbers withDNA and performing arithmetic operators emerged in thenineties [108]. However, the first addition algorithms provideoutput in a different form than the input, thus preventingthe iterative application of operations to implement arithmeticsystems. Moreover, earlier DNA computing models do notsupport addresses and operations on single strands, thus not al-lowing arithmetic operations to be performed through parallelprocessing [109].

Fig. 19: Memory strands, stickers and DNA-based numberrepresentation

More recently, models such as the Sticker-based DNAmodel have been proposed [110]. The sticker-based DNAmodel is used to represent fixed-point and floating-pointnumbers, overcoming the limitations of the previous modelsand allowing iterative and parallel processing to be carriedout [111]. With the DNA sticker, a binary number is repre-sented through two groups of single-stranded DNA molecules:i) the memory strand, a long DNA molecule that is subdividedinto nonoverlapping segments and ii) a set of stickers, i.e.,short DNA molecules, each with the length of a segment ofthe memory strand. Each one of the nonoverlapping segmentsrepresents a bit, and a sticker is complementary to one andonly one of the nonoverlapping segments. As depicted inFig. 19 a segment on the memory strand annealed to thematching sticker represents a “1”; otherwise, the value of thebit is “0”.

Test tubes contain sets of strands representing binary num-bers, either the operands or the results of arithmetic operations.The functional formulation of the operations on the Sticker-based DNA model allows the setup of algorithms for per-forming DNA-based computing [111]. Three main operationsare essentially considered to implement logic and arithmeticoperations: Combine, Separate and Set.• Combine(Td, Ts1, · · · , Tsn) - pour the contents of tubesTsi into Td and then empty Tsi; formally, using the settheory, Td ≡

⋃i Tsi ; Tsi ≡ ∅.

• Separate(Ts; i, B[1], B[0]) - Separate the contents of Tsbased on the value of the ith bit: the DNA strands withan ith bit equal to “1” are inserted into tube B[1], whilethe other DNA strands, into tube B[0].

• Set(Tsd; i) - Set the ith bit of all DNA strands in tubeTsd, which serve as the source and destination in thisoperation.

Several logic and arithmetic operations on the Sticker-basedDNA model have been proposed [111]; the bitwise ANDoperation over two n bit vectors is presented in Algorithm 1,and the n-bit integer ADD, in Algorithm 2. In Algorithm 1,the output of the AND operation over the operands representedby the strands in source tubes Ts1 and Ts2 is made availablein the destination tube Td. After pouring the contents of tubesTs1 and Ts2 into the auxiliary tube Ta (line 1), the strandsin this tube are separated according to the value of the bitsin each of the n positions in tubes B[1] and B[0] (line 3);whenever the value of the ith bit in the two strands is “1”, theith bit of the DNA strand in tube Td is set (line 5). Strandsare re-combined in tube Ta in line 7.

17

Algorithm 1 AND(Ts1, Ts2, n:in; Td:out)Require: Pour blank strand of n bits (0 . . . 0) into Td

Ensure: bit stream in Td =bit stream in Ts1∧bit stream in Ts2

1: Combine(Ta, Ts1, Ts2) {Ta: auxiliary Tube}2: for all bit 0 ≤ i < n do3: Separate(Ta, i, B[1], B[0])4: if B[0] is empty then5: Set(Td, i)6: end if7: Combine(Ta, B[1], B[0])8: end for

Algorithm 2 applies the ADD operation over the n-bitoperands represented by strands in source tubes Ts1 and Ts2,and the sum is deposited into the destination tube Td. Althoughthe addition algorithm is supported by binary arithmetic,we present it to illustrate how arithmetic is implemented inbiological substrates. The auxiliary tubes Tnc and Tc areused to register the absence or existence of the carry bit,respectively. By pouring the contents of tubes Ts1 and Ts2 intothe auxiliary tube Tnc, it is considered that the initial carry-inis equal to “0” (line 1). For the Least Significant Bit (LSB),strands in the Tnc tube (line 3) are separated according to thevalue of this bit in tubes B[1] and B[0] (line 4). If neither B[1]

and B[0] are empty, this means that the LSBs have differentvalues (line 5). Since carry-in is not considered, the LSB ofthe DNA strand in tube Td is set (line 6). No carry-out isgenerated; thus, both strands are placed back in the Tnc tube(line 7). On the other hand, if both LSBs take the value 1((line 9), the sum bit is “0”, and we only have to put bothstrands in the Tc (line 10) to signal that a carry-in exists forthe next bit. For the last condition in line 11, both LSBs are“0”, and we only have to return both strands to Tnc (line 12).The second half of Algorithm 2, from line 15, is similar to thefirst but for the case where a carry-in exists, which means tubeTnc is empty and tube Tc contains the input strands. All thisprocessing is iteratively repeated from the LSB to the MostSignificant Bit (MSB) (line 2).

Current DNA computational systems are constrained bytheir poor integration capability, complex device structuresor limited computational functions. However, the ability tobuild efficient DNA logic gates and cascade circuits based onpolymerase-mediated strand displacement synthesis has beenshown [112]. The pillars are a single-rail AND gate built withthree strands and a single-rail OR gate built with two strands.Dual-rail DNA logic gates were built by parallelizing a single-rail AND gate and a single-rail OR gate to construct a logicalexpression. The NOT gate, fundamental to the construction ofgeneral logic systems, is difficult to build but can be achievedby reversing the definition of the strands and constructing asingle-rail AND gate and a single-rail OR gate [112].

A DNA ALU, with four 1-bit functions (FA, AND, ORand NAND) and the decoding and controlling logic, wasconstructed [112]. It was built with 16 equivalent logic gatesand consists of 27 DNA species and 74 DNA strands. Afterpurifying the strands with PolyAcrylamide Gel Electrophoresis(PAGE), the logic gates and circuits are tested, and the resultsare visualized with the real-time Polymerase Chain Reaction

Algorithm 2 ADD(Ts1, Ts2, n:in; Td:out)Require: Pour blank strand of n bits (0 . . . 0) into Td

Ensure: bit stream in Td =bit stream in Ts1+bit stream in Ts2

1: Combine(Tnc, Ts1, Ts2) {Tnc: No-carry auxiliary Tube}2: for bits i = 0 to n− 1 do3: if Tnc is not empty then4: Separate(Tnc, i, B[1], B[0])5: if neither B[0] nor B[1] is empty then6: Set(Td, i) {bits in position i have different values}7: Combine(Tnc, Ts1, Ts2)8: else9: if B[0] is empty then

10: Combine(Tc, Ts1, Ts2) {Tc: (with-)carry auxiliaryTube}

11: else12: Combine(Tnc, Ts1, Ts2) {bits in position i are both

0}13: end if14: end if15: else16: Separate(Tc, i, B[1], B[0])17: if neither B[0] nor B[1] is empty then18: Combine(Tnc, Ts1, Ts2){bits in position i have different

values}19: else20: if B[0] is empty then21: Set(Td, i) {bits in position i are both 1}22: Combine(Tc, Ts1, Ts2)23: else24: Set(Td, i) {bits in position i both 0}25: Combine(Tnc, Ts1, Ts2) {bits in position i are both

0}26: end if27: end if28: end if29: end for

(PCR); PAGE can also be followed by gel imaging. The resultsin [112] show that it is possible to integrate components toimplement DNA computer systems. Leakage is the main chal-lenge, especially when the size scales up. The limited purityof commercial chemosynthetic strands and DNA componentsis the primary source of leakage. Moreover, although any kindof DNA strand can be synthesized using biological methods,the preparative step requires a long time in practice, as wellas the preparation of inputs [109].

Although the DNA manipulation required in the models hasalready been realized in the laboratory and the procedureshave been implemented in practice, some defects exist in theprocedures, thus hindering practical use. This is an exampleof the application of nonconventional arithmetic to DNA-based computing. The RRNS has been applied to overcomethe negative effects caused by the defects and instability ofthe biochemical reactions and errors in hybridizations forDNA computing [113]. Applying the RRNS 3-moduli set{2n − 1, 2n + 1, 2n+1} to the DNA model in [114], as dis-cussed in Section II-B, leads to one-digit error detection. Theparallel RRNS-based DNA arithmetic improves the reliabilityof DNA computing while at the same time simplifies the DNAencoding scheme [113].

18

H1√2

[1 11 −1

](a) Hadamard gate

S

[1 00 j

](b) Phase gate

T

[1 00 expjπ/4

](c) π/8 gate

X

[0 11 0

](d) Pauli-X gate

Y

[0 −jj 0

](e) Pauli-Y gate

Z

[1 00 −1

](f) Pauli-Z gate

Fig. 20: Single qubit gates: symbols and unitary matrices

F. Quantum computing

QC cannot stand alongside DNA computing or any othertype of classical computation referred to so far in this paper.Classical computing operates over bits, and even DNA-basedcomputing refers only to the substrate on which computationover these bits is performed. The number of logical statesfor an n-bit representation is 2n, and Boolean operations onthe individual bits are sufficient to realize any deterministictransformation. A quantum bit (qubit), by contrast, typicallya microscopic unit, such as an atom or a nuclear spin, is asuperposition of basis states, orthogonal and typically repre-sented by |0〉 and |1〉. In Dirac notation, also known as bra-ketnotation, a ket such as |x〉 refers to a vector representing astate of a quantum system. A qubit, represented by the vector|x〉, corresponds to a linear combination, the superposition ofthe basis vectors with coefficients α and β defined in a unitcomplex vector space called the Hilbert space (C 2) (39).

|x〉 = α |0〉+ β |1〉 ; |α|2 + |β|2 = 1 . (39)

Regarding measurement, the superposition α |0〉 + β |1〉 cor-responds to |0〉 with probability |α|2 and |1〉 with probability|β|2.

Common simple qubit gates are represented in Fig. 20.Since operations on a qubit preserve the norm of the vectors,the gates are represented by 2 × 2 unitary matrices. Somealgebraic properties, such that H = (X+Z)/

√2 and S = T 2,

are useful to correlate some of these quantum gates.

(a) Control qubit in the top,target qubit in the bottom

1 0 0 00 1 0 00 0 0 10 0 1 0

(b) Matrix representation

Fig. 21: CNOT: matrix and circuit representation

A two-qubit system can be represented by a vector in theHilbert space C 2 ⊗ C 2, with ⊗ denoting the tensor product,which is isomorphic to C 4. Thus, the basis of C 2 ⊗ C 2 canbe written as:

|0〉 ⊗ |0〉 , |0〉 ⊗ |1〉 , |1〉 ⊗ |0〉 , |1〉 ⊗ |1〉

and |a〉 ⊗ |b〉 is often expressed as |a〉 |b〉 or |ab〉. Generally,the state of an n-qubit system is expressed by (40) in C 2n .

Υ =∑

b∈{0,1}ncb |b〉 ;

∑b

|cb|2 = 1 (40)

with the state of an n-particle system being represented in a2n-dimensional space. The exponentially large dimensionalityof this space makes quantum computers much more powerfulthan classical analogue computers, the state of which isdescribed only by a number of parameters proportional tothe size of the system. By contrast, 2n complex numbers arerequired to keep track of the state of an n-qubit system.

If the qubits are allowed to interact, then the closed systemincludes both qubits together, meaning that the qubits areentangled. When two or more qubits are entangled, thereexists a special connection between them: the outcome of themeasurement of one qubit is correlated to the measurementof the other qubits. The quantum version of the CNOT gatein Fig. 12a is an example of the interaction between 2 qubitsthat operate on the same basis:

|00〉 → |00〉 , |01〉 → |01〉 , |10〉 → |11〉 , |11〉 → |10〉

The CNOT gate flips the state of the target (t) qubitaccording to the state of the first control (c) qubit. If thecontrol qubit is set to |1〉, then the target qubit is flipped;otherwise, nothing is done with the target qubit, which actioncan be represented as: |c〉 |t〉 → |c〉 |c⊕ t〉. Fig. 21 providesthe CNOT gate matrix and circuit representations.

A quantum circuit that implements a 1-bit FA, acting onthe superposition of states, is presented in Fig. 22. With theapplication of the equations of the classic reversible logic(Fig. 12) to the circuit in Fig. 22, the output equations forthe full adder are achieved: P = A, Q = B, S = A⊕B⊕Ci,and Co = AB + ACi + BCi. Thus, if, for example, the fourqubits of the input are set to the state |ABCiX〉 = |1010〉,then the system goes, with a probability equal 1, to a statethat provides an output |PQSCo〉 = |1001〉.

Quantum gates and quantum computers have already beendeveloped, for example, TGs based on quantum mechanicswere successfully realized more than ten years ago at the

19

Fig. 22: Quantum 1-bit FA: Diagram of a circuit using CNOTand TG reversible gates– behaviour simulated on the QuantumInspire (QI) platform [168]

Fig. 23: The transmon qubit and the quantum chip [169]: atype of superconducting charge qubit; the superconductors arecapacitatively shunted in order to decrease the sensitivity tocharge noise

University of Innsbruck [115]. The IBM Q systems are phys-ically supported by the transmon qubit represented in Fig. 23.The transmon qubit circuit corresponds to Josephson junctions,kept at a very low temperature, shunted by an additional largecapacitance and matched by another comparably large gatecapacitance (Fig. 1 [116]). Although the transmon qubit isclosely related to the Cooper PairBox (CPB) qubit [117], it isoperated at a significantly different ratio of Josephson energyfor the charging energy. It overcomes the main weakness ofthe CPB by featuring an exponential gain in the insensitivityto charge noise [116]. IBM Q processors are composed oftransmon qubits that are coupled and addressed through mi-crowave resonators, as depicted in Fig. 23). Several quantumcomputers have been constructed based on that technology, themost powerful and recent one being a 53-qubit system [118].

In the next section, applications of the nonconventionalarithmetic, new technologies and systems to emerging applica-tions are discussed. While QC algorithms have been derived tosolve linear systems of equations with low complexity [119],in the next section, we refer the capacity of QC to attack the

current security measures of cryptographic protocols.

IV. APPLICATIONS IN EMERGING AREAS

Postquantum cryptography and machine learning, in partic-ular DL, have been selected as case studies of applications thatuse nonconventional arithmetic supported by the technologiesand systems presented in the previous sections. The criterionfor selection was based on the importance of these applicationsin emerging areas, particularly those that can benefit the mostfrom nonconventional arithmetic.

A. Deep learning

A type of DL dominant in a broad spectrum of applicationsis the CNN [120], a class of the general Deep Neural Network(DNN). The CNN encompasses two main different phases ofoperation: training and inference. The training is an offlineprocess, usually performed in powerful servers, for learningfeatures from (typically massive) databases. The inference ap-plies the trained model to predict and classify according to thefeatures “learned” by the neural network. A CNN architectureconsists of layers of convolution, activation, pooling, and fullyconnected neural networks [120]. The convolution kernels,represented by the weight w in the general equation (41),are applied to identify features by also adding a scalar bias bto (41). For the activation, nonlinear functions, such as the sig-moid or the Rectified Linear Unit (ReLU) in (42), are appliedto obtain the values of the feature maps. The pooled layer mapsthe features into a lower resolution space, for example, withthe function in (43). Fully connected neural networks computeover reduced spaces, applying weights to the inputs of artificialneurons that “fire” based on nonlinear functions. The largeamount of computation and the high memory requirementsare challenges faced in the development of efficient CNNs,especially in devices with autonomy and resource constraints.It will be shown how RNS, LNS, associated with fixed-pointand FP representations, and SC can be used to face thesechallenges.

conv →∑i

wiyi + b (41)

act→ ht = tanh y;hs =1

1 + e−y;hRLU = max(0, y) (42)

pool→ pmax = max(y); pL2 =√∑

y2i (43)

Although several proposals have been made to apply RNSto compute a CNN, only a few can be fully computedin the RNS domain [121]–[123]. These complete solutionsmitigate the overhead imposed by the nonlinear functions ofthe activation and pooling layers by carefully selecting theRNS moduli sets. In [121], the 4-tuple {2n ± 1, 2n+1 ± 1}takes advantage of the fact that M in (9) is odd. Thus, [41]can be applied to perform the comparison required to computethe max function in (42) and (43). The fully RNS-based CNNaccelerator in [123], i.e., Res-DNN, is supported by the Eyerissarchitecture [124] and the RNS arithmetic for the 3-moduliset {2n − 1, 2n, 2n+1 − 1}, preventing overflow by using theBE algorithm referenced in II-B (extending 2n to the modulo

20

Fig. 24: Computation of the dot product and ReLU in the LNSdomain

2n+e). A dynamic range partitioning approach is adopted toimplement the comparison operation in the RNS domain [125],while for sign detection in the ReLU unit, the MRC methodis adopted [126].

Apart from the large amount of computing, CNNs also re-quire high memory bandwidth due to the large volume of datathat must be moved between the memory and processing units.CNNs fully computed in the RNS domain are also proposedfor PIM (RNSnet) [122]. RNSnet simplifies the fundamentalCNN operations and maps them with the support of RNS toin-memory addition and data access. The nonlinear activationfunctions are approximated using a few of the first termsof the Taylor expansion, and the comparison for pooling isimplemented only for addition by taking advantage of the char-acteristics of the traditional 3-moduli set {2n−1, 2n, 2n+ 1}.

By following a different approach, the fact that weights andactivations in a CNN naturally have nonuniform distributionscan be used to reduce the bit-width of a data representation andsimplify the arithmetic [12], [127]. Inspired by the Kulisch ap-proach [128], which removes the sources of rounding errors ina dot product through an exact multiplier and wide fixed-pointaccumulator, products and accumulation can be performedin both the logarithmic and the linear domains [127]. Thisapproach works for the FP and fixed-point representations,reducing the data bit-width through nonuniform quantization.

Assuming that translating the number m.f from the loga-rithmic to the linear domains is easier than computing log2(1±2x) (6), Exact Log-linear Multiply Add (ELMA) [127] per-forms multiplications in the logarithmic domain and thenapproximates results to the linear domain floating-point repre-sentation for accumulation. Translating m imposes a multipli-cation by 2m in the linear domain, a floating-point exponentaddition or a fixed-point bit shift. If the number of bits of thefractional part f is small, then LUTs can be used in practiceto map from the logarithmic to the linear domain and fromthe linear back to the logarithmic domain.

The tapered floating point, for which the exponent andsignificand field sizes vary according to a third field, such asthe posit number system [129], has also been adopted to designCNNs. The posit number system is characterized by (N, s), theword length in bits (N ) and the exponent scale (s). The resultsshow that the accuracy of the multiply and add operations withthe proposed ELMA (N = 8, s = 1, α = 5, β = 5, γ = 7) isapproximately the same as that of the posit (8, 1) and leads tovery efficient hardware implementations.

Fig. 24 shows the samples and the weights forconvolution represented in the log-domain [12]as xi = Quantize(log2(xi)) (3-bit width) andwi = Quantize(log2(wi)) (4-bit width), respectively.

conv =∑i

2xi+wi =∑i

BitShift(1, xi + wi) (44)

In (44), the accumulation is still performed in the lineardomain, with BitShift(1, z) denoting the function that bit-shifts 1 by an integer z in the fixed-point arithmetic. To reachthe method in Fig. 24, the accumulation is also performed inthe log-domain by using the approximation log2(1 + x) ≈ xfor 0 ≤ x < 1. By considering the accumulation of onlytwo terms, i.e., p1 = x1 + w1 and p2 = x2 + w2, (44) canbe rewritten as (45). (45) can be easily generalized for anynumber of terms, i and pi, leading to the complete computationin the logarithmic domain of the convolution layers and theremaining layers of the CNN [12]. Moreover, an end-to-endtraining procedure based on the logarithmic representation isproposed in [12].

log2(conv) = log2(2p1 + 2p2) = log2[2

p1(1 +2p2

2p1)]

= p1 + log(1 +2p2

2p1) ≈ p1 +BitShift(1, p2 − p1) (45)

SC is another example of nonconventional arithmetic ap-plied to design CNNs [130], [131], [133]–[137]. In [130],the polling layers are implemented with an improved ver-sion of the SC max unit based on the FSMs presented inSection II-C [132]. The SC-based ReLU, SReLU, is alsoimplemented through an FSM that mimics the three keyoperations of the artificial neuron: integration, output gen-eration, and decrement. To improve the signal-to-noise ratio,weight normalization and upscaling are performed, while theoverhead of binary-to-stochastic conversion is reduced bysharing number generators. In [131], the linear rectifier usedfor the activation function is realized through an FSM, andpooling is achieved by simply aggregating streams to computeeither the average or the max function. Compared with theexisting CNN deterministic architectures, the stochastic-basedarchitecture can achieve, at the cost of a slight degradation incomputing accuracy, significantly higher energy efficiency atlow cost. The SC-DCNN is another design and optimizationframework of SC-based CNNs, which adopts a bottom-upapproach [133]. Connecting different basic function blockswith joint optimization and applying efficient weight stor-age methods, the area and power (energy) consumption arereduced while maintaining high network accuracy. IntegralSC representations, supported on integer stochastic streamsobtained by the summation of two or more binary stochas-tic streams, have also been proposed for designing efficientstochastic-based DNN architectures [134]. Hybrid stochastic-binary neural networks, particularly for near-sensor comput-ing, have been proposed [135]. Combining signal acquisitionand SC in the first ANN layer, binary arithmetic is applied tothe remaining layers to avoid compounding accuracy losses.It is also shown that retraining these remaining ANN layerscan compensate for precision loss introduced by SC. ImprovedSC-based multipliers and their vector extension have also beenproposed for application to DNNs [136]. Moreover, the SC-based convolution can be improved in CNNs by producing

21

the result of the first multiplication and performing the sub-sequent multiplications with the differences of the successiveweights [137].

An SC-based DL framework was implemented with AQFPtechnology [79], a superconducting logic family that achieveshigh energy efficiency, as described in Section III-A. Throughan ultraefficient stochastic number generator and a high-accuracy subsampling (pooling) block, both in AQFP, theimplemented CNN achieves a four-orders-of-magnitude higherenergy efficiency than the CMOS-based implementation whilemaintaining a 96% accuracy for the MNIST dataset.

B. CryptographyDifficult mathematical problems are used to ensure the

security of public-key cryptographic systems. These systemsare designed so that the derivation of the private key fromthe public key is as difficult as computing the solution toproblems such as the factorization of large integers (e.g.,RSA [138]) or the discrete logarithm (e.g., Elliptic Curve (EC)cryptography [170]). The number of steps required to solvethese problems is a function of the length of the input; thus,the dimension of keys is enlarged to make the systems secure(e.g., private keys larger than 512 and 1024 bits).

Modular reduction (modulo N ) is a fundamental arithmeticoperation in public cryptography for mapping results back tothe set of representatives {0, · · · , N − 1}. The Montgomeryalgorithm for modular multiplication replaces the costly reduc-tion for a general value N with a reduction for L, typicallya power of 2 [139]. However, since in RNS, the reductionmodulo power of 2 is inefficient, an RNS variation of theMontgomery modular multiplication algorithm relies on theBE method presented in II-B by choosing L = M1 andGCD(M1,M2) = 1 [140], [141]. Similarly, the reductionmodulo L corresponds to the reduction modulo of the dynamicrange M1, which automatically relies on the modulo operationin each one of the RNS channels of S1. Moreover, 〈N〉M1

and 〈M1〉M2exist, which are required for the precomputed

constants used in the modified Montgomery modular multipli-cation algorithm [142].

A survey on the usage of RNS to support these classicpublic-key algorithms can be found in [142]. Multimod-uli RNS architectures have also been proposed to obfuscatesecure information. The random selection of moduli is pro-posed for each key bit operation, which makes it difficult toobtain the secret key, preventing power analysis while stillproviding all the benefits of RNS [143].

However, if quantum computers become available, this typeof public-key cryptography, based on the factorization oflarge integers or the discrete logarithm, will no longer besecure [144]. Therefore, there is a need to develop arith-metic techniques for alternate systems, usually designatedas postquantum cryptography [145]. Quite recently, anothersurvey was published on the application of RNS and SCto this type of emergent cryptography [147]. We presenttwo representative examples of the application of RNS andSC to the Goldreich-Goldwasser-Halevi (GGH) postquantumcryptography and Fully Homomorphic Encryption (FHE), re-spectively.

~b0

~b1 = ~r1

~r0

P(R)

P(B)

~c mod P(B)

~c mod P(R)

~c

Fig. 25: Two basis ~r0, ~r1 and ~b0,~b1 of the same lattice, alongwith the corresponding parallelepipeds in red and grey, arerepresented; ~c is reduced modulo the two parallelepipeds [147]

A lattice can be seen as a vector space generated by alllinear combinations with integer coefficients of a set R ={~r0, . . . , ~rn−1}, with ~ri ∈ Rm, of linearly independent vectors,as defined in (46); the rank of the lattice is n, and its dimensionis m. Two vectors in span(R) are congruent if their differenceis in L(R).

L(R) =

{n−1∑i=0

zi~ri : zi ∈ Z

}(46)

Each basis is associated with the parallelepiped:

P(R) =

{n−1∑i=0

wi~ri : wi ∈(−1

2,

1

2

]}(47)

For any point ~y = ~w + ~x ∈ span(R), where ~w ∈ L(R)and ~x ∈ P(R), the reduction in the ~y modulo P(R) isdefined as ~x = ~y mod P(R). The modular reduction hasa different meaning for each basis, since they are associatedwith different parallelepipeds. An example of this is featuredin Fig. 25, where the point ~c is reduced modulo both P(R) andP(B), producing two different points, which are representedas triangles, while L(R) = L(B).

Lattice Based Cryptography (LBC) is supported by theClosest Vector Problem (CVP). This problem consists of agiven base R ∈ Rn×m, and ~y ∈ Rm, finding ~x ∈ L(R) suchthat ||~y − ~x|| = min~z∈L(B) ||~y − ~z||. The private basis is pro-duced as a rotated nearly orthogonal basis, such that Babai’sround-off [146] provides accurate solutions to the CVP.Rose’s cryptosystem uses bases of an Optimal Hermite NormalForm (OHNF) as the public key, a subclass of Hermite NormalForms (HNFs), where all but the first column are trivial. Thedecryption algorithm is modified for its implementation withthe RNS. The operation

⌊~cR−1

⌉(b·e denotes rounding to

the nearest integer) can be replaced by the approximation ~µof bγ~cR−1c through an RNS Montgomery reduction [139],where ~c = (c, 0, . . . , 0) and the scaling by γ enables the

22

detection and correction of the errors resulting from the ap-proximate RNS Montgomery reduction. bγ~cR−1c is rewrittenusing integer arithmetic as in (48), where R = R−1d is aninteger and d = det(R).

bγ~cR−1c =γ~cR−

⟨γ~cR

⟩d

d(48)

It is shown that the usage of RNS enables parallelizing thedecryption in Rose’s cryptosystems to significantly speed upits computation in both CPUs and GPUs [147].

Homomorphic encryption allows performing computationsdirectly on ciphertexts, generating encrypted results as if theoperations had been performed on the plaintext and thenencrypted. FHE provides malleable ciphertexts, such thatgiven two ciphertexts representing the operands, it will bepossible to produce a ciphertext encrypting its product orsum. Structured lattices underpin classes of cryptosystemssupporting FHE [148]. There is noise associated with theciphertexts that grows as homomorphic operations are applied;thus, bootstrapping has been proposed, a technique in whichciphertexts are homomorphically decrypted [148]. ModernFHE systems rely on Ring Learning with Errors (RLWE), forwhich techniques have been proposed that limit the need forbootstrapping [147].

Batching [149] improves the performance of FHE basedon the CRT by allowing multiple bits to be encrypted in asingle ciphertext so that one can carry out AND and XORsequences of bits using a single homomorphic multiplicationor addition. For example, in an RLWE cryptosystem, binarypolynomials are homomorphically processed in a cyclotomicring. By noticing that certain cyclotomic polynomials factormodulo two onto a set of polynomials with the same degree,one may take advantage of the CRT to associate a plaintext bitwith each one of these polynomials. Homomorphic additionsand multiplications then add and multiply the bits modulo theirrespective polynomial, achieving coefficientwise XOR andAND operations. Rotations of these bits may be accomplishedwith [150].

The operations that arise from FHE are evaluated, and ef-ficient algorithm-hiding systems are designed for applicationsthat take advantage of those operations in [151]. First, numberscan be represented as sequences of bits through stochasticrepresentations. Since most FHE schemes support batching asan acceleration technique for the processing of multiple bitsin parallel, one can map stochastic representations to batch-ing slots to efficiently implement multiplications and scaledadditions (19). Unlike traditional Boolean circuits, the amountof bits required for stochastic computing is flexible, allowingfor a flexible correspondence between the unencrypted andencrypted domains.

The homomorphic systems based on SC in [151] targetthe approximation of continuous functions. These functionsare first approximated with Bernstein polynomials in a black-box manner. Hence, function development can take place withtraditional programming paradigms, providing an automatedway to produce FHE circuits. A stochastic sequence of bitsis encrypted in a single cryptogram. It was concluded that

stochastic representations are more widely applicable, while afixed-point representation provides better performance [151].

A quantum computer efficiently factors an arbitrary inte-ger [153], which makes the widely used RSA public-keycryptosystem insecure. It is based on reducing this compu-tation to a special case of a mathematical problem knownas the Hidden Subgroup Problem (HSP) [154], [155] (HSPis a group theoretic generalization of the problem of period-icity determination). The Abelian quantum hidden subgroupalgorithms in [153], [156] are based on the Quantum FourierTransform (QFT). The QFT, the counterpart to the DiscreteFourier Transform (DFT) in the classical computing model, isthe basis of many quantum algorithms [152], [157], [158].

V. CONCLUSIONS AND FUTURE RESEARCH

This survey covers the main types of nonconventional arith-metic from a holistic perspective, highlighting their practicalinterest. Several different classes of nonconventional arith-metic are reviewed, such as LNS, RNS, SC and HDC, andtheir usage in various emergent applications, with differentfeatures and based on new technologies, is discussed. Weshow the importance of nonconventional number represen-tation and arithmetic not only to implement fast, reliable,and efficient systems but also to shape the usage of thetechnology. For example, SC is useful for FHE, enabling theprocessing of multiple bits in parallel for batching, but is alsoimportant to perform approximate computing on nanoscalememristor crossbars and to overcome the deep pipeliningnature of AQFP logic devices. Similarly, RNS is suitablefor machine learning and cryptographic applications, but itis also quite useful to mitigate the instability of biochemicalreactions for DNA-based computing and to reduce the cost ofarithmetic systems for integrated nanophotonics technology.Hyperdimensional representation and arithmetic, inspired bybrain-like computing, adopt high-dimensional binary vectorsto introduce randomness and achieve reliability in HDC. Thistype of nonconventional number representation offers a strongbasis for in-memory computation, being a promising candidatefor taking advantage of the physical attributes of nanoscalememristive devices to perform online training, classificationand prediction.

One of the main conclusions that can be drawn from thissurvey is that the investigation of nonconventional arithmeticmust take into account all the dimensions of the systems,which includes not only computer arithmetic theory but alsotechnology advances and the demands of emergent applica-tions. For example, the results of further investigations ofthe randomness characteristics of SC and HDC may alsoimpact applications that might benefit from approximate com-putations [66]. At the technological level, these classes ofarithmetic may benefit from the ease of nanotechnologies ingenerating the required random numbers [79] while improvingthe robustness to noise [171]. The efficiency offered by SCand HDC results in important savings both in terms of theintegration area and hardware cost, which make these numberrepresentations suitable for the development of autonomouscyberphysical [161] and Internet of Things (IoT) proces-sors [73].

23

Further research on LNS and RNS should target difficultoperations in the respective domains, namely, addition andsubtraction in the logarithmic domain and division and com-parison in the RNS domain. The accuracy and the imple-mentation cost under different technologies are key aspectsfor the practical usage of this nonconventional arithmetic. Inthat respect, the development of tools such as the Computingwith the Residue Number System (CRNS), which automatesthe design of systems based on RNS for exploring the in-creasing parallelism available in computing devices, is alsoimportant [162]. The redundancy of the RRNS and the ran-domness of SC and HDC make them suitable for improvingreliability [113] and supporting the heterogeneous integrationof multiple emerging nanotechnologies [106]. Moreover, itwill be very useful to investigate efficient converters betweennonconventional systems that allow them to be used togetherin different components of heterogeneous systems and fordifferent applications.

New paradigms of computation will also become the focusof research on nonconventional computer arithmetic. Theless disruptive, more straightforward path of research is thecombination of nonconventional arithmetic with approximatecomputing. While for SC, this combination is natural, for theother nonpositional number representations, a research effortfocused on this purpose is required. For the other types ofnew paradigms of computation, such as QC, the quantumcomputer platforms publicly available, i.e., IBM Q [169] andQI from QuTech [168], provide both simulators and realquantum computers at the back end. They are quite useful forperforming an investigation of QC and introduce this topicwith a practical component in university curricula. They arevaluable instruments for simulating new quantum algorithmsand circuits for several applications, extending their usage farbeyond the domain of quantum mechanics.

ACKNOWLEDGMENTS

This work was partially supported by the European UnionHorizon 2020 research and innovation program under grantagreement No. 779391 (FutureTPM) and by national fundsthrough Fundacao para a Ciencia e a Tecnologia (FCT) un-der project number UIDB/50021/2020. I would also like toacknowledge the contributions of all MSc and PhD studentsthat I have supervised over the years. They have been a sourceof inspiration and challenges in the adventure that research oncomputer arithmetic has been for me.

REFERENCES

[1] D. Bishop, “Nanotechnology and the end of moore’s law?” Bell LabsTechnical Journal, vol. 10, no. 3, pp. 23–28, 2005.

[2] SocialMediaToday, “How much data is generated every minute?”2018. [Online]. Available: https://www.socialmediatoday.com/news/how-much-data-is-generated-every-minute-infographic-1/525692/

[3] P. Martins, J. Eynard, J. Bajard, and L. Sousa, “Arithmetical improve-ment of the round-off for cryptosystems in high-dimensional lattices,”IEEE Transactions on Computers, vol. 66, no. 12, pp. 2005–2018,2017.

[4] T. Tay and C. Chang, “A non-iterative multiple residue digit errordetection and correction algorithm in RRNS,” IEEE Transactions onComputers, vol. 65, no. 2, pp. 396–408, 2016.

[5] S. H. Park, Y. Liu, N. Kharche, M. Salmani Jelodar, G. Klimeck, M. S.Lundstrom, and M. Luisier, “Performance comparisons of III-V andStrained-Si in planar FETs and nonplanar FinFETs at ultrashort gatelength (12 nm),” IEEE Transactions on Electron Devices, vol. 59, no. 8,pp. 2107–2114, 2012.

[6] TSMC, “7nm technology,” https://www.tsmc.com/english/ dedicated-Foundry/technology/7nm.htm, 2019.

[7] I. SEMATECH, “International technology roadmap for semiconductor:Emerging research devices,” Tech. Rep., 2003.

[8] M. LaPedus, “Big trouble at 3nm,” https://semiengineering.com/big-trouble-at-3nm/, 2018.

[9] S. I. Association, “International technology roadmap for semiconductor2.0: Beyond CMOS,” Tech. Rep., 2015.

[10] N. G. Kingsbury and P. J. Rayner, “Digital filtering using logarithmicarithmetic,” Electronics Letters, vol. 7, p. 5658, 1971.

[11] E. Swartzlander and A. Alexopoulos, “The sign/logarithm numbersystem,” IEEE Transactions on Computers, vol. C-24, no. 12, p.12381242, 1975.

[12] D. Miyashita, E. H. Lee, and B. Murmann, “Convolutional neu-ral networks using logarithmic data representation,” CoRR, vol.abs/1603.01025, 2016.

[13] F. J. Taylor, R. Gill, J. Joseph, and J. Radke, “A 20 bit logarithmicnumber system processor,” IEEE Transactions on Computers, vol. 37,no. 2, pp. 190–200, Feb 1988.

[14] F. de Dinechin and A. Tisserand, “Multipartite table methods,” IEEETransactions on Computers, vol. 54, no. 3, pp. 319–330, March 2005.

[15] J. Coleman and E. Chester, “A 32-bit logarithmic arithmetic unit andits performance compared to floating-point,” in IEEE Symposium onComputer Arithmetic, 1999, pp. 142–151.

[16] D. Lewis, “Complex logarithmic number system arithmetic using high-radix redundant CORDIC algorithms,” in 14th IEEE Symposium onComputer Arithmetic, 1999, pp. 194–203.

[17] Chichyang Chen, Rui-Lin Chen, and Chih-Huan Yang, “Pipelinedcomputation of very large word-length LNS addition/subtraction withpolynomial hardware cost,” IEEE Transactions on Computers, vol. 49,no. 7, pp. 716–726, July 2000.

[18] Chichyang Chen and Rui-Lin Chen, “Performance-improved computa-tion of very large word-length LNS addition/subtraction using signed-digit arithmetic,” in IEEE International Conference on Application-Specific Systems, Architectures, and Processors (ASAP), June 2003,pp. 337–347.

[19] J. Coleman, E. Chester, C. Softley, and J. Kadlec, “Arithmetic on theeuropean logarithmic microprocessor,” IEEE Transactions on Comput-ers, vol. 49, no. 7, pp. 702–715, july 2000.

[20] B. Lee and N. Burgess, “A parallel look-up logarithmic numbersystem addition/subtraction scheme for FPGA,” in IEEE InternationalConference on Field-Programmable Technology (FPT), Dec 2003, pp.76–83.

[21] H. Fu, O. Mencer, and W. Luk, “FPGA designs with optimizedlogarithmic arithmetic,” IEEE Transactions on Computers, vol. 59,no. 7, pp. 1000–1006, July 2010.

[22] J. N. Coleman and R. Che Ismail, “LNS with co-transformation com-petes with floating-point,” IEEE Transactions on Computers, vol. 65,no. 1, pp. 136–146, Jan 2016.

[23] Y. Popoff, F. Scheidegger, M. Schaffner, M. Gautschi, F. K. Grkaynak,and L. Benini, “High-efficiency logarithmic number unit design basedon an improved cotransformation scheme,” in Design, Automation Testin Europe Conference Exhibition (DATE), March 2016, pp. 1387–1392.

[24] D. Yu and D. M. Lewis, “A 30-b integrated logarithmic number systemprocessor,” IEEE Journal Solid-State Circuits, vol. 26, pp. 1433–1440,1991.

[25] M. Arnold, “A VLIW architecture for logarithmic arithmetic,” inEuromicro Symposium on Digital System Design, 2003, pp. 294–302.

[26] J. Coleman, C. Softley, J. Kadlec, R. Matousek, M. Tichy, Z. Pohl,A. Hermanek, and N. Benschop, “The european logarithmic micropro-cesor,” IEEE Transactions on Computers, vol. 57, no. 4, pp. 532–546,2008.

[27] R. Chassaing, DSP Applications Using C and the TMS320C6x DSK.John Wiley & Sons, Inc., 2002.

[28] M. G. Arnold and C. Walter, “Unrestricted faithful rounding is goodenough for some LNS applications,” in 15th IEEE Symposium onComputer Arithmetic ARITH, 2001, pp. 237–246.

[29] M. Gautschi, A. Traber, A. Pullini, L. Benini, M. Scandale, A. Di Fed-erico, M. Beretta, and G. Agosta, “Tailoring instruction-set extensionsfor an ultra-low power tightly-coupled cluster of OpenRISC cores,” inIFIP/IEEE International Conference on Very Large Scale Integration(VLSI-SoC), 2015, pp. 25–30.

24

[30] R. C. Ismail, M. K. Zakaria, and S. A. Z. Murad, “Hybrid logarithmicnumber system arithmetic unit: A review,” in 2013 IEEE InternationalConference on Circuits and Systems (ICCAS), Sep. 2013, pp. 55–58.

[31] A. Omondi and B. Premkumar, Residue Number Systems: Theory andImplementations. Imperial College Press, 2007.

[32] H. L. Garner, “The residue number system,” IEEE Circuits and SystemsMagazine, vol. 8, no. 2, pp. 140–147, 1959.

[33] S. Lang, Algebra, 3rd ed., ser. Graduate Texts in Mathematics.Springer, 2002, vol. 211.

[34] A. Molahosseini and L. Sousa, Embedded systems design with specialarithmetic and number systems. Springer, 2017, ch. Introduction toResidue Number System: Structure and Teaching Methodology, pp.3–17.

[35] C. Chang, A. Molahosseini, A. Zarandi, and T. Tay, “Residue numbersystems: A new paradigm to datapath optimization for low-power andhigh-performance digital signal processing applications,” IEEE Circuitsand Systems Magazine, vol. 15, no. 4, pp. 26–44, 2015.

[36] A. Molahosseini, L. Sousa, and C.-H. Chang, Eds., Embedded SystemsDesign with Special Arithmetic and Number Systems. Springer, 2017.

[37] F. J. Taylor, “Residue arithmetic: a tutorial with examples,” IEEETransactions on Computers, vol. 17, no. 5, pp. 50–62, 1984.

[38] P. V. Mohan, Arithmetic Circuits for DSP Applications. John Wiley& Sons, Ltd, 2017.

[39] A. Hiasat and H. Abdel-Aty-Zohdy, “Design and implementation ofan RNS division algorithm,” in Proceedings 13th IEEE Symposium onComputer Arithmetic, 1997, pp. 240–249.

[40] S. Bi and W. Gross, “The mixed-radix chinese remainder theoremand its applications to residue comparison,” IEEE Transactions onComputers, vol. 57, no. 12, 2008.

[41] L. Sousa, “Efficient method for magnitude comparison in RNS basedon two pairs of conjugate moduli,” in IEEE Symposium on ComputerArithmetic, 2007, pp. 240–250.

[42] S. Kumar and C. Chang, “A new fast and area-efficient adder-basedsign detector for RNS 2n−1, 2n, 2n+1,” IEEE Transactions on VeryLarge Scale Integration Systems, vol. 24, no. 7, pp. 2608–2612, 2016.

[43] L. Sousa, “2n RNS scalers for extended 4-moduli sets,” IEEE Trans-actions on Computers, vol. 64, no. 12, pp. 3322–3334, 2015.

[44] J.Bajard and L. Imbert, “A full RNS implementation of RSA,” IEEETransactions on Computers, vol. 53, no. 6, pp. 769–774, June 2004.

[45] A. Zarandi, A. Molahosseini, L. Sousa, and M. Hosseinzadeh, “Anefficient component for designing signed reverse converters for aclass of RNS moduli sets of composite form {2k, 2P − 1},” IEEETransactions on Very Large Scale Integration (VLSI) Systems, vol. 25,no. 1, pp. 48–59, 2017.

[46] F. Jafarzadehpour, A. Molahosseini, A. Zarandi, and L. Sousa, “Ef-ficient modular adder designs based on thermometer and one-hotcoding,” IEEE Transactions on Very Large Scale Integration (VLSI)Systems, vol. 27, no. 9, pp. 2142–2155, 2019.

[47] C. H. Vun and B. Premkumar, “Thermometer code based modulararithmetic,” in Spring Congress on Engineering and Technology, 2012.

[48] C. Vun, A. Premkumar, and W. Zhang, “Sum of products: Computationusing modular thermometer codes,” in International Symposium onIntelligent Signal Processing and Communication Systems (ISPACS),2014, pp. 141–146.

[49] P. Patronik and S. J. Piestrak, “Hardware/software approach to design-ing low-power RNS-enhanced arithmetic units,” IEEE Transactions onCircuits and Systems I: Regular Papers, vol. 64, no. 5, pp. 1031–1039,2017.

[50] A. Ghosh, S. Singha, and A. Sinha, “Floating point RNS: A newconcept for designing the MAC unit of digital signal processor,”SIGARCH Computer Architecture News, vol. 40, no. 2, pp. 39–43,May 2012.

[51] R. Dhanabal, V. Barathi, S. K. Sahoo, N. R. Samhitha, N. A. Cherian,and P. M. Jacob, “Implementation of floating point MAC using residuenumber system,” in International Conference on Reliability Optimiza-tion and Information Technology (ICROIT), 2014, pp. 461–465.

[52] B. Parhami, “RNS representations with redundant residues,” in Confer-ence Record of Thirty-Fifth Asilomar Conference on Signals, Systemsand Computers, vol. 2, Nov 2001, pp. 1651–1655 vol.2.

[53] B. Deng, S. Srikanth, E. Hein, T. Conte, E. Debenedictis, J. Cook,and M. Frank, “Extending moores law via computationally error-tolerant computing,” ACM Transactions on Architecture and CodeOptimization, vol. 15, no. 1, Mar. 2018.

[54] S. Srikanth, P. G. Rabbat, E. Hein, B. Deng, T. Conte, E. DeBenedictis,J. Cook, and M. Frank, “Memory system design for ultra low power,computationally error resilient processor microarchitectures,” in IEEE

International Symposium on High Performance Computer Architecture(HPCA), 2018, pp. 696–709.

[55] B. R. Gaines, “Stochastic computing,” in AFIPS Spring Joint ComputerConference, vol. 30, 1967, pp. 149–156.

[56] A. Alaghi and J. P. Hayes, “Survey of stochastic computing,” ACMTransactions on Embedded Computing Systems (TECS), vol. 12, no. 2s,pp. 1–19, May 2013.

[57] W. Gross and V. Gaudet, Eds., Stochastic Computing: Techniques andApplications. Springer, 2019.

[58] R. Duarte, M. Vestias, and H. Neto, “Enhancing stochastic computa-tions via process variation,” in 25th International Conference on FieldProgrammable Logic and Applications (FPL), 2015.

[59] T. Chen and J. P. Hayes, “Analyzing and controlling accuracy instochastic circuits,” in IEEE 32nd International Conference on Com-puter Design (ICCD), 2014, pp. 367–373.

[60] W. Qian and M. D. Riedel, “The synthesis of robust polynomial arith-metic with stochastic logic,” in 45th ACM/IEEE Design AutomationConference, 2008, pp. 648–653.

[61] M. Lunglmayr, D. Wiesinger, and W. Haselmayr, “Design and analysisof efficient maximum/minimum circuits for stochastic computing,”IEEE Transactions on Computers, vol. 69, no. 3, pp. 402–409, 2020.

[62] B. Brown and H. Card, “Stochastic neural computation. i. computa-tional elements,” IEEE Transactions on Computers, vol. 50, no. 9, pp.891–905, 2001.

[63] P. Li, D. J. Lilja, W. Qian, M. D. Riedel, and K. Bazargan, “Logicalcomputation on stochastic bit streams with linear finite-state machines,”IEEE Transactions on Computers, vol. 63, no. 6, pp. 1474–1486, 2014.

[64] V. T. Lee, A. Alaghi, R. Pamula, V. S. Sathe, L. Ceze, and M. Oskin,“Architecture considerations for stochastic computing accelerators,”IEEE Transactions on Computer-Aided Design of Integrated Circuitsand Systems, vol. 37, no. 11, pp. 2277–2289, Nov 2018.

[65] W. Qian, X. Li, M. D. Riedel, K. Bazargan, and D. J. Lilja, “Anarchitecture for fault-tolerant computation with stochastic logic,” IEEETransactions on Computers, vol. 60, no. 1, pp. 93–105, 2011.

[66] S. Gupta and K. Gopalakrishnan, “Position paper,revisiting stochasticcomputation : Approximate estimation of machine learning kernels,”in Position paper, Workshop on Approximate Computing Across theSystem Stack (WACAS), 2014.

[67] A. Alaghi, W. Qian, and J. P. Hayes, “The promise and challenge ofstochastic computing,” IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems, vol. 37, no. 8, pp. 1515–1531, 2018.

[68] V. K. Chippa, S. Venkataramani, K. Roy, and A. Raghunathan,“StoRM: A stochastic recognition and mining processor,” in IEEE/ACMInternational Symposium on Low Power Electronics and Design(ISLPED), 2014, pp. 39–44.

[69] T. H. M. 2010-2020, “The human memory: Brain neurons & synapses,”https://human-memory.net/brain-neurons-synapses/, 2020.

[70] P. Kanerva, “Hyperdimensional computing: An introduction to com-puting in distributed representation with high-dimensional randomvectors,” Cognitive Computation, vol. 1, p. 139159, 2009.

[71] O. J. Rasanen and J. P. Saarinen, “Sequence prediction with sparsedistributed hyperdimensional coding applied to the analysis of mobilephone use patterns,” IEEE Transactions on Neural Networks andLearning Systems, vol. 27, no. 9, pp. 1878–1889, Sep. 2016.

[72] A. Rahimi, P. Kanerva, L. Benini, and J. M. Rabaey, “Efficient biosig-nal processing using hyperdimensional computing: Network templatesfor combined learning and classification of exg signals,” Proceedingsof the IEEE, vol. 107, no. 1, pp. 123–143, 2019.

[73] S. Datta, R. A. Antonio, A. R. Ison, and J. M. Rabaey, “A pro-grammable hyper-dimensional processor architecture for human-centricIoT,” IEEE Journal on Emerging and Selected Topics in Circuits andSystems, vol. 9, no. 3, pp. 439–452, Sep. 2019.

[74] S. Mittal, “A survey of techniques for approximate computing,” ACMComputing Surveys, vol. 48, no. 4, Mar. 2016.

[75] H. Jiang, F. J. H. Santiago, H. Mo, L. Liu, and J. Han, “Approximatearithmetic circuits: A survey, characterization, and recent applications,”Proceedings of the IEEE, 2020.

[76] F. S. Snigdha, D. Sengupta, J. Hu, and S. S. Sapatnekar, “Optimaldesign of jpeg hardware under the approximate computing paradigm,”in 53nd ACM/EDAC/IEEE Design Automation Conference (DAC),2016.

[77] Y. Wang, H. Li, and X. Li, “Real-time meets approximate computing:An elastic cnn inference accelerator with adaptive trade-off betweenQoS and QoR,” in ACM/EDAC/IEEE Design Automation Conference(DAC), 2017.

25

[78] S. Smithson, N. Onizawa, B. Meyer, W. Gross, and T. Hanyu, “EfficientCMOS invertible logic using stochastic computing,” IEEE Transactionson Circuits and Systems I: Regular Papers, vol. 66, no. 6, pp. 2263–2274, 2019.

[79] R. Cai, A. Ren, O. Chen, N. Liu, C. Ding, X. Qian, J. Han, W. Luo,N. Yoshikawa, and Y. Wang, “A stochastic-computing based deep learn-ing framework using adiabatic quantum-flux-parametron superconduct-ing technology,” in ACM/IEEE 46th Annual International Symposiumon Computer Architecture (ISCA), 2019, pp. 567–578.

[80] A. Chabi, A. Roohi, H. Khademolhosseini, S. Sheikhfaal, S. Angizi,K. Navi, and R. DeMara, “Towards ultra-efficient QCA reversiblecircuits,” Microprocessors and Microsystems, vol. 49, pp. 127 – 138,2017.

[81] A. Molahosseini, A. Asadpoor, A. Zarandi, and L. Sousa, “Towardsefficient modular adders based on reversible circuits,” in IEEE Inter-national Symposium on Circuits and Systems (ISCAS), 2018.

[82] C. H. Bennett, “Logical reversibility of computation,” IBM Journal ofResearch and Development, vol. 17, no. 6, pp. 525–532, Nov 1973.

[83] T. S. G. Hinton and D. Ackley, “Boltzmann machines: Constraintsatisfaction networks that learn,” Dept. Computer Science, Carnegie-Mellon University, Pittsburgh, PA, USA, Tech. Rep. CMUCS-84-119,1984.

[84] S. Sheikhfaal, S. Angizi, S. Sarmadi, M. Moaiyeri, and S. Sayedsalehi,“Designing efficient QCA logical circuits with power dissipation anal-ysis,” Microelectronics Journal, vol. 46, no. 6, pp. 462 – 471, 2015.

[85] S. Oskouei and A. Ghaffari, “Designing a new reversible ALU by QCAfor reducing occupation area,” The Journal of Supercomputing, vol. 75,no. 8, p. 51185144, 2019.

[86] M. Sarkar, P. Ghosal, and S. Mohanty, “Minimal reversible circuitsynthesis on a dna computer,” Natural Computing, vol. 16, pp. 463–472, 2017.

[87] B. Tsukerblat, A. Palii, J. Clemente-Juan, N. Suaud, and E. Coronado,“Quantum cellular automata: a short overview of molecular problem,”Acta Physica Polonica A, 2018.

[88] C. Bennettand and D. DiVincenzo, “Quantum information and compu-tation,” Nature, vol. 404, pp. 247–255, March 2000.

[89] N. Kazemifard and K. Navi, “Implementing rns arithmetic unit throughsingle electron quantum-dot cellular automata,” International Journalof Computer Applications, vol. 163, pp. 20–27, 2017.

[90] O. Chen, R. Cai, Y. Wang, F. Ke, T. Yamae, R. Saito, N. Takeuchi, andN. Yoshikawa, “Adiabatic quantum-flux-parametron: Towards buildingextremely energy-efficient circuits and systems,” Scientific Reports,vol. 9, no. 10514, 2019.

[91] N. Takeuchi, D. Ozawa, Y. Yamanashi, and N. Yoshikawa, “An adi-abatic quantum flux parametron as an ultra-low-power logic device,”Superconductor Science and Technology, vol. 26, no. 3, p. 035010, jan2013.

[92] N. Takeuchi, T. Yamae, C. Ayala, H. Suzuki, and N. Yoshikawa, “Anadiabatic superconductor 8-bit adder with 24kbt energy dissipation perjunction,” Applied Physics Letters, vol. 114, no. 4, p. 042602, 2019.

[93] J. Peng, S. Sun, V. K. Narayana, V. J. Sorger, and T. El-Ghazawi,“Residue number system arithmetic based on integrated nanophoton-ics,” Optics Letters, vol. 43, no. 9, pp. 2026–2029, May 2018.

[94] J. Peng, S. Sun, V. Narayana, V. Sorger, and T. El-Ghazawi, “Integratedphotonics architectures for residue number system computations,” inIEEE International Conference on Rebooting Computing (ICRC), 2019.

[95] S. Sun, V. K. Narayana, I. Sarpkaya, J. Crandall, R. A. Soref, H. Dalir,T. El-Ghazawi, and V. J. Sorger, “Hybrid photonic-plasmonic non-blocking broadband 5×5 router for optical networks,” IEEE PhotonicsJournal, vol. 10, no. 2, pp. 1–12, 2018.

[96] L. Bakhtiar, E. Yaghoubi, S. Hamidi, and M. Hosseinzadeh, “OpticalRNS adder and multiplier,” International Journal of Computer Appli-cations in Technology, vol. 52, no. 1, pp. 71–76, 2015.

[97] Y. Xie and J. Zhao, “Emerging memory technologies,” IEEE Micro,vol. 39, no. 1, pp. 6–7, 2019.

[98] N. Haron and S. Hamdioui, “Redundant residue number system codefor fault-tolerant hybrid memories,” ACM Journal on Emerging Tech-nologies in Computing Systems, vol. 7, no. 1, Jan. 2011.

[99] S. Thirumala and S. Gupta, “Reconfigurable ferroelectric transistorpartI: Device design and operation,” IEEE Transactions on ElectronDevices, vol. 66, no. 6, pp. 2771–2779, 2019.

[100] I. C. . D. Lab, “Ferroelectric transistor technologies,”https://engineering.purdue.edu/ICDL/research/ferroelectrics, 2019.

[101] L. Chua, “Memristor-the missing circuit element,” IEEE Transactionson Circuit Theory, vol. 18, no. 5, pp. 507–519, 1971.

[102] S. Kvatinsky, “Real processing-in-memory with memristive memoryprocessing unit (mMPU),” in IEEE 30th International Conference onApplication-specific Systems, Architectures and Processors (ASAP),2019, pp. 142–148.

[103] P. Yao, H. Wu, B. Gao, J. Tang, and J. Y. H. Q. Q. Zhang, W. Zhang,“Fully hardware-implemented memristor convolutional neural net-work,” Nature, vol. 577, p. 641646, 2020.

[104] J. Cook, “PIMS: Memristor-based processing-in-memory-and-storage,”Sandia National Laboratories,, Albuquerque, New Mexico, Tech. Rep.SAND2018-2095, Feb. 2018.

[105] S. Raj, D. Chakraborty, and S. K. Jha, “In-memory flow-basedstochastic computing on memristor crossbars using bit-vector stochasticstreams,” in IEEE 17th International Conference on Nanotechnology(NANO), 2017, pp. 855–860.

[106] T. Wu, H. Li, P. Huang, A. Rahimi, J. Rabaey, H. Wong, M. Shulaker,and S. Mitra, “Brain-inspired computing exploiting carbon nanotubeFETs and resistive RAM: Hyperdimensional computing case study,” inIEEE International Solid-State Circuits Conference (ISSCC), 2018, pp.492–494.

[107] C. Ma, Y. Sun, W. Qian, Z. Meng, R. Yang, and L. Jiang, “Go unary: Anovel synapse coding and mapping scheme for reliable ReRAM-basedneuromorphic computing,” in Conference on Design, Automation &Test in Europe (DATE), 2020, pp. 1432–1437.

[108] F. Guarnieri, M. Fliss, and C. Bancroft, “Making DNA add,” Science,vol. 273, no. 5272, pp. 220–223, 1996.

[109] A. Fujiwara, K. Matsumoto, and W. Chen, “Procedures for logic andarithmetic operations with DNA molecules,” International Journal ofFoundations of Computer Science, vol. 15, no. 03, pp. 461–474, 2004.

[110] Z. Ignatova, I. Martnez-Prez, and K.-H. Zimmermann, DNA Comput-ing. Springer, 2008.

[111] M. Sarkar, P. Ghosal, and S. P. Mohanty, “Exploring the feasibility of aDNA computer: Design of an ALU using sticker-based DNA model,”IEEE Transactions on NanoBioscience, vol. 16, no. 6, pp. 383–399,2017.

[112] H. Su, J. Xu, Q. Wang, F. Wang, and X. Zhou, “High-efficiency and in-tegrable dna arithmetic and logic system based on strand displacementsynthesis,” Nature Communications, vol. 10, no. 5390, 2019.

[113] X. Zheng, B. Wang, C. Zhou, X. Wei, and Q. Zhang, “Parallel DNAarithmetic operation with one error detection based on 3-moduli set,”IEEE Transactions on NanoBioscience, vol. 15, no. 5, pp. 499–507,2016.

[114] W. Chang, “Fast parallel DNA-based algorithms for molecular com-putation: Quadratic congruence and factoring integers,” IEEE Transac-tions on NanoBioscience, vol. 11, no. 1, pp. 62–69, 2012.

[115] T. Monz, K. Kim, W. Hansel, M. Riebe, A. Villar, P. Schindler,M. Chwalla, M. Hennrich, and R. Blatt, “Realization of the quantumtoffoli gate with trapped ions,” Physical Review Letters, vol. 102, p.040501, Jan 2009.

[116] J. Koch, T. Yu, J. Gambetta, A. Houck, D. Schuster, J. Majer, A. Blais,M. Devoret, S. Girvin, and R. Schoelkopf, “Charge-insensitive qubitdesign derived from the cooper pair box,” Physical Review A, vol. 76,no. 4, 2007.

[117] V. Bouchiat, D. Vion, P. Joyez, D. Esteve, and M. H. Devoret,“Quantum coherence with a single cooper pair,” Physica Scripta, vol.T76, no. 1, p. 165, 1998.

[118] IBM, “IBM will soon launch a 53-qubit quantum computer,” 2019.[Online]. Available: https://techcrunch.com/2019/09/18/ibm-will-soon-launch-a-53-qubit-quantum-computer/

[119] A. Harrow, A. Hassidim, and S. Lloyd, “Quantum algorithm for linearsystems of equations,” Physical Review Letters, vol. 103, no. 15, Oct2009.

[120] R. C. Gonzalez, “Deep convolutional neural networks [lecture notes],”IEEE Signal Processing Magazine, vol. 35, no. 6, pp. 79–87, Nov 2018.

[121] M. Abdelhamid and S. Koppula, “Applying the residue number systemto network inference,” arXiv, eprint 1712.04614, 2017.

[122] S. Salamat, M. Imani, S. Gupta, and T. Rosing, “RNSnet: In-memoryneural network acceleration using residue number system,” in IEEEInternational Conference on Rebooting Computing (ICRC), Nov 2018,pp. 1–12.

[123] N. Samimi, M. Kamal, A. Afzalli-Kusha, and M. Pedram, “Res-DNN: A residue number system-based dnn accelerator unit,” IEEETransactions on Circuits and Systems I: Regular Papers, vol. 67, no. 2,pp. 658–671, 2020.

[124] Y. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural net-works,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–138, Jan 2017.

26

[125] Z. Torabi and G. Jaberipur, “Low-power/cost RNS comparison viapartitioning the dynamic range,” IEEE Transactions on Very LargeScale Integration (VLSI) Systems, vol. 24, no. 5, pp. 1849–1857, May2016.

[126] M. Xu, Z. Bian, and R. Yao, “Fast sign detection algorithm for theRNS moduli set {2n+1−1, 2n−1, 2n},” IEEE Transactions on VeryLarge Scale Integration (VLSI) Systems, vol. 23, no. 2, pp. 379–383,Feb 2015.

[127] J. Johnson, “Rethinking floating point for deep learning,” CoRR,vol. abs/1811.01721, 2018. [Online]. Available: http://arxiv.org/abs/1811.01721

[128] U. Kulisch, Computer Arithmetic and Validity. De Gruyter, 2012.[129] J. Gustafson and I. Yonemoto, “Beating floating point at its own

game: Posit arithmetic,” Supercomputing Frontiers and Innovation:International Journal, vol. 4, no. 2, p. 7186, Jun. 2017.

[130] J. Yu, K. Kim, J. Lee, and K. Choi, “Accurate and efficient stochasticcomputing hardware for convolutional neural networks,” in IEEEInternational Conference on Computer Design (ICCD), 2017, pp. 105–112.

[131] M. Alawad and M. Lin, “Stochastic-based deep convolutional networkswith reconfigurable logic fabric,” IEEE Transactions on Multi-ScaleComputing Systems, vol. 2, no. 4, pp. 242–256, 2016.

[132] P. Li and D. J. Lilja, “Using stochastic computing to implementdigital image processing algorithms,” in 2011 IEEE 29th InternationalConference on Computer Design (ICCD), 2011, pp. 154–161.

[133] A. Ren, Z. Li, C. Ding, Q. Qiu, Y. Wang, J. Li, X. Qian, and B. Yuan,“SC-DCNN: Highly-scalable deep convolutional neural network usingstochastic computing,” in Twenty-Second International Conference onArchitectural Support for Programming Languages and OperatingSystems (ASPLOS), 2017, p. 405418.

[134] A. Ardakani, F. Leduc-Primeau, N. Onizawa, T. Hanyu, and W. J.Gross, “VLSI implementation of deep neural network using integralstochastic computing,” IEEE Transactions on Very Large Scale Inte-gration (VLSI) Systems, vol. 25, no. 10, pp. 2688–2699, 2017.

[135] V. Lee, A. Alaghi, J. Hayes, V. Sathe, and L. Ceze, “Energy-efficienthybrid stochastic-binary neural networks for near-sensor computing,”in Conference on Design, Automation & Test in Europe (DATE), 2017,p. 1318.

[136] H. Sim and J. Lee, “A new stochastic computing multiplierwith application to deep convolutional neural networks,” in 54thACM/EDAC/IEEE Design Automation Conference (DAC), 2017, pp.1–6.

[137] R. Hojabr, K. Givaki, S. R. Tayaranian, P. Esfahanian, A. Khonsari,D. Rahmati, and M. H. Najafi, “SkippyNN: An embedded stochastic-computing accelerator for convolutional neural networks,” in 56thACM/IEEE Design Automation Conference (DAC), 2019, pp. 1–6.

[138] R. Rivestand, A. Shamir, and L. Adleman, “A Method for ObtainingDigital Signatures and Public-key Cryptosystems,” Communications ofthe ACM, vol. 21, no. 2, pp. 120–126, 1978.

[139] P. L. Montgomery, “Modular multiplication without trial division,”Mathematics of Computation, vol. 44, no. 170, pp. 519–521, 1985.

[140] J. Bajard, L. Didier, and P. Kornerup, “Modular multiplication and baseextensions in residue number systems,” in 15th IEEE Symposium onComputer Arithmetic ARITH, June 2001, pp. 59–65.

[141] S. Antao, J. Bajard, and L. Sousa, “Elliptic curve point multiplicationon GPUs,” in 21st IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), 2010, pp. 192–199.

[142] L. Sousa, S. Antao, and P. Martins, “Combining residue arithmetic todesign efficient cryptographic circuits and systems,” IEEE Circuits andSystems Magazine, vol. 16, no. 4, pp. 6–32, 2016.

[143] J. Ambrose, H. Pettenghi, and L. Sousa, “DARNS:a randomized multi-modulo RNS architecture for double-and-add in ECC to prevent poweranalysis side channel attacks,” in Asia and South Pacific DesignAutomation Conference (ASP-DAC), 2013, pp. 620–625.

[144] P. W. Shor, “Algorithms for quantum computation: discrete logarithmsand factoring,” in Proceedings of the 35th Annual Symposium onFoundations of Computer Science, Nov 1994, pp. 124–134.

[145] NIST, p. 25, 2016. [Online]. Available: https://csrc.nist.gov/CSRC/media/Projects/Post-Quantum-Cryptography/documents/call-for-proposals-final-dec-2016.pdf

[146] L. Babai, “On Lovasz’ Lattice Reduction and the Nearest LatticePoint Problem (Shortened Version),” in Proc. of the 2nd Symposium ofTheoretical Aspects of Computer Science, 1985, pp. 13–20.

[147] P. Martins and L. Sousa, “The role of non-positional arithmetic onefficient emerging cryptographic algorithms,” IEEE Access, vol. 8, pp.59 533–59 549, 2020.

[148] C. Gentry, “Fully Homomorphic Encryption Using Ideal Lattices,” inProceedings of the Forty-first Annual ACM Symposium on Theory ofComputing, 2009, pp. 169–178.

[149] N. Smart and F. Vercauteren, “Fully Homomorphic SIMD Operations,”Cryptology ePrint Archive, Report 2011/133, 2011.

[150] C. Gentry, S. Halevi, and N. Smart, Fully Homomorphic Encryptionwith Polylog Overhead. Springer Berlin Heidelberg, 2012, pp. 465–482.

[151] P. Martins and L. Sousa, “A methodical FHE-based cloud computingmodel,” Future Generation Computer Systems, vol. 95, pp. 639 – 648,2019.

[152] S. Jordan, “Quantum algorithm zoo,” 2011. [Online]. Available:https://quantumalgorithmzoo.org/

[153] P. Shor, “Algorithms for quantum computation: discrete logarithms andfactoring,” in 35th Annual Symposium on Foundations of ComputerScience, 1994, pp. 124–134.

[154] R. Jozsa, “Quantum factoring, discrete logarithms, and the hiddensubgroup problem,” Computing in Science & Engineering, vol. 3, pp.34–43, 1996.

[155] P. Kaye, R. Laflamme, and M. Mosca, An Introduction to QuantumComputing. Oxford University Press, Inc., 2007.

[156] D. R. Simon, “On the power of quantum computation,” in Proceedings35th Annual Symposium on Foundations of Computer Science, 1994,pp. 116–123.

[157] R.Jozsa, “Quantum algorithms and the fourier transform,” Proceedingsof the Royal Society of London. Series A: Mathematical, Physical andEngineering Sciences, vol. 454, no. 1969, p. 323337, Jan 1998.

[158] M. Nielsen and L. Chuang, Quantum Computation and QuantumInformation: 10th Anniversary Edition. Cambridge University Press,2011.

[159] R. Cleve, A. Ekert, C. Macchiavello, and M. Mosca, “Quantumalgorithms revisited,” Proceedings of the Royal Society of London.Series A: Mathematical, Physical and Engineering Sciences, vol.454, no. 1969, p. 339354, Jan 1998. [Online]. Available: http://dx.doi.org/10.1098/rspa.1998.0164

[160] A. Asfaw, L. Bello, Y. Ben-Haim, S. Bravyi, L. Capelluto, A. Vazquez,J. Ceroni, R. Chen, A. Frisch, J. Gambetta, S. Garion, L. Gil,S. Gonzalez, F. Harkins, T. Imamichi, D. McKay, A. Mezzacapo,Z. Minev, R. Movassagh, G. Nannicni, P. Nation, A. Phan, M. Pistoia,A. Rattew, J. Schaefer, J. Shabani, J. Smolin, K. Temme, M. Tod,S. Wood, and J. Wootton, “Learn quantum computation using qiskit,”2020. [Online]. Available: http://community.qiskit.org/textbook

[161] R. Duarte, H. Neto, and M. Vestias, “Xtokaxtikox: A stochasticcomputing-based autonomous cyber-physical system,” in 2016 IEEEInternational Conference on Rebooting Computing (ICRC), 2016, pp.1–7.

[162] S. Antao and L. Sousa, “The CRNS framework and its application toprogrammable and reconfigurable cryptography,” ACM Transactions onArchitecture and Code Optimization, vol. 9, no. 4, 1 2013.

[163] D. Boneh, C. Dunworth, R. Lipton, and J. Sgall, “On the computationalpower of DNA,” Discrete Applied Mathematics, vol. 71, no. 1, pp. 79–94, 1996.

[164] D. Lewis, “An accurate lns arithmetic unit using interleaved memoryfunction interpolator,” in Symposium Computer Arithmetic (ARITH),1993.

[165] W. Mathworld, “Modular inverse,”https://mathworld.wolfram.com/ModularInverse.html.

[166] J. Bajard, J. Eynard, A. Hasan, and V. Zucca, “A full RNS variant of FVlike somewhat homomorphic encryption schemes,” Cryptology ePrintArchive, Report 2016/510, 2016, https://eprint.iacr.org/2016/510.

[167] R. Chaves and L. Sousa, “RDSP: a RISC DSP based on residue numbersystem,” in Euromicro Symposium on Digital System Design, 2003,2003, pp. 128–135.

[168] QuTech, “Quantum inspire (QI),” https://www.quantum-inspire.com/.[169] IBM, “Scalable quantum systems,” https://www.ibm.com/quantum-

computing/learn/what-is-ibm-q/.[170] N. Koblitz, “Elliptic curve cryptosystems,” Mathematics of Computa-

tion, vol. 48, no. 177, pp. 203–209, 1987.[171] R. Akhtar and F. A. Khanday, “Stochastic computing: Systems, appli-

cations, challenges and solutions,” in 3rd International Conference onCommunication and Electronics Systems (ICCES), 2018, pp. 722–727.

27

Leonel Sousa received a Ph.D. degree in Electricaland Computer Engineering from the Instituto Supe-rior Tecnico (IST), Universidade de Lisboa (UL),Lisbon, Portugal, in 1996, where he is currentlya Full Professor. He is also a Senior Researcherwith the R&D Instituto de Engenharia de Sistemase Computadores (INESC-ID). His research interestsinclude computer arithmetic, VLSI architectures,parallel computing and signal processing. He hascontributed to more than 200 papers in journals andinternational conferences, for which he has received

several awards, including the DASIP13 Best Paper Award, SAMOS11 Stama-tis Vassiliadis Best Paper Award, DASIP10 Best Poster Award, and HonorableMention Award UTL/Santander Totta for the quality of his publications in2009 and 2016. He has contributed to the organization of several internationalconferences, typically as program chair and as a general and topic chair,and has given keynotes in some of them. He has edited four special issuesof international journals; he is currently the Associate Editor of the IEEETransactions on Computers, IEEE Access, Springer JRTIP, and Senior Editorof the Journal on Emerging and Selected Topics in Circuits and Systems . Hehas co-edited the books Circuits and Systems for Security and Privacy (CRC,2018) and Embedded Systems Design with Special Arithmetic and NumberSystems (Springer, 2017). He is a Fellow of the IET, a Distinguished Scientistof the ACM and a Senior Member of the IEEE.