elysium technologies private limitedelysiumtechnologies.com/wp-content/uploads/2015/08/vlsi.pdf ·...
TRANSCRIPT
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
This brief proposes a two-step optimization technique for designing a reconfigurable VLSI architecture of an
interpolation filter for multistandard digital up converter (DUC) to reduce the power and area consumption.
The proposed technique initially reduces the number of multiplications per input sample and additions per
input sample by 83% in comparison with individual implementation of each standard’s filter while designing a
root-raised-cosine finite-impulse response filter for multistandard DUC for three different standards. In the
next step, a 2-bit binary common subexpression (BCS)-based BCS elimination algorithm has been proposed to
design an efficient constant multiplier, which is the basic element of any filter. This technique has succeeded
in reducing the area and power usage by 41% and 38%, respectively, along with 36% improvement in
operating frequency over a 3-bit BCS-based technique reported earlier, and can be considered more
appropriate for designing the multistandard DUC.
ETPL
VLSI - 001
An Efficient VLSI Architecture of a Reconfigurable Pulse-Shaping FIR
Interpolation
This paper introduces a novel low-complexity multiple-input multiple-output (MIMO) detector tailored for
single-carrier frequency division-multiple access (SC-FDMA) systems, suitable for efficient hardware
implementations. The proposed detector starts with an initial estimate of the transmitted signal based on a
minimum mean square error (MMSE) detector. Subsequently, it recognizes less reliable symbols for which
more candidates in the constellation are browsed to improve the initial estimate. An efficient high-throughput
VLSI architecture is also introduced achieving a superior performance compared to the conventional MMSE
detectors with less than 28% added complexity. The performance of the proposed design is close to the
existing maximum likelihood post-detection processing (ML-PDP) scheme, while resulting in a significantly
lower complexity, i.e., 4.5×102 and 7×10
4 times fewer Euclidean distance (ED) calculations in the 16-QAM
and 64-QAM schemes, respectively. The proposed design for the 16-QAM scheme is fabricated in a 0.13 μm
CMOS technology and fully tested, achieving a 1.332 Gbps throughput, reporting the first fabricated design
for SC-FDMA MIMO detectors to-date. A soft version of the proposed architecture is also introduced, which
is customized for coded systems.
ETPL
VLSI - 002
A High-Throughput VLSI Architecture for Hard and Soft SC-FDMA MIMO
Detectors
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
Increasing demand of high-speed portable modules for multimedia applications has motivated the
development of hardware-based solutions for image processing applications. Most of the nonrigid image
registration algorithms are found to be unsuitable for hardware implementation because of their nonlinearity
and computationally intensive nature. In this paper, an algorithm for nonrigid image registration based on
Demons approximation is proposed. The algorithm has been simulated in MATLAB and results show a 15%
improvement in peak-signal-to-noise-ratio with a 17% reduction in registration time for 256 x 256 image over
the original Demons algorithm. The proposed algorithm is synthesized in Virtex6-xc6vlx760-2-ff1760 and
maximum synthesized frequency is found to be 174 MHz. The proposed architecture provides the low cost,
high-speed solution for the registration process, which is also helpful for making a portable system.
ETPL
VLSI - 003
VLSI-Assisted Nonrigid Registration Using Modified Demons Algorithm
Transpose form finite-impulse response (FIR) filters are inherently pipelined and support multiple constant
multiplications (MCM) technique that results in significant saving of computation. However, transpose form
configuration does not directly support the block processing unlike direct-form configuration. In this paper, we
explore the possibility of realization of block FIR filter in transpose form configuration for area-delay efficient
realization of large order FIR filters for both fixed and reconfigurable applications. Based on a detailed
computational analysis of transpose form configuration of FIR filter, we have derived a flow graph for
transpose form block FIR filter with optimized register complexity. A generalized block formulation is
presented for transpose form FIR filter. We have derived a general multiplier-based architecture for the
proposed transpose form block filter for reconfigurable applications. A low-complexity design using the MCM
scheme is also presented for the block implementation of fixed FIR filters. The proposed structure involves
significantly less area-delay product (ADP) and less energy per sample (EPS) than the existing block
implementation of direct-form structure for medium or large filter lengths, while for the short-length filters,
the block implementation of direct-form FIR structure has less ADP and less EPS than the proposed structure.
Application-specific integrated circuit synthesis result shows that the proposed structure for block size 4 and
filter length 64 involves 42% less ADP and 40% less EPS than the best available FIR filter structure proposed
for reconfigurable applications. For the same filter length and the same block size, the proposed structure
involves 13% less ADP and 12.8% less EPS than that of the existing direct-form block FIR structure.
ETPL
VLSI - 004
A High-Performance FIR Filter Architecture for Fixed and Reconfigurable
Applications
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
Due to the channel achieving property, the polar code has become one of the most favorable error-correcting
codes. As the polar code achieves the property asymptotically, however, it should be long enough to have a
good error-correcting performance. Although the previous fully parallel encoder is intuitive and easy to
implement, it is not suitable for long polar codes because of the huge hardware complexity required. In this
brief, we analyze the encoding process in the viewpoint of very-large-scale integration implementation and
propose a new efficient encoder architecture that is adequate for long polar codes and effective in alleviating
the hardware complexity. As the proposed encoder allows high-throughput encoding with small hardware
complexity, it can be systematically applied to the design of any polar code and to any level of parallelism.
ETPL
VLSI - 005
Partially Parallel Encoder Architecture for Long Polar Codes
We present algorithms for routing in advanced technology nodes, used by BonnRoute (BR) to obtain efficient
and almost design rule clean wire packings and pin access solutions. Designs with dense standard cell libraries
in presence of complex industrial design rules, with a special focus on multiple patterning lithography are
considered. The key components of this approach are a multilabel interval-based shortest path algorithm for
long on-track connections, and a dynamic program for computing packings of pin access paths and short
connections between closely spaced pins. The multilabel path search implementation is very general and is
driven with different labeling rules, allowing to trade-off runtime against accuracy in terms of obeyed design
rules. We combine BR with an industrial router for cleaning up the remaining design rule violations, and
demonstrate superior results over that industrial router in our experiments in terms of wire length, number of
vias, design rule violations, and runtime.
ETPL
VLSI - 006
Detailed Routing Algorithms for Advanced Technology Nodes
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
In this paper, the impact of variations on single-edge triggered flip-flops (FFs) is evaluated for a wide range of
topologies. In particular, this Part II explicitly considers sources of variations such as voltage, temperature and
variations induced by the clock network. The effect of variations on the energy and its tradeoff with
performance is also investigated. This paper complements the previous Part I, which is focused on process
variations and flip-flop timing. From a design perspective, the presented results provide well-defined
guidelines for variation-aware selection of the flip-flop topologies, and for early budgeting of variations before
detailed circuit design. Results are put into the technology scaling perspective through comparison of results at
65 and 28 nm. The results show that the technology scaling does not affect either the main findings of this
analysis or the ranking of the considered topologies.
ETPL
VLSI - 007
Variations in Nanometer CMOS Flip-Flops: Part II—Energy Variability and
Impact of Other Sources of Variations
Many packaging and structural materials are made of conductive materials such as metal or carbon-fiber
composites, which limits the use of embedded radio frequency-based telemetry systems for sensing. In this
paper, we present the design of a complete passive ultrasonic energy harvesting and back-telemetry system
that exploits near-field acoustic coupling to wirelessly transfer energy and data across conductive barriers. The
use of near-field operation makes the telemetry robust to multipath reflections that occur at barrier
discontinuities and robust to crosstalk when multiple sensors are simultaneously interrogated. Underlying the
proposed architecture is a system-on-chip (SoC) that integrates different ultrasonic energy harvesting and
telemetry modules. The operation of the system has been verified using SoC prototypes fabricated in a 0.5-μm
CMOS process which have been integrated with a piezoelectric transducer attached to an aerospace-grade
aluminum substrate. Measured results show that the proposed near-field ultrasonic telemetry system can
effectively operate across a 2-mm-thick metallic barrier at a frequency of 13.56 MHz with the SoC consuming
22.3 μW of power.
ETPL
VLSI - 008
Design of a CMOS System-on-Chip for Passive, Near-Field Ultrasonic Energy
Harvesting and Back-Telemetry
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
Manycore processor system is becoming an attractive platform for applications seeking both high performance
and high energy efficiency. However, huge communication demands among cores, large power density, and
low process yield will be three significant limitations for the scalability of future manycore processors.
Breaking a large chip into multiple smaller ones can alleviate the problems of power density and yield, but
would worsen the problem of communication efficiency due to the limited off-chip bandwidth. In response,
we propose an inter/intra-chip optical network, which will not only fulfill the intra-chip communication
requirements but also address the inter-chip communication, by exploiting the advantages of optical links with
high bandwidth and energy efficiency. The network is composed of an inter-chip subnetwork and multiple
intra-chip subnetworks, and the subnetworks closely coordinate with each other to balance the traffic. The
proposed network effectively explores the distinctive properties of optical signals and photonic devices, and
dynamically partitions each data channel into multiple sections. Each section can be utilized independently to
boost performance as well as reduce energy consumption. Simulation results show that our network can
achieve higher throughput with lower power consumption than alternative designs under most of synthetic
traffics and real applications.
ETPL
VLSI - 009
An Inter/Intra-Chip Optical Network for Manycore Processors
With silicon optical technology moving toward maturity, the use of photonic networks-on-chip (NoCs) for
global chip communication is emerging as a promising solution to the communication requirements of future
many core processors. It is expected that photonic NoCs will play an important role in alleviating current
power, latency, and bandwidth constraints. However, photonic NoCs are sensitive to ambient temperature
variations because their basic constituents, ring resonators, are themselves sensitive to those variations. Since
ring resonators are basic building blocks for photonic modulators, switches, multiplexers, and demultiplexers,
variations of on-chip temperature pose serious challenges to the proper operation of photonic NoCs. Proposed
methods that mitigate the effects of temperature at the device level are either difficult to use in CMOS
processes or not suitable for large scale implementation. In this paper, we propose Aurora, a thermally resilient
photonic NoC architecture design that supports reliable and low bit error rate (BER) on-chip communications
in the presence of large temperature variations. Our proposed architecture leverages cross-layer solutions at
the device, architecture, and operating system (OS) layers that individually provide considerable
improvements and synergistically provide even more significant improvements. To compensate for small
temperature variations, our design varies the bias current through ring resonators. For larger temperature
variations, we propose architecture-level techniques to reroute messages away from hot regions, and through
cooler regions, to their destinations. We also propose a thermal/congestion-aware coscheduling algorithm at
ETPL
VLSI - 010
Aurora: A Cross-Layer Solution for Thermally Resilient Photonic Network-on-
Chip
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
In this paper, we present a carry skip adder (CSKA) structure that has a higher speed yet lower energy
consumption compared with the conventional one. The speed enhancement is achieved by applying
concatenation and incrementation schemes to improve the efficiency of the conventional CSKA (Conv-CSKA)
structure. In addition, instead of utilizing multiplexer logic, the proposed structure makes use of AND-OR-
Invert (AOI) and OR-AND-Invert (OAI) compound gates for the skip logic. The structure may be realized
with both fixed stage size and variable stage size styles, wherein the latter further improves the speed and
energy parameters of the adder. Finally, a hybrid variable latency extension of the proposed structure, which
lowers the power consumption without considerably impacting the speed, is presented. This extension utilizes
a modified parallel structure for increasing the slack time, and hence, enabling further voltage reduction. The
proposed structures are assessed by comparing their speed, power, and energy parameters with those of other
adders using a 45-nm static CMOS technology for a wide range of supply voltages. The results that are
obtained using HSPICE simulations reveal, on average, 44% and 38% improvements in the delay and energy,
respectively, compared with those of the Conv-CSKA. In addition, the power--delay product was the lowest
among the structures considered in this paper, while its energy--delay product was almost the same as that of
the Kogge--Stone parallel prefix adder with considerably smaller area and power consumption. Simulations on
the proposed hybrid variable latency CSKA reveal reduction in the power consumption compared with the
latest works in this field while having a reasonably high speed.
ETPL
VLSI - 011
High-Speed and Energy-Efficient Carry Skip Adder Operating Under a Wide
Range of Supply Voltage Levels
Quantum-dot cellular automata (QCA) are an attractive emerging technology suitable for the development of
ultra-dense low-power high-performance digital circuits. Efficient solutions have recently been proposed for
several arithmetic circuits, such as adders, multipliers, and comparators. Nevertheless, since the design of
digital circuits in QCA still poses several challenges, novel implementation strategies and methodologies are
highly desirable. This paper proposes a new design approach oriented to the implementation of binary
comparators in QCA. New formulations of basic logic equations required to perform the comparison function
are proposed. The new strategy has been exploited in the design of two different comparator architectures and
for several operands word lengths. With respect to existing counterparts, the comparators proposed here
exhibit significantly higher speed and reduced overall area.
ETPL
VLSI - 012
Design of Efficient Binary Comparators in Quantum-Dot Cellular Automata
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
A transistor level implementation of an improved matrix multiplier for high-speed digital signal processing
applications based on matrix element transformation and multiplication is reported in this study. The
improvement in speed was achieved by rearranging the matrix element into a two-dimensional array of
processing elements interconnected as a mesh. The edges of each row and column were interconnected in
torus structure, facilitating simultaneous implementation of several multiplications. The functionality of the
circuitry was verified and the performance parameters for example, propagation delay and dynamic switching
power consumptions were calculated using spice spectre using 90 nm CMOS technology. The proposed
methodology ensures substantial reduction in propagation delay compared with the conventional algorithm,
systolic array and pseudo number theoretic transformation (PNTT)-based implementation, which are the most
commonly used techniques, for matrix multiplication. The propagation delay of the implemented 4 × 4 matrix
multiplier was only ~2 μs, whereas the power consumption of the implemented 4 × 4 matrix multiplier was
~3.12 mW only. Improvement in speed compared with earlier reported matrix multipliers, for example,
conventional algorithm, systolic array and PNTT-based implementation was found to be ~67, ~56 and ~65%,
respectively.
ETPL
VLSI - 013
Improved matrix multiplier design for high-speed digital signal processing
applications
We experimentally demonstrated high-speed logic operations of adiabatic quantum-flux-parametron (AQFP)
gates through the use of quantum-flux-latches (QFLs). In QFL-based high-speed test circuits (QHTCs), the
output data of the circuits under test (CUTs), which are driven by high-speed excitation currents, are stored in
QFLs and are slowly read out using low-speed excitation currents. We designed and fabricated three types of
QHTCs using QFLs with different circuit parameters, where the CUTs were buffer gates and and gates. We
confirmed the correct operation of buffer gates and and gates at 1 GHz. The obtained bias margins of the 1
GHz excitation currents were more than ±30% for each QHTC, which is wide enough for high-speed logic
operations of AQFP gates
ETPL
VLSI - 014
High-Speed Experimental Demonstration of Adiabatic Quantum-Flux-
Parametron Gates Using Quantum-Flux-Latches
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
A novel nonvolatile flip-flop based on spin-orbit torque magnetic tunnel junctions (SOT-MTJs) is proposed
for fast and ultralow energy applications. A case study of this nonvolatile flip-flop is considered. In addition to
the independence between writing and reading paths, which offers a high reliability, the low resistive writing
path performs high-speed, and energy-efficient WRITE operation. We compare the SOT-MTJ performances
metrics with the spin transfer torque (STT)-MTJ. Based on accurate compact models, simulation results show
an improvement, which attains 20× in terms of WRITE energy per bit cell. At the same writing current and
supply voltage, the SOT-MTJ achieves a writing frequency 4× higher than the STT-MTJ.
ETPL
VLSI - 015
Spin Orbit Torque Non-Volatile Flip-Flop for High Speed and Low Energy
Applications
This paper presents a precise analysis of the critical path of the least-mean-square (LMS) adaptive filter for
deriving its architectures for high-speed and low-complexity implementation. It is shown that the direct-form
LMS adaptive filter has nearly the same critical path as its transpose-form counterpart, but provides much
faster convergence and lower register complexity. From the critical-path evaluation, it is further shown that no
pipelining is required for implementing a direct-form LMS adaptive filter for most practical cases, and can be
realized with a very small adaptation delay in cases where a very high sampling rate is required. Based on
these findings, this paper proposes three structures of the LMS adaptive filter: (i) Design 1 having no
adaptation delays, (ii) Design 2 with only one adaptation delay, and (iii) Design 3 with two adaptation delays.
Design 1 involves the minimum area and the minimum energy per sample (EPS). The best of existing direct-
form structures requires 80.4% more area and 41.9% more EPS compared to Design 1. Designs 2 and 3
involve slightly more EPS than the Design 1 but offer nearly twice and thrice the MUF at a cost of 55.0% and
60.6% more area, respectively.
ETPL
VLSI - 016
Critical-Path Analysis and Low-Complexity Implementation of the LMS
Adaptive Algorithm
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
Algebraic side-channel attack (ASCA) is a typical technique that relies on a general solver to solve the
equations of a cipher and its side-channel leaks. It falls under analytical side-channel attack and can recover
the entire key at once. Many ASCAs are proposed against the AES, and they utilize the Gröbner basis-based,
SAT-based, or optimizer-based solver. The advantage of the general solver approach is its generic feature,
which can be easily applied to different cryptographic algorithms. The disadvantage is that it is difficult to take
into account the specialized properties of the targeted cryptographic algorithms. The results vary depending on
what type of solver is used, and the time complexity is quite high when considering the error-tolerant attack
scenarios. Thus, we were motivated to find a new approach that would lessen the influence of the general
solver and reduce the time complexity of ASCA. This paper proposes a new analytical side-channel attack on
AES by exploiting the incomplete diffusion feature in one AES round. We named our technique incomplete
diffusion analytical side-channel analysis (IDASCA). Different from previous ASCAs, IDASCA adopts a
specialized approach to recover the secret key of AES instead of the general solver. Extensive attacks are
performed against the software implementation of AES on an 8-bit microcontroller. Experimental results show
that: 1) IDASCA can exploit the side-channel leaks in all AES rounds using a single power trace; 2) it has less
time complexity and more robustness than previous ASCAs, especially when considering the error-tolerant
attack scenarios; and 3) it can calculate the reduced key search space of AES for the given amount of side-
channel leaks. IDASCA can also interpret the mechanism behind previous ASCAs on AES from a quantitative
perspective, such as why ASCA can work under unknown plaintext/ciphertext scenarios and what are the
extreme cases in ASCAs.
ETPL
VLSI - 017
Exploiting the Incomplete Diffusion Feature: A Specialized Analytical Side-Channel
Attack Against the AES and Its Application to Microcontroller Implementations
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
This paper presents a high-throughput and ultralow-power asynchronous domino logic pipeline design
method, targeting to latch-free and extremely fine-grain OR gate-level design. The data paths are composed of
a mixture of dual-rail and single-rail domino gates. Dual-rail domino gates are limited to construct a stable
critical data path. Based on this critical data path, the handshake circuits are greatly simplified, which offers
the pipeline high throughput as well as low power consumption. Moreover, the stable critical data path enables
the adoption of single-rail domino gates in the noncritical data paths. This further saves a lot of power by
reducing the overhead of logic circuits. An 8×8 array style multiplier is used for evaluating the proposed
pipeline method. Compared with a bundled-data asynchronous domino logic pipeline, the proposed pipeline,
respectively, saves up to 60.2% and 24.5% of energy in the best case and the worst case when processing
different data patterns.
ETPL
VLSI - 019
Design and simulation of power efficient traffic light controller (PTLC)
Modern superscalar processors implement register renaming using either random access memory (RAM) or
content-addressable memories (CAM) tables. The design of these structures should address both access time
and misprediction recovery penalty. Although direct-mapped RAMs provide faster access times, CAMs are
more appropriate to avoid recovery penalties. The presence of associative ports in CAMs, however, prevents
them from scaling with the number of physical registers and pipeline width, negatively impacting
performance, area, and energy consumption at the rename stage. In this paper, we present a new hybrid RAM–
CAM register renaming scheme, which combines the best of both approaches. In a steady state, a RAM
provides fast and energy-efficient access to register mappings. On misspeculation, a low-complexity CAM
enables immediate recovery. Experimental results show that in a four-way state-of-the-art superscalar
processor, the new approach provides almost the same performance as an ideal CAM-based renaming scheme,
while dissipating only between 17% and 26% of the original energy and, in some cases, consuming less
energy than purely RAM-based renaming schemes. Overall, the silicon area required to implement the hybrid
RAM–CAM scheme does not exceed the area required by conventional renaming mechanisms.
ETPL
VLSI - 018
Efficient Register Renaming and Recovery for High-Performance Processors
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
In this paper, we propose a reliable low-power multiplier design by adopting algorithmic noise tolerant (ANT)
architecture with the fixed-width multiplier to build the reduced precision replica redundancy block (RPR).
The proposed ANT architecture can meet the demand of high precision, low power consumption, and area
efficiency. We design the fixed-width RPR with error compensation circuit via analyzing of probability and
statistics. Using the partial product terms of input correction vector and minor input correction vector to lower
the truncation errors, the hardware complexity of error compensation circuit can be simplified. In a 12×12 bit
ANT multiplier, circuit area in our fixed-width RPR can be lowered by 44.55% and power consumption in our
ANT design can be saved by 23% as compared with the state-of-art ANT design.
ETPL
VLSI - 020
Reliable Low-Power Multiplier Design Using Fixed-Width Replica Redundancy
Block
A unified very large-scale integration (VLSI) architecture with butterflies that can perform photo core
transform (PCT) in JPEG XR image compression is presented. The proposed architecture can achieve the
unified architecture design, which supports the three elemental operations of PCT, and it has the
characteristics of lower hardware cost, shorter critical path, lower power consumption, more efficient
hardware utilisation and regular structure for VLSI implementation. Finally, the implementation on Altera
field programmable gate array (FPGA) devices validates the effectiveness of the design.
ETPL
VLSI - 021
Unified VLSI architecture for photo core transform used in JPEG XR
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
We use mixed device-circuit simulations to predict the performance of 6T static RAM (SRAM) cells
implemented with tunnel-FETs (TFETs). Idealized template devices are used to assess the impact of device
unidirectionality, which is inherent to TFETs and identify the most promising configuration for the access
transistors. The same template devices are used to investigate the text{V}_{\rm DD} range, where TFETs may
be advantageous compared to conventional CMOS. The impact of device ambipolarity on SRAM operation is
also analyzed. Realistic device templates extracted from experimental data of fabricated state-of-the-art silicon
pTFET are then used to estimate the performance gap between the simulation of idealized TFETs and the best
experimental implementations.
ETPL
VLSI - 023
Impact of TFET Unidirectionality and Ambipolarity on the Performance of 6T SRAM
Cells
This paper proposes an efficient constant multiplier architecture based on vertical-horizontal binary common
sub-expression elimination (VHBCSE) algorithm for designing a reconfigurable finite impulse response (FIR)
filter whose coefficients can dynamically change in real time. To design an efficient reconfigurable FIR filter,
according to the proposed VHBCSE algorithm, 2-bit binary common sub-expression elimination (BCSE)
algorithm has been applied vertically across adjacent coefficients on the 2-D space of the coefficient matrix
initially, followed by applying variable-bit BCSE algorithm horizontally within each coefficient. This
technique is capable of reducing the average probability of use or the switching activity of the multiplier block
adders by 6.2% and 19.6% as compared to that of two existing 2-bit and 3-bit BCSE algorithms respectively.
ASIC implementation results of FIR filters using this multiplier show that the proposed VHBCSE algorithm is
also successful in reducing the average power consumption by 32% and 52% along with an improvement in
the area power product (APP) by 25% and 66% compared to those of the 2-bit and 3-bit BCSE algorithms
respectively. As regards the implementation of FIR filter, improvements of 13% and 28% in area delay
product (ADP) and 76.1% and 77.8% in power delay product (PDP) for the proposed VHBCSE algorithm
have been achieved over those of the earlier multiple constant multiplication (MCM) algorithms, viz. faithfully
rounded truncated multiple constant multiplication/accumulation (MCMAT) and multi-root binary partition
graph (MBPG) respectively. Efficiency shown by the results of comparing the FPGA and ASIC
implementations of the reconfigurable FIR filter designed using VHBCSE algorithm based constant multiplier
establishes the suitability of the proposed algorithm for efficient fixed point reconfigurable FIR filter
synthesis.
ETPL
VLSI - 022
An Efficient Constant Multiplier Architecture Based on Vertical-Horizontal Binary
Common Sub-expression Elimination Algorithm for Reconfigurable FIR Filter
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
This paper presents architecture of block-level-parallel layered decoder for irregular LDPC code. It can be
reconfigured to support various block lengths and code rates of IEEE 802.11n (WiFi) wireless-communication
standard. We have proposed efficient comparison techniques for both column and row layered schedule and
rejection-based high-speed circuits to compute the two minimum values from multiple inputs required for row
layered processing of hardware-friendly min-sum decoding algorithm. The results show good speed with
lower area as compared to state-of-the-art circuits. Additionally, this work proposes dynamic multi-frame
processing schedule which efficiently utilizes the layered-LDPC decoding with minimum pipeline stages. The
suggested LDPC-decoder architecture has been synthesized and post-layout simulated in 90 nm-CMOS
process. This decoder occupies 5.19 ${rm mm}^{2}$ area and supports multiple code rates like 1/2, 2/3, 3/4
& 5/6 as well as block-lengths of 648, 1296 & 1944. At a clock frequency of 336 MHz, the proposed LDPC-
decoder has achieved better throughput of 5.13 Gbps and energy efficiency of 0.01 nJ/bits/iterations, as
compared to the similar state-of-the-art works.
ETPL
VLSI - 024
High-Throughput LDPC-Decoder Architecture Using Efficient Comparison
Techniques & Dynamic Multi-Frame Processing Schedule
Digital multipliers are among the most critical arithmetic functional units. The overall performance of these
systems depends on the throughput of the multiplier. Meanwhile, the negative bias temperature instability
effect occurs when a pMOS transistor is under negative bias (Vgs = -Vdd), increasing the threshold voltage of
the pMOS transistor, and reducing multiplier speed. A similar phenomenon, positive bias temperature
instability, occurs when an nMOS transistor is under positive bias. Both effects degrade transistor speed, and
in the long term, the system may fail due to timing violations. Therefore, it is important to design reliable
high-performance multipliers. In this paper, we propose an aging-aware multiplier design with a novel
adaptive hold logic (AHL) circuit. The multiplier is able to provide higher throughput through the variable
latency and can adjust the AHL circuit to mitigate performance degradation that is due to the aging effect.
Moreover, the proposed architecture can be applied to a columnor row-bypassing multiplier. The experimental
results show that our proposed architecture with 16 × 16 and 32 × 32 column-bypassing multipliers can attain
up to 62.88% and 76.28% performance improvement, respectively, compared with 16×16 and 32×32 fixed-
latency column-bypassing multipliers. Furthermore, our proposed architecture with 16 × 16 and 32 × 32 row-
bypassing multipliers can achieve up to 80.17% and 69.40% performance improvement as compared with
16×16 and 32 × 32 fixed-latency row-bypassing multipliers.
ETPL
VLSI - 025
Aging-Aware Reliable Multiplier Design With Adaptive Hold Logic
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
Bloom filters (BFs) provide a fast and efficient way to check whether a given element belongs to a set. The
BFs are used in numerous applications, for example, in communications and networking. There is also
ongoing research to extend and enhance BFs and to use them in new scenarios. Reliability is becoming a
challenge for advanced electronic circuits as the number of errors due to manufacturing variations, radiation,
and reduced noise margins increase as technology scales. In this brief, it is shown that BFs can be used to
detect and correct errors in their associated data set. This allows a synergetic reuse of existing BFs to also
detect and correct errors. This is illustrated through an example of a counting BF used for IP traffic
classification. The results show that the proposed scheme can effectively correct single errors in the associated
set. The proposed scheme can be of interest in practical designs to effectively mitigate errors with a reduced
overhead in terms of circuit area and power.
ETPL
VLSI - 026
A Synergetic Use of Bloom Filters for Error Detection and Correction
In this paper, we propose a level-converting retention flip-flop (RFF) for ZigBee systems-on-chips (SoCs).
The proposed RFF allows the voltage regulator that generates the core supply voltage (VDD,core) to be turned
off in the standby mode, and it thus reduces the standby power of the ZigBee SoCs. The logic states are
retained in a slave latch composed of thick-oxide transistors using an I/O supply voltage (VDD,IO) that is
always turned on. Level-up conversion from VDD,core to VDD,IO is achieved by an embedded nMOS pass-
transistor level-conversion scheme that uses a low-only signal-transmitting technique. By embedding a
retention latch and level-up converter into the data-to-output path of the proposed RFF, the RFF resolves the
problems of the static RAM-based RFF, such as large dc current and low readability caused by threshold drop.
The proposed RFF does not also require additional control signals for power mode transitioning. Using 0.13-
μm process technology, we implemented an RFF with VDD,core and VDD,IO of 1.2 and 2.5 V, respectively.
The maximum operating frequency is 300 MHz. The active energy of the RFF is 191.70 fJ, and its standby
power is 350.25 pW.
ETPL
VLSI - 027
Level-Converting Retention Flip-Flop for Reducing Standby Power in ZigBee
SoCs
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
We prove analytically that the yield of static random access memory (SRAM) is intrinsically a function of its
architecture owing to the correlation among cell failures. In addition, architecture-aware analytical yield
models are proposed for read access. The yield results using the proposed models show that the most dominant
factor determining yield is the variation in the voltage difference between bitlines due to the cell leakage
current variation according to the SRAM architecture. The models also show the possibility that the most
dominant factor determining the yield can change with the relative ratios among the amounts of changes in the
correlation, recovery sample space, distributions of the sense amplifier enable time, voltage difference
between bitlines, as well as sense amplifier offset voltage, memory capacity, and redundancy scheme. The
proposed yield models show that combined row and column redundancy ensures the highest yield, whereas
column redundancy is the most efficient.
ETPL
VLSI - 029
Architecture-Aware Analytical Yield Model for Read Access in Static Random
Access Memory
The confluence of 3-D integration and network-on-chip (NoC) provides an effective solution to the scalability
problem of on-chip interconnects. In 3-D integration, through-silicon via (TSV) is considered to be the most
promising bonding technology. However, TSVs are also precious link resources because they consume
significant chip area and possibly lead to routing congestion in the physical design stage. In addition, TSVs
suffer from serious yield losses that shrink the effective TSV density. Thus, it is necessary to implement a
TSV-economical 3-D NoC architecture in cost-effective design. For symmetric 3-D mesh NoCs, we observe
that the TSVs bandwidth utilization is low and they rarely become the contention spots in networks as planar
links. Based on this observation, we propose the TSV sharing (TS) scheme to save TSVs in 3-D NoC by
enabling neighboring routers to share the vertical channels in a time division multiplexing way. We also
investigate different TS implementation alternatives and show how TS improves TSV-effectiveness (TE) in
multicore processors through a design space exploration. In experiments, we comprehensively evaluate TSs
influence on all layers of system. It is shown that the proposed method significantly promotes TE with
negligible performance overhead.
ETPL
VLSI - 028
Economizing TSV Resources in 3-D Network-on-Chip Design
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
Content addressable memories (CAMs) enable high-speed parallel search operations in table lookup-based
applications, such as Internet routers and processor caches. Traditional CAM design has always suffered from
the high dynamic power consumption associated with its large and active parallel hardware. However, deeply
scaled technology nodes, with multigate devices replacing planar MOSFETs, are expected to bring new
tradeoffs to CAM design. FinFET, a vertical-channel gate-wrap-around double-gate device, has emerged as
the best alternative to planar MOSFET. In this brief, for the first time, we explore the design space of
symmetric and asymmetric gate-workfunction FinFET CAMs. We propose several design alternatives and
evaluate them in terms of their dc and transient metrics for different mismatch probabilities using technology
computer-aided design simulations with 22-nm FinFET devices. We also propose two orthogonal layout styles
for CAM design and show that one of them (vertical-search line) outperforms the other (vertical-match line) in
terms of total power (22.3%) and search delay (5.8%).
ETPL
VLSI - 030
Design of Efficient Content Addressable Memories in High-Performance
FinFET Technology
We propose a low-power content-addressable memory (CAM) employing a new algorithm for associativity
between the input tag and the corresponding address of the output data. The proposed architecture is based on
a recently developed sparse clustered network using binary connections that on-average eliminates most of the
parallel comparisons performed during a search. Therefore, the dynamic energy consumption of the proposed
design is significantly lower compared with that of a conventional low-power CAM design. Given an input
tag, the proposed architecture computes a few possibilities for the location of the matched tag and performs the
comparisons on them to locate a single valid match. TSMC 65-nm CMOS technology was used for simulation
purposes. Following a selection of design parameters, such as the number of CAM entries, the energy
consumption and the search delay of the proposed design are 8%, and 26% of that of the conventional NAND
architecture, respectively, with a 10% area overhead. A design methodology based on the silicon area and
power budgets, and performance requirements is discussed.
ETPL
VLSI - 031
Algorithm and Architecture for a Low-Power Content-Addressable Memory
Based on Sparse Clustered Networks
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
Network-on-chip (NoC) has emerged as a vital factor that determines the performance and power
consumption of many-core systems. This paper proposes a hybrid scheme for NoCs, which aims at
obtaining low latency and low power consumption. In the presented hybrid scheme, a novel
switching mechanism, called virtual circuit switching, is proposed to intermingle with circuit
switching and packet switching. Flits traveling in virtual circuit switching can traverse the router with
only one stage. In addition, multiple virtual circuit-switched (VCS) connections are allowed to share
a common physical channel. Moreover, a path allocation algorithm is proposed in this paper to
determine VCS connections and circuit-switched connections on a mesh-connected NoC, such that
both communication latency and power are optimized. A set of synthetic and real traffic workloads
are exploited to evaluate the effectiveness of the proposed hybrid scheme. The experimental results
show that our proposed hybrid scheme can efficiently reduce the communication latency and power.
For instance, for real traffic workloads, an average of 20.3% latency reduction and 33.2% power
saving can be obtained when compared with the baseline NoC. Moreover, when compared with the
NoC with virtual point-to-point connections (VIP), the proposed hybrid scheme can reduce the
latency by 6.8% with the power decreasing by 11.3% averagely.
ETPL
VLSI - 032
A Low-Latency and Low-Power Hybrid Scheme for On-Chip Networks
This paper presents the design and the VLSI imple-mentation of an asynchronous cellular logic array for fast
binary image processing. The proposed processor array employs trigger-wave propagation and collision
detection mechanisms for binary image skeletonization, and Voronoi tessellation. Low power, low area, and
high processing speed are achieved using full custom dy-namic logic design. The prototype array consisting of
64 96 cells is fabricated in a standard 90 nm CMOS technology. The experi-mental results confirm the fast
operation of the array, capable of extracting up to skeletons per second, consuming less than 1 nJ/skeleton.
The asynchronous operation enables circular wave contours, which improves the quality of the extracted
skele-tons. The proposed asynchronousprocessing module consists of 24 MOS transistors and occupies area.
Such array can be used as a co-processing unit aiding global binary image processing in standard pixel-parallel
SIMD architectures in vision chips.
ETPL
VLSI - 033
Trigger-Wave Asynchronous Cellular Logic Array for Fast Binary Image
Processing
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
This paper presents the design and testing of an electrode driving application specific integrated circuit (ASIC)
intended for epidural spinal cord electrical stimulationinrats.
TheASICcandeliverupto1mAfullyprogrammablemonophasic
rbiphasicstimuluscurrentpulses,to13electrodesselectedinany possible configuration. It also supports interleaved
timulation. Communication is achieved via only 3 wires. The current source and the control of the stimulation
timing were kept off-chip to reduce the heat dissipation close to thespinal cord.TheASIC was designed in a
0.18- m high voltage CMOS process. Its output voltage compliance can be up to 25 V. It features a small core
area ( mm) and consumes a maximum of 114 Wduring a full stimulation cycle. The layout of the ASIC was
developed to be suitable for integration on the epidural electrode array, and two different versions were
fabricated and electrically tested. Results from both versions were almost indistinguishable. The performance
of the system was verified for different loads and stimulation parameters. Its suitability to drive a passive
epidural 12-electrode array in saline has also been demonstrated
ETPL
VLSI - 034
An Implantable Versatile Electrode-Driving ASIC for Chronic Epidural
Stimulation in Rats
The DCT and the DWT are used in a number of emerging DSP applications, such as, HD video compression,
biomedical imaging, and smart antenna beamformers for wireless communications and radar. Of late, there has
been much interest on fast algorithms for the computation of the above transforms using multiplier-free
approximations because they result in low power and low complexity systems. Approximate methods rely on
the trade-off of accuracy for lower power and/or circuit complexity/chip-area. This paper provides a detailed
review of VLSI architectures and CAS implementations for both DCT/DWTs, which can be designed either
for higher-accuracy or for low-power consumption. This article covers both recent theoretical advancements
on discrete transforms in addition to an overview of existing VLSI architectures. The paper also discusses
error free VLSI architectures that provides high accuracy systems and approximate architectures that offer
high computational gain making them highly attractive for real-world applications that are subject to
constraints in both chip-area as well as power. The methods discussed in the paper can be used in the design of
emerging low-power digital systems having lowest complexity at the cost of a loss in accuracy?the optimal
trade-off of computational accuracy for lowest possible complexity and power. A complete synopsis of
available techniques, algorithms and FPGA/VLSI realizations are discussed in the paper.
ETPL
VLSI - 035
Low-Power VLSI Architectures for DCT/DWT: Precision vs Approximation for
HD Video, Biomedical, and Smart Antenna Applications
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
This paper presents a novel algorithm and architecture design for 18-band quasi-class-2 ANSI S1.11 1/3
octave filterbank. The proposed design has several advantages such as lower group delay, lower computational
complexity, and lower matching error. The technique we developed in this paper can be summarized as
follows: 1) a simple low-pass filter (LPF) and discrete cosine transform (DCT) modulation are utilized to
generate a uniform 9-band filterbank first, and then all elements of {\rm z}^{-1} are replaced by all-pass filters
to obtain a non-uniform filterbank; 2) a fast recursive structure and variable-length algorithm is further
developed to efficiently accomplish DCT modulation. Thus, the spectrum of LPF can be easily spanned and
flexibly extended to the location of the desired central frequency; 3) after employing the multi-rate algorithm,
an 18-band non-uniform filterbank is generated from two 9-band sub filterbanks by following the proposed
design steps and parameter determinations. Compared with the latest Liu 's quasi-class-2 ANSI S1.11 design,
the proposed method-I (Proposed-I) totally has 72.8% reduction for multiplications per sample, 11.25-ms
group delay, and 59 additions decreased per sample. Moreover, the maximum matching error of the proposed
method-II (Proposed-II) is averagely equal to 1.79 dB much smaller than that of the latest Wei 's design. For
the proposed variable-length DCT modulation, only 2 adders, 2 multipliers, 2 multiplexers, and 5 registers are
required for hardware implementation after applying VLSI retiming scheme. Overall, the proposed filterbank
design would be a new solution for future applications in the area of hearing aids
ETPL
VLSI - 036
11.25-ms-Group-Delay and Low-Complexity Algorithm Design of 18-Band Quasi-ANSI
S1.11 1/3 Octave Digital Filterbank for Hearing Aids
A fully-integrated low-dropout regulator (LDO) with fast transient response and full spectrum power supply
rejection (PSR) is proposed to provide a clean supply for noise-sensitive building blocks in wideband
communication systems. With the proposed point-of-load LDO, chip-level high-frequency glitches are well
attenuated, consequently the system performance is improved. A tri-loop LDO architecture is proposed and
verified in a 65 nm CMOS process. In comparison to other fully-integrated designs, the output pole is set to be
the dominant pole, and the internal poles are pushed to higher frequencies with only 50 μA of total quiescent
current. For a 1.2 V input voltage and 1 V output voltage, the measured undershoot and overshoot is only 43
mV and 82 mV, respectively, for load transient of 0 μA to 10 mA within edge times of 200 ps. It achieves a
transient response time of 1.15 ns and the figure-of-merit (FOM) of 5.74 ps. PSR is measured to be better than
-12 dB over the whole spectrum (DC to 20 GHz tested). The prototype chip measures 260×90 μm2, including
140 pF of stacked on-chip capacitors.
ETPL
VLSI - 037
A Fully-Integrated Low-Dropout Regulator With Full-Spectrum Power Supply
Rejection
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
Spin-torque transfer RAM (STT-RAM), a promising alternative to static RAM (SRAM) for reducing leakage
power consumption, has been widely studied to mitigate the impact of its asymmetrically long write latency.
However, physical effects of technology scaling down to 45 nm and below, in particular, process variation,
introduce the previously unreported and alarming trends in read performance and reliability due to reduced
sensing margins and increasing error rates. In this brief, we study the scaling trends of STT-RAM from 65
down to 22 nm as they pertain to read performance, including a 50% increase in sensing versus peripheral
circuit delay ratio and a more than 80% increase in uncorrectable read error rates. Through differential
sensing, we show how 22 nm can return to sense delay ratio levels at 65 nm and uncorrectable read errors can
be reduced by an order of magnitude. Through a case study of a multilevel STT-RAM cache, we show how a
reconfigurable cache cell can create an extreme access mode (X-mode) based on differential sensing improve
to outperform the state-of-the-art STT-RAM caching techniques in both raw performance and performance per
watt by more than 10% while still reducing energy consumption over SRAM caches by more than 1/3.
ETPL
VLSI - 038
Read Performance: The Newest Barrier in Scaled STT-RAM
This paper presents a digital-subranging (sub-R) analog-to-digital conversion (ADC) architecture to improve
the operation speed of sub-R ADCs. Long latency between coarse and fine conversions will slow down the
conventional sub-R ADCs. The proposed digital-sub-R uses digital circuits to implement the sub-R function
and shorten this latency, thus benefits the CMOS scaling. Furthermore, the dynamic comparators are used to
save more ADC power consumption. Their accuracy is improved by the proposed pseudodifferential offset
calibration loop. The digital-sub-R also helps to reduce the dynamic offset of the fine comparators caused by
the input common-mode variation. Fabricated using a 55-nm CMOS technology, the reported 8-bit 1-GS/s
ADC consumes only 16 mW from a 1.2 V supply. Measured signal-to-noise ratio (SNR) and spurious free
dynamic range (SFDR) are 46 and 55 dB, respectively. Measured effective number of bits (ENOB) is seven
bits at 10-MHz input frequency. At Nyquist input, the ENOB performance of 6.3 bits is still maintained. Its
figure-of-merit is 197-fJ/conversion-step.
ETPL
VLSI - 039
A 16-mW 8-Bit 1-GS/s Digital-Subranging ADC in 55-nm CMOS
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
A frequency-tuning negative-conductance ( -G_{m} ) boosted structure and applications for a voltage-
controlled oscillator (VCO) is presented in this paper. Analog tuning varactors connected to a -{text\it
{G}}_{m} boosted structure is proposed to significantly alleviate the limitation of the tuning range for the -
{text\it {G}}_{m} boosted structure, resulting in a low-voltage low-power wide-tuning-range VCO. Based on
the proposed architecture, the fabricated 0.18- \mu m CMOS VCO exhibits a measured 49.8% tuning range.
Operating at 0.65 V low supply voltage, the VCO core consumes 2.37-mW dc power. At this bias condition,
the measured average value of phase noise for all frequency ranges is −115.1 dBc/Hz at 1-MHz offset from the
carriers. Relative to recently published wide-tuning-range CMOS VCOs, the proposed VCO simultaneously
achieves low supply voltage, low dc power dissipation, low phase noise, and a wide tuning range, leading to a
good figure-of-merit (FOM) and FOM including the tuning range. Furthermore, formulas of analysis for the
proposed frequency-tuning -{text\it {G}}_{m} boosted structure and wide VCO tuning range are presented,
and the mechanisms are validated by experiments.
ETPL
VLSI - 040
Frequency-Tuning Negative-Conductance Boosted Structure and Applications
for Low-Voltage Low-Power Wide-Tuning-Range VCO
Ternary content addressable memories (TCAMs) perform high-speed lookup operation but when compared
with static random access memories (SRAMs), TCAMs have certain limitations such as low storage density,
relatively slow access time, low scalability, complex circuitry, and are very expensive. Thus, can we use the
benefits of SRAM by configuring it (with additional logic) to enable it to behave like TCAM? This brief
proposes a novel memory architecture, named Z-TCAM, which emulates the TCAM functionality with
SRAM. Z-TCAM logically partitions the classical TCAM table along columns and rows into hybrid TCAM
subtables, which are then processed to map on their corresponding memory blocks. Two example designs for
Z-TCAM of sizes 512 × 36 and 64 × 32 have been implemented on Xilinx Virtex-7 field-programmable gate
array. The design of 64 × 32 Z-TCAM has also been implemented using OSUcells library for 0.18 μm
technology, which confirms the physical and technical feasibility of Z-TCAM. Search latency for each design
is three clock cycles. The detailed implementation results and power measurements for each design have been
reported thoroughly.
ETPL
VLSI - 041
Z-TCAM: An SRAM-based Architecture for TCAM
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
The common objective of very large-scale integration (VLSI) placement problem is to minimize the total
wirelength, which is calculated by the total half-perimeter wirelength (HPWL). Since the HPWL is not
differentiable, various differentiable wirelength approximation functions have been proposed in analytical
placement methods. In this paper, we reformulate the HPWL as an l_{1} -norm model of the wirelength
function, which is exact but nonsmooth. Based on the l_{1} -norm wirelength model and exact calculation of
overlapping areas between cells and bins, a nonsmooth optimization model is proposed for the VLSI global
placement problem, and a subgradient method is proposed for solving the nonsmooth optimization problem.
Moreover, local convergence of the subgradient method is proved under some suitable conditions. In addition,
two enhanced techniques, i.e., an adaptive parameter to control the step size and a cautious strategy for
increasing the penalty parameter, are also used in the nonsmooth optimization method. In order to make the
placement method scalable, a multilevel framework is adopted. In the clustering stage, the best choice
clustering algorithm is modified according to the l_{1} -norm wirelength model to cluster the cells, and the
nonsmooth optimization method is recursively used in the declustering stage. Comparisons of experimental
results on the International Symposium on Physical Design (ISPD) 2005 and 2006 benchmarks show that the
global placement method is promising.
ETPL
VLSI - 042
Nonsmooth Optimization Method for VLSI Global Placement
Thompson's model of very large scale integration computation relates the energy of a computation to the
product of the circuit area and the number of clock cycles needed to carry out the computation. It is shown that
for any sequence of increasing block-length decoder circuits implemented according to this model, if the
probability of block error is asymptotically less than 1/2 then the energy of the computation scales at least as
Ω(n(log n)1/2), and so the energy of decoding per bit must scale at least as Ω(log n)1/2. This implies that the
average energy per decoded bit must approach infinity for any sequence of decoders that approaches capacity.
The analysis techniques used are then extended to show that for any sequence of increasing block-length serial
decoders, if the asymptotic block error probability is less than 1/2 then the energy scales at least as fast as Ω(n
log n). In a very general case that allows for the number of output pins to vary with block length, it is shown
that the energy must scale as Ω(n(log n)1/5). A simple example is provided of a class of circuits performing
low-density parity-check decoding whose energy complexity scales as O(n2 loglogn).
ETPL
VLSI - 043
Energy Consumption of VLSI Decoders
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
A low-power test generation procedure that was developed earlier merges broadside test cubes that are derived
from functional broadside tests in order to generate a low-power broadside test set. This has several
advantages, most importantly, that test cubes, which are derived from functional broadside tests, create
functional operation conditions in subcircuits around the sites of detected faults. These conditions are
preserved when a test cube is merged with other test cubes. This brief applies a similar approach to the
generation of a low-power skewed-load test set. The main challenge that this paper addresses is the derivation
of skewed-load test cubes from functional broadside tests. The paper also considers the percentages of values
that should be unspecified in the skewed-load test cubes in order to balance the need to create functional
operation conditions with the need for test compaction.
ETPL
VLSI - 044
Skewed-Load Test Cubes Based on Functional Broadside Tests for a Low-
Power Test Set
Minimizing energy consumption is of utmost importance in an energy starved system with relaxed
performance requirements. This brief presents a digital energy sensing method that requires neither a constant
voltage reference nor a time reference. An energy minimizing loop uses this to find the minimum energy point
and sets the supply voltage between 0.2 and 0.5 V. Energy savings up to 1275% over existing minimum
energy tracking techniques in the literature is achieved.
ETPL
VLSI - 045
All Digital Energy Sensing for Minimum Energy Tracking
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
With the significant increase in the number of processing elements in NoC-based MPSoCs,
communication becomes, increasingly, a critical resource for performance gains and quality-of-
service (QoS) guarantees. The main gap observed in the NoC-based MPSoCs literature is the runtime
adaptive techniques to meet QoS. In the absence of such techniques, the system user must statically
define, for example, the scheduling policy, communication priorities, and the communication
switching mode of applications. The goal of this paper is to investigate the runtime adaptation of the
NoC resources, according to the QoS requirements of each application running in the MPSoC. This
paper adopts an NoC architecture with duplicated physical channels, adaptive routing, support to
flow priorities and simultaneous packet and circuit switching. The monitoring and adaptation
management is performed at the operating system level, ensuring QoS to the monitored applications.
The QoS acts in the flow priority and the switching mode. Monitoring and QoS adaptation were
implemented in software, resulting in flexibility to apply the techniques to other platforms or include
other adaptive techniques, as task migration or DVFS. Applications with latency and throughput
deadlines run concurrently with best-effort applications. Results with synthetic and real application
reduced in average 60% the latency violations, ensuring smaller jitter and throughput. The execution
time of applications is not penalized applying the proposed QoS adaptation methods.
ETPL
VLSI - 046
Fat-Tree-Based Optical Interconnection Networks Under Crosstalk Noise
Constraint
We prove analytically that the yield of static random access memory (SRAM) is intrinsically a function of its
architecture owing to the correlation among cell failures. In addition, architecture-aware analytical yield
models are proposed for read access. The yield results using the proposed models show that the most dominant
factor determining yield is the variation in the voltage difference between bitlines due to the cell leakage
current variation according to the SRAM architecture. The models also show the possibility that the most
dominant factor determining the yield can change with the relative ratios among the amounts of changes in the
correlation, recovery sample space, distributions of the sense amplifier enable time, voltage difference
between bitlines, as well as sense amplifier offset voltage, memory capacity, and redundancy scheme. The
proposed yield models show that combined row and column redundancy ensures the highest yield, whereas
column redundancy is the most efficient.
ETPL
VLSI - 047
Architecture-Aware Analytical Yield Model for Read Access in Static Random
Access Memory
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
This brief presents a two-port disturb-free 9T subthreshold static random access memory (SRAM) cell with
independent single-ended read bitline and write bitline (WBL) and cross-point data-aware write structure to
facilitate robust subthreshold operation and bit-interleaving architecture for enhanced soft error immunity. The
design employs a variation-tolerant line-up write-assist scheme where the timing of areaefficient boosted write
wordline and negative WBL are aligned and triggered/initiated by the same low-going global WBL to
maximize the write-ability enhancement. A 72-kb test chip is implemented in United Microelectronics Corp.
40-nm low-power (40LP) CMOS. Full functionality is achieved for VDD ranging from 1.5 to 0.32 V without
redundancy. The measured maximum operation frequency is 260 MHz (450 kHz) at 1.1 V (0.32 V) and 25 °C.
At 0.325 V and 25 °C, the chip operates at 600 kHz with 5.78 μW total power and 4.69 μW leakage power,
offering 2× frequency improvement compared with 300 kHz of our previous 72-kb 9T subthreshold SRAM
design in the same 40LP technology. The energy efficiency (power/frequency/IO) at 0.325 V and 25 °C is
0.267 pJ/bit, a 23.7% improvement over the 0.350 pJ/bit of our previous design.
ETPL
VLSI - 048
A 0.325 V, 600-kHz, 40-nm 72-kb 9T Subthreshold SRAM with Aligned Boosted
Write Wordline and Negative Write Bitline Write-Assist
A CMOS pulsewidth modulation (PWM) transceiver circuit that exploits the self-referenced edge
detection technique is presented. By comparing the rising edge that is self-delayed by about 0.5 T and
the modulated falling edge in one carrier clock cycle, area-efficient and high-robustness (against
timing fluctuations) edge detection enabling PWM communication is achieved without requiring
elaborate phase-locked loops. Since the proposed self-referenced edge detection circuit has the
capability of timing error measurement while changing the length of self-delay element, adaptive
data-rate optimization and delay-line calibration are realized. The measured results with a 65-nm
CMOS prototype demonstrate a 2-bit PWM communication, high data rate (3.2 Gb/s), and high
reliability (BER> 10-12) with small area occupation (540 μm2). For reliability improvement, error
check and correction associated with intercycle edge detection is introduced and its effectiveness is
verified by 1-bit PWM measurement.
ETPL
VLSI - 049
A CMOS PWM Transceiver Using Self-Referenced Edge Detection
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
Phase change memory (PCM) is a promising DRAM replacement in embedded systems due to its
attractive characteristics, such as low-cost, shock-resistivity, nonvolatility, high density, and low
leakage power. However, relatively low endurance has limited its practical applications. In this paper,
in addition to existing hardware level optimizations, we propose software enabled wear-leveling
techniques to further extend PCMs lifetime when it is adopted in embedded systems. Most existing
software optimization techniques focus on reducing the total number of writes to PCM, but none of
them consider wear leveling, in which the writes are distributed more evenly over the PCM. An
integer linear programming formulation and a polynomial-time algorithm, the software wear-leveling
algorithm, are proposed in this paper to achieve wear leveling without hardware overhead. According
to the experimental results, the proposed techniques can reduce the number of writes on the most-
written addresses by more than 80% when compared with a greedy algorithm, and by more than 60%
when compared with the existing optimal data allocation algorithm with under 6% memory access
overhead.
ETPL
VLSI - 050
Low Overhead Software Wear Leveling for Hybrid PCM + DRAM Main
Memory on Embedded Systems
This brief presents a parallel single-rail self-timed adder. It is based on a recursive formulation for performing
multibit binary addition. The operation is parallel for those bits that do not need any carry chain propagation.
Thus, the design attains logarithmic performance over random operand conditions without any special speedup
circuitry or look-ahead schema. A practical implementation is provided along with a completion detection
unit. The implementation is regular and does not have any practical limitations of high fanouts. A high fan-in
gate is required though but this is unavoidable for asynchronous logic and is managed by connecting the
transistors in parallel. Simulations have been performed using an industry standard toolkit that verify the
practicality and superiority of the proposed approach over existing asynchronous adders.
ETPL
VLSI - 051
Recursive Approach to the Design of a Parallel Self-Timed Adder
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
In this brief, three nonvolatile flip-flop (FF)/SRAM cells that utilize a single magnetic tunneling
junction (MTJ) as nonvolatile resistive element are proposed. These cells have the same core (i.e.,
6T) but they employ different numbers of MOSFETs to implement the so-called instantly ON,
normally OFF mode of operation. The additional transistors are utilized for the restore operation to
ensure that the data stored in the nonvolatile circuitry can be written back into the FF core once the
power is made available. These three cells (7T, 9T, and 11T) are extensively analyzed in terms of
their operations in 32 nm technology, such as operational delays (for the write, read, and restore
operations), the static noise margin (SNM), critical charge and process variations (in both the
MOSFETs and the resistive element). Simulation results show that an increase in the number of
MOSFETs in the cells causes improvements in critical charge and tolerance to process variations at
the expense of an increase in power dissipation. The SNM and the delay of the restore operation,
however, do not necessarily increase with the number of MOSFETs in the cell, but rather on the
control of access to the storage nodes from the single MTJ.
ETPL
VLSI - 052
On the Nonvolatile Performance of Flip-Flop/SRAM Cells With a Single MTJ
Content addressable memories (CAMs) enable high-speed parallel search operations in table lookup-
based applications, such as Internet routers and processor caches. Traditional CAM design has
always suffered from the high dynamic power consumption associated with its large and active
parallel hardware. However, deeply scaled technology nodes, with multigate devices replacing planar
MOSFETs, are expected to bring new tradeoffs to CAM design. FinFET, a vertical-channel gate-
wrap-around double-gate device, has emerged as the best alternative to planar MOSFET. In this brief,
for the first time, we explore the design space of symmetric and asymmetric gate-workfunction
FinFET CAMs. We propose several design alternatives and evaluate them in terms of their dc and
transient metrics for different mismatch probabilities using technology computer-aided design
simulations with 22-nm FinFET devices. We also propose two orthogonal layout styles for CAM
design and show that one of them (vertical-search line) outperforms the other (vertical-match line) in
terms of total power (22.3%) and search delay (5.8%).
ETPL
VLSI - 053
Design of Efficient Content Addressable Memories in High-Performance
FinFET Technology
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
A clock skew-compensation and duty-cycle correction circuit (CSADC) is used as the second-level clock
distributing circuit to align a system global clock while maintaining a 50% duty cycle. A power-efficient,
range-unlimited, and accuracy-enhanced CSADC, designed mainly with a new delay-interleaving and -
recycling technique that mitigates operating frequency limitations while keeping overhead costs low, is
proposed in this paper. Our preliminary research results prove the feasibility of the proposed technique and
show that the operating frequency ranges from 110 MHz to 1.75 GHz, with the corrected duty cycle varying
from 51.2% to 48.9% based on 0.18-μm CMOS technology. Meanwhile, the lock-in time, static phase error,
and power consumption are, respectively, 26 clock cycles, 4.2 ps, and 5.58 mW at 1.75 GHz.
ETPL
VLSI - 054
Range Unlimited Delay-Interleaving and -Recycling Clock Skew Compensation
and Duty-Cycle Correction Circuit
A fast transient response flying-capacitor buck-boost converter is proposed to improve the efficiency
of conventional switched-capacitor converters. The voltage boost ratio of the proposed converter is
2D, where D is the duty cycle of the switching signal waveform. Furthermore, the proposed structure
utilizes pseudocurrent dynamic acceleration techniques to achieve fast transient response when load
changes between heavy load and light load. The switching frequency of the proposed converter is 1
MHz for 3.3-V input and 1.0-4.5-V output range application. Experiment results show that the
proposed scheme improves the transient response to within 2 μs and the total power conversion
efficiency can be as high as 89.66%. The proposed converter has been realized by a 2P4M CMOS
chip by 0.35-μm fabrication process with total chip size of about 1.5 mm × 1.5 mm, PADs included.
ETPL
VLSI - 055
A Fast Transient Response Flying-Capacitor Buck-Boost Converter Utilizing
Pseudocurrent Dynamic Acceleration Techniques
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
In this brief, a low-cost low-power all-digital spread-spectrum clock generator (ADSSCG) is presented. The
proposed ADSSCG can provide an accurate programmable spreading ratio with process, voltage, and
temperature variations. To maintain the frequency stability while performing triangular modulation, the fast-
relocked mechanism is proposed. The proposed fast-relocked ADSSCG is implemented in a standard
performance 90-nm CMOS process, and the active area is 200 μm × 200 μm. The experimental results show
that the electromagnetic interference reduction is 14.61 dB with a 0.5% spreading ratio and 19.69 dB with a
2% spreading ratio at 270 MHz. The power consumption is 443 μW at 270 MHz with a 1.0 V power supply
ETPL
VLSI - 056
A Low-Cost Low-Power All-Digital Spread-Spectrum Clock Generator
Thank You !