elysium technologies private limitedelysiumtechnologies.com/wp-content/uploads/2015/08/vlsi.pdf ·...


Sivakasi |Dindugul|


This brief proposes a two-step optimization technique for designing a reconfigurable VLSI architecture of an

interpolation filter for multistandard digital up converter (DUC) to reduce the power and area consumption.

The proposed technique initially reduces the number of multiplications per input sample and additions per

input sample by 83% in comparison with individual implementation of each standard’s filter while designing a

root-raised-cosine finite-impulse response filter for multistandard DUC for three different standards. In the

next step, a 2-bit binary common subexpression (BCS)-based BCS elimination algorithm has been proposed to

design an efficient constant multiplier, which is the basic element of any filter. This technique has succeeded

in reducing the area and power usage by 41% and 38%, respectively, along with 36% improvement in

operating frequency over a 3-bit BCS-based technique reported earlier, and can be considered more

appropriate for designing the multistandard DUC.

ETPL

VLSI - 001

An Efficient VLSI Architecture of a Reconfigurable Pulse-Shaping FIR

Interpolation

This paper introduces a novel low-complexity multiple-input multiple-output (MIMO) detector tailored for

single-carrier frequency division-multiple access (SC-FDMA) systems, suitable for efficient hardware

implementations. The proposed detector starts with an initial estimate of the transmitted signal based on a

minimum mean square error (MMSE) detector. Subsequently, it recognizes less reliable symbols for which

more candidates in the constellation are browsed to improve the initial estimate. An efficient high-throughput

VLSI architecture is also introduced achieving a superior performance compared to the conventional MMSE

detectors with less than 28% added complexity. The performance of the proposed design is close to the

existing maximum likelihood post-detection processing (ML-PDP) scheme, while resulting in a significantly

lower complexity, i.e., 4.5×102 and 7×10

4 times fewer Euclidean distance (ED) calculations in the 16-QAM

and 64-QAM schemes, respectively. The proposed design for the 16-QAM scheme is fabricated in a 0.13 μm

CMOS technology and fully tested, achieving a 1.332 Gbps throughput, reporting the first fabricated design

for SC-FDMA MIMO detectors to-date. A soft version of the proposed architecture is also introduced, which

is customized for coded systems.

ETPL

VLSI - 002

A High-Throughput VLSI Architecture for Hard and Soft SC-FDMA MIMO

Detectors


Sivakasi |Dindugul|


Increasing demand of high-speed portable modules for multimedia applications has motivated the

development of hardware-based solutions for image processing applications. Most of the nonrigid image

registration algorithms are found to be unsuitable for hardware implementation because of their nonlinearity

and computationally intensive nature. In this paper, an algorithm for nonrigid image registration based on

Demons approximation is proposed. The algorithm has been simulated in MATLAB and results show a 15%

improvement in peak-signal-to-noise-ratio with a 17% reduction in registration time for 256 x 256 image over

the original Demons algorithm. The proposed algorithm is synthesized in Virtex6-xc6vlx760-2-ff1760 and

maximum synthesized frequency is found to be 174 MHz. The proposed architecture provides the low cost,

high-speed solution for the registration process, which is also helpful for making a portable system.

ETPL

VLSI - 003

VLSI-Assisted Nonrigid Registration Using Modified Demons Algorithm

Transpose form finite-impulse response (FIR) filters are inherently pipelined and support multiple constant

multiplications (MCM) technique that results in significant saving of computation. However, transpose form

configuration does not directly support the block processing unlike direct-form configuration. In this paper, we

explore the possibility of realization of block FIR filter in transpose form configuration for area-delay efficient

realization of large order FIR filters for both fixed and reconfigurable applications. Based on a detailed

computational analysis of transpose form configuration of FIR filter, we have derived a flow graph for

transpose form block FIR filter with optimized register complexity. A generalized block formulation is

presented for transpose form FIR filter. We have derived a general multiplier-based architecture for the

proposed transpose form block filter for reconfigurable applications. A low-complexity design using the MCM

scheme is also presented for the block implementation of fixed FIR filters. The proposed structure involves

significantly less area-delay product (ADP) and less energy per sample (EPS) than the existing block

implementation of direct-form structure for medium or large filter lengths, while for the short-length filters,

the block implementation of direct-form FIR structure has less ADP and less EPS than the proposed structure.

Application-specific integrated circuit synthesis result shows that the proposed structure for block size 4 and

filter length 64 involves 42% less ADP and 40% less EPS than the best available FIR filter structure proposed

for reconfigurable applications. For the same filter length and the same block size, the proposed structure

involves 13% less ADP and 12.8% less EPS than that of the existing direct-form block FIR structure.

ETPL

VLSI - 004

A High-Performance FIR Filter Architecture for Fixed and Reconfigurable

Applications


Sivakasi |Dindugul|


Due to the channel achieving property, the polar code has become one of the most favorable error-correcting

codes. As the polar code achieves the property asymptotically, however, it should be long enough to have a

good error-correcting performance. Although the previous fully parallel encoder is intuitive and easy to

implement, it is not suitable for long polar codes because of the huge hardware complexity required. In this

brief, we analyze the encoding process in the viewpoint of very-large-scale integration implementation and

propose a new efficient encoder architecture that is adequate for long polar codes and effective in alleviating

the hardware complexity. As the proposed encoder allows high-throughput encoding with small hardware

complexity, it can be systematically applied to the design of any polar code and to any level of parallelism.

ETPL

VLSI - 005

Partially Parallel Encoder Architecture for Long Polar Codes

We present algorithms for routing in advanced technology nodes, used by BonnRoute (BR) to obtain efficient

and almost design rule clean wire packings and pin access solutions. Designs with dense standard cell libraries

in presence of complex industrial design rules, with a special focus on multiple patterning lithography are

considered. The key components of this approach are a multilabel interval-based shortest path algorithm for

long on-track connections, and a dynamic program for computing packings of pin access paths and short

connections between closely spaced pins. The multilabel path search implementation is very general and is

driven with different labeling rules, allowing to trade-off runtime against accuracy in terms of obeyed design

rules. We combine BR with an industrial router for cleaning up the remaining design rule violations, and

demonstrate superior results over that industrial router in our experiments in terms of wire length, number of

vias, design rule violations, and runtime.

ETPL

VLSI - 006

Detailed Routing Algorithms for Advanced Technology Nodes


Sivakasi |Dindugul|


In this paper, the impact of variations on single-edge triggered flip-flops (FFs) is evaluated for a wide range of

topologies. In particular, this Part II explicitly considers sources of variations such as voltage, temperature and

variations induced by the clock network. The effect of variations on the energy and its tradeoff with

performance is also investigated. This paper complements the previous Part I, which is focused on process

variations and flip-flop timing. From a design perspective, the presented results provide well-defined

guidelines for variation-aware selection of the flip-flop topologies, and for early budgeting of variations before

detailed circuit design. Results are put into the technology scaling perspective through comparison of results at

65 and 28 nm. The results show that the technology scaling does not affect either the main findings of this

analysis or the ranking of the considered topologies.

ETPL

VLSI - 007

Variations in Nanometer CMOS Flip-Flops: Part II—Energy Variability and

Impact of Other Sources of Variations

Many packaging and structural materials are made of conductive materials such as metal or carbon-fiber

composites, which limits the use of embedded radio frequency-based telemetry systems for sensing. In this

paper, we present the design of a complete passive ultrasonic energy harvesting and back-telemetry system

that exploits near-field acoustic coupling to wirelessly transfer energy and data across conductive barriers. The

use of near-field operation makes the telemetry robust to multipath reflections that occur at barrier

discontinuities and robust to crosstalk when multiple sensors are simultaneously interrogated. Underlying the

proposed architecture is a system-on-chip (SoC) that integrates different ultrasonic energy harvesting and

telemetry modules. The operation of the system has been verified using SoC prototypes fabricated in a 0.5-μm

CMOS process which have been integrated with a piezoelectric transducer attached to an aerospace-grade

aluminum substrate. Measured results show that the proposed near-field ultrasonic telemetry system can

effectively operate across a 2-mm-thick metallic barrier at a frequency of 13.56 MHz with the SoC consuming

22.3 μW of power.

ETPL

VLSI - 008

Design of a CMOS System-on-Chip for Passive, Near-Field Ultrasonic Energy

Harvesting and Back-Telemetry


Sivakasi |Dindugul|


Manycore processor system is becoming an attractive platform for applications seeking both high performance

and high energy efficiency. However, huge communication demands among cores, large power density, and

low process yield will be three significant limitations for the scalability of future manycore processors.

Breaking a large chip into multiple smaller ones can alleviate the problems of power density and yield, but

would worsen the problem of communication efficiency due to the limited off-chip bandwidth. In response,

we propose an inter/intra-chip optical network, which will not only fulfill the intra-chip communication

requirements but also address the inter-chip communication, by exploiting the advantages of optical links with

high bandwidth and energy efficiency. The network is composed of an inter-chip subnetwork and multiple

intra-chip subnetworks, and the subnetworks closely coordinate with each other to balance the traffic. The

proposed network effectively explores the distinctive properties of optical signals and photonic devices, and

dynamically partitions each data channel into multiple sections. Each section can be utilized independently to

boost performance as well as reduce energy consumption. Simulation results show that our network can

achieve higher throughput with lower power consumption than alternative designs under most of synthetic

traffics and real applications.

ETPL

VLSI - 009

An Inter/Intra-Chip Optical Network for Manycore Processors

With silicon optical technology moving toward maturity, the use of photonic networks-on-chip (NoCs) for

global chip communication is emerging as a promising solution to the communication requirements of future

many core processors. It is expected that photonic NoCs will play an important role in alleviating current

power, latency, and bandwidth constraints. However, photonic NoCs are sensitive to ambient temperature

variations because their basic constituents, ring resonators, are themselves sensitive to those variations. Since

ring resonators are basic building blocks for photonic modulators, switches, multiplexers, and demultiplexers,

variations of on-chip temperature pose serious challenges to the proper operation of photonic NoCs. Proposed

methods that mitigate the effects of temperature at the device level are either difficult to use in CMOS

processes or not suitable for large scale implementation. In this paper, we propose Aurora, a thermally resilient

photonic NoC architecture design that supports reliable and low bit error rate (BER) on-chip communications

in the presence of large temperature variations. Our proposed architecture leverages cross-layer solutions at

the device, architecture, and operating system (OS) layers that individually provide considerable

improvements and synergistically provide even more significant improvements. To compensate for small

temperature variations, our design varies the bias current through ring resonators. For larger temperature

variations, we propose architecture-level techniques to reroute messages away from hot regions, and through

cooler regions, to their destinations. We also propose a thermal/congestion-aware coscheduling algorithm at

ETPL

VLSI - 010

Aurora: A Cross-Layer Solution for Thermally Resilient Photonic Network-on-

Chip


Sivakasi |Dindugul|


In this paper, we present a carry skip adder (CSKA) structure that has a higher speed yet lower energy

consumption compared with the conventional one. The speed enhancement is achieved by applying

concatenation and incrementation schemes to improve the efficiency of the conventional CSKA (Conv-CSKA)

structure. In addition, instead of utilizing multiplexer logic, the proposed structure makes use of AND-OR-

Invert (AOI) and OR-AND-Invert (OAI) compound gates for the skip logic. The structure may be realized

with both fixed stage size and variable stage size styles, wherein the latter further improves the speed and

energy parameters of the adder. Finally, a hybrid variable latency extension of the proposed structure, which

lowers the power consumption without considerably impacting the speed, is presented. This extension utilizes

a modified parallel structure for increasing the slack time, and hence, enabling further voltage reduction. The

proposed structures are assessed by comparing their speed, power, and energy parameters with those of other

adders using a 45-nm static CMOS technology for a wide range of supply voltages. The results that are

obtained using HSPICE simulations reveal, on average, 44% and 38% improvements in the delay and energy,

respectively, compared with those of the Conv-CSKA. In addition, the power--delay product was the lowest

among the structures considered in this paper, while its energy--delay product was almost the same as that of

the Kogge--Stone parallel prefix adder with considerably smaller area and power consumption. Simulations on

the proposed hybrid variable latency CSKA reveal reduction in the power consumption compared with the

latest works in this field while having a reasonably high speed.

ETPL

VLSI - 011

High-Speed and Energy-Efficient Carry Skip Adder Operating Under a Wide

Range of Supply Voltage Levels

Quantum-dot cellular automata (QCA) are an attractive emerging technology suitable for the development of

ultra-dense low-power high-performance digital circuits. Efficient solutions have recently been proposed for

several arithmetic circuits, such as adders, multipliers, and comparators. Nevertheless, since the design of

digital circuits in QCA still poses several challenges, novel implementation strategies and methodologies are

highly desirable. This paper proposes a new design approach oriented to the implementation of binary

comparators in QCA. New formulations of basic logic equations required to perform the comparison function

are proposed. The new strategy has been exploited in the design of two different comparator architectures and

for several operands word lengths. With respect to existing counterparts, the comparators proposed here

exhibit significantly higher speed and reduced overall area.

ETPL

VLSI - 012

Design of Efficient Binary Comparators in Quantum-Dot Cellular Automata


Sivakasi |Dindugul|


A transistor level implementation of an improved matrix multiplier for high-speed digital signal processing

applications based on matrix element transformation and multiplication is reported in this study. The

improvement in speed was achieved by rearranging the matrix element into a two-dimensional array of

processing elements interconnected as a mesh. The edges of each row and column were interconnected in

torus structure, facilitating simultaneous implementation of several multiplications. The functionality of the

circuitry was verified and the performance parameters for example, propagation delay and dynamic switching

power consumptions were calculated using spice spectre using 90 nm CMOS technology. The proposed

methodology ensures substantial reduction in propagation delay compared with the conventional algorithm,

systolic array and pseudo number theoretic transformation (PNTT)-based implementation, which are the most

commonly used techniques, for matrix multiplication. The propagation delay of the implemented 4 × 4 matrix

multiplier was only ~2 μs, whereas the power consumption of the implemented 4 × 4 matrix multiplier was

~3.12 mW only. Improvement in speed compared with earlier reported matrix multipliers, for example,

conventional algorithm, systolic array and PNTT-based implementation was found to be ~67, ~56 and ~65%,

respectively.

ETPL

VLSI - 013

Improved matrix multiplier design for high-speed digital signal processing

applications

We experimentally demonstrated high-speed logic operations of adiabatic quantum-flux-parametron (AQFP)

gates through the use of quantum-flux-latches (QFLs). In QFL-based high-speed test circuits (QHTCs), the

output data of the circuits under test (CUTs), which are driven by high-speed excitation currents, are stored in

QFLs and are slowly read out using low-speed excitation currents. We designed and fabricated three types of

QHTCs using QFLs with different circuit parameters, where the CUTs were buffer gates and and gates. We

confirmed the correct operation of buffer gates and and gates at 1 GHz. The obtained bias margins of the 1

GHz excitation currents were more than ±30% for each QHTC, which is wide enough for high-speed logic

operations of AQFP gates

ETPL

VLSI - 014

High-Speed Experimental Demonstration of Adiabatic Quantum-Flux-

Parametron Gates Using Quantum-Flux-Latches


Sivakasi |Dindugul|


A novel nonvolatile flip-flop based on spin-orbit torque magnetic tunnel junctions (SOT-MTJs) is proposed

for fast and ultralow energy applications. A case study of this nonvolatile flip-flop is considered. In addition to

the independence between writing and reading paths, which offers a high reliability, the low resistive writing

path performs high-speed, and energy-efficient WRITE operation. We compare the SOT-MTJ performances

metrics with the spin transfer torque (STT)-MTJ. Based on accurate compact models, simulation results show

an improvement, which attains 20× in terms of WRITE energy per bit cell. At the same writing current and

supply voltage, the SOT-MTJ achieves a writing frequency 4× higher than the STT-MTJ.

ETPL

VLSI - 015

Spin Orbit Torque Non-Volatile Flip-Flop for High Speed and Low Energy

Applications

This paper presents a precise analysis of the critical path of the least-mean-square (LMS) adaptive filter for

deriving its architectures for high-speed and low-complexity implementation. It is shown that the direct-form

LMS adaptive filter has nearly the same critical path as its transpose-form counterpart, but provides much

faster convergence and lower register complexity. From the critical-path evaluation, it is further shown that no

pipelining is required for implementing a direct-form LMS adaptive filter for most practical cases, and can be

realized with a very small adaptation delay in cases where a very high sampling rate is required. Based on

these findings, this paper proposes three structures of the LMS adaptive filter: (i) Design 1 having no

adaptation delays, (ii) Design 2 with only one adaptation delay, and (iii) Design 3 with two adaptation delays.

Design 1 involves the minimum area and the minimum energy per sample (EPS). The best of existing direct-

form structures requires 80.4% more area and 41.9% more EPS compared to Design 1. Designs 2 and 3

involve slightly more EPS than the Design 1 but offer nearly twice and thrice the MUF at a cost of 55.0% and

60.6% more area, respectively.

ETPL

VLSI - 016

Critical-Path Analysis and Low-Complexity Implementation of the LMS

Adaptive Algorithm


Sivakasi |Dindugul|


Algebraic side-channel attack (ASCA) is a typical technique that relies on a general solver to solve the

equations of a cipher and its side-channel leaks. It falls under analytical side-channel attack and can recover

the entire key at once. Many ASCAs are proposed against the AES, and they utilize the Gröbner basis-based,

SAT-based, or optimizer-based solver. The advantage of the general solver approach is its generic feature,

which can be easily applied to different cryptographic algorithms. The disadvantage is that it is difficult to take

into account the specialized properties of the targeted cryptographic algorithms. The results vary depending on

what type of solver is used, and the time complexity is quite high when considering the error-tolerant attack

scenarios. Thus, we were motivated to find a new approach that would lessen the influence of the general

solver and reduce the time complexity of ASCA. This paper proposes a new analytical side-channel attack on

AES by exploiting the incomplete diffusion feature in one AES round. We named our technique incomplete

diffusion analytical side-channel analysis (IDASCA). Different from previous ASCAs, IDASCA adopts a

specialized approach to recover the secret key of AES instead of the general solver. Extensive attacks are

performed against the software implementation of AES on an 8-bit microcontroller. Experimental results show

that: 1) IDASCA can exploit the side-channel leaks in all AES rounds using a single power trace; 2) it has less

time complexity and more robustness than previous ASCAs, especially when considering the error-tolerant

attack scenarios; and 3) it can calculate the reduced key search space of AES for the given amount of side-

channel leaks. IDASCA can also interpret the mechanism behind previous ASCAs on AES from a quantitative

perspective, such as why ASCA can work under unknown plaintext/ciphertext scenarios and what are the

extreme cases in ASCAs.

ETPL

VLSI - 017

Exploiting the Incomplete Diffusion Feature: A Specialized Analytical Side-Channel

Attack Against the AES and Its Application to Microcontroller Implementations


Sivakasi |Dindugul|


This paper presents a high-throughput and ultralow-power asynchronous domino logic pipeline design

method, targeting to latch-free and extremely fine-grain OR gate-level design. The data paths are composed of

a mixture of dual-rail and single-rail domino gates. Dual-rail domino gates are limited to construct a stable

critical data path. Based on this critical data path, the handshake circuits are greatly simplified, which offers

the pipeline high throughput as well as low power consumption. Moreover, the stable critical data path enables

the adoption of single-rail domino gates in the noncritical data paths. This further saves a lot of power by

reducing the overhead of logic circuits. An 8×8 array style multiplier is used for evaluating the proposed

pipeline method. Compared with a bundled-data asynchronous domino logic pipeline, the proposed pipeline,

respectively, saves up to 60.2% and 24.5% of energy in the best case and the worst case when processing

different data patterns.

ETPL

VLSI - 019

Design and simulation of power efficient traffic light controller (PTLC)

Modern superscalar processors implement register renaming using either random access memory (RAM) or

content-addressable memories (CAM) tables. The design of these structures should address both access time

and misprediction recovery penalty. Although direct-mapped RAMs provide faster access times, CAMs are

more appropriate to avoid recovery penalties. The presence of associative ports in CAMs, however, prevents

them from scaling with the number of physical registers and pipeline width, negatively impacting

performance, area, and energy consumption at the rename stage. In this paper, we present a new hybrid RAM–

CAM register renaming scheme, which combines the best of both approaches. In a steady state, a RAM

provides fast and energy-efficient access to register mappings. On misspeculation, a low-complexity CAM

enables immediate recovery. Experimental results show that in a four-way state-of-the-art superscalar

processor, the new approach provides almost the same performance as an ideal CAM-based renaming scheme,

while dissipating only between 17% and 26% of the original energy and, in some cases, consuming less

energy than purely RAM-based renaming schemes. Overall, the silicon area required to implement the hybrid

RAM–CAM scheme does not exceed the area required by conventional renaming mechanisms.

ETPL

VLSI - 018

Efficient Register Renaming and Recovery for High-Performance Processors


Sivakasi |Dindugul|


In this paper, we propose a reliable low-power multiplier design by adopting algorithmic noise tolerant (ANT)

architecture with the fixed-width multiplier to build the reduced precision replica redundancy block (RPR).

The proposed ANT architecture can meet the demand of high precision, low power consumption, and area

efficiency. We design the fixed-width RPR with error compensation circuit via analyzing of probability and

statistics. Using the partial product terms of input correction vector and minor input correction vector to lower

the truncation errors, the hardware complexity of error compensation circuit can be simplified. In a 12×12 bit

ANT multiplier, circuit area in our fixed-width RPR can be lowered by 44.55% and power consumption in our

ANT design can be saved by 23% as compared with the state-of-art ANT design.

ETPL

VLSI - 020

Reliable Low-Power Multiplier Design Using Fixed-Width Replica Redundancy

Block

A unified very large-scale integration (VLSI) architecture with butterflies that can perform photo core

transform (PCT) in JPEG XR image compression is presented. The proposed architecture can achieve the

unified architecture design, which supports the three elemental operations of PCT, and it has the

characteristics of lower hardware cost, shorter critical path, lower power consumption, more efficient

hardware utilisation and regular structure for VLSI implementation. Finally, the implementation on Altera

field programmable gate array (FPGA) devices validates the effectiveness of the design.

ETPL

VLSI - 021

Unified VLSI architecture for photo core transform used in JPEG XR


Sivakasi |Dindugul|


We use mixed device-circuit simulations to predict the performance of 6T static RAM (SRAM) cells

implemented with tunnel-FETs (TFETs). Idealized template devices are used to assess the impact of device

unidirectionality, which is inherent to TFETs and identify the most promising configuration for the access

transistors. The same template devices are used to investigate the text{V}_{\rm DD} range, where TFETs may

be advantageous compared to conventional CMOS. The impact of device ambipolarity on SRAM operation is

also analyzed. Realistic device templates extracted from experimental data of fabricated state-of-the-art silicon

pTFET are then used to estimate the performance gap between the simulation of idealized TFETs and the best

experimental implementations.

ETPL

VLSI - 023

Impact of TFET Unidirectionality and Ambipolarity on the Performance of 6T SRAM

Cells

This paper proposes an efficient constant multiplier architecture based on vertical-horizontal binary common

sub-expression elimination (VHBCSE) algorithm for designing a reconfigurable finite impulse response (FIR)

filter whose coefficients can dynamically change in real time. To design an efficient reconfigurable FIR filter,

according to the proposed VHBCSE algorithm, 2-bit binary common sub-expression elimination (BCSE)

algorithm has been applied vertically across adjacent coefficients on the 2-D space of the coefficient matrix

initially, followed by applying variable-bit BCSE algorithm horizontally within each coefficient. This

technique is capable of reducing the average probability of use or the switching activity of the multiplier block

adders by 6.2% and 19.6% as compared to that of two existing 2-bit and 3-bit BCSE algorithms respectively.

ASIC implementation results of FIR filters using this multiplier show that the proposed VHBCSE algorithm is

also successful in reducing the average power consumption by 32% and 52% along with an improvement in

the area power product (APP) by 25% and 66% compared to those of the 2-bit and 3-bit BCSE algorithms

respectively. As regards the implementation of FIR filter, improvements of 13% and 28% in area delay

product (ADP) and 76.1% and 77.8% in power delay product (PDP) for the proposed VHBCSE algorithm

have been achieved over those of the earlier multiple constant multiplication (MCM) algorithms, viz. faithfully

rounded truncated multiple constant multiplication/accumulation (MCMAT) and multi-root binary partition

graph (MBPG) respectively. Efficiency shown by the results of comparing the FPGA and ASIC

implementations of the reconfigurable FIR filter designed using VHBCSE algorithm based constant multiplier

establishes the suitability of the proposed algorithm for efficient fixed point reconfigurable FIR filter

synthesis.

ETPL

VLSI - 022

An Efficient Constant Multiplier Architecture Based on Vertical-Horizontal Binary

Common Sub-expression Elimination Algorithm for Reconfigurable FIR Filter


Sivakasi |Dindugul|


This paper presents architecture of block-level-parallel layered decoder for irregular LDPC code. It can be

reconfigured to support various block lengths and code rates of IEEE 802.11n (WiFi) wireless-communication

standard. We have proposed efficient comparison techniques for both column and row layered schedule and

rejection-based high-speed circuits to compute the two minimum values from multiple inputs required for row

layered processing of hardware-friendly min-sum decoding algorithm. The results show good speed with

lower area as compared to state-of-the-art circuits. Additionally, this work proposes dynamic multi-frame

processing schedule which efficiently utilizes the layered-LDPC decoding with minimum pipeline stages. The

suggested LDPC-decoder architecture has been synthesized and post-layout simulated in 90 nm-CMOS

process. This decoder occupies 5.19 ${rm mm}^{2}$ area and supports multiple code rates like 1/2, 2/3, 3/4

& 5/6 as well as block-lengths of 648, 1296 & 1944. At a clock frequency of 336 MHz, the proposed LDPC-

decoder has achieved better throughput of 5.13 Gbps and energy efficiency of 0.01 nJ/bits/iterations, as

compared to the similar state-of-the-art works.

ETPL

VLSI - 024

High-Throughput LDPC-Decoder Architecture Using Efficient Comparison

Techniques & Dynamic Multi-Frame Processing Schedule

Digital multipliers are among the most critical arithmetic functional units. The overall performance of these

systems depends on the throughput of the multiplier. Meanwhile, the negative bias temperature instability

effect occurs when a pMOS transistor is under negative bias (Vgs = -Vdd), increasing the threshold voltage of

the pMOS transistor, and reducing multiplier speed. A similar phenomenon, positive bias temperature

instability, occurs when an nMOS transistor is under positive bias. Both effects degrade transistor speed, and

in the long term, the system may fail due to timing violations. Therefore, it is important to design reliable

high-performance multipliers. In this paper, we propose an aging-aware multiplier design with a novel

adaptive hold logic (AHL) circuit. The multiplier is able to provide higher throughput through the variable

latency and can adjust the AHL circuit to mitigate performance degradation that is due to the aging effect.

Moreover, the proposed architecture can be applied to a columnor row-bypassing multiplier. The experimental

results show that our proposed architecture with 16 × 16 and 32 × 32 column-bypassing multipliers can attain

up to 62.88% and 76.28% performance improvement, respectively, compared with 16×16 and 32×32 fixed-

latency column-bypassing multipliers. Furthermore, our proposed architecture with 16 × 16 and 32 × 32 row-

bypassing multipliers can achieve up to 80.17% and 69.40% performance improvement as compared with

16×16 and 32 × 32 fixed-latency row-bypassing multipliers.

ETPL

VLSI - 025

Aging-Aware Reliable Multiplier Design With Adaptive Hold Logic


Sivakasi |Dindugul|


Bloom filters (BFs) provide a fast and efficient way to check whether a given element belongs to a set. The

BFs are used in numerous applications, for example, in communications and networking. There is also

ongoing research to extend and enhance BFs and to use them in new scenarios. Reliability is becoming a

challenge for advanced electronic circuits as the number of errors due to manufacturing variations, radiation,

and reduced noise margins increase as technology scales. In this brief, it is shown that BFs can be used to

detect and correct errors in their associated data set. This allows a synergetic reuse of existing BFs to also

detect and correct errors. This is illustrated through an example of a counting BF used for IP traffic

classification. The results show that the proposed scheme can effectively correct single errors in the associated

set. The proposed scheme can be of interest in practical designs to effectively mitigate errors with a reduced

overhead in terms of circuit area and power.

ETPL

VLSI - 026

A Synergetic Use of Bloom Filters for Error Detection and Correction

In this paper, we propose a level-converting retention flip-flop (RFF) for ZigBee systems-on-chips (SoCs).

The proposed RFF allows the voltage regulator that generates the core supply voltage (VDD,core) to be turned

off in the standby mode, and it thus reduces the standby power of the ZigBee SoCs. The logic states are

retained in a slave latch composed of thick-oxide transistors using an I/O supply voltage (VDD,IO) that is

always turned on. Level-up conversion from VDD,core to VDD,IO is achieved by an embedded nMOS pass-

transistor level-conversion scheme that uses a low-only signal-transmitting technique. By embedding a

retention latch and level-up converter into the data-to-output path of the proposed RFF, the RFF resolves the

problems of the static RAM-based RFF, such as large dc current and low readability caused by threshold drop.

The proposed RFF does not also require additional control signals for power mode transitioning. Using 0.13-

μm process technology, we implemented an RFF with VDD,core and VDD,IO of 1.2 and 2.5 V, respectively.

The maximum operating frequency is 300 MHz. The active energy of the RFF is 191.70 fJ, and its standby

power is 350.25 pW.

ETPL

VLSI - 027

Level-Converting Retention Flip-Flop for Reducing Standby Power in ZigBee

SoCs


Sivakasi |Dindugul|


We prove analytically that the yield of static random access memory (SRAM) is intrinsically a function of its

architecture owing to the correlation among cell failures. In addition, architecture-aware analytical yield

models are proposed for read access. The yield results using the proposed models show that the most dominant

factor determining yield is the variation in the voltage difference between bitlines due to the cell leakage

current variation according to the SRAM architecture. The models also show the possibility that the most

dominant factor determining the yield can change with the relative ratios among the amounts of changes in the

correlation, recovery sample space, distributions of the sense amplifier enable time, voltage difference

between bitlines, as well as sense amplifier offset voltage, memory capacity, and redundancy scheme. The

proposed yield models show that combined row and column redundancy ensures the highest yield, whereas

column redundancy is the most efficient.

ETPL

VLSI - 029

Architecture-Aware Analytical Yield Model for Read Access in Static Random

Access Memory

The confluence of 3-D integration and network-on-chip (NoC) provides an effective solution to the scalability

problem of on-chip interconnects. In 3-D integration, through-silicon via (TSV) is considered to be the most

promising bonding technology. However, TSVs are also precious link resources because they consume

significant chip area and possibly lead to routing congestion in the physical design stage. In addition, TSVs

suffer from serious yield losses that shrink the effective TSV density. Thus, it is necessary to implement a

TSV-economical 3-D NoC architecture in cost-effective design. For symmetric 3-D mesh NoCs, we observe

that the TSVs bandwidth utilization is low and they rarely become the contention spots in networks as planar

links. Based on this observation, we propose the TSV sharing (TS) scheme to save TSVs in 3-D NoC by

enabling neighboring routers to share the vertical channels in a time division multiplexing way. We also

investigate different TS implementation alternatives and show how TS improves TSV-effectiveness (TE) in

multicore processors through a design space exploration. In experiments, we comprehensively evaluate TSs

influence on all layers of system. It is shown that the proposed method significantly promotes TE with

negligible performance overhead.

ETPL

VLSI - 028

Economizing TSV Resources in 3-D Network-on-Chip Design


Sivakasi |Dindugul|


Content addressable memories (CAMs) enable high-speed parallel search operations in table lookup-based

applications, such as Internet routers and processor caches. Traditional CAM design has always suffered from

the high dynamic power consumption associated with its large and active parallel hardware. However, deeply

scaled technology nodes, with multigate devices replacing planar MOSFETs, are expected to bring new

tradeoffs to CAM design. FinFET, a vertical-channel gate-wrap-around double-gate device, has emerged as

the best alternative to planar MOSFET. In this brief, for the first time, we explore the design space of

symmetric and asymmetric gate-workfunction FinFET CAMs. We propose several design alternatives and

evaluate them in terms of their dc and transient metrics for different mismatch probabilities using technology

computer-aided design simulations with 22-nm FinFET devices. We also propose two orthogonal layout styles

for CAM design and show that one of them (vertical-search line) outperforms the other (vertical-match line) in

terms of total power (22.3%) and search delay (5.8%).

ETPL

VLSI - 030

Design of Efficient Content Addressable Memories in High-Performance

FinFET Technology

We propose a low-power content-addressable memory (CAM) employing a new algorithm for associativity

between the input tag and the corresponding address of the output data. The proposed architecture is based on

a recently developed sparse clustered network using binary connections that on-average eliminates most of the

parallel comparisons performed during a search. Therefore, the dynamic energy consumption of the proposed

design is significantly lower compared with that of a conventional low-power CAM design. Given an input

tag, the proposed architecture computes a few possibilities for the location of the matched tag and performs the

comparisons on them to locate a single valid match. TSMC 65-nm CMOS technology was used for simulation

purposes. Following a selection of design parameters, such as the number of CAM entries, the energy

consumption and the search delay of the proposed design are 8%, and 26% of that of the conventional NAND

architecture, respectively, with a 10% area overhead. A design methodology based on the silicon area and

power budgets, and performance requirements is discussed.

ETPL

VLSI - 031

Algorithm and Architecture for a Low-Power Content-Addressable Memory

Based on Sparse Clustered Networks


Sivakasi |Dindugul|


Network-on-chip (NoC) has emerged as a vital factor that determines the performance and power

consumption of many-core systems. This paper proposes a hybrid scheme for NoCs, which aims at

obtaining low latency and low power consumption. In the presented hybrid scheme, a novel

switching mechanism, called virtual circuit switching, is proposed to intermingle with circuit

switching and packet switching. Flits traveling in virtual circuit switching can traverse the router with

only one stage. In addition, multiple virtual circuit-switched (VCS) connections are allowed to share

a common physical channel. Moreover, a path allocation algorithm is proposed in this paper to

determine VCS connections and circuit-switched connections on a mesh-connected NoC, such that

both communication latency and power are optimized. A set of synthetic and real traffic workloads

are exploited to evaluate the effectiveness of the proposed hybrid scheme. The experimental results

show that our proposed hybrid scheme can efficiently reduce the communication latency and power.

For instance, for real traffic workloads, an average of 20.3% latency reduction and 33.2% power

saving can be obtained when compared with the baseline NoC. Moreover, when compared with the

NoC with virtual point-to-point connections (VIP), the proposed hybrid scheme can reduce the

latency by 6.8% with the power decreasing by 11.3% averagely.

ETPL

VLSI - 032

A Low-Latency and Low-Power Hybrid Scheme for On-Chip Networks

This paper presents the design and the VLSI imple-mentation of an asynchronous cellular logic array for fast

binary image processing. The proposed processor array employs trigger-wave propagation and collision

detection mechanisms for binary image skeletonization, and Voronoi tessellation. Low power, low area, and

high processing speed are achieved using full custom dy-namic logic design. The prototype array consisting of

64 96 cells is fabricated in a standard 90 nm CMOS technology. The experi-mental results confirm the fast

operation of the array, capable of extracting up to skeletons per second, consuming less than 1 nJ/skeleton.

The asynchronous operation enables circular wave contours, which improves the quality of the extracted

skele-tons. The proposed asynchronousprocessing module consists of 24 MOS transistors and occupies area.

Such array can be used as a co-processing unit aiding global binary image processing in standard pixel-parallel

SIMD architectures in vision chips.

ETPL

VLSI - 033

Trigger-Wave Asynchronous Cellular Logic Array for Fast Binary Image

Processing


Sivakasi |Dindugul|


This paper presents the design and testing of an electrode driving application specific integrated circuit (ASIC)

intended for epidural spinal cord electrical stimulationinrats.

TheASICcandeliverupto1mAfullyprogrammablemonophasic

rbiphasicstimuluscurrentpulses,to13electrodesselectedinany possible configuration. It also supports interleaved

timulation. Communication is achieved via only 3 wires. The current source and the control of the stimulation

timing were kept off-chip to reduce the heat dissipation close to thespinal cord.TheASIC was designed in a

0.18- m high voltage CMOS process. Its output voltage compliance can be up to 25 V. It features a small core

area ( mm) and consumes a maximum of 114 Wduring a full stimulation cycle. The layout of the ASIC was

developed to be suitable for integration on the epidural electrode array, and two different versions were

fabricated and electrically tested. Results from both versions were almost indistinguishable. The performance

of the system was verified for different loads and stimulation parameters. Its suitability to drive a passive

epidural 12-electrode array in saline has also been demonstrated

ETPL

VLSI - 034

An Implantable Versatile Electrode-Driving ASIC for Chronic Epidural

Stimulation in Rats

The DCT and the DWT are used in a number of emerging DSP applications, such as, HD video compression,

biomedical imaging, and smart antenna beamformers for wireless communications and radar. Of late, there has

been much interest on fast algorithms for the computation of the above transforms using multiplier-free

approximations because they result in low power and low complexity systems. Approximate methods rely on

the trade-off of accuracy for lower power and/or circuit complexity/chip-area. This paper provides a detailed

review of VLSI architectures and CAS implementations for both DCT/DWTs, which can be designed either

for higher-accuracy or for low-power consumption. This article covers both recent theoretical advancements

on discrete transforms in addition to an overview of existing VLSI architectures. The paper also discusses

error free VLSI architectures that provides high accuracy systems and approximate architectures that offer

high computational gain making them highly attractive for real-world applications that are subject to

constraints in both chip-area as well as power. The methods discussed in the paper can be used in the design of

emerging low-power digital systems having lowest complexity at the cost of a loss in accuracy?the optimal

trade-off of computational accuracy for lowest possible complexity and power. A complete synopsis of

available techniques, algorithms and FPGA/VLSI realizations are discussed in the paper.

ETPL

VLSI - 035

Low-Power VLSI Architectures for DCT/DWT: Precision vs Approximation for

HD Video, Biomedical, and Smart Antenna Applications


Sivakasi |Dindugul|


This paper presents a novel algorithm and architecture design for 18-band quasi-class-2 ANSI S1.11 1/3

octave filterbank. The proposed design has several advantages such as lower group delay, lower computational

complexity, and lower matching error. The technique we developed in this paper can be summarized as

follows: 1) a simple low-pass filter (LPF) and discrete cosine transform (DCT) modulation are utilized to

generate a uniform 9-band filterbank first, and then all elements of {\rm z}^{-1} are replaced by all-pass filters

to obtain a non-uniform filterbank; 2) a fast recursive structure and variable-length algorithm is further

developed to efficiently accomplish DCT modulation. Thus, the spectrum of LPF can be easily spanned and

flexibly extended to the location of the desired central frequency; 3) after employing the multi-rate algorithm,

an 18-band non-uniform filterbank is generated from two 9-band sub filterbanks by following the proposed

design steps and parameter determinations. Compared with the latest Liu 's quasi-class-2 ANSI S1.11 design,

the proposed method-I (Proposed-I) totally has 72.8% reduction for multiplications per sample, 11.25-ms

group delay, and 59 additions decreased per sample. Moreover, the maximum matching error of the proposed

method-II (Proposed-II) is averagely equal to 1.79 dB much smaller than that of the latest Wei 's design. For

the proposed variable-length DCT modulation, only 2 adders, 2 multipliers, 2 multiplexers, and 5 registers are

required for hardware implementation after applying VLSI retiming scheme. Overall, the proposed filterbank

design would be a new solution for future applications in the area of hearing aids

ETPL

VLSI - 036

11.25-ms-Group-Delay and Low-Complexity Algorithm Design of 18-Band Quasi-ANSI

S1.11 1/3 Octave Digital Filterbank for Hearing Aids

A fully-integrated low-dropout regulator (LDO) with fast transient response and full spectrum power supply

rejection (PSR) is proposed to provide a clean supply for noise-sensitive building blocks in wideband

communication systems. With the proposed point-of-load LDO, chip-level high-frequency glitches are well

attenuated, consequently the system performance is improved. A tri-loop LDO architecture is proposed and

verified in a 65 nm CMOS process. In comparison to other fully-integrated designs, the output pole is set to be

the dominant pole, and the internal poles are pushed to higher frequencies with only 50 μA of total quiescent

current. For a 1.2 V input voltage and 1 V output voltage, the measured undershoot and overshoot is only 43

mV and 82 mV, respectively, for load transient of 0 μA to 10 mA within edge times of 200 ps. It achieves a

transient response time of 1.15 ns and the figure-of-merit (FOM) of 5.74 ps. PSR is measured to be better than

-12 dB over the whole spectrum (DC to 20 GHz tested). The prototype chip measures 260×90 μm2, including

140 pF of stacked on-chip capacitors.

ETPL

VLSI - 037

A Fully-Integrated Low-Dropout Regulator With Full-Spectrum Power Supply

Rejection


Sivakasi |Dindugul|


Spin-torque transfer RAM (STT-RAM), a promising alternative to static RAM (SRAM) for reducing leakage

power consumption, has been widely studied to mitigate the impact of its asymmetrically long write latency.

However, physical effects of technology scaling down to 45 nm and below, in particular, process variation,

introduce the previously unreported and alarming trends in read performance and reliability due to reduced

sensing margins and increasing error rates. In this brief, we study the scaling trends of STT-RAM from 65

down to 22 nm as they pertain to read performance, including a 50% increase in sensing versus peripheral

circuit delay ratio and a more than 80% increase in uncorrectable read error rates. Through differential

sensing, we show how 22 nm can return to sense delay ratio levels at 65 nm and uncorrectable read errors can

be reduced by an order of magnitude. Through a case study of a multilevel STT-RAM cache, we show how a

reconfigurable cache cell can create an extreme access mode (X-mode) based on differential sensing improve

to outperform the state-of-the-art STT-RAM caching techniques in both raw performance and performance per

watt by more than 10% while still reducing energy consumption over SRAM caches by more than 1/3.

ETPL

VLSI - 038

Read Performance: The Newest Barrier in Scaled STT-RAM

This paper presents a digital-subranging (sub-R) analog-to-digital conversion (ADC) architecture to improve

the operation speed of sub-R ADCs. Long latency between coarse and fine conversions will slow down the

conventional sub-R ADCs. The proposed digital-sub-R uses digital circuits to implement the sub-R function

and shorten this latency, thus benefits the CMOS scaling. Furthermore, the dynamic comparators are used to

save more ADC power consumption. Their accuracy is improved by the proposed pseudodifferential offset

calibration loop. The digital-sub-R also helps to reduce the dynamic offset of the fine comparators caused by

the input common-mode variation. Fabricated using a 55-nm CMOS technology, the reported 8-bit 1-GS/s

ADC consumes only 16 mW from a 1.2 V supply. Measured signal-to-noise ratio (SNR) and spurious free

dynamic range (SFDR) are 46 and 55 dB, respectively. Measured effective number of bits (ENOB) is seven

bits at 10-MHz input frequency. At Nyquist input, the ENOB performance of 6.3 bits is still maintained. Its

figure-of-merit is 197-fJ/conversion-step.

ETPL

VLSI - 039

A 16-mW 8-Bit 1-GS/s Digital-Subranging ADC in 55-nm CMOS


Sivakasi |Dindugul|


A frequency-tuning negative-conductance ( -G_{m} ) boosted structure and applications for a voltage-

controlled oscillator (VCO) is presented in this paper. Analog tuning varactors connected to a -{text\it

{G}}_{m} boosted structure is proposed to significantly alleviate the limitation of the tuning range for the -

{text\it {G}}_{m} boosted structure, resulting in a low-voltage low-power wide-tuning-range VCO. Based on

the proposed architecture, the fabricated 0.18- \mu m CMOS VCO exhibits a measured 49.8% tuning range.

Operating at 0.65 V low supply voltage, the VCO core consumes 2.37-mW dc power. At this bias condition,

the measured average value of phase noise for all frequency ranges is −115.1 dBc/Hz at 1-MHz offset from the

carriers. Relative to recently published wide-tuning-range CMOS VCOs, the proposed VCO simultaneously

achieves low supply voltage, low dc power dissipation, low phase noise, and a wide tuning range, leading to a

good figure-of-merit (FOM) and FOM including the tuning range. Furthermore, formulas of analysis for the

proposed frequency-tuning -{text\it {G}}_{m} boosted structure and wide VCO tuning range are presented,

and the mechanisms are validated by experiments.

ETPL

VLSI - 040

Frequency-Tuning Negative-Conductance Boosted Structure and Applications

for Low-Voltage Low-Power Wide-Tuning-Range VCO

Ternary content addressable memories (TCAMs) perform high-speed lookup operation but when compared

with static random access memories (SRAMs), TCAMs have certain limitations such as low storage density,

relatively slow access time, low scalability, complex circuitry, and are very expensive. Thus, can we use the

benefits of SRAM by configuring it (with additional logic) to enable it to behave like TCAM? This brief

proposes a novel memory architecture, named Z-TCAM, which emulates the TCAM functionality with

SRAM. Z-TCAM logically partitions the classical TCAM table along columns and rows into hybrid TCAM

subtables, which are then processed to map on their corresponding memory blocks. Two example designs for

Z-TCAM of sizes 512 × 36 and 64 × 32 have been implemented on Xilinx Virtex-7 field-programmable gate

array. The design of 64 × 32 Z-TCAM has also been implemented using OSUcells library for 0.18 μm

technology, which confirms the physical and technical feasibility of Z-TCAM. Search latency for each design

is three clock cycles. The detailed implementation results and power measurements for each design have been

reported thoroughly.

ETPL

VLSI - 041

Z-TCAM: An SRAM-based Architecture for TCAM


Sivakasi |Dindugul|


The common objective of very large-scale integration (VLSI) placement problem is to minimize the total

wirelength, which is calculated by the total half-perimeter wirelength (HPWL). Since the HPWL is not

differentiable, various differentiable wirelength approximation functions have been proposed in analytical

placement methods. In this paper, we reformulate the HPWL as an l_{1} -norm model of the wirelength

function, which is exact but nonsmooth. Based on the l_{1} -norm wirelength model and exact calculation of

overlapping areas between cells and bins, a nonsmooth optimization model is proposed for the VLSI global

placement problem, and a subgradient method is proposed for solving the nonsmooth optimization problem.

Moreover, local convergence of the subgradient method is proved under some suitable conditions. In addition,

two enhanced techniques, i.e., an adaptive parameter to control the step size and a cautious strategy for

increasing the penalty parameter, are also used in the nonsmooth optimization method. In order to make the

placement method scalable, a multilevel framework is adopted. In the clustering stage, the best choice

clustering algorithm is modified according to the l_{1} -norm wirelength model to cluster the cells, and the

nonsmooth optimization method is recursively used in the declustering stage. Comparisons of experimental

results on the International Symposium on Physical Design (ISPD) 2005 and 2006 benchmarks show that the

global placement method is promising.

ETPL

VLSI - 042

Nonsmooth Optimization Method for VLSI Global Placement

Thompson's model of very large scale integration computation relates the energy of a computation to the

product of the circuit area and the number of clock cycles needed to carry out the computation. It is shown that

for any sequence of increasing block-length decoder circuits implemented according to this model, if the

probability of block error is asymptotically less than 1/2 then the energy of the computation scales at least as

Ω(n(log n)1/2), and so the energy of decoding per bit must scale at least as Ω(log n)1/2. This implies that the

average energy per decoded bit must approach infinity for any sequence of decoders that approaches capacity.

The analysis techniques used are then extended to show that for any sequence of increasing block-length serial

decoders, if the asymptotic block error probability is less than 1/2 then the energy scales at least as fast as Ω(n

log n). In a very general case that allows for the number of output pins to vary with block length, it is shown

that the energy must scale as Ω(n(log n)1/5). A simple example is provided of a class of circuits performing

low-density parity-check decoding whose energy complexity scales as O(n2 loglogn).

ETPL

VLSI - 043

Energy Consumption of VLSI Decoders


Sivakasi |Dindugul|


A low-power test generation procedure that was developed earlier merges broadside test cubes that are derived

from functional broadside tests in order to generate a low-power broadside test set. This has several

advantages, most importantly, that test cubes, which are derived from functional broadside tests, create

functional operation conditions in subcircuits around the sites of detected faults. These conditions are

preserved when a test cube is merged with other test cubes. This brief applies a similar approach to the

generation of a low-power skewed-load test set. The main challenge that this paper addresses is the derivation

of skewed-load test cubes from functional broadside tests. The paper also considers the percentages of values

that should be unspecified in the skewed-load test cubes in order to balance the need to create functional

operation conditions with the need for test compaction.

ETPL

VLSI - 044

Skewed-Load Test Cubes Based on Functional Broadside Tests for a Low-

Power Test Set

Minimizing energy consumption is of utmost importance in an energy starved system with relaxed

performance requirements. This brief presents a digital energy sensing method that requires neither a constant

voltage reference nor a time reference. An energy minimizing loop uses this to find the minimum energy point

and sets the supply voltage between 0.2 and 0.5 V. Energy savings up to 1275% over existing minimum

energy tracking techniques in the literature is achieved.

ETPL

VLSI - 045

All Digital Energy Sensing for Minimum Energy Tracking


Sivakasi |Dindugul|


With the significant increase in the number of processing elements in NoC-based MPSoCs,

communication becomes, increasingly, a critical resource for performance gains and quality-of-

service (QoS) guarantees. The main gap observed in the NoC-based MPSoCs literature is the runtime

adaptive techniques to meet QoS. In the absence of such techniques, the system user must statically

define, for example, the scheduling policy, communication priorities, and the communication

switching mode of applications. The goal of this paper is to investigate the runtime adaptation of the

NoC resources, according to the QoS requirements of each application running in the MPSoC. This

paper adopts an NoC architecture with duplicated physical channels, adaptive routing, support to

flow priorities and simultaneous packet and circuit switching. The monitoring and adaptation

management is performed at the operating system level, ensuring QoS to the monitored applications.

The QoS acts in the flow priority and the switching mode. Monitoring and QoS adaptation were

implemented in software, resulting in flexibility to apply the techniques to other platforms or include

other adaptive techniques, as task migration or DVFS. Applications with latency and throughput

deadlines run concurrently with best-effort applications. Results with synthetic and real application

reduced in average 60% the latency violations, ensuring smaller jitter and throughput. The execution

time of applications is not penalized applying the proposed QoS adaptation methods.

ETPL

VLSI - 046

Fat-Tree-Based Optical Interconnection Networks Under Crosstalk Noise

Constraint

We prove analytically that the yield of static random access memory (SRAM) is intrinsically a function of its

architecture owing to the correlation among cell failures. In addition, architecture-aware analytical yield

models are proposed for read access. The yield results using the proposed models show that the most dominant

factor determining yield is the variation in the voltage difference between bitlines due to the cell leakage

current variation according to the SRAM architecture. The models also show the possibility that the most

dominant factor determining the yield can change with the relative ratios among the amounts of changes in the

correlation, recovery sample space, distributions of the sense amplifier enable time, voltage difference

between bitlines, as well as sense amplifier offset voltage, memory capacity, and redundancy scheme. The

proposed yield models show that combined row and column redundancy ensures the highest yield, whereas

column redundancy is the most efficient.

ETPL

VLSI - 047

Architecture-Aware Analytical Yield Model for Read Access in Static Random

Access Memory


Sivakasi |Dindugul|


This brief presents a two-port disturb-free 9T subthreshold static random access memory (SRAM) cell with

independent single-ended read bitline and write bitline (WBL) and cross-point data-aware write structure to

facilitate robust subthreshold operation and bit-interleaving architecture for enhanced soft error immunity. The

design employs a variation-tolerant line-up write-assist scheme where the timing of areaefficient boosted write

wordline and negative WBL are aligned and triggered/initiated by the same low-going global WBL to

maximize the write-ability enhancement. A 72-kb test chip is implemented in United Microelectronics Corp.

40-nm low-power (40LP) CMOS. Full functionality is achieved for VDD ranging from 1.5 to 0.32 V without

redundancy. The measured maximum operation frequency is 260 MHz (450 kHz) at 1.1 V (0.32 V) and 25 °C.

At 0.325 V and 25 °C, the chip operates at 600 kHz with 5.78 μW total power and 4.69 μW leakage power,

offering 2× frequency improvement compared with 300 kHz of our previous 72-kb 9T subthreshold SRAM

design in the same 40LP technology. The energy efficiency (power/frequency/IO) at 0.325 V and 25 °C is

0.267 pJ/bit, a 23.7% improvement over the 0.350 pJ/bit of our previous design.

ETPL

VLSI - 048

A 0.325 V, 600-kHz, 40-nm 72-kb 9T Subthreshold SRAM with Aligned Boosted

Write Wordline and Negative Write Bitline Write-Assist

A CMOS pulsewidth modulation (PWM) transceiver circuit that exploits the self-referenced edge

detection technique is presented. By comparing the rising edge that is self-delayed by about 0.5 T and

the modulated falling edge in one carrier clock cycle, area-efficient and high-robustness (against

timing fluctuations) edge detection enabling PWM communication is achieved without requiring

elaborate phase-locked loops. Since the proposed self-referenced edge detection circuit has the

capability of timing error measurement while changing the length of self-delay element, adaptive

data-rate optimization and delay-line calibration are realized. The measured results with a 65-nm

CMOS prototype demonstrate a 2-bit PWM communication, high data rate (3.2 Gb/s), and high

reliability (BER> 10-12) with small area occupation (540 μm2). For reliability improvement, error

check and correction associated with intercycle edge detection is introduced and its effectiveness is

verified by 1-bit PWM measurement.

ETPL

VLSI - 049

A CMOS PWM Transceiver Using Self-Referenced Edge Detection


Sivakasi |Dindugul|


Phase change memory (PCM) is a promising DRAM replacement in embedded systems due to its

attractive characteristics, such as low-cost, shock-resistivity, nonvolatility, high density, and low

leakage power. However, relatively low endurance has limited its practical applications. In this paper,

in addition to existing hardware level optimizations, we propose software enabled wear-leveling

techniques to further extend PCMs lifetime when it is adopted in embedded systems. Most existing

software optimization techniques focus on reducing the total number of writes to PCM, but none of

them consider wear leveling, in which the writes are distributed more evenly over the PCM. An

integer linear programming formulation and a polynomial-time algorithm, the software wear-leveling

algorithm, are proposed in this paper to achieve wear leveling without hardware overhead. According

to the experimental results, the proposed techniques can reduce the number of writes on the most-

written addresses by more than 80% when compared with a greedy algorithm, and by more than 60%

when compared with the existing optimal data allocation algorithm with under 6% memory access

overhead.

ETPL

VLSI - 050

Low Overhead Software Wear Leveling for Hybrid PCM + DRAM Main

Memory on Embedded Systems

This brief presents a parallel single-rail self-timed adder. It is based on a recursive formulation for performing

multibit binary addition. The operation is parallel for those bits that do not need any carry chain propagation.

Thus, the design attains logarithmic performance over random operand conditions without any special speedup

circuitry or look-ahead schema. A practical implementation is provided along with a completion detection

unit. The implementation is regular and does not have any practical limitations of high fanouts. A high fan-in

gate is required though but this is unavoidable for asynchronous logic and is managed by connecting the

transistors in parallel. Simulations have been performed using an industry standard toolkit that verify the

practicality and superiority of the proposed approach over existing asynchronous adders.

ETPL

VLSI - 051

Recursive Approach to the Design of a Parallel Self-Timed Adder


Sivakasi |Dindugul|


In this brief, three nonvolatile flip-flop (FF)/SRAM cells that utilize a single magnetic tunneling

junction (MTJ) as nonvolatile resistive element are proposed. These cells have the same core (i.e.,

6T) but they employ different numbers of MOSFETs to implement the so-called instantly ON,

normally OFF mode of operation. The additional transistors are utilized for the restore operation to

ensure that the data stored in the nonvolatile circuitry can be written back into the FF core once the

power is made available. These three cells (7T, 9T, and 11T) are extensively analyzed in terms of

their operations in 32 nm technology, such as operational delays (for the write, read, and restore

operations), the static noise margin (SNM), critical charge and process variations (in both the

MOSFETs and the resistive element). Simulation results show that an increase in the number of

MOSFETs in the cells causes improvements in critical charge and tolerance to process variations at

the expense of an increase in power dissipation. The SNM and the delay of the restore operation,

however, do not necessarily increase with the number of MOSFETs in the cell, but rather on the

control of access to the storage nodes from the single MTJ.

ETPL

VLSI - 052

On the Nonvolatile Performance of Flip-Flop/SRAM Cells With a Single MTJ

Content addressable memories (CAMs) enable high-speed parallel search operations in table lookup-

based applications, such as Internet routers and processor caches. Traditional CAM design has

always suffered from the high dynamic power consumption associated with its large and active

parallel hardware. However, deeply scaled technology nodes, with multigate devices replacing planar

MOSFETs, are expected to bring new tradeoffs to CAM design. FinFET, a vertical-channel gate-

wrap-around double-gate device, has emerged as the best alternative to planar MOSFET. In this brief,

for the first time, we explore the design space of symmetric and asymmetric gate-workfunction

FinFET CAMs. We propose several design alternatives and evaluate them in terms of their dc and

transient metrics for different mismatch probabilities using technology computer-aided design

simulations with 22-nm FinFET devices. We also propose two orthogonal layout styles for CAM

design and show that one of them (vertical-search line) outperforms the other (vertical-match line) in

terms of total power (22.3%) and search delay (5.8%).

ETPL

VLSI - 053

Design of Efficient Content Addressable Memories in High-Performance

FinFET Technology


Sivakasi |Dindugul|


A clock skew-compensation and duty-cycle correction circuit (CSADC) is used as the second-level clock

distributing circuit to align a system global clock while maintaining a 50% duty cycle. A power-efficient,

range-unlimited, and accuracy-enhanced CSADC, designed mainly with a new delay-interleaving and -

recycling technique that mitigates operating frequency limitations while keeping overhead costs low, is

proposed in this paper. Our preliminary research results prove the feasibility of the proposed technique and

show that the operating frequency ranges from 110 MHz to 1.75 GHz, with the corrected duty cycle varying

from 51.2% to 48.9% based on 0.18-μm CMOS technology. Meanwhile, the lock-in time, static phase error,

and power consumption are, respectively, 26 clock cycles, 4.2 ps, and 5.58 mW at 1.75 GHz.

ETPL

VLSI - 054

Range Unlimited Delay-Interleaving and -Recycling Clock Skew Compensation

and Duty-Cycle Correction Circuit

A fast transient response flying-capacitor buck-boost converter is proposed to improve the efficiency

of conventional switched-capacitor converters. The voltage boost ratio of the proposed converter is

2D, where D is the duty cycle of the switching signal waveform. Furthermore, the proposed structure

utilizes pseudocurrent dynamic acceleration techniques to achieve fast transient response when load

changes between heavy load and light load. The switching frequency of the proposed converter is 1

MHz for 3.3-V input and 1.0-4.5-V output range application. Experiment results show that the

proposed scheme improves the transient response to within 2 μs and the total power conversion

efficiency can be as high as 89.66%. The proposed converter has been realized by a 2P4M CMOS

chip by 0.35-μm fabrication process with total chip size of about 1.5 mm × 1.5 mm, PADs included.

ETPL

VLSI - 055

A Fast Transient Response Flying-Capacitor Buck-Boost Converter Utilizing

Pseudocurrent Dynamic Acceleration Techniques


Sivakasi |Dindugul|


In this brief, a low-cost low-power all-digital spread-spectrum clock generator (ADSSCG) is presented. The

proposed ADSSCG can provide an accurate programmable spreading ratio with process, voltage, and

temperature variations. To maintain the frequency stability while performing triangular modulation, the fast-

relocked mechanism is proposed. The proposed fast-relocked ADSSCG is implemented in a standard

performance 90-nm CMOS process, and the active area is 200 μm × 200 μm. The experimental results show

that the electromagnetic interference reduction is 14.61 dB with a 0.5% spreading ratio and 19.69 dB with a

2% spreading ratio at 270 MHz. The power consumption is 443 μW at 270 MHz with a 1.0 V power supply

ETPL

VLSI - 056

A Low-Cost Low-Power All-Digital Spread-Spectrum Clock Generator

Thank You !

elysium technologies private limitedelysiumtechnologies.com/wp-content/uploads/2015/08/vlsi.pdf ·...

Documents