a novel power-efficient ic test scheme - j-stage

A novel power-efficient IC testscheme

Ding Deng, Xiaowen Chena), and Yang GuoCollege of Computer, National University of Defense Technology,

Changsha, Hunan, 410073, China

a) [email protected]

Abstract: A novel power-efficient IC test scheme is proposed, containing

parallel test application (PTA) architecture and its procedure. PTA parallel-

izes the stimuli assignments and the vectors can be observed immediately

once applied, which assures the shift safety timely and hence only logic test

is required. The procedure contains two phases for each pattern. In shift

phase, each clock chain is activated in turn and the vectors are assigned in

parallel. In capture phase, all chains are captured simultaneously. Experi-

mental results demonstrate that, compared with the traditional serial scan

scheme, the proposal reduces average power by 88.48% and peak power by

53.36%.

Keywords: low power, scan test, parallel

Classification: Integrated circuits

References

[1] N. Parimi and X. Sun: “Design of a low-power D flip-flop for test-per-scancircuits,” CCECE (2004) 777 (DOI: 10.1109/CCECE.2004.1345229).

[2] S. Gerstendorfer and H.-J. Wunderlich: “Minimized power consumption forscan-based BIST,” IEEE ITC (1999) 77 (DOI: 10.1109/TEST.1999.805616).

[3] X. Zhang and K. Roy: “Power reduction in test-per-scan BIST,” On-LineTesting Workshop (2000) 133 (DOI: 10.1109/OLT.2000.856625).

[4] A. Chandra, et al.: “Low power Illinois scan architecture for simultaneouspower and test data volume reduction,” DATE (2008) 462 (DOI: 10.1109/DATE.2008.4484724).

[5] S. M. Saeed and O. Sinanoglu: “Expedited response compaction for scanpower reduction,” IEEE VTS (2011) 40 (DOI: 10.1109/VTS.2011.5783752).

[6] D. Xiang, et al.: “Reconfigured scan forest for test application cost, test datavolume and test power reduction,” IEEE Trans. Comput. 56 (2007) 557 (DOI:10.1109/TC.2007.1002).

[7] Y. Yamato, et al.: “A novel scan segmentation design method for avoiding shifttiming failure in scan testing,” IEEE ITC (2011) 1 (DOI: 10.1109/TEST.2011.6139162).

[8] A. Hisashige: “Testing VLSI with random access scan,” Dig Compcon (1980)50.

[9] A. S. Mudlapur, et al.: “A random access scans architecture to reduce hardwareoverhead,” IEEE ITC (2005) 9 (DOI: 10.1109/TEST.2005.1583993).

[10] A. S. Mudlapur, et al.: “A novel random access scan flip-flop design,” VDAT(2005) 226.

[11] Y. Hu, et al.: “Localized random access scan: Towards low area and routingoverhead,” ASP-DAC (2008) 565 (DOI: 10.1109/ASPDAC.2008.4484016).

© IEICE 2017DOI: 10.1587/elex.14.20170462Received May 2, 2017Accepted May 16, 2017Publicized June 14, 2017Copyedited July 10, 2017

1

LETTER IEICE Electronics Express, Vol.14, No.13, 1–10

http://dx.doi.org/10.1109/CCECE.2004.1345229




http://dx.doi.org/10.1109/TEST.1999.805616




http://dx.doi.org/10.1109/OLT.2000.856625




http://dx.doi.org/10.1109/DATE.2008.4484724





http://dx.doi.org/10.1109/VTS.2011.5783752




http://dx.doi.org/10.1109/TC.2007.1002

http://dx.doi.org/10.1109/TC.2007.1002

http://dx.doi.org/10.1109/TC.2007.1002

http://dx.doi.org/10.1109/TC.2007.1002

http://dx.doi.org/10.1109/TC.2007.1002









http://dx.doi.org/10.1109/ASPDAC.2008.4484016




[12] D. Xiang, et al.: “Scan flip-flop grouping to compress test data and compacttest responses for broadside delay testing,” ACM TODAES 17 (2012) 18 (DOI:10.1145/2159542.2159550).

[13] K. Enokimoto, et al.: “On guaranteeing capture safety in at-speed scan testingwith broadcast-scan-based test compression,” International Conference onVLSI Design (2013) 279 (DOI: 10.1109/VLSID.2013.201).

1 Introduction

Serial Scan (SS) design is one of the most widely used Design-For-Testability

(DFT) techniques in nowadays industry. However, as the complexity of digital

circuit increases, it also suffers from enormous challenges in high power dissipa-

tion. Power dissipation is usually measured in two aspects: average power and peak

power. Average power, which has a close relationship with heat dissipation, brings

a strict requirement of the cooling mechanisms. Dissipating excessive heat too long

during test may degrade the circuit performance. Even worse, it can lead to a yield

loss or undesirable malfunctions which in turn increases the chip cost. On the other

hand, peak power determines the thermal and electrical limits of circuits [1]. Once

the Circuit Under Test (CUT) surpasses the limitation, the reliability cannot be

guaranteed.

Several modifications have been tried in scan cells to mask undesirable toggles

in [2, 3]. Different structures are proposed in [4, 5], aiming to save the scan-in

power and scan-out power respectively. A reconfigured scan forest architecture was

proposed to reduce test power in [6]. In [7], each chain was divided into several

segments and different segments are activated separately so as to save power.

However, the approach would bring challenges for routing because scan enable and

scan clock originates from the same signal. Actually the scan clock must arrive a

little later than the scan enable signal, otherwise the stimuli can not be applied into

the scan cells.

Another method called Random Access Scan (RAS) was proposed in [8]. A

toggle RAS in [9, 10] and a Localized Random Access Scan (LRAS) method in

[11] were proposed to reduce the routing complexity. Grouping for launch-on-

capture delay testing can also reduce the power consumption [12]. An algorithm

was proposed to guarantee the capture safety in [13].

The important contributions of this paper are summarized as follows:

1. A Parallel Test Application (PTA) architecture is proposed to avoid undesir-

able toggles when stimuli are applied in the whole chip, thus concurrently reducing

average power and peak power to a great extent.

2. Vector assignment and response collection method is non-destructive, which

can reduce effort for fault diagnosis. And chain test is not required, thus saving test

time.

2 Power-efficient IC test scheme

2.1 PTA architecture design

Fig. 1 illustrates the PTA architecture. The Top module consists of two parts: the


2

IEICE Electronics Express, Vol.14, No.13, 1–10

http://dx.doi.org/10.1145/2159542.2159550

http://dx.doi.org/10.1145/2159542.2159550

http://dx.doi.org/10.1145/2159542.2159550

http://dx.doi.org/10.1145/2159542.2159550

http://dx.doi.org/10.1109/VLSID.2013.201




controller and the modified CUT. The rectangles standing side by side in the CUT

represent the modified scan cells. In the remainder of this paper, we refer to them as

the PTA-DFFs. Assuming that there are m � n DFFs in the original CUT, we can

divide all the DFFs into ‘m’ rows and ‘n’ columns. Here ‘m’ denotes the number of

separate scan clock signals (or scan enable signals) while ‘n’ denotes the number of

DFFs in each row. If the total number of DFFs fails to equal m � n exactly, dummy

cells can be added to the CUT. In the rest of this paper, we refer to the PTA-DFFs

with the same scan clock (or scan enable) signals in a row as a CK_chain while the

PTA-DFFs in the same column as a CL_chain.

Each CL_chain is connected to a Scan in (Sin) bus and a Scan out (Sout) bus.

Through the Sin bus, stimuli can be applied to every PTA-DFF as long as the

localized Sc_en and Sc_clk are in active state. Similarly, the captured response can

transmit out to Automatic Test Equipment (ATE) for comparison via the Sout bus.

The controller receives the test clock (Test_clk), Reset and test enable (TE)

signals from ATE and generates ‘m’ separate scan clock (Sc_clk) and scan enable

(Sc_en) signals for all CL_chains respectively.

2.2 PTA scan cell and controller

Fig. 2 illustrates the structure of PTA-DFF in our design. Stimuli come directly

from Sin bus instead of the Q pin of the former scan cell in traditional SS method.

A tri-state buffer is inserted after the Q pin of each scan cell. Assuming that the

output pin Z of tri-state buffer is high voltage enabled, then the output enable (OE)

pin of the tri-state buffer can be connected to the scan enable (SE) pin. All the Z

pins of tri-state buffers in the same CL_chain connect to one of the Sout bus. And

all the SI pins of traditional scan DFFs in the same CL_chains connect to one of the

Sin bus. So the PTA-DFF can have three functions as presented in Table I.

Whenever the SE is set to ‘0’, the PTA-DFF samples data from D at the rising

edge of Clk. Then the subsequent logic (combinational or sequential) can get the

stable value from Q. Meanwhile, the tri-state buffer is disabled so that the Sout bus

Fig. 1. Parallel test application architecture


3


maintains high-impedance (denoted by ‘Z’) state. At this time, PTA-DFF acts as a

normal DFF for normal-working mode. In the test mode, the SE switches to ‘1’,

thus the value in Q pin is able to propagate through tri-state buffer to the Sout bus

for observation. During the inactive period of Clk, the value in the Z pin reflects the

response of the last capture. It is the response-output mode. Once the Clk turns to

active, the PTA-DFF samples stimuli from SI pin and then outputs the value to the

Z pin, operating in the vector-applied (and observed) mode.

For the controller, it can be mainly implemented by a modulo-m counter and

some clock gating cells. Here ‘m’ is the number of the CK_chains in the CUT. The

count value i actives corresponding Sc_en[i] and Sc_clk[i] signals for the specific

CK_chain[i]. When the count value increases up to ‘m’, all CK_chains have already

been assigned. All CK_chains are activated in the next cycle to capture the response

of the last pattern. To simplify the test process, we only focus on the combinational

patterns where only one capture cycle is applied between scan load and scan

unload. The sequential test patterns with more than one capture cycles are not

considered in this paper.

2.3 PTA test procedure

The entire process of our test scheme for a CUT containing four CK_chains is

shown in Fig. 3. Assumes that there are ‘n’ PTA-DFFs in each CK_chain and two

test patterns are applied.

The Sc_clk and Sc_en of each CK_chain are activated one by one in a certain

order, meaning that at most one Sc_en and corresponding Sc_clk are in active state

during the test mode. Each pulse of Sc_en lasts a period of test clock. And it always

starts half a cycle before the rising edge of corresponding Sc_clk. Every time the

Sc_en[i] is activated (switching to high), the test vector for the whole ‘n’ PTA-DFFs

of CK_chain[i] is prepared on the Sin[0:n-1] bus. And as the rising edge of Sc_clk[i]

comes, the vector is concurrently applied to the ‘n’ PTA-DFFs of CK_chain[i] at the

same time. In the next cycle, another CK_chain is activated and applied to stimuli

while the assigned CK_chains hold previous values because of their inactive Sc_clk

Table I. Operations of PTA-DFF

Function SE Clk

Normal-working 0 Active

Response-output 1 Inactive

Vector-applied 1 Active

Fig. 2. PTA-DFF structure


4


and Sc_en signals. After all CK_chains have been assigned, all Sc_en signals are

switched off to convert the CUT into normal-working mode. Meanwhile, the

Primary Inputs (PIs) of the CUT are set to corresponding values in the specific

pattern. Next, all Sc_clk signals are activated for a cycle to capture the response of

the last pattern.

We refer to the procedure where the CK_chains are activated in turn as a round.

It should be noted that except the first and the last round of Sc_en, the response-

output operation of the last pattern and the vector-applied operation of the current

pattern for a certain CK_chain are conducted within the same Sc_en pulse.

Specifically, in the first half of the Sc_en pulse, the response of the last pattern

transmits from the Q pin of DFF to the Z pin of the tri-state buffer because the tri-

state buffer behind the DFF is output enabled at this moment. Thus we can get an

n-bit response of the ‘n’ PTA-DFFs on a certain CK_chain within one cycle for

comparison with expectation. In the second half of the Sc_en pulse, the ‘n’ PTA-

DFFs of a certain CK_chain sample the vector for them at the rising edge of the

corresponding Sc_clk. After the rising edge of the Sc_clk, the value in the Q pin of

DFFs reflects the vector just applied to the DFFs, which can also be observed on the

Sout bus. Therefore, we can verify whether the stimuli have been assigned to the

PTA-DFFs correctly during logic test.

Attention should be paid that the response-output and vector-applied operations

are both non-destructive in PTA. This non-destructive property allows snap-shot

of circuit states at any CK_chain, which benefits the fault diagnosis a lot. In

conventional serial scan, it is impossible to achieve this goal unless adding shadow

latches, because the states of circuit are serially shifted out (and in). Another

difference between PTA and the segmentation design in [7] is a phase difference

between scan enable and scan clock signals. In PTA, the scan enable and scan clock

signals have already got a phase difference of half a period since they are generated

from the controller. It almost avoids the routing problems in [7] as explained in

introduction.

Fig. 3. PTA scheme timing example


5


3 Theoretical analysis

3.1 Test time cost

In the SS approach, stuck-at fault test is conducted in two steps: chain test and logic

(scan) test. As explained in Section 2.3, sequential patterns are not considered in

this paper. Therefore, the SS test takes TSS cycles.

TSS ¼ pðn þ 1Þ þ 2n þ 4 ð1ÞHere ‘n’ denotes the number of DFFs in each traditional scan chain and ‘p’ is the

number of test patterns.

On the other hand, because we can assure the shift safety timely in PTA

approach as explained in Section 2.3, only logic test is required. Thus the total

number of cycles PTA approach needs is:

TPTA ¼ pðm þ 1Þ þ m ð2ÞWhere ‘m’ is the number of CK_chains in the CUT. To avoid extra I/O ports

overhead, we make m ¼ n. Thus, the test time reduction �T can be achieved from

the equation (1) and equation (2).

�T ¼ TSS � TPTA ¼ pðn � mÞ þ ð2n � mÞ þ 4 ¼ n þ 4 ð3ÞThat is to say, because chain test is not required, our method saves at least

(n þ 4) cycles in comparison to the traditional SS method.

3.2 Area and routing overhead

The area overhead of the CUT can be calculated by the following formula:

Area overhead ¼ ðN � AtriÞ=original area � 100% ð4Þ‘N’ is the total number of DFFs in CUT and ‘Atri’ is the area of one tri-state buffer.

As for the additional controller, as it consists of a modulo-m counter and some

clock gating cells, its area is determinated by parameter ‘m’ which represents the

number of CK_chians.

The routing overhead mainly comes from 1) the separate sc_clk and sc_en

signals within each CK_chain, as well as 2) the Sin and Sout bus which are

connected to every CL_chain. The routing overhead resulting from the former cause

is in proportion to the number of CK_chains denoted by ‘m’, while the wires

generated by the latter reason is proportion to the number of CL_chains denoted by

‘n’. However, ‘m’ decreases as ‘n’ increases because N ¼ m � n, where ‘N’ is the

total number of DFFs. Besides, the wire length highly depends on the placement

and routing strategies. All above mentioned factors make it difficult to evaluate

routing overhead via a certain concrete computational formula. But one conclusion

can be definitely drawn that extreme big (or small) ‘m’ or ‘n’ is not a sensible

choice for PTA in terms of routing overhead.

Although it is hard to determine the optimal value of ‘m’ (or ‘n’), tentative

experimental results on industrial module M in Section 4.3 show that the routing

overhead changes within an acceptable range of 11.91%∼17.83% when ‘m’ varies

in a wide range of 7∼60. With the rapid advances in the semiconductor manu-

facturing process, it is possible for designers to accommodate more cells and wires© IEICE 2017DOI: 10.1587/elex.14.20170462Received May 2, 2017Accepted May 16, 2017Publicized June 14, 2017Copyedited July 10, 2017

6


in the same area, which means that DFT designers can set the value of ‘m’ (or ‘n’)

as what they use in traditional DFT method.

4 Experimental results

The power estimation model named Weighted Switching Activity (WSA) [2] may

no longer be accurate as the technology scales down below 60 nm due to the

leakage current and internal power consumption. Hence, we calculated the real

power by Prime Time PX with VCD files generated after fault simulation at a

250MHz test clock, just as the power estimation flow in [4]. All circuits are

synthesized and inserted as full scan in 40 nm technology. ATPG was realized by

TestKompress from Mentor Graphics. The power reported below also includes

clock tree power, controller power and leakage power.

4.1 PTA vs. NOR-Mask

Experiments are conducted on six ISCAS89 circuits to compare the power

reduction of PTA with the method [2] evaluated in our power estimation flow,

which we called NOR-Mask approach. The power saving ration (SR) is calculated

by formula (5):

SR ¼ 1 � Powernew=Powerss � 100% ð5ÞAs shown in Fig. 4(a), with the same number of scan I/O ports as the SS and NOR-

Mask, PTA reduce average power by 83.87% over SS and by 40.08% over NOR-

Mask respectively on average. For peak power consumption in Fig. 4(b), NOR-

Mask may even increase the peak power just as shown at s9024 and s15850, while

PTA always reduce it more or less.

4.2 Power reduction with various CL_chains

Fig. 5 shows the relationship between the number of CL_chains (the scan I/O

ports) in PTA and the power reduction. A clear decrease in average power reduction

can be observed for all the three cases in Fig. 5(a). Even so, the deterioration of

average power reduction in our PTA is actually very slight compared with [4] and

[5]. Our method only suffers 7.4% decrease as the number of CL_chains increases

(a) (b)

Fig. 4. (a) Average power saving ratio(b) Peak power saving ratio


7


by 49. For peak power reduction, there is no obvious tendency can be seen in

Fig. 5(b) as the number of CL_chains varies because peak power reduction mainly

benefits from the gap between the highest shift power and the highest capture

power consumption in our PTA.

4.3 Industrial cases

Two industrial circuits are adapted for PTA structure to evaluate the practicability

and effectiveness of our test scheme. Table II provides results. B18 is one of the

largest circuits in ITC’99 while M is a multiplying unit in the DSP processor

implemented by our institute. The layout design is conducted by Innovus v15.20

from Cadence. In the header to the table, SRAP, SRPP, CPR, AUO, AWLO

represent the saving ratio of average power, saving ratio of peak power, controller

power ratio, area utilization overhead, average wire length overhead, respectively.

(1) From SRAP and SRPP results, we can see that the proposed method can

reduce 88.48% average power and 53.36% peak power at most. Module M only

gets 2.26% peak power reduction, because its combinational logic proportion is

much higher than its sequential proportion. In our method, the combinational logic

driven by scan cells still toggles when stimuli are applied, which weakens the

ability to reduce peak power.

(2) The proposed method takes a little more area and routing cost than the

traditional method. Through careful observation and comparison with the tradi-

tional method, we find the area overhead mainly results from the tri-buffers while

the routing overhead almost comes from the scan I/O that is directly connected to

the DFFs in the same CL_chains.

(a) (b)

Fig. 5. (a) Average power reduction with various scan I/O ports(b) Peak power reduction with various scan I/O ports

Table II. Performance of industrial circuits with PTA architecture

CircuitsScanI/O

Cells DFFsSRAP(%)

SRPP(%)

CPR(%)

AUO(%)

AWLO(%)

b18 8 31518 3308 88.48 53.36 5.00 3.95 7.87

M 12 56637 2616 82.98 2.26 1.80 2.92 11.91


8


However, our method can still achieve smaller Average Power - Routing Cost

product (PCP). For instance, for module M, its PCP in PTA can be reduced to

19.05% (¼ ð1 þ 11:91%Þ � ð1 � 82:98%Þ) of the original PCP of SS scheme,

demonstrating that our proposal is more effective.

5 Conclusion

Based on the previous analysis and experiments, we summarize and clarify

advantages, limitations and application scope, as well as future improvement of

our PTA test scheme.

5.1 Advantages

In this paper, we propose a new test scheme towards power reduction. It can

effectively reduce average power, peak power and test time. Besides, shift validity

can be assured during logic test without extra chain test. No obvious deterioration is

induced in the critical slack as no additional gates are inserted in functional paths.

Since the modifications of PTA architecture is conducted after ATPG, the fault

coverage maintains the same level. A controller is employed to conduct the test in

two phases. In the first phase, separate scan clock signals are activated in turn and

the stimuli for all DFFs of a certain CK_chain are concurrently applied. In the

second phase, all CK_chains capture the response of the last assigned pattern

simultaneously. For the CUT, a tri-state buffer is added after each scan cell.

Experimental results show that, with our method, average power can be reduced

by 88.48% and peak power can be reduced by 53.36%. Furthermore, the power

saving ability of PTA architecture is of strong immunity to the variation of the

number of CL_chains (scan I/O ports).

5.2 Limitation and application scope

Additional area and routing resources are needed for the controller and tri-state

buffers. Although the power saving ratio is not easily influenced by the number of

CL_chains, the change of routing overhead might not be negligible as explained in

Section 3.2. Difficulties in the determination of the optimal number of CK_chains

(or CL_chains) from the perspective of routing overhead will increase the iterations

of the design cycle. If there is an intense restriction on routing resources, the

proposal might only be suitable for designs whose sequential cells make up a small

segment of the total cells. What’s more, arrangements of controller further has the

problems on trade-offs between controllability/observability and area.

For at-speed test, due to limited number of scan cells which are updated at the

same time, it might be hard for the proposal test scheme to generate effective

launch-on-shift patterns which generally show better coverage than launch-on-

capture patterns. So it would be better to apply the PTA architecture to the design

whose at-speed test adopts launch-on-capture method.

5.3 Future improvement

In the future, we will optimize the implementation of the controller using a LFSR

technique to minimize the area and power consumption ratio of the controller.


9


Besides, because modifications of every DFF in this paper are all conducted at gate

level, which cannot be fully optimized. In the future, we would modify the scan

cells at transistor level. In addition, feasible and effective placement as well as

layout-aware routing strategies need to be developed to reduce the wire length

overhead. Meanwhile, a compression algorithm should be proposed to exploit the

parallelism and minimize the scan I/O overhead for PTA.

Acknowledgments

The research is partially supported by the National Natural Science Foundation of

China (No. 61502508), and the Hunan Natural Science Foundation of China

(No. 2015JJ3017).


10


a novel power-efficient ic test scheme - j-stage

Documents