a novel power-efficient ic test scheme - j-stage
TRANSCRIPT
A novel power-efficient IC testscheme
Ding Deng, Xiaowen Chena), and Yang GuoCollege of Computer, National University of Defense Technology,
Changsha, Hunan, 410073, China
Abstract: A novel power-efficient IC test scheme is proposed, containing
parallel test application (PTA) architecture and its procedure. PTA parallel-
izes the stimuli assignments and the vectors can be observed immediately
once applied, which assures the shift safety timely and hence only logic test
is required. The procedure contains two phases for each pattern. In shift
phase, each clock chain is activated in turn and the vectors are assigned in
parallel. In capture phase, all chains are captured simultaneously. Experi-
mental results demonstrate that, compared with the traditional serial scan
scheme, the proposal reduces average power by 88.48% and peak power by
53.36%.
Keywords: low power, scan test, parallel
Classification: Integrated circuits
References
[1] N. Parimi and X. Sun: “Design of a low-power D flip-flop for test-per-scancircuits,” CCECE (2004) 777 (DOI: 10.1109/CCECE.2004.1345229).
[2] S. Gerstendorfer and H.-J. Wunderlich: “Minimized power consumption forscan-based BIST,” IEEE ITC (1999) 77 (DOI: 10.1109/TEST.1999.805616).
[3] X. Zhang and K. Roy: “Power reduction in test-per-scan BIST,” On-LineTesting Workshop (2000) 133 (DOI: 10.1109/OLT.2000.856625).
[4] A. Chandra, et al.: “Low power Illinois scan architecture for simultaneouspower and test data volume reduction,” DATE (2008) 462 (DOI: 10.1109/DATE.2008.4484724).
[5] S. M. Saeed and O. Sinanoglu: “Expedited response compaction for scanpower reduction,” IEEE VTS (2011) 40 (DOI: 10.1109/VTS.2011.5783752).
[6] D. Xiang, et al.: “Reconfigured scan forest for test application cost, test datavolume and test power reduction,” IEEE Trans. Comput. 56 (2007) 557 (DOI:10.1109/TC.2007.1002).
[7] Y. Yamato, et al.: “A novel scan segmentation design method for avoiding shifttiming failure in scan testing,” IEEE ITC (2011) 1 (DOI: 10.1109/TEST.2011.6139162).
[8] A. Hisashige: “Testing VLSI with random access scan,” Dig Compcon (1980)50.
[9] A. S. Mudlapur, et al.: “A random access scans architecture to reduce hardwareoverhead,” IEEE ITC (2005) 9 (DOI: 10.1109/TEST.2005.1583993).
[10] A. S. Mudlapur, et al.: “A novel random access scan flip-flop design,” VDAT(2005) 226.
[11] Y. Hu, et al.: “Localized random access scan: Towards low area and routingoverhead,” ASP-DAC (2008) 565 (DOI: 10.1109/ASPDAC.2008.4484016).
© IEICE 2017DOI: 10.1587/elex.14.20170462Received May 2, 2017Accepted May 16, 2017Publicized June 14, 2017Copyedited July 10, 2017
1
LETTER IEICE Electronics Express, Vol.14, No.13, 1–10
[12] D. Xiang, et al.: “Scan flip-flop grouping to compress test data and compacttest responses for broadside delay testing,” ACM TODAES 17 (2012) 18 (DOI:10.1145/2159542.2159550).
[13] K. Enokimoto, et al.: “On guaranteeing capture safety in at-speed scan testingwith broadcast-scan-based test compression,” International Conference onVLSI Design (2013) 279 (DOI: 10.1109/VLSID.2013.201).
1 Introduction
Serial Scan (SS) design is one of the most widely used Design-For-Testability
(DFT) techniques in nowadays industry. However, as the complexity of digital
circuit increases, it also suffers from enormous challenges in high power dissipa-
tion. Power dissipation is usually measured in two aspects: average power and peak
power. Average power, which has a close relationship with heat dissipation, brings
a strict requirement of the cooling mechanisms. Dissipating excessive heat too long
during test may degrade the circuit performance. Even worse, it can lead to a yield
loss or undesirable malfunctions which in turn increases the chip cost. On the other
hand, peak power determines the thermal and electrical limits of circuits [1]. Once
the Circuit Under Test (CUT) surpasses the limitation, the reliability cannot be
guaranteed.
Several modifications have been tried in scan cells to mask undesirable toggles
in [2, 3]. Different structures are proposed in [4, 5], aiming to save the scan-in
power and scan-out power respectively. A reconfigured scan forest architecture was
proposed to reduce test power in [6]. In [7], each chain was divided into several
segments and different segments are activated separately so as to save power.
However, the approach would bring challenges for routing because scan enable and
scan clock originates from the same signal. Actually the scan clock must arrive a
little later than the scan enable signal, otherwise the stimuli can not be applied into
the scan cells.
Another method called Random Access Scan (RAS) was proposed in [8]. A
toggle RAS in [9, 10] and a Localized Random Access Scan (LRAS) method in
[11] were proposed to reduce the routing complexity. Grouping for launch-on-
capture delay testing can also reduce the power consumption [12]. An algorithm
was proposed to guarantee the capture safety in [13].
The important contributions of this paper are summarized as follows:
1. A Parallel Test Application (PTA) architecture is proposed to avoid undesir-
able toggles when stimuli are applied in the whole chip, thus concurrently reducing
average power and peak power to a great extent.
2. Vector assignment and response collection method is non-destructive, which
can reduce effort for fault diagnosis. And chain test is not required, thus saving test
time.
2 Power-efficient IC test scheme
2.1 PTA architecture design
Fig. 1 illustrates the PTA architecture. The Top module consists of two parts: the
© IEICE 2017DOI: 10.1587/elex.14.20170462Received May 2, 2017Accepted May 16, 2017Publicized June 14, 2017Copyedited July 10, 2017
2
IEICE Electronics Express, Vol.14, No.13, 1–10
controller and the modified CUT. The rectangles standing side by side in the CUT
represent the modified scan cells. In the remainder of this paper, we refer to them as
the PTA-DFFs. Assuming that there are m � n DFFs in the original CUT, we can
divide all the DFFs into ‘m’ rows and ‘n’ columns. Here ‘m’ denotes the number of
separate scan clock signals (or scan enable signals) while ‘n’ denotes the number of
DFFs in each row. If the total number of DFFs fails to equal m � n exactly, dummy
cells can be added to the CUT. In the rest of this paper, we refer to the PTA-DFFs
with the same scan clock (or scan enable) signals in a row as a CK_chain while the
PTA-DFFs in the same column as a CL_chain.
Each CL_chain is connected to a Scan in (Sin) bus and a Scan out (Sout) bus.
Through the Sin bus, stimuli can be applied to every PTA-DFF as long as the
localized Sc_en and Sc_clk are in active state. Similarly, the captured response can
transmit out to Automatic Test Equipment (ATE) for comparison via the Sout bus.
The controller receives the test clock (Test_clk), Reset and test enable (TE)
signals from ATE and generates ‘m’ separate scan clock (Sc_clk) and scan enable
(Sc_en) signals for all CL_chains respectively.
2.2 PTA scan cell and controller
Fig. 2 illustrates the structure of PTA-DFF in our design. Stimuli come directly
from Sin bus instead of the Q pin of the former scan cell in traditional SS method.
A tri-state buffer is inserted after the Q pin of each scan cell. Assuming that the
output pin Z of tri-state buffer is high voltage enabled, then the output enable (OE)
pin of the tri-state buffer can be connected to the scan enable (SE) pin. All the Z
pins of tri-state buffers in the same CL_chain connect to one of the Sout bus. And
all the SI pins of traditional scan DFFs in the same CL_chains connect to one of the
Sin bus. So the PTA-DFF can have three functions as presented in Table I.
Whenever the SE is set to ‘0’, the PTA-DFF samples data from D at the rising
edge of Clk. Then the subsequent logic (combinational or sequential) can get the
stable value from Q. Meanwhile, the tri-state buffer is disabled so that the Sout bus
Fig. 1. Parallel test application architecture
© IEICE 2017DOI: 10.1587/elex.14.20170462Received May 2, 2017Accepted May 16, 2017Publicized June 14, 2017Copyedited July 10, 2017
3
IEICE Electronics Express, Vol.14, No.13, 1–10
maintains high-impedance (denoted by ‘Z’) state. At this time, PTA-DFF acts as a
normal DFF for normal-working mode. In the test mode, the SE switches to ‘1’,
thus the value in Q pin is able to propagate through tri-state buffer to the Sout bus
for observation. During the inactive period of Clk, the value in the Z pin reflects the
response of the last capture. It is the response-output mode. Once the Clk turns to
active, the PTA-DFF samples stimuli from SI pin and then outputs the value to the
Z pin, operating in the vector-applied (and observed) mode.
For the controller, it can be mainly implemented by a modulo-m counter and
some clock gating cells. Here ‘m’ is the number of the CK_chains in the CUT. The
count value i actives corresponding Sc_en[i] and Sc_clk[i] signals for the specific
CK_chain[i]. When the count value increases up to ‘m’, all CK_chains have already
been assigned. All CK_chains are activated in the next cycle to capture the response
of the last pattern. To simplify the test process, we only focus on the combinational
patterns where only one capture cycle is applied between scan load and scan
unload. The sequential test patterns with more than one capture cycles are not
considered in this paper.
2.3 PTA test procedure
The entire process of our test scheme for a CUT containing four CK_chains is
shown in Fig. 3. Assumes that there are ‘n’ PTA-DFFs in each CK_chain and two
test patterns are applied.
The Sc_clk and Sc_en of each CK_chain are activated one by one in a certain
order, meaning that at most one Sc_en and corresponding Sc_clk are in active state
during the test mode. Each pulse of Sc_en lasts a period of test clock. And it always
starts half a cycle before the rising edge of corresponding Sc_clk. Every time the
Sc_en[i] is activated (switching to high), the test vector for the whole ‘n’ PTA-DFFs
of CK_chain[i] is prepared on the Sin[0:n-1] bus. And as the rising edge of Sc_clk[i]
comes, the vector is concurrently applied to the ‘n’ PTA-DFFs of CK_chain[i] at the
same time. In the next cycle, another CK_chain is activated and applied to stimuli
while the assigned CK_chains hold previous values because of their inactive Sc_clk
Table I. Operations of PTA-DFF
Function SE Clk
Normal-working 0 Active
Response-output 1 Inactive
Vector-applied 1 Active
Fig. 2. PTA-DFF structure
© IEICE 2017DOI: 10.1587/elex.14.20170462Received May 2, 2017Accepted May 16, 2017Publicized June 14, 2017Copyedited July 10, 2017
4
IEICE Electronics Express, Vol.14, No.13, 1–10
and Sc_en signals. After all CK_chains have been assigned, all Sc_en signals are
switched off to convert the CUT into normal-working mode. Meanwhile, the
Primary Inputs (PIs) of the CUT are set to corresponding values in the specific
pattern. Next, all Sc_clk signals are activated for a cycle to capture the response of
the last pattern.
We refer to the procedure where the CK_chains are activated in turn as a round.
It should be noted that except the first and the last round of Sc_en, the response-
output operation of the last pattern and the vector-applied operation of the current
pattern for a certain CK_chain are conducted within the same Sc_en pulse.
Specifically, in the first half of the Sc_en pulse, the response of the last pattern
transmits from the Q pin of DFF to the Z pin of the tri-state buffer because the tri-
state buffer behind the DFF is output enabled at this moment. Thus we can get an
n-bit response of the ‘n’ PTA-DFFs on a certain CK_chain within one cycle for
comparison with expectation. In the second half of the Sc_en pulse, the ‘n’ PTA-
DFFs of a certain CK_chain sample the vector for them at the rising edge of the
corresponding Sc_clk. After the rising edge of the Sc_clk, the value in the Q pin of
DFFs reflects the vector just applied to the DFFs, which can also be observed on the
Sout bus. Therefore, we can verify whether the stimuli have been assigned to the
PTA-DFFs correctly during logic test.
Attention should be paid that the response-output and vector-applied operations
are both non-destructive in PTA. This non-destructive property allows snap-shot
of circuit states at any CK_chain, which benefits the fault diagnosis a lot. In
conventional serial scan, it is impossible to achieve this goal unless adding shadow
latches, because the states of circuit are serially shifted out (and in). Another
difference between PTA and the segmentation design in [7] is a phase difference
between scan enable and scan clock signals. In PTA, the scan enable and scan clock
signals have already got a phase difference of half a period since they are generated
from the controller. It almost avoids the routing problems in [7] as explained in
introduction.
Fig. 3. PTA scheme timing example
© IEICE 2017DOI: 10.1587/elex.14.20170462Received May 2, 2017Accepted May 16, 2017Publicized June 14, 2017Copyedited July 10, 2017
5
IEICE Electronics Express, Vol.14, No.13, 1–10
3 Theoretical analysis
3.1 Test time cost
In the SS approach, stuck-at fault test is conducted in two steps: chain test and logic
(scan) test. As explained in Section 2.3, sequential patterns are not considered in
this paper. Therefore, the SS test takes TSS cycles.
TSS ¼ pðn þ 1Þ þ 2n þ 4 ð1ÞHere ‘n’ denotes the number of DFFs in each traditional scan chain and ‘p’ is the
number of test patterns.
On the other hand, because we can assure the shift safety timely in PTA
approach as explained in Section 2.3, only logic test is required. Thus the total
number of cycles PTA approach needs is:
TPTA ¼ pðm þ 1Þ þ m ð2ÞWhere ‘m’ is the number of CK_chains in the CUT. To avoid extra I/O ports
overhead, we make m ¼ n. Thus, the test time reduction �T can be achieved from
the equation (1) and equation (2).
�T ¼ TSS � TPTA ¼ pðn � mÞ þ ð2n � mÞ þ 4 ¼ n þ 4 ð3ÞThat is to say, because chain test is not required, our method saves at least
(n þ 4) cycles in comparison to the traditional SS method.
3.2 Area and routing overhead
The area overhead of the CUT can be calculated by the following formula:
Area overhead ¼ ðN � AtriÞ=original area � 100% ð4Þ‘N’ is the total number of DFFs in CUT and ‘Atri’ is the area of one tri-state buffer.
As for the additional controller, as it consists of a modulo-m counter and some
clock gating cells, its area is determinated by parameter ‘m’ which represents the
number of CK_chians.
The routing overhead mainly comes from 1) the separate sc_clk and sc_en
signals within each CK_chain, as well as 2) the Sin and Sout bus which are
connected to every CL_chain. The routing overhead resulting from the former cause
is in proportion to the number of CK_chains denoted by ‘m’, while the wires
generated by the latter reason is proportion to the number of CL_chains denoted by
‘n’. However, ‘m’ decreases as ‘n’ increases because N ¼ m � n, where ‘N’ is the
total number of DFFs. Besides, the wire length highly depends on the placement
and routing strategies. All above mentioned factors make it difficult to evaluate
routing overhead via a certain concrete computational formula. But one conclusion
can be definitely drawn that extreme big (or small) ‘m’ or ‘n’ is not a sensible
choice for PTA in terms of routing overhead.
Although it is hard to determine the optimal value of ‘m’ (or ‘n’), tentative
experimental results on industrial module M in Section 4.3 show that the routing
overhead changes within an acceptable range of 11.91%∼17.83% when ‘m’ varies
in a wide range of 7∼60. With the rapid advances in the semiconductor manu-
facturing process, it is possible for designers to accommodate more cells and wires© IEICE 2017DOI: 10.1587/elex.14.20170462Received May 2, 2017Accepted May 16, 2017Publicized June 14, 2017Copyedited July 10, 2017
6
IEICE Electronics Express, Vol.14, No.13, 1–10
in the same area, which means that DFT designers can set the value of ‘m’ (or ‘n’)
as what they use in traditional DFT method.
4 Experimental results
The power estimation model named Weighted Switching Activity (WSA) [2] may
no longer be accurate as the technology scales down below 60 nm due to the
leakage current and internal power consumption. Hence, we calculated the real
power by Prime Time PX with VCD files generated after fault simulation at a
250MHz test clock, just as the power estimation flow in [4]. All circuits are
synthesized and inserted as full scan in 40 nm technology. ATPG was realized by
TestKompress from Mentor Graphics. The power reported below also includes
clock tree power, controller power and leakage power.
4.1 PTA vs. NOR-Mask
Experiments are conducted on six ISCAS89 circuits to compare the power
reduction of PTA with the method [2] evaluated in our power estimation flow,
which we called NOR-Mask approach. The power saving ration (SR) is calculated
by formula (5):
SR ¼ 1 � Powernew=Powerss � 100% ð5ÞAs shown in Fig. 4(a), with the same number of scan I/O ports as the SS and NOR-
Mask, PTA reduce average power by 83.87% over SS and by 40.08% over NOR-
Mask respectively on average. For peak power consumption in Fig. 4(b), NOR-
Mask may even increase the peak power just as shown at s9024 and s15850, while
PTA always reduce it more or less.
4.2 Power reduction with various CL_chains
Fig. 5 shows the relationship between the number of CL_chains (the scan I/O
ports) in PTA and the power reduction. A clear decrease in average power reduction
can be observed for all the three cases in Fig. 5(a). Even so, the deterioration of
average power reduction in our PTA is actually very slight compared with [4] and
[5]. Our method only suffers 7.4% decrease as the number of CL_chains increases
(a) (b)
Fig. 4. (a) Average power saving ratio(b) Peak power saving ratio
© IEICE 2017DOI: 10.1587/elex.14.20170462Received May 2, 2017Accepted May 16, 2017Publicized June 14, 2017Copyedited July 10, 2017
7
IEICE Electronics Express, Vol.14, No.13, 1–10
by 49. For peak power reduction, there is no obvious tendency can be seen in
Fig. 5(b) as the number of CL_chains varies because peak power reduction mainly
benefits from the gap between the highest shift power and the highest capture
power consumption in our PTA.
4.3 Industrial cases
Two industrial circuits are adapted for PTA structure to evaluate the practicability
and effectiveness of our test scheme. Table II provides results. B18 is one of the
largest circuits in ITC’99 while M is a multiplying unit in the DSP processor
implemented by our institute. The layout design is conducted by Innovus v15.20
from Cadence. In the header to the table, SRAP, SRPP, CPR, AUO, AWLO
represent the saving ratio of average power, saving ratio of peak power, controller
power ratio, area utilization overhead, average wire length overhead, respectively.
(1) From SRAP and SRPP results, we can see that the proposed method can
reduce 88.48% average power and 53.36% peak power at most. Module M only
gets 2.26% peak power reduction, because its combinational logic proportion is
much higher than its sequential proportion. In our method, the combinational logic
driven by scan cells still toggles when stimuli are applied, which weakens the
ability to reduce peak power.
(2) The proposed method takes a little more area and routing cost than the
traditional method. Through careful observation and comparison with the tradi-
tional method, we find the area overhead mainly results from the tri-buffers while
the routing overhead almost comes from the scan I/O that is directly connected to
the DFFs in the same CL_chains.
(a) (b)
Fig. 5. (a) Average power reduction with various scan I/O ports(b) Peak power reduction with various scan I/O ports
Table II. Performance of industrial circuits with PTA architecture
CircuitsScanI/O
Cells DFFsSRAP(%)
SRPP(%)
CPR(%)
AUO(%)
AWLO(%)
b18 8 31518 3308 88.48 53.36 5.00 3.95 7.87
M 12 56637 2616 82.98 2.26 1.80 2.92 11.91
© IEICE 2017DOI: 10.1587/elex.14.20170462Received May 2, 2017Accepted May 16, 2017Publicized June 14, 2017Copyedited July 10, 2017
8
IEICE Electronics Express, Vol.14, No.13, 1–10
However, our method can still achieve smaller Average Power - Routing Cost
product (PCP). For instance, for module M, its PCP in PTA can be reduced to
19.05% (¼ ð1 þ 11:91%Þ � ð1 � 82:98%Þ) of the original PCP of SS scheme,
demonstrating that our proposal is more effective.
5 Conclusion
Based on the previous analysis and experiments, we summarize and clarify
advantages, limitations and application scope, as well as future improvement of
our PTA test scheme.
5.1 Advantages
In this paper, we propose a new test scheme towards power reduction. It can
effectively reduce average power, peak power and test time. Besides, shift validity
can be assured during logic test without extra chain test. No obvious deterioration is
induced in the critical slack as no additional gates are inserted in functional paths.
Since the modifications of PTA architecture is conducted after ATPG, the fault
coverage maintains the same level. A controller is employed to conduct the test in
two phases. In the first phase, separate scan clock signals are activated in turn and
the stimuli for all DFFs of a certain CK_chain are concurrently applied. In the
second phase, all CK_chains capture the response of the last assigned pattern
simultaneously. For the CUT, a tri-state buffer is added after each scan cell.
Experimental results show that, with our method, average power can be reduced
by 88.48% and peak power can be reduced by 53.36%. Furthermore, the power
saving ability of PTA architecture is of strong immunity to the variation of the
number of CL_chains (scan I/O ports).
5.2 Limitation and application scope
Additional area and routing resources are needed for the controller and tri-state
buffers. Although the power saving ratio is not easily influenced by the number of
CL_chains, the change of routing overhead might not be negligible as explained in
Section 3.2. Difficulties in the determination of the optimal number of CK_chains
(or CL_chains) from the perspective of routing overhead will increase the iterations
of the design cycle. If there is an intense restriction on routing resources, the
proposal might only be suitable for designs whose sequential cells make up a small
segment of the total cells. What’s more, arrangements of controller further has the
problems on trade-offs between controllability/observability and area.
For at-speed test, due to limited number of scan cells which are updated at the
same time, it might be hard for the proposal test scheme to generate effective
launch-on-shift patterns which generally show better coverage than launch-on-
capture patterns. So it would be better to apply the PTA architecture to the design
whose at-speed test adopts launch-on-capture method.
5.3 Future improvement
In the future, we will optimize the implementation of the controller using a LFSR
technique to minimize the area and power consumption ratio of the controller.
© IEICE 2017DOI: 10.1587/elex.14.20170462Received May 2, 2017Accepted May 16, 2017Publicized June 14, 2017Copyedited July 10, 2017
9
IEICE Electronics Express, Vol.14, No.13, 1–10
Besides, because modifications of every DFF in this paper are all conducted at gate
level, which cannot be fully optimized. In the future, we would modify the scan
cells at transistor level. In addition, feasible and effective placement as well as
layout-aware routing strategies need to be developed to reduce the wire length
overhead. Meanwhile, a compression algorithm should be proposed to exploit the
parallelism and minimize the scan I/O overhead for PTA.
Acknowledgments
The research is partially supported by the National Natural Science Foundation of
China (No. 61502508), and the Hunan Natural Science Foundation of China
(No. 2015JJ3017).
© IEICE 2017DOI: 10.1587/elex.14.20170462Received May 2, 2017Accepted May 16, 2017Publicized June 14, 2017Copyedited July 10, 2017
10
IEICE Electronics Express, Vol.14, No.13, 1–10