Streaming Scan Network - static.sw.cdn.siemens.com

Siemens Digital Industries Software

Executive summaryOriginally presented at the 2020 International Test Conference by Siemens and Intel authors, this paper describes the Tessent Streaming Scan Network and demonstrates how this packetized data network optimizes test time and implementation productivity for today’s complex SoCs.

The IEEE paper is reprinted here in full with permission.

Jean-François Côté, Mark Kassab, Wojciech Janiszewski, Ricardo Rodrigues, Reinhard Meier, Bartosz Kaczmarek, Peter Orlando, Geir Eide, Janusz Rajski Siemens Digital Industries Software

Glenn Colon-Bonet, Naveen Mysore, Ya Yin, Pankaj Pant Intel Corporation

© 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

siemens.com/eda

Streaming Scan NetworkAn Efficient Packetized Data Network for Testing of Complex SoCs

https://www.siemens.com/software

Regular Paper INTERNATIONAL TEST CONFERENCE 1

978-1-7281-9113-3/20/ $31.00 ©2020 IEEE

Streaming Scan Network (SSN):

An Efficient Packetized Data Network for Testing of Complex SoCs

Jean-François Côté, Mark Kassab, Wojciech Janiszewski, Ricardo Rodrigues, Reinhard Meier, Bartosz Kaczmarek,

Peter Orlando, Geir Eide, Janusz Rajski, Glenn Colon-Bonet*, Naveen Mysore*, Ya Yin*, Pankaj Pant**

Mentor, A Siemens Business

8005 SW Boeckman Road

Wilsonville, OR 97070

*Intel Corporation

4701 Technology Parkway

Fort Collins, CO 80528

**Intel Corporation

75 Reed Road

Hudson, MA 01749

Abstract—System-on-Chip (SoC) designs are increasingly

difficult to test using traditional scan access methods without

incurring inefficient test time, high planning effort, and physical

design/timing closure challenges. The number of cores keeps

growing while chip pin counts available for scan remain constant

or decline, limiting the ability to drive cores concurrently. With

increasingly commonplace tiling and abutment, the scan

distribution hardware must be placed inside the cores, making

balanced pipelining when broadcasting to identical cores difficult.

Optimizing test time requires analyzing all the cores and

subsequently changing the test hardware in the cores. Internal

shift speed constraints may limit the ability to shift data in and out

of the chip at high rates. Differences in pattern counts or scan

chain lengths between cores tested in parallel can result in padding

and increased test time. SSN is a bus-based scan data distribution

architecture designed to address all these challenges. It enables

simultaneous testing of any number of cores even with few chip

I/Os. It facilitates short test time by enabling high-speed data

distribution, by efficiently handling imbalances between cores,

and by supporting testing of any number of identical cores with a

constant cost. It provides a plug-and-play interface in each core

that is well suited for abutted tiles, and simplifies scan timing

closure. This paper also compares the test cost and

implementation productivity of SSN with those of Intel’s

Structural Test Fabric.

Keywords—Design For Test, DFT, SoC Test, Hierarchical Test,

Multiple Identical Cores, Known-Good-Die Testing, Test Time

Reduction, Low Pin Count Test, Scan Distribution Architecture,

Scan Fabric

I. INTRODUCTION

With some Integrated Circuits (ICs) growing to billions of transistors, it is virtually impossible to design, implement, and test them flat. A System-on-a-Chip (SoC) is an IC that is comprised of multiple components, referred to as cores. Each core is typically designed, implemented, and validated independently before being integrated with others. As design complexity has grown, so have the levels of core hierarchy. It is not uncommon to have lower-level cores integrated into subsystems, which are integrated into chiplets that are then assembled into a chip.

As design is done hierarchically to manage complexity, so is DFT. In hierarchical test methodologies [1][2][3], scan chains and compression logic [4][5][6] are inserted into every core. The

cores are wrapped with scan and interface control logic. Test patterns targeting most faults in a core are generated and validated at the core level. Subsequently, the patterns from multiple wrapped cores are retargeted or mapped to the top level. They are often merged with patterns retargeted from other cores that are tested at the same time if scan access and design constraints permit. In addition to retargeting patterns generated for testing the wrapped logic within each core, test pattern generation is also run at the next level up to test peripheral logic outside wrapper chains as well as logic at that higher level of hierarchy. If this parent level is not the chip level, then those patterns will also have to be retargeted to the chip level. The same test pattern generation and retargeting methodology is applied recursively regardless of the levels of hierarchy, but the planning and implementation of DFT get more complex with additional levels of hierarchy, especially when using conventional scan access methods.

The following subsections explain key SoC test challenges inherent with pin-mux scan access, which is commonly used in the industry and explained in the referenced papers.

A. SoC Test Challenges: Planning and Layout

Traditionally, for a group of cores to be tested concurrently, one of the requirements is that their channel inputs and outputs must be directly connected to chip-level pins. As the number of cores in SoCs grows and the number of chip-level pins available for scan test remains the same or is reduced, additional groups of cores and scan access configurations must be created. This has negative implications on DFT implementation effort, silicon area, pattern retargeting complexity, and test time.

Part of hierarchical test planning is to identify early in the design flow the number of scan channels used in every core, and the groups of cores which will be tested concurrently in every scan access configuration. This can result in sub-optimal results since it creates fixed core groupings and forces premature decisions on channel counts per core before the cores are completed and before their compression configurations can be optimized and their pattern counts estimated. Chip-level design decisions depend on the cores. The cores are finalized too late in the design cycle, and their compression configurations are influenced by the chip-level core groupings and pin availability. This mutual dependency makes it virtually impossible to optimize compression for the SoC. As the number of levels of


core hierarchy increases, the planning complexity and test inefficiency also grow.

Connecting chip pins to the cores can have physical design implications. Connecting each pin to different cores in different test configurations can lead to routing congestion. The pads may be embedded inside cores in some packaging technologies such that the connections for one core impact the design of other cores to which the signals have to be routed, or through which the scan connections flow. Those connections are also often pipelined, so timing between those pipeline stages and compression logic must be carefully designed to achieve high shift speeds and avoid timing violations.

Tile-based layout is a relatively recent trend in SoC design that is adding further complexity and constraints to DFT architectures. In pure tiling layouts, virtually all logic and routing is within the cores and not at the top level. The cores are designed to abut one another when integrated into the chip such that connections flow from one core to the next. Any connectivity between cores has to flow through cores that are between them. Logic that is at the top level has to be pushed into the cores and designed as part of the cores.

B. SoC Test Challenges: Limited Chip-Level Pins

When retargeting core-level patterns, limited chip-level pin counts can be dealt with by increasing the number of core groups and test sessions, as long as there are enough chip pins to drive at least each core individually. However, there are cases where simultaneous access to multiple or all cores is necessary, and grouping cores into smaller groups is not an option. One example is Iddq test, where scan data is loaded across the entire chip before a relatively lengthy current measurement is taken. When using scan compression such as Embedded Deterministic Test (EDT) [4], this means there must be enough pins available to drive all the EDT channels of the cores concurrently.

C. SoC Test Challenges: Identical Core Instances

Pattern retargeting in the presence of identical core instances can benefit from generating patterns once, and from the ability to broadcast the scan inputs from the same top-level pins, reducing both ATPG runtime and pin requirements. There are, however, still multiple challenges to be resolved.

Although broadcast of scan inputs keeps the number of input pins constant for any number of identical cores, the outputs are often observed independently to guarantee the same test coverage achieved at the core level and to ensure enough observability for diagnosing failing cores. Since at least 1 output channel is needed per core instance, this can limit the number of identical core instances that can be tested concurrently just as there are similar limitations on heterogeneous core instances.

The second issue is that after scan loading, the capture clocking is usually applied concurrently to all core instances. Combined with the broadcast of input scan data, the number of pipeline stages must be equal between a scan input pin and all the identical core instances it drives. This can be difficult to achieve in the presence of tiling where no routing or logic may exist outside the cores. Signals, including scan inputs, may propagate across multiple instances of the same core,

accumulating pipelining delay. Routing of individual output channels from each core instance through the other core instances can also be complicated due to the fact that all cores are copies of each other. A solution exists where every core instance is programmed with a different number of pipeline stages and different routing for scan output paths, but this introduces complexity and limits the reuse of cores. Designing a new chip with more core instances requires redesigning the cores to account for differences in pipelining and routing channels.

II. PRIOR WORK

To address some of the challenges explained, a few companies have developed and published scan access technologies beyond the traditional pin-mux topologies. They vary in the scope of the challenges they address and the trade-offs they make.

A packetized bus-based architecture specifically tailored at providing a scalable solution for testing of multiple identical core instances was introduced in [7]. It is not a general scan access mechanism that can simultaneously test heterogeneous cores. It supports shifting in the expected data, in addition to input stimuli, such that on-chip comparison can be done and pass/fail data accumulated and observed. It also allows some trade-offs between efficiency and diagnostic information. Getting full failure data for diagnosis may require the application of a different pattern set; one that uses a different configuration than the full-rate mode used for high-volume manufacturing. This architecture also has data overhead because every parallel word includes a command opcode in addition to the scan data payload. The fact that each parallel word has to include both payload and a command imposes limits on how narrow the bus may be, and imposes additional constraints on the bus width and its relation to the core scan channel counts.

The authors subsequently introduced a new architecture [8] that has a different focus: while it maintains a solution for testing of multiple identical cores, its primary new design objective is to enable better bin packing for retargeted core-level patterns. It does so by providing flexibility in mapping chip-level pins to core-level scan pins such that there is flexibility in controlling which cores are tested concurrently. Instead of a bus architecture as in [7], it uses a flexible mux-based switching network. The architecture succeeds in enabling effective dynamic bandwidth management [9] and late-binding core grouping to minimize padding caused by test length differences across cores. However, this architecture incurs some costs. Given the network provides flexibility in connecting any top-level pin to any core-level scan channel (although there are restrictions on combinations of connections), the network can result in significant routing cost especially in the presence of a large number of cores. Using a mux-based star network is also less amenable to connection-by-abutment in tile-based designs compared to bus-based architectures.

The Structural Test Fabric (STF) solution [10][11], published by co-authors of this paper, provides a general packet-based core access mechanism that works for heterogeneous cores, and has a scalable solution for multiple identical cores. It is flexible in that every parallel word is self-contained, but incurs


overhead per parallel bus word. A detailed comparison of this architecture to SSN is presented in Section VIII.

To allow simultaneously driving more internal scan channels than the number of chip-level scan pins, some architectures such as [12] employ serializers/deserializers. This additionally allows running chip-level scan pins at higher frequencies than internal scan chains support, improving overall bandwidth. A subsequent version of this technology [13] added flexibility to allow varying the number of scan pins per core. The number of external scan pins per core and the related serialization/deserialization ratio are programmable. The purpose is to enable reuse of the test data for a given core across SoCs with different scan pin configurations. It also enables varying shift frequencies in different cores within the SoC. Those methods facilitate IP reuse and access to cores in the presence of limited chip-level scan pins. However, they do not address routing challenges in tile-based designs nor provide an efficient and scalable solution for multiple identical cores.

Some scan compression methods have extensions to facilitate test across an SoC. For example, the architecture in [14] can distribute test data to compression logic in cores, and uses serializers/deserializers to manage pin count limitations. However, as with the preceding method, it is not an abutment-friendly architecture nor does it efficiently test many identical cores as SSN will be shown to do.

In the next sections, we describe how SSN aims to solve the challenges presented in Section I, while improving on efficiency, flexibility, and capabilities of previously published access mechanisms.

III. SSN TECHNOLOGY FUNDAMENTALS

A. Architecture Overview

Fig. 1 shows a simplified example of a 6-core design that uses SSN. Each core typically contains one Streaming Scan Host (SSH) node (yellow box). The SSH drives local scan resources to load/unload scan chains/channels with data delivered on the SSN bus. In the figure, an EDT scan compression controller is shown for simplicity as a representative of the scan logic within the core. In reality, the SSH node can interface with EDT controller(s), uncompressed/legacy scan chains, or a combination of the two.

Each SSH has two external interfaces: An IEEE 1687 [15] IJTAG interface predominantly used for setup, and a parallel data bus that subsequently transports the payload scan data and connects one SSH node to the next. The IJTAG network, shown as a 1-bit bus, is used to configure all nodes in the SSN network prior to the application of a test pattern set. Each node is loaded with information related to the protocol such as the active bus width, its location in the series of nodes driven, the number of shift cycles per scan pattern, scan_enable transition timing information, etc. Following this setup, the entire test pattern set is applied as packetized scan data that is streamed on the parallel bus shown as an N-bit bus. Because the protocol of alternating shift/capture operations is very regular and repeatable, each SSH is pre-loaded with the information needed for its counters and finite state machine to track the streaming operation. There is no need to send opcode or address information with each packet. Only the scan payload is streamed, as shown in the next section. As data streams through the SSH nodes, each node can identify when it needs to read scan_in data from the bus, when it needs to place scan_out data on the bus, and when it needs to pass along data that is destined for other nodes. Each SSH controls the local scan operations for the core, including transitions between load/unload and capture stages, as well as performing individual shift operations. All scan signals and EDT controls are generated by the SSN local to the core and the only test signals that cross core boundaries are the SSN parallel bus (N-bit data bus + clock) and the IJTAG signals. This allows scan timing closure to be completed at the core level.

SSN supports the abutment of cores in tile-based designs with no routing outside the cores. The outputs of one core connect to the inputs of the next adjacent core. A chip with SSN usually has a single datapath (parallel bus) that goes through all cores. Depending on the floorplan and pad locations, it may be preferable for physical design to implement multiple, physically independent datapaths (for example, one datapath per chiplet [16][17]). Each datapath is also configurable and can include muxes that can be programmed to include or exclude segments of the network similar to the Segment Insertion Bit (SIB) in IJTAG networks.

As will be demonstrated in the upcoming sections, the SSN bus width is selected based on chip-level pin availability and is independent of the number and logic size of the scanned cores, and the number of channels needed by the EDT controller(s) in each core. This enables each core to have the same plug-and-play interface and bus width for scan test, allowing SSN to scale efficiently as the design floorplan, number of cores, or the content of the cores change.

The ability to route the bus carrying the data from one core to the next while dynamically controlling which cores are active/inactive/bypassed means one has flexibility in accessing any combination of cores without changing the hardware. Unlike pin-mux architectures, this flexibility does not come at the expense of routing congestion. Additionally, there is no need to try and predict at design time how to group cores that are to be tested concurrently. Whether performing ATPG on groups of cores or retargeting patterns from different cores, the same SSN network can provide access to one core at a time, all cores simultaneously, or anything in-between.

Fig. 1: SSN Architecture


B. Packets

In SSN terminology, a “packet” usually consistent of all the scan data needed for all the active SSH nodes to perform a single internal scan shift operation. A packet should not be confused with the actual SSN physical bus width which could be narrower or wider than a packet. The SSN payload delivered from the tester may be viewed as a continuous stream of packets that may wrap across SSN bus boundaries. To illustrate this concept, consider the example shown in Fig. 2 where two blocks are being tested concurrently. Block A loads/unloads 5 bits per shift cycle of the block (has 5 EDT channels). Block B has 4 channels. For both blocks to perform one shift cycle, 9 bits have to be loaded/unloaded. In conventional scan access methods, this would have required 9 chip-level scan input pins and 9 scan output pins. With SSN, the packet size in this example gets set to 9 bits independent of the SSN 8-bit bus width. 9 bits have to be delivered for each of the 2 blocks to shift once. The first 5 bits of every 9-bit packet are programmed to belong to block A, and the next 4 bits of every packet are programmed to belong to block B. This is all determined and programmed at pattern generation time – it is not hard-coded in the SSN logic. After programming all the SSN nodes using IJTAG, SSN delivers a continuous, repeating stream of 9-bit packets. The allocation of packet bit positions to SSH nodes is the same for all packets and is programmed at setup. As soon as block A extracts 5 bits from the bus, it performs one internal shift operation. Likewise for block B, every time it accumulates 4 bits. The SSH is programmed with the shift count per scan load, so it can identify when to perform shift, and when to perform capture. Capture involves events generated by the SSH such as de-asserting scan_enable, applying capture clocks through an On-chip Clock Controller (OCC) [18], and re-asserting scan_enable in preparation for the next scan operation.

In this example, we have decided to use 9-bit packets although the bus width is 8 bits. The stream of 9-bit packets is simply folded into the 8-bit bus with no bits wasted. The first 9-bit packet occupies the first 8-bit parallel word of the bus, and the first bit of the second word (second tester cycle). The second packet starts immediately after that, occupying the remaining 7

bits of the second parallel word, and the 2 bits of the following parallel word. While the allocation of bits within a packet to an SSH is invariant, there is no static mapping between a bit of the bus and an EDT channel inputs/output. The locations of the 9-bit packets within each 8-bit bus word rotate with each packet. Each SSH node keeps track of the location of its data in each packet, including accounting for rotation of the data. The size of each packet must be equal to or greater than the bus width. In exceptional cases where the packet size is less than the physical bus width, the bus is re-programmed to reduce its active width such that it does not exceed the number of bits in a packet.

Typically, the same time slots of the packet that carry scan-in data to an SSH node also carry scan-out data from that node. (Multiple identical cores may be handled differently as explained later.) As block A reads the first 5 bits of every packet, it replaces them with 5 bits scanned out (with slight latency).

Any number of internal cores and their channels can be controlled with an SSN bus that is as narrow as one bit. This is because the packets can be as wide as they need to be, and can occupy as many bus words as needed. The internal channel requirements (9 bits in this example) are decoupled from the available scan pins at the chip level (8 × 2 pins for scan in this case). If the packet is wider than the bus and occupies multiple bus words, the cores shift less often than once every bus shift cycle but it will be possible to drive all the cores needed. In this example with 9-bit packets and an 8-bit bus, the blocks shift approximately every bus/tester clock cycle. Occasionally, a block may omit shifting in a given cycle because it has to wait to acquire all the bits it needs for one shift cycle. If the bus is 1 bit wide instead of 8 bits wide, it takes 9 tester cycles to scan in each packet. So the internal shift rate is 1/9th of the external shift rate, but it is still possible to drive all 9 internal channels from the 1-bit bus. In fact, the bus width can be scaled down dynamically at pattern generation time. When driving multiple cores concurrently such that the packet spans multiple bus widths, and the internal shift frequency is slower than the external frequency as a result, this presents an opportunity to deliver the data more quickly without exceeding the constraints on the internal core shift frequencies. It is common in SSN

Fig. 2: Streaming scan packets


implementations to cap the core-internal shift frequency at 100 MHz yet run a faster/narrow bus at 400 MHz.

C. The Streaming Scan Host (SSH) Node

Fig. 3 shows a high-level view of the SSH. In addition to its aforementioned functionality, other characteristics to highlight are:

1. If a core with an SSH is not under test in a given mode, the SSH may have to continue passing data through, being part of the network, but does not have to deliver scan data to its EDT. In this case, the SSH is said to be disabled. The data passes from the bus input register directly to the bus output register, such that the SSH acts as two pipeline stages within the network.

2. If a core is to be powered off when not under test such that the data cannot flow through the SSN segment within it, the datapath can be designed such that the segment going through the powered-off region is muxed out.

3. Because the packets data may rotate within the bus and span multiple parallel words, the SSH has shifters and registers to re-align and collect the data.

4. To test the SSH and the rest of the SSN network before they are used for scan test, the SSH can be placed into loopback mode. In this mode, the scan data normally going to EDT is directly fed back to the scan data normally unloaded from EDT, as shown in the figure.

5. The node is small in size. It is usually smaller than an EDT controller.

IV. MANAGING CLOCK SKEW & BUS WIDTH

To maximize SSN’s throughput, it is desired to run the bus at higher frequencies than shift frequencies of the cores. It is possible to implement a 400 MHz SSN bus. It is, however, often unrealistic to balance the SSN clock throughout a large SoC. The SSN clock may be balanced within each core or groups of cores, but there may be clock skew between those regions that must not be allowed to degrade the shift frequency. This is addressed

using a Bus Frequency Divider (BFD)/Bus Frequency Multiplier (BFM) pair, as shown in Fig. 4.

The pair acts as a deskew FIFO. By temporarily converting a fast narrow bus into a slow wide bus when crossing Clock Tree Synthesis (CTS) regions, a larger amount of clock skew can be tolerated without impacting the shift speed or throughput. The FIFO logically acts like pipeline stages in the SSN datapath. Splitting the FIFO into 2 discrete components allows the BFD to be placed in the transmitting region and the BFM in the receiving region, with each component driven by the local SSN clock in its region.

The BFD and BFM nodes may additionally be used to reduce the bus width distributed around the chip and reduce the SSN area. Although an SSN bus that operates at 400 MHz can be easily implemented, it is often not possible to shift data through the chip-level pins at more than 200 MHz. Assume that the SoC has enough pins to implement 64 scan inputs and 64 scan outputs. One option would be to implement a 64-bit bus throughout the chip and operate it at 200 MHz. Alternatively, the data can be scanned into the chip through 64 pins at 200 MHz and a BFM added between the scan inputs and the first SSH to convert this input stream to a 32-bit, 400 MHz bus. This 32-bit bus is then used across the chip, connecting SSH nodes with 32-bit buses. Then before exiting the chip, a BFD node is added to convert the SSN output bus back to a 200 MHz 64-bit bus driving the output pins.

Fig. 3: Streaming Scan Host (SSH) node

Fig. 4: Managing clock skew across CTS regions using

BFD/BFM


V. OPTIMIZING TEST TIME AND DATA VOLUME

It is important to differentiate when the capture cycles of all cores must be aligned and performed concurrently versus when each core (or group of cores) operates independently and can capture regardless of whether other cores are shifting or capturing. The latter enables more efficiency when it can be used.

When running ATPG on a group of interacting cores, such as during external test, it is always necessary to align capture events because of the interactions between the cores during capture. In this case, the SSH in each of those cores can shift independently, but all those SSH nodes are programmed to capture concurrently once they complete a scan load/unload.

However, consider when pattern generation is performed on wrapped cores (or groups of cores) that are isolated from one another and have their own OCCs. At the top level, those patterns sets are independent and can be merged and applied concurrently. In most other retargeting solutions, the capture events are aligned as shown in Fig. 5. While this is necessary for ATPG, it can be unnecessary and inefficient for the case of retargeted patterns. Imbalances in shift lengths per scan load may result in unnecessary padding. A core with short scan chains should not need to wait for other cores to complete shifting before they can capture. Furthermore, there are often significant imbalances in the pattern counts of different cores. Traditional retargeting methods pad the cores with fewer patterns such that there is a waste of data and test time.

SSN has two features to reduce test time and test data volume in such cases. First, it supports independent shift/capture for different retargeted cores. This is possible because signals such as scan_enable and the shift clock are generated locally by each SSH. Second, it reduces the shift length/pattern count imbalances between cores by programmatically varying the bandwidth used for each core. If a core requires many fewer overall shift cycles across a pattern set than other cores, it can be sent fewer bits per packet. For example, a core with 4 channels does not need to be allocated 4 bits per packet. It can be throttled down and sent only 1 bit per packet such that it shifts internally every four packets instead of every packet. The result is that the total number of packets remains the same, but the size of the packets is reduced, speeding up the overall test time. The next section introduces further test optimization possible in the presence of multiple identical core instances.

Note that an additional benefit of independent capture is power. It can mitigate IR drop since cores under test do not all

shift and capture at the same time. In addition to scan access, this may further facilitate testing a large number of cores concurrently.

VI. TESTING OF MULTIPLE IDENTICAL CORES

Many SoCs that achieve high throughput by parallelizing processing contain a number of cores that are replicated multiple times. CPU chips often include multiple processor cores. AI and GPU chips in particular can have some cores replicated well over 100 times. As previously explained, in pin-mux scan access architectures, the scan inputs may be broadcast to identical core instances, but the scan outputs are usually observed independently to ensure lossless mapping and observability for diagnosis. This results in a non-scalable solution where increasing the number of core instances requires additional chip pins for concurrent test.

A. On-Chip Compare

SSN provides a scalable method for testing any number of identical core instances in near constant test time, independent of the number of available chip-level pins, even in the presence of tile-based design constraints explained earlier. Instead of shifting in the stimuli only and unloading the expected response for comparison on the tester, the stimuli, expected responses, and compare/nocompare mask data are scanned in within each packet so that each core can perform its own on-chip comparison. Note that the data arrives at each core instance at a slightly different time since the SSN bus data streams through the nodes. With each internal shift cycle, the channel data transferred from EDT to the SSH is compared, and a pass/fail status bit per channel per shift cycles is computed. What is ultimately observed on the tester is the following:

1. Per-shift status bits: This is the aforementioned pass/fail bit for a given channel in a given internal shift cycle. This status bit is allocated a timeslot in the packet for unloading. To provide a scalable solution for any number of identical core instances, the same status bit in the packet usually accumulates the pass/fail status from a given channel/shift cycle across all identical core instances (or a subset of them). If this bit indicates a fail, one can identify which core-level bit had a failure but not necessarily which core instance(s) this failure originated from. It is still possible to identify failing cores and per-core fail information for diagnosis as explained later.

2. Sticky status bits: One sticky bit per SSH indicates if there was a failure in scan observed by this SSH in any cycle/channel of the pattern set. This bit per SSH is unloaded through IJTAG at the end of a pattern set to quickly identify failing cores (for designs with redundant cores), and to aid in diagnosis. Note that where finer granularity than 1 fail bit per SSH is needed, it is possible to generate a sticky bit per channel output connected to the SSH.

Fig. 6 shows an example of data encoding into packets when using on-chip compare. Six identical core instances are used in this example, each driving an EDT controller that has 7 input channels and 2 output channels. Each packet has enough scan data for the cores to perform one internal shift operation. First,

Fig. 5: Retargeting with aligned vs. independent capture


7 bits per packet corresponding to the 7 input channels (shown in blue) are allocated. Those stimuli are broadcast (in time) to all identical core instances. The expected responses (2 output channels = 2 bits) and mask information (2 output channels = 2 bits) are also shifted in and broadcasted (red). Last are the status bits that accumulate the pass/fail information per channel per shift cycle (green). Typically, we would allocate 2 bits corresponding to the 2 output channels. A failure in one of those bits would indicate that the first channel of one of the 6 core instances failed, but we would not know which one. When we accumulate the status information of all 6 cores together, they are considered to be placed into 1 status group. In this example, we chose to partition the 6 cores into group “a” and group “b”. We only accumulate the fail information within each group. That is why we have 4 green bits: 2 output channels × 2 groups. The number of groups is programmable at pattern retargeting time. Increasing the number of groups beyond 1 sacrifices test efficiency for improved observability as will be explained in the diagnosis section.

When using on-chip compare, the response data cannot replace the stimuli in the packet because the stimuli have to travel to all other core instances. Separate time slots have to be allocated for the stimuli, the expected responses and the masks shifted in, as well as the status bits unloaded. In the common case of 1 status group, the number of bits per packet is usually #input_channels + 3 × #output_channels. Because each output channel requires at least 3 bits of data in the packet (expected value, mask, and pass/fail status), using an asymmetric EDT with fewer output channels than input channels improves test time and test data volume in conjunction with on-chip compare.

B. Diagnosis Flow

Failure data is needed even during high-volume manufacturing for on-tester identification of failing cores to support partial good die strategies (redundant logic cores), and for diagnosis-driven yield analysis. When not using on-chip compare, every channel output bit in a core maps to a single bit on the top-level SSN bus outputs that are unloaded and compared on the tester. Logic diagnosis is straightforward in that case: perform reverse mapping of chip-level failures through the SSN network to the EDT channel outputs, then perform conventional compressed pattern diagnosis (at the core level in case of retargeted patterns).

Diagnosis in the presence of on-chip compare is more involved and may require re-application of the pattern set to collect all the data needed. Consider the case where all identical core instances are placed in a single status group such that their per-cycle pass/fail information is aggregated into the same packet timeslots. If any of those bits indicate failures, we have the cumulative per-pin per-cycle fail data but may not know which core(s) the failures came from. The sticky status bits unloaded at the end of the test set via IJTAG indicate which core(s) failed at least once. If only one core in this group fails, then we know the per-cycle pass/fail data came from this core alone and therefore we have all the information needed for diagnosis. However, if multiple cores fail, we have to separately test and observe each of those failing cores to get their individual fail data. If two cores fail, for example, then the same test set is re-applied twice, with minor patching applied. In each case, static bits in the setup of the cores are patched to control which cores are allowed to contribute to the cumulative pass/fail results. Note there is no need to store separate patterns for diagnosis on the tester.

If identical core instances are split into multiple groups, this slightly increases the test time, but decreases the probability of resorting to multiple test applications for collecting diagnosis data. In the example shown in Fig. 6, the six cores are split into two groups. If cores A1 and A4 are found to have failed, there is no need for test re-application because cores A1-A3 accumulate their status bits separately from cores A4-A6. However, if cores A1 and A3 fail, test re-application with patching is needed to acquire the individual fail data. In the extreme case, you may choose to assign each core instance to its own group so that each core is observed individually. This mode of operation may be better suited for silicon debug than high-volume manufacturing.

VII. ALTERNATE INTERFACES

A. Streaming Tests through JTAG/IJTAG Interfaces

It is possible not to use the SSH parallel bus at all, and instead use the JTAG(chip)/IJTAG(core) interface for both setup and subsequent streaming of the test data. There are two cases where this may be desirable:

1. As a survivability option. If during silicon bring-up, the bus is inaccessible due to a silicon defect, this provides an alternate method of accessing any SSH or group of SSHs.

Fig. 6: Packets when using on-chip compare to test multiple identical cores


2. If a low pin count device only has a JTAG interface and no other digital pins, it is possible to implement SSN without the parallel bus and rely on the JTAG/IJTAG interfaces for streaming the test data.

B. Compatibility with Test Using SerDes (IEEE 1149.10)

IEEE 1149.10 [19] provides for re-using high-speed I/O (HSIO) SerDes lanes to enable very high bandwidth transfer of test data to/from a chip. The Packet Encoder/Decoder and Distribution Architecture (PEDDA) IP described in the standard results in deserialized data presented on a parallel bus. SSN’s synchronous parallel bus is ideally suited to interface with the PEDDA. SSN can handle on-chip distribution of test data and internal generation of test signals. As the SSN network can operate internally at high frequencies (at least 400 MHz), it is capable of testing many cores concurrently and quickly when coupled with this high-bandwidth chip-level interface.

VIII. PRACTICAL EXPERIENCE USING SSN

In collaboration with Mentor, Intel has been evaluating the use of SSN. SSN is capable of scaling to large SoCs and server-class designs that require support for large partition counts and identical core testing. Previous generations of Intel SoCs have utilized an internally developed high bandwidth packetized fabric, STF [10][11] to address these needs. STF was developed to allow this scalability at much lower overhead than the traditional pin muxed scan solutions. In evaluating SSN, the goals were to assess whether moving to SSN could further improve test time and bandwidth utilization over STF, as well as reduce design effort through the use of a vendor supported platform.

A. Comparison of Packet Encoding Overhead

Both STF and SSN can scale to support any number of partitions, however, the approach to accomplish this differs between the two systems. The STF network relies on explicit addressing information stored within each packet. This is accomplished by having a short address ID tag contained within each packet, typically 4 bits in size. In addition, STF requires an opcode field, 4 bits in size, as well as input and output valid bits. This results in an overhead of 10 bits being added to each data packet. In contrast under SSN, the destinations and interleave settings are statically programmed during the test setup, allowing the entire bus bandwidth to be used for data. For a typical bus size of 32 bits, STF has a 31% higher overhead than SSN. This is depicted in Fig. 7.

B. Comparison of Data Field Utilization

STF utilizes a fixed data field size of 32 bits. To accommodate EDTs with a smaller number of channels, the STF data word is divided up into fields, and the data for multiple shift cycles is packed into the 32-bit word to achieve better utilization. However, when the EDT channel size does not divide evenly into the 32-bit word, this reduces efficiency as illustrated in Fig. 8. In this example, with 9-bit EDTs, we can pack 3 shift cycles of data into the data word with 5 bits of unused data, resulting in an overhead of nearly 16%. In the worst case of a 17-bit EDT, 47% of the data bandwidth is wasted. Thus, STF data field utilization can range from 53% to 100% depending on how the EDT data packs into the 32-bit word. Because SSN utilizes data rotation, any leftover bits within the bus become part of the next packet, always achieving 100% utilization of the bus data word.

C. Interleaving, Vector Count and Chain Length Mismatch

Handling

Both STF and SSN scale to any number of partitions, however their approaches differ in how they handle the interleaving of partitions. In the example shown in Fig. 9, a set of partition patterns that have differing numbers of vectors are to be merged. Typically, STF will have a specified interleave factor, in this case 4, to which the patterns are repacked optimally into these 4 groups. These groups are then round-robin interleaved to create the final pattern set, as shown in the figure.

SSN’s handling of interleaving achieves similar efficiency for vector count mismatch as STF, but SSN can also partially

Fig. 7: STF packet overhead vs. SSN

Fig. 8: STF packing of narrow EDT data

Fig. 9: STF pattern interleaving

Fig. 10: Chain length mismatch padding in STF


mitigate chain length mismatch between partitions, which STF cannot. STF requires all partitions in the pattern set to be padded to the same shift length, resulting in overhead. This is depicted in Fig. 10. In our current designs, we allow up to 20% chain length mismatch between partitions, so it is theoretically possible SSN could have up to 20% better packing efficiency in the final pattern.

D. Fabric Test Setup

Since the STF fabric is configured in-band using packets, the pattern overhead for network setup is very small, approximately 10 cycles per active endpoint. SSN utilizes IJTAG to program the network with approximately 160 bits of state per active endpoint, plus IJTAG network overhead. Though this could result in substantially higher setup overhead for SSN, the cost of the setup is amortized across the entire scan vector set. For large pattern sets, network setup should not present a significant overhead of more than 1% for SSN.

E. On-Die Compare

STF and SSN provide comparable functionality for identical core testing using on-die compare. Both systems require the input data stream to include the input data, mask data and expected response, causing a 3X growth of the data volume, but allow testing of any number of cores in constant time. SSN has a possible advantage in the handling of an asymmetric number of input and output channels. In this case, SSN can more tightly pack the expect and mask fields to match the smaller output channel case, possibly realizing less than 3X data growth. STF, however, allocates bandwidth assuming symmetric usage and is always 3X data volume. For the purpose of this analysis, we assumed that on-die compare would be neutral between the two systems.

F. Total Estimated Overhead Comparison

In summary, STF pays a high overhead in packet encoding, data field utilization and handling of chain length mismatch. Network setup overhead is higher in SSN, but amortized across the number of scan vectors resulting in a negligible difference. Overall, this can lead to over 2X reduction in data volume under SSN vs. STF, as summarized in Table I.

G. SSN Pilot Study

SSN offers a compelling theoretical advantage over the current STF fabric in use. However, we wanted to measure results on actual partition data to verify. Further, the study

looked at other aspects, such as design effort and run times. To perform the study, a simple test design was created consisting of a single interface partition, partition1, and four identical copies of a partition, partitions 2-5, as shown in Fig. 11.

An SSN bus data width of 32 bits was chosen to match STF to allow direct comparison. ATPG patterns were created targeting partitions 2-5, each having 9 EDT channels for a total of 36 bits of channel data. By having a total channel data set size of >32 bits, SSN will perform data rotation and create a more meaningful comparison. The 9-bit EDT channel size represents a typical data field packing inefficiency for STF. Multiple ATPG runs were conducted to analyze the overhead at 10, 500, and 10,000 vectors. The results from these runs are summarized in Table II, comparing STF, SSN, and a legacy pin mux solution.

For this testcase, SSN shows a clear advantage over STF, with STF having 19% higher test time and 57% more data

Fig. 11: Pilot network topology

Table II. Pilot test time and data volume results

SSN (32b bus size) STF (42b packet size / 32b data size) Pin-muxed GPIO (estimated)

Patterns

Setup

cycles

(IJTAG)

Scan

Test

Cycles

Total

Test

Cycles

Total

Data

(Mb)

Setup

cycles

(STF)

Scan

Test

Cycles

Total

Test

Cycles

Total

Data

(Mb)

Setup

Cycles

(IJTAG)

Scan

Test

Cycles

Total

Test

Cycles

Total

Data

(Mb)

10 34,703 3,280 37,983 0.28 64 3,800 3,864 0.16 20,446 5,740 26,186 0.29

500 34,703 136,120 170,823 4.53 64 161,820 161,884 6.80 20,446 241,900 262,346 7.74

10000 34,703 2,711,740 2,746,443 86.95 64 3,257,200 3,257,264 136.81 20,446 4,820,780 4,841,226 154.26

STF vs. SSN @ 10K patterns 1.19 1.57 Pinmux vs. SSN 1.76 1.77

Table I. Theoretical Comparison of STF vs. SSN Data Volume

Packet

Encoding

Overhead

Data Field

Utilization

Overhead

Vector/

Chain

Length

Mismatch

Overhead

Fabric

Setup

Overhead

SSN (baseline) 1.0 1.0 1.0 1

STF 1.31 1.0 - 1.47 1.0 - 1.2 0.99

STF data volume vs. SSN (theoretical): 1.30 - 2.29


volume than SSN. SSN test setup is higher overhead than STF, however when amortized across the 10,000 vectors in the run set, this impact is in the expected range of 1.2%. This testcase used identical partitions and hence did not exercise vector count mismatch between partitions nor chain length mismatch, which would further favor SSN. For comparison purposes, a legacy pin muxed solution is included showing a large overhead relative to SSN. Since the pin muxed solution cannot transport 36 bits of channel data in a single run, it must be split into 2 runs, nearly doubling test time and data volume.

In addition to data volume and test time metrics, we also collected information on design efficiency between the internal STF toolset and the Mentor Tessent™ tool flows for SSN. This comparison is summarized in Table III.

As the table shows, SSN and the Tessent flows provide significant productivity improvement over our previous flow built from multiple tools, enabling rapid integration into the design and fast turnaround ATPG runs. The SSN flows do not require ATPG cut points and custom setups to generate and retarget patterns, resulting in significant savings in pattern retargeting. Though not in the scope of this analysis, further benefits are expected in gate level simulation debug productivity.

H. SSN Pilot Study Summary

Analysis of a small test network verified that the theoretical advantages of SSN over our previous internal STF fabric are achievable and a significant improvement in both test time (16% reduction) and test data volume (36% reduction). The data shows that the approach of static network configuration during test setup is more efficient for large scan data sets than allocating addressing and opcode information within each packet. In addition, further benefits were seen in design efficiency for insertion, ATPG setup and pattern retargeting relative to our previous flows.

IX. CONCLUSION

The SSN technology introduced in this paper solves many of the scan distribution challenges in complex SoCs. It enables simultaneous testing of any number of cores with few chip-level pins, and it has multiple features to reduce test time and test data volume. It can test any number of identical core instances in near constant time, minimizes padding in the presence of cores with mismatched pattern counts and/or scan chain lengths, and enables fast streaming of data to/from and throughout the chip. It simplifies design planning and implementation, and is

especially well suited for tile-based designs. Intel evaluated SSN and compared it to STF as well as to conventional pin-muxed access. SSN was found to reduce the test data volume by 36% and 43%, respectively. It reduced test cycles by 16% and 43%, respectively. Steps in the design and retargeting flow were between 10x – 20x faster with SSN compared to STF.

ACKNOWLEDGMENT

The authors wish to thank other contributors to the development of the SSN technology: Yahya Zaidan, Pawel Galas, Szymon Walkowiak, Paul Reuter, and Tony Fryars. We would also like to thank the contributors to the SSN pilot study: Sirish Chittoor, Yonsang Cho, Luis Briceño Guerrero, Kavita Bansal, Kelsey Byers, and Ian Nuber. Finally, many thanks to all our other partners who also provided invaluable feedback during the development, validation, and deployment of SSN.

REFERENCES

[1] Standard Testability Method for Embedded Core-based Integrated

Circuits, IEEE Standard 1500, 2005.

[2] J. Remmers et al., “Hierarchical DFT methodology - a case study, ” IEEE International Test Conference, 2004.

[3] D. Trock et al., “Recursive Hierarchical DFT Methodology with Multi-level Clock Control and Scan Pattern Retargeting,” IEEE Design, Automation & Test in Europe Conference & Exhibition (DATE), 2016.

[4] J. Rajski et al., “Embedded Deterministic Test,” IEEE Trans. on CAD, vol. 23, May 2004, pp. 776-792.

[5] P. Wohl, J.A. Waicukauski, J.E. Colburn, M. Sonawane. "Achieving extreme scan compression for SoC Designs", IEEE International Test Conference, 2014.

[6] C. Barnhart et al., "OPMISR: The foundation for compressed ATPG vectors," IEEE International Test Conference, 2001.

[7] G. Giles et al., “Test Access Mechanism for Multiple Identical Cores,” IEEE International Test Conference, 2008.

[8] Y. Dong et al., “Maximizing Scan Pin and Bandwidth Utilization with a Scan Routing Fabric,” IEEE International Test Conference, 2017.

[9] J. Janicki et al., "EDT bandwidth management - Practical scenarios for large SoC designs," IEEE International Test Conference, 2013.

[10] G. Colon-Bonet, “High Bandwidth DFT Fabric Requirements for Server and Microserver SoCs,” IEEE International Test Conference, 2015.

[11] G. Colon-Bonet, “High Bandwidth Packetized DFT Fabric for Server SoCs,” IEEE International System-on-Chip Conference, 2016.

[12] A. Sanghani et al., “Design and Implementation of A Time-Division Multiplexing Scan Architecture Using Serializer and Deserializer in GPU Chips,” IEEE VLSI Test Symposium, 2011.

[13] M. Sonawane et al., “Flexible Scan Interface Architecture for Complex SoCs,” IEEE VLSI Test Symposium, 2016.

[14] P. Wohl et al., “Achieving Extreme Scan Compression for SoC Designs,” IEEE International Test Conference, 2014.

[15] Standard for Access and Control of Instrumentation Embedded within a Semiconductor Device, IEEE Standard 1687, 2014.

[16] J. Durupt et al., " IJTAG supported 3D DFT using chiplet-footprints for testing multi-chips active interposer system," IEEE European Test Symposium, 2016.

[17] M. Lin et al., “A 7nm 4GHz Arm®-core-based CoWoS® Chiplet Design for High Performance Computing”, Symposium on VLSI Circuits Digest of Technical Papers, 2019.

[18] T. Waayers et al., “Clock control architecture and ATPG for reducing pattern count in SoC designs with multiple clock domains,” IEEE International Test Conference, 2010.

[19] Standard for High-Speed Test Access Port and On-Chip Distribution Architecture, IEEE Standard 1149.10, 2017.

Table III. Design efficiency comparison between STF and SSN

Metric

STF

Flow

Tessent SSN

Flow

Tools Count 7 3

RTL Completion to ATPG

Start ~10 Hours ~1 Hour

ATPG Completion to Gate

Level Simulation start ~1 Day ~2 Hours

ATPG pattern retargeting of

a partition ~4 Hours ~12 Minutes

Streaming Scan Network - static.sw.cdn.siemens.com

Documents

Transcript of Streaming Scan Network - static.sw.cdn.siemens.com