Post-Silicon Validation Methodology in SoC (Part 1 of 2)

By V.P. Sampath


(a) Timing convergence bugs

  • Circuit operates too slow (speedpath),
  • Circuit operates too fast (minimum delay), or
  • Circuit fails due to timing of multiple converging signals (race)

(b) Analogue bugs:

  • Primarily in I/O buffers, PLLs and thermal sensors
  • Silicon does not operate in accordance with predicted (simulated) circuit behaviour

3. Fundamentals for circuit bug hunting are as follows:

(a) Needs a sufficiently large population of devices

(b) Needs to vary environmental conditions

(c) Needs to stimulate stressful system behaviour

(d) Stimulus is generally functional; failures look just like functional failures

Circuit debug issues

The following need to be checked for debugging:

1. On-die signal integrity

  • Cross-coupling induced noise
  • Droop-event induced noise

2. Power delivery integrity

  • High dynamic current events
  • Clock gating

3. Clock domain crossing

4. Process, voltage and temperature

  • Power state transitions
  • Silicon process variation

Ideal operating range

Ideally, silicon operates in well-defined volume. Minimum and maximum corners are defined as per manufacturers’ specifications: uniform over-voltage, frequency, temperature, process and time. But what happens is a bit different.

Other factors include temperature, component age and silicon variability.

Ideal operating range
Fig. 7: Ideal operating range
2D view of ideal case
Fig. 8: 2D view of ideal case
2D view of speed paths
Fig. 9: 2D view of speed paths
2D view of minimum delay
Fig. 10: 2D view of minimum delay
2D view of Shmoo holes/cracks
Fig. 11: 2D view of Shmoo holes/cracks

Speed paths

Circuit slows down as VCC decreases. And, failure disappears as VCC increases or F decreases. Historically, highest percentage of CPU circuit issues occur.

Minimum delays

Failure happens when circuit is too fast. Failure disappears as VCC decreases or F increases. This is hard to fix.

Shmoo holes/cracks

1. Voids within the window

2. Intermittent working

3. Multiple clock domains

  • Skew within same domain
  • Skew/jitter across domains

Finding circuit margins

1. Exercise in platform-based silicon characterisation

2. Method is stress-to-fail (increase FMAX to failure)

3. Stimulus is directed/random

  • Victim/attacker patterns
  • Software load-driven power variation
  • Injected power state transitions
  • Randomised instructions, memory configurations and architectural events

4. Characterised before/after burn-in (simulate aging)

5. Characterised over large populations to understand silicon variability

Fig. 12: Stimulus

I/O margin characteristics

1. Stimulus includes victim/attacker patterns, resonance stimulus and other noise generators like dynamic CPU core loads

2. VCC and timing margined to fail (find extents of eye diagram)

3. Incorporates systematic 3D variation (shmooing) of voltage, temperature and frequency

4. Incorporates skewed silicon (varied process parameters) and skewed circuit boards (varied trace impedance)

Post-debug challenges

1. Basic observability is package pins

  • Signal observability (higher integration SoC)
  • Probing scope
  • Probing signal integrity

2. Trend is towards lower observability. Integration increasing towards SoC

3. Functional and circuit issues require different solutions

Observability/control/survivability architecture
Fig. 13: Observability/control/survivability architecture

Observability/control/survivability architecture

Fictional example code:
// Read a counter in a certain design block on a certain boundary scan chain
// First select a desired design blockirscan(MY_CHAIN, SELECT_BLOCK_OPCODE)
// scan in block number stored in 4 bits.
// now select the desired counter out of 100 counters
// 7 bits used to select counter
// This operation also resets counter to 0
// Wait until something triggers…
// Read Counter Value
drscan(MY_CHAIN, 32 , 0x0, Counter_Value)

Example of a boundary scan

Depending on the deployment target, use-cases and necessary power/performance trade-offs, any design functionality in an IP may be moved to hardware or software (firmware) implementation. Systems are vertically integrated with system use-cases only realised by hardware, software, applications and peripheral communication.

Boundary scan
Fig. 14: Boundary scan

Validation and post-silicon validation is a complex co-validation problem across hardware, software and peripheral functionality, with no clear decomposition into individual components. With integration of significant design functionalities into one system, it is getting more and more complex to control and observe any individual design component as necessary for validation. And with reduced time-to-market, the number of silicon spins available for validation has decreased dramatically.

When an error is found in silicon, other errors are detected in the same silicon spin. In fact, tools, flows and design instrumentations have been incrementally accumulated over time in response to specific challenges or requirements. Over 20 per cent of the design real estate and a significant component of the CAD flow effort are devoted towards silicon validation today.

Stages of SoC validation: pre-silicon validation, post-silicon validation and on-field survivability
Fig. 15: Stages of SoC validation: pre-silicon validation, post-silicon validation and on-field survivability

As we move from pre-silicon to post-silicon and finally on-field execution, more and more complex usage scenarios are exercised, potentially stimulating errors that could not be seen in previous validation phases. At the same time, observability and controllability of the design during these executions get progressively more complex, making it harder to root cause failure. At the same time, cost of a bug has increased and time available for debug decreased as we go further in the system lifecycle.

To be continued…

V.P. Sampath is a senior member of IEEE and a member of Institution of Engineers India, working in an FPGA design house. He has published international papers on VLSI and networks


Please enter your comment!
Please enter your name here