(a) Timing convergence bugs
- Circuit operates too slow (speedpath),
- Circuit operates too fast (minimum delay), or
- Circuit fails due to timing of multiple converging signals (race)
(b) Analogue bugs:
- Primarily in I/O buffers, PLLs and thermal sensors
- Silicon does not operate in accordance with predicted (simulated) circuit behaviour
3. Fundamentals for circuit bug hunting are as follows:
(a) Needs a sufficiently large population of devices
(b) Needs to vary environmental conditions
(c) Needs to stimulate stressful system behaviour
(d) Stimulus is generally functional; failures look just like functional failures
Circuit debug issues
The following need to be checked for debugging:
1. On-die signal integrity
- Cross-coupling induced noise
- Droop-event induced noise
2. Power delivery integrity
- High dynamic current events
- Clock gating
3. Clock domain crossing
4. Process, voltage and temperature
- Power state transitions
- Silicon process variation
Ideal operating range
Ideally, silicon operates in well-defined volume. Minimum and maximum corners are defined as per manufacturers’ specifications: uniform over-voltage, frequency, temperature, process and time. But what happens is a bit different.
Other factors include temperature, component age and silicon variability.
Circuit slows down as VCC decreases. And, failure disappears as VCC increases or F decreases. Historically, highest percentage of CPU circuit issues occur.
Failure happens when circuit is too fast. Failure disappears as VCC decreases or F increases. This is hard to fix.
1. Voids within the window
2. Intermittent working
3. Multiple clock domains
- Skew within same domain
- Skew/jitter across domains
Finding circuit margins
1. Exercise in platform-based silicon characterisation
2. Method is stress-to-fail (increase FMAX to failure)
3. Stimulus is directed/random
- Victim/attacker patterns
- Software load-driven power variation
- Injected power state transitions
- Randomised instructions, memory configurations and architectural events
4. Characterised before/after burn-in (simulate aging)
5. Characterised over large populations to understand silicon variability
I/O margin characteristics
1. Stimulus includes victim/attacker patterns, resonance stimulus and other noise generators like dynamic CPU core loads
2. VCC and timing margined to fail (find extents of eye diagram)
3. Incorporates systematic 3D variation (shmooing) of voltage, temperature and frequency
4. Incorporates skewed silicon (varied process parameters) and skewed circuit boards (varied trace impedance)
1. Basic observability is package pins
- Signal observability (higher integration SoC)
- Probing scope
- Probing signal integrity
2. Trend is towards lower observability. Integration increasing towards SoC
3. Functional and circuit issues require different solutions
Fictional example code:
// Read a counter in a certain design block on a certain boundary scan chain
// First select a desired design blockirscan(MY_CHAIN, SELECT_BLOCK_OPCODE)
// scan in block number stored in 4 bits.
drscan(MY_CHAIN , 4, BLOCK_NUMBER)
// now select the desired counter out of 100 counters
// 7 bits used to select counter
// This operation also resets counter to 0
drscan(MY_CHAIN, 7, COUNTER_ID)
// Wait until something triggers…
// Read Counter Value
drscan(MY_CHAIN, 32 , 0x0, Counter_Value)
Example of a boundary scan
Depending on the deployment target, use-cases and necessary power/performance trade-offs, any design functionality in an IP may be moved to hardware or software (firmware) implementation. Systems are vertically integrated with system use-cases only realised by hardware, software, applications and peripheral communication.
Validation and post-silicon validation is a complex co-validation problem across hardware, software and peripheral functionality, with no clear decomposition into individual components. With integration of significant design functionalities into one system, it is getting more and more complex to control and observe any individual design component as necessary for validation. And with reduced time-to-market, the number of silicon spins available for validation has decreased dramatically.
When an error is found in silicon, other errors are detected in the same silicon spin. In fact, tools, flows and design instrumentations have been incrementally accumulated over time in response to specific challenges or requirements. Over 20 per cent of the design real estate and a significant component of the CAD flow effort are devoted towards silicon validation today.
As we move from pre-silicon to post-silicon and finally on-field execution, more and more complex usage scenarios are exercised, potentially stimulating errors that could not be seen in previous validation phases. At the same time, observability and controllability of the design during these executions get progressively more complex, making it harder to root cause failure. At the same time, cost of a bug has increased and time available for debug decreased as we go further in the system lifecycle.
To be continued…
V.P. Sampath is a senior member of IEEE and a member of Institution of Engineers India, working in an FPGA design house. He has published international papers on VLSI and networks