A simulation methodology for verification of transient fault tolerance of ASICs designed for high-energy physics experiments

22 Sept 2022, 09:00
20m
Terminus Hall

Terminus Hall

Oral ASIC ASIC

Speaker

Matteo Lupi (CERN)

Description

Transient fault tolerance verification is a crucial step in the design of radiation-tolerant ASICs for high-energy physics experiments. In this paper, we discuss a methodical approach toward the verification of transient fault tolerance of ASICs using industry-standard methodologies and tools. The framework for fault verification includes tools for fault enumeration, fault injection, and running fault campaigns. The framework supports fault verification at various levels of design abstraction from high-level register-transfer models to gate-level netlist. The methodology and framework described in this paper were successfully used to identify SEE vulnerabilities in some of the ASICs designed at CERN.

Summary (500 words)

ASICs that are required to operate in a high-radiation environment, typical of high-energy physics experiments, are designed to tolerate transient faults induced by radiation. Transient faults, i.e. Single Event Effects (SEEs), manifest either as Single Event Upsets (SEUs) of memory elements or as Single Event Transients (SETs) on the nets in the design. Various micro-architectural techniques such as triple modular redundancy (TMR), triple time redundancy (TTR), error correction codes (ECC), etc. are often utilized to mask the undesired effects of SEEs. It is very expensive in terms of area and power to exhaustively protect the entire design against SEEs. Therefore, the chip designer makes a careful choice of using fault tolerance techniques for parts of the design which are deemed critical for the operation of the chip. This process of selective hardening and implementation is error-prone. There have been several instances of improper fault protection in ASICs designed in the high-energy physics community which resulted in chip re-spins and de-featuring. Additionally, it takes a lot more effort during post-silicon testing to find the root cause of a failure in SEE protection circuits because of the very low controllability and observability of SEE protection features. Therefore, it is imperative that transient fault tolerance is thoroughly verified during pre-silicon verification of the ASIC.

The SEE verification must be addressed as an integral part of functional verification and must be tightly integrated into the metric-driven verification methodology based on UVM which is the gold standard for functional verification of complex ASICs. To this end, we developed a SEE verification component at CERN which provides a framework for thorough verification of transient fault tolerance. The verification component includes a flow to enumerate and filter nodes in the design which are candidates for fault injection, a SEE UVM verification component that can inject constrain random SEUs and SETs during a simulation and collect coverage, and a flow to manage massively parallel fault campaigns. The framework supports fault enumeration, injection, and campaigns on designs at both register transfer level and netlist level abstractions. In this contribution, we would like to present our SEE verification methodology, the SEE verification component which implements this methodology and, share various good practices of SEE verification. We would also like to share our experience of how this methodology and framework were applied to various radiation-tolerant ASICs (e.g. lpGBTv1, EXP28, Altiroc3, etc.) designed at CERN.

Primary authors

Presentation materials