# The new ATLAS pixel chip: FEI4 # PH-ESE Electronics Seminar March 8, 2011 Maurice Garcia-Sciveres Lawrence Berkeley National Lab The new generation pixel chip: FEI4 # PH-ESE Electronics Seminar March 8, 2011 Maurice Garcia-Sciveres Lawrence Berkeley National Lab #### Introduction - I was asked to talk about how the design was organized - Distributed collaboration / Management of large design - I do not have a secret recipe guaranteed to work for organizing a successful design effort with a distributed collaboration. - We were successful for FE-I4 and the best I can do it try to describe what we did, also pointing out the things we did not do so well. - FE-I4 was not a top-down project, starting from well defined specifications and rigorously managed. - It was an R&D effort with evolving specifications, starting from initial concepts. I will have to give some history to explain many choices. - A <u>draft</u> reference manual is attached to the agenda page. I will gloss over or leave out many technical details that can be found there. ## FE-I4 Designer Team #### **Bonn** Michael Karagounis, Tomasz Hemperek, Andre Kruth #### **CPPM** Mohsine Menouni, Denis Fougeron, Fabrice Gensolen #### **INFN Ge** Roberto Beccherle #### **LBNL** Abder Makkaoui (lead designer), Dario Gnani, Julien Fleury (LAL visitor) #### **NIKHEF** Ruud Kluit, Jan-David Schipper, Vladimir Gromov, Vladimir Zivkovic Many others had varying roles in making FE-I4 a reality: physicists, <u>students</u>, other designers lending a hand, CERN IC group, management, external companies. Have not compiled author list for attached reference document yet... (to do) Still many more are involved in testing chips and modules- producing many more results than I'm able to show today ## What must a pixel readout chip do? - Remember the time and the charge of all hit pixels for a little while - Massive short term memory - A trigger signal arriving during this short while can select a particular 25ns time slice for persistent storage and transmission of all hits in that window - Filtered long term memory # What is in use in ATLAS today 16 FE-I3 chips on 1 sensor to cover a ~10cm² area Digital module control chip The FE-I3 chip works great! #### **Known limitation of FE-I3** Column drain readout architecture. It does not scale. # When and why FE-I4 work started - In 2004 a test chip was fabricated with the FE-I3 front end circuitry "scaled" to a 130nm layout. - This was a "quick and dirty" technology exploration effort. - It was already known then that LHC luminosity upgrades would eventually need more advanced pixel chips than 0.25um, but technology and design of such future chips was not clear. - Not much happened for the next 2 years, partly because all the designers who worked on FE-I3 left. - In the mean time, the CERN characterization of 130nm radiation hardness was completed. - A new 130nm test chip was fabricated in 2006, with a ground-up front end design. Minimal digital circuits. - FE-I4 name and design collaboration originated in 2007. ## 2007 upgrade workshop # Collaboration formed quickly after that #### 130nm Pixel Chip Draft Work Plan Draft 6, 19 May 2007 #### Milestones and scope - First full size chip submission date December 2008 - Final Design review September 2008 - Initial Design review January 2008 - Heavy coverage of the test chip results - Specifications - o Clear idea for all subcirquits. - Front end design #### **Known requirements:** | Pixel size | 50 x 250 | $\mu \mathrm{m}^2$ | |---------------------------------------|-----------------------------|--------------------| | Bump pad diameter | 12 | μm | | Input | DC-coupled -ve polarity | | | Normal pixel input capacitance range* | 300-500 | fF | | Long pixel input capacitance range* | 450-700 | fF | | In-time threshold with 20ns gate | 4000 | e | | Two-hit time resolution | 400 | ns | | DC leakage current tolerance | 100 | nA | | Single channel ENC sigma (400fF) | 300 | e | | Tuned threshold dispersion | 100 | e | | Analog supply current/pixel @400fF | 10 | μΑ | | Radiation tolerance | 200 | MRad | | Average hit rate | 200 | MHz/ | | Acquisition mode | Data driven with time stamp | | | Time stamp precision | 8 | bits | | Readout initiation | Trigger command | | | Max. number of continuous triggers | 16 | | | Trigger latency | 3.2 | μs | | Single chip data output rate | 160 | Mb/s | | - | | | | | | | <sup>\*</sup> Low value given by planar sensors and high value by 3D. ## But there were a few problems with this plan - Did not have fully defined requirements. - Did not have a fully defined readout architecture - Had concepts, but they needed refinement and an implementation plan. - Had not fleshed out what the chip periphery should look like - Had not defined a design methodology NOTE: this was bottom-up R&D, not a project with a need-by date and not part of a larger R&D effort. Nobody had asked for this chip yet. The ATLAS IBL upgrade, today's main customer, had not yet been conceived. # What actually happened #### 130nm Pixel Chip Draft Work Plan Draft 6, 19 May 2007 #### Milestones and scope - First full size chip submission date December 2008 July 1, 2010 - Final Design review September 2008 November 2009 - Initial Design review January 2008 - o Heavy coverage of the chip results - o Specifications - o Clear idea for all subcircuits. - Front end design # Getting there - All the features of today's chip were really defined at FE-I4 design collaboration workshop in July 2009 (which turned out to be T-1year) - Integration methodology was not completely finalized until end of 2009. - Note that available features in 130nm process evolved along the way. - Critical feature of T3 isolation was not available until end of 2009. ## Collaboration platform #### SOS design repository from cliosoft.com Repository hosted at LBNL and mirrored at all other sites | Seamless Integration | | | |-----------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|--| | Cadence IC<br>Platform | Manage Cadence IC libraries directly from the Cadence IC Platform. Manage cell views without worrying about the physical files that make up these design units. | | | Synopsys<br>Custom Designer | Manage Open Access libraries directly from<br>Custom Designer. | | | Mentor HDL<br>Designer | Manage Mentor's HDL Designer Series<br>libraries directly from Mentor's Design<br>Browser. Manage logical design units without<br>worrying about the physical files. | | | SpringSoft Laker | Design Browser allows easy navigation of libraries and provides convenient access to DM features from Laker. | | | C API | A complete C programming interface to integrate any in-house tools with the SOS data collaboration platform. Readily available multi-site DM support in all tools. | | ## SOS repository features we relied on - Low cost educational licenses offered to us - World-wide access to design data in real time - Revision management (backup, snapshots, versions) - Graphic diff tool to view changes in schematics and layouts - Simple and flexible administration mostly via GUI - Very robust (we never lost data) - Very well supported (help-desk responds within the hour) - Fast (using caching, data access about the same as accessing local data) - Flexible: all data types can be shared in the repository (design databases for both digital and analog parts, documents, etc) - Only our own design work is shared. Third party IP, such as design kits and standard cell libraries were obtained directly by each site. Repository can link to local libraries/kits in a seamless way # Now the chip: What would be better than today's pixel detectors? - Much cheaper module manufacture (=> chip size as big as possible) - Greater fraction of the footprint devoted to pixel array (=> move the memory inside the array) - Lower power (=> don't move the hits around unless triggered) - Able to take higher hit rate (=> store the hits locally and distribute the trigger) - Still able to resolve the hits at higher rate (=> smaller pixels and faster recovery time) - No need for extra control chip (=> significant digital logic blocks on array periphery) Region architecture FE-I3 FE-I4 #### FE-I4 status - 16 wafer engineering run ordered (FE-I4A) (60 chips / wafer) - First wafer received 27 sept. 2010 (an initial lot was scrapped at foundry due to processing mistakes) - Testing, wafer probing and irradiation made very fast progress - Focus today is on testing of bump-bonded modules - About to launch order for another 23 wafers - Starting FE-I4B design effort aiming to submit in June 2011: production version for IBL installation in 2013. - Flexible test platforms developed along side chip design were ready to test chips as soon as we had them - No time to cover in this talk - Simplified version of FE-I4 was implemented in FPGA and used to debug test setups before we had the chip - No time to cover that either ## Large format - Full chip size finally agreed late 2009 after detailed analysis - 80 columns x 336 rows. 20mm x 19mm outline. (250μm x 50μm pixel) - Prime driver was to lower cost of future pixel detectors. - For present detector modules we paid EUR80/FE-I3 chip - Today we're paying for IBL prototyping EUR100/FE-I4 chip. - That's a cost reduction of a factor of 4.6 per unit active area, not counting inflation. - Full reticle. Needed pre-approval from foundry to exceed the maximum standard size. - We did worry about yield - A key observation: yield is NOT dominated by number of pixels, since 0.1% bad pixels is perfectly acceptable for "physics grade" chips. #### **Estimated** #### and # Actual yield - 2009 estimate based on Medipix wafer probe results for digital circuits and power shorts. - Quote: - Expect 39% digitally <u>perfect</u> FE-I4 chips - Yield of fully functional chips may be as high as 76% Because pixel array design is single point defect tolerant - Above: example wafer map - Avg. yield (3 wafers): 70% - Functional tests only. Have not looked at scan chains yet. ## **Design Foundations** - 130nm process (MVP?) - Radiation hardness out of the box - Essential to use standard cell synthesized logic - Outstanding power distribution - Essential to make long columns - Excellent substrate isolation (T3) - Essential to use standard cell synthesized logic - Commercial digital design tools - Extremely powerful. Fully exploited academic access. - Allows complex functionality & detailed verification - Design innovations - Region architecture - Modular approach and distributed design - Low current operation, fault tolerance, digital test bench, etc. ## Conductor stack and usage ## **Footprint** ## Full chip diagram ## Modular concept # Example of block: global configuration memory Custom SEU hard RAM. Blocks that need static config. Bits simply take a work of this RAM #### Modular view of the FE-I4 core blocks #### Design responsibilities # Design method From Design Review (Nov. 2009) #### Readout architecture - We call it region architecture - Local storage of digital hits- no high bandwidth to be "drained" - Took a long time of concepts and analysis to arrive at the final form ## FE-I4 Region - 4 analog pixels, each ending with a discriminator output (ADC function). - If 1 pixel is hit, 1 counter starts. If 2,3,4 pixels are hit, also only 1 counter starts. Hit processing decides how to associate hit in time depending on TOT value (digital correction for analog timewalk) # Array detail views #### Digital Column - 4-pixel Regions are synthesized std cells - Each region in a T3 deep N-well, framed in a substrate contact. - Entire column with 168 regions also synthesized. Includes clock and trigger tree, data links. - No driving long lines for fast signals. - Region address transmitted and encoded by same circuit: (30K transistors shown) 300um slice Digital column pair layout ## Digital column simulation vs measurement Simulation @1.2V Average power for 4-pixel region at IBL occupancy (MC hits) | Simulation type | Power<br>(avg) [uW] | |----------------------------|---------------------| | ETS <sup>1</sup> | 42.28 | | Spectre | 25.19 | | Ultarasim(s) <sup>2</sup> | 24.69 | | Ultarasim(a) <sup>2</sup> | 24.73 | | Ultarasim(ms) <sup>2</sup> | 35.12 | | HSIM <sup>1</sup> | 27.64 | | HSIM <sup>2</sup> | 30.98 | Parasitic extraction done width ¹PEX #### Measurement @1.2V Occupancy faked with periodic charge injection Approx. IBL range # Digital synthesis parameters (aside) - We defined common parameters for synthesis late in the process, after we hired an outside firm to run format timing analysis. - Verification applied to all synthesized blocks, but works best for synchronous logic. Asynchronous logic in region must be verified by simulation. #### Re-synthesis guidelines -Final- April 2010 All pins must be kept exactly in the same place AND the area must not grow more than 20% in the allowed direction. If this can't be achieved with preferred choice then use acceptable. If still can't be achieved must review case-by-case: the choices are to move pins, accept bigger area, or reduce margin. #### 1. Clock margin: OCV de-rating of 10% will be applied to clock paths only, both for setup and hold paths #### 1.1. **Preferred:** DOB: 320MHz clock + clock uncertainty 75ps All others: 50MHz clock + clock\_uncertainty -setup 500ps -hold 75ps #### 1.2. Acceptable: DOB: 300MHz clock + clock uncertainty 75ps All others: 45MHz clock + clock\_uncertainty -setup 500ps -hold 75ps # Having this earlier would have been better! #### 2. Clock source: Source for any generated clock must be the master clock source, not any intermediate points ## (back to region architeture) FE-I4 Rate Capability ### **Analog Front End** ## Pulse shape and ToT | "True" ToT | HitDiscCnfg | | | | |-------------|-------------|----|----|----| | (clocks) | 00 | 01 | 10 | 11 | | Below tresh | F | F | F | X | | 1 | 0 | Е | Е | X | | 2 | 1 | 0 | Е | x | | 3 | 2 | 1 | 0 | x | | 4 | 3 | 2 | 1 | x | | 5 | 4 | 3 | 2 | x | | 6 | 5 | 4 | 3 | x | | 7 | 6 | 5 | 4 | x | | 8 | 7 | 6 | 5 | x | | 9 | 8 | 7 | 6 | x | | 10 | 9 | 8 | 7 | x | | 11 | Α | 9 | 8 | x | | 12 | В | A | 9 | x | | 13 | С | В | Α | x | | 14 | D | C | В | x | | 15 | D | D | C | x | | ≥16 | D | D | D | х | ### Hit association ### Threshold dispersion ### Noise change with radiation - Ratio of noise after / noise before from full chip threshold scan. - Absolute value of noise is about 100e- (there is no sensor load on these channels). - See J.Grosse-Knetter presentation tomorrow for charge scale calibration and noise with sensors. ### Data path from array out ## Complex global digital operations #### Command decoder: - Parses serial command input stream, detects commands and translate to internal levels. - Entire state machine triplicated with majority logic applied at outputs and idle state. #### End of Chip Logic: - Counts triggers, bunch crossings, keeps track of trigger ID, formats data flow (prev. page), counts and reports error messages - All counters triplicated, all data Hamming coded, all logic triple redundant. #### Column data formatter: Unpacks data from array and repacks into "phi pairs". "Region X top left pixel hit followed by region X+1 bottom left pixel hit" will become "pair starting at row 2X+1 hit" #### About fault tolerance - The Verilog code for digital blocks is fault tolerant by triple redundancy and Hamming coding. - All data are moved around Hamming coded. - The fault tolerance can be exploited circuit-by-circuit either for yield or for SEU tolerance in operation - For some circuits yield makes more sense, and SEU for others - For example the trigger ID counter should be SEU tolerant, which means scan chain verification will be mandatory - On the other hand data transmission down the columns does not need SEU protection, but because the array is big, yield protection can be significant. - Ultimately it's a user choice how to exploit the fault tolerance - A yield-only fault tolerance method where space is tight or for analog blocks is the inclusion of configurable spare circuits. - In FE-I4 the configuration shift register in each analog column has a spare that can be enabled by an e-fuse PROM bit as needed. #### Verification testbench - Functional verification of a complex (and expensive) chip is critical. - We relied on a digital test bench using OVM: www.ovmworld.org - A virtual test bench was coded to control the chip inputs and parse the output - A system verilog model of the full chip was run with this test bench - Performed few second long runs with charge hits from physics simulation, calibration scans, full exercise of all configuration registers and modes, etc. - We did find and correct problems, some of them quite subtle. - Of all the circuits touched by the scrutiny we missed only one problem that we know of today: the reporting of the skipped trigger counter value when skipped triggers occur (we did not check it!) - But a weakness in our approach was that not 100% of the chip made it into the testbench. No behavioral model for some custom circuits! - There were 2 dumb errors (wrong polarity reset signals) in the "human verified" part of the chip. We we're lucky that we caught the critical one and missed the not so important one. ## Verilog model extraction – the key to success - Verilog netlist generated directly from the top-level schematic in Cadence OA database (using verilogXL netlister on the command line and custom scripts to fix various syntax bugs). - Verilog primitives are defined only for digital std cells, smallest possible custom blocks (using mostly only behavioral -timinglessdescription), a few technology devices (e.g. pull up resistors, CMOS switches). - Post-layout timing is extracted via Assura QRC/Calibre PEX as SDF backannotation files. They can be selectively added at simulation time. Timing files can be used only for digital cells and structural modules (all analog models are timing-less). - ! Top-level timing extraction failed both in Calibre and Assura. - This simulated netlist is guaranteed LVS equivalent to the other design views (layout, schematic). Modeling minimizes functional assumptions by modeling at the lowest possible level in the hierarchy. #### That's it. I did not talk about: - Novel "Shunt-LDO" regulators for power conditioning and/or serial power implementation – See talk by <u>Laura Gonella this afternoon</u> - First on-chip DC-DC converter in HEP (a V/2 charge pump) See talk by <u>Yunpeng Lu this afternoon</u> - Clock multiplier for up to 320Mb/s output from 40MHz input clock. - LVDS compatible I/O with 8b/10b encoded output - Current reference - SEU tolerant latch designs - Low power comparator designs - Blocks connected or to be connected to the modular "backplane" - PROM using the e-fuse process option - Analog multiplexer for internal signal monitoring - ADC for remote monitoring of internal levels (to go in for FE-I4B) - Rad hard temperature sensor connected to ADC (FE-I4B) - Hot off the press results of FE-I4 chip modules with sources and charged particles-- See talk by Joern Grosse-Knetter on Wednesday (ACES) ### FE-I4 Conclusions - A new generation of pixel chip containing real innovations - Largest format - First to use synthesized logic. A digital chip with analog elements. - Digital correction of analog timewalk exploited to reduce analog power (poor analog followed by DSP is the way of the future). - Testbench digital model of full chip "delivered" along side real chip. - New readout architecture. Low power. Scales to naturally to higher and higher rate. - Widespread use of fault tolerant designs - Success of distributed design collaboration - Two more slides... #### What's next for FE-I4 - IBL production version FE-I4B for this Summer - IBL is a simple system-- See talk by Roberto Beccherle tomorrow (ACES) - Module = chip - No power conversion - Chip has direct data link to DAQ crate - FE-I4 size and other features aimed at covering large areas with pixels in future upgrades-- see my talk tomorrow (ACES) - Module = 4 FE-I4 chips, but no module controller - Power conversion (either serial or DC-DC) - Module data link must go to high speed serializer such as GBT - Design FE-I4C: 4-chip module version, after ~1yr of system development ### Beyond FE-I4 - Still higher rate capability - Need smaller, faster pixel - Yet need more memory per pixel to buffer higher rate - Two directions being explored: 3D and 65nm-- See talk by <u>Sasha Rozanov on</u> <u>Thursday (ACES)</u> #### References - "Design and Measurements of SEU tolerant latches", M. Menouni et al, proceedings of TWEPP 2008. - "New ATLAS Pixel Front-End IC for Upgraded LHC Luminosity", M. Barbero et al, NIM A 604 (2009). - "Digital Architecture and Interface of the New ATLAS Pixel Front-End IC for Upgraded LHC Luminosity", D. Arutinov et al, IEEE Trans. Nucl. Sci. 56, 388 (2009). - "An Integrated Shunt-LDO Regulator for Serial Powered Systems", M. Karagounis et al, Proceedings of the 35th European Solid-State Circuits Conference, 2009. - "Charge Pump Clock Generation PLL for the Data Output Block of the Upgraded ATLAS Pixel Front-End in 130 nm CMOS", A. Kruth et al, Proceedings TWEPP 2009. - "Low Power Discriminator for ATLAS Pixel Chip", M. Menouni et al, proceedings of TWEPP 2009. - "Submission of the First Full Scale Prototype Chip for Upgraded ATLAS Pixel Detector at LHC, FE-I4A" M. Barbero, proceedings Pixel2010 conference, ATL-COM-UPGRADE-2010-022. - "The FE-I4A Integrated Circuit Guide", FE-I4 Collaboration. - Final design review: http://indico.cern.ch/conferenceDisplay.py?confld=72160 ## **BACKUP** ## Use for ATLAS IBL, then sLHC outer layers ### FE-I4 Pixel Layout ### More info on wafer map ## Subthreshold leakage Calibration injection switches in entire chip # Hit processing (HC3 mode)- schematic - Receives comparator output - BC resolution - Generates Leading Edge (LE) - Generates Small hit Leading Edge (sLE) - Generates Trailing Edge (TE) - Generates ToT counter reset and enable (rst cnt, en cnt) # Fixed format clustered data compression factor (all at 3×LHC) 3.7cm (vs. 21cm), $\eta = 0$ - indiv pixels: $4.09 (0.25) \times (7+9+4+2) = 1.00 (1.00) \text{ A.U.}$ - static $1\times 2:$ = $-3.45 (0.18) \times (7+8+2\times 4+2) = 0.96 (0.83)$ A.U. - dynamic 1×2: $3.02 (0.15) \times (7+9+2\times4+2) = 0.87 (0.74)$ A.U. - static 1×4: 2.86 (0.17)×(6+8+4×4+4)=1.08 (1.08) A.U. - dyn. in-DC 1×4: 2.43 (0.15)× $(6+9+4\times4+4)$ = 0.95 (0.95) A.U. - dynamic 1×4: $2.13 (0.14) \times (7+9+4\times4+4) = 0.85 (0.94)$ A.U. Choice: **Dynamic phi-pairing** (dynamic 1×2) merge neighbours and small hits in process. Compression ok, simple to do and good format, 24 bits (nice for FIFO and 8b10b). Note that hamming decoding needed before formatter.