

# Clock distribution system using IOSERDES based clock-duty-cycle-modulation

Outline

- Introduction
- Implementation
- Results
- Summary

KEK IPNS Ryotaro HONDA



# Motivation

Precise clock distribution (a few tenth ps in  $\sigma$ ) is a key issue for many particle and nuclear experiments.

#### **Typical requirements**

- Transferring not only the clock but also synchronous data with predictable latency
- As small as transmission lines

#### **Example solution**



It actually works well, but

- Some main stream/low price FPGAs do not have a high-speed serial transceiver
- A high-performance FPGA having a lot of transceiver lanes is necessary for the clock distribution

The author is motivated to develop a high-speed transceiver independent clock/data distribution system, which can be generally adopted in many FPGA devices.

The developed system will be introduced as a standard clock/timing distribution system in J-PARC hadron experiments together with the trigger-less DAQ system.



#### Speaker

L TAKAHASHI, Tomonori (RIKEN)





Adopting clock-duty-cycle-modulation (CDCM) as a core technology

- CDCM is a clock-centric type modulation.
- Data bits are embedded to the trailing edges of the clock signal.



Denis Calvet, IEEE TNS (Volume: 67, No. 8, Aug. 2020)

- Modulated clock can be directly input to PLLs. Every PLL will be a clock recovery circuit.
- When using PLL having a zero-delay mode for clock recovery, output clock skews respect to the input modulated clock are automatically adjusted.
  - No phase uncertainty. (It exits in a CDR circuit due to clock frequency division.)



Electronics System Group

# CDCM by IOSERDES of Xilinx FPGA

Open source consortium of Instrumentation





# Implementation





# Clock distribution system MIKUMARI



- Data transmission using an arbitral length frame structure
  - Scrambler/Descrambler (PRBS16)
- Pulse transmission with fixed latency

MIKUMARI link is independent from the user defined protocol. It's just a link layer protocol.

- CDCM modulation
- 8-bit data to 10-bit character extension
  - D, K, and T type characters







#### Features

- Data transmission using a arbitral length frame structure
  - Synchronous scrambler by PRBS16
- Pulse transmission with a fixed latency

### Latency for pulse

• TX+RX : 9 (+ CBT latency)

#### Latency for data

• Vary within the CBT character transfer cycle.

#### **Resource usage**

• ~50 slices



#### 23rd IEEE Real Time Conference

9

# MIKUMARI frame structure

The normal frame structure is similar to that of Xilinx Aurora protocol.

A frame has arbitral length data body and 8-bit check sum, which are sandwiched by FSK and FEK.

Normal frame transmission



- FSK: Frame start K-char
- FEK: Frame end K-char
- Check sum: 8-bit check sum

Pulse K-type characters contain

- 3-bit pulse type
- 4-bit pulse timing
  - CBT character transfer needs 5 (10) clock cycles. 4-bit pulse timing provides the fine timing respect to CBT character send cycle.

Pulse K-type character transmission during normal frame operation

• Sending pulse K-type character has highest priority.









- Used oscilloscope
- Keysight DSOS054A (Analog BW: 2.1 GHz, 20 GSPS)
- Tektronix DPO 7254 (Analog BW: 2.5 GHz, 40 GSPS)

\*Two oscilloscopes were used for consistency check





# Results



23rd IEEE Real Time Conference

### Demonstration





CDCE62002 correctly locked even for the modulated clock by CDCM-10-2.5.

Pulse transfer by MIKUMARI link





23rd IEEE Real Time Conference

Sending 8-bit incremental data continuously by the MIKUMARI system, and measure the recovered clock jitter.

Slave

- S1: CDCM-10-1.5 with scrambler
- C1: CDCM-10-1.5 clear text
- S2: CDCM-10-2.5 with scrambler
- C2: CDCM-10-2.5 clear text

Standard deviation of recovered clock jitter (TIE) when using CECE62002 jitter cleaner

Master

| Frequency | Master<br>clock | IDLE | <b>S1</b> | C1  | <b>S2</b> | C2   | Unit: ps |
|-----------|-----------------|------|-----------|-----|-----------|------|----------|
| 125 MHz   | 4.9             | 5.5  | 5.8       | 6.4 | 6.8       | 10.0 |          |
| 100 MHz   | 5.1             | 5.6  | 6.0       | 7.6 | 6.1       | 10.8 |          |
| 75 MHz    | 4.9             | 5.0  | 5.5       | 7.7 | 6.1       | 11.4 |          |
| 50 MHz    | 4.8             | 5.6  | 6.0       | 8.2 | 7.0       | 13.3 |          |

\* Systematic errors for all the measured values are  $\pm 0.1$  ps

- Scrambler provides the better jitter performance, and it's more efficient for the CDCM-10-2.5.
- The jitter performance when using scrambler does not have the clock frequency dependence, but in the clear text case, it has.



Horizontal axis position is adjusted.Taken by DPO 7254



Sending 8-bit incremental data continuously by the MIKUMARI system, and measure the recovered clock jitter.

- S1: CDCM-10-1.5 with scrambler
- C1: CDCM-10-1.5 clear text
- S2: CDCM-10-2.5 with scrambler
- C2: CDCM-10-2.5 clear text

Standard deviation of measured recovered clock jitter (TIE) when using MMCM

| Frequency | Master<br>clock | IDLE | <b>S1</b> | C1   | S2   | C2   | Unit: ps |
|-----------|-----------------|------|-----------|------|------|------|----------|
| 125 MHz   | 7.7             | 8.7  | 9.5       | 10.6 | 11.9 | 13.9 |          |
| 100 MHz   | 8.3             | 9.5  | 10.6      | 11.4 | 10.9 | 14.6 |          |
| 75 MHz    | 7.4             | 9.6  | 9.7       | 10.7 | 11.9 | 14.9 |          |
| 50 MHz    | 6.8             | 10.7 | 10.5      | 11.0 | 12.7 | 16.9 |          |

\* Systematic errors for all the measured values are  $\pm 0.3$  ps

- The jitter clean up performance of MMCM is worse than that of CDCE62002.
- Tendency is the same.









Horizontal axis position is adjusted.Taken by DPO 7254





How the recovered clock phase changes on average during the data transmission?

- Measuring the time delay (D) between the master clock and the recovered clock.
- Time (phase) difference:  $dT = D_{idle} D_{data}$

Time (phase) difference (*dT*)

| Frequency | <b>S1</b> | C1  | <b>S2</b> | C2   | Unit | : ps      |
|-----------|-----------|-----|-----------|------|------|-----------|
| 125 MHz   | -0.8      | 3.2 | -6.5      | -4.0 | ٦    |           |
| 100 MHz   | 6.9       | 8.1 | 4.7       | 6.0  |      |           |
| 75 MHz    | 2.2       | 0.0 | 4.7       | 2.5  |      | CDCE62002 |
| 50 MHz    | 2.0       | 0.5 | 2.0       | 2.9  |      |           |
|           |           |     |           |      |      |           |
| 125 MHz   | 6.4       | 6.9 | 1.7       | 1.4  | ٦    |           |
| 100 MHz   | -1.1      | 1.6 | -1.0      | -0.5 |      | ммсм      |
| 75 MHz    | 1.1       | 4.1 | -0.6      | 1.3  |      |           |
| 50 MHz    | -2.4      | 2.4 | -2.6      | -1.4 |      |           |

\* Systematic errors for all the measured values are  $\pm 1.0 \text{ ps}$ 



- The recovered clock phase moves within 10 ps.
  - If the CDCM patterns are well mixed, the averaged clock phase is not drastically shifted.
- It seems to be not predictable how the phase changes.



# Jitter performance when cascaded





n source consortium of Instrumentatio



- Introduce a distributer and synchronize the two slave modules.
- Measuring the standard deviation of the time delay between 1-2, 2-3, 1-3, and 3-4.
- Extracting how the jitter is added per clock recovery.



# Jitter performance when cascaded



Master clock (Recovered clock) clock) Standard deviation of time delay

#### Standard deviation of time delay measurement (@ clock frequency of 125 MHz)

| De 4h | <b>CDCE62002</b> |           |           | MMO       | ММСМ   |  |  |
|-------|------------------|-----------|-----------|-----------|--------|--|--|
| Path  | IDLE             | <b>S1</b> | <b>S2</b> | IDLE S1   | S2     |  |  |
| 1-2   | 7.3              | 7.9       | 8.2       | 10.4 11.0 | 0 14.1 |  |  |
| 2-3   | 7.5              | 8.1       | 8.3       | 10.3 11.  | 0 14.1 |  |  |
| 1-3   | 8.1              | 9.0       | 9.7       | 12.3 13.  | 5 17.4 |  |  |
| 3-4   | 8.3              | 8.1       | 8.5       | 10.3 11.  | 1 14.8 |  |  |

The results of 1-2 and 2-3 are consistent. It indicates that a constant jitter is added in every clock recovery.

Jitters added per clock recovery

| 3.7 | 4.3 | 5.2 | 6.6 | 7.9 | 10.2 |
|-----|-----|-----|-----|-----|------|
|-----|-----|-----|-----|-----|------|

The synchronization accuracy between two electronics will be around 15 ps ( $\sigma$ ) and 18 ps ( $\sigma$ ) for CDCM-10-1.5 and CDCM-10-2.5, respectively, even after repeating 10 times. This value will satisfy many experimental requirements.



- FPGA high-speed serial transceiver independent clock distribution system, MIKUMARI, was developed using the clockduty-cycle-modulation (CDCM).
  - Physical layer: CDCM-10-2.5 and CDCM-10-1.5 based transceiver using the IOSERDES primitive.
  - Protocol layer: MIKUMARI link
- MIKUMARI was implemented on the general purpose logic module, AMANEQ, and tested.
- Data scrambling provides the better jitter performance, and it is more efficient for CDCM-10-2.5.
- CDCM-10-1.5 with scrambler shows the best jitter performance in this test.
- Jitter clean up power of CDCE62002 is better than that of MMCM.
- The recovered clock phase change during the data transmission is small, less than 10 ps.
- When MIKUMARI are cascaded, jitter of 4.3 ps is added per clock recovery for CDCM-10-1.5.

#### **Future prospect**

- Implement CDCM-8-1.5. As the frequency ratio between the fast and slow clocks is smaller than that of CDCM-10-XX, it will be matched with a lower performance FPGA.
- The test setup in this work was simple. We will construct a large system and study the jitter performance in detail.





Electronics System Group







Small- and middle-scale experiments (especially in J-PARC in Japan)



Small- and middle-scale J-PARC experiments tend to prefer Ethernet for the data transfer since we can rely on the commercial communication technology.







Small- and middle-scale J-PARC experiments tend to prefer Ethernet for the data transfer since we can rely on the commercial communication technology.



Electronics System Group

The normal frame structure is similar to that of Xilinx Aurora protocol. A frame has arbitral length data body and 8-bit check sum, which are sandwiched by FSK and FEK.

Normal frame transmission

| FSK | Data 0 | Data 1 | Data 2 | <br>Data N | Cheo<br>sun |
|-----|--------|--------|--------|------------|-------------|

Sending dogfood during normal frame transmission.

T-type character is transmitted at this timing. Tx ACK is not returned from CBT.

Pulse K-type character transmission during normal frame operation

• Sending pulse K-type character has highest priority.



٠

#### Frame last

Frame last

Data N

| Data N  | Check | FEK   |  |
|---------|-------|-------|--|
| s und 1 | sum   | 1 211 |  |

**FEK** 

Check

sum

- FSK: Frame start K-char
- FEK: Frame end K-char
- Check sum: 8-bit check sum

Pulse K-type characters contain

- 3-bit pulse type
- 4-bit pulse timing
  - CBT character transfer needs 5 (10) clock cycles. 4-bit pulse timing provides the fine timing respect to CBT character send cycle.



CDCM-10-2.5 encode table

| Binary | Encoded |
|--------|---------|
| 00     | 0000    |
| 01     | 0001    |
| IDLE   | 0011    |
| 10     | 0111    |
| 11     | 1111    |

- If disparity of the binary values is 0, disparity of the encoded pattern is not necessarily 0.
- To ensure the DC balance of the waveform pattern, data scrambling is necessary.



#### **CBT transfer cycle**

- 2-bits (1-bit) are transferred in a CDCM-10-2.5 (1.5) cycle.
- 5 (10) clock cycles are necessary to send a 10-bit CBT character.

#### **CBT character type (2-bit header + 8-bit body)**

- T-type character: Special character for CBT control
  - Invisible from outside of CBT.
- K-type character: Special character for link layer protocol
  - K-type character is not defined in the CBT level.
- D-type character: User data character of link layer protocol
- Priority: K > T > D

#### **CBT** lane initialize

- IDELAY tap number and ISERDES bit slip are tuned using IDLE pattern.
- Decoder bit order is adjusted using T-type characters.
  - When all the initialization process are finished, the CBT lane up is asserted.

#### Hot plug and automatic lane up

- If CBT finds a clock like signal on the modulated clock line, the initialization process will start.
  - CBT slave side: Existence of the modulated clock is indicated by the LOCK signal of PLL for clock recovery.
  - CBT master side: Existence of the modulated clock is indicated by the clock monitor in CBT.
- Rx quality monitor checks the CDCM waveform pattern. If broken pattern is found, pattern error is asserted.
- If watch dog cannot eat dogfood (T-type character) within a set time, CBT lane will down.
  - CBT Tx sends dogfood periodically.



| CBT header | Туре    | _ |
|------------|---------|---|
| 00         | T-type  |   |
| 01         | D-type+ | - |
| 10         | D-type- | _ |
| 11         | K-type  |   |

To ensure the DC balance of encoded waveform for CDCM-10-2.5



How the recovered clock phase changes as a function of the duty cycle?



Send five patterns continuously, and measure the relative time difference between master clock and the recovered clock.

- 0b11100'00000
- 0b11110'00000
- 0b11111'00000 (IDLE)
- 0b11111'10000
- 0b11111'11000

Time interval (delay) measurement





Horizontal axis is the measured duty cycle

- Non-negligible phase change was observed when a certain modulated pattern continues.
- It indicates that the data scrambling will provide the better jitter performance.





Total jitter (TIE Tj 1E-12 BER) when using CECE62002 jitter cleaner

| Frequency | Master<br>clock | IDLE | <b>S1</b> | <b>C1</b> | <b>S2</b> | C2  | Unit: ps |
|-----------|-----------------|------|-----------|-----------|-----------|-----|----------|
| 125 MHz   | 105             | 113  | 116       | 111       | 119       | 130 |          |
| 100 MHz   | 100             | 114  | 120       | 124       | 132       | 133 |          |
| 75 MHz    | 108             | 107  | 116       | 118       | 113       | 127 |          |
| 50 MHz    | 84              | 113  | 126       | 121       | 130       | 135 |          |

\* Systematic errors for all the measured values are  $\pm 10$  ps

Total jitter (TIE Tj 1E-12 BER) when using MMCM

| Frequency | Master<br>clock | IDLE | <b>S1</b> | C1  | <b>S2</b> | C2  | Unit: ps |
|-----------|-----------------|------|-----------|-----|-----------|-----|----------|
| 125 MHz   | 150             | 198  | 227       | 192 | 235       | 216 |          |
| 100 MHz   | 156             | 228  | 244       | 244 | 204       | 213 |          |
| 75 MHz    | 161             | 263  | 237       | 195 | 277       | 264 |          |
| 50 MHz    | 156             | 310  | 335       | 341 | 321       | 343 |          |



\* Systematic errors for all the measured values are  $\pm 20$  ps



| Random jitter (TIE | kj 1E-12 BER) when using | CECE62002 jitter cleaner |
|--------------------|--------------------------|--------------------------|
|--------------------|--------------------------|--------------------------|

| Frequency | Master<br>clock | IDLE | <b>S1</b> | <b>C1</b> | <b>S2</b> | C2  | Unit: ps |
|-----------|-----------------|------|-----------|-----------|-----------|-----|----------|
| 125 MHz   | 4.1             | 4.9  | 5.2       | 5.0       | 5.5       | 4.9 |          |
| 100 MHz   | 4.2             | 4.7  | 5.4       | 5.0       | 5.5       | 5.0 |          |
| 75 MHz    | 4.0             | 4.4  | 4.9       | 4.8       | 5.6       | 4.6 |          |
| 50 MHz    | 4.0             | 4.7  | 5.2       | 4.9       | 6.5       | 4.9 |          |

\* Systematic errors for all the measured values are  $\pm 0.1$  ps

#### Random jitter (TIE Rj 1E-12 BER) when using MMCM

| Frequency | Master<br>clock | IDLE | <b>S1</b> | C1   | S2   | C2   | Unit: ps |
|-----------|-----------------|------|-----------|------|------|------|----------|
| 125 MHz   | 6.2             | 7.6  | 8.7       | 8.0  | 10.4 | 8.0  |          |
| 100 MHz   | 6.6             | 8.3  | 9.3       | 8.6  | 9.9  | 8.3  |          |
| 75 MHz    | 6.4             | 8.4  | 9.3       | 8.6  | 10.6 | 9.0  |          |
| 50 MHz    | 6.4             | 9.8  | 10.7      | 11.2 | 11.8 | 10.4 |          |



\* Systematic errors for all the measured values are  $\pm 0.2$  ps

# Link stability

- Sending 8-bit incremental data using CDCM-10-2.5 @125 MHz with scrambler over half day
- Clock recovery by MMCM
  - The number of total transmitted bits:  $\sim 10^{13}$  bits
- No error (broken CDCM pattern and checksum mismatch) is observed.









Introduction, J-PARC E50 experiment

#### Motivation

• Reveal the effective degree of free of the baryon internal structure, the di-quark correlation, by introducing heavy (*c*) quark.

#### Strategy

- Missing spectroscopy via the  $\pi^- p \rightarrow D^{*-} Y_c^*$  reaction.
- Measure production cross and decay branching ratio simultaneously.

# The experimental setup at J-PARC high-p beamline





#### Secondary $\pi^-$ beam

- 20 GeV/c
- 30 MHz (60 M/spill)
  - (2s duration)

#### Target

• Liquid H<sub>2</sub>,  $4-g/cm^2$ 

#### Reaction Charmed-baryon production • ~1 nb/sr Background reaction • 2.4 mb/sr Total reaction rate • 1.5 MHz Charged-particle multiplicity • 4

Trigger-less data-streaming-type DAQ system

Schema of the DAQ system

FairMQ +

redis July stem Group

- Process monitor and control via inmemory type DB.
- Automatic topology generation



Total data rate: ~12 GB/s (25 GB/spill) (E50 case)

Clock/command/timing

distribution



# Implementation into an FPGA board





A main electronics for network oriented trigger-less data acquisition system (AMANEQ)

- A general purpose logic module for the J-PARC hadron experiments.
- This will be not only a front-end electronics (TDC) but also the clock distributer by changing mezzanine cards.
  - No special module for the clock distribution.

#### Features related to MIKUMARI

- FPGA: Kintex-7 (speed grade -2)
  - Maximum BUFG frequency: 720 MHz
  - Maximum IO speed in the HR bank: 1.25 Gbps
- On board jitter cleaner: CDCE62002 (TI)

# AMANEQ



**E**lectronics System Group

A main electronics for network oriented trigger-less data acquisition system (AMANEQ)

- VME 6U size but it doesn't have VME bus
  - VME crate without the power is used as a housing box
- Kintex7 with speed grade -2
  - Transceiver bandwidth up to 10Gbps
  - Can implement SiTCP-XG
- Main input ports compatible with HUL
- Has two mezzanine slot
  - Compatible with HUL
  - Mount HUL mezzanine HR-TDC
  - Mount DCR mezzanine for DC readout
- Belle2 trigger port (master clock)
- Has a jitter cleaner (CDCE62002)
- **DDR3-SDRAM** as a de-randomizer
  - DDR3-1333 with 16-bit bus width.
  - 2 Gb •
  - It allows us to use spill off time for data transfer
- Powered by the external power supply with DC 30-35V