# General-purpose data streaming FPGA TDC synchronized by SerDes-based clock synchronization technique

Ryotaro Honda, Masahiro Ikeno, Che-Sheng Lin, and Masayoshi Shoji

*Abstract*—This article presents a clock synchronization protocol using functionalities of the IDELAYE2 and IOSERDESE2 primitives of AMD Xilinx field programmable gate array (FPGA) and a general-purpose data-streaming type time-to-digital converter (TDC) for particle and nuclear physics experiments. The clock synchronization protocol called local area common clock protocol (LACCP) is developed as the upper layer protocol of the MIKUMARI link technology. Clock synchronization is realized by a round-trip time measurement with the system clock period and a fine offset time estimation, which corresponds to a clock signal phase difference between primary and secondary FPGAs. The fine offset measurement is based on information from IDELAYE2 and ISERDES2 primitives utilized as the physical layer of the MIKUMARI link. No extra component is used for the fine offset estimation. The feature of LACCP is that it can be implemented if FPGAs are connected with a pair of RX and TX with general IO pins. The streaming high-resolution TDC called Str-HRTDC is also developed, which is the tappeddelay-line-based TDC consisting of the CARRY4 primitives in AMD Xilinx Kintex-7 FPGA. It continuously measures the timing with 19.5 ps intrinsic resolution in  $\sigma$  and provides unique timestamp information over 2.4 hours by introducing the time frame structure, which is defined and synchronized by LACCP. The clock synchronization accuracy and the timing resolution are evaluated by connecting four modules with optical fibers up to 100 m. No cable length dependence is confirmed. The obtained synchronization accuracy is around 300 ps, and LACCP shows a potential to achieve clock synchronization accuracy better than 100 ps. The timing resolution between two synchronized modules is 23.1 ps in  $\sigma$ .

*Index Terms*—clock synchronization, field programmable gate array (FPGA), Serializer–Deserializer (SerDes), time-to-digital converter (TDC), streaming readout,

#### I. INTRODUCTION

IN particle and nuclear physics experiments, the basic<br>approach to data collection is to select events and store<br>them and taken this is data by a combination of bandware N particle and nuclear physics experiments, the basic them, and today, this is done by a combination of hardware triggers and software high-level triggers. While hardware triggers are powerful in significantly reducing the amount of data transmitted from front-end electronics (FEE), there are

Manuscript received May 24, 2024. This work was supported by the JSPS KAKENHI Grant Number 23H00126 and 22H04940. We would like to thank the technical support from the members of the Open Source Consortium of Instrumentation (Open-It).

Ryotaro Honda, Che-Sheng Lin, and Masayoshi Shoji are with Institute of Particle and Nuclear Studies/J-PARC Center, High Energy Accelerator Research Organization, Tsukuba 305-0801, Japan (e-mail: rhonda@post.kek.jp; cslin@post.kek.jp; mshoji@post.kek.jp).

Masahiro Ikeno is with the Research Center for Nuclear Physics, Osaka University, Ibaraki, Osaka 567-0047, Japan (e-mail: ikeno@rcnp.osakau.ac.jp).

difficulties in developing complex and low-latency hardware triggers, and FEE also needs to have enough size of memory to wait for hardware triggers. To relieve these difficulties, triggerless data-streaming type data acquisition (DAQ) systems are being intensively investigated in the world.

We consider that the trigger-less DAQ is suitable for smalland middle-scale experiments performed in nuclear and hadron facilities in Japan, i.e., J-PARC, RCNP, RIKEN, RARIS and so on. Since these experiments share the same beamline, the detector configuration, and the trigger logic are changed experiment by experiment. Then, the difficulty of the development and the maintenance of the hardware trigger also exists here.

To overcome these problems, we are developing the generalpurpose trigger-less DAQ system under the signal processing and data acquisition infrastructure (SPADI) alliance [1]. We have started from a time-to-digital converter (TDC) based DAQ system since event reconstruction using timing information is the minimum but the most essential function of the trigger-less DAQ system. Thus, a TDC module and a clock synchronization system are the first targets. The most important requirement for the FEE is generality. For example, the beam delivery method for the J-PARC hadron experimental facility [2] is slow extraction. A proton beam is slowly extracted from the J-PARC main ring over 2 s within a 4.2 s cycle. If the FEE functionalities are dedicated to the slow extraction, it is inconvenient for the experiments at cyclotron facilities. The FEE should not rely on timing signals from accelerators. Scalability and simplicity are also required. FEE should be workable from a single FEE, i.e., a stand-alone mode, to a few thousand FEEs. Especially, for the stand-alone mode, the components consisting the DAQ system should be only FEE and a personal computer (PC).

Under these backgrounds, we have developed a data streaming high- and low-resolution TDC called Str-HRTDC and Str-LRTDC on the AMANEQ module [3], respectively. The target timing resolution for the HR-TDC is 30 ps in  $\sigma$  according to the requirement from the J-PARC hadron experiment [4]. We expect that the Str-LRTDC readout wire chambers, and thus we set the target LSB precision to 1 ns. In addition, we also have developed the clock synchronization system using the AMANEQ module. This system requires clock signal transmission with a lower jitter than the intrinsic resolution of HR-TDC. Sub-nanosecond synchronization accuracy is also necessary for software-based timing coincidence for event reconstruction. In this article, we describe the clock synchronization protocol and Str-HRTDC.

## II. DESIGN OF CLOCK SYNCHRONIZATION PROTOCOL

The clock distribution network forms a tree structure starting from a clock-root module. Root means top of the tree structure. All FPGAs on other modules adjust their internal clock to that of the root. All includes not only the main FPGAs directly connected by optical fiber or metal cable, but also FPGAs interconnected by general IO pins on the same board or on daughter boards. We aim to synchronize them using the same synchronization technique. In the previous work [3], we developed the clock signal distribution method called MIKUMARI link, which provides the clock signal frequency synchronization and communication protocol between two FPGAs. The MIKUMARI link technology is based on the clock-duty-cyclemodulation [5], and it achieved sufficiently low-jitter clock signal transmission with around 7 ps in  $\sigma$  even using mixedmode-clock-management (MMCM) [10] in FPGA for clock recovery. In addition, as the MIKUMARI link is implemented using AMD Xilinx IDELAYE2 and IOSERDESE2 primitives [6], it is suitable for synchronization of interconnected FPGAs with general IOs. Thus, the MIKUMARI link is adopted as the link layer protocol, and a clock synchronization protocol, local area common clock protocol (LACCP), is developed.

In this work, the CDCM-8-1.5 modulation is used to make the frequency ratio of the system clock signal to the sampling clock signal a multiple of 2. For the naming convention of CDCM, see Ref [5].

#### *A. Timestamp*

First, we define a time frame structure of this protocol. The heartbeat method [7] is selected and modified; it determines the time frame boundary by a periodic signal called heartbeat which is given by a 16-bit local counter carry bit. In the past work [7], since time domain definition depended on the J-PARC slow-extraction cycle, we modified it to be independent of the external environment. The system (reference) clock signal frequency driving this counter is 125 MHz since a 500 MHz clock signal is necessary to implement the TDCs as described later. Then, the length of a time frame is around 524 us. The time frame is called the heartbeat frame. To provide a unique timestamp during a typical DAQ run time, an additional 24-bit frame number is given. Thus, a timestamp with 8 ns precision is defined over around 2.4 hours, and the TDCs interpolate finer timing information. The unit that defines the timestamp is called the heartbeat unit.

### *B. Synchronization Scheme*

Consider the synchronization between two FPGAs. In general, clock synchronization is done by obtaining the offset to the client's clock by measuring four pieces of time information, however, since the purpose of this protocol is to synchronize heartbeat timing and frame numbers, the process for offset estimation can be simplified. The heartbeat signal transferred from the primary indicates that the 16-bit counter becomes 0. For the signal transmission with the fixed latency, the pulse transfer function of the MIKUMARI link is used. Secondary FPGA adjusts its counter when receiving the heartbeat signal, and it needs to know the offset value to cancel



Fig. 1. Block diagram of physical layer of MIKUMARI link. D and dt are constant and relative delays of each component.

the transmission delay. This is done by measuring round-trip time  $(T_{rt})$  by sending the pulse from the secondary side, and  $T_{\rm rt}/2$  is the offset value. Thus, the heartbeat timing is roughly adjusted with 8-ns precision. To achieve synchronization with sub-nanosecond accuracy, estimation of a fine offset corresponding to the phase difference between the reference and the recovered clock signals is further necessary. The measured round-trip time should be even but can be odd depending on the phase relationship between the two clock signals. If the round-trip time is odd, the decimal point of the offset value is added to the fine offset.

Frame number synchronization is easy because the heartbeat frame length is long enough for typical transmission delays. Global frame numbers leaving the root FPGA at backbeat timing are spread to all secondary FPGAs before the next heartbeat timing.

#### *C. Fine Offset Estimation*

The phase measurement is possible using IDELAYE2 and ISERDESE2 primitives if around 100 ps precision is enough. The physical layer of the MIKUMARI link is shown in Fig 1. The transmission delay from OSERDES to ISERDES is expressed as

$$
\delta = D_{\text{osd}} + D_{\text{cable}} + D_{\text{idelay}} + D_{\text{isd}},\tag{1}
$$

where  $D_{\text{osd}}$ ,  $D_{\text{cable}}$ ,  $D_{\text{idelay}}$ , and  $D_{\text{isd}}$  represent delays of OSERDESE2 [6], a cable, IDELAYE2, and ISERDESE2, respectively. However,  $\delta$  is usually not a multiple of period of the system clock, i.e., there is a phase difference between the incoming modulated clock signal and the system clock signal. Then, additional delay,  $dt$ , is necessary, and it is the sum of  $d_{\text{idelav}}$  and  $d_{\text{iserdes}}$  coming from IDELAY delay taps and the bitslip function of ISERDES, respectively. Note that  $d_{\text{isertes}}$ takes not only the positive value but also negative value because this is the relative time to the initial state generated due to bitslip. The relationship between the number of performed bitslip and  $d_{\text{iserdes}}$  can be examined using OSERDES on the TX side of the link. By connecting OSERDES to ISERDES within the same IOB with OFB, the time taken for a bit pattern input to OSERDES to appear from ISERDES can be checked while performing bitslip. Thus, the relationship between the

number of performed bitslip and  $d_{\text{isertes}}$  is obtained and stored during the link-up process. IDELAYE2 provides delays with a 78-ps step in default, and  $d_{\text{isertes}}$  step is 1 ns in this case since ISERDES is operated by 500MHz and 125 MHz clock signals with double-data-rate mode. Thus, the round-trip time  $(T_{\rm rt})$  is expressed as

$$
T_{\rm rt} = 2\delta + dt + dt',\tag{2}
$$

where  $dt$  and  $dt'$  are delays in primary and secondary FPGAs, respectively. Therefore, the phase difference between the reference and the recovered clock signals is given by  $(dt'-dt)/2$ . If  $T_{\rm rt}$  is odd, half the period of system clock is added. In this way, the fine offset is obtained without any additional components.

The feature of this method is that this method does not require oscilloscope measurement and software support. The MIKUMARI link automatically adjusts IDELAY and performs bitslip by checking the output data pattern during the link-up process; it corresponds to the measurement of  $dt$ .

## *D. Fine Offset Accumulation*

Clock synchronization is performed between endpoints in each link. Secondary LACCP adjusts the clock with respect to one upstream LACCP since the MIKUMARI link defines the point-to-point communication rule. For example, there are three modules labeled M1, M2, and M3, and they are connected in series with M1 at the root. First, the M2 clock is synchronized with M1. After M1-M2 clock synchronization is established, clock synchronization of M3 to M2 is performed. Thus, there are two local fine offsets for M1-M2 and M2- M3, respectively. The fine offset of M3 with respect to M1 is obtained by accumulating the local fine offsets. If the accumulated fine offset exceeds the system clock signal period, the offset value to the 16-bit local counter is corrected by  $\pm 1$ . In principle, there is no limit to the number of connection stages. Accumulation and correction to the offset value are automatically done in LACCP.

### III. DESIGN OF TDC

The design of the streaming TDC (Str-TDC) consists of the online data processing (ODP) block and the data merging (MGR) block as shown in Fig 2. The basic structure has been designed in the past work [7] and was modified in this work. The ODP block has two timing units, which respectively measure the arrival time of the leading and trailing edges of the incoming signal. After the data paths are merged, the TDC fine timing data passes through the  $2-\mu s$  delay buffer. This buffer is used to wait for a trigger input. The Str-TDC works with the trigger-less mode in default but also supports trigger input. This is because we expect that there will be experiments using FEE that require a hardware trigger or that do not have sufficient computing power. Support for trigger input allows us to propose a staging approach for DAQ updates for such experiments. The 16-bit counter value from the heartbeat unit is combined before the heartbeat inserter. The delimiter inserter puts the special data called the heartbeat data as a boundary of the heartbeat frame at the heartbeat timing. The frame number is embedded into the heartbeat delimiter. Thus,



Fig. 2. Block diagram of Str-TDC. Timestamp comes from the heartbeat unit and is combined with the TDC fine timing before the delimiter inserter. Delimiter is generated by the heartbeat signal.

the TDC value is expressed as 16-bit counter value  $+$  TDC fine timing. By checking the delimiter data, users can obtain more macroscopic time. Up to this point, the processing time is fixed.

Leading and trailing timing data are combined at the paring unit, and time-over-threshold is calculated and is embedded to leading TDC data. Trailing edge data is discarded here to reduce the data rate. Data is sent to the MGR block after passing through the TOT filter unit.

The role of the MGR block is to collect data from the input channel and to regenerate the heartbeat frame including them. It has two stages called the front-merger unit and the back-merger unit. In front of the front-merger, first-input-firstoutput (FIFO) memories exist channel by channel. The reason for the two-stage structure is to split between the front and back merger units, allowing implementation in two FPGAs. In addition, the two-stage structure provides better buffering performance for sudden input rate increases than that of onestage structure. In the merger unit, the incoming data from each channel are output in order of arrival, however, if the heartbeat data is found, it stops reading from that channel. When delimiter data are found on all the channels, it generates the delimiter data and restarts reading. As we expect the situation that data randomly come from the ODP block, long waiting times at the margins will be rare. The merger unit is designed to keep a throughput of around 8 Gbps (64 bit  $\times$  125 MHz) for both randomly and simultaneously incoming data.

Finally, data are transferred to a PC by SiTCP [8] or SiTCP-XG [9] cores, which are the hardware implementation of transmission control protocol (TCP) with gigabit and 10 gigabit Ethernet, respectively. For user convenience, a generalpurpose network protocol is used for data transfer. Since the throughput of the merger unit, 8 Gbps, is the bottleneck of this TDC, one can obtain the best performance by selecting



Fig. 3. Picture of AMANEQ and mezzanine cards. White and Blue (dotted) lines represent paths for TDC data and clock synchronization, respectively, The heart mark denotes that the heartbeat unit exists inside FPGA.

SiTCP-XG, however, if the expected input data rate is enough low, the use of SiTCP can be considered for the convince.

## IV. IMPLEMENTATION TO AMANEQ

In this section, we mainly describe details of Str-HRTDC implementation using the AMANEQ module, which is a general-purpose logic module. For more information about AMANEQ, see Ref [3]. Before describing Str-HRTDC in detail, we briefly describe the clock-root and clock-hub modules used in Sec. V. They are implemented to the AMANEQ module on which the clock-data-distributor (CDD) mezzanine card [3] is mounted. The CDD mezzanine card is used for the primary side of the MIKUMARI link. The AMANEQ module has a mini-mezzanine port for the secondary side; it is represented as clock sync. port in Fig. 3.

On these mezzanine cards, buffer ICs are placed for current mode logic (CML) to low-voltage differential signal (LVDS) translation. For CML-to-LVDS translation, Micrel SY58603UMG is used, however, for LVDS-to-CML translation, SY58605UMG and PERICOM PI6C5922504 are used on the CDD and CRV cards, receptively, due to difference in when they were developed. As propagation delays of two ICs differ by 210 ps, there is the systemic error of 105 ps in clock synchronization originating from asymmetry in transmission delay. Part-to-part skew of buffer ICs, SY58603UMG, SY58605UMG, and PI6C5922504, are 100, 135, and 200 ps, respectively. This also will make non-negligible asymmetry. In



Fig. 4. Block diagram of first part of tapped-delay-line. O and CO outputs are captured by FFs driven. The OR logic operation is made from outputs from three FFs. The calibration clock signal of 26.2144 MHz is connected to DI0 of the first CARRY4.



Fig. 5. Four phase regions of system clock signal.

this work, since we could not measure the propagation delay of each buffer IC part by part, we treat asymmetry in transmission delay as unknown. In Sec. V, the fine offset values obtained by LACCP are used in the discussion without software correction.

In addition, Str-LRTDC is also realized using AMANEQ. MIKUMARI link, secondary LACCP, and entire block of the streaming TDC are implemented to the FPGA, AMD Xilinx XC7K-160T-2, on AMANEQ. SiTCP is selected as the data link. 1ns TDC realized by four 250 MHz clock signals with 0, 90, 180, and 270 degrees.

## *A. Str-HRTDC*

The feature of AMANEQ is that it has two mezzanine slots for functionality extension. For Str-HRTDC, the mezzanine card that has an AMD Xilinx Kintex-7 FPGA (XC7K-160T-1) is used to outsource the tapped-delay-line based TDC from the main FPGA on AMANEQ. The picture of AMANEQ and mezzanine cards is shown in Fig. 3. The FPGA on the mezzanine card is connected to the main FPGA with 32 signal lines with the LVDS standard. The supply voltages for the

FPGA are generated by a series regulator, Analog Devices ADP1741ACPZ-R7, to reduce power supply noise. The input signals are once buffered by onsemi FIN1108MTDX. The number of input channels is 32.

Since FPGAs are interconnected with several lines with LVDS, the system clock signal is sent from the main FPGA. The clock recovery from the modulated clock signal is not performed in FPGA on the mezzanine. 125 MHz and 500 MHz clock signal generation is done by MMCM in FPGA because this mezzanine card does not have a clock jitter cleaner,

The tapped-delay-line consisting of CARRY4 primitives [11] is formed in the timing unit. 192 CARRY4 primitives are chained; it almost corresponds to the size of a clock region. The O and CO outputs of CARRY4 are connected to flipflop (FF) alternately in the order of O and CO according to knowledge in Ref [12] to reduce zero-width bins. The first part of TDL is shown in Fig. 4. Since the 26.2144 MHz clock signal for calibration is connected to DI0 input of the first CARRY4 primitive, only for the first O output, CO is used instead. In TDL-based TDC in FPGAs, a phenomenon called "bubbles" due to non-uniform propagation of TDL is commonly observed, e.g., 000101111 for rising edge propagation. To avoid this, the "OR" logic operation is made from outputs from three FFs. This operation reduces the number of effective taps to 64. The average effective tap delay becomes longer, around 30 ps, and it results the worse timing resolution than in the best case, however, since our goal is not to obtain ultrahigh-resolution, we adopted this method for its simplicity. At this point, the trailing measurement is branched by performing bit inversion.

After binary encoding, data crosses the clock domains from 500 MHz to 125 MHz, and thus additional 2-bits are given. To compensate for non-uniform tap propagation delays, a calibration look-up-table (LUT) is commonly required for TDL-based FPGA TDC. Here, it is found that calibration results are different among four phase regions as illustrated in Fig. 5, i.e., even with the same tap, the amount of delay varies. This effect probably originated from ground or power supply voltage noise generated by the system clock signal as discussed in [13]. Therefore, the LUT for all 4x64 patterns is prepared to compensate for this effect.

In FPGA on the mezzanine card, Str-TDC components up to the front-merger are implemented. The heartbeat unit also exists in this FPGA, which is synchronized by LACCP. The MIKUMARI link is running between the FPGA on mezzanine and the main FPGA on AMANEQ. Thus, the heartbeat frame is defined in the mezzanine card.

The transfer speed for TDC data from the mezzanine is 8 Gbps, which is equal to the internal data bandwidth in FPGA. The back-merger unit implemented in the main FPGA collects data from both mezzanine cards. Finally, data are sent to a PC by SiTCP-XG via 10-gigabit Ethernet.

#### V. RESULTS AND DISCUSSION

# *A. Synchronization Test*

One clock-root and three clock-hub modules were connected with multi-mode optical fibers in series to measure

▼◯◯◚◙≪₽  $\sqrt{0\pi}$  $OFS_{osc}$ 2.00 ns/  $\boxed{\bigcirc}$  3.9960 ns  $\boxed{\checkmark}$  0  $\boxed{\checkmark}$   $\boxed{\mathfrak{g}}$   $\mathfrak{g}$ 

Fig. 6. Heartbeat signals from four modules. The logic is the NIM standard. Oscilloscope channels 1, 2, 3, and 4 correspond to M1, M2, M3, and M4, respectively.  $OFS_{\text{osc}}$  is the phase difference respect to one upstream module.



Fig. 7. Distribution of local fine offset between M1 and M2.

the synchronization accuracy as seen from the last clock-hub module. These four modules are labeled as M1, M2, M3, and M4, respectively. The fiber length between M1-M2, M2-M3, and M3-M4 are 100, 10, and 3 m, respectively. Fig. 6 shows the heartbeat signal from each module. Note that the logic is the nuclear instrument modules (NIM) standard. The falling edge is the leading edge of the logic. Since the heartbeat signal is driven by the system clock signal, leading edge position deference comes from the clock signal phase difference. The phase differences measured by the oscilloscope ( $OFS<sub>osc</sub>$ ) are summarized in Table I. At this time, the fine offset value measured by LACCP between M1 and M4 was 5311 ps. Although this is close to the value of 5280 ps measured by the oscilloscope, it is a coincidence. Since the precision of the  $dt$ measurement is 78 ps, the synchronization accuracy for one time is 100-200 ps. A different fine offset will be obtained in

TABLE I SUMMARY OF OFFSET MEASUREMENTS.

| Path    | $OFSosc$ (ps) | $OFS_{\text{average}}$ (ps) | $OFS_{\rm r.m.s.}$ (ps) |
|---------|---------------|-----------------------------|-------------------------|
| $M1-M2$ | $-601$        | $-594$                      | 24.72                   |
| $M2-M3$ | 2827          | 2737                        | 10.6                    |
| M3-M4   | 3054          | 3058                        | 19.0                    |

್ಷನ್



Fig. 8. Block diagram of the test bench. Pulses from the same pulse generator are input to the HR-TDC mezzanine cards at the same timing.



Fig. 9. Typical TDC distribution measured by two Str-HTDC.

#### each link-up.

By repeating the link-up process of the MIKUMARI link in M2 1000 times, the distribution of the local fine offset respect to M1 is obtained as shown in Fig. 7. Offset values are distributed in several bins. It indicates that clock synchronization is not deterministic for the power cycle. The same thing is performed to M3 and M4, and obtained average  $(OFS<sub>average</sub>)$  and root-mean-square  $(OFS<sub>r.m.s</sub>)$  are summarized in Table I. Local fine offset measurements are identical among these three, however, only the M2-M3 case differs by 90 ps from the value measured by the oscilloscope and is slightly larger than those of M1-M2 and M3-M4. We interpret that this comes from the part-to-part skew of buffer ICs. In addition, there is another buffer IC, Texas Instruments SN65CML100D, to output the NIM level logic signal between FPGA and the oscilloscope. We selected AMANEQ as the first target of our implementation, but a module dedicated without buffer ICs for clock distribution is needed for further study of synchronization accuracy. Nevertheless, obtained results suggest that LACCP has the potential to achieve synchronization accuracy better than 100 ps. If there is a function to repeat the link-up process at module startup to obtain an average offset value in LACCP, it allows for more accurate offset estimation. As the obtained standard deviations are sufficiently small, clock synchronization will be deterministic if the average offset values are used in LACCP.



Fig. 10. Mean position (a) and timing resolution (b) of TDC distribution as a function of cable length.

# *B. Str-HRTDC evaluation*

To evaluate the cable length dependence of the synchronization accuracy and the timing resolution, the test bench is configured as shown in Fig. 8. Four modules are also labeled as M1, M2, M3, and M4, respectively. As the HR-TDC mezzanine cards are mounted on AMANEQ, there are six FPGAs in total in this system. The pulse from the pulse generator is divided and measured by two Str-HRTDCs with changing the fiber length between M1 and M4. To eliminate differences between channels, pulses are input to the same channel on the two cards. The timing offset coming from the input cable length difference is measured and corrected in the analysis. When changing the cable of M4, the condition of M2 and M3 is not changed. Fig. 9 shows the typical timing distribution measured by Str-HRTDCs. The mean positions the distribution as a function of cable length are plotted in Fig. 10 (a). As described in Sec. V-A, synchronization accuracy is around 100-200 ps and is not deterministic. The obtained mean values are actually distributed in the range of around 300 ps. In addition, data points are distributed not around 0 but around 130 ps in average. This tendency does not change even if we redo the measurement several times, we interpret this as the part-to-part skew systematic error. Although there is variation, no cable length dependence indicates that the  $T_{\text{rt}}$  measurement and the fine offset estimation are working correctly. Thus, the sub-nanosecond clock synchronization is achieved.

The timing resolutions as a function of cable length are plotted in Fig. 10 (b). Timing resolution also does not have a cable length dependence. It can be interpreted that the jitter deterioration as a function of cable length is sufficiently small. The average timing resolution is 23.1 ps, and it satisfies our requirement. We verify this result. By measuring two pulses by the same TDC, the TDC intrinsic resolution is extracted since the clock signal jitter is canceled. The obtained intrinsic resolution is 19.5 ps in  $\sigma$ . In the previous work [3], the clock jitter of MMCM was found to be 7.7 ps for 125 MHz by the time interval error measurement. In addition, the additional random jitter of 3.7 ps is added per clock recovery by CDCE62002. In the test bench setup, the clock recovery is performed three times on AMANEQs, and the clock signals are generated by MMCM on two mezzanine cards. Thus, the expected timing resolution in this test is roughly represented as

$$
\sigma_{res} = \sqrt{19.5^2 + 3 \times 3.7^2 + 2 \times 7.7^2}
$$
  
~23.6. (3)

Here, we assumed that the jitter of the 500 MHz clock signal is the same as that of 125 MHz generated by MMCM. It is consistent with the measured one. The timing resolution is mainly determined by the performance of tapped-delayline. The clock recovery by CDCE62002 is the smallest contribution.

## VI. SUMMARY

We aim to develop the general-purpose trigger-less DAQ system for particle and nuclear physics experiments in Japan under the SPADI alliance. The clock synchronization protocol called LACCP is developed as the upper layer protocol of the MIKUMARI-link technology. LACCP clock synchronization is based on the measurements of the round-trip and the clock signal phase difference using IDELAYE2 and ISERDESE2 primitives. The obtained synchronization accuracy is around 300 ps, however, there is room to improve it to better than 100 ps. By using LACCP, we developed data-streaming type highresolution and low-resolution TDCs. Str-HRTDC consists of the AMANEQ module and its mezzanine card. The TDL-based TDC using CARRY4 primitives is implemented in the AMD Xilinx Kintex-7 FPGA on the mezzanine card. The obtained timing resolution is 23.1 ps in  $\sigma$ , and it does not have the cable length dependence. The obtained clock synchronization accuracy and the timing resolution satisfy our requirements.

We plan to improve the MIKUMARI link and LACCP functionalities as prospects. As mentioned in Sec.V-A, there is room for improvement in clock synchronization accuracy. A function to repeat the MIKUMARI link-up process after module start-up to obtain the average value of the fine offset should be added. In this work, the static phase compensation method is developed. It is necessary to add a function to compensate for long-term variation caused by temperature changes and other factors. Since the fine offset estimation is based on IDELAY adjustment in the link-up process, variation after link-up cannot be measured by the same method. For dynamical phase compensation, the leading edge timing of the incoming modulated clock signal must be continuously measured by another method. Relative phase compensation methods using digital ducal mixer time difference (DDMTD) and TDL-based TDC have been reported so for [14] [15], and these will be tested.

We are considering implementing MIKUMARI and LACCP to AMD Xilinx UltraScale+ FPGAs, where IOSERDES and IDELAY have been upgraded to E3 [16] and have significantly different functions. We plan to modify the MIKUMARI link functionality to accommodate these changes.

#### **REFERENCES**

- [1] *SPADI alliance* [Online]. Available: https://www.rcnp.osakau.ac.jp/ spadi/ Accessed on: May 19, 2024.
- [2] K. H. Tanaka, Hadron Beam Channel GroupHadron Facility Construction Team, "Construction and Status of the Hadron Experimental Hall' '*Nucl. Phys. A*, vol. 835, pp. 81–87, Apr. (2010), 10.1016/j.nuclphysa.2010.01.178.
- [3] R. Honda, "New Clock Distribution System Based On Clock-Duty-Cycle-Modulation For Distributed Data-Aquisition System" *IEEE Trans. Nucl. Sci.*, vol. 70, no. 6, pp. 1102–1109, Apr. 2023, 10.1109/TNS.2023.3265698
- [4] H. Noumi, Y. Morino, T. Nakano, K. Shirotori, Y. Sugaya, T. Yamaga *et al.*, (2006). "Charmed Baryon Spectroscopy via the (π, D∗−) reaction", *J-PARC E50 Proposal.* [Online]. Available: http://jparc.jp/researcher/Hadron/en/pac\_1301/pdf/P50\_2012-19.pdf. Accessed on: May 19, 2024.
- [5] D. Calvet, "Clock-Centric Serial Links for the Synchronization of Distributed Readout Systems" *IEEE Trans. Nucl. Sci.*, vol. 67, no. 8, pp. 1912–1919, Aug. 2020, 10.1109/TNS.2020.3006698.
- [6] "7 Series FPGAs SelectIO Resources", *AMD Xilinx User* Guile. [Online]. Available: https://docs.amd.com/v/u/en-US/ug471 7Series SelectIO. Accessed on: May 19, 2024.
- [7] R. Honda, T. Aramaki, H. Asano, T. Akaishi, W.C. Chang, Y. Igarashi *et al.*, "Continuous timing measurement using a data-streaming DAQ system " *PTEP*, vol. 2021, issue. 12, pp. 123H01, Oct. 2021, 10.1093/ptep/ptab128
- [8] T. Uchida, "Hardware-based TCP processor for gigabit Ethernet", *IEEE Trans. Nucl. Sci.*, vol. 55, no. 3, pp. 1631-1637, Jun. 2008. 10.1109/TNS.2008.920264.
- [9] *Bee Beans Technologies SiTCP-XG* [Online]. Available: https://github.com/BeeBeansTechnologies/SiTCPXG\_Netlist\_for\_Kintex7 Accessed on: May 19, 2024.<br>
"7 Series FPGAs Clocl
- [10] "7 Series FPGAs Clocking Resources", *AMD Xilinx User* https://docs.amd.com/v/u/en-US/ug472 7Series Clocking. Accessed on: May 20, 2024.
- [11] "7 Series FPGAs Configurable Logic Block", *AMD Xilinx User Guile.* [Online]. Available: https://docs.amd.com/v/u/en-US/ug474 7Series CLB. Accessed on: May 20, 2024.
- [12] J.Y. Won and J.S. Lee, "Time-to-Digital Converter Using a Tuned-Delay Line Evaluated in 28-, 40-, and 45-nm FPGAs" *IEEE Trans. Nucl. Sci.*, vol. 65, no. 7, pp. 1678–1689, Jul. 2016, 10.1109/TIM.2016.2534670
- [13] C. Liu and Y. Wang, "A 128-Channel, 710 M Samples/Second, and Less Than 10 ps RMS Resolution Time-to-Digital Converter Implemented in a Kintex-7 FPGA" *IEEE Trans. Nucl. Sci.*, vol. 62, no. 3, pp. 773–783, Jun. 2015, 10.1109/TNS.2015.2421319.
- [14] E. Mendes, S. Baron, J. Hegeman, J. Troska, N. Loukas"TCLink: A Fully Integrated Open Core for Timing Compensation in FPGA-Based High-Speed Links" *IEEE Trans. Nucl. Sci.*, vol. 70, no. 2, pp. 156–163, Feb. 2023, 10.1109/TNS.2023.3240539.
- [15] H.B. Xie, Y. Li, Q. Shen, S.K. Liao, C.Z. Peng "A High-Precision 2.5ps RMS Time Synchronization for Multiple High-Speed Transceivers in FPGA" *IEEE Trans. Nucl. Sci.*, vol. 66, no. 7, pp. 1070–1075, Jul. 2019, 10.1109/TNS.2019.2904703
- [16] "UltraScale Architecture SelectIO Resources", *AMD Xilinx User Guile.* [Online]. Available: https://docs.amd.com/r/en-US/ug571-ultrascaleselectio. Accessed on: May 23, 2024.