Incidents:

We had two incidents with dCache. On July 8th, the postgresql partion of the head node was flooded by the billing database, and it took us over 24 hours on the weekend to recover it, we are planing to rebuild a R6525 work nodes with larger NVMe cards as the new head node to host a bigger postgresql partition (6TB vs 1TB)

The second incident is on July 19th, 2 dCache nodes had all the pools offline, and caused some transfer failure, restarting the pools fixed the issue.

System update:

We updated HTCondor from 9.0.17 to 10.0.5, and also took this chance to apply firmware and kernel updates with required system reboot. We ran into some token issue because in Condor 10, the TRUST_DOMAIN default value is changed to TRUST_UID, and the tokens used by daemon authentication need to be signed with the same TRUST_DOMAIN. Our fix is to set the TRUST_DOMAIN with the old value.