12/19/2023, we updated dCache from 9.2.4 to 9.2.7, and also took the opportunity to update the BIOS firmware and system kernel and rebooted all the storage nodes. The whole process went very smoothly and only caused 30 minutes of downtime.

12/20/2023, In the morning, we noticed the running job slots started to ramp down (from 17.5K to 10.5K, about 40% job slots were not shown from the ATLAS monitoring plot), but the HTCondor cluster was fully utilized (99% job slots were being claimed). We put together a script and run it as a cron job trying to find all the zombie jobs (Job status is failed but pilot is still running), and finally cleaned most of the zoombie jobs, we still see about 5% job slots discrepency. 

On 12/27/2023, we enabled and verified the storage token access based on the dCache system.

12/29/2023  we received a ggus ticket about transferring failure with AGLT2 as the source, and it turned out that One pool node (umfs24) had some filesystem errors and a full /var area. We fixed the issue in the morning, and the transfer efficiency went back to normal.