SWT2_CPB:
Actively communicating with OSG experts and Panda team concerning a bug in Condor. Jobs are getting stuck in the Condor queue after completion. There are recent discussions on this potentially being an issue with harvester repeatedly losing contact with certain jobs (testing different setting for condor-ce-routing on test cluster). Also, the primary issue with our site not filling up properly with jobs was due to a CRIC setting, which Fred is assisting us with.
Implemented second CE as backup and to allow for downtime maintenance of our main CE whenever needed. This second CE is operating at a 100 max job limit.
Performed tests on our EL9 storage using a similar environment as the production cluster to the test module.
Coordinating with DDM Ops to add second RSE and other components, so we can use our test cluster separately to simulate production when staging changes. They added the RSE Monday (5/26/2025), so we are close to having this completed.
We have our internal monitoring for our Slurm and both CE’s now in place (thanks Judith for links and help).
For EL9, developed other appliances, but waiting to test in the test cluster before implementing.
Concerning GGUS ticket to enable network monitoring, I sent a follow up message to multiple members of campus networking last week. They said they would discuss it last week and move forward on our request. Waiting for a follow up.
OU: