SWT2_CPB:
Operations
We experienced a significant drain on 4/3/2025. There are still investigations ongoing, but with investigating locally and the help of others we are gathering information and our site has filled back up. We noticed a significant drop in multicore jobs during this time.
We increased our Slurm max job limit to 12000 from 10000.
We created a second CE, but have not implemented it yet.
Timo and Rod have helped us in having both SWT2_CPB_TEST and SWT2_CPB use gk10 and enabling 16 core jobs to be submitted.
We have reached out to experts and shared requests logs from our CE for review.
Timo has helped investigate this and found errors in apfmon logs showing "The job's remote status is unknown… known again" It is still unclear if this is a central issue or a bug in the version of HTCondor-CE, but it appears to be some kind of handshake/status problem.
We have been focused on finding out why this issue occurred and how to prevent this from happening in the future.
We are now draining SWT2_CPB_TEST to revert back to only having running jobs on the SWT2_CPB queue.
We experienced a spike in errors recently due to jobs hitting the 2 day limit on our CE. We are discussing the idea of making changes to these limits.
Last update from ADC OPS meeting:
Request all sites move to at least 96h maxwalltime
ATLAS VO Card includes 5760 minute walltime limit = 96 hours
Monitoring
We are currently working on developing better monitoring of our site to include additional information from our Slurm and CE servers.
EL9 Migration Updates
Built test storage nodes in the test cluster. There are still more tests we want to perform.
Improving the storage module in Puppet/Foreman.
GGUS Ticket - Enable Network Monitoring
Followed up with campus networking. It appears there were internal changes that caused them to lose track of our request.
They added their manager of Operations Center and are discussing this.
OU: