SWT2_CPB:
For EL9 and storage, fixed CRIC setting for test cluster, cleared old jobs on test CE, and now waiting for new jobs to complete on worker nodes to test new storage. We are testing EL9 storage module in test cluster before deploying new hardware, then will test other EL9 modules.
Finished testing partition layouts. We decided on two, one that overwrites all partitions (for empty storage systems only) and one that will preserve data partitions, allowing us to rebuild the OS of the storage system while preserving their data.
Changed plans from using Zabbix for both alerting and displaying some information for monitoring to being strictly for alerting.
GGUS tickets - Network Monitoring
Continued to follow up with campus networking. They have partially granted us access to a grafana plot concerning the throughput of CPB from their end. After months of communication and follow ups, campus networking decided to communicate with campus network security concerning snmp read-only access to the appropriate switch port. It has been approved. They need to inspect our web server to assess its security. I have requested they do this as soon as they are able. Follow up questions are being asked as necessary to ensure we are not infringing on their security policies.
Moving configuration away from EL7 monitoring server to a new EL9 server that will be used for this.
Waiting for campus networking to implement snmp change.
I am communicating with campus network security concerning other aspects of their security policy to implement this without infringing on it.
Waiting for this to be completed first before working toward BGP tagging.
Communicating with Dell sales representatives to extend warranty on thirteen of our storage systems. Eleven are R740s, two are Me4084s. Negotiations have finished and we are following through on their most recent quote.
Communicated with Dell sales representative concerning new hardware for head nodes. We are finalizing this to purchase R450s with a configuration suitable for replacing our XRootD proxies and master node. These will also have higher network capabilities for when we improve our network infrastructure.
Using a new server in the test cluster to test Varnish before implementing. Communicating with Ilija.
Continuing test with changing parameters of both of our CEs. We have been communicating with OSG experts and the harvester team for deeper understanding and when discovering new bugs. Examples are:
Condor-ce reconfig command not working as expected for changing max job limit.
Reducing the max job parameter to 0 on one CE seemed to cause the whole site to drain and stop receiving new jobs. We discussed better alternatives for draining the CE using CRIC instead.
We had a gradual increase in jobs cancelling from harvester's perspective. This seems to be caused by two issues: We tried to add IPv6 to our CE and we had ten problematic worker nodes that were in a strange state involving the puppet agent.
We reverted the changes of IPv6, restarted the condor-ce services for revert to take effect, then noticed the cancelling job rate reduced significantly. We will test this in the test cluster first.
We restarted the puppet agent on problematic worker nodes, which resolved the strange state they were in where the puppet agent timed out in running commands to get catalog from puppet master. We are investigating this.
OU: