- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
Help us make Indico better by taking this survey! Aidez-nous à améliorer Indico en répondant à ce sondage !
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
Updates on US Tier-2 centers
06/24/2022
7 nodes became blackhole nodes because of cvmfs issue, this is later diagnosed with cause from the one of the squid servers.
06/29/2022
One of the slate squid servers sl-um-es5 stopped working because of both iptables issue and full var partition . It caused intermittent cvmfs issues. We got 2 ggus tickets for this.
06/30
From 06/28, the SAM test jobs stopped running. This started after the SAM test job team made some changes (change the leave_in_queue conditions on ETF). We could not find any obvious cause after a couple of days of debugging. Eventually we decided to restart the condor-ce services on both ATLAS gatekeepers, and that got the SAM test jobs to start to run, but it also caused all the running jobs on the gatekeepers to be removed, so about 4000 jobs got removed.
07/06
upgraded dCache 7.2.16 to 7.2.19 (with reboot to new kernel)
Got all WNs updated and ready for reboot to new kernel.
Starting rolling drain and reboot in batches
All January 2022 order R6525 AMD Milan 7413 are shipped.
A fraction already received.
Upgrading elasticsearch to 8.3. Cluster upgraded to 7.17 last week.
Still waiting on UChicago IT Services to configure our new Juniper networking gear from our most recent purchase.
Updating condor to 9.0.13-1.1.osg36.el7 on the workers. IU is done. UC is halfway done. UIUC still needs to be upgraded.
A switch and servers rebooted at IU last weekend. Back online by Monday.
Replacing the motherboard on the problematic dCache pool node appears to have fixed the lockup issues. Another dCache pool node had a bad NIC; this has also been replaced and the pool node is back online.
Removed ALRB testing variables from the workers and gatekeepers.
Applied user.max_net_namespaces=0 for kernel mitigation.
UTA:
We are creating two additional platforms.
One will serve educational purposes and not like Codas workshop, have tools that are usable to all HEP not only ATLAS (servicex for CERN open data, jupyter with Root kernel, etc.)
The other one will be dedicated to ATLAS Analytics with tools that support Analytics efforts.
Yesterday Patrick found that while he is using the same compute node setup from the tier2 cluster for K8s cluster, one of the parameters for nodes is causing the issue for K8s to run containers (jobs waiting at the ContainerCreating state). Once he rolled that setting back, jobs started to run.
Pinged Fernando today, waiting for ATLAS test jobs.