US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
-
-
13:00
→
13:10
Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
The USATLAS technical meeting and pre-scrubbing in Bloomington is fast approaching: https://indico.cern.ch/event/1273590/
- First draft of L3 presentations due by COB this Friday
- Final versions of L3 presentations due by Friday, June 2
There will be a day devoted to LHC computing - a morning for ATLAS, and the afternoon a joint session with CMS - at the upcoming Throughput Computing 2023 (HTCondor Week + OSG All-Hands Meeting) on Wednesday July 12th. Info on the event at https://path-cc.io/htc23
-
13:10
→
13:20
Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
- Remember to register for HTC '23 (https://path-cc.io/htc23). Schedule here https://agenda.hep.wisc.edu/event/2014/abstracts/
- New OSG Software Team member Matt Westphall started this week
- Expect a new HTCondor 9 release containing the tools to help detect potential issues in a 9 -> 10 upgrade. Then an HTCondor 10 upgrade ~1 week later.
- InCommon IGTF CA v2 is steadily making its way through the IGTF process and you can expect a release in the coming months. For the time being, please avoid issuing v2 certs
- Working on the workaround for the SHA1-signed certs issue with tight OS crypto defaults (https://opensciencegrid.atlassian.net/browse/SOFTWARE-5365). Anyone willing to give it a spin when we have a package ready?
-
13:20
→
13:40
Convener: Ofer Rind (Brookhaven National Laboratory)
- Some job drainage overnight Monday (cause?)
- Issues with replication of CVMFS nightlies at BNL affecting ART jobs, under investigation (JIRA, GGUS)
- Intervention for ATLAS IAM db migration planned Thursday overnight, estimated 7 minute downtime, should be fine according to P. Vokac (OTG)
- Squid and XRootd issues at OU, plus network problems over the weekend.
- Squid still does not appear on monitoring page
- Deploying ATLAS token support for CERN EOS, planned for 5/31 (restart of Management and Metadata Server)
- Preparation for DC24
- 13:20
-
13:25
Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
-
13:30
Speaker: Armen Vartapetian (University of Texas at Arlington (US))
- Both clusters (SWT2_CPB_K8S & SWT2_CPB_K8S_TEST) have been running fine.
- Have been working on Prometheus monitoring. Installed on SWT2_CPB_K8S_TEST using Helm charts. In addition had to setup Persistent Volumes, and after some adjustment in configuration, Prometheus was working fine.
- On SWT2_CPB_K8S we were starting to hit Disk Pressure on the master node, when trying to setup new stuff. So going forward we needed to increase the size of the main partition for the new cluster, and the only way was to reinstall the node.
- So the SWT2_CPB_K8S_TEST cluster was rebuilt and K8S installed again. All went smoothly, and the cluster is running fine.
- Started draining the SWT2_CPB_K8S cluster in preparation for switching to the new cluster, and scaling it up (to about 1000 cores). Patrick is preparing a diverse set of machines for that. Most probably all that will be completed this week.
- 13:40 → 13:45
-
13:45
→
14:05
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))- Reasonable running over the past 30 days...
- ADC has had trouble at times supplying sufficient work.
- AGLT2 and MWT2 have received much or all of their FY23 orders.
- NET2 is close to being in operation but I will let them explain that.
- I will be preparing the Tier 2 pre-scrubbing slides over the next couple of days.
- NET2 will present their status at the pre-scrubbing.
- All other sites should be let me know if they have any input for the pre-scrubbing.
-
13:45
Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Dronen
2023 equipment purchase:
Added 11x R6525 with AMD 7443 (96 HT/node)
measured HEPscore= 17.48 HS06= 17.75
total added: 1056 cores, 18.5k HEPscore / 18.7k HS06
note: new PERC H355 not supported by OMSA on CentOS7Retired 32x 24HT R410s (768 cores / 6.6k HS06) 9.84
Retired 24x 32HT R620s (736 cores / 8.1k HS06) 11.03
Retired 5x 40HT R620s (200 cores / 2.1k HS06) 10.68
(retirements above may not be exact, will correct after meeting)Total Change expected about neutral
Also added 7x of 12x R740xd2 at UM
(each with 24x 20TB disks= 370 TB usable in dcache)
4x R740xd2 at MSU waiting on network config (promised for tonight)
The other 5x R740xd2 at UM to compare RAID6 to JBOD/raidz3Retired 4x MD3xxx shelves of 8T disks (~1.4 PB)
Waiting on MSU to be deployed before sitewide space re-balance
Did NOT YET increase dcache advertised space
Ultimately the grand total change will be about +4.5 PBEvents:
10-May: ZFS problem on NVMe mirror holding dcache database (Ticket 161890).
After file system recovery one postgres file remained flagged as possibly corrupted.
Recovered from backup/mirroring node.18-May: MSU Data Center heats up during regular/yearly Fire Alarm testing.
All newer/hotter Worker Nodes (C6420s and R6525s) shut themselves down.
Unexpected. Could be operator error but no official report yet.For fun/curiosity: coarse comparison HS06 vs HEPscore (as of May 2023)
| | 6132 | 6240R | 7302 | 7413 | 7443 |
| | 2.60GHz | 2.40GHz | 3.00GHz | 2.65GHz | 2.85GHz |
| Tot HT | 56 | 96 | 64 | 96 | 96 |
|----------+---------+---------+---------+---------+---------|
| HS06/HT | 13.64 | 10.94 | 16.42 | 17.28 | 17.75 |
| HEPS/HT | 13.16 | 12.22 | 16.17 | 16.97 | 17.48 |
|----------+---------+---------+---------+---------+---------|
note: HEPscore measured as average of 2 runs on only 1 node each
(except 7443 with 2 runs on 5 nodes)
HS06 taken from US facility spreadsheet for AGLT2.
-
13:50
Speakers: David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
-
13:55
Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
Configuration of IPv6 on NESE network is fixed. Doing final checks before configuring storage element.
NESE team is automating DNS for storage servers.
The installation of OpenShift is ongoing. Many layers of many technologies make the debugging processes slow.
The perfSonar server is in place in the MGHPCC.
Final fiber distribution being installed.
Finalizing purchase of racks with RDHX for system.
-
14:00
Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
UTA
- A pair of incidents with the campus chilled water supply caused disruptions. We were able to maintain storage access, but in the first incident we had to drop all of the computational load. In the second incident we lost about 1/4 of the computational load.
- Scaling up internal K8 cluster. Previous K8 cluster will be merged into SWT2_CPB
- Power balancing operations have started.
OU
- Last Wednesday was OSCER maintenance, they upgraded network switches. That apparently didn't go too well, since on Friday afternoon the core network collapsed; core switches had high CPU usage and broadcast storms or something like that. Was fixed Saturday morning.
- Also saw some fraction of stage-in transfer failures with strange IPv4 network error. Not clear if that started around that time as well, or if it was there at a low level before. A restart of xrootd (both proxy on se1 and backend storage) seems to have fixed that.
- Old OCHEP squid server stopped reporting to CERN monitoring. Not sure yet what's going on there, investigating.
- Reasonable running over the past 30 days...
-
14:05
→
14:10
Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))
Perlmutter
- Fairly good running over the last week or so. Lustre issues killed a lot of jobs in the last 24h. Stable configuration at ~50K cores.
- Following up with Michal Svatos regarding a Squid / Frontier access issue he flagged
- The queuing time of the GPU nodes appears to be shorten now, would consider switching back to full node configuration
Frontera
- Difficult to get jobs through the queue. PanDA is cancelling jobs that have waited for days.
- Following up with Asoka on CVMFSExec Squid configuration issue.
-
14:10
→
14:25
Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
- 14:10
- 14:15
-
14:20
Speakers: Fengping Hu (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
Images deployable on AF have been upgraded:
- base is now nvidia image with cuda 11.8 and new cuDNN.
- both ml-platform and conda image packages updated.
FAB ServiceX deployment
- FAB and LHCONE peering is up
- Sorting through xroot authentication from ipv6 only network
HTCondor issue - scheduler down due to stuck IO. Restart ceph daemons as a temperary fix. Updating kernels as a potential fix.
- 14:25 → 14:35
-
13:00
→
13:10