US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 993 2967 7148
Meeting password: 452400
Invite link: https://umich.zoom.us/j/99329677148?pwd=c29ObEdCak9wbFBWY2F2Rlo4cFJ6UT09
-
-
1:00 PM
→
1:05 PM
WBS 2.3 Facility Management News 5mSpeakers: Alexei Klimentov (Brookhaven National Laboratory (US)), Dr Shawn Mc Kee (University of Michigan (US))
-
1:05 PM
→
1:10 PM
OSG-LHC 5mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
-
1:10 PM
→
1:30 PM
WBS 2.3.1: Tier1 CenterConvener: Alexei Klimentov (Brookhaven National Laboratory (US))
-
1:10 PM
Tier-1 Infrastructure 5mSpeaker: Jason Smith
-
1:15 PM
Compute Farm 5mSpeaker: Thomas Smith
-
1:20 PM
Storage 5mSpeakers: Carlos Fernando Gamboa (Department of Physics-Brookhaven National Laboratory (BNL)-Unkno), Carlos Fernando Gamboa (Brookhaven National Laboratory (US))
-
1:25 PM
Tier1 Operations and Monitoring 5mSpeaker: Ivan Glushkov (Brookhaven National Laboratory (US))
WBS 2.3.1.2 Tier-1 Infrastructure - Jason
-
Unplanned power interruption (morning of 3/7)
-
Most things recovered in a few hours, a few VMs took several hours to recover
-
Casualties: NIC on one OpenShift worker (few days to replace), a few VM disk images were corrupted (one needs to be rebuilt, another copied from RHEV again since it was recently migrated the old image was still present)
-
OpenShift: more than half migrations from RHEV complete, close to supporting containers (ready for testing very soon)
WBS 2.3.1.3 Tier-1 Compute - Tom
-
BNL_ARM - it was not getting new jobs due to missing SW tags in CRIC. Solved.
-
Unplanned power outage on friday 7 Mar.
-
This lead to a large number of job failures as workers lost power
-
HTCondor recovery by ~15:45 (eastern time)
-
Job ramp up was gradual, but successful
-
Some worker nodes came up in a bad state and were rebuilt. Full capacity restored
There was an effort to recover additional previously downed worker nodes, capacity is slightly higher post- power outage as a result of This effort (34.2k core -> 35.4k core)
WBS 2.3.1.4 Tier-1 Storage - Carlos
-
Power glitch outage on 03/07/25.
-
The ATLAS production storage service was degraded
-
The Chimera server was down for 7 minutes but restarted without any issues or corruption.
-
Other dCache core services failed over to redundant components.
-
A mix of pool hosts restarted automatically, while a few others required manual hardware intervention. No data loss was observed.
-
A subset of doors were also affected and recovered without issue
-
The impact was limited to some READ operations and READ/WRITE transfers that were in progress during the power glitch.
-
The system was fully functional by 11 AM (EST).
-
Test/Integration instance affected due to the OpenShift issue
-
Work on DMZ Pools: The underlying filesystem block size of DMZ pools has been aligned with the NVMe-based block size, resulting in an improvement in READ IOPS.
WBS 2.3.1.4 Tier-1 Operations & Monitoring - Ivan
-
All operations’ related news were already reported above.
-
-
1:10 PM
-
1:30 PM
→
1:40 PM
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Conveners: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))- Reasonable running over the last couple of weeks.
- OU had a scheduled a downtime.
- Found that there was a problem with the reliabiility reporting not playing well with sites putting only some services offline.
- I need to reply to an email from Borja from the CERN monit team.
- I am (slowly) working on templates for the procument and operation plans.
- I have modified the v71 tab of the capacity sheet to calculate the meanRSS for each site.
- I will shortly add a power consuption calculation so that we can answer a question from the operations review.
- The BNL data on the capacity sheet seems out of date.
- Reasonable running over the last couple of weeks.
-
1:40 PM
→
1:50 PM
WBS 2.3.3 Heterogenous Integration and Operations
HIOPS
Convener: Rui Wang (Argonne National Laboratory (US))- 1:40 PM
-
1:45 PM
Integration of Complex Workflows on Heterogeneous Resources 5mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
1:50 PM
→
2:10 PM
WBS 2.3.4 Analysis FacilitiesConveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
-
1:50 PM
Analysis Facilities - BNL 5mSpeaker: Qiulan Huang (Brookhaven National Laboratory (US))
-
1:55 PM
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
2:00 PM
Analysis Facilities - Chicago 5mSpeaker: Fengping Hu (University of Chicago (US))
-
1:50 PM
-
2:10 PM
→
2:25 PM
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
-
2:10 PM
ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5mSpeaker: Ivan Glushkov (Brookhaven National Laboratory (US))
-
2:15 PM
Services DevOps 5mSpeaker: Ilija Vukotic (University of Chicago (US))
- Analytics
- we used to have all the data visible to ATLAS_USER and anonymous user. Now that's not the case any more and we have to explicitly allow data into dashboards for these users. That "broke" a lot of dashboards and visualizations embeded or shared to a lot of places and people. I have been fixing them for the last few days. Please complain if you see a dashboard not showing correctly.
- XCaches
- required update to the x509 proxy renewal container
- updated all UC AF xcaches
- had to fix dashboards
- building new image for gStream monitoring fix
- Varnishes
- All working fine
- ATLAS made a decission to move to Varnish for conditions.
- Ilija and Nurcan preparing a grand plan document.
- Asked John to try installing one at BNL.
- VP
- working fine
- ServiceX and ServiceY
- x509 proxy renewal container update
- also for cms
- Analytics
-
2:20 PM
Facility R&D 5mSpeaker: Lincoln Bryant (University of Chicago (US))
- Have access to NET2 K8S, doing some tests at a small scale. Coordinating with Eduardo on figuring out minimal privileges for e.g. WireGuard in OpenShift
- Aidan will try Armada for Kubernetes-level federation against this cluster as well
- stretched k8s upgraded to Kubernetes 1.31
- having a working unprivileged wireguard container with manual configuration. capabilies added _in the namespace_ only
-
[12:03]:~/wg-test/config $ podman run --cap-add=NET_RAW --cap-add=NET_ADMIN --cap-add=SYS_MODULE --sysctl="net.ipv4.conf.all.src_valid_mark=1" -p 51820:51820/udp -v /lib/modules:/lib/modules -v /home/lincolnb/wg-test/config/:/etc/wiregua
rd wgtest3 /bin/bash -c "wg-quick up wg0; ping 10.20.10.1"
PING 10.20.10.1 (10.20.10.1) 56(84) bytes of data.
64 bytes from 10.20.10.1: icmp_seq=1 ttl=64 time=3.74 ms
64 bytes from 10.20.10.1: icmp_seq=2 ttl=64 time=1.85 ms
^C
--- 10.20.10.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 1.851/2.794/3.737/0.943 ms
-
- Have access to NET2 K8S, doing some tests at a small scale. Coordinating with Eduardo on figuring out minimal privileges for e.g. WireGuard in OpenShift
-
2:10 PM
-
2:25 PM
→
2:35 PM
AOB 10m
-
1:00 PM
→
1:05 PM