US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
-
-
13:00
→
13:05
WBS 2.3 Facility Management News 5mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
Very good meeting last week at BNL, discussing new Tier-1 organization and storage items with the dCache team https://indico.bnl.gov/event/22078/
- We will plan to start a new site networking focused meeting to bring in site/campus network people. There is an existing weekly meeting on Thursday's at 10 AM and we can move its focus to campus/site network information exchange once per month.
Reminder that CHEP abstracts are due May 10 https://indico.cern.ch/event/1338689/page/31560-call-for-abstracts
HEPiX is in a few weeks, consider attending and submit an abstract: https://indico.cern.ch/event/1377701/
Quarterly reports are due Friday, April 19, 2024: https://atlasreporting.bnl.gov/
- We need to review and update milestones as well.
- Please suggest any new milestones or let Rob and I know if there are milestones to retire/remove
There are needed updates for upcoming IAM changes. Tickets were issued to non-US site presumably assuming OSG would coordinate this for them. We should discuss with OSG about their plans.
- VOMS configuration changes for LHC Experiments https://ggus.eu/index.php?mode=ticket_info&ticket_id=165668
- Token configuration ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=165816
- Timeline document at https://docs.google.com/document/d/1onp_qMOvE5s9byaDF9L2Fx1LIVd2smUtNHwKa7ejnJA/edit#heading=h.7vqi4tau13n6
We need to continue to look at the results and data from DC24, trying to identify issues that can be resolved by configuration, architecture, software and/or hardware changes.
-
13:05
→
13:10
OSG-LHC 5mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
-
13:10
→
13:25
WBS 2.3.1: Tier1 CenterConvener: Alexei Klimentov (Brookhaven National Laboratory (US))
-
13:10
Infrastructure and Compute Farm 5mSpeaker: Thomas Smith
-
13:15
Storage 5mSpeaker: Jason Smith
-
13:20
Tier1 Services 5mSpeaker: Ivan Glushkov (University of Texas at Arlington (US))
- Farm
- Alma9 + Condor23 transition: testing the full job submission chain
- IPv6 transition: Testing a script for automatic node conversion ti IPv6
- CVMFS: No more errors at BNL. Waiting for the new pilot release to get better monitoring
- Lower job efficiency on BNL due to more than half of the cluster being filled with user analysis jobs.
- HammerCloud blacklisting event due to switch problem at CERN (OTG0149318) did not affect BNL
- Storage
- Filled tape pools detected today. Solved.
- Misc
- Confirmed pledged resource delivery for 2024
- GGUS:
- GGUS:165929: Transfer failures. Solved.
- GGUS:165532: Post-DC24 test ticket
- Saw-pattern observed in the throughput which is yet to be understood. The reason is not with BNL.
- GGUS:164216: The CMS request for running test jobs on BNL T1 slots
- Farm
-
13:10
-
13:25
→
13:35
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))- Reasonable running
- MWT2 IU site had a painful network refresh that caused some loss of production
- NET2 has been struggling with network errors.
- CPB is being asked to retire LSM by ADC but it's non-trivual. The issue affecting production at the CPB Kubernetes site.
- End of quarter reporting: please update the following (if needed):
- Site Capcity sheet: https://docs.google.com/spreadsheets/d/1nZnL1kE_XCzQ2-PFpVk_8DheUqX2ZjETaUD9ynqlKs4
- Site Evolution sheet: https://docs.google.com/spreadsheets/d/1YjDe4YdApHoB5_HbDnNwrG-ceJP3amNWMb_VzQEaxGI
- Site Services sheet: https://docs.google.com/spreadsheets/d/1_fKB6GckfODTzEvOgRJu9sazxICM_RN95y039DZHF7U
- At Rob's request, I have created a sheet to track progress on the open issues (most but not all are milestones):
- https://docs.google.com/spreadsheets/d/1CHpVHqnLJz0dNfXh-v4GYSOq0ez9n6SPmF3hJ9iMflY
- Please check your section. I will tend to the tier 2 items. Ofer will track the items for the tier 1.
- If there are items that are delayed we need to know. In particular high level milestones visible to the funding agencies need to be handled carefully. If you are delayed by something out of your control (e.g. you can't order equipment before the funding agency delivers the funding) those delays will not could against your site.
- Reasonable running
-
13:35
→
13:40
WBS 2.3.3 HPC Operations 5mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Rui Wang (Argonne National Laboratory (US))
TACC
- Havaster is up and running, queue is set for testing
- Jobs failed due to input file validation
- Checksum matches between the local on and in Rucio. No issue was seen when reading the linked file locally or via the debug queue
- Added the binding area of the local datadisk in CRIC
- Requesting testing task
- Updated the pilot version to 3.7.2.4
- Very long queuing time ~4 days before set for testing
- Need a TACC specific test Request.
- Jobs failed due to input file validation
NERSC
- Running Harvester with older pilot (3.7.2.4) - above the uniform usage line
- Testing latest pilot (version - 3.7.3.9) - currently all failed on the production. Using a Test queue.
- Need to decide if we want to make the GPU's available to ATLAS
- Havaster is up and running, queue is set for testing
-
13:40
→
13:55
WBS 2.3.4 Analysis FacilitiesConveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
- 13:40
-
13:45
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
13:50
Analysis Facilities - Chicago 5mSpeakers: Fengping Hu (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
AB-stable and AB-dev images have been updated. AB-dev has all the latest versins of uproot, awkward, dask-awkward. Image building has been updated so to always get correct dask workers.
htcondor worker auto scaling has been configured for both the long and short queue workers. either queue can now scale to more workers when needed. This should help with one workers nodes being left idling while on the other hand jobs are pending on resources. This does increase scaling activities so we are hoping to make that as lightweight as possible, currently the workers has a user provision step that takes a few minutes, we are hoping that in the future user account will be backed by ldap to avoid the cost of user proovision step.
-
13:55
→
14:10
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
- Fred and I are beginning to track OS and other updates as in previous years (spreadsheet)
- CVMFS mount errors now identifiable using new wrapper message on Harvester dashboard; worker node information to be added in next pilot update
- Hiro and Mark have updated and deployed sitewide networking script (corrects direction which was previously flipped in/out)
- XRootd 5.6.9 deployment for ATLAS production - held up by SWT2_CPB_K8S
- SWT2_CPB, OU site network monitoring? (GGUS,GGUS)
- ATLAS considering site exclusions based on unavailability of a certain fraction of data
-
13:55
ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5mSpeaker: Ivan Glushkov (University of Texas at Arlington (US))
-
14:00
Service Development & Deployment 5mSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
- XCache
- still issues with several nodes.
- restarts help temporarily
- will be testing the new version at MWT2 and AGLT2
- VP - all working fine
- Varnish caches - all working fine
- analytics and monitoring
- working on getting back FTS stream
- some improvements to the Alarm And Alert Service
- ServiceX
- improvements in reliability, performance, logging, user interface
- testing new client
- ServiceXLite
- now full time running at FAB, River, NRP.
- XCache
-
14:05
Facility R&D 5mSpeakers: Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
Kubernetes Tutorial/Hackathon- Please sign up by Friday, April 5 especially if you plan to attend in person. Will send an email out to this effect.
Multi-site stretched cluster assembled with Kubespray, and using Wireguard as the fundamental network layer.
Wireguard is a VPN technology. We can assemble a VPN mesh that encrypts all interal cluster traffic and requires a site to only expose 1 UDP port to the public internet for the most essential connectivity. Wireguard is built into the Kernel (above v5.6?) creates a private interface on each node. To Kubernetes it appears to be all on 1 private network. However we need to understand what it looks like to expose public services. Public-facing services where we can, and tunneled private traffic where we have to?
Wireguard config example:
[Peer]
PublicKey = xxjmp6WyT7IU/9hffUjyV0uj8sfYzR6G3C/I3yt+Qxk= # Elliptic curve public key
AllowedIPs = 192.168.0.6/32 # INTERNAL IP assigned to the 'wg0' interface
Endpoint = 192.41.231.216:51820 # EXTERNAL IP and UDP port assigned for negotiating Wireguard traffic
PersistentKeepalive = 30 # Periodic ping between nodes to keep the conenction alive
[Peer]
PublicKey = oVVQuMR2hHCW+a5y0w4BS9ySOQK2pp8Tkba4RP5TByM=
AllowedIPs = 192.168.0.7/32
Endpoint = 192.41.237.213:51820
PersistentKeepalive = 30
[Peer]
PublicKey = BFh6AaxOf8rmDE68BtRcdcEIrQRrx6TklfZozLm3d28=
AllowedIPs = 192.168.0.8/32
Endpoint = 206.12.98.227:51820Kubespray config sample - each site has a label corresponding to its site in CRIC as well as the institution where it sits:
# ...
uchicago005.hl-lhc.io:
ansible_host: 192.168.0.5
ip: 192.168.0.5
access_ip: 192.168.0.5
node_labels:
site: mwt2
institution: uchicago
umich001.hl-lhc.io:
ansible_host: 192.168.0.6
ip: 192.168.0.6
access_ip: 192.168.0.6
node_labels:
site: aglt2
institution: umich
msu001.hl-lhc.io:
ansible_host: 192.168.0.7
ip: 192.168.0.7
access_ip: 192.168.0.7
node_labels:
site: aglt2
institution: msu
uvic001.hl-lhc.io:
ansible_host: 192.168.0.8
ip: 192.168.0.8
access_ip: 192.168.0.8
node_labels:
site: uvic
institution: uvic
# ...Kubectl:
[root@uchicago002 ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
msu001.hl-lhc.io Ready <none> 6d21h v1.28.6
uchicago002.hl-lhc.io Ready control-plane 6d21h v1.28.6
uchicago003.hl-lhc.io Ready control-plane 6d21h v1.28.6
uchicago004.hl-lhc.io Ready control-plane 6d21h v1.28.6
uchicago005.hl-lhc.io Ready <none> 6d20h v1.28.6
umich001.hl-lhc.io Ready <none> 6d21h v1.28.6
uvic001.hl-lhc.io Ready <none> 6d21h v1.28.6
-
14:10
→
14:20
AOB 10m
-
13:00
→
13:05