US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2024-03-27T13:00:00-04:00
End: 2024-03-27T15:25:00-04:00
Location: No location set

Wednesday 27 Mar 2024, 13:00 → 15:25 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 13:00 → 13:05
  
  WBS 2.3 Facility Management News 5m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
  
  Very good meeting last week at BNL, discussing new Tier-1 organization and storage items with the dCache team https://indico.bnl.gov/event/22078/
  
  - We will plan to start a new site networking focused meeting to bring in site/campus network people. There is an existing weekly meeting on Thursday's at 10 AM and we can move its focus to campus/site network information exchange once per month.
  
  Reminder that CHEP abstracts are due May 10 https://indico.cern.ch/event/1338689/page/31560-call-for-abstracts
  
  HEPiX is in a few weeks, consider attending and submit an abstract: https://indico.cern.ch/event/1377701/
  
  Quarterly reports are due Friday, April 19, 2024: https://atlasreporting.bnl.gov/
  
  - We need to review and update milestones as well.
  
  - Please suggest any new milestones or let Rob and I know if there are milestones to retire/remove
  
  There are needed updates for upcoming IAM changes. Tickets were issued to non-US site presumably assuming OSG would coordinate this for them. We should discuss with OSG about their plans.
  
  - VOMS configuration changes for LHC Experiments https://ggus.eu/index.php?mode=ticket_info&ticket_id=165668
  
  - Token configuration ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=165816
  
  - Timeline document at https://docs.google.com/document/d/1onp_qMOvE5s9byaDF9L2Fx1LIVd2smUtNHwKa7ejnJA/edit#heading=h.7vqi4tau13n6
  
  We need to continue to look at the results and data from DC24, trying to identify issues that can be resolved by configuration, architecture, software and/or hardware changes.
- 13:05 → 13:10
  
  OSG-LHC 5m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
- 13:10 → 13:25
  WBS 2.3.1: Tier1 Center
  
  Convener: Alexei Klimentov (Brookhaven National Laboratory (US))
  - 13:10
    
    Infrastructure and Compute Farm 5m
    
    Speaker: Thomas Smith
  - 13:15
    
    Storage 5m
    
    Speaker: Jason Smith
  - 13:20
    Tier1 Services 5m
    
    Speaker: Ivan Glushkov (University of Texas at Arlington (US))
    
    Farm
    
    Alma9 + Condor23 transition: testing the full job submission chain
    
    IPv6 transition: Testing a script for automatic node conversion ti IPv6
    
    CVMFS: No more errors at BNL. Waiting for the new pilot release to get better monitoring
    
    Lower job efficiency on BNL due to more than half of the cluster being filled with user analysis jobs.
    
    HammerCloud blacklisting event due to switch problem at CERN (OTG0149318) did not affect BNL
    
    Storage
    
    Filled tape pools detected today. Solved.
    
    Misc
    
    Confirmed pledged resource delivery for 2024
    
    GGUS:
    
    GGUS:165929: Transfer failures. Solved.
    
    GGUS:165532: Post-DC24 test ticket
    
    Saw-pattern observed in the throughput which is yet to be understood. The reason is not with BNL.
    
    GGUS:164216: The CMS request for running test jobs on BNL T1 slots
- 13:25 → 13:35
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  Reasonable running
  
  MWT2 IU site had a painful network refresh that caused some loss of production
  
  NET2 has been struggling with network errors.
  
  CPB is being asked to retire LSM by ADC but it's non-trivual. The issue affecting production at the CPB Kubernetes site.
  
  End of quarter reporting: please update the following (if needed):
  
  Site Capcity sheet: https://docs.google.com/spreadsheets/d/1nZnL1kE_XCzQ2-PFpVk_8DheUqX2ZjETaUD9ynqlKs4
  
  Site Evolution sheet: https://docs.google.com/spreadsheets/d/1YjDe4YdApHoB5_HbDnNwrG-ceJP3amNWMb_VzQEaxGI
  
  Site Services sheet: https://docs.google.com/spreadsheets/d/1_fKB6GckfODTzEvOgRJu9sazxICM_RN95y039DZHF7U
  
  At Rob's request, I have created a sheet to track progress on the open issues (most but not all are milestones):
  
  https://docs.google.com/spreadsheets/d/1CHpVHqnLJz0dNfXh-v4GYSOq0ez9n6SPmF3hJ9iMflY
  
  Please check your section. I will tend to the tier 2 items. Ofer will track the items for the tier 1.
  
  If there are items that are delayed we need to know. In particular high level milestones visible to the funding agencies need to be handled carefully. If you are delayed by something out of your control (e.g. you can't order equipment before the funding agency delivers the funding) those delays will not could against your site.
- 13:35 → 13:40
  WBS 2.3.3 HPC Operations 5m
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Rui Wang (Argonne National Laboratory (US))
  TACC
  
  Havaster is up and running, queue is set for testing
  
  Jobs failed due to input file validation
  
  Checksum matches between the local on and in Rucio. No issue was seen when reading the linked file locally or via the debug queue
  
  Added the binding area of the local datadisk in CRIC
  
  Requesting testing task
  
  Updated the pilot version to 3.7.2.4
  
  Very long queuing time ~4 days before set for testing
  
  Need a TACC specific test Request.
  
  NERSC
  
  Running Harvester with older pilot (3.7.2.4) - above the uniform usage line
  
  Testing latest pilot (version - 3.7.3.9) - currently all failed on the production. Using a Test queue.
  
  Need to decide if we want to make the GPU's available to ATLAS
- 13:40 → 13:55
  WBS 2.3.4 Analysis Facilities
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
  - 13:40
    Analysis Facilities - BNL 5m
    
    Speaker: Ofer Rind (Brookhaven National Laboratory)
    
    New disk storage for Tier-3 GPFS update has arrived and is being installed; no downtime expected and should be transparent for users except they might notice available space fluctuations
  - 13:45
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 13:50
    
    Analysis Facilities - Chicago 5m
    
    Speakers: Fengping Hu (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
    
    AB-stable and AB-dev images have been updated. AB-dev has all the latest versins of uproot, awkward, dask-awkward. Image building has been updated so to always get correct dask workers.
    
    htcondor worker auto scaling has been configured for both the long and short queue workers. either queue can now scale to more workers when needed. This should help with one workers nodes being left idling while on the other hand jobs are pending on resources. This does increase scaling activities so we are hoping to make that as lightweight as possible, currently the workers has a user provision step that takes a few minutes, we are hoping that in the future user account will be backed by ldap to avoid the cost of user proovision step.
- 13:55 → 14:10
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  Fred and I are beginning to track OS and other updates as in previous years (spreadsheet)
  
  CVMFS mount errors now identifiable using new wrapper message on Harvester dashboard; worker node information to be added in next pilot update
  
  Hiro and Mark have updated and deployed sitewide networking script (corrects direction which was previously flipped in/out)
  
  XRootd 5.6.9 deployment for ATLAS production - held up by SWT2_CPB_K8S
  
  SWT2_CPB, OU site network monitoring? (GGUS,GGUS)
  
  ATLAS considering site exclusions based on unavailability of a certain fraction of data
  - 13:55
    ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
    
    Speaker: Ivan Glushkov (University of Texas at Arlington (US))
    
    Tape downtimes - should be tested.
    
    Proposing policy for sites: All queues should be down if % of files are inaccessible (for example > 20%) (to be discussed on WLCG coord meeting)
    
    CVMFS check
    
    Post DC24 T0-T1 tests conducted
  - 14:00
    Service Development & Deployment 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    XCache
    
    still issues with several nodes.
    
    restarts help temporarily
    
    will be testing the new version at MWT2 and AGLT2
    
    VP - all working fine
    
    Varnish caches - all working fine
    
    analytics and monitoring
    
    working on getting back FTS stream
    
    some improvements to the Alarm And Alert Service
    
    ServiceX
    
    improvements in reliability, performance, logging, user interface
    
    testing new client
    
    ServiceXLite
    
    now full time running at FAB, River, NRP.
  - 14:05
    
    Facility R&D 5m
    
    Speakers: Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    Kubernetes Tutorial/Hackathon- Please sign up by Friday, April 5 especially if you plan to attend in person. Will send an email out to this effect.
    
    Multi-site stretched cluster assembled with Kubespray, and using Wireguard as the fundamental network layer.
    
    Wireguard is a VPN technology. We can assemble a VPN mesh that encrypts all interal cluster traffic and requires a site to only expose 1 UDP port to the public internet for the most essential connectivity. Wireguard is built into the Kernel (above v5.6?) creates a private interface on each node. To Kubernetes it appears to be all on 1 private network. However we need to understand what it looks like to expose public services. Public-facing services where we can, and tunneled private traffic where we have to?
    
    Wireguard config example:
    
    [Peer] PublicKey = xxjmp6WyT7IU/9hffUjyV0uj8sfYzR6G3C/I3yt+Qxk= # Elliptic curve public key AllowedIPs = 192.168.0.6/32 # INTERNAL IP assigned to the 'wg0' interface Endpoint = 192.41.231.216:51820 # EXTERNAL IP and UDP port assigned for negotiating Wireguard traffic PersistentKeepalive = 30 # Periodic ping between nodes to keep the conenction alive [Peer] PublicKey = oVVQuMR2hHCW+a5y0w4BS9ySOQK2pp8Tkba4RP5TByM= AllowedIPs = 192.168.0.7/32 Endpoint = 192.41.237.213:51820 PersistentKeepalive = 30 [Peer] PublicKey = BFh6AaxOf8rmDE68BtRcdcEIrQRrx6TklfZozLm3d28= AllowedIPs = 192.168.0.8/32 Endpoint = 206.12.98.227:51820
    
    Kubespray config sample - each site has a label corresponding to its site in CRIC as well as the institution where it sits:
    
    # ... uchicago005.hl-lhc.io: ansible_host: 192.168.0.5 ip: 192.168.0.5 access_ip: 192.168.0.5 node_labels: site: mwt2 institution: uchicago umich001.hl-lhc.io: ansible_host: 192.168.0.6 ip: 192.168.0.6 access_ip: 192.168.0.6 node_labels: site: aglt2 institution: umich msu001.hl-lhc.io: ansible_host: 192.168.0.7 ip: 192.168.0.7 access_ip: 192.168.0.7 node_labels: site: aglt2 institution: msu uvic001.hl-lhc.io: ansible_host: 192.168.0.8 ip: 192.168.0.8 access_ip: 192.168.0.8 node_labels: site: uvic institution: uvic # ...
    
    Kubectl:
    
    [root@uchicago002 ~]# kubectl get nodes NAME STATUS ROLES AGE VERSION msu001.hl-lhc.io Ready <none> 6d21h v1.28.6 uchicago002.hl-lhc.io Ready control-plane 6d21h v1.28.6 uchicago003.hl-lhc.io Ready control-plane 6d21h v1.28.6 uchicago004.hl-lhc.io Ready control-plane 6d21h v1.28.6 uchicago005.hl-lhc.io Ready <none> 6d20h v1.28.6 umich001.hl-lhc.io Ready <none> 6d21h v1.28.6 uvic001.hl-lhc.io Ready <none> 6d21h v1.28.6
- 14:10 → 14:20
  
  AOB 10m

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder