US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2023-02-01T13:00:00-05:00
End: 2023-02-01T15:10:00-05:00
Location: No location set

Wednesday 1 Feb 2023, 13:00 → 15:10 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
  
  WLCG workshop on analysis facilities weekend before chep 2023, https://indico.cern.ch/event/1230126/
  
  From Zach:
  
  Thank you all for attending S&C Week #74! Recordings have been added to the agenda.
  
  Feel free to sign up for atlas-comp-deriv-phys-physlite, a new list for the coordination, discussion, and advertising of PHYS/LITE derivation production campaigns.
  
  The WLCG DOMA meeting was today: https://indico.cern.ch/event/1232370/ Notes on preparing for the next WLCG Data Challenge are at https://docs.google.com/document/d/1qorG3JYNW5XZx51pTyJM6s_t7-blYr9r--FkYUyhQeQ/edit#
- 13:10 → 13:20
  OSG-LHC 10m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  EL9
  Initial EL9 tests kicked yesterday: found some missing packages
  We have some concerns about issues with XRootD, EL9, and OpenSSL 3
  HTCondor
  10.2.0 targeted for upcoming (this week?)
  HTCondor-CE 6 targeted for some time in the next few weeks: they bumped the major version to align authentication/authorization configurations with the rest of HTCondor but it was done in such a way that it should be transparent to sites
  See F2F slides for a stab at the venn diagram of GSI + EL9 + HTCondor versions: https://indico.cern.ch/event/1201515/contributions/5141108/attachments/2559076/4411772/2022-12-02.osg-lhc-atlas-f2f.pdf
  N.B. it's important to differentiate between GSI and X.509 support: we are dropping GSI support at the authentication level, X.509 proxy delegation down to the worker node will continue to work
  Throughput Computing 2023 in July
  Joint HTCondor Week and OSG All Hands in Madison, WI
  For the ATLAS (and CMS?) session, do you need AV equipment?
- 13:20 → 13:25
  
  WBS 2.3.1 Tier1 Center 5m
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
  
  smooth running
  Farm team - starting test of HTCondor 10 including new HTCondorCE submitting to HTCondor 9 resources. so we can do rolling HTCondor 10 upgrade.
  Farm team successfully recovered more compute nodes (replaced failed hard drives)
- 13:25 → 13:45
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  
  N_Jobs-20230201.png
  
  Success-20230201.png
  
  Transfers-20220201.png
  Reasonably good running since the previous meeting on January 3rd.
  NET2.0 (BU) shutdown for good on January 11th except BU_NESE storage remains available until the the NET2.1 (UMass) storage is online and data can be transferred off of it to the new storage. At that point the old storage will be added to the new NET2.1 storage pool.
  Progress continues on NET2.1 which Eduardo will describe in their report.
  We should be able to procure in a couple of weeks once the paperwork is fully signed.
  I have negotiated pricing with Dell for compute 3000126302435.14 ($5.31/HS06 R6525 & $4.79/HS06 C6525) and storage 3000126703242.7 ($44,51/Usable TB).
  Operations plans are due for each Tier 2 on February 28.
  The Google directory holding the operations plans is here. Please use the AGLT2 plan as a template.
  - 13:25
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Dronen
    
    a missing file from rucio dataset casued transfer error, we declared file loss to rucio.
    one of the dCache doors at the UM site had flooded /var partition from iptables log, caused 40% job failure.
    Fixed a few things(mostly adding compatible library files and more restrict sudo athentication ) to enable BOINC backfilling jobs to continue to run on CentOS8 stream.
    We completed 3 milestones due at the end of Dec 2022 and Jan 2023:
    enabled UEFI boot in cobbler, and used it to build CentOS8 stream OS.
    updated vmware from 6.7 to 7.0
    installed TrueNAS at MSU and upgraded BOTH UM and MSU to Bluefin from Angelfish (The part of the milestone for TrueNAS that was missing was having the storage for the MSU vmware)
  - 13:30
    MWT2 5m
    
    Speakers: David Jordan (University of Chicago (US)), Edward James Dambik (Indiana University (US)), Farnaz Golnaraghi (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    UIUC had a PM on January 18th. Everything came back up smoothly.
    Working on upgrading to Condor 10.
    Discussing procurement plan and deciding how to divide equipment between sites.
    We switched from Squid to Varnish. Currently, we are testing it out and fixing any issues that come up.
  - 13:35
    NET2 5m
    
    Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
    
    Transfer of machines ongoing: one rack installed in current UMass POD for demonstration, other machine crated until new racks are prepared.
    Racks: ongoing work with MGHPCC and UMass to start infrastructure work.
    Network: ESnet connection will be done in Boston instead of NYC in preparation for high-speed connection later in the year. Being worked out this week.
    Disks: dCache system being tested. Tape DTN being transferred to UMass.
  - 13:40
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    UTA
    Recurring problems with transfers
    Disabled the test queue
    Update door versions
    Would like to test offline, but may have to install on production first
    Potentially go to slightly older version as on the other door
    Accounting issue between GRACC and ATLAS numbers
    Probably related to test queue, but need to do research
    Network Upgrade
    Problem with long run cables between TOR and core switches
    Forward Error Correction (FEC) not negotiated correctly, can be manually set
    Dell's recommendation is to update OS/Firmware on switches
    OU
    Running well, no issues
    Working on installing AlmaLinux9 HTCondor-CE test GK
- 13:45 → 13:50
  
  WBS 2.3.3 HPC Operations 5m
  
  Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))
  
  TACC
  - Running jobs on the scale of ~10 nodes (1 node per slurm job) for about a week now
  - Tried scaling to multi-node but encountered SIGBUS errors. Haven't traced down the cause.
  
  NERSC
  - Some large job failures at Cori due to not being able to find Sim_tf. Believe this was because queue was set to online instead of brokeroff after downtime..
  - HammerCloud tests running on Perlmutter. Machine is down for emergency network maintenance.
- 13:50 → 14:05
  WBS 2.3.4 Analysis Facilities
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
  - 13:50
    Analysis Facilities - BNL 5m
    
    Speaker: Ofer Rind (Brookhaven National Laboratory)
    
    User set up running Athena GPU studies on Federated hub
    Exposed an environment issue that Doug identified and quickly fixed
    Need updated documentation
    Doug has made progress launching notebooks on GPU nodes using HTCondor (+Dask)
  - 13:55
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:00
    Analysis Facilities - Chicago 5m
    
    Speakers: Fengping Hu (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
    
    Getting container image caching service deployed on AF at UChicago
    Two components: a harbor instance and a mutation controller(kyverno with cluster policy)
    Transparent to users
    Caches public access images, privates images will not be cached.
- 14:05 → 14:25
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  Global Rucio meltdown on Monday (caused by URL signing req on Google cloud overloading Rucio servers?)
  WLCG DOMA General Meeting earlier today focused on discussion of DC24 planning document.
  WLCG coordination and oversight, but no specific person-power for this. Bottom-up approach with self-forming "topical splinter groups."
  Target date is "early 2024" before LHC run commences and lasting "multiple weeks." Possible "pre-challenge" concurrent with SC23 in November.
  Target rates likely around 20-25% of HL-LHC
  Dune and Belle-II added as participants. Their targets TBD. More direct involvement of NRENs as well.
  Token auth only for disk endpoints.
  Demonstrate Tape REST-API for recall operations on selected sites.
  Involvement of Analysis Facilities?
  Status of ARC-CE at SLAC? (GGUS)
  BNL VP running well, up to 1.5K jobs. Can we ramp up further?
  Checksum errors completely dominated by accesses to SWT2 (link)
  Possible attempt to update IAM again next week (GGUS)
  RAC meeting tomorrow re:Localgroupdisk requirements
  - 14:05
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speaker: Mark Sosebee (University of Texas at Arlington (US))
  - 14:10
    Service Development & Deployment 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    XCaches
    working fine
    VP
    a lot of errors from Taiwan and SWT2 endpoints
    looking at panda brokering decissions in order to understand what is limiting number of jobs in different queues
    Varnish
    found and fixed a misconfiguration (became apparent when Fermilab lost IPv6 networking).
    running smoothly at UC and AGLT2.
    ServiceX
    testing 1.1.3. If all is fine, will deploy it on UC AF.
  - 14:15
    Kubernetes R&D at UTA 5m
    
    Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    
    The cluster is running fine. At times there was a spike in stage-out/stage-in errors, which must be related to connectivity issues (ticketed in ggus:160124).
    Occasionally there are OOM termination of jobs running in old nodes with 2GB/core (never in the nodes with higher GB/core). Those nodes are request optimized to run at maximum efficient mode for job occupancy. When Request/Task is using much more memory than requested, the node may run out of memory and jobs terminated. Looking into a solution which may work for all type of nodes. As I was investigating this, I noticed that some panda pages (lookup from WorkerID) are not showing any job selection - pinged pandamon support.
    Trying to optimize the job CPU requests coefficient sent from Harvester (has 0.9 scale down value as default). The idea is to not overcommit the node CPU, but at the same time leave CPU request space for other system/auxiliary pods.
    Follow-up on the issue with the name of the parameter "resource_type_limits.SCORE_HIMEM" in CRIC, which got an extra space typo in the name. Fixed it for SWT2_CPB_K8S, and also pinged to Ryan (Victoria), so he fixed it for his site too, but for a while there was no feedback from the CRIC expert, so Fernando deleted that parameter from the CRIC. With some delay we got a response from Alexey. He checked DB that there are no more "bad" names, and they now also have mechanism in place to avoid introducing parameters with such typos.
- 14:25 → 14:35
  
  AOB 10m