US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2023-03-29T13:00:00-04:00
End: 2023-03-29T15:10:00-04:00
Location: No location set

Wednesday 29 Mar 2023, 13:00 → 15:10 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 13:00 → 13:10
  WBS 2.3 Facility Management News 10m
  
  Minutes
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
  Planning has started for the annual US ATLAS Technical and Pre-Scrubbing meeting.
  This will be held June 5-7, 2023, at Indiana University Bloomington.
  Block agenda, TBC:
  June 5 (all day) - Technical S&C talks I
  June 6 (half day) - Technical S&C talks II
  June 6 (pm) - Pre-Scrubbing (closed)
  June 7 (all day) - Scrubbing (closed)
  Please let Shawn and I know if you'd like to present
  Next week's facility coordination to get started on pre-scrubbing
  
  ATLAS S&C Plenary tomorrow: https://indico.cern.ch/event/1268248/
- 13:10 → 13:20
  OSG-LHC 10m
  
  Minutes
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  Release
  Initial EL9 release expected tomorrow!
  Contains HTCondor-CE 6 and the HTCondor feature series (10.3.0)
  We don't have osg-ca-certs with any SHA1 CA workarounds yet so sites should at least temporarily downgrade the default crypto policy (except for compute services like CEs/local condor)
  XRootD 5.5.4 with xrdcl-http is available in osg-testing
  Gratia Probe 2.8.4 (maybe this week) fixing issues with HTCondor APs introduced in 2.8.1
- 13:20 → 13:40
  WBS 2.3.5 Continuous Operations
  
  Minutes
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  Updates on DC24 plans in DOMA General Meeting today
  Planning document
  Some question of T2-T2 component
  S&C Plenary Demonstrators discussion tomorrow
  Milestone #240 to be delayed by 1 month due to SWT2_CPB cluster network hardware upgrade and reconfig
  Condor queue quota reconfiguration at BNL today (see T1 report)
  BNL XRootd doors update in progress (GGUS)
  GGUS down today
  - 13:20
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speaker: Mark Sosebee (University of Texas at Arlington (US))
  - 13:25
    
    Service Development & Deployment 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
  - 13:30
    Kubernetes R&D at UTA 5m
    
    Minutes
    
    Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    
    Some failures with stage-in errors, discussed in a dpa email thread. The 300s timeout on rucio get is a known problem, fixed in 1.30.8 rucio release, under testing right now, probably will be deployed this week. Hopefully this will resolve it.
    Next big step is to recreate SWT2_CPB_K8S cluster within the SWT2_CPB main cluster network. Got 2 compute nodes (for 1 Master + 1 Worker node) to start a new K8S cluster. Once it's done, we'll migrate the current K8S cluster, and also add more worker nodes.
    Optimizing the job CPU requests coefficient sent from Harvester (has 0.9 scale down value as default). The idea is to not overcommit the node CPU. Further changed the value in CRIC from 0.94 to 0.98, which must, and I checked, is addressing this issue. Things so far running fine.
- 13:40 → 13:45
  WBS 2.3.1 Tier1 Center 5m
  
  Minutes
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
  Stable running
  still waiting for CPU delivery (due by end of next week) to meet the 2023 April pledge.
  Today, Chris Hollowell updated ATLAS Tier 1 quota so all production jobs have the same quota. (prod.all) including updating all the gatekeepers.
  Already tested on one gatekeeper since Friday.
- 13:45 → 14:05
  WBS 2.3.2 Tier2 Centers
  
  Minutes
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  Reasonable running with some issues in the last 30 days.
  AGLT2 did some rolling updates (e.g. HTCondor10) that resulted in partial draining.
  MWT2 had some trouble receiving enough work from the production system.
  OU Was down several days to install a new gatekeeper.
  NET2.1 is progressing
  I need to write the global tier 2 operations plan today.
  Please report on your procurement activities.
  - 13:45
    
    AGLT2 5m
    
    Minutes
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Dronen
    
    Updated cvmfs to use varnish style cache server from SLATE.
    Each site (MSU/UM) now has its own varnish instance from the site's own slate cluster.
    
    3/10/2023
    MSU upgraded Juniper OS in all rack data switches
    to solve a problem/bug preventing access to a readonly account on our rack switches.
    Some Worker nodes from MSU had a burst of failed jobs from a couple sources.
    Some expected, as we were missing some of the redundant/bonded cabling (solved/all cabled now)
    Some unexpected related to the force-on setting needed for provisioning (avoidable in the future)
    
    3/16
    There was a new security kernel update available, we applied this new kernel to all our work nodes and interactive login nodes, and rebooted them to the new kernel. We also took this opportunity to update the firmware for the work nodes. This process required draining the HTCondor cluster, and for most of the time, BOINC backfilling jobs filled the draining job slots. Only for 2 days, when we were draining a small batch of work nodes, it happened that the BOINC queue had no available jobs, so some draining job slots did not get fulfilled.
    
    2023 Equipment orders placed at MSU and UM
    18x R740xd2 with 20T drives for estimated 6.8 PB in dCache (minus retirements)
    12x R6525 with AMD 7443 for 1152k cores or estimated ~2k HS06 (minus retirements)
    a second NVMe storage node for MSU Vmware cluster
    a second NVMe storage node for MSU SLATE
    another NVMe storage node for UM
    also storage and GPU node for UM T3
  - 13:50
    
    MWT2 5m
    
    Minutes
    
    Speakers: David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    Upgrading condor to 10.x and adding two additional gatekeepers
    Investigated squid failovers that were being misattributed to MWT2.
    Investigating CVMFS failures and stuck jobs
    Site struggling to keep full during lack of sim
    Updated CRIC settings for MWT2_VHIMEM_UCORE
    Preparing RFQ for storage at UC; UIUC working on worker node purchase; IU preparing purchase now that sub-award is fully executed.
  - 13:55
    
    NET2 5m
    
    Minutes
    
    Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
    
    Working with ATLAS experts towards final configuration for dCache.
    Setting up DNS configuration: coordinating between UMass and Harvard to have both machines under net2.mghpccc.org. Configuration is on going
    With DNS work ongoing, so are the IGTF certificate requests for both Harvard and UMass machines.
  - 14:00
    SWT2 5m
    
    Minutes
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    OU
    New AlmaLinux9 osg-36 HTCondor-CE GK working well and stably
    Last new compute nodes should be brought online this week
    Brought up new 10 TB /xrd_test/ xrootd-cephfs test instance on se1 on port 64000
    Initial tests successful, thanks to Wei and Hiro and Andy, but more testing needed to try to improve throughput
    UTA
    Network update underway
    Will attempt to do two or three racks per day until the old network components are all replaced
    Also performing updates to nodes as we go
    Provisioning for the updated K8
    New master node and compute node on CPB's network in place
    Need to get rucio mover working with stage-out to internal xrootd door before scaling up
    Scheduling a meeting with UTA Networking to discuss several items including getting the network monitoring in place
- 14:05 → 14:10
  WBS 2.3.3 HPC Operations 5m
  
  Minutes
  
  Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))
  NERSC
  Moved Harvester into the 'workflow queue' - now runs as a very long running (30 day) job.
  There may be bugs - still evaluating
  NERSC_Perlmutter_GPU configuration merged into Github repo for Perlmutter harvester config
  Rui has tested with simple Tensorflow test from end to end (https://bigpanda.cern.ch/job?pandaid=5799977799)
  TACC
  CVFMFSExec working well. Working on scale up but stuck behind very long queue times.
  Modified Harvester code to not special-case around log files. May have helped? Still checking files.
- 14:10 → 14:25
  WBS 2.3.4 Analysis Facilities
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:10
    Analysis Facilities - BNL 5m
    
    Minutes
    
    Speaker: Ofer Rind (Brookhaven National Laboratory)
    
    Congratulations to Doug on being named AMG Analysis Workflow Containerization Activity Contact along with Matt Feickert
    The goal is to put together infrastructure and examples of containers for analysis workflows for wide use in ATLAS. This effort connects AMG and other software groups with ADC. The focus is on analysis steps that could be carried out at a range of resources including lxplus/lxbatch or local analysis resources at various institutes and facilities (cloud, etc).
    Useful talk on lxplus/lxbatch at last weeks AF Forum
    Pre-CHEP workshop timetable pretty much finalized
    I will need help with CHEP presentation, ATLAS seems to want it for review 30 days in advance
  - 14:15
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:20
    Analysis Facilities - Chicago 5m
    
    Minutes
    
    Speakers: Fengping Hu (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
    
    ServiceX deployment at CERN FABRIC site
    Working with FABRIC team on getting external IPV6 networking service to work on the FABRIC slice
- 14:25 → 14:35
  
  AOB 10m

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder

Release