US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2023-06-21T13:00:00-04:00
End: 2023-06-21T15:10:00-04:00
Location: No location set

Wednesday 21 Jun 2023, 13:00 → 15:10 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 13:00 → 13:10
  WBS 2.3 Facility Management News 10m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
  The Throughput Computing 2023 Week is fast approaching, https://agenda.hep.wisc.edu/event/2014/. We are planning a joint session with CMS for most of the day on topics of mutual interest. We will start with ATLAS-only, then go to the joint session for the day.
  Scratch agenda, https://docs.google.com/spreadsheets/d/169Ey-AqykAIHU11VaGHyQ0Mq4IRJha2qiJtAOljRzJE/edit?usp=sharing.
  Some take aways from S&C
  migration / support for tokens
  DC24 prep
- 13:10 → 13:20
  
  OSG-LHC 10m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
- 13:20 → 13:40
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  Discussion about storage token scaling for DC24 from IAM, Rucio and FTS viewpoints at today's DOMA-BDT meeting. Very useful slides presented (link).
  Smooth running save some minor issues at OU and MWT2
  Second of two planned network interventions at BNL tomorrow
  - 13:20
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speaker: Mark Sosebee (University of Texas at Arlington (US))
  - 13:25
    
    Service Development & Deployment 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
  - 13:30
    Kubernetes R&D at UTA 5m
    
    Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    
    SWT2_CPB_K8S_TEST cluster was rebuilt, and new nodes were added to the cluster. We are at above 1K running job slots right now (were slightly below 1K on pre-scrubbing plots)
    Meanwhile the SWT2_CPB_K8S cluster was drained.
    We switched the configurations in the CRIC and Harvester, for the new cluster to continue run under the SWT2_CPB_K8S queue. And the SWT2_CPB_K8S_TEST queue was disabled.
    The new SWT2_CPB_K8S cluster is running fine.
    As I was monitoring it, I noticed an artificial peak in the slots of running jobs. And that bump present in all grid sites as well, so opened a SNOW ticket: https://cern.service-now.com/service-portal?id=ticket&table=u_request_fulfillment&n=RQF2335613 . It appeared that the collecting agent restart resulted in duplicated record ...
    Got from Patrick a node which was not used in production, to use it to reinstall Prometheus on it, as a dedicated node.
    Also looking into job accounting reporting, available options ...
- 13:40 → 13:45
  WBS 2.3.1 Tier1 Center 5m
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
  No major operational issue
  Just a reminder that the second BNL Perimeter replacement intervention is to be carried out tomorrow, on Thursday June 22, 2023 between 17:30-21:30pm (ET), as planned:
  converting farm nodes in batches to HTCondor 10.
- 13:45 → 14:05
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  Will miss today's meeting because of a conflict. Sorry for the late notice,
  There was good running over the past month with the main outage being caused by an EOS issue on May 30-31,
  NET2.1 continues to progress and they will report today.
  They have good progress on all fronts: compute, storage, and network,
  If anybody has input for what I say in the tier 2 operations report at the scrubbing on July 7, send it to me ASAP.
  I expect Brian will have already said the OSG has now released a version of 3.6 that includes Condor 10 /Condor-CE 6 and supports AlmaLinux9.
  So let the testing begin...
  - 13:45
    
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Dronen
    
    -smooth operation
    -MSU deployed the 4x R740xd2 with 20T drives (was previously waiting on network config), UM reserved 5 R740xD2 nodes to compare zfs and raid IO performance, the other 7 nodes are in production
    -Updated dcache from 8.2.13 to 8.2.24, and updated the firmware and rebooted all dcache nodes.
    
    - current problem with 2x R740xd2 with 18T drives, from Feb-2023, lost their 25G NIC after Dell FW update.
  - 13:50
    MWT2 5m
    
    Speakers: David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    New storage nodes racked and cabled at UChicago. Currently, we are working on building them.
    One of the storage nodes had a bad DIMM which was causing transfer problems during the first week of June. The issue was resolved after replacing the DIMM.
    Another storage node was unresponsive on 06/19/2023 and had to be rebooted.
    UChicago experienced a network outage on 06/13/2023 due to a distribution switch failure. The switch was replaced today.
  - 13:55
    
    NET2 5m
    
    Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
    
    * Network problems with NESE were solved after the network gear upgrade.
    * 8 machines integrated into the NESE storage.
    * Load tests being performed with Hiro
    * Contact Fabio to start to publish process.
    * With help from RedHat team, we were able to debug the OpenShift installation and our Kubernetes cluster is now operational.
    * Next steps are to install software to receive jobs and establish a queue.
  - 14:00
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    UTA:
    Installing machines and balancing heat/power draws is continuing
    Seeing issues with the overall heat load in machine room
    May need to accelerate the retirement of R410 Nodes (we have 200+)
    May add a rack to ease future power additions
    Production running mostly well, except for incidents where tripped breakers killed many running jobs
    OU:
    Needed to expand size of home disk for condor-ce spool area, so had to dump all jobs yesterday to switch over.
    Reason for this was partially that some jobs keep putting large input files (AOD, RDO, ...) into condor-ce-spool area rather than directly into /lscratch/ - WN_TMP. Why?
- 14:05 → 14:10
  WBS 2.3.3 HPC Operations 5m
  
  Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))
  TACC
  atlas-cvmfsexec wrapper updated to add some file locking to solve a race when many jobs started at once
  Some slots have freed up. ~5K jobs finished over the weekend with 93% efficiency
  NERSC
  Fairly smooth running in the last week
  Needed to update Harvester to support new Raythena features, but new verison is causing issues. No jobs are starting. Investigating.
  Getting user's jobs on GPU queue. feedback on job time limit and python environment.
- 14:10 → 14:25
  WBS 2.3.4 Analysis Facilities
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:10
    Analysis Facilities - BNL 5m
    
    Speaker: Ofer Rind (Brookhaven National Laboratory)
    
    Working with BNL web admins (C. Lepore, L. Pelosi) to deploy an updated landing page for the BNL Jupyterhub web frontend. This will include links for sign up, documentation and discourse support. Documentation also needs updating/reorganization.
    Discussion of this and containerization update at 2.3/5 meeting tomorrow
  - 14:15
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:20
    Analysis Facilities - Chicago 5m
    
    Speakers: Fengping Hu (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
    
    MLflow is installed and available at UC AF - mlflow.
    AF compute nodes random reboot issue - noticed node random reboots after we switched to mainline kernal about a month ago. Reverted to use the RH kernel for now and haven't notice reboot since the switch.
- 14:25 → 14:35
  
  AOB 10m

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder