US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
-
-
13:00
→
13:10
WBS 2.3 Facility Management News 10mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
The Throughput Computing 2023 Week is fast approaching, https://agenda.hep.wisc.edu/event/2014/. We are planning a joint session with CMS for most of the day on topics of mutual interest. We will start with ATLAS-only, then go to the joint session for the day.
Scratch agenda, https://docs.google.com/spreadsheets/d/169Ey-AqykAIHU11VaGHyQ0Mq4IRJha2qiJtAOljRzJE/edit?usp=sharing.
Some take aways from S&C
- migration / support for tokens
- DC24 prep
-
13:10
→
13:20
OSG-LHC 10mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
-
13:20
→
13:40
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
- Discussion about storage token scaling for DC24 from IAM, Rucio and FTS viewpoints at today's DOMA-BDT meeting. Very useful slides presented (link).
- Smooth running save some minor issues at OU and MWT2
- Second of two planned network interventions at BNL tomorrow
-
13:20
US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5mSpeaker: Mark Sosebee (University of Texas at Arlington (US))
-
13:25
Service Development & Deployment 5mSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
-
13:30
Kubernetes R&D at UTA 5mSpeaker: Armen Vartapetian (University of Texas at Arlington (US))
- SWT2_CPB_K8S_TEST cluster was rebuilt, and new nodes were added to the cluster. We are at above 1K running job slots right now (were slightly below 1K on pre-scrubbing plots)
- Meanwhile the SWT2_CPB_K8S cluster was drained.
- We switched the configurations in the CRIC and Harvester, for the new cluster to continue run under the SWT2_CPB_K8S queue. And the SWT2_CPB_K8S_TEST queue was disabled.
- The new SWT2_CPB_K8S cluster is running fine.
- As I was monitoring it, I noticed an artificial peak in the slots of running jobs. And that bump present in all grid sites as well, so opened a SNOW ticket: https://cern.service-now.com/service-portal?id=ticket&table=u_request_fulfillment&n=RQF2335613 . It appeared that the collecting agent restart resulted in duplicated record ...
- Got from Patrick a node which was not used in production, to use it to reinstall Prometheus on it, as a dedicated node.
- Also looking into job accounting reporting, available options ...
-
13:40
→
13:45
WBS 2.3.1 Tier1 Center 5mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
-
13:45
→
14:05
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))- Will miss today's meeting because of a conflict. Sorry for the late notice,
- There was good running over the past month with the main outage being caused by an EOS issue on May 30-31,
- NET2.1 continues to progress and they will report today.
- They have good progress on all fronts: compute, storage, and network,
- If anybody has input for what I say in the tier 2 operations report at the scrubbing on July 7, send it to me ASAP.
- I expect Brian will have already said the OSG has now released a version of 3.6 that includes Condor 10 /Condor-CE 6 and supports AlmaLinux9.
- So let the testing begin...
-
13:45
AGLT2 5mSpeakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Dronen
-smooth operation
-MSU deployed the 4x R740xd2 with 20T drives (was previously waiting on network config), UM reserved 5 R740xD2 nodes to compare zfs and raid IO performance, the other 7 nodes are in production
-Updated dcache from 8.2.13 to 8.2.24, and updated the firmware and rebooted all dcache nodes.
- current problem with 2x R740xd2 with 18T drives, from Feb-2023, lost their 25G NIC after Dell FW update. -
13:50
MWT2 5mSpeakers: David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
- New storage nodes racked and cabled at UChicago. Currently, we are working on building them.
- One of the storage nodes had a bad DIMM which was causing transfer problems during the first week of June. The issue was resolved after replacing the DIMM.
- Another storage node was unresponsive on 06/19/2023 and had to be rebooted.
- UChicago experienced a network outage on 06/13/2023 due to a distribution switch failure. The switch was replaced today.
-
13:55
NET2 5mSpeakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
* Network problems with NESE were solved after the network gear upgrade.
* 8 machines integrated into the NESE storage.
* Load tests being performed with Hiro
* Contact Fabio to start to publish process.* With help from RedHat team, we were able to debug the OpenShift installation and our Kubernetes cluster is now operational.
* Next steps are to install software to receive jobs and establish a queue. -
14:00
SWT2 5mSpeakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
UTA:
- Installing machines and balancing heat/power draws is continuing
- Seeing issues with the overall heat load in machine room
- May need to accelerate the retirement of R410 Nodes (we have 200+)
- May add a rack to ease future power additions
- Production running mostly well, except for incidents where tripped breakers killed many running jobs
OU:
- Needed to expand size of home disk for condor-ce spool area, so had to dump all jobs yesterday to switch over.
- Reason for this was partially that some jobs keep putting large input files (AOD, RDO, ...) into condor-ce-spool area rather than directly into /lscratch/ - WN_TMP. Why?
- Installing machines and balancing heat/power draws is continuing
-
14:05
→
14:10
WBS 2.3.3 HPC Operations 5mSpeakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))
TACC
- atlas-cvmfsexec wrapper updated to add some file locking to solve a race when many jobs started at once
- Some slots have freed up. ~5K jobs finished over the weekend with 93% efficiency
NERSC
- Fairly smooth running in the last week
- Needed to update Harvester to support new Raythena features, but new verison is causing issues. No jobs are starting. Investigating.
- Getting user's jobs on GPU queue. feedback on job time limit and python environment.
-
14:10
→
14:25
WBS 2.3.4 Analysis FacilitiesConveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:10
Analysis Facilities - BNL 5mSpeaker: Ofer Rind (Brookhaven National Laboratory)
- Working with BNL web admins (C. Lepore, L. Pelosi) to deploy an updated landing page for the BNL Jupyterhub web frontend. This will include links for sign up, documentation and discourse support. Documentation also needs updating/reorganization.
- Discussion of this and containerization update at 2.3/5 meeting tomorrow
-
14:15
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:20
Analysis Facilities - Chicago 5mSpeakers: Fengping Hu (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
- MLflow is installed and available at UC AF - mlflow.
- AF compute nodes random reboot issue - noticed node random reboots after we switched to mainline kernal about a month ago. Reverted to use the RH kernel for now and haven't notice reboot since the switch.
-
14:10
-
14:25
→
14:35
AOB 10m
-
13:00
→
13:10