US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
-
-
13:00
→
13:10
WBS 2.3 Facility Management News 10mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
-
13:10
→
13:20
OSG-LHC 10mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
EL9
- Only HTCondor 10.x is supported in EL9. What's more confusing?
- No HTCondor in the release repository and HTCondor 10.x in upcoming (this would match EL7 + EL8)
- HTCondor 10.x in release + upcoming (i.e., HTCondor 10 in EL7/EL8 release soon, HTCondor 10.x in EL7/EL8 upcoming and EL9 release/upcoming)
- Investigating Slurm test setup failures, i.e. HTCondor-CE + Slurm currently untested in nightlies
- EL9 default crypto policy does not accept SHA-1 IGTF CAs. Investigating packaging workarounds
OSG 3.7
- Starting planning, aiming for a release before June 2023
- This will trigger the OSG 3.6 end-of-life process, one year after the OSG 3.7 release: June 2024, security and critical bug fix only mode for the 6 months prior
- Only HTCondor 10.x is supported in EL9. What's more confusing?
-
13:20
→
13:40
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
- Harvester HTCondor upgrade will start next week but Petr and Fa-Hui are beginning to update ports for ARC-CEs
- ADC Weekly presentation on adding Site Energy Consumption parameters to CRIC
-
13:20
US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5mSpeaker: Mark Sosebee (University of Texas at Arlington (US))
-
13:25
Service Development & Deployment 5mSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
-
13:30
Kubernetes R&D at UTA 5mSpeaker: Armen Vartapetian (University of Texas at Arlington (US))
- The cluster is running fine. There was an incident overnight to this Monday, when the cluster drained. It appeared the K8s certificates have expired. After renewal things came back normal and jobs running fine.
- Looking into possible configuration tuning of the kube-scheduler. A possible candidate was inter-pod affinity for SCORE and MCORE jobs, which may have helped in some scenarios. A concern is the IO performance when packing the nodes with just SCORE jobs. Then noticed a warning in the K8s documentation that the inter-pod affinity requires substantial amount of processing which can slow down scheduling in large clusters significantly - not recommended using in clusters larger than several hundred nodes.
- Trying to optimize the job CPU requests coefficient sent from Harvester (has 0.9 scale down value as default). The idea is to not overcommit the node CPU. For now changed the value in CRIC to 0.94 , and things so far look fine.
- Next big step is to merge SWT2_CPB_K8S cluster with SWT2_CPB (see Patrick's report).
-
13:40
→
13:45
WBS 2.3.1 Tier1 Center 5mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
Mostly quiet running
Deployed April 23 MOU disk pledge this week. http://adc-ddm-mon.cern.ch/ddmusr01/plots/plots.php?endpoint=BNL-OSG2_DATADISK Milestone #179 Completed
dCache Test instance part of the WLCG token tests - https://ci.cloud.cnaf.infn.it/job/wlcg-jwt-compliance-tests/job/master/747/artifact/reports/reports/latest/joint-report.html# Milestone #182 Completed
Will Deploy April 23 MOU tape pledge this week. Installing 1000 (LTO8 tape 12 TB each) Thursday. Milestone #180 Completed
dCache upgrade to v8.2.x planned for 8-March 09:00-15:00 EST
Working w/ Lincoln to understand source of occasional failures of transfer of HPC job logs.
-
13:45
→
14:05
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))- Fairly good running over the last month.
- Some issues at MWT2 & SWT2
- NET2.0 (BU_ATLAS*) was set to disabled and no longer appears in displays
- NET2.1 (UMass) is making progress but I leave describing the details to the UMass team
- OU is working hard to get a new OSG GK online under AlmaLinux9.
- They must do so by ~March 6 to avoid being unable to receive jobs due to lack of support for tokens.
- Procurement is underway
- Please upload your Dell quotes to the Google directory:
https://drive.google.com/drive/folders/1rxwnJtNOxrfvBj7lCuiL3LZRItnJGaIB
- Please upload your Dell quotes to the Google directory:
- People have created site operations plans at AGLT2, MWT2, and NET2. Need to check on SWT2.
-
13:45
AGLT2 5mSpeakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Dronen
Milestones:
VMware Transition 6.7 to 7 and implementing iSCSI/TrueNAS storage (#217): completedProgress on Linux 9 choice: UEFI boot from cobbler completed (#218). Rest ongoing.
WLCG monitoring (#117)
UM side ready.
MSU side close (can't get info from our switches but getting access to upstream data center switches).Operation plan 2023:
https://docs.google.com/document/d/1n2Rv80TP_87Y4Xfr79HOK6RgCh9pohrbTniYE-3jxEs/edit#Software:
Upgraded dCache 7.2.20 to 8.2.13 (newest golden release series)Current operation:
No global problem recently.
VP queue high failure rate and xCache problem being debugged.
Adding MSU SLATE node and varnish usage improved.Purchase:
Acceptable prices for compute (close to expectation.)
Still pushing on storage prices (better than expectation but higher than IU quote.)
We will buy more storage than originally planned. -
13:50
MWT2 5mSpeakers: David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
- There was a problem with UChicago border firewall on February 21st which caused network/DNS issues. It was resolved the same day. We are looking into a solution to alleviate potential future problems when we lose access to the DNS servers.
- IU had a firewall incident on February 22nd. There was a misconfiguration that was reverted, but put IU offline overnight.
- Reviewed and submitted our operations plan.
- Still working on our purchasing plan. We have quotes in hand for storage and switches at UC.
-
13:55
NET2 5mSpeakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
A problem with the power supply to the first rack was detected by UMass staff at the MGHPCC and the rack is temporary without power until the problem is investigated by the MGHPCC team.
The process of setting up a dns zone for net2.mghpcc.org was initiated by the UMass networking team.
A basic dcache setup using a pair of disk pools was completed using RedHat 8 and Lets Encrypt certificates for testing. The NESE team is re-provisioning some machines with ALMA9 so that we can test the setup on this new OS and possibly publish the service using it, so we don't have stop to do this upgrade in a few months.In parallel, we started to investigate how to increase the availability of dCache, focusing on three fronts: a zookeeper cluster with 3 machines, HA PostgreSQL setup, and redundancy of dCache services.
-
14:00
SWT2 5mSpeakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
UTA
- Network Upgrade
- The TOR switches have been updated with the latest OS. We bricked one in the process but it has been replaced and updated as well
- The core switches will be updated today or tomorrow and we can start deploying the core
- Working with Dell on getting the new core and old core to talk to each other
- First priority is to get K8s cluster merged into the rest of SWT2_CPB
- IPV6 Reverse DNS issue
- UTA is undergoing a rebuild of the campus network that involves, among other things, a redeployment of DNS services. We lost reverse DNS service in the migration.
- Original ticket was misfiled with OIT, but we did get it straight as part of other work
- Complicating issues was the campus rebuild affected our group mail services
OU
- Some intermittent xrootd storage overloads, not entirely sure why
- In the process of installing osg-ce-slurm on AlmaLinux9 test host, getting close to having it working
- Network Upgrade
- Fairly good running over the last month.
-
14:05
→
14:10
WBS 2.3.3 HPC Operations 5mSpeakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))
NERSC
- Perlmutter online again. Hammercloud jobs succeeding but coming infrequently in the local "preempt" queue. Filesystem issues allegedly resolved for now?
- Cori jobs taking a lot longer since the filesystem outage. Not sure if it is the task or the system is slower now. Increased max walltime because we are failing jobs after hitting the limit of 10 hrs.
TACC
- Working fine at 10-node scale with CVMFSExec.
- Log files are occasionally getting put into the wrong Rucio path. Bug with the Globus stager? Have been chatting with Doug
GCS5
- Rui working on transfers to the GCS v5 door at BNL with Doug.
-
14:10
→
14:25
WBS 2.3.4 Analysis FacilitiesConveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:10
Analysis Facilities - BNL 5mSpeaker: Ofer Rind (Brookhaven National Laboratory)
- AF Metrics Dashboard configured (Ilija, Mike Hance, Tom Smith)
- IRIS-HEP AGC Demo Day 2 last Friday, not all slides available, but recording is posted
- 14:15
-
14:20
Analysis Facilities - Chicago 5mSpeakers: Fengping Hu (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
triton inference service now available at UC AF(https://github.com/triton-inference-server/server/tree/main/deploy/k8s-onprem)
- configured with loadbalancing and auto scaling(maximum currnently set a 3 but we have more than 70 gpus)
- model registry is a S3 bucket(currently on SSL cluster)
- available with kubernetes clusterips
-
14:10
-
14:25
→
14:35
AOB 10m
-
13:00
→
13:10