US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
-
- 13:00 → 13:10
-
13:10
→
13:20
OSG-LHC 10mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
Release yesterday: https://opensciencegrid.org/docs/release/osg-36/#april-26-2022-cvmfs-292-upcoming-htcondor-981
- CVMFS bugfix release
- VOMS clients now generate 2048 bit proxies by default
- osg-ce minor update that will help us track OSG 3.6 updates
OSG 3.5 EOL on May 1!
- 3.5 containers will stop being updated; move to 3.6-release tags https://opensciencegrid.org/technology/policy/container-release/
- We will start removing 3.5 documentation
HTCondor Week registration is closing soon! See invitation:
Greetings CHTC Users!
We want to invite you to HTCondor Week 2022, our annual HTCondor user conference, May 23-26, 2022. This year, HTCondor Week will be a hybrid event: we are hosting an in-person meeting at the Fluno Center on the University of Wisconsin-Madison campus. This provides HTCondor Week attendees with a compelling environment in which to attend tutorials and talks from HTCondor developers, meet other users like you and attend social events. For those who cannot attend in person, we'll also be broadcasting the event online via a Zoom meeting.
Registration for HTCondor Week 2022 is open now. The registration deadline for in-person attendee is May 2, 2022, and the cost is $90 per day to partake in conference food. For virtual-only attendance, registration is a flat $25 fee for the whole week.
UW-Madison affiliates who attend conference talks in-person only need to register for in-person participation (and pay) if they plan to partake in conference food. We otherwise/also recommend the virtual registration (still with a fee) for UW-Madison affiliates who plan to participate virtually.
You can register at http://htcondor.org/HTCondorWeek2022.There will be specific programming highlighting the UW-Madison campus community on Thursday, May 26, where you can meet other campus users of CHTC and HTCondor, as well as CHTC staff. We will separately contact some CHTC users to present their work that day!!
On other days, we will have a variety of in-depth tutorials and talks where you can learn more about HTCondor and how other people are using and deploying HTCondor. Best of all, you can establish contacts and learn best practices from people in industry, government, and academia who are using HTCondor to solve hard problems, many of which may be similar to those you are facing.
Hotel details and agenda overview are on the HTCondor Week 2022 site:
http://htcondor.org/HTCondorWeek2022
We hope to see you there,
The Center for High Throughput Computing
-
13:20
→
13:50
Topical ReportsConvener: Robert William Gardner Jr (University of Chicago (US))
-
13:20
Consistency checks on storage at the T1 30mSpeaker: Shigeki Misawa (Brookhaven National Laboratory (US))
-
13:20
-
13:50
→
13:55
WBS 2.3.1 Tier1 Center 5mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
-
13:55
→
14:15
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))- Pretty good running over the past two weeks.
- Lots compute slots coming online.
- The main issue (the EPoll:) issue is a pilot issue where the string "EPoll:" gets prepended to variable values causing the pilot to kill the job after it has been running without problems. This hurt job efficiency a lot and caused Hammer Cloud to kick sites offline when there was not a site problem. It seems to affect sites that are running HTCondor and dCache even though the working hypothesis is an XRootD issue.
- We are down to the wire on getting OSG 3.6 into use.
- How are NET2 and SWT2 doing on enabling IPV6.
- Please keep these sheets up to date:
- Service versions: https://docs.google.com/spreadsheets/d/1_fKB6GckfODTzEvOgRJu9sazxICM_RN95y039DZHF7U
- Run 3 readiness: https://docs.google.com/spreadsheets/d/1KniOlqb4dbJ6dKUHBYYt9OfriKjhVpUqXguPvryIMY8
- Site capacity: https://docs.google.com/spreadsheets/d/1nZnL1kE_XCzQ2-PFpVk_8DheUqX2ZjETaUD9ynqlKs4
- NB: I have not yet added the tabs for the current (April-June) quarter until I am sure that the data for the previous quarter actually reflects the situation on March 31.
-
13:55
AGLT2 5mSpeakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
On OSG 3.6, for gatekeepers and worker nodes.
We broke frontier squids while trying to fix gratia probe problems.
Our first fix attempt inadvertently re-enabled a local setup script overriding squid location variables.
Gratia issues solved: directory ownership was root instead of condor.2 tickets:
156868 15-Apr-2022 AGLT2: Failing jobs in panda with "Unable to identify specific exception"
156873 17-Apr-2022 US AGLT2: High Transfer failures as sourceThe job problems was traced to time outs during stage-out.
There was no clear problem but the likely suspect was dcache and java running out of memory.
We increased the memory for webdav on the doors and dCacheDomain on the headnodes.
Also added CPUs and memory to the VM doors. That all helped.
We also upgraded dcache from 6.2.35 to 7.2.15 (since we had to restart to load new CA certs anyway)
The issues from both tickets disappeared after that.Maintenance:
mostly through updating all worker nodes for new kernel, Dell FW updates, OSG updates (cvmfs)
Network upgrades completed * and tested * :
All new multi-path and multi-100G connections to ESnet and between MSU and UM are now fully deployed
and were tested for proper failover in case of backhoe vs fiber incident. -
14:00
MWT2 5mSpeakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
-
14:05
NET2 5mSpeaker: Prof. Saul Youssef
Smooth operations. New workers are in production.
NESE Team preparing for ~5 rack expansion of NESE Ceph including NET2 storage. Slowed down by Cisco switch delivery. This will allow retirement of NET2 GPFS and make more space for workers.
Working on ipv6; then OSG 3.6; upgrading TOR networking and NET2-NESE networking.
-
14:10
SWT2 5mSpeakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
UTA:
- Compute nodes from UTA_SWT2 have been integrated into SWT2_CPB
- UTA_SWT2 is now disabled in CRIC
- Working on updating some old compute nodes with additional memory
- Still need to update Capacity Spreadsheet/OIM/CRIC to reflect changes
- Received partial shipment of R6525 nodes (8 nodes of 48)
- The machines are racked
- Need to update Rocks install kernel to support RAID card before installation
- Work is progressing on configuring the new OSG 3.6 CE.
OU:
- Drained some HEP nodes to move them, should be back up later today.
- Should get the rest of the newly arrived HEP nodes up and running soon as well.
- Compute nodes from UTA_SWT2 have been integrated into SWT2_CPB
- Pretty good running over the past two weeks.
-
14:15
→
14:20
WBS 2.3.3 HPC Operations 5mSpeakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))
-
14:20
→
14:35
WBS 2.3.4 Analysis FacilitiesConvener: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:20
Analysis Facilities - BNL 5mSpeaker: Ofer Rind (Brookhaven National Laboratory)
- First of new AF Forum series held last week with an update on k8s batch and upcoming kubecon
- Met on Friday to work out details of BNL/NERSC XRootd SE setup
-
14:25
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:30
Analysis Facilities - Chicago 5mSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
-
14:20
-
14:35
→
14:55
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
-
14:35
US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5mSpeakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
14:40
Service Development & Deployment 5mSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
* XCache - working fine
* VP - working fine - will summarize performance and BHAM experience with switching to VP at next DDM meeting.
* ServiceX - works fine at 1.0.30. Next week will be dedicated to performance improvements developments.
* Analytics - adding new functionality to ATLAS Alarm & Alert Frontend.
-
14:45
Kubernetes R&D at UTA 5mSpeaker: Armen Vartapetian (University of Texas at Arlington (US))
All the existing Kubernetes worker nodes were updated with additional memory. Also part of hardware from the retired UTA_SWT2 cluster was racked in CPB, and added to the cluster. Kubernetes was installed on those nodes, and those workers were joined to the existing cluster. The cluster is showing healthy.
Now trying to find out why the grid jobs are reaching the workers, but are stuck there in a waiting state.
-
14:35
-
14:55
→
15:05
AOB 10m