- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
Release yesterday: https://opensciencegrid.org/docs/release/osg-36/#april-26-2022-cvmfs-292-upcoming-htcondor-981
OSG 3.5 EOL on May 1!
HTCondor Week registration is closing soon! See invitation:
Greetings CHTC Users!
We want to invite you to HTCondor Week 2022, our annual HTCondor user conference, May 23-26, 2022. This year, HTCondor Week will be a hybrid event: we are hosting an in-person meeting at the Fluno Center on the University of Wisconsin-Madison campus. This provides HTCondor Week attendees with a compelling environment in which to attend tutorials and talks from HTCondor developers, meet other users like you and attend social events. For those who cannot attend in person, we'll also be broadcasting the event online via a Zoom meeting.
Registration for HTCondor Week 2022 is open now. The registration deadline for in-person attendee is May 2, 2022, and the cost is $90 per day to partake in conference food. For virtual-only attendance, registration is a flat $25 fee for the whole week.
UW-Madison affiliates who attend conference talks in-person only need to register for in-person participation (and pay) if they plan to partake in conference food. We otherwise/also recommend the virtual registration (still with a fee) for UW-Madison affiliates who plan to participate virtually.
You can register at http://htcondor.org/HTCondorWeek2022.
There will be specific programming highlighting the UW-Madison campus community on Thursday, May 26, where you can meet other campus users of CHTC and HTCondor, as well as CHTC staff. We will separately contact some CHTC users to present their work that day!!
On other days, we will have a variety of in-depth tutorials and talks where you can learn more about HTCondor and how other people are using and deploying HTCondor. Best of all, you can establish contacts and learn best practices from people in industry, government, and academia who are using HTCondor to solve hard problems, many of which may be similar to those you are facing.
Hotel details and agenda overview are on the HTCondor Week 2022 site:
http://htcondor.org/HTCondorWeek2022
We hope to see you there,
The Center for High Throughput Computing
Updates on US Tier-2 centers
On OSG 3.6, for gatekeepers and worker nodes.
We broke frontier squids while trying to fix gratia probe problems.
Our first fix attempt inadvertently re-enabled a local setup script overriding squid location variables.
Gratia issues solved: directory ownership was root instead of condor.
2 tickets:
156868 15-Apr-2022 AGLT2: Failing jobs in panda with "Unable to identify specific exception"
156873 17-Apr-2022 US AGLT2: High Transfer failures as source
The job problems was traced to time outs during stage-out.
There was no clear problem but the likely suspect was dcache and java running out of memory.
We increased the memory for webdav on the doors and dCacheDomain on the headnodes.
Also added CPUs and memory to the VM doors. That all helped.
We also upgraded dcache from 6.2.35 to 7.2.15 (since we had to restart to load new CA certs anyway)
The issues from both tickets disappeared after that.
Maintenance:
mostly through updating all worker nodes for new kernel, Dell FW updates, OSG updates (cvmfs)
Network upgrades completed * and tested * :
All new multi-path and multi-100G connections to ESnet and between MSU and UM are now fully deployed
and were tested for proper failover in case of backhoe vs fiber incident.
Smooth operations. New workers are in production.
NESE Team preparing for ~5 rack expansion of NESE Ceph including NET2 storage. Slowed down by Cisco switch delivery. This will allow retirement of NET2 GPFS and make more space for workers.
Working on ipv6; then OSG 3.6; upgrading TOR networking and NET2-NESE networking.
UTA:
OU:
- Drained some HEP nodes to move them, should be back up later today.
- Should get the rest of the newly arrived HEP nodes up and running soon as well.
* XCache - working fine
* VP - working fine - will summarize performance and BHAM experience with switching to VP at next DDM meeting.
* ServiceX - works fine at 1.0.30. Next week will be dedicated to performance improvements developments.
* Analytics - adding new functionality to ATLAS Alarm & Alert Frontend.
All the existing Kubernetes worker nodes were updated with additional memory. Also part of hardware from the retired UTA_SWT2 cluster was racked in CPB, and added to the cluster. Kubernetes was installed on those nodes, and those workers were joined to the existing cluster. The cluster is showing healthy.
Now trying to find out why the grid jobs are reaching the workers, but are stuck there in a waiting state.