US ATLAS Computing Facility
→
US/Eastern
Description
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
-
-
13:00
→
13:05
WBS 2.3 Facility Management News 5mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
-
13:05
→
13:15
OSG-LHC 10mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
Release
Next week:
- XRootD 5.6.3 available in testing
- osg-system-profiler (dumps system crypto policy, XRootD configurations)
- vo-client (InCommon IGTF CA 3 updates, https://opensciencegrid.atlassian.net/browse/SOFTWARE-5743)
Aiming for week of Nov 27:
- vo-client (more InCommon IGTF CA 3 updates, https://opensciencegrid.atlassian.net/browse/SOFTWARE-5741)
- osg-ca-certs/osg-ca-certs-java with/without SHA1 workaround (https://opensciencegrid.atlassian.net/browse/SOFTWARE-5745). We plan on sending a warning to OSG sites
Miscellaneous
- Have Squid containers been updated to 23-release?
- Any complaints about/issues with OSG 23?
- Any word on CERN account renewal?
-
13:15
→
13:35
WBS 2.3.1: Tier1 CenterConvener: Eric Christian Lancon (Brookhaven National Laboratory (US))
-
13:15
Infrastructure and Compute Farm 5mSpeaker: Thomas Smith
-We've been investigating odd behavior of idle jobs in the queue causing jobs being sent to the site to throttle and the T1 to drain; a temporary fix for the group quotas has been implemented that allowed the flow of jobs to return to normal-We've engaged the developers and are working on pinpointing the cause and a permanent solution
-
13:20
Storage 5mSpeaker: Vincent Garonne (Brookhaven National Laboratory (US))
- On October 31, 2023, a successful major intervention was conducted on the storage. This involved refreshing the hardware for core dCache services, which included the addition of 10 new servers and the removal of 8 servers. Additionally, database migration (e.g., dCache namespace) was carried out to new servers, along with an update of PostgreSQL from version 12 to version 15, and the reconfiguration of core services.
- On October 19, 2023, a successful transition from the older SRM door server (dcsrm03.usatlas.bnl.gov) to two new door servers (dcfrontend01|2.usatlas.bnl.gov) was completed. These new servers also serve the TAPE Rest API, alongside the introduction of a new DNS alias (dcfrontend.usatlas.bnl.gov).
- Reporting an issue to the WLCG monitoring team to rectify a glitch related to the SRM update in the availability and reliability accounting report for October 19th.
- Further details: https://ggus.eu/index.php?mode=ticket_info&ticket_id=164024
- Enhancements were made to the ATLAS pre-production instance, a critical service for validating changes before their deployment in the production environment. These enhancements involved decommissioning outdated servers, deploying new ones, and implementing the same deployment model and configuration management modules utilized in the production environment
-
13:25
Tier1 Services 5m
-
13:15
-
13:35
→
13:55
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))- Stable running with no major issues.
- NET2 is ramping up...
- Eduardo is at the hospital with his new born baby boy (a different but nice ramp up!)
- Working with Rob, Shawn, and Ofer on defining what information will be requested for the operations and procurement plans.
- We are tentatively ask people to submit their first draft by November 30 to allow time to discuss the contents before the milestone data of 12/31. This also gets the plans done before the holidays.
- I should have the draft of the listing information for the new IU admin/DevOps position to Rob & Shawn later today.
-
13:35
AGLT2 5mSpeakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wendy Wu (University of Michigan)
-
13:40
MWT2 5mSpeakers: David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
- Testing storage at IU. Still troubleshooting network issues before we can attempt to put online
- Had a drainage in mid-October. Caused by xrootd door issues. Unsure the cause. Restarted the door and it cleared up
- Multiple squid tickets
- https://ggus.eu/index.php?mode=ticket_info&ticket_id=163709
- https://ggus.eu/index.php?mode=ticket_info&ticket_id=164021
- First ticket, squid seemed to stop working. Restarted the service brought things back into production
- Second ticket, NIC was having issues on the node, rebooted and it cleared up. Ticket reopened due to a different issue where the squid was working, but monitoring wasn't. Restarted the squid and monitoring resumed
- Want to identify why and how the monitoring failed so we can fix that without bouncing the squid service
- Testing a condor configuration to limit memory usage on jobs to save workers rebooting due to going out of memory
- Investigating CVMFS issues on AlmaLinux 9 workers at IU. Updated CVMFS on the nodes to see if that fixes things. But otherwise, working with Dave Dykstra to debug
- https://github.com/cvmfs/cvmfs/issues/3434
-
13:45
NET2 5mSpeakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
Operations:
- Small interrruption of service (few minutes) due to network interruption between NET2 and NESE. We are adding dedicated SNMP monitoring to be able to react fast to future problems
Installation:
- "Rack 2" deployed. (~5000 slots total -- rack 1 + rack 2 -- available right now).
- Preparing "Rack 3"
Procurement:
- We finished FY23 procurement. 8 new NESE servers will be made available (~estimated an additional 3.1PB of usuable disk storage).
-
13:50
SWT2 5mSpeakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Booth (UTA)
OU:
- Running well, no major problems; have gotten really good opportunistic throughput in the last few weeks
- Working on getting slate-squid installed, getting close!
UTA:
- Met with campus networking folks on Tuesday (11/7) to discuss various topics, in particular our needs (i.e., access to networking devices, etc.) in terms of deploying the WLCG site network monitoring. Making progress - hope to have this deployed within the next week or so.
- We're still waiting on Schneider / APC to finalize the date for performing the upgrade work on the UPS in the data center. Possibly by mid-December (before the holidays), otherwise early January. We're also planning to replace the cluster admin node at the same time.
- Working with Dell to resolve hardware issues on a few computers (WN's).
- Patrick still coming to campus ~one day per week to help with training Zach - much appreciated!
- Generally smooth operations for ATLAS over the past four weeks.
-
13:55
→
14:05
WBS 2.3.3 HPC Operations 10mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Rui Wang (Argonne National Laboratory (US))
- Perlmutter: Large fraction of jobs are resigned by Jedi stating from Sept
- Most of those jobs have input file status in PENDING
- After some investigation, it might related an issue with the storage at BNL that Doug have ask xrootd developers to help debug
- There was an ticket reporting CVMFS was missing in some of the login nodes. NERSC experts have checked all the nodes and made sure all of them have the CVMFS mounted now.
- Cleaned up the queues (VP and ES)
- (Doug&Lincoln) following up with the xrootd service and RSE related setups
- TACC: No successful job starting from Sept.
- Globus key file was missing on CERN side. Contacted Tadashi to help on restore it.
- Issues with running CVMFSexec. Investigating using standalone test
- (Rui) install harvester instance from the git repo Lincoln made
- Perlmutter: Large fraction of jobs are resigned by Jedi stating from Sept
-
14:05
→
14:30
WBS 2.3.4 Analysis FacilitiesConveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
- 14:05
-
14:10
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:15
Analysis Facilities - Chicago 5mSpeakers: Fengping Hu (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
- New gpu monitoring dashboard is up at UC AF -- gpu monitoring
- GPU HTCondor worker autoscaling in the works
- allow gpus to be shared among batch system and jupyter notebooks
- updated metrics manager to provide occupancy metric to HTCondor partitions
- working on understanding HPA behavior
- Monitoring and alerting change
- we now route all default prometheus alerts to null receiver and only let alerts we care about send to slack channel.
- "/data" monitoring should be improved now
- Working with Matthew F on new images for AF. These will be based on AnalysisBase, add dask, awkwardarray etc. We will create new ones on request, name them according to AnalysisBase version, and place in both DockerHub and Harbor. This kind of image will be needed for Z->ee demo on Physlite data.
-
14:30
→
14:45
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
- Job draining at BNL, similar to what was seen prior to the dCache upgrade. Thought related to SCORE_HIMEM jobs limit then, but removing this limit clearly did not fix the problem. Lengthy investigation traced it to potential HTCondor negotiator problem related to the job quotas. Job quotas removed and site refilled. In contact with HTCondor developers to follow up.
- DC24 workshop at CERN this week
- ANALY_BNL_VP down due to HC TEST setting - trying to understand this status with Ilija
- Squid problem at SLAC (GGUS)
-
14:30
ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
-
14:35
Service Development & Deployment 5mSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
- ServiceX
- AF instance still on 1.2.2 - still waiting on tests for 2.0.0
- Installed ServiceX-Lite on River cluster that is slaving to AF ServiceX instance.
- FAB instance just now being modified in order to help NDN people do their tests.
- XCache
- All upgraded.
- Instabilities at MWT2 and AGLT2. Some of it were issues with nodes but some are server simply puting clients in a waiting loop. We are not sure if issue is a bug or consequence of low performance.
- NRP
- An MSU node has been added to NRP.
- It now runs varnish for CVMFS. Once I stress test it, it can be used for MSU nodes requests caching.
- varnish for CVMFS running on NRP starlight node is used by UC and working fine.
- ServiceX
-
14:40
Facility R&D 5mSpeakers: Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
- Worked with Horst to setup Kubernetes v1.28 at OU
- Investigating Keycloak integration with Kubernetes via kubelogin
- Would provide the authentication via CERN / ATLAS IAM and this part seems to essentially work
- Authorization TBD
- Seems straight-forward to put users into predefined groups
- Not clear how to automatically create namespaces 'on the fly' when a user signs in for the first time
- May need some small script to implement an enrollment workflow that creates namespaces, rolebindings etc.
-
14:50
→
15:00
AOB 10m
-
13:00
→
13:05