US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
-
-
13:00
→
13:05
WBS 2.3 Facility Management News 5mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
Upcoming meetings
- Facility Coordination next week (Jan 10) - to review milestones & quarterly reporting time
- Faciity R&D, next Thursday, Jan 11
- Facility Topical in two weeks (Jan 17), speaker TBD
- ATLAS S&C week, Feb 5-9, https://indico.cern.ch/event/1340782/timetable/
- ADC @ S&C week (Sites Jamboree), Feb 6-8, https://indico.cern.ch/event/1355529/
Holiday Updates
- Significant ops issues (seemed pretty quiet)
- DC24 testing
-
13:05
→
13:15
OSG-LHC 10mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
-
13:15
→
13:35
WBS 2.3.1: Tier1 CenterConvener: Eric Christian Lancon (Brookhaven National Laboratory (US))
-
13:15
Infrastructure and Compute Farm 5mSpeaker: Thomas Smith
-
13:20
Storage 5mSpeaker: Vincent Garonne (Brookhaven National Laboratory (US))
USATLAS T1 storage update:
-
Decommissioning of Old dCache Core Servers and Migration of NFS Doors (12/16/23)
-
GGUS Token implementation in production request from ATLAS (12/17/23)
-
Changes from the previous recommended implementation
-
Successfully validated in the preproduction environment before deployment on USATLAS (01-03-2024) (9.2.X)
-
Ongoing discussions with ATLAS and dCache teams regarding intermittent authentication failures observed; the issue is under investigation
-
USATLAS dCache core servers have been upgraded to dCache version 8.2.40 to enhance verbosity for authentication with tokens (01-03-2024)
-
Commissioning of New dCache Pools (dc260-dc270)
-
Puppet Code Porting and Refactoring (RHEL8 and Puppet 6)
-
USATLAS dCache Upgrade Preparation from 8.2.X to 9.2.X:
-
Few issues submitted to dCache team.
-
Preproduction has been upgraded to 9.2.8.
-
Tentative update and downtime date: 22nd of January.
-
-
13:25
Tier1 Services 5mSpeaker: Ivan Glushkov (University of Texas at Arlington (US))
Network upgrade
- Details in this ticket (Jira:NETOPS-595)
- BNL is connected at 2x400 Gbps to ESNet since 12/20/23 (to be added another 2x400 Gbps by the end of the month)
- Some virtual rauting features still to be clarified with ESNet. (Off for the moment)
- We have managed to fill the farm with jobs within 4 hours after the end of the upgrade (Monitoring link)
Throughput tests
- Started by Hiro on 12/21
- Some monitoring problems with WLCG Site Network monitoring. Fixed
- Ongoing and to be resumed later
Farm
- ADC mass blacklisting event on 12/30 due to a fault central submission node. Reduced the farm utilization with 20% for 2 hours (Monitoring link)
-
13:15
-
13:35
→
13:55
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))- Reasonable running over the last month
- Bad jobs sets around Christmas caused some excitement
- AGLT2 and MWT2 observed continuing zombie pilots. Now we see ones where the proxy is updated but no payload is running.
- Lots of transfer failures but mostly caused by a site in Romania.
- Also local issues occured at all 4 sites
- Dell is setting up to allow me benchmark EPYC Genoa (9354, 9374F) and Bergamo (9534, 9754) CPUs.
- CPUs chosen to match 12 memory channels and yield ~3 GB/thread.
- Quarterly reporting due...
- Monitoring: https://monit-grafana.cern.ch/goto/8wP4zKFIR?orgId=17
-
13:35
AGLT2 5mSpeakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wendy Wu (University of Michigan)
12/19/2023, we updated dCache from 9.2.4 to 9.2.7, and also took the opportunity to update the BIOS firmware and system kernel and rebooted all the storage nodes. The whole process went very smoothly and only caused 30 minutes of downtime.
12/20/2023, In the morning, we noticed the running job slots started to ramp down (from 17.5K to 10.5K, about 40% job slots were not shown from the ATLAS monitoring plot), but the HTCondor cluster was fully utilized (99% job slots were being claimed). We put together a script and run it as a cron job trying to find all the zombie jobs (Job status is failed but pilot is still running), and finally cleaned most of the zoombie jobs, we still see about 5% job slots discrepency.
On 12/27/2023, we enabled and verified the storage token access based on the dCache system.
12/29/2023 we received a ggus ticket about transferring failure with AGLT2 as the source, and it turned out that One pool node (umfs24) had some filesystem errors and a full /var area. We fixed the issue in the morning, and the transfer efficiency went back to normal.
-
13:40
MWT2 5mSpeakers: David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
Had a hypervisor go down with a critical service over the holidays. Brought back up within a day and otherwise ran fine. Will investigate better integrity
Plan on upgrading to dcache 9.2 and enabling token support at the same time, to complete GGUS ticket #164675, by the end of January
Fred is in talks with Dell to benchmark CPUs
Got networking to work properly on IU storage test node. Ran into other transfer issues, however, so we've set it offline before the holidays to investigate
Procurement and operations plan for 2024 are complete
IU will be getting a new upstream switch to replace older hardware. Expecting a downtime for the replacement sometime in Q1 2024
-
13:45
NET2 5mSpeakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
During the break:
- Certification expiration problem [SOLVED]
- NESE connection drops when large files are transferred [WORKING]
- NESE BGP rules not present in some sites [TO BE INVESTIGATED]
In Junary:
- Rack 3. Delayed. Mid December -> Mid January.
- Upgrade of dCache to 9.2.x with token support ongoing.
- Procurement ongoing.
-
13:50
SWT2 5mSpeakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Booth (UTA)
SWT2_CPB -
- Generally smooth operations over the holiday break.
- The host certificate for the CE / gatekeeper expired on 1/1/24, causing the cluster to drain. It was updated on 1/2/24. GGUS 164828
Upcoming work / projects:
- UPS upgrade will occur on 1/24/24 (power modules, electronics, etc.).
- As part of the downtime for this work we're planning to replace the cluster admin node.
- Planning for FY 24 procurements
- Finalize WLCG network monitoring setup with campus personnel. GGUS 162991
OU:
- Generally smooth operations, just brief OSCER IPA(LDAP) outage
- OU Slate node should be ready, working on initial testing
- Still working on getting ESNET monitoring for OU setup
- Then we can work on WLCG network monitoring
- Reasonable running over the last month
-
13:55
→
14:05
WBS 2.3.3 HPC Operations 10mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Rui Wang (Argonne National Laboratory (US))
Perlmutter
- (Doug&Lincoln) following up with the xrootd service setup and testing
- (Doug)Job failure due to issues related to wrong Python version found before the break
TACC
- (Rui) install harvester instance from the git repo Lincoln made
- Switching to Globus v5 before Jan 8th
- Issues with running CVMFSexec. Standalone test with pilot on the devolopment node
-
14:05
→
14:30
WBS 2.3.4 Analysis FacilitiesConveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:05
Analysis Facilities - BNL 5mSpeaker: Ofer Rind (Brookhaven National Laboratory)
-
14:10
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:15
Analysis Facilities - Chicago 5mSpeakers: Fengping Hu (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
- GPU HTCondor worker auto scaling is deployed at the Analysis Facility. Now HTCondor workers can better coexist with jupyter notebook servers to use the GPU resources.
- Hareware maintenence: we replaced fans and mother boards for half of the AF cluster to address faulty fan alerts issues. This was done in a rolling basis so there's no service outage.
- A100 GPU node loaned to SSL for a HSF training event is now returned back to the Analysis Facility.
AF image change:
We have an image based on AnalysisBase 24.2.6 that we are using for the demo Physlite analysis.
It comes with dask integration and users can create personal K8s dask clusters that autoscales up to 100 cores. It took a long time to get dask dashboards to work correctly.
Since dask workers need the same libraries to run the analysis we had to give a lot of memory to dask workers... Later we will have to create dedicated much smaller images.
The simplest possible analysis on MC data works, but with real data there are issues with the PHYSLITESCHEMA eg. caloClusterLinks.
Will try to get this analysis working at scale before S&C.
-
14:05
-
14:30
→
14:45
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
Token enabled storage tracking
BNL-ATLAS
ggus 164654 State:in progress Date:2023-12-22 23:57:00 Info:BNL-ATLAS: Enable token support for storage
MWT2
ggus 164675 State:assigned Date:2023-12-18 01:52:00 Info:MWT2: Enable token support for storage
SWT2_CPB
ggus 164771 State:assigned Date:2023-12-20 08:55:00 Info:SWT2_CPB: Enable token support for storage
Please find more information at the ATLAS Site Status Board
You are receiving information about the following modules: ggus statusactivities jira
Support: ADC Monitoring JIRA.- Now that network monitoring has been fixed (thanks JD) Hiro will re-run network throughput tests reading from BNL
- Would like to run back to back with Mario's Rucio tool to compare
- US ATLAS website will be updated to Drupal10 tomorrow at 3pm
- Will now require comanage login as announced several times before
- US topics for Site Jamboree?
-
14:30
ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5mSpeaker: Ivan Glushkov (University of Texas at Arlington (US))
Reported by ADC:
- AGLT2: Transfers' timeout due to FS errors on a storage pool node. Fixed. (GGUS:164819)
- NET2: Connectivity to some sites is missing (GGUS:164795, since 12/22/23 )
- OSCER: Transfer errors due to authentication issue. Fixed (GGUS:164812)
- SWT2: Slow outbound connection (GGUS:164790, since 12/21/23)
Misc:
* ESNet monitor for OSCER is missing..
-
14:35
Service Development & Deployment 5mSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
Caching services
- XCache
- I was informed that esnet got xcache node ready, we should meet to get it setup.
- one of BHAM nodes is out. Contacted them
- BNL is completely out. Ofer?
- Yes, version was updated but I still need to test
- Has been HC "test" status for some time (I was asking you about this in early November)
- Varnish
- all SLATE varnishes work fine
- all NRP varnishes work fine
- Added an NRP varnish for frontier instance for SLAC. Instance works fine but now not getting any requests. Is the queue there working?
- Will add an instance for NET2
- There was a very interesting special meeting of Varnish and CVMFS people.
ServiceX
- testing-4 and FAB instances updated to the new more performant version. Works fine.
- Production instance works fine and will probably be updated tomorrow after the ServiceX meeting.
- ServiceXLite works fine and will need testing after the main instance update.
Analytics
- ES is working fine but needs a day or two of cleanup (clashing templates, changed pipelines, update lifetime management rules, etc.)
- logstash collectors working fine
- number of changes in Alarm and Alert system
- XCache
-
14:40
Facility R&D 5mSpeakers: Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
Worked with Horst to get Kubernetes/SLATE server running at OU. Should be up now and awaiting applications (Squid).
Gave a presentation on Identity and Access Management at last Facility R&D weekly: Identity & Access Explorations
- including how to get the ATLAS IAM working with Keycloak directly.
- Now that network monitoring has been fixed (thanks JD) Hiro will re-run network throughput tests reading from BNL
-
14:50
→
15:00
AOB 10m
-
13:00
→
13:05