US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2024-01-03T13:00:00-05:00
End: 2024-01-03T15:25:00-05:00
Location: No location set

Wednesday 3 Jan 2024, 13:00 → 15:25 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 13:00 → 13:05
  WBS 2.3 Facility Management News 5m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
  Upcoming meetings
  
  Facility Coordination next week (Jan 10) - to review milestones & quarterly reporting time
  
  Faciity R&D, next Thursday, Jan 11
  
  Facility Topical in two weeks (Jan 17), speaker TBD
  
  ATLAS S&C week, Feb 5-9, https://indico.cern.ch/event/1340782/timetable/
  
  ADC @ S&C week (Sites Jamboree), Feb 6-8, https://indico.cern.ch/event/1355529/
  
  Holiday Updates
  
  Significant ops issues (seemed pretty quiet)
  
  DC24 testing
- 13:05 → 13:15
  
  OSG-LHC 10m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
- 13:15 → 13:35
  WBS 2.3.1: Tier1 Center
  
  Convener: Eric Christian Lancon (Brookhaven National Laboratory (US))
  - 13:15
    
    Infrastructure and Compute Farm 5m
    
    Speaker: Thomas Smith
  - 13:20
    Storage 5m
    
    Speaker: Vincent Garonne (Brookhaven National Laboratory (US))
    
    USATLAS T1 storage update:
    
    Decommissioning of Old dCache Core Servers and Migration of NFS Doors (12/16/23)
    
    GGUS Token implementation in production request from ATLAS (12/17/23)
    
    Changes from the previous recommended implementation
    
    Successfully validated in the preproduction environment before deployment on USATLAS (01-03-2024) (9.2.X)
    
    Ongoing discussions with ATLAS and dCache teams regarding intermittent authentication failures observed; the issue is under investigation
    
    USATLAS dCache core servers have been upgraded to dCache version 8.2.40 to enhance verbosity for authentication with tokens (01-03-2024)
    
    Commissioning of New dCache Pools (dc260-dc270)
    
    Puppet Code Porting and Refactoring (RHEL8 and Puppet 6)
    
    USATLAS dCache Upgrade Preparation from 8.2.X to 9.2.X:
    
    Few issues submitted to dCache team.
    
    Preproduction has been upgraded to 9.2.8.
    
    Tentative update and downtime date: 22nd of January.
  - 13:25
    Tier1 Services 5m
    
    Speaker: Ivan Glushkov (University of Texas at Arlington (US))
    
    Network upgrade
    
    Details in this ticket (Jira:NETOPS-595)
    
    BNL is connected at 2x400 Gbps to ESNet since 12/20/23 (to be added another 2x400 Gbps by the end of the month)
    
    Some virtual rauting features still to be clarified with ESNet. (Off for the moment)
    
    We have managed to fill the farm with jobs within 4 hours after the end of the upgrade (Monitoring link)
    
    Throughput tests
    
    Started by Hiro on 12/21
    
    Some monitoring problems with WLCG Site Network monitoring. Fixed
    
    Ongoing and to be resumed later
    
    Farm
    
    ADC mass blacklisting event on 12/30 due to a fault central submission node. Reduced the farm utilization with 20% for 2 hours (Monitoring link)
- 13:35 → 13:55
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  Reasonable running over the last month
  
  Bad jobs sets around Christmas caused some excitement
  
  AGLT2 and MWT2 observed continuing zombie pilots. Now we see ones where the proxy is updated but no payload is running.
  
  Lots of transfer failures but mostly caused by a site in Romania.
  
  Also local issues occured at all 4 sites
  
  Dell is setting up to allow me benchmark EPYC Genoa (9354, 9374F) and Bergamo (9534, 9754) CPUs.
  
  CPUs chosen to match 12 memory channels and yield ~3 GB/thread.
  
  Quarterly reporting due...
  
  Monitoring: https://monit-grafana.cern.ch/goto/8wP4zKFIR?orgId=17
  - 13:35
    
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wendy Wu (University of Michigan)
    
    12/19/2023, we updated dCache from 9.2.4 to 9.2.7, and also took the opportunity to update the BIOS firmware and system kernel and rebooted all the storage nodes. The whole process went very smoothly and only caused 30 minutes of downtime.
    
    12/20/2023, In the morning, we noticed the running job slots started to ramp down (from 17.5K to 10.5K, about 40% job slots were not shown from the ATLAS monitoring plot), but the HTCondor cluster was fully utilized (99% job slots were being claimed). We put together a script and run it as a cron job trying to find all the zombie jobs (Job status is failed but pilot is still running), and finally cleaned most of the zoombie jobs, we still see about 5% job slots discrepency.
    
    On 12/27/2023, we enabled and verified the storage token access based on the dCache system.
    
    12/29/2023 we received a ggus ticket about transferring failure with AGLT2 as the source, and it turned out that One pool node (umfs24) had some filesystem errors and a full /var area. We fixed the issue in the morning, and the transfer efficiency went back to normal.
  - 13:40
    
    MWT2 5m
    
    Speakers: David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    Had a hypervisor go down with a critical service over the holidays. Brought back up within a day and otherwise ran fine. Will investigate better integrity
    
    Plan on upgrading to dcache 9.2 and enabling token support at the same time, to complete GGUS ticket #164675, by the end of January
    
    Fred is in talks with Dell to benchmark CPUs
    
    Got networking to work properly on IU storage test node. Ran into other transfer issues, however, so we've set it offline before the holidays to investigate
    
    Procurement and operations plan for 2024 are complete
    
    IU will be getting a new upstream switch to replace older hardware. Expecting a downtime for the replacement sometime in Q1 2024
  - 13:45
    NET2 5m
    
    Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
    
    During the break:
    
    Certification expiration problem [SOLVED]
    
    NESE connection drops when large files are transferred [WORKING]
    
    NESE BGP rules not present in some sites [TO BE INVESTIGATED]
    
    In Junary:
    
    Rack 3. Delayed. Mid December -> Mid January.
    
    Upgrade of dCache to 9.2.x with token support ongoing.
    
    Procurement ongoing.
  - 13:50
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Booth (UTA)
    
    SWT2_CPB -
    
    Generally smooth operations over the holiday break.
    
    The host certificate for the CE / gatekeeper expired on 1/1/24, causing the cluster to drain. It was updated on 1/2/24. GGUS 164828
    
    Upcoming work / projects:
    
    UPS upgrade will occur on 1/24/24 (power modules, electronics, etc.).
    
    As part of the downtime for this work we're planning to replace the cluster admin node.
    
    Planning for FY 24 procurements
    
    Finalize WLCG network monitoring setup with campus personnel. GGUS 162991
    
    OU:
    
    Generally smooth operations, just brief OSCER IPA(LDAP) outage
    
    OU Slate node should be ready, working on initial testing
    
    Still working on getting ESNET monitoring for OU setup
    
    Then we can work on WLCG network monitoring
- 13:55 → 14:05
  WBS 2.3.3 HPC Operations 10m
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Rui Wang (Argonne National Laboratory (US))
  Perlmutter
  
  (Doug&Lincoln) following up with the xrootd service setup and testing
  
  (Doug)Job failure due to issues related to wrong Python version found before the break
  
  TACC
  
  (Rui) install harvester instance from the git repo Lincoln made
  
  Switching to Globus v5 before Jan 8th
  
  Issues with running CVMFSexec. Standalone test with pilot on the devolopment node
- 14:05 → 14:30
  WBS 2.3.4 Analysis Facilities
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:05
    
    Analysis Facilities - BNL 5m
    
    Speaker: Ofer Rind (Brookhaven National Laboratory)
  - 14:10
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:15
    Analysis Facilities - Chicago 5m
    
    Speakers: Fengping Hu (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
    
    GPU HTCondor worker auto scaling is deployed at the Analysis Facility. Now HTCondor workers can better coexist with jupyter notebook servers to use the GPU resources.
    
    Hareware maintenence: we replaced fans and mother boards for half of the AF cluster to address faulty fan alerts issues. This was done in a rolling basis so there's no service outage.
    
    A100 GPU node loaned to SSL for a HSF training event is now returned back to the Analysis Facility.
    
    AF image change:
    
    We have an image based on AnalysisBase 24.2.6 that we are using for the demo Physlite analysis.
    
    It comes with dask integration and users can create personal K8s dask clusters that autoscales up to 100 cores. It took a long time to get dask dashboards to work correctly.
    
    Since dask workers need the same libraries to run the analysis we had to give a lot of memory to dask workers... Later we will have to create dedicated much smaller images.
    
    The simplest possible analysis on MC data works, but with real data there are issues with the PHYSLITESCHEMA eg. caloClusterLinks.
    
    Will try to get this analysis working at scale before S&C.
- 14:30 → 14:45
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  Token enabled storage tracking
  
  BNL-ATLAS
  ggus 164654 State:in progress Date:2023-12-22 23:57:00 Info:BNL-ATLAS: Enable token support for storage
  MWT2
  ggus 164675 State:assigned Date:2023-12-18 01:52:00 Info:MWT2: Enable token support for storage
  SWT2_CPB
  ggus 164771 State:assigned Date:2023-12-20 08:55:00 Info:SWT2_CPB: Enable token support for storage
  
  Please find more information at the ATLAS Site Status Board
  
  You are receiving information about the following modules: ggus statusactivities jira
  Support: ADC Monitoring JIRA.
  
  Now that network monitoring has been fixed (thanks JD) Hiro will re-run network throughput tests reading from BNL
  
  Would like to run back to back with Mario's Rucio tool to compare
  
  US ATLAS website will be updated to Drupal10 tomorrow at 3pm
  
  Will now require comanage login as announced several times before
  
  US topics for Site Jamboree?
  - 14:30
    ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
    
    Speaker: Ivan Glushkov (University of Texas at Arlington (US))
    
    Reported by ADC:
    
    AGLT2: Transfers' timeout due to FS errors on a storage pool node. Fixed. (GGUS:164819)
    
    NET2: Connectivity to some sites is missing (GGUS:164795, since 12/22/23 )
    
    OSCER: Transfer errors due to authentication issue. Fixed (GGUS:164812)
    
    SWT2: Slow outbound connection (GGUS:164790, since 12/21/23)
    
    Misc:
    
    * ESNet monitor for OSCER is missing..
  - 14:35
    Service Development & Deployment 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    Caching services
    
    XCache
    
    I was informed that esnet got xcache node ready, we should meet to get it setup.
    
    one of BHAM nodes is out. Contacted them
    
    BNL is completely out. Ofer?
    
    Yes, version was updated but I still need to test
    
    Has been HC "test" status for some time (I was asking you about this in early November)
    
    Varnish
    
    all SLATE varnishes work fine
    
    all NRP varnishes work fine
    
    Added an NRP varnish for frontier instance for SLAC. Instance works fine but now not getting any requests. Is the queue there working?
    
    Will add an instance for NET2
    
    There was a very interesting special meeting of Varnish and CVMFS people.
    
    ServiceX
    
    testing-4 and FAB instances updated to the new more performant version. Works fine.
    
    Production instance works fine and will probably be updated tomorrow after the ServiceX meeting.
    
    ServiceXLite works fine and will need testing after the main instance update.
    
    Analytics
    
    ES is working fine but needs a day or two of cleanup (clashing templates, changed pipelines, update lifetime management rules, etc.)
    
    logstash collectors working fine
    
    number of changes in Alarm and Alert system
  - 14:40
    Facility R&D 5m
    
    Speakers: Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    Worked with Horst to get Kubernetes/SLATE server running at OU. Should be up now and awaiting applications (Squid).
    
    Gave a presentation on Identity and Access Management at last Facility R&D weekly: Identity & Access Explorations
    
    including how to get the ATLAS IAM working with Keycloak directly.
- 14:50 → 15:00
  
  AOB 10m

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder