Name: US ATLAS Computing Facility
Start: 2023-11-08T13:00:00-05:00
End: 2023-11-08T15:25:00-05:00
Location: No location set

- 13:00 → 13:05
  
  WBS 2.3 Facility Management News 5m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
  
  Many people at CERN this week for (pre)GDB (https://indico.cern.ch/event/1225118/) and DC24 (https://indico.cern.ch/event/1307338/)
  Tier-2s have upcoming milestone to provide Operations and Procurement plans
  USCloud needs to confirm site network numbers and participation with Petr Vokac for DC24
- 13:05 → 13:15
  OSG-LHC 10m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  Release
  Next week:
  XRootD 5.6.3 available in testing
  osg-system-profiler (dumps system crypto policy, XRootD configurations)
  vo-client (InCommon IGTF CA 3 updates, https://opensciencegrid.atlassian.net/browse/SOFTWARE-5743)
  Aiming for week of Nov 27:
  vo-client (more InCommon IGTF CA 3 updates, https://opensciencegrid.atlassian.net/browse/SOFTWARE-5741)
  osg-ca-certs/osg-ca-certs-java with/without SHA1 workaround (https://opensciencegrid.atlassian.net/browse/SOFTWARE-5745). We plan on sending a warning to OSG sites
  Miscellaneous
  Have Squid containers been updated to 23-release?
  Any complaints about/issues with OSG 23?
  Any word on CERN account renewal?
- 13:15 → 13:35
  WBS 2.3.1: Tier1 Center
  
  Convener: Eric Christian Lancon (Brookhaven National Laboratory (US))
  - 13:15
    
    Infrastructure and Compute Farm 5m
    
    Speaker: Thomas Smith
    
    -We've been investigating odd behavior of idle jobs in the queue causing jobs being sent to the site to throttle and the T1 to drain; a temporary fix for the group quotas has been implemented that allowed the flow of jobs to return to normal
    -We've engaged the developers and are working on pinpointing the cause and a permanent solution
  - 13:20
    Storage 5m
    
    Speaker: Vincent Garonne (Brookhaven National Laboratory (US))
    
    On October 31, 2023, a successful major intervention was conducted on the storage. This involved refreshing the hardware for core dCache services, which included the addition of 10 new servers and the removal of 8 servers. Additionally, database migration (e.g., dCache namespace) was carried out to new servers, along with an update of PostgreSQL from version 12 to version 15, and the reconfiguration of core services.
    On October 19, 2023, a successful transition from the older SRM door server (dcsrm03.usatlas.bnl.gov) to two new door servers (dcfrontend01|2.usatlas.bnl.gov) was completed. These new servers also serve the TAPE Rest API, alongside the introduction of a new DNS alias (dcfrontend.usatlas.bnl.gov).
    Reporting an issue to the WLCG monitoring team to rectify a glitch related to the SRM update in the availability and reliability accounting report for October 19th.
    Further details: https://ggus.eu/index.php?mode=ticket_info&ticket_id=164024
    Enhancements were made to the ATLAS pre-production instance, a critical service for validating changes before their deployment in the production environment. These enhancements involved decommissioning outdated servers, deploying new ones, and implementing the same deployment model and configuration management modules utilized in the production environment
  - 13:25
    
    Tier1 Services 5m
- 13:35 → 13:55
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  Stable running with no major issues.
  NET2 is ramping up...
  Eduardo is at the hospital with his new born baby boy (a different but nice ramp up!)
  Working with Rob, Shawn, and Ofer on defining what information will be requested for the operations and procurement plans.
  We are tentatively ask people to submit their first draft by November 30 to allow time to discuss the contents before the milestone data of 12/31. This also gets the plans done before the holidays.
  I should have the draft of the listing information for the new IU admin/DevOps position to Rob & Shawn later today.
  - 13:35
    
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wendy Wu (University of Michigan)
    
    - No incident. Regular maintenance. Smooth running.
    - Happy to see BOINC jobs back (most of the time.)
    - Continuing tests on zfs vs raid6 for dCache.
    - Loaning one NVMe server to NRP (National Research Platform).
    Remotely administered. Now operational. Said to be running jobs. msu-nrp.aglt2.org
  - 13:40
    MWT2 5m
    
    Speakers: David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    Testing storage at IU. Still troubleshooting network issues before we can attempt to put online
    Had a drainage in mid-October. Caused by xrootd door issues. Unsure the cause. Restarted the door and it cleared up
    Multiple squid tickets
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=163709
    https://ggus.eu/index.php?mode=ticket_info&ticket_id=164021
    First ticket, squid seemed to stop working. Restarted the service brought things back into production
    Second ticket, NIC was having issues on the node, rebooted and it cleared up. Ticket reopened due to a different issue where the squid was working, but monitoring wasn't. Restarted the squid and monitoring resumed
    Want to identify why and how the monitoring failed so we can fix that without bouncing the squid service
    Testing a condor configuration to limit memory usage on jobs to save workers rebooting due to going out of memory
    Investigating CVMFS issues on AlmaLinux 9 workers at IU. Updated CVMFS on the nodes to see if that fixes things. But otherwise, working with Dave Dykstra to debug
    https://github.com/cvmfs/cvmfs/issues/3434
  - 13:45
    NET2 5m
    
    Speakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
    
    Operations:
    Small interrruption of service (few minutes) due to network interruption between NET2 and NESE. We are adding dedicated SNMP monitoring to be able to react fast to future problems
    Installation:
    "Rack 2" deployed. (~5000 slots total -- rack 1 + rack 2 -- available right now).
    Preparing "Rack 3"
    Procurement:
    We finished FY23 procurement. 8 new NESE servers will be made available (~estimated an additional 3.1PB of usuable disk storage).
  - 13:50
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Booth (UTA)
    
    OU:
    Running well, no major problems; have gotten really good opportunistic throughput in the last few weeks
    Working on getting slate-squid installed, getting close!
    UTA:
    Met with campus networking folks on Tuesday (11/7) to discuss various topics, in particular our needs (i.e., access to networking devices, etc.) in terms of deploying the WLCG site network monitoring. Making progress - hope to have this deployed within the next week or so.
    We're still waiting on Schneider / APC to finalize the date for performing the upgrade work on the UPS in the data center. Possibly by mid-December (before the holidays), otherwise early January. We're also planning to replace the cluster admin node at the same time.
    Working with Dell to resolve hardware issues on a few computers (WN's).
    Patrick still coming to campus ~one day per week to help with training Zach - much appreciated!
    Generally smooth operations for ATLAS over the past four weeks.
- 13:55 → 14:05
  WBS 2.3.3 HPC Operations 10m
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Rui Wang (Argonne National Laboratory (US))
  Perlmutter: Large fraction of jobs are resigned by Jedi stating from Sept
  Most of those jobs have input file status in PENDING
  After some investigation, it might related an issue with the storage at BNL that Doug have ask xrootd developers to help debug
  There was an ticket reporting CVMFS was missing in some of the login nodes. NERSC experts have checked all the nodes and made sure all of them have the CVMFS mounted now.
  Cleaned up the queues (VP and ES)
  (Doug&Lincoln) following up with the xrootd service and RSE related setups
  TACC: No successful job starting from Sept.
  Globus key file was missing on CERN side. Contacted Tadashi to help on restore it.
  Issues with running CVMFSexec. Investigating using standalone test
  (Rui) install harvester instance from the git repo Lincoln made
- 14:05 → 14:30
  WBS 2.3.4 Analysis Facilities
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:05
    
    Analysis Facilities - BNL 5m
    
    Speaker: Ofer Rind (Brookhaven National Laboratory)
    
    NTR
  - 14:10
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:15
    Analysis Facilities - Chicago 5m
    
    Speakers: Fengping Hu (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
    
    New gpu monitoring dashboard is up at UC AF -- gpu monitoring
    GPU HTCondor worker autoscaling in the works
    allow gpus to be shared among batch system and jupyter notebooks
    updated metrics manager to provide occupancy metric to HTCondor partitions
    working on understanding HPA behavior
    Monitoring and alerting change
    we now route all default prometheus alerts to null receiver and only let alerts we care about send to slack channel.
    "/data" monitoring should be improved now
    Working with Matthew F on new images for AF. These will be based on AnalysisBase, add dask, awkwardarray etc. We will create new ones on request, name them according to AnalysisBase version, and place in both DockerHub and Harbor. This kind of image will be needed for Z->ee demo on Physlite data.
- 14:30 → 14:45
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  Job draining at BNL, similar to what was seen prior to the dCache upgrade. Thought related to SCORE_HIMEM jobs limit then, but removing this limit clearly did not fix the problem. Lengthy investigation traced it to potential HTCondor negotiator problem related to the job quotas. Job quotas removed and site refilled. In contact with HTCondor developers to follow up.
  DC24 workshop at CERN this week
  ANALY_BNL_VP down due to HC TEST setting - trying to understand this status with Ilija
  Squid problem at SLAC (GGUS)
  - 14:30
    
    ADC Operations, US Cloud Operations: Site Issues, Tickets & ADC Ops News 5m
  - 14:35
    Service Development & Deployment 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    ServiceX
    AF instance still on 1.2.2 - still waiting on tests for 2.0.0
    Installed ServiceX-Lite on River cluster that is slaving to AF ServiceX instance.
    FAB instance just now being modified in order to help NDN people do their tests.
    XCache
    All upgraded.
    Instabilities at MWT2 and AGLT2. Some of it were issues with nodes but some are server simply puting clients in a waiting loop. We are not sure if issue is a bug or consequence of low performance.
    NRP
    An MSU node has been added to NRP.
    It now runs varnish for CVMFS. Once I stress test it, it can be used for MSU nodes requests caching.
    varnish for CVMFS running on NRP starlight node is used by UC and working fine.
  - 14:40
    Facility R&D 5m
    
    Speakers: Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    Worked with Horst to setup Kubernetes v1.28 at OU
    Investigating Keycloak integration with Kubernetes via kubelogin
    Would provide the authentication via CERN / ATLAS IAM and this part seems to essentially work
    Authorization TBD
    Seems straight-forward to put users into predefined groups
    Not clear how to automatically create namespaces 'on the fly' when a user signs in for the first time
    May need some small script to implement an enrollment workflow that creates namespaces, rolebindings etc.
- 14:50 → 15:00
  
  AOB 10m

US ATLAS Computing Facility

Facilities Team Google Drive Folder