US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2022-04-13T13:00:00-04:00
End: 2022-04-13T15:10:00-04:00
Location: No location set

Wednesday 13 Apr 2022, 13:00 → 15:10 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
  
  there is a wlcg k8s meeting being organized for June 7, https://indico.cern.ch/event/1096043/
- 13:10 → 13:20
  OSG-LHC 10m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  3.6 release planned for tomorrow, contains:
  
  osg-configure 4.1.1, with a fix for a gratia probe check on sites with an HTCondor batch system (reported by Xin)
  
  Pending testing, we may also release osg-scitokens-mapfile 8 (for 3.5 and 3.6).
  
  Other software ready for testing:
  
  OSG 3.6:
  
  CVMFS 2.9.2 (bugfix release, see https://cvmfs.readthedocs.io/en/2.9/cpt-releasenotes.html)
  
  voms clients 2.0.15-1.5 (el7) and 2.1.0-0.14.rc2.5 (el8): increase default proxy size for 2048 bits (smaller proxies are rejected by el8 servers)
  
  OSG 3.5-upcoming and 3.6:
  
  xcache (including atlas-xcache) 2.2.0 (OSG 3.5-upcoming) / 3.0.1 (OSG 3.6):
  
  Fixed xcache-reporter and xcache-consistency-check library issues (causing them to fail to run)
  
  Working on improving documentation for updating to OSG 3.6, especially for xrootd / xcache services.
- 13:20 → 13:50
  Topical Reports
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
  - 13:20
    
    TBD 30m
- 13:50 → 13:55
  
  WBS 2.3.1 Tier1 Center 5m
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
  
  BNL met the 2022 pledges. Milestone needs to be marked as complete for the Tier 1
  
  testing HPSS - Endit at increased scale (200k) restores from tape.
  
  updated on gatekeeper to OSG 3.6
- 13:55 → 14:15
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  
  Slots_Running_MWT2-20220407-UTC.png
  Reasonable running over the past two weeks with some issues.
  
  AGLT2 had two incidents where the site drained
  
  MWT2 had two incidents: DATADISK full & AC problems
  
  NET2 had lots of little incidents and is nearly fully drained today
  
  SWT2 had a couple of minor incidents.
  
  Spent some time looking at how quickly sites are put online by Hammer Cloud. There was an extended discussion with Rod. In the end it was not clear why it took 3 hours to put MWT2 online after one of the draining issues but Rod is pushing the HC people to provide better displays so a site can see what the hold up is.
  
  Also spent some time looking at why a site can be delayed in refilling after being set online. Rod said that we need to contact them when it's happening to debug this. Here is a plot of the number of job slots full vs time (UTC) showing the site draining and refilling during the MWT2 AC incident (plot is for 24 hours):
  
  We have about 2 weeks to get OSG 3.6 in place. Let's do it!
  
  Let's get IPV6 going at NET2 and SWT2.
  
  UC finished the physical moving of gear to the new machine room but various mopping up continues to get things fully back online.
  
  UTA_SWT2 has finally shuffled off this mortal coil never to return.
  
  Could Mark/Patrick please get it disabled/removed from OSG/CRIC.
  
  GET YOUR QUARTERLY REPORTING IN TODAY!
  - 13:55
    
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
    
    All gatekeepers are updated to osg3.6 (HTCondorCEVersion: 5.1.3)
    working on potential issue with gratia-probe from new-style built-in htcondor-ce probe.
    
    All worker nodes are updated to condor 9.0.11
    but gate01/04/aglbatch are still on 9.0.10
    
    UM still waiting on worker nodes from Fall 2021 order (R6525 AMD Rome)
    UM and MSU waiting on January 2022 order (R6525 AMD Milan) with now some/all delayed to June 9
    (all storage has been received and instlalled)
    
    6-Apr: noticed stage-out problem ; traced to one dcache door (restarted all doors)
  - 14:00
    MWT2 5m
    
    Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    Tested OSG 3.6 a while back and currently running it on an opportunity queue. Will update the site in the coming weeks.
    
    Rolling out kernel updates to fix security vulnerabilities.
    
    UC:
    
    New Purchases
    
    New compute racked and cabled. Working on building.
    
    New storage same situation. Getting some networking situation before we do these though.
    
    Temporary networking solution in place for compute until purchased production equipment arrives.
    
    Cooling issue brought down a couple storage nodes last week. They've been brought back up and we're running fine since.
    
    Working on hardware gatekeeper as we're currently only running off the IU gatekeeper.
    
    Discussed upgrading dcache to 7.2.x. No concrete date yet.
    
    Final move phase finished. No more major moves for ATLAS equipment.
    
    IU:
    
    New compute is online.
    
    Looking to purchase a few more with remaining funds.
    
    UIUC:
    
    New compute is online.
    
    PDU issue in the data center was causing benchmarking to trip power, but has been fixed and benchmarks finish without trouble.
  - 14:05
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef
    
    Operational issues:
    
    MGHPCC site level cooling issue for ~1 hour caused NESE Ceph to be offline for about 1 hour.
    
    GPFS disk errors in the system pool needed evacuation. Reduced SGE while this was in process.
    
    88 worker nodes installed and tested, all but 1 rack in production.
    
    Rack of contributed nodes arrived.
    
    Hard to get 100Gb optics from DELL.
    
    New NESE Ceph storage equipment arrived except for Cisco switches, expected in August following long delays.
    
    NESE Tape commissioning: 50PB pool being reformatted today re: IBM firmware issue.
    
    Working with NESE team on NESE Tape expansion.
    
    Quarterly reports in. Hardware spreadsheet updated.
    
    Run 3 software prep: Working on perfsonar & ipv6. OSG 3.6 to follow.
  - 14:10
    
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    UTA:
    
    UTA_SWT2 is officially shut down and the equipment has been moved back to campus work is ongoing to move compute nodes into SWT2_CPB and K8 clusters.
    
    Existing CE (gk01.atlas-swt2.org) will move to new hardware and OSG 3.6. The last existing job should drain today.
    
    Investigating GRACC reporting issue related to two gatekeepers.
    
    OU:
    
    New GK and SLATE squid are about to be installed to be ready for testing.
    
    Found work around for xrootd5 transfer failures with new emi/rucio ALRB tests. xrootd5 client operations from compute nodes against newly upgraded xrootd5 backend storage would insist on using TLS, and since the OSG CE propagates X509_USER_CERT and X509_USER_KEY to wn-client environment, and hostcert/key files don't exist there, TLS would fail. Unsetting X509_USER_CERT and X509_USER_KEY in atlas_app/local/setup.sh.local prevents these failures.
- 14:15 → 14:20
  WBS 2.3.3 HPC Operations 5m
  
  Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))
  TACC-
  
  Got an extension to use up the remaining 8% of our allocation
  
  Will use this to experiment with a CE-based solution for sending jobs
  
  NERSC
  
  Cori running well
  
  Perlmutter Globus endpoint is now available, will test.
- 14:20 → 14:35
  WBS 2.3.4 Analysis Facilities
  
  Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:20
    Analysis Facilities - BNL 5m
    
    Speaker: Ofer Rind (Brookhaven National Laboratory)
    
    mig configuration options
    
    GPU hosts are online, will be partitioned in 2x3 (20G) mig gpu per host. Xin is working on enabling HTCondor scheduling (need to publish GPUs for use with partitionable slots)
    
    Image attached with the mig options Doug found
    
    Snowmass CompF4 presentations on AFs last Friday
    
    News about review and UC onboarding event tomorrow
  - 14:25
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:30
    
    Analysis Facilities - Chicago 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    Nothing to report. Onboarding event last week went smoothly.
- 14:35 → 14:55
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  SBU and SMU storage endpoints decommissioned by DDM
  
  Thanks to Doug's Spring Cleanup, ten Tier-3 sites deleted (or disabled) in CRIC (ANLASC, BELLARMINE, Brandeis, Hampton, OLCF, Penn, SBU, SMU_HPC, Tufts, UPitt)
  
  UTA_SWT2 deactivated in OSG Topology (squid removed from monitoring)
  
  Follow up actions needed in CRIC?
  
  OSG 3.6 HTCondor-CE deployed and being tested at BNL on gridgk05
  
  Bug found and fixed in osg-configure (see https://opensciencegrid.atlassian.net/browse/SOFTWARE-5115)
  
  Harvester jobs are running successfully; will proceed with updated on remaining gatekeepers once updated osg-configure rpm is released in production
  
  Petr noted but in XRootd 5 client with new gfal2, related to missing host cert files that are not needed; affects our XRootd/Slurm sites; there is a workaround and Horst filed ticket with OSG
  
  Massive HC exclusion yesterday related to massive spike in held jobs - HC was disabled, but back now
  
  Shifter monitor and/or alarm?
  
  Need QR info today please
  - 14:35
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-4_13_22.pdf
    
    US-cloud-summary-4_6_22.pdf
  - 14:40
    
    Service Development & Deployment 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
  - 14:45
    
    Kubernetes R&D at UTA 5m
    
    Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    
    The startup K8S cluster works fine. I created a queue for it in CRIC: SWT2_CPB_K8S, which was later used for tests.
    
    After that I created Harvester service account in the K8S cluster, and generated a kubeconfig file used on the Harvester side to communicate with the cluster. Patrick did firewall reconfiguration. But there was still a communication issue at first. This was tracked down to the initial setup of the cluster, when kubeadm init step by default picked up only the private IP address of the control plane. I regenerated the api server certificate to include also the public IP address, and did all the related reconfiguration, after which communication with the cluster was established. After that Fernando managed to submit several grid test jobs, and they reached the workers, but stuck there in a waiting state. So looking into that right now.
    
    On the hardware side our admins will start adding nodes to the K8S cluster from the UTA_SWT2 cluster which arrived last week, and probably we'll also update some of the existing worker nodes.
- 14:55 → 15:05
  
  AOB 10m

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder