US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2022-07-06T13:00:00-04:00
End: 2022-07-06T15:10:00-04:00
Location: No location set

Wednesday 6 Jul 2022, 13:00 → 15:10 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 13:00 → 13:10
  WBS 2.3 Facility Management News 10m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
  Two items to note:
  
  Prescrubbing at BNL: we now need to finalize WBS 2.3.x presentation for the scrubbing. Target drafts by next week's facility coordination?
  
  Upcoming quarterly reports: we would like to see WBS 2.3.x reports by July 20th to give us time to finish WBS 2.3
- 13:10 → 13:20
  
  OSG-LHC 10m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
- 13:20 → 13:50
  
  Topical Reports
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
- 13:50 → 13:55
  
  WBS 2.3.1 Tier1 Center 5m
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
  
  - investigating issues with xrootd access
  
  - received Tuesday evening a tiny bit of first bunch of Run3 RAW data
  
  - Petr V. requested that OSG sites check the site topology to ensure SRM references for disk only sites is removed
- 13:55 → 14:15
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  
  N_Jobs-20220706.png
  
  Success-20220706.png
  
  Transfer-20220706.png
  Reasonable running over the last two weeks
  
  Communication issue with aipanda machines caused almost all running jobs to fail on 6/22/2022
  
  MWT2 some between 6/25/2022 and 6/27/2022
  
  NET2 disruption between 6/25/2022 and 7/1/2022
  
  Please state how you are dealing with the current Linux kernel security issue.
  
  Please describe any updates you are doing to OSG, dCache, XRootD, etc.
  
  Please describe your procurement plans today.
  
  I really want to get our orders out earlier this year than the late September like last year.
  
  I will check with Dell to find out what CPUs, Server types, storage types, etc. might be actually be available to help guide what you order.
  
  We can follow up as needed at next week's Facility Management Meeting
  
  Please enter your quarterly reporting.
  - 13:55
    
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Dronen
    
    06/24/2022
    
    7 nodes became blackhole nodes because of cvmfs issue, this is later diagnosed with cause from the one of the squid servers.
    
    06/29/2022
    
    One of the slate squid servers sl-um-es5 stopped working because of both iptables issue and full var partition . It caused intermittent cvmfs issues. We got 2 ggus tickets for this.
    
    06/30
    
    From 06/28, the SAM test jobs stopped running. This started after the SAM test job team made some changes (change the leave_in_queue conditions on ETF). We could not find any obvious cause after a couple of days of debugging. Eventually we decided to restart the condor-ce services on both ATLAS gatekeepers, and that got the SAM test jobs to start to run, but it also caused all the running jobs on the gatekeepers to be removed, so about 4000 jobs got removed.
    
    07/06
    
    upgraded dCache 7.2.16 to 7.2.19 (with reboot to new kernel)
    Got all WNs updated and ready for reboot to new kernel.
    Starting rolling drain and reboot in batches
    
    All January 2022 order R6525 AMD Milan 7413 are shipped.
    A fraction already received.
  - 14:00
    
    MWT2 5m
    
    Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    Upgrading elasticsearch to 8.3. Cluster upgraded to 7.17 last week.
    
    Still waiting on UChicago IT Services to configure our new Juniper networking gear from our most recent purchase.
    
    Updating condor to 9.0.13-1.1.osg36.el7 on the workers. IU is done. UC is halfway done. UIUC still needs to be upgraded.
    
    A switch and servers rebooted at IU last weekend. Back online by Monday.
    
    Replacing the motherboard on the problematic dCache pool node appears to have fixed the lockup issues. Another dCache pool node had a bad NIC; this has also been replaced and the pool node is back online.
    
    Removed ALRB testing variables from the workers and gatekeepers.
    
    Applied user.max_net_namespaces=0 for kernel mitigation.
  - 14:05
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef (Boston University (US))
  - 14:10
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    UTA:
    
    Received all of the R6525's that were outstanding (45 machines). Starting to rack them.
    
    Fixed a configuration problem in the compute nodes of the Kubernetes cluster.
    
    Testing IPV6 and OSG 3.6 XRootD Standalone
    
    Acts as a proxy to the backend storage (replicates existing services)
    
    Drops gridFTP as available protocol
    
    Reconfigured AC unit to avoid some problems associated with additional load
- 14:15 → 14:20
  
  WBS 2.3.3 HPC Operations 5m
  
  Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))
  
  Cori running well
  
  Ongoing debugging of Perlmutter failed jobs - Pilot cannot update PanDA satus some significant % of the time and they get marked as failed despite completing successfully.
  
  Attempting to convert Perlmutter to use the latest Pilot3 code - may help
- 14:20 → 14:35
  WBS 2.3.4 Analysis Facilities
  
  Conveners: Ofer Rind (Brookhaven National Laboratory), Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:20
    Analysis Facilities - BNL 5m
    
    Speaker: Ofer Rind (Brookhaven National Laboratory)
    
    ATLAS Analysis Facility Task Force Mandate document (review and comment)
    
    Discussion of AE2 outcomes document at AF Forum last week
    
    Presentations at BNL/JLAB/HSF S&C Round Table next Tuesday
  - 14:25
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:30
    
    Analysis Facilities - Chicago 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    We are creating two additional platforms.
    
    One will serve educational purposes and not like Codas workshop, have tools that are usable to all HEP not only ATLAS (servicex for CERN open data, jupyter with Root kernel, etc.)
    
    The other one will be dedicated to ATLAS Analytics with tools that support Analytics efforts.
- 14:35 → 14:55
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  Updated facility services spreadsheet
  
  Big progress at NET2, now supporting token auth
  
  XRootd access broken at BNL (GGUS). Causing problems with VP queue (especially at BNL) and elsewhere
  
  Numerous SLATE squid issues in the past week (iptables, partition size at AGLT2, OOM at IU, ...)
  
  DOMA BDT discussion today about using X.509 and tokens for the Data Challenge
  
  2.3.5 folks please get QR in ASAP...
  - 14:35
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-6_29_22.pdf
    
    US-cloud-summary-7_6_22.pdf
  - 14:40
    Service Development & Deployment 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    ES
    
    upgraded to 7.17 last Thursday
    
    today trying to upgrade to 8.3
    
    XCaches
    
    all stable
    
    VP
    
    except BNL all running fine
    
    Testing SLATE deployed Varnish based CVMFS caching at AGLT2.
  - 14:45
    
    Kubernetes R&D at UTA 5m
    
    Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    
    Yesterday Patrick found that while he is using the same compute node setup from the tier2 cluster for K8s cluster, one of the parameters for nodes is causing the issue for K8s to run containers (jobs waiting at the ContainerCreating state). Once he rolled that setting back, jobs started to run.
    Pinged Fernando today, waiting for ATLAS test jobs.
- 14:55 → 15:05
  
  AOB 10m

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder