US ATLAS Computing Facility

Name: US ATLAS Computing Facility
Start: 2022-04-27T13:00:00-04:00
End: 2022-04-27T15:10:00-04:00
Location: No location set

Wednesday 27 Apr 2022, 13:00 → 15:10 US/Eastern

Description

Facilities Team Google Drive Folder

Zoom information

Meeting ID: 996 1094 4232

Meeting password: 125

Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09

- 13:00 → 13:10
  
  WBS 2.3 Facility Management News 10m
  
  Speakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
  
  HEPiX is this week https://indico.cern.ch/event/1123214/timetable/#20220427.detailed
- 13:10 → 13:20
  OSG-LHC 10m
  
  Speakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
  Release yesterday: https://opensciencegrid.org/docs/release/osg-36/#april-26-2022-cvmfs-292-upcoming-htcondor-981
  
  CVMFS bugfix release
  
  VOMS clients now generate 2048 bit proxies by default
  
  osg-ce minor update that will help us track OSG 3.6 updates
  
  OSG 3.5 EOL on May 1!
  
  3.5 containers will stop being updated; move to 3.6-release tags https://opensciencegrid.org/technology/policy/container-release/
  
  We will start removing 3.5 documentation
  
  HTCondor Week registration is closing soon! See invitation:
  
  Greetings CHTC Users!
  
  We want to invite you to HTCondor Week 2022, our annual HTCondor user conference, May 23-26, 2022. This year, HTCondor Week will be a hybrid event: we are hosting an in-person meeting at the Fluno Center on the University of Wisconsin-Madison campus. This provides HTCondor Week attendees with a compelling environment in which to attend tutorials and talks from HTCondor developers, meet other users like you and attend social events. For those who cannot attend in person, we'll also be broadcasting the event online via a Zoom meeting.
  
  Registration for HTCondor Week 2022 is open now. The registration deadline for in-person attendee is May 2, 2022, and the cost is $90 per day to partake in conference food. For virtual-only attendance, registration is a flat $25 fee for the whole week.
  
  UW-Madison affiliates who attend conference talks in-person only need to register for in-person participation (and pay) if they plan to partake in conference food. We otherwise/also recommend the virtual registration (still with a fee) for UW-Madison affiliates who plan to participate virtually.
  You can register at http://htcondor.org/HTCondorWeek2022.
  
  There will be specific programming highlighting the UW-Madison campus community on Thursday, May 26, where you can meet other campus users of CHTC and HTCondor, as well as CHTC staff. We will separately contact some CHTC users to present their work that day!!
  
  On other days, we will have a variety of in-depth tutorials and talks where you can learn more about HTCondor and how other people are using and deploying HTCondor. Best of all, you can establish contacts and learn best practices from people in industry, government, and academia who are using HTCondor to solve hard problems, many of which may be similar to those you are facing.
  
  Hotel details and agenda overview are on the HTCondor Week 2022 site:
  
  http://htcondor.org/HTCondorWeek2022
  
  We hope to see you there,
  
  The Center for High Throughput Computing
- 13:20 → 13:50
  Topical Reports
  
  Convener: Robert William Gardner Jr (University of Chicago (US))
  - 13:20
    
    Consistency checks on storage at the T1 30m
    
    Speaker: Shigeki Misawa (Brookhaven National Laboratory (US))
    
    ATLAS-T1-Consistency-Check-v3.pdf
- 13:50 → 13:55
  
  WBS 2.3.1 Tier1 Center 5m
  
  Speakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
  
  - Migrated all gatekeepers to OSG 3.6
  
  - HC turned us off 5 times in past week, investigation on going, as it looks like issues are non-site related errors
- 13:55 → 14:15
  WBS 2.3.2 Tier2 Centers
  
  Updates on US Tier-2 centers
  
  Convener: Fred Luehring (Indiana University (US))
  
  N_Jobs-20220427.png
  
  Success-20220427.png
  
  Transfers-20220427.png
  Pretty good running over the past two weeks.
  
  Lots compute slots coming online.
  
  The main issue (the EPoll:) issue is a pilot issue where the string "EPoll:" gets prepended to variable values causing the pilot to kill the job after it has been running without problems. This hurt job efficiency a lot and caused Hammer Cloud to kick sites offline when there was not a site problem. It seems to affect sites that are running HTCondor and dCache even though the working hypothesis is an XRootD issue.
  
  We are down to the wire on getting OSG 3.6 into use.
  
  How are NET2 and SWT2 doing on enabling IPV6.
  
  Please keep these sheets up to date:
  
  Service versions: https://docs.google.com/spreadsheets/d/1_fKB6GckfODTzEvOgRJu9sazxICM_RN95y039DZHF7U
  
  Run 3 readiness: https://docs.google.com/spreadsheets/d/1KniOlqb4dbJ6dKUHBYYt9OfriKjhVpUqXguPvryIMY8
  
  Site capacity: https://docs.google.com/spreadsheets/d/1nZnL1kE_XCzQ2-PFpVk_8DheUqX2ZjETaUD9ynqlKs4
  
  NB: I have not yet added the tabs for the current (April-June) quarter until I am sure that the data for the previous quarter actually reflects the situation on March 31.
  - 13:55
    
    AGLT2 5m
    
    Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
    
    On OSG 3.6, for gatekeepers and worker nodes.
    
    We broke frontier squids while trying to fix gratia probe problems.
    Our first fix attempt inadvertently re-enabled a local setup script overriding squid location variables.
    Gratia issues solved: directory ownership was root instead of condor.
    
    2 tickets:
    
    156868 15-Apr-2022 AGLT2: Failing jobs in panda with "Unable to identify specific exception"
    156873 17-Apr-2022 US AGLT2: High Transfer failures as source
    
    The job problems was traced to time outs during stage-out.
    There was no clear problem but the likely suspect was dcache and java running out of memory.
    We increased the memory for webdav on the doors and dCacheDomain on the headnodes.
    Also added CPUs and memory to the VM doors. That all helped.
    We also upgraded dcache from 6.2.35 to 7.2.15 (since we had to restart to load new CA certs anyway)
    The issues from both tickets disappeared after that.
    
    Maintenance:
    
    mostly through updating all worker nodes for new kernel, Dell FW updates, OSG updates (cvmfs)
    
    Network upgrades completed * and tested * :
    
    All new multi-path and multi-100G connections to ESnet and between MSU and UM are now fully deployed
    and were tested for proper failover in case of backhoe vs fiber incident.
  - 14:00
    
    MWT2 5m
    
    Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
    
    All new compute and storage online at the three sites.
    
    Both gatekeepers upgraded to OSG 3.6.
    
    Restarted dCache after upgrading osg-ca-certs. Planning upgrade to 7.2 for the week of 9 May.
  - 14:05
    
    NET2 5m
    
    Speaker: Prof. Saul Youssef
    
    Smooth operations. New workers are in production.
    
    NESE Team preparing for ~5 rack expansion of NESE Ceph including NET2 storage. Slowed down by Cisco switch delivery. This will allow retirement of NET2 GPFS and make more space for workers.
    
    Working on ipv6; then OSG 3.6; upgrading TOR networking and NET2-NESE networking.
  - 14:10
    SWT2 5m
    
    Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
    
    UTA:
    
    Compute nodes from UTA_SWT2 have been integrated into SWT2_CPB
    
    UTA_SWT2 is now disabled in CRIC
    
    Working on updating some old compute nodes with additional memory
    
    Still need to update Capacity Spreadsheet/OIM/CRIC to reflect changes
    
    Received partial shipment of R6525 nodes (8 nodes of 48)
    
    The machines are racked
    
    Need to update Rocks install kernel to support RAID card before installation
    
    Work is progressing on configuring the new OSG 3.6 CE.
    
    OU:
    
    - Drained some HEP nodes to move them, should be back up later today.
    
    - Should get the rest of the newly arrived HEP nodes up and running soon as well.
- 14:15 → 14:20
  WBS 2.3.3 HPC Operations 5m
  
  Speakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Argonne National Laboratory (US))
  TACC
  
  Running as normal. a bit over 10K SUs left.
  
  NERSC
  
  Old task was taking 10+hrs to run. asked John Anders to send a new task. New task seems to be running ok.
  
  Working with Doug and Wei to get XRootD going at NERSC for DATADISK.
- 14:20 → 14:35
  WBS 2.3.4 Analysis Facilities
  
  Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:20
    Analysis Facilities - BNL 5m
    
    Speaker: Ofer Rind (Brookhaven National Laboratory)
    
    First of new AF Forum series held last week with an update on k8s batch and upcoming kubecon
    
    Met on Friday to work out details of BNL/NERSC XRootd SE setup
  - 14:25
    
    Analysis Facilities - SLAC 5m
    
    Speaker: Wei Yang (SLAC National Accelerator Laboratory (US))
  - 14:30
    
    Analysis Facilities - Chicago 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    AGC was a complete success. Everything worked, had some new users, very interesting discussions, especially interesting was discussion at ATLAS parallel session at the end.
- 14:35 → 14:55
  WBS 2.3.5 Continuous Operations
  
  Convener: Ofer Rind (Brookhaven National Laboratory)
  All BNL gatekeepers now updated and running OSG 3.6 HTCondor-CEs
  
  On Monday, ATLAS jobs began triggering a machine check exception (MCE) on some older servers at BNL (Dell R640 Skylake). These hosts are currently closed to jobs to apply a firmware update.
  - 14:35
    
    US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5m
    
    Speakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
    
    US-cloud-summary-4_20_22.pdf
    
    US-cloud-summary-4_27_22.pdf
  - 14:40
    
    Service Development & Deployment 5m
    
    Speakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
    
    * XCache - working fine
    
    * VP - working fine - will summarize performance and BHAM experience with switching to VP at next DDM meeting.
    
    * ServiceX - works fine at 1.0.30. Next week will be dedicated to performance improvements developments.
    
    * Analytics - adding new functionality to ATLAS Alarm & Alert Frontend.
  - 14:45
    
    Kubernetes R&D at UTA 5m
    
    Speaker: Armen Vartapetian (University of Texas at Arlington (US))
    
    All the existing Kubernetes worker nodes were updated with additional memory. Also part of hardware from the retired UTA_SWT2 cluster was racked in CPB, and added to the cluster. Kubernetes was installed on those nodes, and those workers were joined to the existing cluster. The cluster is showing healthy.
    Now trying to find out why the grid jobs are reaching the workers, but are stuck there in a waiting state.
- 14:55 → 15:05
  
  AOB 10m

Choose timezone

US ATLAS Computing Facility

Facilities Team Google Drive Folder