US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
-
-
13:00
→
13:10
WBS 2.3 Facility Management News 10mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
-
13:10
→
13:20
OSG-LHC 10mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
Release
- blahp (3.5 upcoming, 3.6) fixes for LSF, PBS, and Slurm.
- osg-token-renewer: package for GlideinWMS frontends + Harvester to manage token renewal from OIDC providers like IAM
Other
Register and attend the token transition workshop! https://indico.fnal.gov/event/50597/overview
- Hear details about the OSG plans and timeline
- Learn where other VOs are in their transition
- Start receiving token-based pilots at your CE during the Technical Working Sessions (if you've updated to HTCondor-CE 5 + HTCondor 9)
- Learn about bearer token technology and how to request your own bearer tokens
- Participate in the Open Discussion and Policy Working Session to discuss timelines and concerns
-
13:20
→
13:50
Topical ReportsConvener: Robert William Gardner Jr (University of Chicago (US))
-
13:20
Lancium Computing 30m
Andrew will tell us about Lancium computing and their new, inexpensive way to deliver computing in a "green" way.
We plan to allocate half the time to presentation and half to discussion.
Speaker: Andrew Grimshaw (Lancium)
-
13:20
-
13:50
→
13:55
WBS 2.3.1 Tier1 Center 5mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
-
13:55
→
14:15
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))- Good Running
- MWT2 downtime with IU servers still down.
- XRootD still has issues - could the affected sites report on the status
- Still having significant trouble getting a clear story on delivery from Dell.
- IPV6 could sites not supporting IPV6 report on their progress.
- SRR using dCache now working at BNL and need to get it going at AGLT2 & NET2.
-
13:55
AGLT2 5mSpeakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
Data Challenge:
No problem
Ongoing IPV6 issues with new UM networking:
17 of the previous 31 worker nodes now recovered from ipv6 issues and back in condor
(after migrating gateway from shinano to new cisco pair).
The other 14 nodes still fail to reach some internal IPv6 addresses (i.e. still only BOINC).
UM network team has been helping and now started support ticket with Cisco.MSU migration to campus data center:
Signing a formal agreement, including we will not be charged for space/power.
Purchase Status:
R740xd2 Storage nodes arriving this week at UM and next week at MSU.
Two month earlier than early-Dec estimate at PO time
and even one month earlier than sales rep mid-Nov estimate at pre-order time.
The rest still shown as mid-January on Dell order support website. -
14:00
MWT2 5mSpeakers: David Jordan (University of Chicago (US)), Jess Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))
-
14:05
NET2 5mSpeaker: Prof. Saul Youssef (Boston University (US))
-
14:10
SWT2 5mSpeakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
UTA_SWT2:
Waiting on DDM-ops to remove the last remaining data from DATADISK area (~60TB).
SWT2_CPB:
- Setting up meeting with OIT this week to go over latest IPV6 updates. We expect to at least setup the PerfSonar machines with IPV6 next week.
- WebDAV door has had problems, working with Wei and Andy to help debug issues
- I/O loads are an issue since gridftp is not used much, was being used along with webdav during last week's network data challenge.
OU:
- xrootd (5.3.1) on proxy gateway (se1.oscer.ou.edu) kept hanging up. Increased stack (xrd.sched stksz 4m) and limits in
/etc/security/limits.d/20-nfile.conf
/etc/security/limits.d/20-nproc.confand switched to jemalloc. That seems to have stabilized it for the moment. Still waiting for 5.3.2.
- Kept seeing ES job stage-in failures, but don't see anything wrong here. If there were problems locally, non-ES stage-ins would presumably fail, too.
Issue seems to have gone away again now, so was probably on the other end.
- write_lan still not working in rucio copy tool, causing failover to write_wan, causing unnecessary load on proxy. Waiting for rucio fix.
- Good Running
-
14:15
→
14:20
WBS 2.3.3 HPC Operations 5mSpeakers: Lincoln Bryant (University of Chicago (US)), Rui Wang (Northern Illinois University (US))
- TACC -
- Spoke to TACC consultants about the ongoing queue time problems. 2 factors:
- Special "Gordon Bell" tasks prioritized in the queue for the last few weeks. These jobs are nearing completion.
- We are projected to use up 123% of our allocation, so the scheduler has correspondingly down-prioritized us.
- Spoke to TACC consultants about the ongoing queue time problems. 2 factors:
- NERSC -
- Power outage over the weekend brought down the workers + Harvester. Restarted now.
- Initial Perlmutter setup work ongoing
- TACC -
-
14:20
→
14:35
WBS 2.3.4 Analysis FacilitiesConvener: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:20
Analysis Facilities - BNL 5mSpeaker: Ofer Rind (Brookhaven National Laboratory)
- Smooth operations
- Working with ServiceX developers (Ben Galewsky) to test at SDCC
- Status of RBT?
- Interesting Accelerator Forum today
-
14:25
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:30
Analysis Facilities - Chicago 5mSpeakers: David Jordan (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
-
14:20
-
14:35
→
14:55
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
- Quarterly reports!
- Network data challenge successful - latest summary of results from today's GDB
- Tape challenge is ongoing, in phase 2 A-DT mode
- Xin has deployed and tested HTCondor-CE with token auth at BNL
- Some important testing notes re:fallback to X509
- OSG has draft of documentation for XRootD HTTP-TPC endpoint deployment (thanks Brian!)
- Working on BNL dCache SRR reporting issues and reconfiguration of LAKE endpoints in CRIC
- Tape downtime for next week will test new Topology/CRIC configuration
- F/S DevOps: GitOps alerts now going to FederatedOps email list - working on improving the content
-
14:35
US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5mSpeakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
14:40
Service Development & Deployment 5mSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
- 14:45
-
14:55
→
15:05
AOB 10m
-
13:00
→
13:10