US ATLAS Computing Facility
Facilities Team Google Drive Folder
Zoom information
Meeting ID: 996 1094 4232
Meeting password: 125
Invite link: https://uchicago.zoom.us/j/99610944232?pwd=ZG1BMG1FcUtvR2c2UnRRU3l3bkRhQT09
-
-
13:00
→
13:10
WBS 2.3 Facility Management News 10mSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn Mc Kee (University of Michigan (US))
Data challenge is next week. One issue is our perfSONAR deployment. It is in need of some attention!! Please review your toolkit deployment. Make sure you have your new hardware operational ASAP
New monitoring dashboard in beta from ESnet. Not yet "announced" but have a look at: https://public.stardust.es.net/d/IkFCB5Hnk/lhc-data-challenge-overview?orgId=1 Verify your site is visible. Send feedback (for now) to Shawn. Announcement later this week?
Have a look at the DOMA projects list https://docs.google.com/document/d/1i5YLxgDaVFt_-0R4DHABCyAeCvzaZdynTaVLNiM9anA/edit#heading=h.qucfjz4ani2c This is supposed to briefly document all the ongoing DOMA related activities.
Packet marking / flow labeling RPM is available to install on storage nodes. Currently deployed at AGLT2 and BNL. Service watches netstat and sends firefly packets to ESnet collector. Let Shawn know if your site is interested in participating.
LHCOPN/LHCONE meeting in two weeks: https://indico.cern.ch/event/1022426/
Other relevant topics covered in WBS sub-areas.
-
13:10
→
13:20
OSG-LHC 10mSpeakers: Brian Hua Lin (University of Wisconsin), Matyas Selmeci
Release (expected this week)
- OSG 3.5 + 3.6
- CMVFS 2.8.2
- osg-wn-client
- OSG 3.5 upcoming
- xrootd-multiuser 2.0.2
- OSG 3.5 upcoming and OSG 3.6
- HTCondor 9.0.6
- blahp 2.1.2
- OSG 3.6 upcoming
- HTCondor 9.2.0 (see updated versionioning scheme slides 8+9 indico.cern.ch/event/1059494/contributions/4532565/attachments/2312014/3934741/WhatsNew_European_Workshop_Sept_2021.pdf)
Token Transition
- HTCondor-CE 5.1.2 + HTCondor 9.0.6 available in OSG 3.6 and OSG 3.5-upcoming
- Oct 12 Pre-GDB Token/WebDAV transition https://indico.cern.ch/event/876809/
- Oct 14-15 Token transition workshop https://indico.fnal.gov/event/50597/overview
- OSG 3.5 + 3.6
-
13:20
→
13:35
Topical ReportsConvener: Robert William Gardner Jr (University of Chicago (US))
-
13:20
TBD 10m
-
13:20
-
13:35
→
13:40
WBS 2.3.1 Tier1 Center 5mSpeakers: Doug Benjamin (Brookhaven National Laboratory (US)), Eric Christian Lancon (Brookhaven National Laboratory (US))
-
13:40
→
14:00
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))- Smooth running over the last two weeks. The main problem was transfer backlog at most sites. Hiro solved it by increasing the number of concurrent FTS transfers allowed which cleared the backlogs at all sites.

- Investigating actual delivery date for Dell equipment.
- dCache version that supports SRR released.
- HTTP-TPC at the sites using XRootD doors is still not quite there for next week's data challenge.
- Will come back to IPV6 after the data challenge.
- Will try to update to OSG 3.5 upcoming soon.
-
13:40
AGLT2 5mSpeakers: Philippe Laurens (Michigan State University (US)), Dr Shawn Mc Kee (University of Michigan (US)), Prof. Wenjing Wu (University of Michigan)
09/16/2021
MSU Dell PO issued.
Missing info to find it on Dell website.
Asked Dell reps, but no success yet.
Only found the WN part with estimate 18-Jan-202209/17/2021
Allowed all smaller worker nodes to run BOINC (using larger swap file)
after long test period and measurement and verification of low impact
on Atlas jobs and node stability.09/20/2021
One of the dCache pools (umfs11_6) went disabled again (twice in 2 weeks).
we repaired the file system first, then started the pool.
The disabled pool caused 110 failed jobs for staging-out files.Finally we decided to retire this pool and another pool on the same host
because they each had unresponsive and pending failure disks
which we are not planning to replace anymore.
(This whole storage node was already targeted for retirement
as soon we get our new storage nodes, now estimated Jan 2022 ).
With some struggle(the pool would disable itself during draining),
we finally drained and retired the pool umfs11_6.Eventually we found that over 11K files were lost during the xfs_repair,
we declared the lost files in JIRA ticket ATLDDMOPS-5575 on 09/29/2021.09/22/2021
Updated dcache from 6.2.25 to 6.2.29 (for new SRR support).
We also did system firmware and software (including kernel) update and rebooted all dCache servers.
Two dCache storage nodes (umfs11 and umfs19) had corrupted grub configuration files
we had to mount an ISO file to recover the grub file.09/22/2021
Also applied new firmware and kernel updates on worker nodes,
and drained and reboot the nodes in batches. -
13:45
MWT2 5mSpeakers: David Jordan (University of Chicago (US)), Jess Haney (Univ. Illinois at Urbana Champaign (US)), Judith Lorraine Stephen (University of Chicago (US))
-
13:50
NET2 5mSpeaker: Prof. Saul Youssef (Boston University (US))
1. Did live transition from Gridftp to WEBDAV with Xrootd 5.3 with private containers.
2. GPFS disk failures => migrations needed, lost about 70TB in BU GPFS
3. Two 100G links from BU to NESE Ceph failed, causing many PanDA jobs to fail with stagein/out timeouts, etc.
4. Top of rack switch failing, causing about 50% job failures. Drained those workers and investigating.
5. Lots of progress in NESE tape.
Priority: Getting NESE storage under WEBDAV for Data Challenge. -
13:55
SWT2 5mSpeakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
OU:
- All running well
- Had two xrootd related issues in the last week:
- xrootd was in a funny state on cstore13, causing that node to fail transfers; not sure why or how;
xrootd restart on cstore13 fixed that. That happened once before on that node about a year ago ..
- xrootd hung up on se1, our proxy gateway; that completely halted all WAN xrootd transfers;
again, xrootd restart on se1 fixed that
- It's possible that the first issue caused a few corrupted log files (adler32 mis-match).
They were declared lost in rucio.
- Spent remaining hardware funds on more compute nodes and slate node; ETA a few months, of course ...
UTA:
1) UTA_SWT2_UCORE panda queue is now retired in favor of UTA_SWT2_UCORE_RET whose I/O operations use SWT2_CPB_DATADISK
2) We are in the process of retiring UTA_SWT2_DATADISK, additional space was made at SWT2_CPB_DATADISK to accept migrated data, although half of UTA_SWT2's datasets are already in place there.
3) Our Webdav door is in production at SWT2_CPB and ATLAS prefers to use it for I/O despite ATLAS CRIC settings setting gridftp as the higher priority. This caused somewhat of a problem at startup. We are noticing a possible problem with the long term stability of xrootd. Investigating the use xrd.report to generate statistics about metrics within the service.
4) Still working on getting the initial setup of the Kubernetes cluster.
5) Electrical work at SWT2_CPB *SHOULD* not affect operations overnight, but...
- Smooth running over the last two weeks. The main problem was transfer backlog at most sites. Hiro solved it by increasing the number of concurrent FTS transfers allowed which cleared the backlogs at all sites.
-
14:00
→
14:05
WBS 2.3.3 HPC Operations 5mSpeaker: Lincoln Bryant (University of Chicago (US))
Jobs are sitting in queue at TACC for quite a while, getting Cancelled. e.g. from PanDA:
pilot, 1236: Killed by Harvester due to worker queuing too long. 3504589 myjob normal phy20021 100 CANCELLED+ 0:0
NERSC queue revived after 20M addtl hours added. Globus transfers are going. All proxies renewed on Tues.
-
14:05
→
14:20
WBS 2.3.4 Analysis FacilitiesConvener: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:05
Analysis Facilities - BNL 5mSpeaker: William Strecker-Kellogg (Brookhaven National Lab)
-
14:10
Analysis Facilities - SLAC 5mSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
14:15
Analysis Facilities - Chicago 5mSpeakers: David Jordan (University of Chicago (US)), Ilija Vukotic (University of Chicago (US))
David - Lincoln is troubleshooting the Broadcom NICs we are using.
BCM57414 works ok with the stock kernel, but high amounts of retransmits are introduced with newer mainline kernels.
Ilija
- ML works fine
- AF now has two instances of ServiceX deployed (xAOD and UpROOT)
- ServiceX dedicated xCache deployed.
-
14:05
-
14:20
→
14:40
WBS 2.3.5 Continuous OperationsConvener: Ofer Rind (Brookhaven National Laboratory)
- dCache SRR REST-API released yesterday, integrated into frontend service and fixing the storageshares reporting
- Deployed at BNL
- Ready to test HTCondor-CE 5.1.2 deployment at BNL
- Production XRootd endpoint deployed at BU
- Data Challenge next week
-
14:20
US Cloud Operations Summary: Site Issues, Tickets & ADC Ops News 5mSpeakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
14:25
Service Development & Deployment 5mSpeakers: Ilija Vukotic (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US))
- Oracle upgrade went smooth!
- ES
- David added new nodes in the new computing center. All working fine. Data now getting redistributed.
- Next step is draining half of nodes in the old cluster and physically moving them to the new center.
- XCaches
- All working fine
- VP
- All working fine
- ServiceX
- Now using Flux deployed instances (6 on SSL cluster, 2 on AF)
- New, faster DID Finder in production. Twice faster, much less load on Rucio
- A lot of developments and testing.
- 14:30
- dCache SRR REST-API released yesterday, integrated into frontend service and fixing the storageshares reporting
-
14:40
→
14:45
AOB 5m
-
13:00
→
13:10