US ATLAS Computing Facility
→
US/Eastern
-
-
1
WBS 2.3 Facility Management NewsSpeakers: Robert William Gardner Jr (University of Chicago (US)), Dr Shawn McKee (University of Michigan ATLAS Group)
-
2
OSG-LHCSpeakers: Brian Lin (University of Wisconsin), Matyas Selmeci
OSG All Hands 2020 postponed!
3.5.10-2 and 3.4.44-2
VO client update for FNAL VOs, GLOW, and OSG due to InCommon subject DN format changes
3.5.11 and 3.4.45
OSG 3.4 has entered critical bug/security fix only support; EOL scheduled for November 2020. Last release series that supports EL6! https://opensciencegrid.org/technology/policy/release-series/
Most package updates from here on out will only be available in OSG 3.5!
- XRootD 4.11.3
- XCache 1.3.0 with data integrity tool
- Singularity 3.5.3 (OSG 3.4 only, otherwise available in EPEL)
- CVMFS 2.7.1
-
Topical ReportsConvener: Robert William Gardner Jr (University of Chicago (US))
-
3
Xrootd vs Http protocols in TPCSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
3
-
4
WBS 2.3.1 Tier1 CenterSpeakers: Eric Christian Lancon (CEA/IRFU,Centre d'etude de Saclay Gif-sur-Yvette (FR)), Xin Zhao (Brookhaven National Laboratory (US))
- dCache downtime scheduled for 03/24~03/26 (48h), for upgrade to version 5.2, for support of SRR and TPC, plus many other bug fixes and improvements
- CPU utilization fluctuated recently, stable now though
- not enough pilots
- job router changes on CEs
- draining and rebooting partial WNs in the farm, in a rolling fashion
- upgrade cvmfs to 2.7.0
- add cvmfs-x509-helper package for LIGO jobs
- R&D
- Data Carousel exercise/RPVLL reprocessing
- going well. BNL staging throughput 3GB/s+, best among T1s
- need more requests to stress the system
- MAS
- "moving" instead of "copying" unused datasets from DATADISK to BNL_LAKE
- running jobs on the BNL_LAKE_UCORE PQ
- Data Carousel exercise/RPVLL reprocessing
-
WBS 2.3.2 Tier2 Centers
Updates on US Tier-2 centers
Convener: Fred Luehring (Indiana University (US))- Will need input from the Tier 2 sites if we do hold an ATLAS meeting
- Please close tickets and respond to items sent to the US cloud mailing list.
- The source of the low cpu efficiency at SWT2 is believed to be understood.
- The issue was at SWT2_CPB and involved RUCIO Mover vs LSM
- I leave SWT2 to explain the details in their report
- Two issues in the past week appear to have been caused by settings in AGIS
- We have to damn careful with AGIS.
- ===>> CHECK THAT AGIS IS SETUP FOR YOUR SITE AS YOU EXPECT!!!
- Please make sure that requested upgrades like dCache and ipv6 are getting attention.
-
5
AGLT2Speakers: Philippe Laurens (Michigan State University (US)), Dr Shawn McKee (University of Michigan ATLAS Group), Prof. Wenjing Wu (Computer Center, IHEP, CAS)
- had 2 intermittent downtime, to fix the hardware of the storage enclosure of a dCache pool node
- 2 ggus ticket: 145774 for rucio dataset replication being stuck between AGLT2 and DESY_HH, after some investigation, we found out the problem is not at the AGLT2 site, the ticket was being reassigned to a few other sites; ticket 145772 to upgrade dcache to latest release and enable SRR at AGLT2. We enabled SRR, and will update dCache to the latest release of 5 soon.
- On 25th Feb, because of the CERN production issue, we noticed our site did not get any jobs for 8 hours, because there was any notice from ADC, we thought there was a problem with our gatekeeper, ended up spending a lot of time debugging, restarting services/ nodes.. Could we request notice/update on such incidents in the future?
- BOINC accounting to OSG: I got the Gratia API document from Derek, and still in the process of reading through the condor accounting example..
-
6
MWT2Speakers: David Jordan (University of Chicago (US)), Judith Lorraine Stephen (University of Chicago (US))
UC
- dCache upgraded to 5.2
- Everything 5.2.15 except for the xrootd door, which is 5.2.7 due to Protocol Xrootd-4.0:72.36.96.247:60394 is not supported errors in 5.2.8 and above
- Network interface errors after rebooting our SL6 nodes during the downtime, fixed by cable reseating
- In the process of upgrading our remaining SL6 storage nodes and doors to SL7
- Added SRR, updated CRIC
- Stuck replication rule from MWT2_DATADISK to DESY-HH_LOCALGROUPDISK
- It looks like the FTS transfers stalled while we were debugging network issues post-upgrade
- Is there a regular procedure or contact to fix this? DDM?
UIUC
- 24 new workers online (1960 cores)
- PDU issues after onlining the new workers, fixed for now
- dCache upgraded to 5.2
-
7
NET2Speaker: Prof. Saul Youssef (Boston University (US))
-
8
SWT2Speakers: Dr Horst Severini (University of Oklahoma (US)), Mark Sosebee (University of Texas at Arlington (US)), Patrick Mcguigan (University of Texas at Arlington (US))
UTA:
- Preparing to deploy new storage (2PB Raw), around 1PB will be used to cover retirements
- Working on condor-ce/NFS issue that is preventing pilots from being accepted. Looks like NFS server issue. Temporary workaround now in place
- We believe we have identified the low efficiency problems at SWT2_CPB
- Rucio mover was placed as primary mover by ADC although it would not work
- LSM would be used after rucio mover failed
- Rucio mover took significant time to fail, lowering CPU efficiency
- Now mostly solved with adoption of rucio mover on reads, some work still needed for writes.
OU:
- Nothing to report, running fine.
- Had temporary OSCER authentication (LDAP/IPA) hickup Monday night which caused some stage-out failures.
-
9
WBS 2.3.3 HPC OperationsSpeaker: Doug Benjamin (Duke University (US))
-
WBS 2.3.4 Analysis FacilitiesConvener: Wei Yang (SLAC National Accelerator Laboratory (US))
- 10
-
11
Analysis Facilities - SLACSpeaker: Wei Yang (SLAC National Accelerator Laboratory (US))
-
12
ATLAS ML Platform & User SupportSpeaker: Ilija Vukotic (University of Chicago (US))
-
WBS 2.3.5 Continuous OperationsConveners: Ofer Rind, Robert William Gardner Jr (University of Chicago (US))
-
13
US Cloud Operations Summary: Site Issues, Tickets & ADC Ops NewsSpeakers: Mark Sosebee (University of Texas at Arlington (US)), Xin Zhao (Brookhaven National Laboratory (US))
-
14
Analytics Infrastructure & User SupportSpeaker: Ilija Vukotic (University of Chicago (US))
-
15
Intelligent Data Delivery R&D (co-w/ WBS 2.4.x)Speakers: Andrew Hanushevsky (STANFORD LINEAR ACCELERATOR CENTER), Andrew Hanushevsky (Unknown), Ilija Vukotic (University of Chicago (US)), Wei Yang (SLAC National Accelerator Laboratory (US))
-
13
-
16
AOB
-
1