US ATLAS Tier 2 Technical
Meeting to discuss technical issues at the US ATLAS Tier 2 site. The primary audience is the US Tier 2 site administrators but anyone interested is welcome to attend.
-
-
10:00
→
10:10
Introduction 10mSpeakers: Fred Luehring (Indiana University (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US))
-
10:10
→
10:20
TW-FTT 10mSpeakers: Eric Yen, Felix.hung-te Lee (Academia Sinica (TW))
-
10:20
→
10:30
AGLT2 10mSpeakers: Daniel Hayden (Michigan State University (US)), Philippe Laurens (Michigan State University (US)), Shawn Mc Kee (University of Michigan (US)), Dr Wendy Wu (University of Michigan)
- 13-May updated dcache 9.2.27 -> 10.2.12
Declared at-risk (intermittent) outage
Smooth update
It seems only up to 7 jobs may have failed on direct access (pilot error 1361)
Part of motivation was to enable firefly monitoring.
Success though content not as expected
(expected latest changes didn't make it in this release?).- switching WNs to OSG24 (at UM)
noticed some were not running boinc jobs; fixed scripts.- EL9 at MSU: continuing to investigate node record corruption.
Will likely ask for satellite software update (in next couple days).- 10-May Saturday generator test at MSU data center = qualified success
smoother than expected (i.e. no temperature fluctuation)
when dropping A-side (on UPS) or B-side power feed
The control system maintained (UPS) or restored full power in the racks in fraction of a second
when dropping both, first from UPS, then 2.5 MW generator within a minute
with very minimal temperature fluctuation for all steps
–> all the normally unattended steps were a full success
last step: going off the generator was somehow difficult to engage
then all AC units failed to restart automatically
and took time and effort to coax back on
all C6420s and most R6525s shut off on temperature alarm -
10:30
→
10:40
MWT2 10mSpeakers: Aidan Rosberg (Indiana University (US)), David Jordan (University of Chicago (US)), Farnaz Golnaraghi (University of Chicago (US)), Fengping Hu (University of Chicago (US)), Fred Luehring (Indiana University (US)), Judith Lorraine Stephen (University of Chicago (US)), Lincoln Bryant (University of Chicago (US)), Robert William Gardner Jr (University of Chicago (US)), lincoln bryant
-
10:40
→
10:50
NET2 10mSpeakers: Eduardo Bach (University of Massachusetts (US)), Rafael Coelho Lopes De Sa (University of Massachusetts (US)), William Axel Leight (University of Massachusetts Amherst)
Some discussion about the current downtime
We were upgrading our routing to evolve our router infrastructure so that it can operate with SENSE. Part of this work is to segregate the several different routing services (LHCONE, non-LHCONE, FABRIC, SENSE, ...) in different virtual routers and, in the process, ensure that only LHCONE communication goes through the L3VPN. We thought these changes would be transparent but it turn out not to be the case. Non-LHCONE routing to NESE has stopped and this is preventing transfers to go through (and really begs the question why do we have to deal with requests outside of LHCONE). Jessa is working to fix it, but it will take another O(day) [estimate].
These improvements being pursued will also help us to reduce the asymmetric ASN paths that were flagged last week. On the other hand, this also means that several NET2_MCTAPE transfer requests will timeout, but that's expected.
We are working on preparing for the OKD upgrade. The first attempt was planned for this week, but we must postpone it now.
We are progressing on the understanding of the tape system from our experience with Fabio's tests. -
10:50
→
11:00
SWT2 10mSpeakers: Andrey Zarochentsev (University of Texas at Arlington (US)), Horst Severini (University of Oklahoma (US)), Kaushik De (University of Texas at Arlington (US)), Mark Sosebee (University of Texas at Arlington (US)), Zachary Thomas Booth (University of Texas at Arlington (US))
SWT2_CPB:
Operations
-
We recently migrated certificate management from the EL7 admin to the new EL9 admin for our XRootD proxy servers.
-
After the certificate management change, we ran into an issue with updating CRLs on 5/5, causing us to drain. We adjusted the cron that runs on the EL9 admin node, fixing the issue.
-
We drained on 5/11 due to many short jobs sent to sites, leading to all sites draining with not enough jobs to stay full.
-
We noticed a lot of bad jobs on our CE on 5/13 that eventually cleared up. We noticed we completed over 25K jobs within the last twelve hours during this, despite only being half full. We may have been receiving a lot of short jobs, leading us to not stay full. We are still investigating this. There are a high number of exiting pilots and the issue may be with our CE. The status change seems to be taking some time.
-
We replaced the expired certificate on an XRootD proxy that expired at the beginning of this month. Monitored and had no issues.
Slurm and CE Monitoring
-
We provisioned a new EL9 monitoring server, set up Grafana, and set up a monitoring dashboard of our Slurm server. We received help from MWT2 concerning monitoring tools for our CE. We will be working on this very soon.
-
We are researching and planning on implementing Zabbix for our new alert system. We have a student assisting us in researching this tool. We plan on configuring and testing this very soon.
EL9 Migration
-
Our new test cluster CE has been added to CRIC for the SWT2_CPB_TEST Panda Queue and is receiving test jobs. The jobs are failing likely due to our test cluster XRootD proxy not being added to CRIC. It appears to be trying to use the production cluster’s XRootD proxy addresses (which is ).
-
We are actively communicating with DDM Ops for assistance with adding this into CRIC. We are also asking questions to be careful in how we approach this.
-
We created the monitor server module and are creating the XRootD redirector module.
-
We have a list of minor improvements to make to the EL9 test cluster. We are going to make these changes soon.
-
Once new storage has been implemented (or is in the early stages of being implemented), we will convert our XRootD proxies in the production cluster from EL7 to EL9.
New Storage
-
We have tested transferring and pulling files from our new storage running EL9 locally in the test cluster. We ran other commands to test, and it appears to be working.
-
We also tested the new EL9 storage in the test cluster externally using an lxplus server. We ran different commands to test transferring files and other actions using different protocols. It is working.
-
We plan on testing this for one week more thoroughly (hopefully with test jobs putting disposable data on the new storage), rebuilding, retesting, then start building new storage as EL9 next week in the production cluster.
GGUS Ticket - GoeGrid Transfer Failures
-
Issue has been resolved through a change on our side.
-
GoeGrid is not using LHCONE. Because of this, we are not able to route traffic through ESNet to reach.
-
We communicated with UTA campus networking and ESNet experts to coordinate and make changes with how the network is routing traffic. Campus networking is now routing this traffic over commercial provider instead of ESNet. This resolved the issue.
GGUS Ticket - Enable Network Monitoring
-
We have not heard back from campus networking on the last follow-up. We will follow up.
OU:- Scheduled maintenance today (file system and network upgrades).
- This should fix the job failures we've been seeing the last few days.
-
-
10:00
→
10:10
