Meeting ID: 996 1094 4232
Meeting password: 125
Release (next week?)
BNL staff are required to take excess of vacations days before July 20
BNL HPSS downtime scheduled for 2-Aug-21 12:00 UTC to 6-Aug-21 00:00 UTC. SRM services affected. OIM used to schedule downtime
Working through the FTS-dCache-HPSS local site monitoring plots and ATLAS DATA Carousel monitoring to ensure we can spot errors or inefficiencies. The later is very simple and has no time series available.
In the middle of splitting up BNLLAKE into DATADISK part and LOCALGROUP disk part. determining the proper paths for the HPSS system to ensure accurate accounting of tape usage and make tape recycling easier. Working with DDM Ops to rationalize space reporting since these are Disk/Tape endpoints.
Updates on US Tier-2 centers
1) All WNs online (still at the old location).
2) All switches online at Data Center. Next move will include all recent T2 WNs, now set for July 21st.
Still in the recovery process.
1) having network issues (IPV6)for a few work nodes, for now use fixed ipv6 config resolved the problem
2) All C6420 work nodes with Intel NICs do not light up.
3) About 2/3 UM work nodes are back online, the other 1/3 need network config/debug.
upgraded dcache from 6.2.21 to 6.2.23, the update went smooth.
- Upgraded xrootd proxy (se1) to 5.3.0.rc2, which fixed xrootdfs and http-tpc bugs. Will add site to http-tpc testing when I'm back from vacation on 7/19.
- Discovered a problem with Condor spool directory access on four compute nodes. (Nodes were batch system "black holes.") Fixed, and implemented monitoring to prevent a recurrence.
- Campus facilities fixed a problem with the A/C in the machine room.
- Awaiting delivery of the equipment from our most recent purchase (compute nodes, storage, LAN re-fresh).
Ilija is on vacation.
Work is needed to describe procedures and responsibilities for the federated operation team, to formalize the management of frontier-squids. A goal is to implement a GitOps-like procedure to help manage updates and record history. Another is to create an alarms and alert service.
One very important (and now overdue) item:
OTP numbers for ATLAS for January - June 2021
Needs to be finished TODAY if at all possible (Scrubbing work should provide the numbers?)
Cloud Operation and Management: