● Outstanding tickets
- GGUS #146413 Lancaster, Matt: it's now online and doing OK. Will close ticket.
Peter asked about longer-term plans, like replacing DPM. In the medium term, Matt will be able to get rid of oldest disks, but that will mean having to start shrinking quotas, which ATLAS doesn't like. Sam suggested that after Glasgow had got Ceph working, it would be easier for other T2s. Lustre is also a distributed storage, so better than DPM.
Overall, ATLAS is more comfortable with declaring files lost, but UK sites worried about flack from the PMB. Need to moderate rapid deletion requests that can overload the DPM headnodes when files are declared lost.
- GGUS #146374 Sheffield ARC-CE problem. Follow up on TB-SUPPORT, or maybe contact NorduGrid mailing list. There followed a robust discussion about technology choices.
- GGUS #146280 Lancaster, Matt: progressing draining dodgy disk, 3/4 of the way through. Once drained, can close ticket.
- GGUS #146159 Glasgow, Sam: progressing. (Gareth and Sam should be on holiday today)
- GGUS #145688 Manchester: on hold
- GGUS #145510 RAL: stage-out problems occur at other sites. Rucio team need to fix it. RAL now has a few WNs with SSDs, so can compare old and new for stage-ins.
- GGUS #144759 Glasgow Squids, on hold: need to talk with Networking team, but they are understandably busy.
● CPU
- On Monday, the Panda server was misconfigured due to AGIS changes (preparing for CRIC).
- Also on Monday, Rucio client update interfered with Storm sites; fixed by ATLAS.
- RAL increase hopefully due to new WNs.
- Gareth: ScotGrid was overpledged in 2019-20 (had pledged 100% of Durham, GridPP only has 15%). This was fixed in 2020-21, hence the reduction in the pledge line.
● Other new issues
- Oxford would like to go storageless. RAL will be the endpoint.
- Upgrading services from SL6. Matt has a few disk servers on SL6. Plan was to upgrade in June, but may have to be delayed until he has physical access. Need to identify important data to move to untouched storage.
● CentOS7 - Sussex
Peter updated AGIS, but now jobs fail without any error message. It would help to track a single job and get some clues. Peter will email Patrick, CC James.
● Glasgow Ceph storage
Sam configuring the firewall today to give access to xrootd from outside (even though he's supposed to be on holiday).
● Grand Unified queues
All GU PanDA queues are now online. All old queues are closed, apart from Sheffield, which still has problems.
● News round-table
- Dan: new Lustre system now in happy state. Syncing data from old to new system might take a while. Can be done remotely.
- Gareth: NETR
- James: NETR
- Matt: NETR
- Peter: NETR
- Sam: NETR
- Tim: NETR
- Vip: had to leave earlier.
● AOB
Continuing discussion about storage in the Chat, quoted here:
Dan:
my r510 are very stable (touch wood) at the moment. one thing i have done is to keep the firmware up to date.
Lustre better able to balance and rebalance data across the servers. all servers contribute to all space tokens
why don't we see the same issue from Manchester (similar size and dpm)?
is it just the hardware?
dell vs xma?
Matt:
Actually you might have a point there, my newer dells don't seem to have much of a problem
It might be that they're running a tighter ship somehow.
Vip:
we have mixture of 510 and 720 running DPM. The firmware is relatively up to date. We have few with cache battery failures, overall, it has been stable apart from few disk failures. I drained a pool node for spare disks. All of them are out of warranty.
There are minutes attached to this event.
Show them.