ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

● Outstanding tickets

  • GGUS #146413 Lancaster, Matt: it's now online and doing OK. Will close ticket.
    Peter asked about longer-term plans, like replacing DPM. In the medium term, Matt will be able to get rid of oldest disks, but that will mean having to start shrinking quotas, which ATLAS doesn't like. Sam suggested that after Glasgow had got Ceph working, it would be easier for other T2s. Lustre is also a distributed storage, so better than DPM.
    Overall, ATLAS is more comfortable with declaring files lost, but UK sites worried about flack from the PMB. Need to moderate rapid deletion requests that can overload the DPM headnodes when files are declared lost.
  • GGUS #146374 Sheffield ARC-CE problem. Follow up on TB-SUPPORT, or maybe contact NorduGrid mailing list. There followed a robust discussion about technology choices.
  • GGUS #146280 Lancaster, Matt: progressing draining dodgy disk, 3/4 of the way through. Once drained, can close ticket.
  • GGUS #146159 Glasgow, Sam: progressing. (Gareth and Sam should be on holiday today)
  • GGUS #145688 Manchester: on hold
  • GGUS #145510 RAL: stage-out problems occur at other sites. Rucio team need to fix it. RAL now has a few WNs with SSDs, so can compare old and new for stage-ins.
  • GGUS #144759 Glasgow Squids, on hold: need to talk with Networking team, but they are understandably busy.

● CPU

  • On Monday, the Panda server was misconfigured due to AGIS changes (preparing for CRIC).
  • Also on Monday, Rucio client update interfered with Storm sites; fixed by ATLAS.
  • RAL increase hopefully due to new WNs.
  • Gareth: ScotGrid was overpledged in 2019-20 (had pledged 100% of Durham, GridPP only has 15%). This was fixed in 2020-21, hence the reduction in the pledge line.

● Other new issues

  • Oxford would like to go storageless. RAL will be the endpoint.
  • Upgrading services from SL6. Matt has a few disk servers on SL6. Plan was to upgrade in June, but may have to be delayed until he has physical access. Need to identify important data to move to untouched storage.

● CentOS7 - Sussex

Peter updated AGIS, but now jobs fail without any error message. It would help to track a single job and get some clues. Peter will email Patrick, CC James.


● Glasgow Ceph storage

Sam configuring the firewall today to give access to xrootd from outside (even though he's supposed to be on holiday).


● Grand Unified queues

All GU PanDA queues are now online. All old queues are closed, apart from Sheffield, which still has problems.


● News round-table

  • Dan: new Lustre system now in happy state. Syncing data from old to new system might take a while. Can be done remotely.
  • Gareth: NETR
  • James: NETR
  • Matt: NETR
  • Peter: NETR
  • Sam: NETR
  • Tim: NETR
  • Vip: had to leave earlier.

● AOB

Continuing discussion about storage in the Chat, quoted here:

Dan:

my r510 are very stable (touch wood) at the moment. one thing i have done is to keep the firmware up to date.
Lustre better able to balance and rebalance data across the servers. all servers contribute to all space tokens
why don't we see the same issue from Manchester (similar size and dpm)?
is it just the hardware?
dell vs xma?


Matt:

Actually you might have a point there, my newer dells don't seem to have much of a problem
It might be that they're running a tighter ship somehow.


Vip:

we have mixture of 510 and 720 running DPM. The firmware is relatively up to date. We have few with cache battery failures, overall, it has been stable apart from few disk failures. I drained a pool node for spare disks. All of them are out of warranty.
There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m
        • GGUS #146413 Lancaster, Matt: it's now online and doing OK. Will close ticket.
          Peter asked about longer-term plans, like replacing DPM. In the medium term, Matt will be able to get rid of oldest disks, but that will mean having to start shrinking quotas, which ATLAS doesn't like. Sam suggested that after Glasgow had got Ceph working, it would be easier for other T2s. Lustre is also a distributed storage, so better than DPM.
          Overall, ATLAS is more comfortable with declaring files lost, but UK sites worried about flack from the PMB. Need to moderate rapid deletion requests that can overload the DPM headnodes when files are declared lost.
        • GGUS #146374 Sheffield ARC-CE problem. Follow up on TB-SUPPORT, or maybe contact NorduGrid mailing list. There followed a robust discussion about technology choices.
        • GGUS #146280 Lancaster, Matt: progressing draining dodgy disk, 3/4 of the way through. Once drained, can close ticket.
        • GGUS #146159 Glasgow, Sam: progressing. (Gareth and Sam should be on holiday today)
        • GGUS #145688 Manchester: on hold
        • GGUS #145510 RAL: stage-out problems occur at other sites. Rucio team need to fix it. RAL now has a few WNs with SSDs, so can compare old and new for stage-ins.
        • GGUS #144759 Glasgow Squids, on hold: need to talk with Networking team, but they are understandably busy.
      • CPU 5m
        • On Monday, the Panda server was misconfigured due to AGIS changes (preparing for CRIC).
        • Also on Monday, Rucio client update interfered with Storm sites; fixed by ATLAS.
        • RAL increase hopefully due to new WNs.
        • Gareth: ScotGrid was overpledged in 2019-20 (had pledged 100% of Durham, GridPP only has 15%). This was fixed in 2020-21, hence the reduction in the pledge line.
      • Other new issues 5m

        Oxford to go storageless

        Migrations to Centos7: Lancaster, RHUL

        • Oxford would like to go storageless. RAL will be the endpoint.
        • Upgrading services from SL6. Matt has a few disk servers on SL6. Plan was to upgrade in June, but may have to be delayed until he has physical access. Need to identify important data to move to untouched storage.
    • 10:20 10:40
      Ongoing issues 20m
      • CentOS7 - Sussex 5m

        Peter updated AGIS, but now jobs fail without any error message. It would help to track a single job and get some clues. Peter will email Patrick, CC James.

      • Glasgow Ceph storage 5m

        Sam configuring the firewall today to give access to xrootd from outside (even though he's supposed to be on holiday).

      • Grand Unified queues 5m

        All GU PanDA queues are now online. All old queues are closed, apart from Sheffield, which still has problems.

    • 10:40 10:50
      News round-table 10m
      • Dan: new Lustre system now in happy state. Syncing data from old to new system might take a while. Can be done remotely.
      • Gareth: NETR
      • James: NETR
      • Matt: NETR
      • Peter: NETR
      • Sam: NETR
      • Tim: NETR
      • Vip: had to leave earlier.
    • 10:50 11:00
      AOB 10m

      Continuing discussion about storage in the Chat, quoted here:

      Dan:

      my r510 are very stable (touch wood) at the moment. one thing i have done is to keep the firmware up to date.
      Lustre better able to balance and rebalance data across the servers. all servers contribute to all space tokens
      why don't we see the same issue from Manchester (similar size and dpm)?
      is it just the hardware?
      dell vs xma?


      Matt:

      Actually you might have a point there, my newer dells don't seem to have much of a problem
      It might be that they're running a tighter ship somehow.


      Vip:

      we have mixture of 510 and 720 running DPM. The firmware is relatively up to date. We have few with cache battery failures, overall, it has been stable apart from few disk failures. I drained a pool node for spare disks. All of them are out of warranty.