ATLAS UK Cloud Support

Name: ATLAS UK Cloud Support
Start: 2020-04-09T10:00:00+01:00
End: 2020-04-09T11:00:00+01:00
Location: Vidyo

Thursday 9 Apr 2020, 10:00 → 11:00 Europe/London

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Hide

● Outstanding tickets

GGUS #146413 Lancaster, Matt: it's now online and doing OK. Will close ticket.
Peter asked about longer-term plans, like replacing DPM. In the medium term, Matt will be able to get rid of oldest disks, but that will mean having to start shrinking quotas, which ATLAS doesn't like. Sam suggested that after Glasgow had got Ceph working, it would be easier for other T2s. Lustre is also a distributed storage, so better than DPM.
Overall, ATLAS is more comfortable with declaring files lost, but UK sites worried about flack from the PMB. Need to moderate rapid deletion requests that can overload the DPM headnodes when files are declared lost.
GGUS #146374 Sheffield ARC-CE problem. Follow up on TB-SUPPORT, or maybe contact NorduGrid mailing list. There followed a robust discussion about technology choices.
GGUS #146280 Lancaster, Matt: progressing draining dodgy disk, 3/4 of the way through. Once drained, can close ticket.
GGUS #146159 Glasgow, Sam: progressing. (Gareth and Sam should be on holiday today)
GGUS #145688 Manchester: on hold
GGUS #145510 RAL: stage-out problems occur at other sites. Rucio team need to fix it. RAL now has a few WNs with SSDs, so can compare old and new for stage-ins.
GGUS #144759 Glasgow Squids, on hold: need to talk with Networking team, but they are understandably busy.

● CPU

On Monday, the Panda server was misconfigured due to AGIS changes (preparing for CRIC).
Also on Monday, Rucio client update interfered with Storm sites; fixed by ATLAS.
RAL increase hopefully due to new WNs.
Gareth: ScotGrid was overpledged in 2019-20 (had pledged 100% of Durham, GridPP only has 15%). This was fixed in 2020-21, hence the reduction in the pledge line.

● Other new issues

Oxford would like to go storageless. RAL will be the endpoint.
Upgrading services from SL6. Matt has a few disk servers on SL6. Plan was to upgrade in June, but may have to be delayed until he has physical access. Need to identify important data to move to untouched storage.

● CentOS7 - Sussex

Peter updated AGIS, but now jobs fail without any error message. It would help to track a single job and get some clues. Peter will email Patrick, CC James.

● Glasgow Ceph storage

Sam configuring the firewall today to give access to xrootd from outside (even though he's supposed to be on holiday).

● Grand Unified queues

All GU PanDA queues are now online. All old queues are closed, apart from Sheffield, which still has problems.

● News round-table

Dan: new Lustre system now in happy state. Syncing data from old to new system might take a while. Can be done remotely.
Gareth: NETR
James: NETR
Matt: NETR
Peter: NETR
Sam: NETR
Tim: NETR
Vip: had to leave earlier.

● AOB

Continuing discussion about storage in the Chat, quoted here:

Dan:

my r510 are very stable (touch wood) at the moment. one thing i have done is to keep the firmware up to date.
Lustre better able to balance and rebalance data across the servers. all servers contribute to all space tokens
why don't we see the same issue from Manchester (similar size and dpm)?
is it just the hardware?
dell vs xma?

Matt:

Actually you might have a point there, my newer dells don't seem to have much of a problem
It might be that they're running a tighter ship somehow.

Vip:

we have mixture of 510 and 720 running DPM. The firmware is relatively up to date. We have few with cache battery failures, overall, it has been stable apart from few disk failures. I drained a pool node for spare disks. All of them are out of warranty.

There are minutes attached to this event. Show them.

- 10:00 → 10:20
  Status 20m
  - Outstanding tickets 10m
    
    Minutes
    
    Open ATLAS UK GGUS tickets
    
    GGUS #146413 Lancaster, Matt: it's now online and doing OK. Will close ticket.
    Peter asked about longer-term plans, like replacing DPM. In the medium term, Matt will be able to get rid of oldest disks, but that will mean having to start shrinking quotas, which ATLAS doesn't like. Sam suggested that after Glasgow had got Ceph working, it would be easier for other T2s. Lustre is also a distributed storage, so better than DPM.
    Overall, ATLAS is more comfortable with declaring files lost, but UK sites worried about flack from the PMB. Need to moderate rapid deletion requests that can overload the DPM headnodes when files are declared lost.
    
    GGUS #146374 Sheffield ARC-CE problem. Follow up on TB-SUPPORT, or maybe contact NorduGrid mailing list. There followed a robust discussion about technology choices.
    
    GGUS #146280 Lancaster, Matt: progressing draining dodgy disk, 3/4 of the way through. Once drained, can close ticket.
    
    GGUS #146159 Glasgow, Sam: progressing. (Gareth and Sam should be on holiday today)
    
    GGUS #145688 Manchester: on hold
    
    GGUS #145510 RAL: stage-out problems occur at other sites. Rucio team need to fix it. RAL now has a few WNs with SSDs, so can compare old and new for stage-ins.
    
    GGUS #144759 Glasgow Squids, on hold: need to talk with Networking team, but they are understandably busy.
  - CPU 5m
    
    Minutes
    
    UK Cloud jobs over last week
    
    On Monday, the Panda server was misconfigured due to AGIS changes (preparing for CRIC).
    
    Also on Monday, Rucio client update interfered with Storm sites; fixed by ATLAS.
    
    RAL increase hopefully due to new WNs.
    
    Gareth: ScotGrid was overpledged in 2019-20 (had pledged 100% of Durham, GridPP only has 15%). This was fixed in 2020-21, hence the reduction in the pledge line.
  - Other new issues 5m
    
    Minutes
    
    Oxford to go storageless
    
    Migrations to Centos7: Lancaster, RHUL
    
    Oxford would like to go storageless. RAL will be the endpoint.
    
    Upgrading services from SL6. Matt has a few disk servers on SL6. Plan was to upgrade in June, but may have to be delayed until he has physical access. Need to identify important data to move to untouched storage.
- 10:20 → 10:40
  Ongoing issues 20m
  - CentOS7 - Sussex 5m
    
    Minutes
    
    Centos 7 deployment Twiki
    
    Peter updated AGIS, but now jobs fail without any error message. It would help to track a single job and get some clues. Peter will email Patrick, CC James.
  - Glasgow Ceph storage 5m
    
    Minutes
    
    ADCINFR-152: Glasgow Ceph
    
    Sam configuring the firewall today to give access to xrootd from outside (even though he's supposed to be on holiday).
  - Grand Unified queues 5m
    
    Minutes
    
    ADCDPA-235
    
    Migration plans
    
    All GU PanDA queues are now online. All old queues are closed, apart from Sheffield, which still has problems.
- 10:40 → 10:50
  News round-table 10m
  
  Minutes
  Dan: new Lustre system now in happy state. Syncing data from old to new system might take a while. Can be done remotely.
  
  Gareth: NETR
  
  James: NETR
  
  Matt: NETR
  
  Peter: NETR
  
  Sam: NETR
  
  Tim: NETR
  
  Vip: had to leave earlier.
- 10:50 → 11:00
  
  AOB 10m
  
  Minutes
  
  Continuing discussion about storage in the Chat, quoted here:
  
  Dan:
  
  my r510 are very stable (touch wood) at the moment. one thing i have done is to keep the firmware up to date.
  Lustre better able to balance and rebalance data across the servers. all servers contribute to all space tokens
  why don't we see the same issue from Manchester (similar size and dpm)?
  is it just the hardware?
  dell vs xma?
  
  Matt:
  
  Actually you might have a point there, my newer dells don't seem to have much of a problem
  It might be that they're running a tighter ship somehow.
  
  Vip:
  
  we have mixture of 510 and 720 running DPM. The firmware is relatively up to date. We have few with cache battery failures, overall, it has been stable apart from few disk failures. I drained a pool node for spare disks. All of them are out of warranty.

Choose timezone