ATLAS UK Cloud Support

Name: ATLAS UK Cloud Support
Start: 2020-03-05T10:00:00+00:00
End: 2020-03-05T11:00:00+00:00
Location: Vidyo

Thursday 5 Mar 2020, 10:00 → 11:00 Europe/London

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), Stewart Martin-Haugh (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Hide

● Outstanding tickets

GGUS #145614, #145931: maybe Manchester headnode memory issue has resurfaced.
GGUS #145688: Alessandra is discussing with RAL expert (Jose)
GGUS #145804: Matt is investigating. Multi-core job submission to CREAM is failing.
GGUS #145610: Glasgow Ceph test disk is working again, so Sam will close the ticket.
GGUS #145510: James is working on timeouts accessing RAL Echo from WN jobs. For stage-in, looking at transfer times. For stage-out, thought new Rucio version 1.21.9 would fix it, but it didn't. The stage-out issue is not specific to RAL: Rod sees similar error rates from other sites.

● CPU

Lancaster drop in submissions last Friday, but fixed. Could have been when apfmon failed? Peter reported that apfmon is being updated.
RHUL running mostly single-core jobs. Looking to install HTCondor defrag package.
Durham has been down for various interventions. Sam expects it to ramp up now.
James reported that Stewart has looked at the CPU pledges. Since the pledge period is Apr-Mar, we need to compare current ATLAS reporting against 2019 pledge. The pledge lines match REBUS 2019.

● CentOS7 - Sussex

Dan discussed with Patrick a couple of days ago. The WN kernels should now be up to date. He should be ready to accept ATLAS jobs again, but not yet in HammerCloud. He should email atlas-support-cloud-uk@cern.ch to be enabled again.

● Glasgow Ceph storage

Sam will upgrade to Ceph Nautilus release. He can then check stage-in and stage-out errors. Sam commented that stage-out errors may not be the same as those experienced at other sites (see GGUS #145510 above).

● Grand Unified queues

no news.

● News round-table

Dan sees a lot of job failures from DaviX. That should be the secondary protocol.
Elena is investigating problems with Pilots.
James: NETR
Matt: NETR; jobs flowing.
Peter: will sort out Lancaster problem ASAP
Sam: NETR
Tim: NETR
Vip: NETR

● AOB

Peter requested that future reminders for this meeting be sent earlier. James agreed to remind on Tuesday.

James asked about site plans concerning quarantine for Coronavirus.
Matt said that working from home is OK for many sites.

There are minutes attached to this event. Show them.

- 10:00 → 10:20
  Status 20m
  - Outstanding tickets 10m
    
    Minutes
    
    Open ATLAS UK GGUS tickets
    
    GGUS #145614, #145931: maybe Manchester headnode memory issue has resurfaced.
    
    GGUS #145688: Alessandra is discussing with RAL expert (Jose)
    
    GGUS #145804: Matt is investigating. Multi-core job submission to CREAM is failing.
    
    GGUS #145610: Glasgow Ceph test disk is working again, so Sam will close the ticket.
    
    GGUS #145510: James is working on timeouts accessing RAL Echo from WN jobs. For stage-in, looking at transfer times. For stage-out, thought new Rucio version 1.21.9 would fix it, but it didn't. The stage-out issue is not specific to RAL: Rod sees similar error rates from other sites.
  - CPU 5m
    
    Minutes
    
    UK Cloud jobs over last week
    
    Lancaster drop in submissions last Friday, but fixed. Could have been when apfmon failed? Peter reported that apfmon is being updated.
    
    RHUL running mostly single-core jobs. Looking to install HTCondor defrag package.
    
    Durham has been down for various interventions. Sam expects it to ramp up now.
    
    James reported that Stewart has looked at the CPU pledges. Since the pledge period is Apr-Mar, we need to compare current ATLAS reporting against 2019 pledge. The pledge lines match REBUS 2019.
  - Other new issues 5m
- 10:20 → 10:40
  Ongoing issues 20m
  - CentOS7 - Sussex 5m
    
    Minutes
    
    Centos 7 deployment Twiki
    
    Dan discussed with Patrick a couple of days ago. The WN kernels should now be up to date. He should be ready to accept ATLAS jobs again, but not yet in HammerCloud. He should email atlas-support-cloud-uk@cern.ch to be enabled again.
  - Glasgow Ceph storage 5m
    
    Minutes
    
    ADCINFR-152: Glasgow Ceph
    
    Sam will upgrade to Ceph Nautilus release. He can then check stage-in and stage-out errors. Sam commented that stage-out errors may not be the same as those experienced at other sites (see GGUS #145510 above).
  - Grand Unified queues 5m
    
    Minutes
    
    ADCDPA-235
    
    no news.
- 10:40 → 10:50
  News round-table 10m
  
  Minutes
  Dan sees a lot of job failures from DaviX. That should be the secondary protocol.
  
  Elena is investigating problems with Pilots.
  
  James: NETR
  
  Matt: NETR; jobs flowing.
  
  Peter: will sort out Lancaster problem ASAP
  
  Sam: NETR
  
  Tim: NETR
  
  Vip: NETR
- 10:50 → 11:00
  
  AOB 10m
  
  Minutes
  
  Peter requested that future reminders for this meeting be sent earlier. James agreed to remind on Tuesday.
  
  James asked about site plans concerning quarantine for Coronavirus.
  Matt said that working from home is OK for many sites.

Choose timezone

ATLAS UK Cloud Support

Vidyo

● Outstanding tickets

● CPU

● CentOS7 - Sussex

● Glasgow Ceph storage

● Grand Unified queues

● News round-table

● AOB

Share this page

Direct link

Social networks

Calendaring