ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), Stewart Martin-Haugh (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

● Outstanding tickets

  • GGUS  #145614, #145931: maybe Manchester headnode memory issue has resurfaced.
  • GGUS #145688: Alessandra is discussing with RAL expert (Jose)
  • GGUS #145804: Matt is investigating. Multi-core job submission to CREAM is failing.
  • GGUS #145610: Glasgow Ceph test disk is working again, so Sam will close the ticket.
  • GGUS #145510: James is working on timeouts accessing RAL Echo from WN jobs. For stage-in, looking at transfer times. For stage-out, thought new Rucio version 1.21.9 would fix it, but it didn't. The stage-out issue is not specific to RAL: Rod sees similar error rates from other sites.

● CPU

  • Lancaster drop in submissions last Friday, but fixed. Could have been when apfmon failed? Peter reported that apfmon is being updated.
  • RHUL running mostly single-core jobs. Looking to install HTCondor defrag package.
  • Durham has been down for various interventions. Sam expects it to ramp up now.
  • James reported that Stewart has looked at the CPU pledges. Since the pledge period is Apr-Mar, we need to compare current ATLAS reporting against 2019 pledge. The pledge lines match REBUS 2019.

● CentOS7 - Sussex

Dan discussed with Patrick a couple of days ago. The WN kernels should now be up to date. He should be ready to accept ATLAS jobs again, but not yet in HammerCloud. He should email atlas-support-cloud-uk@cern.ch to be enabled again.


● Glasgow Ceph storage

Sam will upgrade to Ceph Nautilus release. He can then check stage-in and stage-out errors. Sam commented that stage-out errors may not be the same as those experienced at other sites (see GGUS #145510 above).


● Grand Unified queues

no news.


● News round-table

  • Dan sees a lot of job failures from DaviX. That should be the secondary protocol.
  • Elena is investigating problems with Pilots.
  • James: NETR
  • Matt: NETR; jobs flowing.
  • Peter: will sort out Lancaster problem ASAP
  • Sam: NETR
  • Tim: NETR
  • Vip: NETR

● AOB

Peter requested that future reminders for this meeting be sent earlier. James agreed to remind on Tuesday.

James asked about site plans concerning quarantine for Coronavirus.
Matt said that working from home is OK for many sites.

There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m
        • GGUS  #145614, #145931: maybe Manchester headnode memory issue has resurfaced.
        • GGUS #145688: Alessandra is discussing with RAL expert (Jose)
        • GGUS #145804: Matt is investigating. Multi-core job submission to CREAM is failing.
        • GGUS #145610: Glasgow Ceph test disk is working again, so Sam will close the ticket.
        • GGUS #145510: James is working on timeouts accessing RAL Echo from WN jobs. For stage-in, looking at transfer times. For stage-out, thought new Rucio version 1.21.9 would fix it, but it didn't. The stage-out issue is not specific to RAL: Rod sees similar error rates from other sites.
      • CPU 5m
        • Lancaster drop in submissions last Friday, but fixed. Could have been when apfmon failed? Peter reported that apfmon is being updated.
        • RHUL running mostly single-core jobs. Looking to install HTCondor defrag package.
        • Durham has been down for various interventions. Sam expects it to ramp up now.
        • James reported that Stewart has looked at the CPU pledges. Since the pledge period is Apr-Mar, we need to compare current ATLAS reporting against 2019 pledge. The pledge lines match REBUS 2019.
      • Other new issues 5m
    • 10:20 10:40
      Ongoing issues 20m
      • CentOS7 - Sussex 5m

        Dan discussed with Patrick a couple of days ago. The WN kernels should now be up to date. He should be ready to accept ATLAS jobs again, but not yet in HammerCloud. He should email atlas-support-cloud-uk@cern.ch to be enabled again.

      • Glasgow Ceph storage 5m

        Sam will upgrade to Ceph Nautilus release. He can then check stage-in and stage-out errors. Sam commented that stage-out errors may not be the same as those experienced at other sites (see GGUS #145510 above).

      • Grand Unified queues 5m

        no news.

    • 10:40 10:50
      News round-table 10m
      • Dan sees a lot of job failures from DaviX. That should be the secondary protocol.
      • Elena is investigating problems with Pilots.
      • James: NETR
      • Matt: NETR; jobs flowing.
      • Peter: will sort out Lancaster problem ASAP
      • Sam: NETR
      • Tim: NETR
      • Vip: NETR
    • 10:50 11:00
      AOB 10m

      Peter requested that future reminders for this meeting be sent earlier. James agreed to remind on Tuesday.

      James asked about site plans concerning quarantine for Coronavirus.
      Matt said that working from home is OK for many sites.