ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), Stewart Martin-Haugh (Science and Technology Facilities Council STFC (GB))

● Outstanding tickets

GGUS #145057 Gareth said this was very likely load related as disk servers are very busy. This can happen if frequently accessed files are located on the same server. Oxford sees a similar problem at the moment. We discussed options: reduce the number of jobs, and encourage/force analysis users not to user directio. Should be much better for Glasgow once Ceph disk moved into production.

GGUS #144953 and #144884: two RAL tickets for the same problem. Seems to have been fixed with an aCT update at CERN.

GGUS #144759: On hold, but Gareth will update ticket.

CPU

CERN Ceph issue stopped monitoring and jobs last Thursday morning.


● CentOS7 - Sussex

Local software may be confusing the pilot, which would otherwise pick software (eg. gfal) from CVMFS.
ATLAS only needs CVMFS and user namespaces enabled (to use Singularity from CVMFS), or CVMFS+Singularity.
Would be useful to have this better documented. And also maybe a talk in the ATLAS S&C Week.


● Storageless sites

Still have 8TB still left at Sheffield. Elena asked on JIRA for this to be removed, as need to remove the disk.


● Glasgow Ceph storage

The disk servers for the full Ceph pool have arrived.
Transfers to the test DataDisk are working fine. Now trying to get access from jobs. Read works, but write doesn't yet.


● AOB

Alessandra: Manchester going to unify batch system (Grand Unified queue) with two CEs will for the same queue.
Dan: QMUL has already switched to a GU queue.
Elena: NETR (nothing else to report)
Glasgow: NETR from Emanuele, Gareth, and Sam.
Peter: getting new NIC 20->80G.
Stewart: provided site availability script: https://github.com/StewMH/QuarterlyReport/blob/master/report.py
Tim: Data Carousel reprocessing campaign had a rocky start, with the CERN FTS going too slowly (now fixed). This affected most sites, but RAL was OK and has staged most of the first tranche, data18. This is now being processed at UK sites.
Had problems with jobs specifying 1GB RAM and then being killed by RAL LRMS when they reached 3GB.
Vip: NETR

There are minutes attached to this event. Show them.
    • 10:00 10:10
      Outstanding tickets 10m

      GGUS #145057 Gareth said this was very likely load related as disk servers are very busy. This can happen if frequently accessed files are located on the same server. Oxford sees a similar problem at the moment. We discussed options: reduce the number of jobs, and encourage/force analysis users not to user directio. Should be much better for Glasgow once Ceph disk moved into production.

      GGUS #144953 and #144884: two RAL tickets for the same problem. Seems to have been fixed with an aCT update at CERN.

      GGUS #144759: On hold, but Gareth will update ticket.

      CPU

      CERN Ceph issue stopped monitoring and jobs last Thursday morning.

    • 10:10 10:20
      Other new issues 10m
    • 10:20 10:40
      Ongoing issues 20m
      • CentOS7 - Sussex 5m

        Local software may be confusing the pilot, which would otherwise pick software (eg. gfal) from CVMFS.
        ATLAS only needs CVMFS and user namespaces enabled (to use Singularity from CVMFS), or CVMFS+Singularity.
        Would be useful to have this better documented. And also maybe a talk in the ATLAS S&C Week.

      • Storageless sites 5m

        Still have 8TB still left at Sheffield. Elena asked on JIRA for this to be removed, as need to remove the disk.

      • Glasgow Ceph storage 5m

        The disk servers for the full Ceph pool have arrived.
        Transfers to the test DataDisk are working fine. Now trying to get access from jobs. Read works, but write doesn't yet.

    • 10:40 10:50
      News round-table 10m
    • 10:50 11:00
      AOB 10m

      Alessandra: Manchester going to unify batch system (Grand Unified queue) with two CEs will for the same queue.
      Dan: QMUL has already switched to a GU queue.
      Elena: NETR (nothing else to report)
      Glasgow: NETR from Emanuele, Gareth, and Sam.
      Peter: getting new NIC 20->80G.
      Stewart: provided site availability script: https://github.com/StewMH/QuarterlyReport/blob/master/report.py
      Tim: Data Carousel reprocessing campaign had a rocky start, with the CERN FTS going too slowly (now fixed). This affected most sites, but RAL was OK and has staged most of the first tranche, data18. This is now being processed at UK sites.
      Had problems with jobs specifying 1GB RAM and then being killed by RAL LRMS when they reached 3GB.
      Vip: NETR