ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), Stewart Martin-Haugh (Science and Technology Facilities Council STFC (GB))

● Outstanding tickets

GGUS #143106 QMUL: Rucio mover now supports Storm, so Rod moved QMUL to use it [ADC Weekly CRC report]. This may have provoked the memory overload on Storm frontend service. Dan restarted it which freed up RAM. Also took problematic nodes offline.

GGUS #143094 Glasgow:
Gareth reported that migrating machines to CentOS7 saw scaling problems with the network. A fix broke the default routes. Fixed, then broke again, fixed again, now hopefully OK. Leave ticket open while confirmed.

GGUS #143059 Edinburgh: Teng discussed with Alessandra on atlas-support-cloud-uk@cern.ch. Alessandra switched lan access from Rucio to xrootd, which seems to have fixed it. tpc still uses srm until DOME is enabled.

Other issues:

There was a database outage at CERN on Monday that paused all ATLAS job submissions.

Oxford was taken offline last night due to failing HammerCloud tests. Some mcore jobs started to run this morning.

Cambridge had no jobs on Friday. Andrew McNab fixed a problem with VAC, but still seem to be problems. Pilots are running, but not doing anything.

Durham reporting lack of jobs since 20 August. Elena will investigate.

Sheffield had a problem with ARC-CE, now fixed. Now queues switched to Singularity.

There have been problems with no ATLAS jobs at several sites over the last week or so. Elena suggested they may be separate issues, so we deal with them first site-by-site.


● Birmingham/Cambridge XCache

Cambridge XCache running fine, currently 60% full. Logging seems to be working. Can close this issue.


● Diskless sites

Cambridge and Sussex are now diskless and not experiencing trouble from that. Can close this issue.


● Centos 7 migration

Alessandra has announced that remaining SL6 queues at RHUL, Glasgow, and Sussex will be put in BROKEROFF on Sunday.
Glasgow has no problem with this. 40% done migrating nodes. Has set up an additional CE. Will email Elena to add to AGIS.
RHUL has set up CentOS7 CE.
Sussex: Patrick is working on setting up ARC-CE6 and upgrading CentOS7 workers. Won't be ready for deadline, so Sussex will go quiet.


● Singularity

Alessandra mailed Durham and Edinburgh to prod them to install Singularity. Sussex will do it later.


● News round-table

Elana: Problem staging files in Oxford. Sent detailed errors in email to Vip. He can check if gridftp is installed.
            Elena will be on holiday for the next 2 weeks.
Emanuele is site admin at Glasgow.
Gareth: Migration to new data centre ongoing. Finally networking installed. On track for moving in October.
John: NTR
Patrick: NTR
Sam: Continuing with Ceph storage setup with help from RAL. When the GridFTP access is available, Tim will setup storage endpoint in AGIS.
Vip: restarted services on headnode.
Tim: xrootd TPC smoke tests are failing for RAL Echo due to a certificate problem.

There are minutes attached to this event. Show them.
    • 10:00 10:10
      Outstanding tickets 10m

      GGUS #143106 QMUL: Rucio mover now supports Storm, so Rod moved QMUL to use it [ADC Weekly CRC report]. This may have provoked the memory overload on Storm frontend service. Dan restarted it which freed up RAM. Also took problematic nodes offline.

      GGUS #143094 Glasgow:
      Gareth reported that migrating machines to CentOS7 saw scaling problems with the network. A fix broke the default routes. Fixed, then broke again, fixed again, now hopefully OK. Leave ticket open while confirmed.

      GGUS #143059 Edinburgh: Teng discussed with Alessandra on atlas-support-cloud-uk@cern.ch. Alessandra switched lan access from Rucio to xrootd, which seems to have fixed it. tpc still uses srm until DOME is enabled.

      Other issues:

      There was a database outage at CERN on Monday that paused all ATLAS job submissions.

      Oxford was taken offline last night due to failing HammerCloud tests. Some mcore jobs started to run this morning.

      Cambridge had no jobs on Friday. Andrew McNab fixed a problem with VAC, but still seem to be problems. Pilots are running, but not doing anything.

      Durham reporting lack of jobs since 20 August. Elena will investigate.

      Sheffield had a problem with ARC-CE, now fixed. Now queues switched to Singularity.

      There have been problems with no ATLAS jobs at several sites over the last week or so. Elena suggested they may be separate issues, so we deal with them first site-by-site.

    • 10:10 10:30
      Ongoing issues 20m
      • Birmingham/Cambridge XCache 5m

        Cambridge XCache running fine, currently 60% full. Logging seems to be working. Can close this issue.

      • Diskless sites 5m

        Cambridge and Sussex are now diskless and not experiencing trouble from that. Can close this issue.

      • Centos 7 migration 5m

        Alessandra has announced that remaining SL6 queues at RHUL, Glasgow, and Sussex will be put in BROKEROFF on Sunday.
        Glasgow has no problem with this. 40% done migrating nodes. Has set up an additional CE. Will email Elena to add to AGIS.
        RHUL has set up CentOS7 CE.
        Sussex: Patrick is working on setting up ARC-CE6 and upgrading CentOS7 workers. Won't be ready for deadline, so Sussex will go quiet.

      • Singularity 5m

        Alessandra mailed Durham and Edinburgh to prod them to install Singularity. Sussex will do it later.

    • 10:30 10:50
      News round-table 20m

      Elana: Problem staging files in Oxford. Sent detailed errors in email to Vip. He can check if gridftp is installed.
                  Elena will be on holiday for the next 2 weeks.
      Emanuele is site admin at Glasgow.
      Gareth: Migration to new data centre ongoing. Finally networking installed. On track for moving in October.
      John: NTR
      Patrick: NTR
      Sam: Continuing with Ceph storage setup with help from RAL. When the GridFTP access is available, Tim will setup storage endpoint in AGIS.
      Vip: restarted services on headnode.
      Tim: xrootd TPC smoke tests are failing for RAL Echo due to a certificate problem.

    • 10:50 11:00
      AOB 10m