ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), Stewart Martin-Haugh (Science and Technology Facilities Council STFC (GB))

● Outstanding tickets

CPU

CERN Monit not working since 6am, so everything is/looks stopped. There is a problem with CERN Ceph storage.

Tickets

GGUS #144884 Seems to be some user analysis jobs that used too much RAM. We should try to get a better error message than "Job submission to LRMS failed".

GGUS #144759 There may be a misconfiguration of the Glasgow squids. Gareth will investigate when he has a chance (new hardware arriving)

GGUS #144688 Gareth said this is a common issue where a burst of transfers to same disk causes transfer errors like this. Once it cools down things seem OK again. Can close this ticket, but it could reoccur.
[later] Had requested (ADCINFR-162) to reduce old DPM storage to allow decommissioning old servers, but CRC shifter said (on GGUS) they still needed the space. Gareth: uncomfortable keeping it this like. Disks are old and may fail sometime.


● CentOS7 - Sussex

Patrick/Dan: everything looks good at Sussex, but can't tell with Monit down.
Alessandra [later]: Sussex is running payloads. Failed because they can't access QMUL storage. QMUL is Storm site, and Storm mover uses POSIX access, which isn't supported yet. Alessandra will discuss with DDM.


● Storageless sites

Elena still has 8.2 TB left on Sheffield disks. Elena will push for last bit to be removed. Will start by posting on JIRA.
For access to RAL disk, Elena has finished switching to use rucio copytool.


● Glasgow Ceph storage

Sam: Current setup not final production configuration. Will need more servers. Plan to switch to production cluster once new servers arrive from Dell, probably available mid-February. Dan also noted issues with delivery from Dell. This will mean current disk will probably lose its data. It is a little concerning that the data now on the disk is marked "primary".

Tim: Added Ceph DataDisk in AGIS last Thursday. This apparently is not the correct procedure: DDM need to do some magic *before* the disk is enabled. Dimitrios fixed this on Monday and switched the disk to type "TEST" (instead of DATADISK).
There were still problems transferring to the disk, which Sam fixed in the voms-mapfile.

Tim then setup a new test queue, and HammerCloud jobs started today. (Elena suggested to contact atlas-adc-expert@cern.ch if HC doesn't run.) Jobs fail: they need to be configured to upload the output through the correct gateway. Sam will give details on JIRA.


● News round-table

Alessandra: NETR [nothing else to report; some comments noted above.]
Dan: Last WN moved SL6->C7. Waiting for Dell storage, hope for delivery in February.
Elena: NETR
Emanuele: NTR
Matt: One Lancaster server rebuilding 3 disks, but seems OK. Purchasing gpnode.
Patrick: NTR
Sam: NTR
Stewart: LocalGroupDisk is filling up. Identifying people across UK who have left.
Tim: Switched RAL to use Rucio copytool. All seems good. Data Carousel reprocessing started on Tuesday without RAL, which had a Castor intervention scheduled for Wednesday. That's done, so can start today.
Vip: NTR

There are minutes attached to this event. Show them.
    • 10:00 10:10
      Outstanding tickets 10m

      CPU

      CERN Monit not working since 6am, so everything is/looks stopped. There is a problem with CERN Ceph storage.

      Tickets

      GGUS #144884 Seems to be some user analysis jobs that used too much RAM. We should try to get a better error message than "Job submission to LRMS failed".

      GGUS #144759 There may be a misconfiguration of the Glasgow squids. Gareth will investigate when he has a chance (new hardware arriving)

      GGUS #144688 Gareth said this is a common issue where a burst of transfers to same disk causes transfer errors like this. Once it cools down things seem OK again. Can close this ticket, but it could reoccur.
      [later] Had requested (ADCINFR-162) to reduce old DPM storage to allow decommissioning old servers, but CRC shifter said (on GGUS) they still needed the space. Gareth: uncomfortable keeping it this like. Disks are old and may fail sometime.

    • 10:10 10:20
      Other new issues 10m
    • 10:20 10:40
      Ongoing issues 20m
      • CentOS7 - Sussex 5m

        Patrick/Dan: everything looks good at Sussex, but can't tell with Monit down.
        Alessandra [later]: Sussex is running payloads. Failed because they can't access QMUL storage. QMUL is Storm site, and Storm mover uses POSIX access, which isn't supported yet. Alessandra will discuss with DDM.

      • Storageless sites 5m

        Elena still has 8.2 TB left on Sheffield disks. Elena will push for last bit to be removed. Will start by posting on JIRA.
        For access to RAL disk, Elena has finished switching to use rucio copytool.

      • Glasgow Ceph storage 5m

        Sam: Current setup not final production configuration. Will need more servers. Plan to switch to production cluster once new servers arrive from Dell, probably available mid-February. Dan also noted issues with delivery from Dell. This will mean current disk will probably lose its data. It is a little concerning that the data now on the disk is marked "primary".

        Tim: Added Ceph DataDisk in AGIS last Thursday. This apparently is not the correct procedure: DDM need to do some magic *before* the disk is enabled. Dimitrios fixed this on Monday and switched the disk to type "TEST" (instead of DATADISK).
        There were still problems transferring to the disk, which Sam fixed in the voms-mapfile.

        Tim then setup a new test queue, and HammerCloud jobs started today. (Elena suggested to contact atlas-adc-expert@cern.ch if HC doesn't run.) Jobs fail: they need to be configured to upload the output through the correct gateway. Sam will give details on JIRA.

    • 10:40 10:50
      News round-table 10m

      Alessandra: NETR [nothing else to report; some comments noted above.]
      Dan: Last WN moved SL6->C7. Waiting for Dell storage, hope for delivery in February.
      Elena: NETR
      Emanuele: NTR
      Matt: One Lancaster server rebuilding 3 disks, but seems OK. Purchasing gpnode.
      Patrick: NTR
      Sam: NTR
      Stewart: LocalGroupDisk is filling up. Identifying people across UK who have left.
      Tim: Switched RAL to use Rucio copytool. All seems good. Data Carousel reprocessing started on Tuesday without RAL, which had a Castor intervention scheduled for Wednesday. That's done, so can start today.
      Vip: NTR

    • 10:50 11:00
      AOB 10m