ATLAS UK Cloud Support

Name: ATLAS UK Cloud Support
Start: 2020-01-23T10:00:00+00:00
End: 2020-01-23T11:00:00+00:00
Location: Vidyo

Thursday 23 Jan 2020, 10:00 → 11:00 Europe/London

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), Stewart Martin-Haugh (Science and Technology Facilities Council STFC (GB))

Hide

● Outstanding tickets

CPU

CERN Monit not working since 6am, so everything is/looks stopped. There is a problem with CERN Ceph storage.

Tickets

GGUS #144884 Seems to be some user analysis jobs that used too much RAM. We should try to get a better error message than "Job submission to LRMS failed".

GGUS #144759 There may be a misconfiguration of the Glasgow squids. Gareth will investigate when he has a chance (new hardware arriving)

GGUS #144688 Gareth said this is a common issue where a burst of transfers to same disk causes transfer errors like this. Once it cools down things seem OK again. Can close this ticket, but it could reoccur.
[later] Had requested (ADCINFR-162) to reduce old DPM storage to allow decommissioning old servers, but CRC shifter said (on GGUS) they still needed the space. Gareth: uncomfortable keeping it this like. Disks are old and may fail sometime.

● CentOS7 - Sussex

Patrick/Dan: everything looks good at Sussex, but can't tell with Monit down.
Alessandra [later]: Sussex is running payloads. Failed because they can't access QMUL storage. QMUL is Storm site, and Storm mover uses POSIX access, which isn't supported yet. Alessandra will discuss with DDM.

● Storageless sites

Elena still has 8.2 TB left on Sheffield disks. Elena will push for last bit to be removed. Will start by posting on JIRA.
For access to RAL disk, Elena has finished switching to use rucio copytool.

● Glasgow Ceph storage

Sam: Current setup not final production configuration. Will need more servers. Plan to switch to production cluster once new servers arrive from Dell, probably available mid-February. Dan also noted issues with delivery from Dell. This will mean current disk will probably lose its data. It is a little concerning that the data now on the disk is marked "primary".

Tim: Added Ceph DataDisk in AGIS last Thursday. This apparently is not the correct procedure: DDM need to do some magic *before* the disk is enabled. Dimitrios fixed this on Monday and switched the disk to type "TEST" (instead of DATADISK).
There were still problems transferring to the disk, which Sam fixed in the voms-mapfile.

Tim then setup a new test queue, and HammerCloud jobs started today. (Elena suggested to contact atlas-adc-expert@cern.ch if HC doesn't run.) Jobs fail: they need to be configured to upload the output through the correct gateway. Sam will give details on JIRA.

● News round-table

Alessandra: NETR [nothing else to report; some comments noted above.]
Dan: Last WN moved SL6->C7. Waiting for Dell storage, hope for delivery in February.
Elena: NETR
Emanuele: NTR
Matt: One Lancaster server rebuilding 3 disks, but seems OK. Purchasing gpnode.
Patrick: NTR
Sam: NTR
Stewart: LocalGroupDisk is filling up. Identifying people across UK who have left.
Tim: Switched RAL to use Rucio copytool. All seems good. Data Carousel reprocessing started on Tuesday without RAL, which had a Castor intervention scheduled for Wednesday. That's done, so can start today.
Vip: NTR

There are minutes attached to this event. Show them.

- 10:00 → 10:10
  
  Outstanding tickets 10m
  
  Minutes
  
  Open ATLAS UK GGUS tickets
  
  UK Cloud jobs over last week
  
  CPU
  
  CERN Monit not working since 6am, so everything is/looks stopped. There is a problem with CERN Ceph storage.
  
  Tickets
  
  GGUS #144884 Seems to be some user analysis jobs that used too much RAM. We should try to get a better error message than "Job submission to LRMS failed".
  
  GGUS #144759 There may be a misconfiguration of the Glasgow squids. Gareth will investigate when he has a chance (new hardware arriving)
  
  GGUS #144688 Gareth said this is a common issue where a burst of transfers to same disk causes transfer errors like this. Once it cools down things seem OK again. Can close this ticket, but it could reoccur.
  [later] Had requested (ADCINFR-162) to reduce old DPM storage to allow decommissioning old servers, but CRC shifter said (on GGUS) they still needed the space. Gareth: uncomfortable keeping it this like. Disks are old and may fail sometime.
- 10:10 → 10:20
  
  Other new issues 10m
- 10:20 → 10:40
  Ongoing issues 20m
  - CentOS7 - Sussex 5m
    
    Minutes
    
    Centos 7 deployment Twiki
    
    Patrick/Dan: everything looks good at Sussex, but can't tell with Monit down.
    Alessandra [later]: Sussex is running payloads. Failed because they can't access QMUL storage. QMUL is Storm site, and Storm mover uses POSIX access, which isn't supported yet. Alessandra will discuss with DDM.
  - Storageless sites 5m
    
    Minutes
    
    ADCINFR-147: Sheffield
    
    Elena still has 8.2 TB left on Sheffield disks. Elena will push for last bit to be removed. Will start by posting on JIRA.
    For access to RAL disk, Elena has finished switching to use rucio copytool.
  - Glasgow Ceph storage 5m
    
    Minutes
    
    ADCINFR-152: Glasgow Ceph
    
    Sam: Current setup not final production configuration. Will need more servers. Plan to switch to production cluster once new servers arrive from Dell, probably available mid-February. Dan also noted issues with delivery from Dell. This will mean current disk will probably lose its data. It is a little concerning that the data now on the disk is marked "primary".
    
    Tim: Added Ceph DataDisk in AGIS last Thursday. This apparently is not the correct procedure: DDM need to do some magic *before* the disk is enabled. Dimitrios fixed this on Monday and switched the disk to type "TEST" (instead of DATADISK).
    There were still problems transferring to the disk, which Sam fixed in the voms-mapfile.
    
    Tim then setup a new test queue, and HammerCloud jobs started today. (Elena suggested to contact atlas-adc-expert@cern.ch if HC doesn't run.) Jobs fail: they need to be configured to upload the output through the correct gateway. Sam will give details on JIRA.
- 10:40 → 10:50
  
  News round-table 10m
  
  Minutes
  
  Alessandra: NETR [nothing else to report; some comments noted above.]
  Dan: Last WN moved SL6->C7. Waiting for Dell storage, hope for delivery in February.
  Elena: NETR
  Emanuele: NTR
  Matt: One Lancaster server rebuilding 3 disks, but seems OK. Purchasing gpnode.
  Patrick: NTR
  Sam: NTR
  Stewart: LocalGroupDisk is filling up. Identifying people across UK who have left.
  Tim: Switched RAL to use Rucio copytool. All seems good. Data Carousel reprocessing started on Tuesday without RAL, which had a Castor intervention scheduled for Wednesday. That's done, so can start today.
  Vip: NTR
- 10:50 → 11:00
  
  AOB 10m

Choose timezone

ATLAS UK Cloud Support

Vidyo

● Outstanding tickets

CPU

Tickets

● CentOS7 - Sussex

● Storageless sites

● Glasgow Ceph storage

● News round-table

CPU

Tickets

Share this page

Direct link

Social networks

Calendaring