ATLAS UK Cloud Support

Name: ATLAS UK Cloud Support
Start: 2020-01-30T10:00:00+00:00
End: 2020-01-30T11:00:00+00:00
Location: Vidyo

Thursday 30 Jan 2020, 10:00 → 11:00 Europe/London

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), Stewart Martin-Haugh (Science and Technology Facilities Council STFC (GB))

Hide

● Outstanding tickets

GGUS #145057 Gareth said this was very likely load related as disk servers are very busy. This can happen if frequently accessed files are located on the same server. Oxford sees a similar problem at the moment. We discussed options: reduce the number of jobs, and encourage/force analysis users not to user directio. Should be much better for Glasgow once Ceph disk moved into production.

GGUS #144953 and #144884: two RAL tickets for the same problem. Seems to have been fixed with an aCT update at CERN.

GGUS #144759: On hold, but Gareth will update ticket.

CPU

CERN Ceph issue stopped monitoring and jobs last Thursday morning.

● CentOS7 - Sussex

Local software may be confusing the pilot, which would otherwise pick software (eg. gfal) from CVMFS.
ATLAS only needs CVMFS and user namespaces enabled (to use Singularity from CVMFS), or CVMFS+Singularity.
Would be useful to have this better documented. And also maybe a talk in the ATLAS S&C Week.

● Storageless sites

Still have 8TB still left at Sheffield. Elena asked on JIRA for this to be removed, as need to remove the disk.

● Glasgow Ceph storage

The disk servers for the full Ceph pool have arrived.
Transfers to the test DataDisk are working fine. Now trying to get access from jobs. Read works, but write doesn't yet.

● AOB

Alessandra: Manchester going to unify batch system (Grand Unified queue) with two CEs will for the same queue.
Dan: QMUL has already switched to a GU queue.
Elena: NETR (nothing else to report)
Glasgow: NETR from Emanuele, Gareth, and Sam.
Peter: getting new NIC 20->80G.
Stewart: provided site availability script: https://github.com/StewMH/QuarterlyReport/blob/master/report.py
Tim: Data Carousel reprocessing campaign had a rocky start, with the CERN FTS going too slowly (now fixed). This affected most sites, but RAL was OK and has staged most of the first tranche, data18. This is now being processed at UK sites.
Had problems with jobs specifying 1GB RAM and then being killed by RAL LRMS when they reached 3GB.
Vip: NETR

There are minutes attached to this event. Show them.

- 10:00 → 10:10
  
  Outstanding tickets 10m
  
  Minutes
  
  Open ATLAS UK GGUS tickets
  
  UK Cloud jobs over last week
  
  GGUS #145057 Gareth said this was very likely load related as disk servers are very busy. This can happen if frequently accessed files are located on the same server. Oxford sees a similar problem at the moment. We discussed options: reduce the number of jobs, and encourage/force analysis users not to user directio. Should be much better for Glasgow once Ceph disk moved into production.
  
  GGUS #144953 and #144884: two RAL tickets for the same problem. Seems to have been fixed with an aCT update at CERN.
  
  GGUS #144759: On hold, but Gareth will update ticket.
  
  CPU
  
  CERN Ceph issue stopped monitoring and jobs last Thursday morning.
- 10:10 → 10:20
  
  Other new issues 10m
- 10:20 → 10:40
  Ongoing issues 20m
  - CentOS7 - Sussex 5m
    
    Minutes
    
    Centos 7 deployment Twiki
    
    Local software may be confusing the pilot, which would otherwise pick software (eg. gfal) from CVMFS.
    ATLAS only needs CVMFS and user namespaces enabled (to use Singularity from CVMFS), or CVMFS+Singularity.
    Would be useful to have this better documented. And also maybe a talk in the ATLAS S&C Week.
  - Storageless sites 5m
    
    Minutes
    
    ADCINFR-147: Sheffield
    
    Still have 8TB still left at Sheffield. Elena asked on JIRA for this to be removed, as need to remove the disk.
  - Glasgow Ceph storage 5m
    
    Minutes
    
    ADCINFR-152: Glasgow Ceph
    
    The disk servers for the full Ceph pool have arrived.
    Transfers to the test DataDisk are working fine. Now trying to get access from jobs. Read works, but write doesn't yet.
- 10:40 → 10:50
  
  News round-table 10m
- 10:50 → 11:00
  
  AOB 10m
  
  Minutes
  
  Alessandra: Manchester going to unify batch system (Grand Unified queue) with two CEs will for the same queue.
  Dan: QMUL has already switched to a GU queue.
  Elena: NETR (nothing else to report)
  Glasgow: NETR from Emanuele, Gareth, and Sam.
  Peter: getting new NIC 20->80G.
  Stewart: provided site availability script: https://github.com/StewMH/QuarterlyReport/blob/master/report.py
  Tim: Data Carousel reprocessing campaign had a rocky start, with the CERN FTS going too slowly (now fixed). This affected most sites, but RAL was OK and has staged most of the first tranche, data18. This is now being processed at UK sites.
  Had problems with jobs specifying 1GB RAM and then being killed by RAL LRMS when they reached 3GB.
  Vip: NETR

Choose timezone

ATLAS UK Cloud Support

Vidyo

● Outstanding tickets

● CentOS7 - Sussex

● Storageless sites

● Glasgow Ceph storage

● AOB

Share this page

Direct link

Social networks

Calendaring