ATLAS UK Cloud Support

Name: ATLAS UK Cloud Support
Start: 2020-07-02T10:00:00+01:00
End: 2020-07-02T11:00:00+01:00
Location: Vidyo

Thursday 2 Jul 2020, 10:00 → 11:00 Europe/London

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Hide

Outstanding tickets

147698 UKI-SCOTGRID-DURHAM less urgent assigned 2020-07-01 15:32:00 UKI-SCOTGRID-DURHAM squid down
- Assigned; VM / to reboot
146771 UKI-SCOTGRID-ECDF less urgent reopened 2020-07-01 22:18:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”
- reopened; hoped that update to centos7 would have resolved most issues
146651 RAL-LCG2 urgent in progress 2020-05-27 10:43:00 singularity and user NS setup at RAL
- Timescale and planning underway with Grid service
146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-06-24 16:18:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
- On Hold
145688 UKI-NORTHGRID-MAN-HEP less urgent in progress 2020-06-30 06:45:00 Very old version of squids at UKI-NORTHGRID-MAN-HEP
- Almost complete; test squid online; try new version for few days on one production squid. Then rollout.
145510 RAL-LCG2 urgent in progress 2020-06-29 07:33:00 RAL-LCG2: timeouts on stage-in/outs
- Pilot update seems to have improved situation; However was a spike in timeout activity. Will try to close
144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-06-09 07:59:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
- on Hold; access may become increasingly restricted
142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
- on Hold

CPU

RAL
- Problem in aCT appear. Fixed by 2200, but taking time to reclaim the lost slots
Northgrid
- Durham; Aircon on but below full efficiency; may take time to get jobs back in the queue from the backlog of other jobs.
London
- QMUL: To investigate memory issues from jobs
SouthGrid
Scotgrid

Other new issues

Ongoing issues

CentOS7 - Sussex
- On Hold
Grand Unified queues
- On Hold

News round-table

Vip
- Downtine arc; next Wednesday 1 day
Dan
- JW: To look at the memory failures QMUL
- https://bigpanda.cern.ch/wns/UKI-LT2-QMUL/?hours=12
Matt
(via email)
- Our new CE is coming slowly; but managing with current version. Details for our new ARC CE as soon as I get it accepting jobs.
Peter
- new CE in progress
Alessandra
- NTR
Sam
- Can run current number of jobs on CEPH ok;
- To online all CPU, would want to add more redirectors
Gareth
- CEPH site; still in mentions test in some ATLAS pages
  - AF to ask what situation with CEPH in ATLAS and how to properly put into production
  - SAM to update the JIRA.
- DPM can drain once ATLAS is happy with CEPH.
- Increased restriction to server room with recirculated air - under discussion
- http://adc-ddm-mon.cern.ch/ddmusr01/plots/plots.php?endpoint=UKI-SCOTGRID-GLASGOW-CEPH_DATADISK
- https://fts3-pilot.cern.ch:8449/fts3/ftsmon/#/?vo=&source_se=gsiftp:%2F%2Fcephc04.gla.scotgrid.ac.uk&dest_se=&time_window=1
- http://adc-ddm-mon.cern.ch/ddmusr01/plots/plots.php?endpoint=UKI-SCOTGRID-GLASGOW_DATADISK
JW
- TPC; xrootd back for RAL-LCG2 and RAL-CEPH; working on http with minor sucesses

There are minutes attached to this event. Show them.

- 10:00 → 10:20
  Status 20m
  - Outstanding tickets 10m
    
    Open ATLAS UK GGUS tickets
    
    147698 UKI-SCOTGRID-DURHAM less urgent assigned 2020-07-01 15:32:00 UKI-SCOTGRID-DURHAM squid down
    
    Assigned; VM / to reboot
    
    146771 UKI-SCOTGRID-ECDF less urgent reopened 2020-07-01 22:18:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”
    
    reopened; hoped that update to centos7 would have resolved most issues
    
    146651 RAL-LCG2 urgent in progress 2020-05-27 10:43:00 singularity and user NS setup at RAL
    
    Timescale and planning underway with Grid service
    
    146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-06-24 16:18:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
    
    On Hold
    
    145688 UKI-NORTHGRID-MAN-HEP less urgent in progress 2020-06-30 06:45:00 Very old version of squids at UKI-NORTHGRID-MAN-HEP
    
    Almost complete; test squid online; try new version for few days on one production squid. Then rollout.
    
    145510 RAL-LCG2 urgent in progress 2020-06-29 07:33:00 RAL-LCG2: timeouts on stage-in/outs
    
    Pilot update seems to have improved situation; However was a spike in timeout activity. Will try to close
    
    144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-06-09 07:59:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
    
    on Hold; access may become increasingly restricted
    
    142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    
    on Hold
  - CPU 5m
    
    UK Cloud jobs over last week
    
    RAL
    
    Problem in aCT appear. Fixed by 2200, but taking time to reclaim the lost slots
    
    Northgrid
    
    Durham; Aircon on but below full efficiency; may take time to get jobs back in the queue from the backlog of other jobs.
    
    London
    
    QMUL: To investigate memory issues from jobs
    
    SouthGrid
    
    Scotgrid
  - Other new issues 5m
- 10:20 → 10:40
  Ongoing issues 20m
  CentOS7 - Sussex
  
  On Hold
  
  Grand Unified queues
  
  On Hold
  - CentOS7 - Sussex 5m
    
    Centos 7 deployment Twiki
  - Grand Unified queues 5m
    
    ADCDPA-235
    
    Migration plans
- 10:40 → 10:50
  News round-table 10m
  Vip
  
  Downtine arc; next Wednesday 1 day
  
  Dan
  
  JW: To look at the memory failures QMUL
  
  https://bigpanda.cern.ch/wns/UKI-LT2-QMUL/?hours=12
  
  Matt
  (via email)
  
  Our new CE is coming slowly; but managing with current version. Details for our new ARC CE as soon as I get it accepting jobs.
  
  Peter
  
  new CE in progress
  
  Alessandra
  
  NTR
  
  Sam
  
  Can run current number of jobs on CEPH ok;
  
  To online all CPU, would want to add more redirectors
  
  Gareth
  
  CEPH site; still in mentions test in some ATLAS pages
  
  AF to ask what situation with CEPH in ATLAS and how to properly put into production
  
  SAM to update the JIRA.
  
  DPM can drain once ATLAS is happy with CEPH.
  
  Increased restriction to server room with recirculated air - under discussion
  
  http://adc-ddm-mon.cern.ch/ddmusr01/plots/plots.php?endpoint=UKI-SCOTGRID-GLASGOW-CEPH_DATADISK
  
  https://fts3-pilot.cern.ch:8449/fts3/ftsmon/#/?vo=&source_se=gsiftp:%2F%2Fcephc04.gla.scotgrid.ac.uk&dest_se=&time_window=1
  
  http://adc-ddm-mon.cern.ch/ddmusr01/plots/plots.php?endpoint=UKI-SCOTGRID-GLASGOW_DATADISK
  
  JW
  
  TPC; xrootd back for RAL-LCG2 and RAL-CEPH; working on http with minor sucesses
- 10:50 → 11:00
  
  AOB 10m

Choose timezone