ATLAS UK Cloud Support

Name: ATLAS UK Cloud Support
Start: 2020-12-03T10:00:00+00:00
End: 2020-12-03T11:00:00+00:00
Location: Zoom

Thursday 3 Dec 2020, 10:00 → 11:00 Europe/London

Zoom

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Description

Meeting to be held via Zoom (https://ukri.zoom.us/j/97404730356)
Password protected (same as OPs Mtg)

Hide

Outstanding tickets

149752 UKI-NORTHGRID-LANCS-HEP less urgent assigned 2020-12-02 16:07:00 Failovers from University of Lancaster to CERN backup proxy
- Number of stale cvmfs observed (also at Glasgow)
- geoip issues; might be related to Stratum 1 updates?
- refresh cache may be best option
149750 UKI-SOUTHGRID-RALPP less urgent in progress 2020-12-02 10:16:00 UKI-SOUTHGRID-RALPP: unable to connect to host
- Problems in FTS transfers for ATLAS (not other VOs). CLI TPC transfers appear ok.
149738 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-12-02 15:55:00 UKI-NORTHGRID-LANCS-HEP: deletion errors
- Poor raid card showing issues with many simultaneous interactions (deletions) causing crashing.
- Down to last 25% of data from draning of the seriver.
- Stop draining for today; but should expect some file losses.
149705 UKI-SCOTGRID-ECDF less urgent in progress 2020-11-30 11:52:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER [70] TRANSFER an end-of-file was reached …
- Load on headnode from httpd processes
  - From Matt; method to mitigate high mem usage at lancs for http implemented. Might be related issues.
149362 UKI-SOUTHGRID-RALPP urgent in progress 2020-11-19 10:11:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
- heplnx207 still in downtime (ended post-meeting)
148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-11-27 10:00:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
- Disk 40; being drained withing decom. Raid set says ok, FS not.
- AC / cooling issues in DPM server room
146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
- on hold, working on underlying issues
142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-11-05 10:52:00 CentOS7 migration UKI-SOUTHGRID-SUSX
- Arc-ce issues; not reporting back to the monitoring sites
  - e.g. http://apfmon.lancs.ac.uk/CERN_central_B:20455538.0
- Communication issue ? GridFTP looks to be working
- Can the BDII / LDAP be queried (from offsite?)
  - Status information usually through the BDII.
- To contact the arc-devs?
- To try an LDAP search against BDII
- Patrick to report back to TB support.

CPU

RAL
Northgrid
London
SouthGrid
Scotgrid
Downtime for DPM; Problems with Chillers and AC. Effectively shut down for the moment.
- Some replacements needed.
Prod is in DC; which is fine

Other new issues

Ongoing issues

CentOS7 - Sussex
TPC http
- RAL TPC-http FTS tests working by converting // to / in path.
Oxford Storageless tests
10GB link working
Arc config needed; Sam to send to Vip
ECDF unreliable storage
- Rob to update ticket
Glasgow LOCALGROUPDISK
- Sam to aim to create Ceph pool.

News round-table

Vip
- Production squid server failover yesterday;
- CPU efficiency looks a bit lower?
- prmon to be added: https://github.com/HSF/prmon in monitoring for storageless tests.
Dan
- Possible downtime 1wk on the 14th.
  - Storm moving ahead to centos7
- Next year disruption expected in DC, dates to be determined.
Matt
- NTR; prepare for lost files.
Peter
- Considering options for CRC shifter
  - Soliciting for CRC shifts.
Sam
NTR
Gareth
- Continue to work on cooling issues
JW
- NTR
Patrick
- NTR

AOB

There are minutes attached to this event. Show them.

- 10:00 → 10:20
  Status 20m
  - Outstanding tickets 10m
    
    Open ATLAS UK GGUS tickets
    
    149752 UKI-NORTHGRID-LANCS-HEP less urgent assigned 2020-12-02 16:07:00 Failovers from University of Lancaster to CERN backup proxy
    
    Number of stale cvmfs observed (also at Glasgow)
    
    geoip issues; might be related to Stratum 1 updates?
    
    refresh cache may be best option
    
    149750 UKI-SOUTHGRID-RALPP less urgent in progress 2020-12-02 10:16:00 UKI-SOUTHGRID-RALPP: unable to connect to host
    
    Problems in FTS transfers for ATLAS (not other VOs). CLI TPC transfers appear ok.
    
    149738 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-12-02 15:55:00 UKI-NORTHGRID-LANCS-HEP: deletion errors
    
    Poor raid card showing issues with many simultaneous interactions (deletions) causing crashing.
    
    Down to last 25% of data from draning of the seriver.
    
    Stop draining for today; but should expect some file losses.
    
    149705 UKI-SCOTGRID-ECDF less urgent in progress 2020-11-30 11:52:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER [70] TRANSFER an end-of-file was reached …
    
    Load on headnode from httpd processes
    
    From Matt; method to mitigate high mem usage at lancs for http implemented. Might be related issues.
    
    149362 UKI-SOUTHGRID-RALPP urgent in progress 2020-11-19 10:11:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
    
    heplnx207 still in downtime (ended post-meeting)
    
    148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-11-27 10:00:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
    
    Disk 40; being drained withing decom. Raid set says ok, FS not.
    
    AC / cooling issues in DPM server room
    
    146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
    
    on hold, working on underlying issues
    
    142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-11-05 10:52:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    
    Arc-ce issues; not reporting back to the monitoring sites
    
    e.g. http://apfmon.lancs.ac.uk/CERN_central_B:20455538.0
    
    Communication issue ? GridFTP looks to be working
    
    Can the BDII / LDAP be queried (from offsite?)
    
    Status information usually through the BDII.
    
    To contact the arc-devs?
    
    To try an LDAP search against BDII
    
    Patrick to report back to TB support.
  - CPU 5m
    
    Monit: Site-oriented dashboard
    
    UK Cloud jobs over last week
    
    RAL
    
    Northgrid
    
    London
    
    SouthGrid
    
    Scotgrid
    
    Downtime for DPM; Problems with Chillers and AC. Effectively shut down for the moment.
    
    Some replacements needed.
    
    Prod is in DC; which is fine
  - Other new issues / tasks 5m
- 10:20 → 10:40
  Ongoing Items 20m
  CentOS7 - Sussex
  
  TPC http
  
  RAL TPC-http FTS tests working by converting // to / in path.
  
  Oxford Storageless tests
  
  10GB link working
  
  Arc config needed; Sam to send to Vip
  
  ECDF unreliable storage
  
  Rob to update ticket
  
  Glasgow LOCALGROUPDISK
  
  Sam to aim to create Ceph pool.
  - CentOS7 - Sussex 5m
    
    Centos 7 deployment Twiki
  - TPC with http 5m
  - Storageless Site tests (Oxford) 5m
    
    ADCINFR-185
  - ECDF volatile storage 5m
    
    ADCINFR-184
  - Glasgow DPM Decommissioning 20m
    
    LOCALGROUPDISK and DATADISK decommissioning
    
    ADCINFR-152
- 10:40 → 10:50
  News round-table 10m
  Vip
  
  Production squid server failover yesterday;
  
  CPU efficiency looks a bit lower?
  
  prmon to be added: https://github.com/HSF/prmon in monitoring for storageless tests.
  
  Dan
  
  Possible downtime 1wk on the 14th.
  
  Storm moving ahead to centos7
  
  Next year disruption expected in DC, dates to be determined.
  
  Matt
  
  NTR; prepare for lost files.
  
  Peter
  
  Considering options for CRC shifter
  
  Soliciting for CRC shifts.
  
  Sam
  
  NTR
  
  Gareth
  
  Continue to work on cooling issues
  
  JW
  
  NTR
  
  Patrick
  
  NTR
- 10:50 → 11:00
  
  AOB 10m

Choose timezone