ATLAS UK Cloud Support

Name: ATLAS UK Cloud Support
Start: 2020-08-13T10:00:00+01:00
End: 2020-08-13T11:00:00+01:00
Location: Vidyo

Thursday 13 Aug 2020, 10:00 → 11:00 Europe/London

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Hide

Outstanding tickets

148234 RAL-LCG2 less urgent in progress 2020-08-12 10:38:00 RAL-LCG2 deletion errors
- Deletion into echo failure rate 10%, just a load issue? Failed deletions do complete
148228 UKI-SOUTHGRID-OX-HEP less urgent waiting for reply 2020-08-12 10:17:00 UKI-SOUTHGRID-OX-HEP transfer failures as destination
- To Close
148169 UKI-SCOTGRID-ECDF less urgent in progress 2020-08-05 10:25:00 Failovers from jobs running at UKI-SCOTGRID-ECDF_CLOUD to CERN backup proxy
- Follow-up
147979 UKI-NORTHGRID-MAN-HEP less urgent in progress 2020-08-04 09:28:00 UKI-NORTHGRID-MAN-HEP timeout transfer errros and also deletion errors
- Follow-up
146771 UKI-SCOTGRID-ECDF less urgent in progress 2020-08-10 10:23:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”
- Mitigation still working; still exploring the main solution
146651 RAL-LCG2 urgent on hold 2020-08-10 10:59:00 singularity and user NS setup at RAL
- RAL last big site to provide this; impacting on containerised workflow jobs
146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-07-22 14:53:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
- Some test jobs through, but still issues
144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
- Update to ticket; Restrictions on access; dealing with admin to get relevant systems into place
142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
- Access to to data centre now feasible. Need to consolidate pieces of kit deliveried to various places and start preparing new node.

CPU

Bad version rucio 1.23.0 rolled out Tuesday
- Affected Tokyo and UK, likely due to mix protocol (read/write) and write/stat race conditions?
- Wed’s pm; Misconfiguation in agis for site RSE caused mass HC blacklisting; quickly resolved.
RAL
- Small dip due to rucio, but slow recovery from atlas issues
Northgrid
- Man. recovering slowly?
London
- QMUL low jobs; AC issues
SouthGrid
Scotgrid
- Durham low jobs
- ECDF - (CLOUD in test)
  - Believed CLOUD scheduler and openstack interference
- DPM; Panda stopped sending jobs to Kelvin for short time; infrequent but previously seen issue

Other new issues

Ongoing issues

CentOS7 - Sussex
- as discussed above
Grand Unified queues
- Awaiting Shefield

News round-table

Vip
- 896 threads added to the pool
- Noted lower efficiency; GR pointed out may just be from increase of reco jobs
Dan
- AC issues, but more nodes should now be available
Peter
- NTR
Sam
- Xrootd; is ATLAS seeing similar issues as LHCb with streaming
  - JW do see some error rate in user jobs (using direct-IO)
  - recent case of production job now running in direct-IO; with similar issue
Gareth
- Noted wrt to job efficiency:
  - special evgen ? some jobs may try to take two threads;
  - Reco jobs can hit efficiency (JW: increased running due to reprocessing camapaigns)
- Performance improvements planed for CEPH / infrastructure / bonding networking; ‘timescale’
- 1400 cores; starting to hit the gridFTP limits;
JW
- NTR
Patrick
- NTR

There are minutes attached to this event. Show them.

- 10:00 → 10:20
  Status 20m
  - Outstanding tickets 10m
    
    Minutes
    
    Open ATLAS UK GGUS tickets
    
    148234 RAL-LCG2 less urgent in progress 2020-08-12 10:38:00 RAL-LCG2 deletion errors
    
    Deletion into echo failure rate 10%, just a load issue? Failed deletions do complete
    
    148228 UKI-SOUTHGRID-OX-HEP less urgent waiting for reply 2020-08-12 10:17:00 UKI-SOUTHGRID-OX-HEP transfer failures as destination
    
    To Close
    
    148169 UKI-SCOTGRID-ECDF less urgent in progress 2020-08-05 10:25:00 Failovers from jobs running at UKI-SCOTGRID-ECDF_CLOUD to CERN backup proxy
    
    Follow-up
    
    147979 UKI-NORTHGRID-MAN-HEP less urgent in progress 2020-08-04 09:28:00 UKI-NORTHGRID-MAN-HEP timeout transfer errros and also deletion errors
    
    Follow-up
    
    146771 UKI-SCOTGRID-ECDF less urgent in progress 2020-08-10 10:23:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”
    
    Mitigation still working; still exploring the main solution
    
    146651 RAL-LCG2 urgent on hold 2020-08-10 10:59:00 singularity and user NS setup at RAL
    
    RAL last big site to provide this; impacting on containerised workflow jobs
    
    146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-07-22 14:53:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
    
    Some test jobs through, but still issues
    
    144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
    
    Update to ticket; Restrictions on access; dealing with admin to get relevant systems into place
    
    142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    
    Access to to data centre now feasible. Need to consolidate pieces of kit deliveried to various places and start preparing new node.
  - CPU 5m
    
    Minutes
    
    Monit: Site-oriented dashboard
    
    UK Cloud jobs over last week
    
    Bad version rucio 1.23.0 rolled out Tuesday
    
    Affected Tokyo and UK, likely due to mixed protocol (read/write) and write/stat race conditions?
    
    Wed’s pm; Misconfiguation in agis for site RSE caused mass HC blacklisting; quickly resolved.
    
    RAL
    
    Small dip due to rucio, but slow recovery from atlas issues.
    
    Northgrid
    
    Man. recovering slowly?
    
    London
    
    QMUL low jobs still, Largely AC related
    
    SouthGrid
    
    Scotgrid
    
    Durham low jobs
    
    ECDF - (CLOUD in test)
    
    Believed CLOUD scheduler and openstack interference
    
    DPM; Panda stopped sending jobs to Kelvin for short time; infrequent but previously seen issue
  - Other new issues 5m
- 10:20 → 10:40
  Ongoing issues 20m
  
  Minutes
  CentOS7 - Sussex
  
  as discussed above
  
  Grand Unified queues
  
  Awaiting Shefield
  - CentOS7 - Sussex 5m
    
    Centos 7 deployment Twiki
  - Grand Unified queues 5m
    
    ADCDPA-235
    
    Migration plans
- 10:40 → 10:50
  News round-table 10m
  
  Minutes
  Vip
  
  896 threads added to the pool
  
  Noted lower efficiency; GR pointed out may just be from increase of reco jobs
  
  Dan
  
  AC issues, but more nodes should now be available
  
  Peter
  
  NTR
  
  Sam
  
  Xrootd; is ATLAS seeing similar issues as LHCb with streaming
  
  JW do see some error rate in user jobs (using direct-IO)
  
  recent case of production job now running in direct-IO; with similar issue
  
  Gareth
  
  Noted wrt to job efficiency:
  
  special evgen ? some jobs may try to take two threads;
  
  Reco jobs can hit efficiency (JW: increased running due to reprocessing camapaigns)
  
  Performance improvements planed for CEPH / infrastructure / bonding networking; ‘timescale’
  
  1400 cores; starting to hit the gridFTP limits;
  
  JW
  
  NTR
  
  Patrick
  
  NTR
- 10:50 → 11:00
  
  AOB 10m

Choose timezone

ATLAS UK Cloud Support

Vidyo

Outstanding tickets

CPU

Other new issues

Ongoing issues

News round-table

Share this page

Direct link

Social networks

Calendaring