ATLAS UK Cloud Support

Name: ATLAS UK Cloud Support
Start: 2020-09-24T10:00:00+01:00
End: 2020-09-24T11:00:00+01:00
Location: Vidyo

Thursday 24 Sept 2020, 10:00 → 11:00 Europe/London

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Hide

Outstanding tickets

148719 UKI-LT2-IC-HEP less urgent in progress 2020-09-22 19:43:00 Failovers from UKI-LT2-IC-HEP to CERN CVMFS backup proxy
- Active discussion on ticket
148401 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-09-22 12:47:00 UKI-NORTHGRID-LANCS-HEP: globus_ftp_client failures
- Files declared lost (again, with typo fixed); few residual files to be investigated once Matt is back.
148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-09-15 12:59:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
- Internal deltions complete; Sam to update ticket
146651 RAL-LCG2 urgent on hold 2020-08-10 10:59:00 singularity and user NS setup at RAL
- On hold
146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-09-11 13:35:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
- On hold
144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
- On hold
142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
- No update

CPU

Number of ATLAS / CERN issues affecting sites.
- New Pilot version misreporting corecount:
  - Affected scaling values used for acounting (e.g. wallclock and slots used)
  - Killed jobs (incorrectly) with incorrectly calculated men-limit values.
  - fix deployed yesterday, and rolling out
- Update to VOMS server yesterday introduced issues:
  - upgraded VOMS server issued a VOMS extension that could not be validated by existing (and supported) VOMS C/C++ libraries.
  - The problem was observed on XRootD since XRootD links against VOMS libraries, but any C/C++ software linking against the VOMS library would be affected (e.g., the StoRM frontend server).
  - change has been rolled back; but ATLAS may still have some lingering effects?
- Harvester_Central_B stopped submitting jobs this morning - under investigation
- All storm sites blacklisted since VOMS incident + pilot update (may be related to the VOMS issue?):
  - “pilot, 1324: Service not available at the moment”

RAL
- Small drop in jobs due to pilot problems; now slowly claiming back jobs from other VOs
- Not seemingly affected by other issues.
Northgrid
- All jobs dropped off.
London
- All jobs dropped off.
- QMUL breifly back up to 20kHS06 before new issues arose
SouthGrid
- Most sites gone; BHAM not affected
Scotgrid
- Most sites gone; ECDF not affected

Other new issues

GLASGOW:
- CEPH_DATADISK no longer in TEST (set to DATADISK in AIGS)
- DPM DATADISK now set as test
- PQ set offline for DPM queues
QMUL:
- Space reporting now ok
- Additional space for ATLAS (with some further space coming)

Ongoing issues

CentOS7 - Sussex
- No update
TPC with http
- No update

News round-table

Dan
- 1/2 PB further to add for ATLAS
  - ATLAS to propose spacetoken split
Peter
- Learning arc-ce
Sam
- Reported on discussion in Storage mtg. on future planning,
  - e.g. moving to Storageless sites (even if storage not initially decommissioned):
- To hold of final commissioning, until voms / related issues are resolved.
Gareth
- Noted general problems due to the VOMS issues
JW
- NTR

AOB

Move to Zoom?
- No strong preference in either direction;
  - Noted that additional (organsiation) overhead on Host may be the deciding factor.

There are minutes attached to this event. Show them.

- 10:00 → 10:20
  Status 20m
  - Outstanding tickets 10m
    
    Open ATLAS UK GGUS tickets
    
    148719 UKI-LT2-IC-HEP less urgent in progress 2020-09-22 19:43:00 Failovers from UKI-LT2-IC-HEP to CERN CVMFS backup proxy
    
    Active discussion on ticket
    
    148401 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-09-22 12:47:00 UKI-NORTHGRID-LANCS-HEP: globus_ftp_client failures
    
    Files declared lost (again, with typo fixed); few residual files to be investigated once Matt is back.
    
    148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-09-15 12:59:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
    
    Internal deltions complete; Sam to update ticket
    
    146651 RAL-LCG2 urgent on hold 2020-08-10 10:59:00 singularity and user NS setup at RAL
    
    On hold
    
    146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-09-11 13:35:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
    
    On hold
    
    144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
    
    On hold
    
    142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    
    No update
  - CPU 5m
    
    Monit: Site-oriented dashboard
    
    UK Cloud jobs over last week
    
    Number of ATLAS / CERN issues affecting sites.
    
    New Pilot version misreporting corecount:
    
    Affected scaling values used for acounting (e.g. wallclock and slots used)
    
    Killed jobs (incorrectly) with incorrectly calculated men-limit values.
    
    fix deployed yesterday, and rolling out
    
    Update to VOMS server yesterday introduced issues:
    
    upgraded VOMS server issued a VOMS extension that could not be validated by existing (and supported) VOMS C/C++ libraries.
    
    The problem was observed on XRootD since XRootD links against VOMS libraries, but any C/C++ software linking against the VOMS library would be affected (e.g., the StoRM frontend server).
    
    change has been rolled back; but ATLAS may still have some lingering effects?
    
    Harvester_Central_B stopped submitting jobs this morning - under investigation
    
    All storm sites blacklisted since VOMS incident + pilot update (may be related to the VOMS issue?):
    
    “pilot, 1324: Service not available at the moment”
    
    RAL
    
    Small drop in jobs due to pilot problems; now slowly claiming back jobs from other VOs
    
    Not seemingly affected by other issues.
    
    Northgrid
    
    All jobs dropped off.
    
    London
    
    All jobs dropped off.
    
    QMUL breifly back up to 20kHS06 before new issues arose
    
    SouthGrid
    
    Most sites gone; BHAM not affected
    
    Scotgrid
    
    Most sites gone; ECDF not affected
  - Other new issues 5m
    
    New Pilot version misreporting corecount:
    - Affected scaling values used for acctouning (e.g. wallclock and slots used)
    - Killed jobs (incorrectly) with incorrectly calculated men-limit values.
    - fix deployed yesterday
    
    Update to VOMS server yesterday introduced issues(UK significantly affected). (Issue with v2).
    - ATLAS uses both v2 and v3 in various places.
    - https://cern.service-now.com/service-portal?id=outage&n=OTG0059138
    
    Harvester_Central_B stopped submitting jobs this morning - under investigation
    
    All storm sites blacklisted since VOMS incident + pilot update:
    - "pilot, 1324: Service not available at the moment"
    
    GLASGOW:
    - CEPH_DATADISK no longer in TEST
    - PQ set offline for DPM queues
    
    QMUL:
    Lustre migration:
    Space reportting now ok?
    https://monit-grafana.cern.ch/d/mHqFLAbik/wlcg-storage-space-accounting?from=now-7d&orgId=20&to=now&var-area=ATLASDATADISK&var-binning=1h&var-country=All&var-federation=All&var-groupby=vo&var-medium=Disk&var-service=All&var-site=UKI-LT2-QMUL&var-tier=All&var-vo=ALICE&var-vo=ATLAS&var-vo=LHCb
    
    GLASGOW:
    
    CEPH_DATADISK no longer in TEST (set to DATADISK in AIGS)
    
    DPM DATADISK now set as test
    
    PQ set offline for DPM queues
    
    QMUL:
    
    Space reporting now ok
    
    Additional space for ATLAS (with some further space coming)
- 10:20 → 10:40
  Ongoing issues 20m
  CentOS7 - Sussex
  
  No update
  
  TPC with http
  
  No update
  - CentOS7 - Sussex 5m
    
    Centos 7 deployment Twiki
  - Grand Unified queues 5m
    
    ADCDPA-235
    
    Migration plans
  - TPC with http 20m
- 10:40 → 10:50
  News round-table 10m
  Dan
  
  1/2 PB further to add for ATLAS
  
  ATLAS to propose spacetoken split
  
  Peter
  
  Learning arc-ce
  
  Sam
  
  Reported on discussion in Storage mtg. on future planning,
  
  e.g. moving to Storageless sites (even if storage not initially decommissioned):
  
  To hold of final commissioning of new cephcXX, until voms / related issues are resolved.
  
  Gareth
  
  Noted general problems due to the VOMS issues
  
  JW
  
  NTR
- 10:50 → 11:00
  AOB 10m
  
  Zoom ?
  Move to Zoom?
  
  No strong preference in either direction;
  
  Noted that additional (organsiation) overhead on Host may be the deciding factor.

Choose timezone