ATLAS UK Cloud Support

Europe/London
Zoom

Zoom

Tim Adye (Science and Technology Facilities Council STFC (GB)) , James William Walder (Science and Technology Facilities Council STFC (GB))
Description

https://cern.zoom.us/j/98434450232

Password protected (same as (new) OPs Mtg)

Videoconference Rooms
ATLAS UK Cloud Support
Zoom Meeting ID
98434450232
Host
James William Walder

● Outstanding tickets

  • 150308 UKI-SCOTGRID-DURHAM less urgent in progress 2021-01-22 22:21:00 Jobs at UKI-SCOTGRID-DURHAM_SL7_UCORE fail with “Server error: no such file or directory”
  • 150304 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2021-01-25 17:24:00 MWT2: low transfer efficiency as dest. from UKI-NORTHGRID-LANCS-HEP
    • Script finished; 2/3 files on server are corrupted …
    • Prepare a list of the 400k; aim to just declare as lost.
    • Can zfs announce bad files - e.g. scubbing; First pass is metadata consistency; second pass to do checksum-level checking
    • Simple script to probe each file might be best approach.
    • Recovery of data highly unlikely.
    • Matt to investigate other servers once this list is prepared.
  • 149842 UKI-SCOTGRID-ECDF very urgent waiting for reply 2021-01-26 14:29:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER ERROR: Copy failed with mode 3rd pull, wi…
    • Similar issue with lost files. Needs DDM support to declare files as lost; JW to follow up.
  • 149362 UKI-SOUTHGRID-RALPP urgent in progress 2021-01-28 09:45:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
    • Need to check the ansible scripts, to see if something is affecting them.
  • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2021-01-23 06:49:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
    • Largely a placeholder ticket to follow / track Decommissioning; Sam / JW to update the associated Jira to probe status.
  • 146651 RAL-LCG2 urgent on hold 2021-01-19 10:05:00 singularity and user NS setup at RAL
    • Increasing requests from other VO’s on status of upgrade.
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2021-01-20 20:29:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    • On hold pending changes to the immediate situations

● CPU

  • RAL
    • Recovery from CE problem at start of week. Some loss of jobs during the week - not well understood.
  • Northgrid
    • Lancs: File system related issues.
  • London
  • QMUL: Main college link down; on backup.
    • (Update: Set to offline on Friday until QMUL IT decide network is stable)
  • SouthGrid
    • RALPP back online;
    • BHAM issues (external reverse lookup on grid subnet switched off via IT)
    • CAM; also offline (but seems clear that unrelated to BHAM issue)
  • Scotgrid
    • Possible brokerage issues seen; to try and follow-up with idendifaction of how jobs are (not) assigned to Glasgow.
    • ECDF and Durham effects of file-related issues (followed up in GGUS).

● Ongoing Items

  • CentOS7 - Sussex

    • NTR
  • TPC with http

    • NTR
  • Storageless Site test / storage decomissioning (Oxford)

    • Gateway to point at: JW
    • JW: test gateway
  • ECDF volatile storage

    • JW -> prod DDM experts on how they wish to commission the RSE’s
  • Glasgow DPM Decommissioning

    • Sam to prod ticket.
  • ATLAS: Site Availability/Reliability reports: Glasgow

    • JW to try and move / access a timeline

● News round-table

  • Vip
    • NTR
  • Dan
    • Network problems anticipated.
  • Matt
    • NTR
  • Peter
    • NTR
  • Sam
    • NTR
  • Gareth
    • NTR
  • JW
    • NTR
  • Rob
    • NTR
There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m
        • 150308 UKI-SCOTGRID-DURHAM less urgent in progress 2021-01-22 22:21:00 Jobs at UKI-SCOTGRID-DURHAM_SL7_UCORE fail with “Server error: no such file or directory”
        • 150304 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2021-01-25 17:24:00 MWT2: low transfer efficiency as dest. from UKI-NORTHGRID-LANCS-HEP
          • Script finished; 2/3 files on server are corrupted …
          • Prepare a list of the 400k; aim to just declare as lost.
          • Can zfs announce bad files - e.g. scubbing; First pass is metadata consistency; second pass to do checksum-level checking
          • Simple script to probe each file might be best approach.
          • Recovery of data highly unlikely.
          • Matt to investigate other servers once this list is prepared.
        • 149842 UKI-SCOTGRID-ECDF very urgent waiting for reply 2021-01-26 14:29:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER ERROR: Copy failed with mode 3rd pull, wi…
          • Similar issue with lost files. Needs DDM support to declare files as lost; JW to follow up.
        • 149362 UKI-SOUTHGRID-RALPP urgent in progress 2021-01-28 09:45:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
          • Need to check the ansible scripts, to see if something is affecting them.
        • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2021-01-23 06:49:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
          • Largely a placeholder ticket to follow / track Decommissioning; Sam / JW to update the associated Jira to probe status.
        • 146651 RAL-LCG2 urgent on hold 2021-01-19 10:05:00 singularity and user NS setup at RAL
          • Increasing requests from other VO’s on status of upgrade.
        • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2021-01-20 20:29:00 CentOS7 migration UKI-SOUTHGRID-SUSX
          • On hold pending changes to the immediate situations
      • CPU 5m

        New link for the site-oriented dashboard

        • RAL
          • Recovery from CE problem at start of week. Some loss of jobs during the week - not well understood.
        • Northgrid
          • Lancs: File system related issues.
        • London
        • QMUL: Main college link down; on backup.
          • (Update: Set to offline on Friday until QMUL IT decide network is stable)
        • SouthGrid
          • RALPP back online;
          • BHAM issues (external reverse lookup on grid subnet switched off via IT)
          • CAM; also offline (but seems clear that unrelated to BHAM issue)
        • Scotgrid
          • Possible brokerage issues seen; to try and follow-up with idendifaction of how jobs are (not) assigned to Glasgow.
          • ECDF and Durham effects of file-related issues (followed up in GGUS).
      • Other new issues / tasks 5m
    • 10:20 10:40
      Ongoing Items 20m
      • CentOS7 - Sussex

        • NTR
      • TPC with http

        • NTR
      • Storageless Site test / storage decomissioning (Oxford)

        • Gateway to point at: JW
        • JW: test gateway
      • ECDF volatile storage

        • JW -> prod DDM experts on how they wish to commission the RSE’s
      • Glasgow DPM Decommissioning

        • Sam to prod ticket.
      • ATLAS: Site Availability/Reliability reports: Glasgow

        • JW to try and move / access a timeline
    • 10:40 10:50
      News round-table 10m
      • Vip
        • NTR
      • Dan
        • Network problems anticipated.
      • Matt
        • NTR
      • Peter
        • NTR
      • Sam
        • NTR
      • Gareth
        • NTR
      • JW
        • NTR
      • Rob
        • NTR
    • 10:50 11:00
      AOB 10m