ATLAS UK Cloud Support

Europe/London
Zoom

Zoom

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))
Description

https://cern.zoom.us/j/98434450232

Password protected (same as (new) OPs Mtg)

Videoconference
ATLAS UK Cloud Support
Zoom Meeting ID
98434450232
Host
James William Walder
Useful links
Join via phone
Zoom URL

● Outstanding tickets

  • 150308 UKI-SCOTGRID-DURHAM less urgent in progress 2021-02-25 09:32:00 Jobs at UKI-SCOTGRID-DURHAM_SL7_UCORE fail with “Server error: no such file or directory”
    • Additional files missing and declared lost. Ticket now closed
  • 149842 UKI-SCOTGRID-ECDF very urgent waiting for reply 2021-02-22 11:55:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER ERROR: Copy failed with mode 3rd pull, wi…
    • Macaroons issue. patch in apache config failed to apply.
    • On resolving this, situation looks much improved. Await some time to monitor, then site can hopefully close the ticket.
  • 149362 UKI-SOUTHGRID-RALPP urgent in progress 2021-02-18 20:00:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
    • Remains stalled. Some discussion in the Round-table below; needs detailed site diganosis to understand more.
  • 146651 RAL-LCG2 urgent on hold 2021-02-16 17:37:00 singularity and user NS setup at RAL
    • Remains on hold while pre-steps are completed.
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2021-01-20 20:29:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    • 2 network swtiches faulty; to be replaced shortly. Aim to get remaining nodes updated after that.

● CPU

  • RAL
    • High numbers of jobs for RAL, as LHCb running low
  • Northgrid
    • Sheffield offline for Downtime, but slowly coming back.
    • Lancs: best two weeks for some time.
  • London
    • QMUL; Residual Data transfer issues for a few sites.
  • SouthGrid
    • BHAM and CAM remain offline - VAC problems?
  • Scotgrid
    • Durham Argus problems resulted outage.

 


 


● Ongoing Items

  • CentOS7 - Sussex
    • see ticket
  • TPC with http
    • Alessandra to aim to move latest set of UK T2 to http today.
  • Storageless Site test / storage decomissioning (Oxford)
    • Aim to complete / test Xcache today. If successful move towards ATLAS configuration and testing.
  • ECDF volatile storage
    • Jira updated; requires site to reconfigure the new DPM to have a atlasvolatiledisk, rather than the atlasdatadisk as currently envisiged.
  • Glasgow DPM Decommissioning
    • Awaiting feedback from DDM ops
  • ATLAS: Site Availability/Reliability reports: Glasgow
    • Alessandra hopes to update ticket if it can be progressed.

● News round-table

  • Vip

    • Noted that GocDB search brings up the pre-prod instance (which is out-of-date, and has no warning that it is pre-production).
  • Dan

    • NTR
  • Matt

    • last 2 weeks, very good.
    • Some updates to storages, e.g. to set read-only old servers,
    • dpm settled down, with no specific overloaded servers.
    • Running largely Full simulation at the moment.
    • Worries with new servers; e.g. overloading. How to preload?
  • Peter

    • Discussion on how to proceed with RALPP issue (above):
    • Possible for RTE / puppet, interactions;
    • Gareth suggests that submission must have made it to the batch farm;
  • Alessandra

    • to update next tranche of UK sites to http
  • Sam

    • To present Xcache activity to PMB in near meeting.
  • Gareth

    • NTR
  • JW

    • NTR
  • Duncan

    • NTR
  • Patrick

    • NTR
  • Rob

    • transparent Xcache working at ECDF and reducing numbers of connections

 


 

There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m
        • 150308 UKI-SCOTGRID-DURHAM less urgent in progress 2021-02-25 09:32:00 Jobs at UKI-SCOTGRID-DURHAM_SL7_UCORE fail with “Server error: no such file or directory”
          • Additional files missing and declared lost. Ticket now closed
        • 149842 UKI-SCOTGRID-ECDF very urgent waiting for reply 2021-02-22 11:55:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER ERROR: Copy failed with mode 3rd pull, wi…
          • Macaroons issue. patch in apache config failed to apply.
          • On resolving this, situation looks much improved. Await some time to monitor, then site can hopefully close the ticket.
        • 149362 UKI-SOUTHGRID-RALPP urgent in progress 2021-02-18 20:00:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
          • Remains stalled. Some discussion in the Round-table below; needs detailed site diganosis to understand more.
        • 146651 RAL-LCG2 urgent on hold 2021-02-16 17:37:00 singularity and user NS setup at RAL
          • Remains on hold while pre-steps are completed.
        • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2021-01-20 20:29:00 CentOS7 migration UKI-SOUTHGRID-SUSX
          • 2 network swtiches faulty; to be replaced shortly. Aim to get remaining nodes updated after that.
      • CPU 5m

        New link for the site-oriented dashboard

        • RAL
          • High numbers of jobs for RAL, as LHCb running low
        • Northgrid
          • Sheffield offline for Downtime, but slowly coming back.
          • Lancs: best two weeks for some time.
        • London
          • QMUL; Residual Data transfer issues for a few sites.
        • SouthGrid
          • BHAM and CAM remain offline - VAC problems?
        • Scotgrid
          • Durham Argus problems resulted outage.

         


         

      • Other new issues / tasks 5m
    • 10:20 10:40
      Ongoing Items 20m
      • CentOS7 - Sussex
        • see ticket
      • TPC with http
        • Alessandra to aim to move latest set of UK T2 to http today.
      • Storageless Site test / storage decomissioning (Oxford)
        • Aim to complete / test Xcache today. If successful move towards ATLAS configuration and testing.
      • ECDF volatile storage
        • Jira updated; requires site to reconfigure the new DPM to have a atlasvolatiledisk, rather than the atlasdatadisk as currently envisiged.
      • Glasgow DPM Decommissioning
        • Awaiting feedback from DDM ops
      • ATLAS: Site Availability/Reliability reports: Glasgow
        • Alessandra hopes to update ticket if it can be progressed.
    • 10:40 10:50
      News round-table 10m
      • Vip

        • Noted that GocDB search brings up the pre-prod instance (which is out-of-date, and has no warning that it is pre-production).
      • Dan

        • NTR
      • Matt

        • last 2 weeks, very good.
        • Some updates to storages, e.g. to set read-only old servers,
        • dpm settled down, with no specific overloaded servers.
        • Running largely Full simulation at the moment.
        • Worries with new servers; e.g. overloading. How to preload?
      • Peter

        • Discussion on how to proceed with RALPP issue (above):
        • Possible for RTE / puppet, interactions;
        • Gareth suggests that submission must have made it to the batch farm;
      • Alessandra

        • to update next tranche of UK sites to http
      • Sam

        • To present Xcache activity to PMB in near meeting.
      • Gareth

        • NTR
      • JW

        • NTR
      • Duncan

        • NTR
      • Patrick

        • NTR
      • Rob

        • transparent Xcache working at ECDF and reducing numbers of connections

       


       

    • 10:50 11:00
      AOB 10m