ATLAS UK Cloud Support

Europe/London
Zoom

Zoom

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))
Description

https://cern.zoom.us/j/98434450232

Password protected (same as (new) OPs Mtg)

Videoconference
ATLAS UK Cloud Support
Zoom Meeting ID
98434450232
Host
James William Walder
Useful links
Join via phone
Zoom URL

● Outstanding tickets

  • 150912 UKI-NORTHGRID-MAN-HEP less urgent assigned 2021-03-10 19:28:00 UKI-NORTHGRID-MAN-HEP: TRANSFER Transfer canceled because the gsiftp performance marker timeout
    • No Site update; JW to follow up with site
  • 150896 UKI-LT2-QMUL very urgent assigned 2021-03-10 09:28:00 UKI-LT2-QMUL: Sudden appearance of dark data
    • Dark data at site after deletions,
    • Reported space values started reporting nonsense values, which didn’t change as deletions proceed; proper fix needed, but working fix in place.
    • Dark data to be removed, hopefully with a consistency check.
    • Latest storm version should provide an automated ‘du’
  • 150820 UKI-LT2-RHUL less urgent waiting for reply 2021-03-11 09:33:00 UKI-LT2-RHUL: 0% Transfer and deletion efficiencies
    • File list declared lost, some follow-up files to declare lost.
    • Permission denied errors in transfers.
  • 149362 UKI-SOUTHGRID-RALPP urgent in progress 2021-02-18 20:00:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
    • Still no progress
    • JW to attempt to move over to aCT
  • 146651 RAL-LCG2 urgent on hold 2021-02-16 17:37:00 singularity and user NS setup at RAL
    • On hold
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2021-01-20 20:29:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    • 8pm yesterday - running pilot jobs.
    • Possible update to provide memory limits
      • Lancaster needed some runtime script; to allow for minimum memory requirements
      • Matt to provide instructions

● CPU

  • RAL

    • Still attempting to get more single-core user analysis jobs
    • Fairshare cap release on 100% for ATLAS and CMS
    • Two tranches drained for kernel updates; now coming back to accept jobs
  • Northgrid

  • LANCS; local user retries;

    • Expecting several new servers after Easter
  • London

  • SouthGrid

  • Scotgrid


● Other new issues / tasks

  • VAC; BHAM not running ATLAS jobs; site issues should have been resolved.
    • JW to prod harvester support list if no active response
  • Will ATLAS still want to support VAC?
  • Understand why broken now; then see what to do
    • Implications for how BHAM may wish to run site.

 

 


● Ongoing Items

  • CentOS7 - Sussex

    • Should be near production readiness
  • TPC with http

    • No update; xrootd 5.1.1 is available
    • Sam to update Glasgow TPC gateway.
  • Storageless Site test / storage decomissioning (Oxford)

    • RAL side complete; OX to finalise configuration, then ATLAS side.
  • ECDF volatile storage

    • Process on ATLAS side; appears that enpoint name changed however?
  • Glasgow DPM Decommissioning

    • Sam to update Jira with downtime notification
  • ATLAS: Site Availability/Reliability reports: Glasgow

    • Moving forward - not yet resolved

 


 


● News round-table

  • Dan
    • Expect downtime for SE switch, before Easter
  • Matt
    • NTR
  • Peter
    • Mention France datacenter fire (inside self-contained containers)
    • Warnings that data-loss can happen in the cloud …
  • Sam
    • NTR
  • JW
    • TPC work sidelined for VectorRead support
  • Patrick
    • NTR
There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m
        • 150912 UKI-NORTHGRID-MAN-HEP less urgent assigned 2021-03-10 19:28:00 UKI-NORTHGRID-MAN-HEP: TRANSFER Transfer canceled because the gsiftp performance marker timeout
          • No Site update; JW to follow up with site
        • 150896 UKI-LT2-QMUL very urgent assigned 2021-03-10 09:28:00 UKI-LT2-QMUL: Sudden appearance of dark data
          • Dark data at site after deletions,
          • Reported space values started reporting nonsense values, which didn’t change as deletions proceed; proper fix needed, but working fix in place.
          • Dark data to be removed, hopefully with a consistency check.
          • Latest storm version should provide an automated ‘du’
        • 150820 UKI-LT2-RHUL less urgent waiting for reply 2021-03-11 09:33:00 UKI-LT2-RHUL: 0% Transfer and deletion efficiencies
          • File list declared lost, some follow-up files to declare lost.
          • Permission denied errors in transfers.
        • 149362 UKI-SOUTHGRID-RALPP urgent in progress 2021-02-18 20:00:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
          • Still no progress
          • JW to attempt to move over to aCT
        • 146651 RAL-LCG2 urgent on hold 2021-02-16 17:37:00 singularity and user NS setup at RAL
          • On hold
        • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2021-01-20 20:29:00 CentOS7 migration UKI-SOUTHGRID-SUSX
          • 8pm yesterday - running pilot jobs.
          • Possible update to provide memory limits
            • Lancaster needed some runtime script; to allow for minimum memory requirements
            • Matt to provide instructions
      • CPU 5m

        New link for the site-oriented dashboard

        • RAL

          • Still attempting to get more single-core user analysis jobs
          • Fairshare cap release on 100% for ATLAS and CMS
          • Two tranches drained for kernel updates; now coming back to accept jobs
        • Northgrid

        • LANCS; local user retries;

          • Expecting several new servers after Easter
        • London

        • SouthGrid

        • Scotgrid

      • Other new issues / tasks 5m

        UKI-SOUTHGRID-BHAM-HEP_MCOREVAC.
        From apfmon, there is : http://apfmon.lancs.ac.uk/q/UKI-SOUTHGRID-BHAM-HEP_MCOREVAC

        000 (642776.084.000) 03/09 10:34:06 Job submitted from host: <...&noUDP&sock=1160960_92ef_3>
        ...
        027 (642776.084.000) 03/09 10:34:32 Job submitted to grid resource
        GridResource: condor aipanda025.cern.ch aipanda025.cern.ch:20615?sock=collector
        GridJobId: condor aipanda025.cern.ch aipanda025.cern.ch:20615?sock=collector 481075.0

        From the VAC side, the information reported is:

        01/28/21 07:43:47 Initial update sent to collector(s)
        01/28/21 07:43:47 Sending DC_SET_READY message to master <...?addrs=...>
        01/28/21 07:43:47 SECMAN: FAILED: Received "DENIED" from server for user atlpan@cern.ch using method GSI.
        01/28/21 07:43:47 ERROR: SECMAN:2010:Received "DENIED" from server for user atlpan@cern.ch using method GSI.
        01/28/21 07:43:47 Failed to start non-blocking update to <....>.
        01/28/21 07:44:09 State change: benchmarks completed
        01/28/21 07:44:12 SECMAN: FAILED: Received "DENIED" from server for user atlpan@cern.ch using method GSI.
        01/28/21 07:44:12 ERROR: SECMAN:2010:Received "DENIED" from server for user atlpan@cern.ch using method GSI.
        01/28/21 07:44:12 Failed to start non-blocking update to <...>.

        • VAC; BHAM not running ATLAS jobs; site issues should have been resolved.
          • JW to prod harvester support list if no active response
        • Will ATLAS still want to support VAC?
        • Understand why broken now; then see what to do
          • Implications for how BHAM may wish to run site.

         

         

    • 10:20 10:40
      Ongoing Items 20m
      • CentOS7 - Sussex

        • Should be near production readiness
      • TPC with http

        • No update; xrootd 5.1.1 is available
        • Sam to update Glasgow TPC gateway.
      • Storageless Site test / storage decomissioning (Oxford)

        • RAL side complete; OX to finalise configuration, then ATLAS side.
      • ECDF volatile storage

        • Process on ATLAS side; appears that enpoint name changed however?
      • Glasgow DPM Decommissioning

        • Sam to update Jira with downtime notification
      • ATLAS: Site Availability/Reliability reports: Glasgow

        • Moving forward - not yet resolved

       


       

    • 10:40 10:50
      News round-table 10m
      • Dan
        • Expect downtime for SE switch, before Easter
      • Matt
        • NTR
      • Peter
        • Mention France datacenter fire (inside self-contained containers)
        • Warnings that data-loss can happen in the cloud …
      • Sam
        • NTR
      • JW
        • TPC work sidelined for VectorRead support
      • Patrick
        • NTR
    • 10:50 11:00
      AOB 10m