ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Outstanding tickets

  • 147390 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-06-09 10:48:00 Failovers from jobs running at UKI-SCOTGRID-GLASGOW_CEPH to CERN backup proxy

    • Routing issue between sides of DC; attempt some static routing, but will physical access to finally resolve
  • 147361 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-06-08 07:43:00 Deletion errors at UKI-SCOTGRID-GLASGOW

    • Check if files lost are not in our DPM DB they need to removed on the ATLAS side?
    • Tricky to delete multiple replicas; risk to delete the whole object, not just on disk039.
    • JW To ask the DDM OPs people.
  • 146918 UKI-SCOTGRID-ECDF less urgent in progress 2020-06-09 10:03:00 Failovers from jobs running at UKI-SCOTGRID-ECDF_CLOUD to CERN backup proxy

    • Still with other pressing priorities
  • 146771 UKI-SCOTGRID-ECDF less urgent in progress 2020-06-07 16:11:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”

    • Problematic xroot with DPM, plan still to upgrade to centos 7.
  • 146651 RAL-LCG2 urgent in progress 2020-05-27 10:43:00 singularity and user NS setup at RAL

    • In todo list
  • 146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-06-06 23:57:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE

    • Queue set to TEST, progress being made
  • 145688 UKI-NORTHGRID-MAN-HEP less urgent on hold 2020-04-02 09:20:00 Very old version of squids at UKI-NORTHGRID-MAN-HEP

    • On hold
  • 145510 RAL-LCG2 urgent on hold 2020-05-13 13:07:00 RAL-LCG2: timeouts on stage-in/outs

    • To close -> DirectIO comparisons
  • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-06-09 07:59:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1

    • As previously; needs to change the HW.
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX

    • Updated; physical access required to manage migration

CPU

Pledge lines - only visible in 30day mode now

Other new issues

Ongoing issues

  • CentOS7 DPM Lancs

    • NTR
  • CentOS7 - Sussex

    • As mentioned above
  • Glasgow Ceph storage

    • Not BW problems - unlikely in the ceph cluster
    • Problems seems in the disk cache. If problem getting a file, will store the truncated file? Hence poisoned by the corrupt copies?
    • Using xrootd 4.12, compiled.
      • Try perhaps a 4.11? (Can use the exact version that RAL uses,
    • CEPH itself appears more stable after configurations
  • Grand Unified queues

    • Awaiting SHEF

News round-table

  • Vip

    • NTR
  • Dan

    • Migration to centos7 for several services in progress
  • Matt

    • NTR
  • Peter

    • School closures continue to interupt work as normal
  • Alessandra

    • DPM 1.14 in testing; needed for TPC tests in production; contains puppet and memory libraries (to avoid full mem)
      • Petr, RAL off RAL-FTS (on to CERN), to have the TPC capabilities
  • Sam

    • NTR
  • Gareth

    • NTR
  • Tim

    • TPC; transfers (xrootd) to test, have checksum issues: Too slow for the stress-test. Can it be improved by checksumming close to the storage?
      • Can also reduce the number of simultaneous connections?
    • Petr pushing to look at http (may be the eventual prefered protocol)
    • curent issues with the the xrootd server, not the protocol
  • JW

    • NTR
There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • 147390 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-06-09 10:48:00 Failovers from jobs running at UKI-SCOTGRID-GLASGOW_CEPH to CERN backup proxy

        • Routing issue between sides of DC; attempt some static routing, but will physical access to finally resolve
      • 147361 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-06-08 07:43:00 Deletion errors at UKI-SCOTGRID-GLASGOW

        • Check if files lost are not in our DPM DB they need to removed on the ATLAS side?
        • Tricky to delete multiple replicas; risk to delete the whole object, not just on disk039.
        • JW To ask the DDM OPs people.
      • 146918 UKI-SCOTGRID-ECDF less urgent in progress 2020-06-09 10:03:00 Failovers from jobs running at UKI-SCOTGRID-ECDF_CLOUD to CERN backup proxy

        • Still with other pressing priorities
      • 146771 UKI-SCOTGRID-ECDF less urgent in progress 2020-06-07 16:11:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”

        • Problematic xroot with DPM, plan still to upgrade to centos 7.
      • 146651 RAL-LCG2 urgent in progress 2020-05-27 10:43:00 singularity and user NS setup at RAL

        • In todo list
      • 146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-06-06 23:57:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE

        • Queue set to TEST, progress being made
      • 145688 UKI-NORTHGRID-MAN-HEP less urgent on hold 2020-04-02 09:20:00 Very old version of squids at UKI-NORTHGRID-MAN-HEP

        • On hold
      • 145510 RAL-LCG2 urgent on hold 2020-05-13 13:07:00 RAL-LCG2: timeouts on stage-in/outs

        • To close -> DirectIO comparisons
      • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-06-09 07:59:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1

        • As previously; needs to change the HW.
      • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX

        • Updated; physical access required to manage migration
    • 10:20 10:40
      Ongoing issues 20m
      • CentOS7 DPM Lancs

        • NTR
      • CentOS7 - Sussex

        • As mentioned above
      • Glasgow Ceph storage

        • Not BW problems - unlikely in the ceph cluster
        • Problems seems in the disk cache. If problem getting a file, will store the truncated file? Hence poisoned by the corrupt copies?
        • Using xrootd 4.12, compiled.
          • Try perhaps a 4.11? (Can use the exact version that RAL uses,
        • CEPH itself appears more stable after configurations
      • Grand Unified queues

        • Awaiting SHEF
    • 10:40 10:50
      News round-table 10m
      • Vip

        • NTR
      • Dan

        • Migration to centos7 for several services in progress
      • Matt

        • NTR
      • Peter

        • School closures continue to interupt work as normal
      • Alessandra

        • DPM 1.14 in testing; needed for TPC tests in production; contains puppet and memory libraries (to avoid full mem)
          • Petr, RAL off RAL-FTS (on to CERN), to have the TPC capabilities
      • Sam

        • NTR
      • Gareth

        • NTR
      • Tim

        • TPC; transfers (xrootd) to test, have checksum issues: Too slow for the stress-test. Can it be improved by checksumming close to the storage?
          • Can also reduce the number of simultaneous connections?
        • Petr pushing to look at http (may be the eventual prefered protocol)
        • curent issues with the the xrootd server, not the protocol
      • JW

        • NTR
    • 10:50 11:00
      AOB 10m