ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Outstanding tickets

  • 147390 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-06-09 10:48:00 Failovers from jobs running at UKI-SCOTGRID-GLASGOW_CEPH to CERN backup proxy

    • Routing issue between sides of DC; attempt some static routing, but will physical access to finally resolve
  • 147361 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-06-08 07:43:00 Deletion errors at UKI-SCOTGRID-GLASGOW

    • Check if files lost are not in our DPM DB they need to removed on the ATLAS side?
    • Tricky to delete multiple replicas; risk to delete the whole object, not just on disk039.
    • JW To ask the DDM OPs people.
  • 146918 UKI-SCOTGRID-ECDF less urgent in progress 2020-06-09 10:03:00 Failovers from jobs running at UKI-SCOTGRID-ECDF_CLOUD to CERN backup proxy

    • Still with other pressing priorities
  • 146771 UKI-SCOTGRID-ECDF less urgent in progress 2020-06-07 16:11:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”

    • Problematic xroot with DPM, plan still to upgrade to centos 7.
  • 146651 RAL-LCG2 urgent in progress 2020-05-27 10:43:00 singularity and user NS setup at RAL

    • In todo list
  • 146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-06-06 23:57:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE

    • Queue set to TEST, progress being made
  • 145688 UKI-NORTHGRID-MAN-HEP less urgent on hold 2020-04-02 09:20:00 Very old version of squids at UKI-NORTHGRID-MAN-HEP

    • On hold
  • 145510 RAL-LCG2 urgent on hold 2020-05-13 13:07:00 RAL-LCG2: timeouts on stage-in/outs

    • To close -> DirectIO comparisons
  • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-06-09 07:59:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1

    • As previously; needs to change the HW.
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX

    • Updated; physical access required to manage migration

CPU

Pledge lines - only visible in 30day mode now

Other new issues

Ongoing issues

  • CentOS7 DPM Lancs

    • NTR
  • CentOS7 - Sussex

    • As mentioned above
  • Glasgow Ceph storage

    • Not BW problems - unlikely in the ceph cluster
    • Problems seems in the disk cache. If problem getting a file, will store the truncated file? Hence poisoned by the corrupt copies?
    • Using xrootd 4.12, compiled.
      • Try perhaps a 4.11? (Can use the exact version that RAL uses,
    • CEPH itself appears more stable after configurations
  • Grand Unified queues

    • Awaiting SHEF

News round-table

  • Vip

    • NTR
  • Dan

    • Migration to centos7 for several services in progress
  • Matt

    • NTR
  • Peter

    • School closures continue to interupt work as normal
  • Alessandra

    • DPM 1.14 in testing; needed for TPC tests in production; contains puppet and memory libraries (to avoid full mem)
      • Petr, RAL off RAL-FTS (on to CERN), to have the TPC capabilities
  • Sam

    • NTR
  • Gareth

    • NTR
  • Tim

    • TPC; transfers (xrootd) to test, have checksum issues: Too slow for the stress-test. Can it be improved by checksumming close to the storage?
      • Can also reduce the number of simultaneous connections?
    • Petr pushing to look at http (may be the eventual prefered protocol)
    • curent issues with the the xrootd server, not the protocol
  • JW

    • NTR
There are minutes attached to this event. Show them.