ATLAS UK Cloud Support

Europe/London
Zoom

Zoom

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))
Description

Meeting to be held via Zoom (https://ukri.zoom.us/j/97404730356)
Password protected (same as OPs Mtg)

Outstanding tickets

  • 148968 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-10-30 09:32:00 UKI-NORTHGRID-LANCS-HEP: deletion and transfer failures
    • Transfers still at 70% eff.
    • May be be some disk server issues.
      • Aggressive usage of space not helping on older servers
      • Discussion follows on long-term storage, model changes and future plans
  • 148342 UKI-SCOTGRID-GLASGOW less urgent reopened 2020-11-04 15:25:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
    • Silent failures of xrdcp (reporting success) for certain files left on cache.
      • Files manually added back
  • 146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
    • Will be done as sequence of updates at site
  • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
    • To move the nat soon; should then be done
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-10-29 17:42:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    • Hardware is racked; needing networking
    • Peter spotted pilot issues; will follow-up on ticket

CPU

  • RAL

  • Northgrid

    • SHEF now 10-15 kHS06
  • London

  • SouthGrid

  • Scotgrid

  • GLA want to bring ce 1 and 2 back online with increasing workernodes

    • Sharing across the ce’s leads currently becomes unbalanced.
  • ECDF working on Volatile storage, will reduce Watermark level on primary storage shortly.

Other new issues

  • Pilot failures at RALPP

  • ECDF - High VMEM usage jobs

    • java inside singularity fixes
      • voms-proxy seems to take significant vmem
      • LANCS sets limit, which seems to work.
      • Action: To send details to Peter
  • Durham lost files

    • Some discrenpancies between declaring files as lost, and DPM cleaning. To follow-up with a consistency check.

Ongoing issues

  • CentOS7 - Sussex

    • As mentioned in GGUS
  • Grand Unified queues

    • With Sheffield running, will remove task.
  • ECDF

    • Already handles file loss
    • rucio aware cache.
    • Will reduce the watermark for the Volatile QOS storage
  • TPC:

    • DOMA-TPC meeting yesterday, discussing xrootd (including xrootd-ceph)
    • 1.14.2 won’t work with CMS currently
      • needs xroot to go to 5.

News round-table

  • Vip
    • left early; NTR
  • Dan
    • lot’s of single-core other VO jobs
    •  
  • Matt
    • DPM to upgrade
  • Peter
    • NTR
  • Alessandra
    • NTR
  • Sam
    • NTR
  • Gareth
    • SHEF may need some further tweaking.
  • Tim
    • NTR
  • JW
    • NTR

AOB

  • Will continue with Zoom.
There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m
        • 148968 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-10-30 09:32:00 UKI-NORTHGRID-LANCS-HEP: deletion and transfer failures
          • Transfers still at 70% eff.
          • May be be some disk server issues.
            • Aggressive usage of space not helping on older servers
            • Discussion follows on long-term storage, model changes and future plans
        • 148342 UKI-SCOTGRID-GLASGOW less urgent reopened 2020-11-04 15:25:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
          • Silent failures of xrdcp (reporting success) for certain files left on cache.
            • Files manually added back
        • 146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
          • Will be done as sequence of updates at site
        • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
          • To move the nat soon; should then be done
        • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-10-29 17:42:00 CentOS7 migration UKI-SOUTHGRID-SUSX
          • Hardware is racked; needing networking
          • Peter spotted pilot issues; will follow-up on ticket
      • CPU 5m
        • RAL

        • Northgrid

          • SHEF now 10-15 kHS06
        • London

        • SouthGrid

        • Scotgrid

        • GLA want to bring ce 1 and 2 back online with increasing workernodes

          • Sharing across the ce’s leads currently becomes unbalanced.
        • ECDF working on Volatile storage, will reduce Watermark level on primary storage shortly.

      • Other new issues / tasks 5m

        Pilot failures at RALPP

        ECDF - High VMEM usage jobs
        - small fraction of the single core jobs at ECDF use a peak of >40GB of vmem during running

        Durham lost files

        • Pilot failures at RALPP

        • ECDF - High VMEM usage jobs

          • java inside singularity fixes
            • voms-proxy seems to take significant vmem
            • LANCS sets limit, which seems to work.
            • Action: To send details to Peter
        • Durham lost files

          • Some discrenpancies between declaring files as lost, and DPM cleaning. To follow-up with a consistency check.
    • 10:20 10:40
      Ongoing Items 20m
      • CentOS7 - Sussex

        • As mentioned in GGUS
      • Grand Unified queues

        • With Sheffield running, will remove task.
      • ECDF

        • Already handles file loss
        • rucio aware cache.
        • Will reduce the watermark for the Volatile QOS storage
      • TPC:

        • DOMA-TPC meeting yesterday, discussing xrootd (including xrootd-ceph)
        • 1.14.2 won’t work with CMS currently
          • needs xroot to go to 5.
    • 10:40 10:50
      News round-table 10m
      • Vip
        • left early; NTR
      • Dan
        • lot’s of single-core other VO jobs
        •  
      • Matt
        • DPM to upgrade
      • Peter
        • NTR
      • Alessandra
        • NTR
      • Sam
        • NTR
      • Gareth
        • SHEF may need some further tweaking.
      • Tim
        • NTR
      • JW
        • NTR
    • 10:50 11:00
      AOB 10m
      • Will continue with Zoom.