ATLAS UK Cloud Support

Europe/London
Zoom

Zoom

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))
Description

Meeting to be held via Zoom (https://ukri.zoom.us/j/97404730356)
Password protected (same as OPs Mtg)

Outstanding tickets

  • 149362 UKI-SOUTHGRID-RALPP urgent in progress 2020-11-13 08:45:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
    • Issue in adding new CE with AGIS/CRIC (see below)
    • Site to take ce into downtime on Monday for general cleanup
  • 148968 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-11-19 06:53:00 UKI-NORTHGRID-LANCS-HEP: deletion and transfer failures
    • gridFTP restarted; looking better, but will keep an eye
    • Other non-Lancs issues with Italy sites, adds a bit of confusion
      • Napoli issue with https available only on LHCONE (via certain IPvX?) whereas,
      • gridFTP available on non LHCONE
  • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-11-12 17:24:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
    • Sam to take a look at problem files
  • 146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
    • no update
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-11-05 10:52:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    • no update

CPU

  • RAL

  • Northgrid

  • London

  • Server with occasional bad memory issue

    • To discuss with manufactuer to attempt a proper fix (no just bios update)
  • SouthGrid

  • Scotgrid

    • Durham Priorty user that takes the all the priorty, causing loss of ATLAS jobs.
      • Some job loss also from HC test failures from missing files
    • GLA: CMVFS update for CMS, some unindented consequences caused problems
    • GLA: Bringing online additional capacity slowly; aim before Christmas full capactiy.
      • Probable identification of high iops in ceph cluster from offsite xroot direct access reads
        • user from cern, accessing scratchdisk
        • Swtich off to see if this solves the issue.

Other new issues

  • Recent Switcher problem with AGAS/CRIC sending many emails

    • Concerns raised on approprate use of mailing lists.
    • ATLAS uk has cloud-support, uk comp operations, and uk comp users.
      • The comp users list has been unused for 5 years, and it was agreed to be removed
      • For cloud support, this remains the most active discussion list, and will be unchanged.
      • The comp operations contains the daily summary and Switcher notifications. Non automated traffic is on the order of 1 email per year; which may have been unintentionally intended for cloud support.
        • It was decided to keep the Swticher and Daily summary in this list. A simple filter can remove any unwanted emails.
  • Queues:

    • Long-term queues that are not disabled, but not running production:
      • UK ANALY_MANC_TEST_SL7: Still needed
      • UK ANALY_QMUL_GPU_TEST: -> could be renamed to non test
      • RAL-LCG2_TEST: -> not actively used (see comment from Peter)
      • RAL-LCG2_UCORE: Can be disabled
      • UKI-NORTHGRID-LANCS-HEP_TEST (see comment from Peter)
      • UKI-NORTHGRID-MAN-HEP_TEST; testbed -> keep
      • UKI-SCOTGRID-GLASGOW_CEPH_TEST: keep
      • UKI-SOUTHGRID-OX-HEP_TEST: (see comment from Peter)
      • UKI-SOUTHGRID-SUSX_UCORE: not test, should become production, might want remaining
      •  
      • Peter uses TEST queuse for dev test work monitoring
      • QM test queue might be useful

Ongoing issues

  • CentOS7 - Sussex

    • no update
  • Datadisk; watermark reduced.

  • LOCALGROUP disk

    • New pool to be created shortly
  • TPC:

    • Naples -> moved to DPM 1.14.2, networking blocked 443 ipv6, ipv4 open on general network

    • Affects whole of UK (e.g. Lancs, MAN)

    • Retry transfer failures

    • Vunerability from DPM, and dCache

      • Beleive all UK DPM sites up to date (or not affected)
      • dCache issue announced in appropriate channels

News round-table

  • Vip
    • Had to leave before end; NTR
  • Dan
    • NTR
  • Matt
    • NTR; away for next week’s meeting
  • Peter
  • Alessandra
    • NTR
  • Gareth
    • The two CE’s recently added to Glasgow will stay in downtime for time being.
    • JW to check they are included correctly in CRIC / AGIS.
  • JW
    • NTR
  • Sam;
    • Final talk available for Workshop.
    • Positive comments on updates to talk draft
      • Tables are now much better

AOB

NTR

There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m
        • 149362 UKI-SOUTHGRID-RALPP urgent in progress 2020-11-13 08:45:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
          • Issue in adding new CE with AGIS/CRIC (see below)
          • Site to take ce into downtime on Monday for general cleanup
        • 148968 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-11-19 06:53:00 UKI-NORTHGRID-LANCS-HEP: deletion and transfer failures
          • gridFTP restarted; looking better, but will keep an eye
          • Other non-Lancs issues with Italy sites, adds a bit of confusion
            • Napoli issue with https available only on LHCONE (via certain IPvX?) whereas,
            • gridFTP available on non LHCONE
        • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-11-12 17:24:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
          • Sam to take a look at problem files
        • 146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
          • no update
        • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-11-05 10:52:00 CentOS7 migration UKI-SOUTHGRID-SUSX
          • no update


         

      • CPU 5m
        • RAL

        • Northgrid

        • London

        • Server with occasional bad memory issue

          • To discuss with manufactuer to attempt a proper fix (no just bios update)
        • SouthGrid

        • Scotgrid

          • Durham Priorty user that takes the all the priorty, causing loss of ATLAS jobs.
            • Some job loss also from HC test failures from missing files
          • GLA: CMVFS update for CMS, some unindented consequences caused problems
          • GLA: Bringing online additional capacity slowly; aim before Christmas full capactiy.
            • Probable identification of high iops in ceph cluster from offsite xroot direct access reads
              • user from cern, accessing scratchdisk
              • Swtich off to see if this solves the issue.
      • Other new issues / tasks 5m
        • Environment variable DQ2_LOCAL_SITE_ID unused for more than a year now.
          Now finally removed from client and pilot.
          If you have documentation/code-snippets/etc still using DQ2_LOCAL_SITE_ID, please rename it to RUCIO_LOCAL_SITE_ID

        • Usage and membership of the various atlas mailing lists

        • cloud support,
        • atlas uk comp operations, ~ 1 non-automated email / year
        • atlas uk comp users : last email 2015
        • Recent Switcher problem with AGAS/CRIC sending many emails

          • Concerns raised on approprate use of mailing lists.
          • ATLAS uk has cloud-support, uk comp operations, and uk comp users.
            • The comp users list has been unused for 5 years, and it was agreed to be removed
            • For cloud support, this remains the most active discussion list, and will be unchanged.
            • The comp operations contains the daily summary and Switcher notifications. Non automated traffic is on the order of 1 email per year; which may have been unintentionally intended for cloud support.
              • It was decided to keep the Swticher and Daily summary in this list. A simple filter can remove any unwanted emails.
      • Long-term offline sites 20m

        UK ANALY_MANC_TEST_SL7 manual TEST OnlyTest 2020 01 28 NM keep TEST for pilot dev only 293 2020-01-28T18:44:44.791487 2121-01-15T00:00:00
        UK ANALY_QMUL_GPU_TEST manual TEST False Not working 74 2020-09-04T13:06:28.452468 2021-09-04T12:06:27
        UK RAL-LCG2_TEST manual TEST OnlyTest arc6 160 2020-06-10T11:25:40.228843 2021-04-06T09:25:40.208693
        UK RAL-LCG2_UCORE manual OFFLINE AutoExclusion 2020 05 04 NM set OFFLINE for GU migration 196 2020-05-04T17:21:43.052362 2121-01-15T00:00:00
        UK UKI-NORTHGRID-LANCS-HEP_TEST manual TEST AutoExclusion Site.Test.Queue 306 2020-01-15T15:34:01.319565 2099-06-07T12:00:00
        UK UKI-NORTHGRID-MAN-HEP_TEST manual TEST OnlyTest Site.Test.Queue 237 2020-03-24T16:25:19.972552 2120-01-01T00:00:00
        UK UKI-SCOTGRID-GLASGOW_CEPH_TEST manual TEST AutoExclusion LetTestRun 173 2020-05-27T16:14:32.949448 2021-03-23T14:14:32.936017
        UK UKI-SOUTHGRID-OX-HEP_TEST manual TEST OnlyTest TEST 8 2020-11-09T11:50:43.892209 2030-02-02T12:00:00
        UK UKI-SOUTHGRID-SUSX_UCORE manual TEST AutoExclusion Site.Test.Queue 246 2020-03-16T11:51:54.094506 2099-06-07T12:00:00

        • Queues:

          • Long-term queues that are not disabled, but not running production:
            • UK ANALY_MANC_TEST_SL7: Still needed
            • UK ANALY_QMUL_GPU_TEST: -> could be renamed to non test
            • RAL-LCG2_TEST: -> not actively used (see comment from Peter)
            • RAL-LCG2_UCORE: Can be disabled
            • UKI-NORTHGRID-LANCS-HEP_TEST (see comment from Peter)
            • UKI-NORTHGRID-MAN-HEP_TEST; testbed -> keep
            • UKI-SCOTGRID-GLASGOW_CEPH_TEST: keep
            • UKI-SOUTHGRID-OX-HEP_TEST: (see comment from Peter)
            • UKI-SOUTHGRID-SUSX_UCORE: not test, should become production, might want remaining
            •  
            • Peter uses TEST queuse for dev test work monitoring
            • QM test queue might be useful

         

         

      • Enables CEs in Panda Queues 20m

        Adding CE's to RALPP, and Glasgow Panda queues during CRIC migration

    • 10:20 10:40
      Ongoing Items 20m
      • CentOS7 - Sussex

        • no update
      • Datadisk; watermark reduced.

      • LOCALGROUP disk

        • New pool to be created shortly
      • TPC:

        • Naples -> moved to DPM 1.14.2, networking blocked 443 ipv6, ipv4 open on general network

        • Affects whole of UK (e.g. Lancs, MAN)

        • Retry transfer failures

        • Vunerability from DPM, and dCache

          • Beleive all UK DPM sites up to date (or not affected)
          • dCache issue announced in appropriate channels
    • 10:40 10:50
      News round-table 10m

      News round-table

      • Vip
        • Had to leave before end; NTR
      • Dan
        • NTR
      • Matt
        • NTR; away for next week’s meeting
      • Peter
      • Alessandra
        • NTR
      • Gareth
        • The two CE’s recently added to Glasgow will stay in downtime for time being.
        • JW to check they are included correctly in CRIC / AGIS.
      • JW
        • NTR
      • Sam;
        • Final talk available for Workshop.
        • Positive comments on updates to talk draft
          • Tables are now much better
    • 10:50 11:00
      AOB 10m

      NTR