Indico celebrates its 20th anniversary! Check our blog post for more information!

ATLAS UK Cloud Support



Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Meeting to be held via Zoom (
Password protected (same as OPs Mtg)

Outstanding tickets

  • 149362 UKI-SOUTHGRID-RALPP urgent in progress 2020-11-13 08:45:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
    • Issue in adding new CE with AGIS/CRIC (see below)
    • Site to take ce into downtime on Monday for general cleanup
  • 148968 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-11-19 06:53:00 UKI-NORTHGRID-LANCS-HEP: deletion and transfer failures
    • gridFTP restarted; looking better, but will keep an eye
    • Other non-Lancs issues with Italy sites, adds a bit of confusion
      • Napoli issue with https available only on LHCONE (via certain IPvX?) whereas,
      • gridFTP available on non LHCONE
  • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-11-12 17:24:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
    • Sam to take a look at problem files
  • 146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
    • no update
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-11-05 10:52:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    • no update


  • RAL

  • Northgrid

  • London

  • Server with occasional bad memory issue

    • To discuss with manufactuer to attempt a proper fix (no just bios update)
  • SouthGrid

  • Scotgrid

    • Durham Priorty user that takes the all the priorty, causing loss of ATLAS jobs.
      • Some job loss also from HC test failures from missing files
    • GLA: CMVFS update for CMS, some unindented consequences caused problems
    • GLA: Bringing online additional capacity slowly; aim before Christmas full capactiy.
      • Probable identification of high iops in ceph cluster from offsite xroot direct access reads
        • user from cern, accessing scratchdisk
        • Swtich off to see if this solves the issue.

Other new issues

  • Recent Switcher problem with AGAS/CRIC sending many emails

    • Concerns raised on approprate use of mailing lists.
    • ATLAS uk has cloud-support, uk comp operations, and uk comp users.
      • The comp users list has been unused for 5 years, and it was agreed to be removed
      • For cloud support, this remains the most active discussion list, and will be unchanged.
      • The comp operations contains the daily summary and Switcher notifications. Non automated traffic is on the order of 1 email per year; which may have been unintentionally intended for cloud support.
        • It was decided to keep the Swticher and Daily summary in this list. A simple filter can remove any unwanted emails.
  • Queues:

    • Long-term queues that are not disabled, but not running production:
      • UK ANALY_MANC_TEST_SL7: Still needed
      • UK ANALY_QMUL_GPU_TEST: -> could be renamed to non test
      • RAL-LCG2_TEST: -> not actively used (see comment from Peter)
      • RAL-LCG2_UCORE: Can be disabled
      • UKI-NORTHGRID-LANCS-HEP_TEST (see comment from Peter)
      • UKI-NORTHGRID-MAN-HEP_TEST; testbed -> keep
      • UKI-SOUTHGRID-OX-HEP_TEST: (see comment from Peter)
      • UKI-SOUTHGRID-SUSX_UCORE: not test, should become production, might want remaining
      • Peter uses TEST queuse for dev test work monitoring
      • QM test queue might be useful

Ongoing issues

  • CentOS7 - Sussex

    • no update
  • Datadisk; watermark reduced.


    • New pool to be created shortly
  • TPC:

    • Naples -> moved to DPM 1.14.2, networking blocked 443 ipv6, ipv4 open on general network

    • Affects whole of UK (e.g. Lancs, MAN)

    • Retry transfer failures

    • Vunerability from DPM, and dCache

      • Beleive all UK DPM sites up to date (or not affected)
      • dCache issue announced in appropriate channels

News round-table

  • Vip
    • Had to leave before end; NTR
  • Dan
    • NTR
  • Matt
    • NTR; away for next week’s meeting
  • Peter
  • Alessandra
    • NTR
  • Gareth
    • The two CE’s recently added to Glasgow will stay in downtime for time being.
    • JW to check they are included correctly in CRIC / AGIS.
  • JW
    • NTR
  • Sam;
    • Final talk available for Workshop.
    • Positive comments on updates to talk draft
      • Tables are now much better



There are minutes attached to this event. Show them.
    • 1
      • a) Outstanding tickets
        • 149362 UKI-SOUTHGRID-RALPP urgent in progress 2020-11-13 08:45:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
          • Issue in adding new CE with AGIS/CRIC (see below)
          • Site to take ce into downtime on Monday for general cleanup
        • 148968 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-11-19 06:53:00 UKI-NORTHGRID-LANCS-HEP: deletion and transfer failures
          • gridFTP restarted; looking better, but will keep an eye
          • Other non-Lancs issues with Italy sites, adds a bit of confusion
            • Napoli issue with https available only on LHCONE (via certain IPvX?) whereas,
            • gridFTP available on non LHCONE
        • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-11-12 17:24:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
          • Sam to take a look at problem files
        • 146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
          • no update
        • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-11-05 10:52:00 CentOS7 migration UKI-SOUTHGRID-SUSX
          • no update


      • b) CPU
        • RAL

        • Northgrid

        • London

        • Server with occasional bad memory issue

          • To discuss with manufactuer to attempt a proper fix (no just bios update)
        • SouthGrid

        • Scotgrid

          • Durham Priorty user that takes the all the priorty, causing loss of ATLAS jobs.
            • Some job loss also from HC test failures from missing files
          • GLA: CMVFS update for CMS, some unindented consequences caused problems
          • GLA: Bringing online additional capacity slowly; aim before Christmas full capactiy.
            • Probable identification of high iops in ceph cluster from offsite xroot direct access reads
              • user from cern, accessing scratchdisk
              • Swtich off to see if this solves the issue.
      • c) Other new issues / tasks
        • Environment variable DQ2_LOCAL_SITE_ID unused for more than a year now.
          Now finally removed from client and pilot.
          If you have documentation/code-snippets/etc still using DQ2_LOCAL_SITE_ID, please rename it to RUCIO_LOCAL_SITE_ID

        • Usage and membership of the various atlas mailing lists

        • cloud support,
        • atlas uk comp operations, ~ 1 non-automated email / year
        • atlas uk comp users : last email 2015
        • Recent Switcher problem with AGAS/CRIC sending many emails

          • Concerns raised on approprate use of mailing lists.
          • ATLAS uk has cloud-support, uk comp operations, and uk comp users.
            • The comp users list has been unused for 5 years, and it was agreed to be removed
            • For cloud support, this remains the most active discussion list, and will be unchanged.
            • The comp operations contains the daily summary and Switcher notifications. Non automated traffic is on the order of 1 email per year; which may have been unintentionally intended for cloud support.
              • It was decided to keep the Swticher and Daily summary in this list. A simple filter can remove any unwanted emails.
      • d) Long-term offline sites

        UK ANALY_MANC_TEST_SL7 manual TEST OnlyTest 2020 01 28 NM keep TEST for pilot dev only 293 2020-01-28T18:44:44.791487 2121-01-15T00:00:00
        UK ANALY_QMUL_GPU_TEST manual TEST False Not working 74 2020-09-04T13:06:28.452468 2021-09-04T12:06:27
        UK RAL-LCG2_TEST manual TEST OnlyTest arc6 160 2020-06-10T11:25:40.228843 2021-04-06T09:25:40.208693
        UK RAL-LCG2_UCORE manual OFFLINE AutoExclusion 2020 05 04 NM set OFFLINE for GU migration 196 2020-05-04T17:21:43.052362 2121-01-15T00:00:00
        UK UKI-NORTHGRID-LANCS-HEP_TEST manual TEST AutoExclusion Site.Test.Queue 306 2020-01-15T15:34:01.319565 2099-06-07T12:00:00
        UK UKI-NORTHGRID-MAN-HEP_TEST manual TEST OnlyTest Site.Test.Queue 237 2020-03-24T16:25:19.972552 2120-01-01T00:00:00
        UK UKI-SCOTGRID-GLASGOW_CEPH_TEST manual TEST AutoExclusion LetTestRun 173 2020-05-27T16:14:32.949448 2021-03-23T14:14:32.936017
        UK UKI-SOUTHGRID-OX-HEP_TEST manual TEST OnlyTest TEST 8 2020-11-09T11:50:43.892209 2030-02-02T12:00:00
        UK UKI-SOUTHGRID-SUSX_UCORE manual TEST AutoExclusion Site.Test.Queue 246 2020-03-16T11:51:54.094506 2099-06-07T12:00:00

        • Queues:

          • Long-term queues that are not disabled, but not running production:
            • UK ANALY_MANC_TEST_SL7: Still needed
            • UK ANALY_QMUL_GPU_TEST: -> could be renamed to non test
            • RAL-LCG2_TEST: -> not actively used (see comment from Peter)
            • RAL-LCG2_UCORE: Can be disabled
            • UKI-NORTHGRID-LANCS-HEP_TEST (see comment from Peter)
            • UKI-NORTHGRID-MAN-HEP_TEST; testbed -> keep
            • UKI-SOUTHGRID-OX-HEP_TEST: (see comment from Peter)
            • UKI-SOUTHGRID-SUSX_UCORE: not test, should become production, might want remaining
            • Peter uses TEST queuse for dev test work monitoring
            • QM test queue might be useful



      • e) Enables CEs in Panda Queues

        Adding CE's to RALPP, and Glasgow Panda queues during CRIC migration

    • 2
      Ongoing Items
      • CentOS7 - Sussex

        • no update
      • Datadisk; watermark reduced.

      • LOCALGROUP disk

        • New pool to be created shortly
      • TPC:

        • Naples -> moved to DPM 1.14.2, networking blocked 443 ipv6, ipv4 open on general network

        • Affects whole of UK (e.g. Lancs, MAN)

        • Retry transfer failures

        • Vunerability from DPM, and dCache

          • Beleive all UK DPM sites up to date (or not affected)
          • dCache issue announced in appropriate channels
    • 3
      News round-table

      News round-table

      • Vip
        • Had to leave before end; NTR
      • Dan
        • NTR
      • Matt
        • NTR; away for next week’s meeting
      • Peter
      • Alessandra
        • NTR
      • Gareth
        • The two CE’s recently added to Glasgow will stay in downtime for time being.
        • JW to check they are included correctly in CRIC / AGIS.
      • JW
        • NTR
      • Sam;
        • Final talk available for Workshop.
        • Positive comments on updates to talk draft
          • Tables are now much better
    • 4
