Indico celebrates its 20th anniversary! Check our blog post for more information!

ATLAS UK Cloud Support



Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Password protected (same as (new) OPs Mtg)

ATLAS UK Cloud Support
Zoom Meeting ID
James William Walder
Useful links
Join via phone
Zoom URL

Outstanding tickets

  • 150048 UKI-LT2-QMUL less urgent in progress 2021-01-12 17:28:00 Transfers and deletion at UKI-LT2-QMUL fails with “Connection reset by peer”
    • Trimuf very far away; no perfsonar to see exactrly what’s happening.
      • Different ip address space between se’s might be contributing?
      • Maybe related to a full link connections?
    • Additional comments from Duncan in Round Table.
  • 149842 UKI-SCOTGRID-ECDF very urgent in progress 2021-01-12 17:59:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER ERROR: Copy failed with mode 3rd pull, wi…
    • low deletion efficiency (many initial deletions requests)
    • JW - To test a few files to ensure no data inconsistency check
  • 149362 UKI-SOUTHGRID-RALPP urgent in progress 2021-01-05 12:35:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
    • Important to find out exactly where in the chain it is failing.
    • job executing status in the logs; is evicted 2s later.
    • Condor history -> check for X’s not C’s
  • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2021-01-07 09:49:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
    • JW - Add Ceph localgroup disk as a proper resource in CRIC connected to site
    • Local users to consider ceph as the primary storage
  • 146651 RAL-LCG2 urgent on hold 2020-12-18 10:48:00 singularity and user NS setup at RAL
    • Remains stuck behind updates further down the stack.
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-12-18 09:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    • 3 WNs in. Network switch remains a problem; aiming for solution by end of the month.


  • RAL

    • Ok, recent additional slots from CMS (which is now recovering)
  • Northgrid

    • LANCS; yesterday gridFTP stuck yestderday, needed restarting; jobs recovered
  • London

    • QMUL; largely recovered; work ongoing.
  • SouthGrid

    • OX; jobs drained overnight; no conclusive reason found; possibiles are: assigned jobs requiring input from tape; potenital bug in FTS with transfers
  • Scotgrid

    • Glasgow; inefficient transfers; i.e. very high rate of input of very small files. Max-ed out gridFTP connections.

Other new issues

Ongoing issues

  • CentOS7 - Sussex

    • 3 nodes currently in; continue to have issues with network switch infrastructure.
  • TPC with http

    • To follow up with Sam regarding http for FT in ATLAS on Glasgow test endpoint.
    • Some issues in the past with Rate limitting, and protections from ingenious users.
    • From Gareth; discuss with ATLAS on how write-backs and data access patterns wrt. Ceph would be useful.
    • Alessandra keen to move internal lan transfers away from gridFTP.
    • Test queue at Glasgow, currently pointing to old XrdCeph; Sam to update c02 to latest version.
  • Storageless Site test / storage decomissioning (Oxford)

    • Oxford Jira for decommissioning now set up.
    • Will need to wait for Glasgow decomissioning to complete
  • ECDF volatile storage

    • JW to start actions from the Jira.
  • Glasgow DPM Decommissioning

    • Ongoing; final part most difficult due to the problems of last year
  • ATLAS: Site Availability/Reliability reports: Glasgow

    • Push for VOFeed to cric; expected timescale being sought.

News round-table

  • Vip
    • Needed to leave
  • Dan
    • Needed to go to the next meeting. plans: insatll 2.nd gridftp node, update wn to latests ois/lustre/slurm on drained nodes maintain stability otherwise
  • Matt
    • Disk servers continuing to need attention (e.g. weighting issues).
  • Peter
    • Had to leave
  • Sam
    • NTR
  • Gareth
    • Q/R needed
  • JW
    • NTR
  • Duncan
    • QMUL -> triumf; 1600->0200 transfers were ok;
    • Routes: until 4pm yesterday UK routes via London, then via Geant/Amsterdam.
    • Traceroute data: Failing via London; Running via Amsterdam;
    • Can it be IPV6 / routing / QMUL config related ?
    • Perfsonar would certainly help identify in these cases
  • Patrick
    • NTR
  • Rob
    • NTR


There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m
        • 150048 UKI-LT2-QMUL less urgent in progress 2021-01-12 17:28:00 Transfers and deletion at UKI-LT2-QMUL fails with “Connection reset by peer”
          • Trimuf very far away; no perfsonar to see exactrly what’s happening.
            • Different ip address space between se’s might be contributing?
            • Maybe related to a full link connections?
          • Additional comments from Duncan in Round Table.
        • 149842 UKI-SCOTGRID-ECDF very urgent in progress 2021-01-12 17:59:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER ERROR: Copy failed with mode 3rd pull, wi…
          • low deletion efficiency (many initial deletions requests)
          • JW - To test a few files to ensure no data inconsistency check
        • 149362 UKI-SOUTHGRID-RALPP urgent in progress 2021-01-05 12:35:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
          • Important to find out exactly where in the chain it is failing.
          • job executing status in the logs; is evicted 2s later.
          • Condor history -> check for X’s not C’s
        • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2021-01-07 09:49:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
          • JW - Add Ceph localgroup disk as a proper resource in CRIC connected to site
          • Local users to consider ceph as the primary storage
        • 146651 RAL-LCG2 urgent on hold 2020-12-18 10:48:00 singularity and user NS setup at RAL
          • Remains stuck behind updates further down the stack.
        • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-12-18 09:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
          • 3 WNs in. Network switch remains a problem; aiming for solution by end of the month.



      • CPU 5m

        New link for the site-oriented dashboard

        • RAL

          • Ok, recent additional slots from CMS (which is now recovering)
        • Northgrid

          • LANCS; yesterday gridFTP stuck yestderday, needed restarting; jobs recovered
        • London

          • QMUL; largely recovered; work ongoing.
        • SouthGrid

          • OX; jobs drained overnight; no conclusive reason found; possibiles are: assigned jobs requiring input from tape; potenital bug in FTS with transfers
        • Scotgrid

          • Glasgow; inefficient transfers; i.e. very high rate of input of very small files. Max-ed out gridFTP connections.



      • Other new issues / tasks 5m
    • 10:20 10:40
      Ongoing Items 20m
      • CentOS7 - Sussex

        • 3 nodes currently in; continue to have issues with network switch infrastructure.
      • TPC with http

        • To follow up with Sam regarding http for FT in ATLAS on Glasgow test endpoint.
        • Some issues in the past with Rate limitting, and protections from ingenious users.
        • From Gareth; discuss with ATLAS on how write-backs and data access patterns wrt. Ceph would be useful.
        • Alessandra keen to move internal lan transfers away from gridFTP.
        • Test queue at Glasgow, currently pointing to old XrdCeph; Sam to update c02 to latest version.
      • Storageless Site test / storage decomissioning (Oxford)

        • Oxford Jira for decommissioning now set up.
        • Will need to wait for Glasgow decomissioning to complete
      • ECDF volatile storage

        • JW to start actions from the Jira.
      • Glasgow DPM Decommissioning

        • Ongoing; final part most difficult due to the problems of last year
      • ATLAS: Site Availability/Reliability reports: Glasgow

        • Push for VOFeed to cric; expected timescale being sought.
    • 10:40 10:50
      News round-table 10m
      • Vip
        • Needed to leave
      • Dan
        • Needed to go to the next meeting. plans: insatll 2.nd gridftp node, update wn to latests ois/lustre/slurm on drained nodes maintain stability otherwise
      • Matt
        • Disk servers continuing to need attention (e.g. weighting issues).
      • Peter
        • Had to leave
      • Sam
        • NTR
      • Gareth
        • Q/R needed
      • JW
        • NTR
      • Duncan
        • QMUL -> triumf; 1600->0200 transfers were ok;
        • Routes: until 4pm yesterday UK routes via London, then via Geant/Amsterdam.
        • Traceroute data: Failing via London; Running via Amsterdam;
        • Can it be IPV6 / routing / QMUL config related ?
        • Perfsonar would certainly help identify in these cases
      • Patrick
        • NTR
      • Rob
        • NTR
    • 10:50 11:00
      AOB 10m