Indico celebrates its 20th anniversary! Check our blog post for more information!

ATLAS UK Cloud Support

Europe/London
Zoom

Zoom

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))
Description

https://cern.zoom.us/j/98434450232

Password protected (same as (new) OPs Mtg)

Videoconference
ATLAS UK Cloud Support
Zoom Meeting ID
98434450232
Host
James William Walder
Useful links
Join via phone
Zoom URL

Outstanding tickets

  • 150048 UKI-LT2-QMUL less urgent in progress 2021-01-12 17:28:00 Transfers and deletion at UKI-LT2-QMUL fails with “Connection reset by peer”
    • Trimuf very far away; no perfsonar to see exactrly what’s happening.
      • Different ip address space between se’s might be contributing?
      • Maybe related to a full link connections?
    • Additional comments from Duncan in Round Table.
  • 149842 UKI-SCOTGRID-ECDF very urgent in progress 2021-01-12 17:59:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER ERROR: Copy failed with mode 3rd pull, wi…
    • low deletion efficiency (many initial deletions requests)
    • JW - To test a few files to ensure no data inconsistency check
  • 149362 UKI-SOUTHGRID-RALPP urgent in progress 2021-01-05 12:35:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
    • Important to find out exactly where in the chain it is failing.
    • job executing status in the logs; is evicted 2s later.
    • Condor history -> check for X’s not C’s
  • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2021-01-07 09:49:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
    • JW - Add Ceph localgroup disk as a proper resource in CRIC connected to site
    • Local users to consider ceph as the primary storage
  • 146651 RAL-LCG2 urgent on hold 2020-12-18 10:48:00 singularity and user NS setup at RAL
    • Remains stuck behind updates further down the stack.
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-12-18 09:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    • 3 WNs in. Network switch remains a problem; aiming for solution by end of the month.

CPU

  • RAL

    • Ok, recent additional slots from CMS (which is now recovering)
  • Northgrid

    • LANCS; yesterday gridFTP stuck yestderday, needed restarting; jobs recovered
  • London

    • QMUL; largely recovered; work ongoing.
  • SouthGrid

    • OX; jobs drained overnight; no conclusive reason found; possibiles are: assigned jobs requiring input from tape; potenital bug in FTS with transfers
    •  
  • Scotgrid

    • Glasgow; inefficient transfers; i.e. very high rate of input of very small files. Max-ed out gridFTP connections.

Other new issues

Ongoing issues

  • CentOS7 - Sussex

    • 3 nodes currently in; continue to have issues with network switch infrastructure.
  • TPC with http

    • To follow up with Sam regarding http for FT in ATLAS on Glasgow test endpoint.
    • Some issues in the past with Rate limitting, and protections from ingenious users.
    • From Gareth; discuss with ATLAS on how write-backs and data access patterns wrt. Ceph would be useful.
    • Alessandra keen to move internal lan transfers away from gridFTP.
    • Test queue at Glasgow, currently pointing to old XrdCeph; Sam to update c02 to latest version.
  • Storageless Site test / storage decomissioning (Oxford)

    • Oxford Jira for decommissioning now set up.
    • Will need to wait for Glasgow decomissioning to complete
  • ECDF volatile storage

    • JW to start actions from the Jira.
  • Glasgow DPM Decommissioning

    • Ongoing; final part most difficult due to the problems of last year
  • ATLAS: Site Availability/Reliability reports: Glasgow

    • Push for VOFeed to cric; expected timescale being sought.

News round-table

  • Vip
    • Needed to leave
  • Dan
    • Needed to go to the next meeting. plans: insatll 2.nd gridftp node, update wn to latests ois/lustre/slurm on drained nodes maintain stability otherwise
  • Matt
    • Disk servers continuing to need attention (e.g. weighting issues).
  • Peter
    • Had to leave
  • Sam
    • NTR
  • Gareth
    • Q/R needed
  • JW
    • NTR
  • Duncan
    • QMUL -> triumf; 1600->0200 transfers were ok;
    • Routes: until 4pm yesterday UK routes via London, then via Geant/Amsterdam.
    • Traceroute data: Failing via London; Running via Amsterdam;
    • Can it be IPV6 / routing / QMUL config related ?
    • Perfsonar would certainly help identify in these cases
  • Patrick
    • NTR
  • Rob
    • NTR

AOB

There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m
        • 150048 UKI-LT2-QMUL less urgent in progress 2021-01-12 17:28:00 Transfers and deletion at UKI-LT2-QMUL fails with “Connection reset by peer”
          • Trimuf very far away; no perfsonar to see exactrly what’s happening.
            • Different ip address space between se’s might be contributing?
            • Maybe related to a full link connections?
          • Additional comments from Duncan in Round Table.
        • 149842 UKI-SCOTGRID-ECDF very urgent in progress 2021-01-12 17:59:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER ERROR: Copy failed with mode 3rd pull, wi…
          • low deletion efficiency (many initial deletions requests)
          • JW - To test a few files to ensure no data inconsistency check
        • 149362 UKI-SOUTHGRID-RALPP urgent in progress 2021-01-05 12:35:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
          • Important to find out exactly where in the chain it is failing.
          • job executing status in the logs; is evicted 2s later.
          • Condor history -> check for X’s not C’s
        • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2021-01-07 09:49:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
          • JW - Add Ceph localgroup disk as a proper resource in CRIC connected to site
          • Local users to consider ceph as the primary storage
        • 146651 RAL-LCG2 urgent on hold 2020-12-18 10:48:00 singularity and user NS setup at RAL
          • Remains stuck behind updates further down the stack.
        • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-12-18 09:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
          • 3 WNs in. Network switch remains a problem; aiming for solution by end of the month.

         


         

      • CPU 5m

        New link for the site-oriented dashboard

        • RAL

          • Ok, recent additional slots from CMS (which is now recovering)
        • Northgrid

          • LANCS; yesterday gridFTP stuck yestderday, needed restarting; jobs recovered
        • London

          • QMUL; largely recovered; work ongoing.
        • SouthGrid

          • OX; jobs drained overnight; no conclusive reason found; possibiles are: assigned jobs requiring input from tape; potenital bug in FTS with transfers
          •  
        • Scotgrid

          • Glasgow; inefficient transfers; i.e. very high rate of input of very small files. Max-ed out gridFTP connections.

         


         

      • Other new issues / tasks 5m
    • 10:20 10:40
      Ongoing Items 20m
      • CentOS7 - Sussex

        • 3 nodes currently in; continue to have issues with network switch infrastructure.
      • TPC with http

        • To follow up with Sam regarding http for FT in ATLAS on Glasgow test endpoint.
        • Some issues in the past with Rate limitting, and protections from ingenious users.
        • From Gareth; discuss with ATLAS on how write-backs and data access patterns wrt. Ceph would be useful.
        • Alessandra keen to move internal lan transfers away from gridFTP.
        • Test queue at Glasgow, currently pointing to old XrdCeph; Sam to update c02 to latest version.
      • Storageless Site test / storage decomissioning (Oxford)

        • Oxford Jira for decommissioning now set up.
        • Will need to wait for Glasgow decomissioning to complete
      • ECDF volatile storage

        • JW to start actions from the Jira.
      • Glasgow DPM Decommissioning

        • Ongoing; final part most difficult due to the problems of last year
      • ATLAS: Site Availability/Reliability reports: Glasgow

        • Push for VOFeed to cric; expected timescale being sought.
    • 10:40 10:50
      News round-table 10m
      • Vip
        • Needed to leave
      • Dan
        • Needed to go to the next meeting. plans: insatll 2.nd gridftp node, update wn to latests ois/lustre/slurm on drained nodes maintain stability otherwise
      • Matt
        • Disk servers continuing to need attention (e.g. weighting issues).
      • Peter
        • Had to leave
      • Sam
        • NTR
      • Gareth
        • Q/R needed
      • JW
        • NTR
      • Duncan
        • QMUL -> triumf; 1600->0200 transfers were ok;
        • Routes: until 4pm yesterday UK routes via London, then via Geant/Amsterdam.
        • Traceroute data: Failing via London; Running via Amsterdam;
        • Can it be IPV6 / routing / QMUL config related ?
        • Perfsonar would certainly help identify in these cases
      • Patrick
        • NTR
      • Rob
        • NTR
    • 10:50 11:00
      AOB 10m