Indico celebrates its 20th anniversary! Check our blog post for more information!

ATLAS UK Cloud Support

Europe/London
Zoom

Zoom

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))
Description

https://cern.zoom.us/j/98434450232

Password protected (same as OPs Mtg, but repeated)

Videoconference
ATLAS UK Cloud Support
Zoom Meeting ID
98434450232
Host
James William Walder
Useful links
Join via phone
Zoom URL

Outstanding tickets

  • 149842 UKI-SCOTGRID-ECDF less urgent in progress 2021-01-03 13:08:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER ERROR: Copy failed with mode 3rd pull, wi…
    • Manual blacklisting on WAN transfers set over New year period;
      • Test of whitelisting shows no improvement since new year.
      • Needs input from site to understand situation.
  • 149362 UKI-SOUTHGRID-RALPP urgent in progress 2021-01-05 12:35:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
    • Ongoing; Peter to try and look from apfmon side
  • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2021-01-04 12:01:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
    • DPM transfer failures; Sam to check if from old files that should be cleared from the namespace.
    • CEPH still running well; but current job mix is testing the caching
      • On Ceph, Internal xrootd Cache filling up due to intesive sets of jobs;
      • Purging of old files in xrootd cache to understand better; may be that all files in cache are in active usage?
    • Still needing to move final compute capacity; requiring on-site work.
  • 146651 RAL-LCG2 urgent on hold 2020-12-18 10:48:00 singularity and user NS setup at RAL
    • Remains awaiting updates to underlying software stack; no date is given by Grid Services
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-12-18 09:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    • Matt, some discussions on certs with site held.; will try to get recent update.

CPU

  • RAL
  • Failure of one CE over Christmas resulted in drained slots. Other VOs running fine, so took ~ 2days to fully reclaim slots.
    • Gareth mentioned that one CE might be Primiary; which, if fails, drops all jobs.
    • Not obvious in AGIS/CRIC if one is set to primary
  • Northgrid
  • LANCS: Fairshare issues; other VOs taking numbers of slots.
    • Might need to reduce the time window on SGE to allocate slots; Also a dropoff parameter to change the weighting of older jobs
  •  
  • London
  • QMUL: Reinstalled batch system; crashing every 10mins; related to (new) arc accouting?
    • Legacy accounting had issues with processing to the sqlite database.
    • Major upgrade to DC planned for early in year; but timing still be be finalised
  • SouthGrid
  • Scotgrid
    • 10Gb to 1Gb negotiation issue in networking caused drop in jobs over the new year.

Other new issues

Ongoing issues

  • CentOS7 - Sussex

    • No update
  • TPC with http

    • No update
  • Storageless Site tests (Oxford)

    • Progress reported in Storage meeting; new arc almost ready
    • ATLAS has stopped writing new data to Oxford
    • LOCALGROUPDISK hope to keep at Oxford; needs confirmation. Update to SL7 needed.
    • Sheffield;
      • Input from Duncan:
      • perfsonar looks ok, but ipv6 only
      • Use of the NAT might be the bottleneck, and not scale up to full production / analysis loads
      • Steve’s tests useful; needs some work from Shef to update few tests.
      • Could perhaps make nodes dual stack?
  • ECDF volatile storage

    • No update; JW to work on actions in Jira ticket; other issues at the site higher priority.
  • Glasgow DPM Decommissioning

    • TO check final deletions;
    • Localgroupdisk naming now settled. To give green light once space is increased.

News round-table

  • Dan
    • NTR (left before end)
  • Matt
    • NTR (left before end)
  • Peter
    • NTR
  • Sam
    • NTR
  • Gareth
    • Q/R’s needed by month-end
  • JW
    • NTR
  • Duncan
    • Was confirmed that Sheffield has no storage set up.
    • Discussion on IO demands e.g. 5k cores (glasgow) * 0.5MB/s/core for future UK and Glasgow requirements

AOB

There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m
        • 149842 UKI-SCOTGRID-ECDF less urgent in progress 2021-01-03 13:08:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER ERROR: Copy failed with mode 3rd pull, wi…
          • Manual blacklisting on WAN transfers set over New year period;
            • Test of whitelisting shows no improvement since new year.
            • Needs input from site to understand situation.
        • 149362 UKI-SOUTHGRID-RALPP urgent in progress 2021-01-05 12:35:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
          • Ongoing; Peter to try and look from apfmon side
        • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2021-01-04 12:01:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
          • DPM transfer failures; Sam to check if from old files that should be cleared from the namespace.
          • CEPH still running well; but current job mix is testing the caching
            • On Ceph, Internal xrootd Cache filling up due to intesive sets of jobs;
            • Purging of old files in xrootd cache to understand better; may be that all files in cache are in active usage?
          • Still needing to move final compute capacity; requiring on-site work.
        • 146651 RAL-LCG2 urgent on hold 2020-12-18 10:48:00 singularity and user NS setup at RAL
          • Remains awaiting updates to underlying software stack; no date is given by Grid Services
        • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-12-18 09:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
          • Matt, some discussions on certs with site held.; will try to get recent update.

         


         

      • CPU 5m

        New link for the site-oriented dashboard

        • RAL
        • Failure of one CE over Christmas resulted in drained slots. Other VOs running fine, so took ~ 2days to fully reclaim slots.
          • Gareth mentioned that one CE might be Primiary; which, if fails, drops all jobs.
          • Not obvious in AGIS/CRIC if one is set to primary
        • Northgrid
        • LANCS: Fairshare issues; other VOs taking numbers of slots.
          • Might need to reduce the time window on SGE to allocate slots; Also a dropoff parameter to change the weighting of older jobs
        •  
        • London
        • QMUL: Reinstalled batch system; crashing every 10mins; related to (new) arc accouting?
          • Legacy accounting had issues with processing to the sqlite database.
          • Major upgrade to DC planned for early in year; but timing still be be finalised
        • SouthGrid
        • Scotgrid
          • 10Gb to 1Gb negotiation issue in networking caused drop in jobs over the new year.

         


         

      • Other new issues / tasks 5m
    • 10:20 10:40
      Ongoing Items 20m
      • CentOS7 - Sussex

        • No update
      • TPC with http

        • No update
      • Storageless Site tests (Oxford)

        • Progress reported in Storage meeting; new arc almost ready
        • ATLAS has stopped writing new data to Oxford
        • LOCALGROUPDISK hope to keep at Oxford; needs confirmation. Update to SL7 needed.
        • Sheffield;
          • Input from Duncan:
          • perfsonar looks ok, but ipv6 only
          • Use of the NAT might be the bottleneck, and not scale up to full production / analysis loads
          • Steve’s tests useful; needs some work from Shef to update few tests.
          • Could perhaps make nodes dual stack?
      • ECDF volatile storage

        • No update; JW to work on actions in Jira ticket; other issues at the site higher priority.
      • Glasgow DPM Decommissioning

        • TO check final deletions;
        • Localgroupdisk naming now settled. To give green light once space is increased.
    • 10:40 10:50
      News round-table 10m
      • Dan
        • NTR (left before end)
      • Matt
        • NTR (left before end)
      • Peter
        • NTR
      • Sam
        • NTR
      • Gareth
        • Q/R’s needed by month-end
      • JW
        • NTR
      • Duncan
        • Was confirmed that Sheffield has no storage set up.
        • Discussion on IO demands e.g. 5k cores (glasgow) * 0.5MB/s/core for future UK and Glasgow requirements
    • 10:50 11:00
      AOB 10m