ATLAS UK Cloud Support

Europe/London
Zoom

Zoom

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))
Description

Meeting to be held via Zoom (https://ukri.zoom.us/j/97404730356)
Password protected (same as OPs Mtg)

Outstanding tickets

  • 149842 UKI-SCOTGRID-ECDF less urgent assigned 2020-12-09 11:15:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER ERROR: Copy failed with mode 3rd pull, wi…
    • Davs ECDF https transfers; possible headnodes overloaded, compared to other protocols (interpretation from Sam)
    • Rob looking into this
  • 149811 UKI-LT2-QMUL less urgent in progress 2020-12-09 16:16:00 Transfer and deletion errors from UKI-LT2-QMUL as dst site
    • Storage back online; needs rebuilding of several systems for Compute nodes
    • ProxMox cluster taken down. HP SSD running journals, with uptime bug that bricked after x-hours. 2 out 3 SSDs taken out.
      • Positive comments regarding ProxMox made; Runs on debian/ubuntu
    • Downtime next week for power work
  • 149750 UKI-SOUTHGRID-RALPP less urgent in progress 2020-12-09 11:50:00 UKI-SOUTHGRID-RALPP: unable to connect to host
    • IPv4 problems to site with FTS transfers via Rucio.
    • Site will attempt router reboot to fix
    • Also exposed bug in rucio for default IPvX version, if not specified in RSE.
      • RSE default looks to be update, which is causing succesful transfers over, by using IPv6.
  • 149738 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-12-09 14:16:00 UKI-NORTHGRID-LANCS-HEP: deletion errors
    • two sets of files declared lost.
    • Ongoing unique set attempting to be recovered. Will stop by Monday.
  • 149362 UKI-SOUTHGRID-RALPP urgent in progress 2020-12-04 10:14:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
    • No progress; however may have some relation to IPv4/6 differences; to be followed-up.
  • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-11-27 10:00:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
    • Recieved file-list from disk 40. Some might be recoverable, but unlikely.
    • To be declared lost once cleaned from namespace,
    • JW: to create Jira, and get unique files
  • 146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
    • On hold
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-11-05 10:52:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    • Arc now working correctly. LDAP issue; not started. Adding more nodes; but network failures in the DC to be fixed.
    • Final nodes need provisioning, aim to finish early next year.

CPU

  • RAL

    • HC test failues (due to updated root version in one of the tests) caused sites to go into test. Recovery of lost slots taking time.
  • Northgrid

    • Lancs: Mis-config of submission dir on the nfs mounts; should now be fixed
  • London

    • QMUL issues (as reported above)
  • SouthGrid

    • OX observed similar HC dip to RAL
  • Scotgrid

    • Durham; problematic disk server over weekend.
    • Glasgow; some additional cores added; running with 40 kHS06.

Other new issues

  • Glasgow Site Avail/Rel
    • ETF information appears to be correct, but interpretation from the ATLAS Topology enrichment via VOFeed to be understood and updated.

Ongoing issues

  • CentOS7 - Sussex
    • (described above)
  • TPC with http
    • No update
  • Storageless Site tests (Oxford)
    • No progress; discussions ongoing on how to configure the arc-ce queues
  • ECDF volatile storage
    • Ticket updated; number of config changes needed from ATLAS side; JW to follow-up.
  • Glasgow DPM Decommissioning
    • Still need LOCALGROUPDISK setup on Ceph. Discussion on the pool name, vs endpoint naming.

News round-table

  • Vip
    • NTR
  • Dan
    • NTR
  • Matt
    • NTR
  • Peter
    • NTR
  • Sam
    • NTR
  • Gareth
    • NTR
  • JW
    • NTR
  • Patrick
    • NTR

AOB

  • Future meetings to use new Cern hosted zoom room, integrated into indico.
  • Next week 17th, last Cloud support Mtg of the year. Expect to then restart on 7th.


 

There are minutes attached to this event. Show them.