Help us make Indico better by taking this survey! Aidez-nous à améliorer Indico en répondant à ce sondage !

ATLAS UK Cloud Support

Europe/London
Zoom

Zoom

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))
Description

Meeting to be held via Zoom (https://ukri.zoom.us/j/97404730356)
Password protected (same as OPs Mtg)

Outstanding tickets

  • 149842 UKI-SCOTGRID-ECDF less urgent assigned 2020-12-09 11:15:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER ERROR: Copy failed with mode 3rd pull, wi…
    • Davs ECDF https transfers; possible headnodes overloaded, compared to other protocols (interpretation from Sam)
    • Rob looking into this
  • 149811 UKI-LT2-QMUL less urgent in progress 2020-12-09 16:16:00 Transfer and deletion errors from UKI-LT2-QMUL as dst site
    • Storage back online; needs rebuilding of several systems for Compute nodes
    • ProxMox cluster taken down. HP SSD running journals, with uptime bug that bricked after x-hours. 2 out 3 SSDs taken out.
      • Positive comments regarding ProxMox made; Runs on debian/ubuntu
    • Downtime next week for power work
  • 149750 UKI-SOUTHGRID-RALPP less urgent in progress 2020-12-09 11:50:00 UKI-SOUTHGRID-RALPP: unable to connect to host
    • IPv4 problems to site with FTS transfers via Rucio.
    • Site will attempt router reboot to fix
    • Also exposed bug in rucio for default IPvX version, if not specified in RSE.
      • RSE default looks to be update, which is causing succesful transfers over, by using IPv6.
  • 149738 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-12-09 14:16:00 UKI-NORTHGRID-LANCS-HEP: deletion errors
    • two sets of files declared lost.
    • Ongoing unique set attempting to be recovered. Will stop by Monday.
  • 149362 UKI-SOUTHGRID-RALPP urgent in progress 2020-12-04 10:14:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
    • No progress; however may have some relation to IPv4/6 differences; to be followed-up.
  • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-11-27 10:00:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
    • Recieved file-list from disk 40. Some might be recoverable, but unlikely.
    • To be declared lost once cleaned from namespace,
    • JW: to create Jira, and get unique files
  • 146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
    • On hold
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-11-05 10:52:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    • Arc now working correctly. LDAP issue; not started. Adding more nodes; but network failures in the DC to be fixed.
    • Final nodes need provisioning, aim to finish early next year.

CPU

  • RAL

    • HC test failues (due to updated root version in one of the tests) caused sites to go into test. Recovery of lost slots taking time.
  • Northgrid

    • Lancs: Mis-config of submission dir on the nfs mounts; should now be fixed
  • London

    • QMUL issues (as reported above)
  • SouthGrid

    • OX observed similar HC dip to RAL
  • Scotgrid

    • Durham; problematic disk server over weekend.
    • Glasgow; some additional cores added; running with 40 kHS06.

Other new issues

  • Glasgow Site Avail/Rel
    • ETF information appears to be correct, but interpretation from the ATLAS Topology enrichment via VOFeed to be understood and updated.

Ongoing issues

  • CentOS7 - Sussex
    • (described above)
  • TPC with http
    • No update
  • Storageless Site tests (Oxford)
    • No progress; discussions ongoing on how to configure the arc-ce queues
  • ECDF volatile storage
    • Ticket updated; number of config changes needed from ATLAS side; JW to follow-up.
  • Glasgow DPM Decommissioning
    • Still need LOCALGROUPDISK setup on Ceph. Discussion on the pool name, vs endpoint naming.

News round-table

  • Vip
    • NTR
  • Dan
    • NTR
  • Matt
    • NTR
  • Peter
    • NTR
  • Sam
    • NTR
  • Gareth
    • NTR
  • JW
    • NTR
  • Patrick
    • NTR

AOB

  • Future meetings to use new Cern hosted zoom room, integrated into indico.
  • Next week 17th, last Cloud support Mtg of the year. Expect to then restart on 7th.


 

There are minutes attached to this event. Show them.
    • 1
      Status
      • a) Outstanding tickets
        • 149842 UKI-SCOTGRID-ECDF less urgent assigned 2020-12-09 11:15:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER ERROR: Copy failed with mode 3rd pull, wi…
          • Davs ECDF https transfers; possible headnodes overloaded, compared to other protocols (interpretation from Sam)
          • Rob looking into this
        • 149811 UKI-LT2-QMUL less urgent in progress 2020-12-09 16:16:00 Transfer and deletion errors from UKI-LT2-QMUL as dst site
          • Storage back online; needs rebuilding of several systems for Compute nodes
          • ProxMox cluster taken down. HP SSD running journals, with uptime bug that bricked after x-hours. 2 out 3 SSDs taken out.
            • Positive comments regarding ProxMox made; Runs on debian/ubuntu
          • Downtime next week for power work
        • 149750 UKI-SOUTHGRID-RALPP less urgent in progress 2020-12-09 11:50:00 UKI-SOUTHGRID-RALPP: unable to connect to host
          • IPv4 problems to site with FTS transfers via Rucio.
          • Site will attempt router reboot to fix
          • Also exposed bug in rucio for default IPvX version, if not specified in RSE.
            • RSE default looks to be update, which is causing succesful transfers over, by using IPv6.
        • 149738 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-12-09 14:16:00 UKI-NORTHGRID-LANCS-HEP: deletion errors
          • two sets of files declared lost.
          • Ongoing unique set attempting to be recovered. Will stop by Monday.
        • 149362 UKI-SOUTHGRID-RALPP urgent in progress 2020-12-04 10:14:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
          • No progress; however may have some relation to IPv4/6 differences; to be followed-up.
        • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-11-27 10:00:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
          • Recieved file-list from disk 40. Some might be recoverable, but unlikely.
          • To be declared lost once cleaned from namespace,
          • JW: to create Jira, and get unique files
        • 146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
          • On hold
        • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-11-05 10:52:00 CentOS7 migration UKI-SOUTHGRID-SUSX
          • Arc now working correctly. LDAP issue; not started. Adding more nodes; but network failures in the DC to be fixed.
          • Final nodes need provisioning, aim to finish early next year.
      • b) CPU
        • RAL

          • HC test failues (due to updated root version in one of the tests) caused sites to go into test. Recovery of lost slots taking time.
        • Northgrid

          • Lancs: Mis-config of submission dir on the nfs mounts; should now be fixed
        • London

          • QMUL issues (as reported above)
        • SouthGrid

          • OX observed similar HC dip to RAL
        • Scotgrid

          • Durham; problematic disk server over weekend.
          • Glasgow; some additional cores added; running with 40 kHS06.
      • c) Other new issues / tasks
      • d) Glasgow: Site Availability/Reliability Config
        • Glasgow Site Avail/Rel
          • ETF information appears to be correct, but interpretation from the ATLAS Topology enrichment via VOFeed to be understood and updated.
    • 2
      Ongoing Items
      • CentOS7 - Sussex
        • (described above)
      • TPC with http
        • No update
      • Storageless Site tests (Oxford)
        • No progress; discussions ongoing on how to configure the arc-ce queues
      • ECDF volatile storage
        • Ticket updated; number of config changes needed from ATLAS side; JW to follow-up.
      • Glasgow DPM Decommissioning
        • Still need LOCALGROUPDISK setup on Ceph. Discussion on the pool name, vs endpoint naming.
    • 3
      News round-table
      • Vip
        • NTR
      • Dan
        • NTR
      • Matt
        • NTR
      • Peter
        • NTR
      • Sam
        • NTR
      • Gareth
        • NTR
      • JW
        • NTR
      • Patrick
        • NTR
    • 4
      AOB
      • Set up new CERN Zoom room for next week.
      • Future meetings to use new Cern hosted zoom room, integrated into indico.
      • Next week 17th, last Cloud support Mtg of the year. Expect to then restart on 7th.