ATLAS UK Cloud Support

Europe/London
Zoom

Zoom

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))
Description

https://cern.zoom.us/j/98434450232

Password protected (same as OPs Mtg, but repeated)

Videoconference
ATLAS UK Cloud Support
Zoom Meeting ID
98434450232
Host
James William Walder
Useful links
Join via phone
Zoom URL

Outstanding tickets

  • 149842 UKI-SCOTGRID-ECDF less urgent in progress 2020-12-15 02:10:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER ERROR: Copy failed with mode 3rd pull, wi…
    • Rob investigating with DPM. Interface to DPM is not facilitating progress.
  • 149750 UKI-SOUTHGRID-RALPP less urgent in progress 2020-12-11 13:51:00 UKI-SOUTHGRID-RALPP: unable to connect to host
    • Resolved initial problem, by switching transfers through IPv6; underlying firewall/network issues to be resolved.
  • 149362 UKI-SOUTHGRID-RALPP urgent in progress 2020-12-04 10:14:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
    • Possibly related to IPv issues; needs following-up on Jira
  • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-12-10 14:42:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
    • Files declared as lost; should be ok to close, if transfers now look ok.
  • 146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
    • On hold; awaiting underlying changes
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-11-05 10:52:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    • Dev harvester instance issues - fixed for condor submitter.

CPU

  • RAL

    • Ok; JW to ensure corepower is updated in new year
  • Northgrid

    • 60k HS06 running; some issue started recently, perhaps with gridFTP
  • London

  • QMUL;

    • PDU replacements; issues with cabling from PDUs to Racks, to be replaced. -> increased downtime
    • Many improvement / some disruption next year leading to improved cooling / capacity.
    • Work on CE’s next week

  • SouthGrid

    • OX squid sever down; moved to old server.

    • QMUL will use it’s old perf-sonar for it’s new squids.

  • Scotgrid

    • Durham drop in capacity; starting to come back now

Other new issues

  • Site availability and reliability for Glasgow follow-up; see associated tickets.
    • Mixture of CRIC and AGIS information being used.
    • Glasgow - to ensure all relevent info is included into GocDB
    • ATLAS - to see how much can be exposed before final push to CRIC

Ongoing issues

  • CentOS7 - Sussex

    • Sussex - Peter reports dev server was fixed, so all pilots now working.
      • In a good state for provisioning of nodes in new year
  • TPC with http

    • Expecting a deadline of May 2021 for deployment at most sites
  • Storageless Site tests (Oxford)¶

    • No particular progress
  • ECDF volatile storage

    • Awaiting JW to make SE changes from Jira
  • Glasgow DPM Decommissioning

    • Sam preparing Ceph localgroupdisk
    • Hope for transfer across before / during Christmas
    • Gareth will put DPM in “AT RISK” of the period.

News round-table

  • General

    • Most sites at limited response from next week
  • Vip

  • Asked about mu3e VO port number;

  • Dan

    • NTR
  • Matt

    • NTR
  • Peter

    • NTR
  • Sam

    • NTR
  • Gareth

    • Will set at risk for DPM from the weekend
  • JW

    • NTR

AOB

  • Next UK Cloud meeting 7th January 2021
    • Happy Holidays!

 

There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m
        • 149842 UKI-SCOTGRID-ECDF less urgent in progress 2020-12-15 02:10:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER ERROR: Copy failed with mode 3rd pull, wi…
          • Rob investigating with DPM. Interface to DPM is not facilitating progress.
        • 149750 UKI-SOUTHGRID-RALPP less urgent in progress 2020-12-11 13:51:00 UKI-SOUTHGRID-RALPP: unable to connect to host
          • Resolved initial problem, by switching transfers through IPv6; underlying firewall/network issues to be resolved.
        • 149362 UKI-SOUTHGRID-RALPP urgent in progress 2020-12-04 10:14:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
          • Possibly related to IPv issues; needs following-up on Jira
        • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-12-10 14:42:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
          • Files declared as lost; should be ok to close, if transfers now look ok.
        • 146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
          • On hold; awaiting underlying changes
        • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-11-05 10:52:00 CentOS7 migration UKI-SOUTHGRID-SUSX
          • Dev harvester instance - fixed for condor submitter.
      • CPU 5m

        New link for the site-oriented dashboard

        • RAL

          • Ok; JW to ensure corepower is updated in new year
        • Northgrid

          • 60k HS06 running; some issue started recently, perhaps with gridFTP
        • London

        • QMUL;

          • PDU replacements; issues with cabling from PDUs to Racks, to be replaced. -> increased downtime
          • Many improvement / some disruption next year leading to improved cooling / capacity.
          • Work on CE’s next week

        • SouthGrid

          • OX squid sever down; moved to old server.

            • QMUL will use it’s old perf-sonar for it’s new squids.

        • Scotgrid

          • Durham drop in capacity; starting to come back now

         

         

      • Other new issues / tasks 5m
      • Glasgow: Site Availability/Reliability Config 5m

        From our side (i.e. ATLAS SAM team), the migration has been done; so the ETF tests use CRIC as the source.
        The MONIT side will eventually use the filtered vofeed provided by us.
        But as a temporary solution, MONIT consumes the vofeed (either from AGIS or CRIC) internally handled in MONIT.

        • Site availability and reliability for Glasgow follow-up; see associated tickets.
          • Mixture of CRIC and AGIS information being used.
          • Glasgow - to ensure all relevent info is included into GocDB
          • ATLAS - to see how much can be exposed before final push to CRIC
    • 10:20 10:40
      Ongoing Items 20m
      • CentOS7 - Sussex

        • Sussex - Peter reports dev server was fixed, so all pilots now working.
          • In a good state for provisioning of nodes in new year
      • TPC with http

        • Expecting a deadline of May 2021 for deployment at most sites
      • Storageless Site tests (Oxford)¶

        • No particular progress
      • ECDF volatile storage

        • Awaiting JW to make SE changes from Jira
      • Glasgow DPM Decommissioning

        • Sam preparing Ceph localgroupdisk
        • Hope for transfer across before / during Christmas
        • Gareth will put DPM in “AT RISK” of the period.

       

       

    • 10:40 10:50
      News round-table 10m
      • General

        • Most sites at limited response from next week
      • Vip

      • Asked about mu3e VO port number;

      • Dan

        • NTR
      • Matt

        • NTR
      • Peter

        • NTR
      • Sam

        • NTR
      • Gareth

        • Will set at risk for DPM
      • JW

        • NTR
    • 10:50 11:00
      AOB 10m

      Next Cloud Meeting: Jan 7th 2021
      - Happy Holidays!