ATLAS UK Cloud Support

Europe/London
Zoom

Zoom

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))
Description

https://cern.zoom.us/j/98434450232

Password protected (same as OPs Mtg, but repeated)

Outstanding tickets

  • 149842 UKI-SCOTGRID-ECDF less urgent in progress 2020-12-15 02:10:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER ERROR: Copy failed with mode 3rd pull, wi…
    • Rob investigating with DPM. Interface to DPM is not facilitating progress.
  • 149750 UKI-SOUTHGRID-RALPP less urgent in progress 2020-12-11 13:51:00 UKI-SOUTHGRID-RALPP: unable to connect to host
    • Resolved initial problem, by switching transfers through IPv6; underlying firewall/network issues to be resolved.
  • 149362 UKI-SOUTHGRID-RALPP urgent in progress 2020-12-04 10:14:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
    • Possibly related to IPv issues; needs following-up on Jira
  • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-12-10 14:42:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
    • Files declared as lost; should be ok to close, if transfers now look ok.
  • 146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
    • On hold; awaiting underlying changes
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-11-05 10:52:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    • Dev harvester instance issues - fixed for condor submitter.

CPU

  • RAL

    • Ok; JW to ensure corepower is updated in new year
  • Northgrid

    • 60k HS06 running; some issue started recently, perhaps with gridFTP
  • London

  • QMUL;

    • PDU replacements; issues with cabling from PDUs to Racks, to be replaced. -> increased downtime
    • Many improvement / some disruption next year leading to improved cooling / capacity.
    • Work on CE’s next week

  • SouthGrid

    • OX squid sever down; moved to old server.

    • QMUL will use it’s old perf-sonar for it’s new squids.

  • Scotgrid

    • Durham drop in capacity; starting to come back now

Other new issues

  • Site availability and reliability for Glasgow follow-up; see associated tickets.
    • Mixture of CRIC and AGIS information being used.
    • Glasgow - to ensure all relevent info is included into GocDB
    • ATLAS - to see how much can be exposed before final push to CRIC

Ongoing issues

  • CentOS7 - Sussex

    • Sussex - Peter reports dev server was fixed, so all pilots now working.
      • In a good state for provisioning of nodes in new year
  • TPC with http

    • Expecting a deadline of May 2021 for deployment at most sites
  • Storageless Site tests (Oxford)¶

    • No particular progress
  • ECDF volatile storage

    • Awaiting JW to make SE changes from Jira
  • Glasgow DPM Decommissioning

    • Sam preparing Ceph localgroupdisk
    • Hope for transfer across before / during Christmas
    • Gareth will put DPM in “AT RISK” of the period.

News round-table

  • General

    • Most sites at limited response from next week
  • Vip

  • Asked about mu3e VO port number;

  • Dan

    • NTR
  • Matt

    • NTR
  • Peter

    • NTR
  • Sam

    • NTR
  • Gareth

    • Will set at risk for DPM from the weekend
  • JW

    • NTR

AOB

  • Next UK Cloud meeting 7th January 2021
    • Happy Holidays!

 

There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m
        • 149842 UKI-SCOTGRID-ECDF less urgent in progress 2020-12-15 02:10:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER ERROR: Copy failed with mode 3rd pull, wi…
          • Rob investigating with DPM. Interface to DPM is not facilitating progress.
        • 149750 UKI-SOUTHGRID-RALPP less urgent in progress 2020-12-11 13:51:00 UKI-SOUTHGRID-RALPP: unable to connect to host
          • Resolved initial problem, by switching transfers through IPv6; underlying firewall/network issues to be resolved.
        • 149362 UKI-SOUTHGRID-RALPP urgent in progress 2020-12-04 10:14:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
          • Possibly related to IPv issues; needs following-up on Jira
        • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-12-10 14:42:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
          • Files declared as lost; should be ok to close, if transfers now look ok.
        • 146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
          • On hold; awaiting underlying changes
        • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-11-05 10:52:00 CentOS7 migration UKI-SOUTHGRID-SUSX
          • Dev harvester instance - fixed for condor submitter.
      • CPU 5m

        New link for the site-oriented dashboard

        • RAL

          • Ok; JW to ensure corepower is updated in new year
        • Northgrid

          • 60k HS06 running; some issue started recently, perhaps with gridFTP
        • London

        • QMUL;

          • PDU replacements; issues with cabling from PDUs to Racks, to be replaced. -> increased downtime
          • Many improvement / some disruption next year leading to improved cooling / capacity.
          • Work on CE’s next week

        • SouthGrid

          • OX squid sever down; moved to old server.

            • QMUL will use it’s old perf-sonar for it’s new squids.

        • Scotgrid

          • Durham drop in capacity; starting to come back now

         

         

      • Other new issues / tasks 5m
      • Glasgow: Site Availability/Reliability Config 5m

        From our side (i.e. ATLAS SAM team), the migration has been done; so the ETF tests use CRIC as the source.
        The MONIT side will eventually use the filtered vofeed provided by us.
        But as a temporary solution, MONIT consumes the vofeed (either from AGIS or CRIC) internally handled in MONIT.

        • Site availability and reliability for Glasgow follow-up; see associated tickets.
          • Mixture of CRIC and AGIS information being used.
          • Glasgow - to ensure all relevent info is included into GocDB
          • ATLAS - to see how much can be exposed before final push to CRIC
    • 10:20 10:40
      Ongoing Items 20m
      • CentOS7 - Sussex

        • Sussex - Peter reports dev server was fixed, so all pilots now working.
          • In a good state for provisioning of nodes in new year
      • TPC with http

        • Expecting a deadline of May 2021 for deployment at most sites
      • Storageless Site tests (Oxford)¶

        • No particular progress
      • ECDF volatile storage

        • Awaiting JW to make SE changes from Jira
      • Glasgow DPM Decommissioning

        • Sam preparing Ceph localgroupdisk
        • Hope for transfer across before / during Christmas
        • Gareth will put DPM in “AT RISK” of the period.

       

       

    • 10:40 10:50
      News round-table 10m
      • General

        • Most sites at limited response from next week
      • Vip

      • Asked about mu3e VO port number;

      • Dan

        • NTR
      • Matt

        • NTR
      • Peter

        • NTR
      • Sam

        • NTR
      • Gareth

        • Will set at risk for DPM
      • JW

        • NTR
    • 10:50 11:00
      AOB 10m

      Next Cloud Meeting: Jan 7th 2021
      - Happy Holidays!