ATLAS UK Cloud Support

Europe/London
Zoom

Zoom

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))
Description

Meeting to be held via Zoom (https://ukri.zoom.us/j/97404730356)
Password protected (same as OPs Mtg)

Outstanding tickets

  • 149752 UKI-NORTHGRID-LANCS-HEP less urgent assigned 2020-12-02 16:07:00 Failovers from University of Lancaster to CERN backup proxy
    • Number of stale cvmfs observed (also at Glasgow)
    • geoip issues; might be related to Stratum 1 updates?
    • refresh cache may be best option
  • 149750 UKI-SOUTHGRID-RALPP less urgent in progress 2020-12-02 10:16:00 UKI-SOUTHGRID-RALPP: unable to connect to host
    • Problems in FTS transfers for ATLAS (not other VOs). CLI TPC transfers appear ok.
  • 149738 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-12-02 15:55:00 UKI-NORTHGRID-LANCS-HEP: deletion errors
    • Poor raid card showing issues with many simultaneous interactions (deletions) causing crashing.
    • Down to last 25% of data from draning of the seriver.
    • Stop draining for today; but should expect some file losses.
  • 149705 UKI-SCOTGRID-ECDF less urgent in progress 2020-11-30 11:52:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER [70] TRANSFER an end-of-file was reached …
    • Load on headnode from httpd processes
      • From Matt; method to mitigate high mem usage at lancs for http implemented. Might be related issues.
  • 149362 UKI-SOUTHGRID-RALPP urgent in progress 2020-11-19 10:11:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
    • heplnx207 still in downtime (ended post-meeting)
  • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-11-27 10:00:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
    • Disk 40; being drained withing decom. Raid set says ok, FS not.
    • AC / cooling issues in DPM server room
  • 146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
    • on hold, working on underlying issues
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-11-05 10:52:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    • Arc-ce issues; not reporting back to the monitoring sites
    • Communication issue ? GridFTP looks to be working
    • Can the BDII / LDAP be queried (from offsite?)
      • Status information usually through the BDII.
    • To contact the arc-devs?
    • To try an LDAP search against BDII
    • Patrick to report back to TB support.

CPU

  • RAL

  • Northgrid

  • London

  • SouthGrid

  • Scotgrid

  • Downtime for DPM; Problems with Chillers and AC. Effectively shut down for the moment.

    • Some replacements needed.
  • Prod is in DC; which is fine

Other new issues

Ongoing issues

  • CentOS7 - Sussex

  • TPC http

    • RAL TPC-http FTS tests working by converting // to / in path.
  • Oxford Storageless tests

  • 10GB link working

  • Arc config needed; Sam to send to Vip

  • ECDF unreliable storage

    • Rob to update ticket
  • Glasgow LOCALGROUPDISK

    • Sam to aim to create Ceph pool.

News round-table

  • Vip

    • Production squid server failover yesterday;
    • CPU efficiency looks a bit lower?
    • prmon to be added: https://github.com/HSF/prmon in monitoring for storageless tests.
  • Dan

    • Possible downtime 1wk on the 14th.
      • Storm moving ahead to centos7
    • Next year disruption expected in DC, dates to be determined.
  • Matt

    • NTR; prepare for lost files.
  • Peter

    • Considering options for CRC shifter
      • Soliciting for CRC shifts.
  • Sam

  • NTR

  • Gareth

    • Continue to work on cooling issues
  • JW

    • NTR
  • Patrick

    • NTR

AOB

There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m
        • 149752 UKI-NORTHGRID-LANCS-HEP less urgent assigned 2020-12-02 16:07:00 Failovers from University of Lancaster to CERN backup proxy
          • Number of stale cvmfs observed (also at Glasgow)
          • geoip issues; might be related to Stratum 1 updates?
          • refresh cache may be best option
        • 149750 UKI-SOUTHGRID-RALPP less urgent in progress 2020-12-02 10:16:00 UKI-SOUTHGRID-RALPP: unable to connect to host
          • Problems in FTS transfers for ATLAS (not other VOs). CLI TPC transfers appear ok.
        • 149738 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-12-02 15:55:00 UKI-NORTHGRID-LANCS-HEP: deletion errors
          • Poor raid card showing issues with many simultaneous interactions (deletions) causing crashing.
          • Down to last 25% of data from draning of the seriver.
          • Stop draining for today; but should expect some file losses.
        • 149705 UKI-SCOTGRID-ECDF less urgent in progress 2020-11-30 11:52:00 UKI-SCOTGRID-ECDF: Low transfer efficiency due to TRANSFER [70] TRANSFER an end-of-file was reached …
          • Load on headnode from httpd processes
            • From Matt; method to mitigate high mem usage at lancs for http implemented. Might be related issues.
        • 149362 UKI-SOUTHGRID-RALPP urgent in progress 2020-11-19 10:11:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
          • heplnx207 still in downtime (ended post-meeting)
        • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-11-27 10:00:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
          • Disk 40; being drained withing decom. Raid set says ok, FS not.
          • AC / cooling issues in DPM server room
        • 146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
          • on hold, working on underlying issues
        • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-11-05 10:52:00 CentOS7 migration UKI-SOUTHGRID-SUSX
          • Arc-ce issues; not reporting back to the monitoring sites
          • Communication issue ? GridFTP looks to be working
          • Can the BDII / LDAP be queried (from offsite?)
            • Status information usually through the BDII.
          • To contact the arc-devs?
          • To try an LDAP search against BDII
          • Patrick to report back to TB support.
      • CPU 5m
        • RAL

        • Northgrid

        • London

        • SouthGrid

        • Scotgrid

        • Downtime for DPM; Problems with Chillers and AC. Effectively shut down for the moment.

          • Some replacements needed.
        • Prod is in DC; which is fine

      • Other new issues / tasks 5m
    • 10:20 10:40
      Ongoing Items 20m
      • CentOS7 - Sussex

      • TPC http

        • RAL TPC-http FTS tests working by converting // to / in path.
      • Oxford Storageless tests

      • 10GB link working

      • Arc config needed; Sam to send to Vip

      • ECDF unreliable storage

        • Rob to update ticket
      • Glasgow LOCALGROUPDISK

        • Sam to aim to create Ceph pool.
    • 10:40 10:50
      News round-table 10m
      • Vip

        • Production squid server failover yesterday;
        • CPU efficiency looks a bit lower?
        • prmon to be added: https://github.com/HSF/prmon in monitoring for storageless tests.
      • Dan

        • Possible downtime 1wk on the 14th.
          • Storm moving ahead to centos7
        • Next year disruption expected in DC, dates to be determined.
      • Matt

        • NTR; prepare for lost files.
      • Peter

        • Considering options for CRC shifter
          • Soliciting for CRC shifts.
      • Sam

        • NTR

      • Gareth

        • Continue to work on cooling issues
      • JW

        • NTR
      • Patrick

        • NTR
    • 10:50 11:00
      AOB 10m