ATLAS UK Cloud Support

Europe/London
Zoom

Zoom

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))
Description

Meeting to be held via Zoom (https://ukri.zoom.us/j/97404730356)
Password protected (same as OPs Mtg)

Outstanding tickets

  • 149362 UKI-SOUTHGRID-RALPP urgent in progress 2020-11-11 16:15:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
    • Change to ACT not helped; JW to see about switching back
    • Various replies on ticket from arc devs.
    • Problem persists
  • 149349 UKI-SOUTHGRID-OX-HEP less urgent waiting for reply 2020-11-09 10:15:00 UKI-SOUTHGRID-OX-HEP Frontier Squid Status
    • JW to close
  • 148968 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-11-11 14:48:00 UKI-NORTHGRID-LANCS-HEP: deletion and transfer failures
    • Few failures on usual server. New disks now in.
    • To update jira with namespace files
    • JW to declare the latest files as lost
  • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-11-10 09:59:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
    • JW to verify that the file can be transfered and close
  • 146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
    • No update
  • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
    • Ticket closed; nat5 moved
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-11-05 10:52:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    • Nodes in; awaiting networking.
      • Pilots still failing
    • arc can fill up logs quickly
      • Input from Gareth on checking the gridFTP logs
      • Matt to have a look at arc conf file

CPU

New site dashboard panel of job efficiency available.

  • RAL

  • Northgrid

    • LANCS; other VOs are getting lots of jobs; should have 4:1 weighting
      • DPM DB falling over; killed for excess memory consumption; may not be directly DPM, but in httpd processes
      • Increase of TPC-http activity could relate to this?
  • London

  • SouthGrid

  • Scotgrid

    • GLASGOW; 2 new CE’s being added. Half avaialble capacity currently running.

Other new issues

  • Migration from AGIS to CRIC started with Switcher migration.
    • Some initial problems in sites (Glasgow, Brunel) trying to get out of downtime.
      • Should be resolved now
    • Not all AGIS information may stay up-to-date, as CRIC becomes primary source
    • Peter to check for apfmon wrt. AGIS migration
  • Glasgow LOCALGROUPDISK:
    • Set up new Pool (JW to recheck atlas config will be ok)
    • Should be fine for internal Glasgow users.

Ongoing issues

  • CentOS7 - Sussex
    • See ticket description above
  • TPC
    • No update
  • Oxford storageless tests
    • Discused in Jira and Storage mtg.
    • Running HC test queue, using RAL as endpoint
    • Next to set a new (arc) queue at OX
  • ECDF
    • No update here

News round-table

  • Vip
    • Needed to leave before ended.
      • Asked about testing Squid
      • JW to provide some examples from ATLAS
  • Dan
    • Needed to leave before end; NTR
  • Matt
    • NTR
  • Peter
    • Asked about AGIS migration; will follow-up for CRIC
  • Sam
    • NTR
  • Gareth
    • NTR
  • JW
    • NTR
  • Patrick
    • NTR

AOB

  • NTR


 

There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m
        • 149362 UKI-SOUTHGRID-RALPP urgent in progress 2020-11-11 16:15:00 ATLAS CE failures on UKI-SOUTHGRID-RALPP-heplnx207
          • Change to ACT not helped; JW to see about switching back
          • Various replies on ticket from arc devs.
          • Problem persists
        • 149349 UKI-SOUTHGRID-OX-HEP less urgent waiting for reply 2020-11-09 10:15:00 UKI-SOUTHGRID-OX-HEP Frontier Squid Status
          • JW to close
        • 148968 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-11-11 14:48:00 UKI-NORTHGRID-LANCS-HEP: deletion and transfer failures
          • Few failures on usual server. New disks now in.
          • To update jira with namespace files
          • JW to declare the latest files as lost
        • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-11-10 09:59:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
          • JW to verify that the file can be transfered and close
        • 146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
          • No update
        • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
          • Ticket closed; nat5 moved
        • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-11-05 10:52:00 CentOS7 migration UKI-SOUTHGRID-SUSX
          • Nodes in; awaiting networking.
            • Pilots still failing
          • arc can fill up logs quickly
            • Input from Gareth on checking the gridFTP logs
            • Matt to have a look at arc conf file
      • CPU 5m

        New plot Available from the Site-daskboard:
        Job efficiency (defined as #successful / #jobs)

         

        • RAL

        • Northgrid

          • LANCS; other VOs are getting lots of jobs; should have 4:1 weighting
            • DPM DB falling over; killed for excess memory consumption; may not be directly DPM, but in httpd processes
            • Increase of TPC-http activity could relate to this?
        • London

        • SouthGrid

        • Scotgrid

          • GLASGOW; 2 new CE’s being added. Half avaialble capacity currently running.
      • Other new issues / tasks 5m

        Glasgow LOCALGROUPDISK

        • Migration from AGIS to CRIC started with Switcher migration.
          • Some initial problems in sites (Glasgow, Brunel) trying to get out of downtime.
            • Should be resolved now
          • Not all AGIS information may stay up-to-date, as CRIC becomes primary source
          • Peter to check for apfmon wrt. AGIS migration
        • Glasgow LOCALGROUPDISK:
          • Set up new Pool (JW to recheck atlas config will be ok)
          • Should be fine for internal Glasgow users.
    • 10:20 10:40
      Ongoing Items 20m
      • CentOS7 - Sussex
        • See ticket description above
      • TPC
        • No update
      • Oxford storageless tests
        • Discused in Jira and Storage mtg.
        • Running HC test queue, using RAL as endpoint
        • Next to set a new (arc) queue at OX
      • ECDF
        • No update here
    • 10:40 10:50
      News round-table 10m
      • Vip
        • Needed to leave before ended.
          • Asked about testing Squid
          • JW to provide some examples from ATLAS
      • Dan
        • Needed to leave before end; NTR
      • Matt
        • NTR
      • Peter
        • Asked about AGIS migration; will follow-up for CRIC
      • Sam
        • NTR
      • Gareth
        • NTR
      • JW
        • NTR
      • Patrick
        • NTR

      AOB

      • NTR
    • 10:50 11:00
      AOB 10m