ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Outstanding tickets

  • 149095 UKI-SOUTHGRID-OX-HEP less urgent in progress 2020-10-17 07:37:00 UKI-SOUTHGRID-OX-HEP: unstable transfer
    • Should be fixed; JW - to check and close
  • 148968 UKI-NORTHGRID-LANCS-HEP less urgent waiting for reply 2020-10-21 13:49:00 UKI-NORTHGRID-LANCS-HEP: deletion and transfer failures
    • Should now be fine; JW - to check and close
  • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-10-19 13:51:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
    • Power fluctuations (cable strikes in and off campus)
      • Odd networking state for some racks; rebooting appears to have cleared this
    • Looking ok? JW to check and close if so.
    • Other cvmfs issues also seem to be resolved
  • 146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
    • on-hold
  • 146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-09-11 13:35:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
    • on-hold
  • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
    • on-hold
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    • on-hold

CPU

  • RAL

    • Firewall upgrade at RAL (am of 22nd Oct); killed most jobs; now recovering
    •  
  • Northgrid

    • LANCS now recovering on slots. Unclear why jobs still ran with missing files when declared temporarily unavailable.
  • London

    • Small issue at QMUL, resolved quickly
  • SouthGrid

  • Scotgrid

    • Durham, some unexpected downtime. One disk server identified with lost ATLAS data; List of files is in preparation.

Other new issues

Ongoing issues

  • CentOS7 - Sussex
    • on-hold
  • Grand Unified queues
    • (awaiting Sheffield)
  • TPC via http
    • Ceph; fix available for testing for aligment issues in EC pool in xrootd
      • Appears to be working at RAL; although still some failure modes observed

News round-table

  • Vip
    • Approx 1/3 of Cs to be drained from Saturday, for work on Tuesday.
  • Dan
    • Weekend memory problem; Storm / Argus; requires a restart. (open ticket with Storm devs.)
    • Aiming for improving the automatation of restarts
  • Matt
    • rebuilding of servers ongoing.
  • Peter
    • NTR
  • Sam
    • AOD -> DAOD jobs still show up as source of failures. (JW to also follow this).
  • JW
    • NTR (see tpc-http info above).
  • Rob
    • Will have a discussion with ATLAS experts regarding QoS developments with ECDF storage

AOB

  • NTR
There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m
        • 149095 UKI-SOUTHGRID-OX-HEP less urgent in progress 2020-10-17 07:37:00 UKI-SOUTHGRID-OX-HEP: unstable transfer
          • Should be fixed; JW - to check and close
        • 148968 UKI-NORTHGRID-LANCS-HEP less urgent waiting for reply 2020-10-21 13:49:00 UKI-NORTHGRID-LANCS-HEP: deletion and transfer failures
          • Should now be fine; JW - to check and close
        • 148342 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-10-19 13:51:00 UKI-SCOTGRID-GLASGOW with transfer efficiency degraded and many failures
          • Power fluctuations (cable strikes in and off campus)
            • Odd networking state for some racks; rebooting appears to have cleared this
          • Looking ok? JW to check and close if so.
          • Other cvmfs issues also seem to be resolved
        • 146651 RAL-LCG2 urgent on hold 2020-10-16 11:56:00 singularity and user NS setup at RAL
          • on-hold
        • 146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-09-11 13:35:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
          • on-hold
        • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
          • on-hold
        • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
          • on-hold

         

         

      • CPU 5m
        • RAL

          • Firewall upgrade at RAL (am of 22nd Oct); killed most jobs; now recovering
        • Northgrid

          • LANCS now recovering on slots. Unclear why jobs still ran with missing files when declared temporarily unavailable.
        • London

          • Small issue at QMUL, resolved quickly
        • SouthGrid

        • Scotgrid

          • Durham, some unexpected downtime. One disk server identified with lost ATLAS data; List of files is in preparation.

         

         

      • Other new issues 5m
    • 10:20 10:40
      Ongoing issues 20m
      • CentOS7 - Sussex
        • on-hold
      • Grand Unified queues
        • (awaiting Sheffield)
      • TPC via http
        • Ceph; fix available for testing for aligment issues in EC pool in xrootd
          • Appears to be working at RAL; although still some failure modes observed

       

       

    • 10:40 10:50
      News round-table 10m
      • Vip
        • Approx 1/3 of Cs to be drained from Saturday, for work on Tuesday.
      • Dan
        • Weekend memory problem; Storm / Argus; requires a restart. (open ticket with Storm devs.)
        • Aiming for improving the automatation of restarts
      • Matt
        • rebuilding of servers ongoing.
      • Peter
        • NTR
      • Sam
        • AOD -> DAOD jobs still show up as source of failures. (JW to also follow this).
      • JW
        • NTR (see tpc-http info above).
      • Rob
        • Will have a discussion with ATLAS experts regarding QoS developments with ECDF storage

       

       

    • 10:50 11:00
      AOB 10m
      • NTR