ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Outstanding tickets

  • 147194 UKI-LT2-RHUL less urgent in progress 2020-05-28 UKI-LT2-RHUL: deletion errors
    • Couple of problems with disabled storages. Right now upgrading our storages to Centos7. These problems should be fixed now.
  • 147189 UKI-NORTHGRID-MAN-HEP less urgent in progress 2020-05-28 UKI-NORTHGRID-MAN-HEP detion errors …
    • Services restarted, looking better; with pupper off, needs manual intervention; was a cron issue (needs 1.14 dpm version)
  • 147082 UKI-NORTHGRID-MAN-HEP urgent waiting for reply 2020-05-21 File not accessible at …
    • Race condition; waiting for 1.14 to fix issue
    • Problem resolved - no response from dast; to close
  • 146918 UKI-SCOTGRID-ECDF less urgent in progress 2020-05-19 Failovers from jobs running at …
    • no new update
  • 146771 UKI-SCOTGRID-ECDF less urgent on hold 2020-05-21 UKI-SCOTGRID-ECDF deletion failures …
    • Site needs to disable ipv6, if needed, JW to respond.
  • 146651 RAL-LCG2 urgent involved in progress 2020-05-27 singularity and user NS setup at RAL
    • Rollback change to max_user_namespaces as it has negatively impacted LHCb and will pursue enabling unprivileged singularity instead.
  • 146525 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-05-15 UKI-NORTHGRID-SHEF-HEP: evicted jobs
  • 146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-05-15 ATLAS pilot jobs idle on …
    • GR - reply in TB support
  • 146159 UKI-SCOTGRID-GLASGOW very urgent in progress 2020-05-19 Unaccessible files at …
    • Priority to go into production; read_lan needs to be fixed in ATLAS
    • xrootd shows instabilitord in external connection (latest 4.12.1 version running, as plugin)
    • Why not RAL affected (4.11)? If problem with ceph-xrootd plugin
    • DPM; eventually to stop the queues; and physical move useful nodes as and when restrictions allows to be added to Ceph.
    • See if this ticket can be closed, and follow-up in decommissioning Jira
  • 145688 UKI-NORTHGRID-MAN-HEP less urgent on hold 2020-04-02 Very old version of squids at …
    • on Hold
  • 145510 RAL-LCG2 urgent on hold 2020-05-13 RAL-LCG2: timeouts on stage-in/outs
    • On hold, moving to close
  • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-02-17 High traffic from UKI-SCOTGRID-GLASGOW …
    • Complex arrangement between data centre and older systems.
    • Final solution only likely with move to DC, new hardware
    • Initial problem seems to be solved, but keep ticket open to follow progress
       
  • 142329 UKI-SOUTHGRID-SUSX top priority reopened 2020-05-24 CentOS7 migration UKI-SOUTHGRID-SUSX
    • Test jobs running; ask to enable all nodes now

CPU

Major DB outage at Cern; knock on effect with Frontier launchpads; CERN Frontier still issue in the morning; in recovery
HC tests set many sites offline.

  • RAL
    Config issue over the weekend, fixed on the Monday

  • Northgrid

  • London

  • SouthGrid

  • Scotgrid

Other new issues

ECDF to 8 core jobs

Ongoing issues

  • LANCS DPM centos 7 upgrade

    • circa. June 24th; no extraordinary actions need be taken prior to move
  • CentOS7 - Sussex

  • Glasgow Ceph storage

    • Non DC cores, reduction in capacity
  • Grand Unified queues

News round-table

  • Vip
    • OX is set offline; to follow-up with HC;
  • Dan
    • NTR
  • Matt
    • NTR
  • Peter
    • NTR
  • Alessandra
    • NTR
  • Sam
    • NTR
  • Gareth
    • NTR
  • Tim
    • NTR
  • JW
    • NTR

AOB

TA -> AF; TPC on smoke test; RAL Have gone into stress-test as dest; missing source, which indicates problem
GR noted some issues with Firefox in protected (certificate) atlas pages, but not common to other present members.

There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m

        - 147194     UKI-LT2-RHUL             less urgent   in progress         2020-05-28     UKI-LT2-RHUL: deletion errors 
            - Couple of problems with disabled storages. Right now upgrading our storages to Centos7. These problems should be fixed now.


        - 147189     UKI-NORTHGRID-MAN-HEP     less urgent   in progress          2020-05-28     UKI-NORTHGRID-MAN-HEP detion errors ...     
            - Services restarted, looking better; with pupper off, needs manual intervention; was a cron issue (needs 1.14 dpm version)


        - 147082     UKI-NORTHGRID-MAN-HEP   urgent            waiting for reply     2020-05-21     File not accessible at ...     
            - Race condition; waiting for 1.14 to fix issue
            - Problem resolved - no response from dast; to close


        - 146918     UKI-SCOTGRID-ECDF         less urgent   in progress          2020-05-19     Failovers from jobs running at ... 
            - no new update
        - 146771     UKI-SCOTGRID-ECDF         less urgent   on hold             2020-05-21     UKI-SCOTGRID-ECDF deletion failures ... 
            - Site needs to disable ipv6, if needed, JW to respond.    


        - 146651     RAL-LCG2                    urgent       involved     in progress     2020-05-27     singularity and user NS setup at RAL     
            - Rollback change to max_user_namespaces as it has negatively impacted LHCb and will pursue enabling unprivileged singularity instead.


        - 146525     UKI-NORTHGRID-SHEF-HEP  urgent            on hold         2020-05-15     UKI-NORTHGRID-SHEF-HEP: evicted jobs     


        - 146374     UKI-NORTHGRID-SHEF-HEP  urgent            on hold     2020-05-15     ATLAS pilot jobs idle on ...     
            - GR - reply in TB support


        - 146159     UKI-SCOTGRID-GLASGOW     very urgent   in progress     2020-05-19     Unaccessible files at ...     
            - Priority to go into production; read_lan needs to be fixed in ATLAS
            - xrootd shows instabilitord in external connection (latest 4.12.1 version running, as plugin)
            - Why not RAL affected (4.11)? If problem with ceph-xrootd plugin
            - DPM; eventually to stop the queues; and physical move useful nodes as and when restrictions allows to be added to Ceph.
            - See if this ticket can be closed, and follow-up in decommissioning Jira


        - 145688     UKI-NORTHGRID-MAN-HEP     less urgent   on hold     2020-04-02     Very old version of squids at ...     
            - on Hold


        - 145510     RAL-LCG2                   urgent                on hold     2020-05-13     RAL-LCG2: timeouts on stage-in/outs     
            - On hold, moving to close


        - 144759     UKI-SCOTGRID-GLASGOW     less urgent   on hold     2020-02-17     High traffic from UKI-SCOTGRID-GLASGOW ... 
            - Complex arrangement between data centre and older systems. 
            - Final solution only likely with move to DC, new hardware
            - Initial problem seems to be solved, but keep ticket open to follow progress


        - 142329     UKI-SOUTHGRID-SUSX         top priority  reopened     2020-05-24     CentOS7 migration UKI-SOUTHGRID-SUSX     
            - Test jobs running; ask to enable all nodes now

      • CPU 5m

        Major DB outage at Cern; knock on effect with Frontier launchpads; CERN Frontier still issue in the morning; in recovery
        HC tests set many sites offline.

        - RAL
        Config issue over the weekend, fixed on the Monday
        - Northgrid

        - London

        - SouthGrid

        - Scotgrid

      • Other new issues 5m

        ECDF move from 4 -> 8 core MCore jobs;

    • 10:20 10:40
      Ongoing issues 20m

      - LANCS DPM centos 7 upgrade
          - circa. June 24th; no extraordinary actions need be taken prior to move
          
      - CentOS7 - Sussex

      - Glasgow Ceph storage
          - Non DC cores, reduction in capacity

      - Grand Unified queues

    • 10:40 10:50
      News round-table 10m

      - Vip
          - OX is set offline; to follow-up with HC; 
      - Dan
          - NTR
      - Matt
          - NTR
      - Peter
          -  NTR
      - Alessandra
          - NTR
      - Sam
          - 
      - Gareth
          - NTR
      - Tim
          - NTR
      - JW 
          - NTR

    • 10:50 11:00
      AOB 10m

      GR noted some issues with Firefox in protected (certificate) atlas pages, but not common to other present members.

       

      TA -> AF; TPC on smoke test; RAL Have gone into stress-test as dest; missing source, which indicates problem