ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))
Videoconference Rooms
ATLAS_UK_Cloud_Support_indico_233262
Name
ATLAS_UK_Cloud_Support_indico_233262
Description
Weekly ATLAS UK Cloud Support Meeting
Extension
109233262
Owner
Tim Adye
Auto-join URL
Useful links
Phone numbers

Outstanding tickets

  • 147553 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-06-20 09:08:00 UK UKI-NORTHGRID-LANCS-HEP_DATADISK deletion failures

    • Closed
  • 147390 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-06-25 07:19:00 Failovers from jobs running at UKI-SCOTGRID-GLASGOW_CEPH to CERN backup proxy

    • Static route now rolled out to all nodes.
  • 147361 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-06-18 08:22:00 Deletion errors at UKI-SCOTGRID-GLASGOW

    • Specific files done. Will close the ticket once remaining files in namespace have been proceesed.
  • 146771 UKI-SCOTGRID-ECDF less urgent on hold 2020-06-16 15:41:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”

    • DPM centos 7 migration done; but not completely removed the issue. Some difference between ECDF and other DPM configs.
    • Under investigation and will talk to dpm-devs
  • 146651 RAL-LCG2 urgent in progress 2020-05-27 10:43:00 singularity and user NS setup at RAL

    • If moved to unprivigled, we use our own; else RAL needs support singularity
    • Docker makes it look like User namespace is enabled. Singlarity must be able to mount /proc
    • JW to follow up with JA
  • 146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-06-24 16:18:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE

    • Work on ce in progress
  • 145688 UKI-NORTHGRID-MAN-HEP less urgent waiting for reply 2020-06-24 16:43:00 Very old version of squids at UKI-NORTHGRID-MAN-HEP

    • Upgrade underway; need to make Frontier squid work with the puppet modules
  • 145510 RAL-LCG2 urgent in progress 2020-06-18 05:50:00 RAL-LCG2: timeouts on stage-in/outs

    • Problems at ral preventing looking into and closing the ticker
  • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-06-09 07:59:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1

    • Needs Access
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX

    • Needs Access

CPU

Pledge line back; move to cric db in atlas monit

  • RAL

    • Powercut, broken software (singularity) update
  • Northgrid

    • LANCS; migration done; some residual problems
      • using old CE; but upgrade needed
      • Dirac workin; atlas needs some work.
  • London

    • RHUL: In test; HC ‘stuck’; action being followed-up
  • SouthGrid

  • Scotgrid

    • Durham: Cooling failed; off until Monday

Other new issues

  • RAL-FTS

    • ATLAS moved sites from RAL to CERN’s FTS instance
  • Cern DB downtime

    • Major DB intervention 27 June; affects many services
    • CERN Frontier switched off from afternoon 26th
    • Jobs submission to be halted later in day
  • Downtimes:

    • Durham: 24-28 Aircon failure, (24) Storage maintainance
    • LANCS: 23 Upgrade SEs
    • MAN: 22 Arc-ce6
    • RAL: 22/23 Power cut

Ongoing issues

  • CentOS7 DPM Lancs

    LANCS; migration done; some residual problems
    using old CE; but upgrade needed
    Dirac workin; atlas needs some work.
    CentOS7 - Sussex

    Needs Access
    Glasgow Ceph storage

    Various improvements planned; stable running
    Will remove from ‘ongoing’ issues
    Grand Unified queues

    Awaiting SHEF
     

 

News round-table

  • Vip

    • NTR
  • Dan

    • Panda failing; out-of-memory error
    • JW To investigate
  • Matt

    • NTR
  • Peter

    • NTR
  • Alessandra

    • NTR
  • Sam

    • NTR
  • Tim

    • Echo access from James for http to progress on that
  • JW

    • NTR

AOB

There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m

        Outstanding tickets

        • 147553 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-06-20 09:08:00 UK UKI-NORTHGRID-LANCS-HEP_DATADISK deletion failures

          • Closed
        • 147390 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-06-25 07:19:00 Failovers from jobs running at UKI-SCOTGRID-GLASGOW_CEPH to CERN backup proxy

          • Static route now rolled out to all nodes.
        • 147361 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-06-18 08:22:00 Deletion errors at UKI-SCOTGRID-GLASGOW

          • Specific files done. Will close the ticket once remaining files in namespace have been proceesed.
        • 146771 UKI-SCOTGRID-ECDF less urgent on hold 2020-06-16 15:41:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”

          • DPM centos 7 migration done; but not completely removed the issue. Some difference between ECDF and other DPM configs.
          • Under investigation and will talk to dpm-devs
        • 146651 RAL-LCG2 urgent in progress 2020-05-27 10:43:00 singularity and user NS setup at RAL

          • If moved to unprivigled, we use our own; else RAL needs support singularity
          • Docker makes it look like User namespace is enabled. Singlarity must be able to mount /proc
          • JW to follow up with JA
        • 146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-06-24 16:18:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE

          • Work on ce in progress
        • 145688 UKI-NORTHGRID-MAN-HEP less urgent waiting for reply 2020-06-24 16:43:00 Very old version of squids at UKI-NORTHGRID-MAN-HEP

          • Upgrade underway; need to make Frontier squid work with the puppet modules
        • 145510 RAL-LCG2 urgent in progress 2020-06-18 05:50:00 RAL-LCG2: timeouts on stage-in/outs

          • Problems at ral preventing looking into and closing the ticker
        • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-06-09 07:59:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1

          • Needs Access
        • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX

          • Needs Access
      • CPU 5m

        Pledge line back; move to cric db in atlas monit

        • RAL

          • Powercut, broken software (singularity) update
        • Northgrid

          • LANCS; migration done; some residual problems
            • using old CE; but upgrade needed
            • Dirac workin; atlas needs some work.
        • London

          • RHUL: In test; HC ‘stuck’; action being followed-up
        • SouthGrid

        • Scotgrid

          • Durham: Cooling failed; off until Monday

         

         

      • Other new issues 5m

        Pledge line back; move to cric db in atlas monit

        • RAL

          • Powercut, broken software (singularity) update
        • Northgrid

          • LANCS; migration done; some residual problems
            • using old CE; but upgrade needed
            • Dirac workin; atlas needs some work.
        • London

          • RHUL: In test; HC ‘stuck’; action being followed-up
        • SouthGrid

        • Scotgrid

          • Durham: Cooling failed; off until Monday

         

         

    • 10:20 10:40
      Ongoing issues 20m
      • CentOS7 DPM Lancs

        • LANCS; migration done; some residual problems
          • using old CE; but upgrade needed
          • Dirac workin; atlas needs some work.
      • CentOS7 - Sussex

        • Needs Access
      • Glasgow Ceph storage

        • Various improvements planned; stable running
        • Will remove from ‘ongoing’ issues
      • Grand Unified queues

        • Awaiting SHEF
    • 10:40 10:50
      News round-table 10m
    • 10:50 11:00
      AOB 10m