ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Outstanding tickets

  • 147553 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-06-20 09:08:00 UK UKI-NORTHGRID-LANCS-HEP_DATADISK deletion failures

    • Closed
  • 147390 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-06-25 07:19:00 Failovers from jobs running at UKI-SCOTGRID-GLASGOW_CEPH to CERN backup proxy

    • Static route now rolled out to all nodes.
  • 147361 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-06-18 08:22:00 Deletion errors at UKI-SCOTGRID-GLASGOW

    • Specific files done. Will close the ticket once remaining files in namespace have been proceesed.
  • 146771 UKI-SCOTGRID-ECDF less urgent on hold 2020-06-16 15:41:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”

    • DPM centos 7 migration done; but not completely removed the issue. Some difference between ECDF and other DPM configs.
    • Under investigation and will talk to dpm-devs
  • 146651 RAL-LCG2 urgent in progress 2020-05-27 10:43:00 singularity and user NS setup at RAL

    • If moved to unprivigled, we use our own; else RAL needs support singularity
    • Docker makes it look like User namespace is enabled. Singlarity must be able to mount /proc
    • JW to follow up with JA
  • 146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-06-24 16:18:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE

    • Work on ce in progress
  • 145688 UKI-NORTHGRID-MAN-HEP less urgent waiting for reply 2020-06-24 16:43:00 Very old version of squids at UKI-NORTHGRID-MAN-HEP

    • Upgrade underway; need to make Frontier squid work with the puppet modules
  • 145510 RAL-LCG2 urgent in progress 2020-06-18 05:50:00 RAL-LCG2: timeouts on stage-in/outs

    • Problems at ral preventing looking into and closing the ticker
  • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-06-09 07:59:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1

    • Needs Access
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX

    • Needs Access

CPU

Pledge line back; move to cric db in atlas monit

  • RAL

    • Powercut, broken software (singularity) update
  • Northgrid

    • LANCS; migration done; some residual problems
      • using old CE; but upgrade needed
      • Dirac workin; atlas needs some work.
  • London

    • RHUL: In test; HC ‘stuck’; action being followed-up
  • SouthGrid

  • Scotgrid

    • Durham: Cooling failed; off until Monday

Other new issues

  • RAL-FTS

    • ATLAS moved sites from RAL to CERN’s FTS instance
  • Cern DB downtime

    • Major DB intervention 27 June; affects many services
    • CERN Frontier switched off from afternoon 26th
    • Jobs submission to be halted later in day
  • Downtimes:

    • Durham: 24-28 Aircon failure, (24) Storage maintainance
    • LANCS: 23 Upgrade SEs
    • MAN: 22 Arc-ce6
    • RAL: 22/23 Power cut

Ongoing issues

  • CentOS7 DPM Lancs

    LANCS; migration done; some residual problems
    using old CE; but upgrade needed
    Dirac workin; atlas needs some work.
    CentOS7 - Sussex

    Needs Access
    Glasgow Ceph storage

    Various improvements planned; stable running
    Will remove from ‘ongoing’ issues
    Grand Unified queues

    Awaiting SHEF
     

 

News round-table

  • Vip

    • NTR
  • Dan

    • Panda failing; out-of-memory error
    • JW To investigate
  • Matt

    • NTR
  • Peter

    • NTR
  • Alessandra

    • NTR
  • Sam

    • NTR
  • Tim

    • Echo access from James for http to progress on that
  • JW

    • NTR

AOB

There are minutes attached to this event. Show them.