ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Outstanding tickets

  • 147299 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-06-03 23:12:00 UKI-NORTHGRID-LANCS-HEP: deletion errors

    • Heading on-site to understand problem; possible the disk has died, ~ 10TB data loss
  • 146918 UKI-SCOTGRID-ECDF less urgent in progress 2020-06-02 10:46:00 Failovers from jobs running at UKI-SCOTGRID-ECDF_CLOUD to CERN backup proxy

    • Pushed back due to other Edingbugh priorities
  • 146771 UKI-SCOTGRID-ECDF less urgent on hold 2020-06-02 10:30:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”

    • Pushed back due to other Edingbugh priorities
  • 146651 RAL-LCG2 urgent in progress 2020-05-27 10:43:00 singularity and user NS setup at RAL

    • Work ongoing to use unprivleged mode.
  • 146525 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-05-15 16:12:00 UKI-NORTHGRID-SHEF-HEP: evicted jobs

    • Active interactions with NORDIGRID mailing lists; discussion on deprication on LCMAPs, and it’s possible replacements
  • 146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-05-15 16:11:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE

    • As above
  • 145688 UKI-NORTHGRID-MAN-HEP less urgent on hold 2020-04-02 09:20:00 Very old version of squids at UKI-NORTHGRID-MAN-HEP

    • On hold
  • 145510 RAL-LCG2 urgent on hold 2020-05-13 13:07:00 RAL-LCG2: timeouts on stage-in/outs

    • Will aim to close this week
  • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-02-17 09:51:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1

    • On hold
  • 142329 UKI-SOUTHGRID-SUSX top priority reopened 2020-06-01 08:27:00 CentOS7 migration UKI-SOUTHGRID-SUSX

    • to put on hold; awaiting rollout of all nodes; might require physical access

CPU

  • RAL

  • Northgrid

    • MAN - walltime units changed from seconds to minutes (within ATLAS); change reverted.
    • Lancaster’s drop was due to an IPv6 problem over the weekened
  • London

  • SouthGrid

  • Scotgrid

Other new issues

  • Request sent from ATLAS to restart squids due to residual issues with DB / frontier problems from previous weeks

Ongoing issues

  • CentOS7 DPM Lancs

    • No change to plans
  • CentOS7 - Sussex

    • As in GGUS discussion
  • Glasgow Ceph storage

    • xroot message and troubleshooting tricky.
    • External - should be ok (gridFTP, maybe also xrootd external),
      –Internal - bandwidth. 30GB/s 3x 10GB links.
  • Grand Unified queues

    • Awaiting Shefield

News round-table

  • Vip
  • Dan
    • LCMAPS will become deprecated, what will be the solution?
    • Updated mount points - perhaps higher rates of failures
  • Matt
    • NTR
  • Peter
    • Re-opening questions; Sites ; lots of online teaching; re-opening will be cautious
  • Alessandra
    • NTR
  • Sam
    • NTR
  • Gareth
    • NTR
  • Tim
    • TPC: running initially on wrong server; now on test (more allowed connections)
    • RAL as source is fine, RAL as dest. fails; two transfers trying to access same fail
    • If not as dest - it is not the active party; uses pulling, dest gets from the source
  • JW
    • NTR
There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m
        • 147299 UKI-NORTHGRID-LANCS-HEP less urgent in progress 2020-06-03 23:12:00 UKI-NORTHGRID-LANCS-HEP: deletion errors

          • Heading on-site to understand problem; possible the disk has died, ~ 10TB data loss
        • 146918 UKI-SCOTGRID-ECDF less urgent in progress 2020-06-02 10:46:00 Failovers from jobs running at UKI-SCOTGRID-ECDF_CLOUD to CERN backup proxy

          • Pushed back due to other Edingbugh priorities
        • 146771 UKI-SCOTGRID-ECDF less urgent on hold 2020-06-02 10:30:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”

          • Pushed back due to other Edingbugh priorities
        • 146651 RAL-LCG2 urgent in progress 2020-05-27 10:43:00 singularity and user NS setup at RAL

          • Work ongoing to use unprivleged mode.
        • 146525 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-05-15 16:12:00 UKI-NORTHGRID-SHEF-HEP: evicted jobs

          • Active interactions with NORDIGRID mailing lists; discussion on deprication on LCMAPs, and it’s possible replacements
        • 146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-05-15 16:11:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE

          • As above
        • 145688 UKI-NORTHGRID-MAN-HEP less urgent on hold 2020-04-02 09:20:00 Very old version of squids at UKI-NORTHGRID-MAN-HEP

          • On hold
        • 145510 RAL-LCG2 urgent on hold 2020-05-13 13:07:00 RAL-LCG2: timeouts on stage-in/outs

          • Will aim to close this week
        • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-02-17 09:51:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1

          • On hold
        • 142329 UKI-SOUTHGRID-SUSX top priority reopened 2020-06-01 08:27:00 CentOS7 migration UKI-SOUTHGRID-SUSX

          • to put on hold; awaiting rollout of all nodes; might require physical access

         

      • CPU 5m
        • RAL

        • Northgrid

          • MAN - walltime units changed from seconds to minutes (within ATLAS); change reverted.
          • Lancaster’s drop was due to an IPv6 problem over the weekened
        • London

        • SouthGrid

        • Scotgrid

         

      • Other new issues 5m

        As written last Friday (29 May) due to the squid version (4.11-2.1) problem, could you please restart your local site squids if you have not done so already to mitigate job failure we are seeing due to the latest squid version?

        We still have many site squids to restart as seen in the plot (thanks Michal), the object counts drop upon restart:

        http://wlcg-squid-monitor.cern.ch/snmpstats/mrtgatlas2/weeklyObjects.html

        • Request sent from ATLAS to restart squids due to residual issues with DB / frontier problems from previous weeks

         

    • 10:20 10:40
      Ongoing issues 20m
      • CentOS7 DPM Lancs

        • No change to plans
      • CentOS7 - Sussex

        • As in GGUS discussion
      • Glasgow Ceph storage

        • xroot message and troubleshooting tricky.
        • External - should be ok (gridFTP, maybe also xrootd external),
          –Internal - bandwidth. 30GB/s 3x 10GB links.
      • Grand Unified queues

        • Awaiting Shefield
    • 10:40 10:50
      News round-table 10m
      • Vip
      • Dan
        • LCMAPS will become deprecated, what will be the solution?
        • Updated mount points - perhaps higher rates of failures
      • Matt
        • NTR
      • Peter
        • Re-opening questions; Sites ; lots of online teaching; re-opening will be cautious
      • Alessandra
        • NTR
      • Sam
        • NTR
      • Gareth
        • NTR
      • Tim
        • TPC: running initially on wrong server; now on test (more allowed connections)
        • RAL as source is fine, RAL as dest. fails; two transfers trying to access same fail
        • If not as dest - it is not the active party; uses pulling, dest gets from the source
      • JW
        • NTR

       

    • 10:50 11:00
      AOB 10m