ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))
Videoconference Rooms
ATLAS_UK_Cloud_Support_indico_233262
Name
ATLAS_UK_Cloud_Support_indico_233262
Description
Weekly ATLAS UK Cloud Support Meeting
Extension
109233262
Owner
Tim Adye
Auto-join URL
Useful links
Phone numbers

Outstanding tickets

  • 147436 UKI-SOUTHGRID-RALPP less urgent in progress 2020-06-15 14:58:00 UK UKI-SOUTHGRID-RALPP failing deletions

    • Argus server down from power problems; also disk server out of action; needs physical access (this week).
  • 147390 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-06-09 10:48:00 Failovers from jobs running at UKI-SCOTGRID-GLASGOW_CEPH to CERN backup proxy

    • Progress; test machine working; plan how to rollback from change once permanent solution can be found.
  • 147361 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-06-15 15:13:00 Deletion errors at UKI-SCOTGRID-GLASGOW

    • Site needs to delete files from the namespace; general strategy:
      • Have the site clean the namespace from any leftovers.
      • Have the site produce storage dumps.
      • Run a consistency check.
      • Declare any missing files as lost.
  • 146771 UKI-SCOTGRID-ECDF less urgent on hold 2020-06-16 15:41:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”

    • Power outage; pushed forward migration of DPM to centos7; will monitor situation
  • 146651 RAL-LCG2 urgent in progress 2020-05-27 10:43:00 singularity and user NS setup at RAL

    • no update
  • 146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-06-06 23:57:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE

    • on hold for ce6; Elena working hard to make progress
  • 145688 UKI-NORTHGRID-MAN-HEP less urgent on hold 2020-04-02 09:20:00 Very old version of squids at UKI-NORTHGRID-MAN-HEP

    • on hold
  • 145510 RAL-LCG2 urgent in progress 2020-06-18 05:50:00 RAL-LCG2: timeouts on stage-in/outs

    • Ticket updated. currently no spike in timeouts; with switch to direct-io for user jobs; should quantify error rate.
    • Set to in progress; and aim to close ticket once direct io studies are done
  • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-06-09 07:59:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1

    • Keep open until move is complete
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX

    • on hold

CPU

  • RAL

  • Northgrid

  • London

  • SouthGrid

  • Scotgrid

    • ECDF recovered from downtime; still deletion errors
    • Glasgow AC recovered - up to 25k jobs

Other new issues

Ongoing issues

  • CentOS7 DPM Lancs

    • does not look like our SL6 CE will be able to talk to this new filesystem
    • next week is downtime for cluster , during which our worker nodes will all be reinstalled and a new backed end shared filesystem will be back in place
    • plan is to set up an ARC CE by lunchtime tomorrow as the old SL6 CREAM CE will likely stop being able to talk to our Cluster after next week’s downtime.
    • please can atlas UK be ready to stick new endpoints in to AGIS for us.
  • CentOS7 - Sussex

    • Awaiting updating
  • Glasgow Ceph storage

    • RAL xrootd was tried
      • Identified that some performance tuning options, when under high loads caused too many concurrent threads and truncation of the cached files
    • Moved back to on disk cache (on SSDs)
    • Getting data to the jobs looks much better now.
    • Stage-ing back from jobs and redirection is next, to work between all three caches (one currently running)
    • Ceph-tuning for timeouts; stability improving; awaiting updated nautilaus for fixes to some current work-arounds
    • Bandwith looks good and is maxing out the gridFTP box.
    • Different versions of xroot on the various services: gateway 4.12.2, cache 4.11.3; aim to upgrade when possible
  • Grand Unified queues

    • Awaiting SHEF

News round-table

  • Vip

    • NTR: passed on information that atlas timeline for site decommissioning is typically 3-6 months
  • Matt

    • Microphone problems; comments above passed by chat
  • Sam

    • NTR
  • Gareth

    • NTR
  • Tim

    • RAL should try to get TPC running, is ATLAS priority.
  • JW

    • will concentrate on direct-io tests to close ral ticket.

 

There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m
        • 147436 UKI-SOUTHGRID-RALPP less urgent in progress 2020-06-15 14:58:00 UK UKI-SOUTHGRID-RALPP failing deletions

          • Argus server down from power problems; also disk server out of action; needs physical access (this week).
        • 147390 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-06-09 10:48:00 Failovers from jobs running at UKI-SCOTGRID-GLASGOW_CEPH to CERN backup proxy

          • Progress; test machine working; plan how to rollback from change once permanent solution can be found.
        • 147361 UKI-SCOTGRID-GLASGOW less urgent in progress 2020-06-15 15:13:00 Deletion errors at UKI-SCOTGRID-GLASGOW

          • Site needs to delete files from the namespace; general strategy:
            • Have the site clean the namespace from any leftovers.
            • Have the site produce storage dumps.
            • Run a consistency check.
            • Declare any missing files as lost.
        • 146771 UKI-SCOTGRID-ECDF less urgent on hold 2020-06-16 15:41:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”

          • Power outage; pushed forward migration of DPM to centos7; will monitor situation
        • 146651 RAL-LCG2 urgent in progress 2020-05-27 10:43:00 singularity and user NS setup at RAL

          • no update
        • 146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-06-06 23:57:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE

          • on hold for ce6; Elena working hard to make progress
        • 145688 UKI-NORTHGRID-MAN-HEP less urgent on hold 2020-04-02 09:20:00 Very old version of squids at UKI-NORTHGRID-MAN-HEP

          • on hold
        • 145510 RAL-LCG2 urgent in progress 2020-06-18 05:50:00 RAL-LCG2: timeouts on stage-in/outs

          • Ticket updated. currently no spike in timeouts; with switch to direct-io for user jobs; should quantify error rate.
          • Set to in progress; and aim to close ticket once direct io studies are done
        • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-06-09 07:59:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1

          • Keep open until move is complete
        • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX

          • on hold
      • CPU 5m
        • RAL

        • Northgrid

        • London

        • SouthGrid

        • Scotgrid

          • ECDF recovered from downtime; still deletion errors
          • Glasgow AC recovered - up to 25k jobs

         

         

      • Other new issues 5m
    • 10:20 10:40
      Ongoing issues 20m
      • CentOS7 DPM Lancs

        • Next week is downtime for cluster , during which our worker nodes will all be reinstalled and a new backed end shared filesystem will be back in place
        • Plan is to set up an ARC CE by lunchtime tomorrow as the old SL6 CREAM CE will likely stop being able to talk to our Cluster after next week’s downtime.
        • Please can atlas UK be ready to stick new endpoints in to AGIS for us.
      • CentOS7 - Sussex

        • Awaiting updating
      • Glasgow Ceph storage

        • RAL xrootd was tried
          • Identified that some performance tuning options, when under high loads caused too many concurrent threads and truncation of the cached files
        • Moved back to on disk cache (on SSDs)
        • Getting data to the jobs looks much better now.
        • Stage-ing back from jobs and redirection is next, to work between all three caches (one currently running)
        • Ceph-tuning for timeouts; stability improving; awaiting updated nautilaus for fixes to some current work-arounds
        • Bandwith looks good and is maxing out the gridFTP box.
        • Different versions of xroot on the various services: gateway 4.12.2, cache 4.11.3; aim to upgrade when possible
      • Grand Unified queues

        • Awaiting SHEF

       

       

    • 10:40 10:50
      News round-table 10m
      • Vip

        • NTR: passed on information that atlas timeline for site decommissioning to storageless is typically 3-6 months
      • Matt

        • Microphone problems; comments above passed by chat
      • Sam

        • NTR
      • Gareth

        • NTR
      • Tim

        • RAL should try to get TPC running, is ATLAS priority.
      • JW

        • will concentrate on direct-io tests to close ral ticket.

       

    • 10:50 11:00
      AOB 10m