ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))
Videoconference Rooms
ATLAS_UK_Cloud_Support_indico_233262
Name
ATLAS_UK_Cloud_Support_indico_233262
Description
Weekly ATLAS UK Cloud Support Meeting
Extension
109233262
Owner
Tim Adye
Auto-join URL
Useful links
Phone numbers

Outstanding tickets

  • 147841 UKI-SCOTGRID-GLASGOW less urgent waiting for reply 2020-07-14 14:27:00 UKI-SCOTGRID-GLASGOW: deletion problems

    • Problem with these deletions resolved. Final set of work to solve the underlying problem with the files in the namespace
  • 147792 UKI-NORTHGRID-MAN-HEP less urgent in progress 2020-07-13 08:41:00 UKI-NORTHGRID-MAN-HEP deletion errors with message: DavPosix::unlink Authentication error

    • Files likely declared lost; To follow up on ticket.
  • 147770 UKI-NORTHGRID-LANCS-HEP very urgent in progress 2020-07-13 09:11:00 UKI-NORTHGRID-LANCS-HEP: stage-out failures

    • still seeing periods of overloaded disk servers; plan to help ease our disk server problems later in the week with zfs tuning
    • O(3PB) new servers with empty space; decides to write to the empty machine, hence overloads it; similar issues seen at GLA and SHEF
    • Discussion on storage solutions and interaction bettween experiments and sites followed; some notes:
      • Infrastructure and maintainance, difficicult to fund this.
      • Experiments change their mind, based on current needs … planning for the wrong eventualities
    • Sam - number of hurdles for new technologies; and assumptions of established code working
      • Experience with setting up a DPM xroot cache woks well
      • Use of cache to ‘change type’ eg. direct-io to locally staged.
      • GLA, proxy cache to talk to DPM, for local users (set up with existing hardware)
  • 146771 UKI-SCOTGRID-ECDF less urgent in progress 2020-07-14 10:00:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”

    • Restarting services in DPM;
    • in progress.
  • 146651 RAL-LCG2 urgent in progress 2020-07-14 09:27:00 singularity and user NS setup at RAL

    • Ticket was updated, largely on hold until other priority upgrades rolled out.
  • 146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-06-24 16:18:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE

    • on hold
  • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-06-09 07:59:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1

    • on hold
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX

    • Plans to get access to site to re-rack the machines are planned; requires the agreement from the University for non emergency access.

CPU

  • RAL

    • Largely recovered from quota change, and some network issues
  • Northgrid

    • Lancs no jobs for last couple of days;
  • London

  • SouthGrid

  • Scotgrid

    • ECDF few jobs
    • GLA down to 50% nominal capactiy:
      • O(600) cores off for problems with AC downtime
      • Ceph queue issues for xrootd transfers:
        • rebuild xrootd plugins using RAL updates
        • Sam to update the Jira
        • If not resolved by Tuesday, fallback to gridFTP

Other new issues

  • New Site monitoring MONIT Page:
  • Lancaster
    • Jobs failing; appears to hit Mem limits; especially (and unusually?) with voms-proxy-init
    • Changed the configuration to allow for softer memory limits; needs to be picked up by Harvester
    • Noted that Arc runtime environment scripts for modifying environment are useful

Ongoing issues

  • CentOS7 - Sussex

    • Awaiting access
  • Grand Unified queues

    • Awaiting Sheff.

News round-table

  • Vip

    • OX working ok; problems with Conder Upgrade to 8.8.X, not working for ATLAS; went back to 8.6
    • Following up on question of Shefield
  • Matt

    • NTR
  • Peter

    • Bad disk server; how to identify the device name;
      • Matt, Sam to dig out the recipies (sent via mailing list)
  • Sam

    • NTR
  • Gareth

    • Cores offline from AC issues; access restrictions make things challenging
  • JW

    • Work on TPC ongoing
  • Patrick:

    • Access to DC still needed

 

 

There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m
        • 147841 UKI-SCOTGRID-GLASGOW less urgent waiting for reply 2020-07-14 14:27:00 UKI-SCOTGRID-GLASGOW: deletion problems

          • Problem with these deletions resolved. Final set of work to solve the underlying problem with the files in the namespace
        • 147792 UKI-NORTHGRID-MAN-HEP less urgent in progress 2020-07-13 08:41:00 UKI-NORTHGRID-MAN-HEP deletion errors with message: DavPosix::unlink Authentication error

          • Files likely declared lost; To follow up on ticket.
        • 147770 UKI-NORTHGRID-LANCS-HEP very urgent in progress 2020-07-13 09:11:00 UKI-NORTHGRID-LANCS-HEP: stage-out failures

          • still seeing periods of overloaded disk servers; plan to help ease our disk server problems later in the week with zfs tuning
          • O(3PB) new servers with empty space; decides to write to the empty machine, hence overloads it; similar issues seen at GLA and SHEF
          • Discussion on storage solutions and interaction bettween experiments and sites followed; some notes:
            • Infrastructure and maintainance, difficicult to fund this.
            • Experiments change their mind, based on current needs … planning for the wrong eventualities
          • Sam - number of hurdles for new technologies; and assumptions of established code working
            • Experience with setting up a DPM xroot cache woks well
            • Use of cache to ‘change type’ eg. direct-io to locally staged.
            • GLA, proxy cache to talk to DPM, for local users (set up with existing hardware)
        • 146771 UKI-SCOTGRID-ECDF less urgent in progress 2020-07-14 10:00:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”

          • Restarting services in DPM;
          • in progress.
        • 146651 RAL-LCG2 urgent in progress 2020-07-14 09:27:00 singularity and user NS setup at RAL

          • Ticket was updated, largely on hold until other priority upgrades rolled out.
        • 146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-06-24 16:18:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE

          • on hold
        • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-06-09 07:59:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1

          • on hold
        • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX

          • Plans to get access to site to re-rack the machines are planned; requires the agreement from the University for non emergency access.

         

         

      • CPU 5m
        • RAL

          • Largely recovered
        • Northgrid

          • Lancs no jobs for last couple of days
        • London

        • SouthGrid

        • Scotgrid

          • ECDF few jobs
          • GLA down to 50% nominal capactiy ?
            • O(600) cores off for problems with AC downtime
            • Ceph queue issues for xrootd transfers:
              • rebuild xrootd plugins using RAL updates
              • Sam to update the Jira
              • If not resolved by Tuesday, fallback to gridFTP

         

         

      • Other new issues 5m

        LANCS:
        A few data points for discussion later, this job finished successfully:
        https://bigpanda.cern.ch/job?pandaid=4788609755

        It was a multicore job. A clue (as mcore jobs have an effective 8* multiplier on their mem requests)? James' theory about jobs not being able to summon enough memory is looking very likely ro my eyes.

        I can't see any smoking guns on our queue AGIS:
        http://atlas-agis.cern.ch/agis/pandaqueue/detail/UKI-NORTHGRID-LANCS-HEP/full/

        But the request is as plain as day in our pilot jdls:
        https://aipanda157.cern.ch/condor_logs_2/20-07-14_11/grid.1305756.1.jdl
        (memory=1500)

        Taking a cheeky peek at a GLASGOW jdl I see that their memory request is 2000:
        https://aipanda157.cern.ch/condor_logs_2/20-07-16_07/grid.1325252.2.jdl

        I'm not sure where this is set, but for Lancs I'd like to see this set to at least 3000 memory units.

        • New Site monitoring MONIT Page:
        • Lancaster
          • Jobs failing; appears to hit Mem limits; especially (and unusually?) with voms-proxy-init
          • Changed the configuration to allow for softer memory limits; needs to be picked up by Harvester
          • Noted that Arc runtime environment scripts for modifying environment are useful

         

         

    • 10:20 10:40
      Ongoing issues 20m
      • CentOS7 - Sussex

        • Awaiting access
      • Grand Unified queues

        • Awaiting Sheff.
    • 10:40 10:50
      News round-table 10m
      • Vip

        • OX working ok; problems with Conder Upgrade to 8.8.X, not working for ATLAS; went back to 8.6
        • Following up on question of Shefield
      • Matt

        • NTR
      • Peter

        • Bad disk server; how to identify the device name;
          • Matt, Sam to dig out the recipies (sent via mailing list)
      • Sam

        • NTR
      • Gareth

        • Cores offline from AC issues; access restrictions make things challenging
      • JW

        • Work on TPC ongoing
      • Patrick:

        • Access to DC still needed
    • 10:50 11:00
      AOB 10m