ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Outstanding tickets

  • 147841 UKI-SCOTGRID-GLASGOW less urgent waiting for reply 2020-07-14 14:27:00 UKI-SCOTGRID-GLASGOW: deletion problems

    • Problem with these deletions resolved. Final set of work to solve the underlying problem with the files in the namespace
  • 147792 UKI-NORTHGRID-MAN-HEP less urgent in progress 2020-07-13 08:41:00 UKI-NORTHGRID-MAN-HEP deletion errors with message: DavPosix::unlink Authentication error

    • Files likely declared lost; To follow up on ticket.
  • 147770 UKI-NORTHGRID-LANCS-HEP very urgent in progress 2020-07-13 09:11:00 UKI-NORTHGRID-LANCS-HEP: stage-out failures

    • still seeing periods of overloaded disk servers; plan to help ease our disk server problems later in the week with zfs tuning
    • O(3PB) new servers with empty space; decides to write to the empty machine, hence overloads it; similar issues seen at GLA and SHEF
    • Discussion on storage solutions and interaction bettween experiments and sites followed; some notes:
      • Infrastructure and maintainance, difficicult to fund this.
      • Experiments change their mind, based on current needs … planning for the wrong eventualities
    • Sam - number of hurdles for new technologies; and assumptions of established code working
      • Experience with setting up a DPM xroot cache woks well
      • Use of cache to ‘change type’ eg. direct-io to locally staged.
      • GLA, proxy cache to talk to DPM, for local users (set up with existing hardware)
  • 146771 UKI-SCOTGRID-ECDF less urgent in progress 2020-07-14 10:00:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”

    • Restarting services in DPM;
    • in progress.
  • 146651 RAL-LCG2 urgent in progress 2020-07-14 09:27:00 singularity and user NS setup at RAL

    • Ticket was updated, largely on hold until other priority upgrades rolled out.
  • 146374 UKI-NORTHGRID-SHEF-HEP urgent on hold 2020-06-24 16:18:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE

    • on hold
  • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-06-09 07:59:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1

    • on hold
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX

    • Plans to get access to site to re-rack the machines are planned; requires the agreement from the University for non emergency access.

CPU

  • RAL

    • Largely recovered from quota change, and some network issues
  • Northgrid

    • Lancs no jobs for last couple of days;
  • London

  • SouthGrid

  • Scotgrid

    • ECDF few jobs
    • GLA down to 50% nominal capactiy:
      • O(600) cores off for problems with AC downtime
      • Ceph queue issues for xrootd transfers:
        • rebuild xrootd plugins using RAL updates
        • Sam to update the Jira
        • If not resolved by Tuesday, fallback to gridFTP

Other new issues

  • New Site monitoring MONIT Page:
  • Lancaster
    • Jobs failing; appears to hit Mem limits; especially (and unusually?) with voms-proxy-init
    • Changed the configuration to allow for softer memory limits; needs to be picked up by Harvester
    • Noted that Arc runtime environment scripts for modifying environment are useful

Ongoing issues

  • CentOS7 - Sussex

    • Awaiting access
  • Grand Unified queues

    • Awaiting Sheff.

News round-table

  • Vip

    • OX working ok; problems with Conder Upgrade to 8.8.X, not working for ATLAS; went back to 8.6
    • Following up on question of Shefield
  • Matt

    • NTR
  • Peter

    • Bad disk server; how to identify the device name;
      • Matt, Sam to dig out the recipies (sent via mailing list)
  • Sam

    • NTR
  • Gareth

    • Cores offline from AC issues; access restrictions make things challenging
  • JW

    • Work on TPC ongoing
  • Patrick:

    • Access to DC still needed

 

 

There are minutes attached to this event. Show them.