ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

Outstanding tickets

  • 148234 RAL-LCG2 less urgent in progress 2020-08-12 10:38:00 RAL-LCG2 deletion errors
    • Deletion into echo failure rate 10%, just a load issue? Failed deletions do complete
  • 148228 UKI-SOUTHGRID-OX-HEP less urgent waiting for reply 2020-08-12 10:17:00 UKI-SOUTHGRID-OX-HEP transfer failures as destination
    • To Close
  • 148169 UKI-SCOTGRID-ECDF less urgent in progress 2020-08-05 10:25:00 Failovers from jobs running at UKI-SCOTGRID-ECDF_CLOUD to CERN backup proxy
    • Follow-up
  • 147979 UKI-NORTHGRID-MAN-HEP less urgent in progress 2020-08-04 09:28:00 UKI-NORTHGRID-MAN-HEP timeout transfer errros and also deletion errors
    • Follow-up
  • 146771 UKI-SCOTGRID-ECDF less urgent in progress 2020-08-10 10:23:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”
    • Mitigation still working; still exploring the main solution
  • 146651 RAL-LCG2 urgent on hold 2020-08-10 10:59:00 singularity and user NS setup at RAL
    • RAL last big site to provide this; impacting on containerised workflow jobs
  • 146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-07-22 14:53:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
    • Some test jobs through, but still issues
  • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
    • Update to ticket; Restrictions on access; dealing with admin to get relevant systems into place
  • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
    • Access to to data centre now feasible. Need to consolidate pieces of kit deliveried to various places and start preparing new node.

CPU

  • Bad version rucio 1.23.0 rolled out Tuesday

    • Affected Tokyo and UK, likely due to mix protocol (read/write) and write/stat race conditions?
    • Wed’s pm; Misconfiguation in agis for site RSE caused mass HC blacklisting; quickly resolved.
  • RAL

    • Small dip due to rucio, but slow recovery from atlas issues
  • Northgrid

    • Man. recovering slowly?
  • London

    • QMUL low jobs; AC issues
  • SouthGrid

  • Scotgrid

    • Durham low jobs
    • ECDF - (CLOUD in test)
      • Believed CLOUD scheduler and openstack interference
    • DPM; Panda stopped sending jobs to Kelvin for short time; infrequent but previously seen issue

Other new issues

Ongoing issues

  • CentOS7 - Sussex
    • as discussed above
  • Grand Unified queues
    • Awaiting Shefield

News round-table

  • Vip

    • 896 threads added to the pool
    • Noted lower efficiency; GR pointed out may just be from increase of reco jobs
  • Dan

    • AC issues, but more nodes should now be available
  • Peter

    • NTR
  • Sam

    • Xrootd; is ATLAS seeing similar issues as LHCb with streaming
      • JW do see some error rate in user jobs (using direct-IO)
      • recent case of production job now running in direct-IO; with similar issue
  • Gareth

    • Noted wrt to job efficiency:
      • special evgen ? some jobs may try to take two threads;
      • Reco jobs can hit efficiency (JW: increased running due to reprocessing camapaigns)
    • Performance improvements planed for CEPH / infrastructure / bonding networking; ‘timescale’
    • 1400 cores; starting to hit the gridFTP limits;
  • JW

    • NTR
  • Patrick

    • NTR

 

There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m
        • 148234 RAL-LCG2 less urgent in progress 2020-08-12 10:38:00 RAL-LCG2 deletion errors
          • Deletion into echo failure rate 10%, just a load issue? Failed deletions do complete
        • 148228 UKI-SOUTHGRID-OX-HEP less urgent waiting for reply 2020-08-12 10:17:00 UKI-SOUTHGRID-OX-HEP transfer failures as destination
          • To Close
        • 148169 UKI-SCOTGRID-ECDF less urgent in progress 2020-08-05 10:25:00 Failovers from jobs running at UKI-SCOTGRID-ECDF_CLOUD to CERN backup proxy
          • Follow-up
        • 147979 UKI-NORTHGRID-MAN-HEP less urgent in progress 2020-08-04 09:28:00 UKI-NORTHGRID-MAN-HEP timeout transfer errros and also deletion errors
          • Follow-up
        • 146771 UKI-SCOTGRID-ECDF less urgent in progress 2020-08-10 10:23:00 UKI-SCOTGRID-ECDF deletion failures with “The requested service is not available at the moment.”
          • Mitigation still working; still exploring the main solution
        • 146651 RAL-LCG2 urgent on hold 2020-08-10 10:59:00 singularity and user NS setup at RAL
          • RAL last big site to provide this; impacting on containerised workflow jobs
        • 146374 UKI-NORTHGRID-SHEF-HEP urgent in progress 2020-07-22 14:53:00 ATLAS pilot jobs idle on UKI-NORTHGRID-SHEF-HEP CE
          • Some test jobs through, but still issues
        • 144759 UKI-SCOTGRID-GLASGOW less urgent on hold 2020-08-10 09:54:00 High traffic from UKI-SCOTGRID-GLASGOW on RAL CVMFS Stratum1
          • Update to ticket; Restrictions on access; dealing with admin to get relevant systems into place
        • 142329 UKI-SOUTHGRID-SUSX top priority on hold 2020-06-04 14:05:00 CentOS7 migration UKI-SOUTHGRID-SUSX
          • Access to to data centre now feasible. Need to consolidate pieces of kit deliveried to various places and start preparing new node.

         

      • CPU 5m
        • Bad version rucio 1.23.0 rolled out Tuesday

          • Affected Tokyo and UK, likely due to mixed protocol (read/write) and write/stat race conditions?
          • Wed’s pm; Misconfiguation in agis for site RSE caused mass HC blacklisting; quickly resolved.
        • RAL

          • Small dip due to rucio, but slow recovery from atlas issues.
        • Northgrid

          • Man. recovering slowly?
        • London

          • QMUL low jobs still, Largely AC related
        • SouthGrid

        • Scotgrid

          • Durham low jobs
          • ECDF - (CLOUD in test)
            • Believed CLOUD scheduler and openstack interference
          • DPM; Panda stopped sending jobs to Kelvin for short time; infrequent but previously seen issue

         

      • Other new issues 5m
    • 10:20 10:40
      Ongoing issues 20m
      • CentOS7 - Sussex
        • as discussed above
      • Grand Unified queues
        • Awaiting Shefield

       

    • 10:40 10:50
      News round-table 10m
      • Vip

        • 896 threads added to the pool
        • Noted lower efficiency; GR pointed out may just be from increase of reco jobs
      • Dan

        • AC issues, but more nodes should now be available
      • Peter

        • NTR
      • Sam

        • Xrootd; is ATLAS seeing similar issues as LHCb with streaming
          • JW do see some error rate in user jobs (using direct-IO)
          • recent case of production job now running in direct-IO; with similar issue
      • Gareth

        • Noted wrt to job efficiency:
          • special evgen ? some jobs may try to take two threads;
          • Reco jobs can hit efficiency (JW: increased running due to reprocessing camapaigns)
        • Performance improvements planed for CEPH / infrastructure / bonding networking; ‘timescale’
        • 1400 cores; starting to hit the gridFTP limits;
      • JW

        • NTR
      • Patrick

        • NTR

       

    • 10:50 11:00
      AOB 10m