ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB)), Stewart Martin-Haugh (Science and Technology Facilities Council STFC (GB)), James William Walder (Science and Technology Facilities Council STFC (GB))

● Outstanding tickets

  • GGUS# 146110 (RALPP) FTS transfer failure; dCache processes restarted; things seem to be ok now.
  • GGUS# 146057 (ECDF) Currently in Downtime, will be discussed in next OPs meeting.
  • GGUS# 145614 (Manchester): -> to follow up with Alessandra if can be closed.
  • GGUS #145688 (Manchester): Squids; last input 17th; needs a reply
  • GGUS #145971 (Manchester) seems to be an SKA ticket, maybe added "Concerned VO: atlas" by accident
  • GGUS #145510 (RAL): No progress; stage-out issues similar to other sites. stage-in: AODmerge jobs suffering from this effect.
  • GGUS# 146029 (Glasgow): Deletion errors. High deletion rate appears to eject disks. Set offline for deletions.
    • -> JIRA to ATLAS side
  • GGUS #144759 (Glasgow): on-hold; higher priorities

● CPU

  • RAL consistently meet its CPU pledge for the week.
  • NorthGrid:
    • Sheffield problems (not ATLAS specific). -> to email TB-support
  • ScotGrid:
    • Glasgow: additional 2k cores will be available once Ceph in production

● Other Issues

  • RHUL: DPM issues; progress made from the dpm-users forum
  • IC: Replacing CREAMCEs with HTCondorCEs. New CE's appear in GOCDB, but AGIS not able to be updated. -> contact ATLAS experts

● CentOS7 - Sussex

Hardware upgrades completed; debugging configuration problems (e.g. mem limit at 2GB)
As storageless will use QMUL data endpoints


● Glasgow Ceph storage

  •  
  •  New resilient pool created; hope that transfer of data across to new pool can be transparent to ATLAS, and new pool will be renamed to the original; O(60TB) Atlas data stored.
    • If too complex, will ask ATLAS to move the data
  • gridFTP and Xroot services to be setup on machines separate to Ceph Op. machines.
    • Using same machines caused problems for Ceph.
  • xrootd to be used internally for stage-in; gridFTP for stage-out and external transfers

● Grand Unified queues

  • UK GU is beginning (QMUL already migrated)
  • GU spreadsheet for progress
    • Once operational should still be able to distinguish production and pilot jobs

● News round-table

  • Dan: things looking stable. Planning to move to new Lustre system. Working on local monitoring.
  • Elena: NTR
  • Gareth: NTR
  • James: NTR
  • Matt: NTR
  • Peter: NTR
  • Stewart: NTR
  • Tim: NTR
  • Vip: Oxford been asked to work from Home, as are others.
There are minutes attached to this event. Show them.
    • 10:00 10:20
      Status 20m
      • Outstanding tickets 10m
        • GGUS# 146110 (RALPP) FTS transfer failure; dCache processes restarted; things seem to be ok now.
        • GGUS# 146057 (ECDF) Currently in Downtime, will be discussed in next OPs meeting.
        • GGUS# 145614 (Manchester): -> to follow up with Alessandra if can be closed.
        • GGUS #145688 (Manchester): Squids; last input 17th; needs a reply
        • GGUS #145971 (Manchester) seems to be an SKA ticket, maybe added "Concerned VO: atlas" by accident
        • GGUS #145510 (RAL): No progress; stage-out issues similar to other sites. stage-in: AODmerge jobs suffering from this effect.
        • GGUS# 146029 (Glasgow): Deletion errors. High deletion rate appears to eject disks. Set offline for deletions.
        • GGUS #144759 (Glasgow): on-hold; higher priorities
      • CPU 5m

         

        • RAL consistently meet its CPU pledge for the week.
        • NorthGrid:
          • Sheffield problems (not ATLAS specific). -> to email TB-support
        • ScotGrid:
          • Glasgow: additional 2k cores will be available once Ceph in production
      • Other new issues 5m
        • RHUL: DPM issues; progress made from the dpm-users forum
        • IC: Replacing CREAMCEs with HTCondorCEs. New CE's appear in GOCDB, but AGIS not able to be updated. -> contact ATLAS experts
    • 10:20 10:40
      Ongoing issues 20m
      • CentOS7 - Sussex 5m

        Hardware upgrades completed; debugging configuration problems (e.g. mem limit at 2GB)
        As storageless will use QMUL data endpoints

      • Glasgow Ceph storage 5m
        •  New resilient pool created; hope that transfer of data across to new pool can be transparent to ATLAS, and new pool will be renamed to the original; O(60TB) Atlas data stored.
          • If too complex, will ask ATLAS to move the data
        • gridFTP and Xroot services to be setup on machines separate to Ceph Op. machines.
          • Using same machines caused problems for Ceph.
        • xrootd to be used internally for stage-in; gridFTP for stage-out and external transfers
      • Grand Unified queues 5m
        • UK GU is beginning (QMUL already migrated)
        • GU spreadsheet for progress
          • Once operational should still be able to distinguish production and pilot jobs
    • 10:40 10:50
      News round-table 10m
      • Dan: things looking stable. Planning to move to new Lustre system. Working on local monitoring.
      • Elena: NTR
      • Gareth: NTR
      • James: NTR
      • Matt: NTR
      • Peter: NTR
      • Stewart: NTR
      • Tim: NTR
      • Vip: Oxford been asked to work from Home, as are others.
    • 10:50 11:00
      AOB 10m