ATLAS UK Cloud Support

Europe/London
Vidyo

Vidyo

Tim Adye (Science and Technology Facilities Council STFC (GB))
ATLAS UK Cloud Support meeting minutes, 21 February 2019

Present: Alessandra Forti, Brian Davies, Dan Traynor, Elena Korolkova, Gareth Roy, Matt Doidge, Peter Love, Rob Currie, Sam Skipsey, Tim Adye

Diskless Sites:

We had a discussion about what we (ATLAS UK) want to do about diskless sites, starting from Stephane Jézequel's ADC Weekly slides (also attached to this Indico page). Here are some of the main points from a lively and lengthy discussion.

* "Lightweight" sites with small disk can still provide significant CPU. Eg. Durham is listed as having 192 TB (currently 302 TB), but provides 2.7% of ATLAS UK CPU. See chart.
* We probably have three classes of Tier-2s: large (>520TB), medium (<520 TB, but significant CPU), and small. The small sites can probably directly access data at another site (but check network connectivity). The medium sites will probably need a disk buffer.
* Gareth suggested to refer to a "buffer" rather than a "cache". The purpose of the buffer is to prestage files for directio access. It's most efficient if this can be coordinated with job starts, as ARC-CE does (eg. currently at Durham). XCache may still help, but not coordinated with job starts.
* For small sites with directio over the WAN, need to find another site close by that can handle the load on the storage. This could be a particular problem for Southgrid, which has several small sites.
* Sam is developing a network map of who is close to whom, though network topologies can change with time.
* We need better monitoring. Maybe we can get something out of PanDA or Monit, but needs someone to do the research (not a trivial job).
* Any disk freed up by these moves can be allocated as LOCALGROUPDISK.
* Sheffield will have to reduce to 150TB, but has 800 job slots. This is a good candidate for going diskless (no buffer to start with), accessing data from Lancaster and/or Manchester.
* Birmingham should have 600 TB (700-800TB next year). We can use this! It is large enough for a DATADISK.
* Agreed to discuss with ATLAS at the Sites Jamboree (5-8 March). We can then bring it back to GridPP.

Other urgent matters:

* Gareth asked about setting up a new space token for better LOCALGROUPDISK access. Alessandra said this was tricky.
* Dan sent round a list of new queues for CentOS7 at QMUL. Elena will add them to PanDA. Alessandra asked to keep the GPU test queue.
* Matt: Just a note that Lancaster will have a downtime in the next week or so for electrical work - no more melting Commando sockets. Just finalising the exact dates for the work.
There are minutes attached to this event. Show them.
    • 10:00 10:20
      Diskless sites discussion 20m
    • 10:20 10:30
      Outstanding tickets 10m
    • 10:30 10:40
      Ongoing issues 10m
    • 10:40 10:50
      News round-table 10m
    • 10:50 11:00
      AOB 10m