ATLAS HPC Sites: Toward Common Solutions
Description
This meeting will focus on working out the details of how to continue pushing toward common solutions at large HPC facilities. This includes discussing Harvester(+Panda), Event Service, and Containers for Software Distribution.
Harvester related:
- What features need to be added to or tested in Harvester before it should be passed on to OLCF, NERSC, and BNL?
- Pooling Globus Online Stage-in/out
- Andy: SLAC runs xrootd server at NERSC
- Doug: do we still have a Memory usage problem? Taylor: We were running profiling when you saw the 16GB, I don't think this is an issue.
- Problems with files not being in RSE when all Harvester+minipilot think everything is OK. Doug, will send around some PandaIDs for us to look. Tadashi is going to check panda monitoring to see if there is a missmatch in the number of job harvester reports as successful and which panda list of successful.
- Doug: what do we do about AGIS? If we update AGIS like we did with the DDM setup, how do we propagate that through Harvester and the jobs running without problems like we had?
- upload mini-pilot to git with some documentation, clean-up, vet custom things like the Geant4 exception removal.
- How do we handle planned outages without moving jobs around? We should support short outages (<= 1 day) where jobs should be diverted and long outages (> 1 day) where jobs can remain in place.
- Andrej suggests perhaps we need more entries in AGIS for keeping track of this sort of thing.
- Doug: need DDM also to see these things
- Are there pieces in Panda still missing that will help support HPC type queues?
- Danila: AGIS has a label for HPCs "pandaResourceType"
- Doug: At what point do we no longer need dedicated Tasks? Is it when we move to Event Service? Tadashi: Harvester will remove the case where activated jobs are re-directed.
Event Service:
- How to arrange merge job on site?
- Tadashi: There is a catchall flag for setting this in panda.
- Vakho, for HPC we need to make sure to be using Jumbo Jobs.
Software Distribution related:
- How will the production of containers work?
- See Wei slides.
- What releases will be included?
- Wei: all releases as it seems the size of the containers does not matter for our HPC sites.
- How do we distribute them?
- Using GO or other form of copy, they can be pushed by the system described in Wei's slides.
- Containers not on worker nodes of Titan or Theta, but both operations groups have test installs.
- Is there a reason to move to X.Y.Z release per container instead of fat ones?
- Rob: why not? They're smaller in size. You would lower the threshold to become an opportunistic site.
- Rob: This is related to DDC, yes?
There are minutes attached to this event.
Show them.