ATLAS HPC Sites: Toward Common Solutions

Name: ATLAS HPC Sites: Toward Common Solutions
Start: 2017-09-19T14:00:00+02:00
End: 2017-09-19T16:00:00+02:00
Location: CERN

Tuesday 19 Sept 2017, 14:00 → 16:00 Europe/Amsterdam

13/3-005 (CERN)

13/3-005

CERN

Show room on map

Taylor Childers (Argonne National Laboratory (US))

Description

This meeting will focus on working out the details of how to continue pushing toward common solutions at large HPC facilities. This includes discussing Harvester(+Panda), Event Service, and Containers for Software Distribution.

Hide

Harvester related:

What features need to be added to or tested in Harvester before it should be passed on to OLCF, NERSC, and BNL?
- Pooling Globus Online Stage-in/out
- Andy: SLAC runs xrootd server at NERSC
- Doug: do we still have a Memory usage problem? Taylor: We were running profiling when you saw the 16GB, I don't think this is an issue.
- Problems with files not being in RSE when all Harvester+minipilot think everything is OK. Doug, will send around some PandaIDs for us to look. Tadashi is going to check panda monitoring to see if there is a missmatch in the number of job harvester reports as successful and which panda list of successful.
- Doug: what do we do about AGIS? If we update AGIS like we did with the DDM setup, how do we propagate that through Harvester and the jobs running without problems like we had?
- upload mini-pilot to git with some documentation, clean-up, vet custom things like the Geant4 exception removal.
How do we handle planned outages without moving jobs around? We should support short outages (<= 1 day) where jobs should be diverted and long outages (> 1 day) where jobs can remain in place.
- Andrej suggests perhaps we need more entries in AGIS for keeping track of this sort of thing.
- Doug: need DDM also to see these things
Are there pieces in Panda still missing that will help support HPC type queues?
Danila: AGIS has a label for HPCs "pandaResourceType"
Doug: At what point do we no longer need dedicated Tasks? Is it when we move to Event Service? Tadashi: Harvester will remove the case where activated jobs are re-directed.

Event Service:

How to arrange merge job on site?
- Tadashi: There is a catchall flag for setting this in panda.
Vakho, for HPC we need to make sure to be using Jumbo Jobs.

Software Distribution related:

How will the production of containers work?
- See Wei slides.
What releases will be included?
- Wei: all releases as it seems the size of the containers does not matter for our HPC sites.
How do we distribute them?
- Using GO or other form of copy, they can be pushed by the system described in Wei's slides.
Containers not on worker nodes of Titan or Theta, but both operations groups have test installs.
Is there a reason to move to X.Y.Z release per container instead of fat ones?
- Rob: why not? They're smaller in size. You would lower the threshold to become an opportunistic site.
- Rob: This is related to DDC, yes?

There are minutes attached to this event. Show them.

- Harvester Discussion
  
  During this discussion we should talk about
  - What features need to be added to or tested in Harvester before it should be passed on to OLCF, NERSC, and BNL?
  - Are there pieces in Panda still missing that will help support HPC type queues?
  - How do we handle planned outages without moving jobs around? We should support short outages (<= 1 day) where jobs should be diverted and long outages (> 1 day) where jobs can remain in place.
  
  Conveners: Doug Benjamin (Duke University (US)), Tadashi Maeno (Brookhaven National Laboratory (US)), Taylor Childers (Argonne National Laboratory (US))
- Data Motion
  
  Discuss:
  - Globus Online tools for Harvester and Rucio
  - Dual End-points
  
  Convener: Doug Benjamin (Duke University (US))
- Software Distribution
  
  In this section we should discuss:
  - How will the production of containers work?
  - What releases will be included?
  - How do we distribute them?
  - How do we deal with the time lag between new releases appearing on CVMFS and having them on the site via container?
  
  Convener: Wei Yang (SLAC National Accelerator Laboratory (US))
  
  containers

Choose timezone

ATLAS HPC Sites: Toward Common Solutions

13/3-005

CERN