Speaker
Description
The DUNE experiment is a large international particle physics
project which is currently under construction at Fermilab in
Illinois and SURF in South Dakota, with prototypes at CERN. The
experiment relies on Fermilab’s investment in HTCondor and
GlideInWMS, and on the LArSoft ecosystem of applications
software. Initially data management was done with Fermilab’s SAM
system but this is gradually being replaced by other components.
MetaCat and Rucio are now in use as DUNE’s file metadata and
replica catalogues, and DUNE has developed a just-in-time
workflow management system, justIN, to replace the SAM workflow
functionality and provide higher level management of processing
requests which are carried out in GlideInWMS/HTCondor jobs. The
new system’s philosophy of matching tasks to resources as they
become available will be described. justIN provides a workflow
submission interface and then submits suitable jobs to the DUNE
HTCondor pool. Jobs call back to justIN when they eventually
start at sites, and a decision is made at that point about what
workflows to carry out on that machine and which files to
process. These decisions are based on the available memory,
processors, maximum local job duration, and the availability of
nearby files which are still to be processed as part of the
current workflows. This just-in-time approach is able to take
unplanned downtimes at sites and storages into account
immediately, as well as higher level changes such as
fluctuations in the demand from other user communities. This
system was validated during the DUNE Data Challenge 4 in late
2022 and has been used in the simulation campaigns of 2023.
justIN uses token information obtained from CILogon with users
authenticating with the Fermilab Identity Provider service. This
in turn allows users to authenticate to the justIN web dashboard
or to use the justIN command line tool to launch and manage
workflows. To enforce DUNE policies on the use of Rucio-managed
storage, justIN jobs carry out data write operations on behalf
of user supplied scripts and code, which are isolated from
higher level credentials by justIN’s use of
Singularity/Apptainer containers. Further work to increase the
integration of justiN and the new dedicated DUNE HTCondor pool
will be described.
Desired slot length | 20 |
---|---|
Speaker release | Yes |