System performance modelling WG meeting
Participants
- Local: Andrea Sciabà, Markus Schulz, Dirk Duellmann
- Remote: Michele Michelotto, Catherine Biscarat, Gareth Roy, Renaud Vernet, Pepe Flix, David Lange, Johannes Elmsheuser, Graeme Stewart, Concezio Bozzi, Domenico Giordano
Task list status report
Example workflows
ATLAS' jobs described in Johannes' doc have several large input minbias files. They could be put in EOS at CERN for convenience (CVMFS is not considered a good choice due to their size). The jobs are very recent, taken from the ingoing 2017 production. The jobs used by Domenico for benchmarking use an older version of the software and it might be good to update them (though for benchmarking a fixed version is to be used for a longer time). Pepe adds that also the CMS job used for benchmarking is not very representative of a real workload.
Concezio will similarly provide instructions to run LHC example jobs, and the same request will be done to ALICE (Costin).
David showed a doc for running CMS jobs; input files are read via xrootd, not from a local filesystem (by default). Somebody without a CMS certificate will verify that no CMS credentials are needed (very likely).
Coming back to CVMFS to distribute data, David thinks that this would not be a problem for CMS. Domenico argues that having lots of minbias events to choose from in the DIGI job is not really important for benchmarking applications, so CVMFS should be the way to go in this case. For our cost model studies, however, realistic jobs are to be preferred.
David asks what should be the scale of the testing. For the next few months, we will be running on single machines. Later on, we could think of using HammerCloud to study the behaviour on the Grid infrastructure.
Finally, Domenico points out that for benchmarking it's important that access to conditions data is local, to if it's under CVMFS it should have been prestaged to the local cache.
Workload properties
Graeme goes through the reorganised tables (for CPU, memory, disk, IO and application) for the metrics. During the discussion he clarifies that in ATLAS they are moving towards reading files via network rather than prestaging (which still happens in some cases). Concerning the number of threads/processes as a metric, he clarifies that this refers to the main event loop, not to short-lived threads/processes.
Memory bottlenecks are not relevant in HEP software now, but they might well become in the future. MemoryMonitor can be used to study the time structure of memory utilisation. David S. and Servesh M. developed a graphical tool for memory accesses that can be very useful. Peaks in memory utilisation are important to measure because the amount of time spent during those peaks can have a more or less significant effect on performance whenever the execution node is very loaded.
It is stressed that it's important, for each metric, to find out how to measure them and do it with our example jobs. After the meeting we will decide from where to start from.
Cost estimation for site
Catherine and Renaud will provide a document explaining how costs were calculated a couple of years ago for the Lyon Tier-1 and Tier-2 sites. Pepe will do the same for PIC.
WLCG/HSF workshop
Markus mentions that we will have one short slot in the morning, intended for a more general audience, and a longer session in the afternoon where we can have more discussion.
In the next days he will update the draft for the agenda.