System performance modelling WG kick-off meeting (part 2)

Europe/Zurich
31/S-023 (CERN)

31/S-023

CERN

22
Show room on map

Participants

  • Local: Andrea Sciabà, Markus Schulz, Jan Iven
  • Remote: Helge Meinhard, Gareth Roy, Renaud Vernet, Chaterine Biscarat, Davide Costanzo, Andrea Sartirana, Andrew Sansum, Johannes Elmsheuser, David Lange, Yves Kemp, Alessandra Forti, Eric Fede, Concezio Bozzi, Pepe Flix
  • Apologies: Graeme Stewart, Frank Würthwein, Michel Jouvin

Work organisation

Long discussion on how to define and assign work. These suggestions were made:

  • Markus proposed to have small teams of people to brainstorm on specific topics and write some text for the others to comment (similar to the CWP approach)
  • produce a list of things to do (the "known unknowns"), put them in a Google document and ask people to volunteer for them according to their competences (Andrea Sciabà offered to start writing such list)
  • The need for a glossary was reiterated by Jan, who suggested to start from the description of what we need to do, extract the critical concepts and define them

Alessandra pointed out that we will also need some infrastructure where to conduct tests and measurements. Markus mentioned that the UP team at CERN has a performance evaluation testbed that can be used. Alessandra also stressed the need for the tasts to be as specific and concrete as possible.

Markus proposed to put as one of the first tasks to identify the most important workloads (to be done by people from the experiments).

According to Markus, another initial task could be to look at both open and commercial performance analysis tools, which could produce an enormous number of metrics, and see which ones are most relevant for us.

Johannes asked if metrics should be measured only in a controlled environment or also from the production environment. Andrea Sciabà answers that one should do both things and compare the results, as any significant discrepancy should be understood. For example, as Johannes said, data access can have a strong effect it can be very different between a lab setup and the WLCG infrastructure. This can be very tricky to model.

It is agreed that a controlled environment is anyway essential to really understand the application's behaviour.

Discussion on metrics

Markus presents his and Andrea's ideas on performance, efficiency and cost. He points out that in the last few years a lot of progress was done in studying the behaviour of jobs and metrics as a function of time. This should make easier to build models of workloads (and full workflows) taking into account their time structure.

It is very important to be able to find out what are the performance limiting factors (which includes bottlenecks and resource starvation) and to measure how much of the hardware capabilities we are exploiting (e.g. how far we are from the theoretical limit on the number of instructions per core). This is not always trivial: for example a CPU that looks fully loaded could be stalled by memory access.

Jan proposes to work on two types of models: one to describe the hardware in abstract terms and one to relate the workload to the hardware model. These models should start simple, using a bottom-up approach to a simple application. Similarly, workloads should be split into smaller units.

David asks which concrete actions could be defined for the experiments. Markus proposes to identify one particular workflow and study it in a detailed way as the main program for the first year. Andrea reminds that in parallel it would be very important to classify the most important workflows.

AOB

The time and frequency of future meetings is briefly discussed. Andrea proposes to have fortnightly meetings on Wednesdays at 17:00 CERN time; no objection is made but the proposal will be reiterated by email to make sure that people who were absent can express their opinion.

There are minutes attached to this event. Show them.