LCG Web>WebPreferences>WLCGSystemsPerformanceModeling (2019-04-16, AndreaSciaba)

Systems Performance and Cost Modeling Working Group

Link to Google document for comments and shared editing:

https://docs.google.com/document/d/1qRAKRdjCi3m4W4c9VorPrjn0V8-OR3p_fHr1SbeYttY/edit?usp=sharing

Mandate and goals

Motivation (reviewed by MB Nov 14 2017)

Our community needs metrics that allow us to characterise the resource usages of HEP workloads in sufficient detail so that the impact of changes in the infrastructure or the workload implementations can be quantified with a precision high enough to guide design decisions towards improved efficiencies. This model has to express the resource utilisation of the workloads in terms of fundamental capabilities that computing systems provide, such as storage, memory, network, computational operations, latency, bandwidths etc. To allow sites and user communities to use t this model to improve also their cost efficiency an approach to map these capabilities to local costs is highly desirable. This can’t be achieved at a global level, since the conditions at different sites are too different, but the model should be constructed in such a way that this mapping on a local level can be done fairly easily, following given examples. Decisions on the evolution of workloads, workflows and infrastructures impact the quantity and quality of human resources required to build and operate the system. It is important that a cost and performance model at the system level takes these adequately into account to allow to optimize the global infrastructure cost into a constrained budget.

WLCG, HSF, HEPIX and the Systems Modeling Working Group (reviewed by MB Nov 14 2017)

This working group depends on active participation from workload, workflow and framework developers, people who plan, engineer and operate IT systems and people who federate all this into a global infrastructure. To ensure that this activity is reflecting the understanding of these vital groups of experts it is helpful that beyond the informal cross links created by members of the working group, reporting links with all three entities will be established. With WLCG's history of tracking a large number of working groups, the focus on computing for (HL-)LHC and an established community driven governance, this working group is best established as a "WLCG Working Group". The scope is not limited to WLCG and is of potential interest for other experiments relying on a largely distributed computing infrastructure, like (but not limited to) Belle II or SKA.

Mandate (reviewed by MB Nov 14 2017)

Bring together workload and infrastructure experts from sites and experiments to agree on common suitable metrics to describe the interrelationship between workload resource needs and infrastructure characteristics.
Identify and agree on a set of reference workloads and meter them in different environments as input data for a model.
Build models and verify them by predicting resource usage of the reference workloads with respect to changes in the execution environment.
Develop a strategy/methodology for mapping the model predictions to local costs and verify this approach by applying it to a small number of different sites.

These steps are not necessarily sequential and an iterative approach is needed to end up with usable metrics and models that can over time follow the changes in our environment.

Goals (reviewed by MB Nov 14 2017)

Short

Set up a structure for the working group and define subgroups. At the time of the joint HSF/WLCG workshop in spring 2018 a first, simplified, metrics and model should be developed, covering all the aspects of the mandate for one reference workload. It is clear that this model will not be usable in practice, but building it will help to understand better what white areas exist on the map and how they can be filled.

Medium

End of 2018 get a first set of metrics that are useful to some extents and is easily accessible. (On these some collaboration with the HEPiX Benchmarking Working Group would be helpful). The derived model should cover for two experiments the workload consuming the largest amount of resources today. The model should also come with a first process for mapping to local costs. Testbeds can be used to verify the quality of the model to predict the performance of different fabrics. The process to measure the resource utilisation metrics and the model should be well documented.

Long

Starting in 2020, at the time of the HL-LHC Computing TDRS, ,the model should be refined and should cover all major workloads from most experiments. Mapping to local cost should have been carried out by several sites. Use of the model should be simplified by providing tools.

By predicting planned changes and comparing predictions with observed impact frequently, the model will be continuously refined.

Members

Catherine Biscarat, Tommaso Boccali, Daniele Bonacorsi, Concezio Bozzi, Raul Cardoso Lopes, Davide Costanzo, Alessandro Di Girolamo, Johannes Elmsheuser, Eric Fede, Pepe Flix, Alessandra Forti, Martin Gasthuber, Domenico Giordano, Chris Hollowell, Jan Iven, Michel Jouvin, Yves Kemp, Andrey Kiryanov, David Lange, Helge Meinhard, Michele Michelotto, Gareth Roy, Markus Schulz, Andrew Sansum, Andrea Sartirana, Andrea Sciabà, Oxana Smirnova, Graeme Stewart, Renaud Vernet, Mattias Wadenstein, Torre Wenaus, Frank Wuerthwein

Subgroups (Updated DRAFT)

We discussed at the kickoff meeting forming subgroups. It became clear that in the initial phase splitting Metrics, Workload characterisation and Model Building is not without risk because of the huge interdependence of these areas. We will start without this separation and revisit this approach later. For the moment, it is better to talk about subtasks.

Metrics

Select and define the most appropriate metrics to characterize resource usage of workloads and their efficiency. This task should start immediately and aim at find metrics that:

provide a reasonably complete characterisation of the workload (see below)
can be measured relatively easily
are meaningful both for software experts and infrastructure experts, thus providing a common language to discuss requirements and performance

Workload characterisation

This task should also start very soon and consists in definiting a schematic view of a workload, using a simple (but not too simple) model with parameters related to the metrics defined in the previous task. It should describe the time structure of the application (e.g. initialisation, event processing, finalisation phases), how different applications combine to form a full workflow, etc.

This model should be applied to representative "templates" for workloads to adequately capture those computing activities consuming the majority of WLCG resources.

It can be used for example to estimate the performance impact due to changes in the experiment software and changes in the infrastructure.

Resource estimation

Define common procedures to estimate the resources needed by the LHC experiments for their computing, that is, for capacity planning. Share non-confidential information about how this is done by the four collaborations.

Cost mapping

Define a methodology to translate resource needs into local costs and map a local fabric decision into an expected performance.

Manpower (to start later)

Work on understanding the cost and needs on manpower. Estimate the impact of manpower and operations costs of different choices in the computing model.

Glossary

TBA.

Performance Measurement and Analysis Tools

List of tools used by the community with information on how to select tools and how to use them.

-- AndreaSciaba - 2017-10-30

Topic revision: r19 - 2019-04-16 - AndreaSciaba

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback