WLCG monitoring consolidation Minutes Friday 16 August 2013 - Eddie

Infrastructure monitoring proposal


Participants

Local : Pablo, Julia, Ivan, Alex, Stefan, Luca, Lionel, Dave, Eddie, Pedro, Markus, Nicolo

Remote : Alberto

Slide 12

Pedro: It is important to make sure that the metadata part is as small as possible. If the metadata
starts to get overused it might be a problem - the smaller the better. Topology is a good example. 
Pablo agreed and confirmed that metadata will be as small as possible otherwise it will go to the 
current metrics.

Slide 13

Pedro: From the current metrics, depending on the technology that you use, you might find aggregation 
build-in that will give you a lot of possibilities for simple math aggregation - latest version of ES
 has an aggregation framework.

Slide 17

Pedro: Asked if the collect information layer will publish info to the transport layer in a 
single common format.

Pablo: It might not be a single common format but the transport layer has to accept more than one format.

Dave: Does the transformation to the common metric format happen before the message is published to the 
Transport layer or after it is consumed from the Transport layer?

Pablo: It should happen before.

Slide 18

Pedro Asked if it is just Nagios data. Pablo replied that it can be Nagios data or anything else 
that publishes data to us.

Slide 21

Pedro: If you already know what to aggregate data, this makes sense. If it is more dynamic, once you 
start aggregating data it might be complicated if you want to do recomputation and that you figure 
out that the number of use cases and complexity explodes.

Pablo: We are dividing the aggregation part in small simple steps. We might not have to store the 
intermediate steps but we prefer to do it in small and simple steps.

Generic Comments

Stefan: With this new architecture, the experiments are encouraged to publish external data into 
the framework. Will it show up in different profiles? Will they be able to go to the critical profile?

Pablo: They will appear as any other metric, it’s up to the experiment to define in which profile they will go.

Markus: Definitely, we should be more flexible compared to how we were before. Of course policy 
decisions have to be discussed before-hands.

Julia: From technical point of view, this architecture is very flexible.

Stefan: We have a summer student working on publishing LHCb internal data to MSG and he found out 
that syntax contains possible duplicated entries, might be possible to clean this.

Julia: We are also looking on simplifying the job submission part through Nagios - if you publish 
through ActiveMQ it will not go to Nagios, it will go to SSB directly.

Markus: You are now storing 300k messages, at the moment it is not a real storage problem.

Pablo: The current (not enforced) policy is that RAW data will be kept for 3 months and the 
aggregated data for 1 year. 3 of the experiments agreed on this. ATLAS only disagreed.

Pedro: If the h/w is sufficient you can keep the data for longer.

Julia: We might need to keep the data for longer periods of time, for example, in job monitoring we 
get requests to recalculate data or produce graphs from very old time periods.

Dave: Will this standard format be used for transfers and job monitoring? Pablo agreed and said 
that’s what we are hoping for.

Markus: Is the format that Pablo suggests the same as the one that the AI monitoring is using?

Pedro: Concept is the same, format is slightly different.

Markus: Wouldn’t make sense to share the format if you want to exchange data at a later stage? In the
long term, the ideas between AI mon and us should be the same - the format that either team uses 
should be also possible to be imported to SSB. Only if we have a very very good reason we should 
have a different format.

Pedro and Pablo agreed and will discuss this offline.

Markus: How many messages do you expect to consume from two years from now?

Pablo: ~2 million messages per day in SSB at the moment. We could go up to ~3million without any problem.

Lionel: Not worried about the messaging part as you can have more brokers or use different cluster.

Pablo: We could also decouple the storage for each application.

Pablo: Next meeting will be in two weeks and we will cover the Storage part in more detail. Pablo 
and Pedro will meet offline about things that can be done in parallel.
Julia: Suggestion, maybe experiments can play with the prototype already as it is available.