WLCG monitoring consolidation Minutes Friday 16 August 2013 - Eddie Infrastructure monitoring proposal Participants Local : Pablo, Julia, Ivan, Alex, Stefan, Luca, Lionel, Dave, Eddie, Pedro, Markus, Nicolo Remote : Alberto Slide 12 Pedro: It is important to make sure that the metadata part is as small as possible. If the metadata starts to get overused it might be a problem - the smaller the better. Topology is a good example. Pablo agreed and confirmed that metadata will be as small as possible otherwise it will go to the current metrics. Slide 13 Pedro: From the current metrics, depending on the technology that you use, you might find aggregation build-in that will give you a lot of possibilities for simple math aggregation - latest version of ES has an aggregation framework. Slide 17 Pedro: Asked if the collect information layer will publish info to the transport layer in a single common format. Pablo: It might not be a single common format but the transport layer has to accept more than one format. Dave: Does the transformation to the common metric format happen before the message is published to the Transport layer or after it is consumed from the Transport layer? Pablo: It should happen before. Slide 18 Pedro Asked if it is just Nagios data. Pablo replied that it can be Nagios data or anything else that publishes data to us. Slide 21 Pedro: If you already know what to aggregate data, this makes sense. If it is more dynamic, once you start aggregating data it might be complicated if you want to do recomputation and that you figure out that the number of use cases and complexity explodes. Pablo: We are dividing the aggregation part in small simple steps. We might not have to store the intermediate steps but we prefer to do it in small and simple steps. Generic Comments Stefan: With this new architecture, the experiments are encouraged to publish external data into the framework. Will it show up in different profiles? Will they be able to go to the critical profile? Pablo: They will appear as any other metric, it’s up to the experiment to define in which profile they will go. Markus: Definitely, we should be more flexible compared to how we were before. Of course policy decisions have to be discussed before-hands. Julia: From technical point of view, this architecture is very flexible. Stefan: We have a summer student working on publishing LHCb internal data to MSG and he found out that syntax contains possible duplicated entries, might be possible to clean this. Julia: We are also looking on simplifying the job submission part through Nagios - if you publish through ActiveMQ it will not go to Nagios, it will go to SSB directly. Markus: You are now storing 300k messages, at the moment it is not a real storage problem. Pablo: The current (not enforced) policy is that RAW data will be kept for 3 months and the aggregated data for 1 year. 3 of the experiments agreed on this. ATLAS only disagreed. Pedro: If the h/w is sufficient you can keep the data for longer. Julia: We might need to keep the data for longer periods of time, for example, in job monitoring we get requests to recalculate data or produce graphs from very old time periods. Dave: Will this standard format be used for transfers and job monitoring? Pablo agreed and said that’s what we are hoping for. Markus: Is the format that Pablo suggests the same as the one that the AI monitoring is using? Pedro: Concept is the same, format is slightly different. Markus: Wouldn’t make sense to share the format if you want to exchange data at a later stage? In the long term, the ideas between AI mon and us should be the same - the format that either team uses should be also possible to be imported to SSB. Only if we have a very very good reason we should have a different format. Pedro and Pablo agreed and will discuss this offline. Markus: How many messages do you expect to consume from two years from now? Pablo: ~2 million messages per day in SSB at the moment. We could go up to ~3million without any problem. Lionel: Not worried about the messaging part as you can have more brokers or use different cluster. Pablo: We could also decouple the storage for each application. Pablo: Next meeting will be in two weeks and we will cover the Storage part in more detail. Pablo and Pedro will meet offline about things that can be done in parallel. Julia: Suggestion, maybe experiments can play with the prototype already as it is available.