WLCG monitoring consolidation Minutes Friday 29 August 2014
Local: Luca, Eddie (minutes), Pablo, Lionel, Marian, Andrea, Maarten, Julia (last to arrive)
Remote: David, Pepe, Salvatore (disconnected early from the meeting)
Apologies for absence from Stefan and Costin
Pablo: Some tickets are taking longer than expected and some others we will close when we put the new system in production
For hammerlcoud investigation of DB - Julia: we should finish the report, we were waiting for people to return from holidays.
For documentation - Marian: SAM documentation was exported from Confluence to a static HTML page
Pablo: We need to focus on what we put now for SAM3, will be of a production quality
SAM3 data validation:
Twiki page for difference between SAM2 and SAM3. Notice to experiment representatives: Please check them and tell us if you agree.
Pablo: Will add a new column on the table assigning the cases into categories. ‘Agreed by OPS’ means agreed by the sites representative. If you find any other discrepancy between the two systems please add it on the twiki.
Slide 3, Pablo: When a site doesn’t have a service, SAM2 does not take it into account in the availability but SAM3 does. This was affecting the sites with no SRM endpoint. Created a new way of combining metrics with ‘AND IF DATA’ which is an extra algorithm that mimics the behaviour of SAM2.
Slide 6, Lionel: Symbols are not user-friendly, maybe you could explain it in plain english keywords instead of symbols, ie. ‘&SRM’ should be ‘ALL SRM’.
Slide 9, Andrea: What is the difference between ‘AND IF’ and the new metric with the ‘@’ sign? For example in your slide @SiteWithoutStorage?
Pablo: @ represents that a metric is defined at the site level. It does not need any aggregation (for example, combining several instances of CE into the status of CE).
Slide 11, Pepe: Could we enforce that the experiments use the same profile naming conventions?
Maarten: This is already the case.
Marian: Many profiles were created to do the nagios configuration. They might not be needed anymore.
Pablo: If they are not needed, we can delete them.
Slide 13, Maarten: We finally discovered these differences between SAM2 and SAM3 and agreed that SAM3 behaves better. We could go to the Management Board and present the understood differences.
Andrea: Do you still have the ‘MISSING’ status in SAM3?
Pablo: Yes, the equivalent of ‘MISSING’ is ‘no data’. We also have ‘UNKNOWN’, which is slightly different and should be escalated.
Julia: if the status of the site is ‘UNKNOWN’ what happens to the availability?
Pablo: The availability if calculated over the known period of service, so it’s ignored but at least we can see that now.
Pablo: SAM2 does ANY SRM, do we want to have the hard AND SRM? The availability of some sites will get reduced. This is going to be another difference between SAM2 and SAM3.
Marteen and Andrea agreed.
Maarten: Can it happen that in SAM3 a service might not be tested like in SAM2?
Pablo: Let’s go back to slide 7, imagine that we have another metric that will show if the service appears in VO Feed and this will get propagated all the way until the end. Andrea and Maarten agreed on this.
Slide 14, Pablo: for Preproduction should we copy the data?
Answer from Maarten and Andrea is ‘no’.
Andrea: Please add back the preprod links to the dashboard homepage.
Maarten: Will the weekly ops reports be replaced with the reports generated from SAM3?
Pablo: Yes and this will be proposed and presented at the GDB.
Andrea: When do you think you can put SAM3 in production?
Pablo: So far the system was tested internally by us. Now it’s the time that people start using SAM3 as the entry point. For the report of August we will compare the reports of both SAM2 and SAM3 and I hope that for the report of September, we will use the one generated from SAM3.
Andrea: Please make sure that the API of SAM3 is backwards compatible with the API of SUM (SAM2) as we have people in CMS that use this API in scripts to query data for example for site readiness.
Julia: What was the end of discussion for OPS T0?
Pablo: What will happen is that they will puppetise their box. Marian: They have our puppet manifests.
Pablo: For visualisation they use SLS.
Pablo: We meet in three weeks, on 19th and we would like to have feedback from VO representatives for the new system.