WLCG Consolidation meeting 30/08/2013 minutes _________________________________________________________________________________ Participants : Eddie Dave Julia Luca Maarten Andrea Alberto Lionel Pedro Marian Markus Alessandra on the phone Minutes taken by Luca _________________________________________________________________________________ Pablo: are the experiment representative happy with the current proposal? Maarten: nothing wrong with it. ALICE is ok Alessandra: Need to confirm with Simone and Alessandro Andrea: CMS ( from Nicolo's feedback) is ok (Stefan, through email to the mailing list): LHCb ok Dave's presentation about Elastic Search (ES) for dashboard use case Slide 3 : Pablo: ES cluster installation done through puppet was smooth and easy thanks to the templates of the AI Monitoring team Slide 8: Pedro: ES version 1.0, coming with the new aggregation module, is (unofficially) scheduled for the end of the year Maarten: If not, can we implement the missing functionality our self? Dave: plug-in development is doable but may be complex. Next slides present several grouping/aggregation solutions. Slide 9: Maarten: Any of the currently implemented solutions is foreseen to be used in 1 year from now? Dave: no, if we go for ES, production solutions should be based on ES 1.0 aggregation. Slide 12: Have you graphically compared results coming from ES to Oracle ones? Dave: Yes, plot are identical, the produced JSON is the same. Slide 14: Maarten: Advantages of Oracle cache are limited, it is not a common to have users doing the same query with the same parameters in a short time window. Alessandra: Are we considering in 1 year to replace Oracle? Pablo: this is an investigation if we can move to another storage solution Markus: anything we can do to reduce Oracle dependency is welcome. Markus: When comparing performance, it has to be take in account tests were made on VM and being ES quite IO demanding it can benefit from real machine. Pedro:Confirmed by experience. Markus: Moreover, is quite easy to get out-of warrany but still good machine to be used to prove the performance gain on real hardware. Maarten: Increasing the the number of vm will improve the results? Pedro: Not clear. No tests in that direction on AI cluster. Markus/Pedro/Pablo: to repeat a test with a large number of VMs, it is agreed to use the Agile cluster for a short time test. Markus: to be seen also the impact when going on real hardware. The question not to have something as fast as Oracle but fast enough and that scales well. Andrea: for historical view, this can be interesting also for job monitoring Julia: Use case is too complicated, for the moment Eddie playing with Hbase Pablo: When evaluating the solution another factor is how generic a it can be, so that multiple use cases can be shared Pedro's presentation on common data format Common message format for AI monitoring and WLCG monitoring Benefit to have a common format for storing data metrics are evident, from tool sharing and even combine/correlate data. I.e. common structure for ES data/index. 2 concrete use cases: site status and lemon data - format definition of data structure + ES metadata Conceptually items are the same, but names are only 10% aligned. There is a twiki page from AI dedicated for message format. Site status specification can be added to the twiki page as an expansion of the core metrics. The idea is to define a core format and doing effort to have the different metrics comply to that one. Julia: how this message format can be translate to a database in the current SSB perspective Pablo: the core format define also the format in the elastic search. Alberto: are there any visualization tool to display this common format? Pedro: this is a custom json format. Kibana on top of ES can browse any format Pedro: next step try to maintain this page, give concrete number to version. Lionel: why does this format include lemon but not syslog? Pedro: syslog integration is doable but requires some effort to map right information to right keys Lionel: for the message broker monitoring, graphite is successfully used to do visualization/aggregation of time series of numerical data. Are we sure that a generic text based search engine (Lucene) is the best tool to do generic time series visualization? Pedro: no Julia: in WLCG monitoring, we have a mixture of numerical data with status, errors and other string value to be visualized. A unified solution with acceptable performance will simplify the architecture. Pedro: AI did a Graphite evaluation with lemon data and syslog, and aggregation is wonderful. Still, it is a tool only for numerical data. Maarten: Is Graphite going to be evaluated as a tool for AI monitoring? Pedro: no, we look at it only for comparison. We are considering only ES and Kibana. Andrea: For SSB perspective, does this change have any impact on how users feed the data? Pablo: no, it will be transparent Alberto: do we want to have tests with real machine? Pablo: It's ok to start with test using AI monitoring's cluster Alberto: Is the visualization of ES data via Kibana comparable on what the WLCG monitoring tools provide nowadays? Markus: we have to face a drastic reduction in the number of people working on monitoring. Users have to be aware changes may happens on the tools and on the interfaces they're used to. Julia: being user interface defined by years of experience, we should focus on changing the backend infrastructure and affect as less as possible the UI. Pedro: maintenance of existing tools have to be considered also in the perspective on updating metadata for every new instances/vo/sites etc. Alberto/Markus: even project declared as stable requires people with knowledge on how to keep it up to date with the evolving requirements, such as new browsers versions, security issues. Lionel: mobile devices support is also a strong requirement to be taken into account. Next meeting: Thursday 12th September 14:00 On the agenda: discussion about probes, what experiment uses and and what they would like to change