WLCG Consolidation meeting 30/08/2013 minutes
_________________________________________________________________________________

Participants : Eddie Dave Julia Luca Maarten Andrea Alberto Lionel Pedro Marian Markus
Alessandra on the phone 
Minutes taken by Luca
_________________________________________________________________________________


Pablo: are the experiment representative happy with the current proposal?
Maarten: nothing wrong with it. ALICE is ok
Alessandra: Need to confirm with Simone and Alessandro
Andrea: CMS ( from Nicolo's feedback) is ok
(Stefan, through email to the mailing list): LHCb ok


Dave's presentation about Elastic Search (ES)  for dashboard use case

Slide 3 : Pablo:  ES cluster installation done through puppet was smooth and easy thanks to the templates of the AI Monitoring team

Slide 8:  Pedro: ES version 1.0, coming with the new aggregation module, is (unofficially) scheduled for the end of the year
Maarten: If not, can we implement the missing functionality our self? 
Dave: plug-in development is doable but may be complex.  Next slides present several grouping/aggregation solutions.

Slide 9: Maarten: Any of the currently implemented solutions is  foreseen to be used in 1 year from now?  
Dave: no, if we go for ES, production solutions should be based on ES 1.0 aggregation.

Slide 12: Have you graphically compared  results coming from ES to Oracle ones? 
Dave: Yes,  plot are identical, the produced JSON is the same.

Slide 14: Maarten: Advantages of Oracle cache are limited, it is not a common to have users doing the same query with the same parameters in a short time window.

Alessandra: Are we considering in 1 year to replace Oracle? 
Pablo: this is an investigation if we can move to another storage solution
Markus: anything we can do to reduce Oracle dependency is welcome.


Markus: When comparing performance, it has to be take in account tests were made on VM and being ES quite IO demanding it can benefit from real machine.
Pedro:Confirmed by experience. 
Markus: Moreover, is quite easy to get out-of warrany but still good machine to be used to prove the performance gain on real hardware.
Maarten: Increasing the  the number of vm will improve the results?
Pedro: Not clear.  No tests in that direction  on AI cluster.
Markus/Pedro/Pablo: to repeat a test with a large number of VMs, it is agreed to use the Agile cluster for a short time test.
Markus: to be seen also the impact when going on real hardware. The question not to have something as fast as Oracle but fast enough and that scales well.

Andrea: for historical view, this can be interesting also for job monitoring
Julia: Use case is too complicated, for the moment Eddie playing with Hbase

Pablo: When evaluating the solution another factor is how generic a it can be, so that multiple use cases can be shared


Pedro's presentation on common data format

Common message format for AI monitoring and WLCG monitoring
Benefit to have a common format for storing data metrics are evident, from tool sharing and even combine/correlate data.
I.e. common structure for ES data/index.
2 concrete use cases: site status and lemon data
- format definition of data structure + ES metadata
Conceptually items are the same, but names are  only 10% aligned.

There is a twiki page from AI dedicated for message format.
Site status specification can be added to the twiki page as an expansion of the core metrics.

The idea is to define a core format and doing effort to have the different metrics comply to that one. 

Julia: how this message format can be translate to a database in the current SSB perspective 
Pablo: the core format define also the format in the elastic search.

Alberto: are there any visualization tool to display this common format? 
Pedro: this is a custom json format. Kibana on top of ES can browse any format

Pedro: next step try to maintain this page, give concrete number to version.


Lionel:  why does this format include lemon but not syslog?
Pedro: syslog integration is doable but requires some effort to map right information to right keys

Lionel: for the message broker monitoring, graphite is  successfully used to do visualization/aggregation of time series of numerical data. Are we sure that a generic text based search engine (Lucene) is the best tool to do generic time series visualization?
Pedro: no
Julia: in WLCG monitoring, we have a mixture of numerical data with status, errors and other string value to be visualized. A unified solution with acceptable performance will simplify the architecture.
Pedro: AI did a Graphite evaluation with lemon data and syslog, and aggregation is wonderful. Still, it is a tool only for numerical data.
Maarten: Is Graphite going to be evaluated as a tool for AI monitoring?
Pedro: no, we look at it only for comparison. We are considering only ES and Kibana.

Andrea: For SSB perspective, does this  change have any impact on how users feed the data? 
Pablo: no, it will be transparent 

Alberto: do we want to have tests with real machine?
Pablo: It's ok to start with test using AI monitoring's  cluster
Alberto: Is the visualization of ES data via Kibana comparable on what the WLCG monitoring tools provide nowadays?
Markus: we have to face a drastic reduction in the number of people working on monitoring.  Users have to be aware changes may happens on the tools and on the interfaces they're used to.
Julia: being user interface defined by years of experience, we should focus on changing the backend infrastructure and affect as less as possible the UI.

Pedro: maintenance of existing tools have to be considered also in the perspective on updating metadata for every new instances/vo/sites etc.
Alberto/Markus: even project declared as stable requires people with knowledge on how to keep it up to date with the evolving requirements, such as new browsers versions, security issues.
Lionel: mobile devices support is also a strong requirement to be taken into account.


Next meeting: Thursday 12th September 14:00
On the agenda: discussion about probes, what experiment uses and and what they would like to change