Notes from the Monitoring Consolidation meeting of Oct 31, 2014

Local: Andrea, Eddie, Ivan, Julia, Luca, Maarten, Maite (last to arrive), Marian, Nicolo, Pablo, Stefan

Vidyo: Alberto, David, Salvatore

Apologies: Lionel


Minutes taken by Ivan and Maarten

JIRA actions

- The new brokers in puppet are ready, and the old ones are being drained

- work on REBUS reports has started

- SUM to SAM-3 migration: the old prod hosts will still be available in Nov,
  while the preprod hosts will be stopped already

- the old queues should also work with the new system, known issues have been fixed

- the URLs are different: experiments need to adjust their clients

- Please, be aware of the algorithm. ATLAS_CRITICAL and CMS_CRITICAL for instance include 'all SRM' (instead of SRM), which meand that sites with multiple SEs need to have all of them up.

- Nagios site plugin: necessary changes have been communicated to Pepe,
  to be implemented in Nov

Nagios configuration (Marian)

) Evolution of the experiment probe submission framework (SAM/Nagios)
2) Motivation:
        the aim of the experiment probe-submission network
3) SAM/Nagios:
        Components of SAM/NAGIOS, how it's done. The functionality is based on Nagios for the most part.
4) Technology Evolution:
        Outside work improving and expanding Nagios. Popularity of Nagios also lead to a very stable core.
5) Requirements (short-list):
                Reduce unsupported components and dependancies
                Decouple config, plugins, submission.
                        Pablo: Out of 14 components, how many do you think will remain long-term?
                        Marian: Three!
                Maintanence, upgrades (catch up).
                        Some of this is more urgent, will get to priorities later.
        Plugins: new ones, fixes, replacements
        New use cases
        Scalability and availability.
                Reaching limits of single Nagios
                        Atlas was simplified, so now CMS is the "leader" with 3 FQANs
                Single point of failure
                        Need at least hot standby

6) Probe submission network
7) Simplified view:
        Configuration and metric store/processing/visualization and authorization
        ATP for topology
        POEM serves metrics
                test remote API
                on top of nagios plugin library (gridmon)
        Complex bridges between Nagios and publishing, that is no longer needed
8) Proposed changes:
        Evolution, not revolution
        Replace nagios configuration component (NCG) because it was developed and is supported for EGI.
Stefan: Will you have to rewrite this?
Marian: Yes, there is no other possibility. Current NCG dates 5-6 years ago and was developed to configure site Nagios, it's a site-based component that needs sites to work, but Nagios itself is service-based and doesn't know about sites. This needs to be done.
Julia: Much more simple.
Marian: Of course. When we drop the site level feature, it's going to simplify it.

       Also separate input sources and Nagios configuration concepts.
        Refactor Nagios messaging to catch up to new libraries. High priority, old ActiveMQ had to be backported.
                Pablo: do you plan the same queue structure?
                Marian: up for discussion, the plan is to have a configurable Nagios to message bridge. Configurable for different serialization, brokers, etc. Then we can decide if we keep the queue structure.
                Julia: as far as I understand the authorization and authentication are imporant features that are missing?
                Marian: yes, the new authorization service voms-to-httpassword that fetches DNs from voms and puts it into http so sysadmins can go to the nagios box. If we want people to rescedule tests.
                Stefan: how often do we expect this to happen?
                Maarten: is't this component stable and working.
                Marian: this component is not owned by us and it's not really working for OSG for example, because not sysadmins for OSG are members of CMS or ATLAS so VOMS is not enough.
                Maarten: Will we have to continue supporting this feature? It has caused problems.
                Marian: it's useful to have remote access to Nagios and it should be authorized. Allowing them to reschedule is another topic, but sites see it as useful.
                Julia: Is this for the central Nagios box? Should remote people have access to central CMS or ATLAS Nagios?
                Marian: Yes, they already have.
                Pablo: how often do they rescedule?
                Marian: We only know when they complain, so no statistics available.
                Maarten: The machinery of nagios is there for a reason, should admins even be allowed to interfere with that. Maybe they should just wait to see results?
                Andrea: I think it is OK to let sites reschedule. The problem is, 1 month ago a site broke the configuration by turning a passive test into an active test.
                Julia: Should it be done through some API and not by accessing configuration?
                Marian: It only happened once and we give zero instruction on how to reschedule.
                Maarten: If we document it, will it become more popular and lead to more incidents?
                Maite: Every site is already using this functionality because it's analogous to site-level nagios.
                From Vidyo: It is very useful to reschedule tests, but having an API that _only_ allows to reschedule tests is enogh.

        Moving packages to EPEL,
        Maybe migrate to open monitoring distibution
                Maarten: In what ways is this more open then Nagios?
                Marian: It gives several monitoring tools known to work together. Just investigating how they work with what we have will be interesting.
                Maarten: This sounds like a huge change.
                Marian: But now I am doing it so compatibility should not be a problem. Again, it's not a requirement, just something we consider.

9) New picture:
        Much fewer components     Voms http authentication
        New configuration component (NCGX) with plugins         Generate intermediate config files: service configuration and metric configuration
        Migrate messaging, preferably have the same message client on check submission and on the workernode.                       Simplify nagios bridge message, just one directory queue.
10) Service configuration:                                                                                                          Intermediate serialization
        Generated from plugins or writen by and  hosts have tags
        Metrics assigned to hosts and tags   Eventually work with POEM
11) Metric configuration:                                                                                                           Metric ID will include FQAN
        Join command arguments                                                                                                      Include worker node configuration
       Introduce metric dependencies

Pablo: Do you plan to keep the naming convention for metrics?
Marian: Now the metric convention is namespace, submission technology, and then what the metric does. It is important to decouple the metric name from the probe, which is not the case right now. Decoupling metric name from plugin is important, but it's more on the plugin side, not on the metric configuration. After that we are free to introduce any metric names. I see the current convention as reasonable
12) Probes/plugins                                         

13) Current situation.
        The status report of 1 year ago is in the back-up slides.  A lot was consolidated and simplified.
        Alice is testing ARC but it's not useable for LHCb because the payload is not there

Stefan: do they say if they want to change that?
Marian: it's unlikely they'll support you complex use-case. EGI is planning to dump worker-node probes, also dump WMS, cream-ce plugins, and move to plugins developed product-teams and maintained. So cream-ce probe will have no payload too. So if we have to support 3 back-ends, it's not difficult to support ARC too.  

Stefan: ARC will just replace WMS.
Marian: It's probably easier to write ARC backend on the same shared framework of WMS, Cream-ce and condor.                

Storage, LFC plug-ins. Is there a plan to move away from that?

Stefan: yes, LHCb plans to move away from LFC
Some experiment-specific additions, but common framework.                                                    worker-node plugins. Once we move to auto-generated config should not be a problem.


14) Proposed changes.
        Refactor Nagios plugin library (python-gridmon)
                Basis for both storage and job-submission plugins
                Need to be simplified
Andrea: what are the advantages of gridmon?
Marian: It allows to publish passive and active results, compile detailed and summary output, parses command-line arguments. Basically implements the nagios plugin standard. Redirects stderr to stdout, things like that. It's very useful.
Julia: One of the first things we looked at in the monitoring consolidation project was gridmon and it was found to be very useful.
Nicolo: Metric chain is implemented in gridmon.
Marian: We don't need everything gridmon provides, some things can be cut off, but many basic things and backward compatibility are necessary.
       Job submission plugins refactoring.
        Upgrade to SLC6/7 and UMD3
Maarten: I hope not CC7?

Marian: If we do SLC6, we can start testing 7. Eventually it will come.
Maarten: It's too new, don't do it now.
Maite: Maybe if the move is in 3 month it's good to plan for 7. We've been on 5 for ages.
Maarten: But the middleware is not supported on EL7. We should wait until the middleware is available.
Marian: Yes, we should wait for the middleware and the plugins to be available on 7.
        For UMD3 it's also important to move a different configuration model.
        Should condsider the possibilities of giant grid environment or per-plugin environment.
Maarten: I would argue UI needs little configuration. You need to define BDII and the proxy server, is that all?
Marian: I'll need to look into it.

15) Worker node test framework
16) worker node test framework
        Maarten: Compliant, not complaint.
        Marian: right.
        Bootstraping script (nagrun)
                some tricky logic, had to be changed with move to Condor, sometimes hard to find transfer files, etc.
17) Proposed changes:
        Maarten: Does Nagios need to be on the worker node. Naively the worker node only needs to run a couple of tests, send the results and exit.
        Marian: Yes, but in practice it's more complicated. If something goes wrong it's nice to have Nagios so you know tests are independent, clean-up is done properly, and set-up and scheduling are stable. There are other possibilities, but we need the same Nagios plugins on WNs and the best tool to run Nagios plugins is Nagios.
18) Summary
19) Areas of work
        Messaging, environment, upgrade OS, UMD3 => NCGX => Plugins
                Andrea: Did you mention compatibility with CondorC, it's urgent for US CMS. What's the timescale on that?
                Marian: Timescale was this week, but will be done next week and be in preprod.
                Pablo: And the timescale on the whole thing?
                Marian: No timescale, but I estimate 4-6 month depending on effort. Also, we have one student for help for 6 month.


- many sites use Nagios and they can import test results from experiments;
  a variety of other tools are popular as well (see presentation)

- Ganglia is for monitoring system usage (CPU/memory/disk/...), not services


Next meeting in 2 weeks.
The next meeting will be on SAM3 experience and recomputation.