ELFms/LEMON meeting,05/02/2004

 

Present: Miroslaw, Karim, German, Sylvain, Dennis, Hugo, Fabio, Helge

Action Items:

      Closed actions:

·        [A1] 22/1/04 All: Provide list of 'TO DO' items to German for 2004 planning
input received from everybody.

·        [A8] Jan: send out a list of legacy (RH6) metrics to be stopped on February 1st.
done.
 Action turned to removing these metrics

       Ongoing actions:

·         [A2] 22/1/04 German: Check with ADC support level for PSF framework
no progress, but interest somewhat reduced.

·        [A5] 22/1/04 German: Propose a new CVS layout (check with Bill Tomlin)
stalled

·        [A6] Fabio: evaluate AlarmGUI (set up with Karim and test it)
Fabio tried several times, but could not establish a connection to the server. Karim: perhaps related with OraMonServer crashes (and subsequent AlarmBroker crashes). Karim to give instructions to the list how to restart AlarmBroker

·        [A7] Dennis: Test using ForSure.pl correlations once CMsensor functionality is stable (Dennis)
stalled

·         [A9] German: schedule an ELFms meeting where Jan will present the WP4 proposal.
postponed

·         [A12] German: Provide arequirements document enumerating FIO and PS requirements for the actuator framework
stalled

New actions:

·        [A13] Dennis: Check with metrics can be obtained via DMIdecode out of the existing hardware ones; check for a DMIdecode RPM; send around DMI specification links to service managers

 

Low-level (BIOS) monitoring

 

·        Dennis: Currently metric 6310 deployed reflecting a BIOS checksum. There are however no reference values yet; one would need to take out machines from production in order to understand all the differences. Another issue:  dmidecode; script exists to parse the information, but nothing is currently being stored.

 

·        General feeling that establishing reference values, and comparing the checksum against it, would be useful, but would require a very significant effort to collect it. Different opinions on whether or not this wouldbe useful. Message to be given to FS and DS sections that deployment could be useful, but the investment cost is high.

 

·        Dmidecode considered more useful, delivering more precise information about hardware than the existing hardware sensors. Deployment would be more straightforward than for the BIOS checksum.

 

·        Jan to send a list of metrics collected by existing hardware sensors to Dennis, Dennis to cross-check which metrics could be obtained via dmidecode.

 

·        Dmidecode (as well as SMART monitoring) contained in kernel-utils; however, SMART monitoring found too old, hence kernel-utils cannot be installed. Dennis to check for a dmidecode RPM, otherwise will package it ourselves.

 

·        Dennis to send around to service managers links to the DMI specifications, and to typical dmidecode output.

 

Status reports:

CCS prototype 2:

·        Fabio: PVSS stopped, all scripts etc. uploaded into it-ccs-sw CVS repository. Client machines not sending data to ccs002d any more.

MSA and sensors:

·        Jan: Metrics removed as agreed; Marek's metrics added to configuration, should now be in Oracle. Another metric added on Vlado's request for load balancing monitoring (just another daemon to be checked).

·        Hugo: Changes to network monitoring sensor applied, being tested. Hugo to give all deployment information to the mailing list.

Lemon on Solaris:

·        Manuel: Work ongoing. Artur: no news

Lemon on Windows:

·        German will meet Alberto in order to explain our strategy.

MR API:

·        Dennis: Has found that subscriptions cause segfaults on standard RH 7.3 clients, but not on other client machines using RHEL 3/ia64 or other distributions. Work ongoing in order to factorise the problem.

Oracle monitoring servers:

·        David: [see his e-mail]

·        Jan: RPM not as important as other work requiring Sylvain's advice (metadata for example). Server should be written such that unknown metrics do not crash the server.

·        Fabio: Operator procedures for OraMonServer restart have been  updated.

New Monitoring Server:

·        Sylvain: progressing. Subscription thread added, efficiency tested. Server can fully process more than 10'000 samples per second. API established (subset of MR API) that needs to be implemented for plugging a database backend (e.g. Oracle). This means that David will have to implement it with Oracle.

Alarm display:

·        Karim: problem of configuration file still around. Using Insure++ had unpleasant side effects.

·        German: Would like to see priority given to Webstart investigations as well as making the display available to Fabio.

CMsensor:

·        Dennis: Development ongoing, including making the framework independent of the MSA

Actuator Framework:

·        Nothing to report

 

AOB:

·        Restructuring starting on implementing new sensors and metrics, including their configuration, in a standard way.