Speaker
Report on the experience (or the proposed activity). It would be very important to mention key services which are essential for the success of your activity on the EGEE infrastructure.
L&B records are oriented towards operational semantics: each service logs its
own vision of the job information and status. The result is a very large amount
of highly redundant data, with in many cases no a-priori syntax or semantics
(blobs in the long_fields table). We have first developed a software suite that
segments the data in order to discover the basic attributes, and cautiously filter
out redundant information. The software also allows to convert the categorical
data into a boolean description, convenient for many off-the-shelf mining and
learning software. The next step was analysis. Elementary methods
(independently scoring attributes) provided little information. The ROGER
algorithm [2] developed in our lab provides a good predictor, which can be
interpreted through sensitivity analysis. On-going work deals with intelligent
clustering in order to reduce dimensionality (frequent itemsets) and learning
non-linear models, which can detect compound failure conditions.
With a forward look to future evolution, discuss the issues you have encountered (or that you expect) in using the EGEE infrastructure. Wherever possible, point out the experience limitations (both in terms of existing services or missing functionality)
This paper is an attempt to explore the performance and limits of this purely
passive approach of detection and diagnosis: a minimally invasive failure
analyser would be based solely on analysis of the production (L&B) records of
the production jobs. A more intrusive approach would use SAM (or equivalent
software) and active learning methods to (approximately) design an optimal
probe set and infer the system state [3]. Support for gathering and interpreting
data in this area would be required.
Describe the scientific/technical community and the scientific/technical activity using (planning to use) the EGEE infrastructure. A high-level description is needed (neither a detailed specialist report nor a list of references).
Autonomic Computing (AC) is defined as “computing systems that manage
themselves in accordance with high-level objectives from humans” [1]. AC is
now a well-established scientific domain, and a priority for industry. Automated
detection, diagnosis, and ultimately management, of software/hardware
problems define autonomic dependability. The paper reports on applying state
of the art autonomic dependability methods to the Logging and Bookeeping
data, with promising results on detection.
Describe the added value of the Grid for the scientific/technical activity you (plan to) do on the Grid. This should include the scale of the activity and of the potential user community and the relevance for other scientific or business applications
The production status and integration level reached by the EGEE middleware
and monitoring provide immense datasets. These are challenging targets for
the Machine Learning (ML) community, whose techniques are at the base of AC.
The fundamental motivation for this interest is the complexity of the
hardware/software components, and the intricacy of their interactions, which
defeat attempts to build models only from a-priori knowledge. Furthermore,
EGEE is not a steady-state system, not only because it is yet ramping-up, but
more profoundly because of the externally-driven collective behaviour of its
users.
EGEE monitoring data exemplify to the extreme two classical issues in ML: 1)
curse of dimensionality (state space exponential in the number of variables);
and 2) data sparsity, most of the state-action space being actually unexplored.
EGEE data offer an extra complexity, not addressed in this paper: integration of
heterogeneous sources of information (ontology building).