9–11 May 2007
Manchester, United Kingdom
Europe/Zurich timezone

Grid reliability

9 May 2007, 17:30
2h 30m
Manchester, United Kingdom

Manchester, United Kingdom

Board: P-003
poster Poster session Poster and Demo Session

Speaker

Pablo Saiz (CERN)

Report on the experience (or the proposed activity). It would be very important to mention key services which are essential for the success of your activity on the EGEE infrastructure.

All this system is based on studying the different actions that
users do. Therefore,
the first and most important dependency is on monitoring systems.
The way we do it is
to interface it with the DASHBOARD, which will hide the
differences between the
heterogeneous sources of data (like RGMA, ICXML or MonALISA).

Another service very important for the effectiveness of the Grid
reliability is the
submission and tracking of tickets, GGUS. This has already been
tested with a manual
procedure. Since the result was very encouraging, we are working
on ways of
automatizing this interaction.

Describe the scientific/technical community and the scientific/technical activity using (planning to use) the EGEE infrastructure. A high-level description is needed (neither a detailed specialist report nor a list of references).

We are offering a system to track the efficiency of different
components of the
GRID. We can study the performance of both the WMS and the data
transfers

At the moment, we have set different parts of the system for
ALICE, ATLAS, CMS and
LHCb. None of the components that we have developed are VO
specific, therefore it
would be very easy to deploy them for any other VO.

With a forward look to future evolution, discuss the issues you have encountered (or that you expect) in using the EGEE infrastructure. Wherever possible, point out the experience limitations (both in terms of existing services or missing functionality)

Themain problem that we have found so far is the lacking of
communication between the
new gLite RB and RGMA. Jobs that went through these resource
brokers do not publish
their status, thus making our taks imposible.

Another possible problem that we might encounter is the
confidentiality of the data.
To solve this, we are anonymising the jobs and transfers, since
we are only
interested in the different status that the job or transfer goes
through.

Describe the added value of the Grid for the scientific/technical activity you (plan to) do on the Grid. This should include the scale of the activity and of the potential user community and the relevance for other scientific or business applications

Our main goal is basically to improve the reliability of the
GRID. The main idea is
to discover as soon as possible the different problems that have
happened, and inform
the responsible. Since we study the jobs and transfers issued by
real users, we see
the same problems that users see. As a matter of fact, we see
even more problems than
the end user does, since we are also interested in following up
the errors that GRID
components can overcome by themselves (like for instance, in case
of a job failure,
resubmitting the job to a different site).

This kind of information is very useful to site and VO
administrators. They can find
out the efficiency of their sites, and, in case of failures, the
problems that they
have to solve.

The reports that we provide are also interesting for the COD,
since the errors might
not be VO specific

Authors

Benjamin Gaidioz (CERN) Julia Andreeva (CERN) Pablo Saiz (CERN) Ricardo Rocha (CERN)

Presentation materials

There are no materials yet.