Speaker
Flavia Donno
(CERN)
Description
User and virtual organisation support in EGEE
Providing adequate user support in a grid environment is a very challenging task
due to the distributed nature of the grid. The variety of users and the variety of
Virtual Organizations (VO) with a wide range of applications in use add further to
the challenge.
The people asking for support are of various kinds. They can be generic grid
beginners, users belonging to a given Virtual Organization and dealing with a
specific set of applications, site administrators operating grid services and local
computing infrastructures, grid monitoring operators who check the status of the
grid and need to contact the specific site to report problems; to this list can be
added network specialists and others.
Wherever a user is located and whatever the problem experienced is, a user expects
from a support infrastructure a given set of services. A non-exhaustive list is
the following:
a) a single access point for support;
b) a portal with a well structured sources of information and updated
documentation concerning the VO or the set of services involved;
c) experts knowledgeable of the particular application in use and who can even
discuss with the user to better understand what he/she is trying to achieve (hot-
line); help integrating user applications with the grid middleware;
d) correct, complete and responsive support;
e) tools to help resolve problems (search engines, monitoring applications,
resources status, etc.);
f) examples, templates, specific distributions for software of interest;
g) integrated interface with other Grid infrastructure support systems;
h) connection with the grid developers and the deployment and operation teams;
i) assistance during production use of the grid infrastructure.
With the Global Grid User Support (GGUS) infrastructure, EGEE attempts to meet all
of these expectations. The current use of the system and the user satisfaction
ratings have shown that the goal has been achieved with a certain success for the
moment.
As of today GGUS has shown to be able to process up to 200 requests per day and
provides all above listed services. In what follows we discuss the organization of
the GGUS system, how it meets the users’ needs, and the current open issues.
The model of the existing EGEE Global Grid User Support (GGUS) is as follows. The
support model in EGEE can be captioned "regional support with central
coordination". Users can submit a support request to the central GGUS service, or
to their Regional Operations' Center (ROC) or to their Virtual Organisation (VO)
helpdesks.
Within GGUS there is an internal support structure for all support requests. The
ROCs and VOs and the other project wide groups such as middleware groups (JRA),
network groups (NA), service groups (SA) and other grid infrastructures (OSG,
NorduGrid, etc.) are connected via a central integration platform provided by GGUS.
GGUS central helpdesk also acts as a portal for all users who do not know where to
send their requests. They can enter them directly into the GGUS system via a web
form or e-mail.
This central helpdesk keeps track of all service requests and assigns them to the
appropriate support groups. In this way, formal communication between all support
groups is possible. To enable this, each group has built an interface (e-mail and
web front-end, or interface between ticketing systems) between its internal support
structure and the central GGUS application.
In the central GGUS system, first line support experts from the ROCs and the
Virtual Organizations will do the initial problem analysis. Support is widely
distributed. These experts are called Ticket Processing Managers (TPM) for generic
first line support (generic TPM) and for VO specific first line support (VO TPM).
These experts can either provide the solution to the problem reported or escalate
it to more specialized support unit that provide network, middleware and grid
service support. They may also refer it to specific ROCs or VO experts.
Behind the specialized VO TPM support units, people belonging to EGEE/NA4 groups
such as the Experiment Integration Support group (EIS) help VO users with on-line
support and the integration of the VO specific applications with the grid
middleware. Such people can also recognize if a problem is application specific
and forward the problem to more VO specific support units connected to GGUS.
TPM and VO TPMs have also the duty of following tickets, making sure that users
receive an adequate answer, coordinating the effort of understanding the real
nature of the problem and involving more than one second level support unit if
needed. The following figure depicts the ticket flow.
To provide appropriate user support, the distributed structure of EGEE and the VOs
has to be taken into account. The community of supporters is therefore
distributed. Their effort is coordinated centrally by GGUS and locally by the
local ROC support infrastructures.
The ROC provides adequate support to classify the problems and to resolve them if
possible. Each ROC has named user support contacts who manage the support inside
the ROC and who coordinate with the other ROCs’ support contacts. The
classification at this level distinguishes between operational problems,
configuration problems, violations of service agreements, problems that originate
from the resource centres and problems that originate from global services or from
internal problems in the software. Problems that are positively linked to a
resource centre are then transferred to the responsibility of the ROC with which
the RC is associated.
MEETING USER NEEDS
As explained above, GGUS provides therefore a single entry point for reporting
problems and dealing with the grid. In collaboration with the EGEE EIS team, the
EGEE User Information Group, NA3, and the entire EGEE infrastructure, GGUS offers a
portal where users can find up-to-date documentation, and powerful search engines
to find answers to resolved problems and examples. Common solutions are stored in
the GGUS knowledge database and Wiki pages are compiled for frequent or
undocumented problems/features.
GGUS offers hot lines for users and supporters and a VRVS chat room to make the
entire support infrastructure available on-line to users.
Special tools and grid middleware distributions are made available by the NA4/EIS
team for GGUS users.
GGUS is interfaced with other grids’ support infrastructures such as in the case of
OSG and NorduGrid. Also, GGUS is used for daily operations to monitor the grid and
keep it healthy. Therefore, specific user problems can be directly communicated to
the Grid Operation Centers and broadcasted to the entire grid community.
GGUS is used also to follow and track down problems during stress testing
activities such as the HEP experiments production data challenges and the service
challenges.
OPEN ISSUES
Even-though GGUS has proven to provide useful services, there are still many things
that need improvement. Concerning users and VOs, in particular, we have identified
the following:
Small VOs do not have the resources to implement their part of the model
The large VOs such as the LHC experiments have people who provide support for the
applications which the VO has to run as part of its work. These people are
contacted by GGUS when tickets are assigned to the VO or then the problem needs
immediate or on-line attention. It has proven difficult for some of the small VOs
to provide such a service. In this case, GGUS still provides support for the VO,
but if the problem is application related and cannot be resolved, then it has to be
put into the state ‘unsolvable’.
Supporters have other jobs to do
In EGEE, almost everyone providing support does so as part of their job. It is not
usually a major part of their job. Some times it is difficult to ensure
responsiveness. There is a small team which maintains and develops the GGUS system.
Supporters are concentrated in a few locations
The resources of the grid are widely distributed over 180 locations, and there are
people in all of these locations looking after the basic operation of the
computers. However this is not the case for higher level support such as support
for a VO application. This tends to exist in only a small number of locations,
with a small number of supporters.
Scalability is constrained by the availability of supporters
The number of people who can provide support for basic operations is large, but the
number of people who can provide support for higher level services is small. As
the VOs become larger this will become a constraint to growth unless more
supporters are found.
Limited experience in handling a large number of tickets
As part of the development of the GGUS system, it has been exercised by generating
tickets. As the system is built from industry standard software parts using Remedy
and Oracle, it has been found to be reliable. We believe however that if large
numbers of tickets are submitted that it will show the limitations in the system.
Limited engagement of existing VOs in the implementation of GGUS
There is an organisation within EGEE called Executive Support Committee (ESC). The
ESC has representatives from all of the ROCs of EGEE. This organisation meets once
per month by telephone to discuss the operations and development of the support
system and to decide on actions and priorities for the work. The present VOs have
found it difficult to provide people for involvement with this work.
CONCLUSION
The GGUS system is now ready for duty. During 2006, it is expected that there will
be a large number of tickets passing through the system as the LHC VOs move from
preparing for service to being in production. It is also expected that the number
of Virtual Organisations will grow as the work of EGEE-II proceeds. There will
also be an increase in the number of support units involved with GGUS, and an
increase in the number of ROCs and RCs.
Acronyms
EGEE Enabling Grids for E-sciencE
EIS Experiment Integration Support
GGUS Global Grid User Support
HEP High Energy Physics
JRA Joint Research Activity of EGEE
LHC Large Hadron Collider
NA Network Activity
OSG Open Science Grid
RC Resource Centre
ROC Regional Operations' Centre
SA Service Activity
TPM Ticket Process Management
VO Virtual Organisation
VRVS Virtual Rooms Videoconferencing System
Wiki Web technology for collaborative working