Speaker
Describe the added value of the Grid for the scientific/technical activity you (plan to) do on the Grid. This should include the scale of the activity and of the potential user community and the relevance for other scientific or business applications
The access to the Grid happens through LHCb's own distributed
production and analysis
system, DIRAC (Distributed Infrastructure with Remote Agent
Control). Dirac implements
the “pull” job scheduling paradigm, where all the jobs are stored
in a central task
queues and then pulled via generic grid jobs called Pilot Agents.
The whole LHCb
community (about 600 people) is divided in sets of physicists,
developers, production
and software managers that have different needs about their jobs
on the Grid. While a
Monte Carlo simulation jobs need several days of intensive CPU
time, the analysis
jobs just need to start immediately. The current state of
affairs, where all the
users access the Grid through a single entry point, does not
prevent to have
sub-communities running most of the jobs and then monopolizing
the use of Grid
resources. The way to avoid this is to implement a system that
ensures job priority and
fair share of the resources among all the community users.
Describe the scientific/technical community and the scientific/technical activity using (planning to use) the EGEE infrastructure. A high-level description is needed (neither a detailed specialist report nor a list of references).
LHCb is one of the four high-energy experiments running in the
near future at the
Large Hadron Collider (LHC) at CERN. LHCb will try to answer some
fundamental
questions about the asymmetry between matter and anti-matter. The
experiment is
expected to produce about 2PB of data per year. Those will be
distributed to several
laboratories all over Europe and then analyzed by the Physics
community. To achieve
this target LHCb fully uses the Grid to reprocess, replicate and
analyze data.
Report on the experience (or the proposed activity). It would be very important to mention key services which are essential for the success of your activity on the EGEE infrastructure.
There are two possible approaches to encompass it: a site-wise
approach where the VO
just takes care of filling up its queues and leaves the
site-specific software to
redistribute the jobs accordingly to early negotiations; a
VO-wise approach, best
tailored to the LHCb computing model, where the site just
allocates the quota of
resources competing to the VO and the VO decides how to share it
across its users
sub-communities.
A rough priority algorithm based on the VO-wise approach has
already been
implemented. The introduction of a "Priority'' flag in the
specification of the job
and some changes in the resource-job matching mechanism already
proved to guarantee the
right precedence to short analysis jobs or to Reconstruction jobs
with respect of
cumbersome Monte Carlo jobs. Our Priority algorithm must be
considered as a work-in
-progress development. Accounting information based on both the
user, job length and
community past CPU consumption will also be considered.
With a forward look to future evolution, discuss the issues you have encountered (or that you expect) in using the EGEE infrastructure. Wherever possible, point out the experience limitations (both in terms of existing services or missing functionality)
The job priority mechanism needs to be extensively tested. An
aging system will also be
introduced to avoid that some jobs to stay too long in the
central queues before
being picked-up at the first suitable resource available. The
mechanism relies on the
assumption that DIRAC is the only access to the Grid but does not
prevent
users to bypass it and access the Grid somehow else. A tool to
enforce VO policy at site
level is then highly desired.