TCG Meeting 16 Nov 2005
-----------------------

Claudio Grandi (JRA1)
Andrei Tsaregorodstev (LHCB)
Laura Perini (ATLAS)
Cal Loomis (NA4)
Johan Montagnat (BIOMED)
Latchezar Betev (ALICE)
John White (JRA1-security)
Stefano Belforte (CMS)
Markus Schulz (Deputy Chair/SA3)
David Smith (secretary)

Ian Bird (SA1) cannot attend - exceptionally Markus will stand in for SA1.

Today Markus will Chair the meeting, as Erwin is away.

Chair/SA3/SA1: OK, following the agenda that Erwin sent:

Does anyone have comments on the minutes of the previous meeting? [No]

Erwin attached a summary of discussion of convergence between gLite and
LCG trees. The plan is to stay separated for gLite 1.5, which will be the
last pure gLite that will be released in the EGEE-1 project. An LCG-2.7
release will be made before Christmas which will mostly contain things
that are now in production for the service challenges.

The process to define the contents of the merged release should start now.
First step is to find the gap. There was a suggestion by
Frederico to start with the requirements from the experiments.

ALICE: Alice have produced a document (called 'baseline services') that
documents what we require to run at the moment. What Frederico suggested
was to send the link to the document that has been discussed in the
alice-lcg taskforce. Any release should happen as soon as possible for us
to check against our production scheme. Have sent a mail asking how this
should happen. Suggest to continue as now - have a bug release as soon as
possible rather than waiting for the next release.

Chair: For the merged set (joint release) we have to go through the
release to decide what to have in. Which parts from the current LCG should
stay and which parts should not.

ALICE: Have documented the outline of the requirements and they will be
sent to the list so that anybody who is interested can read them.

Chair: ACTION: ALICE TO SEND REQUIREMENTS TO THE LIST.

[Does something similar exist for other VOs?]

ATLAS: Have discussed in the last [atlas taskforce] meeting a list of
points to be addressed together with the middleware. The list may still be
changed by ATLAS, but should be a good starting point. Can find it from
the agenda of the last taskforce meeting. I can send the pointer to the
[TCG] list, if that is useful.

Chair: Yes, sure. If this is the shopping list for atlas. If you send it
now we can all look at it.

Chair: CMS?

CMS: We don't have such a list. Would like to understand this merger. What
is the scope of glite 3.0? Will you remove components or is it only a
question of what gets added from glite 1.5? Or will glite 3.0 have
something not in glite 1.5? Then it is a question of going through glite
1.5 and indicating what is needed in production.

Chair: Yes, for glite 3.0 we mean the first release where we have only 1
release. Glite 3.0 will contain components that are used for production.
It is not excluded that services will be removed (if they are unused, for
example). eg. When to phase out the old RLS. At the moment the RLS is a
legacy service. From glite 1.5 will pick those components which are most
urgently needed and add them to what is currently the production service,
if they are in a state to do that. Which components exactly we have to
discuss. If there are other components, not glite, but which are available
we can add them to the first merged production stack. We have to keep in
mind what we said about the timing of this. Glite 3.0 is supposed to be
released sometime in January. We cann't start a large scale development
effort to put new things in there. So things that are not ready for LCG
2.7 will have to wait for gLite or rolling service upgrades after that.
Have to also look at the wish list. Alice and Atlas have a wishlist. CMS
can also probably derive one. Leaves LHCb.

LHCb: Can discuss in LHCb. We don't have a formal document that collects
this all. (Can make one, of course)

Chair: Alice phrased it quite well - it is the baseline.

LHCb: I'm a bit confused. For LHC experiments the grid is LCG. In
principal there is a baseline group for LCG which is collecting the same
thing. This [the baseline] group will provide you with a list and enable
you to get the services needed. But this group (TCG) is an EGEE-2 group -
isn't this bypassing the LCG working group? A coherent view is needed
(which as an LCG deployment person you understand).

ALICE: I agree that is why we have prepared this document, and why we
bring it to this forum. So that this forum doesn't start from scratch.
Starting from this list and seeing what will be in gLite, the RB, SRM etc,
to ensure continuity.

LHCb: My view is that the immediate needs are collected by the LCG
baseline working group and this TCG is an EGEE-2 activity. Here we can
try to extrapolate in time. We can see on the timescale of EGEE-2 what we
can foresee. We shouldn't duplicate activities.

Chair: So you think the TCG is operating on a longer timescale to the
baseline services group?

LHCb: For baseline services group we are discussing immediate issues.

[I understood the baseline group is ending this year?]

Chair: TCG is like the baseline services group - plus the means to make
these things happen. This is the reason why the role of the baseline
services workgroup is moving to TCG. It [Baseline] is coming to an end by
the end of the year.

NA4: We don't have representation in the baseline services group.

Chair: A wishlist for the near future should be made by NA4 and BIOMED.

NA4: We made one this morning. Shall we go through it now and sent it to
the list?

Chair: Does biomed also have comments on the components they are
interested in?

Biomed: Yes, they are included with NA4's list.

Chair: I want to finish the LHCb remark. For LHCb it is just the input
that is already given to the baseline services workgroup?

LHCb: It can be derived from the baseline services group. We can also have
a formally separate one.

Chair: Yes, send it to the list and we will link it to the agenda page for
next time. So we can either start now looking at the material which is
already there or make it a short meeting and next time go through the
list. Then next time we can make a list without duplication and then go
through and try to map them to services. But we could take the NA4 list
now.

Security: Looking at the ATLAS list I see a few things which are
interesting to security. Going through the list is fine.

CMS: Maybe we should go through (individually, for next meeting) what is
to go into LCG2.7 and gLite 1.5?

Chair: The LCG-2.7 release is meant to be a checkpoint release of what we
have in production plus minor addons in terms of user tools in the core.
What we have now plus the latest extensions for components like dpm/srm
copy etc. We want to do the release either in the first week of December
or even earlier. So it is just a checkpoint release. We have some upgrades
(like for the FTS) that are happening continuously. So it will be the last
LCG named release. So not much new in some sense. For gLite 1.5 there is
northing to discuss, it has been written. So the process is then to try to
understand what will come in January.

[Start to go through the ATLAS 'wishlist']

ATLAS: [Of the FTS] perhaps it's not worthwhile going into the FTS details
here given they are being discussed in the FTS workshop. But it is clear
that one would like to have the possibility of using the FTS in different
centers, between tier 2 centers etc. Another important point is the
staging service - it is also a requirement from sites that we ask that
their mass storage have also a disc based SE.

CMS: Is this in the document?

Chair: But that is not really new middleware.

ATLAS: OK it is more an SA1 requirement.

Chair: We have to make sure the information system content is OK.

ATLAS: The first critical point you see at the top of the table is a tool
to allow bulk deletion on an MSS. We did a Rome production where a lot of
files were produced. Some are no longer useful and we would like to be
able to eliminate them. Also, we use grid tools to validate our releases.
We make validation productions (sometimes quite frequently) and somethings
they are bad and so we need to delete all the files.

[Do you really mean deletion on the MSS - not just for disc storage?]

ATLAS: Bulk deletion on any kind of storage (on the SRM interface etc).

Chair: So the problem is that doing it file by file takes too long?

[SRM advisory delete is just too slow]

Chair: Can you quantify?

[It takes several seconds per call. For several thousands of files it is
too much overhead]

CMS: So the bulk file deletion is not in the SRM specs?

Chair: I want to understand why it is critical. If you label a test as
unsuccessful and have for wait 5 seconds for each file this is not good. But
you have the list of files and you don't have to watch over it. So why is
it critical? Is it a matter of namespace? (So you have to make the cleanup
before continuing?)

ATLAS: Apart from the point of the poor performance there are different
tools for deleting which produce different results depending on the
backend, which is confusing. We can arrange to send that between today and
tomorrow.

ACTION: ATLAS will send a clarification of the usecase for bulk deletion
on MSS and the required service for deleting replicas reliably and explain
why it is critical that it is a bulk operation.

CMS: But it is not required to be part of the SRM.

Security: With the gLite tools you can create or delete files very easily
in the catalog.

Chair: This is my point - if you create files at a high rate it is
critical that you add them quickly to be able to access them quickly. For
deletion it is not so clear to me - apart from perhaps the namespace
issue. Depending on what is the critical part there could be a different
solution.

Security: This could be passed to Gavin on ways to delete files.

Chair: Lets send out the requirements and then we will see. For the FTS
improvements you mentioned that you wanted some routing functionality. That
probably won't come overnight as it [FTS] was built as a point to point
tool.

ATLAS: Lets no go into FTS details since there is also the FTS workshop
going on. Another point which is marked as critical is the LCG-POOL
libraries. That is something which keeps coming up and is giving us
problems - eg. POOL and LFC.

Chair: This is something to be addressed by current SA1 (and future SA3).
The application area (of which POOL is a part) have collected dependencies
and made sure the application area software is compatible. The LCG
software hasn't been part of that. We are discussing with the application
area how we can sync dependences. Then the problem should go away as the
application area is already in sync with the experiments. Will take some
time since the way POOL express dependencies is different from LCG. We
will incorporate dependencies when we integrate with gLite (eg gLite 3.0
release). In that time frame we will move the current LCG stack to the
same build system that gLite uses. Work is underway for that. There is
nothing that we can easily do right now. If it's not done at the build
system level you have to go through and compare one by one. What has
worked quite well was the sync. between experiments and the application
area - and we don't want to do that again.

ATLAS: OK looks reasonable.

Chair: Now we come to the items marked as 'high' priority on the ATLAS
list.  What is mean by security hosting of services and security standard?

ATLAS: It is a problem that is related to the VO box. We are trying to
understand the problems that many sites have with the VO BOX as it was
foreseen at the beginning. At the moment we are planning to deploy
everywhere - but we are trying to see if it is possible for us to give up
deployment of the VO Box in tier 2 and find a different mechanism there.
Again I think that here we are planning to go on working to understand
better within ATLAS in a couple of weeks. We need this time for discussion
within atlas. We will come back to this again but I'd prefer to do so
after we have clarified this more.

Chair: There will also be a workshop on the VO Box.

ALICE: 19th December (for 2 experiments) and then again in the second week
of January for the other 2 experiments. It is pending confirmation - it
was raised at the W-LCG meeting yesterday.

Chair: Activity has already started - eg. the VO should only be accessible
from a defined list of domains. So the problem there is to build the list
of domains and the current idea is to use the GOC DB for that and to add
extra fields into the GOC DB for each site.

ALICE: But this will mean the list will explode. Where will the client go
to get authenticated. Has anyone though about that?

Chair: David Group. If you don't know which domains will access the box
then it would be hard.

Secretary: I think the idea was to list ranges of IP address, not exactly
domains.

Chair: This is something that has come from the sites. It should not
really restrict the experiments. I think the practical implications should
be relatively mild. The worst thing is that one may need to go through an
intermediate node to get to eg. the VO box for your connection. This is
one of the things about which the sites are quite serious - because long
lived services with external connectivity live on the VO box.

Chair: Maintenance is a nightmare - because you have to make sure the
correct information is in the GOC DB. But sites insist on it. If this
information is put in then infact it might be good to add the access
restrictions to all service nodes.

ALICE: But if I have proper credentials shouldn't I be able to use the
grid without further restriction?

Chair: You can always bring up your own RB.

CMS: But if you have CEs with firewall rules you cann't bring up your own
RB.

ALICE: Is this already in a written form?

Chair: Yes it is - but maybe not everyone noticed it.

ALICE: Maybe small sites won't be able to setup the firewall rules.

Chair: OK, maybe we shouldn't discuss this further now.

ATLAS: I have to leave in 10 minutes, maybe we should finish the list.

Chair: Yes, we got sidetracked on the VO box.

ATLAS: The workload management issues are high/critical priority but they
aren't marked as such in the list. eg. need to be able to handle 10^5 jobs
(short but not less than 30 minutes) per day with not more than 10 RBs.
Hope bulk submission will help with this - but it is not clear if it will.
(Not yet demonstrated). This is 1 important point. Another is support of
roles in VOMS.

[So you don't want a unique queue for the jobs but do not necessarily
require that the VOMs data isn't in a central DB?]

ATLAS: We have experimented with gPbox, it looks promising but we're still
not sure if it is the right way. But we should know fairly soon. It is
needed to have it in place for a real production in June 2006 - and need
to be able to start testing a few months before. Last point (which is
slightly less urgent) is the accounting system. We could run the analysis
phase without it.

Chair: OK, maybe we should come to an end now. (ie. we won't go through
the NA4 list now).  I think it is useful to go through the list to see in
what detail it should be described. It would be good if the NA4 and HEP
experiments could send their lists and if on the security side there is a
list it would be good to have it. Then we can prepare what to talk about
next week.

Chair: AOB? [No]

See you next week.