TCG Meeting 16 Nov 2005 ----------------------- Claudio Grandi (JRA1) Andrei Tsaregorodstev (LHCB) Laura Perini (ATLAS) Cal Loomis (NA4) Johan Montagnat (BIOMED) Latchezar Betev (ALICE) John White (JRA1-security) Stefano Belforte (CMS) Markus Schulz (Deputy Chair/SA3) David Smith (secretary) Ian Bird (SA1) cannot attend - exceptionally Markus will stand in for SA1. Today Markus will Chair the meeting, as Erwin is away. Chair/SA3/SA1: OK, following the agenda that Erwin sent: Does anyone have comments on the minutes of the previous meeting? [No] Erwin attached a summary of discussion of convergence between gLite and LCG trees. The plan is to stay separated for gLite 1.5, which will be the last pure gLite that will be released in the EGEE-1 project. An LCG-2.7 release will be made before Christmas which will mostly contain things that are now in production for the service challenges. The process to define the contents of the merged release should start now. First step is to find the gap. There was a suggestion by Frederico to start with the requirements from the experiments. ALICE: Alice have produced a document (called 'baseline services') that documents what we require to run at the moment. What Frederico suggested was to send the link to the document that has been discussed in the alice-lcg taskforce. Any release should happen as soon as possible for us to check against our production scheme. Have sent a mail asking how this should happen. Suggest to continue as now - have a bug release as soon as possible rather than waiting for the next release. Chair: For the merged set (joint release) we have to go through the release to decide what to have in. Which parts from the current LCG should stay and which parts should not. ALICE: Have documented the outline of the requirements and they will be sent to the list so that anybody who is interested can read them. Chair: ACTION: ALICE TO SEND REQUIREMENTS TO THE LIST. [Does something similar exist for other VOs?] ATLAS: Have discussed in the last [atlas taskforce] meeting a list of points to be addressed together with the middleware. The list may still be changed by ATLAS, but should be a good starting point. Can find it from the agenda of the last taskforce meeting. I can send the pointer to the [TCG] list, if that is useful. Chair: Yes, sure. If this is the shopping list for atlas. If you send it now we can all look at it. Chair: CMS? CMS: We don't have such a list. Would like to understand this merger. What is the scope of glite 3.0? Will you remove components or is it only a question of what gets added from glite 1.5? Or will glite 3.0 have something not in glite 1.5? Then it is a question of going through glite 1.5 and indicating what is needed in production. Chair: Yes, for glite 3.0 we mean the first release where we have only 1 release. Glite 3.0 will contain components that are used for production. It is not excluded that services will be removed (if they are unused, for example). eg. When to phase out the old RLS. At the moment the RLS is a legacy service. From glite 1.5 will pick those components which are most urgently needed and add them to what is currently the production service, if they are in a state to do that. Which components exactly we have to discuss. If there are other components, not glite, but which are available we can add them to the first merged production stack. We have to keep in mind what we said about the timing of this. Glite 3.0 is supposed to be released sometime in January. We cann't start a large scale development effort to put new things in there. So things that are not ready for LCG 2.7 will have to wait for gLite or rolling service upgrades after that. Have to also look at the wish list. Alice and Atlas have a wishlist. CMS can also probably derive one. Leaves LHCb. LHCb: Can discuss in LHCb. We don't have a formal document that collects this all. (Can make one, of course) Chair: Alice phrased it quite well - it is the baseline. LHCb: I'm a bit confused. For LHC experiments the grid is LCG. In principal there is a baseline group for LCG which is collecting the same thing. This [the baseline] group will provide you with a list and enable you to get the services needed. But this group (TCG) is an EGEE-2 group - isn't this bypassing the LCG working group? A coherent view is needed (which as an LCG deployment person you understand). ALICE: I agree that is why we have prepared this document, and why we bring it to this forum. So that this forum doesn't start from scratch. Starting from this list and seeing what will be in gLite, the RB, SRM etc, to ensure continuity. LHCb: My view is that the immediate needs are collected by the LCG baseline working group and this TCG is an EGEE-2 activity. Here we can try to extrapolate in time. We can see on the timescale of EGEE-2 what we can foresee. We shouldn't duplicate activities. Chair: So you think the TCG is operating on a longer timescale to the baseline services group? LHCb: For baseline services group we are discussing immediate issues. [I understood the baseline group is ending this year?] Chair: TCG is like the baseline services group - plus the means to make these things happen. This is the reason why the role of the baseline services workgroup is moving to TCG. It [Baseline] is coming to an end by the end of the year. NA4: We don't have representation in the baseline services group. Chair: A wishlist for the near future should be made by NA4 and BIOMED. NA4: We made one this morning. Shall we go through it now and sent it to the list? Chair: Does biomed also have comments on the components they are interested in? Biomed: Yes, they are included with NA4's list. Chair: I want to finish the LHCb remark. For LHCb it is just the input that is already given to the baseline services workgroup? LHCb: It can be derived from the baseline services group. We can also have a formally separate one. Chair: Yes, send it to the list and we will link it to the agenda page for next time. So we can either start now looking at the material which is already there or make it a short meeting and next time go through the list. Then next time we can make a list without duplication and then go through and try to map them to services. But we could take the NA4 list now. Security: Looking at the ATLAS list I see a few things which are interesting to security. Going through the list is fine. CMS: Maybe we should go through (individually, for next meeting) what is to go into LCG2.7 and gLite 1.5? Chair: The LCG-2.7 release is meant to be a checkpoint release of what we have in production plus minor addons in terms of user tools in the core. What we have now plus the latest extensions for components like dpm/srm copy etc. We want to do the release either in the first week of December or even earlier. So it is just a checkpoint release. We have some upgrades (like for the FTS) that are happening continuously. So it will be the last LCG named release. So not much new in some sense. For gLite 1.5 there is northing to discuss, it has been written. So the process is then to try to understand what will come in January. [Start to go through the ATLAS 'wishlist'] ATLAS: [Of the FTS] perhaps it's not worthwhile going into the FTS details here given they are being discussed in the FTS workshop. But it is clear that one would like to have the possibility of using the FTS in different centers, between tier 2 centers etc. Another important point is the staging service - it is also a requirement from sites that we ask that their mass storage have also a disc based SE. CMS: Is this in the document? Chair: But that is not really new middleware. ATLAS: OK it is more an SA1 requirement. Chair: We have to make sure the information system content is OK. ATLAS: The first critical point you see at the top of the table is a tool to allow bulk deletion on an MSS. We did a Rome production where a lot of files were produced. Some are no longer useful and we would like to be able to eliminate them. Also, we use grid tools to validate our releases. We make validation productions (sometimes quite frequently) and somethings they are bad and so we need to delete all the files. [Do you really mean deletion on the MSS - not just for disc storage?] ATLAS: Bulk deletion on any kind of storage (on the SRM interface etc). Chair: So the problem is that doing it file by file takes too long? [SRM advisory delete is just too slow] Chair: Can you quantify? [It takes several seconds per call. For several thousands of files it is too much overhead] CMS: So the bulk file deletion is not in the SRM specs? Chair: I want to understand why it is critical. If you label a test as unsuccessful and have for wait 5 seconds for each file this is not good. But you have the list of files and you don't have to watch over it. So why is it critical? Is it a matter of namespace? (So you have to make the cleanup before continuing?) ATLAS: Apart from the point of the poor performance there are different tools for deleting which produce different results depending on the backend, which is confusing. We can arrange to send that between today and tomorrow. ACTION: ATLAS will send a clarification of the usecase for bulk deletion on MSS and the required service for deleting replicas reliably and explain why it is critical that it is a bulk operation. CMS: But it is not required to be part of the SRM. Security: With the gLite tools you can create or delete files very easily in the catalog. Chair: This is my point - if you create files at a high rate it is critical that you add them quickly to be able to access them quickly. For deletion it is not so clear to me - apart from perhaps the namespace issue. Depending on what is the critical part there could be a different solution. Security: This could be passed to Gavin on ways to delete files. Chair: Lets send out the requirements and then we will see. For the FTS improvements you mentioned that you wanted some routing functionality. That probably won't come overnight as it [FTS] was built as a point to point tool. ATLAS: Lets no go into FTS details since there is also the FTS workshop going on. Another point which is marked as critical is the LCG-POOL libraries. That is something which keeps coming up and is giving us problems - eg. POOL and LFC. Chair: This is something to be addressed by current SA1 (and future SA3). The application area (of which POOL is a part) have collected dependencies and made sure the application area software is compatible. The LCG software hasn't been part of that. We are discussing with the application area how we can sync dependences. Then the problem should go away as the application area is already in sync with the experiments. Will take some time since the way POOL express dependencies is different from LCG. We will incorporate dependencies when we integrate with gLite (eg gLite 3.0 release). In that time frame we will move the current LCG stack to the same build system that gLite uses. Work is underway for that. There is nothing that we can easily do right now. If it's not done at the build system level you have to go through and compare one by one. What has worked quite well was the sync. between experiments and the application area - and we don't want to do that again. ATLAS: OK looks reasonable. Chair: Now we come to the items marked as 'high' priority on the ATLAS list. What is mean by security hosting of services and security standard? ATLAS: It is a problem that is related to the VO box. We are trying to understand the problems that many sites have with the VO BOX as it was foreseen at the beginning. At the moment we are planning to deploy everywhere - but we are trying to see if it is possible for us to give up deployment of the VO Box in tier 2 and find a different mechanism there. Again I think that here we are planning to go on working to understand better within ATLAS in a couple of weeks. We need this time for discussion within atlas. We will come back to this again but I'd prefer to do so after we have clarified this more. Chair: There will also be a workshop on the VO Box. ALICE: 19th December (for 2 experiments) and then again in the second week of January for the other 2 experiments. It is pending confirmation - it was raised at the W-LCG meeting yesterday. Chair: Activity has already started - eg. the VO should only be accessible from a defined list of domains. So the problem there is to build the list of domains and the current idea is to use the GOC DB for that and to add extra fields into the GOC DB for each site. ALICE: But this will mean the list will explode. Where will the client go to get authenticated. Has anyone though about that? Chair: David Group. If you don't know which domains will access the box then it would be hard. Secretary: I think the idea was to list ranges of IP address, not exactly domains. Chair: This is something that has come from the sites. It should not really restrict the experiments. I think the practical implications should be relatively mild. The worst thing is that one may need to go through an intermediate node to get to eg. the VO box for your connection. This is one of the things about which the sites are quite serious - because long lived services with external connectivity live on the VO box. Chair: Maintenance is a nightmare - because you have to make sure the correct information is in the GOC DB. But sites insist on it. If this information is put in then infact it might be good to add the access restrictions to all service nodes. ALICE: But if I have proper credentials shouldn't I be able to use the grid without further restriction? Chair: You can always bring up your own RB. CMS: But if you have CEs with firewall rules you cann't bring up your own RB. ALICE: Is this already in a written form? Chair: Yes it is - but maybe not everyone noticed it. ALICE: Maybe small sites won't be able to setup the firewall rules. Chair: OK, maybe we shouldn't discuss this further now. ATLAS: I have to leave in 10 minutes, maybe we should finish the list. Chair: Yes, we got sidetracked on the VO box. ATLAS: The workload management issues are high/critical priority but they aren't marked as such in the list. eg. need to be able to handle 10^5 jobs (short but not less than 30 minutes) per day with not more than 10 RBs. Hope bulk submission will help with this - but it is not clear if it will. (Not yet demonstrated). This is 1 important point. Another is support of roles in VOMS. [So you don't want a unique queue for the jobs but do not necessarily require that the VOMs data isn't in a central DB?] ATLAS: We have experimented with gPbox, it looks promising but we're still not sure if it is the right way. But we should know fairly soon. It is needed to have it in place for a real production in June 2006 - and need to be able to start testing a few months before. Last point (which is slightly less urgent) is the accounting system. We could run the analysis phase without it. Chair: OK, maybe we should come to an end now. (ie. we won't go through the NA4 list now). I think it is useful to go through the list to see in what detail it should be described. It would be good if the NA4 and HEP experiments could send their lists and if on the security side there is a list it would be good to have it. Then we can prepare what to talk about next week. Chair: AOB? [No] See you next week.