GDB

Europe/Zurich
160/1-009 (CERN)

160/1-009

CERN

48
Show room on map
Description
WLCG Grid Deployment Board monthly meeting (note different venue)

December 2010 GDB Minutes
 
 
Introduction (John Gordon)
Only one workshop of note since last meeting. Distributed Database Operations Workshop 16th/17th November.
Changes April 2011 date. March will be in Lyon. April 13th decided to move to 6th.
Forthcoming events – VOMS/VOMRS workshop, ISGC, EGI UF.
OPN – at last meeting John Shade proposed sites responsible for follow up. At MB yesterday it was agreed that one of the two endpoint sites is given overall responsibility. Between T0 and T1 the T0 takes it etc.
Three security incidents have occurred recently. One a weak ssh password, one an instant root exploit and finally compromise of a site savannah.gnu.org via SQL injection.
Xrootd-based storage instances currently unable to publish installed capacity and usage.  Mainly affects ALICE.
 
HR: This requires a gLite 3.1 CE to work!
JT: Not just a problem with xrootd, there is a problem with all the information providers.
MJ: It is a side effect of YAIM usage, does not appear with Quattor.
Generally good feedback continues for the CREAM CE. So the question arises as to whether we can withdraw support for the LCG-CE? Should we perhaps switch before the start of data taking in 2011.
 
IPv6 (David Kelsey)
Brief history – US Fereral directive and IPv4 status.
FY runs 1st October to 30th September.
The original plan was to run dual stack before IPv4 address space ran out. Not possible now. One big problem remains that applications are untested/updated and tools remain immature.
Q: When will we/WLCG have to support IPv6 only systems? When will sites come on with new clusters that can only use v6 and therefore be unable to communicate with others without v6 or some form of translation?
 
IB: You still did not scare me! What is the thing that will force us in the end?
: What fraction of our collaborators will need to go to IPv6.
IB: If that is possible then there is no problem.
DK: There will be sites who can not get any IPv4 space and can not go dual stack. We may conclude there is no urgency and just need to wait for rest of world to move.
JG: Some groups may not get any v4.
RW: We do not know how likely it is to move and software if not yet working.
SL: In Asia pacific they will not allocate more v4 addresses. In China they have a separate network that runs only on v6. Applications are more tricky.
IB: I thought gLite stack was v6 compatible.
DK: 99% compliant. But was it actually tested!?
MJ: We at least need to know what needs to be done. The network backbone is ready.
JG: What did you mean by applications – AFS for example?
DK: Anything above the network layer.
 
JG: So will no doubt ask for updates once the working group is up and running.
 
WLCG Security Update (David Kelsey)
First update since start of EGI.
IB: On the question of managing policy now – it was JSPG and us. Now, we need a repository of the policies that WLCG understand to apply to it sub or superset of the JSPG policies. There is a potential that what we need diverges from EGI/OSG.
DK: We have a WLCG repository but we do need to separate out. As for agreeing policy we will try to work closely with EGI.
IB: IN OB there was a clear statement that we should not adhereto EGI policies regardless of what they are.
RW:
IB: From what EGI is asking members to accept ALL policies.
DK: Many sites sit in both areas. My approach is to try as much as we can to keep things consistent and processes are in place to deliver that.
MJ: Acceptance of certain CAs…
IB: Blanket statement that
DK: If there are areas where WLCG does not agree with EGI policies then this is important and important input that should go not just via me.
MJ: This is why I raised the CA issue. There was no divergence at the policy level but at the operational level.
RW: The policies are designed in such a way that exceptions can be accommodated.
DK: If there are conflicts then we have the processes to deal with them. To what extent will the NGIs agree with policies is less well known.
 
 
Middleware (Andrew Elwell)
??: What is the proportion of sites still running gLite 3.1 components?
JG: Still a handful of sites using 32-bit nodes for WNs. In WLCG I believe there is less dependence on sites running 3.1.
Helger: WMS is available on 3.1 but would much rather have seen it available on 3.2.
ML: Short story is that there were problems with dependencies.
 
JG: EGI asked for one more month for the MON box update.
217 sites running gLite 3.2
88 sites running gLite 3.1
 
roughly 10% resources are still on RHEL4 compatible (this is all sites not just WLCG).
JG: Published information in BDII as to whether sites are T2 etc. are not validated yet.
MJ: What does it mean to be 3.1 or 3.2 publishing? For WMS etc. you can not publish the version.
 
Components not funded by EMI (Markus Schulz)
InfoProviders – product teams are responsible for these except in special cases. CREAM-CE is a special case since it requires different code for different batch systems.
Lcg-info-dynamic-software will move to CREAM team in time.
Batch system support – Only Torque and LSF is supported beyond LSF.
MS: CERN will support LSF generally?
Holger: Yes. Cheaper to support ourselves and make it available than rely on other input.
JG: Batch systems also need to gather information for accounting from blah.
 
Lcg-tags and lcg-ManageVOTags.
MJ: The second is broken for non LCG-CE installs.
gLite-Cluster is not part of EMI. (used for multiple CEs). Maintained by Stephen Burke and now David Smith. But reassignment needed.
JG: This is rolled out?
MS: On the brink of rollout.
ML: Came to discussion late. The release of new node would have implications on CE, the config would need significant changes and could lead to instability on production sites and many would not benefit from this new node type. Runs on LCG CE. Asked for this to be made opt-in. Will not mange this in 2010 but node type is very much desired. For long term needs to be picked up by EMI. CREAM and LCG-CE are equivalent in this matter… it is about publishing the batch system properly. Note being prepared for EMI management to explain why it is required.
SB: Part of the confusion is that as a new node type people think it is new. Actually it is just separating out functionality that is already in the CE using YAIM. This is basically about configuration not functionality.
MJ: Just want to support this. This is a pure YAIM problem. Without this it is not possible to properly publish more than one CE sharing the same cluster.
MS: From EMI perspective only one CE is needed and that is a CREAM CE. Dual deployments with LCG-CEs are only needed for WLCG.
ML: Could open a bug with CREAM that it does not publish multiple headnodes correctly…. EMI would then need to develop a new node type to do the publishing…. I do not fear a big problem here.
WN/UI – EMI has recognised that a UI will be needed but currently discussing requirements. As for WN, after EMI-1 there is a belief that WN is not needed.
JG: Discussion so far seems to suggest a super set of components.
IB: There is no funding for this within EMI.
MS: Filling of the gaps seems to be the technical board asking who can pick things up.
IB: I think we should continue as now but not invest effort in things we do not need for EMI.
JG: Effort could be reassigned to this work. There could be a single distribution that can be configurable.
MS: To have common UI for ARC and gLite would be useful for WLCG but adding UNICORE would not.  We have to provide the gLite versions for now anyway. Issue comes in move from EMI-1 to EMI-2.
Date management components related to XROOTD.
MS: It is not about supporting or not supporting. It was not defined as something EMI would do. The storage providers are all linked with WLCG.
The list is those things that might not be covered but that we need to ensure are covered.
 
WLCG sites not running CREAM (John Gordon)
Not all CREAM CEs observed on the grid are at WLCG sites. Many sites have CREAM access to a subset of resources
 
 
HEPiX Virtualisation WG (Tony Cass)
CG: A while ago – what about customisation that is site specific such as storage access locally
TC: The contextualisation step is there to deal with that area. What you can not do is install a lot of software. I’m not sure that you need additional clients on top of what comes from the experiments.
JG: And what is used is governed by environment variables.
 
JT: It was a balanced summary, except the CHEP discussion about what do we mean by trust? High bandwidth access to storage – did not mean they want to be able to mount local filesystems or run local batch system commands.
TC: Sites should not place more restrictions on the VM images than they would place on ordinary batch wns. So those providing the images should not expect more than what they would get from an ordinary WN.
JT: My statement is much stronger. May currently get shared software mounts etc. in current environment. Please put this sort of thing in the slides.
??: Glad mention of CERNVMFS, but mention other use of pilot frameworks embedded too… successfully managed to run ATLAS panada jobs in this way and also demonstrated for ALICE and CMS, so the work is well underway.
TC: You also looked at the contextualisation method too and that fits in the image too. Other concerns about the package have also been addressed.
 
 
CERNVMFS for Software Distribution (Ian Collier, Elisa Lanciotti)
GS: Comment – we also discovered that it was useful for flat file conditions data distribution as well as Athena. Look forward to it being a supported service. The way ATLAS use CERNVMFS need access to
JT: Opt is used for many many things and is not for this type of thing. The newest verson of the client solved the problems we had with hanging links in /opt, but if there is a problem we would turn it off and ATLAS would then be stuck. Did not tell LHCb about the area it was installed and their install just ran. So it is not insurmountable.
GS: We would put it in some opt guid etc. Are we going to run into problems if we put this in the same place for every site?
JG: Sounds like building up a problem.
MJ: It will be a problem having the same path at every site just as from experience.
JB: If change this then need to reprocess all repositories with new path.
This is already version 2. It has 2 years development behind it but not ready to call it production ready as you can see from the requests coming in now. Would be stable perhaps from next year.
JG: Needed for software installation because current service is overloaded. Would the site infrastructure become overloaded with this CVMFS route.
Long term support for this is needed. There may be a case for a WLCG first level support unit. Is CernVM going to provide support for deployment at sites?
IB: We need to understand what is involved in it.It is a WLCG discussion but I do not know what is involved. It should be part of the same discussion as we have for other things.
JT:  Running LHCb jobs here in production as the step up in job throughput has been great. Nice thing is that if there are any problems we just put the env variable back to the shared area and they stay running
 
 
LUNCH
 
CMS (Ian Fisk)
JG: Figures on slide 10 show more resources deployed than pledged.
JG: The commissioning work is before March?
IF: As much as possible before March.
KB: For the beam lifetime, what are you going to use next year?
IB: This was discussed at the OB on Friday. Sergio said we should plan for 30% overall (30%*200 days to get the figure).
KB: Heavy ions was about 45%.
IF: That sounds a dependable number.
KB: T2-T2. Did you do any tuning or just let it go?
IF: We tuned a lot. We did not optimise the networking. There are 25,000 connections in the mesh. To be efficient only went back to the connections that had problems. Only on a small set of sites was the issue to do with etworking as opposed to FTS and site config issues. There was an interesting CHEP presenation on this.
??: You said 20MB/s is the target for a commissioned link
IF: Yes, for 12hours sustained. You want to know you can transfer a reasonable amount of data, not just a single file. If you aim for 100MB/s then you need to start going into the site networking. For T1-T2 we still have a programme to get 100MB/s running.
 
ATLAS (I Ueda)
JG: Do you have a target for when you might run out of disk?
KB: We have run out already.
GS: Depends on beam time. The situation is very tight.
 
JG: Slide 6. What does secondary data mean?
IU: Data that can be deleted.
JG: Storage token should be publishable via the BDII
MJ: Per space token, what is used and what is there should be published and not by the SRM.
JG: gstat does not collect it yet.
SC: There are fields in the information publishing for these things.
MJ: The information is there
JG: This is an issue for Flavia.
SC: At the moment the information from the SRM is more correct than from the BDII. (SRM available and used per space token). We would use the BDII if it was at least as good.
: Information providers for dCahce look fine – please report any problems that you see.
 
ALICE (Latchezar Betev)
No questions.
 
LHCb (Roberto Santinelli )
JG: Two distinct problems. Rates of i/o are about 10x what we were told in the planning. We can deploy more resources than in planning (Slide 12). The second one was with hot files. If you run 1000 jobs that hit one file then CASTOR does not deal with this very well.
 
PG: Focus on latest hardware we purchased. 20% of cluster has this hardware. 30% of problems are only on this hardware. 6%
On slide 21: The storage has been consisting of 4 nodes and now reduced to 2. We have to be careful if the experiments want to reduce volume and then increasing the interaction rate
JG: Les used to project that not all tracks on disk get used. SCSI is better at handling higher access rates.
[09:14:11] Richard Gokieli joined
[09:17:26] Manuel Guijarro joined
[09:21:17] Denise Heagerty joined
[09:23:53] Maria Dimou joined
[09:27:06] RECORDING John joined
[09:44:21] Stephen Burke joined
[09:52:15] Andres Aeschlimann joined
[10:42:14] John Shade left
[10:52:33] Stephen Burke You could just turn off support for that VO on the LCG CE if they don't want it
[10:54:13] Jeff Templon you are right of course 
[10:55:32] Oxana Smirnova joined
[10:55:32] Oxana Smirnova left
[10:58:15] Pete Gronbech joined
[11:01:53] Ian Collier joined
[11:26:55] Martin Bly joined
[11:36:18] Martin Bly If the software is 'relocatable' then a soft link to it from /opt/wherever is all that is needed. The first part is the hard part.
[11:40:50] Wahid Bhimji joined
[11:42:09] Wahid Bhimji left
[11:42:14] Martin Bly left
[11:42:19] Jeff Templon left
[11:42:53] CERN 160-1-009 starting again at 1400 CET.
[13:02:01] Jeff Templon joined
[13:03:05] Tiziana Ferrari joined
[13:05:40] Jim Shank joined
[13:12:51] John Shade joined
[13:12:57] John Shade left
[13:13:07] Martin Bly joined
[13:15:17] luca dell'agnello left
[13:15:28] luca dell'agnello joined
[13:19:55] Ian Collier left
[13:22:41] Martin Bly left
[13:23:03] Romain Wartel joined
[13:23:04] Romain Wartel left
[13:23:04] Romain Wartel left
[13:38:44] Alberto Aimar joined
[14:01:31] Stephen Burke lcg-stmd returns space token metadata
[14:03:00] Martin Bly joined
[14:07:49] Jim Shank left
[14:14:44] Andrew Elwell joined
[14:23:20] RECORDING Denise joined
[14:40:07] Wahid Bhimji joined
[14:52:28] Romain Wartel left
[14:53:09] Romain Wartel joined
[14:53:09] Romain Wartel left
[14:55:50] Martin Bly left
 
 
 
 
There are minutes attached to this event. Show them.