GDB minutes – 10th November 2010
These minutes/notes cover mainly the discussions and do not generally summarise the talks.
Introduction (John Gordon)
Since the October meeting: Quattor Workshop; CHEP 10; OGF30 and HEPiX fall meeting.
Next GDB meetings – 8th
March (Lyon) – probably move to 6th
August and 14th
No pre-GDBs planned but can be arranged if there are proposals for these meetings.
Upcoming: EMI All hands – Prague 22nd
November. ISGC 21st
March. EGI User Forum 11th
Site Security Challenge 4 (Sven Gabriel)
JG: This is 15 minutes after you alerted the site?
SG: Times in site responses are relative to alarm email.
PG: Did the pilot jobs use glexec to change identity?
SG: No. Not used in this framework.
FD: What does ATLASfirstname.lastname@example.org responding mean
GS: It is the wrong address!
JG: Does this scale? Are you saying that we do this against all sites to get them up to speed?
SG: Well the NGIs should take it over for their sites – use this as a template.
JG: There is an NGI for NDGF.
PG: Are these tests like those in SSC3 – security in sites are passive? So there is no collaboration between sites?
SG: One way coordination between myself and the site. What was done this time was more communication with the VO (ATLAS).
OS: NDGF – no not a single NGI, we have many. Each NGI should take their own care. NDGF has its own security officer, but that did not help Slovenia.
JG: You are mixing up ARC and NDGF?
OS: That is what the sites are doing.
FD: You plan to include storage operations. But you can get to storage not only through jobs but also installing credentials. Are you going to try this?
SG: Done previously on a small scale.
RW: We have tried before. The site could not solve the challenge because not enough information in the logs. Sites need to be able to solve the challenge!
JG: You say sites reluctant to ban users – why?
SG: Site is set up with certain procedures to ban users. If they did not do it they often did not find the pilot job user DN.
JG: If they are reluctant they may adopt that in a real incident too.
RW: It is difficult to block a user on all services.
JG: Requirement on getting that bit of ARGUS going.
ATLAS (Graeme Stewart)
PG: When site contacts you, good to see that the get helped.
JG: Where it was a general panda job – the site would just ban the panda user anyway?
Hard to say in advance – liase as close as prop with CSIRT team.
JG: You had specific factory – by banning the user they banned the factory?
SG: This is also why we banned/unbanned users
?: What about site services?
GS: We can ban user in Panda (prevents only Panda submission); can ban from ATLAS but they may have a 96hr VOMS proxy…. Or their certificate can be revoked and then everything stops. Few hours for CRL propagation.
RW: Remind sites – do not contact VO directly. Use national response procedures.
JG: Sites had to contact local CSIRTS…. What about NGI
SG: Contact both.
RW: Sites decide between malioucous user and compromised user. In real incidents sites help each other.
Revoking certificate of user is not part of standard response.
Owner needs to contact CA –
RW: Or anyone who has private key for cert can revoke it.
P Gerard: Is communication between sites good?
GS: When sites got in contact with ATLAS the communication with us was good.
JG: Guess there is a point here – if they see things coming from a known site vs unknown would they react differently?
JG: See ATLAS have a CSIRT address. Do the others have this too?
RW: They all do – but whether there is someone behind it is another thing.
WLCG Information Officer (Flavia Donno)
GS: At the moment, if site breaks then BDII disappears. Any component breaks then no information. So we can not rely on it to know where the SRM end point is located.
MJ: You must rely on it at some point. If you just populate things by hand then problems will arise. You would probably want a BDII you can rely upon.
PG: This does mean that ATLAS sometimes miss changes – for example CE changes.
GS: We would like a way to know when things come and go but it does not have to be a BDII. We can’t have sites completely disappear.
MS: Not a design flaw. Designed with freshness of information in mind for the RB and later the resource management for matching and job distribution. Since in WLCG we no longer use it like this, only for pilot jobs, the requirements for the information system have changed and this should be followed up.
FD: Went through glue schema to see what is dynamic. Found about 80% semi-static.
.. On static/dynamic situation – this is one reason LHCb does not rely on the information system.
FD: But DIRAC config taken from the BDII. Therefore if one is made more reliable..
JG: If static version cached for longer then you might be able to do without your own version.
MJ: ALICE, they don’t use anything!?
IB: We should certainly fix the deployment of the top-level BDIIs. Having comms channels is also important. Should be careful not to do too much. Need to understand what WLCG needs from an future information system. If we can simplify we will benefit. Step back and look at what will be needed in the future and then discuss with EMI etc.
MJ: In the feedback, not completely fair as they do need to get feedback and sites can not contact each VO since support many. Feeding information system specific for experiments is a lot of work. Workflow is valid. Then, you mention many problems but who to talk to…. Top-level BDII is deployment issue so EGI not EMI. For the cleaning of system (glue) there are other groups with whom we should collaborate as EMI is only responsible for part of the information providers. WLCG makes use of the information system but it is not just for the WLCG. We should resist the need to try to simplify when others also rely on the components. Doing WLCG specific things is a concern.
JT: Discussion about dynamic aspects of information system. There is a 3rd class – admin information such as installed capacity. Has no positive impact on operations. For example avoiding multiple CEs – they help in reliability but for admin info need to zero the number of CPUs on some CEs. So, when looking at info system look at the info for admin function and split it off.
GS: There are other information systems we use. Eg. The GOCDB. This propagates downtime information. That system does not have the same problem as the BDII. Perhaps use that for static information.
MS: The move away from GOCDB because in past it was unreliable. Idea was info prpviders would pull info from services themselves. On the other usages of info system. Suffering from new areas. Originally just to feed WMS. Now monitoring, admin tool and for discovering services over a long period of time. Reasonable therefore to take a step back. Likewise the glue schema. We should not be shy to look at it. Other users often follow what we are doing, so we should not stay with something that is not adequate.
FD: Also the way the glueschema is designed it considers all customers, not just some.
OS: What is the interaction with EGI? Within two days messages came to me from Flavia and Tizania.
FD: I established communication.
JG: Should feed requirements in through EGI to EMI. All user communities should go via that route.
OS: There is also a discussion in EGI about information freshness.
Network Incident Handling (Jamie Shiers et. al.)
JG: Transparent meaning you see through it or everyone knows?
JS: You see through it.
JG: If saying site discussion needs to be invisible…
JS: Site admins know who at their site they need to speak to. But the only people who join us in WLCG are those who already attend the daily operations call.
JG: Asking site guy to talk regularly to their network support and report back. Risk pushing the problem down a bit. Site buy says that Dante is dealing with it….
JS: Model was proposed by Bill Johnson in OPN. They accept the responsibility of ensuring that the link is clean between A and B.
JG: Do they accept they also need to give regular updates?
JS: The site rep pulls the updates if they are not provided.
KB: One thing you did not say. One of the two sides should take responsibility. Often neither A nor B sees it as their problem.
JS: Source FTS is responsible.
John S: It is a good idea to have one responsible and that it is the source site – helps GGUS not implement two owners.
Luca: Not sure the site has the means to regularly update the ticket with all information. Found it difficult to get feedback on BNL-INFN link problem – involves many providers. The coordination would much more easily be done by OPN network people at CERN. I can talk to my OPN but not GEANT etc. The people working on the OPN and these network issues (GEANT) are in the same group.
John S: CERN is not a catchall for all networks.
JS: The other model is not acceptable to the network people.
Luca; Problem without central coordination.
JS: Network people have said they will take the required responsibility – if they don’t we go back to them with specific examples. At the moment we have many problems.
JG: What do you do about escalation?
A first approach, problem between A and B should be dealt with by them.
JS: like idea of separating the problems. Some incidents take longer than 24hrs to resolve. If persistent problems we escalate using the same mechanism. At the moment informal escalation.. better if the escalations are defined.. after 4 days and say 2 weeks things get escalated to project leader?
JG: Network community should accept responsibility and that there has to be a route for escalations.
A: Will write a 1-page summary and bring to MB.
127 working CREAM CEs. Not much less than total of T1s+T2s. Would perhaps not expect as many CREAM as LCG-CEs since CREAM handles load better.
ML: Well some sites have multiple instances so number of sites is a bit less.
JG: One issue is availability calculation. Gridview does an OR of CEs, an OR of …. And then ANDs results. But CREAM is just another CE so should have an OR over CE types not a new service. Current impact is that CREAM does not improve site availability and sites only running CREAM are a special case.
ATLAS (Graeme Stewart)
ML: Could it be the backend factories?
GS: No – nothing special.
FD: And they are real machines not virtual?
Massimo: See a bug this morning. Condor request of cleaning of sandbox.
GS: Yes and somehow this did not work
Conclusion: Some significant bugs remain so still too early to remove LCG-CEs but encourage sites to deploy now so that tests can scale up. We are using it in production.
ML: If the Condor team are letting this matter slip away, as happened with a few early issues – e.g. getting Condor to work with Condor-G then inform the management board. They have many clients so don’t let it drift off their radar.
Massimo; 7.5.3 or 4?
GS: 7.5.3 because we did not see any major change in 4. Rod was receiving patched binaries from Jamie. We do not see so many crashes now (an issue in 3).
Using CREAM routinely. Here talk is about using CREAM for direct submission. WMS only used for pilot mechanism.
Conclusion: System is working well. Looking forward to more CREAM CEs coming online.
JG: There is a risk
PC: Look at all CEs that will accept LHCb production jobs. We know if they are CREAM or not. For sites that also have an LCG-CE they will appear with lcg and cream in name so two different sites for us.
MJ: Pilot by WMS only goes to the LCG-CE?
PC: Yes. There will not be a CE on direct and WMS submission – that would be confusing.
ML: Esentially – ALICE using CREAM at every site except CERN – for various reasons. Operations have been stable at all sites except at CERN. Issues to do with size of batch system and LSF. These are being worked on. In near future we will be able to switch off WMS submission at CERN. In past Patricia has presented many issues but now mostly these are all fixed by CREAM updates. So, for ALICE things are looking reasonable. The way ALICE decides if more jobs should be submitted is via resource BDII.
??: True that CERN running many LCG-CEs an only a few CREAM CEs. Have had to offer as a production service. Stability of CREAM CEs has not been satisfactory – problems mainly due to LSF batch system. Has been manpower intensive to keep service running. Developers at INFN have been very responsive. But, still reluctant to call this a production service. Worried that LCG-CE is only on SLC4 with support dropping at the end of the year.
Conclusion: ALICE very happy with the quality and stability of the CREAM CE.
FD: You use the resource BDII?
??: Yes. Not the top BDII. Used for site queues and VOView for number of jobs.
CREAM in production. 1.6.1 since July. Today 1.6.3 for 3.2 so perhaps in one week it will go to production.
CMS (John on behalf of CMS)
- No issues using CREAM via WMS
- Possible site config issues when using glide-ins
- Something on auth level on EU sites that needs to be resolved.
Report from developers (Massimo Sgaravatto)
MJ: Why are you still doing 3.1. releases? Are there so many using this release?
MS: Was discussed a while back in EMT. Depended on component updates.
ML: It was not decided overnight. Thought best not to have last glite 3.1 release not to have a bad release. Should not be an expensive last release. Some other fixes found their way into update. Plan is now not to have this branch.
JT: Could not find versions 1.5 or 1.6. In release notes in CREAM website there is a mapping… if you have a suggestion then please email it.
JT: Also you have EMI listed on your slides. What is the commitment from EMI to support the CREAM CE? In particular what protection is there to stop developers going off to next great thing!
MS: Main task is to supply support to main VOs and WLCG in particular. Id user community continues to push CREAM I see no problem. EMI will do one release per year with updates in between.
JG: Will you release at least one per year? You won’t stop after year 2!?
ML: CREAM is still the next greatest thing.
Markus: But you can not speak for EMI.
JG: They may be looking at removal of redundancy i.e. consolidation but yet to see any evidence of changes.
Tier-2 installed capacity reports (John Gordon)
Using the gstat tool: http://gstat-wlcg.cern.ch/apps/capacities/comparision/
Have 64 Tier-2 federations.
Many federations are not publishing at all.
Smaller number not meeting their CPU pledge and a larger number not meeting the disk pledge.
The VO bit for storage – report is total disk not on a per VO basis.
Luca reported a problem with non-integer fairshares.
JG: Have you reported it as a gstat problem?
LdA: No. Only changed the values to be integer for the site – noticed only yesterday.
PG: There was a similar problem with HEPSPEC.
Conclusion: Gstat seems to be close, not orders of magnitude out, but some results look wrong – so site may be publishing wrongly. Most sites are though now publishing and most pledges are being met.
JT: On last slides, missed the LCG information officer response.
JG: I don’t expect individual sites to go to Flavia with problems
JT: Sure but the information publishing may be wrong.
FD: I will help identify problems and make sure the tools to help are in place.
PG: Previous issue – HEPSPEC06 being non-integer issue was fixed by July.
MUPJ – gLexec update (Maarten Litmaath)
JG: Should we be pushing more sites to implement something?
ML: Some sites that after initial troubles have it right forever. Some make small mistakes with uprgades.
JG: Could you raise tickets against them?
ML: Yes – small enough number – most problems are probably easy to fix.
Middleware update (Maria Alandes Pradillo)
ML: There has been some reaction to easy linker option for SUSE. I have serious doubts whether we put a lot of effort to save this two unfortunate sites.
MS: We should not invest a lot of effort in this – will be addressed by EMI 1.0 which will be release dwith source tarballs. Sites can then build themselves.
OS: There are not many – on client side there are many users. But it is far from trivial.
Massimo: On open issues slide – it was not decided, but actually it was not discussed.
At CERN: Name or country represented
There are minutes attached to this event.