GDB

Name: GDB
Start: 2011-06-08T10:00:00+02:00
End: 2011-06-08T18:00:00+02:00
Location: CERN

Wednesday 8 Jun 2011, 10:00 → 18:00 Europe/Zurich

IT Auditorium (CERN)

IT Auditorium

CERN

John Gordon (STFC-RAL)

Description

WLCG Grid Deployment Board monthly meeting

Hide

Raw notes (To be updated)

8^th June 2011 GDB Meeting (at CERN)

Introduction (John Gordon)
Reminder that the meeting is open to T0, T1 and T2 people.
Recent meetings: EGI virtualisationand clouds workshop; EMI All hands meeting; Database workshop.
July 13^th meeting is cancelled and unless anyone objects also the August meeting.
Today is world IPv6 day. http://test-ipv6.com/

Availability reminder: ACE the Availability Computation Engine. We compared the different algorithms and MB decided to go with “OR CE if either CREAM or LCG-CE then OK”. The May report will use the OR. The old SAM database will close at the end of August so make sure any tools still using it are migrated! Availability result changes will be accepted up to 7 days later via an automated check.
https:/gvdev.cern.ch/ACEVAL/ace_index.php supplies information until MyWLCG is available.
In May people asked whether there would be impacts on the May report (No) and would any of the experiments have a problem if the LCG-CEs are removed (ALICE, CMS and LHCb will have no problems).
Jim Shank: ATLAS have no problem either.
There was a report of an instability in CREAM running with GridEngine . CREAM CE 1.6.6 fixes the known issues with SGE – coming for gLite 3.2. CESGA have implemented the fixes for certification and now have no issues. The mid-term roadmap is that LCG-CEs have security fixes from October 2011 and these until March 2012. For LHC VOs
Chris Walker: Do you know of problems for any other VOs.
JG: No, but EGI feels there may be some VOs.
MJ: D0 may have problems.
Date for switch of availability to just CREAM will be decided by the MB in August. Likely to be October but dependent on the SGE issues being solved.
There is an Identity Management workshop at CERN later this week. Romain Wartel will present the WLCG view (his MB presented slides are online).
RW: Just talk to me if you have any questions.
IB: The scope of this is not just LCG. It covers many organisations.
EGI virtualisation and clouds workshop. 6 key scenarios. EGI wants to bring in virtualised resources alongside current grid ones. A testbed is being put together using existing expertise and resources. Some resource providers are to set aside resources for this and subsequent investigations (for example to look at accounting).
JT: The interesting thing is that most people there wanted something orthogonal to us.
IB: The workshop was strange in that there were no users other than us present. Half the audience was site managers and the other half developers. The other needs being requirements for persistent services.
WLCG workshop at DESY 11^th-13^th July. Please registerif attending. Still some slots for talks.

Progress towards a GGUS fail-safe system (O Dulov)
JS: There have been a number of times when GGUS has had failures. Have you analysed those incidents to understand if these changes would help.
O: Yes and no. Many incidents have been infrastructure issues – for example network outages. Configuration management for example not installing updated CRLs. No part – I have focussed in last 5 months since starting on …sometimes not only errors but things within the code that need to be changed and changes need to be controlled. I hope we will manage this from June through a test environment that has better defined procedures.
JG: You had VMWare and then two instances in it. I thought VMWare had automatic failover for another instance to be started.
O: Originally the hypervisor can do it if there is a problem with the VM but not with software issues within the VM instance. There are questions about how long it starts to take a new instance – it is about 2 minutes for RH and all services restart. It is quicker to have another VM instance for failover.
JG: You mentioned Remedy are you looking at integration with it?
O: No really. The main issue is that Remedy has a database structure already.

MUPJ – gLexec update
A quick overview shows for ops the Tier-1s and CERN are passing ok. The links on slide 2 need updating – the new Nagios only runs tests on CEs where glexec is tagged in GOCDB… which is zero so the page is currently empty. Asked for the developers to switch back to testing all CEs unconditionally.
Tests for LHCb. All fine, but RAL only now offer CREAM for LHCb which is untested. DIRAC code to report glexec failures but that’s not yet in production.
For ATLAS most sites now looking okay. At CERN SCAS is still being used but not completely understood why seeing a mapping issue, possibly because there is still an ldap infrastructure – may cause group mapping to fail due to an implicit assumption that the files are fully populated. ATLAS have a glexec included in the production version of the pilot but this is not used at most of the sites yet.
JT: We also have an ldap infrastructure too so you should follow up with the developers.
CW: you believe it will not be a problem for ARGUS?
ML: Tested that the mappings make sense and the code is very different so don’t expect the same problem there.
CMS site tests also show everything okay for T1s. CMS’s own tests are more realistic and it is neutral so can be adopted by others. CMS is now polling its sites to find out implementation plans.
Linger mode seems an issue – glexec hangs around until glexec exits. It is a difference found between OSG and EGI deployments.
ALICE currently discussing an LCMAPS plugin option (to work without a full proxy being present to determine which DN to switch to) with glexec developers.
T2 campaign – no traffic on dedicated mailing list so far. 41 sites now claim to support glexec (26 EGI and 15 OSG) but many fail Nagios tests.
The twiki page has been improved https://twiki.cern.ch/twiki/bin/view/GlexecDevelopment. Includes a new script to help configure ARGUS. 2 new bugs observed.

Jeremy C: You have had some exchanges with the developers about a relocatable install. Please could you summarise what you found out.
Relocatable install – difficulty. Hardcoded. Setuid moded can not rely on load library path. So need all paths in conf.d. Suggested to check if sites could live with config file being in /etc and paths .so . Found sites would be satisfied with this – all rest could be taken from reloctable . Config file in /etc will start in EMI version. Will move to in the near future.
So relocatable not feasible until EGI1
Conceptually wanted to figure out if this is sufficient. If not then need to build from source. For a true relocatable version – need that to be made available as a new node type (same way as UI + WN). If we had stayed with gLite then would have added relocatable route. Have started the discussions for WN+UI
DO any other countries/T2s want to give any feedback?
JT: Side comment on SCAS issue. Sounds like a bug. Sometimes if the plugins are in the wrong order this can happen.
Database Futures Workshop (Tony Cass)
The intention to look at future database application requirements. Tracks on requirements, implementations and technologies.

IB: What is challenging in the BE area?
TC: Data volume and rates. Time to restore over 10Gb Ethernet is days. Can address it but if they double the data rate into the database the power is a concern. It is a deployment issue – yes Oracle could handle it.
TC: If you compare the accelerator and experiment people – accelerator were looking at long term reliability and experiments were looking at the ease of application development.
IB: Is it not too early to reach conclusions. People are testing for a variety of environments. Some will have a lifetime. When you eventually understand what is needed for a service then we can implement it.
TC: Yes. The explicit statement was that there is no request for nosql.cern.ch. The emphasis at the moment is we can do this in Cassaandra but not Oracle.
??: One comment on mySQL. It sounded too optimistic if there was a service there would be takers. It is needed for some grid software. The need for proper mySQL support will increase so it is not quite as opportunistic as you suggested.
TC: Central support did not mean HA. Drupal can not take advantage of the HA Oracle environment. The mySQL running for Drupal is on the cluster but does not benefit for the two nodes. The application needs to be coded to take advantage of the environment.
JG: Seen a number of implementations start well and slow on mySQL until expertise increases and people discover the index.
TC: We can code against the interface and those who do can see the benefits.

UMD Release Schedule – Current status & future plans (Michel Drescher)
UMD 1.0 – 4^th July 2011. UMD 1.1 – 5^th September 2011.
Chris Walker: One of the questions from the last GDB was to do with source. Will source be provided for all packages.
MD: No we pick up binaries.
JG: Running tests against the binary rpms but they had another test of bilding things from source. The released versions were all built from source rpms. So if you want them you can get them from the EMI repository.
JT: Interested in list of issues. Transition to EGI/EMI manpower suggested most should go to EMI package teams. Is it working well.
MD: Reasonably well. Will never have 100% coverage so our job is more risk mitigation. If you find something and there is an efficient workaround and put that into the notes and file a GGUS bug.
JG: One was WMS did not work with EGI VOMS. So why did you decide to reject it if it works with existing VOMS? Was that a user driven decision?
MD: Thought VOMS2 was a critical update.
JG: Two packages from the same release not working together is a fairly large issue.
MT: You have to be reasonably sure you will get a self-consistent release but for most of the components they release we will need some sort of compatibility matrix with existing components and also a process to decide what should be kept. I have my doubts….
MD: You are indicating concrete operational requirements but I have not had much feedback so far.
Stephen Newhouse: OMB – requirements of maintaining compatibility.
JG: When this comes out in July presumably there is an OMB discussion on this and the end-of-life of other components.
SN: Yes.
JG: WLCG may have a different view on
Tiziana: You have mentioned problem between CREAM and VOMS. When we talked to project director we asked about new products introducing any backward incompatibility so we assumed nothing would break. Do you have different information?
ML: No, it was a worry because we have been bitten by these things a couple of times in the last couple of years. Mistakes are made. AN example from last year is the VOMS not being compatible with WMS in gLite 1.I would not be surprised to find such things again so I would recommend that we arrive at some sort of compatibility that gets tested in the verification process to mitigate some of the risks.
MD: The staged rollout has more coverage here especially as more sites participate.
Stephen Burke: Does the new WMS work with the deployed VOMS?
SN: The issue that you bring up is relevant – while there is a willingness for sites to be slow upgrading there will be more components out there. So the management need to ensure there is pressure to upgrade. On the deployment side things need to be tested as much as possible and here we rely on sites taking part in the staged rollout.
IB: We are in production and supposed to be in . Need fixes to problems in the existing system not new things. For EMI1 we are faced with a reinstallation issue.
JG: It depends on the component.
SN: EMI and EGI are comfortable with a long overlap period.
IB: You have not tested the interoperation.
SN: But we have and found an issue here which is why the WMS was rejected.
MJ: Does the new WMS work with the VOMS in production?
Non-trivial workload if there are many fixes in the documentation in order to upgrade to a new component. Getting bug fixes out quickly is important.

Ian Fisk: Why is interoperatbiilty found at staged rollout? This would normally be found at an early stage.
MJ: But it is not too late. Staged Rollout has the power to reject components too. We could never emulate all situations.
JG: WLCG pushed for the staged rollout because we knew that testbeds would not find all problems.
SN: The idea is to put things into the wild to reject before mass rollout but hopefully these issues would be caught earlier.
IF: My concern is the assumption that the staged rollout will find problems. Concerned about the statement that we do not know it is compatible.
SN: EMI have asserted that it is compatible.
JT: They asserted and they (EMI) hoped. Concerned about the lack of faith in compatibility.
Oxana Smirnova: There is some misconception about how much EMI testbed can test. They cannot do the scale tests and many tricky things found later. Somebody saved on not having a large EMI testbed but somebody has to pay.
Oliver Keeble: Speaking as part of a product team. In the current phase there is an expectation that we do interoperability tests in the product teams. That is what is expected and we can argue if it is effective.
IB: In the past the certification effort was all in one place.
SN: But that is no longer the world in which we live.

LUNCH

Future discussions (Ian Bird)
In calling for input it has become clear that we need a more dedicated group for discussions – perhaps on the Tuesday before GDBs. (see slides for proposal). Summaries to be presented the following day at the GDB. The discussions need to be focussed and not too wild. The 2hr slots can be good to expose the issues but not resolve them.
Frederico: Your last statement – we should not count on people to help us.
IB: We can not rely on EGI and EMI to deliver things we want.
FC: But we need to ensure that whatever we do fits back in with ExI.
IB: There are issues in areas like supportability, maintainability. It is not just about middleware but anything (infrastructure wise) that we put in to production. We should not assume ExI will do everything.
SN: What involvement would you like from EMI/EGI.
IB: The discussions will be within WLCG about what we need to do.
DK: You mention a document but by when? The initial statement will be documented when?
IB: It will be ongoing. Clearly each discussion will need some summary notes.
RW: A question on the membership – will it be static?
IB: No it will adapt based on the topic.
JG: And each topic will have a working document?
IB: Yes.

Security topics – “glexec”
Statements in slides are deliberately provocative.
IB: The issue comes down to trust. We have to have a rational philosophy underlying things. How do we have a reasonable implementation and also being able to have traceability.
JT: I heard the statement deployment is difficult many times. I claim that everything we introduce to deployment is difficult and the difference with glexec is that the advantage to the experiments is not seen.
IB: How many sites have actually installed it in the last 2 years without recent pushing from Maarrten. I will delete the first bullet off the list.
JT: Traceability is increased for sites if you have glexec. The ayloads are run under different IDs on the site as opposed to everyone using a single identity.
IB: I am asking is there a simpler way? Trusting the experiment frameworks.
RW: There is good tracebility but not with the bridging between grid and unix. Who should have the final traceability?
IB: We have had various suggestions but not a proper community discussion.
Ian Fisk: We discuss it but we often jump to the implmentation. Tracebaility means you can look back to see who ran what. There are other ways of doing this or trustably get the information. We need to define the requirement before looking at the implmentation. Glexec was hampered in terms of coding and mixing up with SCAS/ARGUS. OSG sites have deployed this….
JG: Using gums.
IF: The working solution was dismissed.
FC: glexec is part of the solution not the full solution. The question of the proxies and how we handle them etc.
IB: Glexec is in inverted comma
Glexec tried to solve the unix traceability. The only way without is full system call tracing on the WN and I know of know sites that do it as it vs very expensive – a nightmare. The trouble is, for MUPJ mapped to a single account you have multiple jobs on the same WN you will not be able to figure out who created what file.
IF: you have gone down two levels already. Before us there were people who round-robined around the batch system. My point is that the assumptions need to be checked – you assume you have people using the same ID. Traceabilitu for every process is not there
RW: Traceability is one thing the other is user access – being able to block them.
IF: What does traceability mean?
DPK: It is defined in the policy.
JG: The MB documentation put out around this stated that glexec is only one option that is available now.
Mingchao: In the challenge that we just ran the sites have no way to . Site does not know who ran the payload.
IF: glexec is … perhaps there was a mixup in the submission , that glexec was called…. You already made 3 trusting decisions already.
MM: Trust experiments and frameworks but what about compromised proxies. In an incident we need to make sure by checking the logs and how can you do that if you can not trace it.
RW: I am tracing the VO information. The problem is that they do not work together.
IB: So we should be address that problem.
JT: I agree with Ian that we need to make a good analysis of the requirements and check if what we have now is satisfied. On glexec, deployment is more difficult because this is the first time the epxeriments have had to change their frameworks.
ML: That is not sufficient at the node level. You can not at the moment said who created that action on my WN. You can not distringuish because they are all owned by the same used.
JG: Sincle virtual machine for every payload.
ML: Would work.
Simone Campana: At the time of the last discussion VMs were not part of the discussion. Perhaps they now resolve
Once the VM goes you lose all information?
Many: No. The VMs are logged.
TC: This is exactly the discussion happening in the virtulisation group now.
L: Whole node scheduling is also another way of containing things.
RW: Different VMs do keep a separation between the users.
MS: Is this true of all I/O?
PC: Traceability – whose payload ran at a given time. Now people say we need to know who created a specific file. The payload runs as the ID of the user.
JG: Only if using grid methods.
PC: The worst for me is a user doing rm –f * and until you protect against it why should things be done.
JG: File creation issues were there from day 1.
IF: Are we really worried about file creation. The jobs all run as non-priv users. The idea that we have to track against a file being created in /tmp is not part of the requirement is it? The SE file creation happens with the user credential. The working areas get deletd when the jobs finish.
Denial of service – requiring us to have WNs open makes this problem worse.
RW: “Quotes from the policy”……
IB: We have to apply two things. The philosophy and another term as reasonably possible. Just because it is a policy does not mean we can not revisit it.
JG: It is the external part that concerns sites. The site security officer needed this sort of policy in order to trust/validate what we are doing.
IB: What we should do is check how we implement the policy.
DPK: The policy is aware of balancing risk.
JG: There is an EVO question on blocking users. The banning of users quickly.
M: We would like to ban the users when they are running. Glexec does not help with this.
RW: I think the security challenges proving this is not the case. It is a standard unix problem.
MM: The current SSC – the reason site can identify the malicious job is because there is a defined malicious IP address. Without that with many jobs under the same UID there is no way to do it if job is running.
RW: Blocking issue. Site who is victim of incident… acceptable that information is available but perhaps not in one place.
MM : you need to have a reliable way to ID a malicious job.
You mentioned risk and impact. User proxy compromised vs pilot proxy. The first can do some damage…
JG: Ian is saying that if we can not adjust a specific risk then don’t address any.
IB: I am proposing that we are protecting the things that are most at risk.
IF: The impact of deleted files… we hand credentials on to machines and trust the sites.
RW: We know that in general we have a big problem with storage.
IF: CMS has 80 sites. If a site bans the VO then we would very quickly ban the user… we would accept the risk of having the VO banned entirely for a short time.
IB: The current approach is complex and we need to look at it again.
IF: We also need to know what other trust assumptions you are making.
RW: We need to look at it as a whole.
MJ: As a site I find the discussion a bit confusing. Should we aim for June deadline or relax the pressure?
IB: glexec is the solution now. But we need to have the discussion at some point.
IF: This is the best we have. If we change course then it would take 6-12months.
IB: You would the drills… change policies.
SC: Life today is without glexec so if we have to go to another model in 6-12 months should we really continue on the deployment path? In 2003 ALICE and LHCb were already using pilots.
IB: The distinction was that there were not multi-user pilots.
MJ: We try to meet the deadline. Until now have not been able to use it. Suggest we wait 6 months using it before restarting the discussion.
IB: We have only discussed part of the problem so far.
JT: Since we are almost there, and it is not that much more work, if we take the step then we have additional eperience into the discussion. It would be ashame to not use that.
SC: Side effects – ATLAS needs to force users to store a proxy in a myproxy. We get some learning…. There is implementation to be done.
RW: We need to solve the responsibility question first. Until then the technical solutions will not satisfy people.
IB: We need to have a more focussed and clear discussion. Everybody has a different opionin. I think we have to continue with the deployment of glexec – I understand the ATLAS concern about work to implement the framework. But, we still urgently need to have further discussion. At the moment we can not conclude one way or the other.
Security topics –
Access control
SC: ATLAS does not demand control on file level but on ST or dir level. Requirement is – not being able to have all users recall from tape. Having protection from all users being able to delete raw. ATLAS users should be readable only by ATLAS. 99% of much of this is in place I’m just stating the requirements. Getting more of a concern now since more in root.
Job management: For how much longer do we need the WMS. Can we simplify the requirements on the site?
SC: ATLAS moved away from direct submission for anaysis and production. Condor is used. Direct submission means does not use pilots. Installation of ATLAS software – want to use Panda. For CVMFS – there will be sites. You need an installation system to validate the install and tag. This will be done within Panda but not yet so still need WMS for now. By Jan 2012 there will be no more need for WMS. Current usage we need a couple of WMSes. SAM tests need to be taken care of and can be done within Panda – this is already planned.
CE attributes passing to WNs is no longer a requirement.
Pilot factories currently working in a central way. Work in progress to have pilot factory run at the sites. If this became middleware that would be useful. Something to be discussed here and could be useful to ATLAS.
Frederico: Talking through the answers in the slides. Testing in ALICE to sign jdl rather than sign the proxy. Helps us out of the proxy nightmare.
File permissions – single file catalogue with unix like permissions. Storage has one user owning all files. Interaction requires central service to sign ticket validating the request to give the client permission to carry out a command.
Use of multi-cores in development. One job per core or per machine.
Glexec can we do it simpler? We are trying to follow the strategy .
All file protection is available now in the catalogue we use.
X509 is not an issue for ALICE but we will follow the workshop.
Pilot jobs are almost ubiquitous now. Do not need WMS.
We do not need to pass parameters to the nodes and could live without the CE but currently have a need for simple CEs.
ML: Comment on last bullet. Each VO box you have is like a pilot factory.
FC: It is not really a factory.
IB: If there were a generic factory that was at every site could you use it?
FC: Our VO box – we look at our queues and submit when there is a need, we do not submit jobs blindly.
IB: How you use it is another question. It may be simpler if there is one software solution for everyone.
JG: Is there a need for factories at each site.
MS: The factories we are talking about are those submitting jobs to sites in a way that you can control.
FC: Yes if our scripts can run.
SN: Is this not the same as starting a VM? A pre-pilot job factory?
IB: This is a discussion before the discussion.
IF: This used to be called the job bomb model. It is not a new idea. You submit a request and that request then starts a lot of jobs locally.
RW: Could the tools being used for cloud provisioning systems be looked at?
IB: That area needs to be discussed but it is not the same discussion.
RW: Do the tools need to be different?
JT: No, the VM technology – they don’t have scheduling and you get a VM until you give it up. PC would then not give back the VMs!
IB : Amazon do not have this provisioning problem.
??: Then ask for more money?
There is never enough money!

LHCb: File protection. Gave requirements with the SRM 2.x extension. We would like to protect our data and the user data. We are discussing with CASTOR people. Not sure about at dCache sites. Not yet talking with dCache as in our framework we check in the LFC if the proper permissions (ACLs) exist.
IB: The issue here is on systems where you are doing analysis.
We need proper ownership by owners and groups.
IF: Your talking about ACLs and you are talking about unix permissions.
dCache has ACLs and we can enable them if there is demand.
The average user can not now delete all the data only the user data! We sould revisit. The big deal is ownership. Haing reliability. AFS like ACLs is not really needed. Fermilab map on the SE the user credentials to an ID which is persistent. No complaints about being able to enable wider readability.
IB: Requirement for LFC and SEs to enable ACLs. Want to know what level of protection is really needed.
IF: Step 0 was for me being able to stop a random user deleting data outside of their domain.
Even if Ses have ACLs if the mapping is to … question of persistent user mapping which is not
MS: If the pilot job uses the coarse approach
Pilot credntials are not used for the pilot in DIRAC. Files are uploaded using the users credentials. There is not a single credential used to access the files.
The X509 discussion is not about whether we get rid of it but whether it is an issue at the user level.
PC : VOMS is more an issue than X509. They forget they need to change proxy – having role defined when proxy created is not ideal. Better if it happens when it is to be used. X509 is not an issue itself.
Job management: Here we are only just moving to direct CREAM submission – a manpower issue. The WMS is just used to submit the pilot job. All we need is a factory – just need to be able to submit a pilot to CREAM. Need an abstraction of the batch system – gsub should be possible. Do not need hundreds of queues at sites.
IB: So you would participate in a technical discussion on this?
PC: Yes. Very likely.
CMS:
IF: ID management: Things presented by RW in MB interesting – we wish him luck. Most T1 workflows submitted by pilots. More hours spent in pilots than non-pliots. 2/3 anaysis jobs going through WMS. Deployed two factories in US. Will deploy one at CERN. We can’t shut off the WMS now. We would need to work to a longer timescale – 12months. Hope to get to a total queue approach that would allow us to do some global optimisations. Current work is to deploy jobs via both routes.
Passing requirements is currently handled by using different queues. The systems and pilots can do this themselves.
IB: So we can stop any work on passing parameters to the batch system
Interested in seeing the pilot approach running locally developed. Think it is scalable. Given what the pilots are does not make sense to authenticate all of them. Makes sense to do this in common – our glide-in and Panda systems are 90% similar. This ties directly into our submission systems so we should rely on the expertise in the experiments to make sure this fits with current frameworks.
IB: It is important that the experiments make sure the right people attend the discussions to represent them.

Summary /conclcusion. These are y and large discussion items to be persued further. Suggest we go with the Markus and Jeff group. Suggest we start this soon as many discussions can not wait.
JT: One comment on batch system requirements – I thought this was more a site requirement rather than an experiment one. For example where sites need to plan scheduling for MPI and whole node work.
IB: The whole node scheduling group have not yet reported back any progress.
JT: I thought ATLAS wanted this as they have a class of heavy ion jobs that they need to execute.
JG: The issue is that the information can not get to the CE.
MS: CREAM makes things more difficult to pass.
SC: Sure, job passing from batch system to CE would be useful. At the moment these can be handled through special site negotiations.
JG: There was a previous discussion about parameters that we wanted passed.
IB: We passed the parameters – memory and wall time.
MS: The project response was you can pass using the glue schema…
IB: So we go ahead and setup the group. When is the first meeting.
MS: Before the WLCG workshop.

Information System – Use Cases and Future Steps (Lorenzo Dini)
JT: One of the issues with the info system is because it is getting big. The tags etc. are making it very big. This is one of the reasons I pushed CVMFS because it means publishing just one tag.
ML: There are a huge number of tags due to the number of versions that are still active.
JG: Installed is the amount of disk there. Total is that currently available.
MS: This is not related to dark data. It concerns how much space is available on a site.
SC: Used space is installed minus total.

On the issues slide.
PC: We are not willing to flood sites so take into account number of free slots etc. So estimated response time is important for LHCb.
MJ: At GRIF – maui is not scaling to a large number of jobs slots so it timesout and does not give the information back. The larger the site the more sensitive you are to this issue.
PC: Why nobody assumes that if no response the value is not the same as before?
JT: Somebody should submit the bug to the information system developers. As the developer – the 4444 is the message I can not talk to my back end service. The other problems are things like not linking FQANS to unix groups. The tough one is the per user issue – the information system has to publish for all cases and that is worse than the ATLAS tags issue!
Latchezar This is probably one of the values that you can not cache because it is very dynamic/volatile.
JT: For the configuration problems you could make YAIM better. VOMS FQAN information can not yet be done by YAIM. Fix that and there would be less mistakes.
MJ: We start to have special queues to handle reservations.
JT: You don’t use torque to find the number of running and waiting jobs.
How are the 0 and 4444 errors handled?
ALICE: Bad for 0 because it means we submit. For 4444 it is transient.

IF : Slide 12. The other thing about these issues is that they are all static. They exist because there was a complicated system in place before and not reliably providing information .
IB: Who is involved with the EMI Registry?
LD: It is handled by a unicore product team. Lawrence is part of the specification team.
Is a workshop required?
ML: Could this not be one of the discussion groups?
LD: This is similar to the pilot discussion and merging factories.
IB: I think there is a need for the discussion within this group to decide the direction.
JG: The idea was to follow the caching method that may reduce the need for experiment caching.
LD: Already implementing the “improvements”.
IB: And now you need to know what next after the “stop gap” improvements.
LD: Yes – otherwise just fixing the symptoms.
LF: This was also the reason for the requirements gathering. Now know SE figures important so will write probes to help resolve the problem.

OSG top-Level WLCG BDII timeline (Rob Quick)
Looked at possibility of OSG operations hosting top-level BDII for USATLAS and USCMS. Concern that have yet to run a reliable v5 BDII (probably open ldap issue and network/firewall problems most significant).
The question I have : is the deployment plan introduced by Flavia at the end of last year still something that we should be aiming at?
LD: The answer is yes – carry on with that plan (see slides on improvements). It gives a minimum of 26x improvement with no query performance degradation.
Next step is to have the experiments test it.

RQ: You say it is ready for the prime time – you’ll have a version etc. released.
LF: It is in the current release but not turned on. You just turn it on. So if OSG would like to try it out it is enough to change the parameter and the cache starts automatically.
No July meeting and August likely cancelled unless the discussion group has need for the slot.
MEETING ENDED at 16:50.

EVO chat:
09:06:16] Lorne Levinson no sound yet
[09:07:30] John Cassar Turn on the mic.
[09:09:22] John Cassar vc support here. any problems?
[09:09:50] John Cassar you're sending both cam and pc.
[09:11:17] Tiziana Ferrari joined
[09:11:20] Stephen Burke joined
[09:11:21] Anders Wäänänen joined
[09:11:27] CERN 31-3-004 joined
[09:15:02] Stephen Burke It's better for us to see the room too ...
[09:17:38] Yannick Patois (sound?)
[09:17:52] John Cassar sound is fine
[09:19:42] Ron Trompert joined
[09:21:01] Yannick Patois left
[09:22:24] luca dell'agnello joined
[09:22:40] John Cassar left
[09:22:48] Pierre Girard joined
[09:39:01] Tiziana Ferrari left
[09:39:35] Tiziana Ferrari joined
[10:04:48] Duncan Rand joined
[10:10:30] Christoph Wissing joined
[10:11:32] Duncan Rand left
[10:25:20] Alvaro Fernandez joined
[10:25:38] Christoph Wissing left
[10:31:28] Andrei Tsaregorodtsev joined
[10:35:12] Josep Flix joined
[09:17:52] John Cassar sound is fine
[10:57:30] Duncan Rand joined
13:19:04] Claudio Grandi joined
[13:19:08] Tiziana Ferrari joined
[13:19:10] Lorne Levinson joined
[13:19:12] Ron Trompert joined
[13:19:15] Andrei Tsaregorodtsev joined
[13:19:16] Pierre Girard joined
[13:19:17] Christoph Wissing joined
[13:19:18] Renato Santana joined
[09:06:16] Lorne Levinson no sound yet
[09:07:30] John Cassar Turn on the mic.
[09:09:22] John Cassar vc support here. any problems?
[09:09:50] John Cassar you're sending both cam and pc.
[09:15:02] Stephen Burke It's better for us to see the room too ...
[09:17:38] Yannick Patois (sound?)
[09:17:52] John Cassar sound is fine
[11:22:08] Stephen Burke Does the deployed WMS work with the new VOMS?
[11:44:02] CERN 31-3-004 Starting again at 1400
[13:19:21] Anders Wäänänen joined
[13:19:25] Pablo Fernandez joined
[13:19:27] Josep Flix joined
[13:23:17] Phone Bridge joined
[13:24:15] Phone Bridge joined
[13:29:47] Pablo Fernandez There is a microphone there
[13:29:57] Pablo Fernandez please close it
[13:41:36] Pablo Fernandez Hi, I am not sure if anyone mentioned... we, as a site, would like to be able to have a quick reaction to user misbehavior. If we see one user is doing something bad, we want to kill all his jobs! And according to the Batch system, all jobs are the same user, even if inside the job is using some other uid
[13:41:37] Stephen Burke Is whole node scheduling proposed for user analysis jobs, or just for production?
[13:42:07] Pablo Fernandez this is not being addressed by glexec
[13:46:12] Alvaro Fernandez joined
[13:47:05] luca dell'agnello joined
[13:52:24] Stephen Burke glexec tells you the uid, and you can kill all processes running under that uid
[13:53:09] Pablo Fernandez True, maybe not canceling the job, but rather killing the processes
[13:53:21] Pablo Fernandez that would be good enough, thanks
[13:54:57] Duncan Rand joined
[13:55:31] Christoph Wissing left
[14:00:36] Stephen Burke glexec has taken ~ 5 years to get to this point - most likely any alternative will take the same!
[14:03:24] Andrei Tsaregorodtsev If the alternative is to trust experiment framework for traceability information, then it will take no time at all
[14:04:04] Stephen Burke we already discussed that many times and decided it wasn't enough
[14:05:58] Andrei Tsaregorodtsev Not enough from the site point of view in terms of technical limitations ? or legal limitations ?
[14:06:56] Stephen Burke read the email archive from the last round of discussions - if we do it again I suspect people will just say the same things!
[14:09:33] Andrei Tsaregorodtsev This discussions are always going in circles because there are no technical limitations but everything is blocked by the site legl limitations or lacking trust to experiment frameworks. This is the point that Ian exactly raising
[14:10:53] Pablo Fernandez it's not just tracability looking back to see what happened yesterday, but rather have a tool that allows you to react to a problem that you are having right now
[14:11:03] Pablo Fernandez (to answer Andrei)
[14:11:08] Stephen Burke They go round in circles because different people have different opinions, that doesn't change
[14:11:34] Stephen Burke And even after a decision is made the people who don't agree keep trying to change it
[14:12:06] Andrei Tsaregorodtsev True, the time to change opinions arrived !
[14:15:54] Stephen Burke In the time we've been discussing it we could have re-engineered the linux kernel to use proxies natively
[14:20:24] Andrei Tsaregorodtsev This is easy, re-engineering people's minds is not
[14:21:00] Andrei Tsaregorodtsev left
[14:21:10] Andrei Tsaregorodtsev joined
[14:22:47] Phone Bridge left
[14:26:02] Andrei Tsaregorodtsev left
[14:26:07] Andrei Tsaregorodtsev joined
[14:26:51] Skype Bridge joined
[14:27:07] Stephen Burke Why have even a batch system? Just run the pilot as a daemon on the WN ...
[14:28:00] Andrei Tsaregorodtsev This is what we do in the Amazon EC2, for example
[14:30:06] Andrei Tsaregorodtsev left
[14:30:16] Andrei Tsaregorodtsev joined
[14:36:35] Andrei Tsaregorodtsev left
[14:36:41] Andrei Tsaregorodtsev joined
[14:48:40] Daniele Bonacorsi joined
[14:51:26] Anders Wäänänen left
[14:52:38] Pablo Fernandez And also memory usage
[14:52:51] Pablo Fernandez (not only job length)
[14:56:28] Renato Santana left
[14:56:48] Renato Santana joined
[14:57:26] Rob Quick joined
[14:59:01] luca dell'agnello left
[14:59:24] Andrei Tsaregorodtsev left
[14:59:46] peter solagna joined
[13:29:46] Pablo Fernandez There is a microphone there
[13:29:51] Pablo Fernandez please close it
[13:41:32] Pablo Fernandez Hi, I am not sure if anyone mentioned... we, as a site, would like to be able to have a quick reaction to user misbehavior. If we see one user is doing something bad, we want to kill all his jobs! And according to the Batch system, all jobs are the same user, even if inside the job is using some other uid
[15:09:46] Skype Bridge left
[15:09:59] Helge Meinhard joined
[15:10:56] Rob Quick left
[15:11:07] Rob Quick joined

There are minutes attached to this event. Show them.

- 10:10 → 12:30
  Morning
  - 10:10
    
    Introduction 30m
    
    Speaker: Dr John Gordon (STFC)
    
    Slides
  - 10:40
    
    Progress towards a GGUS fail-safe system 30m
    
    Work started with WLCG requirement Savannah:113831.
    
    Speaker: Oleg Dulov
    
    Slides
  - 11:10
    
    glexec Deployment 30m
    
    Progress in T2 deployment and experiment testing.
    
    Slides
  - 11:40
    
    Database Futures 20m
    
    Report from the workshop earlier this week.
    
    Speaker: Tony Cass (CERN)
    
    Slides
  - 12:00
    
    UMD Release progress 30m
    
    News from EGI on validation and release of EMI-1.
    
    Speaker: Michel Drescher
    
    document
    
    Slides
- 12:30 → 14:00
  
  Lunch 1h 30m
- 14:00 → 16:00
  Technical evolution
  
  ALICE Input
  
  ATLAS input
  
  Introduction
  - 14:00
    
    Security futures 1h
    
    Slides
  - 15:00
    
    Job Management 1h
- 16:00 → 17:00
  Information Services
  - 16:00
    WLCG Information Architecture 30m
    
    Speaker: Lorenzo Dini (CERN)
    
    Slides
    
    Cached_BDII_Update_2011_06_08_v1.pdf
    
    Cached_BDII_Update_2011_06_08_v1.ppt
    
    Information_System_Use_cases_2011_06_08_v2.pdf
    
    Information_System_Use_cases_2011_06_08_v2.ppt
  - 16:30
    
    A Top-Level BDII for WLCG in OSG 30m
    
    Speaker: Rob Quick (OSG - Indiana University)
    
    Slides

Choose timezone

GDB

IT Auditorium

CERN

Share this page

Direct link

Social networks

Calendaring