GDB

Europe/Zurich
IT Auditorium (CERN)

IT Auditorium

CERN

John Gordon (STFC-RAL)
Description
WLCG Grid Deployment Board monthly meeting
GDB (11 May 2011) Chaired by: Dr. Gordon, John

[These notes have not yet been edited]

Introduction (John Gordon)
Membership of the GDB. Attendance is open to T0, 1 and 2s but there is also an official national membership that WLCG may use to contact countries for official representation. Please check http://lcg.web.cern.ch/LCG/gdb.htm for details.

IB: On that John, have you looked at the draft GDB mandate?
JG: Not at the moment, will cover at the next meeting.
 
Meetings update: EGI User Forum in April was the main event.
July’s GDB meeting has been cancelled due to the WLCG workshop at DESY.
Main upcoming and directly relevant events: EGI Virtualisation and Clouds Workshop takes place 12-13th May and we will discuss WLCG input today. WLCG workshop 11th-13th July at DESY, HEPiX fall meeting 21st-26th October in Victoria and CHEP 2012 21st-25th May NYC. See slides for other meetings.
On availability – since last meeting another comparison of ACE with previous A&R figures. There is a good match with deviations as expected. Comparison of the different algorithms was made. The May report will use an OR of CREAM and LCG-CE. SGE will be working with CREAM availability in June/July so all sites must run CREAM by then to avoid 0 in the availability figures at that point. The old SAM database will close at the end of August.
June GDB in the morning will look at EGI and the UMD release, the information service update, glexec deployment and testing, GGUS hardening and a summary from the database workshop. The afternoon discussion will be around security and future visions for job management.
IB: This was a followup to the discussion started last time about more forward looking middleware discussions. The job management for example, if everyone using pilots and factories then we should consider what is needed to support this and the fit with virtualisation. I’d like input from the experiments on this area. I also want to know what sites see as issues in this area…. On the security area, there is a security identity management workshop the day after the GDB… maybe some are unhappy with x.509 certs as interfaces for users and I will ask people to present around this area. In the past we had access control to data discussion but we need to check what are the requirements here and ways forward. I do not want to discuss glexec again but perhaps in the future better approaches can be discussed.
 
Stickers for WLCG are now available from Cath Noble.
 
Information System
IB: Perhaps not everybody knows that Flavia Donno who has been working on this is leaving. I wanted to say thank you to her for everything she has done over the last 8 years including starting the EIS team. It is unlikely that she will be able to join this meeting again before she leaves… but thank you Flavia. Now I would like to introduce Lorenzo Dini who will take over the WLCG information officer role.
Update on WLCG information system(Lorenzo Dini )
To contact Lorenzo: Lorenzo.Dini @cern.ch.
There is a use case document https://twiki.cern.ch/twiki/pub/LCG/WLCGISArea/WLCG_IS_UseCases.pdf. It needs to be reviewed – your input is welcome.
The quality of some attributes is monitored here: https://twiki.cern.ch/twiki/pub/LCG/WLCGISArea/WLCG_IS_UseCases.pdf
A feature to delay the removal of objects from the BDII, to improve the cache time, has been implemented. Need to carefully validate with users to ensure this does not cause issue with services. So far it has been tested on a GT test-bed with 4 days cache. The results so far indicate that the cache smoothes the instabilities. On a reliable top-level BDII many sites have volunteered to host the service at specified levels.
Dario: Do you understand why some objects are less reliable than others? I would expect sites to vary.
LD: If some object drops to trace it you need to contact the site and follow up individually so it is expensive to investigate.
Dario: Slide 17. Why do you have
LD: The temporal variation of objects. If a site adds a new dCache say then 200 objects changes.
??: You mentioned that downtimes can be correlated with cache but if a site is in downtime you don’t want it to disappear immediately from the cache.
ML: We went back into the GOCDB and looked at the IN2P3 case and expected a correlation but we did not look in detail. The objects were kept for 4 days. We still need to look in detail as to the actual issues – at the moment we do not know if a whole bunch of records was added or lost. Anyway, so far this looks good but it needs running for longer to discover the real trends.
Rob: ON slide 15 do you have any idea why the 10000 items for AGLT2 appears as a problem?
LD: No, I can send you the data to investigate.
Pierre: We have a Nagios probe that checks for disappearing DNs. Also we are not a volunteer site to host the top-BDII service for WLCG as a whole – will do it for France.
Possibility of adding timestamp to entries in the BDII? This would be useful to experiments.
Lawrence Field: In LDAP all entries have timestamps by default so you can find out the entry time.
LD: They probably want to know when the status changed to unknown.
Scepticism from sites about automatically submitted GGUS tickets. Giving sites more help with a better view of what is wrong and what should be published would be useful.
LD: Sure. Tickets should be a last option.
Jeff: 09:34:16] Jeff Templon the idea is to completely get rid of that attribute; not needed if sites "install" software via cvmfs
[09:37:07] Jeff Templon we still need to restart the BDII every six days or so, even with openldap 2.4
LF: With latest configuration these variations should not be seen as the issue was fixed. Other sites at 2.4 do not see this issue. Perhaps a configuration issue.
 

 
Stefan: How did you decide on 4 days? In LHCb there is some calculation of availability based on information in the bdii so if the information is wrong then that impacts the LHCb stats.
LD:  We should trial different periods. We figured out incidents including weekends would have a safe recovery time of about 4 days.
 
JG: Now a chance for the experiments to give views on what they need from the information system.
 
ATLAS (Alessandro Di Girolamo)
ATLAS is developing Atlas Grid Information System (AGIS). Not replacing BDII information but adding additional information such as:
Cloud and Tier level -- T2D •  DDM endpoints (e.g. acl and quotas for groups) •  PanDA queues •  Frontier/Squid service info and ATLAS specific configuration
 
IB: The only direct use of the info. system is service discovery?
Alessandro (ADG): At the moment AGIS is not in full production. Right now we rely on the WMS.
IB: IN a year from now what do we need the information service to provide? If it gives a lot of information you do not need then we should know that.
ADG: We describe in detail the situation in the document to Lorenzo – it describes the situation today.
IB: You should also document what you need in the future.
ADG: We have not discussed what are the needs for the future.
Simone C: At the moment the BDII is used for service discovery and a few other things. The requirements for now are shortterm. ATLAS needs to discuss the long-term.
IB: My goal in having the discussion today was to get things started.
Michel J: On the last side. On disk space issue. What you would expect? Is it sites mis-publishing or a problem with the information system.
ADG: We need to debug each entry site-by-site.
Pierre: You don’t check the status of the CE. So with the new solution of cached BDII you will have many CEs that have disappeared.
ADG: We did not yet test behaviour with the cached system. We could add a new check and eventually stop this
JT: When will ATLAS use CVMFS – lose ID tags is big issue.
SC: I will present some slides later.
 
 
CMS (Ian Fisk)
IB: Summary sounds very much like the ATLAS update.
JG: Sounds like you do not need much dynamic information.
IF: If you want to close a CE then you don’t want it marked down, you want it removed. Rarely up-to-date enough to be useful.
IB: On Dynamic Information slide. “Find giving no hints is not worse than the dynamic information”
IF: There was some number used for the WMS  that did not have a well defined normalisation such that turning it off in the US led to better scheduling.  Often the information was out of date anyway.
JG: How do you avoid too many jobs at a given site? We had 10,000s job from ALICE when some figure was set to 0 recently.
 
IF: Queues tend to cope. We found that WMS information was misleading because the factor could have been incorrectly calculated. If you take the installed capacity and balance on that it would probably be more accurate/useful.
MJ: I understand. To have BDII information accurate need batch systems to work correctly. Mostly used (toque/maui) batch system has lots of timeouts and this leads to many of the problems. It is a dream to expect the CE information to remain accurate. Rely on a product where we have no direct control.
IF: To complete that… our experience is that the WMS does not make terrible decisions but we can live without it.
IB: WLCG as a community – we need you to be explicit about what you are going to do in the future. If you are not going to use the WMS then we can focus on what you need is reliable.
IF: You could improve the WMS performance by just giving it access to more static information.
 
No ALICE input.
LHCB: Philippe replied to say they did not have requirements.
If the BDII were more reliable would they use it?
JG: Yes but they can work without it.
IB: Have given Lorenzo more work to summarise what we want in the future from the information system. Should write it down more explicitly.
Stefan: For usage in LHCb, there are several agents in DIRAC trying to see new CEs and quality checks… one problem a few months ago with CERN top-level BDII and now we round robin on top-BDIIs.
This is also the system that I mentioned in the previous talk. The site quality is being checked. The problem was that the BDII was not giving proper information so sites were flooded by pilots.
MJ: I am a bit surprised about this.
ML: LHCb does use the complicated mechanism in WMS but they are moving towards direct submission. They could end up using their internal book keeping to make these decisions.
 
 
HEPiX Spring 2011 Highlights (Michel Jouvin)
Meeting was held at GSI and had 85 participants.
IPv6. There is no a shortage of addresses in Asia. A WG was created in Cornell and just started working – forward names to Dave Kelsey if interested in participating in discussions on possible issues and creating a testbed.
Virtualization and Clouds. Working group updates (group started 18 months ago). Good contact with and recommendations responses from StratusLab. Will focus on reuse of StratusLab Marketplace. Will participate in EGI Virtualization workshop over coming days.
Oracle/Sun: Many sites expressed concerns about former Sun products since buyout by Oracle. SGI situation confusing. Fork between Oracle and open-source.
Some discussions on new projects underway at data centres: IN2P3; CERN; GSI.
Next look at benchmarking activities with 64-bit and virtualisation.
JG: Would add that instructions mention running benchmark as 32-bit. LHCb already indicated that their next major release will be 64-bit.
Fall meeting TRIUMF October 24th-28th. Spring 2012 in Prague. Fall 2012 Korea or China.
You can register to the HEPiX mailing list at: hepix@hepix.org
 
EOS Update (Dirk Duellman)
Couple of points to clarify the situation on EOS goals.
EOS development complements CASTOR at CERN in the disk pool area. It is decoupled from the archive (no automatic tape connectivity). CASTOR will stay fully supported.
Testing production version now and awaiting hardware to become available later this month. Migrations will happen for CMS and ATLAS in June/July. This is not additional disk volume – CASTOR+EOS disk defines the total.
Can EOS be used outside of CERN? Short answer no not yet though software is in a public repository but there is no support manpower. After first deployment should review interest. EOS is tackling scalability and performance issues for the T0. Too early to speculate about whether this solution will ever be of interest outside of CERN. Still need to evaluate performance and will report at future GDBs.
C: Where does it apply to the T0 model and the analysis capability
DD: Will not touch the T0 workflow at all. For T0 and export do not see a role for EOS. CASTOR is doing it well. What we discuss with experiments is the CAF for CMS and analysis pool for ATLAS.
JG: On the use of disk, you used to mirror all disk in CASTOR. Do you have the same redundancy at CERN? Saving on raid controllers.
DD: Nobody would run a disk pool with less than 2 replicas. At the moment we are doing only file replication. Plans with experiments are 2 (plus more for hot areas). We have a prototype of doing RAID5 through the network. But for the first iteration at CERN it is pure replication. The best configurations for the future still need to be discussed.
JG: Reasons for not distribution are all fair… but you are not learning about proper documentation even if just using it within CERN. Just seems to be a culture issue.
DD: There are quick iterations between developments and deployments meaning that the documentation is quickly out of date. The small group would not benefit from this at the moment where there is minimal documentation.
 
MUPJ – gLexec update (Maarten Litmaath)
Update of talk from yesterday’s MB.
Testing status for ATLAS (Jose Caballero) /atlas/Role=pilot job + glexec test, is in progress at T1s.
Testing for CMS (Claudio Grandi) basically looks okay. Issue with wrappers (perl Zlib problem breaks the standard glexec wrapper scripts).
Workarounds for ATLAS and CMS in progress. For ATLAS myproxy.cern.ch does not currently support the use of VOMS attributes (/atlas/Role=pilot) in proxy retrieval policy.
ATLAS, CMS and LHCb are making progress. LHCb preparing DIRAC code to report glexec failures (ready in weeks). CMS use glexec in production in US T1s and T2s. EGI T2s now getting into the game. Running analysis jobs on T2 sites using CRAB and adding sites one-by-one to glideinWMS. Looking at extending CRAB approach for T1s. Nagios glexec probe for CMS being worked on. ATLAS continue debugging T1s.
https://twiki.cern.ch/twiki/bin/view/LCG/GlexecDeployment gives more information on deploying glexec on WN.
Need a config file. Receipe to rebuild glexec from sources. Configuration file needs to be in root.
JC: In the UK 6 of 19 sites have indicated a need/interest in having the relocatable distribution.
Quattor sites will not try to implement individually but working on something together for the end of the month.
Issue for SGE and lost processes, what is the status?
ML: Some issues sorted and in EMI-1. None of the issues were completely vital. SGE tries to keep track of process tree of job by putting an extra group ID in process. This had some problem with glexec because by default it zaps everything. That is essentially a bug. If this becomes urgent then we can I am sure do something.
JG: Most of the other batch systems has clean-up scripts etc. but SGE was late coming to this. Summary to MB yesterday – most issues have workarounds. May be looking at fast tracking some packages.
JT: Extra IDs issue was also found in Condor so the team were looking at it already. Fix is in EMI-1.
Pierre: Will introduce first CREAM CE with SGE tomorrow.
Dario: we have been working happily for the last 5 years without glexec without any problems. Now we are talking about a heavy
JG: The decision was to implement glexec so we will do it.
RW: It is a bit like drink driving. You can get away with it sometimes but it is generally not a good idea.
 
CREAM update (John Gordon)
After the summer availability will suffer but pregressing more quickly now.
MS: Over half resources now appear under CREAM.
GM: When can sites switch off LCG-CEs?
JG: For May the calculation will use an OR so you can do it now. There are sites that have already done it.
Mario D: If we turn off the CEs do we have to warn the experiments?
Production ATLAS uses CREAM by preference.
ML: Not sure MB decision was exactly that… in principal do it now but the conclusion yesterday to Luca was not to do it now. Needs to be carefully orchestrated with experiments. Concerned this will create work for the SAM team who have had to hack the calculation….
SB: Perhaps a difference between allowing sites to do it and asking them to do it.
JG:  I will mail the experiments and ask them to confirm that they do not need any intervention at sites that decide to remove LCG-CEs… and to contact the SAM team in case this adds work for them.
We are in a period where many sites want to move to CREAM. If the SAM team has an issue then we need to get the “manual” component automated.
JG: I don’t think this is an issue since the availability calculation for May will anyway use an OR.
 
Other Middleware updates (John Gordon)
DPM 1.8.0-2 has been released to gLite 3.1 / SL4 - FTS 2.2.5 has been certified at least for gLite 3.2 / SL5
(for the other platforms the patches are not marked as such) - new patches for the gLite 3.2 UI and WN are in preparation;
the WN patch will in particular address the known issues of
the previous update
Links:
https://twiki.cern.ch/twiki/bin/view/EGEE/LCGprioritiesgLite#gLite_ status_presented_in_the_Ti
http://bit.ly/22we3i
 
LUNCH
WLCG Middleware Support (Markus Schulz)
The main questions here for WLCG-EGI-EMI:
How do we get the lost requirements back?
How do we establish a fast feedback loop?
How do we manage WLCG middleware work?
Luca: Feedback to developers  - start quite soon with sites to adopt the new software. CNAF will deal with STORM very soon and also CREAM.
JG: You mention resources need all the time – just look at the CPU used by the experiments, it is basically flat.
JT: Markus said it but not explicit. One thing in the new situation is the number of layers has increased. WLCG is a user. EGI is a user of EMI. EMI creates the stacks. That chain structure means that EMI does not talk to the users. .. The other thing, for EGI they have not yet got a sense of what is a successful process as it is not that nobody is complaining but that all the requirements are captured and acted on.
Tiziania: The site manager and users have the power to influence what the NGI deploys.
IB: I picked up a few things from the talk. We need an additional place to discuss requirements and priorities. Something to replace the EGEE TCG but for WLCG. Yesterday is became clear a whole batch of requests are missing. Maarten has agreed to do a first review of the requirements from the previous project. The question of who we talk to is still not clear – is it EGI or EMI. Probably both.
JG: When you mention the group for discussion….
IB: You need a single voice to proritise … there should be a single statement from WLCG to say what the responsibilities are.
JG: Then any user should be able to comment on it.
JT: Agree with IB except that I would question directly talking to EMI. Useful to have the requirements to EGI as well as EMI in case EMI does not deliver.
 
EMI 1 and WLCG (Cristina Aiftimiei) 
JG: Under your requirements you said 30? Last time did you not say there were 80 packages?
CA: There are 63 products.
Alberto: Clarification on the matter of requirements. We have a formal process for major requirements. We accepted major requirements up to a point but minor requirements can be discussed at any time. We are aware of past requirements – e.g. CREAM high availability… this was noted by an EMI member. It took 3 days to assess the situation and add it to bug tracker.
JG: Your volunteer
CA: Two weeks in which sites offer to install/test/configure some of the services that are of interest to them and also the user community can volunteer to test on these.
JG: EGI are also doing something similar?
TF: Contribution to a certification test done by EMI.
Alberto: It is also a collaboration with EGI and other communities. It is a preview exercise to get feedback and is separate from the EGI staged rollout.
JT: Issue about task requirements. What seemed to be implied is that they have not forgotten the requirements. At EGI UF there was a heated discussion where the comment made was that EMI could not have a memory of the last 10 years.  The situation can not be both of the above.
JG: There has been some email discussion on this. Your can’t rely on the requirements being there but the project should make an effort to gather them all before asking for input. They are bound to miss some but not starting from an empty list.
Alberto: I hope you do not expect EMI to know everything that has gone on through every channel. I don’t want to be excessively formal but we do need some mechanism for reviewing requirements and we are willing to collaborate on this.
JT: Two sets of requirements WLCG needs to bring: from the sites and the users. Can we expect to get those through one group or do we need two?
JG: What is the immediate response to the EMI-1 release. Do we wait for an EGI UMD release or take some things directly.
MS: That depends on the component and how urgent updates are. For example DPM bug fixes are also in gLite 3.2. So for them no big urgency. Whereas CREAM has a link to ARGUS that is only in EMI-1 so sites wanting that should take that from EMI.
IB: From the WLCG point of view we should recommend which version be run and where it should be obtained from.. we should be careful not to confuse people. We should do this service by service. Assume gLite x.x.9 does not change in the EGI process.
Alberto: EGI should not be touching the code.
MJ: In response to IB, some sites may wish to decide what is within their best interests to deploy. Perhaps they have NGI requirements that are more urgent than the WLCG ones. 
JG: NGI may not push the sites to deploy EMI-1 but could support them.
LdA: If one site does not start to use a component in production early then problems will not be found.
JG: WLCG minimum requirements becomes important during this transition. Who owns this at the moment? Need to continue at Tier-1 coordination meeting discussion of releases and also at this meeting. EMI/EGI will no doubt want some WLCG sites to volunteer for their certification activities.
Mario D: UMD is not repackaging or putting any more things than EMI.. it has gone through a verification and staged rollout process that’s all.  Many of the staged rollout sites are WLCG sites.
LdA: When does support for gLite completely finish?
JG: Security patches from October was put forward but that was negotiable. Did anyone negotiate?
MS: For software installed on disk boxes there is little incentive to change.
Alberto: Question about SL6. Our plan is to have EMI ported to SL6 completely in one year’s time. But as the services become available in SL6 we can release them. Is it worth doing that?
MS: The WNs canonly seriously be discussed once the experiments have decided what compiler they will use.
JG: We’ll bring this discussion back to a future meeting or the workshop.
 
Virtualisation and Clouds
HEPiX Virtualisation Working Group (Tony Cass)
The working group has made good progress with VM image exchange policies.
Not such good progress (though better in recent months) in delivering distributed catalogue of endorsed images.
-       Some video conference days (intensive discussions) planned to address this situation.
CVMFS is probably the neatest solution to the problem of VO software distribution.
VM exchange remains interesting… hopefully feasible by fall HEPiX. If VM images can contact pilot job frameworks directly that would simply the scheduling problems at sites.
JG: Where do the images sit?
TC: This is not defined. The market place is just metadata. They could sit in some repository that the image creator defines. Instantiation of images is done from local repository once downloaded.
??: You seem to be limiting contextualisation to sites. It can be much wider. For example to dynamically inject information.
TC: Your right I do insist contextualisation is a site thing. The user part of the image can be changed through CVMFS. If the experiment wants to make changes that is the way… change the things loaded. Loading user credentials into the image is something sites do not want. Links back to the discussion I had with Simone, Ulrich etc…. you could put the ID in at that stage so that it goes off and contacts the ATLAS job queue. It all depends how these things are instantiated at sites.
??: Point missing from list – the API.
TC: We are not trying to address the interface for sites.
JG: There are two models. What we have been looking at is how sites instantiation an image… there is also a cloud instantiation where the user drives it.
 
Cloud Computing and Virtualisation
 ALICE ()
Cloud computing is not currently a priority. Main concern for ALICE is how storage is handled.
Virtualisation – no immediate plans. Present concern is software portability. There are issues of I/O intensive applications that need to be evaluated.
 
ATLAS position (Simone Campana)
ATLAS is interested in both. There is R&D in ATLAS evaluating cloud technologies (including academic clouds and commercial clouds). Dedicated on CVMFS and multicores.
R&S use cases – running MC with stage-in and stage-out. Data reprocessing. Distributed analysis. Resource capacity bursting.
Kickstart workshop 19th May at CERN: http://indico.cern.ch/conferenceDisplay.py?confid=136751
 
SC: Switch between AFS and CVMFS is still happening.
IB: What is the implication on the nightlies? It is something you already do?
SC: The current process is not fully automated… whereas this would give a big step forward.
LdA: Can others join the testing?
SC: Tests in virtualisation & multicores have been waiting on getting the environment ready. See the contacts on the slide.
DB: IN order to use virtual machines at CNAF ATLAS have to downgrade the storage approach, memory consumption etc. This is not a generic solution.
JT: Were ready to join a couple months ago at the time /opt/atlas is that requirement now lifted?
SC: I remember the discussion but not the conclusion.
??: No you do not need to. As of a couple days ago ATLAS finished migrating to new name space on second repository. The transformation should finish today/tomorrow and after that it is not required to use /opt/atlas.
JT: At the time the path was needed by the sgm jobs for configuration.
 
CMS (Claudio Grandi)
Also added update on whole node usage which is what CMS most wants. Have a Whole Node Task Force who will provide details in due course.
On virtualisation CMS is not interested per-se. But CMS has nothing against sites using virtualization provided CMS requirements are met.
Clouds have been tested by CMS but are currently too expensive (see CHEP talk on Amazon EC). A WLCG cloud interface possibly of interest but prefers efficient access to site storage and usage of whole nodes.
 
LHCb (Philippe Charpentier)
Already using CERNVM widely – software on CVMFS for several years.
Raises questions about using pilots with VMs… interested in a generic cloud interface (makes things simpler). Ideally the VM could run as along as necessary… and would not have to be shutdown after every job.
CVMFS now in use at several sites – very positive feedback.
JG: You raise some interesting issues.
TC: Two comments. On shutting down VMs. I don’t wish to argue either way. At the MB we have discussed how this could be signalled to machines using an HS06 figure and shutdown command. The other comment, I have no problem that sites try to instantiate images with EC2, but a more radical option is that sites try to pull from Dirac and if nothing available than instantiate another VO image. Dynamic instantiation might be a better way of sites fulfill cycle delivery to people.  If push model then the EC2 route is the way to go.
Ulrich mentions the complexity. Perhaps there are several images – for SL5 or 6 or for MC etc…
ML: Have had nothing but good news about CVMFS. Fermilab mentioned something at the daily meeting this week.
??: For whatever reason Fermilab are not using the supported route/version. They setup their own server with their own nameserver and put in a bridge.
JG: Not sure on the HEPiX working group position on user instantiation. Issues around vetoing of images by sites and what the authentication model is….
TC: They have no view on how the image becomes instantiated. In StratusLab model the user can ask for a … user submits jobs to experiment framework… is the EC2 model or the dynamic instantiation the better way. The working group is agnostic on what method is used.
RW: There is no DMZ at a site – it will just run at the site. There will be different network areas.
IB:JG: It is up to the site.
CG: You may get different authentication if inside the site.
-much of discussion missed here-
TC: To some extent we are in danger of violently agreeing. Part concepulaisaiont  or what. Getting it in to the batch system is not part of the contexutlastion. If you cut out the batch job then contextulsation is the palce to do that… you have to get some certificate in at some point. The HEPiX WG is agnostic on this and the sites need to decide.
PC: I still do not understand, if we have this why we need a batch system.
IB: We won’t. There is clearly a longer term discussion here about whether we need batch systems in this context.
CG: The temptation to get rid of batch system and resource allocation control from sites is dangerous. If you allocate statically then there is less flexibility.
IB: Within the VM allocation there is already a scheduler. There has to be dynamic provisioning of resources to VOs. You can not live with a static allocation.
CG: Maybe this is my ignorance. Is there anything about resource allocation at a higher level?
JG: Not sure if there is a clear summary. IB going to this cloud and virtualisation workshop in Amsterdam but there is no one voice here to represent!. Perhaps need to sit down with Tony just to recap.
 
DPM Nagios & Puppet (Ricardo Rocha)
First part of talk is about monitoring and second on configuration management using puppet.
JT: Clarification. This exploration with puppet will not affect for support with YAIM and Quattor – is that correct?
RR: Yes. We still use YAIM.
 
MS: I almost feared that everyone would focus on the puppet part. There is the infrastructure side which should be made aware of the useful good monitoring.
JG: You mentioned some other alternatives. CERN use Quattor.
RR: MJ provides Quattor templates but the concern is that it is not lightweight enough for many Tier-2s.
?? Comment. There are plenty of inactive Nagios probes that you may plug into… for example LFC.  They are not part of the profiles. My question, how is the Nagios part to be implemented?
RR: Developed on our testbed. Distributed at the moment with an rpm. The way we put this to the sites needs to be discussed so that we include it with the current probe deployment methods.
JG:  The Nagios running for the country doesn’t allow other things to be loaded on top.
 
Next meeting 8th June. Remember to register for the WLCG workshop in Hamberg.
 
Meeting closed at 17:00
 
EVO chat:
 
[08:57:23] Gonzalo Merino yes, we hear fine
[09:02:59] CERN 31-3-004 joined
[09:03:04] Oxana Smirnova joined
[09:03:05] Michel Jouvin joined
[09:03:06] Gonzalo Merino joined
[09:06:59] Andrew Washbrook joined
[09:07:26] Jeff Templon joined
[09:07:52] Richard Gokieli joined
[09:12:45] Phone Bridge joined
[09:16:36] Phone Bridge joined
[09:20:45] Jeff Templon can somebody bring some to amsterdam tomorrow??
[09:20:48] Jeff Templon stickers
[09:21:11] Wahid Bhimji joined
[09:21:57] Jeff Templon It's not just the last eight years .... she also hepled put together the first EDG release in 2001!!
[09:23:36] Yannick Patois joined
[09:24:31] Stephen Burke joined
[09:25:19] Catalin Condurache joined
[09:25:53] Andrew Elwell joined
[09:26:24] Mario David joined
[09:26:30] Jeremy Coles Hi Michel. John wonders if we can hear you typing? If so please could you mute? Many thanks. If not you then could everyone else please check their session is muted. Thanks.
[09:26:51] Phone Bridge left
[09:30:55] Stephen Burke got to go, clashing meeting ...
[09:31:00] Stephen Burke left
[09:34:16] Jeff Templon the idea is to completely get rid of that attribute; not needed if sites "install" software via cvmfs
[09:37:07] Jeff Templon we still need to restart the BDII every six days or so, even with openldap 2.4
[09:39:24] Tiziana Ferrari joined
[09:45:25] Jeff Templon let me know when i can speak 
[09:45:41] Oliver Keeble joined
[09:46:19] Jeff Templon give him a mike
[09:47:01] Jeff Templon can't really here the speaker
[09:50:46] Mario David Jeff, do you happen to see steady increase of the swap mem along time, with topbdii and ldap2.4?
[09:51:15] Jeff Templon no mario, see the plot immediately above which is the memory. it's not swapping
[09:52:01] Jeff Templon i just showed a week plot ... it is barely swapping just before it goes catatonic
[09:52:02] Mario David the mem stays leveled, but we saw swap increase (vm 4GB and 2cores)
[09:53:00] Mario David ok, I saw your plots but in small scale, so couldn't figure out
[09:53:29] Jeff Templon here is a day plot, showing what happened immediately around the time of the event.
[09:54:02] Mario David i see!!!
[09:58:33] Jeff Templon the mike is breaking up
[10:00:33] Mario David Ale: presently some storm sites are correctly publishing the TOTAL in the info system
[10:00:47] Mario David not the installed
[10:01:42] Oliver Keeble left
[10:01:43] Mario David also presently the info in srm and info system comes from the same place
[10:01:50] Jeff Templon i don't think we (NL-T1) published total=installed
[10:02:42] Jeff Templon for example do you use info on number of jobs running, free slots, ERT, etc????
[10:03:31] Mario David and also the ce's endpoints (the job manager endpoint)??
[10:13:34] Wahid Bhimji left
[10:13:51] Mario David completely agree with you !
[10:14:35] Massimo Sgaravatto joined
[10:16:40] Jeff Templon it's true ... if you misconfigure the dynamic info provider it can give you horribly wrong answer. not sure what this normalization he's referring to.
[10:19:00] Jeff Templon the most typical error is that the dyn scheduler conf file does not have the correct mapping of groups to FQANs
[10:19:44] Jeff Templon so it says for group X there are no waiting jobs ... so WMS submits more .... even though there are thousands waiting
[10:24:40] Jeff Templon the rank expression is (historically at least) rather strange, because they tend to rank "free CPUs" rather heavily whereas what they should look at is headroom.
[10:28:01] Phone Bridge left
[10:40:58] Stephen Burke joined
[10:44:50] Jeremy Coles John is trying to find resolbe a display issue in the room.
[10:44:59] Jeremy Coles resolbe = resolve
[10:45:06] Stephen Burke where are we on the agenda?
[10:45:11] Jeremy Coles EOS
[10:45:55] Stephen Burke Were there any major questions on the info system?
[10:46:25] Jeff Templon can't really hear Dirk
[10:51:07] Jeff Templon flaky connection on your microphone
[10:51:32] Jeremy Coles Hi Stephen. Not really. Main point was that WLCG needs to know what the experiments need from the information system in the future. Talks today looked mainly at the situation now.
[10:55:02] Wahid Bhimji joined
[11:06:10] Jorge Gomes joined
[11:07:03] Joao pina joined
[11:14:25] Christopher Walker joined
[11:18:17] Brian Davies joined
[11:33:07] Stephen Burke Some sites already have ...
[11:38:56] Brian Davies left
[11:39:29] Oscar Koeroo joined
[11:39:45] Stephen Burke As long as jobs are happily running through CREAM I don't see that turning off the LCG CE will disrupt that
[11:40:54] Gonzalo Merino left
[11:40:58] Massimo Sgaravatto left
[11:41:00] Andrew Washbrook left
[11:41:04] Jeff Templon left
[11:41:04] Oscar Koeroo left
[11:41:15] Yannick Patois left
 
[11:41:59] CERN 31-3-004 stopped for lunch. starting again at 1400 CERN time.
[11:42:11] Oxana Smirnova left
[11:42:47] Jorge Gomes left
[12:48:12] Andrew Elwell left
[12:48:42] Paolo Veronesi joined
[12:49:00] Jorge Gomes joined
[12:55:00] Gonzalo Merino joined
[13:01:03] CERN 31-3-004 Will start soon. First speaker not yet arriuved.
[13:01:24] Jeff Templon joined
[13:01:39] peter solagna joined
[13:01:54] Massimo Sgaravatto joined
[13:01:56] Joao pina left
[13:02:57] Massimo Sgaravatto left
[13:03:02] IN2P3-LAL4 left
[13:03:02] IN2P3-LAL4 joined
[13:03:02] IN2P3-LAL4 left
[13:03:02] IN2P3-LAL4 joined
[13:03:09] IN2P3-LAL4 joined
[13:04:31] Massimo Sgaravatto joined
[13:06:58] Joao pina joined
[13:07:47] Tiziana Ferrari EGI also coordinates staged rollout of EMI 1.0
[13:08:40] Yannick Patois joined
[13:11:30] Francesco Giacomini joined
[13:13:57] Phone Bridge joined
[13:15:09] Oscar Koeroo joined
[13:16:33] Phone Bridge left
[13:17:25] Alberto Aimar joined
[13:27:17] Jim Shank joined
[13:30:34] Stephen Burke Definitely not on Friday 13th 
[13:31:44] bob jones joined
[13:31:58] Goncalo Borges joined
[13:38:17] Andrea Ceccanti joined
[13:40:48] Andrew Elwell joined
[13:50:32] peter solagna left
[13:54:07] Michel Jouvin May I make a comment
[14:01:01] Alvaro Fernandez joined
[14:06:02] Oscar Koeroo Lots of useful security features in RHEL 6 compatible kernels too
[14:06:22] Mario David left
[14:06:38] Andrea Ceccanti left
[14:07:19] Francesco Giacomini left
[14:08:08] Massimo Sgaravatto left
[14:11:14] Joao pina left
[14:18:55] Davide Salomoni joined
[14:28:08] Jeff Templon there may be even more views 
[14:29:14] Michel Jouvin May I make a comment?
[14:35:27] Phone Bridge joined
[14:38:20] Phone Bridge joined
[14:38:38] Phone Bridge joined
[14:38:38] Phone Bridge left
[14:39:19] Gonzalo Merino left
[14:39:29] Gonzalo Merino joined
[14:39:55] Phone Bridge left
[14:41:39] Alberto Aimar left
[14:41:40] Catalin Condurache left
[14:42:42] Stephen Burke is anybody out there? 
[14:42:55] Tiziana Ferrari I am but without video and audio from cern
[14:44:38] Alberto Aimar joined
[14:44:49] Jeremy Coles Others remain connected. I can still see audio working (I'm on EVOEU_CH). Try rejoining?
[14:45:18] Jeff Templon it all works fine here at Nikhef ...
[14:45:29] Jeff Templon i can hear Simone fine
[14:45:47] Stephen Burke it came back a couple of minutes ago
[14:45:51] Jeff Templon question for Simone : is the requirement for "/opt/atlas" already removed?
[14:46:22] Stephen Burke we were away for about 5 minutes - evo partition?
[14:48:20] Jeremy Coles Jeff there is no EVO display in the room. But if we do not hear you I'll ask.
[14:51:46] Jeff Templon jeremy it's hard to know when to jump in
[14:51:59] Jeff Templon so you can ask
[14:52:02] Stephen Burke has that ever stopped you? 
[14:52:13] Jeff Templon tell them the CVMFS for atlas is already set up here
[14:52:25] Jeff Templon months ago
[14:52:36] Jeff Templon just we did not want to use /opt/atlas for the location
[14:52:51] Jeff Templon they needed to make it relocatable
[14:52:57] Alberto Aimar left
[14:53:06] Jim Shank left
[14:53:17] Jim Shank joined
[14:55:09] Jim Shank left
[14:56:48] Alvaro Fernandez left
[14:56:48] Alvaro Fernandez joined
[14:57:50] Jim Shank joined
[15:03:32] Davide Salomoni left
[15:07:59] Jorge Gomes left
[15:08:51] Jorge Gomes joined
[15:10:25] Gonzalo Merino left
[15:10:32] Gonzalo Merino joined
[15:10:33] Jeff Templon cern disappeared??
[15:10:34] Jeremy Coles John is trying to reconnect
[15:11:00] Gonzalo Merino ok
[15:11:15] CERN 31-3-004 left
[15:11:21] Andrew Elwell not all of cern - just the IT auditorium
[15:11:21] Gonzalo Merino back now
[15:11:22] CERN 31-3-004 joined
[15:23:05] CERN VCROC joined
[15:23:37] CERN VCROC left
[15:24:31] Tiziana Ferrari left
[15:26:07] CERN VCROC Hello Jeremy, vc support here. The audio seems ok now. Do you agree? John.
[15:27:16] Jeremy Coles Hi John. We are no longer using the microhone that has problems. The portable/audience mics are now being used by the speakers. That works but somebody may still want to look at the clip-mic.
[15:29:08] Michel Jouvin May I make a comment?
[15:33:14] Michel Jouvin Is it possible to make a comment ...
[15:33:24] Jeremy Coles Your audio is not very loud Michel.
[15:38:09] Jeff Templon agreed!!
[15:42:16] Wahid Bhimji left
[15:43:30] Wahid Bhimji joined
[15:58:42] Michel Jouvin I think dpm email list is the right place where to announce it
[15:58:53] Wahid Bhimji if you package it somehow with DPM that would help people use
[15:59:02] Wahid Bhimji (for nagios)
[15:59:10] Wahid Bhimji many uk sites use cfengine
[15:59:11] Jeff Templon it's true
[15:59:17] Jeff Templon quattor is not for the faint of heart
[16:03:22] Gonzalo Merino left
[16:03:35] Wahid Bhimji left
[16:03:44] Paolo Veronesi left
[16:04:02] Alvaro Fernandez left
[16:04:12] Jim Shank left
[16:04:17] bob jones left
[16:04:18] Michel Jouvin left
 

There are minutes attached to this event. Show them.
    • 10:00 12:30
      Morning
      • 10:00
        Introduction 20m
        Speaker: Dr John Gordon (STFC-RAL)
        Slides
      • 10:20
        The WLCG Information System 1h
        • Update on the WLCG Information System 20m
          Speaker: Lorenzo Dini (CERN)
          Slides
        • ATLAS 10m
          Speaker: Alessandro Di Girolamo (CERN)
          Slides
        • Alice 10m
          Speaker: Latchezar Betev (CERN)
        • CMS 10m
          Speaker: Ian Fisk (Fermi National Accelerator Laboratory (FNAL))
          Slides
      • 11:20
        HEPiX 15m
        Report from the recent HEPiX Spring 2011 meeting
        Slides
      • 11:35
        EOS 20m
        Speaker: Dirk Duellmann (CERN)
        Slides
      • 11:55
        Current Middleware Issues 30m
        CREAM, gLExec, interim glite releases and retirals.
        Speakers: Dr John Gordon (STFC-RAL) , Maarten Litmaath (CERN)
    • 12:30 14:00
      Lunch 1h 30m
    • 14:00 15:00
      Software
      • 14:00
        WLCG Middleware Support 10m
        Speaker: Dr Markus Schulz (CERN)
        Slides
      • 14:10
        The EMI-1 Release 30m
        Speaker: Doina Cristina Aiftimiei (Unknown-Unknown-Unknown)
        Slides
      • 14:40
        Discussion 20m
        Speaker: Dr Markus Schulz (CERN)
    • 15:00 16:10
      Virtualisation and Clouds
      • 15:00
        HEPiX Virtualisation Working Group 20m
        Speaker: Tony Cass (CERN)
        Slides
      • 15:20
        Experiment Positions 30m
        What are the current positions of the experiments on virtualisation and use of clouds, public and/or private.
        • Alice 10m
          Speaker: Latchezar Betev (CERN)
          Slides
        • ATLAS 10m
          Speaker: Simone Campana (CERN/IT/GS)
          Slides
        • CMS 10m
          Speaker: Claudio Grandi (INFN Bologna)
          Slides
        • LHCb 10m
          Speaker: Philippe Charpentier (CERN)
          Slides
      • 15:50
        WLCG Position Summary 20m
        Agree on input to the EGI workshop
    • 16:10 16:40
      DPM Monitoring and Fabric Management
      Convener: Ricardo Brito Da Rocha (CERN)
      slides