PPS Pilot Follow-up Meeting Minutes Wed 18 Feb 2009
- Date: Wed 18 Feb 2009
- Agenda: 52236
- Description: Pilot of Cream CE: check-point
- Chair: Antonio Retico
Attendance
- PIC: Raquel Munoz, Christian Neissner
- FZK: Angela Poschlad
- CNAF: Daniele Cesini
- PADOVA: Massimo Sgaravatto
- CMS: Absent
- Alice: Patricia Mendez
- JRA1/Cream/WMS: Massimo Sgaravatto
- SA1: Nick Thackray
Notes: Christian and Massimo complained for the bad audio quality. The Lucent conference system will be used next time
Review of action items (tasks)
Status of the subtasks of TASK:7981(see them in the
PPS tracker ) .
Not covered
Status and results of the pilot service (by VOs and sites)
Antonio gave a quick update in the deployment of
CREAM CEs on the production grid. There are now 13 CEs available in production.
RAL publishes 4 different CEs. They are mostly running the production version of
CREAM
The first results of the SAM tests over the PPS are now available on a testing system at
http://tinyurl.com/ctwfaz
PIC (Christian)
The two
CREAM CEs we set-up according to the requirements of the developers (2 Gbyte of RAM) are published in PPS BDII.
It was clarified that the desired configuration is a special queue published in production with the
GlueCe StateStatus="TestbedB"
FZK (Angela)
Nothing to report.
In particular the only users accessing
CREAM were ops and Patricia for Alice
Antonio observed that there may be a misunderstanding with CMS. They were supposed to wait for a green light from the pilot (Specifically PIC and CNAF). As CNAF confirmed that their queue was ready, he supposed that CMS would have started. As that's not the case and, in consideration of what's happening in production now (installation of several
CREAM CEs) it is probably worth to re-think the layout of the pilot in a way to make the two deployment activity converge on a common goal. That is discussed after the reports form the developers.
Alice (Patricia)
Currently Alice has got 8 supporting sites with
CREAM installed (between production and PPS) and only 5 of them work correctly (they are in contact with
RAL and CNAF to fix the problems). They would like more T1s with
CREAM and having it at CERN is urgent.
Antonio confirmed that the installation of
CREAM at CERN-PROD is in progress although a timeline is not available yet.
Nick pointed out that the thing could be slowed down a bit by the fact that the previous administrator (Ulrich) is phasing out and the replacement may need tiem to get up to speed
Antonio mentioned the concurrent activity of installation of a special WMS for Alice and asked whether it would be possible eventually to identify a priority between the two set-ups
Patricia relied that both activities are important for Alice and they should be done n parallel (50-50). Alice is ready to use whatever solution comes first
Status and results of the development (by developers)
Massimo reported about an issue with an external plugin that was changed in a non backward compatible way. That brought to a lot of problems which were solved by the new tag released last week. The new TAG released to the pilot concerns mostly BLAH and in particular a memory leak on the BL Parser.
Concerning the scalability problems observed when several thousands of jobs are active in the system, we found another problem related to how the proxy renewal daemon manages the proxy. We are addressing this issue but the fix is quite big and needs testing
So from the last test on the
CREAM we can say that we are close to be ready. while for ICE we are still experiencing problems to address. We think that CMS can start testing and I have provided Danilo and Daniele with instructions on how to update the WMS at CNAF. We need to use for the scalability tests the CEs at CNAF(lsf) and PADOVA. So CMS should use the ones at PIC and CNAF(pbs) as previously agreed.
Open Issues (by VOs, sites, deployment teams)
Antonio makes an alternative proposal: currently we are suggesting to CMS to use only the queues at PIC and the pbs queues at CNAF. At the same time there are production sites that are installing a version of CREAm which we know not to interact well with ICE. I would propose to try and make better use of this SA1 drive to install
CREAM and make available as soon as possible in production a version of
CREAM that makes the newly installed production resources suitable to be targeted by our ICE WMS .
The certification of these patches should be done quickly and released in production asap.
Next week
PATCH:1841 is going to production with the first version of ICE which we know to be underperforming, so having a version of ICE and
CREAM that makes the ICE-CREAM workflow work better (providing at the same time the functionality that Alice needs) seems to be a good idea and for sure cannot make things worse that they are now. The site as well would have a different perception of the work they are doing as we would be asking them to deploy something that actually makes sense to use.
Ideally the release of ICE (
PATCH:2459) could be delayed There is added value also
So the developers should wrap up
PATCH:2748, specifically with all the fixes created before the new issue with proxy renewal came up and deliver it to certification. they would be deployed in production very quickly.This would make the sites working in the pilot happier, because there will be less bureaucratic difficulties to have the software published in production and Actually would allow production sites that so far didn't want at all to become part of the pilot to be however useful as targets for ICE.
As a first reaction Nick, and Patricia agreed with the idea.
Massimo: we wanted to finalise the test first of
PATCH:2748 and
PATCH:2459 . You are now proposing to release them now.
Antonio: it's my advice and probably also the ITR team's advice that, if we (pilot) can confirm that what we have now is better than what is deployed in production, as
CREAM is a new service for production, and in two weeks there will be a poorly functional version of ICE, there is no risk for the production system to deploy it. I think it would be a good message for the sites now installed that wouldn't have the feeling of their work being badly spent by installing a version which we already know to have flaws.
Massimo: what we see now is that now a patch takes at least two months to go to production. How long would the certification/PPS phase take in this case?
Antonio: the patches would get the stamp of a quick certification, pass through the PPS (which I consider to be done) and be deployed. I talked with the certifiers and generally agree with this plan. they wouldn't go though the stress testing that you have already done.
Massimo: for
CREAM I don't see problem. ICE is better than the one in
PATCH:1841 but people must be aware that there are still scalability problems that we are fixing. If it is fine for everryone.
Antonio: I think that it is however a good idea to release a better version of ICE, because there is at least the SAM user which could profit of it, although it is not relevant for Alice.
Massimo: i need to discuss with the developers but as far as I am concerned we can proceed this way
Nick: Would the scalability issues we are talking about affect the submission to lcg-CEs as well?
Massimo: no, they are only related to ICE-CREAM
Recommendations for release and deployment
Draft timeline:
Release of
PATCH:2748 and
PATCH:2459 to certification : beginning of the week 23-27/2
release from certification to PPS (verified off-line) : order of 2 weeks
release to production : order of 1 week
Decision about termination/extension of the pilot
Within four week we could be in the condition to release some of the sites involved in this pilot
PADOVA and perhaps CNAF will likely continue to be used for pre-certification.
Antonio will talk individually to the sites to set-up a convenient agenda.
A wrap-up meeting is called for the 13th of March
AOB