PPS Pilot Follow-up Meeting Minutes Wed 18 Feb 2009

  • Date: Wed 18 Feb 2009
  • Agenda: 52236
  • Description: Pilot of Cream CE: check-point
  • Chair: Antonio Retico

Attendance

  • PPS: Antonio Retico

  • PIC: Raquel Munoz, Christian Neissner
  • FZK: Angela Poschlad
  • CNAF: Daniele Cesini
  • PADOVA: Massimo Sgaravatto

  • CMS: Absent
  • Alice: Patricia Mendez

  • JRA1/Cream/WMS: Massimo Sgaravatto
  • SA1: Nick Thackray

Notes: Christian and Massimo complained for the bad audio quality. The Lucent conference system will be used next time

Review of action items (tasks)

Status of the subtasks of TASK:7981(see them in the PPS tracker ) .

Not covered

Status and results of the pilot service (by VOs and sites)

Antonio gave a quick update in the deployment of CREAM CEs on the production grid. There are now 13 CEs available in production. RAL publishes 4 different CEs. They are mostly running the production version of CREAM

The first results of the SAM tests over the PPS are now available on a testing system at
http://tinyurl.com/ctwfaz

PIC (Christian) The two CREAM CEs we set-up according to the requirements of the developers (2 Gbyte of RAM) are published in PPS BDII.

It was clarified that the desired configuration is a special queue published in production with the GlueCe StateStatus="TestbedB"

FZK (Angela) Nothing to report. In particular the only users accessing CREAM were ops and Patricia for Alice

Antonio observed that there may be a misunderstanding with CMS. They were supposed to wait for a green light from the pilot (Specifically PIC and CNAF). As CNAF confirmed that their queue was ready, he supposed that CMS would have started. As that's not the case and, in consideration of what's happening in production now (installation of several CREAM CEs) it is probably worth to re-think the layout of the pilot in a way to make the two deployment activity converge on a common goal. That is discussed after the reports form the developers.

Alice (Patricia) Currently Alice has got 8 supporting sites with CREAM installed (between production and PPS) and only 5 of them work correctly (they are in contact with RAL and CNAF to fix the problems). They would like more T1s with CREAM and having it at CERN is urgent.

Antonio confirmed that the installation of CREAM at CERN-PROD is in progress although a timeline is not available yet.

Nick pointed out that the thing could be slowed down a bit by the fact that the previous administrator (Ulrich) is phasing out and the replacement may need tiem to get up to speed

Antonio mentioned the concurrent activity of installation of a special WMS for Alice and asked whether it would be possible eventually to identify a priority between the two set-ups

Patricia relied that both activities are important for Alice and they should be done n parallel (50-50). Alice is ready to use whatever solution comes first

Status and results of the development (by developers)

Massimo reported about an issue with an external plugin that was changed in a non backward compatible way. That brought to a lot of problems which were solved by the new tag released last week. The new TAG released to the pilot concerns mostly BLAH and in particular a memory leak on the BL Parser.

Concerning the scalability problems observed when several thousands of jobs are active in the system, we found another problem related to how the proxy renewal daemon manages the proxy. We are addressing this issue but the fix is quite big and needs testing

So from the last test on the CREAM we can say that we are close to be ready. while for ICE we are still experiencing problems to address. We think that CMS can start testing and I have provided Danilo and Daniele with instructions on how to update the WMS at CNAF. We need to use for the scalability tests the CEs at CNAF(lsf) and PADOVA. So CMS should use the ones at PIC and CNAF(pbs) as previously agreed.

Open Issues (by VOs, sites, deployment teams)

Antonio makes an alternative proposal: currently we are suggesting to CMS to use only the queues at PIC and the pbs queues at CNAF. At the same time there are production sites that are installing a version of CREAm which we know not to interact well with ICE. I would propose to try and make better use of this SA1 drive to install CREAM and make available as soon as possible in production a version of CREAM that makes the newly installed production resources suitable to be targeted by our ICE WMS .

The certification of these patches should be done quickly and released in production asap.

Next week PATCH:1841 is going to production with the first version of ICE which we know to be underperforming, so having a version of ICE and CREAM that makes the ICE-CREAM workflow work better (providing at the same time the functionality that Alice needs) seems to be a good idea and for sure cannot make things worse that they are now. The site as well would have a different perception of the work they are doing as we would be asking them to deploy something that actually makes sense to use.

Ideally the release of ICE (PATCH:2459) could be delayed There is added value also

So the developers should wrap up PATCH:2748, specifically with all the fixes created before the new issue with proxy renewal came up and deliver it to certification. they would be deployed in production very quickly.This would make the sites working in the pilot happier, because there will be less bureaucratic difficulties to have the software published in production and Actually would allow production sites that so far didn't want at all to become part of the pilot to be however useful as targets for ICE.

As a first reaction Nick, and Patricia agreed with the idea.

Massimo: we wanted to finalise the test first of PATCH:2748 and PATCH:2459 . You are now proposing to release them now.

Antonio: it's my advice and probably also the ITR team's advice that, if we (pilot) can confirm that what we have now is better than what is deployed in production, as CREAM is a new service for production, and in two weeks there will be a poorly functional version of ICE, there is no risk for the production system to deploy it. I think it would be a good message for the sites now installed that wouldn't have the feeling of their work being badly spent by installing a version which we already know to have flaws.

Massimo: what we see now is that now a patch takes at least two months to go to production. How long would the certification/PPS phase take in this case?

Antonio: the patches would get the stamp of a quick certification, pass through the PPS (which I consider to be done) and be deployed. I talked with the certifiers and generally agree with this plan. they wouldn't go though the stress testing that you have already done.

Massimo: for CREAM I don't see problem. ICE is better than the one in PATCH:1841 but people must be aware that there are still scalability problems that we are fixing. If it is fine for everryone.

Antonio: I think that it is however a good idea to release a better version of ICE, because there is at least the SAM user which could profit of it, although it is not relevant for Alice.

Massimo: i need to discuss with the developers but as far as I am concerned we can proceed this way

Nick: Would the scalability issues we are talking about affect the submission to lcg-CEs as well?

Massimo: no, they are only related to ICE-CREAM

Recommendations for release and deployment

Draft timeline:

Release of PATCH:2748 and PATCH:2459 to certification : beginning of the week 23-27/2 release from certification to PPS (verified off-line) : order of 2 weeks release to production : order of 1 week

Decision about termination/extension of the pilot

Within four week we could be in the condition to release some of the sites involved in this pilot PADOVA and perhaps CNAF will likely continue to be used for pre-certification.

Antonio will talk individually to the sites to set-up a convenient agenda.

A wrap-up meeting is called for the 13th of March

AOB


Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2009-02-18 - AntonioRetico
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback