PPS Pilot Follow-up Meeting Minutes Tue 13 Jan 2009

  • Date: Wed 13 Jan 2009
  • Agenda: 47118
  • Description: pilot of Cream CE: check-point
  • Chair: Antonio Retico

Attendance

  • PPS: Antonio Retico
  • CMS: Andrea Sciaba'
  • Alice: Apologise
  • PADOVA: Massimo Sgaravatto
  • FZK: Angela Poschlad
  • JRA1/Cream/WMS: Massimo Sgaravatto
  • SA3: Alessio Gianelle

Review of action items (tasks)

Status of the subtasks of TASK:7981(see them in the PPS tracker ) .

Notes:

Status and results of the pilot service (by VOs and sites)

Antonio: In reply to Massimo's point moved during the last check-point meeting (scarce participation of sites to the pilot) SA1 has moved a request to all ROCs to install at least a cream CE in the region. This will have the combined benefit of a) extending the pilot, b) gaining experience c) respond to Alice's demand for more sites in production. In order to gain on all this fronts the recommendation was done to the ROCs to install using the pilot version of the software . In fact the version currently released to production, as well as the one currently in PPS, would suffer of well known problems with ICE submission, many of which have been addressed in the pilot version.

Antonio stressed the fact that as several production sites are going to use the software from the pilot repository it is really important that the versions delivered there are stable and that the documentation is complete and in order.

PADOVA

Massimo: set-up already upgraded to the latest version

FZK

Angela: one CREAM in production with the production version. a Node running the "pilot" version has to be upgraded by this week.

Development, Test

Massimo: a new version of CREAM was released today to the pilot fixing BUG:5437 and BUG:45736, which were found to be major problems in the latest tests

Massimo reported about the results of the stress tests done by SA3 personnel in PADOVA. a submission rate of 40-45 Jobs/min was applied to the ICE WMS. The failure rate is still higher than desirable. The detailed info below was sent off-line).


Test starts at Wed Jan 7 16:01:32 CET 2009 (WMS: devel18) Description:
  • 7200 collections each of 40 jobs
  • One collection every 60 seconds
  • Used the CEs of testbedB (PD+CNAF) plus cream-12.pd.infn.it
  • Used automatic-delegation and proxy renewal service
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 313"

Results taken at Mon Jan 12 12:52:56 CET 2009

  • Collections correctly submitted: 3733 (149320 jobs)
    • DONE OK: 144004 (96.44%)
    • ABORTED: 446 (0.3%)
    • Not finished: 4870 (3.26%)

The numbers above were obtained with resubmission on (retrycount=2, shallowretrycount=3) They may be slightly polluted by the fact that 3 of the CEs had a configuration problem with LSF

After this test two issues were found on CREAM reported with bugs

#45437 https://savannah.cern.ch/bugs/?45437 ("too many open files" exception raised by the job purger) affecting CREAM #45437 https://savannah.cern.ch/bugs/?45437 ("problems in case of resubmission to the same CE") affecting the ICE+CREAM chain


CMS

Andrea asked for an update about the status of the pilot activity, specifically about the proxy renewal issue and the submission to CREAM through CondorG

Massimo replies: The WMS used for submission in the pilot is still not delivered to certification. It will be released as an add-on to the WMS with PATCH:2459 . The version of WMS currently in PPS (PATCH:1841) supports submission to CREAM but there are known performance issues.

The workaround for proxy renewal issue on WNs was delivered to certification with PATCH:2669 and PATCH:2667 . These patches are still in certification (they have been for a month now). The mechanism was tested on the pilot however and hasn't shown any issues (Antonio will contact the certification team and the EMT to understand whether the certification of these two patches can be accelerated)

The submission via condorG was tried about one month ago by CMS users in Wisconsin which were able to submit to CEs in Padova. No further news received.

Antonio asked what are CMS' plans for the activity on the pilot in the next future

Andrea: The person in charge of the activity on the pilot has left CERN. I have to verify whether he will continue working on it or not. We would however preferred to re-start sending production jobs only after whatever is causing the high failure rate is fixed.

Massimo agrees and will let Andrea know as soon as the system looks more stable.

Antonio: it is important to avoid deadlocks . Is the testing currently done on the infrastructure enough to be able to understand and fix the problem

Massimo: We can manage ourselves the test of ICE in this phase. On the other side we would like to see some new testing of the CLI interface be done by Alice. Especially if this version is going to be used by the ROC for production it is important to verify that the direct submission is still fully functional.

Antonio reported this point to Patricia Mendez (Alice) off-line. Alice is available to re-start testing already form next week once the pilot version at FZK is functional. In fact for the time being this is the only site in the infrastructure which supports Alice and offers the correct layout for Alice testing.

Status and results of the development (by developers)

Open Issues (by VOs, sites, deployment teams)

List of Open bugs and relevant decisions

Recommendations for release and deployment

Antonio: Production sites are now being requested to start CREAM. In the middle term we should make sure that a working and stable version of ICE-CREAM gets to production, because I don't think that all of the sites will accept to run the pilot version

Decision about termination/extension of the pilot

In consideration of the issues currently under analysis the decision is made to extend the pilot in the current configuration until mid-march

AOB


Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2009-01-14 - AntonioRetico
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback