PPS Pilot Follow-up Meeting Minutes Tue 13 Jan 2009
- Date: Wed 13 Jan 2009
- Agenda: 47118
- Description: pilot of Cream CE: check-point
- Chair: Antonio Retico
Attendance
- PPS: Antonio Retico
- CMS: Andrea Sciaba'
- Alice: Apologise
- PADOVA: Massimo Sgaravatto
- FZK: Angela Poschlad
- JRA1/Cream/WMS: Massimo Sgaravatto
- SA3: Alessio Gianelle
Review of action items (tasks)
Status of the subtasks of TASK:7981(see them in the
PPS tracker ) .
Notes:
Status and results of the pilot service (by VOs and sites)
Antonio: In reply to Massimo's point moved during the last check-point meeting (scarce participation of sites to the pilot) SA1 has moved a request to all ROCs to install at least a cream CE in the region. This will have the combined benefit of a) extending the pilot, b) gaining experience c) respond to Alice's demand for more sites in production. In order to gain on all this fronts the recommendation was done to the ROCs to install using the pilot version of the software . In fact the version currently released to production, as well as the one currently in PPS, would suffer of well known problems with ICE submission, many of which have been addressed in the pilot version.
Antonio stressed the fact that as several production sites are going to use the software from the pilot repository it is really important that the versions delivered there are stable and that the documentation is complete and in order.
PADOVA
Massimo: set-up already upgraded to the latest version
FZK
Angela: one
CREAM in production with the production version. a Node running the "pilot" version has to be upgraded by this week.
Development, Test
Massimo: a new version of
CREAM was released today to the pilot fixing
BUG:5437 and
BUG:45736, which were found to be major problems in the latest tests
Massimo reported about the results of the stress tests done by SA3 personnel in PADOVA. a submission rate of 40-45 Jobs/min was applied to the ICE WMS. The failure rate is still higher than desirable. The detailed info below was sent off-line).
Test starts at Wed Jan 7 16:01:32 CET 2009 (WMS: devel18)
Description:
- 7200 collections each of 40 jobs
- One collection every 60 seconds
- Used the CEs of testbedB (PD+CNAF) plus cream-12.pd.infn.it
- Used automatic-delegation and proxy renewal service
- Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
- The job is a "sleep 313"
Results taken at Mon Jan 12 12:52:56 CET 2009
- Collections correctly submitted: 3733 (149320 jobs)
- DONE OK: 144004 (96.44%)
- ABORTED: 446 (0.3%)
- Not finished: 4870 (3.26%)
The numbers above were obtained with resubmission on (retrycount=2, shallowretrycount=3) They may be slightly polluted by the fact that 3 of the CEs had a configuration problem with
LSF
After this test two issues were found on
CREAM reported with bugs
#45437
https://savannah.cern.ch/bugs/?45437 ("too many open files" exception raised by the job purger) affecting
CREAM
#45437
https://savannah.cern.ch/bugs/?45437 ("problems in case of resubmission to the same CE") affecting the ICE+CREAM chain
CMS
Andrea asked for an update about the status of the pilot activity, specifically about the proxy renewal issue and the submission to
CREAM through
CondorG
Massimo replies: The WMS used for submission in the pilot is still not delivered to certification. It will be released as an add-on to the WMS with
PATCH:2459 . The version of WMS currently in PPS (
PATCH:1841) supports submission to
CREAM but there are known performance issues.
The workaround for proxy renewal issue on WNs was delivered to certification with
PATCH:2669 and
PATCH:2667 . These patches are still in certification (they have been for a month now). The mechanism was tested on the pilot however and hasn't shown any issues (Antonio will contact the certification team and the EMT to understand whether the certification of these two patches can be accelerated)
The submission via condorG was tried about one month ago by CMS users in Wisconsin which were able to submit to CEs in Padova. No further news received.
Antonio asked what are CMS' plans for the activity on the pilot in the next future
Andrea: The person in charge of the activity on the pilot has left CERN. I have to verify whether he will continue working on it or not. We would however preferred to re-start sending production jobs only after whatever is causing the high failure rate is fixed.
Massimo agrees and will let Andrea know as soon as the system looks more stable.
Antonio: it is important to avoid deadlocks . Is the testing currently done on the infrastructure enough to be able to understand and fix the problem
Massimo: We can manage ourselves the test of ICE in this phase. On the other side we would like to see some new testing of the CLI interface be done by Alice. Especially if this version is going to be used by the ROC for production it is important to verify that the direct submission is still fully functional.
Antonio reported this point to Patricia Mendez (Alice) off-line. Alice is available to re-start testing already form next week once the pilot version at FZK is functional. In fact for the time being this is the only site in the infrastructure which supports Alice and offers the correct layout for Alice testing.
Status and results of the development (by developers)
Open Issues (by VOs, sites, deployment teams)
List of Open bugs and relevant decisions
Recommendations for release and deployment
Antonio: Production sites are now being requested to start
CREAM. In the middle term we should make sure that a working and stable version of ICE-CREAM gets to production, because I don't think that all of the sites will accept to run the pilot version
Decision about termination/extension of the pilot
In consideration of the issues currently under analysis the decision is made to extend the pilot in the current configuration until mid-march
AOB