PPS all-sites meeting: Place: Europe Congress Center Budapest Hungary Room: Strasbourg Dates: Wednesday 03 October 2007 11:00 General goals of the meeting: -Highlight PPS "hot" issues and discuss them -Work on solution of "live" issues -Build consensus on PPS scope and goals -Collect ideas for the future -Give input to LCG management Participants: Chairmen: Nicholas Thackray, Antonio Retico -- PPS sites CERN_PPS: Antonio Retico, Steve Traylen PPS-LIP: Mario David INFN-CNAF: Paolo Veronesi PreGR-02-UPATRAS: George Goulas Taiwan-PPS: ShuTing Liao PPS-DESY: Christoph Wissing CESGA-PPS: Javier Lopez SCAI-PPS: Daniel Rubin -- PPS users Pedro Andrade (CERN-DILIGENT) (1st session only) Patricia Mendez Lorenzo (Alice) (2nd session only) -- External Markus Schulz (SA3) (2nd session only) Owen Synge (dcache)(2nd session only) Stephen Burke (RAL) Jeff Templon (NIKHEF)(2nd session only) Kostas Koumantaros (GRNET) Cyril L'Orphelin (COD (1st session only)) Agenda Session 1: Operations (11.15-12.30) - CODs and PPS monitoring . Short history and issues of COD monitoring in PPS (Antonio) . New monitoring tools: Nagios (Ian), GridMap (Max) . Operation alternatives: the SWE case (Mario) - Brainstorming: PPS without the CODs - Sum-up (decisions + input for SA1) Session 2 - Interoperations/Interoperability (14.00-14.20) . Problem description (Nick) . Assessment: CERN_PPS <-> OSG_ITB (Andreas) - Users: (14.20-15.20) . Why it's so difficult to be a user of the PPS grid?(Nick) . Alice's use case (Patricia) - Open discussion (input for SA1) - Sum-up input for SA1 session (15.20-15.30) Minutes: The two sessions were mostly interactive with frequent and welcome interruptions during the presentations. Session 1) Introduction and agenda Antonio gave a quick overview on the current status of the service, introduced the general goals of the meeting and presented the agenda http://indico.cern.ch/contributionDisplay.py?contribId=260&sessionId=33&confId=18714 CODs and PPS monitoring Antonio gave a summary of the recent history of COD monitoring of PPS highlighting the question of the follow-up of PPS tickets by the CODs http://indico.cern.ch/materialDisplay.py?subContId=1&contribId=261&sessionId=33&materialId=slides&confId=18714 Max Bohem (EDS) gave a demonstration of GridMap (http://lxb2003.cern.ch/gm/gridmap.html), a visualization tool useful to get an high-level view of the status of the services in PPS Ian Neilson (CERN, Head of the Monitoring Working Group) gave a preview of a particular configuration of Nagios developed at CERN useful to monitor site services locally. The set of configuration file is currently being packaged at CERN and a distribution with YAIM will be proposed Mario David (LIP) presented the model of proactive operations used in production within the SWE ROC http://indico.cern.ch/subContributionDisplay.py?subContId=2&contribId=261&sessionId=33&confId=18714 o At the regional level, CESGA hosts an instance of SAM and Gstat for the SWE. o They have setup a mail notification system based on the SAM and Gstat monitoring. o In this way when a site in the region fails the site admin receives a mail saying what is the failure. o This permits the site admin to be alerted right way and to act. Since the COD will open a ticket after 3 or 5 continuous failures. - This also gives time for the site admin to put a downtime if it's needed. o If site admins acts promptly to this alarms, the number of COD tickets is rather small. A survey was started to collect input by the PPS site administrators on the question: "Do we prefer the COD to monitor the PPS and open GGUS tickets or not?" 10 siteadmins participated to the survey with the following statistics 1. I prefer CODs to keep monitoring my site and eventually sending GGUS ticket 1 Vote 2. I think that PPS can build its own central monitoring 4 votes 3. I think that central monitoring is not needed at all for PPS 1 vote 4. I would like to be notified of problems by CODs but not to have to deal with GGUS tickets 9 votes 5. I have an idea of how the internal support workflow currently in place (COD, TPM, ROC etc.) works 6 votes 6. I don't have a clear view on how the internal support workflow currently in place works 3 votes 7. I work(ed) as COD 3 votes 8. I work(ed) as ROC manager/deputy 2 votes At the end of the session, Cyril L'Orphelin (IN2P3, COD), leading the "Best Practices" working group within the COD-14 meeting (run in parallel), was invited to join the session. It was agreed that the COD will register the alarms raised for PPS sites and send them on weekly basis to the administrators, the ROCs and the PPS Coordination, which then will follow them up independently Session 2 Interoperations/Interoperability (14.00-14.20) Nick gave a quick introduction to the future activity in which PPS will be involved within a SA1 activity aimed to make sure that the current interoperability between EGEE and OSG grids keeps working throughout releases. http://indico.cern.ch/materialDisplay.py?contribId=330&sessionId=33&materialId=slides&confId=18714 A basic testbed should be put in place by the 31st of October Basic features are already there - Dedicated WMS in place - Set up same mechanism as in production for publishing OSG ITB sites in PPS top-level BDII the interoperability testbed should be extended from the current one PPS <-> ITB to PPS/EGEE Prod <-> ITB/OSG Prod There is the need to get basic set of tests running Volunteers are welcome. Andreas Unterkircher described a set of activities done within the interoperability tests between CERN_PPS and sites in the OSG ITB http://indico.cern.ch/subContributionDisplay.py?subContId=1&contribId=330&sessionId=33&confId=18714 Users: (14.20-15.20) Antonio introduced the discussion about the poor usage of preproduction. http://indico.cern.ch/materialDisplay.py?contribId=271&sessionId=33&materialId=slides&confId=18714 He first showed how the poor usage, apart from the remarkable waste of effort, and independently on the recognised ability of PPS sysadmins to "see" problems, affects indirectly also the other key "feature" of the PPS, to improve the quality of the gLite middleware. A question was previously made to the four LHC VOs to ask for feedback Both LHCb and CMS agree on manpower as the main issue: A lot of effort needed by the VO to maintain and operate two parallel submission infrastructures in two "universes" LHCb: Size of PPS "by definition" does not allow to spot problems Suggestions were presented by LHCb and CMS. LHCb put the emphasis on the distribution of clients: - Early distribution: as soon as built and module-tested by developers - Always backward-compatible to be tested by the VO against production services About the services, LHCb would ask for - Service svailable in production BDII but "flagged" as PPS - By default not used by other production services - CEs and SEs to see the same Back-end resources as in production CMS - Share with LHCb the idea of deployment in production of "flagged" PPS services - Propose a staging of deployment to production - Sustain "Task-force" usage model - very focused and on-demand bursts of activity involving a limited number of PPS service instances - no strict need for service continuity out of these "peaks" ****Comments**** LHCb and CMS About early distribution of the middleware: Markus: it is already done, maybe VOs do not know Backward compatibility: production LHC, pre-production LHC The catalogs are different. LHC VOs want common catalogs, BDII so it is easier to test. Mario: PPS has only 10% of the total resources so LHC VOs should put only 10% workload in PPS to not overflow PPS. This way they will catch much of the bugs and they will not get escalation problems. Instead of sending 100000 jobs send 1000. Test FTS in PPS with smaller files. Markus: Many problems seen later by LHCb would be seen just sending one job. We have identified a lot of problems. PPS by default keeps as backend the production services. Then we set in PPS a randomly populated service of that kind. We will never be able to make PPS look like production. Only set up specific endpoints. At least we will have a full client list of tests. We need a process to force them to test the services, e.g. each experiment would have to sign on that the WN were what they want. ***************** Patricia Mendez Lorenzo (Alice) presented a statement of Alice's about the usage of PPS http://indico.cern.ch/materialDisplay.py?subContId=2&contribId=271&sessionId=33&materialId=slides&confId=18714 ALICE has a neutral attitude toward the PPS As the other VOs Alice claim not to have the human resources to sustain a permanent duplicate infrastructure based on production adn pre-production grids Alice is prepared to use either production or pre-production for their tests in an opportunistic way **** Discussion **** Some highlights in sparse order Patricia: Separated BDII subject: It is easier for Alice if PPS services are available from the PROD BDII but they have no requirement on that. Stephen Burke: "flagging" service as pre-production in the production BDII is not so straightforward as it seems. It is not just a question of deploying a few services in production to test. Presumably a lot of changes would be needed not only in the clients but also in middleware components to comply with that Owen Synge (dcache): we don't get proper load testing in PPS Nick: probably we need to decide case-by-case the path to follow for testing each service Owen: PPS is being useful in a different way: trying to fix things in the PPS helps the sites that are also in PROD to solve later the same problems in PROD. Antonio: why not use PPS also for validating operational tools? (e.g. SAM, CIC Portal) Jeff Templon (Nikhef) asks for useful things like deployment testing and checking the documentation. Mario: things actually improved a lot with pre-deployment. Stephen: job priorities went wrong twice but they would be difficult to test in PPS. Antonio: we can not afford any more idle PPS resources. Having 20 CEs running all the time is not useful if they are not used. Constraints proposed: - Big redundancy not necessary in PPS, we do (strictly) only what it is necessary. - Good procedure and good documentation of the release procedure is something which we need to maintain also for the future Stephen: Things that are not yet in PPS should not be put in PROD. Markus: Just some premium users are affected by these bugs that pass PPS undetected because they only appear at a larger scale (PROD). Proposal: process to sign-off, this patch is fine with them so they can not complain later. Markus: I don't want to see the dashboard showing together jobs run in PPS and PROD. ===================== APPENDIX Highlights from discussion during SA1 Session "Evolution of the role of the PPS" Thursday 04 October 2007 11.30-12.00 Nick summarised the discussion which took place during the meeting and highlighted key points to be worked out in order ot re-organise the preproduction (see slides): http://indico.cern.ch/materialDisplay.py?contribId=22&sessionId=42&materialId=slides&confId=18714 Work needs to be done to re-organise the PPS - shift the emphasis of the PPS as is implemented today towards testing the release - usage of automated test suites, - make easier for the experiments to test, - re-size down to avoid wasting resources. - allow VOs to test new services in production (in a formalised way) **** Comments **** Mario David: - bugs are still found after certification - he agrees with the resize of the PPS - while testing we don't care about SAm results - the VOs give their applications as a SAM tests - the purpose of pps is not to test the fts or wms in a large environment but to find bugs - I will be very afraid if testing would not be done in the pps Owen: - the multiplicity of this grid (different version of the clients, etc) makes difficult to check things (you could be testing different interactions); test suites cannot cover every possible combinations - there is no alternative to getting the vos involved in pps and using it John Gordon: - the purpose of pps is that vos check that everything is right and spot problems Markus: - we have to work - instead of reducing the scope we have to shift it to a different one - the only thing that you can catch with the sam test are things that occur with a different deployment scenario than in production - for us the evolution proposed looks more like extintion Stephen: - we should not give up in the idea of running tests not in the scale of prod