Deployment team

Europe/Zurich
EVO - GridPP Deployment team meeting

EVO - GridPP Deployment team meeting

Jeremy Coles
Description
- This is the weekly DTEAM meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The phone bridge number is +41 22 76 71400. The phone bridge ID is 353584 with code: 4880.
Attendees: Jeremy (chair+minutes) Duncan Stephen Derek Mingchao Brian Jens Raja Greig 11:00 Experiment problems/issues: LHCb: Raja reported that LHCb are starting the move away from DIRAC2 for production. It is hoped that Ganaga developers will be available next week. DIRAC2 uses the SRMv1 endpoints and is still used for low level free construction. It is hoped to have a DIRAC3 envirnoment ready by the end of August with the end of September being the absolute deadline - SRmv1 instances are to be switched off on this timescale. The second item for LHCb to mention was the UK CA changes. There was one issue with a user not being able to move smoothly - user was only one in LHCb VOMS not to be registered with both old and new CA DNs. Another user did not receive the notice - the CERN spam filter killed the notification. The notification issue prompted a short discussion on communicating with users. There will always be some issues like this which mean straghtforward emails to individuals or lists is not enough. JC noted that Steve L's test page now incorporated the LHCb SAM test results but that the tests themselves had recently stopped running. RN explained that they were being moved to the DIRAC3 framework. JC asked about the software are corruption that had been seen at several UK sites. RN said this had built up over time - sites were not used for a long period - but the problems were being systematically discovered and resolved. CMS: No report ATLAS: No report Other: There were no other VO issues arising. 11:20 ROC update: EGEE SA1 meeting today: http://indico.cern.ch/conferenceDisplay.py?confId=38432 Topics: Admin matters; Update on EGI blueprint (being re-written as a 30 page document); SLA roadmap. JC explained that the EGI proposal was going to be rewritten as the current presentation was not deemed strong enough. Areas would also be broken out in the new document (i.e. SA1 and its role would become clearer). WLCG update ***************** MB was last week - nothing new to report here: http://indico.cern.ch/conferenceDisplay.py?confId=33704. Most of the MB material was covered under the GDB discussion last week. Benchmarking is probably the most relevant and urgent for T2s. EGEE-WLCG-OSG ops meeting ****************************** Agenda: http://indico.cern.ch/conferenceDisplay.py?confId=38629 - Releases:- For the PPS: 2008-07-28: glexec tests in PPS: Service available on several CEs in PPS. (list available at: https://pps-private-wiki.egee.cesga.es/gocdb/user1.cgi?inputVal=40 selecting nodes at version >= gLite 3.1 PPS-update31 Still no feedback received from users. 2008-07-28: release of gLite3.1 PPS Update34 to PPS in preparation This update will contain * DPM and LFC 1.6.11 (see details in PATCH:1987) * dCache 1.8.0-15p5 with new YAIM nodule for configuration For production: 2008-07-23: release of gLite3.1 Update28 in preparation This update, to be release the 29th of July will contain * glite-CONDOR_utils for lcg-CE - Discussion (our request) on WMS performance. -- 3.1 SL4 found to be more stable than 3.0 -- several countries use a round robin setup for the host machines --> What do we need to do next? - Current recommended storage versions https://twiki.cern.ch/twiki/bin/view/LCG/GSSDCCRCBaseVersions GC noted that most sites were now pretty well up-to-date, but there were old "additional" DPMs around that should be removed. Manchester now have a working DPM, but nobody has tried a WN distribution for DPM. Nothing intrinsic to DPM which will prevent it but lack of tools for managing such a configuration may become an issue. BD suggested that setting up two SEs at the site would allow some form of replication. The move will leave just a few sites RALPP and IC using dCache. They are managing following the latest updates. GC's suggestion that the sites were "content" with dCache was slightly disputed by DR. DR also commented that although UCL is not functioning smoothly at the moment it is a small site and should not consume too much attention. JC mostly agreed but added that if a user placed data on a specific site, even a small site, then the expectation would be for good SE service levels. GC noted a few sites were recently impacted by the CA changes: RHUL; UCL; Durham and Cambridge. Ticket status *************** https://gus.fzk.de/download/escalationreports/roc/html/20080728_EscalationReport_ROCs.html Two tickets. On 35089 DR mentioned that this was awaiting a change by the fabric team at RAL. He was going to remind them after the meeting. 37185 concerned a CDF request for re-enablement at sites. The original ROC ticket had been broken into child tickets for each affected site, unfortunately it was not clear which of those was now closed. JC was to follow up. 11:35 Hardware purchase advice & sharing (05') JC talked through the structure of the page setup to share hardware procurement information: . He asked everyone to take a look and provide feedback/comments today as this needed to be shared ASAP since most sites were about to procure new equipment. No feedback was given during the meeting and no new areas were suggested. 11:40 Follow up on quarterly reports & readiness reviews (10') - Any further feedback on the QRs? There was no additional feedback. DR mentioned that the LondonGrid report was still not final. JC understood but said he would start compiling the overall view from the reports at the end of the week. - The readiness review documents are here: https://www.gridpp.ac.uk/tier2/Readiness_Reviews/index.html The plan was to take a quick look at some but various acces problems became apparent. DR received an access denied error. JC got an error message when trying to open the London overview report. JC to investigate. 11:50 Web-page review (05') starting with the high-level deployment pages http://www.gridpp.ac.uk/deployment/, JC looked at each main area. Most of the high level pages are JCs responsibility (Overview, Status, Meetings .. etc. ) so he will check last updates and modify pages which are out of date. All team members need to check the Contacts page. It was noted that Mingchao did not yet appear. Looking at the wiki areas starting from http://www.gridpp.ac.uk/wiki/Main_Page * The T2 coordinators need to check the T2 and site entries - noted that London sites are not directly linked * Grid services are nearly all T1 pages - DR to review * Tier-2 support -> Experiment. T2 coordinators and experiment reps to review * Tier-2 support -> Middleware. Andrew to check data management; Grieg Storage (quite up-to-date in most areas); Batch systems - DR since many T1 entries; Update tools - is this used? AF instigated?; Virtualisation - ok; Workarounds - static now since moved on. * PPS. static * VO Support. Presumably this is Sergey's area though some falls back to T2Cs. * Security - Mingchao to review and update. MM asked about how to update the left margin links. JC suggested contacting AM but SB said he would be able to help -> MM to liase with SB. * Monitoring - AF for the links and AE for the Nagios part * Availability - Static views but JC looking at feeding daily graphs onto a comment page (exisiting action) * Hardware - new section * Service challenge - prompted question about whether this old information should be archived somehow? Decided to leave it where it is - users of the pages can check the date for relevance. * Deployment team area - some sections like issues log not used. Others ok. 11:55 Actions review (10') Updates recorded in wiki. 12:05 AOB (05') - UKI meeting on Thursday -- CA move is already on the agenda -- Any urgent items for discussion given LHC/experiment status? None suggested! Meeting closed at 12:00. Chat window content: 10:54:58] Mingchao Ma joined EVO [10:59:13] Derek Ross joined EVO [11:04:59] Jens Jensen joined EVO [11:24:41] Stephen Burke joined EVO [11:28:46] Jens Jensen new status: Away [11:35:01] Jens Jensen new status: Available [12:01:58] Brian Davies left EVO [12:02:02] Stephen Burke left EVO
There are minutes attached to this event. Show them.
    • 11:00 11:20
      Experiment problems/issues 20m
      Review of weekly issues by experiment/VO - LHCb - CMS - ATLAS - Other
    • 11:20 11:35
      ROC update 15m
      ROC update *************** EGEE SA1 meeting today: http://indico.cern.ch/conferenceDisplay.py?confId=38432 Topics: Admin matters; Update on EGI blueprint (being re-written as a 30 page document); SLA roadmap WLCG update ***************** MB was last week - nothing new to report here: http://indico.cern.ch/conferenceDisplay.py?confId=33704. EGEE-WLCG-OSG ops meeting ****************************** Agenda: http://indico.cern.ch/conferenceDisplay.py?confId=38629 - Releases:- For the PPS: 2008-07-28: glexec tests in PPS: Service available on several CEs in PPS. (list available at: https://pps-private-wiki.egee.cesga.es/gocdb/user1.cgi?inputVal=40 selecting nodes at version >= gLite 3.1 PPS-update31 Still no feedback received from users. 2008-07-28: release of gLite3.1 PPS Update34 to PPS in preparation This update will contain * DPM and LFC 1.6.11 (see details in PATCH:1987) * dCache 1.8.0-15p5 with new YAIM nodule for configuration For production: 2008-07-23: release of gLite3.1 Update28 in preparation This update, to be release the 29th of July will contain * glite-CONDOR_utils for lcg-CE - Discussion (our request) on WMS performance. -- 3.1 SL4 found to be more stable than 3.0 -- several countries use a round robin setup for the host machines --> What do we need to do next? - Current recommended storage versions https://twiki.cern.ch/twiki/bin/view/LCG/GSSDCCRCBaseVersions Ticket status *************** https://gus.fzk.de/download/escalationreports/roc/html/20080728_EscalationReport_ROCs.html
    • 11:35 11:40
      Hardware purchase advice & sharing 5m
      - The wiki page is available but needs reviewing - http://www.gridpp.ac.uk/wiki/Guidance_and_recent_purchases - What other areas need to be covered?
    • 11:40 11:50
      Follow up on quarterly reports & readiness reviews 10m
      - Any further feedback on the QRs? - The readiness review documents are here: ???
    • 11:50 11:55
      Web-page review 5m
      - Several deployment areas in need of updating - Which areas need focussed effort now? - Who will check what?
    • 11:55 12:05
      Actions review 10m
    • 12:05 12:10
      AOB 5m
      - UKI meeting on Thursday -- CA move is already on the agenda -- Any urgent items for discussion given LHC/experiment status?
    • 12:10 12:30
      Topics to revisit 20m
      - gstat publishing. Small group being formed. - Wiki/web page updates (see for example http://www.gridpp.ac.uk/deployment/contact.html). Admin task! - Completion of the GridPP-NGS site status information in http://www.gridpp.ac.uk/wiki/Working_with_NGS - Regional Nagios monitoring (ScotGrid have progressed - who else is moving forward with it?). At DTEAM on 1st July agreed on deployment on September timescale - YAIM component may be available then. - COD training in August - Collecting site queue/fairshare information - Reminder for sites to add comments to http://www.gridpp.ac.uk/wiki/SAM_availability:_October_2007_-_May_2008. - Look at the Site Readiness Review reports - "We need to audit T2 sites to understand how many concurrent transfers each can cope. This requires details of how many servers are available and how the pools are allocated between the VOs." - 080630: The first public version of the Operations Automation Strategy (MSA1.1) is now in EDMS at https://edms.cern.ch/document/927171/1