Deployment team

Europe/Zurich
EVO - GridPP Deployment team meeting

EVO - GridPP Deployment team meeting

Jeremy Coles
Description
- This is the weekly DTEAM meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The phone bridge number is +41 22 76 71400. The phone bridge ID is 353618 with code: 4880.
Minutes
    • 11:00 11:20
      Experiment problems/issues 20m
      Review of weekly issues by experiment/VO - LHCb - CMS - ATLAS -- Spacetoken status http://wn3.epcc.ed.ac.uk/srm/xml/srm_token_table - Other -- Any sites or VOs require FTS on SL4?
    • 11:20 11:30
      ROC update 10m
      ROC update *************** From the ops meeting: 1) A problem with Gstat has caused many false alarms related to BDII tests on every ROCs. These failed tests to BDIIs were caused by transitorily ASGC network outage for 20 minutes from 06:05 till to 06:25 on 11-Aug-2008. 2) gLite3.1 Update28 has been released. The release contains: * glite-CONDOR_utils for lcg-CE(PATCH:1856) * New version of gsoap plugin with a vulnerability fix (affecting LB, WMS, UI, WN, VOBOX, CE)(PATCH:1846) * Several bug fixes on WMS and clients (PATCH:1780) * New Short Lived Credential Service (SLCS), allowing to get short-lived personal certificate based on Shibboleth AAI identity (PATCH:1693) * MyProxy? version 1.6.1-7 (fixes build issue related to globus flavour, already deployed in production) (PATCH:1978) * Various improvements on lcg-extra-jobmanagers (CE) (PATCH:1942) * GFAL and lcg_util update with new function gfal_removedir and Several bug fixes * FTS SL4 release (32 and 64 bit) This version has a critical bug and should not be installed. The RPMs have been removed from the repository. This situation arose as the developer spotted the problem at the time of release - so it had passed certification tests. 3) Coming soon is: gLite3.1 Update 29 in preparation. The release contains: o DPM & LFC 1.6.11 : R3.1/SLC4/i386 (PATCH:1988) o DPM & LFC 1.6.11 : R3.1/SLC4/x86_64: DPM & LFC 1.6.11 (PATCH:1987) WLCG update ***************** - Daily ops meeting minutes can be found here: https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings. - The next GDB is on September 10th (http://indico.cern.ch/conferenceDisplay.py?confId=20233). There may be a pre-GDB. Ticket status *************** https://gus.fzk.de/download/escalationreports/roc/html/20080818_EscalationReport_ROCs.html
    • 11:30 11:35
      SL network tests 5m
      - Results are here: http://pprc.qmul.ac.uk/~lloyd/gridpp/nettest.html - Not currently getting good results history due to LFC/Castor problems - Can we conclude anything else!?
    • 11:35 11:45
      Site issues 10m
      - GridMap still not showing a healthy situation: http://gridmap.cern.ch/gm/ - Today: DOWN = UCL-CENTRAL; RHUL and ECDF. - Today: Degraded = Tier-1; QMUL; Brunel; IC-HEP. - What more can be done to help sites?
    • 11:45 11:55
      WN proposal 10m
      - https://twiki.cern.ch/twiki/bin/view/EGEE/ClientDistributionProposal - What is the current state of WN versions across the sites? Summary of comments so far: While it has been shown that sites do not upgrade promptly, if this new method went ahead there should perhaps be an opt-out method (shared clusters do not always have direct WN access for GridPP now). 1) Why are the clients are not just given to the experiments to install for specific updates. [risk as each installation takes about 0.5GB and smaller VOs would struggle anyway]. 2) If clusters are full, how would rollback work? 3) Who will be accountable for the cluster availability? 3rd party installs leads to intermediates having responsibility for the install so the subsequent expectation on site admins would have to be very clear. 4) Some sites already provide different m/w versions via a relocatable distribution - which provides a rollback option. The issue is advertising the versions. 5) In the background is the rpm vs tar installation debate. 6) It is not likely that sites will allow the framework to install crons and perform other priveldeged actions currently required for WN installs. 7) Local installing allows sysadmins to take account of local scaling issues and needs. 8) Central installation will make problems harder to diagnose ("and this is tantamount to admitting that the current m/w stack is a broken mess of dependencies which make it hard to maintain/upgrade). 9) It takes away the decision of if it is wise to currently upgrade and reliability is going to be impacted when the cluster changes "under our feet". - Quite negative! So, do we have any better suggestions? How would we deal with divergent WN environment requirements from VOs?
    • 11:55 12:05
      Actions review 10m
    • 12:05 12:10
      Security update 5m
      - Latest on incident - Areas to follow up (especially communications) -- What is the involvement of central CERTs?
    • 12:10 12:15
      AOB 5m
      - ATLAS Jamboree: http://indico.cern.ch/conferenceDisplay.py?confId=38738 - What are the challenges for the coming 6-12 months? - Next Tuesday's DTEAM => UKI meeting
    • 12:15 12:16
      Topics to revisit 1m
      - gstat publishing. Small group being formed. - Wiki/web page updates (see for example http://www.gridpp.ac.uk/deployment/contact.html). Admin task! - Completion of the GridPP-NGS site status information in http://www.gridpp.ac.uk/wiki/Working_with_NGS - Regional Nagios monitoring (ScotGrid have progressed - who else is moving forward with it?). At DTEAM on 1st July agreed on deployment on September timescale - YAIM component may be available then. - COD training in August - Collecting site queue/fairshare information - Reminder for sites to add comments to http://www.gridpp.ac.uk/wiki/SAM_availability:_October_2007_-_May_2008. - Look at the Site Readiness Review reports - "We need to audit T2 sites to understand how many concurrent transfers each can cope. This requires details of how many servers are available and how the pools are allocated between the VOs." - 080630: The first public version of the Operations Automation Strategy (MSA1.1) is now in EDMS at https://edms.cern.ch/document/927171/1