Deployment team

Name: Deployment team
Start: 2007-08-21T11:00:00+02:00
End: 2007-08-21T12:00:00+02:00
Location: EVO - GridPP Deployment team meeting

Tuesday 21 Aug 2007, 11:00 → 12:00 Europe/Zurich

EVO - GridPP Deployment team meeting

Jeremy Coles

Description

- This is the weekly DTEAM meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area

Hide

minutes - lhcb (Raja) ---- - no sim, doing reconstruction (7M events) - ral issues at weekend - 17/18th (accessing storag - dcache) - not running production (generate) cms --- - Dave Newbold email - -- Lack of unified RFIO library for CASTOR/DPM; now we are using SL4, previous hacks don't work. What are the plans for CASTOR migration to secure RFIO and/or a unified library? How hard can this be - presumably the library just needs some environment setting to tell it to use CASTOR- or DPM-style RFIO? Right now, all DPM SL4 sites (worldwide) are out of action for CMS. -- We still have various services (e.g. Tier-1 VObox) that need to go to SL4 ASAP. -- We await FTSv2 at RAL with bated breath. Next week, we are told? -- Testing of network throughput, etc, has been put on a new basis in the CMS 'DDT' project. We will be calling upon UK experts to help us get our T1->T2 and T0->T1 links commissioned. -- https://twiki.cern.ch/twiki/bin/view/CMS/DebuggingDataTransfers - RAL moving to FTS2 next week - Derek Ross / Matt will confirm with voboxes too - Network testing - Data coming from overseas slowly (few Mb/s) - need more testing. AE - Will contact DN directly (Barney too) May do F2F Next week at GridPP mtg. - GC: (rfio) - sent email to developers last week, awaiting response. Who would know? Not Yens - GC will contact Steve Traylen. ATLAS - lack of space in dcache at RAL - castor seems OK - T2s unknown, but should be poss to restart production - M4 cosmic runs next week (don't know which UK sites will participate) - GS would like Glasgow to participate. May be behind tho. Awaiting update from Fredric. - JC reported from PMB meeting - Dave Newbold said he'd seen new problems with CASTOR - Also affecting ATLAS? Not sure - local mtg planned for Wed pheno - glasgow banned user who flooded Q. - site stayed up gridftp / fts - CLOSE_WAITs on dpm disks (gridftp) - exhaused resources caused disks to crash - durham has a script to kill off server at intervals - cambs has same issue, brunel, RHUL too (GC Just back from hols - not checked up fully for latest) LHCb - VO Box down for HW failure - Timescale? Should be back up (disk swapped out - Derek will check with Katlin) - dcache - zero diskspcace on LDAP server? - Derek will check provider started up. Takeup of resources at larger sites -- UK delivered 43% of pledge, but has *deployed* more than they pledged. How do we get utilisation up? Last year LT2 did a push. Glasgow recently too. Manchester has 1400 / 1600 slots empty yesteray - how can we fill the larger sites? -- Issue at QMU just now - unsure of cause. Firewall? doesn't seem to have changed since last sept... --- GS: - do jobs get as far as the CE? We saw issues where steve lloyd jobs died at the gatekeeper --- Seem to run on the CE but fail to pull main app - gridftp issue? S/W install jobs fail. (use CERN RB) - can't run CMS MC codes means cluster only running at 10% JC: Quick calculation - QM ay 13%, Manchester at 18%, GLA at 85% - Why such big differences GS: Glasgow has a local bio user who threw in 400 jobs to fill cluster Alessandara: Same comment as GS ALICE - filling up other countries pledges We need to somehow show that the resource is AVAILABLE even if not used - acounting doesn't show that. Imperial had a few unwanted power cuts (South Kensington down) - CMS, Biomed need throttling Oxford - Busy with H1, biomed, atlas Durham - local pheno users Glasgow may take durham pheno users otherwise the UK sites are pretty quiet. Blacklisting - need to un-blacklist quicker perhaps. ROC updates - Phillipa: Benchmarking - discussion and no solution - more discussion needed... Lots of talks. Manfreid (FZK) runs specint at sites with same parameters, but it needs the software licence purchased. - Should they use SI2k, SI2k6 - no decision made. lcgadmin role - no uk people. Stephen will represent UK on SLD working group UK tickets escalated - GGUS 21476 (RAL) - Outstanding for a while - ATLAS / CASTOR testing. Close as "unsolved" 24038 - poss solution had been posted. (wrong mapping - needed upgrade)... [Hmm - I may have wrong ticket in minutes as Alessandara glue schema update doesn't seem relevant] JC: Did T2C's get update on open tickets - Yes. What stage can we close tickets that are waiting for user confirmation that it's been resolved. PS: Give them a reminder and say you'll close say on Friday. Don't just close the ticket If you close a ticket and they reply it will repoen anyway. ... discussion about merits of closing vs keeping open for user. (adding status=solved) 3 of the T2C's said it's better to close (AF, PG, GS) - ROC would rather it stayed open as escalations don't work on reopened tickets better to close and not get the hit on the bad metrics. - need DB query rather than footprints to work out total open time when it;s been reopened. May be a status=stalled added for when waiting on middleware / 3rd party (eg RAL waiting for 1 month before a s/w fix is afailable) [GS: had to leave for another appt at 12:00] unsolved status used on tickets with dependencies - Another can of worms [at this point JC moved meeting on to next item] FAIRSHARES http://www.gridpp.ac.uk/wiki/Current_VO_Fairshares_at_T2/T1 Sounds of "huh? not seen this" - We generally blamed Olivier for creating the pages (may 2006). These are outdated and not helpful. * needs updating / replacing with new / valid data * Which VO's have allocations at sites. * TD asked at the PMB had we completed it. Could be difficult as fairshares may alter at sites. * there is a table that was completed for Steve Lloyd * Manchester has fairsares and priorites - The prio kicks in when you want one VO to be above the other - ie, barbar over lhcb even if they have same fairshares. * return time for a VO job is calculated by..... ? no idea. AF said she'd look into it. * PG: use of diagnose -f to obtain fairshares - will show what maui is doing now, not necessarily whats configured. Should specify the configured. * may not add up to 100% if "grid" only has a part of a cluster * eg, oxford has total adding up to 100%, made up of 15, 10, 5% slots for varioius VO's - should be normalised back to 100% before adding to table. * Fairshares may not represent the true usage, ie, Manchester has a poor share for LHCb but cos the clusters quiet, most LHCb jobs are running. (Raja - "we're happy with the manchester setup :-)") OPS MEETING * DAG repo - "we had opposition, but no roc had v strong opposition to it" - the wording has been changed - Jermeny posted link to URL for wording. Discussion ensued, transferred to email later, * we think it is a bad idea, but we won't "not use" it. JC sent out an updated wording later emphasising the UK reluctance to use it. * SAM CHANGES sent out to a sam-changes list, only comes out on EGEE BROADCAST list when about to be implemented SL4 Push to move to SL4 even though CMS have issues (can't run jobs) with it lcg-utils -1.5.x (1.6.x OK) one specific version has a problem - raja repacks lcg-cp within their distribution to redistribute They are shipping a new version by the end of the week which should fix it. PPS updates - see agenda Wiki page for SGM/PRD accounts - to ease uncertaincy about setting them up. gLite updates now have changelog. Only necessary to reconfigure the LCG-CE not all services - Derek? More info once minutes out. SECURITY Mingchau - how to get the contact details up to date. JC handed over to him. * security policies - where can they find it - There are ~ 10 policies. Which are the most recent etc. * Will aggregate them together with a summary into one page. * link to the detailed policies. T2Cs - What info would you like on pages? -- refer back to T2 sites. GOC database - not current data. How can this be updated if someone leaves? Freq of update - Monthly? dteam meetings? * now done as a standing item on the T2 tech meetings.

There are minutes attached to this event. Show them.

- 1
  
  Experiment problems/issues
  
  Review of weekly issues by experiment/VO - LHCb - CMS - ATLAS - Other -- The need to improve take up of resources with a focus on larger sites.
- 2
  
  ROC update
  
  ROC manager update ************************* The ROC manager's meeting is/was today (http://indico.cern.ch/conferenceDisplay.py?confId=19703). Items for discussion include benchmarking, removing people from the LCGadmin role in dteam and members for the SLA/SLD working group. Stephen McAllister will represent UKI. https://savannah.cern.ch/task/?5222 Solution can be found here for SE downtime issue Ticket status *************** - Any Tier-2 coordinator problems in following up tickets at the moment? Does everyone have access to the Footprints interface? EGEE ops meeting items of interest ****************************************** - Use of the DAG repository. It looks like they will use it afterall. No ROC had very strong opposition though it was noted that the UK input showed many reservations. The move is due to lack of manpower in SA3 to maintain a respository for externals. - Mail lists may also be used in the future to announce SAM changes. - There is still a WLCG push for SL4 WNs. PPS - gLite3.1.0-PPS-Update05 just released to PPS. This release is done to synchronise the PPS with the corresponding update already in production (service discovery packages for SLC4). - gLite3.0.2-PPS-Update38 to be released soon (1-2 days) in PPS. It contains the SLC3 version of the 3.1 WMS. The upgrade path is not supported for this release, so a campaign will be needed to re-install, in turns, those sites running a gLite WMS. Other - More info on SGM/PRD pool accounts can be found here: https://twiki.cern.ch/twiki/bin/view/LCG/SgmPrdPoolAccounts . - The gLite updates pages now have a Change Log to highlight when changes have been made to the notes after they have been released. The Change Log for gLite 3.0 is here: http://glite.web.cern.ch/glite/packages/R3.0/ReleaseNotes_ChangeLog.asp . The Change Log for gLite 3.1 will be created when needed. - gLite Update 31: only necessary to reconfigure the LCG-CE and not all services.
- 3
  
  Security matters
  
  - Things for the security pages - How to get information to sites (lists, contact details etc.)
- 4
  
  Recording of site fairshares
  
  - A review of the Tier-2 accounting from WLCG last week showed that we could not determine whether VO usage of resources was due to scheduling allocations or opportune use of free resources. The table we created to track this was never completed by each Tier-2: http://www.gridpp.ac.uk/wiki/Current_VO_Fairshares_at_T2/T1.
- 5
  
  AOB
  
  - Actions http://www.gridpp.ac.uk/wiki/Deployment_Team_Action_items