Deployment team
→
Europe/Zurich
EVO - GridPP Deployment team meeting
EVO - GridPP Deployment team meeting
Description
- This is the weekly DTEAM meeting
- The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area
minutes -
lhcb (Raja)
----
- no sim, doing reconstruction (7M events)
- ral issues at weekend - 17/18th (accessing storag - dcache)
- not running production (generate)
cms
---
- Dave Newbold email -
-- Lack of unified RFIO library for CASTOR/DPM; now we are using SL4, previous hacks don't work. What are the plans for CASTOR migration to secure RFIO and/or a unified library? How hard can this be - presumably the library just needs some environment setting to tell it to use CASTOR- or DPM-style RFIO? Right now, all DPM SL4 sites (worldwide) are out of action for CMS.
-- We still have various services (e.g. Tier-1 VObox) that need to go to SL4 ASAP.
-- We await FTSv2 at RAL with bated breath. Next week, we are told?
-- Testing of network throughput, etc, has been put on a new basis in the CMS 'DDT' project. We will be calling upon UK experts to help us get our T1->T2 and T0->T1 links commissioned.
-- https://twiki.cern.ch/twiki/bin/view/CMS/DebuggingDataTransfers
- RAL moving to FTS2 next week - Derek Ross / Matt will confirm with voboxes too
- Network testing - Data coming from overseas slowly (few Mb/s) - need more testing. AE - Will contact DN directly (Barney too)
May do F2F Next week at GridPP mtg.
- GC: (rfio) - sent email to developers last week, awaiting response. Who would know? Not Yens - GC will contact Steve Traylen.
ATLAS
- lack of space in dcache at RAL
- castor seems OK
- T2s unknown, but should be poss to restart production
- M4 cosmic runs next week (don't know which UK sites will participate) - GS would like Glasgow to participate. May be behind tho. Awaiting update from Fredric.
- JC reported from PMB meeting - Dave Newbold said he'd seen new problems with CASTOR - Also affecting ATLAS? Not sure - local mtg planned for Wed
pheno -
glasgow banned user who flooded Q. - site stayed up
gridftp / fts
- CLOSE_WAITs on dpm disks (gridftp)
- exhaused resources caused disks to crash
- durham has a script to kill off server at intervals
- cambs has same issue, brunel, RHUL too
(GC Just back from hols - not checked up fully for latest)
LHCb - VO Box down for HW failure - Timescale? Should be back up (disk swapped out - Derek will check with Katlin)
- dcache - zero diskspcace on LDAP server? - Derek will check provider started up.
Takeup of resources at larger sites
-- UK delivered 43% of pledge, but has *deployed* more than they pledged. How do we get utilisation up?
Last year LT2 did a push. Glasgow recently too.
Manchester has 1400 / 1600 slots empty yesteray - how can we fill the larger sites?
-- Issue at QMU just now - unsure of cause. Firewall? doesn't seem to have changed since last sept...
--- GS: - do jobs get as far as the CE? We saw issues where steve lloyd jobs died at the gatekeeper
--- Seem to run on the CE but fail to pull main app - gridftp issue? S/W install jobs fail. (use CERN RB) - can't run CMS MC codes
means cluster only running at 10%
JC: Quick calculation - QM ay 13%, Manchester at 18%, GLA at 85% - Why such big differences
GS: Glasgow has a local bio user who threw in 400 jobs to fill cluster
Alessandara: Same comment as GS
ALICE - filling up other countries pledges
We need to somehow show that the resource is AVAILABLE even if not used - acounting doesn't show that.
Imperial had a few unwanted power cuts (South Kensington down) - CMS, Biomed need throttling
Oxford - Busy with H1, biomed, atlas
Durham - local pheno users
Glasgow may take durham pheno users
otherwise the UK sites are pretty quiet.
Blacklisting - need to un-blacklist quicker perhaps.
ROC updates - Phillipa:
Benchmarking - discussion and no solution - more discussion needed... Lots of talks. Manfreid (FZK) runs specint at sites with same parameters, but it needs the software licence purchased. - Should they use SI2k, SI2k6 - no decision made.
lcgadmin role - no uk people. Stephen will represent UK on SLD working group
UK tickets escalated - GGUS 21476 (RAL) - Outstanding for a while - ATLAS / CASTOR testing. Close as "unsolved"
24038 - poss solution had been posted. (wrong mapping - needed upgrade)... [Hmm - I may have wrong ticket in minutes as Alessandara glue schema update doesn't seem relevant]
JC: Did T2C's get update on open tickets - Yes.
What stage can we close tickets that are waiting for user confirmation that it's been resolved.
PS: Give them a reminder and say you'll close say on Friday.
Don't just close the ticket
If you close a ticket and they reply it will repoen anyway.
... discussion about merits of closing vs keeping open for user. (adding status=solved)
3 of the T2C's said it's better to close (AF, PG, GS) - ROC would rather it stayed open as escalations don't work on reopened tickets
better to close and not get the hit on the bad metrics.
- need DB query rather than footprints to work out total open time when it;s been reopened.
May be a status=stalled added for when waiting on middleware / 3rd party (eg RAL waiting for 1 month before a s/w fix is afailable)
[GS: had to leave for another appt at 12:00]
unsolved status used on tickets with dependencies - Another can of worms [at this point JC moved meeting on to next item]
FAIRSHARES
http://www.gridpp.ac.uk/wiki/Current_VO_Fairshares_at_T2/T1
Sounds of "huh? not seen this" - We generally blamed Olivier for creating the pages (may 2006). These are outdated and not helpful.
* needs updating / replacing with new / valid data
* Which VO's have allocations at sites.
* TD asked at the PMB had we completed it. Could be difficult as fairshares may alter at sites.
* there is a table that was completed for Steve Lloyd
* Manchester has fairsares and priorites - The prio kicks in when you want one VO to be above the other - ie, barbar over lhcb even if they have same fairshares.
* return time for a VO job is calculated by..... ? no idea. AF said she'd look into it.
* PG: use of diagnose -f to obtain fairshares - will show what maui is doing now, not necessarily whats configured. Should specify the configured.
* may not add up to 100% if "grid" only has a part of a cluster
* eg, oxford has total adding up to 100%, made up of 15, 10, 5% slots for varioius VO's - should be normalised back to 100% before adding to table.
* Fairshares may not represent the true usage, ie, Manchester has a poor share for LHCb but cos the clusters quiet, most LHCb jobs are running. (Raja - "we're happy with the manchester setup :-)")
OPS MEETING
* DAG repo - "we had opposition, but no roc had v strong opposition to it" - the wording has been changed - Jermeny posted link to URL for wording.
Discussion ensued, transferred to email later,
* we think it is a bad idea, but we won't "not use" it.
JC sent out an updated wording later emphasising the UK reluctance to use it.
* SAM CHANGES
sent out to a sam-changes list, only comes out on EGEE BROADCAST list when about to be implemented
SL4
Push to move to SL4 even though CMS have issues (can't run jobs) with it
lcg-utils -1.5.x (1.6.x OK) one specific version has a problem - raja repacks lcg-cp within their distribution to redistribute
They are shipping a new version by the end of the week which should fix it.
PPS updates - see agenda
Wiki page for SGM/PRD accounts - to ease uncertaincy about setting them up.
gLite updates now have changelog.
Only necessary to reconfigure the LCG-CE not all services - Derek? More info once minutes out.
SECURITY
Mingchau - how to get the contact details up to date. JC handed over to him.
* security policies - where can they find it - There are ~ 10 policies. Which are the most recent etc.
* Will aggregate them together with a summary into one page.
* link to the detailed policies.
T2Cs - What info would you like on pages?
-- refer back to T2 sites.
GOC database - not current data. How can this be updated if someone leaves?
Freq of update - Monthly? dteam meetings?
* now done as a standing item on the T2 tech meetings.
There are minutes attached to this event.
Show them.