Deployment team
→
Europe/London
EVO - GridPP Deployment team meeting
EVO - GridPP Deployment team meeting
Description
- This is the weekly DTEAM meeting
- The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area.
- The phone bridge number is +44 (0)161 306 6802 (CERN number +41 22 76 71400). The phone bridge ID is 780409 with code: 4880.
Tuesday 28 April 2009
Present: Brian Davies (minutes), Jeremy Coles, Daniela Bauer, Raja Nandakumar, Sam Skipsey, James Cullen, Doug McNab, Duncan Rand, Gareth Smith, Graeme Stewart, Mingchao Ma, Alessandra Forti, Dave Colling, Mohammed Kashif
• Review of weekly issues by experiment/VO
- LHCb
-- Enabling pilot roles
RAL have deployed. Some T2s have , but most have not. Manchester QMUL and ECDF on one CE have.
Pilots needs new pool accounts.
Mail from Roberto Santinelli.
--New software application tests.
10M event production will be pushed back until next week. (jobs @ T2s)
Will take over 10 days and will go through all good WMS.
UCL-CCC and Manchester were banned.
Manchester is now passing SAM tests so will be un-banned.
UCL-CCC has voms cert issue.
Timeouts occurring within DIRAC at Liverpool and Birmingham. Still investigating.
Sites blacklisting is done via GGUS tickets
Historical information on blacklisting is in DRIAC 3 database.
- CMS
Site readiness page shows which sites are ready for usage.
If you fail any critical SAM (ops or CMS) test; you get blacklisted. If you start passing tests then you get un-banned. This can be an issue if SAM test are badly written. Ie site config confuses tests and takes time to be corrected.
- ATLAS
Production in last week.
QMUL had storage usage issues. RALPPD had storage issues. Liverpool efficiency down. ECDF had storage issues. Plan to continue at high level of production for foreseeable future
Step 09 pre tests effected qmul storage. now being worked.
Current setup is dpm FE Lustre BE. Bottleneck through DPM
If they had storm (which understands Lustre ) thing would be better.
Step09
Could Sites please configure the pilot role.
50% prod
255 pilot
25% regular
Ganga has IO bound on job submission ( from hammercloud test.)
Being worked on. GS will announce on tb support further HammerCloud tests.
DR- is pilot role mandatory for step 09?
GS it is highly desirable. Not needed for non atlas data sites ( IC BRUNEL BRISTOL)
MAUI fairshare is based on dedicated PS so most sites configured correctly.
Blacklist
ECDF were, but now working
UCL-CCC are offline but software install at ucl is going ahead so should work in the next few weeks.
GS to look why qmul not running ( they are not blacklisted.)
2G was offline but 1G queue was online. Hence no jobs. 2G queue now back on
Historical blacklisting could be recorded but is no tin place at the moment.
PG – Oxford space tokens almost full an issue?
GS more info to come out of CRRB meeting.
- Other
Fusion effected RHUL last week.
- Experiment blacklisted sites: review
-- Experiment descriptions of how blacklisting is handled
- Site performance
-- http://pprc.qmul.ac.uk/~lloyd/gridpp/ukgrid.html
Culham issues. PG will look into it.
• ROC update
-From the EGEE ops meeting:
Some gLite 3.0 services still remain
https://twiki.cern.ch/twiki/bin/view/EGEE/SitesPublishinggLite30
RALPPD, IC-HEP, ECDF
ECDF has one SE and mon box.
Since they have no staff and on best effort these will be upgraded when they can.
AF pointed out that RGMA is a pain to upgrade at the moment.
- Running 64-bit WNs
Liverpool have been only vocal site about not be able to run 64Bit SL5
DC Issues with running 2 sets of ROMs for supporting 32bit mode on 64.
Change of YUM to groupinstall method causing issues.
Ewan sent to tb-support
Action to look at this further.
- COD & ROD
-- Any comments from the contributors of last week?
Comment From Dug; ticket management involvement good.
TPM monitoring by PG, DR and GS this week.
TPM handover ticket was not received.
-- 15th June may be the switchover date. Several documents have been circulated (attached)
From the site reports:
- SAM data was missing this week. Very few site comments - many indicated a good and quiet week.
WLCG update
*****************
- No recent WLCG meetings.
Ticket status
***************
https://gus.fzk.de/download/escalationreports/roc/html/20090427_EscalationReport_ROCs.html
45327-gfal for RHUL- on hold. will be put into un-solved.
46024-pheno using data at usage level DN encryption. Good progress
47073-atlas STs at CAM-in progress
47342-ilc t1 se problem in progress
47393-lhcb stalled at Manchester.
Now solved. Bug in NFS kernel server. updated
47653
Voms host update.
47759
Myproxy at T1.
To go on hold into further work is done.
***************
Two sets of slides regarding getting ready for region al ops
1sts set
Checklist for regional model.
One FTE to do work.
Provide mailing list for rcod contact. Needs to be setup.
No preference to name and host of list.
Link to GGUS for regional knowledge base.
1st line supporter status
JC to follow up.
2nd set of slides
Rcod Readiness v1.1.pdf
Need to decide Rota
JC to return to this next week.
Site joining information is now out of date
What needs to be updated? Who should do it?
Does scotgrid have info on this. DM an GS has not seen it. S has been document info for users which isn’t necessarily EGEE useful and what is specific al to local users. JC to take it offline.
Good idea to have it I one page which becomes part of ROC and NGI.
AOB-
None.
There are minutes attached to this event.
Show them.