Deployment team

Europe/London
EVO - GridPP Deployment team meeting

EVO - GridPP Deployment team meeting

Jeremy Coles
Description
- This is the weekly DTEAM meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The phone bridge number is +44 (0)161 306 6802 (CERN number +41 22 76 71400). The phone bridge ID is 780397 with code: 4880.
DTeam Meeting 11.00 – 12.15, Tuesday 21 April 2009 Chair: Pete Gronbech Present: James Cullen (minutes), Stephen Burke, Derek Ross, Gareth Smith, Dug McNab, Stephen Burke, Duncan Rand, Raja Nandakumar, Sam Skipsey, Graeme Stewart, Kashif Mohammad, Daniela Bauer, David Colling **Experiment problems/issues** Review of weekly issues by experiment/VO - LHCb -- Prep. for STEP09 Raja - vacation until yesterday. Quiet over Easter. Problems with ECDF (nfs), and Manchester (jobs stalling). Dirac problem solved. - CMS Prep. for STEP09 Dave Colling – quiet over Easter. Ral performance over past month ok, but some problem due to power outages etc. T2s ok. No Monte Carlo production ongoing. Step09 prep started. - ATLAS Prep. for STEP09 Graeme. High Atlas production rates. Glasgow a victim of there own success, proddisk buffer full, added 3tb. Debugging issues to brokerage now fixed. Issue with RHUL storage, knock on to IC. Duncan – DNS issue, fixed now. IC-HEP set offline now. ECDF in test mode. DPM problems. Bad copy of firewall rules. Durham were missing some Atlas releases – fixed now and running jobs. Prep for STEP09. Want to load each site with as much user analysis as possible through both WMS and Panda. Sites should enable pilot roles: Fair shares: Prod 50%, Pilot 25% general 25% Graeme to set up a wiki page with full info and YAIM configuration. Tomorrow UK hammercloud test – more extreme than normal though. Load sites with as much analysis tests as possible. Start at 10am Wednesday 22 April. Initially run through WMS, but towards end of week hammertest through Panda. Panda test does not need Role=Pilot to be set up. Pilot jobs will have Peter's DN. User analysis jobs not as efficient as production jobs. Is Maui fair share based on Wall time or cputime? For STEP want to load sites for 2 weeks. - Other -- Have any more sites had problems with Fusion user work? Duncan - RHUL fusion user job opening lots of files. Ticket solved and user enabled again. -- Any further feedback on the camont proposal? The T1 has raised a few questions that are being answered via email (will share the summary). Derek - Andrew Sansum passed on the request to David Jackson. There are concerns over use of Janet and trawling of webpages. - Experiment blacklisted sites: review Which sites are currently blacklisted and why? Manchester and ECDF mentioned before. Long standing issue with UCL not being ready. Raja – LHCb blacklist site has intermittent problem - looking into. - Site performance -- http://pprc.qmul.ac.uk/~lloyd/gridpp/ukgrid.html -- ECDF has 100% failure recently -- Cambridge has 50% failure recently -- UCL and IC also have poor success (around 10%) **ROC update** ROC update *************** - Status of central Nagios: https://gridppnagios.physics.ox.ac.uk/nagios/ - Status of Nagios in each Tier-2. Do Tier2s have their own nagios? LondonGrid – no T2 nagios. ScotGrid - yes NorthGrid - yes From the EGEE ops meeting: http://indico.cern.ch/conferenceDisplay.py?confId=57117. - Nothing really new at the meeting this week - Manchester's accounting problem was escalated due to inactivity on the ticket. Internal escalation (in the ROD model) can be done differently but sites do need to demonstrate that problems are being worked on and provide updates. Alessandra working on new APEL box. Lots of other site problems caused this delay. From the site reports: - Interesting T1 comment this week: CE-host-cert-valid: This is a non-lhc service The service users are also non-lhc so this comment did not seem to provide a "reason". Derek Ross - Didn't respond very quickly as no MoU with non-LHC VOs WLCG update ***************** 8th April GDB. Duncan's report: http://www.gridpp.ac.uk/wiki/GDB_8th_April_2009. What are the key T2/GridPP areas to follow up? Duncan - Similar to previous meetings. Briefly: reports from sites; discussion about EGI; security service challenge results discussed; STEP09; authorisation service; identity Management; Steve Traylen gave a monitoring talk; middleware update; monitoring Talk. Ticket status *************** https://gus.fzk.de/download/escalationreports/roc/html/20090420_EscalationReport_ROCs.html 45327: GFAL for biomed at RHUL. still on hold. Duncan – put ticket status to unsolved? 46024: Pheno would like usage data at user level -> sites asked to enable DN Dug. Activated at Glasgow and Durham. Tweak in file site-info.def Not activated in SouthGrid or NorthGrid yet. 47073: ATLAS spacetokens at Cambridge. Needs to be followed up. On hold – waiting for help from Oxford 47074: ATLASSCRATCHDISK at Oxford. No user response. Close? Pete thinks it is working – waiting for Brian to confirm. 47118: LFC at Liverpool. Fixed but slow. Close? James – check with Alessandra and Liverpool – can we close? 47342: ILC Tier-1 SE problem. Seems stuck. Brian please prompt again. Brian on leave for a couple of weeks – Derek to follow. 47393: LHCb prod jobs stall at MAN-HEP. Waiting for James... James working on. Talking to Vladimir. 47528: Supernemo access to OX-HEP SE. In progress. 47529: Biomed at TCD issue with job cleanup. Comments from dteam?? 47530: Supernemo access to MAN-HEP SE. Now fixed so close? If Manchester are happy, set to solved 47653: VOMS host cert update. With Jens. 47677: RAL T1 SRM ATLAS problem. Marked as solved yesterday. Gareth working on. Any other issues? **Quarterly reports** - First look at T2 reports for Q109 (reports should now be online) -- Any completion issues? -- Any urgent matters arising? NorthGrid not submitted yet – Alessandra? Graeme – issue. Want to report efforts from ECDF systems team in quarterly report. Working on. **Benchmarking** - Following the DB decision we now need to move sites to the new accounting units. - For this we need to gather site benchmarking data - How far has each Tier-2 got in pushing this forward? - What are the plans (which sites have spec2006 available)? Pete - Which sites have purchased benchmark? Oxford, Glasgow and Durham so far. Maybe send Mike to ECDF with licence? Benchmarking in Edinburgh in the next week. Graeme – discrepancy between work Atlas doing at sites and the amount reported in APEL. DB decided that HEP-SPEC2006 benchmark needed for all sites. Writing script to parse old maui logs. Do all sites keep batch system logs older than 3 months? Glasgow keep all. (This needs to be done very soon) **Team updates** - Short update from each team member (1-2mins) Gareth – no comment Dug – putting Cream CE into production for Glasgow local users. Dave Colling – how do you submit to it? Direct submission and via wms. DC – wms submission flaky. Kashif – MPI enabled glite setup James – LHCb stalled jobs and ce01 falling over twice in 48 hours. Stephen – documentation on how to publish information system. Include Steve Traylens sub cluster support. Publishing SPEC INT info, space information etc. Derek – SL5 migration. Top-level BDII robustness Raja – Perl on lxplus problem with https – related to LHCb blacklist monitoring page Daniela – moving SE and installing BDII. Accounting problems at LeSC Duncan – Quarterly report, technical meeting, 15 new disk servers. RHUL - what to do with old cluster. Improve communications within LondonGrid. Sam – fixing ECDF storage. Moving Greg's old GridPP monitoring stuff. Installing xrootd service on some storage nodes at Glasgow. Graeme – HEP SPEC. Atlas side – checking Tier1 ok, Panda running on Oracle at CERN. STEP09 coordinator. Dave Colling – catching up after holiday. -- Current ongoing work -- Current issues and concerns **AOB** Derek – FTS monitoring page running slow after power outage. Will move to new machine in May. Duncan – Panda monitoring has some blanks. Something to do with moving from brookhaven. Graeme - Press refresh/update on page to get reliable data. **Chat Window** [10:57:32] Stephen Burke joined [10:57:35] Pete Gronbech joined [10:59:04] Derek Ross joined [10:59:21] Gareth Smith joined [11:00:17] Dug McNab joined [11:00:23] Duncan Rand joined [11:00:51] Raja Nandakumar joined [11:01:02] Sam Skipsey joined [11:01:06] Graeme Stewart joined [11:01:12] Mohammad kashif joined [11:03:00] Sam Skipsey That's being addressed right now at ECDF. [11:03:12] Daniela Bauer joined [11:04:59] Derek Ross should be back up now [11:06:09] Raja Nandakumar Thanks Sam [11:20:44] Pete Gronbech http://lhcb-project-dirac.web.cern.ch/lhcb-project-dirac/lhcbProdnMask.html [11:21:46] Duncan Rand http://lhcbweb.pic.es/DIRAC/jobs/SiteSummary/display [11:27:55] Derek Ross lcgce02 [11:30:27] David Colling joined [11:32:21] Duncan Rand back in a sec [11:34:21] Derek Ross Not done at the T1 yet either [11:35:05] Derek Ross Brian is on leave at the moment [11:38:56] Graeme Stewart svr020:~$ lcg-cr -v -D srmv2 -d srm://t2se01.physics.ox.ac.uk:8446/srm/managerv2?SFN=/dpm/physics.ox.ac.uk/home/atlas/atlasscratchdisk/gs-test -s ATLASSCRATCHDISK file:///etc/group Using grid catalog type: lfc Using grid catalog : lfc.gridpp.rl.ac.uk Using LFN : /grid/atlas/generated/2009-04-21/file-f6680e74-c71f-4dce-bfd6-35a43fd0b277 SE type: SRMv2 Using SURL : srm://t2se01.physics.ox.ac.uk:8446/srm/managerv2?SFN=/dpm/physics.ox.ac.uk/home/atlas/atlasscratchdisk/gs-test Alias registered in Catalog: lfn:/grid/atlas/generated/2009-04-21/file-f6680e74-c71f-4dce-bfd6-35a43fd0b277 [SE][GetSpaceTokens] httpg://t2se01.physics.ox.ac.uk:8446/srm/managerv2: dpm_getspacetoken: Unknown user space token description lcg_cr: Operation now in progress [11:39:09] Graeme Stewart Still token problems at Oxford, sorry Pete! [11:40:34] Graeme Stewart works for Role=production, so I suspect the FQAN on the space token is wrong [12:08:07] Stephen Burke left [12:08:07] Graeme Stewart left [12:08:10] Dug McNab left [12:08:11] Raja Nandakumar left [12:08:11] Gareth Smith left [12:08:11] Mohammad kashif left [12:08:12] Derek Ross left [12:08:13] Duncan Rand left [12:08:14] David Colling left [12:08:15] Sam Skipsey left [12:08:22] Daniela Bauer left
There are minutes attached to this event. Show them.
    • 11:00 11:20
      Experiment problems/issues 20m
      Review of weekly issues by experiment/VO - LHCb -- Prep. for STEP09 - CMS Prep. for STEP09 - ATLAS Prep. for STEP09 - Other -- Have any more sites had problems with Fusion user work? -- Any further feedback on the camont proposal? The T1 has raised a few questions that are being answered via email (will share the summary). - Experiment blacklisted sites: review Which sites are currently blacklisted and why? - Site performance -- http://pprc.qmul.ac.uk/~lloyd/gridpp/ukgrid.html -- ECDF has 100% failure recently -- Cambridge has 50% failure recently -- UCL and IC also have poor success (around 10%)
    • 11:20 11:45
      ROC update 25m
      ROC update *************** - Status of central Nagios: https://gridppnagios.physics.ox.ac.uk/nagios/ - Status of Nagios in each Tier-2. From the EGEE ops meeting: http://indico.cern.ch/conferenceDisplay.py?confId=57117. - Nothing really new at the meeting this week - Manchester's accounting problem was escalated due to inactivity on the ticket. Internal escalation (in the ROD model) can be done differently but sites do need to demonstrate that problems are being worked on and provide updates. From the site reports: - Interesting T1 comment this week: CE-host-cert-valid: This is a non-lhc service The service users are also non-lhc so this comment did not seem to provide a "reason". WLCG update ***************** 8th April GDB. Duncan's report: http://www.gridpp.ac.uk/wiki/GDB_8th_April_2009. What are the key T2/GridPP areas to follow up? Ticket status *************** https://gus.fzk.de/download/escalationreports/roc/html/20090420_EscalationReport_ROCs.html 45327: GFAL for biomed at RHUL. still on hold. 46024: Pheno would like usage data at user level -> sites asked to enable DN 47073: ATLAS spacetokens at Cambridge. Needs to be followed up. 47074: ATLASSCRATCHDISK at Oxford. No user response. Close? 47118: LFC at Liverpool. Fixed but slow. Close? 47342: ILC Tier-1 SE problem. Seems stuck. Brian please prompt again. 47393: LHCb prod jobs stall at MAN-HEP. Waiting for James... 47528: Supernemo access to OX-HEP SE. In progress. 47529: Biomed at TCD issue with job cleanup. Comments from dteam?? 47530: Supernemo access to MAN-HEP SE. Now fixed so close? 47653: VOMS host cert update. With Jens. 47677: RAL T1 SRM ATLAS problem. Marked as solved yesterday. Any other issues?
    • 11:45 11:50
      Quarterly reports 5m
      - First look at T2 reports for Q109 (reports should now be online) -- Any completion issues? -- Any urgent matters arising?
    • 11:50 12:00
      Benchmarking 10m
      - Following the DB decision we now need to move sites to the new accounting units. - For this we need to gather site benchmarking data - How far has each Tier-2 got in pushing this forward? - What are the plans (which sites have spec2006 available)? (This needs to be done very soon)
    • 12:00 12:10
      Team updates 10m
      - Short update from each team member (1-2mins) -- Current ongoing work -- Current issues and concerns
    • 12:10 12:15
      AOB 5m