Deployment team & sites

Europe/London
EVO - GridPP Deployment team meeting

EVO - GridPP Deployment team meeting

Jeremy Coles
Description
- This is the weekly DTEAM meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The phone bridge number is +44 (0)161 306 6802 (CERN number +41 22 76 71400). The phone bridge ID is 1677458 with code: 4880.


 Tuesday 02 March 2010
Brian Davies, Sam Skipsey,Stephen Burke, Duncan Rand, Richard Hellier, Alessandra Forti, Matthew Doidge, Chris Curtis, Wahid Bhimji, Derek Ross, Jeremy Coles, Daniela Bauer, John Bland, Andrew Washbrook,  Andrew Lahiff, Stephen Jones, Gianfranco Sciacca, Dug McNab, Rob Harper, Chri Walker, Mohammad kashif, Raja Nandakumar, Govind Songara, Ewan Mac Mahon, Santanu Das, Winnie Lacesso, Elena Korolkova, Pete Gronbech, Gareth Smith


 11:00        
Experiment problems/issues (20')        

Review of weekly issues by experiment/VO

- LHCb
Waiting for Data.

BHAM epgr02 GGU # 56064.
- CMS
 BHM , Lancs: SW install on CE Share SW area slow in Lancs (GGUS ticket.)Brunel passes SAM tests  but production fails. Glasgow fails  1/2 production jopbs since file transfer top CERN fails.Brunel jobs not picked up. T1 good. lcg-voms.cern.ch bad but voms.cern.ch good.
- ATLAS

- Other
Lancs moved W area to L5 Lookin ginto Tarball issue.. HW issues leading to slow file IO.

- Experiment blacklisted sites

CMS; Quiet, waiting for DATA.squid access vi IC, lokking to use GLASGOW squid for ACCES and looking to use other T"s.

ATLAS Quiet, no production, few analysis jpbs.

BDII not issue for low production.

Change to run reconstruction jobs at T2s briefly; worked.

Data distribution, close to 10GB on the OPN.

MCDISK filling at T1.

Consistencey check issues between ATLAS and SRM  being analysed. Informatio ndifferent bewteen BDII info and lcg-stmd.

CAMONT jobs at GLasgow filled /tmp area.

pdf2html segfaulting casued 51GB file in /tmp.

Camont jobs not running.
 
- Site performance




 11:20        
ROC update (10')        

ROC update
***************
- Update from on-duty
-- As of this week site testing uses Nagios (ROC instance based at CERN)
-- SAM will continue to run this month for a cross-check of availability figures
-- Country Nagios instances will be used from next month

-- What (if any) are the main changes for sites? Email templates, extra checking? What is the correct portal for sites to check results?


From the EGEE ops meeting:
http://indico.cern.ch/conferenceDisplay.py?confId=86972

The release update: https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingGliteReleases

- What were the issues here?

64bit? glite 3.1. YAIM core changes for myproxy config.

WLCG update
*****************
- Machine status
-- You can check via http://op-webtools.web.cern.ch/op-webtools/vistar/vistars.php?usr=LHC1. Currently: Access - no beam. Dealing with problems in the main quad circuits of S78.

RAL Tier-1 status
***********************
- Current news & issues

FTS request for MH to upgrade test endpoint. Need to schedule FTS upgrade- request downtime and drain  ( a rew hours.) need to confirm.

CASTOR OK. AT_RIK fo rORACLE  and LSF change went well. DB resiliency work ongoing.

Wednesday T1 Liaison meeting in wiki.

Ticket status
***************
https://gus.fzk.de/download/escalationreports/roc/html/20100301_EscalationReport_ROCs.html

50491 - on hold. CMS transfers IC-RHUL. Probably jumbo frames issue. Opened in July 09********.
53349 - on hold. Bristol. Publishing vast amount of storage. Opened in November******.
53598 - ATLAS T1. On hold. Channel load change request. wait for data to test?****
53834 - on hold. ECDF old CE. Waiting on second (new) CE?**

 11:30        
gridpp.ac.uk DNS problems (10')        

- Background (issue discovered over the weekend)
- Findings to date (post-mortem underway)
- Additional issues noticed
- Future options
- Switching to an alternate BDII

Had been run sepearte to T2. hardfware fialed with kernel panic. Have now exxtended alias lifetime. moved to T2 DNS servers. will upgrade web service this week. EM suggested distributing DNS neameservers to seperate subnets.backup moved to seperate computing centre, T2, T1?. AMcN will investigate.

 11:40        
AOD (08')        

- A lot of email exchanges
- What conclusions were reached?

ATLAS re-ordering of files:

WB: new release 15.6.3_6 and later has re-ordered output files.

WB to get test datasets ASAP. Will then run tests over these datasets.

WB will repeat later.

RFIO access does not appear better.

LOCAL access appears to be better nmethod.

TTREE cache also llked in to.., choice to be set on a site by site basis. Preparing ongoing tests.

EM would like more realistic tests sincew data access patterns are a concern..


If AOD turns out to be a quick discussion:
Extra topic: Release validation in UK/I
- The need for more early adopter sites
- Overall release deployment strategy

 11:48        
APEL (10')        

APEL is experiencing unusual peaks of activity around times when most sites publish. For some reasons we are still investigating, connections to the service don't get cleaned properly and cause the service to hang until it finally restarts. As a result, many sites don't see their records published on APEL even if their log does not necessarily show any obvious error.

OK sites are Brunel,Imperial, QMUL, sheffield, bristol, PPD.

To keep everyone informed of the daily status of APEL, we will maintain the following page as a permanent feature:

http://goc.grid.sinica.edu.tw/gocwiki/ApelStatus

All relevant information and known issue about the service will be put on this page.


To check use this link: http://www3.egee.cesga.es/gridsite/accounting/CESGA/egee_view.php

1. Select UKI from left column
2. Select recent dates (from Jan 2010 to March 2010)
3. Under Groupings - select show data for SITE
4. Click the refresh button and scroll down the page.
Few sites are up to date.

PG report that for oxford : no errorrs seen but still confused.
 11:58        
Actions (05')        

See http://www.gridpp.ac.uk/wiki/Deployment_Team_Action_items

The main sites wide actions are:

1) To continue checking and updating as required the gstat published values that are summarised here: http://gstat-prod.cern.ch/gstat/summary/GRID/GRIDPP/
PG: Southgrid appears to have dissapeared, ticket is open.
- Is the status now taken from Nagios? Many show critical.
AF: YAIM  needs correcting. site bdii gives errors.
2) Deployment of CREAM and SCAS/glexec
PG: New rpm in glite 3.1 release.
- The expectation is that all sites with >1 CE will deploy a CREAM CE by the end of March
- Other sites should consider setting up a VM based secondary CE
- Sites with CREAM are requested to deploy SCAS/glexec
PG reports: SE problems at Oxford
 12:03        
AOB (01')        

- GridPP24 registration is now open: http://www.gridpp.ac.uk/gridpp24/
WB: Storage workshop  is on at the beginning of the week..

WB: Cream Deployment, ARE VOs hapopy to use?

M. CMS and LHCb are using cream . ATLAS are not.

SD: Issue with  Condor, CREAM-CE still not ready.

JC to check who does Condor CREAM-CE staged rollout

No londo site has CREAM-CE, IC and RHUL are in the progress of deployment. Not an issue if allsites do not deploy. The worry is that if a problem is solved and then large uptake of usage over a short time.

Talk at 24th GB regarding SCAS/GLEXEC issue.

EM: Oxford's CREM-CE has C.

Production CREAM-CE does not have SCAS and WNs do not have GLEXEC.

LCG_CE is not on SL5

Chat Window:
[11:07:12] Duncan Rand https://gus.fzk.de/ws/ticket_info.php?ticket=56083
[11:07:32] Jeremy Coles We will return to the first agenda item.
[11:08:02] Mohammad kashif https://sam-uki-roc.cern.ch/myegee
[11:10:44] Daniela Bauer Resource Summary on my egee lists ce00 for Imperial - that machine was decommissioned a week ago
[11:11:04] Duncan Rand https://samnag025.cern.ch/nagios/cgi-bin/avail.cgi?host=ce03.esc.qmul.ac.uk&service=org.sam.CE-JobSubmit-ops&show_log_entries
[11:11:59] Dug McNab https://sam-uki-roc.cern.ch/myegee does not support CREAM CE's by the looks of it
[11:12:25] Govind Songara It would nice to have search option on list tabs..
[11:13:20] Stephen Jones Where is the help on using my-egee?
[11:13:31] Alessandra Forti https://twiki.cern.ch/twiki/bin/view/LCG/SAMProbesMetrics?topic=LCG.SAMProbesMetrics
[11:13:39] Duncan Rand https://sam-uki-roc.cern.ch/nagios/cgi-bin/status.cgi?host=ce03.esc.qmul.ac.uk&servicestatustypes=16&hoststatustypes=15&serviceprops=0&hostprops=0
[11:27:14] Gareth Smith http://www.gridpp.ac.uk/wiki/RAL_Tier1_Experiments_Liaison_Meeting
[11:28:04] Gareth Smith Above link is for the Tier1 Experiments liaison Meeting. Within there is a link to my reports (Operational Status and Issues).
[11:29:20] Dug McNab uploading to CERN? - what site as that for,?
[11:29:31] Jeremy Coles Glasgow
[11:29:56] Dug McNab first we have heard about that
[11:30:17] Dug McNab okay will investigate
[11:33:05] Winnie Lacesso It's not bdii, it's some other problem. Working on it.
[11:39:08] Phone Bridge left
[12:02:56] Gareth Smith left
[12:09:16] Duncan Rand got to go
[12:09:20] Duncan Rand left
[12:09:46] Pete Gronbech I used the gap publisher ut still have problems according to http://goc-accounting.grid-support.ac.uk/rss/UKI-SOUTHGRID-OX-HEP_Sync.html
[12:10:39] Gianfranco Sciacca Same for both UCL-CENTRAL and UCL-HEP
[12:13:38] Elena Korolkova I hope we published everything to the date in Sheffield
[12:15:24] Dug McNab what are we doing right? 
[12:16:16] Dug McNab i went through them all with Steve T
[12:16:30] Dug McNab painful
[12:17:21] Dug McNab which one?
[12:17:57] Alessandra Forti thought so...
[12:18:02] Pete Gronbech WARNING: t2se01.physics.ox.ac.uk, GlueSACapability has unknown value, WARNING: t2se01.physics.ox.ac.uk, GlueSACapability has unknown value, InstalledOnline or NearlineSize attribute non existing
[12:18:09] Winnie Lacesso Have to go, will look forward to minutes.
[12:18:14] Winnie Lacesso left
[12:18:41] Wahid Bhimji GlueSACapability has unknown value - is what pete is referring to I think
[12:18:47] Dug McNab Steve T raised this bug https://savannah.cern.ch/bugs/?58513 when we went through it.
[12:19:01] Raja Nandakumar I too have to go. Apologies.
[12:19:06] Raja Nandakumar left
[12:20:05] Pete Gronbech This error for ce's "WARNING: t2ce05.physics.ox.ac.uk:2119/jobmanager-lcgpbs-shortfive, GlueCEPolicyAssignedJobSlots has negative or null value, " is fixed by the latest rpm lcg-info-dynamic-pbs-1.0.13-1
[12:20:20] Dug McNab yes that's right
[12:20:42] Dug McNab no
[12:22:22] Dug McNab so yes they run, but they have to run within 24 hours. So atlas will probably not be switching their pilot factories over
[12:25:57] Dug McNab take a look at
[12:25:57] Dug McNab https://twiki.cern.ch/twiki/bin/view/EGEE/BatchSystems
[12:26:18] Dug McNab that is the page for cream and lcg-ce
[12:26:37] Dug McNab Condor integration is maintained by IFAE (PIC) within SA3.
[12:26:51] Dug McNab https://twiki.cern.ch/twiki/bin/view/EGEE/InstallationInstructionsForCondor
[12:27:08] Dug McNab so yes CREAM supports CONDOR
[12:30:16] Dug McNab SGE support is nearly there I believe
[12:30:29] Dug McNab there is an SGE utils package especially for it
[12:30:36] Wahid Bhimji "nearly there" !
[12:30:49] Dug McNab well I actually think it is complete
[12:30:56] Dug McNab but I can't see the certification page
[12:31:42] Santanu Das I know it supports Condor, just as the support was provided for the previous versions of the CEs.
[12:31:50] Dug McNab yes SGE support is there and complete
[12:31:51] Dug McNab https://savannah.cern.ch/patch/?3458
[12:32:08] Santanu Das I'll have a another look.
 

There are minutes attached to this event. Show them.
    • 11:00 11:20
      Experiment problems/issues 20m
      Review of weekly issues by experiment/VO - LHCb - CMS - ATLAS - Other - Experiment blacklisted sites - Site performance
    • 11:20 11:30
      ROC update 10m
      ROC update *************** - Update from on-duty -- As of this week site testing uses Nagios (ROC instance based at CERN) -- SAM will continue to run this month for a cross-check of availability figures -- Country Nagios instances will be used from next month -- What (if any) are the main changes for sites? Email templates, extra checking? What is the correct portal for sites to check results? From the EGEE ops meeting: http://indico.cern.ch/conferenceDisplay.py?confId=86972 The release update: https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingGliteReleases - What were the issues here? WLCG update ***************** - Machine status -- You can check via http://op-webtools.web.cern.ch/op-webtools/vistar/vistars.php?usr=LHC1. Currently: Access - no beam. Dealing with problems in the main quad circuits of S78. RAL Tier-1 status *********************** - Current news & issues Ticket status *************** https://gus.fzk.de/download/escalationreports/roc/html/20100301_EscalationReport_ROCs.html 50491 - on hold. CMS transfers IC-RHUL. Probably jumbo frames issue. Opened in July 09********. 53349 - on hold. Bristol. Publishing vast amount of storage. Opened in November******. 53598 - ATLAS T1. On hold. Channel load change request. wait for data to test?**** 53834 - on hold. ECDF old CE. Waiting on second (new) CE?**
    • 11:30 11:40
      gridpp.ac.uk DNS problems 10m
      - Background (issue discovered over the weekend) - Findings to date (post-mortem underway) - Additional issues noticed - Future options - Switching to an alternate BDII
    • 11:40 11:48
      AOD 8m
      - A lot of email exchanges - What conclusions were reached? If AOD turns out to be a quick discussion: Extra topic: Release validation in UK/I - The need for more early adopter sites - Overall release deployment strategy
    • 11:48 11:58
      APEL 10m
      APEL is experiencing unusual peaks of activity around times when most sites publish. For some reasons we are still investigating, connections to the service don't get cleaned properly and cause the service to hang until it finally restarts. As a result, many sites don't see their records published on APEL even if their log does not necessarily show any obvious error. To keep everyone informed of the daily status of APEL, we will maintain the following page as a permanent feature: http://goc.grid.sinica.edu.tw/gocwiki/ApelStatus All relevant information and known issue about the service will be put on this page. To check use this link: http://www3.egee.cesga.es/gridsite/accounting/CESGA/egee_view.php 1. Select UKI from left column 2. Select recent dates (from Jan 2010 to March 2010) 3. Under Groupings - select show data for SITE 4. Click the refresh button and scroll down the page.
    • 11:58 12:03
      Actions 5m
      See http://www.gridpp.ac.uk/wiki/Deployment_Team_Action_items The main sites wide actions are: 1) To continue checking and updating as required the gstat published values that are summarised here: http://gstat-prod.cern.ch/gstat/summary/GRID/GRIDPP/ - Is the status now taken from Nagios? Many show critical. 2) Deployment of CREAM and SCAS/glexec - The expectation is that all sites with >1 CE will deploy a CREAM CE by the end of March - Other sites should consider setting up a VM based secondary CE - Sites with CREAM are requested to deploy SCAS/glexec
    • 12:03 12:04
      AOB 1m
      - GridPP24 registration is now open: http://www.gridpp.ac.uk/gridpp24/