- Compact style
- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
Tuesday 02 March 2010
Brian Davies, Sam Skipsey,Stephen Burke, Duncan Rand, Richard Hellier, Alessandra Forti, Matthew Doidge, Chris Curtis, Wahid Bhimji, Derek Ross, Jeremy Coles, Daniela Bauer, John Bland, Andrew Washbrook, Andrew Lahiff, Stephen Jones, Gianfranco Sciacca, Dug McNab, Rob Harper, Chri Walker, Mohammad kashif, Raja Nandakumar, Govind Songara, Ewan Mac Mahon, Santanu Das, Winnie Lacesso, Elena Korolkova, Pete Gronbech, Gareth Smith
11:00
Experiment problems/issues (20')
Review of weekly issues by experiment/VO
- LHCb
Waiting for Data.
BHAM epgr02 GGU # 56064.
- CMS
BHM , Lancs: SW install on CE Share SW area slow in Lancs (GGUS ticket.)Brunel passes SAM tests but production fails. Glasgow fails 1/2 production jopbs since file transfer top CERN fails.Brunel jobs not picked up. T1 good. lcg-voms.cern.ch bad but voms.cern.ch good.
- ATLAS
- Other
Lancs moved W area to L5 Lookin ginto Tarball issue.. HW issues leading to slow file IO.
- Experiment blacklisted sites
CMS; Quiet, waiting for DATA.squid access vi IC, lokking to use GLASGOW squid for ACCES and looking to use other T"s.
ATLAS Quiet, no production, few analysis jpbs.
BDII not issue for low production.
Change to run reconstruction jobs at T2s briefly; worked.
Data distribution, close to 10GB on the OPN.
MCDISK filling at T1.
Consistencey check issues between ATLAS and SRM being analysed. Informatio ndifferent bewteen BDII info and lcg-stmd.
CAMONT jobs at GLasgow filled /tmp area.
pdf2html segfaulting casued 51GB file in /tmp.
Camont jobs not running.
- Site performance
11:20
ROC update (10')
ROC update
***************
- Update from on-duty
-- As of this week site testing uses Nagios (ROC instance based at CERN)
-- SAM will continue to run this month for a cross-check of availability figures
-- Country Nagios instances will be used from next month
-- What (if any) are the main changes for sites? Email templates, extra checking? What is the correct portal for sites to check results?
From the EGEE ops meeting:
http://indico.cern.ch/conferenceDisplay.py?confId=86972
The release update: https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingGliteReleases
- What were the issues here?
64bit? glite 3.1. YAIM core changes for myproxy config.
WLCG update
*****************
- Machine status
-- You can check via http://op-webtools.web.cern.ch/op-webtools/vistar/vistars.php?usr=LHC1. Currently: Access - no beam. Dealing with problems in the main quad circuits of S78.
RAL Tier-1 status
***********************
- Current news & issues
FTS request for MH to upgrade test endpoint. Need to schedule FTS upgrade- request downtime and drain ( a rew hours.) need to confirm.
CASTOR OK. AT_RIK fo rORACLE and LSF change went well. DB resiliency work ongoing.
Wednesday T1 Liaison meeting in wiki.
Ticket status
***************
https://gus.fzk.de/download/escalationreports/roc/html/20100301_EscalationReport_ROCs.html
50491 - on hold. CMS transfers IC-RHUL. Probably jumbo frames issue. Opened in July 09********.
53349 - on hold. Bristol. Publishing vast amount of storage. Opened in November******.
53598 - ATLAS T1. On hold. Channel load change request. wait for data to test?****
53834 - on hold. ECDF old CE. Waiting on second (new) CE?**
11:30
gridpp.ac.uk DNS problems (10')
- Background (issue discovered over the weekend)
- Findings to date (post-mortem underway)
- Additional issues noticed
- Future options
- Switching to an alternate BDII
Had been run sepearte to T2. hardfware fialed with kernel panic. Have now exxtended alias lifetime. moved to T2 DNS servers. will upgrade web service this week. EM suggested distributing DNS neameservers to seperate subnets.backup moved to seperate computing centre, T2, T1?. AMcN will investigate.
11:40
AOD (08')
- A lot of email exchanges
- What conclusions were reached?
ATLAS re-ordering of files:
WB: new release 15.6.3_6 and later has re-ordered output files.
WB to get test datasets ASAP. Will then run tests over these datasets.
WB will repeat later.
RFIO access does not appear better.
LOCAL access appears to be better nmethod.
TTREE cache also llked in to.., choice to be set on a site by site basis. Preparing ongoing tests.
EM would like more realistic tests sincew data access patterns are a concern..
If AOD turns out to be a quick discussion:
Extra topic: Release validation in UK/I
- The need for more early adopter sites
- Overall release deployment strategy
11:48
APEL (10')
APEL is experiencing unusual peaks of activity around times when most sites publish. For some reasons we are still investigating, connections to the service don't get cleaned properly and cause the service to hang until it finally restarts. As a result, many sites don't see their records published on APEL even if their log does not necessarily show any obvious error.
OK sites are Brunel,Imperial, QMUL, sheffield, bristol, PPD.
To keep everyone informed of the daily status of APEL, we will maintain the following page as a permanent feature:
http://goc.grid.sinica.edu.tw/gocwiki/ApelStatus
All relevant information and known issue about the service will be put on this page.
To check use this link: http://www3.egee.cesga.es/gridsite/accounting/CESGA/egee_view.php
1. Select UKI from left column
2. Select recent dates (from Jan 2010 to March 2010)
3. Under Groupings - select show data for SITE
4. Click the refresh button and scroll down the page.
Few sites are up to date.
PG report that for oxford : no errorrs seen but still confused.
11:58
Actions (05')
See http://www.gridpp.ac.uk/wiki/Deployment_Team_Action_items
The main sites wide actions are:
1) To continue checking and updating as required the gstat published values that are summarised here: http://gstat-prod.cern.ch/gstat/summary/GRID/GRIDPP/
PG: Southgrid appears to have dissapeared, ticket is open.
- Is the status now taken from Nagios? Many show critical.
AF: YAIM needs correcting. site bdii gives errors.
2) Deployment of CREAM and SCAS/glexec
PG: New rpm in glite 3.1 release.
- The expectation is that all sites with >1 CE will deploy a CREAM CE by the end of March
- Other sites should consider setting up a VM based secondary CE
- Sites with CREAM are requested to deploy SCAS/glexec
PG reports: SE problems at Oxford
12:03
AOB (01')
- GridPP24 registration is now open: http://www.gridpp.ac.uk/gridpp24/
WB: Storage workshop is on at the beginning of the week..
WB: Cream Deployment, ARE VOs hapopy to use?
M. CMS and LHCb are using cream . ATLAS are not.
SD: Issue with Condor, CREAM-CE still not ready.
JC to check who does Condor CREAM-CE staged rollout
No londo site has CREAM-CE, IC and RHUL are in the progress of deployment. Not an issue if allsites do not deploy. The worry is that if a problem is solved and then large uptake of usage over a short time.
Talk at 24th GB regarding SCAS/GLEXEC issue.
EM: Oxford's CREM-CE has C.
Production CREAM-CE does not have SCAS and WNs do not have GLEXEC.
LCG_CE is not on SL5
Chat Window:
[11:07:12] Duncan Rand https://gus.fzk.de/ws/ticket_info.php?ticket=56083
[11:07:32] Jeremy Coles We will return to the first agenda item.
[11:08:02] Mohammad kashif https://sam-uki-roc.cern.ch/myegee
[11:10:44] Daniela Bauer Resource Summary on my egee lists ce00 for Imperial - that machine was decommissioned a week ago
[11:11:04] Duncan Rand https://samnag025.cern.ch/nagios/cgi-bin/avail.cgi?host=ce03.esc.qmul.ac.uk&service=org.sam.CE-JobSubmit-ops&show_log_entries
[11:11:59] Dug McNab https://sam-uki-roc.cern.ch/myegee does not support CREAM CE's by the looks of it
[11:12:25] Govind Songara It would nice to have search option on list tabs..
[11:13:20] Stephen Jones Where is the help on using my-egee?
[11:13:31] Alessandra Forti https://twiki.cern.ch/twiki/bin/view/LCG/SAMProbesMetrics?topic=LCG.SAMProbesMetrics
[11:13:39] Duncan Rand https://sam-uki-roc.cern.ch/nagios/cgi-bin/status.cgi?host=ce03.esc.qmul.ac.uk&servicestatustypes=16&hoststatustypes=15&serviceprops=0&hostprops=0
[11:27:14] Gareth Smith http://www.gridpp.ac.uk/wiki/RAL_Tier1_Experiments_Liaison_Meeting
[11:28:04] Gareth Smith Above link is for the Tier1 Experiments liaison Meeting. Within there is a link to my reports (Operational Status and Issues).
[11:29:20] Dug McNab uploading to CERN? - what site as that for,?
[11:29:31] Jeremy Coles Glasgow
[11:29:56] Dug McNab first we have heard about that
[11:30:17] Dug McNab okay will investigate
[11:33:05] Winnie Lacesso It's not bdii, it's some other problem. Working on it.
[11:39:08] Phone Bridge left
[12:02:56] Gareth Smith left
[12:09:16] Duncan Rand got to go
[12:09:20] Duncan Rand left
[12:09:46] Pete Gronbech I used the gap publisher ut still have problems according to http://goc-accounting.grid-support.ac.uk/rss/UKI-SOUTHGRID-OX-HEP_Sync.html
[12:10:39] Gianfranco Sciacca Same for both UCL-CENTRAL and UCL-HEP
[12:13:38] Elena Korolkova I hope we published everything to the date in Sheffield
[12:15:24] Dug McNab what are we doing right?
[12:16:16] Dug McNab i went through them all with Steve T
[12:16:30] Dug McNab painful
[12:17:21] Dug McNab which one?
[12:17:57] Alessandra Forti thought so...
[12:18:02] Pete Gronbech WARNING: t2se01.physics.ox.ac.uk, GlueSACapability has unknown value, WARNING: t2se01.physics.ox.ac.uk, GlueSACapability has unknown value, InstalledOnline or NearlineSize attribute non existing
[12:18:09] Winnie Lacesso Have to go, will look forward to minutes.
[12:18:14] Winnie Lacesso left
[12:18:41] Wahid Bhimji GlueSACapability has unknown value - is what pete is referring to I think
[12:18:47] Dug McNab Steve T raised this bug https://savannah.cern.ch/bugs/?58513 when we went through it.
[12:19:01] Raja Nandakumar I too have to go. Apologies.
[12:19:06] Raja Nandakumar left
[12:20:05] Pete Gronbech This error for ce's "WARNING: t2ce05.physics.ox.ac.uk:2119/jobmanager-lcgpbs-shortfive, GlueCEPolicyAssignedJobSlots has negative or null value, " is fixed by the latest rpm lcg-info-dynamic-pbs-1.0.13-1
[12:20:20] Dug McNab yes that's right
[12:20:42] Dug McNab no
[12:22:22] Dug McNab so yes they run, but they have to run within 24 hours. So atlas will probably not be switching their pilot factories over
[12:25:57] Dug McNab take a look at
[12:25:57] Dug McNab https://twiki.cern.ch/twiki/bin/view/EGEE/BatchSystems
[12:26:18] Dug McNab that is the page for cream and lcg-ce
[12:26:37] Dug McNab Condor integration is maintained by IFAE (PIC) within SA3.
[12:26:51] Dug McNab https://twiki.cern.ch/twiki/bin/view/EGEE/InstallationInstructionsForCondor
[12:27:08] Dug McNab so yes CREAM supports CONDOR
[12:30:16] Dug McNab SGE support is nearly there I believe
[12:30:29] Dug McNab there is an SGE utils package especially for it
[12:30:36] Wahid Bhimji "nearly there" !
[12:30:49] Dug McNab well I actually think it is complete
[12:30:56] Dug McNab but I can't see the certification page
[12:31:42] Santanu Das I know it supports Condor, just as the support was provided for the previous versions of the CEs.
[12:31:50] Dug McNab yes SGE support is there and complete
[12:31:51] Dug McNab https://savannah.cern.ch/patch/?3458
[12:32:08] Santanu Das I'll have a another look.