UKI Monthly Operations Meeting (TB-SUPPORT)

Name: UKI Monthly Operations Meeting (TB-SUPPORT)
Start: 2008-07-31T10:30:00+00:00
End: 2008-07-31T13:15:00+00:00
Location: EVO

Thursday 31 Jul 2008, 10:30 → 13:15 GMT

EVO

Description

Monthly review and discussion meeting for those involved with GridPP deployment and operations. To join via EVO go to http://evo.caltech.edu. To join by phone call +41 22 76 71400. The phone bridge ID is 451099 and the code: 4880.

Hide

Present: Jens (mins), Ewan, Yves, Brian, Duncan, Derek, Matt, Jeremy (chair+mins), John, Elena, Mike, Sam, Winnie, Santanu, Peter, Graeme, Greig, Phil, David, Andrew, Simon Uptime discussed: RHUL didn't install host certificates - they will install them soon. Cambridge upgrading and need Condor info & docs: huge differences between series. Cambridge CE does not accept jobs atm (looks like 6.8.5 used for testing). Storage mostly OK, some sites failed to update certificates, problem with dCache pin manager. Blackhole nodes: many reasons why this happens, eventually new monitoring written to catch when this happens - for example SSH to head node or node disk is full. Steve's transfer tests were not close to what experiments do. Test jobs being updated to reflect more what apps will do. There is an Atlas test showing default job locations and failure rates. Curious why jobs attempt to run when site is in downtime, but jobs may have been submitted before site entered downtime. Jeremy will check the current job flow - is FCR consulted? Santanu asked about the Cambridge failures. Graeme suggested looking at the failure links as the message contained in them gives a clue. There was a bug in the accounting portal last week that prevented the gnatt chart display of accounting history. THis has now been fixed. Lancaster looks to have a problem folloing the cert updates. Experiments issues: LHCb currently reviewing software, Dirac upgrade. Atlas (Graeme): Cosmic data (from sub-systems) transferred T0->T1. Rolling functional tests take place each Monday - tickets raised against problem sites. Atlas prod problems with CASTOR RAC upgrade. 10 TeV MC data ready to go. For T2s, new or additional space tokens required, 2TB ATLASPRODDISK, .5TB ATLASDATADISK, ATLASGROUPDISK with complicated ACLs (script for DPM to set them), ATLASUSERDISK scratch cleaned centrally, ATLASLOCALGROUPDISK (no garbage collection). Brian will ticket sites with a best guess for space requirements in each token (depends on local resources etc.). Will switch Panda to PRODDISK soon. Small issue that the one spacetoken has 5 associated roles. Graeme has a script for DPM sites to use for this setup. Greig reported that Edinburgh PRODDISK is full again and more data is coming in. Graeme will discuss offline - it is odd. No CMS news. SuperNEMO enabled. CA discussion. Tricky external constraints imposed due to suspected root key compromise, Jens explained some of the background involving peer CAs and non-UK grids trusting the UK CA. Imposed as a compromise between the two means of closing a CA: to drop it immediately or to let everything expire. Thus, we have been reissuing certificates under the new hierarchy since November. The only difference in this case has been that certificates were re-signed - a new invention (in the grid context) which should make life easier in the future - and then of course the shorter timescale, see below. In the UK main problems were due to communication problems: people don't read their emails, and the ones that did felt that the mails did not adequately convey a sense of urgency - due to the fact that the mails were sent out later than planned due to delays. That was the second problem, the short timescale due to delays in signing, partly that the planning was not adequate for the task - the core task of signing that ordinarily would have taken a few hours at most now took a week, one difficulty was to ensure certificates going to helpdesks. Final set of problems encountered were incidental: the VOMS bug hitting some VOs due to VO admins not having deployed the bugfix: Jens followed up on behalf of some VOs. Some certificates were re-signed which should have been revoked, this caused some confusion. A few where left out, this was discovered only on Friday afternoon after Ewan reported a missing Oxford host cert - possibly due to a gap in the original sql query, perhaps corresponding to a single day's certificates. Jeremy asked if the deadline for switching could have been brought forward by 1-2 weeks (before CRL expiry). Graeme thought a central problem was the lack of warning and that most users were clueless about the impact - made worse by the 48hr changeover window. Stephen remarked that the normal renewal was reasonable. Several in the meeting were happy about the new renewal process (copying the CRT easier than copy and extracting across all WNs for example). Jens explained that it would be possible to subscribe to host certs which could be renewed for up to 3-5yrs. Suggestions on how to improve communications will be discussed. Further policy changes will be required to bring the UK back in compliance with changing minimal requirements for Grid CAs (the re-signing as opposed to normal renewals was a positive sign of this change, the new requirements will also impose need for re-identification with RAs every N years (where N is 3 or 5 or so, depending on how the private key is stored). How to check whether certificate matches private key? Jens will send a recipe to tb-support (ACTION). Purchases to be documented to enable other sites to benefit from experiences. Also new benchmarking tests, Unis will need to buy licenses but may have some already. Oxford (and several other sites) seeing specific ports on grid service nodes having odd connection attempts - some 300 bytes transferred every 2 hrs or so. Mingchao looking into it. See mail to gridpp-storage for further information. Which version of torque/maui are people running? See responses in chat transcript below, around 12:02. [10:27:14] Jeremy Coles joined [10:27:20] John Bland joined [10:28:18] Elena Korolkova joined [10:29:30] Mike Kenyon joined [10:29:33] Sam Skipsey joined [10:30:14] Winnie Lacesso joined [10:31:03] Santanu Das joined [10:31:19] Peter Love joined [10:35:11] Graeme Stewart joined [10:35:21] Santanu Das hang on, probably you can't hear me [10:35:40] Santanu Das I'm just trying to fix the audio [10:35:48] Greig Cowan joined [10:36:05] Alessandra Forti joined [10:36:09] Simon George joined [10:36:22] IPPP1 Durham joined [10:37:02] Chris Brew joined [10:37:06] Phone Bridge joined [10:37:19] Simon George '409429' on the agenda is not a valid evo meeting id. Could someone check it please? [10:39:12] Ewan Mac Mahon left [10:39:12] Ewan Mac Mahon joined [10:39:14] Ewan Mac Mahon left [10:39:33] Jeremy Coles It is 451099 [10:40:22] Brian Davies http://www.gridpp.ac.uk/wiki/GridPP_storage_availability_monitoring [10:43:12] Stephen Burke joined [10:43:21] Rob Fay joined [10:43:49] Phone Bridge joined [10:51:26] Ewan Mac Mahon Essentially you fail the tests, but if you're in downtime no-one minds. [10:51:37] Derek Ross Downtime should stop COD from opening tickets about failing tests [10:51:49] Ewan Mac Mahon But if the RB is matching jobs to a downed site it shouldn't. [10:56:30] Jens Jensen New CASTOR GIP is ready - publishing everything [10:56:53] Jens Jensen Not in production yet at RAL [10:58:39] Winnie Lacesso I've lost all sound, has anyone else? [10:59:00] Jens Jensen I;ve got sound and I'm on a US Panda... [10:59:20] Andrew Elwell joined [10:59:38] Andrew Elwell Hi Gang - Sorry I'm late [11:27:45] Simon George is it just me or has it gone quiet? [11:28:22] Jeremy Coles Ok for me [11:28:23] Phone Bridge left [11:29:15] Phone Bridge joined [11:30:07] Simon George phone bridge line went dead. I've redailed and can hear you again [11:31:51] IPPP1 Durham That was Phil Roffe and David Ambrose-Griffith [11:32:12] Peter Love left [11:36:55] Jeremy Coles We'll continue for another 15 mins. Is there any other AOB? [11:40:34] Ewan Mac Mahon Seconded. I didn't need to run the new certs through a browser at all. [11:52:38] Winnie Lacesso Jeremy 0- so do all sites to run this benchmark need to buy a Spec2006 license?? [11:54:43] Winnie Lacesso Or if we have CPU ENNN & MM GB RAM & someone publishes Spec2006 results for that, can we use (publish) their results for free? [11:56:13] Andrew Elwell Ewan - Do you run anything like snort or iptables rules to pick these up or just log greppage? [11:58:24] Graeme Stewart left [11:59:08] Winnie Lacesso Can anyone check on Rollout if it's UK only or wider than that? [11:59:45] Andrew Elwell drop all connections from them and see if they complain? [11:59:51] Andrew Elwell /bofh [12:00:31] Chris Brew I don't see that IP in globus-gatekeeper.log or catalina.out on dCache at RALPP as a numeric IP. Should I if I've been contacted [12:02:19] Ewan Mac Mahon Back now - sorry [12:02:24] Chris Brew glite [12:02:30] IPPP1 Durham glite [12:02:31] Rob Fay also glite [12:02:40] Chris Brew since I think thay've taken steve's fixes [12:02:41] Derek Ross our own builds for the server, clients are glite [12:03:54] Winnie Lacesso left [12:03:55] IPPP1 Durham left [12:03:56] Ewan Mac Mahon Bye. [12:03:57] Chris Brew left [12:03:58] Brian Davies left [12:03:58] Andrew Elwell left [12:03:59] Derek Ross left [12:04:00] Sam Skipsey left [12:04:02] Phone Bridge left [12:04:02] Phone Bridge left [12:04:02] John Bland left [12:04:03] Elena Korolkova left [12:04:03] Matthew Doidge left [12:04:03] Mike Kenyon left [12:04:05] Duncan Rand left [12:04:05] Rob Fay left [12:04:06] Ewan Mac Mahon left [12:04:06] Stephen Burke left [12:04:07] Yves Coppens left [12:04:13] Alessandra Forti left

There are minutes attached to this event. Show them.

- 1
  
  Site stability
  
  - Regular look at current monitoring results. This morning's picture: -- SAM (http://pprc.qmul.ac.uk/~lloyd/gridpp/ukgrid.html) --- RHUL --- UCL - known --- Cambridge --- RAL-PPD -- Storage availability --- http://www.gridpp.ac.uk/wiki/GridPP_storage_availability_monitoring -- ATLAS tests --- IC second cluster --- RHUL --- UCL --- Manchester second cluster --- Cambridge --- RAL (recent) -- LHCb tests --- IC --- QMUL --- UCL --- Durham --- Glasgow --- Culham --- RAL-PPD -- Transfer tests (status) --- Current tests are being moved to a new scenario. Present results are therefore less useful: http://pprc.qmul.ac.uk/~lloyd/gridpp/nettest.html -- UK wide tests --- http://pprc.qmul.ac.uk/~lloyd/gridpp/uktest.html --- Increased ATLAS work has led to changes in ordering but overall a similar pattern. --- There is a high failure rate at Durham, Cambridge and Brunel -- Accounting --- http://www3.egee.cesga.es/gridsite/accounting/CESGA/egee_view.php --- Charts were missing from production portal last week --- EFDA-JET not publishing since June --- IC-LeSC several weeks behind --- Lancaster? --- Oxford? -- If time permits a quick look at the main reasons for availability/reliability problems from: http://www.gridpp.ac.uk/wiki/SAM_availability:_October_2007_-_May_2008. --- No entries for: UCL-HEP; Lancaster; Manchester; Sheffield; Birmingham; Cambridge; EFDA-JET; RALPP; Tier-1 and csTCDie -- B Britton noted this morning that the UK picture does not look very healthly in GridMap: http://gridmap.cern.ch/gm/. It looks like many sites are in maintenance. Is there any attempt within Tier-2s to schedule downtimes with a view towards Tier-2 availability?
- 2
  
  Experiment progress and plans
  
  - Review of what has been happening and what happens next. - LHCb -- Currently reviewing software installations across sites -- SAM tests now on SL pages. Off with move to DIRAC3 - ATLAS -- - CMS -- - Other VOs -- superNEMO have started picking up activity across the sites (for some reason not in APEL) -- CDF are wishing to be re-enabled at supporting sites -- UKQCD are working with RHUL (Duncan et. al.) in the first instance as they have high memory requirement jobs -- The gridpp VO should be enabled at all sites by now!
- 3
  
  Update on CA matters
  
  - Additional information surrounding the UK CA certificate changes - Opportunity to raise any problems or concerns -- Communications are one area to be looked at
- 4
  
  Hardware purchases & middleware upgrades
  
  HARDWARE: - All? sites now have their GridPP hardware grants - The PMB wants to encourage sharing of procurement information - The following page has been set up: http://www.gridpp.ac.uk/wiki/Guidance_and_recent_purchases. Please use it! - One area of concern is the use of a new benchmark - SPECall_cpp2006 is made up of 3 apps from specint and 4 from specfp, (7 apps), can run it in 6h, but no published values. Proposal, is to use the cpp benchmark, a script will be made available. - For Pete G's summary see: http://www.gridpp.ac.uk/wiki/GDB-July_2008 - For the GDB talk see: http://tinyurl.com/5z349j MIDDLEWARE: - Anything to discuss? - Release news can be found here: https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingGliteReleases - PPS release information: https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingPps (includes CREAM and glexec modules)
- 5
  
  AOB
  
  - Recent port scans (discussed on storage list). Check your logs. - Please make progress with Nagios!