Operations team & Sites

Europe/London
EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

Description
- This is the weekly ops & sites meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the Janet(UK) Community area. - Direct EVO link http://evo.caltech.edu/evoNext/koala.jnlp?meeting=eseieIvnv9aiaeIla8Is - The phone bridge number is +44 131 474 4520 (CERN number +41 22 76 71400). The phone bridge ID is 154732 with code: 4880. Apologies: Jeremy C
 
Experiments
===========
 
LHCb
----
 
Raja.
Nothing much to say - runnig a restripping, shoudl ramp up this afternoon.  Should see a few more jobs, but nothinh major expected.  No simulation, becasue of a few bugs, should be fixed, so more MC jobs, again probably htis afternon.  All sites working fine.
 
 
CMS
---
 
Daniela.
CMS also getting tickets about Squids not updated.  CMS has a tendancy to assign these tickets to random people, but Daniella will chase those out to the right people.
 
ATLAS
-----
 
Elena.
 
CERN moved to SL6 today, some problems observed, underinvestigation.
Atlas moving to new Squid Monitoring.  Email on TB-Support about this (attached to minutes).  For sites that have to do updates, May 13th, Alister will open GGUS tickets about this.
Some discussion on how to limit Pile jobs - at themoment done with a memory limit, setting it to 3GB.  (Sam noted that it doens't block all pile jobs, just the problematic ones)
Further observation on that one - but looks good so far.
 
Problem at Lanacatsre, 'cant verify credential' - sheffield also saw this; down to a fetch-crl problem - after yaiming DPM post WebDAV install, the cron job was odd, and didn't work well.
 
OXFORD: GGUS 93817 : User unable to access files because of "no local mapping"
 
Other VOs
---------
 
T2K LFC is not a global LFC (as it probably should have been).
 
Janusz has a Dirac instance up, and has submitted a few test jobs.
 
gridpp.ac.uk VO to be decomissioned - only supported by Imperial at the moment.  To be removed sometime this week.
 
 
Meetings and updates
====================
 
Tier-1
------
 
Outage schedule for castor tomorrow - switching the database behind hte service.  Switching the primary and standby servers, to both test the operational procedure; and that the plan is to move the standby database across to the Atlas building.  Moving the servers, to make it possible to physically move the racked kit.
 
Working on the T2K LFC (as mentioned above).
 
WLCG ops coordination
---------------------
 
Allesandra giving a presentation tomorrow on SL6 status.  Asked for RAL status on SL6?  Gareth to remind.
Documentation
-------------
 
Kashif wants to remove a KeyDocs page on SAM documentation (to be removed totally); No objections, so Andy said he'd action that.
 
Monitoriing
-----------
 
Propblem with Nagios yesterday - from Imperial Top BDII, so some intermittant failures in sotreage tests.  This was a test failure, so shoudn't be reported as a site failure.
 
ROD
---
 
No Rod rota for now onwards.  No handover recieved (Stuart to resend).  Daniella will cover this week, and Jermey to establish a new rota.
 
Rollout
-------
 
Daniela wanted to draw attention to a ticket from Kashif.  When you updated a CREAM CE to EMI-3, it pulls in the new APEL parser - which is not compatible with teh EMI-2 APEL box.  However, you can't run both EMI-2 and EMI-3 APEL boxes, so at the moment it looks like it's an all or nothing upgrade at the moment. GGUS 93805, and documented on https://www.gridpp.ac.uk/wiki/Staged_rollout_emi3
 
Some discussion over if the EMI-3 CE will pull in TORQUE - seemed to at QMUL, but not at Oxford.
 
Tickets
-------
 
As per https://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest
 
GGUS 93833: Daniela notes that there's not point giving EGI the hosts behind the aliases, that nothign good can come of it.
GGUS 93493: Sam has eliminated everything that's different with a node that works for biomed.  The mystery deepens.
GGUS 92590: Looks fixed, need to prod LHCb to confirm (or deny).
 
Site roundtable
===============
 
Andy McN for Manchester: Nothgin to add.
Andy W for Edinburgh: SL6 CE testing underway.  Once the CE is up, test queue time.
Chris for QMUL: Dan reckosn it'll take 6 months to get SL6 up and wokring.  Downtime for rearrangent this week, to give space for more storages.  Working on Argus.
Daniela for Imperial: Nothing exciting.  Started deploying SL6 test queue - LHCB seem to use it.  If that works, they'll expand it. Raja will email her.
Elana for Sheffield: The fetch-crl problem with WebDav on DPM.  Xrootd also installed on pool nodes - needs just head node.  Also ned to re-do argus install. SL6 problem due to shared worker nodes, so need to confer before doing much there.
Ewan for Oxford: shoudl be up runing, and available for testing.  Outstanding quetsion onhow to comnfigura it for Atlas.  Squid upgrade is on the near term todo list.  Shoudl be farily stragiht forward.  Mostly normal operations.
Kashif: Upgrade monitorignbox at lanacatster to laterst release, will update oxford soon.
Govind for RHUL: Nothing to report.
Steve for  Liverpool: SL6, trying to get a baseline in place.  Workign on a worker node and torque server.
John for Cambridge: Done squid update - straightforward.  SL6 'moving up prioirty list' - but not at the top yet.
Mark M for Glasgow:  Aircon and power.  The facilities here are awesome.
Mark S for Birmingham: Had some trival DPM issues - some services flaling over, so cron'd restart.  SL6 wise, letting others trailblaze.
Matt for Lancaster: SL6 problem with shared cluster, so will move the other cluster shortly.
Rod for RALPP: Ldap failure took out side over weekend, now fixed.
 
Allesandrad wondered if 4 sites doing SL6 is enough? Sam suggested that it was; although it was noted that 3 of them were tarball rather than RPM WN'd sites.

AOCB
====
 
Stuart asked if it's Ok to gap publish with APEL.  Allesandra thought that the backlog was done - Ewan thought that there was a ticket, and we should poke that for an update.  John recalled a boradcast saying to raise a ticket if you wanted to gap publish, which he did, and was told to publish. Someone at Glasgow will poke the ticket - GGUS: 93183
 
 
Chat log
========
 
[11:13:20] Mark Slater I haven't been keeping up recently so can't really comment!
[11:13:26] Daniela Bauer Hang on ...
[11:13:30] Daniela Bauer my microphone...
[11:14:49] Daniela Bauer For some reason I have to restart it at least once after joining the meeting...
[11:16:48] Sam Skipsey We should say, interestingly, that we still get one or two pile jobs now (it looks like a very small number of them are 4GB). Those ones aren't a problem, though.
[11:16:56] Sam Skipsey ...less than 4GB, I mean
[11:17:13] Sam Skipsey (Evo doesn't like my less-than symbol and removed it)
[11:17:30] Alessandra Forti \
[11:17:36] Christopher Walker I'm having a prolblem with Elena's sound. Is it just me on the phone bridge?
[11:17:46] Sam Skipsey No, she's cutting in and out a little.
[11:17:58] Alessandra Forti I can hear enough though
[11:18:17] Sam Skipsey Yeah, she's understandable.
[11:22:00] Raja Nandakumar Andy Mc Nab is with me (Raja Nandakumar)
[11:22:09] Raja Nandakumar Is someone speaking?
[11:22:25] Ewan Mac Mahon Raja: yes, Alessandra and Chris
[11:23:15] Christopher Walker http://pprc.qmul.ac.uk/~walker/votable.html
[11:23:32] Christopher Walker is my ldap query
[11:24:37] Ewan Mac Mahon I did rather think it was just 'camont'
[11:25:22] Andrew McNab Raja is here on this one now too (audio problems before)
[11:25:30] John Hill "camont" should be kept
[11:26:11] Ewan Mac Mahon Right, so this is basically a VOMS server cleanup, not that the 'real' VO is going away.
[11:34:52] Steve Jones Nothing to report from me.
[11:35:09] Steve Jones Yes - give the link.
[11:35:26] Steve Jones If anyone objects, pipe up. I can't (mic broke)
[11:35:33] Gareth Roy Have to go, UPS maintanence
[11:36:02] Mohammad kashif https://www.gridpp.ac.uk/wiki/SAM_documentation
[11:36:39] Stuart Purdie Nothing to say
[11:38:10] Ewan Mac Mahon So, to be clear - the tests failed to work, but they reported (correctly) that it was a test failure.
[11:38:20] Ewan Mac Mahon They didn't claim that the sites sotrage was failing.
[11:38:38] Govind Songara Kashif RHUL CE still fails.. can you suggest what wrong
[11:38:42] Ewan Mac Mahon We should keep an eye on it, but I think it /should/ be OK.
[11:40:36] Stuart Purdie So minuted.
[11:41:54] Mohammad kashif https://ggus.eu/ws/ticket_info.php?ticket=93805
[11:42:06] Daniela Bauer thanks
[11:42:16] Daniela Bauer https://www.gridpp.ac.uk/wiki/Staged_rollout_emi3
[11:44:30] Christopher Walker https://ggus.eu/tech/ticket_show.php?ticket=93776
[11:45:12] Ewan Mac Mahon Also, I thought I'd emailled you that, Chris, last Wednesday evening? Did I get spam binned?
[11:45:50] Christopher Walker No, I'd forgotten - and probably just didn't believe you - after all that change caused it to work for me.
[11:46:13] Christopher Walker Can you send me your list of RPMs please.
[11:46:42] Ewan Mac Mahon if you've got that email from Wednesday, it's the attachment
[11:47:27] Ewan Mac Mahon On a whole other thing, there is an option in the SeeVogh setting to turn that 'bing' noise on incoming chat messages off. Just in case anyone was wondering.
[11:50:25] Alessandra Forti http://www.sysadmin.hep.ac.uk/svn/fabric-management/certificates/x509/check-node
[11:51:01] Mohammad kashif Govind, cream2.ppgrid looks OK to me.
[11:51:12] Alessandra Forti this script is still operational and checks the sanity of the default certificates and CRLs
[11:51:33] Alessandra Forti on DPM it doen't do a complete job though due to the number of certificates used
[11:52:10] Sam Skipsey So, I basically did that via the manual process of crosschecking lists of certs and crls on two nodes.
[11:52:45] Sam Skipsey but I see no differences (and it's not precisely a cert problem - the client appears to open a connection to the gridftp daemon, it just doesn't get any data from it)
[12:02:22] Andrew McNab Hi Daniela - Raja here
[12:02:36] Andrew McNab Could you remind me what is the CE for the sl6 WNs please.
[12:02:41] Andrew McNab Thanks!
[12:04:40] Ewan Mac Mahon That sounds good. Which is to say that it sounds like what we did
[12:10:50] Daniela Bauer Hi Raja
[12:10:52] Daniela Bauer cetest00.grid.hep.ph.ic.ac.uk:8443/cream-sge-grid6.q
[12:11:08] Daniela Bauer And yes, it's a tarball as well
[12:11:33] Matt Doidge The tarball dev isn't to be trusted either :-D
[12:11:49] Ewan Mac Mahon And for completeness, everything on t2ce02.physics.ox.ac.uk is an SL6 system.
[12:12:06] Andrew McNab Thanks Daniela - it seems to pass the SAM tests fine.
[12:12:16] Andrew McNab So should be okay for LHCb
[12:12:41] Ewan Mac Mahon Infact the Oxford system is also all EMI3
[12:13:20] Daniela Bauer @Raja: How about a real job ?
[12:14:10] Ewan Mac Mahon *tumbleweed*
[12:14:53] Alessandra Forti https://ggus.eu/ws/ticket_info.php?ticket=93183
[12:15:01] Matt Doidge Yep, that's the one I jsut dug up
 

There are minutes attached to this event. Show them.
    • 11:00 11:20
      Experiment problems/issues 20m
      Review of weekly issues by experiment/VO - LHCb - CMS - ATLAS - Other
      Slides
    • 11:20 11:40
      Meetings & updates 20m
      With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest - Tier-1 status - WLCG ops coordination - Accounting - Documentation - Interoperation - Monitoring - On-duty - Rollout - Security - Services - Tickets - Tools - VOs - SIte updates
    • 11:40 12:00
      SIte roundtable 20m
      - Current activities and priorities at each site
    • 12:00 12:05
      Actions & site updates 5m
      https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items
    • 12:05 12:06
      AOB 1m