Operations team & Sites

EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

- This is the biweekly ops & sites meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the GridPP Community area. - The phone bridge number is +44 (0)161 306 6802 (CERN number +41 22 76 71400). The phone bridge ID is 608 0563 with code: 4880. Apologies: Mark


Last weeks minutes


Ewan notes they'll be out soon!










Raja: Nothing to report, all good!  Awaiting new conditions DB, before starting next round of reprocessing.  A little bit of MC comming around, not much.  Running smoothly.





Duncan: Stuart has now left, and still working out the transfers.  Daniela will be taking over the PhedEx, Brunel might run one.  Some availability problems at Brunel - hammer cloud results seem to be poor.  A problem with the 'site local config file' stored at CERN.  Software and computing week at CERN for CMS this week.





Allessandra:  Not much about site - a few tickets open, but all being worked on.  2 for ECDF about SRM, and transfers from FZK - due again to FZK firewall.  Brian has opened another ticket - should probably consolidate those.  Compiled a report for the availability this morning - only ECDF below 100%, and that's due to scheduled.  This time we are all green, which is great, and a rare occurring - last time was February.  EMI upgrade: Cambridge is all green, and RHUL should be too - awaiting confirmation - Govind notes that a part of the old cluster isn't quite all upgraded yet. Manchester is now all EMI, apart from storage.  


Birmingham was tested 6 months ago, but doesn't appear to be online - Mark: Tried UMD WN 6 months ago, had trouble with Atlas production, so went back to gLite.  Off to CERN, and when he gets back will do DPM head node and EMI-2 WN.  Is it considered that EMI2-WN is all good?

Allesandra: Atlas yes, other VO's, not official, but no problems seen.  ILC, PHENO, DZero at least work.  

Mark: NA62 problem, simple fix once Dan is back, and then will update.  

(It appears that the NA62 problem is the only known case at present, other than one with dCache that Daniela notes).


Other VOS



Chris: t2K: Asking about CVS on TB-Support.  It is on the VO card, suspect that people are re-installing, and missing the odd dependancies.  

Ewan notes that the current process is basically, sites break, VO's ticket, sites fix.  Which isn't perhaps great, but works.


Posabilities of more structured data on the VO card, to allow for more automation of desired RPM's.  Flagged for further discussion at HEPSYSMAN.



UK NGI Update



No presentation, just 2 issues from John.





Using 'Notifiy sites', and plugging holes in the system.  If the ticket is assigned to a site, then there are emails when duration's are exceeded.  If it's Notified, then there aren't.  This went to the CERN co-ordination board, and Notified sites will now get those emails.


gLite updates



CSIRT's now handling the issue, given that it's going through the security dashboard.  We don't expect any UK sites to be affected, unless some new tests are added.  (Stuart notes that there's a new dCache test expected to be online in a day or so).



Bulletin updates



VOMRS Validator



Jeremy: Appears to be now understood as not an issue.  John: Agrees, but needs to be solved in the longer term, as it's a fudge at the moment.


WN Tarballs



Overview from Matt: We have tools to build a regular tarball and the atlas cvmfs tarball.  However, the tarball is meant to be separate from the main line.  It works for atlas grid jobs - but it's not guaranteed for that purpose.  So we have to build one ourselves.  Might put it into CVMFS, or distribute normally - discussion to occur on that point.





Is on Friday.  Pete: Has been arranging things for the last few slots.  John Green, new security guy is in for one of the talks, and a few other people.


(John Gordon wonders if John Green is invited to this meeting - Jeremy to action on that).


WLCG lifecycle process



Presented briefly to the GDB, it's a document that needs feedback.  Core ops team may be expected to contribute.


EMI Cream CE



Stephen: Rerunning Yaim on CREAM can make it loose jobs.  There is a workaround, but not optimal.  There were rumours of an update, but no exact knowledge.  The workaround needs applied every time YAIM is run.






Correlated failures of the SRM test, with a can't connect message.  Affects all UK sites, on Fri / Sat - Gareth following up, but no clear cause at this point.  These are the test run central from CERN, the tests run from Oxford are fine.  Appears to be some networking problem close to CERN, affects the UK quite badly, but visible in other clouds.





Do we need the Approved VOs document the set out the software needs for the VOs?

Chris: Probably this data should be in the CIC portal, and not a separate place.


Rollout Status



National overview page updated for WNs. Please check your site information!






perfSONAR service types are defined in GOCDB, so at some point sites will be asked to list them






Unsupported gLite tickets:

Matt: Little too quiet:  Bristol, Cambridge and Brunel?

Jeremey: Cambridge updated this morning.



t2k 86690
t2k missing metrics in Ganglia.  Looks like it's solved,  but Gareth suspects that there's still a problem with historical data.


Durham 68853

SL4 retirement - Mike is aware of what's needed.


Birmingham 88009

Came down to heavily loaded cluster.


Other VOs



Mark: has updated front page of Gridpp wiki, for the EMI-2 UI, which should be what's needed for a UI install; including the site-info.def stuff.



DPM Status



Cern provided some manpower, along with Taiwan and UK.  Specific plans not yet set on working on what.



Chat window follows:

[11:03:12] Ewan Mac Mahon joined

[11:03:15] John Hill joined

[11:03:19] Gareth Roy Mark is looking after a sick child today

[11:03:23] Ewan Mac Mahon Hello?

[11:03:35] Mohammad kashif joined

[11:03:41] Sam Skipsey joined

[11:04:11] Robert Frank left

[11:04:18] Mark Slater joined

[11:04:37] Ewan Mac Mahon And just in case it's useful; EVO player seems not to work with IcedTea java any more  

[11:04:45] Jeremy Coles Stuart is taking minutes.

[11:05:23] Govind Songara joined

[11:05:37] Wahid Bhimji joined

[11:06:04] Ian Collier joined

[11:06:14] Matt Doidge joined

[11:06:21] Pete Gronbech joined

[11:11:17] Jeremy Coles Also CERN: EMI-2 WN deployed in preprod (~10% of the farm), allowing ATLAS to verify compatibility also with the EOS SRM (BeStMan)

[11:12:58] Ewan Mac Mahon Oxford in now 100% EMI2

[11:13:02] Ewan Mac Mahon (on the WNs)

[11:13:13] Ewan Mac Mahon So far, so (mostly) good.

[11:13:30] Ewan Mac Mahon Only known problem was the t2k one where I'd just neglected to install some stuff.

[11:13:36] Daniela Bauer There is still an issue with dCache though:

[11:13:39] Daniela Bauer https://ggus.eu/tech/ticket_show.php?ticket=87065

[11:13:47] Jeremy Coles T1 checks on small VOs using the EMI-2 WNs that were in test did not show any problems for those that had run. 

[11:13:56] Andrew McNab joined

[11:13:57] Andrew McNab left

[11:13:59] Daniela Bauer I redicovered that myself when I did a test (non-tarball obviously) install.

[11:14:38] Queen Mary, U London London, U.K. joined

[11:16:03] Ewan Mac Mahon VOMSsnooper 

[11:16:20] Ewan Mac Mahon And more structured metadata in the Ops Portal.

[11:17:42] Brian Davies joined

[11:19:39] Ewan Mac Mahon It's slightly tricky to structure it since strictly speaking the WNs don't need to be running SL at all - they could be SL5, SL6 or Debian, and the package names aren't necessarily the same across them all.

[11:20:51] Jeremy Coles Suggestion to send GGUS reminders for outstanding tickets to "Notified Sites" as well (not only ROCs/NGIs).

[11:21:11] Govind Songara I would appricate if someone can comment on my email about title "software are server specs"

[11:22:02] Ewan Mac Mahon Govind - I've been meaning to do that; will try to actually do it shortly.

[11:23:01] Ewan Mac Mahon Short version though is that it doesn't need to be all that fast, but don't make it a non-RAID box on the grounds that you're fairly stuffed if it breaks.

[11:23:07] Ewan Mac Mahon (IMO)

[11:25:45] Govind Songara Thanks Ewan, What about atlas local area, temporary can i move it to non-raid ?

[11:26:42] John Gordon Ticket on GGUS reminders https://savannah.cern.ch/support/index.php?131988

[11:26:55] Ewan Mac Mahon Well, technically no-one else cares if it's RAID or not - the RAID is really for your benefit. As far as the users are concerned, basically any filesystem that gets NFS mounted onto the WNs will do them just fine.

[11:27:02] John Gordon comment #9 contains links to slides and minutes

[11:27:11] Ewan Mac Mahon The point of the RAID is to avoid a single disk failure killing your entire site.

[11:27:43] Ewan Mac Mahon Which I think is worth doing, but you can use a single machine if you don't mind taking the gamble, and if it survives, then you're OK.

[11:27:48] Ewan Mac Mahon If it doesn't, you're not.

[11:28:36] Wahid Bhimji yeah I think its poistive. I think it will likely work for other VOs. I don't think there is any conflict in Simones goals.

[11:30:25] Ewan Mac Mahon Who's Allison?

[11:31:55] Alessandra Forti Peter can you update the agenda? Also can you add a discussion about script and puppet modules sharing?

[11:32:08] Jeremy Coles Ewan - Alison now deals with APEL. 

[11:32:47] Jeremy Coles Alison Packer - being her full name.

[11:33:00] Jeremy Coles She gave a talk at the last HEPSYSMAN meeting.

[11:34:03] Pete Gronbech I'll add the short talks to the agenda

[11:34:11] Alessandra Forti ta

[11:42:01] Jeremy Coles GGUS 87802

[11:42:58] Jeremy Coles http://www.hep.ph.ic.ac.uk/~dbauer/grid/state_of_the_nation.html

[11:49:08] Ewan Mac Mahon No.

[12:00:34] Ewan Mac Mahon Not PerfSonar. ATLAS Sonar.

[12:01:00] Mark Slater Certainly at Bham, I won't have chance to look at that again until end of month at the earliest

[12:05:12] Jeremy Coles https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items

[12:07:01] Wahid Bhimji well if they set DMLite="no" then they don't really have it. I have DMLite on my test DPM 

[12:08:11] John Bland does anyone have DMLITE="yes" on a production DPM?

[12:10:25] Wahid Bhimji I think Matt might? The default should be yes if he didn't reset it explcitly to no on his recent DPM 1.8.4 yaim - it seemed to be default yes when I upgraded our disk server this week .(I set it to no as it wanted a bunch of other yaim config that I couldn't be bothered with

[12:10:39] Mohammad kashif Hi Daniela , EFDA-JET has already upgraded to emi2 creamce but due to a problem at few WN, dteam jobs were failing. It has been fixed now. Whe you will run your next WN tests.

[12:10:45] Matt Doidge no, its one of the things I switched off

[12:10:56] Matt Doidge during my friday of horror

There are minutes attached to this event. Show them.
    • 11:00 AM 11:20 AM
      Experiment problems/issues 20m
      Review of weekly issues by experiment/VO - LHCb - CMS - ATLAS - Other
      ATLAS analysis availability Oct12
    • 11:20 AM 11:40 AM
      UK NGI - monthly discussion 20m
    • 11:40 AM 12:00 PM
      Meetings & updates 20m
      With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest - Tier-1 status - Accounting - Documentation - Interoperation - Monitoring - On-duty - Rollout - Security - Services - Tickets - Tools - VOs - SIte updates
    • 12:00 PM 12:05 PM
      Actions 5m
      To be completed: https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items Completed: https://www.gridpp.ac.uk/wiki/Operations_Team_Completed_Actions
    • 12:05 PM 12:06 PM
      AOB 1m