Operations team & Sites

Europe/London
EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

Description
- This is the weekly GridPP ops & sites meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the Janet(UK) Community area. Direct link http://evo.caltech.edu/evoNext/koala.jnlp?meeting=MDMaM82v2nD2Du999sD99D - The phone bridge number is +44 (0)161 306 6802. The phone bridge ID is 1001002 with code: 4880. Apologies: Mark M
Tuesday, 26 November 2013
=================================================================================         
in Attendance:      

Present were:
Brian Daviues
Alessandra Forti
Andrew McNab
Daniel Traynor
Daniela Bauer
David Crooks
Elena Korolkova
Ewan MacMahon
Jeremy coles
John Bland
John Hill
Lukasz Kreczko
Matt Doidge
Mohammad Kashif
Pete Gronbrch
Raja Nanadakumar
Robert Frank
Sam Skipsey
Steve Jones
Gareth Smith
Wahid Bhimji
================================================================================
Experiment problems/issues
Review of weekly issues by experiment/VO

- LHCb
(RN)NTR
MC simulation and user jobs running.
OK over the UK
Edinburgh issues solved.
EFDA-JEt issues. Keepiong an eye on it.
Sheffield Issues.
(EK) Problems on WENs started after SL6 upgrade. not sure hjow to proceeed.

- CMS
NTR
- ATLAS
(EK)
Low job numbers for atlas produiction ( in all clouds.)
GGUS 98882 for SUSX regarding space usafge and datadisk filling up.
Alastair dewhurst sent email to TB-support regarding T2 operations for what to do when long downtimes are decalred.
Sites dshould not worry about scheduling drain time into there downtime. ATLAS does it.

- Other
(CW) CVMFS -NTR
six VOs have updated ID CARD in portal.
no-one has filled in wiki for testing. Is wiki updateable.
UI- setup instruction almost ready for release.
Testing gfal2; slow.
Webdav enabled on LFC at RAL and firewall open.
 
===================================================================================================
11:20         
Meetings & updates (20')       
With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest
    There is a pre-GDB on Identity Federation in WLCG (agenda). The next GDB is on 11th December.
    EMI-3 WN tarball status (and glexec)?
    There is an LFC outage today (see the downtime announcement.
    The middleware readiness group are setting a time for their meeting. More site admins are needed!
Discussions will surround the items in the twiki.
    There was an email thread last week on ATLAS plans to move jobs/data away from a site going into downtime.
The focus seemed to be on the execute not the storage side of things.
Should site incorporate drainging time in to downtime.
Is jobs failing an issue? Yes for users/No for prod jobs???
To be discussed further. (Since not just a  uk issue. GDB discussion possibly)
    A new SAM interface is available for checking.
http://wlcg-mon.cern.ch/dashboard/request.py/siteviewhome
(BD)Consolidation of site names needed. (AM reports LHCb to move to GOCDB name.)
(AF)History info is missing.
(DC)Feedback from sites requested.
    Glue2 information validation is ongoing. Look to the monitoring summary page for more information.
    There is a workshop on clouds on 28th & 29th November.
    There is an update of the GridPP pledge spreadsheet.
    The final WLCG T2 October ops availability/reliability report is now available.
Discussion regarding site draining and downtime.
==============
WLCG Operations Coordination - Agendas
    There was a virtual WLCG ops coordination meeting on 21st.
    CMS - CRAB users warned that gLite-WMS submission is in decreasing support
    LHCb - LHCb will only build slc6 binaries as of January 2014
    SHA-2 - the experiments have tested a lot and look ready. By Dec 1 the WLCG infrastructure is expected to be
mostly ready.
    WMS decommissioning - Some progress for CMS.
    glexec - EMI gLExec probe (in use since SAM Update 22) crashes on sites that use the tar ball WN and do not have
the Perl module Time/HiRes.pm installed. Status is tracked here.
    xrootd deployment - UDP collector (a.k.a. GLED ) for detailed monitoring. An additional instance of the collectors
 has been enabled at CERN for FAX
            Sites monitoring requirements: SUM tests not representing the real experiment status for example.
==============
Tier-1 - Status Page
(GS)    There was a failure of the Primary OPN link to CERN yesterday morning. The automatic failover didn't work as
the router at the RAL end did not 'see' the break. Fixed by manually dropping that connection. (Primary OPN link to
CERN now fixed - switched back to this just before the meeting today).
    Planned uprade of firmware in a disk array ongoing this morning. Currently LFC, Atlas 3D and FTS2 services down
for a few hours. (FTS3 unaffected).
Futurte intervention in January to re-organise central switching. (TBC)
==============
Accounting - UK Grid Metrics HEPSPEC06 Atlas Dashboard HS06
    APEL data lagging at: Brunel, UCL-HEP and Lancaster.
    Please update your HEPSPEC figures in the wiki.
==============
Interoperation - EGI ops agendas
Tuesday 26th November
    Meeting last week: agenda here: https://wiki.egi.eu/wiki/Agenda-19-11-2013
    David will upload notes soon, apologies for the delay in getting them posted: however one item to draw out was
that the GFAL lcg-utils product team is proposing to phase out GFAL/lcg_utils in favour of GFAL2/gfal-utils
(https://svnweb.cern.ch/trac/lcgutil/wiki/MediumTermProposal) - feedback is solicited on this, which was stressed as
being a proposal.
==============
Monitoring - Links MyWLCG
Tuesday 26th November
    As noted by Alessandra, if possible we'd like site feedback on the consolidated monitoring prototype before the
next meeting a week on Friday to report back to the group (with thanks to everyone who has already contributed)
    http://wlcg-mon.cern.ch/dashboard/request.py/siteviewhome
    Some notes to form a wiki on Graphite are to be found here: https://www.gridpp.ac.uk/wiki/MonitoringTools but
these are under development, however if there are areas people would find useful that could be expanded, please let
David know.
==============
On-duty - Dashboard ROD rota
Monday 25th November
    Nothing unusual. A steady trickle of transient problems.
    The RAL Tier-1 SHA-2 ticket was finally closed as the relevant machines were decommissioned.
==============
Rollout Status WLCG Baseline
Tuesday 29th Oct Yesterday the first stage rollout request (for the CREAMCE) in months has come through. I've updated
the Stage of the Nation page.
Tuesday 8th Oct There have been updates to EMI2 and 3 yesterday, but no new request for Staged Rollout. There is a problem
with dcap-libs: [GGUS 97805] References
    Staged Rollout pages (now separated into EMI1 & 2), and the page listing the deployed versions is extractable from
the bdii, so they should all be reasonably up-to-date:
    http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout.html
    http://www.hep.ph.ic.ac.uk/~dbauer/grid/staged_rollout_emi2.html
    http://www.hep.ph.ic.ac.uk/~dbauer/grid/state_of_the_nation.html
==============
Services - PerfSonar dashboard | GridPP VOMS
Tuesday 26th November
    The main perfSONAR issues this week affect Manchester and Sussex.
==============
Tickets
Monday 25th November 2013, 15.30 GMT
41 Open UK tickets today.
Information System Tickets:
RALPP, ECDF, Lancaster, Liverpool, UCL Brunel, RHUL and the Tier 1 all got tickets about their information system (this
is a prelude to information system probes going into the SAM tests).
I asked for some clarification in the Lancaster ticket, as our resource bdiis are up to date and recently reconfigured,
 but as these tickets are super-fresh don't panic about them.
RALPP
https://ggus.eu/ws/ticket_info.php?ticket=99186 (25/11)
Not a reflection on the site (the ticket is 10 minutes old at time of writing), but the subject interested me
"NAGIOS *emi.cream.glexec.CREAMCE-JobSubmit-/ops/Role=pilot* failed on heplnv146.pp.rl.ac.uk@UKI-SOUTHGRID-RALPP". Are
glexec failures becoming critical? Assigned (25/11)
Which reminds me, I'll be taking a look at all your (and my own...*whimper*) glexec tickets next week.
https://ggus.eu/ws/ticket_info.php?ticket=98923 (15/11)
Picking on RALPP again, this other (SHA2) nagios ticket got reopened. Looks like you're just not publishing your dcache
version. To the ldifs! Reopened (25/11)
SUSSEX
https://ggus.eu/ws/ticket_info.php?ticket=98882 (14/11)
Emyr fixed Sussex's STORM (hang on, I thought Emyr had escaped?) The site's been whitelisted for testing since the 21st,
if things are looking good I suggest closing this ticket. In progress (21/11)
SHEFFIELD
https://ggus.eu/ws/ticket_info.php?ticket=98594 (4/11)
This LHCB ticket, regarding file uploading troubles running at Sheffield post SL6 upgrade, is looking a bit neglected.
Does anyone else know of any post-SL6 tweaks that they needed to apply (say a cheeky undocumented rpm) to get LHCB to work
after their move to SL6? In Progress (13/11)
cvmfs@RAL tickets
https://ggus.eu/ws/ticket_info.php?ticket=98249 (SNO+)
https://ggus.eu/ws/ticket_info.php?ticket=98122 (cern@school)
Both of these tickets have received their first warning for being in the "waiting for reply" state for too long.
https://ggus.eu/ws/ticket_info.php?ticket=97868 (t2k)
T2K don't have software to put into their statum 0 yet, but would like to test with a ROOT tarball. No word from Catalin
over this modest testing plan (at least on the ticket, you might be beavering away offline on this). In progress (18/11)
https://ggus.eu/ws/ticket_info.php?ticket=97385 (hyperK)
A similar story here (I think work is just progressing offline, hopefully we haven't entered a nightmarish universe where
 anything not documented in GGUS tickets doesn't happen-yet). In progress (18/11)
GLASGOW
https://ggus.eu/ws/ticket_info.php?ticket=96234 (29/7)
WMS support for HyperK at Glasgow. Chris spotted a problem, Dave said he'd get on it on Monday (which unless Dave had a 9
 day weekend was a week ago). Any luck? In progress (15/11)
EFDA-JET
https://ggus.eu/ws/ticket_info.php?ticket=97485 (21/9)
I mention this LHCB ticket last week, as this recurring problem has stumped everyone involved. The JET guys have asked LHCB
 for some information to try to help them debug the problem. Waiting for reply (18/11)
I've no doubt missed something, having rushed this out in half the time I usually take, so I'll cover my shoddiness with
my usual line that if I've missed any tickets of interest, please bring them up at the meeting or online.
==============
Tools - MyEGI Nagios
    Regional Nagios updated to release 22. It is a glite to UMD update and it required a fresh installation.
    There have been some internal changes in SAM-Nagios. Test probes are now the responsibility of product team. Some test
names have been changed as a result of this reorganization. For example the org.sam.CREAMCE-DirectJobSubmit test has become
emi.cream.CREAMCE-DirectJobSubmit. This does not affect the operational activities.
    Please could all site admins look at services associated to their site and please mail Kashif if anything odd is noticed.
Site admins can reschedule tests for their sites and it would be helpful if most functionalities are tested.
    Also, look at myegi which can be useful with links to the Dashboard, GSTAT, Accounting Portal and GGUS.
==============
Small VO DIRAC instance
(JM) Testing Dirac instance running ( for small VOs) working with T2k. LFC /SE support to be added.
Landside VO (LK) installed. Few siters and users added since they have to be added manually. ( since dirac does not use info-provider.)
LK web interface useful for debuging. Some setup issues, but overcome.
SNOplus added.( added matt Motram as user) and QMUL user. No jobs submitted yet. JM needs feedback.
Dirac catlaogue rather thwn LFC for long term solution??
Resources VO/User currently manually added. Should be able to automate.
(EM) enabling gridpp vo would allow site admins to test dirac further.
==============
VOs - GridPP VOMS VO IDs Approved VO table
    CVMFS progress - but not quite there yet.
    6 VOs (cern@school,gridpp,na62, pheno,sno+,t2k.org ) have updated their VOID card entries and updated the wiki.
        Can people actually test resources and fill in the tables at
https://www.gridpp.ac.uk/wiki/Adoption_of_Backup_GridPP_Voms_Servers#Test_status_-_testing_by_VOs
    Instant UI - progress
        https://www.gridpp.ac.uk/wiki/Main_Page - Installing a UI - needs rationalisation
        Site-info.def etc - 6 variables + voms snooper
    Storage
        Gfal2 - GGUS 99043,99044,99055,99067 - not performant, but very interesting functionality
        Webdav now enabled on LFC@RAL and ports free from firewall - needs testing     
=================================================================================================
- Current site priorities & issues
Change freeze next week for Holiday break (Glasgow)
=================================================================================================
Information publishing
Areas for sites to review:

REBUS: http://wlcg-rebus.cern.ch/apps/capacities/sites/
Sites to check figures are correct

Glue2 (see March GDB - https://twiki.cern.ch/twiki/bin/view/LCG/GDBMeetingNotes20130313):
The Guide: http://gridinfo.web.cern.ch/glue/glue-validator-guide
The latest talk: http://indico.egi.eu/indico/conferenceDisplay.py?confId=1781
Midmon tests: http://tinyurl.com/qhzjdh6
=================================================================================================
AOB
- Main ops action to follow-up (https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items) is: All sites to query the IPv6 status
of their respective institutions.


=================================================================================================
Chat Window from Meeting:
[10:58:56] Daniel Traynor ok
[11:00:51] Alessandra Forti we should establish a rota
[11:02:18] Jeremy Coles Brian is taking minutes today. We'll try again to pre-book our minute taker!
[11:03:18] Jeremy Coles @Chris - will you be able to take minutes next week please?
[11:04:23] Christopher Walker Probably
[11:04:38] Jeremy Coles Thanks.
[11:04:42] Christopher Walker Definitely not this week - need to leave early
[11:05:35] Matt Doidge Do you/did you have any network tuning in place Elena?
[11:05:51] Jeremy Coles @Ewan - please can I put you down for minutes on 10th December?
[11:06:55] Elena Korolkova No, we didn't do network tuning after switching to sl6
[11:07:07] Matt Doidge DId you have any under SL5?
[11:08:45] Ewan Mac Mahon @Jeremy - the 10th might be a little awkward, I'm mostly not around for the rest of that week, so I'm not sure when they'll get written up. Can I do the week after?
[11:09:02] Jeremy Coles Sure - thanks.
[11:10:21] Christopher Walker I'll be meeting Jeremy Maris shortly. Brian if you can ensure he (or I) have values for what they were publishing, it would be useful.
[11:13:14] Elena Korolkova To Matt: yes, I have wn's on sl5 left for local users and I plan to leave only couple of wn's .Just in case
[11:15:04] Ewan Steele I have spoken with Pheno but I'm not sure if they have done the testing
[11:15:39] Daniela Bauer I was just to lazy (for mice) I have to admit
[11:15:48] Daniela Bauer I was just going to wait for Janusz to start complaining
[11:16:36] Jeremy Coles https://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest#
[11:16:54] Christopher Walker https://www.gridpp.ac.uk/wiki/Adoption_of_Backup_GridPP_Voms_Servers#Test_status_-_testing_by_VOs was what I wanted people to fill in
[11:17:51] Christopher Walker I should also have mentioned Kashif's testing proving usefil.
[11:18:18] Daniel Traynor have an engineer coming need to go, priorities for qm - storm performance. site resilience, redundancy and backup improvements
[11:24:52] Ewan Mac Mahon What would be clever would be to see the downtime coming and move the data.
[11:31:33] Steve Jones Ewan: that is a cultural question!
[11:32:05] Steve Jones Is it "OK" to just pull the plug, or is it "bad form"?
[11:32:13] Christopher Walker If a user job runs and they can't get the output, they are annoyed - particularly if they could have run somewhere else
[11:32:37] Christopher Walker Staging output to another site would solve this.
[11:32:43] Steve Jones Then it's bad form. In which case, it should be avoided if reasonable.
[11:32:51] Alessandra Forti it depends on the size of data to move
[11:32:52] Christopher Walker In the context of an extended SE downtime.
[11:33:17] Christopher Walker GDB sounds good.
[11:33:25] Alessandra Forti yes GDB is better
[11:33:44] Ewan Mac Mahon Maybe a failure to stage the output out should be (or is?) considered as a job failure, so the job should be resumbitted elsewhere just as it would be if the WN dies part ways through.
[11:33:49] Jeremy Coles http://wlcg-mon.cern.ch/dashboard/request.py/siteviewhome
[11:34:30] Ewan Mac Mahon Philosophically, I like things that don't rely on planned shutdowns, because if you're resilient against unplanned shutdowns, you're resilient against breakage 'for free'.
[11:35:02] Wahid Bhimji sorry to be very late.
[11:35:07] Elena Korolkova I have updated the savannah twice
[11:35:23] Alessandra Forti yes I know Elena thanks
[11:35:45] Alessandra Forti I meant from the site point of you. I reported the atlas comments at the last meeting
[11:35:57] Alessandra Forti I know they are eventually similar
[11:37:46] Ewan Mac Mahon I just had a look to at Oxford and two links in got: 'dashboard.common.InvalidRequestException: This request of type 'GET' is unknown to the service'
[11:38:00] Ewan Mac Mahon at this URL: http://dashb-ai-548.cern.ch/dashboard/request.py/getWLCGNavigationLink?columnid=181
[11:38:55] Ewan Mac Mahon The path was front page->find UKI-SOUTHGRID-OX-HEP, click it to get to http://wlcg-mon.cern.ch/dashboard/request.py/siteview#currentView=default&search_0=UKI-SOUTHGRID-OX-HEP
[11:39:32] Ewan Mac Mahon Then click on the warning in yellow under 'atlas critical', then it gave the error.
[11:39:36] Alessandra Forti I have to say they have already improved it in the past week
[11:41:48] Alessandra Forti will add to the list it seems it redirects to a machine that shouldn't be public.
[11:43:09] Ewan Mac Mahon Incedentally; are we actually broken for ATLAS? We've got even fewer jobs than yesterday, but I can't see anything positively wrong.
[11:43:28] Ewan Steele we are the same
[11:44:06] John Hill I'm getting WNs trashed, I think by
[11:44:24] Chris Brew be right back
[11:44:30] John Hill ATLAS jobs, so maybe they are backing off to investidate
[11:44:44] John Hill sorry: investigate
[11:44:46] Matt Doidge I have a ticket open with APEL about Lancaster's accounting, sadly the APEL guys are swamped at the moment
[11:45:42] Christopher Walker Can you forward me a link to the GFAL2 proposal please.
[11:46:25] Ewan Steele @Ewan we are only at 36% capacity and we normally have a lot more ATLAS jobs than we do at the moment
[11:46:43] Christopher Walker Got to go. Bye
[11:46:48] Wahid Bhimji Matt - sorry if you mentioned it already - what is the status of your xroot/FAX failing / reconfig
[11:46:50] Alessandra Forti there are not enough tasks at the moment
[11:47:20] Alessandra Forti that's why there are no atlas jobs at sites
[11:47:25] Alessandra Forti it's a general problem
[11:47:27] Ewan Steele ok cheers
[11:48:50] Ewan Mac Mahon Right-o. We're dead quiet on other VOs too, which is a bit sad. Alice picked up the slack over the weekend, and we occasionally see some LHCb.
[11:49:08] Ewan Mac Mahon At the moment I've got about a thousand cores sat around doing nothing, if anyone wants them.
[11:49:49] Ewan Steele Yeah usually Pheno pick up any spare space but aparently not today  
[11:53:09] Elena Korolkova Thanks, Andrew. I'm checking this
[11:53:49] Alessandra Forti https://twiki.cern.ch/twiki/bin/view/LCG/SL6DependencyRPM
[11:55:20] Ewan Mac Mahon Thinking of Glasgow things, I'm still not getting many signs of life from dev013-v6 - I'm assumung that's expected.
[11:55:30] Elena Korolkova Thanks, Alessandra
[11:56:08] David Crooks Ewan: We'll have a poke afterwards and let you know, sorry about that  
[11:57:08] Sam Skipsey Hi Ewan - I have, indeed, already poked at it
[11:57:13] Sam Skipsey It might be healthier now.
[12:00:14] Ewan Mac Mahon I'm getting 'permission denied' even telnetting to the gridftp port.
[12:00:22] Ewan Mac Mahon Have you firewalled it again?
[12:01:08] Ewan Mac Mahon (and I've tried that from a couple of source networks, so I don't _think_ it's me)
[12:05:46] Andrew McNab Yes to enabling the GridPP VO!
[12:08:00] Ewan Mac Mahon In general I suspect a lot more of this can be scritped even if it's not done in standard issue Dirac.
[12:16:18] Jeremy Coles http://wlcg-rebus.cern.ch/apps/capacities/sites/
[12:17:31] Ewan Mac Mahon Is 'nearline' storage tape or disk?
[12:18:29] Jeremy Coles http://indico.egi.eu/indico/conferenceDisplay.py?confId=1781
There are minutes attached to this event. Show them.
    • 11:00 11:20
      Experiment problems/issues 20m
      Review of weekly issues by experiment/VO - LHCb - CMS - ATLAS - Other
    • 11:20 11:40
      Meetings & updates 20m
      With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest - Tier-1 status - Accounting - Documentation - Interoperation - Monitoring - On-duty - Rollout - Security - Services - Tickets - Tools - VOs - SIte updates
    • 11:40 12:04
      Site 'roundtable' discussion 24m
      - Current site priorities & issues
    • 12:04 12:05
      Information publishing 1m
      Areas to review: REBUS: http://wlcg-rebus.cern.ch/apps/capacities/sites/ Glue2 (see March GDB - https://twiki.cern.ch/twiki/bin/view/LCG/GDBMeetingNotes20130313): The Guide: http://gridinfo.web.cern.ch/glue/glue-validator-guide The latest talk: http://indico.egi.eu/indico/conferenceDisplay.py?confId=1781 Midmon tests: http://tinyurl.com/qhzjdh6
    • 12:05 12:06
      AOB 1m
      - Main ops action to follow-up (https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items) is: All sites to query the IPv6 status of their respective institutions.