Operations team & Sites

EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

- This is the biweekly ops & sites meeting - The intention is to run the meeting in EVO: http://evo.caltech.edu/evoGate/. Join the meeting in the Janet(UK) Community area. - Direct EVO link http://evo.caltech.edu/evoNext/koala.jnlp?meeting=eseieIvnv9aiaeIla8Is - The phone bridge number is +44 131 474 4520 (CERN number +41 22 76 71400). The phone bridge ID is 154732 with code: 4880. Apologies:
Operations team & Sites (26 Feb 2013) - 1st draft To be Updated.
Experiment problems & issues

Meetings & updates

February GDB review

WLCG Ops Coordination update

Actions & site updates


EVO chat:

[10:56:35] Wahid Bhimji hi john - can you remind me how to get curl working with certs in Sl6 . I have completely forgotten what you said and just hit the problem again
[11:01:36] John Green yes
[11:01:59] John Bland wahid: you need to install libcurl-openssl which installs the openssl version of the curl library (as opposed to the NSS default version)
[11:02:36] John Bland wahid: then force clients to use it with something like LD_LIBRARY_PATH=/opt/shibboleth/lib64
[11:03:13] Jeremy Coles I am taking minutes today as nobody else is available.
[11:10:01] Wahid Bhimji On the site side for forcing single core.. SGE has a -binding qsub flag
[11:10:16] Wahid Bhimji which may be relevant - haven't tried it
[11:10:52] Wahid Bhimji not a transfers on HC tests for Jan
[11:23:57] Alessandra Forti I forgot about that
[11:24:01] Alessandra Forti the kernel thing
[11:28:33] John Green increasing the buffer size is just masking the underlying problem
[11:28:38] Brian Davies yainm use to change them automatically. not all sites need to change to these values, I will have a look to see whcih csites are still effected.
[11:28:57] Brian Davies mic not working
[11:29:30] Brian Davies proiblem sites were akso wokring to dcahce some dcahce sites,
[11:31:16] Wahid Bhimji The DPM developes should comment - I will poke them again . As Brian says it is not all dcache sites - so it could be a particular dCache version also - or some config option they set
[11:31:21] Brian Davies port range for connection smigh tbe an issue.
[11:31:46] Ewan Mac Mahon Fortunately, most of our local connections will likely be OK with large buffer sizes because they're mostly filestager streaming reads, but the principle is sound; there's a problem, we're papering over it.
[11:32:18] Ewan Mac Mahon Notably QMUL doen't have the DPM patched gridftp, of course.
[11:32:24] Alessandra Forti Manchester sonar before upgrading
[11:32:30] Alessandra Forti

[11:32:45] Ewan Mac Mahon Oxford blogged about this too
[11:32:50] Ewan Mac Mahon With a sonar graph
[11:33:02] Brian Davies channels of interst to bnl are:
[11:33:03] Brian Davies http://bourricot.cern.ch/dq2/ftsmon/channel_view/multi/UKI-LT2-UCL-HEP&UKI-LT2-RHUL&UKI-SOUTHGRID-OX-HEP&UKI-SCOTGRID-ECDF&UKI-NORTHGRID-LIV-HEP&UKI-SOUTHGRID-BHAM-HEP&UKI-NORTHGRID-SHEF-HEP/BNL-OSG2&BNL-OSG2&BNL-OSG2&BNL-OSG2&BNL-OSG2&BNL-OSG2&BNL-OSG2/2012-12-01/2013-02-28/168/
[11:36:25] Jeremy Coles https://www.gridpp.ac.uk/wiki/GDB_reports
[11:36:36] Jeremy Coles https://www.gridpp.ac.uk/php/KeyDocs.php?sort=reviewed
[11:37:19] Wahid Bhimji why are only those of interest - CAM seems goot to BNL in fact so sorry I mentioned them (someone else did so I was just repeating)
[11:37:38] Wahid Bhimji I meant seems "good to BNL"
[11:38:13] John Hill We're now using the numbers from fasternet (not the enhanced buffer size).
[11:38:55] Brian Davies other sites all have consistently good rates to BNL, so I am not looking into how to improve.
[11:40:02] Wahid Bhimji ok - so actually John (H) it would be interesting if you have the lower default size and yet have good transfer rates....
[11:40:27] John Bland alessandra, brian, you're still showing up as unmuted in my window
[11:40:42] Alessandra Forti thanks
[11:40:55] Wahid Bhimji Brian - a side point - in the discussion it came up that some sites have different FTS settings
[11:41:05] Wahid Bhimji some seem to be in UKt2s and some in T2Ds
[11:41:30] Wahid Bhimji and the streams are different in the 2 ... (it is not the T2Ds in the T2D one !)
[11:41:47] John Hill Yes I do - I will put my numbers into the Twiki. Prior to 10 days ago we were still using the default yaim numbers
[11:41:53] Brian Davies Issue I meant to raise with hiro (@ BNL) before I went away. Will check up.
[11:42:13] John Hill so even the default fasternet values are probably giving an improvement
[11:42:28] Alessandra Forti they are better than what yaim used to set
[11:42:36] Alessandra Forti yaim now doesn't set anything
[11:43:23] Robert Frank https://wiki.gridpp.ac.uk/wiki/VOMSdeployment2013
[11:43:29] John Hill It did back in glite 3.2 days, and our pool nodes had mainly inherited those values
[11:44:38] Alessandra Forti yes, it did. but the values were never tailored for T2 sites they were adapted from castor CERN
[11:45:19] John Hill Indeed - it was an oversight on my part that I hadn't reviewed the sysctl settings
[11:47:49] Ewan Mac Mahon I'm just going to mention Oxford one and only outstanding ticket now that Brian's back - https://ggus.eu/ws/ticket_info.php?ticket=90245 - it's been making some progress, but it would be nice to actually finish it at some point.
[11:47:49] Brian Davies ggus 91029, not obvious in the ticket that it is the : in the DN which is the issue.
[11:48:39] Matt Doidge I believe that the problem is with special characters
[11:48:45] Brian Davies in the ticket that is. I fthis is the conclusion then it should be stated, as this proably a problem for other regions as weel.
[11:49:32] Ewan Mac Mahon There was a VOMSsnooper release - that's tools.
[11:49:56] Brian Davies #ewan, will check on stephane update.
[11:50:26] Ewan Mac Mahon Thanks Brian; it's not exactly urgent, it's more a matter of not wanting to forget about it and let it drag on.
[11:50:40] Wahid Bhimji I think the values DPM used to set may actually have had the higher default... so in fact better in this case. But I can't remember and its not that relevant now.
[11:50:43] Ewan Mac Mahon I think we're basically there on the substance on the thing.
[11:51:37] Ewan Mac Mahon @Wahid - it might be interesting to find out if the reason that sites like Glasgow had the larger defaults was because they'd preserved the old DPM default, or if they'd actually set them on purpose at some point.
[11:51:47] Ewan Mac Mahon But it's mostly of historical interest.
[11:53:50] Alessandra Forti on KeyDocs monitoring what is the percentage? I missed the meaning of it
[11:55:26] John Bland we don't have anything recently, but previous mentions of ipv6 have elicited little more than chortles from Liverpool CSD
[11:55:54] Alessandra Forti I haven't asked yet
[11:57:01] Ewan Mac Mahon Ours seem to have the view that no-one in Oxford needs an IPv6 connection until OUCS are running some services on IPv6 for them to connect to, and they're not. It's a very Oxford approach
[11:57:52] Ewan Mac Mahon I think part of this whole excercise is to a) find out the state of play and b) start making it look like there's a demand for this so as to motivate everyone's central services to pull their collective fingers out.
[11:59:24] Wahid Bhimji sorry I got cut off for a while - we will use it for sure
[11:59:30] Wahid Bhimji there is the pythonpath issue
[12:00:43] Wahid Bhimji the workaround doesn't work for me ? - but we can follow up
[12:03:42] Wahid Bhimji Do you have the fix in your path - Matt? - lancs
[12:03:54] Wahid Bhimji was still failing the last HC test for the FAX testing
[12:04:22] Wahid Bhimji but the latest one didn't find files so I don't know if it works now (if you changed it in the last week)
[12:09:09] Christopher Walker op
[12:09:16] Ewan Mac Mahon OK; because as far as I know we should be able to just stop using RFIO and move to xroot.
[12:09:35] Wahid Bhimji For sure - I agree.
[12:09:42] Ewan Mac Mahon That's fine - we still have our 'test' queue, so we could always try it on that first if you want.
[12:09:56] Ewan Mac Mahon Or you can just flip the production config - that's good too.
[12:10:11] Ian Collier Have to leave sorry.
[12:10:14] Ewan Mac Mahon (provided you're ready to flip it back if it all goes horribly wrong)
[12:10:24] Wahid Bhimji OK - I can try it on your test queue ... Analy queue seems to work - its the prod queue where I have a problem
[12:10:39] Wahid Bhimji and I want to change it all - otherwise the benefit of reducing interfaces isn't really there
[12:12:39] Alessandra Forti indeed
[12:13:22] Ewan Mac Mahon Hmm. I'm not a fan of rfio, but I'm not sure we can just drop it when atlas are done with it; we should check what the small VOs are doing.
[12:13:33] Ewan Mac Mahon And give them plenty of notice that it's going.
[12:15:10] John Bland doesn't DPM use RFIO for internal transfers etc? And rfrm is a useful too.
[12:15:25] John Bland tool even
[12:15:49] Sam Skipsey So, the plan is that DPM will eventually retire rfio as its internal transfer mechanism.
[12:16:04] Sam Skipsey This has been the plan ever since the whole DMLite thing was envisaged.
[12:16:59] Alessandra Forti indeed
[12:17:26] Wahid Bhimji Small Vos should be fine - but you are right they need to be notified and given time. The plan is to retire rfio - hopefully there is an equivalent tool - if not we can make one
[12:17:57] Wahid Bhimji re webdav I agree ... sorry to start the same discussion again ! I thought there was a one word thing that was mentioned that webdav could do that xrootd can't
[12:18:18] Wahid Bhimji and I wanted to hear what that was not start a discussion - sorry !
[12:18:36] Jeremy Coles https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items
[12:20:06] Alessandra Forti which one word thing?
[12:21:17] Ewan Mac Mahon It's probably going to have to be a Welsh word - there's no way to cover that sort of ground in English
[12:22:00] Ewan Mac Mahon Er, no.
[12:22:34] Wahid Bhimji now it was just that jeremey said that in the meeting it was said that xrootd couldn't do everything that webdav could
[12:22:46] Wahid Bhimji then he said the thing that was said was " ------"
[12:23:02] Wahid Bhimji and I thought it was just a one word thing
[12:23:06] Wahid Bhimji anyway bye
[12:31:25] Jeremy Coles undo

There are minutes attached to this event. Show them.
    • 11:00 11:20
      Experiment problems/issues 20m
      Review of weekly issues by experiment/VO - LHCb - CMS - ATLAS - Other
      Atlas report
    • 11:20 11:40
      Meetings & updates 20m
      With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest - Tier-1 status - Accounting - Documentation - Interoperation - Monitoring - On-duty - Rollout - Security - Services - Tickets - Tools - VOs - SIte updates
    • 11:40 11:55
      February GDB review 15m
      Agenda: http://indico.cern.ch/conferenceDisplay.py?confId=197800 Minutes: https://twiki.cern.ch/twiki/bin/view/LCG/GDBMeetingNotes20130213 GridPP Tier-2 notes: https://www.gridpp.ac.uk/wiki/GDB_reports
    • 11:55 12:04
      WLCG Ops Coordination update 9m
    • 12:04 12:10
      Actions & site updates 6m
    • 12:10 12:11
      AOB 1m