Operations team & Sites
EVO - GridPP Operations team meeting
- This is the weekly GridPP ops & sites meeting - The intention is to run the meeting in Vidyo: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=zXhsqAxVnaT6 -- The PIN is 1234. To join via phone see http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone for dial in numbers. -- The London (UK) service is on +44 (0)161 306 6802. Phone bridge ID 1001002 -- The meeting extension is 109308582. PIN 1234 Chair: Jeremy Minutes: Apologies:
LHCb
-
CMS
* Bristol failing HC tests. It's in the waiting room, i.e. it doesn't get any analysis jobs. Winnie thinks the problem is solved and the ticket can be
* RALPP: a ticket that seems to have been forgotten by CMS. RALPP may want to ask for help. It's a problem about quotas.
* Singularity: installed at RALPP on SL6 nodes.
ATLAS
* RAL FTS problems should be solved by upgrading to the latest version. dq2 tools have been upgraded a week ago. IPv6 is disabled until all problems are resolved. Andrew says the IPv6 and FTS problems were unrelated he'll update the tickets.
* Ticket on VAC. Decision to move only to multicore. Ticket will be closed.
* Problem with QMUL xrootd direct IO has to be disabled on all the queues.
* Glasgow was hit by ATLAS jobs overloading some dataservers. The file access was changed to direct IO and the problem was solved. The user was overriding the central setup and they shouldn't.
Other VOs.
* Dirac sam tests: RAL long delays. will look into it.
Bulletin
* HSF comunity paper open for comments
* perfsonar tests simplified to avoid overloading the network and confusion. Only one test in one direction.
* Security: pakiti you should be able to check your own site. Sites should look at it and give feedback if needed. Front page flags only high critical. If you dig into the site page you have more deatils on the type of failure. Some discussion on how pakity runs and the possibility to run the client locally on all the nodes.
* Tickets:
* Bristol ticket interesting in the use of elastic search to debug.
* Sussex: problems with cream CE
* Birmingham: wants to go to VAC only but some pieces of the infrstructure like monitoring are missing. Jeremy will raise the issue at the EGI MB with some suggestions.
* Manchester
- VAC: to be closed
- Storage: waiting for upgrade
- IPv6: to be acknowledged
* RAL castor: two tickets with some problems and some solutions attempted.
Sites update
* Bristol: downtime to move the machines to another machine room.
* Imperial: upgrade and development of gridpp dirac server. submission to gpus working. cream has an extra parameter that dirac can't handle. ECDF and xrootd only access is being looked at. Jeremy asks who has GPUs on the grid. ATM only Manchester and QMUL but this may expand in the future. GPUs currently used only by small experiments but there might be big changes in the future due to Machine Learning ramping up in the analysis techniques. Many ML for physicists to learn how to code, but very little on requirements for sites. There will be a discussion on infrastructure and submission at the HSF/WLCG meeting. So important this is being looked at.
SAM tests and availability: discussion on SAM tests at the next WLCG Ops Coordination. Are they useful for sites? Are they useful for experiments? Or are they only legacy? CMS sites and others are using them to do trouble shooting when something is wrong. Sites cannot submit as the experiment so the minimal approach is considered useful. Another useful thing is to have the SAM tests in the same place with the same API so people don't have to struggle to get results out. People looking mostly at nagios ETF but the history on the dashboard is also useful. Jeremy will setup a poll to ask how the SAM tests are used in view of the next WLCG Ops Coordination meeting discussion.
SOC workshop at the beginning of December. David: If you want to come in person please register this week so we can arrange passes.
Chat
Jeremy Coles: (28/11/2017 11:02)
Alessandra is taking minutes this week. Thank you.
Alessandra Forti: (11:03 AM)
is anyone speaking?
Jeremy Coles: (11:06 AM)
CMS links give access denied for me.
egroup cms-web-access.
Andrew David Lahiff: (11:08 AM)
That's not quite true
The upgrade of FTS was unrelated to the IPv6 problems
Jeremy Coles: (11:15 AM)
https://www.gridpp.ac.uk/gridpp-dirac-sam
Elena Korolkova: (11:16 AM)
@AndrewLahiff: thanks for closing the ticket
Jeremy Coles: (11:17 AM)
https://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest
David Crooks: (11:22 AM)
Daniela Bauer: (11:24 AM)
I get "Cannot connect to the DB! Please, check your connection parameters."
It asked for my certificate. And I made an exception for its own certificate.
It's firefox 57 on Fedora
David Crooks: (11:27 AM)
Thanks
Matt Doidge: (11:35 AM)
https://ggus.eu/?mode=ticket_info&ticket_id=131607
Daniela Bauer: (11:35 AM)
My sound gave up, I need to reconnect.
Jeremy Coles: (11:38 AM)
http://pprc.qmul.ac.uk/~lloyd/gridpp/
raul: (11:39 AM)
Nothing new at Brunel
And I have no microphone today
Jeremy Coles: (11:40 AM)
Hi Raul. What are you working on?
Elena Korolkova: (11:41 AM)
I agree with Daniela: it's better to keep the meetings to an hour
raul: (11:42 AM)
WHat sort of GPUs are being requested?
Latest GPU servers can be extremely expensive. Are experiemtns requesting them?
Daniela Bauer: (11:47 AM)
@Raul: I can ask LZ, if you want.
The firealarm just went off (honest). I need to go.
Jeremy Coles: (11:47 AM)
Not being requested actively by the LHC VOs but we can see the demand increasing and need to be ready.
Chris Brew: (11:47 AM)
help
raul: (11:48 AM)
Yes, please Daniela. Anything you could get from LZ or CMS about GPU usage
We do have GPUs, but I wonder if it is what LZ would want
Jeremy Coles: (11:50 AM)
does anyone else have audio working now?
David Crooks: (11:50 AM)
I can hear Alessandra OK
Jeremy Coles: (11:51 AM)
Did I end up speaking over Alessandra/others? My session disconnected.
David Crooks: (11:51 AM)
I don't think so
raul: (11:51 AM)
Audio workingnow. I had to re-join
David Crooks: (11:52 AM)
We do
Paige Winslowe Lacesso: (11:52 AM)
I glance at them every morning but often not much more than that
David Crooks: (11:52 AM)
check the tests, I mean
Kashif: (11:52 AM)
I wait for ticket to arrive
Ian Loader: (11:52 AM)
We do
John Hill: (11:52 AM)
I look most days, but usually I discover issues via other routes
Govind: (11:52 AM)
not regular but check at least 2-3 time in month plus if there any problem
Chris Brew: (11:53 AM)
I generally have a browser slideshow open all the time with the Atlas, CMS and LHCb SMA tests cycling through
Paige Winslowe Lacesso: (11:55 AM)
OOPs I meant nagios tests, not sam test, are what I glance at. I've NO IDEA where the sam tests are?.....
David Crooks: (11:57 AM)
https://indico.cern.ch/event/676160/