Operations team & Sites

Europe/London
EVO - GridPP Operations team meeting

EVO - GridPP Operations team meeting

Description

- This is the weekly GridPP ops & sites meeting - The intention is to run the meeting in Vidyo: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=zXhsqAxVnaT6 -- The PIN is 1234. To join via phone see http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone for dial in numbers. -- The London (UK) service is on +44 (0)161 306 6802. Phone bridge ID 1001002 -- The meeting extension is 109308582. PIN 1234 Chair: Jeremy Minutes:  Apologies:

LHCb

-

CMS

* Bristol failing HC tests. It's in the waiting room, i.e. it doesn't get any analysis jobs. Winnie thinks the problem is solved and the ticket can be

* RALPP: a ticket that seems to have been forgotten by CMS. RALPP may want to ask for help. It's a problem about quotas.

* Singularity: installed at RALPP on SL6 nodes.

ATLAS

* RAL FTS problems should be solved by upgrading to the latest version. dq2 tools have been upgraded a week ago. IPv6 is disabled until all problems are resolved. Andrew says the IPv6 and FTS problems were unrelated he'll update the tickets.

* Ticket on VAC. Decision to move only to multicore. Ticket will be closed.

* Problem with QMUL xrootd direct IO has to be disabled on all the queues.

* Glasgow was hit by ATLAS jobs overloading some dataservers. The file access was changed to direct IO and the problem was solved. The user was overriding the central setup and they shouldn't.

Other VOs.

* Dirac sam tests: RAL long delays. will look into it.

Bulletin

* HSF comunity paper open for comments

* perfsonar tests simplified to avoid overloading the network and confusion. Only one test in one direction.

* Security: pakiti you should be able to check your own site. Sites should look at it and give feedback if needed. Front page flags only high critical. If you dig into the site page you have more deatils on the type of failure. Some discussion on how pakity runs and the possibility to run the client locally on all the nodes.

* Tickets:

* Bristol ticket interesting in the use of elastic search to debug.

* Sussex: problems with cream CE

* Birmingham: wants to go to VAC only but some pieces of the infrstructure like monitoring are missing. Jeremy will raise the issue at the EGI MB with some suggestions.

* Manchester

  - VAC: to be closed

  -  Storage: waiting for upgrade

  -  IPv6: to be acknowledged

* RAL castor: two tickets with some problems and some solutions attempted.

Sites update

* Bristol: downtime to move the machines to another machine room.

* Imperial: upgrade and development of gridpp dirac server. submission to gpus working. cream has an extra parameter that dirac can't handle. ECDF and xrootd only access is being looked at. Jeremy asks who has GPUs on the grid. ATM only Manchester and QMUL but this may expand in the future. GPUs currently used only by small experiments but there might be big changes in the future due to Machine Learning ramping up in the analysis techniques. Many ML for physicists to learn how to code, but very little on requirements for sites. There will be a discussion on infrastructure and submission at the HSF/WLCG meeting. So important this is being looked at.

SAM tests and availability: discussion on SAM tests at the next WLCG Ops Coordination. Are they useful for sites? Are they useful for experiments? Or are they only legacy? CMS sites and others are using them to do trouble shooting when something is wrong. Sites cannot submit as the experiment so the minimal approach is considered useful. Another useful thing is to have the SAM tests in the same place with the same API so people don't have to struggle to get results out. People looking mostly at nagios ETF but the history on the dashboard is also useful. Jeremy will setup a poll to ask how the SAM tests are used in view of the next WLCG Ops Coordination meeting discussion.

SOC workshop at the beginning of December. David: If you want to come in person please register this week so we can arrange passes.

Chat

Jeremy Coles: (28/11/2017 11:02)

Alessandra is taking minutes this week. Thank you.

Alessandra Forti: (11:03 AM)

is anyone speaking?

Jeremy Coles: (11:06 AM)

CMS links give access denied for me.

egroup cms-web-access.

Andrew David Lahiff: (11:08 AM)

That's not quite true

The upgrade of FTS was unrelated to the IPv6 problems

Jeremy Coles: (11:15 AM)

https://www.gridpp.ac.uk/gridpp-dirac-sam

Elena Korolkova: (11:16 AM)

@AndrewLahiff: thanks for closing the ticket

Jeremy Coles: (11:17 AM)

https://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest

David Crooks: (11:22 AM)

https://pakiti.egi.eu

Daniela Bauer: (11:24 AM)

I get "Cannot connect to the DB! Please, check your connection parameters."

It asked for my certificate. And I made an exception for its own certificate.

It's firefox 57 on Fedora

David Crooks: (11:27 AM)

Thanks

Matt Doidge: (11:35 AM)

https://ggus.eu/?mode=ticket_info&ticket_id=131607

Daniela Bauer: (11:35 AM)

My sound gave up, I need to reconnect.

Jeremy Coles: (11:38 AM)

http://pprc.qmul.ac.uk/~lloyd/gridpp/

raul: (11:39 AM)

Nothing new at Brunel

And I have no microphone today

Jeremy Coles: (11:40 AM)

Hi Raul. What are you working on?

Elena Korolkova: (11:41 AM)

I agree with Daniela: it's better to keep the meetings to an hour

raul: (11:42 AM)

WHat sort of GPUs are being requested?

Latest GPU servers can be extremely expensive. Are experiemtns requesting them?

Daniela Bauer: (11:47 AM)

@Raul: I can ask LZ, if you want.

The firealarm just went off (honest). I need to go.

Jeremy Coles: (11:47 AM)

Not being requested actively by the LHC VOs but we can see the demand increasing and need to be ready.

Chris Brew: (11:47 AM)

help

raul: (11:48 AM)

Yes, please Daniela. Anything you could get from LZ or CMS about GPU usage

We do have GPUs, but I wonder if it is what LZ would want

Jeremy Coles: (11:50 AM)

does anyone else have audio working now?

David Crooks: (11:50 AM)

I can hear Alessandra OK

Jeremy Coles: (11:51 AM)

Did I end up speaking over Alessandra/others? My session disconnected.

David Crooks: (11:51 AM)

I don't think so

raul: (11:51 AM)

Audio workingnow. I had to re-join

David Crooks: (11:52 AM)

We do

Paige Winslowe Lacesso: (11:52 AM)

I glance at them every morning but often not much more than that

David Crooks: (11:52 AM)

check the tests, I mean

Kashif: (11:52 AM)

I wait for ticket to arrive

Ian Loader: (11:52 AM)

We do

John Hill: (11:52 AM)

I look most days, but usually I discover issues via other routes

Govind: (11:52 AM)

not regular but check at least 2-3 time in month plus if there any problem

Chris Brew: (11:53 AM)

I generally have a browser slideshow open all the time with the Atlas, CMS and LHCb SMA tests cycling through

Paige Winslowe Lacesso: (11:55 AM)

OOPs I meant nagios tests, not sam test, are what I glance at. I've NO IDEA where the sam tests are?.....

David Crooks: (11:57 AM)

https://indico.cern.ch/event/676160/

 

There are minutes attached to this event. Show them.
    • 11:00 11:01
      Ops meeting minutes 1m
      • This is a reminder that this is an important task. The minute taker gives access to the discussions for those not present and provides a reference for others to refer back to afterwards.

      • The team composition has been changing. If everybody contributes then the task comes around less often.

      • Please extract actions from the meeting and add them to our table here: https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items#Action_list.

      • Recent allocations: See above link. The page should be updated each week by the minute taker (if they don't the task will keep coming to them!).

      • Upcoming allocations:

      28th Nov: Alessandra
      5th Dec: Chris?
      12th Dec: TBC
      19th Dec: Vip
      8th Jan??
      15th Jan: TBC

    • 11:01 11:20
      Experiment problems/issues 19m

      Review of weekly issues by experiment/VO

      • LHCb

      • CMS
        T1: https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T1_UK_RAL
        T2: https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T2_UK_London_Brunel

      Bristol is now in the "Waiting Room", i.e. disabled for analysis jobs.
      Current CMS GGUS tickets:
      https://ggus.eu/?mode=ticket_info&ticket_id=131987 (Bristol)
      https://ggus.eu/?mode=ticket_info&ticket_id=131565 (RALPP - stuck on "Waiting for Reply* from CERN - this might need brining up on the cop-ops mailing list, as it looks forgotten)
      Singularity according to CMS: https://twiki.cern.ch/twiki/bin/view/CMS/FacilitiesServicesDocumentation#Singularity
      (and a very bad plot to show which sites have Singularity: https://monit-kibana.cern.ch/kibana/app/kibana#/visualize/edit/CMS-glideins-singularity?_g=h@114715d&_a=h@af4d5e4)

      • ATLAS

      • Other: Updates should be recorded in https://www.gridpp.ac.uk/wiki/GridPP_VO_Incubator.

      • GridPP DIRAC status [Andrew McNab]
        -- https://www.gridpp.ac.uk/gridpp-dirac-sam

    • 11:20 11:40
      Meetings & updates 20m

      With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest

      • General updates
      • WLCG ops coordination
      • Tier-1 status
      • Storage and data management
      • Tier-2 Evolution
      • Accounting
      • Documentation
      • Interoperation
      • Monitoring
      • On-duty
      • Security
      • Services
      • Tickets
      • Tools
      • VOs
      • Site updates
    • 11:40 12:20
      Discussion topics 40m
      • Site updates from those sites missed last week. Includes: RHUL, Brunel, Bristol...
    • 12:20 12:25
      Actions & AOB 5m
      • https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items