GridPP Technical Meeting

Europe/London
Virtual Only

Virtual Only

Andrew McNab (University of Manchester), David Colling (Imperial College Sci., Tech. & Med. (GB))
Description

Fortnightly meeting for technical topics looking further ahead than the weekly ops meetings on Tuesdays. There are also dedicated storage group meeting on Wednesdays. Each topic can go beyond the nominal 5 minute slot whenever necessary.

GridPP Technical Meeting – 4th March 2016

Attending: Andrew M; Ewan M; John B; Jeremy C (chair+notes); Pete G; Sam S; Daniela B; Duncan R; Terry F; David C +…

 

GridPP DIRAC Service (Daniela)

DIRAC up and running.  Found some hardware to run a DIRAC test server. Pilot certificates – gridPP and LondonGrid VO. Will try running different roles with different pilot certificates. So even if no glexec at a site jobs do not trample one another.

On LSST – lots of jobs run.  Some run for a long time and then fail with incomprehensible error messages. Daniela suspects this is an ARC problem. Many jobs are finishing. She mentions that the jobs running can now be viewed by anyone with a login and suggested someone should have a look.

Tier-2 Evolution: Jobs in VMs

Vac/Vcycle

Andrew mentioned the Vac 00.21 release. It is a candidate for Release 1.0. It has machine job features support. It also works out the size of logical volumes to give to VMs. It provides better support for ATLAS VMs – specifically it is better at handling many jobs ramping at the same time. Release is in latest Vac-in-a-box version.

Most recent new site was Liverpool. Not yet talked to other sites. The strategy is to pick a volunteer to start from scratch. Try to run half of CPU resources via the VAC route. At present Liverpool is only running LHCb and GridPP VO work, but otherwise good progress. The next step is to get ATLAS stuff working

John Bland: Liverpool are slowly ramping up the VAC cpus, but I think Steve's waiting for the ATLAS support to be 100% before committing much more

Access to HTCondor is based on passwords but the developers need to look at a certificate approach. There are attempts at CERN to do this, but we will continue to use passwords for now. Once this issue is resolved then there will be a rollout to other VAC sites. Basically, once LHCb and ATLAS are running/supported smoothly the approach will be naturally more sustainable.

In addition to getting significant resources under VAC there is also a plan to get sites to run just single machines to gain familiarity. So far this is not happening but planned.

Ewan Mac Mahon: (11:14 AM)

The current Oxford status is exactly where it was the last time you heard aboiut it, Andrew. I've just been doing DPM stuff since. The plan for the immediate future is to reboot/reinstall the two ViaB nodes soon to pick up the updates, then take it from there.

Oh, and just to add, that leaves a few 'legacy' vac machines still running, but as soon as the ViaB ones are working the others are expected to be redone as ViaB too.

So ultimately it'll all be ViaB.

Ewan will reboot VAc-in-a-box nodes. Will get them to do something interesting next week.

Andrew further reported that Manchester is now at 600 VMs. They have one rack with normal WNs. One rack Puppet installed. One rack using VAC-in-a-box.

 

ATLAS VMS & News

Working on new style VMs with CERN people. Want a VM supported by the community rather than just in GridPP. All ATLAS VMs almost the same now. Using passwords internally is not scalable. Condor supports GSI and certificates so can be made to work.

Ewan – could we fudge by putting encrypted passwords on web with certificates? Andrew agrees this is a possibility if we cannot get the cert approach working. The main issue is about configuring things in HTC.

 

CMS VMs & News

No report.

 

LHCb VMs & News

Slowly redoing VM approach to use cloudinit.

 

GridPP DIRAC VMs and other experiments

Andrew again: Once work done for LHCb will apply the changes to the GridPP VMs.  That is, make the VMs use cloudinit.  This is producing material that will go into core DIRAC. This will make it easier for others to deal with VMs. Being careful not to hardwire the configuration.

For GridPP VMs we need to do the multi-vo stuff. The approach is to configure VAC such that you will set the type of VM with an extra option such as ‘LSST’.  The goal is to do this in the next couple of weeks and have SAM monitoring pages use the approach. Then we can check support across the country – including on the conventional sites with normal batch.

Andrew noted that there is a WLCG WG being created to explore the topic of lightweight sites. The is going to be led by Maarten Litmaath. The idea is to look at the broad range of ways sites can be made light weight. Lawrence’s slides (at the WLCG workshop) were also approaching this topic.  We need to make sure we are plugged in to the work.

 

Tier-2 Evolution: Storage

Testing for ATLAS. Want to do auto tests with Hammercloud but it is not configured to send tests to sites that do not have data at them. So Alistair is progressing a change to address this for ATLAS week.

SAM has been talking to the ATLAS ARC guys – David Cameron – about testing the ARC cache rucio implementation. Have a candidate site with reliable distributed storage in Durham to test against. Just need to get Oliver on board!

CMS – We had Oxford as a test site. With Ewan moving the additional load would not be manageable. Need to progress with an alternative test site. Duncan mentioned that Glasgow and QMUL already run CMS analysis jobs… and also at Oxford reading data at remote sites from numerous other sites.

Sam mentioned the next task would be scalability testing. It was suggested that sites could give priority to CMS pilots to ramp up loads.

Pete G: Oxford run relatively many CMS jobs but in coming months will have less effort so good to get other sites involved. Daniela directed ticket to Oxford originally, but there are other sites who could be more involved.

 

It was noted that US Tier-3s only read remotely from US Tier-2s. The approach seems to work reasonably well. Perhaps we should have a look at this approach in the UK… but Daniela warned the setup may be quite involved, but may not require a lot of work of the Tier-3s (only Tier-2s).

Sam: The point of the work is to understand the scalability. CMS is interested in pushing this more than other VOs. Would be nice to push this work in different directions.

Duncan observed that the US approach is based around “Tier-2 overflow” – if one site has full queue then others take on the work and he speculated that the Tier-3 approach they have does something very similar.

Looking at:

http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/efficiency_individual?sites=T3_UK_SGrid_Oxford&applications=All%20Application%20Versions&accesses=remote&sitesSort=3&start=null&end=null&timeRange=lastWeek&granularity=Hourly&generic=0&sortBy=4&series=All&type=aeg

he pointed out the inefficiency of certain remote connections. It would be good to push the UK sites for a regional testbed….CMS UK may be  looking for a European testbed so we can then help.

 

Ewan Mac Mahon: (11:35 AM)

Isn't this exactly the thing that Daniela just said was 'rather involved' in the US?

Depends what is required. This is more than just setting some parameters.

Duncan agreed (with Daniela) to lead on such a UK testbed from Imperial (Action 1)

 

Other updates from storage group

Not much to bring up other than SNO+ discussion last week that is relevant because it relates to supporting other VOs better (see the writeup in the Storage Meeting minutes – http://storage.esc.rl.ac.uk/weekly/20160302-minutes.txt.

 

Networking including IPv6

Duncan reported that he is working with Andrea on a box for IPV6 monitoring. It includes some xrootd tests. Looking into redirectors for ATLAS and CMS. European redirectors have IPv6 enabled already. FAX testing by IPv6 would probably work if CERN based box made use of IPv6.

Nebraska has dual stack node in progress.  Xrootd should start working over IPv6 soon as the redirectors are gradually being made IPv6 ready.

https://etf-ipv6-dev.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fview_name%3Dallservices

UK sites already in the mesh: QMUL; Brunel; Imperial.

 

Security

No report.

 

HEP Software Foundation

It has been mentioned elsewhere, but Andrew has written up a draft Vacuum Platform technical note for the HSF. This sets out the interfaces between VMs and VM lifecycle managers, in particular the user_data templates, $JOBOUTPUTS mechanism and the VacQuery UDP protocol.

In addition first steps have been taken to get Vac adopted as an EGI Community Platform.

 

AoB

Nothing raised.

 

ACTIONS SUMMARY:

1. Duncan (with Daniela) to lead from Imperial on setting up a CMS UK testbed to explore the remote data access overflow approach akin to that used in the US Tier-2s/3s.

 

 

Chat window contents.

Terry Froy: (04/03/2016 11:05)

Yep, can hear you

John Bland: (11:13 AM)

Liverpool are slowly ramping up the VAC cpus, but I think Steve's waiting for the ATLAS support to be 100% before committing much more

Ewan Mac Mahon: (11:14 AM)

The current Oxford status is exactly where it was the last time you heard aboiut it, Andrew. I've just been doing DPM stuff since. The plan for the immediate future is to reboot/reinstall the two ViaB nodes soon to pick up the updates, then take it from there.

Oh, and just to add, that leaves a few 'legacy' vac machines still running, but as soon as the ViaB ones are working the others are expected to be redone as ViaB too.

So ultimately it'll all be ViaB.

David Crooks: (11:25 AM)

There's a draft agenda up now for the GDB: https://indico.cern.ch/event/394780/

Andrew McNab: (11:26 AM)

I have to go to another meeting now unfortunately

Duncan Rand: (11:32 AM)

http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/efficiency_individual?sites=T3_UK_SGrid_Oxford&applications=All%20Application%20Versions&accesses=remote&sitesSort=3&start=null&end=null&timeRange=lastWeek&granularity=Hourly&generic=0&sortBy=4&series=All&type=aeg

Ewan Mac Mahon: (11:35 AM)

Isn't this exactly the thing that Daniela just said was 'rather involved' in the US?

Duncan Rand: (11:41 AM)

https://etf-ipv6-dev.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fview_name%3Dallservices

 

 

There are minutes attached to this event. Show them.
    • 11:00 11:05
      GridPP DIRAC service 5m
      Speaker: Daniela Bauer (Imperial College Sci., Tech. & Med. (GB))
    • 11:05 11:10
      Tier-2 Evolution: Jobs in VMs 5m
      • Vac, Vac-in-a-Box, Vcycle 5m
        Speaker: Andrew McNab (University of Manchester)
      • ATLAS VMs & news 5m
        Speaker: Peter Love (Lancaster University (GB))
      • CMS VMs & news 5m
        Speaker: Andrew David Lahiff (STFC - Rutherford Appleton Lab. (GB))
      • LHCb VMs & news 5m
        Speaker: Andrew McNab (University of Manchester (GB))
      • GridPP DIRAC VMs and other experiments 5m
      • Site updates relating to Vac, Cloud, and VMs 5m
        Anything sites want to report this week
    • 11:10 11:15
      Tier-2 Evolution: Storage 5m
      Speaker: Samuel Cadellin Skipsey
    • 11:15 11:20
      Other updates from the Storage Group 5m
    • 11:20 11:25
      Networking including IPv6 5m
    • 11:25 11:30
      Security 5m
      Speaker: Ian Neilson (STFC RAL (GB))
    • 11:30 11:35
      HEP Software Foundation 5m
      Speaker: Andrew McNab (University of Manchester)
    • 11:35 11:40
      AoB 5m