Operations team & Sites

Name: Operations team & Sites
Start: 2017-06-27T11:00:00+01:00
End: 2017-06-27T12:30:00+01:00
Location: EVO - GridPP Operations team meeting

Tuesday 27 Jun 2017, 11:00 → 12:30 Europe/London

EVO - GridPP Operations team meeting

Description

- This is the weekly GridPP ops & sites meeting - The intention is to run the meeting in Vidyo: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=zXhsqAxVnaT6 -- The PIN is 1234. To join via phone see http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone for dial in numbers. -- The London (UK) service is on +44 (0)161 306 6802. Phone bridge ID 1001002 -- The meeting extension is 109308582. PIN 1234 Chair: Jeremy Minutes: Daniela Reserve: Apologies:

Hide

## Links

Agenda: https://indico.cern.ch/event/645087/
Bulletin: https://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest

## LHCb

Smooth in the UK, taking data. One concern, Birmingham, not running. Haven't had a chance to look at it. Mark, fairshare?

Mark: Assume fairshare, wouldn't worry too much. Submitting to Vac? Could remove our old cream ce. Running over fair share on Vac.

Raja: Fine, thanks! Will replace with Vac in monitoring.

Alistair Dewhurst working with LHCb to push Ceph at RAL.

## CMS

Daniela: No major problems, still struggling with SL7 gfal2 problems, imperial still not really running. On back burner because of LZ mock data challenge. Otherwise all good.

Jeremy: Bristol was down a bit last week but back now.

## ATLAS

Elena: One ticket for Sheffield. MCore killed, only one CE, killed after 6 hours. Does anyone have a suggestion?

Uploaded slides, HammerCloud moving to new servers, req to check it. 3/7 new server. Comp/Software week, will present overview next time.

## DIRAC

[https://www.gridpp.ac.uk/gridpp-dirac-sam]

Jeremy: DIRAC/SAM tests. Vac cloud at the top. Oxford not resolved, last job in May.

Kashif: Problem with - is running other VAC jobs. Problem with gridpp VO. Updated cert, probably after that job didn't run so need to check.

Daniela: Vac doesn't run, lost patience because we need to update the pilot version, can't have it run older software, have fed back to Andy.

Jeremy: By email, a ticket?

Daniela: Closed ticket, said he would raise ticket in JIRA. 6 months ago

Jeremy: Just Oxford?

Daniela: All sites. Manchester, test site?

Jeremy: Andrew away this week. Should sort this out.

CREAM/HTCONDOR sites: Durham, ECDF, Bristol as site that look last job ~ month ago. Any comments? (ECDF OS version?)

Oliver: Just the test by the look of it have been running other jobs, will check.

## Meetings and updates

[https://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest]

Upgrading HTCondor to 86: default of the output format, ARC CE doesn't like that change. Need to add a new line (see email)

### T1 update

Main thing is incident over weekend, access to ATLAS SRMS failed badly. Traced to double slash problem, castor process dies.

Had planned to updated link to CERN on 14/6. Now look like Wednesday (28/6)

[Many meetings on hiatus for other Meetings]
### On duty

Rota ran out.

### Security

Security discussion

## ROD

Kashif: Monitoring question [ROD duty]. Glue 2 validity test, what do you think about it? Should we push for this test to be non critical? I don't find it useful.

Jeremy: Originally set up to monitoring glue 2 rollout

Kashif: That's fine, don't make it critical.

## Tickets

30 tickets in UK, most being handled really well.

Storage account from John Gordon, eg https://ggus.eu/?mode=ticket_info&ticket_id=129183

From Daniela to remove londongrid VO from T1, VO is being decommissioned.

LHCB; Alarm ticket, CEs at RAL not responding, 20th June, gone critical.
JohnKelly: Jobs not arriving on ARCs? Strange tickets, email discussions, last I heard was Andrew Lahiff cant find any evidence of these jobs. Not sure it went after that. Update from Vladimir on 23rd certain problems with timestamps.

Jeremy: -> John , add comment to ticket that's it's being followed up by Raja, it's an ALARM ticket.

100% IT, config an appliance, normally quick but raised a couple of days ago, will follow up.

Team ticket on Glasgow: ARC CEs: Gareth Need to apply patch never came from devs, or upgrade to latest version. Even do that? Using VAC and using LHCb from VAC not ARC. Still internal discussion. (See Birmingham)

Alessandra: LSST: We got more or less the jobs working at RAL, Manchester, Lancs, some updates needed, xrootd tools, these are not auto installed. Will be in CentOS 7 because I asked. Reason I'm saying this is that there is a sketch plan to run ata challenge. No detailed plans yet. Before workshop person running jobs plan was to rung 30TB a day, previous exercise at NERSC 16000 cores over a month. Of course see if the grid is useful/easy would like to try something similar. What I want to say is that there is a heads-up, Management is in favour, v good PR exercise. However, is not going to be that easy because Panda, not ATLAS Panda. Doesn't have all the dev from ATLAS. In particular no AAA . Resubmission. But like DATA challenge run before, Sites need to be on top of it. These few weeks, were of course usual failures, try to minimise that. Start setting up correctly all the sites/tools, step in the right direction. Also need to look at as a priority.

Gareth: Is there a set of packages that we need to install?

Alessandra: xrootd-clients, gfal up to date. Run a simple test in DIRAC, wasn't effective, didn't fail at RAL the test, but failed jobs, iron these things out. VOMS config. Then beef up tests run by DIRAC. Zero monitoring. Other than talking to person that run the jobs.

Jeremy: Don't see some of those things on the wiki [https://www.gridpp.ac.uk/wiki/LSST_UK]

Alessandra: will update

jeremy: then we can point people at it. Not duplicating EGI card?

Alessandra: thaey have one, not using it for this. After testing can ask official to have that udpated. Wouldn't rely on this necessarily, some staff changeover.

Jeremy: Call is for peplple to support LSST, see that page, update

Pete Clarke: For lots of reasons useful if we could say that astronomers did serious production work. At least LSST sites who want to help (and others who want to help). Thanks Alessandra/Peter Love for liaising with LSST on this.

Alessandr: Also set prority for LSST appropriately.

Jeremy: Review LSST sites

## discussion

### Steve LLoyd tests

### Site round table

See [https://www.gridpp.ac.uk/wiki/Category:Sites_Status]

Durham [Oliver]: Batch; ARC/SLURM, no plns to change. Loud: containers, initial setup working not in production. IPv6 still stuck, waiting for rDNS by Campus, who promise toi look at tit over the summer. Have been putting new kit into HS06.

### Lancs: Cream.SGE -> ARC SGE. Stick with SGE for the foreseeable, local users.
September?
Containers, work in progress, a few months ahead.
IPv6 sort of good. With Brian's help found routing problem affects IPv4/6. Backup link going via Carlisle.

Jeremy: Have IPv6 allocations?

Matt: Yes, on PS, on a few nodes, I tihnk theyve released lock on DPM, might have more next week.

Jeremy: SHS06?

Matt: Need to update table.

Jeremy: Good to put date bnext to table as a timestamp. Containers - Openstack? Would be good to check with Peter L.

Matt: Openstack at Datacentred.

Jeremy: Tarball home?

Matt: Backed up repo to be sure, offered EGI appdb, dont think that's appropriate for our needs. Looking at git repo at CERN, see if that will work. Power cut last week which took the time.

### SHEF

Elena,: cream/torque. ARC in test, doesn't work properly. Sysadmin local who helped, has left. Don't have local sysadmin. I'm doing partly that work as well. There is a plan to switch to ARC/HTCondor, but can't say when t the moment, but there is a plan. Condor running on local cluster, basically only ARC that should condor. HS06: Bought new WNs, should updated table. Containers, havnet look at it yet. IPv6, problem with University supporting IPv6, contacted network team about that, haven't had a reply yet. Working on it.

Jeremy: allocations?

Elena: Yes, but apparently those addresses stopped working. Something changed, investigating.

### SUSSEX

### BRUNEL

### RALPP

CB: Updated a couple of things. IPv6 since T1 deployment has gone well, were talking to networking, not too complex to do ours, have asked for allocation. Will request them to set up routing over next few weeks. Need a bit of work on the router. Bristol, hosted HTCondor CE, OSG model. HS06: no new kit. Will check table. Site storage updated. Will be looking at containers, no plans for any cloud stuff.

#### Steve Lloyd

Jeremy: Raise the question: do people still find the SL tests usefuL? Are people actively referencing/checking the metrics?

Pete G: A/R?

Jeremy: Might be better to use WLCG reports rather than Steve Lloyd.

Gareth: Caveat: SAM tests are broken, ARC tests with -1, easier on manpower. WLCG ones are fine, EGI ones on which we get ticketed, -1. We can use ATLAS, but what if site doesn't support a site?

General discussion to be continued next week.

### HEPSYSMAN

Most sites were in attendance

### WLCG workshop

[https://indico.cern.ch/event/645087/contributions/2619199/attachments/1483264/2301179/WLCG_workshop_2017.pdf]

There are minutes attached to this event. Show them.

- 11:00 → 11:01
  Ops meeting minutes 1m
  - This is a reminder that this is an important task. The minute taker gives access to the discussions for those not present and provides a reference for others to refer back to afterwards.
  - The team composition has been changing. If everybody contributes then the task comes around less often.
  - Please extract actions from the meeting and add them to our table here: https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items#Action_list.
  - Recent allocations: See above link. The page should be updated each week by the minute taker (if they don't the task will keep coming to them!).
  - Upcoming allocations:
  June
  27th: David C
  
  July
  4th: Elena
  11th:
  18th:
  25th:
- 11:01 → 11:20
  Experiment problems/issues 19m
  Review of weekly issues by experiment/VO
  - LHCb
  - CMS
    https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T2_UK_London_Brunel
  - ATLAS
  - Other: Updates should be recorded in https://www.gridpp.ac.uk/wiki/GridPP_VO_Incubator.
  - GridPP DIRAC status [Andrew McNab]
    -- https://www.gridpp.ac.uk/gridpp-dirac-sam
  Atlas-27.07.2017.pdf
- 11:20 → 11:40
  Meetings & updates 20m
  With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest
  - General updates
  - WLCG ops coordination
  - Tier-1 status
  - Storage and data management
  - Tier-2 Evolution
  - Accounting
  - Documentation
  - Interoperation
  - Monitoring
  - On-duty
  - Security
  - Services
  - Tickets
  - Tools
  - VOs
  - Site updates
- 11:40 → 12:20
  Updates - HEPSYSMAN - WLCG workshop 40m
  - Updating Steve's tests
  - Site roundup (missed on 6th)
    ** With reference to tables under https://www.gridpp.ac.uk/wiki/Category:Sites_Status.
    ** Durham/Lancaster/Sheffield/Sussex/Brune
  - WLCG workshop summary (perhaps next week if time tight)
  - HEPSYSMAN summary (perhaps next week)
  WLCG workshop 2017.pdf
- 12:20 → 12:25
  Actions & AOB 5m
  - https://www.gridpp.ac.uk/wiki/Operations_Team_Action_items

Choose timezone

Operations team & Sites

EVO - GridPP Operations team meeting