Operations Team & Sites Meeting

Tuesday, 29 April 2014 from 11:00 to 12:30

Agenda: https://indico.cern.ch/event/316555/

Chair: J Coles
Present (Apologies for misspellings, captured by hand):
L Cornwall,R Frank, R Harper,Rf, A McNab,D Bauer, D Crooks,E Korelkova,
E MacMahon, G Roy,J Bland, R Nandhakamur, S Jones,D Traynor,
E Steele, Matt RB, M Williams, Qin, J Hill, R Fay, C Brew, M Slater,
M Doidge, M Kashif, A Washbrook, Brian@ral, P Gronbech, Raul,
Minutes: S Jones

Apologies: Duncan R, Chris W

o    11:00 - 11:20 Experiment problems/issues 20'

Review of weekly issues by experiment/VO  -

LHCb  -
Raja: Nothing much, some MC, quiet in immediate future. Dirac changes to be tested. News to be given at (WLCG?) workshop.

CMS  -
Daniela: IC is suffering due a single user running 200 jobs that maxes out the 20 Gbs network. Investigations and mitigations continue. The rest of the UK is fine.

ATLAS  -
Elena: UK transitioned to RUCIO; very clean.
There has been a drop in Atlas production jobs.
Release 19 will facilitate more multicore jobs.
Sites advised to reduce static allocation.
Announcements to be monitored via weekly ATLAS meetings and communicated at ops meeting.

Tier-1 status –
Nothing to report.

Other -
Nothing to report


o    11:20 - 11:40 Meetings & updates 20'


With reference to: http://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest  -

General –

A WLCG Workshop will occur starting 7th July. This is a strategy gathering where experiment and grid-ware requirements and developments are discussed and mapped out. Some level of site participation is expected. Please email Jeremy if you have an interest in attending the WLCG workshop in July.

An OMB occurred: https://indico.egi.eu/indico/conferenceDisplay.py?confId=2162

Extended GOCDB service OUTAGE starting 07:00 to 14:00 (BST) on 29th April

Accounting –

Glasgow delayed due to vital maintenance.

Documentation –

n/a

Interoperation –

n/a

Monitoring –

David Crooks tell us that 5th is public holiday in Scotland.

On-duty –

Problems with dashboard. Complaints coming from NGIs. Problem getting things fixed. Shortage of development people. Some problems fixed last Thursday, notification is  now be correct. Mailing list sorted.

Rollout –

RHUL : Govind has had trouble upgrading CEs. Maybe machine too weak for EMI3. GGUS filed. Kashif: VMs with 4GB are good enough.

SUSSEX:  Matt: Luster crash over Easter slowed progress. New prioritization.

BRISTOL: CEs and BDII need doing. Ewan: New prioritization.

ECDF: Andrew: Delays with ECDF systems team. Progress over coming week.

DURHAM: Moving to ARC, SE and accounting to be moved soon.

Security –

Ewan: Brief update.  Incident at a site that is on-going. Mop up phase. Naughty user running some illicit job. User blocked via suspension system. No need to do anything. System worked.

Services –

Duncan announced update to perfsonar meshes, different sort scheme. See bulletin.

Tickets –

----------
EMI upgrade tickets: ECDF, Bristol, RHUL, Durham, EFDA-JET, Glasgow, Sussex, UCL and RALPP  all have open EMI upgrade tickets.

Can everyone with an open ticket  please update it this week (preferably buy the first) if they haven't  done so in the last 7 days (or if you have but have made progress since  then). It's a lot easier for the Person on Duty to extend tickets when  there's site updates to validate their actions.

(RALPP have submitted  https://ggus.eu/index.php?mode=ticket_info&ticket_id=104839 in response  to an argus problem they were seeing post upgrade).

UCL have another Nagios error ticket:  https://ggus.eu/index.php?mode=ticket_info&ticket_id=104824  Interesting

One: https://ggus.eu/index.php?mode=ticket_info&ticket_id=104937

Manchester received a ticket from Steve Traylen regarding a lot of  connections to the CVMFS stratum 1. Andrew confirms these are VAC  machines (unless I've misread something). It looks like the local squid  cache was being ignored, Andrew is on the case.

Cheers, Matt

----------

Tools –

n/a

VOs –

n/a

Site updates

Various emi upgrade tickets discussed.

o    11:40 - 12:00 Multicore update & discussion 20'

Alesandra gave us a talk on the progress of the Ops coordination task force. A summary is linked from the agenda (see above). Here are some high-lights from her talk:

In the first phase, all the batch systems were reviewed to check where the problems lie.

CMS is not really ready for multicore. It has started this week to run them more extensively at T1. ATLAS has problems with submission patterns, which has been quite disruptive. The use of dynamic partitioning mitigates this somewhat, but the drain phase is very  expensive, and become prohibitive when there are frequent ramp ups. ATLAS has stopped multicore for rest of month anyway.


The most successful model without wall time and or steady stream is dynamic partitioning. Geoff Templon has done some scripts to implement this option. This method increases the size of the partitions once there are enough multicores in in the queue.

The Jedi scheduling system will include wall times. It works for analysis and should work for production soon.

The CREAM CE does not have all the right BLAH scripts.  There are BLAH scripts available  SGE there, but others need writing. With SLURM and HTCONDOR, there is no problem passing multicore parameters to the batch system.

The next steps will be to try to use parameters like walltime. This is not ready yet (without the installation of NIKHEF Scripts)? There will be a second round of presentations when implementations of these new conditions emerge.

ATLAS wants more sites involved at end of May. Atlas would send hammer cloud tests. Alessandra recommends starting the config with a couple of  nodes to start with, but a production sized cluster is needed for real work.

A short discussion occurred (Jeremy, Steve, Alesandra) on recommendations and documentation, fragmentation, coordination, similarities and differences between CMS and ATLAS job profiles etc.


o    12:00 - 12:05 Dissemination 5'


Tom Whyntie gave a short talk on spin off firms, technologies, new use cases, clouds and innovations.


o    12:05 - 12:06 AOB 1'

Please get your quarterly reports in.


------------ SCREEN DUMP OF CHAT WINDOW --------

Ewan Mac Mahon: (11:10 AM)
Send it to an ATLAS site - they have lots or IO bandwidth.....
Steve Jones: (11:12 AM)
Redsign the job?
Limit instances.
John Bland: (11:13 AM)
Send it to the cloud. That solves all problems.
Daniela Bauer: (11:13 AM)
http://www.hep.ph.ic.ac.uk/~dbauer/tmp/networking.png
The missing bits are a) the downtime over Easter (poer was off) and b) the day our monitoring machine crashed (incidentally teh day we applied the heartbleed fix accross the network, so it well hidden
Jeremy Coles: (11:21 AM)
https://indico.cern.ch/event/305362/
https://e-grant.egi.eu
Ewan Steele: (11:33 AM)
I've seen it
Daniela Bauer: (11:41 AM)
Govind's GGUS ticket:
https://ggus.eu/index.php?mode=ticket_info&ticket_id=104881
He's also seem Java out of memory errors
Chris Brew: (11:44 AM)
If you're moving to complete;y Arc then they don't need an APEL node
it publishes directly
Ewan Steele: (11:44 AM)
thanks for the info do you have documentation on how to do that?
Chris Brew: (11:45 AM)
Bitcoin or another one?
Tom Whyntie: (11:46 AM)
I'll hold off the news item
Alessandra Forti: (11:46 AM)
yes
Chris Brew: (11:46 AM)
The following section in the [grid-manager] section of arc.conf deals with it:
jobreport="APEL:http://msg.cro-ngi.hr:6162"
jobreport_options="urbatch:50,archiving:/var/run/arc/urs,topic:/queue/global.accounting.cpu.central,gocdb_name:UKI-SOUTHGRID-RALPP,use_ssl:true,Network:PROD,benchmark_type:Si2k,benchmark_value:2499.00"
jobreport_credentials="/etc/grid-security/hostkey.pem /etc/grid-security/hostcert.pem /etc/grid-security/certificates"
jobreport_publisher="jura"
We do scaling to a nominal 2500 SI2k cpu, then the different nodes have different values around that 2499, 2500 and 2501 so we can see on the portal that all are publishing
Jeremy Coles: (11:51 AM)
Document is on the agenda as https://indico.cern.ch/event/316555/contribution/3/material/0/0.pdf
Ewan Mac Mahon: (12:09 PM)
We have arc/condor running and still have every intention of progressively moving all the resources over to it.
Steve Jones: (12:11 PM)
2 blog posts from l'pool: http://northgrid-tech.blogspot.co.uk/
Ewan Mac Mahon: (12:16 PM)
If people are doing EC2 type VMs for non-CPU intensive tinkering, we'd need to account that by walltime, but I don't think we have anything set up for that.
Daniel Traynor: (12:16 PM)
once its up and running i'll give a talk at the friday meeting
Ewan Mac Mahon: (12:20 PM)
Vac is a lot like 'normal' grid computing in that it assumes that it can run non-interactively. It's not a cloud like EC2 that gives you a box you can just SSH to..
We need to be very clear on exactly what people are asking for, and what their use-case is.
It's not enough to just be all 'yay, clouds'.
Jeremy Coles: (12:22 PM)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=101502