Operations team & Sites
EVO - GridPP Operations team meeting
GridPP Operations 12 May 2015
Present:
Sam Skipsey (minutes)
Brian Davies
Daniela Bauer
David Crooks
Jeremy Coles (chair)
Gareth Roy
Elena Korolkova
Andrew McNab
Gang Qin
John Hill
Frederico Melaccio
Rob Fay
Govind Songara
Winnie Lacesso
Dan Traynor
Ewan McMahon
Gordon Stewart
Tom Whyntie
Robert Frank
Oliver Smith
Liam Skinner
Matt Doidge (chair of final section)
Kashif
Lukasz Kreczko
Raul
Duncan Rand
-
LHCb - nothing to report (Raja)
CMS - nothing to report (Daniela)
ATLAS - ATLAS using grid fully, uses 150 job slots , ran MC15 on multicore, started with 500 events? per job, but not efficient, considering increasing job lengths for multicore. Alessandra raised question at AC+S week a while ago. ATLAS requests all sites to provide ATLAS multicore resources. Was a Rucio FTS issue which causes missing files after Rucio submits job to the FTS service, but took a while to fix (FTS manager was asked to update service to mitigate issue). ATLAS have new shifts, for Run2 Computing Run Coordinator -
CVMFS problem.
UCL needs storage cleaned. (Hopefully storage can just be disabled, but there's the issue of LOCALGROUPDISK for political reasons.)
- Elena
Other VOs - some information on progress with LSST (working with Ed, Liv, Man), preparing test datasets.
LIGO also starting tests - some work being done with DIRAC data management. [Some initial teething due to the fact that the generic DIRAC documentation is still a bit LHCb specific - Tom Whyntie is working on the documentation.] They have DIRAC registration and compute set up.
LZ - some progress, someone at Ed signed up. Issue at PMB - requested to take decisions via Imperial?
VAC - DIRAC status looks "pretty good". All VM based sites (but Lancs, who are upgrading) are running today, as are the nonVM sites. Ongoing work with UCL. - Andy McNab
Next steps regarding monitoring for UCL, decommissioning storage:
Daniela - The problem is that the avail/reli is 0% for UCL. Ben (waugh) closed all the UCL tickets for "relauching site" reasons, and this morning, interestingly, the reliability is now up to 4% (having closed all the services).
Andy McNab - the UCL VAC is now working, but this doesn't help with SAM tests, as there's no infrastructure in the EGI Ops testing framework to submit to a "headless site".
Daniela notes that there are improvements that could be made in operational responsiveness - the UCL CE which dragged down availability should, arguably, have been removed rather than d/ted much sooner.
Ewan notes that this is mostly a historical issue - this happened before we had the agreed plan to move to a headless VAC instance? We should be decommissioning the services that remain, surely?
Alessandra had suggested leaving the storage until we're happy with the rest of the work at UCL and are sure we can work with QMUL as a remote storage endpoint.
Ewan noted that the UCL storage is also unreliable.
Q: from ROD perspective, what do we do next?
Are there other services we need to think about decommissioning?
Ewan - the other services don't really need a notice period. (For the SE, we probably do need it to be up for the duration for the decom process. If it isn't working well enough to do that, then we should tell the users that they have lost their data!)
[Daniela still has to issue a ticket for the low availability alarm, but this can be dealt with.]
Elena noted that LOCALGROUPDISK should still be checked as part of the decommissioning process. (Ewan noted that for a normal decomissioning broadcast, it isn't clear if ATLAS can actually move LOCALGROUPDISK content from UCL to other useful SEs.)
-
-
https://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest (see this for details)
****************************General Updates
Tuesday 12th May
There is a pre-GDB on batch systems at CERN this week. Tier-2 participation encouraged.
There will be a GDB on Wednesday 13th.
The next GridPP Technical Discussion meeting is scheduled for next Friday.
LSST preparing test dataset (involves Manchester, Liverpool and Edinburgh).
* Jeremy to check if we have GridPP Tech this fri or next. [There is confusion.]
Monday 11th May
There was an EGI Operations Management Board (OMB) meeting on 30th April.
Operations updates:
12 service types will be removed from GOC DB due to not being used. They are defined in GGUS 113432
A list Tools-admins at mailman.egi.eu has been created for ops tools administrator discussion.
EGI OLA period 1 May 2015 - 30 April 2016
Security coordination moves to CERN after SNIC.
Only NGI-Argus servers should accept Nagios probes [Some discussion as to the precise interpretation of the statement here. It appears that the instruction is specifically that site argues service should refuse connections from the outside world.]
What HPC facilities are available in NGIs for federating?
Suggestion for common RC suspension process.
EGI conference in Lisbon 18-22 May.
FedCloud
No stable monitoring tests. Proposal to create a new CLOUD-MON_CRITICAL (inc. eu.egi.cloud.APEL-Pub; eu.egi.cloud.OCCI-VM ...).
New sites IN2P3-IRES (FR) and NCG-INGRID-PT (PT). 2 others in process.
EGI to provide capacity to instantiate virtual machines to run the computational tasks (on earth observation datasets) generated by the users of the ESA funded Terradue for the development of the e-Collaboration for Earth Observation (e-CEO) platform.
Auger moving to production on FedCloud.
EGI CSIRT
Concern about effort going into perfSONAR issues (cacti; web interface; shellshocked...)
CRITICAL CVE handling. Want EGI CSIRT hook into site re-certification by NGIs.
Have no way to probe specific WNs. Proposed pakiti client run manually. (More UK feedback given).
EGI-CSIRT got reviewed by TI and certified according to maturity parameters. Looking to run review on sites/NGIs.
UMD support for SL5/SL6
Torque 4.2 is not backward compatible to 2.5.7. Update not recommended. Move to Torque 2.5.13 (patched by SVG) using AppDB repositoy with highest priority.
SL5 support alligned with RHEL5. In "Maintenance" until March 31, 2017 ... but >80% sites not using it anyway and some sites on SL7 + struggling with MW deployment.
Supporting CentOS7 in UMD requires to schedule the end of support of SL5 in UMD.
EPEL7/CentOS7: 13 products are ready for EPEL7.
No move from SL5 campaign foreseen.
60% of cloud sites base their cloud infrastructure on RHEL-compat distribution. Most of these are Ubuntu.
Proposal: UMD4: September 2015. Decommissioning of SL5: March 2016.
ARGO Central Monitoring
Deploy test central instance in May. Review results in June.
High availability instances deployment in July (Croatia and Greece). Monitor during August.
Switch A/R engine in September.
Decommission NGI instances October 2015 (they can still be run for local alarms).
EGI Strategy Summary
See document. Basically: Expand cloud. Push 'commons' and open platforms.
"Consider open science as a production and dissemination system that needs integrated, easy and fair access to several types of shared resources (physical, digital, intellectual), engaged communities that contribute to the process and collaborates in the management and stewardship of the resources, a suitable governance with rules to allow/exclude access, to resolve conflicts, and finally financial support for the long-term availability".
****************************WLCG Coordination meeting summary.
Thursday 7th May
The agenda. Minutes
News: Alessandra will present the WLCG workshop conclusions at next week's GDB.
Middleware news: UMD 3.12.0 released this week (fixes for ARGUS-PAP and dCache server)
Middleware baselines: dCache 2.6.x removed. New version 2.10.28/ 2.12.8 of dCache. Sites should avoid simultaneous updates.
Middleware issues: major upgrade of torque arrived in EPEL (from torque-2.5.7 to torque-4.2.10) which is not compatible standard EMI torque installation. If upgraded the patched 2.5.13 version of torque has been pushed to the EMI third-party repo in order to downgrade.
T0 & T1 upgrades: FTS 3.2.33 upgraded at CERN & RAL.
T0 news: batch HTCondor pilot is open for grid submission. Lower-than-usual WLCG availability figures in March for Atlas and CMS - possible overload.
T1 feedback: NTR
T2 feedback: NTR
OS support in UMD: Plans in EGI for CentOS7 support. 13 products are ready for EPEL7, but in general CentOS7 is not a viable option for sites. The release of UMD4 (supporting EPEL7 and Ubuntu) is foreseen for September 2015 and the decommissioning of SL5 for March 2016. It is likely that some products relevant for WLCG will not be ready for EPEL7 before 2016. The requirement for WLCG is to provide SL6 until the end of Run2, however, there are already offers for resources on CentOS7 and this is an incentive for experiments to validate their software on it.
ALICE: CASTOR at CERN - some re-reco job instabilities.
ATLAS: ~running full. Considering increasing job lengths for all MCORE jobs. Need sites to provide MCORE resources. Rucio/FTS issue was discovered - fix via update. Tier-0 data and computing workflow fully commissioned.
CMS: CMS production activities continue - Several sites reported network saturation.
[No comment in this meeting as to if anyone knew about this.]
Evaluating to use selected “strong" Tier-2 sites to add computing capacity for DIGI-RECO. Plan to drop support of CRC32 checksum in CMS data transfer systems.
LHCb: Various operational issues reported - CASTOR/CERN SRM access problems; other data access issues.
gLExec: ATLAS 61 out of 94 sites. RAL, RALPP and TW-FTT issue was due to a bug in the pilot code that showed up with ARC CE + Condor sites.
SHA-2: old VOMS server aliases (lcg-)voms.cern.ch were removed on Tue Apr 28.
RFC proxies: RFC proxy readiness to be followed up per experiment. SAM-Nagios proxy renewal code fix to support RFC proxies.
Machine/Job features: NTR
MW readiness: 10th meeting on 6th agenda. WG is making a check-point of goals and priorities. ARGUS testbed at CERN is set-up and ready to start. Pakiti client requested at other test sites.
MC deployment: NTR
IPv6: LHCb: DIRAC was made IPv6-compatible back in November, but testing has started in April. Issue found at CERN with python library (wrong IPV6 address returned). [Raja notes that it actually returned a garbage address, not even an invalid one. Python exe was compiled without the useIPv6 option.]
Network/Transfers WG: NTR
HTTP deployment: perfSONAR - Security: NDT 3.7.0.1 was released. The latest perfSONAR Toolkit version that all sites should be running is 3.4.2-12.pSPS. Network performance incidents process put in place as was agreed at the last meeting. OSG/Datastore validation progressing well. Publishing results to message bus progressing, development has finalized for esmond2mq prototype. Recent meeting focussed on FTS performance. Next meeting 3rd June. Plan is to focus it on latency ramp up and proximity service.
*************************** Tier 1 status
(Brian)
Tuesday 12th May
Remaining CREAM CEs were turned off last week.
The problems with our primary network router are still being followed up.
We are planning an update to the version of the Oracle database behind Castor. Dates to be finalised.
*************************** Storage and Data Management
Dirac discussion has happened, will be testing Cam -> Durham?
(Ewan noted that we might want to collect data by sticking a Perfsonar box on the DIRAC network.)
Oliver noted that DIRAC are , actually already on the same firewall bypass as the Grid site.
*************************** Accounting
One Brunel, Liv, ECDF. 113473. Message broker issues – mem problem. Site thinks sent - lost. APEL. Croatia/Greece. WLCG reports late. - Does VAC publish sync.
*************************** Security
Redacted.
*************************** Tickets
(Matt)
Monday 11th May 2015, 14.10 BST
22 Open UK Tickets this week.
[But many of these were UCL tickets culled as a result of the decommissioning process.]
TIER 1
There are a few tickets at the Tier 1 that are set "In Progress" but haven't received an update yet this month:
108944 (CMS AAA Tests, 30/4)
112721 (Atlas Transfer problems, 16/4)
109694 (SNO+ gfal copy trouble, 15/4)
112866 (CMS job failures, 7/4)
112819 (SNO+ arcsync troubles, 20/4)
[Will be prodded by Brian etc]
Other Tier 1 Tickets (sorry to be picking on you guys!)
111699 (10/2)
Atlas glexec hammercloud test jobs at the Tier 1. It appears to be working, but a batch of test jobs failed because they couldn't find the "mkgltempdir" utility on some nodes ("slot1_5@lcg1742.gridpp.rl.ac.uk" and "slot1_4@lcg1739.gridpp.rl.ac.uk"). In progress (4/5)
113320 (27/4)
Maybe repeating what Daniela is going to say in the CMS update - trouble with CMS data transfers within RAL. It's under investigation, but it looks like the files in question will need to be invalidated - even if it's just to paint a clearer picture. In progress (10/5)
APEL REPUBLISHING
113473
At last update Brunel, Liverpool, Edinburgh, Birmingham and Oxford need to republish still. Oxford have their own ticket about it due to complications (113482).
UCL Tickets - Ben is starting to move to close these, some are going to be "unsolved".
GLASGOW
113095 (17/4)
Andrew asks if the timeframe for the move to Condor be added to this ticket, for the ROD team's information. On Hold (7/4)
100IT
112948 (10/4)
No news on this 100IT ticket for a while. In progress (27/4)
[Ewan noted that 100%IT are generally directly involved with EGI central people. We don't really have a good contact with them, and they're certainly not GridPP.]
************************** AOB
Ewan - for the LIGO and other testing people testing via the GridPP VO - do they have the possibility of running on the VAC services? We need to check if the configuration of the cvmfs on those services is using the old or new setups (and therefore if it can see the new repos). Andy McNab noted that the fastest way to check is to log into a VM and see.
HEPSYSMAN Registration is open.
************************** CHAT LOG
Daniela Bauer: (12/05/2015 11:05)
Sorry I have a really dodgy vidyo connection
Tom Whyntie: (11:09 AM)
LIGO seems to be moving too
Matt Doidge: (11:11 AM)
Sorry I'm late, stacking my meetings this morning.
Ewan Mac Mahon: (11:12 AM)
Threy did seem to have some rouble with the UI; it looked like they'd got the dirac client running, but it didn't have lcg-utils.
Tom Whyntie: (11:12 AM)
@Ewan: only because they were trying to run on Ubuntu, and not the CERN VM
Samuel Cadellin Skipsey: (11:13 AM)
That was due to ... as Tom says, the fact that lcg-utils doesn't work on anything but SL.
Tom Whyntie: (11:13 AM)
Once they had the CERN VM there was another authentication issue that we're now trying to tackle...
Ewan Mac Mahon: (11:13 AM)
Indeed, I'm not sure what gave them the idea that that was something that would work at all, but we should probably find out.
Samuel Cadellin Skipsey: (11:13 AM)
The current issue might actually be one with Imperial - it's one of the classic "I don't like this because the hostname doesn't match the reverse lookup" errors that GSI throws.
Ewan Mac Mahon: (11:14 AM)
Also, just for information, Paul Hopkins took up the offer of an account on the Oxford testing UI, so he's got that to play with as well, but we should make sure we follow up on the cernvm front too.
Matt Doidge: (11:14 AM)
Our VAC reinstall is a roundtuit that we should hopefully get roundtu this week.
Samuel Cadellin Skipsey: (11:15 AM)
I have a half-finished CernVM here now to play with (but have other stuff also eating into my time).
It would be good to get them to test multiple endpoints from both.
Tom Whyntie: (11:16 AM)
@Ewan Cool. If PH gets the same error from your Oxford UI that would be very useful to know :-)
@Ewan (Though I'm guessing now it's gone quiet it's working? That's what I'm finding as a general rule...)
Andrew McNab: (11:17 AM)
UCL Vac: LHCb http://lgm.cern.ch/?r=day&cs=&ce=&c=VAC.UKI-LT2-UCL-HEP.uk&h=&tab=m&vn=&hide-hf=false&stack=&show_dhosts=
Samuel Cadellin Skipsey: (11:17 AM)
Which is depressing, as it would be nice to know what is or isn't a problem for them, Tom :D
Tom Whyntie: (11:17 AM)
@Sam ;-)
Ewan Mac Mahon: (11:18 AM)
He logged into it for the first time this morning and he's still logged in, so I think it's still WIP, so I'm not concerned that we haven't heard back yet.
Daniela Bauer: (11:21 AM)
Right, my Vidyo died again and I can't hear anything.
I think as far as the ops portal is concerend teh storage is fine and iof we keep it in that state for a while the low availabilty alarm might just go away
I'll try and reconnect
Federico Melaccio: (11:22 AM)
the low availability alarm should go away, as numbers are rising in the plot http://operations-portal.egi.eu/availability/siteAvailabilities/type/Zoomline/site/UKI-LT2-UCL-HEP
Jeremy Coles: (11:26 AM)
https://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest
Ewan Mac Mahon: (11:27 AM)
I think that sounds like a plan. The principle is to get on with decomissioning all the traditional site services, but (particularly with the SE) we need to have a bit of a think about how to do that appropriately.
Hmm. It does indeed say that site argus servers must not be exposed to the internet. Not quite sure I understand why though.
Duncan Rand: (11:42 AM)
Is this a known problem: http://accounting.egi.eu/tier2.php?SubTier2=1.35&query=cpueff&startYear=2015&startMonth=3&endYear=2015&endMonth=3&yRange=SITE&xRange=VO&voGroup=all&chart=GRBAR&scale=LIN&localJobs=onlygridjobs
Ewan Mac Mahon: (11:45 AM)
That's a lot of numbers. What's the problem exactly?
Duncan Rand: (11:45 AM)
ATLAS > 100%
Presumably a multi-core issue.
Ewan Mac Mahon: (11:48 AM)
Is it actually a problem, or is thatjust how the portal reports multicore? The legend under the chart does denote the cyan as meaning "eff >= 100% (parallel jobs)"
Sort-of sounds deliberate.
Duncan Rand: (11:49 AM)
OK, but the total CPU is greater than total wall-clock. Does that make sense to you?
Ewan Mac Mahon: (11:51 AM)
I think so? If a job does two cores for an hour that 1h wall time 2h CPU time. So if you've actually got a minority of eight-core jobs mixed with mostly single core jobs, getting an overall efficiency of ~150% doesn't sound crazy.
Samuel Cadellin Skipsey: (11:51 AM)
Yes, Ewan has it right.
Wallclock is literally just the total time taken.
Federico Melaccio: (11:54 AM)
I agree
Ewan Mac Mahon: (11:56 AM)
OK, well they should make sure they've dealt with it as per the advisory which should be a fairly strightforward matter of just doing the updates.
Jeremy Coles: (11:57 AM)
Matt - I need to drop out for another meeting (hopefully the last time I need to do this)... thanks for chairing the end of the meeting.
raul: (11:59 AM)
I believe I republished yesterday. I've got to confirm it yet
Ok. I'll check later this week. thanks.
Tom Whyntie: (12:04 PM)
Thanks, bye