LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsWeb>WLCGOpsCoordination>WLCGOpsMinutes160929 (2018-02-28, MaartenLitmaath)

EditAttachPDF

WLCG Operations Coordination Minutes, September 29th 2016

Highlights

Reminder to install the necessary patches as per the critical EGI Advisory-SVG-2016-11476, deadline to update & restart affected services: 2016-09-29 00:00 UTC!
- Mind that also the Worker Nodes should be patched (while not affected), to avoid false positives in the EGI security monitoring.
The Sept 2016 WLCG MB approved the IPv6 deployment plan. Dual stack availability is mandatory for the Tier0 and Tier1s by April 2017, at least on a testbed. Subsequent deployment deadlines in the relevant detailed report.
A draft questionnaire on potential Lightweight Site alternatives was presented and discussed. The questionnaire is expected to be sent out still before CHEP, so that trends from early results may be included in the corresponding presentation there.

Agenda

https://indico.cern.ch/event/540422/

Attendance

local: Alessandra Forti (chairperson), Maria Dimou (minutes), Alberto Aimar, Maarten Litmaath, Julia Andreeva, Andrea Sciaba, Andrea Manzi, Raja Nandakumar.
remote: Di Qing, Andreas (KIT), Andrew McNab, Shawn McKee, Marian Babik, Javier Sanchez, Dave Mason, Dave Dykstra, Giuseppe Bagliesi, Stephan Lammel, Alessandra Doria, Renaud Vernet, Pepe Flix, Paige Winslowe Lacesso.
Apologies: Nurcan Ozturk (ATLAS report), Ulf Tigerstedt (NDGF report), Maria Alandes (Info Sys TF report).

Operations News

Meetings:
- See you at the WLCG Workshop, co-located with CHEP @ SF: https://indico.cern.ch/event/555063/timetable/#20161008
- Next WLCG Ops Coord meeting will take place on Nov. 3rd https://indico.cern.ch/event/540423/. Theme is FTS. The Steering committee news and impact on the experiments and WLCG Operations.

Middleware News

Baselines/News:
- a new version of the WN bundle has been published to CVMFS ( /cvmfs/grid.cern.ch/emi-wn-3.17.1-1_sl6v1), it includes gfal2 v 2.11 (GGUS:123994)
- New version of edg-mkgridmap released (4.0.4) to fix a problem on SL7/C7, available in EPEL-testing, soon in UMD

Issues:
- critical EGI Advisory-SVG-2016-11476, deadline to update & restart affected services: 2016-09-29 00:00 UTC. Sites have to update their services and also their WNs. The rpms concerned with this vulnerability are included also on the WN, which is not affected, but is the target of the standard EGI security monitoring. The advisory didn't mention the WNs. That will be improved next time. We cannot easily find out if a site applied the patch on its services.
- GGUS:124136: all LHCb transfers failed after FTS upgrade. The issue is described at OTG:0033158 and INC:1143771. A fix has been deployed to allow LHCb transfer to be executed. Had LHCb participated in the MW Readiness effort, this FTS issue would have been discovered during verification, before operation. Raja conveyed the invitation for more active participation to the LHCb computing management.
- GGUS:120586 concerning an issue with glite-ce-* clients with dual stack CEs. No news yet. How much is affecting for LHCb? Raja said that dual stack no more affects LHCb. Nevertheless, this is a bug that CERN has to fix when the next CREAM client version is released.
- EOS instabilities ( check T0 report)
T0 and T1 services
- CERN
  - check T0 report
- IN2P3
  - XRootD proxy servers dedicated to Atlas and CMS upgraded to version 4.4.0-1
- KIT
  - Update of dCache for ATLAS and CMS to 2.13.44 on 28th and 26th of September respectively.
  - xrootd activity for CMS should now be reported to CMS-AAA-EU-COLLECTOR.cern.ch.
- NDGF
  - Major dCache upgrade to v 3.0.0 2 weeks go
  - minor upgrade today to fix a bug on file upload ( check T1 feedback)
- PIC
  - dCache upgrade to v 2.13.42
- RRC-KI
  - dCache upgrade to v 2.10.61
  - EOS upgrade to 0.3.197-1

Tier 0 News

Highlights

The bulk of the computing services is provided as LSF shared instance (70 k cores), HTCondor (26 k cores), a dedicated LSF instance for the ATLAS Tier-0 (12 k cores), and 21.5 k cores as cloud resource for the CMS Tier-0. Limits of LSF become visible (the limit of 5'000 worker nodes has been reached, and there was an occasion where the number of jobs in the system has grown larger than what LSF could reasonably handle). New capacity will be added to HTCondor.
On the request of the experiment validation teams, a configuration of worker nodes under CC7 is being worked on.
The external cloud resources (4 k cores at T-Systems) are running various job types from the experiments.
During September about 5.2 PB have been recorded by CASTOR.
Minor upgrades (maintenance) have been performed on EOS, aligning to 0.3.197. In parallel a transparent background upgrade campaign for the FTS disk server daemons is taking place that will finish by mid October.

Issues

An LSF failure is being followed up. Work is going on to ensure that in case of failover, no "ghost" jobs are created.
A scheduled FTS upgrade (moving the VM to CC7 and upgrading the database) had to be rolled back, since all LHCb transfers were failing (while the other experiments were fine). The issue is being analysed with LHCb, putting the upgrade on hold.
Some instabilities affecting EOSATLAS and EOSCMS have been observed and are being studied.
The incident affecting several services (ATLAS users disappearing from the ATLAS group) was created by a manual error upstream (wrong update in egroup having the master information) which was propagated across many services.

Plans

There is progress on the third link to Wigner; the timescales will hopefully become clear soon.

Tier 1 Feedback

NDGF-T1 (Ulf offline during the meeting): The dCache 3.0.0 snapshot running for 1½ weeks has had a bug where files uploaded with auto-created directories along a path got buggy permissions on the directories newly created (0664). These then caused the file upload to fail, since the user could not access the directory (--x missing). This was mitigated by mass update of the modes during the week, and fixed by an update of dCache on Thursday. Atlas noticed this and reported it for atlasscratchdisk, presumably alice also got hit by it but didn't notice it. No long-lasting problems come out of it since we can fix the buggy directories with a simple database query.

Tier 2 Feedback

Experiments Reports

ALICE

High activity on average
Central services were unavailable from Sep 14 evening to Sep 15 afternoon
- a big network intervention made them unreachable for many hours
- all grid and user activity for ALICE was stopped for that period
- in parallel the File Catalog was moved to a more powerful new machine
CERN: team ticket GGUS:123929 opened Sep 15 late afternoon
- CREAM / LSF was not working for ALICE
- fall-out from OTG:0032902
- converted to ALARM Sep 16 afternoon
- LSF info provider issue got fixed
- job submissions resumed late afternoon

ATLAS

Grid has been running in full capacity with the new Sherpa Monte-Carlo production jobs, as well as with the low priority samples.
group zp issue last week, e-group deletion attempt done by a power user caused 10646 accounts removed from the zp group, followed up in SNOW (INC1136368) and restored.
EOSATLAS service was degraded September 27th night (high latencies between diskservers and the EOSATLAS headnode), stabilized late last night by the EOS team, IT ticket OTG0033142, shortage of disk space at Meyrin, geotagging disabled.
Problem with the xrootd 4.2.3 clients accessing input files on EOS in Athena 21.x releases, not fixed by upgrading to the newer clients 4.3.0 and 4.4.0, problem is related to the choice of the compiler version and how the thread code gets optimized by the compiler, a fix was found on Monday, now we are waiting for an official xrootd clients release. Hopefully a release 4.2.4 will happen before the weekend and 4.4.1 next week.
ATLAS Software and Computing week this week, discussions for a new production system component for better resource provisioning started.

CMS

about four more weeks of proton-proton running then proton-lead
re-reconstruction campaign of early 2016 data began this weekend
tape-resident data deletion campaign going well, tape repacking at first sites started
preparing pileup libraries for next large Monte Carlo campaign

CMS transfer team is asking sites with older xrootd version to upgrade to v4.4.0 and sites using DPM to upgrade to 1.8.11

the networking issues at CERN and Nebraska together with the downtime at Fermilab and a rogue server in Pakistan caused significant xrootd failures lasting into this week
Hammer Cloud outage last week quickly resolved, thanks Andrea!
Tier-0 transfer system was down for a weekend due to a hypervisor reboot and service not automatically starting

LHCb

Mostly simulation and user jobs now on the grid. Data processing jobs not taking too many slots
SARA downtime a little confusing - especially given that the tapes were already moved 2 weeks ago.
CERN FTS issue (GGUS:124136) - solved promptly after alarm ticket opened.
Request sites installing ARC CEs to ensure that the publishing is correct, and by VO for the numbers that matter (running, waiting jobs). Default ARC installation does not currently publish correct numbers. Alessandra said that sites use puppet and this fix isn't included in puppet. HEP-puppet is a comunity effort and nobody has inserted the patch yet. Those not using puppet have to do it by hand and the instructions are in the GridPP doc.

Ongoing Task Forces and Working Groups

Accounting TF

Generation of T1 accounting reports is enabled in the new accounting portal
Automation of validation of data in the EGI accounting portal is enabled.
LHCb put in place API which allows to retrieve accounting data from Dirac. Comparison of DIRAC data with data in the EGI portal is now available
Follow up on sites which show substantial difference of data in the EGI portal and experiment-specific accounting systems
Good progress was demonstrated in work on improvement of T0 accounting

Andrea S. asked about the Pisa problem. Julia said that there were several independent issues from the Pisa numbers provided and also on the APEL side. Things are getting better now.

Information System Evolution

An IS TF meeting took place on 22nd of September. Information sources and main functionality of central CRIC were discussed. Aligment with EGI plans on moving more information to GOCDB was agreed. There is on going progress on the defined actions.
Next IS TF meeting will take place on 10th November. VOfeed structure and integration with CRIC will be discussed.

IPv6 Validation and Deployment TF

See slides.

Andrea S. & Alastair Dewhurst gave a short presentation on the TF progress. The Sept 2016 MB approved the IPv6 deployment plan. Dual stack availability is mandatory for the Tier0 and Tier1s by April 2017, at least on a testbed. By April 2018 dual stack should be available in production for the Tier0 and Tier1s. By the end of Run2 a large number of sites should have migrated their storage to IPv6. Alessandra asked about Tier2s' deadlines. Andrea S. said the plan applies to Tier2s as well but deadlines are not as strict. Most Tier1 sites gave positive commitment to the plan. Some haven't replied yet. He emphasised that several Tier2s already have IPv6, especially the GridPP ones. The TF is now becoming more active and participation is welcome. Maria D. suggested the creation of a dedicated GGUS SU to monitor progress with the deployment.

Machine/Job Features TF

MJF values used in ongoing LHCb fast benchmarking evaluation (see last GDB)
Some local config errors found during this exercise

Monitoring

Status

MONIT portal: http://monit.cern.ch (open to all CERN authenticated users).

Contains all raw FTS, Xrootd, ETF data. Some Job Monitoring data.
Examples of dashboards are available.

Had a couple of months of instability and we needed to work on the infrastructure and resources.
Worked with the ES (ElasticSearch) service in order have a separate ES MONIT instance. Continued to help in the benchmarking of ES resources.

Next Steps:

Will soon add a link from the existing FTS Dashboard to the new MONIT portal with FTS dashboards. The new portal is being tuned and there maybe be glitches (or timeouts if you select longs time ranges...), but all FTS data is available.

We are getting to a phase where we need closer and defined contact with WLCG representative (VOs, sites) to show were we are and work together for the WLCG use cases. Details on the organization being discussed with WLCG Operations.

MW Readiness WG

The agenda of the 2/11 meeting http://indico.cern.ch/e/MW-Readiness_19 is taking shape and the twiki is reachable from there. Maria will prepare the table of jira tickets' status closer to the date, so please, record all progress in jira or email the e-group wlcg-ops-coord-wg-middleware at cern.
WN and UI rpm for EL7 have been prepared ( with the clients/lib available now on EL7) and pushed to UMD preview repo for testing ( MWREADY-135 and MWREADY-128). Looking for sites available for the validation
We'd like to debate at this meeting the future of the WG. It completes 3 years of life in December. Some products are now verified for Readiness "be default" see examples here. Other products and 2/4 experiments never embarked this effort. Participation is declining. It is a good moment to review the continuation/transformation/dissolution of the WG.
This idea was circulated in email on 22/9. Alessandra's feedback is the WG should remain alive even if meetings are not very frequent. Example reason: CentOS7 will require some coordination and it seems you are the bridge with EGI. The MW Readiness jira tickets are useful, e.g. https://its.cern.ch/jira/browse/MWREADY-128 and https://its.cern.ch/jira/browse/MWREADY-135

Network and Transfer Metrics WG

Network session at the WLCG workshop
- Q&A session planned, questions will be sent in advance, we encourage all to participate
- Inder Monga (Director of ESNet) will join the session
LHCOPN/LHCONE workshop was held in Helsinki, Sept 19-20 (https://indico.cern.ch/event/527372/)
- GEANT reported peaks over 100GBps and growth of over 65% from Q2 2015 to Q2 2016
- ESNet reported that LHCONE traffic has increased 118% in the past year
- Positive feedback received on the LHC Network Evolution talk
pre-GDB on networking focusing on the long-term network evolution planned on January 10th - save the date
Throughput meetings were held on 15th Sept:
- Hendrik Borras (Univ. of Heidelberg) presented early results on the network telemetry based on perfSONAR
perfSONAR 4.0 RC1 was released, RC2 planned in October with final release sometime in November
We are now using a new mailing list wlcg-network-throughput-wg@cernNOSPAMPLEASE.ch - joint mailing list for European and NA throughput meetings
WLCG Network Throughput Support Unit: New cases were reported on IPv6 and are being followed up, see twiki for details.

RFC proxies

VOMS clients 3.0.7 have been released by the VOMS product team
- http://italiangrid.github.io/voms/2016/09/09/voms-release-16-09.html
- being tested for integration into the UMD
a new version of YAIM core for myproxy-init on the UI is available
- temporary location: here
they should appear together in an upcoming UMD update

Some progress on the EGI side. OSG has already switched. Raja asked how easy it is to use them today. Maarten said that one needs to supply an explicit option or environment variable today, but with the UMD update in October they will become the default. For voms-proxy-init the default will simply change, while the UI env. variable today makes myproxy-init upload legacy proxies, but there will be the necessary publicity on what will need to be done to change to the RFC ones. Raja would like to see the switch on the lxplus UI. Maarten confirmed that lxplus would be adjusted as soon as the release is available in the UMD. He reminded that RFC proxies are already in use today (e.g. SAM tests already use them), not enforced but recommended, because legacy proxies have started giving problems in some areas. By the end of the year RFC proxies should have become standard everywhere.

Squid Monitoring and HTTP Proxy Discovery TFs

http://wlcg-wpad.cern.ch/wpad.dat now returns proxies based on the squids registered in CMS SITECONF in addition to ATLAS AGIS. You can test how it would behave for different source IP addresses by adding "?ip=N.N.N.N" to the end of the URL.
When a proxy service is registered in CMS as "loadbalance=proxies", the javascript chooses a random number and selects a different ordering of proxies based on that. See for example http://wlcg-wpad.cern.ch/wpad.dat?ip=138.16.225.14
http://wlcg-squid-monitor.cern.ch/worker-proxies.json contains the combined input to the service.
There will be a talk on this at CHEP.

Traceability and Isolation WG

No report

Theme: Lightweight Site

Presentation by Maarten. Slides on the agenda. Storage not considered for now. Each OSG site (mainly) supports only one LHC experiment. EGI is more complex (e.g. more experiments per site and more supported MW packages). So, we should learn from the OSG sites and concentrate on the EGI ones. A questionnaire is being prepared. Its draft was presented for discussion. During the discussion:

Lukasz asked how to 'enforce' APEL Accounting for HTCondor CE. Maarten will include it in the questionnaire to have people aware of the pending work involved. Julia committed to include it in the Accounting TF.
Raja asked whether the Network requirements should be included in the questionnaire. Maarten and Alessandra said the Lightweight sites typically will be small, up to a few thousand cores at most. If such a site were used for MC simulation only, little bandwidth would be needed. In general the network requirements are driven by data input and output patterns and hence should be discussed in the Data Management Coordination group, also for lightweight sites. The issues met so far with the Tier0 "cloud" experience (T-Systems) were mainly due to the amounts of data input and output by jobs, which were not always commensurate with the given network capacity. Raja thinks some recommendation of prerequisite network conditions should be included. However, a small site with few human and computing resources will not be taken seriously if it asks for a very fast network capacity. Also, the opposite, a very rich local infrastructure with very low connectivity will be very unbalanced. Julia said that this issue will be raised e.g. in the Data Management session at the San Francisco WLCG Workshop.
Maria asked whether everyone knows the DMZ acronym on slide 9 (it stands for DeMilitarised Zone). For the record: https://en.wikipedia.org/wiki/DMZ_%28computing%29
She also asked why to include question 11 Allow remote access to a DMZ for the experiment(s)? altogether given that it doesn't scale well, it requires the manual intervention by a remote expert (as today) and it introduces the possibility to remotely login as root outside the site's firewall. It should be checked with the WLCG security experts. Maarten said this is already being practiced by US-CMS at T3 sites in California and, to a lesser extent, without root access, today via the use of VOboxes by ALICE, LHCb and CMS.
Andrew will send another version of question 10 (to split it into 2). It now says: Could your site supply WNs dedicated to the experiment(s)?
Julia suggested the VOs should also see the questionnaire.
After the meeting, Maria suggested to Maarten to check the site survey https://wlcg-survey.web.cern.ch/ we issued in the autumn of 2014 to make sure no question is forgotten before publishing the questionnaire. For example, ask for the contributors' emails so that we can get back to them. Results https://twiki.cern.ch/twiki/bin/view/LCG/WLCGSiteSurvey

Secondary theme: Tier1 downtime announcements

Devise an algorithm for the announcement to be done earlier if the downtime is likely to last longer.

This will be moved to the next meeting because the initiators in ATLAS and LHCb are absent.

Action list

Creation date	Description	Responsible	Status	Comments
01.09.2016	Collect plans from sites to move to EL7	WLCG Operations	On-going	The EL7 WN is ready (see MW report of 29.09.2016). ALICE and LHCb can use it. NDGF will stay on SL6 for now but they plan to go directly to EL7 early 2017. Other ATLAS sites e.g. Triumf are working on a container solution that could mask the EL7 env. for the experiments which can't use it. Maria said that GGUS tickets are a clear way to collect the sites' intentions. Alessandra said we shouldn't ask a vague question. Andrea M. said the UI bundle is also making progress.

Specific actions for experiments

Creation date	Description	Affected VO	Affected TF	Comments	Deadline	Completion
29.04.2016	Unify HTCondor CE type name in experiments VOfeeds	all	-	Proposal to use HTCONDOR-CE. Still not done for ALICE. Raja will ask the status for LHCb.		Ongoing

Specific actions for sites

Creation date	Description	Affected VO	Affected TF	Comments	Deadline	Completion

AOB

Please check this short video presenting the WLCG Ops portal and contact Maria D. if you wish to record a short video promoting your services. More examples here and more details in https://twiki.cern.ch/ELearning

Raja would like to know why the tape system at SARA will be down for another 14 days given that it has been moved already. There was no NL_T1 representative today. Maria suggested to bring this up at next Monday's 3pm call.

MariaDimou - 2016-09-21

Topic revision: r62 - 2018-02-28 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback